skm: selective k-means.

Overview

A set of algorithm for solving minimize the sum of the column minimal of a subset k rows from a given matrix. This is equivalent to solve a k-means problem when the matrix is square and the matrix entries are euclidean distance. The skm package provides a set of c-level class and solvers to solve the problem with rectangluar matrix, and with arbitrary distance metric defined, and an r-level wrapper over one particular c-level solver.

An introduction example

Let’s first understand what does minimize the sum of the column minimal of a subset k rows from a given matrix mean via a concrete small example. Imagine that there are 3 internet service providers in the area where you owned 5 houses, and the matrix below shows the monthly charges for internet access from each service providers to each of your house.

#- an introduction example

mm <- matrix(
  data = c(
    1,  5,  8,  5, 8,
    3, 13,  3, 13, 3,
    2,  2, 21,  2, 21
  ), 
  nrow = 3, ncol = 5, byrow = TRUE, 
  dimnames = list(
    paste0("s", 1:3), paste0("t", 1:5)
  ))

mm
##    t1 t2 t3 t4 t5
## s1  1  5  8  5  8
## s2  3 13  3 13  3
## s3  2  2 21  2 21

Of course, you would like to save the cost by paying as less as possible. If it is possible to use all 3 service providers, you can select s1 for t1, s3 for t2, s2 for t3, s3 for t4 and s2 for t5, and pay sum of minimial of each column from the given matrix. However, what if you can select only 1 service provider? Then, you have to find the k = 1 row such that can minimize the sum of the column minimial of k = 1 rows from the matrix. This is as straightforward as finding the k = 1 row that gives the minimial row sum, which is the row s1. What if you can select 2 service providers? Then, you want to find the k = 2 rows such that can minimize the sum of the column minimial of k = 2 rows from the matrix. And, the problem is start to get interest as it becomes a combinatorial optimization problem. It is best to select row s2 and s3, after screening all combinations of s1 and s2 which leads to a total cost of 17, s2 and s3 which leads to a total cost of 12, and s3 and s1 which leads to a total cost of 22.

Ok, that sounds simple. Now, what if you have 51 candidate locations, one in each U.S. states, and you have 28,844 target houses, one in each U.S. zip codes, from each candidate locations to each target houses there is a cost associated with distance and time in transit, and you want select 10 from the 51 candidate locations to build warehouses so that you can minimize the cost? Of course, you can brute force to go over all choose(51, 10) = 12,777,711,870 possible combinations. How about using skm to find k = 10 candidate locations (rows) that minimize the overall cost (distance) from the closet warehouse to each target houses (sum of the column minimial)?

skm package design

Let’s concretize the example above with datasets come with skm package. There are two datasets in skm: zip2012 is a data.table with latitude, longitude, population and average income for each 28,844 U.S. zip code, and source_zip_list is a character vector of 51 zip, one for each U.S. states. The zip in the source_zip_list are the candidate locations which often being the center of the states that minimize the population weighed average distance to other zip within the state. And, the objective is to find 10 locations from the source_zip_list, which minimize the overall population weighed average distance to all 28,844 zip codes.

#- a zip find example
library(skm) # also load data.table, magrittr and Rcpp

#- create source - target - population weighted distance matrix

#- source
dsrc <- skm::zip2012[zip %in% skm::source_zip_list, .(s = zip, s_lat = lat, s_lng = lng)]

#- target
ddst <- skm::zip2012[ , .(t = zip, t_lat = lat, t_lng = lng, p_pop = p_pop)]

#- create source - target - population weighted distance data.table

#- CJ.data.table extends CJ from data.table to cross join data.table - see yg::CJ.dt
CJ.data.table <- function(X, Y) {

  stopifnot(is.data.table(X), is.data.table(Y))

  k = NULL

  X = X[, c(k = 1L, .SD)]
  setkey(X, k)

  Y = Y[, c(k = 1L, .SD)]
  setkey(Y, NULL)

  return( X[Y, allow.cartesian=TRUE][, k := NULL][] )

}

ddzt <- CJ.data.table(dsrc, ddst) %>% setorder(s, t) %>%
  .[ , `:=`(d = skm::dist_wlatlng(s_lat, s_lng, t_lat, t_lng), w = p_pop)]

#- ddzt has 1,471,044 row each with source zip in s, target zip in t, 
#- and population weighted distance in d

#- convert source - target - population weighted distance data.table into matrix
dmtx <- ddzt %>%
  .[ , `:=`(
    wd = d * w
  )] %>%
  data.table::dcast(
    s ~ t, value.var = "wd"
  )

sname <- dmtx[["s"]]

set(dmtx, i = NULL, j = 1L, value = NULL) # dmtx[ , `:=`(s = NULL)]

tname <- names(dmtx)

dmtx <- as.matrix(dmtx)

rownames(dmtx) <- sname
colnames(dmtx) <- tname

A quick overview of the dmtx:

round(dmtx[1L:4L, 1L:10L] * 10L^4L, 4L)
##        00501   01001   01002  01004   01005   01007  01008  01009   01010
## 02124 2.3732 48.0274 51.5983 2.3159  8.7798 37.0822 4.6501 2.9263  8.4073
## 02886 1.6754 38.7864 50.5262 2.2677 10.0092 34.5004 4.0737 2.5861  6.8087
## 03052 2.5970 46.4727 43.3135 1.9440  7.1070 33.2183 4.2145 2.7594  8.6559
## 04038 3.8379 92.0654 95.9728 4.3075 19.9072 75.0100 7.8733 6.1977 19.4622
##        01011
## 02124 4.1610
## 02886 3.7950
## 03052 3.6516
## 04038 6.8101

Thus, the objective is to find k = 10 rows (source locations) from the matrix dmtx to minimize the sum of each column minimial (the distance from closest source to each target location). The skm package aims solve a selective k-means problem like this one. Let’s define a selective k-means problem as finding k rows in an m x n matrix such that the sum of each column minimial is minimized. In the scenairo when m == n and each cell value is a valid distance metric, this is equivalent to a k-means problem. The selective k-means extends the k-means problem in the sense that it is possible to have m != n, often with m < n which implies we want to limit the search within a small set of candidates. Also, the selective k-means extends the k-means problem in the sense that the instance in row set can be instantce not seen in the column set, e.g, zip in source set s not necessary in target set t.

The skm solves the selective k-means problems using a method similar to the standard k-means algorithm, e.g., start with an initial solution, repeat with an assignment (expectation) step and an update (minimization) step until converge to an (often local) optimial. The available solvers are discussed one by one in the next subsections. The most important solvers are the C-level solver skm_mls_cpp and the R-level solver skm_mls, and also the C-level solver skm_gbp_cpp for extreme large problems.

C-level class and solver

skm_sgl_cpp

The function skm_sgl_cpp takes four inputs: the distance matrix x (double matrix), a single fixed initialization s_init (integer vector of row indicies matrix x, indexed from 0), a set of row indicies must be included in the final solution s_must (integer vector of row indicies matrix x, indexed from 0), and maximum number of iteration max_it (integer scalar), and returns an skmSolution instance.

The skm implements a class skmSolution to store the solution from various solvers. An instance of skmSolution has two member variables, an optimial objective o which is a double scalar value, an optimial source set s which is an integer vector of row indicies indexed from 0.

The function skm_sgl_cpp uses s_init as a single fixed initial point to find the optimial solution, which will be returned as an skmSolution instance.

#- select 10 from 51 candidate locations to provide service for 28,444 houses 
#- with the objective of minimizing the population weighted average distances
#- note that the row indicies are indexed from 0 following the cpp convention
m0 <- skm::skm_sgl_cpp(
  x = dmtx, s_init = c(0L, 1L, 2L, 3L, 5L, 8L, 13L, 21L, 34L, 4L), s_must = integer(0), max_it = 1000L
)

m0$o # objective: sum of column minimial
## [1] 360.6712
m0$s # selected source: row index set indexed from 0 so c(0L, 4L) implies rownames(dmtx)[c(1L, 5L)] = c("02124", "05452")
##       [,1]
##  [1,]   19
##  [2,]   13
##  [3,]   45
##  [4,]    7
##  [5,]    8
##  [6,]    1
##  [7,]    0
##  [8,]    4
##  [9,]    2
## [10,]    3
sname[m0$s + 1L]
##  [1] "37076" "25404" "89102" "11226" "19120" "02886" "02124" "05452" "03052"
## [10] "04038"
#- must have candidate locations rownames(dmtx)[c(5L, 11L)] = c("05452", "20011") in the solution
m1 <- skm::skm_sgl_cpp(
  x = dmtx, s_init = c(0L, 1L, 2L, 3L, 5L, 8L, 13L, 21L, 34L, 4L), s_must = c(4L, 10L), max_it = 1000L
)

m1$o
## [1] 363.1209
sname[m1$s + 1L]
##  [1] "05452" "20011" "37076" "89102" "11226" "19120" "02886" "02124" "03052"
## [10] "04038"

skm_rgi_cpp

The function skm_rgi_cpp is built on skm_sgl_cpp, in which skm_rgi_cpp takes an integer k and initialized the search for optimial solution with a random s_init, rather than a fixed manual specified s_init as in skm_sgl_cpp.

m2 <- skm::skm_rgi_cpp(
  x = dmtx, k = 10L, s_must = integer(0), max_it = 1000L
)

m2$o
## [1] 323.0713
sname[m2$s + 1L]
##  [1] "25404" "72764" "89102" "94596" "80219" "85210" "98006" "97214" "96701"
## [10] "99523"

skm_rgs_cpp

The function skm_rgs_cpp is built on skm_rgi_cpp, in which skm_rgs_cpp takes an extra argument indicates group g. The skm_rgs_cpp function generates the random initial solution s_init using a stratified sampling strategy with resepect to the grouping defined by g, rather than a simple random s_init in skm_rgi_cpp.

#- g is important when the size of s is large
#- g should be an integer vector same length as s, value indicate groups
gname <- sname %>% vapply(substr, "", 1L, 1L) %>% unname %>% as.integer()

m3 <- skm::skm_rgs_cpp(
  x = dmtx, k = 10L, g = gname, s_must = integer(0), max_it = 1000L
)

m3$o
## [1] 221.6097
sname[m3$s + 1L]
##  [1] "30093" "07017" "20120" "60629" "94596" "77084" "40214" "50317" "02124"
## [10] "87121"

skm_mlp_cpp

The function skm_mlp_cpp is built on skm_rgi_cpp, in which skm_mlp_cpp takes an extra argument indicates number of random initializations max_at, e.g., number of attempt to initialize a different s_init to find optimial solution. This aims to overcome the local optimial issue with multiple random initial points. The skm_mlp_cpp will return an Rcpp::List which in addition to the objective o and selected source s, it also includes all the objectives olist (double vector) and the selected source slist (integer matrix) reached along with each random initialized solution.

#- multiple random initial points to overcome local optimial issue
m4 <- skm::skm_mlp_cpp(
  x = dmtx, k = 10L, s_must = integer(0), max_it = 1000L, max_at = 4L
)

m4$o
## [1] 203.9813
sname[m4$s + 1L]
##  [1] "30093" "11226" "60629" "73008" "44130" "20011" "89102" "94596" "98006"
## [10] "96701"
m4$o_list # optimial objective find with each of the 4 random initial points
##          [,1]
## [1,] 203.9813
## [2,] 238.2370
## [3,] 218.4519
## [4,] 266.2851
#- str(m4$s_list) is num [1:4, 1:10] but not a matrix?
apply(m4$s_list, 2L, function(s) { sname[s + 1L] }) # selected sources obtained with each of the 4 random initial points
##      [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]   
## [1,] "30093" "11226" "60629" "73008" "44130" "20011" "89102" "94596" "98006"
## [2,] "07017" "60629" "73008" "94596" "35242" "27513" "33186" "63129" "02124"
## [3,] "07017" "60629" "30093" "77084" "29615" "89102" "94596" "80219" "96701"
## [4,] "46227" "70506" "20120" "07017" "94596" "50317" "02124" "87121" "05452"
##      [,10]  
## [1,] "96701"
## [2,] "05452"
## [3,] "99523"
## [4,] "99523"

skm_mls_cpp

The function skm_mls_cpp combines skm_rgs_cpp and skm_mlp_cpp in the sense it solves selective k-mean problem with multiple random initial points and the initial solution s_init are initialized using stratified sampling on source s with respect to grouping g. This is the core C-level solver that should be called in most cases.

#- multiple random initial points initialized using stratified sampling w.r.t g
m5 <- skm::skm_mls_cpp(
  x = dmtx, k = 10L, g = gname, s_must = integer(0), max_it = 1000L, max_at = 4L
)

m5$o
## [1] 213.8434
sname[m5$s + 1L]
##  [1] "30093" "07017" "20120" "60629" "77084" "46227" "89102" "66062" "94596"
## [10] "02124"
m5$o_list
##          [,1]
## [1,] 234.2461
## [2,] 217.4452
## [3,] 213.8434
## [4,] 217.6622
apply(m5$s_list, 2L, function(s) { sname[s + 1L] })
##      [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]   
## [1,] "30093" "07017" "20120" "73008" "89102" "60629" "40214" "02124" "55391"
## [2,] "19120" "30093" "89102" "60629" "77084" "44130" "02124" "33186" "55391"
## [3,] "30093" "07017" "20120" "60629" "77084" "46227" "89102" "66062" "94596"
## [4,] "46227" "20120" "06492" "77084" "19120" "89102" "33186" "50317" "80219"
##      [,10]  
## [1,] "96701"
## [2,] "96701"
## [3,] "02124"
## [4,] "97214"

skmRpl_mlp_cpp

The function skm_mlp_cpp and skm_mls_cpp solving the selective k-means with multiple random initial points. These multiple runs are running in serial. The function skmRpl_mlp_cpp, which is built on skmRpl_rgi_cpp, implements the same algorithm as skm_mlp_cpp while having multiple runs running in parallel. This solver often works fine, however, it is not stable at this moment and could crash R sometimes, mostly due to my lack of knowledge on RcppParallel. Thus, I didn’t go forward for skmRpl_mls_cpp yet. Let me know if you have good suggestions.

R-level class and solver

skm_mls

The function skm_mls is an R-level function wrapping over skm_mls_cpp with the following features.

At first, the input data x is expected to be a data.table instead of matrix. The data.table x is expected to have four columns: source indicator s (character or integer), target indicator t (character or integer), and the distance metric d (double), and an optional column for weight w. The column name for source, target, distance and weight can be specified using s_column, t_column, d_column and w_column, with default s, t, d, and NULL. If weight w is not NULL then the weighted average distance will be minimized rather than the simple average distance.

Also, it can take a set of k all at once, e.g. k = c(1L:10L), rather than a single k. The group structure can be specified using argument s_ggrp. The argument s_ggrp is expected a data.table contains source to group mapping. The argument auto_create_ggrp is a boolean indicator indicating whether to create group structure using the first letter of s when s_ggrp is integer(0L), default is TRUE. And, the argument max_at and max_it are inherited from skm_mls_cpp.

s_ggrp <- data.table(s = sname, g = gname)
s_ggrp[c(5L, 8L, 13L, 21L, 34L)]
##         s     g
##    <char> <int>
## 1:  05452     0
## 2:  11226     1
## 3:  21215     2
## 4:  38632     3
## 5:  66062     6

At last, the argument extra_immaculatism is an indicator indicating whether making extra runs for improving solution consistency, e.g., when specifying k = c(9L, 10L), the algorithm will first solve k = 9, and then when solve k = 10, it will have several extra runs with initial points using the solution found in k = 9 with one additional source. The argument extra_at defines how many such extra runs.

The function skm_mls will return a more meaningful data.table compared to the skm_mls_cpp returned Rcpp::List. The function skm_mls returns data.table that has five columns: the objective name o (character) which is the input value of d_colname, the weighting scale w (character) which is the input value of w_colname (character), the number of sources k (integer), the source list s (character vector), and the objective distance d (double).

# ddzt <- ddzt[ , .(s, t, d, w)]

m6 <- skm::skm_mls(
  x = ddzt, k = 1L:10L, s_colname = "s", t_colname = "t", d_colname = "d", w_colname = "w",
  s_ggrp = integer(0L), s_must = integer(0L), max_it = 1000L, max_at = 1000L, 
  auto_create_ggrp = TRUE, extra_immaculatism = TRUE, extra_at = 200L
)

m6
##          o      w     k                                       s        d
##     <char> <char> <int>                                  <list>    <num>
##  1:      d      w     1                                   40214 802.9782
##  2:      d      w     2                             40214,89102 502.7013
##  3:      d      w     3                       37076,07017,89102 388.2517
##  4:      d      w     4                 07017,35242,60629,89102 324.8525
##  5:      d      w     5           07017,60629,30093,89102,77084 277.4736
##  6:      d      w     6     07017,60629,30093,89102,77084,97214 249.6806
##  7:      d      w     7 07017,60629,30093,89102,77084,33186,... 231.1486
##  8:      d      w     8 07017,60629,30093,77084,89102,33186,... 215.0671
##  9:      d      w     9 07017,60629,30093,77084,66062,89102,... 200.7363
## 10:      d      w    10 07017,60629,30093,20120,77084,66062,... 188.0782

C-level solver for selective k-means problem of extreme large size

skm_gdp_cpp

The skm package also ships with a function skm_gdp_cpp, which implements a greedy algorithm for solving selective k-means of extreme large size, e.g., a matrix with thousands of rows and millions of columns. The function skm_gdp_cpp runs fast and returns simply a row index vector indexed from 0, so it requires a little bit more effort to obtain the objective and etc.

m7 <- skm::skm_gdp_cpp(
  x = dmtx, k = 10L
)
## skm_gdp_cpp: optimize at it <0> ...
## skm_gdp_cpp: optimize at it <1> ...
## skm_gdp_cpp: optimize at it <2> ...
## skm_gdp_cpp: optimize at it <3> ...
## skm_gdp_cpp: optimize at it <4> ...
## skm_gdp_cpp: optimize at it <5> ...
## skm_gdp_cpp: optimize at it <6> ...
## skm_gdp_cpp: optimize at it <7> ...
## skm_gdp_cpp: optimize at it <8> ...
## skm_gdp_cpp: optimize at it <9> ...
m7
##       [,1]
##  [1,]   21
##  [2,]   45
##  [3,]    6
##  [4,]   38
##  [5,]   16
##  [6,]   31
##  [7,]   48
##  [8,]   17
##  [9,]   46
## [10,]   33
# obtain the objective population weighted average distance from the row index vector m7
sum(apply(dmtx[m7 + 1L, ], 2L, min))
## [1] 194.0329

Some practical examples

Optimal Warehouse Locator

A shiny application that use skm to find the optimal locations for building warehouses: OWL - Optimal Warehouse Locator.