corss validation of missing value

Last updated: 2017-02-02

Code version: c78d381

3-fold CV

take 3-fold CV as example. We want to divide the data \(Y\) in to 3 parts to do 3-fold CV. We randomly divide the the columns into 3 sets and the rows into 3 sets as well. The data matrix \(Y\) is divided into 9 partition (by row and column permutation):

\[Y = \begin{pmatrix} Y_{11} & Y_{12} & Y_{13} \\ Y_{21} & Y_{22} & Y_{23} \\ Y_{31} & Y_{32} & Y_{33} \end{pmatrix}\]

Then \(Y_{(1)} = \{ Y_{11}, Y_{22} , Y_{33}\}\), \(Y_{(2)} = \{ Y_{12}, Y_{23} , Y_{31}\}\) and \(Y_{(3)} = \{ Y_{13}, Y_{21} , Y_{32}\}\) are orthogonal to each other. Then the data matrix \(Y\) is marked as: \[Y = \begin{pmatrix} Y_{(1)} & Y_{(2)} & Y_{(3)} \\ Y_{(3)} & Y_{(1)} & Y_{(2)} \\ Y_{(2)} & Y_{(3)} & Y_{(1)} \end{pmatrix}\]

For complete data set, we would use scheme 1. The imputation methods we consider are based on the model with low rank structure: \(LF^T\), so each \(Y_{(i)}\) in scheme 1 contains equally one set of indices of \(L_{.k}\) and \(F_{k.}\). This also helps to make sure that all \(i\)’s and \(j\)’s are included into each \(Y_{-(i)}\). Otherwise, we can’t make prediction when some times whole columns or rows are missing.

\[Y= \begin{pmatrix} Y_{11} & Y_{12} \\ Y_{21} & Y_{22} \end{pmatrix}= \begin{bmatrix} L^{(1)}F^{(1)} & L^{(1)}F^{(2)} \\ L^{(2)}F^{(1)} & L^{(2)}F^{(2)} \end{bmatrix} + E = \left[ \begin{array}{c} L^{(1)} \\ L^{(2)} \end{array} \right] \times \left[ \begin{array}{cc} F^{(1)}& F^{(2)} \end{array} \right] + E\]

In this case, \(Y_{(1)} = \{ Y_{11}, Y_{22} \}\), \(Y_{(2)} = \{ Y_{12}, Y_{21} \}\), and each of \(L^{(1)},F^{(1)},L^{(2)},F^{(2)}\) shows up in both \(Y_{(1)}\) and \(Y_{(2)}\) so the information contains in each \(Y_{(i)}\) are in balance.

5-fold CV

start from five fold and can easily extend to 10 fold or K -fold.

code

To generate the cross validation data for missing value, I randomly pick the missing possition for each fold.

the code is like the following:

cross_plot = function(Y,k_fold = 5){
  N = dim(Y)[1]
  P = dim(Y)[2]
  colindex = matrix(sample(P,P),ncol = k_fold)
  rowindex = matrix(sample(N,N),ncol = k_fold)
  
  missing= array(0,dim = c(k_fold,N,P))
  foldindex = array(0,dim = c(k_fold,k_fold,2))
  for(i in 1:k_fold){
    for(j in 1:k_fold){
      foldindex[i,j,1] = i
      foldindex[i,j,2] = (i+j) %% k_fold
    }
  }
  foldindex[which(foldindex == 0)] = k_fold
  for(i in 1:k_fold){
    missing[i, , ] = Y
    for(j in 1:k_fold){
      missing[i,rowindex[,foldindex[j,i,1]],colindex[,foldindex[j,i,2]]] = NA
    }
  }
  par(mfrow = c(2,3), mar = (c(5,4,4,2)-1.9))
  image(Y,main = "data Y")
  for(i in 1:k_fold){
    image(missing[i,,],main = "missing data")
  }
  return(missing)
}

block design

Starting from simplest case, where Y is \(5 \times 5\) matrix.

N = 5
P = 5
Y = matrix(rep(1,N*P),ncol =  P)
g = cross_plot(Y)

permute rows and columns

for larger matrix

N = 50
P = 100
Y = matrix(rep(1,N*P),ncol =  P)
g = cross_plot(Y)

## Session Information

sessionInfo()

R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12.3 (unknown)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_0.3.0 rmarkdown_1.3  

loaded via a namespace (and not attached):
 [1] backports_1.0.5 magrittr_1.5    rprojroot_1.2   htmltools_0.3.5
 [5] tools_3.3.0     yaml_2.1.14     Rcpp_0.12.8     stringi_1.1.1  
 [9] knitr_1.15.1    git2r_0.18.0    stringr_1.1.0   digest_0.6.11  
[13] evaluate_0.10

This R Markdown site was created with workflowr