# The Copenhagen Chemometrics Group

## 7. MISSING DATA

**Contents**

- Small intro to the theory of missing data
- Check the influence of missing data

**Data used**

claus.mat contains fluorescence excitation emission data from five samples containing tryptophan, phenylalanine, and tyrosine.

**Purpose**

Learning about proper preprocessing

**Information **

R. Bro, PARAFAC: Tutorial & applications. *Chemom. Intell. Lab. Syst*., 1997, **38**, 149-171.

### 1. Small intro to the theory of missing data

**Missing values should be treated with care in any model **

Simply setting the values to zero is sometimes suggested, but this is a *very* dangerous approach. The missing elements may just as well be set to 1237 or some other value.

There is nothing special about zero (is there? Could zero be a reasonable number when an element is missing?). Another approach is to impute the missing elements from an ANOVA model or something similar. While better than simply setting the elements to zero, this is still not a good approach.

**In two-way PCA and any-way PLS**

estimated through NIPALS-like algorithms the approach normally advocated for in chemometrics is to simply skip the missing elements in the appropriate inner products of the algorithm. This approach has been shown to work well for a small amount of randomly missing data, but also to be problematic in some cases.

**A better way, though, to handle missing data**

follows from the idea that the model is estimated by optimizing the loss function *only* considering non-missing data. This is a more sensible way of handling randomly missing data. The loss function for any model of incomplete data can thus be stated

Where **X** is a matrix containing the data and **M** the model (both unfolded).

The structure (and constraints) of **M** are given by the specific model being estimated.

The matrix **W** contains weights that are either one if corresponding to an existing element or zero if corresponding to a missing element. If weighted regression is desired **W** is changed accordingly keeping the zero elements at zero. The natural way to estimate the model with missing data is thus by a weighted regression approach.

Below the result of applying two different PCA algorithms to one sample from Claus.mat (with emission below excitation set to missing) are shown. Three components should be appropriate based on knowledge of fluorescence data. NIPALS cannot handle the structured missing data and give much more significant components as seen to the right in the plot. Using least squares fitting as indicated above, the missing data are simply not considered, and hence (as there are many elements available for estimating the scores and loadings) the model is well modelled. All components above three are noisy and low-variance.

**Another approach for handling incomplete data**

is to impute the missing data iteratively during the estimation of the model parameters. The missing data are initially replaced with either sensible or random elements.

A standard algorithm is used for estimating the model parameters using all data.

After each iteration the model of **X** is calculated, and the missing elements are replaced with the model estimates. The iterations and replacements are continued until no changes occur in the estimates of the missing elements and the overall convergence criterion is fulfilled. It is easy to see, that when the algorithm has converged, the elements replacing the missing elements will have zero residual.

**How then, do these two approaches compare?** *Henk Kiers* has shown that the two approaches give identical results, which can also be realized by considering data imputation more closely. As residuals corresponding to missing elements will be zero they do not influence the parameters of the model, which is the same as saying they should have zero weight in the loss function. Algorithmically, however, there are some differences. **Consider two competing algorithms**

for estimating a model of data with missing elements; one where the parameters are updated using weighted least squares regression with zero weights for missing elements and one where ordinary least squares regression and data imputation is used. Using direct weighted least squares regression instead of ordinary least squares regression is computationally more costly per iteration, and will therefore slow down the algorithm. Using iterative data imputation on the other hand often requires more iterations due to the data imputation (typically 30-100% more iterations). **It is difficult to say which method is preferable**

as this is dependent on implementation, size of the data, and size of the computer.

Data imputation has the advantage of being easy to implement, also for problems which are otherwise difficult to estimate under a weighted loss function.

### 2. Check the influence of missing data

**It is very easy to investigate the effect of missing data**

Simply take a full data set and set some of the elements to missing and you can compare with the results on the full data set. Here, we will try to illustrate the effect of missing data and their treatment in a number of ways.

**Test the difference between **

setting missing elements to zero or to NaN (Not a Number) in the *N*-way toolbox. When set to NaN, the missing elements are excluded in a least squares sense as indicated above.

`load claus`

K = prod(size(X)); %The number of elements in X

[idd,id]=sort(rand(K,1)); %Make a random vector with K elements

%Choose (random) position of the first

%of these as the ones to set to zero.

%View it to see how it works and also

%try a similar thing on a small array

%x=rand(4,3) to understand what is

%done in these lines.

`id=id(1:round(1.5*length(id)/2)); `

`%Pick approximately 75% of the elements as missing`

`XZero = X; %Make a new data array`

XZero(id)=0; %Set missing elements to zero

XNaN = X; %Make a new data array

XNaN(id)=NaN; %Set missing elements to NaN`%Fit a model to each of the data sets`

m0 = parafac(X,3); %Golden standard

m1 = parafac(XZero,3);

m2 = parafac(XNaN,3);

**Which of the models**

provide the best fit?

Try to plot the loadings of the different models and see if you can explain the results:`figure`

`plotfac(m0);`

`figure`

plotfac(m1);`figure`

plotfac(m2);

**Possibly plot the raw data **to gain more insight.`figure`

for i=1:5,subplot(2,3,i),mesh(squeeze(X(i,:,:))),axis tight,end`figure`

for i=1:5,subplot(2,3,i),mesh(squeeze(XZero(i,:,:))),axis tight,end`figure`

for i=1:5,subplot(2,3,i),mesh(squeeze(XNaN(i,:,:))),axis tight,end