# The Copenhagen Chemometrics Group

## 6. PREPROCESSING

**Contents**

- Centering
- Investigate the effect of centering
- Scaling

**Data used **

claus.mat contains fluorescence excitation emission data from five samples containing tryptophan, phenylalanine, and tyrosine.

**Purpose**

Learning about proper preprocessing

**Information: **

R. Bro, PARAFAC: Tutorial & applications. *Chemom. Intell. Lab. Syst*., 1997, **38**, 149-171.

Also see the paper:

“Centering and Scaling in Component Analysis“

**Preprocessing of higher-order arrays**

is more complicated than in the two-way case, though understandable in light of the multilinear variation presumed to be an acceptable model of the data. Centering serves the same purpose as in two-way analysis, namely to remove constant terms in the data, that may otherwise at best need an extra component, at worst make modeling impossible.

**All models described here are**

implicitly based on that the data are ratio-scale (interval-scale with a natural origin), i.e., that there is a natural zero which really does correspond to zero (no presence means no signal) and that the measurements are otherwise proportional such that doubling the amount of a phenomenon implies that its corresponding contribution to the data is doubled.

If data are not approximately ratio-scale, then centering the data is also mandatory. In two-way analysis, centering is almost always performed, but not at all always needed.

**Centering is performed to **

make the data compatible with the structural model. Scaling on the other hand is a way of making the data compatible with the least squares loss function normally used.

Scaling does not change the structural model of the data, but only the weight paid to errors of specific elements in the estimation.

Scaling is dramatically simpler than using a weighted loss function, and is therefore to be preferred to this, if approximate homoscedastic data can be obtained by scaling. Centering and scaling will be described using three-way arrays in the following.

### 1. Centering

**Centering, e.g., the first mode of an array can be done by **

unfolding the array to an *I* × *JK* matrix, and then center this matrix as in ordinary two-way analysis:

**This is often referred to as single-centering.**

The centering shown above is also called centering

*across*the first mode, which is a terminology suggested by ten Berge. The centering can of course be applied to any of the modes, depending on the problem. If centering is to be performed across more than one mode, it has to be done by first centering one mode, and then center the outcome of this centering.

If two centerings are performed in this way, it is often referred to as double-centering. Triple-centering means centering across all three modes one at a time. It turns out that centering one mode at a time, is the only appropriate way of centering, with respect to the assumptions of the PARAFAC or any other multilinear model. Centering one mode at a time essentially removes any constant levels in that particular mode. Centering for example matrices instead of columns will destroy the multilinear behavior of the data, because more constant levels are introduced than eliminated. The same holds for other kinds of centering.

For instance, if it is known that the true model consists of one PARAFAC term (a trilinear component) and an overall level, it may seem feasible to estimate a PARAFAC model on the original data subtracted the grand level. However, even though the mathematical structure might theoretically be true, the subtraction of the grand level introduces some artifacts in the data, not easily described by the PARAFAC model. In this case even though the grand level has been subtracted *two components *are still necessary to describe the data.

This shows that the preprocessing has not achieved its goal of simplifying the subsequent model. If on the other the data are centered across one mode the data can be modeled by a one-component model. Another possibility is to estimate a two-component model but constraining one component to have constant loadings in each mode, thus reflecting the grand level. This provide a model with a unique estimate of the grand level (see box below).

### 2. **Investigate the effect of centering**

**Fit an appropriate three-component PARAFAC model**

to the amino acid data and look at the loadings.

How much variance does the model describe?

`load claus`

model = parafac(X,3);

plotfac(model)

**Center the data across samples**

i.e., as in ordinary two-way analysis:

`x1 = reshape(X,5,201*61);`

meanx1 = mean(x1);

centeredx1 = x1-ones(5,1)*meanx1;

centeredx1 = reshape(centeredx1,[5 201 61]);

**You can also** **use the**

m-file `nprocess`

for this (`centeredx1=nprocess(X,[1 0 0 ],[0 0 0]);`

)

but we avoid that here to get some insight into preprocessing and how it works. For real data analysis, though, it is always a good idea to use nprocess to avoid problematic preprocessing.

**Look at the centered data compared to the uncentered data**`figure`

subplot(1,2,1),

mesh(ExAx,EmAx,squeeze(X(4,:,:))),axis tight

title('Uncentered')

subplot(1,2,2),

mesh(ExAx,EmAx,squeeze(centeredx1(4,:,:))),axis tight

title('Centered')

**Upon centering**

any offsets constant across samples should be removed and hence a three-component model should still be valid. Fit an appropriate three-component PARAFAC model to the amino acid data and look at the loadings. How much variance does the model describe?

Do the the loadings look like the ones earlier obtained? Which ones differ a little/a lot?

`model1 = parafac(centeredx1,3);`

figure

plotfac(model1)

**Now try to do an incorrect centering across two modes**.

For example:

`meanx1b = mean(x1');`

centeredx1b = x1'-ones(201*61,1)*meanx1b;

centeredx1b = centeredx1b';

centeredx1b=reshape(centeredx1b,size(X));

model2 = parafac(centeredx1b,3);

figure

plotfac(model2)

**How much variance does the model describe?** **Why?**

Do the the loadings look like the ones earlier obtained? Why (not)?

### 3. Scaling

**If the uncertainties of the individual data elements are known**

it can be feasible to use these in the decomposition. If the uncertainty of a given variable remains almost the same over all other modes, it will suffice to scale the array accordingly. After scaling, an unconstrained model is estimated from the scaled array. If the uncertainties vary also within specific variables or if an iteratively re-weighted approach is desired for robustness, then the model must be estimated using a weighted loss function.

**Scaling in multi-way analysis has to be done**

taking the trilinear model into account. It is not, as for centering, appropriate to scale the unfolded array column-wise, but rather whole slabs or submatrices of the array should be scaled. If variable *j* of the second mode is to be scaled (compared to the rest of the variables in the second mode), it is necessary to scale all columns where variable *j* occurs by the same scalar. This means that whole matrices instead of columns has to be scaled. For a four-way array, three-way arrays would have to scaled. Mathematically scaling within the first mode can be described

where setting

will scale to unit squared variation. **The scaling shown above**

is referred to as scaling *within* the first mode. When scaling within several modes is desired, the situation is a bit complicated because scaling one mode affects the scale of the other modes. If scaling to norm one is desired within several modes, this has to be done iteratively, until convergence.

**Another complicating issue**

is the interdependence of centering and scaling. Scaling within one mode disturbs prior centering across the same mode, but not across other modes. Centering across one mode disturbs scaling within all modes. Hence only centering across arbitrary modes or scaling within one mode is straightforward, and furthermore not all combinations of iterative scaling and centering will converge. **In practice**,

it need not influence the outcome much if an iterative approach is not used. Scaling to a sum-of-squares of one is arbitrary anyway and it may be just as reasonable to just scale, e.g., by variances, within the modes of interest once, thereby having at least mostly equalized any huge differences in scale. Centering can then be performed after scaling and thereby it is assured that the modes to be centered are indeed centered.

**The appropriate centering and scaling procedures**

can most easily be summarized in a figure where the array is shown unfolded to a matrix (Figure 1). Centering must be done across the columns of this matrix, while scaling should be done within the rows of this matrix. Note that the common approach of scaling the columns of a data-matrix would not be appropriate for the above unfolded data. The consequence of such a scaling could be that more components are necessary than if proper scaling is used, and that the resulting model will be more difficult to interpret (see box).