# The Copenhagen Chemometrics Group

## 4. MULTI-WAY REGRESSION – N-PLS

**Contents**

- Load data
- Compare unfold-PCA and PARAFAC
- Compare unfold- and three-way PLS regression
- Summary

**Data used **

brod.mat (courtesy of Magni Martens, KVL). Eight judges assessed ten bread with respect to 11 attributes. The data are in the matrix **X**. The samples are pair-wise replicates. The salt content of each bread is also known.

**Purpose**

Using N-PLS on a simple data set to learn about the method and to see the importance of using a model of appropriate structure.

**Information **

R. Bro, Multi-way calibration.

Multi-linear PLS, *J. Chemom.*, 1996, **10**(1), 47-62.

**Prerequisites**

Be sure to understand the basics of handling

multi-way arrays in MATLAB (Chapter 1).

You should know your two-way PLS

### 1. Load data

Get the data

load data (`load brod`

) and use `whos`

to learn what files are there.

**Unlike the fluorescence data** **described previously**

there is no similar fundamental theory for how sensory data ideally behaves.

Thus, using a trilinear model in this case can not be justified as hard model as is the case in well-behaved spectral data.

However, even more significant is the idea of *latent variables*. That is, if we assume that all assessors use the same latent or basic types of sensations only in different proportions, then this exactly what the trilinear model states.

**If the data are unfolded**

such that the breads are the row-mode and the attributes and judges are the column-mode (see figure above) then the data can be written as **X** = [**X**_{1} **X**_{2} .. **X**_{8}], where **Xk** is the assessments of the *k’th* judge. If we make an *F-component* two-way PCA model of this matrix we obtain the approximation **X** = **TP**^{T} where **T** is the 10 by *F* score matrix pertaining to the breads and **P** is the 88 by *F* loading matrix pertaining to both assessors and attributes.

As seen from the figure the loading elements for each assessor are not directly related to the loading elements of the other assessors. The first eleven rows of **P** correspond to the first assessor etc. Thus, in unfolding we impose no relation between different assessors. Each assessor is assumed to have his or her own idiosyncratic perception.

**In maintaining the structure of the three-way data **

we obtain instead the PARAFAC model **X**_{k} = **AD**_{k}**B**^{T} where **A** is the 10 by *F* score matrix pertaining to the breads (similar to **T** above), **B** is the 11 by *F* loading matrix pertaining to both attributes and **D**_{k }is a diagonal matrix holding the *kth* row of **C** which is the 8 by *F* loading matrix for the judges.

By writing the trilinear model as above it is seen that for sensory data, the model imposed on the data is that all assessors use the same basic type of sensations given by **B** but each assessor use these latent variables in different proportions. For example the *kth* assessor uses the first component with a relative magnitude of *c _{(k,1)}* which is the first diagonal element in

**D**

_{k}.

**The trilinear model underlying both PARAFAC and N-PLS**

is more restricted than unfolding models. Therefore the fit of a trilinear model will per definition be lower than the fit of a corresponding bilinear model. However, the bilinear model is mostly overly flexible and the increased fit is to a large extent attributable to fitting the noise of the data. In the trilinear model it is much more difficult to overfit, since any variation incorporated in the model must be consistent over all assessors (or similar).

In the bilinear model above each component uses 98 (10+88) parameters while a trilinear component uses only 29 (10+11+8) parameters. This clearly illustrates that for the bilinear model to be suitable there must be a large deviation from the trilinear model. Otherwise the increased number of parameters will only fit noise. And to the degree that the trilinear model is only approximately correct, incorporating an additional trilinear component is still by far more parsimonious than using a bilinear component.

### 2. **Compare unfold-PCA and PARAFAC**

**In order to first explore the data**

one may use either PARAFAC or a PCA model on the unfolded data.

In this exercise it will be shown that the PARAFAC model is easier to interpret than the unfold model. For convenience you do not have to consider how many components to use. Two components are suitable for both PCA and PARAFAC.

**Task**

Estimate a two-component two-way PCA of the unfolded centered data. Try to interpret the scores and loadings. Estimate a two-component PARAFAC model of the centered data. Interpret. For which model are the replicates located most closely in the score plot. Why?

### 3. Compare unfold- and three-way PLS regression

**As an example on calibration with very noisy data**

you must try to predict the salt content from the sensory data. This is not interesting from a sensory point of view, but it provides a good illustration of the use of data with relatively low signal-to-noise ratio and significant model mis-specification

(the data are definitely not perfectly trilinear as e.g. fluorescence data are).

**Task**

Use unfold as well as three-way PLS for building a regression model for salt using every second sample. Predict the salt contents in the remaining samples. You’ll need a matlab

m-file for PLS calibration in this exercise (for example from the PLS Toolbox) or alternatively you can use the trilinear PLS model but set the size of the three-way array to have one variable mode dimension equal to all variables (88) and the other variable mode equal to dimension 1.*how to do it*

Use the plotting functions in MATLAB (or in the PLS_Toolbox) to plot the scores and loadings to investigate the models

### 4. Summary

**In this chapter**

it has been shown that multi-way models are not solely applicable in spectral analysis. **In fact**

for these data the gain in using multi-way models is quite pronounced because the robustness and interpretability of the multi-way structure is even more important when the data are noisy. **This is apparent in the closeness**

of the replicate breads in the score plots; in the interpretability of the loading plots and in the predictions obtained in PLS models.