The Copenhagen Chemometrics Group
4. MULTI-WAY REGRESSION – N-PLS
- Load data
- Compare unfold-PCA and PARAFAC
- Compare unfold- and three-way PLS regression
brod.mat (courtesy of Magni Martens, KVL). Eight judges assessed ten bread with respect to 11 attributes. The data are in the matrix X. The samples are pair-wise replicates. The salt content of each bread is also known.
Using N-PLS on a simple data set to learn about the method and to see the importance of using a model of appropriate structure.
R. Bro, Multi-way calibration.
Multi-linear PLS, J. Chemom., 1996, 10(1), 47-62.
Be sure to understand the basics of handling
multi-way arrays in MATLAB (Chapter 1).
You should know your two-way PLS
1. Load data
Get the data
load data (
load brod) and use
whos to learn what files are there.
Unlike the fluorescence data described previously
there is no similar fundamental theory for how sensory data ideally behaves.
Thus, using a trilinear model in this case can not be justified as hard model as is the case in well-behaved spectral data.
However, even more significant is the idea of latent variables. That is, if we assume that all assessors use the same latent or basic types of sensations only in different proportions, then this exactly what the trilinear model states.
If the data are unfolded
such that the breads are the row-mode and the attributes and judges are the column-mode (see figure above) then the data can be written as X = [X1 X2 .. X8], where Xk is the assessments of the k’th judge. If we make an F-component two-way PCA model of this matrix we obtain the approximation X = TPT where T is the 10 by F score matrix pertaining to the breads and P is the 88 by F loading matrix pertaining to both assessors and attributes.
As seen from the figure the loading elements for each assessor are not directly related to the loading elements of the other assessors. The first eleven rows of P correspond to the first assessor etc. Thus, in unfolding we impose no relation between different assessors. Each assessor is assumed to have his or her own idiosyncratic perception.
In maintaining the structure of the three-way data
we obtain instead the PARAFAC model Xk = ADkBT where A is the 10 by F score matrix pertaining to the breads (similar to T above), B is the 11 by F loading matrix pertaining to both attributes and Dk is a diagonal matrix holding the kth row of C which is the 8 by
F loading matrix for the judges.
By writing the trilinear model as above it is seen that for sensory data, the model imposed on the data is that all assessors use the same basic type of sensations given by B but each assessor use these latent variables in different proportions. For example the kth assessor uses the first component with a relative magnitude of c(k,1) which is the first diagonal element in Dk.
The trilinear model underlying both PARAFAC and N-PLS
is more restricted than unfolding models. Therefore the fit of a trilinear model will per definition be lower than the fit of a corresponding bilinear model. However, the bilinear model is mostly overly flexible and the increased fit is to a large extent attributable to fitting the noise of the data. In the trilinear model it is much more difficult to overfit, since any variation incorporated in the model must be consistent over all assessors (or similar).
In the bilinear model above each component uses 98 (10+88) parameters while a trilinear component uses only 29 (10+11+8) parameters. This clearly illustrates that for the bilinear model to be suitable there must be a large deviation from the trilinear model. Otherwise the increased number of parameters will only fit noise. And to the degree that the trilinear model is only approximately correct, incorporating an additional trilinear component is still by far more parsimonious than using a bilinear component.
2. Compare unfold-PCA and PARAFAC
In order to first explore the data
one may use either PARAFAC or a PCA model on the unfolded data.
In this exercise it will be shown that the PARAFAC model is easier to interpret than the unfold model. For convenience you do not have to consider how many components to use. Two components are suitable for both PCA and PARAFAC.
Estimate a two-component two-way PCA of the unfolded centered data. Try to interpret the scores and loadings. Estimate a two-component PARAFAC model of the centered data. Interpret. For which model are the replicates located most closely in the score plot. Why?
3. Compare unfold- and three-way PLS regression
As an example on calibration with very noisy data
you must try to predict the salt content from the sensory data. This is not interesting from a sensory point of view, but it provides a good illustration of the use of data with relatively low signal-to-noise ratio and significant model mis-specification
(the data are definitely not perfectly trilinear as e.g. fluorescence data are).
Use unfold as well as three-way PLS for building a regression model for salt using every second sample. Predict the salt contents in the remaining samples. You’ll need a matlab
m-file for PLS calibration in this exercise (for example from the PLS Toolbox) or alternatively you can use the trilinear PLS model but set the size of the three-way array to have one variable mode dimension equal to all variables (88) and the other variable mode equal to dimension 1.
how to do it
Use the plotting functions in MATLAB (or in the PLS_Toolbox) to plot the scores and loadings to investigate the models
In this chapter
it has been shown that multi-way models are not solely applicable in spectral analysis.
for these data the gain in using multi-way models is quite pronounced because the robustness and interpretability of the multi-way structure is even more important when the data are noisy.
This is apparent in the closeness
of the replicate breads in the score plots; in the interpretability of the loading plots and in the predictions obtained in PLS models.