# The Copenhagen Chemometrics Group

## 2. Basic PARAFAC modeling

**Contents**

- Fit a PARAFAC model
- Determining the proper number of components
- Checking convergence
- Summary

**Data used: **

claus.mat contains fluorescence excitation emission data from five samples containing: tryptophan, phenylalanine, and tyrosine.

**Purpose: **

Learning to fit and interpret a PARAFAC model

**Information: **

R. Bro, PARAFAC: Tutorial & applications. *Chemom. Intell. Lab. Syst*., 1997, **38**, 149-171.

**Prerequisites:**

Be sure to understand the basics of handling multi-way arrays in MATLAB (Chapter 1), and have a vague idea about what PARAFAC is.

### 1. Fit** a PARAFAC model**

Fluorescence data of dilute samples behave approximately according to a PARAFAC model (*A mathematical model of fluorescence excitation-emission data*). For these data it is known that a three-component PARAFAC model should be adequate since there are three analytes in the samples and each can ideally be described by one PARAFAC component.

**Task: **

Fit a three-component PARAFAC model and investigate the model. To fit the PARAFAC model simply type

`[Factors] = parafac(X,3);`

and to convert the output parameters to scores and loadings type

`[A,B,C] = fac2let(Factors);`

The output `Factors`

is actually a cell array (type `help cell`

) of size 1×3. Cell element one holds A, element two B and element three C and you can access those directly e.g. by typing `A = Factors{1}`

. Note the curly brackets used for cells.

You can also plot all loadings by entering `plotfac(Factors)`

.

The scores/loadings in A are the sample mode loadings, the loadings in B are the emission mode loadings and the loadings in C are the excitation mode loadings.

For example, `use plot(B)`

or `plot(EmAx,B)`

to investigate the emission mode loadings. Use `plot(ExAx,C)`

to plot the excitation mode loadings.

See a graphical explanation/illustration of the PARAFAC model of these data.

From the loading plots of B and C, it can be verified, that the PARAFAC model is unique. These loadings are not just some orthogonal parameters as in ordinary PCA.

The loadings resemble closely, the true pure spectra of the three chemical analytes.

So what does unique mean? It means that the scores and loadings can not be rotated.

In PCA you can rotate the scores and loadings and still the model will fit the data just as well as before. Hence, there is no way that you can estimate, for example, pure spectra from PCA, because there is an infinity of different scores and loadings that all give the same fit.

**In PARAFAC this is not so**.

The only thing that can be varied is the order and scale of the components. There is no way to say, from the decomposition whether component one is rightfully the first, second or fifth component. So the order of the components is unidentified. There is also no way to define the scale. This is intrinsic to bi- and multilinear models. If **X **= **tp’**, then we can equally well let **t** be twice the original **t** if we just let **p** be half the original **p**.

**What is the implication of these two unidentifiabilities**?

That there is no intrinsic order, means that the specific components must be identified after fitting. There is no way to guarantee that, say tyrosine, will come out as component one.

(*in second-order calibration, tricks can be used to automatically identify which factor is the analyte of interest if a pure sample is included. If the scores of that sample are forced to be zero for all but one factor, then naturally this factor must be the one pertaining to the analyte of interest*).

That the scale is unidentified means that we can only determine spectra, concentrations etc. up to a scaling. In the above example we can agree that the second and third mode loadings are pure emission and excitation spectra. Then the scores (amounts of the latent variables) must be concentrations, because the latent variables in this case are real chemical variables. So we can find the concentrations in the samples from the PARAFAC model. **But what is the scale? mM, M?? **

Because of the scale indeterminacy there is no way to tell. This is natural.

We can not measure the absorption of a fluid and tell the concentration from that.

We need a reference to fix the scale. So if the concentration is known in one sample we can scale the corresponding column so that this score equals the known concentration.

Then the remaining scores of that factor will be estimates of the concentrations in the other samples.

**Task**:

Evaluate if the scores provide reasonable estimates of the concentrations.

Assume that you know the concentrations in, say, the fifth sample. If the model is unique this should provide sufficient information to estimate the concentrations in the first four samples.

Use the reference concentrations (given in the matrix `y`

and described in the file `readme`

) to scale the scores such that estimates of the concentrations of tryptophan is obtained using the known concentration of tryptophan in the fifth sample `(y(5,1))`

.

Do the same for tyrosine and phenylalanine.

This is the basis for the so-called second-order calibration. *How to do it*

### 2. Determining the proper number of components

**It is difficult to decide the best rank**, i.e., the column-dimension of the loading matrices, of a PARAFAC model. However, it is by far easier than for two-way problems because of the uniqueness properties of PARAFAC. In general, the same tools are available as are used for two-way analysis: Looking at histograms of residuals, assessing the interpretability of the solution, comparing residual variation to the intrinsic noise-level and using resampling techniques such as cross-validation. But there are also additional very powerful tools that we will discuss after the first exercises

**Task: **

Investigate the fits of a one-, two- and three-component PARAFAC model of the data.

Try also to estimate a four-component model. Note that an increased number of iterations are mostly necessary when fitting the four-component model. This is a typical sign of either too many components being extracted or an ill-defined problem (loadings are very correlated).

**An important difference** between two-way PCA and the multi-way PARAFAC model is that the PARAFAC model is not *nested*. Hence the parameters of a three-component model do not equal the parameters of a two-component model plus one additional component.

The reason for this is that the components are not required to be orthogonal; hence independent. Therefore, every model has to be calculated specifically with all its components.

In estimating the model. note some of the options available in the PARAFAC m-file (type: `help parafac)`

. The input options makes it possible to, e.g., have the interim loadings being plotted during the iterations`(options(3) = 2)`

, or only allow for, say, 100 iterations `(options(6) = 100)`

.

*Fitting the PARAFAC the model* *Converting model parameters to loadings* *Calculating the model of the data* *Calculating the model fit*

**Additionally to the above-mentioned tools**,

one may use the split-half analysis for determining the proper number of components.

If two sample sets are available that should describe the same common variation, then it holds that for the proper number of components they should give the same loading vectors up to scaling and permutation (more on this).

Thus, if this is not the case, it is a clear indication that too many components are being extracted. A new approach called the core consistency diagnostic (CORCONDIA) is very helpful in determining the right number of components. **The details will not be given here** (but here in a pdf-file), but the principle is as follows.

For a given number of components fit the PARAFAC model to the data. Use the thereby found solution to calculate the so-called core consistency. If the PARAFAC model is valid then the core consistency is close to 100%. If the data can not approximately be described by a trilinear model or too many components are used the core consistency will be close to zero (or even negative).

If the consistency is in-between (around 50%) the model is unstable in which case imposing valid constraints may help stabilize the model. In practice the core consistency levels off slowly for an increasing number of components and then sharply when the correct number of components is exceeded. The number of components corresponding to the last high consistency value should be chosen.

**Important**

The core consistency is just a tool. It does not prove anything.

It is based on assumptions and may as any other tool be misleading from time to time.

So do not argue for a specific model using only the core consistency as an argument.

Use several tools and visualizations to provide evidence that you are using the right number of components.

**Task**

Use different tools for assessing the proper number of components for the fluorescence data. For models of dimension one to five

- Look at residuals (mesh plots and alike. Check the size of the residuals compared to the data and the known error)
- Look at scree-plots (sum-of-squared residuals versus the number of components)
- Look at model parameters (loading vectors)
- Use the core consistency diagnostic (
`corcond`

) - Use cross-validation (
`ncrossdecomp`

)

The multi-way toolbox also provides an m-file called `pftest`

for checking the number of components (type `help pftest`

). The approach adopted there is more or less similar to the one mentioned here, but additionally each model corresponding to a given number of components is fitted several times.

This provides means for assessing if the model converges to the same minimum every time. This is useful also in determining the proper number of components, since if too many components are used the number of local minima typically increases dramatically.

### 3 **Checking convergence of the algorithm**

**Checking convergence of the algorithm**

**The algorithm for fitting** **the PARAFAC model**

is a so-called alternating least squares algorithm. It is iterative and stops when the relative difference in fit between two successive iterations is below a certain limit.

For most types of data this limit can be set to 10^{-6} (default in the algorithm)

which will ensure that the model is correct and that not too many iterations are used.

For some data the model is very difficult to fit and a lower convergence criterion may therefore be needed.

Note that one need not check convergence for the N-PLS and Tucker algorithms as those are very efficient. **To assess convergence the following list may be used:**

- Fit the models several times using random initialization (why?)

by setting`Options(2)=2;`

. - If all models have the same fit (i.e. loss function value)

the models have converged (this is common) - If all but a small fraction of the fitted models have the same (and best) fit, the model have converged and the few models with lower fit may be discarded as accidental local minima.
- If all models have different fit values, the model is difficult to fit

(maybe too many components) and the convergence criterion has to be lowered. - If the models converge to a few different but distinct fit-values, i.e. there are several models with the same fit values, then there are multiple local minima, which is a tricky situation. Likely, it is possible to circumvent this either by using some additional constraints (e.g., non-negativity) or otherwise slightly re-specifying the model.

**Task**

Fit a three-component model several times and see if the estimated model converges.

If so, the fits should be identical every time. This demonstrates how to assess convergence. Do the same for a four- or a five-component model. Likely you will see that the fit differs from model to model if you are using a random initialization. Lower the convergence criterion to circumvent this. You can use `pftest`

for testing convergence.

### 4. Summary

In this chapter it has been shown how to fit the PARAFAC model and how to find the appropriate number of components. Fitting the PARAFAC model is simply done by typing

`[Factors] = parafac(X,Fac);`

where `Fac`

is the number of components. The parameters of the model are held in the cell `Factors`

and may be converted into ordinary scores and loadings by means of `fac2let`

. In case of a three-way model this would read

`[A,B,C] = fac2let(Factors);`

The scores and loadings of a PARAFAC model may be assessed and used just as in ordinary two-way analysis (e.g., use scores for building regression models, plot loadings and make bi-plots for interpretation etc.)

It has also been discussed that using the file pftest one may very quickly get a good indication of the appropriate number of components to use. Further exploration and interpretation of feasible models may then help the final decision.

Some important aspects that differ from ordinary two-way analysis have been discussed:

- PARAFAC models are
*not*nested; hence one may not fit models by augmenting an existing model by adding an extra component - Unlike many other decomposition models often used, the PARAFAC algorithm can occasionally have difficulties in finding the true least squares model. This is mostly the case when too many components are extracted or when the model is otherwise mis-specified
- The PARAFAC model is unique, which means that it may be used e.g., for resolving spectra. However, a unique PARAFAC model does not explicitly imply that the parameters are estimates of the true underlying parameters. Only in the case where these exist and the model is correctly specified can that be the case. Hence, in the above example with fluorescence data, even a two-component model is unique, but it does not reflect pure spectra since there are three chemical analytes in the samples.