The Copenhagen Chemometrics Group

3. ADVANCED PARAFAC MODELING

Contents

Introduction
Second-order calibration
Using constraints
Fitting new samples
Summary

Data used:
claus.mat contains fluorescence excitation emission data from five samples containing: tryptophan, phenylalanine, and tyrosine.

Purpose:
Learning to use PARAFAC modeling for more advanced things.

Information:
R. Bro, PARAFAC: Tutorial & applications.
Chemom. Intell. Lab. Syst., 1997, 38, 149-171.

Prerequisites:
Be sure to understand the basics of handling multi-way arrays in MATLAB (Chapter 1). You should be acquainted with concepts such as nestedness, uniqueness, convergence, constraints and second-order calibration.

1. Introduction

In this chapter some of the more advanced aspects of PARAFAC modeling will be described. There are two main ways of seeing the PARAFAC model:
A) as a hard model for parameter estimation
B) as a soft model for decomposing multi-collinear data.

A)
Due to its uniqueness properties the PARAFAC model is ideally suited for curve resolution and certain kinds of calibration problems. Since the PARAFAC model is unique and coincides with several physical models (fluorescence spectroscopy, spectrally detected chromatography etc.) it is possible to decompose such data into (chemically) meaningful parameters. As shown in the previous chapter, fully overlapping fluorescence data may be decomposed into score and loading vectors that are estimates of excitation and emission spectra and concentrations of chemical analytes. Thus, the PARAFAC model can perform mathematical chromatography on mixture data enabling identification and quantification of specific analytes.

B)
Seeing the PARAFAC model as a generalization of two-way PCA it is apparent that the PARAFAC model may be used for extracting latent variables regardless of its uniqueness properties. The thus obtained scores and loadings may be used in complete analogy with ordinary two-way models for exploration and quantitative analysis such as regression and classification.

In this chapter
we will elaborate on some of the properties relating to the uniqueness of the PARAFAC model as well as on how to improve a given model by incorporating a priori knowledge directly into the model.

2. Second-order calibration

Explore the concept of second-order calibration using PARAFAC.
Use for example sample number one, X(1,:), as a standard of pure tryptophan (calibration set). Use this standard to estimate the amount of tryptophan in sample two, three, four, and five. You can do this separately or simultaneously. For estimating the concentration of tryptophan in the fourth and fifth sample, create a data set of the first (standard), fourth and fifth sample as

Xnew = X([1 4 5],:,:);

Run PARAFAC with an appropriate number of components
(three in this case – but not in all other cases – why?) and use the scores plus the known concentration of tryptophan in sample one to estimate the concentration.
Compare with the reference concentrations (how).

Compare the predictions obtained with PARAFAC
with those that can be achieved with GRAM or DTLD
(type help gram or help dtld to see how to fit the models).
The difference is usually not dramatic, unless there are very few samples.

3. Using constraints

Constraining a model can sometimes be helpful.
For example, resolution of spectra may be wanted. To ensure that the estimated spectra make sense it may be reasonable to estimate the spectra under non-negativity constraints as most spectral parameters are known to be non-negative. Constraints can for example help to

Obtain parameters that do not contradict with a priori knowledge
(Ex.: Require chromatographic profiles to have but one peak)
Obtain unique solution where otherwise a non-unique model would be obtained
(Ex.: Use selective channels in data to obtain uniqueness)
Test hypotheses (Ex.: Investigate if tryptophan is present in sample)
Avoiding degeneracy and numerical problems
(Ex.: Enabling a PARAFAC model of data otherwise inappropriate for the model)
Speed up algorithms
(Ex.: Use truncated bases to reexpress problem by a smaller problem)
Enable quantitative analysis of qualitative data
(Ex.: Incorporate sex and job type in a model for predicting income)

Some argue that
constraining, e.g., the PARAFAC model is superfluous, as the structural model in itself should be unique. However, there are several good reasons for using constraints.

Firstly, not all models are unique like the PARAFAC model. Secondly, even though the model is unique, the model may not provide a completely satisfactory description of the data. Rayleigh scatter in fluorescence spectroscopy is but one instance where slight model inadequacy can cause the model parameters to be misleading. Constraints can be helpful in preventing that.

In other situations numerical problems or intrinsic ill-conditioning can make a model problematic to fit. At a more general level constraints may be applied simply because they are known to be valid. This can give better estimates of model parameters and of the data.

A constrained model will fit the data poorer than an unconstrained model
but if the constrained model is more interpretable and realistic this may justify the decrease in fit. Applying constraints should be done carefully considering the appropriateness beforehand, considering why the unconstrained model is unsatisfactory, and critically evaluating the effect afterwards. In some cases there is confidence in that the constraint is appropriate.

Sometimes an intuitive approach is used for imposing constraints.
Using non-negativity constraints as an example, a common approach is to use an unconstrained estimation procedure and then subsequently set negative values to zero. Naturally this will lead to a feasible solution, i.e., a solution that is strictly non-negative. This approach can not be recommended for several reasons.

First of all, an estimate obtained from such an approach will have no well-defined optimality property. This can make it difficult to distinguish between problems pertaining to the algorithm, the model, and the data. That is, if the fitted model is unsatisfactory, it becomes more difficult to assess the cause of the problem, because an additional source of error has been introduced, namely the properties of the constraint.

Task
Make a three-component PARAFAC model of sample four and five,
with and without non-negativity. Such restrictions are set in the input const.
Do the estimated emission and excitation spectra change?
Do the same using sample one and four.

See the help in the PARAFAC m-file for how to impose constraints

Task
Repeat the models investigated previously in the second-order calibration problems and use non-negativity on all modes.
Do the parameters change? Do the predictions?

When analyzing spectral data there is often a strong rationale behind using the PARAFAC model with its intrinsic uniqueness properties. Unfortunately, the PARAFAC model is sometimes very difficult to fit and it would be nice to be able to speed up the algorithm.
If the data analyzed are not spectral or similar data it is often of little practical importance whether the determined loading vectors are orthogonal or not. With respect to relative interpretation and for quantitative purposes such as using the scores in a subsequent regression model, the primary aim is to span the systematic variation, and one may therefore e.g. use orthogonality constrained models in order to avoid numerical algorithmic problems.

Task
To exemplify how orthogonality constraints work, try to fit a five-component model several times. With and without orthogonality constraints. Compare the time consumptions.

How to do it

4. Fitting new samples

When a PARAFAC model is built
it is possible to find the scores of new samples with the given model
(This is currently not implemented properly for Tucker models).

The way new scores are found is following

	What to do	Example
1	Fit the model to the calibration data	`calmodel = parafac(Xcal,Fac);`
2	Make a cell array. which holds initial values for the model for the test data. The loadings should equal the calibration model loadings and the scores should be random of appropriate size (number of samples in test data times number of factors)	`Itest = size(Xtest,1); %Number of samples in test set` `Loads = calmodel;` `Loads{1} = rand(Itest,Fac);`
3	Fit a PARAFAC model to the test data. Make sure that the initial loadings are input And That the loadings in the variable modes are fixed.	`Opt = []; %Or any other chosen option` `const = [0 0 0] %Change to [2 0 0] if you want nonnegative scores. In the second and third mode, nothing can be changed in the` `prediction because B and C are fixed.` `Fix = [0 1 1];` `%Defines that B and C should maintain their initial values(Hence no estimation of these` `and the output B and C will equal` `the ones from` `the calibration model)` `testmodel=parafac(X,Fac,Opt,const,Loads,Fix);`

Notes
Do not change the sign or scales of the loadings. When fitting the new samples, the loadings are normalized, so to stay compatible, the originally output loadings must be used when predicting the new samples.

5. Summary

In this chapter
some of the advanced aspects of PARAFAC modeling have been discussed. Most notably second-order calibration which, in its most extreme version, allows quantification in situations where there is only one pure standard. Thus, no knowledge of the interferences and no knowledge of the normal variation. Of course this is not an optimal situation but it illustrates the power of second-order calibration.

Additionally
some remarks have been made on the use of constraints in modeling. If additional knowledge is available on data this may be incorporated directly into the model by means of constraints. This can help, e.g., in providing more sensible and robust models.

In some situations
PARAFAC is too restricted to model the data in a sensible way. If for example, there are severe retention time shifts in a chromatographic data, PARAFAC will not be able to model this because it assumes that the elution profile of a given analyte has the same shape in all experiments. In these situations, the PARAFAC2 model can be helpful. An algorithm for fitting PARAFAC2 can be found under our downloads page: Algorithms.
An introductory exercise is given here.

Previously