The Copenhagen Chemometrics Group
3. ADVANCED PARAFAC MODELING
- Second-order calibration
- Using constraints
- Fitting new samples
claus.mat contains fluorescence excitation emission data from five samples containing: tryptophan, phenylalanine, and tyrosine.
Learning to use PARAFAC modeling for more advanced things.
R. Bro, PARAFAC: Tutorial & applications.
Chemom. Intell. Lab. Syst., 1997, 38, 149-171.
Be sure to understand the basics of handling multi-way arrays in MATLAB (Chapter 1). You should be acquainted with concepts such as nestedness, uniqueness, convergence, constraints and second-order calibration.
In this chapter some of the more advanced aspects of PARAFAC modeling will be described. There are two main ways of seeing the PARAFAC model:
A) as a hard model for parameter estimation
B) as a soft model for decomposing multi-collinear data.
Due to its uniqueness properties the PARAFAC model is ideally suited for curve resolution and certain kinds of calibration problems. Since the PARAFAC model is unique and coincides with several physical models (fluorescence spectroscopy, spectrally detected chromatography etc.) it is possible to decompose such data into (chemically) meaningful parameters. As shown in the previous chapter, fully overlapping fluorescence data may be decomposed into score and loading vectors that are estimates of excitation and emission spectra and concentrations of chemical analytes. Thus, the PARAFAC model can perform mathematical chromatography on mixture data enabling identification and quantification of specific analytes.
Seeing the PARAFAC model as a generalization of two-way PCA it is apparent that the PARAFAC model may be used for extracting latent variables regardless of its uniqueness properties. The thus obtained scores and loadings may be used in complete analogy with ordinary two-way models for exploration and quantitative analysis such as regression and classification.
In this chapter
we will elaborate on some of the properties relating to the uniqueness of the PARAFAC model as well as on how to improve a given model by incorporating a priori knowledge directly into the model.
2. Second-order calibration
Explore the concept of second-order calibration using PARAFAC.
Use for example sample number one,
X(1,:), as a standard of pure tryptophan (calibration set). Use this standard to estimate the amount of tryptophan in sample two, three, four, and five. You can do this separately or simultaneously. For estimating the concentration of tryptophan in the fourth and fifth sample, create a data set of the first (standard), fourth and fifth sample as
Xnew = X([1 4 5],:,:);
Run PARAFAC with an appropriate number of components
(three in this case – but not in all other cases – why?) and use the scores plus the known concentration of tryptophan in sample one to estimate the concentration.
Compare with the reference concentrations (how).
Compare the predictions obtained with PARAFAC
with those that can be achieved with GRAM or DTLD
help gram or
help dtld to see how to fit the models).
The difference is usually not dramatic, unless there are very few samples.
3. Using constraints
Constraining a model can sometimes be helpful.
For example, resolution of spectra may be wanted. To ensure that the estimated spectra make sense it may be reasonable to estimate the spectra under non-negativity constraints as most spectral parameters are known to be non-negative. Constraints can for example help to
- Obtain parameters that do not contradict with a priori knowledge
(Ex.: Require chromatographic profiles to have but one peak)
- Obtain unique solution where otherwise a non-unique model would be obtained
(Ex.: Use selective channels in data to obtain uniqueness)
- Test hypotheses (Ex.: Investigate if tryptophan is present in sample)
- Avoiding degeneracy and numerical problems
(Ex.: Enabling a PARAFAC model of data otherwise inappropriate for the model)
- Speed up algorithms
(Ex.: Use truncated bases to reexpress problem by a smaller problem)
- Enable quantitative analysis of qualitative data
(Ex.: Incorporate sex and job type in a model for predicting income)
Some argue that
constraining, e.g., the PARAFAC model is superfluous, as the structural model in itself should be unique. However, there are several good reasons for using constraints.
Firstly, not all models are unique like the PARAFAC model. Secondly, even though the model is unique, the model may not provide a completely satisfactory description of the data. Rayleigh scatter in fluorescence spectroscopy is but one instance where slight model inadequacy can cause the model parameters to be misleading. Constraints can be helpful in preventing that.
In other situations numerical problems or intrinsic ill-conditioning can make a model problematic to fit. At a more general level constraints may be applied simply because they are known to be valid. This can give better estimates of model parameters and of the data.
A constrained model will fit the data poorer than an unconstrained model
but if the constrained model is more interpretable and realistic this may justify the decrease in fit. Applying constraints should be done carefully considering the appropriateness beforehand, considering why the unconstrained model is unsatisfactory, and critically evaluating the effect afterwards. In some cases there is confidence in that the constraint is appropriate.
Sometimes an intuitive approach is used for imposing constraints.
Using non-negativity constraints as an example, a common approach is to use an unconstrained estimation procedure and then subsequently set negative values to zero. Naturally this will lead to a feasible solution, i.e., a solution that is strictly non-negative. This approach can not be recommended for several reasons.
First of all, an estimate obtained from such an approach will have no well-defined optimality property. This can make it difficult to distinguish between problems pertaining to the algorithm, the model, and the data. That is, if the fitted model is unsatisfactory, it becomes more difficult to assess the cause of the problem, because an additional source of error has been introduced, namely the properties of the constraint.
Make a three-component PARAFAC model of sample four and five,
with and without non-negativity. Such restrictions are set in the input const.
Do the estimated emission and excitation spectra change?
Do the same using sample one and four.
See the help in the
PARAFAC m-file for how to impose constraints
Repeat the models investigated previously in the second-order calibration problems and use non-negativity on all modes.
Do the parameters change? Do the predictions?
When analyzing spectral data there is often a strong rationale behind using the PARAFAC model with its intrinsic uniqueness properties. Unfortunately, the PARAFAC model is sometimes very difficult to fit and it would be nice to be able to speed up the algorithm.
If the data analyzed are not spectral or similar data it is often of little practical importance whether the determined loading vectors are orthogonal or not. With respect to relative interpretation and for quantitative purposes such as using the scores in a subsequent regression model, the primary aim is to span the systematic variation, and one may therefore e.g. use orthogonality constrained models in order to avoid numerical algorithmic problems.
To exemplify how orthogonality constraints work, try to fit a five-component model several times. With and without orthogonality constraints. Compare the time consumptions.
4. Fitting new samples
When a PARAFAC model is built
it is possible to find the scores of new samples with the given model
(This is currently not implemented properly for Tucker models).
The way new scores are found is following
|What to do||Example|
Fit the model to the calibration data
Make a cell array.
which holds initial values for the model for the test data.
The loadings should equal the calibration model loadings and the scores should be random of appropriate size (number of samples in test data times number of factors)
Fit a PARAFAC model to the test data.
Make sure that the initial loadings are input
That the loadings in the variable modes are fixed.
Do not change the sign or scales of the loadings. When fitting the new samples, the loadings are normalized, so to stay compatible, the originally output loadings must be used when predicting the new samples.
In this chapter
some of the advanced aspects of PARAFAC modeling have been discussed. Most notably second-order calibration which, in its most extreme version, allows quantification in situations where there is only one pure standard. Thus, no knowledge of the interferences and no knowledge of the normal variation. Of course this is not an optimal situation but it illustrates the power of second-order calibration.
some remarks have been made on the use of constraints in modeling. If additional knowledge is available on data this may be incorporated directly into the model by means of constraints. This can help, e.g., in providing more sensible and robust models.
In some situations
PARAFAC is too restricted to model the data in a sensible way. If for example, there are severe retention time shifts in a chromatographic data, PARAFAC will not be able to model this because it assumes that the elution profile of a given analyte has the same shape in all experiments. In these situations, the PARAFAC2 model can be helpful. An algorithm for fitting PARAFAC2 can be found under our downloads page: Algorithms.
An introductory exercise is given here.