More on this “Split-half analysis”

R.A. Harshman
Advocate for using split-half analysis for determining the proper rank of unique models.

The split-half analysis
is a type of jack-knife analysis where different subsets of the data are analyzed independently.

Due to the uniqueness
of, e.g., the PARAFAC model, the same result – same loadings – will be obtained in the nonsplitted models from models of any suitable subset of the data, if the correct number of components is chosen. If too many or too few components are chosen the model parameters will differ if the model is fitted to different data sets.

Even though the model may be unique, the model parameters will be dependent on the specific sampling as the amount of underlying phenomena present in the data set determines which linear combination of the intrinsic set of profiles and the noise will give a unique solution for the specific model at hand.

To judge if two models are equal
the indeterminacy in multilinear models has to be respected: the order and scale of components may change if not fixed algorithmically.

If a model is stable in a split-half sense it is a clear indication that the model is real;
that it captures essential variation, that not only pertains to the specific samples.

If, on the other hand, some components are not stable in a split-half sense,
it indicates that they may not be real, hence the model is not valid. It may also happen, though, that the phenomenon reflected in the non-stable component is simply only present in specific subsets.

Therefore non-stability in a split-half analysis is not always as conclusive as stability. When performing a split-half experiment it has to be decided which mode to split.

Splitting should be performed
in a mode with a sufficient number of independent variables/samples. With a highdimensional spectral mode, an obvious idea would be to use this spectral mode for splitting, but the collinearity of the variables in this mode would impede sound results.

If the spectra behave additively the two data sets would in practice be identical, hence split-half analysis would not be possible.

In order to avoid that an unlucky splitting
the samples causes some phenomena to be absent in certain groups, the following approach is often adopted.

The samples are divided into two groups: A and B. If the samples are presumed to have some kind of correlation in time the sets are constructed contiguously, i.e.,
A consists of the first half of the samples and B of the last. Accidentally it may happen that one of these sets does not contain information on all latent phenomena.

To assure or at least increase the possibility, that the sets to be analyzed cover the complete variation two more sets are generated, C and D. The set C is made from the first half of A and B and the set D consists of the last half the of samples in A and B. These four sets are pairwise independent.

A model is fitted to each of the data sets, and if the solution replicates over set A and B or over set C and D, correctness of the solution is empirically verified.

The split-half approach
may also sometimes be used for verifying if non-(tri)-linearities are present. If spectral data are modeled and there are indications of non-linearities.

In certain wavelength areas this may be verified by making separate models of different wavelength areas.

If no non-linearities are present, the same scores should be obtained in the different areas, possibly using sub-models of different dimensions if some phenomena are only present in certain areas.

If the same scores are not obtained, it could indicate dissimilar interrelations in different areas, hence nonlinearities.