GC-MS datasets are simulated over short retention-time intervals. These simulated datasets are varied in rank, by using F = 3, 4 or 5 underlying components, and in noise level with 0.001, 0.01 and 0.1.

For each dataset, the pure components, A, Bk, C are simulated to reflect these components and a simulated residual array E is added to reflect instrumental noise. Matrix A contains the mass spectra with the dimension of 150 × F, extracted from a real GC-MS dataset which is described by Tian et.al. The elution profile Bk (k=1, … ,50) contains chromatographic peaks generated by a Gaussian function, as a 70 × F matrix. For the purpose of simulating retention time shift, peaks are shifted by up to 7 points randomly along the retention time axis for every Bk. The matrix C (50 × F) contains the relative concentrations as random uniformly distributed numbers. To observe the effects of noise, the error array E contains normally distributed random numbers at three suitable noise levels, 0.001, 0.01, or 0.1. These noise levels cover a large range of the signal-to-noise ratios, where, for example, a noise level of 0.1 means that the standard deviation of the added residuals E is equal to 10% of the standard deviation of the simulated pure components. The data array with dimension of 150 × 70 × 50 is generated from the simulated loading matrices and noise-based residual array in accordance with PARAFAC2 model equation.

For each combination of rank and noise level, 100 simulated datasets are generated. This results in a total of 900 datasets.

The function for generating the data can be found here. As an example on the use: To simulate three components and 0.1 noise level GC-MS data with dimension of 150*70*50 do the following.

>> [X1,A,B,C] = simdata_gcms(150,70,50,3,0.1);

Reference

1. Tian K, Wu L, Min S, Bro R. Geometric search: A new approach for fitting PARAFAC2 models on GC-MS data. Talanta. 2018;185:378-386. https://doi.org/10.1016/j.talanta.2018.03.088