EP2541585A1

EP2541585A1 - Computer-assisted structure identification

Info

Publication number: EP2541585A1
Application number: EP11005180A
Authority: EP
Inventors: designation of the inventor has not yet been filed The
Original assignee: Philip Morris Products SA
Current assignee: Philip Morris Products SA
Priority date: 2011-06-27
Filing date: 2011-06-27
Publication date: 2013-01-02

Abstract

The invention relates to a method for analysing mass spectral data obtained from a sample in GCxGC (2-dimensional) mass spectrometry, comprising: (a) comparing mass spectral data of an analyte with mass spectral data of candidate compounds of known structure in a library; (b) identifying a plurality of candidate compounds from the library based on similarities of mass spectral data; (c) predicting, for each candidate compound, a value of at least one analytical property using a quantitative model based on a plurality of molecular descriptors; and (d) calculating a match score for each candidate compound based on the value predicted in step (c) and a measured value of the analytical property for the analyte.

Description

The present invention relates to an automated, computer-assisted method for identifying compounds according to mass spectral and chromatographic data obtained from a sample. In particular, the invention relates to methods for identifying compounds using two dimensional gas chromatography-mass spectrometry (GCxGC-MS), and processes for automating the interpretation of the mass spectral and chromatographic data obtained from such a method.
Mass spectrometry is an analytical tool that can be used to determine the molecular weights of chemical compounds and of their fragments by detecting the ionized compounds and fragments according to their mass-to-charge ratio (m/z). The molecular ions are generated by inducing either a loss or a gain of a charge by the chemical compounds, such as via electron ejection, protonation, or deprotonation. The fragment ions are generated by collision-induced or energy-induced dissociation. The resulting data are usually presented as a spectrum, a plot with m/z ratio on the x-axis and abundance of ions on the γ-axis. Thus, this spectrum shows the distribution of m/z values in the population of ions being analyzed. This distribution is characteristic for a given compound. Therefore, if the sample is a pure compound or contains only a few compounds, mass spectrometry can reveal the identity of the compound(s) in the sample.
A complex sample usually contains too many chemical compounds to be analyzed meaningfully by mass spectrometry alone, because ionization of different chemical compounds may result in ions with the same m/z value. The more chemical compounds a sample contains, the more likely ions of the same m/z values will be generated from different compounds. Therefore, a complex sample is typically resolved to some extent prior to mass spectrometry, such as by liquid chromatography (LC), gas chromatography (GC), or capillary electrophoresis. For analysis of volatile compounds, gas chromatography is advantageously coupled with mass spectroscopy (GC-MS). Several ionization methods are available in GC, one of the most common being electron impact (EI), in which molecules are ionized by bombardment with electrons emitted by a filament.
During the sample separation step (chromatography), the chemical compounds in the sample are separated based on how long they stay in the sample separation system (column). Once a chemical compound exits the sample separation system, it enters a mass spectrometer system, and the ionization/ion separation/detection process begins as described above. For each compound, the time it remains in the sample separation system before it produces signal(s) in the mass spectrum is a function of its structure and is referred to as the retention time (RT). However, retention time is also specific to the instrument being used, and especially the column specifications in a gas chromatograph.
Without exact replication of the instrumentation on which RT is first measured, RTs of the same sample measured later may not match the RTs specified in the original chromatographic method or the computerized method files (including calibration and event tables) and can lead to misidentified peaks. One solution is the "relative retention" approach which utilizes retention indices (RI) or Kovats indices (KI) that circumvent problems associated with discrepancies in RT due to instrument-to-instrument or column-to-column variation. Methods to predict Kovats indices (KI) based on molecular structure and associated features are known in the art. Models which predict KI based on such factors are known as Quantitative Structure-Property Relationship (QSPR) models. See, for example, Mihaleva et al., (2009) Bioinformatics 6:787-794; Garjani-Nejad et al., (2004) Journal of Chromatography A, 1028:287-295; Seeley and Seeley, (2007) Journal of Chromatography A, 1172:72-83. This type of procedure converts the actual retention times of detected peaks into a number that is normalized to multiple reference compounds. This is especially useful for comparing retention times to databases and libraries for identification of individual components. Such libraries provide large numbers of known compounds, and a match between the data obtained experimentally by GC-MS and a compound in a library can assist in identification of the compound.
In order to increase the resolution of the GC-MS, a "second dimension" of GC can be added, for instance by coupling the GC column to a second GC column (often referred to as 2DGC-MS or GCxGC-MS, and used interchangeably here with the terms GCxGC-TOF or GCxGC-TOF-MS). See Venkatramani and Phillips, J. Microcolumn Sep. (1993) 5:511-516. Peaks of interest are diverted from the first column into the second column for further separation, which then feeds into the mass spectrometry system. However, even, GCxGC-MS relies on structural correlation with compound libraries to make identifications of unknown compounds. The libraries of compounds most widely used for structural identification, such as the NIST library, contain retention index information for only 9% of the compounds having mass spectral data.
The use of RI or KI data allows structural assignments derived from comparison with library data to be refined. However, in order to achieve an acceptable level of confidence in the identification of an unknown compound, the assignment must be interpreted by the user, and compared to a reference standard by mass spectrometry to confirm the proposed structure. This approach has a number of disadvantages, including the need to repeat the process manually, which is inefficient; the limited size of Kovats Indices libraries; the lack of standardization, due to the need for manual intervention; all of which leads to reduced levels of confidence in the identification process.
In the traditional approach to identify the structure of a compound, mass spectral data generated by gas chromatography-electron impact ionization-mass spectrometry (GC-EI-MS) are compared with commercially available mass spectral data libraries (Figure 1). Using this procedure, the identification has only a low confidence level. In order to increase the level of confidence, a manual verification and interpretation of the mass spectral library search is carried out and the experimental retention time, or the Kovats index, is compared to database entries (e.g., NIST Retention Index library). Finally, for compound identification, a confirmation with reference standards is required. However, owing to the fact that this is very costly and time demanding, it is currently carried out only for a limited number of compounds.
There is a great need, therefore, for an improved procedure for interpreting GC-MS data which will allow greater levels of automation in structure identification and greater levels of confidence in the result.

Summary of the Invention

In a first aspect, there is provided a method for analysing mass spectral data obtained from a sample in two dimensional gas chromatography-mass spectrometry (GCxGC-MS), comprising:

(a) comparing mass spectral data obtained from a sample comprising an analyte with mass spectral data of candidate compounds of known structure in a library;
(b) identifying a plurality of candidate compounds from the library based on similarities of mass spectral data;
(c) predicting, for each candidate compound, a value of at least one analytical property using a quantitative model based on a plurality of molecular descriptors; and
(d) calculating a match score for each candidate compound based on the value predicted in step (c) and a measured value of the analytical property for the analyte.

In various embodiments of the method, within step (c), an analytical property score is derived from the predicted value of the analytical property of a candidate compound and a measured value of the analyte. In step (d), the measured value of the analytical property for the analyte can be the spectral similarity value as determined by algorithms in the software provided by NIST. The predicted value of an analytical property of a candidate compound is calculated according to a quantitative model based on a plurality of molecular descriptors. Accordingly, in one embodiment, the quantitative model of step (c) can be established by:

(i) providing a set of training compounds of known structure and a set of test compounds of known structure, and optionally a set of validation compounds of known structure;
(ii) generating a measured value of an analytical property for each training compound, each test compound, and each validation compound;
(iii) for each training compound, computing a set of molecular descriptors based on chemical structure and properties;
(iv) selecting a set of molecular descriptors from the set of molecular descriptors for use in a quantitative model of the analytical property, by using a genetic algorithm;
(v) generating a plurality of proposed quantitative models using the selected set of molecular descriptors;
(vi) evaluating each proposed quantitative model by computing a predicted value of the analytical property for each test compound;
(vii) selecting the quantitative model according to the root mean square error (RMSE) and/or the squared correlation (r²) on the measured value and the predicted value of the analytical property for each test compound; and optionally
(viii) selecting the quantitative model according to the squared correlation (r²) on the measured value and the predicted value of the analytical property for each validation compound.

In various embodiments, the genetic algorithm used in step (iv) preferably comprises

(p) generating a plurality of candidate solutions using a combination of two or more molecular descriptors in a machine learning algorithm such as but not limited to multiple linear regression, k-nearest neighbour method, or support vector regression;
(q) scoring each candidate solution according to a fitness function based on the cross validation squared correlation (q²) of the training compounds;
(r) generating new candidate solutions by recombining and/or mutating the candidate solutions that produces an improving cross validation squared correlation; and
(s) repeating step (q) and (r) for a finite number of times, for example, from 10 to 50 generations.

Candidate solutions generated by different machine learning algorithms can be compared to identify the best performing solutions.
The establishment of a quantitative model for one or more analytical properties is performed at least once when a particular set up of a GCxGC-MS separation system (e.g., a change of column specification, temperature profile, mobile phase) or mass spectrometry system..After the quantitative models have been established for an experimental setup, it is not necessary to perform the same each time the data of an analyte generated by this particular set up is being analyzed.
The function of each analytical property, an analytical property score, is preferably calculated as a quadratic function, where for analytical property P, $y = 1 / (- ((exp_p - (exp_p - (n 1 x SEP))) x exp_p - (\exp) p + (n 1 x SEP))))) x - ((pre_p - (exp_p - (n 1 x SEP))) x (pre_p - (exp_p - (n 1 x SEP))))$
Exp_p = measured value of the property obtained by experiments, pre_p = predicted value of the property, and SEP = standard error or prediction. If the predicted and experimentally obtained measured values are identical, the equation = 1. The SEP is calculated according to the formula, using the STEXY function of Microsoft Excel 2003: $\sqrt{\frac{1}{n - 2} [\sum {(y - \overline{y})}^{2} - \frac{{[\sum (x - \overline{x}) (y - \overline{y})]}^{2}}{\sum {(x - \overline{x})}^{2}}]}$

where x is a value of a sample, y is the predicted value of x for the sample and n is the number of samples.
In step (d) of the method, a spectral similarity value obtained from mass spectral database comparison can be used to generate a numerical value, wherein the spectral similarity value and the analytical property score(s) are combined. This numerical value is referred to herein as a match score, also referred to as the computer-assisted structure identification (CASI) score in the figures. In a preferred embodiment, the match score is calculated using a hyperbolic equation. The concept of the present invention differs from those used in currently available methods, in which analytical property values are used as a filter to select or deselect candidate compounds.
Optionally, for each query relating to a sample, the highest and second-highest match scores can be compared by dividing the highest score by the second-highest to generate a discrimination function, where a greater difference between the two scores generates a higher discrimination function. The higher the discrimination function, the higher the confidence score that can be assigned to each query. A confidence score can be calculated by multiplying the highest match score by the discrimination function value.
In preferred embodiments of the method, step (c) comprises predicting values of multiple analytical properties for each candidate compound. In one embodiment, a match score is derived from the spectral similarity obtained from the mass spectral database comparison, and a function of at least two analytical properties derived using a plurality of molecular descriptors. In another embodiment, a match score is derived from the spectral similarity value obtained from the mass spectral database comparison, and an analytical property score wherein the analytical property is the relative second dimension retention time derived by using a plurality of molecular descriptors.
Preferred analytical properties useful in the present invention include a Kovats index, a boiling point and a relative second dimension retention time (2D rel RT) index. If the predicted analytical properties used in the method of the invention comprise a Kovats index and a rel 2D RT, the Kovats Index and relative 2D retention times are preferably calculated using different molecular descriptors. Preferably, all three preferred analytical properties are used.
The Kovats indices of compounds are predicted using a linear equation comprising a plurality of coefficients, each multiplied by the value of a molecular descriptor. The equation is preferably obtained by using a test data set and a genetic algorithm to select the molecular descriptors from a plurality of possible molecular descriptors, and a linear regression or k nearest neighbors learning algorithm to correlate the selected molecular descriptors with the value to predict.
The boiling points of compounds can be predicted based on experimentally determined Kovats Indices. The boiling points of candidate compounds are calculated on the basis of their individual chemical structures using software packages known in the art, such as but not limited to ACD/PhysChem from ACD/Labs (Toronto, Canada).
In methods known in the art, the second dimension retention times are absolute second dimension retention times and there is no known available method for calculating relative 2D retention times. The challenge for developing a relative model is to define a reference system that is accessible for all second dimension peaks. This problem is solved by referring to a reference system based on a function of hypothetical deuterated n-alkanes. Deuterated or isotopically labelled compounds are used in a reference system for controlling retention times or internal standard-based quantification. Although other substances can be used as reference compounds, the n-alkanes are preferably used as a class of substances for generating a hypothetic 2D-RT reference system because this class of compounds does not have any known complex interaction with the stationary phase in the column of the second dimension separation system. Therefore this reference system adjusts for systemic shifts (such as different column length and gas flow), but not for analyte-stationary phase shifts, as these shifts are individual to compounds. Therefore adjusting for systemic shifts is the preferred method with regard to robustness on adjusting the complete compound space. In one embodiment of the invention, the first dimension of the GCxGC-MS is separated in a non-polar environment and the second dimension is separated in a polar environment.
In accordance with the present invention, a relative second dimension retention time of a compound is advantageously calculated as a retention time relative to a hypothetical n-alkane, whose first dimension retention time is derived from the regression function based on a series of deuterated n-alkane reference standards. The relative second dimension retention time of a compound is calculated as follows: $2 D - {rel RT}_{Comp} = \frac{abs 2 {D RT}_{Comp}}{2 {D RT}_{hypothetical n - alkane}}$

where 2D-rel RT _comp is the relative second dimension retention time of the compound; abs 2D RT _comp is the measured absolute second dimension retention time of the compound; and 2D RT _{hypothetical n-alkane}, is calculated for each compound that elutes between deuterated n-alkane standard compound 1 and compound 2: $2 {D RT}_{hypothetical n - alkane} = \frac{(2 {DRT}_{dA 2} - 2 {DRT}_{dA 1})}{({1 DRT}_{dA 2} - 1 {D RT}_{dA 1})} \times 1 {DRT}_{Comp} + (\frac{2 {DRT}_{dA 2} - (2 {DRT}_{dA 2} - 2 {DRT}_{dA 1})}{2 {DRT}_{dA 2} - 1 {DRT}_{dA 1}} \times 1 {DRT}_{dA 1})$

where dA1 and dA2 are deuterated n-alkane 1, and deuterated n-alkane 2; and 1 DRT is the first dimension retention time of the respective molecules.
In the above-described method, neither the absolute nor the relative second dimension rentention times of candidate compounds are available. To use the relative second dimension retention time as an analytic property, a quantitative model is established using a set of training compounds, test compounds and optionally validation compounds.
The above-described methods are automated in Java and is available as a web service. The descriptors for prediction models were calculated using software Dragon. RapidMiner was used to apply predictive retention models. Analytical scientists provide to the software mass spectra files, Kls and 2D relative retention times. First, each mass spectra of the compound to identify is searched in various mass spectra databases using NIST MS Search and the first 100 hits are returned. Structures are standardized and structural duplicates are removed using Pipeline Pilot 8. For each hit, KI, relative retention time for the second dimension and boiling point (BP) are calculated using predictive models. Final match score is calculated using a function taking into account the match factor of NIST MS Search and the difference between each predicted and experimental values of the compound to identify.
In a second aspect of the invention, there is provided a method for calculating a relative second dimension retention time in GCxGC-MS (2-dimensional gas chromatography coupled to mass spectrometry) for a compound comprising the steps of:

(a) defining a reference system based on a function of deuterated n-alkanes that gives the hypothetical retention time of the reference for a range of retention times;
(b) transforming measured values of absolute second dimension retention times for a plurality of training compounds of known molecular structure into the reference system to calculate relative second dimension retention times for the training compounds;
(c) using the relative second dimension retention times for the training compounds to generate a quantitative structure-property relationship model of relative second dimension retention time based on a plurality of molecular descriptors;
(d) using the quantitative model to predict a relative second dimension retention time of the compound.

The quantitative model of relative second dimension retention time is established by:

(i) providing a set of training compounds of known structure and a set of test compounds of known structure, and optionally a set of validation compounds of known structure;
(ii) generating the measured value of the absolute second dimension retention time for each training compound, each test compound, and each validation compound in a specific experimental set up, and transforming these into the reference system to calculate relative second dimension retention times;
(ii) for each training compound, computing a set of molecular descriptors based on chemical structure and properties;
(iii) selecting a set of molecular descriptors from the set of molecular descripto rs for use in a quantitative model of relative second dimension retention time, by using a genetic algorithm;
(iv) generating a plurality of proposed quantitative models using the selected set of molecular descriptors;
(v) evaluating each proposed quantitative model by computing a predicted value of relative second dimension retention time for each test compound
(vi) selecting the quantitative model according to the root mean square error (RMSE) and/or the squared correlation (r²) on the calculated value from step (ia) and the predicted value of the relative second dimension retention time for each test compound; and optionally
(vi) selecting the quantitative model according to the squared correlation (r²) on the calculated value and the predicted value of the second dimension retention time for each validation compound.

Preferably, the genetic algorithm used in this aspect of the invention comprises:

(p) generating a plurality of candidate solutions using a combination of two or more molecular descriptors in a machine learning algorithm such as but not limited to multiple linear regression, k-nearest neighbour method, or support vector regression;
(q) scoring each candidate solution according to a fitness function based on the cross validation squared correlation (q²) of the training compounds;
(r) generating new candidate solutions by recombining and/or mutating the candidate solutions that produces an improving cross validation squared correlation; and
(s) repeating step (q) and (r) for a finite number of times, for example, 10 to 50 generations.

Advantageously, the relative second dimension retention times used in the first aspect of the invention are predicted by the method of the second aspect of the invention.
Optionally, the results obtained from the computer-assisted methods of the invention based on chromatographic and mass spectral data generated by GCxGC-MS can be further enhanced by using the accurate mass data obtained from gas chromatograph-atmospheric pressure chemical ionization-mass spectrometry (GC-APCI-MS). Data generated by the two techniques can be matched by using a duplicate retention index system based on an additional reference system of deuterated fatty acid methyl esters.
In a third aspect, the invention provides methods for confirming the match of a test compound to a candidate compound identified in a database of two-dimension gas chromatography mass spectrometry. The methods comprise analysis of the same sample by gas chromatography by atmospheric pressure chemical ionization and time-of-flight mass spectrometry (GC-APCI-TOF-MS, GC-APCI-TOF,or GC-APCI-MS) and comparing the theoretical monoisotopic mass with the accurate mass measured by GC-APCI-TOF-MS. The prerequisite for the confirmatory method is to match the retention indices of the two different chromatographic systems. The Kovats index system from GCxGC-TOF-MS analysis based on deuterated n-alkanes to another retention index system based on deuterated fatty acid methyl esters (FAMEs). The system based on deuterated FAMEs is used because deuterated n-alkanes are not ionizable by the ion source of the GC-APCI-TOF-MS.
The Kovats index systems are established by generation of a Kovats index system for GCxGC-TOF-MS system based on deuterated n-alkanes; analysis of deuterated FAMEs using the GC-GC-TOF-MS system and determination of the Kovats indices of the FAMEs; analysis of deuterated FAMEs using the GC-APCI-TOF-MS system and generation of a retention index system for GC-APCI-TOF-MS system based on deuterated FAMEs; and bridging of retention index system for GC-APCI-TOF-MS system based on deuterated FAMEs with the Kovats index system based on n-alkanes by using Kovats indeces of deuterated FAMEs for GCxGC-TOF-MS system.
Accordingly, the invention provides methods comprising the steps of:

(a) measuring Kovats indices of analytes relative to a first set of reference compounds in GCxGC-TOF-MS;
(b) measuring Kovats indices of a second set of reference compounds relative to the first set of reference compounds in GCxGC-TOF-MS;
(c) measuring absolute retention times of the second set of reference compounds in a GC-APCI-TOF-MS; and
(d) using the Kovats indices of the second set of reference compounds measured in step (b) to derive by linear regression a function for converting the Kovats indices of the analytes measured in step (a) into estimated absolute retention times of the analytes in the GC-APCI-TOF-MS.

The function of step (d) is derived by linear regression for each retention time range where an analyte is detected between two adjacent reference compounds of the second set of reference compounds. The function is: $RT analytes in GC - APCI - TOF - MS = a (Kl analytes in GCxGC - TOF - MS) + b,$

where a is a coefficient and b is constant for a specific time range.
The method further comprises comparing the molecular masses of the analytes with the molecular masses of the respective candidate compounds for each of the analytes.
In one embodiment, the method further comprises:

(e) measuring the absolute retention times of the analytes in the GC-APCI-TOF-MS;
(f) using the function calculated in step (d) to convert the absolute retention times measured in step (e) into calculated Kovats indices in the GC-APCI-TOF-MS for the analytes; and
(g) comparing the Kovats indices calculated in step (f) with the measured Kovats indices from step (a).

Preferably, the first set of reference compounds deuterated n-alkanes. Preferably, the second set of reference compounds deuterated fatty acids methyl esters.

Brief Description of the Drawings

Preferred embodiments of the present invention will now be described with reference to the accompanying drawings, in which:

Figure 1 illustrates a traditional approach for compound structure identification using GC-MS (NO: no compound identified with medium confidence; YES: compound identified with medium confidence);
Figure 2 illustrates the CASI approach for compound structure identification using GCxGC-MS system including use of GC-APCI-MS to confirm the results;
Figure 3 illustrates a process used to build the Kovats index and relative second dimension retention time models;
Figure 4 shows a correlation of predicted and experimental correlation values of Kovats Indices for a set of validation compounds;
Figure 5 shows a correlation between boiling point (BP) predicted from Kovats Indices and BP predicted from chemical structures by software by ACD/Labs PhysChem for the set of validation compounds (r2 = 0.934);
Figure 6 shows a correlation between predicted retention times and experimental retention times for the external test set of the GCxGC-MS system second column retention time model;
Figure 7 shows a contribution equation of a theoretical scoring module (e.g. KIFIT...);
Figure 8 shows the result of CASI for Geranylgeraniol as presented by the computer system of the present invention;
Figure 9 shows the position of the correct hit (i.e. structure candidate) for the 71 mass spectra to identify;
Figure 10 shows an embodiment of a computer system according to the present invention;
Figure 11 is a contingency table showing the true/false positives and true/false negatives rate for CASI and NIST search;
Figure 12 shows a preferred embodiment of the CASI software architecture;
Figure 13 shows web interface output showing for each structure to identify the structure candidate with the highest score is selected by default; and
Figure 14 shows web interface output wherein user can change selection.

Detailed Description of the Invention

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which this invention belongs. Although any methods, devices and material similar or equivalent to those described herein can be used in the practice or testing of the invention, the preferred methods, devices and materials are now described.
All publications cited in this specification, including patent publications, are indicative of the level of ordinary skill in the art to which this invention pertains and are incorporated herein by reference in their entireties.
A high-throughput computer-assisted system for analyzing GCxGC-MS data, referred to as Computer-Assisted Structure Identification (CASI) is provided in this invention. The CASI system accelerates and standardizes the identification of compound structures, whilst assuring the reproducibility, and enables higher confidence for correct assignment of mass spectra to the right compounds. The concept of CASI is based on several steps of spectral searches and their matches to the parameters that are predicted on-the-fly.
Firstly, mass spectra are searched for candidate compounds and their associated match factors using an algorithm of National Institutes of Standards and Technology (NIST, Gaithersburg, MD, USA) MS Search in the NIST 08 and WILEY 9th ed. Mass Spectra databases. Secondly, we have developed Quantitative Structure-Property Relationship (QSPR) models that predict analytic properties to enhance the confidence in compound identification. Two analytic properties, Kovats indices for first dimension (1 D) separation and relative retention times for second dimension (2D) separation are predicted by using these models. Preferably, the Kovats indices and relative 2D RT are calculated using different molecular descriptors. In addition, a third analytic property, the boiling points of compounds, are derived from the measured 1 D RT of an analyte and are matched to computationally predicted boiling points of the candidate compounds. The boiling points are calculated by software known in the art, such as ACD/PhysChem software. Finally, the CASI system combines the matching results of NIST MS search and all parameters predicted in QSPR models to produce a match score, also referred to as a CASI score (Figure 2). Optionally, the discriminatory power is calculated for each identified compound to measure confidence of the assignment. Optionally, the proposed chemical structure is confirmed by GC-APCI-TOF.

Models for prediction of analytical properties

All QSPR models for the development of CASI are built under the same principles. Compounds of known structure are split randomly into a training set (in this example, 90 compounds) and a test set (in this example, 35 compounds). In addition, in this example, 35 different compounds are used as a validation set. Without limitation, 50 to 500 compounds can be used for training. Different distribution of compounds between the sets could be chosen for model establishment. Chemical structures represented in computer-readable format are prepared using software known in the art, in this case, Pipeline Pilot 8.0.1 (Accelrys, Inc. San Diego, California, USA). During the preparation. salts are stripped from the compounds' structures using a predefined list, largest fragments are kept, bases are deprotonated and acids are protonated, charges of functional groups are standardized, hydrogens are added, canonical tautomers are generated, and 2D coordinates are generated. Then the duplicate structures are removed.
Molecular descriptors for all the compounds are computed by software known in the art, in this case, Dragon (Talete srl, Milano, Italy). A full description of the molecular descriptors can be found in "Molecular Descriptors for Chemoinformatics" by Roberto Todeschini and Viviana Consonni, WILEY - VCH, 2009 in the Series of Methods and Principles in Medicinal Chemistry - Volume 41 (Eds. R. Mannhold, H. Kubinyi, H. Timmerman). All two-dimensional molecular descriptors (2489 in total for the version of software used in this example) are chosen to be calculated. Descriptors that are correlated to other different descriptors at >= 0.97 are redundant and deselected, 321 remaining descriptors are used in the next step.
To construct a predictive model, a set of predictive descriptors is selected in RapidMiner 5 (Rapid-I GmBH, Dortmund, Germany). Other similar data mining software platform known in the art can also be used. Several molecular descriptor selection experiments using forward selection and a genetic algorithm were tried. The performance of forward selection is acceptable, but this method has the inconvenience of a fall in local minima. Stochastic methods like genetic algorithms generally perform better. For this reason, genetic algorithms are used to select molecular descriptors.
The implementation of genetic algorithms in the systems of the invention uses roulette-wheel selection and two point crossover. Each string of molecular descriptors referred to as "chromosome" contains a predefined number of "genes", and each gene codes for a descriptor. Generally, we select between 2 and 15 descriptors. The genes are not binary, but contain the position of the corresponding descriptor in a list. This allows using a minimum number of descriptors. The fitness function set the subset of descriptors in the "Select Attributes" nodes of the RapidMiner process, executes it, and gets the root mean squared error of the training set as the fitness score. Mutation rate was set to 0.1, the number of chromosomes per generation was set to 20 to 40, preferably 30 and the number of generation was set to 100 to 300, preferably 200. The two best chromosomes survive at each generation.
In an exemplary workflow using Rapidminer, data preparation is constituted of a node which selects a subset of attributes, normalization with Z-transformation, separation of data set into training test (75%) and test set (25%). Then a linear regre ssion is applied on the training set, the learned model is applied on both training set and test set. In addition leave-one-out cross validation on training set was carried out. Various different learning algorithms are used to build the models for prediction of KI and relative second dimension retention time. Various learning algorithms were used, such as but not limited to k-Nearest Neighbors (k-NN), Multi Linear Regression (MLR) and Support Vector Regression (SVR). For each learning algorithm, from 2 to 15 descriptors were used to generate the models. At the end of the modeling run, the best model is kept for each value to predict. This process is described in Figure 3.

Kovats indices model

In this example of prediction of KI, the genetic algorithm (GA) were combined with three different learning algorithms. The results are presented in Table 1: Table 1. Result of the best models for KI with multi linear regression, k-nearest neighbors and support vector machine regression. Q2 values were obtained with leave-one-out cross validation for MLR and 10 folds cross validation for kNN and RMSE value was obtained by 5 folds cross validation for SVR. Results shown in bold is selected as the best solution.

GA - MLR GA - kNN GA-epsilon SVR (linear kernel)

KI Q2 0.988 0.972 0.979 C = 1.9 x 10-3

R2 (test set) 0.982 0.956 0.957

The best results were obtained with a genetic algorithm - linear model using 15 descriptors. Exemplary descriptors are presented in Table 2 ; these or any other suitable descriptors may be used. Results obtained with this linear model are very good with r² on training set = 0.991, q² for leave one out on training set = 0.988 and r² test set = 0.982. r² on the external test set is also very good (r² = 0.985, see Figure 4).

Table 2. Descriptors used in the selected KI model.

Coefficient	Descriptor	Description
236.746	nSK	Number of non-H atoms.
- 140.487	TI1	First Mohar index TI1.
+ 60.674	Wap	All-path Wiener index.
- 57.063	Jhetm	Balaban-type index from mass weighted distance matrix.
+ 54.075	PW4	Path/walk 4-Randic shape index.
117.349	AAC	Mean information index on atomic composition.
+ 67.819	ATS6v	Broto-Moreau autocorrelation of a topological structure - lag 6 / weighted by atomic van der Waals volumes.
+ 149.892	EEig10x	Eigenvalue	10 from edge adj. matrix weighted by edge degrees.
- 101.933	EEig10d	Eigenvalue	10 from edge adj. matrix weighted by dipole moments.
+ 69.663	BEHe3	Highest eigenvalue n. 3 of Burden matrix / weighted by atomic Sanderson electronegativities.
- 58.337	nCrq	Number of ring quaternary C(sp3).
- 7.834	C-034	Fragment R-CR..X
+ 49.347	Hy	Hydrophilic factor.
- 44.028	Inflammat-80	Ghose-Viswanadhan-Wendoloski anti-inflammatory-like index at 80 %.
+ 283.204	F02[C-C]	Frequency of C-C at topological distance 2.
1609.956

In another example of prediction of KI, a genetic algorithm -linear model using 12 descriptors is used. Exemplary descriptors are presented in Table 3 below. Results obtained with this linear model yielded with r2 training set = 0.992, q2 leave one out = 0.999 and r2 test set = 0.983. r2 on external test set.

Table 3

Coefficient	Descriptor	Description
2490.980	nSK	Number of non-H atoms.
- 3470.745	nC	Number of C atoms
-48.955	nR06	Number of 6 membered rings
-48.134	Qindex	Quadratic index
-211.303	DELS	Molecular electropological variation
-45.839	SRW09	Self-returning walk count of order 9
-63.030	CIC3	Complementary information content (neighbourhood symmetry of 3-order)
+328.644	ATS1p	Bronto-Moreau autocorrelation of topological structure - lag 1/weighted by atomic polarizabilities
+25.916	EEig15x	Eigenvalue	15 from edge adj. matrix weighted by egde degrees.
-31.625	JGI6	Mean topological charge index of order 6
-59.809	B01[C-Si]	Presence/absence of C-Si at topological distance 1
+1539.797	F01[C-C]	Frequency of C-C at topological distance 1
+1561.023

Boiling Point Model

In this example, the correlation between the boiling point (calculated with ACD/Labs ACD/PhysChem) and the boiling point calculated from Kovats Indices values are: r² training set = 0.955, r² test set = 0.910 and r² validation set = 0.934 (Figure 5). The equation obtained is: $BP = 0.1468 \times KI + 47.402$
In another example, the correlation between the boiling point (calculated with ACD/Labs ACD/PhysChem) and the boiling point calculated from Kovats Indices values are: r2 training set = 0.902, q2 leave one out = 0.899, r2 test set = 0.891 and r2 validation set = 0.934 (Figure 3). The equation obtained is: $BP = 0.1464 x KI + 47.2755$

Relative second dimension retention time model

For the relative second dimension time of the GCxGC-MS, we used genetic algorithms with three different learning algorithms. The results are presented in Table 4: Table 4. Result of the best models for 2DRT with multi linear regression, k-nearest neighbors and support vector machine regression. Q2 values were obtained with leave-one-out cross validation for MLR and 10 folds cross validation for kNN and RMSE value was obtained by 5 folds cross validation for SVR. Results shown in bold is selected as the best solution.

GA - MLR GA - kNN GA-epsilon SVR (linear kernel)

2DRT Q2 0.861 0.841 0.840 C = 3.8

R2 (test set) 0.750 0.673 0.827

One of the best model was obtained by using genetic algorithms and support vector regression analysis. The results obtained are q² leave one out = 0.840, r² test set = 0.827 and r² validation set = 0.849. The model is less accurate than the KI model. It can be explained by the fact that the variances of experimental measured second dimension retention times (respectively 2D relative RT) is higher than for the KI and in addition the relation between the structures and the retention times is not linear. However with a r² = 0.849 for the external test set, the model has a good accuracy. In this example, the model uses 8 descriptors as presented in Table 5.

Table 5. Descriptors used for the 2DRT model.

Descriptor	Description
Wap	All-path Wiener index.
AMW	Average molecular weight.
X0Av	Average valence connectivity index chi-0.
nRCO	Number of ketones (aliphatic).
ZM2V	Second Zagreb index by valence vertex degrees.
JGI3	Mean topological charge index of order 3.
X0A	Average connectivity index chi-0.
piPC10	Molecular multiple path count of order 10.

In another example, wherein the second dimension of the GCxGC-MS set up is polar, one of the best model was obtained by using genetic algorithms and 2 nearest neighbors analysis. The results yielded q2 leave one out = 0.899, r2 test set = 0.816 and r2 validation set = 0.811. The model is less accurate than the KI model. It can be explained by the fact that the reproducibility of experimental measures is lower, and that relation between the structures and the retention times is not linear. However with a value of r2 = 0.811 for the external test set, the model has a good accuracy. In this particular example, the model uses 14 descriptors as presented in Table 6.

Table 6- Descriptors used in the GCxGC-TOF second column retention time model

Descriptors	Description
AMW	Average molecular weight.
MSD	Mean square distance index (Balaban).
BLI	Kier benzene-likeness index.
PW5	Path/walk 5 - Randic shape index.
ICR	Radial centric information index.
piPC04	Molecular multiple path count of order 4.
X0Av	Averaqe valence connectivity index chi-0.
AAC	Mean information index on atomic composition.
ATS5m	Broto-Moreau autocorrelation of a topological structure - lag 5 / weighted by atomic masses.
GATS2v	Geary autocorrelation - lag 2 / weighted by atomic van der Waals volumes.
BEHe1	Highest eigenvalue n. 1 of Burden matrix / weighted by atomic Sanderson electronegativities
F06[Si-Si]	Frequency of Si-Si at topological distance 6.
F09[C-O]	Frequency of C-O at topological distance 9.
F10[C-Si]	Frequency of C-Si at topological distance 10.

Calculation of a match score

Scores are calculated from spectral similarity value, (in this example, the NIST MS Search match factor), predicted KI, predicted second dimension relative retention time of the GCxGC-TOF and the predicted boiling point, using a hyperbolic equation. The general principle is based on similarity of experimental MS to library MS multiplied by analytical property scores derived from each analytical property (KI, BP ...). The analytical property scores (KIFIT, BPFIT...) are normalized from 0 (no similarity) to 1 (perfect match). The scores are based on quadratic equation via polynomials factorization of the type: $a x^{2} + bx + c = a (x - α) (x - β)$
Using KI as an example of one of the analytical properties the terms of the equation are: $a = \frac{1}{- (K I_{Exp} - (K I_{Exp} - (n_{KI} \times {SEP}_{KI}))) \times (K I_{Exp} - (K I_{Exp} + (n_{KI} \times {SEP}_{KI})))}$
$(x - α) = (K I_{\Pr e} - (K I_{Exp} - (n_{KI} \times {SEP}_{KI})))$
$(x - β) = (K I_{\Pr e} - (K I_{Exp} + (n_{KI} \times {SEP}_{KI})))$
The complete equation is: ${hyp}_{KI} = \frac{1}{- (K I_{Exp} - (K I_{Exp} - (n_{KI} \times {SEP}_{KI}))) \times (K I_{Exp} - (K I_{Exp} + (n_{KI} \times {SEP}_{KI})))} \times (K I_{\Pr e} - (K I_{Exp} - (n_{KI} \times {SEP}_{KI}))) \times (K I_{\Pr e} - (K I_{Exp} - (n_{KI} \times {SEP}_{KI})))$
$[if {hyp}_{Kl} < 0 then y = 0]$
With:

H_YPKI: hyperbolic equation which is used to correct the value of NIST Match Factor in the CASI score.
KI_Pre: predicted Kovats Index
KI_Exp: measured Kovats Index
n_KI: factor (for curve fitting) = e.g. n_KI for Kovats Index
SEP_KI: standard error of prediction

Curve Analysis:

Maximum: if KI_Pre = KI_Exp, y = 1
Zero-crossing1: KI_Pre = KI_Exp - n_KI x SEP_KI
Zero-crossing2: KI_Pre = KI_Exp + n_KI x SEP_KI

A graphical interpretation of the derived hyperbolic equation is shown in Figure 7.
An exemplary formula for combining the three analytical property scores and the spectral similarity value to calculate a match score, is as follows: $CASI Score = NIST MF \times {hyp}_{KI} \times {hyp}_{2 DRT} \times {hyp}_{BP}$
For each query of an analyte, the candidate compounds are ranked according to decreasing CASI scores. CASI score is calculated according to the above-described equation. The hit with the highest value is selected by default.

Score optimization

In calculating the CASI score, each of the three analytical property scores has four parameters. However, only n_x has to be established which defines at which value the hyperbolic curve crosses the X axis. n_x is contributing to the shape of the hyperbolic curve, and then to the weight of each analytical property score in the final CASI score.
A grid search procedure is provided to establish optimal values for n_KI, n_2DrelRT and n _BP· A solution's score is generated by using every possible combination of integer values between 1 and 50 for each of n_KI, n_2DreIRT and n_BP. The solution's score is the number of correct hits sorted first for training set and test set. The solution with the highest number of correct hits is selected. The algorithm can be described as follow:

for n_KI in 1 .. 50
- for n_2DRT in 1 .. 50
  - for n_BP in 1 .. 50 compute CASI score for the compounds in the training sets and in the test sets using combinations of values of n_KI, n_2DRT and n_BP for each iteration. count the number of correct hits for this iteration.
select the values of the solution with the greatest number of correct hits.

The selected n_KI, n_2DrelRT and n_BP parameters will be used in the final validation step of the configuration in CASI.

Validation

To validate the performance of the methods of the invention, a set of 71 molecules whose identities are known are used. Results are shown in Figure 9. Some of these molecules are present in the validation set used to validate the models, but none of them are present in the training set and test set. The results obtained by using the CASI system are clearly better than using the NIST match factor alone: 51 correct hits ranked first and 14 correct hits ranked in second position. Using NIST Match Factor, 50 correct hits ranked first but only 9 correct hits sorted in second position. The ranking of correct structures with CASI Score is compared to the ranking using NIST Match Factor in the Table 7: Table 7. Comparison of the position of correct hits by ranking based on CASI score and ranking based on NIST Match Factor. CASI score performs better than NIST Match Factor in term of ranking of correct hits.

Position of correct hits 1 2 3 4 5 6 7 10 20

Frequency with CASI score 51 14 3 2 1

Frequency NIST Match Factor 50 9 4 2 2 1 1 1 1
By analyzing the true/false positives and true/false negatives rate shown in contingency table (Figure 11), the rate of false positive structural assignments is reduced significantly for the CASI score compared to the NIST MS search. Accordingly, CASI score each 9^th structural assignment is a wrong assignment, whereas for the NIST MS search each 3^rd structural assignment is a false one.
An illustrative example of the advantage of the CASI score is the hentriacontane, which is sorted in 20th position with NIST MF but sorted in 2nd position with CASI score, because of the accurate prediction of the KI. Another example presented in Figure 8 is Geranylgeraniol which shows clearly that CASI score gives a better discriminatory power than NIST Match Factor. CASI score as well as NIST Match Factor rank the correct hit in first position, but CASI Score gives a much higher discriminatory power.
These results clearly show that the CASI system improves the confidence and increase the throughput in structure identification .
The results obtained from the CASI system can be confirmed by the use of GC-APCI-TOF-MS. A sample comprising analytes are combined with deuterated n-alkanes and deuterated fatty acids methyl esters, divided into two aliquots. One was analyzed by GCxGC-TOF-MS wherein the Kovats index of the FAMEs and analytes are determined using deuterated n-alkanes as the reference system. The other aliquot is analyzed in a GC-APCI-MS wherein the absolute retention time of the FAMEs are determined. By applying the above-described methods for bridging the retention index systems, the deviation of Kovats Index was found to be less than 1% between both systems and the mass deviation was found to be less than 1 mDa for the GC-APCI-TOF-MS.
The ability to confirm proposed structures using accurate masses measured by GC-APCI-TOF-MS was tested. The method is used to confirm the proposed structures of 155 compounds present in cigarette smoke. 120 of the 155 compounds are ionizable in the GC-APCI-TOF-MS. 106 compounds are detected within the retention time index window and 85 compounds are confirmed automatically.

The CASI system

Figure 10 is a block diagram of a computer system for analysing mass spectral data in GCXGC mass spectrometry. The system includes a web interface 1000, a match score generator engine 2100, a structural candidate search engine 2200 which accesses a structural candidate database 2210, a descriptor selection and model generation engine 2300 and a descriptor computation engine 2400. The system further includes a chemical structure generator 3100 which accesses a name-to-structure database 3200. The components of the system may be software applications operating on a single server or may be distributed over multiple computing systems communicating via network interfaces including wireless communication systems. However, in the embodiment shown in Figure 10, the match score generator engine 2100, structural candidate search engine 2200, descriptor selection and model generation engine 2300 and descriptor computation engine 2400 are interconnected software applications operating on a match score server 2000, on which structural candidate database 2210 is also stored. The chemical structure generator 3100 and name-to-structure database 3200 operate on a second server 3000, although they may also operate on match score server 2000.
Input data 100 is input via web interface 1000. Input data may in the form of a JDX file, and comprises mass spectra data from a sample, and further include experimental values for analytical properties such as Kovats index data, boiling point data and 2D retention time data. The web interface 1000 may communicate with the match score generator engine 2100 via a SOAP (Simple Object Access Protocol).
The computer system operates in two modes, a training mode and an analysis mode. The training mode may be run at any time, but it is necessary to run the computer system in training mode every time the mass spectrometer experimental set up is changed. In the training mode, the input data are mass spectrometer data and measured values of an analytical property such as Kovats index, for a set of known compounds.
For each of the known compounds, the chemical structure in computer readable form is generated by the chemical structure generator 3100 which accesses the name-to-structure database 3200. The chemical structure generator 3100 may be Pipeline Pilot 7.5.1 software, and the database 3200 may be an ACD database.
For all of the known compounds, molecular descriptors are calculated by descriptor computation engine 2400, which may be the Dragon software package. The known compounds are divided into a training set and a test set. For the training set, descriptor selection and model generation engine 2300, which may be RapidMiner software, selects a set of predictive descriptors using forward selection and a genetic algorithm as described in detail above to construct a predictive model for predicting values of an analytical property, such as Kovats indices or 2D retention time, for the training compound structures. The predicted model is verified using the test set, as described in more detail above, and a model is selected.
In the analysis mode, the input data 100 is mass spectrometry data from a sample. The structural candidate search engine 2200 carries out a search in structural candidate database 2210 by comparing the mass spectra data from the sample with mass spectra data in the database 2210, to generate a number of structural candidate compounds based on similarity of the mass spectra data with the data in the database 2210. The selected candidate compounds may be, for example, the top 100 matches. The search engine may be an NIST MS search algorithm, and the database 2210 may be the NIST 08 and WILEY 9th ed Mass Spectra databases. The list of structural candidates is made available for the user to view via web interface 1000. Each candidate has a match factor indicative of the similarity of the mass spectra data for the sample with the data in the database 2210 for the candidate. The match factor is generated by the structural candidate search engine 2200, and may also be displayed to the user via the web interface 1000 for each structural candidate.
For each of the structural candidates, the chemical structure in computer readable form is generated by the chemical structure generator 3100 which accesses the name-to-structure database 3200. The chemical structure generator 3100 may be Pipeline Pilot 7.5.1 software, and the database 3200 may be an ACD database.
For all of the structural candidates, molecular descriptors are calculated by descriptor computation engine 2400, which may be the Dragon software package.
The model generated by the descriptor selection and model generation engine 2300 in the training mode is then used to predict the analytical property, such as Kovats index or 2D retention time, for the candidate structures. The descriptor selection and model generation engine 2300 supplies the model to the match score generator engine 2100 which calculates predicted values of one or more analytical properties based on the model. The predicted values may be communicated to the user via web interface 1000.
The match score generator engine 2100 calculates a match score for each candidate compound based on the match factors generated by the structural candidate search engine 2200, the predicted values of the analytical properties predicted by the model provided by the descriptor selection and model generation engine 2300, and measured values of the analytical properties of the sample which were included in input data 100. The match score generator engine 2100 may calculate a CASI score in accordance with the method described above. The match scores may also be communicated to a user via web interface 1000.
The web interface 1000 may display the results to the user in the form of a table, listing the structural candidates, the match factors generated by the structural candidate search engine 2200, the predicted values of the analytical properties generated by the model generation engine 2300, and the match score. The table may be sorted to rank the structural candidates by their match scores.
Once a model for predicting an analytical property has been generated by descriptor selection and model generation engine 2300 in the training mode, there is no need to generate a model again for a new set of input data ie a new sample for identification, and a new set of structural candidates, provided the experimental set up has not changed. If the experimental set up is changed, it is necessary to generate a new model by running the system in the training mode. Therefore, the descriptor selection and model generation engine 2300 supplies the selected model to the match score generator 2100, which, in the analysis mode, applies the model to the structural candidates to generate predicted values for the analytical property. In this way, in the analysis mode, access to the descriptor selection and model generation engine 2300 is not required. Access to the descriptor selection and model generation engine 2300 is only required in the training mode for generation of a new model. The descriptor selection and model generation engine 2300 may thus be provided on a separate computing device eg server which is only accessed in the training mode.
A preferred embodiment of the software architecture is illustrated in Figure 12.
Oracle Application Express is used for the development of the web interface 1000. A SOAP interface allows Oracle Application Express to communicate with the match score generator engine 2100, which is developed in Java and runs in Tomcat. RapidMiner is used as the descriptor selection and model generation engine 2300 and is integrated by Java API. Java is used to implement the match score generator engine 2100 mainly because RapidMiner can be easily integrated in Java.
The structural candidate search engine 2200 comprises NIST MS Search and is integrated by command line. The chemical structure generator 3100 is Pipeline Pilot and is integrated with Java API. It is used to convert names of the hits to structures (using ACD/Labs name-to-structure and an internet connection to ChemBL), to standardize the structures, to compute boiling point (ACD/Labs PhysChem Batch) and to move data from CASI to a chemical registry database. The descriptor computation engine 2400 comprises Dragon and is integrated by command line. In addition to these software modules, the standard Java APIs Log4J is used for logging error messages, Hibernate is used for the mapping of the objects to the Oracle database and JUnit is used for the unit tests.
Figures 13 and 14 illustrate outputs of the web interface 1000. For a given analysis, all compounds to identify are presented with the structure candidate having the best score (Figure 13). Structure candidates can be browsed and selection can be changed (Figure 14). Each structure candidates (Hits) for compound to identify (Query, in this case 1-Pentene, 2,3-dimethyl) are listed with predicted properties. The one with the best score is selected by default. User can change the selection and add comments which will be inserted with the selected structure into a chemical registration system.

Claims

A method for analysing mass spectral data obtained from a sample in GCxGC (2-dimensional) mass spectrometry, comprising:
(a) comparing mass spectral data of an analyte with mass spectral data of candidate compounds of known structure in a library;

(b) identifying a plurality of candidate compounds from the library based on similarities of mass spectral data;

(c) predicting, for each candidate compound, a value of at least one analytical property using a quantitative model based on a plurality of molecular descriptors; and

(d) calculating a match score for each candidate compound based on the value predicted in step (c) and a measured value of the analytical property for the analyte.
The method of claim 1, wherein step (c) comprises predicting, for each candidate compound, values of a plurality of analytical properties, wherein the predicted analytical properties include at least one of a Kovats index, a boiling point and a relative second dimension retention time.
The method of claim 1 or 2, wherein the relative second dimension retention time of the analyte is a function of the absolute second dimension retention time of the compound and the second dimension retention time of a hypothetical deuterated n-alkane, wherein the second dimension retention time of a hypothetical deuterated n-alkane is calculated according to a linear regression on the absolute first dimension retention times and absolute second dimension retention times of a series of deuterated n-alkanes..
The method of any one of the preceding claims, wherein the match score is additionally based on the similarity of mass spectral data in step (b).
The method of claim 1, wherein the quantitative model of step (c) is obtained by using a test data set and a genetic algorithm to select the molecular descriptors from a plurality of possible molecular descriptors, and using a machine learning algorithm selected from linear regression, support vector regression, or k nearest neighbours method to correlate the selected molecular descriptors with the value to predict.
The method of claim 1, wherein said quantitative model of step (c) is the product of a method for establishing quantitative model,which comprises the following steps:
(i) providing a set of training compounds of known structure and a set of test compounds of known structure, and optionally a set of validation compounds of known structure;

(ii) generating a measured value of an analytic property for each training compound, each test compound, and each validation compound;

(iii) for each training compound, computing a set of molecular descriptors based on chemical structure and properties;

(iv) selecting a set of molecular descriptors from the set of molecular descriptors for use in a quantitative model of the analytical property, by using a genetic algorithm;

(v) generating a plurality of proposed quantitative models using the selected set of molecular descriptors;

(vi) evaluating each proposed quantitative model by computing a predicted value of the analytical property for each test compound

(vii) selecting the quantitative model according to the root mean square error (RMSE) and/or the squared correlation (r²) on the measured value and the predicted value of the analytical property for each test compound; and optionally

(viii) selecting the quantitative model according to the root mean square error (RMSE) and/or the squared correlation (r²) on the measured value and the predicted value of the analytical property for each validation compound.
The method of claim 6, wherein using the genetic algorithm of (iii) comprises (p) generating a plurality of candidate solutions using a combination of two or more molecular descriptors in a machine learning algorithm selected from multiple linear regression, k-nearest neighbour method, or support vector regression;
(r) scoring each candidate solution according to a fitness function based on the cross validation squared correlation (q²) of the training compounds

(s) generating new candidate solutions by recombining and/or mutating the candidate solutions that produces an increased cross validation squared correlation; and

(t) repeating step (r) and (s) for a finite number of times.
The method of any one of the preceding claims, further comprising verifying a candidate structure by a method comprising the steps of:
(A) measuring Kovats indices of analytes relative to a first set of reference compounds in GCxGC-TOF-MS;

(B) measuring Kovats indices of a second set of reference compounds relative to the first set of reference compounds in GCxGC-TOF-MS;

(C) measuring absolute retention times of the second set of reference compounds in a GC-APCI-TOF-MS; and

(D) using the Kovats indices of the second set of reference compounds measured in step (b) to derive by linear regression a function for converting the Kovats indices of the analytes measured in step (A) into estimated absolute retention times of the analytes in the GC-APCI-TOF-MS.
The method of claim 8, further comprising:
(E) measuring the absolute retention times of the analytes in the GC-APCI-TOF-MS;

(F) using the function calculated in step (D) to convert the absolute retention times measured in step (E) into calculated Kovats indices in the GC-APCI-TOF-MS for the analytes; and

(G) comparing the Kovats indices calculated in step (F) with the measured Kovats indices from step (A).
The method of claim 8 or 9, wherein the function of step (D) is derived by linear regression for each retention time range where an analyte is detected between two adjacent reference compounds of the second set of reference compounds, wherein the function is: $RT analytes in GC - APCI - TOF - MS = a (Kl analytes in GCxGC - TOF - MS) + b,$

where a is a coefficient and b is constant for a specific time range.
The method of any one of claims 8 to 10, further comprising comparing the molecular masses of the analytes with the molecular masses of the respective candidate compounds for each of the analytes.
The method of any one of claims 8 to 11, wherein the first set of reference compounds deuterated n-alkanes and the second set of reference compounds deuterated fatty acids methyl esters.
A method of calculating a predicted relative second dimension retention time in a GCxGC-MS (2-dimensional gas chromatography coupled to mass spectrometry) for a molecular structure comprising the steps of:
(a) defining a reference system based on a function of hypothetical deuterated n-alkanes;

(b) transforming measured values of absolute second dimension retention times for a plurality of training compounds of known molecular structure into the reference system to calculate relative second dimension retention times for the training compounds;

(c) using the relative second dimension retention times for the training compounds to generate a quantitative model of relative second dimension retention time based on a plurality of molecular descriptors;

(d) using the quantitative model to predict a relative second dimension retention time of the molecular structure.
A computer system programmed to carry out the method of any one of claims 1 to 19, operatively connected to a GCxGC (2-dimensional) mass spectrometer.