WO2013166406A1

WO2013166406A1 - Methods of distinguishing between similar compositions

Info

Publication number: WO2013166406A1
Application number: PCT/US2013/039497
Authority: WO
Inventors: David A. FRIEDENBERG; Theodore P. KLUPINSKI; Erich D. STROZIER; Douglas D. MOONEY; Cheryl A. DINGUS; Eugene Anthony ZARATE
Original assignee: Battelle Memorial Institute
Priority date: 2012-05-04
Filing date: 2013-05-03
Publication date: 2013-11-07

Abstract

Methods of determining the degree of difference between two similar compositions are disclosed. Mass spectra from multiple samples of the two compositions are obtained using two-dimensional gas chromatography coupled with time-of-flight mass spectrometry. That data is processed to obtain a dataset. A random forest algorithm is used to analyze the dataset and create a classifier that distinguishes between the two compositions. The samples are then classified as originating from the first composition or the second composition using the classifier to create a confusion matrix. A p-value is determined from the confusion matrix and then compared to a selected alpha value -associated with a given confidence level- to determine the degree of difference between the first composition and the second composition

Description

METHODS OF DISTINGUISHING BETWEEN SIMILAR COMPOSITIONS

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional Patent Application Serial No. 61/643,080, filed on May 4, 2012, and to U.S. Provisional Patent Application Serial No. 61/655,027, filed on June 4, 2012. The disclosure of each application is hereby fully incorporated by reference in its entirety.

BACKGROUND

[0002] The present disclosure relates to methods of comparing two particular compositions to determine whether the two compositions can be distinguished from each other, or in other words the degree of difference or similarity between the two compositions. Such methods are useful in many different scenarios.

[0003] As one non-limiting example of an application of such methods, the Family Smoking Prevention and Tobacco Control Act requires the chemical characterization of tobacco products for the demonstration of "Substantial Equivalence" for new tobacco products marketed after February 15, 2007. A manufacturer that wishes to introduce a new tobacco product must either submit a Substantial Equivalence report that compares the new product to a predicate product (i.e. under Section 905(j)) or complete the full new product application process (i.e. under Section 910(c)). The relevant characteristics for determining Substantial Equivalence are "the materials, ingredients, design, composition, heating source, or other features of a tobacco product." The determination of Substantial Equivalence can be envisioned as a hierarchical process in which simple, inexpensive comparisons can be applied first to identify tobacco products that are clearly not equivalent to one another. For example, tobacco products that have different designs or different heating sources can often be readily distinguished from one another. Tobacco products can also be analyzed using routine targeted chemical analyses to determine their potential differences in characteristics such as nicotine content and pH. If two given tobacco products yield similar results when analyzed by such tests, the next step would be to perform a more comprehensive chemical analysis that can detect many natural components found in tobacco leaves as well as additives used in commercial tobacco products. The methods of the present disclosure would be useful in this respect.

[0004] As another non-limiting example, the methods can be used to authenticate the source of a foodstuff such as coffee beans or olive oil, or of a drug formulation. The GCxGC-TOFMS can detect adulterants or other contaminants whose specific identity is not known and use their presence to distinguish between sources. Standard quality tests may be insufficient for screening such contaminants, particularly when the identity is unknown. These methods can also be used for the analysis of other compositions that contain numerous organic compounds to characterize their composition in a comprehensive manner that includes many of the important components for samples derived from plants, animals, or petroleum/petrochemicals.

[0005] It would be desirable to provide methods that can determine the degree of difference or similarity between two compositions or sources.

BRIEF DESCRIPTION

[0006] The present disclosure relates to methods of processing large quantities of data to determine the degree of difference or similarity between two similar compositions that differ in their source. Briefly, several samples of the two different compositions are analyzed to create a dataset containing information on the presence and/or relative concentration of chemical compounds in each sample. The dataset is then analyzed using a random forest algorithm to create a classifier that distinguishes between the two compositions. Each sample is then classified using the classifier as coming from one of the two compositions. The results are used to create a confusion matrix, and a p-value is determined. The p-value can be compared to a selected alpha value to determine the degree of different between the two compositions.

[0007] These and other non-limiting aspects and/or objects of the disclosure are more particularly described below. BRIEF DESCRIPTION OF THE DRAWINGS

[0008] The following is a brief description of the drawings, which are presented for the purposes of illustrating the exemplary embodiments disclosed herein and not for the purposes of limiting the same.

[0009] FIG. 1 is a schematic diagram of an apparatus for two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GCxGC-TOFMS).

[0010] FIG. 2 is an example of a classification tree.

[0011] FIG. 3 is a flowchart illustrating the methods of the present disclosure.

DETAILED DESCRIPTION

[0012] A more complete understanding of the processes and apparatuses disclosed herein can be obtained by reference to the accompanying drawings. These figures are merely schematic representations based on convenience and the ease of demonstrating the existing art and/or the present development, and are, therefore, not intended to indicate relative size and dimensions of the assemblies or components thereof.

[0013] Although specific terms are used in the following description for the sake of clarity, these terms are intended to refer only to the particular structure of the embodiments selected for illustration in the drawings, and are not intended to define or limit the scope of the disclosure. In the drawings and the following description below, it is to be understood that like numeric designations refer to components of like function. In the following specification and the claims which follow, reference will be made to a number of terms which shall be defined to have the following meanings.

[0014] The singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise.

[0015] Numerical values in the specification and claims of this application should be understood to include numerical values which are the same when reduced to the same number of significant figures and numerical values which differ from the stated value by less than the experimental error of conventional measurement technique of the type described in the present application to determine the value. [0016] All ranges disclosed herein are inclusive of the recited endpoint and independently combinable (for example, the range of "from 2 grams to 10 grams" is inclusive of the endpoints, 2 grams and 10 grams, and all the intermediate values).

[0017] As used herein, approximating language may be applied to modify any quantitative representation that may vary without resulting in a change in the basic function to which it is related. Accordingly, a value modified by a term or terms, such as "about" and "substantially," may not be limited to the precise value specified, in some cases. The modifier "about" should also be considered as disclosing the range defined by the absolute values of the two endpoints. For example, the expression "from about 2 to about 4" also discloses the range "from 2 to 4."

[0018] Presented herein are methods and approaches for determining the degree of difference between a first composition and a second composition. Such methods can be useful for showing that the two compositions are either substantially equivalent or substantially different. This can be done according to the presence/absence and relative concentrations of the chemical compounds in samples obtained from the two compositions. The present disclosure contemplates the use of two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GCxGC-TOFMS) as a chemical analysis technique. The data obtained using this chemical analysis technique is then analyzed using a random forest algorithm as a statistical pattern recognition technique.

[0019] Generally, a plurality of samples from the first composition and the second composition are evaluated using two-dimensional gas chromatography coupled with time-of-flight mass spectrometry to create a datafile for each sample. The datafiles are then processed to obtain a dataset, the dataset containing entries corresponding to the presence or relative concentration of chemical compounds in each sample. The dataset is then analyzed using a random forest algorithm to create a classifier that distinguishes between the first composition and the second composition. The classifier is used to classify each sample as originating from the first composition or the second composition and create a confusion matrix. A p-value is determined based on the confusion matrix. The p-value is then compared to a selected alpha value to determine the degree of difference between the first composition and the second composition. [0020] In other words, the source of each sample is known (i.e. from the first composition or the second composition), but the accuracy of the classifier determines whether the two compositions can be considered to be substantially equivalent or substantially different. If the classifier is not very accurate in identifying the correct source of each sample, this provides strong evidence that the two compositions are very similar.

[0021] Two-dimensional gas chromatography coupled with time-of-f light mass spectrometry (GCxGC-TOFMS) offers substantially greater component separation and identification capability than other traditional analytical chemistry techniques. Gas chromatography is also especially well-suited for analyzing mixtures of volatile and semi-volatile compounds. Generally, an organic solvent such as acetone should be used.

[0022] Two-dimensional gas chromatography employs two gas chromatography columns instead of only one such column. A sample is injected into a first column, and the eluent from the first column is then injected onto a second column. The second column has a different separation mechanism. For example, in some embodiments herein, the first column is a non-polar column and the second column is a polar column. Other variations are also possible, such as running the two columns at different temperatures. The second column should run much faster than the first column. Put another way, the retention time on the first column should be greater than the retention time on the second column. One or more modulators are located between the first column and the second column. The modulator acts as a gate or interface between the two columns, and controls the flow of analytes from the first column to the second column.

[0023] FIG. 1 shows a schematic using a gas chromatograph (GC) 1 equipped with one type of two-stage modulator. Generally, the first modulator stage 20 operates by trapping/immobilizing eluent from the first dimension GC column 10 in place. This collected eluent is periodically released to the second modulator stage 30. The second modulator stage 30 releases the eluent as a narrow band into the second dimension GC column 40 to start the secondary separation. The first modulator stage 20 and the second modulator stage 30 are out of phase with each other, so that the first column 10 and the second column 40 are isolated from each other. The eluent from the second column is sent to the time-of-flight mass spectrometer 50 for analysis. The resulting output can be represented as a three-dimensional graph, with the first column retention time on the x-axis, the second column retention time on the y-axis, and the signal intensity on the z-axis. When two-dimensional gas chromatography methods are carefully designed, they can provide substantial increases in chromatographic separation in comparison with single-dimension gas chromatography techniques. The separation of chemical components by two mechanisms (e.g., by boiling point in the first dimension, and by polarity in the second dimension) expands the chromatographic space in which compounds can be separated from one another and thus increases the ability to resolve trace-level compounds that may otherwise be obscured.

[0024] Time-of-flight mass spectra can be acquired at very high rates with sensitivity approaching quadrupole selective ion monitoring (SIM), but have the advantage of being collected in full-scan mode. The full-scan mass spectra can be matched against library spectra to provide tentative identifications of unknown compounds in the absence of analytical standards. They also allow for the use of deconvolution software to further separate interfering or overlapping component peaks.

[0025] Samples from the first composition and second composition are evaluated using GCxGC-TOFMS (previously described) to create a datafile for each sample. In this regard, multiple samples from each composition are recommended to provide sufficient intra-source variability for the resulting dataset. Compositions that are organic liquids (e.g., synthetic pesticides) can be analyzed as received or diluted in an appropriate solvent prior to analysis. Compositions in other forms will often need to be processed to allow for introduction to a GCxGC-TOFMS instrument. For example, a tobacco product sample can be extracted in an organic solvent, and the extract is used for analysis.

[0026] The data collected from the GCxGC-TOFMS for the multiple samples is referred to herein as a dataset. Generally speaking, the dataset contains many peaks, and for each peak has the sample from which the peak was measured, the retention time on the first column, the retention time on the second column, and the signal intensity for each of up to 996 ion channels. The dataset may contain several hundred to several thousand peaks.

[0027] The information in the dataset can be used to tentatively identify a chemical compound for each peak, for example by comparing the information to a mass spectral reference library. In addition, the peaks in the dataset can be filtered to remove known artifacts, such as column siloxane bleed and injection solvent. This information can then be arranged in different ways. For example, one way is to create a list of all compounds identified across all samples and then, for each sample, tabulate whether a given compound is present or absent. These variables are referred to as "In/Out" variables.

[0028] Another approach can be used to account for the fact that a single chemical compound may sometimes exhibit multiple peaks, especially if present at a high concentration. In this regard, the first-dimension retention time (i.e. the retention time of the first column) is typically very long. The second-dimension retention time (i.e. the retention time of the second column) is typically very short, for example around three seconds. The first-dimension retention time is generally accurate to within six seconds. Strong peaks are typically represented across much of the second-dimension retention time. To accommodate this expected analytical variability, for a particular compound, the retention time pair corresponding to the largest peak can be located. A rectangle can then be drawn around this peak, and the sum of all peaks for the same compound found within six seconds of the base first-dimension retention time and within the second-dimension retention time are added together. In other words, all peaks within a rectangle 12 seconds wide by 3 seconds tall are summed together. In practice, the distribution of peaks within this rectangle often has a roughly oval shape, and the variables created using this summing approach can be referred to as "Oval Area" variables. This analysis also allows for a compound that may be present from multiple sources but at different levels. This also filters extra peaks due to peak tailing or column overload. Evaluation can be done by the difference in mean oval area for two groups divided by the pooled variance.

[0029] As a result, a dataset can be created that contains entries corresponding to the presence of chemical compounds in each sample (when e.g. In/Out variables are calculated) or that contains entries corresponding to the relative concentration of chemical compounds in each sample. The various steps that are taken to convert the GCxGC-TOFMS datafiles into this dataset are referred to herein as "processing".

[0030] Next, the dataset is classified using the random forest algorithm to create a classifier that distinguishes between the first composition and the second composition. The random forest algorithm, particularly the Balanced Random Forest algorithm, when applied to GCxGC-TOFMS, provides unique advantages in the ability to attribute a given sample of a known material to a specific source. Random Forest classification techniques are especially well suited for data sets with many variables and few observations because they do not require initial variable reduction and do not over-fit the data.

[0031] The random forest algorithm is described in Breiman, L, "Random Forests", Machine Learning, Vol. 45, No. 1 , pp. 5-32 (2001 ). Generally, many classification trees are used to classify observations into groups using a set of predictor variables. Each tree is created using a randomly selected subset of the data with the added restriction that only a subset of possible predictor variables can be used at each split in the tree. By using only some of the data and some of the predictor variables in each tree, the forest will consist of a large number of different trees. FIG. 2 illustrates an example of a classification tree. Here, data has been collected for samples from seven different sources which are labeled S1 through S7. For each source, a dataset has been created that indicates the presence or absence of six different compounds which are labeled C1 through C6. At each node, one of the compounds is used to split up the sources based on the presence/absence of the compound. The splits continue until all samples are classified. Here, in FIG. 2 for example, starting at the top, if compound C1 is present in the sample, then the sample came from source S1. If C1 and C2 are absent, then the sample came from source S2. This example of a classification tree shows one way to perfectly separate the data, though there may be others.

[0032] In general, a single classification tree will often fail to completely capture all of the available information concerning which compounds can distinguish between different sources. The random forest algorithm is an ensemble approach that uses multiple classification trees, with the ensemble "voting" for the final classification of a given sample, as well as indicating the relative importance of each compound to the overall algorithm. Each tree is built from a random sample of the data in the dataset. Generally, the random forest algorithm can be described as follows.

[0033] The total number of entries in the dataset is N. Each tree receives n entries randomly selected with replacement from the dataset. The number of variables in the dataset is M. A number m of input variables are used to determine the decision at a node. The number m should usually be much lower than M. At each node, randomly select the variables on which to base the decision at that node, and calculate the best split based on those variables. The tree is fully grown until the entries are fully separated. The quality of prediction of this tree can then be estimated by using the tree to predict the classification of the remaining entries in the dataset.

[0034] To classify a sample using the Random Forest, each tree in the forest classifies the sample independently and votes for the predicted classification. The Random Forest classification is the classification for which the most trees voted. If the sample being classified was in the data set used to create the tree, only trees that did not use that sample get to vote. This ensures a degree of cross-validation.

[0035] In particular embodiments, a balanced random forest algorithm is used. This is a variation on the random forest algorithm, where a stratified random sample is used for each tree instead of a simple random sample. In a stratified random sample, the entries in the dataset are divided into smaller groups known as strata based on shared attributes or characteristics. A random sample from each stratum is taken. In a balanced random forest (BRF), each source has its own stratum, and each tree sees a random sample of the same size from each stratum regardless of the relative sizes of the strata in the overall dataset. This can be beneficial in cases where one stratum may be more prevalent in the dataset than another, a situation often referred to as unbalanced classes. In some cases, especially with small sample sizes, unbalanced datasets can lead to classifiers that are biased towards the largest class. The balanced random forest algorithm can be employed to mitigate this effect. The balanced random forest ensures, in other words, that all of the possible different sources are equally represented in every tree of the forest. [0036] The results obtained from classifying the dataset using the random forest algorithm is referred to herein as a classifier. The classifier contains information that permits one to decide whether an unknown sample is closer to the first composition or the second composition. The classifier can also be described as providing rules that can be used to make such a decision. Such rules may be simple or complicated. For example, again referring to FIG. 2, the classifier may identify whether a given compound is present or absent for a possible source. Put another way, the methods can be used to create a classifier that distinguishes between the two compositions.

[0037] Following the creation of the classifier, the classifier is used to classify each sample as originating from the first composition or the second composition. The results can then be used to create a confusion matrix. A confusion matrix is an n*-n table (here, a 2x2 table) in which the row labels indicate the true source of the sample and the column labels indicate the predicted source of the sample based on the classifier. An example of a confusion matrix is illustrated below as Table 1 :

Table 1. First Example of Confusion Matrix

[0038] In Table 1 , 12 samples from each of Composition A and Composition B were analyzed, for a total of 24 samples. Of the 12 samples known to come from Composition A, the classifier correctly attributed five of those samples as coming from Composition A and incorrectly attributed seven of the samples as coming from Composition B. Similarly, of the 12 samples known to come from Composition B, the classifier incorrectly attributed eight of those samples as coming from Composition A and correctly attributed four of the samples as coming from Composition B.

[0039] Next, a p-value is determined based on a hypothesis test of independence between rows and columns of the confusion matrix. The p-value is then compared to a selected alpha value to determine the degree of difference between the first composition and the second composition. The p-value is the probability of obtaining a confusion matrix that is at least as extreme as the one that was actually observed, assuming that the null hypothesis of independence between the rows and columns is true. One "rejects the null hypothesis" when the p-value is less than the alpha value a, which is associated with the confidence level. The confidence level can be expressed as 1 -a, where a is the percent chance of rejecting the null hypothesis when the null hypothesis is true. The confidence level can be from 0% to 100%. The confidence level is commonly set at 95%.

[0040] In specific embodiments, the null hypothesis used here is that the rows and columns of the table are independent. For a confusion matrix, this means that the predicted compositions generated by the classifier have no relationship to the true composition.

[0041] In particular embodiments, the p-value is obtained using Fisher's exact test (right tail). The "right tail" specification indicates the "greater than" alternative hypothesis is being used, so that the null hypothesis is rejected only when the classifier generated from the Balanced Random Forest algorithm is correctly classifying the origin of the samples. In such embodiments, a high p-value indicates the first composition and the second composition are substantially equivalent. A low p-value indicates the first composition and the second composition are substantially different.

[0042] Methods for calculating the p-value for Fisher's exact test are well-known. For example, the p-value (right tail) for Table 1 is p=0.9502. Because the p-value is greater than a=0.05, this indicates the null hypothesis cannot be rejected because there is not enough evidence of a positive relationship between the rows and columns. In other words, there is not enough evidence to refute that the rows and columns are independent, or the classifier cannot distinguish the two compositions apart. This result would suggest that the organic contents of the two compositions are substantially equivalent.

[0043] Another example of a confusion matrix is illustrated below as Table 2:

Table 2. Second Example of Confusion Matrix

[0044] The p-value (right tail) for Table 2 is 4 x 10^"7. Because this result is less than a=0.05, the null hypothesis should be rejected, which indicates that the rows and columns are dependent. In other words, the classifier can distinguish the two compositions based on the GCxGC-TOFMS results. This suggests that the two compositions are substantially different at a confidence level of 95%.

[0045] As noted above, the p-value is compared to a selected alpha value to determine the degree of difference between the first composition and the second composition. There are essentially only two degrees of difference between the first and second compositions, i.e. substantially equivalent or substantially different. The confidence level reflects the degree of certainty of this conclusion, and is usually a value between 90% and 100%. The determination of the degree of difference is intended to capture the fact that different tests can be used to obtain the p-value, and that the meaning of a given p-value changes depending on the null hypothesis being tested. For example, Barnard's exact test or a Chi-squared test could be used instead of Fisher's exact test to obtain the p-value as well.

[0046] A related method that could be used to calculate the p-value is a Permutation test as outlined in Ojala and Garriga (Permutation tests for studying classifier performance. Journal of Machine Learning Research 2010, vol. 11 (June), pp. 1833- 1863). In such a test, the null hypothesis is that there is no class structure in the data. The null distribution is estimated by repeatedly permuting the class labels and running the classifier. As above, the procedure generates a p-value which can be used in the same manner as the p-value from Fisher's exact test.

[0047] These methods for determining the degree of difference between a first composition and a second composition have several applications. For example, when the compositions are foodstuffs, agricultural products, or chemicals, the methods can be used to distinguish the two compositions, for example by their purity or their source. This could be useful, for example, in distinguishing coffee beans based on where they were grown (i.e. their source) or how they were grown (e.g. with pesticides or organically). For drug formulations, the methods could distinguish between those containing pure drug and those that have been diluted with fillers or contaminants, and could identify potentially toxic hazards.

[0048] FIG. 3 is a flowchart illustrating the methods of the present disclosure. In step 1310, two-dimensional gas chromatography coupled with time-of-flight mass spectrometry is used on multiple samples to create a datafile for each sample. In step 1320, the datafiles are processed to obtain a dataset. The dataset contains entries corresponding to the presence and/or relative concentration of chemical compounds in each of the samples. Next, in step 1330 the dataset is analyzed using a random forest algorithm to create a classifier that distinguishes between the two compositions. In step 1340, each sample is classified using the classifier as originating from the first composition or the second composition, and a confusion matrix is created. In step 1350, a p-value is determined from the confusion matrix. In step 1360, the p-value is compared to a selected alpha value to determine the degree of difference between the first composition and the second composition.

[0049] The methods of the present disclosure may be implemented on one or more general purpose computers, special purpose computers), a programmed microprocessor or microcontroller and peripheral integrated circuit elements, an ASIC or other integrated circuit, a digital signal processor, a hardwired electronic or logic circuit such as a discrete element circuit, a programmable logic device such as a PLD, PLA, FPGA, Graphical card CPU (GPU), or PAL, or the like. In general, any device, capable of implementing a finite state machine that is in turn capable of implementing the methods described herein, can be used. The methods of the present disclosure are generally implemented by a computer system having a processor, by execution of software processing instructions which are stored in memory. The computer system may include a computer server, workstation, personal computer, combination thereof, or any other computing device. The computer system may further include hardware, software, and/or any suitable combination thereof, configured to interact with an associated user, a networked device, networked storage, remote devices, or the like. The processor may also control the overall operations of the computer system and other components, such as the GCxGC-TOFMS apparatus of FIG. 1.

[0050] The computer system may also include one or more interface devices for communicating with external devices or to receive external input, such as a computer monitor, a keyboard or touch or writable screen, a mouse, trackball, or the like, for communicating user input information and command selections to the processor. The various components of the computer system may be all connected by a data/control bus.

[0051] The memory used in the computer system may represent any type of non- transitory computer readable medium such as random access memory (RAM), read only memory (ROM), magnetic disk or tape, optical disk, flash memory, or holographic memory. In some embodiments, the memory is a combination of random access memory and read only memory. The processor and memory can be combined in a single chip. Other mass storage device(s), for example, magnetic storage drives, a hard disk drive, optical storage devices, flash memory devices, or a suitable combination thereof, can also be used to provide the memory. The memory is also used to store the data processed in the method as well as the instructions for performing the exemplary method.

[0052] The digital processor can be, for example, a single core processor, a dual core processor (or more generally a multiple core processor), a digital processor and cooperating math coprocessor, a digital controller, or the like. The digital processor executes instructions stored in memory 108 for performing the methods outlined above. [0053] The term "software," as used herein, is intended to encompass any collection or set of instructions executable by a computer or other digital system so as to configure the computer or other digital system to perform the task that is the intent of the software. The term "software" as used herein is intended to encompass such instructions stored in a storage medium such as RAM, a hard disk, optical disk, or so forth, and is also intended to encompass so-called "firmware" that is software stored on a ROM or so forth. Such software may be organized in various ways, and may include software components organized as libraries, Internet-based programs stored on a remote server or so forth, source code, interpretive code, object code, directly executable code, and so forth. It is contemplated that the software may invoke system-level code or calls to other software residing on a server or other location to perform certain functions.

[0054] The methods illustrated in may be implemented in a computer program product that may be executed on a computer. The computer program product may comprise a non-transitory computer-readable recording medium on which a control program is recorded (stored), such as a disk, hard drive, or the like. Common forms of non-transitory computer-readable media include, for example, floppy disks, flexible disks, hard disks, magnetic tape, or any other magnetic storage medium, CD-ROM, DVD, or any other optical medium, a RAM, a PROM, an EPROM, a FLASH-EPROM, or other memory chip or cartridge, or any other tangible medium from which a computer can read and use.

[0055] Alternatively, the methods may be implemented in transitory media, such as a transmittable carrier wave in which the control program is embodied as a data signal using transmission media, such as acoustic or light waves, such as those generated during radio wave and infrared data communications, and the like.

[0056] The following example is for purposes of further illustrating the present disclosure. The example is merely illustrative and is not intended to limit the methods of the present disclosure to the materials, conditions, or process parameters set forth therein. EXAMPLE

[0057] Multiple replicates of cigarettes from four different brands were tested by extracting the tobacco from the cigarettes with methylene chloride, then analyzing the extracts using GCxGC-TOFMS. The data were processed using a Random Forest algorithm with the intent to attribute given samples to specific cigarette brands according to the presence and relative concentrations of certain organic compounds. The potential for sample attribution (also described as "fingerprinting") using data from GCxGC-TOFMS analysis could be an important technique used to assess Substantial Equivalence of tobacco products. In addition, the data were used to provide tentative identities for specific compounds that may be important in distinguishing between two given brands of tobacco products.

[0058] Cigarette Selection:

[0059] The study was designed to test four cigarette brands, representing one pair of brands expected to be similar to one another and a second pair of brands expected to be similar to one another but different from either brand in the first pair. The four brands, listed in Table 3, include Marlboro and Newport, two of the most popular brands in the world. The source numbers were arbitrary and provided for data recording in the present study.

Table 3. Cigarette brands selected for investigation.

[0060] It was expected that the Marlboro and Marlboro Gold Pack brands would be similar to one another. The manufacturer, Philip Morris USA, lists on its website ingredients present in specific cigarette brands at levels of at least 0.1 % by weight, and the lists for Marlboro and Marlboro Gold Pack are identical. Philip Morris USA also lists on its website more than 100 ingredients that may be found in any of its cigarette brands at levels no more than 0.1 % by weight, but these ingredients are not correlated with specific brands. Thus, it was hypothetically possible that Marlboro and Marlboro Gold Pack might differ in the presence or relative concentration of any of these ingredients, but the major components are reported to be the same by the manufacturer.

[0061] It was expected that the Newport and Newport Non-Menthol brands would differ in the presence and absence, respectively, of menthol, but that they might be similar to one another in many other ways. No composition information for these cigarettes was found on the website of the manufacturer, Lorillard, Inc. However, given that Newport and Newport Non-Menthol are the only two brands using the Newport name, it was a reasonable inference that the most significant differences between these brands might be in the flavorings rather than in the tobacco, which include many natural components that can be detected by GC*GC-TOFMS.

[0062] It was expected that the Marlboro and Marlboro Gold Pack brands would each differ substantially from the Newport and Newport Non-Menthol brands because they are from different manufacturers. Possible reasons for such company-specific differences include the selection of types and geographic origins of tobacco, the manufacturing process, and the use of proprietary flavorings and additives.

[0063] In a study intended to assess differences among test samples from different groups, it is important to take measures to ensure that differences within samples from a single group are not artificially minimized due to the experimental design. Thus, for each cigarette brand, multiple packs (boxes) of cigarettes were purchased from three different geographic locations. The lot-specific Batch designations are shown in Table 4.

Table 4. Sources of cigarettes tested.

[0064] Sample Preparation:

[0065] Within 5 days after being purchased, all cigarette packs (still in their original plastic seals) were placed at ambient laboratory temperature (21 ± 3 °C) prior to sample extraction. The moisture content of the cigarettes was not necessarily identical across all packs, but this was not relevant for the present study in which non-combusted tobacco is extracted using an organic solvent, the moisture content of the cigarettes prior to extraction is irrelevant. All batches were stored at ambient laboratory temperature for 2-6 days before sample extraction.

[0066] To extract tobacco from the cigarettes in a given Batch, two packs from the Batch were opened, and 12 to 13 cigarettes were taken from each pack. A disposable scalpel was used to slice the side of each cigarette, and the tobacco plug was removed, leaving behind the paper and filter. The tobacco plugs - numbering 24 or 26 for the Batch - were combined, and the tobacco was homogenized while chopping or grinding using one of the three methods listed in Table 5. To account for variability associated with sample preparation, one Batch from each of the four sources was processed using each method. The use of three different methods was a deliberate precaution against the possibility of method-related cross-contamination occurring among different Batches analyzed on the same day. In practice, there was no evidence of significant cross- contamination, as described in the discussion of the results.

Table 5. Methods for homogenizing tobacco

Batches Method

[0067] The homogenized tobacco from a single Batch of cigarettes was divided into four aliquots, each with a mass of 2.3 grams to 2.6 grams, large enough to fill an 11 mL stainless steel cell used for Accelerated Solvent Extraction (ASE). The ASE cells were then loaded onto an ASE instrument (ASE 200 Accelerated Solvent Extractor, Dionex Corporation, Sunnyvale, CA) for extraction of the tobacco samples, using the conditions listed in Table 6. The first day of extractions included all samples from Batches A, J, R, and X as well as 4 solvent blanks (empty 1 1 mL cells that were processed identically to the tobacco-containing cells). The second day of extractions included all samples from Batches B, K, S, and Y and 4 solvent blanks. The third day of extractions included all samples from Batches C, L, T, and Z and 4 solvent blanks. The sample preparation process thus yielded a total of 60 samples. The use of 12 samples per Source was important to provide a data subset that can represent the variability associated with repeatable processes such as sample extraction and instrument analysis. Each extract was spiked with a concentrated solution of six isotopically-labeled internal standard (I.S.) compounds (acenaphthene-d10, chrysene-d12, 1 ,4-dichlorobenzene-d4, naphthalene-d8, perylenes-d12, and phenanthrene-d10), yielding a concentration of 200 ng/mL for each I.S. The 60 samples were then analyzed by GCxGC-TOFMS.

Table 6. Conditions used for Accelerated Solvent Extraction (ASE).

[0068] Methylene chloride was selected as the solvent after testing of methylene chloride, ethyl acetate, acetone, tetrahydrofuran, and various combinations thereof to determine their effect on the peaks of the resulting extracts. The differences among the solvents in numbers of peaks were judged to be relatively minor in that each gave a data set large enough to be useful for the purposes of sample attribution. Therefore, methlyene chloride was selected as the solvent to use with test samples for two reasons. First, either acetone or ethyl acetate could potentially react with components of tobacco products through aldol-type reactions, which could be especially likely under the alkaline conditions expected from the extraction of tobacco. Although such reactions could hypothetical^ yield a larger number of compounds that could be useful in attributing a given sample to a specific Source, it was expected that reaction kinetics would be poorly reproducible during the ASE extraction process, in which a sample is treated with a high-temperature solvent for only 5-10 minutes at a time. Thus, reactive processes during sample preparation present a technical risk of generating spurious results and should be avoided to the extent possible. Second, methylene chloride was used as an extraction solvent in previous studies on the organic contents of tobacco products.

[0069] GCxGC-TOFMS Analysis:

[0070] Sample analyses were performed using a Leco Pegasus 4D GCxGC-TOFMS (Leco, St Joseph, Ml). Instrument specifications and method conditions are provided in Table 7. The 4 extracts from a given Batch were separated into different groups that were analyzed on different days, and randomization was applied to determine the sample injection sequence on each day.

Table 7.

[0071] Data acquisition, chromatographic peak-finding, and mass spectral deconvolution were performed using Leco ChromaTOF v3.21 software. Tentative component identifications were performed by automated matching of deconvoluted component mass spectra with the National Institute of Standards and Technology (NIST) 05 Mass Spectral Library. Peak tables (i.e. datafiles) were generated showing the retention time, peak height, name of the compound providing the best match in the mass spectral library, molecular formula and CAS# (Chemical Abstract Services registry number) of the best-match compound, and match quality. For a tobacco extract, a peak table typically included several thousand peaks.

[0072] Data Preprocessing for Statistical Analysis:

[0073] Prior to application of the statistical pattern recognition techniques, GC*GC- TOFMS sample datafiles were filtered for known analysis system artifacts such as column siloxane bleed and injection solvent. After filtering, sample results included several hundred to several thousand component peaks. Due to the large amount of data acquired, manual inspection of all spectra and identity verification of all retained components was not feasible; however, the compound names and CAS# data were useful tags applied by the software to components, regardless of whether the components were the indicated compounds. Thus, the term "compound," when used throughout the discussion of the GCxGC-TOFMS results, refers to the identity assigned by the software, rather than the actual identity of the given peak.

[0074] The Oval Area method proved to be useful in controlling data artifacts and gave stronger results than using the maximum peak value or other sums with various thresholds. The Oval Area variables, defined for each specific CAS# as the ratio of the Oval Area divided by the sum of all peak areas assigned to the I.S. acenaphthene-d10 from the same injection, were used as input to the random forest algorithm.

[0075] Statistical Analysis:

[0076] A random forest algorithm was used to create a classifier, and the 60 samples were then classified as coming from Source 1 , Source 2, Source 3, Source 4, or as being a solvent blank.

[0077] The occurrences of correct and incorrect classifications for the 60 samples, representing 48 samples from tobacco extracts and 12 solvent blanks, are shown in a confusion matrix (Table 8). The confusion matrix is used to illustrate the results of sample classification, with row labels used to indicate the true sample identity and column labels used to indicate the predicted sample classification. If a sample is classified correctly, the result for that sample will be tallied in the cell for row and column x (i.e., on the matrix diagonal). If a sample is classified incorrectly, the result for that sample will be tallied in another cell (i.e., off the matrix diagonal). The classification rate is calculated simply as the sum of all tallies on the matrix diagonal divided by the sum of all tallies in the matrix; in this case, the classification rate is 75% (45/60).

Table 8. Confusion matrix

Predicted Sample Classification

[0078] The results in Table 8 indicate that all samples of Sources 3 and 4 - Newport and Newport Non-Menthol, respectively - were classified correctly. Thus, the organic contents of each of these brands can be used to distinguish that brand from the other brand from the same manufacturer, as well as either brand from the other manufacturer. All misclassifications occurred between Sources 1 and 2 - Marlboro and Marlboro Gold Pack, respectively. For these two Sources, the misclassifications actually outnumbered the correct classifications. Thus, the GCxGC-TOFMS data from Marlboro and Marlboro Gold Pack appeared to be equivalent to one another, suggesting that these two brands may be equivalent in their organic contents. [0079] For a Substantial Equivalence report, the manufacturer must compare the new product to a predicate product. That is, a 2x2 table is required, as opposed to the 5x5 table illustrated in Table 8. In light of this fact, the balanced random forest algorithm was applied to the data from Sources 1 and 2 only (see Table 9), and the observed classification rate was only 38% (9/24). These results suggested that the frequent misclassifications between Sources 1 and 2 reported in Table 9 were not artifacts of including the data from these two Sources as part of a larger data set.

Table 9.

[0080] The results from Table 9 may indicate that the organic contents of Sources 1 and 2 are equivalent to one another. The extent of similarity between the two Sources was quantified using Fisher's exact test on the confusion matrix. Fisher's exact test is used to test the null hypothesis that the rows and columns of the table are independent. For a confusion matrix, that would mean that the predictions have no relationship to the truth. In this application, the "greater than" alternative hypothesis will be used so that the null hypothesis is rejected only when the BRF method is classifying the samples correctly (i.e. right tail). A large p-value for Fisher's exact test would indicate that the null hypothesis cannot be rejected with great confidence because there is no strong positive relationship between the predictions and the truth. This result would suggest that the organic contents of the two Sources are Substantially Equivalent. A small p-value for Fisher's exact test would indicate that the null hypothesis can be rejected with a high degree of confidence because there is a strong positive relationship between the predictions and the truth. This latter result would suggest that the organic contents of the two Sources are "Substantially Different."

[0081] The evaluation of Fisher's exact test for the results in Table 9 yielded a p-value of 0.9502, suggesting that the organic contents of Sources 1 and 2 are Substantially Equivalent. By contrast, for a confusion matrix including data from Sources 3 and 4 only (Table 10), evaluating Fisher's exact test yielded a p-value of 4x10^-7, suggesting that the organic contents of these two Sources were Substantially Different. When applying the Permutation test instead, the p-values were calculated as 0.625 and less than 0.01 for Tables 9 and 10, respectively. Thus, the conclusions are identical for these datasets whether using Fisher's exact test or the Permutation test.

Table 10.

[0082] The confidence level that would be used to distinguish Substantially Different results from Substantially Equivalent results was assigned separately from the statistical evaluations. Such an approach can give useful results on the equivalence between tobacco products within the limits of those compounds that can be detected by GC*GC- TOFMS analysis and the power of sample attribution using balanced random forest (BRF). [0083] The sample attribution results using samples from all four tobacco Sources were used to address the topic of potential cross-contamination during homogenization of the tobacco. When all data from the first day of sample preparation (Batches A, J, R, and X) were excluded, the classification rate was 78% (31/40), and misclassifications were again seen only between Sources 1 and 2 (see Table 11 ). The similarity with the results shown in Table 8 suggested that there was no significant cross-contamination from the homogenization method used on the first day of sample preparation.

Table 11.

[0084] The Simple Importance Measure:

[0085] An approach designated as "Simple Importance" was employed to determine the variables that could potentially be important in the separation of two given Sources. The Simple Importance measure, which is independent of the BRF method, was not used for sample attribution. Instead, Simple Importance was employed to provide supplemental information about the tobacco products that may be interesting.

[0086] The first step to calculate Simple Importance was to determine the mean of all reported Oval Area variables for a particular CAS# in a given Source. In general, the further apart the means were from two different Sources, the more influential that variable was. The extent of the difference must be tempered, however, with the noise or variability present in the data. Thus, Simple Importance is the ratio of the squared difference between the means for the two Sources to the noise present, as expressed in Equation (1 ). The means are squared because the denominator is the variance rather than the standard deviation.

where, for the yth of m CAS#'s, is Simple Importance; μ_;- and μγ' are the means of the Oval Area variables for Sources a and b, respectively; and of is the pooled variance of the Oval Area variable.

[0087] Compounds Possibly Associated with Source Distinctions:

[0088] It was of interest— but not essential to the study design— to investigate some compounds having high values for Simple Importance, which quantifies the extent of difference in observed responses for a single compound between two given Sources. This pursuit is independent of sample attribution, which was achieved through separate means, as described in the previous section. For each Source pair not including a Solvent Blank, the ten CAS#'s with the greatest Simple Importance values were reported below in Tables D1-D6. The GCxGC-TOFMS data for selected results from these lists were then investigated.

Table D1. Ten CAS#'s with highest Simple Importance measures for

Source 1 vs. Source 3.

^Λ no CAS# assigned in NIST 05 Mass Spectral Library

Table D2. Ten CAS#'s with highest Simple Importance measures for

Source 1 vs. Source 4.

Table D3. Ten CAS#'s with highest Simple Importance measures for

Source 2 vs. Source 3.

Table D4. Ten CAS#'s with highest Simple Importance measures for

Source 2 vs. Source 4.

Table D5. Ten CAS#'s with highest Simple Importance measures for

Source 3 vs. Source 4.

no CAS# assigned in NIST 05 Mass Spectral Library

Table D6. Ten CAS#'s with highest Simple Importance measures for

Source 1 vs. Source 2.

[0089] Sources 1 and 2 vs. Sources 3 and 4

[0090] From the 40 collective entries in Tables D1-D4, there were three CAS#'s that appear on each table. These three CAS#'s may correspond to compounds that distinguish the Marlboro and Marlboro Gold Pack brands from the Newport and Newport Non-Menthol brands. Assessments of the GCxGC-TOFMS data for these CAS#'s are provided in Table 12. These results suggest that caffeine may be a distinctive component of the Marlboro and Marlboro Gold Pack brands, while ethyl vanillin may be a distinctive component of the Newport and Newport Non-Menthol brands. Neither compound is likely to be found in tobacco plants. Thus, they may be present as or from tobacco product additives used by the respective manufacturers. For example, the caffeine tentatively identified in the Marlboro and Marlboro Gold Pack brands could result from the addition of coffee extract to these tobacco products.

Table 12. Sources 1 and 2 vs. Sources 3 and 4.

[0091] Source 3 vs. Source 4

[0092] The three CAS#'s having the highest Simple Importance values in Table D5 may correspond to compounds that distinguish the Newport brand from the Newport Non-Menthol brand. Assessments of the GCxGC-TOFMS data for these CAS#'s are provided in Table 13. These results suggest that the three indicated alkyl esters may be additives found in the Newport brand but not in the Newport Non-Menthol brand. It is surprising that menthol (CAS# 89-78-1 ) is not among the compounds listed in Table D5, but the related ketone 5-methyl-2-(1-methylethyl)-cyclohexanone (CAS# 10458-14-7) is included. Table 13. Source 3 vs. Source 4.

[0093] Source 1 vs. Source 2:

[0094] The organic contents of the Marlboro brand and the Marlboro Gold Pack brand were assigned as Substantially Equivalent according to the sample attribution results. Table D6 was necessarily populated with CAS#'s and Simple Importance values for comparing these two brands, but they should not provide meaningful information about differences in compositions if the Sources are truly not different. It is noted also that the Simple Importance values in Table D6 are lower than those reported in Tables D1-D5. The first three CAS#'s listed in Table D6 are not believed to represent actual compounds present in the samples according to evaluations of the GCxGC-TOFMS data, as summarized in Table 14. The assignment of CAS# 1235-74-1 to multiple peaks with different retention times suggests that the relatively high value of Simple Importance for this CAS# reported in Table D6 may be a spurious result. In summary, there was no indication that any compound-specific results contradicted the assessment already made about the Substantial Equivalence of the organic contents of Sources 1 and 2. Table 14. Source 1 vs. Source 2.

[0095] To summarize, the results from applying the balanced random forest (BRF) algorithm to the GCxGC-TOFMS data for the purpose of sample attribution showed a classification rate of 75% across all 60 samples, representing 5 Sources and 12 replicates per Source. The misclassifications, however, were seen only between the Marlboro and Marlboro Gold Pack brands. No samples from these two brands were predicted to belong to any other Source, and all predictions for samples from other Sources were made with 100% accuracy. These results were consistent with the a priori expectations that the Newport and Newport Non-Menthol brands are not chemically equivalent to one another, nor to either the Marlboro brand or the Marlboro Gold Pack brand.

[0096] The present disclosure has been described with reference to exemplary embodiments. Obviously, modifications and alterations will occur to others upon reading and understanding the preceding detailed description. It is intended that the present disclosure be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

CLAIMS:

1. A method for determining the degree of difference between a first composition and a second composition, comprising:

evaluating a plurality of samples from the first composition and the second composition using two-dimensional gas chromatography coupled with time-of-flight mass spectrometry to create a datafile for each sample;

processing each datafile to obtain a dataset, the dataset containing entries corresponding to the presence or relative concentration of chemical compounds in each sample;

analyzing the dataset using a random forest algorithm to create a classifier that distinguishes between the first composition and the second composition;

classifying each sample as originating from the first composition or the second composition using the classifier to create a confusion matrix;

determining a p-value based on the confusion matrix; and

comparing the p-value to a selected alpha value to determine the degree of difference between the first composition and the second composition.

2. The method of claim 1 , wherein the p-value is obtained by Fisher's exact test (right tail) and the null hypothesis is that the predicted composition made by the classifier has no relationship to the true composition.

3. The method of claim 2, wherein the confusion matrix is a 2x2 table with known composition identity defining each row and predicted composition identity defining each column.

4. The method of claim 2, wherein a high p-value relative to the selected alpha value indicates the first composition and the second composition are substantially equivalent.

5. The method of claim 2, wherein a low p-value relative to the selected alpha value indicates the first composition and the second composition are substantially different.

6. The method of claim 2, wherein a confidence level calculated as 1 minus the selected alpha value is a value between 90% and 100%.

7. The method of claim 1 , wherein the processing occurs by summing the response of all peaks for a given chemical compound within an oval area defined by a first-dimension retention time and a second-dimension retention time.

8. The method of claim 1 , wherein the processing occurs by identifying whether a peak for a given chemical compound is present or absent in the dataset.

9. The method of claim 1 , wherein the first composition and the second composition are foodstuffs, agricultural products, or chemicals.

10. The method of claim 1 , wherein the first composition and the second composition differ in their purity or source.

11. The method of claim 1 , wherein the first composition and the second composition are coffee beans, olive oil, or drug formulations.

12. The method of claim 1 , wherein each sample contains an organic solvent.

13. The method of claim 1 , wherein the two-dimensional gas chromatography is performed using a first non-polar column and a second polar column.

14. The method of claim 13, wherein a diameter of the first column is greater than a diameter of the second column.

15. The method of claim 13, wherein a length of the first column is greater than a length of the second column.

16. The method of claim 13, wherein one or more modulators is present between the first column and the second column.

17. The method of claim 13, wherein a retention time of the first column is accurate to within 6 seconds.

18. The method of claim 13, wherein a retention time range of the second column is about 3 seconds.

19. The method of claim 1 , wherein the p-value is obtained by Barnard's exact test or a Chi-squared test or a Permutation test, and the null hypothesis is that the predicted composition made by the classifier has no relationship to the true composition.