WO2004042635A1

WO2004042635A1 - Method and system for predicting constituent yields in tobacco smoke using a multivariate regression model

Info

Publication number: WO2004042635A1
Application number: PCT/GB2003/004776
Authority: WO
Inventors: Paul David Case; Nigel David Warren
Original assignee: British American Tobacco (Investments) Limited
Priority date: 2002-11-08
Filing date: 2003-11-04
Publication date: 2004-05-21
Also published as: GB0226062D0; AU2003282218A1; EP1559054A1

Abstract

The concentrations or yields of a first set of components in a particular tobacco smoke, such as the Hoffmann analytes, are predicted on the basis of a statistical model. This model is derived from a multivariate regression analysis that relates the concentrations of the first set of components across a range of tobacco smokes to the yields of a second set of components. Typically the second set of components includes gases and other substances such as carbon monoxide, whose concentration can be determined relatively easily. Thus the (unknown) concentrations of Hoffmann analytes in a particular tobacco smoke can be predicted by first measuring the yields of the second set of components in the particular tobacco smoke to be investigated, and then using the multivariate regression model to predict the concentrations of the first set of components from the measured concentrations of the second set of components.

Description

METHOD AND SYSTEM FOR PREDICTING CONSTITUENT YIELDS IN TOBACCO SMOKE USING A

MULTIVARIATE REGRESSION MODEL

5 Field of the Invention

The present invention relates to a method and system for ascertaining the concentrations of compounds in tobacco smoke.

10 Background of the Invention

Approximately 4800 individual constituents have been reported in mainstream cigarette smoke (Ref: 1). A small percentage of these have been categorised as being of interest because they have been reported as having some biological activity (Ref: 2). These 15 constituents are often referred to as the Hoffmann analytes (Ref: 3), and are typically present in both mainstream and sidestream cigarette smoke (MS and SS).

A knowledge of the concentrations or yields of the various components in cigarette smoke, especially the Hoffmann analytes, is desirable both for the purposes of product 20 development and enhancement, and also to help address the regulatory requirements of different countries. These yields are dependent not only upon the particular cigarette in question, but also upon how it is smoked.

This variability is in part a reflection of the fact that there is a range of production 25 mechanisms for the various smoke components. For example, components such as cadmium may be carried into tobacco directly from the soil, and will therefore be affected by the plant environment and other growing conditions. Certain compounds, such as nicotine, occur naturally in tobacco and may be distilled into the cigarette smoke. any other compounds are created by combustion and pyrolysis processes in the burning zone of a lit 30 cigarette, and will be drawn into the mainstream smoke during puffing, or carried into the air around the cigarette through convection to appear as sidestream smoke. MS particulate matter is collected during smoking onto a filter pad. When dried and with nicotine removed, this is known as NFDPM - nicotine free dry particulate matter, or more generally 'tar'.

35 It is known that the overall yield of smoke components is affected by the puff volume, the puff duration, and puffing interval. A more intense smoking regime tends to increase the yield of all smoke components (Ref: 4). The ISO standard machine smoking regime (Ref: 5) used in smoke studies specifies a volume and duration of 35 millilitres and 2 seconds respectively, taken once every 60 seconds (expressed as a 35/2/60 regime). Certain

40 Government Regulators however believe that it is more realistic to smoke cigarettes according to a more intense regime, such as a volume of 55 millilitres, a duration of/2 seconds and an interval of 30 seconds (i.e. a 55/2/30 regime). This in turn impacts tne resulting smoke components. Accordingly, a wide range of yields is possible, dependent upon the actual smoking regime used. Moreover, some regimes demand that the ventilation holes in the cigarette, used to dilute the smoke available to the smoker, have to be occluded.

This inactivates one of the major design features used to produce low yield cigarettes, and can therefore substantially increase smoke component yields.

The laboratory measurement of smoke constituent yields is an expensive and time- consuming task. This is particularly so given that some components are present only in low concentration, and/or require sophisticated analysis equipment to be used in order to obtain reliable yield figures. This problem is significantly exacerbated by having to make measurements for a number of smoking regimes, and by natural variability across the range of available cigarettes.

One known technique to address this situation is through the use of statistical models to predict the effects of changing the smoking regime. In this approach, the yield of a given smoke constituent is modelled using a multivariate regression analysis as being dependent on the puff interval, volume, and duration. For example, in one particular investigation (Ref: 6), it was found that the % change in carbon monoxide yield, relative to the ISO standard 35/2/60 regime, can be represented as:

%change= 91.72442 + (6.137745V) - (72.14508D) - (4.5868351) - (0.029889V²) + (0.014039?) - (0.02054I.V) + (1.19404981.D) where Vis puff volume, D is duration, and / is interval. According to this model, a move to the 55/2/30 smoking regime (for example) will lead to a 107% increase in carbon monoxide yield.

Statistical models have also been developed to allow for additional factors concerning the manner in which a cigarette is smoked. Thus the yield of a particular smoke constituent during static burn (i.e. when the cigarette is not actively being smoked) has been related to the yields at various other smoking regimes, such as the 35/2/60 regime mentioned above (Ref: 7). In a similar manner, the impact on smoke yields of blocking cigarette filter vents has also been statistically modelled (Ref: 8).

Other examples of the use of multivariate regression analysis outside the tobacco industry are known from documents US 5614718, WO 97/14953, and WO 02/10742. In US 5614718, the gas concentrations in the headspace of a bottle containing a carbonated beverage are statistically modelled, based on the infrared absorption spectrum of the headspace. In WO 97/14953, a regression model is used to predict the physical properties, such as density and viscosity, of a hydrocarbon residue (e.g. bitumen), based on its infrared spectrum. In both of these documents, the modelling is in effect being used to relate one physical property of a substance (its concentration, density, etc.) to another physical property of the substance (its infrared absorption/transmission). In contrast, WO 02/10742 tries to predict the pharmacokinetic properties of a substance, such as metabolism and elimination, based on a statistical (rather than chemical) analysis of the properties of a range of substances having a related structure.

However, none of these documents is directly relevant to the problem outlined above, namely the need in the tobacco industry to be able to ascertain reliably and comprehensively the concentrations of the many components in tobacco smoke.

Summary of the Invention

Accordingly, the invention provides a method for determining the yields of constituents in tobacco smoke according to the appended claims. In one embodiment, the method comprises the steps of: providing a multivariate regression model relating the yields of a first set of constituents across a range of tobacco smokes to the yields of a second set of constituents in the range of tobacco smokes; measuring the yields of the second set of constituents in the particular tobacco smoke to be investigated; and using the multivariate regression model to predict the yields of the first set of constituents from the measured yields of the second set of constituents. Note that this approach, based on statistically modelling, obviates or at least reduces the need for actual laboratory measurements, thereby saving time and money.

In a preferred embodiment, the first set of constituents comprises the Hoffmann analytes. In one implementation, the model is specifically developed for a subset of these analytes of particular interest, namely pyridine, acrolein, ammonia, NNK (nornicotine ketone), acetone and cadmium. Nevertheless, it will be appreciated that the model can be readily extended to incorporate any other smoke constituent(s) of interest.

The second set of constituents preferably comprises tar, nicotine, and carbon monoxide (in one particular implementation, these are the only constituents in the second set). These constituents have the advantage that their yields can be accessed relatively easily, in part because they are the most frequently measured, and so access to the relevant equipment and information is most straightforward. However, other embodiments may use a different set of predictor compounds, as appropriate.

In the preferred embodiment, the regression model used involves both interaction terms between the different constituents in the second set of constituents, and also polynomial terms. It is found that such a model is generally able to represent most of the variance in the observations, and accordingly provides good predictive powers. In order to increase further the accuracy and reliability of the predictions, further information can be utilised, beyond the concentrations of the second set of constituents. For example, account can be taken of blending information and/or smoking regime. There are a variety of mechanisms whereby this can be achieved. One possibility with regard to blending information is that a separate regression model could be developed for each blend. Alternatively, if the blend information is suitably parameterised, then this can be incorporated as an additional set of predictor variables into the main regression model.

One way of parameterising blend information is to use the percentage contents of the different blend components. However, in general more accuracy can be obtained by using chemical and/or physical parameters that are sensitive even to variations within a particular blend. Thus potential chemical parameters to be used include total nitrogen content, total reducing sugars content, and chlorogenic acid content, while physical parameters that could be used include percentage of lamina (leafy) content and percentage of expanded tobacco content (reflecting the fill characteristics of the tobacco). It will be appreciated that one or more of these physical and chemical parameters may be used in combination with each other, or with any other suitable parameters, in order to obtain the best fitting model for the particular circumstances.

With regard to the smoking regime, there are a variety of ways in which this can be accommodated. For example, there may be separate regression models for each smoking regime, or a single overall model may be developed that incorporates any smoking regime. A further possibility is that a first model predicts the concentrations of the first set of constituents for some set smoking regime, and these can then be transformed if necessary to a different smoking regime using a second regression model. For example, the second set of constituents (CO/tar/nicotine) may be measured at a first smoking regime, and their values transformed to a second smoking regime using the methodology of Ref. 6 (for example). This will then allow the yields of the second set of consituents to be predicted for the second smoking regime using the regression model described above.

Although the regression model has been developed in the context of tobacco smoke from cigarettes, it can be applied to various other tobacco smokes as appropriate, such as from cigars, pipe tobacco, and so on, as well as from cigarettes and other products based on tobacco substitutes (whether natural or otherwise). Note that in such cases it will typically be most appropriate to derive new regression parameters, based on the overall statistical structure(s) described herein.

One embodiment of the invention further provides a computer program for implementing the methods described above. Such a computer program is typically provided on a medium, whether in the form of a physical storage device (such as a CD-ROM/or DVD), or as a signal transmission, for example over the Internet.

Another embodiment of the invention provides a system for determining the yields of a first set of constituents in a particular tobacco smoke to be investigated. The system includes a multivariate regression model relating the yields of the first set of constituents across a range of tobacco smokes to yields of a second set of constituents in said range of tobacco smokes. An input is provided for receiving from a system user measured yields of the second set of constituents in the particular tobacco smoke to be investigated. The system further includes at least one processor for using the multivariate regression model to predict the yields of at least one of the first set of constituents from the measured yields of the second set of constituents, and these predicted yields can then be provided to a system user via a suitable output facility.

It will be appreciated that the system and computer program embodiments will generally benefit from the same preferred features as described above in relation to the method embodiments.

Brief Description of the Drawings

Various preferred embodiments in accordance with the present invention will now be described in detail by way of example only with reference to the following drawings:

Figure 1 is a graph illustrating the relationship between pyridine yield and carbon monoxide yield; Figures 2-7 are graphs illustrating for various Hoffmann analytes the predicted yield using the model of Table 2 against the actual (measured) yield;

Figure 8 is another graph illustrating the predicted yield against the actual measured yield, this time for the Hoffmann analyte resorcinol;

Figure 9 illustrates the same data as in Figure 2 (for pyridine), only broken down by cigarette source and smoking regime;

Figures 10-15 are further graphs illustrating for various Hoffmann analytes the predicted yield against the actual (measured) yield, using the model of Table 4;

Figure 16 is a flowchart illustrating the development and use of a regression model in accordance with one embodiment of the present invention; and Figure 17 is a schematic diagram illustrating a system that uses a regression model in accordance with one embodiment of the present invention.

In all diagrams the predicted values appear on the independent axis (Y axis) and the measured (actual) values appear on the dependent axis (X axis).

Detailed Description A) OVERVIEW

This section provides a brief overview of the observations and statistical analysis used to derive the model for predicting the yields of smoke constituents. These aspects are then revisited in more detail in the following sections.

The analysis described herein is based on statistical modelling of forty or so tobacco smoke constituents from the Canadian Regulatory List (the 'Hoffmann' list of analytes - see Ref. 2) using multivariate regression. The analysis leads to a model for predicting the yields of these constituents (per cigarette), based on the yields of more regularly measured components such tar, nicotine and carbon monoxide (including interactions between these).

The model was developed using mainstream tobacco smoke constituent yields, based on about 58 different cigarette brands being smoked at ISO and intense smoking regimes. In total, 96 observations were made available, taken across different laboratories and at different points in time.

Statistically significant relationships are derived between various Hoffmann analytes and a combination of tar, carbon monoxide and nicotine, particularly when a model allowing for interactions between these components is adopted. A general attempt was made to minimise the complexity of the models, with the number of terms in the model being reduced where possible, providing that more than 90% of the variance in the data could still be accounted for.

To develop the models further, allowance can be made for differences in Hoffmann analyte yields between cigarette brands manufactured with a range of designs and features (e.g. blend style). This is discussed in more detail below.

B) STATISTICAL MODELLING THEORY

A simple model to explain the variance between two variables is the 'straight line' relationship (a linear regression), expressed as: V = a W + b + e. Here variable V is expressed as a linear combination of terms containing a dependent variable, W, with coefficient, a, and an offset or intercept, b. Unexplained variance in the results is expressed as an error, e.

More complex, multidimensional models can be used to relate a variable to several dependent variables or 'predictors', each with its own coefficient. This can then be expressed as: V = a W+ b X + c Y+ d + e. A further increase of sophistication occurs in models that include terms that reflect interactions between variables, e.g.: V = a W+ b X+ c W\X + d + e. Here the third term (cWX) represents such an interaction. Still more complex behaviour can be modelled by the use of polynomial terms, e.g.: V = a W + b X+ c X W + d W² + fX² + e.

Regression analysis of a given data set normally attempts to first fit a simple model to the data set. Then, if this cannot explain more than, say, 95% of the variance in the data, a more complex model is investigated. This process is normally best achieved using an analysis of the 'residuals' (the variations between the predicted and actual values), generally . by looking at one or more graphical plots of the residuals. If the residuals appear to be non- random, then a judgement can made as to what polynomial and/or interaction term(s) might lead to a better fit.

Note that modelling in this manner seeks to provide a mathematical 'best fit' to the data collected within the domain of observation. It does not generally rely upon or necessarily directly reflect the underlying physical processes involved. Of course, if particular physical information is available, then this can be included in the model as appropriate.

In the present investigation, models of the form:

H, = z+ aT + bC + cN + dTC + πN + gCN + hTCN ÷j + k(f + N² + e (Model A) were used to define the relationships between the more common tobacco smoke constituents and those appearing in the Hoffmann list of analytes. Here H_t represents a particular Hoffman analyte (as determined by index /), T represents the associated tar yield, Λ/the nicotine yield, and C the carbon monoxide yield; z, a, b, c, d, f, g, h, j, k and m are the regression coefficients; and e is the residual error in prediction.

By performing a progressive, stepwise, regression, a minimum number of terms was used for the prediction of any particular analyte, consistent with maximising the variance accountable in the regression equation. More particularly, the analysis proceeded by starting with the full regression model (Model A), and selectively removing terms that did not contribute to the accountable variance. The selection of terms was done by hand, rather than automatically within a program, since this tended to lead to more results that were not only statistically superior, but also more plausible from a physical/chemical perspective (for example, one would not expect coefficients a, b, and c to all be eliminated from Model A above, given the broad tendency of all smoke yields to rise with a more intense smoking regime).

Note that some experimentation was done using a different regression model from that of Model A above. In particular, various powers other than 2 (e.g. 1.5, 2.5, and 3) were tried for the regression polynomial for certain portions of the data. However, since these other powers did not lead to improved regressions, only the results from Model A a|fe presented herein.

C) DATA COLLATION

The data used in the regression analysis came from a variety of sources, primarily representing data that had been prepared for submission to regulatory authorities in Australia, Canada and Brazil (thus the data relate to brands marketed in those three countries). At least some (if not all) of the data is in the public domain, and can be accessed from the web-sites of the relevant regulatory agencies.

In total, 96 observations were used for the analysis. Of these, 16 were for submission in Australia, 60 were for Canada and the remainder were for Brazil. Data for all 3 countries included the Kentucky reference cigarette, 1 R4F. Where a more intense smoking regime was used, this included the occlusion of filter ventilation holes if necessary. Note that the observations were made at various times and at various laboratories across the world (i.e. measurements for use in a particular country were not necessarily made in the country concerned).

D) DATA ANALYSIS

The regression analysis was performed using the statistical software package available from Minitab Inc., of Pennsylvania, USA (see www.minitab.com), running on standard personal computers with a Windows 95 operating system (from Microsoft Corporation). A multilinear regression was employed to generate a 'best fit' formula using all the elements in Model A above, or a subset of those elements, where appropriate (as discussed above). Where a subset was used, this consisted of the minimum subset such that a maximum regression coefficient was obtained, and a maximum "F-ratio" achieved in the accompanying analysis of variance (ANOVA).

Note that no attempt was made to identify any particular brand as an outlier, in the sense of being consistently removed from the main cluster of data points for multiple smoke components. (This might suggest some flaw with the relevant cigarette and/or experimental set-up, and so lead to a potentially tighter regression if this data could be excised). Rather, the data set for each smoke component was analysed independently from the others.

However, initial screening of the various observations could be introduced, if any observations were believed to be particularly problematic.

E) RESULTS Table 1 shows a full correlation matrix for tar, nicotine, carbon monoxide arid a representative sample of six of the Hoffmann analytes. In general there is a high degree of correlation between all of the components in this matrix. However, to illustrate that a high correlation between two variables does not necessarily imply high predictability, the ^• corresponding values of pyridine and carbon monoxide (correlation coefficient 0.96) were plotted for all 96 data points - in effect, a univariate regression - and are shown in Fig 1. In general, predictability falls with increasing distance from the.origin.

Figure 1 also shows the origin by country and smoking regime (ISO or intense) of the various data points, and it is apparent that the data for different countries are not perfectly coincident. There are several possible explanations for this. One likely factor is that different blend styles ranging from Green Virginia to US Blended have been incorporated into the data set. The range of blends available for marketing varies significantly from one country to another, dependent on local growing conditions, consumer preferences, market conditions, and so on. Consequently, blend variations are likely to show up on a country basis. In addition, different countries can impose require different smoking regimes for regulatory submissions. Furthermore, variations between laboratories may be another factor, e.g. using different analytical methodologies, different measurement equipment, and so on. Again, this is likely to show up on a country basis, since different laboratories are typically used to obtain measurements for different countries.

Table 1: Correlation Matrix for nine smoke constituents used in this study

The data relating to the specific Hoffmann analytes and the corresponding tar, nicotine and carbon monoxide yields were entered into the Minitab package and subjected to a stepwise multilinear regression analysis, as described above in the Data Analysis section. Each analyte was treated separately to discover the set of predictive coefficients that provided a maximum accountability of the variance. Table 2 lists the value of each of these coefficients in tabulated form. Parameters which made no useful contribution to the overall predictive equation, i.e. those that did not reduce the regression coefficient upon removal, are marked with an asterisk. The data for tar, nicotine, CO are expressed in milligrams per cigarette, the data for pyridine, acrolein, ammonia and resorcinol are expressed in micrograms per cigarette, and the data for NNK and cadmium are expressed in nanograms per cigarette.

Table 2: Coefficients used in predictive modelling of six smoke constituents

* = Factor did not contribute significantly to the overall variance accounted for by the regression.

Figures 2-7 serve to give a visual impression of how well the regression formulae from Table 2 fit the measured data. (All the Figures are plotted using the same units as given above for Table 2). In each case, the measured values are plotted against the values predicted from the regression equation. In these plots therefore, a perfect prediction equation will result in a line of slope equal to unity, and an intercept equal to zero. However even under these circumstances, there may still be (random) variance about this line.

Note that the remaining variance can be attributed to two main causes. The first is limitations in the underlying regression model. Thus the yields of particular components may be determined by factors not considered in the model, such as blend, smoking regime, and so on. This is discussed further below. A second reason is that the observations used to create the model may themselves be somewhat inaccurate. (Indeed, one of the main motivations for developing the model is precisely because such observations are hard to perform reliably). In this context, better observations should allow a more accurate regression model to be developed.

Table 3 gives the slope and intercept from regressing the predicted values on measured values for each of the analytes in question. In general there is very good agreement, as verified from the associated regression coefficients, slope and intercept from the fitted equations.

Table 3: Results of Regressing Predicted on Actual values for six analytes

it should be emphasised that it.cannot be determined in advance that model A will provide an accurate explanation of the observations. To illustrate this, consider one more Hoffmann analyte, namely resorcinol. The measured values for resorcinol were predicted using similar regression techniques to those applied to the representative six analytes considered above. The analysis revealed a best-fit equation to be: Resorcinol = 0.398 - 0.0124*Tar*Tar - 0.0084*CO*CO + 0.0197*Tar*CO +

0.109*Tar*Nic - 0.00153*Tar*Nic*CO In this case a regression coefficient of 0.908 was the best that could be achieved. When the predicted values were plotted against the actual values a slope of 0.931 was found together with an intercept of 0.185, which are appreciably different from the ideal values of one and zero respectively.

Figure 8 shows the predicted values for resorcinol plotted against the actual values. A comparison of this graph with those in Figures 2-7 clearly confirms that the prediction for this compound is less accurate. This is most apparent from the spread of data around the line of best fit - more than for any of the other constituents previously described. In this case, it is therefore appropriate to look for different chemical, physical¹ or blend components to explain the behaviour of resorcinol over the three countries and two smoking regimes used for the data set. One justification for this is that as discussed with relation to Figure 1 the data sets from different countries are not coincident, i.e. data from different countries and different smoking regimes do not all appear to fall on the same 'straight line' on the graph when a simple regression is used (univariate analysis).

Figure 9 illustrates the same data as Figure 2, this time broken down by country and smoking regime. It demonstrates that using the multiple regression model from Table 2 to predict pyridine gives a tighter general fit to the observations compared to the univariate model of Figure 1). Note that there is more overlap between the different data sets than in Figure 1, although it seems likely that the model could perhaps still be further improved, such as by adding one or more predictors to account for blending factors.

The regression analysis so far described concerns a predictive model for determining the yields of a subset of six of the forty or so Hoffmann analytes, based on the mainstream tobacco smoke constituents, in particular the primary predictors of tar, nicotine and carbon monoxide. A minimum number of coefficients were utilised from a complex model, providing more than 90% accountability of the variance in the data could be maintained. The results from the model are statistically significant.

The analysis has been extended to generally encompass the complete set of

Hoffman analytes. Table 4 presents the regression coefficients for a further six representative analytes:

Table 4: Coefficients used in predictive modelling of a further six smoke constituents

F-Ratio 1084 379 210 149 80 108 '

Regression 98.3 % 96.0 % 93.8 % 91.7 % 84.5 % 91.6 % Coefficient

The above regression formulae are plotted in Figures 10-15, in the same manner that Figures 2-7 depicted the regressions detailed in Table 2. Note that in Figures 10-15, the observations are again broken down by data source, with filled circles from Brazil, crosses from Canada, and open circles from Australia. For both Canada and Australia, the lighter symbols are for the ISO regime, while the darker symbols are for a more intense 55/2/30 regime. (The Brazilian data is for an ISO regime).

In order to verify the accuracy of the regression model, it was used to predict Hoffmann analyte yields, based on tar, nicotine and carbon monoxide measurements for a set of observations that were not used to create the model in the first place. The predicted yields were then compared with actually observed Hoffmann analyte yields for this set of measurements.

The results of this comparison are presented in Table 5 for two cigarettes (identified schematically as Brand A and Brand B). It will be noted that in general the predictions are very reasonable, although with certain exceptions, the most noticeable deficiency being for cadmium. However, it is difficult to measure this metal analytically with any great accuracy, and in addition its concentration is highly dependent on soil origin. With regard to the predictions for NNN (a nitrosamine, nitrosonomicotine), the values given are close to the analytical detection limits, and so great accuracy (in percentage terms) cannot be expected. Overall, the accuracy of the predictions is generally very acceptable.

Table 5: Predicted and measured results from an 'unknown' data set

Figure 16 presents a flowchart of the operations performed to develop and then utilise a regression model such as described herein. Those steps above the dashed line (indicated as phase A) relate to the creation of the model, whereas those steps below theUashed line (phase B) relate to its deployment.

The method commences with the acquisition of the data sets (step 1010). These are complete in the sense that they generally include concentrations for all the constituents of interest. In other words, the observations contain yields for both the predictors (to be), namely tar, carbon monoxide and nicotine, and also the components to be predicted, in this case the Hoffmann analytes.

A regression analysis is now performed for this data set (step 1020), allowing a set of regression coefficients, such as listed in Tables 2 and 4, to be generated. This permits regression model to be created on the basis of these coefficients (step 1030) that can then be used for predictive purposes. Note that a separate regression analysis is actually performed for each constituent of interest (i.e. for each column of Tables 2 and 4). Likewise, a separate model is generated for each constituent of interest (based on the regression coefficients found in step 1030).

In phase B, where the regression models are deployed, concentrations of a subset of components are measured, corresponding to the predictor variables utilised in the regression model (step 1040). In the particular embodiments disclosed herein, these components are carbon monoxide, tar and nicotine, but it will be appreciated that other embodiments may use different regression predictors. The measured concentrations are then input into the model (step 1050), where predicted concentrations for the other components are calculated (step 1060). In the particular embodiment disclosed herein these other components are the Hoffmann analytes, but again it will be recognised that a regression model could be developed for other smoke components. Finally, the predictions from the models are output to the user (step 1070). Note that although there is a separate underlying regression model for eac i component to be predicted, this may be transparent to the user. Thus in the system described in relation to Figure 17 below, the regression models for each component (Hoffman analyte) are run in parallel. This allows the user to enter the measured concentrations just once (for step 1050); these are then passed within the system to each of the regression models for calculation of a predicted concentration for their respective constituent (for step

1060); and then the results are combined into a single output (step 1070) that lists the predicted concentrations for all the relevant components. Consequently, from the perspective of the user, the system functions as a single overall model that handles all the Hoffmann analytes together, although at a more detailed level, there is actually an independent statistical model for each analyte.

Figure 17 is a schematic diagram of a system 2000 used for determining the predicted yields of smoke components generally in accordance with the method of Figure 16). The system receives laboratory measurements (2010) of the yields of the regression variables (carbon monoxide, tar, nicotine). These are measured via analytical chemistry techniques, as per conventional observations. The input of these measurements into the system 2000 (shown schematically by arrow 2015) will typically be performed by a user entering the data by hand, but could also be performed automatically by a data link between the appropriate measurement apparatus and system 2000.

System 2000 represents any computer or similar machine suitable for running a regression model, such as a personal computer, a workstation, a mainframe, and so on. In addition, system 2000 may be implemented as a distributed system, for example with a user client connected over a network such as the Internet or an intranet to a server. The client would then be responsible for interaction with the user (the data input/output steps 1050 arid 1070 in Figure 16), while the model itself would run on the server (for step 1060).

Included on system 2000 is a regression model 2020. In one particular embodiment, the model 2020 is implemented by a program running on a personal computer system 2000 using the Windows platform (from Microsoft Corporation). This program incorporates the various data parameters needed to define the regression model. The program code is generally stored on a hard disk of the computer. In use, the program is loaded into computer random access memory (RAM) for execution by the system processor. Note that rather than being stored on the hard disk or other form of fixed storage, part or all of the program code may also be stored on a removable storage medium, typically an optical (CD ROM, DVD, etc) or magnetic (floppy disk, tape, etc) device. Alternatively, the program may be downloaded as part of a transmission signal over a network, such as a local area network (LAN), the Internet, and so on. The data parameters used to define the regression model will also be loaded in like manner to the program code, in other words from a hard drive, a removable storage medium, or downloaded over a network. Typically these data parameters are stored with or inside the program, but may alternatively be stored independently, for example in a separate file structure.

5 The system 2000, and more particularly the model 2020, accepts the measured concentrations, and then uses these as input to the model to generate the predicted concentrations 2025. These can then be output to the user via the screen, via an output data file, and so on.

10 It will be appreciated that in general a regression model has a restricted range of applicability, corresponding to the span of the observations from which it was originally derived. Thus it is unrealistic to expect the model to give reliable predictions for concentrations or combinations of concentrations that are very different from those of the original data. To this end, model 2020 includes some input filtering of the measured

15. concentrations 2015. If these fall outside the range of realistic prediction, then the model can either refuse to give an output, or may accompany any output by a flag to indicate that the accuracy of the output should not be relied upon.

As previously indicated, the relationships between the Hoffmann analytes and the 0 corresponding primary cigarette smoke components, namely tar, nicotine and carbon monoxide (as defined by model A), can be enhanced to try to accommodate the differences between various cigarette brands. Thus in broad terms, tobacco can be classified into three main types, Virginia, Burley, and Oriental, dependent upon the plant type and growing conditions, and the way in which the tobacco is processed (άpeά, cured, etc.), with the 5 various brands being made up of blends of these different types.

One possibility is therefore to develop the regression model to take into account the blend mix of a given cigarette, in terms of the percentage contribution of each of the above three main types of tobacco. A more accurate approach however is to parameterise the 0 blend information based on certain physical and/or chemical properties, and then to use these properties within the regression model. For example, a regression model can be developed to incorporate one or more properties such as: total nitrogen content, total reducing sugars content, 5 chlorogenic acid content, percentage expanded tobacco value (this indicates fill value for a given amount of tobacco, and reflects processing characteristics), and percentage lamina content (this specifies the proportion of leafy tobacco, as against other plant components such as stalk). 0 The above parameters have been found generally to provide a good characterisation of a given blend, and are also sensitive to variations within a given broad type of tobacco (for example, different varieties and forms of Virginia tobacco). They therefore offer the potential of greater accuracy than a model based simply on blend components.

As an example of this approach, a regression fit was performed using reducing sugars, lamina, and expanded tobacco content for certain Hoffmann analytes (in addition to the earlier predictors of tar, nicotine, and carbon monoxide). This led to some appreciable improvements in the regression coefficients (r²) compared to those previously obtained: For formaldehyde: from 89.1% to 94.1%

For 4-aminobiphenyl: from 84.5% to 93.3%

For hydroquinone: from 98.3% to 99.5%

Note that when incorporating blending information into the model, completely new sets of regression coefficients are obtained (rather than simply adding new coefficients for the blending parameters into the pre-existing sets).

It will be appreciated that the regression model described herein can be extended further in various ways, for example to predict constituents apart from the Hoffmann analytes, or to allow for specific smoking regime parameters, such as described in Refs 6-8. As regards the latter aspect, this may potentially be approached in three different ways. Firstly, a separate regression model could be developed for each smoking regime of interest. Secondly, a regression model could be developed to transform a set of yields from one smoking regime to another (this sort of model is presented in Refs 6-8). This regression model could then be used in conjunction with a model that predicted concentrations for some nominal standard smoking regime. (This would be a model as described above, but produced only from data relevant to the nominal standard smoking regime). Thirdly, one overall model could be developed. In this case, the input parameters would be not only concentrations of carbon monoxide, tar and nicotine, but also the particular smoking regime of interest. The resultant output would then be appropriate for that smoking regime. Each approach has merits and demerits, in terms of complexity, power, as well as being constrained by the available input data, as will be understood by the skilled person

It will also be appreciated that although the regression model described herein has been developed in the context of cigarettes, it can also be applied to various other tobacco smokes as appropriate (such as from cigars, pipe tobacco, and so on). Similarly, it may be applied to tobacco substitutes (whether from a natural or artificial source, or some combination thereof) that are intended for consumption in cigarettes and the like. Note that in such cases it may be appropriate to develop a new model (i.e. with new regression parameters), although nevertheless still based on the statistical structures described herein. In conclusion, it will be appreciated that the different embodiments described above are by way of illustration only, and not by way of limitation. Thus various modifications and enhancements of these embodiments will be apparent to the skilled person, and remain within the scope of the invention as specified by the following claims and their equivalents.

F) REFERENCES

1 Dube, M.F., Green, C.R., Recent Advan. Tob. Sci. 8 (1982) 48 2 Hoffmann, D., Hoffmann, I., El-Bayoumy, K., "The Less Harmful Cigarette: A

Controversial Issue. A Tribute to Ernst L. Wynder", Chemical Research in Toxicology 14 (2001) 7 Hoffmann, D., Hoffmann, I., "Tobacco Smoke Components", Beitrage zϋr

Tabakforschung International 18 (1998) 49 Borgerding, M.F., "The FTC Method in 1997 - What alternative Smoking Condition(s) does the Future Hold?", Recent Advan. Tob. Sci. 23 (1997) 75 International Organisation for Standardisation: Routine analytical cigarette-smoking machine. Definitions and standard conditions; ISO 3308, 1991; replaced by revised edition in 2000 Case, P.D., and Bishop A., "Alternative Smoking Regimes (1): - the Role of

Statistically Based Experiments in Providing Predictive Tools". Paper presented at CORESTA Congress, Lisbon, Portugal, September 2000. Case, P.D., "Alternative Smoking Regimes (2): - the Role of Static Burn Rate Calculations in Predicting the Yields of Various Mainstream Smoke Components". Paper presented at CORESTA Congress, Lisbon, Portugal, September 2000. Case, P.D., "Alternative Smoking Regimes: - Techniques for Predicting Mainstream Yields with Varying Degrees of Cigarette Filter Vent Blocking". Paper presented at CORESTA Smoke and Technology meeting, Xian, China, September 2001.

Claims

1. A method for determining yields of a first set of constituents in a particular tobacco smoke to be investigated, said method comprising the steps of: providing a multivariate regression model relating the yields of said first set of constituents across a range of tobacco smokes to yields of a second set of constituents in said range of tobacco smokes; measuring the yields of said second set of constituents in the particular tobacco smoke to be investigated; and using the multivariate regression model to predict the yield of at least one of said first set of constituents from said measured yields of said second set of constituents.

2. The method of claim 1 , wherein the first set of constituents comprises the Hoffmann analytes.

3. The method of claim 2, wherein the first set of constituents comprises one or more constituents selected from the group of: pyridine, acrolein, ammonia, NNK, acetone and cadmium.

4. The method of any preceding claim, wherein the second set of constituents comprises tar, nicotine, and carbon monoxide.

5. The method of claim 4, wherein the second set of constituents consists of tar, nicotine, and carbon monoxide.

6. The method of any preceding claim, wherein said model involves interaction terms between different constituents in said second set of constituents.

7. The method of claim 6, wherein said model further involves polynomial terms for constituents in said second set of constituents.

8. The method of claim 1 , wherein the model provided has the form:

X = z+aT + bC + cN + dTC + fTN + gCN + hTCN +j + kCf + mN²+ e where X represents the predicted yield of one of the first set of constituents, T represents the measured tar yield, N represents the measured nicotine yield, C represents the measured carbon monoxide yield; z, a, b, c, d, f, g, h,j, / and m are the regression coefficients; and e is the residual error in prediction.

9. The method of claim 8, wherein the regression model is substantially as specified in

Tables 2 and 4.

10. The method of any preceding claim, further comprising the step of allowing for blending in the predicted yields of said first set of constituents.

11. The method of claim 10, wherein said step of allowing for blending comprises incorporating into the multivariate regression model parameters representative of the chemical content of the tobacco.

12. The method of claim 11 , wherein said parameters include at least one of the following: total nitrogen content, total reducing sugars content, and chlorogenic acid content.

13. The method of any of claims 10 to 12, wherein said step of allowing for blending comprises incorporating into the multivariate regression model parameters representative of the physical properties of the tobacco.

14. The method of claim 13, wherein said parameters representative of the physical properties of the tobacco include at least one of the following: percentage of lamina content and percentage of expanded tobacco content.

15. The method of any preceding claim, further comprising the step of allowing for smoking regime in the predicted yields of said first set of constituents.

16. The method of claim 15, wherein the smoking regime is allowed for by incorporating at least one parameter representative of smoking regime into the r ultivariate regression model.

17. The method of any preceding claim, wherein said tobacco smoke arises from combustion of a tobacco substitute.

18. A computer program for implementing the method of any preceding claim.

19. A computer program product comprising instructions encoded in a medium, said instructions when loaded into a computer system causing it to perform a method for determining yields of a first set of constituents in a particular tobacco smoke to be investigated using a multivariate regression model relating the yields of said first set of constituents across a range of tobacco smokes to yields of a second set of constituents in said range of tobacco smokes, said method comprising: receiving as input measured yields of said second set of constituents in the¹ particular tobacco smoke to be investigated; and using the multivariate regression model to predict the yield of at least one of said first set of constituents from said measured yields of said second set of constituents.

20. A system for determining yields of a first set of constituents in a particular tobacco smoke to be investigated, said system comprising: a multivariate regression model relating the yields of said first set of constituents across a range of tobacco smokes to yields of a second set of constituents in said range of tobacco smokes; an input for receiving from a system user measured yields of said-second set of constituents in the particular tobacco smoke to be investigated; at least one processor for using the multivariate regression model to predict the yield of at least one of said first set of constituents from said measured yields of said second set of constituents; and an output facility for providing the at least one predicted yield to a system user.

21. The system of claim 20, wherein the first set of constituents comprises the Hoffmann analytes.

22. The system of claim 21 , wherein the first set of constituents comprises one or more constituents selected from the group of: pyridine, acrolein, ammonia, NNK, acetone and cadmium.

23. The system of any of claims 20 to 22, wherein the second set of constituents comprises tar, nicotine, and carbon monoxide.

24. The system of claim 23, wherein the second set of constituents consists of tar, nicotine, and carbon monoxide.

25. The system of any of claims 20 to 24, wherein said model involves interaction terms between different constituents in said second set of constituents.

26. The system of claim 25, wherein said model further involves polynomial terms for constituents in said second set of constituents.

27. The system of claim 20, wherein the model provided has the form:

X = z+aT + bC + cN + dTC + fTN + gCN + tiTCN +fT* + k( + mN² + e where X represents the predicted yield of one of the first set of constituents, T represents the measured tar yield, N represents the measured nicotine yield, C represents the measured carbon monoxide yield; z, a, b, c, d, f, g, h, j, k and m are the regression coefficient^; and e is the residual error in prediction.

28. The system of claim 27, wherein the regression model is substantially as specified in Tables 2 and 4.

29. The system of any of claims 20 to 28, wherein said system further allows for blending in the predicted yields of said first set of constituents.

30. The system of claim 29, wherein the blending is allowed for by incorporating into the multivariate regression model parameters representative of the chemical content of the tobacco.

31. The system of claim 30, wherein said parameters include at least one of the following: total nitrogen content, total reducing sugars content, and chlorogenic acid content.

32. The system of any of claims 29 to 31 , wherein blending is allowed for by incorporating into the multivariate regression model parameters representative of the physical properties of the tobacco.

33. The system of claim 32, wherein said parameters representative of the physical properties of the tobacco include at least one of the following: percentage of lamina content and percentage of expanded tobacco content.

34. The system of any of claims 20 to 33, wherein smoking regime is allowed for in the predicted yields of said first set of constituents.

35. The system of claim 34, wherein the smoking regime is allowed for by incorporating at least one parameter representative of smoking regime into the multivariate regression model.

36. The system of any of claims 20 to 35, wherein said tobacco smoke arises from combustion of a tobacco substitute.

37. A method for determining the yields of constituents in tobacco smoke substantially as described herein with reference to the accompanying drawings.

38. A system for determining the yields of constituents in tobacco smoke substantially as described herein with reference to the accompanying drawings.