EP1305600A2 - Combinative multivariate calibration that enhances prediction ability through removal of over-modeled regions - Google Patents
Combinative multivariate calibration that enhances prediction ability through removal of over-modeled regionsInfo
- Publication number
- EP1305600A2 EP1305600A2 EP01952581A EP01952581A EP1305600A2 EP 1305600 A2 EP1305600 A2 EP 1305600A2 EP 01952581 A EP01952581 A EP 01952581A EP 01952581 A EP01952581 A EP 01952581A EP 1305600 A2 EP1305600 A2 EP 1305600A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- matrix
- factors
- calibration
- residual
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 64
- 230000003595 spectral effect Effects 0.000 claims abstract description 64
- 239000011159 matrix material Substances 0.000 claims description 114
- 238000012628 principal component regression Methods 0.000 claims description 86
- 238000000034 method Methods 0.000 claims description 59
- 238000001228 spectrum Methods 0.000 claims description 56
- 239000013598 vector Substances 0.000 claims description 35
- 238000002835 absorbance Methods 0.000 claims description 23
- 238000011068 loading method Methods 0.000 claims description 21
- 239000012491 analyte Substances 0.000 claims description 20
- 238000012417 linear regression Methods 0.000 claims description 5
- 238000000513 principal component analysis Methods 0.000 claims 6
- 238000004458 analytical method Methods 0.000 abstract description 11
- 238000000862 absorption spectrum Methods 0.000 abstract description 5
- 238000010348 incorporation Methods 0.000 abstract description 3
- 238000012544 monitoring process Methods 0.000 description 9
- 239000000523 sample Substances 0.000 description 9
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 8
- 238000005457 optimization Methods 0.000 description 7
- WQZGKKKJIJFFOK-GASJEMHNSA-N Glucose Natural products OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA-N 0.000 description 6
- XSQUKJJJFZCRTK-UHFFFAOYSA-N Urea Chemical compound NC(N)=O XSQUKJJJFZCRTK-UHFFFAOYSA-N 0.000 description 6
- 239000008103 glucose Substances 0.000 description 6
- 238000000491 multivariate analysis Methods 0.000 description 6
- 238000007792 addition Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 4
- 239000004202 carbamide Substances 0.000 description 4
- HVYWMOMLDIMFJA-DPAQBDIFSA-N cholesterol Chemical compound C1C=C2C[C@@H](O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2 HVYWMOMLDIMFJA-DPAQBDIFSA-N 0.000 description 4
- 230000007423 decrease Effects 0.000 description 4
- 210000002966 serum Anatomy 0.000 description 4
- 239000000470 constituent Substances 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000012306 spectroscopic technique Methods 0.000 description 3
- 238000010521 absorption reaction Methods 0.000 description 2
- WQZGKKKJIJFFOK-VFUOTHLCSA-N beta-D-glucose Chemical compound OC[C@H]1O[C@@H](O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-VFUOTHLCSA-N 0.000 description 2
- 235000012000 cholesterol Nutrition 0.000 description 2
- 238000004587 chromatography analysis Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000002329 infrared spectrum Methods 0.000 description 2
- 235000013336 milk Nutrition 0.000 description 2
- 239000008267 milk Substances 0.000 description 2
- 210000004080 milk Anatomy 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 102000004169 proteins and genes Human genes 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 102000009027 Albumins Human genes 0.000 description 1
- 108010088751 Albumins Proteins 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- 240000000296 Sabal minor Species 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 239000007864 aqueous solution Substances 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000013060 biological fluid Substances 0.000 description 1
- 239000012472 biological sample Substances 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000000186 gas chromatography-infrared spectroscopy Methods 0.000 description 1
- 238000002290 gas chromatography-mass spectrometry Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010238 partial least squares regression Methods 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000004611 spectroscopical analysis Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 150000003626 triacylglycerols Chemical class 0.000 description 1
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N21/00—Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
- G01N21/17—Systems in which incident light is modified in accordance with the properties of the material investigated
- G01N21/25—Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
- G01N21/31—Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
- G01N21/35—Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light
- G01N21/359—Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light using near infrared light
Definitions
- the invention relates to multivariate analysis of spectral signals. More particularly, the invention relates to a method of multivariate analysis of a spectral signal that allows for a wavelength or spectral region to be modeled with just enough factors to fully model the analytical signal without the incorporation into the model of noise by using excess factors.
- Multivariate analysis is a well-established tool for extracting a spectroscopic signal, usually quite small, of a target analyte from a data matrix in the presence of noise, instrument variations, environmental effects and interfering components.
- Various methods and devices have been described that employ multivariate analysis to determine an analyte signal. For example, R. Barnes, J. Brasch, D. Purdy, W. Loughheed, Non-invasive determination of analyte concentration in body of mammals, U.S. Patent No.
- PCR principal component regression
- PLS partial least squares regression
- One well-documented problem with multivariate analysis is that noise in the data may be incorporated into the model. This is especially true when too many factors are employed in the development of the model. The modeling of this noise results in subsequent prediction matrices with erroneously high error levels. See, for example, H. Martens, T. Naes, Multivariate Calibration John Wiley & Sons, 1989; or K. Beebe, B. Kowalski, An Introduction to Multivariate Calibration and Analysis, Anal. Chem. 59, 1007A-1017A (1987). Complicating this issue is the fact that a few factors may fully model a given spectral region, while additional factors may be required to model another set of wavelengths. For example, a few factors may model a region having:
- instrument drift changes spectral response over time
- An iterative, combinative PCR algorithm allows a different number of factors to be applied to different wavelengths or spectral regions during modeling of a matrix of calibration spectra.
- a three-factor model is applied over a given spectral region, wherein concentrations of a target analyte are known.
- the residual of the three-factor model is calculated and used as the input for an additional five-factor model.
- Prior to the additional five factors being applied some of the wavelengths are removed, with the result that a three-factor model is applied over the first spectral region and an eight factor model over the second region. This analysis of residuals may be repeated such that a one to n factor model may be applied to any given wavelength or rather any number of factors may be employed to model any given frequency.
- the scores matrices of the individual models are concatenated to arrive at a final scores matrix for the entire calibration matrix.
- Principal component regression is employed to regress the calibration matrix against the vector of analyte concentrations to derive a calibration model, the model comprising a vector of calibration coefficients.
- a method of predicting concentration of a target analyte from a prediction data set comprising a matrix of sample spectra applies the above calibration to a final scores matrix for the sample matrix to generate a vector of predicted values for target analyte concentration.
- the sample matrix is iteratively modeled in the same fashion as the calibration matrix, with the final scores matrix being a concatenation of the individual scores matrices for the various spectral regions or wavelengths.
- Figure 1 is a schematic diagram of the steps of an iterative, combinative PCR algorithm, according to the invention.
- Figure 2 shows an assortment of exemplary spectra from a set of spectra generated for a calibration data set, according to the invention
- Figure 3 provides a plot of the square of a mean residual spectrum calculated from a set of residual calibration spectra, according to the invention.
- Figure 4 shows a plot of the first three loadings from an initial PCR iteration according to the invention;
- Figure 5 plots SEP (Standard Error of Prediction) versus noise for Standard PCR and Modified PCR, according to the invention.
- Figure 6 plots the relative error level of Standard PCR and Modified PCR as a function of noise, according to the invention.
- Figure 7 plots SEP (standard error of prediction) and uncertainty error as a function of the number of factors used to model a signal, according to the invention.
- the invention provides an iterative, combinative PCR (Principal Component Regression) algorithm that allows for each wavelength or spectral region of a spectral signal to be modeled with just enough factors to fully model the analytical signal without the incorporation in the model of noise by using excess factors.
- Each wavelength or spectral region may utilize its own number of factors independently of other wavelengths or spectral regions.
- a novel multivariate model incorporating the invented algorithm is also provided.
- the iterative, combinative PCR algorithm allows a different number of factors to be applied to different wavelengths, or regions, of a spectral signal.
- a three- factor model is applied over a given spectral region.
- the exemplary embodiment is provided only as a description, and is not intended to be limiting.
- the residual of the three-factor model is calculated and used as input for a further five-factor model.
- some of the wavelengths used for the three-factor model are removed.
- the removed wavelengths constitute a first spectral region.
- the remaining wavelengths constitute a second spectral region.
- a three-factor model is applied over the first region and an eight-factor model over the second region.
- X is a matrix of absorbance spectra comprising i samples and k variables, or in this case, wavelengths
- y is a vector of concentrations of a target analyte, where the concentrations are independently determined (i samples x 1).
- Wavelength selection is initially performed on l; (10, Figure 1).
- Wavelength selection is again employed using X 3 as the input matrix (15).
- steps 1 — 3 must go through ⁇ iterations (20) to generate T ⁇ and V ⁇ .
- T ⁇ U The final scores matrix, T ⁇ U , is generated by concatenating all of the Ts:
- loading vectors may not be concatenated since the vectors are different lengths.
- Table 1 below provides a MATLAB program implementing the invented iterative, combinative PCR algorithm.
- % Pre a Matlab file named PCR_Noise_xx that contains the variables % X_cali matrix of calibration spectra in column format
- % column 1 counter (related to varying noise, matrix size, or
- % other parameters such as size of the matrix may be varied by
- T2 X_pred' * W; % generate new score matrix (for prediction) % based upon prediction spectra
- SEC generate_SE_c(Y_cali_ref, Y_cali);
- SEP generate_SE_c(Y_pred_ref, Y_pred);
- T_all_p [ones(m,1 ) T_all_p]; % 1's to allow nonzero offset
- An alternative embodiment of the modified PCR algorithm applies PCR with a set number of factors to all spectral regions requiring that number of factors.
- a separate PCR model with a second number of factors may be applied to individual wavelengths or spectral regions requiring that number of factors.
- the process may be repeated such that one to n factors are applied for each spectral region or wavelength.
- Scores may be concatenated as above with calibration coefficients being generated and predictions being performed as in steps 8 and 9.
- computer simulated near-IR spectral data sets of serum are utilized to demonstrate the feasibility of the combinative PCR algorithm described above.
- Phantom Serum Spectra Generation Near-IR absorbance spectra of water, albumin, triglycerides, cholesterol, glucose and urea with a concentration of 1 g/dL at 37.0jC and a 1 mm pathlength were generated from spectra collected on a NICOLET 860 IR Spectrometer, supplied by Nicolet Instrument Corporation of Madison Wl, in transmission mode followed by multivariate curve resolution analysis. The pure component spectra were used to generate phantom serum spectra by additive additions of the absorbances of the constituent components, where the concentration of each constituent was randomly selected from the concentration ranges in Table 2, below. Noise proportional to the resulting spectral absorbance at each wavelength was then added, with the standard deviation of the added noise being a percentage of the total absorbance; thus yielding spectra with increased noise levels at higher absorbance levels.
- Prediction spectra consisted of 80 independent spectra at each of 30 noise levels where one standard deviation of the noise level varied from 0.0002 to
- 0.0002 to 0.006 are the mean SEP s of the 50 independent prediction sets at each of the noise levels.
- the conventional PCR algorithm was utilized to analyze the generated spectral data sets.
- wavelength optimization through wavelength selection was performed on the calibration and monitoring data sets, with the standard error of the monitoring data set used as a metric for optimization.
- Wavelength optimization for the standard PCR algorithm resulted in the spectral ranges of 1100 to 1862 and 1978 to 2218 nm being selected for removal, which, as would be expected, corresponds to removal of the high water absorbance regions about 1900 and 2500 nm.
- wavelength optimization was again performed on the calibration and monitoring data sets with the standard error of prediction, of the monitoring data set used as the metric for optimization.
- wavelength optimization was performed with each PCR iteration.
- a total of three PCR factors were utilized with the spectral regions 1100 to 1886 and 1980 to 2378nm.
- standard PCR utilized a long wavelength cutoff of 2218nm that excluded the sharply increasing water absorbance band that leads to higher noise levels, while the first iteration of the modified PCR algorithm had a long wavelength cutoff at 2378nm that includes more of this high noise region.
- traditional PCR removed a larger region about the 1950nm water absorbance band compared to the first iteration of the modified PCR algorithm.
- the modified PCR algorithm is able to incorporate these noisy regions, since only a three-factor model is utilized in these regions.
- the square of the resulting mean residual spectrum is plotted in Figure 3.
- the large square of the residual serves as one basis for wavelength selection.
- the loadings provide a second basis for wavelength selection.
- Figure 4 the three spectral loadings generated in the first iteration of PCR are presented.
- the first loading 41 is dominated by water while the second spectral loading 42 shows considerable structure in the combination band region corresponding to protein.
- the third loading 43 shows considerable noise about 1950 and 2350nm.
- the second PCR iteration utilized an additional four factors yielding a total of seven factors for the remaining spectral regions.
- the remaining spectral region continues to include the glucose absorbance band located at 2272nm that was excluded from the traditional PCR algorithm.
- the standard PCR algorithm generated its optimal prediction abilities based largely upon the three glucose absorbance bands in the first overtone region from 1500 to 1800nm.
- the first overtone includes smaller and broader absorption features that require additional factors to be properly modeled.
- the large number of factors required for this region plus the limitation of a fixed number of factors for all wavelengths imposed by conventional PCR, necessitated the removal of some of the glucose containing information in the combination band region to avoid later factors adding undue amounts of noise into the model from the combination band region.
- the calibration exemplifies several well- known phenomena. It is generally known that signal, noise and pathlength considerations often dictate that the signal-rich combination band spectral region from 2000 to 2500nm be excluded from multivariate analyses that include the first overtone region from 1450 to 1900nm. Many additional factors are required to fully model the smaller and more overlapped analytical signals in the first overtone region, with the result that the combination band spectral region from 2000 to 2500nm would be over- modeled were it to be included. Such limitation is dictated by traditional multivariate methods that require a single number of factors for the entire spectral region being analyzed. Thus, the inventive PCR algorithm allows signal in a noisy region such as the combination band to be analyzed at the same time as signal in a more complex region like the first overtone. As the following discussion reveals, applying the inventive algorithm leads to smaller standard errors of prediction (SEP).
- SEP standard errors of prediction
- Prediction Using the calibrations developed with the conventional PCR algorithm and the inventive PCR algorithm, prediction results were obtained on the generated prediction sets. As previously indicated, in the prediction data sets, 50 sets of 80 spectra were generated at each of 30 noise levels from 0.0002 to 0.006 times the absorbance level. The mean SEP for the original PCR 51 and modified PCR algorithm 52 are presented in Figure 5 for each of the noise levels tested. In all cases, the mean SEP of the modified PCR algorithm was lower than that of the standard PCR algorithm. In comparing the traditional PCR results to the modified PCR results, it is evident that percent relative increase in error by the traditional PCR algorithm varied from 17 to 45%, Figure 6.
- Figures 5 and 6 clearly show that the inventive PCR algorithm resulted in smaller prediction errors compared to the conventional PCR algorithm.
- the two PCR approaches produce similar results.
- the two PCR approaches perform similarly.
- the observed SEP versus the number of factors used to model the system classically decreases with initial factors to a local minimum and then increases slowly with additional factors. This is due to the calibration model modeling the analytical signal with early factors and increasingly modeling the spectral noise with later factors to the point of over-modeling the system.
- the observed SEP may be broken up into two components according to equation 2:
- SEP is the standard error of prediction
- SE sig is the standard error of the signal
- SE uncerl is the standard error due to the modeled uncertainty. If no noise or uncorrelated components of the sample were present in the model or in subsequent prediction spectra, the standard error of the modeled signal would continue to decrease with an increasing number of factors until the system was completely modeled as shown by the dashed and dotted line 70 in Figure 7. However, in generation of the calibration model both signal and uncertainty (noise) are modeled into the system. Additional factors increase the uncertainty modeled into the system. This results in the classical decrease in the SEP with early factors as the signal is modeled, a local minimum in the SEP and finally an increase in the SEP as the system is over-modeled with additional factors.
- the invention finds particular utility in various spectroscopy applications; for example, predicting concentration of analytes such as glucose from noninvasive near-IR spectra performed on live subjects. While the invention n3s u ⁇ e ⁇ ue foundedeu n ⁇ rein wim respeci to near s ⁇ ec ⁇ rosco ⁇ y, m ⁇ methods of the invention are equally applicable to data matrices of any kind.
- spectroscopic techniques may include UV/VIS/NIR/IR as well as techniques such as AA/NMR/MS.
- the invention is not limited to spectroscopic techniques but may include chromatographic techniques such as GC/LC or combinations of chromatographic and spectroscopic techniques such as GC/MS or GC/IR. Additionally, the invention finds application in almost any field that relies on multivariate analysis techniques, the social sciences, for example.
Landscapes
- Physics & Mathematics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Immunology (AREA)
- Pathology (AREA)
- Investigating Or Analysing Materials By Optical Means (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A novel multivariate model for analysis of absorbance spectra allows for each wavelength or spectral region to be modeled with just enough factors to fully model the analytical signal without the incorporation of noise by using excess factors. Each wavelength or spectral region is modeled utilizing its own number of factors independently of other wavelengths or spectral regions. An iterative combinative PCR algorithm allows a different number of factors to be applied to different wavelengths. In an exemplary embodiment, a three-factor model is applied over a given spectral region. The residual of the three-factor model is calculated and used as the input for an additional five-factor model. Prior to the additional five factors being applied, some of the wavelengths are removed. This leads to a three-factor model over the first region and an eight-factor model over the second region.
Description
COMBINATIVE MULTIVARIATE CALIBRATION THAT ENHANCES PREDICTION ABILITY THROUGH REMOVAL OF
OVER-MODELED REGIONS
BACKGROUND OF THE INVENTION
FIELD OF THE INVENTION The invention relates to multivariate analysis of spectral signals. More particularly, the invention relates to a method of multivariate analysis of a spectral signal that allows for a wavelength or spectral region to be modeled with just enough factors to fully model the analytical signal without the incorporation into the model of noise by using excess factors.
DESCRIPTION OF RELATED ART Multivariate analysis is a well-established tool for extracting a spectroscopic signal, usually quite small, of a target analyte from a data matrix in the presence of noise, instrument variations, environmental effects and interfering components. Various methods and devices have been described that employ multivariate analysis to determine an analyte signal. For example, R. Barnes, J. Brasch, D. Purdy, W. Loughheed, Non-invasive determination of analyte concentration in body of mammals, U.S. Patent No. 5,379,764 (January 10, 1995) describe a method in which a subject is irradiated with NIR (near-IR) radiation, a resulting absorption spectrum is analyzed using multivariate techniques to obtain a value for analyte concentration.
J. Ivaldi, D. Tracy, R. Hoult, R. Spragg, Method and apparatus for comparing spectra, U.S. Patent No. 5,308,982 (May 3, 1994) describe a method and apparatus in which a matrix model is derived from the measured spectrum of an analyte and interferents. A spectrum is generated for an unknown sample. Multiple linear least squares regression is then utilized to fit the
model to the sample spectrum and compute a concentration for the analyte in the sample spectrum.
L. Nygaard, T. Lapp, B. Amvidarson, Method of determining urea in milk, U.S. Patent No. 5,252,829 (October 12, 1993) describe a method and apparatus for measuring the concentration of urea in a milk sample using an infrared attenuation measuring technique. Multivariate techniques are carried to determine spectral contributions of known components using partial least square algorithms, principal component regression, multiple linear regression or artificial neural network learning.
M. Robinson, K. Ward, R. Eaton, D. Haaland, Method of and apparatus for determining the similarity of a biological analyte from a model constructed from known biological fluids, U.S. Patent No. 4,975,581 (December 4, 1990) describe a method and apparatus for determining analyte concentration in a biological sample based on a comparison of infrared energy absorption between a known analyte concentration and a sample. The comparison is performed using partial least squares analysis or other multivariate techniques.
However, multivariate techniques such as principal component regression (PCR) and partial least squares regression (PLS) have some inherent disadvantages. One well-documented problem with multivariate analysis is that noise in the data may be incorporated into the model. This is especially true when too many factors are employed in the development of the model. The modeling of this noise results in subsequent prediction matrices with erroneously high error levels. See, for example, H. Martens, T. Naes, Multivariate Calibration John Wiley & Sons, 1989; or K. Beebe, B. Kowalski, An Introduction to Multivariate Calibration and Analysis, Anal. Chem. 59, 1007A-1017A (1987). Complicating this issue is the fact that a few factors may fully model a given spectral region, while additional factors may be required to model another set of wavelengths.
For example, a few factors may model a region having:
A. a high degree of co-linearity;
B. high signal to noise ratio;
C. minor or readily modeled instrument variations;
D. low contribution of environmental effects; or
E. a minimal number of readily modeled interfering signals.
Other regions may require a much larger number of factors in order to sufficiently model the analytical signal. This may be the case when:
A. the data are not fully linear;
B. in low signal to noise regions;
C. instrument drift changes spectral response over time; or
D. a large number of spectrally interfering components are present.
In traditional applications of multivariate techniques such as PCR or PLS, a single number of factors is applied over an entire spectrum. This means that for a given region within the spectrum, selection of the appropriate number of spectral factors to adequately model the signal will result in all other spectral regions using the same number of factors. In many cases, another spectral region would be optimally modeled with a different number of factors than the first spectral region. Thus, a compromise between wavelength selection and the number of factors to incorporate into the model becomes necessary. There exists, therefore, a need in the art for an algorithm that allows the number of factors for each wavelength or spectral region to be chosen independently of the number of factors utilized to model a different wavelength or spectral region.
An iterative, combinative PCR algorithm allows a different number of factors to be applied to different wavelengths or spectral regions during modeling of a matrix of calibration spectra. In an exemplary embodiment, a three-factor model is applied over a given spectral region, wherein concentrations of a target analyte are known. The residual of the three-factor model is calculated and used as the input for an additional five-factor model. Prior to
the additional five factors being applied, some of the wavelengths are removed, with the result that a three-factor model is applied over the first spectral region and an eight factor model over the second region. This analysis of residuals may be repeated such that a one to n factor model may be applied to any given wavelength or rather any number of factors may be employed to model any given frequency. The scores matrices of the individual models are concatenated to arrive at a final scores matrix for the entire calibration matrix. Principal component regression is employed to regress the calibration matrix against the vector of analyte concentrations to derive a calibration model, the model comprising a vector of calibration coefficients.
A method of predicting concentration of a target analyte from a prediction data set comprising a matrix of sample spectra applies the above calibration to a final scores matrix for the sample matrix to generate a vector of predicted values for target analyte concentration. The sample matrix is iteratively modeled in the same fashion as the calibration matrix, with the final scores matrix being a concatenation of the individual scores matrices for the various spectral regions or wavelengths.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a schematic diagram of the steps of an iterative, combinative PCR algorithm, according to the invention;
Figure 2 shows an assortment of exemplary spectra from a set of spectra generated for a calibration data set, according to the invention;
Figure 3 provides a plot of the square of a mean residual spectrum calculated from a set of residual calibration spectra, according to the invention;
Figure 4 shows a plot of the first three loadings from an initial PCR iteration according to the invention;
Figure 5 plots SEP (Standard Error of Prediction) versus noise for Standard PCR and Modified PCR, according to the invention; and
Figure 6 plots the relative error level of Standard PCR and Modified PCR as a function of noise, according to the invention.
Figure 7 plots SEP (standard error of prediction) and uncertainty error as a function of the number of factors used to model a signal, according to the invention.
SUMMARY OF THE INVENTION
DETAILED DESCRIPTION
The invention provides an iterative, combinative PCR (Principal Component Regression) algorithm that allows for each wavelength or spectral region of a spectral signal to be modeled with just enough factors to fully model the analytical signal without the incorporation in the model of noise by using excess factors. Each wavelength or spectral region may utilize its own number of factors independently of other wavelengths or spectral regions. A novel multivariate model incorporating the invented algorithm is also provided.
The iterative, combinative PCR algorithm allows a different number of factors to be applied to different wavelengths, or regions, of a spectral signal. In the exemplary embodiment of the invention that follows, a three-
factor model is applied over a given spectral region. The exemplary embodiment is provided only as a description, and is not intended to be limiting. The residual of the three-factor model is calculated and used as input for a further five-factor model. However, prior to the additional five factors being applied, some of the wavelengths used for the three-factor model are removed. The removed wavelengths constitute a first spectral region. The remaining wavelengths constitute a second spectral region. Thus, a three-factor model is applied over the first region and an eight-factor model over the second region. Those skilled in the art will realize that this analysis of residuals can be repeated such that a one to n factor model may be applied to any given wavelength of spectral region. Stated another way, any number of factors may be employed to model any given frequency or spectral region.
Referring now to Figure 1, a block diagram illustrating the several steps of the invented algorithm is provided. X, is a matrix of absorbance spectra comprising i samples and k variables, or in this case, wavelengths, y is a vector of concentrations of a target analyte, where the concentrations are independently determined (i samples x 1). Presented below are the several steps of the invented algorithm in overview. A detailed discussion of the algorithm follows further below. Wavelength selection is initially performed on l; (10, Figure 1).
1. Decompose X, = TVT using a standard PCA function to generate a matrix of loading vectors V, followed by regression to generate a scores matrix T IΛ 1 ^
1. a. Set Tj = _r„(:,1 :z„), where u = the number of factors in an optimal model, z = the number of factors used for an iteration and n denotes the current iteration. In the currrent embodiment, z, = 3, thus a selection of three
factors is used for the initial spectral range, however more or fewer factors may be employed in this step; (12) b. Set VΪ = ¥„( ,-] z„); ( 3)
1. a. Compute residual spectra X2 - Xj — TiV ; (14)
a. Undesirable wavelengths are removed from X2. (15) Several methods of selecting the wavelengths to be removed are indicated below; following wavelength selection, 2 now serves as the new input matrix.
Steps 1 — 3 above are repeated, as shown below:
1. Decompose d? = T,VjT using standard chemometric techniques known to those skilled in the art; (11 )
1. Set T2 = Tu(:,V.z, (12) and V2 = Vu(:,V.zn) (13). In the current iteration, where u - 2, z2 = 5, thus five factors are used in the second spectral range for purposes of description. In actual practice, more or fewer factors could be used;
1. a. Compute residual spectra: X3 = X2 — T2V2 T; (14) and
a. Wavelength selection is again employed using X3 as the input matrix (15).
Several approaches to wavelength selection may be utilized:
A. remove noisy regions by squaring the residual and removing large value areas;
B. remove regions with large noise characteristics; or
C. remove regions where the raw absorbance spectra had high absorbance, since noise is related to absorbance.
The steps can be repeated as many times as are required such that T0 and Va are generated, where α = the total number of iterations. As Figure 1 shows, where u = the number of factors in an optimal model (19), steps 1 — 3 must go through α iterations (20) to generate Tα and Vα.
1. The final scores matrix, TαU, is generated by concatenating all of the Ts:
Tα[l = [Tj +T2 +T3 + 7J (16);
It should be noted that the loading vectors may not be concatenated since the vectors are different lengths.
1. Calculate calibration coefficients by performing standard multiple linear regression (17), as below: y - Ts Tτy = TTTs
(TTT)-'TTy = s s = (τττ τry, where s is a vector of coefficients.
1. Predict analyte concentration: (Prediction Data Set) a. Compute T'j = X'V/, where Vt is obtained in the calibration set; b. Compute the residual, X) = X'. T', V?; c. perform wavelength selection utilized on calibration matrix; repeat sub-steps a and b as needed a. Compute T'2 = X'V2;
b. Compute the residual, X = Xf. T'2 V2 T; c. perform wavelength selection as in calibration; repeat until T'a is reached. a. Concatenate all values for T'„: T'„n = [T', + T', + T'3
+ ni; b. Predict concentration of target analyte, ypιed = (T'an)s.
Table 1 below provides a MATLAB program implementing the invented iterative, combinative PCR algorithm.
TABLE 1
0/
% Modified PCR Algorithm % % This algorithm demonstrates a modification to the PCR algorithm that can
% yield lower prediction errors than the standard PCR algorithm. This % is accomplished through iterative removal of noisy spectral regions
% after removal of signal by earlier factors %
% In this script, this is demonstrated for spectral matrices of varying
% noise levels
%
% Pre: a Matlab file named PCR_Noise_xx that contains the variables % X_cali matrix of calibration spectra in column format
% X pred matrix of prediction spectra in column format
% Y_cali_ref vector of reference concentrations for the
% calibration spectra in row format
% Y_pred_ref vector of reference concentrations for the % prediction spectra in row format
%
% Post: matrix named 'summary' containing 4 columns
% column 1 counter (related to varying noise, matrix size, or
% other varying parameter) % column 2 minimum SEP from PCR algorithm for each matrix
% tested
% column 3 minimum SEP from modified PCR algorithm for each
% matrix tested
% column 4 percent reduction in SEP (PCR modified vs. PCR) %
%
%
% Modified_PCR_Script
0/ clear all close all
% Model Input
% In this script, noise levels in the input matrices tested were varied % these values are proportion to the tested noise levels
% other parameters such as size of the matrix may be varied by
% modifying this input noisejnit = 0.0004; noise_step = 0.0002; noise_final = 0.0046; counter = 1 ; counter2 = 1 ;
% step through X matrices with increasing noise levels for noise = noise_init:noise_step:noise_final eval(['load PCRJMoise int2str(counter2)])
% copy of raw matrices for use with modified PCR algorithm X_calϊ_1 = X_cali; X_pred_1 = X pred;
% User Input for original PCR algorithm pcrjnit = 1 ; % Initial number of PCR factors utilized pcr final = 25; % Final number of PCR factors utilized range_0 = [1 382440 560]; % Wavelength selection: X-axis
% indices of X matrix to be included
% in analysis % User Input for modified PCR algorithm num range = 6; % Number of iterations of PCR algorithm pcr_1 = 3 % Factors employed in 1st iteration pcr_2 = 4 % Factors employed in 2nd iteration pcr_3 = 5 % etc... pcr_4 pcr_5 = 2; pcr_6 = 16; range = [1 394 441 640]; % Wavelength selection: X-axis % indices of X matrix to be % included in analysis range_2 [1 410 440 545]; % Wavelength selection: X-axis % indices of resulting X matrix to % be included in analysis
range_3 = [1 507]; % etc... range_4 = [1 395 410 507]; range_5 = [1 147 184 366 440 480]; range_6 = [1 148 175 310];
% Sum total number of factors (total_pcr) utilized if (num_range == 1 ) total_pcr = pcr_1 ; elseif (num_range == 2) total_pcr = pcr_1 + pcr_2; elseif (num_range == 3) total_pcr = pcr_1 + pcr_2 + pcr_3; elseif (num_range == 4) total_pcr = pcr_1 + pcr_2 + pcr_3 + pcr_4; elseif (num_range == 5) total_pcr = pcr_1 + pcr_2 + pcr_3 + pcr_4 + pcr_5; elseif (num_range == 6) total pcr = pcr_1 + pcr_2 + pcr_3 + pcr_4 + pcr_5 + pcr_6; end
% Original PCR %
% Calibration % Wavelength Selection (next 5 lines) temp_X_cali = Q; for ii = 1 :2:length(range_0) temp_X_cali = [temp_X_cali; X_cali(range_0(ii):range_0(ii+1 ), end
X_cali = temp_X_cali;
[W, vl] = pca(X_cali, pcr_final); % generate loadings (W)
T = X_cali' * W; % compute scores (T)
X_resi = X_cali - W*T'; % determine residuals
% Calibration Model
[m,n] = size(T);
T = [ones(m,1 ) TJ; % add column of 1's to allow offset to be nonzero for ii = 1 :pcr_final % the ÷1 is due to the addition of the column of 1's to the scores
% matrix
% generate regression model yielding vector of coefficients 'a' — % done for each number of factors utilized a = inv(T(:, 1 :ii+1 Y*T(:, 1 :ii+1 ))*T(:, 1 :ii+1)**Y_cali_ref ;
Y_cali(:,ii) = T(:,1 :ii+1)*a(1:ii+1 , 1); % generate fιt(Y_call) end
% Prediction
% Wavelength Selection (next 5 lines): same as Calibration temp_X_pred = Q; for ii = 1 :2:length(range_0) temp_X_pred = [temp_X_pred; X_pred(range_0(ii):range_0(ii+1 ),
■■)]; end
X_pred = temp_X_pred;
T2 = X_pred' * W; % generate new score matrix (for prediction) % based upon prediction spectra
[m,n] = size(T2);
T2 = [ones(m,1 ) T2]; % add column of 1's to allow nonzero offset for ii = 1 :pcr_final
% regenerate regression model yielding vector of coefficients 'a'
% using original scores (T) a = inv(T(:, 1 :il+1 )'*T(:, 1 :ii+1))*T(:, 1 :ii+1 )'*Y_cali_ref;
% generate predicted values utilizing vector of coefficients 'a' % and new score matrix (T2)
Y_pred(:, ii) = T2(:,1 :ii+1)*a(1 :ii+1 , 1); end
% Generate Standard Error of Calibration and Prediction (SEC & SEP) SEC = generate_SE_c(Y_cali_ref, Y_cali);
SEP = generate_SE_c(Y_pred_ref, Y_pred);
%
% Modified PCR % for round = 1 :num_range % Iterative PCR eval(['X_cali_' int2str(round) '_orig = X_cali int2str(round) ϊ ]) % spectra prior to range selection (for plotting)
% Wavelength Selection (next 6 lines) temp_X_cali = Q; eval(['jj = length(range_' int2str(round) ');' ]) for ii = 1 ev3i( ιι ιι_-;>u (i uuι ιuj
end eval(['X_cali_' int2str(round) ' = temp_X_cali;' ]) eval(['[ int2str(round) ', vl] = pca(X_cali_' int2str(round) ', pcr int2str(round) ');' ])
% perform pea: Generate loadings W
% compute scores eval([T int2str(round) ' = X_cali_' int2str(round) '" *
int2str(round) ';' ]) % Calculate residuals for use in next iteration eval(['X_cali_' int2str(round + 1 ) ' = X_cali_' int2str( round) '
- W int2str(round) '*T int2str(round) '";' ]) end
% Concatenate Score Matrices scores = "; for ii = 1 :num_range eval(['scores = [scores "T int2str(ii) ' "];' ]) end eval([T_all = [' scores '];' ]) % T_all = [T1 T2 T3 ...]
[m,n] = size(T_aII); T_all = [ones(m,1) T all]; %column of 1's to allow nonzero offset
% Calibration Model for ii = 1 :(total_pcr)
% the +1 is due to the addition of the column of 1's % to the scores matrix
% regenerate regression model yielding vector of coefficients 'a' a = inv(T_all(:, 1 :ii+1 )'*T_all(:, 1 :ii+1 ))*T_all(:, 1 :ii+1Y*Y_cali_ref;
% generate fit(Y_cali) Y_ca!i_rnodified(:,ii) = T_aH(:,1 :ii+1)*a(1:ii+1 , 1); end for round = 1 :num_range % Iterative PCR % Wavelength Selection (next 6 lines) eval(['temp_X_pred_' int2str(round) ' = D;' ]) eval(['jj = length(range_' int2str(round) ');' ]) for ii = 1 :2:jj eval(['temp_X_pred int2str(round) ' = [temp_X_pred int2str(round) '; X_pred_' int2str(round) '(range int2str(round) '(ϋ):range int2str(round) (ii+1), :)];'
]) end eval(['X_pred_' int2str(round) ' = temp_X_pred_' int2str(round) ϊ 3) /o υui nμuic 3»_. ι c > IUI μi c i nun eval([T' int2str(round) '_p = X_pred_' int2str(round) '" * W int2str(round) ';' ]) % determine residuals eval(['X_pred_' int2str(round+1) ' = X_pred int2str(round) ' — W' int2str(round) '*T int2str(round) '_p";' ]) end
% Concatenate Scores scores = ";
for ii = 1 :num_range eval(['scores = [scores "T int2str(ii) '_p "];' ]) end eval([T_all_p = [' scores '];' ]) [m,n] = size(T_all__p);
T_all_p = [ones(m,1 ) T_all_p]; % 1's to allow nonzero offset
% Predictions for ii = 1 :total_pcr % regenerate regression model yielding vector of coefficients 'a' a = inv(T_all(:, 1 :ii+1 )'*T_all(:, 1 :ii+1 ))*T_all(:, 1 :ii+1)'*Y_cali_ref;
% generate predicted values utilizing 'a' and new score matrix Y_pred_modified(:, ii) = T_all_p(:,1 :ii+1 )*a(1 :ii+1 , 1); end
% Generate SEC & SEP
SEC_new = generate_SE_c(Y_cali_ref, Y_cali_modified); SEP_new = generate_SE_c(Y_pred_ref, Y_pred_modified);
% Generate Summary Matrix percent_gain = ( min(SEP) - min(SEP_new) ) / min(SEP) * 100; summary(counter,1) = counter2; summary(counter,2) = min(SEP); % minimum SEP from PCR summary(counter,3) = min(SEP_new); % minimum SEP from modified
PCR summary(counter,4) = percent_gain; clear X* Y* T W counter = counter + 1 ; counter2 = counter2 + 1 ; end % noise
An alternative embodiment of the modified PCR algorithm applies PCR with a set number of factors to all spectral regions requiring that number of factors. A separate PCR model with a second number of factors may be applied to individual wavelengths or spectral regions requiring that number of factors. The process may be repeated such that one to n factors are applied for each spectral region or wavelength. Scores may be concatenated as above with calibration coefficients being generated and predictions being performed as in steps 8 and 9.
In the following example, computer simulated near-IR spectral data sets of serum are utilized to demonstrate the feasibility of the combinative PCR algorithm described above.
EXAMPLE
Phantom Serum Spectra Generation: Near-IR absorbance spectra of water, albumin, triglycerides, cholesterol, glucose and urea with a concentration of 1 g/dL at 37.0jC and a 1 mm pathlength were generated from spectra collected on a NICOLET 860 IR Spectrometer, supplied by Nicolet Instrument Corporation of Madison Wl, in transmission mode followed by multivariate curve resolution analysis. The pure component spectra were used to generate phantom serum spectra by additive additions of the absorbances of the constituent components, where the concentration of each constituent was randomly selected from the concentration ranges in Table 2, below. Noise proportional to the resulting spectral absorbance at each wavelength was then added, with the standard deviation of the added noise being a percentage of the total absorbance; thus yielding spectra with increased noise levels at higher absorbance levels.
TABLE 2 - Concentration Range of Constituents Used in Serum Phantom.
Calibration, Monitoring and Prediction Set Generation: A total of 60 spectra were generated for a calibration data set and 20 spectra were generated for a monitoring data set utilized in the wavelength selection process. The standard deviation of the applied noise levels for the calibration, and monitoring data sets is 0.002 times the absorbance level at each wavelength. A random selection of these generated spectra is presented in Figure 2. The well-known near-IR water absorbance bands are observed at 1450, 1950 and 2500 nm. Upon detailed inspection, some of the fat bands at 1167, 1210, 1391, 1413, 1724, 1760, 2123, 2144, 2307, 2347 and 2380nm may be identified along with some of the protein absorbance bands at 1690, 1730, 2170 and 2285 nm. The glucose, cholesterol and urea absorbance bands are not obvious in the raw spectra.
Prediction spectra consisted of 80 independent spectra at each of 30 noise levels where one standard deviation of the noise level varied from 0.0002 to
0.006 times the absorbance. For each of these 30 noise levels, 50 sets of
80 independent spectra were generated. In no case were parameters optimized with the prediction data sets. In addition, no human feedback from the predictions was utilized in optimization of the parameters for building the model with the calibration and monitoring data sets. Results from predictions represent the first and only analysis of the data. The thirty separate standard errors of prediction (SEP) reported at noise levels from
0.0002 to 0.006 are the mean SEP s of the 50 independent prediction sets at each of the noise levels.
Calibration: For comparative purposes, the conventional PCR algorithm was utilized to analyze the generated spectral data sets. For the standard PCR algorithm, wavelength optimization through wavelength selection was performed on the calibration and monitoring data sets, with the standard error of the monitoring data set used as a metric for optimization. Wavelength optimization for the standard PCR algorithm resulted in the spectral ranges of 1100 to 1862 and 1978 to 2218 nm being selected for
removal, which, as would be expected, corresponds to removal of the high water absorbance regions about 1900 and 2500 nm.
For the inventive PCR algorithm, wavelength optimization was again performed on the calibration and monitoring data sets with the standard error of prediction, of the monitoring data set used as the metric for optimization. However, using the inventive PCR algorithm, wavelength optimization was performed with each PCR iteration. In the first iteration, a total of three PCR factors were utilized with the spectral regions 1100 to 1886 and 1980 to 2378nm. Notably, standard PCR utilized a long wavelength cutoff of 2218nm that excluded the sharply increasing water absorbance band that leads to higher noise levels, while the first iteration of the modified PCR algorithm had a long wavelength cutoff at 2378nm that includes more of this high noise region. Likewise, traditional PCR removed a larger region about the 1950nm water absorbance band compared to the first iteration of the modified PCR algorithm. The modified PCR algorithm is able to incorporate these noisy regions, since only a three-factor model is utilized in these regions.
Prior to the second iteration of PCR, a new wavelength selection was performed on the residual spectra from the first PCR iteration. The mean residual of the 60 residual calibration spectra was calculated using Equation 1 where X, is the data matrix used in the initial PCR iteration and T, and V, axe the scores and loadings for the initial 3-factor model:
Resi = meαn(X, — Tj V, ). (1 )
The square of the resulting mean residual spectrum is plotted in Figure 3. In regions 31 , 32, 33 where the square of the mean residual is high, noise is dominant. The large square of the residual serves as one basis for wavelength selection. The loadings provide a second basis for wavelength selection. In Figure 4, the three spectral loadings generated in the first
iteration of PCR are presented. The first loading 41 is dominated by water while the second spectral loading 42 shows considerable structure in the combination band region corresponding to protein. However, the third loading 43 shows considerable noise about 1950 and 2350nm. After combining this loading information with the square of the mean residual information prior to the second PCR iteration, the spectral region from 1980 to 2070nm was removed due to the high noise level, as was the region from 2280 to 2378nm. The second PCR iteration utilized an additional four factors yielding a total of seven factors for the remaining spectral regions. Notably, the remaining spectral region continues to include the glucose absorbance band located at 2272nm that was excluded from the traditional PCR algorithm. The standard PCR algorithm generated its optimal prediction abilities based largely upon the three glucose absorbance bands in the first overtone region from 1500 to 1800nm. However, the first overtone includes smaller and broader absorption features that require additional factors to be properly modeled. The large number of factors required for this region, plus the limitation of a fixed number of factors for all wavelengths imposed by conventional PCR, necessitated the removal of some of the glucose containing information in the combination band region to avoid later factors adding undue amounts of noise into the model from the combination band region.
The start and stop wavelength pairs utilized in the modified PCR algorithm for each PCR iteration along with the total number of factors for each iteration is summarized in Table 3, below. Later iterations gradually widen the removal of the water absorbance bands at 1950 and 2500 nm. In addition, the entire combination band and the peak of the water absorbance band at 1450 nm are removed after the first 3 iterations.
Table 3: Wavelength Selection at each PCR Iteration for Calibration and Monitoring Spectra.
For those skilled in the art of analyzing near-IR spectra of aqueous solutions having small analytical signals, the calibration exemplifies several well- known phenomena. It is generally known that signal, noise and pathlength considerations often dictate that the signal-rich combination band spectral region from 2000 to 2500nm be excluded from multivariate analyses that include the first overtone region from 1450 to 1900nm. Many additional factors are required to fully model the smaller and more overlapped analytical signals in the first overtone region, with the result that the combination band spectral region from 2000 to 2500nm would be over- modeled were it to be included. Such limitation is dictated by traditional multivariate methods that require a single number of factors for the entire spectral region being analyzed. Thus, the inventive PCR algorithm allows signal in a noisy region such as the combination band to be analyzed at the same time as signal in a more complex region like the first overtone. As the following discussion reveals, applying the inventive algorithm leads to smaller standard errors of prediction (SEP).
Prediction: Using the calibrations developed with the conventional PCR algorithm and the inventive PCR algorithm, prediction results were obtained
on the generated prediction sets. As previously indicated, in the prediction data sets, 50 sets of 80 spectra were generated at each of 30 noise levels from 0.0002 to 0.006 times the absorbance level. The mean SEP for the original PCR 51 and modified PCR algorithm 52 are presented in Figure 5 for each of the noise levels tested. In all cases, the mean SEP of the modified PCR algorithm was lower than that of the standard PCR algorithm. In comparing the traditional PCR results to the modified PCR results, it is evident that percent relative increase in error by the traditional PCR algorithm varied from 17 to 45%, Figure 6. Figures 5 and 6 clearly show that the inventive PCR algorithm resulted in smaller prediction errors compared to the conventional PCR algorithm. At low noise levels where the spectral matrix and analytical signal is readily modeled, the two PCR approaches produce similar results. As well, at very high noise levels where no analytical signal is modeled (2 times the standard error approximating the mean of the analytical concentration) the two PCR approaches perform similarly.
The observed SEP versus the number of factors used to model the system classically decreases with initial factors to a local minimum and then increases slowly with additional factors. This is due to the calibration model modeling the analytical signal with early factors and increasingly modeling the spectral noise with later factors to the point of over-modeling the system. The observed SEP may be broken up into two components according to equation 2:
SEP - SEsig + SEιmcert ( 2)
where SEP is the standard error of prediction, SEsig is the standard error of the signal and SEuncerl is the standard error due to the modeled uncertainty. If no noise or uncorrelated components of the sample were present in the model or in subsequent prediction spectra, the standard error of the modeled signal would continue to decrease with an increasing number of factors until the system was completely modeled as shown by the dashed
and dotted line 70 in Figure 7. However, in generation of the calibration model both signal and uncertainty (noise) are modeled into the system. Additional factors increase the uncertainty modeled into the system. This results in the classical decrease in the SEP with early factors as the signal is modeled, a local minimum in the SEP and finally an increase in the SEP as the system is over-modeled with additional factors.
The cumulative removal of spectral regions that are fully modeled as in the inventive algorithm results in additional factors incorporating less uncertainty into the model as is represented by the dotted 72a, b versus solid 71 a, b lines in Figure 1. The modeled uncertainty thus does not increase as rapidly with additional factors. Since the SEP depends on the modeled signal as well as the modeled uncertainty, the resulting SE also decreases. Thus, the new algorithm provides a lower local minimum in the SEP with additional factors.
Other embodiments of the invention are possible. For example, the approach of PCR followed by wavelength selection with the subsequent residuals used as input for addition PCR models is readily extended to other multivariate techniques such as partial least squares (PLS). Additionally, residuals from one multivariate method may be used as inputs for another multivariate method.
The invention finds particular utility in various spectroscopy applications; for example, predicting concentration of analytes such as glucose from noninvasive near-IR spectra performed on live subjects. While the invention n3s uβeπ uescriueu nβrein wim respeci to near
sρecιroscoρy, mβ methods of the invention are equally applicable to data matrices of any kind. In the chemical arts, spectroscopic techniques may include UV/VIS/NIR/IR as well as techniques such as AA/NMR/MS. Furthermore, the invention is not limited to spectroscopic techniques but may include chromatographic techniques such as GC/LC or combinations of chromatographic and spectroscopic techniques such as GC/MS or GC/IR. Additionally, the
invention finds application in almost any field that relies on multivariate analysis techniques, the social sciences, for example.
Although the invention has been described herein with reference to certain preferred embodiments, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the Claims included below.
Claims
1. A method of developing a calibration for predicting concentration of a target analyte in sample spectra, said method employing factor-based multivariate techniques, wherein a specific number of factors models at least one region of a spectrum independently of the number of factors used to model other spectral regions, thus minimizing prediction error due to over-modeling of noisy spectral regions, said method comprising the steps of:
A. providing a matrix of calibration spectra;
B. modeling at least one spectral region using a first selected number of factors; C. subtracting said modeled regions from said spectral matrix, so that a residual matrix is generated;
D. modeling at least one region from said residual matrix using a further selected number of factors, wherein said further selected number of factors may be equal to or different from said first selected number;
E. repeating steps B through D, using said residual matrix as an input matrix for step B, until the number of factors employed for all iterations is equal to an optimal number of factors required to model said entire spectrum.
2. The method of Claim 1 , wherein said multivariate techniques include r->, — t:„ι i „ ,_„j. Q „ . .„ „„„ /r>ι \ ~ ~ ^ I, .,-.;,, rcJi iidi Cdsi oqudi eo (Γ / di idiyoio.
3. The method of Claim 1 , wherein said multivariate techniques include Principal Component Regression (PCR).
4. The method of Claim 3, wherein said factors are selected from a matrix of loading vectors and a matrix of scores, said matrix calibration spectra = X said matrix of loading vectors = T, and said scores matrix = V, so that
and wherein using Principal Component Analysis (PCA) is employed to decompose X, to generate T, and wherein X, comprises i samples and k wavelengths.
5. The method of Claim 4, wherein said scores matrix, V, is generated by regressing said matrix of loading vectors, T.
6. The method of Claim 5, wherein the number of factors used in the current iteration = z,„ with z = the number of factors used for an iteration and n = the current iteration, so that said first number of factors = z7, and wherein step B comprises the steps of:
A. setting T„ = Tu(:, 1 : z„); and
B. setting Vn=Vu(:, 1 : z„), where u = the number of factors in an optimal number.
7. The method of Claim 6, wherein step C comprises the step of:
A. calculating a residual spectral matrix X2 from X} according to:
Xn+!=Xn-TnVn .
8. The method of Claim 7, wherein step H further comprises the step of: A. eliminating at least one high-noise region from said residual spectral matrix.
9. The method of Claim 8, wherein step I comprises the optional step of:
A. Calculating a mean residual spectrum according to:
10. The method of Claim 9, wherein step I further comprises any of the steps of: A. squaring said mean residual spectrum and removing at least one large value area; and B. removing at least one region where raw spectra have high absorbance, wherein noise is related to absorbance.
11. The method of Claim 10, further comprising the step of:
A. generating a final scores matrix by concatenating all previous scores matrices according to:
12. The method of Claim 11 , further comprising the step of:
A. generating a vector of calibration coefficients, s, using Multiple Linear regression, wherein said vector s, constitutes said calibration, according to: y =75, where y is a vector of i concentrations;
Tτy = Tτs; (TTT)-'TTy = s; and s = (TTT)-]TTy.
13. A method of predicting concentration of a target analyte from a prediction data set based on a multivariate calibration, said calibration developed using a combinative, iterative PCR (Principal Component Regression) algorithm, wherein a specific number of factors models at least one region of a spectrum independently of the number of factors used to model other spectral regions, so that prediction error due to over- modeling of noisy spectral regions is minimized, said method comprising the steps of: A. providing a prediction data set, said prediction data set comprising a matrix of sample spectra;
B. providing a calibration data set, said calibration data set comprising a matrix of calibration spectra; C. generating said calibration by modeling said calibration data set according to said PCR algorithm; and D. applying said calibration to said prediction data set, so that a prediction of a target analyte concentration is produced.
14. The method of Claim 13, wherein step C comprises the steps of:
A. modeling at least one spectral region from said calibration matrix using a first selected number of factors;
B. subtracting said modeled regions from said calibration matrix, so that a residual matrix is generated; C. modeling at least one region from said residual matrix using a further selected number of factors, wherein said further selected number may be equal to or different from said first selected number;
D. repeating steps E through G, using said residual matrix as an input matrix for step B, until the number of factors employed for all iterations is equal to an optimal number of factors required to model said entire spectrum.
15. The method of Claim 14, wherein said factors are selected from a matrix of loading vectors and a matrix of scores, said matrix calibration spectra =X,, said matrix of loading vectors = T, and said scores matrix = V, so that
X,=7V..
and wherein using Principal Component Analysis (PCA) is employed to decompose Xj to generate T, and wherein X, comprises / samples and k wavelengths.
16. The method of Claim 15, wherein said scores matrix, V, is generated by regressing said matrix of loading vectors, T.
17. The method of Claim 16, wherein the number of factors used in the current iteration = zn, with n = the current iteration and z = the number of factors used in an iteration, and wherein step E comprises the steps of:
A. setting T„ = Tu (:, 1 : z„); and
B. setting V„ = V(:, 1: z„), where u = the number of factors in an optimal model.
18. The method of Claim 17, wherein step F comprises the step of:
A. calculating a residual spectral matrix X2 from X, according to:
Xn+ι—Xn-Tn v„
19. The method of Claim 18, wherein step F further comprises the step of:
A. eliminating at least one high-noise region from said residual spectral matrix.
20. The method of Claim 19, wherein step L comprises the optional step of:
A. Calculating a mean residual spectrum according to: Resi ~ mean(Xn-T„V„T).
21. The method of Claim 20, wherein step L comprises any of the steps of:
A. squaring said mean residual spectrum and removing at least one large value area; and
B. removing at least one region where raw spectra have high absorbance, wherein noise is related to absorbance.
22. The method of Claim 21 , further comprising the step of:
A. generating a final scores matrix by concatenating all previous scores matrices according to:
TaU= [T, +T2 + T3 +Ta].
23. The method of Claim 22, further comprising the step of:
A. generating a vector of calibration coefficients, s, using Multiple Linear regression, wherein said vector s, constitutes said calibration model, according to: y = Ts, where y is a vector of I concentrations;
Tτy = Tτs;
( Tj ijTy - s and s = (TTT)-'TTy.
24. The method of Claim 23, wherein step D comprises the steps of:
A. generating a scores matrix T for said matrix of sample spectra, where said sample matrix = X' , according to: T', = X' V,;
A. computing a residual matrix for said sample matrix according to:
A. removing at least one high noise region from sai residua! of said sample matrix;
B. repeating steps R through T, using said residual of said sample matrix with said high noise regions removed as an input matrix for step R, until a total number of factors employed for all iterations is equal to the said optimal number of factors employed for said calibration set.
25. The method of Claim 24, wherein step D further comprises the step of:
A. generating a final scores matrix, T'all, for said samples matrix by concatenating all previous scores matrices according to: rall= [τ' , + τ'2 + r3 + rN].
26. The method of Claim 25, wherein step D further comprises the step of: A. predicting concentration of said target analyte, ypred, according to: yP ed ~ {T ll)s-
27. An iterative, combinative PCR (Principal Component Regression) algorithm for modeling a data set, said data set comprising a data matrix, said data matrix comprising a plurality of samples and a plurality of variables, wherein a specific number of factors models at least one region of said data matrix independently of the number of factors used to model other regions of said data matrix, thus minimizing over-modeling of noisy regions present in said data matrix, said algorithm comprising the steps of:
A. providing said data matrix;
B. modeling at least one region of said data matrix, a region constituting at least one of said plurality of values, using a selected number of factors; C. subtracting said modeled regions from said data matrix, so that a residual matrix is generated;
D. modeling at least one further region from said residual matrix using a further selected number of values, wherein said further selected number may be equal to or different from said first number;
E. repeating steps B through D, using said residual matrix as an input matrix for step B, until the number of factors employed for all iterations is equal to an optimal number of factors required to model said entire data set.
28. The algorithm of Claim 27, wherein said factors are selected from a matrix of loading vectors and a matrix of scores, said data matrix of = Xh said matrix of loading vectors = T, and said scores matrix = V, so that
and wherein using Principal Component Analysis (PCA) is employed to decompose X} to generate T, and wherein X, comprises i samples and k variables.
29. The algorithm of Claim 28, wherein said scores matrix, V, is generated by regressing said matrix of loading vectors, T.
30. The algorithm of Claim 29, wherein the number of factors used in the current iteration = zn, with z = the number of factors used in an iteration, and n = the current iteration, so that said first number of factors = z,, and wherein step B comprises the steps of:
A. setting Tn = Tu(:, V. zn); and
B. setting V„ = Vlt(:, 1: zn), where u - the number of factors in an optimal model.
31. The algorithm of Claim 30, wherein step C comprises the step of: A. calculating a residual matrix X2 from X, according to:
32. The algorithm of Claim 31 , wherein step H further comprises the step of: A. eliminating at least one high-noise region from said residual matrix.
33. The algorithm of Claim 32, wherein step I comprises the optional step of:
A. Calculating a mean residual matrix according to:
Resi = mean(Xn-T„V ).
34. The algorithm of Claim 33, wherein step I further comprises the step of:
A. squaring said mean residual matrix and removing at least one large value region.
35. The algorithm of Claim 33, further comprising the step of: A. generating a final scores matrix by concatenating all previous scores matrices according to:
Tal,= [T, +T2 + T3 +Ta].
36. The algorithm of Claim 35, further comprising the step of:
A. generating a vector of calibration coefficients, s, using Multiple Linear regression, wherein said vector, 5, constitutes a calibration, according to: y =Ts, where y is a vector of i known values related to a predetermined parameter within said data set, said data set constituting a calibration data set;
Tτy = Tτs; (TTT)'JTTy = s; and s = (TTl)-'TTy.
37. The algorithm of Claim 36, further comprising the step of: A. applying said calibration to a prediction data set, said prediction data set constituting a matrix, wherein actual values for said predetermined parameter are unknown, to generate a vector, ypred< of predicted values for said parameter.
38. The algorithm of Claim 37, wherein step N comprises the steps of:
A. generating a scores matrix T for said matrix of prediction data, where said matrix of prediction data = X' , according to: T'ι = X' Vf,
A. computing a residual matrix for said matrix X' according to:
X'I = X'. T'] V,T;
A. removing at least one high noise region from said residual of said matrix X';; and
B. repeating steps O through Q, using said generated residual with said high noise regions removed as an input matrix for step V, until a total number of factors employed for all iterations is equal to the said optimal number of factors employed for said calibration set.
39. The algorithm of Claim 38, wherein step N further comprises the step of: A. generating a final scores matrix, T'aU, for said prediction data set by concatenating all previous scores matrices according to:
40. The algorithm of Claim 39, wherein step N further comprises the step of:
A. generating said vector , ypred, according to: yPred = CTall)s-
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/630,201 US6871169B1 (en) | 1997-08-14 | 2000-08-01 | Combinative multivariate calibration that enhances prediction ability through removal of over-modeled regions |
US630201 | 2000-08-01 | ||
PCT/US2001/021703 WO2002010726A2 (en) | 2000-08-01 | 2001-07-09 | Combinative multivariate calibration that enhances prediction ability through removal of over-modeled regions |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1305600A2 true EP1305600A2 (en) | 2003-05-02 |
Family
ID=24526210
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP01952581A Withdrawn EP1305600A2 (en) | 2000-08-01 | 2001-07-09 | Combinative multivariate calibration that enhances prediction ability through removal of over-modeled regions |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1305600A2 (en) |
AU (1) | AU2001273314A1 (en) |
TW (1) | TW576918B (en) |
WO (1) | WO2002010726A2 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102192889B (en) * | 2010-03-08 | 2012-11-21 | 上海富科思分析仪器有限公司 | Correction method for UV-visible absorption spectrum of fiber in-situ medicine leaching degree/releasing degree tester |
CN105203498A (en) * | 2015-09-11 | 2015-12-30 | 天津工业大学 | Near infrared spectrum variable selection method based on LASSO |
CN113703282B (en) * | 2021-08-02 | 2022-09-06 | 联芯集成电路制造(厦门)有限公司 | Method for correcting thermal expansion of mask |
CN113607683B (en) * | 2021-08-09 | 2024-09-06 | 天津九光科技发展有限责任公司 | Automatic modeling method for near infrared spectrum quantitative analysis |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DK39792D0 (en) * | 1992-03-25 | 1992-03-25 | Foss Electric As | PROCEDURE FOR DETERMINING A COMPONENT |
US5379764A (en) * | 1992-12-09 | 1995-01-10 | Diasense, Inc. | Non-invasive determination of analyte concentration in body of mammals |
US6040578A (en) * | 1996-02-02 | 2000-03-21 | Instrumentation Metrics, Inc. | Method and apparatus for multi-spectral analysis of organic blood analytes in noninvasive infrared spectroscopy |
-
2001
- 2001-07-09 EP EP01952581A patent/EP1305600A2/en not_active Withdrawn
- 2001-07-09 WO PCT/US2001/021703 patent/WO2002010726A2/en active Application Filing
- 2001-07-09 AU AU2001273314A patent/AU2001273314A1/en not_active Abandoned
- 2001-07-11 TW TW90117009A patent/TW576918B/en not_active IP Right Cessation
Non-Patent Citations (1)
Title |
---|
See references of WO0210726A3 * |
Also Published As
Publication number | Publication date |
---|---|
WO2002010726A3 (en) | 2002-04-25 |
WO2002010726A2 (en) | 2002-02-07 |
AU2001273314A1 (en) | 2002-02-13 |
TW576918B (en) | 2004-02-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6871169B1 (en) | Combinative multivariate calibration that enhances prediction ability through removal of over-modeled regions | |
US20050149300A1 (en) | Method and apparatus for enhanced estimation of an analyte property through multiple region transformation | |
CN107440684B (en) | Method and apparatus for predicting the concentration of an analyte | |
Pedersen et al. | Near-infrared absorption and scattering separated by extended inverted signal correction (EISC): Analysis of near-infrared transmittance spectra of single wheat seeds | |
Roger et al. | EPO–PLS external parameter orthogonalisation of PLS application to temperature-independent measurement of sugar content of intact fruits | |
US6711503B2 (en) | Hybrid least squares multivariate spectral analysis methods | |
Kalivas | Interrelationships of multivariate regression methods using eigenvector basis sets | |
Tan et al. | Wavelet analysis applied to removing non‐constant, varying spectroscopic background in multivariate calibration | |
AU738441B2 (en) | Method and apparatus for generating basis sets for use in spectroscopic analysis | |
US5459677A (en) | Calibration transfer for analytical instruments | |
Chen et al. | Simultaneous wavelength selection and outlier detection in multivariate regression of near-infrared spectra | |
JPH0582545B2 (en) | ||
CN101193592A (en) | Method for predicting a blood glucose level of a person | |
Liu et al. | Multi-spectrometer calibration transfer based on independent component analysis | |
Karstang et al. | Multivariate prediction and background correction using local modeling and derivative spectroscopy | |
Gributs et al. | Parsimonious calibration models for near-infrared spectroscopy using wavelets and scaling functions | |
Tan et al. | Improvement of a standard-free method for near-infrared calibration transfer | |
EP1305600A2 (en) | Combinative multivariate calibration that enhances prediction ability through removal of over-modeled regions | |
Gemperline | Developments in nonlinear multivariate calibration | |
Chen et al. | A new hybrid strategy for constructing a robust calibration model for near-infrared spectral analysis | |
CN113795748A (en) | Method for configuring a spectrometric device | |
Malik et al. | Support vector regression with digital band pass filtering for the quantitative analysis of near‐infrared spectra | |
Jouan-Rimbaud et al. | Calibration line adjustment to facilitate the use of synthetic calibration samples in near-infrared spectrometric analysis of pharmaceutical production samples | |
Bärring et al. | Optimizing meta-parameters in continuous piecewise direct standardization | |
CN114166764A (en) | Method and device for constructing spectral feature model based on feature wavelength screening |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20030213 |
|
AK | Designated contracting states |
Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR |
|
AX | Request for extension of the european patent |
Extension state: AL LT LV MK RO SI |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: SENSYS MEDICAL, INC. |
|
17Q | First examination report despatched |
Effective date: 20080630 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20081111 |