CN102854151B

CN102854151B - Chemometrics method for classifying sample sets in spectrum analysis

Info

Publication number: CN102854151B
Application number: CN201210375066.XA
Authority: CN
Inventors: 陈华舟
Original assignee: Guilin University of Technology
Current assignee: Guilin University of Technology
Priority date: 2012-10-06
Filing date: 2012-10-06
Publication date: 2014-07-30
Anticipated expiration: 2032-10-06
Also published as: CN102854151A

Abstract

The invention discloses a chemometrics method for classifying sample sets in spectrum analysis. A correlation coefficient of a reference chemical value and spectrum absorbance on each wavelength is calculated, and a wavelength point with highest correlation coefficient is found in a full spectrum range; normalization processing is conducted for the reference chemical value and the spectrum data of each sample; two samples with maximal reference chemical value and minimal reference chemical value and two samples with maximal absorbance and minimal absorbance are placed into a calibration set based on the normalized data, and four samples with correspondently reduced values are placed into a prediction set; and residual samples are adequately and randomly classified for multiple times, a correlation coefficient of each chemical value and absorbance of samples in the calibration set and the prediction set is respectively calculated for each time of classification, and if the correlation coefficient of the calibration set for one time of classification is adequately approximate to the correlation coefficient of the predication set, the classification is selected to establish a spectrum analysis model. Due to the adoption of the chemometrics method, a good data foundation is provided for the optimization of the model of the spectrum analysis.

Description

The chemometrics method that in a kind of spectral analysis, sample sets is divided

Technical field

The present invention relates to the sample sets partitioning technology field in spectral analysis, be specifically related to a kind of chemometrics method of dividing for calibrating collection and forecast set.

Background technology

Spectral analysis is to determine the chemical composition of material and the method for content thereof according to the spectrum of material by qualitative or quantitative test, because it has quick, sensitive, noninvasive advantage.The spectral analysis of application at present mainly contains Infrared spectroscopy, ultraviolet spectral analysis, Raman spectrum analysis etc.Particularly near infrared (NIR) spectral analysis technique, non-destructive easy fast with it, real-time online, multicomponent the feature such as detect simultaneously and have application advantage in various fields such as environment, food, agricultural, biomedicines.

Spectral analysis need to be divided into calibration collection and forecast set whole samples to be analyzed.First utilize reference chemical score (C) and the spectral absorbance (A) of calibration collection sample to set up calibration model; Then in conjunction with the spectral absorbance of forecast set sample, utilize calibration model to calculate the component content predicted value of forecast set sample, the predicted value by comparison prediction collection sample and carry out the prediction effect of evaluation model with reference to chemical score.Calibration model is the spectral absorbance based on sample and sets up, optimizes and evaluate with reference to the data of chemical score.But in spectral measurement process, due to reasons such as experimental situation, operative skill, accuracy of instruments, spectral absorbance likely produces the noise of the each side such as drift, inclination; Same, aspect chemical score measurement, conventional chemical measurement method is also inevitably brought system noise, environmental noise, operation noise etc., makes data have measuring error, causes set up calibration model to be difficult to obtain desirable prediction effect.

Experiment shows, due to the existence of various noises, the different demarcation of calibration collection and forecast set can cause altering a great deal of forecast result of model, and model parameter (if spectral analysis wave band, smooth mode, PLS are because of subnumber etc.) also can be affected.In order to find a good division, improve the relevance factor of model, in the process of calibration collection and forecast set division, consider how to choose the wavelength points that signal to noise ratio (S/N ratio) is higher, making calibration collection and the good division of forecast set as basic point, is a crucial research topic of spectral analysis.

Summary of the invention

Technical matters to be solved by this invention is the chemometrics method that provides a kind of sample to divide for spectral analysis, adopts the method to make good data for the model optimization of spectral analysis and prepares.The method be applicable to UV, visible light (UV), near infrared (NIR), in the spectral analysis field such as infrared (MIR), Raman (Raman), analyze at the NIR of pomelo peel pectin, the NIR of soil organism total nitrogen analyzes, the MIR of chemical oxygen demand of waste water analyzes, the MIR of blood haemoglobin is verified in analyzing.

Concrete steps are:

1) data normalization

A) with reference to the normalization of chemical score

C_{m} = \frac{1}{N} Σ_{j = 1}^{N} C_{j}, - - - (1)

norm (C_{j}) = \frac{C_{j}}{\sqrt{Σ_{j = 1}^{N} {(C_{j} - C_{m})}^{2}}} \overset{Δ}{=} C_{n} (j), j = 1,2 . . . . . N, - - - (2)

B) normalization of spectroscopic data

A_{i, m} = \frac{1}{N} Σ_{j = 1}^{N} A_{ij}, i = 1,2 . . . . P, - - - (3)

norm (A_{ij}) = \frac{A_{ij}}{\sqrt{Σ_{j = 1}^{N} {(A_{ij} - A_{i, m})}^{2}}}, i = 1,2 . . . . P, j = 1,2 . . . . N, - - - (4)

| A_{j} | = \sqrt{Σ_{i = 1}^{P} {(norm (A_{ij}))}^{2}} \overset{Δ}{=} A_{n} (j), j = 1,2 . . . . N, - - - (5)

Wherein, N is sample number, and P is wavelength points number; C _jfor the reference chemical score of sample j, C _mfor the reference chemical score average of all samples, C _n(j)=norm (C _j) be this sample reference chemical score through normalization calculate after chemical score data; A _ijfor sample j is at the absorbance of i wavelength, A _i,mfor the absorbance mean value of all N the samples at i wavelength place, norm (A _ij) be the absorbance of this sample after the absorbance at i wavelength place calculates through normalization; A _n(j)=| A _j| be the mould of the absorbance vector of sample j;

Calculate based on the above-mentioned normalization with reference to chemical score and absorbance, each sample is to there being a C _n(j) and one A _n(j); According to langbobier law, based on the C of all samples _nand A (j) _n(j) (j=1,2 ..., N), return the chemical score predicted value C ' that calculates each sample _n(j), calculate subsequently the normalization data deviation from regression of each sample, i.e. RDND, the further mean value to all samples calculating RDND, i.e. RDND _ave;

RDND(j)＝|C’ _n(j)-C _n(j)|, (6)

2) the division of value and sub-value sample

Can there is the good correlativity of guarantee in order to calibrate forecast model, need in principle thering is C _n(j) 2 of maximal value and minimum value samples and there is A _n(j) 2 of maximal value and minimum value samples are put into calibration collection, having C _n(j) 2 of second largest value and sub-minimum samples and there is A _n(j) 2 of second largest value and sub-minimum samples are put into forecast set; But, this wherein selected sample may to have several be identical, need to do corresponding selection processing; Specific operation process is as follows:

Thering is C _n(j) 2 of maximal value and minimum value samples and there is A _n(j) 2 of maximal value and minimum value samples, as value set, are designated as SZ; Simultaneously thering is C _n(j) 2 of second largest value and sub-minimum samples and there is A _n(j) 2 of second largest value and sub-minimum samples, as inferior value set, are designated as SC; First the inner sample of supposing SZ and SC is all not identical, and setting the inner sample number of each set is 4, discusses below, to determine the division that is worth sample most for the common factor of SZ and SC;

If SZ ∩ SC is empty set, SZ and SC do not have identical sample between mutually, and SZ all samples is put into calibration collection, and SC all samples is put into forecast set; Further record the number s that SZ inside has same sample ₁there is the number s of same sample with SC inside ₂, i.e. s ₁, s ₂∈ { 0,1,2};

If SZ ∩ SC is not empty set, record the number s of SZ ∩ SC inner sample ₃, s ₃=1,2,3,4, the RDND of inner SZ ∩ SC each sample respectively with RDND _avebig or small, if the RDND>RDND of certain sample _ave, this sample selects to put into calibration collection, otherwise this sample is selected to put into forecast set; Then, the inner all samples of SZ ∩ Cs (SC) is put into calibration collection, inner Cs (SZ) ∩ SC all samples is put into forecast set, and record respectively inner and Cs (SZ) the ∩ SC inside of SZ ∩ Cs (SC) and have the number s of same sample ₁and s ₂, i.e. s ₁, s ₂∈ { 0,1,2}; Wherein Cs is supplementary set operational symbol;

3) division principle of remaining sample

Through after being worth most the division of sample, remaining sample number is N-8+s ₁+ s ₂+ s ₃; About the division of remaining sample, based on the highest relevant principle, calculate respectively the spectroscopic data of each wavelength points i and the coefficient R (i) with reference to chemical score,

R (i) = \frac{Σ_{j = 1}^{N} (C_{j} - C_{m}) (A_{ij} - A_{i, m})}{\sqrt{Σ_{j = 1}^{N} {(C_{j} - C_{m})}^{2} Σ_{j = 1}^{N} {(A_{ij} - A_{i, m})}^{2}}}, i = 1,2 . . . . P, - - - (7)

From all wavelength points, find maximum R _note=max{R (i), i=1,2 ..., and record R .P} _notethe wavelength points sequence number i at place _note;

Remaining sample is done to the random division of abundant time, to dividing each time, choose i _notespectroscopic data { the A at individual wavelength points place _note, in conjunction with the reference chemical score of sample, in calibration collection with in forecast set, calculate coefficient R respectively _csetand R _pset;

R_{Cset} = \frac{Σ_{j = 1}^{L} (C_{L (j)} - C_{Lm}) (A_{note, L (j)} - A_{note, Lm})}{\sqrt{Σ_{j = 1}^{L} {(C_{L (j)} - C_{Lm})}^{2} Σ_{j = 1}^{L} {(A_{note, L (j)} - A_{note, Lm})}^{2}}}, - - - (8)

R_{Pset} = \frac{Σ_{j = 1}^{K} (C_{K (j)} - C_{Km}) (A_{note, K (j)} - A_{note, Km})}{\sqrt{Σ_{j = 1}^{K} {(C_{K (j)} - C_{Km})}^{2} Σ_{j = 1}^{K} {(A_{note, K (j)} - A_{note, Km})}^{2}}}, - - - (9)

Wherein L, K are respectively calibration collection and forecast set sample size, i.e. L+K=N; C _lm, C _kmbe respectively calibration collection and forecast set sample chemical value mean value, A _{note, L (j)}concentrate j sample at i for calibrating _notespectroscopic data in individual wavelength points, A _{note, Lm}for calibration collection sample is at i _notespectroscopic data average in individual wavelength points, A _{note, K (j)}for j sample in forecast set is at i _notespectroscopic data in individual wavelength points, A _{note, Km}for forecast set sample is at i _notespectroscopic data average in individual wavelength points;

Calculate R _csetand R _psetbetween absolute deviation, i.e. Absolute offset of correlation coefficients, be called for short AOC:

AOC＝|R _Cset-R _Pset|， (10)

In order to make calibration collection and forecast set there is similarity, with AOC<10 ^-5for criterion selects a division as the division of setting up NIR Spectroscopy Analysis Model;

According to this division methods, design is calibration collection and forecast set whole samples to be analyzed according to the ratio cut partition of 2:1; Calculate according to said process, and select one to meet AOC<10 ^-5division.

Chemometrics method of the present invention to the spectroscopic data of whole samples with carry out noise reduction, normalizing, the technical finesse such as associated with reference to chemical score data, and carry out sample division.For the data after normalization noise reduction, combine and calculate by spectral absorbance and chemical score, divide based on the spectroscopic data of high relevant wavelength point, make calibration model there is the higher coefficient of determination, simultaneously, by relatively calibrating the inside related coefficient of collection and forecast set, after ensureing to divide, calibration collection and forecast set have certain relevant similarity degree.Under such division, set up calibration model, can obtain good prediction effect.Under this meaning, the model optimization that the chemometrics method that the present invention proposes is spectral analysis provides good data basis; Data modeling optimization and the modelling verification system of the spectral analyses such as that the method is applicable to is infrared, ultraviolet, Raman, for preferably continuously wave band, discrete wavelength combination, and the peak value of former spectrum, derivative spectrum preferably etc. model optimization process provide good data to prepare.

Brief description of the drawings

Fig. 1 is the workflow diagram of the chemometrics method that in a kind of spectral analysis of the present invention, sample sets is divided.

In figure: the idiographic flow of value and sub-value sample division methods is described by Fig. 2.

Fig. 2 is worth and the process flow diagram of sub-value sample division methods most; It is the subgraph in Fig. 1.

Fig. 3 is the related coefficient figure that finds the highest correlation spectrum data point in the embodiment of the present invention.

In figure: full spectral coverage scope is 10000-4000cm ^-1, comprise visible ray and near infrared spectral coverage, calculate related coefficient with the spectroscopic data of each wavelength points in conjunction with the reference chemical score of sample, be 8058cm thereby find the highest relevant spectroscopic data point ^-1, the division of sample is that spectroscopic data and the chemical score based on this point carries out, and according to this, can ensure that to a certain extent calibration model has higher correlativity.

Embodiment

Embodiment:

Taking the near-infrared analysis of pomelo peel pectin as example, have 118 pomelo peel samples (N=118), each sample measures the spectral value of 3114 wavelength points (P=3114) by spectrum experiment, according to the ratio of about 2:1, calibration collection distributes 78 samples (L=78), and forecast set distributes 40 samples (K=40); Adopt method of the present invention to divide sample, concrete steps:

1) data normalization

A) with reference to the normalization of chemical score

The pectin chemical score of 118 the pomelo peel samples (numbering from 1 to 118) based on known, first calculates average C by (1) formula _m=4.987 (%), further according to average C _m, calculated the chemical score normalization numerical value C of each sample by (2) formula _n(j).

B) normalization of spectroscopic data

The spectroscopic data of 118 pomelo peel samples based on known in 3114 wavelength points, calculates the spectrum average A on each wavelength by (3) formula _i,m, further according to average A _i,m, calculate the normalization spectrum numerical value norm (A of each sample on each wavelength by (4) formula _ij), then each sample, the spectrum numerical value on all wavelengths point is considered as the spectrum numerical value vector of this sample, further calculates the mould A of the spectrum numerical value vector of each sample according to (5) formula _n(j).

Calculate based on the above-mentioned normalization with reference to chemical score and absorbance, each sample is to there being a C _n(j) and one A _n(j).According to langbobier law, based on the C of all samples _nand A (j) _n(j) (j=1,2 ..., 118), return the chemical score predicted value C ' that calculates each sample _n(j), calculate subsequently the normalization data deviation from regression RDND of each sample according to (6) formula, and further all samples is calculated the mean value RDND of RDND _ave.

2) be worth the division of sample most

The inner sample of supposing SZ and SC is all not identical, and setting the inner sample number of each set is 4; Thering is C _n(j) 2 of maximal value and minimum value samples and there is A _n(j) 2 of maximal value and minimum value samples are as value set SZ, i.e. SZ={98, and 66,98,16}, simultaneously having C _n(j) 2 of second largest value and sub-minimum samples and there is A _n(j) 2 of second largest value and sub-minimum samples are as value set SC, i.e. SC={85,81,85,13}.

Obviously, SZ ∩ SC is empty set (is SZ and SC mutually between there is no identical sample), and SZ all samples is put into calibration collection, and SC all samples is put into forecast set; Further record the number s that SZ inside has same sample ₁=1 and SC inside there is the number s of same sample ₂=1.

3) division principle of remaining sample

Through after being worth most the division of sample, remaining sample number is 112; About remaining division, based on the highest relevant principle, calculate respectively the spectroscopic data of each wavelength points i and the coefficient R (i) with reference to chemical score by (7) formula, from all wavelength points, find maximum R _note=max{R (i), i=1,2 ..., 3114}=0.6332, and record R _notethe wavelength at place is 8058cm ^-1, corresponding wavelength point sequence number i _note=1009.

Remaining 112 samples are done to the random division of abundant time, to dividing each time, choose the spectroscopic data { A at the 1009th wavelength points place _note, in conjunction with the reference chemical score { C of sample _note, in calibration collection with in forecast set, calculate coefficient R respectively according to (8) formula and (9) formula _csetand R _pset, and further calculate R by (10) formula _csetand R _psetbetween absolute deviation (AOC), select AOC<10 ^-5a division be used for setting up NIR Spectroscopy Analysis Model.

As a comparison, adopt in addition the method for completely random to divide sample; Adopt partial least square method (PLS), the calibration collection and the forecast set sample data that respectively division methods of the present invention and completely random division methods are obtained are set up near infrared calibration model, and forecast result of model is compared to (in table 1); Result shows, the division that adopts division methods of the present invention to calibrate collection and forecast set sample can improve predicting the outcome of model, improves near infrared detectability.

The comparison of the two kind division methods of table 1 based on PLS model

Note: RMSEC is the prediction deviation of calibration collection sample

RMSEP is the prediction deviation of forecast set sample

R _cfor the prediction related coefficient of calibration collection sample

R _pfor the prediction related coefficient of forecast set sample

Claims

1. the chemometrics method that in spectral analysis, sample sets is divided, is characterized in that concrete steps are:

1) data normalization

A) with reference to the normalization of chemical score

C_{m} = \frac{1}{N} Σ_{j = 1}^{N} C_{j}, - - - (1)

norm (C_{j}) = \frac{C_{j}}{\sqrt{Σ_{j = 1}^{N} {(C_{j} - C_{m})}^{2}}} \overset{Δ}{=} C_{n} (j), j = 1,2 . . . . . N, - - - (2)

B) normalization of spectroscopic data

A_{i, m} = \frac{1}{N} Σ_{j = 1}^{N} A_{ij}, i = 1,2 . . . . P, - - - (3)

norm (A_{ij}) = \frac{A_{ij}}{\sqrt{Σ_{j = 1}^{N} {(A_{ij} - A_{i, m})}^{2}}}, i = 1,2 . . . . P, j = 1,2 . . . . N, - - - (4)

| A_{j} | = \sqrt{Σ_{i = 1}^{P} {(norm (A_{ij}))}^{2}} \overset{Δ}{=} A_{n} (j), j = 1,2 . . . . N, - - - (5)

RDND(j)＝|C’ _n(j)-C _n(j)|, (6)

2) the division of value and sub-value sample

3) division principle of remaining sample

R (i) = \frac{Σ_{j = 1}^{N} (C_{j} - C_{m}) (A_{ij} - A_{i, m})}{\sqrt{Σ_{j = 1}^{N} {(C_{j} - C_{m})}^{2} Σ_{j = 1}^{N} {(A_{ij} - A_{i, m})}^{2}}}, i = 1,2 . . . . P, - - - (7)

R_{Cset} = \frac{Σ_{j = 1}^{L} (C_{L (j)} - C_{Lm}) (A_{note, L (j)} - A_{note, Lm})}{\sqrt{Σ_{j = 1}^{L} {(C_{L (j)} - C_{Lm})}^{2} Σ_{j = 1}^{L} {(A_{note, L (j)} - A_{note, Lm})}^{2}}}, - - - (8)

R_{Pset} = \frac{Σ_{j = 1}^{K} (C_{K (j)} - C_{Km}) (A_{note, K (j)} - A_{note, Km})}{\sqrt{Σ_{j = 1}^{K} {(C_{K (j)} - C_{Km})}^{2} Σ_{j = 1}^{K} {(A_{note, K (j)} - A_{note, Km})}^{2}}}, - - - (9)

AOC＝|R _Cset-R _Pset|， (10)