CN104091089B

CN104091089B - A kind of ir data PLS modeling method

Info

Publication number: CN104091089B
Application number: CN201410362602.1A
Authority: CN
Inventors: 陈孝敬
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2014-07-28
Filing date: 2014-07-28
Publication date: 2016-04-27
Anticipated expiration: 2034-07-28
Also published as: CN104091089A

Abstract

The invention discloses a kind of ir data PLS modeling method, determine the weight coefficient of the PLS model of each interval section in conjunction with the correlativity between the error of the PLS model of each interval section and error, thus the fusion PLS model of gained can be made to have minimum error.Method of the present invention can be best the spectral information utilizing each interval section, easy, visual, operand is little, can be very fast find characteristic wavelength interval; The defining method of the weight coefficient during we are bright, due to the correlativity between the error of model that take into account each and participate in merging and error simultaneously, can ensure that the model after fusion has minimum error.

Description

A kind of ir data PLS modeling method

Technical field

The invention belongs to infrared spectrum identification field, specifically a kind of data processing method that can promote infrared spectrum offset minimum binary modeling effect.

Background technology

In the multivariable ir data of small sample, PLS model can well solve the variable collinearity problem and dimension disaster that other modeling method runs into, and therefore obtains in infrared spectrum identification and uses widely.Although PLS can directly to full spectrum modeling, theoretical and a large amount of experiments proves that wavelength chooses is still a kind of method of effective raising PLS model.Wavelength optimization is selected to refer to the screening being carried out characteristic wavelength or wave band by certain method before modeling.After wavelength chooses, institute's established model is owing to eliminating uncorrelated or non-linear variable, and therefore more full wavelength model more simplifies, and predictive ability and robustness are also better.Wherein iPLS (intervalPLS-iPLS) is a kind of conventional Wavelength selecting method.The advantage of iPLS method be easy, visual, operand is little, can be very fast find characteristic wavelength interval.Shortcoming is the spectral information only utilizing an interval section, may lose the useful spectral information of other interval sections.Therefore how the best spectral information of each interval section that utilizes is problem demanding prompt solution.

Summary of the invention

Technical matters to be solved by this invention is, for above-mentioned the deficiencies in the prior art, provides a kind of ir data PLS modeling method.

For solving the problems of the technologies described above, the technical solution adopted in the present invention is: a kind of ir data PLS modeling method, comprises the following steps:

1) the tuple k of largest interval interval number max_int_no, maximum latent variable number max_lv_no, bracketing method is set ₁and k ₂; Wherein, k ₁, k ₂all be not less than 2;

2) when counting period interval number is int_no, the cross validation error of corresponding fusion PLS model, the step of calculating is all 2.1 to 2.2, wherein 1≤int_no≤max_int_no:

2.1) the spectrum matrix X in infrared spectrum sample set data is equally divided into int_no interval section X _i: the columns of each interval section [] expression rounds; I-th interval section X _ithe data that [(i-1) × l+1] ~ (i × l) of corresponding spectrum matrix X arranges; 1≤i≤int_no;

2.2) when calculating latent variable number is lv_no, fusion PLS model wherein 1≤lv_no≤max_lv_no, the step of calculating is all 2.2.1 to 2.2.5;

2.2.1) use k ₁retransposing method counting period number is int_no, when latent variable number is lv_no, and the cross validation error of the PLS model that each interval section is corresponding wherein y represents the actual value of the dependent variable matrix in infrared spectrum sample set data, represent that latent variable number that i-th interval section is corresponding is the predicted value of the dependent variable matrix that the PLS model of lv_no obtains according to k1 retransposing method, e _ibe corresponding prediction residual matrix, n is the sample number of infrared spectrum sample set data;

2.2.2) counting period number is int_no, when latent variable number is lv_no, and the correlativity between the prediction residual matrix of the PLS model that each interval section is corresponding wherein,

cov (e_{i}, e_{j}) = \frac{1}{n} < e_{i}, e_{j} >, i, j = 1,2, . . ., int_no;

2.2.3) following formula is calculated by the method for nonlinear optimization,

f = \min (Σ_{i = 1}^{int_no} ω_{i}^{2} S^{2} (e_{i}) + 2 Σ_{i = 1}^{int_no} Σ_{p > i}^{int_no} ω_{i} ω_{p} r_{ip} S (e_{i}) S (e_{p}))

s . t \{\begin{matrix} Σ_{i = 1}^{int_no} ω_{i} = 1 \\ 0 \leq ω_{i} \leq 1 \end{matrix};

Obtaining space-number is int_no, when latent variable number is lv_no, and the combination coefficient ω=[ω of the PLS model that each interval section is corresponding ₁..., ω _{int_no}] ':

2.2.4) use k ₂retransposing method counting period number is int_no, when latent variable number is lv_no, and the prediction residual matrix of the PLS model that each interval section is corresponding wherein represent that the latent variable number that i-th interval section is corresponding is that the PLS model of lv_no is according to k ₂the predicted value of the dependent variable matrix that retransposing method obtains, calculates

{\hat{f}}_{int_no}^{lv_no} = Σ_{i = 1}^{int_no} ω_{i}^{2} S^{2} (e_{2 i}) + 2 Σ_{i = 1}^{int_no} Σ_{p > i}^{int_no} ω_{i} ω_{p} r_{ip} S (e_{2 i}) S (e_{2 p});

2.2.5) select minimum the cross validation error of fusion PLS model when being int_no as interval section number, is designated as

3) minimum under selecting all interval section numbers this is minimum corresponding interval section number int_bt, latent variable number lv_bt and combination coefficient ω _ bt are as the model parameter of optimum;

4) PLS model is merged according to the model parameter structure of optimum: spectrum matrix X is equally divided into int_bt interval section, merges PLS model as follows:

y^{*} = Σ_{g = 1}^{int_bt} ω_{bt}_{g} (x_{g} \times b_{g} + c_{g})

Wherein, ω _ bt _gg the component of ω _ bt, y ^*merge PLS model to the predicted value of the dependent variable of sample; b _g, c _ginterval section X respectively _gpartially minimum regression coefficient when being lv_bt with the corresponding latent variable number of dependent variable matrix Y and intercept; x _git is the ir data that g interval section is corresponding.

Fusion PLS model of the present invention is the weighted array of multiple member's model.Member's model is exactly PLS model corresponding to each interval section.The quantity of the corresponding member's model of interval section number.The concrete form of i-th member's model is determined by the spectroscopic data of i-th interval section and the latent variable of extraction.

Compared with prior art, the beneficial effect that the present invention has is: the spectral information utilizing each interval section that method of the present invention can be best, easy, visual, operand is little, can be very fast find characteristic wavelength interval; The defining method of the weight coefficient during we are bright, due to the correlativity between the error of model that take into account each and participate in merging and error simultaneously, can ensure that the model after fusion has minimum error.

Accompanying drawing explanation

Fig. 1 is method flow diagram of the present invention.

Embodiment

Now in conjunction with example, the present invention will be further described.

The spectra spectroscopic data that spectroscopic data adopts matlab2012a to carry, sample is gasoline, and dependent variable is the octane value of sample.Original sample data collection comprises 60 samples, and the spectral variables length of each sample is 700.For convenience of description, this example only selects the spectrum data matrix of 1-6 spectral variables data as sample sets of 1-6 sample.The sample sets data that this example adopts are made up of spectrum data matrix X and dependent variable matrix Y, as follows respectively,

X = [\begin{matrix} - 0.0502 & - 0.0459 & - 0.0422 & - 0.0372 & - 0.0333 & - 0.0312 \\ - 0.0442 & - 0.0396 & - 0.0357 & - 0.0309 & - 0.0267 & - 0.0239 \\ - 0.0469 & - 0.0413 & - 0.0370 & - 0.0315 & - 0.0265 & - 0.0233 \\ - 0.0467 & - 0.0422 & - 0.0386 & - 0.03456 & - 0.0302 & - 0.0277 \\ - 0.0509 & - 0.0451 & - 0.0410 & - 0.0364 & - 0.0327 & - 0.0315 \\ - 0.0481 & - 0.0427 & - 0.0388 & - 0.0340 & - 0.0301 & - 0.0277 \end{matrix}]

Y = [\begin{matrix} 85.3000 \\ 85.2500 \\ 88.4500 \\ 83.4000 \\ 87.9000 \\ 85.5000 \end{matrix}]

Specific embodiment of the invention step is as follows:

Step 1, optimum configurations: the tuple k that maximum interval section number Max_int_no=2, maximum latent variable number Max_lv_no=2, bracketing method are set ₁=4, k ₂=6.Arranging of these parameters can adjust according to actual needs.Here such parameters just modeling procedure for convenience of explanation.

Step 2, when counting period interval number is int_no, corresponding fusion PLS model the step calculated is all 2.1 to 2.2, and wherein 1≤int_no≤max_int_no, is described for int_no=2 below:

Spectrum matrix X in infrared spectrum sample set data is equally divided into int_no interval section X by step 2.1 _i.The columns of each interval section x ₁the first row of corresponding spectrum data matrix X is routine to the 3rd, X ₂4th row of corresponding spectrum data matrix X are to the 6th row.X ₁, X ₂as follows respectively.

X_{1} = [\begin{matrix} - 0.0502 & - 0.0459 & - 0.0422 \\ - 0.0442 & - 0.0396 & - 0.0357 \\ - 0.0469 & - 0.0413 & - 0.0370 \\ - 0.0467 & - 0.0422 & - 0.0386 \\ - 0.0509 & - 0.0451 & - 0.0410 \\ - 0.0481 & - 0.0427 & - 0.0388 \end{matrix}]

X_{2} = [\begin{matrix} - 0.0372 & - 0.0333 & - 0.0312 \\ - 0.0309 & - 0.0267 & - 0.0239 \\ - 0.0315 & - 0.0265 & - 0.0233 \\ - 0.0345 & - 0.0302 & - 0.0277 \\ - 0.0364 & - 0.0327 & - 0.0315 \\ - 0.0340 & - 0.0301 & - 0.0277 \end{matrix}]

Step 2.2 calculates latent variable number when being lv_no, and merge the cross validation error of PLS model, wherein 1≤lv_no≤max_lv_no, the step of calculating is all 2.2.1 to 2.2.5; Be described for lv_no=2 below.

Step 2.2.1 k ₁retransposing method counting period number is int_no, the cross validation error of the PLS model that each interval section when latent variable number is lv_no is corresponding.

According to k ₁the latent variable number that first interval section of retransposing method calculating gained is corresponding is that the PLS model of lv_no is to the predicted value of dependent variable matrix prediction residual matrix e ₁and e ₁standard deviation S (e ₁) as follows respectively,

{\hat{y}}_{1} = [\begin{matrix} 83.5924 \\ 86.7554 \\ 85.2694 \\ 85.1904 \\ 89.5037 \\ 87.4816 \end{matrix}], e_{1} = [\begin{matrix} 1.7076 \\ - 1.5054 \\ 3.1806 \\ 1.2096 \\ - 1.6037 \\ - 1.9816 \end{matrix}],

S(e ₁)＝2.1490。

According to k ₁the latent variable number that second interval section of retransposing method calculating gained is corresponding is that the PLS model of lv_no is to the predicted value of dependent variable matrix prediction residual matrix e ₂and e ₂standard deviation S (e ₂) as follows respectively,

{\hat{y}}_{2} = [\begin{matrix} 80.1147 \\ 86.4685 \\ 86.9897 \\ 86.9970 \\ 81.4383 \\ 84.1570 \end{matrix}], e_{2} = [\begin{matrix} 5.1853 \\ - 1.2185 \\ 1.4603 \\ - 3.5970 \\ 6.4617 \\ 1.3430 \end{matrix}],

S(e ₂)＝3.7823。

Step 2.2.2 counting period number is int_no, when latent variable number is lv_no, and the correlativity between the prediction residual matrix of the PLS model that each interval section is corresponding,

r ₁₁＝1.0000,r ₁₂＝-0.0900,r ₂₁＝-0.0900,r ₂₂＝1.0000。

Step 2.2.3 calculates following formula by the method for nonlinear optimization,

f = \min (Σ_{i = 1}^{int_no} ω_{i}^{2} S^{2} (e_{i}) + 2 Σ_{i = 1}^{int_no} Σ_{j > i}^{int_no} ω_{i} ω_{j} r_{ij} S (e_{i}) S (e_{j}))

s . t \{\begin{matrix} Σ_{i = 1}^{int_no} ω_{i} = 1 \\ 0 \leq ω_{i} \leq 1 \end{matrix}

Obtaining space-number is int_no, when latent variable number is lv_no, and the combination coefficient of the PLS model that each interval section is corresponding

ω＝[0.73760.2624]′。

Step 2.2.4 k ₂retransposing method counting period number is int_no, when latent variable number is lv_no, and the prediction residual matrix of the PLS model that each interval section is corresponding

{\hat{y}}_{21} = [\begin{matrix} 83.5924 \\ 86.7554 \\ 85.2694 \\ 82.1904 \\ 89.5037 \\ 87.4816 \end{matrix}], {\hat{y}}_{22} = [\begin{matrix} 80.1147 \\ 86.4685 \\ 86.9897 \\ 86.9970 \\ 81.4383 \\ 84.1570 \end{matrix}], e_{21} = [\begin{matrix} 1.7076 \\ - 1.5054 \\ 3.1806 \\ 1.2096 \\ - 1.6037 \\ - 1.9816 \end{matrix}], e_{22} = [\begin{matrix} 5.1853 \\ - 1.2185 \\ 1.4603 \\ - 3.5970 \\ 6.4617 \\ 1.3430 \end{matrix}]

be the interval corresponding latent variable number of first, second interval section be respectively that the PLS model of lv_no is to the predicted value of dependent variable matrix.

Calculate

{\hat{f}}_{int_no}^{lv_no} = Σ_{i = 1}^{int_no} ω_{i}^{2} S^{2} (e_{2 i}) + 2 Σ_{i = 1}^{int_no} Σ_{p > i}^{int_no} ω_{i} ω_{p} r_{ip} S (e_{2 i}) S (e_{2 p}) = 1.7250

Step 2.2.5 selects minimum the cross validation error of fusion PLS model when being int_no as interval section number, is designated as in this example

{\hat{f}}_{1}^{1} = 2.4440, {\hat{f}}_{1}^{2} = 2.7208, {\hat{f}}_{2}^{1} = 2.1265,

therefore, during interval section number int_no=1, the cross validation error of PLS model is merged during interval section number int_no=2, the cross validation error merging PLS model is

{\hat{f}}_{2} = 1.7250 .

Step 3, under selecting all interval section number int_no (1≤int_no≤Max_int_no) situations, merges the cross validation error minimum value of PLS model.In this example be minimum value, corresponding optimum model parameter is as follows: interval section number int_bt=2, latent variable number lv_bt=2, combination coefficient ω _ bt=[0.73760.2624] '.

Step 4, the model parameter structure according to optimum merges PLS model.B ₁=[64.4-2120.4443.4] ', c ₁=1565.1 is X respectively ₁partial least squares regression coefficient when the latent variable number corresponding with Y is 2 and intercept.B ₂=[105.8596.31404.7] ', c ₂=-1544.9 is X respectively ₂partial least squares regression coefficient when the latent variable number corresponding with Y is 2 and intercept.Final fusion PLS model is as follows,

y＝0.7376×(x ₁b ₁+64.4)+0.2624×(x ₂b ₂+105.8)。

The complete spectroscopic data x of a sample is by x ₁and x ₂form, i.e. x=[x ₁x ₂].X ₁be first interval section corresponding spectroscopic data, x ₂it is the spectroscopic data that second interval section is corresponding.Y merges PLS model to the predicted value of the dependent variable of sample.

Claims

1. an ir data PLS modeling method, is characterized in that, comprises the following steps:

2) according to step 2.1) and step 2.2) when the counting period, interval number was int_no, the cross validation error of corresponding fusion PLS model, wherein 1≤int_no≤max_int_no:

2.2) according to step 2.2.1) ~ step 2.2.5) when to calculate latent variable number be lv_no, merge PLS model wherein 1≤lv_no≤max_lv_no:

cov (e_{i}, e_{j}) = \frac{1}{n} < e_{i}, e_{j} >, i, j = 1,2, . . ., int_no;

2.2.3) following formula is calculated by the method for nonlinear optimization:

f = \min (Σ_{i = 1}^{int_no} ω_{i}^{2} S^{2} (e_{i}) + 2 Σ_{i = 1}^{int_no} Σ_{p > i}^{int_no} ω_{i} ω_{p} r_{ip} S (e_{i}) S (e_{p}))

s . t \{\begin{matrix} Σ_{i = 1}^{int_no} ω_{i} = 1 \\ 0 \leq ω_{i} \leq 1 \end{matrix};

{\hat{f}}_{int_no}^{lv_no} = Σ_{i = 1}^{int_no} ω_{i}^{2} S^{2} (e_{2 i}) + 2 Σ_{i = 1}^{int_no} Σ_{p > i}^{int_no} ω_{i} ω_{p} r_{ip} S (e_{2 i}) S (e_{2 p});

y^{*} = Σ_{g = 1}^{int_bt} ω_{bt}_{g} (x_{g} \times b_{g} + c_{g})