CN111220565B

CN111220565B - CPLS-based infrared spectrum measuring instrument calibration migration method

Info

Publication number: CN111220565B
Application number: CN202010045812.3A
Authority: CN
Inventors: 赵煜辉; 刘晓东; 李雪晶; 芦鹏程
Original assignee: Northeastern University Qinhuangdao Branch
Current assignee: Northeastern University Qinhuangdao Branch
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2022-07-29
Anticipated expiration: 2040-01-16
Also published as: CN111220565A

Abstract

The invention relates to the technical field of migration learning under a machine learning module, and provides a CPLS-based infrared spectrum measuring instrument calibration migration method. First, a source domain data set { X is collected _m Y and target Domain data set { X _s Y, and carrying out centralization processing on the data set to obtain a centralized source domain data set { X }, wherein the centralized source domain data set is obtained _{m_center} ,Y _center And a target domain data set { X } _{s_center} ,Y _center }; then, the matrix X is subjected to correlation based on CPLS algorithm _{m_center} 、Y _center Performing principal component analysis and applying to the matrix X _{s_center} Performing principal component analysis; recalculating the transition matrix M _{trans_pre} And a transfer matrix M _trans (ii) a Finally, the substance concentration variation of the object to be measured is predicted. The invention can eliminate the random noise measured by the main instrument, improve the data utilization rate and the modeling precision and reduce the time complexity.

Description

CPLS-based infrared spectrum measuring instrument calibration migration method

Technical Field

The invention relates to the technical field of migration learning under a machine learning module, in particular to a CPLS-based infrared spectrum measuring instrument calibration migration method.

Background

The near infrared spectroscopy (NIRS) analysis technology has the advantages of simple instrument operation, high data analysis speed, low cost, no sample pollution and the like, and is generally applied to various fields. In the production process, a near infrared spectrum analysis technology is used for modeling, and the existing calibration model is invalid due to unstable measurement conditions and instrument hardware performance.

The main goal of migration learning is to extract classification or regression knowledge from one or more tasks in the source domain and apply that knowledge to the target domain tasks, if the knowledge of one task is successfully transferred to another, then a model of the new task can be obtained without too many new samples. The learning performance of the target domain is improved by using the knowledge learned in one or more source domains, the problems of target domain label loss, high label cost, time-consuming learning process and the like are solved, and the purpose of improving the learning performance is achieved.

The calibration migration method refers to the migration of a multi-element calibration model under different measuring instruments or measuring states. The method utilizes the linear relation among the spectral data of different sources to convert the measured spectral sample of a new instrument or in a new state, and further can directly utilize the original model to predict the new sample. The migration research can be applied to related fields instead of the same field, and realizes useful information of migration and inter-domain conversion, so that the effectiveness of an original model can be maintained or the original information is utilized to accelerate the modeling speed, a large number of target domain samples or models are prevented from being used for sampling or modeling a target domain again, the effectiveness of the model is improved, the cost is reduced to a great extent, and the modeling speed is accelerated.

The existing calibration migration method has the problems of low prediction precision, limited application occasions and the like. For example, in a PLS-based calibration migration method, partial least-squares (PLS) is one of algorithms commonly used in data information extraction and process monitoring, and by extracting feature information with the maximum correlation between a process variable and a quality variable and dividing the process variable, the process variable and the quality variable are converted into a principal component subspace and a residual subspace, thereby realizing compression and extraction of data. However, the PLS algorithm first extracts the process variable and quality variable pivot separately using principal component analysis, with no correlation between the two pivots. It defaults to all process variables acting on the quality variable, ignoring the state information of internal variables. In many cases, due to lack of excitation of process data, there are a lot of unmeasured process and quality disturbances, and when the remaining information of the quality variables changes, alarm failure occurs, resulting in poor PLS prediction output. In fact, monitoring of quality variable information changes is more important than process variables. On the other hand, the optimization goal involved in building the PLS model is to maximize the principal component correlation between the process and quality variables without residual constraints, maximizing the residual variance between the process and quality variables. Variables cannot be guaranteed to be minimal, which may lead to a large amount of information being left over for process and quality variables. Moreover, the data volume of near infrared spectrum modeling processing is large at present, the time complexity of a serial partial least square algorithm is high, and the training and testing process is long.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides the CPLS-based infrared spectrum measuring instrument calibration migration method, which can eliminate random noise measured by a main instrument, improve the data utilization rate and the modeling precision and reduce the time complexity.

The technical scheme of the invention is as follows:

a CPLS-based infrared spectrum measuring instrument calibration migration method is characterized by comprising the following steps:

step 1: the method comprises the steps of enabling an infrared spectrum measurement master instrument to correspond to a source domain, enabling an infrared spectrum measurement slave instrument to correspond to a target domain, collecting the spectrum of each sample by using the infrared spectrum measurement master instrument and the infrared spectrum measurement slave instrument, respectively obtaining a master spectrum and a slave spectrum, respectively extracting spectral data of the master spectrum and the slave spectrum at intervals anm within a wavelength range, collecting material concentration variable values of each sample, and obtaining a source domain data set { X _m Y and target Domain data set { X _s ,Y}；

Wherein, X _m ＝(X _m1 ,X _m2 ,...,X _mi ,...,X _mI ) ^T ，X _mi ＝(x _mi1 ,x _mi2 ,...,x _mij ,...,x _miJ )，X _s ＝(X _s1 ,X _s2 ,...,X _si ,...,X _sI ) ^T ，X _si ＝(x _si1 ,x _si2 ,...,x _sij ,...,x _siJ )，x _mij 、x _sij J, I is the total number of samples, and J is the total number of extracted spectral data points; y ═ Y ₁ ,Y ₂ ,...,Y _i ,...,Y _I ) ^T ，Y _i ＝(y _i1 ,y _i2 ,...,y _ik ,...,y _iK )，y _ik The value of the kth substance concentration variable of the ith sample, where K is 1, 2.. and K is the total number of substance concentration variables;

Step 2: the source domain data set and the target domain data set are subjected to centralized processing to obtain a centralized source domain data set { X _{m_center} ,Y _center And a target domain data set { X } _{s_center} ,Y _center }；

And step 3: CPLS algorithm based matrix X _{m_center} 、Y _center Performing principal component analysis:

step 3.1: data set { X) based on PLS algorithm _{m_center} ,Y _center Establishment of calibration model Y _center ＝X _{m_center} B, calculating to obtain a coefficient matrix B, X _{m_center} Score matrix T, X _{m_center} Load matrix P, Y _center Score matrix U, Y _center The matrix R is introduced so that T is X _{m_center} R, and determining the number l of the latent variables;

step 3.2: calculating a predictable substance concentration variable of

Performing singular value decomposition on predictable substance concentration variables to obtain

Wherein, U _c As a left singular matrix, D _c As diagonal matrix of singular values, V _c As a right singular matrix, V _c Is an orthogonal matrix; q _c ＝V _c D _c ^T Including l in descending order _c A plurality of non-zero singular values and corresponding right singular vectors;

obtained by the formula (2)

To obtain

R _c ＝RQ ^T V _c D _c ^-1 (4)

Step 3.3: calculating an unpredictable substance concentration variable as

Extracting main components from unpredictable substance concentration variables to obtain l _y The main component number is

Wherein,

is composed of

The output residual matrix of (3);

obtaining a matrix by equation (6)

Step 3.4: by spatially R _c Projection, obtaining variables independent of substance concentration Input variable of

Wherein R is _c ^* ＝(R _c ^T R _c ) ^-1 R _c ^T ；

Subjecting the input variable independent of the concentration variable of the substance to principal component extraction to obtain l _x The main component number is

Wherein,

is composed of

The input residual matrix of (3);

the matrix is obtained by equation (8)

Step 3.5: from step 3.1 to step 3.4, X is obtained _{m_center} 、Y _center The main components extracted by the PLS algorithm are respectively X _{m_pre} ＝TP ^T 、Y _pre ＝UQ ^T ，X _{m_center} 、Y _center Respectively have a residual error of X _{m_res_c} ＝X _{m_center} -X _{m_pre} 、Y _{res_c} ＝Y _center -Y _pre That is to obtain

And 4, step 4: applying the same method as in step 3 to the matrix X _{s_center} Performing principal component analysis to obtain X _{s_center} Has a residual error of X _{s_res_c} ；

And 5: calculating the score T of the source domain data set after the principal spectrum is extracted by the PLS algorithm _{m_pre} ＝X _{m_center} R, calculating the score T of the target domain data set after extracting the principal components from the spectrum by a PLS algorithm _{s_pre} ＝X _{s_center} R, according to T _{m_pre} 、T _{s_pre} Calculating transfer matrix M based on least square method _{trans_pre} (ii) a Calculating the score T of the data set of the source domain after extracting principal components from the residual error of the principal spectrum _m ＝X _{m_res_c} P, calculating the score T of the target domain data set after extracting the principal component from the spectrum pair residual error _s ＝X _{s_res_c} P, according to T _m 、T _s Calculating transfer matrix M based on least square method _trans ；

Step 6: predicting the substance concentration variable of the measured object:

step 6.1: collecting the spectrum of the measured object from the instrument by infrared spectrometry, and extracting the spectrum data by the same method as step 1 to obtain J matrixes X formed by the spectrum data of the measured object _{s_test} ；

Step 6.2: x pair based on CPLS algorithm _{s_test} Performing principal component analysis to obtain X _{s_test} Has a residual error of X _{s_res_c_test} ；

Step 6.3: the matrix formed by predicting the material concentration variable of the measured object is Y _{test_predict} ＝(X _{s_test} *R*M _{trans_pre} *P ^T +X _{s_res_c_test} *R*M _trans *P ^T )*B。

Further, in step 1, the sample is grain, the spectral data is absorbance, and the substance concentration variables include moisture content, oil content, protein content, and starch content of the grain.

The invention has the beneficial effects that:

the invention carries out primary principal component extraction on the source domain data set and the target domain data set based on the CPLS algorithm, then carries out primary principal component extraction on the residual error, calculates the transfer matrix on the basis of the two primary component extractions, eliminates the random noise measured by a main instrument, improves the data utilization rate and the modeling precision, reduces the time complexity and improves the training and testing speed.

Drawings

Fig. 1 is a flow chart of the calibration migration method of the infrared spectroscopic measuring instrument based on CPLS of the present invention.

Fig. 2 is a flow chart of the CPLS-based principal component analysis of the source domain data set in the calibration migration method of the CPLS-based infrared spectroscopic measuring instrument of the present invention.

Fig. 3 is a flow chart of solving a transfer matrix in the calibration migration method of the infrared spectroscopic measuring instrument based on CPLS.

Fig. 4 is a flowchart of predicting the substance concentration variable of the measured object in the calibration migration method of the infrared spectroscopic measuring instrument based on CPLS according to the present invention.

FIG. 5 is a graphical representation of cross-validation error of oil on a corn data set as a function of principal component number in accordance with an embodiment.

FIG. 6 is a graph showing the fitting results of mp6spec to mp5spec in the embodiment.

FIG. 7 is a graph showing the fitting results of m5spec-mp5spec in the embodiment.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific embodiments.

The invention provides a CPLS-based infrared spectrum measuring instrument calibration migration method. In data processing, PLS simply extracts principal components from X and Y once, but the residual error of X and Y usually contains effective information, and the extraction is insufficient, so that the error of the established model is large, a parallel partial least squares (CPLS) algorithm is proposed, and on the basis of PLS, the residual error is extracted once again, so that the established model error is smaller, and the linear relation is closer to the real situation. However, in reality, the acquisition of samples is very expensive and time-consuming, so that the transfer learning is proposed on the basis of the CPLS, and the prediction of the target domain test set is completed by establishing a mapping relation on the standard set of the source domain and the target domain.

The CPLS algorithm adopted by the invention is further improved on the PLS algorithm, and the quality of process variable information irrelevant to quality variables and information which cannot be respectively predicted is subjected to principal component analysis and is divided into 5 subspaces: a subspace of process variable and quality variable related information (related principal element subspace), a process variable principal element space, a process variable residual error space, a quality variable principal element space, a quality variable residual error subspace.

The CPLS model achieves three goals: (1) extracting scores directly related to predictable changes in the output from the standard PLS projection, and these score vectors constitute a co-variational subspace (CVS); (2) further projecting the unpredicted output changes to an Output Principal Subspace (OPS) and an Output Residual Subspace (ORS) to monitor these subspaces for abnormal changes; (3) input changes that are not related to the prediction output are further projected into an input principal component subspace (IPS) and an Input Residual Subspace (IRS) to monitor for abnormal changes in these subspaces.

The CPLS algorithm sets the process variable data into two main parts, one of which is information related to the quality variable and the other of which is information unrelated to the quality variable. The quality variable data is also divided into two main parts, one part being information that is predictable from the process variable and the other part being information that is not predictable from the process variable. Thus, the CPLS-based monitoring method provides a complete monitoring framework that is capable of monitoring process and quality variables as well as other portions of information.

As shown in fig. 1, the calibration migration method of the infrared spectroscopic measuring instrument based on CPLS of the present invention includes the following steps:

step 1: the method comprises the steps of enabling an infrared spectrum measurement main instrument to correspond to a source domain, enabling an infrared spectrum measurement secondary instrument to correspond to a target domain, collecting the spectrum of each sample by using the infrared spectrum measurement main instrument and the infrared spectrum measurement secondary instrument to respectively obtain a main spectrum and a secondary spectrum, respectively extracting spectral data of the main spectrum and the secondary spectrum at intervals anm within a wavelength range, and collecting substances of each sampleThe value of the concentration variable is changed to obtain a source domain data set { X _m Y and target Domain data set { X _s ,Y}；

Wherein, X _m ＝(X _m1 ,X _m2 ,...,X _mi ,...,X _mI ) ^T ，X _mi ＝(x _mi1 ,x _mi2 ,...,x _mij ,...,x _miJ )，X _s ＝(X _s1 ,X _s2 ,...,X _si ,...,X _sI ) ^T ，X _si ＝(x _si1 ,x _si2 ,...,x _sij ,...,x _siJ )，x _mij 、x _sij J, I is the total number of samples, and J is the total number of extracted spectral data points; y ═ Y ₁ ,Y ₂ ,…,Y _i ,…,Y _I ) ^T ，Y _i ＝(y _i1 ,y _i2 ,…,y _ik ,...,y _iK )，y _ik The K is the value of the kth substance concentration variable of the ith sample, K is 1,2, …, K is the total number of substance concentration variables.

In this example, the sample is corn in the grain class, the spectral data is absorbance, and the material concentration variables include moisture content, oil content, protein content, and starch content of the corn. The data measured for the same sample, I-80, by the three spectroscopic instruments constitutes the corn data set. The infrared spectrum is measured by infrared spectrum measuring instruments m5, mp5 and mp6 at intervals of a-2 nm in the wavelength range of 1100-2498nm, and J-700 attributes. The main spectrum of the first experiment, namely the secondary spectrum is m5spec-mp6spec, namely the spectrum measured by m5 is taken as the main spectrum, and the corresponding spectral data set is taken as the initial source domain data set; since the spectrum measured for mp6 differs significantly from the spectrum measured for m5, it is selected as the original target domain data set from the spectrum, the corresponding spectral data set. Then, five more experiments were carried out on mp5spec-mp6spec, mp6spec-mp5spec, m5spec-mp5spec, mp5spec-m5spec, and mp6spec-m5spec in this order.

In this example, the Kennard-Stone (KS) algorithm was used to segment the corn data set. Firstly, 20% of data in the initial source domain data set and the initial target domain data set are extracted as a test sampleHere, the data are 16 samples. And testing the calibration migration model by using the test sample of the target domain. Then, the remaining 80% of the data in the initial source domain data set and the initial target domain data set are extracted as training samples, which are 64 samples of data respectively. Establishing a reference model by utilizing a training sample of a source domain, and predicting a migration sample of a target domain; and establishing a standard model of the target domain by using the training sample of the target domain so as to compare the performances of other migration models. Then, 20% of data are respectively extracted from the training samples of the source domain and the training samples of the target domain by using a KS algorithm to form a standard sample set of the source domain and a standard sample set of the target domain, and the standard sample sets are respectively used as source domain data sets { X ] used in the method of the invention _m Y and target Domain data set { X _s Y, to establish a transfer relationship between the source domain samples and the target domain samples.

Step 2: centralizing the source domain data set and the target domain data set, namely, averaging the data of each column, and then subtracting the average value of each column from the original data of each column to obtain a centralized source domain data set { X _{m_center} ,Y _center } and the target domain dataset X _{s_center} ,Y _center And thus, deviation caused by large numerical difference can be effectively avoided.

And 3, step 3: as shown in fig. 2, the matrix X is paired based on the CPLS algorithm _{m_center} 、Y _center Performing principal component analysis:

step 3.2: calculating a predictable substance concentration variable of

Performing Singular Value Decomposition (SVD) on predictable substance concentration variables to obtain

obtained by the formula (2)

To obtain

R _c ＝RQ ^T V _c D _c ^-1 (4)

Step 3.3: calculating an unpredictable substance concentration variable as

Principal component extraction (PCA) is performed on unpredictable substance concentration variables to obtain l _y The main component number is

Wherein,

is composed of

The output residual matrix of (3);

passing through type(6) Determining a matrix

Step 3.4: by spatially R _c Projection of an input variable independent of the material concentration variable as

Wherein R is _c ^* ＝(R _c ^T R _c ) ^-1 R _c ^T ；

Wherein,

is composed of

The input residual matrix of (3);

the matrix is obtained by equation (8)

According to the algorithm flow of CPLS, X can be obviously seen _{m_center} 、Y _center Is divided into three parts: principal component extracted by PLS algorithm, principal component extracted for residual, unpredictable error. Compared with the PLS algorithm, the CPLS algorithm flow shows that the method has the advantages of more processing for extracting the principal component from the residual error and improving the data utilization rate.

And 4, step 4: applying the same method as in step 3 to the matrix X _{s_center} Performing principal component analysis to obtain X _{s_center} Has a residual error of X _{s_res_c} 。

In this embodiment, the selection result of the PLS algorithm with the optimal principal component number is analyzed as follows: the principal component number of the PLS method is selected by adopting a 10-fold cross validation method, and the change situation of the oil content model cross validation error of the target domain training set in the corn data set caused by the change of the principal component number is shown in FIG. 5 by taking the oil as an example. As can be seen from fig. 5, the cross validation error of oil on corn set reaches global minimum when the principal component number is 12, so we set the optimal principal component number for oil to be 12. The method for selecting the optimal number of main components of the other three components is the same as the method.

And 5: as shown in fig. 3, a transfer matrix is established that maps the target domain latent structure to the source domain latent structure using a least squares algorithm: calculating the score T of the source domain data set after the principal spectrum is extracted by the PLS algorithm _{m_pre} ＝X _{m_center} R, calculating the score T of the target domain data set after extracting the principal components from the spectrum by a PLS algorithm _{s_pre} ＝X _{s_center} R, according to T _{m_pre} 、T _{s_pre} Calculating transfer matrix M based on least square method _{trans_pre} (ii) a Calculating the score T of the data set of the source domain after extracting principal components from the residual error of the principal spectrum _m ＝X _{m_res_c} P, calculating the score T of the target domain data set after extracting the principal component from the spectrum pair residual error _s ＝X _{s_res_c} P, according to T _m 、T _s Calculating transfer matrix M based on least square method _trans 。

Step 6: as shown in fig. 4, the substance concentration variation of the object to be measured is predicted:

step 6.1: collecting spectrum of measured object from instrument by infrared spectrometry, extracting spectrum data by the same method as step 1 to obtain J matrices X composed of spectrum data _{s_test} ；

Step 6.3: predicting the matrix formed by the material concentration variable of the measured object as Y _{test_predict} ＝(X _{s_test} *R*M _{trans_pre} *P ^T +X _{s_res_c_test} *R*M _trans *P ^T )*B。

In this example, the data is predicted using a model, and the prediction error RMSEP results for different master-slave instrument combinations in the corn data set are shown in table 1 below:

TABLE 1

Analysis of Table 1 reveals that: in general, the operation effect of the invention between the spectrum mp5spec and the spectrum mp6spec is generally better than that of the other two groups, because the similarity between mp5spec and mp6spec is higher, and the difference between the two groups and the spectrum m5spec is larger, so that the transfer learning between the two groups is more meaningful, and the result error is smaller. It can be seen that, taking mp6spec as the main spectrum and mp5spec as the auxiliary spectrum, the measurement errors of water, oil, protein and starch are basically the smallest in the six groups of experiments, while the migration results between m5spec and mp5spec, mp6spec are the largest in the six groups.

As shown in FIGS. 6 and 7, the fitting results of mp6spec-mp5spec and m5spec-mp5spec in this example are shown. Comparing fig. 6 and fig. 7, it is clear that the two sets of fitting effects are good or bad. Compared with the transfer learning between the spectrum mp6spec and the spectrum mp5spec, the spectrum mp5spec has higher similarity and better fitting degree, and most points of the spectrum m5spec fall near or on a fitting line, and all points of the spectrum m5spec and the spectrum mp5spec fall below the fitting straight line, which shows that the transfer learning effect of the spectrum m is obviously better than that of the spectrum m5spec and the spectrum m5spec has no need of transfer between the two spectra, because the predicted effect is not good at all.

Since the spectrum mp6spec-mp5spec has the best migration effect, the set of spectra is chosen for the experiment and compared with other algorithms, which are respectively: multivariate Scatter Correction (MSC), Canonical Correlation Analysis (CCA), Slope deviation Correction (SBC), Piecewise Direct normalization (PDS). As shown in Table 2, the results of RMSEP comparisons under each algorithm are for mp6spec-m5spec in the corn data set. As can be seen from table 2, in general, the migration effect of the calibration migration method of the infrared spectroscopic measurement instrument based on CPLS of the present invention is very good: compared with MSC, CCA and PDS algorithms, the method disclosed by the invention is far superior to the three algorithms in the prediction of the four components; compared with the SBC algorithm, the method has better prediction effect on water and oil, and has little difference on the prediction effect on protein and starch.

TABLE 2

In a word, through six groups of experiments on a corn data set, according to the obtained experimental results, the results are respectively compared with the MSC algorithm, the CCA algorithm, the SBC algorithm and the PDS algorithm, and the prediction effect of the CPLS algorithm combined with the transfer learning is similar to that of the SBC algorithm, but is far better than that of the MSC algorithm, the CCA algorithm and the PDS algorithm. Therefore, the method eliminates the random noise measured by the main instrument, and improves the data utilization rate and the modeling precision.

It is to be understood that the above-described embodiments are only a few embodiments of the present invention, and not all embodiments. The above examples are only for explaining the present invention and do not constitute a limitation to the scope of protection of the present invention. All other embodiments, which can be derived by those skilled in the art from the above-described embodiments without any creative effort, namely all modifications, equivalents, improvements and the like made within the spirit and principle of the present application, fall within the protection scope of the present invention claimed.

Claims

1. A CPLS-based infrared spectrum measuring instrument calibration migration method is characterized by comprising the following steps:

Wherein, X _m ＝(X _m1 ,X _m2 ,...,X _mi ,...,X _mI ) ^T ，X _mi ＝(x _mi1 ,x _mi2 ,...,x _mij ,...,x _miJ )，X _s ＝(X _s1 ,X _s2 ,...,X _si ,...,X _sI ) ^T ，X _si ＝(x _si1 ,x _si2 ,...,x _sij ,...,x _siJ )，x _mij 、x _sij J, I is the total number of samples, and J is the total number of extracted spectral data points; y ═ Y ₁ ,Y ₂ ,...,Y _i ,...,Y _I ) ^T ，Y _i ＝(y _i1 ,y _i2 ,...,y _ik ,...,y _iK )，y _ik Is the value of the kth species concentration variable for the ith sample, K being 1,2The total number of concentration variables;

And 3, step 3: CPLS algorithm based matrix X _{m_center} 、Y _center Performing principal component analysis:

step 3.2: calculating a predictable substance concentration variable of

obtained by the formula (2)

To obtain

R _c ＝RQ ^T V _c D _c ^-1 (4)

Step 3.3: calculating an unpredictable substance concentration variable as

Wherein,

is composed of

The output residual matrix of (3);

The matrix is obtained by equation (6)

Wherein R is _c ^* ＝(R _c ^T R _c ) ^-1 R _c ^T ；

Wherein,

is composed of

The input residual matrix of (3);

obtaining a matrix by equation (8)

Step 6: predicting the substance concentration variable of the measured object:

2. The CPLS-based Infrared Spectroscopy measurement instrument calibration migration method according to claim 1, wherein in the step 1, the sample is grain, the spectral data is absorbance, and the substance concentration variables comprise moisture content, oil content, protein content and starch content of grain.