CN111563436B

CN111563436B - Infrared spectrum measuring instrument calibration migration method based on CT-CDD

Info

Publication number: CN111563436B
Application number: CN202010348512.2A
Authority: CN
Inventors: 赵煜辉; 刘晓东; 芦鹏程; 赵子恒
Original assignee: Northeastern University Qinhuangdao Branch
Current assignee: Northeastern University Qinhuangdao Branch
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2022-04-08
Anticipated expiration: 2040-04-28
Also published as: CN111563436A

Abstract

The invention relates to the technical field of transfer learning under a machine learning module, and provides a CT-CDD-based infrared spectrum measuring instrument calibration transfer method. First, a source domain and target domain data set { X is collected^m,y^m}、{X^sDividing a source domain calibration set by using a KS algorithm

Centralizing the same; then, the centralized source domain calibration set

Establishing a PLS calibration model; then, the characteristic spectrum T of the main instrument is calculated^mPseudo-signature spectra of slave instruments

Using OLS and dataset { T^m,y^mDetermine the cluster number K by cross validation and pair { T }^m,y^mAnd

clustering separately, sub-datasets

Establishing a kth OLS model and calculating a transformation matrix M_k(ii) a And finally, predicting the substance concentration variable of the measured object set. The invention does not need to use a standard sample to construct a migration model, and can greatly improve the precision and efficiency of the calibration migration of the infrared spectrum measuring instrument.

Description

Infrared spectrum measuring instrument calibration migration method based on CT-CDD

Technical Field

The invention relates to the technical field of transfer learning under a machine learning module, in particular to a CT-CDD-based infrared spectrum measuring instrument calibration transfer method.

Background

The near infrared region is an electromagnetic wave between visible light and mid-infrared light, and is the non-visible region first discovered by WilliamHerschel in the 19 th century. The American Society for Testing and Materials (ASTM) defined the spectral region at wavelength 780 nm-2526 nm and wave number 12820-3959 cm-1 in 10 months of 1985. By the 50 s of the 20 th century, the near infrared spectroscopy analysis technology can be applied in some fields. The interest in the near infrared region, with the exception of some users for specific analysis applications, was gradually diminished by the ensuing 60 s due to the constant emergence of some novel analysis techniques, coupled with some of the drawbacks of the near infrared spectroscopy technique.

Since then, research on near infrared spectroscopy has entered a long silent period. As research and discussion on stoichiometry has grown, and as manufacturing techniques for spectroscopic instruments have continued to improve, infrared spectroscopy techniques have advanced further in the mid-80's. Different from the traditional analysis technology, the near infrared spectrum is an indirect analysis technology, the information such as the content of substances and the like cannot be directly obtained, and a calibration model must be established through a known sample to realize the prediction of the concentration information of an unknown sample so as to complete quantitative or qualitative analysis. The analysis process of the near infrared spectrum technology is shown in figure 1, and the main steps are as follows:

(1) selecting representative samples to form a calibration set, and testing the near infrared spectrum of the calibration set samples, wherein the collected calibration set samples need to be representative in the process;

(2) after collecting the calibration set of samples, measuring the concentration information of the substance of interest in the sample by standard analytical chemistry means;

(3) and selecting a proper algorithm to model the spectrum of the measured calibration set sample and the corresponding substance concentration information. The step is a core step of near infrared spectrum quantitative analysis, a calibration model is established for the preprocessed near infrared spectrum and concentration information, parameters of the model are generally determined through cross validation, and finally, the performance of the model needs to be checked;

(4) after the multivariate calibration model is established, the near infrared spectrum of the current test sample can be measured, and the substance content of the test sample is predicted by using the established calibration model.

Modern near infrared spectroscopy analysis technology already has abundant theoretical basis and technical practical experience. Unlike other analytical techniques, near infrared spectroscopy involves the theoretical knowledge of many different disciplines such as spectroscopy, chemometrics, and computer technology.

Near infrared spectroscopy has many advantages in that it can measure the chemical composition and properties of a sample in a matter of minutes. The method can simultaneously analyze various components of the sample only by completing one-time acquisition and measurement of the near infrared spectrum on the sample to be detected, and can reach more than ten indexes at most. The near infrared spectrum analysis technology can be directly analyzed after simple pretreatment is carried out on the sample, the sample is not damaged, and nondestructive detection is realized; does not need to use any chemical reagent, greatly reduces the analysis cost, does not cause pollution to the environment, and belongs to the 'green analysis' technology. The near infrared spectrum mainly reflects the information of chemical bonds of organic molecules containing hydrogen groups C-H, O-H, N-H, S-H and the like in a sample, and is very suitable for quantitative or qualitative analysis of hydrogen-containing organic matters. The analysis range of the near infrared spectrum analysis technology comprises most organic mixtures and compounds, and the unique advantages of the near infrared spectrum analysis technology make the application field of the technology extremely wide, and the technology has an indispensable effect in many industries, and is used for measuring the component content of substances in the agricultural field, such as the content of moisture or protein in corn; in the field of medicine, measurement of component contents of medicines, biological, food, environmental tests, and the like.

Machine learning and data mining techniques have enjoyed significant success in many areas of knowledge engineering including classification, regression, and clustering. For a traditional machine learning method, the distribution of training data and the distribution of test data should be the same, so that the test data can be predicted using a model built by the training data. In practical application scenarios, there will be some differences between their data distributions. In some cases, training data is expensive or impossible to collect. In this case, if there is a significant difference in the data distribution of the training data and the test data, there will be a large difference between the predicted result and the actual result of the test data, and most statistical models need to be re-modeled using newly collected training data. In this case, it is desirable to perform knowledge migration between task domains, and this method is called migration learning. Migratory learning is the ability of learners in one area to improve the ability of learners in another area by passing information from the relevant area.

Multivariate calibration is a very useful tool for extracting chemical information from spectral signals, and the established multivariate calibration model is crucial for many analytical measurements. It has been applied to a variety of analytical techniques, but its importance has been manifested in the Near Infrared (NIR) spectrum. Usually, a lot of manpower and material resources are invested in constructing a robust calibration model. Problems arise when measuring samples on different instruments or under different environmental factors. Even if the same sample is measured, the two spectral matrices measured by different instruments are different, and the established model will generate differences. A model built on one instrument is generally not predictive of the spectrum measured by a second instrument. One way to solve this problem is to re-measure each sample and build a new model for the newly acquired spectrum, but this is not a practical solution. Establishing a robust calibration model requires significant cost and time, and another acceptable method to save these unnecessary expenses is to perform model migration. This way of dealing with problems in the field of machine learning is called migration learning, and more specifically, the case where tasks are the same but domains are different is called domain adaptation. And in the field of chemometrics they are referred to as nominal migration.

Most calibration migration methods, which construct a migration model by using a set of standard samples, require measuring a set of standard samples on a master instrument and a slave instrument, respectively, and various standard migration methods have been proposed. For example, Direct normalization (DS) and Piecewise Direct normalization (PDS) correct for spectral differences between the master and slave instruments by a set of standard samples. In DS, each wavelength of the master is associated with all wavelengths of the slave. In the PDS, each wavelength of the master is associated with a wavelength window of the slave, and finally a band migration matrix is formed from the regression coefficients of each window. The experimental results are consistent with the assumption that the spectral dependence between the master and slave is limited to a small area in various migration methods. The key to the PDS is the selection of the window size and the determination of the number of standard samples, which creates multiple regression models, resulting in a large number of calculations. PDS is one of the most widely used migration methods, often used as a comparison to other new technologies. In slope correction of deviation (SBC), a linear relationship between the predicted values of different instruments is assumed. Firstly, calculating a regression coefficient between the spectrum and the response value; calculating the predicted values of the master instrument and the slave instrument through the regression coefficient; finally, a linear fit is made between the predicted values. Liang et al proposed that a calibration migration method based on a typical correlation analysis successfully corrected the differences between the different spectra. Firstly, constructing a PLS model by using a calibration set of a master instrument; selecting a part of a calibration set of a master instrument and a slave instrument as a standard sample; features are extracted separately by Canonical Correlation Analysis (CCA). The relationship between the master spectrum and the slave spectrum is established by the least square method (OLS), and finally, the difference of the spectra is successfully corrected. In addition, other calibration migration methods are proposed, such as Spectral Regression (SR), orthogonal projection Transfer (TOP), Single wavelength normalization (SWS), multi-Spectral calibration migration based on independent component analysis, Generalized Least Squares (GLSW) method, and other methods that require standard samples.

As can be seen from the above, in the prior art, many methods have been used to develop a relatively stable calibration model, but changes in environmental conditions and adjustments of a measurement instrument all cause poor prediction performance of the calibration model and even cause model failure, so that it is necessary to migrate to a spectrum to be measured by using the relevant knowledge of the established calibration model to help the spectrum to be measured to predict so as to save a lot of overhead. In the existing calibration migration method capable of remarkably improving the predictive performance of the model, a standard sample is mostly needed to be used for constructing the migration model. The standard sample should closely match the sample from which the calibration model was constructed and must exhibit sufficient variability to account for differences between the two instruments. The volatility and reactivity of the components make it a great challenge to maintain the integrity of the standard sample. Even more, in some practical applications it is difficult or even impossible to obtain standard samples, i.e. to measure their spectra simultaneously on the master and slave instruments. Although there are a small number of calibration migration methods that do not require standard samples, the prediction performance of these methods is very different from that of the migration methods with standard samples.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides the CT-CDD-based infrared spectrum measuring instrument calibration migration method, which does not need to use a standard sample to construct a migration model, and can greatly improve the precision and efficiency of infrared spectrum measuring instrument calibration migration.

The technical scheme of the invention is as follows:

a CT-CDD-based infrared spectrum measuring instrument calibration migration method is characterized by comprising the following steps:

step 1: the method comprises the steps of enabling an infrared spectrum measurement master instrument to correspond to a source domain, enabling an infrared spectrum measurement slave instrument to correspond to a target domain, collecting the spectrum of each sample by using the infrared spectrum measurement master instrument and the infrared spectrum measurement slave instrument, respectively obtaining a master spectrum and a slave spectrum, respectively extracting spectral data of the master spectrum and the slave spectrum at intervals anm within a wavelength range, collecting material concentration variable values of each sample, and obtaining a source domain data set { X^m,y^mAnd a target domain data set { X }^s}；

Wherein, X^m、X^sRespectively a master spectral matrix and a slave spectral matrix,

i is the main spectral vector and the slave spectral vector of the ith sample, I is 1, 2.

J, J is the total number of extracted spectral data points, i.e. the jth primary spectral data and the jth secondary spectral data of the ith sample respectively; y is^mIs a matrix of values of the concentration of the substance,

is the value of the substance concentration variable for the ith sample;

step 2: source domain data set { X using KS algorithm^m,y^mDividing into a source domain calibration set

And source domain test set

And step 3: set of source domain calibrations

Performing centralization treatment to obtain a source domain calibration set after centralization treatment

And 4, step 4: PLS algorithm based on data sets

Establishing a calibration model

Is calculated to obtain

Weight matrix W of^m、

Load matrix P^mRegression coefficient matrix beta^m；

And 5: constructing a migration model:

step 5.1: calculating a characteristic spectrum matrix of a main instrument for infrared spectrum measurement

T^m＝X^mW^m(P^mW^m)^-1

Calculating a pseudo-characteristic spectral matrix of the infrared spectrometric slave instrument

Step 5.2: for each cluster number L belongs to L^*Using k-means clustering algorithm to data set { T }^m,y^mThe characteristic spectrum vectors of the data set are clustered, and the data set is subjected to the clustering of the characteristic spectrum vectors of the data set T^m,y^mDivide into L sub-datasets

l＝1,2,...,L；

On the basis of OLS algorithm, the first sub-data set

Establishing an initial least squares model

l＝1,2,...,L；

Calculating the cross validation error RMSECV of L initial least square models under each cluster number_LDetermining min { RMSECV }_L|L∈L^*The corresponding cluster number is the final cluster number K;

wherein L is^*To be a set of the number of clusters,

is the l-th initial sub-feature spectral matrix,

is composed of

Matrix of values of variables of the concentration of substance of the corresponding sample, beta_{0_l}Is the first initial regression coefficient matrix;

step 5.3: using K-means clustering algorithm to perform data set { T) according to clustering number K^m,y^mThe characteristic spectrum vectors of the data set are clustered, and the data set is subjected to the clustering of the characteristic spectrum vectors of the data set T^m,y^mDivide into K sub-datasets

k＝1,2,...,K；

Using K-means clustering algorithm to perform data set according to clustering number K

The pseudo characteristic spectral vectors are clustered, and the data set is obtained

Partitioning into K sub-datasets

k＝1,2,...,K；

Wherein the characteristic spectrum vector and the pseudo characteristic spectrum vector are respectively T^m、

The line vectors of (a) are,

respectively a k-th sub characteristic spectrum matrix and a sub pseudo characteristic spectrum matrix,

is composed of

A matrix formed by the variable values of the substance concentration of the corresponding sample;

step 5.4: on the basis of OLS algorithm, the kth sub-data set

Establishing a kth least squares model

Calculating to obtain a k-th regression coefficient matrix beta_k；

Step 5.5: computing the kth transformation matrix

Wherein the content of the first and second substances,

are respectively as

The covariance matrix of (a);

step 6: and predicting the substance concentration variable of the measured object set:

step 6.1: collecting the spectrum of each measured object in the measured object set from the instrument by using infrared spectrum measurement, and extracting the spectrum data by using the same method as the step 1 to obtain a secondary spectrum matrix of the measured object set

Step 6.2: calculating a pseudo characteristic spectrum matrix of a measured object set under an infrared spectrum measuring slave instrument as

Step 6.3: using K-means clustering algorithm to perform data set according to clustering number K

Partitioning into K sub-datasets

K1, 2,. K; wherein the content of the first and second substances,

for the kth sub-pseudo characteristic spectrum matrix of the measured object set,

step 6.4: using the kth transformation matrix M_kTo pair

Carrying out transformation correction to obtain a k transformation corrected sub-pseudo characteristic spectrum matrix of

Step 6.5: computing a k-th transform corrected sub-pseudo feature spectrum matrix

The matrix of the predicted values of the concentration variables of the corresponding measured objects is

The invention has the beneficial effects that:

the invention carries out calibration migration by correcting the data distribution difference (CT-CDD) of PLS subspace, specifically, a PLS model of a master instrument is established, the spectra of the master instrument and a slave instrument are projected to the PLS subspace, the latent variables of different spectra are respectively subjected to cluster analysis, a regression model between the latent variables and concentration information of the master instrument is established by using a common least square method, the characteristic spectrum with the closest data distribution between the two instruments is found, the conversion function is respectively calculated to predict the substance concentration variable of a measured object, the prediction result can be corrected by respective conversion function, a migration model is not required to be established by using a standard sample in the whole process, the precision and the efficiency of infrared spectrum calibration migration are greatly improved, and the problem that the calibration migration method which can obviously improve the predictive performance of the model in the prior art needs the standard sample to establish the migration model which is difficult to even impossible to obtain and the standard sample is solved The integrity is difficult to guarantee, and the calibration migration method without a small amount of standard samples has poor prediction performance.

Drawings

FIG. 1 is a schematic diagram of an analysis process of near infrared spectroscopy.

FIG. 2 is a flow chart of the CT-CDD-based infrared spectroscopy measurement apparatus calibration migration method of the present invention.

FIG. 3 is a diagram showing the difference in spectrum between different devices in the first and second embodiments

Fig. 4 is a schematic diagram illustrating the prediction results of the CT-CDD-based infrared spectroscopic measurement apparatus calibration migration method and five other calibration migration methods of the present invention on M5 × MP5 according to an embodiment.

Fig. 5 is a schematic diagram illustrating the prediction results of the CT-CDD-based infrared spectroscopic measurement apparatus calibration migration method and five other calibration migration methods of the present invention on M5 × MP6 according to an embodiment of the present invention.

Fig. 6 is a schematic diagram illustrating the prediction results of the CT-CDD-based infrared spectroscopic measurement apparatus calibration migration method and five other calibration migration methods of the present invention on MP5 × MP6 according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of the prediction results of the CT-CDD-based infrared spectroscopic measuring instrument calibration migration method and five other calibration migration methods according to the second embodiment of the present invention on B1 × B2.

Fig. 8 is a schematic diagram of the prediction results of the CT-CDD-based infrared spectroscopic measurement apparatus calibration migration method and five other calibration migration methods according to the second embodiment of the present invention on B1 × B3.

Fig. 9 is a schematic diagram of the prediction results of the CT-CDD-based infrared spectroscopic measuring instrument calibration migration method and five other calibration migration methods according to the second embodiment of the present invention on B3 × B2.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific embodiments.

The invention provides a migration-standard-free calibration migration method by utilizing a migration learning method in machine learning, aiming at the technical problems that in the prior art, a standard sample is required to construct a migration model, the standard sample is difficult to or even impossible to obtain, the integrity of the standard sample is difficult to guarantee, and the prediction performance of the calibration migration method without the standard sample is poor, aiming at the characteristics of high dimensionality of spectral data and multiple collinearity. The performance of the CT-CDD of the invention was compared to the predicted performance of SBC, PDS, CCACT, TCR and CTAI by two near infrared spectral datasets. Without a standard sample, the prediction performance obtained by the method is superior to that of a classical standard calibration migration method.

Example one

As shown in FIG. 2, the CT-CDD-based infrared spectroscopy measurement instrument calibration migration method of the present invention comprises the following steps:

i is the master spectral vector and the slave spectral vector of the ith sample, I is 1,2The total number of samples is,

is the value of the substance concentration variable for the ith sample.

In the first embodiment, the sample is corn, and the spectral data is absorbance. The material concentration variables can be moisture content, oil content, protein content, starch content, and in this example, moisture content is used to verify the method of the present invention. The data measured on the same I-80 samples using three near infrared spectroscopic measuring instruments (M5, MP5, MP6) constitute a corn data set. The near infrared spectrum measuring instruments M5, MP5 and MP6 measure the infrared spectrum at intervals of a-2 nm in the wavelength range of 1100-2498nm, and J-700 channels, so as to obtain the spectrum differences among M5-MP5, M5-MP6 and MP5-MP6, which are respectively shown in fig. 3A, fig. 3B and fig. 3C.

And source domain test set

In this example one, the Kennard-Stone (KS) algorithm divides 80 corn samples into two groups: the source domain data of the first set of 64 samples constitutes a source domain calibration set; the source domain data of the second set of 16 samples constitutes a source domain test set.

And step 3: set of source domain calibrations

And 4, step 4: PLS algorithm based on data sets

Establishing a calibration model

Is calculated to obtain

Weight matrix W of^m、

Load matrix P^mRegression coefficient matrix beta^m。

And 5: constructing a migration model:

T^m＝X^mW^m(P^mW^m)^-1

l＝1,2,...,L；

On the basis of OLS algorithm, the first sub-data set

Establishing an initial least squares model

l＝1,2,...,L；

wherein L is^*To be a set of the number of clusters,

is the l-th initial sub-feature spectral matrix,

is composed of

k＝1,2,...,K；

Partitioning into K sub-datasets

k＝1,2,...,K；

Wherein the characteristic spectral vector and the pseudo characteristic lightThe spectral vectors are respectively T^m、

The line vectors of (a) are,

is composed of

step 5.4: on the basis of OLS algorithm, the kth sub-data set

Establishing a kth least squares model

Calculating to obtain a k-th regression coefficient matrix beta_k；

Step 5.5: computing the kth transformation matrix

Wherein the content of the first and second substances,

are respectively as

The covariance matrix of (2).

In the construction process of the least square model, the main instrument models closest to the characteristic spectrum after clustering the slave instruments are respectively found, and the transformation matrixes are respectively calculated. The method comprises the following specific steps:

the master instrument and the slave instrument correspond to one domain, respectively. A domain consists of two main parts: input space X, its corresponding marginal probability distribution p (X). The relative entropy, or KL divergence, may represent the distance between the data distributions of the two domains, expressed using equation (1):

wherein, p and q are probability density functions of data distribution of the source domain and the target domain respectively.

p (x) cannot be directly acquired, but it is assumed that a limited set of training points x has been observed_nN can be derived from p (x). Then the expectation for p (x) can be approximated by a finite sum over these points, as shown below:

labeled spectrum given host instrument { X^m,y^mAnd unlabelled spectra of slave instruments { X }^sWith the aim of predicting the output of spectra to be measured from the instrument

The spectra measured by the different instruments are different, resulting in different data distributions between the two instruments. Equation (3) uses the form of absolute values to represent the distances between the data distributions, both random vectors, following the respective data distributions:

KL(P||Q)≈|lnP(X^m)-lnQ(X^s)| (3)

p, Q is probability density function of data distribution of source domain and target domain;

and considering that the spectrum data has multiple collinearity, all the spectra are mapped to the PLS subspace of the master instrument, and the model (3) is simplified while the dimension of the data is reduced. The characteristic spectrum of the master instrument and the pseudo-characteristic spectrum of the slave instrument are calculated using equation (4) in the form:

wherein, T^mAnd

respectively, a matrix composed of the extracted a score vectors.

At this time, the KL distance between the data profiles of the two instruments is:

KL(P||Q)≈|lnP(t^m)-lnQ(t^s)| (5)

wherein, t^mAnd t^sAre random vectors, each following a data distribution T^mAnd

the spectral data of the master instrument and the slave instrument are mixed, the data distribution after clustering is assumed to be single Gaussian distribution, and the data distribution of the characteristic spectra of the master instrument and the slave instrument are respectively

And

the distribution difference in the formula (6) can be reduced by correcting the mean and covariance of each cluster separately. Firstly centralizing data to correct mean value, the data distribution of ith characteristic spectrum of two instruments is respectively

And

and

the ith gaussian distributions for the two instruments, respectively;

is the ith transfer function;

and

are two random vectors, which are respectively subject to the data distribution of the ith characteristic spectrum of the two instruments. Equation (5) can be written in the form of equation (6).

Assuming the existence of a linear transformation matrix M_iCan make the above-mentioned

Is formed by

So that the distance between the corrected slave spectrum and the master spectrum is minimal, the relative entropy (6) can be rewritten as follows:

in equation (7), the linear transformation matrix M_iThe solution process of (2) is as follows:

each group of clustered data is approximately in normal distribution, the mean value of the data is 0, and the probability density function of the main characteristic spectrum

From the probability density function of the instrument, given by equation (8.1)

Given by equation (8.2).

Passing function

After transformation, the random vector t of the slave instrument^sCan be converted into random vector t of main instrument^mThe formula is as follows:

suppose M_iIs a non-singular matrix, then the random vector of the master instrument can be converted to the random vector of the slave instrument using equation (10):

according to the nature of the probability density function, the probability density function of the master instrument

Can be transformed by equation (9) as follows

It can be changed as follows

Thus, there are:

equation (11) is a transformation matrix M from the instrument characteristic spectrum_iThe probability density function after transformation to the master instrument, expanded as follows:

equation (12) is the same as equation (8.1)So that the covariance of both is the same, and so there is

M_iThe solution of (a) is:

Partitioning into K sub-datasets

K1, 2,. K; wherein the content of the first and second substances,

step 6.4: make itUsing the kth transformation matrix M_kTo pair

k＝1,2,...,K。

In this embodiment, the substance concentration variable of the measured object set is predicted by using the CT-CDD-based infrared spectroscopy instrument calibration migration method of the present invention and the conventional SBC, PDS, CCACT, TCR, and CTAI-based infrared spectroscopy instrument calibration migration method, respectively. The PLS model of the master instrument of the present invention is built on a calibration set, and for other migration methods with migration criteria, a number of standard samples are selected on the calibration set using the Kennard-Stone method. And for SBC, PDS, CCACT and CTAI algorithms, a PLS algorithm is adopted as a main algorithm, a multivariate calibration model is established by using spectral data of a main instrument and is used as a reference model, and a sample to be measured of a slave instrument is predicted.

The parameter selection criteria for the different migration methods are similar to CTAI. And the optimal number of latent variables of the PLS model takes values in the range [1,15], is determined by cross validation through ten folds, and is selected according to the minimum cross validation error criterion.

The PLS modeling method and parameter optimization of the primary instruments of SBC, PDS are the same as CTAI. The window size in the PDS is from 3, 16 is searched by the increment of 2, parameters are selected through 5-fold cross validation, the RMSECV of each window model is respectively calculated, and the window with the minimum RMSECV is selected as the optimal parameter; PDS performs poorly on wheat data sets and, in window selection, an F-test is used to determine the optimal window size.

In this embodiment, the root mean square error RMSE is used as an indicator for parameter selection and model evaluation. Furthermore, RMSEC represents the training error for the calibration set, RMSECV represents the cross-validation error, and RMSEP represents the prediction error for the test set. The RMSE calculation method is written as

Wherein the content of the first and second substances,

is a predicted value, y is a measured value, and n represents the number of samples.

RMSEC, RMSEP, RMSECVmin, and LV of the PLS model of the three instruments M5, MP5, MP6 on the maize dataset are shown in Table 1. Wherein, RMSECVmin is the minimum value of the cross validation error, and LV is the corresponding latent variable number when the minimum cross validation error is obtained. As can be seen from table 1, RMSECVmin, RMSEC, and RMSEP of the PLS model of the apparatus M5 are 0.01066, 0.00599, and 0.00764, respectively, and it can be seen that the three root mean square errors are not very different, the PLS model is relatively stable, and the phenomena of over-fitting and under-fitting do not occur. The instrument MP5 PLS model has RMSECVmin, RMSEC, RMSEP 0.13035, 0.09458, 0.12445, respectively, similar to the M5 PLS model, without under-and over-fitting, and the same conclusions were drawn on the MP6 PLS model. The parameters were selected by 10-fold cross-validation, and the optimal number of latent variables was determined based on the lowest RMSECV criteria, 14, 15, and 10 for the PLS models of the three instruments M5, MP5, and MP6, respectively. It is important that the master device establishes a model with better prediction performance, and the embodiment selects a device with good prediction performance as the master device. As can be seen from table 1, the prediction error of instrument MP6 > the prediction error of instrument MP5 > the prediction error of instrument M5, making it more reasonable to perform model tests with these three combinations (M5 × MP5, M5 × MP6, and MP5 × MP 6); where the superscript denotes the master and the other the slave.

TABLE 1

Instrument for measuring the position of a moving object	Reference value	RMSEC	RMSEP	RMSECV_min	LV
						M5	Moisture content	0.00599	0.00764	0.01066	14
MP5	Moisture content	0.09458	0.12445	0.13035	15
						MP6	Moisture content	0.09991	0.15637	0.14775	10

CT-CDD was compared to five calibration migration methods, SBC, PDS, CCACT, TCR, and CTAI. In CT-CDD, the number of clusters is determined by ten-fold cross-validation. The maize dataset contains 80 samples and the maximum number of sub-models after clustering is set to 3, otherwise the calculated migration matrix is under-ranked, which would lead to infinite final prediction results. The limitation of the number of samples results in that when the number of clusters is large, the clustered characteristic spectrum does not have enough samples to establish a stable model. In the first embodiment, it is found by calculation that when the number of clusters is 2, the minimum cross-validation error is obtained.

The prediction errors for CT-CDD and the five other calibration migration methods are shown in Table 2. In table 2, N is the number of standard samples in the migration method requiring the standard samples, a is the optimal window size in the PDS, and b is the dimension of the corresponding optimal subspace in the TCR.

As can be seen from table 2:

(1) for spectral transfer from instrument MP5 to instrument M5: when the number of standard samples was 35, the SBC reached the lowest RMSEP (0.28872); PDS reached the lowest RMSEP when the number of standard samples was 5 (0.18828); CCACT reaches a minimum RMSEP when the number of standard samples is 25 (0.18699); it can be seen that RMSEP (0.15024) for CT-CDD is less than the minimum RMSEP for the three methods PDS, SBC, CCACT, and is also less than TCR (0.47391) and CTAI (0.17511).

(2) For the spectrum transfer from MP6 to M5, the lowest RMSEPs obtained by SBC, PDS, CCACT were 0.33240, 0.27901, 0.17862, respectively, with CT-CDD having lower RMSEP than the other five methods.

(3) For the spectrum transfer from MP6 to MP5, the lowest RMSEPs for SBC, PDS, CCACT were 0.20481, 0.18409 and 0.13722, respectively, the prediction errors for TCR and CTAI were 0.46124 and 0.16563, respectively, and the CT-CDD again reached the smallest RMSEP (0.12357).

From the three groups of experiments, the CT-CDD model obtains the optimal prediction performance under the general condition and has better generalization capability.

TABLE 2

Fig. 4, 5, and 6 show the relationship between the predicted values and the measured values obtained by 6 different calibration migration methods on combinations M5 × MP5, M5 × MP6, and MP5 × MP6, respectively. A zero difference between the predicted and measured concentrations will make the sample point on a straight line. For the calibration migration method with the standard sample, under different standard samples, when the prediction performance is optimal, the set of experiments is selected for comparison, so that the CT-CDD can be more fully embodied to obtain good prediction performance.

The prediction results of CT-CDD, CTAI, TCR, CCACT, SBC, and PDS at M5 × MP5 are shown in fig. 4A, 4B, 4C, 4D, 4E, and 4F, respectively, the prediction results at M5 × MP6 are shown in fig. 5A, 5B, 5C, 5D, 5E, and 5F, respectively, and the prediction results at MP5-MP6 are shown in fig. 6A, 6B, 6C, 6D, 6E, and 6F, respectively. As can be seen from FIG. 4, the sample points of CT-CDD are more nearly straight; the TCR and SBC were less well fitted under this set of experiments. As can be seen from FIG. 5, in the spectral transmission from the instrument MP6 to the instrument M5, CT-CDD is closer to a straight line than the other five methods, SBC and TCR achieve the worst prediction performance again, and the prediction errors of the three methods PDS, CCACT and CTAI are smaller but the prediction performance is still poor compared with that of CT-CDD. As can be seen from fig. 6, the same conclusions were drawn in the spectral transmission from instrument MP6 to instrument MP5 as in fig. 4 and 5, which confirmed that CT-CDD achieved the best predictive performance. Therefore, the CT-CDD obtains more satisfactory results in comparison with other five models, and the optimal prediction performance is realized.

Example two

In the second embodiment, the sample is wheat. The wheat dataset was the "zootout" dataset published by International Diffuse Reflectance Conference (IDRC) in 2016. The wheat data set contained data from 3 different NIR spectrometers (B1, B2, B3) on the same I-248 samples, with protein content being selected as the substance concentration variable. The spectral differences between B1-B2, B1-B3 and B3-B2 were obtained by measuring the infrared spectra at an interval a of 0.5nm in the 570-1100nm wavelength range using NIR spectrometers B1, B2 and B3, as shown in fig. 3D, fig. 3E and fig. 3F, respectively.

In this example, the Kennard-stone (ks) algorithm divides 248 wheat samples into two groups: the source domain data of the first set of 198 samples constitutes a source domain calibration set; the source domain data of the second set of 50 samples constitutes a source domain test set.

On the wheat data set, the RMSEC, RMSEP, RMSECVmin, and LV for the PLS models of the three instruments B1, B2, B3 are shown in table 3. As can be seen from Table 3, the PLS model established on instrument B1 has RMSECVmin, RMSEC, RMSEP 0.50337, 0.32880, 0.33254, respectively, and does not show over-fitting or under-fitting. The same was observed in both instrument B2 and instrument B3, and neither overfitting nor underfitting was observed in the PLS models created by the three instruments, explaining the rational selection of the optimal latent variables. For the wheat data set, the PLS model was similar to the corn data set in terms of parameter selection criteria, with the number of latent variables set at 15 at maximum. From the observations in table 3, the prediction error of instrument B1 < the prediction error of instrument B3 < the prediction error of instrument B2, and thus the model performance tests were performed with these three combinations (B1 × B2, B1 × B3, and B3 × B2).

TABLE 3

Instrument for measuring the position of a moving object	Reference value	RMSEC	RMSEP	RMSECV_min	LV
						B1	Protein	0.32880	0.33254	0.50337	15
B2	Protein	0.21636	0.83755	0.32441	15
						B3	Protein	0.30288	0.51567	0.43896	15

In CT-CDD, the number of clusters of characteristic spectra is determined similarly to the maize dataset. When the number of samples in the wheat data set is relatively sufficient, the under-rank condition can not occur when the migration matrix is calculated in the corn data set. The number of clusters is set to be between 2 and 5. In the second embodiment, it is found by calculation that when the number of clusters is 2, the minimum cross-validation error is obtained.

Other calibration migration methods have similar parameter selection criteria to the corn data set. In the PDS, the optimal window size is shown in Table 4. When B1 is the master and B2 is the slave, the optimal window sizes for PDS are 11, 15, respectively. When B1 is the master and B3 is the slave, the optimal window sizes for PDS are 15, 11, 5, respectively. When B3 is the master and B2 is the slave, the optimal window sizes for PDS are 7, 15, respectively. In the TCR, the optimal dimensions of the subspace are 7, 12, 21, respectively.

As can be seen from table 4:

(1) when instrument B1 was the master and instrument B2 was the slave, the SBC produced the lowest RMSEP when the number of standard samples was 5 (0.45225). In PDS and CCACT, RMSEP decreased significantly as the standard sample increased. PDS and CCACT reached the lowest RMSEP at a standard sample number of 35, 0.47222 and 0.80448, respectively. CT-CDD achieves the lowest RMSEP compared to SBC, PDS, CCACT (0.43007). The prediction errors for TCR and CTAI are 0.86884 and 0.41419, respectively. Compared with the CT-CDD, the predicted effect of the CT-CDD is only second to CTAI under the current experimental group.

(2) When instrument B1 is the master and instrument B3 is the slave, SBC, PDS and CCACT correspond to the lowest RMSEP of 0.79919, 0.41235 and 0.83440, respectively, for different numbers of standard samples. RMSEPs for TCR and CTAI are 0.72987 and 0.68215, respectively. The results show that RMSEP (0.35160) of CT-CDD is significantly lower than other calibration migration methods, and optimal prediction performance is achieved.

(3) When instrument B3 is the master and instrument B2 is the slave, the lowest RMSEPs for SBC, PDS and CCACT are 0.47177, 0.33707, 0.75119, respectively. The prediction errors of both TCR and CTAI methods are 0.63708 and 0.38446, respectively. The same situation again occurs, with CT-CDD achieving the best prediction performance (RMSEP 0.31856). SBC, PDS and CCACT require standard samples, TCR requires reference values from the instrument, and CT-CDD achieves better prediction performance without the standard samples. Obviously, this means that CT-CDD is a more acceptable approach.

TABLE 4

Fig. 7, 8, and 9 show the predicted values and measured values obtained by 6 different calibration migration methods on combinations B1-B2, B1-B3, and B3-B2, respectively.

The prediction results of CT-CDD, CTAI, TCR, CCACT, SBC, and PDS at B1 × B2 are shown in fig. 7A, 7B, 7C, 7D, 7E, and 7F, respectively, the prediction results at B1 × B3 are shown in fig. 8A, 8B, 8C, 8D, 8E, and 8F, respectively, and the prediction results at B3 — B2 are shown in fig. 9A, 9B, 9C, 9D, 9E, and 9F, respectively. As can be clearly seen from fig. 7C and 7D, the correlation between TCR and CCACT is poor, and the prediction error of the model is large. It can be observed from fig. 7A that the fitting effect of CT-CDD is better, and the prediction error of the corresponding model is smaller. FIG. 8 shows that CT-CDD and PDS achieve good fitting results, and the other four methods are relatively poor. FIG. 9 shows that the TCR and CCACT showed a relatively poor correlation between the concentration of the substance and the predicted results, and the other four methods showed better fitting results, but the CT-CDD showed the best fitting results. It can be seen that CT-CDD provides more satisfactory results in comparison with the other five migration methods.

It can be seen from the above first and second embodiments of the present invention that the CT-CDD based calibration migration method of the present invention achieves the best RMSEP (minimum) in the process of using CTAI, TCR, CCACT, SBC, PDS as comparative experiments to test the performance of the CT-CDD method using two NIR data sets. The results clearly show that CT-CDD successfully corrected the differences between the spectra measured on the different instruments. For SBC, PDS and CCACT, they require standard samples to establish the migration model. In a TCR, a small number of reference values are also required from the instrument sample. Both of these conditions are expensive in practical application, and even cannot be satisfied. Therefore, the CT-CDD-based method of the present invention is an economical and efficient calibration migration method when standard samples are not available in practical applications.

The inventive calibration migration method, which is non-standard and by correcting PLS subspace data distribution differences (CT-CDD), attempts to find a transfer function that ensures that the data distribution distance between the master and slave instruments can be reduced when the data of the slave instrument is projected into this space. The data distribution of the characteristic spectrum is a mixed distribution, and the spectra need to be clustered and the distance of each sub-distribution between the two instruments is minimized by the respective transfer function. The present invention preserves the important properties of both instruments in the same PLS subspace and eliminates the multicollinearity of the spectra, while the data differences between the master instrument's features and the slave instrument's dummy features can be more accurately scaled down. The differences in the data distribution are further corrected by correcting the mean and variance of each part of the latent variable from different instruments.

It is to be understood that the above-described embodiments are only a few embodiments of the present invention, and not all embodiments. The above examples are only for explaining the present invention and do not constitute a limitation to the scope of protection of the present invention. All other embodiments, which can be derived by those skilled in the art from the above-described embodiments without any creative effort, namely all modifications, equivalents, improvements and the like made within the spirit and principle of the present application, fall within the protection scope of the present invention claimed.