WO2021073541A1 - 一种基于光谱相似度的校正集和验证集的选择及建模方法 - Google Patents

一种基于光谱相似度的校正集和验证集的选择及建模方法 Download PDF

Info

Publication number
WO2021073541A1
WO2021073541A1 PCT/CN2020/120950 CN2020120950W WO2021073541A1 WO 2021073541 A1 WO2021073541 A1 WO 2021073541A1 CN 2020120950 W CN2020120950 W CN 2020120950W WO 2021073541 A1 WO2021073541 A1 WO 2021073541A1
Authority
WO
WIPO (PCT)
Prior art keywords
samples
sample
verification
independent test
calibration
Prior art date
Application number
PCT/CN2020/120950
Other languages
English (en)
French (fr)
Inventor
聂磊
孙越
臧恒昌
曾英姿
刘肖雁
苏美
袁萌
王林林
姜红
楚广诣
Original Assignee
山东大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东大学 filed Critical 山东大学
Priority to US17/289,657 priority Critical patent/US20210404952A1/en
Publication of WO2021073541A1 publication Critical patent/WO2021073541A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/17Systems in which incident light is modified in accordance with the properties of the material investigated
    • G01N21/25Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
    • G01N21/27Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands using photo-electric detection ; circuits for computing concentration
    • G01N21/274Calibration, base line adjustment, drift correction
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/17Systems in which incident light is modified in accordance with the properties of the material investigated
    • G01N21/25Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
    • G01N21/31Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
    • G01N21/35Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light
    • G01N21/359Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light using near infrared light
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2201/00Features of devices classified in G01N21/00
    • G01N2201/12Circuits of general importance; Signal processing
    • G01N2201/127Calibration; base line adjustment; drift compensation
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2201/00Features of devices classified in G01N21/00
    • G01N2201/12Circuits of general importance; Signal processing
    • G01N2201/129Using chemometrical methods

Definitions

  • the invention belongs to the technical field of unknown item prediction, and in particular relates to a method for selecting and modeling a correction set and a verification set based on spectral similarity.
  • NIR Near-infrared spectroscopy
  • the division of the sample set is very important for near-infrared spectroscopy analysis.
  • the selection of the calibration set and the verification set is a key step that affects the effect of the model.
  • the samples of the calibration set need to be representative and contain as much sample information as possible, while the verification set It is a kind of feedback that reflects the quality of the model. If the samples in the validation set cannot well reflect the predictive ability of the model for unknown samples, the performance of the model may not be guaranteed. Therefore, the establishment of a new sample set division method with better model performance and stronger predictive ability is one of the key research directions for the current near-infrared spectroscopy analysis.
  • KS Kennard-Stone
  • SPXY SPXY method
  • the KS method is selected by the calculation of the Euclidean distance between samples. Representative samples enter the calibration set, so that the calibration set contains more sample information, but this method may also select abnormal samples into the calibration set, and it lacks certain pertinence in predicting unknown samples.
  • the SPXY method is based on the KS method, taking the reference value variable (Y) into account, and ensuring that it has the same weight in each space, effectively covering the multi-dimensional vector space.
  • Y reference value variable
  • the present invention provides a correction set and verification set selection and modeling method based on spectral similarity, which has better prediction performance for unknown samples.
  • one or more embodiments of the present invention provide the following technical solutions:
  • a method for selecting calibration set and verification set based on spectral similarity including the following steps:
  • a plurality of reference values are also determined for the original sample to obtain the original sample reference value matrix.
  • abnormal value detection is also performed on the original sample spectrum matrix, the abnormal value is eliminated, and the corresponding reference value in the original sample reference value matrix is eliminated.
  • the spectral similarity between it and each remaining sample in the original sample is calculated separately, and the n samples with the highest similarity are obtained and written into the calibration set.
  • the spectral similarity between samples is calculated using Euclidean distance.
  • modeling is performed separately when n takes different values, and the value of n is optimized based on model performance to obtain optimized model parameter values.
  • One or more embodiments provide a modeling method based on the selection method of the calibration set and the verification set to obtain a reference value matrix corresponding to the calibration set, and for each reference value in the reference value matrix, perform a separate operation with the spectrum matrix Association modeling.
  • the method further includes:
  • the performance of the model is evaluated based on the independent test set.
  • the method further includes comprehensively evaluating the performance of the model based on a calibration set, a verification set, and an independent test set.
  • the method for dividing the calibration set and the verification set of the present invention starts from the verification data used to test the performance of the model (that is, the independent test set, which is regarded as an unknown sample to test the model performance after the model is established), and is based on the independent test set. Based on the similarity, select samples with similar spectra to the independent test set to enter the verification set to reflect the predictive ability of unknown samples based on the prediction effect of the verification set, and then select samples with similar spectra to the verification set to enter the calibration set based on the verification set. It is guaranteed that the established model is a model for unknown samples, and compared with the currently commonly used methods, it can be accurately proved that its modeling performance for unknown samples is better and its predictive ability is stronger.
  • the selection of the calibration set and the verification set also involves the selection of the number of samples.
  • the present invention optimizes the number of samples of the calibration set, which can achieve a better prediction effect by selecting a smaller number of samples.
  • FIG. 1 is a flowchart of a method for selecting a calibration set and a verification set and a modeling method involved in one or more embodiments of the present invention.
  • Figure 2 is the original near-infrared spectra of all samples in Example 1;
  • Fig. 3 is a principal component projection diagram of embodiment 1 after removing abnormal samples
  • Figure 4 is a diagram of the variation law of the verification set RMSEV and the independent verification set RMSEP in Example 1;
  • Fig. 5 is a graph showing the variation law of the correlation coefficient R v of the verification set and the correlation coefficient R p of the independent test set in Example 1;
  • Figure 6 is the original near-infrared spectra of all samples in Example 2.
  • Fig. 7 is a principal component projection diagram of embodiment 2 after removing abnormal samples
  • FIG. 8 is a diagram of the variation law of the verification set RMSEV and the independent verification set RMSEP in Example 2;
  • Fig. 9 is a diagram showing the variation law of the correlation coefficient R v of the verification set and the correlation coefficient R p of the independent test set in Example 2;
  • FIG. 10 is a diagram of the variation law of the verification set RMSEV and the independent verification set RMSEP in Example 3;
  • Fig. 11 is a diagram showing the variation law of the correlation coefficient R v of the verification set and the correlation coefficient R p of the independent test set in Example 3;
  • FIG. 12 is a diagram of the variation law of the verification set RMSEV and the independent verification set RMSEP in Example 4.
  • Fig. 13 is a graph showing the variation law of the correlation coefficient R v of the verification set and the correlation coefficient R p of the independent test set in Example 4.
  • a preferred embodiment of the present invention discloses a method for selecting a calibration set and a verification set for near-infrared quantitative modeling. Taking the published corn data as an example, the number of measured samples is 80, including sample repetitions. As shown in Figure 1, the method includes the following steps:
  • Step 1 Perform near-infrared spectroscopy on the original sample to obtain the original sample spectrum matrix X;
  • Step 2 Use the reference method to determine the reference value of the original sample to obtain the original sample reference value matrix Y;
  • each quality index component is selected for the reference value of corn: water, oil, protein and starch, and a reference value matrix Y is constructed, and each column represents a parameter.
  • Step 3 Perform abnormal value detection on the original sample spectrum matrix X, remove the abnormal value, and remove the corresponding reference value of the reference value Y matrix;
  • the original map of the sample is shown in Figure 2.
  • the abnormal samples are eliminated.
  • 3 abnormal samples are detected, and 77 samples remain after the elimination.
  • the principal component projection diagram of the near-infrared spectrum of the sample after removing the abnormal values is shown in Figure 3. It can be seen from Figure 3 that the remaining samples pass the Hotelling T 2 test (inside the ellipse), and there are no abnormal samples.
  • Step 4 Randomly select m samples to form an independent test set, which is used to simulate unknown samples that need to be predicted;
  • the original spectrum matrix X After removing the abnormal values, a certain amount of samples are drawn to form an independent test set, and the unknown samples that need to be predicted are simulated.
  • the corresponding spectrum matrix is denoted as X t
  • its corresponding reference values are denoted as Y t , X t and
  • the samples of Y t correspond one-to-one; the sample size of the independent test set should be determined according to actual needs, generally should not be more than the sample size of the calibration set, which is equivalent to the sample size of the verification set, and the range of reference values should generally be included in the calibration The reference value range of the set sample.
  • the ratio of the calibration set and the verification set is usually 2:1, 3:1 or 4:1; if you want to consider the independent test set, the calibration set, the verification set and the independent test
  • the set ratio can be divided into 4:1:1, 6:1:1 or 8:1:1.
  • Step 5 For each sample in the independent test set, calculate the spectral similarity between it and each remaining sample in the original sample, obtain the samples with the highest spectral similarity to the sample and perform deduplication processing, write Verify the set, and obtain the corresponding spectral matrix X v and parameter value matrix Y v ;
  • the verification set samples can be selected according to the principle of spectral similarity, so that the prediction effect of the verification set indirectly reflects the predictive ability of the unknown sample to be tested.
  • Specific methods are as follows: In separate tests of each sample set as a reference, calculates the Euclidean distance D i between the spectrum of each sample with remaining X i and rank, the more similar distance, independent testing showed that the sample and the concentration The more similar the spectra of a certain sample in the remaining samples. Perform the above calculation on each sample in the independent test set in turn, then each sample in the independent test set can find its most similar g samples from the remaining samples.
  • each independent test can be Select the most similar g samples to form the verification set, and then remove the redundant duplicate samples, which is the final verification set.
  • the corresponding spectral set is denoted as X v
  • the corresponding reference value is denoted as Y v , where g ⁇ 1 Positive integer.
  • the remaining 67 samples are divided, the Euclidean distance D i between each sample in the spectrum matrix X t corresponding to the independent test set and the spectrum matrix X i of the remaining samples is calculated and sorted to form the independent test set X
  • remove redundant duplicate samples and form the final verification set X v .
  • the number of samples in X v is about 8-10.
  • the corresponding reference The value matrix is denoted as Y v .
  • Step 6 For the remaining samples in the original sample, were calculated with the spectral similarity between each sample in X v, X v acquires each sample (with X v, I represents) the highest degrees of similarity The sample is deduplicated, written into the correction set, and the corresponding spectrum matrix X c and parameter value matrix Y c are obtained ;
  • the selection of the calibration set samples is similar to it.
  • the Euclidean distance D k between it and each remaining sample spectrum X k is calculated and sorted. Perform the above calculation on each sample in the validation set X v in turn, and then select the closest n samples for each sample in the validation set as the calibration set samples, and remove the redundant duplicate samples, which is the selected calibration set X c .
  • the remaining samples are samples that can be selected from the calibration set.
  • the number of samples in the calibration set is determined by optimization by the number of most similar samples selected for each sample in the verification set.
  • the number of samples in X c is about 20 (or 18 ) ⁇ 67-X v sample number (ie 57-59), the corresponding reference value matrix remembers Y c .
  • the sample is a unit of observation, and the nearest n samples are selected from the remaining samples as the calibration set samples.
  • the calibration set samples selected in this way are similar to the validation set and indirectly similar to the independent test set, so as to establish a more targeted calibration model for unknown samples.
  • the maximum value of n is the number used when all remaining samples are selected into the calibration set, and the minimum value of n should be twice the number of samples in the validation set.
  • n is different, the number of calibration set samples selected for each validation set sample is different, the larger the number of calibration set samples is not necessarily the best modeling effect, it may contain abnormal samples or duplicate samples or have poor similarity to the validation set samples
  • the sample information of which may cause some interference to the modeling; while the number of samples in the calibration set is too small, and relatively contains less sample information, which may not cover the distribution space of the unknown sample to be tested, so the size of n needs to be optimized, also Optimization of the number of samples in the calibration set.
  • the n value at this time is selected as the optimized number of samples in the closest calibration set selected for the validation set samples.
  • the spectral matrix corresponding to the selected calibration set is denoted as X c
  • the corresponding reference value is denoted as Y c .
  • another embodiment of the present invention further provides a model establishment and evaluation method, which specifically includes:
  • Step 1-Step 6 Refer to the previous embodiment to obtain the verification set and the correction set, and obtain the corresponding spectral matrix and parameter value matrix of the verification set and the correction set;
  • Step 7 Modeling based on the correction set: For each parameter in the parameter value matrix, perform associated modeling with the spectrum matrix to obtain a correction model;
  • the calibration set sample X c and the moisture content matrix Y c are correlated with the partial least squares (PLS) method to establish the relationship model between Y c and X c, as follows:
  • the model parameters are obtained, namely the regression coefficient B pls .
  • the modeling method (to be solving the model parameters of the model) is determined based on the minimum value RMSEV validation set of X v. Modeling is performed under the number of latent variables optimized by the model.
  • Step 8 Optimize the model based on the verification set; specifically include: substituting the verification set into the calibration model, solving the fitted value of the reference value, and adjusting and optimizing the model parameters based on the fitted value and the actual value;
  • Step 9 Evaluate the performance of the model based on the independent test set; specifically include: substituting the independent test set into the optimized model, solving the fitted value of the reference value, solving the root mean square error (RMSEP) based on the fitted value and the actual value, and The correlation coefficient (R p ) is used to evaluate the performance of the model.
  • RMSEP root mean square error
  • R p The correlation coefficient
  • model evaluation method in the above steps 8-9 can also adopt a comprehensive evaluation method, including:
  • Step 8 Then re-substitute the spectral data of the calibration set, validation set and independent test set samples into the calibration model, and calculate the fitted value of each sample set, as follows:
  • Step 9 Then calculate the root mean square error (RMSEC) and correlation coefficient (R c ) according to the fitted value Y c f of the calibration set; calculate the root mean square error of the verification set according to the fitted value Y v f of the validation set ( RMSEV) and correlation coefficient (R v ); finally calculate the root mean square error (RMSEP) and correlation coefficient (R p ) of the independent test set according to the fitting value Y t f of the independent test set;
  • Step 10 Jointly evaluate the performance of the model based on the above parameters.
  • the independent test set is a certain number of samples randomly selected, it has a certain contingency.
  • the RMSEP and the predicted value can be calculated based on the same independent test set. R p , so as to objectively evaluate the performance of the model.
  • preprocessing steps for the calibration set, verification set, and independent test set.
  • the specific preprocessing method is not limited here. In the following specific embodiments, all No pre-processing is used, and the original spectral matrix is used for direct modeling. If a preprocessing method is used, the preprocessing methods of the calibration set, the verification set and the independent test set should be consistent.
  • the R p value of the independent test set is similar to the R v value of the verification set, so it is advisable to select samples similar to the independent test set as the verification set samples, and indirectly reflect the predictive ability of unknown samples
  • the RPD values of all components are greater than 3.0, indicating that the model has good predictive ability.
  • the invention can be used for the selection of sample sets and has better effects.
  • the optimization of the calibration set based on the verification set reflects to a certain extent
  • the prediction performance of the calibration set for the independent test set is optimized, because the spectra of the validation set and the independent test set samples are very similar, and the calibration set is also selected samples similar to the validation set, so the prediction of the unknown sample (that is, the independent test set) Have a stronger pertinence.
  • Table 3 lists the range of reference values for each component of the calibration set, verification set, and independent test set divided by various methods. This range is the average of 10 test results.
  • the reference value ranges of the four components of the calibration set samples of the three classification methods can all include the reference value ranges of the validation set and the independent test set.
  • calibration set range>validation set range>independent test set range it should be satisfied that the range of the calibration set is greater than the range of the verification set. If it is not satisfied, the determination range of the calibration set samples can be further expanded to satisfy the above relationship.
  • the independent test set sample it can be considered as an unknown sample, and the corresponding Y t is not known in advance.
  • X is the near-infrared spectrum matrix of the sample, measured by a Fourier transform near-infrared spectrometer (AntarisII, Thermo Fisher, USA)
  • Y is the matrix of four quality index components, namely Tanshinone IIA (TSIIA) and Cryptotanshinone (CTS), Tanshinone I (TSI), Salvianolic acid B (SAB), the original spectrum of the sample is shown in Figure 6.
  • TTIIA Tanshinone IIA
  • CTSI Cryptotanshinone
  • TIS Tanshinone I
  • SAB Salvianolic acid B
  • the number of samples in X v is about 10 ⁇ 15, the remaining samples are optional samples for the calibration set.
  • the number of samples in the calibration set is determined by optimization by the number of most similar samples selected for each sample in the validation set.
  • the number of samples for X c is between 10 and 87 (or 92) between.
  • partial least squares is used to establish the correlation model of X and Y respectively to correct the root mean square error (RMSEC) of the set, the root mean square error of the verification set (RMSEV) and the root mean square error of the independent test set (RMSEP) and the corresponding correlation coefficients, namely the correction set correlation coefficient (R c ), the verification set correlation coefficient (R v ) and the prediction set correlation coefficient (R p ) to jointly evaluate the performance of the model. Since the independent test set is a certain number of samples randomly selected, there is a certain chance. In order to objectively evaluate the division methods of various data sets, we repeated 10 times in parallel and randomly selected the same number of samples as the independent test set, and calculated the average value of the above indicators for comparison. The results are shown in Table 4.
  • each component of Salvia miltiorrhiza has a good modeling effect.
  • the correlation coefficients of the calibration set, the verification set and the independent test set are all above 0.95, and there is a small root mean square error.
  • Each evaluation of the verification set The index is better than the independent test set, because the calibration set samples are selected samples that are similar to the verification set, and the verification set samples are optimized.
  • the RPD values of all components are greater than 3.0, indicating that the model has good predictive ability.
  • the present invention optimizes the number of calibration set samples, and the number of calibration set samples is reduced to about 50 or more than 60 samples (see Table 4), which reduces the actual workload.
  • the variation law of the root mean square error and the corresponding correlation coefficient variation law of the verification set and the independent test set are shown in Figure 8 and Figure 9, respectively. It can be seen from Figure 8 that the root mean square error variation law Shows a consistent trend. It can be seen from Figure 9 that the correlation coefficients also change consistently. Although the magnitude of the changes are different, they are still the same trend. Therefore, the verification set can represent the independent test set to illustrate the predictive ability of the model.
  • 117 samples were divided by the Kennard-Stone (KS) method and the SPXY method, and the same independent test set was selected.
  • the number of samples in the verification set is the same as the method of the present invention.
  • the remaining samples are used as the calibration set, and the verification set is also used. Optimize the calibration set. The results are shown in Table 5.
  • the R p value and RMSEP of this method are better than the KS method and the SPXY method.
  • the R p of each component is the maximum of the three methods, and the RMSEP is the smallest of the three. Since the three methods use the same independent test set, the R p obtained by this method is the largest and the RMSEP is the smallest, indicating that the calibration set model obtained by this method has the strongest predictive ability for the same independent test set.
  • Table 6 lists the reference value ranges of the four components of the calibration set, validation set, and independent test set divided by various methods. This range is the average of 10 test results.
  • the calibration set samples of other methods can cover the reference value range of the validation set samples, and the calibration set samples can also cover the independent test set samples.
  • X is the near-infrared spectrum matrix of the sample
  • Y is the matrix of four component quality indicators. Taking moisture as the object, the same steps are taken for the rest of the ingredients. First, the abnormal samples are eliminated. Through the Hotelling T 2 method, 3 abnormal samples are detected, and then a total of 77 samples are left after the elimination, and 10 samples are randomly selected as independent Test set X t .
  • Optimize establish a PLS model between the X matrix and the Y matrix, calculate various parameters, including the root mean square error of the calibration set (RMSEC), the root mean square error of the verification set (RMSEV) and the root mean square error of the independent test set (RMSEP) And the corresponding correlation coefficients, namely the correction set correlation coefficient (R c ), the verification set correlation coefficient (R v ) and the prediction set correlation coefficient (R p ). Since the independent test set is a certain number of samples randomly selected, there is a certain chance. In order to objectively evaluate various data set division methods, we repeated 10 times in parallel and randomly selected the same number of samples as the independent test set, and calculated the average value of the above indicators for comparison. The results are shown in Table 7.
  • Table 9 lists the reference value ranges of the four components of the calibration set, validation set, and independent test set divided by various methods. The range is the average of 10 test results.
  • X is the near-infrared spectrum matrix of the sample
  • Y is the matrix of four quality index components.
  • TIS cryptotanshinone
  • TSI tanshinone I
  • SAB salvianolic acid B
  • other components take the same steps, first remove abnormal samples, and use the Hotelling T 2 method. Three abnormal samples were detected, and then a total of 117 samples were left after removal. Fifteen samples are randomly selected as the independent test set X t .
  • the root-mean-square error of the verification set and the independent test set show a consistent trend, and the trend of the correlation coefficient is also the same, and the trend of the correlation coefficient and the root-mean-square error is just in the opposite direction, so
  • the validation set samples can be used as a reflection of the prediction error of the unknown sample, so as to better optimize the model, and the prediction of the unknown sample is more targeted.
  • the specific situation is shown in Figure 12 and Figure 13.
  • Table 12 lists the ranges of the four component reference values of the calibration set, verification set, and independent test set divided by various methods. The range is the average of 10 test results.
  • the method for dividing the calibration set and the verification set of the present invention starts with the verification data used to test the performance of the model (that is, the independent test set, which is used as an unknown sample to test the performance of the model after the model is established), is based on the independent test set, and is based on the spectral similarity.
  • the spectrum similar to the independent test set is selected as the verification set, and the prediction effect of the verification set is used to reflect the predictive ability of the unknown sample.
  • the spectrum similar to the verification set is selected as the calibration set to ensure the establishment of
  • the model is a model for unknown samples, and compared with the currently commonly used methods, it can definitely prove that its modeling performance for unknown samples is better and its predictive ability is stronger.
  • the selection of the verification set and the calibration set also involves the selection of the number.
  • the present invention optimizes the number of samples of the calibration set, which can realize the selection of a smaller number of samples to achieve a better prediction effect.
  • each module or each step of the present invention described above can be implemented by a general-purpose computer device. Alternatively, they can be implemented by program code executable by the computing device, so that they can be stored in a storage device. The device is executed by a computing device, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps in them are fabricated into a single integrated circuit module for implementation.
  • the present invention is not limited to any specific combination of hardware and software.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Operations Research (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Geometry (AREA)
  • Investigating Or Analysing Materials By Optical Means (AREA)

Abstract

一种基于光谱相似度的校正集和验证集选择及建模方法,校正集和验证集选择方法包括:对原始样本进行近红外光谱测定,得到原始样本光谱矩阵;随机抽取m个样本作为独立检验集;对于独立检验集中的每个样本,分别计算样本与原始样本中剩余每个样本之间的光谱相似度,获取相似度最高的g个样本写入验证集;对于验证集中的每个样本,分别计算其与原始样本中剩余每个样本之间的光谱相似度,获取相似度最高的n个样本写入校正集。基于选择方法选择的验证集和校正集,得到的模型能够更准确的对未知模型进行预测。

Description

一种基于光谱相似度的校正集和验证集的选择及建模方法 技术领域
本发明属于未知物品预测技术领域,尤其涉及一种基于光谱相似度的校正集和验证集选择及建模方法。
背景技术
本部分的陈述仅仅是提供了与本公开相关的背景技术信息,不必然构成在先技术。
近红外光谱分析方法(NIR)是当前发展迅速的一种无损、无污染、重现性好的分析技术,随着化学计量学和计算机技术的发展,该技术已在农产品、石油化学、制药、环境、过程控制、临床及生物医学等领域广泛应用。该方法的一大特点是需要借助化学计量学将样品的光谱信息与对应的参考值信息(如含量、来源等)相关联建立模型,通过所建立的模型对未知的样品进行预测,从而实现分析的目的。
为了建立一个准确的模型,需要对现有的样本集进行划分,通过校正集建立模型,用验证集辅助评价模型的效果。因此,如何选择校正集和验证集对模型的适用性和预测能力有极大的影响。样本集的划分对近红外光谱分析至关重要,校正集和验证集的选取是影响模型效果的关键一步,校正集的样本需要具有一定的代表性,包涵尽可能多的样本信息,而验证集是体现模型好坏的一种反馈,如果验证集的样本不能很好地反映该模型对于未知样本的预测能力,那么模型的性能可能无法保证。所以,建立新的模型性能更好、预测能力更强的样本集划分方法,对于当前近红外光谱分析是关键的研究方向之一。
据发明人了解,在近红外光谱分析领域,有两种经典且应用较多的样本集划分方法,分别是Kennard-Stone(KS)法和SPXY法,KS法通过样本间欧氏距离的计算选择有代表性的样本进入校正集,从而使校正集包含了更多的样品信息,但该法有可能将异常样品也选入校正集内,并且在预测未知样本时缺乏一定针对性。SPXY法是在KS法的基础上,将参考值变量(Y)考虑在内,并保证其在各自的空间有相同的权重,有效覆盖多维向量空间。但是这两种方法对于未知样本是否有很好的预测很难确定。
发明内容
为克服上述现有技术的不足,本发明提供了一种基于光谱相似度的校正集和验证集选择及建模方法,对于未知样本具有更好的预测性能。
为实现上述目的,本发明的一个或多个实施例提供了如下技术方案:
一种基于光谱相似度的校正集和验证集选择方法,包括以下步骤:
对原始样本进行近红外光谱测定,得到原始样本光谱矩阵;
进一步地,还对原始样本测定多个参考值,得到原始样本参考值矩阵。
进一步地,得到原始样本光谱矩阵和原始样本参考值矩阵后,还对原始样本光谱矩阵进行异常值检测,将异常值剔除,并将原始样本参考值矩阵中相应的参考值剔除。
随机抽取m个样本作为独立检验集,模拟未知样本;
对于独立检验集中的每个样本,分别计算该样本与原始样本中剩余每个样本之间的光谱相似度,获取相似度最高的g个样本写入验证集;
对于验证集中的每个样本,分别计算其与原始样本中剩余每个样本之间的光谱相似度,获取相似度最高的n个样本写入校正集。
进一步地,样本之间的光谱相似度采用欧氏距离计算。
进一步地,设经异常值剔除后的原始样本数量为N,m、g与n的关系满足:g≤n≤(N-m)。
进一步地,对n取不同值时分别进行建模,基于模型性能优化n的取值,得到优化后的模型参数值。
一个或多个实施例提供了一种基于所述校正集和验证集选择方法的建模方法,获取校正集相应的参考值矩阵,对于参考值矩阵中的每一参考值,分别与光谱矩阵进行关联建模。
进一步地,所述方法还包括:
基于验证集对模型参数进行优化;
基于检验集对校正集组成样本的优化;
基于独立检验集对模型性能进行评价。
进一步地,所述方法还包括基于校正集、验证集和独立检验集对模型性能进行综合评价。
以上一个或多个技术方案存在以下有益效果:
本发明的校正集和验证集的划分方法,从以用于检验模型性能的验证数据(即独立检验集,模型建立后视为未知样本对模型性能进行检验)出发,基于独立检验集,以光谱相似度为依据,选取与独立检验集相似光谱的样本进入验证集,以验证集的预测效果侧面体现对未知样本的预测能力,然后基于验证集,选取与验证集相似光谱的样本进入校正集,保证了建立的模型是针对未知样本的模型,并且与目前常用的方法比较,可以确切地证明其对于未知样本的建模性能更好,预测能力更强。
校正集和验证集的选取还涉及样本数量的选择,本发明对校正集样本数目进行了优化,可以实现选用较少的样本数目达到更好的预测效果。
附图说明
构成本发明的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。
图1为本发明一个或多个实施例涉及的校正集和验证集的选择方法及建模方法流程图。
图2为实施例1中所有样本的原始近红外光谱;
图3为实施例1去除异常样本之后的主成分投影图;
图4为实施例1验证集RMSEV和独立检验集RMSEP变化规律图;
图5为实施例1验证集相关系数R v和独立检验集相关系数R p变化规律图;
图6为实施例2所有样本的原始近红外光谱;
图7为实施例2除异常样本之后的主成分投影图;
图8为实施例2验证集RMSEV和独立检验集RMSEP变化规律图;
图9为实施例2验证集相关系数R v和独立检验集相关系数R p变化规律图;
图10为实施例3验证集RMSEV和独立检验集RMSEP变化规律图;
图11为实施例3验证集相关系数R v和独立检验集相关系数R p变化规律图;
图12为实施例4验证集RMSEV和独立检验集RMSEP变化规律图;
图13为实施例4验证集相关系数R v和独立检验集相关系数R p变化规律图。
具体实施方式
应该指出,以下详细说明都是示例性的,旨在对本发明提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。
需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本发明的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合。
在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。
本发明的一个优选实施例公开了一种用于近红外定量建模的校正集和验证集的选择方法,以公开的玉米数据为例,测定的样本数共有80个,包括样本的重复。如图1所示,该方法包括以下步骤:
步骤1:对原始样本进行近红外光谱测定,得原始样本光谱矩阵X;
步骤2:采用参考方法测定原始样本的参考值,得原始样本参考值矩阵Y;
本实施例中对于玉米的参考值选择四个质量指标成分:水、油、蛋白质和淀粉,构建参考值矩阵Y,每一列代表一个参数。
步骤3:对原始样本光谱矩阵X进行异常值检测,将异常值剔除,并将参考值Y矩阵相应的参考值剔除;
样本的原始图谱见图2所示,首先进行异常样本的剔除,通过Hotelling T 2方法,检测到3个异常样本,剔除之后剩余77个样本。去除异常值后的样本近红外光谱的主成分投影图如图3所示。由图3可见,剩余的样本通过Hotelling T 2检验(在椭圆圈内),已没有异常样本。
步骤4:随机抽取m个样本组成独立检验集,用于模拟需要预测的未知样本;
在剔除异常值后的光谱原始矩阵X中,抽取一定量的样本组成独立检验集,模拟需要预测的未知样本,相应的光谱矩阵记为X t,其对应参考值记为Y t,X t和Y t的样本一一对应;独立检验集的样本数量应根据实际需要进行确定,一般不应多于校正集的样本数量,和验证集样本的数量相当,且参考值的范围一般应包含于校正集样本的参考值范围。
本实施例中随机抽取10(m=10)个样本作为独立检验集。一般情况下,如只划分校正集和验证集,校正集和验证集常采用的比例为2:1,3:1或4:1;如要考虑独立检验集,校正集、验证集和独立检验集比例可划分为4:1:1,6:1:1或8:1:1等。
步骤5:对于独立检验集中的每个样本,分别计算其与原始样本中剩余每个样本之间的光谱相似度,获取与该样本光谱相似度最高的多个样本并进行去重处理,写入验证集,并获取相应的光谱矩阵X v和参数值矩阵Y v
由于光谱信息获得容易,检测迅速,且假定独立检验集中的样本只有光谱信息已经测定,故可以根据光谱相似的原则选取验证集样本,以验证集的预测效果间接反映对待测未知样本的预测能力。具体方法如下:以独立检验集中的每个样本为参考,分别计算其与剩余每个样本光谱X i之间的欧氏距离D i并进行排序,距离越相近,表明独立检验集中的该样本与剩余样本中的某一样本光谱越相似。依次对独立检验集中每一样本进行如上计算,则独立检验集的每一样本都可从剩余样本中找到其最相似的g个样本,根据实际样本数目和建模要求,可为每个独立检验集样本选取最相似的g个样本组成验证集,然后去掉多余的重复样本,即为最终的验证集,相应的光谱集记为X v,对应的参考值记为Y v,其中g≥1的正整数。
本实施例中,对剩余67个样本进行划分,计算独立检验集对应的光谱矩阵X t中每个样本与剩余样本光谱矩阵X i之间的欧氏距离D i并排序,为独立检验集X t中每个样本选取最相 似的1个样本(即g=1),去除多余的重复样本,组成最终的验证集X v,X v的样本个数约在8~10之间,对应的参考值矩阵记为Y v。选择与独立检验集最相似的样本作为验证集样本,该样本可以模拟独立检验集样本对模型效果进行反馈,从而达到更好的预测效果。
独立检验集X t中每个样本与剩余样本X i之间欧氏距离的计算公式为:D i=sqrt(∑(X i-X t,j) 2),sqrt表示开平方根;以X t,j表示验证集中每一样本为一观察单位,从剩余样本X i中选取与其欧氏距离最小的样本即为与X t,j最相似的样本。
步骤6:对于原始样本中的剩余样本,分别计算其与X v中每个样本之间的光谱相似度,获取与X v中每个样本(用X v, i表示)相似度最高的多个样本并进行去重处理,写入校正集,并获取相应的光谱矩阵X c和参数值矩阵Y c
验证集样本获得后,校正集样本的选择与其相似,以验证集X v中的每个样本为参考,分别计算其与剩余每个样本光谱X k之间的欧氏距离D k并进行排序,依次对验证集X v中每一样本进行如上计算,然后针对验证集中的每一样本,为其选取最相近的n个样本作为校正集样本,去除多余的重复样本,即为选取的校正集X c
本实施例中,剩余样本为校正集可以选择的样本,校正集样本的数目由为验证集每个样本选取的最相似样本数目n通过优化决定,X c的样本个数约在20(或18)~67-X v的样本数(即57-59)之间,对应的参考值矩阵记得Y c
验证集X v中每个样本与剩余样本X k之间欧氏距离的计算公式为:D k=sqrt(∑(X k-X v,i) 2),以X v,i表示验证集中每一样本为一观察单位,从剩余样本中选取与其最近的n个样本为校正集样本。
照此法选出的校正集样本,与验证集相似,同时也与独立检验集间接相似,从而更有针对性地建立对未知样本的校正模型。n的最大值为所有剩余样本都被选进校正集时所采用的数目,n的最小值应为验证集样本的数量2倍。n的大小不同,为每个验证集样本选取的校正集样本数目不同,校正集样本数目越多不一定建模效果最优,可能包含了异常样本或重复样本或与验证集样本相似程度较差的样本信息,对建模可能形成一定的干扰;而校正集样本数目太少,相对包含的样本信息较少,可能无法覆盖待测未知样本的分布空间,所以需要对n的大小进行优化,也是对校正集样本数目的优化。在采用不同大小的n值情况下,分别建模,通过获得的验证集的RMSEV值和R v值优化采用的n值大小,RMSEV值越小,R v值越大,则证明建模效果最佳,选取此时的n值为优化后的为验证集样本选取的最相近校正集样本数目,此时选取的校正集对应的光谱矩阵记为X c,对应的参考值记为Y c
基于上述实施例给出的验证集和校正集划分方法,本发明的又一实施例还进一步给出 了模型建立和评价方法,具体包括:
步骤1-步骤6:参见上一实施例,得到验证集和校正集,并获取验证集和校正集相应的光谱矩阵和参数值矩阵;
步骤7:根据校正集进行建模:对于参数值矩阵中的每一参数,分别与光谱矩阵进行关联建模,得到校正模型;
以水分含量参数为例,根据划分结果,将校正集样本X c和水分含量矩阵Y c采用偏最小二乘(PLS)法进行关联,建立Y c和X c之间的关系模型,如下:
Y c=X cB pls  (1);
根据(1)式得到模型参数,即回归系数B pls
优选的,步骤7中,建模方法的潜在变量数(模型中的待求解模型参数)基于验证集X v的最小RMSEV值确定。建模均是在模型优化的潜在变量数下进行。
步骤8:基于验证集进行模型的优化;具体包括:将验证集代入校正模型,求解参考值的拟合值,基于拟合值和实际值对模型参数进行调整优化;
Y v f=X vB pls  (3);
步骤9:基于独立检验集对模型性能进行评价;具体包括:将独立检验集代入优化后的模型,求解参考值的拟合值,基于拟合值和实际值求解均方根误差(RMSEP)及相关系数(R p),对模型性能进行评价。
本领域技术人员可以理解,作为一种可替代方案,上述步骤8-9中的模型评价方法也可以采用综合评价方式,包括:
步骤8:然后将校正集、验证集和独立检验集样本的光谱数据重代入校正模型,计算出各样本集的拟合值,如下:
Y c f=X cB pls  (2);
Y v f=X vB pls  (3);
Y t f=X tB pls  (4);
步骤9:接着再根据校正集的拟合值Y c f计算均方根误差(RMSEC)和相关系数(R c);根据验证集的拟合值Y v f计算验证集的均方根误差(RMSEV)和相关系数(R v);最后根据独立检验集的拟合值Y t f计算独立检验集的均方根误差(RMSEP)及相关系数(R p);
步骤10:根据以上各参数共同评价模型的性能。
由于独立检验集是随机抽取的一定数目的样本,具有一定的偶然性,为了客观评价各种划分方法的性能,我们抽取相同数目的样本,平行重复10次试验,计算上述各项指标的 平均值。
实际应用中,为了更好的比较本实施例方法的效果,与基于其他样本集选择方法(如,KS和SPXY)建立的模型进行比较,可以基于相同的独立检验集,计算预测值的RMSEP和R p,从而客观地评价模型性能。
本领域技术人员可以理解,在进行建模之前,还可以包括对校正集,验证集和独立检验集的预处理步骤,此处对具体预处理方法不进行限定,在以下具体实施例中,均未采用预处理,以原光谱矩阵直接进行建模。如果采用预处理方法,校正集,验证集和独立检验集的预处理方法要保持一致。
实施例1
本实施例对玉米数据四种成分建立模型的结果见表1。其中Lv为潜在因子数,N c为校正集样本数目,N v为验证集样本数目。
表1玉米各成分预测结果一览表
Figure PCTCN2020120950-appb-000001
由表1可见,RMSEC,RMSEV和RMSEP值均是越小越好,R c,R v和R p均是越大越好。玉米各成分均有较好的建模效果,校正集相关系数R c均达到0.95以上,说明有较好的模型性能,有很好的拟合效果,并且仅选用了约40多个样本作为校正集。验证集相关系数R v也均达到0.95以上,说明该模型对验证集样本有很好的预测能力,而对于随机挑选的独立检验集,除油以外,其余成分均达到0.95以上的R p值和较小的RMSEP值,而且独立检验集的R p值与验证集的R v值相近,所以挑选与独立检验集相似的样本作为验证集样本,间接反映对未知样本的预测能力的方法是可取的,另外,所有成分的RPD值均大于3.0,表明模型有很好的预测能力。本发明可以用于样本集的挑选,并有较好的效果。
为了确定通过光谱相似选取的验证集对模型性能的评价效果是否可以与独立检验集对模型的评价效果相近,对校正集数目优化过程中验证集X v和独立检验集X t的均方根误差和相应的相关系数的变化规律进行比较,结果见图4和图5。
由图4可见,验证集的均方根误差RMSEV和独立检验集的均方根误差RMSEP变化趋势 基本一致,当RMSEV达到最小值时,相对RMSEP值也较小,图5中,相关系数R v和R p的整体变化趋势也一致,故用光谱相近的方法挑选与未知样本相近的样本作为验证集间接反映模型的预测效果是可行的,基于验证集对校正集的优化从一定程度上反映了校正集对独立检验集预测性能得到了优化,因为验证集与独立检验集样本的光谱非常相似,并且校正集也是选取的与验证集相似的样本,所以对未知样本(即独立检验集)的预测具有更强的针对性。
为了评价本发明所提出方法的性能效果,我们将与常用方法Kennard-Stone(KS)法和SPXY法作对比,选取与本方法相同数目的验证集样本,剩余样本作为校正集,选取相同的独立检验集比较不同方法的建模性能和预测能力。有关结果见表2。
表2各种数据集划分方法建立模型的预测能力比较(平均重复10次)
Figure PCTCN2020120950-appb-000002
由表2比较可见,本方法在对于独立检验集的预测上,尽管RMSEP值略高于SPXY法,但相关系数R p值最大,优于另外两种方法,尤其对于本身建模效果较差的油有更高的预测能力,预测误差更小。而且结合表1分析,本法仅采用约40多个样本作为校正集,而KS法和SPXY法是采用了除去验证集和独立检验集的剩余所有样本(约57个)作为校正集,相比之下,本法用的校正集样本数更少,校正模型效果更好。
表3列出了各种方法划分的校正集、验证集以及独立检验集的各成分参考值的范围。该范围为10次试验结果的平均值。
表3数据集的参考值范围
Figure PCTCN2020120950-appb-000003
Figure PCTCN2020120950-appb-000004
由表3可知,三种划分方法的校正集样本的四种成分参考值范围均能包括验证集和独立检验集的样本的参考值范围。理论上,校正集范围>验证集范围>独立检验集范围。一般情况下,应满足校正集范围大于验证集范围,如果不满足可以进一步扩大校正集样本的确定范围,使上述关系得到满足。对于独立检验集样本,可以认为其为未知样本,对应的Y t并不事先知晓。
实施例2
以丹参药材为例,测定的样本数共有120个,包括样本的重复。X为样本的近红外光谱矩阵,由傅里叶变换近红外光谱仪(AntarisⅡ,赛默飞世尔,美国)测得,Y是四个质量指标成分矩阵,分别是丹参酮ⅡA(TSⅡA)、隐丹参酮(CTS)、丹参酮Ⅰ(TSⅠ)、丹酚酸B(SAB),样本的原始光谱见图6。各成分为检测对象,对新的划分方法进行评价,以下说明方法中以丹参酮ⅡA为例,其余成分采取与之相同的步骤。先进行异常样本的剔除,通过Hotelling T 2方法,检测到3个异常样本,剔除之后剩下117个样本,去除异常值后的主成分分析图见图7。随机抽取15个样本作为独立检验集X t
对剩余102个样本进行划分,其中为独立检验集X t中每个样本选取最相似的1个样本,去除多余的重复样本,组成最终的验证集X v,X v的样本个数约在10~15之间,剩余样本为校正集可选择的样本,校正集样本的数目由为验证集每个样本选取的最相似样本数目n通过优化决定,X c的样本个数在10~87(或92)之间。
根据划分结果,采用偏最小二乘法(PLS)分别建立X和Y的关联模型,以校正集均方根误差(RMSEC),验证集均方根误差(RMSEV)和独立检验集的均方根误差(RMSEP)及相应的相关系数,即校正集相关系数(R c),验证集相关系数(R v)和预测集相关系数(R p)来共同评价模型性能。由于独立检验集是随机抽取的一定数量的样本,有一定的偶然性。为了客观地评价各种数据集的划分方法,我们平行重复10次随机抽取相同数量的样本作为独立检验集,计算上述各指标的平均值进行比较。有关结果见表4。
表4丹参药材各成分预测结果一览表
Figure PCTCN2020120950-appb-000005
Figure PCTCN2020120950-appb-000006
由表4可见,丹参药材各成分均有很好的建模效果,校正集、验证集和独立检验集的相关系数都达到0.95以上,并有较小的均方根误差,验证集的各评价指标比独立检验集的更好一些,是因为校正集样本挑选的是与验证集相近的样本,并通过验证集样本进行了优化。所有成分的RPD值均大于3.0,表明模型有很好的预测能力。而且本发明对校正集样本数目进行了优化,校正集样本数目降低到了约50或60多个样本(见表4),降低了实际的工作量。
在不同数目的校正集样本情况下,验证集和独立检验集的均方根误差变化规律及相应的相关系数变化规律分别见图8和图9,从图8中可见,均方根误差变化规律呈现一致的趋势,从图9中可见,相关系数也变化一致,尽管变化的幅度不同,但仍是相同的变化趋势,因此验证集可以代表独立检验集对模型的预测能力进行说明。
作为对比,分别采用Kennard-Stone(即KS)法和SPXY方法对117个样本进行划分,选取相同的独立检验集,验证集样本数目与本发明方法相同,剩余样本作为校正集,也是采用验证集对校正集进行优化。有关结果见表5。
表5各种数据集划分方法建立模型的预测能力比较(平均重复10次)
Figure PCTCN2020120950-appb-000007
由表5可见,本方法的R p值和RMSEP均优于KS法和SPXY法,各种成分的R p均是三种方法中的最大值,而RMSEP是三者中最小的。由于三种方法采用相同的独立检验集,本法所得的R p最大,RMSEP最小,表明通过本法划分得到的校正集模型对于相同的独立检验集具有最强的预测能力。结合表4进行分析,由于本发明对校正集样本数目进行了优化,相比于KS法和SPXY法(除去验证集和独立检验集的剩余所有样本(约87个)作为校正集样本),本发明采用的样本数目较少,且模型的性能和预测能力更优。
表6列出了各种方法划分的校正集、验证集以及独立检验集的四种成分参考值的范围。 该范围为10次试验结果的平均值。
表6数据集的参考值的范围
Figure PCTCN2020120950-appb-000008
由表6可知,除了丹参酮Ⅰ的KS方法中,其余方法的校正集样本均能覆盖验证集样本的参考值范围,并且校正集样本也可覆盖独立检验集样本。
实施例3
以公开数据玉米为例,测定的样本共有80个。X为样本的近红外光谱矩阵,Y是四个成分质量指标矩阵。以水分为对象说明,其余成分采取相同的步骤,先进行异常样本的剔除,通过Hotelling T 2方法,检测到3个异常样本,然后剔除之后共剩下77个样本,随机抽取10个样本作为独立检验集X t
对剩余67个样本进行划分,我们变化了验证集样本的数量,以考察验证集样本数量变化后,各种划分方法对模型的性能的影响。其中为每一个独立检验集样本挑选2(即g=2)个欧氏距离最相近的样本计入验证集,验证集的样本数约在14~20之间,其余样本进行校正集样本数目的优化,在X矩阵和Y矩阵之间建立PLS模型,计算各参数,包括校正集均方根误差(RMSEC),验证集均方根误差(RMSEV)和独立检验集的均方根误差(RMSEP)及相应的相关系数,即校正集相关系数(R c),验证集相关系数(R v)和预测集相关系数(R p)。由于独立检验集是随机抽取的一定数量的样本,有一定的偶然性。为了客观地评价各种数据集划分方法,我们平行重复10次随机抽取相同数量的样本作为独立检验集,计算上述各指标的平均值进行比较。有关结果见表7。
表7玉米各成分预测结果一览表
Figure PCTCN2020120950-appb-000009
Figure PCTCN2020120950-appb-000010
由表7可见,玉米各成分仍有很好的建模效果,校正集样本数目通过优化,数目在40个左右(见表7),大部分成分的校正集和验证集相关系数都达到0.95以上,并且所有成分的RPD值均大于3.0,说明模型有很好的预测能力,表明该方法可用于样本集的划分,并得到很好的模型性能和预测效果。
在校正集样本数目优化过程中,验证集和独立检验集的均方根误差呈现一致的变化趋势,并且相关系数变化趋势也一致,所以验证集样本可以作为对未知样本预测误差的一种反映,并使模型对未知样本有更好的预测能力,具体情况见图10和图11。
作为对比,分别采用Kennard-Stone(即KS)法和SPXY方法对67个样本进行划分,为了比较,选取与本方法相同的独立检验集样本,验证集样本数目与本方法相同,剩余样本作为校正集,也采用验证集样本优化校正模型,有关结果见表8。
表8各种数据集划分方法建立模型的预测能力比较(平均重复10次)
Figure PCTCN2020120950-appb-000011
由表8可见,对于油和蛋白质,本发明提出的方法优于KS和SPXY法,RMSEP值最小,R p值最大。对于水分,由于水分本身建模效果就很好,提升空间不大,RMSEP值仍最小,R p值仍最大。对于淀粉,从R p角度来看,本法比KS法好,比SPXY法稍差,但相差不大,而本法的RMSEP值最低,预测误差最小。相比于实施例1中,本实施例中为独立检验集挑选2个最近距离样本的效果稍差,可能是因为验证集样本包括了重复无用的信息或选择的样本相近程度有所下降所致,对建模形成一定干扰。
表9列出了各种方法划分的校正集、验证集以及独立检验集的四种成分参考值的范围,该范围为10次试验结果的平均值。
表9数据集的参考值的范围
Figure PCTCN2020120950-appb-000012
由表9可知,所有成分的各种方法校正集样本的参考值范围均能覆盖验证集和独立检验集。
实施例4
以丹参药材为例,测定的样本数共有120个,包括样本的重复。X为样本的近红外光谱矩阵,Y是四个质量指标成分矩阵。以丹参酮ⅡA(TSⅡA)为例说明方法,隐丹参酮(CTS)、丹参酮Ⅰ(TSⅠ)、丹酚酸B(SAB)等成分采取相同的步骤,先进行异常样本的剔除,通过Hotelling T 2方法,检测到3个异常样本,然后剔除之后共剩下117个样本。随机抽取15个样本作为独立检验集X t
对剩余102个样本进行划分,我们变化了验证集样本的数量,以考察验证集样本数量变化后,各种划分方法对模型的性能的影响。其中为每一个独立检验集样本挑选2个欧氏距离最相近的样本计入验证集,验证集的样本数约在20~30之间,其余样本进行校正集样本数目的优化,在X矩阵和Y矩阵之间建立PLS模型,计算各参数,包括校正集均方根误差(RMSEC),验证集均方根误差(RMSEV)和独立检验集的均方根误差(RMSEP)及相应的相关系数,如校正集相关系数(R c),验证集相关系数(R v)和预测集相关系数(R p)。由于独立检验集是随机抽取的一定数量的样本,有一定的偶然性。为了客观地评价各种数据集的划分方法,我们平行重复10次试验,随机抽取相同数量的样本作为独立检验集,计算上述各指标的平均值进行比较。有关结果见表10。
表10丹参药材各成分预测结果一览表
Figure PCTCN2020120950-appb-000013
Figure PCTCN2020120950-appb-000014
由表10可见,丹参药材各成分仍有很好的建模效果,校正集样本数目通过优化,数目在60个左右(见表10),所有成分的校正集、验证集和独立检验集的相关系数可达0.95以上,并且独立检验集的均方根误差RMSEP较小,RPD值均明显大于3.0,说明模型有很好的建模性能和预测能力。
在校正集数目优化过程中,验证集和独立检验集的均方根误差呈现一致的变化趋势,相关系数变化趋势也一致,而且相关系数和均方根误差的变化趋势刚好呈相反的方向,所以验证集样本可以作为对未知样本预测误差的一种反映,从而更好地对模型进行优化,对未知样本的预测更有针对性,具体情况见图12和图13。
作为对比,分别采用Kennard-Stone(KS)法和SPXY方法对102个样本进行划分,为了比较,选取与本方法相同的独立检验集样本,验证集样本数目与本方法相同,剩余样本作为校正集,以验证集样本优化校正模型,有关结果见表11。
表11各种数据集划分方法建立模型的预测能力比较(平均重复10次)
Figure PCTCN2020120950-appb-000015
由表11可见,本发明对于大部分成分的建模效果和预测能力稍有提升,除CTS的R p值提升较多外,其余成分的R p值相差不大,虽然提升的幅度较小,但仍是三种方法最佳的。SAB的RMSEP值稍差,但是其R p值仍是最高的。与实施例2相比,为独立检验集的每个样本多选1个最近距离样本,可能其中包含了重复的无用的信息或选择的样本相近程度有所下降所致,所以建模效果反而不如实施例2中的情况。
表12列出了各种方法划分的校正集、验证集以及独立检验集的四种成分参考值的范围,该范围为10次试验结果的平均值。
表12数据集的参考值的范围
Figure PCTCN2020120950-appb-000016
Figure PCTCN2020120950-appb-000017
由表12可见,虽然本法对隐丹参酮挑选的样本集中,验证集范围略超出校正集范围,但在实践中,待检测样本也许不一定就包含在校正集范围内,因此本法的适用性可能更好。其余成分的各种方法校正集样本的参考值范围均能覆盖验证集和独立检验集。
以上一个或多个实施例具有以下技术效果:
本发明的校正集和验证集的划分方法,以用于检验模型性能的验证数据(即独立检验集,模型建立后作为未知样本对模型性能进行检验)出发,基于独立检验集,以光谱相似度为依据,选取与独立检验集相似的光谱作为验证集,以验证集的预测效果侧面体现对未知样本的预测能力,然后基于验证集,选取与验证集相似的光谱作为校正集,保证了建立的模型是针对未知样本的模型,并且与目前常用的方法比较,可以确切地证明其对于未知样本的建模性能更好,预测能力更强。
验证集和校正集的选取还涉及数量的选择,本发明对校正集样本数目进行了优化,可以实现选用较少的样本数目达到更好的预测效果。
本领域技术人员应该明白,上述本发明的各模块或各步骤可以用通用的计算机装置来实现,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。本发明不限制于任何特定的硬件和软件的结合。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
上述虽然结合附图对本发明的具体实施方式进行了描述,但并非对本发明保护范围的限制,所属领域技术人员应该明白,在本发明的技术方案的基础上,本领域技术人员不需要付出创造性劳动即可做出的各种修改或变形仍在本发明的保护范围以内。

Claims (8)

  1. 一种基于光谱相似度的校正集和验证集选择方法,其特征在于,包括以下步骤:
    对原始样本进行近红外光谱测定,得到原始样本光谱矩阵;
    对原始样本光谱矩阵进行异常值检测,将异常值剔除;
    随机抽取m个样本作为独立检验集;
    对于独立检验集中的每个样本,分别计算该样本与原始样本中剩余每个样本之间的光谱相似度,获取相似度最高的g个样本写入验证集;
    对于验证集中的每个样本,分别计算其与原始样本中剩余每个样本之间的光谱相似度,获取相似度最高的n个样本写入校正集;
    设经异常值剔除后的原始样本数量为N,m、g与n的关系满足:g≤n≤(N-m)。
  2. 如权利要求1所述的基于光谱相似度的校正集和验证集选择方法,其特征在于,还对原始样本测定多个参考值,得到原始样本参考值矩阵。
  3. 如权利要求2所述的基于光谱相似度的校正集和验证集选择方法,其特征在于,得到原始样本光谱矩阵和原始样本参考值矩阵后,还对原始样本光谱矩阵进行异常值检测,将异常值剔除,并将原始样本参考值矩阵中相应的参考值剔除。
  4. 如权利要求1所述的基于光谱相似度的校正集和验证集选择方法,其特征在于,样本之间的光谱相似度采用欧氏距离计算。
  5. 如权利要求1所述的基于光谱相似度的校正集和验证集选择方法,其特征在于,对n取不同值时分别进行建模,基于模型性能优化n的取值,得到优化后的模型参数值。
  6. 一种基于如权利要求1-5任一项所述校正集和验证集选择方法的建模方法,其特征在于,获取校正集相应的参考值矩阵,对于参考值矩阵中的每一参考值,分别与光谱矩阵进行关联建模。
  7. 如权利要求6所述的建模方法,其特征在于,所述方法还包括:
    基于验证集对模型参数进行优化;
    基于独立检验集对模型性能进行评价。
  8. 如权利要求6所述的建模方法,其特征在于,所述方法还包括基于校正集、验证集和独立检验集对模型性能进行综合评价。
PCT/CN2020/120950 2019-10-17 2020-10-14 一种基于光谱相似度的校正集和验证集的选择及建模方法 WO2021073541A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/289,657 US20210404952A1 (en) 2019-10-17 2020-10-14 Method for selection of calibration set and validation set based on spectral similarity and modeling

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910986971.0 2019-10-17
CN201910986971.0A CN110687072B (zh) 2019-10-17 2019-10-17 一种基于光谱相似度的校正集和验证集的选择及建模方法

Publications (1)

Publication Number Publication Date
WO2021073541A1 true WO2021073541A1 (zh) 2021-04-22

Family

ID=69113453

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/120950 WO2021073541A1 (zh) 2019-10-17 2020-10-14 一种基于光谱相似度的校正集和验证集的选择及建模方法

Country Status (3)

Country Link
US (1) US20210404952A1 (zh)
CN (1) CN110687072B (zh)
WO (1) WO2021073541A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114184577A (zh) * 2021-11-30 2022-03-15 中国科学院西北高原生物研究所 一种近红外定量检测模型的参数选取方法和定量检测方法
WO2024011687A1 (zh) * 2022-07-14 2024-01-18 广东辛孚科技有限公司 一种油品物性快评模型建立方法及装置

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110687072B (zh) * 2019-10-17 2020-12-01 山东大学 一种基于光谱相似度的校正集和验证集的选择及建模方法
CN111667889B (zh) * 2020-07-20 2022-03-01 山东中医药大学 一种预测丹参中质量标志物含量的方法
CN112285056B (zh) * 2020-10-14 2022-02-08 山东大学 一种用于光谱样品个性化校正集选择及建模方法
CN113094892A (zh) * 2021-04-02 2021-07-09 辽宁石油化工大学 一种基于数据剔除与局部偏最小二乘的石油浓度预测方法
CN113762208B (zh) * 2021-09-22 2023-07-28 山东大学 一种近红外光谱与特征图谱的图谱转换方法及其应用
CN114783539A (zh) * 2022-04-28 2022-07-22 山东大学 一种基于光谱聚类的中药成分分析方法及系统
CN115952402B (zh) * 2022-09-29 2023-06-27 南京林业大学 基于二进制蜻蜓算法的近红外模型传递标样集挑选方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6529767B1 (en) * 2000-09-01 2003-03-04 Spectron Tech Co., Ltd. Method and apparatus for measuring skin moisture by using near infrared reflectance spectroscopy
CN103411893A (zh) * 2013-07-29 2013-11-27 陕西步长制药有限公司 一种脑心通胶囊近红外光谱的检测方法
CN105486663A (zh) * 2016-02-29 2016-04-13 上海交通大学 一种利用近红外光谱检测土壤的稳定碳同位素比值的方法
CN106770005A (zh) * 2016-11-25 2017-05-31 山东大学 一种用于近红外光谱分析的校正集和验证集的划分方法
CN109324014A (zh) * 2018-10-08 2019-02-12 华东理工大学 一种自适应的原油性质近红外快速预测方法
CN110687072A (zh) * 2019-10-17 2020-01-14 山东大学 一种基于光谱相似度的校正集和验证集选择及建模方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
MY107650A (en) * 1990-10-12 1996-05-30 Exxon Res & Engineering Company Method of estimating property and / or composition data of a test sample
SE512540C2 (sv) * 1998-06-22 2000-04-03 Umetri Ab Metod och anordning för kalibrering av indata
RU2266523C1 (ru) * 2004-07-27 2005-12-20 Общество с ограниченной ответственностью ООО "ВИНТЕЛ" Способ создания независимых многомерных градуировочных моделей
US20170258331A1 (en) * 2014-09-08 2017-09-14 Shimadzu Corporation Imaging device
CN107179293A (zh) * 2017-06-23 2017-09-19 南京富岛信息工程有限公司 一种油品性质不确定度的评定方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6529767B1 (en) * 2000-09-01 2003-03-04 Spectron Tech Co., Ltd. Method and apparatus for measuring skin moisture by using near infrared reflectance spectroscopy
CN103411893A (zh) * 2013-07-29 2013-11-27 陕西步长制药有限公司 一种脑心通胶囊近红外光谱的检测方法
CN105486663A (zh) * 2016-02-29 2016-04-13 上海交通大学 一种利用近红外光谱检测土壤的稳定碳同位素比值的方法
CN106770005A (zh) * 2016-11-25 2017-05-31 山东大学 一种用于近红外光谱分析的校正集和验证集的划分方法
CN109324014A (zh) * 2018-10-08 2019-02-12 华东理工大学 一种自适应的原油性质近红外快速预测方法
CN110687072A (zh) * 2019-10-17 2020-01-14 山东大学 一种基于光谱相似度的校正集和验证集选择及建模方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114184577A (zh) * 2021-11-30 2022-03-15 中国科学院西北高原生物研究所 一种近红外定量检测模型的参数选取方法和定量检测方法
CN114184577B (zh) * 2021-11-30 2023-08-22 中国科学院西北高原生物研究所 一种近红外定量检测模型的参数选取方法和定量检测方法
WO2024011687A1 (zh) * 2022-07-14 2024-01-18 广东辛孚科技有限公司 一种油品物性快评模型建立方法及装置

Also Published As

Publication number Publication date
CN110687072B (zh) 2020-12-01
CN110687072A (zh) 2020-01-14
US20210404952A1 (en) 2021-12-30

Similar Documents

Publication Publication Date Title
WO2021073541A1 (zh) 一种基于光谱相似度的校正集和验证集的选择及建模方法
WO2016000088A1 (zh) 一种基于最佳指数-相关系数法的高光谱波段提取方法
JP2006268558A (ja) データ処理方法及びプログラム
Shen et al. Local partial least squares based on global PLS scores
CN107563448B (zh) 基于近红外光谱分析的样本空间聚类划分法
CN105431854B (zh) 用于分析生物样品的方法和设备
CN112285056B (zh) 一种用于光谱样品个性化校正集选择及建模方法
Mao et al. Modeling research on wheat protein content measurement using near-infrared reflectance spectroscopy and optimized radial basis function neural network
US20120227043A1 (en) Optimization of Data Processing Parameters
CN111309577A (zh) 一种面向Spark的批处理应用执行时间预测模型构建方法
Wu et al. Variety identification of Chinese cabbage seeds using visible and near-infrared spectroscopy
WO2023207453A1 (zh) 一种基于光谱聚类的中药成分分析方法及系统
Zhang et al. Application of swarm intelligence algorithms to the characteristic wavelength selection of soil moisture content
Wang et al. A multi-kernel channel attention combined with convolutional neural network to identify spectral information for tracing the origins of rice samples
Wang et al. SVM classification method of waxy corn seeds with different vitality levels based on hyperspectral imaging
Bell et al. MIPHENO: data normalization for high throughput metabolite analysis
CN113125377B (zh) 一种基于近红外光谱检测柴油性质的方法及装置
CN114062305B (zh) 基于近红外光谱和1D-In-Resnet网络的单籽粒品种鉴定方法及系统
CN106950193B (zh) 基于自加权变量组合集群分析的近红外光谱变量选择方法
CN111220565B (zh) 一种基于cpls的红外光谱测量仪器标定迁移方法
Liu et al. Sample selection method using near‐infrared spectral information entropy as similarity criterion for constructing and updating peach firmness and soluble solids content prediction models
CN110632024B (zh) 一种基于红外光谱的定量分析方法、装置、设备以及存储介质
CN111062118B (zh) 一种基于神经网络预测分层的多层软测量建模系统及方法
JP2023521757A (ja) ラマンスペクトルに基づいて試料の特質を識別するためのモデルを決定するための遺伝的アルゴリズムの使用
CN111222736A (zh) 一种基于混合相关向量机模型的弹药贮存可靠度评估方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20876421

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20876421

Country of ref document: EP

Kind code of ref document: A1