CN113159114A - High-dimensional data dimension reduction cross validation analysis method based on application in NIR data - Google Patents

High-dimensional data dimension reduction cross validation analysis method based on application in NIR data Download PDF

Info

Publication number
CN113159114A
CN113159114A CN202110257625.6A CN202110257625A CN113159114A CN 113159114 A CN113159114 A CN 113159114A CN 202110257625 A CN202110257625 A CN 202110257625A CN 113159114 A CN113159114 A CN 113159114A
Authority
CN
China
Prior art keywords
data
model
factors
error
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110257625.6A
Other languages
Chinese (zh)
Inventor
潘晓光
潘柠
焦璐璐
张娜
陈亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi Sanyouhe Smart Information Technology Co Ltd
Original Assignee
Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi Sanyouhe Smart Information Technology Co Ltd filed Critical Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority to CN202110257625.6A priority Critical patent/CN113159114A/en
Publication of CN113159114A publication Critical patent/CN113159114A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/17Systems in which incident light is modified in accordance with the properties of the material investigated
    • G01N21/25Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
    • G01N21/31Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
    • G01N21/35Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light
    • G01N21/359Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light using near infrared light
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of NIR data processing, and particularly relates to a high-dimensional data dimension reduction cross validation analysis method based on NIR data, which comprises the following steps: detecting data abnormal values, preprocessing data, fitting a training data model, cross validation, calculating a prediction error and outputting a model, wherein the detected data abnormal values adopt an unsupervised learning abnormal value detection method to detect local abnormal values and remove possible abnormal values, and data correction is carried out; the preprocessing data uses a multiplicative scattering correction method to correct the correlation and the rising trend among the data, and then uses SG-filter to smooth the data; the fitting training data model operates a partial least square method and outputs related indexes such as contribution rates corresponding to different factors; the cross validation runs cross validation, and proper factor quantity is respectively selected for 4 dependent variables; predicting 2018 data by using a trained model for calculating the prediction error, calculating an error RMSEP, and giving an analysis result; the output model outputs the model with the lowest error.

Description

High-dimensional data dimension reduction cross validation analysis method based on application in NIR data
Technical Field
The invention belongs to the technical field of NIR data processing, and particularly relates to a high-dimensional data dimension reduction cross-validation analysis method based on NIR data.
Background
At present, NIR spectral data are high-dimensional data with large data volume, generally, NIR data have too many variables and small data volume, and a model rarely has a good prediction result, because the variables are highly correlated, and in order to solve the problem, students need to use various methods for reasonably and scientifically reducing the dimension and reducing the data volume.
Cause of problems or defects: the existing high-dimensional data dimension reduction method mainly extracts useful information from original data and then reduces the number of variables, and the method mainly comprises a principal component analysis method and a partial least square method. However, in the process of using partial least squares, a main problem is how to select the number of factors to enable the model fitting effect to be the best, the error is small, the existing research proposes to select the number of factors with the contribution rate of more than 80%, however, the contribution rate is larger as the number of factors is larger, and more scientific bases are needed for reasonably determining the number of factors.
Disclosure of Invention
Aiming at the problems that the number of factors cannot be reasonably determined, the prediction result is poor and the like of the model, the invention provides a method which can select reasonable number of factors, has the minimum data error and has better data characteristic performance.
In order to solve the technical problems, the invention adopts the technical scheme that:
a high-dimensional data dimension reduction cross validation analysis method based on application in NIR data comprises the following steps:
s100, detecting data abnormal values: removing possible abnormal values by adopting an abnormal value detection method of unsupervised learning to detect local abnormal values, and correcting data;
s200, preprocessing data: correcting the correlation and the continuously rising trend among the data by using a multiplicative scattering correction method, and then smoothing the data by using SG-filter;
s300, fitting a training data model: operating a partial least square method, and outputting the contribution rates and other related indexes corresponding to different factors;
s400, cross validation: running cross validation, and selecting appropriate factor quantity for 4 dependent variables respectively;
s500, calculating a prediction error: predicting 2018 data by using the trained model, calculating an error RMSEP, and giving an analysis result;
s600, outputting a model: and outputting the model with the lowest error.
In the preprocessing data, a multiplicative scattering correction method is used for solving the problem of higher internal correlation, and Martens is adopted to rotate the spectral data so that the spectral data can be close to the mean value.
In the preprocessing data, the SG-filter method is adopted to smooth the data, and x is assumedjIs the central value of the smoothing window, the length of the smoothing window is equal to 2m +1, i is ∈ [ -m, m]Then xjIs calculated by the formula
Figure BDA0002968625150000021
In the fitting training data model, data are divided into a training set and a test set, then a partial least square method is operated on the training set, and finally the contribution rates corresponding to different factors are output.
In the cross validation, the data set in the training set is divided into a cross training set containing n-1 data and a cross testing set containing only 1 data, a circulation statement is operated again, the training set error when the number of factors is 1 is output, and then the training set error when the number of factors is 1 is calculated
Figure BDA0002968625150000022
Re-cycling statementsThe number M of the factors in the model is 2 … M, corresponding prediction errors of different numbers of the factors are output respectively, finally, a scatter diagram of the RMSEP is drawn, and the number of the factors with the minimum prediction error is found to be used as an independent variable building model of the model.
In the cross validation, the data set in the training set is divided into a cross training set containing data of 44 manufacturers and a data cross test set of only 1 manufacturer, a circulation statement is operated again, the error of the training set when the number of the factors is 1 is output, and then the error of the training set when the number of the factors is 1 is calculated
Figure BDA0002968625150000031
And (3) enabling the number of factors M in the circulation sentence to be 2 … M, respectively outputting corresponding prediction errors of different numbers of factors, finally drawing a scatter diagram of the RMSEP, and finding the number of the factors with the minimum prediction error as the independent variable building model of the model.
Drawings
FIG. 1 is a system flow diagram of the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A high-dimensional data dimension reduction cross-validation analysis method based on NIR data, as shown in fig. 1, includes the following steps:
s100, detecting data abnormal values: removing possible abnormal values by adopting an abnormal value detection method of unsupervised learning to detect local abnormal values, and correcting data;
s200, preprocessing data: correcting the correlation and the continuously rising trend among the data by using a multiplicative scattering correction method, and then smoothing the data by using SG-filter;
s300, fitting a training data model: operating a partial least square method, and outputting the contribution rates and other related indexes corresponding to different factors;
s400, cross validation: running cross validation, and selecting appropriate factor quantity for 4 dependent variables respectively;
s500, calculating a prediction error: predicting 2018 data by using the trained model, calculating an error RMSEP, and giving an analysis result;
s600, outputting a model: and outputting the model with the lowest error.
Further, in the step of detecting abnormal values of data, given NIR high-dimensional spectral data is 2016-2018 high-dimensional spectral data, the spectral data are provided by 45 manufacturers, and 125 independent variables of the spectral data are used for measuring 4 groups of dependent oil variables. Taking 2016 and 2017 year data as training models, and taking 2018 year data as test data. The data set size is n, there are 4Y variables, 125 arguments, the number of partial least squares factors is M, and M is between 1 … M. For high-dimensional data, especially high-dimensional spectral data, high correlation exists between the data, and usually the data contains a plurality of variables, and the noise and fluctuation of the data under different spectra are extremely large, so that abnormal values need to be detected before modeling, and the data need to be corrected to finally build the model.
Further, in the step of preprocessing data, a multiplicative scattering correction method is used for solving the problem of higher internal correlation, and Martens is adopted to rotate the spectral data to enable the spectral data to be close to the mean value, and the reference formula is as follows:
Figure BDA0002968625150000041
wherein ximIs the ith data value for the mth variable (spectrum),
Figure BDA0002968625150000043
is the mean of the m-th spectrum,
Figure BDA0002968625150000044
the method is data processed by a multiplicative scattering correction method.
Further, in the step of preprocessing the data, the data is smoothed by adopting a SG-filter method, and the data is subjected to the previous stepThe data after the processing also has much noise. A simple way to remove data noise is derivative smoothing, and SG-filter is a relatively mature method for smoothing data, assuming that x isjIs the central value of the smoothing window, the length of the smoothing window is equal to 2m +1, i is ∈ [ -m, m],CiRepresents each xj+1The derived weight of (1), then xjThe calculation formula of (2) is as follows:
Figure BDA0002968625150000042
the smoothing method distributes different weights to points in a smoothing window, tries to fit the smoothing window by using a least square curve, each smoothing window can find a least square curve which enables the error to be minimum, and the data are substituted to obtain a middle point xjAn estimate of (d).
Further, in the step of fitting the training data model, the data is divided into a training set and a test set, for example, 2016-. And then, running a partial least square method on the training set, and finally outputting the contribution rates corresponding to different factors, wherein the number of the factors is M, and the value is between 1 and M.
Further, in the step of cross validation, the data set in the training set is divided into a cross training set containing n-1 data and a cross testing set containing only 1 data, a cycle statement is operated again, and the error of the training set when the number of the output factors is 1 is output, wherein: when the number of factors m is 1, n is 1, x1The first data is a cross test set, and the rest data is a cross training set; fitting a model in a cross training set; predicting cross-test set x with this model1The factor value of (2) and outputting a predicted value; n is n + 1. Then calculating the training set error when the number of factors is 1
Figure BDA0002968625150000051
And (3) enabling the number of factors M in the circulation sentence to be 2 … M, respectively outputting corresponding prediction errors of different numbers of factors, finally drawing a scatter diagram of the RMSEP, and finding the number of the factors with the minimum prediction error as the independent variable building model of the model.
Further, step A and step BIn the fork verification, the data set in the training set is divided into a cross training set containing data of 44 manufacturers and a data cross test set with only 1 manufacturer, a circulation statement is operated again, and the training set error is output when the number of output factors is 1, wherein: when the number of factors m is 1, the serial numbers of different manufacturers are X1,X2...,X45Is mixing X1Setting the data as a cross test set, and taking the residual data as a cross training set; fitting a model in a cross training set; predicting the dependent variable value of the cross test set by using the model, and outputting a predicted value; respectively converting cross training set into X2...,X45And (5b1) - (5b3) are iteratively operated. Then calculating the training set error when the number of factors is 1
Figure BDA0002968625150000052
And (3) enabling the number of factors M in the circulation sentence to be 2 … M, respectively outputting corresponding prediction errors of different numbers of factors, finally drawing a scatter diagram of the RMSEP, and finding the number of the factors with the minimum prediction error as the independent variable building model of the model.
Further, in the step output model, as the example conclusion in the figure, the error obtained by the final cross validation modeling is lower than that of the model established by the principle of 80%, and the accuracy of the cross validation without multi-fold cross validation is high instead of one cross validation, that is, the prediction error of the established model after the data cross validation divided according to the number of manufacturers is lower.
Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.

Claims (6)

1. A high-dimensional data dimension reduction cross validation analysis method based on application in NIR data is characterized in that: comprises the following steps:
s100, detecting data abnormal values: removing possible abnormal values by adopting an abnormal value detection method of unsupervised learning to detect local abnormal values, and correcting data;
s200, preprocessing data: correcting the correlation and the continuously rising trend among the data by using a multiplicative scattering correction method, and then smoothing the data by using SG-filter;
s300, fitting a training data model: operating a partial least square method, and outputting the contribution rates and other related indexes corresponding to different factors;
s400, cross validation: running cross validation, and selecting appropriate factor quantity for 4 dependent variables respectively;
s500, calculating a prediction error: predicting 2018 data by using the trained model, calculating an error RMSEP, and giving an analysis result;
s600, outputting a model: and outputting the model with the lowest error.
2. The method for high-dimensional data dimension reduction cross-validation analysis based on NIR data as claimed in claim 1, wherein: in the S200 preprocessing data, a multiplicative scattering correction method is used for solving the problem of high internal correlation, and Martens is adopted to rotate the spectral data so that the spectral data can be close to the mean value.
3. The method for high-dimensional data dimension reduction cross-validation analysis based on NIR data as claimed in claim 2, wherein: in the S200 preprocessing data, the SG-filter method is adopted to smooth the data, and x is assumedjIs the central value of the smoothing window, the length of the smoothing window is equal to 2m +1, i is ∈ [ -m, m]Then xjIs calculated by the formula
Figure FDA0002968625140000011
4. A method of high dimensional data dimension reduction cross-validation analysis based on application in NIR data as claimed in claim 3: in the S300 fitting training data model, data are divided into a training set and a test set, then a partial least square method is operated on the training set, and finally the contribution rates corresponding to different factors are output.
5. The method of claim 4, wherein the method comprises the following steps: in the S400 cross validation, the data set in the training set is divided into a cross training set containing n-1 data and a cross test set containing only 1 data, a cycle statement is operated again, the training set error when the number of factors is 1 is output, and then the training set error when the number of factors is 1 is calculated
Figure FDA0002968625140000021
And (3) enabling the number of factors M in the circulation sentence to be 2 … M, respectively outputting corresponding prediction errors of different numbers of factors, finally drawing a scatter diagram of the RMSEP, and finding the number of the factors with the minimum prediction error as the independent variable building model of the model.
6. The method of claim 5, wherein the method comprises the following steps: in the S400 cross validation, the data set in the training set is divided into a cross training set containing data of 44 manufacturers and a data cross test set of only 1 manufacturer, a circulation statement is operated again, the training set error when the number of the factors is 1 is output, and then the training set error when the number of the factors is 1 is calculated
Figure FDA0002968625140000022
And (3) enabling the number of factors M in the circulation sentence to be 2 … M, respectively outputting corresponding prediction errors of different numbers of factors, finally drawing a scatter diagram of the RMSEP, and finding the number of the factors with the minimum prediction error as the independent variable building model of the model.
CN202110257625.6A 2021-03-09 2021-03-09 High-dimensional data dimension reduction cross validation analysis method based on application in NIR data Pending CN113159114A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110257625.6A CN113159114A (en) 2021-03-09 2021-03-09 High-dimensional data dimension reduction cross validation analysis method based on application in NIR data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110257625.6A CN113159114A (en) 2021-03-09 2021-03-09 High-dimensional data dimension reduction cross validation analysis method based on application in NIR data

Publications (1)

Publication Number Publication Date
CN113159114A true CN113159114A (en) 2021-07-23

Family

ID=76886758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110257625.6A Pending CN113159114A (en) 2021-03-09 2021-03-09 High-dimensional data dimension reduction cross validation analysis method based on application in NIR data

Country Status (1)

Country Link
CN (1) CN113159114A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114117763A (en) * 2021-11-18 2022-03-01 浙江理工大学 Model screening method based on shift-cross verification method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114117763A (en) * 2021-11-18 2022-03-01 浙江理工大学 Model screening method based on shift-cross verification method
CN114117763B (en) * 2021-11-18 2024-05-03 浙江理工大学 Model screening method based on shift-one cross-validation method

Similar Documents

Publication Publication Date Title
CN109063366B (en) Building performance data online preprocessing method based on time and space weighting
CN110706720A (en) Acoustic anomaly detection method for end-to-end unsupervised deep support network
CN109992872B (en) Mechanical equipment residual life prediction method based on stacked separation convolution module
CN109085805B (en) Industrial process fault detection method based on multi-sampling-rate factor analysis model
CN112819107B (en) Artificial intelligence-based fault prediction method for gas pressure regulating equipment
CN110674996B (en) Urban traffic noise prediction method
CN112069787A (en) Log parameter anomaly detection method based on word embedding
CN116448419A (en) Zero sample bearing fault diagnosis method based on depth model high-dimensional parameter multi-target efficient optimization
CN112966891A (en) River water environment quality prediction method
CN115800245A (en) Short-term load prediction method based on SARIMA-random forest combined model
CN114088890A (en) Self-adaptive temperature and humidity compensation method and system based on deep BP neural network
CN115640744A (en) Method for predicting corrosion rate outside oil field gathering and transportation pipeline
CN111160667B (en) Method and device for improving robustness of food safety prediction model
CN113159114A (en) High-dimensional data dimension reduction cross validation analysis method based on application in NIR data
CN109190901A (en) The credible evaluation method of reliability assessment result based on multi objective measurement
CN114580940A (en) Grouting effect fuzzy comprehensive evaluation method based on grey correlation degree analysis method
CN109359388A (en) A kind of Complex simulation systems credibility evaluation method
CN117352088A (en) Prediction method of spatial pollutant distribution based on convolutional neural network
CN108509692A (en) A kind of high sulfur content natural gas desulfurization process modeling method based on MiUKFNN algorithms
CN116204825A (en) Production line equipment fault detection method based on data driving
CN114757495A (en) Membership value quantitative evaluation method based on logistic regression
CN113239075A (en) Construction data self-checking method and system
CN112329335A (en) Long-term prediction method for content of dissolved gas in transformer oil
CN116843998B (en) Spectrum sample weighting method and system
CN117933497B (en) TSA-ARIMA-CNN-based enterprise carbon emission prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination