CN113159114A

CN113159114A - High-dimensional data dimension reduction cross validation analysis method based on application in NIR data

Info

Publication number: CN113159114A
Application number: CN202110257625.6A
Authority: CN
Inventors: 潘晓光; 潘柠; 焦璐璐; 张娜; 陈亮
Original assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Current assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-07-23

Abstract

The invention belongs to the technical field of NIR data processing, and particularly relates to a high-dimensional data dimension reduction cross validation analysis method based on NIR data, which comprises the following steps: detecting data abnormal values, preprocessing data, fitting a training data model, cross validation, calculating a prediction error and outputting a model, wherein the detected data abnormal values adopt an unsupervised learning abnormal value detection method to detect local abnormal values and remove possible abnormal values, and data correction is carried out; the preprocessing data uses a multiplicative scattering correction method to correct the correlation and the rising trend among the data, and then uses SG-filter to smooth the data; the fitting training data model operates a partial least square method and outputs related indexes such as contribution rates corresponding to different factors; the cross validation runs cross validation, and proper factor quantity is respectively selected for 4 dependent variables; predicting 2018 data by using a trained model for calculating the prediction error, calculating an error RMSEP, and giving an analysis result; the output model outputs the model with the lowest error.

Description

High-dimensional data dimension reduction cross validation analysis method based on application in NIR data

Technical Field

The invention belongs to the technical field of NIR data processing, and particularly relates to a high-dimensional data dimension reduction cross-validation analysis method based on NIR data.

Background

At present, NIR spectral data are high-dimensional data with large data volume, generally, NIR data have too many variables and small data volume, and a model rarely has a good prediction result, because the variables are highly correlated, and in order to solve the problem, students need to use various methods for reasonably and scientifically reducing the dimension and reducing the data volume.

Cause of problems or defects: the existing high-dimensional data dimension reduction method mainly extracts useful information from original data and then reduces the number of variables, and the method mainly comprises a principal component analysis method and a partial least square method. However, in the process of using partial least squares, a main problem is how to select the number of factors to enable the model fitting effect to be the best, the error is small, the existing research proposes to select the number of factors with the contribution rate of more than 80%, however, the contribution rate is larger as the number of factors is larger, and more scientific bases are needed for reasonably determining the number of factors.

Disclosure of Invention

Aiming at the problems that the number of factors cannot be reasonably determined, the prediction result is poor and the like of the model, the invention provides a method which can select reasonable number of factors, has the minimum data error and has better data characteristic performance.

In order to solve the technical problems, the invention adopts the technical scheme that:

a high-dimensional data dimension reduction cross validation analysis method based on application in NIR data comprises the following steps:

s100, detecting data abnormal values: removing possible abnormal values by adopting an abnormal value detection method of unsupervised learning to detect local abnormal values, and correcting data;

s200, preprocessing data: correcting the correlation and the continuously rising trend among the data by using a multiplicative scattering correction method, and then smoothing the data by using SG-filter;

s300, fitting a training data model: operating a partial least square method, and outputting the contribution rates and other related indexes corresponding to different factors;

s400, cross validation: running cross validation, and selecting appropriate factor quantity for 4 dependent variables respectively;

s500, calculating a prediction error: predicting 2018 data by using the trained model, calculating an error RMSEP, and giving an analysis result;

s600, outputting a model: and outputting the model with the lowest error.

In the preprocessing data, a multiplicative scattering correction method is used for solving the problem of higher internal correlation, and Martens is adopted to rotate the spectral data so that the spectral data can be close to the mean value.

In the preprocessing data, the SG-filter method is adopted to smooth the data, and x is assumed_jIs the central value of the smoothing window, the length of the smoothing window is equal to 2m +1, i is ∈ [ -m, m]Then x_jIs calculated by the formula

In the fitting training data model, data are divided into a training set and a test set, then a partial least square method is operated on the training set, and finally the contribution rates corresponding to different factors are output.

In the cross validation, the data set in the training set is divided into a cross training set containing n-1 data and a cross testing set containing only 1 data, a circulation statement is operated again, the training set error when the number of factors is 1 is output, and then the training set error when the number of factors is 1 is calculated

Re-cycling statementsThe number M of the factors in the model is 2 … M, corresponding prediction errors of different numbers of the factors are output respectively, finally, a scatter diagram of the RMSEP is drawn, and the number of the factors with the minimum prediction error is found to be used as an independent variable building model of the model.

In the cross validation, the data set in the training set is divided into a cross training set containing data of 44 manufacturers and a data cross test set of only 1 manufacturer, a circulation statement is operated again, the error of the training set when the number of the factors is 1 is output, and then the error of the training set when the number of the factors is 1 is calculated

And (3) enabling the number of factors M in the circulation sentence to be 2 … M, respectively outputting corresponding prediction errors of different numbers of factors, finally drawing a scatter diagram of the RMSEP, and finding the number of the factors with the minimum prediction error as the independent variable building model of the model.

Drawings

FIG. 1 is a system flow diagram of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A high-dimensional data dimension reduction cross-validation analysis method based on NIR data, as shown in fig. 1, includes the following steps:

s600, outputting a model: and outputting the model with the lowest error.

Further, in the step of detecting abnormal values of data, given NIR high-dimensional spectral data is 2016-2018 high-dimensional spectral data, the spectral data are provided by 45 manufacturers, and 125 independent variables of the spectral data are used for measuring 4 groups of dependent oil variables. Taking 2016 and 2017 year data as training models, and taking 2018 year data as test data. The data set size is n, there are 4Y variables, 125 arguments, the number of partial least squares factors is M, and M is between 1 … M. For high-dimensional data, especially high-dimensional spectral data, high correlation exists between the data, and usually the data contains a plurality of variables, and the noise and fluctuation of the data under different spectra are extremely large, so that abnormal values need to be detected before modeling, and the data need to be corrected to finally build the model.

Further, in the step of preprocessing data, a multiplicative scattering correction method is used for solving the problem of higher internal correlation, and Martens is adopted to rotate the spectral data to enable the spectral data to be close to the mean value, and the reference formula is as follows:

wherein x_imIs the ith data value for the mth variable (spectrum),

is the mean of the m-th spectrum,

the method is data processed by a multiplicative scattering correction method.

Further, in the step of preprocessing the data, the data is smoothed by adopting a SG-filter method, and the data is subjected to the previous stepThe data after the processing also has much noise. A simple way to remove data noise is derivative smoothing, and SG-filter is a relatively mature method for smoothing data, assuming that x is_jIs the central value of the smoothing window, the length of the smoothing window is equal to 2m +1, i is ∈ [ -m, m],C_iRepresents each x_j+1The derived weight of (1), then x_jThe calculation formula of (2) is as follows:

the smoothing method distributes different weights to points in a smoothing window, tries to fit the smoothing window by using a least square curve, each smoothing window can find a least square curve which enables the error to be minimum, and the data are substituted to obtain a middle point x_jAn estimate of (d).

Further, in the step of fitting the training data model, the data is divided into a training set and a test set, for example, 2016-. And then, running a partial least square method on the training set, and finally outputting the contribution rates corresponding to different factors, wherein the number of the factors is M, and the value is between 1 and M.

Further, in the step of cross validation, the data set in the training set is divided into a cross training set containing n-1 data and a cross testing set containing only 1 data, a cycle statement is operated again, and the error of the training set when the number of the output factors is 1 is output, wherein: when the number of factors m is 1, n is 1, x₁The first data is a cross test set, and the rest data is a cross training set; fitting a model in a cross training set; predicting cross-test set x with this model₁The factor value of (2) and outputting a predicted value; n is n + 1. Then calculating the training set error when the number of factors is 1

Further, step A and step BIn the fork verification, the data set in the training set is divided into a cross training set containing data of 44 manufacturers and a data cross test set with only 1 manufacturer, a circulation statement is operated again, and the training set error is output when the number of output factors is 1, wherein: when the number of factors m is 1, the serial numbers of different manufacturers are X₁，X₂...，X₄₅Is mixing X₁Setting the data as a cross test set, and taking the residual data as a cross training set; fitting a model in a cross training set; predicting the dependent variable value of the cross test set by using the model, and outputting a predicted value; respectively converting cross training set into X₂...，X₄₅And (5b1) - (5b3) are iteratively operated. Then calculating the training set error when the number of factors is 1

Further, in the step output model, as the example conclusion in the figure, the error obtained by the final cross validation modeling is lower than that of the model established by the principle of 80%, and the accuracy of the cross validation without multi-fold cross validation is high instead of one cross validation, that is, the prediction error of the established model after the data cross validation divided according to the number of manufacturers is lower.

Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.

Claims

1. A high-dimensional data dimension reduction cross validation analysis method based on application in NIR data is characterized in that: comprises the following steps:

s600, outputting a model: and outputting the model with the lowest error.

2. The method for high-dimensional data dimension reduction cross-validation analysis based on NIR data as claimed in claim 1, wherein: in the S200 preprocessing data, a multiplicative scattering correction method is used for solving the problem of high internal correlation, and Martens is adopted to rotate the spectral data so that the spectral data can be close to the mean value.

3. The method for high-dimensional data dimension reduction cross-validation analysis based on NIR data as claimed in claim 2, wherein: in the S200 preprocessing data, the SG-filter method is adopted to smooth the data, and x is assumed_jIs the central value of the smoothing window, the length of the smoothing window is equal to 2m +1, i is ∈ [ -m, m]Then x_jIs calculated by the formula

4. A method of high dimensional data dimension reduction cross-validation analysis based on application in NIR data as claimed in claim 3: in the S300 fitting training data model, data are divided into a training set and a test set, then a partial least square method is operated on the training set, and finally the contribution rates corresponding to different factors are output.

5. The method of claim 4, wherein the method comprises the following steps: in the S400 cross validation, the data set in the training set is divided into a cross training set containing n-1 data and a cross test set containing only 1 data, a cycle statement is operated again, the training set error when the number of factors is 1 is output, and then the training set error when the number of factors is 1 is calculated

6. The method of claim 5, wherein the method comprises the following steps: in the S400 cross validation, the data set in the training set is divided into a cross training set containing data of 44 manufacturers and a data cross test set of only 1 manufacturer, a circulation statement is operated again, the training set error when the number of the factors is 1 is output, and then the training set error when the number of the factors is 1 is calculated