CN111427866A

CN111427866A - Modeling variable selection method based on correlation and principal component analysis

Info

Publication number: CN111427866A
Application number: CN202010234827.4A
Authority: CN
Inventors: 李逗; 孙栓柱; 孙和泰; 周春蕾; 王林; 孙彬; 黄翔; 李春岩; 杨晨琛; 潘苗
Original assignee: Jiangsu Fangtian Power Technology Co Ltd
Current assignee: Jiangsu Fangtian Power Technology Co Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-17

Abstract

The invention relates to a modeling variable selection method based on correlation and principal component analysis, which utilizes an information entropy theory to calculate correlation information coefficients among influence factors, eliminates redundant variables, then utilizes a principal component analysis method to extract residual variable principal elements and reduces the number of modeling variables, and comprises the following steps of 1, utilizing the information entropy theory to calculate the correlation information coefficients between variables to be selected and target variables; step 2, rejecting variables to be selected with smaller associated information coefficients; step 3, extracting the principal components of the remaining variables to be selected by using a principal component analysis method; and 4, acquiring a final modeling variable. The method can reduce the number of modeling variables on the premise of keeping the information of the modeling variables as much as possible, and provides help for establishing the object model.

Description

Modeling variable selection method based on correlation and principal component analysis

Technical Field

The invention relates to the technical field of data-driven modeling, in particular to a modeling variable selection method based on correlation and principal component analysis.

Background

In recent years, with the continuous development of computer and database technologies, data-driven modeling technologies have attracted more and more attention. However, due to the complexity of the object, the candidate variables that may be related to the target variable are often numerous, wherein some of the candidate variables have correlation and coupling, some of the candidate variables do not reflect the system output, and noise, irrelevance or redundancy exists. If all possible variables to be selected are brought into the model, the modeling time and the uncertainty of the model are greatly increased, the generalization capability of the model is weakened, and the model precision is reduced, so that the variables to be selected need to be selected in order to reduce the calculated amount and improve the performance of establishing the model.

Disclosure of Invention

The invention aims to provide a modeling variable selection method based on correlation and principal component analysis, which can reduce the number of modeling variables and provide help for establishing an object model on the premise of keeping modeling variable information as much as possible.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a modeling variable selection method based on correlation and principal component analysis is characterized in that: the method comprises the following steps of calculating the related information coefficient among the influence factors by using an information entropy theory, eliminating redundant variables, extracting residual variable pivot elements by using a principal component analysis method, and reducing the number of modeling variables, wherein the specific steps are as follows:

step 1, calculating a correlation information coefficient between a variable to be selected and a target variable by using an information entropy theory;

step 2, rejecting variables to be selected with smaller associated information coefficients;

step 3, extracting the principal components of the remaining variables to be selected by using a principal component analysis method;

and 4, acquiring a final modeling variable.

In the step 1, the value range of the variable X to be selected is { X₁,x₂,…,x_nIs given by { p (x) }, corresponding to a probability distribution₁),p(x₂),…,p(x_n) Is defined as information entropy H (X)

In the formula: 0 is less than or equal to p (x)_i) Less than or equal to 1 and

the information entropy is a measure of uncertainty, and the smaller the uncertainty is, the smaller the information entropy is;

between the candidate variable X and the target variable Y, when the event Y is Y_jWhen present, is represented by y_jAbout x obtained in_jInformation amount of (I) (x)_i；y_j) Is defined as:

the average mutual information I (X; Y) between X and Y is defined as

In the formula: p (x)_i|y_j) Represents the conditional probability, p (x)_i,y_j) Representing a joint probability; the average mutual information represents the information quantity shared between two random variables;

the correlation information coefficient between the candidate variable X and the target variable Y can be given by the equations (1) and (3):

according to the definition of the related information coefficient, I is more than or equal to 0_R1 or less, the greater the degree of association of X with Y, I_RThe larger.

In the step 2, a related information coefficient threshold value I is set_tIf the correlation information coefficient I before the variable X to be selected and the target variable Y are calculated according to the step one_R<I_tAnd eliminating the variable to be selected.

In the step 3, the specific steps of extracting the main elements of the remaining variables to be selected by a principal component analysis method are as follows:

step 3.1, forming a sample matrix Z by the variables to be selected meeting the threshold condition, wherein each column is an observation sample Z, and each row represents one-dimensional data;

step 3.2, subtracting the column mean value from each column of the original matrix respectively to generate a standardized matrix Z_b；

Step 3.3, calculate the normalized matrix Z_bOf the covariance matrix Z_c；

Step 3.4, calculate covariance matrix Z_cAnd corresponding eigenvector u_iWherein i is 1,2, …, n; n is a matrix Z_cThe number of eigenvalues;

step 3.5, arranging the characteristic values in descending order, and calculating the cumulative contribution rate of the first m principal elements according to the formula (5):

in the formula: m is 1,2, …, n-1;

step 3.6, constructing a transformation matrix T by using eigenvectors corresponding to the first k larger eigenvalues, wherein η (k) > 85% is required, and k ═ m;

T＝(u1,u2,…,uk)；

step 3.7, by Z_kCalculating to obtain the first k main componentsTo the purpose of dimensionality reduction, Z_kAnd extracting the main elements of the residual variables to be selected.

The modeling variable selection method based on the correlation and the principal component analysis has the following beneficial effects: and the understanding of the physical relation between the variables to be selected and the target variables is reduced based on mathematical statistics, so that the method is convenient to popularize.

Drawings

FIG. 1 is a flowchart of the method for selecting modeling variables based on correlation and principal component analysis according to the present invention.

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments.

the value range of the variable X to be selected is { X₁,x₂,…,x_nIs given by { p (x) }, corresponding to a probability distribution₁),p(x₂),…,p(x_n) Is defined as information entropy H (X)

In the formula: 0 is less than or equal to p (x)_i) Less than or equal to 1 and

the average mutual information I (X; Y) between X and Y is defined as

setting a related information coefficient threshold I_tIf the correlation information coefficient I before the variable X to be selected and the target variable Y are calculated according to the step one_R<I_tAnd eliminating the variable to be selected.

And 3, extracting the main elements of the residual variables to be selected by using a principal component analysis method.

Step 3.3, calculate the normalized matrix Z_bOf the covariance matrix Z_c；

Step 3.4, calculate covariance matrix Z_cAnd a corresponding eigenvector ui, where i ═ 1,2, …, n; n is a matrix Z_cThe number of eigenvalues;

in the formula: m is 1,2, …, n-1;

T＝(u1,u2,…,uk)；

step 3.7, by Z_kCalculating to obtain the first k main components to reduce dimension, Z_kAnd extracting the main elements of the residual variables to be selected.

And 4, acquiring a final modeling variable.

Further, taking an example of processing thermal state test data of a boiler of a thermal power generating unit by a modeling variable selection method based on correlation and principal component analysis, the specific implementation steps are as follows:

the modeling candidate variables obtained according to the thermal experimental data of the boiler comprise 28 physical quantities such as fuel quantity, air quantity, coal feeder opening degree, coal quality and the like, and the target variable is NOx emission concentration. The thermal test data are shown in table 1:

TABLE 1

As shown in the data in table 1, the thermal test data was collected in 12 working conditions, with the first 28 columns being candidate variables and the last column being target variables.

Step 1: calculating a correlation information coefficient between a variable to be selected and a target variable by using an information entropy theory;

the calculation of the correlation coefficient matrix between the 28 candidate variables (R1-R28) and the target variable (R29) is shown in table 2:

TABLE 2

Step 2: eliminating variables to be selected with smaller associated information coefficients;

setting a related information coefficient threshold I_tAnd (4) deleting nine candidate variable variables of R6, R10, R11, R17, R18, R19, R25, R27 and R28 according to the correlation calculation result in the step 1.

And step 3: and extracting the principal components of the residual variables to be selected by using a principal component analysis method.

And performing principal component analysis on the remaining nineteen variables to be selected. According to the cumulative contribution rate of more than 85%, nineteen variables are extracted as a principal element, and finally the input variables and the output variables are obtained as shown in table 3.

TABLE 3

According to the data in table 3, the error between the predicted value and the measured value of the model obtained by the modeling algorithm is shown in table 4, and it can be seen that the model built by the modeling input variable extracted by the method has high precision and completely meets the requirement of engineering application.

TABLE 4

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A modeling variable selection method based on correlation and principal component analysis is characterized in that: the method comprises the following steps of calculating the related information coefficient among the influence factors by using an information entropy theory, eliminating redundant variables, extracting residual variable pivot elements by using a principal component analysis method, and reducing the number of modeling variables, wherein the specific steps are as follows:

and 4, acquiring a final modeling variable.

2. The method of claim 1, wherein the method comprises the steps of: in the step 1, the value range of the variable X to be selected is { X₁,x₂,…,x_nIs given by { p (x) }, corresponding to a probability distribution₁),p(x₂),…,p(x_n) Is defined as information entropy H (X)

In the formula: 0 is less than or equal to p (x)_i) Less than or equal to 1 and

between the candidate variable X and the target variable Y, when the event Y is Y_jWhen present, is represented by y_jAbout x obtained in_jInformation amount of (I) (x)_i；y_j) Is defined as

The average mutual information I (X; Y) between X and Y is defined as

3. The method of claim 2, wherein the method comprises the steps of: in the step 2, a related information coefficient threshold value I is set_tIf the correlation information coefficient I before the variable X to be selected and the target variable Y are calculated according to the step one_R<I_tAnd eliminating the variable to be selected.

4. A method of modeling variable selection based on correlation and principal component analysis as claimed in claim 3 wherein: in the step 3, the specific steps of extracting the main elements of the remaining variables to be selected by a principal component analysis method are as follows:

Step 3.3, calculate the normalized matrix Z_bOf the covariance matrix Z_c；

in the formula: m is 1,2, …, n-1;

T＝(u1,u2,…,uk)；