CN111427866A - Modeling variable selection method based on correlation and principal component analysis - Google Patents
Modeling variable selection method based on correlation and principal component analysis Download PDFInfo
- Publication number
- CN111427866A CN111427866A CN202010234827.4A CN202010234827A CN111427866A CN 111427866 A CN111427866 A CN 111427866A CN 202010234827 A CN202010234827 A CN 202010234827A CN 111427866 A CN111427866 A CN 111427866A
- Authority
- CN
- China
- Prior art keywords
- variables
- variable
- information
- component analysis
- modeling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/211—Schema design and management
- G06F16/212—Schema design and management with details for data modelling support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
Abstract
The invention relates to a modeling variable selection method based on correlation and principal component analysis, which utilizes an information entropy theory to calculate correlation information coefficients among influence factors, eliminates redundant variables, then utilizes a principal component analysis method to extract residual variable principal elements and reduces the number of modeling variables, and comprises the following steps of 1, utilizing the information entropy theory to calculate the correlation information coefficients between variables to be selected and target variables; step 2, rejecting variables to be selected with smaller associated information coefficients; step 3, extracting the principal components of the remaining variables to be selected by using a principal component analysis method; and 4, acquiring a final modeling variable. The method can reduce the number of modeling variables on the premise of keeping the information of the modeling variables as much as possible, and provides help for establishing the object model.
Description
Technical Field
The invention relates to the technical field of data-driven modeling, in particular to a modeling variable selection method based on correlation and principal component analysis.
Background
In recent years, with the continuous development of computer and database technologies, data-driven modeling technologies have attracted more and more attention. However, due to the complexity of the object, the candidate variables that may be related to the target variable are often numerous, wherein some of the candidate variables have correlation and coupling, some of the candidate variables do not reflect the system output, and noise, irrelevance or redundancy exists. If all possible variables to be selected are brought into the model, the modeling time and the uncertainty of the model are greatly increased, the generalization capability of the model is weakened, and the model precision is reduced, so that the variables to be selected need to be selected in order to reduce the calculated amount and improve the performance of establishing the model.
Disclosure of Invention
The invention aims to provide a modeling variable selection method based on correlation and principal component analysis, which can reduce the number of modeling variables and provide help for establishing an object model on the premise of keeping modeling variable information as much as possible.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a modeling variable selection method based on correlation and principal component analysis is characterized in that: the method comprises the following steps of calculating the related information coefficient among the influence factors by using an information entropy theory, eliminating redundant variables, extracting residual variable pivot elements by using a principal component analysis method, and reducing the number of modeling variables, wherein the specific steps are as follows:
step 1, calculating a correlation information coefficient between a variable to be selected and a target variable by using an information entropy theory;
step 2, rejecting variables to be selected with smaller associated information coefficients;
step 3, extracting the principal components of the remaining variables to be selected by using a principal component analysis method;
and 4, acquiring a final modeling variable.
In the step 1, the value range of the variable X to be selected is { X1,x2,…,xnIs given by { p (x) }, corresponding to a probability distribution1),p(x2),…,p(xn) Is defined as information entropy H (X)
the information entropy is a measure of uncertainty, and the smaller the uncertainty is, the smaller the information entropy is;
between the candidate variable X and the target variable Y, when the event Y is YjWhen present, is represented by yjAbout x obtained injInformation amount of (I) (x)i;yj) Is defined as:
the average mutual information I (X; Y) between X and Y is defined as
In the formula: p (x)i|yj) Represents the conditional probability, p (x)i,yj) Representing a joint probability; the average mutual information represents the information quantity shared between two random variables;
the correlation information coefficient between the candidate variable X and the target variable Y can be given by the equations (1) and (3):
according to the definition of the related information coefficient, I is more than or equal to 0R1 or less, the greater the degree of association of X with Y, IRThe larger.
In the step 2, a related information coefficient threshold value I is settIf the correlation information coefficient I before the variable X to be selected and the target variable Y are calculated according to the step oneR<ItAnd eliminating the variable to be selected.
In the step 3, the specific steps of extracting the main elements of the remaining variables to be selected by a principal component analysis method are as follows:
step 3.1, forming a sample matrix Z by the variables to be selected meeting the threshold condition, wherein each column is an observation sample Z, and each row represents one-dimensional data;
step 3.2, subtracting the column mean value from each column of the original matrix respectively to generate a standardized matrix Zb;
Step 3.3, calculate the normalized matrix ZbOf the covariance matrix Zc;
Step 3.4, calculate covariance matrix ZcAnd corresponding eigenvector uiWherein i is 1,2, …, n; n is a matrix ZcThe number of eigenvalues;
step 3.5, arranging the characteristic values in descending order, and calculating the cumulative contribution rate of the first m principal elements according to the formula (5):
in the formula: m is 1,2, …, n-1;
step 3.6, constructing a transformation matrix T by using eigenvectors corresponding to the first k larger eigenvalues, wherein η (k) > 85% is required, and k ═ m;
T=(u1,u2,…,uk);
step 3.7, by ZkCalculating to obtain the first k main componentsTo the purpose of dimensionality reduction, ZkAnd extracting the main elements of the residual variables to be selected.
The modeling variable selection method based on the correlation and the principal component analysis has the following beneficial effects: and the understanding of the physical relation between the variables to be selected and the target variables is reduced based on mathematical statistics, so that the method is convenient to popularize.
Drawings
FIG. 1 is a flowchart of the method for selecting modeling variables based on correlation and principal component analysis according to the present invention.
Detailed Description
The invention is further described below with reference to the drawings and specific preferred embodiments.
A modeling variable selection method based on correlation and principal component analysis is characterized in that: the method comprises the following steps of calculating the related information coefficient among the influence factors by using an information entropy theory, eliminating redundant variables, extracting residual variable pivot elements by using a principal component analysis method, and reducing the number of modeling variables, wherein the specific steps are as follows:
a modeling variable selection method based on correlation and principal component analysis is characterized in that: the method comprises the following steps of calculating the related information coefficient among the influence factors by using an information entropy theory, eliminating redundant variables, extracting residual variable pivot elements by using a principal component analysis method, and reducing the number of modeling variables, wherein the specific steps are as follows:
step 1, calculating a correlation information coefficient between a variable to be selected and a target variable by using an information entropy theory;
the value range of the variable X to be selected is { X1,x2,…,xnIs given by { p (x) }, corresponding to a probability distribution1),p(x2),…,p(xn) Is defined as information entropy H (X)
the information entropy is a measure of uncertainty, and the smaller the uncertainty is, the smaller the information entropy is;
between the candidate variable X and the target variable Y, when the event Y is YjWhen present, is represented by yjAbout x obtained injInformation amount of (I) (x)i;yj) Is defined as:
the average mutual information I (X; Y) between X and Y is defined as
In the formula: p (x)i|yj) Represents the conditional probability, p (x)i,yj) Representing a joint probability; the average mutual information represents the information quantity shared between two random variables;
the correlation information coefficient between the candidate variable X and the target variable Y can be given by the equations (1) and (3):
according to the definition of the related information coefficient, I is more than or equal to 0R1 or less, the greater the degree of association of X with Y, IRThe larger.
Step 2, rejecting variables to be selected with smaller associated information coefficients;
setting a related information coefficient threshold ItIf the correlation information coefficient I before the variable X to be selected and the target variable Y are calculated according to the step oneR<ItAnd eliminating the variable to be selected.
And 3, extracting the main elements of the residual variables to be selected by using a principal component analysis method.
Step 3.1, forming a sample matrix Z by the variables to be selected meeting the threshold condition, wherein each column is an observation sample Z, and each row represents one-dimensional data;
step 3.2, subtracting the column mean value from each column of the original matrix respectively to generate a standardized matrix Zb;
Step 3.3, calculate the normalized matrix ZbOf the covariance matrix Zc;
Step 3.4, calculate covariance matrix ZcAnd a corresponding eigenvector ui, where i ═ 1,2, …, n; n is a matrix ZcThe number of eigenvalues;
step 3.5, arranging the characteristic values in descending order, and calculating the cumulative contribution rate of the first m principal elements according to the formula (5):
in the formula: m is 1,2, …, n-1;
step 3.6, constructing a transformation matrix T by using eigenvectors corresponding to the first k larger eigenvalues, wherein η (k) > 85% is required, and k ═ m;
T=(u1,u2,…,uk);
step 3.7, by ZkCalculating to obtain the first k main components to reduce dimension, ZkAnd extracting the main elements of the residual variables to be selected.
And 4, acquiring a final modeling variable.
Further, taking an example of processing thermal state test data of a boiler of a thermal power generating unit by a modeling variable selection method based on correlation and principal component analysis, the specific implementation steps are as follows:
the modeling candidate variables obtained according to the thermal experimental data of the boiler comprise 28 physical quantities such as fuel quantity, air quantity, coal feeder opening degree, coal quality and the like, and the target variable is NOx emission concentration. The thermal test data are shown in table 1:
TABLE 1
As shown in the data in table 1, the thermal test data was collected in 12 working conditions, with the first 28 columns being candidate variables and the last column being target variables.
Step 1: calculating a correlation information coefficient between a variable to be selected and a target variable by using an information entropy theory;
the calculation of the correlation coefficient matrix between the 28 candidate variables (R1-R28) and the target variable (R29) is shown in table 2:
TABLE 2
Step 2: eliminating variables to be selected with smaller associated information coefficients;
setting a related information coefficient threshold ItAnd (4) deleting nine candidate variable variables of R6, R10, R11, R17, R18, R19, R25, R27 and R28 according to the correlation calculation result in the step 1.
And step 3: and extracting the principal components of the residual variables to be selected by using a principal component analysis method.
And performing principal component analysis on the remaining nineteen variables to be selected. According to the cumulative contribution rate of more than 85%, nineteen variables are extracted as a principal element, and finally the input variables and the output variables are obtained as shown in table 3.
TABLE 3
According to the data in table 3, the error between the predicted value and the measured value of the model obtained by the modeling algorithm is shown in table 4, and it can be seen that the model built by the modeling input variable extracted by the method has high precision and completely meets the requirement of engineering application.
TABLE 4
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.
Claims (4)
1. A modeling variable selection method based on correlation and principal component analysis is characterized in that: the method comprises the following steps of calculating the related information coefficient among the influence factors by using an information entropy theory, eliminating redundant variables, extracting residual variable pivot elements by using a principal component analysis method, and reducing the number of modeling variables, wherein the specific steps are as follows:
step 1, calculating a correlation information coefficient between a variable to be selected and a target variable by using an information entropy theory;
step 2, rejecting variables to be selected with smaller associated information coefficients;
step 3, extracting the principal components of the remaining variables to be selected by using a principal component analysis method;
and 4, acquiring a final modeling variable.
2. The method of claim 1, wherein the method comprises the steps of: in the step 1, the value range of the variable X to be selected is { X1,x2,…,xnIs given by { p (x) }, corresponding to a probability distribution1),p(x2),…,p(xn) Is defined as information entropy H (X)
the information entropy is a measure of uncertainty, and the smaller the uncertainty is, the smaller the information entropy is;
between the candidate variable X and the target variable Y, when the event Y is YjWhen present, is represented by yjAbout x obtained injInformation amount of (I) (x)i;yj) Is defined as
The average mutual information I (X; Y) between X and Y is defined as
In the formula: p (x)i|yj) Represents the conditional probability, p (x)i,yj) Representing a joint probability; the average mutual information represents the information quantity shared between two random variables;
the correlation information coefficient between the candidate variable X and the target variable Y can be given by the equations (1) and (3):
according to the definition of the related information coefficient, I is more than or equal to 0R1 or less, the greater the degree of association of X with Y, IRThe larger.
3. The method of claim 2, wherein the method comprises the steps of: in the step 2, a related information coefficient threshold value I is settIf the correlation information coefficient I before the variable X to be selected and the target variable Y are calculated according to the step oneR<ItAnd eliminating the variable to be selected.
4. A method of modeling variable selection based on correlation and principal component analysis as claimed in claim 3 wherein: in the step 3, the specific steps of extracting the main elements of the remaining variables to be selected by a principal component analysis method are as follows:
step 3.1, forming a sample matrix Z by the variables to be selected meeting the threshold condition, wherein each column is an observation sample Z, and each row represents one-dimensional data;
step 3.2, subtracting the column mean value from each column of the original matrix respectively to generate a standardized matrix Zb;
Step 3.3, calculate the normalized matrix ZbOf the covariance matrix Zc;
Step 3.4, calculate covariance matrix ZcAnd corresponding eigenvector uiWherein i is 1,2, …, n; n is a matrix ZcThe number of eigenvalues;
step 3.5, arranging the characteristic values in descending order, and calculating the cumulative contribution rate of the first m principal elements according to the formula (5):
in the formula: m is 1,2, …, n-1;
step 3.6, constructing a transformation matrix T by using eigenvectors corresponding to the first k larger eigenvalues, wherein η (k) > 85% is required, and k ═ m;
T=(u1,u2,…,uk);
step 3.7, by ZkCalculating to obtain the first k main components to reduce dimension, ZkAnd extracting the main elements of the residual variables to be selected.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010234827.4A CN111427866A (en) | 2020-03-30 | 2020-03-30 | Modeling variable selection method based on correlation and principal component analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010234827.4A CN111427866A (en) | 2020-03-30 | 2020-03-30 | Modeling variable selection method based on correlation and principal component analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111427866A true CN111427866A (en) | 2020-07-17 |
Family
ID=71549133
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010234827.4A Withdrawn CN111427866A (en) | 2020-03-30 | 2020-03-30 | Modeling variable selection method based on correlation and principal component analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111427866A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117520824A (en) * | 2024-01-03 | 2024-02-06 | 浙江省白马湖实验室有限公司 | Information entropy-based distributed optical fiber data characteristic reconstruction method |
-
2020
- 2020-03-30 CN CN202010234827.4A patent/CN111427866A/en not_active Withdrawn
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117520824A (en) * | 2024-01-03 | 2024-02-06 | 浙江省白马湖实验室有限公司 | Information entropy-based distributed optical fiber data characteristic reconstruction method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109165664B (en) | Attribute-missing data set completion and prediction method based on generation of countermeasure network | |
CN112101480A (en) | Multivariate clustering and fused time sequence combined prediction method | |
CN111428201B (en) | Prediction method for time series data based on empirical mode decomposition and feedforward neural network | |
CN112329865B (en) | Data anomaly identification method and device based on self-encoder and computer equipment | |
CN111753044A (en) | Regularization-based language model for removing social bias and application | |
Zhu et al. | Network inference from consensus dynamics with unknown parameters | |
Chen et al. | Optimal control for multistage uncertain random dynamic systems with multiple time delays | |
CN110688585A (en) | Personalized movie recommendation method based on neural network and collaborative filtering | |
Xie et al. | Data transformation for geometrically distributed quality characteristics | |
CN111427866A (en) | Modeling variable selection method based on correlation and principal component analysis | |
CN110738363A (en) | photovoltaic power generation power prediction model and construction method and application thereof | |
CN114169091A (en) | Method for establishing prediction model of residual life of engineering mechanical part and prediction method | |
CN112765746B (en) | Turbine blade top gas-thermal performance uncertainty quantification system based on polynomial chaos | |
Ismail et al. | Principal component regression with artificial neural network to improve prediction of electricity demand. | |
CN116245107B (en) | Electric power audit text entity identification method, device, equipment and storage medium | |
CN117131449A (en) | Data management-oriented anomaly identification method and system with propagation learning capability | |
CN109345055B (en) | Large power grid static stability boundary feature extraction method and system based on space-time sequence | |
CN113095596B (en) | Photovoltaic power prediction method based on multi-stage Gate-SA-TCN | |
Zheng et al. | Quantized minimum error entropy with fiducial points for robust regression | |
Zhang et al. | Rare event simulation for large-scale structures with local nonlinearities | |
CN112016004A (en) | Multi-granularity information fusion-based job crime screening system and method | |
Lou et al. | Improved Transformer with Parallel Encoders for Image Captioning | |
Pan | Time series data anomaly detection based on LSTM-GAN | |
Karavarsamis | Two-stage approaches to the analysis of occupancy data i: The homogeneous case | |
Jian et al. | Semi-supervised Bi-dictionary learning using smooth representation-based label propagation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20200717 |