CN111427866A - Modeling variable selection method based on correlation and principal component analysis - Google Patents

Modeling variable selection method based on correlation and principal component analysis Download PDF

Info

Publication number
CN111427866A
CN111427866A CN202010234827.4A CN202010234827A CN111427866A CN 111427866 A CN111427866 A CN 111427866A CN 202010234827 A CN202010234827 A CN 202010234827A CN 111427866 A CN111427866 A CN 111427866A
Authority
CN
China
Prior art keywords
variables
variable
information
component analysis
modeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010234827.4A
Other languages
Chinese (zh)
Inventor
李逗
孙栓柱
孙和泰
周春蕾
王林
孙彬
黄翔
李春岩
杨晨琛
潘苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Fangtian Power Technology Co Ltd
Original Assignee
Jiangsu Fangtian Power Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Fangtian Power Technology Co Ltd filed Critical Jiangsu Fangtian Power Technology Co Ltd
Priority to CN202010234827.4A priority Critical patent/CN111427866A/en
Publication of CN111427866A publication Critical patent/CN111427866A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • G06F16/212Schema design and management with details for data modelling support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis

Abstract

The invention relates to a modeling variable selection method based on correlation and principal component analysis, which utilizes an information entropy theory to calculate correlation information coefficients among influence factors, eliminates redundant variables, then utilizes a principal component analysis method to extract residual variable principal elements and reduces the number of modeling variables, and comprises the following steps of 1, utilizing the information entropy theory to calculate the correlation information coefficients between variables to be selected and target variables; step 2, rejecting variables to be selected with smaller associated information coefficients; step 3, extracting the principal components of the remaining variables to be selected by using a principal component analysis method; and 4, acquiring a final modeling variable. The method can reduce the number of modeling variables on the premise of keeping the information of the modeling variables as much as possible, and provides help for establishing the object model.

Description

Modeling variable selection method based on correlation and principal component analysis
Technical Field
The invention relates to the technical field of data-driven modeling, in particular to a modeling variable selection method based on correlation and principal component analysis.
Background
In recent years, with the continuous development of computer and database technologies, data-driven modeling technologies have attracted more and more attention. However, due to the complexity of the object, the candidate variables that may be related to the target variable are often numerous, wherein some of the candidate variables have correlation and coupling, some of the candidate variables do not reflect the system output, and noise, irrelevance or redundancy exists. If all possible variables to be selected are brought into the model, the modeling time and the uncertainty of the model are greatly increased, the generalization capability of the model is weakened, and the model precision is reduced, so that the variables to be selected need to be selected in order to reduce the calculated amount and improve the performance of establishing the model.
Disclosure of Invention
The invention aims to provide a modeling variable selection method based on correlation and principal component analysis, which can reduce the number of modeling variables and provide help for establishing an object model on the premise of keeping modeling variable information as much as possible.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a modeling variable selection method based on correlation and principal component analysis is characterized in that: the method comprises the following steps of calculating the related information coefficient among the influence factors by using an information entropy theory, eliminating redundant variables, extracting residual variable pivot elements by using a principal component analysis method, and reducing the number of modeling variables, wherein the specific steps are as follows:
step 1, calculating a correlation information coefficient between a variable to be selected and a target variable by using an information entropy theory;
step 2, rejecting variables to be selected with smaller associated information coefficients;
step 3, extracting the principal components of the remaining variables to be selected by using a principal component analysis method;
and 4, acquiring a final modeling variable.
In the step 1, the value range of the variable X to be selected is { X1,x2,…,xnIs given by { p (x) }, corresponding to a probability distribution1),p(x2),…,p(xn) Is defined as information entropy H (X)
Figure BDA0002430627730000011
In the formula: 0 is less than or equal to p (x)i) Less than or equal to 1 and
Figure BDA0002430627730000012
the information entropy is a measure of uncertainty, and the smaller the uncertainty is, the smaller the information entropy is;
between the candidate variable X and the target variable Y, when the event Y is YjWhen present, is represented by yjAbout x obtained injInformation amount of (I) (x)i;yj) Is defined as:
Figure BDA0002430627730000021
the average mutual information I (X; Y) between X and Y is defined as
Figure BDA0002430627730000022
In the formula: p (x)i|yj) Represents the conditional probability, p (x)i,yj) Representing a joint probability; the average mutual information represents the information quantity shared between two random variables;
the correlation information coefficient between the candidate variable X and the target variable Y can be given by the equations (1) and (3):
Figure BDA0002430627730000023
according to the definition of the related information coefficient, I is more than or equal to 0R1 or less, the greater the degree of association of X with Y, IRThe larger.
In the step 2, a related information coefficient threshold value I is settIf the correlation information coefficient I before the variable X to be selected and the target variable Y are calculated according to the step oneR<ItAnd eliminating the variable to be selected.
In the step 3, the specific steps of extracting the main elements of the remaining variables to be selected by a principal component analysis method are as follows:
step 3.1, forming a sample matrix Z by the variables to be selected meeting the threshold condition, wherein each column is an observation sample Z, and each row represents one-dimensional data;
step 3.2, subtracting the column mean value from each column of the original matrix respectively to generate a standardized matrix Zb
Step 3.3, calculate the normalized matrix ZbOf the covariance matrix Zc
Step 3.4, calculate covariance matrix ZcAnd corresponding eigenvector uiWherein i is 1,2, …, n; n is a matrix ZcThe number of eigenvalues;
step 3.5, arranging the characteristic values in descending order, and calculating the cumulative contribution rate of the first m principal elements according to the formula (5):
Figure BDA0002430627730000024
in the formula: m is 1,2, …, n-1;
step 3.6, constructing a transformation matrix T by using eigenvectors corresponding to the first k larger eigenvalues, wherein η (k) > 85% is required, and k ═ m;
T=(u1,u2,…,uk);
step 3.7, by ZkCalculating to obtain the first k main componentsTo the purpose of dimensionality reduction, ZkAnd extracting the main elements of the residual variables to be selected.
The modeling variable selection method based on the correlation and the principal component analysis has the following beneficial effects: and the understanding of the physical relation between the variables to be selected and the target variables is reduced based on mathematical statistics, so that the method is convenient to popularize.
Drawings
FIG. 1 is a flowchart of the method for selecting modeling variables based on correlation and principal component analysis according to the present invention.
Detailed Description
The invention is further described below with reference to the drawings and specific preferred embodiments.
A modeling variable selection method based on correlation and principal component analysis is characterized in that: the method comprises the following steps of calculating the related information coefficient among the influence factors by using an information entropy theory, eliminating redundant variables, extracting residual variable pivot elements by using a principal component analysis method, and reducing the number of modeling variables, wherein the specific steps are as follows:
a modeling variable selection method based on correlation and principal component analysis is characterized in that: the method comprises the following steps of calculating the related information coefficient among the influence factors by using an information entropy theory, eliminating redundant variables, extracting residual variable pivot elements by using a principal component analysis method, and reducing the number of modeling variables, wherein the specific steps are as follows:
step 1, calculating a correlation information coefficient between a variable to be selected and a target variable by using an information entropy theory;
the value range of the variable X to be selected is { X1,x2,…,xnIs given by { p (x) }, corresponding to a probability distribution1),p(x2),…,p(xn) Is defined as information entropy H (X)
Figure BDA0002430627730000031
In the formula: 0 is less than or equal to p (x)i) Less than or equal to 1 and
Figure BDA0002430627730000032
the information entropy is a measure of uncertainty, and the smaller the uncertainty is, the smaller the information entropy is;
between the candidate variable X and the target variable Y, when the event Y is YjWhen present, is represented by yjAbout x obtained injInformation amount of (I) (x)i;yj) Is defined as:
Figure BDA0002430627730000041
the average mutual information I (X; Y) between X and Y is defined as
Figure BDA0002430627730000042
In the formula: p (x)i|yj) Represents the conditional probability, p (x)i,yj) Representing a joint probability; the average mutual information represents the information quantity shared between two random variables;
the correlation information coefficient between the candidate variable X and the target variable Y can be given by the equations (1) and (3):
Figure BDA0002430627730000043
according to the definition of the related information coefficient, I is more than or equal to 0R1 or less, the greater the degree of association of X with Y, IRThe larger.
Step 2, rejecting variables to be selected with smaller associated information coefficients;
setting a related information coefficient threshold ItIf the correlation information coefficient I before the variable X to be selected and the target variable Y are calculated according to the step oneR<ItAnd eliminating the variable to be selected.
And 3, extracting the main elements of the residual variables to be selected by using a principal component analysis method.
Step 3.1, forming a sample matrix Z by the variables to be selected meeting the threshold condition, wherein each column is an observation sample Z, and each row represents one-dimensional data;
step 3.2, subtracting the column mean value from each column of the original matrix respectively to generate a standardized matrix Zb
Step 3.3, calculate the normalized matrix ZbOf the covariance matrix Zc
Step 3.4, calculate covariance matrix ZcAnd a corresponding eigenvector ui, where i ═ 1,2, …, n; n is a matrix ZcThe number of eigenvalues;
step 3.5, arranging the characteristic values in descending order, and calculating the cumulative contribution rate of the first m principal elements according to the formula (5):
Figure BDA0002430627730000044
in the formula: m is 1,2, …, n-1;
step 3.6, constructing a transformation matrix T by using eigenvectors corresponding to the first k larger eigenvalues, wherein η (k) > 85% is required, and k ═ m;
T=(u1,u2,…,uk);
step 3.7, by ZkCalculating to obtain the first k main components to reduce dimension, ZkAnd extracting the main elements of the residual variables to be selected.
And 4, acquiring a final modeling variable.
Further, taking an example of processing thermal state test data of a boiler of a thermal power generating unit by a modeling variable selection method based on correlation and principal component analysis, the specific implementation steps are as follows:
the modeling candidate variables obtained according to the thermal experimental data of the boiler comprise 28 physical quantities such as fuel quantity, air quantity, coal feeder opening degree, coal quality and the like, and the target variable is NOx emission concentration. The thermal test data are shown in table 1:
TABLE 1
Figure BDA0002430627730000051
As shown in the data in table 1, the thermal test data was collected in 12 working conditions, with the first 28 columns being candidate variables and the last column being target variables.
Step 1: calculating a correlation information coefficient between a variable to be selected and a target variable by using an information entropy theory;
the calculation of the correlation coefficient matrix between the 28 candidate variables (R1-R28) and the target variable (R29) is shown in table 2:
TABLE 2
Figure BDA0002430627730000052
Figure BDA0002430627730000061
Step 2: eliminating variables to be selected with smaller associated information coefficients;
setting a related information coefficient threshold ItAnd (4) deleting nine candidate variable variables of R6, R10, R11, R17, R18, R19, R25, R27 and R28 according to the correlation calculation result in the step 1.
And step 3: and extracting the principal components of the residual variables to be selected by using a principal component analysis method.
And performing principal component analysis on the remaining nineteen variables to be selected. According to the cumulative contribution rate of more than 85%, nineteen variables are extracted as a principal element, and finally the input variables and the output variables are obtained as shown in table 3.
TABLE 3
Figure BDA0002430627730000062
According to the data in table 3, the error between the predicted value and the measured value of the model obtained by the modeling algorithm is shown in table 4, and it can be seen that the model built by the modeling input variable extracted by the method has high precision and completely meets the requirement of engineering application.
TABLE 4
Figure BDA0002430627730000063
Figure BDA0002430627730000071
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims (4)

1. A modeling variable selection method based on correlation and principal component analysis is characterized in that: the method comprises the following steps of calculating the related information coefficient among the influence factors by using an information entropy theory, eliminating redundant variables, extracting residual variable pivot elements by using a principal component analysis method, and reducing the number of modeling variables, wherein the specific steps are as follows:
step 1, calculating a correlation information coefficient between a variable to be selected and a target variable by using an information entropy theory;
step 2, rejecting variables to be selected with smaller associated information coefficients;
step 3, extracting the principal components of the remaining variables to be selected by using a principal component analysis method;
and 4, acquiring a final modeling variable.
2. The method of claim 1, wherein the method comprises the steps of: in the step 1, the value range of the variable X to be selected is { X1,x2,…,xnIs given by { p (x) }, corresponding to a probability distribution1),p(x2),…,p(xn) Is defined as information entropy H (X)
Figure FDA0002430627720000011
In the formula: 0 is less than or equal to p (x)i) Less than or equal to 1 and
Figure FDA0002430627720000012
the information entropy is a measure of uncertainty, and the smaller the uncertainty is, the smaller the information entropy is;
between the candidate variable X and the target variable Y, when the event Y is YjWhen present, is represented by yjAbout x obtained injInformation amount of (I) (x)i;yj) Is defined as
Figure FDA0002430627720000013
The average mutual information I (X; Y) between X and Y is defined as
Figure FDA0002430627720000014
In the formula: p (x)i|yj) Represents the conditional probability, p (x)i,yj) Representing a joint probability; the average mutual information represents the information quantity shared between two random variables;
the correlation information coefficient between the candidate variable X and the target variable Y can be given by the equations (1) and (3):
Figure FDA0002430627720000015
according to the definition of the related information coefficient, I is more than or equal to 0R1 or less, the greater the degree of association of X with Y, IRThe larger.
3. The method of claim 2, wherein the method comprises the steps of: in the step 2, a related information coefficient threshold value I is settIf the correlation information coefficient I before the variable X to be selected and the target variable Y are calculated according to the step oneR<ItAnd eliminating the variable to be selected.
4. A method of modeling variable selection based on correlation and principal component analysis as claimed in claim 3 wherein: in the step 3, the specific steps of extracting the main elements of the remaining variables to be selected by a principal component analysis method are as follows:
step 3.1, forming a sample matrix Z by the variables to be selected meeting the threshold condition, wherein each column is an observation sample Z, and each row represents one-dimensional data;
step 3.2, subtracting the column mean value from each column of the original matrix respectively to generate a standardized matrix Zb
Step 3.3, calculate the normalized matrix ZbOf the covariance matrix Zc
Step 3.4, calculate covariance matrix ZcAnd corresponding eigenvector uiWherein i is 1,2, …, n; n is a matrix ZcThe number of eigenvalues;
step 3.5, arranging the characteristic values in descending order, and calculating the cumulative contribution rate of the first m principal elements according to the formula (5):
Figure FDA0002430627720000021
in the formula: m is 1,2, …, n-1;
step 3.6, constructing a transformation matrix T by using eigenvectors corresponding to the first k larger eigenvalues, wherein η (k) > 85% is required, and k ═ m;
T=(u1,u2,…,uk);
step 3.7, by ZkCalculating to obtain the first k main components to reduce dimension, ZkAnd extracting the main elements of the residual variables to be selected.
CN202010234827.4A 2020-03-30 2020-03-30 Modeling variable selection method based on correlation and principal component analysis Withdrawn CN111427866A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010234827.4A CN111427866A (en) 2020-03-30 2020-03-30 Modeling variable selection method based on correlation and principal component analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010234827.4A CN111427866A (en) 2020-03-30 2020-03-30 Modeling variable selection method based on correlation and principal component analysis

Publications (1)

Publication Number Publication Date
CN111427866A true CN111427866A (en) 2020-07-17

Family

ID=71549133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010234827.4A Withdrawn CN111427866A (en) 2020-03-30 2020-03-30 Modeling variable selection method based on correlation and principal component analysis

Country Status (1)

Country Link
CN (1) CN111427866A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117520824A (en) * 2024-01-03 2024-02-06 浙江省白马湖实验室有限公司 Information entropy-based distributed optical fiber data characteristic reconstruction method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117520824A (en) * 2024-01-03 2024-02-06 浙江省白马湖实验室有限公司 Information entropy-based distributed optical fiber data characteristic reconstruction method

Similar Documents

Publication Publication Date Title
CN109165664B (en) Attribute-missing data set completion and prediction method based on generation of countermeasure network
CN112101480A (en) Multivariate clustering and fused time sequence combined prediction method
CN111428201B (en) Prediction method for time series data based on empirical mode decomposition and feedforward neural network
CN112329865B (en) Data anomaly identification method and device based on self-encoder and computer equipment
CN111753044A (en) Regularization-based language model for removing social bias and application
Zhu et al. Network inference from consensus dynamics with unknown parameters
Chen et al. Optimal control for multistage uncertain random dynamic systems with multiple time delays
CN110688585A (en) Personalized movie recommendation method based on neural network and collaborative filtering
Xie et al. Data transformation for geometrically distributed quality characteristics
CN111427866A (en) Modeling variable selection method based on correlation and principal component analysis
CN110738363A (en) photovoltaic power generation power prediction model and construction method and application thereof
CN114169091A (en) Method for establishing prediction model of residual life of engineering mechanical part and prediction method
CN112765746B (en) Turbine blade top gas-thermal performance uncertainty quantification system based on polynomial chaos
Ismail et al. Principal component regression with artificial neural network to improve prediction of electricity demand.
CN116245107B (en) Electric power audit text entity identification method, device, equipment and storage medium
CN117131449A (en) Data management-oriented anomaly identification method and system with propagation learning capability
CN109345055B (en) Large power grid static stability boundary feature extraction method and system based on space-time sequence
CN113095596B (en) Photovoltaic power prediction method based on multi-stage Gate-SA-TCN
Zheng et al. Quantized minimum error entropy with fiducial points for robust regression
Zhang et al. Rare event simulation for large-scale structures with local nonlinearities
CN112016004A (en) Multi-granularity information fusion-based job crime screening system and method
Lou et al. Improved Transformer with Parallel Encoders for Image Captioning
Pan Time series data anomaly detection based on LSTM-GAN
Karavarsamis Two-stage approaches to the analysis of occupancy data i: The homogeneous case
Jian et al. Semi-supervised Bi-dictionary learning using smooth representation-based label propagation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200717