CN116401564A

CN116401564A - PCA-based redundant variable screening improvement method and device

Info

Publication number: CN116401564A
Application number: CN202310299411.4A
Authority: CN
Inventors: 岳喜超; 王勇; 刘蔚
Original assignee: Shanghai Yunjian Information Technology Co ltd; Shanghai Electric Power University
Current assignee: Shanghai Yunjian Information Technology Co ltd; Shanghai Electric Power University
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-07-07

Abstract

The invention discloses a redundant variable screening improvement method and device based on PCA, comprising the following steps: collecting characteristic historical working data of the gas flowmeter, and preprocessing the historical working data; target variable clustering is carried out on the preprocessed data, and a first key variable screening index Q is calculated by combining feature selection ₁ And a second key variable screening index Q ₂ The method comprises the steps of carrying out a first treatment on the surface of the Screening index Q based on key variable ₁ 、Q ₂ Calculating a third key variable screening index Q _f And screening index Q according to the third key variable _f After feature selection is completed and screening is carried outKey variables of (2); and inputting the key variables into a machine learning classification algorithm for testing to obtain an actual prediction effect. According to the invention, the original data is processed by data standardization, the contribution of each feature is balanced, the dimension influence is eliminated, and the ratio of the second key variable screening index to the first key variable screening index is used as the final key variable screening index by increasing the consideration factors of feature selection, so that the prediction precision is improved.

Description

PCA-based redundant variable screening improvement method and device

Technical Field

The invention relates to the technical field of data mining, in particular to a redundant variable screening improvement method and device based on PCA.

Background

In recent years, with the continuous development of information technology, explosive growth of data information has led to higher and higher data complexity and an increase in various data types, resulting in "dimension disasters". Traditional data mining techniques present a significant challenge in processing high-dimensional data, with increasing demands in terms of resources and time. The dimension reduction operation of the feature data can reduce the dimension of the data and improve the performance of the algorithm. The method for reducing the data dimension is mainly divided into a feature transformation method and a feature selection method.

The data dimension reduction algorithm has wide application in the fields of geography, medicine, simulation and the like, and the characteristic selection algorithm is always the subject of a large number of researches of researchers at home and abroad. The feature selection method is to select a feature subset with the best evaluation standard from the original feature set through a feature selection algorithm, so that researchers are helped to better classify and return tasks, and accuracy and efficiency of data classification are improved. The pearson correlation coefficient of the features is calculated by any east to judge the strength relation of the features, an optimal threshold is determined to extract the features, and then classification experiment evaluation is carried out on models such as K neighbors, decision trees, random forests and the like to obtain a good effect. Chen Liang and the like convert continuous optimization of the sine and cosine functions into binary optimization of feature selection, so that mapping relation between individual positions and feature subsets is realized, an optimal feature subset is effectively selected, feature dimensions are reduced, and data classification accuracy is improved. But the algorithm has too many iterations and is not close to the optimal solution. And a correlation-based feature selection (CFS) is introduced into a threo satellite and the like to acquire an optimal feature subset, so that data dimension reduction is realized, and a Partial Least Squares Regression (PLSR) is selected as a core algorithm of modeling, so that the harm caused by multiple correlations among variables is effectively solved. Li Jingxing and the like analyze the correlation degree and redundancy of the features through the maximum information coefficient measurement standard to obtain a Markov blanket representation set of class attributes and a suboptimal feature subset, so that the classification precision can be improved in the test stage, and a remarkable dimension reduction effect can be achieved. Li Xinqian and the like remove uncorrelated features by using a mutual information method, obtain the number of clustering clusters through a particle swarm algorithm, and finally combine the features with the highest mutual information with the category in each clustering cluster as a feature subset, so that the correlation among the features can be effectively reduced, and the classification performance of the algorithm is improved. Wang Lichun and the like preprocess the high-dimensional unbalanced data set by combining the SMOTE algorithm and random undersampling, and simultaneously introduce a clustering algorithm to improve the SMOTE algorithm, so that each evaluation index is higher in the aspect of processing the high-dimensional unbalanced data. But the overall run time of the algorithm is not significantly advantageous over other algorithms. Xu Zhaozhao and the like calculate the information gain ratio of each feature by using the information gain ratio, divide the density equally according to the information density of the features, and finally search the feature group with the density equally by using the grouping evolution genetic algorithm, thereby obtaining good effect on the UCI medical data set. But the effect in high-dimensional small sample data is not ideal. There are various methods for solving such problems, such as redundant variable screening algorithm based on principal component analysis PCA (Principal Component Analysis), but the algorithm also needs human intervention in the link of selecting key variables, needs skilled personnel to select, and has certain randomness, so that the algorithm has the problem of unstable prediction precision in the subsequent machine learning model training stage.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.

The present invention has been made in view of the above-described problems occurring in the prior art.

Therefore, the invention provides a redundant variable screening improvement method and device based on PCA, which solve the problem that the prediction precision of a machine learning model is low due to the fact that expert experience selects key variables in the existing feature selection algorithm.

In order to solve the technical problems, the invention provides the following technical scheme:

in a first aspect, an embodiment of the present invention provides a redundant variable screening improvement method based on PCA, including:

collecting historical working data of the characteristics of the gas flowmeter, and preprocessing the historical working data;

target variable clustering is carried out on the preprocessed data, feature selection is combined, and a first key variable screening index Q is calculated ₁ And a second key variable screening index Q ₂ ；

Screening index Q based on the key variable ₁ 、Q ₂ Calculating a third key variable screening index Q _f And screening index Q according to the third key variable _f Finishing feature selection to obtain screened key variables;

and inputting the key variables into a machine learning classification algorithm for testing to obtain an actual prediction effect.

As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: the collected gas flow meter features include: flowmeter temperature, flowmeter pressure;

the pretreatment comprises the following steps: preprocessing data by a data standardization method; the data is a public data set of the collection.

As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: variable clustering is carried out on the preprocessed data, and the variable clustering method comprises the following steps:

extracting a principal component P on a per class basis _z Calculate each class _z Each variable x in _i And the class principal component P _z Pearson correlation coefficient of (b);

the variable x when the pearson correlation coefficient value is maximum _i At the class where it is _z The most representative in the group, selecting the corresponding variable when the maximum value is obtained;

simultaneously calculate each class _z Each variable x in _i And other principal components P _z Is the variable x when the pearson correlation coefficient value is the minimum _i And other principal components P _z The correlation is the weakest, the variable x _i At the class where it is _z And selecting the variable corresponding to the minimum value if the group is the most representative.

As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: calculating a first key variable screening index Q ₁ Comprising: based on the variable with the maximum similarity and the variable with the minimum similarity, recording x _i And P _z The correlation coefficient of (2) is R;

first key variable screening index Q ₁ Expressed as:

wherein,,

for each variable, square of the correlation coefficient between the main components of the group in which it is located, < >>

Is the square of the maximum correlation coefficient between the variable and the principal components of all other groups, +.>

Q as the ith target variable ₁ And (5) an index.

As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: calculating key variable screening index Q ₂ Comprising: each class based on selection is free of P _z The most recent variable x _i Calculating a variable x _i Information entropy of (a) and variable x _i Variance, which is used to assist in key variable screening; second key variable screening index Q ₂ Expressed as:

wherein,,

q as the ith target variable ₂ Index e _target Information entropy for target variable, +.>

As variable x _i K is the sample size.

As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: calculating a third key variable screening index Q _f Comprising:

when the first key variable screens the index Q ₁ Value decrease and second key variable screening index Q ₂ When the value of (2) increases, the variable x _i Representatively enhanced within the group in which it resides;

setting the final weight value as a second key variable screening index Q ₂ Screening index Q with first key variable ₁ Ratio Q of (2) _f When the third key variable is selected as index Q _f Is increased by a value of (2) and the variable x _i The final key variable is screened out at the representative enhancement of the group in which it is located.

As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: further comprises: third key variable screening index Q _f Expressed as:

wherein,,

q as the ith target variable _f And (5) an index.

In a second aspect, an embodiment of the present invention provides a redundant variable screening improvement apparatus based on PCA, including:

the data acquisition module is used for acquiring historical working data of the characteristics of the gas flowmeter and preprocessing the historical working data;

the variable clustering module is used for carrying out target variable clustering on the preprocessed data, combining feature selection and calculating a first key variable screening index Q ₁ And a second key variable screening index Q ₂ ；

A calculation module for screening the index Q based on the key variable ₁ 、Q ₂ Calculating a third key variable screening index Q _f And screening index Q according to the third key variable _f Finishing feature selection to obtain screened key variables;

and the learning prediction module is used for inputting the key variables into a machine learning classification algorithm for testing to obtain an actual prediction effect.

In a third aspect, embodiments of the present invention provide a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to implement a PCA-based redundancy variable screening enhancement method in accordance with any of the embodiments of the invention.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the PCA-based redundant variable screening improvement method.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, the original data is processed by data standardization, so that the contribution of each feature is balanced, the dimension influence is eliminated, and the comparability of the data is solved; secondly, selecting variances of the original variables and entropy values of the target variables to calculate a second key variable screening index, and increasing consideration factors of feature selection; and finally, taking the ratio of the second key variable screening index to the first key variable screening index as a final key variable screening index, so that the characteristic extraction of the original data can be better carried out, and the prediction precision is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a flow chart of a method for improving a redundant variable screening method and apparatus based on PCA according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a variable clustering structure of a redundant variable screening improvement method and device based on PCA according to an embodiment of the present invention;

FIG. 3 is a graph comparing the accuracy of the prediction of the conventional method and the method of the present invention, which is a redundant variable screening improvement method and device based on PCA according to an embodiment of the present invention;

FIG. 4 is a graph comparing the accuracy of a conventional method of PCA-based redundant variable screening improvement method and apparatus with the accuracy of the prediction of the method of the present invention according to an embodiment of the present invention;

FIG. 5 is a graph comparing the recall ratio predicted by the conventional method and the method of the present invention for a redundant variable screening improvement method and apparatus based on PCA according to an embodiment of the present invention;

fig. 6 is a comparison chart of f1 score index predicted by the conventional method and the method of the present invention, which is a redundant variable screening improvement method and device based on PCA according to an embodiment of the present invention.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

While the embodiments of the present invention have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.

Also in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Example 1

Referring to fig. 1-2, in one embodiment of the present invention, a redundant variable screening improvement method based on PCA is provided, including:

s101, collecting historical working data of characteristics of a gas flowmeter, and preprocessing the historical working data;

specifically, the collected gas flow meter features include: flowmeter temperature, flowmeter pressure;

S102, clustering target variables of the preprocessed data, and calculating a first key variable screening index Q by combining feature selection ₁ And a second key variable screening index Q ₂ ；

Further, variable clustering is performed on the preprocessed data, including:

extracting a principal component P on a per class basis _z Calculate each class _z Each variable x in _i And class principal component P _z Pearson correlation coefficient of (b);

when the pearson correlation coefficient value is maximum, the variable x _i At the class where it is _z The most representative in the group, selecting the corresponding variable when the maximum value is obtained;

simultaneously calculate each class _z Each variable x in _i And other principal components P _z Is the variable x when the pearson correlation coefficient value is the smallest _i And other principal components P _z The least correlated, variable x _i At the class where it is _z And selecting the variable corresponding to the minimum value if the group is the most representative.

Fig. 2 is a schematic view of a variable clustering structure of a redundant variable screening improvement method based on PCA according to the present invention, referring to fig. 2, in an alternative embodiment, it is assumed that the input variables are: x= { x ₁ ，x ₂ ，……，x ₁₅ Calculating the linear correlation among the variables to obtain a variable correlation matrix R:

dividing the original variables into lambda classes according to their correlation coefficients, assuming lambda=4, class ₁ ，class ₂ ，class ₃ ，class ₄ Each class of variable is distributed as follows:

S201:class ₁ ＝{x ₁ ,x ₅ ,x ₈ }；

S202:class ₂ ＝{x ₂ ,x ₃ ,x ₆ ,x ₁₀ }；

S203:class ₃ ＝{x ₄ ,x ₁₁ ,x ₁₂ ,x ₁₃ ,x ₁₅ }；

S204:class ₄ ＝{x ₇ ,x ₉ ,x ₁₄ }。

further, a first key variable screening index Q is calculated ₁ Comprising: based on the variable with the maximum similarity and the variable with the minimum similarity, x is recorded _i And P _z The correlation coefficient of (2) is R;

first key variable screening index Q ₁ Expressed as:

wherein,,

for each changeSquare of the correlation coefficient between the quantity and the principal component of the group in which it is located, < >>

Q as the ith target variable ₁ And (5) an index.

Further, calculating a key variable screening index Q ₂ Comprising: each class based on selection is free of P _z The most recent variable x _i Calculating a variable x _i Information entropy of (a) and variable x _i Variance, which is used to assist in key variable screening;

second key variable screening index Q ₂ Expressed as:

wherein,,

q as the ith target variable ₂ Index e _target Information entropy of target variable S _x i is a variable x _i K is the sample size.

It should be noted that, key variable screening index Q combined with information entropy is introduced ₂ The information entropy is used for assisting in key variable screening, and is a measure of uncertainty. The larger the uncertainty of the target variable is, the larger the entropy value is, and the larger the information content is; the smaller the uncertainty of the target variable, the smaller the entropy and the smaller the amount of information contained. The randomness and disorder degree of an event can be judged by calculating the entropy value, and the discrete degree of a certain index can also be judged by using the entropy value, and the larger the discrete degree of the index is, the larger the influence (weight) of the variable on comprehensive evaluation is.

Information entropy of the target variable, expressed as:

where k is the sample size, y _ij Is the specific gravity of the j-th index of the target variable.

It should also be noted that variance is a measure of the degree of discretization when measuring a random variable or a set of data, and can also be used to infer the degree of variability of a variable. The larger the variance of the input variable, the more discrete the value of the input variable, the larger the degree of variation, the more information is provided, and the more the factor level or interaction will affect the target variable.

Variance calculation is expressed as:

wherein,,

for inputting variable x _i Is the sample variance, k is the sample size, +.>

Is the sample mean.

Specifically, as can be seen from the analysis of the entropy and variance of the information, e _target The larger the value of (c), the more information is contained.

The larger the input variable x is represented _i The greater the degree of variation, the more the target variable is affected. For the importance of each input variable, a proportional value of the variance of each input variable to the sum of variances of all variables needs to be calculated, and the larger the proportional value is, the more information the input variable contains, the more representativeness is.

S103, screening index Q based on key variable ₁ 、Q ₂ Calculating a third key variable screening index Q _f And according to the firstThree key variable screening index Q _f Finishing feature selection to obtain screened key variables;

in an alternative embodiment, the screened data is input into a machine learning classification algorithm such as K nearest neighbor, random forest, nearest neighbor component analysis and the like to obtain the actual prediction effect.

Further, a third key variable screening index Q is calculated _f Comprising:

Still further, still include: third key variable screening index Q _f Expressed as:

wherein,,

q as the ith target variable _f And (5) an index.

S104, inputting the key variables into a machine learning classification algorithm for testing to obtain an actual prediction effect.

The foregoing is a schematic scheme of a redundant variable screening improvement method based on PCA of this embodiment. It should be noted that, the technical solution of the PCA-based redundant variable screening improvement device and the technical solution of the PCA-based redundant variable screening improvement method described above belong to the same concept, and details of the technical solution of the PCA-based redundant variable screening improvement device in this embodiment, which are not described in detail, can be referred to the description of the technical solution of the PCA-based redundant variable screening improvement method described above.

In this embodiment, a redundant variable screening improvement device based on PCA includes:

the variable clustering module is used for performing variable clustering on the preprocessed data, performing feature selection, and calculating a first key variable screening index Q ₁ Second key variable screening index Q ₂ ；

A calculation module for screening the index Q based on the key variable ₁ 、Q ₂ Calculating a third key variable screening index Q _f And screening index Q according to the third key variable _f Finishing feature selection to obtain data after screening treatment;

and the learning prediction module is used for inputting the data into a machine learning classification algorithm for testing to obtain an actual prediction effect.

The embodiment also provides a computing device, which is suitable for the situation of the redundant variable screening improvement method based on PCA, and comprises the following steps:

a memory and a processor; the memory is configured to store computer executable instructions and the processor is configured to execute the computer executable instructions to implement the PCA-based redundant variable screening improvement method as set forth in the above embodiments.

The computer device may be a terminal comprising a processor, a memory, a communication interface, a display screen and input means connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

The present embodiment also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements a redundant variable screening improvement method based on PCA as set forth in the above embodiments.

The storage medium according to the present embodiment belongs to the same inventive concept as the data storage method according to the above embodiment, and technical details not described in detail in the present embodiment can be seen in the above embodiment, and the present embodiment has the same advantageous effects as the above embodiment.

Example 2

Referring to fig. 3 to 6, in this embodiment, the data set of the ultrasonic flowmeter is screened for key variables by using a conventional principal component-based redundant variable screening method and the method according to the present invention, and verification is performed on a K-Nearest Neighbor KNN (K-Nearest Neighbor), a random forest RF (Random Forest), and a Neighbor component analysis NCA (Neighbourhood Component Analysis) classification algorithm, respectively.

In order to reduce experimental errors, 8 different data division conditions are set, the classification accuracy, the accuracy and the recall rate of each algorithm are compared with those of f1-score, the experimental results are shown in figure 3, and after experimental verification is carried out on the 8 different data division conditions, the identification accuracy is effectively improved.

In order to better prove the superiority of the method in performance, the accuracy, the recall rate and the f1 fraction are adopted to carry out experimental comparison on the classification performance of the algorithm, as shown in fig. 4-6, the comparison graph shows that the method is superior to the traditional redundant variable screening method in three indexes of the accuracy, the recall rate and the f1 fraction, and the accuracy, the recall rate and the f1 fraction index of the method under three classical classification algorithms are improved to different degrees through analysis.

For visual evaluation of the improvement effect of the algorithm, average values of 8 experimental results are respectively taken for comparison, and the comparison results are shown in tables 1-4:

table 1 classification accuracy of ultrasonic flowmeter dataset after dimension reduction by KNN, RF, NCA algorithm

As can be seen from Table 1, the average accuracy of the method of the present invention over the three classical classification algorithms is about 76%, which is about a 7% improvement over the conventional method.

TABLE 2 Classification accuracy of ultrasonic flowmeter dataset after dimension reduction by KNN, RF, NCA algorithm

As can be seen from Table 2, the average accuracy of the method of the present invention over the three classification algorithms is about 80%, which is improved by about 4% compared to the conventional method.

Table 3 classified recall rate of ultrasonic flow meter dataset after dimension reduction via KNN, RF, NCA algorithm

As can be seen from Table 3, the average recall of the method of the present invention over the three classification algorithms is about 78%, which is an improvement of about 6% over the conventional method.

TABLE 4 f1-score of KNN, RF, NCA algorithm after dimension reduction of ultrasonic flowmeter dataset

As can be seen from Table 4, the average f1 score of the method of the present invention over the three classification algorithms is about 79%, which is an improvement of about 5% over the conventional method.

As can be seen from the comparison, the accuracy, the precision, the recall rate and the f1-score of the method are improved greatly. The method improves the effectiveness of the redundant variable screening algorithm performance based on the main component. Furthermore, the method can add a plurality of consideration indexes on the original redundant variable screening algorithm based on PCA, solves the problem of low prediction precision of the machine learning model caused by human intervention, can better extract the characteristics of the original data, and improves the prediction precision.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A redundant variable screening improvement method based on PCA, comprising:

2. The PCA based redundant variable screening improvement of claim 1 wherein:

the collected gas flow meter features include: flowmeter temperature, flowmeter pressure;

3. The PCA-based redundant variable screening improvement method of claim 2 wherein variable clustering said preprocessed data comprises:

4. A redundant variable screening improvement method based on PCA as in claim 3 wherein a first key variable screening indicator Q is calculated ₁ Comprising: based on the variable with the maximum similarity and the variable with the minimum similarity, recording x _i And P _z The correlation coefficient of (2) is R;

first key variable screening index Q ₁ Expressed as:

wherein,,

Q as the ith target variable ₁ And (5) an index.

5. The PCA based redundant variable screening improvement of claim 4 wherein a key variable screening index Q is calculated ₂ Comprising: each class based on selection is free of P _z The most recent variable x _i Calculating a variable x _i Information entropy of (a) and variable x _i Variance, which is used to assist in key variable screening;

second key variable screening index Q ₂ Expressed as:

wherein,,

As variable x _i K is the sample size.

6. The PCA based redundant variable screening improvement of claim 4 or 5 wherein a third key variable screening indicator Q is calculated _f Comprising:

7. The PCA based redundant variable screening improvement of claim 6 further comprising: third key variable screening index Q _f Expressed as:

wherein,,

q as the ith target variable _f And (5) an index.

8. A redundant variable screening improvement device based on PCA is characterized by comprising,

A calculation module for screening the index Q based on the key variable ₁ 、Q ₂ Calculating a third key variable screening index Q _f And screening index Q according to the third key variable _f After feature selection is completed and screening is obtainedKey variables;

9. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the PCA-based redundant variable screening improvement method of any one of claims 1 to 7.

10. A computer readable storage medium storing computer executable instructions which when executed by a processor perform the steps of the PCA based redundant variable screening improvement method of any one of claims 1 to 7.