CN116401564A - PCA-based redundant variable screening improvement method and device - Google Patents

PCA-based redundant variable screening improvement method and device Download PDF

Info

Publication number
CN116401564A
CN116401564A CN202310299411.4A CN202310299411A CN116401564A CN 116401564 A CN116401564 A CN 116401564A CN 202310299411 A CN202310299411 A CN 202310299411A CN 116401564 A CN116401564 A CN 116401564A
Authority
CN
China
Prior art keywords
variable
screening
key
index
pca
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310299411.4A
Other languages
Chinese (zh)
Inventor
岳喜超
王勇
刘蔚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yunjian Information Technology Co ltd
Shanghai Electric Power University
Original Assignee
Shanghai Yunjian Information Technology Co ltd
Shanghai Electric Power University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yunjian Information Technology Co ltd, Shanghai Electric Power University filed Critical Shanghai Yunjian Information Technology Co ltd
Priority to CN202310299411.4A priority Critical patent/CN116401564A/en
Publication of CN116401564A publication Critical patent/CN116401564A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Medical Informatics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a redundant variable screening improvement method and device based on PCA, comprising the following steps: collecting characteristic historical working data of the gas flowmeter, and preprocessing the historical working data; target variable clustering is carried out on the preprocessed data, and a first key variable screening index Q is calculated by combining feature selection 1 And a second key variable screening index Q 2 The method comprises the steps of carrying out a first treatment on the surface of the Screening index Q based on key variable 1 、Q 2 Calculating a third key variable screening index Q f And screening index Q according to the third key variable f After feature selection is completed and screening is carried outKey variables of (2); and inputting the key variables into a machine learning classification algorithm for testing to obtain an actual prediction effect. According to the invention, the original data is processed by data standardization, the contribution of each feature is balanced, the dimension influence is eliminated, and the ratio of the second key variable screening index to the first key variable screening index is used as the final key variable screening index by increasing the consideration factors of feature selection, so that the prediction precision is improved.

Description

PCA-based redundant variable screening improvement method and device
Technical Field
The invention relates to the technical field of data mining, in particular to a redundant variable screening improvement method and device based on PCA.
Background
In recent years, with the continuous development of information technology, explosive growth of data information has led to higher and higher data complexity and an increase in various data types, resulting in "dimension disasters". Traditional data mining techniques present a significant challenge in processing high-dimensional data, with increasing demands in terms of resources and time. The dimension reduction operation of the feature data can reduce the dimension of the data and improve the performance of the algorithm. The method for reducing the data dimension is mainly divided into a feature transformation method and a feature selection method.
The data dimension reduction algorithm has wide application in the fields of geography, medicine, simulation and the like, and the characteristic selection algorithm is always the subject of a large number of researches of researchers at home and abroad. The feature selection method is to select a feature subset with the best evaluation standard from the original feature set through a feature selection algorithm, so that researchers are helped to better classify and return tasks, and accuracy and efficiency of data classification are improved. The pearson correlation coefficient of the features is calculated by any east to judge the strength relation of the features, an optimal threshold is determined to extract the features, and then classification experiment evaluation is carried out on models such as K neighbors, decision trees, random forests and the like to obtain a good effect. Chen Liang and the like convert continuous optimization of the sine and cosine functions into binary optimization of feature selection, so that mapping relation between individual positions and feature subsets is realized, an optimal feature subset is effectively selected, feature dimensions are reduced, and data classification accuracy is improved. But the algorithm has too many iterations and is not close to the optimal solution. And a correlation-based feature selection (CFS) is introduced into a threo satellite and the like to acquire an optimal feature subset, so that data dimension reduction is realized, and a Partial Least Squares Regression (PLSR) is selected as a core algorithm of modeling, so that the harm caused by multiple correlations among variables is effectively solved. Li Jingxing and the like analyze the correlation degree and redundancy of the features through the maximum information coefficient measurement standard to obtain a Markov blanket representation set of class attributes and a suboptimal feature subset, so that the classification precision can be improved in the test stage, and a remarkable dimension reduction effect can be achieved. Li Xinqian and the like remove uncorrelated features by using a mutual information method, obtain the number of clustering clusters through a particle swarm algorithm, and finally combine the features with the highest mutual information with the category in each clustering cluster as a feature subset, so that the correlation among the features can be effectively reduced, and the classification performance of the algorithm is improved. Wang Lichun and the like preprocess the high-dimensional unbalanced data set by combining the SMOTE algorithm and random undersampling, and simultaneously introduce a clustering algorithm to improve the SMOTE algorithm, so that each evaluation index is higher in the aspect of processing the high-dimensional unbalanced data. But the overall run time of the algorithm is not significantly advantageous over other algorithms. Xu Zhaozhao and the like calculate the information gain ratio of each feature by using the information gain ratio, divide the density equally according to the information density of the features, and finally search the feature group with the density equally by using the grouping evolution genetic algorithm, thereby obtaining good effect on the UCI medical data set. But the effect in high-dimensional small sample data is not ideal. There are various methods for solving such problems, such as redundant variable screening algorithm based on principal component analysis PCA (Principal Component Analysis), but the algorithm also needs human intervention in the link of selecting key variables, needs skilled personnel to select, and has certain randomness, so that the algorithm has the problem of unstable prediction precision in the subsequent machine learning model training stage.
Disclosure of Invention
This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.
The present invention has been made in view of the above-described problems occurring in the prior art.
Therefore, the invention provides a redundant variable screening improvement method and device based on PCA, which solve the problem that the prediction precision of a machine learning model is low due to the fact that expert experience selects key variables in the existing feature selection algorithm.
In order to solve the technical problems, the invention provides the following technical scheme:
in a first aspect, an embodiment of the present invention provides a redundant variable screening improvement method based on PCA, including:
collecting historical working data of the characteristics of the gas flowmeter, and preprocessing the historical working data;
target variable clustering is carried out on the preprocessed data, feature selection is combined, and a first key variable screening index Q is calculated 1 And a second key variable screening index Q 2
Screening index Q based on the key variable 1 、Q 2 Calculating a third key variable screening index Q f And screening index Q according to the third key variable f Finishing feature selection to obtain screened key variables;
and inputting the key variables into a machine learning classification algorithm for testing to obtain an actual prediction effect.
As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: the collected gas flow meter features include: flowmeter temperature, flowmeter pressure;
the pretreatment comprises the following steps: preprocessing data by a data standardization method; the data is a public data set of the collection.
As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: variable clustering is carried out on the preprocessed data, and the variable clustering method comprises the following steps:
extracting a principal component P on a per class basis z Calculate each class z Each variable x in i And the class principal component P z Pearson correlation coefficient of (b);
the variable x when the pearson correlation coefficient value is maximum i At the class where it is z The most representative in the group, selecting the corresponding variable when the maximum value is obtained;
simultaneously calculate each class z Each variable x in i And other principal components P z Is the variable x when the pearson correlation coefficient value is the minimum i And other principal components P z The correlation is the weakest, the variable x i At the class where it is z And selecting the variable corresponding to the minimum value if the group is the most representative.
As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: calculating a first key variable screening index Q 1 Comprising: based on the variable with the maximum similarity and the variable with the minimum similarity, recording x i And P z The correlation coefficient of (2) is R;
first key variable screening index Q 1 Expressed as:
Figure BDA0004144454020000031
wherein,,
Figure BDA0004144454020000032
for each variable, square of the correlation coefficient between the main components of the group in which it is located, < >>
Figure BDA0004144454020000033
Is the square of the maximum correlation coefficient between the variable and the principal components of all other groups, +.>
Figure BDA0004144454020000034
Q as the ith target variable 1 And (5) an index.
As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: calculating key variable screening index Q 2 Comprising: each class based on selection is free of P z The most recent variable x i Calculating a variable x i Information entropy of (a) and variable x i Variance, which is used to assist in key variable screening; second key variable screening index Q 2 Expressed as:
Figure BDA0004144454020000035
wherein,,
Figure BDA0004144454020000036
q as the ith target variable 2 Index e target Information entropy for target variable, +.>
Figure BDA0004144454020000037
As variable x i K is the sample size.
As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: calculating a third key variable screening index Q f Comprising:
when the first key variable screens the index Q 1 Value decrease and second key variable screening index Q 2 When the value of (2) increases, the variable x i Representatively enhanced within the group in which it resides;
setting the final weight value as a second key variable screening index Q 2 Screening index Q with first key variable 1 Ratio Q of (2) f When the third key variable is selected as index Q f Is increased by a value of (2) and the variable x i The final key variable is screened out at the representative enhancement of the group in which it is located.
As a preferred embodiment of the PCA-based redundant variable screening improvement method of the present invention, wherein: further comprises: third key variable screening index Q f Expressed as:
Figure BDA0004144454020000041
wherein,,
Figure BDA0004144454020000042
q as the ith target variable f And (5) an index.
In a second aspect, an embodiment of the present invention provides a redundant variable screening improvement apparatus based on PCA, including:
the data acquisition module is used for acquiring historical working data of the characteristics of the gas flowmeter and preprocessing the historical working data;
the variable clustering module is used for carrying out target variable clustering on the preprocessed data, combining feature selection and calculating a first key variable screening index Q 1 And a second key variable screening index Q 2
A calculation module for screening the index Q based on the key variable 1 、Q 2 Calculating a third key variable screening index Q f And screening index Q according to the third key variable f Finishing feature selection to obtain screened key variables;
and the learning prediction module is used for inputting the key variables into a machine learning classification algorithm for testing to obtain an actual prediction effect.
In a third aspect, embodiments of the present invention provide a computing device comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to implement a PCA-based redundancy variable screening enhancement method in accordance with any of the embodiments of the invention.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the PCA-based redundant variable screening improvement method.
Compared with the prior art, the invention has the beneficial effects that: according to the invention, the original data is processed by data standardization, so that the contribution of each feature is balanced, the dimension influence is eliminated, and the comparability of the data is solved; secondly, selecting variances of the original variables and entropy values of the target variables to calculate a second key variable screening index, and increasing consideration factors of feature selection; and finally, taking the ratio of the second key variable screening index to the first key variable screening index as a final key variable screening index, so that the characteristic extraction of the original data can be better carried out, and the prediction precision is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:
FIG. 1 is a flow chart of a method for improving a redundant variable screening method and apparatus based on PCA according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a variable clustering structure of a redundant variable screening improvement method and device based on PCA according to an embodiment of the present invention;
FIG. 3 is a graph comparing the accuracy of the prediction of the conventional method and the method of the present invention, which is a redundant variable screening improvement method and device based on PCA according to an embodiment of the present invention;
FIG. 4 is a graph comparing the accuracy of a conventional method of PCA-based redundant variable screening improvement method and apparatus with the accuracy of the prediction of the method of the present invention according to an embodiment of the present invention;
FIG. 5 is a graph comparing the recall ratio predicted by the conventional method and the method of the present invention for a redundant variable screening improvement method and apparatus based on PCA according to an embodiment of the present invention;
fig. 6 is a comparison chart of f1 score index predicted by the conventional method and the method of the present invention, which is a redundant variable screening improvement method and device based on PCA according to an embodiment of the present invention.
Detailed Description
So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
While the embodiments of the present invention have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.
Also in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
Example 1
Referring to fig. 1-2, in one embodiment of the present invention, a redundant variable screening improvement method based on PCA is provided, including:
s101, collecting historical working data of characteristics of a gas flowmeter, and preprocessing the historical working data;
specifically, the collected gas flow meter features include: flowmeter temperature, flowmeter pressure;
the pretreatment comprises the following steps: preprocessing data by a data standardization method; the data is a public data set of the collection.
S102, clustering target variables of the preprocessed data, and calculating a first key variable screening index Q by combining feature selection 1 And a second key variable screening index Q 2
Further, variable clustering is performed on the preprocessed data, including:
extracting a principal component P on a per class basis z Calculate each class z Each variable x in i And class principal component P z Pearson correlation coefficient of (b);
when the pearson correlation coefficient value is maximum, the variable x i At the class where it is z The most representative in the group, selecting the corresponding variable when the maximum value is obtained;
simultaneously calculate each class z Each variable x in i And other principal components P z Is the variable x when the pearson correlation coefficient value is the smallest i And other principal components P z The least correlated, variable x i At the class where it is z And selecting the variable corresponding to the minimum value if the group is the most representative.
Fig. 2 is a schematic view of a variable clustering structure of a redundant variable screening improvement method based on PCA according to the present invention, referring to fig. 2, in an alternative embodiment, it is assumed that the input variables are: x= { x 1 ,x 2 ,……,x 15 Calculating the linear correlation among the variables to obtain a variable correlation matrix R:
Figure BDA0004144454020000071
dividing the original variables into lambda classes according to their correlation coefficients, assuming lambda=4, class 1 ,class 2 ,class 3 ,class 4 Each class of variable is distributed as follows:
S201:class 1 ={x 1 ,x 5 ,x 8 };
S202:class 2 ={x 2 ,x 3 ,x 6 ,x 10 };
S203:class 3 ={x 4 ,x 11 ,x 12 ,x 13 ,x 15 };
S204:class 4 ={x 7 ,x 9 ,x 14 }。
further, a first key variable screening index Q is calculated 1 Comprising: based on the variable with the maximum similarity and the variable with the minimum similarity, x is recorded i And P z The correlation coefficient of (2) is R;
first key variable screening index Q 1 Expressed as:
Figure BDA0004144454020000072
wherein,,
Figure BDA0004144454020000073
for each changeSquare of the correlation coefficient between the quantity and the principal component of the group in which it is located, < >>
Figure BDA0004144454020000074
Is the square of the maximum correlation coefficient between the variable and the principal components of all other groups, +.>
Figure BDA0004144454020000075
Q as the ith target variable 1 And (5) an index.
Further, calculating a key variable screening index Q 2 Comprising: each class based on selection is free of P z The most recent variable x i Calculating a variable x i Information entropy of (a) and variable x i Variance, which is used to assist in key variable screening;
second key variable screening index Q 2 Expressed as:
Figure BDA0004144454020000076
wherein,,
Figure BDA0004144454020000081
q as the ith target variable 2 Index e target Information entropy of target variable S x i is a variable x i K is the sample size.
It should be noted that, key variable screening index Q combined with information entropy is introduced 2 The information entropy is used for assisting in key variable screening, and is a measure of uncertainty. The larger the uncertainty of the target variable is, the larger the entropy value is, and the larger the information content is; the smaller the uncertainty of the target variable, the smaller the entropy and the smaller the amount of information contained. The randomness and disorder degree of an event can be judged by calculating the entropy value, and the discrete degree of a certain index can also be judged by using the entropy value, and the larger the discrete degree of the index is, the larger the influence (weight) of the variable on comprehensive evaluation is.
Information entropy of the target variable, expressed as:
Figure BDA0004144454020000082
where k is the sample size, y ij Is the specific gravity of the j-th index of the target variable.
It should also be noted that variance is a measure of the degree of discretization when measuring a random variable or a set of data, and can also be used to infer the degree of variability of a variable. The larger the variance of the input variable, the more discrete the value of the input variable, the larger the degree of variation, the more information is provided, and the more the factor level or interaction will affect the target variable.
Variance calculation is expressed as:
Figure BDA0004144454020000083
wherein,,
Figure BDA0004144454020000084
for inputting variable x i Is the sample variance, k is the sample size, +.>
Figure DA00041444540251882168
Is the sample mean.
Specifically, as can be seen from the analysis of the entropy and variance of the information, e target The larger the value of (c), the more information is contained.
Figure BDA0004144454020000085
The larger the input variable x is represented i The greater the degree of variation, the more the target variable is affected. For the importance of each input variable, a proportional value of the variance of each input variable to the sum of variances of all variables needs to be calculated, and the larger the proportional value is, the more information the input variable contains, the more representativeness is.
S103, screening index Q based on key variable 1 、Q 2 Calculating a third key variable screening index Q f And according to the firstThree key variable screening index Q f Finishing feature selection to obtain screened key variables;
in an alternative embodiment, the screened data is input into a machine learning classification algorithm such as K nearest neighbor, random forest, nearest neighbor component analysis and the like to obtain the actual prediction effect.
Further, a third key variable screening index Q is calculated f Comprising:
when the first key variable screens the index Q 1 Value decrease and second key variable screening index Q 2 When the value of (2) increases, the variable x i Representatively enhanced within the group in which it resides;
setting the final weight value as a second key variable screening index Q 2 Screening index Q with first key variable 1 Ratio Q of (2) f When the third key variable is selected as index Q f Is increased by a value of (2) and the variable x i The final key variable is screened out at the representative enhancement of the group in which it is located.
Still further, still include: third key variable screening index Q f Expressed as:
Figure BDA0004144454020000091
wherein,,
Figure BDA0004144454020000092
q as the ith target variable f And (5) an index.
S104, inputting the key variables into a machine learning classification algorithm for testing to obtain an actual prediction effect.
The foregoing is a schematic scheme of a redundant variable screening improvement method based on PCA of this embodiment. It should be noted that, the technical solution of the PCA-based redundant variable screening improvement device and the technical solution of the PCA-based redundant variable screening improvement method described above belong to the same concept, and details of the technical solution of the PCA-based redundant variable screening improvement device in this embodiment, which are not described in detail, can be referred to the description of the technical solution of the PCA-based redundant variable screening improvement method described above.
In this embodiment, a redundant variable screening improvement device based on PCA includes:
the data acquisition module is used for acquiring historical working data of the characteristics of the gas flowmeter and preprocessing the historical working data;
the variable clustering module is used for performing variable clustering on the preprocessed data, performing feature selection, and calculating a first key variable screening index Q 1 Second key variable screening index Q 2
A calculation module for screening the index Q based on the key variable 1 、Q 2 Calculating a third key variable screening index Q f And screening index Q according to the third key variable f Finishing feature selection to obtain data after screening treatment;
and the learning prediction module is used for inputting the data into a machine learning classification algorithm for testing to obtain an actual prediction effect.
The embodiment also provides a computing device, which is suitable for the situation of the redundant variable screening improvement method based on PCA, and comprises the following steps:
a memory and a processor; the memory is configured to store computer executable instructions and the processor is configured to execute the computer executable instructions to implement the PCA-based redundant variable screening improvement method as set forth in the above embodiments.
The computer device may be a terminal comprising a processor, a memory, a communication interface, a display screen and input means connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
The present embodiment also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements a redundant variable screening improvement method based on PCA as set forth in the above embodiments.
The storage medium according to the present embodiment belongs to the same inventive concept as the data storage method according to the above embodiment, and technical details not described in detail in the present embodiment can be seen in the above embodiment, and the present embodiment has the same advantageous effects as the above embodiment.
Example 2
Referring to fig. 3 to 6, in this embodiment, the data set of the ultrasonic flowmeter is screened for key variables by using a conventional principal component-based redundant variable screening method and the method according to the present invention, and verification is performed on a K-Nearest Neighbor KNN (K-Nearest Neighbor), a random forest RF (Random Forest), and a Neighbor component analysis NCA (Neighbourhood Component Analysis) classification algorithm, respectively.
In order to reduce experimental errors, 8 different data division conditions are set, the classification accuracy, the accuracy and the recall rate of each algorithm are compared with those of f1-score, the experimental results are shown in figure 3, and after experimental verification is carried out on the 8 different data division conditions, the identification accuracy is effectively improved.
In order to better prove the superiority of the method in performance, the accuracy, the recall rate and the f1 fraction are adopted to carry out experimental comparison on the classification performance of the algorithm, as shown in fig. 4-6, the comparison graph shows that the method is superior to the traditional redundant variable screening method in three indexes of the accuracy, the recall rate and the f1 fraction, and the accuracy, the recall rate and the f1 fraction index of the method under three classical classification algorithms are improved to different degrees through analysis.
For visual evaluation of the improvement effect of the algorithm, average values of 8 experimental results are respectively taken for comparison, and the comparison results are shown in tables 1-4:
table 1 classification accuracy of ultrasonic flowmeter dataset after dimension reduction by KNN, RF, NCA algorithm
Figure BDA0004144454020000111
As can be seen from Table 1, the average accuracy of the method of the present invention over the three classical classification algorithms is about 76%, which is about a 7% improvement over the conventional method.
TABLE 2 Classification accuracy of ultrasonic flowmeter dataset after dimension reduction by KNN, RF, NCA algorithm
Figure BDA0004144454020000112
As can be seen from Table 2, the average accuracy of the method of the present invention over the three classification algorithms is about 80%, which is improved by about 4% compared to the conventional method.
Table 3 classified recall rate of ultrasonic flow meter dataset after dimension reduction via KNN, RF, NCA algorithm
Figure BDA0004144454020000113
As can be seen from Table 3, the average recall of the method of the present invention over the three classification algorithms is about 78%, which is an improvement of about 6% over the conventional method.
TABLE 4 f1-score of KNN, RF, NCA algorithm after dimension reduction of ultrasonic flowmeter dataset
Figure BDA0004144454020000114
As can be seen from Table 4, the average f1 score of the method of the present invention over the three classification algorithms is about 79%, which is an improvement of about 5% over the conventional method.
As can be seen from the comparison, the accuracy, the precision, the recall rate and the f1-score of the method are improved greatly. The method improves the effectiveness of the redundant variable screening algorithm performance based on the main component. Furthermore, the method can add a plurality of consideration indexes on the original redundant variable screening algorithm based on PCA, solves the problem of low prediction precision of the machine learning model caused by human intervention, can better extract the characteristics of the original data, and improves the prediction precision.
It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims (10)

1. A redundant variable screening improvement method based on PCA, comprising:
collecting historical working data of the characteristics of the gas flowmeter, and preprocessing the historical working data;
target variable clustering is carried out on the preprocessed data, feature selection is combined, and a first key variable screening index Q is calculated 1 And a second key variable screening index Q 2
Screening index Q based on the key variable 1 、Q 2 Calculating a third key variable screening index Q f And screening index Q according to the third key variable f Finishing feature selection to obtain screened key variables;
and inputting the key variables into a machine learning classification algorithm for testing to obtain an actual prediction effect.
2. The PCA based redundant variable screening improvement of claim 1 wherein:
the collected gas flow meter features include: flowmeter temperature, flowmeter pressure;
the pretreatment comprises the following steps: preprocessing data by a data standardization method; the data is a public data set of the collection.
3. The PCA-based redundant variable screening improvement method of claim 2 wherein variable clustering said preprocessed data comprises:
extracting a principal component P on a per class basis z Calculate each class z Each variable x in i And the class principal component P z Pearson correlation coefficient of (b);
the variable x when the pearson correlation coefficient value is maximum i At the class where it is z The most representative in the group, selecting the corresponding variable when the maximum value is obtained;
simultaneously calculate each class z Each variable x in i And other principal components P z Is the variable x when the pearson correlation coefficient value is the minimum i And other principal components P z The correlation is the weakest, the variable x i At the class where it is z And selecting the variable corresponding to the minimum value if the group is the most representative.
4. A redundant variable screening improvement method based on PCA as in claim 3 wherein a first key variable screening indicator Q is calculated 1 Comprising: based on the variable with the maximum similarity and the variable with the minimum similarity, recording x i And P z The correlation coefficient of (2) is R;
first key variable screening index Q 1 Expressed as:
Figure FDA0004144454010000011
wherein,,
Figure FDA0004144454010000012
for each variable, square of the correlation coefficient between the main components of the group in which it is located, < >>
Figure FDA0004144454010000013
Is the square of the maximum correlation coefficient between the variable and the principal components of all other groups, +.>
Figure FDA0004144454010000014
Q as the ith target variable 1 And (5) an index.
5. The PCA based redundant variable screening improvement of claim 4 wherein a key variable screening index Q is calculated 2 Comprising: each class based on selection is free of P z The most recent variable x i Calculating a variable x i Information entropy of (a) and variable x i Variance, which is used to assist in key variable screening;
second key variable screening index Q 2 Expressed as:
Figure FDA0004144454010000021
wherein,,
Figure FDA0004144454010000022
q as the ith target variable 2 Index e target Information entropy for target variable, +.>
Figure FDA0004144454010000023
As variable x i K is the sample size.
6. The PCA based redundant variable screening improvement of claim 4 or 5 wherein a third key variable screening indicator Q is calculated f Comprising:
when the first key variable screens the index Q 1 Value decrease and second key variable screening index Q 2 When the value of (2) increases, the variable x i Representatively enhanced within the group in which it resides;
setting the final weight value as a second key variable screening index Q 2 Screening index Q with first key variable 1 Ratio Q of (2) f When the third key variable is selected as index Q f Is increased by a value of (2) and the variable x i The final key variable is screened out at the representative enhancement of the group in which it is located.
7. The PCA based redundant variable screening improvement of claim 6 further comprising: third key variable screening index Q f Expressed as:
Figure FDA0004144454010000024
wherein,,
Figure FDA0004144454010000025
q as the ith target variable f And (5) an index.
8. A redundant variable screening improvement device based on PCA is characterized by comprising,
the data acquisition module is used for acquiring historical working data of the characteristics of the gas flowmeter and preprocessing the historical working data;
the variable clustering module is used for carrying out target variable clustering on the preprocessed data, combining feature selection and calculating a first key variable screening index Q 1 And a second key variable screening index Q 2
A calculation module for screening the index Q based on the key variable 1 、Q 2 Calculating a third key variable screening index Q f And screening index Q according to the third key variable f After feature selection is completed and screening is obtainedKey variables;
and the learning prediction module is used for inputting the key variables into a machine learning classification algorithm for testing to obtain an actual prediction effect.
9. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the PCA-based redundant variable screening improvement method of any one of claims 1 to 7.
10. A computer readable storage medium storing computer executable instructions which when executed by a processor perform the steps of the PCA based redundant variable screening improvement method of any one of claims 1 to 7.
CN202310299411.4A 2023-03-24 2023-03-24 PCA-based redundant variable screening improvement method and device Pending CN116401564A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310299411.4A CN116401564A (en) 2023-03-24 2023-03-24 PCA-based redundant variable screening improvement method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310299411.4A CN116401564A (en) 2023-03-24 2023-03-24 PCA-based redundant variable screening improvement method and device

Publications (1)

Publication Number Publication Date
CN116401564A true CN116401564A (en) 2023-07-07

Family

ID=87009575

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310299411.4A Pending CN116401564A (en) 2023-03-24 2023-03-24 PCA-based redundant variable screening improvement method and device

Country Status (1)

Country Link
CN (1) CN116401564A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556256A (en) * 2023-11-16 2024-02-13 南京小裂变网络科技有限公司 Private domain service label screening system and method based on big data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556256A (en) * 2023-11-16 2024-02-13 南京小裂变网络科技有限公司 Private domain service label screening system and method based on big data

Similar Documents

Publication Publication Date Title
Chen et al. Disease prediction by machine learning over big data from healthcare communities
CN109086805B (en) Clustering method based on deep neural network and pairwise constraints
Espejo et al. A survey on the application of genetic programming to classification
CN109036553A (en) A kind of disease forecasting method based on automatic extraction Medical Technologist&#39;s knowledge
KR102215690B1 (en) Method and apparatus for time series data monitoring
Li et al. A supervised clustering and classification algorithm for mining data with mixed variables
CN108805413A (en) Labor turnover Risk Forecast Method, device, computer equipment and storage medium
CN113744089B (en) Transformer area household variable relation identification method and device
CN116401564A (en) PCA-based redundant variable screening improvement method and device
CN116959725A (en) Disease risk prediction method based on multi-mode data fusion
CN111326236A (en) Medical image automatic processing system
CN108510180A (en) The computational methods of performance interval residing for a kind of production equipment
Tiruneh et al. Feature selection for construction organizational competencies impacting performance
CN113470799A (en) Intelligent editor of hospital comprehensive quality supervision platform
Liu et al. Personalized recommendation algorithm for interactive medical image using deep learning
CN110033862B (en) Traditional Chinese medicine quantitative diagnosis system based on weighted directed graph and storage medium
Li et al. Efficient redundancy reduced subgroup discovery via quadratic programming
CN111127184B (en) Distributed combined credit evaluation method
CN113010783A (en) Medical recommendation method, system and medium based on multi-modal cardiovascular disease information
CN112735596A (en) Similar patient determination method and device, electronic equipment and storage medium
Zhu et al. Leveraging Prototype Patient Representations with Feature-Missing-Aware Calibration to Mitigate EHR Data Sparsity
Li et al. TCM Constitution Analysis Method Based on Parallel FP‐Growth Algorithm in Hadoop Framework
Li et al. [Retracted] Application of Deep Learning Technology in Predicting the Risk of Inpatient Death in Intensive Care Unit
Miao et al. Missing Data Interpolation of Alzheimer’s Disease Based on Column‐by‐Column Mixed Mode
Li Study of ICU Mortality Prediction and Analysis based on Random Forest

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination