CN113761026A - Feature selection method, device, equipment and storage medium based on conditional mutual information - Google Patents

Feature selection method, device, equipment and storage medium based on conditional mutual information Download PDF

Info

Publication number
CN113761026A
CN113761026A CN202111021982.9A CN202111021982A CN113761026A CN 113761026 A CN113761026 A CN 113761026A CN 202111021982 A CN202111021982 A CN 202111021982A CN 113761026 A CN113761026 A CN 113761026A
Authority
CN
China
Prior art keywords
feature
feature set
mutual information
features
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111021982.9A
Other languages
Chinese (zh)
Inventor
马晓忱
孙博
吕闫
李理
石上丘
罗雅迪
程文帅
郑乐
冷喜武
常乃超
吴迪
章昊
王吉文
李端超
叶海峰
刘辉
马金辉
胡海琴
陈伟
李智
李顺
朱刚刚
王维坤
樊锐轶
高志
张秀丽
刘志良
刘国瑞
杨旋
余志国
李英
孙珂
周明
李杨月
汪春燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
State Grid Shanxi Electric Power Co Ltd
State Grid Hebei Electric Power Co Ltd
State Grid Anhui Electric Power Co Ltd
State Grid Hubei Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
State Grid Shanxi Electric Power Co Ltd
State Grid Hebei Electric Power Co Ltd
State Grid Anhui Electric Power Co Ltd
State Grid Hubei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, China Electric Power Research Institute Co Ltd CEPRI, State Grid Shanxi Electric Power Co Ltd, State Grid Hebei Electric Power Co Ltd, State Grid Anhui Electric Power Co Ltd, State Grid Hubei Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202111021982.9A priority Critical patent/CN113761026A/en
Publication of CN113761026A publication Critical patent/CN113761026A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of data mining, and discloses a feature selection method, a device, equipment and a storage medium based on conditional mutual information, wherein the method comprises the following steps: acquiring a data set to form a candidate feature set F; calculating mutual information of each candidate feature and the category attribute C in the candidate feature set F, and putting the selected features into a feature set S; setting a threshold value, and entering a cycle until the threshold value is met; training a model of the selected feature set S through a classifier, predicting the category by using the trained model, and calculating the prediction accuracy; and changing the weight coefficient, repeatedly screening the feature set S, calculating the prediction accuracy, and selecting the feature set S with the highest accuracy as a final output feature set. The invention can more efficiently and quickly select the characteristics and improve the precision and efficiency of data mining.

Description

Feature selection method, device, equipment and storage medium based on conditional mutual information
Technical Field
The invention belongs to the technical field of feature selection (feature selection) in data mining, and particularly relates to a feature selection method, a feature selection device, feature selection equipment and a storage medium based on conditional mutual information.
Background
Data mining refers to the process of algorithmically searching a large amount of data for information hidden therein. There is a large amount of data in production life that can be widely used, and there is an urgent need to convert such data into useful information and knowledge. Data mining is to analyze a large amount of data to find out its rules, thereby obtaining required information and knowledge. The method has many applications in the power system, such as transient stability assessment, fault diagnosis, load prediction and the like. The data mining process mainly comprises three stages of data preparation and preprocessing, data mining and result expression and interpretation. The feature selection is a data preprocessing stage which is very important for data mining, a small number of main attributes are selected from a large number of features and used as input attributes in the data mining of the next stage, and the precision and the efficiency of the data mining can be effectively improved.
Principle of feature extraction: given n samples, each sample has a features, and each sample has a corresponding label or a category (category attribute C), the feature selection is to select a features that help to determine the category of the sample from the a features. The quantity a of the selected features and the composition mode of the a features have direct influence on the result of data mining. In terms of the number of features, if the number of selected features is too small, sufficient useful information cannot be contained, and sufficiently high accuracy cannot be achieved. If too many selected features influence the calculation speed, weak correlation or even irrelevant features can be introduced, too much noise is introduced in the data mining process, and the generalization capability of the obtained rule is reduced. In the case where the number a is determined, the selected feature set is optimal only if the feature set can provide the category information to which the sample belongs to the greatest extent.
How to select the optimal a features is a core problem of feature selection. There are many methods for feature selection, and one of them is a feature selection method based on information theory. In the feature selection method based on the information theory, each feature is generally taken as a random variable, and then the feature is selected through an algorithm.
"entropy" in thermodynamics is used to describe the degree of disorder of a molecular state, with the higher the degree of disorder, the higher the entropy value; the lower the degree of misordering, the lower the entropy value. While "Entropy of information" (Entropy) is used to characterize the uncertainty of a random variable, the greater the uncertainty of the random variable, the more information is needed to determine it.
The existing feature selection method based on mutual information still has certain limitations.
In order to adapt to the characteristics of data diversity and high dimension under a big data environment and improve the quality of the integrated data, the method becomes a problem needing to be researched in the field of feature selection at present.
Disclosure of Invention
The invention aims to provide a feature selection method, a device, equipment and a storage medium based on conditional mutual information, which aim to solve the technical problems that how to reduce the calculation complexity under the condition of not reducing the feature selection precision so as to adapt to the characteristics of data diversity and high dimension under a big data environment and improve the quality of the integrated data; the method considers the importance degree between the selected feature set and the candidate feature set, and efficiently removes the data redundancy. The invention can more efficiently and quickly select the characteristics and improve the precision and efficiency of data mining.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a feature selection method based on conditional mutual information, including the following steps:
acquiring an input data set to form a candidate feature set F;
putting the initially selected features into a feature set S, and deleting the selected features from a candidate feature set F;
calculating condition mutual information, and calculating an evaluation standard J based on the condition mutual information and the weight coefficient alpha; screening and selecting features based on the evaluation criteria J, putting the selected features into a feature set S, and deleting the selected features from the candidate feature set F; repeating the screening of the selected features until a threshold is met;
training a model of the selected feature set S through a classifier, predicting the category by using the trained model, and calculating the prediction accuracy;
and changing the value of the weight coefficient alpha, repeatedly screening the feature set S and calculating the prediction accuracy, and selecting the feature set S with the highest accuracy as a final output feature set.
The invention further improves the following steps: candidate features with zero variance have been deleted from the candidate feature set F.
The core of the invention is that: the specific formula of the calculation evaluation criterion J is as follows:
Figure BDA0003241833260000031
in the formula: alpha represents CMI (C; f)k|fset) The weight of (c); (1-. alpha.) denotes CMI (C; f)set|fk) The weight of (c); MI (C; f)k) Represents mutual information of the candidate features and the category attribute C, CMI (C; f. ofk|fset) And condition mutual information representing candidate characteristics and category attributes.
The invention further improves the following steps: the step of putting the initially selected features into the feature set S specifically includes: and calculating mutual information of each candidate feature and the category attribute C in the candidate feature set F, and putting the feature with the maximum mutual information of the candidate feature and the category attribute C into a feature set S as an initially selected feature.
The invention further improves the following steps: the step of putting the initially selected features into the feature set S specifically includes: and putting the preset key power transmission section characteristics in the power system into the characteristic set S as the characteristics of initial selection.
The invention further improves the following steps: before the step of putting the initially selected features into the feature set S, the method further includes: initializing the selected feature set S as an empty set, and setting an initial weight coefficient alpha; the value range of the weight coefficient alpha is as follows: alpha is more than or equal to 0 and less than or equal to 1.
The invention further improves the following steps: the threshold is the number of selected features screened; the threshold is less than the number of candidate features in the candidate feature set F.
The invention further improves the following steps: the classifier is a decision tree or SVM.
In a second aspect, the present invention provides a feature selection apparatus based on conditional mutual information, including:
the acquisition module is used for acquiring a data set to form a candidate feature set F;
the initial selection module is used for putting the initially selected features into a feature set S; deleting the selected features from the candidate feature set F;
the cyclic selection module is used for calculating condition mutual information and calculating an evaluation standard J based on the condition mutual information and the weight coefficient; screening and selecting features based on the evaluation criteria J, putting the selected features into a feature set S, and deleting the selected features from the candidate feature set F; repeating the screening and selecting of the features until a preset threshold is met;
the training module is used for training the model of the selected feature set S through the classifier, predicting the category by using the trained model and calculating the prediction accuracy;
and the calculation output module is used for changing the value of the weight coefficient alpha, repeatedly screening the feature set S and calculating the prediction accuracy, and selecting the feature set S with the highest accuracy as a final output feature set.
The invention further improves the following steps: the specific formula of the loop selection module for calculating the evaluation standard J is as follows:
Figure BDA0003241833260000041
in the formula: alpha represents CMI (C; f)k|fset) The weight of (c); (1-. alpha.) denotes CMI (C; f)set|fk) The weight of (c); MI (C; f)k) Represents mutual information of the candidate features and the category attribute C, CMI (C; f. ofk|fset) And condition mutual information representing candidate characteristics and category attributes.
The invention further improves the following steps: the acquisition module is further used for initializing the selected feature set S as an empty set and setting an initial weight coefficient alpha; the value range of the weight coefficient alpha is as follows: alpha is more than or equal to 0 and less than or equal to 1.
The invention further improves the following steps: putting the initially selected features into a feature set S, specifically including: and putting the candidate feature and the feature with the maximum mutual information of the category attribute C into a feature set S as the initially selected feature.
The invention further improves the following steps: putting the initially selected features into a feature set S, specifically including: and putting the preset key power transmission section characteristics in the power system into the characteristic set S as the characteristics of initial selection.
The invention further improves the following steps: the threshold is the number of selected features screened; the threshold is less than the number of candidate features in the candidate feature set F.
In a third aspect, the present invention provides an electronic device, which includes a processor and a memory, wherein the processor is configured to execute a computer program stored in the memory to implement the feature selection method based on conditional mutual information.
In a fourth aspect, the present invention provides a computer-readable storage medium storing at least one instruction, which when executed by a processor, implements the method for feature selection based on conditional mutual information.
Compared with the prior art, the invention has the following beneficial effects:
the invention relates to a characteristic selection method, a device, equipment and a storage medium based on conditional mutual information, which are an improvement on the existing mutual information method; the invention has the innovation points that a new evaluation index based on dynamic condition mutual information is provided, and the WCMI can balance the condition mutual information of candidate characteristics and the condition mutual information of selected characteristics, select the characteristics most related to the category attributes and remove redundancy; by adopting the method, a small number of features can be selected from the input data to be used as the input features of data mining, and the accuracy and efficiency of data mining can be improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart illustrating the steps of a feature selection method based on conditional mutual information according to the present invention;
FIG. 2 is a graph comparing the accuracy of the present method with other methods;
FIG. 3 is a block diagram of a feature selection apparatus based on conditional mutual information according to the present invention;
fig. 4 is a block diagram of an electronic device according to the present invention.
Detailed Description
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The following detailed description is exemplary in nature and is intended to provide further details of the invention. Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.
Referring to fig. 1, the present invention provides a feature selection method based on conditional mutual information, including the following steps:
s1, reading sample data, and forming a set of candidate features as a candidate feature set F; each sample has corresponding classification, and the class characteristics in the samples are taken out separately to form a class attribute C; initializing a feature set S as an empty set; and calculating the variance of each feature in the candidate feature set F, removing the feature with the zero variance in the candidate feature set F, and updating the candidate feature set F. The data generated by sample data through an arithmetic example is assumed to have n samples, each sample has a physical quantity, n samples of a certain physical quantity form a matrix of n rows and 1 columns, the matrix is a characteristic vector corresponding to the physical quantity, and a large matrix of n rows and a columns formed by a physical quantity forming a characteristic vector is a characteristic set. Data aggregation is simply the process of transforming the produced data into a matrix.
S2, calculating mutual information of each candidate feature and the category attribute C in the candidate feature set F, putting the candidate feature with the maximum mutual information or the manually selected key feature into the feature set S as an initial selection feature, deleting the selected candidate feature from the candidate feature set F, and setting an initial weight coefficient alpha, wherein the value range of the parameter alpha is as follows: alpha is more than or equal to 0 and less than or equal to 1. The optimal alpha can be determined through algorithms such as artificial intelligence and the like, the method comprises the steps of substituting an initial parameter alpha into an evaluation standard to select a feature set S, dividing all samples into a training set and a test set, training a classifier model through the training set, and verifying the prediction accuracy through the test set. After the accuracy calculation is completed, the value of the parameter alpha is changed through the super-parameter adjustment, and the next calculation is carried out. And the steps are circulated until the parameter alpha and the feature set S which can enable the prediction accuracy to be the highest are selected. Each sample has a corresponding feature quantity and class attribute. For example, when the power system produces simulation data, each sample has a physical quantity value and whether the sample is stable, and whether the sample is stable is a value of a category attribute, generally, the stability is 1, and the instability is 0; thus, whether the sample is stable or not is changed into a numerical value, and when the feature set is generated, the classes of the samples are separately formed into a matrix C with n rows and one column.
And S3, setting a threshold K, wherein K is the finally selected feature quantity. And entering a loop, and repeating the steps a and b until the selected characteristic quantity reaches a threshold value.
a. And calculating condition mutual information, substituting the condition mutual information into a calculation evaluation standard J, and selecting the feature with the maximum value in the J as the next selected feature.
b. And (4) putting the selected features into a feature set S, deleting the selected features from the candidate feature set F, and performing next calculation of the evaluation criterion J by using the updated set.
And S4, calculating the accuracy of the selected feature set S through an SVM classifier.
S5, changing the value of alpha, repeating the steps S3 and S4, and selecting the feature set S with the highest accuracy.
The evaluation criterion in step S3 is a new evaluation index proposed by the present invention, which changes the conditional mutual information CMI (C; f) of the selected feature set by dynamically adjusting the parametersk|fset) And conditional mutual information CMI (C; f. ofset|fk) The weight of (c). The specific formula is as follows:
Figure BDA0003241833260000071
in the formula: alpha represents CMI (C; f)k|fset) The weight of (1) is a parameter to be adjustedCounting; (1-. alpha.) denotes CMI (C; f)set|fk) The weight of (c). MI (C; f)k) Represents the degree of correlation of the candidate feature with the category attribute C, CMI (C; f. ofk|fset) Representing known selected features fsetIn the case of (2), candidate feature fkDegree of correlation with the category attribute C.
The method adopted by the present invention is different from the existing method in step S3. And adopting a new formula as an evaluation index J for measuring the mutual information importance degree of the condition.
The invention takes into account conditional mutual information CMI (C; f) of the selected feature setk|fset) And candidate feature condition mutual information CMI (C; f. ofset|fk) And the computational complexity is reduced. The evaluation index adopted by the invention aims to adapt to the characteristics of high-dimensional data under the background of big data, reduce the operation complexity, improve the quality of the selected feature subset and improve the efficiency and the precision of data mining.
Example 1
Referring to fig. 1, the present invention provides a feature selection method based on conditional mutual information, including the following steps:
s1, reading data, and forming a set of candidate features as a candidate feature set F; initializing a feature set S as an empty set; and calculating the variance of each feature in the candidate feature set F, removing the feature with the zero variance in the candidate feature set F, and updating the candidate feature set F.
S2, calculating mutual information of each candidate feature and the category attribute C in the candidate feature set F updated in the step S1, putting the candidate feature or the experience key feature with the maximum mutual information into the feature set S, taking the candidate feature or the experience key feature as an initial selection feature, and deleting the selected candidate feature from the candidate feature set F; updating the feature set S and the candidate feature set F; setting a weight coefficient alpha, wherein the value range of the weight coefficient alpha is as follows: alpha is more than or equal to 0 and less than or equal to 1.
3. And setting a threshold value K, wherein K is the finally selected characteristic quantity. And entering a loop, and repeating the steps a and b until the selected characteristic reaches the threshold value K.
a. And calculating condition mutual information, substituting the condition mutual information into a calculation evaluation standard J, and selecting the feature with the maximum value in the J as the next selected feature.
b. B, putting the next selected feature determined in the step a into a feature set S, and deleting the next selected feature from a candidate feature set F;
the feature set S and the candidate feature set F are updated until a threshold is met.
4. And calculating the accuracy of the selected feature set S through an SVM classifier.
5. And changing the value of alpha, repeating the steps 3 and 4, and selecting the feature set S with the highest accuracy.
The evaluation standard in step 3 is a new evaluation index provided by the invention, which changes the conditional mutual information CMI (C; f) of the selected feature set by dynamically adjusting parametersk|fset) And conditional mutual information CMI (C; f. ofset|fk) The weight of (c). The specific formula is as follows:
Figure BDA0003241833260000081
in the formula: alpha represents CMI (C; f)k|fset) The weight of (2) is a parameter to be adjusted; (1-. alpha.) denotes CMI (C; f)set|fk) The weight of (c). MI (C; f)k) Represents the degree of correlation of the candidate feature with the category attribute C, CMI (C; f. ofk|fset) Representing known selected features fsetIn the case of (2), candidate feature fkDegree of correlation with the category attribute C.
The method adopted by the invention is different from the existing method in step 3. And adopting a new formula as an evaluation index J for measuring the mutual information importance degree of the condition. The invention takes into account conditional mutual information CMI (C; f) of the selected feature setk|fset) And candidate feature condition mutual information CMI (C; f. ofset|fk) And the computational complexity is reduced. The evaluation index adopted by the invention aims to adapt to the characteristics of high-dimensional data under the background of big data, reduce the operation complexity, improve the quality of the selected feature subset, and improve the efficiency and the precision of data mining。
Example 2
The invention takes a test system standardized by IEEE10 machine 39 nodes in a certain area as an example to carry out verification experiment; simulation data of a 10-machine 39-node power system was taken as an input sample. The active power P, the reactive power Q, the node voltage amplitude V and the node voltage phase angle theta of the line are 170 candidate characteristics. The category attribute C contains two categories, stable and unstable. There were 3000 samples, 1500 stable samples and 1500 unstable samples. During model training and prediction, samples are randomly divided into 2000 training samples and 1000 prediction samples. In the experiment, a Relief algorithm and a mutual information selection algorithm are set as a comparison, and 30 features are selected for the three algorithms to predict whether the stability is stable or not and compare the accuracy. To reduce errors, each set of experiments was run through 10 training sets and random partitions of the test set, and the accuracy was calculated and averaged as the final result.
The specific process is shown in fig. 1, and comprises the following steps:
1. setting the input data set as a candidate feature set F, calculating the variance of each feature, and removing the feature with the variance being zero. And setting the selected feature set S as an empty set.
The standardized test system of the IEEE10 machine 39 node in a certain area is simulated, and 3000 sample data are generated, wherein the sample data comprise 1500 stable samples and 1500 unstable samples. Using the generated data as input, 170 features of all P, Q, V, θ are put into the feature set F. The class corresponding to each sample, i.e. whether this sample is stable (stable to 1 and unstable to 0) is put into the class attribute C. And (3) performing variance calculation on each feature in the feature set F, excluding the feature with the variance of zero, wherein 11 features are found in the calculation, the variance of zero, and deleting the 11 features from the set F, wherein 159 features exist in the set F. And setting the to-be-selected set S as an empty set.
2. And calculating mutual information of each candidate feature and the category attribute C, and putting the feature with the maximum mutual information or the manually selected key feature into the set S to be used as the initial selection feature. The selected feature is deleted from F, and an initial value of the weight coefficient α is set.
For the previous stepThe feature set F generated in the step calculates mutual information MI (C; F) of 159 features and the category attribute C in the set F respectivelyk) And manually setting five characteristics with larger mutual information as key sections, putting the key sections into the set S as initial characteristics, and deleting the five characteristics from the set F. At this point, there are 5 selected features in the set S and 154 candidate features in the candidate feature set F. The initial value of the weight coefficient α is set to 0.5, and the loop calculation is ready to be performed.
3. And setting the threshold K to be 30 and entering a loop. Steps a, b are repeated until the selected feature reaches the threshold 30.
a. And calculating condition mutual information, substituting the condition mutual information into a calculation evaluation standard J, and selecting the feature with the maximum value in the J as the next selected feature.
b. This feature is put into the set S and deleted from F, and the updated set is subjected to the next calculation of the evaluation criterion J.
And (3) calculating an evaluation criterion J according to a formula (10) for the candidate feature set F and the selected feature set S updated in the step 2. For 154 feature calculation in F, values of 154 evaluation criteria J need to be calculated, a feature corresponding to the maximum value in the evaluation criteria J is selected as the optimal feature selected in this cycle, and the feature is put into the set S and deleted from F. At this time, the candidate set F is updated to the remaining 153 features, and the selected feature set is updated to 6 selected features, and then the next cycle is entered. And (c) through circulation, continuously repeating the steps a and b, and finally when the selected feature set S has 30 features, reaching a threshold value, terminating the circulation and outputting the S set.
4. And training the selected set S through an SVM classifier, predicting the category by using the trained model, and calculating the prediction accuracy.
And taking the set S obtained in the last step as input, and training by adopting an SVM classifier. 3000 sample sets of the set S are randomly divided into 2000 training sets X _ train and 1000 test sets X _ test. And the corresponding class attributes are divided into 2000 training sample classes Y _ train and 1000 testing sample classes Y _ test.
And calling an SVM model, substituting the training sample X _ train and the test sample Y _ train into the model for fitting training, and calling a function to adjust the hyper-parameters to obtain the model with the best effect. And substituting the test set X _ test into the model, performing class prediction on the test samples, and predicting whether the samples are stable, wherein the stable value is 1 and the unstable value is 0.
And comparing the predicted result with the category attribute Y _ test of the prediction set, calculating the accuracy and storing. In order to reduce calculation errors, the process of dividing the training set, training the model, predicting and calculating the accuracy is repeated for 10 times, and the average value of the accuracy of 10 times is taken as a final result.
5. And changing the value of alpha, repeating the steps 3 and 4, and selecting the set with the highest accuracy.
And (3) repeating the steps 3 and 4 through super-parameter adjustment, and selecting the set S with the highest accuracy as a final output feature set. And after the feature selection is finished, outputting a feature set S as an input feature of data mining.
In the experiment, 30 optimal features are also taken out from the Relief algorithm and the mutual information algorithm as an S set, and the final accuracy result is shown in FIG. 2. The accuracy of class prediction of the method is obviously higher than that of other two methods, and the feature selection method based on the dynamic condition mutual information can effectively improve the precision and speed of data mining.
Example 3
Referring to fig. 3, the present embodiment provides a feature selection apparatus based on conditional mutual information, including:
the acquisition module is used for acquiring an input data set to form a candidate feature set F; calculating the variance of each candidate feature in the candidate feature set F, removing the candidate features with the variance of zero, and updating the candidate feature set F; initializing the selected feature set S as an empty set; setting an initial weight coefficient alpha;
the initial selection module is used for calculating mutual information between each candidate feature and the category attribute C in the updated candidate feature set F, and putting the selected features into the feature set S to serve as initial selection features; deleting the selected features from the candidate feature set F, and updating the selected feature set S and the candidate feature set F;
and the cycle selection module is used for setting a threshold value and entering a cycle: calculating condition mutual information, substituting into a calculation evaluation standard J, and selecting the characteristic with the maximum value in J as the next selection characteristic; putting the selected features into a feature set S, deleting the selected features from the candidate feature set F, and calculating the next evaluation standard J by the updated feature set S; repeating the loop until the threshold is met;
the training module is used for training the model of the selected feature set S through the classifier, predicting the category by using the trained model and calculating the prediction accuracy;
and the calculation output module is used for changing the value of the weight coefficient alpha, repeatedly screening the feature set S, and selecting the feature set S with the highest accuracy as a final output feature set.
Example 4
Referring to fig. 4, the present invention further provides an electronic device 100 based on a feature selection method of conditional mutual information; the electronic device 100 comprises a memory 101, at least one processor 102, a computer program 103 stored in the memory 101 and executable on the at least one processor 102, and at least one communication bus 104.
The memory 101 may be configured to store the computer program 103, and the processor 102 implements the method steps of the feature selection method based on conditional mutual information according to any one of embodiments 1 to 4 by running or executing the computer program stored in the memory 101 and calling the data stored in the memory 101. The memory 101 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the electronic apparatus 100, and the like. In addition, the memory 101 may include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other non-volatile solid state storage device.
The at least one Processor 102 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The processor 102 may be a microprocessor or the processor 102 may be any conventional processor or the like, and the processor 102 is a control center of the electronic device 100 and connects various parts of the whole electronic device 100 by various interfaces and lines.
The memory 101 in the electronic device 100 stores a plurality of instructions to implement a feature selection method based on conditional mutual information, and the processor 102 can execute the plurality of instructions to implement:
s1, acquiring an input data set to form a candidate feature set F; calculating the variance of each candidate feature in the candidate feature set F, removing the candidate features with the variance of zero, and updating the candidate feature set F; initializing the selected feature set S as an empty set; setting an initial weight coefficient alpha;
s2, calculating mutual information between each candidate feature and the category attribute C in the candidate feature set F updated in the step S1, and putting the selected features into the feature set S to serve as initial selected features; deleting the selected features from the candidate feature set F, and updating the selected feature set S and the candidate feature set F;
s3, setting a threshold value, and entering a loop: calculating condition mutual information, substituting into a calculation evaluation standard J, and selecting the characteristic with the maximum value in J as the next selection characteristic; putting the selected features into a feature set S, deleting the selected features from the candidate feature set F, and calculating the next evaluation standard J by the updated feature set S; repeating step S3 until the threshold is met;
s4, training a model of the selected feature set S through a classifier, predicting the category by using the trained model, and calculating the prediction accuracy;
and changing the value of the weight coefficient alpha, repeating the steps S3 and S4, and selecting the feature set S with the highest accuracy as the final output feature set.
Example 5
The modules/units integrated by the electronic device 100 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, and Read-Only Memory (ROM).
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (10)

1. The feature selection method based on the conditional mutual information is characterized by comprising the following steps of:
acquiring a data set to form a candidate feature set F;
putting the initially selected features into a feature set S, and deleting the selected features from a candidate feature set F;
calculating condition mutual information, and calculating an evaluation standard J based on the condition mutual information and the weight coefficient alpha; screening and selecting features based on the evaluation criteria J, putting the selected features into a feature set S, and deleting the selected features from the candidate feature set F; repeating the screening and selecting of the features until a preset threshold is met;
training a model of the selected feature set S through a classifier, predicting the category by using the trained model, and calculating the prediction accuracy;
and changing the value of the weight coefficient alpha, repeatedly screening the feature set S and calculating the prediction accuracy, and selecting the feature set S with the highest accuracy as a final output feature set.
2. The method for selecting features based on conditional mutual information as claimed in claim 1, wherein the specific formula of the evaluation criterion J is:
Figure FDA0003241833250000011
in the formula: alpha represents CMI (C; f)k|fset) The weight of (c); (1-. alpha.) denotes CMI (C; f)set|fk) The weight of (c); MI (C; f)k) Represents mutual information of the candidate features and the category attribute C, CMI (C; f. ofk|fset) And condition mutual information representing candidate characteristics and category attributes.
3. The method for selecting features based on conditional mutual information according to claim 1, wherein the step of putting the initially selected features into a feature set S specifically comprises:
and calculating mutual information of each candidate feature and the category attribute C in the candidate feature set F, and putting the feature with the maximum mutual information of the candidate feature and the category attribute C into the feature set S as the initially selected feature, or putting the preset key power transmission section feature in the power system into the feature set S as the initially selected feature.
4. The method for selecting features based on conditional mutual information according to claim 1, wherein the step of placing the initially selected features into the feature set S is preceded by the steps of: initializing the selected feature set S as an empty set, and setting an initial weight coefficient alpha; the value range of the alpha is as follows: alpha is more than or equal to 0 and less than or equal to 1.
5. The method of claim 1, wherein the threshold is the number of selected features to be filtered.
6. A feature selection apparatus based on conditional mutual information, comprising:
the acquisition module is used for acquiring a data set to form a candidate feature set F;
the initial selection module is used for putting the initially selected features into the feature set S and deleting the selected features from the candidate feature set F;
the cyclic selection module is used for calculating condition mutual information and calculating an evaluation standard J based on the condition mutual information and the weight coefficient alpha; screening and selecting features based on the evaluation criteria J, putting the selected features into a feature set S, and deleting the selected features from the candidate feature set F; repeating the screening and selecting of the features until a preset threshold is met;
the training module is used for training the model of the selected feature set S through the classifier, predicting the category by using the trained model and calculating the prediction accuracy;
and the calculation output module is used for changing the value of the weight coefficient alpha, repeatedly screening the feature set S and calculating the prediction accuracy, and selecting the feature set S with the highest accuracy as a final output feature set.
7. The device for selecting features based on conditional mutual information as claimed in claim 6, wherein the loop selection module calculates the evaluation criterion J by the following formula:
Figure FDA0003241833250000021
in the formula: alpha represents CMI (C; f)k|fset) The weight of (c); (1-. alpha.) denotes CMI (C; f)set|fk) The weight of (c); MI (C; f)k) Represents mutual information of the candidate features and the category attribute C, CMI (C; f. ofk|fset) And condition mutual information representing candidate characteristics and category attributes.
8. The device for selecting features based on conditional mutual information according to claim 6, wherein the obtaining module is further configured to initialize the selected feature set S as an empty set, and set an initial weight coefficient α; the value range of the weight coefficient alpha is as follows: alpha is more than or equal to 0 and less than or equal to 1.
9. An electronic device, characterized in that the electronic device comprises a processor and a memory, the processor is used for executing a computer program stored in the memory to realize the feature selection method based on conditional mutual information according to any one of claims 1 to 5.
10. A computer-readable storage medium storing at least one instruction which, when executed by a processor, implements the method for feature selection based on conditional mutual information according to any one of claims 1 to 5.
CN202111021982.9A 2021-09-01 2021-09-01 Feature selection method, device, equipment and storage medium based on conditional mutual information Pending CN113761026A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111021982.9A CN113761026A (en) 2021-09-01 2021-09-01 Feature selection method, device, equipment and storage medium based on conditional mutual information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111021982.9A CN113761026A (en) 2021-09-01 2021-09-01 Feature selection method, device, equipment and storage medium based on conditional mutual information

Publications (1)

Publication Number Publication Date
CN113761026A true CN113761026A (en) 2021-12-07

Family

ID=78792490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111021982.9A Pending CN113761026A (en) 2021-09-01 2021-09-01 Feature selection method, device, equipment and storage medium based on conditional mutual information

Country Status (1)

Country Link
CN (1) CN113761026A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115239485A (en) * 2022-08-16 2022-10-25 苏州大学 Credit evaluation method and system based on forward iteration constraint scoring feature selection
CN115840885A (en) * 2023-02-23 2023-03-24 青岛创新奇智科技集团股份有限公司 Feature selection method and device for deep synthesis features

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115239485A (en) * 2022-08-16 2022-10-25 苏州大学 Credit evaluation method and system based on forward iteration constraint scoring feature selection
CN115840885A (en) * 2023-02-23 2023-03-24 青岛创新奇智科技集团股份有限公司 Feature selection method and device for deep synthesis features
CN115840885B (en) * 2023-02-23 2023-05-09 青岛创新奇智科技集团股份有限公司 Feature selection method and device for depth synthesis features

Similar Documents

Publication Publication Date Title
KR20210032140A (en) Method and apparatus for performing pruning of neural network
CN109325516B (en) Image classification-oriented ensemble learning method and device
CN111882040A (en) Convolutional neural network compression method based on channel number search
CN108897829A (en) Modification method, device and the storage medium of data label
JP2023523029A (en) Image recognition model generation method, apparatus, computer equipment and storage medium
CN113761026A (en) Feature selection method, device, equipment and storage medium based on conditional mutual information
US20200090076A1 (en) Non-transitory computer-readable recording medium, prediction method, and learning device
CN107783998A (en) The method and device of a kind of data processing
TWI710970B (en) Unsupervised model evaluation method, device, server and readable storage medium
CN109948680A (en) The classification method and system of medical record data
CN113011529A (en) Training method, device and equipment of text classification model and readable storage medium
CN116150125A (en) Training method, training device, training equipment and training storage medium for structured data generation model
CN111831805A (en) Model creation method and device, electronic equipment and readable storage device
CN113706285A (en) Credit card fraud detection method
CN116628600A (en) Unbalanced data sampling method and device based on random forest
CN116756662A (en) Yield prediction method and system for optimizing random forest based on Harris eagle algorithm
CN111079843A (en) Training method based on RBF neural network
CN106776600A (en) The method and device of text cluster
CN106445960A (en) Data clustering method and device
CN112308122B (en) High-dimensional vector space sample rapid searching method and device based on double trees
KR20160001375A (en) Apparatus and method for learning and classification of decision tree
CN111026661B (en) Comprehensive testing method and system for software usability
CN113827980A (en) Loss user prediction method and device and computer readable storage medium
CN114254764B (en) Feedback-based machine learning model searching method, system, equipment and medium
CN118378074B (en) Method and system for scheduling sorting algorithm in sparse matrix solving process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination