CN105260371B

CN105260371B - A kind of feature selection approach and device

Info

Publication number: CN105260371B
Application number: CN201410342523.4A
Authority: CN
Inventors: 张世明; 袁明轩; 曾嘉
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2014-07-17
Filing date: 2014-07-17
Publication date: 2018-12-07
Anticipated expiration: 2034-07-17
Also published as: CN105260371A

Abstract

The embodiment of the present invention provides a kind of feature selection approach and device, is related to data mining technology field, determines optimal feature subset based on the correlation between characteristic variable, improves the validity and operation efficiency of high dimensional data feature selecting；Feature selection approach provided in an embodiment of the present invention include: initial data is calculated to concentrate between each characteristic variable, and, the initial data concentrates each characteristic variable and predicts the correlation between target signature variable；It is concentrated between each characteristic variable according to the initial data, and, the initial data concentrates each characteristic variable and predicts the correlation between target signature variable, obtains strong correlation character subset and weak correlated characteristic subset；All characteristic variables that will include in the strong correlation character subset, and, in the weak correlated characteristic subset, the set of the characteristic variable directly related with the characteristic variable in the strong correlation character subset is determined as the optimal feature subset of the prediction target signature variable.

Description

A kind of feature selection approach and device

Technical field

The present invention relates to data mining technology field more particularly to a kind of feature selection approach and device.

Background technique

High dimensional data, such as space remote sensing data, biological data, network data and financial market transactions data, data Quantity and dimension exponentially quantity expansion, this can not only bring " dimension Gospel ", i.e., in high dimensional data Can produce in the abundant information contained solve the problems, such as it is new a possibility that；Moreover, can also bring " dimension disaster (curse Of dimensionality) ", i.e., higher dimensional space midpoint is almost the same with the Euclidean distance between point, so that mode in high dimensional data Identification and rule discovery bring extreme difficulties；Therefore, it in order to avoid " dimension disaster ", needs to carry out feature selecting to high dimensional data (Feature Selection)。

Fig. 1 is the basic procedure schematic diagram of feature selecting in the prior art, as shown in Figure 1, comprising the following steps: S101. One group of character subset is randomly generated from initial data concentration；S102. Utilization assessment function evaluates the character subset； S103. evaluation result is compared with stopping criterion, judges whether evaluation result is better than stopping criterion, if so, executing step Rapid S104；If it is not, then repeating step S101-S103；S104. if so, verifying the validity of the character subset, described in determination Character subset is optimal feature subset；Since in the basic process of feature selecting, the initial characteristics generated from data set are sub The quality of collection directly affects the number of iterations of whole process, and the character subset being especially randomly generated can make interative computation restrain It is slow；Meanwhile as a result character subset evaluation criterion is difficult to determine, be easy to causeing not accurate enough evaluation of result is optimal spy Levy subset；Therefore, the general feature selecting operation efficiency of the prior art is low and the optimal feature subset selected is not accurate enough.

Summary of the invention

The embodiment of the present invention provides a kind of feature selection approach and device, solves and how to select from primitive character subset More accurately the problem of optimal feature subset, the validity and operation efficiency of high dimensional data feature selecting are improved.

In order to achieve the above objectives, the technical solution adopted by the present invention is that,

In a first aspect, the embodiment of the present invention provides a kind of feature selection approach, comprising:

It calculates initial data and concentrates the correlation between each characteristic variable, and, the initial data concentrates each feature to become Measure the correlation between prediction target signature variable；Wherein, the raw data set includes N-dimensional characteristic variable, and the N-dimensional is special Levying variable includes that N-1 ties up the characteristic variable and the prediction target signature variable, and the N is positive integer；

The correlation between each characteristic variable is concentrated according to the initial data, and, the initial data concentrates each feature Correlation between variable and prediction target signature variable, obtains strong correlation character subset and weak correlated characteristic subset；Wherein, institute Stating the characteristic variable for including in strong correlation subset is initial data concentration, directly related with the prediction target signature variable Characteristic variable；The characteristic variable for including in the weak associated subset is initial data concentration, special with the prediction target Levy the characteristic variable of variable indirect correlation；

All characteristic variables that will include in the strong correlation character subset, and, in the weak correlated characteristic subset, with The set of the directly related characteristic variable of characteristic variable in the strong correlation character subset is determined as the prediction target signature The optimal feature subset of variable.

In the first possible implementation of the first aspect, with reference to first aspect, the raw data set also includes M Group data, the M group data include training dataset, wherein the N-dimensional that synchronization acquisition is included in every group of data is special The corresponding data of variable are levied, the M is positive integer；

Correspondingly, the calculating initial data concentrates the correlation between each characteristic variable, and, the raw data set In each characteristic variable and prediction target signature variable between correlation, comprising:

The initial data, which is calculated, according to the data that the training data is concentrated concentrates correlation between each characteristic variable, And the initial data concentrates each characteristic variable and predicts the correlation between target signature variable.

In the second possible implementation of the first aspect, with reference to first aspect the possible realization side of the first Formula, the M group data also include assessment data set and test data set；

Correspondingly, the correlation concentrated according to the initial data between each characteristic variable, and, the initial data It concentrates each characteristic variable and predicts the correlation between target signature variable, obtain strong correlation character subset and weak correlated characteristic Collection, comprising:

The correlation between each characteristic variable is concentrated according to the initial data, the initial data concentrates each characteristic variable With the correlation between prediction target signature variable, and, the assessment data set and the test data set obtain mould of classifying Type；

The strong correlation character subset and the weak correlated characteristic subset are obtained according to the disaggregated model.

In a third possible implementation of the first aspect, with reference to first aspect the possible realization side of second Formula, it is described to concentrate correlation, the initial data between each characteristic variable to concentrate each characteristic variable according to the initial data Correlation, the assessment data set and the test data set between prediction target signature variable obtain disaggregated model, Include:

The correlation between each characteristic variable is concentrated according to the initial data, and, the initial data concentrates each feature Correlation between variable and prediction target signature variable establishes initial Bayesian network model；Wherein, the initial shellfish This network model of leaf includes node and directed edge, and the node indicates characteristic variable, and the directed edge indicates and the directed edge Correlation between two nodes of connection；

Using Bayesian network model initial described in the assessment data set repetitive exercise, stable Bayesian network is obtained Network model；Wherein, the stable Bayesian network is the Bayesian network model comprising irreversible directed edge；

The stable Bayesian network model is tested using the test data set, if the stable Bayesian network The topological structure of model remains unchanged, then the stable Bayesian network model is determined as disaggregated model.

In a fourth possible implementation of the first aspect, the third to first aspect may with reference to first aspect Implementation in any implementation, all characteristic variables that will include in the strong correlation character subset, with And in the weak correlated characteristic subset, the characteristic variable directly related with the characteristic variable in the strong correlation character subset Set is determined as the optimal feature subset of the prediction target signature variable, comprising:

In the weak correlated characteristic subset, fisrt feature variable is selected, the fisrt feature variable is added current pre- Model is surveyed, it is described current to judge whether the precision of prediction for the current predictive model being added after the fisrt feature variable is greater than The precision of prediction of prediction model, wherein the fisrt feature variable is in the weak correlated characteristic subset, with the prediction target The characteristic variable of the correlation maximum of characteristic variable, the current predictive model be initial predicted model or it is updated it is described just Beginning prediction model, the initial predicted model are pre- to establish using the characteristic variable in the strong correlation character subset as input terminal Survey model；

If so, update the current predictive model, and by the fisrt feature variable from the weak correlated characteristic subset First set is added in middle deletion；

If it is not, not updating the current predictive model then, and the fisrt feature variable is sub from the weak correlated characteristic It concentrates and deletes；

It repeats the above process, until characteristic variable is not present in the weak correlated characteristic subset；

The set of characteristic variable in characteristic variable and the first set in the strong correlation character subset is determined For the optimal feature subset of the prediction target signature variable.

In the fifth possible implementation of the first aspect, the 4th kind of possible realization side with reference to first aspect Formula, the forecasting type are neural network model；

Correspondingly, the characteristic variable using in the strong correlation character subset establishes prediction model as input terminal, comprising:

Neural network model is constructed with the characteristic variable for including in strong correlation character subset for input member；Wherein, the mind Through network model include input layer, hidden layer, and, output layer；Between the input layer and hidden layer, and, it is described implicit It is connected between layer and the output layer by connection weight function.

Second aspect, the embodiment of the present invention provide a kind of feature selecting device, comprising:

Computing module concentrates correlation between each characteristic variable for calculating initial data, and, the initial data It concentrates each characteristic variable and predicts the correlation between target signature variable；Wherein, the raw data set becomes comprising N-dimensional feature Amount, the N-dimensional characteristic variable include that N-1 ties up the characteristic variable and the prediction target signature variable, and the N is positive integer；

Module is obtained, for concentrating between each characteristic variable according to the calculated initial data of the computing module Correlation, and, the initial data concentrates each characteristic variable and predicts the correlation between target signature variable, obtains strong correlation Character subset and weak correlated characteristic subset；Wherein, the characteristic variable for including in the strong correlation subset is the raw data set In, the characteristic variable directly related with the prediction target signature variable；The characteristic variable for including in the weak associated subset is The initial data is concentrated, the characteristic variable with the prediction target signature variable indirect correlation；

Determining module, for becoming all features for including in the strong correlation character subset of the acquisition module acquisition Amount, and, in the weak correlated characteristic subset, the feature directly related with the characteristic variable in the strong correlation character subset becomes The set of amount is determined as the optimal feature subset of the prediction target signature variable.

In the first possible implementation of the second aspect, in conjunction with second aspect, the raw data set also includes M Group data, the M group data include training dataset, wherein the N-dimensional that synchronization acquisition is included in every group of data is special The corresponding data of variable are levied, the M is positive integer；

Correspondingly, the computing module, is specifically used for:

In a second possible implementation of the second aspect, in conjunction with the first possible realization side of second aspect Formula, the M group data also include assessment data set and test data set；

Correspondingly, the acquisition module, is specifically used for:

Correlation, the initial data between each characteristic variable is concentrated to concentrate each characteristic variable according to the initial data Correlation, the assessment data set and the test data set between prediction target signature variable obtain disaggregated model；

In the third possible implementation of the second aspect, in conjunction with second of possible realization side of second aspect Formula, the acquisition module, is specifically used for:

In the fourth possible implementation of the second aspect, the third in conjunction with second aspect to second aspect may Implementation in any implementation, the determining module is specifically used for:

In a fifth possible implementation of the second aspect, in conjunction with the 4th kind of possible realization side of second aspect Formula, the prediction model are neural network model；

Correspondingly, the determining module, is specifically used for:

The embodiment of the present invention provides a kind of feature selection approach and device, and calculating initial data first concentrates each characteristic variable Between correlation, and, the initial data concentrate each characteristic variable and predict target signature variable between correlation；So The correlation between each characteristic variable is concentrated according to the initial data afterwards, and, the initial data concentrate each characteristic variable with It predicts the correlation between target signature variable, obtains strong correlation character subset and weak correlated characteristic subset；It finally will be described strong All characteristic variables for including in correlated characteristic subset, and, in the weak correlated characteristic subset, with strong correlation feature The set of the directly related characteristic variable of the characteristic variable of concentration is determined as optimal characteristics of the prediction target signature variable Collection.In this way, carrying out feature selecting according to the correlation between characteristic variable, avoids and concentrate random selection in higher-dimension initial data Operation times caused by character subset are more, and operand is big, and the problem of the optimal feature subset inaccuracy determined, improve higher-dimension Data characteristics Selecting operation efficiency and validity.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is the basic procedure schematic diagram of feature selecting in the prior art；

Fig. 2 is a kind of flow chart of feature selection approach provided in an embodiment of the present invention；

Fig. 3 is a kind of structural schematic diagram of Bayesian network model provided in an embodiment of the present invention；

Fig. 4 is a kind of structural schematic diagram of neural network model provided in an embodiment of the present invention；

Fig. 5 is a kind of structure chart of feature selecting device provided in an embodiment of the present invention；

Fig. 6 is a kind of structure chart of feature selecting device provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Embodiment one

Fig. 2 provides a kind of flow chart of feature selection approach for the embodiment of the present invention, as shown in Fig. 2, may include following Step:

201, it calculates initial data and concentrates the correlation between each characteristic variable, and, it is each in the primitive character subset Correlation between characteristic variable and prediction target signature variable.

Wherein, the characteristic variable is the description to a certain feature of the entities such as process, event and state, the prediction target Characteristic variable be it is preset need according to multiple characteristic variables in conjunction with come " a certain phenomenon " that describes, be a specific feature Variable.

The raw data set includes N-dimensional characteristic variable and M group data, and the N and the M are positive integer；The N Dimensional feature variable includes that N-1 ties up the characteristic variable and the prediction target signature variable, is included in every group of data with for the moment Carve the corresponding data of the N-dimensional characteristic variable of acquisition；The prediction target signature variable ties up the characteristic variable knot according to N-1 Description is closed, i.e., in practice, it is thus necessary to determine that when the prediction corresponding specific value of target signature variable, can tie up according to N-1 The corresponding data of the characteristic variable, to determine the specific value of the prediction target signature variable jointly；It needs to illustrate It is that the prediction target signature variable can be any feature variable in N-dimensional characteristic variable.

Become for example, initial data is concentrated comprising features such as weather, temperature, humidity, air pressure, wind-force, rainfall, radiation intensity Amount, and measurement data corresponding with these characteristic variables；If using this characteristic variable of weather as prediction target signature Variable then can describe weather according to the combination of the characteristic variables such as temperature, humidity, air pressure, wind-force, rainfall, radiation intensity Concrete condition, can also using this characteristic variable of temperature as prediction target signature variable, according to weather, humidity, air pressure, wind The combination of the characteristic variables such as power, rainfall, radiation intensity describes the concrete condition of temperature.

Wherein, the specific value embodiment of the present invention of the N and M is without limiting, it is preferred that the initial data Integrate as high dimensional data, can be obtained by the real-time acquisition of a period of time, it can also be from being previously stored with the raw data set Database in read.For example, as shown in table 1, for the raw data set comprising N-dimensional characteristic variable and M group data.

Table 1

Since the initial data concentrates the group number for including relatively more, at this point, if being concentrated all according to the initial data The corresponding data of group calculate the initial data and concentrate correlation between each characteristic variable, and, the initial data is concentrated Each characteristic variable and prediction target signature correlation of variables, then calculation amount is very big, and operation is complicated, it is preferred, therefore, that, M can be taken M in group data₁Group data form training dataset, wherein M₁Less than M；

For example, raw data set has 20 groups of data comprising N-dimensional characteristic variable, then training dataset can be for by 1-10 The set of group data composition.

Preferably, Chi-square Test can be passed through, it is assumed that the statistical algorithms such as inspection, and, expertise calculates described Initial data concentrates the correlation between each characteristic variable, and, the initial data concentrates each characteristic variable and prediction target Correlation between characteristic variable.

202, the correlation between each characteristic variable is concentrated according to the initial data, and, the initial data is concentrated each Correlation between characteristic variable and prediction target signature variable, obtains strong correlation character subset and weak correlated characteristic subset.

Wherein, the characteristic variable for including in the strong correlation character subset is initial data concentration, with the prediction There is the characteristic variable of direct correlation between target signature variable, the characteristic variable for including in the weak correlated characteristic subset is The initial data is concentrated, and has the characteristic variable of indirect correlation between the prediction target signature.It is described with it is described pre- Survey the characteristic variable for having direct correlation between target signature variable are as follows: the correlation between the prediction target signature is big In the characteristic variable of the first preset threshold, i.e., the feature having a significant impact to the characteristic variable of the prediction target signature variable becomes Amount.It is described to have the characteristic variable of indirect correlation between the prediction target signature variable are as follows: with intermediate features variable it Between correlation be greater than the second preset threshold characteristic variable, i.e., the characteristic variable having a significant impact to intermediate characteristic variable, lead to Crossing intermediate features variable has the characteristic variable influenced indirectly to the prediction target signature variable；The intermediate variable is described strong The characteristic variable that the characteristic variable or the weak correlated characteristic subset that correlated characteristic subset includes include.

Wherein, the first preset threshold and the second preset threshold are configured as needed, the embodiment of the present invention to this not into Row limitation.

Due to being to calculate the initial data according to the partial data of raw data set to concentrate each spy in step 201 The correlation between variable is levied, and, the initial data concentrates each characteristic variable and predicts the phase between target signature variable Guan Xing cannot accurately reflect that initial data concentrates the correlation between each characteristic variable very much, and, the raw data set In each characteristic variable and prediction target signature variable between correlation；And the tool of the first preset threshold and the second preset threshold Body value is difficult to determine, at this point, the correlation between each characteristic variable directly calculated according to step 201, and, the original Correlation in beginning data set between each characteristic variable and prediction target signature variable, carries out character subset classification, obtains strong phase It closes character subset and weak correlated characteristic subset may be inaccuracy.It is preferred, therefore, that, in this step, it can take in M group data M₂Group data composition assessment data set, M₃Group data form test data set, include to raw data set by following methods N-dimensional characteristic variable is classified, and strong correlation character subset and weak correlated characteristic subset are obtained:

Wherein, M₂Less than M, M₃Less than M；It preferably, can be mutually independent M by M group data random division₁Group data, M₂Group data, M₃Group data, the M₁, M₂, M₃Can be equal, can also be unequal, for example, raw data set has 20 groups to include N The data of dimensional feature variable, then training dataset may include the data of 1-10 group, and assessment data set may include the 11st group, 14th group, the data of 18-20 group, test data set may include the 12nd group, and the 13rd group, the data of 15-17 group.

Preferably, the disaggregated model can be Bayesian network model.

Illustratively, correlation, the initial data concentrated according to the initial data between each characteristic variable It concentrates each characteristic variable and predicts correlation, the assessment data set and the test data between target signature variable Collection obtains disaggregated model, comprising:

The correlation between each characteristic variable is concentrated according to the initial data, and, the initial data concentrates each feature Correlation between variable and prediction target signature variable establishes initial Bayesian network model；Wherein, the Bayesian network Network model is directed acyclic graph (English: Directed Acyclic Graph, referred to as: DAG), comprising node and oriented Side, the node indicate characteristic variable, and the directed edge indicates the correlation between two nodes connected with the directed edge；

The stable Bayesian network model is tested using the test data set, if the stable Bayesian network The topological structure of model remains unchanged, then the stable Bayesian model is determined as the disaggregated model.

It is described that the strong correlation spy is obtained according to the disaggregated model when the disaggregated model is Bayesian network model Subset and the weak correlated characteristic subset are levied, may include:

Using the set for the corresponding characteristic variable of node for including in the Markov blanket of the Bayesian network model as Strong correlation character subset；

Destination node will be reached by least two directed edges in the Bayesian network model, and is not included in Ma Erke The set of the corresponding characteristic variable of node in husband's blanket is as weak correlated characteristic subset, wherein the destination node be with it is described Predict the corresponding node of target signature variable；

It will cannot be reached corresponding to the node of destination node by any bar directed edge in the Bayesian network model Set is used as uncorrelated features subset.

Wherein, the Markov blanket (Markov blanket) of the Bayesian network model is by the mother of the destination node Node, the child node of the destination node, and, spouse's node composition of the destination node；Female section of the destination node Point is directly on the influential node of the target；The child node of the destination node is the section that the destination node directly affects Point；Spouse's node of the destination node be and at least one common parent node of the destination node or a common son The node of node.

For example, as shown in figure 3, being a kind of structure chart of Bayesian network model provided in an embodiment of the present invention, such as Fig. 3 institute Show, includes 13 nodes, one characteristic variable of each node on behalf, wherein the node that T is indicated in Bayesian network model For destination node, from Fig. 3, it can be observed that the correlation between variable: destination node depends on node 3 and node 4, together When, destination node has an impact to node 9 and node 10, and node 7 is spouse's node of destination node T；Accordingly, it is determined that destination node The node that the Markov blanket of T includes has node 3, node 4, node 9, node 10, node 7；Strong correlation character subset is node 3,4,7,9,10 character pair variable 3, feature 4, feature 7, feature 9, the set that characteristic variable 10 forms；Node 1 can pass through Path 1-4-T reaches destination node T, node 2 can passage path 2-4-T reach destination node T, so, weak correlated characteristic Integrate as the set of node 1 and node 2 corresponding characteristic variable 2 and characteristic variable 1；The corresponding feature of node 5,6,8,11,12 becomes Amount 5, characteristic variable 6, characteristic variable 8, characteristic variable 11, the collection that characteristic variable 12 forms are combined into weak correlated characteristic subset.

Since the characteristic variable that the strong correlation character subset obtained in step 202 includes is and prediction target signature variable The characteristic variable having a direct impact, the characteristic variable for including in weak correlated characteristic subset may have shadow with prediction target signature variable It rings, i.e., the characteristic variable that weak correlated characteristic subset includes not exclusively is redundancy feature variable, the spy that uncorrelated features subset includes Sign variable and prediction target signature variable do not have any relationship, are redundancy feature variables, so, in order to guarantee optimal characteristics The accuracy of collection needs directly to delete the characteristic variable for including in uncorrelated features subset, retains strong phase according to following step 203 The Partial Feature variable in all characteristic variables and weak correlated characteristic subset in character subset is closed as in optimal feature subset Characteristic variable.

203, all characteristic variables that will include in the strong correlation character subset, and, the weak correlated characteristic subset In, the set of the characteristic variable directly related with the characteristic variable in the strong correlation character subset is determined as the prediction target The optimal feature subset of characteristic variable.

Preferably, the optimal feature subset of the prediction target signature variable can be determined by following methods:

Wherein, the update current predictive model, which refers to, is retained in the prediction model for the fisrt feature variable Input terminal, set updated prediction model for the "current" model.

For example, strong correlation character subset includes 5 characteristic variables, weak correlated characteristic subset includes 3 characteristic variables: feature Variable 1, characteristic variable 2, characteristic variable 3, and the correlation size between these three characteristic variables and prediction target signature variable It is characterized variable 2, characteristic variable 3, characteristic variable 1；The prediction model of initialization is established first with strong correlation character subset, it is defeated Enter end for 5 and the one-to-one node of the characteristic variable, and the precision of prediction of initial prediction model is E；It then will be special The input terminal of the prediction model is added in sign variable 2, and the prediction model after characteristic variable 2 is added in training, obtains precision of prediction E', If E' > E, characteristic variable 2 is retained in the input terminal of prediction model, at this point, the input terminal of current predictive model becomes 6 sections Point, precision of prediction E'；Secondly, current predictive model is added in characteristic variable 3, if precision of prediction E " < E' after being added, is deleted Except the characteristic variable 3, "current" model is still the prediction model of 6 nodes, and precision of prediction is still E'；Finally, by characteristic variable 1 Current predictive model is added, if precision of prediction E " ' > E' after being added, the prediction model of 6 nodes is updated to feature is added Prediction model after variable 1, the i.e. prediction model comprising 7 input nodes are "current" model, and precision of prediction is E " '；At this point, weak Not including characteristic variable in correlated characteristic subset, it is determined that strong correlation character subset includes 5 characteristic variables, and, weak correlation The collection that characteristic variable 2 and characteristic variable 1 in character subset form is combined into optimal characteristics of the prediction target signature variable Collection.

Preferably, the prediction model is neural network model；

For example, Fig. 4 is a kind of neural network model provided in an embodiment of the present invention, as shown in figure 4, the neural network model Comprising input layer, two hidden layers, and, an output layer, u_j,kFor the connection weight letter between input layer and first hidden layer Number, V_k,1For the connection weight function between first hidden layer and second hidden layer, W₁For second hidden layer and the output Connection weight function between layer, wherein one of the characteristic variable for including in input layer and strong correlation character subset is a pair of It answers, every group of sample data inputs neural network by input layer, passes through u_j,kIt is transmitted to first hidden layer, first implicit number According to passing through V_k,1It is transmitted to second hidden layer, after hidden layer acts on, using W₁To output layer.

The embodiment of the present invention provides a kind of feature selection approach, and calculating initial data first is concentrated between each characteristic variable Correlation, and, the initial data concentrates each characteristic variable and predicts the correlation between target signature variable；Then basis The initial data concentrates the correlation between each characteristic variable, and, the initial data concentrates each characteristic variable and prediction mesh The correlation between characteristic variable is marked, strong correlation character subset and weak correlated characteristic subset are obtained；It is finally that the strong correlation is special All characteristic variables for including in sign subset, and, in the weak correlated characteristic subset, in the strong correlation character subset The set of the directly related characteristic variable of characteristic variable is determined as the optimal feature subset of the prediction target signature variable.Such as This, carries out feature selecting according to the correlation between characteristic variable, avoids and concentrates random selection feature in higher-dimension initial data Operation times caused by subset are more, and operand is big, and the problem of the optimal feature subset inaccuracy determined, improve high dimensional data Feature selecting operation efficiency and validity.

Embodiment two

Fig. 5 is a kind of structure chart of feature selecting device 50 provided in an embodiment of the present invention, as shown in figure 5, may include:

Computing module 501 is concentrated between each characteristic variable for calculating initial data, and, the primitive character subset In each characteristic variable and prediction target signature variable between correlation.

Preferably, the raw data set is high dimensional data, can be obtained by the real-time acquisition of a period of time, can also be with It is read from the database for being previously stored with the raw data set.

Obtain module 502, for according to the calculated initial data of the computing module concentrate each characteristic variable it Between correlation, and, the initial data concentrates each characteristic variable and predicts the correlation between target signature variable, obtains strong Correlated characteristic subset and weak correlated characteristic subset.

Wherein, the characteristic variable for including in the strong correlation subset is initial data concentration, with the prediction target The directly related characteristic variable of characteristic variable；The characteristic variable for including in the weak associated subset is initial data concentration, With the characteristic variable of the prediction target signature variable indirect correlation.

Determining module 503, all spies for that will include in the strong correlation character subset of the acquisition module acquisition Variable is levied, and, in the weak correlated characteristic subset, the spy directly related with the characteristic variable in the strong correlation character subset The set of sign variable is determined as the optimal feature subset of the prediction target signature variable.

Since the initial data concentrates the group number for including relatively more, at this point, if being concentrated all according to the initial data The corresponding data of group calculate the initial data and concentrate correlation between each characteristic variable, and, the initial data is concentrated Each characteristic variable and prediction target signature correlation of variables, then calculation amount is very big, and operation is complicated, so, further, Ke Yiqu M in M group data₁Group data form training dataset, wherein M₁Less than M；The computing module 501, is specifically used for:

The primitive character, which is calculated, according to the data that the training data is concentrated concentrates correlation between each characteristic variable, And the initial data concentrates each characteristic variable and predicts the correlation between target signature.

Since the computing module 501 is to calculate the initial data according to the partial data of raw data set to concentrate Correlation between each characteristic variable, and, the initial data is concentrated between each characteristic variable and prediction target signature variable Correlation, it may not be possible to accurately reflect that very much initial data concentrates the correlation between each characteristic variable, and, the original Correlation in beginning data set between each characteristic variable and prediction target signature variable；And the first preset threshold and second is preset The specific value of threshold value is difficult to determine, at this point, each spy for obtaining module 502 and directly being calculated according to computing module 501 The correlation between variable is levied, and, the initial data is concentrated related between each characteristic variable and prediction target signature variable Property, character subset classification is carried out, strong correlation character subset is obtained and weak correlated characteristic subset may be inaccuracy.So into One step, the M in M group data can be taken₂Group data composition assessment data set, M₃Group data form test data set；The acquisition Module 502, is specifically used for:

Preferably, the disaggregated model can be Bayesian network model；

Correspondingly, the acquisition module 502, is specifically used for:

The stable Bayesian network model is tested using the test data set, if the stable Bayesian network The topological structure of model remains unchanged, then the stable Bayesian model is determined as the disaggregated model；

Destination node will be reached by least two directed edges in the Bayesian network model, and but is not included in Ma Er The set of the corresponding characteristic variable of node in section's husband's blanket is as weak correlated characteristic subset, wherein the destination node for institute State the corresponding node of prediction target signature variable；

Since obtaining the characteristic variable that the strong correlation character subset that module 502 obtains includes is to become with prediction target signature The characteristic variable having a direct impact is measured, the characteristic variable for including in weak correlated characteristic subset may have with prediction target signature variable It influences, i.e., the characteristic variable that weak correlated characteristic subset includes not exclusively is redundancy feature variable, and uncorrelated features subset includes Characteristic variable and prediction target signature variable do not have any relationship, are redundancy feature variables, so, in order to guarantee optimal characteristics The accuracy of subset, further, the determining module 503 are specifically used for:

In the weak correlated characteristic subset, fisrt feature variable is selected, the fisrt feature variable is added current pre- Model is surveyed, it is described current to judge whether the precision of prediction for the current predictive model being added after the fisrt feature variable is greater than The precision of prediction of prediction model, wherein the fisrt feature variable is in the weak correlated characteristic subset, with the prediction target The characteristic variable of the correlation maximum of characteristic variable, the initial value of the current predictive model are with the strong correlation character subset In characteristic variable be input terminal establish prediction model；

Preferably, the prediction model is neural network model.

The embodiment of the present invention provides a kind of feature selecting device 50, calculates initial data and concentrates the phase between each characteristic variable Guan Xing, and, the initial data concentrates each characteristic variable and predicts the correlation between target signature variable；According to the original Correlation in beginning data set between each characteristic variable, and, the initial data concentrates each characteristic variable and prediction target signature Correlation between variable obtains strong correlation character subset and weak correlated characteristic subset；It will be wrapped in the strong correlation character subset All characteristic variables contained, and, it is straight with the characteristic variable in the strong correlation character subset in the weak correlated characteristic subset The set for connecing relevant characteristic variable is determined as the optimal feature subset of the prediction target signature variable.In this way, according to feature Correlation between variable carries out feature selecting, avoids fortune caused by concentrating random selection character subset in higher-dimension initial data It calculates often, operand is big, and the problem of the optimal feature subset inaccuracy determined, improves high dimensional data feature selecting operation Efficiency and validity.

Embodiment three

Fig. 6 is a kind of structure chart of feature selecting device 60 provided in an embodiment of the present invention, as shown in fig. 6, the device can To include: processor 601, memory 602, communication unit 603, at least one communication bus 604, for realizing these devices it Between connection and be in communication with each other；

Processor 601 may be a central processing unit (English: central processing unit, referred to as CPU)；

Memory 602 can be volatile memory (English: volatile memory), such as random access memory (English: random-access memory, abbreviation: RAM)；Or nonvolatile memory (English: non-volatile Memory), for example, read-only memory (English: read-only memory, abbreviation: ROM), flash memory (English: flash Memory), hard disk (English: hard disk drive, abbreviation: HDD) or solid state hard disk (English: solid-state drive, Abbreviation: SSD)；Or the combination of the memory of mentioned kind, and instruction and data is provided to processor 1101；

Communication unit 603, for carrying out data transmission between ext nal network element.

Processor 601 is concentrated between each characteristic variable for calculating initial data, and, in the primitive character subset Correlation between each characteristic variable and prediction target signature variable.

The processor 601 is also used to concentrate each feature to become according to the calculated initial data of the computing module Correlation between amount, and, the initial data concentrates each characteristic variable and predicts the correlation between target signature variable, obtains Take strong correlation character subset and weak correlated characteristic subset.

The processor 601 is also used to the institute that will include in the strong correlation character subset of the acquisition module acquisition There is characteristic variable, and, it is directly related with the characteristic variable in the strong correlation character subset in the weak correlated characteristic subset Characteristic variable set be determined as it is described prediction target signature variable optimal feature subset.

Since the initial data concentrates the group number for including relatively more, at this point, if being concentrated all according to the initial data The corresponding data of group calculate the initial data and concentrate correlation between each characteristic variable, and, the initial data is concentrated Each characteristic variable and prediction target signature correlation of variables, then calculation amount is very big, and operation is complicated, so, further, Ke Yiqu M in M group data₁Group data form training dataset, wherein M₁Less than M；The processor 601, is specifically used for:

Since the processor 601 is to calculate the initial data according to the partial data of raw data set to concentrate respectively Correlation between characteristic variable, and, the initial data is concentrated between each characteristic variable and prediction target signature variable Correlation, it may not be possible to accurately reflect that very much initial data concentrates the correlation between each characteristic variable, and, it is described original Correlation in data set between each characteristic variable and prediction target signature variable；And the first preset threshold and the second default threshold The specific value of value is difficult to determine, at this point, each characteristic variable that the processor 601 is directly calculated according to processor 601 Between correlation, and, the initial data concentrate each characteristic variable and predict target signature variable between correlation, carry out Character subset classification, obtains strong correlation character subset and weak correlated characteristic subset is also likely to be inaccuracy.So further , the M in M group data can be taken₂Group data composition assessment data set, M₃Group data form test data set；The processor 601, it is specifically used for:

Preferably, the disaggregated model can be Bayesian network model；

Correspondingly, the processor 601, is specifically used for:

Since the characteristic variable that the strong correlation character subset that processor 601 obtains includes is and prediction target signature variable The characteristic variable having a direct impact, the characteristic variable for including in weak correlated characteristic subset may have shadow with prediction target signature variable It rings, i.e., the characteristic variable that weak correlated characteristic subset includes not exclusively is redundancy feature variable, the spy that uncorrelated features subset includes Sign variable and prediction target signature variable do not have any relationship, are redundancy feature variables, so, in order to guarantee optimal characteristics The accuracy of collection, further, the processor 601 are specifically used for:

Preferably, the prediction model is neural network model.

The embodiment of the present invention provides a kind of feature selecting device 60, calculates initial data and concentrates the phase between each characteristic variable Guan Xing, and, the initial data concentrates each characteristic variable and predicts the correlation between target signature variable；According to the original Correlation in beginning data set between each characteristic variable, and, the initial data concentrates each characteristic variable and prediction target signature Correlation between variable obtains strong correlation character subset and weak correlated characteristic subset；It will be wrapped in the strong correlation character subset All characteristic variables contained, and, it is straight with the characteristic variable in the strong correlation character subset in the weak correlated characteristic subset The set for connecing relevant characteristic variable is determined as the optimal feature subset of the prediction target signature variable.In this way, according to feature Correlation between variable carries out feature selecting, avoids fortune caused by concentrating random selection character subset in higher-dimension initial data It calculates often, operand is big, and the problem of the optimal feature subset inaccuracy determined, improves high dimensional data feature selecting operation Efficiency and validity.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that the independent physics of each unit includes, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the portion of each embodiment the method for the present invention Step by step.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, abbreviation ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic or disk etc. are various can store The medium of program code.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of feature selection approach characterized by comprising

Calculate initial data and concentrate the correlation between each characteristic variable, and, the initial data concentrate each characteristic variable with Predict the correlation between target signature variable；Wherein, the raw data set includes N-dimensional characteristic variable, and the N-dimensional feature becomes Amount ties up the characteristic variable and the prediction target signature variable comprising N-1, and the N is positive integer；

The correlation between each characteristic variable is concentrated according to the initial data, and, the initial data concentrates each characteristic variable Correlation between prediction target signature variable, obtains strong correlation character subset and weak correlated characteristic subset；Wherein, described strong The characteristic variable for including in associated subset is initial data concentration, the spy directly related with the prediction target signature variable Levy variable；The characteristic variable for including in the weak associated subset is initial data concentration, is become with the prediction target signature Measure the characteristic variable of indirect correlation；

All characteristic variables that will include in the strong correlation character subset, and, it is and described in the weak correlated characteristic subset The set of the directly related characteristic variable of characteristic variable in strong correlation character subset is determined as the prediction target signature variable Optimal feature subset.

2. feature selection approach according to claim 1, which is characterized in that the raw data set also includes M group data, The M group data include training dataset, wherein the N-dimensional characteristic variable of synchronization acquisition is included in every group of data Corresponding data, the M are positive integer；

Correspondingly, the calculating initial data concentrates the correlation between each characteristic variable, and, the initial data is concentrated each Correlation between characteristic variable and prediction target signature variable, comprising:

The initial data, which is calculated, according to the data that the training data is concentrated concentrates correlation between each characteristic variable, with And the initial data concentrates each characteristic variable and predicts the correlation between target signature variable.

3. feature selection approach according to claim 2, which is characterized in that the M group data also include assessment data set And test data set；

Correspondingly, the correlation concentrated according to the initial data between each characteristic variable, and, the initial data is concentrated Correlation between each characteristic variable and prediction target signature variable, obtains strong correlation character subset and weak correlated characteristic subset, Include:

Correlation between each characteristic variable, the initial data is concentrated to concentrate each characteristic variable and pre- according to the initial data Correlation, the assessment data set and the test data set surveyed between target signature variable obtain disaggregated model；

4. feature selection approach according to claim 3, which is characterized in that described to concentrate each spy according to the initial data Sign variable between correlation, the initial data concentrate each characteristic variable predict target signature variable between correlation, The assessment data set and the test data set obtain disaggregated model, comprising:

The correlation between each characteristic variable is concentrated according to the initial data, and, the initial data concentrates each characteristic variable Correlation between prediction target signature variable establishes initial Bayesian network model；Wherein, the initial Bayes Network model includes node and directed edge, and the node indicates characteristic variable, and the directed edge expression is connect with the directed edge Two nodes between correlation；

Using Bayesian network model initial described in the assessment data set repetitive exercise, stable Bayesian network mould is obtained Type；Wherein, the stable Bayesian network is the Bayesian network model comprising irreversible directed edge；

The stable Bayesian network model is tested using the test data set, if the stable Bayesian network model Topological structure remain unchanged, then the stable Bayesian network model is determined as disaggregated model.

5. feature selection approach according to claim 1-4, which is characterized in that described by the strong correlation feature All characteristic variables for including in subset, and, in the weak correlated characteristic subset, with the spy in the strong correlation character subset The set of the directly related characteristic variable of sign variable is determined as the optimal feature subset of the prediction target signature variable, comprising:

In the weak correlated characteristic subset, fisrt feature variable is selected, current predictive mould is added in the fisrt feature variable Type, judges whether the precision of prediction for the current predictive model being added after the fisrt feature variable is greater than the current predictive The precision of prediction of model, wherein the fisrt feature variable is in the weak correlated characteristic subset, with the prediction target signature The characteristic variable of the correlation maximum of variable, the current predictive model are initial predicted model or updated described initial pre- Model is surveyed, the initial predicted model is the prediction mould established using the characteristic variable in the strong correlation character subset as input terminal Type；

If so, updating the current predictive model, and the fisrt feature variable is deleted from the weak correlated characteristic subset It removes, first set is added；

If it is not, not updating the current predictive model then, and by the fisrt feature variable from the weak correlated characteristic subset It deletes；

The set of characteristic variable in characteristic variable and the first set in the strong correlation character subset is determined as institute State the optimal feature subset of prediction target signature variable.

6. feature selection approach according to claim 5, which is characterized in that the prediction model is neural network model；

Neural network model is constructed with the characteristic variable for including in strong correlation character subset for input member；Wherein, the nerve net Network model include input layer, hidden layer, and, output layer；Between the input layer and hidden layer, and, the hidden layer with It is connected between the output layer by connection weight function.

7. a kind of feature selecting device characterized by comprising

Computing module concentrates correlation between each characteristic variable for calculating initial data, and, the initial data is concentrated Correlation between each characteristic variable and prediction target signature variable；Wherein, the raw data set includes N-dimensional characteristic variable, The N-dimensional characteristic variable includes that N-1 ties up the characteristic variable and the prediction target signature variable, and the N is positive integer；

Module is obtained, for concentrating the correlation between each characteristic variable according to the calculated initial data of the computing module Property, and, the initial data concentrates each characteristic variable and predicts the correlation between target signature variable, obtains strong correlation feature Subset and weak correlated characteristic subset；Wherein, the characteristic variable for including in the strong correlation subset is initial data concentration, with The directly related characteristic variable of the prediction target signature variable；The characteristic variable for including in the weak associated subset is the original Characteristic variable in beginning data set, with the prediction target signature variable indirect correlation；

Determining module, all characteristic variables for that will include in the strong correlation character subset of the acquisition module acquisition, And in the weak correlated characteristic subset, the characteristic variable directly related with the characteristic variable in the strong correlation character subset Set be determined as it is described prediction target signature variable optimal feature subset.

8. feature selecting device according to claim 7, which is characterized in that the raw data set also includes M group data, The M group data include training dataset, wherein the N-dimensional characteristic variable of synchronization acquisition is included in every group of data Corresponding data, the M are positive integer；

Correspondingly, the computing module, is specifically used for:

9. feature selecting device according to claim 8, which is characterized in that the M group data also include assessment data set And test data set；

Correspondingly, the acquisition module, is specifically used for:

10. feature selecting device according to claim 9, which is characterized in that the acquisition module is specifically used for:

11. according to the described in any item feature selecting devices of claim 7-10, which is characterized in that the determining module, specifically For:

12. feature selecting device according to claim 11, which is characterized in that the prediction model is neural network mould Type；

Correspondingly, the determining module, is specifically used for: