CN108875795A - A kind of feature selecting algorithm based on Relief and mutual information - Google Patents

A kind of feature selecting algorithm based on Relief and mutual information Download PDF

Info

Publication number
CN108875795A
CN108875795A CN201810519640.1A CN201810519640A CN108875795A CN 108875795 A CN108875795 A CN 108875795A CN 201810519640 A CN201810519640 A CN 201810519640A CN 108875795 A CN108875795 A CN 108875795A
Authority
CN
China
Prior art keywords
feature
subset
formula
mutual information
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810519640.1A
Other languages
Chinese (zh)
Inventor
王红滨
褚慈
谢晓东
王勇军
原明旗
王念滨
周连科
秦帅
李浩然
白云鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201810519640.1A priority Critical patent/CN108875795A/en
Publication of CN108875795A publication Critical patent/CN108875795A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The feature selecting algorithm based on Relief and mutual information that the present invention provides a kind of, belongs to computerized algorithm field.Step of the invention is as follows:(1) optimal feature subset is arranged to empty set, optimal feature subset weight is set;(2) feature being not belonging in optimal feature subset in all features in a data is selected, is put it into candidate optimal feature subset, and calculates the weight of current candidate optimal feature subset by compound characteristics interpretational criteria;(3) weight of candidate optimal feature subset at this time is evaluated and is substituted;(4) undesirable feature to be selected is removed;(5) if there are also features to be selected, continue to calculate from return (2).Otherwise, algorithm terminates.Method provided by the invention improves aiming at the problem that Relief feature selecting algorithm can only handle two classification problems and cannot handle redundancy feature, the feature selecting algorithm based on improved Relief weight is proposed, there is higher calculating accuracy rate while so that this feature selection algorithm is calculated efficient.

Description

A kind of feature selecting algorithm based on Relief and mutual information
Technical field
The present invention relates to a kind of improved methods of feature selecting algorithm based on Relief and mutual information, belong to computer calculation Method field.
Background technique
Feature selecting algorithm is broadly divided into Filter class, Wrappers class, Embedded class and Hybrid class.Wherein because The feature selecting of Filter class uses wide because calculating efficiently.Wherein most representative algorithm is just in Filter class It is Relief feature selecting algorithm, the algorithm idea is simple and calculates efficient.But because it can only handle two classification problems, therefore It is restricted in, and the problem of the algorithm cannot handle redundancy feature.Feature selecting algorithm in Hybrid class mixes this The advantages of aspect, therefore using also very extensive.Wherein, in feature selecting algorithm, by using mutual information as interpretational criteria Algorithm has very high attention rate.This kind of algorithm by using mutual information as interpretational criteria, therefore have processing redundancy feature Characteristic.But because in mutual information between calculating feature, needing to calculate the even probability density of the probability distribution between feature in advance. Therefore, such algorithm has very high computational load, and computational efficiency is not high.
Summary of the invention
The feature selecting algorithm based on Relief and mutual information that the present invention provides a kind of, it is therefore intended that solve Relief and calculate Method carries out that two classification problems can only be handled in feature selecting, and mutual information is used to carry out feature selecting as the algorithm of interpretational criteria The middle high both of these problems of computation complexity, the feature selecting algorithm based on Relief and mutual information of proposition.
The purpose of the present invention is accomplished by the following way:
Step 1:Optimal feature subset is arranged to empty set, and the weight of optimal feature subset is arranged to integer type The minimum value of number;
Step 2:The feature being not belonging in optimal feature subset in all features in a data is selected, is put it into In candidate optimal feature subset, and pass through the weight of compound characteristics interpretational criteria calculating current candidate optimal feature subset;
Step 3:If the weight of candidate optimal feature subset is greater than the last optimal feature subset being calculated at this time Optimal feature subset weight, then is updated to the weight of current candidate optimal feature subset by weight, and by current candidate feature Subset is as optimal feature subset;
Step 4:If the weight of candidate optimal feature subset is less than the last optimal feature subset being calculated at this time Weight then removes this feature from the feature to be selected of current data;
Step 5:If there are also features to be selected, continue to calculate from return step two.Otherwise, algorithm terminates.
Compared with prior art, advantage of the invention is that:
Method provided by the invention can only handle two classification problems and cannot handle superfluous for Relief feature selecting algorithm The problem of remaining feature, improves, and proposes the feature selecting algorithm based on improved Relief weight.The algorithm by pair Relief weight is modified, and allows to evaluate one group of character subset, is solved Relief algorithm and can only be handled two classification and asks The limitation of topic.For it cannot handle redundancy feature in terms of, by the present invention in that mutual information is used to solve redundancy as interpretational criteria Characteristic Problem.But because use mutual information high as quasi- computational load is evaluated, by the present invention in that with secondary Renyi Entropy calculates mutual information, solves the problems, such as this.Also, under the premise of calculating mutual information by using quadratic Renyi entropy, propose based on mutual The interpretational criteria of information.Better redundancy and the degree of correlation between solving the problems, such as feature.Finally, by improved Relief weight with It is combined using the interpretational criteria based on mutual information that quadratic Renyi entropy is calculated, proposes a kind of feature choosing for mixing the degree of correlation Algorithm is selected, there is higher calculating accuracy rate while so that this feature selection algorithm is calculated efficient.
Detailed description of the invention
Fig. 1 is flow diagram of the invention;
Fig. 2 is the dimensionality reduction effect picture of FSIRW feature selecting algorithm on different data sets in the present invention;
Fig. 3 is the accuracy rate comparison with feature complete or collected works of FSIRW feature selecting algorithm of the invention on different data sets Figure;
Fig. 4 is FSIRQ algorithm and the accuracy rate contrast effect figure of other algorithms on different data sets of the invention;
Fig. 5 is FSIRQ algorithm and the execution time contrast effect figure of other algorithms on different data sets of the invention.
Specific embodiment
Following further describes the present invention with reference to the drawings:
The feature selecting algorithm based on Relief and mutual information that the present invention provides a kind of, is realized by following steps, and It is intuitively showed by the flow diagram of Fig. 1:
Step 1:Optimal feature subset is arranged to empty set, and the weight of optimal feature subset is arranged to integer type The minimum value of number;
Step 2:The feature being not belonging in optimal feature subset in all features in a data is selected, is put it into In candidate optimal feature subset, and pass through the weight of compound characteristics interpretational criteria calculating current candidate optimal feature subset;
Step 3:If the weight of candidate optimal feature subset is greater than the last optimal feature subset being calculated at this time Optimal feature subset weight, then is updated to the weight of current candidate optimal feature subset by weight, and by current candidate feature Subset is as optimal feature subset;
Step 4:If the weight of candidate optimal feature subset is less than the last optimal feature subset being calculated at this time Weight then removes this feature from the feature to be selected of current data;
Step 5:If there are also features to be selected, continue to calculate from return step two.Otherwise, algorithm terminates.
Primary object possessed by method provided by the invention and technical effect:
Currently, have the characteristics that for today's society data more it is high-dimensional, it is no matter domestic or external, all ground in positive Study carefully effective feature selecting algorithm and dimensionality reduction is carried out to data, scholars propose a variety of outstanding algorithms, they use different Thought and evaluation criterion, respectively there is feature.On the basis of these outstanding algorithms, being directed to Relief algorithm can only be handled the present invention Two classification problems, and the problem for using mutual information low as the algorithm computational efficiency of evaluation criterion.It is outstanding improved Relief weight and the evaluation function that mutual information is calculated using quadratic Renyi entropy.That selects the essence simultaneously combines the two Evaluation criterion, proposes the feature selecting algorithm based on Relief and mutual information, and main points of view and content are as follows:
(1) based on the feature selecting algorithm of Relief.
If O={ Xn| 1≤n≤N } be N number of primitive character complete feature samples data set.Wherein primitive character collection F= F1F2···FN.The set of character subset is selectedWherein NS=| S |, representative It is the feature quantity selected in character subset.Therefore, the character subset selected constitutes NSDimensional feature subspace subspaceS. If M is the quantity for having the sample data of label in database.Because each feature is a stochastic variable, and database In have exemplar data correspond to dimension the i.e. character pair of data feature value.WithRepresent feature stochastic variable Xn? There is the specific value in the sample data m (1≤m≤M) of label in database.Therefore there is the sample data m of label in database (1≤m≤M) may also indicate that into the vector in the space being made of the complete feature set of N number of primitive characterSimilarly,Then represent the sample data m (1≤m for having label in database ≤ M) in the N for the character subset composition selectedSA specific data sample point in the subspaceS of dimensional feature subspace.This Outside, the classification information of sample, c are indicated with alphabetical c(m)Classification belonging to representative sample m.
The present invention proposes to have been redefined first in the improvement of Relief algorithm in the N by having selected character subset to constituteS Distance between two points formula (1) in the subspaceS of dimensional feature subspace is as follows.
Wherein, 1≤m in formula (1)1,m2≤ M and dMThat represent is the manhatton distance inputted between two vectors, distmax Shown in representative meaning such as formula (2).
By formula (1) and (2), now definition withThe data point of affiliated same category of arest neighbors isWithPlace It is defined as in the data point of different classes of arest neighborsTherefore, in the N by having selected character subset to constituteSDimensional feature is empty Between in subspaceS,WithBetween differenceAs shown in formula (3).
WithBetween differenceAs shown in formula (4).
By formula (3) and formula (4), the available N by having selected character subset to constituteSDimensional feature subspace It is as follows by the character subset weight equation (5) of single sample m current character subset S being calculated in subspaceS.
Finally, as follows by the available current feature weight formula (6) for having selected character subset S of formula (5).
Based on FSIRW (the Feature selection algorithm based on for improving Relief weight ImprovedReliefWeight) feature selecting algorithm improves the characteristic evaluating weight of classical Relief algorithm. Therefore there is the ability evaluated character subset.FSIRW algorithm is in this part of the evaluation of character subset, first from data The sample data in library, which is concentrated, randomly selects the M sample datas for having label, and to each sample in this group of proper subspaceSearch for the sample number strong point of similar neighbourSearching and sample data in the subspace simultaneouslyIn different classes of Neighbour's sample number strong pointThen sample data is calculated separately by formula (3) and (4)With sample dataWithBetween differenceWithContinue will to calculateWithIt is brought into formula (5), obtains the w of one of current sample sampleS(m), finally by all sampling samples wS(m) it calculates and completes, be brought into and added up to obtain the spy of this group of character subset under this group of proper subspace in formula (6) It levies weighted value w (S).
It is inadequate that feature selection approach, which only has evaluation function, it is also necessary to which the search strategy of feature could be constituted completely Feature selection approach.The FSIRW method improved based on classical Relief filtering type feature selecting algorithm is also required to Corresponding signature search strategy reaches the integrality of algorithm.But in high-dimensional feature space, searched by the method for exhaustion Rope minimum optimal feature subset itself is a NP-Hard problem, possesses the original feature space of N number of complete characterization space collection, Just have 2N- 1 non-empty character subset, thus, the method for exhaustion cannot be used to carry out signature search FSIRW method, use office Portion's search strategy is only correct method, and very necessary.Possess to be selected in low-dimensional proper subspace as far as possible The optimal feature subset of very strong class discrimination ability and meet demand, the signature search part of FSIRW algorithm using before sequence to The search of search strategy progress feature.
(2) characteristic evaluating function based on mutual information.In the scope based on shannon entropy, the mutual trust between feature is calculated Breath must calculate probability distribution p (x), the joint probability distribution P for obtaining character pair in advance12(xi,yi), or even need to calculate and obtain Obtain the probability density function of sample characteristics.But probability density function is calculated, joint probability density is all that one high calculate is born It carries, the complicated calculations process of low computational efficiency.Therefore, it is multiple for carrying out mutual information calculating using the method in shannon entropy scope It is miscellaneous and inefficient.But feature selection approach is summed up in the point that based on another famous information entropy theory-Renyi entropy It when mutual information, then can solve based on the calculating mutual information institute problem in Shannon entropy, it is multiple especially to can solve calculating The big problem of miscellaneous degree.Wherein the mutual information calculation formula (7) based on Renyi entropy is as follows.
Wherein, in formula (7)Indicate the mutual information based on Renyi entropy,Indicate quadratic Renyi entropy Comentropy,Indicate the united information entropy based on quadratic Renyi entropy.
Renyi entropy as one spread come more wide in range definition, when the limit of degree q → 1 of Renyi entropy when It waits, Renyi entropy at this time is then equivalent to using the more extensive, shannon entropy known to more scholars.Because Renyi entropy compared with The definition of comentropy is more wide in range, therefore is directed to different scenes, will by carrying out value to different Renyi entropy degrees More it is suitble to the Renyi entropy of corresponding scene.Quadratic mutual information wherein based on Renyi Mutual information of entropy, just there is phase It should effectively apply.The calculation formula (8) of Renyi entropy is as follows.By formula, it can be concluded that, Renyi entropy passes through in fact Shannon entropy by one additional parameter q of addition expand come.
Feature is calculated by using secondary Renyi comentropy (i.e. when the parameter q=2 in Renyi entropy formula (8)) Between mutual information when, can directly from initial data concentration be calculated, without requiring its probability density function.That is, letter Cease potential function V (X)=∫ p1(x)2It can directly be calculated and be got by formula (9), wherein the function G (x, h) in formula (9) refers to It is gaussian kernel function, as follows, D therein indicates the feature quantity of feature set to gaussian kernel function formula (10), and k and j are represented The feature quantity of characteristic dimension n representative sample as D of sample.
Therefore the work of the probability density function evaluation integrated value by using the sample of data and instead of complexity, It can be expressed as the form of formula (11) using the secondary entropy of Renyi entropy, formula (11) is as follows.
The form of formula (12) can be similarly expressed as using the united information entropy of Renyi entropy.
Therefore the formula in formula (12) is further derived by form shown in following formula (13).Pass through formula (13) form does not need first to solve each it is recognised that during using the calculating for carrying out mutual information based on Renyi entropy The probability density or probability-distribution function of feature can be directly based upon between sample data and just can estimate to obtain two samples The value of mutual information between eigen, therefore caused by overcoming probability density of the shannon entropy because needing to calculate feature Computational load is high, the slow problem of calculating speed.
By deriving analysis above it is found that when calculating the quadratic mutual information of Renyi comentropy, only by using the sample of data The process of this and the probability density function evaluation integrated value that can replace complexity, greatly reduce calculation amount and meter The shortcomings that calculating difficulty, overcoming the probability density function for needing to be calculated feature based on shannon entropy, therefore the present invention exists Between calculating sample when mutual information, the mutual information feature is calculated by the way of Renyi quadratic mutual information.
QJMI (Quadratic Joint Mutual Information) the feature selecting evaluation letter proposed in the present invention Number standards be based on the quadratic mutual information based on Renyi entropy, by using the quadratic mutual information of Renyi entropy, direct basis The value of mutual information between feature is calculated in the data of data set, when avoiding the similar mutual information for calculating shannon entropy, needs to count Calculate the probability distribution or probability density function of character pair.When being judged by mutual information the feature redundancy degree of correlation, if Newly-increased feature is added to selected feature after so that the character subset of new character subset and final output and currently having selected When there is bigger association relationship between character subset, but having selected feature and newly-increased feature that there is lower redundancy, then this feature It is then that the desired characteristics for having selected feature set should be added in feature selecting.Therefore, the present invention proposes QJMI evaluation by the thought Function.The evaluation function considers candidate characteristic set XCIn all features, each candidate feature of inspection one by one and selected spy Levy the relationship between subset.The calculation formula of the algorithm is as follows.The evaluation criterion, which concentrates candidate feature, has maximum value Candidate feature XCIt selects to be put into and select in character subset.QJMI evaluation function formula (14) is as follows.
The function can assess each group of possible candidate feature subset when binding characteristic selection method is calculated, and Select the character subset with maximum secondary mutual information.It is considered as when however, being evaluated using the interpretational criteria feature To select all character subsets be it is unpractical, operation will be very time-consuming, and computational load is very big.Most of feature choosing Selecting algorithm is to have selected candidate feature being added to one by one in character subset, to judge candidate feature and select character subset Relationship, and judged according to evaluation criterion, the candidate feature being best suitable for be added to and has been selected in character subset.Such benefit is On the one hand it is considered that candidate feature and has selected the correlation between character subset, and redundancy feature can be prevented;Separately On the one hand, can be far from character subset has been selected by those features itself, and when used in association, the feature with important value Selection comes in prevent the omission of this feature.Therefore, above-mentioned content should all be taken into account, is made when QJMI evaluation criterion is applied Algorithm is more excellent.So that the selection that algorithm is iterated feature just stops calculating selection until reaching stopping criterion.
Furthermore in the algorithm incipient stage of application QJMI evaluation function, should make to have selected character subset empty, at this point, QJMI is commented Price card standard only needs to consider the relationship between feature and output in candidate feature set, without considering and having selected feature sub The interaction of collection.In next calculating, QJMI evaluation criterion then consists of two parts, and the review extraction of first part is To in candidate feature under the premise of currently having selected character subset, output between correlation, and to the weight of this part into Row weighting.The correlation of second part judge candidate feature and the character subset selected.The calculated value of first part subtracts second Partial calculated value constitutes whole evaluation criterion.By using QJMI interpretational criteria, candidate feature on the one hand may insure It can have very strong correlation between finally output result, and two parts merging is more than simply respective by the two institute The value of information having is added, moreover it is possible to obtain more correlation informations.On the other hand, which can be to avoid will be with The feature with redundancy properties is added in output feature, to guarantee have between the feature selected by the evaluation function Lower redundancy.Meanwhile because QJMI evaluation function in mutual information between calculating feature, using based on quadratic Renyi entropy Mutual information calculation formula (13) calculates the mutual information between sample, therefore, even if the evaluation function is with fireballing Feature.
In the present invention, feature selecting algorithm and evaluation function based on mutual information based on improved Relief weight are knots The progressive relationship used is closed therefore finally to tie improved Relief weight and QJMI interpretational criteria proposed by the present invention It closes, obtains the final feature selecting algorithm based on Relief and mutual information.The algorithm is mutual by distance and improved joint Two judgment criteria of information function judge each of candidate feature set feature will have to total characteristic subset weight The feature of gain, which is added to, have been selected in character subset, until reaching stop condition.Algorithm is by using compound interpretational criteria, to spy Feature in collection conjunction carries out comprehensive consideration, and obtained character subset has more judge ability.
Technical effect of the invention is:
Feature selection approach of the invention in order to obtain more accurate character subset, therefore divides in calculating process The evaluation of feature is carried out not in terms of distance and comentropy two.In terms of the evaluation based on distance, using based on Relief's Thought allows improved weight to evaluate one group of character subset by improving to its evaluation weight.It ensure that meter While calculating efficient, solve the disadvantage that former algorithm can only handle two classification.Meanwhile Manhattan is based on by using improved The calculation formula of distance, further reduced computational load, improve the computational efficiency of algorithm.As shown in Fig. 2, in different numbers According on collection, the feature selecting algorithm proposed by the present invention based on improvement Relief weight effectively completes the dimensionality reduction work of feature Make.The reduction for completing characteristic dimension by a relatively large margin.And while completing data characteristics dimensionality reduction, by Fig. 3 it is recognised that Of the invention on the problem of classifying, is had based on the feature selecting algorithm for improving Relief weight with using feature complete or collected works Consistent even higher classified calculating accuracy.This demonstrate proposed by the present invention based on the feature selecting for improving Relief weight Algorithm effectively eliminates redundancy feature and noise characteristic in data characteristics, thus using the character subset selected into It when row classified calculating, is relatively compared using feature complete or collected works, has and preferably calculate accuracy rate.Based on mutual information and apart from carry out it is comprehensive Close evaluation aspect, the commenting by using multiple correlation degree of the feature selecting algorithm based on Relief and mutual information proposed by the present invention Valence criterion, therefore have better calculated result.It is available by Fig. 4, in addition to the JMI algorithm on Ionosphere data set Other than higher than multiple correlation degree algorithm accuracy rate proposed by the present invention, remaining data set and with the comparison of other algorithms On, FSIRQ algorithm proposed by the present invention has better accuracy.Because of the evaluation mark of feature selecting algorithm proposed by the present invention Standard considers between feature two aspect of the information redundancy degree of correlation between distance and feature, therefore advantage of this algorithm in accuracy respectively Meet the expection of algorithm.And JMI algorithm has higher accuracy to be on Ionosphere data set because QJMI evaluation is quasi- It is then to punish that preferentially therefore, FSIRQ has selected smaller character subset on the data set with redundancy, it is last right to cause The influence of accuracy, but difference is little on the whole, has been maintained at zone of reasonableness to last classification accuracy influence, has illustrated FSIRQ algorithm is excellent in terms of accuracy.The accuracy of this feature selecting algorithm of ReliefF has come most in Fig. 4 Afterwards, this is because the algorithm has only taken into account the interpretational criteria of distance between feature, and the intrinsic disadvantage of ReliefF is cannot to know Other redundancy feature, therefore the character subset selected is compared with other algorithms, is not more outstanding character subset.It causes ReliefF algorithm poor result of accuracy compared with other algorithms on this four data sets.Therefore, it can be obtained by Fig. 4 Know, FSIRQ feature selecting algorithm proposed by the present invention considers that the feature selecting algorithm of single evaluation standard is compared with other, has Better accuracy in computation.
And in terms of the execution efficiency of calculating, it then can be very good to show FSIRQ feature proposed by the invention by Fig. 5 The advantage of selection algorithm.Wherein pass through Fig. 5 it is known that ReliefF algorithm execution speed is all most fast on each data set , this is because ReliefF algorithm, as Relief algorithm, the calculating of algorithm itself is simple, computation complexity only with data It is related with the number of iterations to collect size, therefore there is good execution efficiency.FSIRQ algorithm proposed by the present invention computationally in addition to Outside using distance as interpretational criteria, due to also needing from the redundancy and the degree of correlation between the angle calculation feature of comentropy, Algorithm does not have ReliefF algorithm efficient on calculating the execution time.But and also need measurement redundancy and the degree of correlation MRMR algorithm is compared with JMI algorithm, which has very high advantage.This is because FSIRQ algorithm is in evaluation and test redundancy and phase In the QJMI evaluation function criterion of Guan Du, the mutual information feature is calculated using based on secondary Renyi comentropy, directly It connects from sample data and is concentrated through the mutual information that calculates and can obtain to each other, avoiding MRMR algorithm and JMI algorithm needs It precalculates to obtain the complex process of characteristic probability distribution and probability density, therefore on computational load, is far below MRMR Algorithm and JMI algorithm, so that FSIRQ algorithm is it is also contemplated that redundancy and under the premise of the degree of correlation between feature, has well Calculating speed.
To sum up, carrying out Comprehensive Correlation by Fig. 4 and Fig. 5 can summarize, FSIRQ algorithm proposed by the present invention is being carried out When feature selecting, by using the evaluation criterion of multiple correlation degree, under the premise of ensure that high accuracy, while having higher Execution efficiency.Therefore, in terms of executing the synthesis of speed and accuracy, it is more preferably energy that FSIRQ algorithm, which compares other algorithms, Enough accomplish to take into account accuracy rate and execute speed, reaches expection of the invention, show its superiority.

Claims (4)

1. a kind of feature selecting algorithm based on Relief and mutual information, it is characterised in that:Steps are as follows,
Step 1:Optimal feature subset is arranged to empty set, and the weight of optimal feature subset is arranged to integer type number Minimum value;
Step 2:The feature being not belonging in optimal feature subset in all features in a data is selected, candidate is put it into In optimal feature subset, and pass through the weight of compound characteristics interpretational criteria calculating current candidate optimal feature subset;
Step 3:If the weight of candidate optimal feature subset is greater than the last optimal feature subset power being calculated at this time Optimal feature subset weight, then is updated to the weight of current candidate optimal feature subset by weight, and by current candidate feature Collection is used as optimal feature subset;
Step 4:If the weight of candidate optimal feature subset is less than the last optimal feature subset power being calculated at this time Weight, then remove this feature from the feature to be selected of current data;
Step 5:If there are also features to be selected, continue to calculate from return step two;Otherwise, algorithm terminates.
2. a kind of feature selecting algorithm based on Relief and mutual information according to claim 1, it is characterised in that:It is described Character subset weight computations it is as follows:It is redefined first in the N by having selected character subset to constituteSDimensional feature subspace Distance between two points formula (1) in subspaceS as follows,
Wherein, 1≤m in formula (1)1,m2≤ M and dMThat represent is the manhatton distance inputted between two vectors, distmaxInstitute's generation Shown in the meaning of table such as formula (2),
By formula (1) and (2), now definition withThe data point of affiliated same category of arest neighbors isWithIn difference The data point of the arest neighbors of classification is defined asTherefore, in the N by having selected character subset to constituteSDimensional feature subspace In subspaceS,WithBetween differenceAs shown in formula (3),
WithBetween differenceAs shown in formula (4),
By formula (3) and formula (4), the available N by having selected character subset to constituteSDimensional feature subspace subspaceS In, by single sample m current character subset S being calculated character subset weight equation (5) as follows,
Finally, shown in the available current feature weight formula (6) for having selected character subset S of formula (5), using based on changing Into FSIRW (the Feature selection algorithm based on Improved Relief of Relief weight Weight the signature search part of) feature selecting algorithm, FSIRW algorithm carries out searching for feature using sequence sweep forward strategy Rope.
3. a kind of feature selecting algorithm based on Relief and mutual information according to claim 1, it is characterised in that:Feature The weight evaluation procedure of subset is as follows:
Using characteristic evaluating function based on mutual information, in the scope based on shannon entropy, the mutual information between feature is calculated Probability distribution p (x), the joint probability distribution P for obtaining character pair must be calculated in advance12(xi,yi), feature selection approach is returned The mutual information based on Renyi entropy is tied, to solve based on the calculating mutual information institute problem in Shannon entropy, wherein being based on The mutual information calculation formula (7) of Renyi entropy is as follows.
Wherein, in formula (7)Indicate the mutual information based on Renyi entropy,Indicate the letter of quadratic Renyi entropy Entropy is ceased,Indicate the united information entropy based on quadratic Renyi entropy;
Renyi entropy is to be expanded by shannon entropy by one additional parameter q of addition come the calculating of Renyi entropy in fact Formula (8) as follows,
By using mutual between secondary Renyi comentropy (i.e. when the parameter q=2 in Renyi entropy formula (8)) calculating feature When information, can directly it be calculated from initial data concentration, so information potential function V (X)=∫ p1(x)2It can pass through Formula (9) is directly calculated and is got, and wherein the function G (x, h) in formula (9) refers to gaussian kernel function, gaussian kernel function formula (10) as follows, D therein indicates the feature quantity of feature set, the characteristic dimension n of k with j representative sample representative sample as D This feature quantity,
Therefore it by using the sample of data and instead of the work of complicated probability density function evaluation integrated value, uses The secondary entropy of Renyi entropy can be expressed as the form of formula (11), formula (11) as follows,
The form of formula (12) can be similarly expressed as using the united information entropy of Renyi entropy,
Formula in formula (12) is further derived by form shown in following formula (13), when using based on Renyi entropy During the calculating for carrying out mutual information, it can be directly based upon between sample data and just can estimate to obtain two sample spies The value of mutual information between sign,
4. a kind of feature selecting algorithm based on Relief and mutual information according to claim 3, it is characterised in that:
The characteristic evaluating function is the evaluation of QJMI (Quadratic Joint Mutual Information) feature selecting Function, based on the quadratic mutual information of Renyi entropy, by using the quadratic mutual information of Renyi entropy, directly according to data set The value of mutual information between feature is calculated in data, when being judged by mutual information the feature redundancy degree of correlation, if newly-increased Feature is added to selected feature after so that the character subset of new character subset and final output and currently selected feature son When there is bigger association relationship between collection, but having selected feature and newly-increased feature that there is lower redundancy, then this feature is then special The desired characteristics for having selected feature set should be added in sign selection, which considers candidate characteristic set XCIn all features, by One each candidate feature of inspection and the relationship between character subset is selected, calculation formula is as follows, which will Candidate feature concentrates the candidate feature X with maximum valueCIt selects to be put into and select in character subset, QJMI evaluation function formula (14) as follows
The function can assess each group of possible candidate feature subset when binding characteristic selection method is calculated, and select The character subset of maximum secondary mutual information is provided,
In the algorithm incipient stage of application QJMI evaluation function, should make to have selected character subset empty, at this point, QJMI evaluation criterion is only Need to consider the relationship between the feature and output in candidate feature set, without considering and having selected the mutual of character subset Effect, in next calculating, QJMI evaluation criterion then consists of two parts:
The review extraction of first part is to the phase in candidate feature under the premise of currently having selected character subset, between output Guan Xing, and the weight of this part is weighted;
The correlation of second part judge candidate feature and the character subset selected;
The calculated value of first part subtracts the calculated value of second part, constitutes whole evaluation criterion.
CN201810519640.1A 2018-05-28 2018-05-28 A kind of feature selecting algorithm based on Relief and mutual information Pending CN108875795A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810519640.1A CN108875795A (en) 2018-05-28 2018-05-28 A kind of feature selecting algorithm based on Relief and mutual information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810519640.1A CN108875795A (en) 2018-05-28 2018-05-28 A kind of feature selecting algorithm based on Relief and mutual information

Publications (1)

Publication Number Publication Date
CN108875795A true CN108875795A (en) 2018-11-23

Family

ID=64335022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810519640.1A Pending CN108875795A (en) 2018-05-28 2018-05-28 A kind of feature selecting algorithm based on Relief and mutual information

Country Status (1)

Country Link
CN (1) CN108875795A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635852A (en) * 2018-11-26 2019-04-16 汉纳森(厦门)数据股份有限公司 A kind of building of user's portrait and clustering method based on multidimensional property
CN110363229A (en) * 2019-06-27 2019-10-22 岭南师范学院 A kind of characteristics of human body's parameter selection method combined based on improvement RReliefF and mRMR
CN110598760A (en) * 2019-08-26 2019-12-20 华北电力大学(保定) Unsupervised characteristic selection method for transformer vibration data
CN111259947A (en) * 2020-01-13 2020-06-09 国网浙江省电力有限公司信息通信分公司 Power system fault early warning method and system based on multi-mode learning
CN111275127A (en) * 2020-02-13 2020-06-12 西安理工大学 Dynamic characteristic selection method based on conditional mutual information
CN111898637A (en) * 2020-06-28 2020-11-06 南京工程学院 Feature selection algorithm based on Relieff-DDC
CN112651416A (en) * 2019-10-11 2021-04-13 中移动信息技术有限公司 Feature selection method, device, apparatus, and medium
CN113707330A (en) * 2021-07-30 2021-11-26 电子科技大学 Mongolian medicine syndrome differentiation model construction method, system and method

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635852B (en) * 2018-11-26 2021-03-23 汉纳森(厦门)数据股份有限公司 User portrait construction and clustering method based on multi-dimensional attributes
CN109635852A (en) * 2018-11-26 2019-04-16 汉纳森(厦门)数据股份有限公司 A kind of building of user's portrait and clustering method based on multidimensional property
CN110363229A (en) * 2019-06-27 2019-10-22 岭南师范学院 A kind of characteristics of human body's parameter selection method combined based on improvement RReliefF and mRMR
CN110598760A (en) * 2019-08-26 2019-12-20 华北电力大学(保定) Unsupervised characteristic selection method for transformer vibration data
CN110598760B (en) * 2019-08-26 2023-10-24 华北电力大学(保定) Unsupervised feature selection method for vibration data of transformer
CN112651416A (en) * 2019-10-11 2021-04-13 中移动信息技术有限公司 Feature selection method, device, apparatus, and medium
CN111259947A (en) * 2020-01-13 2020-06-09 国网浙江省电力有限公司信息通信分公司 Power system fault early warning method and system based on multi-mode learning
CN111275127A (en) * 2020-02-13 2020-06-12 西安理工大学 Dynamic characteristic selection method based on conditional mutual information
CN111275127B (en) * 2020-02-13 2024-01-09 河马互联网信息科技(深圳)有限公司 Dynamic feature selection method based on condition mutual information
CN111898637A (en) * 2020-06-28 2020-11-06 南京工程学院 Feature selection algorithm based on Relieff-DDC
CN111898637B (en) * 2020-06-28 2022-09-02 南京工程学院 Feature selection algorithm based on Relieff-DDC
CN113707330A (en) * 2021-07-30 2021-11-26 电子科技大学 Mongolian medicine syndrome differentiation model construction method, system and method
CN113707330B (en) * 2021-07-30 2023-04-28 电子科技大学 Construction method of syndrome differentiation model of Mongolian medicine, syndrome differentiation system and method of Mongolian medicine

Similar Documents

Publication Publication Date Title
CN108875795A (en) A kind of feature selecting algorithm based on Relief and mutual information
CN103679132B (en) A kind of nude picture detection method and system
US20200401842A1 (en) Human Hairstyle Generation Method Based on Multi-Feature Retrieval and Deformation
CN106897821A (en) A kind of transient state assesses feature selection approach and device
CN103810299A (en) Image retrieval method on basis of multi-feature fusion
CN108319855A (en) A kind of malicious code sorting technique based on depth forest
CN101706950A (en) High-performance implementation method for multi-scale segmentation of remote sensing images
CN109858566A (en) A method of it being added to the scorecard of mould dimension based on multilayered model building
CN110458201A (en) A kind of remote sensing image object-oriented classification method and sorter
CN110533116A (en) Based on the adaptive set of Euclidean distance at unbalanced data classification method
CN106250909A (en) A kind of based on the image classification method improving visual word bag model
CN103366151B (en) Hand-written character recognition method and equipment
CN112102317A (en) Multi-phase liver lesion detection method and system based on anchor-frame-free
CN110019779A (en) A kind of file classification method, model training method and device
Zuo et al. Correlation-driven direct sampling method for geostatistical simulation and training image evaluation
Bruzzese et al. DESPOTA: DEndrogram slicing through a pemutation test approach
CN109919320B (en) Triplet network learning method based on semantic hierarchy
CN106156803A (en) A kind of lazy traditional decision-tree based on Hellinger distance
CN101625725A (en) Artificial immunization non-supervision image classification method based on manifold distance
CN111914912B (en) Cross-domain multi-view target identification method based on twin condition countermeasure network
CN109284409A (en) Picture group geographic positioning based on extensive streetscape data
CN108564009A (en) A kind of improvement characteristic evaluation method based on mutual information
CN115984559A (en) Intelligent sample selection method and related device
CN106980872A (en) K arest neighbors sorting techniques based on polling committee
CN103793714A (en) Multi-class discriminating device, data discrimination device, multi-class discriminating method and data discriminating method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination