CN106991447A - A kind of embedded multi-class attribute tags dynamic feature selection algorithm - Google Patents

A kind of embedded multi-class attribute tags dynamic feature selection algorithm Download PDF

Info

Publication number
CN106991447A
CN106991447A CN201710222600.6A CN201710222600A CN106991447A CN 106991447 A CN106991447 A CN 106991447A CN 201710222600 A CN201710222600 A CN 201710222600A CN 106991447 A CN106991447 A CN 106991447A
Authority
CN
China
Prior art keywords
attribute
feature
correlation
data
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710222600.6A
Other languages
Chinese (zh)
Inventor
黄金杰
孔庆达
潘晓真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN201710222600.6A priority Critical patent/CN106991447A/en
Publication of CN106991447A publication Critical patent/CN106991447A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the embedded multi-class attribute tags dynamic feature selection method of one kind, improve the deficiency of the feature selecting algorithm of traditional multi-class attribute tags, propose a kind of embedded multi-class attribute tags dynamic feature selection method (ML_NIFS), the method both take into account the correlation between multi-tag property set inside, it is also contemplated that the factor that the calculating of comentropy constantly changes in feature selecting interpretational criteria.Finally verified, as a result show that proposed algorithm can carry out effective dimensionality reduction to data attribute, and improve follow-up classifying quality.

Description

A kind of embedded multi-class attribute tags dynamic feature selection algorithm
Technical field
It is specifically a kind of embedded multi-class attribute tags dynamic feature selection side the present invention relates to area of pattern recognition Method.
Background technology
Arithmetic speed is fast, efficiency comparison because it has in higher-dimension attribute data processing procedure for traditional mutual information metric algorithm High the advantages of, it is widely used in characteristic dimension Algorithm for Reduction.But with the rapid development of science and technology, many technical field ratios As computer network communication, biochemical medicine engineering all develop towards multi-class attribute tags data type direction.Multi-tag is classified Problem is exactly the type characteristic according to multi-tag data, builds corresponding disaggregated model, and according to criterion to unknown data Category attribute judged, by sample data simultaneously be divided into multiple class labels.Single tag attributes classification problem and The fundamental difference of multi-tag attributive classification problem is that single tag attributes classification problem sample data can only belong to a classification mark Label, and the classification problem sample data of multi-tag attribute may belong to multiple class labels, this extremely meets Information Number at this stage The characteristics of according to high development.Therefore get the attention.
The classification of multi-tag attribute and traditional single tag attributes classification, multi-tag attributive classification problem is also same Sample is faced with " dimension disaster " problem, and " dimension disaster " similarly drastically influence the classification capacity of multi-tag attributive classification device. " dimensionality reduction technology " of characteristic attribute can reduce the dimension of characteristic attribute, the classification accuracy of grader be improved, in single mark While signing applicable in attributive classification problem, this skill of characteristic attribute can be similarly used in multi-tag attributive classification problem Art, to reach the effect of attribute reduction.Feature " dimensionality reduction " technology is generally generally divided into two sides of feature selecting and feature extraction Face, feature selecting is divided into according to the difference of its interpretational criteria, filtering type, packaging type, embedded.Main research multi-tag of the invention Feature selection issues.
There are two kinds of basic developing direction at this stage in multi-tag feature selecting algorithm:On the one hand it is the algorithm of data conversion Direction, is on the one hand that algorithm adapts to direction.Feature selecting algorithm research based on data conversion is to be turned institute's label data Change, be converted into single label category attribute, repeatedly use the feature selecting algorithm of single label to reach multi-tag feature selecting Purpose.The feature selecting algorithm research adapted to based on algorithm is to be deformed the feature selecting algorithm of single label and algorithm Improve, adapt it to the feature selecting algorithm of multi-tag attribute.Common algorithms at this stage have the SVM features converted based on data Selection algorithm, KNN algorithms, these algorithms do not account for the dependency relation inside tag attributes, the feature choosing based on mutual information The correlative relationship between attribute can be analyzed well according to the relevant knowledge of mutual information in information theory by selecting algorithm.But it is conventional It is enough ineffective yet come the evaluation method for the mutual information for weighing the correlation between two variables, only consider feature and classification it Between correlation and feature and selected the correlation between feature, sample data will be chosen with feature by not Disconnected to be determined, the estimated value of comentropy is showed in the dynamic process being continually changing.
Based on it is contemplated above the problem of, the present invention proposes a kind of embedded multi-class attribute tags dynamic feature selection Algorithm (ML-DIFS), the algorithm is calculated by mutual information, not only considers correlation between characteristic attribute and tag attributes also Consider the correlation and redundancy between characteristic attribute, while being additionally contemplates that multi-tag community-internal, tag attributes belong to label Correlation between property.The embedded dynamic multi-tag feature selecting algorithm proposed, will have been recognized by embedded grader Sample data rejected, accuracy, the real-time that information entropy estimation is ensured with this.
The content of the invention
Embedded multi-class attribute tags dynamic feature selection method is based on it is an object of the invention to provide one kind, to solve The problem of being proposed in certainly above-mentioned background technology;To achieve the above object, the present invention provides following technical scheme:Specifically a kind of base Comprise the following steps in embedded multi-class attribute tags dynamic feature selection method:
Traditional feature selection approach based on mutual information is introduced first.
1. data acquisition system is pre-processed
Database is extremely easy by noise data, AFR control and inconsistent data in real world now Invade and harass, have substantial amounts of Data Preprocessing Technology at this stage, can generally be divided into data scrubbing, data integration, data conversion and Hough transformation technology.Data scrubbing can clear data middle noise data, correct it is inconsistent, voluntarily fill up the missing of sample data Data, data conversion (data normalization) can improve the precision and validity for the algorithm for being related to distance metric.Such as people wish Data are hoped to meet certain specific data distribution, it is desirable alternatively to which each data characteristics is mapped into a certain section of specific data interval It is interior, all it is to need to carry out data conversion.The pretreatment of data acquisition system mainly divides several parts for this paper:First will Noise data and inconsistent data AFR control in data acquisition system are handled.Second by data set with classification not phase completely The attribute data of pass is deleted.Attribute data progress norm normalized is made norm be normalized to 1 by the 3rd, then is had:
2. the relevant knowledge of mutual information
The selection target of feature selecting is to select key in the characteristic attribute for most worthy of classifying, feature selecting The problem of needing to solve is metric question, and metric question will consider correlative relationship, the attribute between property set and class label Dependency relation inside the redundancy sexual intercourse of collection and property set and tag attributes collection.Therefore for this correlation problem Discuss, the mutual information in selection information theory analyzes the correlation between attribute as module.Information theory is described below Middle correlation theory and operation rule.
Comentropy is vital concept in information theory theory, and comentropy is a kind of uncertainty degree for characterizing variable, Purpose is the number of representation manners content.
Wherein, p (xi) probability that variable X value is xi is represented, the uncertainty degree of variable X can just use comentropy H (X) To represent, the size and the probability distribution of variable of H (X) value have relation, therefore effectively overcome partial noise number in comentropy According to interference.
Conditional entropy refers under conditions of a known variable that the variable of the uncertainty degree of another variable, i.e., one is to another The degree of strength of the degree of dependence of one variable, therefore stochastic variable X can use condition to another stochastic variable Y degree of dependence Entropy is characterized.
Wherein, p (xi) represent variable X prior probability p (xi|yj) represent variable Y under the conditions of known after variable X Test probability.
Mutual information is to characterize the degree that interdepends between two stochastic variables, co-owning between two variables of expression Information content number, when the value of mutual information for 0 is that minimum value represents that identical information is not present between two variables, when mutual Represent that the identical information that two variables are included is relatively more when the value of information is larger.It is defined as:
I(X;Y)=H (X)-H (X | Y) (4)
Mutual information very effective can reflect the correlation between two stochastic variables, and can pass through numerical value Form is showed, and the tightness degree of the correlation between two stochastic variables is stated with the size of numerical value, but in meter The growth pattern of information is also contemplated that while calculating two stochastic variable mutual information content, if directly with the size of mutual information To select feature, it will select those values than larger feature, so mutual information is normalized, in processing procedure Degree formula using the correlation between symmetrical uncertain SU measures characteristics variable and characteristic variable is as follows:
By formula (5) it can be seen that the excursion of SU degree of correlation values is by 0 to 1, if SU value is 0, represent X with It is separate that correlation, i.e. X, which is not present, with Y in Y.If SU value is 1, represent that X and Y has very strong correlation, such as Fruit X and Y represents attribute information and classification information respectively, and SU value is more big, represents that feature has strong correlation for the selection of classification Property.If X and Y represents two attribute informations respectively, SU value is more big then to be represented between feature and feature, between attribute and attribute Most in very strong redundancy.
3. the metric question based on mutual information
By Mutual Information Theory in information theory, redundancy, single spy between single features attribute and single features attribute Levying correlation between attribute and single label category attribute, the correlation between single label category attribute can be by following Formula is calculated:
Redundancy(Xi;Xj)=SU (Xi,Xj) (6)
Correlation(Xi;Yj)=SU (Xi,Yj) (7)
Correlation(Yi;Yj)=SU (Yi,Yj) (8)
The above formula of calculation formula by to(for) the redundancy between single feature attribute and characteristic attribute set can pass through The method that the redundancy summation of single attributive character and each attributive character in characteristic attribute set is averaged is calculated, public Formula is as follows:
Wherein, | X | represent in characteristic attribute set, the number of characteristic attribute, XjRepresent some in characteristic attribute set Characteristic attribute.
Algorithm considers that application is the feature selecting algorithm of multi-tag, so to single features attribute and multi-tag class The relevance formula that the set that other attribute is constituted is produced is defined as:
Wherein, | Y | represent the number of label category attribute in label category attribute set, YjRepresent label category attribute collection Some label category attribute in conjunction.
This embedded multi-class attribute tags dynamic feature selection algorithm not only considers mutual between characteristic attribute Correlation between relation, characteristic attribute and label category attribute, the phase being additionally contemplates that between multi-tag category attribute inside Influence of the mutual relation to feature selecting, it is total for, if the category attribute of the category attribute of certain class label and other labels has Stronger correlation, then for such label category attribute, selected characteristic attribute out can be associated to other The stronger label category attribute of property equally just has preferable classification performance.So the correlation between tag attributes can be by following Formula solved.
Wherein, | Y | represent the number of label category attribute in label category attribute set, YjRepresent label category attribute collection Some label category attribute in conjunction, W (Yi) represent YiThe average value of first closing property in multi-tag category attribute set, numerical value Show that this label category attribute possesses more correlation label category attributes in label category attribute set more greatly.Then to this The beneficial characteristic attribute of the classifying quality of the label category attribute label category attribute higher to correlation equally has actively just To influence.
Based on considerations above, following formula can be expressed as with reference to formula (9) and formula (10) relativity measurement:
4. feature ordering and feature selecting
In this ML_NIFS algorithm, calculate the degree of correlation of characteristic attribute and multi-tag category attribute, calculate characteristic attribute with The redundancy of characteristic attribute collection, by the degree of correlation between characteristic attribute and multi-tag category attribute and characteristic attribute and characteristic attribute The redundancy of collection combines, that is, the interpretational criteria being characterized, and then is ranked up feature by the interpretational criteria of feature, special The interpretational criteria levied is as follows:
W(Xi)=CCorrelation (Xi;Y)-Redundancy(Xi;H) (13)
Wherein, H is ranked characteristic attribute set, XiTo wait the characteristic attribute of selection, CCorrelation (Xi; Y characteristic attribute and the correlation of multi-tag category attribute set, Redundancy (X) are representedi;H characteristic attribute X) is representediWith The redundancy of the characteristic attribute collection of sequence
Feature selecting is the process that the feature for passing through feature ordering is carried out to selection, generally in multi-tag class In the feature selecting algorithm of other attribute, conventional method is the interpretational criteria according to follow-up sorting algorithm, feature, sets feature The threshold value of selection, feature selecting is carried out by threshold value.This algorithm characteristics is from the point of view of classification capacity, in the feature sequence sequenced The correlation that ranking is between feature above and multi-tag category attribute in row is stronger, characteristic attribute and characteristic attribute it Between redundancy than relatively low, the effect to classification is more preferable.The globality between characteristic attribute is considered simultaneously, multiple features should be belonged to Property integrally as analysis object.Sequencing feature attribute set H characteristic attributes subset and multi-tag can be obtained by formula (10) The correlation of category attribute collection.
Relatedness computation formula is as follows:
Wherein, H represents candidate feature set, and Y represents multi-tag category attribute, | Y | represent multi-tag category attribute collection Number of tags, | H | represent the number of characteristic attribute in ordering feature set.
According to the order of ordering characteristic attribute, the average value of the degree of correlation is calculated by formula (13):
HjRepresent to deserved preceding j characteristic attribute;If Correlation (Hj;Y) it is more than CorrelationIt is average(H;Y) And Correlation (Hj+1;Y) it is less than CorrelatinoIt is average(H;Y), then this j characteristic attribute is exactly obtained spy Levy attribute.
5. Embedded dynamic mutual imformation computational methods
Module based on mutual information, the probability distribution situation progress that we will concentrate to feature in sample data first Rational to calculate, after being determined for sample data, feature is in the case where the probability of place sample data set is namely uniquely determined Come, but being constantly selected with feature, the sample data that sample data is concentrated will be identified constantly, then Will be varied from the calculating process of mutual information, if still select traditional computational methods based on mutual information will produce compared with Big error, therefore, identified sample data provide " deceptive information " to non-selected feature in terms of calculating.
For the dynamic feature selecting proposed in algorithm, main research contents is how to recognize that those can be by The sample data of feature recognition is selected, and data are rejected from data set, and information is calculated from new according to remaining sample data Entropy, selects during algorithm is run a kind of embedded grader to carry out the identification of sample, embedded KNN points is selected herein herein Class device recognizes recognizable sample, and deletion that the sample data recognized by KNN graders is concentrated from sample data, not While changing feature with Category Relevance, the number of the sample data of data set and the dimension of feature are reduced.
Brief description of the drawings
Embedded multi-class attribute dynamic feature selection methods of the Fig. 1 based on mutual information
Fig. 2 applications have selected feature to be classified, the mean accuracy that classifier parameters=1 is classified
Fig. 3 applications have selected feature to be classified, the coverage rate that classifier parameters=1 is classified
Fig. 4 applications have selected feature to be classified, the ranking loss that classifier parameters=1 is classified
Fig. 5 applications have selected feature to be classified, the mean accuracy that classifier parameters=0.8 is classified
Fig. 6 applications have selected feature to be classified, the coverage rate that classifier parameters=0.8 is classified
Fig. 7 applications have selected feature to be classified, the ranking loss that classifier parameters=0.8 is classified
Embodiment
Characteristic set is divided into two parts, is to have selected characteristic set and alternative characteristic set respectively, respectively with H and X is represented.Multi-tag category attribute represents that sample data set is represented with O with Y.
First, correlation highest characteristic attribute is selected according to formula (12), and be added into feature set H, simultaneously will It is removed from characteristic attribute collection X.
Then, according to formula (16) by Euclidean distance d, sample is searchedNearest samples, sample size is k. This k nearest samples constitutes neighbour's data acquisition system
Wherein, (YNN)iIn the data sample for representing i-th of multi-tag category attribute, the category result data of multiple labels, The quantity for concentrating sample is N.For the sample of sample classification will be carried out.
By neighbour's data setIn attribute data it is accurate by maximum ballot with each label category attribute respectively Then carry out the categorical attribute of judgement sample data.By the sample of KNN graders judgement sample concentration is used for multiple times in each label Under classification, and whether judgement sample data correctly classified, will if each label category attribute is correctly classified Sample data is deleted from data sample set
Then, remaining characteristic attribute in feature set X and new sample data set are calculated into comentropy from new, by calculating Formula (13) will be such that the maximum characteristic attribute of formula (13) adds in feature set H.Simultaneously by this characteristic attribute from characteristic attribute collection X Remove.
Finally, step (2) is repeated with step (3) until all characteristic attributes arrange completion, or data sample Untill this classification quantity less than KNN graders.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms;Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit is required rather than described above is limited, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention, and any reference in claim should not be considered as to the claim involved by limitation;
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped Containing an independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art should Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art It may be appreciated other embodiment.

Claims (2)

1. a kind of embedded multi-class attribute tags dynamic feature selection method, it is characterised in that comprise the following steps:It is situated between first Continue traditional feature selection approach based on mutual information.
1. data acquisition system is pre-processed
Database is extremely easily invaded and harassed by noise data, AFR control and inconsistent data in real world now, There is substantial amounts of Data Preprocessing Technology at this stage, can generally be divided into data scrubbing, data integration, data conversion and data rule About technology.Data scrubbing can clear data middle noise data, correct it is inconsistent, voluntarily fill up the missing data of sample data, Data conversion (data normalization) can improve the precision and validity for the algorithm for being related to distance metric.Such as it is desirable to data Meet certain specific data distribution, it is desirable alternatively to each data characteristics is mapped in a certain section of specific data interval, all It is to need to carry out data conversion.The pretreatment of data acquisition system mainly divides several parts for this paper:First by data Noise data and inconsistent data AFR control in set are handled.Second will be completely unrelated with classification in data set Attribute data is deleted.Attribute data progress norm normalized is made norm be normalized to 1 by the 3rd, then is had:
f ^ i = f i | | f i | | - - - ( 1 )
2. the relevant knowledge of mutual information
The selection target of feature selecting is to select key in the characteristic attribute for most worthy of classifying, feature selecting need The problem of solution is metric question, metric question to consider correlative relationship between property set and class label, property set and Dependency relation inside the redundancy sexual intercourse of property set and tag attributes collection.Therefore discussed for this correlation problem, Mutual information in selection information theory analyzes the correlation between attribute as module.It is described below in information theory related Theoretical and operation rule.
Comentropy is vital concept in information theory theory, and comentropy is a kind of uncertainty degree for characterizing variable, purpose It is the number of representation manners content.
H ( X ) = - Σ i = 1 n p ( x i ) log p ( x i ) - - - ( 2 )
Wherein, p (xi) probability that variable X value is xi is represented, the uncertainty degree of variable X just can be with comentropy H (X) come table Show, the size and the probability distribution of variable of H (X) value have relation, therefore effectively overcome partial noise data in comentropy Interference.
Conditional entropy refers under conditions of a known variable that the variable of the uncertainty degree of another variable, i.e., one is to another The degree of strength of the degree of dependence of variable, thus stochastic variable X to another stochastic variable Y degree of dependence can with conditional entropy come Characterize.
H ( X | Y ) = - Σ j = 1 m p ( y j ) Σ i = 1 n p ( x i | y j ) l o g 2 p ( x i | y j ) - - - ( 3 )
Wherein, p (xi) represent variable X prior probability p (xi|yj) represent that variable Y posteriority of variable X under the conditions of known is general Rate.
Mutual information is to characterize the degree that interdepends between two stochastic variables, represents the jointly owned letter between two variables Breath amount number, when the value of mutual information for 0 is that minimum value represents that identical information is not present between two variables, work as mutual information Represent that the identical information that two variables are included is relatively more when being worth larger.It is defined as:
I(X;Y)=H (X)-H (X | Y) (4)
Mutual information very effective can reflect the correlation between two stochastic variables, and can be by the form of numerical value Show, the tightness degree of the correlation between two stochastic variables is stated with the size of numerical value, but calculating two The growth pattern of information is also contemplated that while individual stochastic variable mutual information content, if directly selected with the size of mutual information Select feature, it will select those values than larger feature, so mutual information is normalized, used in processing procedure The degree formula of correlation between symmetrical uncertainty SU measures characteristics variable and characteristic variable is as follows:
S U ( X , Y ) = 2 I ( X ; Y ) H ( X ) + H ( Y ) - - - ( 5 )
By formula (5) it can be seen that the excursion of SU degree of correlation values is by 0 to 1, if SU value is 0, X and Y is represented not It is separate for there is correlation, i.e. X with Y.If SU value is 1, represent that X and Y has very strong correlation, if X Attribute information and classification information are represented respectively with Y, SU value is more big, represent that feature has strong correlation for the selection of classification. If X and Y represents two attribute informations respectively, SU value is more big then to be represented between feature and feature, between attribute and attribute most In very strong redundancy.
3. the metric question based on mutual information
By Mutual Information Theory in information theory, redundancy, single features category between single features attribute and single features attribute Property can be by formula below with the correlation between the correlation between single label category attribute, single label category attribute Calculated:
Redundancy(Xi;Xj)=SU (Xi,Xj) (6)
Correlation(Xi;Yj)=SU (Xi,Yj) (7)
Correlation(Yi;Yj)=SU (Yi,Yj) (8)
Can be by single for the calculation formula of the redundancy between single feature attribute and characteristic attribute set by above formula The method that the redundancy summation of attributive character and each attributive character in characteristic attribute set is averaged is calculated, and formula is such as Under:
Re d u n d a n c y ( X i ; X ) = 1 | X | Σ X j ∈ X Re d u n d a n c y ( X i ; X j ) - - - ( 9 )
Wherein, | X | represent in characteristic attribute set, the number of characteristic attribute, XjRepresent some feature category in characteristic attribute set Property.
Algorithm considers that application is the feature selecting algorithm of multi-tag, so belonging to single features attribute and multi-tag classification Property constituted set produce relevance formula be defined as:
C o r r e l a t i o n ( X i ; Y ) = 1 | Y | Σ Y j ∈ Y C o r r e l a t i o n ( X i ; Y j ) - - - ( 10 )
Wherein, | Y | represent the number of label category attribute in label category attribute set, YjRepresent in label category attribute set Some label category attribute.
This embedded multi-class attribute tags dynamic feature selection algorithm not only consider correlation between characteristic attribute, Correlation between characteristic attribute and label category attribute, the correlation being additionally contemplates that between multi-tag category attribute inside Influence to feature selecting, it is total for, if the category attribute of the category attribute of certain class label and other labels have it is stronger Correlation, then for such label category attribute, selected characteristic attribute out can be stronger to other associated property Label category attribute equally just have preferable classification performance.So the correlation between tag attributes can be by following formula Solved.
W ( Y i ) = 1 | Y | - 1 Σ Y j ∈ Y , j ≠ i C o r r e l a t i o n ( Y i , Y j ) - - - ( 11 )
Wherein, | Y | represent the number of label category attribute in label category attribute set, YjRepresent in label category attribute set Some label category attribute, W (Yi) represent YiThe average value of first closing property in multi-tag category attribute set, numerical value is bigger Show that this label category attribute possesses more correlation label category attributes in label category attribute set.Then to this label The beneficial characteristic attribute of the classifying quality of the category attribute label category attribute higher to correlation equally has actively positive Influence.
Based on considerations above, following formula can be expressed as with reference to formula (9) and formula (10) relativity measurement:
C C o r r e l a t i o n ( X i ; Y ) = 1 | Y | Σ Y j ∈ Y ( C o r r e l a t i o n ( X i ; Y j ) + W ( Y j ) ) - - - ( 12 )
4. feature ordering and feature selecting
In this ML_NIFS algorithm, the degree of correlation of characteristic attribute and multi-tag category attribute is calculated, characteristic attribute and feature is calculated The redundancy of property set, by the degree of correlation between characteristic attribute and multi-tag category attribute and characteristic attribute and characteristic attribute collection Redundancy combines, that is, the interpretational criteria being characterized, and then is ranked up feature by the interpretational criteria of feature, feature Interpretational criteria is as follows:
W(Xi)=CCorrelation (Xi;Y)-Redundancy(Xi;H) (13)
Wherein, H is ranked characteristic attribute set, XiTo wait the characteristic attribute of selection, CCorrelation (Xi;Y) table Show characteristic attribute and the correlation of multi-tag category attribute set, Redundancy (Xi;H characteristic attribute X) is representediWith having sorted Characteristic attribute collection redundancy
Feature selecting is the process that the feature for passing through feature ordering is carried out to selection, is generally belonged in multi-tag classification Property feature selecting algorithm in, conventional method is the interpretational criteria according to follow-up sorting algorithm, feature, set feature selecting Threshold value, feature selecting is carried out by threshold value.This algorithm characteristics is from the point of view of classification capacity, in the characteristic sequence sequenced The correlation that ranking is between feature and multi-tag category attribute above is stronger, between characteristic attribute and characteristic attribute Redundancy is than relatively low, and the effect to classification is more preferable.The globality between characteristic attribute is considered simultaneously, should be whole by multiple characteristic attributes Body is used as analysis object.Sequencing feature attribute set H characteristic attributes subset and multi-tag classification can be obtained by formula (10) The correlation of property set.Relatedness computation formula is as follows:
C o r r e l a t i o n ( H ; Y ) = 1 | H | Σ X i ∈ H 1 | Y | Σ Y j ∈ Y C o r r e l a t i o n ( X i ; Y j ) - - - ( 14 )
Wherein, H represents candidate feature set, and Y represents multi-tag category attribute, | Y | represent the label of multi-tag category attribute collection Number, | H | represent the number of characteristic attribute in ordering feature set.
According to the order of ordering characteristic attribute, the average value of the degree of correlation is calculated by formula (13):
HjRepresent to deserved preceding j characteristic attribute;If Correlation (Hj;Y) it is more than CorrelationIt is average(H;Y) and Correlation(Hj+1;Y) it is less than CorrelationIt is average(H;Y), then this j characteristic attribute is exactly obtained feature category Property.
5. Embedded dynamic mutual imformation computational methods
Module based on mutual information, it is reasonable that the probability distribution situation that we will be concentrated in sample data to feature first is carried out Calculating, for sample data determine after, feature is namely uniquely decided in the probability of place sample data set, but With being constantly selected for feature, the sample data that sample data is concentrated will be identified constantly, then in mutual trust It will be varied from the calculating process of breath, if it is larger still to select traditional computational methods based on mutual information to produce Error, therefore, identified sample data provide " deceptive information " to non-selected feature in terms of calculating.
For the dynamic feature selecting proposed in algorithm, main research contents is how to recognize that those can be by having selected spy The sample data of identification is levied, and data are rejected from data set, and comentropy is calculated from new according to remaining sample data, this Selected works are selected to be embedded in a kind of grader to carry out the identification of sample during algorithm is run, and embedded KNN graders are selected herein To recognize recognizable sample, and the deletion that the sample data recognized by KNN graders is concentrated from sample data, do not changing While feature is with Category Relevance, the number of the sample data of data set and the dimension of feature are reduced.
2. the embedded multi-class attribute tags dynamic feature selection method of one kind according to claim 1, it is characterised in that: The degree of correlation of characteristic attribute and multi-tag category attribute is calculated, the redundancy of characteristic attribute and characteristic attribute collection is calculated, by feature The degree of correlation between attribute and multi-tag category attribute combines with characteristic attribute with the redundancy of characteristic attribute collection, is spy The interpretational criteria levied, and then be ranked up feature by the interpretational criteria of feature, the interpretational criteria of feature is as follows:
W(Xi)=CCorrelation (Xi;Y)-Redundancy(Xi;H) (16)
Feature selecting is the process that the feature for passing through feature ordering is carried out to selection, is generally belonged in multi-tag classification Property feature selecting algorithm in, conventional method is the interpretational criteria according to follow-up sorting algorithm, feature, set feature selecting Threshold value, feature selecting is carried out by threshold value.This method feature is from the point of view of classification capacity, in the characteristic sequence sequenced The correlation that ranking is between feature and multi-tag category attribute above is stronger, between characteristic attribute and characteristic attribute Redundancy is than relatively low, and the effect to classification is more preferable.The globality between characteristic attribute is considered simultaneously, should be whole by multiple characteristic attributes Body is used as analysis object.Sequencing feature attribute set H characteristic attributes subset and the phase of multi-tag category attribute collection can be obtained Guan Xing.
Relatedness computation formula is as follows:
C o r r e l a t i o n ( H ; Y ) = 1 | H | Σ X i ∈ H 1 | Y | Σ Y j ∈ Y C o r r e l a t i o n ( X i ; Y j ) - - - ( 17 )
According to the order of ordering characteristic attribute, the average value of the degree of correlation is calculated by formula (18):
HjRepresent to deserved preceding j characteristic attribute;If Correlation (Hj;Y) it is more than CorrelationIt is average(H;Y) and Correlation(Hj+1;Y) it is less than CorrelationIt is average(H;Y), then this j characteristic attribute is exactly obtained feature category Property.
A kind of embedded multi-class attribute tags dynamic feature selection method after improvement, passes through the correlation of mutual information in information theory Knowwhy, the multi-tag dynamic feature selection algorithm based on mutual information that the present invention is described, is reasonably analyzing feature It is mutual between the correlation of correlation, characteristic attribute and category attribute between attribute and characteristic attribute, category attribute Relation, and dynamic feature selecting is carried out by dynamic mutual information computational methods, precision of the data result by classification, classification Coverage rate, classification ranking lose 3 evaluation criterions experimental result is analyzed, show can obtaining for feature selecting algorithm Smaller character subset is obtained, characteristic dimension is reduced, is that the effect of classification is become better and better, and with preferable stability.
CN201710222600.6A 2017-04-06 2017-04-06 A kind of embedded multi-class attribute tags dynamic feature selection algorithm Pending CN106991447A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710222600.6A CN106991447A (en) 2017-04-06 2017-04-06 A kind of embedded multi-class attribute tags dynamic feature selection algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710222600.6A CN106991447A (en) 2017-04-06 2017-04-06 A kind of embedded multi-class attribute tags dynamic feature selection algorithm

Publications (1)

Publication Number Publication Date
CN106991447A true CN106991447A (en) 2017-07-28

Family

ID=59415377

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710222600.6A Pending CN106991447A (en) 2017-04-06 2017-04-06 A kind of embedded multi-class attribute tags dynamic feature selection algorithm

Country Status (1)

Country Link
CN (1) CN106991447A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595381A (en) * 2018-04-27 2018-09-28 厦门尚为科技股份有限公司 Health status evaluation method, device and readable storage medium storing program for executing
CN108805162A (en) * 2018-04-25 2018-11-13 河南师范大学 A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing
CN109754000A (en) * 2018-12-21 2019-05-14 昆明理工大学 A kind of semi-supervised multi-tag classification method based on dependency degree
CN109784668A (en) * 2018-12-21 2019-05-21 国网江苏省电力有限公司南京供电分公司 A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking
CN110135469A (en) * 2019-04-24 2019-08-16 北京航空航天大学 It is a kind of to improve the characteristic filter method and device selected based on correlative character
CN110334546A (en) * 2019-07-08 2019-10-15 辽宁工业大学 Difference privacy high dimensional data based on principal component analysis optimization issues guard method
CN110390353A (en) * 2019-06-28 2019-10-29 苏州浪潮智能科技有限公司 A kind of biometric discrimination method and system based on image procossing
CN110851720A (en) * 2019-11-11 2020-02-28 北京百度网讯科技有限公司 Information recommendation method and device and electronic equipment
CN111027636A (en) * 2019-12-18 2020-04-17 山东师范大学 Unsupervised feature selection method and system based on multi-label learning
CN112148764A (en) * 2019-06-28 2020-12-29 北京百度网讯科技有限公司 Feature screening method, device, equipment and storage medium
CN112632368A (en) * 2020-12-02 2021-04-09 淮阴工学院 Method for notifying, issuing, personalized recommending and attention reminding of OA (office automation) system of colleges and universities
CN112651703A (en) * 2020-12-02 2021-04-13 淮阴工学院 Dynamic reminding method for informing item processing deadline of OA (office automation) system of colleges and universities
CN112765347A (en) * 2020-12-31 2021-05-07 浙江省方大标准信息有限公司 Mandatory standard automatic identification method, system and device
CN113065428A (en) * 2021-03-21 2021-07-02 北京工业大学 Automatic driving target identification method based on feature selection
CN113518063A (en) * 2021-03-01 2021-10-19 广东工业大学 Network intrusion detection method and system based on data enhancement and BilSTM
WO2022022683A1 (en) * 2020-07-31 2022-02-03 中兴通讯股份有限公司 Feature selection method and device, network device and computer-readable storage medium

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805162A (en) * 2018-04-25 2018-11-13 河南师范大学 A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing
CN108595381A (en) * 2018-04-27 2018-09-28 厦门尚为科技股份有限公司 Health status evaluation method, device and readable storage medium storing program for executing
CN109754000A (en) * 2018-12-21 2019-05-14 昆明理工大学 A kind of semi-supervised multi-tag classification method based on dependency degree
CN109784668A (en) * 2018-12-21 2019-05-21 国网江苏省电力有限公司南京供电分公司 A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking
CN110135469A (en) * 2019-04-24 2019-08-16 北京航空航天大学 It is a kind of to improve the characteristic filter method and device selected based on correlative character
CN110390353B (en) * 2019-06-28 2021-08-06 苏州浪潮智能科技有限公司 Biological identification method and system based on image processing
CN112148764B (en) * 2019-06-28 2024-05-07 北京百度网讯科技有限公司 Feature screening method, device, equipment and storage medium
CN110390353A (en) * 2019-06-28 2019-10-29 苏州浪潮智能科技有限公司 A kind of biometric discrimination method and system based on image procossing
CN112148764A (en) * 2019-06-28 2020-12-29 北京百度网讯科技有限公司 Feature screening method, device, equipment and storage medium
CN110334546A (en) * 2019-07-08 2019-10-15 辽宁工业大学 Difference privacy high dimensional data based on principal component analysis optimization issues guard method
CN110334546B (en) * 2019-07-08 2021-11-23 辽宁工业大学 Difference privacy high-dimensional data release protection method based on principal component analysis optimization
CN110851720A (en) * 2019-11-11 2020-02-28 北京百度网讯科技有限公司 Information recommendation method and device and electronic equipment
CN111027636B (en) * 2019-12-18 2020-09-29 山东师范大学 Unsupervised feature selection method and system based on multi-label learning
CN111027636A (en) * 2019-12-18 2020-04-17 山东师范大学 Unsupervised feature selection method and system based on multi-label learning
WO2022022683A1 (en) * 2020-07-31 2022-02-03 中兴通讯股份有限公司 Feature selection method and device, network device and computer-readable storage medium
CN112651703A (en) * 2020-12-02 2021-04-13 淮阴工学院 Dynamic reminding method for informing item processing deadline of OA (office automation) system of colleges and universities
CN112632368A (en) * 2020-12-02 2021-04-09 淮阴工学院 Method for notifying, issuing, personalized recommending and attention reminding of OA (office automation) system of colleges and universities
CN112765347A (en) * 2020-12-31 2021-05-07 浙江省方大标准信息有限公司 Mandatory standard automatic identification method, system and device
CN113518063A (en) * 2021-03-01 2021-10-19 广东工业大学 Network intrusion detection method and system based on data enhancement and BilSTM
CN113065428A (en) * 2021-03-21 2021-07-02 北京工业大学 Automatic driving target identification method based on feature selection

Similar Documents

Publication Publication Date Title
CN106991447A (en) A kind of embedded multi-class attribute tags dynamic feature selection algorithm
CN106845717B (en) Energy efficiency evaluation method based on multi-model fusion strategy
CN106971205A (en) A kind of embedded dynamic feature selection method based on k nearest neighbor Mutual Information Estimation
CN107766883A (en) A kind of optimization random forest classification method and system based on weighted decision tree
CN101504654B (en) Method for implementing automatic database schema matching
Olteanu et al. On-line relational and multiple relational SOM
CN102324038B (en) Plant species identification method based on digital image
CN110674407A (en) Hybrid recommendation method based on graph convolution neural network
CN107766418A (en) A kind of credit estimation method based on Fusion Model, electronic equipment and storage medium
CN106991446A (en) A kind of embedded dynamic feature selection method of the group policy of mutual information
CN101256631B (en) Method and apparatus for character recognition
CN106326913A (en) Money laundering account determination method and device
CN110533116A (en) Based on the adaptive set of Euclidean distance at unbalanced data classification method
CN102750286A (en) Novel decision tree classifier method for processing missing data
Wang et al. Design of the Sports Training Decision Support System Based on the Improved Association Rule, the Apriori Algorithm.
CN110389932A (en) Electric power automatic document classifying method and device
CN114491082A (en) Plan matching method based on network security emergency response knowledge graph feature extraction
Tseng et al. A pre-processing method to deal with missing values by integrating clustering and regression techniques
CN114254093A (en) Multi-space knowledge enhanced knowledge graph question-answering method and system
Berahmand et al. SDAC-DA: Semi-Supervised Deep Attributed Clustering Using Dual Autoencoder
CN111339258B (en) University computer basic exercise recommendation method based on knowledge graph
CN103678709B (en) Recommendation system attack detection method based on time series data
Zhang et al. Robust Detection of Lead-Lag Relationships in Lagged Multi-Factor Models
CN103886007A (en) Mutual constraint based fuzzy data classification method
CN113705920A (en) Generation method of water data sample set for thermal power plant and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170728

WD01 Invention patent application deemed withdrawn after publication