CN106991447A - A kind of embedded multi-class attribute tags dynamic feature selection algorithm - Google Patents
A kind of embedded multi-class attribute tags dynamic feature selection algorithm Download PDFInfo
- Publication number
- CN106991447A CN106991447A CN201710222600.6A CN201710222600A CN106991447A CN 106991447 A CN106991447 A CN 106991447A CN 201710222600 A CN201710222600 A CN 201710222600A CN 106991447 A CN106991447 A CN 106991447A
- Authority
- CN
- China
- Prior art keywords
- attribute
- feature
- correlation
- data
- characteristic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/285—Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the embedded multi-class attribute tags dynamic feature selection method of one kind, improve the deficiency of the feature selecting algorithm of traditional multi-class attribute tags, propose a kind of embedded multi-class attribute tags dynamic feature selection method (ML_NIFS), the method both take into account the correlation between multi-tag property set inside, it is also contemplated that the factor that the calculating of comentropy constantly changes in feature selecting interpretational criteria.Finally verified, as a result show that proposed algorithm can carry out effective dimensionality reduction to data attribute, and improve follow-up classifying quality.
Description
Technical field
It is specifically a kind of embedded multi-class attribute tags dynamic feature selection side the present invention relates to area of pattern recognition
Method.
Background technology
Arithmetic speed is fast, efficiency comparison because it has in higher-dimension attribute data processing procedure for traditional mutual information metric algorithm
High the advantages of, it is widely used in characteristic dimension Algorithm for Reduction.But with the rapid development of science and technology, many technical field ratios
As computer network communication, biochemical medicine engineering all develop towards multi-class attribute tags data type direction.Multi-tag is classified
Problem is exactly the type characteristic according to multi-tag data, builds corresponding disaggregated model, and according to criterion to unknown data
Category attribute judged, by sample data simultaneously be divided into multiple class labels.Single tag attributes classification problem and
The fundamental difference of multi-tag attributive classification problem is that single tag attributes classification problem sample data can only belong to a classification mark
Label, and the classification problem sample data of multi-tag attribute may belong to multiple class labels, this extremely meets Information Number at this stage
The characteristics of according to high development.Therefore get the attention.
The classification of multi-tag attribute and traditional single tag attributes classification, multi-tag attributive classification problem is also same
Sample is faced with " dimension disaster " problem, and " dimension disaster " similarly drastically influence the classification capacity of multi-tag attributive classification device.
" dimensionality reduction technology " of characteristic attribute can reduce the dimension of characteristic attribute, the classification accuracy of grader be improved, in single mark
While signing applicable in attributive classification problem, this skill of characteristic attribute can be similarly used in multi-tag attributive classification problem
Art, to reach the effect of attribute reduction.Feature " dimensionality reduction " technology is generally generally divided into two sides of feature selecting and feature extraction
Face, feature selecting is divided into according to the difference of its interpretational criteria, filtering type, packaging type, embedded.Main research multi-tag of the invention
Feature selection issues.
There are two kinds of basic developing direction at this stage in multi-tag feature selecting algorithm:On the one hand it is the algorithm of data conversion
Direction, is on the one hand that algorithm adapts to direction.Feature selecting algorithm research based on data conversion is to be turned institute's label data
Change, be converted into single label category attribute, repeatedly use the feature selecting algorithm of single label to reach multi-tag feature selecting
Purpose.The feature selecting algorithm research adapted to based on algorithm is to be deformed the feature selecting algorithm of single label and algorithm
Improve, adapt it to the feature selecting algorithm of multi-tag attribute.Common algorithms at this stage have the SVM features converted based on data
Selection algorithm, KNN algorithms, these algorithms do not account for the dependency relation inside tag attributes, the feature choosing based on mutual information
The correlative relationship between attribute can be analyzed well according to the relevant knowledge of mutual information in information theory by selecting algorithm.But it is conventional
It is enough ineffective yet come the evaluation method for the mutual information for weighing the correlation between two variables, only consider feature and classification it
Between correlation and feature and selected the correlation between feature, sample data will be chosen with feature by not
Disconnected to be determined, the estimated value of comentropy is showed in the dynamic process being continually changing.
Based on it is contemplated above the problem of, the present invention proposes a kind of embedded multi-class attribute tags dynamic feature selection
Algorithm (ML-DIFS), the algorithm is calculated by mutual information, not only considers correlation between characteristic attribute and tag attributes also
Consider the correlation and redundancy between characteristic attribute, while being additionally contemplates that multi-tag community-internal, tag attributes belong to label
Correlation between property.The embedded dynamic multi-tag feature selecting algorithm proposed, will have been recognized by embedded grader
Sample data rejected, accuracy, the real-time that information entropy estimation is ensured with this.
The content of the invention
Embedded multi-class attribute tags dynamic feature selection method is based on it is an object of the invention to provide one kind, to solve
The problem of being proposed in certainly above-mentioned background technology;To achieve the above object, the present invention provides following technical scheme:Specifically a kind of base
Comprise the following steps in embedded multi-class attribute tags dynamic feature selection method:
Traditional feature selection approach based on mutual information is introduced first.
1. data acquisition system is pre-processed
Database is extremely easy by noise data, AFR control and inconsistent data in real world now
Invade and harass, have substantial amounts of Data Preprocessing Technology at this stage, can generally be divided into data scrubbing, data integration, data conversion and
Hough transformation technology.Data scrubbing can clear data middle noise data, correct it is inconsistent, voluntarily fill up the missing of sample data
Data, data conversion (data normalization) can improve the precision and validity for the algorithm for being related to distance metric.Such as people wish
Data are hoped to meet certain specific data distribution, it is desirable alternatively to which each data characteristics is mapped into a certain section of specific data interval
It is interior, all it is to need to carry out data conversion.The pretreatment of data acquisition system mainly divides several parts for this paper:First will
Noise data and inconsistent data AFR control in data acquisition system are handled.Second by data set with classification not phase completely
The attribute data of pass is deleted.Attribute data progress norm normalized is made norm be normalized to 1 by the 3rd, then is had:
2. the relevant knowledge of mutual information
The selection target of feature selecting is to select key in the characteristic attribute for most worthy of classifying, feature selecting
The problem of needing to solve is metric question, and metric question will consider correlative relationship, the attribute between property set and class label
Dependency relation inside the redundancy sexual intercourse of collection and property set and tag attributes collection.Therefore for this correlation problem
Discuss, the mutual information in selection information theory analyzes the correlation between attribute as module.Information theory is described below
Middle correlation theory and operation rule.
Comentropy is vital concept in information theory theory, and comentropy is a kind of uncertainty degree for characterizing variable,
Purpose is the number of representation manners content.
Wherein, p (xi) probability that variable X value is xi is represented, the uncertainty degree of variable X can just use comentropy H (X)
To represent, the size and the probability distribution of variable of H (X) value have relation, therefore effectively overcome partial noise number in comentropy
According to interference.
Conditional entropy refers under conditions of a known variable that the variable of the uncertainty degree of another variable, i.e., one is to another
The degree of strength of the degree of dependence of one variable, therefore stochastic variable X can use condition to another stochastic variable Y degree of dependence
Entropy is characterized.
Wherein, p (xi) represent variable X prior probability p (xi|yj) represent variable Y under the conditions of known after variable X
Test probability.
Mutual information is to characterize the degree that interdepends between two stochastic variables, co-owning between two variables of expression
Information content number, when the value of mutual information for 0 is that minimum value represents that identical information is not present between two variables, when mutual
Represent that the identical information that two variables are included is relatively more when the value of information is larger.It is defined as:
I(X;Y)=H (X)-H (X | Y) (4)
Mutual information very effective can reflect the correlation between two stochastic variables, and can pass through numerical value
Form is showed, and the tightness degree of the correlation between two stochastic variables is stated with the size of numerical value, but in meter
The growth pattern of information is also contemplated that while calculating two stochastic variable mutual information content, if directly with the size of mutual information
To select feature, it will select those values than larger feature, so mutual information is normalized, in processing procedure
Degree formula using the correlation between symmetrical uncertain SU measures characteristics variable and characteristic variable is as follows:
By formula (5) it can be seen that the excursion of SU degree of correlation values is by 0 to 1, if SU value is 0, represent X with
It is separate that correlation, i.e. X, which is not present, with Y in Y.If SU value is 1, represent that X and Y has very strong correlation, such as
Fruit X and Y represents attribute information and classification information respectively, and SU value is more big, represents that feature has strong correlation for the selection of classification
Property.If X and Y represents two attribute informations respectively, SU value is more big then to be represented between feature and feature, between attribute and attribute
Most in very strong redundancy.
3. the metric question based on mutual information
By Mutual Information Theory in information theory, redundancy, single spy between single features attribute and single features attribute
Levying correlation between attribute and single label category attribute, the correlation between single label category attribute can be by following
Formula is calculated:
Redundancy(Xi;Xj)=SU (Xi,Xj) (6)
Correlation(Xi;Yj)=SU (Xi,Yj) (7)
Correlation(Yi;Yj)=SU (Yi,Yj) (8)
The above formula of calculation formula by to(for) the redundancy between single feature attribute and characteristic attribute set can pass through
The method that the redundancy summation of single attributive character and each attributive character in characteristic attribute set is averaged is calculated, public
Formula is as follows:
Wherein, | X | represent in characteristic attribute set, the number of characteristic attribute, XjRepresent some in characteristic attribute set
Characteristic attribute.
Algorithm considers that application is the feature selecting algorithm of multi-tag, so to single features attribute and multi-tag class
The relevance formula that the set that other attribute is constituted is produced is defined as:
Wherein, | Y | represent the number of label category attribute in label category attribute set, YjRepresent label category attribute collection
Some label category attribute in conjunction.
This embedded multi-class attribute tags dynamic feature selection algorithm not only considers mutual between characteristic attribute
Correlation between relation, characteristic attribute and label category attribute, the phase being additionally contemplates that between multi-tag category attribute inside
Influence of the mutual relation to feature selecting, it is total for, if the category attribute of the category attribute of certain class label and other labels has
Stronger correlation, then for such label category attribute, selected characteristic attribute out can be associated to other
The stronger label category attribute of property equally just has preferable classification performance.So the correlation between tag attributes can be by following
Formula solved.
Wherein, | Y | represent the number of label category attribute in label category attribute set, YjRepresent label category attribute collection
Some label category attribute in conjunction, W (Yi) represent YiThe average value of first closing property in multi-tag category attribute set, numerical value
Show that this label category attribute possesses more correlation label category attributes in label category attribute set more greatly.Then to this
The beneficial characteristic attribute of the classifying quality of the label category attribute label category attribute higher to correlation equally has actively just
To influence.
Based on considerations above, following formula can be expressed as with reference to formula (9) and formula (10) relativity measurement:
4. feature ordering and feature selecting
In this ML_NIFS algorithm, calculate the degree of correlation of characteristic attribute and multi-tag category attribute, calculate characteristic attribute with
The redundancy of characteristic attribute collection, by the degree of correlation between characteristic attribute and multi-tag category attribute and characteristic attribute and characteristic attribute
The redundancy of collection combines, that is, the interpretational criteria being characterized, and then is ranked up feature by the interpretational criteria of feature, special
The interpretational criteria levied is as follows:
W(Xi)=CCorrelation (Xi;Y)-Redundancy(Xi;H) (13)
Wherein, H is ranked characteristic attribute set, XiTo wait the characteristic attribute of selection, CCorrelation (Xi;
Y characteristic attribute and the correlation of multi-tag category attribute set, Redundancy (X) are representedi;H characteristic attribute X) is representediWith
The redundancy of the characteristic attribute collection of sequence
Feature selecting is the process that the feature for passing through feature ordering is carried out to selection, generally in multi-tag class
In the feature selecting algorithm of other attribute, conventional method is the interpretational criteria according to follow-up sorting algorithm, feature, sets feature
The threshold value of selection, feature selecting is carried out by threshold value.This algorithm characteristics is from the point of view of classification capacity, in the feature sequence sequenced
The correlation that ranking is between feature above and multi-tag category attribute in row is stronger, characteristic attribute and characteristic attribute it
Between redundancy than relatively low, the effect to classification is more preferable.The globality between characteristic attribute is considered simultaneously, multiple features should be belonged to
Property integrally as analysis object.Sequencing feature attribute set H characteristic attributes subset and multi-tag can be obtained by formula (10)
The correlation of category attribute collection.
Relatedness computation formula is as follows:
Wherein, H represents candidate feature set, and Y represents multi-tag category attribute, | Y | represent multi-tag category attribute collection
Number of tags, | H | represent the number of characteristic attribute in ordering feature set.
According to the order of ordering characteristic attribute, the average value of the degree of correlation is calculated by formula (13):
HjRepresent to deserved preceding j characteristic attribute;If Correlation (Hj;Y) it is more than CorrelationIt is average(H;Y)
And Correlation (Hj+1;Y) it is less than CorrelatinoIt is average(H;Y), then this j characteristic attribute is exactly obtained spy
Levy attribute.
5. Embedded dynamic mutual imformation computational methods
Module based on mutual information, the probability distribution situation progress that we will concentrate to feature in sample data first
Rational to calculate, after being determined for sample data, feature is in the case where the probability of place sample data set is namely uniquely determined
Come, but being constantly selected with feature, the sample data that sample data is concentrated will be identified constantly, then
Will be varied from the calculating process of mutual information, if still select traditional computational methods based on mutual information will produce compared with
Big error, therefore, identified sample data provide " deceptive information " to non-selected feature in terms of calculating.
For the dynamic feature selecting proposed in algorithm, main research contents is how to recognize that those can be by
The sample data of feature recognition is selected, and data are rejected from data set, and information is calculated from new according to remaining sample data
Entropy, selects during algorithm is run a kind of embedded grader to carry out the identification of sample, embedded KNN points is selected herein herein
Class device recognizes recognizable sample, and deletion that the sample data recognized by KNN graders is concentrated from sample data, not
While changing feature with Category Relevance, the number of the sample data of data set and the dimension of feature are reduced.
Brief description of the drawings
Embedded multi-class attribute dynamic feature selection methods of the Fig. 1 based on mutual information
Fig. 2 applications have selected feature to be classified, the mean accuracy that classifier parameters=1 is classified
Fig. 3 applications have selected feature to be classified, the coverage rate that classifier parameters=1 is classified
Fig. 4 applications have selected feature to be classified, the ranking loss that classifier parameters=1 is classified
Fig. 5 applications have selected feature to be classified, the mean accuracy that classifier parameters=0.8 is classified
Fig. 6 applications have selected feature to be classified, the coverage rate that classifier parameters=0.8 is classified
Fig. 7 applications have selected feature to be classified, the ranking loss that classifier parameters=0.8 is classified
Embodiment
Characteristic set is divided into two parts, is to have selected characteristic set and alternative characteristic set respectively, respectively with H and
X is represented.Multi-tag category attribute represents that sample data set is represented with O with Y.
First, correlation highest characteristic attribute is selected according to formula (12), and be added into feature set H, simultaneously will
It is removed from characteristic attribute collection X.
Then, according to formula (16) by Euclidean distance d, sample is searchedNearest samples, sample size is k.
This k nearest samples constitutes neighbour's data acquisition system
Wherein, (YNN)iIn the data sample for representing i-th of multi-tag category attribute, the category result data of multiple labels,
The quantity for concentrating sample is N.For the sample of sample classification will be carried out.
By neighbour's data setIn attribute data it is accurate by maximum ballot with each label category attribute respectively
Then carry out the categorical attribute of judgement sample data.By the sample of KNN graders judgement sample concentration is used for multiple times in each label
Under classification, and whether judgement sample data correctly classified, will if each label category attribute is correctly classified
Sample data is deleted from data sample set
Then, remaining characteristic attribute in feature set X and new sample data set are calculated into comentropy from new, by calculating
Formula (13) will be such that the maximum characteristic attribute of formula (13) adds in feature set H.Simultaneously by this characteristic attribute from characteristic attribute collection X
Remove.
Finally, step (2) is repeated with step (3) until all characteristic attributes arrange completion, or data sample
Untill this classification quantity less than KNN graders.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms;Therefore, no matter
From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power
Profit is required rather than described above is limited, it is intended that all in the implication and scope of the equivalency of claim by falling
Change is included in the present invention, and any reference in claim should not be considered as to the claim involved by limitation;
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped
Containing an independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art should
Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art
It may be appreciated other embodiment.
Claims (2)
1. a kind of embedded multi-class attribute tags dynamic feature selection method, it is characterised in that comprise the following steps:It is situated between first
Continue traditional feature selection approach based on mutual information.
1. data acquisition system is pre-processed
Database is extremely easily invaded and harassed by noise data, AFR control and inconsistent data in real world now,
There is substantial amounts of Data Preprocessing Technology at this stage, can generally be divided into data scrubbing, data integration, data conversion and data rule
About technology.Data scrubbing can clear data middle noise data, correct it is inconsistent, voluntarily fill up the missing data of sample data,
Data conversion (data normalization) can improve the precision and validity for the algorithm for being related to distance metric.Such as it is desirable to data
Meet certain specific data distribution, it is desirable alternatively to each data characteristics is mapped in a certain section of specific data interval, all
It is to need to carry out data conversion.The pretreatment of data acquisition system mainly divides several parts for this paper:First by data
Noise data and inconsistent data AFR control in set are handled.Second will be completely unrelated with classification in data set
Attribute data is deleted.Attribute data progress norm normalized is made norm be normalized to 1 by the 3rd, then is had:
2. the relevant knowledge of mutual information
The selection target of feature selecting is to select key in the characteristic attribute for most worthy of classifying, feature selecting need
The problem of solution is metric question, metric question to consider correlative relationship between property set and class label, property set and
Dependency relation inside the redundancy sexual intercourse of property set and tag attributes collection.Therefore discussed for this correlation problem,
Mutual information in selection information theory analyzes the correlation between attribute as module.It is described below in information theory related
Theoretical and operation rule.
Comentropy is vital concept in information theory theory, and comentropy is a kind of uncertainty degree for characterizing variable, purpose
It is the number of representation manners content.
Wherein, p (xi) probability that variable X value is xi is represented, the uncertainty degree of variable X just can be with comentropy H (X) come table
Show, the size and the probability distribution of variable of H (X) value have relation, therefore effectively overcome partial noise data in comentropy
Interference.
Conditional entropy refers under conditions of a known variable that the variable of the uncertainty degree of another variable, i.e., one is to another
The degree of strength of the degree of dependence of variable, thus stochastic variable X to another stochastic variable Y degree of dependence can with conditional entropy come
Characterize.
Wherein, p (xi) represent variable X prior probability p (xi|yj) represent that variable Y posteriority of variable X under the conditions of known is general
Rate.
Mutual information is to characterize the degree that interdepends between two stochastic variables, represents the jointly owned letter between two variables
Breath amount number, when the value of mutual information for 0 is that minimum value represents that identical information is not present between two variables, work as mutual information
Represent that the identical information that two variables are included is relatively more when being worth larger.It is defined as:
I(X;Y)=H (X)-H (X | Y) (4)
Mutual information very effective can reflect the correlation between two stochastic variables, and can be by the form of numerical value
Show, the tightness degree of the correlation between two stochastic variables is stated with the size of numerical value, but calculating two
The growth pattern of information is also contemplated that while individual stochastic variable mutual information content, if directly selected with the size of mutual information
Select feature, it will select those values than larger feature, so mutual information is normalized, used in processing procedure
The degree formula of correlation between symmetrical uncertainty SU measures characteristics variable and characteristic variable is as follows:
By formula (5) it can be seen that the excursion of SU degree of correlation values is by 0 to 1, if SU value is 0, X and Y is represented not
It is separate for there is correlation, i.e. X with Y.If SU value is 1, represent that X and Y has very strong correlation, if X
Attribute information and classification information are represented respectively with Y, SU value is more big, represent that feature has strong correlation for the selection of classification.
If X and Y represents two attribute informations respectively, SU value is more big then to be represented between feature and feature, between attribute and attribute most
In very strong redundancy.
3. the metric question based on mutual information
By Mutual Information Theory in information theory, redundancy, single features category between single features attribute and single features attribute
Property can be by formula below with the correlation between the correlation between single label category attribute, single label category attribute
Calculated:
Redundancy(Xi;Xj)=SU (Xi,Xj) (6)
Correlation(Xi;Yj)=SU (Xi,Yj) (7)
Correlation(Yi;Yj)=SU (Yi,Yj) (8)
Can be by single for the calculation formula of the redundancy between single feature attribute and characteristic attribute set by above formula
The method that the redundancy summation of attributive character and each attributive character in characteristic attribute set is averaged is calculated, and formula is such as
Under:
Wherein, | X | represent in characteristic attribute set, the number of characteristic attribute, XjRepresent some feature category in characteristic attribute set
Property.
Algorithm considers that application is the feature selecting algorithm of multi-tag, so belonging to single features attribute and multi-tag classification
Property constituted set produce relevance formula be defined as:
Wherein, | Y | represent the number of label category attribute in label category attribute set, YjRepresent in label category attribute set
Some label category attribute.
This embedded multi-class attribute tags dynamic feature selection algorithm not only consider correlation between characteristic attribute,
Correlation between characteristic attribute and label category attribute, the correlation being additionally contemplates that between multi-tag category attribute inside
Influence to feature selecting, it is total for, if the category attribute of the category attribute of certain class label and other labels have it is stronger
Correlation, then for such label category attribute, selected characteristic attribute out can be stronger to other associated property
Label category attribute equally just have preferable classification performance.So the correlation between tag attributes can be by following formula
Solved.
Wherein, | Y | represent the number of label category attribute in label category attribute set, YjRepresent in label category attribute set
Some label category attribute, W (Yi) represent YiThe average value of first closing property in multi-tag category attribute set, numerical value is bigger
Show that this label category attribute possesses more correlation label category attributes in label category attribute set.Then to this label
The beneficial characteristic attribute of the classifying quality of the category attribute label category attribute higher to correlation equally has actively positive
Influence.
Based on considerations above, following formula can be expressed as with reference to formula (9) and formula (10) relativity measurement:
4. feature ordering and feature selecting
In this ML_NIFS algorithm, the degree of correlation of characteristic attribute and multi-tag category attribute is calculated, characteristic attribute and feature is calculated
The redundancy of property set, by the degree of correlation between characteristic attribute and multi-tag category attribute and characteristic attribute and characteristic attribute collection
Redundancy combines, that is, the interpretational criteria being characterized, and then is ranked up feature by the interpretational criteria of feature, feature
Interpretational criteria is as follows:
W(Xi)=CCorrelation (Xi;Y)-Redundancy(Xi;H) (13)
Wherein, H is ranked characteristic attribute set, XiTo wait the characteristic attribute of selection, CCorrelation (Xi;Y) table
Show characteristic attribute and the correlation of multi-tag category attribute set, Redundancy (Xi;H characteristic attribute X) is representediWith having sorted
Characteristic attribute collection redundancy
Feature selecting is the process that the feature for passing through feature ordering is carried out to selection, is generally belonged in multi-tag classification
Property feature selecting algorithm in, conventional method is the interpretational criteria according to follow-up sorting algorithm, feature, set feature selecting
Threshold value, feature selecting is carried out by threshold value.This algorithm characteristics is from the point of view of classification capacity, in the characteristic sequence sequenced
The correlation that ranking is between feature and multi-tag category attribute above is stronger, between characteristic attribute and characteristic attribute
Redundancy is than relatively low, and the effect to classification is more preferable.The globality between characteristic attribute is considered simultaneously, should be whole by multiple characteristic attributes
Body is used as analysis object.Sequencing feature attribute set H characteristic attributes subset and multi-tag classification can be obtained by formula (10)
The correlation of property set.Relatedness computation formula is as follows:
Wherein, H represents candidate feature set, and Y represents multi-tag category attribute, | Y | represent the label of multi-tag category attribute collection
Number, | H | represent the number of characteristic attribute in ordering feature set.
According to the order of ordering characteristic attribute, the average value of the degree of correlation is calculated by formula (13):
HjRepresent to deserved preceding j characteristic attribute;If Correlation (Hj;Y) it is more than CorrelationIt is average(H;Y) and
Correlation(Hj+1;Y) it is less than CorrelationIt is average(H;Y), then this j characteristic attribute is exactly obtained feature category
Property.
5. Embedded dynamic mutual imformation computational methods
Module based on mutual information, it is reasonable that the probability distribution situation that we will be concentrated in sample data to feature first is carried out
Calculating, for sample data determine after, feature is namely uniquely decided in the probability of place sample data set, but
With being constantly selected for feature, the sample data that sample data is concentrated will be identified constantly, then in mutual trust
It will be varied from the calculating process of breath, if it is larger still to select traditional computational methods based on mutual information to produce
Error, therefore, identified sample data provide " deceptive information " to non-selected feature in terms of calculating.
For the dynamic feature selecting proposed in algorithm, main research contents is how to recognize that those can be by having selected spy
The sample data of identification is levied, and data are rejected from data set, and comentropy is calculated from new according to remaining sample data, this
Selected works are selected to be embedded in a kind of grader to carry out the identification of sample during algorithm is run, and embedded KNN graders are selected herein
To recognize recognizable sample, and the deletion that the sample data recognized by KNN graders is concentrated from sample data, do not changing
While feature is with Category Relevance, the number of the sample data of data set and the dimension of feature are reduced.
2. the embedded multi-class attribute tags dynamic feature selection method of one kind according to claim 1, it is characterised in that:
The degree of correlation of characteristic attribute and multi-tag category attribute is calculated, the redundancy of characteristic attribute and characteristic attribute collection is calculated, by feature
The degree of correlation between attribute and multi-tag category attribute combines with characteristic attribute with the redundancy of characteristic attribute collection, is spy
The interpretational criteria levied, and then be ranked up feature by the interpretational criteria of feature, the interpretational criteria of feature is as follows:
W(Xi)=CCorrelation (Xi;Y)-Redundancy(Xi;H) (16)
Feature selecting is the process that the feature for passing through feature ordering is carried out to selection, is generally belonged in multi-tag classification
Property feature selecting algorithm in, conventional method is the interpretational criteria according to follow-up sorting algorithm, feature, set feature selecting
Threshold value, feature selecting is carried out by threshold value.This method feature is from the point of view of classification capacity, in the characteristic sequence sequenced
The correlation that ranking is between feature and multi-tag category attribute above is stronger, between characteristic attribute and characteristic attribute
Redundancy is than relatively low, and the effect to classification is more preferable.The globality between characteristic attribute is considered simultaneously, should be whole by multiple characteristic attributes
Body is used as analysis object.Sequencing feature attribute set H characteristic attributes subset and the phase of multi-tag category attribute collection can be obtained
Guan Xing.
Relatedness computation formula is as follows:
According to the order of ordering characteristic attribute, the average value of the degree of correlation is calculated by formula (18):
HjRepresent to deserved preceding j characteristic attribute;If Correlation (Hj;Y) it is more than CorrelationIt is average(H;Y) and
Correlation(Hj+1;Y) it is less than CorrelationIt is average(H;Y), then this j characteristic attribute is exactly obtained feature category
Property.
A kind of embedded multi-class attribute tags dynamic feature selection method after improvement, passes through the correlation of mutual information in information theory
Knowwhy, the multi-tag dynamic feature selection algorithm based on mutual information that the present invention is described, is reasonably analyzing feature
It is mutual between the correlation of correlation, characteristic attribute and category attribute between attribute and characteristic attribute, category attribute
Relation, and dynamic feature selecting is carried out by dynamic mutual information computational methods, precision of the data result by classification, classification
Coverage rate, classification ranking lose 3 evaluation criterions experimental result is analyzed, show can obtaining for feature selecting algorithm
Smaller character subset is obtained, characteristic dimension is reduced, is that the effect of classification is become better and better, and with preferable stability.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710222600.6A CN106991447A (en) | 2017-04-06 | 2017-04-06 | A kind of embedded multi-class attribute tags dynamic feature selection algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710222600.6A CN106991447A (en) | 2017-04-06 | 2017-04-06 | A kind of embedded multi-class attribute tags dynamic feature selection algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106991447A true CN106991447A (en) | 2017-07-28 |
Family
ID=59415377
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710222600.6A Pending CN106991447A (en) | 2017-04-06 | 2017-04-06 | A kind of embedded multi-class attribute tags dynamic feature selection algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106991447A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595381A (en) * | 2018-04-27 | 2018-09-28 | 厦门尚为科技股份有限公司 | Health status evaluation method, device and readable storage medium storing program for executing |
CN108805162A (en) * | 2018-04-25 | 2018-11-13 | 河南师范大学 | A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing |
CN109754000A (en) * | 2018-12-21 | 2019-05-14 | 昆明理工大学 | A kind of semi-supervised multi-tag classification method based on dependency degree |
CN109784668A (en) * | 2018-12-21 | 2019-05-21 | 国网江苏省电力有限公司南京供电分公司 | A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking |
CN110135469A (en) * | 2019-04-24 | 2019-08-16 | 北京航空航天大学 | It is a kind of to improve the characteristic filter method and device selected based on correlative character |
CN110334546A (en) * | 2019-07-08 | 2019-10-15 | 辽宁工业大学 | Difference privacy high dimensional data based on principal component analysis optimization issues guard method |
CN110390353A (en) * | 2019-06-28 | 2019-10-29 | 苏州浪潮智能科技有限公司 | A kind of biometric discrimination method and system based on image procossing |
CN110851720A (en) * | 2019-11-11 | 2020-02-28 | 北京百度网讯科技有限公司 | Information recommendation method and device and electronic equipment |
CN111027636A (en) * | 2019-12-18 | 2020-04-17 | 山东师范大学 | Unsupervised feature selection method and system based on multi-label learning |
CN112148764A (en) * | 2019-06-28 | 2020-12-29 | 北京百度网讯科技有限公司 | Feature screening method, device, equipment and storage medium |
CN112632368A (en) * | 2020-12-02 | 2021-04-09 | 淮阴工学院 | Method for notifying, issuing, personalized recommending and attention reminding of OA (office automation) system of colleges and universities |
CN112651703A (en) * | 2020-12-02 | 2021-04-13 | 淮阴工学院 | Dynamic reminding method for informing item processing deadline of OA (office automation) system of colleges and universities |
CN112765347A (en) * | 2020-12-31 | 2021-05-07 | 浙江省方大标准信息有限公司 | Mandatory standard automatic identification method, system and device |
CN113065428A (en) * | 2021-03-21 | 2021-07-02 | 北京工业大学 | Automatic driving target identification method based on feature selection |
CN113518063A (en) * | 2021-03-01 | 2021-10-19 | 广东工业大学 | Network intrusion detection method and system based on data enhancement and BilSTM |
WO2022022683A1 (en) * | 2020-07-31 | 2022-02-03 | 中兴通讯股份有限公司 | Feature selection method and device, network device and computer-readable storage medium |
-
2017
- 2017-04-06 CN CN201710222600.6A patent/CN106991447A/en active Pending
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108805162A (en) * | 2018-04-25 | 2018-11-13 | 河南师范大学 | A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing |
CN108595381A (en) * | 2018-04-27 | 2018-09-28 | 厦门尚为科技股份有限公司 | Health status evaluation method, device and readable storage medium storing program for executing |
CN109754000A (en) * | 2018-12-21 | 2019-05-14 | 昆明理工大学 | A kind of semi-supervised multi-tag classification method based on dependency degree |
CN109784668A (en) * | 2018-12-21 | 2019-05-21 | 国网江苏省电力有限公司南京供电分公司 | A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking |
CN110135469A (en) * | 2019-04-24 | 2019-08-16 | 北京航空航天大学 | It is a kind of to improve the characteristic filter method and device selected based on correlative character |
CN110390353B (en) * | 2019-06-28 | 2021-08-06 | 苏州浪潮智能科技有限公司 | Biological identification method and system based on image processing |
CN112148764B (en) * | 2019-06-28 | 2024-05-07 | 北京百度网讯科技有限公司 | Feature screening method, device, equipment and storage medium |
CN110390353A (en) * | 2019-06-28 | 2019-10-29 | 苏州浪潮智能科技有限公司 | A kind of biometric discrimination method and system based on image procossing |
CN112148764A (en) * | 2019-06-28 | 2020-12-29 | 北京百度网讯科技有限公司 | Feature screening method, device, equipment and storage medium |
CN110334546A (en) * | 2019-07-08 | 2019-10-15 | 辽宁工业大学 | Difference privacy high dimensional data based on principal component analysis optimization issues guard method |
CN110334546B (en) * | 2019-07-08 | 2021-11-23 | 辽宁工业大学 | Difference privacy high-dimensional data release protection method based on principal component analysis optimization |
CN110851720A (en) * | 2019-11-11 | 2020-02-28 | 北京百度网讯科技有限公司 | Information recommendation method and device and electronic equipment |
CN111027636B (en) * | 2019-12-18 | 2020-09-29 | 山东师范大学 | Unsupervised feature selection method and system based on multi-label learning |
CN111027636A (en) * | 2019-12-18 | 2020-04-17 | 山东师范大学 | Unsupervised feature selection method and system based on multi-label learning |
WO2022022683A1 (en) * | 2020-07-31 | 2022-02-03 | 中兴通讯股份有限公司 | Feature selection method and device, network device and computer-readable storage medium |
CN112651703A (en) * | 2020-12-02 | 2021-04-13 | 淮阴工学院 | Dynamic reminding method for informing item processing deadline of OA (office automation) system of colleges and universities |
CN112632368A (en) * | 2020-12-02 | 2021-04-09 | 淮阴工学院 | Method for notifying, issuing, personalized recommending and attention reminding of OA (office automation) system of colleges and universities |
CN112765347A (en) * | 2020-12-31 | 2021-05-07 | 浙江省方大标准信息有限公司 | Mandatory standard automatic identification method, system and device |
CN113518063A (en) * | 2021-03-01 | 2021-10-19 | 广东工业大学 | Network intrusion detection method and system based on data enhancement and BilSTM |
CN113065428A (en) * | 2021-03-21 | 2021-07-02 | 北京工业大学 | Automatic driving target identification method based on feature selection |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106991447A (en) | A kind of embedded multi-class attribute tags dynamic feature selection algorithm | |
CN106845717B (en) | Energy efficiency evaluation method based on multi-model fusion strategy | |
CN106971205A (en) | A kind of embedded dynamic feature selection method based on k nearest neighbor Mutual Information Estimation | |
CN107766883A (en) | A kind of optimization random forest classification method and system based on weighted decision tree | |
CN101504654B (en) | Method for implementing automatic database schema matching | |
Olteanu et al. | On-line relational and multiple relational SOM | |
CN102324038B (en) | Plant species identification method based on digital image | |
CN110674407A (en) | Hybrid recommendation method based on graph convolution neural network | |
CN107766418A (en) | A kind of credit estimation method based on Fusion Model, electronic equipment and storage medium | |
CN106991446A (en) | A kind of embedded dynamic feature selection method of the group policy of mutual information | |
CN101256631B (en) | Method and apparatus for character recognition | |
CN106326913A (en) | Money laundering account determination method and device | |
CN110533116A (en) | Based on the adaptive set of Euclidean distance at unbalanced data classification method | |
CN102750286A (en) | Novel decision tree classifier method for processing missing data | |
Wang et al. | Design of the Sports Training Decision Support System Based on the Improved Association Rule, the Apriori Algorithm. | |
CN110389932A (en) | Electric power automatic document classifying method and device | |
CN114491082A (en) | Plan matching method based on network security emergency response knowledge graph feature extraction | |
Tseng et al. | A pre-processing method to deal with missing values by integrating clustering and regression techniques | |
CN114254093A (en) | Multi-space knowledge enhanced knowledge graph question-answering method and system | |
Berahmand et al. | SDAC-DA: Semi-Supervised Deep Attributed Clustering Using Dual Autoencoder | |
CN111339258B (en) | University computer basic exercise recommendation method based on knowledge graph | |
CN103678709B (en) | Recommendation system attack detection method based on time series data | |
Zhang et al. | Robust Detection of Lead-Lag Relationships in Lagged Multi-Factor Models | |
CN103886007A (en) | Mutual constraint based fuzzy data classification method | |
CN113705920A (en) | Generation method of water data sample set for thermal power plant and terminal equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170728 |
|
WD01 | Invention patent application deemed withdrawn after publication |