CN107016073B - A kind of text classification feature selection approach - Google Patents
A kind of text classification feature selection approach Download PDFInfo
- Publication number
- CN107016073B CN107016073B CN201710181572.8A CN201710181572A CN107016073B CN 107016073 B CN107016073 B CN 107016073B CN 201710181572 A CN201710181572 A CN 201710181572A CN 107016073 B CN107016073 B CN 107016073B
- Authority
- CN
- China
- Prior art keywords
- feature
- indicate
- sel
- degree
- exc
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of text classification feature selection approach, can reduce characteristic dimension and complicated classification degree and improves classification accuracy.The described method includes: obtaining feature set S and target category C, each feature x in feature set S is calculated(i)Degree of association R between target category Cc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;Calculate the redundancy R in feature set S between every two featurexWith collaborative e-commerce Sx, degree of association R between binding characteristic and target categoryc(x(i)) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with the descending sort to feature set S as a result, feature set S is divided into Candidate Set S according to threshold value thselCollect S with exclusionexc;Calculate Candidate Set SselCollect S with exclusionexcIn feature between sensitivity S en, and by it compared with preset threshold value th, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted.The present invention is suitable for machine learning text classification field.
Description
Technical field
The present invention relates to machine learning text classification fields, particularly relate to a kind of text classification feature selection approach.
Background technique
With the continuous expansion of internet scale, the information resources converged in internet are also increasing.In order to effective
Management and easily utilize these information resources, content-based information retrieval and data mining are concerned all the time.
Text Classification is the important foundation of information retrieval and text data digging, and main task is the text according to unknown classification
With the content of document, they are determined as one or more of previously given classification.However, training samples number is big and vector
This two major features of dimension height, determine that text classification is that an operation time and all very high machine learning of space complexity are asked
Topic.It would therefore be desirable to carry out feature selecting, characteristic dimension is reduced while guaranteeing classification performance as far as possible.
Feature selecting is an important process of data preprocessing, in common text classification feature selection approach, card
Side examine (Chi-Square) by establishing null hypothesis, it is assumed that word is uncorrelated to target category, selection deviate hypothesis degree greatly
Word is as feature.But whether there is certain word in its statistical documents, but regardless of occurring several times, this make it to low-frequency word
It is partial.Mutual information (Mutual Information) method is selected by measuring the presence of word to target category bring information content
Select feature.But it only considered the degree of association between word and target category, ignore dependence that may be present between word and word.
TF-IDF (Term Frequency-Inverse Document Frequency) method comprehensively considers what word occurred hereof
The distribution of frequency and word in All Files is to assess the significance level of word, to carry out Feature Selection.But it is only simple
The word for thinking that text frequency is small it is more important and word that text frequency is big is more useless, therefore precision is not very high.Furthermore
There are also the feature selection approach such as information gain, odds ratio, text weight evidence, expectation cross entropy, they all only considered mostly
The degree of correlation between degree of correlation or word and word between word and target category is easy to appear dimensionality reduction degree not enough or classification is smart
Spend not high problem.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of text classification feature selection approach, to solve prior art institute
The problem that existing characteristic dimension is high or nicety of grading is low.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of text classification feature selection approach, comprising:
Step 1: obtaining feature set S and target category C, calculate each feature x in feature set S(i)With target category C it
Between degree of association Rc(x(i)), and according to degree of association RcSize carries out descending sort to feature set S;
Step 2: calculating the redundancy R in feature set S between every two featurexWith collaborative e-commerce Sx, binding characteristic and target class
Degree of association R between notc(x(i)) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with right
The descending sort of feature set S is as a result, be divided into Candidate Set S for feature set S according to threshold value thselCollect S with exclusionexc;
Step 3: calculating Candidate Set SselCollect S with exclusionexcIn feature between sensitivity S en, and by its with set in advance
Fixed threshold value th compares, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted.
Further, the step 1 includes:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)Degree of association R between target category Cc(x(i)), wherein I (x(i);C feature x) is indicated(i)It is mutual between target category C
Information;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from large to small, sorted
Feature set S afterwards;
Wherein, x(i)Indicate ith feature in feature set S, Rc(x(i)) indicate feature x(i)With the pass between target category C
Connection degree.
Further, the I (x(i);C it) indicates are as follows:
Wherein, ckIndicate k-th of classification of target category C, p (x(i), ck) indicate feature x(i)With classification ckOccur simultaneously
Probability, p (x(i)|ck) indicate in ckFeature x in classification(i)The probability of appearance, p (x(i)) indicate feature x(i)Occur in feature set S
Probability.
Further, the redundancy RxIt indicates are as follows:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation
Gain, Rx(x(i);x(j)) indicate feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value be 0 and degree of correlation gain
In smaller value.
Further, the collaborative e-commerce SxIt indicates are as follows:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation
Gain, Sx(x(i);x(j)) indicate feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value be 0 and degree of correlation gain
In the larger value.
Further, the IG (x(i);x(j);C it) indicates are as follows:
IG(x(i);x(j);C)=I [(x(i), x(j));C]-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);C feature x) is indicated(i)With the mutual information between target category C;I(x(j);C feature x) is indicated(j)With
Mutual information between target category C;I((x(i), x(j);C feature x) is indicated(i), feature x(j)With the mutual trust between target category C
Breath.
Further, the I ((x(i), x(j);C it) indicates are as follows:
Wherein, ckIndicate k-th of classification of target category C, p (x(i), x(j), ck) indicate feature x(i), feature x(j)And classification
ckThe probability occurred simultaneously, p ((x(i), x(j))|ck) indicate in ckFeature x in classification(i)With feature x(j)The probability occurred simultaneously, p
(x(i), x(j)) indicate feature x(i)With feature x(j)The probability occurred in feature set S simultaneously.
Further, the step 2 includes:
Step 21: first feature in feature set S is added to Candidate Set Ssel, will exclude to collect SexcIt is set to empty set, i.e. Ssel
={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22: since feature set S second feature, using x(i)It indicates second feature, calculates feature x(i)
With Candidate Set SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association between binding characteristic and target category
Rc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23: by sensitivity S en (x(i)) compared with preset threshold value th, if Sen (x(i)) > th, then by feature
x(i)Candidate Set S is addedsel;Otherwise by feature x(i)It is added and excludes collection Sexc;
Step 24: if x(i)It is characterized the last one feature in collection S, then terminates to divide;Otherwise, by x(i)It is set to feature set S
In next feature, return to step 22.
Further, the sensitivity S en (x(i)) indicate are as follows:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β is redundancy R respectivelyxWith collaborative e-commerce SxWeight, min (Rx(x(i);x(j))) indicate feature x(i)With
The minimum value of redundancy between remaining feature, max (Sx(x(i);x(j))) indicate feature x(i)The collaborative e-commerce between remaining feature
Maximum value, Sen (x(i)) indicate feature x(i)Sensitivity to target category C, Rc(x(i)) indicate feature x(i)With target category C it
Between the degree of association.
Further, the step 3 includes:
Step 31: enabling collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude to collect SexcIn first feature, if x(m)
For Candidate Set SselIn first feature;
Step 32: for excluding to collect SexcIn feature x(k), calculate Candidate Set SselIn feature x(m)It is removed with feature set S
x(m)Except all features between collaborative e-commerce maximum value, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33: if feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Collection S undetermined is addedtbd;
Step 34: if feature x(m)It is Candidate Set SselIn the last one feature, and collection S undeterminedtbdFor sky, then enter step
36;If collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, enter step 35;If feature x(m)It is not
Candidate Set SselIn the last one feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35: for collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with preset threshold value th, if Sen (x(j)) < th andThen by feature x(k)Collect S from exclusionexcMiddle removal is added to Candidate Set Ssel, into
Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn the last one element, then be directly entered step 36;Otherwise, by feature
x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36: if feature x(k)It is to exclude collection SexcIn the last one element, then return to current candidate collection SselCollect with exclusion
SexcResult as final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
The advantageous effects of the above technical solutions of the present invention are as follows:
In above scheme, by feature set S and target category C, the degree of association R between feature and target category is calculatedc(x(i)) and feature and feature between redundancy RxWith collaborative e-commerce Sx, to calculate the sensitivity S en of feature;According to presetting
Threshold value th feature is screened, feature set is divided into Candidate Set and excludes to collect, and is continued in the follow-up process to candidate
Collection and exclusion collection are adjusted optimization.In this way, having comprehensively considered the phase between feature and target category and between feature and feature
Mutual relation selects feature by the degree of association, redundancy and collaborative e-commerce, remains the feature to play a crucial role to classification,
Help to reduce characteristic dimension and complicated classification degree, and can be improved classification accuracy.
Detailed description of the invention
Fig. 1 is the flow diagram of text classification feature selection approach provided in an embodiment of the present invention;
Fig. 2 is the detailed process schematic diagram of text classification feature selection approach provided in an embodiment of the present invention;
Fig. 3 is that feature selection approach provided in an embodiment of the present invention divides Candidate Set and excludes the flow diagram of collection;
Fig. 4 is that feature selection approach provided in an embodiment of the present invention adjusts Candidate Set and excludes the flow diagram of collection.
Specific embodiment
To keep the technical problem to be solved in the present invention, technical solution and advantage clearer, below in conjunction with attached drawing and tool
Body embodiment is described in detail.
The present invention provides a kind of text classification feature selecting for the problem that existing characteristic dimension is high or nicety of grading is low
Method.
As shown in Figure 1, text classification feature selection approach provided in an embodiment of the present invention, comprising:
Step 1: obtaining feature set S and target category C, calculate each feature x in feature set S(i)With target category C it
Between degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;
Step 2: calculating the redundancy R in feature set S between every two featurexWith collaborative e-commerce Sx, binding characteristic and target class
Degree of association R between notc(x(i)) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with right
The descending sort of feature set S is as a result, be divided into Candidate Set S for feature set S according to threshold value thselCollect S with exclusionexc;
Step 3: calculating Candidate Set SselCollect S with exclusionexcIn feature between sensitivity S en, and by its with set in advance
Fixed threshold value th compares, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted.
Text classification feature selection approach described in the embodiment of the present invention is calculated special by feature set S and target category C
Degree of association R between sign and target categoryc(x(i)) and feature and feature between redundancy RxWith collaborative e-commerce Sx, to calculate
The sensitivity S en of feature;Feature is screened according to preset threshold value th, feature set is divided into Candidate Set and exclusion
Collection, and continue to be adjusted optimization to Candidate Set and exclusion collection in the follow-up process.In this way, having comprehensively considered feature and target class
Between not and the correlation between feature and feature selects feature by the degree of association, redundancy and collaborative e-commerce, protects
The feature to play a crucial role to classification has been stayed, has helped to reduce characteristic dimension and complicated classification degree, and it is accurate to can be improved classification
Property.
In the present embodiment, as shown in Fig. 2, needing elder generation input feature vector collection S=(x to get feature set S and target category C(1), x(2)..., x(n)) and target category C.
In the present embodiment, the feature set S indicates all feature (single feature x during text classification(i)Table
Show, i.e. word vector) set, i.e. S=(x(1), x(2)..., x(n)), n indicates the number of feature in feature set S;Feature x(i)
Indicate the column vector that the number that word corresponding to feature occurs in each text file is constituted, i.e.,Target category C indicates the column that classification corresponding to each text file is constituted
Vector, target category C are category sets.
In the present embodiment, the feature x(i)Degree of association R between target category Cc(x(i)) it is characterized x(i)With target class
Mutual information between other C.
In the present embodiment, as an alternative embodiment, each feature x in the calculating feature set S(i)With target category C
Between degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort (step 1) include:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)Degree of association R between target category Cc(x(i)), wherein I (x(i);C feature x) is indicated(i)It is mutual between target category C
Information;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from large to small, sorted
Feature set S afterwards;
Wherein, x(i)Indicate ith feature in feature set S, Rc(x(i)) indicate feature x(i)With the pass between target category C
Connection degree.
It is described in the present embodiment
Wherein, I (x(i);C feature x) is indicated(i)With the mutual information between target category C, ckIndicate the target category C
K classification, p (x(i), ck) indicate feature x(i)With classification ckThe probability occurred simultaneously, p (x(i)|ck) indicate in ckFeature in classification
x(i)The probability of appearance, p (x(i)) indicate feature x(i)The probability occurred in feature set S.
In the present embodiment, it is preferable that the feature x(i)With classification ckProbability p (the x occurred simultaneously(i), ck), by ckClassification
Feature x in file(i)The frequency that corresponding word occurs in All Files is come approximate, it may be assumed that
Wherein,Indicate feature x(i)J-th of element (i.e. feature x(i)What corresponding word occurred in j-th of file
Number);Indicate feature x(i)Middle corresponding target category is ckM-th of element (i.e. feature x(i)Corresponding word is in m
A ckThe number occurred in category file).
In the present embodiment, it is preferable that described in ckFeature x in classification(i)Probability p (the x of appearance(i)|ck), by feature x(i)Institute
Corresponding word is in ckThe frequency occurred in category file is come approximate, it may be assumed that
In the present embodiment, it is preferable that the feature x(i)Probability p (the x occurred in feature set S(i)), by feature x(i)Institute
The frequency that corresponding word occurs in All Files is come approximate, it may be assumed that
In the present embodiment, as yet another alternative embodiment, as shown in figure 3, in the calculating feature set S every two feature it
Between redundancy RxWith collaborative e-commerce Sx, degree of association R between binding characteristic and target categoryc(x(i)) calculate feature sensitivity
Sen, and by it compared with preset threshold value th, feature set S is divided into Candidate Set S according to threshold value thselCollect with exclusion
Sexc(step 2) includes:
Step 21: first feature in feature set S is added to Candidate Set Ssel, will exclude to collect SexcIt is set to empty set, i.e. Ssel
={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22: since feature set S second feature, using x(i)It indicates second feature, calculates feature x(i)
With Candidate Set SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association between binding characteristic and target category
Rc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23: by sensitivity S en (x(i)) compared with preset threshold value th, if Sen (x(i)) > th, then by feature
x(i)Candidate Set S is addedsel;Otherwise by feature x(i)It is added and excludes collection Sexc;
Step 24: if x(i)It is characterized the last one feature in collection S, then terminates to divide;Otherwise, by x(i)It is set to feature set S
In next feature, return to step 22.
In the specific embodiment of aforementioned texts characteristic of division selection method, further, the redundancy RxIt indicates
Are as follows:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation
Gain, Rx(x(i);x(j)) indicate feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value be 0 and degree of correlation gain
In smaller value.
In the specific embodiment of aforementioned texts characteristic of division selection method, further, the collaborative e-commerce SxIt indicates
Are as follows:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation
Gain, Sx(x(i);x(j)) indicate feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value be 0 and degree of correlation gain
In the larger value.
In the specific embodiment of aforementioned texts characteristic of division selection method, further, the IG (x(i);x(j);C)
It indicates are as follows:
IG(x(i);x(j);C)=I [(x(i), x(j));C]-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);) and I (x C(j);C) with the feature x(i)Mutual information calculation formula phase between target category C
Together, I (x(i);C feature x) is indicated(i)With the mutual information between target category C;I(x(j);C feature x) is indicated(j)With target category C
Between mutual information;I((x(i), x(j);C feature x) is indicated(i), feature x(j)With the mutual information between target category C.
In the specific embodiment of aforementioned texts characteristic of division selection method, further, the I ((x(i), x(j);C)
It indicates are as follows:
Wherein, ckIndicate k-th of classification of target category C, p (x(i), x(j), ck) and indicate feature x(i), feature x(j)And class
Other ckThe probability occurred simultaneously, p ((x(i), x(j))|ck) indicate in ckFeature x in classification(i)With feature x(j)What is occurred simultaneously is general
Rate, p (x(i), x(j)) indicate feature x(i)With feature x(j)The probability occurred in feature set S simultaneously.
In the present embodiment, it is preferable that the feature x(i), feature x(j)With classification ckProbability p (the x occurred simultaneously(i), x(j),
ck), by ckFeature x in category file(i)With feature x(j)The frequency that corresponding word occurs simultaneously in All Files is come close
Seemingly, it may be assumed that
Wherein,Indicate feature x(i)With feature x(j)Middle corresponding target category is ckM-th yuan
Smaller value (i.e. feature x in element(i)With feature x(j)Word corresponding to the two is in m-th of ckThe number occurred in category file
Smaller value).
In the present embodiment, it is preferable that described in ckFeature x in classification(i)With feature x(j)The Probability p ((x occurred simultaneously(i),
x(j))|ck), by feature x(i)With feature x(j)Corresponding word is in ckThe frequency occurred simultaneously in category file is come approximate, it may be assumed that
In the present embodiment, it is preferable that the feature x(i)With feature x(j)Probability p (the x occurred simultaneously in feature set S(i)), by feature x(i)With feature x(j)The frequency that corresponding word occurs simultaneously in All Files is come approximate, it may be assumed that
In the specific embodiment of aforementioned texts characteristic of division selection method, further, the sensitivity S en (x(i)) indicate are as follows:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β is redundancy R respectivelyxWith collaborative e-commerce SxWeight, min (Rx(x(i);x(j))) indicate feature x(i)With
The minimum value of redundancy between remaining feature, max (Sx(x(i);x(j))) indicate feature x(i)The collaborative e-commerce between remaining feature
Maximum value, Sen (x(i)) indicate feature x(i)Sensitivity to target category C, Rc(x(i)) indicate feature x(i)With target category C it
Between the degree of association.
In the present embodiment, as shown in figure 4, as an alternative embodiment, the calculating Candidate Set SselCollect S with exclusionexcIn
Feature between sensitivity S en, and by it compared with preset threshold value th, according to threshold value th to Candidate Set SselAnd row
Except collection SexcBeing adjusted (step 3) includes:
Step 31: enabling collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude to collect SexcIn first feature, if x(m)
For Candidate Set SselIn first feature;
Step 32: for excluding to collect SexcIn feature x(k), calculate Candidate Set SselIn feature x(m)It is removed with feature set S
x(m)Except all features between collaborative e-commerce maximum value, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33: if feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Collection S undetermined is addedtbd;
Step 34: if feature x(m)It is Candidate Set SselIn the last one feature, and collection S undeterminedtbdFor sky, then enter step
36;If collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, enter step 35;If feature x(m)It is not
Candidate Set SselIn the last one feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35: for collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with preset threshold value th, if Sen (x(j)) < th andThen by feature x(k)Collect S from exclusionexcMiddle removal is added to Candidate Set Ssel, into
Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn the last one element, then be directly entered step 36;Otherwise, by feature
x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36: if feature x(k)It is to exclude collection SexcIn the last one element, then return to current candidate collection SselCollect with exclusion
SexcResult as final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
In the present embodiment, according to step 31-36, Candidate Set S is calculatedselCollect S with exclusionexcIn feature between sensitivity
Sen, and by it compared with preset threshold value th, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted, obtains
To new Candidate Set SselCollect S with exclusionexc, the removal of feature can be reduced or increase the influence to classification results.
In the present embodiment, the redundancy RxWeight α default value can be 0.5;The collaborative e-commerce SxWeight β default value can
Think 0.5;The preset threshold value th is defaulted as being 0.01.The redundancy RxWeight α, collaborative e-commerce SxWeight β and
Preset threshold value th by genetic algorithm optimization and updates in subsequent training and test process.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art
For, without departing from the principles of the present invention, it can also make several improvements and retouch, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (7)
1. a kind of text classification feature selection approach characterized by comprising
Step 1: obtaining feature set S and target category C, calculate each feature x in feature set S(i)Between target category C
Degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;
Step 2: calculating the redundancy R in feature set S between every two featurexWith collaborative e-commerce Sx, binding characteristic and target category it
Between degree of association Rc(x(i)) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with to feature
Collect the descending sort of S as a result, feature set S is divided into Candidate Set S according to threshold value thselCollect S with exclusionexc;
Step 3: calculating Candidate Set SselCollect S with exclusionexcIn feature between sensitivity S en, and by its with it is preset
Threshold value th compares, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted;
Wherein, the redundancy RxIt indicates are as follows:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation increase
Benefit, Rx(x(i);x(j)) indicate feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value be 0 and degree of correlation gain in
Smaller value;
Wherein, the collaborative e-commerce SxIt indicates are as follows:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation increase
Benefit, Sx(x(i);x(j)) indicate feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value be 0 and degree of correlation gain in
The larger value;
Wherein, the sensitivity S en (x(i)) indicate are as follows:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β is redundancy R respectivelyxWith collaborative e-commerce SxWeight, min (Rx(x(i);x(j))) indicate feature x(i)With remaining
The minimum value of redundancy between feature, max (Sx(x(i);x(j))) indicate feature x(i)The maximum of collaborative e-commerce between remaining feature
Value, Sen (x(i)) indicate feature x(i)Sensitivity to target category C, Rc(x(i)) indicate feature x(i)Between target category C
The degree of association.
2. text classification feature selection approach according to claim 1, which is characterized in that the step 1 includes:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)With mesh
Mark the degree of association R between classification Cc(x(i)), wherein I (x(i);C feature x) is indicated(i)With the mutual information between target category C;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from large to small, the spy after being sorted
Collect S;
Wherein, x(i)Indicate ith feature in feature set S, Rc(x(i)) indicate feature x(i)With the degree of association between target category C.
3. text classification feature selection approach according to claim 2, which is characterized in that the I (x(i);C it) indicates are as follows:
Wherein, ckIndicate k-th of classification of target category C, p (x(i), ck) indicate feature x(i)With classification ckWhat is occurred simultaneously is general
Rate, p (x(i)|ck) indicate in ckFeature x in classification(i)The probability of appearance, p (x(i)) indicate feature x(i)Occur in feature set S
Probability.
4. text classification feature selection approach according to claim 1, which is characterized in that the IG (x(i);x(j);C) table
It is shown as:
IG(x(i);x(j);C)=I ((x(i), x(j));C)-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);C feature x) is indicated(i)With the mutual information between target category C;I(x(j);C feature x) is indicated(j)With target
Mutual information between classification C;I((x(i), x(j));C feature x) is indicated(i), feature x(j)With the mutual information between target category C.
5. text classification feature selection approach according to claim 4, which is characterized in that the I ((x(i), x(j));C) table
It is shown as:
Wherein, ckIndicate k-th of classification of target category C, p (x(i), x(j), ck) indicate feature x(i), feature x(j)With classification ckTogether
When the probability that occurs, p ((x(i), x(j))|ck) indicate in ckFeature x in classification(i)With feature x(j)The probability occurred simultaneously, p (x(i), x(j)) indicate feature x(i)With feature x(j)The probability occurred in feature set S simultaneously.
6. text classification feature selection approach according to claim 1, which is characterized in that the step 2 includes:
Step 21: first feature in feature set S is added to Candidate Set Ssel, will exclude to collect SexcIt is set to empty set, i.e. Ssel={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22: since feature set S second feature, using x(i)It indicates second feature, calculates feature x(i)With time
Selected works SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association R between binding characteristic and target categoryc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23: by sensitivity S en (x(i)) compared with preset threshold value th, if Sen (x(i)) > th, then by feature x(i)
Candidate Set S is addedsel;Otherwise by feature x(i)It is added and excludes collection Sexc;
Step 24: if x(i)It is characterized the last one feature in collection S, then terminates to divide;Otherwise, by x(i)Under being set in feature set S
One feature, returns to step 22.
7. text classification feature selection approach according to claim 1, which is characterized in that the step 3 includes:
Step 31: enabling collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude to collect SexcIn first feature, if x(m)To wait
Selected works SselIn first feature;
Step 32: for excluding to collect SexcIn feature x(k), calculate Candidate Set SselIn feature x(m)With in feature set S remove x(m)
Except all features between collaborative e-commerce maximum value, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33: if feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Collection S undetermined is addedtbd;
Step 34: if feature x(m)It is Candidate Set SselIn the last one feature, and collection S undeterminedtbdFor sky, then 36 are entered step;If
Collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, enter step 35;If feature x(m)It is not Candidate Set
SselIn the last one feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35: for collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with preset threshold value th, if Sen (x(j)) < th andThen by feature x(k)Collect S from exclusionexcMiddle removal is added to Candidate Set Ssel, into
Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn the last one element, then be directly entered step 36;Otherwise, by feature
x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36: if feature x(k)It is to exclude collection SexcIn the last one element, then return to current candidate collection SselCollect S with exclusionexc
Result as final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710181572.8A CN107016073B (en) | 2017-03-24 | 2017-03-24 | A kind of text classification feature selection approach |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710181572.8A CN107016073B (en) | 2017-03-24 | 2017-03-24 | A kind of text classification feature selection approach |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107016073A CN107016073A (en) | 2017-08-04 |
CN107016073B true CN107016073B (en) | 2019-06-28 |
Family
ID=59445053
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710181572.8A Active CN107016073B (en) | 2017-03-24 | 2017-03-24 | A kind of text classification feature selection approach |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107016073B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109934251B (en) * | 2018-12-27 | 2021-08-06 | 国家计算机网络与信息安全管理中心广东分中心 | Method, system and storage medium for recognizing text in Chinese language |
CN111612385B (en) * | 2019-02-22 | 2024-04-16 | 北京京东振世信息技术有限公司 | Method and device for clustering articles to be distributed |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105184323A (en) * | 2015-09-15 | 2015-12-23 | 广州唯品会信息科技有限公司 | Feature selection method and system |
CN105260437A (en) * | 2015-09-30 | 2016-01-20 | 陈一飞 | Text classification feature selection method and application thereof to biomedical text classification |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8473451B1 (en) * | 2004-07-30 | 2013-06-25 | At&T Intellectual Property I, L.P. | Preserving privacy in natural language databases |
-
2017
- 2017-03-24 CN CN201710181572.8A patent/CN107016073B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105184323A (en) * | 2015-09-15 | 2015-12-23 | 广州唯品会信息科技有限公司 | Feature selection method and system |
CN105260437A (en) * | 2015-09-30 | 2016-01-20 | 陈一飞 | Text classification feature selection method and application thereof to biomedical text classification |
Non-Patent Citations (1)
Title |
---|
中文文本分类中的特征选择研究;周茜 等;《中文信息学报》;20041231;第18卷(第3期);第17-23页 |
Also Published As
Publication number | Publication date |
---|---|
CN107016073A (en) | 2017-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210042664A1 (en) | Model training and service recommendation | |
US11074442B2 (en) | Identification of table partitions in documents with neural networks using global document context | |
RU2679209C2 (en) | Processing of electronic documents for invoices recognition | |
US9058327B1 (en) | Enhancing training of predictive coding systems through user selected text | |
US11170249B2 (en) | Identification of fields in documents with neural networks using global document context | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN105069141A (en) | Construction method and construction system for stock standard news library | |
CN105653701B (en) | Model generating method and device, word assign power method and device | |
CN110019790A (en) | Text identification, text monitoring, data object identification, data processing method | |
CN105893362A (en) | A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points | |
CN108090178A (en) | A kind of text data analysis method, device, server and storage medium | |
CN103778206A (en) | Method for providing network service resources | |
CN110827131A (en) | Tax payer credit evaluation method based on distributed automatic feature combination | |
CN107016073B (en) | A kind of text classification feature selection approach | |
CN110110143B (en) | Video classification method and device | |
CN107341152B (en) | Parameter input method and device | |
CN110210506A (en) | Characteristic processing method, apparatus and computer equipment based on big data | |
CN103218420B (en) | A kind of web page title extracting method and device | |
CN106202349A (en) | Web page classifying dictionary creation method and device | |
CN105095826B (en) | A kind of character recognition method and device | |
Wang et al. | Multi-level Class Token Transformer with Cross TokenMixer for Hyperspectral Images Classification | |
US20230134218A1 (en) | Continuous learning for document processing and analysis | |
US20230138491A1 (en) | Continuous learning for document processing and analysis | |
US20230177251A1 (en) | Method, device, and system for analyzing unstructured document | |
CN113641823B (en) | Text classification model training, text classification method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |