CN107016073B

CN107016073B - A kind of text classification feature selection approach

Info

Publication number: CN107016073B
Application number: CN201710181572.8A
Authority: CN
Inventors: 张晓彤; 余伟伟; 刘喆; 王璇
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2017-03-24
Filing date: 2017-03-24
Publication date: 2019-06-28
Anticipated expiration: 2037-03-24
Also published as: CN107016073A

Abstract

The present invention provides a kind of text classification feature selection approach, can reduce characteristic dimension and complicated classification degree and improves classification accuracy.The described method includes: obtaining feature set S and target category C, each feature x in feature set S is calculated⁽ⁱ⁾Degree of association R between target category C_c(x⁽ⁱ⁾), and according to degree of association R_c(x⁽ⁱ⁾) size to feature set S carry out descending sort；Calculate the redundancy R in feature set S between every two feature_xWith collaborative e-commerce S_x, degree of association R between binding characteristic and target category_c(x⁽ⁱ⁾) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with the descending sort to feature set S as a result, feature set S is divided into Candidate Set S according to threshold value th_selCollect S with exclusion_exc；Calculate Candidate Set S_selCollect S with exclusion_excIn feature between sensitivity S en, and by it compared with preset threshold value th, according to threshold value th to Candidate Set S_selCollect S with exclusion_excIt is adjusted.The present invention is suitable for machine learning text classification field.

Description

A kind of text classification feature selection approach

Technical field

The present invention relates to machine learning text classification fields, particularly relate to a kind of text classification feature selection approach.

Background technique

With the continuous expansion of internet scale, the information resources converged in internet are also increasing.In order to effective Management and easily utilize these information resources, content-based information retrieval and data mining are concerned all the time. Text Classification is the important foundation of information retrieval and text data digging, and main task is the text according to unknown classification With the content of document, they are determined as one or more of previously given classification.However, training samples number is big and vector This two major features of dimension height, determine that text classification is that an operation time and all very high machine learning of space complexity are asked Topic.It would therefore be desirable to carry out feature selecting, characteristic dimension is reduced while guaranteeing classification performance as far as possible.

Feature selecting is an important process of data preprocessing, in common text classification feature selection approach, card Side examine (Chi-Square) by establishing null hypothesis, it is assumed that word is uncorrelated to target category, selection deviate hypothesis degree greatly Word is as feature.But whether there is certain word in its statistical documents, but regardless of occurring several times, this make it to low-frequency word It is partial.Mutual information (Mutual Information) method is selected by measuring the presence of word to target category bring information content Select feature.But it only considered the degree of association between word and target category, ignore dependence that may be present between word and word. TF-IDF (Term Frequency-Inverse Document Frequency) method comprehensively considers what word occurred hereof The distribution of frequency and word in All Files is to assess the significance level of word, to carry out Feature Selection.But it is only simple The word for thinking that text frequency is small it is more important and word that text frequency is big is more useless, therefore precision is not very high.Furthermore There are also the feature selection approach such as information gain, odds ratio, text weight evidence, expectation cross entropy, they all only considered mostly The degree of correlation between degree of correlation or word and word between word and target category is easy to appear dimensionality reduction degree not enough or classification is smart Spend not high problem.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of text classification feature selection approach, to solve prior art institute The problem that existing characteristic dimension is high or nicety of grading is low.

In order to solve the above technical problems, the embodiment of the present invention provides a kind of text classification feature selection approach, comprising:

Step 1: obtaining feature set S and target category C, calculate each feature x in feature set S⁽ⁱ⁾With target category C it Between degree of association R_c(x⁽ⁱ⁾), and according to degree of association R_cSize carries out descending sort to feature set S；

Step 2: calculating the redundancy R in feature set S between every two feature_xWith collaborative e-commerce S_x, binding characteristic and target class Degree of association R between not_c(x⁽ⁱ⁾) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with right The descending sort of feature set S is as a result, be divided into Candidate Set S for feature set S according to threshold value th_selCollect S with exclusion_exc；

Step 3: calculating Candidate Set S_selCollect S with exclusion_excIn feature between sensitivity S en, and by its with set in advance Fixed threshold value th compares, according to threshold value th to Candidate Set S_selCollect S with exclusion_excIt is adjusted.

Further, the step 1 includes:

Step 11, for each feature x in feature set S⁽ⁱ⁾, according to formula R_c(x⁽ⁱ⁾)=I (x⁽ⁱ⁾；C feature x) is calculated⁽ⁱ⁾Degree of association R between target category C_c(x⁽ⁱ⁾), wherein I (x⁽ⁱ⁾；C feature x) is indicated⁽ⁱ⁾It is mutual between target category C Information；

Step 12, according to degree of association R_c(x⁽ⁱ⁾) size the feature in feature set S is sorted from large to small, sorted Feature set S afterwards；

Wherein, x⁽ⁱ⁾Indicate ith feature in feature set S, R_c(x⁽ⁱ⁾) indicate feature x⁽ⁱ⁾With the pass between target category C Connection degree.

Further, the I (x⁽ⁱ⁾；C it) indicates are as follows:

Wherein, c_kIndicate k-th of classification of target category C, p (x⁽ⁱ⁾, c_k) indicate feature x⁽ⁱ⁾With classification c_kOccur simultaneously Probability, p (x⁽ⁱ⁾|c_k) indicate in c_kFeature x in classification⁽ⁱ⁾The probability of appearance, p (x⁽ⁱ⁾) indicate feature x⁽ⁱ⁾Occur in feature set S Probability.

Further, the redundancy R_xIt indicates are as follows:

R_x(x⁽ⁱ⁾；x^(j))=min (0, IG (x⁽ⁱ⁾；x^(j)；C)), i ≠ j

Wherein, IG (x⁽ⁱ⁾；x^(j)；C ith feature x in feature set S) is indicated⁽ⁱ⁾With j-th of feature x^(j)Between the degree of correlation Gain, R_x(x⁽ⁱ⁾；x^(j)) indicate feature x⁽ⁱ⁾With feature x^(j)Between redundancy, R_x(x⁽ⁱ⁾；x^(j)) value be 0 and degree of correlation gain In smaller value.

Further, the collaborative e-commerce S_xIt indicates are as follows:

S_x(x⁽ⁱ⁾；x^(j))=max (0, IG (x⁽ⁱ⁾；x^(j)；C)), i ≠ j

Wherein, IG (x⁽ⁱ⁾；x^(j)；C ith feature x in feature set S) is indicated⁽ⁱ⁾With j-th of feature x^(j)Between the degree of correlation Gain, S_x(x⁽ⁱ⁾；x^(j)) indicate feature x⁽ⁱ⁾With feature x^(j)Between collaborative e-commerce, S_x(x⁽ⁱ⁾；x^(j)) value be 0 and degree of correlation gain In the larger value.

Further, the IG (x⁽ⁱ⁾；x^(j)；C it) indicates are as follows:

IG(x⁽ⁱ⁾；x^(j)；C)=I [(x⁽ⁱ⁾, x^(j))；C]-I(x⁽ⁱ⁾；C)-I(x^(j)；C)

Wherein, I (x⁽ⁱ⁾；C feature x) is indicated⁽ⁱ⁾With the mutual information between target category C；I(x^(j)；C feature x) is indicated^(j)With Mutual information between target category C；I((x⁽ⁱ⁾, x^(j)；C feature x) is indicated⁽ⁱ⁾, feature x^(j)With the mutual trust between target category C Breath.

Further, the I ((x⁽ⁱ⁾, x^(j)；C it) indicates are as follows:

Wherein, c_kIndicate k-th of classification of target category C, p (x⁽ⁱ⁾, x^(j), c_k) indicate feature x⁽ⁱ⁾, feature x^(j)And classification c_kThe probability occurred simultaneously, p ((x⁽ⁱ⁾, x^(j))|c_k) indicate in c_kFeature x in classification⁽ⁱ⁾With feature x^(j)The probability occurred simultaneously, p (x⁽ⁱ⁾, x^(j)) indicate feature x⁽ⁱ⁾With feature x^(j)The probability occurred in feature set S simultaneously.

Further, the step 2 includes:

Step 21: first feature in feature set S is added to Candidate Set S_sel, will exclude to collect S_excIt is set to empty set, i.e. S_sel ={ x⁽¹⁾, S_exc={ }, the corresponding degree of association R of first feature_c(x⁽ⁱ⁾) maximum；

Step 22: since feature set S second feature, using x⁽ⁱ⁾It indicates second feature, calculates feature x⁽ⁱ⁾ With Candidate Set S_selIn redundancy R between all features_xWith collaborative e-commerce S_x, and the degree of association between binding characteristic and target category R_c(x⁽ⁱ⁾) calculate feature x⁽ⁱ⁾Sensitivity S en (x⁽ⁱ⁾)；

Step 23: by sensitivity S en (x⁽ⁱ⁾) compared with preset threshold value th, if Sen (x⁽ⁱ⁾) > th, then by feature x⁽ⁱ⁾Candidate Set S is added_sel；Otherwise by feature x⁽ⁱ⁾It is added and excludes collection S_exc；

Step 24: if x⁽ⁱ⁾It is characterized the last one feature in collection S, then terminates to divide；Otherwise, by x⁽ⁱ⁾It is set to feature set S In next feature, return to step 22.

Further, the sensitivity S en (x⁽ⁱ⁾) indicate are as follows:

Sen(x⁽ⁱ⁾)=R_c(x⁽ⁱ⁾)+αmin(R_x(x⁽ⁱ⁾；x^(j)))

+βmax(S_x(x⁽ⁱ⁾；x^(j))), j ≠ i

Wherein, α and β is redundancy R respectively_xWith collaborative e-commerce S_xWeight, min (R_x(x⁽ⁱ⁾；x^(j))) indicate feature x⁽ⁱ⁾With The minimum value of redundancy between remaining feature, max (S_x(x⁽ⁱ⁾；x^(j))) indicate feature x⁽ⁱ⁾The collaborative e-commerce between remaining feature Maximum value, Sen (x⁽ⁱ⁾) indicate feature x⁽ⁱ⁾Sensitivity to target category C, R_c(x⁽ⁱ⁾) indicate feature x⁽ⁱ⁾With target category C it Between the degree of association.

Further, the step 3 includes:

Step 31: enabling collection S undetermined_tbdFor sky, i.e. S_tbd={ }, if x^(k)To exclude to collect S_excIn first feature, if x^(m) For Candidate Set S_selIn first feature；

Step 32: for excluding to collect S_excIn feature x^(k), calculate Candidate Set S_selIn feature x^(m)It is removed with feature set S x^(m)Except all features between collaborative e-commerce maximum value, i.e. max (S_x(x^(m)；x⁽ⁱ⁾)), x⁽ⁱ⁾∈ S, i ≠ m；

Step 33: if feature x^(m)The corresponding feature of maximum collaborative e-commerce be x^(k), then by x^(m)Collection S undetermined is added_tbd；

Step 34: if feature x^(m)It is Candidate Set S_selIn the last one feature, and collection S undetermined_tbdFor sky, then enter step 36；If collection S undetermined_tbdIt is not sky, if x^(j)For collection S undetermined_tbdIn first feature, enter step 35；If feature x^(m)It is not Candidate Set S_selIn the last one feature, then by feature x^(m)It is set to Candidate Set S_selIn next feature, return to step 32；

Step 35: for collection S undetermined_tbdIn feature x^(j), more new feature x as follows^(j)Sensitivity:

Sen(x^(j))=R_c(x^(j))+αmin(R_x(x^(j)；x⁽ⁿ⁾))

+βmax(S_x(x^(j)；x⁽ⁿ⁾)), x⁽ⁿ⁾∈ S, n ≠ j, n ≠ k

By feature x^(j)Sensitivity S en (x^(j)) compared with preset threshold value th, if Sen (x^(j)) < th andThen by feature x^(k)Collect S from exclusion_excMiddle removal is added to Candidate Set S_sel, into Enter step 36；Otherwise, if feature x^(j)It is collection S undetermined_tbdIn the last one element, then be directly entered step 36；Otherwise, by feature x^(j)It is set to collection S undetermined_tbdIn next element, return to step 35；

Step 36: if feature x^(k)It is to exclude collection S_excIn the last one element, then return to current candidate collection S_selCollect with exclusion S_excResult as final feature selecting；Otherwise, by feature x^(k)It is set to exclusion collection S_excIn next element, return to step 31.

The advantageous effects of the above technical solutions of the present invention are as follows:

In above scheme, by feature set S and target category C, the degree of association R between feature and target category is calculated_c(x⁽ⁱ⁾) and feature and feature between redundancy R_xWith collaborative e-commerce S_x, to calculate the sensitivity S en of feature；According to presetting Threshold value th feature is screened, feature set is divided into Candidate Set and excludes to collect, and is continued in the follow-up process to candidate Collection and exclusion collection are adjusted optimization.In this way, having comprehensively considered the phase between feature and target category and between feature and feature Mutual relation selects feature by the degree of association, redundancy and collaborative e-commerce, remains the feature to play a crucial role to classification, Help to reduce characteristic dimension and complicated classification degree, and can be improved classification accuracy.

Detailed description of the invention

Fig. 1 is the flow diagram of text classification feature selection approach provided in an embodiment of the present invention；

Fig. 2 is the detailed process schematic diagram of text classification feature selection approach provided in an embodiment of the present invention；

Fig. 3 is that feature selection approach provided in an embodiment of the present invention divides Candidate Set and excludes the flow diagram of collection；

Fig. 4 is that feature selection approach provided in an embodiment of the present invention adjusts Candidate Set and excludes the flow diagram of collection.

Specific embodiment

To keep the technical problem to be solved in the present invention, technical solution and advantage clearer, below in conjunction with attached drawing and tool Body embodiment is described in detail.

The present invention provides a kind of text classification feature selecting for the problem that existing characteristic dimension is high or nicety of grading is low Method.

As shown in Figure 1, text classification feature selection approach provided in an embodiment of the present invention, comprising:

Step 1: obtaining feature set S and target category C, calculate each feature x in feature set S⁽ⁱ⁾With target category C it Between degree of association R_c(x⁽ⁱ⁾), and according to degree of association R_c(x⁽ⁱ⁾) size to feature set S carry out descending sort；

Text classification feature selection approach described in the embodiment of the present invention is calculated special by feature set S and target category C Degree of association R between sign and target category_c(x⁽ⁱ⁾) and feature and feature between redundancy R_xWith collaborative e-commerce S_x, to calculate The sensitivity S en of feature；Feature is screened according to preset threshold value th, feature set is divided into Candidate Set and exclusion Collection, and continue to be adjusted optimization to Candidate Set and exclusion collection in the follow-up process.In this way, having comprehensively considered feature and target class Between not and the correlation between feature and feature selects feature by the degree of association, redundancy and collaborative e-commerce, protects The feature to play a crucial role to classification has been stayed, has helped to reduce characteristic dimension and complicated classification degree, and it is accurate to can be improved classification Property.

In the present embodiment, as shown in Fig. 2, needing elder generation input feature vector collection S=(x to get feature set S and target category C⁽¹⁾, x⁽²⁾..., x⁽ⁿ⁾) and target category C.

In the present embodiment, the feature set S indicates all feature (single feature x during text classification⁽ⁱ⁾Table Show, i.e. word vector) set, i.e. S=(x⁽¹⁾, x⁽²⁾..., x⁽ⁿ⁾), n indicates the number of feature in feature set S；Feature x⁽ⁱ⁾ Indicate the column vector that the number that word corresponding to feature occurs in each text file is constituted, i.e.,Target category C indicates the column that classification corresponding to each text file is constituted Vector, target category C are category sets.

In the present embodiment, the feature x⁽ⁱ⁾Degree of association R between target category C_c(x⁽ⁱ⁾) it is characterized x⁽ⁱ⁾With target class Mutual information between other C.

In the present embodiment, as an alternative embodiment, each feature x in the calculating feature set S⁽ⁱ⁾With target category C Between degree of association R_c(x⁽ⁱ⁾), and according to degree of association R_c(x⁽ⁱ⁾) size to feature set S carry out descending sort (step 1) include:

It is described in the present embodiment

Wherein, I (x⁽ⁱ⁾；C feature x) is indicated⁽ⁱ⁾With the mutual information between target category C, c_kIndicate the target category C K classification, p (x⁽ⁱ⁾, c_k) indicate feature x⁽ⁱ⁾With classification c_kThe probability occurred simultaneously, p (x⁽ⁱ⁾|c_k) indicate in c_kFeature in classification x⁽ⁱ⁾The probability of appearance, p (x⁽ⁱ⁾) indicate feature x⁽ⁱ⁾The probability occurred in feature set S.

In the present embodiment, it is preferable that the feature x⁽ⁱ⁾With classification c_kProbability p (the x occurred simultaneously⁽ⁱ⁾, c_k), by c_kClassification Feature x in file⁽ⁱ⁾The frequency that corresponding word occurs in All Files is come approximate, it may be assumed that

Wherein,Indicate feature x⁽ⁱ⁾J-th of element (i.e. feature x⁽ⁱ⁾What corresponding word occurred in j-th of file Number)；Indicate feature x⁽ⁱ⁾Middle corresponding target category is c_kM-th of element (i.e. feature x⁽ⁱ⁾Corresponding word is in m A c_kThe number occurred in category file).

In the present embodiment, it is preferable that described in c_kFeature x in classification⁽ⁱ⁾Probability p (the x of appearance⁽ⁱ⁾|c_k), by feature x⁽ⁱ⁾Institute Corresponding word is in c_kThe frequency occurred in category file is come approximate, it may be assumed that

In the present embodiment, it is preferable that the feature x⁽ⁱ⁾Probability p (the x occurred in feature set S⁽ⁱ⁾), by feature x⁽ⁱ⁾Institute The frequency that corresponding word occurs in All Files is come approximate, it may be assumed that

In the present embodiment, as yet another alternative embodiment, as shown in figure 3, in the calculating feature set S every two feature it Between redundancy R_xWith collaborative e-commerce S_x, degree of association R between binding characteristic and target category_c(x⁽ⁱ⁾) calculate feature sensitivity Sen, and by it compared with preset threshold value th, feature set S is divided into Candidate Set S according to threshold value th_selCollect with exclusion S_exc(step 2) includes:

In the specific embodiment of aforementioned texts characteristic of division selection method, further, the redundancy R_xIt indicates Are as follows:

R_x(x⁽ⁱ⁾；x^(j))=min (0, IG (x⁽ⁱ⁾；x^(j)；C)), i ≠ j

In the specific embodiment of aforementioned texts characteristic of division selection method, further, the collaborative e-commerce S_xIt indicates Are as follows:

S_x(x⁽ⁱ⁾；x^(j))=max (0, IG (x⁽ⁱ⁾；x^(j)；C)), i ≠ j

In the specific embodiment of aforementioned texts characteristic of division selection method, further, the IG (x⁽ⁱ⁾；x^(j)；C) It indicates are as follows:

Wherein, I (x⁽ⁱ⁾；) and I (x C^(j)；C) with the feature x⁽ⁱ⁾Mutual information calculation formula phase between target category C Together, I (x⁽ⁱ⁾；C feature x) is indicated⁽ⁱ⁾With the mutual information between target category C；I(x^(j)；C feature x) is indicated^(j)With target category C Between mutual information；I((x⁽ⁱ⁾, x^(j)；C feature x) is indicated⁽ⁱ⁾, feature x^(j)With the mutual information between target category C.

In the specific embodiment of aforementioned texts characteristic of division selection method, further, the I ((x⁽ⁱ⁾, x^(j)；C) It indicates are as follows:

Wherein, c_kIndicate k-th of classification of target category C, p (x⁽ⁱ⁾, x^(j), ck) and indicate feature x⁽ⁱ⁾, feature x^(j)And class Other c_kThe probability occurred simultaneously, p ((x⁽ⁱ⁾, x^(j))|c_k) indicate in c_kFeature x in classification⁽ⁱ⁾With feature x^(j)What is occurred simultaneously is general Rate, p (x⁽ⁱ⁾, x^(j)) indicate feature x⁽ⁱ⁾With feature x^(j)The probability occurred in feature set S simultaneously.

In the present embodiment, it is preferable that the feature x⁽ⁱ⁾, feature x^(j)With classification c_kProbability p (the x occurred simultaneously⁽ⁱ⁾, x^(j), c_k), by c_kFeature x in category file⁽ⁱ⁾With feature x^(j)The frequency that corresponding word occurs simultaneously in All Files is come close Seemingly, it may be assumed that

Wherein,Indicate feature x⁽ⁱ⁾With feature x^(j)Middle corresponding target category is c_kM-th yuan Smaller value (i.e. feature x in element⁽ⁱ⁾With feature x^(j)Word corresponding to the two is in m-th of c_kThe number occurred in category file Smaller value).

In the present embodiment, it is preferable that described in c_kFeature x in classification⁽ⁱ⁾With feature x^(j)The Probability p ((x occurred simultaneously⁽ⁱ⁾, x^(j))|c_k), by feature x⁽ⁱ⁾With feature x^(j)Corresponding word is in c_kThe frequency occurred simultaneously in category file is come approximate, it may be assumed that

In the present embodiment, it is preferable that the feature x⁽ⁱ⁾With feature x^(j)Probability p (the x occurred simultaneously in feature set S⁽ⁱ⁾), by feature x⁽ⁱ⁾With feature x^(j)The frequency that corresponding word occurs simultaneously in All Files is come approximate, it may be assumed that

In the specific embodiment of aforementioned texts characteristic of division selection method, further, the sensitivity S en (x⁽ⁱ⁾) indicate are as follows:

Sen(x⁽ⁱ⁾)=R_c(x⁽ⁱ⁾)+αmin(R_x(x⁽ⁱ⁾；x^(j)))

+βmax(S_x(x⁽ⁱ⁾；x^(j))), j ≠ i

In the present embodiment, as shown in figure 4, as an alternative embodiment, the calculating Candidate Set S_selCollect S with exclusion_excIn Feature between sensitivity S en, and by it compared with preset threshold value th, according to threshold value th to Candidate Set S_selAnd row Except collection S_excBeing adjusted (step 3) includes:

Step 34: if feature x^(m)It is Candidate Set S_selIn the last one feature, and collection S undetermined_tbdFor sky, then enter step 36；If collection S undetermined_tbdIt is not sky, if x^(j)For collection S undetermined_tbdIn first feature, enter step 35；If feature x^(m)It is not Candidate Set S_selIn the last one feature, then by feature x^(m)It is set to Candidate Set Sse_lIn next feature, return to step 32；

Sen(x^(j))=R_c(x^(j))+αmin(R_x(x^(j)；x⁽ⁿ⁾))

+βmax(S_x(x^(j)；x⁽ⁿ⁾)), x⁽ⁿ⁾∈ S, n ≠ j, n ≠ k

In the present embodiment, according to step 31-36, Candidate Set S is calculated_selCollect S with exclusion_excIn feature between sensitivity Sen, and by it compared with preset threshold value th, according to threshold value th to Candidate Set S_selCollect S with exclusion_excIt is adjusted, obtains To new Candidate Set S_selCollect S with exclusion_exc, the removal of feature can be reduced or increase the influence to classification results.

In the present embodiment, the redundancy R_xWeight α default value can be 0.5；The collaborative e-commerce S_xWeight β default value can Think 0.5；The preset threshold value th is defaulted as being 0.01.The redundancy R_xWeight α, collaborative e-commerce S_xWeight β and Preset threshold value th by genetic algorithm optimization and updates in subsequent training and test process.

The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, without departing from the principles of the present invention, it can also make several improvements and retouch, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

1. a kind of text classification feature selection approach characterized by comprising

Step 1: obtaining feature set S and target category C, calculate each feature x in feature set S⁽ⁱ⁾Between target category C Degree of association R_c(x⁽ⁱ⁾), and according to degree of association R_c(x⁽ⁱ⁾) size to feature set S carry out descending sort；

Step 2: calculating the redundancy R in feature set S between every two feature_xWith collaborative e-commerce S_x, binding characteristic and target category it Between degree of association R_c(x⁽ⁱ⁾) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with to feature Collect the descending sort of S as a result, feature set S is divided into Candidate Set S according to threshold value th_selCollect S with exclusion_exc；

Step 3: calculating Candidate Set S_selCollect S with exclusion_excIn feature between sensitivity S en, and by its with it is preset Threshold value th compares, according to threshold value th to Candidate Set S_selCollect S with exclusion_excIt is adjusted；

Wherein, the redundancy R_xIt indicates are as follows:

R_x(x⁽ⁱ⁾；x^(j))=min (0, IG (x⁽ⁱ⁾；x^(j)；C)), i ≠ j

Wherein, IG (x⁽ⁱ⁾；x^(j)；C ith feature x in feature set S) is indicated⁽ⁱ⁾With j-th of feature x^(j)Between the degree of correlation increase Benefit, R_x(x⁽ⁱ⁾；x^(j)) indicate feature x⁽ⁱ⁾With feature x^(j)Between redundancy, R_x(x⁽ⁱ⁾；x^(j)) value be 0 and degree of correlation gain in Smaller value；

Wherein, the collaborative e-commerce S_xIt indicates are as follows:

S_x(x⁽ⁱ⁾；x^(j))=max (0, IG (x⁽ⁱ⁾；x^(j)；C)), i ≠ j

Wherein, IG (x⁽ⁱ⁾；x^(j)；C ith feature x in feature set S) is indicated⁽ⁱ⁾With j-th of feature x^(j)Between the degree of correlation increase Benefit, S_x(x⁽ⁱ⁾；x^(j)) indicate feature x⁽ⁱ⁾With feature x^(j)Between collaborative e-commerce, S_x(x⁽ⁱ⁾；x^(j)) value be 0 and degree of correlation gain in The larger value；

Wherein, the sensitivity S en (x⁽ⁱ⁾) indicate are as follows:

Sen(x⁽ⁱ⁾)=R_c(x⁽ⁱ⁾)+αmin(R_x(x⁽ⁱ⁾；x^(j)))+βmax(S_x(x⁽ⁱ⁾；x^(j))), j ≠ i

Wherein, α and β is redundancy R respectively_xWith collaborative e-commerce S_xWeight, min (R_x(x⁽ⁱ⁾；x^(j))) indicate feature x⁽ⁱ⁾With remaining The minimum value of redundancy between feature, max (S_x(x⁽ⁱ⁾；x^(j))) indicate feature x⁽ⁱ⁾The maximum of collaborative e-commerce between remaining feature Value, Sen (x⁽ⁱ⁾) indicate feature x⁽ⁱ⁾Sensitivity to target category C, R_c(x⁽ⁱ⁾) indicate feature x⁽ⁱ⁾Between target category C The degree of association.

2. text classification feature selection approach according to claim 1, which is characterized in that the step 1 includes:

Step 11, for each feature x in feature set S⁽ⁱ⁾, according to formula R_c(x⁽ⁱ⁾)=I (x⁽ⁱ⁾；C feature x) is calculated⁽ⁱ⁾With mesh Mark the degree of association R between classification C_c(x⁽ⁱ⁾), wherein I (x⁽ⁱ⁾；C feature x) is indicated⁽ⁱ⁾With the mutual information between target category C；

Step 12, according to degree of association R_c(x⁽ⁱ⁾) size the feature in feature set S is sorted from large to small, the spy after being sorted Collect S；

Wherein, x⁽ⁱ⁾Indicate ith feature in feature set S, R_c(x⁽ⁱ⁾) indicate feature x⁽ⁱ⁾With the degree of association between target category C.

3. text classification feature selection approach according to claim 2, which is characterized in that the I (x⁽ⁱ⁾；C it) indicates are as follows:

Wherein, c_kIndicate k-th of classification of target category C, p (x⁽ⁱ⁾, c_k) indicate feature x⁽ⁱ⁾With classification c_kWhat is occurred simultaneously is general Rate, p (x⁽ⁱ⁾|c_k) indicate in c_kFeature x in classification⁽ⁱ⁾The probability of appearance, p (x⁽ⁱ⁾) indicate feature x⁽ⁱ⁾Occur in feature set S Probability.

4. text classification feature selection approach according to claim 1, which is characterized in that the IG (x⁽ⁱ⁾；x^(j)；C) table It is shown as:

IG(x⁽ⁱ⁾；x^(j)；C)=I ((x⁽ⁱ⁾, x^(j))；C)-I(x⁽ⁱ⁾；C)-I(x^(j)；C)

Wherein, I (x⁽ⁱ⁾；C feature x) is indicated⁽ⁱ⁾With the mutual information between target category C；I(x^(j)；C feature x) is indicated^(j)With target Mutual information between classification C；I((x⁽ⁱ⁾, x^(j))；C feature x) is indicated⁽ⁱ⁾, feature x^(j)With the mutual information between target category C.

5. text classification feature selection approach according to claim 4, which is characterized in that the I ((x⁽ⁱ⁾, x^(j))；C) table It is shown as:

Wherein, c_kIndicate k-th of classification of target category C, p (x⁽ⁱ⁾, x^(j), c_k) indicate feature x⁽ⁱ⁾, feature x^(j)With classification c_kTogether When the probability that occurs, p ((x⁽ⁱ⁾, x^(j))|c_k) indicate in c_kFeature x in classification⁽ⁱ⁾With feature x^(j)The probability occurred simultaneously, p (x⁽ⁱ⁾, x^(j)) indicate feature x⁽ⁱ⁾With feature x^(j)The probability occurred in feature set S simultaneously.

6. text classification feature selection approach according to claim 1, which is characterized in that the step 2 includes:

Step 21: first feature in feature set S is added to Candidate Set S_sel, will exclude to collect S_excIt is set to empty set, i.e. S_sel={ x⁽¹⁾, S_exc={ }, the corresponding degree of association R of first feature_c(x⁽ⁱ⁾) maximum；

Step 22: since feature set S second feature, using x⁽ⁱ⁾It indicates second feature, calculates feature x⁽ⁱ⁾With time Selected works S_selIn redundancy R between all features_xWith collaborative e-commerce S_x, and the degree of association R between binding characteristic and target category_c(x⁽ⁱ⁾) calculate feature x⁽ⁱ⁾Sensitivity S en (x⁽ⁱ⁾)；

Step 23: by sensitivity S en (x⁽ⁱ⁾) compared with preset threshold value th, if Sen (x⁽ⁱ⁾) > th, then by feature x⁽ⁱ⁾ Candidate Set S is added_sel；Otherwise by feature x⁽ⁱ⁾It is added and excludes collection S_exc；

Step 24: if x⁽ⁱ⁾It is characterized the last one feature in collection S, then terminates to divide；Otherwise, by x⁽ⁱ⁾Under being set in feature set S One feature, returns to step 22.

7. text classification feature selection approach according to claim 1, which is characterized in that the step 3 includes:

Step 31: enabling collection S undetermined_tbdFor sky, i.e. S_tbd={ }, if x^(k)To exclude to collect S_excIn first feature, if x^(m)To wait Selected works S_selIn first feature；

Step 32: for excluding to collect S_excIn feature x^(k), calculate Candidate Set S_selIn feature x^(m)With in feature set S remove x^(m) Except all features between collaborative e-commerce maximum value, i.e. max (S_x(x^(m)；x⁽ⁱ⁾)), x⁽ⁱ⁾∈ S, i ≠ m；

Step 34: if feature x^(m)It is Candidate Set S_selIn the last one feature, and collection S undetermined_tbdFor sky, then 36 are entered step；If Collection S undetermined_tbdIt is not sky, if x^(j)For collection S undetermined_tbdIn first feature, enter step 35；If feature x^(m)It is not Candidate Set S_selIn the last one feature, then by feature x^(m)It is set to Candidate Set S_selIn next feature, return to step 32；

Sen(x^(j))=R_c(x^(j))+αmin(R_x(x^(j)；x⁽ⁿ⁾))+βmax(S_x(x^(j)；x⁽ⁿ⁾)), x⁽ⁿ⁾∈ S, n ≠ j, n ≠ k

Step 36: if feature x^(k)It is to exclude collection S_excIn the last one element, then return to current candidate collection S_selCollect S with exclusion_exc Result as final feature selecting；Otherwise, by feature x^(k)It is set to exclusion collection S_excIn next element, return to step 31.