CN107016073B - A kind of text classification feature selection approach - Google Patents

A kind of text classification feature selection approach Download PDF

Info

Publication number
CN107016073B
CN107016073B CN201710181572.8A CN201710181572A CN107016073B CN 107016073 B CN107016073 B CN 107016073B CN 201710181572 A CN201710181572 A CN 201710181572A CN 107016073 B CN107016073 B CN 107016073B
Authority
CN
China
Prior art keywords
feature
indicate
sel
degree
exc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710181572.8A
Other languages
Chinese (zh)
Other versions
CN107016073A (en
Inventor
张晓彤
余伟伟
刘喆
王璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201710181572.8A priority Critical patent/CN107016073B/en
Publication of CN107016073A publication Critical patent/CN107016073A/en
Application granted granted Critical
Publication of CN107016073B publication Critical patent/CN107016073B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of text classification feature selection approach, can reduce characteristic dimension and complicated classification degree and improves classification accuracy.The described method includes: obtaining feature set S and target category C, each feature x in feature set S is calculated(i)Degree of association R between target category Cc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;Calculate the redundancy R in feature set S between every two featurexWith collaborative e-commerce Sx, degree of association R between binding characteristic and target categoryc(x(i)) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with the descending sort to feature set S as a result, feature set S is divided into Candidate Set S according to threshold value thselCollect S with exclusionexc;Calculate Candidate Set SselCollect S with exclusionexcIn feature between sensitivity S en, and by it compared with preset threshold value th, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted.The present invention is suitable for machine learning text classification field.

Description

A kind of text classification feature selection approach
Technical field
The present invention relates to machine learning text classification fields, particularly relate to a kind of text classification feature selection approach.
Background technique
With the continuous expansion of internet scale, the information resources converged in internet are also increasing.In order to effective Management and easily utilize these information resources, content-based information retrieval and data mining are concerned all the time. Text Classification is the important foundation of information retrieval and text data digging, and main task is the text according to unknown classification With the content of document, they are determined as one or more of previously given classification.However, training samples number is big and vector This two major features of dimension height, determine that text classification is that an operation time and all very high machine learning of space complexity are asked Topic.It would therefore be desirable to carry out feature selecting, characteristic dimension is reduced while guaranteeing classification performance as far as possible.
Feature selecting is an important process of data preprocessing, in common text classification feature selection approach, card Side examine (Chi-Square) by establishing null hypothesis, it is assumed that word is uncorrelated to target category, selection deviate hypothesis degree greatly Word is as feature.But whether there is certain word in its statistical documents, but regardless of occurring several times, this make it to low-frequency word It is partial.Mutual information (Mutual Information) method is selected by measuring the presence of word to target category bring information content Select feature.But it only considered the degree of association between word and target category, ignore dependence that may be present between word and word. TF-IDF (Term Frequency-Inverse Document Frequency) method comprehensively considers what word occurred hereof The distribution of frequency and word in All Files is to assess the significance level of word, to carry out Feature Selection.But it is only simple The word for thinking that text frequency is small it is more important and word that text frequency is big is more useless, therefore precision is not very high.Furthermore There are also the feature selection approach such as information gain, odds ratio, text weight evidence, expectation cross entropy, they all only considered mostly The degree of correlation between degree of correlation or word and word between word and target category is easy to appear dimensionality reduction degree not enough or classification is smart Spend not high problem.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of text classification feature selection approach, to solve prior art institute The problem that existing characteristic dimension is high or nicety of grading is low.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of text classification feature selection approach, comprising:
Step 1: obtaining feature set S and target category C, calculate each feature x in feature set S(i)With target category C it Between degree of association Rc(x(i)), and according to degree of association RcSize carries out descending sort to feature set S;
Step 2: calculating the redundancy R in feature set S between every two featurexWith collaborative e-commerce Sx, binding characteristic and target class Degree of association R between notc(x(i)) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with right The descending sort of feature set S is as a result, be divided into Candidate Set S for feature set S according to threshold value thselCollect S with exclusionexc
Step 3: calculating Candidate Set SselCollect S with exclusionexcIn feature between sensitivity S en, and by its with set in advance Fixed threshold value th compares, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted.
Further, the step 1 includes:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)Degree of association R between target category Cc(x(i)), wherein I (x(i);C feature x) is indicated(i)It is mutual between target category C Information;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from large to small, sorted Feature set S afterwards;
Wherein, x(i)Indicate ith feature in feature set S, Rc(x(i)) indicate feature x(i)With the pass between target category C Connection degree.
Further, the I (x(i);C it) indicates are as follows:
Wherein, ckIndicate k-th of classification of target category C, p (x(i), ck) indicate feature x(i)With classification ckOccur simultaneously Probability, p (x(i)|ck) indicate in ckFeature x in classification(i)The probability of appearance, p (x(i)) indicate feature x(i)Occur in feature set S Probability.
Further, the redundancy RxIt indicates are as follows:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation Gain, Rx(x(i);x(j)) indicate feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value be 0 and degree of correlation gain In smaller value.
Further, the collaborative e-commerce SxIt indicates are as follows:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation Gain, Sx(x(i);x(j)) indicate feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value be 0 and degree of correlation gain In the larger value.
Further, the IG (x(i);x(j);C it) indicates are as follows:
IG(x(i);x(j);C)=I [(x(i), x(j));C]-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);C feature x) is indicated(i)With the mutual information between target category C;I(x(j);C feature x) is indicated(j)With Mutual information between target category C;I((x(i), x(j);C feature x) is indicated(i), feature x(j)With the mutual trust between target category C Breath.
Further, the I ((x(i), x(j);C it) indicates are as follows:
Wherein, ckIndicate k-th of classification of target category C, p (x(i), x(j), ck) indicate feature x(i), feature x(j)And classification ckThe probability occurred simultaneously, p ((x(i), x(j))|ck) indicate in ckFeature x in classification(i)With feature x(j)The probability occurred simultaneously, p (x(i), x(j)) indicate feature x(i)With feature x(j)The probability occurred in feature set S simultaneously.
Further, the step 2 includes:
Step 21: first feature in feature set S is added to Candidate Set Ssel, will exclude to collect SexcIt is set to empty set, i.e. Ssel ={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22: since feature set S second feature, using x(i)It indicates second feature, calculates feature x(i) With Candidate Set SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association between binding characteristic and target category Rc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23: by sensitivity S en (x(i)) compared with preset threshold value th, if Sen (x(i)) > th, then by feature x(i)Candidate Set S is addedsel;Otherwise by feature x(i)It is added and excludes collection Sexc
Step 24: if x(i)It is characterized the last one feature in collection S, then terminates to divide;Otherwise, by x(i)It is set to feature set S In next feature, return to step 22.
Further, the sensitivity S en (x(i)) indicate are as follows:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β is redundancy R respectivelyxWith collaborative e-commerce SxWeight, min (Rx(x(i);x(j))) indicate feature x(i)With The minimum value of redundancy between remaining feature, max (Sx(x(i);x(j))) indicate feature x(i)The collaborative e-commerce between remaining feature Maximum value, Sen (x(i)) indicate feature x(i)Sensitivity to target category C, Rc(x(i)) indicate feature x(i)With target category C it Between the degree of association.
Further, the step 3 includes:
Step 31: enabling collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude to collect SexcIn first feature, if x(m) For Candidate Set SselIn first feature;
Step 32: for excluding to collect SexcIn feature x(k), calculate Candidate Set SselIn feature x(m)It is removed with feature set S x(m)Except all features between collaborative e-commerce maximum value, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33: if feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Collection S undetermined is addedtbd
Step 34: if feature x(m)It is Candidate Set SselIn the last one feature, and collection S undeterminedtbdFor sky, then enter step 36;If collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, enter step 35;If feature x(m)It is not Candidate Set SselIn the last one feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35: for collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with preset threshold value th, if Sen (x(j)) < th andThen by feature x(k)Collect S from exclusionexcMiddle removal is added to Candidate Set Ssel, into Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn the last one element, then be directly entered step 36;Otherwise, by feature x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36: if feature x(k)It is to exclude collection SexcIn the last one element, then return to current candidate collection SselCollect with exclusion SexcResult as final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
The advantageous effects of the above technical solutions of the present invention are as follows:
In above scheme, by feature set S and target category C, the degree of association R between feature and target category is calculatedc(x(i)) and feature and feature between redundancy RxWith collaborative e-commerce Sx, to calculate the sensitivity S en of feature;According to presetting Threshold value th feature is screened, feature set is divided into Candidate Set and excludes to collect, and is continued in the follow-up process to candidate Collection and exclusion collection are adjusted optimization.In this way, having comprehensively considered the phase between feature and target category and between feature and feature Mutual relation selects feature by the degree of association, redundancy and collaborative e-commerce, remains the feature to play a crucial role to classification, Help to reduce characteristic dimension and complicated classification degree, and can be improved classification accuracy.
Detailed description of the invention
Fig. 1 is the flow diagram of text classification feature selection approach provided in an embodiment of the present invention;
Fig. 2 is the detailed process schematic diagram of text classification feature selection approach provided in an embodiment of the present invention;
Fig. 3 is that feature selection approach provided in an embodiment of the present invention divides Candidate Set and excludes the flow diagram of collection;
Fig. 4 is that feature selection approach provided in an embodiment of the present invention adjusts Candidate Set and excludes the flow diagram of collection.
Specific embodiment
To keep the technical problem to be solved in the present invention, technical solution and advantage clearer, below in conjunction with attached drawing and tool Body embodiment is described in detail.
The present invention provides a kind of text classification feature selecting for the problem that existing characteristic dimension is high or nicety of grading is low Method.
As shown in Figure 1, text classification feature selection approach provided in an embodiment of the present invention, comprising:
Step 1: obtaining feature set S and target category C, calculate each feature x in feature set S(i)With target category C it Between degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;
Step 2: calculating the redundancy R in feature set S between every two featurexWith collaborative e-commerce Sx, binding characteristic and target class Degree of association R between notc(x(i)) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with right The descending sort of feature set S is as a result, be divided into Candidate Set S for feature set S according to threshold value thselCollect S with exclusionexc
Step 3: calculating Candidate Set SselCollect S with exclusionexcIn feature between sensitivity S en, and by its with set in advance Fixed threshold value th compares, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted.
Text classification feature selection approach described in the embodiment of the present invention is calculated special by feature set S and target category C Degree of association R between sign and target categoryc(x(i)) and feature and feature between redundancy RxWith collaborative e-commerce Sx, to calculate The sensitivity S en of feature;Feature is screened according to preset threshold value th, feature set is divided into Candidate Set and exclusion Collection, and continue to be adjusted optimization to Candidate Set and exclusion collection in the follow-up process.In this way, having comprehensively considered feature and target class Between not and the correlation between feature and feature selects feature by the degree of association, redundancy and collaborative e-commerce, protects The feature to play a crucial role to classification has been stayed, has helped to reduce characteristic dimension and complicated classification degree, and it is accurate to can be improved classification Property.
In the present embodiment, as shown in Fig. 2, needing elder generation input feature vector collection S=(x to get feature set S and target category C(1), x(2)..., x(n)) and target category C.
In the present embodiment, the feature set S indicates all feature (single feature x during text classification(i)Table Show, i.e. word vector) set, i.e. S=(x(1), x(2)..., x(n)), n indicates the number of feature in feature set S;Feature x(i) Indicate the column vector that the number that word corresponding to feature occurs in each text file is constituted, i.e.,Target category C indicates the column that classification corresponding to each text file is constituted Vector, target category C are category sets.
In the present embodiment, the feature x(i)Degree of association R between target category Cc(x(i)) it is characterized x(i)With target class Mutual information between other C.
In the present embodiment, as an alternative embodiment, each feature x in the calculating feature set S(i)With target category C Between degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort (step 1) include:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)Degree of association R between target category Cc(x(i)), wherein I (x(i);C feature x) is indicated(i)It is mutual between target category C Information;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from large to small, sorted Feature set S afterwards;
Wherein, x(i)Indicate ith feature in feature set S, Rc(x(i)) indicate feature x(i)With the pass between target category C Connection degree.
It is described in the present embodiment
Wherein, I (x(i);C feature x) is indicated(i)With the mutual information between target category C, ckIndicate the target category C K classification, p (x(i), ck) indicate feature x(i)With classification ckThe probability occurred simultaneously, p (x(i)|ck) indicate in ckFeature in classification x(i)The probability of appearance, p (x(i)) indicate feature x(i)The probability occurred in feature set S.
In the present embodiment, it is preferable that the feature x(i)With classification ckProbability p (the x occurred simultaneously(i), ck), by ckClassification Feature x in file(i)The frequency that corresponding word occurs in All Files is come approximate, it may be assumed that
Wherein,Indicate feature x(i)J-th of element (i.e. feature x(i)What corresponding word occurred in j-th of file Number);Indicate feature x(i)Middle corresponding target category is ckM-th of element (i.e. feature x(i)Corresponding word is in m A ckThe number occurred in category file).
In the present embodiment, it is preferable that described in ckFeature x in classification(i)Probability p (the x of appearance(i)|ck), by feature x(i)Institute Corresponding word is in ckThe frequency occurred in category file is come approximate, it may be assumed that
In the present embodiment, it is preferable that the feature x(i)Probability p (the x occurred in feature set S(i)), by feature x(i)Institute The frequency that corresponding word occurs in All Files is come approximate, it may be assumed that
In the present embodiment, as yet another alternative embodiment, as shown in figure 3, in the calculating feature set S every two feature it Between redundancy RxWith collaborative e-commerce Sx, degree of association R between binding characteristic and target categoryc(x(i)) calculate feature sensitivity Sen, and by it compared with preset threshold value th, feature set S is divided into Candidate Set S according to threshold value thselCollect with exclusion Sexc(step 2) includes:
Step 21: first feature in feature set S is added to Candidate Set Ssel, will exclude to collect SexcIt is set to empty set, i.e. Ssel ={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22: since feature set S second feature, using x(i)It indicates second feature, calculates feature x(i) With Candidate Set SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association between binding characteristic and target category Rc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23: by sensitivity S en (x(i)) compared with preset threshold value th, if Sen (x(i)) > th, then by feature x(i)Candidate Set S is addedsel;Otherwise by feature x(i)It is added and excludes collection Sexc
Step 24: if x(i)It is characterized the last one feature in collection S, then terminates to divide;Otherwise, by x(i)It is set to feature set S In next feature, return to step 22.
In the specific embodiment of aforementioned texts characteristic of division selection method, further, the redundancy RxIt indicates Are as follows:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation Gain, Rx(x(i);x(j)) indicate feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value be 0 and degree of correlation gain In smaller value.
In the specific embodiment of aforementioned texts characteristic of division selection method, further, the collaborative e-commerce SxIt indicates Are as follows:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation Gain, Sx(x(i);x(j)) indicate feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value be 0 and degree of correlation gain In the larger value.
In the specific embodiment of aforementioned texts characteristic of division selection method, further, the IG (x(i);x(j);C) It indicates are as follows:
IG(x(i);x(j);C)=I [(x(i), x(j));C]-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);) and I (x C(j);C) with the feature x(i)Mutual information calculation formula phase between target category C Together, I (x(i);C feature x) is indicated(i)With the mutual information between target category C;I(x(j);C feature x) is indicated(j)With target category C Between mutual information;I((x(i), x(j);C feature x) is indicated(i), feature x(j)With the mutual information between target category C.
In the specific embodiment of aforementioned texts characteristic of division selection method, further, the I ((x(i), x(j);C) It indicates are as follows:
Wherein, ckIndicate k-th of classification of target category C, p (x(i), x(j), ck) and indicate feature x(i), feature x(j)And class Other ckThe probability occurred simultaneously, p ((x(i), x(j))|ck) indicate in ckFeature x in classification(i)With feature x(j)What is occurred simultaneously is general Rate, p (x(i), x(j)) indicate feature x(i)With feature x(j)The probability occurred in feature set S simultaneously.
In the present embodiment, it is preferable that the feature x(i), feature x(j)With classification ckProbability p (the x occurred simultaneously(i), x(j), ck), by ckFeature x in category file(i)With feature x(j)The frequency that corresponding word occurs simultaneously in All Files is come close Seemingly, it may be assumed that
Wherein,Indicate feature x(i)With feature x(j)Middle corresponding target category is ckM-th yuan Smaller value (i.e. feature x in element(i)With feature x(j)Word corresponding to the two is in m-th of ckThe number occurred in category file Smaller value).
In the present embodiment, it is preferable that described in ckFeature x in classification(i)With feature x(j)The Probability p ((x occurred simultaneously(i), x(j))|ck), by feature x(i)With feature x(j)Corresponding word is in ckThe frequency occurred simultaneously in category file is come approximate, it may be assumed that
In the present embodiment, it is preferable that the feature x(i)With feature x(j)Probability p (the x occurred simultaneously in feature set S(i)), by feature x(i)With feature x(j)The frequency that corresponding word occurs simultaneously in All Files is come approximate, it may be assumed that
In the specific embodiment of aforementioned texts characteristic of division selection method, further, the sensitivity S en (x(i)) indicate are as follows:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β is redundancy R respectivelyxWith collaborative e-commerce SxWeight, min (Rx(x(i);x(j))) indicate feature x(i)With The minimum value of redundancy between remaining feature, max (Sx(x(i);x(j))) indicate feature x(i)The collaborative e-commerce between remaining feature Maximum value, Sen (x(i)) indicate feature x(i)Sensitivity to target category C, Rc(x(i)) indicate feature x(i)With target category C it Between the degree of association.
In the present embodiment, as shown in figure 4, as an alternative embodiment, the calculating Candidate Set SselCollect S with exclusionexcIn Feature between sensitivity S en, and by it compared with preset threshold value th, according to threshold value th to Candidate Set SselAnd row Except collection SexcBeing adjusted (step 3) includes:
Step 31: enabling collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude to collect SexcIn first feature, if x(m) For Candidate Set SselIn first feature;
Step 32: for excluding to collect SexcIn feature x(k), calculate Candidate Set SselIn feature x(m)It is removed with feature set S x(m)Except all features between collaborative e-commerce maximum value, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33: if feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Collection S undetermined is addedtbd
Step 34: if feature x(m)It is Candidate Set SselIn the last one feature, and collection S undeterminedtbdFor sky, then enter step 36;If collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, enter step 35;If feature x(m)It is not Candidate Set SselIn the last one feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35: for collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with preset threshold value th, if Sen (x(j)) < th andThen by feature x(k)Collect S from exclusionexcMiddle removal is added to Candidate Set Ssel, into Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn the last one element, then be directly entered step 36;Otherwise, by feature x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36: if feature x(k)It is to exclude collection SexcIn the last one element, then return to current candidate collection SselCollect with exclusion SexcResult as final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
In the present embodiment, according to step 31-36, Candidate Set S is calculatedselCollect S with exclusionexcIn feature between sensitivity Sen, and by it compared with preset threshold value th, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted, obtains To new Candidate Set SselCollect S with exclusionexc, the removal of feature can be reduced or increase the influence to classification results.
In the present embodiment, the redundancy RxWeight α default value can be 0.5;The collaborative e-commerce SxWeight β default value can Think 0.5;The preset threshold value th is defaulted as being 0.01.The redundancy RxWeight α, collaborative e-commerce SxWeight β and Preset threshold value th by genetic algorithm optimization and updates in subsequent training and test process.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, without departing from the principles of the present invention, it can also make several improvements and retouch, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (7)

1. a kind of text classification feature selection approach characterized by comprising
Step 1: obtaining feature set S and target category C, calculate each feature x in feature set S(i)Between target category C Degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;
Step 2: calculating the redundancy R in feature set S between every two featurexWith collaborative e-commerce Sx, binding characteristic and target category it Between degree of association Rc(x(i)) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with to feature Collect the descending sort of S as a result, feature set S is divided into Candidate Set S according to threshold value thselCollect S with exclusionexc
Step 3: calculating Candidate Set SselCollect S with exclusionexcIn feature between sensitivity S en, and by its with it is preset Threshold value th compares, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted;
Wherein, the redundancy RxIt indicates are as follows:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation increase Benefit, Rx(x(i);x(j)) indicate feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value be 0 and degree of correlation gain in Smaller value;
Wherein, the collaborative e-commerce SxIt indicates are as follows:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is indicated(i)With j-th of feature x(j)Between the degree of correlation increase Benefit, Sx(x(i);x(j)) indicate feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value be 0 and degree of correlation gain in The larger value;
Wherein, the sensitivity S en (x(i)) indicate are as follows:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β is redundancy R respectivelyxWith collaborative e-commerce SxWeight, min (Rx(x(i);x(j))) indicate feature x(i)With remaining The minimum value of redundancy between feature, max (Sx(x(i);x(j))) indicate feature x(i)The maximum of collaborative e-commerce between remaining feature Value, Sen (x(i)) indicate feature x(i)Sensitivity to target category C, Rc(x(i)) indicate feature x(i)Between target category C The degree of association.
2. text classification feature selection approach according to claim 1, which is characterized in that the step 1 includes:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)With mesh Mark the degree of association R between classification Cc(x(i)), wherein I (x(i);C feature x) is indicated(i)With the mutual information between target category C;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from large to small, the spy after being sorted Collect S;
Wherein, x(i)Indicate ith feature in feature set S, Rc(x(i)) indicate feature x(i)With the degree of association between target category C.
3. text classification feature selection approach according to claim 2, which is characterized in that the I (x(i);C it) indicates are as follows:
Wherein, ckIndicate k-th of classification of target category C, p (x(i), ck) indicate feature x(i)With classification ckWhat is occurred simultaneously is general Rate, p (x(i)|ck) indicate in ckFeature x in classification(i)The probability of appearance, p (x(i)) indicate feature x(i)Occur in feature set S Probability.
4. text classification feature selection approach according to claim 1, which is characterized in that the IG (x(i);x(j);C) table It is shown as:
IG(x(i);x(j);C)=I ((x(i), x(j));C)-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);C feature x) is indicated(i)With the mutual information between target category C;I(x(j);C feature x) is indicated(j)With target Mutual information between classification C;I((x(i), x(j));C feature x) is indicated(i), feature x(j)With the mutual information between target category C.
5. text classification feature selection approach according to claim 4, which is characterized in that the I ((x(i), x(j));C) table It is shown as:
Wherein, ckIndicate k-th of classification of target category C, p (x(i), x(j), ck) indicate feature x(i), feature x(j)With classification ckTogether When the probability that occurs, p ((x(i), x(j))|ck) indicate in ckFeature x in classification(i)With feature x(j)The probability occurred simultaneously, p (x(i), x(j)) indicate feature x(i)With feature x(j)The probability occurred in feature set S simultaneously.
6. text classification feature selection approach according to claim 1, which is characterized in that the step 2 includes:
Step 21: first feature in feature set S is added to Candidate Set Ssel, will exclude to collect SexcIt is set to empty set, i.e. Ssel={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22: since feature set S second feature, using x(i)It indicates second feature, calculates feature x(i)With time Selected works SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association R between binding characteristic and target categoryc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23: by sensitivity S en (x(i)) compared with preset threshold value th, if Sen (x(i)) > th, then by feature x(i) Candidate Set S is addedsel;Otherwise by feature x(i)It is added and excludes collection Sexc
Step 24: if x(i)It is characterized the last one feature in collection S, then terminates to divide;Otherwise, by x(i)Under being set in feature set S One feature, returns to step 22.
7. text classification feature selection approach according to claim 1, which is characterized in that the step 3 includes:
Step 31: enabling collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude to collect SexcIn first feature, if x(m)To wait Selected works SselIn first feature;
Step 32: for excluding to collect SexcIn feature x(k), calculate Candidate Set SselIn feature x(m)With in feature set S remove x(m) Except all features between collaborative e-commerce maximum value, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33: if feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Collection S undetermined is addedtbd
Step 34: if feature x(m)It is Candidate Set SselIn the last one feature, and collection S undeterminedtbdFor sky, then 36 are entered step;If Collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, enter step 35;If feature x(m)It is not Candidate Set SselIn the last one feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35: for collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with preset threshold value th, if Sen (x(j)) < th andThen by feature x(k)Collect S from exclusionexcMiddle removal is added to Candidate Set Ssel, into Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn the last one element, then be directly entered step 36;Otherwise, by feature x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36: if feature x(k)It is to exclude collection SexcIn the last one element, then return to current candidate collection SselCollect S with exclusionexc Result as final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
CN201710181572.8A 2017-03-24 2017-03-24 A kind of text classification feature selection approach Active CN107016073B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710181572.8A CN107016073B (en) 2017-03-24 2017-03-24 A kind of text classification feature selection approach

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710181572.8A CN107016073B (en) 2017-03-24 2017-03-24 A kind of text classification feature selection approach

Publications (2)

Publication Number Publication Date
CN107016073A CN107016073A (en) 2017-08-04
CN107016073B true CN107016073B (en) 2019-06-28

Family

ID=59445053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710181572.8A Active CN107016073B (en) 2017-03-24 2017-03-24 A kind of text classification feature selection approach

Country Status (1)

Country Link
CN (1) CN107016073B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934251B (en) * 2018-12-27 2021-08-06 国家计算机网络与信息安全管理中心广东分中心 Method, system and storage medium for recognizing text in Chinese language
CN111612385B (en) * 2019-02-22 2024-04-16 北京京东振世信息技术有限公司 Method and device for clustering articles to be distributed

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184323A (en) * 2015-09-15 2015-12-23 广州唯品会信息科技有限公司 Feature selection method and system
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8473451B1 (en) * 2004-07-30 2013-06-25 At&T Intellectual Property I, L.P. Preserving privacy in natural language databases

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184323A (en) * 2015-09-15 2015-12-23 广州唯品会信息科技有限公司 Feature selection method and system
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
中文文本分类中的特征选择研究;周茜 等;《中文信息学报》;20041231;第18卷(第3期);第17-23页

Also Published As

Publication number Publication date
CN107016073A (en) 2017-08-04

Similar Documents

Publication Publication Date Title
US20210042664A1 (en) Model training and service recommendation
US11074442B2 (en) Identification of table partitions in documents with neural networks using global document context
RU2679209C2 (en) Processing of electronic documents for invoices recognition
US9058327B1 (en) Enhancing training of predictive coding systems through user selected text
US11170249B2 (en) Identification of fields in documents with neural networks using global document context
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN105069141A (en) Construction method and construction system for stock standard news library
CN105653701B (en) Model generating method and device, word assign power method and device
CN110019790A (en) Text identification, text monitoring, data object identification, data processing method
CN105893362A (en) A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN108090178A (en) A kind of text data analysis method, device, server and storage medium
CN103778206A (en) Method for providing network service resources
CN110827131A (en) Tax payer credit evaluation method based on distributed automatic feature combination
CN107016073B (en) A kind of text classification feature selection approach
CN110110143B (en) Video classification method and device
CN107341152B (en) Parameter input method and device
CN110210506A (en) Characteristic processing method, apparatus and computer equipment based on big data
CN103218420B (en) A kind of web page title extracting method and device
CN106202349A (en) Web page classifying dictionary creation method and device
CN105095826B (en) A kind of character recognition method and device
Wang et al. Multi-level Class Token Transformer with Cross TokenMixer for Hyperspectral Images Classification
US20230134218A1 (en) Continuous learning for document processing and analysis
US20230138491A1 (en) Continuous learning for document processing and analysis
US20230177251A1 (en) Method, device, and system for analyzing unstructured document
CN113641823B (en) Text classification model training, text classification method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant