CN108491429A

CN108491429A - A kind of feature selection approach based on document frequency and word frequency statistics between class in class

Info

Publication number: CN108491429A
Application number: CN201810131876.8A
Authority: CN
Inventors: 邵雄凯; 赵婧; 刘建舟; 王春枝; 华满; 阳邹; 陈亮亮
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2018-02-09
Filing date: 2018-02-09
Publication date: 2018-09-04

Abstract

The invention discloses a kind of feature selection approach based on document frequency and word frequency statistics between class in class, consider concentration degree between the document frequency of Feature Words, the class of word frequency and Feature Words, dispersion degree in class, constructs the feature selecting valuation functions based on document frequency and word frequency statistics (DFCTFS) between class in class；Original feature space of the training set after Text Pretreatment is chosen using feature selecting valuation functions proposed by the present invention in each classification of training set to the feature dictionary of a certain proportion of Feature Words composition category, and the feature dictionary of training set is then the union of training set feature dictionary of all categories.The present invention proposes a kind of feature selection approach based on document frequency and word frequency statistics (DFCTFS) between class in class, the Feature Words that achievable feature selecting goes out integrated distribution Mr. Yu's class document and is uniformly distributed and frequently occurs in such document, improve the effect of Chinese Text Categorization.

Description

A kind of feature selection approach based on document frequency and word frequency statistics between class in class

Technical field

The invention belongs to Technology for Chinese text categorization fields, are related to a kind of feature selection approach, and in particular to one kind is based on In class between class document frequency and word frequency statistics feature selection approach.

Background technology

Chinese Text Categorization Integral Thought is substantially：Text Pretreatment, feature selecting establish text representation model, use Sorting algorithm is classified, disaggregated model assessment.Feature selecting is the committed step of Chinese Text Categorization, it refers to from the original of higher-dimension A part of important feature is selected in feature space, a lower dimensional space is formed, to improve nicety of grading and classification effectiveness.

Traditional feature selection approach has：Document frequency (DF), mutual information (MI), information gain (IG), chi-square statistics amount (CHI) etc..The way of feature selecting is usually that a valuation functions is selected to calculate n original characteristic item, to calculating The value of gained arranges in descending order, is selected from primitive character item set containing the more preceding P characteristic item of information content.

CHI and IG is proved to be the preferable two kinds of feature selecting sides of text classification effect in traditional feature selection approach Method.CHI is with Feature Words t and classification C_iPremised on independently of each other, the value (i.e. extent of deviation) between the two variables is calculated, if The value being calculated is bigger (i.e. deviation is larger), then Feature Words t and classification C_iIt is more related.But traditional CHI methods there is Deficiency, (1) do not consider word frequency distribution of the Feature Words in of all categories, only considered the document frequency of Feature Words, cause CHI may Select the Feature Words that document frequency is high but word frequency is low.(2) Feature Words with classification negative correlation may be selected.

IG be used for text feature selecting when, measurement be the appearance of some word whether to judging whether a text belongs to The number of such information content provided, information content is weighed by entropy.IG does not consider the entropy of document when any feature as and examines The difference of the entropy of document, the difference indicate the reduction degree of information uncertainty after worry this feature.Information uncertainty reduces journey Degree is bigger, and corresponding information gain is bigger, and the information which provides is more, and the lexical item is more important.But traditional IG methods There is deficiency, (1) does not consider word frequency distribution of the Feature Words in of all categories；(2) interference of Feature Words negative correlation；(3) only Global feature selecting (refer to all categories in training set and all use identical characteristic set) can be done, and local feature can not be done It selects (referring to each classification in training set has the characteristic set of oneself).

Training set after pretreatment and feature selecting by forming feature dictionary.CHI feature selection approach is assessed according to CHI Function obtains each Feature Words in the CHI values of each classification of training set, uses being averaged for CHI value of the Feature Words in all categories The CHI values of value or maximum value as this feature word in entire training set are arranged all Feature Words by CHI value descendings, choosing Take feature dictionary of a certain proportion of Feature Words as entire training set.IG feature selection approach is obtained according to IG valuation functions To IG value of each Feature Words in entire training set, all Feature Words are arranged by IG value descendings, choose a certain proportion of feature Feature dictionary of the word as entire training set.

The deficiency of comprehensive analysis CHI and IG, it can be deduced that, the feature selecting key in text classification is to select concentration Distribution Mr. Yu's class document and the Feature Words for being uniformly distributed and frequently occurring in such document.Therefore, the present invention considers spy Concentration degree between the document frequency of word, the class of word frequency and Feature Words, dispersion degree in class are levied, is proposed a kind of based on document frequency between class in class With word frequency statistics (Document Frequency of within-class and between-class and Term Frequency Statistics, DFCTFS) feature selection approach, improve the precision of classification.

Invention content

It is excellent the purpose of the present invention is to provide a kind of feature selection approach based on document frequency and word frequency statistics between class in class Change feature selecting as a result, improve Chinese Text Categorization precision.

The technical solution adopted in the present invention is：1. a kind of feature selecting based on document frequency and word frequency statistics between class in class Method, which is characterized in that include the following steps：

Step 1：Text in training set is indicated after segmenting, removing stop words by lexical item, is denoted as original feature space.It is defeated Enter the original all Feature Words of training set, wherein Feature Words are denoted as t in original feature space_k, 0≤k≤N, N are that primitive character is empty Between middle Feature Words sum；

Step 2：Consider concentration degree between the document frequency of Feature Words, the class of word frequency and Feature Words, dispersion degree in class, structure The feature selecting valuation functions based on document frequency and word frequency statistics between class in class are produced, for calculating in class document frequency and word between class Frequency statistical value DFCTFS；

Step 3：According to gained original feature space, one Feature Words of construction, classification two-dimensional matrix, wherein row represents spy Word is levied, row represent classification, and the element in matrix is DFCTFS values；

Step 4：According to the DFCTFS values of training set each Feature Words in of all categories, to the Feature Words in each classification of training set Carry out descending arrangement；

Step 5：The total number N for obtaining Feature Words in total classification number M in training set and training set, takes a certain proportion of spy Word is levied, is denoted as numWords, then the middle Feature Words number num selected of all categories is numWords divided by M；

Step 6：It all chooses according to gained num values in step 5 during training set is of all categories and presses DFCTFS value descendings in the category The feature dictionary of the preceding num Feature Words composition categories after arrangement；

Step 7：Obtain the feature dictionary of training set, the union of gained feature dictionary as of all categories；

Step 8：Establish text representation model；

According to feature dictionary, the weight of the corresponding Feature Words of every text in training set is calculated, after training set vectorization A two-dimensional matrix is formed, a text is represented per a line, each row represent a Feature Words in feature dictionary；

Step 9：Classified using sorting algorithm；Classifier training is carried out using sorting algorithm to training set, obtains classification mould Type；

Step 10：Classifier performance is assessed；

It for test set, is indicated by lexical item after segmenting, removing stop words, and it is corresponding to calculate every text in test set The weight of Feature Words, will form a two-dimensional matrix after test set vectorization, a text, Mei mono- Lie Daibiaote are represented per a line Levy a Feature Words in dictionary；

The disaggregated model obtained using training, classifies to test set, utilizes recall rate, accuracy rate, F1 values, realization pair The performance evaluation of grader.

The beneficial effects of the present invention are：There is deficiencies for traditional CHI and IG feature selection approach, i.e., do not consider feature Word frequency distribution of the word in of all categories, the Feature Words for causing it that document frequency may be selected high, and have ignored document frequency it is low but The contribution degree of the higher Feature Words of word frequency；The interference of classification negative correlation Feature Words；And IG can only do global feature selecting, and Local feature selecting can not be done.The present invention consider concentration degree between the classes of the document frequency of Feature Words, word frequency and Feature Words, Dispersion degree in class, feature selecting valuation functions of the construction based on document frequency and word frequency statistics (DFCTFS) between class in class, it is intended to select The Feature Words for selecting out integrated distribution Mr. Yu's class document and being uniformly distributed and frequently occur in such document, to improve text point The precision of class.

Description of the drawings

Fig. 1：The flow chart of the embodiment of the present invention；

Fig. 2：Use the Chinese Text Categorization overall flow figure after the present invention；

Fig. 3：The comparison result of CHI, IG and DFCTFS proposed by the present invention in classification recall rate in the embodiment of the present invention Figure；

Fig. 4：The comparison result of CHI, IG and DFCTFS proposed by the present invention on classification accuracy in the embodiment of the present invention Figure；

Fig. 5：The comparison result figure of CHI, IG and DFCTFS proposed by the present invention in classification F1 values in the embodiment of the present invention；

Fig. 6：The comparison knot of CHI, IG and DFCTFS proposed by the present invention on whole classifying quality in the embodiment of the present invention Fruit is schemed.

Specific implementation mode

Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair Bright to be described in further detail, implementation example described herein is merely to illustrate and explain the present invention, and is not used to limit The present invention.

See Fig. 1 and Fig. 2, a kind of feature selecting side based on document frequency and word frequency statistics between class in class provided by the invention Method includes the following steps：

Wherein, it is based on the feature selecting valuation functions of document frequency and word frequency statistics between class in class：

In formula, DFCTFS (t_k,C_i) indicate Feature Words t_kIn classification C_iIn class in document frequency and word frequency statistics value between class DFCTFS, DF (t_k,C_i) indicate Feature Words t_kIn classification C_iThe textual data of middle appearance, DF (t_k) indicate Feature Words t_kIn training set institute There are the textual data summation occurred in classification, DF (t, C_i) indicate classification C_iIn the summation of textual data that occurs of all Feature Words, TF (t_k,C_i) indicate Feature Words t_kIn classification C_iThe number of middle appearance, numDocs_iIndicate classification C_iTextual data, M indicate classification number.

Specific implementation includes the following steps：

Step 3.1：For each classification in training set, statistical nature word t_kIn C_iThe textual data DF occurred in classification (t_k,C_i) and number TF (t_k,C_i)；Wherein, k=1...N, N are characterized word sum；I=1...M, M are classification number；

Step 3.2：According to t_k, C_iTwo-dimensional matrix corresponding position is navigated to, is united using based on document frequency and word frequency between class in class The feature selecting valuation functions of meter calculate C_iThe Feature Words t of classification_kDFCTFS values, to construct training set Feature Words, The two-dimensional matrix of the N*M of classification.

Each classification in this implementation is each classification in training text corpus.The language material used in the present embodiment Library is the Chinese language material that Fudan University's computerized information is put in order with technology system international data center center natural language processing group Library, the text in the corpus is that classified is other, and the text set of each classification is stored using a file, the present embodiment It is middle selected 8 classifications therein (but the present invention class categories be not limited to only be 8 classifications, can be by adjusting in experiment The parameter setting of classification corresponds to the classification number in the corpus selected, it can be achieved that the text classification of different classes of number is tested).In The resource that literary text corpus is existing, has put in order, can be used directly, can download on the net.

Step 7：Obtain the feature dictionary of training set, the union of gained feature dictionary as of all categories；Union ensures special Levy the uniqueness of word in dictionary；

Step 8：Establish text representation model；

Wherein vector space model is most widely used, and main realization approach is, according to feature dictionary, to calculate in training set The weight of the corresponding Feature Words of every text, most-often used weighing computation method are TF-IDF (term frequency-inverse document frequencies), i.e., A two-dimensional matrix being formed after training set vectorization, a text being represented per a line, each row represent one in feature dictionary A Feature Words.

In formula, TFIDF (w_ik) indicate Feature Words t_kIn text d_iIn weight be w_ik, c_ikIndicate Feature Words t_kIn text d_i The number of middle appearance, N indicate that Feature Words sum, D indicate training set text sum, n_kIndicate Feature Words t_kThe textual data of appearance, β It is a constant term.

Step 10：Classifier performance is assessed；

Wherein, recall rate, accuracy rate, the calculation formula of F1 values are：

Recall rate

Accuracy rate

F1 values

Macro recall rate

Macro accuracy rate

Macro F1 values

In formula, M is classification number, and A expressions are judged to belonging to such and belong to such, B expressions be judged to belonging to such but not Belong to such, C expressions are judged to being not belonging to such but belong to such, and D expressions are judged to being not belonging to such and are not belonging to such；R_i Indicate the recall rate of classification i, P_iIndicate the accuracy rate of classification i, F1_iIndicate the F1 values of classification i.

A kind of feature selection approach based on document frequency and word frequency statistics between class in class proposed by the present invention, and it is traditional CHI with IG feature selection approach is compared, and the recall rate, accuracy rate, F1 values of classification is improved to a certain extent, below by way of reality Test explanation.Chinese word segmentation in experiment uses the ICTCLAS Chinese word segmentings that Inst. of Computing Techn. Academia Sinica researches and develops System.Feature selecting takes the 5% of training set Feature Words total number.That sorting algorithm is selected is SVM, is to be based on Taiwan Univ.'s woods intelligence The tool boxes LIBSVM of the exploitations such as benevolence professor.Chinese Corpus uses Fudan University's computerized information and state of technology system The Chinese corpus that border database hub natural language processing group arranges.Select sport therein, history, space, politics, ring Border, economy, art, computer, totally 8 classifications.The selection situation of text wherein of all categories is as shown in the table：

The selection situation of training set and test set in 1 corpus of table

It is programmed and is realized using Java, experiment porch MyEclipse, following 64 bit manipulations of Window7 of configuration of server System, processor are Intel (R) Core (TM) i5-2450M CPU 2.50GHz 2.5GHz, inside save as 4.00GB.

Method proposed by the present invention and tradition CHI, IG feature selection approach are in the knot of recall rate, accuracy rate, F1 values of classifying Fruit comparison is as shown in table 2：

The experimental result of conventional method CHI, IG and DFCTFS proposed by the present invention ratio in 2 text classification of table

Compared with

The comparison of table 3 conventional method CHI, IG and DFCTFS proposed by the present invention on whole classifying quality

By analyzing table 2, it can be deduced that DFCTFS feature selectings proposed in this paper, in the classification of 8 selected classifications It is better than traditional CHI and IG on the overall trend of effect.By analyzing table 3, it can be deduced that DFCTFS features proposed in this paper Selection method has been respectively increased 2.11%, 1.54% in the macro recall rate of classification compared with CHI, IG, divides in macro accuracy rate 2.11%, 1.36% is not improved, and 2.12%, 1.5% has been respectively increased in macro F1 values.In summary experimental result, can be with Show that the classifying quality of DFCTFS feature selectings proposed in this paper compared with traditional CHI, IG, is improved to some extent, says Effectiveness of the invention is illustrated.

It is the comparison result figure of CHI, IG and DFCTFS proposed by the present invention in classification recall rate see Fig. 3；Fig. 3 is more Intuitively illustrate the classification of DFCTFS proposed by the present invention and traditional feature selection approach CHI, IG 8 classifications selected by experiment It compares, is improved to some extent in recall rate.

It is the comparison result figure of CHI, IG and DFCTFS proposed by the present invention on classification accuracy see Fig. 4；Fig. 4 is more Intuitively illustrate the classification of DFCTFS proposed by the present invention and traditional feature selection approach CHI, IG 8 classifications selected by experiment It compares, is improved to some extent in accuracy rate.

It is the comparison result figure of middle CHI, IG and DFCTFS proposed by the present invention in classification F1 values see Fig. 5；Fig. 5 is more Intuitively illustrate the classification of DFCTFS proposed by the present invention and traditional feature selection approach CHI, IG 8 classifications selected by experiment It compares, is improved to some extent in F1 values.

It is the comparison result figure of CHI, IG and DFCTFS proposed by the present invention on whole classifying quality see Fig. 6；Fig. 6 More intuitively illustrate DFCTFS proposed by the present invention with traditional feature selection approach CHI, IG in the whole of selected 8 classifications of experiment On body classifying quality, by macro recall rate, macro accuracy rate, macro F1 values are compared, and are improved to some extent.

It should be understood that the part that this specification does not elaborate belongs to the prior art.The ordinary skill of this field Personnel under the inspiration of the present invention, in the case where not departing from the ambit that the claims in the present invention are protected, can also make replacement Or deformation, it each falls within protection scope of the present invention, it is of the invention range is claimed to be determined by the appended claims.

Claims

1. a kind of feature selection approach based on document frequency and word frequency statistics between class in class, which is characterized in that include the following steps：

Step 1：Text in training set is indicated after segmenting, removing stop words by lexical item, is denoted as original feature space；Input instruction Practice and collect original all Feature Words, wherein Feature Words are denoted as t in original feature space_k, 0≤k≤N, N are in original feature space Feature Words sum；

Step 2：Consider concentration degree between the document frequency of Feature Words, the class of word frequency and Feature Words, dispersion degree in class, constructs Based on the feature selecting valuation functions of document frequency and word frequency statistics between class in class, for calculating, document frequency and word frequency are united between class in class Evaluation DFCTFS；

Step 3：According to gained original feature space, one Feature Words of construction, classification two-dimensional matrix, wherein row represents feature Word, row represent classification, and the element in matrix is DFCTFS values；

Step 4：According to the DFCTFS values of training set each Feature Words in of all categories, the Feature Words in each classification of training set are carried out Descending arranges；

Step 5：The total number N for obtaining Feature Words in total classification number M in training set and training set, takes a certain proportion of feature Word is denoted as numWords, then the middle Feature Words number num selected of all categories is numWords divided by M；

Step 6：It all chooses in the category and is arranged by DFCTFS value descendings according to gained num values in step 5 during training set is of all categories The feature dictionary of the Feature Words composition category of preceding num afterwards；

Step 7：Obtain the feature dictionary of training set, the union of gained feature dictionary as of all categories.

2. the feature selection approach according to claim 1 based on document frequency and word frequency statistics between class in class, feature exist In being based on document frequency and the feature selecting valuation functions of word frequency statistics between class in class described in step 2：

In formula, DFCTFS (t_k,C_i) indicate Feature Words t_kIn classification C_iIn class in document frequency and word frequency statistics value DFCTFS between class, DF(t_k,C_i) indicate Feature Words t_kIn classification C_iThe textual data of middle appearance, DF (t_k) indicate Feature Words t_kIn training set all categories The textual data summation of middle appearance, DF (t, C_i) indicate classification C_iIn the summation of textual data that occurs of all Feature Words, TF (t_k,C_i) Indicate Feature Words t_kIn classification C_iThe number of middle appearance, numDocs_iIndicate classification C_iTextual data, M indicate classification number.

3. the feature selection approach according to claim 1 based on document frequency and word frequency statistics between class in class, feature exist In the specific implementation of step 3 includes the following steps：

Step 3.1：For each classification in training set, statistical nature word t_kIn C_iTextual data DF (the t occurred in classification_k, C_i) and number TF (t_k,C_i)；Wherein, k=1...N, N are characterized word sum；I=1...M, M are classification number；

Step 3.2：According to t_k, C_iTwo-dimensional matrix corresponding position is navigated to, using based on document frequency between class in class and word frequency statistics Feature selecting valuation functions calculate C_iThe Feature Words t of classification_kDFCTFS values, to construct Feature Words, the classification of training set N*M two-dimensional matrix.

4. the feature selecting side based on document frequency and word frequency statistics between class in class according to claim 1-3 any one Method, which is characterized in that the efficiency assessment of the feature selection approach includes the following steps：

Step 8：Establish text representation model；

According to feature dictionary, the weight of the corresponding Feature Words of every text in training set is calculated, will be formed after training set vectorization One two-dimensional matrix, a text is represented per a line, and each row represent a Feature Words in feature dictionary；

Step 9：Classified using sorting algorithm；Classifier training is carried out using sorting algorithm to training set, obtains disaggregated model；

Step 10：Classifier performance is assessed；

It for test set, is indicated by lexical item after segmenting, removing stop words, and calculates the corresponding feature of every text in test set The weight of word, will form a two-dimensional matrix after test set vectorization, represent a text per a line, each row represent Feature Words A Feature Words in library；

The disaggregated model obtained using training, classifies to test set, using recall rate, accuracy rate, F1 values, realizes to classification The performance evaluation of device.

5. the feature selection approach according to claim 4 based on document frequency and word frequency statistics between class in class, feature exist In calculating term weight function formula described in step 8 is：

In formula, TFIDF (w_ik) indicate Feature Words t_kIn text d_iIn weight be w_ik, c_ikIndicate Feature Words t_kIn text d_iIn go out Existing number, N indicate that Feature Words sum, D indicate training set text sum, n_kIndicate Feature Words t_kThe textual data of appearance, β are one A constant term.

6. the feature selection approach according to claim 4 based on document frequency and word frequency statistics between class in class, feature exist In recall rate described in step 10, accuracy rate, the calculation formula of F1 values are：

Recall rate

Accuracy rate

F1 values

Macro recall rate

Macro accuracy rate

Macro F1 values

In formula, M is classification number, and A expressions are judged to belonging to such and belong to such, and B expressions are judged to belonging to such but be not belonging to Such, C expressions are judged to being not belonging to such but belong to such, and D expressions are judged to being not belonging to such and are not belonging to such；R_iIt indicates The recall rate of classification i, P_iIndicate the accuracy rate of classification i, F1_iIndicate the F1 values of classification i.