CN110688481A

CN110688481A - Text classification feature selection method based on chi-square statistic and IDF

Info

Publication number: CN110688481A
Application number: CN201910821594.5A
Authority: CN
Inventors: 李帅; 郑少波; 杨玉龙; 冯建巩; 朱义杰
Original assignee: Guizhou Aerospace Institute of Measuring and Testing Technology
Current assignee: Guizhou Aerospace Institute of Measuring and Testing Technology
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2020-01-14

Abstract

The disclosure relates to a text classification feature selection method based on chi-square statistic and IDF, which comprises the following steps: dividing the collected text into two parts, wherein one part is a training text set, the other part is a test text set, and carrying out category marking on the training text set; and performing word segmentation processing on the obtained training text set to obtain feature words, and calculating CHI-square statistic value, text quantity Dt, IDF value and CHI-IDF (t, c) value of the feature words t. The invention has the advantages that: the method is simple to realize, and the improved CHI-IDF feature selection method is obtained by combining the characteristics of CHI-square statistic and IDF to improve the feature selection effect of the traditional CHI-square statistic and effectively improve the selection probability of feature words.

Description

Text classification feature selection method based on chi-square statistic and IDF

Technical Field

The invention relates to a text classification feature selection method based on chi-square statistic and IDF.

Background

At present, various self-media, news software and information distribution platforms or APPs are increasing, information is increasing continuously, and how to effectively manage the information and extract valuable information from the information becomes an important subject. The text classification method is widely used for processing and mining text information as an important text analysis and mining technology.

The text classification is a supervised text processing method, and achieves the purpose of text classification by setting text types in advance, selecting a classification algorithm model and training a text set. Text classification is an important direction in the field of natural language processing, and has wide application in the aspects of text information processing and mining, information retrieval, public opinion analysis and the like. In the whole classification process, the training text set is usually subjected to vectorization representation, then the classification algorithm model is trained, and the feature words are selected to reduce the dimension of the feature vector when the vectorization representation of the training text set is realized, because the higher the vector dimension is, the calculation difficulty and the generalization capability of the model are weaker. Therefore, text classification requires a reduction in vector dimension, while requiring the selected feature words to have better weight and classification orientation for classification.

The essence of feature word selection is vector dimension reduction, and the classification effect of a feature word on classification is mainly reflected on the classification weight value of the feature word. The feature word selection is to select a feature word with a large weight value by calculating the classification weight value of each feature word, and remove a feature word with a smaller classification weight value. The current feature selection method mainly comprises document frequency, information gain, chi-square statistic and the like, but has the following significant defects: the selection probability of the effective characteristic words is not high, so that the accuracy of classification is influenced.

Disclosure of Invention

The invention aims to provide a text classification feature selection method based on chi-square statistic and IDF, which effectively improves the selection probability of feature words.

In order to solve the technical problems, the invention adopts the technical scheme that: a text classification feature selection method based on chi-square statistic and IDF is characterized by comprising the following steps: dividing the collected text into two parts, wherein one part is a training text set, the other part is a test text set, and carrying out category marking on the training text set;

performing word segmentation processing on the obtained training text set to obtain a characteristic word t;

calculating a chi-square statistic value of the feature word t according to the following algorithm;

a denotes the number of documents belonging to category C and containing word t, B denotes the number of documents not belonging to category C but containing word t, C denotes the number of documents belonging to category C but not containing word t, D denotes the number of documents not belonging to category C nor containing word t, N denotes the total number of documents in the training expectation, i.e., N ═ a + B + C + D;

calculating the text quantity Dt of the appearing characteristic word t;

calculating the IDF value of the feature word t;

calculating CHI-IDF (t, c) value of each feature word, selecting a certain number of feature values according to text scale,

converting the training text set into a vector model consisting of feature words, and then training a classification algorithm model to obtain a text classifier;

and converting the test text set into a vector model consisting of the characteristic words, inputting the vector model into the text classifier for classification, and then obtaining a calculation classification result of each test text.

Compared with the prior art, the invention has the following beneficial technical effects:

the realization is simple, include: dividing the collected text into two parts, wherein one part is a training text set, the other part is a test text set, and carrying out category marking on the training text set; the method comprises the steps of carrying out word segmentation on an obtained training text set to obtain feature words, calculating CHI-square statistic value, text quantity Dt, IDF value and CHI-IDF (t, c) value of the feature words t, and analyzing a feature extraction method of CHI-square statistic to find that CHI-square statistic is easy to ignore feature items of low document frequency, and inverse document frequency IDF has a good selection effect on the feature words of low document frequency.

Drawings

FIG. 1 is a flow chart of the text classification feature extraction method based on chi-squared statistics and IDF of the present invention.

Fig. 2 is a schematic diagram of an embodiment of the method shown in fig. 1.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments, but these examples are only illustrative and do not limit the scope of the present invention.

Referring to fig. 1 and fig. 2, in the present invention, a collected text is divided into two parts, one part is a training text set, the other part is a test text set, and the training text set is labeled by category;

calculating the text quantity Dt of the appearing characteristic word t;

calculating the IDF value of the feature word t;

In one embodiment, the IDF value of the feature word t is calculated according to the algorithm:

wherein: d represents the total number of texts, Dt represents the number of texts containing the word t, and log is base 2.

In one embodiment, the CHI-IDF (t, c) value for each feature word is calculated, according to the algorithm:

in one embodiment, the training text set is obtained by a web crawler, a national language commission modern chinese corpus, and a dog search corpus.

In one embodiment, the word segmentation process is performed using the ICTALAS package of the chinese academy.

In one embodiment, the word segmentation process further comprises: and (5) stopping words and removing abnormal symbols.

In one embodiment, the obtaining the feature word t includes: obtaining a participle set W ═ W of the training text set₁，w₂，w₃…w_nC and a text category set C ═ C₁，c₂，c₃…c_mIn which w_nRepresenting a characteristic word in the set of participles, c_mRepresenting a category of the training text set.

In one embodiment, the calculating the CHI-IDF (t, c) value of each feature word and selecting a certain number of feature values according to the text scale specifically includes the following steps:

calculating CHI-IDF (t, c) value of each feature word and sorting according to the value;

and selecting a certain number of characteristic words as a text representation vector space according to the text scale and the total word segmentation amount.

As a specific embodiment, the text classification feature extraction method based on chi-square statistic and IDF of the present invention specifically includes the following steps:

step (1): acquiring a training text set through a web crawler, a national language commission modern Chinese language corpus, a dog searching language corpus and the like;

step (2): performing word segmentation processing on the acquired training text set, wherein words are segmented by adopting an ICTAAS packet of a Chinese academy;

and (3): processing the word segmentation result by removing stop words, abnormal symbols and the like;

and (4): the steps (1) to (3) are circulated until all texts are processed;

and (5): obtaining a participle set W ═ W of the training text set₁，w₂，w₃…w_nC and a text category set C ═ C₁，c₂，c₃…c_mIn which w_nRepresenting a characteristic word in the set of participles, c_mA category representing a training text set;

and (6): calculating related calculation parameters of the characteristic words t, wherein A represents the number of documents which belong to the category C and also contain the words t, B represents the number of documents which do not belong to the category C but contain the words t, C represents the number of documents which belong to the category C but do not contain the words t, and D represents the number of documents which do not belong to the category C nor contain the words t;

and (7): the step (6) is circulated until the calculation parameters of all the feature words are obtained;

and (8): the characteristic word t correlation in the step (7) is used for calculating a chi-square statistic value of the characteristic word t, the value reflects the correlation degree of the characteristic word t and the category c, the higher the value is, the higher the correlation degree is, and the calculation formula of the chi-square statistic is as follows:

where N represents the total number of texts expected in training, i.e., N ═ a + B + C + D

And (9): calculating the text quantity Dt of the appearing characteristic word t;

step (10): repeating the step (9) until the number of texts in which each feature word appears is calculated;

step (11): calculating the IDF value of the characteristic word t, wherein the calculation formula is as follows:

Step (12): combining chi-square statistics with the IDF method:

wherein, the meaning of the symbol in the formula is consistent with the steps (5), (6), (8) and (11).

The invention improves on the basis of chi-square statistic, improves the defect of insufficient selection of chi-square statistic on low-frequency feature words by introducing an IDF (inverse discrete Fourier transform) method, and can improve the selection capability of the feature words compared with an improved chi-square statistic feature extraction method, thereby improving the text classification effect.

As a modified embodiment, a text classification feature extraction method based on chi-square statistic and IDF of the present invention includes the steps of:

step (1): dividing the collected text into two parts, wherein one part is a training text set, the other part is a test text set, and carrying out category marking on the training text set;

step (2): the method comprises the following steps of performing word segmentation, word deactivation and abnormal symbol removal on a training text set, and the like:

step 2.1: performing word segmentation on the training text set by adopting an ICTAAS packet of a Chinese academy of sciences;

step 2.2: carrying out word-stop-removal and abnormal symbol-removal processing on the word segmentation result;

and (3): calculating a chi-square statistic value of each feature word;

and (4): calculating the IDF value of each feature word;

and (5): calculating the CHI-IDF (t, c) value of each feature word, and selecting a certain number of feature values according to the text scale, which specifically comprises the following steps:

step 5.1: calculating CHI-IDF (t, c) value of each feature word and sorting according to the value;

step 5.2: selecting a certain number of characteristic words as a text representation vector space according to the text scale and the total word segmentation amount;

and (6): converting the training text into a vector model consisting of feature words, and then training a classification algorithm model to obtain a text classifier;

and (7): and (4) converting the test text set into a vector model consisting of the feature words, inputting the vector model into the text classifier trained in the step (6) for classification, and then obtaining a calculation classification result of each test text.

The invention has the following beneficial technical effects:

While the invention has been described with reference to preferred embodiments, it is not intended to be limited thereto. It is obvious that not all embodiments need be, nor cannot be exhaustive here. Variations and modifications of the present invention can be made by those skilled in the art without departing from the spirit and scope of the present invention by using the design and content of the above disclosed embodiments, and therefore, any simple modification, parameter change and modification of the above embodiments based on the research essence of the present invention shall fall within the protection scope of the present invention.

Claims

1. A text classification feature selection method based on chi-square statistic and IDF is characterized by comprising the following steps:

dividing the collected text into two parts, wherein one part is a training text set, the other part is a test text set, and carrying out category marking on the training text set;

calculating the text quantity Dt of the appearing characteristic word t;

calculating the IDF value of the feature word t;

2. The method of claim 1, wherein the IDF value of the feature word t is calculated according to an algorithm:

3. The method of claim 2, wherein the CHI-IDF (t, c) value of each token is calculated, and according to the algorithm:

4. the method of claim 3, wherein the training text set is obtained from a web crawler, a national language commission modern Chinese corpus, and a dog search corpus.

5. The method as claimed in claim 3, wherein the segmentation process is performed using ICTAALAS package of Chinese academy of sciences.

6. The method of claim 5, wherein the segmentation process further comprises: and (5) stopping words and removing abnormal symbols.

7. The method of claim 6, wherein the obtaining the feature word t comprises: obtaining a participle set W ═ W of the training text set₁，w₂，w₃…w_nC and a text category set C ═ C₁，c₂，c₃…c_mIn which w_nRepresenting a characteristic word in the set of participles, c_mRepresenting a category of the training text set.

8. The method as claimed in claim 1, wherein the method for selecting text classification features based on CHI-squared statistic and IDF comprises calculating CHI-IDF (t, c) value of each feature word, and selecting a certain number of feature values according to text scale, and comprises the following steps: