CN110688481A - Text classification feature selection method based on chi-square statistic and IDF - Google Patents

Text classification feature selection method based on chi-square statistic and IDF Download PDF

Info

Publication number
CN110688481A
CN110688481A CN201910821594.5A CN201910821594A CN110688481A CN 110688481 A CN110688481 A CN 110688481A CN 201910821594 A CN201910821594 A CN 201910821594A CN 110688481 A CN110688481 A CN 110688481A
Authority
CN
China
Prior art keywords
text
word
chi
idf
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910821594.5A
Other languages
Chinese (zh)
Inventor
李帅
郑少波
杨玉龙
冯建巩
朱义杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Aerospace Institute of Measuring and Testing Technology
Original Assignee
Guizhou Aerospace Institute of Measuring and Testing Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Aerospace Institute of Measuring and Testing Technology filed Critical Guizhou Aerospace Institute of Measuring and Testing Technology
Priority to CN201910821594.5A priority Critical patent/CN110688481A/en
Publication of CN110688481A publication Critical patent/CN110688481A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

The disclosure relates to a text classification feature selection method based on chi-square statistic and IDF, which comprises the following steps: dividing the collected text into two parts, wherein one part is a training text set, the other part is a test text set, and carrying out category marking on the training text set; and performing word segmentation processing on the obtained training text set to obtain feature words, and calculating CHI-square statistic value, text quantity Dt, IDF value and CHI-IDF (t, c) value of the feature words t. The invention has the advantages that: the method is simple to realize, and the improved CHI-IDF feature selection method is obtained by combining the characteristics of CHI-square statistic and IDF to improve the feature selection effect of the traditional CHI-square statistic and effectively improve the selection probability of feature words.

Description

Text classification feature selection method based on chi-square statistic and IDF
Technical Field
The invention relates to a text classification feature selection method based on chi-square statistic and IDF.
Background
At present, various self-media, news software and information distribution platforms or APPs are increasing, information is increasing continuously, and how to effectively manage the information and extract valuable information from the information becomes an important subject. The text classification method is widely used for processing and mining text information as an important text analysis and mining technology.
The text classification is a supervised text processing method, and achieves the purpose of text classification by setting text types in advance, selecting a classification algorithm model and training a text set. Text classification is an important direction in the field of natural language processing, and has wide application in the aspects of text information processing and mining, information retrieval, public opinion analysis and the like. In the whole classification process, the training text set is usually subjected to vectorization representation, then the classification algorithm model is trained, and the feature words are selected to reduce the dimension of the feature vector when the vectorization representation of the training text set is realized, because the higher the vector dimension is, the calculation difficulty and the generalization capability of the model are weaker. Therefore, text classification requires a reduction in vector dimension, while requiring the selected feature words to have better weight and classification orientation for classification.
The essence of feature word selection is vector dimension reduction, and the classification effect of a feature word on classification is mainly reflected on the classification weight value of the feature word. The feature word selection is to select a feature word with a large weight value by calculating the classification weight value of each feature word, and remove a feature word with a smaller classification weight value. The current feature selection method mainly comprises document frequency, information gain, chi-square statistic and the like, but has the following significant defects: the selection probability of the effective characteristic words is not high, so that the accuracy of classification is influenced.
Disclosure of Invention
The invention aims to provide a text classification feature selection method based on chi-square statistic and IDF, which effectively improves the selection probability of feature words.
In order to solve the technical problems, the invention adopts the technical scheme that: a text classification feature selection method based on chi-square statistic and IDF is characterized by comprising the following steps: dividing the collected text into two parts, wherein one part is a training text set, the other part is a test text set, and carrying out category marking on the training text set;
performing word segmentation processing on the obtained training text set to obtain a characteristic word t;
calculating a chi-square statistic value of the feature word t according to the following algorithm;
Figure BDA0002187699440000021
a denotes the number of documents belonging to category C and containing word t, B denotes the number of documents not belonging to category C but containing word t, C denotes the number of documents belonging to category C but not containing word t, D denotes the number of documents not belonging to category C nor containing word t, N denotes the total number of documents in the training expectation, i.e., N ═ a + B + C + D;
calculating the text quantity Dt of the appearing characteristic word t;
calculating the IDF value of the feature word t;
calculating CHI-IDF (t, c) value of each feature word, selecting a certain number of feature values according to text scale,
converting the training text set into a vector model consisting of feature words, and then training a classification algorithm model to obtain a text classifier;
and converting the test text set into a vector model consisting of the characteristic words, inputting the vector model into the text classifier for classification, and then obtaining a calculation classification result of each test text.
Compared with the prior art, the invention has the following beneficial technical effects:
the realization is simple, include: dividing the collected text into two parts, wherein one part is a training text set, the other part is a test text set, and carrying out category marking on the training text set; the method comprises the steps of carrying out word segmentation on an obtained training text set to obtain feature words, calculating CHI-square statistic value, text quantity Dt, IDF value and CHI-IDF (t, c) value of the feature words t, and analyzing a feature extraction method of CHI-square statistic to find that CHI-square statistic is easy to ignore feature items of low document frequency, and inverse document frequency IDF has a good selection effect on the feature words of low document frequency.
Drawings
FIG. 1 is a flow chart of the text classification feature extraction method based on chi-squared statistics and IDF of the present invention.
Fig. 2 is a schematic diagram of an embodiment of the method shown in fig. 1.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments, but these examples are only illustrative and do not limit the scope of the present invention.
Referring to fig. 1 and fig. 2, in the present invention, a collected text is divided into two parts, one part is a training text set, the other part is a test text set, and the training text set is labeled by category;
performing word segmentation processing on the obtained training text set to obtain a characteristic word t;
calculating a chi-square statistic value of the feature word t according to the following algorithm;
Figure BDA0002187699440000041
a denotes the number of documents belonging to category C and containing word t, B denotes the number of documents not belonging to category C but containing word t, C denotes the number of documents belonging to category C but not containing word t, D denotes the number of documents not belonging to category C nor containing word t, N denotes the total number of documents in the training expectation, i.e., N ═ a + B + C + D;
calculating the text quantity Dt of the appearing characteristic word t;
calculating the IDF value of the feature word t;
calculating CHI-IDF (t, c) value of each feature word, selecting a certain number of feature values according to text scale,
converting the training text set into a vector model consisting of feature words, and then training a classification algorithm model to obtain a text classifier;
and converting the test text set into a vector model consisting of the characteristic words, inputting the vector model into the text classifier for classification, and then obtaining a calculation classification result of each test text.
In one embodiment, the IDF value of the feature word t is calculated according to the algorithm:
wherein: d represents the total number of texts, Dt represents the number of texts containing the word t, and log is base 2.
In one embodiment, the CHI-IDF (t, c) value for each feature word is calculated, according to the algorithm:
in one embodiment, the training text set is obtained by a web crawler, a national language commission modern chinese corpus, and a dog search corpus.
In one embodiment, the word segmentation process is performed using the ICTALAS package of the chinese academy.
In one embodiment, the word segmentation process further comprises: and (5) stopping words and removing abnormal symbols.
In one embodiment, the obtaining the feature word t includes: obtaining a participle set W ═ W of the training text set1,w2,w3…wnC and a text category set C ═ C1,c2,c3…cmIn which wnRepresenting a characteristic word in the set of participles, cmRepresenting a category of the training text set.
In one embodiment, the calculating the CHI-IDF (t, c) value of each feature word and selecting a certain number of feature values according to the text scale specifically includes the following steps:
calculating CHI-IDF (t, c) value of each feature word and sorting according to the value;
and selecting a certain number of characteristic words as a text representation vector space according to the text scale and the total word segmentation amount.
As a specific embodiment, the text classification feature extraction method based on chi-square statistic and IDF of the present invention specifically includes the following steps:
step (1): acquiring a training text set through a web crawler, a national language commission modern Chinese language corpus, a dog searching language corpus and the like;
step (2): performing word segmentation processing on the acquired training text set, wherein words are segmented by adopting an ICTAAS packet of a Chinese academy;
and (3): processing the word segmentation result by removing stop words, abnormal symbols and the like;
and (4): the steps (1) to (3) are circulated until all texts are processed;
and (5): obtaining a participle set W ═ W of the training text set1,w2,w3…wnC and a text category set C ═ C1,c2,c3…cmIn which wnRepresenting a characteristic word in the set of participles, cmA category representing a training text set;
and (6): calculating related calculation parameters of the characteristic words t, wherein A represents the number of documents which belong to the category C and also contain the words t, B represents the number of documents which do not belong to the category C but contain the words t, C represents the number of documents which belong to the category C but do not contain the words t, and D represents the number of documents which do not belong to the category C nor contain the words t;
and (7): the step (6) is circulated until the calculation parameters of all the feature words are obtained;
and (8): the characteristic word t correlation in the step (7) is used for calculating a chi-square statistic value of the characteristic word t, the value reflects the correlation degree of the characteristic word t and the category c, the higher the value is, the higher the correlation degree is, and the calculation formula of the chi-square statistic is as follows:
where N represents the total number of texts expected in training, i.e., N ═ a + B + C + D
And (9): calculating the text quantity Dt of the appearing characteristic word t;
step (10): repeating the step (9) until the number of texts in which each feature word appears is calculated;
step (11): calculating the IDF value of the characteristic word t, wherein the calculation formula is as follows:
Figure BDA0002187699440000062
wherein: d represents the total number of texts, Dt represents the number of texts containing the word t, and log is base 2.
Step (12): combining chi-square statistics with the IDF method:
Figure BDA0002187699440000071
wherein, the meaning of the symbol in the formula is consistent with the steps (5), (6), (8) and (11).
The invention improves on the basis of chi-square statistic, improves the defect of insufficient selection of chi-square statistic on low-frequency feature words by introducing an IDF (inverse discrete Fourier transform) method, and can improve the selection capability of the feature words compared with an improved chi-square statistic feature extraction method, thereby improving the text classification effect.
As a modified embodiment, a text classification feature extraction method based on chi-square statistic and IDF of the present invention includes the steps of:
step (1): dividing the collected text into two parts, wherein one part is a training text set, the other part is a test text set, and carrying out category marking on the training text set;
step (2): the method comprises the following steps of performing word segmentation, word deactivation and abnormal symbol removal on a training text set, and the like:
step 2.1: performing word segmentation on the training text set by adopting an ICTAAS packet of a Chinese academy of sciences;
step 2.2: carrying out word-stop-removal and abnormal symbol-removal processing on the word segmentation result;
and (3): calculating a chi-square statistic value of each feature word;
and (4): calculating the IDF value of each feature word;
and (5): calculating the CHI-IDF (t, c) value of each feature word, and selecting a certain number of feature values according to the text scale, which specifically comprises the following steps:
step 5.1: calculating CHI-IDF (t, c) value of each feature word and sorting according to the value;
step 5.2: selecting a certain number of characteristic words as a text representation vector space according to the text scale and the total word segmentation amount;
and (6): converting the training text into a vector model consisting of feature words, and then training a classification algorithm model to obtain a text classifier;
and (7): and (4) converting the test text set into a vector model consisting of the feature words, inputting the vector model into the text classifier trained in the step (6) for classification, and then obtaining a calculation classification result of each test text.
The invention has the following beneficial technical effects:
the realization is simple, include: dividing the collected text into two parts, wherein one part is a training text set, the other part is a test text set, and carrying out category marking on the training text set; the method comprises the steps of carrying out word segmentation on an obtained training text set to obtain feature words, calculating CHI-square statistic value, text quantity Dt, IDF value and CHI-IDF (t, c) value of the feature words t, and analyzing a feature extraction method of CHI-square statistic to find that CHI-square statistic is easy to ignore feature items of low document frequency, and inverse document frequency IDF has a good selection effect on the feature words of low document frequency.
While the invention has been described with reference to preferred embodiments, it is not intended to be limited thereto. It is obvious that not all embodiments need be, nor cannot be exhaustive here. Variations and modifications of the present invention can be made by those skilled in the art without departing from the spirit and scope of the present invention by using the design and content of the above disclosed embodiments, and therefore, any simple modification, parameter change and modification of the above embodiments based on the research essence of the present invention shall fall within the protection scope of the present invention.

Claims (8)

1. A text classification feature selection method based on chi-square statistic and IDF is characterized by comprising the following steps:
dividing the collected text into two parts, wherein one part is a training text set, the other part is a test text set, and carrying out category marking on the training text set;
performing word segmentation processing on the obtained training text set to obtain a characteristic word t;
calculating a chi-square statistic value of the feature word t according to the following algorithm;
Figure FDA0002187699430000011
a denotes the number of documents belonging to category C and containing word t, B denotes the number of documents not belonging to category C but containing word t, C denotes the number of documents belonging to category C but not containing word t, D denotes the number of documents not belonging to category C nor containing word t, N denotes the total number of documents in the training expectation, i.e., N ═ a + B + C + D;
calculating the text quantity Dt of the appearing characteristic word t;
calculating the IDF value of the feature word t;
calculating CHI-IDF (t, c) value of each feature word, selecting a certain number of feature values according to text scale,
converting the training text set into a vector model consisting of feature words, and then training a classification algorithm model to obtain a text classifier;
and converting the test text set into a vector model consisting of the characteristic words, inputting the vector model into the text classifier for classification, and then obtaining a calculation classification result of each test text.
2. The method of claim 1, wherein the IDF value of the feature word t is calculated according to an algorithm:
Figure FDA0002187699430000021
wherein: d represents the total number of texts, Dt represents the number of texts containing the word t, and log is base 2.
3. The method of claim 2, wherein the CHI-IDF (t, c) value of each token is calculated, and according to the algorithm:
Figure FDA0002187699430000022
4. the method of claim 3, wherein the training text set is obtained from a web crawler, a national language commission modern Chinese corpus, and a dog search corpus.
5. The method as claimed in claim 3, wherein the segmentation process is performed using ICTAALAS package of Chinese academy of sciences.
6. The method of claim 5, wherein the segmentation process further comprises: and (5) stopping words and removing abnormal symbols.
7. The method of claim 6, wherein the obtaining the feature word t comprises: obtaining a participle set W ═ W of the training text set1,w2,w3…wnC and a text category set C ═ C1,c2,c3…cmIn which wnRepresenting a characteristic word in the set of participles, cmRepresenting a category of the training text set.
8. The method as claimed in claim 1, wherein the method for selecting text classification features based on CHI-squared statistic and IDF comprises calculating CHI-IDF (t, c) value of each feature word, and selecting a certain number of feature values according to text scale, and comprises the following steps:
calculating CHI-IDF (t, c) value of each feature word and sorting according to the value;
and selecting a certain number of characteristic words as a text representation vector space according to the text scale and the total word segmentation amount.
CN201910821594.5A 2019-09-02 2019-09-02 Text classification feature selection method based on chi-square statistic and IDF Pending CN110688481A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910821594.5A CN110688481A (en) 2019-09-02 2019-09-02 Text classification feature selection method based on chi-square statistic and IDF

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910821594.5A CN110688481A (en) 2019-09-02 2019-09-02 Text classification feature selection method based on chi-square statistic and IDF

Publications (1)

Publication Number Publication Date
CN110688481A true CN110688481A (en) 2020-01-14

Family

ID=69108803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910821594.5A Pending CN110688481A (en) 2019-09-02 2019-09-02 Text classification feature selection method based on chi-square statistic and IDF

Country Status (1)

Country Link
CN (1) CN110688481A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328787A (en) * 2020-11-04 2021-02-05 中国平安人寿保险股份有限公司 Text classification model training method and device, terminal equipment and storage medium
CN115345229A (en) * 2022-08-08 2022-11-15 航天神舟智慧系统技术有限公司 Fire-fighting risk dimension determination method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103995876A (en) * 2014-05-26 2014-08-20 上海大学 Text classification method based on chi square statistics and SMO algorithm
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103995876A (en) * 2014-05-26 2014-08-20 上海大学 Text classification method based on chi square statistics and SMO algorithm
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
姚彬修等: "一种基于Canopy和粗糙集的CRS-KNN文本分类算法", 《计算机工程与应用》 *
徐冠华等: "文本特征提取方法研究综述", 《软件导刊》 *
李帅,陈笑蓉: "改进卡方统计量的BPNN短文本分类方法", 《贵州大学学报(自然科学版)》 *
梁伍七等: "基于类别的CHI特征选择方法", 《安徽广播电视大学学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328787A (en) * 2020-11-04 2021-02-05 中国平安人寿保险股份有限公司 Text classification model training method and device, terminal equipment and storage medium
CN112328787B (en) * 2020-11-04 2024-02-20 中国平安人寿保险股份有限公司 Text classification model training method and device, terminal equipment and storage medium
CN115345229A (en) * 2022-08-08 2022-11-15 航天神舟智慧系统技术有限公司 Fire-fighting risk dimension determination method

Similar Documents

Publication Publication Date Title
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
US20200311113A1 (en) Method and device for extracting core word of commodity short text
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN104881458B (en) A kind of mask method and device of Web page subject
CN108710611B (en) Short text topic model generation method based on word network and word vector
CN106372111B (en) Local feature point screening method and system
WO2012156774A1 (en) Method and apparatus for detecting visual words which are representative of a specific image category
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN110826618A (en) Personal credit risk assessment method based on random forest
CN109271517A (en) IG TF-IDF Text eigenvector generates and file classification method
CN107357895B (en) Text representation processing method based on bag-of-words model
CN108228541A (en) The method and apparatus for generating documentation summary
CN110688481A (en) Text classification feature selection method based on chi-square statistic and IDF
CN103914551A (en) Method for extending semantic information of microblogs and selecting features thereof
CN107526792A (en) A kind of Chinese question sentence keyword rapid extracting method
CN109902173B (en) Chinese text classification method
CN113111645B (en) Media text similarity detection method
US11960521B2 (en) Text classification system based on feature selection and method thereof
CN112183093A (en) Enterprise public opinion analysis method, device, equipment and readable storage medium
CN109376241B (en) DenseNet-based telephone appeal text classification algorithm for power field
CN103034657A (en) Document abstract generating method and device
Lin et al. Classifying textual components of bilingual documents with decision-tree support vector machines
CN114266249A (en) Mass text clustering method based on birch clustering
WO2022105178A1 (en) Keyword extraction method and related device
CN111709463B (en) Feature selection method based on index synergy measurement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200114

WD01 Invention patent application deemed withdrawn after publication