CN105701084A

CN105701084A - Characteristic extraction method of text classification on the basis of mutual information

Info

Publication number: CN105701084A
Application number: CN201511018702.3A
Authority: CN
Inventors: 赵秉新; 印鉴
Original assignee: SYSU CMU Shunde International Joint Research Institute; National Sun Yat Sen University
Current assignee: SYSU CMU Shunde International Joint Research Institute; National Sun Yat Sen University
Priority date: 2015-12-28
Filing date: 2015-12-28
Publication date: 2016-06-22

Abstract

The invention discloses a characteristic extraction method of text classification on the basis of mutual information. Text preprocessing work mainly comprises the following steps: removing a document sign, removing stop words, carrying out word segmentation, carrying out the labeling of the part of speech, carrying out statistics on word frequency, data cleaning and the like, and extracting a characteristic word according to a characteristic algorithm. A text classification stage is characterized in that a model parameter is mainly trained for a vectorized training set through a support vector machine algorithm, and a text which needs to be classified is subjected to machine learning classification. The scheme of the invention is applied, a situation that noise characteristics are brought into a machine learning process can be effectively avoided when the characteristic extraction of the text classification is carried out, text classification precision is improved, the scale of a characteristic library is greatly reduced, and memory occupation is lowered.

Description

A kind of feature extracting method of the text classification based on mutual information

Technical field

The invention belongs to the technical field of natural language processing, be specially the feature extracting method of a kind of text classification based on mutual information。

Background technology

Along with the high speed development of the Internet, multimedia and memory technology, increasing information (particularly multimedia messages) generates, propagates and accumulates。The Internet makes Information Communication be more prone to, and personal user can find and download the information that they want easily。Jumbo hard disk can store more information。Do not include the information resources on WWW, even the quantity of documents accumulated on PC is likely to there are tens gigabits。How effectively managing and conveniently utilizing these information is a big problem for personal user。According to statistics, although the multimedia messages on the Internet gets more and more at present, but in a foreseeable future, text message remains most important information source, the thing followed, not only the development of text information processing technology not because multimedia messages quantity increase stagnation rapidly, present flourish trend on the contrary。Text Classification is the powerful measure of organization and administration text message。Text classification just occurred from the sixties in 20th century, but until was just increasingly becoming study hotspot after the nineties in 20th century。Machine learning is increasingly becoming main processing method。It automatically from each class another characteristic of the text set learning classified in advance, can build automatic categorizer。It has saving manpower and respond well feature。Therefore, most researchs are both for the file classification method based on machine learning at present。

The basic task of text classification is: determine the relation between document and given classification according to the content of document。Namely concentrate from given classification and find the classification being best suitable for current document。Contacting between this document and classification can be regarded as a mapping, it is clear that, owing to document may belong to multiple classification, this mapping both can be map one by one, it is also possible to is the mapping of one-to-many。Mapping ruler is by determining the study of given Training document collection and classification collection, and the difference according to learning method, mapping ruler is also variant。System runs into when newly entering document, determines, by mapping ruler, the classification that document is corresponding。The difficult point of text classification is in that the content of text is natural language, and this makes computer be difficult to from semantically text being processed。At present, scholars utilizes statistical analysis, machine learning, it is processed by the method in the fields such as data mining, by text message is carried out content-based classification, automatically generate user-friendly Text Classification System, such that it is able to be substantially reduced tissue to arrange the human resources that document expends, help user to be quickly found out information needed。Therefore, how can be effectively prevented from including noise characteristic in machine learning flow process, one of most important research direction in precision field that improve text classification。

Summary of the invention

It is an object of the invention to overcome the deficiencies in the prior art, it is provided that one can be effectively prevented from including noise characteristic in machine learning flow process, improve the feature extracting method of the text classification based on mutual information of the precision of text classification。

In order to solve above-mentioned technical problem, the present invention by the following technical solutions: the feature extracting method of a kind of text classification based on mutual information, comprise the following steps:

A training text is carried out pretreatment by ():

Set up stop words dictionary and training text collection, the training text in data set is carried out participle, according to stop words dictionary after participle, filter out stop words, the text after participle is carried out part-of-speech tagging；

B pretreated text is carried out feature extraction by ():

According to the pretreated text of step (a), calculate remaining lexical item and the mutual information of each classification according to formula (1) and (2),

Formula (1) is:

I (U; C) = \underset{e_{t} &Element; {1, 0}}{Σ} \underset{e_{c} &Element; {1, 0}}{Σ} P (U = e_{t}, C = e_{c}) \log_{2} \frac{P (U = e_{t}, C = e_{c})}{P (U = e_{t}) P (C = e_{c})}

Wherein, U is lexical item, and C is classification；U, C are binary random variables, and when document package is containing lexical item t, the value of U is et=1, otherwise et=0；When document belongs to classification c, the value ec=1 of C, otherwise ec=0,

If use maximal possibility estimation, probit above is all use Ali to calculate；Then Practical Calculation formula is as follows:

Formula (2) is:

\begin{matrix} I (U; C) = \frac{N_{11}}{N} \log_{2} \frac{{NN}_{11}}{N_{1.} N_{.1}} + \frac{N_{01}}{N} \log_{2} \frac{{NN}_{01}}{N_{0.} N_{.1}} \\ + \frac{N_{10}}{N} \log_{2} \frac{{NN}_{10}}{N_{1.} N_{.0}} + \frac{N_{00}}{N} \log_{2} \frac{{NN}_{00}}{N_{0.} N_{.0}} \end{matrix}

Wherein Nxy represents number of documents corresponding in x=et and y=ec situation；

Each classification is calculated its each lexical item mutual information with it k the lexical item that selected value is maximum；

Repetitor between each classification is deleted；Screening draws Feature Words；

C Feature Words is given weights by ():

Obtain Feature Words through step (b), calculate the frequency that each Feature Words occurs in a document, add up whole number of files, comprise the number of files of each Feature Words, calculate the weight of each feature according to formula (3),

Formula (3) is: TF-IDF computing formula: d*log (N/t)

It is wherein feature (entry) t_iFrequency in document d, N is whole number of documents, for comprising entry t_iNumber of files, be a constant, its value generally takes 0.01, and for anti-document frequency, denominator is normalization factor, based on training text collection, utilizes feature evaluation function TF-IDF that each Feature Words t is marked；

(d) SVM model training and prediction

Document vectorization, so as to be converted into term vector；The classification of the one-dimensional representation document of vector, the second dimension shows Feature Words and its weight to K dimension table；This vector is put in SVM model, trains model parameter, carry out text prediction afterwards。

Detailed description of the invention

Following description the specific embodiment of the present invention。

The feature extracting method of a kind of text classification based on mutual information provided by the invention, comprises the following steps:

1) article of each classification of some is obtained as the training dataset of Text Classification System from reptile the Internet；

2) training text is carried out pretreatment: training dataset is carried out participle, the participle instrument used is stammerer participle, it is the Chinese word segmentation module of increasing income of a Python exploitation, afterwards according to stop words dictionary, filter out these stop words, the text stammerer module after participle is carried out part-of-speech tagging。

3) pretreated text being carried out feature extraction: according to (2) pretreated text, leave behind the word that part of speech is noun and verb, this is to extract at the beginning of feature。Remaining lexical item and the mutual information of each classification is calculated according to formula (1) and (2),

I (U; C) = \underset{e_{t} &Element; {1, 0}}{Σ} \underset{e_{c} &Element; {1, 0}}{Σ} P (U = e_{t}, C = e_{c}) \log_{2} \frac{P (U = e_{t}, C = e_{c})}{P (U = e_{t}) P (C = e_{c})} ... (1)

Wherein, U is lexical item, and C is classification。U, C are binary random variables, and when document package is containing lexical item t, the value of U is et=1, otherwise et=0；When document belongs to classification c, the value ec=1 of C, otherwise ec=0, when using maximal possibility estimation, probit above is all calculated by number Ali of lexical item in statistic document and classification。Then Practical Calculation formula is as follows:

\begin{matrix} I (U; C) = \frac{N_{11}}{N} \log_{2} \frac{{NN}_{11}}{N_{1.} N_{.1}} + \frac{N_{01}}{N} \log_{2} \frac{{NN}_{01}}{N_{0.} N_{.1}} \\ + \frac{N_{10}}{N} \log_{2} \frac{{NN}_{10}}{N_{1.} N_{.0}} + \frac{N_{00}}{N} \log_{2} \frac{{NN}_{00}}{N_{0.} N_{.0}} \end{matrix} ... (2)

Wherein Nxy represents number of documents corresponding in x=et and y=ec situation。Such as N10 represents and comprises lexical item t (now et=1) but be not belonging to classification c's (now ec=0)；N1.=N10+N11 represents all number of documents comprising lexical item t。N.1=N11+N01 representing all number of documents belonging to class c, N=N00+N01+N10+N11 represents all number of documents。

Each class calculates each lexical item mutual information with it k the lexical item that selected value is maximum, and likely two classes can choose identical Feature Words certainly, removes repeating lexical item。Here it is the Feature Words finally selected。

4) weights are given to Feature Words: obtain Feature Words, calculate the frequency that each Feature Words occurs in a document, add up whole number of files, comprise the number of files of each Feature Words, the weight of each feature is calculated according to TF-IDF, TF-IDF is a kind of statistical method, for assessing a word to the importance of wherein one section of article in N section article or a corpus。

TF-IDF computing formula:

d*log(N/t)…………(3)

It it is wherein feature (entry) ti frequency in document d, N is whole number of documents, for comprising the number of files of entry ti, it it is a constant, its value generally takes 0.01, and for anti-document frequency, denominator is normalization factor, based on training text collection, utilize feature evaluation function TF-IDF that each Feature Words t is marked。

5) SVM model training and prediction: support vector machine method is built upon on VC dimension theory and the Structural risk minization basis of Statistical Learning Theory, optimal compromise is sought, to obtaining best generalization ability according between complexity and the learning capacity at model of the limited sample information。

Each section of document vectorization, so as to be converted into term vector。The classification of the one-dimensional representation document of vector, the second dimension shows Feature Words and its weight (as described in step 3) to K dimension table。This vector is put in libSVM model, trains model parameter, carry out text prediction afterwards。Model can return two result: label and score, wherein the label i.e. label of its prediction。And score is the degree of membership that this sample belongs to such, score value is more big, represents the confidence level belonging to such more big。

The announcement of book and instruction according to the above description, above-mentioned embodiment can also be modified and revise by those skilled in the art in the invention。Therefore, the invention is not limited in detailed description of the invention disclosed and described above, should also be as some modifications and changes of the present invention falling in the scope of the claims of the present invention。Although additionally, employ some specific terms in this specification, but these terms are intended merely to convenient explanation, and the present invention does not constitute any restriction。

Claims

1. the feature extracting method based on the text classification of mutual information, it is characterised in that: comprise the following steps:

A training text is carried out pretreatment by ():

B pretreated text is carried out feature extraction by ():

Formula (1) is:

I (U; C) = \underset{e_{t} &Element; {1, 0}}{Σ} \underset{e_{c} &Element; {1, 0}}{Σ} P (U = e_{t}, C = e_{c}) \log_{2} \frac{P (U = e_{t}, C = e_{c})}{P (U = e_{t}) P (C = e_{c})}

Formula (2) is:

\begin{matrix} I (U; C) = \frac{N_{11}}{N} \log_{2} \frac{{NN}_{11}}{N_{1.} N_{.1}} + \frac{N_{01}}{N} \log_{2} \frac{{NN}_{01}}{N_{0.} N_{.1}} \\ + \frac{N_{10}}{N} \log_{2} \frac{{NN}_{10}}{N_{1.} N_{.0}} + \frac{N_{00}}{N} \log_{2} \frac{{NN}_{00}}{N_{0.} N_{.0}} \end{matrix}

Wherein N_xyRepresent number of documents corresponding in x=et and y=ec situation；

C Feature Words is given weights by ():

Obtain Feature Words through step (b), calculate the frequency that each Feature Words occurs in a document, add up whole number of files, comprise the number of files of each Feature Words, calculate the weight of each feature according to formula (5),

Formula (3) is:

TF-IDF computing formula: d*log (N/t)

(d) SVM model training and prediction