CN102622373A

CN102622373A - Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm

Info

Publication number: CN102622373A
Application number: CN2011100338086A
Authority: CN
Inventors: 缪建明; 丁泽亚; 张全
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2011-01-31
Filing date: 2011-01-31
Publication date: 2012-08-01
Anticipated expiration: 2031-01-31
Also published as: CN102622373B

Abstract

The present invention relates to a kind of statistics file classification methods based on TF*IDF algorithm, this method propose a kind of new feature vector weight methods (TF*IDF*CIV), conceptual information amount (CIV) this variable is introduced in TF*IDF method, the formula of the algorithm in the calculating process of feature vector weight is considered using the conceptual information amount of feature vector as a variable are as follows:

Shared concept number sim (ci, C) therein is characterized concept set ci corresponding to a ti and matches equal concept number in class concepts set C; The insufficient TF*IDF method at this stage for compensating for TF*IDF method widely is used to calculate the weight of feature vector. But this method can not represent the relevance between characteristic item, have ignored influence of the relevance between characteristic item semanteme to weight. As a result, experiments have shown that the use of new method can effectively improve the accuracy rate of entire Text Classification System.

Description

A kind of based on TF *The statistics text classification system and method for IDF algorithm

Technical field

The present invention relates to the Computer Science and Technology field, particularly a kind of computing method and device of new proper vector weight towards text classification.

Background technology

Along with Internet technology and computer technology rapid development and universal, a large amount of Word messages begins to exist with computer-readable form, is arisen at the historic moment by computing machine autotext sorting technique.Current, the text classification technology extensively is utilized in each research fields such as document index foundation, flame detection, theme identification, automatic abstract, intelligent information retrieval.

Automated Classification starts from the end of the fifties, and H.P.Luhn has carried out initiative research in this field.1961, Maron delivered relevant first piece of paper of classification automatically, and many subsequently famous information scholars such as Sparck, Salton etc. have carried out fruitful research in this field.In the eighties in 20th century; The text classification system is main with the method for knowledge engineering; According to the classification experience of domain expert to given text collection, manual work extracts one group of logic rules, as the computer version The classification basis; Analyze the technical characterstic and the performance of these systems then, promptly utilize Expert Rules to classify; To after the nineties, the method for statistical method and machine learning is introduced in the text automatic classification, has obtained great successes and has replaced the knowledge engineering method gradually, and become main flow trend rapidly; The semantic information of the less consideration text of machine learning method combines methods such as semantic analysis and conceptual network and to have obtained better classifying quality with machine learning method, accuracy rate and stable aspect have remarkable advantages.This text classification process mainly is described below: system uses training sample to carry out feature selecting and classifier parameters training; The input sample of treating classification according to the characteristic of selecting carries out formalization; Be input to sorter then and carry out kind judging, finally obtain importing the classification of sample.

Current; File classification method based on statistics has: simple Bayes method (

Bayesianclassifier), based on support vector machine method (support vector machines; SVM), k-nearest neighbor method (k-nearest neighbor; KNN), the neural network method (neural network, NNet), decision tree (decisiontree) classification, fuzzy classifier method (fuzzy classifier), Rocchio sorting technique and Boosting algorithm etc.According to the Yiming Yang of CMU results reported, best based on the support vector machine method effect of vector space model, the most proper vectors that also require to set up earlier text in the basis of other several methods.The most frequently used method of setting up of proper vector is exactly TF*IDF (TF:Term Frequency, an IDF:Inverse Document Frequency) method, the various improvement computing method of carrying out on its basis in addition.

The document vector space model adopts the feature of semanteme of contextual information quantitative description word; Weigh the semantic similarity between the word through the distance between the compute vector; Effectively avoided the inevitable sparse problem of data in the traditional statistical method; But vector space model is regarded as the independent feature item to each word component in the vector, has ignored the relevance between the characteristic item, and this makes can't be satisfactory with the accuracy rate of the sorter of TF*IDF method.

Summary of the invention

The objective of the invention is to; The low problem of accuracy rate based on the text classifier of TF*IDF algorithm of not considering that thereby semantic similarity between the word causes when overcoming present TF*IDF algorithm computation characteristic item weight provides a kind of statistics text classification system and method based on the TF*IDF algorithm.

Be the realization above-mentioned purpose, a kind of statistics file classification method provided by the invention based on the TF*IDF algorithm, described method comprises following steps:

1) collects language material, the language material of collecting is divided into corpus and testing material;

2) corpus is classified and pre-service;

3) from corpus, extract the vocabulary in each field, extract total vocabulary simultaneously;

4) the classification institute categorical conception of corpus is concluded, utilize concept dictionary to extract affiliated notion set of all categories, form class concepts set storehouse, be used to calculate conceptual information amount CIV;

5) testing material is carried out Feature Selection, obtain the proper vector table of different numbers;

6) weight of the proper vector speech that comprises of use characteristic vector Weight algorithm (TF*IDF*CIV) calculated characteristics vector table, concrete computing formula is following:

W_{ij} = {tf}_{ij} \times \log \frac{N}{n_{i}} \times [sim (c_{i}, C) + 1];

Wherein, sim (c _i, C)+1 be conceptual information parameter CIV, and the shared ideas in this formula is counted sim (c _i, C) be any proper vector t of the described proper vector table of step 5) _iPairing notion set c _iThe notion number that coupling equates in the described class concepts set of said step 4) C;

7) structure corresponding text sorter utilizes sorter that testing material is calculated, and obtains classification results;

8) adopt evaluation function to calculate the performance evaluation parameter of various sorters, confirm optimum proper vector table according to sorter evaluating result.

In the technique scheme, said step 2) pre-service is: remove information such as unwanted ultra chain, advertisement in the web page text, and text is carried out word segmentation processing.

In the technique scheme, the Feature Selection of described step 5) adopts the information gain method, and this information gain method also comprises following substep:

5-1) extract vocabulary, after pre-service, calculate the information gain value of each participle as characteristic, the entropy of information gain value document when not considering any characteristic and the difference of considering the entropy of document after this characteristic, computing formula is following:

Gain (t_{i}) = Entropy (S) - ExpectedEntropy ({St}_{i})

{- \overset{M}{Σ} P (C_{j}) * \log P (C_{j})} - {P (t_{i}) * [- \overset{M}{Σ} P (C_{j} | t_{i}) * \log P (C_{j} | t_{i})]

+ P ({\overset{&OverBar;}{t}}_{i}) * [- ΣP (C_{j} | {\overset{&OverBar;}{t}}_{i}) * \log P (C_{j} | {\overset{&OverBar;}{t}}_{i})]}

Wherein, P (C _i) expression C _jThe probability that the class document occurs in language material, P (t _i) represent to comprise characteristic item t in the language material _iThe probability of document, P (C _j| t _i) represent that document comprises characteristic item t _iThe time belong to C _jThe conditional probability of class,

Do not comprise characteristic item t in the expression language material _iThe probability of document,

The expression document does not comprise characteristic item t _iThe time belong to C _jThe conditional probability of class, M representes the classification number;

5-2) from the information gain value, choose the word of different numbers, as proper vector.

The present invention is based on above-mentioned method a kind of statistics text classification system based on the TF*IDF algorithm also is provided, said system comprises:

Language material is collected and pre-processing module, is used for from interconnection online collection corpus and testing material, and language material is surpassed information processing and participle pre-service such as chain, advertisement;

Feature selection module is used for extracting the vocabulary of language material, therefrom selects the characteristic speech of different numbers, composition characteristic vocabulary according to feature selecting algorithm;

The feature weight computing module is used for the calculated characteristics weight;

Sort module is used for the language material text is classified; With

Excellent module is selected in classification, is used to contrast different classification results, the characteristic speech number when finding the optimal classification effect, it is characterized in that,

Described system also comprises: concept dictionary module and class concepts library module;

The concept dictionary module is used for the affiliated classification information of storage concept;

The class concepts library module is used to store total notion aggregate information under the different affiliated classifications;

Described feature weight computing module utilizes the notion aggregate information C that said class concepts library module obtains and adopts the TF*IDF*CIV algorithm to carry out the weight calculation of the characteristic speech of different numbers;

Wherein, the formula of said TF*IDF*CIV algorithm is:

Shared ideas in this formula is counted sim (c ⁱ, C) be characteristic item t ⁱPairing notion set c ⁱThe notion number that coupling equates among the class concepts set C in said conceptual base module.

The invention has the advantages that; The present invention has carried out the variable adjustment to the TF*IDF method; Having introduced this variable of conceptual information amount CIV improves it; The pairing conceptual information amount of calculated characteristics item effectively is used for adjusting the weight of different characteristic vector, utilizes relevance between the notion to remedy the deficiency of vector space model.The experiment proof is improved the back classification accuracy can improve 6.5 percentage points, and this has also fully proved this validity of improving one's methods.

Description of drawings

Fig. 1 is the process flow diagram that the present invention is based on the statistics file classification method of TF*IDF algorithm;

Fig. 2 is the module structure drafting that the present invention is based on the statistics text classification system of TF*IDF algorithm.

Embodiment

Below in conjunction with accompanying drawing and specific embodiment the present invention is done further description.

The present invention is for effectively utilizing knowledge engineering knowledge that the new method of a compute vector weight is provided on the statistics file classification method.This method is in the TF*IDF method, to have introduced the CIV variable, thus the method after being improved, i.e. TF*IDF*CIV (Term Frequency, Inverse Document Frequency, Concept InformationValue) method.Evaluation indexes such as accuracy, recall rate and the F1 that this method of experiment proof can effectively improve text classification estimates.

As shown in Figure 1, this figure is that concrete steps are following based on the flow chart of steps of the statistics file classification method of TF*IDF algorithm:

Step 1 is from interconnection online collection language material; Wherein part is as corpus; Another part is as testing material; 16000 pieces of texts that portal website collects from the Internet have been downloaded; Wherein 6000 pieces as corpus; Belong to three classifications, each field textual data is respectively: municipal management of markets in urban districts class (1019 pieces), disaster event class (2215 pieces) and other types (2766 pieces), and remaining 10000 pieces as testing material;

Step 2 pair corpus carries out category division, and information such as some the unwanted ultra chains in the removal corpus, advertisement are carried out participle to body text, obtain the word string sequence of document;

Step 3 is taken out each document from corpus, take out word wherein, forms total vocabulary; Simultaneously classification institute categorical conception is concluded; Utilize concept dictionary to extract affiliated notion set of all categories, form other notion set storehouse of three major types, be used to calculate conceptual information amount CIV;

Step 4 is according to what of the quantity of information between word and the classifying text, the computing information gain, and selected different threshold values obtain the proper vector table of different numbers (1000,2000,3000,4000,5000,6000).

Step 5 use characteristic vector weight TF*IDF*CIV calculated characteristics is to the measure word weight, and conceptual information amount wherein is that shared ideas is counted sim (c ⁱ, C)+1, and shared ideas is counted sim (c ⁱ, C) be proper vector c ⁱThe notion number that coupling equates in class concepts set C.

Step 6 structure corresponding text sorter.

Step 7 pair test text is classified, and obtains the classification results under the different number proper vectors

Step 8 is calculated the performance evaluation parameter of sorter.

Step 9 is judged the proper vector number of this system optimal according to the value of evaluation function.

As shown in Figure 2, this figure is the statistics text classification system based on the TF*IDF algorithm, a kind of statistics text classification system based on the TF*IDF algorithm, and said system comprises:

Sort module is used for the language material text is classified; With

Wherein, the formula of said TF*IDF*CIV algorithm is:

Specify each involved detailed problem in the technical scheme of the present invention below:

1, language material is selected:

Download enough language materials from each portal website, mark off the part corpus, category is classified.The division of language material classification is reasonable as far as possible, and language material data of all categories are balanced as far as possible.

2, Feature Selection:

In vector space model, the characteristic item of expression text can be selected word, speech, phrase, even multiple elements such as " notions ".Here we adopt the most frequently used characteristic item of the most effective speech of proof as text classification of also being tested.Present existing Feature Selection method is many, and method commonly used has: based on document frequency (documentfrequency, feature extraction method DF), information gain (information gain, IG) method, χ ²Statistic (CHI) method and mutual information (mutual information, MI) method etc.The Feature Selection main task is to confirm two problems: the first, choose which type of speech as characteristic item; The second, the speech of choosing how many numbers is as characteristic item.We adopt the information gain method to carry out Feature Selection, and concrete steps are following:

1) extracts vocabulary.After early stage, preprocessing process such as participle finished, calculate the information gain value of each speech as characteristic.The information gain method is according to certain characteristic item t _iFor how many quantity of information that whole classification can provide weighs the significance level of this characteristic item, thereby determine choice to this characteristic item.Certain characteristic item t _iInformation gain be meant this characteristic or not during this characteristic, the difference of the quantity of information that can provide for whole classification, wherein, how much the weighing of quantity of information by entropy.The entropy of document and the difference of considering the entropy of document after this characteristic when information gain is not promptly considered any characteristic:

Gain (t_{i}) = Entropy (S) - ExpectedEntropy ({St}_{i})

= {- Σ_{j = 1}^{M} P (C_{j}) * \log P (C_{j})} - {P (t_{i}) * [- Σ_{j = 1}^{M} P (C_{j} | t_{i}) * \log P (C_{j} | t_{i})]

+ P ({\overset{&OverBar;}{t}}_{i}) * [- ΣP (C_{j} | {\overset{&OverBar;}{t}}_{i}) * \log P (C_{j} | {\overset{&OverBar;}{t}}_{i})]}

In the formula, P (C _i) expression C _jThe probability that the class document occurs in language material, P (t _i) represent to comprise characteristic item t in the language material _iThe probability of document, P (C _j| t _i) represent that document comprises characteristic item t _iThe time belong to C _jThe conditional probability of class,

The expression document does not comprise characteristic item t _iThe time belong to C _jThe conditional probability of class, M representes the classification number.

2) from the information gain value, choose the word of different numbers (generally calculate, select 100 integral multiple, as 1000,2000,3000,4000,5000,6000 as the characteristic quantity sum), as proper vector for the ease of the later stage.

3, feature weight computing method

Weight method based on TF*IDF is that Salton proposed in 1973, and its definition is: characteristic item t _iAt text D _jIn weights W _Ij:

W_{ij} = {tf}_{ij} * \log \frac{N}{n_{i}}

Wherein, tf _IjRepresentation feature item t _iAt training text D _jThe middle frequency that occurs; n _iBe to occur characteristic item t in the training set _iNumber of files, N is a number of files total in the training set.Be characteristic item t _iAt text D _jIn weights W _IjEqual it in document D _jIn sum frequency multiply by the logarithm of its inverted entry frequency in the entire document collection.

We propose improving one's methods of TF*IDF method is mainly reflected in introduced this variable of CIV and calculated the pairing conceptual information amount of characteristic item effectively, is used for adjusting the weight of proper vector.This be because:

Explaining with vector space model in the method for document, verified have two factors very crucial to obtaining effective characteristic item weight: the one, and the frequency that characteristic item occurs in single document, the 2nd, characteristic item is in the distribution of entire document collection the inside.In the TF*IDF method, adopt absolute word frequency TF to represent first factor; Though but some characteristic item frequency is very high; Single classification capacity is weak (such as a lot of everyday words) very, though and some characteristic item frequency is lower, classification capacity is very strong; Need carry out the adjustment on the TF basis thus, introduce the IDF variable.Second factor represented with reverse file frequency IDF; The weighted value of IDF is inverse change along with the number of documents that comprises certain characteristic changes; Under extreme case; Have only the characteristic that occurs in one piece of document to contain the highest IDF value, the document that this characteristic item promptly occurs is few more, and the weight of this characteristic item is big more.Second factor in fact all is to consider the distribution problem of characteristic item in whole type, the characteristic item that those discriminations are good, and its IDF value is bigger naturally.

But the defective of its text representation of vector space model existence that in text representation, is extensively used itself.This model adopts the feature of semanteme of contextual information quantitative description vocabulary; Weigh the semantic similarity between the vocabulary through the distance between the compute vector; Avoided the sparse problem of data; But it is regarded as the independent feature item to each vocabulary component in the vector, has isolated the relevance between the vocabulary.This defective must be brought in the method for characteristic item weight calculation of TF*IDF.We have introduced the 3rd variable conceptual information amount CIV and have remedied this defective thus.Viewpoint according to set theory; We think that the quantity of information of two notions set can calculate through the equal notion number in two set, identical many more of two collective concepts, and the Sharing Information amount is many more; Its semantic information amount is big more, and both Semantic Similarity are also big more.We utilize the corresponding notion set c of calculated characteristics item vocabulary thus _iAnd what of both conceptual information amounts common notion number calculate between the pairing notion set of the classification C.Owing to there is the situation of no shared ideas number between the two, still the conceptual information amount be defined as both shared ideas count sim (c _i, C)+1.So we can obtain following characteristic speech weight calculation formula:

W_{ij} = {tf}_{ij} \times \log \frac{N}{n_{i}} \times [sim (c_{i}, C) + 1]

The explanation of each variable is following in the formula: W _IjRepresentation feature item t _iAt text D _jIn weight, tf _IjRepresentation feature item t _iAt training text D _jThe middle frequency that occurs; n _iBe to occur characteristic item t in the training set _iNumber of files, N is a number of files total in the training set, sim (c _i, C) representation feature item t _iPairing notion set c _iAnd the shared ideas number between the pairing notion set of the classification C.Method thus, piece document of each in the corpus all can obtain the weighted value of its pairing characteristic item vocabulary.For example: the notion of disaster class set C can reduce through the expert 13228 2jw0～3e213228 the 13*a 321ae21 14eb63jw1jw2jw3 of 52331 5089e77jw0 π y3xc21jw0 π y3x jw539 a03bi 12*a, amount to 33 notions.If contain characteristic item fire one speech in the proper vector, the concept symbols string that its word is corresponding be 3228～3d01}, two notions are all gathered among the C, so characteristic item c in the notion of disaster class _iSim (the c of (fire) _i, C) value is 2, the CIV value is 3.And if include a characteristic item speech of one mind in the proper vector, its corresponding concept symbols string be { 43e01 (cooperation) j60c43 (fully) }, these two notions are not included in notion to be gathered among the C, so characteristic item c _iSim (the c of (working as one man) _i, C) value is 0, characteristic item c _iThe conceptual information amount CIV of (working as one man) is 1.

4, structural classification device

Adopt the k-nearest neighbor method, decision rule is:

y (\overset{&OverBar;}{x}, C_{j}) = \underset{d_{i} &Element; kNN}{Σ} sim (\overset{&OverBar;}{x}, {\overset{&OverBar;}{d}}_{i}) y ({\overset{&OverBar;}{d}}_{i}, C_{j}) - b_{j}

In the formula,

Value is 0 or 1, and value is to represent document at 1 o'clock

Belong to classification C _j, value is to represent document at 0 o'clock Do not belong to classification C _j

The expression test document

With the training document

Between similarity; b _jIt is the threshold value of binary decision.

5, evaluation function

To different purpose, people have proposed the function of multiple text classifier performance evaluation, comprise recall rate, accuracy, F-measure value, little average and grand average, equilibrium point, 11 average accuracy etc.Popular in the world judgement text classifier Evaluation on effect function mainly contains two kinds at present: micro-F1 estimates and macro-F1 estimates [Yang, 1997].Comparatively speaking the former use is more extensive, and it defines as follows:

F-1 estimates: γ=2rp/ (r+p)

Wherein: the textual data that the Nc representative is classified

The Nr representative is rejected the classified text number

The Ncr representative is by the correct textual data of classification in the classifying text

The textual data that the Ns representative should be classified

For the characteristic term vector of different numbers, make evaluation function F1 estimate that optimum group number, promptly be the characteristic speech number of the optimum of this sorting technique under this language material.

6, experimental result

The language material that this experiment is adopted comprises 16000 pieces of texts that portal website collects from the Internet; Wherein 6000 pieces as corpus; Belong to three classifications; Each field textual data is respectively: municipal management of markets in urban districts class (1019 pieces), disaster event class (2215 pieces) and other types (2766 pieces), remaining 10000 pieces as testing material.

Following table is this result of experiment:

Can find out that from experiment the classifying quality of TF*IDF*CIV algorithm all is better than the TF*IDF algorithm under various characteristic item number, when N=5000, the F1 value of classification improves 6.5%, proves absolutely the validity of this method.Need to prove, embodiment of the present invention of more than introducing and and unrestricted.It will be understood by those of skill in the art that any modification to technical scheme of the present invention perhaps is equal to alternative spirit and the scope that does not break away from technical scheme of the present invention, it all should be encompassed in the claim scope of the present invention.

Claims

1. statistics file classification method based on the TF*IDF algorithm, described method comprises following steps:

2) corpus is classified and pre-service;

4) the classification institute categorical conception of corpus is concluded, utilize concept dictionary to extract affiliated notion set of all categories, form class concepts set storehouse C, this notion set storehouse C is used to calculate conceptual information amount CIV;

W_{ij} = {tf}_{ij} \times \log \frac{N}{n_{i}} \times [sim (c_{i}, C) + 1];

2. according to the said statistics file classification method method of claim 1, it is characterized in that said step 2 based on the TF*IDF algorithm) pre-service be: remove unwanted ultra chain, advertising message in the web page text, and text carried out word segmentation processing.

3. according to claim 1 or 2 said statistics file classification methods based on the TF*IDF algorithm, it is characterized in that the Feature Selection of described step 5) adopts the information gain method, this information gain method also comprises following substep:

Gain (t_{i}) = Entropy (S) - ExpectedEntropy ({St}_{i})

{- \overset{M}{Σ} P (C_{j}) * \log P (C_{j})} - {P (t_{i}) * [- \overset{M}{Σ} P (C_{j} | t_{i}) * \log P (C_{j} | t_{i})]

+ P ({\overset{&OverBar;}{t}}_{i}) * [- ΣP (C_{j} | {\overset{&OverBar;}{t}}_{i}) * \log P (C_{j} | {\overset{&OverBar;}{t}}_{i})]}

4. according to the said statistics file classification method method of claim 1, it is characterized in that described sorter adopts the k-nearest neighbor method based on the TF*IDF algorithm.

5. according to the said statistics file classification method of claim 1, it is characterized in that described evaluation function adopts the micro-F1 measure function based on the TF*IDF algorithm.

6. statistics text classification system based on the TF*IDF algorithm, this system comprises: excellent module is selected in language material collection and pre-processing module, feature selection module, feature weight computing module, sort module and classification;

Described language material is collected and pre-processing module, is used for from interconnection online collection corpus and testing material, and language material is surpassed chain, advertising message processing and participle pre-service;

Described feature selection module is used for extracting the vocabulary of language material, therefrom selects the characteristic speech of different numbers, composition characteristic vocabulary according to feature selecting algorithm;

Described feature weight computing module is used for the calculated characteristics weight;

Described sort module is used for the language material text is classified; With

Excellent module is selected in described classification, is used to contrast different classification results, and the characteristic speech number when finding the optimal classification effect is characterized in that,

Described concept dictionary is used for the affiliated classification information of storage concept;

Described class concepts library module is used to store total notion aggregate information C under the different affiliated classifications;

Wherein, the formula of said TF*IDF*CIV algorithm is:

, the shared ideas in this formula is counted sim (c _i, C) be characteristic item t _iPairing notion set c _iThe notion number that coupling equates among the class concepts set C in said conceptual base module.