CN102622373A - Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm - Google Patents

Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm Download PDF

Info

Publication number
CN102622373A
CN102622373A CN2011100338086A CN201110033808A CN102622373A CN 102622373 A CN102622373 A CN 102622373A CN 2011100338086 A CN2011100338086 A CN 2011100338086A CN 201110033808 A CN201110033808 A CN 201110033808A CN 102622373 A CN102622373 A CN 102622373A
Authority
CN
China
Prior art keywords
idf
classification
characteristic
module
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011100338086A
Other languages
Chinese (zh)
Other versions
CN102622373B (en
Inventor
缪建明
丁泽亚
张全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN2011100338086A priority Critical patent/CN102622373B/en
Publication of CN102622373A publication Critical patent/CN102622373A/en
Application granted granted Critical
Publication of CN102622373B publication Critical patent/CN102622373B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of statistics file classification methods based on TF*IDF algorithm, this method propose a kind of new feature vector weight methods (TF*IDF*CIV), conceptual information amount (CIV) this variable is introduced in TF*IDF method, the formula of the algorithm in the calculating process of feature vector weight is considered using the conceptual information amount of feature vector as a variable are as follows:
Figure DDA0000046367550000011
Shared concept number sim (ci, C) therein is characterized concept set ci corresponding to a ti and matches equal concept number in class concepts set C; The insufficient TF*IDF method at this stage for compensating for TF*IDF method widely is used to calculate the weight of feature vector. But this method can not represent the relevance between characteristic item, have ignored influence of the relevance between characteristic item semanteme to weight. As a result, experiments have shown that the use of new method can effectively improve the accuracy rate of entire Text Classification System.

Description

A kind of based on TF *The statistics text classification system and method for IDF algorithm
Technical field
The present invention relates to the Computer Science and Technology field, particularly a kind of computing method and device of new proper vector weight towards text classification.
Background technology
Along with Internet technology and computer technology rapid development and universal, a large amount of Word messages begins to exist with computer-readable form, is arisen at the historic moment by computing machine autotext sorting technique.Current, the text classification technology extensively is utilized in each research fields such as document index foundation, flame detection, theme identification, automatic abstract, intelligent information retrieval.
Automated Classification starts from the end of the fifties, and H.P.Luhn has carried out initiative research in this field.1961, Maron delivered relevant first piece of paper of classification automatically, and many subsequently famous information scholars such as Sparck, Salton etc. have carried out fruitful research in this field.In the eighties in 20th century; The text classification system is main with the method for knowledge engineering; According to the classification experience of domain expert to given text collection, manual work extracts one group of logic rules, as the computer version The classification basis; Analyze the technical characterstic and the performance of these systems then, promptly utilize Expert Rules to classify; To after the nineties, the method for statistical method and machine learning is introduced in the text automatic classification, has obtained great successes and has replaced the knowledge engineering method gradually, and become main flow trend rapidly; The semantic information of the less consideration text of machine learning method combines methods such as semantic analysis and conceptual network and to have obtained better classifying quality with machine learning method, accuracy rate and stable aspect have remarkable advantages.This text classification process mainly is described below: system uses training sample to carry out feature selecting and classifier parameters training; The input sample of treating classification according to the characteristic of selecting carries out formalization; Be input to sorter then and carry out kind judging, finally obtain importing the classification of sample.
Current; File classification method based on statistics has: simple Bayes method (
Figure BDA0000046367530000011
Bayesianclassifier), based on support vector machine method (support vector machines; SVM), k-nearest neighbor method (k-nearest neighbor; KNN), the neural network method (neural network, NNet), decision tree (decisiontree) classification, fuzzy classifier method (fuzzy classifier), Rocchio sorting technique and Boosting algorithm etc.According to the Yiming Yang of CMU results reported, best based on the support vector machine method effect of vector space model, the most proper vectors that also require to set up earlier text in the basis of other several methods.The most frequently used method of setting up of proper vector is exactly TF*IDF (TF:Term Frequency, an IDF:Inverse Document Frequency) method, the various improvement computing method of carrying out on its basis in addition.
The document vector space model adopts the feature of semanteme of contextual information quantitative description word; Weigh the semantic similarity between the word through the distance between the compute vector; Effectively avoided the inevitable sparse problem of data in the traditional statistical method; But vector space model is regarded as the independent feature item to each word component in the vector, has ignored the relevance between the characteristic item, and this makes can't be satisfactory with the accuracy rate of the sorter of TF*IDF method.
Summary of the invention
The objective of the invention is to; The low problem of accuracy rate based on the text classifier of TF*IDF algorithm of not considering that thereby semantic similarity between the word causes when overcoming present TF*IDF algorithm computation characteristic item weight provides a kind of statistics text classification system and method based on the TF*IDF algorithm.
Be the realization above-mentioned purpose, a kind of statistics file classification method provided by the invention based on the TF*IDF algorithm, described method comprises following steps:
1) collects language material, the language material of collecting is divided into corpus and testing material;
2) corpus is classified and pre-service;
3) from corpus, extract the vocabulary in each field, extract total vocabulary simultaneously;
4) the classification institute categorical conception of corpus is concluded, utilize concept dictionary to extract affiliated notion set of all categories, form class concepts set storehouse, be used to calculate conceptual information amount CIV;
5) testing material is carried out Feature Selection, obtain the proper vector table of different numbers;
6) weight of the proper vector speech that comprises of use characteristic vector Weight algorithm (TF*IDF*CIV) calculated characteristics vector table, concrete computing formula is following:
W ij = tf ij × log N n i × [ sim ( c i , C ) + 1 ] ;
Wherein, sim (c i, C)+1 be conceptual information parameter CIV, and the shared ideas in this formula is counted sim (c i, C) be any proper vector t of the described proper vector table of step 5) iPairing notion set c iThe notion number that coupling equates in the described class concepts set of said step 4) C;
7) structure corresponding text sorter utilizes sorter that testing material is calculated, and obtains classification results;
8) adopt evaluation function to calculate the performance evaluation parameter of various sorters, confirm optimum proper vector table according to sorter evaluating result.
In the technique scheme, said step 2) pre-service is: remove information such as unwanted ultra chain, advertisement in the web page text, and text is carried out word segmentation processing.
In the technique scheme, the Feature Selection of described step 5) adopts the information gain method, and this information gain method also comprises following substep:
5-1) extract vocabulary, after pre-service, calculate the information gain value of each participle as characteristic, the entropy of information gain value document when not considering any characteristic and the difference of considering the entropy of document after this characteristic, computing formula is following:
Gain ( t i ) = Entropy ( S ) - ExpectedEntropy ( St i )
{ - Σ M P ( C j ) * log P ( C j ) } - { P ( t i ) * [ - Σ M P ( C j | t i ) * log P ( C j | t i ) ]
+ P ( t ‾ i ) * [ - ΣP ( C j | t ‾ i ) * log P ( C j | t ‾ i ) ] }
Wherein, P (C i) expression C jThe probability that the class document occurs in language material, P (t i) represent to comprise characteristic item t in the language material iThe probability of document, P (C j| t i) represent that document comprises characteristic item t iThe time belong to C jThe conditional probability of class,
Figure BDA0000046367530000034
Do not comprise characteristic item t in the expression language material iThe probability of document,
Figure BDA0000046367530000035
The expression document does not comprise characteristic item t iThe time belong to C jThe conditional probability of class, M representes the classification number;
5-2) from the information gain value, choose the word of different numbers, as proper vector.
The present invention is based on above-mentioned method a kind of statistics text classification system based on the TF*IDF algorithm also is provided, said system comprises:
Language material is collected and pre-processing module, is used for from interconnection online collection corpus and testing material, and language material is surpassed information processing and participle pre-service such as chain, advertisement;
Feature selection module is used for extracting the vocabulary of language material, therefrom selects the characteristic speech of different numbers, composition characteristic vocabulary according to feature selecting algorithm;
The feature weight computing module is used for the calculated characteristics weight;
Sort module is used for the language material text is classified; With
Excellent module is selected in classification, is used to contrast different classification results, the characteristic speech number when finding the optimal classification effect, it is characterized in that,
Described system also comprises: concept dictionary module and class concepts library module;
The concept dictionary module is used for the affiliated classification information of storage concept;
The class concepts library module is used to store total notion aggregate information under the different affiliated classifications;
Described feature weight computing module utilizes the notion aggregate information C that said class concepts library module obtains and adopts the TF*IDF*CIV algorithm to carry out the weight calculation of the characteristic speech of different numbers;
Wherein, the formula of said TF*IDF*CIV algorithm is:
Figure BDA0000046367530000036
Shared ideas in this formula is counted sim (c i, C) be characteristic item t iPairing notion set c iThe notion number that coupling equates among the class concepts set C in said conceptual base module.
The invention has the advantages that; The present invention has carried out the variable adjustment to the TF*IDF method; Having introduced this variable of conceptual information amount CIV improves it; The pairing conceptual information amount of calculated characteristics item effectively is used for adjusting the weight of different characteristic vector, utilizes relevance between the notion to remedy the deficiency of vector space model.The experiment proof is improved the back classification accuracy can improve 6.5 percentage points, and this has also fully proved this validity of improving one's methods.
Description of drawings
Fig. 1 is the process flow diagram that the present invention is based on the statistics file classification method of TF*IDF algorithm;
Fig. 2 is the module structure drafting that the present invention is based on the statistics text classification system of TF*IDF algorithm.
Embodiment
Below in conjunction with accompanying drawing and specific embodiment the present invention is done further description.
The present invention is for effectively utilizing knowledge engineering knowledge that the new method of a compute vector weight is provided on the statistics file classification method.This method is in the TF*IDF method, to have introduced the CIV variable, thus the method after being improved, i.e. TF*IDF*CIV (Term Frequency, Inverse Document Frequency, Concept InformationValue) method.Evaluation indexes such as accuracy, recall rate and the F1 that this method of experiment proof can effectively improve text classification estimates.
As shown in Figure 1, this figure is that concrete steps are following based on the flow chart of steps of the statistics file classification method of TF*IDF algorithm:
Step 1 is from interconnection online collection language material; Wherein part is as corpus; Another part is as testing material; 16000 pieces of texts that portal website collects from the Internet have been downloaded; Wherein 6000 pieces as corpus; Belong to three classifications, each field textual data is respectively: municipal management of markets in urban districts class (1019 pieces), disaster event class (2215 pieces) and other types (2766 pieces), and remaining 10000 pieces as testing material;
Step 2 pair corpus carries out category division, and information such as some the unwanted ultra chains in the removal corpus, advertisement are carried out participle to body text, obtain the word string sequence of document;
Step 3 is taken out each document from corpus, take out word wherein, forms total vocabulary; Simultaneously classification institute categorical conception is concluded; Utilize concept dictionary to extract affiliated notion set of all categories, form other notion set storehouse of three major types, be used to calculate conceptual information amount CIV;
Step 4 is according to what of the quantity of information between word and the classifying text, the computing information gain, and selected different threshold values obtain the proper vector table of different numbers (1000,2000,3000,4000,5000,6000).
Step 5 use characteristic vector weight TF*IDF*CIV calculated characteristics is to the measure word weight, and conceptual information amount wherein is that shared ideas is counted sim (c i, C)+1, and shared ideas is counted sim (c i, C) be proper vector c iThe notion number that coupling equates in class concepts set C.
Step 6 structure corresponding text sorter.
Step 7 pair test text is classified, and obtains the classification results under the different number proper vectors
Step 8 is calculated the performance evaluation parameter of sorter.
Step 9 is judged the proper vector number of this system optimal according to the value of evaluation function.
As shown in Figure 2, this figure is the statistics text classification system based on the TF*IDF algorithm, a kind of statistics text classification system based on the TF*IDF algorithm, and said system comprises:
Language material is collected and pre-processing module, is used for from interconnection online collection corpus and testing material, and language material is surpassed information processing and participle pre-service such as chain, advertisement;
Feature selection module is used for extracting the vocabulary of language material, therefrom selects the characteristic speech of different numbers, composition characteristic vocabulary according to feature selecting algorithm;
The feature weight computing module is used for the calculated characteristics weight;
Sort module is used for the language material text is classified; With
Excellent module is selected in classification, is used to contrast different classification results, the characteristic speech number when finding the optimal classification effect, it is characterized in that,
Described system also comprises: concept dictionary module and class concepts library module;
The concept dictionary module is used for the affiliated classification information of storage concept;
The class concepts library module is used to store total notion aggregate information under the different affiliated classifications;
Described feature weight computing module utilizes the notion aggregate information C that said class concepts library module obtains and adopts the TF*IDF*CIV algorithm to carry out the weight calculation of the characteristic speech of different numbers;
Wherein, the formula of said TF*IDF*CIV algorithm is:
Figure BDA0000046367530000051
Shared ideas in this formula is counted sim (c i, C) be characteristic item t iPairing notion set c iThe notion number that coupling equates among the class concepts set C in said conceptual base module.
Specify each involved detailed problem in the technical scheme of the present invention below:
1, language material is selected:
Download enough language materials from each portal website, mark off the part corpus, category is classified.The division of language material classification is reasonable as far as possible, and language material data of all categories are balanced as far as possible.
2, Feature Selection:
In vector space model, the characteristic item of expression text can be selected word, speech, phrase, even multiple elements such as " notions ".Here we adopt the most frequently used characteristic item of the most effective speech of proof as text classification of also being tested.Present existing Feature Selection method is many, and method commonly used has: based on document frequency (documentfrequency, feature extraction method DF), information gain (information gain, IG) method, χ 2Statistic (CHI) method and mutual information (mutual information, MI) method etc.The Feature Selection main task is to confirm two problems: the first, choose which type of speech as characteristic item; The second, the speech of choosing how many numbers is as characteristic item.We adopt the information gain method to carry out Feature Selection, and concrete steps are following:
1) extracts vocabulary.After early stage, preprocessing process such as participle finished, calculate the information gain value of each speech as characteristic.The information gain method is according to certain characteristic item t iFor how many quantity of information that whole classification can provide weighs the significance level of this characteristic item, thereby determine choice to this characteristic item.Certain characteristic item t iInformation gain be meant this characteristic or not during this characteristic, the difference of the quantity of information that can provide for whole classification, wherein, how much the weighing of quantity of information by entropy.The entropy of document and the difference of considering the entropy of document after this characteristic when information gain is not promptly considered any characteristic:
Gain ( t i ) = Entropy ( S ) - ExpectedEntropy ( St i )
= { - Σ j = 1 M P ( C j ) * log P ( C j ) } - { P ( t i ) * [ - Σ j = 1 M P ( C j | t i ) * log P ( C j | t i ) ]
+ P ( t ‾ i ) * [ - ΣP ( C j | t ‾ i ) * log P ( C j | t ‾ i ) ] }
In the formula, P (C i) expression C jThe probability that the class document occurs in language material, P (t i) represent to comprise characteristic item t in the language material iThe probability of document, P (C j| t i) represent that document comprises characteristic item t iThe time belong to C jThe conditional probability of class,
Figure BDA0000046367530000064
Do not comprise characteristic item t in the expression language material iThe probability of document,
Figure BDA0000046367530000065
The expression document does not comprise characteristic item t iThe time belong to C jThe conditional probability of class, M representes the classification number.
2) from the information gain value, choose the word of different numbers (generally calculate, select 100 integral multiple, as 1000,2000,3000,4000,5000,6000 as the characteristic quantity sum), as proper vector for the ease of the later stage.
3, feature weight computing method
Weight method based on TF*IDF is that Salton proposed in 1973, and its definition is: characteristic item t iAt text D jIn weights W Ij:
W ij = tf ij * log N n i
Wherein, tf IjRepresentation feature item t iAt training text D jThe middle frequency that occurs; n iBe to occur characteristic item t in the training set iNumber of files, N is a number of files total in the training set.Be characteristic item t iAt text D jIn weights W IjEqual it in document D jIn sum frequency multiply by the logarithm of its inverted entry frequency in the entire document collection.
We propose improving one's methods of TF*IDF method is mainly reflected in introduced this variable of CIV and calculated the pairing conceptual information amount of characteristic item effectively, is used for adjusting the weight of proper vector.This be because:
Explaining with vector space model in the method for document, verified have two factors very crucial to obtaining effective characteristic item weight: the one, and the frequency that characteristic item occurs in single document, the 2nd, characteristic item is in the distribution of entire document collection the inside.In the TF*IDF method, adopt absolute word frequency TF to represent first factor; Though but some characteristic item frequency is very high; Single classification capacity is weak (such as a lot of everyday words) very, though and some characteristic item frequency is lower, classification capacity is very strong; Need carry out the adjustment on the TF basis thus, introduce the IDF variable.Second factor represented with reverse file frequency IDF; The weighted value of IDF is inverse change along with the number of documents that comprises certain characteristic changes; Under extreme case; Have only the characteristic that occurs in one piece of document to contain the highest IDF value, the document that this characteristic item promptly occurs is few more, and the weight of this characteristic item is big more.Second factor in fact all is to consider the distribution problem of characteristic item in whole type, the characteristic item that those discriminations are good, and its IDF value is bigger naturally.
But the defective of its text representation of vector space model existence that in text representation, is extensively used itself.This model adopts the feature of semanteme of contextual information quantitative description vocabulary; Weigh the semantic similarity between the vocabulary through the distance between the compute vector; Avoided the sparse problem of data; But it is regarded as the independent feature item to each vocabulary component in the vector, has isolated the relevance between the vocabulary.This defective must be brought in the method for characteristic item weight calculation of TF*IDF.We have introduced the 3rd variable conceptual information amount CIV and have remedied this defective thus.Viewpoint according to set theory; We think that the quantity of information of two notions set can calculate through the equal notion number in two set, identical many more of two collective concepts, and the Sharing Information amount is many more; Its semantic information amount is big more, and both Semantic Similarity are also big more.We utilize the corresponding notion set c of calculated characteristics item vocabulary thus iAnd what of both conceptual information amounts common notion number calculate between the pairing notion set of the classification C.Owing to there is the situation of no shared ideas number between the two, still the conceptual information amount be defined as both shared ideas count sim (c i, C)+1.So we can obtain following characteristic speech weight calculation formula:
W ij = tf ij × log N n i × [ sim ( c i , C ) + 1 ]
The explanation of each variable is following in the formula: W IjRepresentation feature item t iAt text D jIn weight, tf IjRepresentation feature item t iAt training text D jThe middle frequency that occurs; n iBe to occur characteristic item t in the training set iNumber of files, N is a number of files total in the training set, sim (c i, C) representation feature item t iPairing notion set c iAnd the shared ideas number between the pairing notion set of the classification C.Method thus, piece document of each in the corpus all can obtain the weighted value of its pairing characteristic item vocabulary.For example: the notion of disaster class set C can reduce through the expert 13228 2jw0~3e213228 the 13*a 321ae21 14eb63jw1jw2jw3 of 52331 5089e77jw0 π y3xc21jw0 π y3x jw539 a03bi 12*a, amount to 33 notions.If contain characteristic item fire one speech in the proper vector, the concept symbols string that its word is corresponding be 3228~3d01}, two notions are all gathered among the C, so characteristic item c in the notion of disaster class iSim (the c of (fire) i, C) value is 2, the CIV value is 3.And if include a characteristic item speech of one mind in the proper vector, its corresponding concept symbols string be { 43e01 (cooperation) j60c43 (fully) }, these two notions are not included in notion to be gathered among the C, so characteristic item c iSim (the c of (working as one man) i, C) value is 0, characteristic item c iThe conceptual information amount CIV of (working as one man) is 1.
4, structural classification device
Adopt the k-nearest neighbor method, decision rule is:
y ( x ‾ , C j ) = Σ d i ∈ kNN sim ( x ‾ , d ‾ i ) y ( d ‾ i , C j ) - b j
In the formula,
Figure BDA0000046367530000083
Value is 0 or 1, and value is to represent document at 1 o'clock
Figure BDA0000046367530000084
Belong to classification C j, value is to represent document at 0 o'clock Do not belong to classification C j
Figure BDA0000046367530000086
The expression test document
Figure BDA0000046367530000087
With the training document
Figure BDA0000046367530000088
Between similarity; b jIt is the threshold value of binary decision.
5, evaluation function
To different purpose, people have proposed the function of multiple text classifier performance evaluation, comprise recall rate, accuracy, F-measure value, little average and grand average, equilibrium point, 11 average accuracy etc.Popular in the world judgement text classifier Evaluation on effect function mainly contains two kinds at present: micro-F1 estimates and macro-F1 estimates [Yang, 1997].Comparatively speaking the former use is more extensive, and it defines as follows:
Figure BDA0000046367530000091
F-1 estimates: γ=2rp/ (r+p)
Wherein: the textual data that the Nc representative is classified
The Nr representative is rejected the classified text number
The Ncr representative is by the correct textual data of classification in the classifying text
The textual data that the Ns representative should be classified
For the characteristic term vector of different numbers, make evaluation function F1 estimate that optimum group number, promptly be the characteristic speech number of the optimum of this sorting technique under this language material.
6, experimental result
The language material that this experiment is adopted comprises 16000 pieces of texts that portal website collects from the Internet; Wherein 6000 pieces as corpus; Belong to three classifications; Each field textual data is respectively: municipal management of markets in urban districts class (1019 pieces), disaster event class (2215 pieces) and other types (2766 pieces), remaining 10000 pieces as testing material.
Following table is this result of experiment:
Figure BDA0000046367530000092
Can find out that from experiment the classifying quality of TF*IDF*CIV algorithm all is better than the TF*IDF algorithm under various characteristic item number, when N=5000, the F1 value of classification improves 6.5%, proves absolutely the validity of this method.Need to prove, embodiment of the present invention of more than introducing and and unrestricted.It will be understood by those of skill in the art that any modification to technical scheme of the present invention perhaps is equal to alternative spirit and the scope that does not break away from technical scheme of the present invention, it all should be encompassed in the claim scope of the present invention.

Claims (6)

1. statistics file classification method based on the TF*IDF algorithm, described method comprises following steps:
1) collects language material, the language material of collecting is divided into corpus and testing material;
2) corpus is classified and pre-service;
3) from corpus, extract the vocabulary in each field, extract total vocabulary simultaneously;
4) the classification institute categorical conception of corpus is concluded, utilize concept dictionary to extract affiliated notion set of all categories, form class concepts set storehouse C, this notion set storehouse C is used to calculate conceptual information amount CIV;
5) testing material is carried out Feature Selection, obtain the proper vector table of different numbers;
6) weight of the proper vector speech that comprises of use characteristic vector Weight algorithm (TF*IDF*CIV) calculated characteristics vector table, concrete computing formula is following:
W ij = tf ij × log N n i × [ sim ( c i , C ) + 1 ] ;
Wherein, sim (c i, C)+1 be conceptual information parameter CIV, and the shared ideas in this formula is counted sim (c i, C) be any proper vector t of the described proper vector table of step 5) iPairing notion set c iThe notion number that coupling equates in the described class concepts set of said step 4) C;
7) structure corresponding text sorter utilizes sorter that testing material is calculated, and obtains classification results;
8) adopt evaluation function to calculate the performance evaluation parameter of various sorters, confirm optimum proper vector table according to sorter evaluating result.
2. according to the said statistics file classification method method of claim 1, it is characterized in that said step 2 based on the TF*IDF algorithm) pre-service be: remove unwanted ultra chain, advertising message in the web page text, and text carried out word segmentation processing.
3. according to claim 1 or 2 said statistics file classification methods based on the TF*IDF algorithm, it is characterized in that the Feature Selection of described step 5) adopts the information gain method, this information gain method also comprises following substep:
5-1) extract vocabulary, after pre-service, calculate the information gain value of each participle as characteristic, the entropy of information gain value document when not considering any characteristic and the difference of considering the entropy of document after this characteristic, computing formula is following:
Gain ( t i ) = Entropy ( S ) - ExpectedEntropy ( St i )
{ - Σ M P ( C j ) * log P ( C j ) } - { P ( t i ) * [ - Σ M P ( C j | t i ) * log P ( C j | t i ) ]
+ P ( t ‾ i ) * [ - ΣP ( C j | t ‾ i ) * log P ( C j | t ‾ i ) ] }
Wherein, P (C i) expression C jThe probability that the class document occurs in language material, P (t i) represent to comprise characteristic item t in the language material iThe probability of document, P (C j| t i) represent that document comprises characteristic item t iThe time belong to C jThe conditional probability of class,
Figure FDA0000046367520000015
Do not comprise characteristic item t in the expression language material iThe probability of document,
Figure FDA0000046367520000016
The expression document does not comprise characteristic item t iThe time belong to C jThe conditional probability of class, M representes the classification number;
5-2) from the information gain value, choose the word of different numbers, as proper vector.
4. according to the said statistics file classification method method of claim 1, it is characterized in that described sorter adopts the k-nearest neighbor method based on the TF*IDF algorithm.
5. according to the said statistics file classification method of claim 1, it is characterized in that described evaluation function adopts the micro-F1 measure function based on the TF*IDF algorithm.
6. statistics text classification system based on the TF*IDF algorithm, this system comprises: excellent module is selected in language material collection and pre-processing module, feature selection module, feature weight computing module, sort module and classification;
Described language material is collected and pre-processing module, is used for from interconnection online collection corpus and testing material, and language material is surpassed chain, advertising message processing and participle pre-service;
Described feature selection module is used for extracting the vocabulary of language material, therefrom selects the characteristic speech of different numbers, composition characteristic vocabulary according to feature selecting algorithm;
Described feature weight computing module is used for the calculated characteristics weight;
Described sort module is used for the language material text is classified; With
Excellent module is selected in described classification, is used to contrast different classification results, and the characteristic speech number when finding the optimal classification effect is characterized in that,
Described system also comprises: concept dictionary module and class concepts library module;
Described concept dictionary is used for the affiliated classification information of storage concept;
Described class concepts library module is used to store total notion aggregate information C under the different affiliated classifications;
Described feature weight computing module utilizes the notion aggregate information C that said class concepts library module obtains and adopts the TF*IDF*CIV algorithm to carry out the weight calculation of the characteristic speech of different numbers;
Wherein, the formula of said TF*IDF*CIV algorithm is:
Figure FDA0000046367520000021
, the shared ideas in this formula is counted sim (c i, C) be characteristic item t iPairing notion set c iThe notion number that coupling equates among the class concepts set C in said conceptual base module.
CN2011100338086A 2011-01-31 2011-01-31 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm Expired - Fee Related CN102622373B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011100338086A CN102622373B (en) 2011-01-31 2011-01-31 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011100338086A CN102622373B (en) 2011-01-31 2011-01-31 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm

Publications (2)

Publication Number Publication Date
CN102622373A true CN102622373A (en) 2012-08-01
CN102622373B CN102622373B (en) 2013-12-11

Family

ID=46562296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011100338086A Expired - Fee Related CN102622373B (en) 2011-01-31 2011-01-31 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm

Country Status (1)

Country Link
CN (1) CN102622373B (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106275A (en) * 2013-02-08 2013-05-15 西北工业大学 Text classification character screening method based on character distribution information
CN103425735A (en) * 2013-06-06 2013-12-04 深圳市宜搜科技发展有限公司 Establishing method and system based on website subject term inquiry
CN103593339A (en) * 2013-11-29 2014-02-19 哈尔滨工业大学深圳研究生院 Electronic-book-oriented semantic space representing method and system
CN103927302A (en) * 2013-01-10 2014-07-16 阿里巴巴集团控股有限公司 Text classification method and system
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
WO2015135452A1 (en) * 2014-03-14 2015-09-17 Tencent Technology (Shenzhen) Company Limited Text information processing method and apparatus
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 Text topic classification method and system
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107085608A (en) * 2017-04-21 2017-08-22 上海喆之信息科技有限公司 A kind of effective network hotspot monitoring system
CN107194617A (en) * 2017-07-06 2017-09-22 北京航空航天大学 A kind of app software engineers soft skill categorizing system and method
CN104391835B (en) * 2014-09-30 2017-09-29 中南大学 Feature Words system of selection and device in text
CN107301248A (en) * 2017-07-19 2017-10-27 百度在线网络技术(北京)有限公司 Term vector construction method and device, computer equipment, the storage medium of text
CN107562814A (en) * 2017-08-14 2018-01-09 中国农业大学 A kind of earthquake emergency and the condition of a disaster acquisition of information sorting technique and system
CN108228546A (en) * 2018-01-19 2018-06-29 北京中关村科金技术有限公司 A kind of text feature, device, equipment and readable storage medium storing program for executing
CN108376130A (en) * 2018-03-09 2018-08-07 长安大学 A kind of objectionable text information filtering feature selection approach
CN108461111A (en) * 2018-03-16 2018-08-28 重庆医科大学 Chinese medical treatment text duplicate checking method and device, electronic equipment, computer read/write memory medium
CN108509410A (en) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 Text semantic similarity calculating method, device and user terminal
CN108509407A (en) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 Text semantic similarity calculating method, device and user terminal
CN108509471A (en) * 2017-05-19 2018-09-07 苏州纯青智能科技有限公司 A kind of Chinese Text Categorization
CN108898274A (en) * 2018-05-30 2018-11-27 国网浙江省电力有限公司宁波供电公司 A kind of power scheduling log defect classification method
CN110019817A (en) * 2018-12-04 2019-07-16 阿里巴巴集团控股有限公司 A kind of detection method, device and the electronic equipment of text in video information
CN110287328A (en) * 2019-07-03 2019-09-27 广东工业大学 A kind of file classification method, device, equipment and computer readable storage medium
CN111259649A (en) * 2020-01-19 2020-06-09 深圳壹账通智能科技有限公司 Interactive data classification method and device of information interaction platform and storage medium
CN112269880A (en) * 2020-11-04 2021-01-26 吾征智能技术(北京)有限公司 Sweet text classification matching system based on linear function
CN113032573A (en) * 2021-04-30 2021-06-25 《中国学术期刊(光盘版)》电子杂志社有限公司 Large-scale text classification method and system combining theme semantics and TF-IDF algorithm

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6704905B2 (en) * 2000-12-28 2004-03-09 Matsushita Electric Industrial Co., Ltd. Text classifying parameter generator and a text classifier using the generated parameter

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6704905B2 (en) * 2000-12-28 2004-03-09 Matsushita Electric Industrial Co., Ltd. Text classifying parameter generator and a text classifier using the generated parameter

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
张运良,张全: "《基于句类向量空间模型的自动文本分类研究》", 《计算机工程》 *
缪建明,张全,赵金仿: "《基于文章标题信息的汉语自动文本分类》", 《计算机工程》 *
蔡银珊,黄英铭: "《基于改进的TF-IDF特征权重算法的网页自动分类》", 《绵阳师范学院学报》 *
赵金仿,赵艳,缪建明: "《网页信息抽取及其自动文本分类的实现》", 《计算机技术与发展》 *

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927302A (en) * 2013-01-10 2014-07-16 阿里巴巴集团控股有限公司 Text classification method and system
CN103927302B (en) * 2013-01-10 2017-05-31 阿里巴巴集团控股有限公司 A kind of file classification method and system
CN103106275B (en) * 2013-02-08 2016-02-10 西北工业大学 The text classification Feature Selection method of feature based distributed intelligence
CN103106275A (en) * 2013-02-08 2013-05-15 西北工业大学 Text classification character screening method based on character distribution information
CN103425735A (en) * 2013-06-06 2013-12-04 深圳市宜搜科技发展有限公司 Establishing method and system based on website subject term inquiry
CN103425735B (en) * 2013-06-06 2017-08-11 深圳市宜搜科技发展有限公司 A kind of method for building up and system based on website subject term inquiry
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN103593339A (en) * 2013-11-29 2014-02-19 哈尔滨工业大学深圳研究生院 Electronic-book-oriented semantic space representing method and system
WO2015135452A1 (en) * 2014-03-14 2015-09-17 Tencent Technology (Shenzhen) Company Limited Text information processing method and apparatus
US10262059B2 (en) 2014-03-14 2019-04-16 Tencent Technology (Shenzhen) Company Limited Method, apparatus, and storage medium for text information processing
CN104391835B (en) * 2014-09-30 2017-09-29 中南大学 Feature Words system of selection and device in text
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 Text topic classification method and system
CN105045812B (en) * 2015-06-18 2019-01-29 上海高欣计算机系统有限公司 The classification method and system of text subject
CN106844424B (en) * 2016-12-09 2020-11-03 宁波大学 LDA-based text classification method
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN108509407A (en) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 Text semantic similarity calculating method, device and user terminal
CN108509407B (en) * 2017-02-27 2022-03-18 阿里巴巴(中国)有限公司 Text semantic similarity calculation method and device and user terminal
CN108509410B (en) * 2017-02-27 2022-08-05 阿里巴巴(中国)有限公司 Text semantic similarity calculation method and device and user terminal
CN108509410A (en) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 Text semantic similarity calculating method, device and user terminal
CN107085608A (en) * 2017-04-21 2017-08-22 上海喆之信息科技有限公司 A kind of effective network hotspot monitoring system
CN108509471A (en) * 2017-05-19 2018-09-07 苏州纯青智能科技有限公司 A kind of Chinese Text Categorization
CN107194617A (en) * 2017-07-06 2017-09-22 北京航空航天大学 A kind of app software engineers soft skill categorizing system and method
CN107301248B (en) * 2017-07-19 2020-07-21 百度在线网络技术(北京)有限公司 Word vector construction method and device of text, computer equipment and storage medium
CN107301248A (en) * 2017-07-19 2017-10-27 百度在线网络技术(北京)有限公司 Term vector construction method and device, computer equipment, the storage medium of text
CN107562814A (en) * 2017-08-14 2018-01-09 中国农业大学 A kind of earthquake emergency and the condition of a disaster acquisition of information sorting technique and system
CN108228546A (en) * 2018-01-19 2018-06-29 北京中关村科金技术有限公司 A kind of text feature, device, equipment and readable storage medium storing program for executing
CN108376130A (en) * 2018-03-09 2018-08-07 长安大学 A kind of objectionable text information filtering feature selection approach
CN108461111A (en) * 2018-03-16 2018-08-28 重庆医科大学 Chinese medical treatment text duplicate checking method and device, electronic equipment, computer read/write memory medium
CN108898274A (en) * 2018-05-30 2018-11-27 国网浙江省电力有限公司宁波供电公司 A kind of power scheduling log defect classification method
CN110019817A (en) * 2018-12-04 2019-07-16 阿里巴巴集团控股有限公司 A kind of detection method, device and the electronic equipment of text in video information
CN110287328A (en) * 2019-07-03 2019-09-27 广东工业大学 A kind of file classification method, device, equipment and computer readable storage medium
CN111259649A (en) * 2020-01-19 2020-06-09 深圳壹账通智能科技有限公司 Interactive data classification method and device of information interaction platform and storage medium
CN112269880A (en) * 2020-11-04 2021-01-26 吾征智能技术(北京)有限公司 Sweet text classification matching system based on linear function
CN112269880B (en) * 2020-11-04 2024-02-09 吾征智能技术(北京)有限公司 Sweet text classification matching system based on linear function
CN113032573A (en) * 2021-04-30 2021-06-25 《中国学术期刊(光盘版)》电子杂志社有限公司 Large-scale text classification method and system combining theme semantics and TF-IDF algorithm
CN113032573B (en) * 2021-04-30 2024-01-23 同方知网数字出版技术股份有限公司 Large-scale text classification method and system combining topic semantics and TF-IDF algorithm

Also Published As

Publication number Publication date
CN102622373B (en) 2013-12-11

Similar Documents

Publication Publication Date Title
CN102622373B (en) Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN101819601B (en) Method for automatically classifying academic documents
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN100533441C (en) Two-stage combined file classification method based on probability subject
CN104951548B (en) A kind of computational methods and system of negative public sentiment index
Al Qadi et al. Arabic text classification of news articles using classical supervised classifiers
CN102194013A (en) Domain-knowledge-based short text classification method and text classification system
CN107577785A (en) A kind of level multi-tag sorting technique suitable for law identification
CN101763431A (en) PL clustering method based on massive network public sentiment information
CN105760493A (en) Automatic work order classification method for electricity marketing service hot spot 95598
CN103886108B (en) The feature selecting and weighing computation method of a kind of unbalanced text set
CN104391835A (en) Method and device for selecting feature words in texts
CN101587493A (en) Text classification method
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN109522544A (en) Sentence vector calculation, file classification method and system based on Chi-square Test
CN103473262A (en) Automatic classification system and automatic classification method for Web comment viewpoint on the basis of association rule
CN105975518A (en) Information entropy-based expected cross entropy feature selection text classification system and method
CN106156372A (en) The sorting technique of a kind of internet site and device
CN102629272A (en) Clustering based optimization method for examination system database
CN106570076A (en) Computer text classification system
CN101976270A (en) Uncertain reasoning-based text hierarchy classification method and device
CN104809229A (en) Method and system for extracting text characteristic words
CN109376235A (en) The feature selection approach to be reordered based on document level word frequency
CN105224689A (en) A kind of Dongba document sorting technique
CN111708865A (en) Technology forecasting and patent early warning analysis method based on improved XGboost algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20131211

Termination date: 20170131

CF01 Termination of patent right due to non-payment of annual fee