CN102622373A - Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm - Google Patents
Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm Download PDFInfo
- Publication number
- CN102622373A CN102622373A CN2011100338086A CN201110033808A CN102622373A CN 102622373 A CN102622373 A CN 102622373A CN 2011100338086 A CN2011100338086 A CN 2011100338086A CN 201110033808 A CN201110033808 A CN 201110033808A CN 102622373 A CN102622373 A CN 102622373A
- Authority
- CN
- China
- Prior art keywords
- idf
- classification
- characteristic
- module
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of statistics file classification methods based on TF*IDF algorithm, this method propose a kind of new feature vector weight methods (TF*IDF*CIV), conceptual information amount (CIV) this variable is introduced in TF*IDF method, the formula of the algorithm in the calculating process of feature vector weight is considered using the conceptual information amount of feature vector as a variable are as follows:
Shared concept number sim (ci, C) therein is characterized concept set ci corresponding to a ti and matches equal concept number in class concepts set C; The insufficient TF*IDF method at this stage for compensating for TF*IDF method widely is used to calculate the weight of feature vector. But this method can not represent the relevance between characteristic item, have ignored influence of the relevance between characteristic item semanteme to weight. As a result, experiments have shown that the use of new method can effectively improve the accuracy rate of entire Text Classification System.
Description
Technical field
The present invention relates to the Computer Science and Technology field, particularly a kind of computing method and device of new proper vector weight towards text classification.
Background technology
Along with Internet technology and computer technology rapid development and universal, a large amount of Word messages begins to exist with computer-readable form, is arisen at the historic moment by computing machine autotext sorting technique.Current, the text classification technology extensively is utilized in each research fields such as document index foundation, flame detection, theme identification, automatic abstract, intelligent information retrieval.
Automated Classification starts from the end of the fifties, and H.P.Luhn has carried out initiative research in this field.1961, Maron delivered relevant first piece of paper of classification automatically, and many subsequently famous information scholars such as Sparck, Salton etc. have carried out fruitful research in this field.In the eighties in 20th century; The text classification system is main with the method for knowledge engineering; According to the classification experience of domain expert to given text collection, manual work extracts one group of logic rules, as the computer version The classification basis; Analyze the technical characterstic and the performance of these systems then, promptly utilize Expert Rules to classify; To after the nineties, the method for statistical method and machine learning is introduced in the text automatic classification, has obtained great successes and has replaced the knowledge engineering method gradually, and become main flow trend rapidly; The semantic information of the less consideration text of machine learning method combines methods such as semantic analysis and conceptual network and to have obtained better classifying quality with machine learning method, accuracy rate and stable aspect have remarkable advantages.This text classification process mainly is described below: system uses training sample to carry out feature selecting and classifier parameters training; The input sample of treating classification according to the characteristic of selecting carries out formalization; Be input to sorter then and carry out kind judging, finally obtain importing the classification of sample.
Current; File classification method based on statistics has: simple Bayes method (
Bayesianclassifier), based on support vector machine method (support vector machines; SVM), k-nearest neighbor method (k-nearest neighbor; KNN), the neural network method (neural network, NNet), decision tree (decisiontree) classification, fuzzy classifier method (fuzzy classifier), Rocchio sorting technique and Boosting algorithm etc.According to the Yiming Yang of CMU results reported, best based on the support vector machine method effect of vector space model, the most proper vectors that also require to set up earlier text in the basis of other several methods.The most frequently used method of setting up of proper vector is exactly TF*IDF (TF:Term Frequency, an IDF:Inverse Document Frequency) method, the various improvement computing method of carrying out on its basis in addition.
The document vector space model adopts the feature of semanteme of contextual information quantitative description word; Weigh the semantic similarity between the word through the distance between the compute vector; Effectively avoided the inevitable sparse problem of data in the traditional statistical method; But vector space model is regarded as the independent feature item to each word component in the vector, has ignored the relevance between the characteristic item, and this makes can't be satisfactory with the accuracy rate of the sorter of TF*IDF method.
Summary of the invention
The objective of the invention is to; The low problem of accuracy rate based on the text classifier of TF*IDF algorithm of not considering that thereby semantic similarity between the word causes when overcoming present TF*IDF algorithm computation characteristic item weight provides a kind of statistics text classification system and method based on the TF*IDF algorithm.
Be the realization above-mentioned purpose, a kind of statistics file classification method provided by the invention based on the TF*IDF algorithm, described method comprises following steps:
1) collects language material, the language material of collecting is divided into corpus and testing material;
2) corpus is classified and pre-service;
3) from corpus, extract the vocabulary in each field, extract total vocabulary simultaneously;
4) the classification institute categorical conception of corpus is concluded, utilize concept dictionary to extract affiliated notion set of all categories, form class concepts set storehouse, be used to calculate conceptual information amount CIV;
5) testing material is carried out Feature Selection, obtain the proper vector table of different numbers;
6) weight of the proper vector speech that comprises of use characteristic vector Weight algorithm (TF*IDF*CIV) calculated characteristics vector table, concrete computing formula is following:
Wherein, sim (c
i, C)+1 be conceptual information parameter CIV, and the shared ideas in this formula is counted sim (c
i, C) be any proper vector t of the described proper vector table of step 5)
iPairing notion set c
iThe notion number that coupling equates in the described class concepts set of said step 4) C;
7) structure corresponding text sorter utilizes sorter that testing material is calculated, and obtains classification results;
8) adopt evaluation function to calculate the performance evaluation parameter of various sorters, confirm optimum proper vector table according to sorter evaluating result.
In the technique scheme, said step 2) pre-service is: remove information such as unwanted ultra chain, advertisement in the web page text, and text is carried out word segmentation processing.
In the technique scheme, the Feature Selection of described step 5) adopts the information gain method, and this information gain method also comprises following substep:
5-1) extract vocabulary, after pre-service, calculate the information gain value of each participle as characteristic, the entropy of information gain value document when not considering any characteristic and the difference of considering the entropy of document after this characteristic, computing formula is following:
Wherein, P (C
i) expression C
jThe probability that the class document occurs in language material, P (t
i) represent to comprise characteristic item t in the language material
iThe probability of document, P (C
j| t
i) represent that document comprises characteristic item t
iThe time belong to C
jThe conditional probability of class,
Do not comprise characteristic item t in the expression language material
iThe probability of document,
The expression document does not comprise characteristic item t
iThe time belong to C
jThe conditional probability of class, M representes the classification number;
5-2) from the information gain value, choose the word of different numbers, as proper vector.
The present invention is based on above-mentioned method a kind of statistics text classification system based on the TF*IDF algorithm also is provided, said system comprises:
Language material is collected and pre-processing module, is used for from interconnection online collection corpus and testing material, and language material is surpassed information processing and participle pre-service such as chain, advertisement;
Feature selection module is used for extracting the vocabulary of language material, therefrom selects the characteristic speech of different numbers, composition characteristic vocabulary according to feature selecting algorithm;
The feature weight computing module is used for the calculated characteristics weight;
Sort module is used for the language material text is classified; With
Excellent module is selected in classification, is used to contrast different classification results, the characteristic speech number when finding the optimal classification effect, it is characterized in that,
Described system also comprises: concept dictionary module and class concepts library module;
The concept dictionary module is used for the affiliated classification information of storage concept;
The class concepts library module is used to store total notion aggregate information under the different affiliated classifications;
Described feature weight computing module utilizes the notion aggregate information C that said class concepts library module obtains and adopts the TF*IDF*CIV algorithm to carry out the weight calculation of the characteristic speech of different numbers;
Wherein, the formula of said TF*IDF*CIV algorithm is:
Shared ideas in this formula is counted sim (c
i, C) be characteristic item t
iPairing notion set c
iThe notion number that coupling equates among the class concepts set C in said conceptual base module.
The invention has the advantages that; The present invention has carried out the variable adjustment to the TF*IDF method; Having introduced this variable of conceptual information amount CIV improves it; The pairing conceptual information amount of calculated characteristics item effectively is used for adjusting the weight of different characteristic vector, utilizes relevance between the notion to remedy the deficiency of vector space model.The experiment proof is improved the back classification accuracy can improve 6.5 percentage points, and this has also fully proved this validity of improving one's methods.
Description of drawings
Fig. 1 is the process flow diagram that the present invention is based on the statistics file classification method of TF*IDF algorithm;
Fig. 2 is the module structure drafting that the present invention is based on the statistics text classification system of TF*IDF algorithm.
Embodiment
Below in conjunction with accompanying drawing and specific embodiment the present invention is done further description.
The present invention is for effectively utilizing knowledge engineering knowledge that the new method of a compute vector weight is provided on the statistics file classification method.This method is in the TF*IDF method, to have introduced the CIV variable, thus the method after being improved, i.e. TF*IDF*CIV (Term Frequency, Inverse Document Frequency, Concept InformationValue) method.Evaluation indexes such as accuracy, recall rate and the F1 that this method of experiment proof can effectively improve text classification estimates.
As shown in Figure 1, this figure is that concrete steps are following based on the flow chart of steps of the statistics file classification method of TF*IDF algorithm:
Step 1 is from interconnection online collection language material; Wherein part is as corpus; Another part is as testing material; 16000 pieces of texts that portal website collects from the Internet have been downloaded; Wherein 6000 pieces as corpus; Belong to three classifications, each field textual data is respectively: municipal management of markets in urban districts class (1019 pieces), disaster event class (2215 pieces) and other types (2766 pieces), and remaining 10000 pieces as testing material;
Step 2 pair corpus carries out category division, and information such as some the unwanted ultra chains in the removal corpus, advertisement are carried out participle to body text, obtain the word string sequence of document;
Step 3 is taken out each document from corpus, take out word wherein, forms total vocabulary; Simultaneously classification institute categorical conception is concluded; Utilize concept dictionary to extract affiliated notion set of all categories, form other notion set storehouse of three major types, be used to calculate conceptual information amount CIV;
Step 4 is according to what of the quantity of information between word and the classifying text, the computing information gain, and selected different threshold values obtain the proper vector table of different numbers (1000,2000,3000,4000,5000,6000).
Step 5 use characteristic vector weight TF*IDF*CIV calculated characteristics is to the measure word weight, and conceptual information amount wherein is that shared ideas is counted sim (c
i, C)+1, and shared ideas is counted sim (c
i, C) be proper vector c
iThe notion number that coupling equates in class concepts set C.
Step 6 structure corresponding text sorter.
Step 7 pair test text is classified, and obtains the classification results under the different number proper vectors
Step 8 is calculated the performance evaluation parameter of sorter.
Step 9 is judged the proper vector number of this system optimal according to the value of evaluation function.
As shown in Figure 2, this figure is the statistics text classification system based on the TF*IDF algorithm, a kind of statistics text classification system based on the TF*IDF algorithm, and said system comprises:
Language material is collected and pre-processing module, is used for from interconnection online collection corpus and testing material, and language material is surpassed information processing and participle pre-service such as chain, advertisement;
Feature selection module is used for extracting the vocabulary of language material, therefrom selects the characteristic speech of different numbers, composition characteristic vocabulary according to feature selecting algorithm;
The feature weight computing module is used for the calculated characteristics weight;
Sort module is used for the language material text is classified; With
Excellent module is selected in classification, is used to contrast different classification results, the characteristic speech number when finding the optimal classification effect, it is characterized in that,
Described system also comprises: concept dictionary module and class concepts library module;
The concept dictionary module is used for the affiliated classification information of storage concept;
The class concepts library module is used to store total notion aggregate information under the different affiliated classifications;
Described feature weight computing module utilizes the notion aggregate information C that said class concepts library module obtains and adopts the TF*IDF*CIV algorithm to carry out the weight calculation of the characteristic speech of different numbers;
Wherein, the formula of said TF*IDF*CIV algorithm is:
Shared ideas in this formula is counted sim (c
i, C) be characteristic item t
iPairing notion set c
iThe notion number that coupling equates among the class concepts set C in said conceptual base module.
Specify each involved detailed problem in the technical scheme of the present invention below:
1, language material is selected:
Download enough language materials from each portal website, mark off the part corpus, category is classified.The division of language material classification is reasonable as far as possible, and language material data of all categories are balanced as far as possible.
2, Feature Selection:
In vector space model, the characteristic item of expression text can be selected word, speech, phrase, even multiple elements such as " notions ".Here we adopt the most frequently used characteristic item of the most effective speech of proof as text classification of also being tested.Present existing Feature Selection method is many, and method commonly used has: based on document frequency (documentfrequency, feature extraction method DF), information gain (information gain, IG) method, χ
2Statistic (CHI) method and mutual information (mutual information, MI) method etc.The Feature Selection main task is to confirm two problems: the first, choose which type of speech as characteristic item; The second, the speech of choosing how many numbers is as characteristic item.We adopt the information gain method to carry out Feature Selection, and concrete steps are following:
1) extracts vocabulary.After early stage, preprocessing process such as participle finished, calculate the information gain value of each speech as characteristic.The information gain method is according to certain characteristic item t
iFor how many quantity of information that whole classification can provide weighs the significance level of this characteristic item, thereby determine choice to this characteristic item.Certain characteristic item t
iInformation gain be meant this characteristic or not during this characteristic, the difference of the quantity of information that can provide for whole classification, wherein, how much the weighing of quantity of information by entropy.The entropy of document and the difference of considering the entropy of document after this characteristic when information gain is not promptly considered any characteristic:
In the formula, P (C
i) expression C
jThe probability that the class document occurs in language material, P (t
i) represent to comprise characteristic item t in the language material
iThe probability of document, P (C
j| t
i) represent that document comprises characteristic item t
iThe time belong to C
jThe conditional probability of class,
Do not comprise characteristic item t in the expression language material
iThe probability of document,
The expression document does not comprise characteristic item t
iThe time belong to C
jThe conditional probability of class, M representes the classification number.
2) from the information gain value, choose the word of different numbers (generally calculate, select 100 integral multiple, as 1000,2000,3000,4000,5000,6000 as the characteristic quantity sum), as proper vector for the ease of the later stage.
3, feature weight computing method
Weight method based on TF*IDF is that Salton proposed in 1973, and its definition is: characteristic item t
iAt text D
jIn weights W
Ij:
Wherein, tf
IjRepresentation feature item t
iAt training text D
jThe middle frequency that occurs; n
iBe to occur characteristic item t in the training set
iNumber of files, N is a number of files total in the training set.Be characteristic item t
iAt text D
jIn weights W
IjEqual it in document D
jIn sum frequency multiply by the logarithm of its inverted entry frequency in the entire document collection.
We propose improving one's methods of TF*IDF method is mainly reflected in introduced this variable of CIV and calculated the pairing conceptual information amount of characteristic item effectively, is used for adjusting the weight of proper vector.This be because:
Explaining with vector space model in the method for document, verified have two factors very crucial to obtaining effective characteristic item weight: the one, and the frequency that characteristic item occurs in single document, the 2nd, characteristic item is in the distribution of entire document collection the inside.In the TF*IDF method, adopt absolute word frequency TF to represent first factor; Though but some characteristic item frequency is very high; Single classification capacity is weak (such as a lot of everyday words) very, though and some characteristic item frequency is lower, classification capacity is very strong; Need carry out the adjustment on the TF basis thus, introduce the IDF variable.Second factor represented with reverse file frequency IDF; The weighted value of IDF is inverse change along with the number of documents that comprises certain characteristic changes; Under extreme case; Have only the characteristic that occurs in one piece of document to contain the highest IDF value, the document that this characteristic item promptly occurs is few more, and the weight of this characteristic item is big more.Second factor in fact all is to consider the distribution problem of characteristic item in whole type, the characteristic item that those discriminations are good, and its IDF value is bigger naturally.
But the defective of its text representation of vector space model existence that in text representation, is extensively used itself.This model adopts the feature of semanteme of contextual information quantitative description vocabulary; Weigh the semantic similarity between the vocabulary through the distance between the compute vector; Avoided the sparse problem of data; But it is regarded as the independent feature item to each vocabulary component in the vector, has isolated the relevance between the vocabulary.This defective must be brought in the method for characteristic item weight calculation of TF*IDF.We have introduced the 3rd variable conceptual information amount CIV and have remedied this defective thus.Viewpoint according to set theory; We think that the quantity of information of two notions set can calculate through the equal notion number in two set, identical many more of two collective concepts, and the Sharing Information amount is many more; Its semantic information amount is big more, and both Semantic Similarity are also big more.We utilize the corresponding notion set c of calculated characteristics item vocabulary thus
iAnd what of both conceptual information amounts common notion number calculate between the pairing notion set of the classification C.Owing to there is the situation of no shared ideas number between the two, still the conceptual information amount be defined as both shared ideas count sim (c
i, C)+1.So we can obtain following characteristic speech weight calculation formula:
The explanation of each variable is following in the formula: W
IjRepresentation feature item t
iAt text D
jIn weight, tf
IjRepresentation feature item t
iAt training text D
jThe middle frequency that occurs; n
iBe to occur characteristic item t in the training set
iNumber of files, N is a number of files total in the training set, sim (c
i, C) representation feature item t
iPairing notion set c
iAnd the shared ideas number between the pairing notion set of the classification C.Method thus, piece document of each in the corpus all can obtain the weighted value of its pairing characteristic item vocabulary.For example: the notion of disaster class set C can reduce through the expert 13228 2jw0~3e213228 the 13*a 321ae21 14eb63jw1jw2jw3 of 52331 5089e77jw0 π y3xc21jw0 π y3x jw539 a03bi 12*a, amount to 33 notions.If contain characteristic item fire one speech in the proper vector, the concept symbols string that its word is corresponding be 3228~3d01}, two notions are all gathered among the C, so characteristic item c in the notion of disaster class
iSim (the c of (fire)
i, C) value is 2, the CIV value is 3.And if include a characteristic item speech of one mind in the proper vector, its corresponding concept symbols string be { 43e01 (cooperation) j60c43 (fully) }, these two notions are not included in notion to be gathered among the C, so characteristic item c
iSim (the c of (working as one man)
i, C) value is 0, characteristic item c
iThe conceptual information amount CIV of (working as one man) is 1.
4, structural classification device
Adopt the k-nearest neighbor method, decision rule is:
In the formula,
Value is 0 or 1, and value is to represent document at 1 o'clock
Belong to classification C
j, value is to represent document at 0 o'clock
Do not belong to classification C
j The expression test document
With the training document
Between similarity; b
jIt is the threshold value of binary decision.
5, evaluation function
To different purpose, people have proposed the function of multiple text classifier performance evaluation, comprise recall rate, accuracy, F-measure value, little average and grand average, equilibrium point, 11 average accuracy etc.Popular in the world judgement text classifier Evaluation on effect function mainly contains two kinds at present: micro-F1 estimates and macro-F1 estimates [Yang, 1997].Comparatively speaking the former use is more extensive, and it defines as follows:
F-1 estimates: γ=2rp/ (r+p)
Wherein: the textual data that the Nc representative is classified
The Nr representative is rejected the classified text number
The Ncr representative is by the correct textual data of classification in the classifying text
The textual data that the Ns representative should be classified
For the characteristic term vector of different numbers, make evaluation function F1 estimate that optimum group number, promptly be the characteristic speech number of the optimum of this sorting technique under this language material.
6, experimental result
The language material that this experiment is adopted comprises 16000 pieces of texts that portal website collects from the Internet; Wherein 6000 pieces as corpus; Belong to three classifications; Each field textual data is respectively: municipal management of markets in urban districts class (1019 pieces), disaster event class (2215 pieces) and other types (2766 pieces), remaining 10000 pieces as testing material.
Following table is this result of experiment:
Can find out that from experiment the classifying quality of TF*IDF*CIV algorithm all is better than the TF*IDF algorithm under various characteristic item number, when N=5000, the F1 value of classification improves 6.5%, proves absolutely the validity of this method.Need to prove, embodiment of the present invention of more than introducing and and unrestricted.It will be understood by those of skill in the art that any modification to technical scheme of the present invention perhaps is equal to alternative spirit and the scope that does not break away from technical scheme of the present invention, it all should be encompassed in the claim scope of the present invention.
Claims (6)
1. statistics file classification method based on the TF*IDF algorithm, described method comprises following steps:
1) collects language material, the language material of collecting is divided into corpus and testing material;
2) corpus is classified and pre-service;
3) from corpus, extract the vocabulary in each field, extract total vocabulary simultaneously;
4) the classification institute categorical conception of corpus is concluded, utilize concept dictionary to extract affiliated notion set of all categories, form class concepts set storehouse C, this notion set storehouse C is used to calculate conceptual information amount CIV;
5) testing material is carried out Feature Selection, obtain the proper vector table of different numbers;
6) weight of the proper vector speech that comprises of use characteristic vector Weight algorithm (TF*IDF*CIV) calculated characteristics vector table, concrete computing formula is following:
Wherein, sim (c
i, C)+1 be conceptual information parameter CIV, and the shared ideas in this formula is counted sim (c
i, C) be any proper vector t of the described proper vector table of step 5)
iPairing notion set c
iThe notion number that coupling equates in the described class concepts set of said step 4) C;
7) structure corresponding text sorter utilizes sorter that testing material is calculated, and obtains classification results;
8) adopt evaluation function to calculate the performance evaluation parameter of various sorters, confirm optimum proper vector table according to sorter evaluating result.
2. according to the said statistics file classification method method of claim 1, it is characterized in that said step 2 based on the TF*IDF algorithm) pre-service be: remove unwanted ultra chain, advertising message in the web page text, and text carried out word segmentation processing.
3. according to claim 1 or 2 said statistics file classification methods based on the TF*IDF algorithm, it is characterized in that the Feature Selection of described step 5) adopts the information gain method, this information gain method also comprises following substep:
5-1) extract vocabulary, after pre-service, calculate the information gain value of each participle as characteristic, the entropy of information gain value document when not considering any characteristic and the difference of considering the entropy of document after this characteristic, computing formula is following:
Wherein, P (C
i) expression C
jThe probability that the class document occurs in language material, P (t
i) represent to comprise characteristic item t in the language material
iThe probability of document, P (C
j| t
i) represent that document comprises characteristic item t
iThe time belong to C
jThe conditional probability of class,
Do not comprise characteristic item t in the expression language material
iThe probability of document,
The expression document does not comprise characteristic item t
iThe time belong to C
jThe conditional probability of class, M representes the classification number;
5-2) from the information gain value, choose the word of different numbers, as proper vector.
4. according to the said statistics file classification method method of claim 1, it is characterized in that described sorter adopts the k-nearest neighbor method based on the TF*IDF algorithm.
5. according to the said statistics file classification method of claim 1, it is characterized in that described evaluation function adopts the micro-F1 measure function based on the TF*IDF algorithm.
6. statistics text classification system based on the TF*IDF algorithm, this system comprises: excellent module is selected in language material collection and pre-processing module, feature selection module, feature weight computing module, sort module and classification;
Described language material is collected and pre-processing module, is used for from interconnection online collection corpus and testing material, and language material is surpassed chain, advertising message processing and participle pre-service;
Described feature selection module is used for extracting the vocabulary of language material, therefrom selects the characteristic speech of different numbers, composition characteristic vocabulary according to feature selecting algorithm;
Described feature weight computing module is used for the calculated characteristics weight;
Described sort module is used for the language material text is classified; With
Excellent module is selected in described classification, is used to contrast different classification results, and the characteristic speech number when finding the optimal classification effect is characterized in that,
Described system also comprises: concept dictionary module and class concepts library module;
Described concept dictionary is used for the affiliated classification information of storage concept;
Described class concepts library module is used to store total notion aggregate information C under the different affiliated classifications;
Described feature weight computing module utilizes the notion aggregate information C that said class concepts library module obtains and adopts the TF*IDF*CIV algorithm to carry out the weight calculation of the characteristic speech of different numbers;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100338086A CN102622373B (en) | 2011-01-31 | 2011-01-31 | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100338086A CN102622373B (en) | 2011-01-31 | 2011-01-31 | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102622373A true CN102622373A (en) | 2012-08-01 |
CN102622373B CN102622373B (en) | 2013-12-11 |
Family
ID=46562296
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011100338086A Expired - Fee Related CN102622373B (en) | 2011-01-31 | 2011-01-31 | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102622373B (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103106275A (en) * | 2013-02-08 | 2013-05-15 | 西北工业大学 | Text classification character screening method based on character distribution information |
CN103425735A (en) * | 2013-06-06 | 2013-12-04 | 深圳市宜搜科技发展有限公司 | Establishing method and system based on website subject term inquiry |
CN103593339A (en) * | 2013-11-29 | 2014-02-19 | 哈尔滨工业大学深圳研究生院 | Electronic-book-oriented semantic space representing method and system |
CN103927302A (en) * | 2013-01-10 | 2014-07-16 | 阿里巴巴集团控股有限公司 | Text classification method and system |
CN104424308A (en) * | 2013-09-04 | 2015-03-18 | 中兴通讯股份有限公司 | Web page classification standard acquisition method and device and web page classification method and device |
WO2015135452A1 (en) * | 2014-03-14 | 2015-09-17 | Tencent Technology (Shenzhen) Company Limited | Text information processing method and apparatus |
CN105045812A (en) * | 2015-06-18 | 2015-11-11 | 上海高欣计算机系统有限公司 | Text topic classification method and system |
CN106844424A (en) * | 2016-12-09 | 2017-06-13 | 宁波大学 | A kind of file classification method based on LDA |
CN107085608A (en) * | 2017-04-21 | 2017-08-22 | 上海喆之信息科技有限公司 | A kind of effective network hotspot monitoring system |
CN107194617A (en) * | 2017-07-06 | 2017-09-22 | 北京航空航天大学 | A kind of app software engineers soft skill categorizing system and method |
CN104391835B (en) * | 2014-09-30 | 2017-09-29 | 中南大学 | Feature Words system of selection and device in text |
CN107301248A (en) * | 2017-07-19 | 2017-10-27 | 百度在线网络技术(北京)有限公司 | Term vector construction method and device, computer equipment, the storage medium of text |
CN107562814A (en) * | 2017-08-14 | 2018-01-09 | 中国农业大学 | A kind of earthquake emergency and the condition of a disaster acquisition of information sorting technique and system |
CN108228546A (en) * | 2018-01-19 | 2018-06-29 | 北京中关村科金技术有限公司 | A kind of text feature, device, equipment and readable storage medium storing program for executing |
CN108376130A (en) * | 2018-03-09 | 2018-08-07 | 长安大学 | A kind of objectionable text information filtering feature selection approach |
CN108461111A (en) * | 2018-03-16 | 2018-08-28 | 重庆医科大学 | Chinese medical treatment text duplicate checking method and device, electronic equipment, computer read/write memory medium |
CN108509410A (en) * | 2017-02-27 | 2018-09-07 | 广东神马搜索科技有限公司 | Text semantic similarity calculating method, device and user terminal |
CN108509407A (en) * | 2017-02-27 | 2018-09-07 | 广东神马搜索科技有限公司 | Text semantic similarity calculating method, device and user terminal |
CN108509471A (en) * | 2017-05-19 | 2018-09-07 | 苏州纯青智能科技有限公司 | A kind of Chinese Text Categorization |
CN108898274A (en) * | 2018-05-30 | 2018-11-27 | 国网浙江省电力有限公司宁波供电公司 | A kind of power scheduling log defect classification method |
CN110019817A (en) * | 2018-12-04 | 2019-07-16 | 阿里巴巴集团控股有限公司 | A kind of detection method, device and the electronic equipment of text in video information |
CN110287328A (en) * | 2019-07-03 | 2019-09-27 | 广东工业大学 | A kind of file classification method, device, equipment and computer readable storage medium |
CN111259649A (en) * | 2020-01-19 | 2020-06-09 | 深圳壹账通智能科技有限公司 | Interactive data classification method and device of information interaction platform and storage medium |
CN112269880A (en) * | 2020-11-04 | 2021-01-26 | 吾征智能技术(北京)有限公司 | Sweet text classification matching system based on linear function |
CN113032573A (en) * | 2021-04-30 | 2021-06-25 | 《中国学术期刊(光盘版)》电子杂志社有限公司 | Large-scale text classification method and system combining theme semantics and TF-IDF algorithm |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6704905B2 (en) * | 2000-12-28 | 2004-03-09 | Matsushita Electric Industrial Co., Ltd. | Text classifying parameter generator and a text classifier using the generated parameter |
-
2011
- 2011-01-31 CN CN2011100338086A patent/CN102622373B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6704905B2 (en) * | 2000-12-28 | 2004-03-09 | Matsushita Electric Industrial Co., Ltd. | Text classifying parameter generator and a text classifier using the generated parameter |
Non-Patent Citations (4)
Title |
---|
张运良,张全: "《基于句类向量空间模型的自动文本分类研究》", 《计算机工程》 * |
缪建明,张全,赵金仿: "《基于文章标题信息的汉语自动文本分类》", 《计算机工程》 * |
蔡银珊,黄英铭: "《基于改进的TF-IDF特征权重算法的网页自动分类》", 《绵阳师范学院学报》 * |
赵金仿,赵艳,缪建明: "《网页信息抽取及其自动文本分类的实现》", 《计算机技术与发展》 * |
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103927302A (en) * | 2013-01-10 | 2014-07-16 | 阿里巴巴集团控股有限公司 | Text classification method and system |
CN103927302B (en) * | 2013-01-10 | 2017-05-31 | 阿里巴巴集团控股有限公司 | A kind of file classification method and system |
CN103106275B (en) * | 2013-02-08 | 2016-02-10 | 西北工业大学 | The text classification Feature Selection method of feature based distributed intelligence |
CN103106275A (en) * | 2013-02-08 | 2013-05-15 | 西北工业大学 | Text classification character screening method based on character distribution information |
CN103425735A (en) * | 2013-06-06 | 2013-12-04 | 深圳市宜搜科技发展有限公司 | Establishing method and system based on website subject term inquiry |
CN103425735B (en) * | 2013-06-06 | 2017-08-11 | 深圳市宜搜科技发展有限公司 | A kind of method for building up and system based on website subject term inquiry |
CN104424308A (en) * | 2013-09-04 | 2015-03-18 | 中兴通讯股份有限公司 | Web page classification standard acquisition method and device and web page classification method and device |
CN103593339A (en) * | 2013-11-29 | 2014-02-19 | 哈尔滨工业大学深圳研究生院 | Electronic-book-oriented semantic space representing method and system |
WO2015135452A1 (en) * | 2014-03-14 | 2015-09-17 | Tencent Technology (Shenzhen) Company Limited | Text information processing method and apparatus |
US10262059B2 (en) | 2014-03-14 | 2019-04-16 | Tencent Technology (Shenzhen) Company Limited | Method, apparatus, and storage medium for text information processing |
CN104391835B (en) * | 2014-09-30 | 2017-09-29 | 中南大学 | Feature Words system of selection and device in text |
CN105045812A (en) * | 2015-06-18 | 2015-11-11 | 上海高欣计算机系统有限公司 | Text topic classification method and system |
CN105045812B (en) * | 2015-06-18 | 2019-01-29 | 上海高欣计算机系统有限公司 | The classification method and system of text subject |
CN106844424B (en) * | 2016-12-09 | 2020-11-03 | 宁波大学 | LDA-based text classification method |
CN106844424A (en) * | 2016-12-09 | 2017-06-13 | 宁波大学 | A kind of file classification method based on LDA |
CN108509407A (en) * | 2017-02-27 | 2018-09-07 | 广东神马搜索科技有限公司 | Text semantic similarity calculating method, device and user terminal |
CN108509407B (en) * | 2017-02-27 | 2022-03-18 | 阿里巴巴(中国)有限公司 | Text semantic similarity calculation method and device and user terminal |
CN108509410B (en) * | 2017-02-27 | 2022-08-05 | 阿里巴巴(中国)有限公司 | Text semantic similarity calculation method and device and user terminal |
CN108509410A (en) * | 2017-02-27 | 2018-09-07 | 广东神马搜索科技有限公司 | Text semantic similarity calculating method, device and user terminal |
CN107085608A (en) * | 2017-04-21 | 2017-08-22 | 上海喆之信息科技有限公司 | A kind of effective network hotspot monitoring system |
CN108509471A (en) * | 2017-05-19 | 2018-09-07 | 苏州纯青智能科技有限公司 | A kind of Chinese Text Categorization |
CN107194617A (en) * | 2017-07-06 | 2017-09-22 | 北京航空航天大学 | A kind of app software engineers soft skill categorizing system and method |
CN107301248B (en) * | 2017-07-19 | 2020-07-21 | 百度在线网络技术(北京)有限公司 | Word vector construction method and device of text, computer equipment and storage medium |
CN107301248A (en) * | 2017-07-19 | 2017-10-27 | 百度在线网络技术(北京)有限公司 | Term vector construction method and device, computer equipment, the storage medium of text |
CN107562814A (en) * | 2017-08-14 | 2018-01-09 | 中国农业大学 | A kind of earthquake emergency and the condition of a disaster acquisition of information sorting technique and system |
CN108228546A (en) * | 2018-01-19 | 2018-06-29 | 北京中关村科金技术有限公司 | A kind of text feature, device, equipment and readable storage medium storing program for executing |
CN108376130A (en) * | 2018-03-09 | 2018-08-07 | 长安大学 | A kind of objectionable text information filtering feature selection approach |
CN108461111A (en) * | 2018-03-16 | 2018-08-28 | 重庆医科大学 | Chinese medical treatment text duplicate checking method and device, electronic equipment, computer read/write memory medium |
CN108898274A (en) * | 2018-05-30 | 2018-11-27 | 国网浙江省电力有限公司宁波供电公司 | A kind of power scheduling log defect classification method |
CN110019817A (en) * | 2018-12-04 | 2019-07-16 | 阿里巴巴集团控股有限公司 | A kind of detection method, device and the electronic equipment of text in video information |
CN110287328A (en) * | 2019-07-03 | 2019-09-27 | 广东工业大学 | A kind of file classification method, device, equipment and computer readable storage medium |
CN111259649A (en) * | 2020-01-19 | 2020-06-09 | 深圳壹账通智能科技有限公司 | Interactive data classification method and device of information interaction platform and storage medium |
CN112269880A (en) * | 2020-11-04 | 2021-01-26 | 吾征智能技术(北京)有限公司 | Sweet text classification matching system based on linear function |
CN112269880B (en) * | 2020-11-04 | 2024-02-09 | 吾征智能技术(北京)有限公司 | Sweet text classification matching system based on linear function |
CN113032573A (en) * | 2021-04-30 | 2021-06-25 | 《中国学术期刊(光盘版)》电子杂志社有限公司 | Large-scale text classification method and system combining theme semantics and TF-IDF algorithm |
CN113032573B (en) * | 2021-04-30 | 2024-01-23 | 同方知网数字出版技术股份有限公司 | Large-scale text classification method and system combining topic semantics and TF-IDF algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN102622373B (en) | 2013-12-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102622373B (en) | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm | |
CN101819601B (en) | Method for automatically classifying academic documents | |
CN104750844B (en) | Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device | |
CN100533441C (en) | Two-stage combined file classification method based on probability subject | |
CN104951548B (en) | A kind of computational methods and system of negative public sentiment index | |
Al Qadi et al. | Arabic text classification of news articles using classical supervised classifiers | |
CN102194013A (en) | Domain-knowledge-based short text classification method and text classification system | |
CN107577785A (en) | A kind of level multi-tag sorting technique suitable for law identification | |
CN101763431A (en) | PL clustering method based on massive network public sentiment information | |
CN105760493A (en) | Automatic work order classification method for electricity marketing service hot spot 95598 | |
CN103886108B (en) | The feature selecting and weighing computation method of a kind of unbalanced text set | |
CN104391835A (en) | Method and device for selecting feature words in texts | |
CN101587493A (en) | Text classification method | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN109522544A (en) | Sentence vector calculation, file classification method and system based on Chi-square Test | |
CN103473262A (en) | Automatic classification system and automatic classification method for Web comment viewpoint on the basis of association rule | |
CN105975518A (en) | Information entropy-based expected cross entropy feature selection text classification system and method | |
CN106156372A (en) | The sorting technique of a kind of internet site and device | |
CN102629272A (en) | Clustering based optimization method for examination system database | |
CN106570076A (en) | Computer text classification system | |
CN101976270A (en) | Uncertain reasoning-based text hierarchy classification method and device | |
CN104809229A (en) | Method and system for extracting text characteristic words | |
CN109376235A (en) | The feature selection approach to be reordered based on document level word frequency | |
CN105224689A (en) | A kind of Dongba document sorting technique | |
CN111708865A (en) | Technology forecasting and patent early warning analysis method based on improved XGboost algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20131211 Termination date: 20170131 |
|
CF01 | Termination of patent right due to non-payment of annual fee |