CN103106275A - Text classification character screening method based on character distribution information - Google Patents

Text classification character screening method based on character distribution information Download PDF

Info

Publication number
CN103106275A
CN103106275A CN2013100505834A CN201310050583A CN103106275A CN 103106275 A CN103106275 A CN 103106275A CN 2013100505834 A CN2013100505834 A CN 2013100505834A CN 201310050583 A CN201310050583 A CN 201310050583A CN 103106275 A CN103106275 A CN 103106275A
Authority
CN
China
Prior art keywords
feature
character
classification
document
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100505834A
Other languages
Chinese (zh)
Other versions
CN103106275B (en
Inventor
李思男
李战怀
李宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN201310050583.4A priority Critical patent/CN103106275B/en
Publication of CN103106275A publication Critical patent/CN103106275A/en
Application granted granted Critical
Publication of CN103106275B publication Critical patent/CN103106275B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a text classification character screening method based on character distribution information. The method is used for resolving the technical problems that an existing text classification character screening method is poor in accuracy. The technical scheme includes conducting preprocessing for each document of a document set firstly; enabling the whole document collection to be presented as a vector space modal (VSM); constructing a character dictionary; counting document frequency DF (t, Cj), comprising the character t, of each classification Ci; calculating a normalized tf*idf value of each classification Ci, and then calculating the dispersion D Intra and average inter-classification dispersion D Inter Avg of the character in each classification Ci; calculating the weight wi (t) of each character tk in each classification Ci of a text character space; and enabling all the characters to be arranged in a descending order mode according to the weight of all the characters in the whole document set, and preferentially keeping the characters having front orders during character screening. On the basis of a character distribution system, the method enables the character distribution system to be applied to the character screen process, and improves text classification efficiency and accuracy.

Description

Text classification Feature Selection method based on the feature distributed intelligence
Technical field
The present invention relates to a kind of text classification Feature Selection method, particularly relate to a kind of text classification Feature Selection method based on the feature distributed intelligence.
Background technology
Along with the development of communication technology and network, on the internet, the generations such as a large amount of electronic documents such as news, mail, microblogging there is every day.Text automatic classification is used in a lot of fields widely as a kind of method of efficiently large volume document being carried out Classification Management.
Along with the explosive increase of quantity of information, the subject matter that automatic Text Categorization faces is how to process the higher-dimension text vector feature space that a large amount of text datas produce.Too high text vector feature space will produce two adverse effects to file classification method: the method for (1) a lot of comparative maturities can't be optimized in higher dimensional space, and then can't be applied in text classification.(2) because sorter is to train by training set to get, dimension too high text vector space will inevitably cause the over-fitting phenomenon to occur [1]In the text vector space, most of dimension and text classification are also uncorrelated, and the more noise data that affects the text classification precision even adulterates [2]The text feature screening according to certain Feature Selection algorithm, is selected the more representative text feature of a part and is consisted of the feature space that new dimension is lower from original feature space, reach the purpose of dimensionality reduction.The method is the effective method that solves the too high problem of text classification Chinese version vector feature space dimension.The purpose of text feature screening is to improve the execution efficient of text classification work efficiency and algorithm.Much experiment showed, in most of the cases, initiatively feature space is approximately subtracted and can obtain very large performance boost under less nicety of grading loss [3]
Existing text classification Feature Selection algorithm mainly contains document frequency (DF), information gain (IG), information gain rate (GR), Chi-square Test (CHI), mutual information (MI) and Gini index etc. [3,4]The below to wherein several in text classification effect preferably technology briefly introduce:
Document frequency (DF): document frequency refers to comprise the number of documents of t for given feature t in collection of document.The prediction that is rare feature for classification of its basic assumption is not have helpfully, perhaps can not affect overall performance.The advantage of document frequency: because it is realized simply, calculated amount is little, so feature selecting speed is very fast, and actual effect is also good; Therefore shortcoming: rare feature may not be rare in a certain class text, may comprise important classification information yet, simply weeds out, and may affect the effect of classification, should not be with a large amount of rejecting feature of DF.
Information gain (IG): information gain is a kind of appraisal procedure based on entropy, a given feature t, consider and when not considering it quantity of information respectively be what, both differences are exactly the quantity of information that this feature is brought to system, namely gain [5]The appearance that information gain has been considered a feature whether, in unbalanced data centralization, for rare classification, experiment shows, consider the absent variable situation of feature to the contribution of judgement text categories often much smaller than considering the now interference that brings of situation of feature.
Information gain rate (GR): information gain is proved to be devious in a lot of results.Too abundant due to the more and different attribute of value for training set study causes Information Gain Method to be more prone to select this attribute, and the information gain rate has solved this shortcoming of information gain [6]
Chi-square Test (CHI): Chi-square Test is the method for a kind of two variable independence of check commonly used in mathematical statistics, and its most basic thought is exactly to determine the correctness of theory by observing actual value and the deviation of theoretical value [7,8]
During the experiment of text classification shows, during as feature selecting, the effect of Chi-square Test is a kind of of the best, but it has only added up whether occur feature t in text, but do not consider the number of times that feature t occurs in the text, therefore make it that low-frequency word is had and necessarily exaggerate effect, " the low-frequency word defective " that this namely Chi-square Test is famous.
The present invention is at the feature compartment system [9]The basis on, dispersion computing method between class are improved, with this system employs in the Feature Selection process.
List of references:
[1]Jieming Yang,Yuanning Liu,Xiaodong Zhu et al,A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization,Information Processing&Management,Volume48,Issue4,2012,pp.741-754
[2]Wenqian Shang,Houkuan Huang and Haibin Zhu et al,A novel feature selection algorithm for text classification,Expert Systems with Applications,Volume33,Issuel,2007,pp.1-5
[3]Monica Rogati and Yiming Yang,High-performing feature selection for text classification.In Proceedings of the eleventh international conference on Information and knowledge management(CIKM′02).ACM,New York,NY,USA,2002,pp.659-661.
[4]Yang,Y.,Pedersen,J.O.,A Comparative Study on Feature Selection in Text Classification.In Proceedings of the14th international conference on machine learning,Nashville,USA,1997,pp.4l2-420.
[5]Forman,G.,An Extensive Empirical of Feature Selection Metrics for Text Classification.Journal of Machine Learning Research,3,2003,pp.1289-1305.
[6]Tatsunori Mori,Miwa Kikuchi and Kazufumi Yoshida,,Term Weighting Method based on Information Gain Ratio for Summarizing Documents retrieved by IR systems.Journal of Natural Language Processing,9(4),2001,pp.3-32.
[7]Zheng,Z.,Srihari,R,Optimally Combining Positive and Negative Features for Text Classification.ICML2003Workshop on Learning from Imbalanced Data Sets,2003.
[8]Luigi Galavotti,Via Jacopo Nardi and Fabrizio Sebastiani et al,Feature Selection and Negative Evidence in Automated Text Classification.In Proceedings of the 4thEuropean Conference on Research and Advanced Technology for Digital Libraries(ECDL’00),2000.
V.Lertnattee,T.Theeramunkong,Improving centroid-based text classification using term-distribution-based weighting and feature selection,In Proceedings of INTECH-01,2ndInternational Conference onIntelligent Technologies,Bangkok,Thailand,2001,pp.349-355.
Summary of the invention
In order to overcome the deficiency of existing text classification Feature Selection method poor accuracy, the invention provides a kind of text classification Feature Selection method based on the feature distributed intelligence.The method is improved dispersion computing method between class on the basis of feature compartment system, and the feature compartment system is applied in the Feature Selection process.The method takes full advantage of in the tf*idf information, class of text feature and distribution between class information, reflect more objectively the significance level of characteristic item in text, thereby select the characteristic item that can represent text feature, reach the Feature Selection purpose, can improve text classification efficient and accuracy rate.This method can be issued to higher classify accuracy in the situation of selecting less characteristic item, has advantages of simultaneously fast convergence rate, makes this method also can apply to skewed data set to the improvement of distribution between class.
The technical solution adopted for the present invention to solve the technical problems is: a kind of text classification Feature Selection method based on the feature distributed intelligence is characterized in comprising the following steps:
1. each piece document in document sets is carried out participle, removes stop words and get stem and process.
2. whole collection of document is expressed as vector space model.
3. extract all Feature Words, structural attitude dictionary from collection of document.
In statistics text feature space each Feature Words t at every piece of document d jThe frequency TF of middle appearance (t, d j), and at each class C iThe frequency TF of middle appearance (t, C i), add up simultaneously each class C iThe number of files DF (t, the C that comprise Feature Words t j).
5. the information that obtains according to step 4 is for each Feature Words t k, at first calculate for each class C iNormalized tf*idf value, then calculate this Feature Words at each class C iDispersion DInterAvg between interior dispersion DIntra and average class.
6. the information that obtains according to step 4, step 5 step utilizes following formula to calculate each Feature Words t in the text feature space kAt classification C iIn weight w i(t).
w i(t)=tf*idf*DInterAvg*(1-DIntra)
With Feature Words t kWeight summation in each classification is this Feature Words in the weight of whole document sets, i.e. Feature Words t kThe TDFS value.
TDFS ( t ) = Σ i = 1 NC w i ( t )
With whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, preferentially keep the forward Feature Words of ranking.
The invention has the beneficial effects as follows:, on the basis of feature compartment system dispersion computing method between class are improved due to the method, the feature compartment system is applied in the Feature Selection process.The method takes full advantage of in the tf*idf information, class of text feature and distribution between class information, reflect more objectively the significance level of characteristic item in text, thereby select the characteristic item that can represent text feature, reached the Feature Selection purpose, improved text classification efficient and accuracy rate.This method can be issued to higher classify accuracy in the situation of selecting less characteristic item, has advantages of simultaneously fast convergence rate, makes this method also can apply to skewed data set to the improvement of distribution between class.
Below in conjunction with drawings and Examples, the present invention is elaborated.
Description of drawings
Fig. 1 is the process flow diagram that the present invention is based on the text classification Feature Selection method of feature distributed intelligence.
Embodiment
The inventive method concrete steps are as follows:
1. the concept relevant with the present invention.
Tf*idf(Term frequency inverse document frequency): be a kind of statistical method, in order to assess a words for the significance level of a copy of it file in a file set or corpus.The number of times that the importance of words occurs hereof along with it increase that is directly proportional, but the decline that can be inversely proportional to along with the frequency that it occurs in corpus simultaneously.
Dispersion (Intra-class distribution) in class: refer to the distribution situation of a Feature Words in a certain class document, if be uniformly distributed in such each document, this Feature Words in such document in class dispersion lower; Otherwise, be distributed in a few pieces of documents if concentrate, all occur in all the other documents, this Feature Words in such document in class dispersion higher.
Dispersion between class (Inter-class distribution): refer to the distribution situation of a Feature Words in whole document sets is of all categories, if be uniformly distributed in all kinds of documents, between the class of this Feature Words in this whole document sets, dispersion is lower; Otherwise, occur if only concentrate in some or several classifications, and do not occur in other classifications, between the class of this Feature Words in whole document sets, dispersion is higher.
Dispersion (Average inter-class distribution) between average class: the present invention of this Objective Concept proposes, and is an improvement to dispersion concept between class.Because adopting the total word frequency of Feature Words in each class document, dispersion between class weighs its distribution situation in all kinds of, if different classes of middle number of documents difference is very big, be that data set exists deflection, use the method that the Feature Words in the less classification of number of documents is flooded by the larger classification of number of documents.Dispersion between the average class after improvement adopts the Feature Words word frequency that this Feature Words appears in average every piece of document in each classification to weigh its distribution situation in all kinds of, is not subjected to the impact of data skew, can accurately reflect the distribution between class situation of Feature Words.
2. the properties relevant with the present invention.
The number of times that the some Feature Words of character 1. occur in a certain class document is more, gets over the classification of energy specification documents, and weight is larger.
The original document set sample of the garbages such as table 1 removal form
Numbering Original document Classification
1 Yao has great talent in basketball games. PE
2 We are playing a game about basketball in the playground. PE
3 We are enj oying the music at the concert. MUSIC
4 Music is an art and everybody may enjoy it. MUSIC
5 Playing basketball is my favorite sport. PE
6 Listening to the music is my hobby. MUSIC
For example: in the document sets shown in table 1, Feature Words basketball has occurred 3 times in PE class document, and weight is larger, and talent has only occurred in such once, and weight is less.
In the class of character 2. some Feature Words, dispersion is lower, gets over the classification of energy specification documents, and weight is larger.
Table 2 is through the pretreated collection of document of text
Numbering The training document Classification
1 yao ha great talent basketbal game PE
2 we plai game about basketbal playground PE
3 we enjoi music concert MUSIC
4 music art everybodi mai enjoi MUSIC
5 plai basketbal my favorit sport PE
6 listen music my hobbi MUSIC
[0050]The descending sort of the whole Feature Words TDFS of table 3 sample collection of document value
Feature Words The TDFS value Feature Words The TDFS value
music 0.554 hobbi 0.158
basketbal 0.489 listen 0.158
enjoi 0.489 great 0.140
game 0.394 ha 0.140
plai 0.394 talent 0.140
concert 0.158 yao 0.140
art 0.158 playground 0.140
everybodi 0.158 favorit 0.140
mai 0.158 sport 0.140
For example: in the document sets shown in table 1, Feature Words basketball evenly distributes in PE class document, and every piece of document has a basketball, in class, dispersion is low, illustrate that this word extensively evenly is present in PE class document, the classification information of specification documents well, weight is larger.Feature Words talent only occurs in one piece of document of PE, and all without talent, so in the class of this Feature Words, dispersion is higher, the document that comprises this word is not that the possibility of PE classification is larger, thereby the weight that calculates is also lower in all the other two pieces.
Between the class of character 3. some Feature Words, dispersion is higher, gets over the classification of energy specification documents, and weight is larger.
For example: in the document sets shown in table 1, Feature Words basketball only occurs in PE class document, never occurs in MUSIC class document, and between this feature part of speech, dispersion is very high, the classification information of specification documents well, and weight is larger.And Feature Words my, each occurs once in PE class and MUSIC class, belongs to even distribution between class, so between class, dispersion is very low, can not represent preferably the feature of classification, thereby weight is also lower.
Between the average class of character 4. some Feature Words, dispersion is higher, gets over the classification of energy specification documents, and weight is larger.
Between the average class of table 4, dispersion character gives an example 1
Between the average class of table 5, dispersion character gives an example 2
Figure BDA00002833373100062
For example: provide two examples in table 4, table 5, suppose that the distribution situation of a certain Feature Words t in document sets is as shown in table 4, due to the total word frequency of t in A, B two classes identical (being all 2), so between the class of this Feature Words, dispersion is 0, but obviously t still has necessarily representational for category-B, for no other reason than that the number of documents gap of A, B two classes is excessive, cause the less category-B Feature Words t of number of files to be flooded by the more category-A of number of files.Example shown in table 5 calculates equally that between the class of Feature Words t, dispersion is that in 0(A, B two classes, the total word frequency of t is all 1000), but t is apparent for the significance level of distinguishing A, B two classes.This shows in the situation that the data set deflection, use when between class, dispersion is weighed Feature Words, in the less classification of number of documents, representative Feature Words can't be highlighted.If use dispersion measurement between average class, Feature Words t is very inhomogeneous in the distribution of A, B two classes, and between average class, dispersion is higher, therefore can give rational weight for Feature Words t, eliminate the adverse effect that data skew brings, also have simultaneously the original function of dispersion between class.
For a given document sets D, the present invention is as follows to the detailed process that attribute in document sets screens:
1. parse documents is concentrated all documents, rejects useless structural identification etc., extracts the main information such as exercise question, content in document.
May there be some structural identification information (seeing the following form) in document, all occur in the same way in every piece of document, at first these signs be reached with the content of text classification is incoherent to fall as information filterings such as times.
Figure BDA00002833373100071
2. content of text is carried out pre-service, extract characteristic item (term) and consist of the text feature space.
For documents all in document sets, after the parsing through the 1st step, can obtain the content information of each document, see Table 1.Each piece document in document sets is carried out pre-service: participle (tokenizing), remove stop words (stop words removal), get stem (stemming) and process after, can obtain a set that is consisted of by some words, each word in set is referred to as text feature item (term), all characteristic items have just consisted of text feature space (term space), and the document in table 1 is as shown in table 2 through obtaining result after pre-service.
3. extract all Feature Words, structural attitude dictionary from collection of document.
For documents all in document sets, after the processing by the 2nd step, collect all Feature Words constitutive characteristic dictionaries that occur in document sets, as the basis of Feature Selection.
In statistics text feature space each Feature Words t at every piece of document d jThe frequency TF of middle appearance (t, d j), and at each class C iThe frequency TF of middle appearance (t, C i), add up simultaneously each class C iThe number of files DF (t, the C that comprise Feature Words t j).
5. the statistical information that obtains according to the 4th step is for each Feature Words calculates dispersion between dispersion in normalized tf*idf value, class, average class.
(1) tf*idf: computing formula is as follows:
Figure BDA00002833373100081
n t = Σ j = 1 NC DF ( t , C j )
In formula, n represents C iThe whole Feature Words numbers that occur in class.L is a constant, is got by experiment test, usually gets 0.1 or 0.01.The calculation deviation that normalized tf*idf value can avoid document overlength to bring.
(2) dispersion in class: computing formula is as follows:
DIntra = Σ j = 1 | C i | [ TF ( t , d j ) - TF ( t , C i ) | C i | ] 2 / ( | C i | - 1 ) TF ( t , C i ) / | C i | - 1
(3) dispersion between average class: computing formula is as follows:
DInterAvg = Σ i = 1 NC [ TF ( t , C i ) / | C i | - Σ j = 1 NC TF ( t , C j ) / ND ] 2 / ( NC - 1 ) Σ j = 1 NC TF ( t , C j ) / ND
6. use the result in the 5th step, calculate the weight of characteristic item t in each class.Computing formula is as follows:
wi(t)=tf*idf*DInterAvg*(1-DIntra)
7. the summation of the weight in of all categories with characteristic item t obtains the weight of this characteristic item in whole document sets, namely
The TDFS value.Computing formula is as follows:
TDFS ( t ) = Σ i = 1 NC w i ( t )
8. in calculating document sets the TDFS value of all characteristic items is according to descending sort, and the more forward value of characteristic item in document sets of ranking is higher, and role is larger in document classification.

Claims (1)

1. text classification Feature Selection method based on the feature distributed intelligence is characterized in that comprising the following steps:
(1). each piece document in document sets is carried out participle, removes stop words and get stem and process;
(2). whole collection of document is expressed as vector space model;
(3). extract all Feature Words, structural attitude dictionary from collection of document;
(4). in statistics text feature space, each Feature Words t is at every piece of document d jThe frequency TF of middle appearance (t, d j), and at each class C iThe frequency TF of middle appearance (t, C i), add up simultaneously each class C iThe number of files DF (t, the C that comprise Feature Words t j);
(5). the information that obtains according to step (4), for each Feature Words t k, at first calculate for each class C iNormalized tf*idf value, then calculate this Feature Words at each class C iDispersion DInterAvg between interior dispersion DIntra and average class;
(6). according to the information that step (4), step (5) step obtains, utilize following formula to calculate each Feature Words t in the text feature space kAt classification C iIn weight w i(t);
w i(t)=tf*idf*DInterAvg*(1-DIntra)
With Feature Words t kWeight summation in each classification is this Feature Words in the weight of whole document sets, i.e. Feature Words t kThe TDFS value;
TDFS ( t ) = Σ i = 1 NC w i ( t )
(7). whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, are preferentially kept the forward Feature Words of ranking.
CN201310050583.4A 2013-02-08 2013-02-08 The text classification Feature Selection method of feature based distributed intelligence Expired - Fee Related CN103106275B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310050583.4A CN103106275B (en) 2013-02-08 2013-02-08 The text classification Feature Selection method of feature based distributed intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310050583.4A CN103106275B (en) 2013-02-08 2013-02-08 The text classification Feature Selection method of feature based distributed intelligence

Publications (2)

Publication Number Publication Date
CN103106275A true CN103106275A (en) 2013-05-15
CN103106275B CN103106275B (en) 2016-02-10

Family

ID=48314130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310050583.4A Expired - Fee Related CN103106275B (en) 2013-02-08 2013-02-08 The text classification Feature Selection method of feature based distributed intelligence

Country Status (1)

Country Link
CN (1) CN103106275B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462556A (en) * 2014-12-25 2015-03-25 北京奇虎科技有限公司 Method and device for recommending question and answer page related questions
CN104915327A (en) * 2014-03-14 2015-09-16 腾讯科技(深圳)有限公司 Text information processing method and device
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 Text topic classification method and system
CN106054857A (en) * 2016-05-27 2016-10-26 大连楼兰科技股份有限公司 Maintenance decision tree/word vector-based fault remote diagnosis platform
CN106055439A (en) * 2016-05-27 2016-10-26 大连楼兰科技股份有限公司 Remote fault diagnostic system and method based on maintenance and decision trees/term vectors
CN106227768A (en) * 2016-07-15 2016-12-14 国家计算机网络与信息安全管理中心 A kind of short text opining mining method based on complementary language material
CN106940703A (en) * 2016-01-04 2017-07-11 腾讯科技(北京)有限公司 Pushed information roughing sort method and device
CN106997345A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 The keyword abstraction method of word-based vector sum word statistical information
CN106997344A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 Keyword abstraction system
CN107329999A (en) * 2017-06-09 2017-11-07 江西科技学院 Document classification method and device
CN107844553A (en) * 2017-10-31 2018-03-27 山东浪潮通软信息科技有限公司 A kind of file classification method and device
CN108153872A (en) * 2017-12-25 2018-06-12 佛山市车品匠汽车用品有限公司 A kind of method and apparatus of the Internet web page information filtering
CN108491429A (en) * 2018-02-09 2018-09-04 湖北工业大学 A kind of feature selection approach based on document frequency and word frequency statistics between class in class
CN108776654A (en) * 2018-05-30 2018-11-09 昆明理工大学 One kind being based on improved simhash transcription comparison methods
CN110210559A (en) * 2019-05-31 2019-09-06 北京小米移动软件有限公司 Object screening technique and device, storage medium
CN110442678A (en) * 2019-07-24 2019-11-12 中智关爱通(上海)科技股份有限公司 A kind of text words weighing computation method and system, storage medium and terminal
CN111881668A (en) * 2020-08-06 2020-11-03 成都信息工程大学 Improved TF-IDF calculation model based on chi-square statistics and TF-CRF

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
CN102622373A (en) * 2011-01-31 2012-08-01 中国科学院声学研究所 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
CN102622373A (en) * 2011-01-31 2012-08-01 中国科学院声学研究所 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘海峰等: "文本分类中一种改进的特征选择方法", 《情报科学》 *
张瑜等: "一种改进的特征权重算法", 《计算机工程》 *
徐凤亚等: "文本自动分类中特征权重算法的改进研究", 《计算机工程与应用》 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915327A (en) * 2014-03-14 2015-09-16 腾讯科技(深圳)有限公司 Text information processing method and device
WO2015135452A1 (en) * 2014-03-14 2015-09-17 Tencent Technology (Shenzhen) Company Limited Text information processing method and apparatus
US10262059B2 (en) 2014-03-14 2019-04-16 Tencent Technology (Shenzhen) Company Limited Method, apparatus, and storage medium for text information processing
CN104915327B (en) * 2014-03-14 2019-01-29 腾讯科技(深圳)有限公司 A kind of processing method and processing device of text information
CN104462556B (en) * 2014-12-25 2018-02-23 北京奇虎科技有限公司 Question and answer page relevant issues recommend method and apparatus
CN104462556A (en) * 2014-12-25 2015-03-25 北京奇虎科技有限公司 Method and device for recommending question and answer page related questions
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 Text topic classification method and system
CN105045812B (en) * 2015-06-18 2019-01-29 上海高欣计算机系统有限公司 The classification method and system of text subject
CN106940703B (en) * 2016-01-04 2020-09-11 腾讯科技(北京)有限公司 Pushed information rough selection sorting method and device
CN106940703A (en) * 2016-01-04 2017-07-11 腾讯科技(北京)有限公司 Pushed information roughing sort method and device
CN106055439B (en) * 2016-05-27 2019-09-27 大连楼兰科技股份有限公司 Based on maintenance decision tree/term vector Remote Fault Diagnosis system and method
CN106055439A (en) * 2016-05-27 2016-10-26 大连楼兰科技股份有限公司 Remote fault diagnostic system and method based on maintenance and decision trees/term vectors
CN106054857A (en) * 2016-05-27 2016-10-26 大连楼兰科技股份有限公司 Maintenance decision tree/word vector-based fault remote diagnosis platform
CN106054857B (en) * 2016-05-27 2019-12-24 大连楼兰科技股份有限公司 Maintenance decision tree/word vector-based fault remote diagnosis platform
CN106227768B (en) * 2016-07-15 2019-09-03 国家计算机网络与信息安全管理中心 A kind of short text opining mining method based on complementary corpus
CN106227768A (en) * 2016-07-15 2016-12-14 国家计算机网络与信息安全管理中心 A kind of short text opining mining method based on complementary language material
CN106997344A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 Keyword abstraction system
CN106997345A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 The keyword abstraction method of word-based vector sum word statistical information
CN107329999B (en) * 2017-06-09 2020-10-20 江西科技学院 Document classification method and device
CN107329999A (en) * 2017-06-09 2017-11-07 江西科技学院 Document classification method and device
CN107844553A (en) * 2017-10-31 2018-03-27 山东浪潮通软信息科技有限公司 A kind of file classification method and device
CN108153872A (en) * 2017-12-25 2018-06-12 佛山市车品匠汽车用品有限公司 A kind of method and apparatus of the Internet web page information filtering
CN108491429A (en) * 2018-02-09 2018-09-04 湖北工业大学 A kind of feature selection approach based on document frequency and word frequency statistics between class in class
CN108776654A (en) * 2018-05-30 2018-11-09 昆明理工大学 One kind being based on improved simhash transcription comparison methods
CN110210559A (en) * 2019-05-31 2019-09-06 北京小米移动软件有限公司 Object screening technique and device, storage medium
CN110210559B (en) * 2019-05-31 2021-10-08 北京小米移动软件有限公司 Object screening method and device and storage medium
CN110442678A (en) * 2019-07-24 2019-11-12 中智关爱通(上海)科技股份有限公司 A kind of text words weighing computation method and system, storage medium and terminal
CN110442678B (en) * 2019-07-24 2022-03-29 中智关爱通(上海)科技股份有限公司 Text word weight calculation method and system, storage medium and terminal
CN111881668A (en) * 2020-08-06 2020-11-03 成都信息工程大学 Improved TF-IDF calculation model based on chi-square statistics and TF-CRF
CN111881668B (en) * 2020-08-06 2023-06-30 成都信息工程大学 TF-IDF computing device based on chi-square statistics and TF-CRF improvement

Also Published As

Publication number Publication date
CN103106275B (en) 2016-02-10

Similar Documents

Publication Publication Date Title
CN103106275A (en) Text classification character screening method based on character distribution information
CN103778214B (en) A kind of item property clustering method based on user comment
CN102622373B (en) Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN101587493B (en) Text classification method
US9875294B2 (en) Method and apparatus for classifying object based on social networking service, and storage medium
CN106156372B (en) A kind of classification method and device of internet site
CN103473262B (en) A kind of Web comment viewpoint automatic classification system based on correlation rule and sorting technique
CN106095996A (en) Method for text classification
CN105550269A (en) Product comment analyzing method and system with learning supervising function
CN102332025A (en) Intelligent vertical search method and system
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN101937436B (en) Text classification method and device
CN103886108B (en) The feature selecting and weighing computation method of a kind of unbalanced text set
CN105975518B (en) Expectation cross entropy feature selecting Text Classification System and method based on comentropy
CN104361037B (en) Microblogging sorting technique and device
US10387805B2 (en) System and method for ranking news feeds
CN102929873A (en) Method and device for extracting searching value terms based on context search
CN103810264A (en) Webpage text classification method based on feature selection
CN106446931A (en) Feature extraction and classification method and system based on support vector data description
CN107133282B (en) Improved evaluation object identification method based on bidirectional propagation
CN104463601A (en) Method for detecting users who score maliciously in online social media system
CN101763431A (en) PL clustering method based on massive network public sentiment information
CN106484919A (en) A kind of industrial sustainability sorting technique based on webpage autonomous word and system
CN105389505A (en) Shilling attack detection method based on stack type sparse self-encoder
CN105787662A (en) Mobile application software performance prediction method based on attributes

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160210

Termination date: 20200208

CF01 Termination of patent right due to non-payment of annual fee