CN103106275B - The text classification Feature Selection method of feature based distributed intelligence - Google Patents

The text classification Feature Selection method of feature based distributed intelligence Download PDF

Info

Publication number
CN103106275B
CN103106275B CN201310050583.4A CN201310050583A CN103106275B CN 103106275 B CN103106275 B CN 103106275B CN 201310050583 A CN201310050583 A CN 201310050583A CN 103106275 B CN103106275 B CN 103106275B
Authority
CN
China
Prior art keywords
feature
class
document
feature words
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310050583.4A
Other languages
Chinese (zh)
Other versions
CN103106275A (en
Inventor
李思男
李战怀
李宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN201310050583.4A priority Critical patent/CN103106275B/en
Publication of CN103106275A publication Critical patent/CN103106275A/en
Application granted granted Critical
Publication of CN103106275B publication Critical patent/CN103106275B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of text classification Feature Selection method of feature based distributed intelligence, for solving the technical matters of existing text classification Feature Selection method poor accuracy.Technical scheme first carries out pre-service to each section of document in document sets; Again whole collection of document is expressed as vector space model; Structural attitude dictionary; Add up each class C icomprise number of files DF (t, the C of Feature Words t j); Calculate for each class C inormalized tf*idf value, then calculate this Feature Words at each class C iinterior dispersion DIntra and average inter _ class relationship DInterAvg; Calculate each Feature Words t in text feature space kat classification C iin weight w i(t); By whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, preferentially retain the forward Feature Words of ranking.Feature compartment system, on the basis of feature compartment system, applies in Feature Selection process by the method, improves text classification efficiency and accuracy rate.

Description

The text classification Feature Selection method of feature based distributed intelligence
Technical field
The present invention relates to a kind of text classification Feature Selection method, particularly relate to a kind of text classification Feature Selection method of feature based distributed intelligence.
Background technology
Along with the development of communication technology and network, on the internet, a large amount of electronic documents is had every day as generations such as news, mail, microbloggings.Text automatic classification, as a kind of method of efficiently large volume document being carried out to Classification Management, is widely used in a lot of field.
Along with the explosive increase of quantity of information, the subject matter that automatic Text Categorization faces is the higher-dimension text vector feature space how processing the generation of a large amount of text data.Too high text vector feature space will produce two adverse effects to file classification method: the method for (1) a lot of comparative maturity cannot be optimized in higher dimensional space, and then cannot be applied in text classification.(2) because sorter is trained by training set and obtains, the too high text vector space of dimension will inevitably cause Expired Drugs to occur [1].In text vector space, most of dimension and text classification are also uncorrelated, and even adulterate the more noise data affecting text classification precision [2].Text feature screens, and according to certain Feature Selection algorithm, selects the more representative text feature of a part and forms the lower feature space of a new dimension, reach the object of dimensionality reduction from original feature space.The method is the effective method solving the too high problem of text classification Chinese version vector characteristics Spatial Dimension.The object of text feature screening to improve the execution efficiency of text classification work efficiency and algorithm.A lot of experiment proves, in most of the cases, initiatively about subtracts can obtain very large performance boost under the loss of less nicety of grading to feature space [3].
Existing text classification Feature Selection algorithm mainly contains document frequency (DF), information gain (IG), information gain-ratio (GR), Chi-square Test (CHI), mutual information (MI) and Gini index etc. [3,4].Below the good technology of several wherein effect in text classification is briefly introduced:
Document frequency (DF): document frequency refers to for given feature t, comprises the number of documents of t in collection of document.Its basic assumption is rare feature for class prediction is do not have helpful, or can not affect overall performance.The advantage of document frequency: because its realization is simple, calculated amount is little, so feature selecting speed is very fast, and actual effect is also good; Shortcoming: rare feature may not be rare in a certain class text, may contain important classification information yet, simply weed out, may affect the effect of classification, therefore should not by a large amount of rejecting feature of DF.
Information gain (IG): information gain is a kind of appraisal procedure based on entropy, a given feature t, when considering and do not consider it, quantity of information is respectively how many, and both differences are exactly the quantity of information that this feature is brought to system, i.e. gain [5].Whether information gain considers the appearance of a feature, and in imbalanced data sets, for rare classification, experiment shows, considers the absent variable situation of feature to the contribution judging text categories often much smaller than considering the interference that feature does not show situation and brings.
Information gain-ratio (GR): information gain is proved to be devious in a lot of result.Due to the more and different attribute of value for training set learn too abundant, cause Information Gain Method to be more prone to select this attribute, information gain-ratio solves this shortcoming of information gain [6].
Chi-square Test (CHI): Chi-square Test is the method for a kind of conventional inspection Two Variables independence in mathematical statistics, its most basic thought is exactly determine theoretical correctness by the deviation of observation actual value and theoretical value [7,8].
During the experiment of text classification shows, during as feature selecting, the effect of Chi-square Test is best one, but it has only added up in text whether occur feature t, but the number of times that feature t occurs in the text is not considered, therefore make it have low-frequency word and necessarily exaggerate effect, " low-frequency word defect " that this namely Chi-square Test is famous.
The present invention is at feature compartment system [9]basis on, inter _ class relationship computing method are improved, this system are applied in Feature Selection process.
List of references:
[1]JiemingYang,YuanningLiu,XiaodongZhuetal,Anewfeatureselectionbasedoncomprehensivemeasurementbothininter-categoryandintra-categoryfortextcategorization,InformationProcessing&Management,Volume48,Issue4,2012,pp.741-754
[2]WenqianShang,HoukuanHuangandHaibinZhuetal,Anovelfeatureselectionalgorithmfortextclassification,ExpertSystemswithApplications,Volume33,Issuel,2007,pp.1-5
[3]MonicaRogatiandYimingYang,High-performingfeatureselectionfortextclassification.InProceedingsoftheeleventhinternationalconferenceonInformationandknowledgemanagement(CIKM′02).ACM,NewYork,NY,USA,2002,pp.659-661.
[4]Yang,Y.,Pedersen,J.O.,AComparativeStudyonFeatureSelectioninTextClassification.InProceedingsofthe14thinternationalconferenceonmachinelearning,Nashville,USA,1997,pp.4l2-420.
[5]Forman,G.,AnExtensiveEmpiricalofFeatureSelectionMetricsforTextClassification.JournalofMachineLearningResearch,3,2003,pp.1289-1305.
[6]TatsunoriMori,MiwaKikuchiandKazufumiYoshida,,TermWeightingMethodbasedonInformationGainRatioforSummarizingDocumentsretrievedbyIRsystems.JournalofNaturalLanguageProcessing,9(4),2001,pp.3-32.
[7]Zheng,Z.,Srihari,R,OptimallyCombiningPositiveandNegativeFeaturesforTextClassification.ICML2003WorkshoponLearningfromImbalancedDataSets,2003.
[8]LuigiGalavotti,ViaJacopoNardiandFabrizioSebastianietal,FeatureSelectionandNegativeEvidenceinAutomatedTextClassification.InProceedingsofthe4thEuropeanConferenceonResearchandAdvancedTechnologyforDigitalLibraries(ECDL’00),2000.
V.Lertnattee,T.Theeramunkong,Improvingcentroid-basedtextclassificationusingterm-distribution-basedweightingandfeatureselection,InProceedingsofINTECH-01,2ndInternationalConferenceonIntelligentTechnologies,Bangkok,Thailand,2001,pp.349-355.
Summary of the invention
In order to overcome the deficiency of existing text classification Feature Selection method poor accuracy, the invention provides a kind of text classification Feature Selection method of feature based distributed intelligence.The method, on the basis of feature compartment system, is improved inter _ class relationship computing method, is applied in Feature Selection process by feature compartment system.The method takes full advantage of in the tf*idf information of text feature, class and distribution between class information, reflect characteristic item significance level in the text more objectively, thus select the characteristic item that can represent text feature, reach Feature Selection object, text classification efficiency and accuracy rate can be improved.This method can be issued to higher classify accuracy selecting the situation of less characteristic item, has the advantage of fast convergence rate simultaneously, makes this method also can apply to skewed data set to the improvement of distribution between class.
The technical solution adopted for the present invention to solve the technical problems is: a kind of text classification Feature Selection method of feature based distributed intelligence, is characterized in comprising the following steps:
1. in pair document sets, each section of document carries out participle, removes stop words and gets stem process.
2. whole collection of document is expressed as vector space model.
3. from collection of document, extract all Feature Words, structural attitude dictionary.
4. to add up in text feature space each Feature Words t at every section of document d jfrequency TF (t, the d of middle appearance j), and at each class C ifrequency TF (t, the C of middle appearance i), add up each class C simultaneously icomprise number of files DF (t, the C of Feature Words t j).
5. according to the information that step 4 obtains, for each Feature Words t k, first calculate for each class C inormalized tf*idf value, then calculate this Feature Words at each class C iinterior dispersion DIntra and average inter _ class relationship DInterAvg.
6., according to the information that step 4, step 5 step obtain, utilize following formula to calculate each Feature Words t in text feature space kat classification C iin weight w i(t).
w i(t)=tf*idf*DInterAvg*(1-DIntra)
By Feature Words t kweight summation in each category, is the weight of this Feature Words in whole document sets, i.e. Feature Words t ktDFS value.
TDFS ( t ) = Σ i = 1 NC w i ( t )
7. by whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, preferentially retain the forward Feature Words of ranking.
The invention has the beneficial effects as follows: because the method is on the basis of feature compartment system, inter _ class relationship computing method are improved, feature compartment system is applied in Feature Selection process.The method takes full advantage of in the tf*idf information of text feature, class and distribution between class information, reflect characteristic item significance level in the text more objectively, thus select the characteristic item that can represent text feature, reach Feature Selection object, improve text classification efficiency and accuracy rate.This method can be issued to higher classify accuracy selecting the situation of less characteristic item, has the advantage of fast convergence rate simultaneously, makes this method also can apply to skewed data set to the improvement of distribution between class.
Below in conjunction with drawings and Examples, the present invention is elaborated.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the text classification Feature Selection method that the present invention is based on feature distributed intelligence.
Embodiment
The inventive method concrete steps are as follows:
1. the concept relevant with the present invention.
Tf*idf(Termfrequencyinversedocumentfrequency): be a kind of statistical method, in order to assess the significance level of a words for a copy of it file in a file set or a corpus.The importance of words to be directly proportional increase along with the number of times that it occurs hereof, the decline but the frequency that can occur in corpus along with it is inversely proportional to simultaneously.
Within-cluster variance (Intra-classdistribution): refer to the distribution situation of a Feature Words in a certain class document, if be uniformly distributed in such each document, then this Feature Words within-cluster variance in such document is lower; Otherwise if integrated distribution is in a few sections of documents, all do not occur in all the other documents, then this Feature Words within-cluster variance in such document is higher.
Inter _ class relationship (Inter-classdistribution): refer to the distribution situation of a Feature Words in whole document sets is of all categories, if be uniformly distributed in all kinds of document, then the inter _ class relationship of this Feature Words in this whole document sets is lower; Otherwise occur if only concentrate in some or several classification, and do not occur in other classifications, then the inter _ class relationship of this Feature Words in whole document sets is higher.
Average inter _ class relationship (Averageinter-classdistribution): the present invention of this Objective Concept proposes, and is an improvement to inter _ class relationship concept.Because inter _ class relationship adopts the total word frequency of Feature Words in each class document to weigh its distribution situation in all kinds of, if different classes of middle number of documents very different, namely there is deflection in data set, use the method that the Feature Words in the less classification of number of documents can be made flood by the larger classification of number of documents.Average inter _ class relationship after improvement, adopt Feature Words in each classification on average every section of document occur that the word frequency of this Feature Words is to weigh its distribution situation in all kinds of, not by the impact of data skew, accurately can reflect the distribution between class situation of Feature Words.
2. the properties relevant with the present invention.
The number of times that the some Feature Words of character 1. occur in a certain class document is more, more the classification of energy specification documents, and weight is larger.
The original document set sample of the garbages such as form removed by table 1
Numbering Original document Classification
1 Yao has great talent in basketball games. PE
2 We are playing a game about basketball in the playground. PE
3 We are enj oying the music at the concert. MUSIC
4 Music is an art and everybody may enjoy it. MUSIC
5 Playing basketball is my favorite sport. PE
6 Listening to the music is my hobby. MUSIC
Such as: in the document sets shown in table 1, Feature Words basketball has occurred 3 times in PE class document, weight is comparatively large, and talent only occurs in such that once, weight is less.
The within-cluster variance of the some Feature Words of character 2. is lower, more the classification of energy specification documents, and weight is larger.
Table 2 is through the collection of document of Text Pretreatment
Numbering Training document Classification
1 yao ha great talent basketbal game PE
2 we plai game about basketbal playground PE
3 we enjoi music concert MUSIC
4 music art everybodi mai enjoi MUSIC
5 plai basketbal my favorit sport PE
6 listen music my hobbi MUSIC
The descending sort of table 3 sample document set whole Feature Words TDFS value
Feature Words TDFS value Feature Words TDFS value
music 0.554 hobbi 0.158
basketbal 0.489 listen 0.158
enjoi 0.489 great 0.140
game 0.394 ha 0.140
plai 0.394 talent 0.140
concert 0.158 yao 0.140
art 0.158 playground 0.140
everybodi 0.158 favorit 0.140
mai 0.158 sport 0.140
Such as: in the document sets shown in table 1, Feature Words basketball is uniformly distributed in PE class document, and every section of document has a basketball, within-cluster variance is low, illustrate that this word is extensively evenly present in PE class document, can the classification information of specification documents well, weight is larger.Feature Words talent only occurs in one of PE section of document, and all without talent in all the other two sections, so the within-cluster variance of this Feature Words is higher, the document comprising this word is not that the possibility of PE classification is comparatively large, and the weight thus calculated is also lower.
The inter _ class relationship of the some Feature Words of character 3. is higher, more the classification of energy specification documents, and weight is larger.
Such as: in the document sets shown in table 1, Feature Words basketball only occurs in PE class document, never occurs in MUSIC class document, and this Feature Words inter _ class relationship is very high, can the classification information of specification documents well, and weight is larger.And Feature Words my, in PE class and MUSIC class, each appearance is once, and belong between class and be uniformly distributed, so inter _ class relationship is very low, can not represent the feature of classification preferably, thus weight is also lower.
The average inter _ class relationship of the some Feature Words of character 4. is higher, more the classification of energy specification documents, and weight is larger.
The average inter _ class relationship character citing 1 of table 4
The average inter _ class relationship character citing 2 of table 5
Citing: provide two examples in table 4, table 5, suppose that the distribution situation of a certain Feature Words t in document sets is as shown in table 4, because t is in the total word frequency identical (be 2) of A, B two in class, so the inter _ class relationship of this Feature Words is 0, but obviously t is necessarily representational for still having category-B, for no other reason than that A, B two number of documents gaps of class are excessive, the category-B Feature Words t causing number of files less flood by the more category-A of number of files.Example shown in table 5, the inter _ class relationship calculating Feature Words t be equally 0(A, B two in class the total word frequency of t be all 1000), but t for distinguish A, B two the significance level of class be apparent.As can be seen here when data set deflection, when using inter _ class relationship to weigh Feature Words, Feature Words representative in the classification that number of documents is less cannot be highlighted.If use average inter _ class relationship to weigh, Feature Words t is very uneven in A, B two distributions of class, and average inter _ class relationship is higher, therefore can give rational weight for Feature Words t, eliminate the adverse effect that data skew brings, also there is the original function of inter _ class relationship simultaneously.
For given document sets D, the detailed process that the present invention screens attribute in document sets is as follows:
1. parse documents concentrates all documents, rejects useless structural identification etc., extracts the main information such as exercise question, content in document.
Some structural identification information (seeing the following form) may be there are in document, occur all in the same way in every section of document, first fall as information filterings such as times by these marks and with content of text classification is incoherent.
2. pair content of text carries out pre-service, extracts characteristic item (term) and forms text feature space.
For documents all in document sets, after the parsing of the 1st step, the content information of each document can be obtained, in table 1.Pre-service is carried out to each section of document in document sets: participle (tokenizing), remove stop words (stopwordsremoval), get stem (stemming) process after, a set be made up of some words can be obtained, each word in set is referred to as text feature item (term), all characteristic items just constitute text feature space (termspace), the document in table 1 are obtained result after pre-service as shown in table 2.
3. from collection of document, extract all Feature Words, structural attitude dictionary.
For documents all in document sets, after the process by the 2nd step, collect all Feature Words constitutive characteristic dictionaries occurred in document sets, as the basis of Feature Selection.
4. to add up in text feature space each Feature Words t at every section of document d jfrequency TF (t, the d of middle appearance j), and at each class C ifrequency TF (t, the C of middle appearance i), add up each class C simultaneously icomprise number of files DF (t, the C of Feature Words t j).
5. according to the statistical information that the 4th step obtains, for each Feature Words calculates normalized tf*idf value, within-cluster variance, average inter _ class relationship.
(1) tf*idf: computing formula is as follows:
n t = Σ j = 1 NC DF ( t , C j )
In formula, n represents C ithe whole Feature Words numbers occurred in class.L is a constant, is obtained by experiment test, usually gets 0.1 or 0.01.The calculation deviation that normalized tf*idf value can avoid document overlength to bring.
(2) within-cluster variance: computing formula is as follows:
DIntra = Σ j = 1 | C i | [ TF ( t , d j ) - TF ( t , C i ) | C i | ] 2 / ( | C i | - 1 ) TF ( t , C i ) / | C i | - 1
(3) average inter _ class relationship: computing formula is as follows:
DInterAvg = Σ i = 1 NC [ TF ( t , C i ) / | C i | - Σ j = 1 NC TF ( t , C j ) / ND ] 2 / ( NC - 1 ) Σ j = 1 NC TF ( t , C j ) / ND
6. use the result in the 5th step, calculate the weight of characteristic item t in each class.Computing formula is as follows:
wi(t)=tf*idf*DInterAvg*(1-DIntra)
7., by the weight summation of characteristic item t in of all categories, obtain the weight of this characteristic item in whole document sets, namely
TDFS value.Computing formula is as follows:
TDFS ( t ) = Σ i = 1 NC w i ( t )
8. calculate the TDFS value of all characteristic items in document sets according to descending sort, the more forward value of characteristic item in document sets of ranking is higher, and in document classification, role is larger.

Claims (1)

1. a text classification Feature Selection method for feature based distributed intelligence, is characterized in that comprising the following steps:
(1). each section of document in document sets is carried out to participle, removes stop words and get stem process;
(2). whole collection of document is expressed as vector space model;
(3). from collection of document, extract all Feature Words, structural attitude dictionary;
(4). in statistics text feature space, each Feature Words t is at every section of document d jfrequency TF (t, the d of middle appearance j), and at each class C ifrequency TF (t, the C of middle appearance i), add up each class C simultaneously icomprise number of files DF (t, the C of Feature Words t i);
(5). according to the information that step (4) obtains, for each Feature Words t, first calculate for each class C inormalized tf*idf value, then calculate this Feature Words at each class C iinterior dispersion DIntra and average inter _ class relationship DInterAvg;
(6). according to the information that step (4), step (5) step obtain, utilize following formula to calculate in text feature space each Feature Words t at classification C iin weight w i(t);
w i(t)=tf*idf*DInterAvg*(1-DIntra)
The computing formula of average inter _ class relationship is as follows:
D I n t e r A v g = Σ i = 1 N C [ T F ( t , C i ) / | C i | - Σ j = 1 N C T F ( t , C j ) / N D ] 2 / ( N C - 1 ) Σ j = 1 N C T F ( t , C j ) / N D
By Feature Words t weight summation in each category, be the weight of this Feature Words in whole document sets, i.e. the TDFS value of Feature Words t:
T D F S ( t ) = Σ i = 1 N C w i ( t )
(7). by whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, preferentially retain the forward Feature Words of ranking.
CN201310050583.4A 2013-02-08 2013-02-08 The text classification Feature Selection method of feature based distributed intelligence Expired - Fee Related CN103106275B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310050583.4A CN103106275B (en) 2013-02-08 2013-02-08 The text classification Feature Selection method of feature based distributed intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310050583.4A CN103106275B (en) 2013-02-08 2013-02-08 The text classification Feature Selection method of feature based distributed intelligence

Publications (2)

Publication Number Publication Date
CN103106275A CN103106275A (en) 2013-05-15
CN103106275B true CN103106275B (en) 2016-02-10

Family

ID=48314130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310050583.4A Expired - Fee Related CN103106275B (en) 2013-02-08 2013-02-08 The text classification Feature Selection method of feature based distributed intelligence

Country Status (1)

Country Link
CN (1) CN103106275B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915327B (en) * 2014-03-14 2019-01-29 腾讯科技(深圳)有限公司 A kind of processing method and processing device of text information
CN104462556B (en) * 2014-12-25 2018-02-23 北京奇虎科技有限公司 Question and answer page relevant issues recommend method and apparatus
CN105045812B (en) * 2015-06-18 2019-01-29 上海高欣计算机系统有限公司 The classification method and system of text subject
CN106940703B (en) * 2016-01-04 2020-09-11 腾讯科技(北京)有限公司 Pushed information rough selection sorting method and device
CN106054857B (en) * 2016-05-27 2019-12-24 大连楼兰科技股份有限公司 Maintenance decision tree/word vector-based fault remote diagnosis platform
CN106055439B (en) * 2016-05-27 2019-09-27 大连楼兰科技股份有限公司 Based on maintenance decision tree/term vector Remote Fault Diagnosis system and method
CN106227768B (en) * 2016-07-15 2019-09-03 国家计算机网络与信息安全管理中心 A kind of short text opining mining method based on complementary corpus
CN106997345A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 The keyword abstraction method of word-based vector sum word statistical information
CN106997344A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 Keyword abstraction system
CN107329999B (en) * 2017-06-09 2020-10-20 江西科技学院 Document classification method and device
CN107844553B (en) * 2017-10-31 2021-07-27 浪潮通用软件有限公司 Text classification method and device
CN108153872A (en) * 2017-12-25 2018-06-12 佛山市车品匠汽车用品有限公司 A kind of method and apparatus of the Internet web page information filtering
CN108491429A (en) * 2018-02-09 2018-09-04 湖北工业大学 A kind of feature selection approach based on document frequency and word frequency statistics between class in class
CN108776654A (en) * 2018-05-30 2018-11-09 昆明理工大学 One kind being based on improved simhash transcription comparison methods
CN110210559B (en) * 2019-05-31 2021-10-08 北京小米移动软件有限公司 Object screening method and device and storage medium
CN110442678B (en) * 2019-07-24 2022-03-29 中智关爱通(上海)科技股份有限公司 Text word weight calculation method and system, storage medium and terminal
CN111881668B (en) * 2020-08-06 2023-06-30 成都信息工程大学 TF-IDF computing device based on chi-square statistics and TF-CRF improvement

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
CN102622373A (en) * 2011-01-31 2012-08-01 中国科学院声学研究所 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
CN102622373A (en) * 2011-01-31 2012-08-01 中国科学院声学研究所 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种改进的特征权重算法;张瑜等;《计算机工程》;20110331;第37卷(第5期);正文第211页左栏第4段-倒数第2段 *
文本分类中一种改进的特征选择方法;刘海峰等;《情报科学》;20071031;第25卷(第10期);全文 *
文本自动分类中特征权重算法的改进研究;徐凤亚等;《计算机工程与应用》;20050131;正文第181页左栏第1段-第183页左栏倒数第4段 *

Also Published As

Publication number Publication date
CN103106275A (en) 2013-05-15

Similar Documents

Publication Publication Date Title
CN103106275B (en) The text classification Feature Selection method of feature based distributed intelligence
CN102332025B (en) Intelligent vertical search method and system
CN102622373B (en) Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN104239539B (en) A kind of micro-blog information filter method merged based on much information
CN104615608B (en) A kind of data mining processing system and method
CN106156372B (en) A kind of classification method and device of internet site
CN102298646B (en) Method and device for classifying subjective text and objective text
CN101937436B (en) Text classification method and device
CN101609472B (en) Keyword evaluation method and device based on platform for questions and answers
CN103473262B (en) A kind of Web comment viewpoint automatic classification system based on correlation rule and sorting technique
CN106202518A (en) Based on CHI and the short text classification method of sub-category association rule algorithm
CN105573887B (en) The method for evaluating quality and device of search engine
CN105912625A (en) Linked data oriented entity classification method and system
CN103886108B (en) The feature selecting and weighing computation method of a kind of unbalanced text set
CN108268554A (en) A kind of method and apparatus for generating filtering junk short messages strategy
CN105373606A (en) Unbalanced data sampling method in improved C4.5 decision tree algorithm
CN102841946A (en) Commodity data retrieval sequencing and commodity recommendation method and system
CN107145560A (en) A kind of file classification method and device
CN101763431A (en) PL clustering method based on massive network public sentiment information
Vidinli et al. New query suggestion framework and algorithms: A case study for an educational search engine
CN105389505A (en) Shilling attack detection method based on stack type sparse self-encoder
CN109271517A (en) IG TF-IDF Text eigenvector generates and file classification method
CN106484919A (en) A kind of industrial sustainability sorting technique based on webpage autonomous word and system
CN105787662A (en) Mobile application software performance prediction method based on attributes
CN103914551A (en) Method for extending semantic information of microblogs and selecting features thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160210

Termination date: 20200208