CN101290626A - Text categorization feature selection and weight computation method based on field knowledge - Google Patents

Text categorization feature selection and weight computation method based on field knowledge Download PDF

Info

Publication number
CN101290626A
CN101290626A CNA2008100585170A CN200810058517A CN101290626A CN 101290626 A CN101290626 A CN 101290626A CN A2008100585170 A CNA2008100585170 A CN A2008100585170A CN 200810058517 A CN200810058517 A CN 200810058517A CN 101290626 A CN101290626 A CN 101290626A
Authority
CN
China
Prior art keywords
field
feature
text
speech
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008100585170A
Other languages
Chinese (zh)
Other versions
CN100583101C (en
Inventor
余正涛
韩露
向凤红
万舟
熊新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN200810058517A priority Critical patent/CN100583101C/en
Publication of CN101290626A publication Critical patent/CN101290626A/en
Application granted granted Critical
Publication of CN100583101C publication Critical patent/CN100583101C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to the artificial intelligence technical field, in particular to a text classification feature selection and weigh calculation method based on field knowledge. The method combines sample statistics and field glossaries to construct a filed classification feature space, utilizes internal knowledge relations in the field, calculates the similarity between the glossaries, and then adjusts the corresponding feature weight of classification feature vectors. Moreover, the method adopts a learning algorithm of a support vector machine to construct a field text classification model and then realize field text classification. As shown by text classification laboratory results of the Yunan tourist field and the non-tourist field, the classification accuracy of the method is improved by 4 percent compared with the text classification effect of the improved TFIDF feature weigh method.

Description

Text classification feature selecting and weighing computation method based on domain knowledge
Technical field
The present invention relates to field of artificial intelligence, particularly a kind of text classification feature selecting and weighing computation method based on domain knowledge.
Background technology
Text classification is the hot issue of current natural language processing research, how to discern a text and whether belongs to a certain specific area text problem, is the key issue of researchs such as current vertical search engine, question answering system.Usually in text classification, feature selecting is a most important part, and it directly influences the accuracy rate of text classification.Conventional feature selection approach adopts various valuation functions such as document frequency (Document Frequency mostly, DF), information gain (Information Gain, IG), mutual information (Mutual Informa-tion, MI), statistics (CHI) etc. carries out feature extraction.These feature selection approachs all are based on statistical algorithms, adopt a large amount of language materials when obtaining feature space usually, select feature space through statistical computation and dimension-reduction treatment.These Feature Selection methods may cause some statistical natures of choosing less to the classification contribution, can reduce the accuracy rate of classification on the contrary; And for the field text classification; in the text through regular meeting some field terms appear; these field terms are higher to the discrimination of field text classification; yet utilize conventional feature selection approach; these may obtain lower weight to the feature that classifying quality plays an important role; even be used as noise and be removed, will influence the accuracy rate of classification so greatly.
Summary of the invention
The object of the present invention is to provide a kind of field text classification feature selecting and weighing computation method based on the domain knowledge relation.
The present invention proposes and realized a kind of field text classification feature selecting and weighing computation method based on the domain knowledge relation, this method is in conjunction with sample statistics and field term structure domain classification feature space, utilize the inner knowledge relation in field, calculate the similarity between term, it is heavy to adjust the right-safeguarding of characteristic of division vector individual features according to this, and adopt the support vector machine learning algorithm, and set up the field textual classification model, realize the field text classification.Yunnan tourism field and non-tour field text classification experimental result show that this method classification accuracy improves 4 percentage points than improving the TFIDF method.
The invention technical scheme is as follows:
The step of carrying out text classification based on the text classification feature selecting and the weighing computation method of domain knowledge:
(1) the experiment language material is collected:
Assembling sphere text and non-field text are as corpus and testing material, experiment is adopted from 700 pieces of the yunnan tourism field documents of network random search as the field training text, 700 pieces of (environment of Fudan University's corpus document, computing machine, traffic, education, economical, military, physical culture, medicine, art, each 70 pieces of politics documents) as non-field training text, testing material adopts from 200 pieces of the documents in the yunnan tourism field of network random search as the field test text, 200 pieces of (environment of Fudan University's corpus document, computing machine, traffic, education, economical, military, physical culture, medicine, art, each 20 pieces of politics documents) as non-field test text.
(2) text pre-service:
The pre-service of text comprises, participle is removed stop words (stop words), word frequency statistics, document frequency statistics etc.At first text is carried out Chinese word segmentation and handle, adopt the Words partition system interface of the Computer Department of the Chinese Academy of Science to realize, and, carry out field speech word segmentation processing, and carry out field speech sign on this basis by means of the field dictionary.After the text participle is finished, remove in the text often occur " ", stop words such as " ", " ", " how ".Scanned document then counts in the word frequency, field of each speech document frequency in the document frequency and non-field.
(3) TFIDF feature weight computing method:
After the text pre-service was finished, Preliminary Exploitation document frequency (DF) removed low-frequency word, chose 1000 feature speech, the composition and classification feature space.The weight calculation of feature speech adopts the TFIDF method after associate professor Zhang Yufang of computing machine institute of University Of Chongqing waits the improvement that proposes in " based on the improvement and the application of text classification TFIDF method " that was published on " computer engineering " in 2006, TFIDF=TF * log (m ÷ (m+k) * N), wherein TF represents the word frequency of a certain characteristic item, m represents document frequency in the field of this characteristic item, k represents document frequency in the non-field of this characteristic item, and N represents whole number of files.
(4) expansion field term Feature Selection and feature weight computing method (DTFIDF):
Expansion field term Feature Selection weighing computation method (DTFIDF) is that all spectra term that will occur in the dictionary of field directly expands in the characteristic of division space, and adopts improvement TFIDF method to carry out feature weight and calculate.
(5) by the Feature Selection and the feature weight computing method (WTFIDF) of domain knowledge: after obtaining feature space by the DF method, utilize the correlativity between field term and the feature speech that feature speech weight is adjusted, in limited feature space, the text classification effect is adjusted and then improved to feature speech weight.
The weight method of adjustment has adopted the Chinese Academy of Sciences to calculate professor Liu Qun of institute and has waited the lexical semantic similarity calculating method based on " knowing net " that proposes in " the lexical semantic similarity based on " knowing net " is calculated " that is published in " the 3rd Chinese lexical semantics symposial "
Sim ( S 1 , S 2 ) = Σ i = 1 4 β i Π j = 1 i Sim j ( S 1 , S 2 )
The weighing computation method of feature speech adopts following formula to calculate:
Figure A20081005851700062
The weight of feature speech in feature space when wherein TFIDF represents not adjust through weights, TFn represent the n that occurs in the text with the word frequency of feature speech similarity greater than the field speech of γ, m represents document frequency in the field of the field speech that occurs in the text, k represents document frequency in the non-field of the field speech that occurs in the text, N represents whole number of files, Sim (S 1, S 2) similarity of expression field speech speech and feature speech.
(6) the field textual classification model makes up:
Sorting algorithm SVM:
Adopted support vector machine (SVM) algorithm to carry out the field text classification, SVM is based on the machine learning model of statistics, it shows many distinctive advantages in solving small sample, non-linear and higher-dimension pattern recognition problem, because SVM, its effect on the small sample classification problem has obtained checking at aspects such as text classification, handwritten form identification, natural language processings.
The principle of SVM is that the Nonlinear Mapping (kernel function) by prior selection is mapped to a high-dimensional feature space with input vector X, at this spatial configuration optimal classification lineoid, so that two class samples are separated error-free, and to make the classification space maximum of two classes, the former guarantees the empiric risk minimum, the latter makes the fiducial range minimum (being the structure risk minimum of sorter) in the boundary of generalization, can make like this in the non-linear problem of dividing of luv space to become the problem that the higher dimensional space neutral line can divide.
Text vector is represented and classification:
Before the document training and classifying, document is expressed as the manageable form of computing machine.Text is expressed as<labe1〉<index1:<value1〉<index2:<value2〉... form.Wherein<and labe1〉be the desired value of training dataset, for classification, it is the integer of certain class of sign, in experiment the field text be the desired value of yunnan tourism field text be made as+1, non-field text comprises that the desired value of the text of ten classifications in Fudan University's corpus is made as-1;<index〉be integer with 1 beginning, can be discontinuous, be illustrated in one piece of document which characteristic item to occur;<value〉be real number, be made as the weight of this characteristic item at this.Can construct the proper vector of an expression text to each training and testing text by above several method, and pass through the LIBSVM of Univ Nat Taiwan interface and realize training and classification.
Yunnan tourism field and non-tour field text classification experimental result are shown that the accuracy rate that adopts field text classification feature selecting and weighing computation method based on the domain knowledge relation to carry out text classification improves 4 percentage points than improving the TFIDF method with method of the present invention.
Description of drawings
Fig. 1 is of the present invention based on the text classification feature selecting of domain knowledge and the process flow diagram of weighing computation method.
Embodiment
Carried out experimental verification in the yunnan tourism field, concrete steps such as Fig. 1 at the above method that proposes:
Step a1: the experiment corpus has been chosen 700 pieces of yunnan tourism field documents as the field training text, and 700 pieces of Fudan University's corpus documents (each 70 pieces of environment, computing machine, traffic, education, economy, military affairs, physical culture, medicine, art, political documents) are as non-field training text.Testing material has adopted 200 pieces of the documents in yunnan tourism field as the field test text, and 200 pieces of Fudan University's corpus documents (each 20 pieces of environment, computing machine, traffic, education, economy, military affairs, physical culture, medicine, art, political documents) are as non-field test text.
Step a2: the text pre-service comprises that participle is removed stop words (stop words), word frequency statistics, document frequency statistics etc.At first text is carried out Chinese word segmentation and handle, adopt the Words partition system interface of the Computer Department of the Chinese Academy of Science to realize, and, carry out field speech word segmentation processing, and carry out field speech sign on this basis by means of the field dictionary.After the text participle is finished, remove in the text often occur " ", stop words such as " ", " ", " how ".Scanned document then counts in the word frequency, field of each speech document frequency in the document frequency and non-field.
Step a3: adopt selection of different characteristic space and feature weight computing method to carry out feature space selection and feature weight calculating.
(1) TFIDF feature weight computing method: Preliminary Exploitation document frequency (DF) removes low-frequency word, chooses 1000 feature speech, the composition and classification feature space.The weight calculation of feature speech adopts the TFIDF method after associate professor Zhang Yufang of computing machine institute of University Of Chongqing improves, TFIDF=TF * log (m ÷ (m+k) * N), wherein TF represents the word frequency of a certain characteristic item, m represents document frequency in the field of this characteristic item, k represents document frequency in the non-field of this characteristic item, and N represents whole number of files.
Adopt some frequencies of occurrences of this method lower the field text classification is but had stronger discrimination field term, when feature selecting and weights calculate, be left in the basket probably or give very little weights.
(2) expansion field term Feature Selection and feature weight computing method (DTFIDF):
Expansion field term Feature Selection weighing computation method (DTFIDF) is that all spectra term that will occur in the dictionary of field directly expands in the characteristic of division space.
The formation of feature space is exactly that the feature speech that utilizes document frequency (DF) to remove to obtain behind the low-frequency word and the field term in the dictionary of field merge and obtain like this, and feature speech weight calculation adopts the TFIDF method.This method can not removed by the field term that the class discrimination degree is high when feature space is chosen, but can increase the dimension of feature space, causes data sparse, may influence classifying quality to a certain extent.
(3) by the Feature Selection and the feature weight computing method (WTFIDF) of domain knowledge:
After utilizing document frequency (DF) to remove low-frequency word to obtain feature space, utilize the correlativity between field term and the feature speech that feature speech weight is adjusted, in limited feature space, the text classification effect is adjusted and then improved to feature speech weight.
The adjustment of feature speech weight is to come the similarity between calculated characteristics speech and the field term to realize by means of " knowing net " in the method.HowNet is a general general knowledge resource " to know net ", and it has described the notion of the word representative of Chinese and english, discloses between notion and the notion and attribute that notion had and the relation between the attribute.Adopt the conceptual description language KDML rule of " knowing net ", 2012 notions in yunnan tourism field have been carried out accurate description, as: accurately being described below of notion " Yulong Xueshan " and " Lijing ":
NO.=141008
The W_C=Yulong Xueshan
G_C=N
E_C=is very beautiful
W_E=Yulongxueshan
G_E=N
E_E=~is?a?beautiful?place
The DEF=PLACE| place, PROPERNAME| is special, (SCENE| scenic spot), (LIJIANG| Lijing), (YUNNAN| Yunnan);
NO.=141001
The W_C=Lijing
G_C=N
E_C=~very beautiful
W_E=Lijiang
G_E=N
E_E=~is?beautiful?place
The DEF=PLACE| place, PROPERNAME| is special, CITY| city, (YUNNAN| Yunnan);
By " knowing net " conceptual description method, contact set up in field vocabulary in " knowing net ".To not have selected low frequency field term as the feature speech, the contribution of text classification is embodied in feature space these field terms that neutralize to be had on the weight of feature speech of correlativity.As waiting these not have selected field term, the contribution of text classification is embodied in feature speech of " Lijing " or the like these process weights adjustment as the feature speech with " Yulong Xueshan ".The weight method of adjustment has adopted the Chinese Academy of Sciences to calculate professor Liu Qun of institute and has waited the lexical semantic similarity calculating method based on " knowing net " that proposes in " the lexical semantic similarity based on " knowing net " is calculated " that is published in " the 3rd Chinese lexical semantics symposial "
Sim ( S 1 , S 2 ) = Σ i = 1 4 β i Π j = 1 i Sim j ( S 1 , S 2 )
The weighing computation method of feature speech adopts following formula to calculate:
Figure A20081005851700092
The weight of feature speech in feature space when wherein TFIDF represents not adjust through weights, TFn represent the n that occurs in the text with the word frequency of feature speech similarity greater than the field speech of γ, m represents document frequency in the field of the field speech that occurs in the text, k represents document frequency in the non-field of the field speech that occurs in the text, N represents whole number of files, Sim (S 1, S 2) similarity of expression field speech speech and feature speech.
Step a4: the field textual classification model makes up
Before the document training and classifying, document is expressed as the manageable form of computing machine.Text is expressed as<labe1〉<index1:<value1〉<index2:<value2〉... form.Wherein<and labe1〉be the desired value of training dataset, for classification, it is the integer of certain class of sign, in experiment the field text be the desired value of yunnan tourism field text be made as+1, non-field text comprises that the desired value of the text of ten classifications in Fudan University's corpus is made as-1;<index〉be integer with 1 beginning, can be discontinuous, be illustrated in one piece of document which characteristic item to occur;<value〉be real number, be made as the weight of this characteristic item at this.Can construct the proper vector of an expression text to each training and testing text by above several method, and pass through the LIBSVM of Univ Nat Taiwan interface and realize training and classification.
Step a5: utilize textual classification model to experimentize at the yunnan tourism field.
Experiment adopts the DF method to select feature space, chosen bigger preceding 1000 speech of document frequency as feature space.Adopt improvement TFIDF, DTFIDF method, WTFIDF method to carry out feature space selection and feature weight calculating respectively.One two class sorter has been trained in experiment, realizes field text and the text classification of non-field,
Table 1 is for adopting different characteristic space and feature weight computing method text classification experimental result respectively
Above data as can be seen, adopt the TFIDF method, the text classification accuracy rate is 90.5% in the field, adopt the DTFIDF method, the text classification accuracy rate has improved 3% than TFIDF method in the field, and the classification accuracy of all texts has improved 1.75% than improving the TFIDF method, adopts the WTFIDF method, the text classification accuracy rate has improved 7.5% than TFIDF method in the field, and the classification accuracy of all texts has improved 4% than improving TFIDF.But the accuracy rate of right and wrong field text does not have raising clearly.What above data declaration proposed is very big by the text classification feature selecting of domain knowledge and weighing computation method to the improvement of the accuracy rate of field text classification.
By above experiment and instance data analysis, only adopt the TFIDF method to select the feature speech to experimentize, some characteristics of low-frequency speech of tour field are not selected, some texts that contain the field speech are represented as after the vector form some dimensions with strong class discrimination ability and just are left in the basket, and the text classification result is not ideal.Adopt the DTFIDF method, the dimension with class discrimination ability that contains in the text of field speech is embodied, and the effect of classification has had improvement.But behind the speech of the field of introducing, it is big that the feature space dimension becomes, and causes data sparse, and classification performance also is subjected to certain influence.Adopt the WTFIDF method, under the situation that the feature space dimension limits, do not appear at the field speech in the feature space, the contribution of text classification is embodied in the field speech has in the weight of feature speech of correlativity.Classification accuracy improves.Illustrate that this text classification feature selecting and weighing computation method based on domain knowledge can be practical in the classification of field text and non-field text.

Claims (4)

1. text classification feature selecting and weighing computation method based on a domain knowledge is characterized in that carrying out according to the following steps:
(1) assembling sphere text and non-field text are as corpus and testing material;
(2) pre-service of text: participle, remove stop words, word frequency statistics, document frequency statistics; At first text being carried out Chinese word segmentation handles, adopt the Words partition system interface of the Computer Department of the Chinese Academy of Science to realize, and on this basis by means of the field dictionary, carry out field speech word segmentation processing, and carry out field speech sign, after the text participle is finished, remove often occur in the text " ", stop words such as " ", " ", " how ", scanned document then counts in the word frequency, field of each speech document frequency in the document frequency and non-field;
(3) remove the DF value and get the characteristic of division space, and adopt the TFIDF method to carry out feature weight and calculate less than the selected ci poem of certain threshold value; After the text pre-service was finished, the Preliminary Exploitation document frequency removed low-frequency word, chose 1000 feature speech, the composition and classification feature space; The weight calculation of feature speech adopts TFIDF=TF * log (m ÷ (m+k) * N) method of improving, wherein TF represents the word frequency of a certain characteristic item, m represents document frequency in the field of this characteristic item, and k represents document frequency in the non-field of this characteristic item, and N represents whole number of files;
(4) selected characteristic space and expand field term to feature space on the basis of step (3) forms the characteristic of division space and adopts and improves the TFIDF method and carry out feature weight and calculate; The all spectra term that is about to occur in the dictionary of field directly expands in the characteristic of division space;
(5) on the basis of step (3), choose the characteristic of division space, and utilize improvement TFIDF method feature weight to be calculated and adjusts in conjunction with the domain knowledge relation; After promptly obtaining feature space, utilize the correlativity between " knowing net " middle field term and the feature speech that feature speech weight is adjusted, in limited feature space, the text classification effect is adjusted and then improved to feature speech weight by the DF method;
(6) utilize the different characteristic space to select and the feature weight computing method, use the SVM machine learning algorithm, the training text sorter makes up the field textual classification model, and the field text is carried out the text classification experimental verification.
2. text classification feature selecting and weighing computation method based on domain knowledge according to claim 1, it is characterized in that, utilization described in the step (5) improves the TFIDF method and carries out similarity calculating in conjunction with field term and the feature speech in the feature space that the domain knowledge relation does not have in the feature space to occur to occurring in the text, and similarity is adjusted greater than the feature speech weight of certain threshold value.
3. text classification feature selecting and weighing computation method based on domain knowledge according to claim 1, it is characterized in that the utilization described in the step (5) " know net " in correlativity between field term and the feature speech feature speech weight is adjusted the lexical semantic similarity calculating method:
Sim ( S 1 , S 2 ) = Σ j = 1 4 β i Π j = 1 i Si m j ( S 1 , S 2 ) ,
The weighing computation method of feature speech adopts following formula to calculate:
The weight of feature speech in feature space when wherein TFIDF represents not adjust through weights, TFn represent the n that occurs in the text with the word frequency of feature speech similarity greater than the field term of γ, m represents document frequency in the field of the field term that occurs in the text, k represents document frequency in the non-field of the field term that occurs in the text, N represents whole number of files, Sim (S 1, S 2) similarity of expression field term and feature speech.
4. text classification feature selecting and weighing computation method based on domain knowledge according to claim 1, it is characterized in that, in the described training text sorter of step (6), respectively the different feature spaces of three kinds of mentioning in step (3), (4), (5) are selected and the feature weight computing method have been carried out the structure of field textual classification model.
CN200810058517A 2008-06-12 2008-06-12 Text categorization feature selection and weight computation method based on field knowledge Expired - Fee Related CN100583101C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810058517A CN100583101C (en) 2008-06-12 2008-06-12 Text categorization feature selection and weight computation method based on field knowledge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810058517A CN100583101C (en) 2008-06-12 2008-06-12 Text categorization feature selection and weight computation method based on field knowledge

Publications (2)

Publication Number Publication Date
CN101290626A true CN101290626A (en) 2008-10-22
CN100583101C CN100583101C (en) 2010-01-20

Family

ID=40034884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810058517A Expired - Fee Related CN100583101C (en) 2008-06-12 2008-06-12 Text categorization feature selection and weight computation method based on field knowledge

Country Status (1)

Country Link
CN (1) CN100583101C (en)

Cited By (61)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101819601A (en) * 2010-05-11 2010-09-01 同方知网(北京)技术有限公司 Method for automatically classifying academic documents
CN101477798B (en) * 2009-02-17 2011-01-05 北京邮电大学 Method for analyzing and extracting audio data of set scene
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
WO2011057497A1 (en) * 2009-11-10 2011-05-19 腾讯科技(深圳)有限公司 Method and device for mining and evaluating vocabulary quality
CN101609472B (en) * 2009-08-13 2011-08-17 腾讯科技(深圳)有限公司 Keyword evaluation method and device based on platform for questions and answers
CN102184402A (en) * 2011-05-17 2011-09-14 哈尔滨工程大学 Feature selection method
CN102200981A (en) * 2010-03-25 2011-09-28 三星电子(中国)研发中心 Feature selection method and feature selection device for hierarchical text classification
CN102279890A (en) * 2011-09-02 2011-12-14 苏州大学 Sentiment word extracting and collecting method based on micro blog
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Method and device for matching texts
CN102567308A (en) * 2011-12-20 2012-07-11 上海电机学院 Information processing feature extracting method
CN102629282A (en) * 2012-05-03 2012-08-08 湖南神州祥网科技有限公司 Website classification method, device and system
CN102662952A (en) * 2012-03-02 2012-09-12 成都康赛电子科大信息技术有限责任公司 Chinese text parallel data mining method based on hierarchy
CN102081601B (en) * 2009-11-27 2013-01-09 北京金山软件有限公司 Field word identification method and device
CN102929860A (en) * 2012-10-12 2013-02-13 浙江理工大学 Chinese clause emotion polarity distinguishing method based on context
CN102955791A (en) * 2011-08-23 2013-03-06 句容今太科技园有限公司 Searching and classifying service system for network information
CN102135961B (en) * 2010-01-22 2013-03-20 北京金山软件有限公司 Method and device for determining domain feature words
CN103106275A (en) * 2013-02-08 2013-05-15 西北工业大学 Text classification character screening method based on character distribution information
CN103226578A (en) * 2013-04-02 2013-07-31 浙江大学 Method for identifying websites and finely classifying web pages in medical field
CN103324692A (en) * 2013-06-04 2013-09-25 北京大学 Classified knowledge acquiring method and device
CN103902570A (en) * 2012-12-27 2014-07-02 腾讯科技(深圳)有限公司 Text classification feature extraction method, classification method and device
CN104035996A (en) * 2014-06-11 2014-09-10 华东师范大学 Domain concept extraction method based on Deep Learning
CN104182463A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic-based text classification method
CN104268144A (en) * 2014-08-12 2015-01-07 华东师范大学 Electronic medical record query statement constructing method
CN104794187A (en) * 2015-04-13 2015-07-22 西安理工大学 Feature selection method based on entry distribution
CN104809131A (en) * 2014-01-27 2015-07-29 董靖 Automatic classification system and method of electronic documents
CN104965867A (en) * 2015-06-08 2015-10-07 南京师范大学 Text event classification method based on CHI feature selection
CN104991891A (en) * 2015-07-28 2015-10-21 北京大学 Short text feature extraction method
CN105045913A (en) * 2015-08-14 2015-11-11 北京工业大学 Text classification method based on WordNet and latent semantic analysis
CN105205090A (en) * 2015-05-29 2015-12-30 湖南大学 Web page text classification algorithm research based on web page link analysis and support vector machine
CN105224689A (en) * 2015-10-30 2016-01-06 北京信息科技大学 A kind of Dongba document sorting technique
CN105760471A (en) * 2016-02-06 2016-07-13 北京工业大学 Classification method for two types of texts based on multiconlitron
CN105787004A (en) * 2016-02-22 2016-07-20 浪潮软件股份有限公司 Text classification method and device
CN106095949A (en) * 2016-06-14 2016-11-09 东北师范大学 A kind of digital library's resource individuation recommendation method recommended based on mixing and system
CN106156083A (en) * 2015-03-31 2016-11-23 联想(北京)有限公司 A kind of domain knowledge processing method and processing device
CN106326458A (en) * 2016-06-02 2017-01-11 广西智度信息科技有限公司 Method for classifying city management cases based on text classification
CN106445907A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Domain lexicon generation method and apparatus
CN106569993A (en) * 2015-10-10 2017-04-19 中国移动通信集团公司 Method and device for mining hypernym-hyponym relation between domain-specific terms
CN106649563A (en) * 2016-11-10 2017-05-10 新华三技术有限公司 Method and device for constructing lexicon of website classification
CN106649253A (en) * 2015-11-02 2017-05-10 涂悦 Auxiliary control method and system based on post verification
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107145560A (en) * 2017-05-02 2017-09-08 北京邮电大学 A kind of file classification method and device
CN107292193A (en) * 2017-05-25 2017-10-24 北京北信源软件股份有限公司 A kind of method and system for realizing leakage prevention
CN107402916A (en) * 2017-07-17 2017-11-28 广州特道信息科技有限公司 The segmenting method and device of Chinese text
CN107480126A (en) * 2017-07-10 2017-12-15 广东华联建设投资管理股份有限公司 A kind of engineering material classification intelligent identification Method
WO2018028326A1 (en) * 2016-08-08 2018-02-15 华为技术有限公司 Model updating method and apparatus
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108268457A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of file classification method and device based on SVM
CN109408642A (en) * 2018-08-30 2019-03-01 昆明理工大学 A kind of domain entities relation on attributes abstracting method based on distance supervision
CN109947939A (en) * 2019-01-30 2019-06-28 中兴飞流信息科技有限公司 File classification method, electronic equipment and computer readable storage medium
CN110751285A (en) * 2018-07-23 2020-02-04 第四范式(北京)技术有限公司 Training method and system and prediction method and system of neural network model
CN110765781A (en) * 2019-12-11 2020-02-07 沈阳航空航天大学 Man-machine collaborative construction method for domain term semantic knowledge base
CN111090753A (en) * 2018-10-24 2020-05-01 马上消费金融股份有限公司 Training method of classification model, classification method, device and computer storage medium
CN111177389A (en) * 2019-12-30 2020-05-19 佰聆数据股份有限公司 NLP technology-based classification method, system and storage medium for power charge notification and customer appeal collection
CN111324722A (en) * 2020-05-15 2020-06-23 支付宝(杭州)信息技术有限公司 Method and system for training word weight model
CN111444310A (en) * 2019-12-02 2020-07-24 北京中科院软件中心有限公司 Method and system for constructing manufacturing field term library
CN111694948A (en) * 2019-03-12 2020-09-22 北京京东尚科信息技术有限公司 Text classification method and system, electronic equipment and storage medium
US11321527B1 (en) 2021-01-21 2022-05-03 International Business Machines Corporation Effective classification of data based on curated features
US11727312B2 (en) 2019-09-03 2023-08-15 International Business Machines Corporation Generating personalized recommendations to address a target problem

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573047A (en) * 2018-04-18 2018-09-25 广东工业大学 A kind of training method and device of Module of Automatic Chinese Documents Classification

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6192360B1 (en) * 1998-06-23 2001-02-20 Microsoft Corporation Methods and apparatus for classifying text and for building a text classifier
GB2362238A (en) * 2000-05-12 2001-11-14 Applied Psychology Res Ltd Automatic text classification
US6990496B1 (en) * 2000-07-26 2006-01-24 Koninklijke Philips Electronics N.V. System and method for automated classification of text by time slicing
US7062498B2 (en) * 2001-11-02 2006-06-13 Thomson Legal Regulatory Global Ag Systems, methods, and software for classifying text from judicial opinions and other documents

Cited By (87)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477798B (en) * 2009-02-17 2011-01-05 北京邮电大学 Method for analyzing and extracting audio data of set scene
CN101609472B (en) * 2009-08-13 2011-08-17 腾讯科技(深圳)有限公司 Keyword evaluation method and device based on platform for questions and answers
WO2011057497A1 (en) * 2009-11-10 2011-05-19 腾讯科技(深圳)有限公司 Method and device for mining and evaluating vocabulary quality
US8645418B2 (en) 2009-11-10 2014-02-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for word quality mining and evaluating
RU2517368C2 (en) * 2009-11-10 2014-05-27 Тенсент Текнолоджи (Шэньчжэнь) Компани Лимитед Method and apparatus for determining and evaluating significance of words
CN102081601B (en) * 2009-11-27 2013-01-09 北京金山软件有限公司 Field word identification method and device
CN102135961B (en) * 2010-01-22 2013-03-20 北京金山软件有限公司 Method and device for determining domain feature words
CN102200981B (en) * 2010-03-25 2013-07-17 三星电子(中国)研发中心 Feature selection method and feature selection device for hierarchical text classification
CN102200981A (en) * 2010-03-25 2011-09-28 三星电子(中国)研发中心 Feature selection method and feature selection device for hierarchical text classification
CN101819601A (en) * 2010-05-11 2010-09-01 同方知网(北京)技术有限公司 Method for automatically classifying academic documents
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Method and device for matching texts
CN102411583B (en) * 2010-09-20 2013-09-18 阿里巴巴集团控股有限公司 Method and device for matching texts
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
CN102184402A (en) * 2011-05-17 2011-09-14 哈尔滨工程大学 Feature selection method
CN102955791A (en) * 2011-08-23 2013-03-06 句容今太科技园有限公司 Searching and classifying service system for network information
CN102279890A (en) * 2011-09-02 2011-12-14 苏州大学 Sentiment word extracting and collecting method based on micro blog
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts
CN102332012B (en) * 2011-09-13 2014-10-22 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts
CN102289522B (en) * 2011-09-19 2014-08-13 北京金和软件股份有限公司 Method of intelligently classifying texts
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts
CN102360383B (en) * 2011-10-15 2013-07-31 西安交通大学 Method for extracting text-oriented field term and term relationship
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN102567308A (en) * 2011-12-20 2012-07-11 上海电机学院 Information processing feature extracting method
CN102662952B (en) * 2012-03-02 2015-04-15 成都康赛信息技术有限公司 Chinese text parallel data mining method based on hierarchy
CN102662952A (en) * 2012-03-02 2012-09-12 成都康赛电子科大信息技术有限责任公司 Chinese text parallel data mining method based on hierarchy
CN102629282A (en) * 2012-05-03 2012-08-08 湖南神州祥网科技有限公司 Website classification method, device and system
CN102929860B (en) * 2012-10-12 2015-05-13 浙江理工大学 Chinese clause emotion polarity distinguishing method based on context
CN102929860A (en) * 2012-10-12 2013-02-13 浙江理工大学 Chinese clause emotion polarity distinguishing method based on context
CN103902570A (en) * 2012-12-27 2014-07-02 腾讯科技(深圳)有限公司 Text classification feature extraction method, classification method and device
CN103902570B (en) * 2012-12-27 2018-11-09 腾讯科技(深圳)有限公司 A kind of text classification feature extracting method, sorting technique and device
CN103106275B (en) * 2013-02-08 2016-02-10 西北工业大学 The text classification Feature Selection method of feature based distributed intelligence
CN103106275A (en) * 2013-02-08 2013-05-15 西北工业大学 Text classification character screening method based on character distribution information
CN103226578B (en) * 2013-04-02 2015-11-04 浙江大学 Towards the website identification of medical domain and the method for webpage disaggregated classification
CN103226578A (en) * 2013-04-02 2013-07-31 浙江大学 Method for identifying websites and finely classifying web pages in medical field
CN103324692B (en) * 2013-06-04 2016-05-18 北京大学 Classificating knowledge acquisition methods and device
CN103324692A (en) * 2013-06-04 2013-09-25 北京大学 Classified knowledge acquiring method and device
CN104809131B (en) * 2014-01-27 2021-06-25 董靖 Automatic classification system and method for electronic documents
CN104809131A (en) * 2014-01-27 2015-07-29 董靖 Automatic classification system and method of electronic documents
CN104035996A (en) * 2014-06-11 2014-09-10 华东师范大学 Domain concept extraction method based on Deep Learning
CN104035996B (en) * 2014-06-11 2017-06-16 华东师范大学 Field concept abstracting method based on Deep Learning
CN104182463A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic-based text classification method
CN104268144A (en) * 2014-08-12 2015-01-07 华东师范大学 Electronic medical record query statement constructing method
CN106156083B (en) * 2015-03-31 2020-02-21 联想(北京)有限公司 Domain knowledge processing method and device
CN106156083A (en) * 2015-03-31 2016-11-23 联想(北京)有限公司 A kind of domain knowledge processing method and processing device
CN104794187A (en) * 2015-04-13 2015-07-22 西安理工大学 Feature selection method based on entry distribution
CN105205090A (en) * 2015-05-29 2015-12-30 湖南大学 Web page text classification algorithm research based on web page link analysis and support vector machine
CN104965867A (en) * 2015-06-08 2015-10-07 南京师范大学 Text event classification method based on CHI feature selection
CN104991891A (en) * 2015-07-28 2015-10-21 北京大学 Short text feature extraction method
CN104991891B (en) * 2015-07-28 2018-03-30 北京大学 A kind of short text feature extracting method
CN106445907A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Domain lexicon generation method and apparatus
CN105045913A (en) * 2015-08-14 2015-11-11 北京工业大学 Text classification method based on WordNet and latent semantic analysis
CN105045913B (en) * 2015-08-14 2018-08-28 北京工业大学 File classification method based on WordNet and latent semantic analysis
CN106569993A (en) * 2015-10-10 2017-04-19 中国移动通信集团公司 Method and device for mining hypernym-hyponym relation between domain-specific terms
CN105224689A (en) * 2015-10-30 2016-01-06 北京信息科技大学 A kind of Dongba document sorting technique
CN106649253B (en) * 2015-11-02 2019-03-22 涂悦 Auxiliary control method and system based on rear verifying
CN106649253A (en) * 2015-11-02 2017-05-10 涂悦 Auxiliary control method and system based on post verification
CN105760471A (en) * 2016-02-06 2016-07-13 北京工业大学 Classification method for two types of texts based on multiconlitron
CN105760471B (en) * 2016-02-06 2019-04-19 北京工业大学 Based on the two class text classification methods for combining convex linear perceptron
CN105787004A (en) * 2016-02-22 2016-07-20 浪潮软件股份有限公司 Text classification method and device
CN106326458A (en) * 2016-06-02 2017-01-11 广西智度信息科技有限公司 Method for classifying city management cases based on text classification
CN106095949A (en) * 2016-06-14 2016-11-09 东北师范大学 A kind of digital library's resource individuation recommendation method recommended based on mixing and system
WO2018028326A1 (en) * 2016-08-08 2018-02-15 华为技术有限公司 Model updating method and apparatus
CN106649563A (en) * 2016-11-10 2017-05-10 新华三技术有限公司 Method and device for constructing lexicon of website classification
CN106649563B (en) * 2016-11-10 2022-02-25 新华三技术有限公司 Website classification dictionary construction method and device
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN108268457A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of file classification method and device based on SVM
CN107145560B (en) * 2017-05-02 2021-01-29 北京邮电大学 Text classification method and device
CN107145560A (en) * 2017-05-02 2017-09-08 北京邮电大学 A kind of file classification method and device
CN107292193A (en) * 2017-05-25 2017-10-24 北京北信源软件股份有限公司 A kind of method and system for realizing leakage prevention
CN107480126A (en) * 2017-07-10 2017-12-15 广东华联建设投资管理股份有限公司 A kind of engineering material classification intelligent identification Method
CN107480126B (en) * 2017-07-10 2021-04-13 华联世纪工程咨询股份有限公司 Intelligent identification method for engineering material category
CN107402916A (en) * 2017-07-17 2017-11-28 广州特道信息科技有限公司 The segmenting method and device of Chinese text
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN110751285B (en) * 2018-07-23 2024-01-23 第四范式(北京)技术有限公司 Training method and system and prediction method and system for neural network model
CN110751285A (en) * 2018-07-23 2020-02-04 第四范式(北京)技术有限公司 Training method and system and prediction method and system of neural network model
CN109408642A (en) * 2018-08-30 2019-03-01 昆明理工大学 A kind of domain entities relation on attributes abstracting method based on distance supervision
CN109408642B (en) * 2018-08-30 2021-07-16 昆明理工大学 Domain entity attribute relation extraction method based on distance supervision
CN111090753A (en) * 2018-10-24 2020-05-01 马上消费金融股份有限公司 Training method of classification model, classification method, device and computer storage medium
CN109947939A (en) * 2019-01-30 2019-06-28 中兴飞流信息科技有限公司 File classification method, electronic equipment and computer readable storage medium
CN111694948A (en) * 2019-03-12 2020-09-22 北京京东尚科信息技术有限公司 Text classification method and system, electronic equipment and storage medium
US11727312B2 (en) 2019-09-03 2023-08-15 International Business Machines Corporation Generating personalized recommendations to address a target problem
CN111444310A (en) * 2019-12-02 2020-07-24 北京中科院软件中心有限公司 Method and system for constructing manufacturing field term library
CN110765781A (en) * 2019-12-11 2020-02-07 沈阳航空航天大学 Man-machine collaborative construction method for domain term semantic knowledge base
CN110765781B (en) * 2019-12-11 2023-07-14 沈阳航空航天大学 Man-machine collaborative construction method for domain term semantic knowledge base
CN111177389A (en) * 2019-12-30 2020-05-19 佰聆数据股份有限公司 NLP technology-based classification method, system and storage medium for power charge notification and customer appeal collection
CN111324722A (en) * 2020-05-15 2020-06-23 支付宝(杭州)信息技术有限公司 Method and system for training word weight model
US11321527B1 (en) 2021-01-21 2022-05-03 International Business Machines Corporation Effective classification of data based on curated features

Also Published As

Publication number Publication date
CN100583101C (en) 2010-01-20

Similar Documents

Publication Publication Date Title
CN100583101C (en) Text categorization feature selection and weight computation method based on field knowledge
CN108628971B (en) Text classification method, text classifier and storage medium for unbalanced data set
Ma et al. Label embedding for zero-shot fine-grained named entity typing
CN107066553B (en) Short text classification method based on convolutional neural network and random forest
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN102930063B (en) Feature item selection and weight calculation based text classification method
CN110287328B (en) Text classification method, device and equipment and computer readable storage medium
CN109960799B (en) Short text-oriented optimization classification method
CN106776713A (en) It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN107608999A (en) A kind of Question Classification method suitable for automatically request-answering system
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN106528642A (en) TF-IDF feature extraction based short text classification method
CN107145560B (en) Text classification method and device
CN104391835A (en) Method and device for selecting feature words in texts
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN106599054A (en) Method and system for title classification and push
CN103886108A (en) Feature selection and weight calculation method of imbalance text set
CN113255340B (en) Theme extraction method and device for scientific and technological requirements and storage medium
CN107463703A (en) English social media account number classification method based on information gain
CN103020167A (en) Chinese text classification method for computer
CN107463715A (en) English social media account number classification method based on information gain
CN108090178A (en) A kind of text data analysis method, device, server and storage medium
CN102411592B (en) Text classification method and device
CN106203508A (en) A kind of image classification method based on Hadoop platform
Campbell et al. Content+ context networks for user classification in twitter

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100120

Termination date: 20120612