CN101290626A

CN101290626A - Text categorization feature selection and weight computation method based on field knowledge

Info

Publication number: CN101290626A
Application number: CNA2008100585170A
Authority: CN
Inventors: 余正涛; 韩露; 向凤红; 万舟; 熊新
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2008-06-12
Filing date: 2008-06-12
Publication date: 2008-10-22
Anticipated expiration: 2028-06-12
Also published as: CN100583101C

Abstract

The invention relates to the artificial intelligence technical field, in particular to a text classification feature selection and weigh calculation method based on field knowledge. The method combines sample statistics and field glossaries to construct a filed classification feature space, utilizes internal knowledge relations in the field, calculates the similarity between the glossaries, and then adjusts the corresponding feature weight of classification feature vectors. Moreover, the method adopts a learning algorithm of a support vector machine to construct a field text classification model and then realize field text classification. As shown by text classification laboratory results of the Yunan tourist field and the non-tourist field, the classification accuracy of the method is improved by 4 percent compared with the text classification effect of the improved TFIDF feature weigh method.

Description

Text classification feature selecting and weighing computation method based on domain knowledge

Technical field

The present invention relates to field of artificial intelligence, particularly a kind of text classification feature selecting and weighing computation method based on domain knowledge.

Background technology

Text classification is the hot issue of current natural language processing research, how to discern a text and whether belongs to a certain specific area text problem, is the key issue of researchs such as current vertical search engine, question answering system.Usually in text classification, feature selecting is a most important part, and it directly influences the accuracy rate of text classification.Conventional feature selection approach adopts various valuation functions such as document frequency (Document Frequency mostly, DF), information gain (Information Gain, IG), mutual information (Mutual Informa-tion, MI), statistics (CHI) etc. carries out feature extraction.These feature selection approachs all are based on statistical algorithms, adopt a large amount of language materials when obtaining feature space usually, select feature space through statistical computation and dimension-reduction treatment.These Feature Selection methods may cause some statistical natures of choosing less to the classification contribution, can reduce the accuracy rate of classification on the contrary; And for the field text classification; in the text through regular meeting some field terms appear; these field terms are higher to the discrimination of field text classification; yet utilize conventional feature selection approach; these may obtain lower weight to the feature that classifying quality plays an important role; even be used as noise and be removed, will influence the accuracy rate of classification so greatly.

Summary of the invention

The object of the present invention is to provide a kind of field text classification feature selecting and weighing computation method based on the domain knowledge relation.

The present invention proposes and realized a kind of field text classification feature selecting and weighing computation method based on the domain knowledge relation, this method is in conjunction with sample statistics and field term structure domain classification feature space, utilize the inner knowledge relation in field, calculate the similarity between term, it is heavy to adjust the right-safeguarding of characteristic of division vector individual features according to this, and adopt the support vector machine learning algorithm, and set up the field textual classification model, realize the field text classification.Yunnan tourism field and non-tour field text classification experimental result show that this method classification accuracy improves 4 percentage points than improving the TFIDF method.

The invention technical scheme is as follows:

The step of carrying out text classification based on the text classification feature selecting and the weighing computation method of domain knowledge:

(1) the experiment language material is collected:

Assembling sphere text and non-field text are as corpus and testing material, experiment is adopted from 700 pieces of the yunnan tourism field documents of network random search as the field training text, 700 pieces of (environment of Fudan University's corpus document, computing machine, traffic, education, economical, military, physical culture, medicine, art, each 70 pieces of politics documents) as non-field training text, testing material adopts from 200 pieces of the documents in the yunnan tourism field of network random search as the field test text, 200 pieces of (environment of Fudan University's corpus document, computing machine, traffic, education, economical, military, physical culture, medicine, art, each 20 pieces of politics documents) as non-field test text.

(2) text pre-service:

The pre-service of text comprises, participle is removed stop words (stop words), word frequency statistics, document frequency statistics etc.At first text is carried out Chinese word segmentation and handle, adopt the Words partition system interface of the Computer Department of the Chinese Academy of Science to realize, and, carry out field speech word segmentation processing, and carry out field speech sign on this basis by means of the field dictionary.After the text participle is finished, remove in the text often occur " ", stop words such as " ", " ", " how ".Scanned document then counts in the word frequency, field of each speech document frequency in the document frequency and non-field.

(3) TFIDF feature weight computing method:

After the text pre-service was finished, Preliminary Exploitation document frequency (DF) removed low-frequency word, chose 1000 feature speech, the composition and classification feature space.The weight calculation of feature speech adopts the TFIDF method after associate professor Zhang Yufang of computing machine institute of University Of Chongqing waits the improvement that proposes in " based on the improvement and the application of text classification TFIDF method " that was published on " computer engineering " in 2006, TFIDF=TF * log (m ÷ (m+k) * N), wherein TF represents the word frequency of a certain characteristic item, m represents document frequency in the field of this characteristic item, k represents document frequency in the non-field of this characteristic item, and N represents whole number of files.

(4) expansion field term Feature Selection and feature weight computing method (DTFIDF):

Expansion field term Feature Selection weighing computation method (DTFIDF) is that all spectra term that will occur in the dictionary of field directly expands in the characteristic of division space, and adopts improvement TFIDF method to carry out feature weight and calculate.

(5) by the Feature Selection and the feature weight computing method (WTFIDF) of domain knowledge: after obtaining feature space by the DF method, utilize the correlativity between field term and the feature speech that feature speech weight is adjusted, in limited feature space, the text classification effect is adjusted and then improved to feature speech weight.

The weight method of adjustment has adopted the Chinese Academy of Sciences to calculate professor Liu Qun of institute and has waited the lexical semantic similarity calculating method based on " knowing net " that proposes in " the lexical semantic similarity based on " knowing net " is calculated " that is published in " the 3rd Chinese lexical semantics symposial "

Sim (S_{1}, S_{2}) = Σ_{i = 1}^{4} β_{i} Π_{j = 1}^{i} {Sim}_{j} (S_{1}, S_{2})

The weighing computation method of feature speech adopts following formula to calculate:

The weight of feature speech in feature space when wherein TFIDF represents not adjust through weights, TFn represent the n that occurs in the text with the word frequency of feature speech similarity greater than the field speech of γ, m represents document frequency in the field of the field speech that occurs in the text, k represents document frequency in the non-field of the field speech that occurs in the text, N represents whole number of files, Sim (S ₁, S ₂) similarity of expression field speech speech and feature speech.

(6) the field textual classification model makes up:

Sorting algorithm SVM:

Adopted support vector machine (SVM) algorithm to carry out the field text classification, SVM is based on the machine learning model of statistics, it shows many distinctive advantages in solving small sample, non-linear and higher-dimension pattern recognition problem, because SVM, its effect on the small sample classification problem has obtained checking at aspects such as text classification, handwritten form identification, natural language processings.

The principle of SVM is that the Nonlinear Mapping (kernel function) by prior selection is mapped to a high-dimensional feature space with input vector X, at this spatial configuration optimal classification lineoid, so that two class samples are separated error-free, and to make the classification space maximum of two classes, the former guarantees the empiric risk minimum, the latter makes the fiducial range minimum (being the structure risk minimum of sorter) in the boundary of generalization, can make like this in the non-linear problem of dividing of luv space to become the problem that the higher dimensional space neutral line can divide.

Text vector is represented and classification:

Before the document training and classifying, document is expressed as the manageable form of computing machine.Text is expressed as＜labe1〉＜index1:＜value1〉＜index2:＜value2〉... form.Wherein＜and labe1〉be the desired value of training dataset, for classification, it is the integer of certain class of sign, in experiment the field text be the desired value of yunnan tourism field text be made as+1, non-field text comprises that the desired value of the text of ten classifications in Fudan University's corpus is made as-1;＜index〉be integer with 1 beginning, can be discontinuous, be illustrated in one piece of document which characteristic item to occur;＜value〉be real number, be made as the weight of this characteristic item at this.Can construct the proper vector of an expression text to each training and testing text by above several method, and pass through the LIBSVM of Univ Nat Taiwan interface and realize training and classification.

Yunnan tourism field and non-tour field text classification experimental result are shown that the accuracy rate that adopts field text classification feature selecting and weighing computation method based on the domain knowledge relation to carry out text classification improves 4 percentage points than improving the TFIDF method with method of the present invention.

Description of drawings

Fig. 1 is of the present invention based on the text classification feature selecting of domain knowledge and the process flow diagram of weighing computation method.

Embodiment

Carried out experimental verification in the yunnan tourism field, concrete steps such as Fig. 1 at the above method that proposes:

Step a1: the experiment corpus has been chosen 700 pieces of yunnan tourism field documents as the field training text, and 700 pieces of Fudan University's corpus documents (each 70 pieces of environment, computing machine, traffic, education, economy, military affairs, physical culture, medicine, art, political documents) are as non-field training text.Testing material has adopted 200 pieces of the documents in yunnan tourism field as the field test text, and 200 pieces of Fudan University's corpus documents (each 20 pieces of environment, computing machine, traffic, education, economy, military affairs, physical culture, medicine, art, political documents) are as non-field test text.

Step a2: the text pre-service comprises that participle is removed stop words (stop words), word frequency statistics, document frequency statistics etc.At first text is carried out Chinese word segmentation and handle, adopt the Words partition system interface of the Computer Department of the Chinese Academy of Science to realize, and, carry out field speech word segmentation processing, and carry out field speech sign on this basis by means of the field dictionary.After the text participle is finished, remove in the text often occur " ", stop words such as " ", " ", " how ".Scanned document then counts in the word frequency, field of each speech document frequency in the document frequency and non-field.

Step a3: adopt selection of different characteristic space and feature weight computing method to carry out feature space selection and feature weight calculating.

(1) TFIDF feature weight computing method: Preliminary Exploitation document frequency (DF) removes low-frequency word, chooses 1000 feature speech, the composition and classification feature space.The weight calculation of feature speech adopts the TFIDF method after associate professor Zhang Yufang of computing machine institute of University Of Chongqing improves, TFIDF=TF * log (m ÷ (m+k) * N), wherein TF represents the word frequency of a certain characteristic item, m represents document frequency in the field of this characteristic item, k represents document frequency in the non-field of this characteristic item, and N represents whole number of files.

Adopt some frequencies of occurrences of this method lower the field text classification is but had stronger discrimination field term, when feature selecting and weights calculate, be left in the basket probably or give very little weights.

(2) expansion field term Feature Selection and feature weight computing method (DTFIDF):

Expansion field term Feature Selection weighing computation method (DTFIDF) is that all spectra term that will occur in the dictionary of field directly expands in the characteristic of division space.

The formation of feature space is exactly that the feature speech that utilizes document frequency (DF) to remove to obtain behind the low-frequency word and the field term in the dictionary of field merge and obtain like this, and feature speech weight calculation adopts the TFIDF method.This method can not removed by the field term that the class discrimination degree is high when feature space is chosen, but can increase the dimension of feature space, causes data sparse, may influence classifying quality to a certain extent.

(3) by the Feature Selection and the feature weight computing method (WTFIDF) of domain knowledge:

After utilizing document frequency (DF) to remove low-frequency word to obtain feature space, utilize the correlativity between field term and the feature speech that feature speech weight is adjusted, in limited feature space, the text classification effect is adjusted and then improved to feature speech weight.

The adjustment of feature speech weight is to come the similarity between calculated characteristics speech and the field term to realize by means of " knowing net " in the method.HowNet is a general general knowledge resource " to know net ", and it has described the notion of the word representative of Chinese and english, discloses between notion and the notion and attribute that notion had and the relation between the attribute.Adopt the conceptual description language KDML rule of " knowing net ", 2012 notions in yunnan tourism field have been carried out accurate description, as: accurately being described below of notion " Yulong Xueshan " and " Lijing ":

NO.＝141008

The W_C=Yulong Xueshan

G_C＝N

E_C=is very beautiful

W_E＝Yulongxueshan

G_E＝N

E_E＝～is?a?beautiful?place

NO.＝141001

The W_C=Lijing

G_C＝N

E_C=～very beautiful

W_E＝Lijiang

G_E＝N

E_E＝～is?beautiful?place

The DEF=PLACE| place, PROPERNAME| is special, CITY| city, (YUNNAN| Yunnan);

By " knowing net " conceptual description method, contact set up in field vocabulary in " knowing net ".To not have selected low frequency field term as the feature speech, the contribution of text classification is embodied in feature space these field terms that neutralize to be had on the weight of feature speech of correlativity.As waiting these not have selected field term, the contribution of text classification is embodied in feature speech of " Lijing " or the like these process weights adjustment as the feature speech with " Yulong Xueshan ".The weight method of adjustment has adopted the Chinese Academy of Sciences to calculate professor Liu Qun of institute and has waited the lexical semantic similarity calculating method based on " knowing net " that proposes in " the lexical semantic similarity based on " knowing net " is calculated " that is published in " the 3rd Chinese lexical semantics symposial "

Sim (S_{1}, S_{2}) = Σ_{i = 1}^{4} β_{i} Π_{j = 1}^{i} {Sim}_{j} (S_{1}, S_{2})

Step a4: the field textual classification model makes up

Step a5: utilize textual classification model to experimentize at the yunnan tourism field.

Experiment adopts the DF method to select feature space, chosen bigger preceding 1000 speech of document frequency as feature space.Adopt improvement TFIDF, DTFIDF method, WTFIDF method to carry out feature space selection and feature weight calculating respectively.One two class sorter has been trained in experiment, realizes field text and the text classification of non-field,

Table 1 is for adopting different characteristic space and feature weight computing method text classification experimental result respectively

Above data as can be seen, adopt the TFIDF method, the text classification accuracy rate is 90.5% in the field, adopt the DTFIDF method, the text classification accuracy rate has improved 3% than TFIDF method in the field, and the classification accuracy of all texts has improved 1.75% than improving the TFIDF method, adopts the WTFIDF method, the text classification accuracy rate has improved 7.5% than TFIDF method in the field, and the classification accuracy of all texts has improved 4% than improving TFIDF.But the accuracy rate of right and wrong field text does not have raising clearly.What above data declaration proposed is very big by the text classification feature selecting of domain knowledge and weighing computation method to the improvement of the accuracy rate of field text classification.

By above experiment and instance data analysis, only adopt the TFIDF method to select the feature speech to experimentize, some characteristics of low-frequency speech of tour field are not selected, some texts that contain the field speech are represented as after the vector form some dimensions with strong class discrimination ability and just are left in the basket, and the text classification result is not ideal.Adopt the DTFIDF method, the dimension with class discrimination ability that contains in the text of field speech is embodied, and the effect of classification has had improvement.But behind the speech of the field of introducing, it is big that the feature space dimension becomes, and causes data sparse, and classification performance also is subjected to certain influence.Adopt the WTFIDF method, under the situation that the feature space dimension limits, do not appear at the field speech in the feature space, the contribution of text classification is embodied in the field speech has in the weight of feature speech of correlativity.Classification accuracy improves.Illustrate that this text classification feature selecting and weighing computation method based on domain knowledge can be practical in the classification of field text and non-field text.

Claims

1. text classification feature selecting and weighing computation method based on a domain knowledge is characterized in that carrying out according to the following steps:

(1) assembling sphere text and non-field text are as corpus and testing material;

(2) pre-service of text: participle, remove stop words, word frequency statistics, document frequency statistics; At first text being carried out Chinese word segmentation handles, adopt the Words partition system interface of the Computer Department of the Chinese Academy of Science to realize, and on this basis by means of the field dictionary, carry out field speech word segmentation processing, and carry out field speech sign, after the text participle is finished, remove often occur in the text " ", stop words such as " ", " ", " how ", scanned document then counts in the word frequency, field of each speech document frequency in the document frequency and non-field;

(3) remove the DF value and get the characteristic of division space, and adopt the TFIDF method to carry out feature weight and calculate less than the selected ci poem of certain threshold value; After the text pre-service was finished, the Preliminary Exploitation document frequency removed low-frequency word, chose 1000 feature speech, the composition and classification feature space; The weight calculation of feature speech adopts TFIDF=TF * log (m ÷ (m+k) * N) method of improving, wherein TF represents the word frequency of a certain characteristic item, m represents document frequency in the field of this characteristic item, and k represents document frequency in the non-field of this characteristic item, and N represents whole number of files;

(4) selected characteristic space and expand field term to feature space on the basis of step (3) forms the characteristic of division space and adopts and improves the TFIDF method and carry out feature weight and calculate; The all spectra term that is about to occur in the dictionary of field directly expands in the characteristic of division space;

(5) on the basis of step (3), choose the characteristic of division space, and utilize improvement TFIDF method feature weight to be calculated and adjusts in conjunction with the domain knowledge relation; After promptly obtaining feature space, utilize the correlativity between " knowing net " middle field term and the feature speech that feature speech weight is adjusted, in limited feature space, the text classification effect is adjusted and then improved to feature speech weight by the DF method;

(6) utilize the different characteristic space to select and the feature weight computing method, use the SVM machine learning algorithm, the training text sorter makes up the field textual classification model, and the field text is carried out the text classification experimental verification.

2. text classification feature selecting and weighing computation method based on domain knowledge according to claim 1, it is characterized in that, utilization described in the step (5) improves the TFIDF method and carries out similarity calculating in conjunction with field term and the feature speech in the feature space that the domain knowledge relation does not have in the feature space to occur to occurring in the text, and similarity is adjusted greater than the feature speech weight of certain threshold value.

3. text classification feature selecting and weighing computation method based on domain knowledge according to claim 1, it is characterized in that the utilization described in the step (5) " know net " in correlativity between field term and the feature speech feature speech weight is adjusted the lexical semantic similarity calculating method:

Sim (S_{1}, S_{2}) = Σ_{j = 1}^{4} β_{i} Π_{j = 1}^{i} Si m_{j} (S_{1}, S_{2}),

The weight of feature speech in feature space when wherein TFIDF represents not adjust through weights, TFn represent the n that occurs in the text with the word frequency of feature speech similarity greater than the field term of γ, m represents document frequency in the field of the field term that occurs in the text, k represents document frequency in the non-field of the field term that occurs in the text, N represents whole number of files, Sim (S ₁, S ₂) similarity of expression field term and feature speech.

4. text classification feature selecting and weighing computation method based on domain knowledge according to claim 1, it is characterized in that, in the described training text sorter of step (6), respectively the different feature spaces of three kinds of mentioning in step (3), (4), (5) are selected and the feature weight computing method have been carried out the structure of field textual classification model.