CN102662976A

CN102662976A - Text feature weighting method based on supervision

Info

Publication number: CN102662976A
Application number: CN2012100638795A
Authority: CN
Inventors: 刘端阳; 陆洋
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2012-03-12
Filing date: 2012-03-12
Publication date: 2012-09-12

Abstract

The invention provides a text feature weighting method based on supervision. Text feature extraction is carried out to obtain four types of numbers of documents, a, b, c, and d. The a represents the number of documents containing lexical item t and belonging to a positive example. The b represents the number of documents containing no lexical item t and belonging to a positive example. The c represents the number of documents containing lexical item t and not belonging to a positive example. The d represents the number of documents containing no lexical item t and not belonging to a positive example. The a, b, c, and d add up to N, namely, the total number of documents. The text feature weighting formula (I) is provided. With the text feature weighting method tf.ridf based on supervision put forward in the invention, a comprehensive consideration of words among total documents and the documents of each classification is realized, and classification performance is effectively improved.

Description

A kind of based on the text feature method of weighting that instructs

Technical field

The present invention relates to file classification method, especially a kind of text feature method of weighting.

Background technology

Along with popularizing of internet, applications, stored the text message of magnanimity in the network, people need excavate Useful Information urgently from text.Text mining just is meant from a large amount of text datas the process that extracts unknown in advance, intelligible, final information available or knowledge.Text classification is under given classification system, the process of classification under confirming automatically according to content of text.An important component part is a text representation in the text classification; In the text representation the most frequently used be vector space model (Vector Space Model, VSM), VSM is a statistical model of text representation; It is regarded document as and is made up of proper vector; And each proper vector can be a speech also can be phrase, and each proper vector all contains weight, thereby the vectorial matching problem that the classification problem of text is converted in the vector space is handled.After being converted into proper vector, just can use sorting algorithm commonly used to carry out text classification, like SVMs, k nearest neighbor etc.

A major issue in the text representation is exactly to select the text of representing which proper vector can be best, and will remove the proper vector that those do not represent value.Usually the general process of text classification is: participle, go stop words, index, statistics, feature extraction, characteristic weighing, classification based training and assessment.

Wherein the text feature weighting is a ring important in the entire process process.Based on the text feature weighting of statistics, mainly be divided into two big types: Feature Weighting Method (supervised term weighting method) and guideless Feature Weighting Method (unsupervised term weighting method) that guidance is promptly arranged.Present commonplace use be that (term frequency and inverse document frequency, tf.idf), it belongs to guideless Feature Weighting Method to the anti-document frequency method of weighting of word frequency.

At present, all proposed both at home and abroad much, included guidance and guideless Feature Weighting Method about the improving one's methods of characteristic weighing.Xue Xiaobing is with the densely distributed property of text speech, and the position that text size and speech occur for the first time is as the major consideration of characteristic weighing.Yet guideless Feature Weighting Method do not consider characteristic each document of training set and of all categories in difference on the distribution proportion to the influence of classification.There is the Feature Weighting Method of guidance to consider this point just on the contrary.Li Kaiqi has at first pointed out the deficiency of tf.idf method, on based on the method basis that guidance is arranged, combines tf.idf to improve the text feature weighting information gain.Man Lan has then considered the proportionate relationship of speech in positive type and anti-type, thinks that this speech just can be represented positive type more, just has high more weights as long as anti-type of ratio that accounts for of positive analogy is high more.

Summary of the invention

In order to overcome the relatively poor deficiency of classification performance of existing text feature method of weighting, the present invention provides a kind of text feature method of weighting based on guidance that promotes classification performance

For the technological means that solves the problems of the technologies described above employing is:

A kind of based on the text feature method of weighting that instructs, carry out obtaining a after text feature extracts, b, c, four types of number of files of d, representative is as follows respectively:

The number of files that a representes to contain lexical item t and belongs to positive example;

B representes not contain lexical item t but the number of files that belongs to positive example;

C representes to contain lexical item t but the number of files that do not belong to positive example;

The number of files that d representes not contain lexical item t and do not belong to positive example.

A, b, c, the summation of d is N, promptly total number of files;

Said text feature weighting formula is following:

ridf = \log_{2} = (2 + {(\frac{a}{\max (1, c)})}^{K}) - - - (1)

Wherein K is expressed as:

K = \{\begin{matrix} {(\frac{N}{a + c})}^{\frac{a - c}{| a - c |}} & (a &NotEqual; c) \\ \frac{N}{a + c} & (a = c) \end{matrix} - - - (2)

Wherein, ridf combines to carry out characteristic weighing for relevant anti-document frequency method of weighting with word frequency, this formulate the significance level of a text between overall and classification.

Technical conceive of the present invention is: traditional Feature Weighting Method is following:

tf \cdot idf = tf (t, d) * \log_{2} (\frac{N}{n} + 0.01)

Wherein (t d) is the frequency of feature lexical item t in document d to tf, and high more this speech of tf value can be represented document more, and tf has embodied the relation of speech in document.N among the idf is the number of files that contains feature lexical item t, and N is total number of files.It is low more that idf thinks that the number of files that certain speech occurs accounts for the ratio of total number of files, and the document of this type can be represented more in this speech, i.e. the ability of this speech difference type is strong more, and idf has then embodied the relation of speech between document.Though the tfidf formula seems simply, comparing other complicated file classification method (like the text classification based on semanteme) usually has best expressive ability in some occasion.

Yet also there is weak point in this formula, and in six kinds of document distribution plans representing like Fig. 1, we can know:

A, b, c, the summation of d is N, promptly total number of files.

In these three kinds of situation of T1-T3, idf can give identical weights.Tangible then, the classification contribution ability of T1 should be maximum.

In order to overcome the deficiency of traditional guideless Feature Weighting Method, Chinese scholars has all proposed many new methods of weighting that guidance is arranged, and proposes like Xue Xiaobing:

rf = \log_{2} (2 + \frac{a}{c})

Substitute idf with the rf formula, its basic thought is: the ratio that the ratio c that a accounts for accounts for is big more, and this speech positive example of capable more difference and counter-example so just should have high more weight; And when a=c, no matter the quantity of a and c is much, the ability that this speech difference is positive and negative type equates forever.

Yet this formula has been abandoned the thought of original idf again, and the ratio that the number of files that certain speech promptly occurs accounts for total number of files is low more, and this speech is the document of valuable more this type of representative just.Like T5 and T6 both of these case, significantly, the speech of T6 too spreads unchecked, and its representative ability is not as this situation of T5.

According to above-mentioned analysis, confirm design object of the present invention: the proportionate relationship of speech in each classification considered in (1), promptly utilizes the method that guidance is arranged, and makes full use of the training data of type of having label.(2) advantage of reservation traditional characteristic weighting is considered the situation that speech distributes in overall document.

Based on above two targets, design fundamentals are following:

ridf = \log_{2} = (2 + {(\frac{a}{\max (1, c)})}^{K})

Wherein the k value is:

K = \{\begin{matrix} {(\frac{N}{a + c})}^{\frac{a - c}{| a - c |}} & (a &NotEqual; c) \\ \frac{N}{a + c} & (a = c) \end{matrix}

Substitute idf with ridf, its advantage is following:

1. work as a and be not equal under the situation of c, ridf promptly will consider the factor of N/ (a+c), will consider also that simultaneously a/max (1, we can say and at this time promptly considered the relation of speech in classification, also considered the relation of speech between overall document by factor c).

2. under the situation of a=c, the size of K value can not influence final ridf value, meets above thought.

Description of drawings

Fig. 1 is six kinds of document distribution plans.

Embodiment

Below in conjunction with accompanying drawing the present invention is further specified.

With reference to Fig. 1, a kind of based on the text feature method of weighting that instructs, carry out obtaining a after text feature extracts, b, c, four types of number of files of d, representative is as follows respectively:

A, b, c, the summation of d is N, promptly total number of files;

Said text feature weighting formula is following:

ridf = \log_{2} = (2 + {(\frac{a}{\max (1, c)})}^{K}) - - - (1)

Wherein K is expressed as:

K = \{\begin{matrix} {(\frac{N}{a + c})}^{\frac{a - c}{| a - c |}} & (a &NotEqual; c) \\ \frac{N}{a + c} & (a = c) \end{matrix} - - - (2)

In the present embodiment; Corpus is the Chinese corpus that contains 20 classifications that international data center center natural language processing group provides from Fudan University's computerized information with technology; Words partition system adopts the Chinese lexical analytic system ICTCLAS of Inst. of Computing Techn. Academia Sinica's development, and sorter adopts is the Libsvm of people's exploitation such as professor Lin Zhiren of Taiwan Univ..

1. at first original corpus is carried out Chinese word segmentation and part of speech mark.

2. carry out feature extraction, comprise the removal low-frequency word; Remove unnecessary part of speech, keep noun, verb and adjective; Adopt the feature extraction formula to calculate weights for each speech, preset threshold values is left out the characteristic speech that is lower than threshold values.

3. adopt the weighting formula of the present invention's design to carry out characteristic weighing, and the correlation data collection is set, other disposal routes are identical, and the weighting formula adopts tf.idf and tf.rf.

4. use the Libsvm training data, adopt linear kernel function, and adopt precision ratio, recall ratio and three indexs of F-measure to come comparing data.

5. experimental data shows from 1000 to the 5000 totally 10 groups of experiments respectively of text feature number, and the F-mesaure of tf.ridf is respectively 0.79,0.843,0.876,0.80,875,0.91,0.917,0.947,0.978,0.978.By contrast, the F-measure of tf.rf is 0.726,0.746,0.827,0.77,0.827,0.854,0.912,0.933,0.933,0.944.Can find out that classification performance of the present invention is better than tf.rf.

Claims

1. one kind based on the text feature method of weighting that instructs, and it is characterized in that: carry out obtaining a after text feature extracts, and b, c, four types of number of files of d, representative is as follows respectively:

The number of files that d representes not contain lexical item t and do not belong to positive example;

A, b, c, the summation of d is N, promptly total number of files;

Said text feature weighting formula is following:

ridf = \log_{2} = (2 + {(\frac{a}{\max (1, c)})}^{K}) - - - (1)

Wherein K is expressed as:

K = \{\begin{matrix} {(\frac{N}{a + c})}^{\frac{a - c}{| a - c |}} & (a &NotEqual; c) \\ \frac{N}{a + c} & (a = c) \end{matrix} - - - (2)

Wherein, ridf combines to carry out characteristic weighing for relevant anti-document frequency method of weighting with word frequency, this formulate text totally between the text and the significance level between each classification text.