CN103106275A

CN103106275A - Text classification character screening method based on character distribution information

Info

Publication number: CN103106275A
Application number: CN2013100505834A
Authority: CN
Inventors: 李思男; 李战怀; 李宁
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2013-02-08
Filing date: 2013-02-08
Publication date: 2013-05-15
Anticipated expiration: 2033-02-08
Also published as: CN103106275B

Abstract

The invention discloses a text classification character screening method based on character distribution information. The method is used for resolving the technical problems that an existing text classification character screening method is poor in accuracy. The technical scheme includes conducting preprocessing for each document of a document set firstly; enabling the whole document collection to be presented as a vector space modal (VSM); constructing a character dictionary; counting document frequency DF (t, Cj), comprising the character t, of each classification Ci; calculating a normalized tf*idf value of each classification Ci, and then calculating the dispersion D Intra and average inter-classification dispersion D Inter Avg of the character in each classification Ci; calculating the weight wi (t) of each character tk in each classification Ci of a text character space; and enabling all the characters to be arranged in a descending order mode according to the weight of all the characters in the whole document set, and preferentially keeping the characters having front orders during character screening. On the basis of a character distribution system, the method enables the character distribution system to be applied to the character screen process, and improves text classification efficiency and accuracy.

Description

Text classification Feature Selection method based on the feature distributed intelligence

Technical field

The present invention relates to a kind of text classification Feature Selection method, particularly relate to a kind of text classification Feature Selection method based on the feature distributed intelligence.

Background technology

Along with the development of communication technology and network, on the internet, the generations such as a large amount of electronic documents such as news, mail, microblogging there is every day.Text automatic classification is used in a lot of fields widely as a kind of method of efficiently large volume document being carried out Classification Management.

Along with the explosive increase of quantity of information, the subject matter that automatic Text Categorization faces is how to process the higher-dimension text vector feature space that a large amount of text datas produce.Too high text vector feature space will produce two adverse effects to file classification method: the method for (1) a lot of comparative maturities can't be optimized in higher dimensional space, and then can't be applied in text classification.(2) because sorter is to train by training set to get, dimension too high text vector space will inevitably cause the over-fitting phenomenon to occur ^[1]In the text vector space, most of dimension and text classification are also uncorrelated, and the more noise data that affects the text classification precision even adulterates ^[2]The text feature screening according to certain Feature Selection algorithm, is selected the more representative text feature of a part and is consisted of the feature space that new dimension is lower from original feature space, reach the purpose of dimensionality reduction.The method is the effective method that solves the too high problem of text classification Chinese version vector feature space dimension.The purpose of text feature screening is to improve the execution efficient of text classification work efficiency and algorithm.Much experiment showed, in most of the cases, initiatively feature space is approximately subtracted and can obtain very large performance boost under less nicety of grading loss ^[3]

Existing text classification Feature Selection algorithm mainly contains document frequency (DF), information gain (IG), information gain rate (GR), Chi-square Test (CHI), mutual information (MI) and Gini index etc. ^[3,4]The below to wherein several in text classification effect preferably technology briefly introduce:

Document frequency (DF): document frequency refers to comprise the number of documents of t for given feature t in collection of document.The prediction that is rare feature for classification of its basic assumption is not have helpfully, perhaps can not affect overall performance.The advantage of document frequency: because it is realized simply, calculated amount is little, so feature selecting speed is very fast, and actual effect is also good; Therefore shortcoming: rare feature may not be rare in a certain class text, may comprise important classification information yet, simply weeds out, and may affect the effect of classification, should not be with a large amount of rejecting feature of DF.

Information gain (IG): information gain is a kind of appraisal procedure based on entropy, a given feature t, consider and when not considering it quantity of information respectively be what, both differences are exactly the quantity of information that this feature is brought to system, namely gain ^[5]The appearance that information gain has been considered a feature whether, in unbalanced data centralization, for rare classification, experiment shows, consider the absent variable situation of feature to the contribution of judgement text categories often much smaller than considering the now interference that brings of situation of feature.

Information gain rate (GR): information gain is proved to be devious in a lot of results.Too abundant due to the more and different attribute of value for training set study causes Information Gain Method to be more prone to select this attribute, and the information gain rate has solved this shortcoming of information gain ^[6]

Chi-square Test (CHI): Chi-square Test is the method for a kind of two variable independence of check commonly used in mathematical statistics, and its most basic thought is exactly to determine the correctness of theory by observing actual value and the deviation of theoretical value ^[7,8]

During the experiment of text classification shows, during as feature selecting, the effect of Chi-square Test is a kind of of the best, but it has only added up whether occur feature t in text, but do not consider the number of times that feature t occurs in the text, therefore make it that low-frequency word is had and necessarily exaggerate effect, " the low-frequency word defective " that this namely Chi-square Test is famous.

The present invention is at the feature compartment system ^[9]The basis on, dispersion computing method between class are improved, with this system employs in the Feature Selection process.

List of references:

[1]Jieming Yang，Yuanning Liu，Xiaodong Zhu et al，A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization，Information Processing&Management，Volume48，Issue4，2012，pp.741-754

[2]Wenqian Shang，Houkuan Huang and Haibin Zhu et al，A novel feature selection algorithm for text classification，Expert Systems with Applications，Volume33，Issuel，2007，pp.1-5

[3]Monica Rogati and Yiming Yang，High-performing feature selection for text classification.In Proceedings of the eleventh international conference on Information and knowledge management(CIKM′02).ACM，New York，NY，USA，2002，pp.659-661.

[4]Yang，Y.，Pedersen，J.O.，A Comparative Study on Feature Selection in Text Classification.In Proceedings of the14th international conference on machine learning，Nashville，USA，1997，pp.4l2-420.

[5]Forman，G.，An Extensive Empirical of Feature Selection Metrics for Text Classification.Journal of Machine Learning Research，3，2003，pp.1289-1305.

[6]Tatsunori Mori，Miwa Kikuchi and Kazufumi Yoshida，，Term Weighting Method based on Information Gain Ratio for Summarizing Documents retrieved by IR systems.Journal of Natural Language Processing，9(4)，2001，pp.3-32.

[7]Zheng，Z.，Srihari，R，Optimally Combining Positive and Negative Features for Text Classification.ICML2003Workshop on Learning from Imbalanced Data Sets，2003.

[8]Luigi Galavotti，Via Jacopo Nardi and Fabrizio Sebastiani et al，Feature Selection and Negative Evidence in Automated Text Classification.In Proceedings of the 4thEuropean Conference on Research and Advanced Technology for Digital Libraries(ECDL’00)，2000.

V.Lertnattee，T.Theeramunkong，Improving centroid-based text classification using term-distribution-based weighting and feature selection，In Proceedings of INTECH-01，2ndInternational Conference onIntelligent Technologies，Bangkok，Thailand，2001，pp.349-355.

Summary of the invention

In order to overcome the deficiency of existing text classification Feature Selection method poor accuracy, the invention provides a kind of text classification Feature Selection method based on the feature distributed intelligence.The method is improved dispersion computing method between class on the basis of feature compartment system, and the feature compartment system is applied in the Feature Selection process.The method takes full advantage of in the tf*idf information, class of text feature and distribution between class information, reflect more objectively the significance level of characteristic item in text, thereby select the characteristic item that can represent text feature, reach the Feature Selection purpose, can improve text classification efficient and accuracy rate.This method can be issued to higher classify accuracy in the situation of selecting less characteristic item, has advantages of simultaneously fast convergence rate, makes this method also can apply to skewed data set to the improvement of distribution between class.

The technical solution adopted for the present invention to solve the technical problems is: a kind of text classification Feature Selection method based on the feature distributed intelligence is characterized in comprising the following steps:

1. each piece document in document sets is carried out participle, removes stop words and get stem and process.

2. whole collection of document is expressed as vector space model.

3. extract all Feature Words, structural attitude dictionary from collection of document.

In statistics text feature space each Feature Words t at every piece of document d _jThe frequency TF of middle appearance (t, d _j), and at each class C _iThe frequency TF of middle appearance (t, C _i), add up simultaneously each class C _iThe number of files DF (t, the C that comprise Feature Words t _j).

5. the information that obtains according to step 4 is for each Feature Words t _k, at first calculate for each class C _iNormalized tf*idf value, then calculate this Feature Words at each class C _iDispersion DInterAvg between interior dispersion DIntra and average class.

6. the information that obtains according to step 4, step 5 step utilizes following formula to calculate each Feature Words t in the text feature space _kAt classification C _iIn weight w _i(t).

w _i(t)=tf*idf*DInterAvg*(1-DIntra)

With Feature Words t _kWeight summation in each classification is this Feature Words in the weight of whole document sets, i.e. Feature Words t _kThe TDFS value.

TDFS (t) = Σ_{i = 1}^{NC} w_{i} (t)

With whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, preferentially keep the forward Feature Words of ranking.

The invention has the beneficial effects as follows:, on the basis of feature compartment system dispersion computing method between class are improved due to the method, the feature compartment system is applied in the Feature Selection process.The method takes full advantage of in the tf*idf information, class of text feature and distribution between class information, reflect more objectively the significance level of characteristic item in text, thereby select the characteristic item that can represent text feature, reached the Feature Selection purpose, improved text classification efficient and accuracy rate.This method can be issued to higher classify accuracy in the situation of selecting less characteristic item, has advantages of simultaneously fast convergence rate, makes this method also can apply to skewed data set to the improvement of distribution between class.

Below in conjunction with drawings and Examples, the present invention is elaborated.

Description of drawings

Fig. 1 is the process flow diagram that the present invention is based on the text classification Feature Selection method of feature distributed intelligence.

Embodiment

The inventive method concrete steps are as follows:

1. the concept relevant with the present invention.

Tf*idf(Term frequency inverse document frequency): be a kind of statistical method, in order to assess a words for the significance level of a copy of it file in a file set or corpus.The number of times that the importance of words occurs hereof along with it increase that is directly proportional, but the decline that can be inversely proportional to along with the frequency that it occurs in corpus simultaneously.

Dispersion (Intra-class distribution) in class: refer to the distribution situation of a Feature Words in a certain class document, if be uniformly distributed in such each document, this Feature Words in such document in class dispersion lower; Otherwise, be distributed in a few pieces of documents if concentrate, all occur in all the other documents, this Feature Words in such document in class dispersion higher.

Dispersion between class (Inter-class distribution): refer to the distribution situation of a Feature Words in whole document sets is of all categories, if be uniformly distributed in all kinds of documents, between the class of this Feature Words in this whole document sets, dispersion is lower; Otherwise, occur if only concentrate in some or several classifications, and do not occur in other classifications, between the class of this Feature Words in whole document sets, dispersion is higher.

Dispersion (Average inter-class distribution) between average class: the present invention of this Objective Concept proposes, and is an improvement to dispersion concept between class.Because adopting the total word frequency of Feature Words in each class document, dispersion between class weighs its distribution situation in all kinds of, if different classes of middle number of documents difference is very big, be that data set exists deflection, use the method that the Feature Words in the less classification of number of documents is flooded by the larger classification of number of documents.Dispersion between the average class after improvement adopts the Feature Words word frequency that this Feature Words appears in average every piece of document in each classification to weigh its distribution situation in all kinds of, is not subjected to the impact of data skew, can accurately reflect the distribution between class situation of Feature Words.

2. the properties relevant with the present invention.

The number of times that the some Feature Words of character 1. occur in a certain class document is more, gets over the classification of energy specification documents, and weight is larger.

The original document set sample of the garbages such as table 1 removal form

Numbering	Original document	Classification
			1	Yao has great talent in basketball games.	PE
2	We are playing a game about basketball in the playground.	PE
			3	We are enj oying the music at the concert.	MUSIC
4	Music is an art and everybody may enjoy it.	MUSIC
			5	Playing basketball is my favorite sport.	PE
6	Listening to the music is my hobby.	MUSIC

For example: in the document sets shown in table 1, Feature Words basketball has occurred 3 times in PE class document, and weight is larger, and talent has only occurred in such once, and weight is less.

In the class of character 2. some Feature Words, dispersion is lower, gets over the classification of energy specification documents, and weight is larger.

Table 2 is through the pretreated collection of document of text

Numbering	The training document	Classification
			1	yao ha great talent basketbal game	PE
2	we plai game about basketbal playground	PE
			3	we enjoi music concert	MUSIC
4	music art everybodi mai enjoi	MUSIC
			5	plai basketbal my favorit sport	PE
6	listen music my hobbi	MUSIC

[0050]The descending sort of the whole Feature Words TDFS of table 3 sample collection of document value

Feature Words	The TDFS value	Feature Words	The TDFS value
				music	0.554	hobbi	0.158
basketbal	0.489	listen	0.158
				enjoi	0.489	great	0.140
game	0.394	ha	0.140
				plai	0.394	talent	0.140
concert	0.158	yao	0.140
				art	0.158	playground	0.140
everybodi	0.158	favorit	0.140
				mai	0.158	sport	0.140

For example: in the document sets shown in table 1, Feature Words basketball evenly distributes in PE class document, and every piece of document has a basketball, in class, dispersion is low, illustrate that this word extensively evenly is present in PE class document, the classification information of specification documents well, weight is larger.Feature Words talent only occurs in one piece of document of PE, and all without talent, so in the class of this Feature Words, dispersion is higher, the document that comprises this word is not that the possibility of PE classification is larger, thereby the weight that calculates is also lower in all the other two pieces.

Between the class of character 3. some Feature Words, dispersion is higher, gets over the classification of energy specification documents, and weight is larger.

For example: in the document sets shown in table 1, Feature Words basketball only occurs in PE class document, never occurs in MUSIC class document, and between this feature part of speech, dispersion is very high, the classification information of specification documents well, and weight is larger.And Feature Words my, each occurs once in PE class and MUSIC class, belongs to even distribution between class, so between class, dispersion is very low, can not represent preferably the feature of classification, thereby weight is also lower.

Between the average class of character 4. some Feature Words, dispersion is higher, gets over the classification of energy specification documents, and weight is larger.

Between the average class of table 4, dispersion character gives an example 1

Between the average class of table 5, dispersion character gives an example 2

For example: provide two examples in table 4, table 5, suppose that the distribution situation of a certain Feature Words t in document sets is as shown in table 4, due to the total word frequency of t in A, B two classes identical (being all 2), so between the class of this Feature Words, dispersion is 0, but obviously t still has necessarily representational for category-B, for no other reason than that the number of documents gap of A, B two classes is excessive, cause the less category-B Feature Words t of number of files to be flooded by the more category-A of number of files.Example shown in table 5 calculates equally that between the class of Feature Words t, dispersion is that in 0(A, B two classes, the total word frequency of t is all 1000), but t is apparent for the significance level of distinguishing A, B two classes.This shows in the situation that the data set deflection, use when between class, dispersion is weighed Feature Words, in the less classification of number of documents, representative Feature Words can't be highlighted.If use dispersion measurement between average class, Feature Words t is very inhomogeneous in the distribution of A, B two classes, and between average class, dispersion is higher, therefore can give rational weight for Feature Words t, eliminate the adverse effect that data skew brings, also have simultaneously the original function of dispersion between class.

For a given document sets D, the present invention is as follows to the detailed process that attribute in document sets screens:

1. parse documents is concentrated all documents, rejects useless structural identification etc., extracts the main information such as exercise question, content in document.

May there be some structural identification information (seeing the following form) in document, all occur in the same way in every piece of document, at first these signs be reached with the content of text classification is incoherent to fall as information filterings such as times.

2. content of text is carried out pre-service, extract characteristic item (term) and consist of the text feature space.

For documents all in document sets, after the parsing through the 1st step, can obtain the content information of each document, see Table 1.Each piece document in document sets is carried out pre-service: participle (tokenizing), remove stop words (stop words removal), get stem (stemming) and process after, can obtain a set that is consisted of by some words, each word in set is referred to as text feature item (term), all characteristic items have just consisted of text feature space (term space), and the document in table 1 is as shown in table 2 through obtaining result after pre-service.

For documents all in document sets, after the processing by the 2nd step, collect all Feature Words constitutive characteristic dictionaries that occur in document sets, as the basis of Feature Selection.

5. the statistical information that obtains according to the 4th step is for each Feature Words calculates dispersion between dispersion in normalized tf*idf value, class, average class.

(1) tf*idf: computing formula is as follows:

n_{t} = Σ_{j = 1}^{NC} DF (t, C_{j})

In formula, n represents C _iThe whole Feature Words numbers that occur in class.L is a constant, is got by experiment test, usually gets 0.1 or 0.01.The calculation deviation that normalized tf*idf value can avoid document overlength to bring.

(2) dispersion in class: computing formula is as follows:

DIntra = \frac{\sqrt{Σ_{j = 1}^{| C_{i} |} {[TF (t, d_{j}) - \frac{TF (t, C_{i})}{| C_{i} |}]}^{2} / (| C_{i} | - 1)}}{TF (t, C_{i}) / \sqrt{| C_{i} | - 1}}

(3) dispersion between average class: computing formula is as follows:

DInterAvg = \frac{\sqrt{Σ_{i = 1}^{NC} [TF (t, C_{i}) / | C_{i} | - Σ_{j = 1}^{NC} TF (t, C_{j}) / {ND]}^{2} / (NC - 1)}}{Σ_{j = 1}^{NC} TF (t, C_{j}) / ND}

6. use the result in the 5th step, calculate the weight of characteristic item t in each class.Computing formula is as follows:

wi(t)=tf*idf*DInterAvg*(1-DIntra)

7. the summation of the weight in of all categories with characteristic item t obtains the weight of this characteristic item in whole document sets, namely

The TDFS value.Computing formula is as follows:

TDFS (t) = Σ_{i = 1}^{NC} w_{i} (t)

8. in calculating document sets the TDFS value of all characteristic items is according to descending sort, and the more forward value of characteristic item in document sets of ranking is higher, and role is larger in document classification.

Claims

1. text classification Feature Selection method based on the feature distributed intelligence is characterized in that comprising the following steps:

(1). each piece document in document sets is carried out participle, removes stop words and get stem and process;

(2). whole collection of document is expressed as vector space model;

(3). extract all Feature Words, structural attitude dictionary from collection of document;

(4). in statistics text feature space, each Feature Words t is at every piece of document d _jThe frequency TF of middle appearance (t, d _j), and at each class C _iThe frequency TF of middle appearance (t, C _i), add up simultaneously each class C _iThe number of files DF (t, the C that comprise Feature Words t _j);

(5). the information that obtains according to step (4), for each Feature Words t _k, at first calculate for each class C _iNormalized tf*idf value, then calculate this Feature Words at each class C _iDispersion DInterAvg between interior dispersion DIntra and average class;

(6). according to the information that step (4), step (5) step obtains, utilize following formula to calculate each Feature Words t in the text feature space _kAt classification C _iIn weight w _i(t);

w _i(t)=tf*idf*DInterAvg*(1-DIntra)

With Feature Words t _kWeight summation in each classification is this Feature Words in the weight of whole document sets, i.e. Feature Words t _kThe TDFS value;

TDFS (t) = Σ_{i = 1}^{NC} w_{i} (t)

(7). whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, are preferentially kept the forward Feature Words of ranking.