CN103106275B

CN103106275B - The text classification Feature Selection method of feature based distributed intelligence

Info

Publication number: CN103106275B
Application number: CN201310050583.4A
Authority: CN
Inventors: 李思男; 李战怀; 李宁
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2013-02-08
Filing date: 2013-02-08
Publication date: 2016-02-10
Anticipated expiration: 2033-02-08
Also published as: CN103106275A

Abstract

The invention discloses a kind of text classification Feature Selection method of feature based distributed intelligence, for solving the technical matters of existing text classification Feature Selection method poor accuracy.Technical scheme first carries out pre-service to each section of document in document sets; Again whole collection of document is expressed as vector space model; Structural attitude dictionary; Add up each class C _icomprise number of files DF (t, the C of Feature Words t _j); Calculate for each class C _inormalized tf*idf value, then calculate this Feature Words at each class C _iinterior dispersion DIntra and average inter _ class relationship DInterAvg; Calculate each Feature Words t in text feature space _kat classification C _iin weight w _i(t); By whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, preferentially retain the forward Feature Words of ranking.Feature compartment system, on the basis of feature compartment system, applies in Feature Selection process by the method, improves text classification efficiency and accuracy rate.

Description

The text classification Feature Selection method of feature based distributed intelligence

Technical field

The present invention relates to a kind of text classification Feature Selection method, particularly relate to a kind of text classification Feature Selection method of feature based distributed intelligence.

Background technology

Along with the development of communication technology and network, on the internet, a large amount of electronic documents is had every day as generations such as news, mail, microbloggings.Text automatic classification, as a kind of method of efficiently large volume document being carried out to Classification Management, is widely used in a lot of field.

Along with the explosive increase of quantity of information, the subject matter that automatic Text Categorization faces is the higher-dimension text vector feature space how processing the generation of a large amount of text data.Too high text vector feature space will produce two adverse effects to file classification method: the method for (1) a lot of comparative maturity cannot be optimized in higher dimensional space, and then cannot be applied in text classification.(2) because sorter is trained by training set and obtains, the too high text vector space of dimension will inevitably cause Expired Drugs to occur ^[1].In text vector space, most of dimension and text classification are also uncorrelated, and even adulterate the more noise data affecting text classification precision ^[2].Text feature screens, and according to certain Feature Selection algorithm, selects the more representative text feature of a part and forms the lower feature space of a new dimension, reach the object of dimensionality reduction from original feature space.The method is the effective method solving the too high problem of text classification Chinese version vector characteristics Spatial Dimension.The object of text feature screening to improve the execution efficiency of text classification work efficiency and algorithm.A lot of experiment proves, in most of the cases, initiatively about subtracts can obtain very large performance boost under the loss of less nicety of grading to feature space ^[3].

Existing text classification Feature Selection algorithm mainly contains document frequency (DF), information gain (IG), information gain-ratio (GR), Chi-square Test (CHI), mutual information (MI) and Gini index etc. ^[3,4].Below the good technology of several wherein effect in text classification is briefly introduced:

Document frequency (DF): document frequency refers to for given feature t, comprises the number of documents of t in collection of document.Its basic assumption is rare feature for class prediction is do not have helpful, or can not affect overall performance.The advantage of document frequency: because its realization is simple, calculated amount is little, so feature selecting speed is very fast, and actual effect is also good; Shortcoming: rare feature may not be rare in a certain class text, may contain important classification information yet, simply weed out, may affect the effect of classification, therefore should not by a large amount of rejecting feature of DF.

Information gain (IG): information gain is a kind of appraisal procedure based on entropy, a given feature t, when considering and do not consider it, quantity of information is respectively how many, and both differences are exactly the quantity of information that this feature is brought to system, i.e. gain ^[5].Whether information gain considers the appearance of a feature, and in imbalanced data sets, for rare classification, experiment shows, considers the absent variable situation of feature to the contribution judging text categories often much smaller than considering the interference that feature does not show situation and brings.

Information gain-ratio (GR): information gain is proved to be devious in a lot of result.Due to the more and different attribute of value for training set learn too abundant, cause Information Gain Method to be more prone to select this attribute, information gain-ratio solves this shortcoming of information gain ^[6].

Chi-square Test (CHI): Chi-square Test is the method for a kind of conventional inspection Two Variables independence in mathematical statistics, its most basic thought is exactly determine theoretical correctness by the deviation of observation actual value and theoretical value ^[7,8].

During the experiment of text classification shows, during as feature selecting, the effect of Chi-square Test is best one, but it has only added up in text whether occur feature t, but the number of times that feature t occurs in the text is not considered, therefore make it have low-frequency word and necessarily exaggerate effect, " low-frequency word defect " that this namely Chi-square Test is famous.

The present invention is at feature compartment system ^[9]basis on, inter _ class relationship computing method are improved, this system are applied in Feature Selection process.

List of references:

[1]JiemingYang，YuanningLiu，XiaodongZhuetal，Anewfeatureselectionbasedoncomprehensivemeasurementbothininter-categoryandintra-categoryfortextcategorization，InformationProcessing&Management，Volume48，Issue4，2012，pp.741-754

[2]WenqianShang，HoukuanHuangandHaibinZhuetal，Anovelfeatureselectionalgorithmfortextclassification，ExpertSystemswithApplications，Volume33，Issuel，2007，pp.1-5

[3]MonicaRogatiandYimingYang，High-performingfeatureselectionfortextclassification.InProceedingsoftheeleventhinternationalconferenceonInformationandknowledgemanagement(CIKM′02).ACM，NewYork，NY，USA，2002，pp.659-661.

[4]Yang，Y.，Pedersen，J.O.，AComparativeStudyonFeatureSelectioninTextClassification.InProceedingsofthe14thinternationalconferenceonmachinelearning，Nashville，USA，1997，pp.4l2-420.

[5]Forman，G.，AnExtensiveEmpiricalofFeatureSelectionMetricsforTextClassification.JournalofMachineLearningResearch，3，2003，pp.1289-1305.

[6]TatsunoriMori，MiwaKikuchiandKazufumiYoshida，，TermWeightingMethodbasedonInformationGainRatioforSummarizingDocumentsretrievedbyIRsystems.JournalofNaturalLanguageProcessing，9(4)，2001，pp.3-32.

[7]Zheng，Z.，Srihari，R，OptimallyCombiningPositiveandNegativeFeaturesforTextClassification.ICML2003WorkshoponLearningfromImbalancedDataSets，2003.

[8]LuigiGalavotti，ViaJacopoNardiandFabrizioSebastianietal，FeatureSelectionandNegativeEvidenceinAutomatedTextClassification.InProceedingsofthe4thEuropeanConferenceonResearchandAdvancedTechnologyforDigitalLibraries(ECDL’00)，2000.

V.Lertnattee，T.Theeramunkong，Improvingcentroid-basedtextclassificationusingterm-distribution-basedweightingandfeatureselection，InProceedingsofINTECH-01，2ndInternationalConferenceonIntelligentTechnologies，Bangkok，Thailand，2001，pp.349-355.

Summary of the invention

In order to overcome the deficiency of existing text classification Feature Selection method poor accuracy, the invention provides a kind of text classification Feature Selection method of feature based distributed intelligence.The method, on the basis of feature compartment system, is improved inter _ class relationship computing method, is applied in Feature Selection process by feature compartment system.The method takes full advantage of in the tf*idf information of text feature, class and distribution between class information, reflect characteristic item significance level in the text more objectively, thus select the characteristic item that can represent text feature, reach Feature Selection object, text classification efficiency and accuracy rate can be improved.This method can be issued to higher classify accuracy selecting the situation of less characteristic item, has the advantage of fast convergence rate simultaneously, makes this method also can apply to skewed data set to the improvement of distribution between class.

The technical solution adopted for the present invention to solve the technical problems is: a kind of text classification Feature Selection method of feature based distributed intelligence, is characterized in comprising the following steps:

1. in pair document sets, each section of document carries out participle, removes stop words and gets stem process.

2. whole collection of document is expressed as vector space model.

3. from collection of document, extract all Feature Words, structural attitude dictionary.

4. to add up in text feature space each Feature Words t at every section of document d _jfrequency TF (t, the d of middle appearance _j), and at each class C _ifrequency TF (t, the C of middle appearance _i), add up each class C simultaneously _icomprise number of files DF (t, the C of Feature Words t _j).

5. according to the information that step 4 obtains, for each Feature Words t _k, first calculate for each class C _inormalized tf*idf value, then calculate this Feature Words at each class C _iinterior dispersion DIntra and average inter _ class relationship DInterAvg.

6., according to the information that step 4, step 5 step obtain, utilize following formula to calculate each Feature Words t in text feature space _kat classification C _iin weight w _i(t).

w _i(t)=tf*idf*DInterAvg*(1-DIntra)

By Feature Words t _kweight summation in each category, is the weight of this Feature Words in whole document sets, i.e. Feature Words t _ktDFS value.

TDFS (t) = Σ_{i = 1}^{NC} w_{i} (t)

7. by whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, preferentially retain the forward Feature Words of ranking.

The invention has the beneficial effects as follows: because the method is on the basis of feature compartment system, inter _ class relationship computing method are improved, feature compartment system is applied in Feature Selection process.The method takes full advantage of in the tf*idf information of text feature, class and distribution between class information, reflect characteristic item significance level in the text more objectively, thus select the characteristic item that can represent text feature, reach Feature Selection object, improve text classification efficiency and accuracy rate.This method can be issued to higher classify accuracy selecting the situation of less characteristic item, has the advantage of fast convergence rate simultaneously, makes this method also can apply to skewed data set to the improvement of distribution between class.

Below in conjunction with drawings and Examples, the present invention is elaborated.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the text classification Feature Selection method that the present invention is based on feature distributed intelligence.

Embodiment

The inventive method concrete steps are as follows:

1. the concept relevant with the present invention.

Tf*idf(Termfrequencyinversedocumentfrequency): be a kind of statistical method, in order to assess the significance level of a words for a copy of it file in a file set or a corpus.The importance of words to be directly proportional increase along with the number of times that it occurs hereof, the decline but the frequency that can occur in corpus along with it is inversely proportional to simultaneously.

Within-cluster variance (Intra-classdistribution): refer to the distribution situation of a Feature Words in a certain class document, if be uniformly distributed in such each document, then this Feature Words within-cluster variance in such document is lower; Otherwise if integrated distribution is in a few sections of documents, all do not occur in all the other documents, then this Feature Words within-cluster variance in such document is higher.

Inter _ class relationship (Inter-classdistribution): refer to the distribution situation of a Feature Words in whole document sets is of all categories, if be uniformly distributed in all kinds of document, then the inter _ class relationship of this Feature Words in this whole document sets is lower; Otherwise occur if only concentrate in some or several classification, and do not occur in other classifications, then the inter _ class relationship of this Feature Words in whole document sets is higher.

Average inter _ class relationship (Averageinter-classdistribution): the present invention of this Objective Concept proposes, and is an improvement to inter _ class relationship concept.Because inter _ class relationship adopts the total word frequency of Feature Words in each class document to weigh its distribution situation in all kinds of, if different classes of middle number of documents very different, namely there is deflection in data set, use the method that the Feature Words in the less classification of number of documents can be made flood by the larger classification of number of documents.Average inter _ class relationship after improvement, adopt Feature Words in each classification on average every section of document occur that the word frequency of this Feature Words is to weigh its distribution situation in all kinds of, not by the impact of data skew, accurately can reflect the distribution between class situation of Feature Words.

2. the properties relevant with the present invention.

The number of times that the some Feature Words of character 1. occur in a certain class document is more, more the classification of energy specification documents, and weight is larger.

The original document set sample of the garbages such as form removed by table 1

Numbering	Original document	Classification
			1	Yao has great talent in basketball games.	PE
2	We are playing a game about basketball in the playground.	PE
			3	We are enj oying the music at the concert.	MUSIC
4	Music is an art and everybody may enjoy it.	MUSIC
			5	Playing basketball is my favorite sport.	PE
6	Listening to the music is my hobby.	MUSIC

Such as: in the document sets shown in table 1, Feature Words basketball has occurred 3 times in PE class document, weight is comparatively large, and talent only occurs in such that once, weight is less.

The within-cluster variance of the some Feature Words of character 2. is lower, more the classification of energy specification documents, and weight is larger.

Table 2 is through the collection of document of Text Pretreatment

Numbering	Training document	Classification
			1	yao ha great talent basketbal game	PE
2	we plai game about basketbal playground	PE
			3	we enjoi music concert	MUSIC
4	music art everybodi mai enjoi	MUSIC
			5	plai basketbal my favorit sport	PE
6	listen music my hobbi	MUSIC

The descending sort of table 3 sample document set whole Feature Words TDFS value

Feature Words	TDFS value	Feature Words	TDFS value
				music	0.554	hobbi	0.158
basketbal	0.489	listen	0.158
				enjoi	0.489	great	0.140
game	0.394	ha	0.140
				plai	0.394	talent	0.140
concert	0.158	yao	0.140
				art	0.158	playground	0.140
everybodi	0.158	favorit	0.140
				mai	0.158	sport	0.140

Such as: in the document sets shown in table 1, Feature Words basketball is uniformly distributed in PE class document, and every section of document has a basketball, within-cluster variance is low, illustrate that this word is extensively evenly present in PE class document, can the classification information of specification documents well, weight is larger.Feature Words talent only occurs in one of PE section of document, and all without talent in all the other two sections, so the within-cluster variance of this Feature Words is higher, the document comprising this word is not that the possibility of PE classification is comparatively large, and the weight thus calculated is also lower.

The inter _ class relationship of the some Feature Words of character 3. is higher, more the classification of energy specification documents, and weight is larger.

Such as: in the document sets shown in table 1, Feature Words basketball only occurs in PE class document, never occurs in MUSIC class document, and this Feature Words inter _ class relationship is very high, can the classification information of specification documents well, and weight is larger.And Feature Words my, in PE class and MUSIC class, each appearance is once, and belong between class and be uniformly distributed, so inter _ class relationship is very low, can not represent the feature of classification preferably, thus weight is also lower.

The average inter _ class relationship of the some Feature Words of character 4. is higher, more the classification of energy specification documents, and weight is larger.

The average inter _ class relationship character citing 1 of table 4

The average inter _ class relationship character citing 2 of table 5

Citing: provide two examples in table 4, table 5, suppose that the distribution situation of a certain Feature Words t in document sets is as shown in table 4, because t is in the total word frequency identical (be 2) of A, B two in class, so the inter _ class relationship of this Feature Words is 0, but obviously t is necessarily representational for still having category-B, for no other reason than that A, B two number of documents gaps of class are excessive, the category-B Feature Words t causing number of files less flood by the more category-A of number of files.Example shown in table 5, the inter _ class relationship calculating Feature Words t be equally 0(A, B two in class the total word frequency of t be all 1000), but t for distinguish A, B two the significance level of class be apparent.As can be seen here when data set deflection, when using inter _ class relationship to weigh Feature Words, Feature Words representative in the classification that number of documents is less cannot be highlighted.If use average inter _ class relationship to weigh, Feature Words t is very uneven in A, B two distributions of class, and average inter _ class relationship is higher, therefore can give rational weight for Feature Words t, eliminate the adverse effect that data skew brings, also there is the original function of inter _ class relationship simultaneously.

For given document sets D, the detailed process that the present invention screens attribute in document sets is as follows:

1. parse documents concentrates all documents, rejects useless structural identification etc., extracts the main information such as exercise question, content in document.

Some structural identification information (seeing the following form) may be there are in document, occur all in the same way in every section of document, first fall as information filterings such as times by these marks and with content of text classification is incoherent.

2. pair content of text carries out pre-service, extracts characteristic item (term) and forms text feature space.

For documents all in document sets, after the parsing of the 1st step, the content information of each document can be obtained, in table 1.Pre-service is carried out to each section of document in document sets: participle (tokenizing), remove stop words (stopwordsremoval), get stem (stemming) process after, a set be made up of some words can be obtained, each word in set is referred to as text feature item (term), all characteristic items just constitute text feature space (termspace), the document in table 1 are obtained result after pre-service as shown in table 2.

For documents all in document sets, after the process by the 2nd step, collect all Feature Words constitutive characteristic dictionaries occurred in document sets, as the basis of Feature Selection.

5. according to the statistical information that the 4th step obtains, for each Feature Words calculates normalized tf*idf value, within-cluster variance, average inter _ class relationship.

(1) tf*idf: computing formula is as follows:

n_{t} = Σ_{j = 1}^{NC} DF (t, C_{j})

In formula, n represents C _ithe whole Feature Words numbers occurred in class.L is a constant, is obtained by experiment test, usually gets 0.1 or 0.01.The calculation deviation that normalized tf*idf value can avoid document overlength to bring.

(2) within-cluster variance: computing formula is as follows:

DIntra = \frac{\sqrt{Σ_{j = 1}^{| C_{i} |} {[TF (t, d_{j}) - \frac{TF (t, C_{i})}{| C_{i} |}]}^{2} / (| C_{i} | - 1)}}{TF (t, C_{i}) / \sqrt{| C_{i} | - 1}}

(3) average inter _ class relationship: computing formula is as follows:

DInterAvg = \frac{\sqrt{Σ_{i = 1}^{NC} [TF (t, C_{i}) / | C_{i} | - Σ_{j = 1}^{NC} TF (t, C_{j}) / {ND]}^{2} / (NC - 1)}}{Σ_{j = 1}^{NC} TF (t, C_{j}) / ND}

6. use the result in the 5th step, calculate the weight of characteristic item t in each class.Computing formula is as follows:

wi(t)=tf*idf*DInterAvg*(1-DIntra)

7., by the weight summation of characteristic item t in of all categories, obtain the weight of this characteristic item in whole document sets, namely

TDFS value.Computing formula is as follows:

TDFS (t) = Σ_{i = 1}^{NC} w_{i} (t)

8. calculate the TDFS value of all characteristic items in document sets according to descending sort, the more forward value of characteristic item in document sets of ranking is higher, and in document classification, role is larger.

Claims

1. a text classification Feature Selection method for feature based distributed intelligence, is characterized in that comprising the following steps:

(1). each section of document in document sets is carried out to participle, removes stop words and get stem process;

(2). whole collection of document is expressed as vector space model;

(3). from collection of document, extract all Feature Words, structural attitude dictionary;

(4). in statistics text feature space, each Feature Words t is at every section of document d _jfrequency TF (t, the d of middle appearance _j), and at each class C _ifrequency TF (t, the C of middle appearance _i), add up each class C simultaneously _icomprise number of files DF (t, the C of Feature Words t _i);

(5). according to the information that step (4) obtains, for each Feature Words t, first calculate for each class C _inormalized tf*idf value, then calculate this Feature Words at each class C _iinterior dispersion DIntra and average inter _ class relationship DInterAvg;

(6). according to the information that step (4), step (5) step obtain, utilize following formula to calculate in text feature space each Feature Words t at classification C _iin weight w _i(t);

w _i(t)＝tf*idf*DInterAvg*(1-DIntra)

The computing formula of average inter _ class relationship is as follows:

D I n t e r A v g = \frac{\sqrt{Σ_{i = 1}^{N C} {[T F (t, C_{i}) / | C_{i} | - Σ_{j = 1}^{N C} T F (t, C_{j}) / N D]}^{2} / (N C - 1)}}{Σ_{j = 1}^{N C} T F (t, C_{j}) / N D}

By Feature Words t weight summation in each category, be the weight of this Feature Words in whole document sets, i.e. the TDFS value of Feature Words t:

T D F S (t) = Σ_{i = 1}^{N C} w_{i} (t)

(7). by whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, preferentially retain the forward Feature Words of ranking.