CN105893388A

CN105893388A - Text feature extracting method based on inter-class distinctness and intra-class high representation degree

Info

Publication number: CN105893388A
Application number: CN201510014438.XA
Authority: CN
Inventors: 黄筱聪; 朱永强
Original assignee: CHENGDU WANGAN TECHNOLOGY DEVELOPMENT Co Ltd
Current assignee: CHENGDU WANGAN TECHNOLOGY DEVELOPMENT Co Ltd
Priority date: 2015-01-01
Filing date: 2015-01-01
Publication date: 2016-08-24
Anticipated expiration: 2035-01-01
Also published as: CN105893388B

Abstract

The invention discloses a text feature extracting method based on inter-class distinctness and intra-class high representation degree. The method comprises the following steps: preprocessing a training set text; calculating the class distinctness of each feature word through an improved feature selecting method so as to select feature words with more class representation, wherein the selected feature words are of high distinctness among different classes; and further screening the selected feature words which are of high class distinctness based on the intra-class distribution rate and information gain (IG) of the feature words. With the adoption of the method, the feature selection is carried out twice to select the feature words which are of high intra-class information entropy and high intra-class distribution rate, and thus the classifying efficiency and accuracy can be improved; in addition, the calculation is simple, so that the text classifying speed and accuracy can be improved.

Description

A kind of text feature based on sign degree high in discrimination between class and class

Technical field

The invention belongs to Text Mining Technology field, distinguish based between class particularly to one The text feature of high sign degree in degree and class.

Background technology

In the epoch that current internet information resource quickly increases, in order to quicker Effective discovery information needed and resource, Text Classification is as effective tissue and pipe The important means of reason text message is arisen at the historic moment.Text Classification refers to according to pending The content of text or attribute are assigned to the skill of one or more predefined classification Art.In text classification field, the most popular is to use VSM vector space pair Text is indicated, " the height of the characteristic item in order to avoid producing when setting up VSM space Dimension disaster ", the most distinctive selection algorithm becomes to be even more important.

In text classification, Feature selection algorithm includes that the algorithm that following comparison is traditional: DF calculates Method (document frequency algorithm), its shortcoming is to have only focused on high frequency words, low frequency can be missed but The word that comentropy is high, and the word selected do not possess represent certain classification characteristic IG calculate The particularity that method (information gain method) calculates due to it so that it is tend not to select enough The Feature Words CHI algorithm (χ of quantity²Statistic law), it considers Feature Words to some Classification impact, but the biggest MI of its amount of calculation (mutual information method), its shortcoming be Under experimental enviroment, performance is unstable.

Therefore, it is necessary to design a kind of can select have between the strongest class difference degree and Class belonging to it has again high efficiency, and the Feature Words selection algorithm that amount of calculation is less.

Summary of the invention

Cannot select for the existing text classification feature selection approach of solution and there is high classification generation The Feature Words of scale and the biggest deficiency of amount of calculation, the present invention provides a kind of based on Lei Jian district High sign degree in indexing and class, and the text feature that amount of calculation is less.Institute The scheme of stating comprises the following steps:

Step 1: obtain different classes of text collection, as language material training set.

Step 2: the text of language material training set is carried out pretreatment, including Chinese word segmentation, Stop words is gone to process；

Step 3: use a kind of text based on sign degree high in discrimination between class and class special Levy extracting method and text carried out feature selection, select N number of feature (N is predetermined threshold value), Text feature set as above-mentioned language material training set.

A kind of text feature based on sign degree high in discrimination between class and class, Particularly as follows:

First calculate the class discrimination degree of each Feature Words, choose and there is high class discrimination degree Feature Words, comprise the following steps:

Step (1), determines the degree of association of each Feature Words and each preset classification, its meter Calculation formula is as follows:

R_{jk} = \frac{| {i : t_{k} = d_{j}, d_{j} &Element; C_{j}} |}{| C_{j} |}

Wherein R_jkRepresent Feature Words t_kWith text categories c_jDegree of association, molecule represent literary composition This classification c_jThere is Feature Words t in apoplexy due to endogenous wind_kNumber of files, denominator represents text categories c_jClass In comprise the number of document.

Step (2), calculates Feature Words t_kIn text categories c_jClass discrimination ability in class Value, computing formula is as follows:

Diff_jk=min (R_jk-R_ik)(i！=j and i take 1～s, and s is classification sum)

Note, here Diff_jkIt can be negative.Negative number representation Feature Words t_kIn text class Other c_jThe distribution of class is less than Feature Words t_kIn text categories c_iThe distribution of apoplexy due to endogenous wind.

Step (3), calculates the class discrimination degree of Feature Words tk, and computing formula is as follows:

Diff_k=max{Diff_jk(j takes 1～s)

And record Diff_kCorresponding Diff_jkJ value, i.e. have recorded Feature Words t_kCharacterize Text categories c_j。

Step (4), arranges predetermined threshold value Q1, takes Diff_kThe Feature Words of ＞=Q1 is carried out Further screening.

Further, in conjunction with Feature Words classification rate in class and information gain IG in choosing The Feature Words of the high class discrimination degree gone out screens further, the spy of high sign degree in choosing class Levy word, specifically include following steps:

Step (1), the Feature Words to the high class discrimination degree selected, calculate this spy Levy the word distributive law at its apoplexy due to endogenous wind characterized, it is assumed that Feature Words t_kCharacterize text categories c_j Class, then Feature Words t_kDistributive law computing formula as follows:

w_tk=(class c_jIn the t that comprises_kWord frequency)/(class c_jIn the number of files that comprises)

Step (2), arranges predetermined threshold value Q2, works as w_tkDuring ＞=Q2, then Feature Words t_kMake For high frequency words, enter step (3), predetermined threshold value Q3 is set, works as w_tkDuring ＜=Q3, then Represent Feature Words t_kIt is low-frequency word, enters step (4), be determined further.

Step (3), to w_tkThe Feature Words of ＞=Q2 seeks IG, arranges predetermined threshold value Q4, when IG(t_k) ＜ Q4, then Feature Words t_kIt is eliminated, is not elected as the text of language material training set Characteristic set.

Step (4), to w_tkThe Feature Words of ＜=Q3 seeks IG, and arranges threshold value Q5, when IG(t_k) ＞=Q5 time, represent t_kIt is an effective word of low frequency, is elected as language material training set Text feature set.

Step (5), it is assumed that the dimension of the text feature set of language material training set is N, if Dimension according to the Feature Words above taken out is less than dimension N, the most now from Q3 ＜ t_k＜ Q2's Feature Words selects, selects from high in the end according to weights.Until being full.

Technical scheme provided by the present invention provides the benefit that:

First pass through the class discrimination degree calculating each Feature Words, choose and have more classification generation The Feature Words of table so that it is there is between the class that each are different the highest discrimination, and And by further combined with Feature Words distributive law in class and information gain IG in choosing The Feature Words of the high class discrimination degree gone out screens further, has high comentropy in selecting class And the Feature Words that distributive law is high, it addition, the calculating of this technical scheme is simple, using the teaching of the invention it is possible to provide The arithmetic speed of text classification and efficiency.

Accompanying drawing explanation

Fig. 1 is present invention Text character extraction side based on sign degree high in discrimination between class and class Method flow chart.

Fig. 2 is the detailed algorithm schematic flow sheet that the present invention selects discrimination between high class.

Fig. 3 is that the present invention selects class Nei Gaobiao based between the high class selected in the Feature Words of discrimination The detailed algorithm schematic flow sheet of degree of levying.

Detailed description of the invention

Become apparent from for making the purpose of the present invention, technical scheme and advantage illustrate, below will In conjunction with accompanying drawing and actual case, the present invention is described in further detail.

Fig. 1 is that present invention text feature based on sign degree high in discrimination between class and class carries Access method flow chart, concrete function be accomplished by

Step 1: obtain certain first with web crawlers or artificially collect from the Internet These articles are analyzed arranging by representational article in multiple fields of quantity, It is included into language material training set, as the training sample set of Text Classification System according to classification.

Step 2: in order to extract the word that can represent text feature from text, It is carried out participle, removes the process such as stop words.

Step 3: from choosing the spy with high class discrimination degree through the text of pretreatment Levy word, specific as follows:

Fig. 2 is the detailed algorithm schematic flow sheet that the present invention selects discrimination between high class, below In conjunction with accompanying drawing and example, algorithm is illustrated, specific as follows:

Assume that pre-set categories has 3 classes, respectively A class, B class, C class, wherein A Class, B class, C class contains 10 articles being belonging respectively to its classification respectively.Assume existing In in Feature Words 1 occurs in 10 articles belonging to A class 5, and also divide Do not occur in that in 5 in 10 articles belonging to B and C class.Feature Words 2 goes out Now belong to, in 9 in 10 articles of A class, occur in and belong to the 10 of B class In in piece article 8, and occur in 1 in 10 articles belonging to C class In.Feature Words 3 occurs in 9 in 10 articles belonging to A class, occurs in Belong in 3 in 10 articles of B class, and occur in 10 literary compositions belonging to C class In in chapter 1, as shown in table 1 below:

	A class (10)	B class (10)	C class (10)
				Word 1	5	5	5
Word 2	9	8	1
				Word 3	10	3	1

Table 1

Each word and each predtermined category is calculated according to following relatedness computation formula Degree of association R_jk:

R_{jk} = \frac{| {i : t_{k} = d_{j}, d_{j} &Element; C_{j}} |}{| C_{j} |}

Wherein R_jkRepresent Feature Words t_kWith text categories c_jDegree of association, molecule represent literary composition This classification c_jThere is Feature Words t in apoplexy due to endogenous wind_kNumber of files, denominator represents text categories c_jApoplexy due to endogenous wind Comprise the number of document.

Result of calculation is as shown in table 2 below.

R_jk	A class	B class	C class
				Word 1	R_A1=1/2	R_B1=1/2	R_C1=1/2
Word 2	R_A2=9/10	R_B2=8/10	R_C2=1/10
				Word 3	R_A3=10/10	R_B3=3/10	R_C3=1/10

Table 2

Calculate Feature Words t_kIn text categories c_jThe value of the class discrimination ability in class, calculates Formula is as follows:

Diff_jk=min (R_jk-R_ik)(i！=j and i take 1～s, and s is classification sum)

Diff_A1=min{ (1/2-1/2), (1/2-1/2) }=0

In like manner, Diff_B1=0, Diff_C1=0, the like, calculate Diff_jkSuch as table 3 below Shown in:

Calculating the class discrimination degree of Feature Words tk, computing formula is as follows:

Diff_k=max{Diff_jk(j takes 1～s)

According to table 3:

Diff1=Diff_A1/Diff_B1/Diff_C1=0

Diff2=Diff_A2=1/10

Diff3=Diff_A3=7/10

Assuming that predetermined threshold value Q1 is 1/2, the most now Feature Words 1,2 are eliminated, feature Word 3 is selected, and records its class represented respectively, i.e. Feature Words 3 and can represent A class.

Step 4: combine Feature Words classification rate in class and information gain IG to selecting The Feature Words of high class discrimination degree do screening further, the spy of high sign degree in choosing class Levy word.

Fig. 3 is that the present invention selects in class in the Feature Words of discrimination based between the high class selected The detailed algorithm schematic flow sheet of high sign degree, enters algorithm with example below in conjunction with the accompanying drawings Row explanation, specific as follows:

Assuming Feature Words 1, Feature Words 2, Feature Words 3 is all based on what step 3 was selected Represent the Feature Words of A class (A class comprises 10 articles).Assume that Feature Words 1 is at A Occurring in that altogether 100 times in 10 articles of class, Feature Words 2 is at 10 literary compositions of A class Occurring in that altogether in chapter 50 times, Feature Words 3 occurs altogether in 10 articles of A class 30 times.

Feature Words t is calculated according to formula_kDistributive law, computing formula is as follows:

I.e. w1=100/10=10

W2=50/10=5

W3=30/10=3

Assuming that predetermined threshold value Q2 is 7, predetermined threshold value Q3 is 4:

For Feature Words 1, seek IG, it may be judged whether less than predetermined threshold value Q4, be, wash in a pan Eliminate, the most alternative.

For Feature Words 2, directly as alternative.

For Feature Words 3, seek IG, it may be judged whether more than or equal to predetermined threshold value Q5, be Then select this feature word, otherwise eliminate.

Step 5: based on said method, selects N number of feature (N is predetermined threshold value), makees Text feature set for above-mentioned language material training set.

Determine that application example, as standard, is illustrated by parameter with said process below.

Embodiment 1:

Assume that pre-set categories has 3 classes, respectively A class, B class, C class, wherein A Class, B class, C class contains 10 articles being belonging respectively to its classification respectively.Assume existing In in Feature Words 1 occurs in 10 articles belonging to A class 5, and also divide Do not occur in that in 5 in 10 articles belonging to B and C class.Remaining Feature Words Distribution situation in of all categories, as shown in table 4 below:

	A class (10)	B class (10)	C class (10)
				Feature Words 1	5	5	5
Feature Words 2	2	8	9
				Feature Words 3	10	3	1
Feature Words 4	5	2	7
				Feature Words 5	1	6	8

Feature Words 6

2

7

3

Table 4

According to table 4, calculate the degree of association R of each word and each predtermined category_jk, calculate Result is as shown in table 5 below:

R_jk	A class (10)	B class (10)	C class (10)
				Feature Words 1	R_A1=1/2	R_B1=1/2	R_C1=1/2
Feature Words 2	R_A2=2/10	R_B2=8/10	R_C2=9/10
				Feature Words 3	R_A3=10/10	R_B3=3/10	R_C3=1/10
Feature Words 4	R_A4=5/10	R_B4=2/10	R_C4=7/10
				Feature Words 5	R_A5=1/10	R_B5=6/10	R_C5=8/10
Feature Words 6	R_A6=2/10	R_B6=7/10	R_C6=3/10

Table 5

Calculate Feature Words t_kIn text categories c_jValue Diff of the class discrimination ability in class_jk, Result of calculation such as table 6 below:

Table 6

Calculate the class discrimination degree of Feature Words tk, according to table 6:

Diff1=Diff_A1/Diff_B1/Diff_C1=0

Diff2=Diff_C2=1/10

Diff3=Diff_A3=7/10

Diff4=Diff_C4=2/10

Diff5=Diff_C5=2/10

Diff6=Diff_B6=4/10

Assuming that threshold value Q1 is 1/20, the most now Feature Words 1 is eliminated.Enter next step Feature Words selects.Now Feature Words 2, Feature Words 4, Feature Words 5 is by as representing C The alternative features word of class enters next step Feature Words and selects.

Assume that Feature Words 2 occurs in that altogether 9 times in 10 articles of C class, feature Word 4 occurs in that altogether 40 times in 10 articles of C class, and Feature Words 3 is in A class 10 articles occur in that altogether 20 times.

Feature Words t is calculated according to formula_kDistributive law, i.e.

W2=c9/10=0.9

W2=40/10=4

W3=20/10=2

Assuming that predetermined threshold value Q2 is 3, predetermined threshold value Q3 is 1:

For Feature Words 2, seek IG, it may be judged whether less than predetermined threshold value Q4, be, wash in a pan Eliminate, the most alternative.

For Feature Words 4, seek IG, it may be judged whether more than or equal to predetermined threshold value Q5, be Then select this feature word, otherwise eliminate.

For Feature Words 5, directly as alternative.

Assume that now Feature Words 4 is elected as representing the Feature Words of C class.Same side Method selects to represent the Feature Words of its classification to other classes.Assume that Feature Words 3 is selected representative A class, Feature Words 6 is selected and represents B class.If now presetting VSM Spatial Dimension Being 3, be full text feature set as training the most, if now VSM Spatial Dimension is 4, then select from alternative Feature Words.

The technical scheme that the embodiment of the present invention is provided, it is possible to select more classification and represent The Feature Words of more high information quantity in property and class, and improve the speed of text classification.

By the description of embodiment of above, those skilled in the art can understand this Bright implementation, the present invention can be realized by software programming, corresponding software program Can be stored in the storage medium that can read, such as CD, hard disk, mobile memory medium etc..

It is more than the specific embodiment of the present invention, but not in order to limit the present invention, for For those skilled in the art, all in the premise without departing from the principle of the invention Under, any modification, equivalent substitution and improvement etc. done, should be included in the present invention's Within protection invention scope.

Claims

1. a Text character extraction side based on sign degree high in discrimination between class and class Method, it is characterised in that specifically include following steps:

Step 3: use text feature based on sign degree high in discrimination between class and class to carry Access method carries out feature selection to text, selects N number of feature (N is predetermined threshold value), makees Text feature set for above-mentioned language material training set.

2. as claimed in claim 1 a kind of based on sign high in discrimination between class and class The text feature of degree, it is characterised in that step 3 uses to be distinguished based between class In degree and class, the text feature of high sign degree carries out feature selection to text, choosing Go out N number of feature (N is predetermined threshold value), as the text feature collection of above-mentioned language material training set Close, it is characterised in that described method includes:

First calculate the class discrimination degree of each Feature Words, choose and there is high class discrimination degree Feature Words.

In conjunction with Feature Words classification rate in class and information gain IG, to the high classification selected The Feature Words of discrimination screens further, the Feature Words of high sign degree in choosing class.

3. use as claimed in claim 2 is based on efficient in discrimination between class and class Text feature carries out feature selection to text, and (N is pre-to select N number of feature If threshold value), as the text feature set of above-mentioned language material training set, it is characterised in that Calculate the class discrimination degree of each Feature Words, choose the Feature Words with high class discrimination degree. Specifically include following steps:

R_{jk} = \frac{| {i : t_{k} = d_{i}, d_{j} &Element; C_{j}} |}{| C_{j} |}

Diff_jk=min (R_jk-R_ik)(i！=j and i take 1～s, and s is classification sum)

Diff_k=max{Diff_jk(j takes 1～s)

4. use as claimed in claim 2 is based on sign degree high in discrimination between class and class Text feature text is carried out feature selection, (N is to select N number of feature Predetermined threshold value), as the text feature set of above-mentioned language material training set, it is characterised in that In conjunction with Feature Words classification rate in class and information gain IG in the high classification district selected The Feature Words of indexing screens further, the Feature Words of high sign degree in choosing class.Concrete bag Include following steps:

Step (5), it is assumed that the dimension of the text feature set of language material training set is N, if Dimension according to the Feature Words above taken out is less than dimension N, the most now from Q3 ＜ t_k＜ Q2 Feature Words in select, select from high in the end according to weights.Until be full into Only.