CN105512311A

CN105512311A - Chi square statistic based self-adaption feature selection method

Info

Publication number: CN105512311A
Application number: CN201510927759.9A
Authority: CN
Inventors: 汪友生; 樊存佳; 王雨婷
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2015-12-14
Filing date: 2015-12-14
Publication date: 2016-04-20
Anticipated expiration: 2035-12-14
Also published as: CN105512311B

Abstract

The invention discloses a chi square statistic based self-adaption feature selection method and relates to the field of computer text data processing. Firstly, preprocessing of a training text set and a test text set is performed and comprises participle processing and stop word processing, then, self-adaption text feature selection based on chi square statistic is performed, a word frequency factor and interclass variance are defined and introduced into a CHI algorithm, an appropriate scaling factor is added for the CHI algorithm, finally, the scaling factor is automatically adjusted in combination of classical KNN algorithm evaluation indexes, improved CHI is adapted to different text corpora, and higher classification accuracy is guaranteed. An experimental result proves that by comparison with a conventional CHI method, the classification accuracy of a balanced corpus and a non-balanced corpus is improved.

Description

A kind of adaptive features select method based on chi

Technical field

The present invention relates to computer version data processing field, particularly one is based on chi (χ ², CHI) self-adaptation text feature selection method.

Background technology

Current large data age, the potential value of mining data is most important.Data mining, as the technology finding data potential value, causes great concern.Large data text data accounts for sizable ratio, and text classification is as the data digging method of effective organization and management text data, becomes the focus of attention gradually.It is used widely in information filtering, Information Organization and management, information retrieval, digital library and Spam filtering etc.Text classification (TextClassification, TC) refers to the process unknown classification text being automatically classified into a class or multiclass under classification system given in advance according to its content.Conventional file classification method, as K arest neighbors (K-Nearest-Neighbor, KNN), Bayes (NaiveBayes, NB) and support vector machine (SupportVectorMachine, SVM) etc.

Present stage text classification process comprises pre-service, Feature Dimension Reduction, text representation, sorter study and evaluation and test, what text representation was the most frequently used is vector space model, the higher-dimension of vector space and openness increase time complexity and space complexity, greatly affect text classification precision, so Feature Dimension Reduction process is most important, it directly affects efficiency and the accuracy rate of classification.Feature Dimension Reduction mainly comprises two kinds of methods---feature extraction (FeatureExtraction) and feature selecting (FeatureSelection), the technology of natural language processing aspect is needed and computation complexity is high based on philological feature extraction, and the feature selection approach complexity of Corpus--based Method theory is lower and do not need too much background knowledge, therefore feature selection approach application is more extensive.The basic thought of feature selecting is structure evaluation function, carries out assessment marking respectively, sort according to the height of point value of evaluation to all characteristic items to each characteristic item of feature set, selects the feature of given number as final text feature collection.Conventional feature selection approach has chi, document frequency (DocumentFrequency, DF), information gain (InformationGain, IG), mutual information (MutualInformation, MI), expect cross entropy (ExpectedCrossEntropy, ECE) and text evidence weight (WeightofEvidence, WE) etc.

CHI method, as one of conventional text feature selection method, has and realizes the features such as simple, time complexity is low; But also there is a lot of shortcoming, to such an extent as to classifying quality is undesirable.The deficiency of CHI algorithm mainly comprises the document frequency that two aspect: the first, CHI only considered characteristic item, have ignored the word frequency of characteristic item, causes the weight of low-frequency word to be exaggerated; The second, be exaggerated the occurrence number seldom weight of the characteristic item often occurred in other classes in a classification.For the deficiency that CHI algorithm exists, Many researchers makes improvements, to improve one's methods and be summarized as following two aspects: the first, introduce some regulating parameter to reduce relying on for counsel low-frequency word, but the method not consider the positive and negative relativity problem between characteristic item and classification.The second, introduce scale factor, carry out classifying according to its positive and negative correlativity and compose with different weight to improve the feature selecting ability of CHI model, but scale factor needs to be selected by experience.Consider the deficiency that current various CHI improved algorithm exists, the CHI text feature selection method that design category precision is high has important academic significance and practical value.

Summary of the invention

The object of the invention is to, a kind of CHI text feature selection method of improvement is provided, thus improve the accuracy rate of text classification.Introduce on the one hand word frequency Summing Factor inter-class variance herein to reduce CHI relying on for counsel low-frequency word, select in certain kinds, to occur that frequency greatly and the characteristic item be evenly distributed in such; Introduce on the other hand self-adaptation scale factor μ, to carry out classifying according to its positive and negative correlativity and to compose with different weight, reduce people and choose the error that scale factor brings.

Feature of the present invention is as follows:

Step 1, from Chinese corpus---training text collection and test text collection that the Internet download Fudan University issues;

Step 2, adopts the Chinese Academy of Sciences participle software I CTCLAS that increases income to carry out the pre-service such as participle, stop words removal to training text collection and test text collection, obtains the training text collection after participle and test text collection;

Step 3, adopts the self-adaptation text feature selection method based on CHI to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;

The computing formula of traditional CHI text feature selection method is as follows:

χ^{2} (t_{k}, C_{i}) = \frac{N {(A D - B C)}^{2}}{(A + C) (B + D) (A + B) (C + D)}

Wherein, A represents and comprises feature t _kand belong to classification C _inumber of files, B represents and comprises feature t _kand do not belong to classification C _inumber of files, C represents and does not comprise feature t _kand belong to classification C _inumber of files, D represents and does not comprise feature t _kand do not belong to classification C _inumber of files.

Propose the self-adaptation text feature selection method based on CHI, formula is as follows:

χ _new ²(t _k,C _i)＝[μ*χ ²(t _k,C _i) ⁺+(1-μ)*χ ²(t _k,C _i) ^-]*α*β

\begin{matrix} χ^{2} {(t_{k}, C_{i})}^{+} = \frac{N {(A D - B C)}^{2}}{(A + C) (B + D) (A + B) (C + D)} & A D - B C > 0 \end{matrix}

\begin{matrix} χ^{2} {(t_{k}, C_{i})}^{-} = \frac{N {(A D - B C)}^{2}}{(A + C) (B + D) (A + B) (C + D)} & A D - B C < 0 \end{matrix}

Wherein, μ is adaptive factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:

α = \frac{t f (t_{k}, C_{i})}{Σ_{i = 1}^{m} t f (t_{k}, C_{i})} = \frac{Σ_{j = 1}^{n} t f (t_{k}, d_{i j})}{Σ_{i = 1}^{m} Σ_{j = 1}^{n} t f (t_{k}, d_{i j})}

Wherein, m is training set classification sum, tf (t _k, C _i) represent classification C _imiddle characteristic item t _kthe number of times occurred, representation feature item t _kat the number of times that whole training text collection occurs.In training set, classification is C _iin to comprise n document be d respectively _i1, d _i2..., d _ij..., d _in, tf (t _k, d _ij) representation feature item t _kat classification C _ia jth document in the number of times that occurs, representation feature t _kat classification C _ithe total degree of middle appearance, representation feature item t _kthe total degree occurred in all documents of whole training text collection.Word frequency factor-alpha represents at certain kinds C _iin comprise characteristic item t _kword frequency number account for t in whole training set _kthe ratio of word frequency number.α is larger, represent this characteristic item occur in certain kinds frequency higher and in other class occurrence number less or occur hardly, obviously such characteristic item has higher class discrimination ability; α is less, represent this characteristic item occur in certain kinds frequency lower and in other class occurrence number higher, obviously such characteristic item has more weak class discrimination ability.

β = ({df}_{i} - \overset{&OverBar;}{{df}_{i}}) * {({df}_{i} - \overset{&OverBar;}{{df}_{i}})}^{2}

Wherein, other number of m representation class, df _ifor classification C _iin comprise t _knumber of files, for average each classification comprises t _knumber of files, the textual data that representation feature word appears in certain kinds is more than or equal to mean value the textual data that representation feature word appears in certain kinds is less than mean value β value is used for measuring the departure degree of document frequently and between all class documents mean value frequently comprising Feature Words in a certain class.β is larger, and classification C is described _iin comprise Feature Words t _knumber of files comprise Feature Words t than in all classes _kthe mean value of number of files is large, and often large, and such characteristic item has higher class discrimination ability.

Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n _k+ 0.01), M is the textual data comprised in collection of document, n _krepresent the number of files comprising this word;

Step 5, carries out KNN classification;

Training text integrates as S, and test text is d, n is proper vector dimension threshold value, and K gets 35.

Step 5.1, utilizes vectorial angle cosine value to calculate the similarity in test text d and S between full text;

Step 5.2, selects K the arest neighbors text of K maximum text of similarity that step 5.1 obtains as test text d;

Step 5.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.

If training text d _iknown class be C _j, weight calculation formula is as follows:

W (d, C_{j}) = Σ_{i = 1}^{K} S i m (d, d_{i}) y (d_{i}, C_{j})

Wherein, Sim (d, d _i) be test text d and known class text d _icosine similarity, formula is as follows:

S i m (d, d_{i}) = \frac{Σ_{j = 1}^{n} (X_{j} x_{i j})}{\sqrt{Σ_{j = 1}^{n} ({X_{j}}^{2})} \sqrt{Σ_{j = 1}^{n} ({x_{i j}}^{2})}}

Wherein, n is proper vector dimension threshold value, X _jrepresent the weight (0 < j≤n) of the jth dimension of text d to be measured, x _ijrepresent training text vector d _ijth dimension weight.

Y (d _i, C _j) be category attribute function, formula is as follows:

Step 6, the precision ratio of calculating K NN sorting algorithm, recall ratio and F ₁value, by arranging front and back 2 subseries result F ₁the max-thresholds of value difference value, and the step-length of scale factor μ growth obtains the value of final scale factor μ, to ensure higher classify accuracy.

Step 6.1, arranges initial F ₁value is 0, and initial μ value is 0.5, ε=0.0001 is twice, front and back F ₁the threshold value of difference, the step-length that τ=0.05 increases for scale factor μ;

Step 6.2, repeats step 5, obtains F ₁' value, obtain twice, front and back F ₁difference DELTA F=| F ₁'-F ₁|;

Step 6.3, if Δ F is less than ε, then obtains scale factor μ now; If Δ F is greater than or equal to ε, then make μ '=μ+τ, F ₁=F ₁', repeat step 6.2 and step 6.3, until obtain suitable scale factor μ.

Compared with prior art, the present invention has following beneficial effect.

The present invention proposes a kind of adaptive features select method based on chi, sorting algorithm selects KNN algorithm, for the classification to test text, whole process flow diagram flow chart is shown in Fig. 1, the process flow diagram calculating scale factor μ is shown in Fig. 2, the degree of accuracy index of balanced corpus is in table 1, and the degree of accuracy of non-equilibrium corpus is in table 2.Compared with traditional CHI method, introduce on the one hand word frequency Summing Factor inter-class variance herein to reduce CHI relying on for counsel low-frequency word, select in certain kinds, to occur that frequency greatly and the characteristic item be evenly distributed in such; Introduce on the other hand self-adaptation scale factor μ, to carry out classifying according to its positive and negative correlativity and to compose with different weight, and the method is applicable to the corpus of different distributions, thus reduction people chooses the error that scale factor brings.As can be seen from Table 1 and Table 2, compared with traditional CHI method, the present invention is respectively used to balanced corpus and non-equilibrium corpus nicety of grading is all improved.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of overall process of the present invention.

Fig. 2 is the process flow diagram that the present invention calculates scale factor μ.

Embodiment

The present invention adopts following technological means to realize:

A kind of self-adaptation text feature selection method based on chi.First, carry out the pre-service of training text collection and test text collection, comprise participle, stop words process, secondly, carries out the self-adaptation text feature selection based on chi, definition word frequency factor-alpha and inter-class variance β, be introduced into CHI algorithm, for CHI algorithm adds suitable scale factor μ, finally, in conjunction with classical KNN algorithm, automatic adjustment scale factor μ, makes the CHI of improvement be applicable to different corpus, to ensure higher classify accuracy.

The above-mentioned self-adaptation text feature selection method based on chi is used for text classification, comprises the steps:

Step 2, adopts participle software I CTCLAS to carry out the pre-service such as participle, stop words removal to training text collection and test text collection, obtains the training text collection after participle and test text collection;

χ^{2} (t_{k}, C_{i}) = \frac{N {(A D - B C)}^{2}}{(A + C) (B + D) (A + B) (C + D)} - - - (1)

χ _new ²(t _k,C _i)＝[μ*χ ²(t _k,C _i) ⁺+(1-μ)*χ ²(t _k,C _i) ^-]*α*β(2)

\begin{matrix} χ^{2} {(t_{k}, C_{i})}^{+} = \frac{N {(A D - B C)}^{2}}{(A + C) (B + D) (A + B) (C + D)} & A D - B C > 0 \end{matrix} - - - (3)

\begin{matrix} χ^{2} {(t_{k}, C_{i})}^{-} = \frac{N {(A D - B C)}^{2}}{(A + C) (B + D) (A + B) (C + D)} & A D - B C < 0 \end{matrix}

α = \frac{t f (t_{k}, C_{i})}{Σ_{i = 1}^{m} t f (t_{k}, C_{i})} = \frac{Σ_{j = 1}^{n} t f (t_{k}, d_{i j})}{Σ_{i = 1}^{m} Σ_{j = 1}^{n} t f (t_{k}, d_{i j})} - - - (4)

Wherein, m is training set classification sum, tf (t _k, C _i) represent classification C _icharacteristic item t _kthe number of times occurred, the number of times that representation feature item t occurs at whole training text collection.In training set, classification is C _iin to comprise n document be d respectively _i1, d _i2..., d _ij..., d _in, tf (t _k, d _ij) representation feature item t _kat classification C _ia jth document in the number of times that occurs, representation feature t _kat classification C _ithe total degree of middle appearance, representation feature item t _kthe total degree occurred in all documents of whole training text collection.Word frequency factor-alpha represents and comprises characteristic item t in certain kinds _kword frequency number account for t in whole training set _kthe ratio of word frequency number.α is larger, represent this characteristic item occur in certain kinds frequency higher and in other class occurrence number less or occur hardly, obviously such characteristic item has higher class discrimination ability; α is less, represent this characteristic item occur in certain kinds frequency lower and in other class occurrence number higher, obviously such characteristic item has more weak class discrimination ability.

β = ({df}_{i} - \overset{&OverBar;}{{df}_{i}}) * {({df}_{i} - \overset{&OverBar;}{{df}_{i}})}^{2} - - - (5)

Step 5, carries out KNN classification;

Utilize vectorial angle cosine value to calculate the similarity in test text d and S between full text; Select K the arest neighbors of K maximum text of the similarity that calculates as test text d; Calculate the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.

W (d, C_{j}) = Σ_{i = 1}^{K} S i m (d, d_{i}) y (d_{i}, C_{j}) - - - (6)

S i m (d, d_{i}) = \frac{Σ_{j = 1}^{n} (X_{j} x_{i j})}{\sqrt{Σ_{j = 1}^{n} ({X_{j}}^{2})} \sqrt{Σ_{j = 1}^{n} ({x_{i j}}^{2})}} - - - (7)

Y (d _i, C _j) be category attribute function, formula is as follows:

Step 6, the precision ratio of calculating K NN sorting algorithm, recall ratio and F ₁value, by arranging front and back 2 subseries result F ₁the max-thresholds of value difference value, and the step-length of scale factor μ growth obtains the value of final scale factor.

Initial F is set ₁value is 0, and initial μ value is 0.5, ε=0.0001 is twice, front and back F ₁the threshold value of difference, the step-length that τ=0.05 increases for scale factor μ.

Repeat step 5, obtain F ₁' value, obtain twice, front and back F ₁difference DELTA F=| F ₁'-F ₁|; If Δ F is less than ε, then obtain scale factor μ now; If Δ F is greater than or equal to ε, then make μ '=μ+τ, F ₁=F ₁', repeat the iteration of this step, until obtain suitable scale factor μ, to ensure higher classify accuracy.

Before and after table 1 algorithm improvement results contrast (balanced corpus) (%)

Results contrast (non-equilibrium corpus) (%) before and after table 2 algorithm improvement

Claims

1., based on an adaptive features select method for chi, it is characterized in that, comprise the following steps:

Step 2, adopts participle software I CTCLAS to carry out participle, stop words removal pre-service to training text collection and test text collection, obtains the training text collection after participle and test text collection;

The computing formula of traditional C HI text feature selection method is as follows:

χ^{2} (t_{k}, C_{i}) = \frac{N {(A D - B C)}^{2}}{(A + C) (B + D) (A + B) (C + D)}

Wherein, A represents and comprises feature t _kand belong to classification C _inumber of files, B represents and comprises feature t _kand do not belong to classification C _inumber of files, C represents and does not comprise feature t _kand belong to classification C _inumber of files, D represents and does not comprise feature t _kand do not belong to classification C _inumber of files;

χ^{2} {(t_{k}, C_{i})}^{+} = \frac{N {(A D - B C)}^{2}}{(A + C) (B + D) (A + B) (C + D)}, A D - B C > 0

χ^{2} {(t_{k}, C_{i})}^{-} = \frac{N {(A D - B C)}^{2}}{(A + C) (B + D) (A + B) (C + D)}, A D - B C < 0

α = \frac{t f (t_{k}, C_{i})}{Σ_{i = 1}^{m} t f (t_{k}, C_{i})} = \frac{Σ_{j = 1}^{n} t f (t_{k}, d_{i j})}{Σ_{i = 1}^{m} Σ_{j = 1}^{n} t f (t_{k}, d_{i j})}

Wherein, m is training set classification sum, tf (t _k, C _i) represent classification C _imiddle characteristic item t _kthe number of times occurred, representation feature item t _kat the number of times that whole training text collection occurs; In training set, classification is C _iin to comprise n document be d respectively _i1, d _i2..., d _ij..., d _in, tf (t _k, d _ij) representation feature item t _kat classification C _ia jth document in the number of times that occurs, representation feature t _kat classification C _ithe total degree of middle appearance, representation feature item t _kthe total degree occurred in all documents of whole training text collection; Word frequency factor-alpha represents at certain kinds C _iin comprise characteristic item t _kword frequency number account for t in whole training set _kthe ratio of word frequency number; α is larger, represent this characteristic item occur in certain kinds frequency higher and in other class occurrence number less or occur hardly, obviously such characteristic item has higher class discrimination ability; α is less, represent this characteristic item occur in certain kinds frequency lower and in other class occurrence number higher, obviously such characteristic item has more weak class discrimination ability;

β = ({df}_{i} - \overset{&OverBar;}{{df}_{i}}) * {({df}_{i} - \overset{&OverBar;}{{df}_{i}})}^{2}

Wherein, other number of m representation class, df _ifor classification C _iin comprise t _knumber of files, for average each classification comprises t _knumber of files, the textual data that representation feature word appears in certain kinds is more than or equal to mean value the textual data that representation feature word appears in certain kinds is less than mean value β value is used for measuring the departure degree of document frequently and between all class documents mean value frequently comprising Feature Words in a certain class; β is larger, and classification C is described _iin comprise Feature Words t _knumber of files comprise Feature Words t than in all classes _kthe mean value of number of files is large, and often large, and such characteristic item has higher class discrimination ability;

Step 5, carries out KNN classification;

Training text integrates as S, and test text is d, n is proper vector dimension threshold value, and K gets 35;

Step 5.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight;

W (d, C_{j}) = Σ_{i = 1}^{K} S i m (d, d_{i}) y (d_{i}, C_{j})

S i m (d, d_{i}) = \frac{Σ_{j = 1}^{n} (X_{j} x_{i j})}{\sqrt{Σ_{j = 1}^{n} ({X_{j}}^{2})} \sqrt{Σ_{j = 1}^{n} ({x_{i j}}^{2})}}

Wherein, n is proper vector dimension threshold value, X _jrepresent the weight (0 < j≤n) of the jth dimension of text d to be measured, x _ijrepresent training text vector d _ijth dimension weight;

Y (d _i, C _j) be category attribute function, formula is as follows:

Step 6, the precision ratio of calculating K NN sorting algorithm, recall ratio and F ₁value, by arranging front and back 2 subseries result F ₁the max-thresholds of value difference value, and the step-length of scale factor μ growth obtains the value of final scale factor μ, to ensure higher classify accuracy;

Step 6.2, repeats step 5, obtains F ' ₁value, obtain twice, front and back F ₁difference DELTA F=| F ' ₁-F ₁|;

Step 6.3, if Δ F is less than ε, then obtains scale factor μ now; If Δ F is greater than or equal to ε, then make μ '=μ+τ, F ₁=F ' ₁, repeat step 6.2 and step 6.3, until obtain suitable scale factor μ, to ensure higher classify accuracy.