CN105512311B

CN105512311B - A kind of adaptive features select method based on chi-square statistics

Info

Publication number: CN105512311B
Application number: CN201510927759.9A
Authority: CN
Inventors: 汪友生; 樊存佳; 王雨婷
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2015-12-14
Filing date: 2015-12-14
Publication date: 2019-02-26
Anticipated expiration: 2035-12-14
Also published as: CN105512311A

Abstract

A kind of adaptive features select method based on chi-square statistics, this method is related to computer version data processing field, it is trained the pretreatment of text set and test text collection first, including participle, stop words processing, then the adaptive text feature selection based on chi-square statistics is carried out, define the word frequency factor and inter-class variance, it is introduced into CHI algorithm, suitable scale factor is added for CHI algorithm, finally combine the evaluation index of classical KNN algorithm, automatically adjust scale factor, improved CHI is set to be suitable for different corpus, to guarantee higher classification accuracy.The experimental results showed that the present invention is respectively used to balanced corpus and non-equilibrium corpus nicety of grading is improved compared with traditional CHI method.

Description

A kind of adaptive features select method based on chi-square statistics

Technical field

The present invention relates to computer version data processing fields, in particular to a kind of to be based on chi-square statistics (χ², CHI) from Adapt to text feature selection method.

Background technique

Current big data era, mining data are potentially worth most important.Data mining is as the discovery potential valence of data The technology of value, causes great concern.Big data text data accounts for sizable ratio, and text classification as effectively tissue and The data digging method for managing text data, is increasingly becoming the focus of attention.It is in information filtering, information organization and management, information Retrieval, digital library and Spam filtering etc. are used widely.Text classification (Text Classification, TC) refer to it is according to its content that it is automatic to unknown classification text under previously given classification system It is divided into a kind of or multiclass process.Common file classification method, such as K arest neighbors (K-Nearest-Neighbor, KNN), Bayes (Naive Bayes, NB) and support vector machines (SupportVector Machine, SVM) etc..

Text classification process includes pretreatment, Feature Dimension Reduction, text representation, classifier study and evaluation and test, text at this stage Indicate that, the most commonly used is vector space model, the higher-dimension and sparsity of vector space increase time complexity and spatial complex Degree, largely effects on text classification precision, so Feature Dimension Reduction process is most important, it directly affects the efficiency of classification and accurate Rate.Feature Dimension Reduction mainly includes two methods --- feature extraction (Feature Extraction) and feature selecting (Feature Selection), the technology in terms of natural language processing is needed based on philological feature extraction and computation complexity is high, and base It is lower in the feature selection approach complexity of statistical theory and do not need excessive background knowledge, therefore feature selection approach application More extensively.The basic thought of feature selecting is one evaluation function of construction, is assessed respectively each characteristic item of feature set Marking, is ranked up all characteristic items according to the height of point value of evaluation, selects certain number of feature as final text Eigen collection.Common feature selection approach has chi-square statistics, document frequency (Document Frequency, DF), and information increases Beneficial (Information Gain, IG), mutual information (Mutual Information, MI), it is expected that cross entropy (Expected Cross Entropy, ECE) and text evidence weight (Weight of Evidence, WE) etc..

CHI method has the spies such as realization is simple, time complexity is low as one of common text feature selection method Point；But there is also many disadvantages, so that classifying quality is undesirable.The deficiency of CHI algorithm mainly includes two aspects: first, CHI only considered the document frequency of characteristic item, have ignored the word frequency of characteristic item, and the weight of low-frequency word is caused to be amplified；Second, it puts The weight for the characteristic item that the big frequency of occurrence in a classification is few and often occurs in other classes.It is deposited for CHI algorithm Deficiency, Many researchers make improvements, in terms of improved method is summarized as following two: first, introduce several tune Section parameter relies on low-frequency word for counsel with reducing, but this method does not account for the positive and negative relativity problem between characteristic item and classification. Second, scale factor is introduced, is classified according to its positive and negative correlation and is assigned to different weights to improve the feature of CHI model Selective power, but scale factor needs to select by experience.In view of existing for current various CHI improved algorithms not Foot, the high CHI text feature selection method of design nicety of grading have important academic significance and practical value.

Summary of the invention

The object of the present invention is to provide a kind of improved CHI text feature selection methods, to improve text classification Accuracy rate.On the one hand it introduces the word frequency factor herein and inter-class variance is relied on for counsel with reducing CHI to low-frequency word, select in certain kinds It is middle frequency occur greatly and be evenly distributed on the characteristic item in such；On the other hand adaptive scale factor μ is introduced, according to it Positive and negative correlation is classified and is assigned to different weights, to reduce the artificial error choosing scale factor and bringing.

Feature of the invention is as follows:

Step 1, Chinese corpus --- training text collection and the test text collection issued from the Internet download Fudan University；

Step 2, training text collection and test text collection are segmented using Chinese Academy of Sciences open source participle software I CTCLAS, The pretreatments such as stop words removal, training text collection and test text collection after being segmented；

Step 3, feature is carried out to the training text collection after participle using the adaptive text feature selection method based on CHI Selection, obtains the corresponding feature dictionary of the training text collection；

The calculation formula of traditional CHI text feature selection method is as follows:

Wherein, A indicates to include feature t_kAnd belong to classification C_iNumber of files, B indicate include feature t_kAnd it is not belonging to classification C_i Number of files, C indicate do not include feature t_kAnd belong to classification C_iNumber of files, D indicate do not include feature t_kAnd it is not belonging to classification C_i Number of files.

The adaptive text feature selection method based on CHI of proposition, formula are as follows:

χ_new ²(t_k,C_i)=[μ * χ²(t_k,C_i)⁺+(1-μ)*χ²(t_k,C_i)^-]*α*β

Wherein, μ is scale factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:

Wherein, m is training set classification sum, tf (t_k,C_i) indicate classification C_iMiddle characteristic item t_kThe number of appearance,Indicate characteristic item t_kIn the number that entire training text collection occurs.Classification is C in training set_iIn it is literary comprising p Shelves are d respectively_i1,d_i2,...,d_ij,...,d_ip, tf (t_k,d_ij) indicate characteristic item t_kIn classification C_iJ-th document in occur Number,Indicate feature t_kIn classification C_iThe total degree of middle appearance,Indicate characteristic item t_kEntirely instructing Practice the total degree occurred in all documents of text set.Word frequency factor-alpha is indicated in certain kinds C_iIn include characteristic item t_kWord frequency number Account for the t in entire training set_kWord frequency number ratio.α is bigger, indicate this feature item occur in certain kinds frequency it is higher and Frequency of occurrence is less in other classes or hardly occurs, it is clear that such characteristic item has higher class discrimination ability；α is got over It is small, indicate that this feature item occurs that frequency is lower in certain kinds and frequency of occurrence is higher in other classes, it is clear that such feature Item has weaker class discrimination ability.

Wherein,M indicates the number of classification, df_iFor classification C_iIn include t_kNumber of files, It include t for averagely each classification_kNumber of files,Indicate Feature Words appear in the textual data in certain kinds be greater than or Equal to average valueIndicate that Feature Words appear in the textual data in certain kinds less than average valueβ value is used to Measure the document frequency in certain one kind comprising Feature Words and the departure degree between the average value of all class document frequencies.β is bigger, illustrates class Other C_iIn include Feature Words t_kNumber of files than in all classes include Feature Words t_kThe average value of number of files is big, and relatively big More, such characteristic item has higher class discrimination ability.

Step 4, each training text and each test text are expressed as vector shape with the Feature Words of feature dictionary respectively Formula is calculated per one-dimensional weight according to TFIDF=TF × IDF, and TF (Term Frequency) is word frequency, refers to characteristic item in text The number occurred in shelves, IDF (Inverse Document) are inverse document frequency, and formula is IDF=log (M/n_k+ 0.01), M For the textual data for including in collection of document, n_kIndicate the number of files comprising the word；

Step 5, KNN classification is carried out；

Training text integrates as S, and test text d, n are characterized vector dimension threshold value, and K takes 35.

Step 5.1, the similarity in test text d and S between full text is calculated using vectorial angle cosine value；

Step 5.2, the K arest neighbors of the maximum K text of similarity that step 5.1 obtains as test text d is selected Text；

Step 5.3, the weight that test text d belongs to each classification is calculated, test text d is grouped into the maximum class of weight Not.

If training text d_iKnown class be C_j, weight calculation formula is as follows:

Wherein, Sim (d, d_i) it is test text d and known class text d_iCosine similarity, formula is as follows:

Wherein, n is characterized vector dimension threshold value, X_jIndicate the weight (0 < j≤n) of the jth dimension of text d to be measured, x_ijIt indicates Training text vector d_iJth dimension weight.

y(d_i,C_j) it is category attribute function, formula is as follows:

Step 6, precision ratio, recall ratio and the F of KNN sorting algorithm are calculated₁Value passes through 2 subseries result F of setting front and back₁Value Step-length that the max-thresholds and scale factor μ of difference increase obtains the value of final scale factor μ, higher to guarantee Classification accuracy.

Step 6.1, initial F is set₁Value is 0, and initial μ value is that 0.5, ε=0.0001 is front and back F twice₁The threshold value of difference, τ =0.05 step-length increased for scale factor μ；

Step 6.2, step 5 is repeated, F is obtained₁' value, obtain front and back F twice₁Difference DELTA F=| F₁′-F₁|；

Step 6.3, if Δ F is less than ε, scale factor μ at this time is obtained；If Δ F is more than or equal to ε, enable μ '=μ+τ, F₁=F₁', step 6.2 and step 6.3 are repeated, until obtaining suitable scale factor μ.

Compared with prior art, the present invention has the advantages that.

The present invention proposes that a kind of adaptive features select method based on chi-square statistics, sorting algorithm select KNN algorithm, uses In the classification to test text, whole process flow chart is shown in Fig. 1, and the flow chart for calculating ratio factor mu is shown in Fig. 2, balanced corpus Accuracy index be shown in Table 1, the accuracy of non-equilibrium corpus is shown in Table 2.Compared with traditional CHI method, on the one hand draw herein Enter the word frequency factor and inter-class variance to rely on for counsel to reduce CHI to low-frequency word, selects that occur frequency in certain kinds big and uniformly The characteristic item being distributed in such；On the other hand adaptive scale factor μ is introduced, to classify according to its positive and negative correlation And different weights are assigned to, and this method is suitable for the corpus of different distributions, it is brought to reduce artificial scale factor of choosing Error.As can be seen from Table 1 and Table 2, compared with traditional CHI method, the present invention is respectively used to balanced corpus and non-flat Weighing apparatus corpus nicety of grading is improved.

Detailed description of the invention

Fig. 1 is the flow chart of overall process of the present invention.

Fig. 2 is the flow chart of calculating ratio factor mu of the present invention.

Specific embodiment

The present invention is realized using following technological means:

A kind of adaptive text feature selection method based on chi-square statistics.Firstly, being trained text set and test text The pretreatment of this collection, including participle, stop words processing, secondly, the adaptive text feature selection based on chi-square statistics is carried out, it is fixed Adopted word frequency factor-alpha and inter-class variance β, are introduced into CHI algorithm, add suitable scale factor μ for CHI algorithm, finally, in conjunction with Classical KNN algorithm automatically adjusts scale factor μ, so that improved CHI is suitable for different corpus, to guarantee higher point Class accuracy.

The above-mentioned adaptive text feature selection method based on chi-square statistics is used for text classification, includes the following steps:

Step 2, training text collection and test text collection are segmented using participle software I CTCLAS, stop words removes Deng pretreatment, training text collection and test text collection after being segmented；

χ_new ²(t_k,C_i)=[μ * χ²(t_k,C_i)⁺+(1-μ)*χ²(t_k,C_i)^-]*α*β (2)

Wherein, μ is adaptive factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:

Wherein, m is training set classification sum, tf (t_k,C_i) indicate classification C_iCharacteristic item t_kThe number of appearance, Indicate the number that characteristic item t occurs in entire training text collection.Classification is C in training set_iIn comprising n document be respectively d_i1,d_i2,...,d_ij,...,d_in, tf (t_k,d_ij) indicate characteristic item t_kIn classification C_iJ-th of document in the number that occurs,Indicate feature t_kIn classification C_iThe total degree of middle appearance,Indicate characteristic item t_kIn entire training text The total degree occurred in all documents of collection.Word frequency factor-alpha indicates that in certain kinds include characteristic item t_kWord frequency number account for entire T in training set_kWord frequency number ratio.α is bigger, indicate this feature item occur in certain kinds frequency it is higher and in other classes Frequency of occurrence is less or hardly occurs, it is clear that such characteristic item has higher class discrimination ability；α is smaller, indicates This feature item occurs that frequency is lower in certain kinds and frequency of occurrence is higher in other classes, it is clear that such characteristic item have compared with Weak class discrimination ability.

Wherein,M indicates the number of classification, df_iFor classification C_iIn include t_kNumber of files, It include t for averagely each classification_kNumber of files,Indicate Feature Words appear in the textual data in certain kinds be greater than or Equal to average valueIndicate that Feature Words appear in the textual data in certain kinds less than average valueβ value is used To measure the document frequency in certain one kind comprising Feature Words and the departure degree between the average value of all class document frequencies.β is bigger, explanation Classification C_iIn include Feature Words t_kNumber of files than in all classes include Feature Words t_kThe average value of number of files is big, and so big that compare More, such characteristic item has higher class discrimination ability.

Step 5, KNN classification is carried out；

The similarity in test text d and S between full text is calculated using vectorial angle cosine value；Select calculating K arest neighbors of the maximum K text of obtained similarity as test text d；It calculates test text d and belongs to each classification Test text d is grouped into the maximum classification of weight by weight.

y(d_i,C_j) it is category attribute function, formula is as follows:

Step 6, precision ratio, recall ratio and the F of KNN sorting algorithm are calculated₁Value passes through 2 subseries result F of setting front and back₁Value Step-length that the max-thresholds and scale factor μ of difference increase obtains the value of final scale factor.

Initial F is set₁Value is 0, and initial μ value is that 0.5, ε=0.0001 is front and back F twice₁The threshold value of difference, τ=0.05 are The step-length that scale factor μ increases.

Step 5 is repeated, F is obtained₁' value, obtain front and back F twice₁Difference DELTA F=| F₁′-F₁|；If Δ F is less than ε, Obtain scale factor μ at this time；If Δ F is more than or equal to ε, μ '=μ+τ, F is enabled₁=F₁', the iteration of the step is repeated, directly To suitable scale factor μ is obtained, to guarantee higher classification accuracy.

Result compares (balanced corpus) (%) before and after 1 algorithm improvement of table

Result compares (non-equilibrium corpus) (%) before and after 2 algorithm improvement of table

Claims

1. a kind of adaptive features select method based on chi-square statistics, which comprises the following steps:

Step 2, training text collection and test text collection are segmented using participle software I CTCLAS, the pre- place of stop words removal Reason, training text collection and test text collection after being segmented；

Step 3, feature choosing is carried out to the training text collection after participle using the adaptive text feature selection method based on CHI It selects, obtains the corresponding feature dictionary of the training text collection；

Wherein, A indicates to include feature t_kAnd belong to classification C_iNumber of files, B indicate include feature t_kAnd it is not belonging to classification C_iText Gear number, C indicate not including feature t_kAnd belong to classification C_iNumber of files, D indicate do not include feature t_kAnd it is not belonging to classification C_iText Gear number；

χ_new ²(t_k,C_i)=[μ * χ²(t_k,C_i)⁺+(1-μ)*χ²(t_k,C_i)^-]*α*β

Wherein, m is training set classification sum, tf (t_k,C_i) indicate classification C_iMiddle characteristic item t_kThe number of appearance,Table Show characteristic item t_kIn the number that entire training text collection occurs；Classification is C in training set_iIn comprising p document be d respectively_i1, d_i2,...,d_ij,...,d_ip, tf (t_k,d_ij) indicate characteristic item t_kIn classification C_iJ-th of document in the number that occurs,Indicate feature t_kIn classification C_iThe total degree of middle appearance,Indicate characteristic item t_kIn entire training text The total degree occurred in all documents of collection；Word frequency factor-alpha is indicated in certain kinds C_iIn include characteristic item t_kWord frequency number account for whole T in a training set_kWord frequency number ratio；α is bigger, indicate this feature item occur in certain kinds frequency it is higher and in other classes Middle frequency of occurrence is less or hardly occurs, it is clear that such characteristic item has higher class discrimination ability；α is smaller, table Show that this feature item occurs that frequency is lower in certain kinds and frequency of occurrence is higher in other classes, it is clear that such characteristic item has Weaker class discrimination ability；

Wherein,M indicates the number of classification, df_iFor classification C_iIn include t_kNumber of files,It is flat Each classification includes t_kNumber of files,Indicate that Feature Words appear in the textual data in certain kinds and are greater than or equal to Average value Indicate that Feature Words appear in the textual data in certain kinds less than average valueβ value is used to measure Document frequency in certain one kind comprising Feature Words and the departure degree between the average value of all class document frequencies；β is bigger, illustrates classification C_i In include Feature Words t_kNumber of files than in all classes include Feature Words t_kThe average value of number of files is big, and it is big must compare it is more, this The characteristic item of sample has higher class discrimination ability；

Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, often One-dimensional weight is calculated according to TFIDF=TF × IDF, and TF (TermFrequency) is word frequency, refers to that characteristic item goes out in a document Existing number, IDF (InverseDocument) are inverse document frequency, and formula is IDF=log (M/n_k+ 0.01), M is document sets The textual data for including in conjunction, n_kIndicate the number of files comprising the word；

Step 5, KNN classification is carried out；

Training text integrates as S, and test text d, n are characterized vector dimension threshold value, and K takes 35；

Step 5.2, the K arest neighbors text of the maximum K text of similarity that step 5.1 obtains as test text d is selected；

Step 5.3, the weight that test text d belongs to each classification is calculated, test text d is grouped into the maximum classification of weight；

Wherein, n is characterized vector dimension threshold value, X_jIndicate the weight of the jth dimension of text d to be measured, 0 < j≤n, x_ijIndicate training Text vector d_iJth dimension weight；

y(d_i,C_j) it is category attribute function, formula is as follows:

Step 6, precision ratio, recall ratio and the F of KNN sorting algorithm are calculated₁Value passes through 2 subseries result F of setting front and back₁Value difference value Max-thresholds and the step-length that increases of scale factor μ obtain the value of final scale factor μ, to guarantee higher classification Accuracy；

Step 6.1, initial F is set₁Value is 0, and initial μ value is that 0.5, ε=0.0001 is front and back F twice₁The threshold value of difference, τ= 0.05 step-length increased for scale factor μ；

Step 6.3, if Δ F is less than ε, scale factor μ at this time is obtained；If Δ F is more than or equal to ε, μ '=μ is enabled + τ, F₁=F₁', step 6.2 and step 6.3 are repeated, until suitable scale factor μ is obtained, to guarantee that higher classification is accurate Degree.