CN113515623A

CN113515623A - Feature selection method based on word frequency difference factor

Info

Publication number: CN113515623A
Application number: CN202110466347.5A
Authority: CN
Inventors: 周红芳; 李想; 马一鸣
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-10-19
Anticipated expiration: 2041-04-28
Also published as: CN113515623B

Abstract

The invention discloses a feature selection method based on word frequency difference value factors, which comprises the steps of firstly selecting text type data sets with different document numbers and thousands or even more than ten thousands of feature numbers, and removing words with the number of the documents more than 25% or less than 3 of the total number; processing a data set which is not divided into a training set and a test set by adopting a 5-fold cross verification method; respectively finishing the dimensionality reduction processing on the data of the training set and the testing set according to the obtained optimal feature subset; training a classification model by adopting a naive Bayes algorithm and a support vector machine algorithm, and predicting to obtain a classification result; and evaluating the classification effect, wherein the higher the scores of the macro F1 and the micro F1 are, the better the classification effect is proved, and the better the performance of the feature selection algorithm is proved. According to the invention, when the relevance between the words and the categories is calculated, the influence of the document frequency and the word frequency on the importance of the words can be considered, the words with high category distinguishing capability are finally selected, and the accuracy and efficiency of classification are improved.

Description

Feature selection method based on word frequency difference factor

Technical Field

The invention belongs to the technical field of text classification, and particularly relates to a feature selection method based on word frequency difference value factors.

Background

The popularization of the internet and the development of information technology bring intelligent experience, greatly enrich life, and simultaneously improve the efficiency of daily learning and work. Nowadays, various information platforms or social software are emerging on the network, and these platforms generate massive data every second, wherein the data stored in the form of documents has important weight, such as personal information registered on e-commerce platforms (such as tianmao, kyoto, etc.), consumption records and evaluations of users, or user comments generated by music and video software, and e-mails, etc. In the face of mass data, it is difficult to extract valuable information in the mass data by manual means efficiently and accurately, and text type data must be processed by means of a machine learning algorithm and a natural language processing technology. Where text classification techniques are of paramount importance. The text classification can classify the text data in the data set according to a certain discrimination standard, so that valuable information is extracted from the text data, and the data processing efficiency is improved. Text classification techniques are widely used and have deep applications in the fields of medicine, biology, traffic management, finance, geographic information and the like.

The text classification mainly comprises the following three stages: preprocessing, feature selection and model training for classification. Since the text data is characterized by the words forming the text data, the phenomenon of 'dimension disaster' inevitably occurs during processing, and therefore, the feature selection must be performed on the data set before classification. The feature selection algorithm mainly comprises three types of filtering type, packaging type and embedding type. The invention relates to a filtering type feature selection algorithm based on word frequency and document frequency, which selects words highly related to categories as optimal features by calculating the score of each word in a document and sequencing the words according to the score, thereby achieving the purpose of reducing dimensions.

Most feature selection algorithms today are based on document frequency solving problems, common algorithms include maximum-minimum-ratio (MMR), CHI-square-test (CHI), chini coefficient (GINI), and Information Gain (IG). They have studied the number of documents in which a word appears in each category as an important point, but neglecting the number of occurrences of the word itself in an article has a great influence on the evaluation of its importance. The recently proposed trigonometric comparison metric algorithm (TCM) is an excellent feature selection algorithm based on document frequency, and solves the problems that a breakpoint exists in a denominator in a classic NDM algorithm and a high score is given to highly sparse words. However, this approach ignores the effect of word frequency on word importance. Therefore, the project group proposes a feature selection method combining word frequency and document frequency, which respectively calculates the average word frequency in the positive document and the negative document, and finds the difference between the two as the weight of the word in the word frequency level.

Disclosure of Invention

The invention aims to provide a feature selection method based on a word frequency difference factor, so that an algorithm can take the influence of document frequency and word frequency on the importance of the words and the categories into consideration when calculating the relevance of the words and the categories, finally selects the words with high category distinguishing capability, and improves the accuracy and efficiency of classification.

The invention adopts the technical scheme that a feature selection method based on word frequency difference value factors is implemented according to the following steps:

step 1, selecting text type data sets with different document numbers and thousands or even tens of thousands of characteristic numbers, and removing words which appear in the data sets and have the document numbers more than 25% or less than 3 of the total number; processing the data set which is not divided into the training set and the test set by adopting a 5-fold cross verification method;

step 2, setting the number of the optimal feature subset elements as C, calculating the score of each feature word of training set data by using a feature selection target function, performing descending order arrangement on the feature words according to the scores, and selecting the front C feature words to form the optimal feature subset; finally, respectively finishing the dimensionality reduction processing on the data of the training set and the test set according to the obtained optimal feature subset;

step 3, training by using the training set obtained in the step 2 and respectively adopting a naive Bayes classifier and a support vector machine classifier to train a classification model, and predicting the sample class of the test set processed in the step 2 to obtain a classification result;

and 4, evaluating the classification effect of the classifier by using the Macro-F1 and Micro-F1 evaluation indexes, wherein the higher the scores of the Macro-F1 and the Micro-F1 are, the better the classification effect is proved, and the better the performance of the feature selection algorithm is further proved.

The present invention is also characterized in that,

the step 2 is as follows:

step 2.1: calculating a word frequency difference value factor TDF of the training set document characteristic words,

2.2, calculating a positive document frequency influence factor of the training set data characteristic words and a triangular comparison measurement factor TCM score;

step 2.3: calculating the term t_iGlobal score of (TFTCM) (t)_i) Obtaining a feature set with weight;

2.4, performing descending ordering on the feature words of the training set according to the finally calculated global score, and selecting C feature words with the top rank to form an optimal feature subset with the size of C;

and 2.5, respectively processing the training set data and the test set data according to the optimal feature subset obtained in the step 2.4, namely deleting feature words which do not appear in the optimal feature subset in the document, reserving the feature words contained in the optimal feature subset, and obtaining the training set data and the test set data after dimension reduction.

Step 2.1 is specifically as follows:

step 2.1.1, calculate term t according to equation (1)_iIn document d_jFrequency of occurrence in, is denoted as tf_ijAnd calculating the term t according to the formulas (2) and (3)_iIn class C_kAverage word frequency in

Where k denotes a class number, tc_ijDenotes the term t_iIn document d_jNumber of occurrences in, N_jRepresenting a document d_jTotal number of words in (1), N_kRepresents class C_kTotal number of documents in, I (d)_j,C_k) For judging the document d_jWhether or not it belongs to class C_kWhen document d_jBelong to class C_kWhen I (d)_j,C_k) Is 1, otherwise is 0;

step 2.1.2, calculate term t according to equation (4)_iIn a state other than C_kAverage word frequency of all documents in class

Wherein N is the total number of documents in the data set, N_kIs of class C_kTotal number of documents in;

step 2.1.3, calculating the term t according to the formula (5)_iTerm frequency difference factor TDF (t)_i,c_k),

Step 2.2 is specifically as follows:

step 2.2.1, calculate term t according to equation (6)_iPositive class document frequency influence factor pos_ki，

Where tp is denoted as being in class c_kTerm of (A) t_iNumber of documents appearing, fn, in class c_kTerm of (A) t_iNumber of documents not present, fp representing non-c_kTerm t in class_iNumber of documents appearing, tn denotesIn a region other than c_kTerm t in class_iNumber of documents that do not appear;

step 2.2.2, calculate the term t separately_iIn class c_kTrue rate tpr and false positive rate fpr, true rate

False positive rate

The term t is calculated using the trigonometric comparison metric algorithm TCM, equation (7)_iTCM (t)_i,c_k) The score is obtained by the above-mentioned method,

TCM(t_i,c_k)＝(2 max(sin²θ,cos²θ)-1)^m|tpr-fpr| (7)

wherein θ represents the term t_iAnd the corresponding vector (tpr, fpr) forms an included angle with the coordinate axis closest to the vector, m is a parameter for controlling the influence of the off-axis angle on the integral score, and the integral effect is the best when m is 100.

Step 2.3 is specifically as follows:

calculating the term t according to equation (8)_iGlobal score of (TFTCM) (t)_i) And obtaining a feature set with weight values:

wherein k represents a class number, P (C)_k) Indicates belonging to class C_kThe number of documents in the entire data set.

The invention has the beneficial effects that the feature selection method based on the word frequency difference factor endows the words with higher word frequency in a certain class with higher weight by calculating the average word frequency and the corresponding difference value of the words in the positive class and the negative class. The TCM algorithm improved based on the word frequency difference factor fully considers the document frequency of the characteristic words and the influence of the word frequency on the importance of the characteristic words. The introduction of the positive document frequency influence factor enables the algorithm to pay more attention to the influence of the text document frequency of the words in the multi-classification task, namely, if the proportion of the number of the documents appearing in the positive class of a word is larger, the importance degree of the word is larger. The invention can well select the characteristic words with high category resolution capability, and improves the accuracy and efficiency of classification.

Drawings

FIG. 1 is a flow chart of a feature selection method based on word frequency difference factors according to the present invention;

2(a) -2 (d) are comparison results of Macro-F1 and Micro-F1 obtained when a naive Bayes classifier and a support vector machine classifier are used for classification under different feature word dimensions on a K1b data set in sequence according to the feature selection method based on the word frequency difference factor of the invention and the prior art;

3(a) -3 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on a KDC data set in sequence;

4(a) -4 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on an R8 data set in sequence;

5(a) -5 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on an R52 data set in sequence;

6(a) -6 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on an RE1 data set in sequence;

7(a) -7 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on a 20Newsgroups data set in sequence.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The flow chart of the invention is shown in fig. 1, and the feature selection method based on the word frequency difference value factor is implemented according to the following steps:

the step 2 is as follows:

Step 2.1 is specifically as follows:

Step 2.2 is specifically as follows:

Where tp is denoted as being in class c_kTerm of (A) t_iNumber of documents appearing, fn, in class c_kTerm of (A) t_iNumber of documents not present, fp representing non-c_kTerm t in class_iNumber of documents present, tn, at not c_kTerm t in class_iNumber of documents that do not appear;

False positive rate

TCM(t_i,c_k)＝(2 max(sin²θ,cos²θ)-1)^m|tpr-fpr| (7)

Step 2.3 is specifically as follows:

On the basis of TCM algorithm, respectively calculating the appearance of each word in C_kClass and non-C_kAnd (4) average word frequency in the class document, and calculating the difference between the average word frequency and the class document to realize the correlation judgment of the word frequency level.

In comparative experiments, six data sets of K1b, KDC, R8, R52, RE1, 20Newsgroups were used for testing. Wherein K1b, RE1 is a text type dataset obtained from the university of minnesota Karypis laboratory; the KDC data set consists of a group of Kurdish text documents from different websites; r52 and R8 are obtained by processing a Reuters21578 data set, and the Reuters21578 is a classic text classification test set and is commonly used in the research fields of information retrieval, machine learning and the like; the 20Newsgroups dataset is composed of 20000 Newsgroups of 20Newsgroups extracted.

In order to verify the performance of the feature selection algorithm based on the word frequency difference factor, the invention is compared with five algorithms of maximum-minimum ratio (MMR), Chi-square test (CHI), Gini coefficient (GINI), Triangular Comparison Measure (TCM) and Information Gain (IG). From fig. 2(a) -2 (d), it can be seen that on the K1b data set, the Macro-F1 and Micro-F1 scores of the invention are superior to those of the comparison algorithm in most cases, accounting for 71.88%, and the performance is better. It can be seen from fig. 3(a) -3 (d) that the performance of the present invention is more stable on KDC datasets and that the highest values are achieved at multiple comparison points over other algorithms. From fig. 4(a) -4 (d), it can be seen that for the R8 dataset, whether using naive bayes or support vector machine classifiers, the invention achieved the highest Macro-F1 scores at the 5 points of the lower dimension, while performing less well on the high dimension points, for Micro-F1, the overall performance of the invention was more stable, with the highest values at the multiple contrast points. From fig. 5(a) -5 (d), it can be seen that on the R52 data set, when using the naive bayes classifier, the present invention achieves the highest score at most of the contrast points, with a proportion of 81.25%, and when using the support vector machine classifier, the performance of the present invention is optimal at a plurality of contrast points, although the performance is somewhat degraded. From fig. 6(a) -6 (d), it can be seen that the overall performance of the present invention performed well for the RE1 dataset, with the highest scores obtained at most of the comparison points, and the optimal percentage being 71.88%. From fig. 7(a) -7 (d), it can be seen that for the 20Newsgroups data set, the present invention achieves the best score at almost all comparison points, the optimal case is 90.63%, and the performance is significantly better than the comparison algorithm. The invention has stable overall performance and is a reliable feature selection algorithm.

Claims

1. The feature selection method based on the word frequency difference factor is characterized by being implemented according to the following steps:

2. The method for selecting features based on word frequency difference factors according to claim 1, wherein the step 2 specifically comprises the following steps:

3. The method for selecting features based on word frequency difference factors according to claim 2, wherein the step 2.1 is as follows:

step 2.1.1, calculate term t according to equation (1)_iIn document d_jFrequency of occurrence in, is denoted as tf_ijAnd calculating the term t according to the formulas (2) and (3)_iIn class C_kAverage word in (1)Frequency converter

Where k denotes a class number, tc_ijDenotes the term t_iIn document d_jNumber of occurrences in, N_jRepresenting a document d_jTotal number of words in (1), N_kRepresents class C_kTotal number of documents in, I (d)_j，C_k) For judging the document d_jWhether or not it belongs to class C_kWhen document d_jBelong to class C_kWhen I (d)_j，C_k) Is 1, otherwise is 0;

step 2.1.3, calculating the term t according to the formula (5)_iTerm frequency difference factor TDF (t)_i，c_k)，

4. The method for selecting features based on word frequency difference factors according to claim 3, wherein the step 2.2 is as follows:

False positive rate

The term t is calculated using the trigonometric comparison metric algorithm TCM, equation (7)_iTCM (t)_i，c_k) The score is obtained by the above-mentioned method,

TCM(t_i，c_k)＝(2max(sin²θ，coS²θ)-1)^m|tpr-fpr| (7)

5. The method for selecting features based on word frequency difference factors according to claim 4, wherein the step 2.3 is as follows: