CN113515623A - Feature selection method based on word frequency difference factor - Google Patents
Feature selection method based on word frequency difference factor Download PDFInfo
- Publication number
- CN113515623A CN113515623A CN202110466347.5A CN202110466347A CN113515623A CN 113515623 A CN113515623 A CN 113515623A CN 202110466347 A CN202110466347 A CN 202110466347A CN 113515623 A CN113515623 A CN 113515623A
- Authority
- CN
- China
- Prior art keywords
- class
- term
- feature
- document
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 12
- 238000012549 training Methods 0.000 claims abstract description 34
- 238000012360 testing method Methods 0.000 claims abstract description 19
- 238000012545 processing Methods 0.000 claims abstract description 14
- 238000000034 method Methods 0.000 claims abstract description 13
- 230000000694 effects Effects 0.000 claims abstract description 12
- 238000012706 support-vector machine Methods 0.000 claims abstract description 12
- 238000013145 classification model Methods 0.000 claims abstract description 4
- 238000012795 verification Methods 0.000 claims abstract description 4
- 238000011156 evaluation Methods 0.000 claims description 5
- 238000005259 measurement Methods 0.000 claims description 3
- 238000000546 chi-square test Methods 0.000 description 4
- 238000001914 filtration Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2113—Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a feature selection method based on word frequency difference value factors, which comprises the steps of firstly selecting text type data sets with different document numbers and thousands or even more than ten thousands of feature numbers, and removing words with the number of the documents more than 25% or less than 3 of the total number; processing a data set which is not divided into a training set and a test set by adopting a 5-fold cross verification method; respectively finishing the dimensionality reduction processing on the data of the training set and the testing set according to the obtained optimal feature subset; training a classification model by adopting a naive Bayes algorithm and a support vector machine algorithm, and predicting to obtain a classification result; and evaluating the classification effect, wherein the higher the scores of the macro F1 and the micro F1 are, the better the classification effect is proved, and the better the performance of the feature selection algorithm is proved. According to the invention, when the relevance between the words and the categories is calculated, the influence of the document frequency and the word frequency on the importance of the words can be considered, the words with high category distinguishing capability are finally selected, and the accuracy and efficiency of classification are improved.
Description
Technical Field
The invention belongs to the technical field of text classification, and particularly relates to a feature selection method based on word frequency difference value factors.
Background
The popularization of the internet and the development of information technology bring intelligent experience, greatly enrich life, and simultaneously improve the efficiency of daily learning and work. Nowadays, various information platforms or social software are emerging on the network, and these platforms generate massive data every second, wherein the data stored in the form of documents has important weight, such as personal information registered on e-commerce platforms (such as tianmao, kyoto, etc.), consumption records and evaluations of users, or user comments generated by music and video software, and e-mails, etc. In the face of mass data, it is difficult to extract valuable information in the mass data by manual means efficiently and accurately, and text type data must be processed by means of a machine learning algorithm and a natural language processing technology. Where text classification techniques are of paramount importance. The text classification can classify the text data in the data set according to a certain discrimination standard, so that valuable information is extracted from the text data, and the data processing efficiency is improved. Text classification techniques are widely used and have deep applications in the fields of medicine, biology, traffic management, finance, geographic information and the like.
The text classification mainly comprises the following three stages: preprocessing, feature selection and model training for classification. Since the text data is characterized by the words forming the text data, the phenomenon of 'dimension disaster' inevitably occurs during processing, and therefore, the feature selection must be performed on the data set before classification. The feature selection algorithm mainly comprises three types of filtering type, packaging type and embedding type. The invention relates to a filtering type feature selection algorithm based on word frequency and document frequency, which selects words highly related to categories as optimal features by calculating the score of each word in a document and sequencing the words according to the score, thereby achieving the purpose of reducing dimensions.
Most feature selection algorithms today are based on document frequency solving problems, common algorithms include maximum-minimum-ratio (MMR), CHI-square-test (CHI), chini coefficient (GINI), and Information Gain (IG). They have studied the number of documents in which a word appears in each category as an important point, but neglecting the number of occurrences of the word itself in an article has a great influence on the evaluation of its importance. The recently proposed trigonometric comparison metric algorithm (TCM) is an excellent feature selection algorithm based on document frequency, and solves the problems that a breakpoint exists in a denominator in a classic NDM algorithm and a high score is given to highly sparse words. However, this approach ignores the effect of word frequency on word importance. Therefore, the project group proposes a feature selection method combining word frequency and document frequency, which respectively calculates the average word frequency in the positive document and the negative document, and finds the difference between the two as the weight of the word in the word frequency level.
Disclosure of Invention
The invention aims to provide a feature selection method based on a word frequency difference factor, so that an algorithm can take the influence of document frequency and word frequency on the importance of the words and the categories into consideration when calculating the relevance of the words and the categories, finally selects the words with high category distinguishing capability, and improves the accuracy and efficiency of classification.
The invention adopts the technical scheme that a feature selection method based on word frequency difference value factors is implemented according to the following steps:
step 2, setting the number of the optimal feature subset elements as C, calculating the score of each feature word of training set data by using a feature selection target function, performing descending order arrangement on the feature words according to the scores, and selecting the front C feature words to form the optimal feature subset; finally, respectively finishing the dimensionality reduction processing on the data of the training set and the test set according to the obtained optimal feature subset;
step 3, training by using the training set obtained in the step 2 and respectively adopting a naive Bayes classifier and a support vector machine classifier to train a classification model, and predicting the sample class of the test set processed in the step 2 to obtain a classification result;
and 4, evaluating the classification effect of the classifier by using the Macro-F1 and Micro-F1 evaluation indexes, wherein the higher the scores of the Macro-F1 and the Micro-F1 are, the better the classification effect is proved, and the better the performance of the feature selection algorithm is further proved.
The present invention is also characterized in that,
the step 2 is as follows:
step 2.1: calculating a word frequency difference value factor TDF of the training set document characteristic words,
2.2, calculating a positive document frequency influence factor of the training set data characteristic words and a triangular comparison measurement factor TCM score;
step 2.3: calculating the term tiGlobal score of (TFTCM) (t)i) Obtaining a feature set with weight;
2.4, performing descending ordering on the feature words of the training set according to the finally calculated global score, and selecting C feature words with the top rank to form an optimal feature subset with the size of C;
and 2.5, respectively processing the training set data and the test set data according to the optimal feature subset obtained in the step 2.4, namely deleting feature words which do not appear in the optimal feature subset in the document, reserving the feature words contained in the optimal feature subset, and obtaining the training set data and the test set data after dimension reduction.
Step 2.1 is specifically as follows:
step 2.1.1, calculate term t according to equation (1)iIn document djFrequency of occurrence in, is denoted as tfijAnd calculating the term t according to the formulas (2) and (3)iIn class CkAverage word frequency in
Where k denotes a class number, tcijDenotes the term tiIn document djNumber of occurrences in, NjRepresenting a document djTotal number of words in (1), NkRepresents class CkTotal number of documents in, I (d)j,Ck) For judging the document djWhether or not it belongs to class CkWhen document djBelong to class CkWhen I (d)j,Ck) Is 1, otherwise is 0;
step 2.1.2, calculate term t according to equation (4)iIn a state other than CkAverage word frequency of all documents in class
Wherein N is the total number of documents in the data set, NkIs of class CkTotal number of documents in;
step 2.1.3, calculating the term t according to the formula (5)iTerm frequency difference factor TDF (t)i,ck),
Step 2.2 is specifically as follows:
step 2.2.1, calculate term t according to equation (6)iPositive class document frequency influence factor poski,
Where tp is denoted as being in class ckTerm of (A) tiNumber of documents appearing, fn, in class ckTerm of (A) tiNumber of documents not present, fp representing non-ckTerm t in classiNumber of documents appearing, tn denotesIn a region other than ckTerm t in classiNumber of documents that do not appear;
step 2.2.2, calculate the term t separatelyiIn class ckTrue rate tpr and false positive rate fpr, true rateFalse positive rateThe term t is calculated using the trigonometric comparison metric algorithm TCM, equation (7)iTCM (t)i,ck) The score is obtained by the above-mentioned method,
TCM(ti,ck)=(2 max(sin2θ,cos2θ)-1)m|tpr-fpr| (7)
wherein θ represents the term tiAnd the corresponding vector (tpr, fpr) forms an included angle with the coordinate axis closest to the vector, m is a parameter for controlling the influence of the off-axis angle on the integral score, and the integral effect is the best when m is 100.
Step 2.3 is specifically as follows:
calculating the term t according to equation (8)iGlobal score of (TFTCM) (t)i) And obtaining a feature set with weight values:
wherein k represents a class number, P (C)k) Indicates belonging to class CkThe number of documents in the entire data set.
The invention has the beneficial effects that the feature selection method based on the word frequency difference factor endows the words with higher word frequency in a certain class with higher weight by calculating the average word frequency and the corresponding difference value of the words in the positive class and the negative class. The TCM algorithm improved based on the word frequency difference factor fully considers the document frequency of the characteristic words and the influence of the word frequency on the importance of the characteristic words. The introduction of the positive document frequency influence factor enables the algorithm to pay more attention to the influence of the text document frequency of the words in the multi-classification task, namely, if the proportion of the number of the documents appearing in the positive class of a word is larger, the importance degree of the word is larger. The invention can well select the characteristic words with high category resolution capability, and improves the accuracy and efficiency of classification.
Drawings
FIG. 1 is a flow chart of a feature selection method based on word frequency difference factors according to the present invention;
2(a) -2 (d) are comparison results of Macro-F1 and Micro-F1 obtained when a naive Bayes classifier and a support vector machine classifier are used for classification under different feature word dimensions on a K1b data set in sequence according to the feature selection method based on the word frequency difference factor of the invention and the prior art;
3(a) -3 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on a KDC data set in sequence;
4(a) -4 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on an R8 data set in sequence;
5(a) -5 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on an R52 data set in sequence;
6(a) -6 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on an RE1 data set in sequence;
7(a) -7 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on a 20Newsgroups data set in sequence.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The flow chart of the invention is shown in fig. 1, and the feature selection method based on the word frequency difference value factor is implemented according to the following steps:
step 2, setting the number of the optimal feature subset elements as C, calculating the score of each feature word of training set data by using a feature selection target function, performing descending order arrangement on the feature words according to the scores, and selecting the front C feature words to form the optimal feature subset; finally, respectively finishing the dimensionality reduction processing on the data of the training set and the test set according to the obtained optimal feature subset;
the step 2 is as follows:
step 2.1: calculating a word frequency difference value factor TDF of the training set document characteristic words,
2.2, calculating a positive document frequency influence factor of the training set data characteristic words and a triangular comparison measurement factor TCM score;
step 2.3: calculating the term tiGlobal score of (TFTCM) (t)i) Obtaining a feature set with weight;
2.4, performing descending ordering on the feature words of the training set according to the finally calculated global score, and selecting C feature words with the top rank to form an optimal feature subset with the size of C;
and 2.5, respectively processing the training set data and the test set data according to the optimal feature subset obtained in the step 2.4, namely deleting feature words which do not appear in the optimal feature subset in the document, reserving the feature words contained in the optimal feature subset, and obtaining the training set data and the test set data after dimension reduction.
Step 2.1 is specifically as follows:
step 2.1.1, calculate term t according to equation (1)iIn document djFrequency of occurrence in, is denoted as tfijAnd calculating the term t according to the formulas (2) and (3)iIn class CkAverage word frequency in
Where k denotes a class number, tcijDenotes the term tiIn document djNumber of occurrences in, NjRepresenting a document djTotal number of words in (1), NkRepresents class CkTotal number of documents in, I (d)j,Ck) For judging the document djWhether or not it belongs to class CkWhen document djBelong to class CkWhen I (d)j,Ck) Is 1, otherwise is 0;
step 2.1.2, calculate term t according to equation (4)iIn a state other than CkAverage word frequency of all documents in class
Wherein N is the total number of documents in the data set, NkIs of class CkTotal number of documents in;
step 2.1.3, calculating the term t according to the formula (5)iTerm frequency difference factor TDF (t)i,ck),
Step 2.2 is specifically as follows:
step 2.2.1, calculate term t according to equation (6)iPositive class document frequency influence factor poski,
Where tp is denoted as being in class ckTerm of (A) tiNumber of documents appearing, fn, in class ckTerm of (A) tiNumber of documents not present, fp representing non-ckTerm t in classiNumber of documents present, tn, at not ckTerm t in classiNumber of documents that do not appear;
step 2.2.2, calculate the term t separatelyiIn class ckTrue rate tpr and false positive rate fpr, true rateFalse positive rateThe term t is calculated using the trigonometric comparison metric algorithm TCM, equation (7)iTCM (t)i,ck) The score is obtained by the above-mentioned method,
TCM(ti,ck)=(2 max(sin2θ,cos2θ)-1)m|tpr-fpr| (7)
wherein θ represents the term tiAnd the corresponding vector (tpr, fpr) forms an included angle with the coordinate axis closest to the vector, m is a parameter for controlling the influence of the off-axis angle on the integral score, and the integral effect is the best when m is 100.
Step 2.3 is specifically as follows:
calculating the term t according to equation (8)iGlobal score of (TFTCM) (t)i) And obtaining a feature set with weight values:
wherein k represents a class number, P (C)k) Indicates belonging to class CkThe number of documents in the entire data set.
Step 3, training by using the training set obtained in the step 2 and respectively adopting a naive Bayes classifier and a support vector machine classifier to train a classification model, and predicting the sample class of the test set processed in the step 2 to obtain a classification result;
and 4, evaluating the classification effect of the classifier by using the Macro-F1 and Micro-F1 evaluation indexes, wherein the higher the scores of the Macro-F1 and the Micro-F1 are, the better the classification effect is proved, and the better the performance of the feature selection algorithm is further proved.
On the basis of TCM algorithm, respectively calculating the appearance of each word in CkClass and non-CkAnd (4) average word frequency in the class document, and calculating the difference between the average word frequency and the class document to realize the correlation judgment of the word frequency level.
In comparative experiments, six data sets of K1b, KDC, R8, R52, RE1, 20Newsgroups were used for testing. Wherein K1b, RE1 is a text type dataset obtained from the university of minnesota Karypis laboratory; the KDC data set consists of a group of Kurdish text documents from different websites; r52 and R8 are obtained by processing a Reuters21578 data set, and the Reuters21578 is a classic text classification test set and is commonly used in the research fields of information retrieval, machine learning and the like; the 20Newsgroups dataset is composed of 20000 Newsgroups of 20Newsgroups extracted.
In order to verify the performance of the feature selection algorithm based on the word frequency difference factor, the invention is compared with five algorithms of maximum-minimum ratio (MMR), Chi-square test (CHI), Gini coefficient (GINI), Triangular Comparison Measure (TCM) and Information Gain (IG). From fig. 2(a) -2 (d), it can be seen that on the K1b data set, the Macro-F1 and Micro-F1 scores of the invention are superior to those of the comparison algorithm in most cases, accounting for 71.88%, and the performance is better. It can be seen from fig. 3(a) -3 (d) that the performance of the present invention is more stable on KDC datasets and that the highest values are achieved at multiple comparison points over other algorithms. From fig. 4(a) -4 (d), it can be seen that for the R8 dataset, whether using naive bayes or support vector machine classifiers, the invention achieved the highest Macro-F1 scores at the 5 points of the lower dimension, while performing less well on the high dimension points, for Micro-F1, the overall performance of the invention was more stable, with the highest values at the multiple contrast points. From fig. 5(a) -5 (d), it can be seen that on the R52 data set, when using the naive bayes classifier, the present invention achieves the highest score at most of the contrast points, with a proportion of 81.25%, and when using the support vector machine classifier, the performance of the present invention is optimal at a plurality of contrast points, although the performance is somewhat degraded. From fig. 6(a) -6 (d), it can be seen that the overall performance of the present invention performed well for the RE1 dataset, with the highest scores obtained at most of the comparison points, and the optimal percentage being 71.88%. From fig. 7(a) -7 (d), it can be seen that for the 20Newsgroups data set, the present invention achieves the best score at almost all comparison points, the optimal case is 90.63%, and the performance is significantly better than the comparison algorithm. The invention has stable overall performance and is a reliable feature selection algorithm.
Claims (5)
1. The feature selection method based on the word frequency difference factor is characterized by being implemented according to the following steps:
step 1, selecting text type data sets with different document numbers and thousands or even tens of thousands of characteristic numbers, and removing words which appear in the data sets and have the document numbers more than 25% or less than 3 of the total number; processing the data set which is not divided into the training set and the test set by adopting a 5-fold cross verification method;
step 2, setting the number of the optimal feature subset elements as C, calculating the score of each feature word of training set data by using a feature selection target function, performing descending order arrangement on the feature words according to the scores, and selecting the front C feature words to form the optimal feature subset; finally, respectively finishing the dimensionality reduction processing on the data of the training set and the test set according to the obtained optimal feature subset;
step 3, training by using the training set obtained in the step 2 and respectively adopting a naive Bayes classifier and a support vector machine classifier to train a classification model, and predicting the sample class of the test set processed in the step 2 to obtain a classification result;
and 4, evaluating the classification effect of the classifier by using the Macro-F1 and Micro-F1 evaluation indexes, wherein the higher the scores of the Macro-F1 and the Micro-F1 are, the better the classification effect is proved, and the better the performance of the feature selection algorithm is further proved.
2. The method for selecting features based on word frequency difference factors according to claim 1, wherein the step 2 specifically comprises the following steps:
step 2.1: calculating a word frequency difference value factor TDF of the training set document characteristic words,
2.2, calculating a positive document frequency influence factor of the training set data characteristic words and a triangular comparison measurement factor TCM score;
step 2.3: calculating the term tiGlobal score of (TFTCM) (t)i) Obtaining a feature set with weight;
2.4, performing descending ordering on the feature words of the training set according to the finally calculated global score, and selecting C feature words with the top rank to form an optimal feature subset with the size of C;
and 2.5, respectively processing the training set data and the test set data according to the optimal feature subset obtained in the step 2.4, namely deleting feature words which do not appear in the optimal feature subset in the document, reserving the feature words contained in the optimal feature subset, and obtaining the training set data and the test set data after dimension reduction.
3. The method for selecting features based on word frequency difference factors according to claim 2, wherein the step 2.1 is as follows:
step 2.1.1, calculate term t according to equation (1)iIn document djFrequency of occurrence in, is denoted as tfijAnd calculating the term t according to the formulas (2) and (3)iIn class CkAverage word in (1)Frequency converter
Where k denotes a class number, tcijDenotes the term tiIn document djNumber of occurrences in, NjRepresenting a document djTotal number of words in (1), NkRepresents class CkTotal number of documents in, I (d)j,Ck) For judging the document djWhether or not it belongs to class CkWhen document djBelong to class CkWhen I (d)j,Ck) Is 1, otherwise is 0;
step 2.1.2, calculate term t according to equation (4)iIn a state other than CkAverage word frequency of all documents in class
Wherein N is the total number of documents in the data set, NkIs of class CkTotal number of documents in;
step 2.1.3, calculating the term t according to the formula (5)iTerm frequency difference factor TDF (t)i,ck),
4. The method for selecting features based on word frequency difference factors according to claim 3, wherein the step 2.2 is as follows:
step 2.2.1, calculate term t according to equation (6)iPositive class document frequency influence factor poski,
Where tp is denoted as being in class ckTerm of (A) tiNumber of documents appearing, fn, in class ckTerm of (A) tiNumber of documents not present, fp representing non-ckTerm t in classiNumber of documents present, tn, at not ckTerm t in classiNumber of documents that do not appear;
step 2.2.2, calculate the term t separatelyiIn class ckTrue rate tpr and false positive rate fpr, true rateFalse positive rateThe term t is calculated using the trigonometric comparison metric algorithm TCM, equation (7)iTCM (t)i,ck) The score is obtained by the above-mentioned method,
TCM(ti,ck)=(2max(sin2θ,coS2θ)-1)m|tpr-fpr| (7)
wherein θ represents the term tiAnd the corresponding vector (tpr, fpr) forms an included angle with the coordinate axis closest to the vector, m is a parameter for controlling the influence of the off-axis angle on the integral score, and the integral effect is the best when m is 100.
5. The method for selecting features based on word frequency difference factors according to claim 4, wherein the step 2.3 is as follows:
calculating the term t according to equation (8)iGlobal score of (TFTCM) (t)i) And obtaining a feature set with weight values:
wherein k represents a class number, P (C)k) Indicates belonging to class CkThe number of documents in the entire data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110466347.5A CN113515623B (en) | 2021-04-28 | 2021-04-28 | Feature selection method based on word frequency difference factor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110466347.5A CN113515623B (en) | 2021-04-28 | 2021-04-28 | Feature selection method based on word frequency difference factor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113515623A true CN113515623A (en) | 2021-10-19 |
CN113515623B CN113515623B (en) | 2022-12-06 |
Family
ID=78063918
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110466347.5A Expired - Fee Related CN113515623B (en) | 2021-04-28 | 2021-04-28 | Feature selection method based on word frequency difference factor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113515623B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
CN104794187A (en) * | 2015-04-13 | 2015-07-22 | 西安理工大学 | Feature selection method based on entry distribution |
CN105512311A (en) * | 2015-12-14 | 2016-04-20 | 北京工业大学 | Chi square statistic based self-adaption feature selection method |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
CN111062212A (en) * | 2020-03-18 | 2020-04-24 | 北京热云科技有限公司 | Feature extraction method and system based on optimized TFIDF |
CN111709439A (en) * | 2020-05-06 | 2020-09-25 | 西安理工大学 | Feature selection method based on word frequency deviation rate factor |
US20210019422A1 (en) * | 2019-07-17 | 2021-01-21 | Vmware, Inc. | Feature selection using term frequency-inverse document frequency (tf-idf) model |
-
2021
- 2021-04-28 CN CN202110466347.5A patent/CN113515623B/en not_active Expired - Fee Related
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
CN104794187A (en) * | 2015-04-13 | 2015-07-22 | 西安理工大学 | Feature selection method based on entry distribution |
CN105512311A (en) * | 2015-12-14 | 2016-04-20 | 北京工业大学 | Chi square statistic based self-adaption feature selection method |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
US20210019422A1 (en) * | 2019-07-17 | 2021-01-21 | Vmware, Inc. | Feature selection using term frequency-inverse document frequency (tf-idf) model |
CN111062212A (en) * | 2020-03-18 | 2020-04-24 | 北京热云科技有限公司 | Feature extraction method and system based on optimized TFIDF |
CN111709439A (en) * | 2020-05-06 | 2020-09-25 | 西安理工大学 | Feature selection method based on word frequency deviation rate factor |
Non-Patent Citations (3)
Title |
---|
HONGFANG ZHOU等: "Feature Selection Based on Term Frequency Reordering of Document Level", 《 IEEE ACCESS ( VOLUME: 6)》 * |
KYOUNGOK KIM等: "Trigonometric comparison measure: A feature selection method", 《DATA & KNOWLEDGE ENGINEERING》 * |
潘晓英等: "基于差异度量和互信息的文本特征选择算法", 《西安邮电大学学报》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113515623B (en) | 2022-12-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107577785B (en) | Hierarchical multi-label classification method suitable for legal identification | |
Trstenjak et al. | KNN with TF-IDF based framework for text categorization | |
CN111709439B (en) | Feature selection method based on word frequency deviation rate factor | |
CN103617157A (en) | Text similarity calculation method based on semantics | |
JP5094830B2 (en) | Image search apparatus, image search method and program | |
CN110826618A (en) | Personal credit risk assessment method based on random forest | |
CN104298715A (en) | TF-IDF based multiple-index result merging and sequencing method | |
CN109657011A (en) | A kind of data digging method and system screening attack of terrorism criminal gang | |
CN109376235B (en) | Feature selection method based on document layer word frequency reordering | |
Rahardi et al. | Sentiment analysis of Covid-19 vaccination using support vector machine in Indonesia | |
CN109783633A (en) | Data analysis service procedural model recommended method | |
Christen et al. | A review of the F-measure: Its history, properties, criticism, and alternatives | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN112417152A (en) | Topic detection method and device for case-related public sentiment | |
Sitorus et al. | Sensing trending topics in twitter for greater Jakarta area | |
CN113010884B (en) | Real-time feature filtering method in intrusion detection system | |
CN116629258B (en) | Structured analysis method and system for judicial document based on complex information item data | |
CN109783586B (en) | Water army comment detection method based on clustering resampling | |
CN112417082A (en) | Scientific research achievement data disambiguation filing storage method | |
CN113515623B (en) | Feature selection method based on word frequency difference factor | |
CN116881451A (en) | Text classification method based on machine learning | |
CN106991171A (en) | Topic based on Intelligent campus information service platform finds method | |
Kesidis et al. | Efficient cut-off threshold estimation for word spotting applications | |
CN112579783B (en) | Short text clustering method based on Laplace atlas | |
CN111382273B (en) | Text classification method based on feature selection of attraction factors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20221206 |