CN113515623A - Feature selection method based on word frequency difference factor - Google Patents

Feature selection method based on word frequency difference factor Download PDF

Info

Publication number
CN113515623A
CN113515623A CN202110466347.5A CN202110466347A CN113515623A CN 113515623 A CN113515623 A CN 113515623A CN 202110466347 A CN202110466347 A CN 202110466347A CN 113515623 A CN113515623 A CN 113515623A
Authority
CN
China
Prior art keywords
class
term
feature
document
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110466347.5A
Other languages
Chinese (zh)
Other versions
CN113515623B (en
Inventor
周红芳
李想
马一鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110466347.5A priority Critical patent/CN113515623B/en
Publication of CN113515623A publication Critical patent/CN113515623A/en
Application granted granted Critical
Publication of CN113515623B publication Critical patent/CN113515623B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a feature selection method based on word frequency difference value factors, which comprises the steps of firstly selecting text type data sets with different document numbers and thousands or even more than ten thousands of feature numbers, and removing words with the number of the documents more than 25% or less than 3 of the total number; processing a data set which is not divided into a training set and a test set by adopting a 5-fold cross verification method; respectively finishing the dimensionality reduction processing on the data of the training set and the testing set according to the obtained optimal feature subset; training a classification model by adopting a naive Bayes algorithm and a support vector machine algorithm, and predicting to obtain a classification result; and evaluating the classification effect, wherein the higher the scores of the macro F1 and the micro F1 are, the better the classification effect is proved, and the better the performance of the feature selection algorithm is proved. According to the invention, when the relevance between the words and the categories is calculated, the influence of the document frequency and the word frequency on the importance of the words can be considered, the words with high category distinguishing capability are finally selected, and the accuracy and efficiency of classification are improved.

Description

Feature selection method based on word frequency difference factor
Technical Field
The invention belongs to the technical field of text classification, and particularly relates to a feature selection method based on word frequency difference value factors.
Background
The popularization of the internet and the development of information technology bring intelligent experience, greatly enrich life, and simultaneously improve the efficiency of daily learning and work. Nowadays, various information platforms or social software are emerging on the network, and these platforms generate massive data every second, wherein the data stored in the form of documents has important weight, such as personal information registered on e-commerce platforms (such as tianmao, kyoto, etc.), consumption records and evaluations of users, or user comments generated by music and video software, and e-mails, etc. In the face of mass data, it is difficult to extract valuable information in the mass data by manual means efficiently and accurately, and text type data must be processed by means of a machine learning algorithm and a natural language processing technology. Where text classification techniques are of paramount importance. The text classification can classify the text data in the data set according to a certain discrimination standard, so that valuable information is extracted from the text data, and the data processing efficiency is improved. Text classification techniques are widely used and have deep applications in the fields of medicine, biology, traffic management, finance, geographic information and the like.
The text classification mainly comprises the following three stages: preprocessing, feature selection and model training for classification. Since the text data is characterized by the words forming the text data, the phenomenon of 'dimension disaster' inevitably occurs during processing, and therefore, the feature selection must be performed on the data set before classification. The feature selection algorithm mainly comprises three types of filtering type, packaging type and embedding type. The invention relates to a filtering type feature selection algorithm based on word frequency and document frequency, which selects words highly related to categories as optimal features by calculating the score of each word in a document and sequencing the words according to the score, thereby achieving the purpose of reducing dimensions.
Most feature selection algorithms today are based on document frequency solving problems, common algorithms include maximum-minimum-ratio (MMR), CHI-square-test (CHI), chini coefficient (GINI), and Information Gain (IG). They have studied the number of documents in which a word appears in each category as an important point, but neglecting the number of occurrences of the word itself in an article has a great influence on the evaluation of its importance. The recently proposed trigonometric comparison metric algorithm (TCM) is an excellent feature selection algorithm based on document frequency, and solves the problems that a breakpoint exists in a denominator in a classic NDM algorithm and a high score is given to highly sparse words. However, this approach ignores the effect of word frequency on word importance. Therefore, the project group proposes a feature selection method combining word frequency and document frequency, which respectively calculates the average word frequency in the positive document and the negative document, and finds the difference between the two as the weight of the word in the word frequency level.
Disclosure of Invention
The invention aims to provide a feature selection method based on a word frequency difference factor, so that an algorithm can take the influence of document frequency and word frequency on the importance of the words and the categories into consideration when calculating the relevance of the words and the categories, finally selects the words with high category distinguishing capability, and improves the accuracy and efficiency of classification.
The invention adopts the technical scheme that a feature selection method based on word frequency difference value factors is implemented according to the following steps:
step 1, selecting text type data sets with different document numbers and thousands or even tens of thousands of characteristic numbers, and removing words which appear in the data sets and have the document numbers more than 25% or less than 3 of the total number; processing the data set which is not divided into the training set and the test set by adopting a 5-fold cross verification method;
step 2, setting the number of the optimal feature subset elements as C, calculating the score of each feature word of training set data by using a feature selection target function, performing descending order arrangement on the feature words according to the scores, and selecting the front C feature words to form the optimal feature subset; finally, respectively finishing the dimensionality reduction processing on the data of the training set and the test set according to the obtained optimal feature subset;
step 3, training by using the training set obtained in the step 2 and respectively adopting a naive Bayes classifier and a support vector machine classifier to train a classification model, and predicting the sample class of the test set processed in the step 2 to obtain a classification result;
and 4, evaluating the classification effect of the classifier by using the Macro-F1 and Micro-F1 evaluation indexes, wherein the higher the scores of the Macro-F1 and the Micro-F1 are, the better the classification effect is proved, and the better the performance of the feature selection algorithm is further proved.
The present invention is also characterized in that,
the step 2 is as follows:
step 2.1: calculating a word frequency difference value factor TDF of the training set document characteristic words,
2.2, calculating a positive document frequency influence factor of the training set data characteristic words and a triangular comparison measurement factor TCM score;
step 2.3: calculating the term tiGlobal score of (TFTCM) (t)i) Obtaining a feature set with weight;
2.4, performing descending ordering on the feature words of the training set according to the finally calculated global score, and selecting C feature words with the top rank to form an optimal feature subset with the size of C;
and 2.5, respectively processing the training set data and the test set data according to the optimal feature subset obtained in the step 2.4, namely deleting feature words which do not appear in the optimal feature subset in the document, reserving the feature words contained in the optimal feature subset, and obtaining the training set data and the test set data after dimension reduction.
Step 2.1 is specifically as follows:
step 2.1.1, calculate term t according to equation (1)iIn document djFrequency of occurrence in, is denoted as tfijAnd calculating the term t according to the formulas (2) and (3)iIn class CkAverage word frequency in
Figure BDA0003044143800000031
Figure BDA0003044143800000041
Figure BDA0003044143800000042
Figure BDA0003044143800000043
Where k denotes a class number, tcijDenotes the term tiIn document djNumber of occurrences in, NjRepresenting a document djTotal number of words in (1), NkRepresents class CkTotal number of documents in, I (d)j,Ck) For judging the document djWhether or not it belongs to class CkWhen document djBelong to class CkWhen I (d)j,Ck) Is 1, otherwise is 0;
step 2.1.2, calculate term t according to equation (4)iIn a state other than CkAverage word frequency of all documents in class
Figure BDA0003044143800000044
Figure BDA0003044143800000045
Wherein N is the total number of documents in the data set, NkIs of class CkTotal number of documents in;
step 2.1.3, calculating the term t according to the formula (5)iTerm frequency difference factor TDF (t)i,ck),
Figure BDA0003044143800000046
Step 2.2 is specifically as follows:
step 2.2.1, calculate term t according to equation (6)iPositive class document frequency influence factor poski
Figure BDA0003044143800000047
Where tp is denoted as being in class ckTerm of (A) tiNumber of documents appearing, fn, in class ckTerm of (A) tiNumber of documents not present, fp representing non-ckTerm t in classiNumber of documents appearing, tn denotesIn a region other than ckTerm t in classiNumber of documents that do not appear;
step 2.2.2, calculate the term t separatelyiIn class ckTrue rate tpr and false positive rate fpr, true rate
Figure BDA0003044143800000048
False positive rate
Figure BDA0003044143800000049
The term t is calculated using the trigonometric comparison metric algorithm TCM, equation (7)iTCM (t)i,ck) The score is obtained by the above-mentioned method,
TCM(ti,ck)=(2 max(sin2θ,cos2θ)-1)m|tpr-fpr| (7)
wherein θ represents the term tiAnd the corresponding vector (tpr, fpr) forms an included angle with the coordinate axis closest to the vector, m is a parameter for controlling the influence of the off-axis angle on the integral score, and the integral effect is the best when m is 100.
Step 2.3 is specifically as follows:
calculating the term t according to equation (8)iGlobal score of (TFTCM) (t)i) And obtaining a feature set with weight values:
Figure BDA0003044143800000051
wherein k represents a class number, P (C)k) Indicates belonging to class CkThe number of documents in the entire data set.
The invention has the beneficial effects that the feature selection method based on the word frequency difference factor endows the words with higher word frequency in a certain class with higher weight by calculating the average word frequency and the corresponding difference value of the words in the positive class and the negative class. The TCM algorithm improved based on the word frequency difference factor fully considers the document frequency of the characteristic words and the influence of the word frequency on the importance of the characteristic words. The introduction of the positive document frequency influence factor enables the algorithm to pay more attention to the influence of the text document frequency of the words in the multi-classification task, namely, if the proportion of the number of the documents appearing in the positive class of a word is larger, the importance degree of the word is larger. The invention can well select the characteristic words with high category resolution capability, and improves the accuracy and efficiency of classification.
Drawings
FIG. 1 is a flow chart of a feature selection method based on word frequency difference factors according to the present invention;
2(a) -2 (d) are comparison results of Macro-F1 and Micro-F1 obtained when a naive Bayes classifier and a support vector machine classifier are used for classification under different feature word dimensions on a K1b data set in sequence according to the feature selection method based on the word frequency difference factor of the invention and the prior art;
3(a) -3 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on a KDC data set in sequence;
4(a) -4 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on an R8 data set in sequence;
5(a) -5 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on an R52 data set in sequence;
6(a) -6 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the present invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on an RE1 data set in sequence;
7(a) -7 (d) are the comparison results of Macro-F1 and Micro-F1 obtained when the invention and the prior art are classified by using a naive Bayes classifier and a support vector machine classifier under different feature word dimensions on a 20Newsgroups data set in sequence.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The flow chart of the invention is shown in fig. 1, and the feature selection method based on the word frequency difference value factor is implemented according to the following steps:
step 1, selecting text type data sets with different document numbers and thousands or even tens of thousands of characteristic numbers, and removing words which appear in the data sets and have the document numbers more than 25% or less than 3 of the total number; processing the data set which is not divided into the training set and the test set by adopting a 5-fold cross verification method;
step 2, setting the number of the optimal feature subset elements as C, calculating the score of each feature word of training set data by using a feature selection target function, performing descending order arrangement on the feature words according to the scores, and selecting the front C feature words to form the optimal feature subset; finally, respectively finishing the dimensionality reduction processing on the data of the training set and the test set according to the obtained optimal feature subset;
the step 2 is as follows:
step 2.1: calculating a word frequency difference value factor TDF of the training set document characteristic words,
2.2, calculating a positive document frequency influence factor of the training set data characteristic words and a triangular comparison measurement factor TCM score;
step 2.3: calculating the term tiGlobal score of (TFTCM) (t)i) Obtaining a feature set with weight;
2.4, performing descending ordering on the feature words of the training set according to the finally calculated global score, and selecting C feature words with the top rank to form an optimal feature subset with the size of C;
and 2.5, respectively processing the training set data and the test set data according to the optimal feature subset obtained in the step 2.4, namely deleting feature words which do not appear in the optimal feature subset in the document, reserving the feature words contained in the optimal feature subset, and obtaining the training set data and the test set data after dimension reduction.
Step 2.1 is specifically as follows:
step 2.1.1, calculate term t according to equation (1)iIn document djFrequency of occurrence in, is denoted as tfijAnd calculating the term t according to the formulas (2) and (3)iIn class CkAverage word frequency in
Figure BDA0003044143800000071
Figure BDA0003044143800000072
Figure BDA0003044143800000081
Figure BDA0003044143800000082
Where k denotes a class number, tcijDenotes the term tiIn document djNumber of occurrences in, NjRepresenting a document djTotal number of words in (1), NkRepresents class CkTotal number of documents in, I (d)j,Ck) For judging the document djWhether or not it belongs to class CkWhen document djBelong to class CkWhen I (d)j,Ck) Is 1, otherwise is 0;
step 2.1.2, calculate term t according to equation (4)iIn a state other than CkAverage word frequency of all documents in class
Figure BDA0003044143800000083
Figure BDA0003044143800000084
Wherein N is the total number of documents in the data set, NkIs of class CkTotal number of documents in;
step 2.1.3, calculating the term t according to the formula (5)iTerm frequency difference factor TDF (t)i,ck),
Figure BDA0003044143800000085
Step 2.2 is specifically as follows:
step 2.2.1, calculate term t according to equation (6)iPositive class document frequency influence factor poski
Figure BDA0003044143800000086
Where tp is denoted as being in class ckTerm of (A) tiNumber of documents appearing, fn, in class ckTerm of (A) tiNumber of documents not present, fp representing non-ckTerm t in classiNumber of documents present, tn, at not ckTerm t in classiNumber of documents that do not appear;
step 2.2.2, calculate the term t separatelyiIn class ckTrue rate tpr and false positive rate fpr, true rate
Figure BDA0003044143800000087
False positive rate
Figure BDA0003044143800000088
The term t is calculated using the trigonometric comparison metric algorithm TCM, equation (7)iTCM (t)i,ck) The score is obtained by the above-mentioned method,
TCM(ti,ck)=(2 max(sin2θ,cos2θ)-1)m|tpr-fpr| (7)
wherein θ represents the term tiAnd the corresponding vector (tpr, fpr) forms an included angle with the coordinate axis closest to the vector, m is a parameter for controlling the influence of the off-axis angle on the integral score, and the integral effect is the best when m is 100.
Step 2.3 is specifically as follows:
calculating the term t according to equation (8)iGlobal score of (TFTCM) (t)i) And obtaining a feature set with weight values:
Figure BDA0003044143800000091
wherein k represents a class number, P (C)k) Indicates belonging to class CkThe number of documents in the entire data set.
Step 3, training by using the training set obtained in the step 2 and respectively adopting a naive Bayes classifier and a support vector machine classifier to train a classification model, and predicting the sample class of the test set processed in the step 2 to obtain a classification result;
and 4, evaluating the classification effect of the classifier by using the Macro-F1 and Micro-F1 evaluation indexes, wherein the higher the scores of the Macro-F1 and the Micro-F1 are, the better the classification effect is proved, and the better the performance of the feature selection algorithm is further proved.
On the basis of TCM algorithm, respectively calculating the appearance of each word in CkClass and non-CkAnd (4) average word frequency in the class document, and calculating the difference between the average word frequency and the class document to realize the correlation judgment of the word frequency level.
In comparative experiments, six data sets of K1b, KDC, R8, R52, RE1, 20Newsgroups were used for testing. Wherein K1b, RE1 is a text type dataset obtained from the university of minnesota Karypis laboratory; the KDC data set consists of a group of Kurdish text documents from different websites; r52 and R8 are obtained by processing a Reuters21578 data set, and the Reuters21578 is a classic text classification test set and is commonly used in the research fields of information retrieval, machine learning and the like; the 20Newsgroups dataset is composed of 20000 Newsgroups of 20Newsgroups extracted.
In order to verify the performance of the feature selection algorithm based on the word frequency difference factor, the invention is compared with five algorithms of maximum-minimum ratio (MMR), Chi-square test (CHI), Gini coefficient (GINI), Triangular Comparison Measure (TCM) and Information Gain (IG). From fig. 2(a) -2 (d), it can be seen that on the K1b data set, the Macro-F1 and Micro-F1 scores of the invention are superior to those of the comparison algorithm in most cases, accounting for 71.88%, and the performance is better. It can be seen from fig. 3(a) -3 (d) that the performance of the present invention is more stable on KDC datasets and that the highest values are achieved at multiple comparison points over other algorithms. From fig. 4(a) -4 (d), it can be seen that for the R8 dataset, whether using naive bayes or support vector machine classifiers, the invention achieved the highest Macro-F1 scores at the 5 points of the lower dimension, while performing less well on the high dimension points, for Micro-F1, the overall performance of the invention was more stable, with the highest values at the multiple contrast points. From fig. 5(a) -5 (d), it can be seen that on the R52 data set, when using the naive bayes classifier, the present invention achieves the highest score at most of the contrast points, with a proportion of 81.25%, and when using the support vector machine classifier, the performance of the present invention is optimal at a plurality of contrast points, although the performance is somewhat degraded. From fig. 6(a) -6 (d), it can be seen that the overall performance of the present invention performed well for the RE1 dataset, with the highest scores obtained at most of the comparison points, and the optimal percentage being 71.88%. From fig. 7(a) -7 (d), it can be seen that for the 20Newsgroups data set, the present invention achieves the best score at almost all comparison points, the optimal case is 90.63%, and the performance is significantly better than the comparison algorithm. The invention has stable overall performance and is a reliable feature selection algorithm.

Claims (5)

1. The feature selection method based on the word frequency difference factor is characterized by being implemented according to the following steps:
step 1, selecting text type data sets with different document numbers and thousands or even tens of thousands of characteristic numbers, and removing words which appear in the data sets and have the document numbers more than 25% or less than 3 of the total number; processing the data set which is not divided into the training set and the test set by adopting a 5-fold cross verification method;
step 2, setting the number of the optimal feature subset elements as C, calculating the score of each feature word of training set data by using a feature selection target function, performing descending order arrangement on the feature words according to the scores, and selecting the front C feature words to form the optimal feature subset; finally, respectively finishing the dimensionality reduction processing on the data of the training set and the test set according to the obtained optimal feature subset;
step 3, training by using the training set obtained in the step 2 and respectively adopting a naive Bayes classifier and a support vector machine classifier to train a classification model, and predicting the sample class of the test set processed in the step 2 to obtain a classification result;
and 4, evaluating the classification effect of the classifier by using the Macro-F1 and Micro-F1 evaluation indexes, wherein the higher the scores of the Macro-F1 and the Micro-F1 are, the better the classification effect is proved, and the better the performance of the feature selection algorithm is further proved.
2. The method for selecting features based on word frequency difference factors according to claim 1, wherein the step 2 specifically comprises the following steps:
step 2.1: calculating a word frequency difference value factor TDF of the training set document characteristic words,
2.2, calculating a positive document frequency influence factor of the training set data characteristic words and a triangular comparison measurement factor TCM score;
step 2.3: calculating the term tiGlobal score of (TFTCM) (t)i) Obtaining a feature set with weight;
2.4, performing descending ordering on the feature words of the training set according to the finally calculated global score, and selecting C feature words with the top rank to form an optimal feature subset with the size of C;
and 2.5, respectively processing the training set data and the test set data according to the optimal feature subset obtained in the step 2.4, namely deleting feature words which do not appear in the optimal feature subset in the document, reserving the feature words contained in the optimal feature subset, and obtaining the training set data and the test set data after dimension reduction.
3. The method for selecting features based on word frequency difference factors according to claim 2, wherein the step 2.1 is as follows:
step 2.1.1, calculate term t according to equation (1)iIn document djFrequency of occurrence in, is denoted as tfijAnd calculating the term t according to the formulas (2) and (3)iIn class CkAverage word in (1)Frequency converter
Figure FDA0003044143790000021
Figure FDA0003044143790000022
Figure FDA0003044143790000023
Figure FDA0003044143790000024
Where k denotes a class number, tcijDenotes the term tiIn document djNumber of occurrences in, NjRepresenting a document djTotal number of words in (1), NkRepresents class CkTotal number of documents in, I (d)j,Ck) For judging the document djWhether or not it belongs to class CkWhen document djBelong to class CkWhen I (d)j,Ck) Is 1, otherwise is 0;
step 2.1.2, calculate term t according to equation (4)iIn a state other than CkAverage word frequency of all documents in class
Figure FDA0003044143790000025
Figure FDA0003044143790000026
Wherein N is the total number of documents in the data set, NkIs of class CkTotal number of documents in;
step 2.1.3, calculating the term t according to the formula (5)iTerm frequency difference factor TDF (t)i,ck),
Figure FDA0003044143790000027
4. The method for selecting features based on word frequency difference factors according to claim 3, wherein the step 2.2 is as follows:
step 2.2.1, calculate term t according to equation (6)iPositive class document frequency influence factor poski
Figure FDA0003044143790000031
Where tp is denoted as being in class ckTerm of (A) tiNumber of documents appearing, fn, in class ckTerm of (A) tiNumber of documents not present, fp representing non-ckTerm t in classiNumber of documents present, tn, at not ckTerm t in classiNumber of documents that do not appear;
step 2.2.2, calculate the term t separatelyiIn class ckTrue rate tpr and false positive rate fpr, true rate
Figure FDA0003044143790000032
False positive rate
Figure FDA0003044143790000033
The term t is calculated using the trigonometric comparison metric algorithm TCM, equation (7)iTCM (t)i,ck) The score is obtained by the above-mentioned method,
TCM(ti,ck)=(2max(sin2θ,coS2θ)-1)m|tpr-fpr| (7)
wherein θ represents the term tiAnd the corresponding vector (tpr, fpr) forms an included angle with the coordinate axis closest to the vector, m is a parameter for controlling the influence of the off-axis angle on the integral score, and the integral effect is the best when m is 100.
5. The method for selecting features based on word frequency difference factors according to claim 4, wherein the step 2.3 is as follows:
calculating the term t according to equation (8)iGlobal score of (TFTCM) (t)i) And obtaining a feature set with weight values:
Figure FDA0003044143790000034
Figure FDA0003044143790000041
wherein k represents a class number, P (C)k) Indicates belonging to class CkThe number of documents in the entire data set.
CN202110466347.5A 2021-04-28 2021-04-28 Feature selection method based on word frequency difference factor Expired - Fee Related CN113515623B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110466347.5A CN113515623B (en) 2021-04-28 2021-04-28 Feature selection method based on word frequency difference factor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110466347.5A CN113515623B (en) 2021-04-28 2021-04-28 Feature selection method based on word frequency difference factor

Publications (2)

Publication Number Publication Date
CN113515623A true CN113515623A (en) 2021-10-19
CN113515623B CN113515623B (en) 2022-12-06

Family

ID=78063918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110466347.5A Expired - Fee Related CN113515623B (en) 2021-04-28 2021-04-28 Feature selection method based on word frequency difference factor

Country Status (1)

Country Link
CN (1) CN113515623B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
CN104794187A (en) * 2015-04-13 2015-07-22 西安理工大学 Feature selection method based on entry distribution
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach
CN111062212A (en) * 2020-03-18 2020-04-24 北京热云科技有限公司 Feature extraction method and system based on optimized TFIDF
CN111709439A (en) * 2020-05-06 2020-09-25 西安理工大学 Feature selection method based on word frequency deviation rate factor
US20210019422A1 (en) * 2019-07-17 2021-01-21 Vmware, Inc. Feature selection using term frequency-inverse document frequency (tf-idf) model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
CN104794187A (en) * 2015-04-13 2015-07-22 西安理工大学 Feature selection method based on entry distribution
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach
US20210019422A1 (en) * 2019-07-17 2021-01-21 Vmware, Inc. Feature selection using term frequency-inverse document frequency (tf-idf) model
CN111062212A (en) * 2020-03-18 2020-04-24 北京热云科技有限公司 Feature extraction method and system based on optimized TFIDF
CN111709439A (en) * 2020-05-06 2020-09-25 西安理工大学 Feature selection method based on word frequency deviation rate factor

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HONGFANG ZHOU等: "Feature Selection Based on Term Frequency Reordering of Document Level", 《 IEEE ACCESS ( VOLUME: 6)》 *
KYOUNGOK KIM等: "Trigonometric comparison measure: A feature selection method", 《DATA & KNOWLEDGE ENGINEERING》 *
潘晓英等: "基于差异度量和互信息的文本特征选择算法", 《西安邮电大学学报》 *

Also Published As

Publication number Publication date
CN113515623B (en) 2022-12-06

Similar Documents

Publication Publication Date Title
CN107577785B (en) Hierarchical multi-label classification method suitable for legal identification
Trstenjak et al. KNN with TF-IDF based framework for text categorization
CN111709439B (en) Feature selection method based on word frequency deviation rate factor
CN103617157A (en) Text similarity calculation method based on semantics
JP5094830B2 (en) Image search apparatus, image search method and program
CN110826618A (en) Personal credit risk assessment method based on random forest
CN104298715A (en) TF-IDF based multiple-index result merging and sequencing method
CN109657011A (en) A kind of data digging method and system screening attack of terrorism criminal gang
CN109376235B (en) Feature selection method based on document layer word frequency reordering
Rahardi et al. Sentiment analysis of Covid-19 vaccination using support vector machine in Indonesia
CN109783633A (en) Data analysis service procedural model recommended method
Christen et al. A review of the F-measure: Its history, properties, criticism, and alternatives
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN112417152A (en) Topic detection method and device for case-related public sentiment
Sitorus et al. Sensing trending topics in twitter for greater Jakarta area
CN113010884B (en) Real-time feature filtering method in intrusion detection system
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN109783586B (en) Water army comment detection method based on clustering resampling
CN112417082A (en) Scientific research achievement data disambiguation filing storage method
CN113515623B (en) Feature selection method based on word frequency difference factor
CN116881451A (en) Text classification method based on machine learning
CN106991171A (en) Topic based on Intelligent campus information service platform finds method
Kesidis et al. Efficient cut-off threshold estimation for word spotting applications
CN112579783B (en) Short text clustering method based on Laplace atlas
CN111382273B (en) Text classification method based on feature selection of attraction factors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20221206