CN105512311A - Chi square statistic based self-adaption feature selection method - Google Patents
Chi square statistic based self-adaption feature selection method Download PDFInfo
- Publication number
- CN105512311A CN105512311A CN201510927759.9A CN201510927759A CN105512311A CN 105512311 A CN105512311 A CN 105512311A CN 201510927759 A CN201510927759 A CN 201510927759A CN 105512311 A CN105512311 A CN 105512311A
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- classification
- chi
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Abstract
The invention discloses a chi square statistic based self-adaption feature selection method and relates to the field of computer text data processing. Firstly, preprocessing of a training text set and a test text set is performed and comprises participle processing and stop word processing, then, self-adaption text feature selection based on chi square statistic is performed, a word frequency factor and interclass variance are defined and introduced into a CHI algorithm, an appropriate scaling factor is added for the CHI algorithm, finally, the scaling factor is automatically adjusted in combination of classical KNN algorithm evaluation indexes, improved CHI is adapted to different text corpora, and higher classification accuracy is guaranteed. An experimental result proves that by comparison with a conventional CHI method, the classification accuracy of a balanced corpus and a non-balanced corpus is improved.
Description
Technical field
The present invention relates to computer version data processing field, particularly one is based on chi (χ
2, CHI) self-adaptation text feature selection method.
Background technology
Current large data age, the potential value of mining data is most important.Data mining, as the technology finding data potential value, causes great concern.Large data text data accounts for sizable ratio, and text classification is as the data digging method of effective organization and management text data, becomes the focus of attention gradually.It is used widely in information filtering, Information Organization and management, information retrieval, digital library and Spam filtering etc.Text classification (TextClassification, TC) refers to the process unknown classification text being automatically classified into a class or multiclass under classification system given in advance according to its content.Conventional file classification method, as K arest neighbors (K-Nearest-Neighbor, KNN), Bayes (NaiveBayes, NB) and support vector machine (SupportVectorMachine, SVM) etc.
Present stage text classification process comprises pre-service, Feature Dimension Reduction, text representation, sorter study and evaluation and test, what text representation was the most frequently used is vector space model, the higher-dimension of vector space and openness increase time complexity and space complexity, greatly affect text classification precision, so Feature Dimension Reduction process is most important, it directly affects efficiency and the accuracy rate of classification.Feature Dimension Reduction mainly comprises two kinds of methods---feature extraction (FeatureExtraction) and feature selecting (FeatureSelection), the technology of natural language processing aspect is needed and computation complexity is high based on philological feature extraction, and the feature selection approach complexity of Corpus--based Method theory is lower and do not need too much background knowledge, therefore feature selection approach application is more extensive.The basic thought of feature selecting is structure evaluation function, carries out assessment marking respectively, sort according to the height of point value of evaluation to all characteristic items to each characteristic item of feature set, selects the feature of given number as final text feature collection.Conventional feature selection approach has chi, document frequency (DocumentFrequency, DF), information gain (InformationGain, IG), mutual information (MutualInformation, MI), expect cross entropy (ExpectedCrossEntropy, ECE) and text evidence weight (WeightofEvidence, WE) etc.
CHI method, as one of conventional text feature selection method, has and realizes the features such as simple, time complexity is low; But also there is a lot of shortcoming, to such an extent as to classifying quality is undesirable.The deficiency of CHI algorithm mainly comprises the document frequency that two aspect: the first, CHI only considered characteristic item, have ignored the word frequency of characteristic item, causes the weight of low-frequency word to be exaggerated; The second, be exaggerated the occurrence number seldom weight of the characteristic item often occurred in other classes in a classification.For the deficiency that CHI algorithm exists, Many researchers makes improvements, to improve one's methods and be summarized as following two aspects: the first, introduce some regulating parameter to reduce relying on for counsel low-frequency word, but the method not consider the positive and negative relativity problem between characteristic item and classification.The second, introduce scale factor, carry out classifying according to its positive and negative correlativity and compose with different weight to improve the feature selecting ability of CHI model, but scale factor needs to be selected by experience.Consider the deficiency that current various CHI improved algorithm exists, the CHI text feature selection method that design category precision is high has important academic significance and practical value.
Summary of the invention
The object of the invention is to, a kind of CHI text feature selection method of improvement is provided, thus improve the accuracy rate of text classification.Introduce on the one hand word frequency Summing Factor inter-class variance herein to reduce CHI relying on for counsel low-frequency word, select in certain kinds, to occur that frequency greatly and the characteristic item be evenly distributed in such; Introduce on the other hand self-adaptation scale factor μ, to carry out classifying according to its positive and negative correlativity and to compose with different weight, reduce people and choose the error that scale factor brings.
Feature of the present invention is as follows:
Step 1, from Chinese corpus---training text collection and test text collection that the Internet download Fudan University issues;
Step 2, adopts the Chinese Academy of Sciences participle software I CTCLAS that increases income to carry out the pre-service such as participle, stop words removal to training text collection and test text collection, obtains the training text collection after participle and test text collection;
Step 3, adopts the self-adaptation text feature selection method based on CHI to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
The computing formula of traditional CHI text feature selection method is as follows:
Wherein, A represents and comprises feature t
kand belong to classification C
inumber of files, B represents and comprises feature t
kand do not belong to classification C
inumber of files, C represents and does not comprise feature t
kand belong to classification C
inumber of files, D represents and does not comprise feature t
kand do not belong to classification C
inumber of files.
Propose the self-adaptation text feature selection method based on CHI, formula is as follows:
χ
new 2(t
k,C
i)=[μ*χ
2(t
k,C
i)
++(1-μ)*χ
2(t
k,C
i)
-]*α*β
Wherein, μ is adaptive factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
Wherein, m is training set classification sum, tf (t
k, C
i) represent classification C
imiddle characteristic item t
kthe number of times occurred,
representation feature item t
kat the number of times that whole training text collection occurs.In training set, classification is C
iin to comprise n document be d respectively
i1, d
i2..., d
ij..., d
in, tf (t
k, d
ij) representation feature item t
kat classification C
ia jth document in the number of times that occurs,
representation feature t
kat classification C
ithe total degree of middle appearance,
representation feature item t
kthe total degree occurred in all documents of whole training text collection.Word frequency factor-alpha represents at certain kinds C
iin comprise characteristic item t
kword frequency number account for t in whole training set
kthe ratio of word frequency number.α is larger, represent this characteristic item occur in certain kinds frequency higher and in other class occurrence number less or occur hardly, obviously such characteristic item has higher class discrimination ability; α is less, represent this characteristic item occur in certain kinds frequency lower and in other class occurrence number higher, obviously such characteristic item has more weak class discrimination ability.
Wherein,
other number of m representation class, df
ifor classification C
iin comprise t
knumber of files,
for average each classification comprises t
knumber of files,
the textual data that representation feature word appears in certain kinds is more than or equal to mean value
the textual data that representation feature word appears in certain kinds is less than mean value
β value is used for measuring the departure degree of document frequently and between all class documents mean value frequently comprising Feature Words in a certain class.β is larger, and classification C is described
iin comprise Feature Words t
knumber of files comprise Feature Words t than in all classes
kthe mean value of number of files is large, and often large, and such characteristic item has higher class discrimination ability.
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n
k+ 0.01), M is the textual data comprised in collection of document, n
krepresent the number of files comprising this word;
Step 5, carries out KNN classification;
Training text integrates as S, and test text is d, n is proper vector dimension threshold value, and K gets 35.
Step 5.1, utilizes vectorial angle cosine value to calculate the similarity in test text d and S between full text;
Step 5.2, selects K the arest neighbors text of K maximum text of similarity that step 5.1 obtains as test text d;
Step 5.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.
If training text d
iknown class be C
j, weight calculation formula is as follows:
Wherein, Sim (d, d
i) be test text d and known class text d
icosine similarity, formula is as follows:
Wherein, n is proper vector dimension threshold value, X
jrepresent the weight (0 < j≤n) of the jth dimension of text d to be measured, x
ijrepresent training text vector d
ijth dimension weight.
Y (d
i, C
j) be category attribute function, formula is as follows:
Step 6, the precision ratio of calculating K NN sorting algorithm, recall ratio and F
1value, by arranging front and back 2 subseries result F
1the max-thresholds of value difference value, and the step-length of scale factor μ growth obtains the value of final scale factor μ, to ensure higher classify accuracy.
Step 6.1, arranges initial F
1value is 0, and initial μ value is 0.5, ε=0.0001 is twice, front and back F
1the threshold value of difference, the step-length that τ=0.05 increases for scale factor μ;
Step 6.2, repeats step 5, obtains F
1' value, obtain twice, front and back F
1difference DELTA F=| F
1'-F
1|;
Step 6.3, if Δ F is less than ε, then obtains scale factor μ now; If Δ F is greater than or equal to ε, then make μ '=μ+τ, F
1=F
1', repeat step 6.2 and step 6.3, until obtain suitable scale factor μ.
Compared with prior art, the present invention has following beneficial effect.
The present invention proposes a kind of adaptive features select method based on chi, sorting algorithm selects KNN algorithm, for the classification to test text, whole process flow diagram flow chart is shown in Fig. 1, the process flow diagram calculating scale factor μ is shown in Fig. 2, the degree of accuracy index of balanced corpus is in table 1, and the degree of accuracy of non-equilibrium corpus is in table 2.Compared with traditional CHI method, introduce on the one hand word frequency Summing Factor inter-class variance herein to reduce CHI relying on for counsel low-frequency word, select in certain kinds, to occur that frequency greatly and the characteristic item be evenly distributed in such; Introduce on the other hand self-adaptation scale factor μ, to carry out classifying according to its positive and negative correlativity and to compose with different weight, and the method is applicable to the corpus of different distributions, thus reduction people chooses the error that scale factor brings.As can be seen from Table 1 and Table 2, compared with traditional CHI method, the present invention is respectively used to balanced corpus and non-equilibrium corpus nicety of grading is all improved.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of overall process of the present invention.
Fig. 2 is the process flow diagram that the present invention calculates scale factor μ.
Embodiment
The present invention adopts following technological means to realize:
A kind of self-adaptation text feature selection method based on chi.First, carry out the pre-service of training text collection and test text collection, comprise participle, stop words process, secondly, carries out the self-adaptation text feature selection based on chi, definition word frequency factor-alpha and inter-class variance β, be introduced into CHI algorithm, for CHI algorithm adds suitable scale factor μ, finally, in conjunction with classical KNN algorithm, automatic adjustment scale factor μ, makes the CHI of improvement be applicable to different corpus, to ensure higher classify accuracy.
The above-mentioned self-adaptation text feature selection method based on chi is used for text classification, comprises the steps:
Step 1, from Chinese corpus---training text collection and test text collection that the Internet download Fudan University issues;
Step 2, adopts participle software I CTCLAS to carry out the pre-service such as participle, stop words removal to training text collection and test text collection, obtains the training text collection after participle and test text collection;
Step 3, adopts the self-adaptation text feature selection method based on CHI to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
The computing formula of traditional CHI text feature selection method is as follows:
Wherein, A represents and comprises feature t
kand belong to classification C
inumber of files, B represents and comprises feature t
kand do not belong to classification C
inumber of files, C represents and does not comprise feature t
kand belong to classification C
inumber of files, D represents and does not comprise feature t
kand do not belong to classification C
inumber of files.
Propose the self-adaptation text feature selection method based on CHI, formula is as follows:
χ
new 2(t
k,C
i)=[μ*χ
2(t
k,C
i)
++(1-μ)*χ
2(t
k,C
i)
-]*α*β(2)
Wherein, μ is adaptive factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
Wherein, m is training set classification sum, tf (t
k, C
i) represent classification C
icharacteristic item t
kthe number of times occurred,
the number of times that representation feature item t occurs at whole training text collection.In training set, classification is C
iin to comprise n document be d respectively
i1, d
i2..., d
ij..., d
in, tf (t
k, d
ij) representation feature item t
kat classification C
ia jth document in the number of times that occurs,
representation feature t
kat classification C
ithe total degree of middle appearance,
representation feature item t
kthe total degree occurred in all documents of whole training text collection.Word frequency factor-alpha represents and comprises characteristic item t in certain kinds
kword frequency number account for t in whole training set
kthe ratio of word frequency number.α is larger, represent this characteristic item occur in certain kinds frequency higher and in other class occurrence number less or occur hardly, obviously such characteristic item has higher class discrimination ability; α is less, represent this characteristic item occur in certain kinds frequency lower and in other class occurrence number higher, obviously such characteristic item has more weak class discrimination ability.
Wherein,
other number of m representation class, df
ifor classification C
iin comprise t
knumber of files,
for average each classification comprises t
knumber of files,
the textual data that representation feature word appears in certain kinds is more than or equal to mean value
the textual data that representation feature word appears in certain kinds is less than mean value
β value is used for measuring the departure degree of document frequently and between all class documents mean value frequently comprising Feature Words in a certain class.β is larger, and classification C is described
iin comprise Feature Words t
knumber of files comprise Feature Words t than in all classes
kthe mean value of number of files is large, and often large, and such characteristic item has higher class discrimination ability.
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n
k+ 0.01), M is the textual data comprised in collection of document, n
krepresent the number of files comprising this word;
Step 5, carries out KNN classification;
Training text integrates as S, and test text is d, n is proper vector dimension threshold value, and K gets 35.
Utilize vectorial angle cosine value to calculate the similarity in test text d and S between full text; Select K the arest neighbors of K maximum text of the similarity that calculates as test text d; Calculate the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.
If training text d
iknown class be C
j, weight calculation formula is as follows:
Wherein, Sim (d, d
i) be test text d and known class text d
icosine similarity, formula is as follows:
Wherein, n is proper vector dimension threshold value, X
jrepresent the weight (0 < j≤n) of the jth dimension of text d to be measured, x
ijrepresent training text vector d
ijth dimension weight.
Y (d
i, C
j) be category attribute function, formula is as follows:
Step 6, the precision ratio of calculating K NN sorting algorithm, recall ratio and F
1value, by arranging front and back 2 subseries result F
1the max-thresholds of value difference value, and the step-length of scale factor μ growth obtains the value of final scale factor.
Initial F is set
1value is 0, and initial μ value is 0.5, ε=0.0001 is twice, front and back F
1the threshold value of difference, the step-length that τ=0.05 increases for scale factor μ.
Repeat step 5, obtain F
1' value, obtain twice, front and back F
1difference DELTA F=| F
1'-F
1|; If Δ F is less than ε, then obtain scale factor μ now; If Δ F is greater than or equal to ε, then make μ '=μ+τ, F
1=F
1', repeat the iteration of this step, until obtain suitable scale factor μ, to ensure higher classify accuracy.
Before and after table 1 algorithm improvement results contrast (balanced corpus) (%)
Results contrast (non-equilibrium corpus) (%) before and after table 2 algorithm improvement
Claims (1)
1., based on an adaptive features select method for chi, it is characterized in that, comprise the following steps:
Step 1, from Chinese corpus---training text collection and test text collection that the Internet download Fudan University issues;
Step 2, adopts participle software I CTCLAS to carry out participle, stop words removal pre-service to training text collection and test text collection, obtains the training text collection after participle and test text collection;
Step 3, adopts the self-adaptation text feature selection method based on CHI to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
The computing formula of traditional C HI text feature selection method is as follows:
Wherein, A represents and comprises feature t
kand belong to classification C
inumber of files, B represents and comprises feature t
kand do not belong to classification C
inumber of files, C represents and does not comprise feature t
kand belong to classification C
inumber of files, D represents and does not comprise feature t
kand do not belong to classification C
inumber of files;
Propose the self-adaptation text feature selection method based on CHI, formula is as follows:
χ
new 2(t
k,C
i)=[μ*χ
2(t
k,C
i)
++(1-μ)*χ
2(t
k,C
i)
-]*α*β
Wherein, μ is adaptive factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
Wherein, m is training set classification sum, tf (t
k, C
i) represent classification C
imiddle characteristic item t
kthe number of times occurred,
representation feature item t
kat the number of times that whole training text collection occurs; In training set, classification is C
iin to comprise n document be d respectively
i1, d
i2..., d
ij..., d
in, tf (t
k, d
ij) representation feature item t
kat classification C
ia jth document in the number of times that occurs,
representation feature t
kat classification C
ithe total degree of middle appearance,
representation feature item t
kthe total degree occurred in all documents of whole training text collection; Word frequency factor-alpha represents at certain kinds C
iin comprise characteristic item t
kword frequency number account for t in whole training set
kthe ratio of word frequency number; α is larger, represent this characteristic item occur in certain kinds frequency higher and in other class occurrence number less or occur hardly, obviously such characteristic item has higher class discrimination ability; α is less, represent this characteristic item occur in certain kinds frequency lower and in other class occurrence number higher, obviously such characteristic item has more weak class discrimination ability;
Wherein,
other number of m representation class, df
ifor classification C
iin comprise t
knumber of files,
for average each classification comprises t
knumber of files,
the textual data that representation feature word appears in certain kinds is more than or equal to mean value
the textual data that representation feature word appears in certain kinds is less than mean value
β value is used for measuring the departure degree of document frequently and between all class documents mean value frequently comprising Feature Words in a certain class; β is larger, and classification C is described
iin comprise Feature Words t
knumber of files comprise Feature Words t than in all classes
kthe mean value of number of files is large, and often large, and such characteristic item has higher class discrimination ability;
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n
k+ 0.01), M is the textual data comprised in collection of document, n
krepresent the number of files comprising this word;
Step 5, carries out KNN classification;
Training text integrates as S, and test text is d, n is proper vector dimension threshold value, and K gets 35;
Step 5.1, utilizes vectorial angle cosine value to calculate the similarity in test text d and S between full text;
Step 5.2, selects K the arest neighbors text of K maximum text of similarity that step 5.1 obtains as test text d;
Step 5.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight;
If training text d
iknown class be C
j, weight calculation formula is as follows:
Wherein, Sim (d, d
i) be test text d and known class text d
icosine similarity, formula is as follows:
Wherein, n is proper vector dimension threshold value, X
jrepresent the weight (0 < j≤n) of the jth dimension of text d to be measured, x
ijrepresent training text vector d
ijth dimension weight;
Y (d
i, C
j) be category attribute function, formula is as follows:
Step 6, the precision ratio of calculating K NN sorting algorithm, recall ratio and F
1value, by arranging front and back 2 subseries result F
1the max-thresholds of value difference value, and the step-length of scale factor μ growth obtains the value of final scale factor μ, to ensure higher classify accuracy;
Step 6.1, arranges initial F
1value is 0, and initial μ value is 0.5, ε=0.0001 is twice, front and back F
1the threshold value of difference, the step-length that τ=0.05 increases for scale factor μ;
Step 6.2, repeats step 5, obtains F '
1value, obtain twice, front and back F
1difference DELTA F=| F '
1-F
1|;
Step 6.3, if Δ F is less than ε, then obtains scale factor μ now; If Δ F is greater than or equal to ε, then make μ '=μ+τ, F
1=F '
1, repeat step 6.2 and step 6.3, until obtain suitable scale factor μ, to ensure higher classify accuracy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510927759.9A CN105512311B (en) | 2015-12-14 | 2015-12-14 | A kind of adaptive features select method based on chi-square statistics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510927759.9A CN105512311B (en) | 2015-12-14 | 2015-12-14 | A kind of adaptive features select method based on chi-square statistics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105512311A true CN105512311A (en) | 2016-04-20 |
CN105512311B CN105512311B (en) | 2019-02-26 |
Family
ID=55720291
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510927759.9A Expired - Fee Related CN105512311B (en) | 2015-12-14 | 2015-12-14 | A kind of adaptive features select method based on chi-square statistics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105512311B (en) |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021461A (en) * | 2016-05-17 | 2016-10-12 | 深圳市中润四方信息技术有限公司 | Text classification method and text classification system |
CN106611057A (en) * | 2016-12-27 | 2017-05-03 | 上海利连信息科技有限公司 | Text classification feature selection approach for importance weighing |
CN107256214A (en) * | 2017-06-30 | 2017-10-17 | 联想(北京)有限公司 | A kind of junk information determination methods and device and a kind of server cluster |
CN107291837A (en) * | 2017-05-31 | 2017-10-24 | 北京大学 | A kind of segmenting method of the network text based on field adaptability |
CN107577794A (en) * | 2017-09-19 | 2018-01-12 | 北京神州泰岳软件股份有限公司 | A kind of news category method and device |
CN108073567A (en) * | 2016-11-16 | 2018-05-25 | 北京嘀嘀无限科技发展有限公司 | A kind of Feature Words extraction process method, system and server |
CN108090088A (en) * | 2016-11-23 | 2018-05-29 | 北京国双科技有限公司 | Feature extracting method and device |
CN108197307A (en) * | 2018-01-31 | 2018-06-22 | 湖北工业大学 | The selection method and system of a kind of text feature |
CN108346474A (en) * | 2018-03-14 | 2018-07-31 | 湖南省蓝蜻蜓网络科技有限公司 | The electronic health record feature selection approach of distribution within class and distribution between class based on word |
CN108376130A (en) * | 2018-03-09 | 2018-08-07 | 长安大学 | A kind of objectionable text information filtering feature selection approach |
CN108491429A (en) * | 2018-02-09 | 2018-09-04 | 湖北工业大学 | A kind of feature selection approach based on document frequency and word frequency statistics between class in class |
CN108920545A (en) * | 2018-06-13 | 2018-11-30 | 四川大学 | The Chinese affective characteristics selection method of sentiment dictionary and Ka Fang model based on extension |
CN109325511A (en) * | 2018-08-01 | 2019-02-12 | 昆明理工大学 | A kind of algorithm improving feature selecting |
CN109543037A (en) * | 2018-11-21 | 2019-03-29 | 南京安讯科技有限责任公司 | A kind of article classification method based on improved TF-IDF |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
CN110688481A (en) * | 2019-09-02 | 2020-01-14 | 贵州航天计量测试技术研究所 | Text classification feature selection method based on chi-square statistic and IDF |
CN110705247A (en) * | 2019-08-30 | 2020-01-17 | 山东科技大学 | Based on x2-C text similarity calculation method |
CN111062212A (en) * | 2020-03-18 | 2020-04-24 | 北京热云科技有限公司 | Feature extraction method and system based on optimized TFIDF |
CN111144106A (en) * | 2019-12-20 | 2020-05-12 | 山东科技大学 | Two-stage text feature selection method under unbalanced data set |
CN112200259A (en) * | 2020-10-19 | 2021-01-08 | 哈尔滨理工大学 | Information gain text feature selection method and classification device based on classification and screening |
CN112256865A (en) * | 2019-01-31 | 2021-01-22 | 青岛科技大学 | Chinese text classification method based on classifier |
CN113032564A (en) * | 2021-03-22 | 2021-06-25 | 建信金融科技有限责任公司 | Feature extraction method, feature extraction device, electronic equipment and storage medium |
CN113378567A (en) * | 2021-07-05 | 2021-09-10 | 广东工业大学 | Chinese short text classification method for improving low-frequency words |
CN113515623A (en) * | 2021-04-28 | 2021-10-19 | 西安理工大学 | Feature selection method based on word frequency difference factor |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090190839A1 (en) * | 2008-01-29 | 2009-07-30 | Higgins Derrick C | System and method for handling the confounding effect of document length on vector-based similarity scores |
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
-
2015
- 2015-12-14 CN CN201510927759.9A patent/CN105512311B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090190839A1 (en) * | 2008-01-29 | 2009-07-30 | Higgins Derrick C | System and method for handling the confounding effect of document length on vector-based similarity scores |
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
Non-Patent Citations (1)
Title |
---|
刘海峰: "基于词频的优化互信息文本特征选择方法", 《计算机工程》 * |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021461A (en) * | 2016-05-17 | 2016-10-12 | 深圳市中润四方信息技术有限公司 | Text classification method and text classification system |
CN108073567B (en) * | 2016-11-16 | 2021-12-28 | 北京嘀嘀无限科技发展有限公司 | Feature word extraction processing method, system and server |
CN108073567A (en) * | 2016-11-16 | 2018-05-25 | 北京嘀嘀无限科技发展有限公司 | A kind of Feature Words extraction process method, system and server |
CN108090088A (en) * | 2016-11-23 | 2018-05-29 | 北京国双科技有限公司 | Feature extracting method and device |
CN106611057A (en) * | 2016-12-27 | 2017-05-03 | 上海利连信息科技有限公司 | Text classification feature selection approach for importance weighing |
CN106611057B (en) * | 2016-12-27 | 2019-08-13 | 上海利连信息科技有限公司 | The text classification feature selection approach of importance weighting |
CN107291837A (en) * | 2017-05-31 | 2017-10-24 | 北京大学 | A kind of segmenting method of the network text based on field adaptability |
CN107291837B (en) * | 2017-05-31 | 2020-04-03 | 北京大学 | Network text word segmentation method based on field adaptability |
CN107256214A (en) * | 2017-06-30 | 2017-10-17 | 联想(北京)有限公司 | A kind of junk information determination methods and device and a kind of server cluster |
CN107256214B (en) * | 2017-06-30 | 2020-09-25 | 联想(北京)有限公司 | Junk information judgment method and device and server cluster |
CN107577794A (en) * | 2017-09-19 | 2018-01-12 | 北京神州泰岳软件股份有限公司 | A kind of news category method and device |
CN107577794B (en) * | 2017-09-19 | 2019-07-05 | 北京神州泰岳软件股份有限公司 | A kind of news category method and device |
CN108197307A (en) * | 2018-01-31 | 2018-06-22 | 湖北工业大学 | The selection method and system of a kind of text feature |
CN108491429A (en) * | 2018-02-09 | 2018-09-04 | 湖北工业大学 | A kind of feature selection approach based on document frequency and word frequency statistics between class in class |
CN108376130A (en) * | 2018-03-09 | 2018-08-07 | 长安大学 | A kind of objectionable text information filtering feature selection approach |
CN108346474A (en) * | 2018-03-14 | 2018-07-31 | 湖南省蓝蜻蜓网络科技有限公司 | The electronic health record feature selection approach of distribution within class and distribution between class based on word |
CN108920545B (en) * | 2018-06-13 | 2021-07-09 | 四川大学 | Chinese emotion feature selection method based on extended emotion dictionary and chi-square model |
CN108920545A (en) * | 2018-06-13 | 2018-11-30 | 四川大学 | The Chinese affective characteristics selection method of sentiment dictionary and Ka Fang model based on extension |
CN109325511A (en) * | 2018-08-01 | 2019-02-12 | 昆明理工大学 | A kind of algorithm improving feature selecting |
CN109543037A (en) * | 2018-11-21 | 2019-03-29 | 南京安讯科技有限责任公司 | A kind of article classification method based on improved TF-IDF |
CN112256865A (en) * | 2019-01-31 | 2021-01-22 | 青岛科技大学 | Chinese text classification method based on classifier |
CN112256865B (en) * | 2019-01-31 | 2023-03-21 | 青岛科技大学 | Chinese text classification method based on classifier |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
CN110705247A (en) * | 2019-08-30 | 2020-01-17 | 山东科技大学 | Based on x2-C text similarity calculation method |
CN110688481A (en) * | 2019-09-02 | 2020-01-14 | 贵州航天计量测试技术研究所 | Text classification feature selection method based on chi-square statistic and IDF |
CN111144106A (en) * | 2019-12-20 | 2020-05-12 | 山东科技大学 | Two-stage text feature selection method under unbalanced data set |
CN111144106B (en) * | 2019-12-20 | 2023-05-02 | 山东科技大学 | Two-stage text feature selection method under unbalanced data set |
CN111062212B (en) * | 2020-03-18 | 2020-06-30 | 北京热云科技有限公司 | Feature extraction method and system based on optimized TFIDF |
CN111062212A (en) * | 2020-03-18 | 2020-04-24 | 北京热云科技有限公司 | Feature extraction method and system based on optimized TFIDF |
CN112200259A (en) * | 2020-10-19 | 2021-01-08 | 哈尔滨理工大学 | Information gain text feature selection method and classification device based on classification and screening |
CN113032564A (en) * | 2021-03-22 | 2021-06-25 | 建信金融科技有限责任公司 | Feature extraction method, feature extraction device, electronic equipment and storage medium |
CN113032564B (en) * | 2021-03-22 | 2023-05-30 | 建信金融科技有限责任公司 | Feature extraction method, device, electronic equipment and storage medium |
CN113515623A (en) * | 2021-04-28 | 2021-10-19 | 西安理工大学 | Feature selection method based on word frequency difference factor |
CN113378567B (en) * | 2021-07-05 | 2022-05-10 | 广东工业大学 | Chinese short text classification method for improving low-frequency words |
CN113378567A (en) * | 2021-07-05 | 2021-09-10 | 广东工业大学 | Chinese short text classification method for improving low-frequency words |
Also Published As
Publication number | Publication date |
---|---|
CN105512311B (en) | 2019-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105512311A (en) | Chi square statistic based self-adaption feature selection method | |
CN105224695B (en) | A kind of text feature quantization method and device and file classification method and device based on comentropy | |
CN107609121B (en) | News text classification method based on LDA and word2vec algorithm | |
CN105426426A (en) | KNN text classification method based on improved K-Medoids | |
US10346257B2 (en) | Method and device for deduplicating web page | |
Faguo et al. | Research on short text classification algorithm based on statistics and rules | |
CN104750844A (en) | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN103324628A (en) | Industry classification method and system for text publishing | |
CN105760889A (en) | Efficient imbalanced data set classification method | |
Liliana et al. | Indonesian news classification using support vector machine | |
CN102955857A (en) | Class center compression transformation-based text clustering method in search engine | |
CN103678274A (en) | Feature extraction method for text categorization based on improved mutual information and entropy | |
CN105975518A (en) | Information entropy-based expected cross entropy feature selection text classification system and method | |
Fitriyani et al. | The K-means with mini batch algorithm for topics detection on online news | |
CN109271517A (en) | IG TF-IDF Text eigenvector generates and file classification method | |
Dan et al. | Research of text categorization on Weka | |
Xu et al. | An improved information gain feature selection algorithm for SVM text classifier | |
CN108920545B (en) | Chinese emotion feature selection method based on extended emotion dictionary and chi-square model | |
CN108153899B (en) | Intelligent text classification method | |
CN104809229B (en) | A kind of text feature word extracting method and system | |
Cai et al. | Application of an improved CHI feature selection algorithm | |
Zhang et al. | A hot spot clustering method based on improved kmeans algorithm | |
Shen et al. | A cross-database comparison to discover potential product opportunities using text mining and cosine similarity | |
Wei et al. | The instructional design of Chinese text classification based on SVM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190226 Termination date: 20211214 |
|
CF01 | Termination of patent right due to non-payment of annual fee |