CN105512311A - Chi square statistic based self-adaption feature selection method - Google Patents

Chi square statistic based self-adaption feature selection method Download PDF

Info

Publication number
CN105512311A
CN105512311A CN201510927759.9A CN201510927759A CN105512311A CN 105512311 A CN105512311 A CN 105512311A CN 201510927759 A CN201510927759 A CN 201510927759A CN 105512311 A CN105512311 A CN 105512311A
Authority
CN
China
Prior art keywords
text
feature
classification
chi
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510927759.9A
Other languages
Chinese (zh)
Other versions
CN105512311B (en
Inventor
汪友生
樊存佳
王雨婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201510927759.9A priority Critical patent/CN105512311B/en
Publication of CN105512311A publication Critical patent/CN105512311A/en
Application granted granted Critical
Publication of CN105512311B publication Critical patent/CN105512311B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

The invention discloses a chi square statistic based self-adaption feature selection method and relates to the field of computer text data processing. Firstly, preprocessing of a training text set and a test text set is performed and comprises participle processing and stop word processing, then, self-adaption text feature selection based on chi square statistic is performed, a word frequency factor and interclass variance are defined and introduced into a CHI algorithm, an appropriate scaling factor is added for the CHI algorithm, finally, the scaling factor is automatically adjusted in combination of classical KNN algorithm evaluation indexes, improved CHI is adapted to different text corpora, and higher classification accuracy is guaranteed. An experimental result proves that by comparison with a conventional CHI method, the classification accuracy of a balanced corpus and a non-balanced corpus is improved.

Description

A kind of adaptive features select method based on chi
Technical field
The present invention relates to computer version data processing field, particularly one is based on chi (χ 2, CHI) self-adaptation text feature selection method.
Background technology
Current large data age, the potential value of mining data is most important.Data mining, as the technology finding data potential value, causes great concern.Large data text data accounts for sizable ratio, and text classification is as the data digging method of effective organization and management text data, becomes the focus of attention gradually.It is used widely in information filtering, Information Organization and management, information retrieval, digital library and Spam filtering etc.Text classification (TextClassification, TC) refers to the process unknown classification text being automatically classified into a class or multiclass under classification system given in advance according to its content.Conventional file classification method, as K arest neighbors (K-Nearest-Neighbor, KNN), Bayes (NaiveBayes, NB) and support vector machine (SupportVectorMachine, SVM) etc.
Present stage text classification process comprises pre-service, Feature Dimension Reduction, text representation, sorter study and evaluation and test, what text representation was the most frequently used is vector space model, the higher-dimension of vector space and openness increase time complexity and space complexity, greatly affect text classification precision, so Feature Dimension Reduction process is most important, it directly affects efficiency and the accuracy rate of classification.Feature Dimension Reduction mainly comprises two kinds of methods---feature extraction (FeatureExtraction) and feature selecting (FeatureSelection), the technology of natural language processing aspect is needed and computation complexity is high based on philological feature extraction, and the feature selection approach complexity of Corpus--based Method theory is lower and do not need too much background knowledge, therefore feature selection approach application is more extensive.The basic thought of feature selecting is structure evaluation function, carries out assessment marking respectively, sort according to the height of point value of evaluation to all characteristic items to each characteristic item of feature set, selects the feature of given number as final text feature collection.Conventional feature selection approach has chi, document frequency (DocumentFrequency, DF), information gain (InformationGain, IG), mutual information (MutualInformation, MI), expect cross entropy (ExpectedCrossEntropy, ECE) and text evidence weight (WeightofEvidence, WE) etc.
CHI method, as one of conventional text feature selection method, has and realizes the features such as simple, time complexity is low; But also there is a lot of shortcoming, to such an extent as to classifying quality is undesirable.The deficiency of CHI algorithm mainly comprises the document frequency that two aspect: the first, CHI only considered characteristic item, have ignored the word frequency of characteristic item, causes the weight of low-frequency word to be exaggerated; The second, be exaggerated the occurrence number seldom weight of the characteristic item often occurred in other classes in a classification.For the deficiency that CHI algorithm exists, Many researchers makes improvements, to improve one's methods and be summarized as following two aspects: the first, introduce some regulating parameter to reduce relying on for counsel low-frequency word, but the method not consider the positive and negative relativity problem between characteristic item and classification.The second, introduce scale factor, carry out classifying according to its positive and negative correlativity and compose with different weight to improve the feature selecting ability of CHI model, but scale factor needs to be selected by experience.Consider the deficiency that current various CHI improved algorithm exists, the CHI text feature selection method that design category precision is high has important academic significance and practical value.
Summary of the invention
The object of the invention is to, a kind of CHI text feature selection method of improvement is provided, thus improve the accuracy rate of text classification.Introduce on the one hand word frequency Summing Factor inter-class variance herein to reduce CHI relying on for counsel low-frequency word, select in certain kinds, to occur that frequency greatly and the characteristic item be evenly distributed in such; Introduce on the other hand self-adaptation scale factor μ, to carry out classifying according to its positive and negative correlativity and to compose with different weight, reduce people and choose the error that scale factor brings.
Feature of the present invention is as follows:
Step 1, from Chinese corpus---training text collection and test text collection that the Internet download Fudan University issues;
Step 2, adopts the Chinese Academy of Sciences participle software I CTCLAS that increases income to carry out the pre-service such as participle, stop words removal to training text collection and test text collection, obtains the training text collection after participle and test text collection;
Step 3, adopts the self-adaptation text feature selection method based on CHI to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
The computing formula of traditional CHI text feature selection method is as follows:
χ 2 ( t k , C i ) = N ( A D - B C ) 2 ( A + C ) ( B + D ) ( A + B ) ( C + D )
Wherein, A represents and comprises feature t kand belong to classification C inumber of files, B represents and comprises feature t kand do not belong to classification C inumber of files, C represents and does not comprise feature t kand belong to classification C inumber of files, D represents and does not comprise feature t kand do not belong to classification C inumber of files.
Propose the self-adaptation text feature selection method based on CHI, formula is as follows:
χ new 2(t k,C i)=[μ*χ 2(t k,C i) ++(1-μ)*χ 2(t k,C i) -]*α*β
χ 2 ( t k , C i ) + = N ( A D - B C ) 2 ( A + C ) ( B + D ) ( A + B ) ( C + D ) A D - B C > 0
&chi; 2 ( t k , C i ) - = N ( A D - B C ) 2 ( A + C ) ( B + D ) ( A + B ) ( C + D ) A D - B C < 0
Wherein, μ is adaptive factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
&alpha; = t f ( t k , C i ) &Sigma; i = 1 m t f ( t k , C i ) = &Sigma; j = 1 n t f ( t k , d i j ) &Sigma; i = 1 m &Sigma; j = 1 n t f ( t k , d i j )
Wherein, m is training set classification sum, tf (t k, C i) represent classification C imiddle characteristic item t kthe number of times occurred, representation feature item t kat the number of times that whole training text collection occurs.In training set, classification is C iin to comprise n document be d respectively i1, d i2..., d ij..., d in, tf (t k, d ij) representation feature item t kat classification C ia jth document in the number of times that occurs, representation feature t kat classification C ithe total degree of middle appearance, representation feature item t kthe total degree occurred in all documents of whole training text collection.Word frequency factor-alpha represents at certain kinds C iin comprise characteristic item t kword frequency number account for t in whole training set kthe ratio of word frequency number.α is larger, represent this characteristic item occur in certain kinds frequency higher and in other class occurrence number less or occur hardly, obviously such characteristic item has higher class discrimination ability; α is less, represent this characteristic item occur in certain kinds frequency lower and in other class occurrence number higher, obviously such characteristic item has more weak class discrimination ability.
&beta; = ( df i - df i &OverBar; ) * ( df i - df i &OverBar; ) 2
Wherein, other number of m representation class, df ifor classification C iin comprise t knumber of files, for average each classification comprises t knumber of files, the textual data that representation feature word appears in certain kinds is more than or equal to mean value the textual data that representation feature word appears in certain kinds is less than mean value β value is used for measuring the departure degree of document frequently and between all class documents mean value frequently comprising Feature Words in a certain class.β is larger, and classification C is described iin comprise Feature Words t knumber of files comprise Feature Words t than in all classes kthe mean value of number of files is large, and often large, and such characteristic item has higher class discrimination ability.
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n k+ 0.01), M is the textual data comprised in collection of document, n krepresent the number of files comprising this word;
Step 5, carries out KNN classification;
Training text integrates as S, and test text is d, n is proper vector dimension threshold value, and K gets 35.
Step 5.1, utilizes vectorial angle cosine value to calculate the similarity in test text d and S between full text;
Step 5.2, selects K the arest neighbors text of K maximum text of similarity that step 5.1 obtains as test text d;
Step 5.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.
If training text d iknown class be C j, weight calculation formula is as follows:
W ( d , C j ) = &Sigma; i = 1 K S i m ( d , d i ) y ( d i , C j )
Wherein, Sim (d, d i) be test text d and known class text d icosine similarity, formula is as follows:
S i m ( d , d i ) = &Sigma; j = 1 n ( X j x i j ) &Sigma; j = 1 n ( X j 2 ) &Sigma; j = 1 n ( x i j 2 )
Wherein, n is proper vector dimension threshold value, X jrepresent the weight (0 < j≤n) of the jth dimension of text d to be measured, x ijrepresent training text vector d ijth dimension weight.
Y (d i, C j) be category attribute function, formula is as follows:
Step 6, the precision ratio of calculating K NN sorting algorithm, recall ratio and F 1value, by arranging front and back 2 subseries result F 1the max-thresholds of value difference value, and the step-length of scale factor μ growth obtains the value of final scale factor μ, to ensure higher classify accuracy.
Step 6.1, arranges initial F 1value is 0, and initial μ value is 0.5, ε=0.0001 is twice, front and back F 1the threshold value of difference, the step-length that τ=0.05 increases for scale factor μ;
Step 6.2, repeats step 5, obtains F 1' value, obtain twice, front and back F 1difference DELTA F=| F 1'-F 1|;
Step 6.3, if Δ F is less than ε, then obtains scale factor μ now; If Δ F is greater than or equal to ε, then make μ '=μ+τ, F 1=F 1', repeat step 6.2 and step 6.3, until obtain suitable scale factor μ.
Compared with prior art, the present invention has following beneficial effect.
The present invention proposes a kind of adaptive features select method based on chi, sorting algorithm selects KNN algorithm, for the classification to test text, whole process flow diagram flow chart is shown in Fig. 1, the process flow diagram calculating scale factor μ is shown in Fig. 2, the degree of accuracy index of balanced corpus is in table 1, and the degree of accuracy of non-equilibrium corpus is in table 2.Compared with traditional CHI method, introduce on the one hand word frequency Summing Factor inter-class variance herein to reduce CHI relying on for counsel low-frequency word, select in certain kinds, to occur that frequency greatly and the characteristic item be evenly distributed in such; Introduce on the other hand self-adaptation scale factor μ, to carry out classifying according to its positive and negative correlativity and to compose with different weight, and the method is applicable to the corpus of different distributions, thus reduction people chooses the error that scale factor brings.As can be seen from Table 1 and Table 2, compared with traditional CHI method, the present invention is respectively used to balanced corpus and non-equilibrium corpus nicety of grading is all improved.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of overall process of the present invention.
Fig. 2 is the process flow diagram that the present invention calculates scale factor μ.
Embodiment
The present invention adopts following technological means to realize:
A kind of self-adaptation text feature selection method based on chi.First, carry out the pre-service of training text collection and test text collection, comprise participle, stop words process, secondly, carries out the self-adaptation text feature selection based on chi, definition word frequency factor-alpha and inter-class variance β, be introduced into CHI algorithm, for CHI algorithm adds suitable scale factor μ, finally, in conjunction with classical KNN algorithm, automatic adjustment scale factor μ, makes the CHI of improvement be applicable to different corpus, to ensure higher classify accuracy.
The above-mentioned self-adaptation text feature selection method based on chi is used for text classification, comprises the steps:
Step 1, from Chinese corpus---training text collection and test text collection that the Internet download Fudan University issues;
Step 2, adopts participle software I CTCLAS to carry out the pre-service such as participle, stop words removal to training text collection and test text collection, obtains the training text collection after participle and test text collection;
Step 3, adopts the self-adaptation text feature selection method based on CHI to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
The computing formula of traditional CHI text feature selection method is as follows:
&chi; 2 ( t k , C i ) = N ( A D - B C ) 2 ( A + C ) ( B + D ) ( A + B ) ( C + D ) - - - ( 1 )
Wherein, A represents and comprises feature t kand belong to classification C inumber of files, B represents and comprises feature t kand do not belong to classification C inumber of files, C represents and does not comprise feature t kand belong to classification C inumber of files, D represents and does not comprise feature t kand do not belong to classification C inumber of files.
Propose the self-adaptation text feature selection method based on CHI, formula is as follows:
χ new 2(t k,C i)=[μ*χ 2(t k,C i) ++(1-μ)*χ 2(t k,C i) -]*α*β(2)
&chi; 2 ( t k , C i ) + = N ( A D - B C ) 2 ( A + C ) ( B + D ) ( A + B ) ( C + D ) A D - B C > 0 - - - ( 3 )
&chi; 2 ( t k , C i ) - = N ( A D - B C ) 2 ( A + C ) ( B + D ) ( A + B ) ( C + D ) A D - B C < 0
Wherein, μ is adaptive factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
&alpha; = t f ( t k , C i ) &Sigma; i = 1 m t f ( t k , C i ) = &Sigma; j = 1 n t f ( t k , d i j ) &Sigma; i = 1 m &Sigma; j = 1 n t f ( t k , d i j ) - - - ( 4 )
Wherein, m is training set classification sum, tf (t k, C i) represent classification C icharacteristic item t kthe number of times occurred, the number of times that representation feature item t occurs at whole training text collection.In training set, classification is C iin to comprise n document be d respectively i1, d i2..., d ij..., d in, tf (t k, d ij) representation feature item t kat classification C ia jth document in the number of times that occurs, representation feature t kat classification C ithe total degree of middle appearance, representation feature item t kthe total degree occurred in all documents of whole training text collection.Word frequency factor-alpha represents and comprises characteristic item t in certain kinds kword frequency number account for t in whole training set kthe ratio of word frequency number.α is larger, represent this characteristic item occur in certain kinds frequency higher and in other class occurrence number less or occur hardly, obviously such characteristic item has higher class discrimination ability; α is less, represent this characteristic item occur in certain kinds frequency lower and in other class occurrence number higher, obviously such characteristic item has more weak class discrimination ability.
&beta; = ( df i - df i &OverBar; ) * ( df i - df i &OverBar; ) 2 - - - ( 5 )
Wherein, other number of m representation class, df ifor classification C iin comprise t knumber of files, for average each classification comprises t knumber of files, the textual data that representation feature word appears in certain kinds is more than or equal to mean value the textual data that representation feature word appears in certain kinds is less than mean value β value is used for measuring the departure degree of document frequently and between all class documents mean value frequently comprising Feature Words in a certain class.β is larger, and classification C is described iin comprise Feature Words t knumber of files comprise Feature Words t than in all classes kthe mean value of number of files is large, and often large, and such characteristic item has higher class discrimination ability.
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n k+ 0.01), M is the textual data comprised in collection of document, n krepresent the number of files comprising this word;
Step 5, carries out KNN classification;
Training text integrates as S, and test text is d, n is proper vector dimension threshold value, and K gets 35.
Utilize vectorial angle cosine value to calculate the similarity in test text d and S between full text; Select K the arest neighbors of K maximum text of the similarity that calculates as test text d; Calculate the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.
If training text d iknown class be C j, weight calculation formula is as follows:
W ( d , C j ) = &Sigma; i = 1 K S i m ( d , d i ) y ( d i , C j ) - - - ( 6 )
Wherein, Sim (d, d i) be test text d and known class text d icosine similarity, formula is as follows:
S i m ( d , d i ) = &Sigma; j = 1 n ( X j x i j ) &Sigma; j = 1 n ( X j 2 ) &Sigma; j = 1 n ( x i j 2 ) - - - ( 7 )
Wherein, n is proper vector dimension threshold value, X jrepresent the weight (0 < j≤n) of the jth dimension of text d to be measured, x ijrepresent training text vector d ijth dimension weight.
Y (d i, C j) be category attribute function, formula is as follows:
Step 6, the precision ratio of calculating K NN sorting algorithm, recall ratio and F 1value, by arranging front and back 2 subseries result F 1the max-thresholds of value difference value, and the step-length of scale factor μ growth obtains the value of final scale factor.
Initial F is set 1value is 0, and initial μ value is 0.5, ε=0.0001 is twice, front and back F 1the threshold value of difference, the step-length that τ=0.05 increases for scale factor μ.
Repeat step 5, obtain F 1' value, obtain twice, front and back F 1difference DELTA F=| F 1'-F 1|; If Δ F is less than ε, then obtain scale factor μ now; If Δ F is greater than or equal to ε, then make μ '=μ+τ, F 1=F 1', repeat the iteration of this step, until obtain suitable scale factor μ, to ensure higher classify accuracy.
Before and after table 1 algorithm improvement results contrast (balanced corpus) (%)
Results contrast (non-equilibrium corpus) (%) before and after table 2 algorithm improvement

Claims (1)

1., based on an adaptive features select method for chi, it is characterized in that, comprise the following steps:
Step 1, from Chinese corpus---training text collection and test text collection that the Internet download Fudan University issues;
Step 2, adopts participle software I CTCLAS to carry out participle, stop words removal pre-service to training text collection and test text collection, obtains the training text collection after participle and test text collection;
Step 3, adopts the self-adaptation text feature selection method based on CHI to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
The computing formula of traditional C HI text feature selection method is as follows:
&chi; 2 ( t k , C i ) = N ( A D - B C ) 2 ( A + C ) ( B + D ) ( A + B ) ( C + D )
Wherein, A represents and comprises feature t kand belong to classification C inumber of files, B represents and comprises feature t kand do not belong to classification C inumber of files, C represents and does not comprise feature t kand belong to classification C inumber of files, D represents and does not comprise feature t kand do not belong to classification C inumber of files;
Propose the self-adaptation text feature selection method based on CHI, formula is as follows:
χ new 2(t k,C i)=[μ*χ 2(t k,C i) ++(1-μ)*χ 2(t k,C i) -]*α*β
&chi; 2 ( t k , C i ) + = N ( A D - B C ) 2 ( A + C ) ( B + D ) ( A + B ) ( C + D ) , A D - B C > 0
&chi; 2 ( t k , C i ) - = N ( A D - B C ) 2 ( A + C ) ( B + D ) ( A + B ) ( C + D ) , A D - B C < 0
Wherein, μ is adaptive factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
&alpha; = t f ( t k , C i ) &Sigma; i = 1 m t f ( t k , C i ) = &Sigma; j = 1 n t f ( t k , d i j ) &Sigma; i = 1 m &Sigma; j = 1 n t f ( t k , d i j )
Wherein, m is training set classification sum, tf (t k, C i) represent classification C imiddle characteristic item t kthe number of times occurred, representation feature item t kat the number of times that whole training text collection occurs; In training set, classification is C iin to comprise n document be d respectively i1, d i2..., d ij..., d in, tf (t k, d ij) representation feature item t kat classification C ia jth document in the number of times that occurs, representation feature t kat classification C ithe total degree of middle appearance, representation feature item t kthe total degree occurred in all documents of whole training text collection; Word frequency factor-alpha represents at certain kinds C iin comprise characteristic item t kword frequency number account for t in whole training set kthe ratio of word frequency number; α is larger, represent this characteristic item occur in certain kinds frequency higher and in other class occurrence number less or occur hardly, obviously such characteristic item has higher class discrimination ability; α is less, represent this characteristic item occur in certain kinds frequency lower and in other class occurrence number higher, obviously such characteristic item has more weak class discrimination ability;
&beta; = ( df i - df i &OverBar; ) * ( df i - df i &OverBar; ) 2
Wherein, other number of m representation class, df ifor classification C iin comprise t knumber of files, for average each classification comprises t knumber of files, the textual data that representation feature word appears in certain kinds is more than or equal to mean value the textual data that representation feature word appears in certain kinds is less than mean value β value is used for measuring the departure degree of document frequently and between all class documents mean value frequently comprising Feature Words in a certain class; β is larger, and classification C is described iin comprise Feature Words t knumber of files comprise Feature Words t than in all classes kthe mean value of number of files is large, and often large, and such characteristic item has higher class discrimination ability;
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n k+ 0.01), M is the textual data comprised in collection of document, n krepresent the number of files comprising this word;
Step 5, carries out KNN classification;
Training text integrates as S, and test text is d, n is proper vector dimension threshold value, and K gets 35;
Step 5.1, utilizes vectorial angle cosine value to calculate the similarity in test text d and S between full text;
Step 5.2, selects K the arest neighbors text of K maximum text of similarity that step 5.1 obtains as test text d;
Step 5.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight;
If training text d iknown class be C j, weight calculation formula is as follows:
W ( d , C j ) = &Sigma; i = 1 K S i m ( d , d i ) y ( d i , C j )
Wherein, Sim (d, d i) be test text d and known class text d icosine similarity, formula is as follows:
S i m ( d , d i ) = &Sigma; j = 1 n ( X j x i j ) &Sigma; j = 1 n ( X j 2 ) &Sigma; j = 1 n ( x i j 2 )
Wherein, n is proper vector dimension threshold value, X jrepresent the weight (0 < j≤n) of the jth dimension of text d to be measured, x ijrepresent training text vector d ijth dimension weight;
Y (d i, C j) be category attribute function, formula is as follows:
Step 6, the precision ratio of calculating K NN sorting algorithm, recall ratio and F 1value, by arranging front and back 2 subseries result F 1the max-thresholds of value difference value, and the step-length of scale factor μ growth obtains the value of final scale factor μ, to ensure higher classify accuracy;
Step 6.1, arranges initial F 1value is 0, and initial μ value is 0.5, ε=0.0001 is twice, front and back F 1the threshold value of difference, the step-length that τ=0.05 increases for scale factor μ;
Step 6.2, repeats step 5, obtains F ' 1value, obtain twice, front and back F 1difference DELTA F=| F ' 1-F 1|;
Step 6.3, if Δ F is less than ε, then obtains scale factor μ now; If Δ F is greater than or equal to ε, then make μ '=μ+τ, F 1=F ' 1, repeat step 6.2 and step 6.3, until obtain suitable scale factor μ, to ensure higher classify accuracy.
CN201510927759.9A 2015-12-14 2015-12-14 A kind of adaptive features select method based on chi-square statistics Expired - Fee Related CN105512311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510927759.9A CN105512311B (en) 2015-12-14 2015-12-14 A kind of adaptive features select method based on chi-square statistics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510927759.9A CN105512311B (en) 2015-12-14 2015-12-14 A kind of adaptive features select method based on chi-square statistics

Publications (2)

Publication Number Publication Date
CN105512311A true CN105512311A (en) 2016-04-20
CN105512311B CN105512311B (en) 2019-02-26

Family

ID=55720291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510927759.9A Expired - Fee Related CN105512311B (en) 2015-12-14 2015-12-14 A kind of adaptive features select method based on chi-square statistics

Country Status (1)

Country Link
CN (1) CN105512311B (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021461A (en) * 2016-05-17 2016-10-12 深圳市中润四方信息技术有限公司 Text classification method and text classification system
CN106611057A (en) * 2016-12-27 2017-05-03 上海利连信息科技有限公司 Text classification feature selection approach for importance weighing
CN107256214A (en) * 2017-06-30 2017-10-17 联想(北京)有限公司 A kind of junk information determination methods and device and a kind of server cluster
CN107291837A (en) * 2017-05-31 2017-10-24 北京大学 A kind of segmenting method of the network text based on field adaptability
CN107577794A (en) * 2017-09-19 2018-01-12 北京神州泰岳软件股份有限公司 A kind of news category method and device
CN108073567A (en) * 2016-11-16 2018-05-25 北京嘀嘀无限科技发展有限公司 A kind of Feature Words extraction process method, system and server
CN108090088A (en) * 2016-11-23 2018-05-29 北京国双科技有限公司 Feature extracting method and device
CN108197307A (en) * 2018-01-31 2018-06-22 湖北工业大学 The selection method and system of a kind of text feature
CN108346474A (en) * 2018-03-14 2018-07-31 湖南省蓝蜻蜓网络科技有限公司 The electronic health record feature selection approach of distribution within class and distribution between class based on word
CN108376130A (en) * 2018-03-09 2018-08-07 长安大学 A kind of objectionable text information filtering feature selection approach
CN108491429A (en) * 2018-02-09 2018-09-04 湖北工业大学 A kind of feature selection approach based on document frequency and word frequency statistics between class in class
CN108920545A (en) * 2018-06-13 2018-11-30 四川大学 The Chinese affective characteristics selection method of sentiment dictionary and Ka Fang model based on extension
CN109325511A (en) * 2018-08-01 2019-02-12 昆明理工大学 A kind of algorithm improving feature selecting
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach
CN110688481A (en) * 2019-09-02 2020-01-14 贵州航天计量测试技术研究所 Text classification feature selection method based on chi-square statistic and IDF
CN110705247A (en) * 2019-08-30 2020-01-17 山东科技大学 Based on x2-C text similarity calculation method
CN111062212A (en) * 2020-03-18 2020-04-24 北京热云科技有限公司 Feature extraction method and system based on optimized TFIDF
CN111144106A (en) * 2019-12-20 2020-05-12 山东科技大学 Two-stage text feature selection method under unbalanced data set
CN112200259A (en) * 2020-10-19 2021-01-08 哈尔滨理工大学 Information gain text feature selection method and classification device based on classification and screening
CN112256865A (en) * 2019-01-31 2021-01-22 青岛科技大学 Chinese text classification method based on classifier
CN113032564A (en) * 2021-03-22 2021-06-25 建信金融科技有限责任公司 Feature extraction method, feature extraction device, electronic equipment and storage medium
CN113378567A (en) * 2021-07-05 2021-09-10 广东工业大学 Chinese short text classification method for improving low-frequency words
CN113515623A (en) * 2021-04-28 2021-10-19 西安理工大学 Feature selection method based on word frequency difference factor

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090190839A1 (en) * 2008-01-29 2009-07-30 Higgins Derrick C System and method for handling the confounding effect of document length on vector-based similarity scores
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090190839A1 (en) * 2008-01-29 2009-07-30 Higgins Derrick C System and method for handling the confounding effect of document length on vector-based similarity scores
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘海峰: "基于词频的优化互信息文本特征选择方法", 《计算机工程》 *

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021461A (en) * 2016-05-17 2016-10-12 深圳市中润四方信息技术有限公司 Text classification method and text classification system
CN108073567B (en) * 2016-11-16 2021-12-28 北京嘀嘀无限科技发展有限公司 Feature word extraction processing method, system and server
CN108073567A (en) * 2016-11-16 2018-05-25 北京嘀嘀无限科技发展有限公司 A kind of Feature Words extraction process method, system and server
CN108090088A (en) * 2016-11-23 2018-05-29 北京国双科技有限公司 Feature extracting method and device
CN106611057A (en) * 2016-12-27 2017-05-03 上海利连信息科技有限公司 Text classification feature selection approach for importance weighing
CN106611057B (en) * 2016-12-27 2019-08-13 上海利连信息科技有限公司 The text classification feature selection approach of importance weighting
CN107291837A (en) * 2017-05-31 2017-10-24 北京大学 A kind of segmenting method of the network text based on field adaptability
CN107291837B (en) * 2017-05-31 2020-04-03 北京大学 Network text word segmentation method based on field adaptability
CN107256214A (en) * 2017-06-30 2017-10-17 联想(北京)有限公司 A kind of junk information determination methods and device and a kind of server cluster
CN107256214B (en) * 2017-06-30 2020-09-25 联想(北京)有限公司 Junk information judgment method and device and server cluster
CN107577794A (en) * 2017-09-19 2018-01-12 北京神州泰岳软件股份有限公司 A kind of news category method and device
CN107577794B (en) * 2017-09-19 2019-07-05 北京神州泰岳软件股份有限公司 A kind of news category method and device
CN108197307A (en) * 2018-01-31 2018-06-22 湖北工业大学 The selection method and system of a kind of text feature
CN108491429A (en) * 2018-02-09 2018-09-04 湖北工业大学 A kind of feature selection approach based on document frequency and word frequency statistics between class in class
CN108376130A (en) * 2018-03-09 2018-08-07 长安大学 A kind of objectionable text information filtering feature selection approach
CN108346474A (en) * 2018-03-14 2018-07-31 湖南省蓝蜻蜓网络科技有限公司 The electronic health record feature selection approach of distribution within class and distribution between class based on word
CN108920545B (en) * 2018-06-13 2021-07-09 四川大学 Chinese emotion feature selection method based on extended emotion dictionary and chi-square model
CN108920545A (en) * 2018-06-13 2018-11-30 四川大学 The Chinese affective characteristics selection method of sentiment dictionary and Ka Fang model based on extension
CN109325511A (en) * 2018-08-01 2019-02-12 昆明理工大学 A kind of algorithm improving feature selecting
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF
CN112256865A (en) * 2019-01-31 2021-01-22 青岛科技大学 Chinese text classification method based on classifier
CN112256865B (en) * 2019-01-31 2023-03-21 青岛科技大学 Chinese text classification method based on classifier
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach
CN110705247A (en) * 2019-08-30 2020-01-17 山东科技大学 Based on x2-C text similarity calculation method
CN110688481A (en) * 2019-09-02 2020-01-14 贵州航天计量测试技术研究所 Text classification feature selection method based on chi-square statistic and IDF
CN111144106A (en) * 2019-12-20 2020-05-12 山东科技大学 Two-stage text feature selection method under unbalanced data set
CN111144106B (en) * 2019-12-20 2023-05-02 山东科技大学 Two-stage text feature selection method under unbalanced data set
CN111062212B (en) * 2020-03-18 2020-06-30 北京热云科技有限公司 Feature extraction method and system based on optimized TFIDF
CN111062212A (en) * 2020-03-18 2020-04-24 北京热云科技有限公司 Feature extraction method and system based on optimized TFIDF
CN112200259A (en) * 2020-10-19 2021-01-08 哈尔滨理工大学 Information gain text feature selection method and classification device based on classification and screening
CN113032564A (en) * 2021-03-22 2021-06-25 建信金融科技有限责任公司 Feature extraction method, feature extraction device, electronic equipment and storage medium
CN113032564B (en) * 2021-03-22 2023-05-30 建信金融科技有限责任公司 Feature extraction method, device, electronic equipment and storage medium
CN113515623A (en) * 2021-04-28 2021-10-19 西安理工大学 Feature selection method based on word frequency difference factor
CN113378567B (en) * 2021-07-05 2022-05-10 广东工业大学 Chinese short text classification method for improving low-frequency words
CN113378567A (en) * 2021-07-05 2021-09-10 广东工业大学 Chinese short text classification method for improving low-frequency words

Also Published As

Publication number Publication date
CN105512311B (en) 2019-02-26

Similar Documents

Publication Publication Date Title
CN105512311A (en) Chi square statistic based self-adaption feature selection method
CN105224695B (en) A kind of text feature quantization method and device and file classification method and device based on comentropy
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN105426426A (en) KNN text classification method based on improved K-Medoids
US10346257B2 (en) Method and device for deduplicating web page
Faguo et al. Research on short text classification algorithm based on statistics and rules
CN104750844A (en) Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN103324628A (en) Industry classification method and system for text publishing
CN105760889A (en) Efficient imbalanced data set classification method
Liliana et al. Indonesian news classification using support vector machine
CN102955857A (en) Class center compression transformation-based text clustering method in search engine
CN103678274A (en) Feature extraction method for text categorization based on improved mutual information and entropy
CN105975518A (en) Information entropy-based expected cross entropy feature selection text classification system and method
Fitriyani et al. The K-means with mini batch algorithm for topics detection on online news
CN109271517A (en) IG TF-IDF Text eigenvector generates and file classification method
Dan et al. Research of text categorization on Weka
Xu et al. An improved information gain feature selection algorithm for SVM text classifier
CN108920545B (en) Chinese emotion feature selection method based on extended emotion dictionary and chi-square model
CN108153899B (en) Intelligent text classification method
CN104809229B (en) A kind of text feature word extracting method and system
Cai et al. Application of an improved CHI feature selection algorithm
Zhang et al. A hot spot clustering method based on improved kmeans algorithm
Shen et al. A cross-database comparison to discover potential product opportunities using text mining and cosine similarity
Wei et al. The instructional design of Chinese text classification based on SVM

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190226

Termination date: 20211214

CF01 Termination of patent right due to non-payment of annual fee