CN105512311B - A kind of adaptive features select method based on chi-square statistics - Google Patents

A kind of adaptive features select method based on chi-square statistics Download PDF

Info

Publication number
CN105512311B
CN105512311B CN201510927759.9A CN201510927759A CN105512311B CN 105512311 B CN105512311 B CN 105512311B CN 201510927759 A CN201510927759 A CN 201510927759A CN 105512311 B CN105512311 B CN 105512311B
Authority
CN
China
Prior art keywords
text
classification
feature
indicate
chi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510927759.9A
Other languages
Chinese (zh)
Other versions
CN105512311A (en
Inventor
汪友生
樊存佳
王雨婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201510927759.9A priority Critical patent/CN105512311B/en
Publication of CN105512311A publication Critical patent/CN105512311A/en
Application granted granted Critical
Publication of CN105512311B publication Critical patent/CN105512311B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

A kind of adaptive features select method based on chi-square statistics, this method is related to computer version data processing field, it is trained the pretreatment of text set and test text collection first, including participle, stop words processing, then the adaptive text feature selection based on chi-square statistics is carried out, define the word frequency factor and inter-class variance, it is introduced into CHI algorithm, suitable scale factor is added for CHI algorithm, finally combine the evaluation index of classical KNN algorithm, automatically adjust scale factor, improved CHI is set to be suitable for different corpus, to guarantee higher classification accuracy.The experimental results showed that the present invention is respectively used to balanced corpus and non-equilibrium corpus nicety of grading is improved compared with traditional CHI method.

Description

A kind of adaptive features select method based on chi-square statistics
Technical field
The present invention relates to computer version data processing fields, in particular to a kind of to be based on chi-square statistics (χ2, CHI) from Adapt to text feature selection method.
Background technique
Current big data era, mining data are potentially worth most important.Data mining is as the discovery potential valence of data The technology of value, causes great concern.Big data text data accounts for sizable ratio, and text classification as effectively tissue and The data digging method for managing text data, is increasingly becoming the focus of attention.It is in information filtering, information organization and management, information Retrieval, digital library and Spam filtering etc. are used widely.Text classification (Text Classification, TC) refer to it is according to its content that it is automatic to unknown classification text under previously given classification system It is divided into a kind of or multiclass process.Common file classification method, such as K arest neighbors (K-Nearest-Neighbor, KNN), Bayes (Naive Bayes, NB) and support vector machines (SupportVector Machine, SVM) etc..
Text classification process includes pretreatment, Feature Dimension Reduction, text representation, classifier study and evaluation and test, text at this stage Indicate that, the most commonly used is vector space model, the higher-dimension and sparsity of vector space increase time complexity and spatial complex Degree, largely effects on text classification precision, so Feature Dimension Reduction process is most important, it directly affects the efficiency of classification and accurate Rate.Feature Dimension Reduction mainly includes two methods --- feature extraction (Feature Extraction) and feature selecting (Feature Selection), the technology in terms of natural language processing is needed based on philological feature extraction and computation complexity is high, and base It is lower in the feature selection approach complexity of statistical theory and do not need excessive background knowledge, therefore feature selection approach application More extensively.The basic thought of feature selecting is one evaluation function of construction, is assessed respectively each characteristic item of feature set Marking, is ranked up all characteristic items according to the height of point value of evaluation, selects certain number of feature as final text Eigen collection.Common feature selection approach has chi-square statistics, document frequency (Document Frequency, DF), and information increases Beneficial (Information Gain, IG), mutual information (Mutual Information, MI), it is expected that cross entropy (Expected Cross Entropy, ECE) and text evidence weight (Weight of Evidence, WE) etc..
CHI method has the spies such as realization is simple, time complexity is low as one of common text feature selection method Point;But there is also many disadvantages, so that classifying quality is undesirable.The deficiency of CHI algorithm mainly includes two aspects: first, CHI only considered the document frequency of characteristic item, have ignored the word frequency of characteristic item, and the weight of low-frequency word is caused to be amplified;Second, it puts The weight for the characteristic item that the big frequency of occurrence in a classification is few and often occurs in other classes.It is deposited for CHI algorithm Deficiency, Many researchers make improvements, in terms of improved method is summarized as following two: first, introduce several tune Section parameter relies on low-frequency word for counsel with reducing, but this method does not account for the positive and negative relativity problem between characteristic item and classification. Second, scale factor is introduced, is classified according to its positive and negative correlation and is assigned to different weights to improve the feature of CHI model Selective power, but scale factor needs to select by experience.In view of existing for current various CHI improved algorithms not Foot, the high CHI text feature selection method of design nicety of grading have important academic significance and practical value.
Summary of the invention
The object of the present invention is to provide a kind of improved CHI text feature selection methods, to improve text classification Accuracy rate.On the one hand it introduces the word frequency factor herein and inter-class variance is relied on for counsel with reducing CHI to low-frequency word, select in certain kinds It is middle frequency occur greatly and be evenly distributed on the characteristic item in such;On the other hand adaptive scale factor μ is introduced, according to it Positive and negative correlation is classified and is assigned to different weights, to reduce the artificial error choosing scale factor and bringing.
Feature of the invention is as follows:
Step 1, Chinese corpus --- training text collection and the test text collection issued from the Internet download Fudan University;
Step 2, training text collection and test text collection are segmented using Chinese Academy of Sciences open source participle software I CTCLAS, The pretreatments such as stop words removal, training text collection and test text collection after being segmented;
Step 3, feature is carried out to the training text collection after participle using the adaptive text feature selection method based on CHI Selection, obtains the corresponding feature dictionary of the training text collection;
The calculation formula of traditional CHI text feature selection method is as follows:
Wherein, A indicates to include feature tkAnd belong to classification CiNumber of files, B indicate include feature tkAnd it is not belonging to classification Ci Number of files, C indicate do not include feature tkAnd belong to classification CiNumber of files, D indicate do not include feature tkAnd it is not belonging to classification Ci Number of files.
The adaptive text feature selection method based on CHI of proposition, formula are as follows:
χnew 2(tk,Ci)=[μ * χ2(tk,Ci)++(1-μ)*χ2(tk,Ci)-]*α*β
Wherein, μ is scale factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
Wherein, m is training set classification sum, tf (tk,Ci) indicate classification CiMiddle characteristic item tkThe number of appearance,Indicate characteristic item tkIn the number that entire training text collection occurs.Classification is C in training setiIn it is literary comprising p Shelves are d respectivelyi1,di2,...,dij,...,dip, tf (tk,dij) indicate characteristic item tkIn classification CiJ-th document in occur Number,Indicate feature tkIn classification CiThe total degree of middle appearance,Indicate characteristic item tkEntirely instructing Practice the total degree occurred in all documents of text set.Word frequency factor-alpha is indicated in certain kinds CiIn include characteristic item tkWord frequency number Account for the t in entire training setkWord frequency number ratio.α is bigger, indicate this feature item occur in certain kinds frequency it is higher and Frequency of occurrence is less in other classes or hardly occurs, it is clear that such characteristic item has higher class discrimination ability;α is got over It is small, indicate that this feature item occurs that frequency is lower in certain kinds and frequency of occurrence is higher in other classes, it is clear that such feature Item has weaker class discrimination ability.
Wherein,M indicates the number of classification, dfiFor classification CiIn include tkNumber of files, It include t for averagely each classificationkNumber of files,Indicate Feature Words appear in the textual data in certain kinds be greater than or Equal to average valueIndicate that Feature Words appear in the textual data in certain kinds less than average valueβ value is used to Measure the document frequency in certain one kind comprising Feature Words and the departure degree between the average value of all class document frequencies.β is bigger, illustrates class Other CiIn include Feature Words tkNumber of files than in all classes include Feature Words tkThe average value of number of files is big, and relatively big More, such characteristic item has higher class discrimination ability.
Step 4, each training text and each test text are expressed as vector shape with the Feature Words of feature dictionary respectively Formula is calculated per one-dimensional weight according to TFIDF=TF × IDF, and TF (Term Frequency) is word frequency, refers to characteristic item in text The number occurred in shelves, IDF (Inverse Document) are inverse document frequency, and formula is IDF=log (M/nk+ 0.01), M For the textual data for including in collection of document, nkIndicate the number of files comprising the word;
Step 5, KNN classification is carried out;
Training text integrates as S, and test text d, n are characterized vector dimension threshold value, and K takes 35.
Step 5.1, the similarity in test text d and S between full text is calculated using vectorial angle cosine value;
Step 5.2, the K arest neighbors of the maximum K text of similarity that step 5.1 obtains as test text d is selected Text;
Step 5.3, the weight that test text d belongs to each classification is calculated, test text d is grouped into the maximum class of weight Not.
If training text diKnown class be Cj, weight calculation formula is as follows:
Wherein, Sim (d, di) it is test text d and known class text diCosine similarity, formula is as follows:
Wherein, n is characterized vector dimension threshold value, XjIndicate the weight (0 < j≤n) of the jth dimension of text d to be measured, xijIt indicates Training text vector diJth dimension weight.
y(di,Cj) it is category attribute function, formula is as follows:
Step 6, precision ratio, recall ratio and the F of KNN sorting algorithm are calculated1Value passes through 2 subseries result F of setting front and back1Value Step-length that the max-thresholds and scale factor μ of difference increase obtains the value of final scale factor μ, higher to guarantee Classification accuracy.
Step 6.1, initial F is set1Value is 0, and initial μ value is that 0.5, ε=0.0001 is front and back F twice1The threshold value of difference, τ =0.05 step-length increased for scale factor μ;
Step 6.2, step 5 is repeated, F is obtained1' value, obtain front and back F twice1Difference DELTA F=| F1′-F1|;
Step 6.3, if Δ F is less than ε, scale factor μ at this time is obtained;If Δ F is more than or equal to ε, enable μ '=μ+τ, F1=F1', step 6.2 and step 6.3 are repeated, until obtaining suitable scale factor μ.
Compared with prior art, the present invention has the advantages that.
The present invention proposes that a kind of adaptive features select method based on chi-square statistics, sorting algorithm select KNN algorithm, uses In the classification to test text, whole process flow chart is shown in Fig. 1, and the flow chart for calculating ratio factor mu is shown in Fig. 2, balanced corpus Accuracy index be shown in Table 1, the accuracy of non-equilibrium corpus is shown in Table 2.Compared with traditional CHI method, on the one hand draw herein Enter the word frequency factor and inter-class variance to rely on for counsel to reduce CHI to low-frequency word, selects that occur frequency in certain kinds big and uniformly The characteristic item being distributed in such;On the other hand adaptive scale factor μ is introduced, to classify according to its positive and negative correlation And different weights are assigned to, and this method is suitable for the corpus of different distributions, it is brought to reduce artificial scale factor of choosing Error.As can be seen from Table 1 and Table 2, compared with traditional CHI method, the present invention is respectively used to balanced corpus and non-flat Weighing apparatus corpus nicety of grading is improved.
Detailed description of the invention
Fig. 1 is the flow chart of overall process of the present invention.
Fig. 2 is the flow chart of calculating ratio factor mu of the present invention.
Specific embodiment
The present invention is realized using following technological means:
A kind of adaptive text feature selection method based on chi-square statistics.Firstly, being trained text set and test text The pretreatment of this collection, including participle, stop words processing, secondly, the adaptive text feature selection based on chi-square statistics is carried out, it is fixed Adopted word frequency factor-alpha and inter-class variance β, are introduced into CHI algorithm, add suitable scale factor μ for CHI algorithm, finally, in conjunction with Classical KNN algorithm automatically adjusts scale factor μ, so that improved CHI is suitable for different corpus, to guarantee higher point Class accuracy.
The above-mentioned adaptive text feature selection method based on chi-square statistics is used for text classification, includes the following steps:
Step 1, Chinese corpus --- training text collection and the test text collection issued from the Internet download Fudan University;
Step 2, training text collection and test text collection are segmented using participle software I CTCLAS, stop words removes Deng pretreatment, training text collection and test text collection after being segmented;
Step 3, feature is carried out to the training text collection after participle using the adaptive text feature selection method based on CHI Selection, obtains the corresponding feature dictionary of the training text collection;
The calculation formula of traditional CHI text feature selection method is as follows:
Wherein, A indicates to include feature tkAnd belong to classification CiNumber of files, B indicate include feature tkAnd it is not belonging to classification Ci Number of files, C indicate do not include feature tkAnd belong to classification CiNumber of files, D indicate do not include feature tkAnd it is not belonging to classification Ci Number of files.
The adaptive text feature selection method based on CHI of proposition, formula are as follows:
χnew 2(tk,Ci)=[μ * χ2(tk,Ci)++(1-μ)*χ2(tk,Ci)-]*α*β (2)
Wherein, μ is adaptive factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
Wherein, m is training set classification sum, tf (tk,Ci) indicate classification CiCharacteristic item tkThe number of appearance, Indicate the number that characteristic item t occurs in entire training text collection.Classification is C in training setiIn comprising n document be respectively di1,di2,...,dij,...,din, tf (tk,dij) indicate characteristic item tkIn classification CiJ-th of document in the number that occurs,Indicate feature tkIn classification CiThe total degree of middle appearance,Indicate characteristic item tkIn entire training text The total degree occurred in all documents of collection.Word frequency factor-alpha indicates that in certain kinds include characteristic item tkWord frequency number account for entire T in training setkWord frequency number ratio.α is bigger, indicate this feature item occur in certain kinds frequency it is higher and in other classes Frequency of occurrence is less or hardly occurs, it is clear that such characteristic item has higher class discrimination ability;α is smaller, indicates This feature item occurs that frequency is lower in certain kinds and frequency of occurrence is higher in other classes, it is clear that such characteristic item have compared with Weak class discrimination ability.
Wherein,M indicates the number of classification, dfiFor classification CiIn include tkNumber of files, It include t for averagely each classificationkNumber of files,Indicate Feature Words appear in the textual data in certain kinds be greater than or Equal to average valueIndicate that Feature Words appear in the textual data in certain kinds less than average valueβ value is used To measure the document frequency in certain one kind comprising Feature Words and the departure degree between the average value of all class document frequencies.β is bigger, explanation Classification CiIn include Feature Words tkNumber of files than in all classes include Feature Words tkThe average value of number of files is big, and so big that compare More, such characteristic item has higher class discrimination ability.
Step 4, each training text and each test text are expressed as vector shape with the Feature Words of feature dictionary respectively Formula is calculated per one-dimensional weight according to TFIDF=TF × IDF, and TF (Term Frequency) is word frequency, refers to characteristic item in text The number occurred in shelves, IDF (Inverse Document) are inverse document frequency, and formula is IDF=log (M/nk+ 0.01), M For the textual data for including in collection of document, nkIndicate the number of files comprising the word;
Step 5, KNN classification is carried out;
Training text integrates as S, and test text d, n are characterized vector dimension threshold value, and K takes 35.
The similarity in test text d and S between full text is calculated using vectorial angle cosine value;Select calculating K arest neighbors of the maximum K text of obtained similarity as test text d;It calculates test text d and belongs to each classification Test text d is grouped into the maximum classification of weight by weight.
If training text diKnown class be Cj, weight calculation formula is as follows:
Wherein, Sim (d, di) it is test text d and known class text diCosine similarity, formula is as follows:
Wherein, n is characterized vector dimension threshold value, XjIndicate the weight (0 < j≤n) of the jth dimension of text d to be measured, xijIt indicates Training text vector diJth dimension weight.
y(di,Cj) it is category attribute function, formula is as follows:
Step 6, precision ratio, recall ratio and the F of KNN sorting algorithm are calculated1Value passes through 2 subseries result F of setting front and back1Value Step-length that the max-thresholds and scale factor μ of difference increase obtains the value of final scale factor.
Initial F is set1Value is 0, and initial μ value is that 0.5, ε=0.0001 is front and back F twice1The threshold value of difference, τ=0.05 are The step-length that scale factor μ increases.
Step 5 is repeated, F is obtained1' value, obtain front and back F twice1Difference DELTA F=| F1′-F1|;If Δ F is less than ε, Obtain scale factor μ at this time;If Δ F is more than or equal to ε, μ '=μ+τ, F is enabled1=F1', the iteration of the step is repeated, directly To suitable scale factor μ is obtained, to guarantee higher classification accuracy.
Result compares (balanced corpus) (%) before and after 1 algorithm improvement of table
Result compares (non-equilibrium corpus) (%) before and after 2 algorithm improvement of table

Claims (1)

1. a kind of adaptive features select method based on chi-square statistics, which comprises the following steps:
Step 1, Chinese corpus --- training text collection and the test text collection issued from the Internet download Fudan University;
Step 2, training text collection and test text collection are segmented using participle software I CTCLAS, the pre- place of stop words removal Reason, training text collection and test text collection after being segmented;
Step 3, feature choosing is carried out to the training text collection after participle using the adaptive text feature selection method based on CHI It selects, obtains the corresponding feature dictionary of the training text collection;
The calculation formula of traditional CHI text feature selection method is as follows:
Wherein, A indicates to include feature tkAnd belong to classification CiNumber of files, B indicate include feature tkAnd it is not belonging to classification CiText Gear number, C indicate not including feature tkAnd belong to classification CiNumber of files, D indicate do not include feature tkAnd it is not belonging to classification CiText Gear number;
The adaptive text feature selection method based on CHI of proposition, formula are as follows:
χnew 2(tk,Ci)=[μ * χ2(tk,Ci)++(1-μ)*χ2(tk,Ci)-]*α*β
Wherein, μ is scale factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
Wherein, m is training set classification sum, tf (tk,Ci) indicate classification CiMiddle characteristic item tkThe number of appearance,Table Show characteristic item tkIn the number that entire training text collection occurs;Classification is C in training setiIn comprising p document be d respectivelyi1, di2,...,dij,...,dip, tf (tk,dij) indicate characteristic item tkIn classification CiJ-th of document in the number that occurs,Indicate feature tkIn classification CiThe total degree of middle appearance,Indicate characteristic item tkIn entire training text The total degree occurred in all documents of collection;Word frequency factor-alpha is indicated in certain kinds CiIn include characteristic item tkWord frequency number account for whole T in a training setkWord frequency number ratio;α is bigger, indicate this feature item occur in certain kinds frequency it is higher and in other classes Middle frequency of occurrence is less or hardly occurs, it is clear that such characteristic item has higher class discrimination ability;α is smaller, table Show that this feature item occurs that frequency is lower in certain kinds and frequency of occurrence is higher in other classes, it is clear that such characteristic item has Weaker class discrimination ability;
Wherein,M indicates the number of classification, dfiFor classification CiIn include tkNumber of files,It is flat Each classification includes tkNumber of files,Indicate that Feature Words appear in the textual data in certain kinds and are greater than or equal to Average value Indicate that Feature Words appear in the textual data in certain kinds less than average valueβ value is used to measure Document frequency in certain one kind comprising Feature Words and the departure degree between the average value of all class document frequencies;β is bigger, illustrates classification Ci In include Feature Words tkNumber of files than in all classes include Feature Words tkThe average value of number of files is big, and it is big must compare it is more, this The characteristic item of sample has higher class discrimination ability;
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, often One-dimensional weight is calculated according to TFIDF=TF × IDF, and TF (TermFrequency) is word frequency, refers to that characteristic item goes out in a document Existing number, IDF (InverseDocument) are inverse document frequency, and formula is IDF=log (M/nk+ 0.01), M is document sets The textual data for including in conjunction, nkIndicate the number of files comprising the word;
Step 5, KNN classification is carried out;
Training text integrates as S, and test text d, n are characterized vector dimension threshold value, and K takes 35;
Step 5.1, the similarity in test text d and S between full text is calculated using vectorial angle cosine value;
Step 5.2, the K arest neighbors text of the maximum K text of similarity that step 5.1 obtains as test text d is selected;
Step 5.3, the weight that test text d belongs to each classification is calculated, test text d is grouped into the maximum classification of weight;
If training text diKnown class be Cj, weight calculation formula is as follows:
Wherein, Sim (d, di) it is test text d and known class text diCosine similarity, formula is as follows:
Wherein, n is characterized vector dimension threshold value, XjIndicate the weight of the jth dimension of text d to be measured, 0 < j≤n, xijIndicate training Text vector diJth dimension weight;
y(di,Cj) it is category attribute function, formula is as follows:
Step 6, precision ratio, recall ratio and the F of KNN sorting algorithm are calculated1Value passes through 2 subseries result F of setting front and back1Value difference value Max-thresholds and the step-length that increases of scale factor μ obtain the value of final scale factor μ, to guarantee higher classification Accuracy;
Step 6.1, initial F is set1Value is 0, and initial μ value is that 0.5, ε=0.0001 is front and back F twice1The threshold value of difference, τ= 0.05 step-length increased for scale factor μ;
Step 6.2, step 5 is repeated, F is obtained1' value, obtain front and back F twice1Difference DELTA F=| F1′-F1|;
Step 6.3, if Δ F is less than ε, scale factor μ at this time is obtained;If Δ F is more than or equal to ε, μ '=μ is enabled + τ, F1=F1', step 6.2 and step 6.3 are repeated, until suitable scale factor μ is obtained, to guarantee that higher classification is accurate Degree.
CN201510927759.9A 2015-12-14 2015-12-14 A kind of adaptive features select method based on chi-square statistics Expired - Fee Related CN105512311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510927759.9A CN105512311B (en) 2015-12-14 2015-12-14 A kind of adaptive features select method based on chi-square statistics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510927759.9A CN105512311B (en) 2015-12-14 2015-12-14 A kind of adaptive features select method based on chi-square statistics

Publications (2)

Publication Number Publication Date
CN105512311A CN105512311A (en) 2016-04-20
CN105512311B true CN105512311B (en) 2019-02-26

Family

ID=55720291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510927759.9A Expired - Fee Related CN105512311B (en) 2015-12-14 2015-12-14 A kind of adaptive features select method based on chi-square statistics

Country Status (1)

Country Link
CN (1) CN105512311B (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021461A (en) * 2016-05-17 2016-10-12 深圳市中润四方信息技术有限公司 Text classification method and text classification system
CN108073567B (en) * 2016-11-16 2021-12-28 北京嘀嘀无限科技发展有限公司 Feature word extraction processing method, system and server
CN108090088A (en) * 2016-11-23 2018-05-29 北京国双科技有限公司 Feature extracting method and device
CN106611057B (en) * 2016-12-27 2019-08-13 上海利连信息科技有限公司 The text classification feature selection approach of importance weighting
CN107291837B (en) * 2017-05-31 2020-04-03 北京大学 Network text word segmentation method based on field adaptability
CN107256214B (en) * 2017-06-30 2020-09-25 联想(北京)有限公司 Junk information judgment method and device and server cluster
CN107577794B (en) * 2017-09-19 2019-07-05 北京神州泰岳软件股份有限公司 A kind of news category method and device
CN108197307A (en) * 2018-01-31 2018-06-22 湖北工业大学 The selection method and system of a kind of text feature
CN108491429A (en) * 2018-02-09 2018-09-04 湖北工业大学 A kind of feature selection approach based on document frequency and word frequency statistics between class in class
CN108376130A (en) * 2018-03-09 2018-08-07 长安大学 A kind of objectionable text information filtering feature selection approach
CN108346474B (en) * 2018-03-14 2021-09-28 湖南省蓝蜻蜓网络科技有限公司 Electronic medical record feature selection method based on word intra-class distribution and inter-class distribution
CN108920545B (en) * 2018-06-13 2021-07-09 四川大学 Chinese emotion feature selection method based on extended emotion dictionary and chi-square model
CN109325511B (en) * 2018-08-01 2020-07-31 昆明理工大学 Method for improving feature selection
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF
CN112256865B (en) * 2019-01-31 2023-03-21 青岛科技大学 Chinese text classification method based on classifier
CN110069630B (en) * 2019-03-20 2023-07-21 重庆信科设计有限公司 Improved mutual information feature selection method
CN110705247B (en) * 2019-08-30 2020-08-04 山东科技大学 Based on x2-C text similarity calculation method
CN110688481A (en) * 2019-09-02 2020-01-14 贵州航天计量测试技术研究所 Text classification feature selection method based on chi-square statistic and IDF
CN111144106B (en) * 2019-12-20 2023-05-02 山东科技大学 Two-stage text feature selection method under unbalanced data set
CN111062212B (en) * 2020-03-18 2020-06-30 北京热云科技有限公司 Feature extraction method and system based on optimized TFIDF
CN112200259A (en) * 2020-10-19 2021-01-08 哈尔滨理工大学 Information gain text feature selection method and classification device based on classification and screening
CN113032564B (en) * 2021-03-22 2023-05-30 建信金融科技有限责任公司 Feature extraction method, device, electronic equipment and storage medium
CN113515623B (en) * 2021-04-28 2022-12-06 西安理工大学 Feature selection method based on word frequency difference factor
CN113378567B (en) * 2021-07-05 2022-05-10 广东工业大学 Chinese short text classification method for improving low-frequency words

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9311390B2 (en) * 2008-01-29 2016-04-12 Educational Testing Service System and method for handling the confounding effect of document length on vector-based similarity scores

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于词频的优化互信息文本特征选择方法;刘海峰;《计算机工程》;20141231;第179-182页

Also Published As

Publication number Publication date
CN105512311A (en) 2016-04-20

Similar Documents

Publication Publication Date Title
CN105512311B (en) A kind of adaptive features select method based on chi-square statistics
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN105426426B (en) A kind of KNN file classification methods based on improved K-Medoids
Huang et al. Naive Bayes classification algorithm based on small sample set
CN104142918B (en) Short text clustering and focus subject distillation method based on TF IDF features
CN110825877A (en) Semantic similarity analysis method based on text clustering
US20190278864A2 (en) Method and device for processing a topic
Faguo et al. Research on short text classification algorithm based on statistics and rules
CN108932311B (en) Method for detecting and predicting emergency
CN104750844A (en) Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
CN103294817A (en) Text feature extraction method based on categorical distribution probability
CN111694958A (en) Microblog topic clustering method based on word vector and single-pass fusion
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN107066555A (en) Towards the online topic detection method of professional domain
CN109271517A (en) IG TF-IDF Text eigenvector generates and file classification method
CN105975518A (en) Information entropy-based expected cross entropy feature selection text classification system and method
CN102243641A (en) Method for efficiently clustering massive data
Xu et al. An improved information gain feature selection algorithm for SVM text classifier
CN107908624A (en) A kind of K medoids Text Clustering Methods based on all standing Granule Computing
Dan et al. Research of text categorization on Weka
CN108153899B (en) Intelligent text classification method
CN106503146B (en) The feature selection approach of computer version
Abdul-Rahman et al. Exploring feature selection and support vector machine in text categorization
Zhang et al. A hot spot clustering method based on improved kmeans algorithm
Shen et al. A cross-database comparison to discover potential product opportunities using text mining and cosine similarity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190226

Termination date: 20211214

CF01 Termination of patent right due to non-payment of annual fee