CN105512311B - A kind of adaptive features select method based on chi-square statistics - Google Patents
A kind of adaptive features select method based on chi-square statistics Download PDFInfo
- Publication number
- CN105512311B CN105512311B CN201510927759.9A CN201510927759A CN105512311B CN 105512311 B CN105512311 B CN 105512311B CN 201510927759 A CN201510927759 A CN 201510927759A CN 105512311 B CN105512311 B CN 105512311B
- Authority
- CN
- China
- Prior art keywords
- text
- classification
- feature
- indicate
- chi
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of adaptive features select method based on chi-square statistics, this method is related to computer version data processing field, it is trained the pretreatment of text set and test text collection first, including participle, stop words processing, then the adaptive text feature selection based on chi-square statistics is carried out, define the word frequency factor and inter-class variance, it is introduced into CHI algorithm, suitable scale factor is added for CHI algorithm, finally combine the evaluation index of classical KNN algorithm, automatically adjust scale factor, improved CHI is set to be suitable for different corpus, to guarantee higher classification accuracy.The experimental results showed that the present invention is respectively used to balanced corpus and non-equilibrium corpus nicety of grading is improved compared with traditional CHI method.
Description
Technical field
The present invention relates to computer version data processing fields, in particular to a kind of to be based on chi-square statistics (χ2, CHI) from
Adapt to text feature selection method.
Background technique
Current big data era, mining data are potentially worth most important.Data mining is as the discovery potential valence of data
The technology of value, causes great concern.Big data text data accounts for sizable ratio, and text classification as effectively tissue and
The data digging method for managing text data, is increasingly becoming the focus of attention.It is in information filtering, information organization and management, information
Retrieval, digital library and Spam filtering etc. are used widely.Text classification (Text
Classification, TC) refer to it is according to its content that it is automatic to unknown classification text under previously given classification system
It is divided into a kind of or multiclass process.Common file classification method, such as K arest neighbors (K-Nearest-Neighbor, KNN),
Bayes (Naive Bayes, NB) and support vector machines (SupportVector Machine, SVM) etc..
Text classification process includes pretreatment, Feature Dimension Reduction, text representation, classifier study and evaluation and test, text at this stage
Indicate that, the most commonly used is vector space model, the higher-dimension and sparsity of vector space increase time complexity and spatial complex
Degree, largely effects on text classification precision, so Feature Dimension Reduction process is most important, it directly affects the efficiency of classification and accurate
Rate.Feature Dimension Reduction mainly includes two methods --- feature extraction (Feature Extraction) and feature selecting (Feature
Selection), the technology in terms of natural language processing is needed based on philological feature extraction and computation complexity is high, and base
It is lower in the feature selection approach complexity of statistical theory and do not need excessive background knowledge, therefore feature selection approach application
More extensively.The basic thought of feature selecting is one evaluation function of construction, is assessed respectively each characteristic item of feature set
Marking, is ranked up all characteristic items according to the height of point value of evaluation, selects certain number of feature as final text
Eigen collection.Common feature selection approach has chi-square statistics, document frequency (Document Frequency, DF), and information increases
Beneficial (Information Gain, IG), mutual information (Mutual Information, MI), it is expected that cross entropy (Expected
Cross Entropy, ECE) and text evidence weight (Weight of Evidence, WE) etc..
CHI method has the spies such as realization is simple, time complexity is low as one of common text feature selection method
Point;But there is also many disadvantages, so that classifying quality is undesirable.The deficiency of CHI algorithm mainly includes two aspects: first,
CHI only considered the document frequency of characteristic item, have ignored the word frequency of characteristic item, and the weight of low-frequency word is caused to be amplified;Second, it puts
The weight for the characteristic item that the big frequency of occurrence in a classification is few and often occurs in other classes.It is deposited for CHI algorithm
Deficiency, Many researchers make improvements, in terms of improved method is summarized as following two: first, introduce several tune
Section parameter relies on low-frequency word for counsel with reducing, but this method does not account for the positive and negative relativity problem between characteristic item and classification.
Second, scale factor is introduced, is classified according to its positive and negative correlation and is assigned to different weights to improve the feature of CHI model
Selective power, but scale factor needs to select by experience.In view of existing for current various CHI improved algorithms not
Foot, the high CHI text feature selection method of design nicety of grading have important academic significance and practical value.
Summary of the invention
The object of the present invention is to provide a kind of improved CHI text feature selection methods, to improve text classification
Accuracy rate.On the one hand it introduces the word frequency factor herein and inter-class variance is relied on for counsel with reducing CHI to low-frequency word, select in certain kinds
It is middle frequency occur greatly and be evenly distributed on the characteristic item in such;On the other hand adaptive scale factor μ is introduced, according to it
Positive and negative correlation is classified and is assigned to different weights, to reduce the artificial error choosing scale factor and bringing.
Feature of the invention is as follows:
Step 1, Chinese corpus --- training text collection and the test text collection issued from the Internet download Fudan University;
Step 2, training text collection and test text collection are segmented using Chinese Academy of Sciences open source participle software I CTCLAS,
The pretreatments such as stop words removal, training text collection and test text collection after being segmented;
Step 3, feature is carried out to the training text collection after participle using the adaptive text feature selection method based on CHI
Selection, obtains the corresponding feature dictionary of the training text collection;
The calculation formula of traditional CHI text feature selection method is as follows:
Wherein, A indicates to include feature tkAnd belong to classification CiNumber of files, B indicate include feature tkAnd it is not belonging to classification Ci
Number of files, C indicate do not include feature tkAnd belong to classification CiNumber of files, D indicate do not include feature tkAnd it is not belonging to classification Ci
Number of files.
The adaptive text feature selection method based on CHI of proposition, formula are as follows:
χnew 2(tk,Ci)=[μ * χ2(tk,Ci)++(1-μ)*χ2(tk,Ci)-]*α*β
Wherein, μ is scale factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
Wherein, m is training set classification sum, tf (tk,Ci) indicate classification CiMiddle characteristic item tkThe number of appearance,Indicate characteristic item tkIn the number that entire training text collection occurs.Classification is C in training setiIn it is literary comprising p
Shelves are d respectivelyi1,di2,...,dij,...,dip, tf (tk,dij) indicate characteristic item tkIn classification CiJ-th document in occur
Number,Indicate feature tkIn classification CiThe total degree of middle appearance,Indicate characteristic item tkEntirely instructing
Practice the total degree occurred in all documents of text set.Word frequency factor-alpha is indicated in certain kinds CiIn include characteristic item tkWord frequency number
Account for the t in entire training setkWord frequency number ratio.α is bigger, indicate this feature item occur in certain kinds frequency it is higher and
Frequency of occurrence is less in other classes or hardly occurs, it is clear that such characteristic item has higher class discrimination ability;α is got over
It is small, indicate that this feature item occurs that frequency is lower in certain kinds and frequency of occurrence is higher in other classes, it is clear that such feature
Item has weaker class discrimination ability.
Wherein,M indicates the number of classification, dfiFor classification CiIn include tkNumber of files,
It include t for averagely each classificationkNumber of files,Indicate Feature Words appear in the textual data in certain kinds be greater than or
Equal to average valueIndicate that Feature Words appear in the textual data in certain kinds less than average valueβ value is used to
Measure the document frequency in certain one kind comprising Feature Words and the departure degree between the average value of all class document frequencies.β is bigger, illustrates class
Other CiIn include Feature Words tkNumber of files than in all classes include Feature Words tkThe average value of number of files is big, and relatively big
More, such characteristic item has higher class discrimination ability.
Step 4, each training text and each test text are expressed as vector shape with the Feature Words of feature dictionary respectively
Formula is calculated per one-dimensional weight according to TFIDF=TF × IDF, and TF (Term Frequency) is word frequency, refers to characteristic item in text
The number occurred in shelves, IDF (Inverse Document) are inverse document frequency, and formula is IDF=log (M/nk+ 0.01), M
For the textual data for including in collection of document, nkIndicate the number of files comprising the word;
Step 5, KNN classification is carried out;
Training text integrates as S, and test text d, n are characterized vector dimension threshold value, and K takes 35.
Step 5.1, the similarity in test text d and S between full text is calculated using vectorial angle cosine value;
Step 5.2, the K arest neighbors of the maximum K text of similarity that step 5.1 obtains as test text d is selected
Text;
Step 5.3, the weight that test text d belongs to each classification is calculated, test text d is grouped into the maximum class of weight
Not.
If training text diKnown class be Cj, weight calculation formula is as follows:
Wherein, Sim (d, di) it is test text d and known class text diCosine similarity, formula is as follows:
Wherein, n is characterized vector dimension threshold value, XjIndicate the weight (0 < j≤n) of the jth dimension of text d to be measured, xijIt indicates
Training text vector diJth dimension weight.
y(di,Cj) it is category attribute function, formula is as follows:
Step 6, precision ratio, recall ratio and the F of KNN sorting algorithm are calculated1Value passes through 2 subseries result F of setting front and back1Value
Step-length that the max-thresholds and scale factor μ of difference increase obtains the value of final scale factor μ, higher to guarantee
Classification accuracy.
Step 6.1, initial F is set1Value is 0, and initial μ value is that 0.5, ε=0.0001 is front and back F twice1The threshold value of difference, τ
=0.05 step-length increased for scale factor μ;
Step 6.2, step 5 is repeated, F is obtained1' value, obtain front and back F twice1Difference DELTA F=| F1′-F1|;
Step 6.3, if Δ F is less than ε, scale factor μ at this time is obtained;If Δ F is more than or equal to ε, enable
μ '=μ+τ, F1=F1', step 6.2 and step 6.3 are repeated, until obtaining suitable scale factor μ.
Compared with prior art, the present invention has the advantages that.
The present invention proposes that a kind of adaptive features select method based on chi-square statistics, sorting algorithm select KNN algorithm, uses
In the classification to test text, whole process flow chart is shown in Fig. 1, and the flow chart for calculating ratio factor mu is shown in Fig. 2, balanced corpus
Accuracy index be shown in Table 1, the accuracy of non-equilibrium corpus is shown in Table 2.Compared with traditional CHI method, on the one hand draw herein
Enter the word frequency factor and inter-class variance to rely on for counsel to reduce CHI to low-frequency word, selects that occur frequency in certain kinds big and uniformly
The characteristic item being distributed in such;On the other hand adaptive scale factor μ is introduced, to classify according to its positive and negative correlation
And different weights are assigned to, and this method is suitable for the corpus of different distributions, it is brought to reduce artificial scale factor of choosing
Error.As can be seen from Table 1 and Table 2, compared with traditional CHI method, the present invention is respectively used to balanced corpus and non-flat
Weighing apparatus corpus nicety of grading is improved.
Detailed description of the invention
Fig. 1 is the flow chart of overall process of the present invention.
Fig. 2 is the flow chart of calculating ratio factor mu of the present invention.
Specific embodiment
The present invention is realized using following technological means:
A kind of adaptive text feature selection method based on chi-square statistics.Firstly, being trained text set and test text
The pretreatment of this collection, including participle, stop words processing, secondly, the adaptive text feature selection based on chi-square statistics is carried out, it is fixed
Adopted word frequency factor-alpha and inter-class variance β, are introduced into CHI algorithm, add suitable scale factor μ for CHI algorithm, finally, in conjunction with
Classical KNN algorithm automatically adjusts scale factor μ, so that improved CHI is suitable for different corpus, to guarantee higher point
Class accuracy.
The above-mentioned adaptive text feature selection method based on chi-square statistics is used for text classification, includes the following steps:
Step 1, Chinese corpus --- training text collection and the test text collection issued from the Internet download Fudan University;
Step 2, training text collection and test text collection are segmented using participle software I CTCLAS, stop words removes
Deng pretreatment, training text collection and test text collection after being segmented;
Step 3, feature is carried out to the training text collection after participle using the adaptive text feature selection method based on CHI
Selection, obtains the corresponding feature dictionary of the training text collection;
The calculation formula of traditional CHI text feature selection method is as follows:
Wherein, A indicates to include feature tkAnd belong to classification CiNumber of files, B indicate include feature tkAnd it is not belonging to classification Ci
Number of files, C indicate do not include feature tkAnd belong to classification CiNumber of files, D indicate do not include feature tkAnd it is not belonging to classification Ci
Number of files.
The adaptive text feature selection method based on CHI of proposition, formula are as follows:
χnew 2(tk,Ci)=[μ * χ2(tk,Ci)++(1-μ)*χ2(tk,Ci)-]*α*β (2)
Wherein, μ is adaptive factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
Wherein, m is training set classification sum, tf (tk,Ci) indicate classification CiCharacteristic item tkThe number of appearance,
Indicate the number that characteristic item t occurs in entire training text collection.Classification is C in training setiIn comprising n document be respectively
di1,di2,...,dij,...,din, tf (tk,dij) indicate characteristic item tkIn classification CiJ-th of document in the number that occurs,Indicate feature tkIn classification CiThe total degree of middle appearance,Indicate characteristic item tkIn entire training text
The total degree occurred in all documents of collection.Word frequency factor-alpha indicates that in certain kinds include characteristic item tkWord frequency number account for entire
T in training setkWord frequency number ratio.α is bigger, indicate this feature item occur in certain kinds frequency it is higher and in other classes
Frequency of occurrence is less or hardly occurs, it is clear that such characteristic item has higher class discrimination ability;α is smaller, indicates
This feature item occurs that frequency is lower in certain kinds and frequency of occurrence is higher in other classes, it is clear that such characteristic item have compared with
Weak class discrimination ability.
Wherein,M indicates the number of classification, dfiFor classification CiIn include tkNumber of files,
It include t for averagely each classificationkNumber of files,Indicate Feature Words appear in the textual data in certain kinds be greater than or
Equal to average valueIndicate that Feature Words appear in the textual data in certain kinds less than average valueβ value is used
To measure the document frequency in certain one kind comprising Feature Words and the departure degree between the average value of all class document frequencies.β is bigger, explanation
Classification CiIn include Feature Words tkNumber of files than in all classes include Feature Words tkThe average value of number of files is big, and so big that compare
More, such characteristic item has higher class discrimination ability.
Step 4, each training text and each test text are expressed as vector shape with the Feature Words of feature dictionary respectively
Formula is calculated per one-dimensional weight according to TFIDF=TF × IDF, and TF (Term Frequency) is word frequency, refers to characteristic item in text
The number occurred in shelves, IDF (Inverse Document) are inverse document frequency, and formula is IDF=log (M/nk+ 0.01), M
For the textual data for including in collection of document, nkIndicate the number of files comprising the word;
Step 5, KNN classification is carried out;
Training text integrates as S, and test text d, n are characterized vector dimension threshold value, and K takes 35.
The similarity in test text d and S between full text is calculated using vectorial angle cosine value;Select calculating
K arest neighbors of the maximum K text of obtained similarity as test text d;It calculates test text d and belongs to each classification
Test text d is grouped into the maximum classification of weight by weight.
If training text diKnown class be Cj, weight calculation formula is as follows:
Wherein, Sim (d, di) it is test text d and known class text diCosine similarity, formula is as follows:
Wherein, n is characterized vector dimension threshold value, XjIndicate the weight (0 < j≤n) of the jth dimension of text d to be measured, xijIt indicates
Training text vector diJth dimension weight.
y(di,Cj) it is category attribute function, formula is as follows:
Step 6, precision ratio, recall ratio and the F of KNN sorting algorithm are calculated1Value passes through 2 subseries result F of setting front and back1Value
Step-length that the max-thresholds and scale factor μ of difference increase obtains the value of final scale factor.
Initial F is set1Value is 0, and initial μ value is that 0.5, ε=0.0001 is front and back F twice1The threshold value of difference, τ=0.05 are
The step-length that scale factor μ increases.
Step 5 is repeated, F is obtained1' value, obtain front and back F twice1Difference DELTA F=| F1′-F1|;If Δ F is less than ε,
Obtain scale factor μ at this time;If Δ F is more than or equal to ε, μ '=μ+τ, F is enabled1=F1', the iteration of the step is repeated, directly
To suitable scale factor μ is obtained, to guarantee higher classification accuracy.
Result compares (balanced corpus) (%) before and after 1 algorithm improvement of table
Result compares (non-equilibrium corpus) (%) before and after 2 algorithm improvement of table
Claims (1)
1. a kind of adaptive features select method based on chi-square statistics, which comprises the following steps:
Step 1, Chinese corpus --- training text collection and the test text collection issued from the Internet download Fudan University;
Step 2, training text collection and test text collection are segmented using participle software I CTCLAS, the pre- place of stop words removal
Reason, training text collection and test text collection after being segmented;
Step 3, feature choosing is carried out to the training text collection after participle using the adaptive text feature selection method based on CHI
It selects, obtains the corresponding feature dictionary of the training text collection;
The calculation formula of traditional CHI text feature selection method is as follows:
Wherein, A indicates to include feature tkAnd belong to classification CiNumber of files, B indicate include feature tkAnd it is not belonging to classification CiText
Gear number, C indicate not including feature tkAnd belong to classification CiNumber of files, D indicate do not include feature tkAnd it is not belonging to classification CiText
Gear number;
The adaptive text feature selection method based on CHI of proposition, formula are as follows:
χnew 2(tk,Ci)=[μ * χ2(tk,Ci)++(1-μ)*χ2(tk,Ci)-]*α*β
Wherein, μ is scale factor, and α is the word frequency factor, and β is inter-class variance, α and β formula is defined as follows:
Wherein, m is training set classification sum, tf (tk,Ci) indicate classification CiMiddle characteristic item tkThe number of appearance,Table
Show characteristic item tkIn the number that entire training text collection occurs;Classification is C in training setiIn comprising p document be d respectivelyi1,
di2,...,dij,...,dip, tf (tk,dij) indicate characteristic item tkIn classification CiJ-th of document in the number that occurs,Indicate feature tkIn classification CiThe total degree of middle appearance,Indicate characteristic item tkIn entire training text
The total degree occurred in all documents of collection;Word frequency factor-alpha is indicated in certain kinds CiIn include characteristic item tkWord frequency number account for whole
T in a training setkWord frequency number ratio;α is bigger, indicate this feature item occur in certain kinds frequency it is higher and in other classes
Middle frequency of occurrence is less or hardly occurs, it is clear that such characteristic item has higher class discrimination ability;α is smaller, table
Show that this feature item occurs that frequency is lower in certain kinds and frequency of occurrence is higher in other classes, it is clear that such characteristic item has
Weaker class discrimination ability;
Wherein,M indicates the number of classification, dfiFor classification CiIn include tkNumber of files,It is flat
Each classification includes tkNumber of files,Indicate that Feature Words appear in the textual data in certain kinds and are greater than or equal to
Average value Indicate that Feature Words appear in the textual data in certain kinds less than average valueβ value is used to measure
Document frequency in certain one kind comprising Feature Words and the departure degree between the average value of all class document frequencies;β is bigger, illustrates classification Ci
In include Feature Words tkNumber of files than in all classes include Feature Words tkThe average value of number of files is big, and it is big must compare it is more, this
The characteristic item of sample has higher class discrimination ability;
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, often
One-dimensional weight is calculated according to TFIDF=TF × IDF, and TF (TermFrequency) is word frequency, refers to that characteristic item goes out in a document
Existing number, IDF (InverseDocument) are inverse document frequency, and formula is IDF=log (M/nk+ 0.01), M is document sets
The textual data for including in conjunction, nkIndicate the number of files comprising the word;
Step 5, KNN classification is carried out;
Training text integrates as S, and test text d, n are characterized vector dimension threshold value, and K takes 35;
Step 5.1, the similarity in test text d and S between full text is calculated using vectorial angle cosine value;
Step 5.2, the K arest neighbors text of the maximum K text of similarity that step 5.1 obtains as test text d is selected;
Step 5.3, the weight that test text d belongs to each classification is calculated, test text d is grouped into the maximum classification of weight;
If training text diKnown class be Cj, weight calculation formula is as follows:
Wherein, Sim (d, di) it is test text d and known class text diCosine similarity, formula is as follows:
Wherein, n is characterized vector dimension threshold value, XjIndicate the weight of the jth dimension of text d to be measured, 0 < j≤n, xijIndicate training
Text vector diJth dimension weight;
y(di,Cj) it is category attribute function, formula is as follows:
Step 6, precision ratio, recall ratio and the F of KNN sorting algorithm are calculated1Value passes through 2 subseries result F of setting front and back1Value difference value
Max-thresholds and the step-length that increases of scale factor μ obtain the value of final scale factor μ, to guarantee higher classification
Accuracy;
Step 6.1, initial F is set1Value is 0, and initial μ value is that 0.5, ε=0.0001 is front and back F twice1The threshold value of difference, τ=
0.05 step-length increased for scale factor μ;
Step 6.2, step 5 is repeated, F is obtained1' value, obtain front and back F twice1Difference DELTA F=| F1′-F1|;
Step 6.3, if Δ F is less than ε, scale factor μ at this time is obtained;If Δ F is more than or equal to ε, μ '=μ is enabled
+ τ, F1=F1', step 6.2 and step 6.3 are repeated, until suitable scale factor μ is obtained, to guarantee that higher classification is accurate
Degree.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510927759.9A CN105512311B (en) | 2015-12-14 | 2015-12-14 | A kind of adaptive features select method based on chi-square statistics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510927759.9A CN105512311B (en) | 2015-12-14 | 2015-12-14 | A kind of adaptive features select method based on chi-square statistics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105512311A CN105512311A (en) | 2016-04-20 |
CN105512311B true CN105512311B (en) | 2019-02-26 |
Family
ID=55720291
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510927759.9A Expired - Fee Related CN105512311B (en) | 2015-12-14 | 2015-12-14 | A kind of adaptive features select method based on chi-square statistics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105512311B (en) |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021461A (en) * | 2016-05-17 | 2016-10-12 | 深圳市中润四方信息技术有限公司 | Text classification method and text classification system |
CN108073567B (en) * | 2016-11-16 | 2021-12-28 | 北京嘀嘀无限科技发展有限公司 | Feature word extraction processing method, system and server |
CN108090088A (en) * | 2016-11-23 | 2018-05-29 | 北京国双科技有限公司 | Feature extracting method and device |
CN106611057B (en) * | 2016-12-27 | 2019-08-13 | 上海利连信息科技有限公司 | The text classification feature selection approach of importance weighting |
CN107291837B (en) * | 2017-05-31 | 2020-04-03 | 北京大学 | Network text word segmentation method based on field adaptability |
CN107256214B (en) * | 2017-06-30 | 2020-09-25 | 联想(北京)有限公司 | Junk information judgment method and device and server cluster |
CN107577794B (en) * | 2017-09-19 | 2019-07-05 | 北京神州泰岳软件股份有限公司 | A kind of news category method and device |
CN108197307A (en) * | 2018-01-31 | 2018-06-22 | 湖北工业大学 | The selection method and system of a kind of text feature |
CN108491429A (en) * | 2018-02-09 | 2018-09-04 | 湖北工业大学 | A kind of feature selection approach based on document frequency and word frequency statistics between class in class |
CN108376130A (en) * | 2018-03-09 | 2018-08-07 | 长安大学 | A kind of objectionable text information filtering feature selection approach |
CN108346474B (en) * | 2018-03-14 | 2021-09-28 | 湖南省蓝蜻蜓网络科技有限公司 | Electronic medical record feature selection method based on word intra-class distribution and inter-class distribution |
CN108920545B (en) * | 2018-06-13 | 2021-07-09 | 四川大学 | Chinese emotion feature selection method based on extended emotion dictionary and chi-square model |
CN109325511B (en) * | 2018-08-01 | 2020-07-31 | 昆明理工大学 | Method for improving feature selection |
CN109543037A (en) * | 2018-11-21 | 2019-03-29 | 南京安讯科技有限责任公司 | A kind of article classification method based on improved TF-IDF |
CN109902173B (en) * | 2019-01-31 | 2020-10-27 | 青岛科技大学 | Chinese text classification method |
CN110069630B (en) * | 2019-03-20 | 2023-07-21 | 重庆信科设计有限公司 | Improved mutual information feature selection method |
CN110705247B (en) * | 2019-08-30 | 2020-08-04 | 山东科技大学 | Based on x2-C text similarity calculation method |
CN110688481A (en) * | 2019-09-02 | 2020-01-14 | 贵州航天计量测试技术研究所 | Text classification feature selection method based on chi-square statistic and IDF |
CN111144106B (en) * | 2019-12-20 | 2023-05-02 | 山东科技大学 | Two-stage text feature selection method under unbalanced data set |
CN111062212B (en) * | 2020-03-18 | 2020-06-30 | 北京热云科技有限公司 | Feature extraction method and system based on optimized TFIDF |
CN112200259A (en) * | 2020-10-19 | 2021-01-08 | 哈尔滨理工大学 | Information gain text feature selection method and classification device based on classification and screening |
CN113032564B (en) * | 2021-03-22 | 2023-05-30 | 建信金融科技有限责任公司 | Feature extraction method, device, electronic equipment and storage medium |
CN113515623B (en) * | 2021-04-28 | 2022-12-06 | 西安理工大学 | Feature selection method based on word frequency difference factor |
CN113378567B (en) * | 2021-07-05 | 2022-05-10 | 广东工业大学 | Chinese short text classification method for improving low-frequency words |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9311390B2 (en) * | 2008-01-29 | 2016-04-12 | Educational Testing Service | System and method for handling the confounding effect of document length on vector-based similarity scores |
-
2015
- 2015-12-14 CN CN201510927759.9A patent/CN105512311B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
Non-Patent Citations (1)
Title |
---|
基于词频的优化互信息文本特征选择方法;刘海峰;《计算机工程》;20141231;第179-182页 |
Also Published As
Publication number | Publication date |
---|---|
CN105512311A (en) | 2016-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105512311B (en) | A kind of adaptive features select method based on chi-square statistics | |
CN107609121B (en) | News text classification method based on LDA and word2vec algorithm | |
Rathi et al. | Sentiment analysis of tweets using machine learning approach | |
CN105426426B (en) | A kind of KNN file classification methods based on improved K-Medoids | |
De Battisti et al. | A decade of research in statistics: A topic model approach | |
Huang et al. | Naive Bayes classification algorithm based on small sample set | |
CN104142918B (en) | Short text clustering and focus subject distillation method based on TF IDF features | |
CN110825877A (en) | Semantic similarity analysis method based on text clustering | |
US20190278864A2 (en) | Method and device for processing a topic | |
Faguo et al. | Research on short text classification algorithm based on statistics and rules | |
CN108197144B (en) | Hot topic discovery method based on BTM and Single-pass | |
CN108932311B (en) | Method for detecting and predicting emergency | |
CN104750844A (en) | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts | |
CN103294817A (en) | Text feature extraction method based on categorical distribution probability | |
CN111694958A (en) | Microblog topic clustering method based on word vector and single-pass fusion | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN107066555A (en) | Towards the online topic detection method of professional domain | |
CN109271517A (en) | IG TF-IDF Text eigenvector generates and file classification method | |
CN102243641A (en) | Method for efficiently clustering massive data | |
Xu et al. | An improved information gain feature selection algorithm for SVM text classifier | |
CN107908624A (en) | A kind of K medoids Text Clustering Methods based on all standing Granule Computing | |
Dan et al. | Research of text categorization on Weka | |
CN108153899B (en) | Intelligent text classification method | |
Abdul-Rahman et al. | Exploring feature selection and support vector machine in text categorization | |
Zhang et al. | A hot spot clustering method based on improved kmeans algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190226 Termination date: 20211214 |
|
CF01 | Termination of patent right due to non-payment of annual fee |