CN108491429A - A kind of feature selection approach based on document frequency and word frequency statistics between class in class - Google Patents

A kind of feature selection approach based on document frequency and word frequency statistics between class in class Download PDF

Info

Publication number
CN108491429A
CN108491429A CN201810131876.8A CN201810131876A CN108491429A CN 108491429 A CN108491429 A CN 108491429A CN 201810131876 A CN201810131876 A CN 201810131876A CN 108491429 A CN108491429 A CN 108491429A
Authority
CN
China
Prior art keywords
feature
class
classification
feature words
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810131876.8A
Other languages
Chinese (zh)
Inventor
邵雄凯
赵婧
刘建舟
王春枝
华满
阳邹
陈亮亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Technology
Original Assignee
Hubei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Technology filed Critical Hubei University of Technology
Priority to CN201810131876.8A priority Critical patent/CN108491429A/en
Publication of CN108491429A publication Critical patent/CN108491429A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of feature selection approach based on document frequency and word frequency statistics between class in class, consider concentration degree between the document frequency of Feature Words, the class of word frequency and Feature Words, dispersion degree in class, constructs the feature selecting valuation functions based on document frequency and word frequency statistics (DFCTFS) between class in class;Original feature space of the training set after Text Pretreatment is chosen using feature selecting valuation functions proposed by the present invention in each classification of training set to the feature dictionary of a certain proportion of Feature Words composition category, and the feature dictionary of training set is then the union of training set feature dictionary of all categories.The present invention proposes a kind of feature selection approach based on document frequency and word frequency statistics (DFCTFS) between class in class, the Feature Words that achievable feature selecting goes out integrated distribution Mr. Yu's class document and is uniformly distributed and frequently occurs in such document, improve the effect of Chinese Text Categorization.

Description

A kind of feature selection approach based on document frequency and word frequency statistics between class in class
Technical field
The invention belongs to Technology for Chinese text categorization fields, are related to a kind of feature selection approach, and in particular to one kind is based on In class between class document frequency and word frequency statistics feature selection approach.
Background technology
Chinese Text Categorization Integral Thought is substantially:Text Pretreatment, feature selecting establish text representation model, use Sorting algorithm is classified, disaggregated model assessment.Feature selecting is the committed step of Chinese Text Categorization, it refers to from the original of higher-dimension A part of important feature is selected in feature space, a lower dimensional space is formed, to improve nicety of grading and classification effectiveness.
Traditional feature selection approach has:Document frequency (DF), mutual information (MI), information gain (IG), chi-square statistics amount (CHI) etc..The way of feature selecting is usually that a valuation functions is selected to calculate n original characteristic item, to calculating The value of gained arranges in descending order, is selected from primitive character item set containing the more preceding P characteristic item of information content.
CHI and IG is proved to be the preferable two kinds of feature selecting sides of text classification effect in traditional feature selection approach Method.CHI is with Feature Words t and classification CiPremised on independently of each other, the value (i.e. extent of deviation) between the two variables is calculated, if The value being calculated is bigger (i.e. deviation is larger), then Feature Words t and classification CiIt is more related.But traditional CHI methods there is Deficiency, (1) do not consider word frequency distribution of the Feature Words in of all categories, only considered the document frequency of Feature Words, cause CHI may Select the Feature Words that document frequency is high but word frequency is low.(2) Feature Words with classification negative correlation may be selected.
IG be used for text feature selecting when, measurement be the appearance of some word whether to judging whether a text belongs to The number of such information content provided, information content is weighed by entropy.IG does not consider the entropy of document when any feature as and examines The difference of the entropy of document, the difference indicate the reduction degree of information uncertainty after worry this feature.Information uncertainty reduces journey Degree is bigger, and corresponding information gain is bigger, and the information which provides is more, and the lexical item is more important.But traditional IG methods There is deficiency, (1) does not consider word frequency distribution of the Feature Words in of all categories;(2) interference of Feature Words negative correlation;(3) only Global feature selecting (refer to all categories in training set and all use identical characteristic set) can be done, and local feature can not be done It selects (referring to each classification in training set has the characteristic set of oneself).
Training set after pretreatment and feature selecting by forming feature dictionary.CHI feature selection approach is assessed according to CHI Function obtains each Feature Words in the CHI values of each classification of training set, uses being averaged for CHI value of the Feature Words in all categories The CHI values of value or maximum value as this feature word in entire training set are arranged all Feature Words by CHI value descendings, choosing Take feature dictionary of a certain proportion of Feature Words as entire training set.IG feature selection approach is obtained according to IG valuation functions To IG value of each Feature Words in entire training set, all Feature Words are arranged by IG value descendings, choose a certain proportion of feature Feature dictionary of the word as entire training set.
The deficiency of comprehensive analysis CHI and IG, it can be deduced that, the feature selecting key in text classification is to select concentration Distribution Mr. Yu's class document and the Feature Words for being uniformly distributed and frequently occurring in such document.Therefore, the present invention considers spy Concentration degree between the document frequency of word, the class of word frequency and Feature Words, dispersion degree in class are levied, is proposed a kind of based on document frequency between class in class With word frequency statistics (Document Frequency of within-class and between-class and Term Frequency Statistics, DFCTFS) feature selection approach, improve the precision of classification.
Invention content
It is excellent the purpose of the present invention is to provide a kind of feature selection approach based on document frequency and word frequency statistics between class in class Change feature selecting as a result, improve Chinese Text Categorization precision.
The technical solution adopted in the present invention is:1. a kind of feature selecting based on document frequency and word frequency statistics between class in class Method, which is characterized in that include the following steps:
Step 1:Text in training set is indicated after segmenting, removing stop words by lexical item, is denoted as original feature space.It is defeated Enter the original all Feature Words of training set, wherein Feature Words are denoted as t in original feature spacek, 0≤k≤N, N are that primitive character is empty Between middle Feature Words sum;
Step 2:Consider concentration degree between the document frequency of Feature Words, the class of word frequency and Feature Words, dispersion degree in class, structure The feature selecting valuation functions based on document frequency and word frequency statistics between class in class are produced, for calculating in class document frequency and word between class Frequency statistical value DFCTFS;
Step 3:According to gained original feature space, one Feature Words of construction, classification two-dimensional matrix, wherein row represents spy Word is levied, row represent classification, and the element in matrix is DFCTFS values;
Step 4:According to the DFCTFS values of training set each Feature Words in of all categories, to the Feature Words in each classification of training set Carry out descending arrangement;
Step 5:The total number N for obtaining Feature Words in total classification number M in training set and training set, takes a certain proportion of spy Word is levied, is denoted as numWords, then the middle Feature Words number num selected of all categories is numWords divided by M;
Step 6:It all chooses according to gained num values in step 5 during training set is of all categories and presses DFCTFS value descendings in the category The feature dictionary of the preceding num Feature Words composition categories after arrangement;
Step 7:Obtain the feature dictionary of training set, the union of gained feature dictionary as of all categories;
Step 8:Establish text representation model;
According to feature dictionary, the weight of the corresponding Feature Words of every text in training set is calculated, after training set vectorization A two-dimensional matrix is formed, a text is represented per a line, each row represent a Feature Words in feature dictionary;
Step 9:Classified using sorting algorithm;Classifier training is carried out using sorting algorithm to training set, obtains classification mould Type;
Step 10:Classifier performance is assessed;
It for test set, is indicated by lexical item after segmenting, removing stop words, and it is corresponding to calculate every text in test set The weight of Feature Words, will form a two-dimensional matrix after test set vectorization, a text, Mei mono- Lie Daibiaote are represented per a line Levy a Feature Words in dictionary;
The disaggregated model obtained using training, classifies to test set, utilizes recall rate, accuracy rate, F1 values, realization pair The performance evaluation of grader.
The beneficial effects of the present invention are:There is deficiencies for traditional CHI and IG feature selection approach, i.e., do not consider feature Word frequency distribution of the word in of all categories, the Feature Words for causing it that document frequency may be selected high, and have ignored document frequency it is low but The contribution degree of the higher Feature Words of word frequency;The interference of classification negative correlation Feature Words;And IG can only do global feature selecting, and Local feature selecting can not be done.The present invention consider concentration degree between the classes of the document frequency of Feature Words, word frequency and Feature Words, Dispersion degree in class, feature selecting valuation functions of the construction based on document frequency and word frequency statistics (DFCTFS) between class in class, it is intended to select The Feature Words for selecting out integrated distribution Mr. Yu's class document and being uniformly distributed and frequently occur in such document, to improve text point The precision of class.
Description of the drawings
Fig. 1:The flow chart of the embodiment of the present invention;
Fig. 2:Use the Chinese Text Categorization overall flow figure after the present invention;
Fig. 3:The comparison result of CHI, IG and DFCTFS proposed by the present invention in classification recall rate in the embodiment of the present invention Figure;
Fig. 4:The comparison result of CHI, IG and DFCTFS proposed by the present invention on classification accuracy in the embodiment of the present invention Figure;
Fig. 5:The comparison result figure of CHI, IG and DFCTFS proposed by the present invention in classification F1 values in the embodiment of the present invention;
Fig. 6:The comparison knot of CHI, IG and DFCTFS proposed by the present invention on whole classifying quality in the embodiment of the present invention Fruit is schemed.
Specific implementation mode
Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair Bright to be described in further detail, implementation example described herein is merely to illustrate and explain the present invention, and is not used to limit The present invention.
See Fig. 1 and Fig. 2, a kind of feature selecting side based on document frequency and word frequency statistics between class in class provided by the invention Method includes the following steps:
Step 1:Text in training set is indicated after segmenting, removing stop words by lexical item, is denoted as original feature space.It is defeated Enter the original all Feature Words of training set, wherein Feature Words are denoted as t in original feature spacek, 0≤k≤N, N are that primitive character is empty Between middle Feature Words sum;
Step 2:Consider concentration degree between the document frequency of Feature Words, the class of word frequency and Feature Words, dispersion degree in class, structure The feature selecting valuation functions based on document frequency and word frequency statistics between class in class are produced, for calculating in class document frequency and word between class Frequency statistical value DFCTFS;
Wherein, it is based on the feature selecting valuation functions of document frequency and word frequency statistics between class in class:
In formula, DFCTFS (tk,Ci) indicate Feature Words tkIn classification CiIn class in document frequency and word frequency statistics value between class DFCTFS, DF (tk,Ci) indicate Feature Words tkIn classification CiThe textual data of middle appearance, DF (tk) indicate Feature Words tkIn training set institute There are the textual data summation occurred in classification, DF (t, Ci) indicate classification CiIn the summation of textual data that occurs of all Feature Words, TF (tk,Ci) indicate Feature Words tkIn classification CiThe number of middle appearance, numDocsiIndicate classification CiTextual data, M indicate classification number.
Step 3:According to gained original feature space, one Feature Words of construction, classification two-dimensional matrix, wherein row represents spy Word is levied, row represent classification, and the element in matrix is DFCTFS values;
Specific implementation includes the following steps:
Step 3.1:For each classification in training set, statistical nature word tkIn CiThe textual data DF occurred in classification (tk,Ci) and number TF (tk,Ci);Wherein, k=1...N, N are characterized word sum;I=1...M, M are classification number;
Step 3.2:According to tk, CiTwo-dimensional matrix corresponding position is navigated to, is united using based on document frequency and word frequency between class in class The feature selecting valuation functions of meter calculate CiThe Feature Words t of classificationkDFCTFS values, to construct training set Feature Words, The two-dimensional matrix of the N*M of classification.
Step 4:According to the DFCTFS values of training set each Feature Words in of all categories, to the Feature Words in each classification of training set Carry out descending arrangement;
Each classification in this implementation is each classification in training text corpus.The language material used in the present embodiment Library is the Chinese language material that Fudan University's computerized information is put in order with technology system international data center center natural language processing group Library, the text in the corpus is that classified is other, and the text set of each classification is stored using a file, the present embodiment It is middle selected 8 classifications therein (but the present invention class categories be not limited to only be 8 classifications, can be by adjusting in experiment The parameter setting of classification corresponds to the classification number in the corpus selected, it can be achieved that the text classification of different classes of number is tested).In The resource that literary text corpus is existing, has put in order, can be used directly, can download on the net.
Step 5:The total number N for obtaining Feature Words in total classification number M in training set and training set, takes a certain proportion of spy Word is levied, is denoted as numWords, then the middle Feature Words number num selected of all categories is numWords divided by M;
Step 6:It all chooses according to gained num values in step 5 during training set is of all categories and presses DFCTFS value descendings in the category The feature dictionary of the preceding num Feature Words composition categories after arrangement;
Step 7:Obtain the feature dictionary of training set, the union of gained feature dictionary as of all categories;Union ensures special Levy the uniqueness of word in dictionary;
Step 8:Establish text representation model;
Wherein vector space model is most widely used, and main realization approach is, according to feature dictionary, to calculate in training set The weight of the corresponding Feature Words of every text, most-often used weighing computation method are TF-IDF (term frequency-inverse document frequencies), i.e., A two-dimensional matrix being formed after training set vectorization, a text being represented per a line, each row represent one in feature dictionary A Feature Words.
In formula, TFIDF (wik) indicate Feature Words tkIn text diIn weight be wik, cikIndicate Feature Words tkIn text di The number of middle appearance, N indicate that Feature Words sum, D indicate training set text sum, nkIndicate Feature Words tkThe textual data of appearance, β It is a constant term.
Step 9:Classified using sorting algorithm;Classifier training is carried out using sorting algorithm to training set, obtains classification mould Type;
Step 10:Classifier performance is assessed;
It for test set, is indicated by lexical item after segmenting, removing stop words, and it is corresponding to calculate every text in test set The weight of Feature Words, will form a two-dimensional matrix after test set vectorization, a text, Mei mono- Lie Daibiaote are represented per a line Levy a Feature Words in dictionary;
The disaggregated model obtained using training, classifies to test set, utilizes recall rate, accuracy rate, F1 values, realization pair The performance evaluation of grader.
Wherein, recall rate, accuracy rate, the calculation formula of F1 values are:
Recall rate
Accuracy rate
F1 values
Macro recall rate
Macro accuracy rate
Macro F1 values
In formula, M is classification number, and A expressions are judged to belonging to such and belong to such, B expressions be judged to belonging to such but not Belong to such, C expressions are judged to being not belonging to such but belong to such, and D expressions are judged to being not belonging to such and are not belonging to such;Ri Indicate the recall rate of classification i, PiIndicate the accuracy rate of classification i, F1iIndicate the F1 values of classification i.
A kind of feature selection approach based on document frequency and word frequency statistics between class in class proposed by the present invention, and it is traditional CHI with IG feature selection approach is compared, and the recall rate, accuracy rate, F1 values of classification is improved to a certain extent, below by way of reality Test explanation.Chinese word segmentation in experiment uses the ICTCLAS Chinese word segmentings that Inst. of Computing Techn. Academia Sinica researches and develops System.Feature selecting takes the 5% of training set Feature Words total number.That sorting algorithm is selected is SVM, is to be based on Taiwan Univ.'s woods intelligence The tool boxes LIBSVM of the exploitations such as benevolence professor.Chinese Corpus uses Fudan University's computerized information and state of technology system The Chinese corpus that border database hub natural language processing group arranges.Select sport therein, history, space, politics, ring Border, economy, art, computer, totally 8 classifications.The selection situation of text wherein of all categories is as shown in the table:
The selection situation of training set and test set in 1 corpus of table
It is programmed and is realized using Java, experiment porch MyEclipse, following 64 bit manipulations of Window7 of configuration of server System, processor are Intel (R) Core (TM) i5-2450M CPU 2.50GHz 2.5GHz, inside save as 4.00GB.
Method proposed by the present invention and tradition CHI, IG feature selection approach are in the knot of recall rate, accuracy rate, F1 values of classifying Fruit comparison is as shown in table 2:
The experimental result of conventional method CHI, IG and DFCTFS proposed by the present invention ratio in 2 text classification of table
Compared with
The comparison of table 3 conventional method CHI, IG and DFCTFS proposed by the present invention on whole classifying quality
By analyzing table 2, it can be deduced that DFCTFS feature selectings proposed in this paper, in the classification of 8 selected classifications It is better than traditional CHI and IG on the overall trend of effect.By analyzing table 3, it can be deduced that DFCTFS features proposed in this paper Selection method has been respectively increased 2.11%, 1.54% in the macro recall rate of classification compared with CHI, IG, divides in macro accuracy rate 2.11%, 1.36% is not improved, and 2.12%, 1.5% has been respectively increased in macro F1 values.In summary experimental result, can be with Show that the classifying quality of DFCTFS feature selectings proposed in this paper compared with traditional CHI, IG, is improved to some extent, says Effectiveness of the invention is illustrated.
It is the comparison result figure of CHI, IG and DFCTFS proposed by the present invention in classification recall rate see Fig. 3;Fig. 3 is more Intuitively illustrate the classification of DFCTFS proposed by the present invention and traditional feature selection approach CHI, IG 8 classifications selected by experiment It compares, is improved to some extent in recall rate.
It is the comparison result figure of CHI, IG and DFCTFS proposed by the present invention on classification accuracy see Fig. 4;Fig. 4 is more Intuitively illustrate the classification of DFCTFS proposed by the present invention and traditional feature selection approach CHI, IG 8 classifications selected by experiment It compares, is improved to some extent in accuracy rate.
It is the comparison result figure of middle CHI, IG and DFCTFS proposed by the present invention in classification F1 values see Fig. 5;Fig. 5 is more Intuitively illustrate the classification of DFCTFS proposed by the present invention and traditional feature selection approach CHI, IG 8 classifications selected by experiment It compares, is improved to some extent in F1 values.
It is the comparison result figure of CHI, IG and DFCTFS proposed by the present invention on whole classifying quality see Fig. 6;Fig. 6 More intuitively illustrate DFCTFS proposed by the present invention with traditional feature selection approach CHI, IG in the whole of selected 8 classifications of experiment On body classifying quality, by macro recall rate, macro accuracy rate, macro F1 values are compared, and are improved to some extent.
It should be understood that the part that this specification does not elaborate belongs to the prior art.The ordinary skill of this field Personnel under the inspiration of the present invention, in the case where not departing from the ambit that the claims in the present invention are protected, can also make replacement Or deformation, it each falls within protection scope of the present invention, it is of the invention range is claimed to be determined by the appended claims.

Claims (6)

1. a kind of feature selection approach based on document frequency and word frequency statistics between class in class, which is characterized in that include the following steps:
Step 1:Text in training set is indicated after segmenting, removing stop words by lexical item, is denoted as original feature space;Input instruction Practice and collect original all Feature Words, wherein Feature Words are denoted as t in original feature spacek, 0≤k≤N, N are in original feature space Feature Words sum;
Step 2:Consider concentration degree between the document frequency of Feature Words, the class of word frequency and Feature Words, dispersion degree in class, constructs Based on the feature selecting valuation functions of document frequency and word frequency statistics between class in class, for calculating, document frequency and word frequency are united between class in class Evaluation DFCTFS;
Step 3:According to gained original feature space, one Feature Words of construction, classification two-dimensional matrix, wherein row represents feature Word, row represent classification, and the element in matrix is DFCTFS values;
Step 4:According to the DFCTFS values of training set each Feature Words in of all categories, the Feature Words in each classification of training set are carried out Descending arranges;
Step 5:The total number N for obtaining Feature Words in total classification number M in training set and training set, takes a certain proportion of feature Word is denoted as numWords, then the middle Feature Words number num selected of all categories is numWords divided by M;
Step 6:It all chooses in the category and is arranged by DFCTFS value descendings according to gained num values in step 5 during training set is of all categories The feature dictionary of the Feature Words composition category of preceding num afterwards;
Step 7:Obtain the feature dictionary of training set, the union of gained feature dictionary as of all categories.
2. the feature selection approach according to claim 1 based on document frequency and word frequency statistics between class in class, feature exist In being based on document frequency and the feature selecting valuation functions of word frequency statistics between class in class described in step 2:
In formula, DFCTFS (tk,Ci) indicate Feature Words tkIn classification CiIn class in document frequency and word frequency statistics value DFCTFS between class, DF(tk,Ci) indicate Feature Words tkIn classification CiThe textual data of middle appearance, DF (tk) indicate Feature Words tkIn training set all categories The textual data summation of middle appearance, DF (t, Ci) indicate classification CiIn the summation of textual data that occurs of all Feature Words, TF (tk,Ci) Indicate Feature Words tkIn classification CiThe number of middle appearance, numDocsiIndicate classification CiTextual data, M indicate classification number.
3. the feature selection approach according to claim 1 based on document frequency and word frequency statistics between class in class, feature exist In the specific implementation of step 3 includes the following steps:
Step 3.1:For each classification in training set, statistical nature word tkIn CiTextual data DF (the t occurred in classificationk, Ci) and number TF (tk,Ci);Wherein, k=1...N, N are characterized word sum;I=1...M, M are classification number;
Step 3.2:According to tk, CiTwo-dimensional matrix corresponding position is navigated to, using based on document frequency between class in class and word frequency statistics Feature selecting valuation functions calculate CiThe Feature Words t of classificationkDFCTFS values, to construct Feature Words, the classification of training set N*M two-dimensional matrix.
4. the feature selecting side based on document frequency and word frequency statistics between class in class according to claim 1-3 any one Method, which is characterized in that the efficiency assessment of the feature selection approach includes the following steps:
Step 8:Establish text representation model;
According to feature dictionary, the weight of the corresponding Feature Words of every text in training set is calculated, will be formed after training set vectorization One two-dimensional matrix, a text is represented per a line, and each row represent a Feature Words in feature dictionary;
Step 9:Classified using sorting algorithm;Classifier training is carried out using sorting algorithm to training set, obtains disaggregated model;
Step 10:Classifier performance is assessed;
It for test set, is indicated by lexical item after segmenting, removing stop words, and calculates the corresponding feature of every text in test set The weight of word, will form a two-dimensional matrix after test set vectorization, represent a text per a line, each row represent Feature Words A Feature Words in library;
The disaggregated model obtained using training, classifies to test set, using recall rate, accuracy rate, F1 values, realizes to classification The performance evaluation of device.
5. the feature selection approach according to claim 4 based on document frequency and word frequency statistics between class in class, feature exist In calculating term weight function formula described in step 8 is:
In formula, TFIDF (wik) indicate Feature Words tkIn text diIn weight be wik, cikIndicate Feature Words tkIn text diIn go out Existing number, N indicate that Feature Words sum, D indicate training set text sum, nkIndicate Feature Words tkThe textual data of appearance, β are one A constant term.
6. the feature selection approach according to claim 4 based on document frequency and word frequency statistics between class in class, feature exist In recall rate described in step 10, accuracy rate, the calculation formula of F1 values are:
Recall rate
Accuracy rate
F1 values
Macro recall rate
Macro accuracy rate
Macro F1 values
In formula, M is classification number, and A expressions are judged to belonging to such and belong to such, and B expressions are judged to belonging to such but be not belonging to Such, C expressions are judged to being not belonging to such but belong to such, and D expressions are judged to being not belonging to such and are not belonging to such;RiIt indicates The recall rate of classification i, PiIndicate the accuracy rate of classification i, F1iIndicate the F1 values of classification i.
CN201810131876.8A 2018-02-09 2018-02-09 A kind of feature selection approach based on document frequency and word frequency statistics between class in class Pending CN108491429A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810131876.8A CN108491429A (en) 2018-02-09 2018-02-09 A kind of feature selection approach based on document frequency and word frequency statistics between class in class

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810131876.8A CN108491429A (en) 2018-02-09 2018-02-09 A kind of feature selection approach based on document frequency and word frequency statistics between class in class

Publications (1)

Publication Number Publication Date
CN108491429A true CN108491429A (en) 2018-09-04

Family

ID=63340204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810131876.8A Pending CN108491429A (en) 2018-02-09 2018-02-09 A kind of feature selection approach based on document frequency and word frequency statistics between class in class

Country Status (1)

Country Link
CN (1) CN108491429A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522549A (en) * 2018-10-30 2019-03-26 云南电网有限责任公司信息中心 Building of corpus method based on Web acquisition and text feature equiblibrium mass distribution
CN109558588A (en) * 2018-11-09 2019-04-02 广东原昇信息科技有限公司 The feature extracting method of information streaming material intention text
CN109800296A (en) * 2019-01-21 2019-05-24 四川长虹电器股份有限公司 A kind of meaning of one's words fuzzy recognition method based on user's true intention
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach
CN110096710A (en) * 2019-05-09 2019-08-06 董云鹏 A kind of article analysis and the method from demonstration
CN110110328A (en) * 2019-04-26 2019-08-09 北京零秒科技有限公司 Text handling method and device
CN110135592A (en) * 2019-05-16 2019-08-16 腾讯科技(深圳)有限公司 Classifying quality determines method, apparatus, intelligent terminal and storage medium
CN110609938A (en) * 2019-08-15 2019-12-24 平安科技(深圳)有限公司 Text hotspot discovery method and device and computer-readable storage medium
CN111090997A (en) * 2019-12-20 2020-05-01 中南大学 Geological document feature lexical item ordering method and device based on hierarchical lexical items
CN111310451A (en) * 2018-12-10 2020-06-19 北京沃东天骏信息技术有限公司 Sensitive dictionary generation method and device, storage medium and electronic equipment
CN111709439A (en) * 2020-05-06 2020-09-25 西安理工大学 Feature selection method based on word frequency deviation rate factor
CN113032564A (en) * 2021-03-22 2021-06-25 建信金融科技有限责任公司 Feature extraction method, feature extraction device, electronic equipment and storage medium
CN113157912A (en) * 2020-12-24 2021-07-23 航天科工网络信息发展有限公司 Text classification method based on machine learning
US11526754B2 (en) 2020-02-07 2022-12-13 Kyndryl, Inc. Feature generation for asset classification

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106275A (en) * 2013-02-08 2013-05-15 西北工业大学 Text classification character screening method based on character distribution information
CN104391835A (en) * 2014-09-30 2015-03-04 中南大学 Method and device for selecting feature words in texts
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN105893388A (en) * 2015-01-01 2016-08-24 成都网安科技发展有限公司 Text feature extracting method based on inter-class distinctness and intra-class high representation degree

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106275A (en) * 2013-02-08 2013-05-15 西北工业大学 Text classification character screening method based on character distribution information
CN104391835A (en) * 2014-09-30 2015-03-04 中南大学 Method and device for selecting feature words in texts
CN105893388A (en) * 2015-01-01 2016-08-24 成都网安科技发展有限公司 Text feature extracting method based on inter-class distinctness and intra-class high representation degree
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
万斌候: ""文本分类中的特征降维方法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522549A (en) * 2018-10-30 2019-03-26 云南电网有限责任公司信息中心 Building of corpus method based on Web acquisition and text feature equiblibrium mass distribution
CN109522549B (en) * 2018-10-30 2022-06-10 云南电网有限责任公司信息中心 Corpus construction method based on Web collection and text feature balanced distribution
CN109558588A (en) * 2018-11-09 2019-04-02 广东原昇信息科技有限公司 The feature extracting method of information streaming material intention text
CN109558588B (en) * 2018-11-09 2023-03-31 广东原昇信息科技有限公司 Feature extraction method for creative text of information flow material
CN111310451A (en) * 2018-12-10 2020-06-19 北京沃东天骏信息技术有限公司 Sensitive dictionary generation method and device, storage medium and electronic equipment
CN109800296A (en) * 2019-01-21 2019-05-24 四川长虹电器股份有限公司 A kind of meaning of one's words fuzzy recognition method based on user's true intention
CN109800296B (en) * 2019-01-21 2022-03-01 四川长虹电器股份有限公司 Semantic fuzzy recognition method based on user real intention
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach
CN110110328B (en) * 2019-04-26 2023-09-01 北京零秒科技有限公司 Text processing method and device
CN110110328A (en) * 2019-04-26 2019-08-09 北京零秒科技有限公司 Text handling method and device
CN110096710A (en) * 2019-05-09 2019-08-06 董云鹏 A kind of article analysis and the method from demonstration
CN110096710B (en) * 2019-05-09 2022-12-30 董云鹏 Article analysis and self-demonstration method
CN110135592B (en) * 2019-05-16 2023-09-19 腾讯科技(深圳)有限公司 Classification effect determining method and device, intelligent terminal and storage medium
CN110135592A (en) * 2019-05-16 2019-08-16 腾讯科技(深圳)有限公司 Classifying quality determines method, apparatus, intelligent terminal and storage medium
CN110609938A (en) * 2019-08-15 2019-12-24 平安科技(深圳)有限公司 Text hotspot discovery method and device and computer-readable storage medium
CN111090997A (en) * 2019-12-20 2020-05-01 中南大学 Geological document feature lexical item ordering method and device based on hierarchical lexical items
US11526754B2 (en) 2020-02-07 2022-12-13 Kyndryl, Inc. Feature generation for asset classification
US11748621B2 (en) 2020-02-07 2023-09-05 Kyndryl, Inc. Methods and apparatus for feature generation using improved term frequency-inverse document frequency (TF-IDF) with deep learning for accurate cloud asset tagging
CN111709439A (en) * 2020-05-06 2020-09-25 西安理工大学 Feature selection method based on word frequency deviation rate factor
CN111709439B (en) * 2020-05-06 2023-10-20 深圳万知达科技有限公司 Feature selection method based on word frequency deviation rate factor
CN113157912A (en) * 2020-12-24 2021-07-23 航天科工网络信息发展有限公司 Text classification method based on machine learning
CN113032564B (en) * 2021-03-22 2023-05-30 建信金融科技有限责任公司 Feature extraction method, device, electronic equipment and storage medium
CN113032564A (en) * 2021-03-22 2021-06-25 建信金融科技有限责任公司 Feature extraction method, feature extraction device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108491429A (en) A kind of feature selection approach based on document frequency and word frequency statistics between class in class
CN104391835B (en) Feature Words system of selection and device in text
CN102930063B (en) Feature item selection and weight calculation based text classification method
Basu et al. Support vector machines for text categorization
CN106202518B (en) Short text classification method based on CHI and sub-category association rule algorithm
Guo et al. Research and improvement of feature words weight based on TFIDF algorithm
US11568311B2 (en) Method and system to test a document collection trained to identify sentiments
CN106407406B (en) text processing method and system
CN107180191A (en) A kind of malicious code analysis method and system based on semi-supervised learning
CN104298715B (en) A kind of more indexed results ordering by merging methods based on TF IDF
CN103116637A (en) Text sentiment classification method facing Chinese Web comments
Zhang et al. An improved TF-IDF algorithm based on class discriminative strength for text categorization on desensitized data
Oza et al. Classification of aeronautics system health and safety documents
CN107220311A (en) A kind of document representation method of utilization locally embedding topic modeling
CN106503153B (en) A kind of computer version classification system
CN109766911A (en) A kind of behavior prediction method
CN106570170A (en) Text classification and naming entity recognition integrated method and system based on depth cyclic neural network
CN109508374A (en) Text data Novel semi-supervised based on genetic algorithm
Chow et al. A new document representation using term frequency and vectorized graph connectionists with application to document retrieval
CN106776724A (en) A kind of exercise question sorting technique and system
Hirsch et al. Evolving Lucene search queries for text classification
Li et al. Customer Churn Combination Prediction Model Based on Convolutional Neural Network and Gradient Boosting Decision Tree
Kim et al. Exploring class enumeration in Bayesian growth mixture modeling based on conditional medians
CN107622129A (en) Method for organizing and device, the computer-readable storage medium of a kind of knowledge base
Patel et al. Automated text categorization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180904

RJ01 Rejection of invention patent application after publication