CN109165290A - A kind of text feature selection method based on all standing Granule Computing - Google Patents

A kind of text feature selection method based on all standing Granule Computing Download PDF

Info

Publication number
CN109165290A
CN109165290A CN201810641512.4A CN201810641512A CN109165290A CN 109165290 A CN109165290 A CN 109165290A CN 201810641512 A CN201810641512 A CN 201810641512A CN 109165290 A CN109165290 A CN 109165290A
Authority
CN
China
Prior art keywords
word
feature
standing
text
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810641512.4A
Other languages
Chinese (zh)
Inventor
谢珺
邹雪君
靳红伟
续欣莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN201810641512.4A priority Critical patent/CN109165290A/en
Publication of CN109165290A publication Critical patent/CN109165290A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of text feature selection method based on all standing Granule Computing, comprising: 1) sample text collection segmented, remove stop words, part-of-speech tagging;2) position, part of speech factor are extended in TFIDF algorithm with different weight coefficients " document-word frequency " probability for calculating Feature Words;3) feature Word probability is generated using bLDA topic model to calculate the semantic information of Feature Words;4) text granulation is carried out to Feature Words, reduction, " document-word frequency " probability of the feature word set after obtaining reduction is carried out to Feature Words using the Algorithm for Reduction of Knowledge of all standing Granule Computing;5) combine the term weight function that bLDA and improved TFIDF algorithm calculate, the feature word set after obtaining reduction " document-word frequency " probability.Using the present invention, part of speech, position and the semantic factor of Feature Words are considered, while removing not strong Feature Words of expressing the meaning to text and improving the precision of cluster to select the feature word set more represented.

Description

A kind of text feature selection method based on all standing Granule Computing
Technical field
The invention belongs to the crossing domains in text mining field and all standing Granule Computing, and in particular to the feature selecting of text With application of the Reduction of Knowledge of all standing Granule Computing model more particularly to all standing Granule Computing in text feature selection.
Background technique
Text cluster is the important topic of pattern-recognition, machine learning and the field of data mining research, mainly will be literary The set of this object is grouped into the multiple classes being made of similar object, to realize the cluster to unknown text data.Mesh Before, structured representation, however the higher-dimension in the model existing characteristics space are mainly carried out to text information using vector space model Property and data sparsity problem.The feature space of higher-dimension not only increases the time complexity and space complexity of system operations, and And the quality of text cluster also is greatly reduced comprising a large amount of invalid, redundancy features.Thus, one is used in text cluster The effective feature selection approach of kind just seems most important.Effective feature selection approach can reduce the dimension of feature vector, Remove redundancy feature, retaining has stronger class discrimination ability and the stronger feature of expressing the meaning property, thus improve cluster quality and Robustness.
For text feature selection problem, experts and scholars propose a series of solution respectively, but are solving In this critical issue of text feature, there are still some problems for these methods, mainly has:
1) the methods of information gain (IG), mutual information (MI), chi-square statistics (CHI) are used now with many scholars, these Statistics-Based Method can select effective feature to a certain extent, but method has ignored the semantic information of text.
2) some scholars make feature selecting using LDA topic model, solve the semantic information of text, but the algorithm is ignored The word frequency of text, the position of word and part of speech problem, do not meet the practical expression of text.
Therefore, the present invention specifically addresses the word frequency of text feature word, the position of word, part of speech and matter of semantics, feature drops Retaining while not changing text representation when dimension has stronger class discrimination ability and the stronger Feature Words of expressing the meaning property.
Summary of the invention
To solve existing feature selection approach poor accuracy, feature express the meaning not strong deficiency, the invention proposes a kind of bases In the text feature selection method of all standing Granule Computing.
A kind of text feature selection method based on all standing Granule Computing, comprising the following steps:
Step 1: obtaining different classes of news sample set, the title and body part to newsletter archive collection carry out in advance respectively Processing, the pretreatment include participle, remove stop words and part-of-speech tagging;
Step 2: improving TFIDF method becomes improved TFIDF method, and calculates Feature Words with improved TFIDF method " document-word frequency " probability, then utilize all standing Granule Computing Algorithm for Reduction of Knowledge carry out Feature Words reduction;
Step 3: " document-word frequency " probability of Feature Words, the TFIDF algorithm after joint reduction are calculated with bLDA topic model The term weight function of calculating obtains the weight of final Feature Words and carries out clustering processing.
The TFIDF method specific formula is as follows:
Wherein tjIndicate the word frequency of word t in m documents, N indicates total number of documents, njIndicate the number of files comprising word t, point Mother is normalization factor.
The improved TFIDF method specific formula is as follows:
Wherein tfi,jSpecific formula is as follows:
Wherein
Wherein λjThe part of speech weight coefficient for indicating word j, when the different values of λ are respectively the weight of noun, verb, other words Coefficient, tkIndicate the word frequency of word j in i-th document, u1,u2The weight coefficient of word in title and text is respectively indicated,Respectively Indicate word frequency of the word j in title and text, l indicates the sum of all words in i-th document.
The Reduction of Knowledge of all standing Granule Computing carries out granular processing to text first, as shown in table 1 below:
1 text granular relation table of table
Wherein the basic definition of all standing Granule Computing model is as follows:
IfIt is non-empty talk domainUOn an all standing, all standingP={ Cj: j=1 ..., N }, define grain GxCenter, the center of all standing grain C, all standing granular entropy of P be respectively as follows:
centerC(x)=∩ { NC(x)|x∈NC(x),NC(x)∈GxCenter (C)={ centerC(x)|x∈U}
The core of C is defined as:
Specific step is as follows for the Reduction of Knowledge to text progress all standing Granule Computing:
Step 1: calculating the center center (D) of feature word set D, and calculate the granular entropy I (D) of D.
Step 2: feature word set core (D)=φ after enabling reduction calculates the document sets D that Feature Words are concentratedi∈ D is in feature Different degree in word set DIfThencOre (D)=core (D) ∪ { Di}。
Step 3: whether I (Core (D))=I (D) is true at this time for calculating, terminates step if setting up, core (D) is at this time The most granule reduction of feature word set D;Otherwise, if I (core (D)) < I (D), step 4 is executed.
Step 4: enabling P=core (D).
Step 5: calculating the document sets D that word includestRelative Link Importance Sig of the ∈ D-P relative to feature word set DP(Dt), it looks for Meet outDocument sets Dt, it is added in P, P=P ∪ { D }.
Step 6: whether I (P)=I (D) is true at this time for calculating, terminates step if setting up, P at this time is feature word set A reduction of D;Otherwise return step 5.
In the bLDA topic model Gibbs Sampling sampling specific formula is as follows:
Wherein ziIndicating the corresponding theme variable of ith feature word, ┐ i expression is not counted in i-th,Indicate m texts The word frequency of word t in shelves,Indicate that word t distributes to the word frequency of (k ≠ 0) theme k,It indicates to distribute to theme in m documents The word frequency of k (k=0), K indicate theme number, and V indicates that the sum of all words in document sets, lamda indicate the priori of background theme Probability, βtIndicate the Dirichlet prior distribution of word t, αkIndicate the Dirichlet prior distribution of theme k.
Detailed description of the invention
Fig. 1 is flow chart of the invention.
Specific embodiment
It is more clear to illustrate the purposes, technical schemes and advantages of the present invention, the present invention is made of actual case below It is described in further detail.
The news for obtaining a certain number of multiple and different fields from Sohu's news using web crawlers, to these articles into Row analysis and arrangement removes the non-textual symbol in identical news and news, as sample set.
In order to choose representative feature word set from text, title and body part to sample set divide respectively Word removes stop words and part-of-speech tagging.
Use improved TFIDF method when calculating the probability of Feature Words, different location, different part of speech word assign it is different Weight coefficient.Such as certain news can indicate are as follows: di={ ti|ti1,ti2,ti3,ti4,...,tim, wherein tiIndicate that the piece is new Hear the set of word, ti1,ti2,ti3Indicate the word in title, remaining indicates the word in text, if ti1,ti3It is noun, ti2It is Word, ti4It is noun, ti5It is verb, ti6It is the word of other parts of speech, then weight proportion is ti1,ti3> ti2> ti4> ti5> ti6
Two-dimensional matrix can be expressed as by improving the result that TFIDF is calculated, and row indicates that document code, column indicate Feature Words, example Such as matrixIn 0.112 indicate first document in word t11Probability, 0 indicate this text There is no the word in shelves, just there is no t in first document12, t11Probability in second document is 0.108.By the two-dimensional matrix In greater than 0 value set 1, it is constant equal to 0 value, then by the matrix transposition, such as will become after the setting of above-mentioned exampleRow indicates that Feature Words, column indicate document code at this time.
It is above-mentioned to can be written as t1={ d1,d2... }, t2={ d2... } ..., tV=... dN, corresponding all standing grain meter Calculate the concept of model.
By taking the Reduction of Knowledge of all standing Granule Computing as an example, reduction process is described in detail.If domain U={ x1,x2,x3,x4, x5, all standing C={ C1,C2,C3,C4,C5,C6, wherein C1={ x1, C2={ x2,x3, C3={ x3,x4, C4={ x3, C5= {x5, C6={ x1,x5}。
(1) field of x is respectively NC(x1)=C1And C6, NC(x2)=C2, NC(x3)=C2,C3And C4, NC(x4)=C3, NC (x5)=C5And C6;The neighborhood system of x is respectively NSC(x1)={ C1,C6}={ { x1},{x1,x5, NSC(x2)={ C2}= {{x2,x3, NSC(x3)={ C2,C3,C4}={ { x2,x3},{x3,x4},{x3, NSC(x4)={ C3}={ { x3,x4, NSC (x5)={ C5,C6}={ { x5},{x1,x5}};
(2) the upper grain with center of U
(3) center center (C)={ { x of all standing grain C1},{x2,x3},{x3},{x3,x4},{x5, all standing grain Spend entropy
(4) different degree of the basic grain in all standing C
(5) core core (C)={ C1,C2,C3,C5, I (core (C))=0.72=I (C)
Core (C) is the least reduction of all standing C, the absolute redundancy C of reduction6With relative redundancy C4
It is exactly similarly the feature word set chosen to the Feature Words obtained after word reduction, improvement can be directly obtained first The probability for the feature word set that TFIDF method calculates.
BLDA topic model calculates " document-theme " probability and " theme-word " probability, and two probability are calculated " document-word " probability, the probability of the Feature Words after filtering out reduction linearly add to the feature Word probability that two methods calculate Make normalized after power, " document-word " probability matrix after obtaining reduction finally makees clustering processing, that verifies this method has Effect property and feasibility.

Claims (7)

1. a kind of text feature selection method based on all standing Granule Computing, which is characterized in that specifically includes the following steps:
(1): obtaining different classes of news sample set, news sample set is pre-processed, the pretreatment includes participle, goes Stop words and part-of-speech tagging;
(2): " document-word frequency " probability of Feature Words is calculated using improved TFIDF method, obtains " document-word frequency " matrix w, Then Feature Words reduction is carried out using the Algorithm for Reduction of Knowledge of all standing Granule Computing;
(3): " document-word frequency " probability of Feature Words is calculated using bLDA topic model, the TFIDF algorithm after joint reduction calculates Term weight function, obtain the weight of final Feature Words and carry out clustering processing.
2. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that described News sample set is pre-processed, be to be segmented respectively to the title and text of newsletter archive.
3. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that: described The formula of improved TFIDF algorithm is as follows:
Wherein
Wherein λjThe part of speech weight coefficient for indicating word j, when the different values of λ are respectively the weight coefficient of noun, verb, other words, tkIndicate the word frequency of word j in i-th document, u1,u2The weight coefficient of word in title and text is respectively indicated,It respectively indicates Word frequency of the word j in title and text, l indicate the sum of all words in i-th document.
4. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that TFIDF The formula of algorithm is as follows:
T in formulajIndicate the word frequency of word t in m documents, N indicates total number of documents, njIndicate that the number of files comprising word t, denominator are Normalization factor.
5. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that: " text When shelves-word frequency " Probability p is greater than 0, matrix w is 1, and when " document-word frequency " Probability p is greater than 0, matrix w is 0, realizes the grain to document Degreeization.
6. a kind of text feature selection method based on all standing Granule Computing as claimed in any one of claims 1 to 5, feature It is, all standing Granule Computing model is as follows:
IfIt is non-empty talk domainUOn an all standing, all standingP={ Cj: j=1 ..., n }, Define grain GxCenter, the center of all standing grain C, all standing granular entropy of P be respectively as follows:
centerC(x)=∩ { NC(x)|x∈NC(x),NC(x)∈GxCenter (C)={ centerC(x)|x∈U}
The core of C is defined as:
7. a kind of text feature selection method based on all standing Granule Computing as claimed in claim 4, which is characterized in that be based on Steps are as follows for the feature reduction of all standing Granule Computing:
(1): calculating the center center (D) of feature word set D, and calculate the granular entropy I (D) of D.
(2): feature word set core (D)=φ after enabling reduction calculates the document sets D that Feature Words are concentratedi∈ D is in feature word set D Different degreeIfThen core (D)=core (D) ∪ { Di}。
(3): whether I (Core (D))=I (D) is true at this time for calculating, terminates step if setting up, core (D) is characterized word at this time Collect the most granule reduction of D;Otherwise, if I (core (D)) < I (D), step 4 is executed.
(4): enabling P=core (D).
(5): calculating the document sets D that word includestRelative Link Importance Sig of the ∈ D-P relative to feature word set DP(Dt), find out satisfactionDocument sets Dt, it is added in P, P=P ∪ { D }.
(6): whether I (P)=I (D) is true at this time for calculating, terminates step if setting up, P at this time is the one of feature word set D A reduction;Otherwise previous step is returned.
CN201810641512.4A 2018-06-21 2018-06-21 A kind of text feature selection method based on all standing Granule Computing Pending CN109165290A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810641512.4A CN109165290A (en) 2018-06-21 2018-06-21 A kind of text feature selection method based on all standing Granule Computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810641512.4A CN109165290A (en) 2018-06-21 2018-06-21 A kind of text feature selection method based on all standing Granule Computing

Publications (1)

Publication Number Publication Date
CN109165290A true CN109165290A (en) 2019-01-08

Family

ID=64897201

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810641512.4A Pending CN109165290A (en) 2018-06-21 2018-06-21 A kind of text feature selection method based on all standing Granule Computing

Country Status (1)

Country Link
CN (1) CN109165290A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598192A (en) * 2019-06-28 2019-12-20 太原理工大学 Text feature reduction method based on neighborhood rough set
CN112052666A (en) * 2020-08-09 2020-12-08 中信银行股份有限公司 Expert determination method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150103509A (en) * 2014-03-03 2015-09-11 고려대학교 산학협력단 Method for analyzing patent documents using a latent dirichlet allocation
CN107391660A (en) * 2017-07-18 2017-11-24 太原理工大学 A kind of induction division methods for sub-topic division
CN107908624A (en) * 2017-12-12 2018-04-13 太原理工大学 A kind of K medoids Text Clustering Methods based on all standing Granule Computing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150103509A (en) * 2014-03-03 2015-09-11 고려대학교 산학협력단 Method for analyzing patent documents using a latent dirichlet allocation
CN107391660A (en) * 2017-07-18 2017-11-24 太原理工大学 A kind of induction division methods for sub-topic division
CN107908624A (en) * 2017-12-12 2018-04-13 太原理工大学 A kind of K medoids Text Clustering Methods based on all standing Granule Computing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李湘东 等: ""一种基于加权LDA 模型和多粒度的文本特征选择方法"", 《现代图书情报技术》 *
李静月 等: ""一种改进的TFIDF网页关键词提取方法"", 《计算机应用与软件》 *
许慧芳: ""基于全覆盖粒计算模型的文本表示和特征提取研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598192A (en) * 2019-06-28 2019-12-20 太原理工大学 Text feature reduction method based on neighborhood rough set
CN112052666A (en) * 2020-08-09 2020-12-08 中信银行股份有限公司 Expert determination method, device and storage medium
CN112052666B (en) * 2020-08-09 2024-05-17 中信银行股份有限公司 Expert determination method, device and storage medium

Similar Documents

Publication Publication Date Title
Abbas et al. Multinomial Naive Bayes classification model for sentiment analysis
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN111104794A (en) Text similarity matching method based on subject words
CN110162750B (en) Text similarity detection method, electronic device and computer readable storage medium
CN105426426B (en) A kind of KNN file classification methods based on improved K-Medoids
CN109960756B (en) News event information induction method
Yi et al. Topic modeling for short texts via word embedding and document correlation
CN102929861B (en) Method and system for calculating text emotion index
Zhang et al. Clustering sentences with density peaks for multi-document summarization
CN111090731A (en) Electric power public opinion abstract extraction optimization method and system based on topic clustering
CN110209808A (en) A kind of event generation method and relevant apparatus based on text information
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN110188349A (en) A kind of automation writing method based on extraction-type multiple file summarization method
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN103761264A (en) Concept hierarchy establishing method based on product review document set
CN104199845B (en) Line Evaluation based on agent model discusses sensibility classification method
CN110020312B (en) Method and device for extracting webpage text
Phu et al. A valence-totaling model for Vietnamese sentiment classification
Ayral et al. An automated domain specific stop word generation method for natural language text classification
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
Lin et al. A simple but effective method for Indonesian automatic text summarisation
Gong et al. Few-shot learning for named entity recognition based on BERT and two-level model fusion
CN109165290A (en) A kind of text feature selection method based on all standing Granule Computing
CN108694176B (en) Document emotion analysis method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190108