CN109165290A - A kind of text feature selection method based on all standing Granule Computing - Google Patents
A kind of text feature selection method based on all standing Granule Computing Download PDFInfo
- Publication number
- CN109165290A CN109165290A CN201810641512.4A CN201810641512A CN109165290A CN 109165290 A CN109165290 A CN 109165290A CN 201810641512 A CN201810641512 A CN 201810641512A CN 109165290 A CN109165290 A CN 109165290A
- Authority
- CN
- China
- Prior art keywords
- word
- feature
- standing
- text
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of text feature selection method based on all standing Granule Computing, comprising: 1) sample text collection segmented, remove stop words, part-of-speech tagging;2) position, part of speech factor are extended in TFIDF algorithm with different weight coefficients " document-word frequency " probability for calculating Feature Words;3) feature Word probability is generated using bLDA topic model to calculate the semantic information of Feature Words;4) text granulation is carried out to Feature Words, reduction, " document-word frequency " probability of the feature word set after obtaining reduction is carried out to Feature Words using the Algorithm for Reduction of Knowledge of all standing Granule Computing;5) combine the term weight function that bLDA and improved TFIDF algorithm calculate, the feature word set after obtaining reduction " document-word frequency " probability.Using the present invention, part of speech, position and the semantic factor of Feature Words are considered, while removing not strong Feature Words of expressing the meaning to text and improving the precision of cluster to select the feature word set more represented.
Description
Technical field
The invention belongs to the crossing domains in text mining field and all standing Granule Computing, and in particular to the feature selecting of text
With application of the Reduction of Knowledge of all standing Granule Computing model more particularly to all standing Granule Computing in text feature selection.
Background technique
Text cluster is the important topic of pattern-recognition, machine learning and the field of data mining research, mainly will be literary
The set of this object is grouped into the multiple classes being made of similar object, to realize the cluster to unknown text data.Mesh
Before, structured representation, however the higher-dimension in the model existing characteristics space are mainly carried out to text information using vector space model
Property and data sparsity problem.The feature space of higher-dimension not only increases the time complexity and space complexity of system operations, and
And the quality of text cluster also is greatly reduced comprising a large amount of invalid, redundancy features.Thus, one is used in text cluster
The effective feature selection approach of kind just seems most important.Effective feature selection approach can reduce the dimension of feature vector,
Remove redundancy feature, retaining has stronger class discrimination ability and the stronger feature of expressing the meaning property, thus improve cluster quality and
Robustness.
For text feature selection problem, experts and scholars propose a series of solution respectively, but are solving
In this critical issue of text feature, there are still some problems for these methods, mainly has:
1) the methods of information gain (IG), mutual information (MI), chi-square statistics (CHI) are used now with many scholars, these
Statistics-Based Method can select effective feature to a certain extent, but method has ignored the semantic information of text.
2) some scholars make feature selecting using LDA topic model, solve the semantic information of text, but the algorithm is ignored
The word frequency of text, the position of word and part of speech problem, do not meet the practical expression of text.
Therefore, the present invention specifically addresses the word frequency of text feature word, the position of word, part of speech and matter of semantics, feature drops
Retaining while not changing text representation when dimension has stronger class discrimination ability and the stronger Feature Words of expressing the meaning property.
Summary of the invention
To solve existing feature selection approach poor accuracy, feature express the meaning not strong deficiency, the invention proposes a kind of bases
In the text feature selection method of all standing Granule Computing.
A kind of text feature selection method based on all standing Granule Computing, comprising the following steps:
Step 1: obtaining different classes of news sample set, the title and body part to newsletter archive collection carry out in advance respectively
Processing, the pretreatment include participle, remove stop words and part-of-speech tagging;
Step 2: improving TFIDF method becomes improved TFIDF method, and calculates Feature Words with improved TFIDF method
" document-word frequency " probability, then utilize all standing Granule Computing Algorithm for Reduction of Knowledge carry out Feature Words reduction;
Step 3: " document-word frequency " probability of Feature Words, the TFIDF algorithm after joint reduction are calculated with bLDA topic model
The term weight function of calculating obtains the weight of final Feature Words and carries out clustering processing.
The TFIDF method specific formula is as follows:
Wherein tjIndicate the word frequency of word t in m documents, N indicates total number of documents, njIndicate the number of files comprising word t, point
Mother is normalization factor.
The improved TFIDF method specific formula is as follows:
Wherein tfi,jSpecific formula is as follows:
Wherein
Wherein λjThe part of speech weight coefficient for indicating word j, when the different values of λ are respectively the weight of noun, verb, other words
Coefficient, tkIndicate the word frequency of word j in i-th document, u1,u2The weight coefficient of word in title and text is respectively indicated,Respectively
Indicate word frequency of the word j in title and text, l indicates the sum of all words in i-th document.
The Reduction of Knowledge of all standing Granule Computing carries out granular processing to text first, as shown in table 1 below:
1 text granular relation table of table
Wherein the basic definition of all standing Granule Computing model is as follows:
IfIt is non-empty talk domainUOn an all standing, all standingP={ Cj: j=1 ...,
N }, define grain GxCenter, the center of all standing grain C, all standing granular entropy of P be respectively as follows:
centerC(x)=∩ { NC(x)|x∈NC(x),NC(x)∈GxCenter (C)={ centerC(x)|x∈U}
The core of C is defined as:
Specific step is as follows for the Reduction of Knowledge to text progress all standing Granule Computing:
Step 1: calculating the center center (D) of feature word set D, and calculate the granular entropy I (D) of D.
Step 2: feature word set core (D)=φ after enabling reduction calculates the document sets D that Feature Words are concentratedi∈ D is in feature
Different degree in word set DIfThencOre (D)=core (D) ∪ { Di}。
Step 3: whether I (Core (D))=I (D) is true at this time for calculating, terminates step if setting up, core (D) is at this time
The most granule reduction of feature word set D;Otherwise, if I (core (D)) < I (D), step 4 is executed.
Step 4: enabling P=core (D).
Step 5: calculating the document sets D that word includestRelative Link Importance Sig of the ∈ D-P relative to feature word set DP(Dt), it looks for
Meet outDocument sets Dt, it is added in P, P=P ∪ { D }.
Step 6: whether I (P)=I (D) is true at this time for calculating, terminates step if setting up, P at this time is feature word set
A reduction of D;Otherwise return step 5.
In the bLDA topic model Gibbs Sampling sampling specific formula is as follows:
Wherein ziIndicating the corresponding theme variable of ith feature word, ┐ i expression is not counted in i-th,Indicate m texts
The word frequency of word t in shelves,Indicate that word t distributes to the word frequency of (k ≠ 0) theme k,It indicates to distribute to theme in m documents
The word frequency of k (k=0), K indicate theme number, and V indicates that the sum of all words in document sets, lamda indicate the priori of background theme
Probability, βtIndicate the Dirichlet prior distribution of word t, αkIndicate the Dirichlet prior distribution of theme k.
Detailed description of the invention
Fig. 1 is flow chart of the invention.
Specific embodiment
It is more clear to illustrate the purposes, technical schemes and advantages of the present invention, the present invention is made of actual case below
It is described in further detail.
The news for obtaining a certain number of multiple and different fields from Sohu's news using web crawlers, to these articles into
Row analysis and arrangement removes the non-textual symbol in identical news and news, as sample set.
In order to choose representative feature word set from text, title and body part to sample set divide respectively
Word removes stop words and part-of-speech tagging.
Use improved TFIDF method when calculating the probability of Feature Words, different location, different part of speech word assign it is different
Weight coefficient.Such as certain news can indicate are as follows: di={ ti|ti1,ti2,ti3,ti4,...,tim, wherein tiIndicate that the piece is new
Hear the set of word, ti1,ti2,ti3Indicate the word in title, remaining indicates the word in text, if ti1,ti3It is noun, ti2It is
Word, ti4It is noun, ti5It is verb, ti6It is the word of other parts of speech, then weight proportion is ti1,ti3> ti2> ti4> ti5> ti6。
Two-dimensional matrix can be expressed as by improving the result that TFIDF is calculated, and row indicates that document code, column indicate Feature Words, example
Such as matrixIn 0.112 indicate first document in word t11Probability, 0 indicate this text
There is no the word in shelves, just there is no t in first document12, t11Probability in second document is 0.108.By the two-dimensional matrix
In greater than 0 value set 1, it is constant equal to 0 value, then by the matrix transposition, such as will become after the setting of above-mentioned exampleRow indicates that Feature Words, column indicate document code at this time.
It is above-mentioned to can be written as t1={ d1,d2... }, t2={ d2... } ..., tV=... dN, corresponding all standing grain meter
Calculate the concept of model.
By taking the Reduction of Knowledge of all standing Granule Computing as an example, reduction process is described in detail.If domain U={ x1,x2,x3,x4,
x5, all standing C={ C1,C2,C3,C4,C5,C6, wherein C1={ x1, C2={ x2,x3, C3={ x3,x4, C4={ x3, C5=
{x5, C6={ x1,x5}。
(1) field of x is respectively NC(x1)=C1And C6, NC(x2)=C2, NC(x3)=C2,C3And C4, NC(x4)=C3, NC
(x5)=C5And C6;The neighborhood system of x is respectively NSC(x1)={ C1,C6}={ { x1},{x1,x5, NSC(x2)={ C2}=
{{x2,x3, NSC(x3)={ C2,C3,C4}={ { x2,x3},{x3,x4},{x3, NSC(x4)={ C3}={ { x3,x4, NSC
(x5)={ C5,C6}={ { x5},{x1,x5}};
(2) the upper grain with center of U
(3) center center (C)={ { x of all standing grain C1},{x2,x3},{x3},{x3,x4},{x5, all standing grain
Spend entropy
(4) different degree of the basic grain in all standing C
(5) core core (C)={ C1,C2,C3,C5, I (core (C))=0.72=I (C)
Core (C) is the least reduction of all standing C, the absolute redundancy C of reduction6With relative redundancy C4。
It is exactly similarly the feature word set chosen to the Feature Words obtained after word reduction, improvement can be directly obtained first
The probability for the feature word set that TFIDF method calculates.
BLDA topic model calculates " document-theme " probability and " theme-word " probability, and two probability are calculated
" document-word " probability, the probability of the Feature Words after filtering out reduction linearly add to the feature Word probability that two methods calculate
Make normalized after power, " document-word " probability matrix after obtaining reduction finally makees clustering processing, that verifies this method has
Effect property and feasibility.
Claims (7)
1. a kind of text feature selection method based on all standing Granule Computing, which is characterized in that specifically includes the following steps:
(1): obtaining different classes of news sample set, news sample set is pre-processed, the pretreatment includes participle, goes
Stop words and part-of-speech tagging;
(2): " document-word frequency " probability of Feature Words is calculated using improved TFIDF method, obtains " document-word frequency " matrix w,
Then Feature Words reduction is carried out using the Algorithm for Reduction of Knowledge of all standing Granule Computing;
(3): " document-word frequency " probability of Feature Words is calculated using bLDA topic model, the TFIDF algorithm after joint reduction calculates
Term weight function, obtain the weight of final Feature Words and carry out clustering processing.
2. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that described
News sample set is pre-processed, be to be segmented respectively to the title and text of newsletter archive.
3. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that: described
The formula of improved TFIDF algorithm is as follows:
Wherein
Wherein λjThe part of speech weight coefficient for indicating word j, when the different values of λ are respectively the weight coefficient of noun, verb, other words,
tkIndicate the word frequency of word j in i-th document, u1,u2The weight coefficient of word in title and text is respectively indicated,It respectively indicates
Word frequency of the word j in title and text, l indicate the sum of all words in i-th document.
4. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that TFIDF
The formula of algorithm is as follows:
T in formulajIndicate the word frequency of word t in m documents, N indicates total number of documents, njIndicate that the number of files comprising word t, denominator are
Normalization factor.
5. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that: " text
When shelves-word frequency " Probability p is greater than 0, matrix w is 1, and when " document-word frequency " Probability p is greater than 0, matrix w is 0, realizes the grain to document
Degreeization.
6. a kind of text feature selection method based on all standing Granule Computing as claimed in any one of claims 1 to 5, feature
It is, all standing Granule Computing model is as follows:
IfIt is non-empty talk domainUOn an all standing, all standingP={ Cj: j=1 ..., n },
Define grain GxCenter, the center of all standing grain C, all standing granular entropy of P be respectively as follows:
centerC(x)=∩ { NC(x)|x∈NC(x),NC(x)∈GxCenter (C)={ centerC(x)|x∈U}
The core of C is defined as:
7. a kind of text feature selection method based on all standing Granule Computing as claimed in claim 4, which is characterized in that be based on
Steps are as follows for the feature reduction of all standing Granule Computing:
(1): calculating the center center (D) of feature word set D, and calculate the granular entropy I (D) of D.
(2): feature word set core (D)=φ after enabling reduction calculates the document sets D that Feature Words are concentratedi∈ D is in feature word set D
Different degreeIfThen core (D)=core (D) ∪ { Di}。
(3): whether I (Core (D))=I (D) is true at this time for calculating, terminates step if setting up, core (D) is characterized word at this time
Collect the most granule reduction of D;Otherwise, if I (core (D)) < I (D), step 4 is executed.
(4): enabling P=core (D).
(5): calculating the document sets D that word includestRelative Link Importance Sig of the ∈ D-P relative to feature word set DP(Dt), find out satisfactionDocument sets Dt, it is added in P, P=P ∪ { D }.
(6): whether I (P)=I (D) is true at this time for calculating, terminates step if setting up, P at this time is the one of feature word set D
A reduction;Otherwise previous step is returned.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810641512.4A CN109165290A (en) | 2018-06-21 | 2018-06-21 | A kind of text feature selection method based on all standing Granule Computing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810641512.4A CN109165290A (en) | 2018-06-21 | 2018-06-21 | A kind of text feature selection method based on all standing Granule Computing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109165290A true CN109165290A (en) | 2019-01-08 |
Family
ID=64897201
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810641512.4A Pending CN109165290A (en) | 2018-06-21 | 2018-06-21 | A kind of text feature selection method based on all standing Granule Computing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109165290A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110598192A (en) * | 2019-06-28 | 2019-12-20 | 太原理工大学 | Text feature reduction method based on neighborhood rough set |
CN112052666A (en) * | 2020-08-09 | 2020-12-08 | 中信银行股份有限公司 | Expert determination method, device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150103509A (en) * | 2014-03-03 | 2015-09-11 | 고려대학교 산학협력단 | Method for analyzing patent documents using a latent dirichlet allocation |
CN107391660A (en) * | 2017-07-18 | 2017-11-24 | 太原理工大学 | A kind of induction division methods for sub-topic division |
CN107908624A (en) * | 2017-12-12 | 2018-04-13 | 太原理工大学 | A kind of K medoids Text Clustering Methods based on all standing Granule Computing |
-
2018
- 2018-06-21 CN CN201810641512.4A patent/CN109165290A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150103509A (en) * | 2014-03-03 | 2015-09-11 | 고려대학교 산학협력단 | Method for analyzing patent documents using a latent dirichlet allocation |
CN107391660A (en) * | 2017-07-18 | 2017-11-24 | 太原理工大学 | A kind of induction division methods for sub-topic division |
CN107908624A (en) * | 2017-12-12 | 2018-04-13 | 太原理工大学 | A kind of K medoids Text Clustering Methods based on all standing Granule Computing |
Non-Patent Citations (3)
Title |
---|
李湘东 等: ""一种基于加权LDA 模型和多粒度的文本特征选择方法"", 《现代图书情报技术》 * |
李静月 等: ""一种改进的TFIDF网页关键词提取方法"", 《计算机应用与软件》 * |
许慧芳: ""基于全覆盖粒计算模型的文本表示和特征提取研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110598192A (en) * | 2019-06-28 | 2019-12-20 | 太原理工大学 | Text feature reduction method based on neighborhood rough set |
CN112052666A (en) * | 2020-08-09 | 2020-12-08 | 中信银行股份有限公司 | Expert determination method, device and storage medium |
CN112052666B (en) * | 2020-08-09 | 2024-05-17 | 中信银行股份有限公司 | Expert determination method, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Abbas et al. | Multinomial Naive Bayes classification model for sentiment analysis | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN111104794A (en) | Text similarity matching method based on subject words | |
CN110162750B (en) | Text similarity detection method, electronic device and computer readable storage medium | |
CN105426426B (en) | A kind of KNN file classification methods based on improved K-Medoids | |
CN109960756B (en) | News event information induction method | |
Yi et al. | Topic modeling for short texts via word embedding and document correlation | |
CN102929861B (en) | Method and system for calculating text emotion index | |
Zhang et al. | Clustering sentences with density peaks for multi-document summarization | |
CN111090731A (en) | Electric power public opinion abstract extraction optimization method and system based on topic clustering | |
CN110209808A (en) | A kind of event generation method and relevant apparatus based on text information | |
CN105183833A (en) | User model based microblogging text recommendation method and recommendation apparatus thereof | |
CN108875065B (en) | Indonesia news webpage recommendation method based on content | |
CN110188349A (en) | A kind of automation writing method based on extraction-type multiple file summarization method | |
CN110134777B (en) | Question duplication eliminating method and device, electronic equipment and computer readable storage medium | |
CN103761264A (en) | Concept hierarchy establishing method based on product review document set | |
CN104199845B (en) | Line Evaluation based on agent model discusses sensibility classification method | |
CN110020312B (en) | Method and device for extracting webpage text | |
Phu et al. | A valence-totaling model for Vietnamese sentiment classification | |
Ayral et al. | An automated domain specific stop word generation method for natural language text classification | |
CN115017303A (en) | Method, computing device and medium for enterprise risk assessment based on news text | |
Lin et al. | A simple but effective method for Indonesian automatic text summarisation | |
Gong et al. | Few-shot learning for named entity recognition based on BERT and two-level model fusion | |
CN109165290A (en) | A kind of text feature selection method based on all standing Granule Computing | |
CN108694176B (en) | Document emotion analysis method and device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190108 |