CN109165290A

CN109165290A - A kind of text feature selection method based on all standing Granule Computing

Info

Publication number: CN109165290A
Application number: CN201810641512.4A
Authority: CN
Inventors: 谢珺; 邹雪君; 靳红伟; 续欣莹
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2018-06-21
Filing date: 2018-06-21
Publication date: 2019-01-08

Abstract

A kind of text feature selection method based on all standing Granule Computing, comprising: 1) sample text collection segmented, remove stop words, part-of-speech tagging；2) position, part of speech factor are extended in TFIDF algorithm with different weight coefficients " document-word frequency " probability for calculating Feature Words；3) feature Word probability is generated using bLDA topic model to calculate the semantic information of Feature Words；4) text granulation is carried out to Feature Words, reduction, " document-word frequency " probability of the feature word set after obtaining reduction is carried out to Feature Words using the Algorithm for Reduction of Knowledge of all standing Granule Computing；5) combine the term weight function that bLDA and improved TFIDF algorithm calculate, the feature word set after obtaining reduction " document-word frequency " probability.Using the present invention, part of speech, position and the semantic factor of Feature Words are considered, while removing not strong Feature Words of expressing the meaning to text and improving the precision of cluster to select the feature word set more represented.

Description

A kind of text feature selection method based on all standing Granule Computing

Technical field

The invention belongs to the crossing domains in text mining field and all standing Granule Computing, and in particular to the feature selecting of text With application of the Reduction of Knowledge of all standing Granule Computing model more particularly to all standing Granule Computing in text feature selection.

Background technique

Text cluster is the important topic of pattern-recognition, machine learning and the field of data mining research, mainly will be literary The set of this object is grouped into the multiple classes being made of similar object, to realize the cluster to unknown text data.Mesh Before, structured representation, however the higher-dimension in the model existing characteristics space are mainly carried out to text information using vector space model Property and data sparsity problem.The feature space of higher-dimension not only increases the time complexity and space complexity of system operations, and And the quality of text cluster also is greatly reduced comprising a large amount of invalid, redundancy features.Thus, one is used in text cluster The effective feature selection approach of kind just seems most important.Effective feature selection approach can reduce the dimension of feature vector, Remove redundancy feature, retaining has stronger class discrimination ability and the stronger feature of expressing the meaning property, thus improve cluster quality and Robustness.

For text feature selection problem, experts and scholars propose a series of solution respectively, but are solving In this critical issue of text feature, there are still some problems for these methods, mainly has:

1) the methods of information gain (IG), mutual information (MI), chi-square statistics (CHI) are used now with many scholars, these Statistics-Based Method can select effective feature to a certain extent, but method has ignored the semantic information of text.

2) some scholars make feature selecting using LDA topic model, solve the semantic information of text, but the algorithm is ignored The word frequency of text, the position of word and part of speech problem, do not meet the practical expression of text.

Therefore, the present invention specifically addresses the word frequency of text feature word, the position of word, part of speech and matter of semantics, feature drops Retaining while not changing text representation when dimension has stronger class discrimination ability and the stronger Feature Words of expressing the meaning property.

Summary of the invention

To solve existing feature selection approach poor accuracy, feature express the meaning not strong deficiency, the invention proposes a kind of bases In the text feature selection method of all standing Granule Computing.

A kind of text feature selection method based on all standing Granule Computing, comprising the following steps:

Step 1: obtaining different classes of news sample set, the title and body part to newsletter archive collection carry out in advance respectively Processing, the pretreatment include participle, remove stop words and part-of-speech tagging；

Step 2: improving TFIDF method becomes improved TFIDF method, and calculates Feature Words with improved TFIDF method " document-word frequency " probability, then utilize all standing Granule Computing Algorithm for Reduction of Knowledge carry out Feature Words reduction；

Step 3: " document-word frequency " probability of Feature Words, the TFIDF algorithm after joint reduction are calculated with bLDA topic model The term weight function of calculating obtains the weight of final Feature Words and carries out clustering processing.

The TFIDF method specific formula is as follows:

Wherein t_jIndicate the word frequency of word t in m documents, N indicates total number of documents, n_jIndicate the number of files comprising word t, point Mother is normalization factor.

The improved TFIDF method specific formula is as follows:

Wherein tf_i,jSpecific formula is as follows:

Wherein

Wherein λ_jThe part of speech weight coefficient for indicating word j, when the different values of λ are respectively the weight of noun, verb, other words Coefficient, t_kIndicate the word frequency of word j in i-th document, u₁,u₂The weight coefficient of word in title and text is respectively indicated,Respectively Indicate word frequency of the word j in title and text, l indicates the sum of all words in i-th document.

The Reduction of Knowledge of all standing Granule Computing carries out granular processing to text first, as shown in table 1 below:

1 text granular relation table of table

Wherein the basic definition of all standing Granule Computing model is as follows:

IfIt is non-empty talk domain_UOn an all standing, all standingP={ C_j: j=1 ..., N }, define grain G_xCenter, the center of all standing grain C, all standing granular entropy of P be respectively as follows:

center_C(x)=∩ { N_C(x)|x∈N_C(x),N_C(x)∈G_xCenter (C)={ center_C(x)|x∈U}

The core of C is defined as:

Specific step is as follows for the Reduction of Knowledge to text progress all standing Granule Computing:

Step 1: calculating the center center (D) of feature word set D, and calculate the granular entropy I (D) of D.

Step 2: feature word set core (D)=φ after enabling reduction calculates the document sets D that Feature Words are concentrated_i∈ D is in feature Different degree in word set DIfThen_cOre (D)=core (D) ∪ { D_i}。

Step 3: whether I (Core (D))=I (D) is true at this time for calculating, terminates step if setting up, core (D) is at this time The most granule reduction of feature word set D；Otherwise, if I (core (D)) < I (D), step 4 is executed.

Step 4: enabling P=core (D).

Step 5: calculating the document sets D that word includes_tRelative Link Importance Sig of the ∈ D-P relative to feature word set D_P(D_t), it looks for Meet outDocument sets D_t, it is added in P, P=P ∪ { D }.

Step 6: whether I (P)=I (D) is true at this time for calculating, terminates step if setting up, P at this time is feature word set A reduction of D；Otherwise return step 5.

In the bLDA topic model Gibbs Sampling sampling specific formula is as follows:

Wherein z_iIndicating the corresponding theme variable of ith feature word, ┐ i expression is not counted in i-th,Indicate m texts The word frequency of word t in shelves,Indicate that word t distributes to the word frequency of (k ≠ 0) theme k,It indicates to distribute to theme in m documents The word frequency of k (k=0), K indicate theme number, and V indicates that the sum of all words in document sets, lamda indicate the priori of background theme Probability, β_tIndicate the Dirichlet prior distribution of word t, α_kIndicate the Dirichlet prior distribution of theme k.

Detailed description of the invention

Fig. 1 is flow chart of the invention.

Specific embodiment

It is more clear to illustrate the purposes, technical schemes and advantages of the present invention, the present invention is made of actual case below It is described in further detail.

The news for obtaining a certain number of multiple and different fields from Sohu's news using web crawlers, to these articles into Row analysis and arrangement removes the non-textual symbol in identical news and news, as sample set.

In order to choose representative feature word set from text, title and body part to sample set divide respectively Word removes stop words and part-of-speech tagging.

Use improved TFIDF method when calculating the probability of Feature Words, different location, different part of speech word assign it is different Weight coefficient.Such as certain news can indicate are as follows: d_i={ t_i|t_i1,t_i2,t_i3,t_i4,...,t_im, wherein t_iIndicate that the piece is new Hear the set of word, t_i1,t_i2,t_i3Indicate the word in title, remaining indicates the word in text, if t_i1,t_i3It is noun, t_i2It is Word, t_i4It is noun, t_i5It is verb, t_i6It is the word of other parts of speech, then weight proportion is t_i1,t_i3> t_i2> t_i4> t_i5> t_i6。

Two-dimensional matrix can be expressed as by improving the result that TFIDF is calculated, and row indicates that document code, column indicate Feature Words, example Such as matrixIn 0.112 indicate first document in word t₁₁Probability, 0 indicate this text There is no the word in shelves, just there is no t in first document₁₂, t₁₁Probability in second document is 0.108.By the two-dimensional matrix In greater than 0 value set 1, it is constant equal to 0 value, then by the matrix transposition, such as will become after the setting of above-mentioned exampleRow indicates that Feature Words, column indicate document code at this time.

It is above-mentioned to can be written as t₁={ d₁,d₂... }, t₂={ d₂... } ..., t_V=... d_N, corresponding all standing grain meter Calculate the concept of model.

By taking the Reduction of Knowledge of all standing Granule Computing as an example, reduction process is described in detail.If domain U={ x₁,x₂,x₃,x₄, x₅, all standing C={ C₁,C₂,C₃,C₄,C₅,C₆, wherein C₁={ x₁, C₂={ x₂,x₃, C₃={ x₃,x₄, C₄={ x₃, C₅= {x₅, C₆={ x₁,x₅}。

(1) field of x is respectively N_C(x₁)=C₁And C₆, N_C(x₂)=C₂, N_C(x₃)=C₂,C₃And C₄, N_C(x₄)=C₃, N_C (x₅)=C₅And C₆；The neighborhood system of x is respectively NS_C(x₁)={ C₁,C₆}={ { x₁},{x₁,x₅, NS_C(x₂)={ C₂}= {{x₂,x₃, NS_C(x₃)={ C₂,C₃,C₄}={ { x₂,x₃},{x₃,x₄},{x₃, NS_C(x₄)={ C₃}={ { x₃,x₄, NS_C (x₅)={ C₅,C₆}={ { x₅},{x₁,x₅}}；

(2) the upper grain with center of U

(3) center center (C)={ { x of all standing grain C₁},{x₂,x₃},{x₃},{x₃,x₄},{x₅, all standing grain Spend entropy

(4) different degree of the basic grain in all standing C

(5) core core (C)={ C₁,C₂,C₃,C₅, I (core (C))=0.72=I (C)

Core (C) is the least reduction of all standing C, the absolute redundancy C of reduction₆With relative redundancy C₄。

It is exactly similarly the feature word set chosen to the Feature Words obtained after word reduction, improvement can be directly obtained first The probability for the feature word set that TFIDF method calculates.

BLDA topic model calculates " document-theme " probability and " theme-word " probability, and two probability are calculated " document-word " probability, the probability of the Feature Words after filtering out reduction linearly add to the feature Word probability that two methods calculate Make normalized after power, " document-word " probability matrix after obtaining reduction finally makees clustering processing, that verifies this method has Effect property and feasibility.

Claims

1. a kind of text feature selection method based on all standing Granule Computing, which is characterized in that specifically includes the following steps:

(1): obtaining different classes of news sample set, news sample set is pre-processed, the pretreatment includes participle, goes Stop words and part-of-speech tagging；

(2): " document-word frequency " probability of Feature Words is calculated using improved TFIDF method, obtains " document-word frequency " matrix w, Then Feature Words reduction is carried out using the Algorithm for Reduction of Knowledge of all standing Granule Computing；

(3): " document-word frequency " probability of Feature Words is calculated using bLDA topic model, the TFIDF algorithm after joint reduction calculates Term weight function, obtain the weight of final Feature Words and carry out clustering processing.

2. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that described News sample set is pre-processed, be to be segmented respectively to the title and text of newsletter archive.

3. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that: described The formula of improved TFIDF algorithm is as follows:

Wherein

Wherein λ_jThe part of speech weight coefficient for indicating word j, when the different values of λ are respectively the weight coefficient of noun, verb, other words, t_kIndicate the word frequency of word j in i-th document, u₁,u₂The weight coefficient of word in title and text is respectively indicated,It respectively indicates Word frequency of the word j in title and text, l indicate the sum of all words in i-th document.

4. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that TFIDF The formula of algorithm is as follows:

T in formula_jIndicate the word frequency of word t in m documents, N indicates total number of documents, n_jIndicate that the number of files comprising word t, denominator are Normalization factor.

5. a kind of text feature selection method based on all standing Granule Computing as described in claim 1, it is characterised in that: " text When shelves-word frequency " Probability p is greater than 0, matrix w is 1, and when " document-word frequency " Probability p is greater than 0, matrix w is 0, realizes the grain to document Degreeization.

6. a kind of text feature selection method based on all standing Granule Computing as claimed in any one of claims 1 to 5, feature It is, all standing Granule Computing model is as follows:

center_C(x)=∩ { N_C(x)|x∈N_C(x),N_C(x)∈G_xCenter (C)={ center_C(x)|x∈U}

The core of C is defined as:

7. a kind of text feature selection method based on all standing Granule Computing as claimed in claim 4, which is characterized in that be based on Steps are as follows for the feature reduction of all standing Granule Computing:

(1): calculating the center center (D) of feature word set D, and calculate the granular entropy I (D) of D.

(2): feature word set core (D)=φ after enabling reduction calculates the document sets D that Feature Words are concentrated_i∈ D is in feature word set D Different degreeIfThen core (D)=core (D) ∪ { D_i}。

(3): whether I (Core (D))=I (D) is true at this time for calculating, terminates step if setting up, core (D) is characterized word at this time Collect the most granule reduction of D；Otherwise, if I (core (D)) < I (D), step 4 is executed.

(4): enabling P=core (D).

(5): calculating the document sets D that word includes_tRelative Link Importance Sig of the ∈ D-P relative to feature word set D_P(D_t), find out satisfactionDocument sets D_t, it is added in P, P=P ∪ { D }.

(6): whether I (P)=I (D) is true at this time for calculating, terminates step if setting up, P at this time is the one of feature word set D A reduction；Otherwise previous step is returned.