CN109739951A

CN109739951A - A kind of text feature based on LDA topic model

Info

Publication number: CN109739951A
Application number: CN201811595082.3A
Authority: CN
Inventors: 李卫军; 黎浩炎
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2019-05-10

Abstract

The invention discloses a kind of text features based on LDA topic model, the method of the present invention extracts document by LDA model, characteristic similarity calculating is carried out to it by word2vec model after K theme feature probability distribution of acquisition, it can screen to obtain useful theme feature after comparison, remove distracter；And by way of word2vec model is converted into vector to theme feature, the defect that LDA model ignores text information context relation is made up, theme feature dimension can be reduced by weighted mean method.The method of the present invention, which solves the problems, such as that the existing method for text classification ignores the context relation of text information and ignores the variation of text feature quantity, to be caused to extract extra characteristic information, improves the accuracy rate of classification.

Description

A kind of text feature based on LDA topic model

Technical field

The present invention relates to technical field of data processing more particularly to a kind of Text character extractions based on LDA topic model Method.

Background technique

With the development of internet, disclosed knowledge information amount is huge daily now, the shapes such as blog, news, e-book Formula makes while information content increases severely, and since information is more presented in a text form, recognizes knowledge and understanding for people The difficulty of information is also increasing.Huge information content, the classification of refinement information how are effectively handled, and is therefrom found valuable The thing of value is the direction of our current primary studies.

The text feature based on LDA has been used to carry out text classification, LDA (Latent in the prior art Dirichlet Allocation) topic model is a kind of unsupervised learning technology, it, which can be identified, implies in magnanimity document Theme word information exports K theme of every document in document sets according to the form of probability distribution according to the K value of setting, but It is to have different theme quantity due to different documents, so uniformly taking K theme feature, the interference that can have sub-fraction is special Sign, causes accuracy rate to decline；The distribution of its theme feature obtained is directly used as characteristic of division, can generate high-dimensional vector, no Conducive to calculating；In addition, can assume all documents in document sets in LDA topic model, it is independent from each other between document and document； Or all words in hypothesis document, it is independent from each other between word and word, therefore in the processing of text information The context relation of text is had ignored, so that the feature extracted can also ignore context relation.

In conclusion the existing method for text classification exist ignore the variation of text feature quantity cause to extract it is extra Characteristic information and the problem of ignore the context relation of text information, reduce the accuracy rate of classification.

Summary of the invention

The present invention be solve existing file classification method exist ignore the variation of text feature quantity cause to extract it is extra Characteristic information and the problems such as ignore the context relation of text information, reduce the accuracy rate of classification, provide a kind of base In the text feature of LDA topic model.

To realize the above goal of the invention, and the technological means used is:

A kind of text feature based on LDA topic model, includes the following steps:

S1. document is pre-processed, including Chinese word segmentation is carried out to document and removes stop words；

S2. LDA model and word2vec model is respectively trained；Wherein LDA (Latent Dirichlet Allocation) model is that well known document subject matter generates model, word2vec model be it is a kind of well known be used to generate word to The correlation model of amount.

S3. the pretreated document of step S1 is inputted into LDA model, LDA model extracts document, obtains K master The probability distribution for inscribing feature, wherein the probability distribution of each theme feature has T word；Wherein K is positive integer, and T is positive integer；

S4. the word under each of K theme feature probability distribution theme feature probability distribution is used respectively Word2vec model is screened after carrying out similarity calculation, obtains the probability distribution of r theme feature；

S5.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, and it is special to obtain r theme Levy vector；

S6. it is inputted using r theme feature vector as feature.

Above scheme extracts document by LDA model, passes through after obtaining K theme feature probability distribution Word2vec carries out characteristic similarity calculating to it, can screen to obtain useful theme feature after comparison, remove distracter；And lead to The form that word2vec is converted into vector to theme feature is crossed, the defect that LDA model ignores text information context relation is made up.

Preferably, step S1 is segmented using jieba and is carried out Chinese word segmentation to document, and is loaded by jieba and deactivated vocabulary Remove stop words.Wherein jieba is well known Chinese tool.

Preferably, the probability distribution of step S3 indicates are as follows:

WhereinIndicate t-th of word of k-th of theme feature probability distribution,Indicate k-th theme feature distribution lower the The distribution probability of t word, k ∈ [1, K], t ∈ [1, T].

Preferably, step S4 is specifically included:

S41. to first word in k-th of theme feature probability distributionRespectively withWith Word2vec model carries out similarity calculation, obtains:Total T-1 value；

S42. rightTotal T-1 value carries out summation and takes average value processing:

Remember θ=θ₁, θ₂... ..., θ_K；

S43., threshold epsilon, ε θ are set_kMean value:

S44.θ_kRespectively compared with threshold epsilon, if θ_kMore than or equal to ε, then the θ_kCorresponding theme feature probability distribution conduct Useful theme feature probability distribution retains；If θ_kLess than ε, then the θ_kCorresponding theme feature distribution is regarded as redundant information, abandons Processing；Wherein [1, K] k ∈, θ_kShared K；

Finally retain and obtain r theme feature probability distribution, is denoted as:

Preferably, step S5 is specifically included:

S51.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, obtain r × T to Amount, is denoted as:

WhereinIndicate that j-th of theme feature is distributed the vector of lower t-th of word, t ∈ [1, T], j ∈ [1, r]；

S52. T vector under each theme feature is handled with weighted average, the weight of each word is that the word is corresponding Distribution probability, calculate and with a vector represent a theme feature, obtain r theme feature vector:

Wherein [1, r] s ∈, is denoted as V₁,V₂……,V_r。

Compared with prior art, the beneficial effect of technical solution of the present invention is:

The method of the present invention extracts document by LDA model, passes through after obtaining K theme feature probability distribution Word2vec carries out characteristic similarity calculating to it, can screen to obtain useful theme feature after comparison, can be directed to each defeated Enter the different feature of the main theme feature quantity of document, to efficiently control the extraction of useful feature, remove distracter, So that feature achievees the effect that characteristic optimization mostly without miscellaneous；And by way of theme feature is converted into vector by word2vec, The defect that LDA model ignores text information context relation is made up, theme feature dimension can be reduced by weighted mean method.This Inventive method, which solves the existing method for text classification, which ignores the variation of text feature quantity, causes to extract extra feature Information and the problem of ignore the context relation of text information, improves the accuracy rate of classification.

Detailed description of the invention

Fig. 1 is the general flow chart of the method for the present invention.

Fig. 2 is the probability distribution graph of theme feature in one embodiment of the invention.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product Size；

To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

As shown in Figure 1, a kind of text feature based on LDA topic model, includes the following steps:

S2. LDA model and word2vec model is respectively trained；

S3. the pretreated document of step S1 is inputted into LDA model, LDA model extracts document, obtains K master The probability distribution for inscribing feature, wherein the probability distribution of each theme feature has T word；Wherein K is positive integer, and T is positive integer；This Theme feature K in embodiment is set as 50, and being distributed main word T is 10, and the probability distribution graph of the theme feature of the present embodiment is such as Shown in Fig. 2；

S6. it is inputted using r theme feature vector as feature.

Wherein, step S1 is segmented using jieba and is carried out Chinese word segmentation to document, and is loaded deactivated vocabulary by jieba and gone Stop words.

Wherein, the probability distribution of step S3 indicates are as follows:

Wherein, step S4 is specifically included:

Remember θ=θ₁, θ₂... ..., θ_K；

S43., threshold epsilon, ε θ are set_kMean value:

Wherein, step S5 is specifically included:

Wherein [1, r] s ∈, is denoted as V₁,V₂……,V_r。

The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent；

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of text feature based on LDA topic model, which comprises the steps of:

S2. LDA model and word2vec model is respectively trained；

S3. the pretreated document of step S1 is inputted into LDA model, LDA model extracts document, show that K theme is special The probability distribution of sign, wherein the probability distribution of each theme feature has T word；Wherein K is positive integer, and T is positive integer；

S4. respectively to the word word2vec mould under each of K theme feature probability distribution theme feature probability distribution Type is screened after carrying out similarity calculation, obtains the probability distribution of r theme feature；

S5.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, obtain r theme feature to Amount；

S6. it is inputted using r theme feature vector as feature.

2. text feature according to claim 1, which is characterized in that step S1 is segmented using jieba to text Shelves carry out Chinese word segmentation, and load deactivated vocabulary by jieba and remove stop words.

3. text feature according to claim 1, which is characterized in that the probability distribution of step S3 indicates Are as follows:

WhereinIndicate t-th of word of k-th of theme feature probability distribution,Indicate that k-th of theme feature is distributed lower t-th of word Distribution probability, k ∈ [1, K], t ∈ [1, T].

4. text feature according to claim 3, which is characterized in that step S4 is specifically included:

S41. to first word in k-th of theme feature probability distributionRespectively withUse word2vec Model carries out similarity calculation, obtains:Total T-1 value；

Remember θ=θ₁, θ₂... ..., θ_K；

S43., threshold epsilon, ε θ are set_kMean value:

S44.θ_kRespectively compared with threshold epsilon, if θ_kMore than or equal to ε, then the θ_kCorresponding theme feature probability distribution is as useful Theme feature probability distribution retain；If θ_kLess than ε, then the θ_kCorresponding theme feature, which is distributed, is regarded as redundant information, at discarding Reason；Wherein [1, K] k ∈, θ_kShared K；

5. text feature according to claim 4, which is characterized in that step S5 is specifically included:

S51.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, obtains r × T vector, remembers Are as follows:

S52. T vector under each theme feature is handled with weighted average, the weight of each word is corresponding point of the word Cloth probability calculates and represents a theme feature with a vector, obtains r theme feature vector:

Wherein [1, r] s ∈, is denoted as V₁,V₂……,V_r。