CN109739951A - A kind of text feature based on LDA topic model - Google Patents

A kind of text feature based on LDA topic model Download PDF

Info

Publication number
CN109739951A
CN109739951A CN201811595082.3A CN201811595082A CN109739951A CN 109739951 A CN109739951 A CN 109739951A CN 201811595082 A CN201811595082 A CN 201811595082A CN 109739951 A CN109739951 A CN 109739951A
Authority
CN
China
Prior art keywords
feature
theme feature
probability distribution
theme
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811595082.3A
Other languages
Chinese (zh)
Inventor
李卫军
黎浩炎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201811595082.3A priority Critical patent/CN109739951A/en
Publication of CN109739951A publication Critical patent/CN109739951A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of text features based on LDA topic model, the method of the present invention extracts document by LDA model, characteristic similarity calculating is carried out to it by word2vec model after K theme feature probability distribution of acquisition, it can screen to obtain useful theme feature after comparison, remove distracter;And by way of word2vec model is converted into vector to theme feature, the defect that LDA model ignores text information context relation is made up, theme feature dimension can be reduced by weighted mean method.The method of the present invention, which solves the problems, such as that the existing method for text classification ignores the context relation of text information and ignores the variation of text feature quantity, to be caused to extract extra characteristic information, improves the accuracy rate of classification.

Description

A kind of text feature based on LDA topic model
Technical field
The present invention relates to technical field of data processing more particularly to a kind of Text character extractions based on LDA topic model Method.
Background technique
With the development of internet, disclosed knowledge information amount is huge daily now, the shapes such as blog, news, e-book Formula makes while information content increases severely, and since information is more presented in a text form, recognizes knowledge and understanding for people The difficulty of information is also increasing.Huge information content, the classification of refinement information how are effectively handled, and is therefrom found valuable The thing of value is the direction of our current primary studies.
The text feature based on LDA has been used to carry out text classification, LDA (Latent in the prior art Dirichlet Allocation) topic model is a kind of unsupervised learning technology, it, which can be identified, implies in magnanimity document Theme word information exports K theme of every document in document sets according to the form of probability distribution according to the K value of setting, but It is to have different theme quantity due to different documents, so uniformly taking K theme feature, the interference that can have sub-fraction is special Sign, causes accuracy rate to decline;The distribution of its theme feature obtained is directly used as characteristic of division, can generate high-dimensional vector, no Conducive to calculating;In addition, can assume all documents in document sets in LDA topic model, it is independent from each other between document and document; Or all words in hypothesis document, it is independent from each other between word and word, therefore in the processing of text information The context relation of text is had ignored, so that the feature extracted can also ignore context relation.
In conclusion the existing method for text classification exist ignore the variation of text feature quantity cause to extract it is extra Characteristic information and the problem of ignore the context relation of text information, reduce the accuracy rate of classification.
Summary of the invention
The present invention be solve existing file classification method exist ignore the variation of text feature quantity cause to extract it is extra Characteristic information and the problems such as ignore the context relation of text information, reduce the accuracy rate of classification, provide a kind of base In the text feature of LDA topic model.
To realize the above goal of the invention, and the technological means used is:
A kind of text feature based on LDA topic model, includes the following steps:
S1. document is pre-processed, including Chinese word segmentation is carried out to document and removes stop words;
S2. LDA model and word2vec model is respectively trained;Wherein LDA (Latent Dirichlet Allocation) model is that well known document subject matter generates model, word2vec model be it is a kind of well known be used to generate word to The correlation model of amount.
S3. the pretreated document of step S1 is inputted into LDA model, LDA model extracts document, obtains K master The probability distribution for inscribing feature, wherein the probability distribution of each theme feature has T word;Wherein K is positive integer, and T is positive integer;
S4. the word under each of K theme feature probability distribution theme feature probability distribution is used respectively Word2vec model is screened after carrying out similarity calculation, obtains the probability distribution of r theme feature;
S5.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, and it is special to obtain r theme Levy vector;
S6. it is inputted using r theme feature vector as feature.
Above scheme extracts document by LDA model, passes through after obtaining K theme feature probability distribution Word2vec carries out characteristic similarity calculating to it, can screen to obtain useful theme feature after comparison, remove distracter;And lead to The form that word2vec is converted into vector to theme feature is crossed, the defect that LDA model ignores text information context relation is made up.
Preferably, step S1 is segmented using jieba and is carried out Chinese word segmentation to document, and is loaded by jieba and deactivated vocabulary Remove stop words.Wherein jieba is well known Chinese tool.
Preferably, the probability distribution of step S3 indicates are as follows:
WhereinIndicate t-th of word of k-th of theme feature probability distribution,Indicate k-th theme feature distribution lower the The distribution probability of t word, k ∈ [1, K], t ∈ [1, T].
Preferably, step S4 is specifically included:
S41. to first word in k-th of theme feature probability distributionRespectively withWith Word2vec model carries out similarity calculation, obtains:Total T-1 value;
S42. rightTotal T-1 value carries out summation and takes average value processing:
Remember θ=θ1, θ2... ..., θK
S43., threshold epsilon, ε θ are setkMean value:
S44.θkRespectively compared with threshold epsilon, if θkMore than or equal to ε, then the θkCorresponding theme feature probability distribution conduct Useful theme feature probability distribution retains;If θkLess than ε, then the θkCorresponding theme feature distribution is regarded as redundant information, abandons Processing;Wherein [1, K] k ∈, θkShared K;
Finally retain and obtain r theme feature probability distribution, is denoted as:
Preferably, step S5 is specifically included:
S51.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, obtain r × T to Amount, is denoted as:
WhereinIndicate that j-th of theme feature is distributed the vector of lower t-th of word, t ∈ [1, T], j ∈ [1, r];
S52. T vector under each theme feature is handled with weighted average, the weight of each word is that the word is corresponding Distribution probability, calculate and with a vector represent a theme feature, obtain r theme feature vector:
Wherein [1, r] s ∈, is denoted as V1,V2……,Vr
Compared with prior art, the beneficial effect of technical solution of the present invention is:
The method of the present invention extracts document by LDA model, passes through after obtaining K theme feature probability distribution Word2vec carries out characteristic similarity calculating to it, can screen to obtain useful theme feature after comparison, can be directed to each defeated Enter the different feature of the main theme feature quantity of document, to efficiently control the extraction of useful feature, remove distracter, So that feature achievees the effect that characteristic optimization mostly without miscellaneous;And by way of theme feature is converted into vector by word2vec, The defect that LDA model ignores text information context relation is made up, theme feature dimension can be reduced by weighted mean method.This Inventive method, which solves the existing method for text classification, which ignores the variation of text feature quantity, causes to extract extra feature Information and the problem of ignore the context relation of text information, improves the accuracy rate of classification.
Detailed description of the invention
Fig. 1 is the general flow chart of the method for the present invention.
Fig. 2 is the probability distribution graph of theme feature in one embodiment of the invention.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product Size;
To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.
The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.
As shown in Figure 1, a kind of text feature based on LDA topic model, includes the following steps:
S1. document is pre-processed, including Chinese word segmentation is carried out to document and removes stop words;
S2. LDA model and word2vec model is respectively trained;
S3. the pretreated document of step S1 is inputted into LDA model, LDA model extracts document, obtains K master The probability distribution for inscribing feature, wherein the probability distribution of each theme feature has T word;Wherein K is positive integer, and T is positive integer;This Theme feature K in embodiment is set as 50, and being distributed main word T is 10, and the probability distribution graph of the theme feature of the present embodiment is such as Shown in Fig. 2;
S4. the word under each of K theme feature probability distribution theme feature probability distribution is used respectively Word2vec model is screened after carrying out similarity calculation, obtains the probability distribution of r theme feature;
S5.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, and it is special to obtain r theme Levy vector;
S6. it is inputted using r theme feature vector as feature.
Wherein, step S1 is segmented using jieba and is carried out Chinese word segmentation to document, and is loaded deactivated vocabulary by jieba and gone Stop words.
Wherein, the probability distribution of step S3 indicates are as follows:
WhereinIndicate t-th of word of k-th of theme feature probability distribution,Indicate k-th theme feature distribution lower the The distribution probability of t word, k ∈ [1, K], t ∈ [1, T].
Wherein, step S4 is specifically included:
S41. to first word in k-th of theme feature probability distributionRespectively withWith Word2vec model carries out similarity calculation, obtains:Total T-1 value;
S42. rightTotal T-1 value carries out summation and takes average value processing:
Remember θ=θ1, θ2... ..., θK
S43., threshold epsilon, ε θ are setkMean value:
S44.θkRespectively compared with threshold epsilon, if θkMore than or equal to ε, then the θkCorresponding theme feature probability distribution conduct Useful theme feature probability distribution retains;If θkLess than ε, then the θkCorresponding theme feature distribution is regarded as redundant information, abandons Processing;Wherein [1, K] k ∈, θkShared K;
Finally retain and obtain r theme feature probability distribution, is denoted as:
Wherein, step S5 is specifically included:
S51.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, obtain r × T to Amount, is denoted as:
WhereinIndicate that j-th of theme feature is distributed the vector of lower t-th of word, t ∈ [1, T], j ∈ [1, r];
S52. T vector under each theme feature is handled with weighted average, the weight of each word is that the word is corresponding Distribution probability, calculate and with a vector represent a theme feature, obtain r theme feature vector:
Wherein [1, r] s ∈, is denoted as V1,V2……,Vr
The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent;
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims (5)

1. a kind of text feature based on LDA topic model, which comprises the steps of:
S1. document is pre-processed, including Chinese word segmentation is carried out to document and removes stop words;
S2. LDA model and word2vec model is respectively trained;
S3. the pretreated document of step S1 is inputted into LDA model, LDA model extracts document, show that K theme is special The probability distribution of sign, wherein the probability distribution of each theme feature has T word;Wherein K is positive integer, and T is positive integer;
S4. respectively to the word word2vec mould under each of K theme feature probability distribution theme feature probability distribution Type is screened after carrying out similarity calculation, obtains the probability distribution of r theme feature;
S5.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, obtain r theme feature to Amount;
S6. it is inputted using r theme feature vector as feature.
2. text feature according to claim 1, which is characterized in that step S1 is segmented using jieba to text Shelves carry out Chinese word segmentation, and load deactivated vocabulary by jieba and remove stop words.
3. text feature according to claim 1, which is characterized in that the probability distribution of step S3 indicates Are as follows:
WhereinIndicate t-th of word of k-th of theme feature probability distribution,Indicate that k-th of theme feature is distributed lower t-th of word Distribution probability, k ∈ [1, K], t ∈ [1, T].
4. text feature according to claim 3, which is characterized in that step S4 is specifically included:
S41. to first word in k-th of theme feature probability distributionRespectively withUse word2vec Model carries out similarity calculation, obtains:Total T-1 value;
S42. rightTotal T-1 value carries out summation and takes average value processing:
Remember θ=θ1, θ2... ..., θK
S43., threshold epsilon, ε θ are setkMean value:
S44.θkRespectively compared with threshold epsilon, if θkMore than or equal to ε, then the θkCorresponding theme feature probability distribution is as useful Theme feature probability distribution retain;If θkLess than ε, then the θkCorresponding theme feature, which is distributed, is regarded as redundant information, at discarding Reason;Wherein [1, K] k ∈, θkShared K;
Finally retain and obtain r theme feature probability distribution, is denoted as:
5. text feature according to claim 4, which is characterized in that step S5 is specifically included:
S51.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, obtains r × T vector, remembers Are as follows:
WhereinIndicate that j-th of theme feature is distributed the vector of lower t-th of word, t ∈ [1, T], j ∈ [1, r];
S52. T vector under each theme feature is handled with weighted average, the weight of each word is corresponding point of the word Cloth probability calculates and represents a theme feature with a vector, obtains r theme feature vector:
Wherein [1, r] s ∈, is denoted as V1,V2……,Vr
CN201811595082.3A 2018-12-25 2018-12-25 A kind of text feature based on LDA topic model Pending CN109739951A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811595082.3A CN109739951A (en) 2018-12-25 2018-12-25 A kind of text feature based on LDA topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811595082.3A CN109739951A (en) 2018-12-25 2018-12-25 A kind of text feature based on LDA topic model

Publications (1)

Publication Number Publication Date
CN109739951A true CN109739951A (en) 2019-05-10

Family

ID=66360275

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811595082.3A Pending CN109739951A (en) 2018-12-25 2018-12-25 A kind of text feature based on LDA topic model

Country Status (1)

Country Link
CN (1) CN109739951A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159335A (en) * 2019-12-12 2020-05-15 中国电子科技集团公司第七研究所 Short text classification method based on pyramid pooling and LDA topic model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636456A (en) * 2015-02-03 2015-05-20 大连理工大学 Question routing method based on word vectors
CN107122349A (en) * 2017-04-24 2017-09-01 无锡中科富农物联科技有限公司 A kind of feature word of text extracting method based on word2vec LDA models

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636456A (en) * 2015-02-03 2015-05-20 大连理工大学 Question routing method based on word vectors
CN107122349A (en) * 2017-04-24 2017-09-01 无锡中科富农物联科技有限公司 A kind of feature word of text extracting method based on word2vec LDA models

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159335A (en) * 2019-12-12 2020-05-15 中国电子科技集团公司第七研究所 Short text classification method based on pyramid pooling and LDA topic model

Similar Documents

Publication Publication Date Title
CN106599198B (en) Image description method of multi-cascade junction cyclic neural network
CN109783818B (en) Enterprise industry classification method
CN106055538B (en) The automatic abstracting method of the text label that topic model and semantic analysis combine
CN106383877B (en) Social media online short text clustering and topic detection method
CN108363816A (en) Open entity relation extraction method based on sentence justice structural model
Mouhcine et al. Recognition of cursive Arabic handwritten text using embedded training based on HMMs
CN105302884B (en) Webpage mode identification method and visual structure learning method based on deep learning
CN110532395B (en) Semantic embedding-based word vector improvement model establishing method
CN108710611A (en) A kind of short text topic model generation method of word-based network and term vector
CN109446423B (en) System and method for judging sentiment of news and texts
CN108154156B (en) Image set classification method and device based on neural topic model
CN107480688A (en) Fine granularity image-recognizing method based on zero sample learning
Elleuch et al. Towards unsupervised learning for Arabic handwritten recognition using deep architectures
CN108733647A (en) A kind of term vector generation method based on Gaussian Profile
CN111125370A (en) Relation extraction method suitable for small samples
WO2023173555A1 (en) Model training method and apparatus, text classification method and apparatus, device, and medium
Shi et al. Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition
Al-Hmouz et al. Enhanced numeral recognition for handwritten multi-language numerals using fuzzy set-based decision mechanism
CN109739951A (en) A kind of text feature based on LDA topic model
CN113779283A (en) Fine-grained cross-media retrieval method with deep supervision and feature fusion
CN113204975A (en) Sensitive character wind identification method based on remote supervision
CN110597982A (en) Short text topic clustering algorithm based on word co-occurrence network
CN113221885B (en) Hierarchical modeling method and system based on whole words and radicals
Yang et al. Tb-CNN: joint tree-bank information for sentiment analysis using CNN
CN110377845B (en) Collaborative filtering recommendation method based on interval semi-supervised LDA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190510

RJ01 Rejection of invention patent application after publication