CN109739951A - A kind of text feature based on LDA topic model - Google Patents
A kind of text feature based on LDA topic model Download PDFInfo
- Publication number
- CN109739951A CN109739951A CN201811595082.3A CN201811595082A CN109739951A CN 109739951 A CN109739951 A CN 109739951A CN 201811595082 A CN201811595082 A CN 201811595082A CN 109739951 A CN109739951 A CN 109739951A
- Authority
- CN
- China
- Prior art keywords
- feature
- theme feature
- probability distribution
- theme
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of text features based on LDA topic model, the method of the present invention extracts document by LDA model, characteristic similarity calculating is carried out to it by word2vec model after K theme feature probability distribution of acquisition, it can screen to obtain useful theme feature after comparison, remove distracter;And by way of word2vec model is converted into vector to theme feature, the defect that LDA model ignores text information context relation is made up, theme feature dimension can be reduced by weighted mean method.The method of the present invention, which solves the problems, such as that the existing method for text classification ignores the context relation of text information and ignores the variation of text feature quantity, to be caused to extract extra characteristic information, improves the accuracy rate of classification.
Description
Technical field
The present invention relates to technical field of data processing more particularly to a kind of Text character extractions based on LDA topic model
Method.
Background technique
With the development of internet, disclosed knowledge information amount is huge daily now, the shapes such as blog, news, e-book
Formula makes while information content increases severely, and since information is more presented in a text form, recognizes knowledge and understanding for people
The difficulty of information is also increasing.Huge information content, the classification of refinement information how are effectively handled, and is therefrom found valuable
The thing of value is the direction of our current primary studies.
The text feature based on LDA has been used to carry out text classification, LDA (Latent in the prior art
Dirichlet Allocation) topic model is a kind of unsupervised learning technology, it, which can be identified, implies in magnanimity document
Theme word information exports K theme of every document in document sets according to the form of probability distribution according to the K value of setting, but
It is to have different theme quantity due to different documents, so uniformly taking K theme feature, the interference that can have sub-fraction is special
Sign, causes accuracy rate to decline;The distribution of its theme feature obtained is directly used as characteristic of division, can generate high-dimensional vector, no
Conducive to calculating;In addition, can assume all documents in document sets in LDA topic model, it is independent from each other between document and document;
Or all words in hypothesis document, it is independent from each other between word and word, therefore in the processing of text information
The context relation of text is had ignored, so that the feature extracted can also ignore context relation.
In conclusion the existing method for text classification exist ignore the variation of text feature quantity cause to extract it is extra
Characteristic information and the problem of ignore the context relation of text information, reduce the accuracy rate of classification.
Summary of the invention
The present invention be solve existing file classification method exist ignore the variation of text feature quantity cause to extract it is extra
Characteristic information and the problems such as ignore the context relation of text information, reduce the accuracy rate of classification, provide a kind of base
In the text feature of LDA topic model.
To realize the above goal of the invention, and the technological means used is:
A kind of text feature based on LDA topic model, includes the following steps:
S1. document is pre-processed, including Chinese word segmentation is carried out to document and removes stop words;
S2. LDA model and word2vec model is respectively trained;Wherein LDA (Latent Dirichlet
Allocation) model is that well known document subject matter generates model, word2vec model be it is a kind of well known be used to generate word to
The correlation model of amount.
S3. the pretreated document of step S1 is inputted into LDA model, LDA model extracts document, obtains K master
The probability distribution for inscribing feature, wherein the probability distribution of each theme feature has T word;Wherein K is positive integer, and T is positive integer;
S4. the word under each of K theme feature probability distribution theme feature probability distribution is used respectively
Word2vec model is screened after carrying out similarity calculation, obtains the probability distribution of r theme feature;
S5.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, and it is special to obtain r theme
Levy vector;
S6. it is inputted using r theme feature vector as feature.
Above scheme extracts document by LDA model, passes through after obtaining K theme feature probability distribution
Word2vec carries out characteristic similarity calculating to it, can screen to obtain useful theme feature after comparison, remove distracter;And lead to
The form that word2vec is converted into vector to theme feature is crossed, the defect that LDA model ignores text information context relation is made up.
Preferably, step S1 is segmented using jieba and is carried out Chinese word segmentation to document, and is loaded by jieba and deactivated vocabulary
Remove stop words.Wherein jieba is well known Chinese tool.
Preferably, the probability distribution of step S3 indicates are as follows:
WhereinIndicate t-th of word of k-th of theme feature probability distribution,Indicate k-th theme feature distribution lower the
The distribution probability of t word, k ∈ [1, K], t ∈ [1, T].
Preferably, step S4 is specifically included:
S41. to first word in k-th of theme feature probability distributionRespectively withWith
Word2vec model carries out similarity calculation, obtains:Total T-1 value;
S42. rightTotal T-1 value carries out summation and takes average value processing:
Remember θ=θ1, θ2... ..., θK;
S43., threshold epsilon, ε θ are setkMean value:
S44.θkRespectively compared with threshold epsilon, if θkMore than or equal to ε, then the θkCorresponding theme feature probability distribution conduct
Useful theme feature probability distribution retains;If θkLess than ε, then the θkCorresponding theme feature distribution is regarded as redundant information, abandons
Processing;Wherein [1, K] k ∈, θkShared K;
Finally retain and obtain r theme feature probability distribution, is denoted as:
Preferably, step S5 is specifically included:
S51.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, obtain r × T to
Amount, is denoted as:
WhereinIndicate that j-th of theme feature is distributed the vector of lower t-th of word, t ∈ [1, T], j ∈ [1, r];
S52. T vector under each theme feature is handled with weighted average, the weight of each word is that the word is corresponding
Distribution probability, calculate and with a vector represent a theme feature, obtain r theme feature vector:
Wherein [1, r] s ∈, is denoted as V1,V2……,Vr。
Compared with prior art, the beneficial effect of technical solution of the present invention is:
The method of the present invention extracts document by LDA model, passes through after obtaining K theme feature probability distribution
Word2vec carries out characteristic similarity calculating to it, can screen to obtain useful theme feature after comparison, can be directed to each defeated
Enter the different feature of the main theme feature quantity of document, to efficiently control the extraction of useful feature, remove distracter,
So that feature achievees the effect that characteristic optimization mostly without miscellaneous;And by way of theme feature is converted into vector by word2vec,
The defect that LDA model ignores text information context relation is made up, theme feature dimension can be reduced by weighted mean method.This
Inventive method, which solves the existing method for text classification, which ignores the variation of text feature quantity, causes to extract extra feature
Information and the problem of ignore the context relation of text information, improves the accuracy rate of classification.
Detailed description of the invention
Fig. 1 is the general flow chart of the method for the present invention.
Fig. 2 is the probability distribution graph of theme feature in one embodiment of the invention.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product
Size;
To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing
's.
The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.
As shown in Figure 1, a kind of text feature based on LDA topic model, includes the following steps:
S1. document is pre-processed, including Chinese word segmentation is carried out to document and removes stop words;
S2. LDA model and word2vec model is respectively trained;
S3. the pretreated document of step S1 is inputted into LDA model, LDA model extracts document, obtains K master
The probability distribution for inscribing feature, wherein the probability distribution of each theme feature has T word;Wherein K is positive integer, and T is positive integer;This
Theme feature K in embodiment is set as 50, and being distributed main word T is 10, and the probability distribution graph of the theme feature of the present embodiment is such as
Shown in Fig. 2;
S4. the word under each of K theme feature probability distribution theme feature probability distribution is used respectively
Word2vec model is screened after carrying out similarity calculation, obtains the probability distribution of r theme feature;
S5.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, and it is special to obtain r theme
Levy vector;
S6. it is inputted using r theme feature vector as feature.
Wherein, step S1 is segmented using jieba and is carried out Chinese word segmentation to document, and is loaded deactivated vocabulary by jieba and gone
Stop words.
Wherein, the probability distribution of step S3 indicates are as follows:
WhereinIndicate t-th of word of k-th of theme feature probability distribution,Indicate k-th theme feature distribution lower the
The distribution probability of t word, k ∈ [1, K], t ∈ [1, T].
Wherein, step S4 is specifically included:
S41. to first word in k-th of theme feature probability distributionRespectively withWith
Word2vec model carries out similarity calculation, obtains:Total T-1 value;
S42. rightTotal T-1 value carries out summation and takes average value processing:
Remember θ=θ1, θ2... ..., θK;
S43., threshold epsilon, ε θ are setkMean value:
S44.θkRespectively compared with threshold epsilon, if θkMore than or equal to ε, then the θkCorresponding theme feature probability distribution conduct
Useful theme feature probability distribution retains;If θkLess than ε, then the θkCorresponding theme feature distribution is regarded as redundant information, abandons
Processing;Wherein [1, K] k ∈, θkShared K;
Finally retain and obtain r theme feature probability distribution, is denoted as:
Wherein, step S5 is specifically included:
S51.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, obtain r × T to
Amount, is denoted as:
WhereinIndicate that j-th of theme feature is distributed the vector of lower t-th of word, t ∈ [1, T], j ∈ [1, r];
S52. T vector under each theme feature is handled with weighted average, the weight of each word is that the word is corresponding
Distribution probability, calculate and with a vector represent a theme feature, obtain r theme feature vector:
Wherein [1, r] s ∈, is denoted as V1,V2……,Vr。
The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent;
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair
The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description
To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this
Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention
Protection scope within.
Claims (5)
1. a kind of text feature based on LDA topic model, which comprises the steps of:
S1. document is pre-processed, including Chinese word segmentation is carried out to document and removes stop words;
S2. LDA model and word2vec model is respectively trained;
S3. the pretreated document of step S1 is inputted into LDA model, LDA model extracts document, show that K theme is special
The probability distribution of sign, wherein the probability distribution of each theme feature has T word;Wherein K is positive integer, and T is positive integer;
S4. respectively to the word word2vec mould under each of K theme feature probability distribution theme feature probability distribution
Type is screened after carrying out similarity calculation, obtains the probability distribution of r theme feature;
S5.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, obtain r theme feature to
Amount;
S6. it is inputted using r theme feature vector as feature.
2. text feature according to claim 1, which is characterized in that step S1 is segmented using jieba to text
Shelves carry out Chinese word segmentation, and load deactivated vocabulary by jieba and remove stop words.
3. text feature according to claim 1, which is characterized in that the probability distribution of step S3 indicates
Are as follows:
WhereinIndicate t-th of word of k-th of theme feature probability distribution,Indicate that k-th of theme feature is distributed lower t-th of word
Distribution probability, k ∈ [1, K], t ∈ [1, T].
4. text feature according to claim 3, which is characterized in that step S4 is specifically included:
S41. to first word in k-th of theme feature probability distributionRespectively withUse word2vec
Model carries out similarity calculation, obtains:Total T-1 value;
S42. rightTotal T-1 value carries out summation and takes average value processing:
Remember θ=θ1, θ2... ..., θK;
S43., threshold epsilon, ε θ are setkMean value:
S44.θkRespectively compared with threshold epsilon, if θkMore than or equal to ε, then the θkCorresponding theme feature probability distribution is as useful
Theme feature probability distribution retain;If θkLess than ε, then the θkCorresponding theme feature, which is distributed, is regarded as redundant information, at discarding
Reason;Wherein [1, K] k ∈, θkShared K;
Finally retain and obtain r theme feature probability distribution, is denoted as:
5. text feature according to claim 4, which is characterized in that step S5 is specifically included:
S51.word2vec model carries out vectorization to the word of r theme feature probability distribution respectively, obtains r × T vector, remembers
Are as follows:
WhereinIndicate that j-th of theme feature is distributed the vector of lower t-th of word, t ∈ [1, T], j ∈ [1, r];
S52. T vector under each theme feature is handled with weighted average, the weight of each word is corresponding point of the word
Cloth probability calculates and represents a theme feature with a vector, obtains r theme feature vector:
Wherein [1, r] s ∈, is denoted as V1,V2……,Vr。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811595082.3A CN109739951A (en) | 2018-12-25 | 2018-12-25 | A kind of text feature based on LDA topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811595082.3A CN109739951A (en) | 2018-12-25 | 2018-12-25 | A kind of text feature based on LDA topic model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109739951A true CN109739951A (en) | 2019-05-10 |
Family
ID=66360275
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811595082.3A Pending CN109739951A (en) | 2018-12-25 | 2018-12-25 | A kind of text feature based on LDA topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109739951A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111159335A (en) * | 2019-12-12 | 2020-05-15 | 中国电子科技集团公司第七研究所 | Short text classification method based on pyramid pooling and LDA topic model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104636456A (en) * | 2015-02-03 | 2015-05-20 | 大连理工大学 | Question routing method based on word vectors |
CN107122349A (en) * | 2017-04-24 | 2017-09-01 | 无锡中科富农物联科技有限公司 | A kind of feature word of text extracting method based on word2vec LDA models |
-
2018
- 2018-12-25 CN CN201811595082.3A patent/CN109739951A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104636456A (en) * | 2015-02-03 | 2015-05-20 | 大连理工大学 | Question routing method based on word vectors |
CN107122349A (en) * | 2017-04-24 | 2017-09-01 | 无锡中科富农物联科技有限公司 | A kind of feature word of text extracting method based on word2vec LDA models |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111159335A (en) * | 2019-12-12 | 2020-05-15 | 中国电子科技集团公司第七研究所 | Short text classification method based on pyramid pooling and LDA topic model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106599198B (en) | Image description method of multi-cascade junction cyclic neural network | |
CN109783818B (en) | Enterprise industry classification method | |
CN106055538B (en) | The automatic abstracting method of the text label that topic model and semantic analysis combine | |
CN106383877B (en) | Social media online short text clustering and topic detection method | |
CN108363816A (en) | Open entity relation extraction method based on sentence justice structural model | |
Mouhcine et al. | Recognition of cursive Arabic handwritten text using embedded training based on HMMs | |
CN105302884B (en) | Webpage mode identification method and visual structure learning method based on deep learning | |
CN110532395B (en) | Semantic embedding-based word vector improvement model establishing method | |
CN108710611A (en) | A kind of short text topic model generation method of word-based network and term vector | |
CN109446423B (en) | System and method for judging sentiment of news and texts | |
CN108154156B (en) | Image set classification method and device based on neural topic model | |
CN107480688A (en) | Fine granularity image-recognizing method based on zero sample learning | |
Elleuch et al. | Towards unsupervised learning for Arabic handwritten recognition using deep architectures | |
CN108733647A (en) | A kind of term vector generation method based on Gaussian Profile | |
CN111125370A (en) | Relation extraction method suitable for small samples | |
WO2023173555A1 (en) | Model training method and apparatus, text classification method and apparatus, device, and medium | |
Shi et al. | Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition | |
Al-Hmouz et al. | Enhanced numeral recognition for handwritten multi-language numerals using fuzzy set-based decision mechanism | |
CN109739951A (en) | A kind of text feature based on LDA topic model | |
CN113779283A (en) | Fine-grained cross-media retrieval method with deep supervision and feature fusion | |
CN113204975A (en) | Sensitive character wind identification method based on remote supervision | |
CN110597982A (en) | Short text topic clustering algorithm based on word co-occurrence network | |
CN113221885B (en) | Hierarchical modeling method and system based on whole words and radicals | |
Yang et al. | Tb-CNN: joint tree-bank information for sentiment analysis using CNN | |
CN110377845B (en) | Collaborative filtering recommendation method based on interval semi-supervised LDA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190510 |
|
RJ01 | Rejection of invention patent application after publication |