CN108710611B - Short text topic model generation method based on word network and word vector - Google Patents
Short text topic model generation method based on word network and word vector Download PDFInfo
- Publication number
- CN108710611B CN108710611B CN201810473370.5A CN201810473370A CN108710611B CN 108710611 B CN108710611 B CN 108710611B CN 201810473370 A CN201810473370 A CN 201810473370A CN 108710611 B CN108710611 B CN 108710611B
- Authority
- CN
- China
- Prior art keywords
- word
- document
- words
- network
- short text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000013598 vector Substances 0.000 title claims abstract description 29
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a short text topic model generation method based on a word network and a word vector, which comprises the following steps: 1) learning semantic information: a. segmenting words and removing stop words; b. learning word vectors according to the short text data obtained by preprocessing; c. and calculating semantic similarity between the words. 2) Constructing a pseudo document for each term: a. obtaining a word co-occurrence list based on semantic similarity, and constructing a word network; b. calculating the arithmetic relation of the word vectors to obtain a potential word list; c. judging the length of the pseudo document and deciding whether to add similar words or not. 3) And performing LDA theme modeling on each pseudo document to obtain the frequency distribution of the theme and the words of the original document. According to the invention, the semantic information is introduced to construct the pseudo document, and the subject modeling is carried out on the pseudo document, so that the problems of sparseness and unbalance of short text data are solved, and the performances of tasks such as subject discovery, text classification and text clustering on the short text are improved.
Description
Technical Field
The invention relates to the field of text topic model construction, in particular to a short text topic model generation method based on a word network and a word vector.
Background
With the rapid development of the internet and the rapid increase of short text contents in the internet, the mining and analysis of short text data is more and more urgent, and in the face of these short texts, how to accurately mine the subject from the back of these short texts is a well-recognized challenging and extremely promising task.
Due to the characteristics of sparsity, instantaneity, irregularity and the like of the short text, the traditional topic model algorithm is directly implemented on the short text, for example: pLSA, LDA, etc., tend to be less effective. With the development of short text research, topic models of BTMs, WNTMs and the like for short texts are proposed successively, but the topic models only consider the co-occurrence relationship of words in a corpus, and although the sparse problem of short texts can be solved to a certain extent, because the co-occurrence relationship which can be used for modeling is much richer than the words in the short texts no matter the word-pair relationship or the word network is established, the semantic relationship among the words is ignored, so that the performance of the topic models for the text mining task faces a bottleneck.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a short text topic model generation method based on a word network and a word vector, aiming at solving the technical problem that the performance of tasks such as topic discovery, text classification, text clustering and the like of a common model is not high due to the fact that only a word co-occurrence relation is considered in a conventional short text topic model but semantic information is not considered.
The technical scheme is as follows: the technical scheme provided by the invention is as follows:
a method for generating a short text topic model based on a word network and a word vector comprises the following steps:
(1) learning semantic information of a text, comprising: preprocessing a document, and performing word vector training on a preprocessed document corpus to obtain a word vector of each word; calculating the similarity between the words according to the word vectors;
(2) constructing a pseudo document for each term in the document, including performing steps (2-1) to (2-4) for each term i in turn:
(2-1) setting a sliding window with the size of W, and extracting N words including the word i through the sliding window to form a word network of the word i;
(2-2) constructing a word list Lcooccur(i) Extracting words other than the word iFrequency fri,jJoin word list Lcooccur(i);Wherein, AvriIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of occurrences of j in the current word network for word i;
(2-3) constructing a word list Llatent(i) (ii) a Setting a similarity threshold delta, calculating j and j for each word j in the word networkAnd selecting a word j with the cosine similarity larger than the similarity threshold delta and adding the word j into the Llatent(i) Performing the following steps; wherein,word vectors representing words i, j, respectively;
(2-3) determining whether or not L is satisfiedcooccur(i)+Llatent(i) L, wherein L represents the set minimum length of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list Lsimilar(i) In the middle, m is less than L;
(2-4) merging the word list Lcooccur(i)、Llatent(i)、Lsimilar(i) Obtaining a pseudo document of the word i;
(3) and performing LDA theme modeling on each pseudo document to obtain the frequency distribution of the theme and the words of the original document.
Further, the preprocessing the document includes performing chinese word segmentation and stop word removal processing on the document.
Further, the expression of sim (i, j) is:
further, the Word vector training adopts a Word2Vec model method.
Has the advantages that: compared with the prior art, the method is based on the word network, constructs the pseudo document for the words in the short text data by training the word vectors and calculating the word similarity, and then carries out LDA theme modeling, so that the difficulties of sparsity, unbalance and the like of the short text can be overcome, and the performance of the model is improved by introducing semantic information.
Drawings
FIG. 1 is a flow chart of a short text topic model generation method based on word networks and word vectors according to the present invention;
FIG. 2 is a schematic flow diagram of constructing a word network;
FIG. 3 is a block diagram of constructing a word list Llatent(i) A flow chart of (1);
FIG. 4 is a block diagram of constructing a word list Lsimilar(i) Is described.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
Fig. 1 is a flow chart of the present invention, and the whole flow includes three stages:
firstly, semantic information learning stage:
and 3, calculating the similarity between words by using the word vectors obtained by training in the step 2, wherein the cosine similarity is adopted, and the cosine similarity calculation formula is as follows:
where sim (i, j) represents the cosine similarity between word i and word j,representing the word vectors for words i, j, respectively.
Secondly, pseudo document construction: the invention constructs a pseudo document for each word i, and then carries out theme modeling on the basis of the pseudo document, wherein the pseudo document of each word i consists of three parts, which are introduced respectively as follows:
Step 5, constructing a co-occurrence word list, wherein L is usedlatent(i) Represents; extracting words except for the word i with a frequency fri,jJoin word list Lcooccur(i);fri,jThe calculation formula of (2) is as follows:
wherein, AvriIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of times j appears in the current word network for word i.
Step 6, searching words with similar semantics but no co-occurrence relation by using the arithmetic relation of the word vectors and adding the words into the word list Llatent(i) The specific process is shown in fig. 3:
for a word j in a word network of the word i, calculating a vector w by using the word vectori+wjThe cosine similarity with the word j is calculated as follows:
in the formula, wlatentIndicates that it is to be added to Llatent(i) The term in (1), δ is the set similarity thresholdThe value of the one or more of the one,representation by cosine similarity search andthe most similar words;
comparing the cosine similarity obtained by calculation with the similarity threshold value delta, if the cosine similarity is larger than the similarity threshold value delta, adding a word j into a word list Llatent(i) (ii) a Otherwise, word j is not added.
Step 7, judging the length of the current pseudo document of the word i, if the length is less than the preset maximum length L, adding m words most similar to the word i into a word list Lsimilar(i) The specific process is shown in fig. 4:
judging whether L is satisfiedcooccur(i)+Llatent(i) L, wherein L represents the maximum length of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list Lsimilar(i) In (1).
Step 8, listing the three words L obtained in the steps 5, 6 and 7cooccur(i)、Llatent(i) And Lsimilar(i) Combining to obtain a final pseudo document of the word i;
step 9, utilizing the pseudo document obtained in the step 8 to carry out LDA theme modeling;
and step 10, deducing the topic word distribution of the original document by using the pseudo document topic and the word probability distribution obtained in the step 9.
In summary, the invention is a short text topic model method based on word networks and word vectors, and solves many difficulties of sparsity, imbalance and much noise in the short text topic model. The invention is based on a word network, constructs a pseudo document for words in short text data by training word vectors and calculating word similarity, and then carries out LDA theme modeling, thus achieving the final invention effects: the method can overcome the difficulties of sparsity, unbalance and the like of the short text, and improves the performance of the model by introducing semantic information.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (4)
1. A method for generating a short text topic model based on a word network and a word vector is characterized by comprising the following steps:
(1) learning semantic information of a text, comprising: preprocessing a document, and performing word vector training on a preprocessed document corpus to obtain a word vector of each word; calculating the similarity between the words according to the word vectors;
(2) constructing a pseudo document for each term in the document, including performing steps (2-1) to (2-4) for each term i in turn:
(2-1) setting a sliding window with the size of W, and extracting N words including the word i through the sliding window to form a word network of the word i;
(2-2) constructing a word list Lcooccur(i) Extracting words except for the word i with a frequency fri,jJoin word list Lcooccur(i);Wherein, AvriIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of occurrences of j in the word network for word i;
(2-3) constructing a word list Llatent(i) (ii) a Setting a similarity threshold delta, calculating j and j for each word j in the word networkAnd selecting a word j with the cosine similarity larger than the similarity threshold delta and adding the word j into the Llatent(i) Performing the following steps; wherein,word vectors representing word i and word j, respectively;
(2-3) determining whether or not L is satisfiedcooccur(i)+Llatent(i) L represents the set maximum length of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list Lsimilar(i) In the middle, m is less than L;
(2-4) merging the word list Lcooccur(i)、Llatent(i)、Lsimilar(i) Obtaining a pseudo document of the word i;
(3) and performing LDA theme modeling on each pseudo document to obtain the frequency distribution of the theme and the words of the original document.
2. The method of claim 1, wherein the preprocessing the document comprises performing Chinese word segmentation and stop word removal on the document.
4. the method of claim 1, wherein the Word vector training uses a Word2Vec model method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810473370.5A CN108710611B (en) | 2018-05-17 | 2018-05-17 | Short text topic model generation method based on word network and word vector |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810473370.5A CN108710611B (en) | 2018-05-17 | 2018-05-17 | Short text topic model generation method based on word network and word vector |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108710611A CN108710611A (en) | 2018-10-26 |
CN108710611B true CN108710611B (en) | 2021-08-03 |
Family
ID=63868224
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810473370.5A Active CN108710611B (en) | 2018-05-17 | 2018-05-17 | Short text topic model generation method based on word network and word vector |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108710611B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109359302B (en) * | 2018-10-26 | 2023-04-18 | 重庆大学 | Optimization method of domain word vectors and fusion ordering method based on optimization method |
CN110046340A (en) * | 2018-12-28 | 2019-07-23 | 阿里巴巴集团控股有限公司 | The training method and device of textual classification model |
CN109858028B (en) * | 2019-01-30 | 2022-11-18 | 神思电子技术股份有限公司 | Short text similarity calculation method based on probability model |
CN109857942A (en) * | 2019-03-14 | 2019-06-07 | 北京百度网讯科技有限公司 | For handling the method, apparatus, equipment and storage medium of document |
CN110532378B (en) * | 2019-05-13 | 2021-10-26 | 南京大学 | Short text aspect extraction method based on topic model |
CN110134786B (en) * | 2019-05-14 | 2021-09-10 | 南京大学 | Short text classification method based on subject word vector and convolutional neural network |
CN110263343B (en) * | 2019-06-24 | 2021-06-15 | 北京理工大学 | Phrase vector-based keyword extraction method and system |
CN111897952B (en) * | 2020-06-10 | 2022-10-14 | 中国科学院软件研究所 | Sensitive data discovery method for social media |
CN113051917B (en) * | 2021-04-23 | 2022-11-18 | 东南大学 | Document implicit time inference method based on time window text similarity |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006039566A2 (en) * | 2004-09-30 | 2006-04-13 | Intelliseek, Inc. | Topical sentiments in electronically stored communications |
CN105955948A (en) * | 2016-04-22 | 2016-09-21 | 武汉大学 | Short text topic modeling method based on word semantic similarity |
CN105975499A (en) * | 2016-04-27 | 2016-09-28 | 深圳大学 | Text subject detection method and system |
CN106294662A (en) * | 2016-08-05 | 2017-01-04 | 华东师范大学 | Inquiry based on context-aware theme represents and mixed index method for establishing model |
CN106327341A (en) * | 2016-08-15 | 2017-01-11 | 首都师范大学 | Weibo user gender deduction method and system based on combined theme |
CN107451187A (en) * | 2017-06-23 | 2017-12-08 | 天津科技大学 | Sub-topic finds method in half structure assigned short text set based on mutual constraint topic model |
-
2018
- 2018-05-17 CN CN201810473370.5A patent/CN108710611B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006039566A2 (en) * | 2004-09-30 | 2006-04-13 | Intelliseek, Inc. | Topical sentiments in electronically stored communications |
CN105955948A (en) * | 2016-04-22 | 2016-09-21 | 武汉大学 | Short text topic modeling method based on word semantic similarity |
CN105975499A (en) * | 2016-04-27 | 2016-09-28 | 深圳大学 | Text subject detection method and system |
CN106294662A (en) * | 2016-08-05 | 2017-01-04 | 华东师范大学 | Inquiry based on context-aware theme represents and mixed index method for establishing model |
CN106327341A (en) * | 2016-08-15 | 2017-01-11 | 首都师范大学 | Weibo user gender deduction method and system based on combined theme |
CN107451187A (en) * | 2017-06-23 | 2017-12-08 | 天津科技大学 | Sub-topic finds method in half structure assigned short text set based on mutual constraint topic model |
Non-Patent Citations (4)
Title |
---|
Biterm Pseudo Document Topic Model for Short Text;Lan Jiang et.al;《2016 IEEE 28th International Conference on Tools with Artificial Intelligence》;20161208;第865-872页 * |
Intensity of Relationship Between Words:Using Word Triangles in Topic Discovery for Short Texts;Ming Xu;《Web and Big Data》;Springer;20170803;第642-648页 * |
Topic Modeling of Short Texts: A Pseudo-Document View;Yuan Zuo;《ACM》;20160817;第1-10页 * |
面向产品评论分析的短文本情感主题模型;熊蜀峰等;《自动化学报》;20160831;第42卷(第8期);第1227-1237页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108710611A (en) | 2018-10-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108710611B (en) | Short text topic model generation method based on word network and word vector | |
CN107085581B (en) | Short text classification method and device | |
CN108052593B (en) | Topic keyword extraction method based on topic word vector and network structure | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
WO2019085236A1 (en) | Search intention recognition method and apparatus, and electronic device and readable storage medium | |
CN102799647B (en) | Method and device for webpage reduplication deletion | |
CN108763348B (en) | Classification improvement method for feature vectors of extended short text words | |
CN111104510B (en) | Text classification training sample expansion method based on word embedding | |
CN109086375B (en) | Short text topic extraction method based on word vector enhancement | |
CN110134958B (en) | Short text topic mining method based on semantic word network | |
CN107291914A (en) | A kind of method and system for generating search engine inquiry expansion word | |
CN110889282B (en) | Text emotion analysis method based on deep learning | |
US20140032207A1 (en) | Information Classification Based on Product Recognition | |
CN109840324B (en) | Semantic enhancement topic model construction method and topic evolution analysis method | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN110728144B (en) | Extraction type document automatic summarization method based on context semantic perception | |
CN109918507B (en) | textCNN (text-based network communication network) improved text classification method | |
CN112528653B (en) | Short text entity recognition method and system | |
CN110705272A (en) | Named entity identification method for automobile engine fault diagnosis | |
CN112860889A (en) | BERT-based multi-label classification method | |
CN114462392A (en) | Short text feature expansion method based on topic relevance and keyword association | |
CN111460147A (en) | Title short text classification method based on semantic enhancement | |
CN111353045A (en) | Method for constructing text classification system | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN111061866B (en) | Barrage text clustering method based on feature expansion and T-oBTM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |