CN108710611A - A kind of short text topic model generation method of word-based network and term vector - Google Patents
A kind of short text topic model generation method of word-based network and term vector Download PDFInfo
- Publication number
- CN108710611A CN108710611A CN201810473370.5A CN201810473370A CN108710611A CN 108710611 A CN108710611 A CN 108710611A CN 201810473370 A CN201810473370 A CN 201810473370A CN 108710611 A CN108710611 A CN 108710611A
- Authority
- CN
- China
- Prior art keywords
- word
- document
- pseudo
- network
- short text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000013598 vector Substances 0.000 title claims abstract description 29
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000012549 training Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention proposes a kind of short text topic model generation method of word-based network and term vector, includes the following steps:1) learn semantic information:A, participle and stop-word is removed;B, term vector is learnt according to the short text data that pretreatment obtains;C, the semantic similarity between word is calculated.2) pseudo- document is built to each word:A, it is based on semantic similarity and obtains word co-occurrence list, build word network;B, the arithmetic relation for calculating word vectors obtains potential word list;C, judge pseudo- Document Length and decide whether that similar word is added.3) LDA theme modelings are carried out to each pseudo- document, obtains theme, the term frequencies distribution of original document.The present invention carries out theme modeling by introducing the pseudo- document of semantic information structure to pseudo- document, to solve the sparse and imbalance problem of short text data, the performance for carrying out the tasks such as motif discovery, text classification and text cluster on short text is made to get a promotion.
Description
Technical Field
The invention relates to the field of text topic model construction, in particular to a short text topic model generation method based on a word network and a word vector.
Background
With the rapid development of the internet and the rapid increase of short text contents in the internet, the mining and analysis of short text data is more and more urgent, and in the face of these short texts, how to accurately mine the subject from the back of these short texts is a well-recognized challenging and extremely promising task.
Due to the characteristics of sparsity, instantaneity, irregularity and the like of the short text, the traditional topic model algorithm is directly implemented on the short text, for example: pLSA, LDA, etc., tend to be less effective. With the development of short text research, topic models of BTMs, WNTMs and the like for short texts are proposed successively, but the topic models only consider the co-occurrence relationship of words in a corpus, and although the sparse problem of short texts can be solved to a certain extent, because the co-occurrence relationship which can be used for modeling is much richer than the words in the short texts no matter the word-pair relationship or the word network is established, the semantic relationship among the words is ignored, so that the performance of the topic models for the text mining task faces a bottleneck.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a short text topic model generation method based on a word network and a word vector, aiming at solving the technical problem that the performance of tasks such as topic discovery, text classification, text clustering and the like of a common model is not high due to the fact that only a word co-occurrence relation is considered in a conventional short text topic model but semantic information is not considered.
The technical scheme is as follows: the technical scheme provided by the invention is as follows:
a method for generating a short text topic model based on a word network and a word vector comprises the following steps:
(1) learning semantic information of a text, comprising: preprocessing a document, and performing word vector training on a preprocessed document corpus to obtain a word vector of each word; calculating the similarity between the words according to the word vectors;
(2) constructing a pseudo document for each term in the document, including performing steps (2-1) to (2-4) for each term i in turn:
(2-1) setting a sliding window with the size of W, and extracting N words including the word i through the sliding window to form a word network of the word i;
(2-2) constructing a word list Lcooccur(i) Extracting words except for the word i with a frequency fri,jJoin word list Lcooccur(i);Wherein, AvriIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of occurrences of j in the current word network for word i;
(2-3) constructing a word list Llatent(i) (ii) a Setting a similarity threshold delta, calculating j and j for each word j in the word networkAnd selecting a word j with the cosine similarity larger than the similarity threshold delta and adding the word j into the Llatent(i) Performing the following steps; wherein,word vectors representing words i, j, respectively;
(2-3) determining whether or not L is satisfiedcooccur(i)+Llatent(i) L, wherein L represents the set minimum length of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list Lsimilar(i) In the middle, m is less than L;
(2-4) merging the word list Lcooccur(i)、Llatent(i)、Lsimilar(i) Obtaining a pseudo document of the word i;
(3) and performing LDA theme modeling on each pseudo document to obtain the frequency distribution of the theme and the words of the original document.
Further, the preprocessing the document includes performing chinese word segmentation and stop word removal processing on the document.
Further, the expression of sim (i, j) is:
further, the Word vector training adopts a Word2Vec model method.
Has the advantages that: compared with the prior art, the method is based on the word network, constructs the pseudo document for the words in the short text data by training the word vectors and calculating the word similarity, and then carries out LDA theme modeling, so that the difficulties of sparsity, unbalance and the like of the short text can be overcome, and the performance of the model is improved by introducing semantic information.
Drawings
FIG. 1 is a flow chart of a short text topic model generation method based on word networks and word vectors according to the present invention;
FIG. 2 is a schematic flow diagram of constructing a word network;
FIG. 3 is a block diagram of constructing a word list Llatent(i) A flow chart of (1);
FIG. 4 is a block diagram of constructing a word list Lsimilar(i) Is described.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
Fig. 1 is a flow chart of the present invention, and the whole flow includes three stages:
firstly, semantic information learning stage:
step 1, preprocessing text data, wherein the main implemented action is word segmentation (if the English short text data can omit the word segmentation step, Chinese words need to be segmented, a jieba word segmentation tool is generally used), and stop words are removed;
step 2, performing Word vector training on the preprocessed document by using a Word2Vec model method proposed by Mikolov;
and 3, calculating the similarity between words by using the word vectors obtained by training in the step 2, wherein the cosine similarity is adopted, and the cosine similarity calculation formula is as follows:
where sim (i, j) represents the cosine similarity between word i and word j,representing the word vectors for words i, j, respectively.
Secondly, pseudo document construction: the invention constructs a pseudo document for each word i, and then carries out theme modeling on the basis of the pseudo document, wherein the pseudo document of each word i consists of three parts, which are introduced respectively as follows:
step 4, constructing a word network: setting the window size as W, and extracting N words including the word i through a sliding window to form a word network of the word i; fig. 2 is a schematic diagram of a word network constructed by using a sliding window, and it can be seen that words closer to the word i occur more frequently in the word network.
Step 5, constructing a co-occurrence word list, wherein L is usedlatent(i) Represents; extracting words except for the word i with a frequency fri,jJoin word list Lcooccur(i);fri,jThe calculation formula of (2) is as follows:
wherein, AvriIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of times j appears in the current word network for word i.
Step 6, searching words with similar semantics but no co-occurrence relation by using the arithmetic relation of the word vectors and adding the words into the word list Llatent(i) The specific process is shown in fig. 3:
for a word j in a word network of the word i, calculating a vector w by using the word vectori+wjThe cosine similarity with the word j is calculated as follows:
in the formula, wlatentIndicates that it is to be added to Llatent(i) Where, δ is the set similarity threshold,representation by cosine similarity search andthe most similar words;
comparing the cosine similarity obtained by calculation with the similarity threshold value delta, if the cosine similarity is larger than the similarity threshold value delta, adding a word j into a word list Llatent(i) (ii) a Otherwise, word j is not added.
Step 7, judging the length of the current pseudo document of the word i, if the length is less than the preset maximum length L, adding m words most similar to the word i into a word list Lsimilar(i) The specific process is shown in fig. 4:
judging whether L is satisfiedcooccur(i)+Llatent(i) L, wherein L represents the maximum length of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list Lsimilar(i) In (1).
Step 8, listing the three words L obtained in the steps 5, 6 and 7cooccur(i)、Llatent(i) And Lsimilar(i) Combining to obtain a final pseudo document of the word i;
step 9, utilizing the pseudo document obtained in the step 8 to carry out LDA theme modeling;
and step 10, deducing the topic word distribution of the original document by using the pseudo document topic and the word probability distribution obtained in the step 9.
In summary, the invention is a short text topic model method based on word networks and word vectors, and solves many difficulties of sparsity, imbalance and much noise in the short text topic model. The invention is based on a word network, constructs a pseudo document for words in short text data by training word vectors and calculating word similarity, and then carries out LDA theme modeling, thus achieving the final invention effects: the method can overcome the difficulties of sparsity, unbalance and the like of the short text, and improves the performance of the model by introducing semantic information.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (4)
1. A method for generating a short text topic model based on a word network and a word vector is characterized by comprising the following steps:
(1) learning semantic information of a text, comprising: preprocessing a document, and performing word vector training on a preprocessed document corpus to obtain a word vector of each word; calculating the similarity between the words according to the word vectors;
(2) constructing a pseudo document for each term in the document, including performing steps (2-1) to (2-4) for each term i in turn:
(2-1) setting a sliding window with the size of W, and extracting N words including the word i through the sliding window to form a word network of the word i;
(2-2) constructing a word list Lcooccur(i) Extracting words except for the word i with a frequency fri,jJoin word list Lcooccur(i);Wherein, AvriIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of occurrences of j in the word network for word i;
(2-3) constructing a word list Llatent(i) (ii) a Setting a similarity threshold delta, calculating j and j for each word j in the word networkAnd selecting a word j with the cosine similarity larger than the similarity threshold delta and adding the word j into the Llatent(i) Performing the following steps; wherein,word vectors representing word i and word j, respectively;
(2-3) determining whether or not L is satisfiedcooccur(i)+Llatent(i) L represents a set minimum length threshold of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list Lsimilar(i) In the middle, m is less than L;
(2-4) merging the word list Lcooccur(i)、Llatent(i)、Lsimilar(i) Obtaining a pseudo document of the word i;
(3) and performing LDA theme modeling on each pseudo document to obtain the frequency distribution of the theme and the words of the original document.
2. The method of claim 1, wherein the preprocessing the document comprises performing Chinese word segmentation and stop word removal on the document.
3. The method of claim 1, wherein the expression of sim (i, j) is as follows:
4. the method of claim 1, wherein the Word vector training uses a Word2Vec model method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810473370.5A CN108710611B (en) | 2018-05-17 | 2018-05-17 | Short text topic model generation method based on word network and word vector |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810473370.5A CN108710611B (en) | 2018-05-17 | 2018-05-17 | Short text topic model generation method based on word network and word vector |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108710611A true CN108710611A (en) | 2018-10-26 |
CN108710611B CN108710611B (en) | 2021-08-03 |
Family
ID=63868224
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810473370.5A Active CN108710611B (en) | 2018-05-17 | 2018-05-17 | Short text topic model generation method based on word network and word vector |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108710611B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109359302A (en) * | 2018-10-26 | 2019-02-19 | 重庆大学 | A kind of optimization method of field term vector and fusion sort method based on it |
CN109858028A (en) * | 2019-01-30 | 2019-06-07 | 神思电子技术股份有限公司 | A kind of short text similarity calculating method based on probabilistic model |
CN109857942A (en) * | 2019-03-14 | 2019-06-07 | 北京百度网讯科技有限公司 | For handling the method, apparatus, equipment and storage medium of document |
CN110046340A (en) * | 2018-12-28 | 2019-07-23 | 阿里巴巴集团控股有限公司 | The training method and device of textual classification model |
CN110134786A (en) * | 2019-05-14 | 2019-08-16 | 南京大学 | A kind of short text classification method based on theme term vector and convolutional neural networks |
CN110263343A (en) * | 2019-06-24 | 2019-09-20 | 北京理工大学 | The keyword abstraction method and system of phrase-based vector |
CN110532378A (en) * | 2019-05-13 | 2019-12-03 | 南京大学 | A kind of short text aspect extracting method based on topic model |
CN111897952A (en) * | 2020-06-10 | 2020-11-06 | 中国科学院软件研究所 | Sensitive data discovery method for social media |
CN113051917A (en) * | 2021-04-23 | 2021-06-29 | 东南大学 | Document implicit time inference method based on time window text similarity |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006039566A2 (en) * | 2004-09-30 | 2006-04-13 | Intelliseek, Inc. | Topical sentiments in electronically stored communications |
CN105955948A (en) * | 2016-04-22 | 2016-09-21 | 武汉大学 | Short text topic modeling method based on word semantic similarity |
CN105975499A (en) * | 2016-04-27 | 2016-09-28 | 深圳大学 | Text subject detection method and system |
CN106294662A (en) * | 2016-08-05 | 2017-01-04 | 华东师范大学 | Inquiry based on context-aware theme represents and mixed index method for establishing model |
CN106327341A (en) * | 2016-08-15 | 2017-01-11 | 首都师范大学 | Weibo user gender deduction method and system based on combined theme |
CN107451187A (en) * | 2017-06-23 | 2017-12-08 | 天津科技大学 | Sub-topic finds method in half structure assigned short text set based on mutual constraint topic model |
-
2018
- 2018-05-17 CN CN201810473370.5A patent/CN108710611B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006039566A2 (en) * | 2004-09-30 | 2006-04-13 | Intelliseek, Inc. | Topical sentiments in electronically stored communications |
CN105955948A (en) * | 2016-04-22 | 2016-09-21 | 武汉大学 | Short text topic modeling method based on word semantic similarity |
CN105975499A (en) * | 2016-04-27 | 2016-09-28 | 深圳大学 | Text subject detection method and system |
CN106294662A (en) * | 2016-08-05 | 2017-01-04 | 华东师范大学 | Inquiry based on context-aware theme represents and mixed index method for establishing model |
CN106327341A (en) * | 2016-08-15 | 2017-01-11 | 首都师范大学 | Weibo user gender deduction method and system based on combined theme |
CN107451187A (en) * | 2017-06-23 | 2017-12-08 | 天津科技大学 | Sub-topic finds method in half structure assigned short text set based on mutual constraint topic model |
Non-Patent Citations (4)
Title |
---|
LAN JIANG ET.AL: "Biterm Pseudo Document Topic Model for Short Text", 《2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE》 * |
MING XU: "Intensity of Relationship Between Words:Using Word Triangles in Topic Discovery for Short Texts", 《WEB AND BIG DATA》 * |
YUAN ZUO: "Topic Modeling of Short Texts: A Pseudo-Document View", 《ACM》 * |
熊蜀峰等: "面向产品评论分析的短文本情感主题模型", 《自动化学报》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109359302A (en) * | 2018-10-26 | 2019-02-19 | 重庆大学 | A kind of optimization method of field term vector and fusion sort method based on it |
CN110046340A (en) * | 2018-12-28 | 2019-07-23 | 阿里巴巴集团控股有限公司 | The training method and device of textual classification model |
CN109858028A (en) * | 2019-01-30 | 2019-06-07 | 神思电子技术股份有限公司 | A kind of short text similarity calculating method based on probabilistic model |
CN109858028B (en) * | 2019-01-30 | 2022-11-18 | 神思电子技术股份有限公司 | Short text similarity calculation method based on probability model |
CN109857942A (en) * | 2019-03-14 | 2019-06-07 | 北京百度网讯科技有限公司 | For handling the method, apparatus, equipment and storage medium of document |
CN110532378A (en) * | 2019-05-13 | 2019-12-03 | 南京大学 | A kind of short text aspect extracting method based on topic model |
CN110532378B (en) * | 2019-05-13 | 2021-10-26 | 南京大学 | Short text aspect extraction method based on topic model |
CN110134786A (en) * | 2019-05-14 | 2019-08-16 | 南京大学 | A kind of short text classification method based on theme term vector and convolutional neural networks |
CN110263343A (en) * | 2019-06-24 | 2019-09-20 | 北京理工大学 | The keyword abstraction method and system of phrase-based vector |
CN111897952A (en) * | 2020-06-10 | 2020-11-06 | 中国科学院软件研究所 | Sensitive data discovery method for social media |
CN111897952B (en) * | 2020-06-10 | 2022-10-14 | 中国科学院软件研究所 | Sensitive data discovery method for social media |
CN113051917A (en) * | 2021-04-23 | 2021-06-29 | 东南大学 | Document implicit time inference method based on time window text similarity |
Also Published As
Publication number | Publication date |
---|---|
CN108710611B (en) | 2021-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108710611B (en) | Short text topic model generation method based on word network and word vector | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN107193801B (en) | Short text feature optimization and emotion analysis method based on deep belief network | |
CN107066553B (en) | Short text classification method based on convolutional neural network and random forest | |
CN107451126B (en) | Method and system for screening similar meaning words | |
CN107085581B (en) | Short text classification method and device | |
WO2019085236A1 (en) | Search intention recognition method and apparatus, and electronic device and readable storage medium | |
CN109960799B (en) | Short text-oriented optimization classification method | |
CN108763348B (en) | Classification improvement method for feature vectors of extended short text words | |
CN109086375B (en) | Short text topic extraction method based on word vector enhancement | |
CN111104510B (en) | Text classification training sample expansion method based on word embedding | |
CN110889282B (en) | Text emotion analysis method based on deep learning | |
CN107291914A (en) | A kind of method and system for generating search engine inquiry expansion word | |
CN107423282A (en) | Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character | |
CN110134958B (en) | Short text topic mining method based on semantic word network | |
CN108920482B (en) | Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model | |
Ritu et al. | Performance analysis of different word embedding models on bangla language | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN114462392B (en) | Short text feature expansion method based on association degree of subject and association of keywords | |
CN112528653B (en) | Short text entity recognition method and system | |
CN110705272A (en) | Named entity identification method for automobile engine fault diagnosis | |
CN111859961A (en) | Text keyword extraction method based on improved TopicRank algorithm | |
CN109840324A (en) | It is a kind of semantic to strengthen topic model and subject evolution analysis method | |
CN111460147A (en) | Title short text classification method based on semantic enhancement | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |