CN108710611A - A kind of short text topic model generation method of word-based network and term vector - Google Patents

A kind of short text topic model generation method of word-based network and term vector Download PDF

Info

Publication number
CN108710611A
CN108710611A CN201810473370.5A CN201810473370A CN108710611A CN 108710611 A CN108710611 A CN 108710611A CN 201810473370 A CN201810473370 A CN 201810473370A CN 108710611 A CN108710611 A CN 108710611A
Authority
CN
China
Prior art keywords
word
document
pseudo
network
short text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810473370.5A
Other languages
Chinese (zh)
Other versions
CN108710611B (en
Inventor
张雷
唐驰
陆恒杨
徐鸣
王崇骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201810473370.5A priority Critical patent/CN108710611B/en
Publication of CN108710611A publication Critical patent/CN108710611A/en
Application granted granted Critical
Publication of CN108710611B publication Critical patent/CN108710611B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention proposes a kind of short text topic model generation method of word-based network and term vector, includes the following steps:1) learn semantic information:A, participle and stop-word is removed;B, term vector is learnt according to the short text data that pretreatment obtains;C, the semantic similarity between word is calculated.2) pseudo- document is built to each word:A, it is based on semantic similarity and obtains word co-occurrence list, build word network;B, the arithmetic relation for calculating word vectors obtains potential word list;C, judge pseudo- Document Length and decide whether that similar word is added.3) LDA theme modelings are carried out to each pseudo- document, obtains theme, the term frequencies distribution of original document.The present invention carries out theme modeling by introducing the pseudo- document of semantic information structure to pseudo- document, to solve the sparse and imbalance problem of short text data, the performance for carrying out the tasks such as motif discovery, text classification and text cluster on short text is made to get a promotion.

Description

Short text topic model generation method based on word network and word vector
Technical Field
The invention relates to the field of text topic model construction, in particular to a short text topic model generation method based on a word network and a word vector.
Background
With the rapid development of the internet and the rapid increase of short text contents in the internet, the mining and analysis of short text data is more and more urgent, and in the face of these short texts, how to accurately mine the subject from the back of these short texts is a well-recognized challenging and extremely promising task.
Due to the characteristics of sparsity, instantaneity, irregularity and the like of the short text, the traditional topic model algorithm is directly implemented on the short text, for example: pLSA, LDA, etc., tend to be less effective. With the development of short text research, topic models of BTMs, WNTMs and the like for short texts are proposed successively, but the topic models only consider the co-occurrence relationship of words in a corpus, and although the sparse problem of short texts can be solved to a certain extent, because the co-occurrence relationship which can be used for modeling is much richer than the words in the short texts no matter the word-pair relationship or the word network is established, the semantic relationship among the words is ignored, so that the performance of the topic models for the text mining task faces a bottleneck.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a short text topic model generation method based on a word network and a word vector, aiming at solving the technical problem that the performance of tasks such as topic discovery, text classification, text clustering and the like of a common model is not high due to the fact that only a word co-occurrence relation is considered in a conventional short text topic model but semantic information is not considered.
The technical scheme is as follows: the technical scheme provided by the invention is as follows:
a method for generating a short text topic model based on a word network and a word vector comprises the following steps:
(1) learning semantic information of a text, comprising: preprocessing a document, and performing word vector training on a preprocessed document corpus to obtain a word vector of each word; calculating the similarity between the words according to the word vectors;
(2) constructing a pseudo document for each term in the document, including performing steps (2-1) to (2-4) for each term i in turn:
(2-1) setting a sliding window with the size of W, and extracting N words including the word i through the sliding window to form a word network of the word i;
(2-2) constructing a word list Lcooccur(i) Extracting words except for the word i with a frequency fri,jJoin word list Lcooccur(i);Wherein, AvriIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of occurrences of j in the current word network for word i;
(2-3) constructing a word list Llatent(i) (ii) a Setting a similarity threshold delta, calculating j and j for each word j in the word networkAnd selecting a word j with the cosine similarity larger than the similarity threshold delta and adding the word j into the Llatent(i) Performing the following steps; wherein,word vectors representing words i, j, respectively;
(2-3) determining whether or not L is satisfiedcooccur(i)+Llatent(i) L, wherein L represents the set minimum length of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list Lsimilar(i) In the middle, m is less than L;
(2-4) merging the word list Lcooccur(i)、Llatent(i)、Lsimilar(i) Obtaining a pseudo document of the word i;
(3) and performing LDA theme modeling on each pseudo document to obtain the frequency distribution of the theme and the words of the original document.
Further, the preprocessing the document includes performing chinese word segmentation and stop word removal processing on the document.
Further, the expression of sim (i, j) is:
further, the Word vector training adopts a Word2Vec model method.
Has the advantages that: compared with the prior art, the method is based on the word network, constructs the pseudo document for the words in the short text data by training the word vectors and calculating the word similarity, and then carries out LDA theme modeling, so that the difficulties of sparsity, unbalance and the like of the short text can be overcome, and the performance of the model is improved by introducing semantic information.
Drawings
FIG. 1 is a flow chart of a short text topic model generation method based on word networks and word vectors according to the present invention;
FIG. 2 is a schematic flow diagram of constructing a word network;
FIG. 3 is a block diagram of constructing a word list Llatent(i) A flow chart of (1);
FIG. 4 is a block diagram of constructing a word list Lsimilar(i) Is described.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
Fig. 1 is a flow chart of the present invention, and the whole flow includes three stages:
firstly, semantic information learning stage:
step 1, preprocessing text data, wherein the main implemented action is word segmentation (if the English short text data can omit the word segmentation step, Chinese words need to be segmented, a jieba word segmentation tool is generally used), and stop words are removed;
step 2, performing Word vector training on the preprocessed document by using a Word2Vec model method proposed by Mikolov;
and 3, calculating the similarity between words by using the word vectors obtained by training in the step 2, wherein the cosine similarity is adopted, and the cosine similarity calculation formula is as follows:
where sim (i, j) represents the cosine similarity between word i and word j,representing the word vectors for words i, j, respectively.
Secondly, pseudo document construction: the invention constructs a pseudo document for each word i, and then carries out theme modeling on the basis of the pseudo document, wherein the pseudo document of each word i consists of three parts, which are introduced respectively as follows:
step 4, constructing a word network: setting the window size as W, and extracting N words including the word i through a sliding window to form a word network of the word i; fig. 2 is a schematic diagram of a word network constructed by using a sliding window, and it can be seen that words closer to the word i occur more frequently in the word network.
Step 5, constructing a co-occurrence word list, wherein L is usedlatent(i) Represents; extracting words except for the word i with a frequency fri,jJoin word list Lcooccur(i);fri,jThe calculation formula of (2) is as follows:
wherein, AvriIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of times j appears in the current word network for word i.
Step 6, searching words with similar semantics but no co-occurrence relation by using the arithmetic relation of the word vectors and adding the words into the word list Llatent(i) The specific process is shown in fig. 3:
for a word j in a word network of the word i, calculating a vector w by using the word vectori+wjThe cosine similarity with the word j is calculated as follows:
in the formula, wlatentIndicates that it is to be added to Llatent(i) Where, δ is the set similarity threshold,representation by cosine similarity search andthe most similar words;
comparing the cosine similarity obtained by calculation with the similarity threshold value delta, if the cosine similarity is larger than the similarity threshold value delta, adding a word j into a word list Llatent(i) (ii) a Otherwise, word j is not added.
Step 7, judging the length of the current pseudo document of the word i, if the length is less than the preset maximum length L, adding m words most similar to the word i into a word list Lsimilar(i) The specific process is shown in fig. 4:
judging whether L is satisfiedcooccur(i)+Llatent(i) L, wherein L represents the maximum length of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list Lsimilar(i) In (1).
Step 8, listing the three words L obtained in the steps 5, 6 and 7cooccur(i)、Llatent(i) And Lsimilar(i) Combining to obtain a final pseudo document of the word i;
step 9, utilizing the pseudo document obtained in the step 8 to carry out LDA theme modeling;
and step 10, deducing the topic word distribution of the original document by using the pseudo document topic and the word probability distribution obtained in the step 9.
In summary, the invention is a short text topic model method based on word networks and word vectors, and solves many difficulties of sparsity, imbalance and much noise in the short text topic model. The invention is based on a word network, constructs a pseudo document for words in short text data by training word vectors and calculating word similarity, and then carries out LDA theme modeling, thus achieving the final invention effects: the method can overcome the difficulties of sparsity, unbalance and the like of the short text, and improves the performance of the model by introducing semantic information.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (4)

1. A method for generating a short text topic model based on a word network and a word vector is characterized by comprising the following steps:
(1) learning semantic information of a text, comprising: preprocessing a document, and performing word vector training on a preprocessed document corpus to obtain a word vector of each word; calculating the similarity between the words according to the word vectors;
(2) constructing a pseudo document for each term in the document, including performing steps (2-1) to (2-4) for each term i in turn:
(2-1) setting a sliding window with the size of W, and extracting N words including the word i through the sliding window to form a word network of the word i;
(2-2) constructing a word list Lcooccur(i) Extracting words except for the word i with a frequency fri,jJoin word list Lcooccur(i);Wherein, AvriIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of occurrences of j in the word network for word i;
(2-3) constructing a word list Llatent(i) (ii) a Setting a similarity threshold delta, calculating j and j for each word j in the word networkAnd selecting a word j with the cosine similarity larger than the similarity threshold delta and adding the word j into the Llatent(i) Performing the following steps; wherein,word vectors representing word i and word j, respectively;
(2-3) determining whether or not L is satisfiedcooccur(i)+Llatent(i) L represents a set minimum length threshold of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list Lsimilar(i) In the middle, m is less than L;
(2-4) merging the word list Lcooccur(i)、Llatent(i)、Lsimilar(i) Obtaining a pseudo document of the word i;
(3) and performing LDA theme modeling on each pseudo document to obtain the frequency distribution of the theme and the words of the original document.
2. The method of claim 1, wherein the preprocessing the document comprises performing Chinese word segmentation and stop word removal on the document.
3. The method of claim 1, wherein the expression of sim (i, j) is as follows:
4. the method of claim 1, wherein the Word vector training uses a Word2Vec model method.
CN201810473370.5A 2018-05-17 2018-05-17 Short text topic model generation method based on word network and word vector Active CN108710611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810473370.5A CN108710611B (en) 2018-05-17 2018-05-17 Short text topic model generation method based on word network and word vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810473370.5A CN108710611B (en) 2018-05-17 2018-05-17 Short text topic model generation method based on word network and word vector

Publications (2)

Publication Number Publication Date
CN108710611A true CN108710611A (en) 2018-10-26
CN108710611B CN108710611B (en) 2021-08-03

Family

ID=63868224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810473370.5A Active CN108710611B (en) 2018-05-17 2018-05-17 Short text topic model generation method based on word network and word vector

Country Status (1)

Country Link
CN (1) CN108710611B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359302A (en) * 2018-10-26 2019-02-19 重庆大学 A kind of optimization method of field term vector and fusion sort method based on it
CN109858028A (en) * 2019-01-30 2019-06-07 神思电子技术股份有限公司 A kind of short text similarity calculating method based on probabilistic model
CN109857942A (en) * 2019-03-14 2019-06-07 北京百度网讯科技有限公司 For handling the method, apparatus, equipment and storage medium of document
CN110046340A (en) * 2018-12-28 2019-07-23 阿里巴巴集团控股有限公司 The training method and device of textual classification model
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector
CN110532378A (en) * 2019-05-13 2019-12-03 南京大学 A kind of short text aspect extracting method based on topic model
CN111897952A (en) * 2020-06-10 2020-11-06 中国科学院软件研究所 Sensitive data discovery method for social media
CN113051917A (en) * 2021-04-23 2021-06-29 东南大学 Document implicit time inference method based on time window text similarity

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006039566A2 (en) * 2004-09-30 2006-04-13 Intelliseek, Inc. Topical sentiments in electronically stored communications
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN105975499A (en) * 2016-04-27 2016-09-28 深圳大学 Text subject detection method and system
CN106294662A (en) * 2016-08-05 2017-01-04 华东师范大学 Inquiry based on context-aware theme represents and mixed index method for establishing model
CN106327341A (en) * 2016-08-15 2017-01-11 首都师范大学 Weibo user gender deduction method and system based on combined theme
CN107451187A (en) * 2017-06-23 2017-12-08 天津科技大学 Sub-topic finds method in half structure assigned short text set based on mutual constraint topic model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006039566A2 (en) * 2004-09-30 2006-04-13 Intelliseek, Inc. Topical sentiments in electronically stored communications
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN105975499A (en) * 2016-04-27 2016-09-28 深圳大学 Text subject detection method and system
CN106294662A (en) * 2016-08-05 2017-01-04 华东师范大学 Inquiry based on context-aware theme represents and mixed index method for establishing model
CN106327341A (en) * 2016-08-15 2017-01-11 首都师范大学 Weibo user gender deduction method and system based on combined theme
CN107451187A (en) * 2017-06-23 2017-12-08 天津科技大学 Sub-topic finds method in half structure assigned short text set based on mutual constraint topic model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LAN JIANG ET.AL: "Biterm Pseudo Document Topic Model for Short Text", 《2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE》 *
MING XU: "Intensity of Relationship Between Words:Using Word Triangles in Topic Discovery for Short Texts", 《WEB AND BIG DATA》 *
YUAN ZUO: "Topic Modeling of Short Texts: A Pseudo-Document View", 《ACM》 *
熊蜀峰等: "面向产品评论分析的短文本情感主题模型", 《自动化学报》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359302A (en) * 2018-10-26 2019-02-19 重庆大学 A kind of optimization method of field term vector and fusion sort method based on it
CN110046340A (en) * 2018-12-28 2019-07-23 阿里巴巴集团控股有限公司 The training method and device of textual classification model
CN109858028A (en) * 2019-01-30 2019-06-07 神思电子技术股份有限公司 A kind of short text similarity calculating method based on probabilistic model
CN109858028B (en) * 2019-01-30 2022-11-18 神思电子技术股份有限公司 Short text similarity calculation method based on probability model
CN109857942A (en) * 2019-03-14 2019-06-07 北京百度网讯科技有限公司 For handling the method, apparatus, equipment and storage medium of document
CN110532378A (en) * 2019-05-13 2019-12-03 南京大学 A kind of short text aspect extracting method based on topic model
CN110532378B (en) * 2019-05-13 2021-10-26 南京大学 Short text aspect extraction method based on topic model
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector
CN111897952A (en) * 2020-06-10 2020-11-06 中国科学院软件研究所 Sensitive data discovery method for social media
CN111897952B (en) * 2020-06-10 2022-10-14 中国科学院软件研究所 Sensitive data discovery method for social media
CN113051917A (en) * 2021-04-23 2021-06-29 东南大学 Document implicit time inference method based on time window text similarity

Also Published As

Publication number Publication date
CN108710611B (en) 2021-08-03

Similar Documents

Publication Publication Date Title
CN108710611B (en) Short text topic model generation method based on word network and word vector
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN107193801B (en) Short text feature optimization and emotion analysis method based on deep belief network
CN107066553B (en) Short text classification method based on convolutional neural network and random forest
CN107451126B (en) Method and system for screening similar meaning words
CN107085581B (en) Short text classification method and device
WO2019085236A1 (en) Search intention recognition method and apparatus, and electronic device and readable storage medium
CN109960799B (en) Short text-oriented optimization classification method
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
CN109086375B (en) Short text topic extraction method based on word vector enhancement
CN111104510B (en) Text classification training sample expansion method based on word embedding
CN110889282B (en) Text emotion analysis method based on deep learning
CN107291914A (en) A kind of method and system for generating search engine inquiry expansion word
CN107423282A (en) Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character
CN110134958B (en) Short text topic mining method based on semantic word network
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
Ritu et al. Performance analysis of different word embedding models on bangla language
CN107357895B (en) Text representation processing method based on bag-of-words model
CN114462392B (en) Short text feature expansion method based on association degree of subject and association of keywords
CN112528653B (en) Short text entity recognition method and system
CN110705272A (en) Named entity identification method for automobile engine fault diagnosis
CN111859961A (en) Text keyword extraction method based on improved TopicRank algorithm
CN109840324A (en) It is a kind of semantic to strengthen topic model and subject evolution analysis method
CN111460147A (en) Title short text classification method based on semantic enhancement
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant