CN108710611B

CN108710611B - Short text topic model generation method based on word network and word vector

Info

Publication number: CN108710611B
Application number: CN201810473370.5A
Authority: CN
Inventors: 张雷; 唐驰; 陆恒杨; 徐鸣; 王崇骏
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-05-17
Filing date: 2018-05-17
Publication date: 2021-08-03
Anticipated expiration: 2038-05-17
Also published as: CN108710611A

Abstract

The invention provides a short text topic model generation method based on a word network and a word vector, which comprises the following steps: 1) learning semantic information: a. segmenting words and removing stop words; b. learning word vectors according to the short text data obtained by preprocessing; c. and calculating semantic similarity between the words. 2) Constructing a pseudo document for each term: a. obtaining a word co-occurrence list based on semantic similarity, and constructing a word network; b. calculating the arithmetic relation of the word vectors to obtain a potential word list; c. judging the length of the pseudo document and deciding whether to add similar words or not. 3) And performing LDA theme modeling on each pseudo document to obtain the frequency distribution of the theme and the words of the original document. According to the invention, the semantic information is introduced to construct the pseudo document, and the subject modeling is carried out on the pseudo document, so that the problems of sparseness and unbalance of short text data are solved, and the performances of tasks such as subject discovery, text classification and text clustering on the short text are improved.

Description

Short text topic model generation method based on word network and word vector

Technical Field

The invention relates to the field of text topic model construction, in particular to a short text topic model generation method based on a word network and a word vector.

Background

With the rapid development of the internet and the rapid increase of short text contents in the internet, the mining and analysis of short text data is more and more urgent, and in the face of these short texts, how to accurately mine the subject from the back of these short texts is a well-recognized challenging and extremely promising task.

Due to the characteristics of sparsity, instantaneity, irregularity and the like of the short text, the traditional topic model algorithm is directly implemented on the short text, for example: pLSA, LDA, etc., tend to be less effective. With the development of short text research, topic models of BTMs, WNTMs and the like for short texts are proposed successively, but the topic models only consider the co-occurrence relationship of words in a corpus, and although the sparse problem of short texts can be solved to a certain extent, because the co-occurrence relationship which can be used for modeling is much richer than the words in the short texts no matter the word-pair relationship or the word network is established, the semantic relationship among the words is ignored, so that the performance of the topic models for the text mining task faces a bottleneck.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a short text topic model generation method based on a word network and a word vector, aiming at solving the technical problem that the performance of tasks such as topic discovery, text classification, text clustering and the like of a common model is not high due to the fact that only a word co-occurrence relation is considered in a conventional short text topic model but semantic information is not considered.

The technical scheme is as follows: the technical scheme provided by the invention is as follows:

a method for generating a short text topic model based on a word network and a word vector comprises the following steps:

(1) learning semantic information of a text, comprising: preprocessing a document, and performing word vector training on a preprocessed document corpus to obtain a word vector of each word; calculating the similarity between the words according to the word vectors;

(2) constructing a pseudo document for each term in the document, including performing steps (2-1) to (2-4) for each term i in turn:

(2-1) setting a sliding window with the size of W, and extracting N words including the word i through the sliding window to form a word network of the word i;

(2-2) constructing a word list L_cooccur(i) Extracting words other than the word iFrequency fr_i,jJoin word list L_cooccur(i)；

Wherein, Avr_iIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of occurrences of j in the current word network for word i;

(2-3) constructing a word list L_latent(i) (ii) a Setting a similarity threshold delta, calculating j and j for each word j in the word network

And selecting a word j with the cosine similarity larger than the similarity threshold delta and adding the word j into the L_latent(i) Performing the following steps; wherein,

word vectors representing words i, j, respectively;

(2-3) determining whether or not L is satisfied_cooccur(i)+L_latent(i) L, wherein L represents the set minimum length of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list L_similar(i) In the middle, m is less than L;

(2-4) merging the word list L_cooccur(i)、L_latent(i)、L_similar(i) Obtaining a pseudo document of the word i;

(3) and performing LDA theme modeling on each pseudo document to obtain the frequency distribution of the theme and the words of the original document.

Further, the preprocessing the document includes performing chinese word segmentation and stop word removal processing on the document.

Further, the expression of sim (i, j) is:

further, the Word vector training adopts a Word2Vec model method.

Has the advantages that: compared with the prior art, the method is based on the word network, constructs the pseudo document for the words in the short text data by training the word vectors and calculating the word similarity, and then carries out LDA theme modeling, so that the difficulties of sparsity, unbalance and the like of the short text can be overcome, and the performance of the model is improved by introducing semantic information.

Drawings

FIG. 1 is a flow chart of a short text topic model generation method based on word networks and word vectors according to the present invention;

FIG. 2 is a schematic flow diagram of constructing a word network;

FIG. 3 is a block diagram of constructing a word list L_latent(i) A flow chart of (1);

FIG. 4 is a block diagram of constructing a word list L_similar(i) Is described.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

Fig. 1 is a flow chart of the present invention, and the whole flow includes three stages:

firstly, semantic information learning stage:

step 1, preprocessing text data, wherein the main implemented action is word segmentation (if the English short text data can omit the word segmentation step, Chinese words need to be segmented, a jieba word segmentation tool is generally used), and stop words are removed;

step 2, performing Word vector training on the preprocessed document by using a Word2Vec model method proposed by Mikolov;

and 3, calculating the similarity between words by using the word vectors obtained by training in the step 2, wherein the cosine similarity is adopted, and the cosine similarity calculation formula is as follows:

where sim (i, j) represents the cosine similarity between word i and word j,

representing the word vectors for words i, j, respectively.

Secondly, pseudo document construction: the invention constructs a pseudo document for each word i, and then carries out theme modeling on the basis of the pseudo document, wherein the pseudo document of each word i consists of three parts, which are introduced respectively as follows:

step 4, constructing a word network: setting the window size as W, and extracting N words including the word i through a sliding window to form a word network of the word i; fig. 2 is a schematic diagram of a word network constructed by using a sliding window, and it can be seen that words closer to the word i occur more frequently in the word network.

Step 5, constructing a co-occurrence word list, wherein L is used_latent(i) Represents; extracting words except for the word i with a frequency fr_i,jJoin word list L_cooccur(i)；fr_i,jThe calculation formula of (2) is as follows:

wherein, Avr_iIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of times j appears in the current word network for word i.

Step 6, searching words with similar semantics but no co-occurrence relation by using the arithmetic relation of the word vectors and adding the words into the word list L_latent(i) The specific process is shown in fig. 3:

for a word j in a word network of the word i, calculating a vector w by using the word vector_i+w_jThe cosine similarity with the word j is calculated as follows:

in the formula, w_latentIndicates that it is to be added to L_latent(i) The term in (1), δ is the set similarity thresholdThe value of the one or more of the one,

representation by cosine similarity search and

the most similar words;

comparing the cosine similarity obtained by calculation with the similarity threshold value delta, if the cosine similarity is larger than the similarity threshold value delta, adding a word j into a word list L_latent(i) (ii) a Otherwise, word j is not added.

Step 7, judging the length of the current pseudo document of the word i, if the length is less than the preset maximum length L, adding m words most similar to the word i into a word list L_similar(i) The specific process is shown in fig. 4:

judging whether L is satisfied_cooccur(i)+L_latent(i) L, wherein L represents the maximum length of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list L_similar(i) In (1).

Step 8, listing the three words L obtained in the steps 5, 6 and 7_cooccur(i)、L_latent(i) And L_similar(i) Combining to obtain a final pseudo document of the word i;

step 9, utilizing the pseudo document obtained in the step 8 to carry out LDA theme modeling;

and step 10, deducing the topic word distribution of the original document by using the pseudo document topic and the word probability distribution obtained in the step 9.

In summary, the invention is a short text topic model method based on word networks and word vectors, and solves many difficulties of sparsity, imbalance and much noise in the short text topic model. The invention is based on a word network, constructs a pseudo document for words in short text data by training word vectors and calculating word similarity, and then carries out LDA theme modeling, thus achieving the final invention effects: the method can overcome the difficulties of sparsity, unbalance and the like of the short text, and improves the performance of the model by introducing semantic information.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A method for generating a short text topic model based on a word network and a word vector is characterized by comprising the following steps:

(2-2) constructing a word list L_cooccur(i) Extracting words except for the word i with a frequency fr_i，jJoin word list L_cooccur(i)；

Wherein, Avr_iIn order to construct the average length of the pseudo document of i after the word network, sim (i, j) is the similarity between the word i and the word j, and sigma () is a sigmoid function; count (i, j) is the number of occurrences of j in the word network for word i;

word vectors representing word i and word j, respectively;

(2-3) determining whether or not L is satisfied_cooccur(i)+L_latent(i) L represents the set maximum length of the pseudo document; if yes, m words with the highest similarity with the word i in the word network are selected to be added into the word list L_similar(i) In the middle, m is less than L;

2. The method of claim 1, wherein the preprocessing the document comprises performing Chinese word segmentation and stop word removal on the document.

3. The method of claim 1, wherein the expression of sim (i, j) is as follows:

4. the method of claim 1, wherein the Word vector training uses a Word2Vec model method.