CN108052593B

CN108052593B - Topic keyword extraction method based on topic word vector and network structure

Info

Publication number: CN108052593B
Application number: CN201711315360.0A
Authority: CN
Inventors: 胡晓慧; 李超; 曾庆田; 戴明弟; 赵中英
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2017-12-12
Filing date: 2017-12-12
Publication date: 2020-09-22
Anticipated expiration: 2037-12-12
Also published as: CN108052593A

Abstract

The invention discloses a topic keyword extraction method based on a topic word vector and a network structure, and particularly relates to the technical field of extracting keywords from texts. The topic keyword extraction method based on the topic word vector and the network structure carries out topic clustering on text corpora based on an LDA topic model, and obtains 100 keywords with the topic relevance of top100 in each topic; using word2vec to represent each word in the text corpus as a word vector, obtaining semantic similarity between every two words through calculation, respectively calculating a word with semantic similarity top5 with each keyword in the keywords, wherein the keywords and the words with semantic similarity top5 form a new keyword set; a keyword network is constructed and the words of each set top20 are obtained as keywords for the topic. The method can extract the keywords with higher word frequency in the document, and can effectively find the keywords with lower word frequency but strong relation with the theme.

Description

Topic keyword extraction method based on topic word vector and network structure

Technical Field

The invention relates to the technical field of extracting keywords from texts, in particular to a topic keyword extraction method based on topic word vectors and network structures.

Background

With the wide application of the expression learning technology in the field of natural language processing, word2vec is applied to vector expression of words, semantic and grammar rules of the words can be well described and obtained, and meanwhile, a theme model can well explain the theme aggregation condition of a document level. Therefore, research on word vector representations fusing topic models and topic keywords is becoming more and more widespread.

LDA topic model: among the various topic models proposed, LDA is a generative model that can generalize the distribution of topics. LDA is a three-level hierarchical bayesian model in which each item of a collection is modeled as a finite mixture over a set of potential topics, and conversely, each topic is modeled as an infinite mixture of a set of potential topic probabilities. In the context of text modeling, topic probabilities provide a display representation of a document. The modeling process of LDA can be described as finding a corresponding topic mix for each resource (i.e., P (z | d)), each topic being described by another probability distribution (i.e., P (t | z)). This can be formally expressed as:

wherein, P (t)_i| d) is the probability on the i-th item of a given document d, z_iIs a potential topic. P (t)_i|z_jJ) is t in topic j_iThe probability of (c). P (z)_jJ | d) is the probability of the document being on topic j. The number of Z's of potential topics must be defined in advance. LDA uses Dirichlet prior distributions and determined topic numbers to estimate topic word distributions P (t | z) and document topic distributions P (z | d) from unlabeled corpora.

LDA is a topic model with a wide range of applications, and most other topic models are extended based on LDA. However, the keywords extracted by the LDA are generally too wide in the whole view, and the theme of the article cannot be reflected well, so that the method provided by the invention is innovative.

word embedding: word embedding encodes each word as a continuous vector (word vector) based on syntactic and semantic information so that the distance of similar words on their word vectors is similar. After a language model is counted and established from a natural text and word vectors are obtained, the language model can be used as input of a neural network to perform syntactic analysis, emotion analysis and the like, and can also be used as auxiliary features to expand the existing model. But only the word vectors are unable to identify the topic that the text expects, which must be combined with the topic model.

The existing unsupervised keyword extraction technology mainly comprises the schemes of TF-IDF, Topic model, TextRank and the like. The technical defects are mainly reflected in the following aspects:

TF-IDF is a common weighting technology used for information retrieval and data mining, is a measure for the importance of search keywords, and can also obtain a good effect when applied to extraction of text keywords. But the TF-IDF is based on the cross entropy of word frequency and keyword probability distribution, namely, the TF-IDF does not consider the sequence of the occurrence of words and does not consider the relation between each word in the text and the context.

The widely used Topic model such as LDA can better dig out topics from documents, but the extracted keywords are too wide, and many words which have higher word frequency but are irrelevant to the topics can not better reflect the topics, so that the keywords are not suitable.

The TextRank algorithm is a graph-based sorting algorithm for texts, the texts are divided into sentences, a graph model is established by utilizing the context co-occurrence relation of words in the texts, and keywords are extracted according to the PageRank value in the graph model. The algorithm can simply and effectively extract the key words of a single document on the basis of considering the co-occurrence relation of word frequency and words, but can not identify and cluster the topics of multiple documents, so that the key words of the documents under a specific topic can not be extracted.

Disclosure of Invention

The invention aims to overcome the defects, and provides a keyword extraction method which combines a topic model LDA and Word embedding, extracts keywords of the same topic text by using similarity network propagation, can extract the keywords with higher Word frequency in a document, and can effectively find the keywords with lower Word frequency and strong topic relation.

The invention specifically adopts the following technical scheme:

as shown in fig. 1, a method for extracting topic keywords based on topic word vectors and network structures specifically includes:

performing word segmentation on an original text corpus;

performing topic clustering on the text corpus based on an LDA topic model, and obtaining a keyword set Keywordset with top100 degree of relevance to the topic in each topic₁＝{k₁，...，k₁₀₀}；

Using word2vec to represent each word in the text corpus as a word vector, and obtaining semantic similarity between every two words by calculating cosine values between the word vectors;

respectively calculating keyword set and keyword set₁Is semantically similar to the word of top5, keyword set keyword₁The words in (1) and the words with semantic similarity top5 jointly form a new keyword set₂；

Set keyword set₂Each keyword in the keyword network is a node, semantic similarity between words is the weight of an edge, a keyword network is constructed, and a keyword set Keywordset is obtained according to the PageRank value of each node₂The word of the middle top20 is used as the key word of the subject to form the final key word set_final。

Preferably, the word segmentation is to divide the acquired original text into word sequences for subsequent topic clustering and keyword extraction, and the result of word segmentation is used as the input of word2vec to remove special symbols; when the user inputs the data as LDA, the null words, the place names which can not be used as the key words of the theme and the repeated prepositions which are irrelevant to the theme are removed.

Preferably, topic clustering is performed on the text corpus based on the LDA topic model, and perplexity is used in language modeling to measure how good the modeling effect is, that is, lower perplexity represents better generalization performance, and the perplexity calculation formula is as follows:

wherein, P (w)_i|t_j) Is the word w_iAt topic t_jDistribution of (A) P (t)_i| d) is the topic t_jThe distribution over the document d, N is the total number of words in the corpus that are not repeated, K is the number of topics, i 1.

Preferably, in the word vector generation process, a process of obtaining a word vector representation model of each word with a word segmentation result of a mixed text of a title and a content as an input.

Preferably, in the keyword network construction process, the construction steps specifically include:

s1: calculating words with initial keyword semantic similarity top5 obtained in the step of clustering with the same theme by using cosine relationship between word vectors, removing duplication and combining with keyword set Keywordset₁Form a new keyword set₂；

S2: calculating keyword set under each topic₂The similarity between every two words in the Chinese language is used as the weight between two points;

s3: setting a threshold value, and filtering edges with similarity lower than the threshold value;

s4: constructing a keyword network of each topic;

s5: extracting the topic key words: after the keyword network is constructed, 20 nodes of top from high to low of PageRank values in each topic network are calculated, and corresponding words are used as keyword sets of the topics_final。

The invention has the following beneficial effects:

firstly, clustering text corpora based on an LDA topic model; secondly, representing each word in the text expectation as a word vector by using word2 vec; then, the words of the similarity top5 of each keyword in the document of the subject are obtained, and a new keyword set is formed together. Finally, constructing a keyword network by taking the keywords as nodes and the similarity between the words as the weight of the edges, and obtaining the core nodes of the network as the keywords of the theme;

the method combines a topic model LDA and Word embedding, extracts keywords of the same topic text by using network propagation of similarity, can extract the keywords with higher Word frequency in a document, and can effectively find the keywords with lower Word frequency but strong topic relation;

the method carries out secondary discovery on the keywords according to the word vector relation on the basis of considering the word frequency, and brings words with low word frequency but similar semantics into the alternative set of the keywords, so that the selection range of the keywords can be reasonably expanded, and the finally obtained keywords under the same theme are more closely related semantically;

the method introduces word vectors and carries out network construction based on the distance between the word vectors, so that keywords with similar word senses under the same theme can be found out more accurately, and a more accurate result is obtained.

Drawings

FIG. 1 is a flow chart of a topic keyword extraction method based on topic word vectors and network structures;

FIG. 2 is a graph of confusion;

FIG. 3 is a keyword profile for a teaching-type notification;

FIG. 4 is a keyword profile for a rating type notification;

FIG. 5 is a keyword profile for a library-type notification.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings:

performing word segmentation on an original text corpus;

respectively calculating keyword set and keyword set₁Each of (1) toThe semantic similarity of each keyword to the word of top5, keyword set KeywordSet₁The words in (1) and the words with semantic similarity top5 jointly form a new keyword set₂；

Word segmentation, namely segmenting the acquired original text into word sequences for subsequent topic clustering and keyword extraction, and removing special symbols when the word segmentation result is used as the input of word2 vec; when the user inputs the data as LDA, the null words, the place names which can not be used as the key words of the theme and a large number of repeated prepositions irrelevant to the theme are removed.

As shown in fig. 2, the text corpus is subject clustered based on the LDA subject model, and the property is used in language modeling to measure how good the modeling effect is, that is, a lower property represents better generalization performance, and the property calculation formula is as follows:

wherein, P (w)_i|t_j) Is the word w_iAt topic t_jDistribution of (A) P (t)_j| d) is the topic t_jThe distribution over the document d, N is the total number of words in the corpus that are not repeated, K is the number of topics, i 1. The number of topics is changed, and the optimal number of subjects is obtained by calculating the perplexity of the data set under different number of subjects.

Selecting the magnitude at the curve inflection point enables the perplexity value of the dataset to be small and the number of topics not to be excessive. Then, the topic distribution of each document and the word distribution under each topic are obtained, and 100 words with LDA value ranking top under each topic are selected as an initial keyword set.

In the word vector generation process, a process of obtaining a word vector representation model of each word with the word segmentation result of the mixed text of the title and the content as an input. In the scheme, a CBOW model is selected, the window size is set to be 5 to predict the probability of the current pivot word, and a negative sampling algorithm is selected to distinguish a target word and extract noise distribution through logistic regression. Table 1(word2vec model training parameter settings) gives a description and default values for key parameters in the training process.

TABLE 1

Finally, high-dimensional vector representation of all words in the text can be obtained, and similarity relation between all words, namely semantic distance, can be obtained by utilizing the word vector model.

In the keyword network construction process, the construction steps specifically comprise:

S2: calculating keyword set Kevwordset under each topic₂The similarity between every two words in the Chinese language is used as the weight between two points;

s3: setting a threshold value, and filtering edges with similarity lower than the threshold value; the different results for different values of threshold selection are shown in table 2:

TABLE 2

similarity	Topic similarity
		0.05	0.41
0.1	0.44
		0.15	0.48
0.2	0.49
		0.25	0.52
0.3	0.59
		0.35	0.55
0.4	0.57
		0.45	0.56
0.5	0.52
		0.55	0.50

It can be seen from the table that the degree of aggregation between key words in the same topic is higher when the threshold value is selected to be 0.3.

S4: constructing a keyword network of each topic;

s5: extracting the topic key words: after the keyword network is constructed, 20 nodes of top to bottom of the PageRank value in each topic network are calculated and pairedThe corresponding word is used as the key word of the theme to form a new key word set keyword_final。

As shown in fig. 3-5, in the scheme of the invention, through an experimental mode, 9802 pieces of news announced in a college from 2002 to 2017 are crawled, after word segmentation processing, the topic keywords are extracted through the steps of topic mining, word vector calculation, keyword network construction and the like, and the result is compared with the keywords obtained by the traditional topic model LDA.

Wherein, the words with dark colors represent words that can better reflect the subject, and the lighter the colors represent the lower the degree of correlation between the words and the subject. Larger words indicate a higher ranking under the method. It can be seen that the method of the present invention can better extract the key words which can represent the subject under the condition of integrating word frequency and semantics.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims

1. A topic keyword extraction method based on topic word vectors and network structures is characterized by specifically comprising the following steps:

performing word segmentation on an original text corpus;

Set keyword set₂Each keyword in the keyword network is a node, semantic similarity between words is the weight of an edge, a keyword network is constructed, and a keyword set Keywordset is obtained according to the PageRank value of each node₂The word of the middle top20 is used as the key word of the subject to form the final key word set_final；

In the keyword network construction process, the construction steps specifically include:

s4: constructing a keyword network of each topic;

2. The method as claimed in claim 1, wherein the word segmentation is to segment the obtained original text into word sequences for subsequent topic clustering and keyword extraction, and the word segmentation result is used as the input of word2vec to remove special symbols; when the user inputs the data as LDA, the null words, the place names which can not be used as the key words of the theme and the repeated prepositions which are irrelevant to the theme are removed.

3. The method as claimed in claim 1, wherein the topic keyword extraction method based on topic word vector and network structure is characterized in that topic clustering is performed on text corpus based on LDA topic model, and in language modeling, use property to measure how good the modeling effect is, i.e. lower property represents better generalization performance, and the property calculation formula is as follows:

wherein, P (w)_i|t_j) Is the word w_iAt topic t_jDistribution of (A) P (t)_j| d) is the topic t_jThe distribution over the document d, N is the total number of words in the corpus that are not repeated, K is the topic number, i 1.

4. The method of claim 1, wherein a process of obtaining a word vector representation model of each word using a segmentation result of a mixed text of a title and contents as an input in the word vector generation process is performed.