CN108052593A

CN108052593A - A kind of subject key words extracting method based on descriptor vector sum network structure

Info

Publication number: CN108052593A
Application number: CN201711315360.0A
Authority: CN
Inventors: 胡晓慧; 李超; 曾庆田; 戴明弟; 赵中英
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2017-12-12
Filing date: 2017-12-12
Publication date: 2018-05-18
Anticipated expiration: 2037-12-12
Also published as: CN108052593B

Abstract

The invention discloses a kind of subject key words extracting methods based on descriptor vector sum network structure, and in particular to keyword technical field is extracted from text.The subject key words extracting method based on descriptor vector sum network structure is based on LDA topic models and carries out Subject Clustering to corpus of text, and obtains 100 keywords in each theme with the degree of subject relativity for top100；Each vocabulary in corpus of text is shown as a term vector using word2vec, by calculating the semantic similarity obtained between each two word, the word in semantically similarity top5 with each keyword in keyword is calculated respectively, and the word of keyword and each keyword in semantically similarity top5 collectively constitutes new keyword set；Keyword network is built, and obtains keyword of the word of each set top 20 as the theme.This method can the keyword higher to word frequency in document extract and effectively find relatively low but strong with the thematic relation keyword of word frequency.

Description

A kind of subject key words extracting method based on descriptor vector sum network structure

Technical field

The present invention relates to keyword technical field is extracted from text, and in particular to one kind is based on descriptor vector sum network The subject key words extracting method of structure.

Background technology

With representing that in natural language processing field extensive use, the vector of word is carried out using word2vec learning art It represents, can be good at describing and obtain the semanteme and syntax rule of word, meanwhile, topic model can be good at explaining document The theme aggregation situation of level.Therefore, the research that the term vector of fusion topic model and topic keyword represents at present is increasingly Extensively.

LDA topic models：In the various topic models of proposition, LDA is the generation model for being capable of general subject matter distribution. LDA is the Bayesian model of three stage layereds, wherein each project gathered having of being modeled as that potential theme collection closes Limit mixing, on the contrary, each theme is also modeled as the unlimited mixing of one group of potential theme probability.In the context of text modeling In, the display that theme probability provides document represents.The modeling process of LDA can be described as each resource (i.e. P (z | d)) it looks for Mixed to corresponding theme, each theme is by another probability distribution (i.e. P (t | z)) it describes.This can formally be represented For：

Wherein, P (t_i| it is d) probability on i-th of given document d, z_iIt is potential theme.P(t_i|z_j=j) it is theme j Middle t_iProbability.P(z_j=j | d) it is probability of the document on theme j.The quantity of the Z of potential theme must define in advance.LDA makes Descriptor distribution P (t | z) and text are estimated from unlabelled corpus with Dirichlet prior distributions and definite number of topics Shelves theme distribution P (z | d).

LDA is the very wide topic model of use scope, and most others topic models are all based on LDA and are extended.But The keyword that LDA is extracted as a whole is generally too wide in range, it is impossible to preferably reflection article theme, therefore side proposed by the present invention Method is innovation.

word embedding：Word insertion be according to syntax and semantic information by each Chinese word coding for vector row (word to Amount), therefore distance of the similar word on its term vector is similar.It is counted from natural text and establishes a language model And after obtaining term vector, syntactic analysis, sentiment analysis etc. can be carried out as the input of neutral net, can also be used as Supplemental characteristic expands existing model.But only term vector is the expected theme of None- identified text, it is necessary to by itself and master Topic model is combined.

Existing unsupervised keyword extraction techniques mainly include TF-IDF, the schemes such as Topic model, TextRank. Its technical disadvantages is mainly reflected in the following aspects：

TF-IDF is a kind of common weighting technique for information retrieval and data mining, is to search key importance Measurement, the extraction applied to text key word equally obtains preferable effect.But TF-IDF is based on word frequency and keyword The cross entropy of probability distribution is the sequencing occurred without considering word, do not account in text each word and context it Between relation.

Widely used Topic model such as LDA can preferably excavate theme from document, but its pass for extracting Keyword is crossed to be wide in range, there are many higher but unrelated with the theme word of word frequency, it is impossible to theme is preferably reacted, so as crucial Word is inappropriate.

TextRank algorithm is a kind of sort algorithm based on figure for text, and text is split as sentence, utilizes word In the text the cooccurrence relation of context establishes graph model, and PageRank value in graph model extracts keyword.It should Algorithm succinctly can effectively extract the keyword of single document on the basis of word frequency and Term co-occurrence relation is considered, but nothing Method is identified and clusters to the theme of multiple documents, thus the keyword of particular topic Documents can not be extracted.

The content of the invention

The purpose of the present invention is against the above deficiency, it is proposed that a kind of by topic model LDA and Word embedding phases With reference to extracting the keyword of same subject text using the Internet communication of similarity, pass that can be higher to word frequency in document Keyword extracts, but can effectively find that word frequency is relatively low but the keyword extracting method of keyword that thematic relation is strong.

The present invention specifically adopts the following technical scheme that：

As shown in Figure 1, a kind of subject key words extracting method based on descriptor vector sum network structure, specifically includes：

Original corpus of text is segmented；

Based on LDA topic models to corpus of text carry out Subject Clustering, and obtain in each theme with the degree of subject relativity For the keyword set KeywordsSet of top100₁={ k₁..., k₁₀₀}；

Each vocabulary in corpus of text is shown as a term vector using word2vec, by calculating between term vector Cosine value obtains the semantic similarity between each two word；

It calculates respectively and keyword set KeywordsSet₁In each keyword semantically similarity top5's Word, keyword set KeywordsSet₁In word and its collectively constitute new keyword set in the word of semantically similarity top5 Close KeywordsSet₂；

With keyword set KeywordsSet₂In each keyword for node, semantic similarity between word and word Inverse is the weight on side, builds keyword network, and obtains keyword set according to the PageRank value of each node KeywordsSet₂Keyword of the word of middle top20 as the theme forms final keyword set KeyordsSet_final。

Preferably, the participle, the urtext that will be obtained are divided into word sequence so as to follow-up Subject Clustering and key Word extracts, and when input as word2vec of the result of participle removes additional character；During input as LDA, remove function word, nothing Method is as the place name of subject key words and the preposition of the repetition unrelated with theme.

It is preferably based on LDA topic models and Subject Clustering is carried out to corpus of text, used in Language Modeling Perplexity is fine or not to weigh modeling effect, i.e., relatively low perplexity represents better Generalization Capability, perplexity Calculating formula is as follows：

Wherein, P (w_i|t_j) it is word w_iIn theme t_jOn distribution, P (t_j| d) it is theme t_jDistribution on document d, N are Without dittograph sum in corpus, K is number of topics, i=1 ..., N, j=1 ..., K.

Preferably, in the term vector generating process, using the word segmentation result of the mixing text of title and content as defeated The term vector for entering to obtain each word represents the process of model.

Preferably, during the keyword network struction, construction step specifically includes：

S1：Using the cosine relation between term vector, calculate under same subject with obtained in Subject Clustering step just The word of beginning keywords semantics similarity top5, duplicate removal and with keyword set KeywordsSet₁Form new keyword set KeywordsSet₂；

S2：It calculates under each theme, keyword set KeywordsSet₂In each similarity of word between any two, fall Number is as the weight between 2 points；

S3：Threshold value is set, and filtering similarity is less than the side of threshold value；

S4：Build the keyword network of each theme；

S5：Subject key words are extracted：After the completion of keyword network struction, calculate in each subject network PageRank value from Top20 high to Low node, using its corresponding word as the keyword set KeywordsSet of the theme_final。

The present invention has the advantages that：

The present invention is primarily based on LDA topic models and corpus of text is clustered；Secondly, it is using word2vec that text is pre- Each vocabulary in material is shown as a term vector；Then, each keyword similarity top5 in the document of the theme is obtained Word collectively constitutes new keyword set.Finally, using keyword as node, the similarity between word is the weight on side, and structure closes Keyword network obtains keyword of the core node as the theme of network；

Topic model LDA is combined by this method with Word embedding, is extracted using the Internet communication of similarity The keyword of same subject text, not only can the keyword higher to word frequency in document extract, can simultaneously be effectively It was found that word frequency is relatively low but keyword that thematic relation is strong；

This method carries out secondary discovery on the basis of word frequency is considered according to term vector relation pair keyword, and word frequency is not high But the word of semantic similarity is included in the alternative set of keyword, can reasonably expand the range of choice of keyword so that final Keyword under the same subject of acquisition is semantically contacting more closely；

This method introduces term vector and carries out network struction based on the distance between term vector and can more accurately find out same Keyword similar in the meaning of a word under one theme, so as to obtain more accurately result.

Description of the drawings

Fig. 1 is the subject key words extracting method flow chart based on descriptor vector sum network structure；

Fig. 2 is puzzlement degree (perplexity) graph；

Fig. 3 is the keyword distribution map of teaching class notice；

Fig. 4 is the keyword distribution map that class notifies of appraising and choosing excellent；

Fig. 5 is the keyword distribution map of library's class notice.

Specific embodiment

The specific embodiment of the present invention is described further in the following with reference to the drawings and specific embodiments：

Original corpus of text is segmented；

With keyword set KeywordsSet₂In each keyword for node, semantic similarity between word and word Inverse is the weight on side, builds keyword network, and obtains keyword set according to the PageRank value of each node KeywordsSet₂Keyword of the word of middle top20 as the theme forms final keyword set KeywordsSet_final。

Participle, the urtext that will be obtained are divided into word sequence so as to follow-up Subject Clustering and keyword extraction, participle Input of the result as word2vec when remove additional character；During input as LDA, remove function word, theme can not be used as The place name of keyword and the preposition of repetition largely unrelated with theme.

As described in Figure 2, Subject Clustering is carried out to corpus of text based on LDA topic models, is used in Language Modeling Perplexity is fine or not to weigh modeling effect, i.e., relatively low perplexity represents better Generalization Capability, perplexity Calculating formula is as follows：

Wherein, P (w_i|t_j) it is word w_iIn theme t_jOn distribution, P (t_j| d) it is theme t_jDistribution on document d, N are Without dittograph sum in corpus, K is number of topics, i=1 ..., N, j=1 ..., K.Change topic quantity, pass through calculating The perplexity of data set obtains optimal theme number under different themes number.

Quantitative value at trade-off curve inflection point enables to that the perplexity values of data set are smaller and theme quantity is unlikely In excessive.Then the word distribution under the theme distribution and each theme of every document is obtained, selects LDA values ranking under each theme Top100 word is as initial keyword set.

In term vector generating process, each word is obtained as input using the word segmentation result of title and the mixing text of content Term vector represent model process.Select CBOW models that window size is arranged to 5 to predict present pivot word in this programme Probability, and negative sampling algorithm is selected to pass through logistic regression to distinguish target word and extract noise profile.1 (word2vec of table Model training parameter setting) give the explanation and default value of key parameter in training process.

Table 1

The high dimension vector that all words in text may finally be obtained represents, and can be owned using the term vector model Similarity relation between word, i.e., distance semantically.

During keyword network struction, construction step specifically includes：

S3：Threshold value is set, and filtering similarity is less than the side of threshold value；Threshold value selects the different corresponding Different Results of value such as tables 2：

Table 2

1/similarity	Topic similarity
		0.05	0.41
0.1	0.44
		0.15	0.48
0.2	0.49
		0.25	0.52
0.3	0.59
		0.35	0.55
0.4	0.57
		0.45	0.56
0.5	0.52
		0.55	0.50

By table it can be seen that threshold value selects when 0.3 (i.e. similarity ＞ 3.33) coagulating between keyword under same subject Poly- degree higher.

S4：Build the keyword network of each theme；

S5：Subject key words are extracted：After the completion of keyword network struction, calculate in each subject network PageRank value from Top20 high to Low node, using its corresponding word as the keyword set of the crucial phrase Cheng Xin of the theme KeywordsSet_final。

As shown in Figure 3-Figure 5, the solution of the present invention has first crawled certain colleges and universities 2002 to 2017 years by way of experiment Totally 9802 news announced in the school, after word segmentation processing, by steps such as Topics Crawling, term vector calculating, keyword network structions Suddenly, subject key words are extracted, and by result compared with the keyword that traditional theme model LDA is obtained.

Wherein saturate word expression can preferably react the word of theme, and color gets over the degree of correlation that superficial shows the word and theme It is lower.Ranking is more forward under this methodology for the bigger expression of word.As can be seen that using the present invention method can in comprehensive word frequency and The keyword of the theme can be represented by preferably being extracted in the case of semanteme.

Certainly, above description is not limitation of the present invention, and the present invention is also not limited to the example above, this technology neck The variations, modifications, additions or substitutions that the technical staff in domain is made in the essential scope of the present invention should also belong to the present invention's Protection domain.

Claims

1. a kind of subject key words extracting method based on descriptor vector sum network structure, which is characterized in that specifically include：

Original corpus of text is segmented；

Subject Clustering is carried out to corpus of text based on LDA topic models, and obtains in each theme and is with the degree of subject relativity The keyword set KeywordsSet of top100₁={ k₁..., k₁₀₀}；

Each vocabulary in corpus of text is shown as a term vector using word2vec, by calculating the cosine between term vector Value obtains the semantic similarity between each two word；

It calculates respectively and keyword set KeywordsSet₁In each keyword semantically similarity top5 word, close Keyword set KeywordsSet₁In word and its collectively constitute new keyword set in the word of semantically similarity top5 KeywordsSet₂；

With keyword set KeywordsSet₂In each keyword for node, the inverse of the semantic similarity between word and word For the weight on side, keyword network is built, and keyword set is obtained according to the PageRank value of each node KeywordsSet₂Keyword of the word of middle top20 as the theme forms final keyword set KeywordsSet_final。

2. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special Sign is that the participle, the urtext that will be obtained is divided into word sequence to divide so as to follow-up Subject Clustering and keyword extraction Remove additional character during input of the result of word as word2vec；During input as LDA, remove function word, master can not be used as Inscribe the place name of keyword and the preposition of the repetition unrelated with theme.

3. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special Sign is, carries out Subject Clustering to corpus of text based on LDA topic models, is weighed in Language Modeling using perplexity Effect quality is modeled, i.e., relatively low perplexity represents better Generalization Capability, and perplexity calculating formulas are as follows：

<mrow> <mi>p</mi> <mi>e</mi> <mi>r</mi> <mi>p</mi> <mi>l</mi> <mi>e</mi> <mi>x</mi> <mi>i</mi> <mi>t</mi> <mi>y</mi> <mo>=</mo> <msup> <mi>e</mi> <mfrac> <mrow> <mo>-</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <msubsup> <mi>log&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </msubsup> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>t</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>*</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>j</mi> </msub> <mo>|</mo> <mi>d</mi> <mo>)</mo> </mrow> </mrow> <mi>N</mi> </mfrac> </msup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Wherein, P (w_i|t_j) it is word w_iIn theme t_jOn distribution, P (t_j| d) it is theme t_jDistribution on document d, N are language materials Without dittograph sum in storehouse, K is number of topics, i=1 ..., N, j=1 ..., K.

4. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special Sign is, in the term vector generating process, is obtained using the word segmentation result of title and the mixing text of content as input every The term vector of a word represents the process of model.

5. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special Sign is that during the keyword network struction, construction step specifically includes：

S1：Using the cosine relation between term vector, the initial pass with being obtained in Subject Clustering step under same subject is calculated The word of keyword semantic similarity top5, duplicate removal and with keyword set KeywordsSet₁Form new keyword set KeywordsSet₂；

S2：It calculates under each theme, keyword set KeywordsSet₂In each similarity of word between any two, inverse makees For the weight between 2 points；

S4：Build the keyword network of each theme；

S5：Subject key words are extracted：After the completion of keyword network struction, calculate in each subject network PageRank value from height to Top20 low node, using its corresponding word as the keyword set KeywordsSet of the theme_final。