CN105843795A

CN105843795A - Topic model based document keyword extraction method and system

Info

Publication number: CN105843795A
Application number: CN201610162410.5A
Authority: CN
Inventors: 蔡毅; 杨楷; 闵华清
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-03-21
Filing date: 2016-03-21
Publication date: 2016-08-10
Anticipated expiration: 2036-03-21
Also published as: CN105843795B

Abstract

The invention discloses a topic model based document keyword extraction method and system. The document keyword extraction method comprises the following steps of document information preprocessing, document structure graph construction, document topic distribution extraction, word weight extraction and keyword generation. The document keyword extraction system comprises the following modules: a document information preprocessing module, a document structure graph construction module, a document topic distribution extraction module, a word weight extraction module and a keyword generation module. According to the method and system, extracted keywords are more reasonable and related to a topic of a document more closely; and partial deficiencies in the keyword extraction field at present are overcome, a better document summarization effect is achieved, and a user can conveniently and quickly know an abstract of the document.

Description

Document keyword abstraction method based on topic model and system thereof

Technical field

The present invention relates to a kind of data mining technology, particularly to a kind of document keyword abstraction side based on topic model Method and system thereof.

Background technology

Key word is the summary of document main contents, and it is the important method of a kind of fast understanding document subject matter.At each Local it can be seen that the figure of key word, such as we on news website it can be seen that the label of every news, Wo Men It can be seen that the key word discussed of this paper when browsing technical paper.That reduce people in magnanimity information, search information Difficulty.Current key word has been used in every field.In information retrieval field, the application of key word widely, Baidu, paddy The search engine companies such as song are all based on the key word of web page text and retrieve, past based on document keyword searching result out Past is that user intentionally gets.And in field of social network, current many functions and correlational study are all based on user and are marked The label (tags) of note realizes.User tag facilitates user that mark object is managed, collects and is retrieved, it is also possible to utilize This customizes recommendation of personalized information for user.By providing user's mark to objects (such as picture, article, video etc.) Pouring functions, utilizes user group's wisdom (Collective Intelligence), and we can obtain large quantities of literary composition marked Shelves, provide data supporting to our research work.Key word is widely used in every field, and general generation method has Three below: 1) by the spontaneous generation of user, the content 2 that user oneself mark is interested) expert document please be carried out manual mark Key word 3) use document automation keyword extraction techniques to extract key word.It is suitable for by the way of user's spontaneous mark label Scene is less, and specific object interested is only labeled by user, and a kind of effective mode excitation the most also cannot be provided to use Family mark other guide.And because current information technology high-speed develops, the growth of internet information amount also explosion type, new content All the time producing, please expert carry out relevant documentation marking high cost by hand, document of its mark also can only make a search With, and be difficult to carry out commercial use.So the demand currently for document automation keyword extraction techniques is urgent, relevant The research of problem is also current focus.

Summary of the invention

The primary and foremost purpose of the present invention is that the shortcoming overcoming prior art is with not enough, it is provided that a kind of literary composition based on topic model Shelves keyword abstraction method, the method so that the key word extracted more rationally, more representative.

Another object of the present invention is to the shortcoming overcoming prior art with not enough, it is provided that a kind of literary composition based on topic model Shelves keyword abstraction method system, system solves the problem the weak point of current keyword extraction part, makes key word preferably Summarize document, facilitate user to understand rapidly document according to the key word extracted.

The primary and foremost purpose of the present invention can be achieved through the following technical solutions: a kind of document key word based on topic model Abstracting method, the method includes:

S1, document information pretreatment: the document of input carries out word part of speech division, function word/stop words is rejected, stem Extract, it is established that semi-structured data.

S2, file structure figure build: file structure figure describes each word positional information in the document in document. One word of each node on behalf of this figure, the limit of two nodes of link represent the word of the two node on behalf in a document away from Close to from.The present invention proposes a kind of file structure figure construction method.

The distribution of S3, document subject matter is extracted: each document has the theme laying particular stress on description, and this method passes through topic model technology The theme distribution of each word in theme distribution in abstracting document and document.This method also proposes a kind of master based on background word Topic model, promotes the effect of topic model.The things that the document that theme is close describes is more close.For each theme, permissible The set of letters relevant to this theme is extracted from collection of document.

S4, word weight extraction: what the weight of each word represented is this word significance level in a document.In document The most important word weight is the highest, otherwise, the word of low weight significance level in a document is low.The present invention proposes a kind of weight and carries Access method.

S5, key word generate: according to above-mentioned step, and this method will ask key word problem to be converted in nomography extraction pass The problem of key node, uses PageRank algorithm according to file structure figure this method, and combines topic model and word weight, permissible Each word is calculated a score, and score is the biggest represents its key word being more probably the document.The present invention proposes a kind of crucial The method that word generates.

Described document information pretreatment comprises the following steps:

S1a, for Chinese text, part of speech divides and participle instrument can be used to divide text, for English text, According to space document carried out participle, and use that word is stemmed obtains word prototype；

The part of speech of each word in the word that S1b, employing part-of-speech tagging instrument annotation step (S1a) have been cut.

The part-of-speech tagging result produced in S1c, employing step (S1bb), deletes the function word in document, stop words.

Described file structure figure builds and comprises the following steps:

S2a, a selected sliding window length, be designated as W.

S2b, the sliding window of one a length of W of structure, for occurring in the word in window, build a figure joint for it Point, if two words simultaneously appear in sliding window, then two nodes representated by the two word increase a limit.

S2c, mobile sliding window, start to move toward document afterbody from document, constantly increase in figure in moving process Node and limit.

The distribution extraction of described document subject matter comprises the following steps:

S3a, document background word mark: document background word refers to the word not having certain semantic implication in a document, this Topic model is often caused confusion by word.Because the limited amount of this word, we use the mode of semi-automation to mark Note.We are by calculating the TF-IDF value of each word, and certain threshold values selected, and the word that will be less than this threshold values is regarded as letter The word that breath amount is less.The most manually browse these words, select background word.

This method uses TF-IDF (Term Frequency Inverse Document Frequency) word frequency technology meter Calculate weighted value.TF-IDF is for weighing the word significance level for a document sets.Its main thought is: if certain The frequency the highest (i.e. TF is the highest) that individual word occurs in a document, and probability of occurrence seldom (IDF value in other documents Ratio is relatively low), the separating capacity of this word is stronger.The computing formula of TF-IDF is as follows:

{TF}_{i} = \frac{n_{i}}{Σ_{k} n_{k}}

{IDF}_{i} = l o g (\frac{D}{D_{w}})

TF-IDF_i=TF_i*IDF_i,

In above formula, i is i-th key word.n_iIt is word t_iThe number of times occurred in a document；TF_iIt is key word t_iAll Word frequency in document；∑_kn_kIt is that all words occur total degree in a document.IDF_iIt is key word t_iReverse probability；D is to be Unite the number of all documents, D_wIt is that word t occurs_iDocument number；

S3b, utilize the mutation bLDA of LDA (implicit Di Li Cray distribution, Latent Dirichlet Allocation) (implicit Di Li Cray based on background distribution) obtains the distribution of the implicit theme of document.In bLDA, all in collection of document All of implicit theme shared the most according to a certain percentage by document, and implicit theme collection is then made up of a series of correlated characteristic words. And in bLDA, first topic is arranged to background theme, is all gathered in this topic by the word unrelated with theme.Because Gibbs sampler can extract theme, effectively so this method employs gibbs sampler and solves from extensive document sets bLDA.By bLDA, the probability of each topic corresponding of each word in document can be obtained.

W_z(w_i)=p (z | w)

Under conditions of given word w, this word is assigned to the probability of theme z, and this also reflects word w much degree On belong to theme z.

Described word weight extraction comprises the following steps:

S4a, can be obtained the theme distribution of each word in document by step (S3b), we ask according to the theme distribution of word Its comentropy, its theme distribution of word that comentropy is the biggest is the most uniform, i.e. degree of aliasing is the highest, and the least its theme that represents of comentropy divides Cloth is more concentrated, and degree of aliasing is the lowest.Word w in document_iComentropy computing formula for its theme distribution is as follows:

H (w_{i}) = - Σ_{t = 1}^{K} p (k = t | w_{i}) \log p (k = t | w_{i})

The comentropy that S4b, basis are tried to achieve calculates the weight of each word, and formula is as follows:

w e i g h t (w_{i}) = \frac{δ}{H (w_{i})}

Work as w_iWhen implicit theme is distributed the most uniform (when this word representativeness is not strong), H (w_i) will be very big, thusWill become the least, i.e. the weight of this word will become the least.

Described key word generates and comprises the steps of:

S5a, the PageRank value of the first each word of random initializtion.

S5b, for every document, for each theme, this method uses PageRank method to carry out the weight to each word The property wanted is given a mark.PageRank algorithm is a kind of algorithm being proposed by Google at first and using, and goes out chain according to obtaining chain in webpage Connect one webpage figure of structure, open up complement further according to the webpage formed and calculate the power of influence of each webpage.It is generally employed Evaluate field to webpage, evaluate a webpage the most important.PageRank algorithm applies following two hypothesis: quantum hypothesis and Quality is assumed.If quantum hypothesis refer to a webpage to enter chain quantity the most, this webpage is the most important.Matter Amount assumes to refer to chain, and to enter the webpage of certain page the most important, and this webpage is the most important.This method To key word problem be sought to be converted in nomography extract the problem of key node, and use PageRank to calculate according to file structure figure Method.

For theme each in every document, this method calculates the PageRank value of each word, and computing formula is as follows:

{TR}_{t} (w_{i}) = λ \underset{j : w_{j} &RightArrow; w_{i}}{Σ} \frac{e (w_{j}, w_{i})}{O (w_{j})} {TR}_{t} (w_{j}) + (1 - λ) W_{t} (w_{i})

In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1].One node is not necessarily Having chain, it is 0 that some node goes out chain.This node is isolated to other webpages, in order to make this webpage to be accessed to, it is necessary to PageRank is modified by damped coefficient, and (1-λ) is the probit that each node jumps to another one node.For page w_iTotal chain goes out number, and e (w_i,w_j) it is page w_iTo w_jChain go out number.These public affairs The main thought of formula is: the PageRank value of each page distributes to the summation of its value equal to numerous pages that its chain enters.

Wherein, W_t(w_i)=p (z | w), under conditions of i.e. giving word w, this word is assigned to the probability of theme z, and this is also Reflect word w to have in much degree and belong to theme z.This probability is proposed by (3b).For every document, we can be according to K Theme draws the word graph of K different weights, then applies above-mentioned formula, is calculated each word in document and comprises K PageRank value.

S5c, the method are alternative manners, calculate current iteration and the weighted value difference of each word in last iteration, Terminate the method if less than 0.001, enter next step.If iterations arrives Termination Threshold 300, also terminate the party Method, enters next step.Otherwise, enter (5b) and carry out next round iteration.Constantly repeat the above steps, each page PageRank value can trend towards normal and stable, because this algorithm will eventually be restrained, and this convergency value is PageRank Value.

S5d, for each word, its K PageRank value under K different themes is processed as one by this method PageRank value.Concrete processing method is as follows: p (z | d) is the probability that document d belongs to theme z, for K of each word PageRank value, adds the impact of entropy when calculating the weight of each word, we can use equation below they to be merged:

T R (w_{i}) = \frac{δ}{H (w_{i})} Σ_{z = 1}^{K} {TR}_{z} (w_{i}) \times p (z | d)

δ is a smoothing factor, and can control the value of comentropy affects amplitude to final PageRank value.

S5e, according to the final PageRank value of word each in document, it is ranked up by the method according to its value, it is intended that Maximum top n word is as the key word of the document.

Another object of the present invention is achieved through the following technical solutions: the document keyword abstraction method of a kind of topic model System, this system includes:

Document information pretreatment module: the document of input carries out word part of speech division, function word/stop words is rejected, word Do extraction, set up semi-structured data.

File structure figure builds module: build file structure figure.

Document subject matter distribution extraction module: utilize topic model to extract the theme distribution of document.

Word weight extraction module: give weight for each word in document.

Keyword generation module: generate document key word.

Described keyword generation module specifically for:

First, PageRank algorithm is a kind of algorithm being proposed by Google at first and using, and goes out chain according to obtaining chain in webpage Connect one webpage figure of structure, open up complement further according to the webpage formed and calculate the power of influence of each webpage.It is generally employed Evaluate field to webpage, evaluate a webpage the most important.PageRank algorithm applies following two hypothesis: quantum hypothesis and Quality is assumed.If quantum hypothesis refer to a webpage to enter chain quantity the most, this webpage is the most important.Matter Amount assumes to refer to chain, and to enter the webpage of certain page the most important, and this webpage is the most important.This method To key word problem be sought to be converted in nomography extract the problem of key node, and use PageRank to calculate according to file structure figure Method.PageRank algorithm is widely used in keyword extraction field.

In the present invention, for theme each in every document, this method calculates the PageRank value of each word, computing formula As follows:

{TR}_{t} (w_{i}) = λ \underset{j : w_{j} &RightArrow; w_{i}}{Σ} \frac{e (w_{j}, w_{i})}{O (w_{j})} {TR}_{t} (w_{j}) + (1 - λ) W_{t} (w_{i})

In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1].One node is not necessarily Having chain, it is 0 that some node goes out chain.This node is isolated to other webpages, in order to make this webpage to be accessed to, it is necessary to PageRank is modified by damped coefficient, and (1-λ) is the probit that each node jumps to another one node.For page w_iTotal chain goes out number, and e (w_i,w_j) it is page w_iTo w_jChain go out number.These public affairs The main thought of formula is: the PageRank value of each page distributes to the summation of its value equal to numerous pages that its chain enters.Its In, W_t(w_i)=p (z | w), under conditions of i.e. giving word w, this word is assigned to the probability of theme z, and this also reflects word W has in much degree and belongs to theme z.This probability is proposed by (3b).For every document, we can draw K according to K theme The word graph of different weights, then applies above-mentioned formula, is calculated each word in document and comprises K PageRank value.

The PageRank value of the present invention first each word of random initializtion, then calculates each word w_iTheme t's PageRank value TR_t(w_i), continuous iteration, until the difference of iterations twice iteration more than 300 or front and back is less than 0.001 Time stop.

Then, its K PageRank value under K different themes is processed as by the present invention for each word, this method One PageRank value.Formula is as follows:

T R (w_{i}) = \frac{δ}{H (w_{i})} Σ_{z = 1}^{K} {TR}_{z} (w_{i}) \times p (z | d)

P (z | d) is the probability that document d belongs to theme z, for K PageRank value of each word, calculates each word Plus the impact of entropy during weight.δ is a smoothing factor, and can control the value of comentropy affects width to final PageRank value Degree.

According to the PageRank value that word each in document is final, it is ranked up by the method according to its value, it is intended that maximum Top n word as the key word of the document.Then according to keyword abstraction result, user is presented to.

Described document subject matter distribution extraction module specifically for:

First, LDA (implicit Di Li Cray distribution, Latent Dirichlet Allocation) is a kind of implicit theme Model.It has a lot of mutation, the present invention propose its another kind of mutation: bLDA (implicit Di Li Cray based on background distribution) with Obtain the distribution of the implicit theme of document.In bLDA, in collection of document, all documents are shared all of the most according to a certain percentage Implicit theme, implicit theme collection is then made up of a series of correlated characteristic words.And in bLDA, first topic is set It is set to background theme, the word unrelated with theme is all gathered in this topic.Because gibbs sampler can be effectively from a large scale Document sets extracts theme, so this method employs gibbs sampler and solves bLDA.By bLDA, can obtain in document every The probability of individual word each topic corresponding.

W_z(w_i)=p (z | w)

Under conditions of given word w, this word is assigned to the probability of theme z, and this also reflects word w much degree On belong to theme z.Document subject matter distribution extraction module is according to the theme distribution of each word of document structure tree.

Described word weight extraction module specifically for:

The present invention seeks its comentropy according to the theme distribution of word, word w in document_iComentropy for its theme distribution calculates Formula is as follows:

H (w_{i}) = - Σ_{t = 1}^{K} p (k = t | w_{i}) \log p (k = t | w_{i})

Then, calculating the weight of each word according to the comentropy tried to achieve, formula is as follows:

w e i g h t (w_{i}) = \frac{δ}{H (w_{i})}

weight(w_i) the biggest, represent a word and be distributed in the fewest theme, more there is importance.Otherwise, weight (w_i) the least to represent this word degree of aliasing the highest, its representativeness is strong, and the probability becoming document key word is the lowest.

Accompanying drawing explanation

Fig. 1 is the overview flow chart of the document keyword abstraction method based on topic model disclosed in the present invention.

Detailed description of the invention

Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention do not limit In this.

Embodiment

As it is shown in figure 1, the overview flow chart of document keyword abstraction method based on topic model, the literary composition of this topic model Shelves keyword abstraction method comprises the following steps:

Document information pretreatment: the document of input carries out word part of speech division, function word/stop words is rejected, and stem carries Take, it is established that semi-structured data.

File structure figure builds: file structure figure describes each word positional information in the document in document.Should One word of each node on behalf of figure, the limit of two nodes of link represents the word distance in a document of the two node on behalf Nearer.The present invention proposes a kind of file structure figure construction method.

Document subject matter distribution is extracted: each document has the theme laying particular stress on description, and this method is taken out by topic model technology Take the theme distribution of each word in the theme distribution in document and document.This method also proposes a kind of theme based on background word Model, promotes the effect of topic model.The things that the document that theme is close describes is more close.For each theme, Ke Yicong Collection of document extracts the set of letters relevant to this theme.

Word weight extraction: what the weight of each word represented is this word significance level in a document.In document more Important word weight is the highest, otherwise, the word of low weight significance level in a document is low.The present invention proposes a kind of weight extraction Method.

Key word generates: according to above-mentioned step, and this method will ask key word problem to be converted in nomography extraction key The problem of node, uses PageRank algorithm according to file structure figure this method, and combines topic model and word weight, can be right Each word calculates a score, and score is the biggest represents its key word being more probably the document.The present invention proposes a kind of key word The method generated.

The document keyword abstraction method based on topic model provided the present invention below elaborates:

Described document information pretreatment module.For Chinese text, part of speech divides and participle instrument can be used text Divide；For English text, according to space document carried out participle, and use that word is stemmed obtains word prototype.Connect The part of speech using part-of-speech tagging instrument to obtain each word.Finally, the function word in document, stop words are deleted, leaves behind name Word, adjective, verb, reduce the noise when that descriptor building.It is to remove sentence Chinese and English word morpheme that stem extracts Impact, such as ' dog ' and ' dogs ' should can be regarded as with a word.

Described file structure figure builds module.In order to text-processing problem is converted into figure problem, need to turn document It is changed to figure.For the information of stet shelves as far as possible, the present invention builds the sliding window of an a length of W, for occurring in Word in window, builds a node of graph for it, if two words simultaneously appear in sliding window, then for the two word institute's generation Two nodes of table increase a limit.Start, toward document afterbody moving window, constantly to increase in figure in moving process from document Add node and limit.

Described document subject matter distribution extraction module.First carrying out, document background word marks.The present invention uses semi-automation Mode mark.We are by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than the word of this threshold values It is regarded as the word that quantity of information is less.The most manually browse these words, select background word.Then, the present invention utilizes LDA Mutation bLDA (the implicit Di Like based on background of (implicit Di Li Cray distribution, Latent Dirichlet Allocation) Thunder is distributed) obtain the distribution of implicit theme of document.On the basis of LDA, present invention adds background theme, will be unrelated with theme Word all gather in this topic.Then, we use gibbs sampler method to extract theme from document sets.

Described word weight extraction module.The present invention seeks its comentropy according to the theme distribution of word, the word that comentropy is the biggest Its theme distribution is the most uniform, i.e. degree of aliasing is the highest, and the least its theme distribution that represents of comentropy is more concentrated, and degree of aliasing is the lowest.Literary composition Word w in Dang_iComentropy computing formula for its theme distribution is as follows:

H (w_{i}) = - Σ_{t = 1}^{K} p (k = t | w_{i}) \log p (k = t | w_{i})

Calculate the weight of each word according to the comentropy tried to achieve, formula is as follows:

w e i g h t (w_{i}) = \frac{δ}{H (w_{i})}

Wherein, w is worked as_iWhen implicit theme is distributed the most uniform (when this word representativeness is not strong), H (w_i) will be very big, because of AndWill become the least, i.e. the weight of this word will become the least.

Described keyword generation module.The PageRank value of each word of the present invention first random initializtion.Then, for often Piece document, for each theme, this method uses PageRank method to give a mark the importance of each word.The most again For each word, we, by the marking weighted cumulative under its different themes, obtain the marking of this word final.For every literary composition Each theme in Dang, this method calculates the PageRank value of each word, and computing formula is as follows:

{TR}_{t} (w_{i}) = λ \underset{j : w_{j} &RightArrow; w_{i}}{Σ} \frac{e (w_{j}, w_{i})}{O (w_{j})} {TR}_{t} (w_{j}) + (1 - λ) W_{t} (w_{i})

In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1].One node is not necessarily Having chain, it is 0 that some node goes out chain.This node is isolated to other webpages, in order to make this webpage to be accessed to, it is necessary to PageRank is modified by damped coefficient, and (1-λ) is the probit that each node jumps to another one node.For page w_iTotal chain goes out number, and e (w_i,w_j) it is page w_iTo w_jChain go out number.Should The main thought of formula is: the PageRank value of each page distributes to the summation of its value equal to numerous pages that its chain enters.

We calculate the PageRank value of each word iteratively, until sending end condition.The end condition of the present invention has 1) calculate current iteration and the weighted value difference of each word in last iteration, terminate the method if less than 0.001.2) if Iterations arrives Termination Threshold 300, also terminates the method.In continuous iteration, the PageRank value meeting of each page Trending towards normal and stable, because this algorithm will eventually be restrained, and this convergency value is PageRank value.

For each word, its K PageRank value under K different themes is processed as a PageRank by this method Value.Concrete processing method is as follows: p (z | d) is the probability that document d belongs to theme z, for K PageRank value of each word, Adding the impact of entropy when calculating the weight of each word, we can use equation below they to be merged:

T R (w_{i}) = \frac{δ}{H (w_{i})} Σ_{z = 1}^{K} {TR}_{z} (w_{i}) \times p (z | d)

Wherein, δ is a smoothing factor, and can control the value of comentropy affects amplitude to final PageRank value.

According to the PageRank value that word each in document is final, it is ranked up by the method according to its value, it is intended that maximum Top n word as the key word of the document.

The present embodiment also discloses a kind of document keyword abstraction system based on topic model, the document of this topic model Keyword abstraction system includes with lower module:

Document information pretreatment module, for the document of input is carried out word part of speech division, function word/stop words picks Removing, stem extracts, it is established that semi-structured data；

File structure figure builds module, is used for building file structure figure, and wherein, described file structure figure describes in document Each word positional information in the document, one word of each node on behalf of the document structure chart, link two nodes Limit represent that the word of the two node on behalf is the most close together；

Document subject matter distribution extraction module, for by the theme in topic model technology abstracting document based on background word The theme distribution of each word in distribution and document；

Word weight extraction module, for extracting the weight of each word in document, wherein, the weight of each word represents This word significance level in a document；

Keyword generation module, for asking key word problem to be converted in nomography to extract the problem of key node, root Use PageRank algorithm according to file structure figure, and combine topic model and word weight, each word is calculated a score, score The maximum key word as the document.

(1) operation principle of keyword generation module is as follows:

The PageRank value of the first each word of random initializtion；

For every document, for each theme, PageRank method is used to carry out the importance of each word beating Point, wherein, calculating the PageRank value of each word, computing formula is as follows:

{TR}_{t} (w_{i}) = λ \underset{j : w_{j} &RightArrow; w_{i}}{Σ} \frac{e (w_{j}, w_{i})}{O (w_{j})} {TR}_{t} (w_{j}) + (1 - λ) W_{t} (W_{i})

In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1], and it is right to be come by damped coefficient PageRank is modified, and (1-λ) is the probit that each node jumps to another one node, For page w_iTotal chain goes out number, and e (w_i,w_j) it is page w_iTo w_jChain go out number, wherein, W_t(w_i)=p (z | w), i.e. give word w Under conditions of, this word is assigned to the probability of theme z；

It is iterated, calculates current iteration and the weighted value difference of each word in last iteration, if less than 0.001 end Only the method, enters next step；If iterations arrives Termination Threshold 300, also terminate the method, enter next step； Otherwise, enter S5b and carry out next round iteration；Constantly repeat the above steps, makees the convergency value of the PageRank value of each page For final PageRank value；

For each word, its K PageRank value under K different themes is consolidated into a PageRank Value, merges formula as follows:

T R (w_{i}) = \frac{δ}{H (w_{i})} Σ_{z = 1}^{K} {TR}_{z} (w_{i}) \times p (z | d)

Wherein, p (z | d) is the probability that document d belongs to theme z, and δ is a smoothing factor, controls the value of comentropy to Whole PageRank value affect amplitude；

The PageRank value final to word each in document is ranked up, it is intended that maximum top n word is as the document Key word.

(2) operation principle of word weight extraction module is as follows:

Theme distribution according to word seeks its comentropy, word w in document_iFor its theme distribution comentropy computing formula such as Under:

H (w_{i}) = - Σ_{t = 1}^{K} p (k = t | w_{i}) \log p (k = t | w_{i})

w e i g h t (w_{i}) = \frac{δ}{H (w_{i})} .

(3) operation principle of document subject matter distribution extraction module is as follows:

Document background word marks, by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than this threshold values Word regard as the word that quantity of information is less, the most manually browse and select background word；

The computing formula of described TF-IDF value is as follows:

{TF}_{i} = \frac{n_{i}}{Σ_{k} n_{k}}

{IDF}_{i} = l o g (\frac{D}{D_{w}})

TF-IDF_i=TF_i*IDF_i,

In above formula, i is i-th key word, n_iIt is word t_iThe number of times occurred in a document, TF_iIt is key word t_iAll Word frequency in document, ∑_kn_kIt is that all words occur total degree, IDF in a document_iIt is key word t_iReverse probability, D is to be Unite the number of all documents, D_wIt is that word t occurs_iDocument number；

Utilize the distribution of the implicit theme of bLDA implicit Di Li Cray distributed acquisition document based on background, by first Topic is arranged to background theme, is all gathered by the word unrelated with theme in this topic, uses gibbs sampler to solve bLDA, The probability obtaining each topic corresponding of each word in document is as follows

W_z(w_i)=p (z | w)

Under conditions of given word w, this word is assigned to the probability of theme z；

(4) operation principle of file structure figure structure module is as follows:

A selected sliding window length, is designated as W；

Build the sliding window of an a length of W, for occurring in the word in window, build a node of graph for it, If two words simultaneously appear in sliding window, then two nodes representated by the two word increase a limit；

Mobile sliding window, starts to move toward document afterbody from document, constantly increases node in moving process in figure With limit；

(5) operation principle of document information pretreatment module is as follows:

For Chinese text, part of speech divides and uses participle instrument to divide text, for English text, according to space Document is carried out participle, and uses that word is stemmed obtains word prototype；

The part of speech of each word in the word that the part-of-speech tagging instrument mark of employing has cut；

According to part-of-speech tagging result, the function word in document, stop words are deleted.

In each method embodiment of the present invention, the priority that the sequence number of described each step can not be used for limiting each step is suitable Sequence, for those of ordinary skill in the art, on the premise of not paying creative work, changes also the priority of each step Within protection scope of the present invention.

The module of foregoing description or the specific works process of unit, the corresponding mistake being referred in preceding method embodiment Journey, the most specifically repeats at this.

Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not by above-described embodiment Limit, the change made under other any spirit without departing from the present invention and principle, modify, substitute, combine, simplify, All should be the substitute mode of equivalence, within being included in protection scope of the present invention.

Claims

1. a document keyword abstraction method based on topic model, it is characterised in that described document keyword abstraction method Specifically include the following step:

S1, document information pretreatment, carry out word part of speech division by the document of input, and function word/stop words is rejected, and stem carries Take, it is established that semi-structured data；

S2, file structure figure build, and build file structure figure, and wherein, described file structure figure describes each word in document Positional information in the document, one word of each node on behalf of the document structure chart, the limit of two nodes of link represents The word of the two node on behalf is the most close together；

The distribution of S3, document subject matter is extracted, by the theme distribution in topic model technology abstracting document based on background word and The theme distribution of each word in document；

S4, word weight extraction, extract the weight of each word in document, and wherein, the weight of each word represents this word and exists Significance level in document；

S5, key word generate, and key word problem will be sought to be converted in nomography extract the problem of key node, according to file structure Figure uses PageRank algorithm, and combines topic model and word weight, and each word is calculated a score, and score maximum is as this The key word of document.

Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step S5, key word generate specific as follows:

S5a, the PageRank value of the first each word of random initializtion；

S5b, for every document, for each theme, use PageRank method to carry out the importance of each word beating Point, wherein, calculating the PageRank value of each word, computing formula is as follows:

{TR}_{t} (w_{i}) = λ \underset{j : w_{j} &RightArrow; w_{i}}{Σ} \frac{e (w_{j}, w_{i})}{O (w_{j})} {TR}_{t} (w_{j}) + (1 - λ) W_{t} (w_{i})

In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1], and it is right to be come by damped coefficient PageRank is modified, and (1-λ) is the probit that each node jumps to another one node,For page w_iTotal chain goes out number, and e (w_i,w_j) it is page w_iTo w_jChain go out number, its In, W_t(w_i)=p (z | w), under conditions of i.e. giving word w, this word is assigned to the probability of theme z；

S5c, it is iterated, calculates current iteration and the weighted value difference of each word in last iteration, if less than 0.001 end Only the method, enters next step；If iterations arrives Termination Threshold 300, also terminate the method, enter next step； Otherwise, enter S5b and carry out next round iteration；Constantly repeat the above steps, makees the convergency value of the PageRank value of each page For final PageRank value；

S5d, for each word, its K PageRank value under K different themes is consolidated into a PageRank Value, merges formula as follows:

T R (w_{i}) = \frac{δ}{H (w_{i})} Σ_{z = 1}^{K} {TR}_{z} (w_{i}) \times p (z | d)

Wherein, p (z | d) is the probability that document d belongs to theme z, and δ is a smoothing factor, controls the value of comentropy to finally PageRank value affect amplitude；

S5e, the PageRank value that word each in document is final is ranked up, it is intended that maximum top n word is as the document Key word.

Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step S4, word weight extraction are specific as follows:

S4a, theme distribution according to word seek its comentropy, word w in document_iFor its theme distribution comentropy computing formula such as Under:

H (w_{i}) = - Σ_{t = 1}^{K} p (k = t | w_{i}) \log p (k = t | w_{i})

w e i g h t (w_{i}) = \frac{δ}{H (w_{i})} .

Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step The distribution of S3, document subject matter is extracted specific as follows:

S3a, document background word mark, by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than this threshold values Word regard as the word that quantity of information is less, the most manually browse and select background word；The computing formula of described TF-IDF value is such as Under:

{TF}_{i} = \frac{n_{i}}{Σ_{k} n_{k}}

{IDF}_{i} = l o g (\frac{D}{D_{w}})

TF-IDF_i=TF_i*IDF_i,

In above formula, i is i-th key word, n_iIt is word t_iThe number of times occurred in a document, TF_iIt is key word t_iIn all documents Word frequency, ∑_kn_kIt is that all words occur total degree, IDF in a document_iIt is key word t_iReverse probability, D be system own The number of document, D_wIt is that word t occurs_iDocument number；

S3b, utilize the distribution of the implicit theme of bLDA implicit Di Li Cray distributed acquisition document based on background, by first Topic is arranged to background theme, is all gathered by the word unrelated with theme in this topic, uses gibbs sampler to solve bLDA, The probability obtaining each topic corresponding of each word in document is as follows

W_z(w_i)=p (z | w)

Under conditions of given word w, this word is assigned to the probability of theme z.

Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step S2, file structure figure build specific as follows:

S2a, a selected sliding window length, be designated as W；

S2b, the sliding window of one a length of W of structure, for occurring in the word in window, build a node of graph for it, If two words simultaneously appear in sliding window, then two nodes representated by the two word increase a limit；

S2c, mobile sliding window, start to move toward document afterbody from document, constantly increases node in moving process in figure With limit.

Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step S1, document information pretreatment are specific as follows:

S1a, for Chinese text, part of speech divides and uses participle instrument to divide text, for English text, according to space Document is carried out participle, and uses that word is stemmed obtains word prototype；

The part of speech of each word in the word that S1b, the part-of-speech tagging instrument mark of employing have cut；

S1c, according to part-of-speech tagging result, the function word in document, stop words are deleted.

7. a document keyword abstraction method system based on topic model, it is characterised in that described document keyword abstraction Method system includes following modules:

Document information pretreatment module, for the document of input carries out word part of speech division, function word/stop words is rejected, word Dry extraction, it is established that semi-structured data；

File structure figure builds module, is used for building file structure figure, and wherein, described file structure figure describes in document each Word positional information in the document, one word of each node on behalf of the document structure chart, the limit of two nodes of link Represent that the word of the two node on behalf is the most close together；

Document subject matter distribution extraction module, for by the theme distribution in topic model technology abstracting document based on background word And the theme distribution of each word in document；

Word weight extraction module, for extracting the weight of each word in document, wherein, the weight of each word represents this list Word significance level in a document；

Keyword generation module, for seeking key word problem to be converted in nomography to extract the problem of key node, according to literary composition Mark structure figure uses PageRank algorithm, and combines topic model and word weight, and each word is calculated a score, and score is maximum Key word as the document.

Document keyword abstraction system based on topic model the most according to claim 7, it is characterised in that described key The operation principle of word generation module is as follows:

The PageRank value of the first each word of random initializtion；

For every document, for each theme, use PageRank method that the importance of each word is given a mark, its In, calculating the PageRank value of each word, computing formula is as follows:

{TR}_{t} (w_{i}) = λ \underset{j : w_{j} &RightArrow; w_{i}}{Σ} \frac{e (w_{j}, w_{i})}{O (w_{j})} {TR}_{t} (w_{j}) + (1 - λ) W_{t} (w_{i})

In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1], is come PageRank by damped coefficient Being modified, (1-λ) is the probit that each node jumps to another one node,For page Face w_iTotal chain goes out number, and e (w_i,w_j) it is page w_iTo w_jChain go out number, wherein, W_t(w_i)=p (z | w), i.e. give the bar of word w Under part, this word is assigned to the probability of theme z；

Being iterated, calculate current iteration and the weighted value difference of each word in last iteration, terminating if less than 0.001 should Method, enters next step；If iterations arrives Termination Threshold 300, also terminate the method, enter next step；Otherwise, Enter S5b and carry out next round iteration；Constantly repeat the above steps, using the convergency value of the PageRank value of each page as finally PageRank value；

For each word, its K PageRank value under K different themes is consolidated into a PageRank value, closes And formula is as follows:

T R (w_{i}) = \frac{δ}{H (w_{i})} Σ_{z = 1}^{K} {TR}_{z} (w_{i}) \times p (z | d)

The PageRank value final to word each in document is ranked up, it is intended that maximum top n word is as the key of the document Word.

Document keyword abstraction system based on topic model the most according to claim 7, it is characterised in that described word The operation principle of weight extraction module is as follows:

Theme distribution according to word seeks its comentropy, word w in document_iComentropy computing formula for its theme distribution is as follows:

H (w_{i}) = - Σ_{t = 1}^{K} p (k = t | w_{i}) \log p (k = t | w_{i})

w e i g h t (w_{i}) = \frac{δ}{H (w_{i})} .

Document keyword abstraction system based on topic model the most according to claim 7, it is characterised in that described literary composition The operation principle of shelves theme distribution extraction module is as follows:

Document background word marks, by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than the list of this threshold values The word that quantity of information is less regarded as in word, the most manually browses and selects background word；

The computing formula of described TF-IDF value is as follows:

{TF}_{i} = \frac{n_{i}}{Σ_{k} n_{k}}

{IDF}_{i} = l o g (\frac{D}{D_{w}})

TF-IDF_i=TF_i*IDF_i,

Utilize the distribution of the implicit theme of bLDA implicit Di Li Cray distributed acquisition document based on background, by first topic It is arranged to background theme, the word unrelated with theme is all gathered in this topic, use gibbs sampler to solve bLDA, obtain In document, the probability of each word each topic corresponding is as follows

W_z(w_i)=p (z | w)

The operation principle that described file structure figure builds module is as follows:

A selected sliding window length, is designated as W；

Build the sliding window of an a length of W, for occurring in the word in window, build a node of graph for it, if two Individual word simultaneously appears in sliding window, then two nodes representated by the two word increase a limit；

Mobile sliding window, starts to move toward document afterbody from document, constantly increases node and limit in moving process in figure；

The operation principle of described document information pretreatment module is as follows:

For Chinese text, part of speech divides and uses participle instrument to divide text, for English text, according to space to literary composition Shelves carry out participle, and use that word is stemmed obtains word prototype；