CN105069143A

CN105069143A - Method and device for extracting keywords from document

Info

Publication number: CN105069143A
Application number: CN201510512363.8A
Authority: CN
Inventors: 姜迪; 石磊; 林鸿宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-08-19
Filing date: 2015-08-19
Publication date: 2015-11-18
Anticipated expiration: 2035-08-19
Also published as: CN105069143B

Abstract

The present invention discloses a method and a device for extracting keywords from a document, wherein the method comprises: acquiring at least one topic vector and at least one word vector associated with document information according to training of an implied topic vector model that is a fusion model of a topic model and the word vector; calculating a distance between the word vector and the topic vector; and selecting a preset number of words corresponding to the word vector as keywords of the document according to the distance between the word vector and the topic vector. The method and the device disclosed by embodiment of the present invention can be used for extracting keyword information for accurately expressing the document information.

Description

Extract method and the device of keyword in document

Technical field

The embodiment of the present invention relates to areas of information technology, particularly relates to a kind of method and the device that extract keyword in document.

Background technology

In the epoch of current information blast, user can not browse the document likely including relevant information, and the keyword in abstracting document provides reference to user, and to user, obtaining information, the cost that reduces user obtaining information have great meaning accurately.

In general, the word of the keyword of document necessarily some and document subject matter height correlation, therefore the subject information of document has great significance for the keyword abstraction of document.At present, the probability distribution of keyword in the middle of latent dirichlet allocation model (LatentDirichletAllocation, LDA) is mainly utilized to solve this problem.Mainly contain following two kinds of methods:

First method is the method based on possibility predication, namely utilizes LDA model to obtain the theme distribution P (z|d) of document and word distribution P (w|z) of theme, calculates distribution P (the w|d)=∑ of word in document _zp (z|d) P (w|z), wherein z represents theme, and d represents document, and w represents certain word.The distribution probability P (w │ d) of word in above-mentioned document is considered as the importance score of certain word w in document d, selects the highest K the word of score as the keyword of the document.

Second method is the method based on hidden variable distribution distance, namely utilizes LDA model to obtain the theme distribution probability P (z|d) of document and the theme distribution probability of word

P (z | w) = \frac{P (w | z) P (z)}{P (w)} &Proportional; P (w | z) P (z),

Then calculate the COS distance of above-mentioned two distribution probabilities, select larger K the word of COS distance as the keyword of the document.

But, in said extracted document all there are some shortcomings in the method for keyword, for first method, serious deflection is had for high frequency words, namely the word major part extracted is all the high frequency words under certain theme, but these high frequency words occur all very extensive in the middle of different document, can not react the information expressed by document very really.

For second method, distribution P (z) obtaining hidden variable is needed owing to calculating P (z │ w) ∝ P (w │ z) P (z), but this distribution probability is not the distribution parameter in LDA model, generally uses P (z)=∑ _dp (z|d) P (d), wherein P (d) Posterior distrbutionp that is document, and suppose that P (d) is for being uniformly distributed thus obtaining P (z) ∝ ∑ _dp (z|d).But due to for different document d, its posterior distribution P (d) is not be uniformly distributed, and therefore the theoretical foundation of this model is solid not, and the effect in practical application is not good yet.

Summary of the invention

The embodiment of the present invention provides a kind of method and the device that extract keyword in document, the key word information of the accurate expression document information that can extract.

First aspect, embodiments provides a kind of method extracting keyword in document, comprising:

Obtain at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, described implicit theme vector model is the theme the Fusion Model of model and term vector;

Calculate the distance between described term vector and theme vector;

According to the distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described document.

Second aspect, the embodiment of the present invention also provides a kind of device extracting keyword in document, comprising:

Vector training module, for obtaining at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, described implicit theme vector model is the theme the Fusion Model of model and term vector;

Distance calculation module, for calculating the distance between described term vector and theme vector;

Keyword extracting module, for according to the distance between term vector and described theme vector, chooses the keyword of word corresponding to predetermined number term vector as described document.

The embodiment of the present invention by the implicit theme vector model after merging according to topic model and term vector to document training, obtain at least one theme vector relevant to document information and at least one term vector, further according to the distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described document.The embodiment of the present invention to document training, can catch more document information according to implicit theme vector model in the training process, thus the key word information of the accurate expression document information extracted.

Accompanying drawing explanation

The schematic flow sheet of the method for keyword in the extraction document that Fig. 1 provides for the embodiment of the present invention one;

The schematic flow sheet of the method for keyword in the extraction document that Fig. 2 provides for the embodiment of the present invention two;

The structural representation of the search results pages display device that Fig. 3 provides for the embodiment of the present invention three.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not entire infrastructure.

The executive agent of the method for keyword in the extraction document that the embodiment of the present invention provides, can be the device of keyword in the extraction document that the embodiment of the present invention provides, or be integrated with the terminal device of the device extracting keyword in document (such as, smart mobile phone, panel computer etc.), in this extraction document, the device of keyword can adopt hardware or software simulating.

Embodiment one

The schematic flow sheet of the method for keyword in the extraction document that Fig. 1 provides for the embodiment of the present invention one, as shown in Figure 1, specifically comprises:

S11, obtain at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, described implicit theme vector model is the theme the Fusion Model of model and term vector;

Wherein, topic model (TopicModel) and term vector (WordEmbedding) are all semantic representation methods conventional in prior art.Topic model suppose each word be by an implicit space in the middle of semanteme generate.According to this hypothesis, document and word all can be mapped in implicit semantic space and carry out dimensionality reduction.Term vector is then the distributed method for expressing of another word, and it uses the vector of regular length to represent the implication of a word.

Topic model is generally carry out modeling in document level or sentence level, more pays close attention to the semanteme of the overall situation, and term vector then generally supposes that the semanteme of a word is represented by the word of its periphery, more pays close attention to the class syntactic information of local.Above-mentioned two kinds of method emphasis are had nothing in common with each other, and have been proved to be separately and have huge using value.Therefore, both combines by the present embodiment, thus makes implicit theme vector model can capture more information.

Wherein, the dimension of theme vector and term vector can customize setting, the numerical value that each element in vector is corresponding, is obtained by implicit theme vector model training.Wherein, in order to make training result more accurate, also including training data bank in implicit theme vector model, in described training data bank, having a large amount of document datas.

S12, calculate distance between described term vector and theme vector;

Above-mentioned training obtains described term vector and theme vector word object is to calculate the significance level of word in the middle of document, and sorts to significance level, thus selects the keyword of most important word as document.

In the present embodiment, the significance level of word in the middle of document is weighed by the distance calculated between term vector and theme vector, concrete, comprise Euclidean distance, COS distance or the positive chordal distance etc. that calculate between described term vector and theme vector, simultaneously, the distance calculated is different, and the standard weighing significance level is also different.If, calculate the Euclidean distance between described term vector and theme vector or positive chordal distance, then distance is larger, and this word more important theme that namely more can reflect expressed by document in the middle of document is described, if calculate the COS distance between described term vector and theme vector, then distance is less, illustrates that this word is more important in the middle of document.

S13, according to the distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described document.

Wherein, predetermined number can be arranged according to concrete actual conditions, is not specifically limited here.

According to the result of calculation of above-mentioned steps 12, the predetermined number term vector of outbalance in document can be determined, then using the keyword of word corresponding for predetermined number term vector as described document.

The present embodiment by the implicit theme vector model after merging according to topic model and term vector to document training, obtain at least one theme vector relevant to document information and at least one term vector, further according to the distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described document.The present embodiment to document training, can catch more document information according to implicit theme vector model in the training process, thus the key word information of the accurate expression document information extracted.

Exemplary, for improving the degree of accuracy extracting keyword, embodiments provide the method that the following two kinds calculates the distance between described term vector and theme vector, wherein first method is namely based on the computing method of optimum theme, mainly comprises the following steps:

From at least one theme, the theme of theme distribution maximum probability is chosen as optimum theme according to the theme distribution of described document;

Calculate the distance between described term vector and theme vector corresponding to described optimum theme.

Concrete, for certain document, in the middle of implicit theme vector model, theme distribution and the P (z|d) of the document obtaining it can be trained, comprise the theme distribution probability that each theme in described document is corresponding, in the middle of this theme distribution, the theme z of maximum probability and optimum theme then represent the core content of the document.Therefore can think, in the middle of document, most important word is exactly those words nearest from the vector representation of theme z in vector space.Therefore, therefrom select the theme of a theme distribution maximum probability as optimum theme according to the size of described theme distribution probability, based on the theme vector that the theme of this optimum is corresponding, calculate the distance between each term vector, thus according to the distance between described term vector and theme vector corresponding to described optimum theme, choose the keyword of word corresponding to predetermined number term vector as described document.

Wherein, namely second method based on the computing method of theme distribution, mainly comprises the following steps:

Theme distribution probability according to each theme of described document is weighted summation to the distance between described term vector and each theme vector;

Using weighted sum as the distance between described term vector and theme vector.

Concrete, consider in certain document, the theme played an important role may be more than one, and the above-mentioned method based on optimum theme may lost part information, therefore the distance between considering different themes is weighted according to P (z|d), can obtain a kind of new metric form namely as shown in following formula

S c o r e_D i s t r (w) = \underset{z &Element; Z}{Σ} P (z | d) L

Wherein, Score_Distr (w) is weighted sum, and L is the distance between described term vector and theme vector.

Above-mentioned metric form is namely by the word importance score obtained after the theme distribution weighting of document.According to the Score_Distr (w) that said method obtains, word is sorted, and select and choose the keyword of word corresponding to predetermined number term vector as described document.

Exemplary, the embodiment of the present invention additionally provides and obtains at least one theme vector relevant to document information and the concrete methods of realizing of at least one term vector according to implicit theme vector model training, mainly comprises the following steps:

Described document is added in training data bank, for each theme of each document in described training data bank and each word build initial subject vector sum initial word vector respectively;

The joint likelihood function of all documents in described training data bank is set up according to described initial subject vector sum initial word vector;

Parameter estimation is carried out to described joint likelihood function and obtains described theme vector and term vector.

Wherein, described training data bank can obtain on internet (such as, Sina's corpus data storehouse), contains various types of document in training data bank.Wherein, initial subject vector sum initial word vector can customize setting.

Exemplary, set up joint likelihood function according to described initial subject vector sum initial word vector, comprising:

The generating probability of described initial word vector is obtained according to formula one:

Formula one:

P ({\hat{v}}_{w} | x_{w}) = \frac{e^{x_{w} \cdot {\hat{v}}_{w}}}{Σ_{w^{'}} e^{x_{w} \cdot {\hat{v}}_{w^{'}}}}

Wherein, the auxiliary vector of the term vector v of current word w, x _wrepresent the context vector of current word w, wherein, the term vector of surrounding's word of current word w, v _zthe theme vector of current topic z, represent and add and operate, w ' is word w ';

The joint likelihood function of all documents in described training data bank is obtained as formula two according to described formula one:

Formula two:

Wherein, α _zthe Di Li Cray Study first that the z that is the theme is corresponding, β _vfor the Di Li Cray Study first that word v is corresponding, m _dzfor being sampled the number of the sentence of the z that is the theme in document d, n _zvrepresent the number of times summation that in described training data bank, word v occurs together with theme z, M represents the set of all term vectors and theme vector, and D represents the sum of document d, and T represents the sum of theme in document d, represent the adjunct of word v.

Exemplary, for optimizing above-mentioned joint likelihood function further, obtain according to described formula one joint likelihood function as formula two after, comprise the steps: further

Adopt gibbs algorithm to process described formula two, the condition distribution of the theme that each sentence s is corresponding in document d can be obtained as formula three:

Formula three:

\begin{matrix} P (z_{d s} = k | w, z_{- d s}, α, β, M) \approx \\ (m_{d k} + α_{k}) \frac{Γ (Σ_{w = 1}^{W} (n_{k w} + β_{w}))}{Γ (Σ_{w = 1}^{W} (n_{k w} + β_{w} + N_{i w}))} \\ \underset{w &Element; s}{Π} \frac{Γ (n_{k w} + β_{w} + N_{i w})}{Γ (n_{k w} + β_{w})} \underset{\tilde{w} &Element; s}{Π} e^{x_{w} \cdot {\hat{v}}_{w}} \end{matrix}

Wherein, k is theme undetermined, and W is the sum of word in described training data bank, N _iwfor the number of times that word w occurs in i-th sentence in document d;

Particular topic is determined according to each sentence s that the condition distribution probability of each theme in described condition distribution is document d;

Condition distribution probability according to described particular topic processes described formula one, obtains the log-likelihood function as described in formula four:

Formula four:

\begin{matrix} T \log (\frac{Γ (Σ_{v = 1}^{V} β_{v})}{Π_{v = 1}^{V} Γ (β_{v})}) + \underset{z}{Σ} \underset{v}{Σ} \log (Γ (n_{z v} + β_{v})) - \\ \underset{z}{Σ} \log (Γ (\underset{v}{Σ} (n_{z v} + β_{v}))) + \underset{d}{Σ} \underset{s &Element; d}{Σ} \underset{\tilde{w} &Element; s}{Σ} \log P ({\hat{v}}_{w} | x_{w}) \end{matrix} .

Parameter estimation is carried out to described joint likelihood function and obtains described theme vector and term vector, comprising:

Parameter estimation is carried out to described log-likelihood function and obtains described theme vector and term vector.

Exemplary, can be optimized log-likelihood function obtained above further, specifically comprise the steps:

Newton iteration method is adopted to be optimized process to the parameter alpha in described log-likelihood function and parameter beta;

And/or,

Negative sampling algorithm is adopted to be optimized the term vector in described log-likelihood function, theme vector and auxiliary vector;

Accordingly, parameter estimation is carried out to described log-likelihood function and obtains described theme vector and term vector, comprising:

Parameter estimation is carried out to the log-likelihood function after optimization and obtains described theme vector and term vector.

Exemplary, adopt negative sampling algorithm to be optimized term vector, theme vector and auxiliary vector, comprising:

Adopt negative sampling algorithm to process the word in all documents in described training data bank and theme, obtain the likelihood function as described in formula five:

Formula five:

Wherein, l is value corresponding to current word, if current word is normal word, then l=1, if current word is negative sampling word, then l=0, | NEG| is the number of the negative sampling word that word is corresponding, | V| is the sum of word in described training data bank;

Adopt stochastic gradient descent method to process described formula five, the optimization formula that can obtain following term vector is if the optimization formula of formula six, theme vector is if the optimization formula of formula seven and auxiliary vector is as formula eight:

Formula six:

v_{u} : = v_{u} + η \underset{u &Element; c_{w} \cup N E G (c_{w})}{Σ} [l_{u}^{c_{w}} - σ (x_{w} \cdot {\hat{v}}_{u} - l o g (\frac{| N E G |}{| V |}))] \cdot {\hat{v}}_{u},

Formula seven:

v_{z} : = v_{z} + η \underset{u &Element; c_{w} \cup N E G (c_{w})}{Σ} [l_{u}^{c_{w}} - σ (x_{w} \cdot {\hat{v}}_{u} - \log (\frac{| N E G |}{| V |}))] \cdot {\hat{v}}_{u} .

Formula eight:

{\hat{v}}_{u} : = {\hat{v}}_{u} + η [l_{u}^{c_{w}} - σ (x_{w} \cdot {\hat{v}}_{u} - l o g (\frac{| N E G |}{| V |}))] \cdot x_{w} .

For the corpus in described tranining database, the vectorization that the implicit theme vector model using the embodiment of the present invention to provide can obtain each theme represents, the multinomial distribution that can learn to obtain word under each theme based on topic model provided is provided in prior art, the embodiment of the present invention compared for 10 words of maximum probability and immediate 10 term vectors of vector from each theme in the multinomial distribution of word under each theme, and result is as shown in following table one:

Table one

As can be seen from Table I, multinomial distribution based on topic model has inclination clearly for high frequency words, but medium and low frequency word is more weak by contacting of setting up of traditional theme distribution and theme, this just makes when keyword abstraction is carried out in use multinomial distribution, topic model can be partial to high frequency words very naturally, thus causes poor keyword abstraction result.The vectorization of implicit topic model represents, eliminates this problem, can see from upper table, from the word representing physical meaning under the word that theme vector is nearest is all often this theme, this just makes to utilize the model of theme vector can obtain better result in the middle of the task of keyword abstraction.

Therefore, the various embodiments described above equally by the implicit theme vector model after merging according to topic model and term vector to document training, obtain at least one theme vector relevant to document information and at least one term vector, further according to the distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described document.The present embodiment to document training, can catch more document information according to implicit theme vector model in the training process, thus the key word information of the accurate expression document information extracted.

Embodiment two

The schematic flow sheet of the method for keyword in the extraction document that Fig. 2 provides for the embodiment of the present invention two, as shown in Figure 2, specifically comprises:

S21, pending document added in training data bank, for each theme of each document in described training data bank and each word build initial subject vector sum initial word vector respectively;

S22, obtain the generating probability of described initial word vector according to formula one:

Formula one:

P ({\hat{v}}_{w} | x_{w}) = \frac{e^{x_{w} \cdot {\hat{v}}_{w}}}{Σ_{w^{'}} e^{x_{w} \cdot {\hat{v}}_{w^{'}}}}

S23, obtain the joint likelihood function of all documents in described training data bank as formula two according to described formula one:

Formula two:

S24, employing gibbs algorithm process described formula two, can obtain the condition distribution of the theme that each sentence s is corresponding in document d as formula three:

Formula three:

\begin{matrix} P (z_{d s} = k | w, z_{- d s}, α, β, M) \approx \\ (m_{d k} + α_{k}) \frac{Γ (Σ_{w = 1}^{W} (n_{k w} + β_{w}))}{Γ (Σ_{w = 1}^{W} (n_{k w} + β_{w} + N_{i w}))} \\ \underset{w &Element; s}{Π} \frac{Γ (n_{k w} + β_{w} + N_{i w})}{Γ (n_{k w} + β_{w})} \underset{\tilde{w} &Element; s}{Π} e^{x_{w} \cdot {\hat{v}}_{w}} \end{matrix}

S25, being document d according to the condition distribution probability of each theme in the distribution of described condition, each sentence s determines particular topic;

S26, according to the condition distribution probability of described particular topic, described formula one to be processed, obtains the log-likelihood function as described in formula four:

Formula four:

\begin{matrix} T \log (\frac{Γ (Σ_{v = 1}^{V} β_{v})}{Π_{v = 1}^{V} Γ (β_{v})}) + \underset{z}{Σ} \underset{v}{Σ} \log (Γ (n_{z v} + β_{v})) - \\ \underset{z}{Σ} \log (Γ (\underset{v}{Σ} (n_{z v} + β_{v}))) + \underset{d}{Σ} \underset{s &Element; d}{Σ} \underset{\tilde{w} &Element; s}{Σ} \log P ({\hat{v}}_{w} | x_{w}) \end{matrix} .

S27, employing Newton iteration method are optimized process to the parameter alpha in described log-likelihood function and parameter beta, and adopt negative sampling algorithm to be optimized the term vector in described log-likelihood function, theme vector and auxiliary vector;

S28, to optimize after log-likelihood function carry out theme vector and the term vector that parameter estimation obtains described pending document.

S29, calculate COS distance between described term vector and theme vector;

S210, according to the COS distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described pending document.

In order to verify the validity of the embodiment of the present invention, inventor has carried out many groups contrast experiment respectively on the experimental data collection of different scales, and experiment effect has all exceeded the optimum of tradition based on the method for topic model.

First group of experiment, on a small scale data experiment

Experiment purpose: pick out the keyword best embodying document implication in the middle of all words a document

Tranining database: come from the exploitation collection of Sina Sina language material, training set and test set, includes 32000 documents altogether.

Test data: the test set coming from Sina language material, corresponding to each document in test set, includes the reference keyword of its correspondence.Totally 1000 documents.

Appraisal procedure: for each document, each model generation 3 keywords.Accuracy rate and recall rate is used to assess experimental result.Accuracy rate refers to the number percent that the correct keyword number of model prediction accounts for model prediction keyword number.Recall rate refers to the number percent that the correct keyword of model prediction accounts for the keyword number in Key for Reference.Use micro-average as evaluation index, namely its accuracy rate and recall rate are calculated respectively to each document, then average.

Setup Experiments: consider the situation whether removing stop words in the middle of corpus respectively, and the method based on likelihood of the implicit theme vector model used in the embodiment of the present invention with multiple LDA and SentenceLDA is contrasted.Distribute in the middle of (SentenceLDA, sLDA) method at LDA and the implicit Di Li Cray in units of sentence, for each word in the middle of document, calculate ∑ _zp (z|d) P (w|z), as the score of each word in the middle of current document, then have found maximum front 3 words of value as keyword.In the middle of above-mentioned all methods, the embodiment of the present invention all eliminates the word only having a word in the middle of all language materials.Experimental result is as shown in following table two:

Table two

Interpretation: in the middle of above-mentioned experimental result, can see, no matter whether removes stop words, and the method for the embodiment of the present invention all achieves best experimental result.In the middle of the experiment of removing stop words, the experimental result of contrast and LDA model, the lifting amplitude of the implicit theme vector model that the embodiment of the present invention provides reaches 20.9%.Meanwhile, the net result whether removing the implicit theme vector model that stop words provides for the embodiment of the present invention does not affect, and the implicit theme vector model which illustrating the embodiment of the present invention provides has antimierophonic ability to a certain extent.In addition, the experimental result based on the computing method of theme distribution is better than the computing method based on optimum theme, which illustrates and considers that more subject information is helpful to final experimental result when the final keyword of generation.

Second group of experiment, large-scale data is tested

Training data: come from the exploitation collection of Sina language material, training set and test set, and come from the corpus data of News Field, includes 261173 documents altogether.

Setup Experiments: consider the situation whether removing stop words in the middle of corpus respectively, and the method based on likelihood function of the implicit theme vector model of embodiment of the present invention use with multiple LDA and SentenceLDA is contrasted.In the middle of the PL method of LDA and SentenceLDA, for each word in the middle of document, calculate ∑ _zp (z|d) P (w|z), as the score of each word in the middle of current document, then have found maximum front 3 words of value as keyword., also the method based on hidden variable distribution distance in the middle of the method for the embodiment of the present invention and LDA is contrasted, according to the theme distribution P (z|d) of document and the theme distribution of word meanwhile

P (z | w) = \frac{P (z | w) P (z)}{P (w)} &Proportional; P (z | w) P (z),

Then calculate the COS distance of two distributions and press distance-taxis, 3 words that choosing a topic distribution distance distributes closest to document subject matter are as the keyword of the document.In the middle of above-mentioned all methods, all eliminate the word only having a word in the middle of all language materials.Experimental result is as shown in following table three:

Table three

Interpretation: in the middle of above-mentioned experimental result, can see, the method for the embodiment of the present invention still achieves best experimental result, and the experiment conclusion that small-scale language material obtains stands good on large-scale corpus., can see meanwhile, based on the method for LDA and SentenceLDA after adding large-scale training language material, promote for experimental result is not significant.But the method for embodiment of the present invention experimental result after adding corpus obtains the lifting of conspicuousness, the computing method based on optimum theme improve 12.1%, and the computing method based on theme distribution improve 6.5%.And the increase of the corpus along with model, the experimental result of the embodiment of the present invention still has the potentiality increased further.

Embodiment three

In the extraction document that Fig. 3 provides for the embodiment of the present invention three, the structural representation of the device of keyword, as shown in Figure 3, specifically comprises: vectorial training module 31, distance calculation module 32 and keyword extracting module 33;

Described vectorial training module 31 is for obtaining at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, and described implicit theme vector model is the theme the Fusion Model of model and term vector;

Described distance calculation module 32 is for calculating the distance between described term vector and theme vector;

Described keyword extracting module 33, for according to the distance between term vector and described theme vector, chooses the keyword of word corresponding to predetermined number term vector as described document.

In extraction document described in the present embodiment, the device of keyword is for performing the method for keyword in the extraction document described in the various embodiments described above, and the technique effect of its know-why and generation is similar, is not repeated here.

Exemplary, on the basis of above-described embodiment, described distance calculation module 32 specifically for:

From at least one theme, the theme of theme distribution maximum probability is chosen as optimum theme according to the theme distribution of described document; Calculate the distance between described term vector and theme vector corresponding to described optimum theme.

Theme distribution probability according to each theme of described document is weighted summation to the distance between described term vector and each theme vector; Using weighted sum as the distance between described term vector and theme vector.

Exemplary, on the basis of above-described embodiment, described distance is COS distance.

Exemplary, on the basis of above-described embodiment, described vectorial training module 31 comprises: vectorial construction unit 311, joint likelihood function set up unit 312 and parameter estimation unit 313;

Described vectorial construction unit 311 for described document being added in training data bank, for each theme of each document in described training data bank and each word build initial subject vector sum initial word vector respectively;

Described joint likelihood function sets up unit 312 for setting up the joint likelihood function of all documents in described training data bank according to described initial subject vector sum initial word vector;

Described parameter estimation unit 313 obtains described theme vector and term vector for carrying out parameter estimation to described joint likelihood function.

Exemplary, described joint likelihood function set up unit 312 specifically for:

Formula one:

P ({\hat{v}}_{w} | x_{w}) = \frac{e^{x_{w} \cdot {\hat{v}}_{w}}}{Σ_{w^{'}} e^{x_{w} \cdot {\hat{v}}_{w^{'}}}}

Formula two:

Exemplary, described vectorial training module 31 also comprises: joint likelihood function processing unit 314;

Described joint likelihood function processing unit 314 for set up in described joint likelihood function unit 312 according to described formula one obtain joint likelihood function as formula two after, adopt gibbs algorithm to process described formula two, the condition distribution of the theme that each sentence s is corresponding in document d can be obtained as formula three:

Formula three:

\begin{matrix} P (z_{d s} = k | w, z_{- d s}, α, β, M) \approx \\ (m_{d k} + α_{k}) \frac{Γ (Σ_{w = 1}^{W} (n_{k w} + β_{w}))}{Γ (Σ_{w = 1}^{W} (n_{k w} + β_{w} + N_{i w}))} \\ \underset{w &Element; s}{Π} \frac{Γ (n_{k w} + β_{w} + N_{i w})}{Γ (n_{k w} + β_{w})} \underset{\tilde{w} &Element; s}{Π} e^{x_{w} \cdot {\hat{v}}_{w}} \end{matrix}

Formula four:

\begin{matrix} T \log (\frac{Γ (Σ_{v = 1}^{V} β_{v})}{Π_{v = 1}^{V} Γ (β_{v})}) + \underset{z}{Σ} \underset{v}{Σ} \log (Γ (n_{z v} + β_{v})) - \\ \underset{z}{Σ} \log (Γ (\underset{v}{Σ} (n_{z v} + β_{v}))) + \underset{d}{Σ} \underset{s &Element; d}{Σ} \underset{\tilde{w} &Element; s}{Σ} \log P ({\hat{v}}_{w} | x_{w}) \end{matrix} .

Described parameter estimation unit 313 specifically for:

Exemplary, described vectorial training module 31 also comprises: log-likelihood function optimization process unit 315;

Described log-likelihood function optimization process unit 315, for after obtaining the log-likelihood function as described in formula four at described joint likelihood function processing unit 314, adopts Newton iteration method to be optimized process to the parameter alpha in described log-likelihood function and parameter beta;

And/or,

Described parameter estimation unit 313 specifically for:

Example, described log-likelihood function optimization process unit 315 specifically for:

Formula five:

Wherein, l is value corresponding to current word, | NEG| is the number of the negative sampling word that word is corresponding, | V| is the sum of word in described training data bank;

Formula six:

v_{u} : = v_{u} + η \underset{u &Element; c_{w} \cup N E G (c_{w})}{Σ} [l_{u}^{c_{w}} - σ (x_{w} \cdot {\hat{v}}_{u} - \log (\frac{| N E G |}{| V |}))] \cdot {\hat{v}}_{u},

Formula seven:

v_{z} : = v_{z} + η \underset{u &Element; c_{w} \cup N E G (c_{w})}{Σ} [l_{u}^{c_{w}} - σ (x_{w} \cdot {\hat{v}}_{u} - \log (\frac{| N E G |}{| V |}))] \cdot {\hat{v}}_{u} .

Formula eight:

{\hat{v}}_{u} : = {\hat{v}}_{u} + η [l_{u}^{c_{w}} - σ (x_{w} \cdot {\hat{v}}_{u} - l o g (\frac{| N E G |}{| V |}))] \cdot x_{w} .

Exemplary, described parameter estimation unit 313 also for:

Carry out described joint likelihood function in the process of parameter estimation, adopting formula nine to obtain the theme distribution of each document:

Formula nine:

P (z | d) = \frac{m_{d z} + α_{z}}{Σ_{z^{'} = 1}^{K} (m_{{dz}^{'}} + α_{z^{'}})}

Wherein, K is the theme the sum of z at document d.

In extraction document described in the various embodiments described above, the device of keyword is equally for performing the method for keyword in the extraction document described in the various embodiments described above, and the technique effect of its know-why and generation is similar, is not repeated here.

Note, above are only preferred embodiment of the present invention and institute's application technology principle.Skilled person in the art will appreciate that and the invention is not restricted to specific embodiment described here, various obvious change can be carried out for a person skilled in the art, readjust and substitute and can not protection scope of the present invention be departed from.Therefore, although be described in further detail invention has been by above embodiment, the present invention is not limited only to above embodiment, when not departing from the present invention's design, can also comprise other Equivalent embodiments more, and scope of the present invention is determined by appended right.

Claims

1. extract a method for keyword in document, it is characterized in that, comprising:

Calculate the distance between described term vector and theme vector;

2. method according to claim 1, is characterized in that, calculates the distance between described term vector and theme vector, comprising:

3. method according to claim 1, is characterized in that, calculates the distance between described term vector and theme vector, comprising:

4. the method according to any one of claim 1-3, is characterized in that, described distance is COS distance.

5. the method according to any one of claim 1-3, is characterized in that, obtains at least one theme vector relevant to document information and at least one term vector, comprising according to implicit theme vector model training:

6. method according to claim 5, is characterized in that, sets up the joint likelihood function of all documents in described training data bank, comprising according to described initial subject vector sum initial word vector:

The generating probability of described initial word vector is obtained by computing formula;

The joint likelihood function of all documents in described training data bank is obtained according to described computing formula.

7. method according to claim 6, is characterized in that, after obtaining joint likelihood function, also comprises according to described computing formula:

Adopt gibbs algorithm to process described joint likelihood function, the condition distribution of the theme that each sentence is corresponding in each document can be obtained;

In distributing according to described condition, the condition distribution probability of each theme is each sentence determination particular topic of each document;

Condition distribution probability according to described particular topic processes described joint likelihood function, obtains log-likelihood function;

8. method according to claim 7, is characterized in that, after obtaining the log-likelihood function as described in formula four, also comprises:

Newton iteration method is adopted to be optimized process to the parameter in described log-likelihood function;

And/or,

Parameter estimation is carried out to described log-likelihood function and obtains described theme vector and term vector, comprising:

9. method according to claim 8, is characterized in that, adopts negative sampling algorithm to be optimized term vector, theme vector and auxiliary vector, comprising:

Adopt negative sampling algorithm to process the word in all documents in described training data bank and theme, obtain negative sampling likelihood function;

Adopt stochastic gradient descent method to process described negative sampling likelihood function, obtain the optimization formula of the optimization formula of described term vector, the optimization formula of theme vector and auxiliary vector.

10. the method according to any one of claim 6 ~ 9, is characterized in that, also comprises:

Described joint likelihood function is carried out to the theme distribution obtaining each document in the process of parameter estimation.

11. 1 kinds of devices extracting keyword in document, is characterized in that, comprising:

12. devices according to claim 11, is characterized in that, described distance calculation module specifically for:

13. devices according to claim 11, is characterized in that, described distance calculation module specifically for:

14. devices according to any one of claim 11-13, it is characterized in that, described distance is COS distance.

15. devices according to any one of claim 11-13, it is characterized in that, described vectorial training module comprises:

Vector construction unit, for being added by described document in training data bank, for each theme of each document in described training data bank and each word build initial subject vector sum initial word vector respectively;

Joint likelihood function sets up unit, for setting up the joint likelihood function of all documents in described training data bank according to described initial subject vector sum initial word vector;

Parameter estimation unit, obtains described theme vector and term vector for carrying out parameter estimation to described joint likelihood function.

16. devices according to claim 15, is characterized in that, described joint likelihood function set up unit specifically for:

17. devices according to claim 16, is characterized in that, described vectorial training module also comprises:

Joint likelihood function processing unit, for setting up after unit obtains joint likelihood function according to described computing formula one in described joint likelihood function, adopt gibbs algorithm to process described joint likelihood function, the condition distribution of the theme that each sentence is corresponding in each document can be obtained;

Described parameter estimation unit specifically for:

18. devices according to claim 17, is characterized in that, described vectorial training module also comprises:

Log-likelihood function optimization process unit, after obtaining log-likelihood function at described joint likelihood function processing unit, adopts Newton iteration method to be optimized process to the parameter in described log-likelihood function;

And/or,

Described parameter estimation unit specifically for:

19. devices according to claim 18, is characterized in that, described log-likelihood function optimization process unit specifically for:

20. devices according to any one of claim 16 ~ 19, is characterized in that, described parameter estimation unit also for: