CN105069143B

CN105069143B - Extract the method and device of keyword in document

Info

Publication number: CN105069143B
Application number: CN201510512363.8A
Authority: CN
Inventors: 姜迪; 石磊; 林鸿宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-08-19
Filing date: 2015-08-19
Publication date: 2019-07-23
Anticipated expiration: 2035-08-19
Also published as: CN105069143A

Abstract

The invention discloses a kind of method and devices of keyword in extraction document, the method comprise the steps that obtain at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, the implicit theme vector model is the theme the Fusion Model of model and term vector；Calculate the distance between the term vector and theme vector；According to the distance between term vector and the theme vector, keyword of the corresponding word of predetermined number term vector as the document is chosen.The key word information for the accurate expression document information that the embodiment of the present invention can extract.

Description

Extract the method and device of keyword in document

Technical field

The present embodiments relate to information technology field more particularly to a kind of methods and dress for extracting keyword in document It sets.

Background technique

In the epoch of current information explosion, user can not browse it is all may include the document of relevant information, and take out It takes the keyword in document to provide the user with reference, information is accurately obtained to user, the cost that reduction user obtains information has Great meaning.

In general, the keyword of document is necessarily certain with the highly relevant word of document subject matter, therefore the master of document Topic information has great significance for the keyword abstraction of document.Currently, mainly using keyword in implicit Di Li Cray Probability distribution in distribution model (Latent Dirichlet Allocation, LDA) solves this problem.Mainly There is following two method:

First method is the method based on possibility predication, i.e., obtains the theme distribution P (z | d) of document using LDA model And the word distribution P (w | z) of theme, distribution P (w | d)=∑ of word in document is calculated_zP (z | d) P (w | z), wherein z table Show that theme, d indicate that document, w indicate some word.The distribution probability P (w │ d) of word in above-mentioned document is considered as some word w in document d In importance score, select keyword of the K word of highest scoring as the document.

Second method is the method based on hidden variable distribution distance, i.e., obtains the theme distribution of document using LDA model The theme distribution probability of probability P (z | d) and wordThen it calculates above-mentioned The COS distance of two distribution probabilities selects keyword of the biggish K word of COS distance as the document.

But in said extracted document there are some disadvantages in the method for keyword, for first method, for high frequency Word has serious deviation, that is, the word extracted is all largely the high frequency words under some theme, however these high frequency words are not With occurring all very extensively in document, information expressed by document can not be really reacted very much.

For second method, need to obtain the distribution P (z) of hidden variable due to calculating P (z │ w) ∝ P (w │ z) P (z), but Be the distribution probability it is not distribution parameter in LDA model, generally uses P (z)=∑_dP (z | d) P (d), wherein P (d) be The Posterior distrbutionp of document, and assume that P (d) is to be uniformly distributed to obtain P (z) ∝ ∑_dP(z|d).But due to for different Document d, posterior distribution P (d) is not to be uniformly distributed, therefore the theoretical basis of the model is not solid enough, in practical application Effect it is also bad.

Summary of the invention

The embodiment of the present invention provides a kind of method and device for extracting keyword in document, the accurate expression that can be extracted The key word information of document information.

In a first aspect, the embodiment of the invention provides a kind of methods of keyword in extraction document, comprising:

At least one theme vector relevant to document information and at least one are obtained according to implicit theme vector model training A term vector, the implicit theme vector model are the theme the Fusion Model of model and term vector；

Calculate the distance between the term vector and theme vector；

According to the distance between term vector and the theme vector, choose described in the corresponding word conduct of predetermined number term vector The keyword of document.

Second aspect, the embodiment of the present invention also provide a kind of device for extracting keyword in document, comprising:

Vector training module, for according to imply theme vector model training obtain it is relevant to document information at least one Theme vector and at least one term vector, the implicit theme vector model are the theme the Fusion Model of model and term vector；

Distance calculation module, for calculating the distance between the term vector and theme vector；

Keyword extracting module, for choosing predetermined number word according to the distance between term vector and the theme vector Keyword of the corresponding word of vector as the document.

The embodiment of the present invention by according to the fused implicit theme vector model of topic model and term vector to document into Row training, obtains at least one theme vector relevant to document information and at least one term vector, further according to term vector The distance between described theme vector chooses keyword of the corresponding word of predetermined number term vector as the document.This hair Bright embodiment is trained document according to implicit theme vector model, can capture more document letters in the training process Breath, thus the key word information of the accurate expression document information extracted.

Detailed description of the invention

Fig. 1 is the flow diagram for extracting the method for keyword in document that the embodiment of the present invention one provides；

Fig. 2 is the flow diagram of the method for keyword in extraction document provided by Embodiment 2 of the present invention；

Fig. 3 is the structural schematic diagram for the search results pages display device that the embodiment of the present invention three provides.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.

The executing subject of the method for keyword, can mention in extraction document provided in an embodiment of the present invention for the embodiment of the present invention The device of keyword in document is extracted in confession, or is integrated with the terminal device for extracting the device of keyword in document (for example, intelligence Can mobile phone, tablet computer etc.), the device of keyword can use hardware or software realization in the extraction document.

Embodiment one

Fig. 1 is the flow diagram for extracting the method for keyword in document that the embodiment of the present invention one provides, such as Fig. 1 institute Show, specifically include:

S11, at least one theme vector relevant to document information and extremely is obtained according to implying theme vector model training A few term vector, the implicit theme vector model are the theme the Fusion Model of model and term vector；

Wherein, topic model (Topic Model) and term vector (Word Embedding) are all common in the prior art Semantic representation method.Topic model assume each word be by an implicit space in semantic generate.According to This carries out dimensionality reduction it is assumed that document and word can be mapped in implicit semantic space.And term vector is then another The distributed representation method of word, it indicates the meaning of a word using the vector of regular length.

Topic model is usually to be modeled in document level or sentence level, is focused more in global semanteme, and word Vector then generally assumes that the semanteme of a word is indicated by the word on its periphery, focuses more on the class syntax in part and believes Breath.Above two method emphasis is had nothing in common with each other, and has respectively been proved to have huge application value.Therefore, the present embodiment Both is combined, so that implicit theme vector model can capture more information.

Wherein, the dimension of theme vector and term vector can customize setting, and the corresponding numerical value of each element in vector leads to Implicit theme vector model training is crossed to obtain.Wherein, it in order to keep training result more accurate, is also wrapped in implicit theme vector model Containing training data bank, there is a large amount of document data in the trained data bank.

The distance between S12, the calculating term vector and theme vector；

It is the weight in order to calculate word in document that above-mentioned training, which obtains the purpose of term vector and theme vector word, Degree is wanted, and significance level is ranked up, to select keyword of the most important word as document.

In the present embodiment, weight of the word in document is measured by calculating the distance between term vector and theme vector Degree is wanted, specifically, including Euclidean distance, COS distance or the positive chordal distance calculated between the term vector and theme vector Deng, meanwhile, the distance of calculating is different, and the standard for measuring significance level is also different.If calculating the term vector and theme vector Between Euclidean distance or positive chordal distance, then distance it is more big, illustrating that the word is more important in document can more reflect text Theme expressed by shelves apart from smaller, is said if calculating the COS distance between the term vector and theme vector The bright word is more important in document.

S13, according to the distance between term vector and the theme vector, choose the corresponding word conduct of predetermined number term vector The keyword of the document.

Wherein, predetermined number can be configured according to specific actual conditions, be not specifically limited here.

It according to the calculated result of above-mentioned steps 12, that is, can determine predetermined number term vector more important in document, then will Keyword of the corresponding word of predetermined number term vector as the document.

The present embodiment is by instructing document according to the fused implicit theme vector model of topic model and term vector Practice, at least one theme vector relevant to document information and at least one term vector is obtained, further according to term vector and institute The distance between theme vector is stated, keyword of the corresponding word of predetermined number term vector as the document is chosen.The present embodiment Document is trained according to implicit theme vector model, more document information can be captured in the training process, to mention The key word information of the accurate expression document information of taking-up.

Illustratively, to improve the accuracy for extracting keyword, the embodiment of the invention provides described in the following two kinds calculating The method of the distance between term vector and theme vector, wherein first method is the calculation method based on optimal theme, mainly The following steps are included:

The theme conduct of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document Optimal theme；

Calculate the distance between the term vector and the corresponding theme vector of the optimal theme.

Specifically, in implicit theme vector model, can train to obtain the theme of its document for some document Distribution i.e. P (z | d), it include the corresponding theme distribution probability of each theme in the document, and probability in the theme distribution Maximum theme z, that is, optimal theme then indicates the core content of the document.It is understood that most important word in document It is exactly those of nearest word with a distance from the vector expression in vector space from theme z.Therefore, according to the theme distribution probability Size therefrom select the theme of a theme distribution maximum probability as optimal theme, based on the corresponding master of the optimal theme Vector is inscribed, the distance between each term vector is calculated, thus according to the term vector and the corresponding theme of the optimal theme The distance between vector chooses keyword of the corresponding word of predetermined number term vector as the document.

Wherein, second method is the calculation method based on theme distribution, is mainly comprised the steps that

According to the theme distribution probability of each theme of the document between the term vector and each theme vector Distance is weighted summation；

Using weighted sum as the distance between described term vector and theme vector.

Specifically, in view of in some document, the theme that plays an important role may more than one, and above-mentioned be based on The method of optimal theme may lost part information, therefore consider to add the distance between different themes according to P (z | d) Power, the new metric form of available one kind is i.e. as shown in following formula

Wherein, Score_Distr (w) is weighted sum, and L is the distance between the term vector and theme vector.

Above-mentioned metric form is the word importance score obtained after weighting by the theme distribution of document.According to the above method Obtained Score_Distr (w) is ranked up word, and selects and choose the corresponding word of predetermined number term vector as the text The keyword of shelves.

Illustratively, the embodiment of the invention also provides obtained and document information phase according to implicit theme vector model training The concrete methods of realizing of at least one theme vector and at least one term vector for closing, mainly comprises the steps that

The document is added in training data bank, is each theme of each document in the trained data bank and every A word constructs the initial term vector of initial subject vector sum respectively；

The joint of all documents in the trained data bank is established seemingly according to the initial term vector of initial subject vector sum Right function；

Parameter Estimation is carried out to the joint likelihood function and obtains the theme vector and term vector.

Wherein, the trained data bank can be obtained from internet (for example, Sina's corpus data library), training data Various types of documents are contained in library.Wherein, the initial term vector of initial subject vector sum can customize setting.

Illustratively, joint likelihood function is established according to the initial term vector of initial subject vector sum, comprising:

The generating probability of the initial term vector is obtained according to formula one:

Formula one:

Wherein,It is the auxiliary vector of the term vector v of current word w, x_wIndicate the context vector of current word w, wherein It is the term vector of surrounding's word of current word w, v_zThe theme vector of current topic z,It indicates to add And operation, w ' are word w '；

The joint likelihood function such as formula two of all documents in the trained data bank is obtained according to the formula one:

Formula two:

Wherein, α_zThe corresponding Dirichlet prior parameter of the z that is the theme, β_vFor the corresponding Dirichlet prior parameter of word v, m_dz For be sampled in document d the z that is the theme sentence number, n_zvIndicate that word v and theme z occur together in the trained data bank Number summation, M indicates the set of all term vectors and theme vector, and D indicates that the sum of document d, T indicate theme in document d Sum,Indicate the adjunct of word v.

Illustratively, to advanced optimize above-mentioned joint likelihood function, joint likelihood letter is being obtained according to the formula one Further comprise following steps after number such as formula two:

The formula two is handled using gibbs algorithm, the item of the corresponding theme of each sentence s in document d can be obtained Part distribution such as formula three:

Formula three:

Wherein, k is theme undetermined, and W is the sum of word in the trained data bank, N_iwIt is word w i-th in document d The number occurred in son；

Each sentence s that condition distribution probability according to each theme in condition distribution is document d determines specific master Topic；

The formula one is handled according to the condition distribution probability of the specific subject, is obtained as described in formula four Log-likelihood function:

Formula four:

Parameter Estimation is carried out to the joint likelihood function and obtains the theme vector and term vector, comprising:

Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector.

Illustratively, further log-likelihood function obtained above can be optimized, is specifically comprised the following steps:

Using Newton iteration method in the log-likelihood function parameter alpha and parameter beta optimize processing；

And/or

Term vector, theme vector and the auxiliary vector in the log-likelihood function are carried out using negative sampling algorithm excellent Change；

Correspondingly, carrying out parameter Estimation to the log-likelihood function obtains the theme vector and term vector, comprising:

Parameter Estimation is carried out to the log-likelihood function after optimization and obtains the theme vector and term vector.

Illustratively, term vector, theme vector and auxiliary vector are optimized using negative sampling algorithm, comprising:

Using negative sampling algorithm in all documents in the trained data bank word and theme handle, obtain such as public affairs Likelihood function described in formula five:

Formula five:

Wherein, l is the corresponding value of current word, if current word is normal word, l=1, if current word is negative sampling Word, then l=0, | NEG | for the number of the corresponding negative sampling word of word, | V | for the sum of word in the trained data bank；

The formula five is handled using stochastic gradient descent method, the optimization formula such as formula of following term vector can be obtained Six, the optimization formula such as formula eight of optimization the formula such as formula seven and auxiliary vector of theme vector:

Formula six:

Formula seven:

Formula eight:

For the training corpus in the tranining database, implicit theme vector model provided in an embodiment of the present invention is used The vectorization of available each theme indicates, using provide in the prior art can learn to obtain based on topic model it is each The multinomial distribution of word under theme, the embodiment of the present invention compared 10 words of maximum probability in the multinomial distribution of word under each theme And immediate 10 term vectors of vector from each theme, as a result as shown in following table one:

Table one

As can be seen from Table I, apparent inclination there are high frequency words based on the multinomial distribution of topic model, but in Low-frequency word is weaker by contacting of setting up of traditional theme distribution and theme, this allows for being closed using multinomial distribution When keyword extracts, topic model can naturally enough be partial to high frequency words, so as to cause poor keyword abstraction result.And The vectorization expression of implicit topic model then eliminates this problem, can see from upper table, the word nearest from theme vector is past Toward being all the word for indicating physical meaning under the theme, this allows for working as using the model of theme vector in the task of keyword abstraction In can obtain better result.

Therefore, the various embodiments described above are equally by according to the fused implicit theme vector model of topic model and term vector Document is trained, at least one theme vector relevant to document information and at least one term vector, further root are obtained According to the distance between term vector and the theme vector, key of the corresponding word of predetermined number term vector as the document is chosen Word.The present embodiment is trained document according to implicit theme vector model, can capture more documents in the training process Information, thus the key word information of the accurate expression document information extracted.

Embodiment two

Fig. 2 is the flow diagram of the method for keyword in extraction document provided by Embodiment 2 of the present invention, such as Fig. 2 institute Show, specifically include:

S21, document to be processed is added in training data bank, is each master of each document in the trained data bank Topic and each word construct the initial term vector of initial subject vector sum respectively；

S22, the generating probability of the initial term vector is obtained according to formula one:

Formula one:

S23, the joint likelihood function such as formula two that all documents in the trained data bank are obtained according to the formula one:

Formula two:

S24, the formula two is handled using gibbs algorithm, the corresponding theme of each sentence s in document d can be obtained Condition distribution such as formula three:

Formula three:

S25, spy is determined according to each sentence s that the condition distribution probability of each theme in condition distribution is document d Determine theme；

S26, the formula one is handled according to the condition distribution probability of the specific subject, is obtained such as four institute of formula The log-likelihood function stated:

Formula four:

S27, using Newton iteration method in the log-likelihood function parameter alpha and parameter beta optimize processing, and adopt Term vector, theme vector and the auxiliary vector in the log-likelihood function are optimized with negative sampling algorithm；

S28, to after optimization log-likelihood function carry out parameter Estimation obtain the document to be processed theme vector and Term vector.

COS distance between S29, the calculating term vector and theme vector；

S210, according to the COS distance between term vector and the theme vector, it is corresponding to choose predetermined number term vector Keyword of the word as the document to be processed.

In order to verify the validity of the embodiment of the present invention, inventor carries out on the experimental data set of different scales respectively Multiple groups comparative experiments, experiment effect have been more than the optimum of traditional method based on topic model.

First group of experiment, small-scale data experiment

Experiment purpose: the keyword for best embodying document meaning is picked out in all words in a document

Tranining database: including 32000 from the development set of Sina's Sina corpus, training set and test set in total A document.

Test data: corresponding to each document in test set including its correspondence from the test set of Sina corpus Reference keyword.Totally 1000 documents.

Appraisal procedure: for each document, each model generates 3 keywords.Come using accuracy rate and recall rate Assess experimental result.Accuracy rate refers to that the correct keyword number of model prediction accounts for the percentage of model prediction keyword number Than.Recall rate refers to the percentage for the keyword number that the correct keyword of model prediction accounts in Key for Reference.Using micro- flat It is used as evaluation index, i.e., its accuracy rate and recall rate is calculated separately to each document, is then averaged.

Experimental setup: the case where whether stop words being removed in training corpus is considered respectively, and will be of the invention real It applies and implies theme vector model used in example and carried out pair with the method based on likelihood of a variety of LDA and Sentence LDA Than.In LDA and implicit Di Li Cray distribution (Sentence LDA, sLDA) method as unit of by sentence, for document When each of word, calculate ∑_zScore of P (z | the d) P (w | z) as each word in current document, then has found It is worth maximum preceding 3 words as keyword.In above-mentioned all methods, the embodiment of the present invention eliminates all corpus and works as In only one word word.Experimental result is as shown in following tables two:

Table two

Analysis of experimental results: in above-mentioned experimental result, it can be seen that regardless of whether removal stop words, the present invention are real The method for applying example all achieves best experimental result.In the experiment of removal stop words, the experiment knot of comparison and LDA model The promotion amplitude of fruit, implicit theme vector model provided in an embodiment of the present invention has reached 20.9%.Simultaneously, if removal deactivates Word does not influence the final result of implicit theme vector model provided in an embodiment of the present invention, this illustrates implementation of the present invention The implicit theme vector model that example provides has antimierophonic ability to a certain extent.In addition, the calculating side based on theme distribution The experimental result of method is better than the calculation method based on optimal theme, this illustrates to consider more when generating final keyword More subject informations is helpful final experimental result.

Second group of experiment, large-scale data experiment

Training data: from the development set of Sina corpus, training set and test set, and from the language of News Field Expect data, in total includes 261173 documents.

Experimental setup: the case where whether stop words being removed in training corpus is considered respectively, and will be of the invention real Implicit theme vector model that example uses is applied to have carried out pair with the method based on likelihood function of a variety of LDA and Sentence LDA Than.In the PL method of LDA and Sentence LDA, word each of is worked as document, calculates ∑_zP(z|d)P(w| Z) then the score as each word in current document has found maximum preceding 3 words of value as keyword.Meanwhile also The method based on hidden variable distribution distance in the method for the embodiment of the present invention and LDA is compared, according to document The theme distribution of theme distribution P (z | d) and wordThen two are calculated The COS distance of distribution simultaneously presses distance-taxis, and 3 words for select theme distribution distance to be distributed closest to document subject matter are as this The keyword of document.In above-mentioned all methods, the word of only one word in all corpus is eliminated.Experimental result As shown in following tables three:

Table three

Analysis of experimental results: in above-mentioned experimental result, it can be seen that the method for the embodiment of the present invention still achieves Best experimental result, the experiment conclusion obtained on small-scale corpus stand good on large-scale corpus.Meanwhile it can see It arrives, the method based on LDA and Sentence LDA is after it joined large-scale training corpus, and for experimental result, there is no aobvious The promotion of work.However the method for embodiment of the present invention experimental result after increasing training corpus has obtained mentioning for conspicuousness It rises, the calculation method based on optimal theme improves 12.1%, and the calculation method based on theme distribution improves 6.5%.And With the increase of the training corpus of model, the experimental result of the embodiment of the present invention still has the potentiality further increased.

Embodiment three

Fig. 3 is the structural schematic diagram for extracting the device of keyword in document that the embodiment of the present invention three provides, such as Fig. 3 institute Show, specifically include: vector training module 31, distance calculation module 32 and keyword extracting module 33；

The vector training module 31 be used for according to imply theme vector model training obtain it is relevant to document information extremely A few theme vector and at least one term vector, the implicit theme vector model are the theme the fusion mould of model and term vector Type；

The distance calculation module 32 is for calculating the distance between the term vector and theme vector；

The keyword extracting module 33 is used to choose default according to the distance between term vector and the theme vector Keyword of the corresponding word of number vector as the document.

The device of keyword is for executing the text of extraction described in the various embodiments described above in extraction document described in the present embodiment The method of keyword in shelves, technical principle is similar with the technical effect of generation, is described again here.

Illustratively, on the basis of the above embodiments, the distance calculation module 32 is specifically used for:

The theme conduct of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document Optimal theme；Calculate the distance between the term vector and the corresponding theme vector of the optimal theme.

According to the theme distribution probability of each theme of the document between the term vector and each theme vector Distance is weighted summation；Using weighted sum as the distance between described term vector and theme vector.

Illustratively, on the basis of the above embodiments, the distance is COS distance.

Illustratively, on the basis of the above embodiments, the vector training module 31 include: vector construction unit 311, Joint likelihood function establishes unit 312 and parameter estimation unit 313；

The vector construction unit 311 is used to for the document being added in training data bank, is in the trained data bank Each document each theme and each word construct the initial term vector of initial subject vector sum respectively；

The joint likelihood function establishes unit 312 for establishing institute according to the initial term vector of initial subject vector sum State the joint likelihood function of all documents in trained data bank；

The parameter estimation unit 313 is used to carry out parameter Estimation to the joint likelihood function to obtain the theme vector And term vector.

Illustratively, the joint likelihood function is established unit 312 and is specifically used for:

Formula one:

Formula two:

Illustratively, the vector training module 31 further include: joint likelihood function processing unit 314；

The joint likelihood function processing unit 314 is used to establish unit 312 according to described in the joint likelihood function After formula one obtains joint likelihood function such as formula two, the formula two is handled using gibbs algorithm, can be obtained literary The condition distribution such as formula three of the corresponding theme of each sentence s in shelves d:

Formula three:

Formula four:

The parameter estimation unit 313 is specifically used for:

Illustratively, the vector training module 31 further include: log-likelihood function optimization processing unit 315；

The log-likelihood function optimization processing unit 315 in the joint likelihood function processing unit 314 for obtaining After log-likelihood function as described in formula four, using Newton iteration method to the parameter alpha and ginseng in the log-likelihood function Number β optimizes processing；

And/or

The parameter estimation unit 313 is specifically used for:

Exemplary, the log-likelihood function optimization processing unit 315 is specifically used for:

Formula five:

Wherein, l is the corresponding value of current word, | NEG | it is the number of the corresponding negative sampling word of word, | V | it is the training The sum of word in data bank；

Formula six:

Formula seven:

Formula eight:

Illustratively, the parameter estimation unit 313 is also used to:

Formula nine is used to obtain the theme point of each document during carrying out parameter Estimation to the joint likelihood function Cloth:

Formula nine:

Wherein, K be the theme z document d sum.

The device of keyword is equally used for executing described in the various embodiments described above in extraction document described in the various embodiments described above Extract document in keyword method, technical principle is similar with the technical effect of generation, is described again here.

Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. a kind of method for extracting keyword in document characterized by comprising

At least one theme vector relevant to document information and at least one word are obtained according to implicit theme vector model training Vector, the implicit theme vector model are the theme the Fusion Model of model and term vector, and training obtains related to document information At least one theme vector and at least one term vector specifically include: by the document be added training data bank in, be described The each theme and each word of each document in training data bank construct the initial term vector of initial subject vector sum respectively, according to The initial term vector of initial subject vector sum establishes the joint likelihood function of all documents in the trained data bank, to described Joint likelihood function carries out parameter Estimation and obtains the theme vector and term vector；

Calculate the distance between the term vector and theme vector；

According to the distance between term vector and the theme vector, the corresponding word of predetermined number term vector is chosen as the document Keyword.

2. the method according to claim 1, wherein calculate the distance between the term vector and theme vector, Include:

The theme of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document as optimal Theme；

3. the method according to claim 1, wherein calculate the distance between the term vector and theme vector, Include:

According to the theme distribution probability of each theme of the document to the distance between the term vector and each theme vector It is weighted summation；

4. method according to claim 1-3, which is characterized in that the distance is COS distance.

5. the method according to claim 1, wherein being established according to the initial term vector of the initial subject vector sum The joint likelihood function of all documents in the trained data bank, comprising:

The generating probability of the initial term vector is obtained by calculation formula；

The joint likelihood function of all documents in the trained data bank is obtained according to the calculation formula.

6. according to the method described in claim 5, it is characterized in that, according to the calculation formula obtain joint likelihood function it Afterwards, further includes:

The joint likelihood function is handled using gibbs algorithm, the corresponding theme of each sentence in each document can be obtained Condition distribution；

Condition distribution probability according to each theme in condition distribution is that each sentence of each document determines specific subject；

The joint likelihood function is handled according to the condition distribution probability of the specific subject, obtains log-likelihood letter Number；

7. according to the method described in claim 6, it is characterized in that, after obtaining the log-likelihood function, further includes:

Processing is optimized to the parameter in the log-likelihood function using Newton iteration method；

And/or

Term vector, theme vector and the auxiliary vector in the log-likelihood function are optimized using negative sampling algorithm；

Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector, comprising:

8. the method according to the description of claim 7 is characterized in that using negative sampling algorithm to term vector, theme vector and auxiliary Vector is helped to optimize, comprising:

Using negative sampling algorithm in all documents in the trained data bank word and theme handle, obtain negative sampling seemingly Right function；

The negative sampling likelihood function is handled using stochastic gradient descent method, obtain the term vector optimization formula, The optimization formula of the optimization formula and auxiliary vector of theme vector.

9. according to the described in any item methods of claim 5~8, which is characterized in that further include:

The theme distribution of each document is obtained during carrying out parameter Estimation to the joint likelihood function.

10. a kind of device for extracting keyword in document characterized by comprising

Vector training module, for obtaining at least one theme relevant to document information according to implicit theme vector model training At least one term vector of vector sum, the implicit theme vector model are the theme the Fusion Model of model and term vector, it is described to Measuring training module includes: vector construction unit, is in the trained data bank for the document to be added in training data bank Each document each theme and each word construct the initial term vector of initial subject vector sum respectively, joint likelihood function is established Unit, for establishing the joint of all documents in the trained data bank seemingly according to the initial term vector of initial subject vector sum Right function, parameter estimation unit, for the joint likelihood function carry out parameter Estimation obtain the theme vector and word to Amount；

Keyword extracting module, for choosing predetermined number term vector according to the distance between term vector and the theme vector Keyword of the corresponding word as the document.

11. device according to claim 10, which is characterized in that the distance calculation module is specifically used for:

The theme of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document as optimal Theme；Calculate the distance between the term vector and the corresponding theme vector of the optimal theme.

12. device according to claim 10, which is characterized in that the distance calculation module is specifically used for:

According to the theme distribution probability of each theme of the document to the distance between the term vector and each theme vector It is weighted summation；Using weighted sum as the distance between described term vector and theme vector.

13. the described in any item devices of 0-12 according to claim 1, which is characterized in that the distance is COS distance.

14. device according to claim 10, which is characterized in that the joint likelihood function is established unit and is specifically used for:

15. device according to claim 14, which is characterized in that the vector training module further include:

Joint likelihood function processing unit is joined for establishing unit in the joint likelihood function according to the calculation formula After closing likelihood function, the joint likelihood function is handled using gibbs algorithm, each sentence in each document can be obtained The condition distribution of the corresponding theme of son；

The parameter estimation unit is specifically used for:

16. device according to claim 15, which is characterized in that the vector training module further include:

Log-likelihood function optimization processing unit, for the joint likelihood function processing unit obtain log-likelihood function it Afterwards, processing is optimized to the parameter in the log-likelihood function using Newton iteration method；

And/or

The parameter estimation unit is specifically used for:

17. device according to claim 16, which is characterized in that the log-likelihood function optimization processing unit is specifically used In:

18. 4~17 described in any item devices according to claim 1, which is characterized in that the parameter estimation unit is also used to: