CN105069143B - Extract the method and device of keyword in document - Google Patents

Extract the method and device of keyword in document Download PDF

Info

Publication number
CN105069143B
CN105069143B CN201510512363.8A CN201510512363A CN105069143B CN 105069143 B CN105069143 B CN 105069143B CN 201510512363 A CN201510512363 A CN 201510512363A CN 105069143 B CN105069143 B CN 105069143B
Authority
CN
China
Prior art keywords
vector
theme
document
likelihood function
term vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510512363.8A
Other languages
Chinese (zh)
Other versions
CN105069143A (en
Inventor
姜迪
石磊
林鸿宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510512363.8A priority Critical patent/CN105069143B/en
Publication of CN105069143A publication Critical patent/CN105069143A/en
Application granted granted Critical
Publication of CN105069143B publication Critical patent/CN105069143B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Abstract

The invention discloses a kind of method and devices of keyword in extraction document, the method comprise the steps that obtain at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, the implicit theme vector model is the theme the Fusion Model of model and term vector;Calculate the distance between the term vector and theme vector;According to the distance between term vector and the theme vector, keyword of the corresponding word of predetermined number term vector as the document is chosen.The key word information for the accurate expression document information that the embodiment of the present invention can extract.

Description

Extract the method and device of keyword in document
Technical field
The present embodiments relate to information technology field more particularly to a kind of methods and dress for extracting keyword in document It sets.
Background technique
In the epoch of current information explosion, user can not browse it is all may include the document of relevant information, and take out It takes the keyword in document to provide the user with reference, information is accurately obtained to user, the cost that reduction user obtains information has Great meaning.
In general, the keyword of document is necessarily certain with the highly relevant word of document subject matter, therefore the master of document Topic information has great significance for the keyword abstraction of document.Currently, mainly using keyword in implicit Di Li Cray Probability distribution in distribution model (Latent Dirichlet Allocation, LDA) solves this problem.Mainly There is following two method:
First method is the method based on possibility predication, i.e., obtains the theme distribution P (z | d) of document using LDA model And the word distribution P (w | z) of theme, distribution P (w | d)=∑ of word in document is calculatedzP (z | d) P (w | z), wherein z table Show that theme, d indicate that document, w indicate some word.The distribution probability P (w │ d) of word in above-mentioned document is considered as some word w in document d In importance score, select keyword of the K word of highest scoring as the document.
Second method is the method based on hidden variable distribution distance, i.e., obtains the theme distribution of document using LDA model The theme distribution probability of probability P (z | d) and wordThen it calculates above-mentioned The COS distance of two distribution probabilities selects keyword of the biggish K word of COS distance as the document.
But in said extracted document there are some disadvantages in the method for keyword, for first method, for high frequency Word has serious deviation, that is, the word extracted is all largely the high frequency words under some theme, however these high frequency words are not With occurring all very extensively in document, information expressed by document can not be really reacted very much.
For second method, need to obtain the distribution P (z) of hidden variable due to calculating P (z │ w) ∝ P (w │ z) P (z), but Be the distribution probability it is not distribution parameter in LDA model, generally uses P (z)=∑dP (z | d) P (d), wherein P (d) be The Posterior distrbutionp of document, and assume that P (d) is to be uniformly distributed to obtain P (z) ∝ ∑dP(z|d).But due to for different Document d, posterior distribution P (d) is not to be uniformly distributed, therefore the theoretical basis of the model is not solid enough, in practical application Effect it is also bad.
Summary of the invention
The embodiment of the present invention provides a kind of method and device for extracting keyword in document, the accurate expression that can be extracted The key word information of document information.
In a first aspect, the embodiment of the invention provides a kind of methods of keyword in extraction document, comprising:
At least one theme vector relevant to document information and at least one are obtained according to implicit theme vector model training A term vector, the implicit theme vector model are the theme the Fusion Model of model and term vector;
Calculate the distance between the term vector and theme vector;
According to the distance between term vector and the theme vector, choose described in the corresponding word conduct of predetermined number term vector The keyword of document.
Second aspect, the embodiment of the present invention also provide a kind of device for extracting keyword in document, comprising:
Vector training module, for according to imply theme vector model training obtain it is relevant to document information at least one Theme vector and at least one term vector, the implicit theme vector model are the theme the Fusion Model of model and term vector;
Distance calculation module, for calculating the distance between the term vector and theme vector;
Keyword extracting module, for choosing predetermined number word according to the distance between term vector and the theme vector Keyword of the corresponding word of vector as the document.
The embodiment of the present invention by according to the fused implicit theme vector model of topic model and term vector to document into Row training, obtains at least one theme vector relevant to document information and at least one term vector, further according to term vector The distance between described theme vector chooses keyword of the corresponding word of predetermined number term vector as the document.This hair Bright embodiment is trained document according to implicit theme vector model, can capture more document letters in the training process Breath, thus the key word information of the accurate expression document information extracted.
Detailed description of the invention
Fig. 1 is the flow diagram for extracting the method for keyword in document that the embodiment of the present invention one provides;
Fig. 2 is the flow diagram of the method for keyword in extraction document provided by Embodiment 2 of the present invention;
Fig. 3 is the structural schematic diagram for the search results pages display device that the embodiment of the present invention three provides.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
The executing subject of the method for keyword, can mention in extraction document provided in an embodiment of the present invention for the embodiment of the present invention The device of keyword in document is extracted in confession, or is integrated with the terminal device for extracting the device of keyword in document (for example, intelligence Can mobile phone, tablet computer etc.), the device of keyword can use hardware or software realization in the extraction document.
Embodiment one
Fig. 1 is the flow diagram for extracting the method for keyword in document that the embodiment of the present invention one provides, such as Fig. 1 institute Show, specifically include:
S11, at least one theme vector relevant to document information and extremely is obtained according to implying theme vector model training A few term vector, the implicit theme vector model are the theme the Fusion Model of model and term vector;
Wherein, topic model (Topic Model) and term vector (Word Embedding) are all common in the prior art Semantic representation method.Topic model assume each word be by an implicit space in semantic generate.According to This carries out dimensionality reduction it is assumed that document and word can be mapped in implicit semantic space.And term vector is then another The distributed representation method of word, it indicates the meaning of a word using the vector of regular length.
Topic model is usually to be modeled in document level or sentence level, is focused more in global semanteme, and word Vector then generally assumes that the semanteme of a word is indicated by the word on its periphery, focuses more on the class syntax in part and believes Breath.Above two method emphasis is had nothing in common with each other, and has respectively been proved to have huge application value.Therefore, the present embodiment Both is combined, so that implicit theme vector model can capture more information.
Wherein, the dimension of theme vector and term vector can customize setting, and the corresponding numerical value of each element in vector leads to Implicit theme vector model training is crossed to obtain.Wherein, it in order to keep training result more accurate, is also wrapped in implicit theme vector model Containing training data bank, there is a large amount of document data in the trained data bank.
The distance between S12, the calculating term vector and theme vector;
It is the weight in order to calculate word in document that above-mentioned training, which obtains the purpose of term vector and theme vector word, Degree is wanted, and significance level is ranked up, to select keyword of the most important word as document.
In the present embodiment, weight of the word in document is measured by calculating the distance between term vector and theme vector Degree is wanted, specifically, including Euclidean distance, COS distance or the positive chordal distance calculated between the term vector and theme vector Deng, meanwhile, the distance of calculating is different, and the standard for measuring significance level is also different.If calculating the term vector and theme vector Between Euclidean distance or positive chordal distance, then distance it is more big, illustrating that the word is more important in document can more reflect text Theme expressed by shelves apart from smaller, is said if calculating the COS distance between the term vector and theme vector The bright word is more important in document.
S13, according to the distance between term vector and the theme vector, choose the corresponding word conduct of predetermined number term vector The keyword of the document.
Wherein, predetermined number can be configured according to specific actual conditions, be not specifically limited here.
It according to the calculated result of above-mentioned steps 12, that is, can determine predetermined number term vector more important in document, then will Keyword of the corresponding word of predetermined number term vector as the document.
The present embodiment is by instructing document according to the fused implicit theme vector model of topic model and term vector Practice, at least one theme vector relevant to document information and at least one term vector is obtained, further according to term vector and institute The distance between theme vector is stated, keyword of the corresponding word of predetermined number term vector as the document is chosen.The present embodiment Document is trained according to implicit theme vector model, more document information can be captured in the training process, to mention The key word information of the accurate expression document information of taking-up.
Illustratively, to improve the accuracy for extracting keyword, the embodiment of the invention provides described in the following two kinds calculating The method of the distance between term vector and theme vector, wherein first method is the calculation method based on optimal theme, mainly The following steps are included:
The theme conduct of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document Optimal theme;
Calculate the distance between the term vector and the corresponding theme vector of the optimal theme.
Specifically, in implicit theme vector model, can train to obtain the theme of its document for some document Distribution i.e. P (z | d), it include the corresponding theme distribution probability of each theme in the document, and probability in the theme distribution Maximum theme z, that is, optimal theme then indicates the core content of the document.It is understood that most important word in document It is exactly those of nearest word with a distance from the vector expression in vector space from theme z.Therefore, according to the theme distribution probability Size therefrom select the theme of a theme distribution maximum probability as optimal theme, based on the corresponding master of the optimal theme Vector is inscribed, the distance between each term vector is calculated, thus according to the term vector and the corresponding theme of the optimal theme The distance between vector chooses keyword of the corresponding word of predetermined number term vector as the document.
Wherein, second method is the calculation method based on theme distribution, is mainly comprised the steps that
According to the theme distribution probability of each theme of the document between the term vector and each theme vector Distance is weighted summation;
Using weighted sum as the distance between described term vector and theme vector.
Specifically, in view of in some document, the theme that plays an important role may more than one, and above-mentioned be based on The method of optimal theme may lost part information, therefore consider to add the distance between different themes according to P (z | d) Power, the new metric form of available one kind is i.e. as shown in following formula
Wherein, Score_Distr (w) is weighted sum, and L is the distance between the term vector and theme vector.
Above-mentioned metric form is the word importance score obtained after weighting by the theme distribution of document.According to the above method Obtained Score_Distr (w) is ranked up word, and selects and choose the corresponding word of predetermined number term vector as the text The keyword of shelves.
Illustratively, the embodiment of the invention also provides obtained and document information phase according to implicit theme vector model training The concrete methods of realizing of at least one theme vector and at least one term vector for closing, mainly comprises the steps that
The document is added in training data bank, is each theme of each document in the trained data bank and every A word constructs the initial term vector of initial subject vector sum respectively;
The joint of all documents in the trained data bank is established seemingly according to the initial term vector of initial subject vector sum Right function;
Parameter Estimation is carried out to the joint likelihood function and obtains the theme vector and term vector.
Wherein, the trained data bank can be obtained from internet (for example, Sina's corpus data library), training data Various types of documents are contained in library.Wherein, the initial term vector of initial subject vector sum can customize setting.
Illustratively, joint likelihood function is established according to the initial term vector of initial subject vector sum, comprising:
The generating probability of the initial term vector is obtained according to formula one:
Formula one:
Wherein,It is the auxiliary vector of the term vector v of current word w, xwIndicate the context vector of current word w, wherein It is the term vector of surrounding's word of current word w, vzThe theme vector of current topic z,It indicates to add And operation, w ' are word w ';
The joint likelihood function such as formula two of all documents in the trained data bank is obtained according to the formula one:
Formula two:
Wherein, αzThe corresponding Dirichlet prior parameter of the z that is the theme, βvFor the corresponding Dirichlet prior parameter of word v, mdz For be sampled in document d the z that is the theme sentence number, nzvIndicate that word v and theme z occur together in the trained data bank Number summation, M indicates the set of all term vectors and theme vector, and D indicates that the sum of document d, T indicate theme in document d Sum,Indicate the adjunct of word v.
Illustratively, to advanced optimize above-mentioned joint likelihood function, joint likelihood letter is being obtained according to the formula one Further comprise following steps after number such as formula two:
The formula two is handled using gibbs algorithm, the item of the corresponding theme of each sentence s in document d can be obtained Part distribution such as formula three:
Formula three:
Wherein, k is theme undetermined, and W is the sum of word in the trained data bank, NiwIt is word w i-th in document d The number occurred in son;
Each sentence s that condition distribution probability according to each theme in condition distribution is document d determines specific master Topic;
The formula one is handled according to the condition distribution probability of the specific subject, is obtained as described in formula four Log-likelihood function:
Formula four:
Parameter Estimation is carried out to the joint likelihood function and obtains the theme vector and term vector, comprising:
Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector.
Illustratively, further log-likelihood function obtained above can be optimized, is specifically comprised the following steps:
Using Newton iteration method in the log-likelihood function parameter alpha and parameter beta optimize processing;
And/or
Term vector, theme vector and the auxiliary vector in the log-likelihood function are carried out using negative sampling algorithm excellent Change;
Correspondingly, carrying out parameter Estimation to the log-likelihood function obtains the theme vector and term vector, comprising:
Parameter Estimation is carried out to the log-likelihood function after optimization and obtains the theme vector and term vector.
Illustratively, term vector, theme vector and auxiliary vector are optimized using negative sampling algorithm, comprising:
Using negative sampling algorithm in all documents in the trained data bank word and theme handle, obtain such as public affairs Likelihood function described in formula five:
Formula five:
Wherein, l is the corresponding value of current word, if current word is normal word, l=1, if current word is negative sampling Word, then l=0, | NEG | for the number of the corresponding negative sampling word of word, | V | for the sum of word in the trained data bank;
The formula five is handled using stochastic gradient descent method, the optimization formula such as formula of following term vector can be obtained Six, the optimization formula such as formula eight of optimization the formula such as formula seven and auxiliary vector of theme vector:
Formula six:
Formula seven:
Formula eight:
For the training corpus in the tranining database, implicit theme vector model provided in an embodiment of the present invention is used The vectorization of available each theme indicates, using provide in the prior art can learn to obtain based on topic model it is each The multinomial distribution of word under theme, the embodiment of the present invention compared 10 words of maximum probability in the multinomial distribution of word under each theme And immediate 10 term vectors of vector from each theme, as a result as shown in following table one:
Table one
As can be seen from Table I, apparent inclination there are high frequency words based on the multinomial distribution of topic model, but in Low-frequency word is weaker by contacting of setting up of traditional theme distribution and theme, this allows for being closed using multinomial distribution When keyword extracts, topic model can naturally enough be partial to high frequency words, so as to cause poor keyword abstraction result.And The vectorization expression of implicit topic model then eliminates this problem, can see from upper table, the word nearest from theme vector is past Toward being all the word for indicating physical meaning under the theme, this allows for working as using the model of theme vector in the task of keyword abstraction In can obtain better result.
Therefore, the various embodiments described above are equally by according to the fused implicit theme vector model of topic model and term vector Document is trained, at least one theme vector relevant to document information and at least one term vector, further root are obtained According to the distance between term vector and the theme vector, key of the corresponding word of predetermined number term vector as the document is chosen Word.The present embodiment is trained document according to implicit theme vector model, can capture more documents in the training process Information, thus the key word information of the accurate expression document information extracted.
Embodiment two
Fig. 2 is the flow diagram of the method for keyword in extraction document provided by Embodiment 2 of the present invention, such as Fig. 2 institute Show, specifically include:
S21, document to be processed is added in training data bank, is each master of each document in the trained data bank Topic and each word construct the initial term vector of initial subject vector sum respectively;
S22, the generating probability of the initial term vector is obtained according to formula one:
Formula one:
Wherein,It is the auxiliary vector of the term vector v of current word w, xwIndicate the context vector of current word w, wherein It is the term vector of surrounding's word of current word w, vzThe theme vector of current topic z,It indicates to add And operation, w ' are word w ';
S23, the joint likelihood function such as formula two that all documents in the trained data bank are obtained according to the formula one:
Formula two:
Wherein, αzThe corresponding Dirichlet prior parameter of the z that is the theme, βvFor the corresponding Dirichlet prior parameter of word v, mdz For be sampled in document d the z that is the theme sentence number, nzvIndicate that word v and theme z occur together in the trained data bank Number summation, M indicates the set of all term vectors and theme vector, and D indicates that the sum of document d, T indicate theme in document d Sum,Indicate the adjunct of word v.
S24, the formula two is handled using gibbs algorithm, the corresponding theme of each sentence s in document d can be obtained Condition distribution such as formula three:
Formula three:
Wherein, k is theme undetermined, and W is the sum of word in the trained data bank, NiwIt is word w i-th in document d The number occurred in son;
S25, spy is determined according to each sentence s that the condition distribution probability of each theme in condition distribution is document d Determine theme;
S26, the formula one is handled according to the condition distribution probability of the specific subject, is obtained such as four institute of formula The log-likelihood function stated:
Formula four:
S27, using Newton iteration method in the log-likelihood function parameter alpha and parameter beta optimize processing, and adopt Term vector, theme vector and the auxiliary vector in the log-likelihood function are optimized with negative sampling algorithm;
S28, to after optimization log-likelihood function carry out parameter Estimation obtain the document to be processed theme vector and Term vector.
COS distance between S29, the calculating term vector and theme vector;
S210, according to the COS distance between term vector and the theme vector, it is corresponding to choose predetermined number term vector Keyword of the word as the document to be processed.
In order to verify the validity of the embodiment of the present invention, inventor carries out on the experimental data set of different scales respectively Multiple groups comparative experiments, experiment effect have been more than the optimum of traditional method based on topic model.
First group of experiment, small-scale data experiment
Experiment purpose: the keyword for best embodying document meaning is picked out in all words in a document
Tranining database: including 32000 from the development set of Sina's Sina corpus, training set and test set in total A document.
Test data: corresponding to each document in test set including its correspondence from the test set of Sina corpus Reference keyword.Totally 1000 documents.
Appraisal procedure: for each document, each model generates 3 keywords.Come using accuracy rate and recall rate Assess experimental result.Accuracy rate refers to that the correct keyword number of model prediction accounts for the percentage of model prediction keyword number Than.Recall rate refers to the percentage for the keyword number that the correct keyword of model prediction accounts in Key for Reference.Using micro- flat It is used as evaluation index, i.e., its accuracy rate and recall rate is calculated separately to each document, is then averaged.
Experimental setup: the case where whether stop words being removed in training corpus is considered respectively, and will be of the invention real It applies and implies theme vector model used in example and carried out pair with the method based on likelihood of a variety of LDA and Sentence LDA Than.In LDA and implicit Di Li Cray distribution (Sentence LDA, sLDA) method as unit of by sentence, for document When each of word, calculate ∑zScore of P (z | the d) P (w | z) as each word in current document, then has found It is worth maximum preceding 3 words as keyword.In above-mentioned all methods, the embodiment of the present invention eliminates all corpus and works as In only one word word.Experimental result is as shown in following tables two:
Table two
Analysis of experimental results: in above-mentioned experimental result, it can be seen that regardless of whether removal stop words, the present invention are real The method for applying example all achieves best experimental result.In the experiment of removal stop words, the experiment knot of comparison and LDA model The promotion amplitude of fruit, implicit theme vector model provided in an embodiment of the present invention has reached 20.9%.Simultaneously, if removal deactivates Word does not influence the final result of implicit theme vector model provided in an embodiment of the present invention, this illustrates implementation of the present invention The implicit theme vector model that example provides has antimierophonic ability to a certain extent.In addition, the calculating side based on theme distribution The experimental result of method is better than the calculation method based on optimal theme, this illustrates to consider more when generating final keyword More subject informations is helpful final experimental result.
Second group of experiment, large-scale data experiment
Experiment purpose: the keyword for best embodying document meaning is picked out in all words in a document
Training data: from the development set of Sina corpus, training set and test set, and from the language of News Field Expect data, in total includes 261173 documents.
Test data: corresponding to each document in test set including its correspondence from the test set of Sina corpus Reference keyword.Totally 1000 documents.
Appraisal procedure: for each document, each model generates 3 keywords.Come using accuracy rate and recall rate Assess experimental result.Accuracy rate refers to that the correct keyword number of model prediction accounts for the percentage of model prediction keyword number Than.Recall rate refers to the percentage for the keyword number that the correct keyword of model prediction accounts in Key for Reference.Using micro- flat It is used as evaluation index, i.e., its accuracy rate and recall rate is calculated separately to each document, is then averaged.
Experimental setup: the case where whether stop words being removed in training corpus is considered respectively, and will be of the invention real Implicit theme vector model that example uses is applied to have carried out pair with the method based on likelihood function of a variety of LDA and Sentence LDA Than.In the PL method of LDA and Sentence LDA, word each of is worked as document, calculates ∑zP(z|d)P(w| Z) then the score as each word in current document has found maximum preceding 3 words of value as keyword.Meanwhile also The method based on hidden variable distribution distance in the method for the embodiment of the present invention and LDA is compared, according to document The theme distribution of theme distribution P (z | d) and wordThen two are calculated The COS distance of distribution simultaneously presses distance-taxis, and 3 words for select theme distribution distance to be distributed closest to document subject matter are as this The keyword of document.In above-mentioned all methods, the word of only one word in all corpus is eliminated.Experimental result As shown in following tables three:
Table three
Analysis of experimental results: in above-mentioned experimental result, it can be seen that the method for the embodiment of the present invention still achieves Best experimental result, the experiment conclusion obtained on small-scale corpus stand good on large-scale corpus.Meanwhile it can see It arrives, the method based on LDA and Sentence LDA is after it joined large-scale training corpus, and for experimental result, there is no aobvious The promotion of work.However the method for embodiment of the present invention experimental result after increasing training corpus has obtained mentioning for conspicuousness It rises, the calculation method based on optimal theme improves 12.1%, and the calculation method based on theme distribution improves 6.5%.And With the increase of the training corpus of model, the experimental result of the embodiment of the present invention still has the potentiality further increased.
Embodiment three
Fig. 3 is the structural schematic diagram for extracting the device of keyword in document that the embodiment of the present invention three provides, such as Fig. 3 institute Show, specifically include: vector training module 31, distance calculation module 32 and keyword extracting module 33;
The vector training module 31 be used for according to imply theme vector model training obtain it is relevant to document information extremely A few theme vector and at least one term vector, the implicit theme vector model are the theme the fusion mould of model and term vector Type;
The distance calculation module 32 is for calculating the distance between the term vector and theme vector;
The keyword extracting module 33 is used to choose default according to the distance between term vector and the theme vector Keyword of the corresponding word of number vector as the document.
The device of keyword is for executing the text of extraction described in the various embodiments described above in extraction document described in the present embodiment The method of keyword in shelves, technical principle is similar with the technical effect of generation, is described again here.
Illustratively, on the basis of the above embodiments, the distance calculation module 32 is specifically used for:
The theme conduct of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document Optimal theme;Calculate the distance between the term vector and the corresponding theme vector of the optimal theme.
Illustratively, on the basis of the above embodiments, the distance calculation module 32 is specifically used for:
According to the theme distribution probability of each theme of the document between the term vector and each theme vector Distance is weighted summation;Using weighted sum as the distance between described term vector and theme vector.
Illustratively, on the basis of the above embodiments, the distance is COS distance.
Illustratively, on the basis of the above embodiments, the vector training module 31 include: vector construction unit 311, Joint likelihood function establishes unit 312 and parameter estimation unit 313;
The vector construction unit 311 is used to for the document being added in training data bank, is in the trained data bank Each document each theme and each word construct the initial term vector of initial subject vector sum respectively;
The joint likelihood function establishes unit 312 for establishing institute according to the initial term vector of initial subject vector sum State the joint likelihood function of all documents in trained data bank;
The parameter estimation unit 313 is used to carry out parameter Estimation to the joint likelihood function to obtain the theme vector And term vector.
Illustratively, the joint likelihood function is established unit 312 and is specifically used for:
The generating probability of the initial term vector is obtained according to formula one:
Formula one:
Wherein,It is the auxiliary vector of the term vector v of current word w, xwIndicate the context vector of current word w, wherein It is the term vector of surrounding's word of current word w, vzThe theme vector of current topic z,It indicates to add And operation, w ' are word w ';
The joint likelihood function such as formula two of all documents in the trained data bank is obtained according to the formula one:
Formula two:
Wherein, αzThe corresponding Dirichlet prior parameter of the z that is the theme, βvFor the corresponding Dirichlet prior parameter of word v, mdz For be sampled in document d the z that is the theme sentence number, nzvIndicate that word v and theme z occur together in the trained data bank Number summation, M indicates the set of all term vectors and theme vector, and D indicates that the sum of document d, T indicate theme in document d Sum,Indicate the adjunct of word v.
Illustratively, the vector training module 31 further include: joint likelihood function processing unit 314;
The joint likelihood function processing unit 314 is used to establish unit 312 according to described in the joint likelihood function After formula one obtains joint likelihood function such as formula two, the formula two is handled using gibbs algorithm, can be obtained literary The condition distribution such as formula three of the corresponding theme of each sentence s in shelves d:
Formula three:
Wherein, k is theme undetermined, and W is the sum of word in the trained data bank, NiwIt is word w i-th in document d The number occurred in son;
Each sentence s that condition distribution probability according to each theme in condition distribution is document d determines specific master Topic;
The formula one is handled according to the condition distribution probability of the specific subject, is obtained as described in formula four Log-likelihood function:
Formula four:
The parameter estimation unit 313 is specifically used for:
Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector.
Illustratively, the vector training module 31 further include: log-likelihood function optimization processing unit 315;
The log-likelihood function optimization processing unit 315 in the joint likelihood function processing unit 314 for obtaining After log-likelihood function as described in formula four, using Newton iteration method to the parameter alpha and ginseng in the log-likelihood function Number β optimizes processing;
And/or
Term vector, theme vector and the auxiliary vector in the log-likelihood function are carried out using negative sampling algorithm excellent Change;
The parameter estimation unit 313 is specifically used for:
Parameter Estimation is carried out to the log-likelihood function after optimization and obtains the theme vector and term vector.
Exemplary, the log-likelihood function optimization processing unit 315 is specifically used for:
Using negative sampling algorithm in all documents in the trained data bank word and theme handle, obtain such as public affairs Likelihood function described in formula five:
Formula five:
Wherein, l is the corresponding value of current word, | NEG | it is the number of the corresponding negative sampling word of word, | V | it is the training The sum of word in data bank;
The formula five is handled using stochastic gradient descent method, the optimization formula such as formula of following term vector can be obtained Six, the optimization formula such as formula eight of optimization the formula such as formula seven and auxiliary vector of theme vector:
Formula six:
Formula seven:
Formula eight:
Illustratively, the parameter estimation unit 313 is also used to:
Formula nine is used to obtain the theme point of each document during carrying out parameter Estimation to the joint likelihood function Cloth:
Formula nine:
Wherein, K be the theme z document d sum.
The device of keyword is equally used for executing described in the various embodiments described above in extraction document described in the various embodiments described above Extract document in keyword method, technical principle is similar with the technical effect of generation, is described again here.
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims (18)

1. a kind of method for extracting keyword in document characterized by comprising
At least one theme vector relevant to document information and at least one word are obtained according to implicit theme vector model training Vector, the implicit theme vector model are the theme the Fusion Model of model and term vector, and training obtains related to document information At least one theme vector and at least one term vector specifically include: by the document be added training data bank in, be described The each theme and each word of each document in training data bank construct the initial term vector of initial subject vector sum respectively, according to The initial term vector of initial subject vector sum establishes the joint likelihood function of all documents in the trained data bank, to described Joint likelihood function carries out parameter Estimation and obtains the theme vector and term vector;
Calculate the distance between the term vector and theme vector;
According to the distance between term vector and the theme vector, the corresponding word of predetermined number term vector is chosen as the document Keyword.
2. the method according to claim 1, wherein calculate the distance between the term vector and theme vector, Include:
The theme of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document as optimal Theme;
Calculate the distance between the term vector and the corresponding theme vector of the optimal theme.
3. the method according to claim 1, wherein calculate the distance between the term vector and theme vector, Include:
According to the theme distribution probability of each theme of the document to the distance between the term vector and each theme vector It is weighted summation;
Using weighted sum as the distance between described term vector and theme vector.
4. method according to claim 1-3, which is characterized in that the distance is COS distance.
5. the method according to claim 1, wherein being established according to the initial term vector of the initial subject vector sum The joint likelihood function of all documents in the trained data bank, comprising:
The generating probability of the initial term vector is obtained by calculation formula;
The joint likelihood function of all documents in the trained data bank is obtained according to the calculation formula.
6. according to the method described in claim 5, it is characterized in that, according to the calculation formula obtain joint likelihood function it Afterwards, further includes:
The joint likelihood function is handled using gibbs algorithm, the corresponding theme of each sentence in each document can be obtained Condition distribution;
Condition distribution probability according to each theme in condition distribution is that each sentence of each document determines specific subject;
The joint likelihood function is handled according to the condition distribution probability of the specific subject, obtains log-likelihood letter Number;
Parameter Estimation is carried out to the joint likelihood function and obtains the theme vector and term vector, comprising:
Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector.
7. according to the method described in claim 6, it is characterized in that, after obtaining the log-likelihood function, further includes:
Processing is optimized to the parameter in the log-likelihood function using Newton iteration method;
And/or
Term vector, theme vector and the auxiliary vector in the log-likelihood function are optimized using negative sampling algorithm;
Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector, comprising:
Parameter Estimation is carried out to the log-likelihood function after optimization and obtains the theme vector and term vector.
8. the method according to the description of claim 7 is characterized in that using negative sampling algorithm to term vector, theme vector and auxiliary Vector is helped to optimize, comprising:
Using negative sampling algorithm in all documents in the trained data bank word and theme handle, obtain negative sampling seemingly Right function;
The negative sampling likelihood function is handled using stochastic gradient descent method, obtain the term vector optimization formula, The optimization formula of the optimization formula and auxiliary vector of theme vector.
9. according to the described in any item methods of claim 5~8, which is characterized in that further include:
The theme distribution of each document is obtained during carrying out parameter Estimation to the joint likelihood function.
10. a kind of device for extracting keyword in document characterized by comprising
Vector training module, for obtaining at least one theme relevant to document information according to implicit theme vector model training At least one term vector of vector sum, the implicit theme vector model are the theme the Fusion Model of model and term vector, it is described to Measuring training module includes: vector construction unit, is in the trained data bank for the document to be added in training data bank Each document each theme and each word construct the initial term vector of initial subject vector sum respectively, joint likelihood function is established Unit, for establishing the joint of all documents in the trained data bank seemingly according to the initial term vector of initial subject vector sum Right function, parameter estimation unit, for the joint likelihood function carry out parameter Estimation obtain the theme vector and word to Amount;
Distance calculation module, for calculating the distance between the term vector and theme vector;
Keyword extracting module, for choosing predetermined number term vector according to the distance between term vector and the theme vector Keyword of the corresponding word as the document.
11. device according to claim 10, which is characterized in that the distance calculation module is specifically used for:
The theme of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document as optimal Theme;Calculate the distance between the term vector and the corresponding theme vector of the optimal theme.
12. device according to claim 10, which is characterized in that the distance calculation module is specifically used for:
According to the theme distribution probability of each theme of the document to the distance between the term vector and each theme vector It is weighted summation;Using weighted sum as the distance between described term vector and theme vector.
13. the described in any item devices of 0-12 according to claim 1, which is characterized in that the distance is COS distance.
14. device according to claim 10, which is characterized in that the joint likelihood function is established unit and is specifically used for:
The generating probability of the initial term vector is obtained by calculation formula;
The joint likelihood function of all documents in the trained data bank is obtained according to the calculation formula.
15. device according to claim 14, which is characterized in that the vector training module further include:
Joint likelihood function processing unit is joined for establishing unit in the joint likelihood function according to the calculation formula After closing likelihood function, the joint likelihood function is handled using gibbs algorithm, each sentence in each document can be obtained The condition distribution of the corresponding theme of son;
Condition distribution probability according to each theme in condition distribution is that each sentence of each document determines specific subject;
The joint likelihood function is handled according to the condition distribution probability of the specific subject, obtains log-likelihood letter Number;
The parameter estimation unit is specifically used for:
Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector.
16. device according to claim 15, which is characterized in that the vector training module further include:
Log-likelihood function optimization processing unit, for the joint likelihood function processing unit obtain log-likelihood function it Afterwards, processing is optimized to the parameter in the log-likelihood function using Newton iteration method;
And/or
Term vector, theme vector and the auxiliary vector in the log-likelihood function are optimized using negative sampling algorithm;
The parameter estimation unit is specifically used for:
Parameter Estimation is carried out to the log-likelihood function after optimization and obtains the theme vector and term vector.
17. device according to claim 16, which is characterized in that the log-likelihood function optimization processing unit is specifically used In:
Using negative sampling algorithm in all documents in the trained data bank word and theme handle, obtain negative sampling seemingly Right function;
The negative sampling likelihood function is handled using stochastic gradient descent method, obtain the term vector optimization formula, The optimization formula of the optimization formula and auxiliary vector of theme vector.
18. 4~17 described in any item devices according to claim 1, which is characterized in that the parameter estimation unit is also used to:
The theme distribution of each document is obtained during carrying out parameter Estimation to the joint likelihood function.
CN201510512363.8A 2015-08-19 2015-08-19 Extract the method and device of keyword in document Active CN105069143B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510512363.8A CN105069143B (en) 2015-08-19 2015-08-19 Extract the method and device of keyword in document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510512363.8A CN105069143B (en) 2015-08-19 2015-08-19 Extract the method and device of keyword in document

Publications (2)

Publication Number Publication Date
CN105069143A CN105069143A (en) 2015-11-18
CN105069143B true CN105069143B (en) 2019-07-23

Family

ID=54498512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510512363.8A Active CN105069143B (en) 2015-08-19 2015-08-19 Extract the method and device of keyword in document

Country Status (1)

Country Link
CN (1) CN105069143B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740354B (en) * 2016-01-26 2018-11-30 中国人民解放军国防科学技术大学 The method and device of adaptive potential Di Li Cray model selection
CN106021272B (en) * 2016-04-04 2019-11-19 上海大学 The keyword extraction method calculated based on distributed expression term vector
CN106407316B (en) * 2016-08-30 2020-05-15 北京航空航天大学 Software question and answer recommendation method and device based on topic model
CN108399180B (en) * 2017-02-08 2021-11-26 腾讯科技(深圳)有限公司 Knowledge graph construction method and device and server
CN107220232B (en) * 2017-04-06 2021-06-11 北京百度网讯科技有限公司 Keyword extraction method and device based on artificial intelligence, equipment and readable medium
CN109815474B (en) * 2017-11-20 2022-09-23 深圳市腾讯计算机系统有限公司 Word sequence vector determination method, device, server and storage medium
CN108829822B (en) * 2018-06-12 2023-10-27 腾讯科技(深圳)有限公司 Media content recommendation method and device, storage medium and electronic device
CN108984526B (en) * 2018-07-10 2021-05-07 北京理工大学 Document theme vector extraction method based on deep learning
CN109446516B (en) * 2018-09-28 2022-11-11 北京赛博贝斯数据科技有限责任公司 Data processing method and system based on theme recommendation model
CN109299465A (en) * 2018-10-17 2019-02-01 北京京航计算通讯研究所 The identifying system of file keyword accuracy is promoted based on many algorithms
CN109408641B (en) * 2018-11-22 2020-06-02 山东工商学院 Text classification method and system based on supervised topic model
CN110263122B (en) * 2019-05-08 2022-05-17 北京奇艺世纪科技有限公司 Keyword acquisition method and device and computer readable storage medium
CN110134957B (en) * 2019-05-14 2023-06-13 云南电网有限责任公司电力科学研究院 Scientific and technological achievement warehousing method and system based on semantic analysis

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009104296A (en) * 2007-10-22 2009-05-14 Nippon Telegr & Teleph Corp <Ntt> Related keyword extraction method, device, program, and computer readable recording medium
CN102081660A (en) * 2011-01-13 2011-06-01 西北工业大学 Method for searching and sequencing keywords of XML documents based on semantic correlation
US8768960B2 (en) * 2009-01-20 2014-07-01 Microsoft Corporation Enhancing keyword advertising using online encyclopedia semantics
CN104008090A (en) * 2014-04-29 2014-08-27 河海大学 Multi-subject extraction method based on concept vector model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9384287B2 (en) * 2014-01-15 2016-07-05 Sap Portals Isreal Ltd. Methods, apparatus, systems and computer readable media for use in keyword extraction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009104296A (en) * 2007-10-22 2009-05-14 Nippon Telegr & Teleph Corp <Ntt> Related keyword extraction method, device, program, and computer readable recording medium
US8768960B2 (en) * 2009-01-20 2014-07-01 Microsoft Corporation Enhancing keyword advertising using online encyclopedia semantics
CN102081660A (en) * 2011-01-13 2011-06-01 西北工业大学 Method for searching and sequencing keywords of XML documents based on semantic correlation
CN104008090A (en) * 2014-04-29 2014-08-27 河海大学 Multi-subject extraction method based on concept vector model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于文档主题结构的关键词抽取方法研究;刘知远;《www.thunlp.org》;20111231;18-21

Also Published As

Publication number Publication date
CN105069143A (en) 2015-11-18

Similar Documents

Publication Publication Date Title
CN105069143B (en) Extract the method and device of keyword in document
CN109408526B (en) SQL sentence generation method, device, computer equipment and storage medium
CN105426354B (en) The fusion method and device of a kind of vector
CN109145290A (en) Based on word vector with from the semantic similarity calculation method of attention mechanism
Vig et al. Exploring neural models for query-focused summarization
CN106227714A (en) A kind of method and apparatus obtaining the key word generating poem based on artificial intelligence
CN109740158A (en) A kind of text semantic analysis method and device
CN112016320A (en) English punctuation adding method, system and equipment based on data enhancement
CN104915399A (en) Recommended data processing method based on news headline and recommended data processing method system based on news headline
CN108363688A (en) A kind of name entity link method of fusion prior information
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
KR20160112248A (en) Latent keyparase generation method and apparatus
CN107341152B (en) Parameter input method and device
CN107122378B (en) Object processing method and device and mobile terminal
CN110929022A (en) Text abstract generation method and system
CN107665222B (en) Keyword expansion method and device
CN110309513B (en) Text dependency analysis method and device
CN107291686B (en) Method and system for identifying emotion identification
Hong et al. Comprehensive technology function product matrix for intelligent chatbot patent mining
CN111566665B (en) Apparatus and method for applying image coding recognition in natural language processing
CN109727591B (en) Voice search method and device
CN109241993B (en) Evaluation object emotion classification method and device integrating user and overall evaluation information
CN112069816A (en) Chinese punctuation adding method, system and equipment
CN110610001A (en) Short text integrity identification method and device, storage medium and computer equipment
CN110619866A (en) Speech synthesis method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant