CN105069143B - Extract the method and device of keyword in document - Google Patents
Extract the method and device of keyword in document Download PDFInfo
- Publication number
- CN105069143B CN105069143B CN201510512363.8A CN201510512363A CN105069143B CN 105069143 B CN105069143 B CN 105069143B CN 201510512363 A CN201510512363 A CN 201510512363A CN 105069143 B CN105069143 B CN 105069143B
- Authority
- CN
- China
- Prior art keywords
- vector
- theme
- document
- likelihood function
- term vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
Abstract
The invention discloses a kind of method and devices of keyword in extraction document, the method comprise the steps that obtain at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, the implicit theme vector model is the theme the Fusion Model of model and term vector;Calculate the distance between the term vector and theme vector;According to the distance between term vector and the theme vector, keyword of the corresponding word of predetermined number term vector as the document is chosen.The key word information for the accurate expression document information that the embodiment of the present invention can extract.
Description
Technical field
The present embodiments relate to information technology field more particularly to a kind of methods and dress for extracting keyword in document
It sets.
Background technique
In the epoch of current information explosion, user can not browse it is all may include the document of relevant information, and take out
It takes the keyword in document to provide the user with reference, information is accurately obtained to user, the cost that reduction user obtains information has
Great meaning.
In general, the keyword of document is necessarily certain with the highly relevant word of document subject matter, therefore the master of document
Topic information has great significance for the keyword abstraction of document.Currently, mainly using keyword in implicit Di Li Cray
Probability distribution in distribution model (Latent Dirichlet Allocation, LDA) solves this problem.Mainly
There is following two method:
First method is the method based on possibility predication, i.e., obtains the theme distribution P (z | d) of document using LDA model
And the word distribution P (w | z) of theme, distribution P (w | d)=∑ of word in document is calculatedzP (z | d) P (w | z), wherein z table
Show that theme, d indicate that document, w indicate some word.The distribution probability P (w │ d) of word in above-mentioned document is considered as some word w in document d
In importance score, select keyword of the K word of highest scoring as the document.
Second method is the method based on hidden variable distribution distance, i.e., obtains the theme distribution of document using LDA model
The theme distribution probability of probability P (z | d) and wordThen it calculates above-mentioned
The COS distance of two distribution probabilities selects keyword of the biggish K word of COS distance as the document.
But in said extracted document there are some disadvantages in the method for keyword, for first method, for high frequency
Word has serious deviation, that is, the word extracted is all largely the high frequency words under some theme, however these high frequency words are not
With occurring all very extensively in document, information expressed by document can not be really reacted very much.
For second method, need to obtain the distribution P (z) of hidden variable due to calculating P (z │ w) ∝ P (w │ z) P (z), but
Be the distribution probability it is not distribution parameter in LDA model, generally uses P (z)=∑dP (z | d) P (d), wherein P (d) be
The Posterior distrbutionp of document, and assume that P (d) is to be uniformly distributed to obtain P (z) ∝ ∑dP(z|d).But due to for different
Document d, posterior distribution P (d) is not to be uniformly distributed, therefore the theoretical basis of the model is not solid enough, in practical application
Effect it is also bad.
Summary of the invention
The embodiment of the present invention provides a kind of method and device for extracting keyword in document, the accurate expression that can be extracted
The key word information of document information.
In a first aspect, the embodiment of the invention provides a kind of methods of keyword in extraction document, comprising:
At least one theme vector relevant to document information and at least one are obtained according to implicit theme vector model training
A term vector, the implicit theme vector model are the theme the Fusion Model of model and term vector;
Calculate the distance between the term vector and theme vector;
According to the distance between term vector and the theme vector, choose described in the corresponding word conduct of predetermined number term vector
The keyword of document.
Second aspect, the embodiment of the present invention also provide a kind of device for extracting keyword in document, comprising:
Vector training module, for according to imply theme vector model training obtain it is relevant to document information at least one
Theme vector and at least one term vector, the implicit theme vector model are the theme the Fusion Model of model and term vector;
Distance calculation module, for calculating the distance between the term vector and theme vector;
Keyword extracting module, for choosing predetermined number word according to the distance between term vector and the theme vector
Keyword of the corresponding word of vector as the document.
The embodiment of the present invention by according to the fused implicit theme vector model of topic model and term vector to document into
Row training, obtains at least one theme vector relevant to document information and at least one term vector, further according to term vector
The distance between described theme vector chooses keyword of the corresponding word of predetermined number term vector as the document.This hair
Bright embodiment is trained document according to implicit theme vector model, can capture more document letters in the training process
Breath, thus the key word information of the accurate expression document information extracted.
Detailed description of the invention
Fig. 1 is the flow diagram for extracting the method for keyword in document that the embodiment of the present invention one provides;
Fig. 2 is the flow diagram of the method for keyword in extraction document provided by Embodiment 2 of the present invention;
Fig. 3 is the structural schematic diagram for the search results pages display device that the embodiment of the present invention three provides.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just
Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
The executing subject of the method for keyword, can mention in extraction document provided in an embodiment of the present invention for the embodiment of the present invention
The device of keyword in document is extracted in confession, or is integrated with the terminal device for extracting the device of keyword in document (for example, intelligence
Can mobile phone, tablet computer etc.), the device of keyword can use hardware or software realization in the extraction document.
Embodiment one
Fig. 1 is the flow diagram for extracting the method for keyword in document that the embodiment of the present invention one provides, such as Fig. 1 institute
Show, specifically include:
S11, at least one theme vector relevant to document information and extremely is obtained according to implying theme vector model training
A few term vector, the implicit theme vector model are the theme the Fusion Model of model and term vector;
Wherein, topic model (Topic Model) and term vector (Word Embedding) are all common in the prior art
Semantic representation method.Topic model assume each word be by an implicit space in semantic generate.According to
This carries out dimensionality reduction it is assumed that document and word can be mapped in implicit semantic space.And term vector is then another
The distributed representation method of word, it indicates the meaning of a word using the vector of regular length.
Topic model is usually to be modeled in document level or sentence level, is focused more in global semanteme, and word
Vector then generally assumes that the semanteme of a word is indicated by the word on its periphery, focuses more on the class syntax in part and believes
Breath.Above two method emphasis is had nothing in common with each other, and has respectively been proved to have huge application value.Therefore, the present embodiment
Both is combined, so that implicit theme vector model can capture more information.
Wherein, the dimension of theme vector and term vector can customize setting, and the corresponding numerical value of each element in vector leads to
Implicit theme vector model training is crossed to obtain.Wherein, it in order to keep training result more accurate, is also wrapped in implicit theme vector model
Containing training data bank, there is a large amount of document data in the trained data bank.
The distance between S12, the calculating term vector and theme vector;
It is the weight in order to calculate word in document that above-mentioned training, which obtains the purpose of term vector and theme vector word,
Degree is wanted, and significance level is ranked up, to select keyword of the most important word as document.
In the present embodiment, weight of the word in document is measured by calculating the distance between term vector and theme vector
Degree is wanted, specifically, including Euclidean distance, COS distance or the positive chordal distance calculated between the term vector and theme vector
Deng, meanwhile, the distance of calculating is different, and the standard for measuring significance level is also different.If calculating the term vector and theme vector
Between Euclidean distance or positive chordal distance, then distance it is more big, illustrating that the word is more important in document can more reflect text
Theme expressed by shelves apart from smaller, is said if calculating the COS distance between the term vector and theme vector
The bright word is more important in document.
S13, according to the distance between term vector and the theme vector, choose the corresponding word conduct of predetermined number term vector
The keyword of the document.
Wherein, predetermined number can be configured according to specific actual conditions, be not specifically limited here.
It according to the calculated result of above-mentioned steps 12, that is, can determine predetermined number term vector more important in document, then will
Keyword of the corresponding word of predetermined number term vector as the document.
The present embodiment is by instructing document according to the fused implicit theme vector model of topic model and term vector
Practice, at least one theme vector relevant to document information and at least one term vector is obtained, further according to term vector and institute
The distance between theme vector is stated, keyword of the corresponding word of predetermined number term vector as the document is chosen.The present embodiment
Document is trained according to implicit theme vector model, more document information can be captured in the training process, to mention
The key word information of the accurate expression document information of taking-up.
Illustratively, to improve the accuracy for extracting keyword, the embodiment of the invention provides described in the following two kinds calculating
The method of the distance between term vector and theme vector, wherein first method is the calculation method based on optimal theme, mainly
The following steps are included:
The theme conduct of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document
Optimal theme;
Calculate the distance between the term vector and the corresponding theme vector of the optimal theme.
Specifically, in implicit theme vector model, can train to obtain the theme of its document for some document
Distribution i.e. P (z | d), it include the corresponding theme distribution probability of each theme in the document, and probability in the theme distribution
Maximum theme z, that is, optimal theme then indicates the core content of the document.It is understood that most important word in document
It is exactly those of nearest word with a distance from the vector expression in vector space from theme z.Therefore, according to the theme distribution probability
Size therefrom select the theme of a theme distribution maximum probability as optimal theme, based on the corresponding master of the optimal theme
Vector is inscribed, the distance between each term vector is calculated, thus according to the term vector and the corresponding theme of the optimal theme
The distance between vector chooses keyword of the corresponding word of predetermined number term vector as the document.
Wherein, second method is the calculation method based on theme distribution, is mainly comprised the steps that
According to the theme distribution probability of each theme of the document between the term vector and each theme vector
Distance is weighted summation;
Using weighted sum as the distance between described term vector and theme vector.
Specifically, in view of in some document, the theme that plays an important role may more than one, and above-mentioned be based on
The method of optimal theme may lost part information, therefore consider to add the distance between different themes according to P (z | d)
Power, the new metric form of available one kind is i.e. as shown in following formula
Wherein, Score_Distr (w) is weighted sum, and L is the distance between the term vector and theme vector.
Above-mentioned metric form is the word importance score obtained after weighting by the theme distribution of document.According to the above method
Obtained Score_Distr (w) is ranked up word, and selects and choose the corresponding word of predetermined number term vector as the text
The keyword of shelves.
Illustratively, the embodiment of the invention also provides obtained and document information phase according to implicit theme vector model training
The concrete methods of realizing of at least one theme vector and at least one term vector for closing, mainly comprises the steps that
The document is added in training data bank, is each theme of each document in the trained data bank and every
A word constructs the initial term vector of initial subject vector sum respectively;
The joint of all documents in the trained data bank is established seemingly according to the initial term vector of initial subject vector sum
Right function;
Parameter Estimation is carried out to the joint likelihood function and obtains the theme vector and term vector.
Wherein, the trained data bank can be obtained from internet (for example, Sina's corpus data library), training data
Various types of documents are contained in library.Wherein, the initial term vector of initial subject vector sum can customize setting.
Illustratively, joint likelihood function is established according to the initial term vector of initial subject vector sum, comprising:
The generating probability of the initial term vector is obtained according to formula one:
Formula one:
Wherein,It is the auxiliary vector of the term vector v of current word w, xwIndicate the context vector of current word w, wherein It is the term vector of surrounding's word of current word w, vzThe theme vector of current topic z,It indicates to add
And operation, w ' are word w ';
The joint likelihood function such as formula two of all documents in the trained data bank is obtained according to the formula one:
Formula two:
Wherein, αzThe corresponding Dirichlet prior parameter of the z that is the theme, βvFor the corresponding Dirichlet prior parameter of word v, mdz
For be sampled in document d the z that is the theme sentence number, nzvIndicate that word v and theme z occur together in the trained data bank
Number summation, M indicates the set of all term vectors and theme vector, and D indicates that the sum of document d, T indicate theme in document d
Sum,Indicate the adjunct of word v.
Illustratively, to advanced optimize above-mentioned joint likelihood function, joint likelihood letter is being obtained according to the formula one
Further comprise following steps after number such as formula two:
The formula two is handled using gibbs algorithm, the item of the corresponding theme of each sentence s in document d can be obtained
Part distribution such as formula three:
Formula three:
Wherein, k is theme undetermined, and W is the sum of word in the trained data bank, NiwIt is word w i-th in document d
The number occurred in son;
Each sentence s that condition distribution probability according to each theme in condition distribution is document d determines specific master
Topic;
The formula one is handled according to the condition distribution probability of the specific subject, is obtained as described in formula four
Log-likelihood function:
Formula four:
Parameter Estimation is carried out to the joint likelihood function and obtains the theme vector and term vector, comprising:
Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector.
Illustratively, further log-likelihood function obtained above can be optimized, is specifically comprised the following steps:
Using Newton iteration method in the log-likelihood function parameter alpha and parameter beta optimize processing;
And/or
Term vector, theme vector and the auxiliary vector in the log-likelihood function are carried out using negative sampling algorithm excellent
Change;
Correspondingly, carrying out parameter Estimation to the log-likelihood function obtains the theme vector and term vector, comprising:
Parameter Estimation is carried out to the log-likelihood function after optimization and obtains the theme vector and term vector.
Illustratively, term vector, theme vector and auxiliary vector are optimized using negative sampling algorithm, comprising:
Using negative sampling algorithm in all documents in the trained data bank word and theme handle, obtain such as public affairs
Likelihood function described in formula five:
Formula five:
Wherein, l is the corresponding value of current word, if current word is normal word, l=1, if current word is negative sampling
Word, then l=0, | NEG | for the number of the corresponding negative sampling word of word, | V | for the sum of word in the trained data bank;
The formula five is handled using stochastic gradient descent method, the optimization formula such as formula of following term vector can be obtained
Six, the optimization formula such as formula eight of optimization the formula such as formula seven and auxiliary vector of theme vector:
Formula six:
Formula seven:
Formula eight:
For the training corpus in the tranining database, implicit theme vector model provided in an embodiment of the present invention is used
The vectorization of available each theme indicates, using provide in the prior art can learn to obtain based on topic model it is each
The multinomial distribution of word under theme, the embodiment of the present invention compared 10 words of maximum probability in the multinomial distribution of word under each theme
And immediate 10 term vectors of vector from each theme, as a result as shown in following table one:
Table one
As can be seen from Table I, apparent inclination there are high frequency words based on the multinomial distribution of topic model, but in
Low-frequency word is weaker by contacting of setting up of traditional theme distribution and theme, this allows for being closed using multinomial distribution
When keyword extracts, topic model can naturally enough be partial to high frequency words, so as to cause poor keyword abstraction result.And
The vectorization expression of implicit topic model then eliminates this problem, can see from upper table, the word nearest from theme vector is past
Toward being all the word for indicating physical meaning under the theme, this allows for working as using the model of theme vector in the task of keyword abstraction
In can obtain better result.
Therefore, the various embodiments described above are equally by according to the fused implicit theme vector model of topic model and term vector
Document is trained, at least one theme vector relevant to document information and at least one term vector, further root are obtained
According to the distance between term vector and the theme vector, key of the corresponding word of predetermined number term vector as the document is chosen
Word.The present embodiment is trained document according to implicit theme vector model, can capture more documents in the training process
Information, thus the key word information of the accurate expression document information extracted.
Embodiment two
Fig. 2 is the flow diagram of the method for keyword in extraction document provided by Embodiment 2 of the present invention, such as Fig. 2 institute
Show, specifically include:
S21, document to be processed is added in training data bank, is each master of each document in the trained data bank
Topic and each word construct the initial term vector of initial subject vector sum respectively;
S22, the generating probability of the initial term vector is obtained according to formula one:
Formula one:
Wherein,It is the auxiliary vector of the term vector v of current word w, xwIndicate the context vector of current word w, wherein It is the term vector of surrounding's word of current word w, vzThe theme vector of current topic z,It indicates to add
And operation, w ' are word w ';
S23, the joint likelihood function such as formula two that all documents in the trained data bank are obtained according to the formula one:
Formula two:
Wherein, αzThe corresponding Dirichlet prior parameter of the z that is the theme, βvFor the corresponding Dirichlet prior parameter of word v, mdz
For be sampled in document d the z that is the theme sentence number, nzvIndicate that word v and theme z occur together in the trained data bank
Number summation, M indicates the set of all term vectors and theme vector, and D indicates that the sum of document d, T indicate theme in document d
Sum,Indicate the adjunct of word v.
S24, the formula two is handled using gibbs algorithm, the corresponding theme of each sentence s in document d can be obtained
Condition distribution such as formula three:
Formula three:
Wherein, k is theme undetermined, and W is the sum of word in the trained data bank, NiwIt is word w i-th in document d
The number occurred in son;
S25, spy is determined according to each sentence s that the condition distribution probability of each theme in condition distribution is document d
Determine theme;
S26, the formula one is handled according to the condition distribution probability of the specific subject, is obtained such as four institute of formula
The log-likelihood function stated:
Formula four:
S27, using Newton iteration method in the log-likelihood function parameter alpha and parameter beta optimize processing, and adopt
Term vector, theme vector and the auxiliary vector in the log-likelihood function are optimized with negative sampling algorithm;
S28, to after optimization log-likelihood function carry out parameter Estimation obtain the document to be processed theme vector and
Term vector.
COS distance between S29, the calculating term vector and theme vector;
S210, according to the COS distance between term vector and the theme vector, it is corresponding to choose predetermined number term vector
Keyword of the word as the document to be processed.
In order to verify the validity of the embodiment of the present invention, inventor carries out on the experimental data set of different scales respectively
Multiple groups comparative experiments, experiment effect have been more than the optimum of traditional method based on topic model.
First group of experiment, small-scale data experiment
Experiment purpose: the keyword for best embodying document meaning is picked out in all words in a document
Tranining database: including 32000 from the development set of Sina's Sina corpus, training set and test set in total
A document.
Test data: corresponding to each document in test set including its correspondence from the test set of Sina corpus
Reference keyword.Totally 1000 documents.
Appraisal procedure: for each document, each model generates 3 keywords.Come using accuracy rate and recall rate
Assess experimental result.Accuracy rate refers to that the correct keyword number of model prediction accounts for the percentage of model prediction keyword number
Than.Recall rate refers to the percentage for the keyword number that the correct keyword of model prediction accounts in Key for Reference.Using micro- flat
It is used as evaluation index, i.e., its accuracy rate and recall rate is calculated separately to each document, is then averaged.
Experimental setup: the case where whether stop words being removed in training corpus is considered respectively, and will be of the invention real
It applies and implies theme vector model used in example and carried out pair with the method based on likelihood of a variety of LDA and Sentence LDA
Than.In LDA and implicit Di Li Cray distribution (Sentence LDA, sLDA) method as unit of by sentence, for document
When each of word, calculate ∑zScore of P (z | the d) P (w | z) as each word in current document, then has found
It is worth maximum preceding 3 words as keyword.In above-mentioned all methods, the embodiment of the present invention eliminates all corpus and works as
In only one word word.Experimental result is as shown in following tables two:
Table two
Analysis of experimental results: in above-mentioned experimental result, it can be seen that regardless of whether removal stop words, the present invention are real
The method for applying example all achieves best experimental result.In the experiment of removal stop words, the experiment knot of comparison and LDA model
The promotion amplitude of fruit, implicit theme vector model provided in an embodiment of the present invention has reached 20.9%.Simultaneously, if removal deactivates
Word does not influence the final result of implicit theme vector model provided in an embodiment of the present invention, this illustrates implementation of the present invention
The implicit theme vector model that example provides has antimierophonic ability to a certain extent.In addition, the calculating side based on theme distribution
The experimental result of method is better than the calculation method based on optimal theme, this illustrates to consider more when generating final keyword
More subject informations is helpful final experimental result.
Second group of experiment, large-scale data experiment
Experiment purpose: the keyword for best embodying document meaning is picked out in all words in a document
Training data: from the development set of Sina corpus, training set and test set, and from the language of News Field
Expect data, in total includes 261173 documents.
Test data: corresponding to each document in test set including its correspondence from the test set of Sina corpus
Reference keyword.Totally 1000 documents.
Appraisal procedure: for each document, each model generates 3 keywords.Come using accuracy rate and recall rate
Assess experimental result.Accuracy rate refers to that the correct keyword number of model prediction accounts for the percentage of model prediction keyword number
Than.Recall rate refers to the percentage for the keyword number that the correct keyword of model prediction accounts in Key for Reference.Using micro- flat
It is used as evaluation index, i.e., its accuracy rate and recall rate is calculated separately to each document, is then averaged.
Experimental setup: the case where whether stop words being removed in training corpus is considered respectively, and will be of the invention real
Implicit theme vector model that example uses is applied to have carried out pair with the method based on likelihood function of a variety of LDA and Sentence LDA
Than.In the PL method of LDA and Sentence LDA, word each of is worked as document, calculates ∑zP(z|d)P(w|
Z) then the score as each word in current document has found maximum preceding 3 words of value as keyword.Meanwhile also
The method based on hidden variable distribution distance in the method for the embodiment of the present invention and LDA is compared, according to document
The theme distribution of theme distribution P (z | d) and wordThen two are calculated
The COS distance of distribution simultaneously presses distance-taxis, and 3 words for select theme distribution distance to be distributed closest to document subject matter are as this
The keyword of document.In above-mentioned all methods, the word of only one word in all corpus is eliminated.Experimental result
As shown in following tables three:
Table three
Analysis of experimental results: in above-mentioned experimental result, it can be seen that the method for the embodiment of the present invention still achieves
Best experimental result, the experiment conclusion obtained on small-scale corpus stand good on large-scale corpus.Meanwhile it can see
It arrives, the method based on LDA and Sentence LDA is after it joined large-scale training corpus, and for experimental result, there is no aobvious
The promotion of work.However the method for embodiment of the present invention experimental result after increasing training corpus has obtained mentioning for conspicuousness
It rises, the calculation method based on optimal theme improves 12.1%, and the calculation method based on theme distribution improves 6.5%.And
With the increase of the training corpus of model, the experimental result of the embodiment of the present invention still has the potentiality further increased.
Embodiment three
Fig. 3 is the structural schematic diagram for extracting the device of keyword in document that the embodiment of the present invention three provides, such as Fig. 3 institute
Show, specifically include: vector training module 31, distance calculation module 32 and keyword extracting module 33;
The vector training module 31 be used for according to imply theme vector model training obtain it is relevant to document information extremely
A few theme vector and at least one term vector, the implicit theme vector model are the theme the fusion mould of model and term vector
Type;
The distance calculation module 32 is for calculating the distance between the term vector and theme vector;
The keyword extracting module 33 is used to choose default according to the distance between term vector and the theme vector
Keyword of the corresponding word of number vector as the document.
The device of keyword is for executing the text of extraction described in the various embodiments described above in extraction document described in the present embodiment
The method of keyword in shelves, technical principle is similar with the technical effect of generation, is described again here.
Illustratively, on the basis of the above embodiments, the distance calculation module 32 is specifically used for:
The theme conduct of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document
Optimal theme;Calculate the distance between the term vector and the corresponding theme vector of the optimal theme.
Illustratively, on the basis of the above embodiments, the distance calculation module 32 is specifically used for:
According to the theme distribution probability of each theme of the document between the term vector and each theme vector
Distance is weighted summation;Using weighted sum as the distance between described term vector and theme vector.
Illustratively, on the basis of the above embodiments, the distance is COS distance.
Illustratively, on the basis of the above embodiments, the vector training module 31 include: vector construction unit 311,
Joint likelihood function establishes unit 312 and parameter estimation unit 313;
The vector construction unit 311 is used to for the document being added in training data bank, is in the trained data bank
Each document each theme and each word construct the initial term vector of initial subject vector sum respectively;
The joint likelihood function establishes unit 312 for establishing institute according to the initial term vector of initial subject vector sum
State the joint likelihood function of all documents in trained data bank;
The parameter estimation unit 313 is used to carry out parameter Estimation to the joint likelihood function to obtain the theme vector
And term vector.
Illustratively, the joint likelihood function is established unit 312 and is specifically used for:
The generating probability of the initial term vector is obtained according to formula one:
Formula one:
Wherein,It is the auxiliary vector of the term vector v of current word w, xwIndicate the context vector of current word w, wherein It is the term vector of surrounding's word of current word w, vzThe theme vector of current topic z,It indicates to add
And operation, w ' are word w ';
The joint likelihood function such as formula two of all documents in the trained data bank is obtained according to the formula one:
Formula two:
Wherein, αzThe corresponding Dirichlet prior parameter of the z that is the theme, βvFor the corresponding Dirichlet prior parameter of word v, mdz
For be sampled in document d the z that is the theme sentence number, nzvIndicate that word v and theme z occur together in the trained data bank
Number summation, M indicates the set of all term vectors and theme vector, and D indicates that the sum of document d, T indicate theme in document d
Sum,Indicate the adjunct of word v.
Illustratively, the vector training module 31 further include: joint likelihood function processing unit 314;
The joint likelihood function processing unit 314 is used to establish unit 312 according to described in the joint likelihood function
After formula one obtains joint likelihood function such as formula two, the formula two is handled using gibbs algorithm, can be obtained literary
The condition distribution such as formula three of the corresponding theme of each sentence s in shelves d:
Formula three:
Wherein, k is theme undetermined, and W is the sum of word in the trained data bank, NiwIt is word w i-th in document d
The number occurred in son;
Each sentence s that condition distribution probability according to each theme in condition distribution is document d determines specific master
Topic;
The formula one is handled according to the condition distribution probability of the specific subject, is obtained as described in formula four
Log-likelihood function:
Formula four:
The parameter estimation unit 313 is specifically used for:
Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector.
Illustratively, the vector training module 31 further include: log-likelihood function optimization processing unit 315;
The log-likelihood function optimization processing unit 315 in the joint likelihood function processing unit 314 for obtaining
After log-likelihood function as described in formula four, using Newton iteration method to the parameter alpha and ginseng in the log-likelihood function
Number β optimizes processing;
And/or
Term vector, theme vector and the auxiliary vector in the log-likelihood function are carried out using negative sampling algorithm excellent
Change;
The parameter estimation unit 313 is specifically used for:
Parameter Estimation is carried out to the log-likelihood function after optimization and obtains the theme vector and term vector.
Exemplary, the log-likelihood function optimization processing unit 315 is specifically used for:
Using negative sampling algorithm in all documents in the trained data bank word and theme handle, obtain such as public affairs
Likelihood function described in formula five:
Formula five:
Wherein, l is the corresponding value of current word, | NEG | it is the number of the corresponding negative sampling word of word, | V | it is the training
The sum of word in data bank;
The formula five is handled using stochastic gradient descent method, the optimization formula such as formula of following term vector can be obtained
Six, the optimization formula such as formula eight of optimization the formula such as formula seven and auxiliary vector of theme vector:
Formula six:
Formula seven:
Formula eight:
Illustratively, the parameter estimation unit 313 is also used to:
Formula nine is used to obtain the theme point of each document during carrying out parameter Estimation to the joint likelihood function
Cloth:
Formula nine:
Wherein, K be the theme z document d sum.
The device of keyword is equally used for executing described in the various embodiments described above in extraction document described in the various embodiments described above
Extract document in keyword method, technical principle is similar with the technical effect of generation, is described again here.
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that
The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation,
It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention
It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also
It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.
Claims (18)
1. a kind of method for extracting keyword in document characterized by comprising
At least one theme vector relevant to document information and at least one word are obtained according to implicit theme vector model training
Vector, the implicit theme vector model are the theme the Fusion Model of model and term vector, and training obtains related to document information
At least one theme vector and at least one term vector specifically include: by the document be added training data bank in, be described
The each theme and each word of each document in training data bank construct the initial term vector of initial subject vector sum respectively, according to
The initial term vector of initial subject vector sum establishes the joint likelihood function of all documents in the trained data bank, to described
Joint likelihood function carries out parameter Estimation and obtains the theme vector and term vector;
Calculate the distance between the term vector and theme vector;
According to the distance between term vector and the theme vector, the corresponding word of predetermined number term vector is chosen as the document
Keyword.
2. the method according to claim 1, wherein calculate the distance between the term vector and theme vector,
Include:
The theme of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document as optimal
Theme;
Calculate the distance between the term vector and the corresponding theme vector of the optimal theme.
3. the method according to claim 1, wherein calculate the distance between the term vector and theme vector,
Include:
According to the theme distribution probability of each theme of the document to the distance between the term vector and each theme vector
It is weighted summation;
Using weighted sum as the distance between described term vector and theme vector.
4. method according to claim 1-3, which is characterized in that the distance is COS distance.
5. the method according to claim 1, wherein being established according to the initial term vector of the initial subject vector sum
The joint likelihood function of all documents in the trained data bank, comprising:
The generating probability of the initial term vector is obtained by calculation formula;
The joint likelihood function of all documents in the trained data bank is obtained according to the calculation formula.
6. according to the method described in claim 5, it is characterized in that, according to the calculation formula obtain joint likelihood function it
Afterwards, further includes:
The joint likelihood function is handled using gibbs algorithm, the corresponding theme of each sentence in each document can be obtained
Condition distribution;
Condition distribution probability according to each theme in condition distribution is that each sentence of each document determines specific subject;
The joint likelihood function is handled according to the condition distribution probability of the specific subject, obtains log-likelihood letter
Number;
Parameter Estimation is carried out to the joint likelihood function and obtains the theme vector and term vector, comprising:
Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector.
7. according to the method described in claim 6, it is characterized in that, after obtaining the log-likelihood function, further includes:
Processing is optimized to the parameter in the log-likelihood function using Newton iteration method;
And/or
Term vector, theme vector and the auxiliary vector in the log-likelihood function are optimized using negative sampling algorithm;
Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector, comprising:
Parameter Estimation is carried out to the log-likelihood function after optimization and obtains the theme vector and term vector.
8. the method according to the description of claim 7 is characterized in that using negative sampling algorithm to term vector, theme vector and auxiliary
Vector is helped to optimize, comprising:
Using negative sampling algorithm in all documents in the trained data bank word and theme handle, obtain negative sampling seemingly
Right function;
The negative sampling likelihood function is handled using stochastic gradient descent method, obtain the term vector optimization formula,
The optimization formula of the optimization formula and auxiliary vector of theme vector.
9. according to the described in any item methods of claim 5~8, which is characterized in that further include:
The theme distribution of each document is obtained during carrying out parameter Estimation to the joint likelihood function.
10. a kind of device for extracting keyword in document characterized by comprising
Vector training module, for obtaining at least one theme relevant to document information according to implicit theme vector model training
At least one term vector of vector sum, the implicit theme vector model are the theme the Fusion Model of model and term vector, it is described to
Measuring training module includes: vector construction unit, is in the trained data bank for the document to be added in training data bank
Each document each theme and each word construct the initial term vector of initial subject vector sum respectively, joint likelihood function is established
Unit, for establishing the joint of all documents in the trained data bank seemingly according to the initial term vector of initial subject vector sum
Right function, parameter estimation unit, for the joint likelihood function carry out parameter Estimation obtain the theme vector and word to
Amount;
Distance calculation module, for calculating the distance between the term vector and theme vector;
Keyword extracting module, for choosing predetermined number term vector according to the distance between term vector and the theme vector
Keyword of the corresponding word as the document.
11. device according to claim 10, which is characterized in that the distance calculation module is specifically used for:
The theme of theme distribution maximum probability is chosen from least one theme according to the theme distribution of the document as optimal
Theme;Calculate the distance between the term vector and the corresponding theme vector of the optimal theme.
12. device according to claim 10, which is characterized in that the distance calculation module is specifically used for:
According to the theme distribution probability of each theme of the document to the distance between the term vector and each theme vector
It is weighted summation;Using weighted sum as the distance between described term vector and theme vector.
13. the described in any item devices of 0-12 according to claim 1, which is characterized in that the distance is COS distance.
14. device according to claim 10, which is characterized in that the joint likelihood function is established unit and is specifically used for:
The generating probability of the initial term vector is obtained by calculation formula;
The joint likelihood function of all documents in the trained data bank is obtained according to the calculation formula.
15. device according to claim 14, which is characterized in that the vector training module further include:
Joint likelihood function processing unit is joined for establishing unit in the joint likelihood function according to the calculation formula
After closing likelihood function, the joint likelihood function is handled using gibbs algorithm, each sentence in each document can be obtained
The condition distribution of the corresponding theme of son;
Condition distribution probability according to each theme in condition distribution is that each sentence of each document determines specific subject;
The joint likelihood function is handled according to the condition distribution probability of the specific subject, obtains log-likelihood letter
Number;
The parameter estimation unit is specifically used for:
Parameter Estimation is carried out to the log-likelihood function and obtains the theme vector and term vector.
16. device according to claim 15, which is characterized in that the vector training module further include:
Log-likelihood function optimization processing unit, for the joint likelihood function processing unit obtain log-likelihood function it
Afterwards, processing is optimized to the parameter in the log-likelihood function using Newton iteration method;
And/or
Term vector, theme vector and the auxiliary vector in the log-likelihood function are optimized using negative sampling algorithm;
The parameter estimation unit is specifically used for:
Parameter Estimation is carried out to the log-likelihood function after optimization and obtains the theme vector and term vector.
17. device according to claim 16, which is characterized in that the log-likelihood function optimization processing unit is specifically used
In:
Using negative sampling algorithm in all documents in the trained data bank word and theme handle, obtain negative sampling seemingly
Right function;
The negative sampling likelihood function is handled using stochastic gradient descent method, obtain the term vector optimization formula,
The optimization formula of the optimization formula and auxiliary vector of theme vector.
18. 4~17 described in any item devices according to claim 1, which is characterized in that the parameter estimation unit is also used to:
The theme distribution of each document is obtained during carrying out parameter Estimation to the joint likelihood function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510512363.8A CN105069143B (en) | 2015-08-19 | 2015-08-19 | Extract the method and device of keyword in document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510512363.8A CN105069143B (en) | 2015-08-19 | 2015-08-19 | Extract the method and device of keyword in document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105069143A CN105069143A (en) | 2015-11-18 |
CN105069143B true CN105069143B (en) | 2019-07-23 |
Family
ID=54498512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510512363.8A Active CN105069143B (en) | 2015-08-19 | 2015-08-19 | Extract the method and device of keyword in document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105069143B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740354B (en) * | 2016-01-26 | 2018-11-30 | 中国人民解放军国防科学技术大学 | The method and device of adaptive potential Di Li Cray model selection |
CN106021272B (en) * | 2016-04-04 | 2019-11-19 | 上海大学 | The keyword extraction method calculated based on distributed expression term vector |
CN106407316B (en) * | 2016-08-30 | 2020-05-15 | 北京航空航天大学 | Software question and answer recommendation method and device based on topic model |
CN108399180B (en) * | 2017-02-08 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Knowledge graph construction method and device and server |
CN107220232B (en) * | 2017-04-06 | 2021-06-11 | 北京百度网讯科技有限公司 | Keyword extraction method and device based on artificial intelligence, equipment and readable medium |
CN109815474B (en) * | 2017-11-20 | 2022-09-23 | 深圳市腾讯计算机系统有限公司 | Word sequence vector determination method, device, server and storage medium |
CN108829822B (en) * | 2018-06-12 | 2023-10-27 | 腾讯科技(深圳)有限公司 | Media content recommendation method and device, storage medium and electronic device |
CN108984526B (en) * | 2018-07-10 | 2021-05-07 | 北京理工大学 | Document theme vector extraction method based on deep learning |
CN109446516B (en) * | 2018-09-28 | 2022-11-11 | 北京赛博贝斯数据科技有限责任公司 | Data processing method and system based on theme recommendation model |
CN109299465A (en) * | 2018-10-17 | 2019-02-01 | 北京京航计算通讯研究所 | The identifying system of file keyword accuracy is promoted based on many algorithms |
CN109408641B (en) * | 2018-11-22 | 2020-06-02 | 山东工商学院 | Text classification method and system based on supervised topic model |
CN110263122B (en) * | 2019-05-08 | 2022-05-17 | 北京奇艺世纪科技有限公司 | Keyword acquisition method and device and computer readable storage medium |
CN110134957B (en) * | 2019-05-14 | 2023-06-13 | 云南电网有限责任公司电力科学研究院 | Scientific and technological achievement warehousing method and system based on semantic analysis |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009104296A (en) * | 2007-10-22 | 2009-05-14 | Nippon Telegr & Teleph Corp <Ntt> | Related keyword extraction method, device, program, and computer readable recording medium |
CN102081660A (en) * | 2011-01-13 | 2011-06-01 | 西北工业大学 | Method for searching and sequencing keywords of XML documents based on semantic correlation |
US8768960B2 (en) * | 2009-01-20 | 2014-07-01 | Microsoft Corporation | Enhancing keyword advertising using online encyclopedia semantics |
CN104008090A (en) * | 2014-04-29 | 2014-08-27 | 河海大学 | Multi-subject extraction method based on concept vector model |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9384287B2 (en) * | 2014-01-15 | 2016-07-05 | Sap Portals Isreal Ltd. | Methods, apparatus, systems and computer readable media for use in keyword extraction |
-
2015
- 2015-08-19 CN CN201510512363.8A patent/CN105069143B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009104296A (en) * | 2007-10-22 | 2009-05-14 | Nippon Telegr & Teleph Corp <Ntt> | Related keyword extraction method, device, program, and computer readable recording medium |
US8768960B2 (en) * | 2009-01-20 | 2014-07-01 | Microsoft Corporation | Enhancing keyword advertising using online encyclopedia semantics |
CN102081660A (en) * | 2011-01-13 | 2011-06-01 | 西北工业大学 | Method for searching and sequencing keywords of XML documents based on semantic correlation |
CN104008090A (en) * | 2014-04-29 | 2014-08-27 | 河海大学 | Multi-subject extraction method based on concept vector model |
Non-Patent Citations (1)
Title |
---|
基于文档主题结构的关键词抽取方法研究;刘知远;《www.thunlp.org》;20111231;18-21 |
Also Published As
Publication number | Publication date |
---|---|
CN105069143A (en) | 2015-11-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105069143B (en) | Extract the method and device of keyword in document | |
CN109408526B (en) | SQL sentence generation method, device, computer equipment and storage medium | |
CN105426354B (en) | The fusion method and device of a kind of vector | |
CN109145290A (en) | Based on word vector with from the semantic similarity calculation method of attention mechanism | |
Vig et al. | Exploring neural models for query-focused summarization | |
CN106227714A (en) | A kind of method and apparatus obtaining the key word generating poem based on artificial intelligence | |
CN109740158A (en) | A kind of text semantic analysis method and device | |
CN112016320A (en) | English punctuation adding method, system and equipment based on data enhancement | |
CN104915399A (en) | Recommended data processing method based on news headline and recommended data processing method system based on news headline | |
CN108363688A (en) | A kind of name entity link method of fusion prior information | |
CN112860896A (en) | Corpus generalization method and man-machine conversation emotion analysis method for industrial field | |
KR20160112248A (en) | Latent keyparase generation method and apparatus | |
CN107341152B (en) | Parameter input method and device | |
CN107122378B (en) | Object processing method and device and mobile terminal | |
CN110929022A (en) | Text abstract generation method and system | |
CN107665222B (en) | Keyword expansion method and device | |
CN110309513B (en) | Text dependency analysis method and device | |
CN107291686B (en) | Method and system for identifying emotion identification | |
Hong et al. | Comprehensive technology function product matrix for intelligent chatbot patent mining | |
CN111566665B (en) | Apparatus and method for applying image coding recognition in natural language processing | |
CN109727591B (en) | Voice search method and device | |
CN109241993B (en) | Evaluation object emotion classification method and device integrating user and overall evaluation information | |
CN112069816A (en) | Chinese punctuation adding method, system and equipment | |
CN110610001A (en) | Short text integrity identification method and device, storage medium and computer equipment | |
CN110619866A (en) | Speech synthesis method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |