CN105069143A - Method and device for extracting keywords from document - Google Patents

Method and device for extracting keywords from document Download PDF

Info

Publication number
CN105069143A
CN105069143A CN201510512363.8A CN201510512363A CN105069143A CN 105069143 A CN105069143 A CN 105069143A CN 201510512363 A CN201510512363 A CN 201510512363A CN 105069143 A CN105069143 A CN 105069143A
Authority
CN
China
Prior art keywords
vector
theme
likelihood function
document
term vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510512363.8A
Other languages
Chinese (zh)
Other versions
CN105069143B (en
Inventor
姜迪
石磊
林鸿宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510512363.8A priority Critical patent/CN105069143B/en
Publication of CN105069143A publication Critical patent/CN105069143A/en
Application granted granted Critical
Publication of CN105069143B publication Critical patent/CN105069143B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Abstract

The present invention discloses a method and a device for extracting keywords from a document, wherein the method comprises: acquiring at least one topic vector and at least one word vector associated with document information according to training of an implied topic vector model that is a fusion model of a topic model and the word vector; calculating a distance between the word vector and the topic vector; and selecting a preset number of words corresponding to the word vector as keywords of the document according to the distance between the word vector and the topic vector. The method and the device disclosed by embodiment of the present invention can be used for extracting keyword information for accurately expressing the document information.

Description

Extract method and the device of keyword in document
Technical field
The embodiment of the present invention relates to areas of information technology, particularly relates to a kind of method and the device that extract keyword in document.
Background technology
In the epoch of current information blast, user can not browse the document likely including relevant information, and the keyword in abstracting document provides reference to user, and to user, obtaining information, the cost that reduces user obtaining information have great meaning accurately.
In general, the word of the keyword of document necessarily some and document subject matter height correlation, therefore the subject information of document has great significance for the keyword abstraction of document.At present, the probability distribution of keyword in the middle of latent dirichlet allocation model (LatentDirichletAllocation, LDA) is mainly utilized to solve this problem.Mainly contain following two kinds of methods:
First method is the method based on possibility predication, namely utilizes LDA model to obtain the theme distribution P (z|d) of document and word distribution P (w|z) of theme, calculates distribution P (the w|d)=∑ of word in document zp (z|d) P (w|z), wherein z represents theme, and d represents document, and w represents certain word.The distribution probability P (w │ d) of word in above-mentioned document is considered as the importance score of certain word w in document d, selects the highest K the word of score as the keyword of the document.
Second method is the method based on hidden variable distribution distance, namely utilizes LDA model to obtain the theme distribution probability P (z|d) of document and the theme distribution probability of word P ( z | w ) = P ( w | z ) P ( z ) P ( w ) ∝ P ( w | z ) P ( z ) , Then calculate the COS distance of above-mentioned two distribution probabilities, select larger K the word of COS distance as the keyword of the document.
But, in said extracted document all there are some shortcomings in the method for keyword, for first method, serious deflection is had for high frequency words, namely the word major part extracted is all the high frequency words under certain theme, but these high frequency words occur all very extensive in the middle of different document, can not react the information expressed by document very really.
For second method, distribution P (z) obtaining hidden variable is needed owing to calculating P (z │ w) ∝ P (w │ z) P (z), but this distribution probability is not the distribution parameter in LDA model, generally uses P (z)=∑ dp (z|d) P (d), wherein P (d) Posterior distrbutionp that is document, and suppose that P (d) is for being uniformly distributed thus obtaining P (z) ∝ ∑ dp (z|d).But due to for different document d, its posterior distribution P (d) is not be uniformly distributed, and therefore the theoretical foundation of this model is solid not, and the effect in practical application is not good yet.
Summary of the invention
The embodiment of the present invention provides a kind of method and the device that extract keyword in document, the key word information of the accurate expression document information that can extract.
First aspect, embodiments provides a kind of method extracting keyword in document, comprising:
Obtain at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, described implicit theme vector model is the theme the Fusion Model of model and term vector;
Calculate the distance between described term vector and theme vector;
According to the distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described document.
Second aspect, the embodiment of the present invention also provides a kind of device extracting keyword in document, comprising:
Vector training module, for obtaining at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, described implicit theme vector model is the theme the Fusion Model of model and term vector;
Distance calculation module, for calculating the distance between described term vector and theme vector;
Keyword extracting module, for according to the distance between term vector and described theme vector, chooses the keyword of word corresponding to predetermined number term vector as described document.
The embodiment of the present invention by the implicit theme vector model after merging according to topic model and term vector to document training, obtain at least one theme vector relevant to document information and at least one term vector, further according to the distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described document.The embodiment of the present invention to document training, can catch more document information according to implicit theme vector model in the training process, thus the key word information of the accurate expression document information extracted.
Accompanying drawing explanation
The schematic flow sheet of the method for keyword in the extraction document that Fig. 1 provides for the embodiment of the present invention one;
The schematic flow sheet of the method for keyword in the extraction document that Fig. 2 provides for the embodiment of the present invention two;
The structural representation of the search results pages display device that Fig. 3 provides for the embodiment of the present invention three.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not entire infrastructure.
The executive agent of the method for keyword in the extraction document that the embodiment of the present invention provides, can be the device of keyword in the extraction document that the embodiment of the present invention provides, or be integrated with the terminal device of the device extracting keyword in document (such as, smart mobile phone, panel computer etc.), in this extraction document, the device of keyword can adopt hardware or software simulating.
Embodiment one
The schematic flow sheet of the method for keyword in the extraction document that Fig. 1 provides for the embodiment of the present invention one, as shown in Figure 1, specifically comprises:
S11, obtain at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, described implicit theme vector model is the theme the Fusion Model of model and term vector;
Wherein, topic model (TopicModel) and term vector (WordEmbedding) are all semantic representation methods conventional in prior art.Topic model suppose each word be by an implicit space in the middle of semanteme generate.According to this hypothesis, document and word all can be mapped in implicit semantic space and carry out dimensionality reduction.Term vector is then the distributed method for expressing of another word, and it uses the vector of regular length to represent the implication of a word.
Topic model is generally carry out modeling in document level or sentence level, more pays close attention to the semanteme of the overall situation, and term vector then generally supposes that the semanteme of a word is represented by the word of its periphery, more pays close attention to the class syntactic information of local.Above-mentioned two kinds of method emphasis are had nothing in common with each other, and have been proved to be separately and have huge using value.Therefore, both combines by the present embodiment, thus makes implicit theme vector model can capture more information.
Wherein, the dimension of theme vector and term vector can customize setting, the numerical value that each element in vector is corresponding, is obtained by implicit theme vector model training.Wherein, in order to make training result more accurate, also including training data bank in implicit theme vector model, in described training data bank, having a large amount of document datas.
S12, calculate distance between described term vector and theme vector;
Above-mentioned training obtains described term vector and theme vector word object is to calculate the significance level of word in the middle of document, and sorts to significance level, thus selects the keyword of most important word as document.
In the present embodiment, the significance level of word in the middle of document is weighed by the distance calculated between term vector and theme vector, concrete, comprise Euclidean distance, COS distance or the positive chordal distance etc. that calculate between described term vector and theme vector, simultaneously, the distance calculated is different, and the standard weighing significance level is also different.If, calculate the Euclidean distance between described term vector and theme vector or positive chordal distance, then distance is larger, and this word more important theme that namely more can reflect expressed by document in the middle of document is described, if calculate the COS distance between described term vector and theme vector, then distance is less, illustrates that this word is more important in the middle of document.
S13, according to the distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described document.
Wherein, predetermined number can be arranged according to concrete actual conditions, is not specifically limited here.
According to the result of calculation of above-mentioned steps 12, the predetermined number term vector of outbalance in document can be determined, then using the keyword of word corresponding for predetermined number term vector as described document.
The present embodiment by the implicit theme vector model after merging according to topic model and term vector to document training, obtain at least one theme vector relevant to document information and at least one term vector, further according to the distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described document.The present embodiment to document training, can catch more document information according to implicit theme vector model in the training process, thus the key word information of the accurate expression document information extracted.
Exemplary, for improving the degree of accuracy extracting keyword, embodiments provide the method that the following two kinds calculates the distance between described term vector and theme vector, wherein first method is namely based on the computing method of optimum theme, mainly comprises the following steps:
From at least one theme, the theme of theme distribution maximum probability is chosen as optimum theme according to the theme distribution of described document;
Calculate the distance between described term vector and theme vector corresponding to described optimum theme.
Concrete, for certain document, in the middle of implicit theme vector model, theme distribution and the P (z|d) of the document obtaining it can be trained, comprise the theme distribution probability that each theme in described document is corresponding, in the middle of this theme distribution, the theme z of maximum probability and optimum theme then represent the core content of the document.Therefore can think, in the middle of document, most important word is exactly those words nearest from the vector representation of theme z in vector space.Therefore, therefrom select the theme of a theme distribution maximum probability as optimum theme according to the size of described theme distribution probability, based on the theme vector that the theme of this optimum is corresponding, calculate the distance between each term vector, thus according to the distance between described term vector and theme vector corresponding to described optimum theme, choose the keyword of word corresponding to predetermined number term vector as described document.
Wherein, namely second method based on the computing method of theme distribution, mainly comprises the following steps:
Theme distribution probability according to each theme of described document is weighted summation to the distance between described term vector and each theme vector;
Using weighted sum as the distance between described term vector and theme vector.
Concrete, consider in certain document, the theme played an important role may be more than one, and the above-mentioned method based on optimum theme may lost part information, therefore the distance between considering different themes is weighted according to P (z|d), can obtain a kind of new metric form namely as shown in following formula
S c o r e _ D i s t r ( w ) = Σ z ∈ Z P ( z | d ) L
Wherein, Score_Distr (w) is weighted sum, and L is the distance between described term vector and theme vector.
Above-mentioned metric form is namely by the word importance score obtained after the theme distribution weighting of document.According to the Score_Distr (w) that said method obtains, word is sorted, and select and choose the keyword of word corresponding to predetermined number term vector as described document.
Exemplary, the embodiment of the present invention additionally provides and obtains at least one theme vector relevant to document information and the concrete methods of realizing of at least one term vector according to implicit theme vector model training, mainly comprises the following steps:
Described document is added in training data bank, for each theme of each document in described training data bank and each word build initial subject vector sum initial word vector respectively;
The joint likelihood function of all documents in described training data bank is set up according to described initial subject vector sum initial word vector;
Parameter estimation is carried out to described joint likelihood function and obtains described theme vector and term vector.
Wherein, described training data bank can obtain on internet (such as, Sina's corpus data storehouse), contains various types of document in training data bank.Wherein, initial subject vector sum initial word vector can customize setting.
Exemplary, set up joint likelihood function according to described initial subject vector sum initial word vector, comprising:
The generating probability of described initial word vector is obtained according to formula one:
Formula one: P ( v ^ w | x w ) = e x w · v ^ w Σ w ′ e x w · v ^ w ′
Wherein, the auxiliary vector of the term vector v of current word w, x wrepresent the context vector of current word w, wherein, the term vector of surrounding's word of current word w, v zthe theme vector of current topic z, represent and add and operate, w ' is word w ';
The joint likelihood function of all documents in described training data bank is obtained as formula two according to described formula one:
Formula two:
Wherein, α zthe Di Li Cray Study first that the z that is the theme is corresponding, β vfor the Di Li Cray Study first that word v is corresponding, m dzfor being sampled the number of the sentence of the z that is the theme in document d, n zvrepresent the number of times summation that in described training data bank, word v occurs together with theme z, M represents the set of all term vectors and theme vector, and D represents the sum of document d, and T represents the sum of theme in document d, represent the adjunct of word v.
Exemplary, for optimizing above-mentioned joint likelihood function further, obtain according to described formula one joint likelihood function as formula two after, comprise the steps: further
Adopt gibbs algorithm to process described formula two, the condition distribution of the theme that each sentence s is corresponding in document d can be obtained as formula three:
Formula three:
P ( z d s = k | w , z - d s , α , β , M ) ≈ ( m d k + α k ) Γ ( Σ w = 1 W ( n k w + β w ) ) Γ ( Σ w = 1 W ( n k w + β w + N i w ) ) Π w ∈ s Γ ( n k w + β w + N i w ) Γ ( n k w + β w ) Π w ~ ∈ s e x w · v ^ w
Wherein, k is theme undetermined, and W is the sum of word in described training data bank, N iwfor the number of times that word w occurs in i-th sentence in document d;
Particular topic is determined according to each sentence s that the condition distribution probability of each theme in described condition distribution is document d;
Condition distribution probability according to described particular topic processes described formula one, obtains the log-likelihood function as described in formula four:
Formula four:
T log ( Γ ( Σ v = 1 V β v ) Π v = 1 V Γ ( β v ) ) + Σ z Σ v log ( Γ ( n z v + β v ) ) - Σ z log ( Γ ( Σ v ( n z v + β v ) ) ) + Σ d Σ s ∈ d Σ w ~ ∈ s log P ( v ^ w | x w ) .
Parameter estimation is carried out to described joint likelihood function and obtains described theme vector and term vector, comprising:
Parameter estimation is carried out to described log-likelihood function and obtains described theme vector and term vector.
Exemplary, can be optimized log-likelihood function obtained above further, specifically comprise the steps:
Newton iteration method is adopted to be optimized process to the parameter alpha in described log-likelihood function and parameter beta;
And/or,
Negative sampling algorithm is adopted to be optimized the term vector in described log-likelihood function, theme vector and auxiliary vector;
Accordingly, parameter estimation is carried out to described log-likelihood function and obtains described theme vector and term vector, comprising:
Parameter estimation is carried out to the log-likelihood function after optimization and obtains described theme vector and term vector.
Exemplary, adopt negative sampling algorithm to be optimized term vector, theme vector and auxiliary vector, comprising:
Adopt negative sampling algorithm to process the word in all documents in described training data bank and theme, obtain the likelihood function as described in formula five:
Formula five:
Wherein, l is value corresponding to current word, if current word is normal word, then l=1, if current word is negative sampling word, then l=0, | NEG| is the number of the negative sampling word that word is corresponding, | V| is the sum of word in described training data bank;
Adopt stochastic gradient descent method to process described formula five, the optimization formula that can obtain following term vector is if the optimization formula of formula six, theme vector is if the optimization formula of formula seven and auxiliary vector is as formula eight:
Formula six:
v u : = v u + η Σ u ∈ c w ∪ N E G ( c w ) [ l u c w - σ ( x w · v ^ u - l o g ( | N E G | | V | ) ) ] · v ^ u ,
Formula seven:
v z : = v z + η Σ u ∈ c w ∪ N E G ( c w ) [ l u c w - σ ( x w · v ^ u - log ( | N E G | | V | ) ) ] · v ^ u .
Formula eight:
v ^ u : = v ^ u + η [ l u c w - σ ( x w · v ^ u - l o g ( | N E G | | V | ) ) ] · x w .
For the corpus in described tranining database, the vectorization that the implicit theme vector model using the embodiment of the present invention to provide can obtain each theme represents, the multinomial distribution that can learn to obtain word under each theme based on topic model provided is provided in prior art, the embodiment of the present invention compared for 10 words of maximum probability and immediate 10 term vectors of vector from each theme in the multinomial distribution of word under each theme, and result is as shown in following table one:
Table one
As can be seen from Table I, multinomial distribution based on topic model has inclination clearly for high frequency words, but medium and low frequency word is more weak by contacting of setting up of traditional theme distribution and theme, this just makes when keyword abstraction is carried out in use multinomial distribution, topic model can be partial to high frequency words very naturally, thus causes poor keyword abstraction result.The vectorization of implicit topic model represents, eliminates this problem, can see from upper table, from the word representing physical meaning under the word that theme vector is nearest is all often this theme, this just makes to utilize the model of theme vector can obtain better result in the middle of the task of keyword abstraction.
Therefore, the various embodiments described above equally by the implicit theme vector model after merging according to topic model and term vector to document training, obtain at least one theme vector relevant to document information and at least one term vector, further according to the distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described document.The present embodiment to document training, can catch more document information according to implicit theme vector model in the training process, thus the key word information of the accurate expression document information extracted.
Embodiment two
The schematic flow sheet of the method for keyword in the extraction document that Fig. 2 provides for the embodiment of the present invention two, as shown in Figure 2, specifically comprises:
S21, pending document added in training data bank, for each theme of each document in described training data bank and each word build initial subject vector sum initial word vector respectively;
S22, obtain the generating probability of described initial word vector according to formula one:
Formula one: P ( v ^ w | x w ) = e x w · v ^ w Σ w ′ e x w · v ^ w ′
Wherein, the auxiliary vector of the term vector v of current word w, x wrepresent the context vector of current word w, wherein, the term vector of surrounding's word of current word w, v zthe theme vector of current topic z, represent and add and operate, w ' is word w ';
S23, obtain the joint likelihood function of all documents in described training data bank as formula two according to described formula one:
Formula two:
Wherein, α zthe Di Li Cray Study first that the z that is the theme is corresponding, β vfor the Di Li Cray Study first that word v is corresponding, m dzfor being sampled the number of the sentence of the z that is the theme in document d, n zvrepresent the number of times summation that in described training data bank, word v occurs together with theme z, M represents the set of all term vectors and theme vector, and D represents the sum of document d, and T represents the sum of theme in document d, represent the adjunct of word v.
S24, employing gibbs algorithm process described formula two, can obtain the condition distribution of the theme that each sentence s is corresponding in document d as formula three:
Formula three:
P ( z d s = k | w , z - d s , α , β , M ) ≈ ( m d k + α k ) Γ ( Σ w = 1 W ( n k w + β w ) ) Γ ( Σ w = 1 W ( n k w + β w + N i w ) ) Π w ∈ s Γ ( n k w + β w + N i w ) Γ ( n k w + β w ) Π w ~ ∈ s e x w · v ^ w
Wherein, k is theme undetermined, and W is the sum of word in described training data bank, N iwfor the number of times that word w occurs in i-th sentence in document d;
S25, being document d according to the condition distribution probability of each theme in the distribution of described condition, each sentence s determines particular topic;
S26, according to the condition distribution probability of described particular topic, described formula one to be processed, obtains the log-likelihood function as described in formula four:
Formula four:
T log ( Γ ( Σ v = 1 V β v ) Π v = 1 V Γ ( β v ) ) + Σ z Σ v log ( Γ ( n z v + β v ) ) - Σ z log ( Γ ( Σ v ( n z v + β v ) ) ) + Σ d Σ s ∈ d Σ w ~ ∈ s log P ( v ^ w | x w ) .
S27, employing Newton iteration method are optimized process to the parameter alpha in described log-likelihood function and parameter beta, and adopt negative sampling algorithm to be optimized the term vector in described log-likelihood function, theme vector and auxiliary vector;
S28, to optimize after log-likelihood function carry out theme vector and the term vector that parameter estimation obtains described pending document.
S29, calculate COS distance between described term vector and theme vector;
S210, according to the COS distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described pending document.
In order to verify the validity of the embodiment of the present invention, inventor has carried out many groups contrast experiment respectively on the experimental data collection of different scales, and experiment effect has all exceeded the optimum of tradition based on the method for topic model.
First group of experiment, on a small scale data experiment
Experiment purpose: pick out the keyword best embodying document implication in the middle of all words a document
Tranining database: come from the exploitation collection of Sina Sina language material, training set and test set, includes 32000 documents altogether.
Test data: the test set coming from Sina language material, corresponding to each document in test set, includes the reference keyword of its correspondence.Totally 1000 documents.
Appraisal procedure: for each document, each model generation 3 keywords.Accuracy rate and recall rate is used to assess experimental result.Accuracy rate refers to the number percent that the correct keyword number of model prediction accounts for model prediction keyword number.Recall rate refers to the number percent that the correct keyword of model prediction accounts for the keyword number in Key for Reference.Use micro-average as evaluation index, namely its accuracy rate and recall rate are calculated respectively to each document, then average.
Setup Experiments: consider the situation whether removing stop words in the middle of corpus respectively, and the method based on likelihood of the implicit theme vector model used in the embodiment of the present invention with multiple LDA and SentenceLDA is contrasted.Distribute in the middle of (SentenceLDA, sLDA) method at LDA and the implicit Di Li Cray in units of sentence, for each word in the middle of document, calculate ∑ zp (z|d) P (w|z), as the score of each word in the middle of current document, then have found maximum front 3 words of value as keyword.In the middle of above-mentioned all methods, the embodiment of the present invention all eliminates the word only having a word in the middle of all language materials.Experimental result is as shown in following table two:
Table two
Interpretation: in the middle of above-mentioned experimental result, can see, no matter whether removes stop words, and the method for the embodiment of the present invention all achieves best experimental result.In the middle of the experiment of removing stop words, the experimental result of contrast and LDA model, the lifting amplitude of the implicit theme vector model that the embodiment of the present invention provides reaches 20.9%.Meanwhile, the net result whether removing the implicit theme vector model that stop words provides for the embodiment of the present invention does not affect, and the implicit theme vector model which illustrating the embodiment of the present invention provides has antimierophonic ability to a certain extent.In addition, the experimental result based on the computing method of theme distribution is better than the computing method based on optimum theme, which illustrates and considers that more subject information is helpful to final experimental result when the final keyword of generation.
Second group of experiment, large-scale data is tested
Experiment purpose: pick out the keyword best embodying document implication in the middle of all words a document
Training data: come from the exploitation collection of Sina language material, training set and test set, and come from the corpus data of News Field, includes 261173 documents altogether.
Test data: the test set coming from Sina language material, corresponding to each document in test set, includes the reference keyword of its correspondence.Totally 1000 documents.
Appraisal procedure: for each document, each model generation 3 keywords.Accuracy rate and recall rate is used to assess experimental result.Accuracy rate refers to the number percent that the correct keyword number of model prediction accounts for model prediction keyword number.Recall rate refers to the number percent that the correct keyword of model prediction accounts for the keyword number in Key for Reference.Use micro-average as evaluation index, namely its accuracy rate and recall rate are calculated respectively to each document, then average.
Setup Experiments: consider the situation whether removing stop words in the middle of corpus respectively, and the method based on likelihood function of the implicit theme vector model of embodiment of the present invention use with multiple LDA and SentenceLDA is contrasted.In the middle of the PL method of LDA and SentenceLDA, for each word in the middle of document, calculate ∑ zp (z|d) P (w|z), as the score of each word in the middle of current document, then have found maximum front 3 words of value as keyword., also the method based on hidden variable distribution distance in the middle of the method for the embodiment of the present invention and LDA is contrasted, according to the theme distribution P (z|d) of document and the theme distribution of word meanwhile P ( z | w ) = P ( z | w ) P ( z ) P ( w ) ∝ P ( z | w ) P ( z ) , Then calculate the COS distance of two distributions and press distance-taxis, 3 words that choosing a topic distribution distance distributes closest to document subject matter are as the keyword of the document.In the middle of above-mentioned all methods, all eliminate the word only having a word in the middle of all language materials.Experimental result is as shown in following table three:
Table three
Interpretation: in the middle of above-mentioned experimental result, can see, the method for the embodiment of the present invention still achieves best experimental result, and the experiment conclusion that small-scale language material obtains stands good on large-scale corpus., can see meanwhile, based on the method for LDA and SentenceLDA after adding large-scale training language material, promote for experimental result is not significant.But the method for embodiment of the present invention experimental result after adding corpus obtains the lifting of conspicuousness, the computing method based on optimum theme improve 12.1%, and the computing method based on theme distribution improve 6.5%.And the increase of the corpus along with model, the experimental result of the embodiment of the present invention still has the potentiality increased further.
Embodiment three
In the extraction document that Fig. 3 provides for the embodiment of the present invention three, the structural representation of the device of keyword, as shown in Figure 3, specifically comprises: vectorial training module 31, distance calculation module 32 and keyword extracting module 33;
Described vectorial training module 31 is for obtaining at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, and described implicit theme vector model is the theme the Fusion Model of model and term vector;
Described distance calculation module 32 is for calculating the distance between described term vector and theme vector;
Described keyword extracting module 33, for according to the distance between term vector and described theme vector, chooses the keyword of word corresponding to predetermined number term vector as described document.
In extraction document described in the present embodiment, the device of keyword is for performing the method for keyword in the extraction document described in the various embodiments described above, and the technique effect of its know-why and generation is similar, is not repeated here.
Exemplary, on the basis of above-described embodiment, described distance calculation module 32 specifically for:
From at least one theme, the theme of theme distribution maximum probability is chosen as optimum theme according to the theme distribution of described document; Calculate the distance between described term vector and theme vector corresponding to described optimum theme.
Exemplary, on the basis of above-described embodiment, described distance calculation module 32 specifically for:
Theme distribution probability according to each theme of described document is weighted summation to the distance between described term vector and each theme vector; Using weighted sum as the distance between described term vector and theme vector.
Exemplary, on the basis of above-described embodiment, described distance is COS distance.
Exemplary, on the basis of above-described embodiment, described vectorial training module 31 comprises: vectorial construction unit 311, joint likelihood function set up unit 312 and parameter estimation unit 313;
Described vectorial construction unit 311 for described document being added in training data bank, for each theme of each document in described training data bank and each word build initial subject vector sum initial word vector respectively;
Described joint likelihood function sets up unit 312 for setting up the joint likelihood function of all documents in described training data bank according to described initial subject vector sum initial word vector;
Described parameter estimation unit 313 obtains described theme vector and term vector for carrying out parameter estimation to described joint likelihood function.
Exemplary, described joint likelihood function set up unit 312 specifically for:
The generating probability of described initial word vector is obtained according to formula one:
Formula one: P ( v ^ w | x w ) = e x w · v ^ w Σ w ′ e x w · v ^ w ′
Wherein, the auxiliary vector of the term vector v of current word w, x wrepresent the context vector of current word w, wherein, the term vector of surrounding's word of current word w, v zthe theme vector of current topic z, represent and add and operate, w ' is word w ';
The joint likelihood function of all documents in described training data bank is obtained as formula two according to described formula one:
Formula two:
Wherein, α zthe Di Li Cray Study first that the z that is the theme is corresponding, β vfor the Di Li Cray Study first that word v is corresponding, m dzfor being sampled the number of the sentence of the z that is the theme in document d, n zvrepresent the number of times summation that in described training data bank, word v occurs together with theme z, M represents the set of all term vectors and theme vector, and D represents the sum of document d, and T represents the sum of theme in document d, represent the adjunct of word v.
Exemplary, described vectorial training module 31 also comprises: joint likelihood function processing unit 314;
Described joint likelihood function processing unit 314 for set up in described joint likelihood function unit 312 according to described formula one obtain joint likelihood function as formula two after, adopt gibbs algorithm to process described formula two, the condition distribution of the theme that each sentence s is corresponding in document d can be obtained as formula three:
Formula three:
P ( z d s = k | w , z - d s , α , β , M ) ≈ ( m d k + α k ) Γ ( Σ w = 1 W ( n k w + β w ) ) Γ ( Σ w = 1 W ( n k w + β w + N i w ) ) Π w ∈ s Γ ( n k w + β w + N i w ) Γ ( n k w + β w ) Π w ~ ∈ s e x w · v ^ w
Wherein, k is theme undetermined, and W is the sum of word in described training data bank, N iwfor the number of times that word w occurs in i-th sentence in document d;
Particular topic is determined according to each sentence s that the condition distribution probability of each theme in described condition distribution is document d;
Condition distribution probability according to described particular topic processes described formula one, obtains the log-likelihood function as described in formula four:
Formula four:
T log ( Γ ( Σ v = 1 V β v ) Π v = 1 V Γ ( β v ) ) + Σ z Σ v log ( Γ ( n z v + β v ) ) - Σ z log ( Γ ( Σ v ( n z v + β v ) ) ) + Σ d Σ s ∈ d Σ w ~ ∈ s log P ( v ^ w | x w ) .
Described parameter estimation unit 313 specifically for:
Parameter estimation is carried out to described log-likelihood function and obtains described theme vector and term vector.
Exemplary, described vectorial training module 31 also comprises: log-likelihood function optimization process unit 315;
Described log-likelihood function optimization process unit 315, for after obtaining the log-likelihood function as described in formula four at described joint likelihood function processing unit 314, adopts Newton iteration method to be optimized process to the parameter alpha in described log-likelihood function and parameter beta;
And/or,
Negative sampling algorithm is adopted to be optimized the term vector in described log-likelihood function, theme vector and auxiliary vector;
Described parameter estimation unit 313 specifically for:
Parameter estimation is carried out to the log-likelihood function after optimization and obtains described theme vector and term vector.
Example, described log-likelihood function optimization process unit 315 specifically for:
Adopt negative sampling algorithm to process the word in all documents in described training data bank and theme, obtain the likelihood function as described in formula five:
Formula five:
Wherein, l is value corresponding to current word, | NEG| is the number of the negative sampling word that word is corresponding, | V| is the sum of word in described training data bank;
Adopt stochastic gradient descent method to process described formula five, the optimization formula that can obtain following term vector is if the optimization formula of formula six, theme vector is if the optimization formula of formula seven and auxiliary vector is as formula eight:
Formula six:
v u : = v u + η Σ u ∈ c w ∪ N E G ( c w ) [ l u c w - σ ( x w · v ^ u - log ( | N E G | | V | ) ) ] · v ^ u ,
Formula seven:
v z : = v z + η Σ u ∈ c w ∪ N E G ( c w ) [ l u c w - σ ( x w · v ^ u - log ( | N E G | | V | ) ) ] · v ^ u .
Formula eight:
v ^ u : = v ^ u + η [ l u c w - σ ( x w · v ^ u - l o g ( | N E G | | V | ) ) ] · x w .
Exemplary, described parameter estimation unit 313 also for:
Carry out described joint likelihood function in the process of parameter estimation, adopting formula nine to obtain the theme distribution of each document:
Formula nine:
P ( z | d ) = m d z + α z Σ z ′ = 1 K ( m dz ′ + α z ′ )
Wherein, K is the theme the sum of z at document d.
In extraction document described in the various embodiments described above, the device of keyword is equally for performing the method for keyword in the extraction document described in the various embodiments described above, and the technique effect of its know-why and generation is similar, is not repeated here.
Note, above are only preferred embodiment of the present invention and institute's application technology principle.Skilled person in the art will appreciate that and the invention is not restricted to specific embodiment described here, various obvious change can be carried out for a person skilled in the art, readjust and substitute and can not protection scope of the present invention be departed from.Therefore, although be described in further detail invention has been by above embodiment, the present invention is not limited only to above embodiment, when not departing from the present invention's design, can also comprise other Equivalent embodiments more, and scope of the present invention is determined by appended right.

Claims (20)

1. extract a method for keyword in document, it is characterized in that, comprising:
Obtain at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, described implicit theme vector model is the theme the Fusion Model of model and term vector;
Calculate the distance between described term vector and theme vector;
According to the distance between term vector and described theme vector, choose the keyword of word corresponding to predetermined number term vector as described document.
2. method according to claim 1, is characterized in that, calculates the distance between described term vector and theme vector, comprising:
From at least one theme, the theme of theme distribution maximum probability is chosen as optimum theme according to the theme distribution of described document;
Calculate the distance between described term vector and theme vector corresponding to described optimum theme.
3. method according to claim 1, is characterized in that, calculates the distance between described term vector and theme vector, comprising:
Theme distribution probability according to each theme of described document is weighted summation to the distance between described term vector and each theme vector;
Using weighted sum as the distance between described term vector and theme vector.
4. the method according to any one of claim 1-3, is characterized in that, described distance is COS distance.
5. the method according to any one of claim 1-3, is characterized in that, obtains at least one theme vector relevant to document information and at least one term vector, comprising according to implicit theme vector model training:
Described document is added in training data bank, for each theme of each document in described training data bank and each word build initial subject vector sum initial word vector respectively;
The joint likelihood function of all documents in described training data bank is set up according to described initial subject vector sum initial word vector;
Parameter estimation is carried out to described joint likelihood function and obtains described theme vector and term vector.
6. method according to claim 5, is characterized in that, sets up the joint likelihood function of all documents in described training data bank, comprising according to described initial subject vector sum initial word vector:
The generating probability of described initial word vector is obtained by computing formula;
The joint likelihood function of all documents in described training data bank is obtained according to described computing formula.
7. method according to claim 6, is characterized in that, after obtaining joint likelihood function, also comprises according to described computing formula:
Adopt gibbs algorithm to process described joint likelihood function, the condition distribution of the theme that each sentence is corresponding in each document can be obtained;
In distributing according to described condition, the condition distribution probability of each theme is each sentence determination particular topic of each document;
Condition distribution probability according to described particular topic processes described joint likelihood function, obtains log-likelihood function;
Parameter estimation is carried out to described joint likelihood function and obtains described theme vector and term vector, comprising:
Parameter estimation is carried out to described log-likelihood function and obtains described theme vector and term vector.
8. method according to claim 7, is characterized in that, after obtaining the log-likelihood function as described in formula four, also comprises:
Newton iteration method is adopted to be optimized process to the parameter in described log-likelihood function;
And/or,
Negative sampling algorithm is adopted to be optimized the term vector in described log-likelihood function, theme vector and auxiliary vector;
Parameter estimation is carried out to described log-likelihood function and obtains described theme vector and term vector, comprising:
Parameter estimation is carried out to the log-likelihood function after optimization and obtains described theme vector and term vector.
9. method according to claim 8, is characterized in that, adopts negative sampling algorithm to be optimized term vector, theme vector and auxiliary vector, comprising:
Adopt negative sampling algorithm to process the word in all documents in described training data bank and theme, obtain negative sampling likelihood function;
Adopt stochastic gradient descent method to process described negative sampling likelihood function, obtain the optimization formula of the optimization formula of described term vector, the optimization formula of theme vector and auxiliary vector.
10. the method according to any one of claim 6 ~ 9, is characterized in that, also comprises:
Described joint likelihood function is carried out to the theme distribution obtaining each document in the process of parameter estimation.
11. 1 kinds of devices extracting keyword in document, is characterized in that, comprising:
Vector training module, for obtaining at least one theme vector relevant to document information and at least one term vector according to implicit theme vector model training, described implicit theme vector model is the theme the Fusion Model of model and term vector;
Distance calculation module, for calculating the distance between described term vector and theme vector;
Keyword extracting module, for according to the distance between term vector and described theme vector, chooses the keyword of word corresponding to predetermined number term vector as described document.
12. devices according to claim 11, is characterized in that, described distance calculation module specifically for:
From at least one theme, the theme of theme distribution maximum probability is chosen as optimum theme according to the theme distribution of described document; Calculate the distance between described term vector and theme vector corresponding to described optimum theme.
13. devices according to claim 11, is characterized in that, described distance calculation module specifically for:
Theme distribution probability according to each theme of described document is weighted summation to the distance between described term vector and each theme vector; Using weighted sum as the distance between described term vector and theme vector.
14. devices according to any one of claim 11-13, it is characterized in that, described distance is COS distance.
15. devices according to any one of claim 11-13, it is characterized in that, described vectorial training module comprises:
Vector construction unit, for being added by described document in training data bank, for each theme of each document in described training data bank and each word build initial subject vector sum initial word vector respectively;
Joint likelihood function sets up unit, for setting up the joint likelihood function of all documents in described training data bank according to described initial subject vector sum initial word vector;
Parameter estimation unit, obtains described theme vector and term vector for carrying out parameter estimation to described joint likelihood function.
16. devices according to claim 15, is characterized in that, described joint likelihood function set up unit specifically for:
The generating probability of described initial word vector is obtained by computing formula;
The joint likelihood function of all documents in described training data bank is obtained according to described computing formula.
17. devices according to claim 16, is characterized in that, described vectorial training module also comprises:
Joint likelihood function processing unit, for setting up after unit obtains joint likelihood function according to described computing formula one in described joint likelihood function, adopt gibbs algorithm to process described joint likelihood function, the condition distribution of the theme that each sentence is corresponding in each document can be obtained;
In distributing according to described condition, the condition distribution probability of each theme is each sentence determination particular topic of each document;
Condition distribution probability according to described particular topic processes described joint likelihood function, obtains log-likelihood function;
Described parameter estimation unit specifically for:
Parameter estimation is carried out to described log-likelihood function and obtains described theme vector and term vector.
18. devices according to claim 17, is characterized in that, described vectorial training module also comprises:
Log-likelihood function optimization process unit, after obtaining log-likelihood function at described joint likelihood function processing unit, adopts Newton iteration method to be optimized process to the parameter in described log-likelihood function;
And/or,
Negative sampling algorithm is adopted to be optimized the term vector in described log-likelihood function, theme vector and auxiliary vector;
Described parameter estimation unit specifically for:
Parameter estimation is carried out to the log-likelihood function after optimization and obtains described theme vector and term vector.
19. devices according to claim 18, is characterized in that, described log-likelihood function optimization process unit specifically for:
Adopt negative sampling algorithm to process the word in all documents in described training data bank and theme, obtain negative sampling likelihood function;
Adopt stochastic gradient descent method to process described negative sampling likelihood function, obtain the optimization formula of the optimization formula of described term vector, the optimization formula of theme vector and auxiliary vector.
20. devices according to any one of claim 16 ~ 19, is characterized in that, described parameter estimation unit also for:
Described joint likelihood function is carried out to the theme distribution obtaining each document in the process of parameter estimation.
CN201510512363.8A 2015-08-19 2015-08-19 Extract the method and device of keyword in document Active CN105069143B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510512363.8A CN105069143B (en) 2015-08-19 2015-08-19 Extract the method and device of keyword in document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510512363.8A CN105069143B (en) 2015-08-19 2015-08-19 Extract the method and device of keyword in document

Publications (2)

Publication Number Publication Date
CN105069143A true CN105069143A (en) 2015-11-18
CN105069143B CN105069143B (en) 2019-07-23

Family

ID=54498512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510512363.8A Active CN105069143B (en) 2015-08-19 2015-08-19 Extract the method and device of keyword in document

Country Status (1)

Country Link
CN (1) CN105069143B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740354A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Adaptive potential Dirichlet model selection method and apparatus
CN106021272A (en) * 2016-04-04 2016-10-12 上海大学 Keyword automatic extraction method based on distributed expression word vector calculation
CN106407316A (en) * 2016-08-30 2017-02-15 北京航空航天大学 Topic model-based software question and answer recommendation method and device
CN107220232A (en) * 2017-04-06 2017-09-29 北京百度网讯科技有限公司 Keyword extracting method and device, equipment and computer-readable recording medium based on artificial intelligence
CN108399180A (en) * 2017-02-08 2018-08-14 腾讯科技(深圳)有限公司 A kind of knowledge mapping construction method, device and server
CN108829822A (en) * 2018-06-12 2018-11-16 腾讯科技(深圳)有限公司 The recommended method and device of media content, storage medium, electronic device
CN108984526A (en) * 2018-07-10 2018-12-11 北京理工大学 A kind of document subject matter vector abstracting method based on deep learning
CN109299465A (en) * 2018-10-17 2019-02-01 北京京航计算通讯研究所 The identifying system of file keyword accuracy is promoted based on many algorithms
CN109408641A (en) * 2018-11-22 2019-03-01 山东工商学院 It is a kind of based on have supervision topic model file classification method and system
CN109446516A (en) * 2018-09-28 2019-03-08 北京赛博贝斯数据科技有限责任公司 A kind of data processing method and system based on subject recommending model
CN109815474A (en) * 2017-11-20 2019-05-28 深圳市腾讯计算机系统有限公司 A kind of word order column vector determines method, apparatus, server and storage medium
CN110134957A (en) * 2019-05-14 2019-08-16 云南电网有限责任公司电力科学研究院 A kind of scientific and technological achievement storage method and system based on semantic analysis
CN110263122A (en) * 2019-05-08 2019-09-20 北京奇艺世纪科技有限公司 A kind of keyword acquisition methods, device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009104296A (en) * 2007-10-22 2009-05-14 Nippon Telegr & Teleph Corp <Ntt> Related keyword extraction method, device, program, and computer readable recording medium
CN102081660A (en) * 2011-01-13 2011-06-01 西北工业大学 Method for searching and sequencing keywords of XML documents based on semantic correlation
US8768960B2 (en) * 2009-01-20 2014-07-01 Microsoft Corporation Enhancing keyword advertising using online encyclopedia semantics
CN104008090A (en) * 2014-04-29 2014-08-27 河海大学 Multi-subject extraction method based on concept vector model
US20150199438A1 (en) * 2014-01-15 2015-07-16 Roman Talyansky Methods, apparatus, systems and computer readable media for use in keyword extraction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009104296A (en) * 2007-10-22 2009-05-14 Nippon Telegr & Teleph Corp <Ntt> Related keyword extraction method, device, program, and computer readable recording medium
US8768960B2 (en) * 2009-01-20 2014-07-01 Microsoft Corporation Enhancing keyword advertising using online encyclopedia semantics
CN102081660A (en) * 2011-01-13 2011-06-01 西北工业大学 Method for searching and sequencing keywords of XML documents based on semantic correlation
US20150199438A1 (en) * 2014-01-15 2015-07-16 Roman Talyansky Methods, apparatus, systems and computer readable media for use in keyword extraction
CN104008090A (en) * 2014-04-29 2014-08-27 河海大学 Multi-subject extraction method based on concept vector model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘知远: "基于文档主题结构的关键词抽取方法研究", 《WWW.THUNLP.ORG》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740354B (en) * 2016-01-26 2018-11-30 中国人民解放军国防科学技术大学 The method and device of adaptive potential Di Li Cray model selection
CN105740354A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Adaptive potential Dirichlet model selection method and apparatus
CN106021272A (en) * 2016-04-04 2016-10-12 上海大学 Keyword automatic extraction method based on distributed expression word vector calculation
CN106021272B (en) * 2016-04-04 2019-11-19 上海大学 The keyword extraction method calculated based on distributed expression term vector
CN106407316A (en) * 2016-08-30 2017-02-15 北京航空航天大学 Topic model-based software question and answer recommendation method and device
CN106407316B (en) * 2016-08-30 2020-05-15 北京航空航天大学 Software question and answer recommendation method and device based on topic model
CN108399180B (en) * 2017-02-08 2021-11-26 腾讯科技(深圳)有限公司 Knowledge graph construction method and device and server
CN108399180A (en) * 2017-02-08 2018-08-14 腾讯科技(深圳)有限公司 A kind of knowledge mapping construction method, device and server
CN107220232A (en) * 2017-04-06 2017-09-29 北京百度网讯科技有限公司 Keyword extracting method and device, equipment and computer-readable recording medium based on artificial intelligence
CN107220232B (en) * 2017-04-06 2021-06-11 北京百度网讯科技有限公司 Keyword extraction method and device based on artificial intelligence, equipment and readable medium
CN109815474A (en) * 2017-11-20 2019-05-28 深圳市腾讯计算机系统有限公司 A kind of word order column vector determines method, apparatus, server and storage medium
CN109815474B (en) * 2017-11-20 2022-09-23 深圳市腾讯计算机系统有限公司 Word sequence vector determination method, device, server and storage medium
CN108829822A (en) * 2018-06-12 2018-11-16 腾讯科技(深圳)有限公司 The recommended method and device of media content, storage medium, electronic device
CN108829822B (en) * 2018-06-12 2023-10-27 腾讯科技(深圳)有限公司 Media content recommendation method and device, storage medium and electronic device
CN108984526A (en) * 2018-07-10 2018-12-11 北京理工大学 A kind of document subject matter vector abstracting method based on deep learning
CN108984526B (en) * 2018-07-10 2021-05-07 北京理工大学 Document theme vector extraction method based on deep learning
CN109446516A (en) * 2018-09-28 2019-03-08 北京赛博贝斯数据科技有限责任公司 A kind of data processing method and system based on subject recommending model
CN109446516B (en) * 2018-09-28 2022-11-11 北京赛博贝斯数据科技有限责任公司 Data processing method and system based on theme recommendation model
CN109299465A (en) * 2018-10-17 2019-02-01 北京京航计算通讯研究所 The identifying system of file keyword accuracy is promoted based on many algorithms
CN109408641A (en) * 2018-11-22 2019-03-01 山东工商学院 It is a kind of based on have supervision topic model file classification method and system
CN110263122A (en) * 2019-05-08 2019-09-20 北京奇艺世纪科技有限公司 A kind of keyword acquisition methods, device and computer readable storage medium
CN110263122B (en) * 2019-05-08 2022-05-17 北京奇艺世纪科技有限公司 Keyword acquisition method and device and computer readable storage medium
CN110134957A (en) * 2019-05-14 2019-08-16 云南电网有限责任公司电力科学研究院 A kind of scientific and technological achievement storage method and system based on semantic analysis
CN110134957B (en) * 2019-05-14 2023-06-13 云南电网有限责任公司电力科学研究院 Scientific and technological achievement warehousing method and system based on semantic analysis

Also Published As

Publication number Publication date
CN105069143B (en) 2019-07-23

Similar Documents

Publication Publication Date Title
CN105069143A (en) Method and device for extracting keywords from document
CN104376406B (en) A kind of enterprise innovation resource management and analysis method based on big data
CN105224695B (en) A kind of text feature quantization method and device and file classification method and device based on comentropy
CN104834747A (en) Short text classification method based on convolution neutral network
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN106599029A (en) Chinese short text clustering method
CN103870001A (en) Input method candidate item generating method and electronic device
CN101295294A (en) Improved Bayes acceptation disambiguation method based on information gain
CN103870474A (en) News topic organizing method and device
CN100511214C (en) Method and system for abstracting batch single document for document set
CN103870000A (en) Method and device for sorting candidate items generated by input method
CN106294618A (en) Searching method and device
CN104408033A (en) Text message extracting method and system
CN105550170A (en) Chinese word segmentation method and apparatus
CN103869998A (en) Method and device for sorting candidate items generated by input method
CN102629272A (en) Clustering based optimization method for examination system database
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN107656920A (en) A kind of skilled personnel based on patent recommend method
CN107885717A (en) A kind of keyword extracting method and device
CN109740158A (en) A kind of text semantic analysis method and device
CN102915448A (en) AdaBoost-based 3D (three-dimensional) model automatic classification method
CN108090178A (en) A kind of text data analysis method, device, server and storage medium
CN107015965A (en) A kind of Chinese text sentiment analysis device and method
CN106681986A (en) Multi-dimensional sentiment analysis system
CN105243053A (en) Method and apparatus for extracting key sentence of document

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant