CN108376131A

CN108376131A - Keyword abstraction method based on seq2seq deep neural network models

Info

Publication number: CN108376131A
Application number: CN201810211285.1A
Authority: CN
Inventors: 李弘艺
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-03-14
Filing date: 2018-03-14
Publication date: 2018-08-07

Abstract

The present invention relates to computer realms,More particularly to a kind of keyword abstraction method being based on seq2seq (sequence to sequence sequences to sequence) deep neural network model,Target information is first passed through preprocessing module and extracted by the method,Term vector conversion module and part-of-speech tagging module are converted and are marked respectively again,It is then passed through candidate word weight computation module and obtains candidate word sequence,Keyword is obtained using candidate word screening module,The present invention by document vector by being considered as being averaged for term vector,Term vector and document vector are combined and indicated as the vector of word,Importance of each word for document can preferably be analyzed,The keyword of document purport can more be represented by selecting,The investigation range of keyword extraction is expanded simultaneously,Overcome the shortcomings that existing extraction technique cannot predict keyword other than vocabulary and the keyword not in source document content.

Description

Keyword abstraction method based on seq2seq deep neural network models

Technical field

The present invention relates to computer realms, and in particular to a kind of keyword based on seq2seq deep neural network models Abstracting method.

Background technology

Development with computer and network technologies and the arrival in big data epoch, digitized file is just with surprising Speed madness increase, a large amount of human contacts to information be all to exist with electrical file form.It is vast as the open sea in face of these Information, people can automatically identify the keyword that can most represent article purport there is an urgent need to machine, us helped quickly to understand Article main contents, to save reading, processing and utilize the time of these electronic documents.

These technologies are referred to as keyword abstraction (Keyword Extraction) at present, keyword abstraction refer to quickly from Multiple words or phrase that can represent document subject matter are obtained in document, as a kind of refining summary to the document main contents. The main contents of document can be understood quickly by keyword people, efficiently hold document subject matter.Keyword extensive use In fields such as news report, scientific papers, so as to allow people's efficiently management and retrieval document.Therefore, keyword abstraction Become an important research hot spot in text-processing field at present.

The starting point of current existing many webpage keywords extracting methods, these methods is mostly the appearance frequency of word Region residing for full text of rate, word, word itself semantic feature.The method of use substantially has following a few classes:Based on statistics Method, the method for machine learning, the method for natural language processing.

But all there is deficiencies for these methods, wherein to keyword extraction, evaluate the candidate keywords of text, and After sequence, keyword of the extraction top n word as webpage, but in this N number of keyword, and not all word be all really with The relevant keyword of text theme, and in the candidate keywords not being extracted, but still have some and text theme very phase The word of pass so that the accuracy rate and recall rate of keyword extraction be not high.

In searching keyword, there is the keyword of half not come from source document, and existing keyword extraction techniques Candidate keywords, therefore the unpredictable keyword not in source document can only be selected from source document, it can not will be in document The near synonym of word leverage the accuracy rate of keyword extraction as keyword.

Existing keyword extraction techniques also can only choose candidate word from the vocabulary of certain scale simultaneously, when document word When language scale is far more than vocabulary scale, cannot predict the word other than vocabulary then can influence the accuracy rate of keyword extraction.

Existing keyword abstraction method is when choosing candidate keywords, it will usually consider the feature that machine learning obtains, However these features can only carry out the importance that statistics finds each word by the frequency of occurrences to word in document, it can not It is enough to disclose the Complete Semantics being hidden in document content.

Invention content

The present invention is at least one defect (deficiency) overcome described in the above-mentioned prior art, provides one kind.

In order to solve the above technical problems, technical scheme is as follows：

One kind being based on the key of seq2seq (sequence to sequence sequences to sequence) deep neural network model Word abstracting method, the described method comprises the following steps：

S1. document to be extracted and corpus preprocessing module is imported to extract；

S2. pass through the information that preprocessing module is made to extract and respectively enter term vector conversion module and part-of-speech tagging module, Term vector conversion module carries out term vector conversion, and part-of-speech tagging is carried out in part-of-speech tagging module；

S3. the information for passing through term vector conversion and part-of-speech tagging enters candidate word weight computation module, obtains candidate word order Row；

S4. the candidate word sequence obtained enters candidate word screening module, obtains suitable keyword；

Further, the preprocessing module to corpus and document to be extracted carry out Chinese word segmentation, English stem extraction with And removal stop words, wherein Chinese word segmentation enter term vector conversion module, English stem enters part-of-speech tagging module.

Further, the term vector conversion module converts the word handled by preprocessing module to term vector The form of (word embedding), term vector conversion module are used in term vector representational framework word2vec (word to Vector words to vector) on the basis of random selection document in word seek mean value, as document vector indicate, then by word Vector sum document vector participates in training and predicts as a whole.

Further, the term vector conversion module indicates to obtain certain using context and document information using following formula The probability of one word：

Wherein, c is context term vector, and x is document vector, and U is mapping matrix of the neural network input layer to hidden layer, V is mapping matrix of the neural network hidden layer to output layer, and w is prediction target word, and T is the length of document.

Further, the document vector x is handled by following formula, the seq2seq deep neural network models with Probability q abandons the component of the term vector of certain word, while in order to avoid generating deviation, the dimension of reservation being normalized：

Term vector modulus of conversion model optimization object function in the block is represented by following formula, first item be still for Determine context and document semantic observes the likelihood function of target word, and Section 2 is the regularization of data dependence：

Further, the regularization of data dependence is indicated using following formula, becomes regular terms,

Wherein σ is considered as logistic regression.For the regular terms, it tends to punish high word frequency word.Because for high frequency It is sampled that probability is higher to word, and the value of regular terms is also bigger, and for σ (1- σ) this coefficient, when term vector is more accurate, When the probability of prediction is bigger, this also can be smaller, this also just demonstrates the regular terms from mathematical angle can play Optimized model Effect.

Further, the part-of-speech tagging module passes through NLTK (the natural language toolkit in the libraries python Natural language processing tool) it wraps to the word progress part-of-speech tagging after preprocessing module.

Further, the candidate word weight computation module is seq2seq models, and the seq2seq models include encoder And decoder, outputting and inputting for the seq2seq models is all a sequence, and the length for outputting and inputting sequence be can Become, the encoder and decoder are Recognition with Recurrent Neural Network (RNN).

Further, attention mechanism and replicanism are added in the Recognition with Recurrent Neural Network so that pass through the neural network The keyword other than vocabulary and source document can be predicted, therefore predicts that the probability of word is represented by following formula：

p(y_t|y₁..., t-1, x)

=p_g(y_t|y₁..., t-1, x)+p_c(y_t|y₁..., t-1, x)

First item is the predictor formula of traditional Recognition with Recurrent Neural Network, i.e., by softmax graders according to hidden layer State and it is predicted that word export the probability of all current all words；Section 2 is that replicanism considers in document each The importance of word, can be expressed as following formula：

ψ is the set of all words in source document, and σ is a nonlinear function, and Wc is a trained parameter matrix, Z is the weighted sum of all scores, is used for regularization.

Compared with prior art, the advantageous effect of technical solution of the present invention is：

(1) present invention is in term vector conversion module, by the way that document vector is considered as being averaged for term vector, and word here It is vectorial then be it is randomly selected from document, by term vector and document vector combine as word vector indicate, to Semanteme of the word in different context of co-texts is considered, importance of each word for document can be preferably analyzed, is selected The keyword of document purport can more be represented by selecting out.

(2) since the present invention is in candidate word chooses module, input of the term vector as module of document information is added, More external informations are introduced for the keyword abstraction of document to be extracted, while adding attention mechanism and replicanism, from And expanded the investigation range of keyword extraction, overcome existing extraction technique cannot predict keyword other than vocabulary and The shortcomings that keyword not in source document content.

(3) accuracy rate for finding keyword is substantially increased, solving keyword in source document the problems in could not be chosen, Search area is expanded simultaneously, can then disclose the semanteme hidden in keyword.

Description of the drawings

Fig. 1 is the keyword abstraction method schematic diagram based on seq2seq deep neural network models.

Fig. 2 is the keyword abstraction method work flow diagram based on seq2seq deep neural network models.

Specific implementation mode

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

To those skilled in the art, it is to be appreciated that certain known features and its explanation, which may be omitted, in attached drawing 's.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

The present invention is based on a kind of keyword abstraction method of the proposition of seq2seq deep neural network models, method operations Step is as shown in Figure 1：

S1. document to be extracted and corpus preprocessing module is imported to extract.

S2. pass through the information that preprocessing module is made to extract and respectively enter term vector conversion module and part-of-speech tagging module, Term vector conversion module carries out term vector conversion, and part-of-speech tagging is carried out in part-of-speech tagging module.

S3. the information for passing through term vector conversion and part-of-speech tagging enters candidate word weight computation module, obtains candidate word order Row.

S4. the candidate word sequence obtained enters candidate word screening module, obtains suitable keyword

Wherein, preprocessing module carries out Chinese word segmentation, English stem extraction and removal to corpus and document to be extracted Stop words, wherein Chinese word segmentation enter term vector conversion module, and English stem enters part-of-speech tagging module.

Term vector conversion module converts the word handled by preprocessing module to term vector (word Embedding form), the technology that term vector conversion module uses and existing term vector representational framework word2vec (word To vector words to vector) it is similar, be on the basis of word2vec randomly choose document in word seek mean value, as Document vector indicates, then term vector and document vector are participated in training as a whole and predicted.It can see the vector The part that representation method not only indicates document semantic as vector, while trained ginseng is reduced by randomly selected mechanism Number, greatly reduces trained complexity.And the mechanism of random selection is also a kind of method of regularization, can be to the pre- of model It surveys effect and plays optimization function.

Term vector conversion module indicates to obtain the probability of a certain word using context and document information using formula 1：

Term vector modulus of conversion document vector x in the block_dIt is handled by formula 2, which abandons certain word with probability q The component of term vector, while in order to avoid generating deviation, the dimension of reservation is normalized：

Term vector modulus of conversion model optimization object function in the block is represented by formula 3, and first item is still for given Context and document semantic observe the likelihood function of target word, and Section 2 is the regularization of data dependence, becomes regularization,

Regular terms can be equivalent to formula 4, and wherein σ is considered as logistic regression.For the regular terms, it is high that it tends to punishment Word frequency word.Because it is sampled that probability is higher for high frequency words, the value of regular terms is also bigger.And for σ (1- σ), this is Number, when term vector is more accurate, when the probability of prediction is bigger, this also can be smaller, this also just demonstrates the canonical from mathematical angle Item can play the role of Optimized model.

Part-of-speech tagging module is by the NLTK natural language processing kits in the libraries python to after preprocessing module Word carry out part-of-speech tagging.

Candidate word weight computation module is a seq2seq model, and outputting and inputting for the model is all a sequence, and And the length for outputting and inputting sequence is variable.Seq2seq models are made of an encoder and a decoder, I Using Recognition with Recurrent Neural Network (RNN) be used as encoder and decoder.We input each of sentence into encoder Then word exports the semantic vector of entire sentence.Because the characteristics of Recognition with Recurrent Neural Network is exactly to consider each step in front Input information, so the semantic vector theoretically exported can include the information of entire sentence, we can be this semanteme Vector treats as a semantic expressiveness of this sentence, that is, a sentence vector.Then in a decoder, we are according to encoder Obtained sentence vector, gradually comes out information analysis therein is lain in.

By the way that attention mechanism and replicanism are added in the Recognition with Recurrent Neural Network in candidate word weight computation module, make The keyword other than vocabulary and source document can be predicted by the neural network by obtaining, therefore predict that the probability of word can indicate For formula 5：

p(y_t|y₁..., t-1, x)

=p_g(y_t|g₁..., t-1, x)+p_c(y_t|y₁..., t-1, x) (formula 5)

First item is the predictor formula of traditional Recognition with Recurrent Neural Network, i.e., by softmax graders according to hidden layer State and it is predicted that word export the probability of all current all words；Section 2 is that replicanism considers in document each The importance of word can be expressed as formula 6：

Therefore the decoder of candidate word weight computation module and traditional RNN structures are the difference is that with lower part： When generating word, there are both of which, and one is the patterns of generation, and one is copy mode, final mask is by a selection net The probabilistic model of network combination both of which is in then similar to the decoder of traditional RNN when generation pattern, generates a word, And in copy mode when then from position softmax obtain word input terminal location information；With the t-1 moment when state updates The word predicted update the state of t moment, and consider the hidden state of specific position in word matrix.

Candidate word screening module screens the candidate word sequence that candidate word weight computation module obtains, including filters out The keyword of suitable part of speech, while all keywords being made of number and single character are excluded, and belong to other keys Word prefix or the keyword repeated.

Embodiment 1

By in the keyword extraction system of the selected text input present invention, keyword abstraction experiment is carried out, as shown in Fig. 2, “Towards content-based relevance ranking for video search.Most existing web Video search engines index videos by file names, URLs, and surrounding texts.These type of video metadata roughly describe the whole video in an Abstract level without taking the rich content, such as semantic content descriptions and speech within the video.In this paper we propose a novel relevance ranking approach for Web-based video search using both video Metadata and rich content.To leverage real content into ranking, the videos Are segmented into shots, which are smaller and more semantic-meaningful Retrievable units.With video metadata and content information of shots, we Developed an integrated ranking approach, which achieves improved ranking Performance. " by segmenting with after part-of-speech tagging the reservation part of speech of acquiescence is arranged, by this system and tradition RNN models in Obtained keyword is compared in terms of the keyword two from positioned at source document and other than source document respectively, obtains result such as Under, the keyword of wherein benchmark result is：Video metadata, integrated ranking, relevance Ranking, content based ranking, video segmentation:

1. the keyword in source document：1.information retrieval；2.video search； 3.ranking；4.relevance ranking；5.relevance ranking；6.video metadata； 7.intergrated ranking；8.web video；9.web video search；10.rich content

2. the keyword other than source document：1.video retrieval；2.web search；3.content ranking；4.content based retrieval；5.content retrieval；6.video indexing； 7.relevance feedback；8.content based ranking；9.semantic web；10.video segmentation

Embodiment 2

Multiple existing keyword abstraction algorithms are compared, using F values as performance indicator, predict preceding 5,10 keys Word, it is as a result as follows.It can be seen that it is proposed that keyword abstraction algorithm and model (CopyRNN tape copy mechanism cycle god Through network) performance is best on each data set.

Embodiment 3

Extraction experiment is carried out for the keyword other than source document, source document is predicted since other algorithms are unable to get Keyword in addition predicts preceding 10,50 keys because being only compared with using the algorithm of traditional Recognition with Recurrent Neural Network Word, and using recall rate as evaluation metrics, it is as a result as follows.It can be seen that it is proposed that keyword abstraction algorithm and model (CopyRNN) all highers of the recall rate on each data set illustrate that the algorithm can be predicted more accurately other than source document Keyword.

It can be seen that the keyword abstraction system that the invention proposes can not only extract the keyword being present in source document, And also have preferable prediction effect for the keyword other than source document, compare existing keyword abstraction technology, this hair The result that bright system is realized is more rationally and efficient.

Claims

1. a kind of keyword abstraction method based on seq2seq deep neural network models, which is characterized in that the method includes Following steps：

S2. pass through preprocessing module and make the information extracted and respectively enter term vector conversion module and part-of-speech tagging module, word to It measures conversion module and carries out term vector conversion, part-of-speech tagging is carried out in part-of-speech tagging module；

S3. the information for passing through term vector conversion and part-of-speech tagging enters candidate word weight computation module, obtains candidate word sequence.

S4. the candidate word sequence obtained enters candidate word screening module, obtains suitable keyword.

2. the keyword abstraction method according to claim 1 based on seq2seq deep neural network models, feature exist In the preprocessing module carries out Chinese word segmentation, English stem extraction and removal stop words to corpus and document to be extracted.

3. the keyword abstraction method according to claim 1 based on seq2seq deep neural network models, feature exist In the term vector conversion module converts the word handled by preprocessing module to the form of term vector, term vector Conversion module seeks mean value using the word in random selection document on the basis of term vector representational framework word2vec, as text Shelves vector indicates, then term vector and document vector are participated in training as a whole and predicted.

4. the keyword abstraction method according to claim 3 based on seq2seq deep neural network models, feature exist In the term vector conversion module indicates to obtain the general of a certain word using context and document information using following formula Rate：

Wherein, c is context term vector, and x is document vector, and U is mapping matrix of the neural network input layer to hidden layer, and v is Neural network hidden layer is to the mapping matrix of output layer, and w is prediction target word, and T is the length of document.

5. the keyword abstraction method according to claim 4 based on seq2seq deep neural network models, feature exist In, the document vector x is handled by following formula, abandons the component of the term vector of certain word with probability q, while in order to It avoids generating deviation, the dimension of reservation is normalized：

Term vector modulus of conversion model optimization object function in the block is represented by following formula：

First item is to observe the likelihood function of target word for giving context and document semantic, and Section 2 is data dependence Regularization.

6. the keyword abstraction method according to claim 5 based on seq2seq deep neural network models, feature exist In the regularization of the data dependence is indicated using following formula, referred to as regular terms

Wherein σ is considered as logistic regression, and for regular terms, it tends to punish high word frequency word, because for its quilt of high frequency words Sampling probability is higher, and the value of regular terms is also bigger, and for σ (1- σ) this coefficient, when term vector is more accurate, prediction When probability is bigger, regular terms also can be smaller.

7. the keyword abstraction method according to claim 1 based on seq2seq deep neural network models, feature exist In the part-of-speech tagging module is by the NLTK natural language processing kits in the libraries python to after preprocessing module Word carries out part-of-speech tagging.

8. the keyword abstraction method according to claim 1 based on seq2seq deep neural network models, feature exist In the candidate word weight computation module is seq2seq models, and the seq2seq models include encoder and decoder, described Outputting and inputting for seq2seq models is all a sequence, and the length for outputting and inputting sequence is variable, the coding Device and decoder are Recognition with Recurrent Neural Network.

9. the keyword abstraction method according to claim 1 based on seq2seq deep neural network models, feature exist In addition attention mechanism and replicanism in the Recognition with Recurrent Neural Network so that can predict word by the neural network Keyword other than remittance table and source document, predicts that the probability of word is represented by following formula：

p(y_t|y_{1 ..., t-1}, x)

=p_g(y_t|y_{1 ..., t-1}, x) and+p_c(y_t|y_{1 ..., t-1}, x)

First item is the predictor formula of traditional Recognition with Recurrent Neural Network, i.e., by softmax graders according to the state of hidden layer With it is predicted that word export the probability of all current all words；Section 2 is that replicanism considers each word in document Importance, can be expressed as following formula：

ψ is the set of all words in source document, and σ is a nonlinear function, and Wc is a trained parameter matrix, and Z is The weighted sum of all scores is used for regularization.