CN108804495A

CN108804495A - A kind of Method for Automatic Text Summarization semantic based on enhancing

Info

Publication number: CN108804495A
Application number: CN201810281684.5A
Authority: CN
Inventors: 史景伦; 洪冬梅; 宁培阳; 王桂鸿
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-04-02
Filing date: 2018-04-02
Publication date: 2018-11-13
Anticipated expiration: 2038-04-02
Also published as: CN108804495B

Abstract

The invention discloses a kind of Method for Automatic Text Summarization semantic based on enhancing, and steps are as follows：To Text Pretreatment, is arranged from high to low according to word frequency information, word is switched into id；List entries is encoded using a single-layer bidirectional LSTM, extracts text message feature；The text semantic vector that coding obtains is decoded using single layer unidirectional LSTM and obtains hidden layer state；The calculating of context vector is carried out, extract in list entries and currently exports the most useful information；The probability distribution for obtaining a vocabulary size after the decoding is taken certain strategy to carry out abstract selected ci poem and is selected, and the semantic similarity that fusion is generated abstract and source text by the training stage carries out costing bio disturbance, improves the semantic similarity of abstract and source text.The present invention characterizes text using LSTM deep learning models, incorporates the semantic relation of context, and enhances the semantic relation of abstract and source text, and the abstract of generation can more agree with the theme of text, and application prospect is extensive.

Description

A kind of Method for Automatic Text Summarization semantic based on enhancing

Technical field

The present invention relates to natural language processing technique fields, and in particular to a kind of automatic text summarization semantic based on enhancing Method.

Background technology

With the fast development of science and technology and internet, the arriving in big data epoch, the network information covered the sky and the earth and day are all Increase.Wherein, the explosive increase of representative text message amount, such as news, blog, chat, report, microblogging so that Information overload, huge information make people be taken a significant amount of time in brose and reading.Therefore, how quickly from a large amount of texts Key content is extracted in this information, solves the problems, such as information overload, it has also become a urgent demand, automatic text summarization technology It comes into being.

According to generating, writing abstract can be divided into extraction-type abstract to automatic text summarization technology and production is made a summary.The former be by Sentence in original text carries out importance ranking according to certain method, using the highest preceding n sentence of importance as abstract；Afterwards Person is by excavating deeper semantic information, reporting original text central idea, summarize.For extraction-type abstract By largely studying, but this method is merely resting on the lexical information on surface, and production abstract more meets people's generation and plucks The process wanted.

In recent years, due to the rise of deep learning, few achievement is achieved in many fields, has been also introduced into automatic Digest field.Based on sequence to sequence seq2seq models, production abstract may be implemented, use for reference the successful application of machine translation, Automatic abstract based on seq2seq models has become the research hotspot of natural language processing, but there is also some continuities, readable The problem of property.Traditional extraction-type abstract would generally cause prodigious information loss, especially be embodied in long text, therefore deeply Production automatic abstract is studied, is of great significance for really solving information overload.

Invention content

The purpose of the present invention is to solve drawbacks described above in the prior art, provide a kind of based on the automatic of enhancing semanteme Text snippet method, this method is based on seq2seq models, while introducing attention mechanism, utilizes generation abstract and source document This Semantic Similarity is trained, and is improved the semantic relevancy for generating abstract and source text, is improved abstract quality.

The purpose of the present invention can be reached by adopting the following technical scheme that：

A kind of Method for Automatic Text Summarization semantic based on enhancing, the Method for Automatic Text Summarization include：

Text Pretreatment step, text is segmented, form reduction and reference resolution, according to word frequency information from height to Word is switched to id by low arrangement；

Coding step encodes list entries, obtains carrying the hidden layer of text sequence information by neural network State vector；

Decoding step initializes the last hidden layer state obtained by encoder, and it is each to proceed by decoding acquisition Step hides layer state s_t；

Attention distribution calculates step, in conjunction with the hidden layer of the hiding layer state and current time decoding acquisition of list entries State s_tThe calculating for carrying out context vector, obtains the context vector u of current t moment_t；

Summarization generation step, by decoding step obtain output by two linear layers be mapped as vocabulary size dimension to Amount, each ties up the probability for representing word in vocabulary, selects candidate word with certain selection strategy, generates abstract.

Further, the data of text are the corpus crawled by reptile or increase income in the Text Pretreatment step Corpus, and by article-abstract to forming.

Further, in the Text Pretreatment step, the word of preceding 200k is obtained as basic vocabulary, while will be special It marks [PAD], [UNK], [START] and [STOP] that vocabulary is added, and the word of text is switched to id, one sequence of each correspondence Row.

Further, the list entries is the corresponding term vector of id sequences for obtaining text after conversion, word Vector dimension 128, sequence maximum length are taken as 700.

Further, the neural network is the LSTM of a single-layer bidirectional, and hidden layer unit number is 256, will be positive and negative To hidden layer state h connect to obtain final hidden state.

Further, the decoding step process is as follows：

The term vector and last moment hidden layer state for receiving input are obtained by the unidirectional LSTM neural networks of single layer Current time hidden layer state s_t, hidden unit number is 256.

Further, the context vector u_tCalculation it is as follows：

Wherein, v, W_h, W_sAnd b_attIt is the parameter for needing to learn, h_iFor the hidden layer state value of encoder, N is input sequence The length of row.

Further, the selection strategy refers to that test phase is selected generally with beam search algorithms in each step Maximum 4 of rate is as a result, to the last obtain the abstract sequence of maximum probability, and the training stage only selects the word of maximum probability, plucks After generating completely comparative evaluation is carried out with reference to making a summary.

Further, in the summarization generation step, each step only generates a word, ultimately generates abstract maximum length It is 100, that is, from coding step to summarization generation step maximum cycle is 100, when end of output mark or reaches Stop when maximum length, probability calculation formula is as follows：

p_v=soft max (V₁(V₂[s_t,u_t]+b₂)+b₁)

Wherein, V₁, V₂, b₁, b₂All it is the parameter for needing to learn, p_vTo predict that next word provides foundation.

Further, the summarization generation step further includes：By finally obtained prediction abstract and source text sequence into Row semantic similarity Re l are calculated, and training process punishes the abstract of low semantic relevancy, are calculated as follows：

Wherein,WithIt is the hiding layer state of forward and backward, G respectively_tIt is that encoder hides layer state, λ is one Adjustable factors, M are the abstract sequence length generated, loss_tIt is the loss of each step, is bonded with semantic similarity Re l Total loss loss.

The present invention has the following advantages and effects with respect to the prior art：

The present invention is based on seq2seq models, the automatic text summarization model based on LSTM is constructed, is introduced in decoder Attention mechanism obtains the context vector at each moment, and introduces semantic similarity to enhance the semanteme for generating abstract and source text Similarity is fused in loss function in training, avoids model from wandering off, improve the quality of abstract by the degree of correlation.

Description of the drawings

Fig. 1 is the step flow chart of the Method for Automatic Text Summarization semantic based on enhancing of the present invention；

Fig. 2 is the Semantic Similarity Measurement structure chart in the present invention；

The algorithm flow chart of Fig. 3 each steps when being the decoding generation abstract word in the present invention.

Specific implementation mode

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art The every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Embodiment

As shown in Figure 1, including based on the semantic Method for Automatic Text Summarization of enhancing：Text Pretreatment step, coding step, Decoding step, attention step, summarization generation step.Wherein：

Text Pretreatment step, text data here can be the corpus crawled by reptile, can also be to increase income Corpus, by taking CNN/Daily Mail as an example, by article-abstract to forming, every article is averaged 780 words, abstract Average 56 words.Source text is segmented, form reduction, after reference resolution, according to word frequency height, the word of 200k is made before obtaining For basic vocabulary, and extension vocabulary corresponding with the word of each text composition, while by special marking [PAD], [UNK], [START], vocabulary is added in [STOP], and the word of text is switched to id, and one sequence of each correspondence is made a summary similarly, training set Including 287226 samples, verification collection includes 13368 samples, and test set includes 11490 samples.

Coding step obtains the vector of 128 dimensions, by neural network after carrying out word embedding to list entries Obtain the text representation vector of a carrying text sequence information.

Wherein, list entries is the id sequences for obtaining article after conversion, and maximum length is taken as 700, shortest length It is 30.

Wherein, the neural network in coding step is the LSTM compositions of a single-layer bidirectional, and hidden layer unit number is 256, It connects forward and reverse hidden layer state h to obtain final hidden state.

Decoding step receives the term vector of list entries, by the unidirectional LSTM neural networks of single layer, obtains final hidden layer State s_t, hidden unit number is 256.

Attention calculates step, and decoded state s is obtained in conjunction with current time decoding step_tWith the list entries of coding step Hiding layer state, obtain the context vector u at current time_t。

Wherein, t moment context vector calculation is as follows：

Summarization generation step, by decoding step obtain output by two linear layers be mapped as vocabulary size dimension to Amount, each is tieed up the probability for representing word in vocabulary, candidate word is selected with certain selection strategy.

Wherein, selection strategy refers to that test phase selects 4 knots of maximum probability with each step of beam search algorithms Fruit, to the last obtains the abstract sequence of maximum probability, and the training stage only takes the word of maximum probability, make a summary after generating completely with Comparative evaluation is carried out with reference to abstract.

Wherein, it is 100 to generate abstract maximum length, and probability calculation formula is as follows：

p_v=soft max (V₁(V₂[s_t,u_t]+b₂)+b₁)

Wherein, summarization generation step further includes that finally obtained prediction abstract and source text sequence are carried out semantic similarity Re l are calculated, and training process punishes the abstract of low semantic relevancy, are calculated as follows：

Wherein,WithIt is the hiding layer state of forward and backward, G respectively_tIt is that encoder hides layer state, λ is one Adjustable factors, it is the abstract sequence length generated, loss to be defaulted as 1, M_tIt is the loss of each step, is bonded with similarity Total loss.

In the training process, using back-propagation algorithm, using Adagrad optimizers, learning rate 0.15, initially Accelerator value is 0.1.

Decoding step is divided into training stage and test phase, wherein the training stage will refer to abstract as input, test rank Section will export last moment to be inputted as this moment.

Assessment reference is made a summary and the index of prediction abstract is ROUGE indexs.Linux operating systems are used, and on GPU Program is run, the programming language used is python, platform tensorflow.Introduce the model running time of semantic similarity About 4 days, about 380000 iteration are carried out, the experimental results are shown inthe following table.

1. 3 kinds of model result comparisons of table

Experimental model	ROUGE-1	ROUGE-2	ROUGE-L
				Basic LSTM models	0.2896	0.1028	0.2613
LSTM+Attention	0.3116	0.1127	0.2920
				LSTM+Attention+Rel	0.3493	0.1390	0.3342

The present invention gives full play to seq2seq models and carries out profound excavation text semantic letter by merging attention mechanism The ability of breath focuses on to currently exporting useful information in list entries when decoding being allow to generate abstract, and incorporates semanteme Similarity carries out costing bio disturbance, so that model is paid close attention to the semantic similarity with source text when generating abstract, is more met The sentence of original text semanteme.Compared with traditional automaticabstracting based on statistics, the model based on deep learning more has characterization Ability has great advantage in automatic text summarization task.

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, it is other it is any without departing from the spirit and principles of the present invention made by changes, modifications, substitutions, combinations, simplifications, Equivalent substitute mode is should be, is included within the scope of the present invention.

Claims

1. a kind of Method for Automatic Text Summarization semantic based on enhancing, which is characterized in that the Method for Automatic Text Summarization packet It includes：

Text Pretreatment step, text is segmented, form reduction and reference resolution, arranged from high to low according to word frequency information Row, switch to id by word；

Coding step encodes list entries, obtains carrying the hiding layer state of text sequence information by neural network Vector；

Decoding step initializes the last hidden layer state obtained by encoder, and it is hidden to proceed by each step of decoding acquisition Hide layer state s_t；

Attention distribution calculates step, in conjunction with the hiding layer state of the hiding layer state and current time decoding acquisition of list entries s_tThe calculating for carrying out context vector, obtains the context vector u of current t moment_t；

The output that decoding step obtains is passed through the vector that two linear layers are mapped as vocabulary size dimension by summarization generation step, Each ties up the probability for representing word in vocabulary, selects candidate word with certain selection strategy, generates abstract.

2. a kind of Method for Automatic Text Summarization semantic based on enhancing according to claim 1, which is characterized in that described The data of text are the corpus crawled by reptile or the corpus increased income in Text Pretreatment step, and by article-abstract To composition.

3. a kind of Method for Automatic Text Summarization semantic based on enhancing according to claim 1, which is characterized in that described In Text Pretreatment step, the word of 200k is as basic vocabulary before obtaining, while by special marking [PAD], [UNK], [START] Vocabulary is added in [STOP], and the word of text is switched to id, one sequence of each correspondence.

4. a kind of Method for Automatic Text Summarization semantic based on enhancing according to claim 1, which is characterized in that described List entries is the corresponding term vector of id sequences for obtaining text after conversion, term vector dimension 128, sequence maximum length It is taken as 700.

5. a kind of Method for Automatic Text Summarization semantic based on enhancing according to claim 1, which is characterized in that described Neural network is the LSTM of a single-layer bidirectional, and hidden layer unit number is 256, and forward and reverse hidden layer state h is connected To final hidden state.

6. a kind of Method for Automatic Text Summarization semantic based on enhancing according to claim 1, which is characterized in that described Decoding step process is as follows：

The term vector and last moment hidden layer state for receiving input obtain current by the unidirectional LSTM neural networks of single layer Moment hidden layer state s_t, hidden unit number is 256.

7. a kind of Method for Automatic Text Summarization semantic based on enhancing according to claim 1, which is characterized in that described Context vector u_tCalculation it is as follows：

Wherein, v, W_h, W_sAnd b_attIt is the parameter for needing to learn, h_iFor the hidden layer state value of encoder, N is list entries Length.

8. a kind of Method for Automatic Text Summarization semantic based on enhancing according to claim 1, which is characterized in that described Selection strategy refers to that test phase selects 4 of maximum probability as a result, to the last with beam search algorithms in each step Obtain the abstract sequence of maximum probability, and the training stage only selects the word of maximum probability, make a summary after generating completely with reference to make a summary into Row comparative evaluation.

9. a kind of Method for Automatic Text Summarization semantic based on enhancing according to claim 1, which is characterized in that described In summarization generation step, each step only generates a word, and it is 100 to ultimately generate abstract maximum length, that is, from coding step It is 100 to summarization generation step maximum cycle, when end of output mark or while reaching maximum length stop, probability calculation Formula is as follows：

p_v=soft max (V₁(V₂[s_t,u_t]+b₂)+b₁)

10. a kind of Method for Automatic Text Summarization semantic based on enhancing according to claim 1, which is characterized in that described Summarization generation step further include：Finally obtained prediction abstract and source text sequence are carried out semantic similarity Re l to calculate, Training process punishes the abstract of low semantic relevancy, calculates as follows：

Wherein,WithIt is the hiding layer state of forward and backward, G respectively_tIt is that encoder hides layer state, λ is one adjustable The factor, M are the abstract sequence length generated, loss_tIt is the loss of each step, total damage is bonded with semantic similarity Re l Lose loss.