CN108804495B

CN108804495B - Automatic text summarization method based on enhanced semantics

Info

Publication number: CN108804495B
Application number: CN201810281684.5A
Authority: CN
Inventors: 史景伦; 洪冬梅; 宁培阳; 王桂鸿
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-04-02
Filing date: 2018-04-02
Publication date: 2021-10-22
Anticipated expiration: 2038-04-02
Also published as: CN108804495A

Abstract

The invention discloses an automatic text summarization method based on enhanced semantics, which comprises the following steps: preprocessing a text, arranging the text from high to low according to word frequency information, and converting words into id; encoding the input sequence by using a single-layer bidirectional LSTM, and extracting text information characteristics; decoding the text semantic vector obtained by encoding by using a single-layer unidirectional LSTM to obtain a hidden layer state; calculating a context vector, and extracting the most useful information from the input sequence and the current output; and in the training stage, loss calculation is carried out on the semantic similarity of the fusion generated abstract and the source text, so that the semantic similarity of the abstract and the source text is improved. The invention uses the LSTM deep learning model to represent the text, integrates the semantic relation of the context, enhances the semantic relation between the abstract and the source text, generates the abstract which can be more suitable for the theme of the text, and has wide application prospect.

Description

Automatic text summarization method based on enhanced semantics

Technical Field

The invention relates to the technical field of natural language processing, in particular to an automatic text summarization method based on enhanced semantics.

Background

With the rapid development of science and technology and the internet, the big data era comes, and the network information of the covered area is increasing day by day. In which, the explosive increase of representative text information amount, such as news, blog, chat, report, microblog, etc., makes the information burden heavy, and the huge information makes people spend a lot of time when browsing and reading. Therefore, how to quickly extract key contents from a large amount of text information and solve the problem of information overload becomes an urgent need, and an automatic text summarization technology comes along.

The automatic text summarization technique is classified into an abstract summary and a generative summary according to the type of the generative summary. The former is to sort the sentences in the original text according to a certain method, and take the first n sentences with the highest importance as the abstract; the latter is to describe and summarize the original text center thought by mining deeper semantic information. There has been a lot of research on abstract, but this method is only the vocabulary information staying on the surface, and the generated abstract is more suitable for the process of human generating abstract.

In recent years, due to the rise of deep learning, a lot of achievements have been achieved in many fields, and the field of automatic summarization has also been introduced. Based on the sequence-to-sequence seq2seq model, a generative abstract can be realized, and by using the successful application of machine translation as a reference, the automatic abstract based on the seq2seq model has become a research hotspot of natural language processing, but has some problems of continuity and readability. The traditional abstraction generally causes great information loss, which is particularly reflected in long texts, so that the deep research of the generated automatic abstraction has important significance for really solving the problem of information overload.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides an automatic text summarization method based on enhanced semantics, which is based on a seq2seq model, introduces an attention mechanism and trains by using semantic similarity between a generated summary and a source text, thereby improving semantic relevancy between the generated summary and the source text and improving summary quality.

The purpose of the invention can be achieved by adopting the following technical scheme:

an automatic text summarization method based on enhanced semantics, the automatic text summarization method comprising:

a text preprocessing step, namely performing word segmentation, form reduction and reference resolution on the text, arranging the words from high to low according to word frequency information, and converting the words into id;

coding, namely coding an input sequence, and obtaining a hidden layer state vector carrying text sequence information through a neural network;

decoding step, initializing the last hidden layer state obtained by the encoder, and starting decoding to obtain the hidden layer state s of each step_t；

An attention distribution calculation step for combining the hidden layer state of the input sequence with the hidden layer state s obtained by decoding at the current moment_tCalculating the context vector to obtain the context vector u at the current time t_t；

And an abstract generation step, namely mapping the output obtained in the decoding step into vectors of the dimension of the size of the word list through two linear layers, wherein each dimension represents the probability of the word in the word list, and selecting a candidate word by using a certain selection strategy to generate an abstract.

Further, the data of the text in the text preprocessing step is a corpus crawled by a crawler or an open-source corpus, and consists of article-abstract pairs.

Further, in the step of preprocessing the text, the first 200k words are obtained as a basic word list, and meanwhile, special marks [ PAD ], [ UNK ], [ START ] and [ STOP ] are added into the word list, and the words of the text are converted into id, and each piece corresponds to a sequence.

Further, the input sequence is a word vector corresponding to an id sequence obtained by converting the text, the dimension of the word vector is 128, and the maximum length of the sequence is 700.

Further, the neural network is a single-layer bidirectional LSTM, the number of hidden layer units is 256, and the forward and reverse hidden layer states h are connected to obtain a final hidden layer state.

Further, the decoding step process is as follows:

receiving an input word vector and a previous-time hidden layer state, and obtaining a current-time hidden layer state s through a single-layer unidirectional LSTM neural network_tThe number of hidden units is 256.

Further, the context vector u_tThe calculation method of (c) is as follows:

wherein, v, W_h，W_sAnd b_attIs a parameter to be learned, h_iIs the hidden layer state value of encoder, N is the outputThe length of the incoming sequence.

Furthermore, the selection strategy refers to that 4 results with the maximum probability are selected in each step by using a beam search algorithm in the testing stage until a summary sequence with the maximum probability is obtained finally, while only the words with the maximum probability are selected in the training stage, and the summary is compared and evaluated with the reference summary after being completely generated.

Further, in the digest generation step, only one word is generated in each step, and the maximum length of the generated digest is 100, that is, the maximum number of cycles from the encoding step to the digest generation step is 100, and when the end flag is output or the maximum length is reached, the probability calculation formula is as follows:

p_v＝softmax(V₁(V₂[s_t,u_t]+b₂)+b₁)

wherein, V₁，V₂，b₁，b₂Are all parameters that need to be learned, p_vProviding basis for predicting the next word.

Further, the digest generation step further includes: and performing semantic similarity Rel calculation on the finally obtained prediction abstract and the source text sequence, and punishing the abstract with low semantic relevance in the training process, wherein the calculation is as follows:

wherein,

and

hidden layer states, G, in forward and backward directions, respectively_tIs the encoder hidden layer state, λ is an adjustable factor, M is the length of the generated digest sequence, loss_tIs the loss of each step, combined with the semantic similarity Rel to form the total loss.

Compared with the prior art, the invention has the following advantages and effects:

the invention constructs an automatic text abstract model based on an LSTM based on a seq2seq model, introduces an attention mechanism to obtain a context vector at each moment when a decoder is used, introduces semantic similarity to enhance the semantic relevance between a generated abstract and a source text, fuses the similarity into a loss function during training, avoids model bias and improves the quality of the abstract.

Drawings

FIG. 1 is a flow chart of the steps of the enhanced semantic based automatic text summarization method of the present invention;

FIG. 2 is a diagram of a semantic similarity calculation structure in the present invention;

fig. 3 is a flowchart of an algorithm of each step when generating the abstract word in decoding according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

As shown in fig. 1, the automatic text summarization method based on enhanced semantics includes: the method comprises the steps of text preprocessing, encoding, decoding, attention and abstract generation. Wherein:

text preprocessing step, where the text data may be a corpus crawled by a crawler, or an open-source corpus, for example, CNN/daisy Mail, composed of article-abstract pairs, where each article has 780 words on average and the abstract has 56 words on average. The method comprises the steps of performing word segmentation and morphological restoration on a source text, obtaining the first 200k words as a basic word list according to the word frequency and the word frequency after resolution, adding special marks [ PAD ], [ UNK ], [ START ], [ STOP ] into the word list at the same time, converting the words of the text into id, wherein each word corresponds to a sequence, the abstract has the same principle, a training set comprises 287226 samples, a verification set comprises 13368 samples, and a test set comprises 11490 samples.

And a coding step, namely performing word embedding on the input sequence to obtain a 128-dimensional vector, and obtaining a text expression vector carrying text sequence information through a neural network.

The input sequence is an id sequence obtained by converting an article, the maximum length is 700, and the minimum length is 30.

The neural network in the encoding step is composed of a single-layer bidirectional LSTM, the number of hidden layer units is 256, and the forward and reverse hidden layer states h are connected to obtain the final hidden layer state.

Decoding step, receiving word vector of input sequence, passing through single-layer unidirectional LSTM neural network to obtain final hidden layer state s_tThe number of hidden units is 256.

An attention calculating step, which combines the decoding step at the current moment to obtain a decoding state s_tAnd the hidden layer state of the input sequence of the encoding step, obtaining the context vector u at the current moment_t。

The context vector at time t is calculated as follows:

wherein, v, W_h，W_sAnd b_attIs a parameter to be learned, h_iIs the hidden layer state value of the encoder, and N is the length of the input sequence.

And an abstract generating step, namely mapping the output obtained in the decoding step into vectors of the dimension of the size of the word list through two linear layers, wherein each dimension represents the probability of the word in the word list, and selecting candidate words by using a certain selection strategy.

The selection strategy refers to that 4 results with the maximum probability are selected by the beam search algorithm in each step in the testing stage until a summary sequence with the maximum probability is obtained finally, the training stage only takes the words with the maximum probability, and the summary is compared and evaluated with the reference summary after being completely generated.

The maximum length of the generated abstract is 100, and the probability calculation formula is as follows:

p_v＝softmax(V₁(V₂[s_t,u_t]+b₂)+b₁)

wherein, V₁，V₂，b₁，b₂Are all parameters that need to be learned, p_vProviding a basis for predicting the next word.

The abstract generating step also comprises the following steps of carrying out semantic similarity Rel calculation on the finally obtained prediction abstract and the source text sequence, and punishing the abstract with low semantic relevance in the training process, wherein the calculation is as follows:

wherein,

and

hidden layer states, G, in forward and backward directions, respectively_tIs a plaitThe state of a hidden layer of a coder, lambda is an adjustable factor and defaults to 1, M is the length of a generated summary sequence, loss_tIs the loss of each step, combined with the similarity to make up the total loss.

In the training process, a back propagation algorithm is adopted, an Adagarad optimizer is used, the learning rate is 0.15, and the initial accelerator value is 0.1.

The decoding step is divided into a training stage and a testing stage, wherein the training stage takes the reference abstract as input, and the testing stage takes the last moment output as the moment input.

The indicators evaluating the reference summary and the prediction summary are the ROUGE indicators. A linux operating system is adopted, a program is run on a GPU, the used programming language is python, and the platform is tensorflow. The model with semantic similarity introduced runs for about 4 days, with about 380000 iterations, and the experimental results are shown in the following table.

TABLE 1 comparison of the results of the three models

Experimental model	ROUGE-1	ROUGE-2	ROUGE-L
				Basic LSTM model	0.2896	0.1028	0.2613
LSTM+Attention	0.3116	0.1127	0.2920
				LSTM+Attention+Rel	0.3493	0.1390	0.3342

The method fully exerts the capability of the seq2seq model for deeply excavating text semantic information by fusing an attention mechanism, so that information which is useful for current output in an input sequence can be focused when the abstract is generated by decoding, and loss calculation is carried out by fusing semantic similarity, so that the semantic similarity with a source text can be focused when the abstract is generated by the model, and sentences which are more in line with the original text semantics can be obtained. Compared with the traditional automatic summarization method based on statistics, the model based on deep learning has more representation capability and has great advantages on the task of automatic text summarization.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An automatic text summarization method based on enhanced semantics, characterized in that the automatic text summarization method comprises:

a text preprocessing step, namely performing word segmentation, form reduction and reference resolution on the text, arranging words from high to low according to word frequency information, and converting the words into an id sequence;

A summary generation step, namely mapping the output obtained in the decoding step into vectors of the dimension of the size of the word list through two linear layers, wherein each dimension represents the probability of a word in the word list, selecting a candidate word by using a selection strategy, and generating a summary; the selection strategy refers to that 4 results with the maximum probability are selected in each step by using a beam search algorithm in the testing stage until a summary sequence with the maximum probability is obtained finally, only the words with the maximum probability are selected in the training stage, and the summary is compared and evaluated with a reference summary after being completely generated;

the abstract generating step further comprises: and performing semantic similarity Rel calculation on the finally obtained prediction abstract and the source text sequence, and punishing the abstract with low semantic relevance in the training process, wherein the calculation is as follows:

wherein,

and

hidden layer states, G, in forward and backward directions, respectively_tIs the encoder hidden layer state, λ is an adjustable factor, M is the length of the generated digest sequence, loss_tLoss of each step is combined with semantic similarity Rel to form total loss;

in the abstract generating step, only one word is generated in each step, the maximum length of the generated abstract is 100, that is, the maximum cycle number from the encoding step to the abstract generating step is 100, and when the output end mark or the maximum length is reached, the probability calculation formula is as follows:

p_v＝softmax(V₁(V₂[s_t,u_t]+b₂)+b₁)

2. The method for automatically abstracting text based on enhanced semantics as claimed in claim 1, wherein the data of the text in the text preprocessing step is a corpus crawled by a crawler or an open-source corpus, and is composed of article-abstract pairs.

3. The method for automatically summarizing text based on enhanced semantics of claim 1, wherein in the text preprocessing step, the top 200k words are obtained as basic vocabulary, and the special labels [ PAD ], [ UNK ], [ START ] and [ STOP ] are added into the vocabulary, and the words of the text are converted into id sequences, each corresponding to a sequence.

4. The method for automatically abstracting text based on enhanced semantics of claim 1, wherein the input sequence is a word vector corresponding to an id sequence obtained by converting a text, the dimension of the word vector is 128, and the maximum length of the sequence is 700.

5. The method according to claim 1, wherein the neural network is a single-layer bi-directional LSTM, the number of hidden layer units is 256, and forward and reverse hidden layer states h are connected to obtain a final hidden layer state.

6. The method for automatic text summarization based on enhanced semantics of claim 1 wherein the decoding step is performed as follows:

7. The method according to claim 1, wherein the context vector u is a semantic vector with a semantic meaning that is different from the semantic meaning of the text to be extracted_tThe calculation method of (c) is as follows: