CN111061862A

CN111061862A - Method for generating abstract based on attention mechanism

Info

Publication number: CN111061862A
Application number: CN201911293797.8A
Authority: CN
Inventors: 唐卓; 方小泉; 李肯立; 周文; 阳王东; 周旭; 刘楚波; 曹嵘晖
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2020-04-24
Anticipated expiration: 2039-12-16
Also published as: CN111061862B

Abstract

The invention discloses a method for generating an abstract based on an attention mechanism to generate a text abstract, which comprises two stages in total: the first stage is a sentence ordering process and the second stage is a summary generation process, and the input of the summary generation process is the N sentences which are most relevant to the article topics and obtained by the first stage. In the first stage, a supervised sorting method is provided for articles with titles, similarity between each sentence and the title is calculated, the articles are sorted according to the similarity, and finally N sentences with the highest similarity are selected. For the second stage, the present invention proposes a new way of calculating the attention distribution between the encoder and the decoder, i.e. at different times the decoder should focus on different parts of the encoder. The invention solves the problem of article information attenuation caused by directly truncating a part of text as the input of the abstract generation model when the article is too long through the sorting method and the abstract generation method.

Description

Method for generating abstract based on attention mechanism

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a method for generating an abstract based on an attention mechanism.

Background

With the development of the internet, the network information is explosively increased, and how to quickly and effectively acquire the network information has become an important research problem. The text abstract is developed under such a background, and with the development of information retrieval and natural language processing technologies, the text abstract has become a research hotspot in recent years.

Text summarization is the purpose of converting a text or a collection of texts into a short text containing key information. Text digests can be divided into single document digests, which generate digests from a given one of the documents, and multiple document digests, which generate digests from a given set of topic-related documents, according to the type of input. There are two methods for dividing the sentence in the abstract mode, one is the extraction type, which is to find some key sentences from the article directly and combine them into an abstract according to the appearance sequence of the article; one is generative, requiring that the computer can read the content of the article and express it in more refined form. Compared with the abstraction type abstract technology, the advanced natural language processing algorithm used in the generation type method generates a more concise and concise abstract through the technologies of rephrasing, synonymous substitution, sentence abbreviation and the like.

At present, most of the methods for generating the abstract are implemented by using an encoder and a decoder using a Recurrent Neural Network (RNN) and its variants as basic units. However, this digest generation method has some non-negligible drawbacks: firstly, the RNN network training process is complex and the training is very slow; secondly, because the input length of the encoder is limited, when the length of the article is greater than the input length, the article content greater than the input length is automatically deleted, so that some important information in the article is lost.

Disclosure of Invention

In view of the above defects or improvement requirements of the prior art, the present invention provides a method for generating an abstract based on attention mechanism, which aims to solve the technical problems of slow training speed and high training difficulty in the existing method for acquiring a generated abstract based on RNN, and the technical problems of some important information loss in an article.

To achieve the above object, according to one aspect of the present invention, there is provided a method for generating a summary based on an attention mechanism, comprising the steps of:

s1, acquiring an article from the Internet, and inputting the article into a trained sentence sequencing model to acquire a simplified article;

and S2, inputting the simplified article obtained in the step S1 into a trained abstract generation model to obtain an abstract of the article.

Preferably, the training process in the sentence ordering model is as follows:

(1) acquiring a title of an article, and inputting each word in the title into a title-level encoder of a sentence sequencing model to obtain a semantic vector of the title;

(2) obtaining sentences in the article, inputting each word in the sentences into a sentence-level encoder of a sentence sequencing model to obtain semantic vectors which correspond to the sentences and contain title information;

(3) and (3) calculating the similarity between the title and the sentence according to the semantic vector which is obtained in the step (2), corresponds to the sentence and contains the title information.

The specific step is that firstly, the maximum pooling operation is carried out on the semantic vector which is obtained in the step (2), corresponds to the sentence and contains the title information so as to obtain the final representation of the sentence, and the final representation is processed by using linear mapping and sigmoid activation function so as to obtain the similarity between the title and the sentence.

The calculation formula of the similarity between the title and the sentence is as follows:

where s represents the similarity between the title and the sentence, n represents the total number of words in the sentence, w1 represents the weight of the linear mapping, axpooling represents the maximum pooling operation, which means that the maximum value is selected as the result, the sigmoid activation function is to transform the continuous value of the input into the output between 0 and 1,

and (4) representing the semantic vector which corresponds to the sentence and contains the title information and is obtained in the step (2-4).

(4) Repeating the steps (2) and (3) for m times to obtain the similarity between the title and each sentence in the article, sequencing all m sentences in the article according to the obtained similarity from big to small, selecting N sentences corresponding to N similarity degrees with the top rank from the m sentences, and forming a new article according to the sequence of the N sentences in the article, wherein m represents the total number of sentences in the article, and N is an integer between 10 and 20;

(5) and (3) acquiring a text abstract data set, and executing the steps (1) to (4) on each article in the text abstract data set to obtain new texts, wherein all the new texts form the new data set.

Preferably, step (1) comprises the sub-steps of:

(1-1) inputting each word in the title into a word embedding layer of a title-level encoder, inputting an output result of the word embedding layer into a position coding layer of the title-level encoder as a first word vector to obtain a position coding vector of each word, and adding the position coding vector of each word and the first word vector to obtain a second word vector which corresponds to each word in the title and contains position information;

wherein, the position coding vector of the word is calculated by sine and cosine coding:

where pos denotes the position of the word, d_modelI represents a dimension number in the position-coding vector of the word, and the value of i is from 0 to (d)_model-1)。

The second word vector is calculated by the following formula:

wherein E_wordRepresenting a first word vector, x_jRepresenting the jth word in title x,

a second word vector representing the jth word in title x, j having a value ranging from 0 to the length of the title.

(1-2) inputting the second word vector obtained in the step (1-1) into a multi-head self-attention layer of a header-level encoder to obtain a self-attention layer output result;

(1-3) embedding the position of the output result of the attention layer obtained in the step (1-2) which is input into a header-level encoder into a network layer to obtain a semantic vector of the header;

preferably, the step (1-2) is embodied in such a way that firstly, the second word vector obtained in the step (1-1) is taken as the question Q, the key K and the value V, then Q, K and V are linearly mapped and the dimension d is the dimension_modelCut them into n_headPortions of each portion being divided into knotsAll fruits include problem Q_aKey K_aAnd the value V_aDimension of each segmentation result is d_kAnd has d_model＝n_head×d_kWherein n is_headRepresenting the number of heads of a multi-head self-attention layer;

then, taking each segmentation result as the input of a corresponding head in the multi-head self-attention layer, and calculating the self-attention output result of each head:

wherein the value range of a is 1 to the number of the head of the multi-head self-attention layer, softmax is an activation function, and the calculation formula is as follows:

Soft_i′is the i' th output value of the softmax activation function,

is the element of the i 'th dimension of the input, and the value range of j' is 0 to (d)_model-1)。

Finally, all n are put together_headAnd splicing the self-attention output results of the individual heads to obtain the self-attention layer output result.

Preferably, the location embedding network layer comprises a first convolutional layer, a second convolutional layer and a Relu activation function which are connected in sequence;

wherein the input matrix of the first convolutional layer has a size of d_model*len_q，len_qIndicates the length of the header and the convolution kernel size is d_model2048 × 1, step size 1, and output matrix size 2048 × len_q。

The input matrix size of the second convolutional layer is 2048 × len_qConvolution kernel size 2048 × d _model1, step size 1, output matrix size d_model*len_q。

Calculation formula of Relu activation function:

Relu(x″)＝max(0，x″)

the final output result of the position embedding network layer is:

FFN(x′)＝conv₂(Relu(conv₁(x′))

where x' represents the output result from the attention layer, conv1 represents the first convolutional layer, conv2 represents the first convolutional layer. FFN (x') is a semantic vector of a title.

Preferably, step (2) comprises the sub-steps of:

(2-1) inputting each word of a sentence in the article into a word embedding layer of a sentence-level encoder, inputting an output result of the word embedding layer into a position encoding layer as a first word vector to obtain a position encoding vector of each word, and adding the position encoding vector of each word and the first word vector to obtain a second word vector which corresponds to each word in the sentence and contains position information;

(2-2) inputting the second word vector obtained in the step (2-1) into a multi-head self-attention layer of a sentence-level encoder to obtain a self-attention layer output result;

(2-3) inputting the semantic vector of the title obtained in the step (1) and the self-attention layer output result obtained in the step (2-2) into another multi-head self-attention layer of the sentence-level encoder together to obtain a semantic vector corresponding to the sentence and the title;

and (2-4) inputting the semantic vector corresponding to the sentence and the title obtained in the step (2-3) into a position of a sentence-level encoder to be embedded into a network layer so as to obtain a semantic vector corresponding to the sentence and containing title information.

Preferably, the training process of the abstract generation model is as follows:

(6) acquiring a sample from the new data set generated in the step (5), wherein the sample comprises an article X and a abstract Y of the article X, and inputting the article X in the sample into a abstract level encoder of a text abstract generation model to obtain an article semantic vector containing full-text information of the article X;

(7) inputting 0 th to Y-1 th words in a abstract Y of an article X (into a decoder of an abstract generation model to generate Y-1 abstract words, wherein Y represents the total number of words in the abstract;

(8) repeating the steps (6) and (7) on the new data set obtained in the step (5) to train the abstract generating model until the abstract generating model converges, thereby obtaining the trained abstract generating model.

Specifically, the condition for the digest creation model to converge is that loss cannot be made smaller, or the number of iterations reaches a set upper limit value 800000.

Preferably, step (6) comprises the sub-steps of:

(6-1) inputting each word in the article X into a word embedding layer of a abstract-level encoder, inputting an output result of the word embedding layer into a position coding layer of the abstract-level encoder as a first word vector to obtain a position coding vector of each word, and adding the position coding vector of each word and the first word vector to obtain a second word vector which corresponds to each word in the article X and contains position information;

(6-2) inputting the second word vector obtained in the step (6-1) into a multi-head self-attention layer to obtain a multi-head self-attention layer output result; and then the multi-head output result from the attention layer is input to the position of the abstract-level encoder to be embedded into a network layer so as to obtain an article semantic vector.

Preferably, step (7) comprises the sub-steps of:

(7-1) inputting the first Y-1 words in the abstract Y into a word embedding layer of a decoder, inputting the output result of the word embedding layer as a first word vector into a position coding layer to obtain position coding vectors of all the words, and adding the position coding vectors of all the words and the first word vector to obtain second word vectors which correspond to all the words in the abstract and contain position information;

(7-2) inputting second word vectors which correspond to all the words in the abstract obtained in the step (7-1) and contain position information into the multi-head self-attention layer to obtain an output result of the multi-head self-attention layer;

(7-3) processing the multi-head self-attention layer output result obtained in the step (7-2) by using a mask mechanism to obtain a processed multi-head self-attention layer output result;

wherein the mask matrix mask is a matrix with a size of (Y-1) × (Y-1), wherein Y is the total number of words in the summary Y and has:

and (7-4) inputting the article semantic vector obtained in the step (6) and the multi-head output result from the attention layer processed in the step (7-3) into a time penalty attention layer of a decoder together to obtain a context matrix containing article information and generated abstract words.

(7-5) inputting the context matrix obtained in the step (7-4) into a position feedforward network of a digest-level encoder to obtain a plurality of decoded words, mapping the decoded words onto a vocabulary through a full connection layer of a decoder to obtain the probability distribution of each decoded word in the vocabulary, and obtaining the probability that the decoded word is a real digest word according to the probability distribution;

specifically, this step first calculates the probability distribution of each decoded word in the vocabulary:

P_vocab＝W_V(FFN(C))

wherein W_VIs the weight of the full connection layer;

then, according to the probability distribution P_vocabCalculating the probability that the decoded word is a real abstract word

Wherein

Representing the real abstract word corresponding to the t-th decoding word;

(7-6) calculating a loss value according to the probability that the decoded word obtained in the step (7-5) is a true abstract word:

wherein T represents the total number of the decoded words obtained in the step (7-5), and the value of T is y-1.

Preferably, in the step (7-4), when the t-th abstract word is generated, the calculation manner of the attention distribution of the article by the abstract word is as follows:

first, the attention distribution for the article is calculated:

wherein

Wherein mul_output[t]Represents the t-th row element, enc, of the lower triangular matrix obtained in step (7-3)_ouputRepresenting the semantic vector of the article obtained in the step (6), T representing transposition operation, W_h、W_eAnd V_vAre all weights of the linear mapping operation, tan is the activation function, and has:

and (4) finally, multiplying the attention distribution by the article semantic vector obtained in the step (6) to obtain a context matrix containing article information and generated abstract words.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) because the sentence sequencing model and the abstract generating model do not need to be circulated, all words in the sequence are processed in parallel, and the context and the far words are combined by using the self-attention mechanism (namely using a multi-head self-attention layer), the training difficulty of the model is lower than that of RNN, and the training speed is much higher than that of RNN;

(2) the sentence sequencing model is characterized in that a title encoder is used for encoding titles of articles, sentences in the articles are encoded by the sentence encoder, so that semantic vectors containing title information and sentence information are obtained, then the similarity between the titles and the sentences is calculated according to the semantic vectors, and N sentences with the highest similarity between the titles are taken as the input of an abstract generation model, so that the technical problem that the article information is lost due to the fact that a part of the articles are directly cut off in the prior art can be solved;

(3) when the attention distribution between the decoder and the encoder is calculated, the invention provides a time-based attention mechanism through the step (7-4), and the condition that the generated abstract contains a plurality of repeated words can be relieved.

Drawings

FIG. 1 is an architectural diagram of a sentence ordering model used by the present invention.

FIG. 2 is an architectural diagram of a summary generation model used by the present invention.

FIG. 3 is a flow chart of a method for generating a summary based on an attention mechanism of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The input lengths of both the encoder and decoder that generate the digest generation model are set to fixed values when the input lengths are set to be excessively long. The model will be difficult to train and the model accuracy will also decrease. A common way to solve this problem is to truncate a part longer than a set maximum length, which may result in a part of the complete information of the text being truncated, and thus also in a failure of the model to completely summarize the text information. Therefore, it is meaningful to research the sentence ordering method of the article, and in this respect, for the article with the title, the semantic similarity between each sentence in the article and the title is calculated by the invention. And (4) summarizing the whole article by taking the N sentences with the highest similarity. For articles without titles, N sentences in the article are extracted as input of the generation model by using a Textrank unsupervised method.

As shown in fig. 3, the present invention provides a method for generating a summary based on an attention mechanism, comprising the following steps:

firstly, acquiring an article from the Internet, and inputting the article into a trained sentence sequencing model (shown in figure 1) to acquire a simplified article;

and secondly, inputting the simplified article obtained in the step one into a trained abstract generation model (as shown in figure 2) to obtain an abstract of the article.

The training process in the sentence sequencing model is as follows:

the method comprises the following substeps:

specifically, the word embedding layer acquires a first word vector of each word in the title according to a pre-established word vector table; the word vector table is obtained by training word vectors using a wikipedia corpus.

The position encoding layer is to add relative position information of words (tokens) at the time of input in order to enable the title level encoder to utilize the order information of the respective words in the title,

the position-coding vector of a word is calculated by sine-cosine coding:

The second word vector is calculated by the following formula:

(1-2) inputting the second word vector obtained in the step (1-1) into a Multi-head self-attention layer (Multi-head self-attention layer) of a title level encoder to obtain a self-attention layer output result;

the step is specifically that firstly, the second word vector obtained in the step (1-1) is used as a question (query, hereinafter referred to as Q), a key (key, hereinafter referred to as K) and a value (value, hereinafter referred to as V); q, K, and V are then linearly mapped and in dimension d_modelCut them into n_headPart (wherein n)_headRepresenting the number of heads in a multi-headed self-attention horizon), each segmentation result includes a question Q_aKey K_aAnd the value V_aDimension of each segmentation result is d_kAnd has d_model＝n_head×d_k；

Soft_i′is the i' th output value of the softmax activation function,

specifically, the location-embedded network layer comprises a first convolutional layer, a second convolutional layer and a Relu activation function which are connected in sequence;

Calculation formula of Relu activation function:

Relu(x″)＝max(0，x″)

the final output in this step is:

FFN(x′)＝conv₂(Relu(conv₁(x′))

the method comprises the following substeps:

specifically, the word embedding layer and the position coding layer in this step are completely the same as those in the step (1-1), and are not described herein again;

(2-2) inputting the second word vector obtained in the step (2-1) into a Multi-head self-attention layer (Multi-head self-attention layer) of a sentence-level encoder to obtain a self-attention layer output result;

specifically, the multi-head self-attention layer in this step is completely the same as the multi-head self-attention layer in step (1-2), and the problem, key and value are the second word vector output in step (2-1), which is not described herein again;

specifically, in this step, the semantic vector of the title obtained in step (1) is first used as the key sum value, the self-attention layer output result obtained in step (2-2) is used as the question, and then the question, the key sum value are linearly mapped and d 'is measured in the dimension thereof'_modelTo cut them into n'_headEach partition result includes a problem Q_a′Key K_a′And the value V_a′Dimension of each segmentation result is d'_kAnd is of d'_model＝n′_head×d′_k。

wherein the value of a' ranges from 1 to the number of the heads of the multi-head self-attention layer, and softmax is an activation function, which is the same as the activation function of the step (1-2) above.

Finally, n 'is'_headThe self-attention output results of the individual heads are concatenated to obtain semantic vectors corresponding to the sentence and the title.

(2-4) inputting the semantic vector corresponding to the sentence and the title obtained in the step (2-3) into a position of a sentence-level encoder to be embedded into a network layer so as to obtain a semantic vector corresponding to the sentence and containing title information;

specifically, the position embedded network layer in this step is completely the same as the position embedded network layer in the above step (1-3), and is not described herein again;

The step is specifically, firstly, performing max-pooling (max-pooling) operation on the semantic vector which is obtained in the step (2) and corresponds to the sentence and contains the title information to obtain a final representation of the sentence, and processing the final representation by using linear mapping and sigmoid activation function to obtain similarity between the title and the sentence.

(4) Repeating the steps (2) and (3) for m times (wherein m represents the total number of sentences in the article), thereby obtaining the similarity between the title and each sentence in the article, sequencing all m sentences in the article from large to small according to the obtained similarity, selecting N sentences corresponding to N similarity (wherein N is an integer between 10 and 20, preferably 15) ranked at the top from the m sentences, and forming a new article according to the sequence of the new article in the article;

The training process of the abstract generation model is as follows:

the method comprises the following substeps:

(6-2) inputting the second word vector obtained in the step (6-1) into a Multi-head self-attention layer (Multi-head self-attention layer) to obtain a Multi-head self-attention layer output result; and then the multi-head output result from the attention layer is input to the position of the abstract-level encoder to be embedded into a network layer so as to obtain an article semantic vector.

Specifically, the multi-head self-attention layer in this step is completely the same as the multi-head self-attention layer in step (1-2), the problem, key and value are the second word vector output in step (6-1), and the position embedding network layer is completely the same as the position embedding network layer in step (1-3), and is not described herein again;

(7) the 0 th to Y-1 th words in the abstract Y of the article X (where Y represents the total number of words in the abstract) are input into a decoder of the abstract generating model to generate Y-1 abstract words.

The method comprises the following steps:

specifically, the word embedding layer and the position encoding layer in this step are completely the same as those in the step (1-1), and are not described herein again;

(7-2) inputting a second word vector which corresponds to each word in the abstract obtained in the step (7-1) and contains position information into a Multi-head self-attention layer (Multi-head self-attention layer) to obtain a Multi-head self-attention layer output result;

specifically, the question, the key, and the value of the multi-head attention layer are all second word vectors corresponding to the words in the summary obtained in step (7-1) and containing position information, and the structure of the second word vectors is completely the same as that of the multi-head attention layer in step (1-2), and will not be described again here.

(7-3) processing the multi-head self-attention layer output result obtained in the step (7-2) by using a Mask (Mask) mechanism to obtain a processed multi-head self-attention layer output result;

specifically, since a abstract word is generated, which is only related to the article and the abstract word that has been generated before, but not related to the abstract word to be generated after the abstract word, the first y-1 words of the abstract are used as input in step (7-1), and the multi-head self-attention layer output result of step (7-2) contains information of all abstract words, it is necessary to hide the result of the abstract word after the abstract word to be generated currently by applying a mask mechanism to the multi-head self-attention layer output result of step (7-2).

Specifically, this step is to multiply the multi-headed self-attention layer output result of step (7-2) by the following mask matrix, thereby obtaining a lower triangular matrix indicating that each word in the abstract will focus only on the abstract word generated before it.

The mask matrix mask is:

it can be seen that the mask matrix is a matrix of size (Y-1) × (Y-1), where Y is the total number of words in the summary Y.

And (7-4) inputting the article semantic vector obtained in the step (6) and the multi-head self-attention layer output result processed in the step (7-3) into a time-penalty attention layer (time-penalty attention layer) of a decoder together to obtain a context matrix containing article information and generated abstract words.

Specifically, when the t-th abstract word is generated, the attention distribution of the abstract word to the article is calculated in the following manner:

wherein mul_output[t]To representThe t row element, enc, of the lower triangular matrix obtained in the step (7-3)_ouputRepresenting the semantic vector of the article obtained in the step (6), T representing transposition operation, W_h、W_eAnd V_vAre the weights of the linear mapping operation (these weights are initialized at training time, are not fixed, and are trained to give the best results). tan is an activation function, which is calculated by the formula:

in order to prevent the article information concerned about when the t-th abstract word is generated from being similar to the article information concerned about by the abstract word generated in the previous step, a penalty mechanism is adopted to penalty the words with higher attention obtained in the previous step, and the attention distribution calculation mode for the article in the penalty mechanism is as follows:

and (4) multiplying the attention distribution by the article semantic vector obtained in the step (6) to obtain a context matrix C containing article information and generated abstract words.

P_vocab＝W_V(FFN(C))

wherein W_VIs the weight of the full connection layer;

Wherein

And representing the true abstract word corresponding to the t-th decoding word.

(8) Repeating the steps (6) and (7) on the new data set obtained in the step (5) to train the abstract generating model until the abstract generating model converges, thereby obtaining a trained abstract generating model;

In contrast to RNN-based methods, the present invention does not require loops, but rather processes all words or symbols in a sequence in parallel, while using the self-attention mechanism to combine context with more distant words; by processing all words in parallel and letting each word notice other words in the sentence in multiple processing steps, the training speed of the present invention is much faster than RNN, and the training results applied to the machine translation task are much better than with RNN.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for generating an abstract based on an attention mechanism is characterized by comprising the following steps:

2. The method of claim 1, wherein the training process in the sentence ordering model is as follows:

3. The method according to claim 1, wherein step (1) comprises the sub-steps of:

The second word vector is calculated by the following formula:

and (1-3) embedding the position of the output result of the self-attention layer obtained in the step (1-2) input into the header-level encoder into a network layer to obtain a semantic vector of the header.

4. The method according to claim 3, wherein the step (1-2) is implemented by first taking the second word vector obtained in the step (1-1) as the question Q, the key K, and the value V, then linearly mapping Q, K and V, and performing the dimension d_modelCut them into n_headEach partition result includes a problem Q_aKey K_aAnd the value V_aDimension of each segmentation result is d_kAnd has d_model＝n_head×d_kWherein n is_headRepresenting the number of heads of a multi-head self-attention layer;

Soft_i′is the i' th output value of the softmax activation function,

5. The method of claim 4,

the position embedded network layer comprises a first convolution layer, a second convolution layer and a Relu activation function which are connected in sequence;

The input matrix size of the second convolutional layer is 2048 × len_qConvolution kernel size 2048 × d_model1, step size 1, output matrix size d_model*len_q。

Calculation formula of Relu activation function:

Relu(x″)＝max(0，x″)

the final output result of the position embedding network layer is:

FFN(x′)＝conv₂(Relu(conv₁(x′))

6. The method according to claim 5, wherein step (2) comprises the sub-steps of:

7. The method of claim 6, wherein the abstract generation model is trained as follows:

8. The method according to claim 7, characterized in that step (6) comprises the following sub-steps:

9. The method according to claim 8, characterized in that step (7) comprises the following sub-steps:

P_vocab＝W_V(FFN(C))

wherein W_VIs the weight of the full connection layer;

Wherein

Representing the real abstract word corresponding to the t-th decoding word;

10. The method according to claim 9, wherein the step (7-4) is specifically that, when the t-th abstract word is generated, the attention distribution of the article by the abstract word is calculated in a manner that:

first, the attention distribution for the article is calculated:

wherein