CN111723547A

CN111723547A - Text automatic summarization method based on pre-training language model

Info

Publication number: CN111723547A
Application number: CN202010449079.1A
Authority: CN
Inventors: 王宇; 师岩
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2020-09-29

Abstract

The invention relates to a text automatic summarization method based on a pre-training language model, belonging to the technical field of natural language processing. The method comprises the following steps: encoding the source text information by using a pre-training language model BERT network; the digest is then automatically generated for the source text by the LSTM joint attention mechanism. In the automatic Chinese text summarization task, the generated Chinese summary achieves good readability, the generated summary is high in quality, meanwhile, the model training speed is high, and due to the fact that the pre-training language model is used as the encoder, even under the condition that training data are few, the summary with relatively high quality can be generated.

Description

Text automatic summarization method based on pre-training language model

Technical Field

The invention relates to a text automatic summarization method based on a pre-training language model, belonging to the technical field of natural language processing.

Background

With the continuous emergence of new media platforms, the information which people contact daily shows explosive growth, thereby bringing the trouble of information overload to people, and with the acceleration of life rhythm, people are overwhelmed to comb all the received information. By reading the abstract, people can improve the efficiency of understanding the original text and effectively reduce the time and energy for browsing information.

The text summarization technology commonly used at home and abroad at present can be mainly divided into an extraction type and a generation type.

The extraction method is to calculate the importance of a sentence according to the statistical characteristics of the text, such as word frequency, inverse document frequency, and the like, so as to select the most important sentence as the abstract of the original text, which can be regarded as a sentence classification task and becomes an abstract with a high score, otherwise, it becomes a non-abstract sentence. The development time of the extraction method is long, and the existing relatively mature automatic summarization technology still adopts the extraction method. The quality of the abstracts obtained by the method is lower, and the extracted sentences have higher redundancy, so that the quality and the effect of the generated abstracts are not satisfactory.

The generating method is that the computer 'understands' the text, and according to the way of writing the abstract by human beings, the syntactic and semantic analysis is carried out according to the information in the obtained text, thereby generating a new abstract sentence. With the continuous development of deep learning, the method for generating text summaries based on deep learning has greatly improved summary quality and fluency, and is also called as a mainstream research direction for text summary generation.

The Seq2Seq model is an end-to-end model, 2014 firstly proposes and applies to a machine translation task, and a good effect is obtained at that time. The text summarization task is also an end-to-end text generation task, so that researchers think that the Seq2Seq model is applied to the text summarization task, and then the generation-type summarization is carried out a series of researches based on the model idea, and the generation-type automatic text summarization technology is rapidly developed.

Pre-training the language model refers to applying the trained language model to other natural language processing tasks. Word Embedding (Word Embedding) appeared in 2003 is the earliest pre-training technology, does not need large-scale labeled data sets, and can learn semantic similarity between words without supervision. With the continuous development and improvement of pre-training language models, pre-training language models such as ELMO, GPT, BERT and the like appear, and the models have the common characteristic that good text representation can be obtained through a large amount of pre-training and strong information extraction capability in the models, and the results can be obtained with half the effort when the models are applied to downstream natural language processing tasks.

Disclosure of Invention

The invention provides a text automatic summarization method based on a pre-training language model, which aims to solve the problems that the existing text automatic summarization is provided, for example, a coding end cannot well learn information in a source text, so that a decoding end has the problems of repeated summarization or irrelevant semantics and the like.

The invention adopts the following technical scheme for solving the technical problems:

a text automatic summarization method based on a pre-training language model comprises the following steps:

(1) globally coding the context information of the text based on a BERT pre-training language model;

(2) decoding the coded output result based on an attention mechanism and an LSTM network to generate a text abstract;

(3) finally, calculating the hidden state vector, the context vector and the word vector of the previous abstract word to obtain the output state vector, and selecting the value with the maximum corresponding probability as the output of the step according to the probability corresponding to each possible output word.

In the step (1), the BERT pre-training language model is a superimposed self-attention model, and the self-attention model is coded as follows:

Q_l＝H^l-1W_l ^Q，

wherein: w_l ^QA matrix of query vectors representing the l-th layer,

a key vector matrix representing the l-th layer,

a value vector matrix representing the l-th layer; q_l，K_l，V_lRespectively representing a query vector, a key vector and a value vector of the l layer; d_kVector dimensions representing Q and K; softmax denotes the softmax function, T denotes the transposition of the vector matrix, H^l-1Representing the result after the operation of the previous layer of the Transformer module; a. the_lThe result after the self-attention model is finally shown, and the result H after the I layer coding is finally obtained after the full connection layer and the residual layer^l。

The specific process of the step (2) is as follows: and calculating a context vector of a certain word vector through an attention mechanism, and calculating to obtain a next generated word to be predicted by combining the output of the encoder.

The method for calculating the context vector of a certain word vector is as follows:

wherein: h_iRepresenting the i-th word vector output of the encoder, S_jIndicating the hidden state of the decoder at step j, e (H)_iS_j) Is a function for calculating whether each word of the original text is related to the current decoder state, which is a feedforward neural network with a single hidden layer, exp () represents an exponential function with a natural constant e as the base, α_i,jWeight of influence of output representing i position of encoder on j prediction result of decoder, c_jRepresenting the hidden state after the attention mechanism for subsequent generation of a prediction of the word.

Hidden state vector S of j step of the decoder_jThe calculation formula of (a) is as follows:

S_j＝LSTM(y_j-1,c_j-1,S_j-1)

where LSTM (·) denotes LSTM network operation, y_j-1A word vector representing the previous generated word, c_j-1Context vector, S, representing the previous word vector_j-1Representing the previous hidden state vector.

The output state vector in the step (3):

y_j＝W_ry_j-1+U_rc_j+V_rS_j

wherein W_r，U_r，V_rAre trainable weight matrix parameters.

The invention has the following beneficial effects:

the invention combines the pre-training language model BERT, the attention mechanism and the LSTM (long short term memory network) model, and effectively solves the defect that the traditional RNN (recurrent neural network) can not acquire long-distance information. Because the pre-training language model BERT is a bi-directional training language model and a network model using a self-attention mechanism, such as a Transformer, is adopted, the information of the context of the source text can be effectively acquired, and the defect of insufficient information acquisition in a long distance is overcome. Therefore, the hidden state obtained after the source text passes through the encoder actually contains rich context information, and more favorable for generating more accurate abstract in the subsequent decoding problem. Meanwhile, because the pre-training language model is trained by a large amount of text data, the effect which is more excellent than the prior effect can be obtained only by fine adjustment aiming at the text abstract task, and meanwhile, the training speed of the network model can be improved.

Drawings

Fig. 1 is a flowchart of a text automatic summarization method provided by the present invention.

Fig. 2 is a schematic diagram of an automatic text summarization model provided by the present invention.

FIG. 3 is a schematic diagram of the coding of the pre-training model BERT provided by the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

FIG. 1 is a flow chart of the text automatic summarization method based on a pre-training language model according to the present invention. The device is mainly divided into two parts:

(1) the source text is encoded using an encoder built based on a pre-trained language model.

(2) And decoding the coded vector by using a decoder constructed based on an LSTM network and combining an attention mechanism to generate the text abstract.

For a text content, a summary is automatically generated, which can be implemented by the method described in fig. 2. The specific process is as follows:

(1) the source text is encoded using a BERT pretrained network.

The processed source text information may be represented as H₀＝[E₁,…,E_N]Wherein: e₁A vector representation representing the first word (character) in the source text; e_NA vector representation representing the nth word (character) in the source text. After the process of the BERT pre-training language model, that is, the computation of the Transformer model of the L layer, the coding result can be obtained:

Q＝H^l-1W_l ^Q，

H^land (3) representing the output result obtained after passing through the Transformer model of the previous layer.

H^l＝Transformer_l(H^l-1),l∈[1,L]

For each input vector (input word vector), a query vector, a key vector, a value vector is created. W_l ^QThe query vector matrix representing the l-th layer, similarly,

a key vector matrix representing the l-th layer,

a vector matrix of values representing the l-th layer. Q_l，K_l，V_lRespectively representing the query vector, the key vector and the value vector of the l-th layer. d_kRepresenting the vector dimensions of Q and K. softmax denotes the softmax function. T denotes transposing the vector matrix, H^l-1And (4) representing the result of the operation of the previous layer of the Transformer module. A. the_lThe result after the self-attention model is finally shown, and the result H after the I layer coding is finally obtained after the full connection layer and the residual layer^l。

(2) Computing encoder output using an attention mechanism

The vector matrix containing the context information of each word vector can be better obtained through an attention mechanism, and the decoding prediction process is better assisted in a decoding stage. The attention mechanism is calculated along with this decoding process.

(3) The unidirectional LSTM network acts as a decoder and incorporates an attention mechanism before predicting the word to be generated each time, since the attention mechanism can better extract the context information of all word vectors in the source text and thus can better assist in generating the summary.

Wherein H_iRepresenting the i-th word vector output of the encoder, S_jIndicating the hidden state of the decoder at step j, e (H)_iS_j) Is a function for calculating whether each word of the original text is related to the current decoder state, which is a feedforward neural network with a single hidden layer, exp () represents an exponential function with a natural constant e as the base, α_i,jIndicating the ith position of the encoderThe influence weight of the output of (1) on the prediction result of the j step of the decoder, c_jRepresenting the hidden state after the attention mechanism for subsequent generation of a prediction of the word. Wherein the hidden state vector S in the decoder_jThe calculation formula of (2) is as follows:

S_j＝LSTM(y_j-1,c_j-1,S_j-1)

where LSTM (·) denotes LSTM network operation, y_j-1A word vector representing the previous generated word, c_j-1Context vector, S, representing the previous word vector_j-1Representing the previous hidden state vector, the initial hidden state vector S₀Comprises the following steps:

S₀＝tanh(W_dh_N+b)

wherein tanh (. cndot.) represents a hyperbolic tangent function, h_NFor the result of the last coded output, W_dA weight matrix representing trainable hidden states, b represents a hidden state bias vector.

Finally by the hidden state vector S_jContext vector c_jAnd the word vector y of the previous abstract word_j-1By performing the calculation, the output state vector can be obtained:

y_j＝W_ry_j-1+U_rc_j+V_rS_j

wherein W_r，U_r，V_rAre trainable weight matrix parameters.

And obtaining an output state vector, and selecting a value with the maximum corresponding probability as the output of the step according to the probability corresponding to each possible output word.

FIG. 3 shows a schematic diagram of the coding of the pre-training model BERT provided by the present invention. As shown, Trm represents an abbreviated form of the Transformer module. The BERT model can adopt a bidirectional language model to code the source text, and can more fully acquire the context information of the source text, thereby improving the quality of the final generated abstract.

Claims

1. A text automatic summarization method based on a pre-training language model is characterized by comprising the following steps:

2. The method for automatically abstracting text based on pre-trained language model as claimed in claim 1, wherein the BERT pre-trained language model in step (1) is a superimposed self-attention model encoded as:

Q_l＝H^l-1W_l ^Q，K_l＝H^l-1W_l ^K，V_l＝H^l-1W_l ^V，

wherein: w_l ^QQuery vector matrix, W, representing the l-th layer_l ^KA key vector matrix, W, representing the l-th layer_l ^VA value vector matrix representing the l-th layer; q_l，K_l，V_lRespectively representing a query vector, a key vector and a value vector of the l layer; d_kVector dimensions representing Q and K; softmax denotes the softmax function, T denotes the transposition of the vector matrix, H^l-1Representing the result after the operation of the previous layer of the Transformer module; a. the_lThe result after the self-attention model is finally shown, and the result H after the I layer coding is finally obtained after the full connection layer and the residual layer^l。

3. The method for automatically abstracting text based on pre-trained language model as claimed in claim 1, wherein the specific process of step (2) is as follows: and calculating a context vector of a certain word vector through an attention mechanism, and calculating to obtain a next generated word to be predicted by combining the output of the encoder.

4. The method of claim 3, wherein the method of calculating the context vector of a word vector comprises:

5. The method according to claim 4, wherein the hidden state vector S of the decoder at the j-th step is a hidden state vector S_jThe calculation formula of (a) is as follows:

S_j＝LSTM(y_j-1,c_j-1,S_j-1)

where LSTM (·) denotes LSTM network operation, y_j-1A word vector representing the previous generated word, c_j-1Context vector, S, representing the previous word vector_j-1Before showingA hidden state vector.

6. The method for automatically abstracting text based on pre-trained language model as claimed in claim 5, wherein the state vector outputted in step (3):

y_j＝W_ry_j-1+U_rc_j+V_rS_j

wherein W_r，U_r，V_rAre trainable weight matrix parameters.