CN113468854A

CN113468854A - Multi-document automatic abstract generation method

Info

Publication number: CN113468854A
Application number: CN202110703934.1A
Authority: CN
Inventors: 杨鹏; 周华健; 刘子健; 李文军
Original assignee: Zhejiang Huaxun Technology Co ltd
Current assignee: Zhejiang Huaxun Technology Co ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-10-01

Abstract

The invention discloses a method for automatically generating a summary abstract of multiple documents, which can automatically generate a summary abstract for multiple texts under the same theme. Firstly, preprocessing a preset text abstract data set to obtain input data required by model training; then, a hierarchical Transformer multi-document abstract generation model is constructed, and model training is carried out by combining triple loss and cross entropy loss; and finally, preprocessing a plurality of texts to be processed, inputting the preprocessed plurality of texts into the trained abstract model, and automatically generating the summarized abstract of the plurality of texts. Compared with the prior art, the method and the device can provide rich hierarchical structure information for the summary generation process by effectively combining the semantic information in the document and the dependency relationship between the documents, thereby improving the context consistency and the information coverage of the summary result.

Description

Multi-document automatic abstract generation method

Technical Field

The invention relates to a multi-document automatic abstract generation method, and belongs to the technical field of Internet and artificial intelligence.

Background

In recent years, with the rapid development of internet technology, networks have become an important channel for people to obtain information, however, network information has the characteristics of redundant content and huge quantity, and the efficiency of people to obtain important information is greatly reduced. The Multi-Document Summarization (MDS) technology aims to analyze, refine and integrate multiple documents with the same or similar topics to generate a summarized summary capable of summarizing a central topic, and can effectively realize content aggregation of multiple documents on the same topic, thereby helping a user quickly and clearly understand the main content of Document information.

At present, the mainstream multi-document summarization technology generally utilizes a deep neural network model to perform rich semantic vector coding on a vocabulary and a document respectively at two levels, so that the dependency relationship between the vocabulary semantics in the document and the document is captured, and then document level information is utilized to perform summarization generation. However, the above method mainly has the following three problems: firstly, in order to extract a cross-document relationship, a document needs to be subjected to feature representation firstly, however, the existing method lacks global explicit constraint, so that important information is easy to be lost in document representation, and document relationship modeling is not facilitated; secondly, a plurality of documents with the same theme have an obvious information overlapping problem, and the generated abstract has more redundant information easily under the condition of not screening; thirdly, the existing method simply fuses the document hierarchical information in the forms of splicing or adding and the like, and the deep association of the hierarchical features is difficult to effectively construct.

Disclosure of Invention

Aiming at the problems and the defects in the prior art, the invention provides the multi-document automatic abstract generating method which can effectively combine the dependency relationship between the semantic information in the document and provide rich hierarchical structure information for the abstract generating process, thereby improving the context consistency and the information coverage of the abstract result.

In order to achieve the purpose, the method for generating the multi-document automatic abstract comprises the steps of firstly extracting sub-theme representation of a document and constructing a central theme representation of a document set, and further generating a document vector with more theme relevance; then, filtering semantic information in the document through an information gating mechanism to obtain a vocabulary vector with more remarkable information; and finally, performing information integration on two levels of the document and the vocabulary by using a hierarchical attention mechanism, and fusing semantic information of the two levels into a context vector so as to guide the abstract generating process. The method mainly comprises four steps, specifically as follows:

step 1: the method comprises the steps of data preprocessing, namely performing truncation, word embedding and position coding on texts in a preset text abstract data set, adding word embedding representation and position coding to obtain word vector representation of each word, wherein each sample in the preset data set comprises a plurality of texts with the same theme and a corresponding artificial abstract;

step 2: and constructing and training a multi-document abstract generation model. Firstly, extracting vocabulary semantic information from a sample text after word vector representation by using a Transformer coding module, and integrating the vocabulary semantic information by using a topic fusion attention module to generate document vector representation; then realizing information interaction among the documents through multi-head self-attention, and obtaining document vector representation containing document dependency relationship through a residual structure, layer normalization and a feedforward neural network; then, through information gating, information screening is carried out on the vocabulary semantic vectors, the semantic vectors obtained by the vocabulary and the document in two layers are fused through a layering attention mechanism, and the generated context vector is used for guiding abstract generation; and finally, training the model by utilizing the triple loss and the cross entropy loss.

And step 3: and generating the abstract of the plurality of texts to be abstracted. For the text to be abstracted, firstly, text truncation, word embedding and position coding are carried out, and the obtained word vector representation is input into the abstract generation model trained in the step 3 to generate the text topic abstract.

Compared with the prior art, the invention has the following advantages:

(1) the document expression method based on topic fusion attention can guide vocabulary semantic vectors to generate document vector expression with more comprehensive information and more relevant topics by constructing the central topic, and the problem of important information loss is relieved;

(2) the information gating mechanism adopted by the invention can pre-filter the information of the vocabulary semantic vectors, reduce the interference of redundant information and effectively improve the accuracy of the abstract result;

(3) the hierarchical attention mechanism adopted by the invention integrates and associates the vocabulary semantic information in the document and the external associated information between the document hierarchically, can effectively construct deep association of hierarchical characteristics, can provide rich hierarchical semantic information for the abstract generation process, and improves the context consistency of the abstract result.

Drawings

FIG. 1 is a diagram of a multi-document abstract model framework according to an embodiment of the present invention.

Fig. 2 is an overall structural view of the subject fusion attention module.

Detailed Description

The invention will be further illustrated with reference to specific examples in order to provide a better understanding and appreciation of the invention.

Example (b): referring to fig. 1 and fig. 2, the method for generating an automatic summary of multiple documents provided by the present invention includes the following specific implementation steps:

step 1, data preprocessing, in this embodiment, a preset data set D is preprocessed first. Performing truncation processing on M texts included in each sample, wherein the length of each text after truncation is Len/M, and if the length of the text before truncation is less than Len/M, the lengths of the texts before and after truncation are unchanged, where Len in this embodiment is 1500; performing word embedding mapping and position coding on each cut text, wherein a word embedding matrix is a parameter matrix which can be learned, and the position coding adopts a position coding module in a transform model;

step 2, the data set D processed in the step 1 is used for building and training a multi-document abstract generation model, and the implementation of the step can be divided into the following substeps:

substep 2-1, constructing an internal feature extraction layer using l stacksThe Transformer coding sublayer extracts semantic information expressed by each input text word vector in each sample to obtain the vocabulary semantic vector expression of the jth word in the ith input text

And a document vector representation with fixed dimensionality is constructed on the basis of the vocabulary semantic vector representation through a topic fusion attention module, and the topic fusion attention module comprises three parts of subtopic coding, subtopic fusion and attention calculation. The sub-topic coding adopts a two-layer bidirectional LSTM network to generate the sub-topic vector representation of each text, the input of a sub-topic coder is a vocabulary semantic vector sequence, and the output forward hidden state is

And backward hidden state

Stitching to obtain sub-topic vector representations

Calculating the central theme vector of the input text set by adopting a weighted summation mode

Wherein the weight factor w_iRepresenting the degree of contribution of the sub-topic vector to the central topic vector, N representing the total number of documents in the document set, T_sumVector sum of all document subtopic vectors in the document set, [ T ]_i；T_sum]Is T_iAnd T_sumOfAnd v is a learnable weight matrix parameter. Center based topic vector

Integrating the vocabulary semantic vectors by adopting an attention mechanism and constructing a document vector representation

Wherein,

for the lexical semantic vector sequence of the ith document, W_dJ is the number of words contained in the input document, which is a learnable parameter matrix.

And a substep 2-2 of constructing an external information interaction layer, wherein the embodiment adopts a Multi-Head Self Attention mechanism (Multi-Head Self Attention mechanism) to realize information interaction between documents so as to capture the association information between the documents, and the input is vector representation of each document

In this embodiment, the value of the attention head number is 8, and a final document vector d is obtained through a residual structure, layer normalization and a feedforward neural network module_i。

And a sub-step 2-3 of information gating filtering, wherein the information filtering is carried out on the vocabulary semantic vector generated by the encoder by utilizing the document sub-theme vector representation so as to reduce unnecessary redundant content. For the jth word in the ith document, the corresponding gating vector

The calculation formula of (a) is as follows:

wherein, W_g、U_g、b_gFor learnable parameters, σ (-) is sigmoid function, and then vector point multiplication is carried out on the vocabulary semantic vector by using the gating vector to realize information filtering:

substep 2-4, hierarchical attention calculation. In the embodiment, a hierarchical attention mechanism is adopted to fuse the document vector and the vocabulary vector to generate the context vector containing rich hierarchical semantic information, and the input of the mechanism comprises three parts, namely all document vectors d obtained in the substep 2-2 and vocabulary semantic vectors filtered in the substep 2-3

And the hidden state vector h at the current decoding moment_tWherein h is_tFrom the input y at the moment t of the decoder_tObtained by word embedding, position coding, multi-head self-attention shielding, residual error connection and layer normalization and output, and y is obtained in the training process_tThe t-th word of the artificial summary contained in the sample. The mechanism first performs attention calculation at the document level to generate a document context vector

Wherein,

for attention weight, from h_tAnd all ofThe document vector d is calculated in the form of equation 3. Attention calculation is then performed at the lexical level and document attention weights are used

And (3) adjusting:

wherein,

for the lexical attention weight of the jth word in the ith document,

is a lexical context vector. Finally, the context vectors obtained respectively on the document level and the vocabulary level are subjected to dimension splicing and linear mapping to generate a context vector c^t：

Wherein W_cAre learnable weight parameters.

Substep 2-5, constructing abstract probability layer, and comparing context vector c^tAnd hidden state vector h_tObtaining an output vector of a decoder at the t moment through residual connection, layer normalization and a feedforward neural network

And converting the predicted probability distribution P into abstract words through a full connection layer fc and softmax function, wherein the calculation formula is as follows:

substep 2-6, constructing a loss function layer, wherein the loss function layer adopts a cross entropy loss function L of a prediction abstract and an artificial abstract_SAnd triple loss function L of document topic extraction_TTo construct the overall loss function of the model. Where the triplet loss function is calculated as follows

L_T＝max{d(T_A,T_P)-d(T_A,T_N)+Margin,0} (12)

d(T_A,T_P)＝1-cos(T_A,T_P) (13)

d(T_A,T_N)＝1-cos(T_A,T_N) (14)

L_total＝αL_S+βL_T (15)

Wherein L is_TFor triple loss, Margin is a boundary distance, the value of the embodiment is 1, so that the difference between the positive example P and the negative example N in the subject semantics is ensured, and T is_ASubtopic vector, T, being a true summary_PFor the central topic vector, T, of the input document set_NA central topic vector for another sample set of documents; the cos function is used for calculating cosine values of an included angle between two theme vectors and measuring semantic similarity between the theme vectors; alpha and beta are over parameters, which represent respective weight coefficients of the two losses, and values of 0.9 and 1.5 are taken in this embodiment, respectively. L is_SPredicted cross entropy loss for abstract words; l is_totalIs the overall loss function of the model.

And substep 2-7, model training. In this embodiment, all the parameters to be trained are initialized in a random initialization manner, an Adam optimizer is used for gradient back propagation to update the model parameters in the training process, the initial learning rate is set to 0.001, and β is₁、β₂The coefficients are set to 0.9 and 0.998, respectively, the batch size is set to 16, and model training is stopped after 3 consecutive rounds of the loss function no longer decline or the training round exceeds 50.

And 3, generating the abstract by using the trained model. Preprocessing a plurality of texts to be abstracted in a mode of step 1 and inputting the texts into a trained model, wherein the initial input of a decoder is a special mark "< START >", a prediction abstract word at each moment is a word with the maximum probability output by an abstraction probability layer, the prediction abstract word is used as the input of the decoder at the next moment, and when an output END mark "< END >", the abstract generation is stopped, and the generated abstract word is output as the prediction abstract of an input text set.

Based on the same inventive concept, the embodiment of the present invention further provides a multi-document automatic summary generation apparatus, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the computer program is loaded into the processor, the multi-document automatic summary generation method is implemented.

It is to be understood that the examples are for purposes of illustration only and are not intended to limit the scope of the invention, which is to be construed as broadly as the equivalent forms of the present invention which will be defined in the appended claims to which the present invention pertains, as the skilled artisan will appreciate upon review of the present disclosure.

Claims

1. A multi-document automatic abstract generation method is characterized by comprising the following steps:

step 1: the pre-processing of the data is carried out,

step 2: constructing and training a multi-document abstract generation model;

and step 3: and generating the abstract of the plurality of texts to be abstracted.

2. The method for generating the multi-document automatic abstract according to claim 1, wherein the step 1: data preprocessing is specifically as follows: the method comprises the steps of performing truncation, word embedding and position coding on texts in a preset text summary data set, adding word embedding representation and position coding to obtain word vector representation of each word, wherein each sample in the preset data set comprises a plurality of texts with the same theme and a corresponding artificial summary.

3. The method for generating the multi-document automatic abstract according to claim 1, wherein the step 2: constructing and training a multi-document abstract generation model; the method comprises the following specific steps: firstly, extracting vocabulary semantic information from a sample text after word vector representation by using a Transformer coding module, and integrating the vocabulary semantic information by using a topic fusion attention module to generate document vector representation; then realizing information interaction among the documents through multi-head self-attention, and obtaining document vector representation containing document dependency relationship through a residual structure, layer normalization and a feedforward neural network; then, through information gating, information screening is carried out on the vocabulary semantic vectors, the semantic vectors obtained by the vocabulary and the document in two layers are fused through a layering attention mechanism, and the generated context vector is used for guiding abstract generation; and finally, training the model by utilizing the triple loss and the cross entropy loss.

4. The method for generating the multi-document automatic abstract according to claim 1, wherein the step 3: the method comprises the following steps of generating abstracts for a plurality of texts to be abstracted, specifically: for the text to be abstracted, firstly, text truncation, word embedding and position coding are carried out, and the obtained word vector representation is input into the abstract generation model trained in the step 3 to generate the text topic abstract.

5. The method for generating the multi-document automatic abstract according to claim 1, wherein in the step 1, data preprocessing is performed, a preset data set D is preprocessed, M texts contained in each sample are cut off, the length of each text after cutting off is Len/M, if the length of the text before cutting off is smaller than Len/M, the length of the text before and after cutting off is unchanged, and Len is 1500; and performing word embedding mapping and position coding on each cut text, wherein a word embedding matrix is a parameter matrix which can be learned, and the position coding adopts a position coding module in a transform model.

6. The method for generating the multi-document abstract automatically according to claim 1, wherein step 2, the data set D processed in step 1 is used to train the multi-document abstract generating model, and the implementation of the step is divided into the following sub-steps:

and a substep 2-1, constructing an internal feature extraction layer, extracting semantic information represented by each input text word vector in each sample by using l stacked transform coding sublayers, and obtaining the vocabulary semantic vector representation of the jth word in the ith input text

And constructing a document vector representation with fixed dimensionality on the basis of the vocabulary semantic vector representation through a topic fusion attention module, wherein the topic fusion attention module comprises three parts of subtopic coding, subtopic fusion and attention calculation, the subtopic coding adopts a two-layer bidirectional LSTM network to generate subtopic vector representation of each text, the input of a subtopic coder is a vocabulary semantic vector sequence, and the output forward hidden state is represented

And backward hidden state

Stitching to obtain sub-topic vector representations

Wherein the weight factor w_iRepresenting the degree of contribution of the sub-topic vector to the central topic vector, N representing the total number of documents in the document set, T_sumVector sum of all document subtopic vectors in the document set, [ T ]_i；T_sum]Is T_iAnd T_sumV is a learnable weight matrix parameter, based on the central topic vector

Wherein,

for the lexical semantic vector sequence of the ith document, W_dIs a parameter matrix which can be learnt, J is the number of words contained in the input document;

and a substep 2-2, constructing an external information interaction layer, realizing information interaction between documents by adopting a Multi-Head Self Attention mechanism (Multi-Head Self Attention mechanism) to capture the associated information between the documents, and inputting the associated information into the vector representation of each document

The attention head number is 8, and a final document vector d is obtained through a residual error structure, layer normalization and a feedforward neural network module_i；

The sub-steps 2-3 of,information gating filtering, namely performing information filtering on the vocabulary semantic vector generated by the encoder by utilizing document subtopic vector representation to reduce unnecessary redundant content; for the jth word in the ith document, the corresponding gating vector

The calculation formula of (a) is as follows:

substep 2-4, hierarchical attention calculation, adopting a hierarchical attention mechanism to fuse the document vector and the vocabulary vector to generate a context vector containing rich hierarchical semantic information, wherein the input of the mechanism comprises three parts, namely all document vectors d obtained in substep 2-2 and the vocabulary semantic vector filtered in substep 2-3

And the hidden state vector h at the current decoding moment_tWherein h is_tFrom the input y at the moment t of the decoder_tObtained by word embedding, position coding, multi-head self-attention shielding, residual error connection and layer normalization and output, and y is obtained in the training process_tFor the t-th word of the artificial abstract contained in the sample, the mechanism firstly carries out attention calculation on the document level to generate a document context vector

Wherein,

for attention weight, from h_tCalculating all document vectors d according to the form of formula 3, then performing attention calculation on the vocabulary level, and using the document attention weight

And (3) adjusting:

wherein,

for the lexical attention weight of the jth word in the ith document,

the context vector of the vocabulary is obtained, finally, the context vector obtained on the document level and the vocabulary level is subjected to dimension splicing and linear mapping to generate a context vector c^t：

Wherein W_cIs a learnable weight parameter;

substeps 2-5, constructing abstract probabilitiesLayer, for context vector c^tAnd hidden state vector h_tObtaining an output vector of a decoder at the t moment through residual connection, layer normalization and a feedforward neural network

substep 2-6, constructing a loss function layer, wherein the loss function layer adopts a cross entropy loss function L of a prediction abstract and an artificial abstract_SAnd triple loss function L of document topic extraction_TTo construct an overall loss function for the model, where the triplet loss function is calculated as follows

L_T＝max{d(T_A,T_P)-d(T_A,T_N)+Margin,0} (12)

d(T_A,T_P)＝1-cos(T_A,T_P) (13)

d(T_A,T_N)＝1-cos(T_A,T_N) (14)

L_total＝αL_S+βL_T (15)

Wherein L is_TFor triple loss, Margin is boundary distance, and the value is 1, so as to ensure that the positive example P and the negative example N have difference in theme semantics, T_ASubtopic vector, T, being a true summary_PFor the central topic vector, T, of the input document set_NA central topic vector for another sample set of documents; the cos function is used for calculating cosine values of an included angle between two theme vectors and measuring semantic similarity between the theme vectors; alpha and beta are over parameters, representing respective weight coefficients of the two losses, L_SPredicted cross entropy loss for abstract words; l is_totalIs the overall loss function of the model;

the sub-steps 2-7 of,model training, initializing all parameters to be trained by adopting a random initialization mode, updating model parameters by adopting Adam optimizer to perform gradient back propagation in the training process, setting the initial learning rate to be 0.001, and setting the initial learning rate to be beta₁、β₂The coefficients are set to 0.9 and 0.998, respectively, the batch size is set to 16, and model training is stopped after 3 consecutive rounds of the loss function no longer decline or the training round exceeds 50.