CN113609284A

CN113609284A - Method and device for automatically generating text abstract fused with multivariate semantics

Info

Publication number: CN113609284A
Application number: CN202110882867.4A
Authority: CN
Inventors: 何欣; 陈永超; 胡霄林; 于俊洋; 王光辉; 翟瑞; 宋亚林
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-11-05

Abstract

The invention belongs to the technical field of text data processing, and particularly relates to a method and a device for automatically generating a text abstract fused with multivariate semantics, wherein the method comprises the following steps: firstly, fusing multi-element semantic features before a source text is input into an encoder, so that the source text contains more semantic information; then inputting the source text fused with the multivariate semantic features into a bidirectional long-short term memory network in an encoder, and obtaining hidden layer states corresponding to word vectors in the text fused with the multivariate semantic features; secondly, the decoder predicts a word vector generated at the next moment through a context vector and a hidden layer state of the decoder at the current moment by adopting a one-way long-short term memory network in combination with an improved attention mechanism; and finally, training the model by using a loss function, and automatically generating the abstract of the text by using the trained model. The invention integrates the multi-element semantic features before the source text is input into the encoder, fully excavates the hidden features of the deep layer of the source text and improves the quality of generating the abstract.

Description

Method and device for automatically generating text abstract fused with multivariate semantics

Technical Field

The invention belongs to the technical field of text data processing, and particularly relates to a method and a device for automatically generating a text abstract fused with multivariate semantics.

Background

The automatic text summarization can effectively reduce the reading cost and can relieve the problem of information overload of people at present. The method is distinguished according to an automatic summarization method, and the main methods are two types: abstract and generate abstract.

The extraction type abstract extracts a plurality of most important sentences for recombination by judging the importance of each sentence in the original text, and the combined content is used as the abstract. Early abstract methods used statistical knowledge as the basis for determining the importance of word frequency, relative length of sentences, and similarity between sentences and titles. The importance of the sentence is measured according to the high-frequency words at first, the more the high-frequency words are, the more important the sentence is, and the word frequency-inverse document algorithm is proposed later to improve the traditional word frequency algorithm, so that the abstract quality is improved. At present, under the condition of having superior computing power, a machine learning method can be applied, a data set is labeled by a supervision and semi-supervision method, and after reasonable modeling, an unlabelled sentence is labeled by a trained model to predict whether the unlabelled sentence can be used as an abstract sentence or not. Although the abstract method is easy to implement, the abstract method is based on the document surface layer, grammars and context relations between adjacent words are not considered, the original text is not really understood, and sentences in the abstract are generated at the same time, so that the consistency is not high, and the limitation is large.

The generated abstract analyzes the grammar of the original text by the current more advanced and complicated method, and expresses the content of the original text by more concise sentences on the basis of understanding the original text. With the increasing performance of hardware in recent years and the increasing amount of data available for training, the deep learning is rapidly developed. After the sequence-to-sequence model is proposed, the sequence-to-sequence model is applied to some fields of natural language processing, provides a good research idea for the task of text automatic summarization, and makes great progress. The sequence-to-sequence model encodes the source text into a fixed size context vector through an encoder, and then generates the next predicted word through a decoder based on the word generated at the previous time and the hidden layer state at that time. Later proposals have made the use of attention mechanisms for the encoder, improving the quality of the generated summary. Then, the network is replaced by the decoder by the cycle spirit, and good progress is achieved. On the basis of the model, reinforcement learning is introduced, so that the problem of error propagation is solved, the problem of repeated words and sentences is solved, and the readability of generating the abstract is improved. In addition, the generated abstract can be combined with the inherent characteristics of the source text to improve the effect of the model, and word vectors can be blended into TF-IDF, POS, NER and other statistical information, so that the generated abstract is closer to the abstract of manual summary.

With the development of deep learning and natural language processing, sequence-to-sequence based generative summarization methods are continuously improving and promoting. Most of the current improvements are based on the encoder and decoder level, and the fusion of multivariate semantics is very lacking.

Disclosure of Invention

In order to acquire more effective information from a source text during model training so as to further improve the quality of an abstract generated after the model training, the invention provides a method and a device for automatically generating a text abstract fusing multivariate semantics.

In order to solve the technical problems, the invention adopts the following technical scheme:

the invention provides a text abstract automatic generation method fusing multivariate semantics, which comprises the following steps:

step 1, based on a sequence-to-sequence model, combining the multivariate semantic characteristics of natural language processing, fusing the multivariate semantic characteristics before a source text is input into an encoder, so that the source text contains more semantic information;

step 2, inputting the source text fused with the multivariate semantic features into a bidirectional long-short term memory network in an encoder, and obtaining hidden layer states corresponding to word vectors in the text fused with the multivariate semantic features;

step 3, the decoder predicts a word vector generated at the next moment through a context vector and the hidden layer state of the decoder at the current moment by adopting a one-way long-short term memory network in combination with an improved attention mechanism;

and 4, training the model by using the loss function, and automatically generating the abstract of the text through the trained model.

Further, the fusion of the multivariate semantic features of step 1 includes two times of semantic information extraction and two times of vector splicing, and the specific process is as follows:

setting the number of convolution kernels of two convolution layers of a convolution neural network to be the same as the word vector size k, setting the size of each convolution kernel of a first convolution layer to be 3, and setting the size of each convolution kernel of a second convolution layer to be 5;

inputting a first convolution layer by a source text, outputting k semantic vectors by the first convolution layer, and carrying out first splicing on the k semantic vectors;

and inputting the spliced semantic vectors into a second convolution layer as a new characteristic matrix, outputting k semantic vectors again by the second convolution layer, performing second splicing on the new k semantic vectors, and finally inputting the spliced semantic vectors into the encoder.

Further, the hidden layer state in step 2 is represented as:

wherein h is_iBy forward hidden layer states

And backward hidden layer state

The components are spliced into a whole body,

and

the generation formula of (1) is:

wherein x is_iThe ith word vector representing the input, i ∈ [1, m]And m represents the number of input source text word vectors.

Further, the step 3 specifically includes the following steps:

step 3.1, calculating the hidden layer state s of the decoder at the time t through a one-way long-short term memory network_t；

Step 3.2, generating a context vector C for decoding at time t by means of an improved attention mechanism and hidden layer states of the encoder at time t_t；

Step 3.3, passing context vector C_tAnd the decoder hidden layer state s at time t_tThe vocabulary is predicted.

Further, the decoder hides the layer state s at time t in said step 3.1_tThe calculation formula of (2) is as follows:

S_t＝LSTM(S_t-1，y_t-1)

wherein s is_t-1For the previous moment to hide the layer state, when model training is performed, y_t-1Is a word vector of a reference abstract vocabulary in a training set, and y is the word vector of the reference abstract vocabulary when the training set is used for prediction_t-1Is a word vector predicted at the last moment; outputting the last encoding output result h of the encoder hidden layer_mInitializing hidden layer states s at the initial moment of the decoder₀Assigning the ending vector of the source text to the initial input sequence y of the decoder₀；t∈[1,n]And n is the set length for generating the summary.

Further, the context vector C at time t in step 3.2_tHiding layer states h with an encoder_iAnd hidden layer state s of the decoder at time t_tGenerating and calculating the formula as follows:

the method comprises the following steps of introducing an unsaturated activation function LeakyReLU into an attention mechanism to optimize a model, wherein the formula of the LeakyReLU is as follows:

LeakyReLU＝max(θx，x)

wherein, theta is a parameter of the function, and theta belongs to (- ∞, 1);

wherein, v, W_h，W_s，b_attnAre all learnable parameters, exp (-) represents an exponential function,

representing the hidden layer state st at the moment of the decoder t and the hidden layer state of the encoder

Degree of similarity of (a)^tRepresenting the probability distribution of the source vocabulary, i ∈ [1, m ∈ ]]，t∈[1，n]。

Further, the formula of the vocabulary prediction at the time t in the step 3.3 is as follows:

P_vocab＝softmax(V′(V[s_t；C_t]+b)+b′)

wherein V ', V, b' are learnable parameters, P_vocabIs the probability distribution of all words in the dictionary, softmax (·) represents the softmax function, and the final distribution of the final predicted word w is:

P(w)＝P_vocab(w)。

further, in the step 4, the target vocabulary is obtained at the time t

The loss function of (d) is:

and the loss of the entire sequence is:

and automatically generating the abstract by using the trained model.

The invention also provides a device for automatically generating the text abstract fused with the multivariate semantics, which comprises the following steps:

the multivariate semantic feature fusion module is used for fusing multivariate semantic features before a source text is input into the encoder based on a sequence-to-sequence model and combined with multivariate semantic characteristics of natural language processing, so that the source text contains more semantic information;

the encoder hidden layer state calculation module is used for inputting the source text fused with the multivariate semantic features into a bidirectional long-short term memory network in an encoder and obtaining hidden layer states corresponding to word vectors in the text fused with the multivariate semantic features;

the word vector prediction module is used for predicting a word vector generated at the next moment by the decoder through the context vector and the hidden layer state of the decoder at the current moment by adopting a one-way long-short term memory network and combining an improved attention mechanism;

and the model training module is used for training the model by using the loss function and automatically generating the abstract of the text through the trained model.

Compared with the prior art, the invention has the following advantages:

1. on the basis of combining a traditional sequence-to-sequence model with an attention mechanism, multi-element semantic features are fused before a source text is input into an encoder, so that the source text obtains more semantic information before entering the encoder, the model fully excavates important contents of the source text, the readability and the global relevance of generating an abstract are increased, and the problem of low global relevance of generating the abstract is solved.

2. After coding, a context vector used for predicting a word at the next moment is generated based on an attention mechanism, a saturated activation function is mostly used in the conventional attention mechanism, a non-saturated activation function LeakyReLU is used in the attention mechanism, and the attention mechanism has the functions of avoiding gradient disappearance in a model training process and accelerating the convergence speed of a model; the next word is predicted in conjunction with the context vector at that time and the decoder hidden layer state. By training the model, the quality of generating the abstract is integrally improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a network architecture diagram after a sequence-to-sequence model with attention mechanism incorporates fused multivariate semantic features in accordance with an embodiment of the present invention;

FIG. 2 is a flowchart of a text abstract automatic generation method fusing multi-element semantics according to an embodiment of the present invention;

FIG. 3 is a process diagram of fusing multivariate semantic features according to an embodiment of the invention;

FIG. 4 is a flow chart of predicting a word vector generated at a next time according to an embodiment of the present invention;

fig. 5 is a block diagram of an apparatus for automatically generating a text excerpt with a fused multivariate semantic meaning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention, and based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts belong to the scope of the present invention.

As shown in fig. 1 and fig. 2, the method for automatically generating a text abstract with a fused multivariate semantic meaning of this embodiment includes the following steps:

and step S11, based on the sequence-to-sequence model, combining the multi-element semantic characteristics of natural language processing, fusing the multi-element semantic characteristics before the source text is input into the encoder, so that the source text contains more semantic information, and the model can fully mine the important content of the source text.

As shown in fig. 3, the fusion of the multivariate semantic features includes two times of semantic information extraction and two times of vector splicing, and the specific process is as follows:

the invention provides a multivariate semantic extraction method suitable for text abstraction, which uses two convolution layers of a convolution neural network, sets the number of convolution kernels of a first convolution layer to be the same as the size k of a word vector, sets the size of each convolution kernel of the first convolution layer to be 3 by combining the daily reading range of people to be three to five words, inputs a source text into the first convolution layer to obtain semantic vectors with the same number as the size k of the word vector, and performs first splicing on the semantic vectors; and inputting the spliced semantic vectors into a second convolution layer as a new feature matrix, setting the number of convolution kernels of the second convolution layer to be the same as the word vector size k, setting the size of each convolution kernel of the second convolution layer to be 5, obtaining k semantic vectors again, performing second splicing on the new k semantic vectors to form the feature matrix, splicing the feature matrix into the feature matrix, wherein the feature matrix contains more semantic information than the initial source text vector matrix, and finally inputting the feature matrix into an encoder.

Convolutional neural networks were originally used in the field of natural language processing for text classification to obtain features in sentences, which we improved on the model of the present invention to extract local correlations in sentences, such as internal correlations of phrase structures, and to remove the pooling layer of convolutional neural networks (which would cause text to lose a lot of features), preventing information loss; and (3) filling all 0 in each characteristic matrix boundary to ensure that the size of the matrix fused with the multivariate semantic characteristics is unchanged, so that the deep characteristics in the text can be better mined after the multivariate semantic characteristics are fused, and the global relevance of the abstract is enhanced.

And step S12, inputting the source text fused with the multivariate semantic features into a bidirectional long-short term memory network in an encoder, and obtaining hidden layer states corresponding to word vectors in the text fused with the multivariate semantic features.

In this example, the hidden layer state is represented as:

wherein h is_iBy forward hidden layer states

And backward hidden layer state

The components are spliced into a whole body,

and

the generation formula of (1) is:

In step S13, the decoder predicts the word vector generated at the next time through the context vector and the hidden layer state of the decoder at the current time by using the one-way long-short term memory network in combination with the improved attention mechanism, which specifically includes steps S131 to S133, as shown in fig. 4:

step S131, calculating the hidden layer state S of the decoder at the time t through the one-way long-short term memory network_tThe calculation formula is as follows:

S_t＝LSTM(S_t-1，y_t-1)

Step S132, generating a context vector C for decoding at time t by the improved attention mechanism and the hidden layer state of the encoder at time t_t。

In particular, the method comprises the following steps of,context vector C at time t_tHiding layer states h with an encoder_iAnd hidden layer state s of the decoder at time t_tGenerating and calculating the formula as follows:

the invention improves the Attention mechanism, introduces the unsaturated activation function LeakyReLU into the Attention mechanism to optimize the model, and has the functions of avoiding gradient disappearance in the model training process and accelerating the convergence speed of the model, wherein the equation of the LeakyReLU is as follows:

LeakyReLU＝max(θx，x)

wherein, theta is a parameter of the function, and theta belongs to (- ∞, 1);

representing the hidden layer state s at the moment t of the decoder_tAnd hidden layer states of the encoder

Similarity of (2), attention distribution α^tCan be viewed as a probability distribution of the source vocabulary that tells the decoder where the next word needs to be focused, and then generates a weighted sum of the encoder hidden layer states with attention weights, called the context vector C_t，i∈[1，m]，t∈[1，n]。

The use of sequence-to-sequence models in natural language processing, often accompanied by the use of attention mechanism, can improve the global relevance and quality of generating the summary by determining the relevance of the word and the source text at the next moment before decoding by the decoder.

Step S133, passing the context vector C_tAnd the decoder hidden layer state s at time t_tThe vocabulary is predicted, and the calculation formula is as follows:

P_vocab＝softmax(V′(V[s_t；C_t]+b)+b′)

wherein V', V, b are learnable parameters, P_vocabIs the probability distribution of all words in the dictionary, softmax (·) represents the softmax function, and the final distribution of the final predicted word w is:

P(w)＝P_vocab(w)。

and step S14, training the model by using the loss function, and automatically generating the abstract of the text through the trained model.

Wherein for the target vocabulary at time t

The loss function of (d) is:

and the loss of the entire sequence is:

and automatically generating the abstract by using the trained model.

The process of generating the abstract is to repeat the processes from the step S131 to the step S133, and the abstract is formed by repeating the process of completing the generation of one word until all words are generated and finally fusing all the generated words.

Corresponding to the above method for automatically generating a text abstract fused with multivariate semantics, as shown in fig. 5, the present embodiment further provides an apparatus for automatically generating a text abstract fused with multivariate semantics, which includes a multivariate semantic feature fusion module 51, an encoder hidden layer state calculation module 52, a word vector prediction module 53, and a model training module 54.

The multivariate semantic feature fusion module 51 is configured to fuse multivariate semantic features before the source text is input to the encoder based on a sequence-to-sequence model in combination with multivariate semantic features of natural language processing, so that the source text contains more semantic information.

And the encoder hidden layer state calculating module 52 is configured to input the source text with the fused multivariate semantic features into a bidirectional long-term and short-term memory network in the encoder, and obtain hidden layer states corresponding to word vectors in the text with the fused multivariate semantic features.

And a word vector prediction module 53, configured to predict a word vector generated at a next time by using the context vector and the decoder hidden layer state at the current time by using a one-way long-short term memory network in combination with an improved attention mechanism.

And the model training module 54 is used for training the model by using the loss function, and automatically generating the abstract of the text through the trained model.

It is to be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it is to be noted that: the above description is only a preferred embodiment of the present invention, and is only used to illustrate the technical solutions of the present invention, and not to limit the protection scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A text abstract automatic generation method fusing multivariate semantics is characterized by comprising the following steps:

2. The method for automatically generating the text abstract with the fused multi-semantic meaning as claimed in claim 1, wherein the fused multi-semantic feature of the step 1 comprises two times of semantic information extraction and two times of vector splicing, and the specific process is as follows:

3. The method for automatically generating the text abstract fusing the multi-semantic meaning as claimed in claim 1, wherein the hidden layer state in the step 2 is represented as:

wherein h is_iBy forward hidden layer states

And backward hidden layer state

The components are spliced into a whole body,

and

the generation formula of (1) is:

4. The method for automatically generating a text abstract fusing multivariate semantics according to claim 3, wherein the step 3 specifically comprises the following steps:

5. The method for automatically generating text abstract fusing multivariate semantics as claimed in claim 4, wherein the decoder hidden layer state s at the time t in the step 3.1 is set as the hidden layer state s_tThe calculation formula of (2) is as follows:

s_t＝LSTM(s_t-1，y_t-1)

wherein s is_t-1For the previous moment to hide the layer state, when model training is performed, y_t-1Is a word vector of a reference abstract vocabulary in a training set, and y is the word vector of the reference abstract vocabulary when the training set is used for prediction_t-1Is a word vector predicted at the last moment; outputting the last encoding output result h of the encoder hidden layer_mInitializing hidden layer states s at the initial moment of the decoder₀Assigning the ending vector of the source text to the initial input sequence y of the decoder₀；t∈[1，n]And n is the set length for generating the summary.

6. The method for automatically generating a text abstract fusing multivariate semantics as claimed in claim 5, wherein the context vector C at the time t in the step 3.2_tHiding layer states h with an encoder_iAnd hidden layer state s of the decoder at time t_tGenerating and calculating the formula as follows:

LeakyReLU＝max(θx，x)

wherein, theta is a parameter of the function, and theta belongs to (- ∞, 1);

7. The method for automatically generating the text abstract fusing the multivariate semantics as claimed in claim 6, wherein the formula of the vocabulary prediction at the time t in the step 3.3 is as follows:

P_vocab＝softmax(y′(V[s_t；C_t]+b)+b′)

P(w)＝P_vocab(w)。

8. the method for automatically generating text abstract fused with multivariate semantics according to claim 7Method, characterized in that in step 4, the target vocabulary is aligned at time t

The loss function of (d) is:

and the loss of the entire sequence is:

and automatically generating the abstract by using the trained model.

9. An automatic text abstract generating device fused with multivariate semantics is characterized by comprising the following steps: