CN111858931A

CN111858931A - Text generation method based on deep learning

Info

Publication number: CN111858931A
Application number: CN202010652675.XA
Authority: CN
Inventors: 廖盛斌; 余亚斌
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2020-10-30
Anticipated expiration: 2040-07-08
Also published as: CN111858931B

Abstract

The invention discloses a text generation method based on deep learning. The method comprises training and testing, and is characterized in that the training comprises the following steps: constructing a training set, wherein the training set comprises a plurality of sample pairs consisting of the preprocessed topics and the corresponding texts; a pre-defined generator, wherein the generator is used for generating a text according to an input topic, pre-training the generator by using the training set, and adding an attention mechanism and a new historical memory information module in the encoding and decoding of the generator; predefining a classifier, and inputting the text output by the generator and the text in the training set into the classifier for countertraining; and performing reinforcement learning training on the generator according to the pre-trained generator and the classifier definition loss function. The invention has better text generation effect.

Description

Text generation method based on deep learning

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a text generation method based on deep learning.

Background

The advent of deep learning has enabled the development of artificial intelligence to move up a new step and quickly have profound effects in both academic and industrial areas. The method based on deep learning has become a mainstream method in the fields of computer vision, natural language processing and the like. The method based on deep learning in the natural processing field has also made great progress, for example, in the fields of machine translation, man-machine conversation, ancient poetry generation and the like, the method based on deep learning completely surpasses or even replaces the traditional machine learning method.

The automatic writing is an important artificial intelligence technology, writing or aided creation is performed by using artificial intelligence, a new creation method and a new approach are provided for human beings, the convenience and the speed of the automatic writing are greatly improved, and the daily writing mode of people is changed to a great extent. However, the former automatic writing is automatic writing based on a template, and although the automatic writing can be rapidly performed, the method has great defects in novelty and diversity and is difficult to meet the requirement of people on innovation.

The classic text generation method based on deep learning is an artificial neural network model based on a Recurrent Neural Network (RNN). Compressing the input information into a vector with a fixed length, and generating a text sentence by sentence through a neural network by using linear or nonlinear transformation. The method has a very obvious defect that the model compresses historical memory information into state vectors with the same length, and each word only considers the historical information transmitted by the previous word, so that the historical information is seriously lost, and the quality of the text generated later is increasingly poor.

Disclosure of Invention

Aiming at least one defect or improvement requirement of the prior art, the invention provides a text generation method based on deep learning, which has better text generation effect.

In order to achieve the above object, the present invention provides a text generation method based on deep learning, which comprises training and testing, wherein the training comprises the following steps:

constructing a training set, wherein the training set comprises a plurality of sample pairs consisting of the preprocessed topics and the corresponding texts;

the generator comprises an encoder and a decoder, wherein the encoder is used for encoding the input topics into word vectors, the decoder is a long-short term memory network using a recurrent neural network, the initial state vectors of the long-short term memory network use randomly initialized vectors, and the input of the long-short term memory network comprises the real output of the last time step, the topic vectors obtained by attention mechanism and the global history memory vectors;

predefining a classifier, and inputting the text output by the generator and the text in the training set into the classifier for countertraining;

and performing reinforcement learning training on the generator according to the pre-trained generator and the classifier definition loss function.

Preferably, the pre-treatment comprises: and performing keyword segmentation on the texts in the sample set, calculating tf-idf scores of all the keywords by using a tf-idf algorithm, and selecting a plurality of keywords with the highest scores as topics of each text.

Preferably, the global history memory vector is obtained according to a history memory matrix, the history memory matrix is composed of vectors with a length of L, the history memory matrix is initialized to 0 at first, the word vectors generated before are dynamically stored in the training process, and the history memory matrix is not updated in the training process of the generator.

Preferably, a gating network is used to obtain the currently required global history memory vector.

Preferably, the classifier comprises a convolutional layer, a pooling layer and a Highway network which are connected in sequence, and an objective function of the classifier uses a cross entropy loss function.

Preferably, the defining a loss function according to the pre-trained generator and the classifier is specifically: and using the expectation based on punishment as an objective function of reinforcement learning training, wherein the punishment function is obtained by jointly calculating according to the classifier and the generator.

Preferably, the hidden state vector s of the decoder_t＝LSTM(s_t-1，[e(y_t-1)；h_t-1；ct_]) In which is the hidden state vector of the decoder t-1 time steps, h_t-1Vector representing memory information of t-1 time steps, c_tIs a context vector of a topic, c_tThe method is obtained according to a multiplicative attention mechanism, and particularly according to the following formula:

g_tj＝v_a ^TC_t-1，jtanh(W_as_t-1+U_ae(τ_j))

α_tj＝soft max(g_tj)

in the above formula, g_tjRepresenting the t-th time step decoder to the j-th topic tau_jAttention weight of (a)_tjIs to g_tjNormalized attention weight, upsilon_a、W_aAnd U_aAll are trainable parameters and are initialized by using standard normal distribution, C is a topic coverage vector, and C is used for coverage vector of jth topic at t-1 time step_t-1，_jAnd (4) showing.

In general, compared with the prior art, the invention has the following beneficial effects:

(1) two attention mechanisms and a new history memory module are added to the coding and decoding structure. One attention mechanism is used for selectively focusing on a specific input feature vector, the new history memory module as an explicit information storage module can learn and characterize global history information, and the other attention mechanism is used for acquiring a global history information vector from the global history information storage module. The global history information is combined with local history information of a long-term memory network (LSTM) to further increase the long-term dependence capacity of the recurrent neural network on language.

(2) In order to further improve the effect, the ideas of reinforcement learning and neural network resistance are combined, and the topic relevance of the text is further increased.

(3) In addition, because topic-based text generation belongs to an open text generation task, the invention adopts a decoding mode based on Temperature sampling (Sample with Temperature) to increase the diversity of generated texts.

Drawings

FIG. 1 is a flowchart illustrating a text generation method based on deep learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a generator of an embodiment of the present invention;

FIG. 3 is a schematic diagram of an arbiter according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of reinforcement learning based on strategy gradients according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an anti-neural network training in accordance with an embodiment of the present invention;

fig. 6 is a schematic diagram of an experimental result of the text generation method based on deep learning according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the text generation method based on deep learning according to the embodiment of the present invention includes the following stages.

Stage 1: data preparation

In the data preparation stage, required text data is firstly acquired from a webpage crawler, and special symbols in the data are cleaned to acquire required training data.

Secondly, extracting keywords by using a tf-idf algorithm, which specifically comprises the following steps: performing word segmentation processing on all articles; removing stop words; according to tf-idf_ij＝tf_ij×idf_iThe tf-idf score is calculated. Wherein tf is_ijThe method is used for measuring the frequency of occurrence of the ith word in the jth text and is calculated according to the following formula:

wherein idf_iThe reverse text frequency of the ith word is specifically calculated as follows:

calculated tf-idf_ijRepresenting the tf-idf score of the ith word in the jth article. And (3) sequencing the tf-idf scores of all words of each article in a descending order, taking the first 5 words as article keywords, taking the extracted keywords as topics of the article, then counting the word frequencies of all the topics, and removing the topics with lower frequency (the frequency is less than 100) from the 5 topics in each article.

Randomly dividing the data of the articles and the topic data acquired above, wherein 85% of all the data are used as a training set, 5% are used as a verification set, and the rest 10% are used as a test set;

Selecting a proper dictionary size | V |, and using the segmented text to construct the mapping from the vocabulary index to the word, wherein the specific method is that 4 markers are sequentially added at the beginning of the vocabulary, < PAD > represents a filling marker, < UNK > represents a word not in the vocabulary, < GO > represents a starting marker of the text, and < END > represents an ending marker of the text. For the new vocabulary, the words in the vocabulary are numbered from 0, a mapping from number to word is established, and a mapping from word to number is established.

In one embodiment, the method further comprises the following steps: and selecting a proper text length L, and preprocessing all texts by using the word list constructed above. Adding the above < GO > tag at the beginning of each text, adding < END > tags at the END of each text, and filling with < PAD > until the total length reaches L, directly intercepting the text length to L for the text with the length larger than L, and adding < END > termination tags at the END of the text; and (4) serializing all texts by using the mapping dictionary from the words to the numbers, and converting all the texts with the divided words into a word list number sequence for training and testing.

And (2) stage: pre-training generator model

Constructing a pre-training model, wherein the pre-training model specifically comprises an encoder and a decoder; the method comprises the steps that a coder codes topics into word vectors with proper dimensionality, a decoder uses a long-time and short-time memory network of a recurrent neural network, an initial state vector of the long-time and short-time memory network uses a vector initialized at random, a topic vector of each time step is obtained by using an attention mechanism, and the current topic vector represents topic semantic information included in a current word. The current input comprises a word vector mapped (corresponding) by a real output (group Truth) word of the previous time step, a topic vector obtained by an attention mechanism and a new historical memory vector, and the three vectors are spliced to form the current input. The method comprises the steps that a current input vector and the state of a previous time step are subjected to nonlinear transformation of a long-time memory network (LSTM) to obtain an output vector, the output vector is subjected to one-layer linear transformation to convert the last dimension into the size of a word list, and a softmax function is used for normalization to obtain the occurrence probability of each word in the word list of the current time step or the probability distribution of the current word list. We calculate the cross entropy loss function as the final objective function based on this probability distribution and the 0, 1 distribution of the training set labels, and then continuously adjust the parameters of the model according to the batch stochastic gradient descent algorithm and the appropriate learning rate until the model converges. The final probability component is used to decode to the final predicted word.

The pre-training generator model architecture is shown in FIG. 2, and specific inputs are set as n topic sets { τ [ ]₁，τ₂，...，τ_nN is the one with the preset maximum input topicAnd (4) counting. Mapping the input topic by a dictionary to obtain the unique id of each word, and then retrieving the e (tau) from the word vector matrix_j) As a topic τ_jWord vector of (2), hidden state vector s of decoder_t＝LSTM(s_t-1，[e(y_t-1)；h_t-1；c_t]) Wherein s is_t-1Is the hidden state vector for the decoder t-1 time steps. h is_t-1Vector representing memory information of t-1 time steps, c_tIs a context vector of a topic, c_tThe method is obtained according to a multiplicative attention mechanism, and particularly according to the following formula:

g_tj＝υ_a ^TC_t-1，jtanh(W_as_t-1+U_ae(τ_j))

α_tj＝soft max(g_tj)

in the above formula, g_tjRepresenting the t-th time step decoder to the j-th topic tau_jAttention weight of (a)_ijIs to g_tjNormalized attention weight, upsilon_a、W_aAnd U_aAre trainable parameters and are initialized using a standard normal distribution.

Meanwhile, in order to avoid over-expressing some topics and neglecting other topics during generation, a topic coverage vector C is also used. C is [0, 0, 0 ]]Initialization is performed to indicate that no topics are expressed at the beginning. C is also dynamically updated, and for topics which are expressed before, the attention weight of the topics is reduced, so that the chances of the topics being expressed next are reduced; and for topics which are not expressed, the opportunity that the topics are expressed next is increased by increasing the attention weight of the topics. C is used for coverage vector of jth topic at tth time step _t，jIndicating that the update is performed according to the following equation:

wherein phi_jObtained according to the following formula:

φ_j＝N·σ(U_f[e(τ₁)，e(τ₂)，...，e(τ_k)])

in the above formula, N is the number of input topics, U_fσ is a sigmoid activation function for trainable parameters.

Wherein h is_tObtained according to a history memory module, which mainly comprises a history memory matrix HM^T ^×ET represents the maximum length of the article, E represents the dimension of a word vector, the history memory matrix is initialized by using an all-0 matrix, the history memory matrix represents that no memory information is stored at the beginning, the word vectors generated at each time step before the decoder are dynamically stored in the training process, the word vectors are filled into the history memory matrix, and the history memory matrix is not subjected to parameter updating in the training process and is only used as a container for storing the word vectors.

Since the long-time memory network encodes the historical information into 2 vectors, a certain loss is caused, and the historical memory matrix is equivalent to a historical information enhancement module and is used for making up the loss of the historical information.

With the generation of new words, the history memory matrix is calculated as follows:

HM^T×E(t)＝e(y_t-1)

to select the history, we use a gating network that uses the decoder's hidden state vector s _tAnd two trainable parameters W_hAnd b_hAs input, processing is performed using the tanh activation function, calculated as follows:

υ_t＝tanh(W_hs_t+b_h)HM[t，；]

vector upsilon obtained by upper gating network_tPerforming soft max normalization as the weight of each word in the history memory module, and performing soft max normalization according to the weightThe weight and history memory module selects the needed history information vector h_tThe specific calculation is obtained according to the following formula:

h_t＝soft max(υ_t)HM^T×E

the distribution of the model at the t time step is finally obtained according to the following formula:

p(y_t|y_1：t-1，τ_1：k)＝soft max(W_os_t)

wherein W_oAre learnable parameters and are initialized using a standard normal distribution.

In the training phase, the model is trained by using a cross entropy loss function as an objective function, and the formula is as follows:

where q (t) is the distribution of the real output, using one hot (one hot) encoding, and p (t) is the distribution predicted by the model.

In the prediction phase, the model predicts the word y at time t_tSampling was performed based on the following distribution:

y_t～p(y_t|y_1：t-1，τ_1：k)

and (3) stage: training multiple-question classifier

And constructing a multi-classification discriminator. As shown in fig. 2, the multi-classifier specifically includes a convolutional layer followed by a max-pooling layer, and a high speed Network (high-way Network), and the objective function uses a cross-entropy loss function. The data of the multi-classifier is from the training set and the data generated by the previous pre-training generator, the label dimension is T +1, T represents the number of labels contained in the data set, and the other label represents whether the current text is the data of the training set or the generated data of the pre-training generator.

The structure of the discriminator is shown in fig. 3, the discriminator is a text multi-classifier with n +1 targets, wherein n targets are n topics to which the text belongs, and the other is used for judging whether the text is a moduleThe pattern is generated or an actual training sample. The input of the classifier is composed of real training data and data generated by a pre-training generator, and the input text sequence y₁，y₂，...，y_TFeature vector obtained by a two-dimensional convolution

Representing concatenation of vectors, text feature sequence vectors pi_1：T∈R^T×E，ω∈R^l×EThe length of the convolution kernel is E and the dimension of the word vector is consistent, the width of the convolution kernel is l, the word convolution and the activation function are mapped to obtain the feature vector

And obtaining the final output class distribution D by using the maximum pooling and a Highway network_φ(x_j|y_1：T). The goal is a cross entropy loss function, which is formulated as follows:

wherein x is_jIs to input a text sequence y_1：TThe topic label is coded by using one hot (one hot), the total number of the topic labels is n +1, the last label represents whether the text is real data, and the model updates parameters by using an Adam gradient descent algorithm.

The generator and the multi-classifier are initialized, and the word vector is randomly initialized by using a standard normal distribution. Other weights are initialized with a normal with a mean of 0 and a variance of 0.01. Initializing a proper batch size, namely the number of data pieces of the disposable feeding model, and initializing the learning rate to be 0.01;

Pre-training a generator and a multi-classifier, randomly disordering training data in each round, pre-training the generator, generating data with the same quantity as a training set by using the pre-training generator, and disordering the multi-classified data set in each round; secondly, pre-training the multiple classifiers, and updating the weight of each layer of network by using a random gradient descent algorithm until the network converges.

And (4) stage: construction reinforcement learning generator

And constructing a reinforcement learning module, wherein the reinforcement learning module is composed of a new generator and the above multi-classifiers. And modifying a loss function by the new generator on the basis of the pre-training generator, using an expectation based on punishment as an objective function of the reinforcement learning model, and calculating the punishment function according to the multi-topic classifier and the new generator.

Since the goal of training with Maximum Likelihood Estimation (MLE) is to find the solution with the maximum probability at each step, but in reality, words with low probability appear in the text, which results in the situation that the model is inconsistent with the actual intention of generation, while reinforcement learning does not require that each step is the optimal solution but the solution with the maximum accumulated reports, low-probability words are allowed to appear in the middle, local optimal solution is sought by maximum likelihood, global optimal solution is sought by reinforcement learning, and thus reinforcement learning is more likely to find the grammar generation rule which accords with human language cognition. As shown in fig. 4, in the text generation, the agent of reinforcement learning can be regarded as a generator, the graph is represented by G, the environment of reinforcement learning is represented by a multi-classifier D, the state of the agent is the set of the words (token) generated before, and the graph is represented by solid black dots. The action of reinforcement learning is the word (token) selectable by the next agent, and a method of strategy gradient is introduced, a _tRepresenting the tokens predicted by the t-th time step model, and the strategy pi represents all tokens with which the text is generated by us. Then reinforcement learning is to determine this strategy. Then we parameterize the strategy pi, P_θ(as) where s ═ a ═ s₁，a₂，...，a_nUnder the condition, the probability that the next token selected by the model agent is a. Here our strategy is G_θ(y_t+1|y_1：t) The loss function for deep reinforcement learning is as follows:

the Penalty factor Penalty is calculated according to whether the current sequence generates the last word or not according to the following two conditions:

as shown in fig. 4, the reward (reward) of the arbiter is not used directly as feedback for reinforcement learning, but the penalty is used as feedback to the agent from the environment, so our strategy should minimize the accumulated penalty. Since we need to obtain the current accumulated expected Penalty every action, we need to use Monte Carlo (MenterCarlo) to sample the remaining T-T tokens and calculate the current instantaneous accumulated Penalty (Penalty) with the discriminator when generating the T word. The addition of reinforcement learning can enable the generated text of us to have more topic relevance.

And (5) stage: using antagonistic neural network training

As shown in fig. 5, the antagonistic neural network is composed of a generator and a discriminator. The reinforcement learning generator described in the generator experiment stage 5, during the counterstudy, on one hand, the generator generates more training data and texts with topic relevance through learning evolution, and on the other hand, the discriminator identifies the texts generated by the generator through learning evolution and guides the evolution of the generator.

The counterlearning is added because when the generator is stronger by the prior reinforcement learning, the capability of the discriminator is weakened at this time, once the discrimination capability of the discriminator is weakened, the reward calculation of the reinforcement learning is biased, and because the existence of the sampling variance can make the reinforcement learning training unstable, after the parameters of the reinforcement learning are updated, a log-likelihood (MLE) objective function is added to correct the reinforcement learning generator, so as to slow down the fluctuation in the training process.

The training speed of the anti-neural network is slow and convergence is difficult because the anti-neural network needs to train the arbiter and the generator simultaneously in the training process. To solve these problems, on the one hand, the multi-topic classifier and the pre-training generator need to be pre-trained sufficiently before training against the neural network, which also helps the convergence of the model. On the other hand, a smaller training period (1-3 training periods) is selected during the counterlearning training, and an excessively large training period wastes computing resources and may cause overfitting.

And 6: model selection and testing

Since the task itself belongs to the open text generation, the use of greedy decoding or beam search (beamsearch) decoding makes the generated text too single and the duplication phenomenon is severe. Therefore, the embodiment of the invention uses a sampling decoding mode to increase the diversity of the generated text. Through sampling-based decoding, the embodiment of the invention can generate more diversified texts. And the sampling method can effectively avoid the repeated phenomenon of words. In addition, the embodiment of the invention uses the decoding based on sampling in the training and testing, and the problem of exposure bias (exposing bias) caused by the inconsistency of the decoding methods of the training and testing is relieved to a certain extent.

Compared with a plurality of models on the disclosed data set, experiments prove that the text generation method provided by the embodiment of the invention can generate texts with more topic relevance and is smoother and smoother. In the decoding process, the diversity of the generated text is increased and the occurrence of the repeated phenomenon of the text generation is reduced by using a decoding mode based on distributed sampling.

To verify the effectiveness of the embodiments of the present invention, experimental verification was performed on the disclosed data set:

The experiment adopts the zhihu data set disclosed by Haugha in 2018, and on the aspect of the BLEU score of the text automatic evaluation index, the model of the embodiment of the invention is improved by 37% compared with a baseline model and is improved by 6% compared with the current best model. In terms of manual evaluation, the model-generated short text score of the embodiment of the invention also achieves the best result. The text generated by the method of the embodiment of the invention is shown in fig. 6.

It must be noted that in any of the above embodiments, the methods are not necessarily executed in order of sequence number, and as long as it cannot be assumed from the execution logic that they are necessarily executed in a certain order, it means that they can be executed in any other possible order.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A text generation method based on deep learning comprises training and testing, and is characterized in that the training comprises the following steps:

2. The method of claim 1, wherein the preprocessing comprises: and performing keyword segmentation on the texts in the sample set, calculating tf-idf scores of all the keywords by using a tf-idf algorithm, and selecting a plurality of keywords with the highest scores as topics of each text.

3. The method as claimed in claim 1, wherein the global history memory vector is obtained from a history memory matrix, the history memory matrix is composed of vectors with length L, the history memory matrix is initially initialized to 0, the word vectors generated before are dynamically stored in the training process, and the history memory matrix is not updated during the training process of the generator.

4. The method of claim 3, wherein a gating network is used to obtain the global history memory vector needed currently.

5. The deep learning-based text generation method of claim 1, wherein the classifier comprises a convolutional layer, a pooling layer and a Highway network which are connected in sequence, and an objective function of the classifier uses a cross entropy loss function.

6. The method for generating text based on deep learning of claim 1, wherein the defining a loss function according to the pre-trained generator and the classifier is specifically: and using the expectation based on punishment as an objective function of reinforcement learning training, wherein the punishment function is obtained by jointly calculating according to the classifier and the generator.

7. The text generation method based on deep learning of claim 1,

hidden state vector s of the decoder_t＝LSTM(s_t-1，[e(y_t-1)；h_t-1；c_t]) In which is t-1 times of the decoderHidden state vector of step, h_t-1Vector representing memory information of t-1 time steps, c_tIs a context vector of a topic, c_tThe method is obtained according to a multiplicative attention mechanism, and particularly according to the following formula:

g_tj＝v_a ^TC_t-1，jtanh(W_as_t-1+U_ae(τ_j))

α_tj＝softmax(g_tj)

in the above formula, g_tjRepresenting the t-th time step decoder to the j-th topic tau_jAttention weight of (a)_tjIs to g_tjNormalized attention weight, v_a、W_aAnd U_aAll are trainable parameters and are initialized by using standard normal distribution, C is a topic coverage vector, and C is used for coverage vector of jth topic at t-1 time step_t-1，jAnd (4) showing.