CN111858931B

CN111858931B - A text generation method based on deep learning

Info

Publication number: CN111858931B
Application number: CN202010652675.XA
Authority: CN
Inventors: 廖盛斌; 余亚斌
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2022-05-13
Anticipated expiration: 2040-07-08
Also published as: CN111858931A

Abstract

The invention discloses a text generation method based on deep learning. The method includes training and testing, wherein the training includes the steps of: constructing a training set, the training set including a plurality of sample pairs composed of preprocessed topics and corresponding texts; pre-defining a generator, the generating The generator is used to generate text according to the input topic, use the training set to pre-train the generator, and add an attention mechanism and a new historical memory information module to the encoding and decoding of the generator; The text output by the generator and the text in the training set are input to the classifier for adversarial training; and the generator is subjected to reinforcement learning training according to the pre-trained generator and the loss function defined by the classifier. The present invention has better text generation effect.

Description

A text generation method based on deep learning

技术领域technical field

本发明属于自然语言处理技术领域，更具体地，涉及一种基于深度学习的文本生成方法。The invention belongs to the technical field of natural language processing, and more particularly, relates to a text generation method based on deep learning.

背景技术Background technique

深度学习的出现使得人工智能的发展走上一个新的台阶，并且迅速在学术界和工业界产生深远的影响。基于深度学习的方法在计算机视觉、自然语言处理等领域已经成为一种主流的方法。在自然处理领域基于深度学习的方法也已经取得了很大的进步，比如在机器翻译，人机对话，古诗生成等领域，基于深度学习的方法已经完全超越甚至取代了传统的机器学习方法。The emergence of deep learning has brought the development of artificial intelligence to a new level, and has quickly had a profound impact in academia and industry. The method based on deep learning has become a mainstream method in the fields of computer vision and natural language processing. In the field of natural processing, deep learning-based methods have also made great progress. For example, in the fields of machine translation, human-machine dialogue, and ancient poetry generation, deep learning-based methods have completely surpassed or even replaced traditional machine learning methods.

自动写作是一项重要的人工智能技术，利用人工智能进行写作或者辅助创作，为人类提供了新的创作方法与途径，自动写作对于写作的便捷与速度有了很大改善，很大程度上改变了人们日常写作方式。然而以前的自动写作均为基于模板的自动写作，虽然能够快速进行自动写作，但在新颖性和多样上有很大缺陷，难以满足人们对创新性的要求。Automatic writing is an important artificial intelligence technology. The use of artificial intelligence to write or assist in creation provides new creative methods and approaches for human beings. Automatic writing has greatly improved the convenience and speed of writing, and has greatly changed the the way people write on a daily basis. However, the previous automatic writing is all template-based automatic writing. Although it can quickly perform automatic writing, it has great defects in novelty and variety, and it is difficult to meet people's requirements for innovation.

经典的基于深度学习的文本生成方法都是基于循环神经网络RNN的人工神经网络模型。将输入信息压缩为固定长度的向量，再使用线性或者非线性的变换，通过神经网络逐句生成文本。该类方法存在一个很明显的缺点，模型把历史记忆信息压缩为相同长度的状态向量，并且每个词只考虑到上一个词传递过来的历史信息，这样导致了历史信息存在丢失严重问题，所以后面生成的文本质量会越来越差。The classic deep learning-based text generation methods are all artificial neural network models based on recurrent neural networks (RNNs). The input information is compressed into a fixed-length vector, and then a linear or nonlinear transformation is used to generate text sentence by sentence through a neural network. This type of method has an obvious disadvantage. The model compresses the historical memory information into a state vector of the same length, and each word only considers the historical information transmitted by the previous word, which leads to a serious problem of loss of historical information, so The quality of the text generated later will get worse and worse.

发明内容SUMMARY OF THE INVENTION

针对现有技术的至少一个缺陷或改进需求，本发明提供了一种基于深度学习的文本生成方法,具有更好的文本生成效果。In view of at least one defect or improvement requirement of the prior art, the present invention provides a text generation method based on deep learning, which has better text generation effect.

为实现上述目的，本发明提供了一种基于深度学习的文本生成方法，包括训练和测试，所述训练包括步骤：In order to achieve the above object, the present invention provides a text generation method based on deep learning, including training and testing, and the training includes steps:

构建训练集，所述训练集中包括经过预处理的话题和对应文本组成的多个样本对；constructing a training set, which includes a plurality of sample pairs composed of preprocessed topics and corresponding texts;

预先定义生成器，所述生成器用于根据输入的话题生成文本，利用所述训练集对所述生成器进行预训练，所述生成器包括编码器和解码器，所述编码器用于将输入的话题编码为词向量，所述解码器为使用循环神经网络的长短时记忆网络，所述长短期记忆网络的初始状态向量使用随机初始化的向量，所述长短期记忆网络的输入包括上一个时间步的真实输出、注意力机制得到的话题向量和全局历史记忆向量；A generator is pre-defined, the generator is used to generate text according to the input topic, the generator is pre-trained by using the training set, the generator includes an encoder and a decoder, and the encoder is used to convert the input The topic is encoded as a word vector, the decoder is a long-short-term memory network using a recurrent neural network, the initial state vector of the long-short-term memory network uses a randomly initialized vector, and the input of the long-short-term memory network includes the previous time step The real output of , the topic vector and the global historical memory vector obtained by the attention mechanism;

预先定义分类器，将所述生成器输出的文本和所述训练集中的文本输入到所述分类器进行对抗训练；Define a classifier in advance, and input the text output by the generator and the text in the training set into the classifier for adversarial training;

根据预训练的所述生成器和所述分类器定义损失函数对所述生成器进行强化学习训练。Reinforcement learning training is performed on the generator according to the pre-trained generator and the classifier definition loss function.

优选地，所述预处理包括：对样本集中的文本进行关键词分词，使用tf-idf算法计算所有关键词的tf-idf得分，选取得分最高的多个关键词作为每个文本的话题。Preferably, the preprocessing includes: performing keyword segmentation on the text in the sample set, calculating the tf-idf scores of all keywords by using the tf-idf algorithm, and selecting multiple keywords with the highest scores as the topic of each text.

优选地，所述全局历史记忆向量根据历史记忆矩阵得到，所述历史记忆矩阵由长度为L的向量组成，所述历史记忆矩阵最开始全部初始化为0，在训练过程中动态地存储之前所生成的词向量，在对所述生成器进行训练的过程中，所述历史记忆矩阵不进行参数更新。Preferably, the global history memory vector is obtained according to a history memory matrix, the history memory matrix is composed of a vector with a length of L, the history memory matrix is all initialized to 0 at first, and is dynamically stored in the training process before it is generated. In the process of training the generator, the historical memory matrix is not updated with parameters.

优选地，使用门控网络来获取当前所需的所述全局历史记忆向量。Preferably, a gating network is used to obtain the currently required global historical memory vector.

优选地，所述分类器包括依次连接的卷积层、池化层和Highway网络，所述分类器的目标函数使用交叉熵损失函数。Preferably, the classifier includes a convolutional layer, a pooling layer and a Highway network connected in sequence, and the objective function of the classifier uses a cross-entropy loss function.

优选地，所述根据预训练的所述生成器和所述分类器定义损失函数具体是：使用基于惩罚的期望作为强化学习训练的目标函数，惩罚函数根据所述分类器和所述生成器共同计算得到。Preferably, defining the loss function according to the pre-trained generator and the classifier is specifically: using a penalty-based expectation as an objective function of reinforcement learning training, and the penalty function is based on the classifier and the generator jointly Calculated.

优选地，所述解码器的隐藏状态向量s_t＝LSTM(s_t-1，[e(y_t-1)；h_t-1；c_t])，其中是解码器t-1个时间步的隐藏状态向量，h_t-1代表示t-1个时间步的记忆信息的向量，c_t是话题的上下文向量，c_t根据乘法注意力机制得到,具体根据以下公式得到：Preferably, the decoder's hidden state vector s _t = LSTM(s _t-1 , [e(y _t-1 ); h _t-1 ; c _t ]), where is the decoder t-1 time steps The hidden state vector of , h _t-1 represents the vector of memory information of t-1 time steps, c _t is the context vector of the topic, c _t is obtained according to the multiplication attention mechanism, and is obtained according to the following formula:

g_tj＝v_a ^TC_t-1，jtanh(W_as_t-1+U_ae(τ_j))g _tj = v _a ^T C _{t-1, j} tanh(W _a s _t-1 +U _a e(τ _j ))

α_tj＝softmax(g_tj)α _tj =softmax(g _tj )

上式中，g_tj表示第t个时间步解码器对第j个话题τ_j的注意力权重，α_tj为对g_tj归一化的注意力权重，v_a、W_a和U_a都是可训练的参数，使用标准的正态分布进行初始化，C为话题覆盖向量，第t-1个时间步第j个话题的覆盖向量用C_t-1，_j表示。In the above formula, g _tj represents the attention weight of the t-th time step decoder to the j-th topic τ _j , α _tj is the normalized attention weight of g _tj , v _a , W _a and U _a are all The trainable parameters are initialized using a standard normal distribution, C is the topic coverage vector, and the coverage vector of the jth topic at the t-1th time step is denoted by C _t-1 , _j .

总体而言，本发明与现有技术相比，具有有益效果：In general, compared with the prior art, the present invention has beneficial effects:

(1)本文在编码解码架构中，加入两个注意力机制和一个新的历史记忆模块。其中一个注意力机制用以选择性地关注特定的输入的特征向量，新的历史记忆模块作为显式的信息存储模块可以对全局的历史信息进行学习和表征，另一个注意力机制则用以从这个全局历史信息存储模块中获取一个全局历史信息向量。这种全局历史信息结合长短时记忆网络(LSTM)的局部历史信息进一步增加了循环神经网络对语言的长期依赖能力。(1) In this paper, two attention mechanisms and a new historical memory module are added to the encoder-decoder architecture. One of the attention mechanisms is used to selectively focus on specific input feature vectors, the new historical memory module as an explicit information storage module can learn and represent the global historical information, and the other attention mechanism is used to learn from A global historical information vector is obtained from the global historical information storage module. This global history information combined with the local history information of the long short-term memory network (LSTM) further increases the long-term dependence of the recurrent neural network on language.

(2)为了进一步提升效果，结合了强化学习和对抗神经网络的思想，进一步增加文本的话题相关性。(2) In order to further improve the effect, the ideas of reinforcement learning and adversarial neural network are combined to further increase the topic relevance of the text.

(3)另外由于基于话题的文本生成属于开放式文本生成任务，本发明采用了基于温度采样(Sample with Temperature)的解码方式来增加生成文本的多样性。(3) In addition, since topic-based text generation belongs to an open text generation task, the present invention adopts a decoding method based on temperature sampling (Sample with Temperature) to increase the diversity of generated texts.

附图说明Description of drawings

图1是本发明实施例的基于深度学习的文本生成方法的流程示意图；1 is a schematic flowchart of a deep learning-based text generation method according to an embodiment of the present invention;

图2是本发明实施例的生成器示意图；2 is a schematic diagram of a generator according to an embodiment of the present invention;

图3是本发明实施例的判别器示意图；3 is a schematic diagram of a discriminator according to an embodiment of the present invention;

图4是本发明实施例的基于策略梯度的强化学习示意图；4 is a schematic diagram of reinforcement learning based on policy gradients according to an embodiment of the present invention;

图5是本发明实施例的对抗神经网络训练示意图；5 is a schematic diagram of an adversarial neural network training according to an embodiment of the present invention;

图6是本发明实施例的基于深度学习的文本生成方法的实验结果示意图。FIG. 6 is a schematic diagram of an experimental result of a text generation method based on deep learning according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

如图1所示，本发明实施例的基于深度学习的文本生成方法包括以下几个阶段。As shown in FIG. 1 , the deep learning-based text generation method according to the embodiment of the present invention includes the following stages.

阶段1：数据准备Phase 1: Data Preparation

数据准备阶段首先从网页爬虫获取所需的文本数据，并对数据中的特殊符号进行清洗，获取所需的训练数据。In the data preparation stage, the required text data is first obtained from the web crawler, and the special symbols in the data are cleaned to obtain the required training data.

其次使用tf-idf算法进行关键词提取，具体包括：对所有的文章进行分词处理；去除停用词；根据tf-idf_ij＝tf_ij×idf_i计算tf-idf得分。其中tf_ij用来衡量第i个词在第j个文本中出现的频次，按照如下公式进行计算：Secondly, the tf-idf algorithm is used to extract keywords, which specifically includes: performing word segmentation on all articles; removing stop words; calculating tf-idf scores according to tf-idf _ij =tf _ij ×idf _i . Among them, tf _ij is used to measure the frequency of the i-th word in the j-th text, and it is calculated according to the following formula:

其中idf_i为第i个词的逆向文本频率，具体计算如下：where idf _i is the reverse text frequency of the ith word, which is calculated as follows:

计算出来的tf-idf_ij表示在第j篇文章中第i个词的tf-idf得分。对于每一篇文章所有词的tf-idf得分进行降序排序，取前5个词作为文章关键词，把提取的关键词作为文章的话题，然后统计所有话题的词频，从每篇文章中的5个话题中剔除频率较低的话题(频率小于100)。The calculated tf-idf _ij represents the tf-idf score of the ith word in the jth article. Sort the tf-idf scores of all words in each article in descending order, take the first 5 words as article keywords, take the extracted keywords as the topic of the article, and then count the word frequencies of all topics, from the 5 words in each article Topics with low frequency (frequency less than 100) are removed from the topics.

对上面获取的文章和话题数据对数据进行随机划分，所有数据的85％作为训练集，5％作为验证集，剩下的10％作为测试集；Randomly divide the data for the article and topic data obtained above, 85% of all data is used as the training set, 5% is used as the validation set, and the remaining 10% is used as the test set;

选取一个合适的词典大小|V|，用分词后的文本构建词表索引到词的映射，具体做法是，在词表开头依次加入4个标记符，<PAD>代表填充标记，<UNK>代表不在词表中的词，<GO>代表文本的开始标记，<END>代表文本的结束标记。对于新的词表从0开始对词表中的词进行编号，建立一个从编号到词的映射，另外建立一个从词到编号的映射。Select a suitable dictionary size | V |, use the text to build a phrase to the mapping of the word gauge. The specific method is to add 4 markings in turn at the beginning of the word table. For words not in the vocabulary, <GO> represents the start tag of the text, and <END> represents the end tag of the text. For the new vocabulary, the words in the vocabulary are numbered from 0, a mapping from the number to the word is established, and a mapping from the word to the number is established.

在其中一个实施例中，还包括：选取一个合适文本长度L，对所有文本使用上面构造的词表进行预处理。在每一个文本开头加上上述的<GO>标记，长度小于L的文本，在结尾加上<END>标记，并且使用<PAD>进行填充直到总长度达到L,对于长度大于L的文本直接截取文本长度到L，并且在文本的末尾加上<END>终止标记符；对于使用上面词到编号的映射词典对所有文本进行序列化处理，将所有分好词的文本转化为词表编号序列，用于训练和测试。In one of the embodiments, the method further includes: selecting an appropriate text length L, and preprocessing all texts using the vocabulary list constructed above. Add the above <GO> tag at the beginning of each text, add the <END> tag at the end of the text whose length is less than L, and use <PAD> for padding until the total length reaches L, and directly intercept the text whose length is greater than L The length of the text is L, and the <END> terminator is added at the end of the text; for serializing all texts using the above word-to-number mapping dictionary, convert all word-divided texts into word list numbering sequences, for training and testing.

阶段2：预训练生成器模型Stage 2: Pretraining the generator model

预训练模型的构建，预训练模型具体包括编码器和解码器；编码器将话题编码为合适维度的词向量，解码器使用循环神经网络的长短时记忆网络，长短期记忆网络的初始状态向量使用随机初始化的向量，每个时间步的话题向量使用注意力机制得到，当前的话题向量代表了当前词包涵的话题语义信息。当前输入包括上一个时间步的真实输出(Ground Truth)的词所映射(对应)的词向量、注意力机制得到的话题向量和一个新的历史记忆向量，由这三个向量拼接而成。当前输入的向量和上一个时间步的状态经过长短时记忆网络(LSTM)的非线性变换得到一个输出向量，这个输出向量经过一层线性变换将最后一个维度转化为词表大小，并且使用softmax函数进行归一化，得到当前时间步词表中每个词出现概率大小，或者说当前词表的概率分布。我们根据这个概率分布和训练集标签的0，1分布计算交叉熵损失函数作为最终的目标函数，然后根据批量随机梯度下降算法和合适地学习率不断地调整模型的参数，直到模型收敛。最后的概率部分用来进行解码得到最后预测的词。Construction of the pre-training model. The pre-training model specifically includes an encoder and a decoder; the encoder encodes the topic into a word vector of appropriate dimensions, the decoder uses the long and short-term memory network of the recurrent neural network, and the initial state vector of the long and short-term memory network uses Randomly initialized vector, the topic vector of each time step is obtained using the attention mechanism, and the current topic vector represents the topic semantic information contained in the current word. The current input includes the word vector mapped (corresponding to) the word of the ground truth of the previous time step, the topic vector obtained by the attention mechanism, and a new historical memory vector, which is composed of these three vectors. The current input vector and the state of the previous time step are subjected to the nonlinear transformation of the long short-term memory network (LSTM) to obtain an output vector. This output vector undergoes a layer of linear transformation to convert the last dimension into the vocabulary size, and uses the softmax function. Perform normalization to obtain the probability of occurrence of each word in the vocabulary of the current time step, or the probability distribution of the current vocabulary. We calculate the cross-entropy loss function based on this probability distribution and the 0,1 distribution of the training set labels as the final objective function, and then continuously adjust the parameters of the model according to the batch stochastic gradient descent algorithm and an appropriate learning rate until the model converges. The final probability part is used for decoding to get the final predicted word.

预训练生成器模型架构如图2所示，设定具体的输入为n个话题集合{τ₁，τ₂，...，τ_n},n为预先设定最大输入话题的个数。输入话题通过词典映射得到每个词独有的id，然后从词向量矩阵中检索得到e(τ_j)为话题τ_j的词向量，解码器的隐藏状态向量s_t＝LSTM(s_t-1，[e(y_t-1)；h_t-1；c_t])，其中s_t-1是解码器t-1个时间步的隐藏状态向量。h_t-1代表示t-1个时间步的记忆信息的向量，c_t是话题的上下文向量，c_t根据乘法注意力机制得到,具体根据以下公式得到：The architecture of the pre-training generator model is shown in Figure 2, and the specific input is set as a set of n topics {τ ₁ , τ ₂ , ..., τ _n }, where n is the preset maximum number of input topics. The input topic gets the unique id of each word through dictionary mapping, and then retrieves e(τ _j ) from the word vector matrix to get the word vector of topic τ _j , the hidden state vector of the decoder s _t =LSTM(s _t-1 , [e(y _t-1 ); h _t-1 ; c _t ]), where s _t-1 is the hidden state vector of the decoder for t-1 time steps. h _t-1 represents the vector of memory information for t-1 time steps, c _t is the context vector of the topic, c _t is obtained according to the multiplicative attention mechanism, and is obtained according to the following formula:

α_tj＝softmax(g_tj)α _tj =softmax(g _tj )

上式中，g_tj表示第t个时间步解码器对第j个话题τ_j的注意力权重，α_tj为对g_tj归一化的注意力权重，v_a、W_a和U_a都是可训练的参数，使用标准的正态分布进行初始化。In the above formula, g _tj represents the attention weight of the t-th time step decoder to the j-th topic τ _j , α _tj is the normalized attention weight of g _tj , v _a , W _a and U _a are all Trainable parameters, initialized using the standard normal distribution.

同时为了避免生成时过度重复表达某些话题而忽略另外一些话题，我们还使用了一个话题覆盖向量C。C使用[0，0，0，0，0]进行初始化，表示开始没有任何话题被表达。同时还对C进行动态地更新，对于之前已经被表达的话题，我们减少这些话题的注意力权重，使得这些话题接下来被表达的机会减少；而对于还没有被表达的话题我们通过增加这些话题的注意力权重使得这些话题接下来被表达的机会增大。C在第t个时间步第j个话题的覆盖向量用C_t，j表示，根据以下式子进行跟新：At the same time, in order to avoid over-repeating some topics and ignoring other topics during generation, we also use a topic coverage vector C. C is initialized with [0, 0, 0, 0, 0], indicating that no topic is expressed at the beginning. At the same time, C is also dynamically updated. For topics that have been expressed before, we reduce the attention weight of these topics, so that the opportunities for these topics to be expressed next are reduced; and for topics that have not been expressed, we increase these topics by increasing The attention weight of , increases the chance of these topics being expressed next. The coverage vector of the jth topic of C at the tth time step is denoted by C _t,j , and it is updated according to the following formula:

其中φ_j根据下式得到：where _φj is obtained according to the following formula:

φ_j＝N·σ(U_f[e(τ₁)，e(τ₂)，...，e(τ_k)])φ _j =N·σ(U _f [e(τ ₁ ), e(τ ₂ ), . . . , e(τ _k )])

上式中，N为输入话题的数量，U_f为可训练的参数，σ为sigmoid激活函数。In the above formula, N is the number of input topics, U _f is a trainable parameter, and σ is the sigmoid activation function.

其中h_t根据一个历史记忆模块得到，历史记忆模块主要包括一个历史记忆矩阵HM^T ^×E，T表示文章的最大长度，E表示词向量的维度，这个历史记忆矩阵开始使用全0矩阵进行初始化，代表开始没有存储任何记忆信息，在训练过程中动态的存储解码器之前每一个时间步所生成的词向量，并且把这些词向量填充到这个历史记忆矩阵当中，在训练过程中这个历史记忆矩阵并不进行参数更新，只是作为一个存储词向量的容器。Among them, h _t is obtained from a historical memory module, which mainly includes a historical memory matrix HM ^T ^×E , where T represents the maximum length of the article, and E represents the dimension of the word vector. This historical memory matrix starts with an all-zero matrix for initialization, The representative does not store any memory information at the beginning, and dynamically stores the word vectors generated at each time step before the decoder during the training process, and fills these word vectors into the historical memory matrix. During the training process, the historical memory matrix and No parameter updates, just as a container for storing word vectors.

由于长短时记忆网络将历史信息编码为了2个向量，所以造成了一定的损失，这个历史记忆矩阵就相当于一个历史信息增强模块，用来弥补历史信息的损失。Since the long-short-term memory network encodes the historical information into two vectors, it causes a certain loss. This historical memory matrix is equivalent to a historical information enhancement module, which is used to make up for the loss of historical information.

随着新词的生成，历史记忆矩阵的计算方式如下：As new words are generated, the historical memory matrix is calculated as follows:

HM^T×E(t)＝e(y_t-1)HM ^T×E (t)=e(y _t-1 )

为了对历史信息进行挑选，我们使用一个门控网络，这个门控网络使用解码器的隐状态向量s_t和两个可训练的参数W_h和b_h作为输入，使用tanh激活函数进行处理，计算如下：To pick on the historical information, we use a gating network that uses the decoder's hidden state vector s _t and two trainable parameters W _h and b _h as input, processed using the tanh activation function, computing as follows:

v_t＝tanh(W_hs_t+b_h)HM[t，；]v _t =tanh(W _h s _t +b _h )HM[t,;]

对上面门控网络得到的向量v_t进行softmax归一化作为历史记忆模块里面每一个词的权重，根据这个权重和历史记忆模块挑选出需要的历史信息向量h_t，具体计算根据下式得到：Perform softmax normalization on the vector v _t obtained by the above gated network as the weight of each word in the historical memory module. According to this weight and the historical memory module, the required historical information vector h _t is selected. The specific calculation is obtained according to the following formula:

h_t＝softmax(v_t)HM^T×E h _t =softmax(v _t )HM ^T×E

模型最终在第t个时间步的分布根据以下公式得到：The final distribution of the model at the t-th time step is obtained according to the following formula:

p(y_t|y_1：t-1，τ_1：k)＝softmaqx(W_os_t)p(y _t |y _{1 : t-1} , τ _{1 : k} )=softmaqx(W _o s _t )

其中W_o是可学习的参数，使用标准的正态分布进行初始化。where W _o is a learnable parameter, initialized using a standard normal distribution.

在训练阶段，模型使用交叉熵损失函数作为目标函数对模型进行训练，公式如下：In the training phase, the model uses the cross-entropy loss function as the objective function to train the model, the formula is as follows:

其中q(t)为真实的输出的分布，使用独热(one hot)编码，p(t)为模型预测的分布。where q(t) is the distribution of the real output, using one hot encoding, and p(t) is the distribution predicted by the model.

在预测阶段，模型预测t时刻的词y_t基于以下分布进行采样：In the prediction phase, the model predicts that the word y _t at time t is sampled based on the following distribution:

y_t～p(y_t|y_1：t-1，τ_1：k)y _t ~p(y _t |y _{1: t-1} , τ _{1: k} )

阶段3：训练多话题分类器Stage 3: Train a multi-topic classifier

构建一个多分类的判别器。如图2所示，这个多分类器具体包括一个卷积层，后面接着一个最大池化层，还有一个高速网络(Highway Network)，目标函数使用交叉熵损失函数。多分类器的数据来自训练集和前面的预训练生成器生成的数据，标签维度为T+1个，T代表数据集所有包含的标签个数，另外的一个标签表示当前文本是训练集的数据还是预训练生成器的所生成的数据。Build a multi-class discriminator. As shown in Figure 2, this multi-classifier specifically includes a convolutional layer, followed by a maximum pooling layer, and a high-speed network (Highway Network), and the objective function uses the cross-entropy loss function. The data of the multi-classifier comes from the training set and the data generated by the previous pre-training generator. The label dimension is T+1, T represents the number of labels contained in the data set, and another label indicates that the current text is the data of the training set Also the generated data for the pretrained generator.

判别器的架构如图3所示，判别器是一个文本多分类器，目标有n+1个，其中n个目标是文本所属的n个话题，另外一个是用于判断文本是模型生成的还是实际的训练样本。分类器的输入由真实的训练数据和预训练生成器生成的数据两部分构成，输入的文本序列y₁，y₂，...，y_T，经过一个二维卷积得到的特征向量

代表向量的拼接，文本特征序列向量π_1：T∈R^T×E，ω∈R^l×E，卷积核的长度为E和词向量的维度一致,卷积核的宽度为l，词语卷积和激活函数映射得到特征向量

再使用最大池化和一个Highway网络得到最终的输出类别分布D_φ(x_j|y_1：T）。目标是交叉熵损失函数，公式如下：The architecture of the discriminator is shown in Figure 3. The discriminator is a text multi-classifier with n+1 targets, of which n targets are the n topics to which the text belongs, and the other is used to determine whether the text is generated by the model or not. actual training samples. The input of the classifier consists of two parts, the real training data and the data generated by the pre-training generator. The input text sequence y ₁ , y ₂ , ..., y _T is a feature vector obtained by a two-dimensional convolution

Represents the concatenation of vectors, the text feature sequence vector π _{1: T} ∈ R ^T×E , ω∈R ^l×E , the length of the convolution kernel is E and the dimension of the word vector is the same, the width of the convolution kernel is l, the word volume Product and activation function mapping to get feature vector

Then use max pooling and a Highway network to get the final output class distribution D _φ (x _j |y _1:T ). The target is the cross-entropy loss function with the following formula:

其中，x_j是输入文本序列y_1：T的一个话题标签，使用独热(one hot)编码，总共有n+1个,最后一个标签表示文本是否是真实的数据，模型使用Adam梯度下降算法更新参数。Among them, x _j is a topic label of the input text sequence y _1:T , using one hot encoding, there are n+1 in total, the last label indicates whether the text is real data, the model uses Adam gradient descent algorithm Update parameters.

初始化生成器和多分类器，词向量使用标准正态分布进行随机初始化。其它的权重使用均值为0，方差为0.01的正态进行初始化。初始化一个合适的批量大小batch_size，就是一次性喂给模型的数据条数，初始化学习速率0.01；The generator and multi-classifier are initialized, and word vectors are randomly initialized using a standard normal distribution. The other weights are initialized using normal with mean 0 and variance 0.01. Initialize a suitable batch size batch_size, which is the number of pieces of data fed to the model at one time, and initialize the learning rate to 0.01;

对生成器和多分器进行预训练，每个轮次对训练数据进行随机打乱操作，首先预训练生成器，然后使用预训练生成器生成和训练集相同数量的数据，每个轮次同样对多分类的数据集进行打乱；其次预训练多分类器，使用随机梯度下降算法更新各层网络的权重，直到网络收敛。Pre-train the generator and the multi-segmenter, and randomly scramble the training data in each round. First, the generator is pre-trained, and then the pre-training generator is used to generate the same amount of data as the training set. The multi-classified data set is scrambled; secondly, the multi-classifier is pre-trained, and the stochastic gradient descent algorithm is used to update the weights of each layer of the network until the network converges.

阶段4：构建强化学习生成器Stage 4: Building the Reinforcement Learning Generator

构建强化学习模块，强化学习模块由一个新的生成器和上面的多分类器构成。新的生成器在预训练生成器的基础上修改损失函数，使用基于惩罚的期望作为强化学习模型的目标函数，惩罚函数根据多话题分类器和新的生成器共同计算得到。Build the reinforcement learning module, which consists of a new generator and the above multi-classifier. The new generator modifies the loss function on the basis of the pre-trained generator, and uses the penalty-based expectation as the objective function of the reinforcement learning model. The penalty function is jointly calculated according to the multi-topic classifier and the new generator.

由于用最大似然估计(MLE)来训练的目标每一步都是寻求概率最大的解，但是实际上文本中概率低的词也会出现，这样就会导致模型和实际的生成意愿不一致的情况，而强化学习则不要求每一步都是最优解而是寻求累计的汇报最大的解，中间允许出现低概率的词，最大似然追求局部最优解，强化学习就追求全局最优解，所以强化学习更可能找到符合人类语言认知的这种语法生成规则。如图4所示，在文本生成中,强化学习的智能体(agent)可以看作生成器，图用G来表示，强化学习的环境用一个多分类器D来表示，智能体的状态(state)就是之前生成词语(token)的集合，图中用实心黑点来表示词语(token)。强化学习的动作(action)就是下一步智能体(agent)可选择的词语(token)，我们引入策略梯度的方法，a_t表示第t个时间步模型预测的token，策略π代表我们以怎样的策略生成文本的所有token。那么强化学习就是要确定这个策略π。然后我们把策略π参数化，P_θ(a|s)在s＝{a₁，a₂，...，a_n}条件下，下一个被模型agent选择的token是a的概率。在这里我们的策略就是G_θ(y_t+1|y_1：t)，深度强化学习的损失函数如下：Since the goal of training with maximum likelihood estimation (MLE) is to seek the solution with the highest probability at each step, but in fact words with low probability in the text will also appear, which will lead to the inconsistency between the model and the actual generation intention. Reinforcement learning does not require each step to be an optimal solution, but seeks the solution with the largest cumulative report. Words with low probability are allowed to appear in the middle. The maximum likelihood pursues the local optimal solution, and reinforcement learning pursues the global optimal solution, so Reinforcement learning is more likely to find such grammar generation rules that conform to human language cognition. As shown in Figure 4, in the text generation, the reinforcement learning agent (agent) can be regarded as a generator, the figure is represented by G, the reinforcement learning environment is represented by a multi-classifier D, the state of the agent (state ) is the set of previously generated words (tokens), which are represented by solid black dots in the figure. The action of reinforcement learning is the word (token) that the agent can choose in the next step. We introduce the method of policy gradient, at _t represents the token predicted by the model at the t-th time step, and the policy π represents how we use All tokens for which the strategy generates text. Then reinforcement learning is to determine this policy π. Then we parameterize the policy π, P _θ (a|s) under the condition of s = {a ₁ , a ₂ , ..., a _n }, the probability that the next token selected by the model agent is a. Here our strategy is G _θ (y _t+1 |y _{1: t} ), and the loss function of deep reinforcement learning is as follows:

惩罚因子Penalty根据当前序列是否已经生成最后一个词分为以下两种情况计算：The penalty factor Penalty is calculated according to whether the current sequence has generated the last word in the following two cases:

如图4所示，在这里没有直接使用判别器的奖励(reward)作为强化学习的反馈，而是使用惩罚作为环境对智能体的反馈，所以我们的策略应当使得累计的惩罚最小。由于我们每次action都需要获得当前的累计期望惩罚，所以我们在生成第t个词语的时候，需要使用蒙特卡洛(MenteCarlo)进行采样获得剩下的T-t个token，并且用判别器计算当前的瞬时累计惩罚(Penalty)。强化学习的加入能够使得我们的生成的文本能够更加的具有话题相关性。As shown in Figure 4, the reward of the discriminator is not directly used as the feedback of reinforcement learning, but the punishment is used as the feedback of the environment to the agent, so our strategy should minimize the cumulative penalty. Since we need to obtain the current cumulative expectation penalty for each action, when we generate the t-th word, we need to use Monte Carlo (MenteCarlo) sampling to obtain the remaining T-t tokens, and use the discriminator to calculate the current Instantaneous cumulative penalty (Penalty). The addition of reinforcement learning can make our generated text more topic-relevant.

阶段5：使用对抗神经网络训练Stage 5: Training with Adversarial Neural Networks

如图5所示，对抗神经网络由一个生成器和一个判别器组成。生成器实验阶段5所描述的强化学习生成器，在对抗学习过程中，一方面生成器通过学习进化生成更加像训练数据和具有话题相关性的文本，另一方面判别器通过学习进化识别出生成器生成的文本并且指导生成器的进化。As shown in Figure 5, the adversarial neural network consists of a generator and a discriminator. The reinforcement learning generator described in the generator experiment phase 5, in the process of adversarial learning, on the one hand, the generator generates texts that are more like training data and topic-related through learning evolution, and on the other hand, the discriminator recognizes the generated text through learning evolution. The text generated by the generator and guide the evolution of the generator.

加入对抗学习是因为当之前的强化学习让生成器变得更强的时候，这时候判别器能力就会被削弱，一旦判别器的判别能力被削弱，强化学习的reward计算就会出现偏差，由于采样方差的存在会使得强化学习训练的不稳定，所以在强化学习进行参数更新后，还加入了对数似然(MLE)目标函数对强化学习生成器进行矫正，减缓训练过程中的波动。Adversarial learning is added because when the previous reinforcement learning makes the generator stronger, the discriminator's ability will be weakened. Once the discriminator's discrimination ability is weakened, the reward calculation of reinforcement learning will be biased. The existence of sampling variance will make the reinforcement learning training unstable, so after the parameter update of reinforcement learning, the log-likelihood (MLE) objective function is also added to correct the reinforcement learning generator and slow down the fluctuations in the training process.

由对抗神经网络在训练的过程中需要同时训练判别器和生成器，导致抗神经网络训练速度比较慢而且难以收敛。为了解决这些问题，一方面，在对抗神经网络训练之前需要对多话题分类器和预训练生成器进行充分地预训练，这样也有助于模型的收敛。另一方面，在对抗学习训练的时候选择较小的训练周期(1～3个即可)，过大的训练周期不但浪费计算资源而且可能会过拟合，。Due to the need to train the discriminator and the generator at the same time in the training process of the adversarial neural network, the training speed of the anti-neural network is relatively slow and it is difficult to converge. In order to solve these problems, on the one hand, the multi-topic classifier and the pre-training generator need to be fully pre-trained before the training of the adversarial neural network, which also helps the convergence of the model. On the other hand, when adversarial learning training, choose a smaller training period (1 to 3), an excessively large training period not only wastes computing resources, but may also overfit.

阶段6：模型挑选与测试Stage 6: Model Selection and Testing

由于任务本身属于开放式的文本生成，使用贪心解码或者集束搜索(beamsearch)解码会使得生成的文本过于单一，并且重复现象比较严重。所以本发明实施例使用采样的解码方式增加生成文本的多样性。经过基于采样的解码，本发明实施例可以生成更加多样性的文本。并且采样方法可以有效避免词语重复现象地出现。另外在训练和测试的时候本发明实施例都使用了基于采样的解码，在一定程度上缓解了由于训练和测试解码方法不一致而导致的暴露偏差(exposing bias)的问题。Since the task itself belongs to open text generation, using greedy decoding or beam search decoding will make the generated text too single, and the repetition phenomenon is serious. Therefore, in the embodiment of the present invention, the decoding method of sampling is used to increase the diversity of the generated text. Through sample-based decoding, the embodiments of the present invention can generate more diverse texts. And the sampling method can effectively avoid word repetition phenomenon. In addition, the embodiments of the present invention use sampling-based decoding during training and testing, which alleviates the problem of exposing bias caused by inconsistent training and testing decoding methods to a certain extent.

在公开的数据集上对比了若干个模型，实验证明本发明实施例的文本生成方法能够生成更具有话题相关性的文本，并且更加流畅和通顺。在解码的过程中，我们使用基于分布采样的解码方式增加了生成文本的多样性同时减少了文本生成的重复现象的出现。Several models are compared on the public data set, and experiments prove that the text generation method of the embodiment of the present invention can generate more topic-related text, and is more fluent and fluent. In the decoding process, we use distributed sampling-based decoding to increase the diversity of generated text and reduce the occurrence of repetitive text generation.

为了验证本发明实施例的有效性，在公开的数据集上进行了实验验证：In order to verify the effectiveness of the embodiments of the present invention, experimental verification is carried out on the public data set:

实验采用了2018年哈工大公开的zhihu数据集，在文本自动测评指标BLEU得分上，本发明实施例的模型相比基线模型提升了37％，较当前最好的模型提升了6％。在人工测评方面，本发明实施例的模型生成短文得分也取得了最好的结果。本发明实施例的方法生成的文本如图6所示。The experiment uses the zhihu data set published by Harbin Institute of Technology in 2018. In terms of the BLEU score of the text automatic evaluation index, the model of the embodiment of the present invention is improved by 37% compared with the baseline model, and improved by 6% compared with the current best model. In terms of manual evaluation, the short text score generated by the model of the embodiment of the present invention also achieves the best results. The text generated by the method of the embodiment of the present invention is shown in FIG. 6 .

必须说明的是，上述任一实施例中，方法并不必然按照序号顺序依次执行，只要从执行逻辑中不能推定必然按某一顺序执行，则意味着可以以其他任何可能的顺序执行。It must be noted that, in any of the above embodiments, the methods are not necessarily executed in sequence, and as long as it cannot be inferred from the execution logic that the methods must be executed in a certain sequence, it means that the methods can be executed in any other possible sequence.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。Those skilled in the art can easily understand that the above are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, etc., All should be included within the protection scope of the present invention.

Claims

1. a text generation method based on deep learning, comprising training and testing, is characterized in that, comprises:

Stage 1: Data preparation, the data preparation stage first obtains the required text data from the web crawler, cleans the special symbols in the data, and obtains the required training data;

Stage 2: Construction of the pre-training model. The pre-training model specifically includes an encoder and a decoder; the encoder encodes the topic into word vectors of appropriate dimensions, and the decoder uses a long-short-term memory network of recurrent neural networks. The state vector uses a randomly initialized vector, the topic vector of each time step is obtained using the attention mechanism, and the current topic vector represents the topic semantic information contained in the current word;

Set the specific input to be a set of n topics {τ ₁ , τ ₂ , ..., τ _n }, where n is the preset maximum number of input topics, and the input topics get the unique id of each word through dictionary mapping , and then retrieved from the word vector matrix to obtain e(τ _j ) as the word vector of topic τ _j , the hidden state vector of the decoder s _t =LSTM(s _t-1 ,[e(y _t-1 );h _{t- 1} ;c _t ]), where s _t-1 is the hidden state vector of the decoder for t-1 time steps, h _t-1 represents the vector of memory information for t-1 time steps, and c _t is the context vector of the topic , c _t is obtained according to the multiplicative attention mechanism, and is obtained according to the following formula:

g _tj = v _a ^T C _{t-1, j} tanh(W _a s _t-1 +U _a e(τ _j ))

α _tj =softmax(g _tj )

In the above formula, g _tj represents the attention weight of the t-th time step decoder to the j-th topic τ _j , α _tj is the normalized attention weight of g _tj , v _a , W _a and U _a are all Trainable parameters, initialized using a standard normal distribution;

The coverage vector of the jth topic of C at the tth time step is denoted by C _t,j , and it is updated according to the following formula:

The coverage vector of the jth topic of C at the t-1th time step is denoted by C _t-1,j ,

where _φj is obtained according to the following formula:

φ _j =N·σ(U _f [e(τ ₁ ), e(τ ₂ ), . . . , e(τ _k )])

In the above formula, N is the number of input topics, U _f is a trainable parameter, and σ is the sigmoid activation function;

Among them, h _t is obtained from a historical memory module. The historical memory module mainly includes a historical memory matrix HM ^T×E , where T represents the maximum length of the article, and E represents the dimension of the word vector. This historical memory matrix begins to use an all-zero matrix for initialization, The representative does not store any memory information at the beginning, and dynamically stores the word vectors generated at each time step before the decoder during the training process, and fills these word vectors into the historical memory matrix. During the training process, the historical memory matrix and No parameter update, just as a container for storing word vectors,

As new words are generated, the historical memory matrix is calculated as follows:

HM ^T×E (t)=e(y _t-1 )

For the selection of historical information, a gating network is used, which uses the decoder's hidden state vector s _t and two trainable parameters W _h and b _h as input, processed using the tanh activation function, calculated as follows :

υ _t =tanh(W _h s _t +b _h )HM[t,;]

Perform soft max normalization on the vector υ _t obtained by the above gated network as the weight of each word in the historical memory module. According to this weight and the historical memory module, the required historical information vector h _t is selected. The specific calculation is obtained according to the following formula :

h _t =softmax(v _t )HM ^T×E

The final distribution p(y _t |y _{1 : t-1} , τ _{1 : k} ) at the t-th time step of the model is obtained according to the following formula:

p(y _t |y _{1 : h-1} , τ _{1 : k} )=softmax(W _o s _t )

where W _o is a learnable parameter, initialized using a standard normal distribution,

In the training phase, the model uses the cross-entropy loss function as the objective function to train the model, the formula is as follows:

where q(t) is the distribution of the real output, using one-hot encoding, and p(t) is the distribution predicted by the model;

In the prediction phase, the model predicts that the word y _t at time t is sampled based on the following distribution:

y _t ~p(y _t |y _{1: t-1} , τ _{1: k} )

Stage 3: Train a multi-topic classifier and build a multi-class discriminator. The discriminator is a text multi-classifier with n+1 targets, of which n targets are the n topics to which the text belongs, and the other is used for To judge whether the text is generated by the model or the actual training sample, the input of the classifier consists of two parts, the real training data and the data generated by the pre-training generator. The input text sequence y ₁ , y ₂ , ..., y _T , The feature vector obtained by a two-dimensional convolution

Then use max pooling and a Highway network to get the final output category distribution D _φ (x _j |y _1:T );

Stage 4: Build a reinforcement learning generator. In text generation, the reinforcement learning agent can be regarded as a generator, which is represented by G, and the reinforcement learning environment is represented by a multi-classifier D. The state of the agent is the previous generation. The set of words, the words that can be selected by the agent in the next step of the reinforcement learning action, and the method of introducing the policy gradient, a _t represents the words predicted by the model at the t-th time step, and the strategy π represents the strategy by which all tokens of the text are generated, then Reinforcement learning is to determine the strategy π, and then parameterize the strategy π, P _θ (a|s) indicates that under the condition of s={a ₁ , a ₂ ,..., a _n }, the next selected by the model The probability that the word is a, the strategy is G _θ (y _t+1 |y _1:t ), and the loss function J _G of deep reinforcement learning is as follows:

The penalty factor Penalty is calculated according to whether the current sequence has generated the last word in the following two cases:

Stage 5: Using adversarial neural network training, the adversarial neural network consists of a generator and a discriminator. In the process of adversarial learning, on the one hand, the generator generates texts that are more like training data and topic-related through learning evolution, and the other The aspect discriminator recognizes the text generated by the generator through learning evolution and guides the evolution of the generator;

Stage 6: Model selection and testing;

The training includes the steps:

constructing a training set, which includes a plurality of sample pairs composed of preprocessed topics and corresponding texts;

A generator is pre-defined, the generator is used to generate text according to the input topic, the generator is pre-trained by using the training set, the generator includes an encoder and a decoder, and the encoder is used to convert the input The topic is encoded as a word vector, the decoder is a long-short-term memory network using a recurrent neural network, the initial state vector of the long-short-term memory network uses a randomly initialized vector, and the input of the long-short-term memory network includes the previous time step The real output of , the topic vector and the global historical memory vector obtained by the attention mechanism;

Define a classifier in advance, and input the text output by the generator and the text in the training set into the classifier for adversarial training;

Reinforcement learning training is performed on the generator according to the pre-trained generator and the classifier definition loss function.

2. A kind of text generation method based on deep learning as claimed in claim 1 is characterized in that, described preprocessing comprises: carry out keyword word segmentation to the text in the sample set, use tf-idf algorithm to calculate tf of all keywords -idf score, select multiple keywords with the highest scores as the topic of each text.

3. a kind of text generation method based on deep learning as claimed in claim 1 is characterized in that, described global historical memory vector is obtained according to historical memory matrix, described historical memory matrix is formed by the vector of length L, described The history memory matrix is initially initialized to 0, and the previously generated word vectors are dynamically stored during the training process. During the training process of the generator, the history memory matrix is not updated with parameters.

4. A deep learning-based text generation method according to claim 3, wherein a gating network is used to obtain the currently required global historical memory vector.

5. A deep learning-based text generation method according to claim 1, wherein the classifier comprises a convolutional layer, a pooling layer and a Highway network connected in sequence, and the objective function of the classifier uses Cross-entropy loss function.

6. The method for generating text based on deep learning according to claim 1, wherein the defining a loss function according to the pre-trained generator and the classifier is specifically: using a penalty-based expectation as The objective function of reinforcement learning training, and the penalty function is jointly calculated according to the classifier and the generator.

7. a kind of text generation method based on deep learning as claimed in claim 1 is characterized in that,

The decoder's hidden state vector s _t = LSTM(s _t-1 , [e(y _t-1 ); h _t-1 ; c _t ]), where is the hidden state of the decoder t-1 time steps vector, h _t-1 represents the vector of memory information of _t -1 time steps, ct is the context vector of the topic, _ct is obtained according to the multiplicative attention mechanism, and is obtained according to the following formula:

g _tj = v _a ^T C _{t-1, j} tanh(W _a s _t-1 +U _a e(τ _j ))

α _tj =softmax(g _tj )

In the above formula, g _tj represents the attention weight of the t-th time step decoder to the j-th topic τ _j , α _tj is the normalized attention weight of g _tj , v _a , W _a and U _a are all The trainable parameters are initialized using a standard normal distribution, C is the topic coverage vector, and the coverage vector of the jth topic at the t-1th time step is denoted by C _t-1,j .