CN109299211B

CN109299211B - Automatic text generation method based on Char-RNN model

Info

Publication number: CN109299211B
Application number: CN201811104442.5A
Authority: CN
Inventors: 朱静; 黄颖杰; 杨晋昌; 黄文恺; 韩晓英; 邓文婷
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2018-09-21
Filing date: 2018-09-21
Publication date: 2021-06-29
Anticipated expiration: 2038-09-21
Also published as: CN109299211A

Abstract

The invention discloses a text automatic generation method based on a Char-RNN model, which comprises the following steps: s1, acquiring text data meeting the characteristic requirements; s2, modeling the text data, and representing letters or Chinese characters by using a vector matrix to obtain training data; s3, inputting training data into the Char-RNN model batch by batch for training to obtain the probability of the next character corresponding to each character, and continuously correcting the probability along with the improvement of the training times, and storing the training model result after the preset training times are reached; and S4, taking the input key words as initial characters, obtaining and outputting the probability corresponding to the next character by using the trained model result, taking the probability as the character input of the next step, and generating a section of text by analogy. Compared with the common RNN model, the method solves the problem that the gradient disappears or the gradient explodes when the RNN is processed by long sequence data.

Description

Automatic text generation method based on Char-RNN model

Technical Field

The invention belongs to the technical field of neural networks, and particularly relates to a text automatic generation method based on a Char-RNN model.

Background

The Recurrent Neural Network (RNN) is a classical neural network and is also the preferred network for time series data. When certain sequential machine learning tasks are involved, the RNN can achieve very high accuracy, with which no other algorithm can be higher. This is because conventional neural networks have only one kind of short-term memory, whereas RNNs have the advantage of limited short-term memory.

The main purpose of RNNs is to process sequence data. In the traditional neural network model, from an input layer to a hidden layer to an output layer, all layers are connected, and nodes between each layer are connectionless. But such a general neural network is not capable of failing to address many problems. For example, when predicting what the next word of a sentence is, the previous word is generally needed because the previous and next words in a sentence are not independent. The RNN is called a recurrent neural network, i.e., the current output of a sequence is also related to the previous output. The concrete expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer comprises not only the output of the input layer but also the output of the hidden layer at the last moment. In theory, RNNs can process sequence data of any length.

At present, many papers are based on the application of CNN on NLP, but the application of the fields of deep learning and NLP combination is most widely RNN, because the text can be visually expressed as an input sequence, the input sequence is conveniently processed by the RNN, the information such as Long-Term dependence and the like of the text is captured, and a good effect is achieved in practical application. RNNs have many applications, one of which is the cooperation with Natural Language Processing (NLP). RNNs have been demonstrated by many people on the web, who have created a surprisingly effective model.

RNN-based systems may learn tasks such as translating languages, controlling robots, image analysis, document summarization, speech recognition image recognition, handwriting recognition, controlling chat robots, predicting diseases, click rates and stocks, synthesizing music, and so forth. For example, in 2015 google greatly improved the speech recognition capabilities in android phones and other devices through the RNN program trained based on CTCs; CTC was also used; apple iPhone used RNN in quicktype and Siri; microsoft uses RNN not only for speech recognition, but also this technology for virtual dialog figure generation and programming of program code, etc. Amazon Alexa communicates with you at home through a bi-directional RNN, while google uses RNN more widely, it can generate image captions, auto-reply email, which is included in new intelligent assistant alo, also significantly improves google translation quality (from 2016). In fact, a significant portion of the computing resources of google data centers are now performing RNN tasks.

RNN is also well established in content recommendation, and the RNN is subject to continuous deep research in academia and even starts to be applied to large enterprises and Internet companies. For example, Netflix has a collaborative paper in ICLR 2016 that teaches how to use RNN to make video recommendations based on short-term behavior data of a user.

The RNN also has a place for text auto-generation. RNNs have been demonstrated in the prior art, which created language models that could take large inputs like those of shakespeare and generate their own shakespeare-style poems after training the models, and which were difficult to distinguish from the original. The RNN model automatically generated for the text is Char-RNN, so how to automatically generate the text by using the Char-RNN model is one of the research directions of those skilled in the art.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a text automatic generation method based on a Char-RNN model, so that the problem that the gradient disappears or the gradient explodes when long sequence data are processed is solved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a text automatic generation method based on a Char-RNN model, which comprises the following steps:

s1, acquiring text data meeting the characteristic requirements;

s2, modeling the acquired text data, namely representing letters or Chinese characters by using a vector matrix to obtain training data;

s3, inputting training data into the Char-RNN model batch by batch for training to obtain the probability of the next character corresponding to each character, and continuously correcting the probability along with the improvement of the training times, and storing the training model result after the preset training times are reached;

and S4, taking the input key words as initial characters, obtaining and outputting the probability corresponding to the next character by using the trained model result, taking the probability as the character input of the next step, and generating a section of text by analogy.

As a preferable technical solution, in step S1, the text data meeting the feature requirement satisfies the following requirements:

the text type and sentence style characteristics are similar;

uniformly storing the txt document type and the UTF-8 encoding format;

the same language is used.

As a preferred technical solution, in step S2, when a vector matrix is used to represent letters, the letters are represented by one-hot unique heat vectors, and then are sequentially input to the network; and one-hot coding, wherein binary vectors are used for representing, and the length is the size of the word bank.

As a preferable technical solution, in 26 lower case letters in english, the letter a is represented by the first digit being 1, and the other 25 digits are all 0, i.e., (1,0,0, …, 0), the letter b is represented by the second digit being 1, and the other 25 digits are all 0, i.e., (0,1,0, …, 0), and so on; the output is equivalent to a class 26 classification problem, so the vector output at each step is also 26-dimensional.

As a preferred technical solution, in step S2, when the vector matrix is used to represent the chinese characters, a processing layer of embedding is added before the chinese character processing, and the individual chinese characters are classified according to their actual meaning, so as to convert the large number of chinese characters into a denser representation, thereby obtaining a better effect.

As a preferred technical solution, the Char-RNN model is an N VS N model, and a Dropout layer is added for data processing to reduce the over-fitting situation; in the Char-RNN model, the principle of Dropout processing is to randomly ignore the links among some data, the whole neural network becomes incomplete, after the incomplete neural network is trained once, randomly ignore other data links for the second time, train again, and so on; the Dropout layer holds a keep _ prob parameter that indicates the probability of randomly retaining data.

As a preferred technical solution, in step S3, a Softmax layer is added to the Char-RNN model, where Softmax is a function and its mathematical definition is: assuming an array V, vi represents the ith element in V, the Softmax value of this element is Si ei/Σ jej, i.e., Softmax can map the original digital output to (0,1) values, the sum of which is 1; the Softmax function is a tool for helping to obtain the probability, so that the middle calculation result is transformed to obtain logits, and then the Softmax processing is carried out to obtain the output, namely the predicted probability, and the character with the maximum probability can be selected according to the probability to be used as the output.

As a preferred technical solution, in step S3, in defining the loss, the cross entropy is used to obtain the loss by using the one-hot coding of the prediction probability and the training target, i.e. the next letter corresponding to each inputted letter.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) in text processing, compared with the sentence and word levels of other models, the method is accurate to the character level, and the continuity of the character degree can be obtained through learning.

(2) Compared with the common RNN model, the method solves the problem that the gradient disappears or the gradient explodes when the RNN is processed by long sequence data.

(3) Compared with an RNN model input by sentences and words, the method can solve the problem of unknown words.

(4) When the invention trains the Chinese corpus, because the input unit is specific to each Chinese character, Chinese word segmentation processing is not needed, thereby reducing more complex word segmentation steps and avoiding errors possibly brought by word segmentation.

Drawings

FIG. 1 is a schematic diagram of the classical RNN structure of "N VS N";

FIG. 2 is a schematic diagram of the Char-RNN using an already entered character to predict the next character;

FIG. 3 is a flow chart of a method for automatic text generation in accordance with the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

The invention relates to a text automatic generation method based on a Char-RNN model, which has the main principle that: firstly, preprocessing the text data in a data set to form sample data for training (characters are coded, mapping is established, letters are expressed by using unique heat vectors, and an embedding layer is required to be added to Chinese characters) according to the provided text data serving as a training sample data set; and secondly, training and outputting the model by taking the processed training data as the input of the Char-RNN model.

The Char-RNN is a character-level Recurrent Neural network, and is originally The unknown effective of Current Neural Networks written by Andrej Karpathy. It is well known that RNNs are very good at handling sequence problems, with strong correlation between sequence data and sequence data, and that RNNs are embodied by sharing of each element weight and offset and loop calculation (previously processed information is utilized to process subsequent information). The Char-RNN model is a model for generating text from the dimension of a character, i.e., predicting the probability of the next character occurring from the observed character, i.e., the guess of sequence data. Most of the deep learning of writing songs, poems and novels introduced on the internet are based on the method.

The sequence relation existing among the samples is called a sequence sample, for example, in the context, a word and the previous word are related; in soil data, the temperature at a time and the previous temperature are correlated. The Char-RNN, unlike the traditional neural network whose inputs and outputs are fixed, allows for a sequence of input and output vectors that can reuse existing information. Therefore, analysis of various types of sequential texts by using Char-RNN becomes the first choice.

A language model is required for a machine to generate text. The language model is used to evaluate the probability that a sentence is natural, i.e., the probability that the next word appears is presumed from the existing words in a sentence. Char-RNN is suitable for processing sequence data, and it can extract digests of sequences of any length (Xt, Xt-1.., X1), optionally preserving certain aspects of the sequences.

As shown in fig. 1, the Char-RNN model is an "N VS N" structure, taking english input as an example, where its input sequence is a letter in a sentence, and the output is the next letter of the input in turn, which is the probability of predicting the next letter by using the letter already input.

The technical scheme of the invention is further explained as follows:

as shown in fig. 3, the text automatic generation method based on the Char-RNN model in this embodiment includes the following steps:

step S1: and acquiring text data meeting the characteristic requirements.

In step S1, the acquired text data needs to meet the following requirements:

(1) the text type and sentence style characteristics are relatively close;

(2) uniformly storing the txt document type and the UTF-8 encoding format;

(3) the same language is used (e.g., both english or chinese are used, and different languages may not be mixed together).

Step S2: and modeling the text data, and mainly using a vector matrix to represent letters or Chinese characters to obtain training data.

The modeling process for text data is primarily to use vector matrices to represent characters. Wherein english alphabets will be represented using One-Hot Encoding and then sequentially input to the network.

Take 26 lower case letters in english as an example: the letter a is represented by the first digit being 1, the other 25 digits being 0, i.e., (1,0,0, …, 0), the letter b is represented by the second digit being 1, the other 25 digits being 0, i.e., (0,1,0, …, 0), and so on. The output is equivalent to a class 26 classification problem, so the vector output at each step is also 26-dimensional. In practical use, because letters also have case distinction and other punctuation marks, the single heat vector can code more English letters than 26.

If the training is performed on Latin letters such as English letters, the vector matrix representation of the letters can be directly completed by using simpler one-hot coding because the number of the letters is small (the number of the English letters is only 26 for example). However, when the vector matrix is used for representing the Chinese characters, the number of the Chinese characters is huge, and if one-hot coding is continuously used, the vector length is huge, and the expression is too sparse. Imbedding, word embedding, can express Chinese characters with low-dimensional vectors, and reduce the number of characteristic bits. The embedding vector representation projects letters into a continuous vector space, and the position of each letter in the vector space can reflect the correlation degree between the semantics, for example, the distance between the characters "male" and "female" is closer than the distance between the characters "male" and "cat", because the characters "male" and "female" are more similar. Therefore, a layer of embedding layer processing is needed to be added before the Chinese character processing, and the single Chinese characters are classified according to the practical significance of the single Chinese characters, so that a large number of Chinese character characters are converted into dense expression, and a better effect is achieved.

In the modeling processing process of the text data, if the text data is Chinese characters, a processing method different from English letters is needed. An embedding layer is required to be added before Chinese character processing, and the method can classify single Chinese characters according to the practical significance of the single Chinese characters, so that a large number of Chinese character characters are converted into dense expression, and a better effect is achieved. The imbedding parameter variables are obtained through training.

Step S3: inputting training data into a Char-RNN model batch by batch for training, wherein the purpose of the training process is to obtain the probability of the next character corresponding to each character, and continuously correcting the probability along with the improvement of the training times, so that the training model result is stored after the preset training times are reached.

In step S3, the training model is a multi-layer N VS N model, where a Dropout layer of data processing is added to reduce the over-fit condition.

In the Char-RNN model, the principle of Dropout processing is to randomly ignore some data links, so that the whole neural network becomes "incomplete", and after the neural network is trained once with the "incomplete" neural network, randomly ignore other data links for the second time, train again, and so on. By doing so, each prediction result is not very dependent on a specific part of the data in the whole training process. Dropout is done to essentially leave the neural network without the chance of over-relying, thereby reducing the appearance of over-fitting. In the Char-RNN model, the Dropout layer stores a keep _ prob parameter, which indicates the probability of randomly retaining data. For example, keep _ perb ═ 0.5 means that half of the data is randomly reserved for training. The embodiment implements data processing of the Dropout layer by this one key parameter to prevent overfitting.

The output of the training model needs to obtain the final classification probability, so a Softmax layer is added for processing. Softmax is a function whose mathematical definition is: assuming an array V, vi represents the ith element in V, the Softmax value of this element is Si ei/Σ jej. Namely Softmax can map the original digital output to values of (0,1), the sum of which is 1 (satisfying the property of probability). The Softmax function is a tool that helps get the probability. Therefore, the intermediate calculation result can be converted to obtain logits, and then subjected to Softmax processing to obtain the output, namely the predicted probability, and the character with the maximum probability can be selected as the output according to the probability.

In defining the loss, the prediction probability and the one-hot coding of the training target (i.e. the next letter corresponding to each letter input) are used as cross entropy to obtain the loss.

Step S4: and taking the input key words as initial characters, obtaining the probability corresponding to the next character by using the trained model result, outputting the probability, taking the probability as the character input of the next step, and generating a section of text by analogy. As shown in FIG. 2, after the input of the phrase "hoeing Niang-day at noon", the next output corresponding to the word "hoeing" is probably "standing grain", and so on.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A text automatic generation method based on a Char-RNN model is characterized by comprising the following steps:

s1, acquiring text data meeting the characteristic requirements;

s2, modeling the acquired text data, namely representing letters or Chinese characters by using a vector matrix to obtain training data; when a vector matrix is used to represent letters, wherein the letters are represented by one-hot unique vectors, then the letters are sequentially input into the network; one-hot coding, which is represented by binary vectors, wherein the length is the size of a word bank; in the 26 lower case letters in english, the letter a is represented by a first digit of 1 and the other 25 digits are all 0, i.e., vector (1,0,0, …, 0), the letter b is represented by a second digit of 1 and the other 25 digits are all 0, i.e., vector (0,1,0, …, 0), and so on; the output is equivalent to a 26-class classification problem, so the vector output at each step is also 26-dimensional;

when the vector matrix is used for representing the Chinese characters, a processing layer needs to be added before the Chinese characters are processed, and the single Chinese characters are classified according to the actual significance of the single Chinese characters, so that the large number of Chinese character characters are converted into dense representation, and a better effect is achieved;

s3, inputting training data into the Char-RNN model batch by batch for training to obtain the probability of the next character corresponding to each character, and continuously correcting the probability along with the improvement of the training times, and storing the training model result after the preset training times are reached; adding a Softmax layer in the Char-RNN model for processing, wherein Softmax is a function, namely Softmax maps the original digital output into values of (0,1), and the summation of the values is 1; the Softmax function is a tool for helping to obtain the probability, therefore, the middle calculation result is transformed to obtain logits, then the Softmax processing is carried out to obtain the output, namely the predicted probability, and the character with the maximum probability is selected as the output according to the probability;

2. The method as claimed in claim 1, wherein in step S1, the text data meeting the feature requirement satisfies the following requirements:

the text type and sentence style characteristics are similar;

uniformly storing the txt document type and the UTF-8 encoding format;

the same language is used.

3. The method as claimed in claim 1, wherein the Char-RNN model is an N VS N model, and a Dropout layer is added for data processing to reduce overfitting; in the Char-RNN model, the principle of Dropout processing is to randomly ignore the links among some data, the whole neural network becomes incomplete, after the incomplete neural network is trained once, randomly ignore other data links for the second time, train again, and so on; the Dropout layer holds a keep _ prob parameter that indicates the probability of randomly retaining data.

4. The method as claimed in claim 1, wherein in step S3, the predicted probability and the one-hot coding of the training target, i.e. the next letter corresponding to each letter inputted, are cross-entropy lost in defining the loss.