CN109299211B - Automatic text generation method based on Char-RNN model - Google Patents

Automatic text generation method based on Char-RNN model Download PDF

Info

Publication number
CN109299211B
CN109299211B CN201811104442.5A CN201811104442A CN109299211B CN 109299211 B CN109299211 B CN 109299211B CN 201811104442 A CN201811104442 A CN 201811104442A CN 109299211 B CN109299211 B CN 109299211B
Authority
CN
China
Prior art keywords
probability
training
data
char
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811104442.5A
Other languages
Chinese (zh)
Other versions
CN109299211A (en
Inventor
朱静
黄颖杰
杨晋昌
黄文恺
韩晓英
邓文婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN201811104442.5A priority Critical patent/CN109299211B/en
Publication of CN109299211A publication Critical patent/CN109299211A/en
Application granted granted Critical
Publication of CN109299211B publication Critical patent/CN109299211B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text automatic generation method based on a Char-RNN model, which comprises the following steps: s1, acquiring text data meeting the characteristic requirements; s2, modeling the text data, and representing letters or Chinese characters by using a vector matrix to obtain training data; s3, inputting training data into the Char-RNN model batch by batch for training to obtain the probability of the next character corresponding to each character, and continuously correcting the probability along with the improvement of the training times, and storing the training model result after the preset training times are reached; and S4, taking the input key words as initial characters, obtaining and outputting the probability corresponding to the next character by using the trained model result, taking the probability as the character input of the next step, and generating a section of text by analogy. Compared with the common RNN model, the method solves the problem that the gradient disappears or the gradient explodes when the RNN is processed by long sequence data.

Description

Automatic text generation method based on Char-RNN model
Technical Field
The invention belongs to the technical field of neural networks, and particularly relates to a text automatic generation method based on a Char-RNN model.
Background
The Recurrent Neural Network (RNN) is a classical neural network and is also the preferred network for time series data. When certain sequential machine learning tasks are involved, the RNN can achieve very high accuracy, with which no other algorithm can be higher. This is because conventional neural networks have only one kind of short-term memory, whereas RNNs have the advantage of limited short-term memory.
The main purpose of RNNs is to process sequence data. In the traditional neural network model, from an input layer to a hidden layer to an output layer, all layers are connected, and nodes between each layer are connectionless. But such a general neural network is not capable of failing to address many problems. For example, when predicting what the next word of a sentence is, the previous word is generally needed because the previous and next words in a sentence are not independent. The RNN is called a recurrent neural network, i.e., the current output of a sequence is also related to the previous output. The concrete expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer comprises not only the output of the input layer but also the output of the hidden layer at the last moment. In theory, RNNs can process sequence data of any length.
At present, many papers are based on the application of CNN on NLP, but the application of the fields of deep learning and NLP combination is most widely RNN, because the text can be visually expressed as an input sequence, the input sequence is conveniently processed by the RNN, the information such as Long-Term dependence and the like of the text is captured, and a good effect is achieved in practical application. RNNs have many applications, one of which is the cooperation with Natural Language Processing (NLP). RNNs have been demonstrated by many people on the web, who have created a surprisingly effective model.
RNN-based systems may learn tasks such as translating languages, controlling robots, image analysis, document summarization, speech recognition image recognition, handwriting recognition, controlling chat robots, predicting diseases, click rates and stocks, synthesizing music, and so forth. For example, in 2015 google greatly improved the speech recognition capabilities in android phones and other devices through the RNN program trained based on CTCs; CTC was also used; apple iPhone used RNN in quicktype and Siri; microsoft uses RNN not only for speech recognition, but also this technology for virtual dialog figure generation and programming of program code, etc. Amazon Alexa communicates with you at home through a bi-directional RNN, while google uses RNN more widely, it can generate image captions, auto-reply email, which is included in new intelligent assistant alo, also significantly improves google translation quality (from 2016). In fact, a significant portion of the computing resources of google data centers are now performing RNN tasks.
RNN is also well established in content recommendation, and the RNN is subject to continuous deep research in academia and even starts to be applied to large enterprises and Internet companies. For example, Netflix has a collaborative paper in ICLR 2016 that teaches how to use RNN to make video recommendations based on short-term behavior data of a user.
The RNN also has a place for text auto-generation. RNNs have been demonstrated in the prior art, which created language models that could take large inputs like those of shakespeare and generate their own shakespeare-style poems after training the models, and which were difficult to distinguish from the original. The RNN model automatically generated for the text is Char-RNN, so how to automatically generate the text by using the Char-RNN model is one of the research directions of those skilled in the art.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a text automatic generation method based on a Char-RNN model, so that the problem that the gradient disappears or the gradient explodes when long sequence data are processed is solved.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a text automatic generation method based on a Char-RNN model, which comprises the following steps:
s1, acquiring text data meeting the characteristic requirements;
s2, modeling the acquired text data, namely representing letters or Chinese characters by using a vector matrix to obtain training data;
s3, inputting training data into the Char-RNN model batch by batch for training to obtain the probability of the next character corresponding to each character, and continuously correcting the probability along with the improvement of the training times, and storing the training model result after the preset training times are reached;
and S4, taking the input key words as initial characters, obtaining and outputting the probability corresponding to the next character by using the trained model result, taking the probability as the character input of the next step, and generating a section of text by analogy.
As a preferable technical solution, in step S1, the text data meeting the feature requirement satisfies the following requirements:
the text type and sentence style characteristics are similar;
uniformly storing the txt document type and the UTF-8 encoding format;
the same language is used.
As a preferred technical solution, in step S2, when a vector matrix is used to represent letters, the letters are represented by one-hot unique heat vectors, and then are sequentially input to the network; and one-hot coding, wherein binary vectors are used for representing, and the length is the size of the word bank.
As a preferable technical solution, in 26 lower case letters in english, the letter a is represented by the first digit being 1, and the other 25 digits are all 0, i.e., (1,0,0, …, 0), the letter b is represented by the second digit being 1, and the other 25 digits are all 0, i.e., (0,1,0, …, 0), and so on; the output is equivalent to a class 26 classification problem, so the vector output at each step is also 26-dimensional.
As a preferred technical solution, in step S2, when the vector matrix is used to represent the chinese characters, a processing layer of embedding is added before the chinese character processing, and the individual chinese characters are classified according to their actual meaning, so as to convert the large number of chinese characters into a denser representation, thereby obtaining a better effect.
As a preferred technical solution, the Char-RNN model is an N VS N model, and a Dropout layer is added for data processing to reduce the over-fitting situation; in the Char-RNN model, the principle of Dropout processing is to randomly ignore the links among some data, the whole neural network becomes incomplete, after the incomplete neural network is trained once, randomly ignore other data links for the second time, train again, and so on; the Dropout layer holds a keep _ prob parameter that indicates the probability of randomly retaining data.
As a preferred technical solution, in step S3, a Softmax layer is added to the Char-RNN model, where Softmax is a function and its mathematical definition is: assuming an array V, vi represents the ith element in V, the Softmax value of this element is Si ei/Σ jej, i.e., Softmax can map the original digital output to (0,1) values, the sum of which is 1; the Softmax function is a tool for helping to obtain the probability, so that the middle calculation result is transformed to obtain logits, and then the Softmax processing is carried out to obtain the output, namely the predicted probability, and the character with the maximum probability can be selected according to the probability to be used as the output.
As a preferred technical solution, in step S3, in defining the loss, the cross entropy is used to obtain the loss by using the one-hot coding of the prediction probability and the training target, i.e. the next letter corresponding to each inputted letter.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) in text processing, compared with the sentence and word levels of other models, the method is accurate to the character level, and the continuity of the character degree can be obtained through learning.
(2) Compared with the common RNN model, the method solves the problem that the gradient disappears or the gradient explodes when the RNN is processed by long sequence data.
(3) Compared with an RNN model input by sentences and words, the method can solve the problem of unknown words.
(4) When the invention trains the Chinese corpus, because the input unit is specific to each Chinese character, Chinese word segmentation processing is not needed, thereby reducing more complex word segmentation steps and avoiding errors possibly brought by word segmentation.
Drawings
FIG. 1 is a schematic diagram of the classical RNN structure of "N VS N";
FIG. 2 is a schematic diagram of the Char-RNN using an already entered character to predict the next character;
FIG. 3 is a flow chart of a method for automatic text generation in accordance with the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
The invention relates to a text automatic generation method based on a Char-RNN model, which has the main principle that: firstly, preprocessing the text data in a data set to form sample data for training (characters are coded, mapping is established, letters are expressed by using unique heat vectors, and an embedding layer is required to be added to Chinese characters) according to the provided text data serving as a training sample data set; and secondly, training and outputting the model by taking the processed training data as the input of the Char-RNN model.
The Char-RNN is a character-level Recurrent Neural network, and is originally The unknown effective of Current Neural Networks written by Andrej Karpathy. It is well known that RNNs are very good at handling sequence problems, with strong correlation between sequence data and sequence data, and that RNNs are embodied by sharing of each element weight and offset and loop calculation (previously processed information is utilized to process subsequent information). The Char-RNN model is a model for generating text from the dimension of a character, i.e., predicting the probability of the next character occurring from the observed character, i.e., the guess of sequence data. Most of the deep learning of writing songs, poems and novels introduced on the internet are based on the method.
The sequence relation existing among the samples is called a sequence sample, for example, in the context, a word and the previous word are related; in soil data, the temperature at a time and the previous temperature are correlated. The Char-RNN, unlike the traditional neural network whose inputs and outputs are fixed, allows for a sequence of input and output vectors that can reuse existing information. Therefore, analysis of various types of sequential texts by using Char-RNN becomes the first choice.
A language model is required for a machine to generate text. The language model is used to evaluate the probability that a sentence is natural, i.e., the probability that the next word appears is presumed from the existing words in a sentence. Char-RNN is suitable for processing sequence data, and it can extract digests of sequences of any length (Xt, Xt-1.., X1), optionally preserving certain aspects of the sequences.
As shown in fig. 1, the Char-RNN model is an "N VS N" structure, taking english input as an example, where its input sequence is a letter in a sentence, and the output is the next letter of the input in turn, which is the probability of predicting the next letter by using the letter already input.
The technical scheme of the invention is further explained as follows:
as shown in fig. 3, the text automatic generation method based on the Char-RNN model in this embodiment includes the following steps:
step S1: and acquiring text data meeting the characteristic requirements.
In step S1, the acquired text data needs to meet the following requirements:
(1) the text type and sentence style characteristics are relatively close;
(2) uniformly storing the txt document type and the UTF-8 encoding format;
(3) the same language is used (e.g., both english or chinese are used, and different languages may not be mixed together).
Step S2: and modeling the text data, and mainly using a vector matrix to represent letters or Chinese characters to obtain training data.
The modeling process for text data is primarily to use vector matrices to represent characters. Wherein english alphabets will be represented using One-Hot Encoding and then sequentially input to the network.
Take 26 lower case letters in english as an example: the letter a is represented by the first digit being 1, the other 25 digits being 0, i.e., (1,0,0, …, 0), the letter b is represented by the second digit being 1, the other 25 digits being 0, i.e., (0,1,0, …, 0), and so on. The output is equivalent to a class 26 classification problem, so the vector output at each step is also 26-dimensional. In practical use, because letters also have case distinction and other punctuation marks, the single heat vector can code more English letters than 26.
If the training is performed on Latin letters such as English letters, the vector matrix representation of the letters can be directly completed by using simpler one-hot coding because the number of the letters is small (the number of the English letters is only 26 for example). However, when the vector matrix is used for representing the Chinese characters, the number of the Chinese characters is huge, and if one-hot coding is continuously used, the vector length is huge, and the expression is too sparse. Imbedding, word embedding, can express Chinese characters with low-dimensional vectors, and reduce the number of characteristic bits. The embedding vector representation projects letters into a continuous vector space, and the position of each letter in the vector space can reflect the correlation degree between the semantics, for example, the distance between the characters "male" and "female" is closer than the distance between the characters "male" and "cat", because the characters "male" and "female" are more similar. Therefore, a layer of embedding layer processing is needed to be added before the Chinese character processing, and the single Chinese characters are classified according to the practical significance of the single Chinese characters, so that a large number of Chinese character characters are converted into dense expression, and a better effect is achieved.
In the modeling processing process of the text data, if the text data is Chinese characters, a processing method different from English letters is needed. An embedding layer is required to be added before Chinese character processing, and the method can classify single Chinese characters according to the practical significance of the single Chinese characters, so that a large number of Chinese character characters are converted into dense expression, and a better effect is achieved. The imbedding parameter variables are obtained through training.
Step S3: inputting training data into a Char-RNN model batch by batch for training, wherein the purpose of the training process is to obtain the probability of the next character corresponding to each character, and continuously correcting the probability along with the improvement of the training times, so that the training model result is stored after the preset training times are reached.
In step S3, the training model is a multi-layer N VS N model, where a Dropout layer of data processing is added to reduce the over-fit condition.
In the Char-RNN model, the principle of Dropout processing is to randomly ignore some data links, so that the whole neural network becomes "incomplete", and after the neural network is trained once with the "incomplete" neural network, randomly ignore other data links for the second time, train again, and so on. By doing so, each prediction result is not very dependent on a specific part of the data in the whole training process. Dropout is done to essentially leave the neural network without the chance of over-relying, thereby reducing the appearance of over-fitting. In the Char-RNN model, the Dropout layer stores a keep _ prob parameter, which indicates the probability of randomly retaining data. For example, keep _ perb ═ 0.5 means that half of the data is randomly reserved for training. The embodiment implements data processing of the Dropout layer by this one key parameter to prevent overfitting.
The output of the training model needs to obtain the final classification probability, so a Softmax layer is added for processing. Softmax is a function whose mathematical definition is: assuming an array V, vi represents the ith element in V, the Softmax value of this element is Si ei/Σ jej. Namely Softmax can map the original digital output to values of (0,1), the sum of which is 1 (satisfying the property of probability). The Softmax function is a tool that helps get the probability. Therefore, the intermediate calculation result can be converted to obtain logits, and then subjected to Softmax processing to obtain the output, namely the predicted probability, and the character with the maximum probability can be selected as the output according to the probability.
In defining the loss, the prediction probability and the one-hot coding of the training target (i.e. the next letter corresponding to each letter input) are used as cross entropy to obtain the loss.
Step S4: and taking the input key words as initial characters, obtaining the probability corresponding to the next character by using the trained model result, outputting the probability, taking the probability as the character input of the next step, and generating a section of text by analogy. As shown in FIG. 2, after the input of the phrase "hoeing Niang-day at noon", the next output corresponding to the word "hoeing" is probably "standing grain", and so on.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (4)

1. A text automatic generation method based on a Char-RNN model is characterized by comprising the following steps:
s1, acquiring text data meeting the characteristic requirements;
s2, modeling the acquired text data, namely representing letters or Chinese characters by using a vector matrix to obtain training data; when a vector matrix is used to represent letters, wherein the letters are represented by one-hot unique vectors, then the letters are sequentially input into the network; one-hot coding, which is represented by binary vectors, wherein the length is the size of a word bank; in the 26 lower case letters in english, the letter a is represented by a first digit of 1 and the other 25 digits are all 0, i.e., vector (1,0,0, …, 0), the letter b is represented by a second digit of 1 and the other 25 digits are all 0, i.e., vector (0,1,0, …, 0), and so on; the output is equivalent to a 26-class classification problem, so the vector output at each step is also 26-dimensional;
when the vector matrix is used for representing the Chinese characters, a processing layer needs to be added before the Chinese characters are processed, and the single Chinese characters are classified according to the actual significance of the single Chinese characters, so that the large number of Chinese character characters are converted into dense representation, and a better effect is achieved;
s3, inputting training data into the Char-RNN model batch by batch for training to obtain the probability of the next character corresponding to each character, and continuously correcting the probability along with the improvement of the training times, and storing the training model result after the preset training times are reached; adding a Softmax layer in the Char-RNN model for processing, wherein Softmax is a function, namely Softmax maps the original digital output into values of (0,1), and the summation of the values is 1; the Softmax function is a tool for helping to obtain the probability, therefore, the middle calculation result is transformed to obtain logits, then the Softmax processing is carried out to obtain the output, namely the predicted probability, and the character with the maximum probability is selected as the output according to the probability;
and S4, taking the input key words as initial characters, obtaining and outputting the probability corresponding to the next character by using the trained model result, taking the probability as the character input of the next step, and generating a section of text by analogy.
2. The method as claimed in claim 1, wherein in step S1, the text data meeting the feature requirement satisfies the following requirements:
the text type and sentence style characteristics are similar;
uniformly storing the txt document type and the UTF-8 encoding format;
the same language is used.
3. The method as claimed in claim 1, wherein the Char-RNN model is an N VS N model, and a Dropout layer is added for data processing to reduce overfitting; in the Char-RNN model, the principle of Dropout processing is to randomly ignore the links among some data, the whole neural network becomes incomplete, after the incomplete neural network is trained once, randomly ignore other data links for the second time, train again, and so on; the Dropout layer holds a keep _ prob parameter that indicates the probability of randomly retaining data.
4. The method as claimed in claim 1, wherein in step S3, the predicted probability and the one-hot coding of the training target, i.e. the next letter corresponding to each letter inputted, are cross-entropy lost in defining the loss.
CN201811104442.5A 2018-09-21 2018-09-21 Automatic text generation method based on Char-RNN model Active CN109299211B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811104442.5A CN109299211B (en) 2018-09-21 2018-09-21 Automatic text generation method based on Char-RNN model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811104442.5A CN109299211B (en) 2018-09-21 2018-09-21 Automatic text generation method based on Char-RNN model

Publications (2)

Publication Number Publication Date
CN109299211A CN109299211A (en) 2019-02-01
CN109299211B true CN109299211B (en) 2021-06-29

Family

ID=65163966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811104442.5A Active CN109299211B (en) 2018-09-21 2018-09-21 Automatic text generation method based on Char-RNN model

Country Status (1)

Country Link
CN (1) CN109299211B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287478B (en) * 2019-05-15 2023-05-23 广东工业大学 Machine writing system based on natural language processing technology
WO2020240872A1 (en) * 2019-05-31 2020-12-03 株式会社 AI Samurai Patent text generating device, patent text generating method, and patent text generating program
CN112307820B (en) * 2019-07-29 2022-03-22 北京易真学思教育科技有限公司 Text recognition method, device, equipment and computer readable medium
CN111222320B (en) * 2019-12-17 2020-10-20 共道网络科技有限公司 Character prediction model training method and device
CN111325095B (en) * 2020-01-19 2024-01-30 西安科技大学 Intelligent detection method and system for equipment health state based on acoustic wave signals
CN112329779A (en) * 2020-11-02 2021-02-05 平安科技(深圳)有限公司 Method and related device for improving certificate identification accuracy based on mask

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10503827B2 (en) * 2016-09-23 2019-12-10 International Business Machines Corporation Supervised training for word embedding
CN107665254A (en) * 2017-09-30 2018-02-06 济南浪潮高新科技投资发展有限公司 A kind of menu based on deep learning recommends method
CN108197294B (en) * 2018-01-22 2021-10-22 桂林电子科技大学 Text automatic generation method based on deep learning
CN108334497A (en) * 2018-02-06 2018-07-27 北京航空航天大学 The method and apparatus for automatically generating text

Also Published As

Publication number Publication date
CN109299211A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN109299211B (en) Automatic text generation method based on Char-RNN model
CN109635109B (en) Sentence classification method based on LSTM and combined with part-of-speech and multi-attention mechanism
CN109697232B (en) Chinese text emotion analysis method based on deep learning
CN108920622B (en) Training method, training device and recognition device for intention recognition
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN110059188B (en) Chinese emotion analysis method based on bidirectional time convolution network
CN110704576B (en) Text-based entity relationship extraction method and device
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN111401061A (en) Method for identifying news opinion involved in case based on BERT and Bi L STM-Attention
CN112214604A (en) Training method of text classification model, text classification method, device and equipment
CN113190656B (en) Chinese named entity extraction method based on multi-annotation frame and fusion features
US11783179B2 (en) System and method for domain- and language-independent definition extraction using deep neural networks
CN111666758A (en) Chinese word segmentation method, training device and computer readable storage medium
CN113536795B (en) Method, system, electronic device and storage medium for entity relation extraction
CN113626589A (en) Multi-label text classification method based on mixed attention mechanism
CN112183083A (en) Abstract automatic generation method and device, electronic equipment and storage medium
CN112766319A (en) Dialogue intention recognition model training method and device, computer equipment and medium
CN112632258A (en) Text data processing method and device, computer equipment and storage medium
CN113987167A (en) Dependency perception graph convolutional network-based aspect-level emotion classification method and system
CN113821635A (en) Text abstract generation method and system for financial field
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN114564563A (en) End-to-end entity relationship joint extraction method and system based on relationship decomposition
Fang et al. A method of automatic text summarisation based on long short-term memory
CN115186080A (en) Intelligent question-answering data processing method, system, computer equipment and medium
CN111191439A (en) Natural sentence generation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant