CN112613282A

CN112613282A - Text generation method and device and storage medium

Info

Publication number: CN112613282A
Application number: CN202011625631.4A
Authority: CN
Inventors: 蔡晓东; 高铸成
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-06

Abstract

The invention provides a text generation method, a text generation device and a storage medium, wherein the method comprises the following steps: importing a plurality of data sets, and respectively carrying out splicing calculation on each data set to obtain a data set embedding vector corresponding to each data set; constructing a training model, and updating parameters of the training model according to the plurality of data set embedded vectors and the plurality of data sets to obtain an updated training model; and training the text to be trained according to the updated training model to obtain a generated text. Compared with the prior art, the method can accurately select important contents in the records to generate the text, update and iterate through the text similarity, optimize the text generation quality, improve the accuracy and the fluency of the generated text, and solve the problems of low accuracy rate and redundancy of the generated text.

Description

Text generation method and device and storage medium

Technical Field

The invention mainly relates to the technical field of language processing, in particular to a text generation method, a text generation device and a storage medium.

Background

Data-to-text generation is an important and challenging task in natural language generation, aiming to convert from non-language input to text output, and can be applied to many natural language generation scenes in reality, such as automatically writing sports reports, weather forecasts and the like according to data. However, the existing encoding-decoding generation model has the problems of low accuracy of generated texts and redundancy in the process of generating the texts.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method, an apparatus and a storage medium for generating a text, aiming at the defects of the prior art.

The technical scheme for solving the technical problems is as follows: a text generation method comprises the following steps:

importing a plurality of data sets, and respectively carrying out splicing calculation on each data set to obtain a data set embedding vector corresponding to each data set;

constructing a training model, and updating parameters of the training model according to the plurality of data set embedded vectors and the plurality of data sets to obtain an updated training model;

and training the text to be trained according to the updated training model to obtain a generated text.

Another technical solution of the present invention for solving the above technical problems is as follows: a text generation apparatus comprising:

the splicing calculation module is used for importing a plurality of data sets and respectively carrying out splicing calculation on the data sets to obtain data set embedding vectors corresponding to the data sets;

the parameter updating module is used for constructing a training model and updating parameters of the training model according to the data set embedding vectors and the data sets to obtain an updated training model;

and the generated text obtaining module is used for training the text to be trained according to the updated training model to obtain a generated text.

The invention has the beneficial effects that: the method comprises the steps of obtaining data set embedded vectors corresponding to the data sets through splicing calculation of the data sets, updating parameters of a training model according to the data set embedded vectors and the data sets to obtain an updated training model, and obtaining a generated text through training of the training text to be trained according to the updated training model.

Drawings

Fig. 1 is a schematic flowchart of a text generation method according to an embodiment of the present invention;

fig. 2 is a block diagram of a text generation apparatus according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a schematic flowchart of a text generation method according to an embodiment of the present invention.

As shown in fig. 1, a text generation method includes the following steps:

It should be understood that the data set may be an event log table.

In the embodiment, the data set embedded vectors corresponding to the data sets are obtained through splicing calculation of the data sets respectively, the updated training model is obtained through updating the parameters of the training model according to the data set embedded vectors and the data sets, and the generated text is obtained through training the text to be trained according to the updated training model.

Optionally, as an embodiment of the present invention, the process of performing the splicing calculation on each data set to obtain the data set embedding vector corresponding to each data set includes:

and respectively carrying out splicing calculation on each data set through a first formula to obtain a data set embedding vector corresponding to each data set, wherein the first formula is as follows:

r_j＝ReLU(W_r[r_j,1；r_j,2；...；r_j,K]+b_r)，

wherein r is_jEmbedding a vector for the jth data set, ReLU being the activation function, W_rIs a weight matrix, [ r ]_j,1；r_j,2；...；r_j,K]For data in the jth data set, b_rIs a bias vector, [;]is the concatenation between the vectors.

It should be understood that the data in the dataset may be various types of attributes in a record.

It should be understood that all data in the data set are passed through an embedding method and spliced to obtain an embedded representation of the record, i.e., the data set embedding vector.

In the embodiment, the data set embedding vectors corresponding to the data sets are obtained through the first type splicing calculation of the data sets respectively, data support is provided for the subsequent processing process, important contents in records can be accurately selected to generate texts, updating iteration is carried out through text similarity, the text generation quality is optimized, compared with the prior art, the accuracy and the fluency of the generated texts can be improved, and the problems that the generated texts are low in accuracy rate and redundant are solved.

Optionally, as an embodiment of the present invention, the constructing a training model, and performing parameter update on the training model according to the multiple data set embedded vectors and the multiple data sets, to obtain an updated training model includes:

constructing a training model based on an OpenNMT-py encoding and decoding model, wherein the training model comprises an encoding layer and a decoding layer;

respectively inputting each data set embedding vector into the coding layer, and respectively carrying out coding analysis on each data set embedding vector through the coding layer to obtain a plurality of data set updating vectors;

inputting the plurality of data set updating vectors into the decoding layer, and calculating a final text of the plurality of data set updating vectors through the decoding layer to obtain a final text and a plurality of original text probabilities;

performing loss calculation on a plurality of original probabilities of the texts according to the final text to obtain an updated loss function;

and training the training model according to the updating loss function to obtain an updated training model.

It should be understood that OpenNMT-py is a Pytorch version of OpenNMT, widely used for machine translation, auto-summarization, text generation, language morphology and many other fields. OpenNMT-py is a coding-decoding framework, a coding-decoding route and a loss feedback method are defined in the coding-decoding framework, and a user can train the model of the user only by specifying a required coder, a decoder and a loss calculation method.

In the embodiment, the training model is constructed, parameters of the training model are updated according to the multiple data set embedded vectors and the multiple data sets to obtain the updated training model, important contents in records can be accurately selected to generate texts, updating iteration is carried out according to the similarity of the texts, the text generation quality is optimized, compared with the prior art, the accuracy and the fluency of the generated texts can be improved, and the problems that the accuracy rate of the generated texts is low and redundancy exists are solved.

Optionally, as an embodiment of the present invention, the process of obtaining a plurality of data set update vectors by performing coding analysis on each data set embedded vector by the coding layer respectively includes:

respectively inputting each data set embedded vector into a hidden layer, and respectively coding each data set embedded vector through the pre-built hidden layer to obtain a plurality of data set coding vectors and hidden layer output vectors corresponding to the data set coding vectors;

and respectively carrying out content screening calculation on each data set encoding vector and the hidden layer output vector corresponding to the data set encoding vector to obtain a plurality of data set updating vectors.

In the embodiment, the plurality of data set update vectors are obtained through the coding analysis of the embedded vectors of each data set by the coding layer, the secondary information can be filtered, the important content in the record can be accurately selected to generate the text, the update iteration is performed through the text similarity, the text generation quality is optimized, compared with the prior art, the accuracy and the fluency of the generated text can be improved, and the problems that the generated text is low in accuracy rate and has redundancy are solved.

Optionally, as an embodiment of the present invention, the process of respectively inputting each data set embedding vector into a hidden layer, and respectively performing encoding processing on each data set embedding vector by using the pre-built hidden layer to obtain a plurality of data set encoding vectors and hidden layer output vectors corresponding to the data set encoding vectors includes:

performing mean value pooling coding on each data set embedding vector through a second formula to obtain a plurality of data set coding vectors, wherein the second formula is as follows:

wherein r is_jVectors are embedded for the jth data set, mean pooling is used by Meanpooling,

encoding a vector for the jth data set;

and respectively carrying out hidden information extraction on each data set embedded vector to obtain a hidden layer output vector corresponding to the data set coding vector.

It will be appreciated that the embedded representation of the records is mean-pooling encoded, resulting in a vector representation of the records, i.e. the data set encoding vector.

In the embodiment, the mean value pooling coding of each data set embedded vector is respectively carried out through the second formula to obtain a plurality of data set coding vectors, the hidden information of each data set embedded vector is respectively extracted to obtain the hidden layer output vector corresponding to the data set coding vector, important contents in records can be accurately selected to generate texts, updating iteration is carried out through the text similarity, the text generation quality is optimized, compared with the prior art, the accuracy and the fluency of the generated texts can be improved, and the problems that the generated texts are not high in accuracy rate and have redundancy are solved.

Optionally, as an embodiment of the present invention, the process of performing content screening calculation on the plurality of data set encoding vectors and the hidden layer output vector corresponding to the data set encoding vectors respectively to obtain a plurality of data set update vectors includes:

respectively carrying out content screening calculation on the plurality of data set encoding vectors and hidden layer output vectors corresponding to the data set encoding vectors by a third formula to obtain a plurality of data set updating vectors, wherein the third formula is as follows:

wherein the content of the first and second substances,

wherein h is_nOutput vector for hidden layer, W_sAnd U_sAre all a matrix of weights, and are,

encode a vector for the jth data set, e is a dot product of elements_jAnd updating a vector for the jth data set, wherein sigma is a sigmoid function, b is an offset vector, and sGate is a selection gate.

It should be understood that content screening is performed on the encoding layer record vector, i.e., the data set encoding vector, filtering the secondary information in the record, and updating the vector representation of the record, i.e., the data set update vector.

In the embodiment, the third formula is used for respectively screening and calculating the contents of the plurality of data set coding vectors and the hidden layer output vectors corresponding to the data set coding vectors to obtain the plurality of data set updating vectors, so that the secondary information can be filtered, the important contents in the records can be accurately selected to generate the text, updating iteration is performed through the text similarity, the text generation quality is optimized, compared with the prior art, the accuracy and the smoothness of the generated text can be improved, and the problems that the generated text is low in accuracy and has redundancy are solved.

Optionally, as an embodiment of the present invention, the calculating, by the decoding layer, a final text of the plurality of data set update vectors to obtain a final text and a plurality of original probabilities of the text includes:

calculating content vectors of the plurality of data set updating vectors to obtain content vectors;

respectively calculating the content vectors and the data sets through a fourth formula to obtain a plurality of text original probabilities, wherein the fourth formula is as follows:

where r is the data set, y_tTo output text, p (y)_tL r) is the original probability of the text, softmax is the activation function, W_yIn order to be a weight matrix, the weight matrix,

is a content vector, b_yIs a bias vector;

calculating the original probabilities of the plurality of texts by a fifth formula to obtain a final text, wherein the fifth formula is as follows:

wherein p (y' | r) is the original probability of the text,

is the final text.

It should be appreciated that the text raw probabilities are used to obtain a probability of correct text based on the content vectors and the data set.

In particular, the output text y is calculated upon input of the data set r_tThe original probability of the text, as follows:

wherein p (y)_tR) is the output of the text y with the data set r as the condition_tIs the activation function, W_yIs a weight matrix, b_yIs a bias vector.

It should be understood that the original probability of the text with the highest probability is selected as the final text for our generation, as follows:

where y' represents the output text corresponding to the original probability of the text.

In the embodiment, the final text and the original probabilities of the plurality of texts are obtained by calculating the final text of the plurality of data set update vectors through the decoding layer, the secondary information can be filtered, the important content in the record can be accurately selected to generate the text, the update iteration is performed through the text similarity, the text generation quality is optimized, compared with the prior art, the accuracy and the fluency of the generated text can be improved, and the problems that the generated text is low in accuracy rate and has redundancy are solved.

Optionally, as an embodiment of the present invention, the calculating a content vector for the plurality of data set update vectors to obtain a content vector includes:

calculating a content vector for the plurality of data set update vectors by using an LSTM decoder to obtain the content vector, specifically:

calculating a plurality of data set update vectors by a sixth formula to obtain a content vector, wherein the sixth formula is as follows:

wherein q is_t＝α_t,1e₁+α_t,2e₂+…+α_t,je_j，

Wherein the content of the first and second substances,

wherein d is_tHidden state of LSTM cell at time t, W_aAnd W_cAre all weight matrices, α_t,jTo the attention score, q_tFor the dynamic context vector, ej updates the vector for the jth data set,

is d_tIs transposed, sigma_jα_t,j1, tanh is the activation function,

is a content vector.

It should be understood that the content vector is used to generate all content preceding a piece of coherent text.

It should be understood that the decoding side takes part in the text generation by grabbing the most important information in the recording through an attention mechanism.

In particular, the use of an LSTM decoder is based on saidData set update vector e_jAnd an attention mechanism generation text specifically comprises the following steps:

let d_tRepresenting the hidden state of the LSTM unit at time t, calculating the update vector e of the data set output by the encoder_jSaid attention score α of_t,jAnd obtaining a dynamic context vector q_tAnd d is_tAnd q is_tThe content vector synthesized as content

In the embodiment, the content vectors of the update vectors of the multiple data sets are calculated through the fourth formula to obtain the content vectors, the most important information is captured to participate in the text generation process, the important content in the records can be accurately selected to generate the text, the update iteration is performed through the text similarity, the text generation quality is optimized, compared with the prior art, the accuracy and the fluency of the generated text can be improved, and the problems that the accuracy rate of the generated text is low and redundancy exists are solved.

Optionally, as an embodiment of the present invention, the process of performing loss calculation on a plurality of original probabilities of the texts according to the final text to obtain an updated loss function includes:

calculating the final text and the original probabilities of the plurality of texts by a seventh formula to obtain an update loss function, wherein the seventh formula is as follows:

L＝L_g+λL_c，

wherein the content of the first and second substances,

L_c＝cos(V_o,V_t)，

wherein L is an update penalty function, L_gIs a negative log-likelihood loss function, L_cIs cosine distance, lambda is weight value of cosine distance, G is total number of data set, D is set of data set and output text, p (y)_tR) is the original probability of the text, V_oFor the final text, V_tThe standard text is preset.

And calculating the text similarity between the final text output by decoding and the given standard text, performing loss analysis, updating iteration, and finally generating the high-quality text.

Specifically, a negative log-likelihood loss function L is first calculated_gThe following formula:

wherein G is the total number of the data sets, D is the set of the data set r and the output text y, and p (y | r) refers to the output text y under the condition of the data set r_tThe probability of (d);

secondly, calculating the cosine distance L between the generated content and the true value_cThe following formula:

L_c＝cos(V_o,V_t)

wherein V_oAnd V_tVector representations representing the final text and true values, respectively.

Finally, the two loss functions are fused as follows:

wherein λ is L_cThe new loss function is used for training feedback, and the generated text can be closer to the standard text.

In the embodiment, the update loss function is obtained by calculating the final text and the original probabilities of the plurality of texts through the seventh formula, update iteration can be performed, important contents in the records are accurately selected to generate the text, update iteration is performed through the similarity of the texts, and the text generation quality is optimized.

Alternatively, as another embodiment of the present invention, as shown in fig. 2, a text generating apparatus includes:

Alternatively, another embodiment of the present invention provides a text generation apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the computer program is executed by the processor, the text generation method as described above is implemented. The device may be a computer or the like.

Alternatively, another embodiment of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the text generation method as described above.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. It will be understood that the technical solution of the present invention essentially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A text generation method is characterized by comprising the following steps:

2. The text generation method according to claim 1, wherein the process of performing the stitching calculation on each of the data sets to obtain the data set embedding vector corresponding to each of the data sets comprises:

r_j＝ReLU(W_r[r_j，1；r_j，2；...；r_j，K]+b_r)，

wherein r is_jEmbedding a vector for the jth data set, ReLU being the activation function, W_rIs a weight matrix, [ r ]_j，1；r_j，2；...；r_j，K]For data in the jth data set, b_rIs a bias vector, [;]is the concatenation between the vectors.

3. The method of claim 1, wherein the constructing a training model, and the updating parameters of the training model according to the plurality of data set embedding vectors and the plurality of data sets to obtain an updated training model comprises:

constructing a training model based on a 0penNMT-py coding and decoding model, wherein the training model comprises a coding layer and a decoding layer;

4. The method of claim 3, wherein the encoding analysis of each embedded vector of the data sets by the encoding layer to obtain a plurality of update vectors of the data sets comprises:

5. The method according to claim 4, wherein the process of inputting each of the data set embedding vectors into a hidden layer, and encoding each of the data set embedding vectors by the pre-built hidden layer to obtain a plurality of data set encoding vectors and hidden layer output vectors corresponding to the data set encoding vectors comprises:

encoding a vector for the jth data set;

6. The method of claim 5, wherein the step of performing content filtering calculation on each of the data set encoding vectors and the hidden layer output vector corresponding to the data set encoding vector to obtain a plurality of data set update vectors comprises:

respectively calculating each data set encoding vector and a hidden layer output vector corresponding to the data set encoding vector by a third formula to obtain a plurality of data set updating vectors, wherein the third formula is as follows:

wherein the content of the first and second substances,

encode a vector for the jth data set, e is a dot product of elements_jUpdating the vector, σ, for the jth data setIs sigmoid function, b is offset vector, and sGate is selection gate.

7. The method of claim 2, wherein the calculating of the final text from the plurality of data set update vectors by the decoding layer to obtain the final text and a plurality of original probabilities of the text comprises:

is a content vector, b_yIs a bias vector;

wherein p (y' | r) is the original probability of the text,

is the final text.

8. The method of claim 7, wherein the computing a content vector for the plurality of dataset update vectors comprises:

wherein q is_t＝α_t，1e₁+α_t，2e₂+…+α_t，je_j，

Wherein the content of the first and second substances,

wherein d is_tHidden state of LSTM cell at time t, W_aAnd W_cAre all weight matrices, α_t，jTo the attention score, q_tAs a dynamic context vector, e_jThe vector is updated for the jth data set,

is d_tIs transposed, sigma_jα_t，j1, tanh is the activation function,

is a content vector.

9. The method of claim 3, wherein the performing a loss calculation on the plurality of original probabilities of the text according to the final text to obtain an updated loss function comprises:

L＝L_g+λL_c，

wherein the content of the first and second substances,

L_c＝cos(V_o，V_t)，

10. A text generation apparatus, comprising: