CN113111638A

CN113111638A - Training method and device of natural language generation model

Info

Publication number: CN113111638A
Application number: CN202110395155.XA
Authority: CN
Inventors: 程维
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-07-13

Abstract

The disclosure relates to a training method and a training device for a natural language generation model, and relates to the technical field of computers. The training method comprises the following steps: taking the generated words at each historical moment as the state of the current moment, taking the output of the generated words at the current moment as an action, and modeling the natural language generation processing as a reinforcement learning model; determining a first objective function according to the dominant function of the generated words at each moment output by the reinforcement learning model; determining a second objective function according to the difference between the probability distribution of the labeling result at each moment and the probability distribution of the generated words at each moment; and training the reinforcement learning model according to the weighted average value of the first objective function and the second objective function.

Description

Training method and device of natural language generation model

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method for a natural language generation model, a training apparatus for a natural language generation model, a natural language generation method, a natural language generation apparatus, and a non-volatile computer-readable storage medium.

Background

In recent years, AI (Artificial Intelligence) technology has been developed rapidly, and its application is spread in various fields of human production and life. NLP (Natural Language Processing) is an important application in the field of artificial intelligence, and is mainly divided into two branches, namely Natural Language understanding and NLG (Natural Language Generation).

Natural language generation is an important scientific technology combining artificial intelligence, computer science and computational linguistics. Its main purpose is to make the computer have the same expression and writing function as human body, i.e. the computer can automatically generate a high-quality text through a series of processing and planning processes according to some key input information. The application of this technology is already very wide, among which the most common applications are machine translation, chat robots, voice assistants, and so on.

In the related art, a statistical machine learning-based method models information input to a computer, thereby generating text.

Disclosure of Invention

The inventors of the present disclosure found that the following problems exist in the above-described related art: the gradient variance of the natural language generation model training is too large, so that the model training is unstable or not converged, and the natural language generation effect is poor.

In view of this, the present disclosure provides a training technical solution for a natural language generation model, which can improve a natural language generation effect.

According to some embodiments of the present disclosure, there is provided a training method of a natural language generation model, including: taking the generated words at each historical moment as the state of the current moment, taking the output of the generated words at the current moment as an action, and modeling the natural language generation processing as a reinforcement learning model; determining a first objective function according to the dominant function of the generated words at each moment output by the reinforcement learning model; determining a second objective function according to the difference between the probability distribution of the labeling result at each moment and the probability distribution of the generated words at each moment; and training the reinforcement learning model according to the weighted average value of the first objective function and the second objective function.

In some embodiments, determining the first objective function according to the dominance function of the generated words at each time of the output of the reinforcement learning model comprises: and determining the advantage function according to the difference between the state action function at each moment and the state action function at the previous moment.

In some embodiments, determining the merit function includes: according to the action a from time 0 to t-1₀～a_t-1Action a at time t_tPredicting a plurality of action combinations a of times t +1 to L-1_t+1～a_L-1(ii) a Calculating each action combination a_t+1～a_L-1The prize value of (d); and determining the state action function at the time t according to the weighted average value of the reward values.

In some embodiments, determining the first objective function according to the dominance function of the generated words at each time of the output of the reinforcement learning model comprises: generating probabilities of the generated words at each moment are output by using the reinforcement learning model; and determining a first objective function according to the dominant function and the generation probability of the generated words at each moment.

In some embodiments, determining the second objective function according to a difference between the probability distribution of the labeling result at each time and the probability distribution of the generated word at each time includes: determining prior probability of each moment according to the difference between the labeling result of each moment and each word in the corpus; generating probabilities of the generated words at each moment are output by using the reinforcement learning model; and determining a second objective function according to the weighted average of the difference between the prior probability and the generation probability at each moment.

In some embodiments, determining the prior probability at each time according to the difference between the labeling result at each time and each word in the corpus comprises: and determining the prior probability of each moment according to the similarity between the word vector of the labeling result at each moment and the vector of each word in the corpus.

In some embodiments, determining the second objective function from the weighted average of the differences of the prior probabilities and the generated probabilities at the respective time instants comprises: and determining the weight of the difference between the prior probability and the generation probability at the corresponding moment according to the dominance function of the generated word at each moment, wherein the weight is in negative correlation with the dominance function.

In some embodiments, training the reinforcement learning model based on the weighted average of the first objective function and the second objective function comprises: determining a comprehensive objective function according to the weighted average value of the first objective function and the second objective function; and training a reinforcement learning model under the condition of minimizing the comprehensive objective function.

In some embodiments, the training method further comprises: and generating natural language data by using the trained reinforcement learning model.

In some embodiments, generating natural language data comprises: and translating the input first language data into second language data by using the trained reinforcement learning model.

According to further embodiments of the present disclosure, there is provided a natural language generating method including: taking the generated words at each historical moment as the state of the current moment, taking the output of the generated words at the current moment as an action, and modeling the natural language generation processing as a reinforcement learning model; determining a first objective function according to the dominant function of the generated words at each moment output by the reinforcement learning model; determining a second objective function according to the difference between the probability distribution of the labeling result at each moment and the probability distribution of the generated words at each moment; training a reinforcement learning model according to the weighted average value of the first objective function and the second objective function; and generating natural language data by using the trained reinforcement learning model.

According to still other embodiments of the present disclosure, there is provided a training apparatus for a natural language generation model, including: a modeling unit for modeling natural language generation processing as a reinforcement learning model by using the generated word at each historical time as a state of the current time and using the output of the generated word at the current time as an action; the determining unit is used for determining a first objective function according to the dominant function of the generated words at each moment output by the reinforcement learning model, and determining a second objective function according to the difference between the probability distribution of the labeling result at each moment and the probability distribution of the generated words at each moment; and the training unit is used for training the reinforcement learning model according to the weighted average value of the first objective function and the second objective function.

In some embodiments, the determination unit determines the merit function based on a difference between the state action function at each time instant and the state action function at a time instant immediately preceding the time instant.

In some embodiments, the determination unit determines the action a according to the time 0 to t-1₀～a_t-1Action a at time t_tPredicting a plurality of action combinations a of times t +1 to L-1_t+1～a_L-1(ii) a Calculating each action combination a_t+1～a_L-1The prize value of (d); and determining the state action function at the time t according to the weighted average value of the reward values.

In some embodiments, the determining unit determines the first objective function according to the dominance function and the generation probability of the generated word at each time, which are output by using the reinforcement learning model.

In some embodiments, the determining unit determines the prior probability at each time according to the similarity between the word vector of the labeling result at each time and the vector of each word in the corpus.

In some embodiments, the determining unit determines the prior probability of each time according to the difference between the labeling result of each time and each word in the corpus, and determines the second objective function according to the weighted average of the prior probability of each time and the difference of the generation probability by using the reinforcement learning model and the generated probability of each time.

In some embodiments, the determining unit determines a weight of a difference between the prior probability and the generated probability at the corresponding time, the weight being inversely related to the dominance function, according to the dominance function of the generated word at each time.

In some embodiments, the training unit determines a comprehensive objective function according to a weighted average of the first objective function and the second objective function, and trains the reinforcement learning model under the condition that the comprehensive objective function is minimized.

In some embodiments, the training apparatus further comprises: and the generating unit is used for generating natural language data by using the trained reinforcement learning model.

In some embodiments, the generation unit translates the input first language data into the second language data using the trained reinforcement learning model.

According to still further embodiments of the present disclosure, there is provided a natural language generating apparatus including: a modeling unit for modeling natural language generation processing as a reinforcement learning model by using the generated word at each historical time as a state of the current time and using the output of the generated word at the current time as an action; the determining unit is used for determining a first objective function according to the dominant function of the generated words at each moment output by the reinforcement learning model, and determining a second objective function according to the difference between the probability distribution of the labeling result at each moment and the probability distribution of the generated words at each moment; the training unit is used for training a reinforcement learning model according to the weighted average value of the first objective function and the second objective function; and the generating unit is used for generating natural language data by using the trained reinforcement learning model.

According to still further embodiments of the present disclosure, there is provided a training apparatus for a natural language generation model, including: a memory; and a processor coupled to the memory, the processor configured to perform the method of training a natural language generative model of any of the above embodiments based on instructions stored in the memory device.

According to still further embodiments of the present disclosure, there is provided a natural language generating apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform the method of generating natural language in any of the above embodiments based on instructions stored in the memory device.

According to still further embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a training method of a natural language generation model or a generation method of a natural language in any of the above embodiments.

In the embodiment, the objective function is constructed based on the strategy gradient of the dominant function, and the objective function has smaller gradient variance; and constructing an objective function based on the difference between the labeling result and the model output result, so that the unknown problem of training deviation can be solved. Therefore, the model training can be more stable and easier to converge, and the training effect of the model is improved, so that the natural language generation effect is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure can be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

FIG. 1 illustrates a flow diagram of some embodiments of a training method of a natural language generation model of the present disclosure;

FIG. 2 shows a flow diagram of some embodiments of step 140 of FIG. 1;

FIG. 3 illustrates a schematic diagram of some embodiments of a training method of a natural language generation model of the present disclosure;

FIG. 4 illustrates a block diagram of some embodiments of a training apparatus of a natural language generation model of the present disclosure;

FIG. 5 illustrates a block diagram of some embodiments of a training apparatus of a natural language generation model or a generation apparatus of natural language of the present disclosure;

fig. 6 shows a block diagram of further embodiments of the training apparatus of the natural language generation model or the generation apparatus of natural language of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

As previously mentioned, modeling information input to a computer using a deep neural network employs a decoder-encoder neural network (e.g., a transform and a Seq2Seq model). In training such models, a method of training a natural language generation model by MLE (Maximum posterior probability) using a cross entropy loss function is generally adopted. This approach has the following technical problems: the problem of inconsistent model training, model testing and reasoning; exposure bias problems for the model; negative diversity has no known problem.

The reason for the inconsistent problem of model training and model testing and reasoning is that the MLE method trains the model at the word level through the maximum conditional probability, and each optimal word is selected through a greedy sampling method to form a text. However, the text generated by the words obtained at each step of the greedy sampling method is not necessarily optimal.

Also, the model testing or reasoning phase often uses sequence-level Evaluation criteria (e.g., the commonly used BLEU (Bilingual Evaluation Understudy) to evaluate how well text generation is from the perspective of the entire sequence.

The exposure bias problem for the model is mainly present in autoregressive decoder-encoder neural networks, each decoding of the autoregressive decoding needs to rely on the output of the previous step. However, when each word in the text sequence is generated in the decoding stage, the MLE training method is to use the tag word input model as input information required when outputting the next word.

This method is called the teacher-training method. The drawback of this approach is that the model is unlabeled at the time of testing or reasoning. Therefore, information output at each step of the model needs to be input information required when outputting the next word. However, when the model is trained, this information is never exposed (i.e., never trained), resulting in the model performing poorly when testing inferences.

For the unknown problem of negative diversity, the MLE training method usually targets the cross-entropy loss function as an optimization target. However, the cross-entropy loss function generally treats the samples with model prediction errors as identical, i.e., the cross-entropy loss function assigns the same "score" to these erroneous samples regardless of how far the model prediction error is from the true label. Such training methods can reduce the variety of text generated by the model.

In addition, the natural language generation algorithm based on reinforcement learning has the following problems: negative diversity has no known problem and the training gradient variance is too large.

The unknown problem of negative diversity is because reinforcement learning often scores predicted text using an evaluation label such as BLEU as a reward value in training of a natural language generation model. However, these evaluation criteria still give the same score to prediction samples with different degrees of error.

The problem of too large a variance of the training Gradient is that reinforcement learning adopts a Policy Gradient (Policy Gradient) method. However, the strategic gradient algorithm is easy to cause too large a gradient variance during model training when estimating the reward value, thereby causing unstable or non-convergence of model training.

In view of the above technical problem, the present disclosure can give a more reasonable score to different misprediction samples when the model is trained. Therefore, a more accurate and more diverse natural language generation model can be obtained through training.

In addition, the model is trained by designing the reinforcement learning algorithm with smaller gradient variance, so that the model training process is more stable, and a more accurate natural language generation model is obtained. For example, the technical solution of the present disclosure can be realized by the following embodiments.

FIG. 1 illustrates a flow diagram of some embodiments of a training method of a natural language generation model of the present disclosure.

As shown in fig. 1, in step 110, the natural language generation process is modeled as a reinforcement learning model in a state where the generated word at each history time is the current time and an output of the generated word at the current time is an action.

In some embodiments, the training process of the natural language generative model may be defined as a reinforcement learning problem. The process of reinforcement learning is a markov decision process, and therefore, the natural language generating process needs to be defined as a markov decision process.

In some embodiments, the Markov decision process includes a five-tuple (S, A, T, R, γ), S being the state space, A being the action space, T being the state transfer function, R being the reward value, and γ being the discount factor. The natural language generative model can be trained using reinforcement learning methods by maximizing the reward value.

For example, the Evaluation method of the reward value may be BLEU, ROUGE (Recall-organized throughout for marketing Evaluation), METEOR, or the like.

For example, with respect to state S, state S at time t_tIs the word w that has been generated from time 0 to t-1₀～w_t-1And the output Z of the decoder of the machine learning model, t is more than or equal to 0 and less than or equal to L, and L is the length of a sentence to be generated (corresponding to the time of generating each word).

For example, with respect to the motion space A, a word generated by the model represents a motion, and the motion at time t is denoted as a_tI.e. w_t＝a_tThe different movements make up the movement space a. In the natural language generation problem, the action space a is equivalent to a dictionary, which contains all the words available in the generation process.

For example, with respect to the state transition function T, it defines the state S from the current time_tState S of transition to next moment_t+1The probability of (c). Once the agent has performed an action a_tThe transfer is determined. the state at time t may be expressed as:

S_t＝(w₀,w₁,…,w_t-1,Z)

for example, regarding the reward value R, when the length of the sentence generated at the present time is smaller than L, the reward value is 0; when the sentence length generated at the present moment is equal to L, an EOS (end identifier) is generated for identifying that the whole sentence is generated, and the current bonus value is calculated. the reward value at time t may be calculated according to various evaluation methods:

in some embodiments, the reward value r may be determined using different evaluation methods depending on the different types of natural language generating tasks. For example, for a machine translation type task, a BLEU method can be adopted to determine the reward value r; for summary type tasks, a ROUGE method can be adopted to determine the reward value r.

For example, with respect to the state transition probability T, the state transition probability P (S)_t+1|S_t,a_t) Is shown inAction a_tState S under action of_tTransfer to S_t+1The probability of (c). In the natural language generation task, P (S) is known from the definition of the state_t+1|S_t,a_t) Is constantly equal to 1.

For example, with respect to the discount factor γ, the effect is to give different weights to different time instants t. In the natural language generation task, it can be considered that the importance level of each word is the same, i.e., γ is always equal to 1.

In step 120, a first objective function is determined based on the dominance function of the generated word at each time outputted from the reinforcement learning model.

In some embodiments, the merit function is determined based on the difference between the state action function at each time and the state action function at the time immediately preceding it.

For example, the merit function at time t is A^π(S_t,a_t) It measures the quality of the word generated at time t and the state value V at time t^π(S_t) And a prize value r_tAnd (4) correlating. That is, to compute the merit function, an estimate of the state value is needed.

However, there is no separate value network in the natural language generation model to estimate the state value V^π(S_t). Therefore, in reinforcement learning, it is necessary to calculate the operation state value Q^π(s_t,a_t) To replace the status value V^π(S_t(。

For example, Q^π(s_t,a_t) The value of the state action function (also called Q value) at the time t is the expectation of the reward value, and the quality of the words output at the time t is measured. Q^π(s_t,a_t) Can be expressed as:

from the above formula, V^π(S_t) Can be passed through Q^π(s_t,a_t) And calculating. Also, as previously mentioned, based on reinforcementAs can be seen from the definition of the natural language generation problem in learning, the state transition probability and the discount factor are both always equal to the reward value r at 1, t_tIs 0, so the state value at time t +1 is equal to the Q value at time t:

V^π(S_t)＝Q^π(s_t,a_t)

thus, the merit function may be determined by the following calculation:

A^π(S_t,a_t)＝Q^π(s_t,a_t)-Q^π(s_t-1,a_t-1)

thus, the merit function at the corresponding time can be determined only by estimating the Q value.

In some embodiments, in the definition of reinforcement learning, the Q value is the expectation of the reward value. In the course of training the natural language generating model, the words (such as w) output in each step are utilized_t) Reasoning about each term (e.g., w) further backwards_t+1～w_L) A complete sentence containing L words is obtained.

In some embodiments, for each word output, a complete sentence may be predicted. Thus, each full prize value may be calculated for calculating the Q value.

For example, according to the operation a from time 0 to t-1₀～a_t-1Action a at time t_tPredicting a plurality of action combinations a of times t +1 to L-1_t+1～a_L-1(i.e., a plurality of complete sentences); calculating each action combination a_t+1～a_L-1The prize value of (d); and determining the state action function at the time t according to the weighted average value of the reward values.

For example, K-step monte carlo simulation can be used to predict K sentences from time t +1 to the entire end time L when a word is output at time t, and calculate the average value of the reward values of the K sentences as the Q value at time t:

a_0:t-1for actions taken from 0 to t-1 (i.e. the generated words),

action a taken for predicted time t +1 to L_t+1:LK combinations of (1).

In some embodiments, the generation probability of the generated words at each time is output by using a reinforcement learning model; and determining a first objective function according to the dominant function and the generation probability of the generated words at each moment. For example, the first objective function L may be calculated by the following formula_RL(θ)：

E_P[ ]Represents a pair P_θ(w_t) Calculating the expected value, P_θ(w_t) Generated word w at time t output for reinforcement learning model_tθ is a model parameter that can be adjusted according to actual conditions.

In step 130, a second objective function is determined according to the difference between the probability distribution of the labeling result at each time and the probability distribution of the generated word at each time.

In some embodiments, the prior probability at each time is determined according to the difference between the labeling result at each time and each word in the corpus. For example, the prior probability at each time is determined based on the similarity between the word vector of the labeling result at each time and the vector of each word in the corpus.

Generating probabilities of the generated words at each moment are output by using the reinforcement learning model; and determining a second objective function according to the weighted average of the difference between the prior probability and the generation probability at each moment.

For example, according to the dominance function of the generated word at each time, the weight of the difference between the prior probability and the generation probability at the corresponding time is determined, and the weight is inversely related to the dominance function.

In some embodiments, the second objective function may be calculated based on the adaptation factor and the KL divergence of the prior distribution.

For example, firstly, a word2vec pre-training algorithm can be used for pre-training a corpus to obtain pre-trained word vectors; then, based on the pre-trained word vector, calculating the prior probability P at the time t^*(w_t)：

w*_tIs a label word marked in advance at time t, w_tThe generated words output by the model at time t, w is words in a dictionary (corpus), mb () is a function of a word-taking vector, σ () is a Softmax function, and cos _ sim () is a cosine-taking similarity function.

That is, the model outputs a certain generated word w at time t_tThen, the label word at the moment is used for calculating w_tA priori probability of.

For example, the cosine similarity between the word vector of the tagged word and all words in the dictionary can be calculated, and the cosine similarity of the word which is closer to the tag semantics is higher; using Softmax function to normalize all the calculated similarities to obtain prior distribution p at t moment^*(w_t)。

In some embodiments, a prior probability P is obtained^*(w_t) Then, the prior probability p can be calculated^*(w_t) Probability of generation p with model output_θ(w_t) KL Divergence (Kullback-Leibler Divergence, relative entropy Divergence) to determine a second objective function:

for the adaptation factor, α is a weight parameter set according to the actual situation, P_θ(w_t) As a model at time tThe output generation probability of KL [ alpha ] C]To calculate the KL divergence function.

In some embodiments, the adaptive factor and the dominance function A^π(S_t,a_t) Inversely proportional, i.e., the larger the merit function value, the smaller the adaptation factor value. That is, the word w generated at the time t_tIn the case where the predicted complete sentence can achieve a higher prize value, the weighting of the KL divergence is automatically reduced. Therefore, excessive correction of the KL divergence on the output with a high reward value is avoided, and the accuracy of model training is improved.

In step 140, a reinforcement learning model is trained based on the weighted average of the first objective function and the second objective function.

In some embodiments, natural language data is generated using a trained reinforcement learning model. For example, the input first language data is translated into second language data by using a trained reinforcement learning model.

In some embodiments, step 140 may be implemented by the embodiment in fig. 2.

FIG. 2 illustrates a flow diagram of some embodiments of step 140 of FIG. 1.

As shown in fig. 2, in step 1410, a composite objective function is determined according to the weighted average of the first objective function and the second objective function.

In some embodiments, using a first objective function based on the merit function (the strategic gradient penalty function) and a second objective function based on the KL divergence, a composite objective function is determined as:

L(θ)＝-L_RL(θ)+βL_KL(θ)

beta is a weight parameter which can be adjusted according to actual conditions.

In step 1420, the reinforcement learning model is trained conditioned on the minimization of the composite objective function. For example, a natural language generative model may be trained using a gradient descent method based on a synthetic objective function.

FIG. 3 illustrates a schematic diagram of some embodiments of a training method of a natural language generation model of the present disclosure.

As shown in fig. 3The whole training process is that in the process of reinforcement learning training, w generated based on t moment by using the Monte Carlo simulation method_tAnd performing K times of expansion to predict K complete sentences. For example, w_t-1Is "The", w_tIs "cat" based on w_t-1、w_tK complete sentences can be predicted.

And calculating an advantage function based on the K reward values and the Q values of each predicted sentence, and weighting the strategy gradient training target through the advantage function.

In some embodiments, to solve the bias agnostic problem, the dictionary may be pre-trained, resulting in a trained word vector. For example, in model training, the prior probability distribution is calculated by using the pre-trained word vector at each moment according to the labeling result at the current moment. For example, The labeling results at each time point are "The", "cat", "is", "eating", "an" and "applet".

In some embodiments, a KL divergence of the generated probability distribution and the prior probability distribution of the model output is calculated for measuring a closeness of the generated probability distribution probability and the prior probability distribution of the model output. For example, to prevent over-correction of the output by KL divergence, KL divergence may be weighted using a merit function as part of the optimization objective.

In some embodiments, the technical solution of the present disclosure may include steps a, B, C, and D4. Step A, B is for calculating a dominance function based policy gradient objective function as a first objective function; step C, calculating a second objective function based on the KL divergence of the adaptive factor; and step D, adding the optimization targets to obtain a comprehensive target function, and training the model.

In step a, a training process of the natural language generation model can be defined as a reinforcement learning problem. The process of reinforcement learning is a markov decision process, and therefore, the natural language generating process needs to be defined as a markov decision process.

For example, with respect to the state transition function T, it defines the state S from the current time_tState S of transition to next moment_t+1The probability of (c). Once the agent has performed an action a_tThe transfer is determined.

For example, regarding the reward value R, when the length of the sentence generated at the present time is smaller than L, the reward value is 0; when the sentence length generated at the present moment is equal to L, an EOS (end identifier) is generated for identifying that the whole sentence is generated, and the current bonus value is calculated.

For example, with respect to the state transition probability T, the state transition probability P (S)_t+1|S_t,a_t) Is shown in action a_tState S under action of_tTransfer to S_t+1Probability of (2). In the natural language generation task, P (S) is known from the definition of the state_t+1|S_t,a_t) Is constantly equal to 1.

In step B, the merit function at time t is A^π(S_t,a_t) It measures the quality of the word generated at time t and the state value V at time t^π(S_t) And a prize value r_tAnd (4) correlating. That is, to compute the merit function, an estimate of the state value is needed.

However, there is no separate value network in the natural language generation model to estimate the state value V^π(S_t). Therefore, in reinforcement learning, it is necessary to calculate the operation state value Q^π(s_t,a_t) To replace the status value V^π(S_t)。

For example, Q^π(s_t,a_t) The value of the state action function (also called Q value) at the time t is the expectation of the reward value, and the quality of the words output at the time t is measured. V^π(S_t) Can be passed through Q^π(s_t,a_t) And calculating.

Further, as described above, as is known from the definition of the natural language problem of reinforcement learning, the state transition probability and the discount factor are both always equal to the reward value r at 1, t_tIs 0, so the state value at time t +1 is equal to the Q value at time t.

Thus, the merit function may be determined by the following calculation:

A^π(S_t,a_t)＝Q^π(s_t,a_t)-Q^π(s_t-1,a_t-1)

In some embodiments, in the definition of reinforcement learning, QValue is the expectation of the prize value. In the course of training the natural language generating model, the words (such as w) output in each step are utilized_t) Reasoning about each term (e.g., w) further backwards_t+1～w_L) A complete sentence containing L words is obtained.

For example, when a word is output at time t, K pieces of sentences from time t +1 to the entire end time L are predicted using a K-step monte carlo simulation method, and the average value of the reward values of the K pieces of sentences is calculated as the Q value at time t.

In some embodiments, the generation probability of the generated words at each time is output by using a reinforcement learning model; and determining a first objective function according to the dominant function and the generation probability of the generated words at each moment.

In the step C, pre-training the corpus by using a word2vec pre-training algorithm to obtain pre-trained word vectors; then, based on the pre-trained word vector, calculating the prior probability P at the time t^*(w_t)。

In some embodiments, a prior probability P is obtained^*(w_t) Then, the prior probability p can be calculated^*(w_t) Probability of generation p with model output_θ(w_t) To determine a second objective function.

In step D, a synthetic objective function is determined using a first objective function (a strategic gradient penalty function) based on the dominance function and a second objective function based on the KL divergence. For example, a natural language generative model may be trained using a gradient descent method based on a synthetic objective function.

In the embodiment, the strategy ladder target function based on the dominant function solves the problem that the model training is difficult to converge due to too large gradient variance in the training process of the existing method; the KL divergence objective function based on the adaptive factor measures the distance between the output distribution of the model and the prior distribution of pre-training, and solves the problem that the deviation of the traditional cross entropy loss function and the strategy gradient objective function is unknown.

In addition, the adaptive factor based on the dominant function can automatically adjust the weight of the KL divergence, so that the two optimization targets are better trained jointly. Therefore, the KL divergence target of the adaptive factor is prevented from carrying out word-level transition training on the model

FIG. 4 illustrates a block diagram of some embodiments of a training apparatus of a natural language generation model of the present disclosure.

As shown in fig. 4, the training device 4 for generating a model from natural language includes a modeling unit 41, a determination unit 42, and a training unit 43.

The modeling unit 41 models the natural language generation process as a reinforcement learning model by using the generated word at each history time as the state of the current time and using the output of the generated word at the current time as the action.

The determining unit 42 determines a first objective function according to the dominance function of the generated word at each time outputted by the reinforcement learning model; and determining a second objective function according to the difference between the probability distribution of the labeling result at each moment and the probability distribution of the generated words at each moment.

In some embodiments, the determination unit 42 determines the merit function based on the difference between the state action function at each time instant and the state action function at the time instant immediately preceding it.

In some embodiments, determination unit 42 acts on a from time 0 to t-1₀～a_t-1Action a at time t_tPredicting a plurality of action combinations a of times t +1 to L-1_t+1～a_L-1(ii) a Calculating each action combination a_t+1～a_L-1The prize value of (d); and determining the state action function at the time t according to the weighted average value of the reward values.

In some embodiments, the determining unit 42 determines the first objective function according to the dominance function and the generation probability of the generated word at each time, using the reinforcement learning model and the output generation probability of the generated word at each time.

In some embodiments, the determining unit 42 determines the prior probability at each time according to the similarity between the word vector of the labeling result at each time and the vector of each word in the corpus.

In some embodiments, the determining unit 42 determines the prior probability of each time according to the difference between the labeling result of each time and each word in the corpus, and determines the second objective function according to the weighted average of the prior probability and the difference of the generation probability of each time, which is output by using the reinforcement learning model, and the generation probability of each generated word at each time.

In some embodiments, the determining unit 42 determines the weight of the difference between the prior probability and the generated probability at each time according to the dominance function of the generated word at each time, and the weight is inversely related to the dominance function.

The training unit 43 trains the reinforcement learning model based on the weighted average of the first objective function and the second objective function.

In some embodiments, the training unit 43 determines a comprehensive objective function according to the weighted average of the first objective function and the second objective function, and trains the reinforcement learning model with the condition that the comprehensive objective function is minimized.

In some embodiments, the training device 4 further comprises: and a generating unit 44, configured to generate natural language data by using the trained reinforcement learning model.

In some embodiments, the generation unit 44 translates the input first language data into the second language data using the trained reinforcement learning model.

Fig. 5 shows a block diagram of some embodiments of a training apparatus of a natural language generation model or a generation apparatus of a natural language of the present disclosure.

As shown in fig. 5, in some embodiments, the training device 5 for generating a model from natural language includes: a memory 51 and a processor 52 coupled to the memory 51, the processor 52 being configured to execute a training method of a natural language generating model in any one of the embodiments of the present disclosure based on instructions stored in the memory 51.

In some embodiments, the natural language generation device 5 includes: a memory 51 and a processor 52 coupled to the memory 51, the processor 52 being configured to execute the method for generating natural language in any one of the embodiments of the present disclosure based on instructions stored in the memory 51.

The memory 51 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader, a database, and other programs.

As shown in fig. 6, in some embodiments, the training device 6 of the natural language generation model includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 being configured to execute the training method of the natural language generator model in any of the foregoing embodiments based on instructions stored in the memory 610.

In some embodiments, the natural language generation device 6 includes: a memory 610 and a processor 620 coupled to the memory 610, wherein the processor 620 is configured to execute the natural language generating method in any one of the embodiments based on instructions stored in the memory 610.

The memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader, and other programs.

The apparatus 6 may also include an input-output interface 630, a network interface 640, a storage interface 650, and the like. These

interfaces

630, 640, 650 and the memory 610 and the processor 620 may be connected by a bus 860, for example. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, a touch screen, a microphone, and a sound box. The network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media having computer-usable program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.

Thus far, a training method of a natural language generation model, a training apparatus of a natural language generation model, a generation method of a natural language, a generation apparatus of a natural language, and a nonvolatile computer-readable storage medium according to the present disclosure have been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.

The method and system of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. A method of training a natural language generative model, comprising:

taking the generated words at each historical moment as the state of the current moment, taking the output of the generated words at the current moment as an action, and modeling the natural language generation processing as a reinforcement learning model;

determining a first objective function according to the dominant function of the generated words at each moment output by the reinforcement learning model;

determining a second objective function according to the difference between the probability distribution of the labeling result at each moment and the probability distribution of the generated words at each moment;

and training the reinforcement learning model according to the weighted average value of the first objective function and the second objective function.

2. The training method according to claim 1, wherein the determining a first objective function according to the dominance function of the generated words at each time of the reinforced learning model output comprises:

and determining the advantage function according to the difference between the state action function at each moment and the state action function at the previous moment.

3. The training method of claim 2, wherein the determining the merit function comprises:

according to the action a from time 0 to t-1₀～a_t-1Action a at time t_tPredicting a plurality of action combinations a of times t +1 to L-1_t+1～a_L-1；

Calculating each action combination a_t+1～a_L-1The prize value of (d);

and determining the state action function at the time t according to the weighted average value of the reward values.

4. The training method according to claim 1, wherein the determining a first objective function according to the dominance function of the generated words at each time of the reinforced learning model output comprises:

generating probabilities of the generated words at each moment output by using the reinforcement learning model;

and determining the first objective function according to the dominant function and the generation probability of the generated words at each moment.

5. The training method according to claim 1, wherein the determining the second objective function according to the difference between the probability distribution of the labeling result at each time and the probability distribution of the generated word at each time comprises:

determining prior probability of each moment according to the difference between the labeling result of each moment and each word in the corpus;

and determining the second objective function according to the weighted average of the difference between the prior probability and the generation probability at each moment.

6. The training method of claim 5, wherein the determining the prior probability at each time according to the difference between the labeling result at each time and each word in the corpus comprises:

and determining the prior probability of each moment according to the similarity between the word vector of the labeling result at each moment and the vector of each word in the corpus.

7. The training method of claim 5, wherein the determining the second objective function according to the weighted average of the differences of the prior probability and the generated probability at each time instant comprises:

and determining the weight of the difference between the prior probability and the generation probability at the corresponding moment according to the dominance function of the generated word at each moment, wherein the weight is in negative correlation with the dominance function.

8. The training method of claim 1, wherein said training the reinforcement learning model based on a weighted average of the first and second objective functions comprises:

determining a comprehensive objective function according to the weighted average value of the first objective function and the second objective function;

and training the reinforcement learning model under the condition of minimizing the comprehensive objective function.

9. The training method of any one of claims 1-8, further comprising:

and generating natural language data by using the reinforcement learning model after training.

10. The training method of claim 9, wherein the generating natural language data comprises:

and translating the input first language data into second language data by using the trained reinforcement learning model.

11. A method of generating natural language, comprising:

training the reinforcement learning model according to the weighted average value of the first objective function and the second objective function;

12. The generation method of claim 11, wherein the generating natural language data comprises:

13. A training apparatus for a natural language generative model, comprising:

a modeling unit for modeling natural language generation processing as a reinforcement learning model by using the generated word at each historical time as a state of the current time and using the output of the generated word at the current time as an action;

a determining unit, configured to determine a first objective function according to the dominant function of the generated word at each time output by the reinforcement learning model, and determine a second objective function according to a difference between a probability distribution of the labeling result at each time and a probability distribution of the generated word at each time;

and the training unit is used for training the reinforcement learning model according to the weighted average value of the first objective function and the second objective function.

14. The training device of claim 13,

the determining unit determines the merit function according to a difference between the state action function at each time and the state action function at the previous time.

15. The training device of claim 14,

the determination unit determines the action a according to the time 0-t-1₀～a_t-1Action a at time t_tPredicting a plurality of action combinations a of times t +1 to L-1_t+1～a_L-1(ii) a Calculating each action combination a_t+1～a_L-1The prize value of (d); and determining the state action function at the time t according to the weighted average value of the reward values.

16. The training device according to claim 13, wherein the determination unit determines the first objective function from the dominance function and the generation probability of the generated word at each time, using the reinforcement learning model, and the output generation probability of the generated word at each time.

17. The training device of claim 13,

the determining unit determines the prior probability of each moment according to the difference between the labeling result of each moment and each word in the corpus, outputs the generation probability of the generated word of each moment by using the reinforcement learning model, and determines the second objective function according to the weighted average of the prior probability and the difference of the generation probability of each moment.

18. The training device of claim 17,

the determining unit determines the prior probability of each moment according to the similarity between the word vector of the labeling result of each moment and the vector of each word in the corpus.

19. The training device of claim 17,

the determining unit determines the weight of the difference between the prior probability and the generating probability at the corresponding moment according to the dominant function of the generated word at each moment, wherein the weight is inversely related to the dominant function.

20. The training device of claim 13,

and the training unit determines a comprehensive objective function according to the weighted average value of the first objective function and the second objective function, and trains the reinforcement learning model under the condition of minimizing the comprehensive objective function.

21. The training apparatus of any one of claims 13-20, further comprising:

and the generating unit is used for generating natural language data by using the reinforcement learning model after training.

22. The training device of claim 21,

the generation unit translates the input first language data into second language data by using the reinforcement learning model after training.

23. An apparatus for generating a natural language, comprising:

the training unit is used for training the reinforcement learning model according to the weighted average value of the first objective function and the second objective function;

24. A training apparatus for a natural language generative model, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of training of a natural language generative model of any one of claims 1 to 10 based on instructions stored in the memory.

25. An apparatus for generating a natural language, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the natural language generation method of claim 11 or 12 based on instructions stored in the memory.

26. A non-transitory computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the method of training a natural language generation model according to any one of claims 1 to 10, or the method of generating a natural language according to claim 11 or 12.