CN111428519B

CN111428519B - Entropy-based neural machine translation dynamic decoding method and system

Info

Publication number: CN111428519B
Application number: CN202010151246.4A
Authority: CN
Inventors: 程学旗; 郭嘉丰; 范意兴; 王素
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2022-03-29
Anticipated expiration: 2040-03-06
Also published as: CN111428519A

Abstract

The invention provides a neural machine translation dynamic decoding method and system based on entropy, which are used for finding that the average entropy value of words in a sentence with a high BLEU value is smaller than that of words in a sentence with a low BLEU value by analyzing the relation between the entropy value and the BLEU value of the sentence, and the BLEU value of the sentence with the low entropy value is generally higher than that of the sentence with the high entropy value. By calculating the Pearson coefficient between the entropy value and the BLEU value of the sentence, the correlation between the two is found. Therefore, the invention proposes that at each time step of the decoding stage in the training process, not only real words or predicted words are sampled and selected with a certain probability to obtain context information, but also entropy is calculated according to the prediction result of the previous time step, and then the weight of the context information is dynamically adjusted according to the entropy. The problem of error accumulation caused by context information difference between training and inference in the decoding process of the neural machine translation model is solved.

Description

Entropy-based neural machine translation dynamic decoding method and system

Technical Field

The invention relates to the technical field of natural language processing and neural machine translation, in particular to a neural machine translation dynamic decoding method and system based on entropy.

Background

Machine translation is an important task in natural language processing, and in recent years, with the rise of deep neural networks, machine translation methods based on neural networks have made great progress and have gradually become mainstream machine translation methods. The neural machine translation model mainly comprises three parts: an encoder network, a decoder network, and an attention network.

The encoder network is responsible for encoding the source language sentence into a list of hidden vectors, one hidden vector representation for each word. The encoder network is typically a multi-layer bi-directional RNN structure, with forward RNN

Sequentially reading in a sequence of source language sentences (from x)₁To x_|x|) Calculating to obtain a forward hidden state sequence

Reverse RNN

Reading in the Source language sentence sequence in reverse order (from x)_|x|To x₁) Calculating to obtain a reverse hidden state sequence

Word x_iThe corresponding hidden vector is expressed as

Thus h_iNot only contains the semantic information of the preceding words, but also contains the semantic information of the following words.

The attention network generates a list of hidden vectors (h) from the encoder network₁,…,h_|x|) And the current hidden state vector s_j-1Computing a context vector c_jAnd passed to the decoder network. First, calculate the hidden vector list (h)₁,…,h_|x|) With the current hidden layer state vector s_j-1The degree of correlation between the two is obtained to obtain a weight list (alpha)_1j,…,α_|x|j) Then, the weight list is used to perform weighted summation on the hidden vector list to calculate a context vector c_jFor the next hidden state vector s_jAnd (4) calculating.

The decoder network is typically a multi-layer RNN structure, each time step being based on the current word vector

Hidden layer state vector s_j-1And context vector c calculated by attention network_jCalculating the hidden state vector s of the next time step_jAnd decoding a target language word y_jUntil a special end of sentence symbol (EOS) is generated.

The existing neural machine translation model architecture is shown in fig. 1. Although the existing neural machine translation model has achieved good effects, some shortcomings still exist. In the prior art, the model decodes the target words in turn according to the context information. In the training phase, the model predicts using the real word as context information, while in the inference phase, it must generate the whole sequence from scratch, and can predict using the prediction result of the previous time step as context information. This difference in context information between training and inference results in an accumulation of errors, such that the model must predict without seeing the training phase.

In existing neural-machine translation models, each time step of the decoding process is based on the current word vector

Hidden layer state vector s_j-1And context vector c calculated by attention network_jCalculating the hidden state vector s of the next time step_jI.e. by

In the training phase, y_j-1Is a real target language word in the training corpus

While in the inference phase, y_j-1Is a target language word predicted at the last time step

To solve the difference between training and inference, the model samples the context information from the real sequence and the predicted sequence with a certain probability in the training phase, rather than merely selecting the target language word in the real sequence, i.e. selecting the target language word in the real sequence

Although the method reduces the difference between the training phase and the inference phase to a certain extent and improves the translation effect, when the sampling selects the words in the prediction sequence

Due to uncertainty of prediction, prediction error is introduced in the training process, and robustness of the model is reduced.

Disclosure of Invention

The invention aims to solve the problem of error accumulation of a neural machine translation model in a decoding process due to the context information difference between training and inference. The prior art provides a method for keeping consistency of training and prediction in machine translation, and in the training process, a correct word or a predicted word is sampled and selected for each decoding position with a certain probability, so that the flow of training and prediction can be kept consistent. However, this method introduces prediction errors into the training process when selecting words in the prediction sequence due to the uncertainty of the prediction itself, reducing the robustness of the model. According to the invention, by analyzing the correlation between the entropy value of the sentence and the double pre-evaluation substitution value (BLEU), on the basis of a method for keeping the consistency of training and prediction in machine translation, the weight of context information is dynamically adjusted according to the entropy value, and the influence of uncertainty on the translation result is reduced.

Specifically, the invention provides a neural machine translation dynamic decoding method based on entropy, which comprises

Step 1, transmitting word vectors of words in a source language sentence in a training corpus into an encoder network to obtain a code vector list (h) of the source language sentence₁,…,h_|x|) And hidden state vector s of j-1 time step_j-1；

Step 2, the attention network according to the code vector list (h)₁,…,h_|x|) And hidden state vector s_j-1To obtain a context vector c_j；

Step 3, obtaining the j-1 th real target language word in the training corpus

And j-1 time step predicted target language word

And selects the real target language word with probability p

Selecting predicted target language words with a probability of 1-p

Obtaining a selection result y_j-1；

Step 4, according to the selection result y_j-1The entropy e of the j-1 time step is obtained by the following formula_j-1Where N is the size of the target language lexicon,

step 5, selecting a result y_j-1Hidden state vector s_j-1And a context vector c_jAnd entropy value e_j-1Inputting the signal into a decoder network to obtain a hidden layer state vector s of the current jth time step_j，p_i,j-1A predicted probability for the ith target language word;

step 6, according to y_j-1Hidden state vector s_jAnd a context vector c_jTo obtain the target language word of the jth time step

Step 7, list of encoding vectors (h)₁,…,h_|x|) The hidden state vector s of the jth time step_jJ, the actual target language word in the training corpus

And target language words predicted at jth time step

And (5) transmitting the j +1 th time step, and continuing the decoding process until a special end of sentence symbol EOS is generated.

The neural machine translation dynamic decoding method based on entropy, wherein the step 2 comprises:

by paying attention toThe force network obtains the context vector c_jWeight α_ijIs in a hidden layer state h_iWith respect to hidden state s_j-1In determining the next hidden state s_jAnd predicting y_jImportance of (1);

where x is the source language sentence, V_a、W_aAnd U_aAre all parameters to be learned in the neural network.

The neural machine translation dynamic decoding method based on entropy, wherein the step 3 comprises:

where μ is the hyperparameter and e is the number of training rounds.

The neural machine translation dynamic decoding method based on entropy, wherein the step 6 comprises:

o_j＝W_ot_j；

P_j＝softmax(o_j)；

wherein e_yj-1Entropy of the probability distribution of the predicted word at j-1 time step, W_oIs a parameter to be learned in the neural network.

The neural machine translation dynamic decoding method based on entropy, wherein the step 7 comprises:

the encoder network computes the list of hidden vectors (h) corresponding to the source language sentence₁,…,h_|x|)，

Is the word x_iIs used to represent the word vector of (a),

is the word x_iThe corresponding hidden vector representation is represented by a hidden vector,

the invention also provides a neural machine translation dynamic decoding system based on entropy, which comprises

Module 1, transmitting word vectors of words in source language sentences in training corpus into encoder network to obtain encoding vector list (h) of the source language sentences₁,…,h_|x|) And hidden state vector s of j-1 time step_j-1；

Module 2, attention network from the list of code vectors (h)₁,…,h_|x|) And hidden state vector s_j-1To obtain a context vector c_j；

Module 3, obtaining j-1 real target language word in training corpus

And j-1 time step predicted target language word

And selects the real target language word with probability p

Selecting predicted target language words with a probability of 1-p

Obtaining a selection result y_j-1；

Module 4, according to the selection result y_j-1The entropy e of the j-1 time step is obtained by the following formula_j-1Where N is the size of the target language lexicon,

module 5, selecting result y_j-1Hidden state vector s_j-1And a context vector c_jAnd entropy value e_j-1Inputting the signal into a decoder network to obtain a hidden layer state vector s of the current jth time step_j，p_i,j-1A predicted probability for the ith target language word;

module 6 according to y_j-1Hidden state vector s_jAnd a context vector c_jTo obtain the target language word of the jth time step

Module 7, List of encoding vectors (h)₁,…,h_|x|) The hidden state vector s of the jth time step_jJ, the actual target language word in the training corpus

And target language words predicted at jth time step

The neural machine translation dynamic decoding system based on entropy, wherein the module 2 comprises:

obtaining the context vector c through an attention network_jWeight α_ijIs in a hidden layer state h_iWith respect to hidden state s_j-1In determining the next hidden state s_jAnd predicting y_jImportance of (1);

The neural machine translation dynamic decoding system based on entropy, wherein the module 3 comprises:

where μ is the hyperparameter and e is the number of training rounds.

The neural machine translation dynamic decoding system based on entropy, wherein the module 6 comprises:

o_j＝W_ot_j；

P_j＝softmax(o_j)；

The neural machine translation dynamic decoding system based on entropy, wherein the module 7 comprises:

Is the word x_iIs used to represent the word vector of (a),

drawings

FIG. 1 is a diagram of a prior art neural machine translation model architecture;

FIG. 2 is y_j-1Sampling a picture;

FIG. 3 is a GRU calculation schematic;

fig. 4 is a flowchart of a method for entropy-based dynamic decoding.

Detailed Description

When the inventor conducts the research of the neural machine translation technology, the relation between the entropy value and the BLEU value of a sentence is analyzed, and the fact that the average entropy value of words in the sentence with a high BLEU value is smaller than the average entropy value of words in the sentence with a low BLEU value is found, and the BLEU value of the sentence with a low entropy value is higher than the BLEU value of the sentence with a high entropy value is found. The inventor finds that there is a correlation between the entropy value and the BLEU value of a sentence by calculating the Pearson coefficient between the two. Therefore, the invention proposes that at each time step of the decoding stage in the training process, not only real words or predicted words are sampled and selected with a certain probability to obtain context information, but also entropy is calculated according to the prediction result of the previous time step, and then the weight of the context information is dynamically adjusted according to the entropy.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

RNN-basedNMTModel

Let source language sentence be x ═ x₁,…,x_|x|The corresponding target language sentence is

Encoder：

The encoder network computes a list of hidden vectors (h) corresponding to the source language sentence₁,…,h_|x|)，

Is the word x_iIs used to represent the word vector of (a),

is the word x_iThe corresponding hidden vector representation.

Attention:

Attention network computing context vector c_jWeight α_ijReflects the hidden layer state h_iWith respect to hidden state s_j-1In determining the next hidden state s_jAnd predicting y_jOf importance in (1).

Wherein V_a、W_aAnd U_aAre all parameters to be learned in the neural network.

Decoder:

The decoder network decodes the target language words in turn until a special end of sentence symbol (EOS) is generated.

y_j-1(the j-1 word in the target language sentence) is selected (its role is to select, and it is only the calculation formula to realize) by the probability of p (p is the sampling probability, and the value is obtained by formula 6. concretely, it is to generate a sampling vector of 0-1 by the value of probability p, and by the way of multiplication, the position with sampling probability of 1 is selected

The position with the sampling probability of 0 is selected

For example,the probability p is 0.3, and the value of the sampling vector is [1,0,0,1,0,0,1,0,0,0]Then, then

Real target language words

Selecting predicted target language words with a probability of 1-p

μ is the hyperparameter and e is the number of training rounds. As shown in fig. 2.

Computing a hidden state vector s_j：

GRU₂The calculation principle of (c) is shown in fig. 3.

[ c ] in formula (9)_j；e_j-1](vector stitching) corresponds to x in FIG. 3_t，

Corresponding to h in FIG. 3_t-1，s_jCorresponding to h in FIG. 3_tAccording to the entropy value e_j-1Adjustment c_jThe weight of (c). Entropy value e_j-1The larger the uncertainty, the more the description

The worse the translation, and therefore the less utilization in the next time step

More utilizes c_jThe information of (1).

The formula for calculating the entropy value is as follows, where N is the size of the target language dictionary. The prediction probability at the j-1 time step is denoted as P_j-1It is an N-dimensional vector representing the predicted probabilities of all words in the target language lexicon, where the predicted probability of the ith target language word is denoted as p_i,j-1。

Probability distribution P over all words in the target language dictionary_jThe calculation formula of (a) is as follows:

o_j＝W_ot_j (12)

P_j＝softmax(o_j) (13)

wherein e_yj-1Is an entropy value reflecting the uncertainty of the probability distribution of the predicted word at the j-1 time step, W_oAre the parameters to be learned in the network.

The use of the above-described entropy-based dynamic decoding technique is explained:

first, word vectors of words in source language sentencesTransmitting into encoder network, and obtaining encoding vector list (h) corresponding to source language sentence according to formulas (1) - (2)₁,…,h_|x|). The target language words are then decoded in sequence until a special end of sentence symbol (EOS) is generated. The specific decoding process at the jth time step is as follows:

step S1, known quantity: code vector list (h)₁,…,h_|x|) Hidden state vector s at the j-1 st time step_j-1(the whole decoding process is carried out backward along with the time step, which is a concrete decoding process of the jth time step, so that the first j-1 time steps are all calculated), and the jth-1 real target language word in the training corpus

Target language word predicted at j-1 time step

Step S2, list of known encoding vectors (h)₁,…,h_|x|) And hidden state vector s_j-1The attention network computes a context vector c according to equations (3) - (5)_j。

Step S3, knowing the real target language word

And predicted target language words

Sampling selection y according to equations (6) - (7)_j-1。

Step S4, known y_j-1Calculating an entropy value e according to the formula (10)_j-1。

Step S5, knowing the entropy value e_j-1Target language word y_j-1Hidden state vector s_j-1And a context vector c_jThe decoder network calculates the hidden state vector s for the jth time step according to equations (8) - (9)_j。

Step S6, known y_j-1Hidden layerState vector s_jAnd a context vector c_jPredicting the target language word at the jth time step according to equations (11) - (14)

Step S7, encoding vector list (h)₁,…,h_|x|) The hidden state vector s of the jth time step_jJ, the actual target language word in the training corpus

And target language words predicted at jth time step

And (4) transmitting the j +1 th time step, and continuing the decoding process until a special end of sentence symbol (EOS) is generated.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

Module 3, obtaining j-1 real target language word in training corpus

And j-1 time step predicted target language word

And selects the real target language word with probability p

Selecting predicted target language words with a probability of 1-p

Obtaining a selection result y_j-1；

And target language words predicted at jth time step

where μ is the hyperparameter and e is the number of training rounds.

o_j＝W_ot_j；

P_j＝softmax(o_j)；

Is the word x_iIs used to represent the word vector of (a),

Claims

1. an entropy-based neural machine translation dynamic decoding method is characterized by comprising

Step 3, obtaining the j-1 th real target language word in the training corpus

And j-1 time step predicted target language word

And selects the real target language word with probability p

Selecting predicted target language words with a probability of 1-p

Obtaining a selection result y_j-1；

step 5, selecting a result y_j-1Hidden state vector s_j-1Context vector c_jAnd entropy value e_j-1Inputting the signal into a decoder network to obtain a hidden layer state vector s of the current jth time step_j，p_i,j-1A predicted probability for the ith target language word;

And target language words predicted at jth time step

2. An entropy-based neural machine translation dynamic decoding method as claimed in claim 1, wherein the step 2 comprises:

where x is the source language sentence,

W_aand U_aAre all to be learned in neural networksAnd (4) parameters.

3. An entropy-based neural machine translation dynamic decoding method as claimed in claim 2, wherein the step 3 comprises:

where μ is the hyperparameter and e is the number of training rounds.

4. An entropy-based neural machine translation dynamic decoding method as claimed in claim 3, wherein the step 6 comprises:

o_j＝W_ot_j；

P_j＝softmax(o_j)；

wherein

Entropy of the probability distribution of the predicted word at j-1 time step, W_oIs a parameter to be learned in the neural network.

5. An entropy-based neural machine translation dynamic decoding method as claimed in claim 4, wherein the step 7 comprises:

the encoder network calculates the code vector corresponding to the source language sentenceList (h)₁,…,h_|x|)，

Is the word x_iIs used to represent the word vector of (a),

6. an entropy-based neural machine translation dynamic decoding system is characterized by comprising

Module 3, obtaining j-1 real target language word in training corpus

And j-1 time step predicted target language word

And selects the real target language word with probability p

Selecting predicted target language words with a probability of 1-p

Obtaining a selection result y_j-1；

module 5, selecting result y_j-1Hidden state vector s_j-1Context vector c_jAnd entropy value e_j-1Inputting the signal into a decoder network to obtain a hidden layer state vector s of the current jth time step_j，p_i,j-1A predicted probability for the ith target language word;

And target language words predicted at jth time step

7. An entropy-based neural machine translation dynamic decoding system as claimed in claim 6, wherein the module 2 comprises:

where x is the source language sentence,

W_aand U_aAre all parameters to be learned in the neural network.

8. An entropy-based neural machine translation dynamic decoding system as claimed in claim 7, wherein the module 3 comprises:

where μ is the hyperparameter and e is the number of training rounds.

9. An entropy-based neural machine translation dynamic decoding system as claimed in claim 8, wherein the module 6 comprises:

o_j＝W_ot_j；

P_j＝softmax(o_j)；

wherein

10. An entropy-based neural machine translation dynamic decoding system as claimed in claim 9, wherein the module 7 comprises:

the encoder network computes the list of encoding vectors (h) corresponding to the source language sentence₁,…,h_|x|)，

Is the word x_iIs used to represent the word vector of (a),