CN113378584B

CN113378584B - Non-autoregressive neural machine translation method based on auxiliary representation fusion

Info

Publication number: CN113378584B
Application number: CN202110592517.4A
Authority: CN
Inventors: 杜权; 刘兴宇
Original assignee: Shenyang Yayi Network Technology Co ltd
Current assignee: Shenyang Yayi Network Technology Co ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2023-09-05
Anticipated expiration: 2041-05-28
Also published as: CN113378584A

Abstract

The invention discloses a non-autoregressive neural machine translation method based on auxiliary representation fusion, which comprises the following steps: constructing an autoregressive neural machine translation model; constructing training parallel corpus, and training a model with only one layer of decoder; constructing a non-autoregressive neural machine translation model; the output of the feedforward neural network at the top layer of the autoregressive neural machine translation model decoder is subjected to weighted fusion with the top layer representation of the non-autoregressive neural machine translation model encoder, and the weighted fusion is used as the input of the non-autoregressive neural machine translation model decoder; the encoder extracts the source sentence sub-information, and the decoder predicts the corresponding target sentence sub-according to the source sentence sub-information; training a non-autoregressive neural machine translation model; and sending the source sentence into a non-autoregressive neural machine translation model, and decoding translation results with different lengths. The invention combines the advantages of an autoregressive model and a non-autoregressive model, and can obtain 7-9 times of speed improvement under the condition of losing smaller performance.

Description

Non-autoregressive neural machine translation method based on auxiliary representation fusion

Technical Field

The invention relates to a neural machine translation inference acceleration method, in particular to a non-autoregressive neural machine translation method based on auxiliary representation fusion.

Background

Machine translation is a technique of translating one natural language into another. Machine translation is a branch of natural language processing, is one of the ultimate targets of artificial intelligence, and has important scientific research value. Meanwhile, with the rapid development of internet technology, the machine translation technology plays an increasingly important role in daily life and work of people.

The machine translation technology is developed for years from the method based on rules in the 70 th century, the method based on examples in the 80 th century and the method based on statistics in the 90 th year to the method based on the neural network, finally achieves good effects, and is widely used in daily life of people.

The most widely used neural machine translation systems currently employ end-to-end encoder-decoder frameworks based on neural networks, where the most powerful is the transducer model structure based on self-attention mechanisms, achieving optimal translation performance across multiple languages. The transducer consists of an encoder and a decoder based on a self-attention mechanism. A standard Transformer encoder consists of six stacked encoding layers, the decoder also comprising six decoding layers. The traditional RNN and CNN were discarded from the whole model, consisting entirely of the attention mechanism. More precisely, the transducer consists of and only of the attention mechanism and the feed forward neural network. Compared with RNNs, the method has the advantages that the limitation of sequential calculation only is abandoned by a transducer, and the parallelism capability of the system is improved. Meanwhile, due to the processing mode of parallel computing, the phenomenon that long-term dependence is difficult to process in sequential computing is also relieved. The coding layer of the transducer comprises a self-attention layer and a feedforward neural network. The sentences which are output from the attention coding end and are expressed by dense vectors are sent into a feedforward neural network after feature extraction. The decoder models the mapping relationship between the source and target languages by adding an encoding-decoding attention layer between the self-attention layer and the feedforward neural network layer relative to the decoder.

Neural network-based machine translation systems have made significant advances in performance over previously-focused, statistical-based translation systems. But because neural networks involve a large number of matrix operations, training and decoding can be more time consuming than previous approaches. For both of these time consuming aspects, in practice the time consumption for decoding tends to be more important. In order for a neuromotor translation system to be practical, it must be required that the system has a high response speed in the decoding process, otherwise it is difficult for the user to accept in many scenarios even though the translation system has more excellent performance.

Most machine translation models are currently implemented using an encoder-decoder framework, where the encoder feeds a representation of the source sentence to the decoder to generate the target sentence; the decoder typically works in an autoregressive manner to generate target sentences from left to right, word by word, the generation of the t-th target word being dependent on the t-1 target words previously generated. The autoregressive decoding mode accords with the habit of reading and generating sentences, and can effectively capture the distribution situation of real translation. Each step of the decoder must run sequentially rather than in parallel and thus autoregressive decoding prevents architectures such as transformers from fully exploiting the performance advantages of their training in the reasoning process.

To mitigate the inference delay, a non-autoregressive translation model is proposed that uses replicated source inputs to initialize decoder inputs in a left-to-right fashion and independently generates all target words simultaneously. However, while accelerating, the NAT model has to process the translation task with weak target information at its decoder, thereby reducing the accuracy of the translation.

Disclosure of Invention

Aiming at the problem of reduced translation quality caused by weak target end information in a non-autoregressive machine translation model, the invention provides a non-autoregressive neural machine translation method based on auxiliary representation fusion, which can enable the non-autoregressive machine translation to obtain the performance equivalent to that of autoregressive machine translation, has higher response speed and better practical application.

In order to solve the technical problems, the invention adopts the following technical scheme:

the invention provides a non-autoregressive neural machine translation method based on auxiliary representation fusion, which comprises the following steps:

1) Constructing an autoregressive neural machine translation model comprising an encoder and a decoder by adopting a transducer model based on an autoregressive mechanism;

2) Constructing training parallel corpus, performing word segmentation and word segmentation pretreatment to obtain a source language sequence and a target language sequence, generating a machine translation word list, and training a model with only one layer of decoder until the model converges;

3) Removing a matrix of the original decoder shielding future information in the transducer model, and simultaneously adding multi-head position attention between self-attention and coding and decoding attention to construct a non-autoregressive neural machine translation model;

4) The front part word of the source language, which is set according to a specified proportion, is decoded by using a shallow autoregressive model, and the output of a feedforward neural network at the top layer of the autoregressive neural machine translation model decoder is subjected to weighted fusion with the top layer representation of a non-autoregressive neural machine translation model encoder to be used as the input of the non-autoregressive neural machine translation model decoder;

5) Using parallel corpus training to use fusion representation as an input non-autoregressive neural machine translation model, an encoder encodes a source sentence, extracts source sentence sub-information, and a decoder predicts a corresponding target sentence sub-according to the source sentence sub-information; calculating the difference between the predicted data distribution and the real data distribution, and continuously reducing the difference in a back propagation mode until the model converges to complete the training process of the non-autoregressive neural machine translation model;

6) And sending the source sentence input by the user into a non-autoregressive neural machine translation model, decoding translation results with different lengths, and selecting an optimal translation result through evaluation of the autoregressive neural machine translation model.

In the step 3), constructing a non-autoregressive neural machine translation model, which specifically comprises the following steps:

301 Modeling translation problems after removing the matrix of the decoding end for future information shielding:

wherein X is a source language sequence, Y is a target language sequence, T is a target language sequence length, T' is a source language sequence length, T is a target language position, and X _1…T′ For source language sentences, y _t The target word is the target word at the t-th position;

302 Adding an additional multi-headed location attention module in each decoder layer, the module being:

wherein Q is a query matrix, K is a key matrix, V is a value matrix, and softmax (DEG) is a normalization functionAttention (& gt) is the Attention calculating function, d _k Is the dimension of the key matrix;

303 Before decoding begins, the target length is estimated using the source length and the estimated target length data is sent to a non-autoregressive neural machine translation model to generate all words in parallel.

In step 4), the translation result of the autoregressive model is used to improve the input of the non-autoregressive neural machine translation model, specifically:

401 Input at decoding end of non-autoregressive machine translation model as follows:

wherein θ _at And theta _nat The parameters of the autoregressive neural machine translation model and the non-autoregressive neural machine translation model are respectively, T is the target language sequence length, T' is the source language sequence length, y _t Is the target word at the t-th position, x _1…T′ For source language sentences, y _<t For 1 st to t-1 th target words, z _nat Input of a decoding end of a non-autoregressive neural machine translation model;

402 Constructing a fusion function, adopting a weighted sum mode, specifically:

Fusion＝λDecoder ^at (y _1…k )+μEncoder ^nat (x _1…T′ )

where λ and μ are hyper-parameters controlling weights of different representation terms, decoder ^at () is the output of the autoregressive neural machine translation model decoder, the Encoder ^nat () output of non-autoregressive neural machine translation model decoder, y _1…k For words 1 to k, x _1…T′ Is a source sentence;

403 Before feeding the above calculated fusion representation to the decoder, the forward layer input and the backward layer gradient are normalized by applying a layer normalization operation.

Z in step 401) _nat The calculation formula of (2) is as follows:

z _nat ＝Fusion(Decoder ^at (y _1…k ),Encoder ^nat (x _1…T′ ))

wherein the Decoder ^at (.) is the output of the decoding end of the autoregressive model, the Encoder ^nat (.) is the output of the decoding end of the non-autoregressive model, fusion (-) is the auxiliary representation Fusion function, y _1…k For words 1 to k, x _1…T′ Is a source sentence.

And 5) in the training process of the non-autoregressive neural machine translation model, parallel corpus is sent into the model to calculate cross entropy loss, and then corresponding gradient is calculated to update parameters so as to complete the training process.

Step 6), the source sentence input by the user is sent to a non-autoregressive neural machine translation model, and a plurality of translation results are obtained by designating different target language lengths; then, an autoregressive neural machine translation model is used as a scoring function for these decoded translation results, and the best overall translation is selected.

The invention has the following beneficial effects and advantages:

1. the invention provides a non-autoregressive neural machine translation method based on auxiliary representation fusion, which is implemented by

The high-level representation encoded in the autoregressive model is introduced into the non-autoregressive model to improve the translation quality of the non-autoregressive model. Combining the advantages of the autoregressive model and the non-autoregressive model, a quick and accurate translation can be achieved.

2. The method uses the fusion expression of the source language and part of the target language as input, greatly relieves the problem of weak target end information of the non-autoregressive model, and effectively improves the performance of the non-autoregressive model.

3. The method has strong expansibility, and can comprise an autoregressive neural machine translation model and a non-autoregressive neural machine translation model by adjusting the proportion of the previous part of words decoded by using the autoregressive neural machine translation model.

Drawings

FIG. 1 is a diagram of a non-autoregressive neural machine translation model based on a fusion representation in accordance with the present invention;

FIG. 2 is a schematic diagram of the structure of the encoding layer and decoding layer in the conventional transducer according to the present invention.

Detailed Description

The invention is further elucidated below in connection with the drawings of the specification.

The invention optimizes the translation performance of the non-autoregressive neural machine translation system from the expression fusion angle, and aims to realize accurate and rapid translation.

2) Constructing training parallel corpus, carrying out word segmentation and word segmentation preprocessing flow to obtain a source language sequence and a target language sequence, generating a machine translation word list, and training a model with only one layer of decoding end until convergence;

3) Removing a matrix of a decoding end for shielding future information in a Transformer, and simultaneously adding multi-head position attention between self-attention and coding and decoding attention to construct a non-autoregressive machine translation model;

4) The front part of word is decoded by using a shallow autoregressive model, and the output of the feedforward neural network at the topmost layer of the decoding end of the autoregressive machine translation model is subjected to weighted fusion with the top layer representation at the encoding end of the non-autoregressive machine translation model to be used as the input of the decoding end of the non-autoregressive machine translation model;

5) The parallel corpus training is used for using a fusion representation as a non-autoregressive machine translation model of input, an encoder encodes source sentence sub-information is extracted from the source sentence sub-information, and a decoder predicts a corresponding target sentence according to the information. Then calculating the loss of the predicted distribution and the real data distribution, and continuously reducing the loss through back propagation to complete the training process of the model;

6) And sending the source sentence input by the user into a machine translation model, decoding translation results with different lengths, and obtaining an optimal translation result through evaluation of an autoregressive model.

In step 1), the transducer consists of only the attentional mechanism and the feedforward neural network, as shown in fig. 2. The transducer is still based on an encoder-decoder framework, which constitutes an encoder and a decoder, respectively, by stacking a plurality of identical stacks, the sub-layer structures of which are slightly different. The transducer achieves significant performance improvement over multiple data sets of the machine translation task, achieves the best performance at the time, and has a faster training speed. The mechanism of attention is an important component in neural machine translation models. In the original encoder-decoder framework, the neural network has difficulty in learning the corresponding information of the source end and the target end due to the above reasons, and the translation system has poor translation effect on the input long sentences. In the self-attention mechanism, the Query (Query, Q), key (Key, K) and Value (Value, V) come from the same content, firstly, the three matrices are respectively subjected to linear transformation, then the dot product scaling operation is carried out, namely, the Query and the Key are calculated to carry out dot product calculation, and in order to prevent the overlarge calculation result, the dimension of the Key is dividedTo achieve the regulation function, as shown in the following formula:

wherein Q is a query matrix, K is a key matrix, V is a value matrix, softmax (.) is a normalization function, attention (.) is an Attention calculation function, and d _k Is the dimension of the key matrix.

In step 2), the autoregressive model operates most time-consuming at the decoder at decoding time. Since there is no reference translation in the decoding stage, the autoregressive neural machine translation model predicts the current target word using the generated sequence, which results in a serious decoding delay. The decoding speed can be greatly improved by using a lightweight autoregressive neural machine translation model.

wherein Q is a query matrix, K is a key matrix, V is a value matrix, softmax (.) is a normalization function, attention (.) is an Attention calculation function, and d _k Hiding the dimensions of the layers for the model;

In step 4), the translation result of the autoregressive neural machine translation model is used to reconstruct the input of the non-autoregressive neural machine translation model, specifically:

401 Using a shallow autoregressive model to decode the previous part of words, and carrying out weighted fusion on the output of the feedforward neural network at the topmost layer of the decoding end of the autoregressive machine translation model and the top layer representation of the encoding end of the non-autoregressive machine translation model to serve as the input of the decoding end of the non-autoregressive machine translation model, wherein the weighted fusion is as follows:

wherein θ _at And theta _nat The parameters of the autoregressive neural machine translation model and the non-autoregressive neural machine translation model are respectively, T is the target language sequence length, T' is the source language sequence length, y _t Is the target word at the t-th position, x _1…T′ For source language sentences, y _<t For 1 st to t-1 th target words, z _nat Input of a decoding end of a non-autoregressive neural machine translation model; z _nat The calculation formula of (2) is as follows:

z _nat ＝Fusion(Decoder ^at (y _1…k ),Encoder ^nat (x _1…T′ ))

402 Constructing a fusion function, adopting a weighted sum mode, specifically:

Fusion＝λDecoder ^at (y _1…k )+μEncoder ^nat (x _1…T′ )

In the step 5), parallel corpus is sent into a model to calculate cross entropy loss in the training process of non-autoregressive neural machine translation, and then corresponding gradient is calculated to update parameters so as to complete the training process.

In step 6), the source sentence input by the user is sent into the model, and a plurality of translation results are obtained by designating different target language lengths; then, using an autoregressive model as a scoring function for the decoded translation results, and further selecting the best overall translation; since all translation samples can be calculated and scored completely independently, the process can only double as much time if there is sufficient parallelism to calculate a single translation.

The invention uses the current common data set ISLT 14 Deying spoken language data set and WMT14 Deying language data set to verify the effectiveness of the proposed method, and the training set comprises 16 ten thousand and 450 ten thousand parallel sentence pairs respectively. And obtaining the processed bilingual corpus training data in a byte pair encoder word segmentation mode. However, since the non-autoregressive model is difficult to fit to the multimodal distribution in the real data, this embodiment solves the problem by using sentence-level knowledge refinement. That is, sentences generated by the autoregressive neural machine translation with the same parameter configuration are used as training samples and provided for non-autoregressive neural machine translation for learning.

As shown in fig. 1, in this embodiment, the first two words "We total" of the source language sentence are first sent to the encoder of the autoregressive neural machine translation model, and the multi-head attention of the encoder extracts the source language sentence information by obtaining the correlation coefficient between the words and then sending the obtained correlation coefficient to the feedforward neural network. Then, the decoder of the autoregressive neural machine translation model receives the information and then sequentially passes through the multi-head self-attention layer, the multi-head coding decoding attention layer and the feedforward neural network layer and then carries out linear transformation again to obtain a translation result 'complete'. Then, the non-autoregressive neural machine translation model uses the translation result to carry out representation fusion with the encoder information of the non-autoregressive neural machine translation model to be used as the input of a decoder of the non-autoregressive neural machine translation model, finally, the decoder uses the extracted source language sentence information and the decoder input to sequentially pass through a multi-head self-attention layer, a multi-head position attention, a multi-head encoding decoding attention layer and a feedforward neural network layer to obtain the whole target language sentence which is completely accepted by the user after linear change again.

The invention uses bilingual evaluation index BLEU commonly used in machine translation task as evaluation standard. Experimental results show that 9 candidate translations of different lengths are simultaneously decoded by using an auxiliary representation fusion method as the input of a non-autoregressive model, and then 9.4 times of speed improvement is obtained under the condition that 14 percent of performance is lost on an IWSLT14 Deying dataset by using an autoregressive model evaluation method; on the WMT14 de-english dataset, a 7.9-fold speed improvement was obtained with only a loss of 8.5 percent performance.

The invention optimizes the translation performance of the non-autoregressive neural machine translation system from the expression fusion angle, and aims to realize accurate and rapid translation. By introducing the high-level representation encoded in the shallow autoregressive model into the non-autoregressive model, the translation quality of the non-autoregressive model is improved and efficient reasoning speed is ensured. The fusion expression of the source language and part of the target language is used as input, so that the problem of weak target end information of the non-autoregressive model is greatly solved, and the performance of the model is effectively enhanced.

Claims

1. The non-autoregressive neural machine translation method based on auxiliary representation fusion is characterized by comprising the following steps of:

2. The non-autoregressive neural machine translation method based on auxiliary representation fusion of claim 1, wherein: in the step 3), constructing a non-autoregressive neural machine translation model, which specifically comprises the following steps:

wherein Q is a query matrix, K is a key matrix, V is a value matrix, softmax (.) is a normalization function, attention (.) is an Attention calculation function, and d _k Is the dimension of the key matrix;

3. The non-autoregressive neural machine translation method based on auxiliary representation fusion of claim 1, wherein: in step 4), the translation result of the autoregressive model is used to improve the input of the non-autoregressive neural machine translation model, specifically:

wherein θ _at And theta _nat The parameters of the autoregressive neural machine translation model and the non-autoregressive neural machine translation model are respectively, T is the target language sequence length, T' is the source language sequence length, y _t Is the target word at the t-th position, x _1...T′ For source sentence, y < t is 1 st to t-1 st target word, z _nat Input of a decoding end of a non-autoregressive neural machine translation model;

402 Constructing a fusion function, adopting a weighted sum mode, specifically:

Fusion＝λDecoder ^at (y _1...k )+μEncoder ^nat (x _1...T′ )

where λ and μ are hyper-parameters controlling weights of different representation terms, decoder ^at () is the output of the autoregressive neural machine translation model decoder, the Encoder ^nat () output of non-autoregressive neural machine translation model decoder, y _1...k For words 1 to k, x _1...T′ Is a source sentence;

4. The non-autoregressive neural machine translation method based on auxiliary representation fusion of claim 3, wherein z in step 401) _nat The calculation formula of (2) is as follows:

z _nat ＝Fusion(Decoder ^at (y _1...k )，Encoder ^nat (x _1...T′ ))

wherein the Decoder ^at (.) is the output of the decoding end of the autoregressive model, the Encoder ^nat (.) is the output of the decoding end of the non-autoregressive model, fusion (-) is the auxiliary representation Fusion function, y _1...k For words 1 to k, x _1...T′ Is a source sentence.

5. The non-autoregressive neural machine translation method based on auxiliary representation fusion of claim 1, wherein: and 5) in the training process of the non-autoregressive neural machine translation model, parallel corpus is sent into the model to calculate cross entropy loss, and then corresponding gradient is calculated to update parameters so as to complete the training process.

6. The non-autoregressive neural machine translation method based on auxiliary representation fusion of claim 1, wherein: step 6), the source sentence input by the user is sent to a non-autoregressive neural machine translation model, and a plurality of translation results are obtained by designating different target language lengths; then, an autoregressive neural machine translation model is used as a scoring function for these decoded translation results, and the best overall translation is selected.