CN112329483A

CN112329483A - Multi-mechanism attention-combined multi-path neural machine translation method

Info

Publication number: CN112329483A
Application number: CN202011209086.0A
Authority: CN
Inventors: 范洪博; 郑棋
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2021-02-05

Abstract

The invention relates to a multi-mechanism attention-combined multi-path neural machine translation method, and belongs to the field of natural language processing. The invention independently generates self attention values by a CNN translation mechanism, a transform translation mechanism and a Tree-transform translation mechanism, carries out weighted accumulation on the calculated attention values, then aligns and normalizes the calculated attention values to form a new attention value, and transmits the new attention value to a Dec-Enc attribute layer of a decoder, so that each translation mechanism finishes the subsequent machine translation process to obtain a decoding key-value matrix. And performing weighted superposition and normalization by adopting the decoding key-value matrixes generated by all mechanisms, and generating a target translation through a linear transformation layer and a softmax layer. The process of overlapping and normalizing the attention of the multiple mechanisms can effectively integrate the analysis capability of various algorithms, and the formed attention is closer to the theoretical real attention, so that a better translation effect is obtained, and the translation accuracy can be effectively improved.

Description

Multi-mechanism attention-combined multi-path neural machine translation method

Technical Field

The invention relates to a multi-mechanism attention-combined multi-path neural machine translation method, and belongs to the field of natural language processing.

Background

Machine translation, which is a process implemented by a computer to translate a sentence in one language (a source language sentence) into a sentence in another language (a target language sentence) having the same meaning, has become an important research direction in the field of artificial intelligence.

Gehring et al in the prior art propose a CNN translation mechanism to realize machine translation, which completely utilizes a convolutional neural network to realize machine translation, and the convolutional neural network is respectively used as working units of an encoder and a decoder, wherein the encoder and the decoder are formed by stacking multiple layers of convolutional neural networks. At the encoding end, the input sequence is encoded using a convolution operation. At the decoding end, each convolutional layer performs attention, and the result continues as input for the next layer. And finally, predicting the next target word based on the hidden state of the last layer.

In the prior art, Vaswani et al propose a transform translation mechanism to realize machine translation, which completely utilizes an attention mechanism to realize machine translation, and at a coding end, the transform translation mechanism is formed by stacking 6 identical coding layers, wherein each layer consists of a multi-head self-attention mechanism sublayer and a feedforward neural network sublayer, and the layers use residual connection and layer normalization. At the decoding end, a stack of 6 identical decoding layers is used, where each layer decoder has one more masking attention layer than the encoder.

Wang et al in the prior art proposed a Tree-Transformer translation mechanism to implement machine translation, which can consider the syntactic information in a sentence during translation, and which adds a component attention module for capturing the syntactic information on the basis of multi-head self-attention of the conventional Transformer encoding end.

The three algorithms are respectively derived from the top academic conference in the field, are newer algorithms with excellent performance in the automatic translation method based on machine learning in the prior art, but have room for improvement.

Currently, attention has become the core key of most automatic translation methods based on machine learning, and the accuracy of attention directly determines the quality of translation. Under different attention generation mechanisms, the calculation results of the attention generation mechanisms are inconsistent, and the attention generated by any single mechanism cannot truly reflect the theoretical attention in the language completely and accurately.

When the practical decision is taken into consideration, the way is widely opened, people have a desire to speak, and then the opinions of people are collated to form a democratic decision which is generally more ideal than a speaker-type independent line decision. We speculate that the translation accuracy can be improved by introducing a mechanism similar to democratic decision in the attention of automatic translation formation.

Disclosure of Invention

The invention provides a multi-mechanism attention-combined multi-path neural machine translation method, which is used for effectively improving the translation quality.

The technical scheme of the invention is as follows: a multi-mechanism attention-combined multipath neural machine translation method combines a CNN translation mechanism, a Transformer translation mechanism and a Tree-Transformer translation mechanism. The attention value of each automatic translation method is independently generated, and the calculated attention value is subjected to weighted accumulation, wherein the updated algorithm and the algorithm with better actual experimental data are considered to be closer to the theoretical attention value possibly due to the fact that the calculated attention value of the updated algorithm and the calculated attention value of the actual experimental data are closer, therefore, the algorithms are endowed with higher weights during accumulation, specific weight values need to be further determined through experiments, and the new attention value is formed through normalization after alignment accumulation.

In the multi-computer method, a newer algorithm and an algorithm with better experimental data are closer to real attention theoretically, so that the newer algorithm or the algorithm with better experimental data is endowed with higher weight in the process of designing the weight of the method.

In the process of constructing a multi-mechanism attention-combined multi-path neural machine translation model, firstly, the sum of an input word embedding vector and a position embedding vector is input to a plurality of translation mechanisms respectively, then each translation mechanism trains the input according to the training mode of the translation mechanism, a training model of the translation mechanism is formed respectively, and the attention vector of the translation mechanism is calculated. At the encoding end of the model, the plurality of attention values obtained through calculation are weighted and superposed, then the plurality of attention values are aligned and normalized to form a new attention value which is transmitted to a Dec-Enc attention layer of a decoder, so that each translation mechanism finishes the subsequent machine translation process to obtain a decoding key-value matrix. And performing weighted superposition and normalization by adopting the decoding key-value matrixes generated by all mechanisms, and generating a target translation through a linear transformation layer and a softmax layer.

The method comprises the following specific steps:

step1, collecting training corpora;

step2, preprocessing the corpus: and performing word segmentation, lowercase processing and data cleaning on the bilingual corpus in the training corpus by using MOSES (motion-based expert system), and finally keeping sentence pairs with the length within 175, and performing word segmentation on all preprocessed data by using a BPE (business process optimization) algorithm.

Step3, extracting a part from the preprocessed corpus to be used as a test set, a part to be used as a verification set and the other parts to be used as a training set: randomly extracting 160K parallel corpora from the processed corpora to be used as a training set, using 7K parallel corpora as a verification set to train the translation model, and using 6K parallel corpora as a test set to evaluate the translation model; the training set is used for training parameters in the neural network, the test set is used for testing the accuracy of the current translation model, and the hyperparameters such as iteration times, learning rate and the like are adjusted according to the test result of the verification set, so that the translation model is better in performance.

Step4, generating a source language word embedding vector for training and a position embedding vector for training from the training set corpus, splicing the source language word embedding vector and the position embedding vector for training together to serve as input, and respectively inputting the input into a CNN translation mechanism, a transform translation mechanism and a Tree-transform translation mechanism, wherein each translation mechanism trains the input according to the training mode of the translation mechanism to respectively form a training model of the translation mechanism.

Step5, generating language word embedded vectors and position embedded vectors for the sentences to be translated, inputting the language word embedded vectors and the position embedded vectors into a plurality of translation mechanisms respectively, and calculating corresponding attention vectors by each translation mechanism according to self-training models. By adopting a CNN translation mechanism, a transform translation mechanism and a Tree-transform translation mechanism, namely adopting the superposition of 3 models, the model firstly converts an input sequence into word embedding vectors, adds a position embedding vector for each input word embedding vector in order to enable the model to learn the sequence order of words in the sequence, wherein the position embedding vector represents the position relation of different words in a source sentence, and defines that the position embedding vector and the word embedding vector are respectively represented as p ═ respectively (p ═ is)₁,…,p_m) And w ═ w₁,…,w_m) Wherein the position embedding vector is calculated by adopting the following formula:

wherein pos represents the position of a word in the source sentence, and i represents the dimension;

adding the word embedding vector and the position embedding vector to be input into the model, respectively inputting the word embedding vector and the position embedding vector to a plurality of translation mechanisms, and training the input by each translation mechanism according to a training mode of the translation mechanism to respectively form a training model of the translation mechanism.

Step6, carrying out weighted superposition on the plurality of attention values obtained by calculation in the Step5, then aligning and normalizing to form a new attention value, wherein the new attention value is used as new attention, relatively speaking, an algorithm with higher performance on the tested corpus can obtain higher weight, and the specific weight is obtained by experiments.

And Step7, sending the attention value obtained by the calculation in the Step6 to a Dec-Enc attention layer of a decoder, and enabling each translation mechanism to complete the subsequent machine translation process to respectively generate a decoding key-value matrix.

And Step8, performing weighted superposition and normalization on the decoding key-value matrixes generated by the mechanisms, and sending the result to a linear transformation layer and a softmax layer to generate a target translation.

The invention has the beneficial effects that:

1. in the invention, each automatic translation method independently generates the attention value of the user, the calculated attention values are subjected to weighted accumulation, and then the new attention values are formed by alignment and normalization, so that a better translation effect is obtained.

2. The process of multi-mechanism attention superposition and normalization can effectively integrate the analysis capability of various algorithms, form the democratic decision advantage like 'Zhuge Liang at the top of three smelly skinners', and the formed attention is closer to the theoretical real attention, thereby obtaining better translation effect.

Drawings

FIG. 1 is a block flow diagram of the present invention;

FIG. 2 is a bar graph of experimental results of the present invention;

Detailed Description

The invention is further described with reference to the following figures and specific examples.

Example 1: in this example, the german-english language material is used as the translation language material, and the selected multi-decision method is the CNN translation mechanism, the transform translation mechanism, and the Tree-transform translation mechanism, respectively.

As shown in fig. 1-2, a multi-mechanism attention-combined multi-path neural machine translation method includes the following specific steps:

and (3) model construction process:

step1, downloading the German and English language material from the website, and determining a plurality of translation mechanisms;

step2, preprocessing the corpus: performing word segmentation, lowercase processing and data cleaning on bilingual corpus by using MOSES (motion-based expert system), and finally keeping sentence pairs with the length within 175, and performing word segmentation on all preprocessed data by using a BPE (business process optimization) algorithm;

step3, generating a training set, a verification set and a test set: randomly extracting 160K parallel corpora from the processed corpora to be used as a training set, 7K parallel corpora to be used as a verification set to train the translation model, and 6K parallel corpora to be used as a test set to evaluate the translation model;

step4, fully utilizing the advantages of a CNN translation mechanism, a Transformer translation mechanism and a Tree-Transformer translation mechanism, and respectively using a convolutional neural network-based encoder, a Transformer encoder and a Tree-Transformer encoder to encode an input sequence in an encoding end;

in order to enable the model to learn the sequence order of words in a source sentence, the position embedding vector and the word embedding vector are added in a bit-by-bit mode to serve as input of a coding end, the model can capture the position information of words in the input sequence, the position embedding vector represents the position information of different words in the input sequence, and the position embedding vector and the word embedding vector are defined to be respectively represented as p ═ p (p ═ p)₁,…,p_m) And w ═ w₁,…,w_m)；

Step5, training the input by all translation mechanisms in the translation model according to the training mode of the translation mechanisms, respectively forming the training models of the translation mechanisms, and calculating respective attention vectors;

step6, carrying out weighted superposition on the plurality of attention values obtained by calculation in the Step5, aligning and normalizing to form a new attention value;

step7, between the encoder and the decoder, adopting an attention fusion module with fusion function, and automatically acquiring the information required by the decoding target word. At the decoding end, the decoder of each path calculates attention using the three-path output of the encoder as context. Thus, there are nine types of information streams that pass from the encoder to the decoder. Specifically, the Attention value calculated in step6 is sent to the Dec-Enc authorization layer of the decoder, which extracts the context information generated by three paths at the encoding end and the output of the decoder at the previous moment as the decoder input for decoding, wherein the specific calculation formula of the Dec-Enc authorization module is as follows:

ctx^cc＝Attention(q^c，k^c，v^c)

ctx^ca＝Attention(q^c，k^a，v^a)

ctx^cl＝Attention(q^c，k^l，v^l)

ctx^aa＝Attention(q^a，k^a，v^a)

ctx^ac＝Attention(q^a，k^c，v^c)

ctx^al＝Attention(q^a，k^l，v^l)

ctx^ll＝Attention(q^l，k^l，v^l)

ctx^lc＝Attention(q^l，k^c，v^c)

ctx^la＝Attention(q^l，k^a，v^a)

wherein ctx^ccAttention query value q referring to CNN path in decoder^cAnd note key k of CNN path in encoding end^cSum value v^cAttention results of (1). ctx (ctx)^caAttention query value q referring to CNN path in decoder^cAnd attention key k of Transformer path in encoding end^aSum value v^aAttention results of (1). ctx (ctx)^clAttention query value q referring to CNN path in decoder^cAnd an attention key k of the Tree-Transformer path in the encoding end^lSum value v^lAttention results of (1). ctx (ctx)^aaAttention query value q referring to CNN path in decoder^aAnd note key k of CNN path in encoding end^aSum value v^aAttention results of (1). ctx (ctx)^acAttention query value q referring to CNN path in decoder^aAnd attention key k of Transformer path in encoding end^cSum value v^cAttention results of (1). ctx (ctx)^clAttention query value q referring to CNN path in decoder^aAnd an attention key k of the Tree-Transformer path in the encoding end^lSum value v^lAttention results of (1). ctx (ctx)^llAttention query value q referring to CNN path in decoder^lAnd note key k of CNN path in encoding end^lSum value v^lAttention results of (1). ctx (ctx)^lcAttention query value q referring to CNN path in decoder^lAnd attention key k of Transformer path in encoding end^cSum value v^cAttention results of (1). ctx (ctx)^laAttention query value q referring to CNN path in decoder^lAnd an attention key k of the Tree-Transformer path in the encoding end^aSum value v^aAttention results of (1).

To fully exploit the information captured by the different encoder paths, we use a weighted summation mechanism to fuse them.

Step8, predicting the target word: at a decoding end, after the three decoders generate decoding information, integrating the information generated by the three decoders by adopting a weighted summation mechanism, and transmitting an integration result into a prediction target word of a softmax layer, wherein the formula is as follows:

z^o＝normal(z^c+z^a+z^t)

P(y)＝softmax(z^oW^s+b^s)

wherein z is^c、z^a、z^tRespectively representing the decoded information generated by the three decoders. z is a radical of^oRepresenting the final output result of the three-way decoder fusion. P (y) is the predicted probability of the target word.

In order to verify the effectiveness of the method, a neural machine translation model, a Transformer translation model, a Tree Transformer translation model and a translation model combining CNN and Transformer are compared in an experiment;

when the model parameters are set, the set parameters are as follows:

the operating environment of this experiment was: pythen 3.6, the deep learning framework is torch 0.4.0, the experimental corpus selects IWSLT2014 delta corpus, for a specific algorithm in the experiment, the specific parameters set by us are all 256-dimensional word embedding dimensions, the number of network layers of an encoder and a decoder is 2, the number of hidden units of each layer is 256, dropout is set to 0.1, filter _ size is set to 1024, kernel _ size is set to 3, learning rate is set to 0.25, label smoothing rate is set to 0.1, an NAG optimizer is used to optimize a training model, and the batch size is 128.

To demonstrate the effectiveness of our method, we compared our method with four reference models, CNN translation mechanism, Transformer translation mechanism, Tree-Transformer translation mechanism, CNN + Transformer translation mechanism, respectively.

Since the specific parameters of each method may affect the final experimental data, but fine tuning of these parameters is not relevant to highlight the benefit of the present invention, and to show the effectiveness and benefit of the present invention generally, we set the operating parameters of all algorithms close, which are also close to the examples provided by the original authors of these algorithms.

The BLEU value is adopted to evaluate a translation model, and as can be seen from the attached figure 2 and the table 1, the multi-mechanism attention-combined multipath neural machine translation method can effectively improve the performance of neural machine translation.

To demonstrate the benefit of the present invention, we designed two examples (example 5, example 6) for comparison with the existing method, both of which generated attention in a three-mechanism overlay manner, where example 5 did not use a weighted overlay, and example 6 set twice the weight of the updated Tree transform method. In this experiment, the encoding end of the algorithm 5 respectively adopts a transform encoder, a Tree transform encoder and a CNN-based encoder, the decoding end only adopts 1 transform decoder, and the attention generation scheme is as follows. Algorithm 6 the encoding side adopts a transform encoder, a Tree transform encoder and a CNN-based encoder respectively, the decoding side adopts 1 transform decoder and 1 CNN-based decoder, and adopts 2 times of syntax information.

Table 1 shows the translation results of different models

	Model (model)	De-English dataCollection
			Algorithm 1	CNN	29.07
Algorithm 2	Transformer	28.65
			Algorithm 3	Tree Transformer	29.62
Algorithm 4	CNN+Transformer	31.69
			Algorithm 5 (invention)	CNN+Transformer+Tree Transformer	32.49
Algorithm 6 (invention)	CNN+Transformer+2*Tree Transformer	32.69

The translation results of the de-english corpus for the translation model and the baseline model proposed by the present invention are shown in table 1. As can be seen from Table 1, the CNN, Transformer, Tree Transformer models all expect the correct results in German-English and the Tree Transformer has better performance. The three-mechanism attention weighted superposition method has a better BLEU value, is more accurate in translation, and is more accurate and consistent with the democratic voting effect predicted by people compared with a single decision. In particular, when we give 2 times weight to the Tree transducer, the BLEU value continues to increase by 0.2, indicating that our weighting strategy is effective. At present, based on an attention generation mechanism with weighted democratic voting, better experimental performance is obtained, and the beneficial effects of the invention are fully embodied.

The multi-mechanism attention-combined multi-path neural machine translation method provided by the invention has good performance on a translation task, and mainly has the following reasons: 1. the translation model simultaneously combines the advantages of a CNN translation mechanism, a transform translation mechanism and a Tree transform translation mechanism, wherein the Tree transform can be integrated with syntactic information during translation. 2. The attention value of each automatic translation method is independently generated, the calculated attention values are subjected to weighted accumulation, and then the calculated attention values are aligned and normalized to form a new attention value. The process of multi-mechanism attention superposition and normalization can effectively integrate the analysis capability of various algorithms, form the multi-decision-making advantage like 'three smelly skinners carrying all the Zhuge Liang', and the formed attention is closer to the theoretical real attention, thereby obtaining better translation effect.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The multi-mechanism attention-combined multipath neural machine translation method is characterized by comprising the following steps of: the method comprises the following specific steps:

step1, collecting training corpora;

step2, preprocessing the training corpus;

step3, extracting a training set, a verification set and a test set from the preprocessed training corpus;

step4, generating a source language word embedding vector for training and a position embedding vector for training from the corpus of the training set, splicing the source language word embedding vector and the position embedding vector for training together to serve as input, and respectively inputting the input into a CNN translation mechanism, a transform translation mechanism and a Tree-transform translation mechanism, wherein each translation mechanism trains the input according to a training mode of the translation mechanism to respectively form a training model of the translation mechanism;

step5, generating language word embedded vectors and position embedded vectors for the sentences to be translated, and inputting the language word embedded vectors and the position embedded vectors into a plurality of translation mechanisms respectively, wherein each translation mechanism calculates corresponding attention vectors according to a self-training model;

step7, sending the attention value obtained by the calculation in the Step6 to a Dec-Enc attention layer of a decoder, and enabling each translation mechanism to complete the subsequent machine translation process and respectively generate a decoding key-value matrix;

2. The multi-mechanism attention-merging multipath neural machine translation method of claim 1, wherein: the preprocessing in Step2 is to perform word segmentation, lowercase processing and data cleaning on the bilingual corpus in the corpus, and finally keep the sentence pair with the length within 175, and then perform word segmentation on all the preprocessed data by using a BPE algorithm.

3. The multi-mechanism attention-merging multipath neural machine translation method of claim 1, wherein: the Step of extracting the training set, the verification set and the test set in Step3 means that 160K parallel corpora are randomly extracted from the processed corpora to be used as the training set, 7K parallel corpora are used as the verification set to train the translation model, and 6K parallel corpora are used as the test set to evaluate the translation model.

4. The multi-mechanism attention-merging multipath neural machine translation method of claim 1, wherein: the Step4 adopts a CNN translation mechanismThe model firstly converts an input sequence into word embedding vectors, adds a position embedding vector to each input word embedding vector in order to enable the model to learn the sequence order of words in the sequence, wherein the position embedding vector represents the position relation of different words in a source sentence, and defines that the position embedding vector and the word embedding vector are respectively represented as p ═ p (p ═ p)₁,…,p_m) And w ═ w₁,…,w_m) Wherein the position embedding vector is calculated by adopting the following formula:

5. The multi-mechanism attention-merging multipath neural machine translation method of claim 1, wherein: in the Step6, the encoding end receives the input vectors, calculates the attention values respectively, performs weighted superposition, then aligns and normalizes the input vectors to form a new attention value, wherein the encoding end comprises three encoders, and the weight given to the Tree-Transformer translation mechanism is twice of the weights given to the other two translation mechanisms.

6. The multi-mechanism attention-merging multipath neural machine translation method of claim 1, wherein: in Step7, the Dec-Enc attribute layer of the decoder receives the attention value key-value key value pair generated by the encoding end, respectively, the query matrix q and the key matrix k in the decoder perform dot product operation, and then perform weighted summation with the value matrix v, so that each translation mechanism completes the subsequent machine translation process, and generates a decoding key-value matrix, respectively.