Background
With the trend of economic globalization, communication and cooperation between the international countries become more frequent. The manual translation depending on the translator needs to consume huge manpower and financial resources, and cannot meet the increasing translation requirements, so that the machine translation is carried forward. Machine translation, as the name implies, refers to the process of converting a source language into a target language semantically equivalent thereto using computer technology.
Thanks to the improvement of computer computing power and the development of deep learning research, a Neural Machine Translation (NMT) model based on a deep Neural network occupies the leading position of Machine Translation research. The neural machine translation model adopts a coder-decoder framework, obtains excellent translation performance, and is widely applied. Specifically, given a source language sentence X ═ { X ═ X1,x2,…,xmIn which xiRepresenting the ith subword in the source language sentence, i ═ {1,2, …, m }, the NMT model first encodes it using an encoder into a source-side representation E ═ { E ═ E }1,e2,…,emIn which eiRepresenting the semantic representation corresponding to the ith subword in the source language sentence, i ═ {1,2, …, m }, and then decoding through a decoder to obtain the translation Y ═ { Y } of the target language1,y2,…,ynIn which y isjWhich represents the jth subword in the target language sentence, j ═ 1,2, …, n. NMT models can be classified into two categories according to the way decoders work: the autoregressive neural machine translation model and the non-autoregressive neural machine translation model are translation principles of the autoregressive neural machine translation model and the non-autoregressive neural machine translation model respectively as shown in fig. 1a and 1b, source language sentences input by the autoregressive neural machine translation model and the non-autoregressive neural machine translation model are both 'I love China', and target language translation results of the autoregressive neural machine translation model and the non-autoregressive neural machine translation model are both Chinese.
In autoregressive neural machine translation models, classical, e.g. Transformer, modelType (Ashish Vawani, Noam Shazer, Niki Parmar, Jakob Uszkoroiit, Llion Jones, Aidan N.Gomez, Lukasz Kaiser, and Illia Polosukhin.2017.Attention is all you need in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems2017, Decumber 4-9, 2017, Long Beach, CA, USA, pages 5998. 6008.), RNN model (Zarea, Wojciech, Yalsatsu Sutskey, and Oriol Vinyakus, "Regourent process" network prediction ", Xylarrar 1241, Australian coding model, and coding model, all of the coding model of the coding, generated from left to right in fig. 1 a): for prediction at time t, the decoder uses the output Y before time tt={y1,y2,...,yt-1Where y isjRepresents the jth subword in the target language sentence, j ═ 1, 2., t-1, and the source end encoded by the encoder in combination with the attention mechanism represents the E-predicted target word y't. The Transformer model can achieve excellent performance on a plurality of translation data sets, but the autoregressive decoding process adopted by the Transformer model has the following problems: 1) there is an exposure bias problem: the historical information during training is taken from the reference translation, and the historical information during testing can only be obtained from the prediction of the model, so that the problem of inconsistency between training and testing is caused, and the performance is reduced. 2) The translation efficiency is low: the serial working mode of the autoregressive model cannot utilize the high parallel characteristic of GPU hardware during testing, the prediction time of the autoregressive model is positively correlated with the sequence length, and the translation speed is low when a long sentence is translated.
Unlike autoregressive neural Machine Translation models, Non-autoregressive neural Machine Translation models (NAT) assumeThe words in the target language sequence are independent of one another. The NAT model generates words in the target language sequence in parallel: after the source end representation E is obtained by the coder, the length n of the target language sequence is obtained by the length predictor, and the input D of the decoder is constructed by the length predictor and is { D ═ D }1,d2,...,dnIn which d isjWhich indicates the decoder input corresponding to the j-th position, j ═ 1, 2.. times, n, and then the corresponding word is predicted by the decoder at the same time. By removing the dependence on historical information through the independence assumption, the NAT model relieves the exposure bias problem in the autoregressive translation besides having extremely high translation efficiency. However, performance is far behind that of the autoregressive machine translation model because there is no explicit dependency between the outputs, making it difficult for predictions from different locations of the NAT model to be coordinated to produce consistent translations. Furthermore, the multimodality phenomenon (i.e., one source language sentence has multiple correct target language sentences corresponding to it) of the translation task deepens the problem, resulting in lower final translation performance.
Around the problem of dependency loss in non-autoregressive neural machine translation models, there are two types of solutions: one class of schemes directly model the dependency relationship between words in a target sequence; another type of scheme selects to introduce hidden variables to implicitly model the missing dependencies.
The scheme of direct modeling of the dependency relationship adopts a training strategy similar to an autoregressive model. The researcher proposes to take part of the words in the reference translation Y as input and train the decoder to predict the words not provided, thereby modeling the dependency between the provided part of the words and the rest of the words. This scheme significantly improves the performance of non-autoregressive machine translation in conjunction with iterative decoding strategies (Lihua Qian, Hao Zhou, Yu Bao, Mingxuan Wang, Lin Qiu, Weinan Zhang, Yong Yu, and Lei Li.2020. Glancing translation for non-autoregressive neural translation. arXiv prediction arXiv: 2008.07905; Marjan Ghazing jad, Omer Levy, Yinhan Liu, and Luke Zettle layer.2019. Mass-prediction: Parallel decoding of connected texture regulation. in Processing of the Conference on the Natural translation and Processing of the family, while the method of the present invention is applied to the family of the family members, the family members 6112, the family members for the family of the family members, and the family members for the family members, and the family members, the family. However, this solution has the following problems: 1) partial words using the reference sequence are used as input in the training stage, and the predicted words are decoded or provided in the testing stage at the same time, so that the problem of exposure deviation still exists, and the prediction performance of the model is reduced; 2) the decoding algorithm of multiple iterations results in reduced model efficiency.
The scheme based on the hidden variables utilizes a deep neural network to code the dependency information of the target sequence into the hidden variables, and then a model is trained to model the hidden variables. Modeling of hidden variables can be an intermediate process to non-autoregressive modeling. Specifically, hidden variable-based schemes first predict the hidden variables either auto-regressively or non-auto-regressively, followed by prediction of the target sequence (documents Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Edurard Hovy.2019. FlowSeq: non-innovative comparative sequence generation with genetic flow. in Processing of the2019Conference on Empirical Methods in Natural Language Processing and 9th International journal Conference Natural Language Processing (EMNLPICNLP), pages 4282-. However, modeling of hidden variables relies on complex deep neural networks, even though the translation of the model is less efficient and often less interpretable.
Disclosure of Invention
The invention aims to provide a method for improving the quality of a non-autoregressive neural machine translation model through a modeling synergistic relationship, so as to solve the problem that the translation performance is reduced due to the fact that the existing non-autoregressive neural machine translation model lacks an explicit dependency relationship modeling.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the method for improving the translation quality of the non-autoregressive neural machine through modeling cooperative relationship comprises the following steps of firstly obtaining source end representation corresponding to a source language sequence, then obtaining the length of a target language sequence, and then constructing the input of a decoder in a non-autoregressive neural machine translation model by combining the source end representation with the length of the target language sequence, wherein the method further comprises the following steps:
step 1, obtaining a dependency syntax tree of a target language sequence based on source end representation and decoder input, and converting the dependency syntax tree of the target language sequence into a collaborative relationship matrix.
And 2, integrating the cooperative relationship matrix of the target language sequence into a decoder in the non-autoregressive neural machine translation model, and decoding the input of the decoder by using the decoder integrated with the cooperative relationship matrix to obtain the target language sequence.
In the method for improving the translation quality of the non-autoregressive neural machine through the modeling synergistic relationship, the source end expression is obtained by encoding a source language sequence by an encoder.
In the method for improving the translation quality of the non-autoregressive neural machine through the modeling synergistic relationship, the length of the target language sequence is obtained through a length predictor in a non-autoregressive neural machine translation model based on the source end representation prediction. The length predictor firstly predicts the length difference between the source language sequence and the target language sequence based on the source end expression, and then obtains the length of the target language sequence according to the length difference and the length of the source language sequence.
In the method for improving the translation quality of the non-autoregressive neural machine through modeling the cooperative relationship, in step 1, a cooperative relationship predictor is adopted to construct a dependency syntax tree of a target language sequence based on source end representation and decoder input, and the dependency syntax tree of the target language sequence is converted into a cooperative relationship matrix of the target language sequence.
The cooperative relationship predictor predicts a dependency syntax tree by adopting a double affine dependency parser model, the double-emulation dependency parser model takes the source end representation and the decoder input as input, the dependency syntax tree of the target language is used as a training target for training, the dependency syntax tree of the target language sequence is obtained by predicting the trained double affine dependency parser model, and then the cooperative relationship predictor converts the dependency syntax tree of the target sequence into a cooperative relationship matrix. The dependency syntax tree of the target language is extracted from the reference translation of the target language.
In the step 2 of the method for improving the translation quality of the non-autoregressive neural machine through modeling the cooperative relationship, a cooperative relationship layer is constructed in a decoder in a non-autoregressive neural machine translation model, and the cooperative relationship layer comprises an attention layer based on a target language sequence cooperative relationship matrix, a source end-target end attention layer and a feedforward neural network layer, so that the cooperative relationship matrix of the target language sequence is integrated in the decoder.
Compared with the traditional autoregressive decoding scheme, the neural machine translation model adopting non-autoregressive decoding has extremely high efficiency and is more suitable for the requirements of the industry, but the application of the model is influenced by the lower translation quality. At the very root, the non-autoregressive neural machine translation model lacks explicit modeling of the dependency relationship between target language sequence words, so that the multi-modal phenomenon commonly existing in a machine translation task is difficult to deal with. Existing non-autoregressive neural machine translation studies around modeling dependencies rely either on inefficient rounds of iteration or on complex depth networks. The invention extracts the undirected dependency relationship (namely the cooperative relationship) between words in the target language sequence through the dependency syntax tree, then models the cooperative relationship through a simple cooperative relationship predictor, and improves the translation quality of the non-autoregressive neural machine translation model.
By analyzing the working mode of the non-autoregressive neural machine translation model (NAT), the fact that the parallel prediction of the NAT is actually the target sequence which is predicted cooperatively, and the existing work selection modeling (directed) dependency relationship is not the essential appeal of the NAT. Therefore, the invention provides a method for modeling the cooperative relationship between words in the target language sequence and integrating the cooperative relationship into the decoding process of the NAT.
Compared with the prior art, the invention has the advantages that:
1) the invention firstly provides the cooperative relationship among words in the modeling target sequence, expresses the cooperative relationship into a cooperative relationship matrix, and accordingly constructs a cooperative relationship layer and integrates the cooperative relationship layer into a decoder of an NAT model.
2) The invention provides a method for modeling the cooperative relationship between words in a target sequence by a dependency syntax tree, extracting a cooperative relationship matrix from the dependency syntax tree and integrating the cooperative relationship matrix into the decoding process of NAT. Therefore, the translation quality is obviously improved while the dependency relationship is considered, and the huge value of modeling the cooperative relationship in the NAT is shown.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
In this embodiment, a NAT model system of one german language is given as an example, the input source language is german, the source language sequence is "Ich habe ine katze", the target language desired to be output is english, and the target language sequence desired to be output is "I have a cat". As shown in fig. 2, the method for improving the translation quality of the non-autoregressive neural machine by modeling the cooperative relationship in the embodiment includes the following steps:
step one, setting a source language sequence X as { X ═ X1,x2,...,xmIn which xiRepresents the ith subword in the source language sentence, i ═ 1, 2. The invention adopts an encoder of an autoregressive Transformer model to convert a source language sequence x into { x ═ x1,x2,..,xmInputting the data into an encoder, and encoding the data by the encoder to obtain a corresponding source end tableDenotes E ═ E1,e2,..,emIn which eiAnd representing a semantic representation corresponding to the ith subword in the source language sentence, wherein i is {1, 2.
Secondly, through a length predictor in the NAT model, the length of the target language sequence is obtained through prediction based on the source end expression obtained in the first step, and then the input of a decoder in the non-autoregressive neural machine translation model is constructed by combining the source end expression E and the length of the target language sequence, specifically:
firstly, using a length predictor to predict the length difference delta (L) between the target language sequence and the source language sequence based on the source end representation, and calculating the length n of the target language sequence according to the length difference delta (L), as shown in formula (1):
in formula (1), MLP represents a multi-layered perceptron,
is a parameter of the length predictor, mean-posing denotes the average pooling operation, and m denotes the length of the source language sequence.
Indicating the difference in length of the target and source language sequences for a given source terminal x
Probability distribution of (2). Δ (L) occurs below the argmax function to represent the function
Is returned a value of
Corresponding when taking the maximum value
Then, based on the target language sequence length n and the source end representation E, the input D ═ D of the decoder in the NAT model is constructed1,d2,...,dnIn which d isjThis indicates the decoder input corresponding to the j-th position, j ═ 1, 2.
In formula (2), τ is a hyper-parameter for controlling the sharpness of the softmax function, i denotes a subscript of the source language sequence, i ═ 1,2,.., m }, j denotes a subscript of the target language sequence, j ═ 1,2,.., n }, and e ═ isiAnd representing semantic representations corresponding to the ith sub-word in the source language sentence. w is aijAnd representing the relevance of the ith sub-word to the jth sub-word.
Thirdly, obtaining a dependency syntax tree of the target language sequence based on the source end representation and the decoder input, and converting the dependency syntax tree of the target language sequence into a collaborative relationship matrix, which is specifically described as follows:
the dependency syntax tree clearly defines the grammatical dependency relationship between words in the sentence, and can significantly improve the performance of a non-autoregressive machine translation (NAT) model, so the embodiment first adopts the collaborative relationship predictor to obtain the dependency syntax tree of the target language sequence and converts the dependency syntax tree into the collaborative relationship matrix.
Get dependency syntax tree step, in the training phase, the embodiment first uses an external dependency syntax tree extraction tool (e.g., stanza) to extract the dependency syntax tree of the reference translation. Then, a dual affine dependency parser model is trained to extract the dependency syntax tree for the corresponding reference translation from the decoder input and source representation (literature, memory do and Christopher d. management.2017. deep biaffine entry for neural dependency matching.in 5th International Conference on Learning responses, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track progress. openview. network.). The dual affine dependency parser model is trained with decoder input D and source representation E as inputs, and the dependency syntax tree of the target language as a training target. In the testing stage, the dependency syntax tree of the target language sequence is obtained through the trained dual affine dependency parser model prediction.
The present embodiment parses the reference translation of the target language using an external dependency syntax tree extraction tool (stanza). It is noted that the processing units of the dependency syntax tree extraction tool are words, while the processing units of the NMT model are subwords, and thus it is necessary to convert word-level dependency syntax trees into subword-level dependency syntax trees. Suppose a word yjIs decomposed into three subwords y1j,y2j,y3jThen the first subword y1jIs tjThe remaining subwords { y2j,y3jThe parent node of is y1j。
This example predicts the dependency syntax tree for a target language sequence using the dual affine dependency parser model proposed in the literature Timothy Donat and Christoph D.Manning.2017.deep biaffine orientation for neural dependency addressing.In 5th International Conference on Learning Repressions, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track proceedings. The dual affine dependency parser model takes decoder input D and source representation E as inputs and the dependency syntax tree of the target language as a training target, predicting its parent node for each subword in parallel. However, unlike the architecture in the above-mentioned document, the present embodiment removes the post-processing part using the minimum spanning tree, and uses a transform encoder instead of the LSTM encoder. Processing the parsing result of the dependency parser using the minimum spanning tree may obtain a higher quality dependency syntax tree, but brings a huge time overhead, so this embodiment removes this module. Meanwhile, the encoding capability of the transform encoder is stronger than that of the LSTM, so the embodiment uses a 4-layer transform encoder layer instead of the LSTM encoder layer in the literature to extract the dependency information. The joint training NMT task and the dependency syntax tree prediction task can not only generate the cooperative relationship of the target sequence, but also regularize the representation of the encoder, and further improve the translation performance.
In the step of converting the dependency syntax tree of the target language sequence into a co-relationship matrix, for a given target language the dependency syntax tree t, t of the reference translation is assignediRepresenting the parent node subscript of the ith node, the present embodiment converts the dependency syntax tree of the target language sequence into the co-relationship matrix of the target language sequence using formula (3), where formula (3) is as follows:
in the formula (3), AkjRepresenting the synergetic relationship between the kth subword and the jth subword. k denotes an index of the target language sequence, k ═ 1, 2., n }, j denotes an index of the target language sequence, and j ═ 1, 2., n }.
Intuitively, the present invention considers: (1) nodes in parent-child relationship have a cooperative relationship. (2) Each node and itself have a cooperative relationship. As shown in fig. 3, a dependency syntax tree and a corresponding collaborative relationship matrix of the sentence "I have a cat.
And step four, integrating the cooperative relationship matrix of the target language sequence into a decoder in the non-autoregressive neural machine translation model, and decoding the input of the decoder by the decoder to obtain the target language sequence as a translation result.
In this embodiment, in order to integrate the synergy relationship matrix of the target language sequence into the decoding process of the NAT model, according to the following documents Peter Shaw, Jakob Uszkoreit, and Ashish vaswani.2018. Self-attributes with relative position representation, in Proceedings of the 2018 Conference of the North American header of the Association for the general knowledge: in the self-attention component based on relative position proposed in Human Language Technologies, Volume 2(Short Papers), pages 464-. The calculation method of the self-attention layer based on the cooperation matrix is shown as the formula (4):
in formula (4), k denotes an index of the target language sequence, k ═ 1, 2.. multidata, n }, j denotes an index of the target language sequence, j ═ 1, 2.. multidata, n },
respectively, a representation of the collaborative relationship between the kth word and the jth word, d
kAnd d
jDecoder inputs representing the k-th and j-th positions, respectively, d
modelRepresenting the size of the model, alpha
kjRepresenting the degree of association of the kth sub-word with the jth sub-word, hk representing the hidden layer state of the kth sub-word, W
v、 W
QAnd W
KAre trainable parameters, the remaining N-1 layers in the decoder and the Transformer use the same architecture.
In training, the present invention uses the course learning strategy proposed in GLAT (Ashish Vaswani, Noam Shazer, Niki Parmar, Jakob Uszkorit, Llion Jones, Aidan N.Gomez, Lukasz Kaiser, and Illia Polosukhin, 2017.Attention all you need in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems2017, December4-9, 2017, Long Beach, CA, USA, pages 5998-. Specifically, it performs a two-stage decoding process, in the first stage, the co-relationship predictor predicts the dependent syntax tree t ' according to the decoder input D and the source-side representation E, calculates the quality of the predicted dependent syntax tree t ', and mixes the reference translation Y with the decoder input D according to the quality to obtain a new vector representation D '; in the second phase, the predictor predicts the dependency syntax tree from D' and the source-side representation E, calculates the loss, and updates the model.
The invention integrates the cooperative relationship matrix into the decoding process of the NAT model by using the cooperative relationship layer, and predicts the cooperative relationship by using the cooperative relationship predictor, thereby supplementing the cooperative relationship lacking in the NAT model and improving the performance of the NAT model.
In a technical aspect, the dependency syntax tree is used for modeling the word-word coordination relationship in the target sequence and is integrated into the NAT decoding process, so that the obvious performance improvement is brought.
From the application level, the invention obtains the current optimal performance in 3 widely used machine translation data sets (WMT14 end, WMT16 enro, IWSLT deen), proves that the decoding process of the NAT does not have the modeling of the cooperative relationship, and the performance of machine translation can be obviously improved by using the syntax tree to model the cooperative relationship.
The embodiments of the present invention are described only for the preferred embodiments of the present invention, and not for the purpose of limiting the spirit and scope of the present invention, and various modifications and improvements of the technical solutions of the present invention made by those skilled in the art without departing from the design concept of the present invention shall fall within the protection scope of the present invention, and the technical contents of the present invention as claimed are all described in the claims.