WO2021239631A1

WO2021239631A1 - Neural machine translation method, neural machine translation system, learning method, learning system, and programm

Info

Publication number: WO2021239631A1
Application number: PCT/EP2021/063697
Authority: WO
Inventors: Thomas Eißfeller
Original assignee: IP.appify GmbH
Priority date: 2020-05-26
Filing date: 2021-05-21
Publication date: 2021-12-02
Also published as: DE102020114046A1

Abstract

A neural machine translation method for translating a source text in a source language into a target text in a target language is provided which obtains the source text, a source auxiliary entry, such as a source dictionary entry, in the source language and a target auxiliary entry, such as a target dictionary entry, in the target language as explicit inputs and computes, using a multi-layer encoder-decoder network, target token information (215) regarding a target token of the target text, by combining information (213) computed based on at least one previously computed or initial target token (203) with information (211, 212, 214) computed based on the source text, the source auxiliary entry, and the target auxiliary entry. A corresponding learning method is provided which performs a learning step using a record including a source text and a target text from which a source phrase and a target phrase are extracted using a computed alignment, wherein the source and the target phrase are used as the source dictionary entry and the target dictionary entry for computing target token information (215) according to the neural machine translation method and a loss computed by comparing the target token information(215) with information corresponding to the actual target text is used to adjust at least some of the parameters of the multi-layer encoder-decoder network.

Description

TITLE

NEURAL MACHINE TRANSLATION METHOD, NEURAL MACHINE TRANSLATION SYSTEM, LEARNING METHOD, LEARNING SYSTEM, AND PROGRAMM

BACKGROUND OF THE INVENTION

Field of the Invention

[0001] The present invention relates to a neural machine translation method using explicit bilingual auxiliary information such as bilingual dictionary information, to a neural machine translation system using the same, to a corresponding learning method, to a learning system using the same, and to a program.

Description of the Related Art

[0002] Recently, neural machine translation (NMT) systems based on multi-layer (deep) neural networks (multi-layer encoder-decoder networks) have outperformed classical statistical machine translation (SMT) systems and have become the new standard for machine translation or machine-aided translation. In particular, the so- called attention mechanism described in NLP1 has enabled translations with almost human-level accuracy.

[0003] In attention-based NMT, source tokens obtained by tokenizing a source text (e.g. a source sentence) are encoded by an encoder (multi-layer encoder network) to obtain source encodings (sometimes called contextual embeddings), one for each source token. The encoder usually combines each source token at least with some source tokens in its proximity in order to capture the context of the source token. Then, a decoder (multi-layer decoder network) decodes target embeddings in a stepwise manner. In each such step, a previous (or initial) target token is input to the decoder. Based on the previous target token and the source encodings, a (next) target embedding is computed by combining the previous target token with the source encodings. In attention-based NMT, the previous target token (or more precisely, a vector computed based on the previous target token) and the source encodings (or more precisely, a vector computed based on the source encodings) are combined by an attention layer and the target embedding is computed based on the output of the attention layer. Then, a (next) target token is selected based on a score over target token candidates computed based on the target embedding. This step is repeated until a completion condition is met, for instance, until a certain number of target tokens or words is generated or a specific token such as a control token indicating an end of sentence, end of paragraph, end of table cell, or the like is encountered. Finally, the target tokens are detokenized to obtain the target text (e.g. a target sentence). Similar to the encoder, the decoder usually combines the previous target token at one step at least with some previous target tokens at earlier steps in order to capture a context of the previous target token. Here, "previous" is to be understood as "previously generated" (or input) since there is no particular limitation regarding the order in which the target tokens are decoded. They may be decoded from the beginning to the end of the text, from the end to the beginning of the text, or may even be successively inserted somewhere within the text (as, for instance, in so-called insertion models).

[0004] There are several different encoder-decoder network architectures for attention-based NMT. In PL1 and NLP1, recurrent layers such as LSTM (long-short term memory) and GRU (gated recurrent unit) are used in the encoder as well as the decoder to combine source tokens among each other and previous target tokens among each other. NLP2 employs convolutional layers which use filters to combine adjacent tokens. NLP3 uses attention layers not only for combining the previous target token with source encodings, but also for combining source tokens among each other and for combining previous target tokens among each other (so-called self-attention). [0005] In any case, the computation by the encoder-decoder network is based on parameters, such as weights, biases, and filters, associated with the respective layers which constitute the encoder-decoder network. By learning (optimizing) these parameters using training data including bilingual text pairs, the NMT model can learn to focus its attention on the most relevant source token or tokens for computing the (next) target embedding (for an example, see Fig. 3 of NPL1). Hence, when learned on a sufficiently large amount of training data, NMT is very successful at translating texts which are similar to those in the training data.

[0006] However, translations by a conventional attention-based NMT method can still be inconsistent. For instance, in a certain context, two or more translations of a source phrase may be about equally likely. In another context, a source phrase may have a specific unusual translation which does not occur or only rarely occurs in the training data. In either case, it is difficult to cause the NMT model to consistently use the desired translation. Such problems often occur in technical documents such as technical documentations or patent specifications where consistency is highly important. Consider, for example, that an English text which is to be translated to German contains a rare phrase such as "ophthalmic screening apparatus". A conventional NMT model which is learned without an example of this phrase in the training data may yield a German translation such as "ophthalmisches Siebgerat". However, in the context of the respective text, a different translation such as "Augenuntersuchungsvorrichtung" may be desired. Since the NMT model is learned without any example of the desired translation, it will however not produce the desired translation.

[0007] In case of equally likely translations, NMT models are often consistent over a certain range of text since the NMT models usually combine the previous target token with several earlier previous target tokens and can thus access information on how a source phrase has been translated before. However, the effective context size, that is the number of earlier previous target tokens which the previous target token is effectively combined with, is limited. In recurrent NMT models, the hidden state of the recurrent units tends to decay, even if gating is used, since the size of the hidden state is limited and it is favorable for the NMT model to forget past information in favor of more recent information. In convolutional NMT models, the context size is limited by the width of filters. And, even in self-attention based NMT models, the context size is limited due to constraints in memory and computational resources. In the case of rare translations,

NMT models will often not produce the desired translation at all. Returning to the example above, if the translation is manually corrected at the beginning of the text, a conventional NMT may use the desired translation in following sentences, but it will hardly be able to consistently use the desired translation throughout a whole document, in particular when there are longer gaps in the text where the phrase does not occur. In particular, if the gaps where the phrase does not occur are longer than the effective context size, the NMT model has no way to "remember" and to use the desired translation.

[0008] A known way to mitigate this problem is to assign rewards to the desired translation which shifts the output of the NMT model in favor of the desired target phrase. However, if rewards are too low, the desired translation may not be produced. If the rewards are too high, the target text is distorted. Namely, the desired translation may appear in the target text but at an ungrammatical position and often the remaining target text becomes utterly incorrect because the multi-layer encoder-decoder network will not know what source phrase the forced translation was meant to correspond to. In case of equally likely translations, a more or less acceptable trade-off may exist. But even then, it is difficult to find an appropriate value for the reward. For rare translations, such an acceptable trade-off often does not even exist. Here, the reward required for the NMT model to produce the rare translation is often so large that the target text is necessarily distorted. In the above example, a low reward may lead to a translation such as "Augensiebvorrichtung" which is closer to the desired translation but still does not match the desired translation. On the other hand, a high reward may produce the desired translation but often leads to distortion of the remaining text. A typical distortion is stuttering, where the NMT model keeps repeating a single token or a few tokens over an extended range of text such as "Augenuntersuchungsvorrichtungvorrichtungvorrichtungvorrichtungf...]". Such stuttering occurs since the NMT model does not understand that the German phrase "Augenuntersuchungsvorrichtung" which was enforced by the high reward is actually a translation of the English phrase "ophthalmic screening apparatus". Therefore, the NMT model still tries to produce the translation of the English phrase (or parts thereof) yielding an incorrect (sub-word) token, such as "vorrichtung", as the next token. However, from the viewpoint of the NMT model, the resulting phrase

"Augenuntersuchungsvorrichtungvorrichtung" is still not a translation of the English phrase and the NMT model keeps producing partial translations, for instance, by indefinitely repeating "vorrichtung".

[0009] A different approach is so-called fine-tuning (re-training). Here, the already learned NMT model is learned again on a smaller data set comprising the desired translation so as to shift the outputs of the NMT model towards the desired translation. This, however, requires that such a data set is actually available. Usually, some tens or hundreds of example sentence pairs featuring the desired translation are necessary to fine-tune the NMT model to the desired translation. Clearly, generating such numbers of examples manually for each problematic (i.e. equally likely or rare) translation would place an unreasonable burden on a user (a translator) - in particular, when considering that individual fine-tuning may be necessary for each particular translation job.

Moreover, fine-tuning of current NMT models having several 100 million parameters requires substantial computational resources which may not be available on devices such as servers, PCs, or smartphones. Even when performed on specialized hardware, the power-consumption caused by fine-tuning is not desirable in view of the environment.

[0010] In a different approach described in PL2, an alignment of the next target token with a source token is determined. If the source token indicated by the alignment corresponds to a source entry in a bilingual dictionary, the next target token is replaced by the corresponding target dictionary entry. However, computed alignments usually have a certain error rate such that the method of PL2 may not be able to identify the source token correctly, in particular in case of rare source tokens. But even if the correct source token can be identified, simply replacing the target token by a target dictionary token, leads to similar problems like the reward-based approach. So, the produced target text may be ungrammatical because tokens preceding the replaced target token, such as articles, adjective forms, verb forms, or the like, may be incorrect for the target dictionary token, for instance because it may have a different numerus or genus than the replaced target token. In addition, the target text following the replaced token may be distorted for the same reasons as in the reward-based approach.

Cited References

[0011] Patent Literature:

PL1 US 2020/0S44S5 A1

PL2 CN 110489762A

[0012] Non-Patent Literature:

NLP1 Bahdanau, D. et al.: "Neural Machine Translation by Jointly Learning to Align and Translate", ICLR 2015, URL: https://arxiv.org/abs/1409.047Sv7 NLP2 Gehring, J. et al.: "Convolutional Sequence to Sequence Learning", ICML 2017, URL: https://arxiv.org/abs/1705.03122v3

NLP3 Vaswani, A. et al.: "Attention Is All You Need", NIPS 2017,

URL: https://arxiv.org/abs/1706.03762v5

SUMMARY OF THE INVENTION

[0013] In view of the above problems, the present invention provides a modification of conventional NMT which enables the use of explicit dictionary information (glossary information or, more generally, auxiliary information) comprising one or more corresponding source and target entries in the source and target language, respectively.

In the present invention, the source entries and corresponding target entries of the bilingual dictionary (bilingual auxiliary information) are directly input to the NMT model in addition to the source text (or more precisely respective tokens of the source entries, target entries, and source text). Therefore, no fine-tuning or re-training is required when new dictionary entries (auxiliary entries) are added, or existing dictionary entries are modified. The added or modified dictionary entries can be simply input to the same learned NMT model. In contrast to using rewards where the only outputs of the learned NMT model are shifted in favor of the desired translation, the NMT model of the present invention takes the dictionary entries as an explicit input so that the NMT can adapt to the use of dictionary entries during learning such that distortion of the target text can be avoided while the use of dictionary entries, even of rare translations, can be promoted.

[0014] In a first aspect of the present invention, a neural machine translation method according to claim 1 is provided.

[0015] In a second aspect of the present invention, a neural machine translation system according to claim 10 is provided.

[0016] In a third aspect of the present invention, a learning method according to claim

12 is provided.

[0017] In a fourth aspect of the present invention, a learning system according to claim

13 is provided.

[0018] In a fifth aspect of the present invention, a method according to claim 14 is provided.

[0019] In a sixth aspect of the present invention, a program according to claim 15 is provided.

[0020] The other claims relate to further developments.

[0021] In sum, a neural machine translation method for translating a source text in a source language into a target text in a target language is provided which obtains the source text, a source dictionary entry (a source auxiliary entry) in the source language and a target dictionary entry (a target auxiliary entry) in the target language as explicit inputs and computes, using a multi-layer encoder-decoder network, target token information (target embedding) regarding a target token of the target text, by combining information computed based on at least one previously computed or initial target token (at least one previous target embedding) with information computed based on the source text (one or more source embeddings), the source dictionary entry (one or more source dictionary embeddings), and the target dictionary entry (one or more target dictionary embeddings).

[0022] Moreover, a corresponding learning method is provided which performs a learning step using a record including a source text and a target text from which a source phrase and a target phrase are extracted using a computed alignment, wherein the source and the target phrase are used as the source dictionary entry and the target dictionary entry for computing target token information (215) according to the neural machine translation method and a loss computed by comparing the target token information(215) with information corresponding to the actual target text is used to adjust at least some of the parameters of the multi-layer encoder-decoder network.

BRIEF DESCRIPTION OF THE DRAWINGS

[002S] Fig. 1 is a flow chart illustrating a neural machine translation method according to the present invention.

[0024] Fig. 2 illustrates a flow of information in the multi-layer encoder-decoder network used by the neural machine translation method according to the present invention.

[0025] Fig. S illustrates a flow of information in the multi-layer encoder-decoder network used by the neural machine translation method according to a first Embodiment of the present invention.

[0026] Fig. 4 illustrates a flow of information in the multi-layer encoder-decoder network used by the neural machine translation method according to a second Embodiment of the present invention. [0027] Fig. 5 illustrates a flow of information in the multi-layer encoder-decoder network used by the neural machine translation method according to a third Embodiment of the present invention.

[0028] Fig. 6 illustrates a flow of information in an attention operation.

[0029] Fig. 7 illustrates a flow of information in an attention layer with a residual connection

[0030] Fig. 8 is a flow chart illustrating a learning step of the neural machine translation method according to the present invention.

[0031] Fig. 9 illustrates a system according to the present invention.

DESCRIPTION OF THE EMBODIMENTS

[0032] The present invention relates to an NMT (neural machine translation) method which translates a source text (e.g. a source sentence) in a source language into a target text (e.g. a target sentence) in a target language. The languages are usually different natural languages. However, the source and target languages may also be dialects of the same natural language. In this context, text is to be understood broadly as any unit of text and, in particular, includes complete sentences, incomplete sentences such as titles or headlines, single phrases, words, numbers, or the like such as in cells of a table, and also entire paragraphs or documents. Details of the NMT method will be described below.

Neural Machine Translation Method

[0033] In the following, the NMT method illustrated in Fig. 1 is described as an example of the neural machine translation method according to the present invention. Dictionary entries each containing an isolated word or phrase in the source language and its translation in the target language will be described as a specific example of auxiliary entries. However, the present invention is not limited to this and other bilingual textual entries which are relevant for translating a source text may be used as auxiliary entries. Namely, any textual entries which represent translations between the source and target language containing phrases or words in the source language which occur in the source text to be translated may be used.

[0034] In a preliminary step (not shown), a source text in the source language, at least one source dictionary entry (source auxiliary entry) in the source language and at least one corresponding target dictionary entry (target auxiliary entry) in the target language are obtained as inputs. Optionally, the inputs may also include a partial target text that is to be completed by the NMT method. The inputs may be designated by a user, for instance, by directly inputting them via a user interface or by selecting a file or parts of file to be translated and a file containing the dictionary entries (auxiliary entries). The inputs may also be obtained automatically, for instance, based on an API call by another system. Moreover, a source text may also be designated automatically based on a received file or email or a database entry. Dictionary entries may, for instance, be automatically extracted from previous translations by an alignment tool.

[0035] In step S100, the inputs are preprocessed. Such preprocessing comprises at least tokenizing the inputs so as to obtain respective source tokens, source dictionary tokens (source auxiliary tokens), and target dictionary tokens (target auxiliary tokens). If a partial target text is obtained, the partial target text may also be tokenized to obtain previous target tokens. Here, tokens include punctuation, numbers, symbols, words, sub word units, and the like. Words may be segmented into sub-word units, for instance, by BPE (byte-pair encoding), unigram sub-word models, or the like. In addition, tokens may include special control tokens such as control tokens indicating the beginning or end of a text. Moreover, special control tokens may be added to the inputs. For instance, a BOS (beginning of sequence) token may be prepended to the source and target tokens, and an EOS (end of sequence) token may be appended to the source tokens (and, in case of training, to the target tokens). If there are multiple dictionary entries, the respective source and target dictionary entries may be concatenated in a respective sequence, optionally, separated by a control token, in the following denoted as DCT (dictionary). Using such control tokens is however not necessary and additional inputs may be used to indicate the beginning and/or end of the respective inputs. In addition, normalization may be performed before or after tokenizing the inputs. For instance, unusual Unicode characters may be replaced by similar, more usual characters or substituted by reserved tokens which can be back substituted when detokenizing generated target tokens. The inputs may also be cast to lowercase.

[0036] In step S101, an initial target token is selected as a previous target token. For instance, the control token BOS may be selected. If a partial target text is obtained, the respective tokens may be used as previous target tokens.

[0037] In step S102, a (next) target embedding is decoded using the (at least one) previous target token by a multi-layer (deep) encoder-decoder network. In a first sub step S102A, the embeddings are computed for the tokens corresponding to each input, using respective embedding weights. And in a second sub-step S102B, the previous target token is combined with the other inputs, namely with the source embeddings, the source dictionary embeddings (source auxiliary embeddings), and the target dictionary embeddings (target auxiliary embeddings).

[0038] Fig. 2 illustrates an information flow in the encoder-decoder network according to the present invention. The source tokens 201 and the source dictionary tokens 202 are embedded to obtain source embeddings 211 and source dictionary embeddings 212 by an embedding layer 220. Such an embedding operation can be implemented as selecting a column (or row) of an embedding matrix (embedding weights) corresponding to an index of a token in a vocabulary. Alternatively, the embedding operation can be implemented as multiplying a (possibly sparse) one-hot column (or row) vector which is all zeros expect for a one at the index of the token in the vocabulary with the embedding matrix. Since both the source text and the source dictionary entry share the same language, it is preferable to also share the same vocabulary and embedding weights. Hence, the embedding weights for embedding the source tokens 201 and for embedding the source dictionary tokens 202 are preferably the same. This saves memory and promotes learning. While precomputed embeddings, such as GloVe, may be used, it is preferable that the respective embedding matrix or matrices are learnable parameters to enable better adaption to the particular NMT task.

[00B9] Similarly, the (at least one) previous target token 203 and the target dictionary tokens are embedded to obtain (at least one) previous target embedding 213 and target dictionary embeddings 214 by embedding layer 221. Here, it is also preferable that the embedding weights for embedding the at least one previous target token (213) and for embedding the target dictionary tokens (214) are the same.

[0040] Preferably, the embedding layers 220 and 221 embed also positional information indicating the position of a token in the respective sequence such as an index of the token in the respective text or dictionary sequence (auxiliary information sequence). The respective positional embedding weights may be learnable parameters or may be precomputed and fixed (frozen) during learning such as a sinusoidal positional embedding (for details, see Section 3.5 of NLP3). Preferably, the embedding layers 220 and 221 share the same positional embedding weights. Additional information, for instance, regarding casing or whitespace patterns may also be embedded. Information indicating a beginning or end of a text or dictionary entry may also be embedded, in particular, if specific control tokens are not used for this purpose. If different embeddings are used, the different embeddings are combined. The particular method of combining the embeddings is not important and any suitable method may be used. For instance, the embeddings can be combined by element-wise summation or multiplication, or can be concatenated and input to a linear or affine layer.

[0041] The embeddings are forwarded through the layers of the encoder-decoder network. The encoder-decoder network according to the present invention comprises one or more combining layers 230 which combine the previous target embedding 213 (or, more precisely, a vector computed based thereon) with the source embeddings 211, the source dictionary embeddings 212, and the target dictionary embeddings 214 (or, more precisely, vectors computed based on the respective embeddings). The (next) target embedding 215 is then computed based on the output of the one or more combining layers.

[0042] In addition, the encoder-decoder network can include various further layers to process the input or output of the one or more combining layers as needed (indicated by " in the figures). In order to capture the context of respective tokens, the encoder- decoder network may comprise any suitable layer for combining vectors within the same sequence, such as vectors computed based on the source embeddings 211, the previous target embeddings 213, the source dictionary embeddings 212, and target dictionary embeddings 214, or between sequences. For instance, a recurrent layer as in NLP1 and PL1, a convolutional layer as in NLP2, or an attention-based layer as in NLP3 may be used to combine vectors within the same sequence. An attention-based layer may be used to combine vectors among different sequences. Multiple such layers of the same or different kind can be stacked directly or with further layers in between. For instance, such further layers can be dense layers (linear or affine layers), activation layers such as ReLU, GELU, GLU, or the like, normalization layers, drop-out layers, bottleneck layers, pooling layers, combinations thereof, and the like. Moreover, so-called residual connections may be added which combine the input vector of a layer with the respective output vector of a layer. For combining vectors within the same sequences, preferably at least one attention-based layer is used, since attention has a large context size (for details, see Section 4. of NLP3). For combining vectors among different sequences, preferably an attention-based layer is used. Residual connections are preferably included at least around attention layers, if present. Otherwise, propagation of information through the encoder-decoder network may be impaired. Preferably, a normalization layer is arranged directly before or after an attention-based layer, if present, in order to avoid exploding or vanishing gradients during learning.

[0043] The skilled person will appreciate that any suitable architecture such as those described in PL1, NPL1, NPL2, and NPL3 may be used as a basis for the encoder-decoder network according to the present invention, simply by including the one or more combining layers at appropriate positions in these networks. [0044] Returning to Fig. 1, in step S103 a score over target token candidates is computed based on the (next) target embedding output by the encoder-decoder network. Usually, the target embedding is mapped by a linear or affine transformation onto a vector corresponding to all or parts of the target vocabulary (that is each vector entry corresponds to a vocabulary entry) and a score is computed over this vector by the softmax function or a variation thereof to obtain a probability distribution over the target vocabulary (or the respective parts thereof). Sometimes, the result of the transformation is also used directly as a score. Further factors may also be included in the score such as the probability of a target token candidate according to a language model of the target language. While dictionary rewards are usually not necessary when using the NMT method according to the present invention, they may in principle still be included in the score.

[0045] In step S104, a target token is selected among the target token candidates based on the respective score. The target token candidate with the highest score may be selected as the single target token. Alternative, a search algorithm, such as beam search, may be applied in which multiple target token candidates are selected each serving as a target token in one among multiple hypotheses.

[0046] In step S105, it is checked whether or not a completion condition is met, for instance, whether a certain number of target tokens or words is generated or a specific token such as control token indicating an end of sentence, end of paragraph, end of table cell, or the like is encountered. If the completion condition is met, the processing proceeds to step S107, otherwise to S106. In case of using a search algorithm, the check may be performed for each hypothesis separately.

[0047] In step S106, the target token obtained in step S104 is selected as a previous target token and the processing returns to step S102. In case of using a search algorithm, the steps S102 to S104 may be performed for one, some, or all hypotheses, for instance depending of an overall score of the respective hypothesis. [0048] In step S107, the generated target tokens are postprocessed. Postprocessing at least comprises detokenizing the target tokens to obtain the target text. In addition, casing or whitespaces may added and substitutions performed in step S101 may be reverted. In case of using a search algorithm, a target text may be generated for one, some, or all hypotheses, for instance depending of an overall score of the respective hypothesis.

[0049] In a final step (not shown), the target text may be output via a user interface. Outputting may include displaying the target text on a screen and saving or printing a file containing the target text. The target text may also be automatically output, for instance, to a file or a database entry, or may be automatically sent to another system, for instance, by email or an API call.

[0050] Since the source and target dictionary tokens are used as explicit inputs to the encoder-decoder network and the previous target token is combined with them, the encoder-decoder network can compute a (next) target embedding which takes into account the source and target dictionary tokens to obtain a consistent translation in which source dictionary entry is translated according to the respective target dictionary entry.

[0051] While the advantageous effect of the present invention is obtained by using an explicit dictionary entry as an input, the same NMT method can also be used to translate a source text without source and target dictionary entries. In this case, the source dictionary tokens and target dictionary tokens input to the encoder-decoder network are simply left empty (sequences of zero length).

First Embodiment

[0052] In the first Embodiment of the present invention, the NMT method according to the present invention is carried out using the encoder-decoder network illustrated in Fig. 3. [0053] In the encoder-decoder network of the first Embodiment, the source dictionary tokens 202 and the source tokens 201 are processed by an encoder with embedding layer 220 and layers which combine tokens from the same sequence among each other, such as recurrent layers, convolutional layers, self-attention layers, or the like. Preferably, four or more and, more preferably, six or more such layers are included. Additional layers of other types may be added as needed. At one or more points in the encoder, a first attention-based layer 331 combines vectors computed based on the source tokens 201 with vectors computed based on the source dictionary tokens 202, that is the first attention-based layer 331 takes vectors computed based on the source tokens 201 as input vectors (query vectors) and vectors computed based on the source dictionary tokens 202 as memory vectors (key / value vectors) and forwards its output to the next layer of the encoder. The final output of the encoder are the source encodings 340 each corresponding to a source token 201 and capturing the context of the source token 201 as well as dictionary information from the source dictionary tokens 202.

[0054] Similarly, the target dictionary tokens 204 and the one or more previous target tokens 203 are processed by a decoder with embedding layer 221 and layers which combine tokens from the same sequence among each other, such as recurrent layers, convolutional layers, self-attention layers, or the like. Preferably, four or more and, more preferably, six or more such layers are included. At one or more points in the decoder, a second attention-based layer 332 combines vectors computed based on the previous target tokens 203 with vectors computed based on the target dictionary tokens 204, that is the second attention-based layer 332 takes vectors computed based on the previous target tokens 203 as input vectors (query vectors) and vectors computed based on the target dictionary tokens 202 as memory vectors (key / value vectors) and forwards its output to the next layer of the decoder. Moreover, at one or more points in the decoder, a third attention-based layer 333 combines vectors computed based on the previous target tokens 203 with vectors computed based on the source encodings 340, that is the third attention-based layer 333 takes vectors computed based on the previous target tokens 203 as input vectors (query vectors) and vectors computed based on the source encodings 340 as memory vectors (key / value vectors) and forwards its output to the next layer of the decoder. The final output of the decoder is a decoded target embedding 215.

Second Embodiment

[0055] In the second Embodiment of the present invention, the NMT method according to the present invention is carried out using the encoder-decoder network illustrated in Fig. 4. The encoder-decoder network of the second Embodiment is similar to that of the first Embodiment and mainly the differences are described in the following.

[0056] In the second Embodiment, the source dictionary tokens 202 and source tokens 201 are concatenated to a common source sequence. Specific control tokens are preferably added to separate the tokens of different source dictionary entries and to separate the source dictionary tokens from the source tokens. The concatenated source sequence is input to an encoder of the encoder-decoder network which comprises an embedding layer 220 and layers for combining the tokens in the concatenated source sequence among each other, such as recurrent layers, convolutional layers, self-attention layers, or the like. Preferably, four or more and, more preferably, six or more such layers are included. Preferably, these layers include at least one self-attention layer (first attention-based layer 331) which takes vector computed based on tokens in the concatenated source sequence as input vectors (query vectors) and as memory vectors (key / value vectors). Using a self-attention layer enables the encoder to access all source dictionary entries relevant to each source token. The final outputs of the encoder are source encodings 340 for respective tokens in the concatenated source sequence.

[0057] Similarly, the target dictionary tokens 204 and previous target tokens 203 are concatenated to a common target sequence. Specific control tokens are preferably added to separate the tokens of different target dictionary entries and to separate the target dictionary tokens from the previous target tokens. The concatenated target sequence is input to a decoder of the encoder-decoder network which comprises an embedding layer 221 and layers for combining the tokens in the concatenated target sequence among each other, such as recurrent layers, convolutional layers, self-attention layers, or the like. Preferably, four or more and, more preferably, six or more such layers are included. Preferably, these layers include at least one self-attention layer (second attention-based layer 332) which takes vectors computed based on tokens in the concatenated target sequence as input vectors (query vectors) and as memory vectors (key / value vectors). Using a self-attention layer enables the decoder to access all target dictionary entries relevant for computing the (next) target embedding.

[0058] The encoder includes at least one third attention-based layer 333 for combining vectors computed based on the concatenated target sequence with vectors computed based on the concatenated source sequence, similar as in the first Embodiment. The final output of the decoder is a decoded target embedding 215.

Third Embodiment

[0059] In the third Embodiment of the present invention, the NMT method according to the present invention is carried out using the encoder-decoder network illustrated in Fig. 5. The encoder-decoder network of the third Embodiment is similar to that of the first and second Embodiment and mainly the differences are described in the following.

[0060] In the third Embodiment, the source dictionary tokens 202, the source tokens 201, the target dictionary tokens 204, and the previous target tokens 203 are concatenated to a common sequence. Specific control tokens are preferably added to separate the tokens of different source and target dictionary entries and to separate tokens from the different inputs (the source dictionary tokens, the source tokens, the target dictionary tokens, and the previous target tokens). The common sequence is input to a unified encoder-decoder network. The encoder-decoder network may include an embedding layer 220 for the source part and an embedding layer 221 for the target part of the common sequence. This can for instance, be implemented by masking or by temporarily slicing the common sequence in a source and a target part. Alternatively, the tokens may first be embedded using the respective embedding layer and may than be concatenated to a common sequence, optionally, after having passed through further layers such as normalization layers. Further alternatively, a common embedding layer may be used for the entire common sequence.

[0061] After the embedding layer(s), the encoder-decoder network includes layers for combining the tokens in the concatenated sequence among each other, such as recurrent layers, convolutional layers, self-attention layers, or the like. Preferably, six or more and, more preferably eight, or more such layers are included. Preferably, these layers include at least one self-attention layer (attention-based layer 430) which takes vectors computed from tokens in the common sequence as input vectors (query vectors) and as memory vectors (key / value vectors). Using a self-attention layer enables the encoder- decoder network to access information from all source dictionary tokens, source tokens, target dictionary tokens, and previous target tokens relevant for computing the (next) target embedding in the same layer.

[0062] This Embodiment can be especially useful when translating between similar languages which at least partially share the same vocabulary and between dialects of the same language since embedding weights can be shared and the overall number and size of layers can be reduced compared to the first and second Embodiment.

Fourth Embodiment

[0063] In the fourth Embodiment of the present invention, translation memory entries are used in the NMT method as auxiliary entries to further improve consistency of translations over an entire document. Namely, previously translated units of text from the same document (same text) or from other documents (other texts) are appropriately selected and used in addition to or in place of the dictionary entries in the previously described Embodiments. The fourth Embodiment uses the encoder-decoder network of the present invention, for instance that of the first, second, and third Embodiment, and a detailed description thereof is omitted.

[0064] Consistency means that a source word or phrase in the source language is always translated by essentially the same target word or phrase in the target language within a document or a plurality of documents. As set out before, conventional NMT models often maintain consistency over a certain range of text which is however limited by the effective context size of the model. In particular, if the gaps where the source phrase does not occur are longer than the effective context size, the NMT model has no way to "remember" and to use the same target phrase. There are approaches to compress context information in order to quasi increase the effective context size of a model, however at present with limited success since it is difficult for a model to learn which context information might actually be useful for later translations.

[0065] In the previously described Embodiments, explicit dictionary information is used to promote a consistent translation. However, it can be time consuming for a user to input or select appropriate dictionary entries. Therefore, it is desirable to reduce the number of explicitly input or selected dictionary entries while maintaining a high consistency.

[0066] Moreover, conventional computer aided translation (CAT) tools commonly have a function of exporting translated units of text (in the following denoted as segments), such as paragraphs, sentences, titles, contents of table cells or the like, from a translated document as a so-called translation memory. A translation memory contains at least one segment of the document in the source language as a source part of a translation memory entry and a corresponding segment of the document in the target language as a target part of the translation memory entry. When the translation memory is imported into the CAT tool and a sentence in a new document is translated, the CAT tool displays translation memory entries based on a similarity of the sentence to be translated with a source parts of a respective translation memory entry in order to assist the user (the translator). It is desirable that such translation memory entries are also utilized by the NMT model to improve consistency over multiple similar documents, such as revisions of a technical document.

[0067] In addition to importing a translation memory, an already translated segment of a document to be translated can also be taken as a translation memory entry when translating further segments of the document. Namely, documents are often too large to be translated by a single pass (loop) of the NMT method due to memory constraints. For example, the memory required by attention-based layers increases quadratically with context size and current NMT models therefore often restrict the context size to about 1000 tokens. Hence, a document is usually split into segments which are then translated one at a time. Sometimes, several consecutive segments are translated at a time depending on the context size of the NMT model. It is desirable that such translation memory entries corresponding to the same document are also utilized by the NMT model to improve consistency within the document.

[0068] In the present Embodiment, one or more translation memory entries are selected among a plurality of translation memory entries based on a similarity score between the source text and a source part of the respective translation memory entry. The similarity score is not particularly limited as long as it is higher for source parts which have words or phrases in common with the source text than for source parts which have no words or phrases in common with the source text, optionally determined after preprocessing of the source text and/or source part. Such preprocessing can include common techniques like omitting some portions of the source text and/or source part such as punctuation, numbers, meaningless fill words (trivial words, stop words) or the like, splitting compound words into word parts, stemming or lemmatizing words, or casting characters to lowercase. In other words, the similarity score is higher in a case where a text based on the source text and a text based on the source part have pieces of text (words, word parts, or phrases) in common than in a case where the text based on the source text and the text based on the source part have no pieces of text in common.

[0069] The similarity score can be based on a computed similarity between the source text and some or all source parts of the translation memory entries. The similarity can be computed by conventional methods. For instance, a bag-of-words or bag-of-n-grams similarity between the source text and the source parts of the translation memory entries may be used. That is, the cosine similarity, Jaccard similarity, or the like between vectors representing the count or existence of a word or n-gram in the source text and in the source part of a translation memory entry is computed as a similarity. Also, the inverse, negative value, or the like of a distance measure such as the edit distance may be used as the similarity. Before computing the similarity, preprocessing as mentioned above may be applied.

[0070] The computed similarity may then be used directly as the similarity score and the translation memory entry with the highest score may be selected. However, additional factors may also be considered in some or all similarity scores. The similarity score may, for instance, be increased when the translation memory entry has been translated or revised by a human, for instance via a respective user interface. The similarity score may also be increased when the translation memory entry stems from the same document as source text to be translated. The similarity score may further be modified based on a comparison between meta data of the document to be translated and the translation memory entry, for instance indicating a technical field, an author or translator, a writing or publishing date or the like, such that more similar meta data leads to a higher similarity score. On the other hand, the similarity score may be decreased for longer source parts so that shorter entries with the same similarity are preferably used.

In some cases, the costly computation of a similarity may be avoided by assigning a predetermined similarity score. For instance, a high predetermined similarity score may be assigned to a translation memory entry of which the source part is identical to the source text or contains the same (non-trivial) words, optionally, determined after preprocessing. A low predetermined similarity score may be assigned to a translation memory entry which has no (non-trivial) words in common with the source text, optionally, determined after preprocessing. Instead of assigning a low predetermined similarity score, the respective translation memory entries may also be filtered out and excluded from computing the similarity. Such determinations can be implemented efficiently using a full text index over the source parts of some or all translation memory entries.

[0071] Moreover, several translation memory entries with the highest scores may be selected, for instance, up to a certain combined text length of the source and/or target parts of the selected translation memory entries, measured for instance based on a count of characters, tokens, words, or sub-word units.

[0072] Alternatively, multiple translation entries may be selected based on a coverage of the (non-trivial) words in the source text by the source parts of the selected translation memory entries, optionally after preprocessing. That is, translation memory entries may be selected such that the words in their source parts together contain as many words in the source text as possible, for instance, subject to a certain maximal combined text length of the source and / or target parts. For instance, a first translation memory entry may be selected based on the computed similarity scores and then similarity scores of all or at least some of the remaining translation memory entries may be recomputed based on the words in the source text which are not already covered by the source part of the first translation memory entry. Based on the recomputed similarity scores, a further translation memory entry may be selected. This procedure may be repeated based on the words in the source text which are not covered by any source part of the selected translation memory entries until a certain combined text length is reached or no further suitable translation memory entry is found.

[0073] If multiple translation memory entries are selected, the respective source and target parts may be concatenated in a respective sequence, optionally, separated by a designated control token. Source and target dictionary entries, if any, can also be concatenated (e.g., prepended or appended) to the respective sequence, optionally, separated by a designated control token. If there are no suitable translation memory entries, for instance, when all computed similarity scores are below a threshold, the respective sequences are empty or contain only dictionary entries, if any.

[0074] The NMT method of the present invention, as illustrated in Fig. 1, is then employed with the provision that the translation memory entries are used in place of or in addition to the dictionary entries.

[0075] As noted above, dictionary entries can be used in addition to translation memory entries as auxiliary entries. In this case, the learning method of the present invention is preferably used to obtain an NMT model which can handle dictionary entries appropriately. On the other hand, when only translation memory entries but no dictionary entries are used, it is not necessary to perform the learning method according to the present invention and a conventional learning method may be used (that is, only the conventional learning step as described below may be employed). In either case, it is preferred that at least some records in the training data contain multiple segments (for instance, consecutive sentences of a document) such that the model can learn to translate a text comprising multiple segments appropriately.

Attention-Based Layers

[0076] Examples of attention-based layers according to the present invention are illustrated in Figs. 6 and 7. The attention-based layer according to the present invention may be any variation of attention such as simple attention (NLP1), multi-headed attention (NLP3), structured attention, multi-stream attention, or the like. The attention based- layers have in common that they combine one or more input vectors with memory vectors to obtain a number of output vectors which corresponds to the number of input vectors.

[0077] In attention based-layers, the memory vectors are each mapped by linear or affine transformations LI and L2 to respective key and value vectors. The input vectors are each mapped by a linear or affine transformation L3 to respective query vectors.

[0078] The query, key, and value vectors are input to an attention operation. For each query vector, the attention operation computes a score over all or some of the key vectors (resulting in so-called energies) by a evaluating a function F for the query vector and each key vector to be used. The function F may be a scaled dot-product or any other appropriate function which takes two vectors as an input and generates a score (a scalar value). The function may also generate multiple scores (a vector value), for instance, by computing the dot-product on corresponding slices of the query and key vector. Then, a function G computes weights from the energies for the key vectors, for instance, by a softmax function or variation thereof. If the function F generates multiple scores, the function G is applied to the energies of each score separately. The function G may apply masking as necessary, for instance, to hide future information during training (such as the target token at a position that corresponds to the target embedding to be predicted). Then, the attention operation computes a weighted sum å over the value vector with the corresponding weights to obtain a context vector for the respective query vector. If the function F generates multiple scores, the respective weights may be used for weighting different slices of each of the value vectors.

[0079] The so-obtained context vectors are mapped by a linear or affine transformation L4 to obtain output vectors. Preferably, a residual connection is included which adds the input to the output vectors. Preferably, at least one of the input and the output of the attention-based layer is normalized, for instance, by layer-normalization.

[0080] An attention-based layer which combines input vectors with memory vectors from different sequences is usually called an attention layer and a layer which combines input vectors with memory vectors from the same sequence is usually called a self attention layer.

Learning Method

[0081] First, in a preliminary step (not shown) of the learning method, training data is obtained. The training data may, for instance, be read from a file, read from a database, or downloaded from a network. Multiple sources of training data may be combined. The training data may be filtered, for instance based on an estimated reliability or the text length, to obtain the training data to be actually used in the learning method. The training data may be designated by a user, for instance, by selecting data sources. The training data may also be obtained automatically, for instance, by a script or based on a timer or an API call by another system. Each record of training data contains a source text and a target text. As before, a text is to be understood broadly in the present context.

[0082] Then, for each record, a learning step is performed. The learning step is either a conventional learning step or a specific learning step according to the present invention. ln a conventional learning step, the source and target text in the record are preprocessed similar as in step S100 to obtain the source tokens and target tokens. The encoder- decoder network is used to predict target embeddings wherein, for computing a predicted target embedding corresponding to a certain position among the target tokens, at least one target token at a different position is used as the at least one previous target token. In case the target tokens are decoded from beginning to end of the text, a target token before the predicted target embedding may be used as the previous target token.

In case the target tokens are decoded from end to beginning of the text, a target token after the predicted target embedding may be used as the previous target token. When the encoder-decoder network comprises a layer which combines a previous target token with earlier previous target tokens (that is, respective vectors computed based thereon) masking can be applied as necessary to prevent the encoder-decoder network from accessing future information such as the target token (that is, a vector computed based on the target token) at the corresponding position of the target embedding to be predicted. Then, a loss is computed from the target tokens and the predicted target embeddings (215) at corresponding positions. Commonly, negative log likelihood (NLL) is used as a loss function, however, any suitable loss function may be used as well. Finally, the parameters of the encoder-decoder network are optimized (adjusted) to reduce the loss. This can be performed by any suitable optimization method. For deep neural networks, commonly backpropagation is used to compute the gradients of the loss with respect to the parameters and the parameters are updated based on the gradients by gradient descent or a variation thereof such as Adam, RMSProp, and Adagrad. Although currently less efficient, gradient-free methods, for instance, known from the field of reinforcement learning could be used. Multiple records may be combined in batches and a combined loss may be calculated for each batch in order to better exploit parallel hardware and also to stabilize the training.

[0083] In the following, the learning step illustrated in Fig. 8 is described as an example of the learning step according to the present invention. In the learning method according to the present invention, at least a first subset of the records in the training is used to perform the learning step according to the present invention. For the remaining training data, the conventional training step may be performed, wherein the source and target dictionary entries input to the encode-decoder network are simply left empty (sequences of zero length). Preferably, the number of records in first subset of the records is 5 % or more and 95 % or less, more preferably 10 % or more and 80 % or less, relative to the number of records in the training data used in the learning method. If the number of records in first subset is too small, the encoder-decoder network may not learn to use the dictionary entries at all. If the number of records in first subset is too large, the encoder- decoder network may become dependent on dictionary entries and may produce incorrect target embeddings if no dictionary entries are input, that is, if the source and target dictionary tokens are empty sequences.

[0084] In step S800, the source and target text of the record is aligned. There is no particular limitation on the quality of the alignment since neural network training tends to be very robust with respect to random noise in the training data. In fact, such random noise in the training data may serve a regularization mechanism and suppress overfitting (overlearning). An alignment obtainable by conventional alignment tools, such as GIZA++ or fast_align, is sufficient. The alignment may be based on words, sub-word units, and other parts of the text as necessary.

[0085] In step S801, using the alignment, at least one source phrase and the corresponding target phrase is extracted from the respective text as a dictionary entry. A phrase may consist of one or more words or sub-word units and may also include other parts of the text such as punctuations, numbers, symbols or the like.

[0086] In step S802, the source text, source dictionary entry, target text, and target dictionary entry are preprocessed as inputs similar as in step S100. They are in particular tokenized to obtain respective tokens.

[0087] In step S803, the target embeddings are predicted from the source tokens, source dictionary tokens, target tokens, and target dictionary tokens, using the encoder- decoder network. This is similar to the convention learning step described above with the exception that, here, the source dictionary tokens and target dictionary tokens are actually input to the encoder-decoder network, instead of empty sequences.

[0088] In step S804, the loss is computed based on the target tokens and the predicted target embeddings at corresponding positions and, in step S805, the parameters of the encoder-decoder network are optimized (adjusted) so as to reduce the loss, as in the conventional learning step.

[0089] The conventional learning step and the learning step according to the present invention are repeated for the entire training data or for multiple iterations over the training data (epochs) wherein the learning step according to the present invention is performed for records in the first subset and the conventional learning step is performed for other records.

[0090] In a final step (not shown), the learning parameters may be output, for instance, to a memory such as a file or a database. In addition, the parameters may be modified before they are stored or after they are read. For instance, some or all parameters may be converted to a lower floating-point precision or to a quantized integer representation. To this end, additional quantization layers may be added before and / or after layers of the encoder-decoder network as necessary. In addition, the parameters may be sparsified or a distilling technique may be used to reduce the number of parameters. Large matrices in the parameters may be factorized into smaller matrices.

[0091] As it is commonly known in the field of machine learning and especially deep learning, hyper-parameters such as number and size of layers and the hyper-parameters of the optimization algorithm such as a learning rate need to adapted to the concrete NMT task, for instance, depending on the languages involved and the kinds of texts to be translated, in order to obtain high translation accuracy. In the present case, also the relative sizes of first subset of records and the rate of extracting source / target dictionary entries may be treated as hyper-parameters. For a given problem, such hyper parameters can be found automatically, for instance, by a grid search, or may be tuned manually. [0092] As noted before it is preferable but not necessary that the embedding weights are learned (adjusted) in the optimization step. Also, further parameters of the encoder- decoder network may be pre-trained on a different task such as a language model task and may be frozen (not adjusted) in the optimization step.

NMT System and Learning System

[009S] The hardware configuration of an NMT system and a learning system according to the present invention can be essentially the same. Fig. 9 illustrates an exemplary hardware configuration of such a system.

[0094] The system 900 includes at least one CPU 901 and preferably includes at least one hardware accelerator (HWA) 902 such as a GPU, a TPU, a FPGA or the like. The HWA may also be realized as part of the CPU, for instance by a specialized instruction set. The system 900 further includes a RAM 911 and a disk 912 for storing parameters and embedding weights of the encoder-decoder network and, in case of the learning system, at least parts of the training data. The entire training data need not be stored in the system and parts thereof may be acquired from an external system only when needed. The system may include a network interface (network l/F) for receiving translation or training requests and for accessing data such as dictionaries or training data on an external system. In the NMT system, the CPU 901 and / or HWA 902 perform the NMT method using the parameters and embedding weights stored in the RAM 911 and / or a disk 912. In the learning system, the CPU 901 and / or HWA 902 perform the conventional learning step and the learning step according to the present invention using the training data stored in the RAM 911 and / or a disk 912 and store the optimized parameters in the RAM 911 and / or a disk 912.

[0095] The NMT system may further include input unit 931 for designating a source text to translate and dictionary or translation memory entries (auxiliary entries) to be used for the translation, for instance, via a user interface. The learning system may include input unit 931 for designating training data, for instance, via a user interface. The input unit 931 of the learning system may also be used to initiate, stop, or resume training or to specify or modify hyper-parameters to be used for the training. The respective input unit 931 may in particular perform the preliminary step of the NMT method or the learning method.

[0096] The NMT system may further include output unit 932 for outputting the obtained target text, for instance, via a user interface. The learning system may include output unit 931 for storing the optimized parameters. The respective output unit 931 may in particular perform the final step of the NMT method or the learning method.

[0097] The NMT system and learning system according to the present invention may be realized by a single apparatus or by a plurality of connected apparatuses. Both systems may share all or some of the hardware components. The systems may be realized by specialized hardware or by generic hardware which executes a program which causes the generic hardware to perform the NMT method and / or learning method according to the present invention.

Modifications

[0098] The sequences used by the encoder-decoder networks are usually ordered from the beginning to the end of the text. In this case, the term "previous target tokens" refers to past target tokens and the next target token generated in each decoding step is appended to the past target tokens to be used in the next step. But depending on the languages, it can sometimes be favorable to reverse the order of the source and/or target sequence so as to encode and/or decode texts from the end to the beginning. In this case, the term "previous target tokens" refers to future target tokens and the next target token generated in each decoding step is prepended to the future target tokens to be used in the next step. Moreover, there are so-called insertion models where the next target token is inserted somewhere inside the sequence of previous target token. The present invention is not limited to any specific case and can be directly applied to the other cases analogously. [0099] The Embodiments according to the present invention use attention-based layers. However, any other layer can be used as long as it combines one or more input vectors with memory vectors so as to obtain a number of output vectors corresponding to the number of input vectors. In particular, any mechanism which provides a similar effect as attention, such as active memory, may be used. The attention mechanism is differentiable and is therefore suitable for deep learning since it can be trained with backpropagation. However, a non-differentiable mechanism may be used and trained with techniques known, for instance, from reinforcement learning.

Ei/Ah/An

Claims

1. A neural machine translation method for translating a source text in a source language into a target text in a target language, the method comprising: obtaining the source text, a source auxiliary entry in the source language, and a target auxiliary entry in the target language; computing (S102), using a multi-layer encoder-decoder network, target token information (215) regarding a target token of the target text, by combining information (213) computed based on at least one previous or initial target token (203) with information (211, 212, 214) computed based on the source text, the source auxiliary entry, and the target auxiliary entry; and selecting (S103, S104) a target token based on the computed target token information.

2. The neural machine translation method according to claim 1, comprising: preprocessing (S100) the source text, the source auxiliary entry, and the target auxiliary entry to obtain one or more source tokens (201), one or more source auxiliary tokens (202), and one or more target auxiliary tokens (204), respectively; decoding (S102), as the target token information, a target embedding (215) based on at least one previous or initial target token (203) by a multi-layer encoder-decoder network, wherein the multi-layer encoder-decoder network computes (S102A) at least one previous target embedding (213), source embeddings (211), source auxiliary embeddings (212), and target auxiliary embeddings (214) based on the at least one previous target token (203), the source tokens (201), source auxiliary tokens (202), and target auxiliary tokens (204), respectively, and wherein the multi-layer encoder-decoder network computes (S102B) the target embedding based on the result of combining a vector computed based on the at least one previous target embedding (213) with vectors computed based on the source embeddings (211), vectors computed based on the source auxiliary embeddings (212), and vectors computed based on the target auxiliary embeddings (214); computing (S103) a score for target token candidates based on the target embedding (215); and selecting (S104) a target token from the target token candidates based on the computed score; wherein the steps of decoding (S102) a target embedding, computing (S103) a score for target token candidates, and selecting (S104) a target token are repeated using the target token selected in one repetition as the at least one previous target token (203) in the next repetition (S106) until a completion condition is met (S105), and wherein the target tokens (203) selected over the repetitions are postprocessed to obtain the target text (S107).

3. The neural machine translation method according to claim 2, wherein the multi-layer encoder-decoder network comprises a multi-layer encoder network which encodes the source embeddings (211) to obtain source encodings (340), wherein the multi-layer encoder network comprises a first attention-based layer (331) which takes vectors computed based on the source embeddings (211) as input vectors and vectors computed based on the source auxiliary embeddings (212) as memory vectors, wherein the multi-layer encoder-decoder network comprises a multi-layer decoder network which decodes the target embedding (215), wherein the multi-layer decoder network comprises a second attention- based layer (332) which takes a vector computed based on the at least one previous target embedding (213) as an input vector and vectors computed based on the target auxiliary embeddings (214) as memory vectors, and wherein the multi-layer decoder network comprises a third attention- based layer (333) which takes a vector computed based on the at least one previous target embedding (313) as an input and vectors computed based on the source encodings (340) as memory vectors.

4. The neural machine translation method according to claim 3, wherein the first attention-based layer (331) takes vectors computed based on the source embeddings (211) and vectors computed based on the source auxiliary embeddings (212) as input and memory vectors, and wherein the second attention-based layer (332) takes a vector computed based on the at least one previous target embedding (213) and vectors computed based on the target auxiliary embeddings (214) as input and memory vectors.

5. The neural machine translation method according to claim 2, wherein the multi-layer encoder-decoder network comprises an attention- based layer (430) which takes vectors computed based on the source embeddings (211), vectors computed based on the source auxiliary embeddings (212), a vector computed based on the at least one previous target embedding (213), and vectors computed based on the target auxiliary embeddings (214) as input and memory vectors.

6. The neural machine translation method according to any one of claims 1 to 5, further comprising: at least one of an input step of displaying a user interface for designating the source text and auxiliary information including the source auxiliary entry and the target auxiliary entry and an output step of displaying a user interface for outputting the target text.

7. The neural machine translation method according to claim 1, wherein the source auxiliary entry is a source dictionary entry, and the target auxiliary entry is a target dictionary entry.

8. The neural machine translation method according to any one of claims 2 to 6, wherein the source auxiliary entry is a source dictionary entry, the target auxiliary entry is a target dictionary entry, the source auxiliary tokens are source dictionary tokens, the target auxiliary tokens are target dictionary tokens, the source auxiliary embeddings are source dictionary embeddings, and the target auxiliary embeddings are target dictionary embeddings.

9. The neural machine translation method according to any one of claims 1 to 8, wherein the source auxiliary entry or a further source auxiliary entry is a source part of a translation memory entry and the target auxiliary entry or a further target auxiliary entry is a target part of the translation memory entry, and wherein the translation memory entry is selected from a plurality of translation memory entries based on similarity scores between the source text and the respective source parts of the translation memory entries.

10. A neural machine translation system (900) for translating a source text in a source language into a target text in a target language, the system comprising: one or more processing units (901, 902) configured to perform the neural machine translation method according to any one of claims 1 to 9; one or more memory units (911, 912) configured to store parameters of the multi-layer encoder-decoder network; and preferably, at least one of an input unit (931) configured to display a user interface for designating the source text and auxiliary information including the source auxiliary entry and the target auxiliary entry and an output unit (932) configured to display a user interface for outputting the target text.

11. The neural machine translation system according to claim 10, wherein the at least one input unit (931) is configured to display a user interface for designating a dictionary including source and target dictionary entries and/or a translation memory including source and target translation memory entries as the auxiliary information.

12. A learning method of adjusting parameters of a multi-layer encoder-decoder network as defined in any one of claims 1 to 9 based on training data including records each comprising a source text in the source language and a target text in the target language, the learning method comprising: obtaining the training data; and performing, for each record in a first subset of the records, a learning step, comprising: computing (S800) an alignment between the source text and the target text; extracting (S801), using the alignment, a source phrase and a target phrase from the source text and the target text as a source dictionary entry and as a target dictionary entry, respectively; preprocessing (S802) the source text, the source dictionary entry, the target text, and the target dictionary entry to obtain source tokens (201), source dictionary tokens (202), target tokens, and target dictionary tokens (204), respectively; computing (S803) predicted target embeddings (215) using the multi-layer encoder-decoder network as defined in any one of claims 1 to 5 from the source tokens (201), source dictionary tokens (202), target tokens (203), and target dictionary tokens (204), wherein, for computing the predicted target embedding (215) corresponding to a position among the target tokens, at least one target token corresponding to a different position among the target tokens is used as the at least one previous target token (203); computing (S804) a loss based on at least some of the target tokens and the predicted target embeddings (215) at corresponding positions among the target tokens; and adjusting (S805) at least some of the parameters of the multi-layer encoder-decoder network so as to reduce the loss.

13. A learning system (900), comprising: one or more processing units (901, 902) configured to perform the learning method according to claim 12; one or more memory units (911, 912) configured to store the training data and to store the adjusted parameters of the multi-layer encoder-decoder network; and preferably, at least one of an input unit (931) configured to display a user interface for designating the training data and an output unit (932) configured to output the adjusted parameters of the multi-layer encoder-decoder network.

14. A neural machine translation method for translating a source text in a source language into a target text in a target language, the method comprising: the learning method according to claim 12 for adjusting parameters of a multi-layer encoder-decoder network; and the neural machine translation method according to any one of claims 1 to 9 using parameters based on the adjusted parameters of the multi-layer encoder- decoder network to translate the source text into the target text.

15. A program for causing, when executed on a system (900) having a processor (901, 902) and a memory (911, 912), the system (900) to perform the method according to any one of claims 1 to 9, 12, and 14.

E