CN110222350A

CN110222350A - By bilingual predefined translation to the method for incorporating neural Machine Translation Model

Info

Publication number: CN110222350A
Application number: CN201910577358.3A
Authority: CN
Inventors: 熊德意; 王涛
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-09-10

Abstract

The invention discloses a kind of by bilingual predefined translation to the method for incorporating neural Machine Translation Model.The present invention is a kind of by the bilingual predefined method translated to the neural Machine Translation Model of involvement, it applies based on attention mechanism and using the NMT model of encoder-decoder frame, include: target be in order to by it is bilingual it is predefined translation to (p, q) neural Machine Translation Model is incorporated, wherein p is appeared in the sentence of source, and p needs are correctly translated as q, are appeared in the sentence of target side, while other words in source are correctly translated.Beneficial effects of the present invention: it is proposed that sample is introduced neural Machine Translation Model, with this come instruct its translate method, it has the advantage that, 1. by using tagging method, make model learning to this mode, it establishes source and target side predefined phrase contacts over the ground, improve their possibility by successful translation.

Description

By bilingual predefined translation to the method for incorporating neural Machine Translation Model

Technical field

The present invention relates to neural machine translation fields, and in particular to it is a kind of by bilingual predefined translation to incorporating neural machine The method of translation model.

Background technique

Machine translation (Machine Translation, MT) is an important neck in the field natural language processing (NLP) Domain, it is intended to the use of machine by a kind of language translation be another language.MT pass through many years development, from rule-based method to Statistics-Based Method, then neural machine translation (Neural Machine neural network based by now Translation,NMT).In general as the NLP task of many other mainstreams, NMT also uses a sequence to sequence The structure (Sequence to sequence, seq2seq) of column, is made of encoder and decoder.Encoder (encoder) The sentence of source, which is encoded to vector, to be indicated, then decoder (decoder) indicates to generate corresponding translation by word according to vector.

Under many scenes, NMT system is needed using the translation in predefined translation database.Such as it is led in electric business Domain, the brand name of many commodity are all to have fixed translation, and the translation of mistake may cause trade disputes.In News Field And country name name was required according to the predefined translation directly translation past.These are had agreement to be commonly called as translating by we Phrase to referred to as it is bilingual it is predefined translation pair.Since NMT model is a model end to end, translation process can be with Regard a black box as, so we are difficult directly to go to intervene exporting as a result, so predefined translation is incorporated in NMT simultaneously It is not easy thing.

(1)Addressing the rare word problem in neural machine translation.Text Chapter introduces alignment techniques, so that NMT model knows that the word of source corresponding to the word not in vocabulary of target side is, And a predefined alignment dictionary is used during subsequent processing, the word by those not in vocabulary replaces with predetermined Word in adopted dictionary.

(2)Neural machine translation with external phrase memory.Article proposes The framework of entitled phrase network (PhraseNet), has modified entire framework, and model is allowed to determine that current time step is normal A word is exported according further to bilingual predefined translation to generation phrase.

(3)Lexically constrained decoding for sequence generation using grid beam search.Article is by proposing a kind of technology of entitled grid beam search (Grid Beam Search), Ke Yirang Specified segment appears in the output of model, and specified segment is either single word is also possible to multiple words.

There are following technical problems for traditional technology:

Alignment techniques are introduced, and replace the method for translation in subsequent processing mainly for the treatment of rare word problem, only The case where only can handle single word, and the case where bilingual predefined translation centering may include multiple word phrases.Modify mould Type frame structure solves the problems, such as phrase, but it can not guarantee that predefined phrase centainly appears in target side, Er Qiexiu Change complex, more difficult reproduction.The frame of modification network is not although needed using the method for modification beam search (Beam Search) Structure, but when generating each word, require to be decided whether to according to current translation using bilingual predefined translation The information of centering greatly reduces the speed of translation.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of by bilingual predefined translation to the neural machine translation mould of involvement The method of type, it will successfully can be included in double by the pretreatment to data and to the slight modifications of model in source The phrase successful translation (success rate is higher than 99%) of the predefined translation centering of language, and decoding speed will not be reduced.

Bilingual predefined translation is turned over to neural machine is incorporated in order to solve the above-mentioned technical problems, the present invention provides a kind of The method for translating model is applied based on attention mechanism and using the NMT model of encoder-decoder frame, comprising: mesh Mark is in order to which bilingual predefined translation is incorporated neural Machine Translation Model to (p, q), and wherein p appears in the sentence of source In, and p needs are correctly translated as q, are appeared in the sentence of target side, while other words in source are correctly turned over It translates；

The special label<start>of the p of appearance in training corpus and q and<end>are surrounded；Label<start>and <end>as other words in dictionary, vector expression is all random initializtion, and in the training process, gradually study is joined Number；Purpose using tagging is while to make q to establish a kind of connection between p and q according to this identical mode It is continuously present without interrupting.

A method of by bilingual predefined translation to neural Machine Translation Model is incorporated, apply based on attention mechanism And using the NMT model of encoder-decoder frame, comprising: target is to melt bilingual predefined translation to (p, q) Enter neural Machine Translation Model, wherein p is appeared in the sentence of source, and p needs are correctly translated as q, are appeared in target Other words in the sentence at end, while in source are correctly translated；

It proposes to enhance p using q；It can be seen that other than<start>and<end>in tagging, source also <middle>label is used；Source contains the information of p and q simultaneously, and such NMT model can not only acquire " p " and arrive " q " It can be gone over when translating, while can also learn to encounter " q " with direct copying, and not have to translation；Serve as interpreter frequency of occurrence in corpus When less word, model may be difficult to acquire corresponding translation, can preferably be translated by copy at this time.

It does not modify to data, but model is slightly modified.Equally with " I loves the Hong Kong<start> For<middle>hong kong<end>" (extra embeddings is only used at the end encoder, thus with target side without Close), this 8 words can be considered as flag sequence " n n n s n t t n ", and wherein s and t respectively indicates predefined translation to p And q, n indicate other common words；Then according to the corresponding label of each word, on the corresponding term vector of each word, in addition Corresponding vector.

A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage The step of computer program, the processor realizes any one the method when executing described program.

A kind of computer readable storage medium, is stored thereon with computer program, realization when which is executed by processor The step of any one the method.

A kind of processor, the processor is for running program, wherein described program executes described in any item when running Method.

Beneficial effects of the present invention:

It is proposed that sample is introduced neural Machine Translation Model, with this come the method for instructing it to translate, have following excellent Gesture,

1. making model learning to this mode by using tagging method, establishing source and target side predefined phrase It contacts over the ground, improves their possibility by successful translation.

2. the phrase of target side is added to by source, and the method for combining tagging by mixed phrase method, So that model learns translation and direct copying simultaneously, these phrases are greatly increased by the possibility of successful translation.

3. by using extra embeddings and combine above method, we distinguish in source sentence it is different at Point, with a kind of more direct method enhancing copy signal, without modifying corpus.

Detailed description of the invention

Fig. 1 is the schematic diagram of the NMT model based on attention mechanism (attention) in the present invention.

Fig. 2 be the present invention by bilingual predefined translation to the method for incorporating neural Machine Translation Model using improved mould Type schematic diagram.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings and specific examples, so that those skilled in the art can be with It more fully understands the present invention and can be practiced, but illustrated embodiment is not as a limitation of the invention.

Background: it is based on the NMT model of attention mechanism (attention)

It is general using model end to end in neural machine translation system.Model generally comprises encoder (encoder) With decoder (decoder).In translation, typically enters and the expression of sentence is converted by encoder and is input to decoder In, the part decoder receives the input of encoder and other mechanism (such as attention mechanism) is combined to export as a result, defeated each time A word out, and this word is input in decoder, the input as next output word.In this way, until the words turns over Translate end.

Transformer is the model that Google proposed in 2017, with most encoder-decoder frame phase Seemingly, it is also made of this two parts, the difference is that, it has abandoned common RNN structure in traditional natural Language Processing, And attention mechanism construction model is used completely.

In data input unit, we in addition to will input it is Sequence Transformed at corresponding embedding other than, be additionally added One position vector (output of position encoding) indicates for the position of coding source statement.

Left-half in figure indicates encoder module, it is formed by 6 identical layer heaps are folded, and each layer includes two Sublayer, first sublayer is multi-head self attention sublayer, for passing through the information of other words in source statement Self attention network takes into account to generate the context vector of current word, about multi-head attention's Specific formula for calculation is as follows:

MultiHead (Q, K, V)=Concat (head₁..., head_h)W^o

Second sublayer is the feed forward sublayer an of full-mesh, and effect is by self attention network Context vector in the source statement of generation is integrated with the information of current word, and then is generated and merged entire sentence context Current time hidden state, formula is as follows:

FFN (x)=max (0, XW₁+b₁)W₂+b₂

By above step, we can be obtained by the expression of source statement.

Right half part in figure is decoder module, similar to encoder, it is also formed by 6 identical layer heaps are folded, but There is also certain differences.Every layer of decoder includes three different sublayers, first sublayer of first sublayer with encoder It is similar, it is a multi-head self attention sublayer, is responsible for considering the context that contextual information generates current word Vector, but be different from coding side, the information for the word that we have generated before can only see when decoding, and to not generating later Word information, we are shielded by mask mechanism, that is, masked multi-head self described in figure attention.Second sublayer is the self attention sublayer of a grouping, is responsible for the same mesh of hidden state of source statement The hidden state of poster speech relatively generates the context vector of original language, and Q therein is masked multi-head self The output of attention sublayer, and K and V be then before the obtained output of encoder.The in third sublayer and encoder Two sublayers are similar, and the information before incorporating goes to generate the prediction of current location object language.

It is connect in addition, all connections between layers also use layer norm with residual error, obtains decoder's After output indicates, we pass through a linear transformation, then obtain probability of the current location on entire dictionary by softmax Distribution, and finally obtain the translation result currently walked.

The training of model, which generally uses, minimizes negative log-likelihood as loss function, uses stochastic gradient descent for training side Method is iterated training.In training setOn, wherein xⁿ, yⁿFor parallel sentence pair, model training objective function is such as Under:

A method of by bilingual predefined translation to the neural Machine Translation Model of involvement:

Our target is in order to which bilingual predefined translation is incorporated neural Machine Translation Model to (p, q), and wherein p goes out In the sentence of present source, and p needs are correctly translated as q, are appeared in the sentence of target side, while other in source Word correctly translated.In order to reach this target, we have proposed three kinds of methods, including label tagging, mixed Phrase and extra embeddings.For the method for better describing us, we with parallel sentence pairs " I loves Hong Kong | i For love hong kong ".

Tagging:

This method is very intuitive, we are by the p of the appearance in training corpus and q special label<start>and<end > surround.By taking parallel sentence pairs above as an example, after tagging is handled, sentence pair become " I like the Hong Kong<start><end>| i love<start>hong kong<end>".With<end>as other words in dictionary, vector indicates label<start> It is all random initializtion, in the training process, gradually learning parameter.Purpose using tagging is in order to according to this phase With mode established between p and q it is a kind of contact, while q is continuously present without interrupting.

Mixed Phrase:

For neural network, study copy should be able to be easier than study translation, so it is proposed that being increased using q Strong p.Example before same use, sentence pair become " I likes the Hong Kong<start><middle>hong kong<end>| i love< start>hong kong<end>".It can be seen that we are in source other than<start>and<end>in tagging <middle>label is also used, for separating " Hong Kong " and " hong kong ".Source contains the information of p and q simultaneously, this Sample NMT model can not only acquire " Hong Kong " and arrive the translation of " hong kong ", while can also learn to encounter " hong kong " When can be gone over direct copying, and do not have to translation.When the less word of frequency of occurrence in corpus of serving as interpreter, model may be difficult to acquire Corresponding translation can be translated preferably by copy at this time.

Extra Embeddings:

The third method does not modify to data, but is slightly modified model.Equally with " I like < For start>Hong Kong<middle>hong kong<end>" (extra embeddings is only used at the end encoder, so It is unrelated with target side), this 8 words can be considered as flag sequence " n n n s n t t n ", and wherein s and t is respectively indicated predefined Translation to p and q, n indicates other common words.Then according to the corresponding label of each word, the corresponding word of each word to In amount, in addition corresponding vector.Term vector embeddings and position vector in this mode and transformer Position embeddings is similar, so we term it extra embeddings.Extra Embeddings not only may be used To enhance q directly by the signal of copy, and the method compared to the first tagging, it is not necessary to modify data, not will increase sentence The length of son, but application acts on directly on intermediate representation.

Three kinds of methods above, tagging, mixed phrase can be used alone, can also be used together, Extra embeddings method needs that mixed phrase method is combined to be used together.In addition, due to the side phrase mixed Source contains multilingual in method, so we used shared embeddings, to achieve it, we are to mould Type has carried out simple modification.As shown, it can be seen that compared to general transformer model, our shared word to It measures (embeddings) to be made of source term vector, target side term vector and label term vector, occur in such source sentence The word of target side can share same term vector with the word of target side；In addition the part Target Linear is limited to mesh Mark end term vector size, it is ensured that output be all target side word.We can further be seen that we add in the part encoder Extra embeddings is added, as described in the third method above.

Bilingual predefined translation described above by two ways to can be obtained, first is that artificial constructed, second is that from instruction Practice and is constructed in corpus.Since artificial constructed cost is very high, how we are described herein constructs translation pair automatically from corpus.

Here be principally dedicated to name entity (Named Entities, NEs), because of many names, place name, brand Name substantially belongs to name entity.We construct phrase table using statistical translation tool Moses first according to bilingual training corpus (phrase table, phrase table in have the corresponding various possible translations of n member phrase and probability), and known using name entity The entity of other tool identification source.We traverse parallel corpora, and to the entity identified in source, we are gone in phrase table Look for corresponding translation, if this translation is present in the sentence of target side, be added in candidate list, if it exists it is multiple then Select probability it is higher that.

By above-mentioned process, we can successfully obtain bilingual predefined translation pair.

When we want bilingual predefined translation to being dissolved into neural Machine Translation Model, in order to improve translation It the success rate of these phrases and avoids carrying out model complicated modification or reduces translation speed, we have proposed three kinds of methods: First is that these predefined phrases to label (Tagging), so that NMT model learns during training to this translation Mode；Second is that the target side phrase of bilingual predefined translation centering is added in source (Mixed Phrase), learning NMT Learn direct copying (copy) while practising translation；Third is that increasing additional term vector (Extra in a model Embeddings), for distinguishing the different compositions in a word, translation quality is further increased.

By using tagging method, makes model learning to this mode, establish source and target side predefined phrase pair Ground connection, improves their possibility by successful translation.

The phrase of target side is added to source, and the method for combining tagging by mixed phrase method, so that Model learns translation and direct copying simultaneously, greatly increases these phrases by the possibility of successful translation.

By using extra embeddings and combine above method, we distinguish in source sentence it is different at Point, with a kind of more direct method enhancing copy signal, without modifying corpus.

We are tested in China and Britain, Germany and Britain's corpus using one or more kinds of above-mentioned methods, bilingual predetermined All there are many promotions than baseline in the translation success rate and BLEU-4 index of adopted phrase pair, experimental result is as follows:

Table one

Table two

Table one and table above second is that on Sino-British news corpus as a result, wherein table is first is that at development set (NIST06) And the probability that the p on multiple test sets (NIST03 NIST04 NIST05) in source is q by successful translation.As a result it is divided into two Column, respectively indicate the probability of sentence level and the probability of phrase rank.For each sentence of test, if each of sentence Phrase p is q by successful translation, then it is assumed that this sentence is by successful translation.BASE in table indicates most basic Transformer model, T indicate that plus tagging method, M indicates mixed phrase method above, and R needs are additionally said It is bright, it is the method similar with M, be not instead of added to source for q here, and p is replaced with q by selection, this method is used for and M It is compared.E indicates extra embeddings method, and & indicates the combined use of a variety of methods.

It can be seen that these methods above-mentioned are helpful to the translation of predefined translation pair from the data in table, Best effect wherein is achieved in conjunction with the method for T&R or T&M, the method in conjunction with three T&M&E is also almost same high.

Second table is the BLEU-4 result on these exploitation test sets.It can be seen that by comparing T&M and T&R method, We have found that M method has better translation quality compared to R method.In conjunction with the available best translation result of three kinds of methods.

Table three

Table is third is that the experiment carried out in English German data.Here only compare base transformer model and In conjunction with the result of three kinds of methods.

Since English and German belong to the European family of languages, the noun of many entities be actually it is the same, so not using Any method also has 91.18% success rate in sentence level.And our method still can further increase success rate To 99.34%.Meanwhile BLEU-4 value is also promoted.

Embodiment described above is only to absolutely prove preferred embodiment that is of the invention and being lifted, protection model of the invention It encloses without being limited thereto.Those skilled in the art's made equivalent substitute or transformation on the basis of the present invention, in the present invention Protection scope within.Protection scope of the present invention is subject to claims.

Claims

1. it is a kind of by bilingual predefined translation to the method for incorporating neural Machine Translation Model, apply based on attention mechanism and Using the NMT model of encoder-decoder frame characterized by comprising target is in order to by bilingual predefined translation Neural Machine Translation Model is incorporated to (p, q), wherein p is appeared in the sentence of source, and p needs are correctly translated as q, are occurred Other words in the sentence in target side, while in source are correctly translated.

The special label<start>of the p of appearance in training corpus and q and<end>are surrounded；Label<start>and<end> As other words in dictionary, vector expression is all random initializtion, in the training process, gradually learning parameter；Make It is while to make q continuously to establish a kind of connection between p and q according to this identical mode with the purpose of tagging Occur without interrupting.

2. it is a kind of by bilingual predefined translation to the method for incorporating neural Machine Translation Model, apply based on attention mechanism and Using the NMT model of encoder-decoder frame characterized by comprising target is in order to by bilingual predefined translation Neural Machine Translation Model is incorporated to (p, q), wherein p is appeared in the sentence of source, and p needs are correctly translated as q, are occurred Other words in the sentence in target side, while in source are correctly translated；

It proposes to enhance p using q；It can be seen that also being used other than<start>and<end>in tagging in source <middle>label；Source contains the information of p and q simultaneously, and such NMT model can not only acquire " p " turning over to " q " It can be gone over when translating, while can also learn to encounter " q " with direct copying, and not have to translation；Frequency of occurrence of serving as interpreter in corpus compared with When few word, model may be difficult to acquire corresponding translation, can preferably be translated by copy at this time.

3. it is a kind of by bilingual predefined translation to the method for incorporating neural Machine Translation Model, apply based on attention mechanism and Using the NMT model of encoder-decoder frame characterized by comprising target is in order to by bilingual predefined translation Neural Machine Translation Model is incorporated to (p, q), wherein p is appeared in the sentence of source, and p needs are correctly translated as q, are occurred Other words in the sentence in target side, while in source are correctly translated；

It does not modify to data, but model is slightly modified.Equally with " I like the Hong Kong<start>< For middle>hong kong<end>" (extra embeddings is only used at the end encoder, thus with target side without Close), this 8 words can be considered as flag sequence " n n n s n t t n ", and wherein s and t respectively indicates predefined translation to p And q, n indicate other common words；Then according to the corresponding label of each word, on the corresponding term vector of each word, in addition Corresponding vector.

4. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the processor realizes any one of claims 1 to 3 the method when executing described program Step.

5. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The step of any one of claims 1 to 3 the method is realized when row.

6. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit requires 1 to 3 described in any item methods.