A kind of context-sensitive neural network machine translation in unregistered word translating equipment and
Method
Technical field
The present invention relates to a kind of lexical translation apparatus and method, belong to lexical translation apparatus and method technical field.
Background technology
Neural network machine translation (neural machine translation, NMT) is a kind of new machine translation side
Method, core is a kind of the simple of end-to-end training and is easy to extensive depth neural network.This Web vector graphic one kind coding-
The structure of decoding, coded portion is responsible for the semantic vector that former end sentence is encoded into regular length to represent, decoded portion is one
Recognition with Recurrent Neural Network (recurrent neural network, RNN), it is believed using the history of former end sentence expression and destination end
Cease the machine translation sentence that word one by one decodes destination end.Machine between multilingual is turned over since this network is suggested
Translate and effect best at present is all achieved in task, such as Great Britain and France translate, Germany and Britain's translation, the translation of English Czech.
In actual model realization, due to the limitation of amount of calculation and GPU internal memories, NMT models need to be determined in advance one
Unregistered word (out of vocabulary, OOV) outside the everyday words vocabulary being very limited, other vocabularys all uses special symbol
Number<unk>(unknown) mark, vocabulary size is generally set as 30000 to 80000.Because translation is an open vocabulary problem,
So a large amount of unregistered words for enriching semanteme are expressed as one without semanteme<unk>Mark can greatly increase former end sentence
Ambiguousness.Meanwhile, once included in the translation of generation<unk>, all do not stepped on due to having been abandoned during NMT model translations
Word information is recorded, so can not be to these<unk>Handled, we can only be after the completion of translation in translation result<unk>
Post-processed.
At present, what is be most widely used is a kind of greedy post-processing approach:Word alignment information is recorded in NMT models,
Here usually using notice mechanism (attention mechanism), found according to alignment information<unk>Align maximum probability
Former terminal word, realize that the dictionary for translation constructed finds the translation candidate of former terminal word in advance by one afterwards, in selection dictionary
The maximum word of translation probability is in translation result<unk>It is replaced.This method is also to be contrasted during the present embodiment is tested
Baseline Methods.
This method is given<unk>The substitute found is proved that NMT translation result can be lifted by many experiments
Quality, but be due to not accounted for when replacing in translation result<unk>Contextual information, so remaining
Many problems.Because in translation process, existing a large amount of " one-to-many ", " many-one ", the translation mapping of " multi-to-multi ", simultaneously
Even in the case of " one-to-one " translation, same former terminal word may also need to translate into different mesh under different linguistic context
Mark terminal word.In face of these complicated translation phenomenons, using above-mentioned greedy post-processing approach, then substantial amounts of replacement can be caused wrong
By mistake, replace and repeat, the problem of sentence is not clear and coherent after replacement.
The content of the invention
Translation of the present invention in order to solve existing neutral net machine translator can not meet and linguistic context from the context or language
Adopted the problem of, it is proposed that unregistered word translating equipment and method in a kind of neural network machine translation of context-sensitive.
Unregistered word translating equipment in a kind of neural network machine translation of context-sensitive, the technical scheme taken is such as
Under:
The unregistered word translating equipment includes:
According to all former terminal words, the searching modul of search terms in dictionary for translation;
It is according to the lookup word result that the searching modul is obtained<unk>Mark provides possible unregistered word candidate and turned over
The candidate's translation translated provides module;
The feature extraction module of contextual feature is extracted for being translated for the candidate;
For the contextual feature, the unregistered word candidate translation is obtained using the SVM rank models trained
Evaluation index, and unregistered word candidate translation is ranked up by the order of evaluation index from high in the end according to evaluation index
Order module;
For evaluation index sequence highest unregistered word candidate's translation to be replaced in the translation of the sentence<unk>Mark
Replacement module, obtain and meet the complete translation sentence of context of co-text.
Further, the feature extraction module includes:
Word alignment characteristic extracting module for extracting word alignment feature from NMT notice alignment models;
For the word grain size characteristic extraction module for the word grain size characteristic for extracting former terminal word and unregistered word candidate translation;
For the phrase grain size characteristic extraction module for the phrase grain size characteristic for extracting former terminal word and unregistered word candidate translation;
Appeared in for extracting unregistered word candidate translation<unk>During mark position, near unregistered word candidate translation
The language model characteristic extracting module of language model feature.
Further, institute's predicate grain size characteristic extraction module includes:
The positive translation probability module of unregistered word candidate translation is translated for former terminal word;
The reverse translation probabilistic module of former terminal word is translated for unregistered word candidate;
The former terminal word number of times extraction module of the number of times occurred in parallel corpora is trained in NMT for extracting former terminal word;
Unregistered word candidate for extracting the number of times that unregistered word candidate translation occurs in NMT training parallel corporas turns over
Translate number of times extraction module;
For extracting being total to for the co-occurrence number of times of former terminal word and unregistered word candidate translation in the parallel sentence pair of parallel corpora
Occurrence number extraction module;
The vocabulary position extraction module for appearing in the position in vocabulary for extracting former terminal word;
For judge former terminal word whether be unregistered word judge module.
Further, the phrase grain size characteristic is extracted and included:
Number of times extraction module in former terminal word phrase table for extracting the number of times that former terminal word occurs in phrase table;
Unregistered word candidate for extracting the number of times that unregistered word candidate translation occurs in phrase table translates phrase table
Middle number of times extraction module one;
For extracting the short of the co-occurrence number of times of former terminal word and unregistered word candidate translation in each phrase pair of phrase table
Co-occurrence number of times extraction module in language table;
Phrase number of times extraction module for extracting phrase occurrence number in phrase table that former terminal word is constituted with front and rear word;
When constituting phrase with front and rear word for extracting former terminal word, unregistered word candidate translation is appeared in correspondence object phrase
Number of times unregistered word candidate translation phrase table in number of times extraction module two;
For extracting the phrase of former terminal word and unregistered word candidate translation respectively with front and rear word composition to appearing in phrase table
When middle, the unregistered word candidate translation phrase length extraction module of the maximum length of unregistered word candidate translation phrase;
For extracting the phrase of former terminal word and unregistered word candidate translation respectively with front and rear word composition to appearing in phrase table
When middle, and during unregistered word candidate translation phrase acquirement maximum length, the former terminal word phrase length of former terminal word phrase length is extracted
Module.
Further, the language model characteristic extracting module includes:
Appeared in for extracting unregistered word candidate translation in continuously translation sequence of terms<unk>During mark position not
The positive n gram language models probability extraction module of the positive n gram language models probability of posting term candidate translation;
Appeared in for extracting unregistered word candidate translation in continuously translation sequence of terms<unk>During mark position not
The reverse n gram language models probability extraction module of the reverse n gram language models probability of posting term candidate translation;
Appeared in for extracting unregistered word candidate translation<unk>During mark position, the unregistered word of corresponding first number is included
The word string number of extracted module of the word string quantity of candidate's translation.
Unregistered word interpretation method in a kind of neural network machine translation of context-sensitive, the technical scheme taken is such as
Under:
The unregistered word interpretation method includes:
According to all former terminal words, the finding step of search terms in dictionary for translation;
It is according to the lookup word result that the finding step is obtained<unk>Mark provides possible unregistered word candidate and turned over
The candidate's translation translated provides step;
The feature extraction step of contextual feature is extracted for being translated for the candidate;
For the contextual feature, the unregistered word candidate translation is obtained using the SVM rank models trained
Evaluation index, and unregistered word candidate translation is ranked up by the order of evaluation index from high in the end according to evaluation index
Sequence step;
For evaluation index sequence highest unregistered word candidate's translation to be replaced in the translation of the sentence<unk>Mark
Replacement step, obtain and meet the complete translation sentence of context of co-text.
Further, the feature extraction step includes:
Word alignment characteristic extraction step for extracting word alignment feature from NMT notice alignment models;
For the word grain size characteristic extraction step for the word grain size characteristic for extracting former terminal word and unregistered word candidate translation;
For the phrase grain size characteristic extraction step for the phrase grain size characteristic for extracting former terminal word and unregistered word candidate translation;
For extracting unregistered word candidate translation appearance<unk>During mark position, the language near unregistered word candidate translation
Say the language model characteristic extraction step of the aspect of model.
Further, institute's predicate grain size characteristic extraction step includes:
The positive translation probability step of unregistered word candidate translation is translated for former terminal word;
The reverse translation probability step of former terminal word is translated for unregistered word candidate;
The former terminal word number of times extraction step of the number of times occurred in parallel corpora is trained in NMT for extracting former terminal word;
Unregistered word candidate for extracting the number of times that unregistered word candidate translation occurs in NMT training parallel corporas turns over
Translate number of times extraction step;
For extracting being total to for the co-occurrence number of times of former terminal word and unregistered word candidate translation in the parallel sentence pair of parallel corpora
Occurrence number extraction step;
The vocabulary position extraction step for appearing in the position in vocabulary for extracting former terminal word;
For judge former terminal word whether be unregistered word judgment step.
Further, the phrase grain size characteristic is extracted and included:
Number of times extraction step in former terminal word phrase table for extracting the number of times that former terminal word occurs in phrase table;
Unregistered word candidate for extracting the number of times that unregistered word candidate translation occurs in phrase table translates phrase table
Middle number of times extraction step one;
For extracting the short of the co-occurrence number of times of former terminal word and unregistered word candidate translation in each phrase pair of phrase table
Co-occurrence number of times extraction step in language table;
Phrase number of times extraction step for extracting phrase occurrence number in phrase table that former terminal word is constituted with front and rear word;
When constituting phrase with front and rear word for extracting former terminal word, unregistered word candidate translation is appeared in correspondence object phrase
Number of times unregistered word candidate translation phrase table in number of times extraction step two;
For extracting the phrase of former terminal word and unregistered word candidate translation respectively with front and rear word composition to appearing in phrase table
When middle, the unregistered word candidate translation phrase length extraction step of the maximum length of unregistered word candidate translation phrase;
For extracting the phrase of former terminal word and unregistered word candidate translation respectively with front and rear word composition to appearing in phrase table
When middle, and during unregistered word candidate translation phrase acquirement maximum length, the former terminal word phrase length of former terminal word phrase length is extracted
Step.
Further, the language model characteristic extraction step includes:
Appeared in for extracting unregistered word candidate translation in continuously translation sequence of terms<unk>During mark position not
The positive n gram language models probability extraction step of the positive n gram language models probability of posting term candidate translation;
Appeared in for extracting unregistered word candidate translation in continuously translation sequence of terms<unk>During mark position not
The reverse n gram language models probability extraction step of the reverse n gram language models probability of posting term candidate translation;
Appeared in for extracting unregistered word candidate translation<unk>During mark position, the unregistered word of corresponding first number is included
The word string number of extracted step of the word string quantity of candidate's translation.
Beneficial effect of the present invention:
Unregistered word translating equipment and method can be done in the neural network machine translation of context-sensitive of the present invention
The context of co-text and semanteme for translating word to junction belt are translated, and are made to translate the BLEU values of the word word come and are not stepped on
Record word recall rate more preferably, in-English translation duties in NIST data sets on its BLEU and unregistered word recall rate be respectively
33.405 and 6.53% improve 0.012 He respectively than the 33.393 of the greedy post-processing approach of prior art and 6.16%
0.37%;More it has been obviously improved the translation quality of unregistered word in NMT translation results.
Brief description of the drawings
Fig. 1 shows for the structure of unregistered word translating equipment in the neural network machine translation of context-sensitive of the present invention
It is intended to.
Fig. 2 is word grain size characteristic extraction module structural representation of the present invention.
Fig. 3 is phrase grain size characteristic extraction module structural representation of the present invention.
Fig. 4 is language model characteristic extracting module structural representation of the present invention.
Fig. 5 illustrates for the case of unregistered word translating equipment in the neural network machine translation of context-sensitive of the present invention
It is intended to.
Embodiment
With reference to specific embodiment, the present invention will be further described, but the present invention should not be limited by the examples.
Embodiment 1:
As shown in Figures 1 to 4, unregistered word translating equipment, institute in a kind of neural network machine translation of context-sensitive
The technical scheme taken is as follows:
The unregistered word translating equipment includes:
According to all former terminal words, the searching modul of search terms in dictionary for translation;
The lookup word result obtained according to the searching modul provides possible unregistered word candidate for unregistered word and turned over
The candidate's translation translated provides module;
The feature extraction module of contextual feature is extracted for being translated for the candidate;
For the contextual feature, the unregistered word candidate translation is obtained using the SVM rank models trained
Evaluation index, and unregistered word candidate translation is ranked up by the order of evaluation index from high in the end according to evaluation index
Order module;
For evaluation index sequence highest unregistered word candidate's translation to be replaced in the translation of the sentence<unk>Mark
Replacement module, obtain and meet the complete translation sentence of context of co-text.
Wherein, the feature extraction module includes:
Word alignment characteristic extracting module for extracting word alignment feature from NMT notice alignment models;
For the word grain size characteristic extraction module for the word grain size characteristic for extracting former terminal word and unregistered word candidate translation;
For the phrase grain size characteristic extraction module for the phrase grain size characteristic for extracting former terminal word and unregistered word candidate translation;
Appeared in for extracting unregistered word candidate translation<unk>During mark position, near unregistered word candidate translation
The language model characteristic extracting module of language model feature.
Institute's predicate grain size characteristic extraction module includes:
The positive translation probability module of unregistered word candidate translation is translated for former terminal word;
The reverse translation probabilistic module of former terminal word is translated for unregistered word candidate;
The former terminal word number of times extraction module of the number of times occurred in parallel corpora is trained in NMT for extracting former terminal word;
Unregistered word candidate for extracting the number of times that unregistered word candidate translation occurs in NMT training parallel corporas turns over
Translate number of times extraction module;
For extracting being total to for the co-occurrence number of times of former terminal word and unregistered word candidate translation in the parallel sentence pair of parallel corpora
Occurrence number extraction module;
The vocabulary position extraction module for appearing in the position in vocabulary for extracting former terminal word;
For judge former terminal word whether be unregistered word judge module.
The phrase grain size characteristic, which is extracted, to be included:
Number of times extraction module in former terminal word phrase table for extracting the number of times that former terminal word occurs in phrase table;
Unregistered word candidate for extracting the number of times that unregistered word candidate translation occurs in phrase table translates phrase table
Middle number of times extraction module one;
For extracting the short of the co-occurrence number of times of former terminal word and unregistered word candidate translation in each phrase pair of phrase table
Co-occurrence number of times extraction module in language table;
Phrase number of times extraction module for extracting phrase occurrence number in phrase table that former terminal word is constituted with front and rear word;
When constituting phrase with front and rear word for extracting former terminal word, unregistered word candidate translation is appeared in correspondence object phrase
Number of times unregistered word candidate translation phrase table in number of times extraction module two;
For extracting the phrase of former terminal word and unregistered word candidate translation respectively with front and rear word composition to appearing in phrase table
When middle, the unregistered word candidate translation phrase length extraction module of the maximum length of unregistered word candidate translation phrase;
For extracting the phrase of former terminal word and unregistered word candidate translation respectively with front and rear word composition to appearing in phrase table
When middle, and during unregistered word candidate translation phrase acquirement maximum length, the former terminal word phrase length of former terminal word phrase length is extracted
Module.
Wherein, the language model characteristic extracting module includes:
Appeared in for extracting unregistered word candidate translation in continuously translation sequence of terms<unk>During mark position not
The positive n gram language models probability extraction module of the positive n gram language models probability of posting term candidate translation;
Appeared in for extracting unregistered word candidate translation in continuously translation sequence of terms<unk>During mark position not
The reverse n gram language models probability extraction module of the reverse n gram language models probability of posting term candidate translation;
Appeared in for extracting unregistered word candidate translation<unk>During mark position, the unregistered word of corresponding first number is included
The word string number of extracted module of the word string quantity of candidate's translation.
Unregistered word interpretation method in a kind of neural network machine translation of context-sensitive, the technical scheme taken is such as
Under:
The unregistered word interpretation method includes:
According to all former terminal words, the finding step of search terms in dictionary for translation;
It is according to the lookup word result that the finding step is obtained<unk>Mark provides possible unregistered word candidate and turned over
The candidate's translation translated provides step;
The feature extraction step of contextual feature is extracted for being translated for the candidate;
For the contextual feature, the unregistered word candidate translation is obtained using the SVM rank models trained
Evaluation index, and unregistered word candidate translation is ranked up by the order of evaluation index from high in the end according to evaluation index
Sequence step;
For evaluation index sequence highest unregistered word candidate's translation to be replaced in the translation of the sentence<unk>Mark
Replacement step, obtain and meet the complete translation sentence of context of co-text.
Wherein, the feature extraction step includes:
Word alignment characteristic extraction step for extracting word alignment feature from NMT notice alignment models;
For the word grain size characteristic extraction step for the word grain size characteristic for extracting former terminal word and unregistered word candidate translation;
For the phrase grain size characteristic extraction step for the phrase grain size characteristic for extracting former terminal word and unregistered word candidate translation;
Appeared in for extracting unregistered word candidate translation<unk>During mark position, near unregistered word candidate translation
The language model characteristic extraction step of language model feature.
Institute's predicate grain size characteristic extraction step includes:
The positive translation probability step of unregistered word candidate translation is translated for former terminal word;
The reverse translation probability step of former terminal word is translated for unregistered word candidate;
The former terminal word number of times extraction step of the number of times occurred in parallel corpora is trained in NMT for extracting former terminal word;
Unregistered word candidate for extracting the number of times that unregistered word candidate translation occurs in NMT training parallel corporas turns over
Translate number of times extraction step;
For extracting being total to for the co-occurrence number of times of former terminal word and unregistered word candidate translation in the parallel sentence pair of parallel corpora
Occurrence number extraction step;
For the position extraction step for the vocabulary position appeared in vocabulary for extracting former terminal word;
For judge former terminal word whether be unregistered word judgment step.
The phrase grain size characteristic, which is extracted, to be included:
Number of times extraction step in former terminal word phrase table for extracting the number of times that former terminal word occurs in phrase table;
Unregistered word candidate for extracting the number of times that unregistered word candidate translation occurs in phrase table translates phrase table
Middle number of times extraction step one;
For extracting the short of the co-occurrence number of times of former terminal word and unregistered word candidate translation in each phrase pair of phrase table
Co-occurrence number of times extraction step in language table;
Phrase number of times extraction step for extracting phrase occurrence number in phrase table that former terminal word is constituted with front and rear word;
When constituting phrase with front and rear word for extracting former terminal word, unregistered word candidate translation appears in correspondence object phrase table
In number of times unregistered word candidate translation phrase in number of times extraction step two;
For extracting the phrase of former terminal word and unregistered word candidate translation respectively with front and rear word composition to appearing in phrase table
When middle, the unregistered word candidate translation phrase length extraction step of the maximum length of unregistered word candidate translation phrase;
For extracting the phrase of former terminal word and unregistered word candidate translation respectively with front and rear word composition to appearing in phrase table
When middle, and during unregistered word candidate translation phrase acquirement maximum length, the former terminal word phrase length of former terminal word phrase length is extracted
Step.
The language model characteristic extraction step includes:
Appeared in for extracting unregistered word candidate translation in continuously translation sequence of terms<unk>During mark position not
The positive n gram language models probability extraction step of the positive n gram language models probability of posting term candidate translation;
Appeared in for extracting unregistered word candidate translation in continuously translation sequence of terms<unk>During mark position not
The reverse n gram language models probability extraction step of the reverse n gram language models probability of posting term candidate translation;
Appeared in for extracting unregistered word candidate translation<unk>During mark position, the unregistered word of corresponding first number is included
The word string number of extracted step of the word string quantity of candidate's translation.
Unregistered word translating equipment and method in a kind of neural network machine translation of context-sensitive described in the present embodiment,
Its experiment the results are shown in Table shown in 1.In table 1, NMT word alignment features are 1. represented, word grain size characteristic is 2. represented, 3. represent phrase grain
Feature is spent, language model feature is 4. represented.
It can see from table 1, highest accuracy rate 45.12% reached using the model of whole features trainings.
36.89% than greedy post-processing approach is high by 8.23%.
The model of table 1 post-processes the effect in construction data in unregistered word
The NMT actual translations result under open environment, experimental result is as shown in table 2.Here after the greed that we compare
Directly deleted when processing method and post processing<unk>It is all tests to mark BLEU and Recall (OOV) in two methods, table
Average value on collection 2.It will be seen that our model has been above greed on Recall (OOV) and BLEU from table 2
Unregistered word processing method.Unregistered word in the neural network machine translation for the context-sensitive that this explanation present invention is extracted
Translating equipment and method relative to existing greedy post-processing approach for having significant technological progress.
Effect of the model of word scope on the true translation results of NMT is selected in the extension of table 2
Unregistered word translating equipment and method can be done in the neural network machine translation of context-sensitive of the present invention
The context of co-text and semanteme for translating word to junction belt are translated, and are made to translate the BLEU values of the word word come and are not stepped on
Record word recall rate more preferably, in-English translation duties in NIST data sets on its BLEU and unregistered word recall rate be respectively
33.405 and 6.53% improve 0.012 He respectively than the 33.393 of the greedy post-processing approach of prior art and 6.16%
0.37%;More it has been obviously improved the translation quality of unregistered word in NMT translation results.
Embodiment 2
Embodiment 2 is that unregistered word is translated in being translated to a kind of neural network machine of context-sensitive described in embodiment 1
The further refinement of method, finds a most suitable word using contextual information and goes to replace in NMT translation results<unk>Mark
Note (is used for representing unregistered word) in NMT.The dictionary for translation that unregistered word translating equipment has been constructed according to former terminal word and in advance is carried
Take unregistered word candidate to translate, while recording the former terminal word for producing this unregistered word candidate translation, be not logged in for each
Word candidate translates and former terminal word is extracted 4 class contextual features from different angles to combining former sentence and translation result:NMT words
Alignment feature, word grain size characteristic, phrase grain size characteristic, language model feature, finally using all 4 classes of svm-rank models couplings
Feature goes sequence to obtain optimal substitute, and unregistered word interpretation method is turned in the neural network machine translation of the context-sensitive
Translate<unk>The detailed process of mark is as follows:
Given one carries<unk>The translation of the sentence of mark former end sentence corresponding with its, the translation flow of this method is as follows:
Step one:Searching dictionary for translation according to all former terminal words is<unk>Possible unregistered word candidate translation is provided.
Step 2:To be each<unk>Unregistered word candidate translation extract contextual feature.
Step 3:Based on context it is characterized as that all unregistered word candidates translate using the SVM rank models trained
It is ranked up.
Replaced using sequence highest word in translation of the sentence<unk>Mark.
Wherein, the SVM rank models belong to sequence study in the classes of pairwise mono- method, be for learn to
Candidate list sorts rather than two classification tasks.Assume there is one for sorted lists rank in Rank SVM basic assumption
Linear function f (x)=wtX+b is metUnderstand that SVMrank is substantially also linear
Certain fraction is fitted, only this fraction does not ensure identical with authentic assessment index, is merely able to ensure to use this fraction pair
The result of candidate's sequence is consistent with using authentic assessment index.The present invention adds slack variable to locate in SVM rank models
Manage the noise in input and increase generalization ability, therefore the formal structure of the model mathematics added after slack variable is:
subjectto
Wherein xiAnd yiIt is candidate i feature and evaluation index, x respectivelyjAnd yjIt is that candidate j feature and evaluation refer to respectively
Mark, ξI, jIt is slack variable.
After SVM rank models are selected, whether the feature of input has the pass that distinction is decision model performance quality
Key.
Wherein, the process of model training is:
1), SVM rank model trainings data set
The present embodiment from LDC2002E18, LDC2003E07, LDC2003E14, LDC2004T07, LDC2004T08,
This 7 data of LDC2005T06 and LDC2005T10 concentrate extracted in 2,100,000-English parallel corpora as NMT training number
According to wherein including 5.4 thousand ten thousand Chinese words and 6,000 ten thousand English words respectively.The present embodiment filters out 25 from NMT training corpus
Ten thousand parallel corporas with unregistered word, 320,000 unregistered words post processing training examples are constructed with these language materials.Every
It is individual training example Central Plains end sentence in all words be all<unk>Mark provides unregistered word candidate translation, and the scope of candidate is
Unregistered word in dictionary for translation in maximum preceding 100 words of translation probability.Final average each training example has 65 not
Posting term candidate translates.
The order models training data sample of table 3
Table 3 is order models training data sample, and the 1st, 2,3 row are respectively sequence number, candidate's translation and corresponding source word.
5 to 32nd row are respectively alignment feature, word grain size characteristic, phrase grain size characteristic and language model feature.Each candidate's translation is logical
Source word lookup dictionary for translation is crossed to obtain.The present embodiment forces the mode of decoding to obtain the notice pair of training data using NMT
Neat feature, statistics obtains word grain size characteristic and language model feature in 2,100,000 parallel corporas, the phrase table built in Moses
Middle extracting phrase grain size characteristic.
The present embodiment uses " grow-diag-final " method of standard using GIZA++ instruments on 2,100,000 parallel corporas
A two-way word alignment matrix is obtained, based on this word alignment result, the present embodiment calculates former end using maximum likelihood method
Word to target terminal word positive translation probability and target terminal word to the reverse translation probability of former terminal word, each word is at most protected in dictionary
Hold 200 translation candidate's translations.Last the present embodiment obtains former end and used to destination end and destination end to two dictionary for translation at former end
In offer unregistered word candidate and the forward and reverse translation probability feature of extraction.
In addition, the present embodiment is extracted 4 class contextual features from different perspectives, as shown in figure 5, four class contextual feature bags
Include:1. it is the word alignment feature extracted from NMT notice alignment models, is 2. that extraction source terminal word and unregistered word candidate translate
Word grain size characteristic, 3. be extraction source terminal word and unregistered word candidate translation phrase grain size characteristic, 4. be extract unregistered word
Candidate's translation is appeared in<unk>During mark position, the language model feature near unregistered word candidate translation.
Wherein, as shown in Fig. 2 by taking the former end sentence in Fig. 2 as an example:
1. NMT word alignments feature
Translated for each candidate and produce its former terminal word pair, we extract a NMT word alignment feature first, this
It is characterized in that NMT is produced<unk>When the notice fraction (attention scores) that produces, it is represented in translation result<unk
>Snap to the probability of former terminal word.This fraction is that NMT is produced in a model, while being also connection<unk>With the weight of former terminal word
Want information.
2. word grain size characteristic
Candidate's translation corresponding with its for each former terminal word, what we first had to extraction is the two words in language material
Cooccurrence relation, and themselves statistical information in language material.The present embodiment has extracted 7 word granularity contextual features:
●p(t|s):Former terminal word translates the positive translation probability of candidate's translation.
●p(s|t):Candidate translates the reverse probability of former terminal word.
●number_in_corpus(s):Former terminal word trains the number of times occurred in parallel corpora in NMT.
●number_in_corpus(t):The number of times occurred in parallel corpora is trained in candidate's translation in NMT.
●number_cooc_in_corpus(s,t):Former terminal word and candidate's translation are in the parallel sentence pair of parallel corpora
Co-occurrence number of times.
●freq_in_vocab(s):Vocabulary is arranged out from big to small by the frequency of word in parallel corpora, former terminal word
Appear in the position in vocabulary.
●1if s is OOV else 0:Whether former terminal word is unregistered word, if former terminal word is unregistered word, feature
It is worth for 1, is otherwise 0.
3. phrase grain size characteristic
We further capture the cooccurrence relation and system between the phrase of former terminal word and candidate's translation and its front and rear word composition
Count information, this Partial Feature we statistical machine translation instrument Moses generate phrase translation table in counted and extracted.
The present embodiment has extracted 7 phrase granularity contextual features:
●number_in_phrase_table(s):The number of times that former terminal word occurs in phrase table.
●number_in_phrase_table(t):The number of times that candidate's translation occurs in phrase table.
●number_cooc_in_phrase_table(s,t):Former terminal word and candidate translate each phrase in phrase table
The co-occurrence number of times of centering.
●number_in_phrase_table(phrase(s)):The phrase that former terminal word is constituted with front and rear word is in phrase table
Middle occurrence number.
●number_in_phrase_table(phrase(s))if t in phrase table:Former terminal word with it is front and rear
When word constitutes phrase, candidate's translation appears in the number of times in correspondence object phrase.
●max_length(t)if cooc(phrase(s),phrase(t)):Former terminal word and candidate translation respectively with it is preceding
When the phrase that word is constituted afterwards is to appearing in phrase table, candidate translates the maximum length of phrase.
●length(s)if max_length(t)and cooc(phrase(s),phrase(t)):Former terminal word and candidate
It is former when the phrase that translation is constituted with front and rear word respectively is to appearing in phrase table, and during candidate's translation phrase acquirement maximum length
Terminal word phrase length.
4. language model feature
Language model is to represent a key character for word context fluency, and the present embodiment is centered on candidate translates, root
According to<unk>Front and rear word extracted 15 language model features, for 5 continuous translation word sequences, A B OOV C D:
P (OOV | B), p (C | OOV):Positive 2 gram language model feature comprising OOV.
● p (B | OOV), p (OOV | C):Reverse 2 gram language model feature comprising OOV.
● p (OOV | B, A), p (C | OOV, B), p (D | C, OOV):Positive 3 gram language model feature comprising OOV.
● p (A | B, OOV), p (B | OOV, C), p (OOV | C, D):Reverse 3 gram language model feature comprising OOV.
● count (B OOV), count (OOV C):2 yuan of word string quantity comprising OOV.
● count (A B OOV), count (B OOV C), count (OOV C D):OOV 3 yuan of words are included in sentence
Although the present invention is disclosed as above with preferred embodiment, it is not limited to the present invention, any to be familiar with this
The people of technology, without departing from the spirit and scope of the present invention, can do various changes and modification, therefore the protection of the present invention
What scope should be defined by claims is defined.