CN108038111A

CN108038111A - A kind of machine translation pipeline method for building up and system, computer program, computer

Info

Publication number: CN108038111A
Application number: CN201711309530.4A
Authority: CN
Inventors: 汪鸣; 汪一鸣; 程国艮
Original assignee: Chinese Translation Language Through Polytron Technologies Inc
Current assignee: Chinese Translation Language Through Polytron Technologies Inc
Priority date: 2017-12-11
Filing date: 2017-12-11
Publication date: 2018-05-15

Abstract

The invention belongs to computer software technical field, discloses a kind of machine translation pipeline method for building up and system, computer program, computer, and bilingual parallel corporas is carried out secondary use by the machine translation pipeline method for building up；After being corrected by correction model to first time translation translation, the set of new word or phrase is produced；The correction model of the traditional Machine Translation Model of training and training translation is joined together by establishing translation pipeline, passing sequentially through translation pipeline for data to be translated obtains translation.The present invention is joined together two kinds of models by establishing translation pipeline, and the translation pipeline need to be only passed sequentially through for data to be translated can obtain final translation.The translation accuracy rate of translation system can be lifted more than 5%.

Description

A kind of machine translation pipeline method for building up and system, computer program, computer

Technical field

The invention belongs to computer software technical field, more particularly to a kind of machine translation pipeline method for building up and system, Computer program, computer.

Background technology

A kind of natural language (original language) is translated into another natural language (target language by machine translation using computer Speech) process, there is critically important scientific research value and use value.The process of research on the machine translation can trace back to In generation 30 or 40 years in century, mainly experienced machine translation rule-based, based on statistics and based on neutral net.Based on statistics The main thought of machine translation is by carrying out statistical analysis to a large amount of panel datas, constructing translation model, combine language mould Type reconstructed models etc. carry out translation scoring to sentence to be translated.Main thought based on neural network machine translation is by source language The sentence of speech random length is converted into the floating type vector of specific dimension, and a kind of spy is converted into by transmission layer by layer in a network Fixed form, ultimately generates the translation of object language.So that existing machine translation no longer only rests on simple literal matching In aspect, and start to be deep into the level of semanteme.Existing machine translation is mainly using the technology of data-driven, from bilingual Parallel sentence pairs learning goes out relevant information and generates final translation model.Existing machine translation remains defect：With base Exemplified by statistical machine translation, on the one hand, in translation model foundation, each sentence pair in parallel corpora is produced in final mask During be only used effectively once, this method can not make full use of the information of parallel corpora.On the other hand, scientific research with And industrial circle proposes many post-processing approach for being directed to machine translation translation, these methods can not produce new phrase or Person's word combination.Such as the method for reordering of translation translation, machine is directed to just with extra parameter or other models Preceding N bars translation Candidate Set is resequenced caused by translation.In this process, simply to original every translation result Resequenced or scored, new phrase combination etc. can't be produced.

In conclusion problem existing in the prior art is：Existing machine translation presence cannot make full use of parallel corpora Information, it is impossible to produce new phrase or word combination.

The content of the invention

In view of the problems of the existing technology, the present invention provides a kind of machine translation pipeline method for building up and system, meter Calculation machine program, computer.

The present invention is achieved in that a kind of machine translation pipeline method for building up, the machine translation pipeline method for building up Bilingual parallel corporas is subjected to secondary use；After being corrected by correction model to first time translation translation, produce new Word or phrase set；By establishing translation pipeline by the traditional Machine Translation Model of training and the correction mould of training translation Type is joined together, and passing sequentially through translation pipeline for data to be translated obtains translation.

Further, the machine translation pipeline method for building up comprises the following steps：

The first step, for bilingual parallel corporas S, the T after pretreatment, using IBM Model 1 to IBM Model 5 with And Hidden Markov HMM model obtains word alignment combination, and then combine to obtain finally with aliging according to the frequency that phrase occurs Phrase-based model；For bilingual parallel corporas S, the T after pretreatment, estimate to obtain final tune using the cost based on distance Sequence model；For the object language list language language material after pretreatment, the N member combinations for counting genitive phrase and phrase produce finally Language model.

Second step, the correction model of training translation, by all source language sentence S profits in parallel corpora before model training With phrase-based model, language model and sequencing model translation into target language sentence T '；Obtaining bilingual corpora source language sentence Translation translation T ' afterwards, the sentence T by object language in T ' and original bilingual corpora be bilingual parallel corporas, translation translation T ' trains a new translation model with target language sentence T in former bilingual parallel corporas as new bilingual parallel corporas.

Further, then decoded in the second step in translation process according to formula：

Wherein f_iFor i-th of phrase of original language, e_iFor i-th of phrase translation of original language into object language phrase,What is represented is the probability that source language phrase matches with object language phrase,It can be translated by query phrase Model obtains；start_iWith end_i-1The first word for representing source language phrase respectively is translated into i-th of object phrase Put and original language last word corresponds to the position of translation, (d) then represents target language word built-up sequence after translation Score, score are obtained by inquiring about corresponding sequencing model；P_LM(e_i|e₁...e_i-1) then represent translation object phrase language Model score, e_iI-th of word for being translated phrase is then represented, score value is obtained by query language model.

Another object of the present invention is to provide a kind of side using machine translation pipeline method for building up translation translation Method, the method for the translation translation are directed to sentence s to be translated, it is necessary first to are put into trained phrase using the sentence as input Three model, language model and sequencing model models are translated to obtain the translation t of corresponding object language；Then by translation t As input is put into trained phrase-based model, three models of language model and sequencing model are translated translating after being corrected Literary t ', translation t ' are translation result.

Another object of the present invention is to provide a kind of machine translation pipeline of the machine translation pipeline method for building up to build Erection system, the machine translation pipeline, which establishes system, to be included：

Translation model training module, the training for conventional translation model；

Correction model training module, the training for translation correction model；

Creation module, pipeline is translated for creating.

Another object of the present invention is to provide a kind of computer program for realizing the machine translation pipeline method for building up.

Another object of the present invention is to provide a kind of computer for being equipped with the computer program.

Another object of the present invention is to provide a kind of computer-readable recording medium, including instruction, when it is in computer During upper operation so that computer performs the machine translation pipeline method for building up.

The present invention is expanded on the basis of machine translation system of the tradition based on statistics is only once translated, double Language parallel corpora carries out secondary use, gives full play to the effect of bilingual parallel corporas, while by correction model to turning over for the first time After translation is corrected, the set of new word or phrase is produced；Two kinds of models are combined by establishing translation pipeline Come, the translation pipeline need to be only passed sequentially through for data to be translated can obtain final translation.Can be by translation system Translate accuracy rate lifting more than 5%.

Brief description of the drawings

Fig. 1 is machine translation pipeline method for building up flow chart provided in an embodiment of the present invention.

Fig. 2 is that machine translation pipeline provided in an embodiment of the present invention establishes system structure diagram；

In figure：1st, translation model training module；2nd, correction model training module；3rd, creation module.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.

As shown in Figure 1, machine translation pipeline method for building up provided in an embodiment of the present invention comprises the following steps：

S101：Bilingual parallel corporas is subjected to secondary use；School is carried out to first time translation translation with by correction model After just, the set of new word or phrase is produced；

S102：Two kinds of models are joined together by establishing translation pipeline, passing sequentially through this for data to be translated turns over Translate pipeline and obtain final translation.

As shown in Fig. 2, machine translation pipeline provided in an embodiment of the present invention is established system and is included：

Translation model training module 1, the training for conventional translation model.

Correction model training module 2, the training for translation correction model.

Creation module 3, pipeline is translated for creating.

The application principle of the present invention is further described with reference to specific embodiment.

Machine translation pipeline method for building up provided in an embodiment of the present invention comprises the following steps：

The first step, the traditional Machine Translation Model of training.By taking the machine translation based on statistics as an example, statistical machine translation is adopted It is noisy channel model.For the bilingual parallel corporas (S, T) after pretreatment, IBM is arrived using IBM Model 1 Model 5 and HMM (Hidden Markov) model obtain phrase alignment combination, and the frequency and then occurred according to phrase is with aliging Combination obtains final phrase translation model and sequencing model.For the object language list language language material after pretreatment, statistics The N of genitive phrase and phrase member combinations produce final language model.

Second step, the correction model of one translation of training.Before model training, in order to give full play to bilingual parallel corporas Effect, it is necessary to by all source language sentence S in parallel corpora using phrase-based model, language model and sequencing model translation into Target language sentence T ', is then decoded in translation process according to the following formula：

Wherein f_iFor i-th of phrase of original language, e_iFor i-th of phrase translation of original language into object language phrase,What is represented is the probability that source language phrase matches with object language phrase,It can be translated by query phrase Model obtains.start_iWith end_i-1The first word for representing source language phrase respectively is translated into i-th of object phrase Put and original language last word corresponds to the position of translation, (d) then represents target language word built-up sequence after translation Score, the score can be obtained by inquiring about corresponding sequencing model.P_LM(e_i|e₁...e_i-1) then represent the object phrase of translation Language model scores, e_iI-th of word for being translated phrase is then represented, which can be obtained by query language model.Obtaining The translation translation T ' of bilingual corpora source language sentence afterwards, by assuming that in T ' and original bilingual corpora object language sentence T For bilingual parallel corporas, translation T ' will be translated with target language sentence T in former bilingual parallel corporas as new bilingual parallel language Material one new translation model of training, the model can be used as the correction model of translation translation.

3rd step, carries out the foundation of translation pipeline, for sentence s to be translated, it is necessary first to using the sentence as input It is put into the phrase-based model of first step training, three models of language model and sequencing model are translated to obtain corresponding object language Translation t, then using translation t as input be put into second step training phrase-based model, language model and sequencing model three A model is translated the translation t ' after being corrected, and translation t ' is then final translation result.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its any combination real It is existing.When using whole or in part realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or performing the computer program instructions, produce whole or in part according to Flow or function described in the embodiment of the present invention.The computer can be all-purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction can store in a computer-readable storage medium, or from one Computer-readable recording medium is transmitted to another computer-readable recording medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer read/write memory medium can be that any usable medium that computer can access either includes one The data storage devices such as server that a or multiple usable mediums integrate, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disc Solid State Disk (SSD)) etc..

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of machine translation pipeline method for building up, it is characterised in that the machine translation pipeline method for building up will be bilingual parallel Language material carries out secondary use；After being corrected by correction model to first time translation translation, new word or short is produced The set of language；The correction model of the traditional Machine Translation Model of training and training translation is combined by establishing translation pipeline Come, passing sequentially through translation pipeline for data to be translated obtains translation.

2. machine translation pipeline method for building up as claimed in claim 1, it is characterised in that the machine translation pipeline foundation side Method comprises the following steps：

The first step, for bilingual parallel corporas S, the T after pretreatment, utilizes IBM Model 1 to IBM Model 5 and hidden Markov HMM model obtains word alignment combination, and then according to the frequency that phrase occurs with align combine to obtain it is final short Language model；For bilingual parallel corporas S, the T after pretreatment, estimate to obtain final sequencing mould using the cost based on distance Type；For the object language list language language material after pretreatment, the N member combinations for counting genitive phrase and phrase produce final language Say model.

Second step, the correction model of training translation, by all source language sentence S in parallel corpora using short before model training Language model, language model and sequencing model translation are into target language sentence T '；Obtaining turning over for bilingual corpora source language sentence Afterwards, the sentence T by object language in T ' and original bilingual corpora be bilingual parallel corporas to translation T ', translate translation T ' and Target language sentence T trains a new translation model as new bilingual parallel corporas in former bilingual parallel corporas.

3. machine translation pipeline method for building up as claimed in claim 2, it is characterised in that in translation process in the second step In then decoded according to formula：

Wherein f_iFor i-th of phrase of original language, e_iFor i-th of phrase translation of original language into object language phrase,Table What is shown is the probability that source language phrase matches with object language phrase,It can be obtained by query phrase translation model Arrive；start_iWith end_i-1Represent respectively source language phrase first word be translated into object phrase i-th of position and Last word of original language corresponds to the position of translation, and (d) then represents the score of target language word built-up sequence after translation, obtain Divide and obtained by inquiring about corresponding sequencing model；

P_LM(e_i|e₁...e_i-1) then represent translation object phrase language model scores, e_iThen represent and be translated the i-th of phrase A word, score value are obtained by query language model.

4. a kind of method that usage right requires machine translation pipeline method for building up translation translation described in 1~3 any one, it is special Sign is that the method for the translation translation is directed to sentence s to be translated, it is necessary first to is put into using the sentence as input trained Translated to obtain the translation t of corresponding object language in three phrase-based model, language model and sequencing model conventional models；Connect Translation t being put into three trained phrase-based model, language model and sequencing model correction models as input and turned over The translation t ' after being corrected is translated, translation t ' is translation result.

5. a kind of machine translation pipeline of machine translation pipeline method for building up as claimed in claim 1 establishes system, its feature exists In the machine translation pipeline, which establishes system, to be included：

Creation module, pipeline is translated for creating.

A kind of 6. computer program for realizing machine translation pipeline method for building up described in claims 1 to 3 any one.

A kind of 7. computer for being equipped with computer program described in claim 6.

8. a kind of computer-readable recording medium, including instruction, when run on a computer so that computer is performed as weighed Profit requires the machine translation pipeline method for building up described in 1-3 any one.