CN108038111A - A kind of machine translation pipeline method for building up and system, computer program, computer - Google Patents

A kind of machine translation pipeline method for building up and system, computer program, computer Download PDF

Info

Publication number
CN108038111A
CN108038111A CN201711309530.4A CN201711309530A CN108038111A CN 108038111 A CN108038111 A CN 108038111A CN 201711309530 A CN201711309530 A CN 201711309530A CN 108038111 A CN108038111 A CN 108038111A
Authority
CN
China
Prior art keywords
translation
model
language
phrase
pipeline
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711309530.4A
Other languages
Chinese (zh)
Inventor
汪鸣
汪一鸣
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese Translation Language Through Polytron Technologies Inc
Original Assignee
Chinese Translation Language Through Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese Translation Language Through Polytron Technologies Inc filed Critical Chinese Translation Language Through Polytron Technologies Inc
Priority to CN201711309530.4A priority Critical patent/CN108038111A/en
Publication of CN108038111A publication Critical patent/CN108038111A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention belongs to computer software technical field, discloses a kind of machine translation pipeline method for building up and system, computer program, computer, and bilingual parallel corporas is carried out secondary use by the machine translation pipeline method for building up;After being corrected by correction model to first time translation translation, the set of new word or phrase is produced;The correction model of the traditional Machine Translation Model of training and training translation is joined together by establishing translation pipeline, passing sequentially through translation pipeline for data to be translated obtains translation.The present invention is joined together two kinds of models by establishing translation pipeline, and the translation pipeline need to be only passed sequentially through for data to be translated can obtain final translation.The translation accuracy rate of translation system can be lifted more than 5%.

Description

A kind of machine translation pipeline method for building up and system, computer program, computer
Technical field
The invention belongs to computer software technical field, more particularly to a kind of machine translation pipeline method for building up and system, Computer program, computer.
Background technology
A kind of natural language (original language) is translated into another natural language (target language by machine translation using computer Speech) process, there is critically important scientific research value and use value.The process of research on the machine translation can trace back to In generation 30 or 40 years in century, mainly experienced machine translation rule-based, based on statistics and based on neutral net.Based on statistics The main thought of machine translation is by carrying out statistical analysis to a large amount of panel datas, constructing translation model, combine language mould Type reconstructed models etc. carry out translation scoring to sentence to be translated.Main thought based on neural network machine translation is by source language The sentence of speech random length is converted into the floating type vector of specific dimension, and a kind of spy is converted into by transmission layer by layer in a network Fixed form, ultimately generates the translation of object language.So that existing machine translation no longer only rests on simple literal matching In aspect, and start to be deep into the level of semanteme.Existing machine translation is mainly using the technology of data-driven, from bilingual Parallel sentence pairs learning goes out relevant information and generates final translation model.Existing machine translation remains defect:With base Exemplified by statistical machine translation, on the one hand, in translation model foundation, each sentence pair in parallel corpora is produced in final mask During be only used effectively once, this method can not make full use of the information of parallel corpora.On the other hand, scientific research with And industrial circle proposes many post-processing approach for being directed to machine translation translation, these methods can not produce new phrase or Person's word combination.Such as the method for reordering of translation translation, machine is directed to just with extra parameter or other models Preceding N bars translation Candidate Set is resequenced caused by translation.In this process, simply to original every translation result Resequenced or scored, new phrase combination etc. can't be produced.
In conclusion problem existing in the prior art is:Existing machine translation presence cannot make full use of parallel corpora Information, it is impossible to produce new phrase or word combination.
The content of the invention
In view of the problems of the existing technology, the present invention provides a kind of machine translation pipeline method for building up and system, meter Calculation machine program, computer.
The present invention is achieved in that a kind of machine translation pipeline method for building up, the machine translation pipeline method for building up Bilingual parallel corporas is subjected to secondary use;After being corrected by correction model to first time translation translation, produce new Word or phrase set;By establishing translation pipeline by the traditional Machine Translation Model of training and the correction mould of training translation Type is joined together, and passing sequentially through translation pipeline for data to be translated obtains translation.
Further, the machine translation pipeline method for building up comprises the following steps:
The first step, for bilingual parallel corporas S, the T after pretreatment, using IBM Model 1 to IBM Model 5 with And Hidden Markov HMM model obtains word alignment combination, and then combine to obtain finally with aliging according to the frequency that phrase occurs Phrase-based model;For bilingual parallel corporas S, the T after pretreatment, estimate to obtain final tune using the cost based on distance Sequence model;For the object language list language language material after pretreatment, the N member combinations for counting genitive phrase and phrase produce finally Language model.
Second step, the correction model of training translation, by all source language sentence S profits in parallel corpora before model training With phrase-based model, language model and sequencing model translation into target language sentence T ';Obtaining bilingual corpora source language sentence Translation translation T ' afterwards, the sentence T by object language in T ' and original bilingual corpora be bilingual parallel corporas, translation translation T ' trains a new translation model with target language sentence T in former bilingual parallel corporas as new bilingual parallel corporas.
Further, then decoded in the second step in translation process according to formula:
Wherein fiFor i-th of phrase of original language, eiFor i-th of phrase translation of original language into object language phrase,What is represented is the probability that source language phrase matches with object language phrase,It can be translated by query phrase Model obtains;startiWith endi-1The first word for representing source language phrase respectively is translated into i-th of object phrase Put and original language last word corresponds to the position of translation, (d) then represents target language word built-up sequence after translation Score, score are obtained by inquiring about corresponding sequencing model;PLM(ei|e1...ei-1) then represent translation object phrase language Model score, eiI-th of word for being translated phrase is then represented, score value is obtained by query language model.
Another object of the present invention is to provide a kind of side using machine translation pipeline method for building up translation translation Method, the method for the translation translation are directed to sentence s to be translated, it is necessary first to are put into trained phrase using the sentence as input Three model, language model and sequencing model models are translated to obtain the translation t of corresponding object language;Then by translation t As input is put into trained phrase-based model, three models of language model and sequencing model are translated translating after being corrected Literary t ', translation t ' are translation result.
Another object of the present invention is to provide a kind of machine translation pipeline of the machine translation pipeline method for building up to build Erection system, the machine translation pipeline, which establishes system, to be included:
Translation model training module, the training for conventional translation model;
Correction model training module, the training for translation correction model;
Creation module, pipeline is translated for creating.
Another object of the present invention is to provide a kind of computer program for realizing the machine translation pipeline method for building up.
Another object of the present invention is to provide a kind of computer for being equipped with the computer program.
Another object of the present invention is to provide a kind of computer-readable recording medium, including instruction, when it is in computer During upper operation so that computer performs the machine translation pipeline method for building up.
The present invention is expanded on the basis of machine translation system of the tradition based on statistics is only once translated, double Language parallel corpora carries out secondary use, gives full play to the effect of bilingual parallel corporas, while by correction model to turning over for the first time After translation is corrected, the set of new word or phrase is produced;Two kinds of models are combined by establishing translation pipeline Come, the translation pipeline need to be only passed sequentially through for data to be translated can obtain final translation.Can be by translation system Translate accuracy rate lifting more than 5%.
Brief description of the drawings
Fig. 1 is machine translation pipeline method for building up flow chart provided in an embodiment of the present invention.
Fig. 2 is that machine translation pipeline provided in an embodiment of the present invention establishes system structure diagram;
In figure:1st, translation model training module;2nd, correction model training module;3rd, creation module.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.
As shown in Figure 1, machine translation pipeline method for building up provided in an embodiment of the present invention comprises the following steps:
S101:Bilingual parallel corporas is subjected to secondary use;School is carried out to first time translation translation with by correction model After just, the set of new word or phrase is produced;
S102:Two kinds of models are joined together by establishing translation pipeline, passing sequentially through this for data to be translated turns over Translate pipeline and obtain final translation.
As shown in Fig. 2, machine translation pipeline provided in an embodiment of the present invention is established system and is included:
Translation model training module 1, the training for conventional translation model.
Correction model training module 2, the training for translation correction model.
Creation module 3, pipeline is translated for creating.
The application principle of the present invention is further described with reference to specific embodiment.
Machine translation pipeline method for building up provided in an embodiment of the present invention comprises the following steps:
The first step, the traditional Machine Translation Model of training.By taking the machine translation based on statistics as an example, statistical machine translation is adopted It is noisy channel model.For the bilingual parallel corporas (S, T) after pretreatment, IBM is arrived using IBM Model 1 Model 5 and HMM (Hidden Markov) model obtain phrase alignment combination, and the frequency and then occurred according to phrase is with aliging Combination obtains final phrase translation model and sequencing model.For the object language list language language material after pretreatment, statistics The N of genitive phrase and phrase member combinations produce final language model.
Second step, the correction model of one translation of training.Before model training, in order to give full play to bilingual parallel corporas Effect, it is necessary to by all source language sentence S in parallel corpora using phrase-based model, language model and sequencing model translation into Target language sentence T ', is then decoded in translation process according to the following formula:
Wherein fiFor i-th of phrase of original language, eiFor i-th of phrase translation of original language into object language phrase,What is represented is the probability that source language phrase matches with object language phrase,It can be translated by query phrase Model obtains.startiWith endi-1The first word for representing source language phrase respectively is translated into i-th of object phrase Put and original language last word corresponds to the position of translation, (d) then represents target language word built-up sequence after translation Score, the score can be obtained by inquiring about corresponding sequencing model.PLM(ei|e1...ei-1) then represent the object phrase of translation Language model scores, eiI-th of word for being translated phrase is then represented, which can be obtained by query language model.Obtaining The translation translation T ' of bilingual corpora source language sentence afterwards, by assuming that in T ' and original bilingual corpora object language sentence T For bilingual parallel corporas, translation T ' will be translated with target language sentence T in former bilingual parallel corporas as new bilingual parallel language Material one new translation model of training, the model can be used as the correction model of translation translation.
3rd step, carries out the foundation of translation pipeline, for sentence s to be translated, it is necessary first to using the sentence as input It is put into the phrase-based model of first step training, three models of language model and sequencing model are translated to obtain corresponding object language Translation t, then using translation t as input be put into second step training phrase-based model, language model and sequencing model three A model is translated the translation t ' after being corrected, and translation t ' is then final translation result.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its any combination real It is existing.When using whole or in part realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or performing the computer program instructions, produce whole or in part according to Flow or function described in the embodiment of the present invention.The computer can be all-purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction can store in a computer-readable storage medium, or from one Computer-readable recording medium is transmitted to another computer-readable recording medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer read/write memory medium can be that any usable medium that computer can access either includes one The data storage devices such as server that a or multiple usable mediums integrate, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disc Solid State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should all be included in the protection scope of the present invention.

Claims (8)

1. a kind of machine translation pipeline method for building up, it is characterised in that the machine translation pipeline method for building up will be bilingual parallel Language material carries out secondary use;After being corrected by correction model to first time translation translation, new word or short is produced The set of language;The correction model of the traditional Machine Translation Model of training and training translation is combined by establishing translation pipeline Come, passing sequentially through translation pipeline for data to be translated obtains translation.
2. machine translation pipeline method for building up as claimed in claim 1, it is characterised in that the machine translation pipeline foundation side Method comprises the following steps:
The first step, for bilingual parallel corporas S, the T after pretreatment, utilizes IBM Model 1 to IBM Model 5 and hidden Markov HMM model obtains word alignment combination, and then according to the frequency that phrase occurs with align combine to obtain it is final short Language model;For bilingual parallel corporas S, the T after pretreatment, estimate to obtain final sequencing mould using the cost based on distance Type;For the object language list language language material after pretreatment, the N member combinations for counting genitive phrase and phrase produce final language Say model.
Second step, the correction model of training translation, by all source language sentence S in parallel corpora using short before model training Language model, language model and sequencing model translation are into target language sentence T ';Obtaining turning over for bilingual corpora source language sentence Afterwards, the sentence T by object language in T ' and original bilingual corpora be bilingual parallel corporas to translation T ', translate translation T ' and Target language sentence T trains a new translation model as new bilingual parallel corporas in former bilingual parallel corporas.
3. machine translation pipeline method for building up as claimed in claim 2, it is characterised in that in translation process in the second step In then decoded according to formula:
Wherein fiFor i-th of phrase of original language, eiFor i-th of phrase translation of original language into object language phrase,Table What is shown is the probability that source language phrase matches with object language phrase,It can be obtained by query phrase translation model Arrive;startiWith endi-1Represent respectively source language phrase first word be translated into object phrase i-th of position and Last word of original language corresponds to the position of translation, and (d) then represents the score of target language word built-up sequence after translation, obtain Divide and obtained by inquiring about corresponding sequencing model;
PLM(ei|e1...ei-1) then represent translation object phrase language model scores, eiThen represent and be translated the i-th of phrase A word, score value are obtained by query language model.
4. a kind of method that usage right requires machine translation pipeline method for building up translation translation described in 1~3 any one, it is special Sign is that the method for the translation translation is directed to sentence s to be translated, it is necessary first to is put into using the sentence as input trained Translated to obtain the translation t of corresponding object language in three phrase-based model, language model and sequencing model conventional models;Connect Translation t being put into three trained phrase-based model, language model and sequencing model correction models as input and turned over The translation t ' after being corrected is translated, translation t ' is translation result.
5. a kind of machine translation pipeline of machine translation pipeline method for building up as claimed in claim 1 establishes system, its feature exists In the machine translation pipeline, which establishes system, to be included:
Translation model training module, the training for conventional translation model;
Correction model training module, the training for translation correction model;
Creation module, pipeline is translated for creating.
A kind of 6. computer program for realizing machine translation pipeline method for building up described in claims 1 to 3 any one.
A kind of 7. computer for being equipped with computer program described in claim 6.
8. a kind of computer-readable recording medium, including instruction, when run on a computer so that computer is performed as weighed Profit requires the machine translation pipeline method for building up described in 1-3 any one.
CN201711309530.4A 2017-12-11 2017-12-11 A kind of machine translation pipeline method for building up and system, computer program, computer Pending CN108038111A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711309530.4A CN108038111A (en) 2017-12-11 2017-12-11 A kind of machine translation pipeline method for building up and system, computer program, computer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711309530.4A CN108038111A (en) 2017-12-11 2017-12-11 A kind of machine translation pipeline method for building up and system, computer program, computer

Publications (1)

Publication Number Publication Date
CN108038111A true CN108038111A (en) 2018-05-15

Family

ID=62101935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711309530.4A Pending CN108038111A (en) 2017-12-11 2017-12-11 A kind of machine translation pipeline method for building up and system, computer program, computer

Country Status (1)

Country Link
CN (1) CN108038111A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344195A (en) * 2018-10-25 2019-02-15 电子科技大学 Pipe safety event recognition and Knowledge Discovery Method based on HMM model
CN109558597A (en) * 2018-12-17 2019-04-02 北京百度网讯科技有限公司 Text interpretation method and device, equipment and storage medium
CN110555213A (en) * 2019-08-21 2019-12-10 语联网(武汉)信息技术有限公司 training method of text translation model, and text translation method and device
CN110837741A (en) * 2019-11-14 2020-02-25 北京小米智能科技有限公司 Machine translation method, device and system
CN111160046A (en) * 2018-11-07 2020-05-15 北京搜狗科技发展有限公司 Data processing method and device and data processing device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030018470A1 (en) * 2001-04-13 2003-01-23 Golden Richard M. System and method for automatic semantic coding of free response data using Hidden Markov Model methodology
CN103646019A (en) * 2013-12-31 2014-03-19 哈尔滨理工大学 Method and device for fusing multiple machine translation systems
CN105321517A (en) * 2015-11-16 2016-02-10 范朝阳 Voice command conversion and translation execution system
CN106407184A (en) * 2015-07-30 2017-02-15 阿里巴巴集团控股有限公司 Decoding method used for statistical machine translation, and statistical machine translation method and apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030018470A1 (en) * 2001-04-13 2003-01-23 Golden Richard M. System and method for automatic semantic coding of free response data using Hidden Markov Model methodology
CN103646019A (en) * 2013-12-31 2014-03-19 哈尔滨理工大学 Method and device for fusing multiple machine translation systems
CN106407184A (en) * 2015-07-30 2017-02-15 阿里巴巴集团控股有限公司 Decoding method used for statistical machine translation, and statistical machine translation method and apparatus
CN105321517A (en) * 2015-11-16 2016-02-10 范朝阳 Voice command conversion and translation execution system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ACHRAF OTHMAN等: "Statistical Sign Language Machine Translation: from English written text to American Sign Language Gloss", 《IJCSI INTERNATIONAL JOURNAL OF COMPUTER SCIENCE ISSUES》 *
李响等: "面向多引擎融合技术的统计后编辑方法研究", 《工业技术创新》 *
樊重俊等: "《大数据分析与应用》", 31 January 2016, 立信会计出版社 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344195A (en) * 2018-10-25 2019-02-15 电子科技大学 Pipe safety event recognition and Knowledge Discovery Method based on HMM model
CN109344195B (en) * 2018-10-25 2021-09-21 电子科技大学 HMM model-based pipeline security event recognition and knowledge mining method
CN111160046A (en) * 2018-11-07 2020-05-15 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN109558597A (en) * 2018-12-17 2019-04-02 北京百度网讯科技有限公司 Text interpretation method and device, equipment and storage medium
CN109558597B (en) * 2018-12-17 2022-05-24 北京百度网讯科技有限公司 Text translation method and device, equipment and storage medium
CN110555213A (en) * 2019-08-21 2019-12-10 语联网(武汉)信息技术有限公司 training method of text translation model, and text translation method and device
CN110837741A (en) * 2019-11-14 2020-02-25 北京小米智能科技有限公司 Machine translation method, device and system
CN110837741B (en) * 2019-11-14 2023-11-07 北京小米智能科技有限公司 Machine translation method, device and system

Similar Documents

Publication Publication Date Title
CN108038111A (en) A kind of machine translation pipeline method for building up and system, computer program, computer
US10810379B2 (en) Statistics-based machine translation method, apparatus and electronic device
US10860808B2 (en) Method and system for generation of candidate translations
US11481562B2 (en) Method and apparatus for evaluating translation quality
Ling et al. Latent predictor networks for code generation
Chen et al. Automated scoring of nonnative speech using the speechrater sm v. 5.0 engine
US11972365B2 (en) Question responding apparatus, question responding method and program
US9460080B2 (en) Modifying a tokenizer based on pseudo data for natural language processing
WO2019113783A1 (en) Number generalization method and system for machine translation, computer, and computer program
US8874433B2 (en) Syntax-based augmentation of statistical machine translation phrase tables
US20080120092A1 (en) Phrase pair extraction for statistical machine translation
CN104484322A (en) Methods and systems for automated text correction
US7725306B2 (en) Efficient phrase pair extraction from bilingual word alignments
US20180018960A1 (en) Systems and methods for automatic repair of speech recognition engine output
CN116150613A (en) Information extraction model training method, information extraction method and device
Li et al. Semantic structure based query graph prediction for question answering over knowledge graph
Chen et al. A self-attention joint model for spoken language understanding in situational dialog applications
CN114822518A (en) Knowledge distillation method, electronic device, and storage medium
Brouns et al. Supporting language diversity of European MOOCs with the EMMA platform
Setiawan et al. Discriminative word alignment with a function word reordering model
Stahlberg et al. Word segmentation and pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment
Neubarth et al. A hybrid approach to statistical machine translation between standard and dialectal varieties
JP2017059216A (en) Query calibration system and method
Pudaruth et al. Morisia: A Neural Machine Translation System to Translate between Kreol Morisien and English
CN104899193A (en) Interactive translation method of restricted translation fragments in computer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180515