CN108038111A - A kind of machine translation pipeline method for building up and system, computer program, computer - Google Patents
A kind of machine translation pipeline method for building up and system, computer program, computer Download PDFInfo
- Publication number
- CN108038111A CN108038111A CN201711309530.4A CN201711309530A CN108038111A CN 108038111 A CN108038111 A CN 108038111A CN 201711309530 A CN201711309530 A CN 201711309530A CN 108038111 A CN108038111 A CN 108038111A
- Authority
- CN
- China
- Prior art keywords
- translation
- model
- language
- phrase
- pipeline
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The invention belongs to computer software technical field, discloses a kind of machine translation pipeline method for building up and system, computer program, computer, and bilingual parallel corporas is carried out secondary use by the machine translation pipeline method for building up;After being corrected by correction model to first time translation translation, the set of new word or phrase is produced;The correction model of the traditional Machine Translation Model of training and training translation is joined together by establishing translation pipeline, passing sequentially through translation pipeline for data to be translated obtains translation.The present invention is joined together two kinds of models by establishing translation pipeline, and the translation pipeline need to be only passed sequentially through for data to be translated can obtain final translation.The translation accuracy rate of translation system can be lifted more than 5%.
Description
Technical field
The invention belongs to computer software technical field, more particularly to a kind of machine translation pipeline method for building up and system,
Computer program, computer.
Background technology
A kind of natural language (original language) is translated into another natural language (target language by machine translation using computer
Speech) process, there is critically important scientific research value and use value.The process of research on the machine translation can trace back to
In generation 30 or 40 years in century, mainly experienced machine translation rule-based, based on statistics and based on neutral net.Based on statistics
The main thought of machine translation is by carrying out statistical analysis to a large amount of panel datas, constructing translation model, combine language mould
Type reconstructed models etc. carry out translation scoring to sentence to be translated.Main thought based on neural network machine translation is by source language
The sentence of speech random length is converted into the floating type vector of specific dimension, and a kind of spy is converted into by transmission layer by layer in a network
Fixed form, ultimately generates the translation of object language.So that existing machine translation no longer only rests on simple literal matching
In aspect, and start to be deep into the level of semanteme.Existing machine translation is mainly using the technology of data-driven, from bilingual
Parallel sentence pairs learning goes out relevant information and generates final translation model.Existing machine translation remains defect:With base
Exemplified by statistical machine translation, on the one hand, in translation model foundation, each sentence pair in parallel corpora is produced in final mask
During be only used effectively once, this method can not make full use of the information of parallel corpora.On the other hand, scientific research with
And industrial circle proposes many post-processing approach for being directed to machine translation translation, these methods can not produce new phrase or
Person's word combination.Such as the method for reordering of translation translation, machine is directed to just with extra parameter or other models
Preceding N bars translation Candidate Set is resequenced caused by translation.In this process, simply to original every translation result
Resequenced or scored, new phrase combination etc. can't be produced.
In conclusion problem existing in the prior art is:Existing machine translation presence cannot make full use of parallel corpora
Information, it is impossible to produce new phrase or word combination.
The content of the invention
In view of the problems of the existing technology, the present invention provides a kind of machine translation pipeline method for building up and system, meter
Calculation machine program, computer.
The present invention is achieved in that a kind of machine translation pipeline method for building up, the machine translation pipeline method for building up
Bilingual parallel corporas is subjected to secondary use;After being corrected by correction model to first time translation translation, produce new
Word or phrase set;By establishing translation pipeline by the traditional Machine Translation Model of training and the correction mould of training translation
Type is joined together, and passing sequentially through translation pipeline for data to be translated obtains translation.
Further, the machine translation pipeline method for building up comprises the following steps:
The first step, for bilingual parallel corporas S, the T after pretreatment, using IBM Model 1 to IBM Model 5 with
And Hidden Markov HMM model obtains word alignment combination, and then combine to obtain finally with aliging according to the frequency that phrase occurs
Phrase-based model;For bilingual parallel corporas S, the T after pretreatment, estimate to obtain final tune using the cost based on distance
Sequence model;For the object language list language language material after pretreatment, the N member combinations for counting genitive phrase and phrase produce finally
Language model.
Second step, the correction model of training translation, by all source language sentence S profits in parallel corpora before model training
With phrase-based model, language model and sequencing model translation into target language sentence T ';Obtaining bilingual corpora source language sentence
Translation translation T ' afterwards, the sentence T by object language in T ' and original bilingual corpora be bilingual parallel corporas, translation translation
T ' trains a new translation model with target language sentence T in former bilingual parallel corporas as new bilingual parallel corporas.
Further, then decoded in the second step in translation process according to formula:
Wherein fiFor i-th of phrase of original language, eiFor i-th of phrase translation of original language into object language phrase,What is represented is the probability that source language phrase matches with object language phrase,It can be translated by query phrase
Model obtains;startiWith endi-1The first word for representing source language phrase respectively is translated into i-th of object phrase
Put and original language last word corresponds to the position of translation, (d) then represents target language word built-up sequence after translation
Score, score are obtained by inquiring about corresponding sequencing model;PLM(ei|e1...ei-1) then represent translation object phrase language
Model score, eiI-th of word for being translated phrase is then represented, score value is obtained by query language model.
Another object of the present invention is to provide a kind of side using machine translation pipeline method for building up translation translation
Method, the method for the translation translation are directed to sentence s to be translated, it is necessary first to are put into trained phrase using the sentence as input
Three model, language model and sequencing model models are translated to obtain the translation t of corresponding object language;Then by translation t
As input is put into trained phrase-based model, three models of language model and sequencing model are translated translating after being corrected
Literary t ', translation t ' are translation result.
Another object of the present invention is to provide a kind of machine translation pipeline of the machine translation pipeline method for building up to build
Erection system, the machine translation pipeline, which establishes system, to be included:
Translation model training module, the training for conventional translation model;
Correction model training module, the training for translation correction model;
Creation module, pipeline is translated for creating.
Another object of the present invention is to provide a kind of computer program for realizing the machine translation pipeline method for building up.
Another object of the present invention is to provide a kind of computer for being equipped with the computer program.
Another object of the present invention is to provide a kind of computer-readable recording medium, including instruction, when it is in computer
During upper operation so that computer performs the machine translation pipeline method for building up.
The present invention is expanded on the basis of machine translation system of the tradition based on statistics is only once translated, double
Language parallel corpora carries out secondary use, gives full play to the effect of bilingual parallel corporas, while by correction model to turning over for the first time
After translation is corrected, the set of new word or phrase is produced;Two kinds of models are combined by establishing translation pipeline
Come, the translation pipeline need to be only passed sequentially through for data to be translated can obtain final translation.Can be by translation system
Translate accuracy rate lifting more than 5%.
Brief description of the drawings
Fig. 1 is machine translation pipeline method for building up flow chart provided in an embodiment of the present invention.
Fig. 2 is that machine translation pipeline provided in an embodiment of the present invention establishes system structure diagram;
In figure:1st, translation model training module;2nd, correction model training module;3rd, creation module.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.
As shown in Figure 1, machine translation pipeline method for building up provided in an embodiment of the present invention comprises the following steps:
S101:Bilingual parallel corporas is subjected to secondary use;School is carried out to first time translation translation with by correction model
After just, the set of new word or phrase is produced;
S102:Two kinds of models are joined together by establishing translation pipeline, passing sequentially through this for data to be translated turns over
Translate pipeline and obtain final translation.
As shown in Fig. 2, machine translation pipeline provided in an embodiment of the present invention is established system and is included:
Translation model training module 1, the training for conventional translation model.
Correction model training module 2, the training for translation correction model.
Creation module 3, pipeline is translated for creating.
The application principle of the present invention is further described with reference to specific embodiment.
Machine translation pipeline method for building up provided in an embodiment of the present invention comprises the following steps:
The first step, the traditional Machine Translation Model of training.By taking the machine translation based on statistics as an example, statistical machine translation is adopted
It is noisy channel model.For the bilingual parallel corporas (S, T) after pretreatment, IBM is arrived using IBM Model 1
Model 5 and HMM (Hidden Markov) model obtain phrase alignment combination, and the frequency and then occurred according to phrase is with aliging
Combination obtains final phrase translation model and sequencing model.For the object language list language language material after pretreatment, statistics
The N of genitive phrase and phrase member combinations produce final language model.
Second step, the correction model of one translation of training.Before model training, in order to give full play to bilingual parallel corporas
Effect, it is necessary to by all source language sentence S in parallel corpora using phrase-based model, language model and sequencing model translation into
Target language sentence T ', is then decoded in translation process according to the following formula:
Wherein fiFor i-th of phrase of original language, eiFor i-th of phrase translation of original language into object language phrase,What is represented is the probability that source language phrase matches with object language phrase,It can be translated by query phrase
Model obtains.startiWith endi-1The first word for representing source language phrase respectively is translated into i-th of object phrase
Put and original language last word corresponds to the position of translation, (d) then represents target language word built-up sequence after translation
Score, the score can be obtained by inquiring about corresponding sequencing model.PLM(ei|e1...ei-1) then represent the object phrase of translation
Language model scores, eiI-th of word for being translated phrase is then represented, which can be obtained by query language model.Obtaining
The translation translation T ' of bilingual corpora source language sentence afterwards, by assuming that in T ' and original bilingual corpora object language sentence T
For bilingual parallel corporas, translation T ' will be translated with target language sentence T in former bilingual parallel corporas as new bilingual parallel language
Material one new translation model of training, the model can be used as the correction model of translation translation.
3rd step, carries out the foundation of translation pipeline, for sentence s to be translated, it is necessary first to using the sentence as input
It is put into the phrase-based model of first step training, three models of language model and sequencing model are translated to obtain corresponding object language
Translation t, then using translation t as input be put into second step training phrase-based model, language model and sequencing model three
A model is translated the translation t ' after being corrected, and translation t ' is then final translation result.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its any combination real
It is existing.When using whole or in part realizing in the form of a computer program product, the computer program product include one or
Multiple computer instructions.When loading on computers or performing the computer program instructions, produce whole or in part according to
Flow or function described in the embodiment of the present invention.The computer can be all-purpose computer, special purpose computer, computer network
Network or other programmable devices.The computer instruction can store in a computer-readable storage medium, or from one
Computer-readable recording medium is transmitted to another computer-readable recording medium, for example, the computer instruction can be from one
A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)
Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center
Transmission).The computer read/write memory medium can be that any usable medium that computer can access either includes one
The data storage devices such as server that a or multiple usable mediums integrate, data center.The usable medium can be magnetic Jie
Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disc Solid
State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
All any modification, equivalent and improvement made within refreshing and principle etc., should all be included in the protection scope of the present invention.
Claims (8)
1. a kind of machine translation pipeline method for building up, it is characterised in that the machine translation pipeline method for building up will be bilingual parallel
Language material carries out secondary use;After being corrected by correction model to first time translation translation, new word or short is produced
The set of language;The correction model of the traditional Machine Translation Model of training and training translation is combined by establishing translation pipeline
Come, passing sequentially through translation pipeline for data to be translated obtains translation.
2. machine translation pipeline method for building up as claimed in claim 1, it is characterised in that the machine translation pipeline foundation side
Method comprises the following steps:
The first step, for bilingual parallel corporas S, the T after pretreatment, utilizes IBM Model 1 to IBM Model 5 and hidden
Markov HMM model obtains word alignment combination, and then according to the frequency that phrase occurs with align combine to obtain it is final short
Language model;For bilingual parallel corporas S, the T after pretreatment, estimate to obtain final sequencing mould using the cost based on distance
Type;For the object language list language language material after pretreatment, the N member combinations for counting genitive phrase and phrase produce final language
Say model.
Second step, the correction model of training translation, by all source language sentence S in parallel corpora using short before model training
Language model, language model and sequencing model translation are into target language sentence T ';Obtaining turning over for bilingual corpora source language sentence
Afterwards, the sentence T by object language in T ' and original bilingual corpora be bilingual parallel corporas to translation T ', translate translation T ' and
Target language sentence T trains a new translation model as new bilingual parallel corporas in former bilingual parallel corporas.
3. machine translation pipeline method for building up as claimed in claim 2, it is characterised in that in translation process in the second step
In then decoded according to formula:
Wherein fiFor i-th of phrase of original language, eiFor i-th of phrase translation of original language into object language phrase,Table
What is shown is the probability that source language phrase matches with object language phrase,It can be obtained by query phrase translation model
Arrive;startiWith endi-1Represent respectively source language phrase first word be translated into object phrase i-th of position and
Last word of original language corresponds to the position of translation, and (d) then represents the score of target language word built-up sequence after translation, obtain
Divide and obtained by inquiring about corresponding sequencing model;
PLM(ei|e1...ei-1) then represent translation object phrase language model scores, eiThen represent and be translated the i-th of phrase
A word, score value are obtained by query language model.
4. a kind of method that usage right requires machine translation pipeline method for building up translation translation described in 1~3 any one, it is special
Sign is that the method for the translation translation is directed to sentence s to be translated, it is necessary first to is put into using the sentence as input trained
Translated to obtain the translation t of corresponding object language in three phrase-based model, language model and sequencing model conventional models;Connect
Translation t being put into three trained phrase-based model, language model and sequencing model correction models as input and turned over
The translation t ' after being corrected is translated, translation t ' is translation result.
5. a kind of machine translation pipeline of machine translation pipeline method for building up as claimed in claim 1 establishes system, its feature exists
In the machine translation pipeline, which establishes system, to be included:
Translation model training module, the training for conventional translation model;
Correction model training module, the training for translation correction model;
Creation module, pipeline is translated for creating.
A kind of 6. computer program for realizing machine translation pipeline method for building up described in claims 1 to 3 any one.
A kind of 7. computer for being equipped with computer program described in claim 6.
8. a kind of computer-readable recording medium, including instruction, when run on a computer so that computer is performed as weighed
Profit requires the machine translation pipeline method for building up described in 1-3 any one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711309530.4A CN108038111A (en) | 2017-12-11 | 2017-12-11 | A kind of machine translation pipeline method for building up and system, computer program, computer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711309530.4A CN108038111A (en) | 2017-12-11 | 2017-12-11 | A kind of machine translation pipeline method for building up and system, computer program, computer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108038111A true CN108038111A (en) | 2018-05-15 |
Family
ID=62101935
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711309530.4A Pending CN108038111A (en) | 2017-12-11 | 2017-12-11 | A kind of machine translation pipeline method for building up and system, computer program, computer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108038111A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344195A (en) * | 2018-10-25 | 2019-02-15 | 电子科技大学 | Pipe safety event recognition and Knowledge Discovery Method based on HMM model |
CN109558597A (en) * | 2018-12-17 | 2019-04-02 | 北京百度网讯科技有限公司 | Text interpretation method and device, equipment and storage medium |
CN110555213A (en) * | 2019-08-21 | 2019-12-10 | 语联网(武汉)信息技术有限公司 | training method of text translation model, and text translation method and device |
CN110837741A (en) * | 2019-11-14 | 2020-02-25 | 北京小米智能科技有限公司 | Machine translation method, device and system |
CN111160046A (en) * | 2018-11-07 | 2020-05-15 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030018470A1 (en) * | 2001-04-13 | 2003-01-23 | Golden Richard M. | System and method for automatic semantic coding of free response data using Hidden Markov Model methodology |
CN103646019A (en) * | 2013-12-31 | 2014-03-19 | 哈尔滨理工大学 | Method and device for fusing multiple machine translation systems |
CN105321517A (en) * | 2015-11-16 | 2016-02-10 | 范朝阳 | Voice command conversion and translation execution system |
CN106407184A (en) * | 2015-07-30 | 2017-02-15 | 阿里巴巴集团控股有限公司 | Decoding method used for statistical machine translation, and statistical machine translation method and apparatus |
-
2017
- 2017-12-11 CN CN201711309530.4A patent/CN108038111A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030018470A1 (en) * | 2001-04-13 | 2003-01-23 | Golden Richard M. | System and method for automatic semantic coding of free response data using Hidden Markov Model methodology |
CN103646019A (en) * | 2013-12-31 | 2014-03-19 | 哈尔滨理工大学 | Method and device for fusing multiple machine translation systems |
CN106407184A (en) * | 2015-07-30 | 2017-02-15 | 阿里巴巴集团控股有限公司 | Decoding method used for statistical machine translation, and statistical machine translation method and apparatus |
CN105321517A (en) * | 2015-11-16 | 2016-02-10 | 范朝阳 | Voice command conversion and translation execution system |
Non-Patent Citations (3)
Title |
---|
ACHRAF OTHMAN等: "Statistical Sign Language Machine Translation: from English written text to American Sign Language Gloss", 《IJCSI INTERNATIONAL JOURNAL OF COMPUTER SCIENCE ISSUES》 * |
李响等: "面向多引擎融合技术的统计后编辑方法研究", 《工业技术创新》 * |
樊重俊等: "《大数据分析与应用》", 31 January 2016, 立信会计出版社 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344195A (en) * | 2018-10-25 | 2019-02-15 | 电子科技大学 | Pipe safety event recognition and Knowledge Discovery Method based on HMM model |
CN109344195B (en) * | 2018-10-25 | 2021-09-21 | 电子科技大学 | HMM model-based pipeline security event recognition and knowledge mining method |
CN111160046A (en) * | 2018-11-07 | 2020-05-15 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN109558597A (en) * | 2018-12-17 | 2019-04-02 | 北京百度网讯科技有限公司 | Text interpretation method and device, equipment and storage medium |
CN109558597B (en) * | 2018-12-17 | 2022-05-24 | 北京百度网讯科技有限公司 | Text translation method and device, equipment and storage medium |
CN110555213A (en) * | 2019-08-21 | 2019-12-10 | 语联网(武汉)信息技术有限公司 | training method of text translation model, and text translation method and device |
CN110837741A (en) * | 2019-11-14 | 2020-02-25 | 北京小米智能科技有限公司 | Machine translation method, device and system |
CN110837741B (en) * | 2019-11-14 | 2023-11-07 | 北京小米智能科技有限公司 | Machine translation method, device and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108038111A (en) | A kind of machine translation pipeline method for building up and system, computer program, computer | |
US10810379B2 (en) | Statistics-based machine translation method, apparatus and electronic device | |
US10860808B2 (en) | Method and system for generation of candidate translations | |
US11481562B2 (en) | Method and apparatus for evaluating translation quality | |
Ling et al. | Latent predictor networks for code generation | |
Chen et al. | Automated scoring of nonnative speech using the speechrater sm v. 5.0 engine | |
US11972365B2 (en) | Question responding apparatus, question responding method and program | |
US9460080B2 (en) | Modifying a tokenizer based on pseudo data for natural language processing | |
WO2019113783A1 (en) | Number generalization method and system for machine translation, computer, and computer program | |
US8874433B2 (en) | Syntax-based augmentation of statistical machine translation phrase tables | |
US20080120092A1 (en) | Phrase pair extraction for statistical machine translation | |
CN104484322A (en) | Methods and systems for automated text correction | |
US7725306B2 (en) | Efficient phrase pair extraction from bilingual word alignments | |
US20180018960A1 (en) | Systems and methods for automatic repair of speech recognition engine output | |
CN116150613A (en) | Information extraction model training method, information extraction method and device | |
Li et al. | Semantic structure based query graph prediction for question answering over knowledge graph | |
Chen et al. | A self-attention joint model for spoken language understanding in situational dialog applications | |
CN114822518A (en) | Knowledge distillation method, electronic device, and storage medium | |
Brouns et al. | Supporting language diversity of European MOOCs with the EMMA platform | |
Setiawan et al. | Discriminative word alignment with a function word reordering model | |
Stahlberg et al. | Word segmentation and pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment | |
Neubarth et al. | A hybrid approach to statistical machine translation between standard and dialectal varieties | |
JP2017059216A (en) | Query calibration system and method | |
Pudaruth et al. | Morisia: A Neural Machine Translation System to Translate between Kreol Morisien and English | |
CN104899193A (en) | Interactive translation method of restricted translation fragments in computer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180515 |