CN106156007A - A kind of English-Chinese statistical machine translation method of word original shape - Google Patents

A kind of English-Chinese statistical machine translation method of word original shape Download PDF

Info

Publication number
CN106156007A
CN106156007A CN201510130398.5A CN201510130398A CN106156007A CN 106156007 A CN106156007 A CN 106156007A CN 201510130398 A CN201510130398 A CN 201510130398A CN 106156007 A CN106156007 A CN 106156007A
Authority
CN
China
Prior art keywords
word
original shape
english
translation
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201510130398.5A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201510130398.5A priority Critical patent/CN106156007A/en
Publication of CN106156007A publication Critical patent/CN106156007A/en
Withdrawn legal-status Critical Current

Links

Abstract

The present invention relates to the English-Chinese statistical machine translation method of a kind of word original shape, it includes: in bilingual corpus, it is " word original shape _ part of speech " that word in English sentence is carried out original shape pretreatment, extracts translation phrase table after the English language material of original shape and Chinese data are carried out word alignment;Translation phrase table collectively constitutes translation system with language model and tune sequence model;Use above-mentioned translation system to carry out translation after the same original shape of English sentence to be translated and obtain translator of Chinese result.Through word original shape pretreatment, English-Chinese statistical machine translation has a certain degree of improvement on translation efficiency and effect.

Description

A kind of English-Chinese statistical machine translation method of word original shape
Technical field
The present invention relates to machine translation field, particularly relate to the English-Chinese statistical machine translation method of a kind of word original shape.
Background technology
English is one of the most the most frequently used language, is also language the most frequently used in the fields such as International Politics, economy, culture, education, science and technology.English-Chinese translation is the major way that the people with Chinese as mother tongue obtains English information.In the information age, English information explosion formula increase, only use machine translation could effectively solve English-Chinese between aphasis.
At present, statistical machine translation method based on phrase has become as the foundation stone of machine translation, and Google translates, and Baidu's translation all have employed this interpretation method with translating.But, the quality of current machine translation can not reach the most far away the requirement of people in terms of translation ambiguity and readability.In order to improve the quality of translation, people have carried out many improvement to statistical machine translation based on phrase, it is proposed that the translation model of factorization, translation model based on syntax tree etc..These interpretation methods mainly carry out grammer pretreatment original language, it is considered to part of speech and syntax tree, form more complicated translation model, therefore extremely complex at its coding/decoding method, and are not widely used the effect improved limited of machine translation.
Statistical machine translation based on phrase is mainly made up of three parts, translation model, language model and tune sequence model, most important of which is that translation model, i.e. translates phrase table.For having the translation system of practical value, training bilingual corpora typically want 1,000,000 sentences to more than, the most several ten million to more than one hundred million of the phrase table that wherein can extract.The biggest translation phrase table greatly reduces the speed of translation, and the computing power running translation decoding be it is also proposed the highest requirement.In actual use, people typically use exploitation language material and entry number restriction to carry out phrase table filtration, to obtain the translation phrase table simplified.But, an abundant translation phrase table can be effectively improved translation quality, it is therefore desirable to comprises more phrase in the case of translation phrase table is the least.
In C MT, English word has variform at sentence: noun has original shape and plural form;The forms such as verb has original shape, third-person singular, present participle, past tense, past participle;Adjective and adverbial word have the forms such as original shape, comparative degree and the superlative degree.And as the Chinese of object language, the most substantially there is no change and the use of DANFU number of tense, the same word Chinese meaning of a word under different English tenses is generally the same.Therefore, it can by the word in English language material is converted to its original shape, then carry out word alignment, the entry number of the translation phrase table extracted will be fewer but better.
Summary of the invention
The technical problem to be solved is, sets up the English-Chinese statistical machine translation method of a kind of word original shape, can not only effectively reduce the size of translation phrase table, and can improve the effect of English-Chinese statistical machine translation to a certain extent.
The present invention solves technical problem and the technical scheme taked is, the English-Chinese statistical machine translation method of a kind of word original shape is provided, described method includes the following: in bilingual corpus, first word in English sentence is carried out part-of-speech tagging and original shape conversion, the form that word original shape pretreatment is " word original shape _ part of speech " in sentence, pretreated English language material and corresponding Chinese data are carried out word alignment, and extracts translation phrase table;Translation phrase table collectively constitutes translation system with language model and tune sequence model;Use above-mentioned translation system to carry out translation after the English sentence same word original shape pretreatment to be translated and obtain translator of Chinese result.
To adjective and the comparative degree of adverbial word and the superlative degree, in order to ensure that its meaning keeps complete after original shape, at the necessary adjunct word " more " of " word original shape _ part of speech " front interpolation or " most ".
Compared with the English-Chinese statistical machine translation method using original translation phrase table, this method has the following advantages.
1. the entry of translation phrase table quantitatively substantially reduces, and can be typically reduced to about the 30% of former phrase table, thus improves English-Chinese translation decoding efficiency.
As long as the phrase 2. having a kind of form occurs in the translation phrase table of original shape, it is possible to for translating the English phrase decoding various forms, thus improve English-Chinese translation quality.
3. in original shape phrase table, each word list is shown as " word original shape _ part of speech ", does not the most increase any unnecessary word, it is only necessary to statistical machine translation based on phrase can carry out translation decoding, and therefore translation decoding technique is the most ripe.
4. can be very easy to add various known English phrase entry in the translation phrase table of original shape, it is only necessary to adding " word original shape _ part of speech " such entry, can cover the phrase of all forms, the customization being very beneficial for phrase table is expanded.
Accompanying drawing explanation
Accompanying drawing 1 is training and the translation flow figure of the English-Chinese statistical machine translation method of word original shape of the present invention.
Accompanying drawing 2 is the original shape transformational rule of different shape and part of speech word in English sentence of the present invention.
Detailed description of the invention
In order to make the principle of the present invention, technical scheme and advantage clearer, with specific embodiment, the present invention is described in further detail below in conjunction with the accompanying drawings.
The present invention proposes the English-Chinese statistical machine translation method of a kind of word original shape.Accompanying drawing 1 is training and the translation flow figure of the present invention.As shown in the figure, first the present invention carries out original shape pretreatment and generates " word original shape _ part of speech " form and add the adjunct word of necessity the sentence in English language material, sentence in Chinese data is carried out Chinese word segmentation, then the Chinese sentence after the English sentence of original shape and Chinese word segmenting is carried out automatic word alignment, extract the translation phrase table that the English phrase of " word original shape _ part of speech " is corresponding with Chinese phrase.This translation phrase table, together with language model and tune sequence model, forms English-Chinese statictic machine translation system.When English sentence is translated by the machine translation system using the present invention, first English sentence to be translated is carried out original shape process, re-use translation system and carry out translation decoding, obtain Chinese sentence.
The core of the present invention is the word original shape pretreatment of English sentence.The feature translated according to English to Chinese, the present invention provides the original shape transformational rule (accompanying drawing 2) of the title of English, verb, adjective and adverbial word.
According to the transformational rule of accompanying drawing 2, each noun of English sentence, verb, adjective and adverbial word are all converted to accordingly " word original shape _ part of speech " and the mode of adjunct word according to the rule of accompanying drawing 2.Such pretreatment, is not only all converted to its original shape the various different shapes of English word, and " _ the part of speech " that mark below at original shape compensate for the part of speech ambiguity that original shape brings.The part of speech of noun, verb, adjective and adverbial word is the most respectively with " _ n ", " _ v ", " _ a " and " _ r " expression, it would however also be possible to employ underscore " _ " character different with any four or string representation.Additionally, adjective and the comparative degree of adverbial word and highest meaning in sentence after original shape is converted in order to make up, can add adjunct word " more " or " most " before " word original shape _ part of speech ", they correspond respectively to " more ... " and " ... " of Chinese.So, English sentence information to be expressed is substantially all and remains, and in sentence, word is all to occur with the form of original shape.
Three embodiments elaborate the principle of the present invention, implementation method and advantage below.
Embodiment one. the original shapeization of a pair English-Chinese sentence processes and translation Phrase extraction.
English: If cellulosic ethanol is to live up to its promise, researchers will have to find cheaper enzymes.
Chinese: if cellulose ethanol very can be lived up to expectations, scientific research personnel is necessary for finding less expensive enzyme.
English after original shape: If cellulosic_a ethanol_n be_v to live_v up to its promise_n , researcher_n will have_v to find_v more cheap_a enzyme_n .
Chinese after participle: if cellulose ethanol very can be lived up to expectations, scientific research personnel is necessary for finding less expensive enzyme.
English sentence after original shape and the Chinese sentence after participle can extract verb phrase pair: " live_v up to its promise_n | | | live up to expectations ".It can be used for 8 phrases translating in true English sentence: live up to its promise, live up to its promises, lives up to its promise, lives up to its promises, living up to its promise, living up to its promises, lived up to its promise, lived up to its promises.Therefore, word original shape pretreatment can be greatly reduced the size of phrase table, thus improves the decoding efficiency of translation.Particularly, as long as there being the phrase of a kind of form to occur, it is possible to use the phrase table of original shape to translate its Chinese meaning, thus be effectively improved the frequency of occurrences translation effect than relatively low phrase.
It can in addition contain extraction noun phrase " cheap_a enzyme_n | | | cheap enzyme ".It both can translate the cheap enzyme in true sentence and cheap enzymes, can be used for coupling translation cheaper enzyme, cheaper enzymes, cheapest enzyme and cheapest enzymes, it is only necessary to original shape pretreatment distinguished in above-mentioned 4 phrases is " more cheap_a enzyme_n " and " most cheap_a enzyme_n ".Therefore, use the ambiguity problem that adjunct word " more " and " most " both can solve adjective and adverbial word comparative degree original shape brings, also improve the translation scope of English phrase.
Embodiment two. translation phrase table adds the translation phrase entry of customization.
No matter translation phrase table has much, and the various Special English phrases that are likely to occur all cannot be completely covered.Therefore, under many circumstances, it is desirable to the English phrase that customized by interpolation and the corresponding Chinese meaning improves translation effect.If cause because of the restriction of corpus translating in phrase table and do not have " take action | | | take action " this entry, so can directly add entry " take_v action_n | | | take action " and can translate matched 10 phrase: " take action ", " takes action ", " taking action ", " took action ", " taken action ", " take actions ", " takes actions ", " taking actions ", " took actions ", " taken actions ".Visible, this original shapeization translation phrase table is very easy to add some specialty phrase translation of individual subscriber customization, and plays a multiplier effect.
Embodiment three. the English-Chinese statictic machine translation system of a word original shape.
In order to illustrate the beneficial effect of word original shape pretreatment, the present embodiment establishes based on phrase the English-Chinese statictic machine translation system of a word original shape, and compares original shape to translation phrase table and the impact of translation effect.
According to the flow process of accompanying drawing 1, a complete statictic machine translation system based on phrase needs a lot of modules.Now, each module can use and increase income or free software kit realizes.The present embodiment uses the Niutrans 1.3.0 statistical machine translation open source software bag of Northeastern University to carry out Phrase extraction filtration, set up language model, set up and adjust sequence model, weight tuning, translation decoding and outcome evaluation, use Chinese Academy of Sciences's Chinese word segmentation software I CTCLAS to carry out Chinese word segmentation, use Giza++ software to carry out word alignment.English word original shape pretreatment uses the GENIA Tagger marking software bag at United Kingdom National text mining center, and it uses two-way maximum entropy Markov chain model, and the accuracy of original shape conversion and part-of-speech tagging can be made to reach more than 98%.After part-of-speech tagging and original shape conversion, it is combined as the form of " word original shape _ part of speech " and adds the adjunct word of necessity.Corpus uses 2,000,000 English-Chinese news comparison language materials, and test set uses 1000 English-Chinese sentences of same domain.
Look first at the bilingual entry number of the translation phrase table that training translation model obtains.After word original shape, the entry extracted is 14,320,000, reduces 30% than the bar number (20,340,000) without pretreatment.After using exploitation corpus to filter, the bilingual entry number of original shape is 2,840,000, and the bar number (3,910,000) after filtering than the translation table of non-original shape reduces 27%.Visible, the entry of translation phrase table can be efficiently reduced through word original shape pretreatment, thus improve the efficiency of translation decoding.
The present embodiment uses IBM BLEU value as evaluation metrics assessment translation effect.The BLEU value using the interpretation method of non-original shape pretreatment to obtain is 0.3046, and the BLEU value using word original shape pretreatment interpretation method is 0.3256, improves 0.021 than original system.Visible, word original shape pretreatment can improve the effect of C MT to a certain extent.
From the translation efficiency of embodiment three and effect in general, the English-Chinese statistical machine translation method of word original shape has significant advantage.
Particular embodiments described above, has carried out further detailed description with the object, technical solutions and advantages of the present invention it should be understood that the foregoing is only the specific embodiment of the present invention, has been not limited to the present invention.All within the spirit and principles in the present invention, any modification, equivalent substitution and improvement etc. done, should be included within the scope of the present invention.

Claims (4)

1. the English-Chinese statistical machine translation method of a word original shape, it is characterized in that: in the translation training stage, word in English sentence is carried out original shape pretreatment, after the English language material of word original shape and the Chinese data of Chinese word segmentation are carried out word alignment, extracts translation phrase table;Translation phrase table collectively constitutes translation system with language model and tune sequence model;Use above-mentioned translation system to carry out translation after the English sentence same original shape pretreatment to be translated and obtain translator of Chinese result.
The English-Chinese statistical machine translation method of a kind of word original shape the most according to claim 1, it is characterised in that the noun in English sentence, verb, adjective and adverbial word are carried out original shape pretreatment.
The English-Chinese statistical machine translation method of a kind of word original shape the most according to claim 1, it is characterised in that the word in English sentence is converted to the combining form of word original shape and part of speech: such as " word original shape _ part of speech " or other combining form.
The English-Chinese statistical machine translation method of a kind of word original shape the most according to claim 1, it is characterized in that, adjective in English sentence and the comparative degree of adverbial word and the superlative degree are adding adjunct word " more " and " most " after being converted to " word original shape _ part of speech " the most respectively.
CN201510130398.5A 2015-03-24 2015-03-24 A kind of English-Chinese statistical machine translation method of word original shape Withdrawn CN106156007A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510130398.5A CN106156007A (en) 2015-03-24 2015-03-24 A kind of English-Chinese statistical machine translation method of word original shape

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510130398.5A CN106156007A (en) 2015-03-24 2015-03-24 A kind of English-Chinese statistical machine translation method of word original shape

Publications (1)

Publication Number Publication Date
CN106156007A true CN106156007A (en) 2016-11-23

Family

ID=58063939

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510130398.5A Withdrawn CN106156007A (en) 2015-03-24 2015-03-24 A kind of English-Chinese statistical machine translation method of word original shape

Country Status (1)

Country Link
CN (1) CN106156007A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980611A (en) * 2017-03-23 2017-07-25 吕海港 The Chinese machine annotation system and method for a kind of English Electronic document
CN108538111A (en) * 2017-12-14 2018-09-14 李敏 A kind of Chinese language teaching information system and its application method
CN110414013A (en) * 2019-07-31 2019-11-05 腾讯科技(深圳)有限公司 Data processing method, device and electronic equipment
CN110738045A (en) * 2019-10-25 2020-01-31 北京中献电子技术开发有限公司 English lexical analysis method and system oriented to neural network machine translation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1134567A (en) * 1995-11-29 1996-10-30 陈肇雄 Morphological parsing algorithm for English-Chinese translation system
CN101452446A (en) * 2007-12-07 2009-06-10 株式会社东芝 Target language word deforming method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1134567A (en) * 1995-11-29 1996-10-30 陈肇雄 Morphological parsing algorithm for English-Chinese translation system
CN101452446A (en) * 2007-12-07 2009-06-10 株式会社东芝 Target language word deforming method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张霄军: "《计算机语言学》", 31 October 2011 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980611A (en) * 2017-03-23 2017-07-25 吕海港 The Chinese machine annotation system and method for a kind of English Electronic document
CN108538111A (en) * 2017-12-14 2018-09-14 李敏 A kind of Chinese language teaching information system and its application method
CN110414013A (en) * 2019-07-31 2019-11-05 腾讯科技(深圳)有限公司 Data processing method, device and electronic equipment
CN110738045A (en) * 2019-10-25 2020-01-31 北京中献电子技术开发有限公司 English lexical analysis method and system oriented to neural network machine translation

Similar Documents

Publication Publication Date Title
Artetxe et al. Unsupervised neural machine translation
US20190138606A1 (en) Neural network-based translation method and apparatus
CN104408078A (en) Construction method for key word-based Chinese-English bilingual parallel corpora
KR101266361B1 (en) Automatic translation system based on structured translation memory and automatic translating method using the same
CN107066455A (en) A kind of multilingual intelligence pretreatment real-time statistics machine translation system
CN103902525B (en) Uighur part-of-speech tagging method
CN103314369B (en) Machine translation apparatus and method
CN109359304A (en) Limited neural network machine interpretation method and storage medium
CN106156007A (en) A kind of English-Chinese statistical machine translation method of word original shape
CN105159889A (en) Intermediate Chinese language model for English-to-Chinese machine translation and translation method thereof
CN106156013A (en) The two-part machine translation method that a kind of regular collocation type phrase is preferential
CN101763403A (en) Query translation method facing multi-lingual information retrieval system
CN108491399A (en) Chinese to English machine translation method based on context iterative analysis
Kang Spoken language to sign language translation system based on HamNoSys
CN109815503A (en) A kind of human-computer interaction interpretation method
Sun Analysis of Chinese machine translation training based on deep learning technology
JP2018072979A (en) Parallel translation sentence extraction device, parallel translation sentence extraction method and program
CN103268314A (en) Method and device for acquiring sentence punctuating rules of Thai language
Andrabi et al. Sentence Alignment for English Urdu Language Pair
Wang et al. Chunk extraction and analysis based on frame-verbs
Ning et al. Design and Testing of Automatic Machine Translation System Based on Chinese-English Phrase Translation
Acharya et al. A Comparative Study of SMT and NMT: Case Study of English-Nepali Language Pair.
Utiyama et al. Machine translation from Japanese and French to Vietnamese, the difference among language families
CN105335351B (en) A kind of synonym automatic mining method based on patent search daily record user behavior
Miao et al. An unknown word processing method in NMT by integrating syntactic structure and semantic concept

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20161123