CN101770458A

CN101770458A - Mechanical translation method based on example phrases

Info

Publication number: CN101770458A
Application number: CN200910002334A
Authority: CN
Inventors: 何亮; 万磊; 王进
Original assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Current assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Priority date: 2009-01-07
Filing date: 2009-01-07
Publication date: 2010-07-07

Abstract

The invention provides a mechanical translation method based on example phrases. The method comprises the following steps of: extracting phrases according to word alignment information obtained from a bilingual alignment text, and acquiring a phrase alignment list; segmenting source language sentences into a plurality of phrases based on a predetermined principle according to the phrase alignment list; and carrying out statistical mechanical translation on the segmented phrases based on phrases. The invention improves the translation speed and the translation quality. In addition, unknown words in a translation result are translated by using a bilingual dictionary as well as combining and utilizing an existing language model of the target language, thereby the translation quality is improved.

Description

Machine translation method based on example phrases

Technical field

The present invention relates to the mechanical translation field, be based on the mechanical translation of corpus specifically, description be a kind of method of utilizing example phrases to translate.

Background technology

Mechanical translation is the automatic translation system that a kind of natural language translation is become another kind of natural language.The problem that mechanical translation will solve is sentence or a fragment of utilizing computing machine that the sentence or the fragment of source language (SL) are translated into corresponding target language (TL) automatically.The type of machine translation system is a lot, comprise based on example mechanical translation (EBMT) system and based on mechanical translation (PBMT) system of phrase.

The basic thought of EBMT system is not by the sentence structure and the semantic analysis of deep layer, only by existing experimental knowledge, translates by the analogy principle.The basic realization principle of this thought: the main knowledge source of system is the translation instance storehouse of bilingual journal, when source language sentence S of input, system finds out the sentence S ' the most similar with S, and the translation T ' of imitation S ', S and the unmatched place of S ' are translated, replace the part of the middle correspondence of T ', the translation T that finally constitutes S exports then.As long as be characterized in existing the very high even example sentence equally of similarity, just can produce high-quality translation.The EBMT method needs a very big case library as support.

The basic thought of PBMT system is with the base unit of phrase as translation.In translation process, system translates each speech isolatedly, but continuous a plurality of speech are translated together.Owing to enlarged the granularity of translation, be easy to handle the local context dependence based on the method for phrase, can translate the collocation of idiom and everyday words well.Usually, in the method based on phrase, phrase can be any continuous character string, does not have phraseological restriction, like this can be easily from the bilingualism corpora of word alignment the bilingual phrase of Automatic Extraction be translated as a source language sentence of appointment.Method based on phrase need be trained system.In the time of training, import a bilingualism corpora, i.e. one group of sentence of translating each other earlier.Know from the result of word alignment which speech is translated each other in the sentence.Next also need to carry out phrase extraction, just extract the continuous speech string that all are translated each other in the corpus, whether have real implication and need not manage this speech string.

Yet the defective of EBMT is: if similarity threshold is too high, it is low to be matched to power; Otherwise if similarity threshold is low excessively, it is relatively poor then to produce translation quality during fuzzy matching.To under the prerequisite that guarantees translation quality, improve the success ratio of coupling, have only and set up large-scale case library, but this need a large amount of time, man power and material.The defective of PBMT is: when sentence is translated, need to consider all possible phrase (so long as continuous speech string just can be construed to be phrase), and the combined situation of these phrases, this has reduced the speed of translating greatly; Simultaneously,, need handle a large amount of ambiguities during translation, cause the poor effect of translating for long sentence or phrase.In addition, pure EBMT method and PBMT method do not have to consider the processing to the unknown word that not have appearance in the corpus, especially a large amount of specialized vocabularies.A disposal route is to expand case library or bilingual alignment corpus, enlarges the coverage of its vocabulary, but the construction of one side case library and bilingual alignment corpus needs a large amount of time, man power and material; On the other hand, when having new term to occur, expanding corpus all needs again system to be trained afterwards.

Summary of the invention

According to an aspect of the present invention, to combine based on the machine translation method of phrase with based on the thought of example, under the prerequisite that existing P BMT system is not made an amendment, introducing is based on the method for example, make full use of existing phrase alignment data, fast, in high quality to the advantage translated of sentence of coupling, thereby reach the synchronous raising of translation speed and translation quality; Simultaneously, use a bilingual dictionary, in conjunction with the language model that utilizes existing target language, unknown word in the translation result is translated, the structure difficulty of bilingual dictionary is significantly less than the right structure difficulty of bilingual sentence, only need expand simultaneously and can translate new term, and existing system need not to train again dictionary.

According to an aspect of the present invention, provide a kind of machine translation method based on example phrases, described method comprises: carry out phrase extraction according to the word alignment information that obtains from the bilingual alignment text, and obtain the phrase alignment table; According to the phrase alignment table, be some phrases with the cutting of source language sentence based on predetermined principle; To carry out statistical machine translation through the phrase after the cutting based on phrase.

According to an aspect of the present invention, described method also can comprise: utilize the language model of bilingual dictionary and target language that unknown word is translated.

According to an aspect of the present invention, the source language sentence is carried out cutting step based on principle be: make that the phrase coverage rate after the cutting is the highest, wherein, coverage rate is meant total number of word that the short-and-medium language of source language sentence the is capped total number of word divided by the source language sentence, covers and is meant that the phrase that is syncopated as is present in the phrase alignment table.

According to an aspect of the present invention, in the step of the source language sentence being carried out cutting, make under the highest prerequisite of phrase coverage rate after the cutting, make the phrase number of source language sentence minimum.

According to an aspect of the present invention, the phrase coverage rate after making cutting is the highest and make under the minimum prerequisite of the phrase number of source language sentence, makes the phrase that is syncopated as the longest.

According to an aspect of the present invention, can according in the graph theory ask two the fixed point between shortest path be some phrases with the cutting of source language sentence.

According to an aspect of the present invention, by coming the step of cutting source language sentence to comprise: be a summit between per two words in the definition source language sentence, a summit respectively is set before first word of sentence and after the last character of sentence according to the shortest path between two fixed points of asking in the graph theory; The weight on the limit on two summits is set to identical value in the connection layout; Utilize A* algorithm or dijkstra's algorithm to find the solution shortest path between two summits of head and the tail.

According to an aspect of the present invention, unknown word being carried out steps of translating can comprise: may the translating of each unknown word from bilingual dictionary in the retrieval source language sentence; In result, may translate the replacement unknown word with each of unknown word to acquisition after carrying out based on the statistical machine translation of phrase through the phrase after the cutting; Utilize the language model of target language to calculate the probable value of the sentence after the replacement; Select the highest replacement of probable value as final translation result.

Description of drawings

Below in conjunction with the drawings and specific embodiments the present invention is described in further detail.

Fig. 1 is the process flow diagram based on the machine translation method of example phrases according to the embodiment of the invention;

Fig. 2 is the synoptic diagram according to the structure phrase alignment table of prior art;

Fig. 3 is the example according to the phrase segmentation method of the embodiment of the invention;

Fig. 4 is the example based on the statistical machine translation of phrase according to prior art;

Fig. 5 is the process flow diagram that unknown word is translated according to the embodiment of the invention.

Embodiment

System and method of the present invention is made up of following core: structure phrase alignment table, example phrases cutting, based on the translation of phrase and the translation of unknown word.

Fig. 1 shows the process flow diagram based on the machine translation method of example phrases according to the embodiment of the invention, specifically comprises following steps:

At step S100, structure phrase alignment table.In the process of structure phrase alignment table, utilize GIZA++ from the bilingual alignment text, to obtain word alignment information, carry out phrase extraction according to word alignment information then, obtain the phrase alignment table.Wherein the phrase alignment table is made up of following three parts: source language phrase, target language phrase and probable value.Fig. 2 is an example of structure phrase alignment table, is used for illustrating the input of structure phrase alignment table module, and wherein probable value has a plurality ofly, and they are used to comprehensively weigh the probability of phrase alignment.

At step S200, carry out the example phrases cutting.The input of example phrases cutting is a source language sentence, this sentence can pass through participle in advance, promptly text is carried out the cutting of word, making as English has the space with sign between the speech in the sentence, for instance, sentence " machine translation system and method " is divided into as " machine translation system and method " with the space the separated form of word, a benefit of doing like this is, after the participle, can be that the unit replacement is that unit carries out follow-up phrase segmentation with the word with the speech, thereby obviously improve the efficient of translation.In addition, the source language sentence of the input of phrase segmentation also can not carry out any pre-service as an example, with a continuous word string form input.

In step S200, according to the phrase alignment table, be some source language phrases with the cutting of source language sentence, separate with the space between each phrase, cutting will be followed following principle:

At first, the sentence phrase after the cutting is capped rate the highest (total number of word that phrase is capped in coverage rate=sentence/sentence total number of word), wherein, if the phrase that is syncopated as is present in the phrase alignment table, then claims this phrase to be capped.Secondly, under above-mentioned prerequisite, cutting number to sentence is minimum, that is: after the cutting, the phrase number that is separated by the space in the sentence is minimum, is become " machine translation system and method " after the cutting as sentence " machine translation system and method ", because two spaces are arranged in the sentence, these two spaces promptly are that so, we say that its cutting number is 2 because cutting generates.Once more, under described in the above two prerequisites, the longest situation of phrase that consideration is syncopated as, promptly in multiple slit mode, consider the longest a kind of mode of certain phrase wherein, because phrase is long more, its number of times that occurs in former alignment text is just few more, and the situation complexity of appearance is more little, and is high more with align uniqueness and the accuracy of target language.Under the opposite extreme situations, whole sentence all occurred in the alignment text, and existed in the phrase alignment table, and this sentence is just as any cutting so, and directly translation is come out just passable.

Be the example of a phrase segmentation below.When the input sentence is " machine translation system and method ", suppose to contain in the phrase alignment table following phrase: a. machine, b. translation, c. system, d. method, e. mechanical translation, f. translation system, g. machine translation system, h. system and method, the statistics following (all not enumerating) of then possible cutting and coverage rate thereof, cutting number and length language:

(1) machine translation system and method (coverage rate: 8/9, the cutting number: 2, length language: 6)

(2) machine translation system and method (coverage rate: 9/9, the cutting number: 2, length language: 5)

(3) machine translation system and method (coverage rate: 8/9, the cutting number: 3, length language: 4)

(4) machine translation system and method (coverage rate: 9/9, the cutting number: 1, length language: 5)

According to above-mentioned phrase segmentation principle, we select the slit mode of " machine translation system and method " the most at last, and the phrase coverage rate of this mode is the highest, and put the cutting number of sentence minimum before this.

At step S300, carry out statistical machine translation to what after having carried out step S200, obtained through the input sentence of cutting, mainly forms based on the statictic machine translation system of phrase: the language model of translation model, target language, accent preface model, demoder by four parts based on phrase.Translation model provides the relation of its appropriate translation between source language and the target language phrase, and represent the degree of this its appropriate translation relation with a probable value, probable value is high more, shows the accurate more of translation correspondence, is used to the source language sentence that possible target language translation is provided.The language model of target language has been stored a large amount of probable values, and these probable values have provided the probabilistic relation information of each speech and its front and back speech or phrase, and its effect is to judge a sentence S _tThe degree that meets target language grammer, custom is used for translation result is selected, and generally uses a probable value P _LM(S _t) weigh this degree, P _LM(S _t) the high more expression sentence of value meets target language more.The effect of transferring the preface model is a sequence of positions of adjusting speech among the target language result who translates out or phrase.The effect of demoder is exactly to coordinate above-mentioned several model, takes all factors into consideration the language model of translation model, target language and transfers the probable value of preface model to calculate, and the source language sentence is translated.The output of step S300 is preliminary Aim of Translation language sentence, wherein may comprise the unknown word that does not have translation to come out, and these speech are still keeping the form of source language.

In addition, the present invention also can comprise step S400, and the translation of carrying out unknown word in step S400 is to obtain translation result.Unknown word translation is made up of two part and parcels: the bilingual dictionary that vocabulary is bigger and the language model of a target language.Wherein bilingual dictionary is used to unknown word that possible translation item is provided, and the language model of target language is used for selecting from a plurality of possible translation items only as translation.The composition of the most basic bilingual dictionary comprises two parts: the word (W of a source language (SL) _SL) and the translation (W of the target language (TL) of one group of correspondence _TLi).Bilingual dictionary also can add other information as required, as: the information of part of speech, to each the translation W in the translation of one group of corresponding target language _TLiAll give a probable value, be used to represent W _SLBe translated as W _TLiPossibility.The language model of target language step S300 based on the translation of phrase in also be essential constituent, its effect is to judge a sentence S _tThe degree that meets target language grammer, custom is used for translation result is selected, and generally uses one 2 probable value P _LM(S _t) weigh this degree, P _LM(S _t) the high more expression sentence of value meets target language more, and P _LM(S _t) translation item that value is the highest is selected as last translation result.

Fig. 3 shows an a kind of example of more excellent phrase segmentation method, and this method is followed above-mentioned segmentation principle: this method is converted into the phrase segmentation problem graph theoretic problem of asking shortest path between two fixed points.At first defining between per two words in the sentence (in language such as English, the word of indication is a speech that is separated by the space) here is a summit, and a summit respectively is set before first word of sentence and after the last character of sentence in addition; The phrase that the word that on behalf of this edge, a limit among the figure cover is formed can retrieve in the phrase alignment table; The weight on all limits is 1 among the figure, and weight all is set to 1 to be that the word of representing this edge to cover will be done as a whole here, handles with the form of phrase, and weight also can be set to other values, if the entitlement heavy phase with; Utilize A* algorithm or dijkstra's algorithm to find the solution shortest path between two summits of head and the tail.In addition, if there is unknown word in the sentence, also be shortest path when being infinity, figure is broken down into several connected subgraphs, and then we only need each subgraph is used A* algorithm or dijkstra's algorithm get final product.Last in the identical result of all shortest paths, there is result's (span i.e. number of words that the limit covered, and span is big more, and the phrase after its corresponding cutting is also just long more) of maximum span in selection.If there are a plurality of subgraphs, then respectively each subgraph is selected.

In Fig. 3, the phrase (corresponding respectively to machine, translation, system, method, mechanical translation, translation system, machine translation system, system and method) that limit a, b, c, d, e, f, g and h cover can retrieve in the phrase alignment table.(a, b, h) and (e, h) selects the path of limit e and h composition from the path then, thereby obtains the input sentence through cutting.

Fig. 4 shows an example based on the statistical machine translation of phrase.Three translation results of " meeting will be held in may ", i.e. " The meeting will be held in May ", " The meeting will holdin May " and " The meeting will in may be held ", have probabilistic language model value 0.9,0.7 and 0.2 respectively, thereby choose the highest translation result of probabilistic language model value " The meeting will be heldin May ".

Fig. 5 shows the basic procedure that unknown word is translated.At step S510, each the untranslated unknown word W in the preliminary translation result that the mechanical translation based on phrase is obtained _Unknown, in bilingual dictionary, retrieve the possible translation of this unknown word.Then at step S520, may translate T to each of this unknown word _i, carry out following operation: at first use T _iReplace W _Unknown, the sentence marking after utilizing the language model of target language be to replace then is that sentence after replacing calculates its probable value by the language model of target language promptly, selects the highest replacement of probable value at last, as final translation result.Wherein, generate the language model of target language by target language text, wherein, the target language text storage be the set of a target language sentence, be the starting material that generate the language model of target language.In sentence, exist under the situation of a plurality of unknown words, can only translate, but the present invention is not limited to this at every turn at a unknown word.

To introduce the example of a unknown word translation below.In this example, the translation source language is a Chinese, and target language is English.The source language of supposing input is " to could you tell me the payment terms.", wherein, suppose that " payment " speech is a unknown word.Then the PRELIMINARY RESULTS that has obtained after having carried out based on the translation steps of phrase is " Would you please tell me the terms of payment. ".In bilingual dictionary, " payment " speech has following translation item: defray, disburse, pay and payment.Then corresponding each possible translation item in the bilingual dictionary is replaced unknown word, can obtain following intermediate result:

a.Would?you?please?tell?me?the?terms?of?defray.

b.Would?you?please?tell?me?the?terms?of?disburse.

c.Would?you?please?tell?me?the?terms?of?pay.

d.Would?you?please?tell?me?the?terms?of?payment.

Language model is in English given a mark to these four intermediate results then, because " payment terms " have its saying commonly used " terms of payment ", and " Would you please tell me the terms ofpayment. " more meets the syntax rule and the use habit of English, therefore, English language model can provide a higher score value for this result: to middle a as a result, b, the score value 0.4 that obtains respectively after c and d give a mark, 0.4,0.7 and in 0.9, select score value the highest as net result: Would you pleasetell me the terms of payment.

Carried out in the process of Sino-Korean translation at the interpretation method that utilizes this patent, respectively account in the test set of half at a closed test (test statement is selected in training set) and open test (test statement does not belong to training set), Sino-Korean translation and Korean-Chinese translation speed have improved 80% and 90% respectively with respect to the machine mould (such as the translation model Moses that increases income) based on phrase, and in translation result, the sentence that the fluent degree of statement obviously improves has increased by 30%, as:

Example 1

Korean:

Can Chinese: you help me that this traveller's check is cashed?

Can model translation result based on phrase: this cash traveller's check?

According to translation result of the present invention: please help me that traveller's check is cashed?

Example 2

Korean:

Chinese: have a bit wine?

Is model translation result based on phrase: wine drunk?

According to model translation result of the present invention: can ask you to drink a glass wine?

The present invention improves existing Machine Translation Model based on phrase on speed and accuracy.Since with sentence from original be that the decode procedure of unit is reduced to the phrase with character or speech be the decode procedure of unit, dwindled the search volume of decoding, improved decoding speed, simultaneously be that unit decodes and reduced the ambiguity between the word in the phrase, improved the accuracy of translation with the phrase.In addition, the present invention has also improved the quality of translation to the translation of unknown word.

Claims

1. machine translation method based on example phrases, described method comprises:

Carry out phrase extraction according to the word alignment information that from the bilingual alignment text, obtains, and obtain the phrase alignment table;

According to the phrase alignment table, be some phrases with the cutting of source language sentence based on predetermined principle;

To carry out statistical machine translation through the phrase after the cutting based on phrase.

2. the method for claim 1 is characterized in that described method also comprises:

Utilize the language model of bilingual dictionary and target language that unknown word is translated.

3. method as claimed in claim 1 or 2, it is characterized in that to the source language sentence carry out cutting step based on principle be: make that the phrase coverage rate after the cutting is the highest, wherein, coverage rate is meant total number of word that the short-and-medium language of source language sentence the is capped total number of word divided by the source language sentence, covers and is meant that the phrase that is syncopated as is present in the phrase alignment table.

4. method as claimed in claim 3 is characterized in that making in the step of the source language sentence being carried out cutting under the highest prerequisite of phrase coverage rate after the cutting, makes the phrase number of source language sentence minimum.

5. method as claimed in claim 4 is characterized in that phrase coverage rate after making cutting is the highest and makes under the minimum prerequisite of the phrase number of source language sentence, makes the phrase that is syncopated as the longest.

6. method as claimed in claim 1 or 2, it is characterized in that according in the graph theory ask two the fixed point between shortest path be some phrases with the cutting of source language sentence.

7. method as claimed in claim 6, it is characterized in that by coming the step of cutting source language sentence to comprise: be a summit between per two words in the definition source language sentence, a summit respectively is set before first word of sentence and after the last character of sentence according to the shortest path between two fixed points of asking in the graph theory; The weight on the limit on two summits is set to identical value in the connection layout; Utilize A* algorithm or dijkstra's algorithm to find the solution shortest path between two summits of head and the tail.

8. method as claimed in claim 2 is characterized in that unknown word is carried out steps of translating to be comprised:

May translating of each unknown word from bilingual dictionary in the retrieval source language sentence;

In result, may translate the replacement unknown word with each of unknown word to acquisition after carrying out based on the statistical machine translation of phrase through the phrase after the cutting;

Utilize the language model of target language to calculate the probable value of the sentence after the replacement;

Select the highest replacement of probable value as final translation result.