CN103116578A

CN103116578A - Translation method integrating syntactic tree and statistical machine translation technology and translation device

Info

Publication number: CN103116578A
Application number: CN2013100497397A
Authority: CN
Inventors: 罗文�; 黄子河; 刘法旺; 胡小鹏; 宋金平; 袁琦
Original assignee: BEIJING SAIDI TRANSLATION TECHNOLOGY Co Ltd
Current assignee: BEIJING SAIDI TRANSLATION TECHNOLOGY Co Ltd
Priority date: 2013-02-07
Filing date: 2013-02-07
Publication date: 2013-05-22

Abstract

The invention discloses a translation method integrating a syntactic tree and statistical machine translation technology and a translation device. The method comprises the following steps. First, a dictionary base, a grammatical rule base, a phrase translation probability table and a target language linguistic model between different languages are established. Then, segmentation, word property removing and grammatical analysis are conducted to an original input sentence, and a syntactic tree is generated. Then by adopting a top-down strategy, the syntactic tree is gone through, by means of each individual node and part of continuous nodes which cross the syntax, the original texts of leaf nodes are taken to be matched with the phrase translation probability table trained by the statistical machine translation, By utilizing the translated texts of the phrase translation table and the linguistic model of the target language, the purpose of improving the fluency and the accuracy of the output translated texts is achieved. By means of the translation method integrating the syntactic tree and the statistical machine translation technology and the translation device, not only is fine grit knowledge provided by the phrase translation table utilized, but also the advantages of the syntactic tree when solving the relevant problems of depth and long distance of a sentence are utilized, and the quality of the texts translated by the machine can be improved remarkably.

Description

A kind of interpretation method and device that merges syntax tree and statistical machine translation technology

Technical field

The present invention relates to statistics and regular mechanical translation field, particularly relate to a kind of machine translation method and device that merges the statistical machine translation technology such as syntax tree and phrase translation probability table, language model.

Background technology

Along with popularizing of Internet, the computing machine of natural language is treated as the important means of obtaining knowledge from the internet.For example, in fields such as international exchange and scientific research and education, people need to translate foreign country's spoken and written languages, and the past, this was the stage that the great master of linguistics displays one's talent.Along with hardware technology develop rapidly, software engineering are constantly improved and the deepening continuously of speech research, mechanical translation obtains using more and more widely.Mechanical translation has it self huge advantage, as fast in translation speed, memory capability is strong, can also reduce translation cost etc. simultaneously, but its shortcoming is translation quality also can not satisfy people's demand far away at present, and how working out high-quality machine translation method becomes the important topic that we face.

The world of 2011 is evaluated and tested and is shown, the translation quality of data-driven and the translation of knowledge drive machines is equally matched, only adopts single method to be difficult to satisfy user's needs.Statistics and the translation error analysis of regular mechanical translation are shown, the type of error that the different machines translation system occurs is complementary.The weakness of algorithm be carry out lexical choice in transfer process and when analyzing defective sentence performance relatively poor, but advantage is can not omit any one tiny part when analyzing original text, can accomplish to translate more accurately.By contrast, the strong adaptability of statictic machine translation system, the use of phrase collocation makes translation more fluent, and is also better aspect lexical choice.But the problem of statictic machine translation system maximum is, be difficult to tackle translation and generate this fact that requires linguistic knowledge, such as, their lack morphology and syntactic function, also lack word order and adjust function, more difficultly accomplish that the word of whole sentence level transfers order.In addition, it is just right that the translation of statictic machine translation system can not be accomplished, occurs sometimes leaking translating and the false phenomenon of translating.

Summary of the invention

Because the mechanical translation of single method can not obtain good translation effect, and the mechanical translation that based on data drives and knowledge drives possesses the characteristics of mutual supplement with each other's advantages basically, with the distinct methods combination, become the reasonable approach of raising mechanical translation quality.The fine granularity knowledge that the machine translation method that the present invention proposes had both utilized the statistical translation engine to provide, utilized again the advantage of syntax tree aspect the deep layer that solves sentence and long distance correlation problem, therefore can significantly improve the translation quality of mechanical translation, the present invention will promote to mix the development of engine machine translation mothod effectively.

The present invention proposes a kind of machine translation method that merges syntax tree and statistical machine translation technology, comprises the following steps:

1) set up dictionary between the different language language, syntax rule storehouse, phrase translation probability table and target language language model; Wherein dictionary is stored the corresponding word and expression of different language language, the corresponding syntax rule of syntax rule library storage different language language, store the different language translation segment that is got by the statictic machine translation system training in phrase translation probability table, the language model of the target language that the storage of target language language model is got by the statictic machine translation system training;

2) read dictionary information, the simple sentence to be translated of inputting is carried out cutting, this simple sentence is decomposed into the word or expression of source language;

3) read the syntax rule library information, the simple sentence after cutting is carried out part of speech disappear and hold concurrently and grammatical analysis, form a syntax tree;

4) read phrase translation probability table information, adopt the described syntax tree of top-down strategy traversal, to the individual node in described syntax tree and the part continuous nodes across syntax, get the described phrase translation probability table of original text search of its leaf node, and choose translation in this phrase translation table as the translation of node in described syntax tree; To untranslated syntax tree node in said process, according to regular interpretation method generating version;

5) utilize described target language language model that the translation that generates is carried out smoothly, generate target language.

Preferably, the different language translation segment of described phrase translation probability table storage is obtained by the GIZA++ training.

Preferably, adopt language model training tool SRILM or N-gram to train the language model that obtains described target language according to Parallel Corpus.

The present invention also proposes a kind of device that adopts above-mentioned machine translation method, and it comprises:

The dictionary module is used for the corresponding word and expression of storage different language language;

The syntax rule library module is used for the corresponding syntax rule of storage different language language;

Phrase translation probability table module, be used for different language language that storage obtains by the statictic machine translation system training the translation segment;

Target language language model module is used for storage and is trained the language model of the target language that obtains by statictic machine translation system;

Parser connects described dictionary module and described syntax rule library module, is used for according to dictionary and syntax rule storehouse, original text being carried out sentence division, cutting, part of speech successively and disappears and hold concurrently and grammatical analysis, and then generate syntax tree;

Demoder connects described phrase translation probability table module, described language model module and described parser, is used for converting original text to translation according to phrase translation probability table and the described syntax tree of target language language model traversal, generates target language.

Further, described parser comprises:

Module divided in sentence, is used for reading original text and original text being made pauses in reading unpunctuated ancient writings;

Cutting and pretreatment module connect described sentence and divide module, are used for the simple sentence after dividing is carried out cutting and pre-service;

The double module that disappears connects described cutting and pretreatment module, is used for that the simple sentence after cutting is carried out part of speech and disappears double;

Syntax Analysis Module connects the described double module that disappears, and the simple sentence that is used for offseting after holding concurrently carries out grammatical analysis;

Top control module connects respectively above-mentioned each module and controls the operation of each module.

the invention provides a kind of fusion syntax tree, the machine translation method of phrase translation probability table and language model and device, adopt syntax tree section by section and bridge position scan also the phrase translation probability table of searching statistical mechanical translation and the strategy of language model, this method had both taken full advantage of the advantage of traditional rule-based machine translation method aspect the deep layer that solves sentence and long distance correlation problem, the benefit of having utilized again the phrase translation table of statistical machine translation and fine granularity knowledge that language model provides to bring, improved to greatest extent the translation quality of mechanical translation translation.

Description of drawings

Fig. 1 is that the structure of English-Chinese machine translation apparatus in embodiment forms schematic diagram;

Fig. 2 is the schematic flow sheet of English-Chinese machine translation method in embodiment;

Fig. 3 is that the module of parser in Fig. 1 forms schematic diagram;

Fig. 4 is the training schematic diagram of translation probability table and language model in embodiment;

Fig. 5 is the syntax tree schematic diagram that obtains in embodiment.

Embodiment

Below by specific embodiment, and coordinate accompanying drawing, the present invention is described in detail.

Fig. 1 is that the structure of the machine translation apparatus 100 of the fusion syntax tree of the present embodiment and statistical machine translation technology forms schematic diagram, Fig. 2 be utilize this device carry out mechanical translation realization flow figure.

Please refer to Fig. 1, device 100 comprises: dictionary module 110 is used for the corresponding word and expression of storage different language language; Syntax rule library module 120 is used for the corresponding syntax rule of storage different language language; Phrase translation probability table module 130, be used for storage by the statictic machine translation system training the different language language the translation segment; Target language language model module 140 is used for the target language language model that storage is got by the statictic machine translation system training; Parser 150 connects described dictionary module and described syntax rule library module, is used for according to dictionary and syntax rule storehouse, source document being carried out sentence division, cutting, part of speech successively and disappears and hold concurrently and grammatical analysis, generates syntax tree; Demoder 160 connects described phrase translation probability table module, described language model module and described parser, is used for converting original text to translation according to phrase translation probability table and the described syntax tree of target language language model traversal, generates target language.Phrase translation probability table and target language language model obtain from the Parallel Corpus training, as shown in Figure 2.

Below in conjunction with Fig. 1 and Fig. 2, take source language as English, target language as Chinese as example, concrete translation process is described, mainly comprise the steps:

1) English in bidirectional English-Chinese Parallel Corpus is carried out morphological analysis, Chinese is carried out word segmentation processing;

2) adopt the GIZA++ statistical tool to carry out word alignment and phrase alignment to Parallel Corpus, and extract English-Chinese phrase translation probability table;

3) extract English-Chinese phrase translation probability table and carry out filtration treatment above-mentioned, filter out wherein inaccurate statistics entry;

4) adopt language model training tool SRILM to train the language model of target language according to Parallel Corpus;

5) read dictionary information, the simple sentence to be translated of input is carried out cutting, read the syntax rule library information, the simple sentence after cutting is carried out part of speech disappear and hold concurrently and grammatical analysis, form a syntax tree; Disappearing holds concurrently also identifies and record with syntax analysis step does not have in dictionary maybe can not collect complete noun or verb phrase;

6) for above-mentioned syntax tree, then adopt top-down strategy traversal syntax tree, to the entry in the subtree search phrase translation probability table take present node as root node, generating version;

7) when the traversal syntax tree, except to root node searching statistical phrase table, also need suitably to increase some across the situation of syntax, make it to search for and to use phrase translation probability table in the situation that do not destroy syntax tree, improve the quality of translation to farthest utilizing the statistics phrase table;

Across the continuous nodes of syntax, in the time of must satisfying certain specific structure, just can get the original text search phrase translation probability table of its leaf node in syntax tree, such as: in V N to V, V N to can go to search for, and N to V can not go to search for; About the concrete enforcement of described situation across syntax, can with reference to translation instance hereinafter the 3rd), 4) step;

8) to untranslated syntax tree node in said process, adopt dictionary, the mode that rule and language model combine generates target language, namely according to regular interpretation method generating version, and utilizes described target language language model that the translation that generates is carried out smoothly.

Why " untranslated syntax tree node " is arranged, be because the fragment that has search in phrase translation probability table less than, so it is untranslated to have about 29% fragment, thereby to translate with regular interpretation method.Need to prove, it is the translation that rule translates that this step 8) is carried out level and smooth emphasis, but in other embodiments, also can all carry out smoothing processing to all translations (comprising the translation that uses phrase translation probability table to obtain) that generate previously, the present invention is not as restriction.

As shown in Figure 3, the embodiment of a parser comprises: top control module 151 is used for the work of management and each module of control parser; Module 152 divided in sentence, is used for English sentence to be translated is divided being broken into character string; Cutting and pretreatment module 153, for the character string sequence that an english sentence is cut into take phrase as unit, pre-service comprises the punctuation mark processing, format analysis processing etc. are the common technologies in regular translation system; The module 154 of holding concurrently that disappears is used for by eliminating ambiguous category, the english sentence after cutting being carried out part-of-speech tagging; Syntax Analysis Module 155 is used for relatively simple grammatical analysis, makes the english sentence after cutting form syntax tree.

The entry of preserving in described dictionary marks by the requirement of translation system, has indicated relevant semantic attribute, and is as follows:

Afromosia N afrormosia

﹠amp; CAT[N] M_SEM[B] S_SEM[D] CLAS[] $

Afront F in front

&CAT[F]M_SEM[J]$

……

Mountain bike N mountain bike, mountain bike

&CAT[N]M_SEM[C]S_SEM[I]$

Mountain coast N steep coast

&CAT[N]M_SEM[C]S_SEM[B]CLAS[d]$

Mountain cork N asbestos

&CAT[N]M_SEM[C]S_SEM[G]NUM[U]$

The requirement of described translation system refers to the dictionary standard, is that regular translation system developer oneself defines, and generally comprises part of speech, the syntactic and semantic information of mark entry, is the common technology in regular translation system.

The syntax rule of preserving in described syntax rule storehouse has been stipulated the translation rule of word or phrase according to the requirement of translation system, as follows:

@with?links?to:

[24] (1) CAT[N]--〉be related with %1

@reach:

[12] (1) CHI[level|value]--〉MEANQ[0, reach];

[13] (0) CAT[V]+(1) CHI[conclusion]--MEAN[0, draw];

[14] (0) CAT[V]+(1) CHI[goal]--MEAN[0, reach];

[15] (0) CAT[V] ﹠amp; ﹠amp; IS_CENTER[1]+(1) CAT[N] ﹠amp; ﹠amp; L_CHI[agreement]--〉MEAN[0, reach].

As shown in Figure 4, the training process that statistical machine obtains phrase translation probability table and language model comprises, adopt the training tool GIZA++ of statistical machine translation that Parallel Corpus is trained, obtain phrase translation probability table, adopt the language model training tool SRILM of statistical machine translation that Parallel Corpus is trained, obtain the target language language model.Except SRILM, can also adopt the training method of the language models such as N-gram.

In above embodiment, the described the 2nd) extraction of step phrase translation probability table is emphasis of the present invention, now conducts further description.The phrase translation probability table here comprises four parts: the source language phrase that comprises J word

The target language phrase that comprises I word

The word alignment of source language phrase and target language phrase inside concerns α and phrase translation mark p, can be expressed as

Then calculate phrase translation mark, comprise four parts: the phrase translation probability

With

P (f_{1}^{J} | e_{1}^{I}),

The vocabulary translation probability

p_{w} (e_{1}^{I} | f_{1}^{J}, α)

With

p_{w} (f_{1}^{J} | e_{l}^{I}, α) .

Wherein, phrase translation probability computing formula is:

p (e_{1}^{I} | f_{1}^{J}) = \frac{N (f_{1}^{J} | e_{1}^{I})}{\underset{{ee}_{1}^{I}}{Σ} N ({f_{1}}^{J} | {ee}_{1}^{I})}

p (f_{1}^{J} | e_{1}^{I}) = \frac{N (e_{1}^{I} | f_{1}^{J})}{\underset{{ff}_{1}^{J}}{Σ} N (e_{1}^{I} | {ff}_{1}^{J})}

In following formula,

Expression phrase pair

The number of times that occurs in corpus,

Expression Corresponding all possible target language phrase, Expression phrase pair

The number of times that occurs in corpus,

Expression

Corresponding all possible source language phrase, Expression phrase pair

The number of times that occurs in corpus,

Expression expression phrase pair

The number of times that occurs in corpus.

Vocabulary translation probability computing formula is:

In following formula, p (e _i, f _i) expression source language word f _j(j=1...J) be translated as target language e _i(i=1...I) probability, p (f _j, e _i) expression target language word e _i(i=1...I) be translated as source language f _i(j=1...J) probability.α represents source language and the right alignment relation of target language word.

In above embodiment, about the described the 8th) the step mode that adopts dictionary, rule and language model to combine generates target language, the employing target language language model that namely refers to use in level and smooth mechanical translation dictionary and the regular translation that generates, and/or the translation that smoothly uses phrase translation probability table to obtain, to improve the fluency of translation.Openly calculate the target language translation here with respect to the computing method of the smoothness of target language language model:

1) a target language statistical model is represented with the conditional probability of a rear word with respect to previous word:

Here, w _tRepresent t word in translation,

Be w ₁..., w _T, Be w ₁..., w _t-1

2) due to

Can adopt the N-gram model to calculate a rear word with respect to the conditional probability of previous word:

\hat{P} (w_{t} | w_{1}^{t - 1}) \approx \hat{P} (w_{t} | w_{t - n + 1}^{t - 1})

3) establish w ₁W _TThe training set of a target language, and w _T∈ V, V are limited set, and our target will be designed a good model exactly so:

f (w_{t} . . . w_{t - n + 1}) = \hat{P} (w_{t} | w_{1}^{t - 1})

Following formula has provided maximum sample likelihood, obtains its geometric mean:

Perplexity = 1 / \hat{P} (w_{t} | w_{1}^{t - 1})

4) in following formula, for arbitrarily Have

Like this, just can calculate the target language translation with respect to the smoothness of target language language model:

Score = \frac{1}{T} \underset{t}{Σ} \log \frac{f (w_{t}, w_{t - 1}, . . ., w_{t - n + 1})}{Perplexity},

Wherein, T is the number of word in the training set of target language.

The below provides an instantiation, and the sentence that this example will be translated is:

Select?this?option?to?postpone?deleting?these?records?until?pruning?is?performed.

At first, by reading dictionary information, the sentence of above-mentioned input is carried out cutting; Read the syntax rule library information, the sentence after cutting carried out part of speech disappear and hold concurrently and grammatical analysis, form a syntax tree, this syntax tree as shown in Figure 5:

Then, above-mentioned syntax tree is decoded, method is: adopt the above-mentioned syntax tree that fell of top-down strategy traversal, namely node [V] beginning in the top layer upper left corner to the right leaf node direction traversal, is below detailed traversal step from the left side:

1) read the leaf node character string of [V]: Select this option to postpone deleting these records until pruning is performed, then use this character string removal search phrase translation probability table, result does not search the translation segment that is complementary.

2) read the structure attribute of [V], find that it is " V Conj S V " structure, to this structure, be divided into two parts and go to translate, namely be divided into " V||Conj S V ".

3) read the leaf node character string of " V||Conj S V " first [V*]: Select this option to postpone deleting these records, then use this character string removal search phrase translation probability table, result does not search the translation segment that is complementary.

4) read the structure attribute of [V*], find that it is " V N to V " structure, to this structure, two kinds of syncopations are arranged, namely be divided into " V N to||V " or " V N||to||V ".

5) according to maximum match principle, if the piece number of i.e. cutting is minimum, translation result can be more accurate, therefore should first attempt the first syncopation " V N to||V ", like this, read leaf node original text " Select this option to " the removal search phrase translation probability table of " V N to ", result is searched for successfully:

Select this option to||| select this option can || | 0-01-12-23-3|||10.0003327991397108e-007;

At this moment, this example will use the translation of " select this option can " conduct " Select this option to ".

6) read the leaf node character string of second [V*] of " V N to||V ": postpone deleting these records, then use this character string removal search phrase translation probability table, result is searched for successfully:

Postpone deleting these records||| postpones these records of deletion || | 0-01-12-23-3|||10.00056812810.125;

This example will use the translation of " postponing these records of deletion " conduct " postpone deleting these records ".

7) like this, whole sentence " V||Conj S V " first " V " can translate into:

Select this option to postpone deleting these records → this option of selection can be postponed these records of deletion

8) next, for " Conj S V ", read its leaf node character string: until pruning is performed, then use this character string removal search phrase translation probability table, result does not search the translation segment that is complementary.

9) for " Conj S V " this structure, can be cut into " Conj||S V ".

That 10) " Conj " is corresponding is common word " until ", need not removal search phrase translation probability table.

11) read the leaf node character string of the second portion " S V " of " Conj||S V ": pruning is performed, then use this character string removal search phrase translation probability table, result is searched for successfully:

Pruning is performed||| completes pruning || | 0-11-02-0|||15.14058e-0063.37201e-006

This example will use the translation of " completing pruning " conduct " pruning is performed ".

12) can translating into of whole sentence " V||Conj S V " like this:

Select this option to postpone deleting these records until pruning is performed. → this option of selection can be postponed these records of deletion, and until completes pruning.

13) for " V||Conj S V " this structure, two kinds of basic interpretation methods are arranged:

V\|\|Conj?S?V→V，Conj?S?V
	V\|\|Conj?S?V→Conj?S?V，V

" Conj " specific to this is " until ", for:

V\|\|until S V → V is until S V
	V\|\|until S V → before S V, V

Like this, two kinds of translation results are arranged:

Select this option to postpone deleting these records until pruning is performed. → this option of selection can be postponed these records of deletion, until complete pruning.

Select this option to postpone deleting these records until pruning is performed. → before completing pruning, select this option can postpone these records of deletion.

14) specifically adopt which kind of translation result, need to calculate the translation probability of above-mentioned two sentences according to the target language language model, according to the N-gram language model, establish the word order of some translation results and classify w as ₁, w ₂..., w _m, the probability of this translation result appearance is:

P (w_{1}, . . ., w_{m}) = Π_{i = 1}^{m} P (w_{i} | w_{1}, . . ., w_{i - 1}) \approx Π_{i = 1}^{m} P (w_{1} | w_{i - (n - 1)}, . . ., w_{i - 1})

According to Markov Hypothesis, above-mentioned conditional probability can be calculated according to the frequency number of times in N-gram:

P (w_{i} | w_{i - (n - 1)}, . . ., w_{i - 1}) = \frac{count (w_{i - (n - 1)}, . . ., w_{i - 1}, w_{i})}{count (w_{i - (n - 1)}, . . ., w_{i - 1})}

The n here is the n gram language model, can be made as 3, and the probability of occurrence that actual computation can get first translation is 5.38125e-005, the probability of occurrence of second translation is 4.20337e-006, therefore as seen, the probability of first translation is higher, selects it as final translation:

" Select this option to " in said process is not a complete node in syntax tree, but across the node of syntax, it is the part of structure " V N to V ", to this continuous nodes across syntax, as long as it satisfies certain pattern, also should allow its removal search statistics phrase table, do like this and can utilize substantially the statistics phrase table to obtain better translation result, prerequisite is not destroy the macrostructure of sentence.

The applicant customizes in the IT security fields of a practicality in English-Chinese system the present invention this " merging machine translation method and the device of syntax tree, phrase translation probability table and language model " is tested, 610,000 right English-Chinese IT security fields parallel corporas have been chosen as corpus, utilize statistics alignment tool Giza++ to train the phrase translation probability table that comprises 3,490,000 translation segments, and use the Chinese of 610,000 to train a language model.Test interpretation method of the present invention with 2468 test english sentences, draw BLEU value, TER value and readable as shown in table 1.

By as seen from Table 1, adopt method of the present invention, segment (as: Selectthis option to## select this option can) has occurred in the translation of 2468 test sentence 7684 times altogether, and on average each sentence occurs 3.11 times, the frequency of occurrences is very high, accounts for 71.7% of whole output translation Chinese characters.

That is to say, 71.7% of the Chinese translation of exporting is the translation result of the statistics phrase translation probability table that adopted, this shows that this method of the present invention takes full advantage of the fine granularity knowledge of statistical machine translation, the improvement effect of translation quality is very obvious, the BLEU value of translation result, TER value and the readable lifting that all obtains in various degree.

The test data of table 1. the inventive method

Abovely by specific embodiments of the invention, principle of the present invention and feature are described.Be to be understood that the present invention is not limited only to above-mentioned specific embodiment, multiple variation can also be arranged, and concrete implementation step also can be had any different.Protection scope of the present invention is only defined by the appended claims.

Claims

1. machine translation method that merges syntax tree and statistical machine translation technology, its step comprises:

2. the method for claim 1 is characterized in that: the different language translation segment of described phrase translation probability table storage is obtained by the GIZA++ training.

3. the method for claim 1, is characterized in that: adopt language model training tool SRILM or N-gram to obtain described target language language model.

4. the method for claim 1, is characterized in that, described phrase translation probability table comprises: the source language phrase that comprises J word

The target language phrase that comprises I word

The word alignment of source language phrase and target language phrase inside concerns α, and phrase translation mark p.

5. method as claimed in claim 4, is characterized in that, described phrase translation mark p comprises phrase translation probability and vocabulary translation probability; The computing formula of described phrase translation probability is:

p (e_{1}^{I} | f_{1}^{J}) = \frac{N (f_{1}^{J} | e_{1}^{I})}{\underset{{ee}_{1}^{I}}{Σ} N ({f_{1}}^{J} | {ee}_{1}^{I})}

p (f_{1}^{J} | e_{1}^{I}) = \frac{N (e_{1}^{I} | f_{1}^{J})}{\underset{{ff}_{1}^{J}}{Σ} N (e_{1}^{I} | {ff}_{1}^{J})}

Wherein, Expression phrase pair

The number of times that occurs in corpus,

Expression

Corresponding all possible target language phrase,

Expression phrase pair

The number of times that occurs in corpus,

Expression

Corresponding all possible source language phrase,

Expression phrase pair

The number of times that occurs in corpus,

Expression expression phrase pair The number of times that occurs in corpus;

The computing formula of described vocabulary translation probability is:

Wherein, p (e _i, f _j) expression source language word f _j(j=1...J) be translated as target language e _i(i=1...I) probability, p (f _j, e _i) expression target language word e _i(i=1...I) be translated as source language f _i(j=1...J) probability; α represents source language and the right alignment relation of target language word.

6. the method for claim 1 is characterized in that: calculating the target language translation with respect to the method for the smoothness of described target language language model is:

1) the target language statistical model is represented with the conditional probability of a rear word with respect to previous word:

Wherein, w _tRepresent t word in translation, Be w ₁..., w _T,

Be w ₁..., w _t-1

2) adopt the N-gram model to calculate a rear word with respect to the conditional probability of previous word:

\hat{P} (w_{t} | w_{1}^{t - 1}) \approx \hat{P} (w_{t} | w_{t - n + 1}^{t - 1})

3) establish w ₁W _TThe training set of a target language, and w _T∈ V, V are limited set, calculate maximum sample likelihood:

f (w_{t} . . . w_{t - n + 1}) = \hat{P} (w_{t} | w_{1}^{t - 1}),

Its geometric mean:

Perplexity = 1 / \hat{P} (w_{t} | w_{1}^{t - 1})

4) for arbitrarily Have

Thereby obtain the target language translation with respect to the smoothness of target language language model be:

Score = \frac{1}{T} \underset{t}{Σ} \log \frac{f (w_{t}, w_{t - 1}, . . ., w_{t - n + 1})}{Perplexity},

Wherein, T is the number of word in the training set of target language.

7. the method for claim 1, it is characterized in that: the entry of preserving in described dictionary marks by the requirement of translation system, indicates relevant semantic attribute; The syntax rule of preserving in described syntax rule storehouse is according to the requirement regulation word of translation system or the translation rule of phrase.

8. the method for claim 1 is characterized in that: calculate the translation probability of different translation results according to described target language language model, the translation that probability is high is as final translation.

9. a machine translation apparatus that merges syntax tree and statistical machine translation technology, is characterized in that, comprising:

10. device as claimed in claim 9, is characterized in that, described parser comprises:

Module divided in sentence, is used for reading source document and source document being made pauses in reading unpunctuated ancient writings;

Cutting and pretreatment module connect described sentence and divide module, and the simple sentence after being used for dividing carries out cutting and pre-service;