CN103116578A - Translation method integrating syntactic tree and statistical machine translation technology and translation device - Google Patents

Translation method integrating syntactic tree and statistical machine translation technology and translation device Download PDF

Info

Publication number
CN103116578A
CN103116578A CN2013100497397A CN201310049739A CN103116578A CN 103116578 A CN103116578 A CN 103116578A CN 2013100497397 A CN2013100497397 A CN 2013100497397A CN 201310049739 A CN201310049739 A CN 201310049739A CN 103116578 A CN103116578 A CN 103116578A
Authority
CN
China
Prior art keywords
translation
language
phrase
target language
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013100497397A
Other languages
Chinese (zh)
Inventor
罗文�
黄子河
刘法旺
胡小鹏
宋金平
袁琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING SAIDI TRANSLATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING SAIDI TRANSLATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SAIDI TRANSLATION TECHNOLOGY Co Ltd filed Critical BEIJING SAIDI TRANSLATION TECHNOLOGY Co Ltd
Priority to CN2013100497397A priority Critical patent/CN103116578A/en
Publication of CN103116578A publication Critical patent/CN103116578A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a translation method integrating a syntactic tree and statistical machine translation technology and a translation device. The method comprises the following steps. First, a dictionary base, a grammatical rule base, a phrase translation probability table and a target language linguistic model between different languages are established. Then, segmentation, word property removing and grammatical analysis are conducted to an original input sentence, and a syntactic tree is generated. Then by adopting a top-down strategy, the syntactic tree is gone through, by means of each individual node and part of continuous nodes which cross the syntax, the original texts of leaf nodes are taken to be matched with the phrase translation probability table trained by the statistical machine translation, By utilizing the translated texts of the phrase translation table and the linguistic model of the target language, the purpose of improving the fluency and the accuracy of the output translated texts is achieved. By means of the translation method integrating the syntactic tree and the statistical machine translation technology and the translation device, not only is fine grit knowledge provided by the phrase translation table utilized, but also the advantages of the syntactic tree when solving the relevant problems of depth and long distance of a sentence are utilized, and the quality of the texts translated by the machine can be improved remarkably.

Description

A kind of interpretation method and device that merges syntax tree and statistical machine translation technology
Technical field
The present invention relates to statistics and regular mechanical translation field, particularly relate to a kind of machine translation method and device that merges the statistical machine translation technology such as syntax tree and phrase translation probability table, language model.
Background technology
Along with popularizing of Internet, the computing machine of natural language is treated as the important means of obtaining knowledge from the internet.For example, in fields such as international exchange and scientific research and education, people need to translate foreign country's spoken and written languages, and the past, this was the stage that the great master of linguistics displays one's talent.Along with hardware technology develop rapidly, software engineering are constantly improved and the deepening continuously of speech research, mechanical translation obtains using more and more widely.Mechanical translation has it self huge advantage, as fast in translation speed, memory capability is strong, can also reduce translation cost etc. simultaneously, but its shortcoming is translation quality also can not satisfy people's demand far away at present, and how working out high-quality machine translation method becomes the important topic that we face.
The world of 2011 is evaluated and tested and is shown, the translation quality of data-driven and the translation of knowledge drive machines is equally matched, only adopts single method to be difficult to satisfy user's needs.Statistics and the translation error analysis of regular mechanical translation are shown, the type of error that the different machines translation system occurs is complementary.The weakness of algorithm be carry out lexical choice in transfer process and when analyzing defective sentence performance relatively poor, but advantage is can not omit any one tiny part when analyzing original text, can accomplish to translate more accurately.By contrast, the strong adaptability of statictic machine translation system, the use of phrase collocation makes translation more fluent, and is also better aspect lexical choice.But the problem of statictic machine translation system maximum is, be difficult to tackle translation and generate this fact that requires linguistic knowledge, such as, their lack morphology and syntactic function, also lack word order and adjust function, more difficultly accomplish that the word of whole sentence level transfers order.In addition, it is just right that the translation of statictic machine translation system can not be accomplished, occurs sometimes leaking translating and the false phenomenon of translating.
Summary of the invention
Because the mechanical translation of single method can not obtain good translation effect, and the mechanical translation that based on data drives and knowledge drives possesses the characteristics of mutual supplement with each other's advantages basically, with the distinct methods combination, become the reasonable approach of raising mechanical translation quality.The fine granularity knowledge that the machine translation method that the present invention proposes had both utilized the statistical translation engine to provide, utilized again the advantage of syntax tree aspect the deep layer that solves sentence and long distance correlation problem, therefore can significantly improve the translation quality of mechanical translation, the present invention will promote to mix the development of engine machine translation mothod effectively.
The present invention proposes a kind of machine translation method that merges syntax tree and statistical machine translation technology, comprises the following steps:
1) set up dictionary between the different language language, syntax rule storehouse, phrase translation probability table and target language language model; Wherein dictionary is stored the corresponding word and expression of different language language, the corresponding syntax rule of syntax rule library storage different language language, store the different language translation segment that is got by the statictic machine translation system training in phrase translation probability table, the language model of the target language that the storage of target language language model is got by the statictic machine translation system training;
2) read dictionary information, the simple sentence to be translated of inputting is carried out cutting, this simple sentence is decomposed into the word or expression of source language;
3) read the syntax rule library information, the simple sentence after cutting is carried out part of speech disappear and hold concurrently and grammatical analysis, form a syntax tree;
4) read phrase translation probability table information, adopt the described syntax tree of top-down strategy traversal, to the individual node in described syntax tree and the part continuous nodes across syntax, get the described phrase translation probability table of original text search of its leaf node, and choose translation in this phrase translation table as the translation of node in described syntax tree; To untranslated syntax tree node in said process, according to regular interpretation method generating version;
5) utilize described target language language model that the translation that generates is carried out smoothly, generate target language.
Preferably, the different language translation segment of described phrase translation probability table storage is obtained by the GIZA++ training.
Preferably, adopt language model training tool SRILM or N-gram to train the language model that obtains described target language according to Parallel Corpus.
The present invention also proposes a kind of device that adopts above-mentioned machine translation method, and it comprises:
The dictionary module is used for the corresponding word and expression of storage different language language;
The syntax rule library module is used for the corresponding syntax rule of storage different language language;
Phrase translation probability table module, be used for different language language that storage obtains by the statictic machine translation system training the translation segment;
Target language language model module is used for storage and is trained the language model of the target language that obtains by statictic machine translation system;
Parser connects described dictionary module and described syntax rule library module, is used for according to dictionary and syntax rule storehouse, original text being carried out sentence division, cutting, part of speech successively and disappears and hold concurrently and grammatical analysis, and then generate syntax tree;
Demoder connects described phrase translation probability table module, described language model module and described parser, is used for converting original text to translation according to phrase translation probability table and the described syntax tree of target language language model traversal, generates target language.
Further, described parser comprises:
Module divided in sentence, is used for reading original text and original text being made pauses in reading unpunctuated ancient writings;
Cutting and pretreatment module connect described sentence and divide module, are used for the simple sentence after dividing is carried out cutting and pre-service;
The double module that disappears connects described cutting and pretreatment module, is used for that the simple sentence after cutting is carried out part of speech and disappears double;
Syntax Analysis Module connects the described double module that disappears, and the simple sentence that is used for offseting after holding concurrently carries out grammatical analysis;
Top control module connects respectively above-mentioned each module and controls the operation of each module.
the invention provides a kind of fusion syntax tree, the machine translation method of phrase translation probability table and language model and device, adopt syntax tree section by section and bridge position scan also the phrase translation probability table of searching statistical mechanical translation and the strategy of language model, this method had both taken full advantage of the advantage of traditional rule-based machine translation method aspect the deep layer that solves sentence and long distance correlation problem, the benefit of having utilized again the phrase translation table of statistical machine translation and fine granularity knowledge that language model provides to bring, improved to greatest extent the translation quality of mechanical translation translation.
Description of drawings
Fig. 1 is that the structure of English-Chinese machine translation apparatus in embodiment forms schematic diagram;
Fig. 2 is the schematic flow sheet of English-Chinese machine translation method in embodiment;
Fig. 3 is that the module of parser in Fig. 1 forms schematic diagram;
Fig. 4 is the training schematic diagram of translation probability table and language model in embodiment;
Fig. 5 is the syntax tree schematic diagram that obtains in embodiment.
Embodiment
Below by specific embodiment, and coordinate accompanying drawing, the present invention is described in detail.
Fig. 1 is that the structure of the machine translation apparatus 100 of the fusion syntax tree of the present embodiment and statistical machine translation technology forms schematic diagram, Fig. 2 be utilize this device carry out mechanical translation realization flow figure.
Please refer to Fig. 1, device 100 comprises: dictionary module 110 is used for the corresponding word and expression of storage different language language; Syntax rule library module 120 is used for the corresponding syntax rule of storage different language language; Phrase translation probability table module 130, be used for storage by the statictic machine translation system training the different language language the translation segment; Target language language model module 140 is used for the target language language model that storage is got by the statictic machine translation system training; Parser 150 connects described dictionary module and described syntax rule library module, is used for according to dictionary and syntax rule storehouse, source document being carried out sentence division, cutting, part of speech successively and disappears and hold concurrently and grammatical analysis, generates syntax tree; Demoder 160 connects described phrase translation probability table module, described language model module and described parser, is used for converting original text to translation according to phrase translation probability table and the described syntax tree of target language language model traversal, generates target language.Phrase translation probability table and target language language model obtain from the Parallel Corpus training, as shown in Figure 2.
Below in conjunction with Fig. 1 and Fig. 2, take source language as English, target language as Chinese as example, concrete translation process is described, mainly comprise the steps:
1) English in bidirectional English-Chinese Parallel Corpus is carried out morphological analysis, Chinese is carried out word segmentation processing;
2) adopt the GIZA++ statistical tool to carry out word alignment and phrase alignment to Parallel Corpus, and extract English-Chinese phrase translation probability table;
3) extract English-Chinese phrase translation probability table and carry out filtration treatment above-mentioned, filter out wherein inaccurate statistics entry;
4) adopt language model training tool SRILM to train the language model of target language according to Parallel Corpus;
5) read dictionary information, the simple sentence to be translated of input is carried out cutting, read the syntax rule library information, the simple sentence after cutting is carried out part of speech disappear and hold concurrently and grammatical analysis, form a syntax tree; Disappearing holds concurrently also identifies and record with syntax analysis step does not have in dictionary maybe can not collect complete noun or verb phrase;
6) for above-mentioned syntax tree, then adopt top-down strategy traversal syntax tree, to the entry in the subtree search phrase translation probability table take present node as root node, generating version;
7) when the traversal syntax tree, except to root node searching statistical phrase table, also need suitably to increase some across the situation of syntax, make it to search for and to use phrase translation probability table in the situation that do not destroy syntax tree, improve the quality of translation to farthest utilizing the statistics phrase table;
Across the continuous nodes of syntax, in the time of must satisfying certain specific structure, just can get the original text search phrase translation probability table of its leaf node in syntax tree, such as: in V N to V, V N to can go to search for, and N to V can not go to search for; About the concrete enforcement of described situation across syntax, can with reference to translation instance hereinafter the 3rd), 4) step;
8) to untranslated syntax tree node in said process, adopt dictionary, the mode that rule and language model combine generates target language, namely according to regular interpretation method generating version, and utilizes described target language language model that the translation that generates is carried out smoothly.
Why " untranslated syntax tree node " is arranged, be because the fragment that has search in phrase translation probability table less than, so it is untranslated to have about 29% fragment, thereby to translate with regular interpretation method.Need to prove, it is the translation that rule translates that this step 8) is carried out level and smooth emphasis, but in other embodiments, also can all carry out smoothing processing to all translations (comprising the translation that uses phrase translation probability table to obtain) that generate previously, the present invention is not as restriction.
As shown in Figure 3, the embodiment of a parser comprises: top control module 151 is used for the work of management and each module of control parser; Module 152 divided in sentence, is used for English sentence to be translated is divided being broken into character string; Cutting and pretreatment module 153, for the character string sequence that an english sentence is cut into take phrase as unit, pre-service comprises the punctuation mark processing, format analysis processing etc. are the common technologies in regular translation system; The module 154 of holding concurrently that disappears is used for by eliminating ambiguous category, the english sentence after cutting being carried out part-of-speech tagging; Syntax Analysis Module 155 is used for relatively simple grammatical analysis, makes the english sentence after cutting form syntax tree.
The entry of preserving in described dictionary marks by the requirement of translation system, has indicated relevant semantic attribute, and is as follows:
Afromosia N afrormosia
﹠amp; CAT[N] M_SEM[B] S_SEM[D] CLAS[] $
Afront F in front
&CAT[F]M_SEM[J]$
……
Mountain bike N mountain bike, mountain bike
&CAT[N]M_SEM[C]S_SEM[I]$
Mountain coast N steep coast
&CAT[N]M_SEM[C]S_SEM[B]CLAS[d]$
Mountain cork N asbestos
&CAT[N]M_SEM[C]S_SEM[G]NUM[U]$
The requirement of described translation system refers to the dictionary standard, is that regular translation system developer oneself defines, and generally comprises part of speech, the syntactic and semantic information of mark entry, is the common technology in regular translation system.
The syntax rule of preserving in described syntax rule storehouse has been stipulated the translation rule of word or phrase according to the requirement of translation system, as follows:
@with?links?to:
[24] (1) CAT[N]--〉be related with %1
@reach:
[12] (1) CHI[level|value]--〉MEANQ[0, reach];
[13] (0) CAT[V]+(1) CHI[conclusion]--MEAN[0, draw];
[14] (0) CAT[V]+(1) CHI[goal]--MEAN[0, reach];
[15] (0) CAT[V] ﹠amp; ﹠amp; IS_CENTER[1]+(1) CAT[N] ﹠amp; ﹠amp; L_CHI[agreement]--〉MEAN[0, reach].
As shown in Figure 4, the training process that statistical machine obtains phrase translation probability table and language model comprises, adopt the training tool GIZA++ of statistical machine translation that Parallel Corpus is trained, obtain phrase translation probability table, adopt the language model training tool SRILM of statistical machine translation that Parallel Corpus is trained, obtain the target language language model.Except SRILM, can also adopt the training method of the language models such as N-gram.
In above embodiment, the described the 2nd) extraction of step phrase translation probability table is emphasis of the present invention, now conducts further description.The phrase translation probability table here comprises four parts: the source language phrase that comprises J word
Figure BDA00002831015500051
The target language phrase that comprises I word
Figure BDA00002831015500052
The word alignment of source language phrase and target language phrase inside concerns α and phrase translation mark p, can be expressed as
Figure BDA00002831015500053
Then calculate phrase translation mark, comprise four parts: the phrase translation probability
Figure BDA00002831015500054
With P ( f 1 J | e 1 I ) , The vocabulary translation probability p w ( e 1 I | f 1 J , α ) With p w ( f 1 J | e l I , α ) .
Wherein, phrase translation probability computing formula is:
p ( e 1 I | f 1 J ) = N ( f 1 J | e 1 I ) Σ ee 1 I N ( f 1 J | ee 1 I )
p ( f 1 J | e 1 I ) = N ( e 1 I | f 1 J ) Σ ff 1 J N ( e 1 I | ff 1 J )
In following formula,
Figure BDA00002831015500063
Expression phrase pair
Figure BDA00002831015500064
The number of times that occurs in corpus,
Figure BDA00002831015500065
Expression Corresponding all possible target language phrase, Expression phrase pair
Figure BDA00002831015500068
The number of times that occurs in corpus,
Figure BDA00002831015500069
Expression
Figure BDA000028310155000610
Corresponding all possible source language phrase, Expression phrase pair
Figure BDA000028310155000612
The number of times that occurs in corpus,
Figure BDA000028310155000613
Expression expression phrase pair
Figure BDA000028310155000614
The number of times that occurs in corpus.
Vocabulary translation probability computing formula is:
Figure BDA000028310155000616
In following formula, p (e i, f i) expression source language word f j(j=1...J) be translated as target language e i(i=1...I) probability, p (f j, e i) expression target language word e i(i=1...I) be translated as source language f i(j=1...J) probability.α represents source language and the right alignment relation of target language word.
In above embodiment, about the described the 8th) the step mode that adopts dictionary, rule and language model to combine generates target language, the employing target language language model that namely refers to use in level and smooth mechanical translation dictionary and the regular translation that generates, and/or the translation that smoothly uses phrase translation probability table to obtain, to improve the fluency of translation.Openly calculate the target language translation here with respect to the computing method of the smoothness of target language language model:
1) a target language statistical model is represented with the conditional probability of a rear word with respect to previous word:
Figure BDA000028310155000617
Here, w tRepresent t word in translation,
Figure BDA000028310155000618
Be w 1..., w T, Be w 1..., w t-1
2) due to
Figure BDA000028310155000620
Can adopt the N-gram model to calculate a rear word with respect to the conditional probability of previous word:
P ^ ( w t | w 1 t - 1 ) ≈ P ^ ( w t | w t - n + 1 t - 1 )
3) establish w 1W TThe training set of a target language, and w T∈ V, V are limited set, and our target will be designed a good model exactly so:
f ( w t . . . w t - n + 1 ) = P ^ ( w t | w 1 t - 1 )
Following formula has provided maximum sample likelihood, obtains its geometric mean:
Perplexity = 1 / P ^ ( w t | w 1 t - 1 )
4) in following formula, for arbitrarily Have
Figure BDA00002831015500074
Like this, just can calculate the target language translation with respect to the smoothness of target language language model:
Score = 1 T Σ t log f ( w t , w t - 1 , . . . , w t - n + 1 ) Perplexity ,
Wherein, T is the number of word in the training set of target language.
The below provides an instantiation, and the sentence that this example will be translated is:
Select?this?option?to?postpone?deleting?these?records?until?pruning?is?performed.
At first, by reading dictionary information, the sentence of above-mentioned input is carried out cutting; Read the syntax rule library information, the sentence after cutting carried out part of speech disappear and hold concurrently and grammatical analysis, form a syntax tree, this syntax tree as shown in Figure 5:
Then, above-mentioned syntax tree is decoded, method is: adopt the above-mentioned syntax tree that fell of top-down strategy traversal, namely node [V] beginning in the top layer upper left corner to the right leaf node direction traversal, is below detailed traversal step from the left side:
1) read the leaf node character string of [V]: Select this option to postpone deleting these records until pruning is performed, then use this character string removal search phrase translation probability table, result does not search the translation segment that is complementary.
2) read the structure attribute of [V], find that it is " V Conj S V " structure, to this structure, be divided into two parts and go to translate, namely be divided into " V||Conj S V ".
3) read the leaf node character string of " V||Conj S V " first [V*]: Select this option to postpone deleting these records, then use this character string removal search phrase translation probability table, result does not search the translation segment that is complementary.
4) read the structure attribute of [V*], find that it is " V N to V " structure, to this structure, two kinds of syncopations are arranged, namely be divided into " V N to||V " or " V N||to||V ".
5) according to maximum match principle, if the piece number of i.e. cutting is minimum, translation result can be more accurate, therefore should first attempt the first syncopation " V N to||V ", like this, read leaf node original text " Select this option to " the removal search phrase translation probability table of " V N to ", result is searched for successfully:
Select this option to||| select this option can || | 0-01-12-23-3|||10.0003327991397108e-007;
At this moment, this example will use the translation of " select this option can " conduct " Select this option to ".
6) read the leaf node character string of second [V*] of " V N to||V ": postpone deleting these records, then use this character string removal search phrase translation probability table, result is searched for successfully:
Postpone deleting these records||| postpones these records of deletion || | 0-01-12-23-3|||10.00056812810.125;
This example will use the translation of " postponing these records of deletion " conduct " postpone deleting these records ".
7) like this, whole sentence " V||Conj S V " first " V " can translate into:
Select this option to postpone deleting these records → this option of selection can be postponed these records of deletion
8) next, for " Conj S V ", read its leaf node character string: until pruning is performed, then use this character string removal search phrase translation probability table, result does not search the translation segment that is complementary.
9) for " Conj S V " this structure, can be cut into " Conj||S V ".
That 10) " Conj " is corresponding is common word " until ", need not removal search phrase translation probability table.
11) read the leaf node character string of the second portion " S V " of " Conj||S V ": pruning is performed, then use this character string removal search phrase translation probability table, result is searched for successfully:
Pruning is performed||| completes pruning || | 0-11-02-0|||15.14058e-0063.37201e-006
This example will use the translation of " completing pruning " conduct " pruning is performed ".
12) can translating into of whole sentence " V||Conj S V " like this:
Select this option to postpone deleting these records until pruning is performed. → this option of selection can be postponed these records of deletion, and until completes pruning.
13) for " V||Conj S V " this structure, two kinds of basic interpretation methods are arranged:
V||Conj?S?V→V,Conj?S?V
V||Conj?S?V→Conj?S?V,V
" Conj " specific to this is " until ", for:
V||until S V → V is until S V
V||until S V → before S V, V
Like this, two kinds of translation results are arranged:
Select this option to postpone deleting these records until pruning is performed. → this option of selection can be postponed these records of deletion, until complete pruning.
Select this option to postpone deleting these records until pruning is performed. → before completing pruning, select this option can postpone these records of deletion.
14) specifically adopt which kind of translation result, need to calculate the translation probability of above-mentioned two sentences according to the target language language model, according to the N-gram language model, establish the word order of some translation results and classify w as 1, w 2..., w m, the probability of this translation result appearance is:
P ( w 1 , . . . , w m ) = Π i = 1 m P ( w i | w 1 , . . . , w i - 1 ) ≈ Π i = 1 m P ( w 1 | w i - ( n - 1 ) , . . . , w i - 1 )
According to Markov Hypothesis, above-mentioned conditional probability can be calculated according to the frequency number of times in N-gram:
P ( w i | w i - ( n - 1 ) , . . . , w i - 1 ) = count ( w i - ( n - 1 ) , . . . , w i - 1 , w i ) count ( w i - ( n - 1 ) , . . . , w i - 1 )
The n here is the n gram language model, can be made as 3, and the probability of occurrence that actual computation can get first translation is 5.38125e-005, the probability of occurrence of second translation is 4.20337e-006, therefore as seen, the probability of first translation is higher, selects it as final translation:
Select this option to postpone deleting these records until pruning is performed. → this option of selection can be postponed these records of deletion, until complete pruning.
" Select this option to " in said process is not a complete node in syntax tree, but across the node of syntax, it is the part of structure " V N to V ", to this continuous nodes across syntax, as long as it satisfies certain pattern, also should allow its removal search statistics phrase table, do like this and can utilize substantially the statistics phrase table to obtain better translation result, prerequisite is not destroy the macrostructure of sentence.
The applicant customizes in the IT security fields of a practicality in English-Chinese system the present invention this " merging machine translation method and the device of syntax tree, phrase translation probability table and language model " is tested, 610,000 right English-Chinese IT security fields parallel corporas have been chosen as corpus, utilize statistics alignment tool Giza++ to train the phrase translation probability table that comprises 3,490,000 translation segments, and use the Chinese of 610,000 to train a language model.Test interpretation method of the present invention with 2468 test english sentences, draw BLEU value, TER value and readable as shown in table 1.
By as seen from Table 1, adopt method of the present invention, segment (as: Selectthis option to## select this option can) has occurred in the translation of 2468 test sentence 7684 times altogether, and on average each sentence occurs 3.11 times, the frequency of occurrences is very high, accounts for 71.7% of whole output translation Chinese characters.
That is to say, 71.7% of the Chinese translation of exporting is the translation result of the statistics phrase translation probability table that adopted, this shows that this method of the present invention takes full advantage of the fine granularity knowledge of statistical machine translation, the improvement effect of translation quality is very obvious, the BLEU value of translation result, TER value and the readable lifting that all obtains in various degree.
The test data of table 1. the inventive method
Figure BDA00002831015500101
Abovely by specific embodiments of the invention, principle of the present invention and feature are described.Be to be understood that the present invention is not limited only to above-mentioned specific embodiment, multiple variation can also be arranged, and concrete implementation step also can be had any different.Protection scope of the present invention is only defined by the appended claims.

Claims (10)

1. machine translation method that merges syntax tree and statistical machine translation technology, its step comprises:
1) set up dictionary between the different language language, syntax rule storehouse, phrase translation probability table and target language language model; Wherein dictionary is stored the corresponding word and expression of different language language, the corresponding syntax rule of syntax rule library storage different language language, store the different language translation segment that is got by the statictic machine translation system training in phrase translation probability table, the language model of the target language that the storage of target language language model is got by the statictic machine translation system training;
2) read dictionary information, the simple sentence to be translated of inputting is carried out cutting, this simple sentence is decomposed into the word or expression of source language;
3) read the syntax rule library information, the simple sentence after cutting is carried out part of speech disappear and hold concurrently and grammatical analysis, form a syntax tree;
4) read phrase translation probability table information, adopt the described syntax tree of top-down strategy traversal, to the individual node in described syntax tree and the part continuous nodes across syntax, get the described phrase translation probability table of original text search of its leaf node, and choose translation in this phrase translation table as the translation of node in described syntax tree; To untranslated syntax tree node in said process, according to regular interpretation method generating version;
5) utilize described target language language model that the translation that generates is carried out smoothly, generate target language.
2. the method for claim 1 is characterized in that: the different language translation segment of described phrase translation probability table storage is obtained by the GIZA++ training.
3. the method for claim 1, is characterized in that: adopt language model training tool SRILM or N-gram to obtain described target language language model.
4. the method for claim 1, is characterized in that, described phrase translation probability table comprises: the source language phrase that comprises J word
Figure FDA00002831015400011
The target language phrase that comprises I word
Figure FDA00002831015400012
The word alignment of source language phrase and target language phrase inside concerns α, and phrase translation mark p.
5. method as claimed in claim 4, is characterized in that, described phrase translation mark p comprises phrase translation probability and vocabulary translation probability; The computing formula of described phrase translation probability is:
p ( e 1 I | f 1 J ) = N ( f 1 J | e 1 I ) Σ ee 1 I N ( f 1 J | ee 1 I )
p ( f 1 J | e 1 I ) = N ( e 1 I | f 1 J ) Σ ff 1 J N ( e 1 I | ff 1 J )
Wherein, Expression phrase pair
Figure FDA00002831015400016
The number of times that occurs in corpus,
Figure FDA00002831015400017
Expression
Figure FDA00002831015400018
Corresponding all possible target language phrase,
Figure FDA00002831015400019
Expression phrase pair
Figure FDA000028310154000110
The number of times that occurs in corpus,
Figure FDA000028310154000111
Expression
Figure FDA000028310154000112
Corresponding all possible source language phrase,
Figure FDA000028310154000113
Expression phrase pair
Figure FDA000028310154000114
The number of times that occurs in corpus,
Figure FDA00002831015400021
Expression expression phrase pair The number of times that occurs in corpus;
The computing formula of described vocabulary translation probability is:
Figure FDA00002831015400023
Wherein, p (e i, f j) expression source language word f j(j=1...J) be translated as target language e i(i=1...I) probability, p (f j, e i) expression target language word e i(i=1...I) be translated as source language f i(j=1...J) probability; α represents source language and the right alignment relation of target language word.
6. the method for claim 1 is characterized in that: calculating the target language translation with respect to the method for the smoothness of described target language language model is:
1) the target language statistical model is represented with the conditional probability of a rear word with respect to previous word:
Figure FDA00002831015400025
Wherein, w tRepresent t word in translation, Be w 1..., w T,
Figure FDA00002831015400027
Be w 1..., w t-1
2) adopt the N-gram model to calculate a rear word with respect to the conditional probability of previous word:
P ^ ( w t | w 1 t - 1 ) ≈ P ^ ( w t | w t - n + 1 t - 1 )
3) establish w 1W TThe training set of a target language, and w T∈ V, V are limited set, calculate maximum sample likelihood:
f ( w t . . . w t - n + 1 ) = P ^ ( w t | w 1 t - 1 ) ,
Its geometric mean:
Perplexity = 1 / P ^ ( w t | w 1 t - 1 )
4) for arbitrarily Have
Figure FDA000028310154000212
Thereby obtain the target language translation with respect to the smoothness of target language language model be:
Score = 1 T Σ t log f ( w t , w t - 1 , . . . , w t - n + 1 ) Perplexity ,
Wherein, T is the number of word in the training set of target language.
7. the method for claim 1, it is characterized in that: the entry of preserving in described dictionary marks by the requirement of translation system, indicates relevant semantic attribute; The syntax rule of preserving in described syntax rule storehouse is according to the requirement regulation word of translation system or the translation rule of phrase.
8. the method for claim 1 is characterized in that: calculate the translation probability of different translation results according to described target language language model, the translation that probability is high is as final translation.
9. a machine translation apparatus that merges syntax tree and statistical machine translation technology, is characterized in that, comprising:
The dictionary module is used for the corresponding word and expression of storage different language language;
The syntax rule library module is used for the corresponding syntax rule of storage different language language;
Phrase translation probability table module, be used for different language language that storage obtains by the statictic machine translation system training the translation segment;
Target language language model module is used for storage and is trained the language model of the target language that obtains by statictic machine translation system;
Parser connects described dictionary module and described syntax rule library module, is used for according to dictionary and syntax rule storehouse, original text being carried out sentence division, cutting, part of speech successively and disappears and hold concurrently and grammatical analysis, and then generate syntax tree;
Demoder connects described phrase translation probability table module, described language model module and described parser, is used for converting original text to translation according to phrase translation probability table and the described syntax tree of target language language model traversal, generates target language.
10. device as claimed in claim 9, is characterized in that, described parser comprises:
Module divided in sentence, is used for reading source document and source document being made pauses in reading unpunctuated ancient writings;
Cutting and pretreatment module connect described sentence and divide module, and the simple sentence after being used for dividing carries out cutting and pre-service;
The double module that disappears connects described cutting and pretreatment module, is used for that the simple sentence after cutting is carried out part of speech and disappears double;
Syntax Analysis Module connects the described double module that disappears, and the simple sentence that is used for offseting after holding concurrently carries out grammatical analysis;
Top control module connects respectively above-mentioned each module and controls the operation of each module.
CN2013100497397A 2013-02-07 2013-02-07 Translation method integrating syntactic tree and statistical machine translation technology and translation device Pending CN103116578A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013100497397A CN103116578A (en) 2013-02-07 2013-02-07 Translation method integrating syntactic tree and statistical machine translation technology and translation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013100497397A CN103116578A (en) 2013-02-07 2013-02-07 Translation method integrating syntactic tree and statistical machine translation technology and translation device

Publications (1)

Publication Number Publication Date
CN103116578A true CN103116578A (en) 2013-05-22

Family

ID=48414955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013100497397A Pending CN103116578A (en) 2013-02-07 2013-02-07 Translation method integrating syntactic tree and statistical machine translation technology and translation device

Country Status (1)

Country Link
CN (1) CN103116578A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731777A (en) * 2015-03-31 2015-06-24 网易有道信息技术(北京)有限公司 Translation evaluation method and device
WO2017012327A1 (en) * 2015-07-22 2017-01-26 华为技术有限公司 Syntax analysis method and device
CN106407184A (en) * 2015-07-30 2017-02-15 阿里巴巴集团控股有限公司 Decoding method used for statistical machine translation, and statistical machine translation method and apparatus
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
CN106844352A (en) * 2016-12-23 2017-06-13 中国科学院自动化研究所 Word prediction method and system based on neural machine translation system
CN107066455A (en) * 2017-03-30 2017-08-18 唐亮 A kind of multilingual intelligence pretreatment real-time statistics machine translation system
CN107436865A (en) * 2016-05-25 2017-12-05 阿里巴巴集团控股有限公司 A kind of word alignment training method, machine translation method and system
CN107526726A (en) * 2017-07-27 2017-12-29 山东科技大学 A kind of method that Chinese procedural model is automatically converted to English natural language text
CN107729326A (en) * 2017-09-25 2018-02-23 沈阳航空航天大学 Neural machine translation method based on Multi BiRNN codings
TWI637278B (en) * 2017-07-03 2018-10-01 雲拓科技有限公司 Computer automatically claim-translating device
CN108763222A (en) * 2018-05-17 2018-11-06 腾讯科技(深圳)有限公司 Detection, interpretation method and device, server and storage medium are translated in a kind of leakage
CN108829657A (en) * 2018-04-17 2018-11-16 广州视源电子科技股份有限公司 smoothing processing method and system
CN108874790A (en) * 2018-06-29 2018-11-23 中译语通科技股份有限公司 A kind of cleaning parallel corpora method and system based on language model and translation model
CN109448458A (en) * 2018-11-29 2019-03-08 郑昕匀 A kind of Oral English Training device, data processing method and storage medium
CN109978829A (en) * 2019-02-26 2019-07-05 深圳市华汉伟业科技有限公司 A kind of detection method and its system of object to be detected
CN110413963A (en) * 2019-07-03 2019-11-05 东华大学 Breast ultrasonography report structure method based on domain body
CN110895660A (en) * 2018-08-23 2020-03-20 澳门大学 Statement processing method and device based on syntax dependency relationship dynamic coding
CN111104796A (en) * 2019-12-18 2020-05-05 北京百度网讯科技有限公司 Method and device for translation
CN113283250A (en) * 2021-05-26 2021-08-20 南京大学 Automatic machine translation test method based on syntactic component analysis
RU2766821C1 (en) * 2021-02-10 2022-03-16 Общество с ограниченной ответственностью " МЕНТАЛОГИЧЕСКИЕ ТЕХНОЛОГИИ" Method for automated extraction of semantic components from compound sentences of natural language texts in machine translation systems and device for implementation thereof
CN114330376A (en) * 2021-11-15 2022-04-12 甲骨易(北京)语言科技股份有限公司 Computer aided translation system and method
CN110895660B (en) * 2018-08-23 2024-05-17 澳门大学 Sentence processing method and device based on syntactic dependency dynamic coding

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1652106A (en) * 2004-02-04 2005-08-10 北京赛迪翻译技术有限公司 Machine translation method and apparatus based on language knowledge base
US20080162111A1 (en) * 2006-12-28 2008-07-03 Srinivas Bangalore Sequence classification for machine translation
CN101482861A (en) * 2008-01-09 2009-07-15 中国科学院自动化研究所 Chinese-English words automatic alignment method
CN102662932A (en) * 2012-03-15 2012-09-12 中国科学院自动化研究所 Method for establishing tree structure and tree-structure-based machine translation system
US20120316862A1 (en) * 2011-06-10 2012-12-13 Google Inc. Augmenting statistical machine translation with linguistic knowledge

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1652106A (en) * 2004-02-04 2005-08-10 北京赛迪翻译技术有限公司 Machine translation method and apparatus based on language knowledge base
US20080162111A1 (en) * 2006-12-28 2008-07-03 Srinivas Bangalore Sequence classification for machine translation
CN101482861A (en) * 2008-01-09 2009-07-15 中国科学院自动化研究所 Chinese-English words automatic alignment method
US20120316862A1 (en) * 2011-06-10 2012-12-13 Google Inc. Augmenting statistical machine translation with linguistic knowledge
CN102662932A (en) * 2012-03-15 2012-09-12 中国科学院自动化研究所 Method for establishing tree structure and tree-structure-based machine translation system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐志明,王晓龙等: "N-gram语言模型的数据平滑技术", 《计算机应用研究》 *
蒋宏飞,李生等: "一种基于同步树替换文法的统计机器翻译模型", 《软件学报》 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731777A (en) * 2015-03-31 2015-06-24 网易有道信息技术(北京)有限公司 Translation evaluation method and device
WO2017012327A1 (en) * 2015-07-22 2017-01-26 华为技术有限公司 Syntax analysis method and device
US10909315B2 (en) 2015-07-22 2021-02-02 Huawei Technologies Co., Ltd. Syntax analysis method and apparatus
CN106407184A (en) * 2015-07-30 2017-02-15 阿里巴巴集团控股有限公司 Decoding method used for statistical machine translation, and statistical machine translation method and apparatus
CN106407184B (en) * 2015-07-30 2019-10-01 阿里巴巴集团控股有限公司 Coding/decoding method, statistical machine translation method and device for statistical machine translation
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
CN106598937B (en) * 2015-10-16 2019-10-18 阿里巴巴集团控股有限公司 Language Identification, device and electronic equipment for text
CN107436865A (en) * 2016-05-25 2017-12-05 阿里巴巴集团控股有限公司 A kind of word alignment training method, machine translation method and system
CN107436865B (en) * 2016-05-25 2020-10-16 阿里巴巴集团控股有限公司 Word alignment training method, machine translation method and system
CN106844352B (en) * 2016-12-23 2019-11-08 中国科学院自动化研究所 Word prediction method and system based on neural machine translation system
CN106844352A (en) * 2016-12-23 2017-06-13 中国科学院自动化研究所 Word prediction method and system based on neural machine translation system
CN107066455A (en) * 2017-03-30 2017-08-18 唐亮 A kind of multilingual intelligence pretreatment real-time statistics machine translation system
CN107066455B (en) * 2017-03-30 2020-07-28 唐亮 Multi-language intelligent preprocessing real-time statistics machine translation system
TWI637278B (en) * 2017-07-03 2018-10-01 雲拓科技有限公司 Computer automatically claim-translating device
CN107526726A (en) * 2017-07-27 2017-12-29 山东科技大学 A kind of method that Chinese procedural model is automatically converted to English natural language text
CN107729326A (en) * 2017-09-25 2018-02-23 沈阳航空航天大学 Neural machine translation method based on Multi BiRNN codings
CN107729326B (en) * 2017-09-25 2020-12-25 沈阳航空航天大学 Multi-BiRNN coding-based neural machine translation method
CN108829657B (en) * 2018-04-17 2022-05-03 广州视源电子科技股份有限公司 Smoothing method and system
CN108829657A (en) * 2018-04-17 2018-11-16 广州视源电子科技股份有限公司 smoothing processing method and system
CN108763222B (en) * 2018-05-17 2020-08-04 腾讯科技(深圳)有限公司 Translation missing detection and translation method and device, server and storage medium
CN108763222A (en) * 2018-05-17 2018-11-06 腾讯科技(深圳)有限公司 Detection, interpretation method and device, server and storage medium are translated in a kind of leakage
CN108874790A (en) * 2018-06-29 2018-11-23 中译语通科技股份有限公司 A kind of cleaning parallel corpora method and system based on language model and translation model
CN110895660A (en) * 2018-08-23 2020-03-20 澳门大学 Statement processing method and device based on syntax dependency relationship dynamic coding
CN110895660B (en) * 2018-08-23 2024-05-17 澳门大学 Sentence processing method and device based on syntactic dependency dynamic coding
CN109448458A (en) * 2018-11-29 2019-03-08 郑昕匀 A kind of Oral English Training device, data processing method and storage medium
CN109978829A (en) * 2019-02-26 2019-07-05 深圳市华汉伟业科技有限公司 A kind of detection method and its system of object to be detected
CN110413963A (en) * 2019-07-03 2019-11-05 东华大学 Breast ultrasonography report structure method based on domain body
CN110413963B (en) * 2019-07-03 2022-11-25 东华大学 Breast ultrasonic examination report structuring method based on domain ontology
CN111104796A (en) * 2019-12-18 2020-05-05 北京百度网讯科技有限公司 Method and device for translation
CN111104796B (en) * 2019-12-18 2023-05-05 北京百度网讯科技有限公司 Method and device for translation
RU2766821C1 (en) * 2021-02-10 2022-03-16 Общество с ограниченной ответственностью " МЕНТАЛОГИЧЕСКИЕ ТЕХНОЛОГИИ" Method for automated extraction of semantic components from compound sentences of natural language texts in machine translation systems and device for implementation thereof
CN113283250A (en) * 2021-05-26 2021-08-20 南京大学 Automatic machine translation test method based on syntactic component analysis
CN114330376A (en) * 2021-11-15 2022-04-12 甲骨易(北京)语言科技股份有限公司 Computer aided translation system and method

Similar Documents

Publication Publication Date Title
CN103116578A (en) Translation method integrating syntactic tree and statistical machine translation technology and translation device
CN100437557C (en) Machine translation method and apparatus based on language knowledge base
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
CN110378409A (en) It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN103020230A (en) Semantic fuzzy matching method
CN103314369B (en) Machine translation apparatus and method
Costa-Jussá et al. Statistical machine translation enhancements through linguistic levels: A survey
JP2006164293A (en) Automatic natural language translation
Flickinger et al. Wikiwoods: Syntacto-semantic annotation for english wikipedia
CN107102980A (en) The extracting method and device of emotion information
Aragonés Lumeras et al. On the complementarity between human translators and machine translation
Dunđer Machine translation system for the industry domain and Croatian language
CN106156013A (en) The two-part machine translation method that a kind of regular collocation type phrase is preferential
Sánchez-Martínez et al. Inferring shallow-transfer machine translation rules from small parallel corpora
CN113343717A (en) Neural machine translation method based on translation memory library
Dowling et al. Tapadóir: Developing a statistical machine translation engine and associated resources for Irish
Kessler et al. Extraction of terminology in the field of construction
Dipper Morphological and part-of-speech tagging of historical language data: A comparison
Arora et al. Pre-processing of English-Hindi corpus for statistical machine translation
Mara English-Wolaytta Machine Translation using Statistical Approach
Septarina et al. Machine translation of Indonesian: a review
Singh et al. English-Dogri Translation System using MOSES
Nevado et al. Translation Memories Enrichment by Statistical Bilingual Segmentation.
Estarrona et al. Dealing with dialectal variation in the construction of the Basque historical corpus
CN103902524A (en) Uygur language sentence boundary recognition method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20130522

RJ01 Rejection of invention patent application after publication