CN102214166B - Machine translation system and machine translation method based on syntactic analysis and hierarchical model - Google Patents

Machine translation system and machine translation method based on syntactic analysis and hierarchical model Download PDF

Info

Publication number
CN102214166B
CN102214166B CN 201010144623 CN201010144623A CN102214166B CN 102214166 B CN102214166 B CN 102214166B CN 201010144623 CN201010144623 CN 201010144623 CN 201010144623 A CN201010144623 A CN 201010144623A CN 102214166 B CN102214166 B CN 102214166B
Authority
CN
Grant status
Grant
Patent type
Prior art keywords
phrase
based
syntax
Prior art date
Application number
CN 201010144623
Other languages
Chinese (zh)
Other versions
CN102214166A (en )
Inventor
熊张亮
何亮
万磊
Original Assignee
三星电子(中国)研发中心
三星电子株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Grant date

Links

Abstract

本发明公开了一种基于句法分析和层次模型的机器翻译系统和方法,所述机器翻译系统包括词对齐模块、短语提取模块、词性句法标注模块、基于句法的非连续短语提取模块、基于非连续短语的翻译模块和评分输出模块。 The present invention discloses a machine translation system and method based on the hierarchical model and the syntactic analysis, the machine translation system comprises an alignment module word, phrase extraction module, POS tagging module syntax, the syntax based discontinuous phrase extraction module, based on the non-continuous phrase translation module and scoring output modules. 所述机器翻译系统和方法在通用的基于连续短语的机器翻译模型基础上进行句法分析,从而从双语句对齐文本中提取基于句法的非连续短语规则库,解决全句上下文非连续固定搭配的问题,使其符合语言的句法特征;基于非连续短语规则库和短语对齐表进行翻译,对翻译结果基于评估模型进行评分,从而有效地改善了翻译效果。 The machine translation system and method for parsing a general-purpose machine translation model based on a continuous basis on a phrase, thereby extracting a non-continuous phrases based on the syntax rule base from the two aligned text sentence, the entire sentence context solve discontinuous fixed with problems to make it consistent with the language of syntactic features; translated phrase based on non-continuous rule base and phrase-alignment table, to translate the results were scored based on the evaluation model, thus effectively improving the translation effect.

Description

基于句法分析和层次模型的机器翻译系统和方法技术领域[0001] 本发明涉及机器翻译,具体来讲,涉及一种基于句法分析和层次模型的机器翻译系统和方法。 [0001] The present invention relates to a machine translation system based Machine Translation Technical Field syntactic analysis method and the hierarchical model and, specifically, relates to a machine translation system and method based on the hierarchical model and the syntactic analysis. 背景技术[0002] 机器翻译是将一种自然语言翻译成另一种自然语言的自动翻译,机器翻译系统的类型很多,目前流行的是基于连续短语的机器翻译(PBMT)系统。 [0002] Machine translation is the translation of a natural language into another natural language automatic translation of the type of machine translation systems of many popular phrase is based on continuous machine translation (PBMT) system. 机器翻译要解决的问题是利用计算机将源语言(SL)的句子或片段自动翻译成对应的目标语言(TL)的句子或片段。 Machine translation problem to be solved by the use of computer source language (SL) or a fragment of a sentence automatically translated into a corresponding target language (TL) or a sentence fragment. 基于语料库的机器翻译包含一个双语对齐语料库(即每一句源语言句子均有一句或多句对应的目标语言的翻译),计算机进行自动翻译所需要的数据和知识都从语料库中得到。 Automatic translation data needed and the knowledge is obtained from the corpus corpus-based machine translation contains an aligned bilingual corpus (i.e., each one has one or more of the source language sentence translated sentence corresponding to the target language), computer. [0003] PBMT系统以短语为翻译的基本单位。 [0003] PBMT system as the basic unit translated phrase. 在翻译过程中,系统不是孤立地翻译每个词, 而是将连续的多个词一起进行翻译。 In the translation process, the system does not translate each word in isolation, but will be translated together a number of consecutive terms. 由于扩大了翻译的粒度,基于短语的方法很容易处理局部上下文依赖关系,能够很好地翻译习语和常用词搭配。 Due to the expansion of the size of translation, phrase-based method is easy to handle local context dependency can be a good translation of idioms and common words match. 一般的,在基于短语的方法中, 短语可以是任意连续的字符串,没有句法上的限制,这样可以方便地从词语对齐的双语语料库中自动提取双语短语翻译为指定的一个源语言句子。 In general, the phrase-based methods, the phrase may be a continuous string of any no restrictions on the syntax, which can be easily automatically extracted bilingual phrase translated to the specified word from a source language sentence aligned bilingual corpus. 基于短语的方法需要对系统进行训练。 The method requires phrase-based training system. 训练的时候,先输入一个双语语料库,即一组互为翻译的句子。 Training time, first enter a bilingual corpus, that is, a set of mutually translated sentences. 从词语对齐的结果中知道句子中哪些词是互为翻译的。 Know which words are mutually sentence translated from the word alignment results. 接下来还需要进行短语提取,也就是提取出语料库中所有互为翻译的连续的词串,而不用管这个词串是否具有真正的含义。 The next phrase also need to be extracted, which is continuously extracted word string corpus for all mutual translation of the word and do not control whether strings really means. [0004] PBMT具有如下缺陷:(I)由于局部上下文依赖关系,PBMT不能很好地进行处理较长的句子或短语,尤其是非连续的固定搭配所带来的长距离调序问题;⑵由于PBMT完全依靠连续短语统计信息,忽略了语言的句法特征,未能充分利用语料库所包含的知识,从而限制了其翻译效果的进一步提高。 [0004] PBMT has the following drawbacks: (I) Since the local context dependency, not well performed PBMT longer treatment sentence or phrase, particularly non-continuous with a fixed long-range reordering brought problems; ⑵ due PBMT totally dependent on continuous phrase statistics, ignoring the syntactic features of language, failed to make full use of the knowledge contained in the corpus, thus limiting further improve its translation effects. 发明内容[0005] 针对以上提到的缺点,本发明的目的在于提供一种基于句法分析和层次模型的机器翻译系统和方法。 SUMMARY OF THE INVENTION [0005] For the above-mentioned disadvantages, an object of the present invention is to provide a machine translation system and method for parsing and hierarchical model. [0006] 根据本发明的一方面,提供了一种基于句法分析和层次模型的机器翻译系统,所述机器翻译系统可包括:词对齐模块,从外部接收双语句对齐文本,并从接收的双语对齐文本中获得词对齐信息;短语提取模块,从词对齐模块接收词对齐信息,利用接收的词对齐信息进行短语提取,以获得短语对齐表;词性句法标注模块,从外部接收已标注语料库和双语句对齐文本,从已标注语料库中提取有用的语言知识及其概率分布信息,并利用提取出的语言知识及其概率分布信息对双语句对齐文本中的双语或者单语进行词性及句法标注,产生句法标注语料库;基于句法的非连续短语提取模块,从词性句法标注模块接收句法标注语料库,并基于句法标注语料库根据词对齐模块产生的对齐信息或短语提取模块产生的短语对齐表进行基于句法的非连续短语提取,以 [0006] According to an aspect of the present invention, there is provided a machine translation system and parsing hierarchical model based on the machine translation system comprising: a word alignment module, receiving a dual aligned text statement from the outside, and from the received bilingual align text obtain word alignment information; phrase extraction module, from the word alignment module receives the word alignment information, using the received word alignment information phrase extraction, to obtain a phrase alignment table; POS syntax tagging module, received from the outside has been annotated corpus and bis statement align text, extracted from the labeled corpus useful language knowledge and its probability distribution information, and distribution information on the double sentence-aligned text in a bilingual or monolingual part of speech tagging and syntactic knowledge extracted using a language and its probability, produce syntax annotated corpus; syntax-based discontinuous phrase extraction module denoted module receives syntax from the speech syntax annotated corpus, and annotated corpus-based syntactic non syntax-based the phrase alignment table word alignment module generates alignment information or phrase extraction module generated continuous phrase extraction to 产生基于句法的非连续短语规则库;基于非连续短语的翻译模块,从非连续短语提取模块接收基于句法的非连续短语规则库,并对待翻译句子在所述基于句法的非连续短语规则库中检索所有可能的短语、翻译及其概率,并输出翻译结果;评分输出模块,从外部接收评估模型,基于评估模型对翻译结果进行评分, 并输出得分最高的翻译结果。 Generated based on a non-continuous phrase syntax rule base; phrase-based translation module discontinuous, a discontinuous reception based on the syntax rule base from the discontinuous phrase phrase extraction module, and the translated sentence in the treatment of non-continuous phrases based syntax rule base retrieve all the possible phrases translated its probability, and outputs a translation result; scoring output module receives the evaluation model from the outside of the translation results are scored based on the evaluation model, and output the highest translation score. [0007] 所述机器翻译系统还可包括:基于连续短语的翻译模块,从短语提取模块接收短语对齐表,对待翻译句子在短语对齐表中检索所有可能的短语、翻译及其概率,并将翻译结果输出到评分输出模块。 [0007] The machine translation system further comprising: a translation module based on continuous phrase, the phrase alignment table from the received phrase extraction module, to treat all possible translated sentence retrieval phrases in a phrase alignment table, and the probability of the translation, and the translation output module outputs the result to score. [0008] 基于句法的非连续短语提取模块可包括:非连续短语提取模块,根据词对齐模块产生的词对齐信息或短语提取模块产生的短语对齐表,将双语句对齐文本的每句中双语对齐的连续短语采用非终结符代替,获得非连续短语规则库;句法过滤模块,基于句法标注语料库对非连续短语提取模块产生的非连续短语规则库进行过滤,以产生基于句法的非连续短语规则库。 [0008] Syntax-based discontinuous phrase extraction module may comprise: a discontinuous phrase extraction module according to the generated word alignment module alignment word alignment phrase or phrase extraction module information table generated by the aligned bilingual sentence aligned bilingual sentence in the text continuous phrases employed nonterminal place, to obtain a non-continuous phrases rulebase; syntax filtering module, syntax-based annotation corpus discontinuous phrase rule base of the discontinuous phrase extraction module generates a filtered, to generate based on a non-continuous phrases rulebase syntax . [0009] 所述概率分布信息可包括特定词语属于特定词类的概率、特定短语属于特定类短语的概率以及上下文概率。 [0009] The distribution information may comprise a probability of certain words belonging to specific parts of speech probability, and the probability of belonging to a particular phrase context probability of a particular class of phrases. [0010] 所述短语对齐表可包括源语言短语、目标语言短语和概率值。 [0010] The phrase alignment table may include the source language phrase, the target language phrase and probability values. [0011] 根据本发明的另一方面,提供了一种基于句法分析和层次模型的机器翻译方法, 所述机器翻译方法包括以下步骤:接收双语句对齐文本,并从接收的双语对齐文本中获得词对齐信息;利用词对齐信息进行短语提取,以获得短语对齐表;接收已标注语料库和双语句对齐文·本,从已标注语料库中提取有用的语言知识及其概率分布信息,并利用提取出的语言知识及其概率分布信息对双语句对齐文本中的双语或者单语进行词性及句法标注, 产生句法标注语料库;基于句法标注语料库根据对齐信息或短语对齐表进行基于句法的非连续短语提取,以产生基于句法的非连续短语规则库;对待翻译句子在所述基于句法的非连续短语规则库中检索所有可能的短语、翻译及其概率;接收评估模型,基于评估模型对所述翻译进行评分,并输出得分最高的翻译结果 [0011] According to another aspect of the present invention, there is provided a machine translation method based on the hierarchical model and the syntactic analysis, the machine translation method comprising the steps of: receiving a dual aligned text sentence, and obtained from the bilingual text received in aligned word alignment information; using the word alignment information phrase extraction in order to obtain a chunk alignment table; receiving statements have been annotated corpus and double-aligned text Ben, extract useful knowledge of the language and its probability distribution information, and using the extracted from the labeled corpus knowledge of the language and its probability distribution information for bilingual sentence-aligned text bilingual or monolingual part of speech and syntactic tagging, syntax generate annotated corpus; corpus-based syntactic annotation extraction based on a non-continuous syntactic phrase or phrases are aligned according to alignment information table, to produce a non-continuous phrases based syntax rule base; treat translated sentence retrieves all possible phrase translation probability-based and non-continuous phrase syntax rules repository; receiving evaluation model, based on the evaluation of the translation model score and outputs the highest score of the translation results [0012] 所述机器翻译方法还可包括以下步骤:对待翻译句子在短语对齐表中检索所有可能的短语、翻译及其概率。 [0012] The machine translation method further comprising the steps of: retrieving all possible treatment translated sentence phrase, its probability in the phrase translation table alignment. [0013] 产生基于句法的非连续短语规则库的步骤可包括以下步骤:根据词对齐信息或短语对齐表将双语句对齐文本的每句中双语对齐的连续短语采用非终结符代替,获得非连续短语规则库;基于句法标注语料库对非连续短语规则库进行过滤,以产生基于句法的非连续短语规则库。 [0013] Step generated based on a non-continuous phrase syntax rule base may include the steps of: alignment information word or phrase in each sentence aligned bilingual phrase continuously aligned bilingual sentence-aligned text table instead of the non-terminator, to obtain a non-continuous the phrase rule base; syntax-based annotation corpus to filter non-continuous phrases rule base, to produce a non-continuous phrases based syntax rule base. [0014] 根据本发明的机器翻译系统和方法在通用的基于连续短语的机器翻译模型基础上进行句法分析,从而从双语句对齐文本中提取基于句法的非连续短语规则库,解决全句上下文非连续固定搭配的问题,使其符合语言的句法特征;基于非连续短语规则库和短语对齐表进行翻译,对翻译结果基于评估模型进行评分,从而有效地改善了翻译效果。 [0014] In-based machine translation model based continuous phrase on The machine translation system and method of the present invention are common parsing, to extract from the two-sentence-aligned text based on non-continuous phrases rulebase syntax, solving the whole sentence context non fixed problem with continuous, syntactic features make it consistent with the language; translation rules based on non-continuous phrase library and phrase-alignment table, the result of the translation score-based assessment model, in order to effectively improve the translation effect. 附图说明[0015] 通过参照附图对本发明示例性实施例的详细描述,本发明的以上和其他特征和方面将变得更清楚,其中:[0016] 图I是示出根据本发明示例性实施例的基于句法分析和层次模型的机器翻译系统的框图;[0017] 图2是示出构造句法标注语料库的示图;[0018] 图3是示出根据本发明示例性实施例的图I中示出的基于句法的非连续短语提取模块的示图;[0019] 图4是示出图3中的非连续短语提取模块操作的示例的示图;[0020] 图5是示出非连续短语规则库的单语句法分析过滤的示例的示图;[0021] 图6A和图6B是分别示出根据本发明示例性实施例和传统技术的机器翻译的示图;[0022] 图7是示出根据本发明示例性实施例的基于句法分析和层次短语模型的机器翻译方法的流程图。 BRIEF DESCRIPTION [0015] detailed description of exemplary embodiments of the present invention with reference to the accompanying drawings, the foregoing and other features and aspects of the present invention will become more apparent, wherein: [0016] Figure I is a diagram illustrating an exemplary of the present invention Example block diagram of a machine translation system parsing and hierarchy model; [0017] FIG. 2 shows a configuration of the syntax denoted shown in FIG corpus; [0018] FIG. 3 is a diagram illustrating an exemplary embodiment of the present invention, I syntax-based discontinuous phrase extraction block diagram shown; [0019] FIG. 4 is a diagram illustrating an example in shown in FIG 3 discontinuous phrase extraction module operation; [0020] FIG. 5 is a diagram illustrating a discontinuous analysis phrase rulebase single Syntax diagram illustrating an example of filtering; [0021] FIGS. 6A and 6B are diagrams illustrating the translation machine according to an exemplary embodiment of the present invention and the conventional art; [0022] FIG. 7 is It shows a flowchart of a method of machine translation phrase parsing hierarchical model and according to an exemplary embodiment of the present invention. 具体实施方式[0023] 以下,将参照附图详细描述本发明的示例性实施例。 DETAILED DESCRIPTION [0023] Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. [0024] 图I是示出根据本发明示例性实施例的基于句法分析和层次短语模型的机器翻译系统。 [0024] Figure I shows a machine translation system based on syntactic analysis of phrases and hierarchical model according to an exemplary embodiment of the present invention. [0025] 如图I所示,根据本发明示例性实施例的基于句法分析和层次短语模型的机器翻译系统包括:词对齐模块101、短语提取模块102、基于连续短语的翻译模块103、词性句法标注模块201、基于句法的非连续短语提取模块202、基于非连续短语的翻译模块301和评分输出模块302。 [0025] FIG I, according to an exemplary embodiment of the present invention is based on the phrase parsing hierarchical model and machine translation system comprising: a word alignment module 101, the phrase extraction module 102, a translation module 103 based on continuous phrases, syntactic speech annotation module 201, based on non-continuous phrase extraction module 202, based on non-continuous phrase translation module 301 and output module 302 rates. [0026] 词对齐模块101、短语提取模块102、基于连续短语的翻译模块103是采用传统的基于连续短语的翻译系统中所使用的模块,其与根据本发明示例性实施例的词性句法标注模块201、基于句法的非连续短语提取模块202共同构成根据本发明示例性实施例的基于句法分析和层次短语模型的机器翻译系统的预处理部分。 [0026] The word alignment module 101, the phrase extraction module 102, a translation module based on a continuous phrase 103 is a traditional module-based translation system continuously phrases used, which according to the present invention, parts of speech syntax exemplary embodiment are denoted module 201, based on the syntax of the discontinuous extraction module 202 together constitute a phrase-based machine translation systems phrase parsing hierarchical model and the pre-processing section according to an exemplary embodiment of the present invention. 而基于连续短语的翻译模块103 和根据本发明示例性实施例的基于非连续短语的翻译模块301和评分输出模块302可构成根据本发明示例性实施例的基于句法分析和层次短语模型的机器翻译系统的翻译引擎。 And based on continuous phrase translation module 103 and machine-based syntactic analysis and hierarchical phrase models translated to an exemplary embodiment of the present invention based on the discontinuous phrase translation module to an exemplary embodiment of the present invention 301 and Rating output module 302 may be configured in accordance with translation engine system. [0027] 参照图1,将双语句对齐文本输入到词对齐模块101,词对齐模块101利用工具(例如,GIZA++)从输入的双语对齐文本中获得词对齐信息,并将该对词齐信息输入到短语提取模块102。 [0027] Referring to FIG 1, the dual sentence aligned to enter text into the word alignment module 101, word alignment module 101 using a tool (e.g., GIZA ++) obtaining word alignment information from bilingual justified text input, and the pair of word aligned information input the phrase extraction module 102. [0028] 短语提取模块102从词对齐模块101接收词对齐信息,利用接收的词对齐信息进行短语提取,从而获得短语对齐表(也被称为连续短语库),并将获得的短语对齐表发送到基于连续短语的翻译模块103和基于句法的非连续短语提取模块202。 [0028] The phrase extraction module 102 receives the word from the word alignment block alignment information 101, using the received information word alignment phrase extraction, thereby obtaining a phrase alignment table (also referred to as continuous phrase library), and the obtained phrase alignment table transmitted a translation module 103 based on the syntax and non-continuous phrase extraction module 202 based on the continuous phrase. 其中,所述短语对齐表包括以下三个部分:(I)源语言短语;(2)目标语言短语;(3)概率值。 Wherein said phrase alignment table consists of three parts: (I) the source language phrase; (2) the target language phrase; (3) the probability value. [0029] 在自然语言的计算机处理中,基于规则的句法剖析主要是使用Chomsky的上下文无关句法,但是其在处理自然语言的歧义时显得无能为力。 [0029] In the computer processing of natural language, a rule-based syntax analysis primarily using context Chomsky's syntax is independent, but it is powerless in dealing with the ambiguity of natural language. [0030] 近年来对上下文无关句法的改进主要体现在两个方面:一方面是给上下文无关句法的规则加上概率,提出了概率上下文无关句法(PCFG),另一方面是除了给规则加概率之外,还考虑规则的中心词对于规则概率的影响,提出了概率词汇化上下文无关句法。 [0030] In recent years, improvements to the context-free syntax is mainly reflected in two aspects: one is to rule context-free syntax plus the probability of a probabilistic context-free syntax (PCFG), on the other hand is in addition to the rules applied probability but also consider the impact of the rules of the central word of probability rules proposed terms of the probability of context-free syntax. [0031] 这些研究把基于规则的理性主义方法与基于统计的经验主义方法巧妙地结合起来,取得了较好的成果,为解决句法歧义问题提供了有力的手段。 [0031] These studies rationalism rule-based approach based on a clever combination of empirical and statistical methods together, we achieved good results, provides a powerful means to resolve syntactic ambiguity. 概率句法给一个句子或者单词的符号串指派一个概率,从而捕捉比一般的上下文无关句法更加细致的句法信息。 Syntax to probability of a symbol string of a sentence or word is assigned a probability to capture independent syntax than the general context more detailed syntax information. 概率上下文无关句法也是一种上下文无关句法,其中的每一个规则都标上选择该规则的概率,处理每一个上下文无关规则时,都假定它们在条件上是独立的,一个句子的概率使用剖析该句子时每一个规则的概率的乘积来计算。 When the context-free probability is also a syntactic context independent syntax, wherein the probability of each rule are marked on the selection of the rule, a context-free processing each rule, assuming they are independent on the conditions, the probability of a sentence using the Analysis product of the probabilities of each rule to calculate the time of the sentence. [0032] 下面将参照图2来以PCFG为例描述词性句法标注模块201构造句法标注语料库(这里,语料库也被称为树库)的具体操作。 [0032] FIG. 2 will be described as an example in PCFG POS tagging module 201 configured Syntax Syntactic Referring annotated corpus (here also called the corpus treebank) specific operation. [0033] 首先,通过对语料库的标注处理(自动或人工进行),形成带有不同层次的标注信息的语料库,如标注了词类和句法树信息的宾州树库,其主要标注集如图2的(a)所示。 [0033] First, the treatment is denoted by corpus (automatically or manually), corpus formed with different levels of annotation information, such as parts of speech marked Penn library tree and syntax tree information, the main set of labels 2 It is shown in (a). 将已标注语料库输入到词性句法标注模块201。 The annotated corpus is input to the POS tagging module 201 syntax. [0034] 词性句法标注模块201利用统计工具从已标注语料库中提取有用的语言知识及其概率分布信息,即有指导的训练(supervised training)方法。 [0034] 201 POS Syntax Tagging module using statistical tools to extract useful knowledge of the language and its probability distribution information from the labeled corpus, that is coaching the (supervised training) method. 主要的概率分布信息包括某词语属于某词类的概率、某短语属于某类短语的概率以及上下文概率。 The main probability distribution information includes a probability of belonging to a part of speech of words, phrases belong to a certain probability and the probability of context phrase. [0035] 词性句法标注模块201利用提取出的语言知识及其概率分布信息,对双语句对齐文本中的双语或者单语进行词性及句法标注,产生句法标注语料库,并将产生的句法标注语料库发送到基于句法的非连续短语提取模块202。 [0035] Syntax POS tagging module 201 using knowledge of the language and the probability distribution of the extracted information, double sentence aligned bilingual text monolingual or part of speech tagging and syntax, annotated corpus generating syntax, the syntax and transmits the generated annotated corpus syntax-based to a discontinuous phrase extraction module 202. 一个句子可能有多种标注结果,我们选取其中概率最大的作为输出结果,如图2的(a)和(b)所述,根据概率计算,图2的(a) 的概率为:P1 = O. 2X0. 2X0. 2X0. 4X0. 45X1. OX I. 0X0. 4X0. 05 = 2·88Χ1(Γ5;而图2 的(b)的概率为:P2 = O. 8X0. 2X0. 05X0. 4X0. 4X0. 3X0. 4X0. 4X0. 4X0. 05 =I. 2288X 10—6,因此,选择图2的(a)的标注结果。[0036] 图2的(C)和(d)分别示出了部分句法标注集和已标注的中文句子。[0037] 基于句法的非连续短语提取模块202从词性句法标注模块201接收句法标注语料库,并基于句法标注语料库根据词对齐模块101产生的对齐信息或短语提取模块102产生的短语对齐表进行基于句法的非连续短语提取,以获得基于句法的非连续短语规则库。[0038] 下面将参照图3至图5来详细描述基于句法的非连续短语提取模块202如何产生基于句法的非连续短语规则库。[0039] 图3至图5示出了根据本 A sentence may have multiple annotation result, we choose the maximum probability as a result of which output, as shown in (a) and (B) 2, according to the probability calculation, the probability of FIG. 2 (a) are: P1 = O ..... 2X0 2X0 2X0 4X0 45X1 OX I. 0X0 4X0 05 = 2 · 88Χ1 (Γ5; probability of FIG. 2 (b) is:....... P2 = O. 8X0 2X0 05X0 4X0 4X0 . 3X0. 4X0. 4X0. 4X0. 05 = I. 2288X 10-6, therefore, select the annotation result of FIG. [0036] FIG. 2 (C) and (D) 2 (a) respectively illustrate syntax portion tag set and the labeled Chinese sentence. [0037] syntax-based discontinuous phrase extraction module 202 denoted module 201 receives the syntax from the speech syntax annotated corpus, and syntax-based annotation corpus accordance with a word alignment module 101 generates alignment information or phrase extraction module 102 chunk alignment table generated by the syntax of a discontinuous phrase extraction, to obtain based on a non-continuous phrases rulebase syntax in. [0038] next, how syntax-based discontinuous phrase extraction module 202 is described in detail with reference to FIGS. 3 to 5 generated based on a non-continuous phrase syntax rule base. [0039] Figures 3 to 5 illustrate the present 发明的示例性实施例的非连续短语提取模块202具体构成和具体操作。[0040] 如图3所示,基于句法的非连续短语提取模块202包括非连续短语提取模块212 和句法过滤模块222。[0041] 下面参照图4详细描述非连续短语提取模块212如何构造非连续短语规则库。[0042] 非连续短语提取模块212根据词对齐模块101产生的词对齐信息或短语提取模块102产生的短语对齐表,将双语句对齐文本的每句中双语对齐的连续短语采用[X]、[Y]等非终结符代替,获得非连续短语规则库。[0043] 图4示出了一个非连续短语规则提取实例。该实例的规则为:带[X]的[Y] III [Y] with[X] 11 0.10. 30. 6,其中,O. I是源语言到目标语言的翻译概率,O. 3是目标语言到源语言的词翻译概率,O. 6是源语言到目标语言的词翻译概率。[0044] 对非连续短语规则库的句法过滤的基本思想是保证句子中被提取的短语部分应 Discontinuous phrase exemplary embodiment of the invention, the extracting module 202 specifically configured and specific operations. [0040] As shown, based on 202 comprises a non-continuous syntactic phrase extraction module 212 and a discontinuous syntactic phrase extraction filter module 2223 module. [0041] Referring to Figure 4 a detailed description of non-continuous phrase extraction module 212 how to construct a non-continuous phrases rule base. [0042] discontinuous phrase extraction module 212. the word alignment module phrase 101 generates alignment information or phrase extraction module 102 generated alignment table, each sentence aligned bilingual phrase continuously aligned text sentence using bis [X], [Y] and the like instead of the non-terminal, non-continuous phrases rule base is obtained. [0043] FIG. 4 illustrates a non-continuous phrases examples of the extraction rule instance is the rule:.. [Y] III [Y] with [X] with [X] 11 30.6 0.10, wherein, O I is the probability of source language to a target language, O.. 3 is the target language to the source language word translation probability, O. 6 is the source language word translation probability into the target language. [0044] the basic idea of ​​the syntax for noncontiguous phrase rule base of the filter is to ensure the phrase part of the sentence are extracted should 该是一个具有相对独立性的句子成分短语,如名词短语(NP),数量词短语(QP)等,以保证后期的翻译质量。[0045] 句法过滤模块222基于句法标注语料库对非连续短语提取模块产生的非连续短CN 102214166 B书明说5/9页语规则库进行过滤,以产生基于句法的非连续短语规则库;[0046] 下面参照图5描述句法过滤模块222如何进行句法过滤。 The sentence is a component having relatively independent phrases, such as noun phrase (the NP), quantifier phrase (QP) so as to ensure the quality of the translation later. [0045] The filtering module 222 Syntax Syntax-based annotation corpus discontinuous phrase extraction module generating non-continuous short CN 102214166 B book confessed 5/9 language rule base is filtered to produce a non-continuous phrases based syntax rule base; [0046] Referring now to Figure 5 depicts how the syntax of the filter module 222 filters syntax. [0047] 图5示出了一个非连续短语规则库的单语句法分析过滤实例。 [0047] FIG. 5 illustrates a non-continuous analysis of a single rulebase phrase Syntax filter instance. [0048] 如图5所示,对输入的单语句子进行句法标注。 [0048] As shown in FIG 5, sub-input single sentence syntactic labels. [0049] 考虑对标注后的句子,挖去非代词名词短语(NP-NN),以[X]代替的情况,此处为“地铁路线图”,生成的非连续短语规则如图5中保留的第I条RULE。 [0049] Consider the sentence label, non-digging pronouns noun phrase (NP-NN), in a case where [X] is replaced, here "subway route map ', generated discontinuous phrase rules shown in FIG. 5 Reserved Article I RULE. [0050] 考虑数量词短语(QP)的情况,具体为标记为QP的短语,且包含两个子节点,分别是CD与CLP,如(QP (CD两)(CLP (Μ张))),对CD以[X]代替,此处为“两”,生成的非连续短语规则如图5中保留的第2条规则;[0051] 由于不符合语法规则,被过滤掉的规则为图5中的“[X]给我地铁路线图吗? ”。 [0050] Considering quantifier phrase (QP), specifically labeled QP phrase, and includes two child nodes, respectively, CD and the CLP, such as (QP (CD two) (CLP (Μ photos))), a CD in [X] in place of, here, "two", rule 2 5 retained phrase rules generated in FIG discontinuous; [0051] do not meet the grammatical rules, the rules are filtered FIG. 5 " [X] to the subway map I do?. " [0052] 以上参照附图详细描述了根据本发明示例性实施例的基于句法分析和层次短语模型的机器翻译系统的预处理部分,下面将参照图I和图6描述根据本发明示例性实施例的基于句法分析和层次短语模型的机器翻译系统的翻译引擎。 [0052] The above figures are described in detail based machine translation systems phrase parsing hierarchical model and the preprocessing section an exemplary embodiment of the present invention, will now be described with FIGS. I and FIG. 6 with reference to an exemplary embodiment of the present invention with reference to the translation engine based machine translation system syntactic analysis and hierarchical model of the phrase. [0053] 根据本发明的基于句法分析和层次模型的机器翻译系统使用翻译模型、语言模型、调序模型和解码器。 [0053] The syntax analysis and machine translation systems based on the use of the hierarchical model translation model, a language model, and a decoder model reordering the present invention. [0054] 根据本发明的基于句法分析和层次模型的机器翻译与传统技术的基于连续短语的机器翻译主要差别在于翻译模型的扩展和调序模型的相对弱化。 [0054] Based on the syntax analysis and the hierarchical model of the conventional art machine translation and translation on the machine main difference is that the continuous relative weakening of extended phrase translation model and a model of reordering according to the present invention. [0055] 翻译模型提供源语言和目标语言短语之间的对应翻译关系,并用一个概率值表示这种对应翻译关系的程度,概率值越高,表明翻译对应的越准确,用于为源语言句子提供可能的目标语言翻译。 [0055] model provides translation correspondence translation relationship between the source and target language phrases, and that the extent of this correspondence translation relationship with a probability value, the higher the probability value, indicate corresponding translation more accurate, for the source language sentence provide possible target language translation. 基于层次短语的翻译模型将对应翻译关系由连续短语扩展至连续短语及基于句法的非连续短语。 The phrase translation model level will correspond to the translation by the continuous expansion of the relationship between the phrase to phrase continuous and discontinuous phrase Syntax-based based. [0056] 语言模型存储了大量的概率值,这些概率值给出了每个词与其前后词或短语的概率关系信息,其作用是判断一个短语St符合目标语言句法、习惯的程度,用于对翻译结果进行选择,一般用一个概率值PLM(St)来衡量这个程度,PLM (St)值越高表示短语越符合目标语目。 [0056] The language model stores a lot of probabilities, these probabilities gives the probability of each word in its relationship information before and after a word or phrase, and its role is to determine a phrase St comply with the target language syntax, the degree of habit, for translation choose, usually with a probability value of PLM (St) to measure the extent, PLM (St) higher values ​​indicate more in line with the target language phrase head. [0057] 调序模型用于调整翻译出来的目标语言结果中词或者短语的位置顺序,由于基于句法的非连续短语的存在,调序模型的功能部分被取代,其权重可相应较低。 [0057] The position adjustment sequence for sequentially adjusting the model out of the target language translation of words or phrases result, due to the presence of non-continuous Syntax-based phrases functional portion reordering model is substituted, its weight can be correspondingly lower. [0058] 翻译引擎的作用在于协调上述几个模型来对源语言句子进行翻译。 [0058] role is to coordinate translation engine above several models to translate the source language sentence. [0059] 参照图I,基于连续短语的翻译模块103对从短语提取模块102输出的对经过词切分的待翻译的句子在短语对齐表中检索所有可能的短语、翻译及其概率。 [0059] Referring to FIG I, a translation module 103 based on the continuous phrase phrase extraction from the module 102 through the output word segmentation of the sentence to be translated to retrieve all possible phrases in a phrase alignment table, the probability of its translation. [0060] 基于非连续短语的翻译模块301从非连续短语提取模块202接收基于句法的非连续短语规则库,并针对经过词切分的待翻译的句子在所述基于句法的非连续短语规则库中检索所有可能的短语、翻译及其概率。 [0060] Based on the received syntax rule base based on a non-continuous phrases discontinuous phrase translation module 301 from the discontinuous phrase extraction module 202, and after the word for parsing the sentence to be translated in the library based on the syntax rules of non-continuous phrases retrieve all the possible phrases translated its probability. [0061] 图6Α示出根据本发明示例性实施例的基于句法分析和层次短语模型将中文翻译成英文的示图。 [0061] FIG 6Α shown into English phrase according to the syntax analysis and the hierarchical model the Chinese translation of an exemplary embodiment of the present invention is shown in FIG. [0062] 图6Α中的标号(1)_(5)分别与下面的操作(1)_(5)--对应。 [0062] FIG 6Α the reference numeral (1) _ (5) respectively, the following operations (1) _ (5) - corresponds. [0063] (I)输入待翻译的中文句子;[0064] (2)根据翻译模型,基于连续短语的翻译模块103在短语对齐表中搜索所有可能的短语、翻译及其概率;8[0065] (3)根据翻译模型,基于非连续短语的翻译模块301在非连续短语规则库中搜索所有可能的非连续短语、翻译及其概率;[0066] (4)根据短语、非连续短语对的翻译概率和三元语言模型概率等,解码器计算各种可能翻译结果的总概率;[0067] (5)解码器选取总概率最优的前N个句子作为N-best候选目标语言句。 [0063] (I) input to be translated Chinese sentence; [0064] (2) The translation model, a translation module 103 based on a continuous search for all possible phrases of phrase, the phrase translation probability its alignment table; 8 [0065] (3) based on a translation model, based on non-continuous phrase translation module 301 searches all possible non-continuous phrases translated phrase and its probability in non-continuous rule base; [0066] (4) the phrase translation, according to the non-continuous phrases and ternary language model probability probability, the decoder computes all possible translations of the total probability; [0067] (5) the decoder to select the best overall probability of a sentence as a first N N-best candidate target language sentence. [0068] 在图6A中,(4)-(5)表示汇总计算总概率,从而选出N个候选句子。 [0068] In FIG. 6A, (4) - (5) represents the total probability calculation summary, in order to select N candidate sentence. 另外,在图6A 中,|3,6|表示的范围均为[3,6),即包含3,但不包含6,范围是到6之前。 Further, in FIG. 6A, | 3,6 | range are represented by [3,6), i.e. comprising 3, 6 but does not include, in the range of 6 to before. [0069] 图6B是与图6A相应的根据传统技术的将中文翻译成英文的示图。 [0069] FIG. 6B is a diagram corresponding to FIG. 6A translated into English according to the conventional art will be Chinese. [0070] 与根据本发明的图6A相比,主要区别在于,在传统技术翻译过程中仅利用连续短语进行翻译,而未利用句法分析过滤过的层次短语,例如X-> ([X]的[Y],[Y]of[X]),进行概率计算,生成翻译结果。 [0070] Compared with the present invention according to FIG. 6A, the main difference is that in the conventional art using only the translation process translated phrases continuously, without using the hierarchical parsing phrases filtered, e.g. X-> ([X] of [Y], [Y] of [X]), the probability calculation, generates translation. 例如,在本申请方法,“中国的上海”被翻译成“Shanghai of China”,而根据传统技术翻译的结果是“Chinese Shanghai”,故根据本发明的翻译结果明显好于根据传统技术的翻译结果。 For example, in the application method, "China, Shanghai" is translated into "Shanghai of China", and the result of the conventional art translation is the "Chinese Shanghai", it is based on a translation result of the present invention is significantly better than the translation result in accordance with the conventional art . [0071] 下面将描述评分输出模块302基于评估模型对翻译结果进行评分。 [0071] will be described below rates translation module 302 outputs the results of the scoring assessment. [0072] 输入到评分输出模块302的翻译输出是N个候选目标语言句子,N大于等于I。 [0072] Rating input to the output module 302 is a translated output target language sentence candidate N, N greater than or equal I. [0073] 评分输出模块302基于输入的评估模型还对输入的N个候选目标语言句子进行评分。 [0073] Rating output module 302 based on the input evaluation model also the N candidate target language sentence inputted score. [0074] 评估模型可以综合多个翻译特征,如语言模型特征、句子的词性序列模型特征、目标语言的句子长度等,来对这N个候选目标语言句子进行重新排序,选取总体最优的翻译作为翻译结果进行输出。 [0074] Comprehensive evaluation model features a plurality of translation, such as the language model features, wherein POS series model sentence, the target language sentence length and the like, re-order on the N candidate target language sentence, select the best overall translation output as a translation result. [0075] 考虑实现的简便性和处理效率,在本发明的示例性实施例中以目标语言的语言模型作为评估模型进行描述,其作用是判断一个句子St符合目标语言句法和习惯的程度,从而对翻译结果进行选择。 [0075] consideration of the simplicity and efficiency achieved, in an exemplary embodiment of the present invention to the language model of the target language as the evaluation model described, its role is to determine a sentence St conformance target language syntax and habits, thereby select the translation results. 一般用概率值PLM(St)来衡量所述程度,PLM(St)值越高表示句子越符合目标语言。 Generally with a probability value PLM (St) to measure the extent, PLM (St) higher values ​​indicate more in line with the target language sentence. [0076] 考虑到处理效率和候选的目标语言句子的差异性,在本发明的当前示例性实施例中N = 2,即一个仅基于连续短语翻译的输出句和一个基于句法分析和层次模型的输出句。 [0076] Considering the differences in the target language sentence processing efficiency and candidates, in the present exemplary embodiment of the present invention, N = 2, i.e. only one based on an output sentences continuous phrases translated and based on syntactic analysis and hierarchical model output sentence. [0077] 评分输出模块302基于以下基本流程进行评分:[0078] I、接收N = 2的候选目标语言句子,一个为仅基于连续短语翻译的输出句和一个基于句法分析和层次模型的输出句;[0079] 2、利用目标语言模型(即通过语言模型)对每一个可能的翻译计算其概率值;[0080] 3、选择得分最优的输出。 [0077] Rating output module 302 based on the basic flow rates: [0078] I, receiving N = candidate target language sentence 2, a is only based on the output sentences continuous phrase translation, and an output sentence parsing and Hierarchical Model ; [0079] 2, using the target language model (i.e. by language model) calculated probability value for each possible translation; [0080] 3, to select the best score output. [0081] 下面描述评分输出模块302进行评分的实例。 [0081] The output module 302 will be described below Rating score examples. [0082] 翻译源语言是中文,目标语言是英文。 [0082] is a Chinese translation of the source language, the target language is English. 输入的源语言是:“请告诉我支付条件”。 Enter the source language is: "Please tell me the payment terms." [0083] 翻译后的结果是(N = 2):[0084] l)ffould you please tell me the pay terms.(基于连续短语的翻译结果)[0085] 2)Would you please tell me the terms of payment.(基于句法分析和层次模型的翻译结果)[0086] 用英语的语言模型对这两个结果进行打分,由于“支付条件”有其常用说法“terms of payment”,且“Would you please tell me the terms of payment.,,更符合英语的句9法规则及使用习惯,因此,语言模型会为该结果给出一个较高的分值:[0087] I)对中间结果I进行打分:0. 7[0088] 2)对中间结果2进行打分:0. 9[0089] 5.选择分值最高的作为最终结果:Would you please tell me the terms ofpaymento[0090] 下面将参照图7描述根据本发明示例性实施例的基于句法分析和层次模型的机器翻译方法。[0091] 图7是示出根据本发明示例性实施例的基于句法分析和层次模型的机器翻译方法的流程图。[0092] 如图7所示,在步 [0083] The results translate (N = 2):. [0084] l) ffould you please tell me the pay terms (the translation-based continuous phrases) [0085] 2) Would you please tell me the terms of payment . (based on hierarchical model of parsing and translation) [0086] scoring results with these two English language model, because the "payment terms" has its usual saying "terms of payment", and "Would you please tell me the terms of payment ,, more in line with the regulations of English sentences 9 and habits, therefore, the language model will give a higher score for this result:. [0087] I) of intermediate results I scored: 0. 7 [0088] 2) on the intermediate result is scored 2: 09 [0089] 5. select the highest score as the final result: would you please tell me the terms ofpaymento [0090] will be described below with reference to FIG. 7 according to the present invention exemplary embodiments of the machine translation method and parsing hierarchical model. [0091] FIG. 7 is a flowchart showing based machine translation method and parsing a hierarchy model according to an exemplary embodiment of the present invention. [0092] the As shown in FIG. 7, at step S701和S702,分别输入已标注语料库和双语句对齐文本。[0093] 在步骤S703,进行词性和句法标注。首先利用统计工具从输入的已标注语料库中提取有用的语言知识及其概率分布信息,然后,利用提取出的语言知识及其概率分布信息, 对输入的双语句对齐文本中的双语或者单语进行词性及句法标注,最终产生句法标注语料库(或称为句法标注树库)。[0094] 在步骤S704,利用GIZA++工具从输入的双语句对齐文本获得词对齐信息。[0095] 在步骤S705,利用在步骤S704获得的词对齐信息提取短语,从而获得短语对齐表,所述短语对齐表包括以下三个部分:(I)源语言短语;(2)目标语言短语;(3)概率值。[0096] 在步骤S706,基于在步骤S703中获得的句法标注语料库根据在步骤S704中产生的对齐信息或在步骤S705中获得的短语对齐表来进行非连续短语提取,以获得基于句法的非连 S701 and S702, respectively, enter the annotated corpus and bilingual sentence-aligned text. [0093] In step S703, part of speech and syntactic annotation. First use statistical tools to extract useful knowledge of the language and its probability distribution information from the input of the labeled corpus, then, using the extracted knowledge of the language and its probability distribution information, to align the two-sentence text input in bilingual or monolingual part of speech and syntactic tagging, syntax eventually produce annotated corpus (or called syntactic annotation tree Bank). [0094 ] in step S704, using GIZA ++ tools obtaining word alignment information from the two-sentence-aligned text input. [0095] in step S705, the use of the word in step S704 to obtain alignment information extracting phrases, to obtain a phrase alignment table, the phrase alignment table It consists of three parts: (I) the source language phrase; (2) the target language phrase; (3) the probability values ​​[0096] in step S706, the syntax based on obtained in step S703 according to the annotated corpus is generated in step S704. alignment information or phrase alignment table obtained in step S705 to extract the non-continuous phrases, based on the syntax to obtain the unconnected 续短语规则库。[0097] 详细地讲,首先,基于在步骤S704中获得的对齐信息或在步骤S705中获得的短语对齐表,将双语句对齐文本的每句中双语对齐的连续短语采用[X],[Y]等非终结符代替, 获得非连续短语规则库;然后,基于在步骤S703中获得的句法标注语料库进行句法过滤, 以获得基于句法的非连续短语规则库;[0098] 在步骤S707,根据翻译模型,在短语对齐表和基于句法的非连续短语规则库中搜索所有可能的短语、非连续短语、翻译及其概率,输出总概率最有的N个翻译作为候选目标语目句。 Continued phrase rule base. [0097] In detail, first, based on the alignment information obtained in step S704 or phrase alignment table obtained in step S705, the continuous sentence aligned bilingual phrase sentence text alignment using bis [ X], [Y] and the like nonterminal place, to obtain a non-continuous phrases rule base; then, based on the syntax obtained in step S703 annotated corpus syntactic filtered to obtain on the non-continuous phrases rulebase syntax; [0098] in step S707, based on a translation model, in the phrase alignment table and based on all possible search phrases, non-contiguous phrases translated its probability, the total output of some of the most probability of N non-contiguous phrases translated syntax rule base in the target language as a candidate for head sentence. [0099] 在步骤S708,基于评估模型对候选目标语言句进行评分,并选择总体最优的作为最终输出。 [0099] In step S708, based on an assessment model for the target language sentence candidate score, and select the best overall as the final output. [0100] 以上参照附图描述了根据本发明示例性实施例的基于句法分析和层次模型的机器翻译系统和方法,本领域技术人员应该理解的是,本发明不限于上述示例性实施例。 [0100] described above with reference to the accompanying drawings in accordance with the present invention, machine translation systems and methods and syntactic analysis based on hierarchical model, those skilled in the art should be appreciated that the exemplary embodiment of the present invention is not limited to the above-described exemplary embodiments. 例如,为了获得所有可能的翻译结果,在图I中包括了基于连续短语的翻译模块103,并在图7的步骤S707中包括了在短语对齐表中搜索所有可能的短语、非连续短语、翻译及其概率, 但如果在图I中不包括基于连续短语的翻译模块103以及在图7的步骤S707中不包括了对短语对齐表的搜索也是可行的。 For example, in order to obtain all possible translations, including a translation module based on continuous phrase 103 in FIG. I, and includes a search phrase alignment table of all possible phrases, discontinuous phrase translation in the step of FIG. 7, S707 its probability, but if I is not included in FIG phrase-based translation module continuous in step 103 in FIG. 7 and S707 are not included in the table of search phrase alignment are also possible. 另外,在本发明的示例性实施例中,评估模型不限于语言模型。 Further, in an exemplary embodiment of the present invention is not limited to the language model evaluation model. [0101] 在基于本专利的原型系统上进行了韩中翻译的实验。 [0101] Korean translation experiments performed on a prototype system of this patent. [0102] 测试集类型:封闭测试(在训练集中选择测试语句)为20%,开放测试(测试语句不属于训练集)为80%。 [0102] Test set types: closed test (test selection statement in the training set) is 20% open test (test statement does not belong to the training set) is 80%. [0103] 人工评测的结果:与传统的基于连续短语的机器翻译系统相比,韩中语句流利度明显改善的句子增加了10%以上,达到86. 5%人工评测良好率的实用程度。 [0103] result of human evaluation: Compared with traditional machine translation system based on continuous phrase, sentence fluency Korea improved significantly sentences increased by more than 10 percent to 86.5 percent the utility of artificial Reviews good rate. [0104] 在相当于目前主流手机硬件配置的嵌入式系统中,平均翻译速度为2句/秒,实现了即时翻译。 [0104] In the equivalent to the current mainstream mobile phone hardware configuration of embedded systems, the average translation rate of 2 / s, achieved instant translation. [0105] [0105]

Figure CN102214166BD00111

[0106] 以下是韩中翻译(示例I)和中韩翻译(示例2)[0107] 例子I (韩中翻译)[0108] [0106] The following is a Korean translation (Example I) and Korea Translation (Example 2) [0107] Examples I (Korean translation) [0108]

Figure CN102214166BD00112

[0109] 例子2 (中韩翻译)[0110] 中文:请把我的包送去我的房间。 [0109] Example 2 (South Korea translation) [0110] Chinese: Please send my bag to my room. [0111] 基于连续短语模型的翻译结果:[0112]啮7]·嘹咅旦切平砷立.(翻译错误);[0113] 本发明的基于句法分析和层次模型的翻译结果:[0114]宅咅冲啩幺旦7>^4平砷A.(翻译正确)。 [0111] Based on the translation continuous phrase model: [0112] Nie 7] · Liao Pou denier cut flat arsenic Li (translation error); [0113] a translation result based on syntactic analysis and the hierarchical model of the present invention: [0114] Pou house 7 denier youngest red Gua> 4 ^ arsenic level A. (correct translation). [0115] 根据本发明示例性实施例的基于句法分析和层次模型的机器翻译系统和方法相对于现有技术中的基于连续短语的机器翻译系统和方法可明显提高翻译的准确度,特别是在语料库规模受限的情况下。 [0115] The machine translation system and method based on syntactic analysis and hierarchical model based machine translation system and method for continuously translated phrases can significantly improve the accuracy of the prior art with respect to an exemplary embodiment of the present invention, in particular corpus case of limited scale. [0116] 根据本发明示例性实施例的基于句法分析和层次模型的机器翻译系统和方法既可以应用于计算机系统,也可应用于嵌入式系统。 [0116] According to an exemplary embodiment of the present invention, machine translation systems and methods and syntactic analysis may be based on the hierarchical model applied to a computer system, it can also be used in embedded systems. [0117] 本发明引入了层次模型,由句子对齐的双语语料库提取获得对齐的非连续短语规则库,解决了全句上下文非连续固定搭配的翻译问题。 [0117] The present invention introduces a hierarchical model, a sentence aligned bilingual corpus extract obtained aligned discontinuous phrase rule base, solved the sentence translation problem fixed with the context of non-continuous. [0118] 本发明增加了词性句法标注模块和基于句法的非连续短语提取模块,分析并获取语料库中各句的句法标注树(即,对经过句法标注的句子)基于句法标注树获得基于句法的非连续短语规则库,使其符合语言的句法特征,从而改善了翻译效果,并大幅度减小了非连续短语规则库的规模,适于在嵌入式系统上应用;[0119] 本发明基于评估模型对翻译结果进行评分和选择,输出得分最高的翻译结果作为最终结果,从而可以有效融合各翻译模型的优点,保证了系统的可扩展性,进一步改善了翻译效果。 [0118] The present invention increases the POS syntax tagging module and the extraction module Syntax-based non-continuous phrases, analyze and obtain the syntax of each sentence corpus labeled tree (i.e., elapsed syntactic labels sentence) Syntax-based annotation tree obtained based syntax discontinuous phrase rule base, characterized in that it comply with the language syntax, thereby improving the translation results, and significantly reduce the size of the non-continuous phrases rule base, suitable for use in embedded systems; [0119] the present invention is based on the evaluation translation model score and selects and outputs the highest score as the final result of the translation result, various advantages can be effectively translated fusion model, to ensure the scalability of the system is further improved translation results. [0120] 本领域的技术人员应该理解,在不脱离本发明的精神和范围的情况下,可在形式和细节上进行各种改变。 [0120] Those skilled in the art will appreciate, various changes may be made in form and detail without departing from the spirit and scope of the invention. 因此,如上所述的示例性实施例仅为了示出的目的,而不应该被解释为对本发明的限制。 Accordingly, the exemplary embodiments described above illustrated purposes only and should not be construed as limiting the present invention. 本发明的范围由权利要求限定。 Scope of the invention defined by the claims.

Claims (10)

  1. 1. 一种基于句法分析和层次模型的机器翻译系统,包括:词对齐模块,从外部接收双语句对齐文本,并从接收的双语句对齐文本中获得词对齐信息;短语提取模块,从词对齐模块接收词对齐信息,利用接收的词对齐信息进行短语提取, 以获得短语对齐表;词性句法标注模块,从外部接收已标注语料库和双语句对齐文本,从已标注语料库中提取用于双语句对齐文本的语言知识及其概率分布信息,并利用提取出的语言知识及其概率分布信息对双语句对齐文本中的双语或者单语进行词性及句法标注,产生句法标注语料库;基于句法的非连续短语提取模块,从词性句法标注模块接收句法标注语料库,并基于句法标注语料库根据词对齐模块产生的对齐信息或短语提取模块产生的短语对齐表进行基于句法的非连续短语提取,以产生基于句法的非连续短语规则库; A machine translation system based on the hierarchical model and the syntactic analysis, comprising: a word alignment module, receiving a dual aligned text statement from the outside, and obtain information from the double word alignment justified text received sentence; phrase extraction module, from the word alignment word alignment information receiving module, using the received information word alignment phrase extraction, to obtain a phrase alignment table; syntax POS tagging module, received from outside and has been annotated corpus bis statements align text corpus from the labeled extracts a sentence aligned bis knowledge of the language and its probability distribution of text information, and distribution information on the double sentence-aligned text in a bilingual or monolingual part of speech tagging and syntactic knowledge extracted using a language and its probability, produce syntax annotated corpus; syntax-based non-contiguous phrases extraction module denoted module receives syntax from the speech syntax annotated corpus, and annotated corpus-based syntactic discontinuous phrase extraction based on syntactic the phrase alignment table word alignment module generates alignment information or phrase extraction module generates to generate syntax-based non- continuous phrase rule base; 于非连续短语的翻译模块,从非连续短语提取模块接收基于句法的非连续短语规则库,并对待翻译句子在所述基于句法的非连续短语规则库中检索待翻译句子的所有可能的短语、翻译及其翻译概率,并输出翻译结果;评分输出模块,从外部接收评估模型,基于评估模型对翻译结果进行评分,并输出得分最高的翻译结果。 In discontinuous phrase translation modules, based on the received syntax rule base from a discontinuous non-continuous phrases phrase extraction module, and treated in the translated sentence based on syntactic rules repository discontinuous phrase to be translated sentence retrieves all possible phrase, translation and translation probabilities, and outputs the result of the translation; scoring output module receives from the external evaluation model, the results of the translation model score based on the evaluation, and outputs the result of the highest translation scores.
  2. 2.如权利要求I所述的基于句法分析和层次模型的机器翻译系统,其特征在于所述机器翻译系统还包括:基于连续短语的翻译模块,从短语提取模块接收短语对齐表,对待翻译句子在短语对齐表中检索所有可能的短语、翻译及其概率,并将翻译结果输出到评分输出模块。 Based on continuous phrase translation module receiving table alignment phrase from the phrase extraction module, to treat translated sentence: I 2. The machine translation system and parsing hierarchical model-based, wherein said machine translation system as claimed in claim further comprising retrieve all the possible phrases translated phrase align its probability in the table, and outputs the result to translate scoring output modules.
  3. 3.如权利要求I或2所述的基于句法分析和层次模型的机器翻译系统,其特征在于基于句法的非连续短语提取模块包括:非连续短语提取模块,根据词对齐模块产生的词对齐信息或短语提取模块产生的短语对齐表,将双语句对齐文本的每句中双语对齐的连续短语采用非终结符代替,获得非连续短语规则库;句法过滤模块,基于句法标注语料库对非连续短语提取模块产生的非连续短语规则库进行过滤,以产生基于句法的非连续短语规则库。 3. I or syntactic analysis and machine translation system based on hierarchical model, wherein the claim 2 comprising a discontinuous Syntax-based phrase extraction module: Discontinuous phrase extraction module, according to the alignment information word generated word alignment module table aligned phrase or phrases produced by the extraction module, the continuous sentence aligned bilingual sentence aligned text phrases using double nonterminal place, the rule base to obtain a non-continuous phrases; syntax filtering module annotated corpus based syntactic phrase extraction of non-continuous module generates a non-continuous phrases filtering rule base, to produce a non-continuous phrases based syntax rule base.
  4. 4.如权利要求I所述的基于句法分析和层次模型的机器翻译系统,其特征在于所述概率分布信息包括特定词语属于特定词类的概率、特定短语属于特定类短语的概率以及上下文概率。 4. The machine translation system based on the I and parsing hierarchical model claim, wherein said information comprises a probability distribution probability of certain words belonging to specific parts of speech, phrases probability of belonging to a particular context, and the probability of a particular phrase class.
  5. 5.如权利要求I所述的基于句法分析和层次模型的机器翻译系统,其特征在于所述短语对齐表包括源语言短语、目标语言短语和概率值。 5. The machine translation system based on the I and parsing hierarchical model claim, wherein said table includes a source language phrase is aligned phrase and the target language phrase probability value.
  6. 6. 一种基于句法分析和层次模型的机器翻译方法,包括以下步骤:接收双语句对齐文本,并从接收的双语句对齐文本中获得词对齐信息;利用词对齐信息进行短语提取,以获得短语对齐表;接收已标注语料库和双语句对齐文本,从已标注语料库中提取用于双语句对齐文本的语言知识及其概率分布信息,并利用提取出的语言知识及其概率分布信息对双语句对齐文本中的双语或者单语进行词性及句法标注,产生句法标注语料库;基于句法标注语料库根据对齐信息或短语对齐表进行基于句法的非连续短语提取,以产生基于句法的非连续短语规则库;对待翻译句子在所述基于句法的非连续短语规则库中检索待翻译句子的所有可能的短语、翻译及其翻译概率;接收评估模型,基于评估模型对所述翻译进行评分,并输出得分最高的翻译结果。 A machine translation method based on the hierarchical model and the syntactic analysis, comprising the steps of: receiving a dual aligned text sentence, and sentence alignment is obtained from the double word alignment of text received information; information using word alignment phrase extraction to obtain phrases table alignment; bis receives the annotated corpus and align text sentence, extracted from the annotated corpus has knowledge of the language and the probability for the sentence-aligned text bis distribution information, and distribution information of the aligned bilingual sentence extracted by using the language knowledge and probability text bilingual or monolingual part of speech and syntactic labels, generating syntax annotated corpus; syntax-based annotation corpus extraction based on a discontinuous phrase based alignment information or phrase alignment table, to generate based on a non-continuous phrases rulebase syntax; treated all possible phrases in the translated sentence based on syntactic rules repository discontinuous phrase to be translated sentences retrieved, translation and translation probability; receiving evaluation model, based on the evaluation of the translation model score, and outputs the highest translation scores result.
  7. 7.如权利要求6所述的基于句法分析和层次模型的机器翻译方法,其特征在于所述机器翻译方法还包括以下步骤:对待翻译句子在短语对齐表中检索所有可能的短语、翻译及其概率。 7. A machine translation method and parsing hierarchical model-based machine translation wherein said method further comprises the step according to claim 6: translated sentence retrieval treat all possible phrases in a phrase alignment table, and the translation probability.
  8. 8.如权利要求6或7所述的基于句法分析和层次模型的机器翻译方法,其特征在于产生基于句法的非连续短语规则库的步骤包括以下步骤:根据词对齐信息或短语对齐表将双语句对齐文本的每句中双语对齐的连续短语采用非终结符代替,获得非连续短语规则库;基于句法标注语料库对非连续短语规则库进行过滤,以产生基于句法的非连续短语规则库。 A machine translation method is based on the hierarchical model and the syntactic analysis 6 or claim 7, wherein the step of generating non-continuous phrases based on the syntax rule base comprising the steps of: aligned according to alignment information word or phrase table bis statement sentence aligned bilingual text phrase continuously aligned using nonterminal place, the discontinuous phrase rule base is obtained; syntax-based annotation corpus discontinuous phrase filtering rule base, to produce a non-continuous phrases based syntax rule base.
  9. 9.如权利要求6所述的基于句法分析和层次模型的机器翻译方法,其特征在于所述概率分布信息包括特定词语属于特定词类的概率、特定短语属于特定类短语的概率以及上下文概率。 9. The method of machine translation and syntactic analysis based hierarchical model according to claim 6, wherein said information comprises a probability distribution probability of certain words belonging to specific parts of speech, phrases probability of belonging to a particular context, and the probability of a particular phrase class.
  10. 10.如权利要求6所述的基于句法分析和层次模型的机器翻译方法,其特征在于所述短语对齐表包括源语言短语、目标语言短语和概率值。 Parsing and machine translation method based on hierarchical model, wherein said alignment table 6, the phrase includes a source language phrase and the target language phrase as claimed in claim 10. The probability values.
CN 201010144623 2010-04-06 2010-04-06 Machine translation system and machine translation method based on syntactic analysis and hierarchical model CN102214166B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010144623 CN102214166B (en) 2010-04-06 2010-04-06 Machine translation system and machine translation method based on syntactic analysis and hierarchical model

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN 201010144623 CN102214166B (en) 2010-04-06 2010-04-06 Machine translation system and machine translation method based on syntactic analysis and hierarchical model
KR20110018439A KR101777421B1 (en) 2010-04-06 2011-03-02 A syntactic analysis and hierarchical phrase model based machine translation system and method
US13079283 US8818790B2 (en) 2010-04-06 2011-04-04 Syntactic analysis and hierarchical phrase model based machine translation system and method

Publications (2)

Publication Number Publication Date
CN102214166A true CN102214166A (en) 2011-10-12
CN102214166B true CN102214166B (en) 2013-02-20

Family

ID=44745481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010144623 CN102214166B (en) 2010-04-06 2010-04-06 Machine translation system and machine translation method based on syntactic analysis and hierarchical model

Country Status (1)

Country Link
CN (1) CN102214166B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116575B (en) * 2011-11-16 2016-06-22 富士通株式会社 Based on hierarchical phrase translation probability model of word order to determine the method and device
KR101475284B1 (en) * 2011-11-29 2014-12-23 에스케이텔레콤 주식회사 Error detection apparatus and method based on shallow parser for estimating writing automatically
CN103914447B (en) * 2013-01-09 2017-04-19 富士通株式会社 The information processing apparatus and information processing method
CN104346325B (en) * 2013-07-30 2017-05-10 富士通株式会社 Information processing method and apparatus
CN104050160B (en) * 2014-03-12 2017-04-05 北京紫冬锐意语音科技有限公司 One kind of spoken language translation method and apparatus for machine translation and manual integration of
CN106372053A (en) * 2015-07-22 2017-02-01 华为技术有限公司 Syntactic analysis method and apparatus
CN106383818A (en) * 2015-07-30 2017-02-08 阿里巴巴集团控股有限公司 Machine translation method and device
CN106484681A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 Method and device for generating candidate translation, and electronic equipment
CN106484682A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 Statistics-based machine translation method and apparatus, and electronic device
CN105320644B (en) * 2015-09-23 2018-01-02 陕西中医药大学 Chinese syntax analysis method for automatic rule-based
CN106156013A (en) * 2016-06-30 2016-11-23 电子科技大学 Two-stage-type machine translation method with preferentiality of idiomatic phrases

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1228566A (en) 1998-03-11 1999-09-15 英业达股份有限公司 Non-continuous phrase matching translation device and method
CN1652106A (en) 2004-02-04 2005-08-10 北京赛迪翻译技术有限公司 Machine translation method and apparatus based on language knowledge base
CN101290616A (en) 2008-06-11 2008-10-22 中国科学院计算技术研究所 Statistical machine translation method and system
CN101685441A (en) 2008-09-24 2010-03-31 中国科学院自动化研究所 Generalized reordering statistic translation method and device based on non-continuous phrase

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7353165B2 (en) * 2002-06-28 2008-04-01 Microsoft Corporation Example based machine translation system
KR100911619B1 (en) * 2007-12-11 2009-08-12 한국전자통신연구원 Method and apparatus for constructing vocabulary pattern of english

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1228566A (en) 1998-03-11 1999-09-15 英业达股份有限公司 Non-continuous phrase matching translation device and method
CN1652106A (en) 2004-02-04 2005-08-10 北京赛迪翻译技术有限公司 Machine translation method and apparatus based on language knowledge base
CN101290616A (en) 2008-06-11 2008-10-22 中国科学院计算技术研究所 Statistical machine translation method and system
CN101685441A (en) 2008-09-24 2010-03-31 中国科学院自动化研究所 Generalized reordering statistic translation method and device based on non-continuous phrase

Also Published As

Publication number Publication date Type
CN102214166A (en) 2011-10-12 application

Similar Documents

Publication Publication Date Title
Matusov et al. Computing consensus translation for multiple machine translation systems using enhanced hypothesis alignment
Hockenmaier et al. CCGbank: a corpus of CCG derivations and dependency structures extracted from the Penn Treebank
Goldberg et al. An efficient algorithm for easy-first non-directional dependency parsing
Han et al. A generative entity-mention model for linking entities with knowledge base
US7353165B2 (en) Example based machine translation system
Denis et al. Specialized models and ranking for coreference resolution
Luo et al. A mention-synchronous coreference resolution algorithm based on the bell tree
US20100241416A1 (en) Adaptive pattern learning for bilingual data mining
Cussens Part-of-speech tagging using Progol
Seddah et al. Overview of the SPMRL 2013 shared task: cross-framework evaluation of parsing morphologically rich languages
Tillmann et al. A discriminative global training algorithm for statistical MT
CN101079028A (en) On-line translation model selection method of statistic machine translation
Niehues et al. A POS-based model for long-range reorderings in SMT
Ganchev et al. Better alignments= better translations?
Xu et al. SemEval-2015 Task 1: Paraphrase and semantic similarity in Twitter (PIT)
Barrón-Cedeno et al. Plagiarism detection across distant language pairs
US20070282590A1 (en) Grammatical element generation in machine translation
Xiong et al. Error detection for statistical machine translation using linguistic features
Niehues et al. Wider context by using bilingual language models in machine translation
de Caseli et al. Alignment-based extraction of multiword expressions
CN101706777A (en) Method and system for extracting resequencing template in machine translation
Zhang et al. A tree-to-tree alignment-based model for statistical machine translation
Kumar et al. Part of speech taggers for morphologically rich indian languages: a survey
Yang et al. Joint relational embeddings for knowledge-based question answering
Fu et al. Chinese word segmentation as morpheme-based lexical chunking

Legal Events

Date Code Title Description
C06 Publication
C10 Request of examination as to substance
C14 Granted