JP5416021B2

JP5416021B2 - Machine translation apparatus, machine translation method, and program thereof

Info

Publication number: JP5416021B2
Application number: JP2010087921A
Authority: JP
Inventors: 克仁須藤; ドゥケヴィン; 努平尾; 元塚田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-04-06
Filing date: 2010-04-06
Publication date: 2014-02-12
Anticipated expiration: 2030-04-06
Also published as: JP2011221650A

Description

本発明は、ある言語の入力文を異なる言語の文に機械翻訳する機械翻訳装置、機械翻訳方法、およびそのプログラムに関する。 The present invention relates to a machine translation apparatus, a machine translation method, and a program for machine translation of an input sentence in one language into a sentence in a different language.

ある言語（以下、「原言語」という。）の入力文から異なる言語（以下、「目的言語」という。）の翻訳文を得るための統計的な機械翻訳（例えば、特許文献１、２）において、あらゆる翻訳の可能性を探索しようとすると、入力文の長さ（単語数）の数乗の計算量が必要となることが知られており、長い原言語の文の翻訳に際しては一般に探索範囲を狭めることにより対処する。 In statistical machine translation (for example, Patent Documents 1 and 2) for obtaining a translated sentence in a different language (hereinafter referred to as “target language”) from an input sentence in a certain language (hereinafter referred to as “source language”). In order to search for all possible translations, it is known that the calculation amount of the power of the length of the input sentence (number of words) is required. When translating long source language sentences, the search range is generally used. To deal with by narrowing.

探索範囲を狭める方法としては、探索の過程で或る水準以上のスコアを持つ解候補のみを残し、水準に満たない解候補のそれ以上の探索を打ち切るビームサーチと呼ばれる方法や、翻訳における語順の並べ替えを一定の範囲に制約することで、単語数の階乗個の語順の並べ替えの全探索を避ける方法が一般的に用いられている。しかし、いずれの方法でも、入力文が長い場合は近似の度合いが大きくなり、最終的な翻訳品質が低下するという問題がある。特に後者の方法では、英語から日本語への翻訳のように語順が大きく入れ替わる言語間の翻訳において正しい翻訳を得ることができず、翻訳品質が著しく低下することがある。 Methods for narrowing the search range include a method called beam search that leaves only solution candidates with a score of a certain level or higher during the search process, and terminates further search of solution candidates that do not satisfy the level. A method is generally used that restricts the rearrangement to a certain range, thereby avoiding a full search for word order factorial word order. However, any of these methods has a problem that when the input sentence is long, the degree of approximation increases and the final translation quality deteriorates. In particular, in the latter method, a correct translation cannot be obtained in a translation between languages in which the word order is largely switched, such as translation from English to Japanese, and the translation quality may be significantly lowered.

このような問題に対処する従来技術として、文を複数の部分列に分割し、その部分列をそれぞれ翻訳して再度結合するものがある（例えば、特許文献３）。文を分割することで翻訳すべき文の長さにより生じる上記の問題が軽減できる。特許文献３では、分割された部分列を複数の翻訳装置でそれぞれ翻訳し、最適となる部分列翻訳の組み合わせから翻訳文を生成する方式が提案されている。具体的には、予め作成した規則を用いて文の分割および組み上げを行う。 As a conventional technique for dealing with such a problem, there is a technique in which a sentence is divided into a plurality of partial strings, and the partial strings are translated and combined again (for example, Patent Document 3). By dividing the sentence, the above problem caused by the length of the sentence to be translated can be reduced. Patent Document 3 proposes a method of translating divided subsequences with a plurality of translation devices and generating a translated sentence from an optimal combination of subsequence translations. Specifically, sentence division and assembly are performed using rules created in advance.

特開２００６−９９２０８号公報JP 2006-99208 A 特開２００８−１５８４４号公報JP 2008-15844 A 特開２００１−２２２５２９号公報JP 2001-222529 A

特許文献３の方法では、文の分割および組み上げを行うための規則を手作業で作成しておく必要があり、様々な文を分割し組み上げるためには膨大な数の規則を定義する労力を要する。本発明の目的は、部分列に分割された入力文の翻訳において、手作業で作成した規則を用いることなく部分列翻訳を適切な順序に結合可能とし、大きく語順の異なる言語間の翻訳においても翻訳文の品質を高めることができる機械翻訳装置、機械翻訳方法、およびそのプログラムを提供することにある。 In the method of Patent Document 3, it is necessary to manually create rules for dividing and assembling sentences, and in order to divide and assemble various sentences, it takes effort to define a large number of rules. . It is an object of the present invention to enable translation of input sentences divided into substrings in an appropriate order without using manually created rules, and to translate between languages with largely different word orders. The object is to provide a machine translation device, a machine translation method, and a program thereof that can improve the quality of a translation.

本発明の機械翻訳装置は、ブロック対訳文データベースと翻訳訓練部とブロック翻訳モデルデータベースとブロック分割部と翻訳部と結合部とを備える。 The machine translation apparatus of the present invention includes a block parallel translation database, a translation training unit, a block translation model database, a block division unit, a translation unit, and a coupling unit.

ブロック分割対訳文データベースは、構文解析に基づき各ブロックが１以上の単語と下位のブロックの挿入位置を表す非終端記号とからなる複数のブロックに分割された原言語の学習文と、当該分割されたブロックごとの目的言語による前記非終端記号を含む理想翻訳文と、からなるブロック分割対訳文が蓄積されたものである。 The block-divided parallel translation database is based on syntax analysis, each learning block is divided into a plurality of blocks each composed of one or more words and a non-terminal symbol indicating the insertion position of a lower block, and the divided sentence The block divided parallel translation composed of the ideal translation including the non-terminal symbol in the target language for each block is accumulated.

翻訳訓練部は、前記ブロック対訳文データベースから読み出したブロック分割対訳文を用いて、ブロック翻訳モデルを学習し、ブロック翻訳モデルデータベースに書き込む。 The translation training unit learns a block translation model using the block division parallel translation read from the block parallel translation database, and writes it into the block translation model database.

ブロック翻訳モデルデータベースは、前記ブロック翻訳モデルを蓄積する。 The block translation model database stores the block translation model.

ブロック分割部は、目的言語への翻訳対象である原言語の入力文を、構文解析に基づき、複数の前記ブロックに分割する。 The block division unit divides an input sentence of the source language that is a translation target into the target language into a plurality of the blocks based on syntax analysis.

翻訳部は、前記ブロック翻訳モデルデータベースから読み出した前記ブロック翻訳モデルを用いて、前記ブロック分割部で分割された前記入力文の各ブロックを、それぞれ目的言語による前記非終端記号を含む翻訳文に翻訳する。 Using the block translation model read from the block translation model database, the translation unit translates each block of the input sentence divided by the block division unit into a translated sentence including the non-terminal symbol in a target language. .

結合部は、各ブロックの翻訳文を前記非終端記号で表されるブロック挿入位置に基づき結合することにより、前記入力文に対する翻訳文を生成する。 The combining unit generates a translated sentence for the input sentence by combining the translated sentence of each block based on the block insertion position represented by the non-terminal symbol.

本発明の機械翻訳装置、機械翻訳方法、およびそのプログラムによれば、本発明の目的は、部分列に分割された入力文の翻訳において、手作業で作成した規則を用いることなく部分列翻訳を適切な順序に結合可能とし、大きく語順の異なる言語間の翻訳においても翻訳文の品質を高めることができる。そのため、長い入力文を分割して効率的に翻訳でき、かつ原言語と目的言語との間の語順の差が大きい場合でも、適切な語順による精度の高い翻訳文を生成することができる。 According to the machine translation device, the machine translation method, and the program thereof according to the present invention, an object of the present invention is to translate a partial sequence without using a manually created rule in translating an input sentence divided into partial sequences. It can be combined in an appropriate order, and the quality of the translated sentence can be improved even in translation between languages with largely different word orders. Therefore, a long input sentence can be divided and efficiently translated, and even when the difference in the word order between the source language and the target language is large, a highly accurate translated sentence with an appropriate word order can be generated.

機械翻訳装置１００の構成例を示すブロック図。FIG. 2 is a block diagram illustrating a configuration example of a machine translation apparatus 100. 機械翻訳装置１００の処理フロー例を示す図。The figure which shows the example of a processing flow of the machine translation apparatus. 構文解析結果の例。Example of parsing result. ブロック分割対訳文データベース１１０へのブロック分割対訳文の格納イメージを示す図。The figure which shows the storage image of the block division parallel translation sentence to the block division parallel translation database 110. FIG. 構文解析結果の別の例。Another example of parsing result. 機械翻訳装置２００の構成例を示すブロック図。FIG. 2 is a block diagram showing a configuration example of a machine translation device 200. 機械翻訳装置２００の構成例を示すブロック図。FIG. 2 is a block diagram showing a configuration example of a machine translation device 200. 対訳文データベース２４０への対訳文の格納イメージを示す図。The figure which shows the storage image of the bilingual sentence to the bilingual sentence database 240. FIG. 目的言語の各単語と原言語のブロックとの対応付けイメージを示す図。The figure which shows the matching image with each word of a target language, and the block of a source language. 単語関連度データベース２２０への単語関連度モデルの格納イメージを示す図。The figure which shows the storage image of the word related degree model to the word related degree database 220. FIG. 単語連接度データベース２３０への単語連接度モデルの格納イメージを示す図。The figure which shows the storage image of the word connection degree model to the word connection degree database 230. FIG.

以下、本発明の実施の形態について、詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.

図１に本発明の機械翻訳装置１００の構成例を示すブロック図を、図２に機械翻訳装置１００の処理フロー例をそれぞれ示す。機械翻訳装置１００は、ブロック対訳文データベース１１０と翻訳訓練部１２０とブロック翻訳モデルデータベース１３０とブロック分割部１４０と翻訳部１５０と結合部１６０とを備える。 FIG. 1 is a block diagram showing a configuration example of the machine translation apparatus 100 of the present invention, and FIG. 2 shows a processing flow example of the machine translation apparatus 100. The machine translation apparatus 100 includes a block translation database 110, a translation training unit 120, a block translation model database 130, a block division unit 140, a translation unit 150, and a combination unit 160.

ブロック分割対訳文データベース１１０は、公知の一般的な構文解析手法に基づき複数のブロックに分割された原言語の非終端記号を含む学習文と、当該分割されたブロックごとの目的言語による非終端記号を含む理想翻訳文と、からなるブロック分割対訳文が蓄積されたものである。本発明におけるブロックとは、１以上の連続する単語と下位のブロックによって構成される単位をいう。このように定義することで、原言語の文全体を最上位のブロックと考え、下位において単語とブロックの列の形式で再帰的・階層的に文を表現することができる。ブロックの単位は構文解析結果（構造木）上の任意の部分木に設定することが可能であるが、本発明では節を単位とすることが望ましいため、以下では節がブロックの単位であるとして説明する。 The block-divided parallel translation database 110 includes a learning sentence including a non-terminal symbol of the source language divided into a plurality of blocks based on a known general parsing technique, and a non-terminal symbol in the target language for each of the divided blocks. An ideal translation sentence and a block-divided parallel translation sentence made up of are accumulated. A block in the present invention refers to a unit composed of one or more consecutive words and a lower block. By defining in this way, the entire source language sentence is considered as the highest block, and the sentence can be expressed recursively and hierarchically in the form of a word and block sequence in the lower order. The block unit can be set to an arbitrary subtree on the parsing result (structural tree). However, in the present invention, it is desirable to use a section as a unit. Therefore, in the following, it is assumed that a section is a block unit. explain.

図３は、英文 "John bought a toy that was popular in Japan" について構文解析を行った結果の例である。構文解析には公知の英語構文解析器Enjuを用いた。Ｓは関係詞節以外の節を指す記号であり、Ｓ−ＲＥＬは関係詞節を指す記号である。図３の例では、まず文全体を節と考えることができ、その中に関係詞節が埋め込まれていることがわかる。節をブロックの単位として、図３の構造の上位から順にＳまたはＳ−ＲＥＬで表現される英文中の部分列を切り出すと、上記の英文は、
Ｂ０：John bought a toy [Ｂ１]
Ｂ１：that was popular in Japan
という２つのブロックに分割される。ここで、ブロックＢ０に含まれる [Ｂ１] は、ブロックＢ０内でのブロックＢ１の位置を表す非終端記号である。そして、これらのブロックとその目的言語による非終端記号を含む理想翻訳文とが対となったものがブロック分割対訳文である。日本語が目的言語である場合、非終端記号を含む理想翻訳文は例えば、
Ｂ０：ジョンは [Ｂ１] おもちゃを買った。 FIG. 3 is an example of the result of syntactic analysis for the English sentence “John bought a toy that was popular in Japan”. For syntax analysis, a well-known English parser Enju was used. S is a symbol indicating a clause other than a relative clause, and S-REL is a symbol indicating a relative clause. In the example of FIG. 3, it can be understood that the entire sentence can be considered as a clause, and a relative clause is embedded therein. When a substring in an English sentence expressed by S or S-REL is cut out in order from the top of the structure in FIG.
B0: John bought a toy [B1]
B1: that was popular in Japan
It is divided into two blocks. Here, [B1] included in the block B0 is a non-terminal symbol representing the position of the block B1 in the block B0. A block divided parallel translation is a combination of these blocks and an ideal translation including a non-terminal symbol in the target language. If Japanese is the target language, the ideal translation containing non-terminal symbols is
B0: John bought a toy [B1].

Ｂ１：日本で人気があった
となり、これらのそれぞれが上記の各英文と対とされたものがブロック分割対訳文である。ブロック分割対訳文データベース１１０へのブロック分割対訳文の格納イメージを図４に示す。図４は英語と日本語との対訳であり、(a)が英語側、(b)が日本語側である。図４においては、__s0、__s1が非終端記号であり、１つのブロックに複数のブロックが挿入される場合は、非終端記号を例えばこのように区別して表現する。 B1: It became popular in Japan, and each of these paired with each English sentence is a block-divided parallel translation. FIG. 4 shows a storage image of the block division parallel translation sentence in the block division parallel translation database 110. FIG. 4 is a translation between English and Japanese, where (a) is the English side and (b) is the Japanese side. In FIG. 4, __s0 and __s1 are non-terminal symbols, and when a plurality of blocks are inserted into one block, the non-terminal symbols are distinguished and expressed in this way, for example.

実施例１は、このようなブロック分割対訳文が事前に用意されていることを前提とする構成である。用意されていない場合の構成については実施例２で説明する。 The first embodiment is configured on the assumption that such a block division parallel translation is prepared in advance. A configuration in the case where it is not prepared will be described in a second embodiment.

翻訳訓練部１２０は、ブロック分割対訳文データベース１１０から読み出した非終端記号を含むブロック分割対訳文を用いて、ブロック翻訳モデルを学習し、ブロック翻訳モデルデータベース１３０に書き込む（Ｓ１）。ブロック翻訳モデルの学習は、公知の技術を用いることができ、例えば、機械翻訳プログラムMosesと合わせて用いられる翻訳モデル学習プログラムや単語翻訳確率推定プログラムＧＩＺＡ＋＋や単語Ｎグラム確率推定プログラムＳＲＩＬＭなどを用いることができる。また、モデルごとの重みの最適化のために、例えば、誤り率最小化学習 (Minimum Error Rate Training: MERT) と呼ばれる公知の技術を用いてもよい。なお、学習に際しては、より精度を高めるためブロック分割対訳文でない通常の文単位の対訳文を併用してもよい。この場合の学習方法についても、既存の方法（特許文献１、２等の統計的モデルに基づく方法、参考文献１等の辞書・規則に基づく方法、参考文献２等の用例に基づく方法）が利用できる。 The translation training unit 120 learns the block translation model using the block division parallel translation including the non-terminal symbols read from the block division parallel translation database 110, and writes it into the block translation model database 130 (S1). For the learning of the block translation model, a known technique can be used. For example, a translation model learning program used together with the machine translation program Moses, a word translation probability estimation program GIZA ++, a word N-gram probability estimation program SRILM, or the like is used. Can do. Further, for optimization of the weight for each model, for example, a known technique called error rate minimization learning (Minimum Error Rate Training: MERT) may be used. In learning, in order to improve accuracy, a parallel sentence in a normal sentence unit that is not a block-divided parallel sentence may be used in combination. As the learning method in this case, existing methods (methods based on statistical models such as Patent Documents 1 and 2; methods based on dictionaries and rules such as Reference 1; and methods based on examples such as Reference 2) are used. it can.

〔参考文献１〕特許第３３５８０９６号公報
〔参考文献２〕特許第４２３９５０５号公報
非終端記号を含むブロック分割対訳文を用いてブロック翻訳モデルを学習することで、原言語と目的言語との対訳関係だけでなく、原言語における非終端記号の位置と目的言語における非終端記号の位置との位置関係も学習される。そのため、このブロック翻訳モデルを翻訳部１５０での翻訳処理に用いることで、ブロック分割部１４０でブロック分割された原言語の入力文に含まれる非終端記号を、ブロックの翻訳文において目的言語における適切な位置に配置することができる。 [Reference document 1] Japanese Patent No. 3358096 [Reference document 2] Japanese Patent No. 4239505 By learning a block translation model using a block division parallel translation including a non-terminal symbol, only the translation relationship between the source language and the target language is obtained. Instead, the positional relationship between the position of the non-terminal symbol in the source language and the position of the non-terminal symbol in the target language is also learned. Therefore, by using this block translation model for the translation processing in the translation unit 150, a non-terminal symbol included in the input sentence of the source language block-divided by the block division unit 140 can be converted into an appropriate one in the target language in the block translation sentence. Can be placed in position.

ブロック分割部１４０は、目的言語への翻訳対象である原言語の入力文を、公知の一般的な構文解析手法に基づき非終端記号を含む複数のブロックに分割する（Ｓ２）。なお、ブロック分割は単語列に対して行うため、入力文全体が単語列の場合(例えば英語のように空白文字を用いて分かち書きされた文)はそのまま入力できるが、入力文が文字列である場合(例えば日本語)又は単語分割されていない部分を含む文である場合には、図１に示すようにブロック分割部１４０の前段に文字列を単語列に分割する原言語単語分割部１４５を設ける必要がある。原言語単語分割部１４５における文字列から単語列への分割は、公知の一般的な形態素解析手法を用いて行うことができる。日本語の形態素解析プログラムとしては例えばMecabなどが挙げられる。 The block division unit 140 divides the input sentence of the source language to be translated into the target language into a plurality of blocks including non-terminal symbols based on a known general syntax analysis method (S2). In addition, since block division is performed on a word string, if the entire input sentence is a word string (for example, a sentence separated by using a blank character like English), it can be input as it is, but the input sentence is a character string. In the case (for example, Japanese) or a sentence including a part that is not divided into words, the source language word dividing unit 145 that divides the character string into word strings is provided in the preceding stage of the block dividing unit 140 as shown in FIG. It is necessary to provide. The division from the character string into the word string in the source language word dividing unit 145 can be performed using a known general morphological analysis method. An example of a Japanese morphological analysis program is Mecab.

翻訳部１５０は、ブロック翻訳モデルデータベース１３０から読み出したブロック翻訳モデルを用いて、ブロック分割部１４０で分割された原言語の入力文の各ブロックを、それぞれ目的言語による非終端記号を含む翻訳文に翻訳する（Ｓ３）。翻訳は公知の機械翻訳技術（特許文献１〜３、参考文献１、２等）を用いて行うことができる。 Using the block translation model read from the block translation model database 130, the translation unit 150 translates each block of the source language input sentence divided by the block division unit 140 into a translation sentence including a non-terminal symbol in the target language. (S3). Translation can be performed using a known machine translation technique (Patent Documents 1 to 3, Reference Documents 1 and 2, etc.).

結合部１６０は、翻訳部１５０で翻訳された各ブロックの翻訳文を、前記非終端記号で表されるブロック挿入位置に基づき結合することにより、原言語による入力文に対する翻訳文を生成する（Ｓ４）。 The combining unit 160 generates a translated sentence for the input sentence in the source language by combining the translated sentence of each block translated by the translation unit 150 based on the block insertion position represented by the non-terminal symbol (S4). .

以上のように構成された機械翻訳装置１００を用いて、以下に示す英語による入力文を日本語による翻訳文に翻訳する例を説明する。なお、ブロック翻訳モデルは予め学習されているものとする。 An example in which an input sentence in English shown below is translated into a translated sentence in Japanese using the machine translation apparatus 100 configured as described above will be described. Note that the block translation model is learned in advance.

入力文：we examined whether idiopathic pancreatitis is associated with CFTR mutations in persons who do not have lung disease of cystic fibrosis .
まず、ブロック分割部１４０において、構文解析を行い、図５に示すような構文木を得る。図５は公知の英語構文解析器Enjuによる構文解析例である。ブロック分割部１４０はこの解析結果に基づき、節をブロックの単位として入力文を、非終端記号を含む形でブロック分割する。図５の構造木において節はＳおよびＳ−ＲＥＬであるため、ＳまたはＳ−ＲＥＬの節点以下の部分をブロックとして分割すると、以下のようになる。 Input sentence: we examined whether idiopathic pancreatitis is associated with CFTR mutations in persons who do not have lung disease of cystic fibrosis.
First, the block division unit 140 performs syntax analysis to obtain a syntax tree as shown in FIG. FIG. 5 shows an example of parsing by a known English parser Enju. Based on the analysis result, the block division unit 140 divides the input sentence into blocks including non-terminal symbols, using clauses as units of blocks. Since the nodes in the structure tree of FIG. 5 are S and S-REL, when a portion below the node of S or S-REL is divided as a block, the following is obtained.

1. We examined whether __s0 .
2. idiopathic pacreatitis is associated with CFTR mutation in person __s0
3. who do not have lung disease of cystic fibrosis
この分割結果において、__s0はリストの次のブロックが挿入される位置を表す非終端記号である。 1. We examined whether __s0.
2. idiopathic pacreatitis is associated with CFTR mutation in person __s0
3. who do not have lung disease of cystic fibrosis
In this division result, __s0 is a non-terminal symbol indicating the position where the next block in the list is inserted.

続いて翻訳部１５０において、ブロック分割部１４０の出力として得られた３つのブロックを、ブロック翻訳モデルデータベース１３０に蓄積されたブロック翻訳モデルを用いて、それぞれ非終端記号を含む日本語に翻訳する。公知の機械翻訳プログラムMosesを用いた場合の翻訳結果は以下のようになる。 Subsequently, the translation unit 150 translates the three blocks obtained as the output of the block division unit 140 into Japanese including non-terminal symbols using the block translation model stored in the block translation model database 130. The translation results when using the well-known machine translation program Moses are as follows.

1. __s0 かどうかを検討した。
2. __s0 人では、特発性膵炎がＣＦＴＲ変異と関係がある
3. 嚢胞性線維症の肺疾患を発症していない 1. We examined whether it was __s0.
2. In __s0 people, idiopathic pancreatitis is associated with CFTR mutations
3. No cystic fibrosis lung disease

そして、結合部１６０において、翻訳部１５０の出力として得られた３つのブロック翻訳結果を非終端記号をもとに結合することにより、以下のような翻訳文が得られる。 Then, the following translation is obtained by combining the three block translation results obtained as the output of the translation unit 150 based on the non-terminal symbols in the combining unit 160.

翻訳文：嚢胞性線維症の肺疾患を発症していない人では、特発性膵炎がＣＦＴＲ変異と関係があるかどうかを検討した。 Translation: We examined whether idiopathic pancreatitis was associated with CFTR mutations in people who did not develop cystic fibrosis lung disease.

この翻訳文からわかるように、入力文の末尾の関係代名詞節の係り受け関係が適切に維持され、異和感のない日本語文となっている。これに対し、上記の入力文をブロック分割を行わない従来の機械翻訳装置により翻訳すると、例えば以下のように入力文の末尾の関係代名詞節の係り受け関係が失われる場合がある。 As can be seen from this translated sentence, the dependency relationship of the relative pronoun clause at the end of the input sentence is appropriately maintained, and the Japanese sentence has no sense of incongruity. On the other hand, when the above-mentioned input sentence is translated by a conventional machine translation device that does not perform block division, for example, the dependency relationship of the relative pronoun clause at the end of the input sentence may be lost as follows.

従来技術による翻訳文：われわれは、特発性膵炎ＣＦＴＲ変異と関連しているか否かを検討した嚢胞性線維症の肺疾患を有しない人々であった。 Translated by the prior art: We were people with no cystic fibrosis lung disease who examined whether it was associated with idiopathic pancreatitis CFTR mutation.

以上のように、本発明の機械翻訳装置１００によれば、非終端記号を含むブロック対訳文によりブロック翻訳モデルを学習し、これを用いて非終端記号を含む形でブロック分割された入力文の各ブロックを翻訳する。そのため、各ブロックの翻訳文において、非終端記号が目的言語における適切な位置に配されるため、非終端記号に基づいて各ブロックを結合するだけで、予め規則を作成することなく適切な語順の翻訳文を得ることができる。また、長い修飾節についても適切な順序に並べ替えることができ、翻訳の品質を向上することができる。 As described above, according to the machine translation apparatus 100 of the present invention, each block of an input sentence obtained by learning a block translation model from a block translation sentence including a non-terminal symbol and using the block translation model to include the non-terminal symbol. Translate. Therefore, in the translated text of each block, the non-terminal symbols are arranged at appropriate positions in the target language. Therefore, by simply combining the blocks based on the non-terminal symbols, the translated text in the appropriate word order without creating rules in advance. Can be obtained. Also, long modifier clauses can be rearranged in an appropriate order, improving translation quality.

実施例１は、ブロック分割対訳文が事前に用意されていることを前提とする構成であるが、用意されていない場合には作成する必要がある。実施例２の機械翻訳装置２００は、実施例１の構成にブロック分割対訳文を作成するための構成を加えたものである。 The first embodiment has a configuration on the premise that a block division parallel translation is prepared in advance, but it is necessary to create it when it is not prepared. The machine translation apparatus 200 according to the second embodiment is obtained by adding a configuration for creating a block-divided parallel translation to the configuration according to the first embodiment.

図６に機械翻訳装置２００の構成例を示すブロック図を、図７に機械翻訳装置２００の処理フロー例をそれぞれ示す。機械翻訳装置２００は、機械翻訳装置１００の各構成要素に加え、ブロック分割対訳部作成部２１０と単語関連度モデルデータベース２２０と単語連接度モデルデータベース２３０とを備える。 FIG. 6 is a block diagram illustrating a configuration example of the machine translation apparatus 200, and FIG. 7 illustrates a process flow example of the machine translation apparatus 200. In addition to the components of the machine translation apparatus 100, the machine translation apparatus 200 includes a block division parallel translation unit creation unit 210, a word association degree model database 220, and a word connection degree model database 230.

ブロック分割対訳文は、文単位で原言語と目的言語とで訳の対応がとられた学習用の対訳文から作成する。この対訳文（原言語の学習文とそれに対応する目的言語による理想翻訳文との組）は、通常、対訳文データベース２４０に予め蓄積しておく。対訳文データベース２４０への対訳文の格納イメージを図８に示す。図８は英語と日本語との対訳であり、(a)が英語側、(b)が日本語側である。 The block-divided parallel translation sentence is created from a parallel translation sentence for learning in which the translation is matched between the source language and the target language in sentence units. This bilingual sentence (a set of a source language learning sentence and an ideal translation sentence corresponding to the target language) is normally stored in the bilingual sentence database 240 in advance. FIG. 8 shows a storage image of the parallel translation sentence in the parallel translation database 240. FIG. 8 is a translation between English and Japanese, where (a) is the English side and (b) is the Japanese side.

ブロック分割対訳文作成部２１０においては、ブロック単位に分割された原言語の学習文と、当該学習文の目的言語による理想翻訳文の単語列とを入力として処理を行う。そのため、原言語の学習文が文字列の場合又は単語分割されていない部分を含む文である場合には、原言語単語分割部１４５にて単語列に分割した上で、得られた単語列をブロック分割部１４０にてブロック単位に分割して（Ｓ１１）、ブロック分割対訳文作成部２１０に入力する。また、目的言語による理想翻訳文が文字列の場合又は単語分割されていない部分を含む文である場合には、目的言語単語分割部２４５にて単語列に分割して、ブロック分割対訳文作成部２１０に入力する。目的言語単語分割部２４５での文字列の単語列への分割は、原言語単語分割部１４５と同様、公知の一般的な形態素解析手法を用いて行うことができる。 The block-divided bilingual sentence creation unit 210 performs processing by inputting the learning sentence of the source language divided into blocks and the word string of the ideal translation sentence in the target language of the learning sentence. Therefore, when the learning sentence of the source language is a character string or a sentence including a part that is not divided into words, the obtained word string is divided into word strings by the source language word dividing unit 145, and The block division unit 140 divides the data into blocks (S11), and inputs the block division parallel translation creation unit 210. In addition, when the ideal translation sentence in the target language is a character string or a sentence including a part that is not divided into words, the target language word dividing unit 245 divides the word into a word string, and generates a block division parallel translation creating unit. Input to 210. The target language word dividing unit 245 can divide the character string into word strings using a known general morphological analysis method, as with the source language word dividing unit 145.

ブロック分割対訳文作成部２１０は、入力された目的言語の各単語を原言語の各ブロックに対応付ける処理を行うことにより、目的言語の単語列をブロック化するとともに、原言語のブロックにおける非終端記号を、対応する目的言語のブロックにおいて適切な位置に配置する（Ｓ１２）。目的言語の各単語を原言語の各ブロックに対応付ける処理は、原言語の単語列がＦ＝ｆ_１,ｆ_２,・・・,ｆ_Ｍ、ブロック数がＫ、Block(ｋ)(１≦ｋ≦Ｋ)がｋ番目のブロックに含まれる原言語の単語を表し、目的言語の単語列がＥ＝ｅ_１,ｅ_２,・・・,ｅ_Ｎなる翻訳文において、各ｅ_１,ｅ_２,・・・,ｅ_Ｎがどのブロックに対応するかを求めることと等価である。 The block split parallel translation creating unit 210 performs processing for associating each word in the input target language with each block in the source language, thereby blocking the word string in the target language and converting non-terminal symbols in the blocks in the source language. In the corresponding target language block, it is arranged at an appropriate position (S12). The process of associating each word in the target language with each block in the source language is as follows: the source language word string is F = f ₁ , f ₂ ,..., F _M , the number of blocks is K, and Block (k) (1 ≦ k ≦ K) represents a source language word included in the k-th block, = e ₁ word string target language E, e _2, · · ·, in e _N becomes translation, each e _1, e _2, .., E is equivalent to finding which block _N corresponds to.

本発明では、この問題を図９で表されるようなグラフの分割問題と定義し、参考文献３と同様な方法により解決する。 In the present invention, this problem is defined as a graph division problem as shown in FIG. 9 and solved by the same method as in Reference 3.

〔参考文献３〕X. Zhu, Z. Ghahramani, and J. Lafferty,"Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions", Proceedings of the 20th International Conference on Machine Learning, 2003, p.912-919
具体的には、図９において、原言語のブロックBlock(ｋ)(１≦ｋ≦Ｋ)と目的言語の各単語ｅ_ｎ(１≦ｎ≦Ｎ)がそれぞれ節点を構成し、ブロックの節点と目的言語の単語の節点とを結ぶ枝（細実線で表記）と目的言語の隣り合う単語の節点同士（太実線で表記）が存在している。図９のグラフを、それぞれが１つのブロックの節点を持つＫ個のグラフに分割すれば、各単語ｅ_ｎがどのブロックと対応するかが求められる。これは次式のＬの最適化問題として解決される。 [Reference 3] X. Zhu, Z. Ghahramani, and J. Lafferty, "Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions", Proceedings of the 20th International Conference on Machine Learning, 2003, p.912-919
Specifically, in FIG. 9, the source language of the block Block (k) (1 ≦ k ≦ K) each object language word _{e n (1 ≦ n ≦ N} ) constitutes a node respectively, and the node of the block There are branches (indicated by a thin solid line) connecting the nodes of the target language word and nodes (indicated by a thick solid line) of adjacent words in the target language. The graph of FIG. 9, if divided into K graphs each with nodes one block, or are determined each word e _n corresponds to which block. This is solved as an optimization problem of L in the following equation.

ここで、ｗ_ijは節点ｉと節点ｊとを結ぶ枝の重み、ｖ_i、ｖ_jはそれぞれ節点ｉと節点ｊのブロックＩＤであるｋ(１≦ｋ≦Ｋ)である。節点ｉがブロック節点である場合、ｖ_iはそのブロックのＩＤであり、未知数は目的言語の単語のＮ個の節点が属するブロックのＩＤである（つまり、Ｋ＋Ｎ次元のベクトルｖの要素のうち、Ｋ個は既知でＮ個が未知である）。ｗ_ijは、ブロックの節点と単語の節点とを結ぶ枝については、ブロック内の原言語の単語と単語節点である目的言語の単語との関連度、単語節点同士を結ぶ枝については、目的言語の単語の連接度となるように設計する。機械翻訳装置２００においては、単語の関連度が単語関連度モデルとして単語関連度モデルデータベース２２０に、単語の連接度が単語連接度モデルとして単語連接度モデルデータベース２３０にそれぞれ蓄積されているものとする。単語関連度モデルと単語連接度モデルは対訳文データベース２４０やその他の対訳文データベースから別途学習されたものを予め蓄積しておく。単語関連度モデルとしては、統計的な機械翻訳システムで利用される単語翻訳確率モデルなどの公知技術によるものが利用できる。単語翻訳確率モデルは、例えば対訳文データベースと公知の単語翻訳確率推定プログラムＧＩＺＡ＋＋によって得られた図１０に示すような単語翻訳確率のリストとして構成することができる。図１０において各行は、それぞれ「英単語」「日本語単語」「日本語単語から英単語への条件付き翻訳確率」である。また、単語連接度モデルとしては、統計的な機械翻訳システムで利用される単語バイグラムモデルなどの公知技術によるものが利用できる。単語バイグラムモデルは、例えば対訳文データベースの日本語側と公知の単語Ｎグラム確率推定プログラムＳＲＩＬＭによって得られた図１１に示すような単語バイグラム確率のリストとして構成することができる。図１１において各行は、それぞれ「１番目の単語の次に２番目の単語が現れる条件付き確率の対数」「１番目の単語」「２番目の単語」「バックオフ確率」である。式(1)の最適化により、目的言語の単語はより強く関連するブロックに属するようになり、また、連接度の大きい目的言語の単語群は同じブロックに属するようになるため、目的言語側のブロック分割および原言語側への対応付けという問題に適したグラフ分割結果の獲得が期待できる。 Here, w _ij is a weight of a branch connecting node i and node j, and v _i and v _j are k (1 ≦ k ≦ K) which are block IDs of node i and node j, respectively. If node i is a block node, v _i is the ID of that block, and the unknown is the ID of the block to which the N nodes of the target language word belong (ie, among the elements of the K + N-dimensional vector v, K is known and N is unknown). w _ij is the degree of relevance between the source language word in the block and the target language word that is the word node for the branch connecting the node of the block and the word node, and the target language for the branch connecting the word nodes Design to be the degree of word connection. In the machine translation apparatus 200, it is assumed that the word association degree is stored in the word association degree model database 220 as a word association degree model and the word connection degree is stored in the word connection degree model database 230 as a word connection degree model. . As the word relevance model and the word connection degree model, those separately learned from the parallel translation database 240 and other parallel translation databases are stored in advance. As the word relevance degree model, a known technique such as a word translation probability model used in a statistical machine translation system can be used. The word translation probability model can be configured, for example, as a list of word translation probabilities as shown in FIG. 10 obtained by a parallel translation database and a known word translation probability estimation program GIZA ++. In FIG. 10, each row is “English word”, “Japanese word”, “Conditional translation probability from Japanese word to English word”. Moreover, as a word connection degree model, what is based on well-known techniques, such as a word bigram model used with a statistical machine translation system, can be utilized. The word bigram model can be configured, for example, as a list of word bigram probabilities as shown in FIG. 11 obtained by the Japanese side of the parallel translation database and the known word N-gram probability estimation program SRILM. In FIG. 11, each row is “logarithm of conditional probability that the second word appears after the first word”, “first word”, “second word”, and “backoff probability”. By optimizing equation (1), the target language words will belong to the more strongly related blocks, and the target language words with a high degree of connection will belong to the same block. We can expect to obtain graph partitioning results suitable for the problem of block partitioning and mapping to the source language.

以上の定義に基づき、ブロック分割対訳文作成部２３０は式(1)を最適化するベクトルｖを求める。節点番号が、ブロック節点、単語節点の順に割り当てられているとすると、ｖは式(2)のようにブロック節点に関するＫ次元ベクトルｖ_bと単語節点に関するＮ次元ベクトルｖ_wとを連結したベクトルとして表現することができる。 Based on the above definition, the block division parallel translation creating unit 230 obtains a vector v that optimizes the expression (1). Assuming that node numbers are assigned in the order of block nodes and word nodes, v is a vector obtained by concatenating a K-dimensional vector v _b related to block nodes and an N-dimensional vector v _w related to word nodes as shown in Equation (2). Can be expressed.

また、枝の重みｗ_ijによって構成される対称な重み行列Ｗも同様にブロック節点と単語節点に関わる部分を分けて考えることができる（式(3))。 Similarly, a symmetric weight matrix W composed of branch weights w _ij can also be considered by dividing a part related to a block node and a word node (formula (3)).

式(3)において、Ｗ_bbはブロック節点同士を結ぶ枝の重み（本発明においてはブロック節点同士の枝が存在しないため値はすべて０）、Ｗ_bwとＷ_wbはブロック節点と単語節点とを結ぶ枝の重み、Ｗ_wwは単語節点同士を結ぶ枝の重みを表す行列である。 In Equation (3), W _bb is the weight of the branch connecting the block nodes (in the present invention, the value is 0 because there is no branch between the block nodes), and W _bw and W _wb are the block node and the word node. The weight of the branch to be connected, W _ww is a matrix representing the weight of the branch connecting the word nodes.

式(2),(3)のもとで、式(1)を最適化するｖの未知部分ｖ_wは、参考文献３に従い次の等式で表される行列の演算により求めることができる。 Under the equations (2) and (3), the unknown part v _{w of} v that optimizes the equation (1) can be obtained by calculation of a matrix represented by the following equation according to the reference 3.

式(4)において、Ｄ_wwは(Ｋ＋Ｎ)×(Ｋ＋Ｎ)次元の対角行列Ｄの単語節点同士を結ぶ枝に関する部分（式(3)のＷとＷ_wwの関係と同様）であり、各要素ｄ_iが、 In equation (4), D _ww is a portion related to a branch connecting word nodes of a diagonal matrix D of (K + N) × (K + N) dimensions (similar to the relationship between W and W _{ww in} equation (3)), and Element d _i is

なる行列である。 Is a matrix.

以上の処理内容に基づき、ブロック分割対訳文作成部２３０において英語の学習文の各ブロックに対訳の日本語の単語の割り付けを行うことによりブロック分割対訳文を生成する例を示す。 Based on the above processing contents, an example of generating a block-divided parallel translation sentence by assigning a Japanese word of the parallel translation to each block of the English learning sentence in the block-divided parallel sentence creation section 230 will be shown.

対訳文データベース２４０から読み出した学習用の対訳文が以下のような単語列であるとする。 It is assumed that the bilingual sentence for learning read from the bilingual sentence database 240 is the following word string.

英文：Although epidural corticosteroid injection are commonly used for sciatica , their efficacy has not been established .
日本語文：コルチコステロイドの硬膜外注射は、坐骨神経痛に対して一般的に用いられているが、その有効性は確立されていない。 English: Although epidural corticosteroid injection are commonly used for sciatica, their efficacy has not been established.
Japanese: Epidural injections of corticosteroids are commonly used for sciatica, but their effectiveness has not been established.

まず、原言語である英文の単語列について、ブロック分割部１４０において次のように非終端記号を含む形でブロック分割される。 First, an English word string as a source language is divided into blocks in a form including non-terminal symbols in the block dividing unit 140 as follows.

1. Although __s0 , __s1 .
2. epidural corticosteroid injection are commonly used for sciatica
3. their efficacy has not been established
そして、ブロック分割対訳文作成部２３０おいて、上記のようにブロック分割された英文に、対訳文データベース２４０から読み出した対訳の日本語文の各単語を割り付けることにより、各ブロックの英文に対応するブロック分割された日本語文が以下のように得られる。 1. Although __s0, __s1.
2. epidural corticosteroid injection are commonly used for sciatica
3. their efficacy has not been established
Then, in the block-divided bilingual sentence creation unit 230, by assigning each word of the bilingual Japanese sentence read from the bilingual sentence database 240 to the English sentence divided into blocks as described above, the block corresponding to the English sentence of each block Divided Japanese sentences are obtained as follows.

1. __s0 られているが、__s1 。
2. コルチコステロイドの硬膜外注射は、坐骨神経痛に対して一般的に用い
3. その有効性は確立されていない 1. __s0 is __s1.
2. Epidural injection of corticosteroid is commonly used for sciatica
3. Its effectiveness has not been established

以上のように、機械翻訳装置２００によれば、ブロック分割対訳文が事前に用意されていなくても、既存の技術における用例や統計モデルの学習に利用される文単位で訳の対応がとられた対訳文からブロック分割対訳文を生成することができ、これを用いて実施例１で示した機械翻訳装置１００による処理内容を実行できる。 As described above, according to the machine translation apparatus 200, even if a block-divided parallel translation is not prepared in advance, the translation is handled in units of sentences used for learning examples of existing techniques and statistical models. The block-divided parallel translation sentence can be generated from the parallel translation sentence, and the processing content by the machine translation apparatus 100 shown in the first embodiment can be executed using this.

なお、本発明の機械翻訳装置１００、２００の各構成要素の機能分担は、上記の実施例に示す機能分担に限定されるものではなく、本発明を逸脱しない範囲で適宜変更が可能である。また、本発明の機械翻訳方法における各ステップの処理は上記で説明した時系列において実行されるのみならず、処理を実行する各構成要素の処理能力あるいは必要に応じて並列的にあるいは個別に実行することとしてもよい。 It should be noted that the function sharing of each component of the machine translation devices 100 and 200 of the present invention is not limited to the function sharing shown in the above embodiment, and can be appropriately changed without departing from the present invention. In addition, the processing of each step in the machine translation method of the present invention is not only executed in the time series described above, but also executed in parallel or individually as required by the processing capability of each constituent element that executes the processing. It is good to do.

Claims

Based on the syntax analysis, the learning sentence in the source language divided into a plurality of blocks each consisting of one or more words and a non-terminal symbol representing the insertion position of the lower block based on the parsing, and the target language for each of the divided blocks An ideal translated sentence including a non-terminal symbol, and a block division parallel translation database in which block division parallel translations are stored;
A translation training unit that learns a block translation model using the block-divided parallel translation read from the block parallel translation database and writes the block translation model in the block translation model database;
A block translation model database in which the block translation model is stored;
Parse the source language input sentence to be translated into the target language, separate the tree structure at any selected node, and add a non-terminal symbol representing the selected node to the tree structure. A block divider for dividing the block into a number of blocks;
Using the block translation model read from the block translation model database, a translation unit that translates each block of the input sentence divided by the block division unit into a translation sentence including the non-terminal symbol in a target language,
A combining unit that generates a translated sentence for the input sentence by combining the translated sentence of each block based on a block insertion position represented by the non-terminal symbol;
A machine translation apparatus comprising:

The machine translation device according to claim 1,
The block dividing unit further divides a learning sentence in a source language into a plurality of the blocks based on syntax analysis,
A word relevance model database in which a word relevance model indicating a relevance degree between a word included in a learning sentence in a source language and a word included in an ideal translation sentence in a target language of the learning sentence in the source language;
A word adjacency model database in which word adjacency models indicating the degree of articulation between words included in the ideal translation of the target language of the source language learning sentence are stored;
The learning sentence of the source language divided into the block units and the ideal translation sentence in the target language of the learning sentence are input, and the word association degree model read from the word association degree model database and the word connection degree model database Using the read word connectivity model, segment the source language learning sentence by identifying which block in the source language learning sentence corresponds to each word included in the target language ideal translation sentence. A block division parallel translation creation unit that generates an ideal translation including the non-terminal symbol in the target language for each block and writes the obtained block division parallel translation to the block division parallel translation database;
A machine translation device further comprising:

3. The machine translation apparatus according to claim 1, wherein the block is a node unit.

For each divided block, the translation training section is divided into a plurality of blocks each of which is composed of one or more words and a non-terminal symbol representing the insertion position of the lower block based on the syntax analysis A translation training step of reading a block division parallel translation sentence including the ideal translation sentence including the non-terminal symbol in the target language from the block division parallel translation database, learning a block translation model using this, and writing it into the block translation model database; ,
The block division unit parses the input sentence of the source language to be translated into the target language, separates the tree structure at any selected node, and adds a non-terminal symbol representing the selected node to the tree structure. addition to, a block dividing step of dividing said blocks of multiple,
A translation unit translates each block of the input sentence divided in the block division step into a translation sentence including the non-terminal symbol in a target language using a block translation model read from the block translation model database Steps,
A combining step for generating a translated sentence for the input sentence by combining the translated sentence of each block based on a block insertion position represented by the non-terminal symbol;
Perform machine translation method.

The machine translation method according to claim 4,
A learning sentence block dividing step of dividing a learning sentence of the source language into a plurality of the blocks based on syntax analysis;
The words and source language learning sentences included in the source language learning sentences read from the word relevance model database from the source language learning sentences divided into block units and the ideal translation sentences in the target language of the learning sentences. Between the words included in the ideal translation in the target language of the source language learning sentence read from the word connection model database and the word relevance model indicating the degree of association with the words included in the ideal translation in the target language By using a word-jointness model that indicates the degree of connection, it is possible to determine which block of the source language learning sentence each word contained in the ideal translation sentence in the target language corresponds to, respectively. An ideal translation including the non-terminal symbol in the target language for each of the divided blocks is generated, and the obtained block division parallel translation is stored in the block division parallel translation database. Machine translation method to further execute a block division bilingual sentence creating step burn them.

6. The machine translation method according to claim 4, wherein the block is a node unit.

A program for causing a computer to function as the machine translation device according to claim 1.