JP2006099208A

JP2006099208A - Statistical machine translation apparatus and statistical machine translation program

Info

Publication number: JP2006099208A
Application number: JP2004281636A
Authority: JP
Inventors: Taro Watanabe; 太郎渡辺
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2004-09-28
Filing date: 2004-09-28
Publication date: 2006-04-13
Anticipated expiration: 2024-09-28
Also published as: JP4084789B2

Abstract

【課題】対訳フレーズを利用した統計機械翻訳装置において、より高い精度で翻訳を行うことができる装置を提供する。
【解決手段】日英機械翻訳のデコーダ４２は、日本語フレーズＮグラムモデル５６、英語フレーズＮグラムモデル５８、英語言語モデル６０、および英語から日本語へのフレーズ翻訳モデル６８と、日本語の入力文４０に対し可能な全てのセグメンテーションを行なうセグメンテーション処理部８０と、得られたセグメンテーションにしたがい、モデル６８、５６、５８および６０を用い、英語のフレーズを任意の順序で確率付きで並べたフレーズシーケンスを表すラッティスを作成するラッティス作成部８４と、ラッティス作成部８４が作成したラッティスのうちで最も確率の高い上位Ｍ個の経路を探索して出力するＡ＊探索処理部８８とを含む。
【選択図】図３PROBLEM TO BE SOLVED: To provide an apparatus capable of performing translation with higher accuracy in a statistical machine translation apparatus using a parallel translation phrase.
A Japanese-English machine translation decoder includes a Japanese phrase N-gram model 56, an English phrase N-gram model 58, an English language model 60, an English-to-Japanese phrase translation model 68, and Japanese input. A segmentation processing unit 80 that performs all possible segmentations on the sentence 40, and a phrase sequence in which English phrases are arranged with probability in any order using the models 68, 56, 58, and 60 in accordance with the obtained segmentation. And a A * search processing unit 88 that searches for and outputs the top M routes with the highest probability among the lattices created by the lattice creation unit 84.
[Selection] Figure 3

Description

この発明は機械翻訳装置に関し、特に、統計機械翻訳において対訳コーパス中に現れるフレーズの対訳を生かしながら効率的に翻訳を行なう機械翻訳装置に関する。なお、本明細書において「フレーズ」とは、１または連続する複数の単語のまとまり（単語列）のことをいう。 The present invention relates to a machine translation apparatus, and more particularly, to a machine translation apparatus that efficiently translates using a parallel translation of a phrase that appears in a parallel corpus in statistical machine translation. In the present specification, “phrase” means a group of one or a plurality of consecutive words (word string).

機械翻訳を統計的に行なう立場では、非特許文献１において提案されているように、翻訳のソース言語の文ｆを、ターゲット言語ｅの文に翻訳するという問題は、次の最大化問題として定式化される。 From the standpoint of statistically performing machine translation, as proposed in Non-Patent Document 1, the problem of translating the source language sentence f into the target language e sentence is formulated as the next maximization problem. It becomes.

この問題に関し、雑音を含む通信路モデルを適用すると次の式が得られる。

When the channel model including noise is applied to this problem, the following equation is obtained.

統計機械翻訳においてフレーズを用いた手法を試みた文献として、式（２）の第１番目の因子Ｐ（ｆ｜ｅ）を、さらに制約を課した一連のフレーズの翻訳を掛合わせたもので近似しているものがある（非特許文献２、３、４）。

Approximating the first factor P (f | e) of equation (2) by multiplying a series of restricted phrase translations as a reference for using a phrase-based technique in statistical machine translation (Non-Patent Documents 2, 3, and 4).

ただし/ｆ_i（本明細書では、符号の直前の「/」は、直後の符号の直上に記載されるべき上線をあらわすものとする。）は、文ｆに対するフレーズセグメンテーション（フレーズ分割）がされたフレーズセグメンテーション後の文/ｆのｉ番目のフレーズのことをいい、ａ_iはフレーズセグメンテーションされたテキストのｉ番目のフレーズアライメントのことをいう。
Ｐ．Ｆ．ブラウン他、「機械翻訳に対する統計的アプローチ」、計算機言語学、第１６巻、第２号、ｐｐ．７９−８５、１９９０年（P. F. Brown et al., "A statistical approach to machine translation," Computational Linguistics, vol. 16, no. 2, pp. 79-85, 1990.）Ｐ．ケーン他、「フレーズによる統計翻訳」、ＨＬＴ−ＮＡＡＣＬ２００３予稿集、ｐｐ．４８−５４、２００３年（P. Koehn et al., "Statistical phrase-based translation," in Proc. of HLT-NAACL 2003, pp. 48-54 2003.) Ｓ．フォーゲル他、「ｃｍｕ統計翻訳システム」、ＭＴサミットＩＸ予稿集、ｐｐ．４０２−４０９、２００３年（S. Vogel et al., "The cmu statistical translation system," in Proceedings of MT Summit IX, pp. 402-409, 2003.）Ｃ．ティルマン、「統計機械翻訳のための射影拡張アルゴリズム」、自然言語処理における経験法２００３年大会予稿集、ｐｐ．１−８、２００３年（C. Tillmann, "A projection extension algorithm for statistical machine translation," in Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 1-8, 2003.）Ｈ．ダウメＩＩＩ他、「文書／要約アライメントのためのフレーズによるｈｍｍアプローチ」、ＥＭＮＬＰ２００４予稿集、ｐｐ．１１９−１２６、２００４年（H. Daume III et al., "A phrase-based hmm approach to document/abstract alignment," in Proceedings of EMNLP 2004, pp. 119-126, 2004.）Ａ．Ｐ．デンプスター他、「不完全なデータからのｅｍアルゴリズムによる最大尤度」、王室統計学会ジャーナル、第Ｂ巻、第３９号、ｐｐ．１−３８、１９７７年（A. P. Dempster et al., "Maximum likelihood from incomplete data via the em algorithm," Journal of the Royal Statistical Society, vol. B, no. 39, pp. 1-38, 1977.）Ｆ．Ｊ．オク他、「さまざまな統計的アライメントモデルの系統的比較」、コンピュータ言語学、第２９巻、第１号、ｐｐ．１９−５１、２００３年（F. J. Och et al. "A systematic comparison of various statistical alignment models," Computational Linguistics, vol. 29, no. 1, pp. 19-51, 2003）Ｆ．Ｊ．オク他、「統計機械翻訳のための弁別トレーニングおよび最大エントロピーモデル」、ＡＣＬ２００２予稿集、ｐｐ．２９５−３０２、２００２年（F. J. Och et al., "Discriminative training and maximum entropy models for statistical machine translation," in Proc. of ACL 2002, pp. 295-302, 2002.）Ｆ．Ｊ．オク、「統計機械翻訳における最小誤差率学習」、ＡＣＬ２００３予稿集、ｐｐ．１６０−１６７、２００３年（F. J. Och, "Minimum error rate training in statistical machine translation," in Proc. of ACL 2003, pp. 160-167, 2003.）Ｗ．Ｈ．プレス他、「Ｃ＋＋の数値的レシピ」、ケンブリッジ大学出版会、２００２年（W. H. Press et al., "Numerical Recipes in C++," Cambridge University Press, 2002.）Ｎ．ユッフィインク他、「統計機械翻訳における単語グラフの生成」、自然言語処理における経験法大会（ＥＭＮＬＰ０２）予稿集、ｐｐ．１５６−１６３、２００２年（N. Ueffing et al., "Generation of word graphs in statistical machine translation," in Proc. of Conference on Empirical Methods for Natural Language Processing (EMNLP02), pp. 156-163, 2002.）

However, / f _i (in this specification, “/” immediately before the code represents an overline to be described immediately above the code immediately after) is subjected to phrase segmentation (phrase division) for the sentence f. I is the i-th phrase of the sentence / f after the phrase segmentation, and a _i is the i-th phrase alignment of the phrase segmented text.
P. F. Brown et al., “Statistical Approach to Machine Translation”, Computer Linguistics, Vol. 16, No. 2, pp. 79-85, 1990 (PF Brown et al., "A statistical approach to machine translation," Computational Linguistics, vol. 16, no. 2, pp. 79-85, 1990.) P. Kane et al., “Statistical Translation by Phrases”, HLT-NAACL2003 Proceedings, pp. 48-54, 2003 (P. Koehn et al., "Statistical phrase-based translation," in Proc. Of HLT-NAACL 2003, pp. 48-54 2003.) S. Vogel et al., “Cmu Statistical Translation System”, MT Summit IX Proceedings, pp. 402-409, 2003 (S. Vogel et al., "The cmu statistical translation system," in Proceedings of MT Summit IX, pp. 402-409, 2003.) C. Tillman, “Projection Extension Algorithm for Statistical Machine Translation”, Experiential Methods in Natural Language Processing 2003 Proceedings, pp. 1-8, 2003 (C. Tillmann, "A projection extension algorithm for statistical machine translation," in Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 1-8, 2003.) H. Daume III et al., “Phrase hmm approach for document / summary alignment”, EMNLP 2004 proceedings, pp. 119-126, 2004 (H. Daume III et al., "A phrase-based hmm approach to document / abstract alignment," in Proceedings of EMNLP 2004, pp. 119-126, 2004.) A. P. Dempster et al., “Maximum likelihood by em algorithm from incomplete data”, Journal of Royal Statistical Society, Vol. 1-38, 1977 (AP Dempster et al., "Maximum likelihood from incomplete data via the em algorithm," Journal of the Royal Statistical Society, vol. B, no. 39, pp. 1-38, 1977.) F. J. et al. Ok et al., “Systematic Comparison of Various Statistical Alignment Models”, Computer Linguistics, Vol. 29, No. 1, pp. 19-51, 2003 (FJ Och et al. "A systematic comparison of various statistical alignment models," Computational Linguistics, vol. 29, no. 1, pp. 19-51, 2003) F. J. et al. Oku et al., “Discrimination Training and Maximum Entropy Model for Statistical Machine Translation”, ACL 2002 Proceedings, pp. 295-302, 2002 (FJ Och et al., "Discriminative training and maximum entropy models for statistical machine translation," in Proc. Of ACL 2002, pp. 295-302, 2002.) F. J. et al. Ok, “Minimum Error Rate Learning in Statistical Machine Translation”, ACL 2003 Proceedings, pp. 160-167, 2003 (FJ Och, "Minimum error rate training in statistical machine translation," in Proc. Of ACL 2003, pp. 160-167, 2003.) W. H. Press et al., "C ++ Numerical Recipe", Cambridge University Press, 2002 (WH Press et al., "Numerical Recipes in C ++," Cambridge University Press, 2002.) N. Yufiink et al., “Generation of word graphs in statistical machine translation”, Proceedings of Empirical Convention on Natural Language Processing (EMNLP02), pp. 156-163, 2002 (N. Ueffing et al., "Generation of word graphs in statistical machine translation," in Proc. Of Conference on Empirical Methods for Natural Language Processing (EMNLP02), pp. 156-163, 2002.)

しかし、従来の手法では、フレーズの翻訳対を求める際の方法が単純であって、式（３）にしたがってＰ（/ｆ_i｜/ｅ_ai）を推定しても、例えばフレーズの相互関係などについての条件が含まれていないため、翻訳の精度が低くなるという問題がある。 However, in the conventional method, the method for obtaining the phrase translation pair is simple. Even if P (/ f _i | / e _ai ) is estimated according to the equation (3), for example, the correlation between phrases, etc. Since the condition for is not included, there is a problem that the accuracy of translation is lowered.

それゆえに本発明の目的は、対訳フレーズを利用した統計機械翻訳装置において、より高い精度で翻訳を行なうことができる機械翻訳装置を提供することである。 Therefore, an object of the present invention is to provide a machine translation device that can perform translation with higher accuracy in a statistical machine translation device that uses parallel phrases.

本発明の他の目的は、対訳フレーズを利用した統計機械翻訳装置において、フレーズのセグメンテーションの推定とフレーズの対訳との推定を連携させて行なうことにより、より高い精度で翻訳を行なうことができる機械翻訳装置を提供することである。 Another object of the present invention is to provide a machine capable of performing translation with higher accuracy by coordinating estimation of phrase segmentation and estimation of phrase translation in a statistical machine translation apparatus using parallel translation phrases. It is to provide a translation device.

本発明の第１の局面に係る機械翻訳装置は、ソース言語の入力文をターゲット言語に翻訳する翻訳装置であって、ソース言語のフレーズＮグラムモデル、ターゲット言語のフレーズＮグラムモデルおよび所定の言語モデル、およびターゲット言語からソース言語へのフレーズによる翻訳モデルを記憶するための記憶手段と、ソース言語の入力文に対するセグメンテーションを行なうためのセグメンテーション手段と、セグメンテーション手段により得られたセグメンテーションにしたがい、記憶手段に記憶されたソース言語のフレーズＮグラムモデル、ターゲット言語のフレーズＮグラムモデルおよび言語モデル、ならびにターゲット言語からソース言語へのフレーズによる翻訳モデルを用いて、ターゲット言語のフレーズを任意の順序で、かつソース言語のフレーズＮグラムモデル、ターゲット言語のフレーズＮグラムモデルおよび言語モデル、ならびにフレーズによる翻訳モデルを用いて得た確率を各ノードとエッジとに付して並べた、フレーズのラッティスを作成するためのラッティス作成手段と、ラッティス作成手段により作成されたラッティスのうちで、最大確率を与える上位Ｍ個のフレーズシーケンスを探索するための探索手段とを含む。 A machine translation device according to a first aspect of the present invention is a translation device that translates an input sentence in a source language into a target language, the phrase N-gram model in the source language, the phrase N-gram model in the target language, and a predetermined language A storage means for storing a model and a translation model by a phrase from the target language to the source language, a segmentation means for performing segmentation on the input sentence of the source language, and a storage means in accordance with the segmentation obtained by the segmentation means Source language phrases N-gram model, target language phrases N-gram model and language model, and target language-to-source language phrase translation model, and target language phrases in any order, Phrase lattices are created by placing the probabilities obtained by using the phrase N-gram model of one source language, the phrase N-gram model and language model of the target language, and the translation model by phrase, with each node and edge aligned. And a search means for searching for the top M phrase sequences giving the maximum probability among the lattices created by the lattice creation means.

フレーズによる翻訳モデルという、従来にない概念を用い、ソース言語の入力文が与えられると、当該入力文に対する翻訳である確率が統計的に最も高い上位Ｍ個の翻訳文を出力することができる。単語レベルのみではなく、フレーズレベルでの確率をも用いるため、入力文と翻訳文との間のフレーズの対応関係を直接的に用い、精度の高い翻訳を行なうことができる。 When an input sentence in a source language is given using an unprecedented concept called a phrase translation model, the top M translation sentences having the highest statistical probability of translation for the input sentence can be output. Since the probability not only at the word level but also at the phrase level is used, it is possible to perform translation with high accuracy by directly using the correspondence between phrases between the input sentence and the translated sentence.

好ましくは、探索手段は、ラッティス作成手段により作成されたラッティスのうちで、最大確率を与える上位Ｍ個のフレーズシーケンスをＡ＊探索アルゴリズムにより探索するための手段を含む。 Preferably, the search means includes means for searching the top M phrase sequences giving the maximum probability among the lattices created by the lattice creation means using the A * search algorithm.

さらに好ましくは、ラッティス作成手段は、所定の確率条件を充足するノードのみを用いてラッティスを作成するための手段を含む。 More preferably, the lattice creation means includes means for creating a lattice using only nodes satisfying a predetermined probability condition.

フレーズによる翻訳モデルは、ターゲット言語のフレーズＮグラムモデル、ソース言語によるフレーズセグメンテーションモデル、およびターゲット言語からソース言語へのフレーズ翻訳モデルの組合せにより形成されるようにしてもよい。 The phrase-based translation model may be formed by a combination of a target language phrase N-gram model, a source language phrase segmentation model, and a target language-to-source language phrase translation model.

本発明の第２の局面に係る統計機械翻訳プログラムは、コンピュータにより実行されると、上記したいずれかの機械翻訳装置として当該コンピュータを動作させる統計機械翻訳プログラムである。 The statistical machine translation program according to the second aspect of the present invention is a statistical machine translation program that, when executed by a computer, causes the computer to operate as one of the machine translation devices described above.

［基本的概念］
従来の技術では、翻訳対を求める方法と、式（３）におけるＰ（/ｆ_i｜/ｅ_ai）の推定とが別の問題となっており、そのために精度の向上が望めないという問題があったと考えられる。本実施の形態では、以下のような思想に基づいてフレーズを用いた機械翻訳を行なう。すなわち、従来の技術で用いられていた式（３）に代えて、新たに/ｆおよび/ｅという変数を導入し、次の式（４）によってＰ（ｆ｜ｅ）を算出することにより、フレーズ翻訳に相当する関係を直接的に翻訳に反映させる。 [Basic concept]
In the conventional technique, the method for obtaining a translation pair and the estimation of P (/ f _i | / e _ai ) in Equation (3) are different problems, and therefore, there is a problem that improvement in accuracy cannot be expected. It is thought that there was. In the present embodiment, machine translation using phrases is performed based on the following idea. That is, instead of the equation (3) used in the prior art, new variables / f and / e are introduced, and P (f | e) is calculated by the following equation (4): The relationship corresponding to phrase translation is directly reflected in the translation.

ただし/ｆは入力文ｆのセグメンテーションを表し、/ｅはターゲット文ｅのセグメンテーションを表す。

However, / f represents the segmentation of the input sentence f, and / e represents the segmentation of the target sentence e.

式（４）において、Ｐ（ｆ，/ｆ，/ｅ｜ｅ）はさらに３つの項に分解される。 In equation (4), P (f, / f, / e | e) is further decomposed into three terms.

式（５）の第１項はセグメンテーションされた入力文/ｆが再構成により入力文ｆとなる確率を表す。第２項は/ｅと/ｆという二つのフレーズシーケンスに対する翻訳確率を表す。最後の項は、フレーズセグメンテーションされた文/ｅが文ｅから生成される尤度を示す。本明細書では、これらの項をそれぞれフレーズセグメンテーションモデル、フレーズ翻訳モデル，およびフレーズＮグラムモデルと呼ぶ。以下、これらにつき個別に説明する。

The first term of Equation (5) represents the probability that the segmented input sentence / f becomes the input sentence f by reconstruction. The second term represents the translation probabilities for the two phrase sequences / e and / f. The last term indicates the likelihood that a phrase segmented sentence / e is generated from sentence e. In this specification, these terms are referred to as a phrase segmentation model, a phrase translation model, and a phrase N-gram model, respectively. These will be described individually below.

−フレーズＮグラムモデル−
フレーズＮグラムモデルは以下のように近似できる。 -Phrase N-gram model-
The phrase N-gram model can be approximated as follows.

式（６）において、Ｐ（/ｅ_i｜/ｅ_i-1）は、互いに隣接する、翻訳されたフレーズに関するバイグラム制約として取扱う。

In Equation (6), P (/ e _i | / e _i-1 ) is treated as a bigram constraint on translated phrases that are adjacent to each other.

フレーズＮグラムモデルは、文ｅに対し可能な全てのフレーズセグメンテーションを、図１に示すようなラッティス構造/Ｅに広げることで、前向き・後向きアルゴリズムを使用して容易に算出できる。図１に示すように、このラッティス内の各ノードは文ｅ内の特定のフレーズ/Ｅ_i（ｉ＝１〜６）を表しており、これらノード/Ｅ₁〜/Ｅ₆はエッジで互いに結ばれている。また各エッジには、当該エッジの前端のノードに対応するフレーズとその後端のノードに対応するフレーズとが連続して生ずる確率が付されている。例えばノード/Ｅ_iとその直前のノード/Ｅ_i'とは、確率Ｐ（/Ｅ_i｜/Ｅ_i'）が付されたエッジにより結ばれている。図１の例で言えば、ノード/Ｅ₂とその直前のノード（図１では/Ｅ₁）とを結ぶエッジには確率Ｐ（/Ｅ₂｜/Ｅ₁）が付されている。 The phrase N-gram model can be easily calculated using a forward / backward algorithm by extending all possible phrase segmentations for sentence e to a lattice structure / E as shown in FIG. As shown in FIG. 1, each node in the lattice represents a specific phrase / E _i (i = 1 to 6) in the sentence e, and these nodes / E ₁ to / E ₆ are connected to each other by edges. It is. Further, each edge is given a probability that a phrase corresponding to the node at the front end of the edge and a phrase corresponding to the node at the rear end are successively generated. For example, the node / E _i and the immediately preceding node / E _{i ′} are connected by an edge with a probability P (/ E _i | / E _{i ′} ). In the example of FIG. 1, the probability P (/ E ₂ | / E ₁ ) is attached to the edge connecting the node / E ₂ and the node immediately preceding it (/ E _{1 in} FIG. ₁ ).

Ｐ（/Ｅ_i｜/Ｅ_i'）の値の推定は以下の手順により行なわれる。 The value of P (/ E _i | / E _{i ′} ) is estimated by the following procedure.

１）確率テーブルを何らかの一様な値で初期化する。 1) Initialize the probability table with some uniform value.

２）学習コーパス内の各文ｅに対し、前向き・後向きアルゴリズムを用い、ラッティス内における事後確率Ｐ（/Ｅ_i，/Ｅ_i'｜ｅ）を推定する。 2) For each sentence e in the learning corpus, the forward / backward algorithm is used to estimate the posterior probability P (/ E _i , / E _{i ′} | e) in the lattice.

３）推定された事後確率を単語の発生頻度として用いて最尤アルゴリズムを用いて事前確率を以下のように算出する。 3) Using the estimated posterior probability as the word occurrence frequency, the prior probability is calculated using the maximum likelihood algorithm as follows.

ステップ２および３を所定の終了条件が成立するまで繰返す。終了条件としては、予め定められた繰返し回数を完了したことなどの条件を用いることができる。

Steps 2 and 3 are repeated until a predetermined end condition is satisfied. As the termination condition, a condition such as completion of a predetermined number of repetitions can be used.

−フレーズセグメンテーションモデル−
式（４）に示されたモデル化によると、Ｐ（ｆ｜/ｆ，/ｅ，ｅ）は、フレーズセグメンテーションがされた文/ｆをいかにして並べ替えればソース文ｆが得られるか、を示すディストーション確率とみなすことができる。本実施の形態では、これに代えて、このモデルを特定のフレーズセグメント/ｆ_jが文ｆ内に生ずる尤度を表すものとみなす。すなわち、 -Phrase segmentation model-
According to the modeling shown in the equation (4), P (f | / f, / e, e) can be obtained as a source sentence f by rearranging the phrase segmented sentence / f, Can be regarded as a distortion probability. In the present embodiment, instead of this, this model is regarded as representing the likelihood that a specific phrase segment / f _j will occur in the sentence f. That is,

このセグメンテーションは、前節に述べたフレーズＮグラムモデルのユニグラム事後確率として実現することができる。このユニグラム事後確率は、文ｆに対するラッティス構造/Ｆを用い、前向き・後向きアルゴリズムを用いて効率的に計算できる。

This segmentation can be realized as a unigram posterior probability of the phrase N-gram model described in the previous section. This unigram posterior probability can be efficiently calculated using the forward / backward algorithm using the lattice structure / F for the sentence f.

このフレーズセグメンテーションモデルは、ソース文が与えられたときに、あるフレーズに対してある重みを与えるものと考えることができる。仮にフレーズ長を１に限定すると、すなわちどのフレーズも単語を一つしか含まないものとすると、フレーズセグメンテーションの結果は一通りとなる。したがってフレーズセグメンテーションモデルはどのフレーズにも「１」という重みを割当てる。

This phrase segmentation model can be considered to give a certain weight to a certain phrase when a source sentence is given. If the phrase length is limited to 1, that is, if every phrase contains only one word, the results of phrase segmentation will be one. Therefore, the phrase segmentation model assigns a weight of “1” to any phrase.

−フレーズ翻訳モデル−
フレーズ翻訳モデルは、フレーズ翻訳が各フレーズの翻訳の積として得られるように近似したものである。 -Phrase translation model-
The phrase translation model is approximated so that the phrase translation is obtained as a product of the translation of each phrase.

ただしａ_iは単語アライメントによる翻訳モデル（例えばＩＢＭ（登録商標）モデル）において見られるようなフレーズアライメントを表す。

Here, a _i represents a phrase alignment as found in a translation model based on word alignment (for example, an IBM (registered trademark) model).

フレーズ翻訳モデルの一つの実現形式としては、学習コーパスに現れる全てのフレーム翻訳対を、それらの現れる確率とともにテーブル形式で格納したものが考えられる。 As one implementation form of the phrase translation model, one in which all frame translation pairs appearing in the learning corpus are stored in a table form together with their appearance probabilities.

−フレーズによるＨＭＭ統計翻訳−
上記したフレーズＮグラムモデル、フレーズセグメンテーションモデル、およびフレーズ翻訳モデルを全て統合すると、式（４）は次のように書換えることができる。 -HMM statistical translation by phrase-
When all of the above phrase N-gram model, phrase segmentation model, and phrase translation model are integrated, Equation (4) can be rewritten as follows.

フレーズセグメンテーションがされた文/ｅと文/ｆとを、それぞれ対応する/Ｅおよび/Ｆというラッティス構造にまで拡張すると、式（１２）は隠れマルコフモデル（ＨＭＭ）と考えることができる。ただしこの場合、図２に示すように、ラッティス/Ｆ内の各ソースフレーズ/Ｆ_j（ｊ＝１〜６）は、ラッティス/Ｅ内のターゲットフレーズである状態/Ｅ_i（ｉ＝１〜６）から出力された観測量として扱われる。図２に示す例でいえば、ラッティス/Ｆ内のフレーズ/Ｆ₂は、ラッティス/Ｅ中のフレーズ/Ｅ₃からの観測量として扱われる。

When the phrase segmented sentence / e and sentence / f are expanded to corresponding lattice structures of / E and / F, respectively, Expression (12) can be considered as a hidden Markov model (HMM). However, in this case, as shown in FIG. 2, each source phrase / F _j (j = 1 to 6) in the lattice / F is a target phrase in the lattice / E / E _i (i = 1 to 6). ) Is treated as the observation amount output. In the example shown in FIG. 2, the phrase / F ₂ in the lattice / F is treated as an observation amount from the phrase / E ₃ in the lattice / E.

図２においてラッティス/Ｆにおいて生ずるフレーズ/Ｆ₂は、フレーズ/Ｅ₃からは確率Ｐ（/Ｆ₂｜/Ｅ₃）で観測されるということになる。ラッティス/Ｅにおいてフレーズ/Ｅ₁の次にフレーズ/Ｅ₃が続く確率をＰ（/Ｅ₃｜/Ｅ₁）、文ｆのセグメンテーションにおいてフレーズ/Ｆ₂が生ずる確率をＰ（/Ｆ₂|ｆ）とすれば、これらを乗算することで式（１２）のうちｉ＝３、ｊ＝２に相当する項の値が得られる。これをラッティス/Ｅおよびラッティス/Ｆ内の全てのノード（フレーズ）の組合せに対し計算し、合計することで式（４）の値が近似できる。 In FIG. 2, the phrase / F ₂ generated in the lattice / F is observed with the probability P (/ F ₂ | / E ₃ ) from the phrase / E ₃ . Lattice / the probability that the next phrase / E ₃ followed by phrases / E ₁ P in _{E (/ E 3 | / E} 1), the probability that phrases / F ₂ is generated in the segmentation of the sentence f P (/ F ₂ | f ), The values of the terms corresponding to i = 3 and j = 2 in equation (12) are obtained by multiplying them. By calculating this for all combinations of nodes (phrases) in Lattice / E and Lattice / F and adding them up, the value of Equation (4) can be approximated.

フレーズによるＨＭＭ構造は、文書と要約との対応を付ける、という問題について、非特許文献５において既に提案されている。非特許文献５に記載の手法では、ジャンプの確率は状態遷移として明確にエンコードされており、その確率は、単語による統計翻訳モデルにおけるアライメント確率にほぼ対応していると言える。ジャンプまたはアライメント確率を明確な形で使用すると、翻訳のモデル化を完全なものにすることができるが、そのかわりにフレーズによるＨＭＭ構造では学習の際に広大なサーチ空間が必要になるという問題がある。 Non-Patent Document 5 has already proposed a problem that the phrase-based HMM structure associates a document with a summary. In the method described in Non-Patent Document 5, the jump probability is clearly encoded as a state transition, and it can be said that the probability substantially corresponds to the alignment probability in the statistical translation model using words. The use of jumps or alignment probabilities in a clear way can help complete translation modeling, but instead the phrase-based HMM structure has the problem of requiring a vast search space for learning. is there.

本実施の形態に係る手法では、状態遷移はフレーズＮグラムモデルと、フレーズバイグラムの連結確率とを用いているが、フレーズアライメント確率については無視している。したがって、フレーズによるＨＭＭ翻訳モデルは不完全なモデルということもできる。しかし、その単純さのゆえにパラメータの推定を高速に行なうことができる。 In the method according to this embodiment, the state transition uses a phrase N-gram model and a phrase bigram connection probability, but ignores the phrase alignment probability. Accordingly, it can be said that the phrase-based HMM translation model is an incomplete model. However, because of its simplicity, parameter estimation can be performed at high speed.

−パラメータ推定−
フレーズによるＨＭＭ翻訳モデルに対するパラメータは、既に説明したように前向き・後向きアルゴリズムを用いて効率的に推定できる。 -Parameter estimation-
As described above, the parameters for the phrase-based HMM translation model can be efficiently estimated using the forward / backward algorithm.

前向き・後向きアルゴリズムの処理では、二つの補助変数α（e_i1 ⁱ²，ｆ_j1 ^j2）とβ（e_i1 ⁱ²，ｆ_j1 ^j2）とを定義する。α（e_i1 ⁱ²、ｆ_j1 ^j2）は、ｅ₁ ^i1-1に存在する全てのフレーズの組合せの出力後にフレーズe_i1 ⁱ²がｆ_j1 ^j2に翻訳される確率の前向き推定値を表す。同様にβ（e_i1 ⁱ²，ｆ_j1 ^j2）は、ｅ_i2+1 ^lの右側の全てのフレーズの組合せを考慮したときの、フレーズe_i1 ⁱ²がフレーズｆ_j1 ^j2に翻訳される確率の後向き推定値である。 In the process of the forward / backward algorithm, two auxiliary variables α (e _i1 ⁱ² , f _j1 ^j2 ) and β (e _i1 ⁱ² , f _j1 ^j2 ) are defined. α (e _i1 ⁱ² , f _j1 ^j2 ) represents a forward estimate of the probability that the phrase e _i1 ⁱ² is translated into f _j1 ^j2 after output of all the combinations of phrases present in e ₁ ^i1-1 . Similarly, β (e _i1 ⁱ² , f _j1 ^j2 ) is a backward estimate of the probability that the phrase e _i1 ⁱ² is translated into the phrase f _j1 ^j2 when all the combinations of phrases on the right side of e _{i2 + 1} ^l are considered. Value.

したがって、前向き・後向きアルゴリズムは、次の漸化式を解くという形に定式化される。 Therefore, the forward / backward algorithm is formulated in the form of solving the following recurrence formula.

ＥＭアルゴリズムでよく見られるような局所的収束という問題（非特許文献６）を避けるため、本実施の形態では、フレーズによる翻訳モデルの学習にあたって、初期値として非特許文献７に記載のＧＩＺＡ＋＋学習より得られるレキシコンモデルを用いる。さらに、フレーズＮグラムモデルとフレーズセグメンテーションモデルとを単言語コーパスを用いて別々に学習させ、ＨＭＭ学習の繰返し時には固定させておく。

In order to avoid the problem of local convergence (Non-Patent Document 6) often seen in the EM algorithm, in this embodiment, the initial value of GIZA ++ learning described in Non-Patent Document 7 is used for learning a translation model using phrases. Use the resulting lexicon model. Furthermore, the phrase N-gram model and the phrase segmentation model are separately learned using a monolingual corpus, and are fixed when the HMM learning is repeated.

−フレーズセグメントの誘導−
式（１３）および（１４）は、ラッティス構造/Ｅの左側と右側とにおいてそれぞれ全ての可能な経路についての合計という処理と、/Ｆに対し可能な全てのセグメンテーションに関する合計という処理とを含んでいる。この処理は仮にダイナミックプログラミングを使用しても依然として膨大な計算を必要とするので、本実施の形態では、考慮するセグメンテーションを予め限定する。 -Phrase segment induction-
Equations (13) and (14) include the summation process for all possible paths, respectively, on the left and right sides of the lattice structure / E, and the summation process for all possible segmentations for / F. Yes. Since this processing still requires enormous calculation even if dynamic programming is used, the segmentation to be considered is limited in advance in this embodiment.

フレーズ対を導出するにあたってはまず、バイリンガル学習コーパス内において見出し得る全てのバイリンガルフレーズ対について、次の式に示す二つのフレーズ翻訳確率の積を用いる。 In deriving a phrase pair, first, the product of two phrase translation probabilities shown in the following equation is used for all the bilingual phrase pairs that can be found in the bilingual learning corpus.

ただしｃｏｕｎｔ（/ｅ，/ｆ）は二つのフレーズ/ｅおよび/ｆの共起頻度である。式（１５）の基本的な概念は、２方向の対応関係を見ることにより、バイリンガルのフレーズの対応をとらえる、ということである。

However, count (/ e, / f) is the co-occurrence frequency of two phrases / e and / f. The basic concept of equation (15) is that the correspondence between bilingual phrases can be grasped by looking at the correspondence in two directions.

さらに、それ以外のフレーズについては、ＧＩＺＡ＋＋（非特許文献２）により計算したＰ（ｅ｜ｆ）およびＰ（ｆ｜ｅ）という２方向のモデルを用いたビタビ単語アライメントの積集合／和集合に基づき網羅的に導出する。 Furthermore, for other phrases, the product set / union of Viterbi word alignment using the two-way model P (e | f) and P (f | e) calculated by GIZA ++ (Non-patent Document 2). Based on exhaustive derivation.

フレーズ翻訳対の抽出後、それらに対する単言語フレーズレキシコンを抽出し、ソース文およびターゲット文に対する可能なセグメンテーションとして使用する。 After the phrase translation pairs are extracted, monolingual phrase lexicons for them are extracted and used as possible segmentation for the source and target sentences.

−デコーダ−
最良訳を算出するための決定規則は、非特許文献８において示されている、翻訳モデルの全ての構成要素の対数線形和によって行なう。 -Decoder-
The decision rule for calculating the best translation is performed by the logarithmic linear sum of all the components of the translation model shown in Non-Patent Document 8.

ただしＰｒ_j（ｅ，ｆ）は翻訳モデルの構成要素であって、例えばフレーズＮグラムモデルまたは言語モデルなどであり、λ_jは各モデルに与えられた重みである。重みλ_jはＩＩＳ（Improved Iterative Scaling）またはＧＩＳ（Generalized Iterative Scaling）アルゴリズムを用いた最尤基準を用いても、滑降シンプレックス法（Downhill Simplex Method：非特許文献９）のように制約なしの最適化アルゴリズムを用いた最小誤差率基準（非特許文献９）を用いてもよい。

However, Pr _j (e, f) is a component of the translation model, for example, a phrase N-gram model or a language model, and λ _j is a weight given to each model. The weight λ _j is an unconstrained optimization as in the Downhill Simplex Method (Non-Patent Document 9), even if the maximum likelihood criterion using IIS (Improved Iterative Scaling) or GIS (Generalized Iterative Scaling) algorithm is used. You may use the minimum error rate standard (nonpatent literature 9) using an algorithm.

デコーダは単語グラフによるデコーダ（非特許文献１１）に類似したものであり、サブモデルの複雑な構造を組込むため、マルチパスでコーディングを行なえるようにしたものである。デコードの最初のパスではビームサーチを用いて入力文に対する翻訳の単語グラフ、すなわちラッティスを生成する。このパスでは、フレーズによるＨＭＭ翻訳モデルのすべてのサブモデルを、ターゲット言語の単語によるトライグラム言語モデルおよびクラス５グラム言語モデルとともに使用する。次のパスでは、この単語グラフに対してＡ＊探索法を用いて翻訳のＮベストパスを探索する。 The decoder is similar to a decoder based on a word graph (Non-patent Document 11) and incorporates a complicated structure of a sub model so that coding can be performed in multi-pass. In the first pass of decoding, a beam search is used to generate a translation word graph, or lattice, for the input sentence. In this pass, all sub-models of the phrase-based HMM translation model are used with a trigram language model and a class 5 gram language model with words in the target language. In the next pass, the N best path of translation is searched for this word graph using the A * search method.

［構成］
図３に、本実施の形態に係る統計機械翻訳システム２０のブロック図を示す。この統計機械翻訳システム２０は、日本語の文（ｆ）を英語の文（ｅ）に翻訳する日英翻訳装置である。つまり、本実施の形態ではソース言語ｆは日本語、ターゲット言語ｅは英語ということになる。 [Constitution]
FIG. 3 shows a block diagram of the statistical machine translation system 20 according to the present embodiment. This statistical machine translation system 20 is a Japanese-English translation device that translates a Japanese sentence (f) into an English sentence (e). That is, in this embodiment, the source language f is Japanese and the target language e is English.

図３を参照して、統計機械翻訳システム２０は、日本語の単言語コーパス（以下「日本語コーパス」と呼ぶ。）３０と、英語の単言語コーパス（以下「英語コーパス」と呼ぶ。）３４と、日本語および英語の対訳コーパス３２と、日本語コーパス３０、対訳コーパス３２、および英語コーパス３４から日本語フレーズＮグラムモデル５６、英語フレーズＮグラムモデル５８、英語言語モデル６０、およびフレーズ翻訳モデル６８を作成するためのフレーズによるＨＭＭ翻訳モデル作成部３６と、フレーズによるＨＭＭ翻訳モデル作成部３６により作成された日本語フレーズＮグラムモデル５６、英語フレーズＮグラムモデル５８、英語言語モデル６０、およびフレーズ翻訳モデル６８を用いて、日本語の入力文４０に対し翻訳を行ない、英語の訳文４４を出力するためのデコーダ（統計機械翻訳装置）４２とを含む。なお図３に示す統計機械翻訳システム２０では、フレーズＮグラムとしてフレーズバイグラムを用いる。また、英語フレーズＮグラムモデル５８と、日本語フレーズＮグラムモデル５６と、フレーズ翻訳モデル６８とは、フレーズによるＨＭＭ翻訳モデル３８を構成する。 Referring to FIG. 3, the statistical machine translation system 20 includes a Japanese monolingual corpus (hereinafter referred to as “Japanese corpus”) 30 and an English monolingual corpus (hereinafter referred to as “English corpus”) 34. Japanese and English parallel corpus 32, Japanese phrase N-gram model 56, English phrase N-gram model 58, English language model 60, and phrase translation model from Japanese corpus 30, parallel corpus 32, and English corpus 34 68, a Japanese phrase N-gram model 56, an English phrase N-gram model 58, an English language model 60, and a phrase created by the phrase-based HMM translation model creation section 36. Using the translation model 68, the Japanese input sentence 40 is translated into English And a decoder (statistical machine translation apparatus) 42 for outputting the translated sentence 44. In the statistical machine translation system 20 shown in FIG. 3, a phrase bigram is used as the phrase Ngram. The English phrase N-gram model 58, the Japanese phrase N-gram model 56, and the phrase translation model 68 constitute an HMM translation model 38 using phrases.

フレーズによるＨＭＭ翻訳モデル作成部３６は、日本語コーパス３０から日本語フレーズＮグラムモデル５６を作成するための日本語フレーズＮグラムモデル作成部５０と、英語コーパス３４から英語フレーズＮグラムモデル５８を作成するための英語フレーズＮグラムモデル作成部５２と、英語コーパス３４から英語言語モデル６０を作成するための英語言語モデル作成部５４とを含む。英語言語モデル６０は、図示してはいないが、前述したとおり英語単語トライグラム言語モデルと、英語のクラス５グラム言語モデルとを含む。なおクラスＮグラム言語モデルとは品詞に着目した単語Ｎグラム言語モデルであり、同一の品詞の単語は同じ単語であるとみなして作成したＮグラム言語モデルである。 The phrase-based HMM translation model creation unit 36 creates a Japanese phrase N-gram model creation unit 50 for creating a Japanese phrase N-gram model 56 from the Japanese corpus 30 and an English phrase N-gram model 58 from the English corpus 34. An English phrase N-gram model creating unit 52 for creating the English language model 60 from the English corpus 34. Although not shown, the English language model 60 includes an English word trigram language model and an English class 5 gram language model as described above. The class N-gram language model is a word N-gram language model that focuses on the part of speech, and is an N-gram language model that is created by regarding the same part of speech as the same word.

なお、日本語フレーズＮグラムモデル作成部５０は、日本語フレーズＮグラムモデル５６の作成にあたり、日本語コーパス３０に出現するＮグラムのうち上位の所定個数のみを採用すること、および対訳コーパス３２にフレーズ対として出現する日本語Ｎグラムのみを採用すること、という制約にしたがっている。同様に英語フレーズＮグラムモデル作成部５２は、英語フレーズＮグラムモデル５８の作成にあたって、英語コーパス３４に出現するＮグラムのうち上位の所定個数のみを採用すること、対訳コーパス３２に出現するフレーズ対として出現する英語Ｎグラムのみを採用すること、という制約にしたがっている。 It should be noted that the Japanese phrase N-gram model creation unit 50 adopts only a predetermined number of N-grams appearing in the Japanese corpus 30 and creates the bilingual corpus 32 in creating the Japanese phrase N-gram model 56. The restriction is that only Japanese N-grams appearing as phrase pairs should be adopted. Similarly, the English phrase N-gram model creation unit 52 adopts only a predetermined upper number of N-grams appearing in the English corpus 34 when creating the English phrase N-gram model 58, and phrase pairs appearing in the parallel corpus 32. This is in accordance with the restriction that only English N-grams appearing as

フレーズによるＨＭＭ翻訳モデル作成部３６はさらに、日本語フレーズＮグラムモデル５６に基づき、前述した手法により日本語のフレーズセグメンテーションモデル６６を作成するためのフレーズセグメンテーションモデル作成部６２と、英語フレーズＮグラムモデル５８およびフレーズセグメンテーションモデル６６を用いてフレーズ翻訳モデル６８を構成するＨＭＭのパラメータの推定を行なうＨＭＭ学習部７０とを含む。 The phrase-based HMM translation model creation unit 36 further includes a phrase segmentation model creation unit 62 for creating a Japanese phrase segmentation model 66 based on the Japanese phrase N-gram model 56, and an English phrase N-gram model. 58 and a phrase segmentation model 66, and an HMM learning unit 70 that estimates the parameters of the HMM that constitutes the phrase translation model 68.

一方デコーダ４２は、日本語の入力文４０が与えられると、入力文４０に対して可能な全てのセグメンテーションを行なうためのセグメンテーション処理部８０と、セグメンテーション処理部８０により作成されたセグメンテーションをすべて記憶するためのセグメンテーション記憶部８２と、セグメンテーション記憶部８２に記憶された、入力文４０の全セグメンテーションに基づき、入力文４０に対する翻訳フレーズからなるラッティスを作成するためのラッティス作成部８４とを含む。ラッティス作成部８４は、ラッティス作成にあたって、フレーズ翻訳モデル６８を用いて、セグメンテーション記憶部８２に記憶されている日本語の各フレーズを対応する英語のフレーズに置換し、さらにそれら英語のフレーズを任意の順序で組合せてラッティスを作成し、日本語フレーズＮグラムモデル５６、英語フレーズＮグラムモデル５８、および英語言語モデル６０を加えて使用して、ラッティスの各エッジおよびノードに確率を割当てていく。なお、英語のフレーズによるラッティス作成時には探索空間が非常に広いので、ラッティス作成部８４は上記した確率を用いたビームサーチを行ない、確率が大きなエッジおよびノードのみを残してラッティスの作成を行なう。 On the other hand, when the Japanese input sentence 40 is given, the decoder 42 stores all of the segmentation processing unit 80 for performing all possible segmentations on the input sentence 40 and the segmentation created by the segmentation processing unit 80. And a lattice creation unit 84 for creating a lattice comprising translation phrases for the input sentence 40 based on the entire segmentation of the input sentence 40 stored in the segmentation storage unit 82. The lattice creation unit 84 uses the phrase translation model 68 to replace each Japanese phrase stored in the segmentation storage unit 82 with the corresponding English phrase when creating the lattice, and further converts the English phrase into an arbitrary phrase. Lattices are created by combining them in order, and the Japanese phrase N-gram model 56, English phrase N-gram model 58, and English language model 60 are used in addition to assign a probability to each edge and node of the lattice. Note that since the search space is very wide when creating lattices using English phrases, the lattice creation unit 84 performs a beam search using the above-described probabilities, and creates lattices while leaving only edges and nodes with large probabilities.

デコーダ４２はさらに、ラッティス作成部８４により作成された確率付きの英語のフレーズからなるラッティスを記憶するためのラッティス記憶部８６と、ラッティス記憶部８６に記憶された確率付きの英語のラッティスに対しＡ＊探索を行なうことにより、ラッティスの経路のうちスコアの高いＭ個をＭベストの訳文４４として出力するためのＡ＊探索処理部８８を含む。 The decoder 42 further stores a lattice storage unit 86 for storing lattices composed of English phrases with probabilities created by the lattice creation unit 84 and an English lattice with probabilities stored in the lattice storage unit 86. * By performing a search, an A * search processing unit 88 is included for outputting M pieces of high scores among lattice paths as M best translations 44.

図４に、英語フレーズＮグラムモデル５８の作成におけるＰ（/Ｅ_i｜/Ｅ_i'）の値の推定手順を示す。図４を参照して、まずステップ１００において上記した確率を格納するための確率テーブルを何らかの一様な値で初期化する。ステップ１０２では繰返し制御変数ｉを０に初期化する。ステップ１０４で変数ｉに１を加算する。ステップ１０６で変数ｉの値が２を超えたか否かを判定する。変数ｉの値が２を超えていれば処理を終了する。変数ｉの値が２以下であればステップ１０８に進む。 FIG. 4 shows a procedure for estimating the value of P (/ E _i | / E _{i ′} ) in creating the English phrase N-gram model 58. Referring to FIG. 4, first, in step 100, a probability table for storing the above probabilities is initialized with some uniform value. In step 102, the repeated control variable i is initialized to zero. In step 104, 1 is added to the variable i. In step 106, it is determined whether or not the value of the variable i exceeds 2. If the value of the variable i exceeds 2, the process is terminated. If the value of the variable i is 2 or less, the process proceeds to step 108.

ステップ１０８ではステップ１１０からステップ１１６までの処理を英語コーパス３４（図３参照）内の各文に対し繰返す。 In step 108, the processing from step 110 to step 116 is repeated for each sentence in the English corpus 34 (see FIG. 3).

まずステップ１１０においてコーパス中の文に対して可能な全てのセグメンテーションをし、その結果に基づいてラッティスを作成する。ステップ１１２にて、このラッティス中の全てのエッジに図１に示すように確率を割当てる。ステップ１１４では、前向き・後向きアルゴリズムにより、各ノードの事後確率Ｐ（/Ｅ_i，/Ｅ_i'｜ｅ）を推定する。 First, in step 110, all possible segmentation is performed on the sentence in the corpus, and a lattice is created based on the result. At step 112, probabilities are assigned to all edges in the lattice as shown in FIG. In step 114, the posterior probability P (/ E _i , / E _{i ′} | e) of each node is estimated by the forward / backward algorithm.

ステップ１１６では以上の処理を全ての文に対して完了したか否かを判定する。まだ完了していなければステップ１１０以下の処理を繰返す。全ての文に対して完了した後、ステップ１１８において、ステップ１１４で推定された事後確率を単語の発生頻度として用い、最尤アルゴリズムによってフレーズＮグラムモデルの事前確率を式（７）に示すように算出する。 In step 116, it is determined whether or not the above processing has been completed for all sentences. If it has not been completed, the processing from step 110 is repeated. After completion for all sentences, in step 118, the posterior probability estimated in step 114 is used as the word occurrence frequency, and the prior probability of the phrase N-gram model is expressed by equation (7) using the maximum likelihood algorithm. calculate.

続いて再度ステップ１０４に戻り、次の繰返しを行なう。本実施の形態では、上記した処理を２回繰返したところで処理を終了する。 Then, it returns to step 104 again and performs the next repetition. In the present embodiment, the processing ends when the above processing is repeated twice.

図５に、ある日本語の文Ｆに対するフレーズセグメンテーションモデルを算出する手順を示す。図５を参照して、まずステップ１４０で文Ｆに対して可能な全てのフレーズセグメンテーションを求め、ラッティスを作成する。ステップ１４２において、ステップ１４０により得られたラッティスに出現する全てのフレーズに対し、日本語フレーズＮグラムモデル５６を用い、前向き・後向きアルゴリズムによって前向き確率α、後向き確率βを算出する。ステップ１４４では、式（１０）にしたがってＰ（/Ｆ｜Ｆ）を計算する。この処理を全ての文に対して行なう。 FIG. 5 shows a procedure for calculating a phrase segmentation model for a Japanese sentence F. Referring to FIG. 5, first, in step 140, all possible phrase segmentations are obtained for sentence F, and a lattice is created. In step 142, the forward probability α and the backward probability β are calculated by the forward / backward algorithm using the Japanese phrase N-gram model 56 for all phrases appearing in the lattice obtained in step 140. In step 144, P (/ F | F) is calculated according to equation (10). This process is performed for all sentences.

図６に、フレーズ翻訳モデルを算出する手順を示す。図６を参照して、まずステップ１６０で、図３に示す対訳コーパス３２内の全ての対訳文についてフレーズを単位としたラッティス対＜Ｆ，Ｅ＞を作成する。 FIG. 6 shows a procedure for calculating a phrase translation model. Referring to FIG. 6, first, in step 160, lattice pairs <F, E> are generated in units of phrases for all parallel translations in the parallel corpus 32 shown in FIG.

ステップ１６２では、ステップ１６４〜ステップ１７０までの処理を繰返す。まずステップ１６４において、ソース文Ｆのラッティスにより、文Ｆの各セグメンテーション/Ｆに対する事後確率Ｐ（/Ｆ|Ｆ）を求める。ステップ１６６で、ターゲット文Ｅのラッティスで、セグメンテーション/Ｅとセグメンテーション/Ｆとを組合せたものに対する前向き確率α、後向き確率βを前向き・後向きアルゴリズムを用いて算出する。ステップ１６８では、こうして求めた確率をセグメンテーション/Ｅとセグメンテーション/Ｆとに含まれるフレーズの共起数としてカウントを加算する。 In step 162, the processing from step 164 to step 170 is repeated. First, in step 164, the posterior probability P (/ F | F) for each segmentation / F of the sentence F is obtained by the lattice of the source sentence F. In step 166, the forward probability α and the backward probability β for the combination of the segmentation / E and the segmentation / F in the lattice of the target sentence E are calculated using the forward / backward algorithm. In step 168, the count is added as the probabilities of the phrases included in the segmentation / E and the segmentation / F.

ステップ１７０では、上記したステップ１６４〜１６８までの処理が全ての＜Ｆ，Ｅ＞に対して行なわれたか否かを判定し、まだ行なわれていなければ再びステップ１６４から繰返す。 In step 170, it is determined whether or not the processing in steps 164 to 168 described above has been performed for all <F, E>. If not yet performed, the processing is repeated from step 164 again.

全ての＜Ｆ，Ｅ＞に対して処理が完了すると、ステップ１７２において、ステップ１６８で計算されたカウント値に基づき、Ｐ（/Ｆ｜/Ｅ）を算出して処理を終了する。 When the processing is completed for all <F, E>, in step 172, P (/ F | / E) is calculated based on the count value calculated in step 168, and the processing ends.

［動作］
以上に構成を説明した図３に示す統計機械翻訳システム２０の動作について以下に説明する。統計機械翻訳システム２０の動作には、以上の説明から明らかなように二つのフェーズがある。第１はフレーズ翻訳モデル６８および関連する日本語フレーズＮグラムモデル５６、英語フレーズＮグラムモデル５８、英語言語モデル６０を作成するフェーズである。第２は、こうして作成されたフレーズ翻訳モデル６８、日本語フレーズＮグラムモデル５６、英語フレーズＮグラムモデル５８、および英語言語モデル６０を用い、入力文４０に対する翻訳を行なって訳文４４を出力するフェーズである。以下順に説明する。 [Operation]
The operation of the statistical machine translation system 20 shown in FIG. The operation of the statistical machine translation system 20 has two phases as is apparent from the above description. The first is a phase in which a phrase translation model 68 and related Japanese phrase N-gram model 56, English phrase N-gram model 58, and English language model 60 are created. Second, using the phrase translation model 68, the Japanese phrase N-gram model 56, the English phrase N-gram model 58, and the English language model 60 thus created, the input sentence 40 is translated and the translated sentence 44 is output. It is. This will be described in order below.

−モデル作成フェーズ−
日本語コーパス３０、対訳コーパス３２、および英語コーパス３４が予め準備されているものとする。日本語フレーズＮグラムモデル作成部５０は、日本語コーパス３０から日本語フレーズＮグラムモデル５６を作成する。英語フレーズＮグラムモデル作成部５２は、英語コーパス３４から英語フレーズＮグラムモデル５８を作成する。英語言語モデル作成部５４は、英語コーパス３４から英語言語モデル６０を作成する。 -Model creation phase-
Assume that a Japanese corpus 30, a bilingual corpus 32, and an English corpus 34 are prepared in advance. The Japanese phrase N-gram model creation unit 50 creates a Japanese phrase N-gram model 56 from the Japanese corpus 30. The English phrase N-gram model creation unit 52 creates an English phrase N-gram model 58 from the English corpus 34. The English language model creation unit 54 creates an English language model 60 from the English corpus 34.

フレーズセグメンテーションモデル作成部６２は、日本語フレーズＮグラムモデル５６および日本語コーパス３０に基づき、フレーズセグメンテーションモデル６６を作成する。 The phrase segmentation model creation unit 62 creates a phrase segmentation model 66 based on the Japanese phrase N-gram model 56 and the Japanese corpus 30.

ＨＭＭ学習部７０は、こうして作成されたフレーズセグメンテーションモデル６６および英語フレーズＮグラムモデル５８を用い、さらに対訳コーパス３２を用いてフレーズ翻訳モデル６８の学習を行なう。 The HMM learning unit 70 uses the phrase segmentation model 66 and the English phrase N-gram model 58 thus created, and further learns the phrase translation model 68 using the parallel corpus 32.

−翻訳フェーズ−
フレーズによるＨＭＭ翻訳モデル３８、すなわち日本語フレーズＮグラムモデル５６と英語フレーズＮグラムモデル５８とフレーズ翻訳モデル６８、および英語言語モデル６０が準備できると、デコーダ４２は入力文４０に対する翻訳が可能になる。セグメンテーション処理部８０が入力文４０に対して可能な全てのセグメンテーションを行ないセグメンテーション記憶部８２に記憶させる。ラッティス作成部８４は、セグメンテーション記憶部８２に記憶されたセグメンテーションに基づき、フレーズ翻訳モデル６８などを用いながら、フレーズをノードとする翻訳文のラッティスを作成する。この際作成される翻訳文のラッティスのエッジおよびノードには、モデルに基づいて計算される確率が割当てられる。作成されたラッティスはラッティス記憶部８６に記憶される。 -Translation phase-
When the phrase-based HMM translation model 38, that is, the Japanese phrase N-gram model 56, the English phrase N-gram model 58, the phrase translation model 68, and the English language model 60 are prepared, the decoder 42 can translate the input sentence 40. . The segmentation processing unit 80 performs all possible segmentations on the input sentence 40 and stores them in the segmentation storage unit 82. Based on the segmentation stored in the segmentation storage unit 82, the lattice creation unit 84 creates a lattice of a translation sentence having a phrase as a node, using the phrase translation model 68 and the like. The probability calculated based on the model is assigned to the lattice edges and nodes of the translation sentence created at this time. The created lattice is stored in the lattice storage unit 86.

Ａ＊探索処理部８８は、Ａ＊探索アルゴリズムを用いて、ラッティス記憶部８６に記憶された確率付きのラッティスの中で確率が最も高いＭベストのパスを探索し、それらに対応フレーズシーケンスを訳文４４として出力する。 The A * search processing unit 88 uses the A * search algorithm to search for the M best path with the highest probability among the lattices with probability stored in the lattice storage unit 86, and translates the corresponding phrase sequence into them. 44 is output.

［効果］
以上のように本実施の形態では、対訳コーパスに含まれるフレーズの対応関係を直接にモデル化してＨＭＭ化し、効率的に翻訳フレーズ対を作成し翻訳を行なうことができる。精度も従来のものと比較してより高くなるという効果が得られる。 [effect]
As described above, in the present embodiment, the correspondence relationship of phrases included in the bilingual corpus can be directly modeled and converted into an HMM, and translation phrase pairs can be efficiently created and translated. There is an effect that the accuracy is higher than the conventional one.

［コンピュータによる実現］
この実施の形態の統計機械翻訳システム２０は、コンピュータハードウェアと、そのコンピュータハードウェアにより実行されるプログラムと、コンピュータハードウェアに格納されるデータとにより実現される。図７はこのコンピュータシステム３３０の外観を示し、図８はコンピュータシステム３３０の内部構成を示す。 [Realization by computer]
The statistical machine translation system 20 of this embodiment is realized by computer hardware, a program executed by the computer hardware, and data stored in the computer hardware. FIG. 7 shows the external appearance of the computer system 330, and FIG. 8 shows the internal configuration of the computer system 330.

図７を参照して、このコンピュータシステム３３０は、ＦＤ（フレキシブルディスク）ドライブ３５２およびＣＤ−ＲＯＭ（コンパクトディスク読出専用メモリ）ドライブ３５０を有するコンピュータ３４０と、キーボード３４６と、マウス３４８と、モニタ３４２とを含む。 Referring to FIG. 7, the computer system 330 includes a computer 340 having an FD (flexible disk) drive 352 and a CD-ROM (compact disk read only memory) drive 350, a keyboard 346, a mouse 348, and a monitor 342. including.

図８を参照して、コンピュータ３４０は、ＦＤドライブ３５２およびＣＤ−ＲＯＭドライブ３５０に加えて、ＣＰＵ（中央処理装置）３５６と、ＣＰＵ３５６、ＦＤドライブ３５２およびＣＤ−ＲＯＭドライブ３５０に接続されたバス３６６と、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）３５８と、バス３６６に接続され、プログラム命令、システムプログラム、および作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）３６０とを含む。コンピュータシステム３３０はさらに、プリンタ３４４を含んでいる。 Referring to FIG. 8, in addition to FD drive 352 and CD-ROM drive 350, computer 340 includes CPU (central processing unit) 356 and bus 366 connected to CPU 356, FD drive 352 and CD-ROM drive 350. And a read only memory (ROM) 358 for storing a boot-up program and the like, and a random access memory (RAM) 360 connected to the bus 366 for storing a program command, a system program, work data, and the like. Computer system 330 further includes a printer 344.

ここでは示さないが、コンピュータ３４０はさらにローカルエリアネットワーク（ＬＡＮ）への接続を提供するネットワークアダプタボードを含んでもよい。 Although not shown here, the computer 340 may further include a network adapter board that provides a connection to a local area network (LAN).

コンピュータシステム３３０に統計機械翻訳システム２０としての動作を行なわせるためのコンピュータプログラムは、ＣＤ−ＲＯＭドライブ３５０またはＦＤドライブ３５２に挿入されるＣＤ−ＲＯＭ３６２またはＦＤ３６４に記憶され、さらにハードディスク３５４に転送される。または、プログラムは図示しないネットワークを通じてコンピュータ３４０に送信されハードディスク３５４に記憶されてもよい。プログラムは実行の際にＲＡＭ３６０にロードされる。ＣＤ−ＲＯＭ３６２から、ＦＤ３６４から、またはネットワークを介して、直接にＲＡＭ３６０にプログラムをロードしてもよい。 A computer program for causing the computer system 330 to operate as the statistical machine translation system 20 is stored in the CD-ROM 362 or FD 364 inserted in the CD-ROM drive 350 or FD drive 352 and further transferred to the hard disk 354. . Alternatively, the program may be transmitted to the computer 340 through a network (not shown) and stored in the hard disk 354. The program is loaded into the RAM 360 when executed. The program may be loaded directly into the RAM 360 from the CD-ROM 362, from the FD 364, or via a network.

なお、本実施の形態においては、図３に示すフレーズ翻訳モデル６８、日本語フレーズＮグラムモデル５６、英語フレーズＮグラムモデル５８、および英語言語モデル６０などはいずれも図８に示すハードディスク３５４に記憶されており、必要に応じてＲＡＭ３６０に読出され、プログラム実行時に利用される。図３に示すセグメンテーション記憶部８２、ラッティス記憶部８６などは、いずれもＲＡＭ３６０により実現される。 In this embodiment, the phrase translation model 68, the Japanese phrase N-gram model 56, the English phrase N-gram model 58, and the English language model 60 shown in FIG. 3 are all stored in the hard disk 354 shown in FIG. The data is read out to the RAM 360 as necessary and used when executing the program. The segmentation storage unit 82, the lattice storage unit 86, and the like shown in FIG.

このプログラムは、コンピュータ３４０にこの実施の形態に係る統計機械翻訳システム２０として動作を行なわせる複数の命令を含む。この動作を行なわせるのに必要な基本的機能のいくつかはコンピュータ３４０上で動作するオペレーティングシステム（ＯＳ）またはサードパーティのプログラム、もしくはコンピュータ３４０にインストールされる各種ツールキットのモジュールにより提供される。したがって、このプログラムはこの実施の形態のシステムを実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能または「ツール」を呼出すことにより、上記した統計機械翻訳システム２０としての動作を実行する命令のみを含んでいればよい。コンピュータシステム３３０の動作は周知であるので、ここでは繰返さない。 This program includes a plurality of instructions for causing the computer 340 to operate as the statistical machine translation system 20 according to this embodiment. Some of the basic functions required to perform this operation are provided by operating system (OS) or third party programs running on the computer 340 or various toolkit modules installed on the computer 340. Therefore, this program does not necessarily include all functions necessary for realizing the system of this embodiment. This program includes only instructions that execute the above-described operation as the statistical machine translation system 20 by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result. Just go out. The operation of computer system 330 is well known and will not be repeated here.

［変形例］
上記した実施の形態では、日本語フレーズＮグラムモデル５６および英語言語モデル６０はいずれも、独立した単言語の日本語コーパス３０および英語コーパス３４から作成される。一般に単言語のコーパスの場合、その内容が充実しているため、こうした手順によって作成された日本語フレーズＮグラムモデル５６、英語フレーズＮグラムモデル５８、英語言語モデル６０は精度が高くなる。しかし本発明はそのような実施の形態には限定されない。例えば、対訳コーパス３２に含まれる対訳数が十分多い場合には、対訳コーパス３２のうちの日本語文から日本語フレーズＮグラムモデル５６を作成し、英語文から英語フレーズＮグラムモデル５８および英語言語モデル６０を作成するようにしてもよい。 [Modification]
In the embodiment described above, both the Japanese phrase N-gram model 56 and the English language model 60 are created from the independent monolingual Japanese corpus 30 and English corpus 34. In general, since the content of a monolingual corpus is substantial, the accuracy of the Japanese phrase N-gram model 56, the English phrase N-gram model 58, and the English language model 60 created by these procedures is high. However, the present invention is not limited to such an embodiment. For example, when the number of parallel translations included in the parallel corpus 32 is sufficiently large, a Japanese phrase N-gram model 56 is created from a Japanese sentence in the parallel translation corpus 32, and an English phrase N-gram model 58 and an English language model are created from the English sentence. 60 may be created.

なお、本実施の形態では、図４に示す処理は２回繰返し、図６に示す処理は１回のみで終了するが、本発明はそのような実施の形態に限定されるわけではない。例えば図４における処理は１回以上、任意の回数だけ繰返すようにしてもよい。また図６に示す処理では、ステップ１７２の後ステップ１６０に戻るようにし、その繰返しを２回以上行なうようにしてもよい。さらに、繰返し数を固定するのではなく、何らかの条件が成立したら繰返しを終了するようにしてもよい。 In the present embodiment, the process shown in FIG. 4 is repeated twice, and the process shown in FIG. 6 ends only once, but the present invention is not limited to such an embodiment. For example, the processing in FIG. 4 may be repeated one or more times and an arbitrary number of times. In the process shown in FIG. 6, the process may return to step 160 after step 172, and the repetition may be performed twice or more. Furthermore, instead of fixing the number of repetitions, the repetition may be terminated when some condition is satisfied.

また、日本語フレーズＮグラムモデル作成部５０および英語フレーズＮグラムモデル作成部５２による日本語フレーズＮグラムモデル５６および英語フレーズＮグラムモデル５８の作成の際、上記実施の形態では日本語コーパス３０、対訳コーパス３２、および英語コーパス３４を制約として用いている。しかし本発明はそのような実施の形態には限定されない。例えばこれに加えて、またはこれらのいずれかに代えて、フレーズ形式の対訳辞書を制約として用いて日本語フレーズＮグラムモデル５６、英語フレーズＮグラムモデル５８の作成を行なうようにしてもよい。 When the Japanese phrase N-gram model 56 and the English phrase N-gram model 58 are created by the Japanese phrase N-gram model creation unit 50 and the English phrase N-gram model creation unit 52, in the above embodiment, the Japanese corpus 30, A bilingual corpus 32 and an English corpus 34 are used as constraints. However, the present invention is not limited to such an embodiment. For example, in addition to or instead of this, a Japanese phrase N-gram model 56 and an English phrase N-gram model 58 may be created using a phrase-type bilingual dictionary as a constraint.

上記した実施の形態では、日本語から英語への翻訳を行なっている。しかし本発明はそのような実施の形態には限定されない。適切なコーパスが利用可能なものであれば、どのような言語の組合せに対しても本発明を適用できる。 In the embodiment described above, translation from Japanese to English is performed. However, the present invention is not limited to such an embodiment. The present invention can be applied to any combination of languages as long as an appropriate corpus is available.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

ターゲット言語のフレーズラッティス構造の一例を示す図である。It is a figure which shows an example of the phrase lattice structure of a target language. フレーズ翻訳モデル作成の際に使用されるソース言語とターゲット言語とのフレーズラッティス構造の関係を示す図である。It is a figure which shows the relationship of the phrase lattice structure of the source language used at the time of phrase translation model creation, and a target language. 本発明の一実施の形態に係る統計機械翻訳システム２０のブロック図である。1 is a block diagram of a statistical machine translation system 20 according to an embodiment of the present invention. 英語フレーズＮグラムモデル５８の作成におけるＰ（/Ｅ_i｜/Ｅ_i'）の値の推定手順を示すフローチャートである。It is a flowchart which shows the estimation procedure of the value of P (/ _Ei | / _{Ei '} ) in creation of the English phrase N-gram model 58. 日本語の文Ｆに対するフレーズセグメンテーションモデルを算出する手順を示すフローチャートである。It is a flowchart which shows the procedure which calculates the phrase segmentation model with respect to the Japanese sentence F. FIG. フレーズ翻訳モデルを算出する手順を示すフローチャートである。It is a flowchart which shows the procedure which calculates a phrase translation model. 本発明の一実施の形態を実現するコンピュータシステムの外観図である。1 is an external view of a computer system that realizes an embodiment of the present invention. 図７に示すコンピュータシステムのブロック図である。FIG. 8 is a block diagram of the computer system shown in FIG. 7.

Explanation of symbols

２０統計機械翻訳システム、３０日本語コーパス、３２対訳コーパス、３４英語コーパス、３６フレーズによるＨＭＭ翻訳モデル作成部、３８フレーズによるＨＭＭ翻訳モデル、４０入力文、４２デコーダ、４４訳文、５０日本語フレーズＮグラムモデル作成部、５２英語フレーズＮグラムモデル作成部、５４英語言語モデル作成部、５６日本語フレーズＮグラムモデル、５８英語フレーズＮグラムモデル、６０英語言語モデル、６２フレーズセグメンテーションモデル作成部、６６フレーズセグメンテーションモデル、６８フレーズ翻訳モデル、７０ＨＭＭ学習部、８０セグメンテーション処理部、８２セグメンテーション記憶部、８４ラッティス作成部、８６ラッティス記憶部 20 statistical machine translation systems, 30 Japanese corpus, 32 parallel corpus, 34 English corpus, 36 phrase HMM translation model generator, 38 phrase HMM translation model, 40 input sentences, 42 decoder, 44 translations, 50 Japanese phrases N Gram model creation unit, 52 English phrase N-gram model creation unit, 54 English language model creation unit, 56 Japanese phrase N-gram model, 58 English phrase N-gram model, 60 English language model, 62 Phrase segmentation model creation unit, 66 phrase Segmentation model, 68 phrase translation model, 70 HMM learning unit, 80 segmentation processing unit, 82 segmentation storage unit, 84 lattice creation unit, 86 lattice storage unit

Claims

A translation device that translates an input sentence in a source language into a target language,
For storing a phrase N-gram model of the source language (N is an integer of 2 or more), a phrase N-gram model of the target language and a predetermined language model, and a translation model by a phrase from the target language to the source language Storage means;
Segmentation means for performing segmentation on the source language input sentence;
According to the segmentation obtained by the segmentation means, the phrase N-gram model of the source language stored in the storage means, the phrase N-gram model and the language model of the target language, and from the target language to the source language The target language phrases in any order, and the source language phrase N-gram model, the target language phrase N-gram model and the language model, and the phrase Lattice creation means for creating a lattice lattice with the probability obtained using the translation model attached to each node and edge,
A statistical machine translation apparatus comprising: search means for searching for top M phrase sequences giving the maximum probability among the lattices created by the lattice creation means.

2. The statistics according to claim 1, wherein the search means includes means for searching, using the A * search algorithm, the top M phrase sequences giving the maximum probability among the lattices created by the lattice creation means. Machine translation device.

The statistical machine translation device according to claim 1, wherein the lattice creation unit includes a unit for creating the lattice using only a node satisfying a predetermined probability condition.

The phrase translation model is formed by a combination of the phrase N-gram model of the target language, a phrase segmentation model of the source language, and a phrase translation model from the target language to the source language. Item 4. The statistical machine translation device according to any one of Items 3 to 4.

A statistical machine translation program that, when executed by a computer, causes the computer to operate as the statistical machine translation apparatus according to any one of claims 1 to 4.