JP2017142757A

JP2017142757A - Information processing method, device, and program

Info

Publication number: JP2017142757A
Application number: JP2016025252A
Authority: JP
Inventors: 正彬西野; Masaaki Nishino; 潤鈴木; Jun Suzuki; 昌明永田; Masaaki Nagata
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-02-12
Filing date: 2016-02-12
Publication date: 2017-08-17
Anticipated expiration: 2036-02-12
Also published as: JP6498135B2

Abstract

PROBLEM TO BE SOLVED: To acquire a phrase table with the number of phrase couples reduced for performing translation with high accuracy.SOLUTION: A selection processing part 22 selects a subset X from a phrase table so as to optimize an object function with a deteriorated modular function, represented by using the number of times when each of the k-th words eof a sentence eof an original language is covered with a phrase couple zincluded in the subset X of the phrase table being a set of phrase couples zabout each of a pair (e, f) of training bilingual corpuses and the number of times when each of the k-th words fof a sentence fof a target language is covered with a phrase couple zincluded in the subset X, as an object function.SELECTED DRAWING: Figure 1

Description

本発明は、情報処理方法、装置、及びプログラムに関する。 The present invention relates to an information processing method, apparatus, and program.

統計的機械翻訳法とは、確率・統計の技術を用いてある言語（以下、原言語と称する。）で書かれた文書を別の言語（以下、目的言語と称する。）で書かれた文書へと自動的に翻訳する手法のことである。様々な統計的機械翻訳手法が存在するが、その中の１つであるフレーズに基づく統計的機械翻訳では、原言語の文を、語の連なりからなる句を並べたものとして表現し、それを目的言語の対応する句の並びに変換することで翻訳を行う。 A statistical machine translation method is a document written in one language (hereinafter referred to as a source language) using a probability / statistical technique and written in another language (hereinafter referred to as a target language). It is a method of automatically translating into There are various statistical machine translation techniques, but in statistical machine translation based on one of the phrases, the source language sentence is expressed as a sequence of phrases consisting of a series of words, Translation is performed by converting the corresponding phrases in the target language.

フレーズに基づく統計的機械翻訳を行うためには、フレーズテーブルとよばれる、原言語のあるフレーズが目的言語のどのフレーズに訳されるかを示したテーブルを用意する必要がある。フレーズテーブルをＳとする。Ｓの構成要素はフレーズ対（ｐ，ｑ）である。ここでｐは原言語のフレーズであり、ｑは目的言語のフレーズである。フレーズテーブルに含まれるフレーズ対の種類が、その翻訳システムが翻訳可能な語彙を定めていることから、一般にフレーズテーブルに含まれるフレーズ対の総数は膨大な数になる。 In order to perform statistical machine translation based on a phrase, it is necessary to prepare a table called a phrase table that indicates a phrase in a target language into which a phrase in the source language is translated. Let S be the phrase table. The component of S is a phrase pair (p, q). Here, p is a phrase in the source language, and q is a phrase in the target language. Since the types of phrase pairs included in the phrase table define the vocabulary that can be translated by the translation system, the total number of phrase pairs included in the phrase table is generally enormous.

フレーズに基づく統計的機械翻訳システムによって翻訳を行う際には、計算機の記憶装置に格納されたフレーズテーブルに繰り返しアクセスする必要がある。フレーズテーブルに含まれるフレーズ対の数が膨大となると、翻訳文を生成する際に取りうる選択肢が増加することから、結果的に翻訳文の生成に時間がかかるようになる。 When translation is performed by a phrase-based statistical machine translation system, it is necessary to repeatedly access a phrase table stored in a storage device of a computer. If the number of phrase pairs included in the phrase table is enormous, the number of options that can be taken when generating a translation increases, and as a result, it takes time to generate the translation.

また、一般に、フレーズテーブルに含まれるフレーズ対は、対訳関係にある原言語と目的言語の文の対の単語アラインメントの結果をもとにして自動的に獲得されるものであるが、こうして得られたフレーズ対には対訳関係になっていない、質の悪いフレーズ対も多く含まれる。質の悪いフレーズ対は翻訳生成時のノイズとなって生成される翻訳の質の低下につながる。これらの理由から、与えられたフレーズテーブルから質の悪いフレーズを除いてより小さなフレーズテーブルを作成する技術が検討されている（例えば、非特許文献１）。 In general, the phrase pairs included in the phrase table are automatically obtained based on the word alignment result of the sentence pairs of the source language and the target language that are in a parallel translation relationship. There are many poor-quality phrase pairs that are not translated in parallel. Poor-quality phrase pairs become noise at the time of translation generation, leading to deterioration in the quality of the generated translation. For these reasons, a technique for creating a smaller phrase table by removing a poor quality phrase from a given phrase table has been studied (for example, Non-Patent Document 1).

Zens, Richard and Stanton, Daisy and Xu, Peng,“A Systematic Comparison of Phrase Table Pruning Techniques”, In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012.Zens, Richard and Stanton, Daisy and Xu, Peng, “A Systematic Comparison of Phrase Table Pruning Techniques”, In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012.

非特許文献１では、エントロピーを用いてフレーズ対に点数をつけ、その点数に基づいて不要なフレーズ対を削除することで小さなフレーズテーブルを作成する手法が提案されている。 Non-Patent Document 1 proposes a method of creating a small phrase table by scoring a phrase pair using entropy and deleting unnecessary phrase pairs based on the score.

しかし、この手法では各フレーズ対について独立に点数を付与して、点数が高いものを取り出すという処理を行うため、点数が高いフレーズ対の集まりとして構成されるフレーズテーブルの性質を直接評価することができないため、フレーズテーブルの質が低下する、つまりフレーズ対の個数を削減したテーブルを用いて統計的機械翻訳システムを構築すると、翻訳精度が低下することがあった。
また、エントロピーを用いたフレーズ対の削減を行うためには、各フレーズ対に含まれるフレーズの長さに対して指数時間の計算を必要とするため、長いフレーズからなるフレーズ対が多く含まれる場合には、効率的な計算が困難であるという課題があった。 However, in this method, a score is independently assigned to each phrase pair, and a process having a high score is taken out. Therefore, it is possible to directly evaluate the properties of the phrase table configured as a collection of phrase pairs having a high score. Therefore, when the statistical machine translation system is constructed using a table in which the number of phrase pairs is reduced, the translation accuracy may be lowered.
In addition, in order to reduce the number of phrase pairs using entropy, it is necessary to calculate the exponent time for the length of the phrase included in each phrase pair, so there are many phrase pairs consisting of long phrases. However, there was a problem that efficient calculation was difficult.

本発明は、上記の事情を鑑みてなされたもので、精度よく翻訳を行うための、フレーズ対の数が削減されたフレーズテーブルを得ることができる情報処理方法、装置、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and provides an information processing method, apparatus, and program capable of obtaining a phrase table with a reduced number of phrase pairs for accurate translation. With the goal.

上記の目的を達成するために本発明に係る情報処理方法は、選択処理手段を含み、原言語の文ｅ_ｉと前記原言語の文ｅ_ｉの対訳である目的言語の文ｆ_ｉとのペア（ｅ_ｉ，ｆ_ｉ）の集合である訓練用対訳コーパスから予め生成された、原言語の文ｅ_ｉの部分文字列であるフレーズｐ_ｉと、前記フレーズｐ_ｉの対訳であり、かつ、目的言語の文ｆ_ｉの部分文字列であるフレーズｑ_ｉとのフレーズ対ｚ_ｉの集合であるフレーズテーブルから、部分集合を選択する情報処理装置における情報処理方法であって、前記選択処理手段が、前記訓練用対訳コーパスの前記ペア（ｅ_ｉ，ｆ_ｉ）の各々についての、前記原言語の文ｅ_ｉのｋ番目の単語ｅ_ｉｋの各々が、前記フレーズ対ｚ_ｊの集合であるフレーズテーブルの部分集合Ｘに含まれる前記フレーズ対ｚ_ｊによって被覆される回数と、前記目的言語の文ｆ_ｉのｋ番目の単語ｆ_ｉｋの各々が、前記部分集合Ｘに含まれる前記フレーズ対ｚ_ｊによって被覆される回数とを用いて表される劣モジュラ関数を目的関数として、前記目的関数を最適化するように、前記フレーズテーブルから部分集合Ｘを選択するステップを含んで構成されている。 The information processing method according to the present invention in order to achieve the above object, includes a selection processing means, the pair of the sentence f _i in the target language is a translation of the sentence e _i of the original said the sentence e _i of the source language Language A phrase p _i that is a partial character string of a sentence e _{i in} the source language, which is generated in advance from a training parallel translation corpus that is a set of (e _i , f _i ), and the phrase p _i , and the purpose An information processing method in an information processing apparatus for selecting a subset from a phrase table that is a set of phrase pairs z _i with a phrase q _i that is a partial character string of a sentence f _{i in} a language, wherein the selection processing means includes: For each of the pairs (e _i , f _i ) of the training bilingual corpus, each of the kth words e _ik of the source language sentence e _i is a set of phrase pairs z _j Before being included in subset X The number of times that is covered by the phrase pair z _j, each of the k-th word f _ik sentence f _i of the target language, with the number of times that is covered by the phrase pair z _j included in the subset X A step of selecting a subset X from the phrase table so as to optimize the objective function using the expressed submodular function as an objective function.

本発明に係る情報処理装置は、原言語の文ｅ_ｉと前記原言語の文ｅ_ｉの対訳である目的言語の文ｆ_ｉとのペア（ｅ_ｉ，ｆ_ｉ）の集合である訓練用対訳コーパスから予め生成された、原言語の文ｅ_ｉの部分文字列であるフレーズｐ_ｉと、前記フレーズｐ_ｉの対訳であり、かつ、目的言語の文ｆ_ｉの部分文字列であるフレーズｑ_ｉとのフレーズ対ｚ_ｉの集合であるフレーズテーブルから、部分集合を選択する情報処理装置であって、前記訓練用対訳コーパスの前記ペア（ｅ_ｉ，ｆ_ｉ）の各々についての、前記原言語の文ｅ_ｉのｋ番目の単語ｅ_ｉｋの各々が、前記フレーズ対ｚ_ｊの集合であるフレーズテーブルの部分集合Ｘに含まれる前記フレーズ対ｚ_ｊによって被覆される回数と、前記目的言語の文ｆ_ｉのｋ番目の単語ｆ_ｉｋの各々が、前記部分集合Ｘに含まれる前記フレーズ対ｚ_ｊによって被覆される回数とを用いて表される劣モジュラ関数を目的関数として、前記目的関数を最適化するように、前記フレーズテーブルから部分集合Ｘを選択する選択処理手段を含んで構成されている。 The information processing apparatus according to the present invention, sentence f _i and the pair (e _{i, f} _i) of the target language is a translation of the sentence e _i of the source language with sentence e _i of the source language translation for a set of training generated in advance from the corpus, the phrase p _i is a substring of the text e _i of the source language, a translation of the phrase p _i, and phrase q _i is a substring of the sentence f _i in the target language Is an information processing apparatus that selects a subset from a phrase table that is a set of phrase pairs z _i with respect to each of the pairs (e _i , f _i ) of the training bilingual corpus each k-th word e _ik statement e _i is the number of times that is covered by the phrase pair z _j included in the subset X a phrase table which is a set of the phrase pair z _j, sentence f of the target language _i of the k-th word _{f ik} of Portion s is submodularity function expressed by using the number of times it is covered by the phrase pair z _j included in the subset X as an objective function, so as to optimize the objective function, from the phrase table Selection processing means for selecting the set X is included.

前記選択処理手段が前記フレーズテーブルから部分集合Ｘを選択するステップは、貪欲法を用いて前記目的関数を最適化するように、前記フレーズテーブルから部分集合Ｘを選択するようにすることができる。 The step of selecting the subset X from the phrase table by the selection processing means may select the subset X from the phrase table so as to optimize the objective function using a greedy method.

前記選択処理手段が前記フレーズテーブルから部分集合Ｘを選択するステップは、以下の式に示す前記目的関数ｇ（Ｘ）を最適化するように、前記フレーズテーブルから部分集合Ｘを選択するようにすることができる。 The step of selecting the subset X from the phrase table by the selection processing means selects the subset X from the phrase table so as to optimize the objective function g (X) represented by the following equation: be able to.

ただし、Ｋは、部分集合のサイズを表す予め定められた値である。Ｅは、前記訓練用対訳コーパスにおける原言語の文の集合｛ｅ_１,...,ｅ_Ｎ｝を表し、Ｆは、前記訓練用対訳コーパスにおける目的言語の文の集合｛ｆ_１,...,ｆ_Ｎ｝を表す。ｃ（Ｘ，ｆ_ｉｋ）は、前記単語ｆ_ｉｋが前記部分集合Ｘに含まれる前記フレーズ対ｚ_ｊにより被覆される回数を表し、ｃ（Ｘ，ｅ_ｉｋ）は、前記単語ｅ_ｉｋが前記部分集合Ｘに含まれる前記フレーズ対ｚ_ｊにより被覆される回数を表す。

Here, K is a predetermined value representing the size of the subset. E represents a set of source language sentences {e ₁ ,..., E _N } in the training bilingual corpus, and F represents a set of target language sentences {f ₁ ,. ., f _N }. c (X, f _ik ) represents the number of times the word f _ik is covered by the phrase pair z _j included in the subset X, and c (X, e _ik ) represents the word e _ik is the part This represents the number of times covered by the phrase pair z _j included in the set X.

前記選択処理手段が前記フレーズテーブルから部分集合Ｘを選択するステップは、以下の式に示す前記目的関数ｈ（Ｘ）を最適化するように、前記フレーズテーブルから部分集合Ｘを選択するようにすることができる。 The step of selecting the subset X from the phrase table by the selection processing means selects the subset X from the phrase table so as to optimize the objective function h (X) represented by the following equation: be able to.

ただし、Ｅは、前記訓練用対訳コーパスにおける原言語の文の集合｛ｅ_１,...,ｅ_Ｎ｝を表し、Ｆは、前記訓練用対訳コーパスにおける目的言語の文の集合｛ｆ_１,...,ｆ_Ｎ｝を表す。ｃ（Ｘ，ｆ_ｉｋ）は、前記単語ｆ_ｉｋが前記部分集合Ｘに含まれる前記フレーズ対ｚ_ｊにより被覆される回数を表し、ｃ（Ｘ，ｅ_ｉｋ）は、前記単語ｅ_ｉｋが前記部分集合Ｘに含まれる前記フレーズ対ｚ_ｊにより被覆される回数を表す。λは０≦λ≦１であるようなパラメーである。Ｉ（・）は、引数が真のときに１、引数が偽のときに０を返すような関数である。

Where E represents a set of source language sentences {e ₁ ,..., E _N } in the training bilingual corpus, and F represents a set of target language sentences {f ₁ ,. ..., f _N }. c (X, f _ik ) represents the number of times the word f _ik is covered by the phrase pair z _j included in the subset X, and c (X, e _ik ) represents the word e _ik is the part This represents the number of times covered by the phrase pair z _j included in the set X. λ is a parameter such that 0 ≦ λ ≦ 1. I (•) is a function that returns 1 when the argument is true and 0 when the argument is false.

本発明に係るプログラムは、本発明の情報処理方法の各ステップをコンピュータに実行させるためのプログラムである。 The program according to the present invention is a program for causing a computer to execute each step of the information processing method of the present invention.

以上説明したように、本発明の情報処理方法、装置、及びプログラムによれば、訓練用対訳コーパスのペア（ｅ_ｉ，ｆ_ｉ）の各々についての、原言語の文ｅ_ｉのｋ番目の単語ｅ_ｉｋの各々が、フレーズ対ｚ_ｊの集合であるフレーズテーブルの部分集合Ｘに含まれるフレーズ対ｚ_ｊによって被覆される回数と、目的言語の文ｆ_ｉのｋ番目の単語ｆ_ｉｋの各々が、部分集合Ｘに含まれるフレーズ対ｚ_ｊによって被覆される回数とを用いて表される劣モジュラ関数を目的関数として、当該目的関数を最適化するように、フレーズテーブルから部分集合Ｘを選択することにより、精度良く翻訳を行うための、フレーズ対の数が削減されたフレーズテーブルを得ることができる、という効果が得られる。 As described above, according to the information processing method, apparatus, and program of the present invention, the kth word of the sentence e _{i in} the source language for each pair (e _i , f _i ) of the training parallel translation corpus each e _ik is the number of times that is covered by the phrase pair z _j included in the subset X a phrase table is a set of phrase pair z _j, each of the k-th word f _ik sentence f _i in the target language The subset X is selected from the phrase table so as to optimize the objective function using the submodular function expressed by the number of times covered by the phrase pair z _j included in the subset X as an objective function. By this, the effect that the phrase table for which the number of phrase pairs for reducing a translation accurately can be obtained can be acquired.

本発明の実施の形態に係る情報処理装置の構成を示す概略図である。It is the schematic which shows the structure of the information processing apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係るアルゴリズムを示す図である。It is a figure which shows the algorithm which concerns on embodiment of this invention. 本発明の実施の形態に係る情報処理装置における選択処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the selection process routine in the information processing apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態の概要＞
本発明の実施の形態は、フレーズに基づく統計的機械翻訳で必要となるフレーズテーブルに格納されるフレーズ対の個数を削減するためのものである。本実施の形態では、フレーズテーブルからフレーズ対の集合を取り出す問題を、劣モジュラ関数を最大化する最適化問題として定式化し、これを貪欲法によって解くことでフレーズ対を取り出す。 <Outline of Embodiment of the Present Invention>
The embodiment of the present invention is to reduce the number of phrase pairs stored in a phrase table that is necessary for statistical machine translation based on phrases. In the present embodiment, the problem of extracting a set of phrase pairs from the phrase table is formulated as an optimization problem that maximizes the submodular function, and the phrase pairs are extracted by solving this by a greedy method.

＜システム構成＞
本発明の実施の形態に係る情報処理装置１００は、訓練用対訳コーパスから予め生成されたフレーズテーブルＳから、部分集合Ｘを選択する。この情報処理装置１００は、ＣＰＵと、ＲＡＭと、後述する選択処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。図１に示すように、情報処理装置１００は、入力部１０と、演算部２０と、出力部３０とを備えている。 <System configuration>
The information processing apparatus 100 according to the embodiment of the present invention selects the subset X from the phrase table S generated in advance from the training parallel translation corpus. The information processing apparatus 100 is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a selection processing routine described later, and is functionally configured as follows. . As illustrated in FIG. 1, the information processing apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 30.

入力部１０は、フレーズテーブルＳと、原言語の文の集合Ｅ及び目的言語の文の集合Ｆを含んで構成される訓練用対訳コーパスとを入力として受け付ける。本実施の形態では、フレーズテーブルＳをＳ＝｛ｚ_１,...,ｚ_Ｍ｝とする。フレーズテーブルＳは、フレーズ対ｚ_ｉの集合である。 The input unit 10 receives as input a phrase table S and a training parallel translation corpus that includes a set E of source language sentences and a set F of target language sentences. In the present embodiment, the phrase table S is S = {z ₁ ,..., Z _M }. The phrase table S is a set of phrase pairs z _i .

フレーズテーブルＳは、訓練用対訳コーパスから予め生成されている。訓練用対訳コーパスは、原言語の文ｅ_ｉと当該原言語の文ｅ_ｉの対訳である目的言語の文ｆ_ｉとのペア（ｅ_ｉ，ｆ_ｉ）の集合である。また、フレーズテーブルＳは、訓練用対訳コーパスの原言語の文ｅ_ｉの部分文字列であるフレーズｐ_ｉと当該フレーズｐ_ｉの対訳であり、かつ、目的言語の文ｆ_ｉの部分文字列であるフレーズｑ_ｉとのフレーズ対ｘ_ｉの集合である。 The phrase table S is generated in advance from the training parallel translation corpus. The training bilingual corpus is a set of pairs (e _i , f _i ) of a source language sentence e _i and a target language sentence f _i that is a translation of the source language sentence e _i . Further, the phrase table S is a bilingual phrase p _i and the phrase p _i is a substring of the text e _i of the source language of the training corpus, and a partial character string of a sentence f _i in the target language it is a set of phrase pair _{x i} with a certain phrase _{q i.}

また、Ｍはフレーズ対の総数を表し、ｘ_ｉ＝（ｐ_ｉ，ｑ_ｉ）は原言語のフレーズｐ_ｉと目的言語のフレーズｑ_ｉとのペアである。フレーズはそれぞれ単語の列であり、 Further, M represents the total number of phrase pairs, and x _i = (p _i , q _i ) is a pair of a source language phrase p _i and a target language phrase q _i . Each phrase is a sequence of words,

とする。Ｅは原言語の文の集合、Ｆは目的言語の文の集合とし、ｅ_ｉ，ｆ_ｉをそれぞれ原言語、目的言語の文とする。ｌ_ｉはｐ_ｉの語数とし、ｋ_ｉはｑ_ｉの語数とする。Ｅ＝｛ｅ_１，．．．，ｅ_Ｎ｝，Ｆ＝｛ｆ_１，．．．，ｆ_Ｎ｝であり、文ｅ_ｉと文ｆ_ｉとは対訳関係にある。各文は単語の系列として表現されており、 And E is a set of sentences in the source language, F is a set of sentences in the target language, and e _i and f _i are sentences in the source language and the target language, respectively. l _i is the number of words of _{p _i,} _k _i is the number of words of _{q i.} E = {e ₁ ,. . . , E _N }, F = {f ₁ ,. . . , F _N }, and sentence e _i and sentence f _i are in a bilingual relationship. Each sentence is expressed as a sequence of words,

とする。ｅ_ｉｊは原言語の単語でありｎ_ｉはｅ_ｉの語数とする。同様に And e _ij is a source language word, and n _i is the number of words in e _i . As well

とする。ｍ_ｉはｆ_ｉの語数とする。 And m _i is the number of words of _{f i.}

あるフレーズ対ｚ_ｊ＝（ｐ_ｊ,ｑ_ｊ）と対訳コーパス中の文のペア（ｅ_ｉ，ｆ_ｉ）に対して、ｚ_ｊがペアに含まれるとは、ｐ_ｊがｅ_ｉのある部分列に一致し、かつｑ_ｊがｆ_ｉのある部分列に一致することと定義する。すなわち、 For a phrase pair z _j = (p _j , q _j ) and a sentence pair (e _i , f _i ) in a parallel corpus, z _j is included in the pair, where p _j is part of e _i It is defined that it matches a column and q _j matches a substring with f _i . That is,

を満たすような To meet

が存在することと定義する。このとき、ｐ_ｊ，ｑ_ｊに一致する部分列に含まれる単語はｚ_ｊによって被覆されていると定義する。 Is defined to exist. At this time, words included in the partial string matching p _j, the q _j is defined as being covered by the z _j.

演算部２０は、入力部１０によって受け付けた訓練用対訳コーパス及びフレーズテーブルＳに基づいて、フレーズテーブルＳから、部分集合を選択する。演算部２０は、選択処理部２２を備えている。 The computing unit 20 selects a subset from the phrase table S based on the training parallel translation corpus and the phrase table S received by the input unit 10. The calculation unit 20 includes a selection processing unit 22.

選択処理部２２は、入力部１０によって受け付けた訓練用対訳コーパス及びフレーズテーブルＳに基づいて、劣モジュラ関数を目的関数として、当該目的関数を最適化するように、フレーズテーブルから部分集合Ｘ（フレーズテーブルＸ）を選択する。本実施形態では、貪欲法を用いて目的関数を最大化するように最適化問題を解くことによって、フレーズテーブルＳから部分集合Ｘを選択する。 Based on the training parallel translation corpus and the phrase table S received by the input unit 10, the selection processing unit 22 uses the submodular function as an objective function and optimizes the objective function from the subset X (phrase). Select table X). In this embodiment, the subset X is selected from the phrase table S by solving the optimization problem so as to maximize the objective function using the greedy method.

本実施の形態では、訓練用対訳コーパスのペア（ｅ_ｉ，ｆ_ｉ）の各々についての、原言語の文ｅ_ｉのｋ番目の単語ｅ_ｉｋの各々が、フレーズ対ｚ_ｊの集合であるフレーズテーブルの部分集合Ｘに含まれるフレーズ対ｚ_ｊによって被覆される回数と、目的言語の文ｆ_ｉのｋ番目の単語ｆ_ｉｋの各々が、部分集合Ｘに含まれるフレーズ対ｚ_ｊによって被覆される回数とを用いて表される劣モジュラ関数を、目的関数として用いる。 In the present embodiment, for each training bilingual corpus pair (e _i , f _i ), each k-th word e _ik of the source language sentence e _i is a set of phrase pairs z _j. The number of times covered by the phrase pair z _j included in the subset X of the table and each of the kth words f _ik of the sentence f _{i in} the target language are covered by the phrase pair z _j included in the subset X A submodular function expressed by the number of times is used as an objective function.

具体的には、選択処理部２２は、以下の式（１）に示す最適化問題を解くことによって、フレーズテーブルから部分集合Ｘを選択する。 Specifically, the selection processing unit 22 selects the subset X from the phrase table by solving the optimization problem shown in the following formula (1).

なお、変数ｇ（Ｘ）は、以下の式（２）に示すように、フレーズ対の部分集合Ｘの良さを評価する目的関数である。 The variable g (X) is an objective function for evaluating the goodness of the phrase pair subset X, as shown in the following equation (2).

ただし、Ｋは、部分集合のサイズを表す予め定められた値である。Ｅは、訓練用対訳コーパスにおける原言語の文の集合｛ｅ_１,...,ｅ_Ｎ｝を表し、Ｆは、訓練用対訳コーパスにおける目的言語の文の集合｛ｆ_１,...,ｆ_Ｎ｝を表す。ｃ（Ｘ，ｆ_ｉｋ）は、単語ｆ_ｉｋが部分集合Ｘに含まれるフレーズ対ｚ_ｊにより被覆される回数を表し、ｃ（Ｘ，ｅ_ｉｋ）は、単語ｅ_ｉｋが部分集合Ｘに含まれるフレーズ対ｚ_ｊにより被覆される回数を表す。 Here, K is a predetermined value representing the size of the subset. E represents a set of source language sentences {e ₁ ,..., E _N } in the training bilingual corpus, and F represents a set of target language sentences {f ₁ ,. f _N }. c (X, f _ik ) represents the number of times the word f _ik is covered by the phrase pair z _j included in the subset X, and c (X, e _ik ) includes the word e _ik in the subset X. It represents the number of times that is covered by the phrase pair z _j.

また、ｇ（Ｘ）は、Ｅ、Ｆ中に出現する語をより多く被覆するようなフレーズテーブルに対して高いスコアを与えるが、ある語を被覆した回数ｃ（Ｘ,ｆ_ｉｋ）又はｃ（Ｘ,ｅ_ｉｋ）に対してその対数をとってスコアに加算することによって、異なる語を多く被覆するようなフレーズテーブルに対してより高いスコアが与えられる。 Also, g (X) gives a high score to a phrase table that covers more words appearing in E and F, but the number of times c (X, f _ik ) or c ( Taking the logarithm of X, e _ik ) and adding it to the score gives a higher score for a phrase table that covers many different words.

図２に、上記の最適化問題の近似解を求める貪欲法アルゴリズムを示す。図２に示すように、本実施形態では、まず部分集合Ｘを空集合として初期化したのちに、スコアを表す関数の差分ｇ（Ｘ∪｛ｚ｝）−ｇ（Ｘ）を最大化するようなｚ∈Ｓを順に選択し、部分集合Ｘに追加していくことによって、サイズＫのフレーズテーブルである部分集合Ｘを生成する。選択処理部２２は、上述の最適化問題を解くことによって、Ｘ⊆Ｓであるようなフレーズテーブルである部分集合Ｘを得る。 FIG. 2 shows a greedy algorithm for obtaining an approximate solution of the above optimization problem. As shown in FIG. 2, in this embodiment, first, the subset X is initialized as an empty set, and then the difference g (X∪ {z}) − g (X) of the function representing the score is maximized. By selecting zεS in order and adding them to the subset X, a subset X that is a phrase table of size K is generated. The selection processing unit 22 obtains a subset X that is a phrase table such that X⊆S by solving the above optimization problem.

出力部３０は、選択処理部２２によって選択された部分集合Ｘを、フレーズ対の数が削減されたフレーズテーブルとして出力する。 The output unit 30 outputs the subset X selected by the selection processing unit 22 as a phrase table in which the number of phrase pairs is reduced.

＜情報処理装置の作用＞
次に、本発明の実施の形態に係る情報処理装置１００の作用について説明する。まず、訓練用対訳コーパス及びフレーズテーブルＳが、情報処理装置１００に入力されると、情報処理装置１００によって、図３に示す選択処理ルーチンが実行される。 <Operation of information processing device>
Next, the operation of the information processing apparatus 100 according to the embodiment of the present invention will be described. First, when the training parallel translation corpus and the phrase table S are input to the information processing apparatus 100, the information processing apparatus 100 executes a selection processing routine shown in FIG.

まず、ステップＳ１００において、入力部１０により訓練用対訳コーパス及びフレーズテーブルＳを受け付ける。 First, in step S <b> 100, a training parallel translation corpus and a phrase table S are received by the input unit 10.

そして、ステップＳ１０２において、選択処理部２２は、上記ステップＳ１００で受け付けた訓練用対訳コーパス及びフレーズテーブルＳに基づいて、貪欲法を用いて、上記式（１）及び（２）に従って、フレーズテーブルＳから部分集合Ｘを選択する。 In step S102, the selection processing unit 22 uses the greedy method based on the training parallel translation corpus and the phrase table S received in step S100, according to the above formulas (1) and (2). A subset X is selected from

ステップＳ１０４において、上記ステップＳ１０２で選択された部分集合Ｘを結果として出力し、選択処理ルーチンを終了する。 In step S104, the subset X selected in step S102 is output as a result, and the selection processing routine is terminated.

以上説明したように、本実施の形態に係る情報処理装置によれば、訓練用対訳コーパスのペア（ｅ_ｉ，ｆ_ｉ）の各々についての、原言語の文ｅ_ｉのｋ番目の単語ｅ_ｉｋの各々が、フレーズ対ｚ_ｊの集合であるフレーズテーブルの部分集合Ｘに含まれるフレーズ対ｚ_ｊによって被覆される回数と、目的言語の文ｆ_ｉのｋ番目の単語ｆ_ｉｋの各々が、部分集合Ｘに含まれるフレーズ対ｚ_ｊによって被覆される回数とを用いて表される劣モジュラ関数を目的関数として、当該目的関数を最適化するように、フレーズテーブルから部分集合Ｘを選択することにより、精度良く翻訳を行うためのフレーズテーブルであり、かつフレーズ対が削減されたフレーズテーブルを得ることができる。また、フレーズ対の数が少ないフレーズテーブルを得ることができる。 As described above, according to the information processing apparatus according to the present embodiment, the k-th word e _ik of the sentence e _{i in} the source language for each pair (e _i , f _i ) of the parallel translation corpus for training. each, the number of times that is covered by the phrase pair z _j included in the subset X a phrase table is a set of phrase pair z _j, each of the k-th word f _ik sentence f _i of the target language, part of the By selecting the subset X from the phrase table so as to optimize the objective function using the submodular function represented by the number of times covered by the phrase pair z _j included in the set X as the objective function. Therefore, it is possible to obtain a phrase table for performing translation with high accuracy and having reduced phrase pairs. Moreover, a phrase table with a small number of phrase pairs can be obtained.

また、その結果として翻訳文書生成処理の高速化、不要なフレーズ対を減らすことによる翻訳精度の向上が可能である。 As a result, it is possible to speed up the translation document generation process and improve translation accuracy by reducing unnecessary phrase pairs.

また、本実施形態によれば、エントロピーを用いた既存手法とは異なりフレーズテーブルの良さを直接評価することができるので、フレーズテーブル内のフレーズの削減が実現される。また、貪欲法による解法は、フレーズの数に対して多項式時間で動作し、フレーズ長に応じて指数的に時間がかかるようなことはない。そのため、膨大な数のフレーズ対を含むフレーズテーブルが入力として与えられたときも高速に動作する。 In addition, according to the present embodiment, unlike the existing method using entropy, the goodness of the phrase table can be directly evaluated, so that the number of phrases in the phrase table can be reduced. In addition, the solution by the greedy method operates in polynomial time with respect to the number of phrases, and does not take exponentially time according to the phrase length. Therefore, even when a phrase table including a huge number of phrase pairs is given as an input, it operates at high speed.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上記実施の形態では、劣モジュラ関数の一例として、上記式（２）に示したｇ（Ｘ）を用いる場合を例に説明したが、これに限定されるものではない。例えば、以下の式（３）に示すｈ（Ｘ）を目的関数として用いてもよい。 For example, in the above-described embodiment, the case where g (X) shown in the above formula (2) is used as an example of the submodular function has been described as an example. However, the present invention is not limited to this. For example, h (X) shown in the following formula (3) may be used as the objective function.

ただし、上記式（２）と同様に、Ｅは、訓練用対訳コーパスにおける原言語の文の集合｛ｅ_１,...,ｅ_Ｎ｝を表し、Ｆは、訓練用対訳コーパスにおける目的言語の文の集合｛ｆ_１,...,ｆ_Ｎ｝を表す。ｃ（Ｘ，ｆ_ｉｋ）は、単語ｆ_ｉｋが部分集合Ｘに含まれるフレーズ対ｚ_ｊにより被覆される回数を表し、ｃ（Ｘ，ｅ_ｉｋ）は、単語ｅ_ｉｋが部分集合Ｘに含まれるフレーズ対ｚ_ｊにより被覆される回数を表す。また、λは０≦λ≦１であるようなパラメーである。Ｉ（・）は、引数が真のときに１、引数が偽のときに０を返すような関数である。 However, as in the above equation (2), E represents a set of sentences in the source language {e ₁ ,..., E _N } in the training bilingual corpus, and F represents the target language in the training bilingual corpus. Represents a set of sentences {f ₁ ,..., F _N }. c (X, f _ik ) represents the number of times the word f _ik is covered by the phrase pair z _j included in the subset X, and c (X, e _ik ) includes the word e _ik in the subset X. It represents the number of times that is covered by the phrase pair z _j. Also, λ is a parameter such that 0 ≦ λ ≦ 1. I (•) is a function that returns 1 when the argument is true and 0 when the argument is false.

なお、λ＜１として上記式（３）のｈ（Ｘ）を用いると、上記式（２）のｇ（Ｘ）と比べてより多くの異なる語を被覆するような部分集合Ｘに対して、スコアが高く設定される。 When h (X) in the above equation (3) is used as λ <1, for a subset X that covers more different words than g (X) in the above equation (2), A high score is set.

上述の情報処理装置１００は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The information processing apparatus 100 described above has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０演算部
２２選択処理部
３０出力部
１００情報処理装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 22 Selection process part 30 Output part 100 Information processing apparatus

Claims

From a training bilingual corpus that includes a selection processing means and is a set of pairs (e _i , f _i ) of a source language sentence e _i and a target language sentence f _i that is a parallel translation of the source language sentence e _i phrases generated, and phrase p _i is a substring of the text e _i of the source language, a translation of the phrase p _i, and the phrase q _i is a substring of the sentence f _i in the target language An information processing method in an information processing apparatus for selecting a subset from a phrase table that is a set of pairs z _i ,
For each of the pairs (e _i , f _i ) of the training bilingual corpus, the selection processing means determines that each of the kth words e _ik of the source language sentence e _i of the phrase pair z _j The number of times covered by the phrase pair z _j included in the subset X of the phrase table that is a set, and each of the kth words f _ik of the sentence f _i of the target language are included in the subset X And selecting a subset X from the phrase table so as to optimize the objective function using a submodular function represented by the number of times covered by the phrase pair z _j as an objective function. .

The information according to claim 1, wherein the step of selecting the subset X from the phrase table by the selection processing unit selects the subset X from the phrase table so as to optimize the objective function using a greedy method. Processing method.

The step of selecting the subset X from the phrase table by the selection processing means selects the subset X from the phrase table so as to optimize the objective function g (X) shown in the following equation. Or the information processing method of Claim 2.

The step of selecting the subset X from the phrase table by the selection processing means selects the subset X from the phrase table so as to optimize the objective function h (X) expressed by the following equation. Information processing method described in 1.

A source language generated in advance from a training bilingual corpus that is a set of pairs (e _i , f _i ) of a source language sentence e _i and a target language sentence f _i that is a translation of the source language sentence e _i and phrase p _i is a substring of the text e _i of a bilingual the phrase p _i, and a set of phrase pair z _i of the phrase q _i is a substring of the sentence f _i in the target language An information processing apparatus that selects a subset from a phrase table,
For each of the pairs (e _i , f _i ) of the training bilingual corpus, each of the kth words e _ik of the source language sentence e _i is a set of phrase pairs z _j The number of times covered by the phrase pair z _j included in the subset X and each of the kth words f _ik of the sentence f _{i in} the target language are covered by the phrase pair z _j included in the subset X And a selection processing means for selecting the subset X from the phrase table so as to optimize the objective function using the submodular function represented by

The program for making a computer perform each step of the information processing method of any one of Claims 1-4.