JP6482073B2

JP6482073B2 - Information processing method, apparatus, and program

Info

Publication number: JP6482073B2
Application number: JP2015116059A
Authority: JP
Inventors: 正彬西野; 鈴木　潤; 潤鈴木; 平尾　努; 努平尾; 俊治梅谷
Original assignee: Nippon Telegraph and Telephone Corp; Osaka University NUC
Current assignee: Nippon Telegraph and Telephone Corp; Osaka University NUC
Priority date: 2015-06-08
Filing date: 2015-06-08
Publication date: 2019-03-13
Anticipated expiration: 2035-06-08
Also published as: JP2017004179A

Description

本発明は、情報処理方法、装置、及びプログラムに関する。 The present invention relates to an information processing method, apparatus, and program.

統計的機械翻訳法とは、確率・統計の技術を用いてある言語（以下、元言語と称する。）で書かれた文書を別の言語（以下、目的言語と称する。）で書かれた文書へと自動的に翻訳する手法のことである。様々な統計的機械翻訳手法が存在するが、その中の１つであるフレーズに基づく統計的機械翻訳では、元言語の文を、語の連なりからなる句を並べたものとして表現し、それを目的言語の対応する句の並びに変換することで翻訳を行う。 The statistical machine translation method is a document written in a language (hereinafter referred to as an original language) using a probability / statistical technique and written in another language (hereinafter referred to as a target language). It is a method of automatically translating into There are various statistical machine translation methods, but in statistical machine translation based on one of the phrases, a sentence in the original language is expressed as an array of phrases consisting of a series of words, Translation is performed by converting the corresponding phrases in the target language.

フレーズに基づく統計的機械翻訳を行うためには、フレーズテーブルとよばれる、元言語のあるフレーズが目的言語のどのフレーズに訳されるかを示したテーブルを用意する必要がある。フレーズテーブルをＳとする。Ｓの構成要素はフレーズ対（ｐ，ｑ）である。ここでｐは元言語のフレーズであり、ｑは目的言語のフレーズである。フレーズテーブルに含まれるフレーズ対の種類が、その翻訳システムが翻訳可能な語彙を定めていることから、一般にフレーズテーブルに含まれるフレーズ対の総数は膨大な数になる。 In order to perform statistical machine translation based on a phrase, it is necessary to prepare a table called a phrase table, which indicates a phrase in a target language into which a phrase in the original language is translated. Let S be the phrase table. The component of S is a phrase pair (p, q). Here, p is a phrase in the original language, and q is a phrase in the target language. Since the types of phrase pairs included in the phrase table define the vocabulary that can be translated by the translation system, the total number of phrase pairs included in the phrase table is generally enormous.

フレーズに基づく統計的機械翻訳システムによって翻訳を行う際には、計算機の記憶装置に格納されたフレーズテーブルに繰り返しアクセスする必要がある。フレーズテーブルに含まれるフレーズ対の数が膨大となると、翻訳文を生成する際に取りうる選択肢が増加することから、結果的に翻訳文の生成に時間がかかるようになる。 When translation is performed by a phrase-based statistical machine translation system, it is necessary to repeatedly access a phrase table stored in a storage device of a computer. If the number of phrase pairs included in the phrase table is enormous, the number of options that can be taken when generating a translation increases, and as a result, it takes time to generate the translation.

また、一般に、フレーズテーブルに含まれるフレーズ対は、対訳関係にある元言語と目的言語の文の対の単語アラインメントの結果をもとにして自動的に獲得されるものであるが、こうして得られたフレーズ対には対訳関係になっていない、質の悪いフレーズ対も多く含まれる。質の悪いフレーズ対は翻訳生成時のノイズとなって生成される翻訳の質の低下につながる。これらの理由から、与えられたフレーズテーブルから質の悪いフレーズを除いてより小さなフレーズテーブルを作成する技術が検討されている（例えば、非特許文献１）。 In general, the phrase pairs included in the phrase table are automatically obtained based on the word alignment results of the sentence pairs of the source language and the target language that are in a parallel translation relationship. There are many poor-quality phrase pairs that are not translated in parallel. Poor-quality phrase pairs become noise at the time of translation generation, leading to deterioration in the quality of the generated translation. For these reasons, a technique for creating a smaller phrase table by removing a poor quality phrase from a given phrase table has been studied (for example, Non-Patent Document 1).

Zens, Richard and Stanton, Daisy and Xu, Peng,“A Systematic Comparison of Phrase Table Pruning Techniques”, In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012.Zens, Richard and Stanton, Daisy and Xu, Peng, “A Systematic Comparison of Phrase Table Pruning Techniques”, In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012.

非特許文献１では、頻度やエントロピーを用いてフレーズ対に点数をつけ、その点数に基づいて不要なフレーズ対を削除することで小さなフレーズテーブルを作成する手法が提案されている。 Non-Patent Document 1 proposes a method of creating a small phrase table by scoring a phrase pair using frequency and entropy and deleting unnecessary phrase pairs based on the score.

しかし、これらの手法ではフレーズ対の点数に対してしきい値θを定めるか、あるいは点数の上位Ｋ件を取り出すという形式でフレーズテーブルを作成するため、フレーズテーブルのサイズを決めるθまたはＫというパラメータを何らかの基準で定める必要がある。適切なフレーズテーブルの大きさは翻訳システムによって異なるため、適切なパラメータを選択することは試行錯誤を伴う困難な問題であった。 However, in these methods, the threshold value θ is determined for the score of the phrase pair, or the phrase table is created in the form of taking out the top K scores, so the parameter θ or K that determines the size of the phrase table Need to be determined by some standard. Since the size of an appropriate phrase table varies depending on the translation system, selecting an appropriate parameter has been a difficult problem involving trial and error.

また、訓練用対訳コーパスに含まれる文を正しく訳せることは翻訳システムの頑健性を保証するうえで重要であるが、既存のフレーズテーブル削減法では訓練用対訳コーパスの文を正しく翻訳するために必要となるフレーズ対が選択されることを保証できないという問題があった。 In addition, it is important to ensure that the translation system correctly translates the sentences included in the training bilingual corpus, but the existing phrase table reduction method is necessary to correctly translate the sentences in the training bilingual corpus. There was a problem that it was not possible to guarantee that a pair of phrases would be selected.

本発明は、上記の事情を鑑みてなされたもので、フレーズ対が削減されたフレーズテーブルを得ることができる情報処理方法、装置、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide an information processing method, apparatus, and program capable of obtaining a phrase table with reduced phrase pairs.

上記の目的を達成するために本発明に係る情報処理方法は、選択処理手段を含み、元言語の文ｅ_ｊと前記元言語の文ｅ_ｊの対訳である目的言語の文ｆ_ｊとのペア（ｅ_ｊ，ｆ_ｊ）の集合である訓練用対訳コーパスから予め生成された、元言語の文ｅ_ｊの部分文字列であるフレーズｐ_ｉと前記フレーズｐ_ｉの対訳であり、かつ、目的言語の文ｆ_ｊの部分文字列であるフレーズｑ_ｉとのフレーズ対ｘ_ｉの集合であるフレーズテーブルから、部分集合を選択する情報処理装置における情報処理方法であって、前記選択処理手段が、前記訓練用対訳コーパスの前記ペア（ｅ_ｊ，ｆ_ｊ）の各々について、前記ペア（ｅ_ｊ，ｆ_ｊ）の前記元言語の文ｅ_ｊのｒ番目の単語ｅ_ｊｒの各々が、前記フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つの前記フレーズ対ｘ_ｉのフレーズｐ_ｉであって、かつ、前記元言語の文ｅ_ｊの部分文字列であるフレーズｐ_ｉに含まれ、前記目的言語の文ｆ_ｊのｒ番目の単語ｆ_ｊｒの各々が、前記フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つの前記フレーズ対ｘ_ｉの前記フレーズｑ_ｉであって、かつ、前記目的言語の文ｆ_ｊの部分文字列である前記フレーズｑ_ｉに含まれるように、前記フレーズテーブルから部分集合を選択するステップを含んで構成されている。 The information processing method according to the present invention in order to achieve the above object, includes a selection processing means, the pair of the sentence f _j in the target language is a translation of the sentence e _j of the source language with sentence e _j of the original language (e _{j, f} _j) are previously generated from training corpus is a collection of a bilingual phrase p _i and the phrase p _i is a substring of the sentence e _j of the original language and the target language An information processing method in an information processing apparatus for selecting a subset from a phrase table that is a set of phrase pairs x _i with a phrase q _i that is a partial character string of a sentence f _j of the pair _(e j, _{f j)} of the training corpus for each of said pairs _(e j, _{f j)} each of r th word _{e jr} sentence _{e j} of the original language, the phrase pair x _i is a set of phrase table A phrase p _i of any one of the phrase pair x _i minute set, and the included phrase p _i is a substring of the sentence e _j of the original language, the sentence f _j of the target language each of r th word f _jr is a said phrase q _i of any one of the phrase pair x _i of a subset of the phrase table which is a set of the phrase pair x _i, and sentences of the target language f includes a step of selecting a subset from the phrase table so as to be included in the phrase q _i which is a partial character string of _j .

本発明に係る情報処理装置は、元言語の文ｅ_ｊと前記元言語の文ｅ_ｊの対訳である目的言語の文ｆ_ｊとのペア（ｅ_ｊ，ｆ_ｊ）の集合である訓練用対訳コーパスから予め生成された、元言語の文ｅ_ｊの部分文字列であるフレーズｐ_ｉと前記フレーズｐ_ｉの対訳であり、かつ、目的言語の文ｆ_ｊの部分文字列であるフレーズｑ_ｉとのフレーズ対ｘ_ｉの集合であるフレーズテーブルから、部分集合を選択する情報処理装置であって、前記訓練用対訳コーパスの前記ペア（ｅ_ｊ，ｆ_ｊ）の各々について、前記ペア（ｅ_ｊ，ｆ_ｊ）の前記元言語の文ｅ_ｊのｒ番目の単語ｅ_ｊｒの各々が、前記フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つの前記フレーズ対ｘ_ｉのフレーズｐ_ｉであって、かつ、前記元言語の文ｅ_ｊの部分文字列であるフレーズｐ_ｉに含まれ、前記目的言語の文ｆ_ｊのｒ番目の単語ｆ_ｊｒの各々が、前記フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つの前記フレーズ対ｘ_ｉの前記フレーズｑ_ｉであって、かつ、前記目的言語の文ｆ_ｊの部分文字列である前記フレーズｑ_ｉに含まれるように、前記フレーズテーブルから部分集合を選択する選択処理手段を含んで構成されている。 The information processing apparatus according to the present invention, the pair (e _{j, f} _j) of the sentence f _j in the target language is a translation of the sentence e _j of the source language with sentence e _j of source language translation for a set of training generated in advance from the corpus, a bilingual phrase p _i and the phrase p _i is a substring of the sentence e _j of the original language and a phrase q _i is a substring of the sentence f _j in the target language from the phrase table is a set of phrase pair x _i of an information processing apparatus for selecting a subset, the pair (e _{j, f} _j) of the training corpus for each of said pairs (e _j, each of r th word e _jr sentence e _j of the original language of the f _j) is a phrase p _i of any one of the phrase pair x _i of a subset of the phrase table which is a set of the phrase pair x _i And the sentence in the original language _j included in the partial character strings in a phrase p _i of each of the r-th word f _jr sentence f _j of the target language, any subset of the set is a phrase table of the phrase pair x _i 1 One of the phrases a said phrase q _i pair x _i, and the like contained in the phrase q _i is a substring of the sentence f _j in the target language, selection for selecting a subset from the phrase table The processing means is included.

前記選択処理手段が前記フレーズテーブルから部分集合を選択するステップは、以下の式にしたがって、前記フレーズテーブルから部分集合を選択するようにすることができる。
ただし、変数ｙ_ｉ（ｉ＝１，．．．，Ｍ）は、フレーズ対ｘ_ｉが前記部分集合に含まれるか否かを表す二値変数であり、ａ_ｉｊｒはフレーズ対ｘ_ｉによって元言語の文ｅ_ｊのｒ番目の単語ｅ_ｊｒが被覆されているかどうかを表すパラメータであり、被覆されているときにはａ_ｉｊｒ＝１、被覆されていないときには０となるパラメータである。ｂ_ｉｊｒはフレーズ対ｘ_ｉによって目的言語の文ｆ_ｊのｒ番目の単語ｆ_ｊｒが被覆されているかどうかを表すパラメータであり、被覆されているときにはｂ_ｉｊｒ＝１、被覆されていないときには０となるパラメータである。 The step of selecting a subset from the phrase table by the selection processing means may select a subset from the phrase table according to the following equation.
However, the variable y _i (i = 1,..., M) is a binary variable indicating whether or not the phrase pair x _i is included in the subset, and a _ijr is the original language according to the phrase pair x _i . a parameter r th word e _jr the sentence e _j represents whether it is covered, when it is covered is 0 and becomes parameter when a _{IJR = 1,} not covered. b _ijr is a parameter indicating whether or not the r-th word f _jr of the sentence f _j of the target language is covered by the phrase pair x _i , and b _ijr = 1 when covered, 0 when not covered It is a parameter.

本発明に係るプログラムは、本発明の情報処理方法の各ステップをコンピュータに実行させるためのプログラムである。 The program according to the present invention is a program for causing a computer to execute each step of the information processing method of the present invention.

以上説明したように、本発明の情報処理方法、装置、及びプログラムによれば、訓練用対訳コーパスのペア（ｅ_ｊ，ｆ_ｊ）の各々について、当該ペア（ｅ_ｊ，ｆ_ｊ）の元言語の文ｅ_ｊのｒ番目の単語ｅ_ｊｒの各々が、フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つのフレーズ対ｘ_ｉのフレーズｐ_ｉであって、かつ、元言語の文ｅ_ｊの部分文字列であるフレーズｐ_ｉに含まれ、目的言語の文ｆ_ｊのｒ番目の単語ｆ_ｊｒの各々が、フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つのフレーズ対ｘ_ｉのフレーズｑ_ｉであって、かつ、目的言語の文ｆ_ｊの部分文字列であるフレーズｑ_ｉに含まれるように、フレーズテーブルから部分集合を選択することにより、フレーズ対が削減されたフレーズテーブルを得ることができる、という効果が得られる。 As described above, the information processing method of the present invention, apparatus, and according to the program, training corpus pair (e _{j, f} _j) for each of the original language of the pair (e _{j, f} _j) Each of the r-th words e _jr of the sentence e _j is a phrase p _i of any one phrase pair x _i of a subset of the phrase table that is a set of phrase pairs x _i , and included in the phrase p _i is a substring of the sentence e _j, each of the r-th word f _jr sentence f _j of the target language is any subset of the phrase table is a set of phrase pair x _i 1 a phrase q _i of One phrase pair x _i, and, to be included in the phrase q _i is a substring of the sentence f _j in the target language, by selecting a subset from the phrase table, the phrase pairs Reduced The effect that a raise table can be obtained is obtained.

本発明の実施の形態に係る情報処理装置の構成を示す概略図である。It is the schematic which shows the structure of the information processing apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る情報処理装置における選択処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the selection process routine in the information processing apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態の概要＞
本発明の実施の形態は、フレーズに基づく統計的機械翻訳で必要となるフレーズテーブルに格納されるフレーズ対の個数を削減するためのものである。本実施の形態では、フレーズテーブルからフレーズ対の部分集合を取り出す問題を、組合せ最適化問題の一種である集合分割問題として定式化し、集合分割問題を解くことでフレーズ対の部分集合を取り出す。 <Outline of Embodiment of the Present Invention>
The embodiment of the present invention is to reduce the number of phrase pairs stored in a phrase table that is necessary for statistical machine translation based on phrases. In this embodiment, the problem of extracting a subset of phrase pairs from the phrase table is formulated as a set partitioning problem that is a kind of combination optimization problem, and a subset of phrase pairs is extracted by solving the set partitioning problem.

＜システム構成＞
本発明の実施の形態に係る情報処理装置１００は、訓練用対訳コーパスから予め生成されたフレーズテーブルＳから、部分集合Ｔを選択する。この情報処理装置１００は、ＣＰＵと、ＲＡＭと、後述する選択処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。図１に示すように、情報処理装置１００は、入力部１０と、演算部２０と、出力部３０とを備えている。 <System configuration>
The information processing apparatus 100 according to the embodiment of the present invention selects a subset T from the phrase table S generated in advance from the training parallel corpus. The information processing apparatus 100 is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a selection processing routine described later, and is functionally configured as follows. . As illustrated in FIG. 1, the information processing apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 30.

入力部１０は、フレーズテーブルＳと、元言語の文の集合Ｅ及び目的言語の文の集合Ｆを含んで構成される訓練用対訳コーパスとを入力として受け付ける。本実施の形態では、フレーズテーブルＳをＳ＝｛ｘ_１，．．．，ｘ_Ｍ｝とする。フレーズテーブルＳは、フレーズ対ｘ_ｉの集合である。 The input unit 10 receives as input a phrase table S and a training bilingual corpus that includes an original language sentence set E and a target language sentence set F. In the present embodiment, the phrase table S is represented by S = {x ₁ ,. . . , X _M }. Phrase table S is a set of phrase pair x _i.

フレーズテーブルＳは、訓練用対訳コーパスから予め生成されている。訓練用対訳コーパスは、元言語の文ｅ_ｊと当該元言語の文ｅ_ｊの対訳である目的言語の文ｆ_ｊとのペア（ｅ_ｊ，ｆ_ｊ）の集合である。また、フレーズテーブルＳは、訓練用対訳コーパスの元言語の文ｅ_ｊの部分文字列であるフレーズｐ_ｉと当該フレーズｐ_ｉの対訳であり、かつ、目的言語の文ｆ_ｊの部分文字列であるフレーズｑ_ｉとのフレーズ対ｘ_ｉの集合である。 The phrase table S is generated in advance from the training parallel translation corpus. Training corpus is a collection of pairs of sentences f _j in the target language is a translation of the sentence e _j and the source language sentence e _j of the source language (e _{j, f} _j). Further, the phrase table S is a bilingual phrase p _i and the phrase p _i is a substring of the sentence e _j of the original language of the training corpus, and a partial character string of a sentence f _j in the target language it is a set of phrase pair _{x i} with a certain phrase _{q i.}

また、Ｍはフレーズ対の総数を表し、ｘ_ｉ＝（ｐ_ｉ，ｑ_ｉ）は元言語のフレーズｐ_ｉと目的言語のフレーズｑ_ｉとのペアである。フレーズはそれぞれ単語の列であり、 M represents the total number of phrase pairs, and x _i = (p _i , q _i ) is a pair of the original language phrase p _i and the target language phrase q _i . Each phrase is a sequence of words,

とする。Ｅは元言語の文の集合、Ｆは目的言語の文の集合とし、ｅ_ｉ，ｆ_ｉをそれぞれ元言語、目的言語の文とする。Ｅ＝｛ｅ_１，．．．，ｅ_Ｎ｝，Ｆ＝｛ｆ_１，．．．，ｆ_Ｎ｝であり、文ｅ_ｉと文ｆ_ｉとは対訳関係にある。各文は単語の系列として表現されており、 And E is a set of sentences in the original language, F is a set of sentences in the target language, and e _i and f _i are sentences in the original language and the target language, respectively. E = {e ₁ ,. . . , E _N }, F = {f ₁ ,. . . , F _N }, and sentence e _i and sentence f _i are in a bilingual relationship. Each sentence is expressed as a sequence of words,

とする。ｅ_ｉｊは元言語の単語でありｎ_ｉはｅ_ｉの語数とする。同様に And e _ij is a word in the original language, and n _i is the number of words in e _i . As well

とする。ｍ_ｉはｆ_ｉの語数とする。 And m _i is the number of words of _{f i.}

あるフレーズ対ｘ_ｊと対訳コーパス中の文のペア（ｅ_ｉ，ｆ_ｉ）に対して、ｘ_ｊ＝（ｐ_ｊ，ｑ_ｊ）がペアに含まれるとは、ｐ_ｊがｅ_ｉのある部分列に一致し、かつｑ_ｊがｆ_ｉのある部分列に一致することと定義する。すなわち、 Part there phrase pair _{x j} and sentences in parallel corpus pair _(e i, _{f i)} with respect _to, x _{j =} (p _{j, q} j) and is included in the pair, with _{p j} is the _{e i} It is defined that it matches a column and q _j matches a substring with f _i . That is,

を満たすような To meet

が存在することと定義する。このとき、ｐ_ｊ，ｑ_ｊに一致する部分列に含まれる単語はｘ_ｊによって被覆されていると定義する。 Is defined to exist. At this time, words included in the partial string matching p _j, the q _j is defined as being covered by x _j.

演算部２０は、入力部１０によって受け付けた訓練用対訳コーパス及びフレーズテーブルＳに基づいて、フレーズテーブルＳから、部分集合を選択する。演算部２０は、選択処理部２２を備えている。 The computing unit 20 selects a subset from the phrase table S based on the training parallel translation corpus and the phrase table S received by the input unit 10. The calculation unit 20 includes a selection processing unit 22.

選択処理部２２は、入力部１０によって受け付けた訓練用対訳コーパスのペア（ｅ_ｊ，ｆ_ｊ）の各々について、当該ペア（ｅ_ｊ，ｆ_ｊ）の元言語の文ｅ_ｊのｒ番目の単語ｅ_ｊｒの各々が、フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つのフレーズ対ｘ_ｉのフレーズｐ_ｉであって、かつ、元言語の文ｅ_ｊの部分文字列であるフレーズｐ_ｉに含まれ、目的言語の文ｆ_ｊのｒ番目の単語ｆ_ｊｒの各々が、フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つのフレーズ対ｘ_ｉのフレーズｑ_ｉであって、かつ、目的言語の文ｆ_ｊの部分文字列であるフレーズｑ_ｉに含まれるように、フレーズテーブルＳから部分集合Ｔを選択する。選択処理部２２は、集合分割問題を解くことによって、Ｔ⊆Ｓであるような部分集合Ｔ（フレーズテーブルＴ）を得る。 For each pair of training bilingual corpora (e _j , f _j ) received by the input unit 10, the selection processing unit 22 selects the r th word of the sentence e _{j in} the original language of the pair (e _j , f _j ). Each of e _jr is a phrase p _i of any one phrase pair x _i of a subset of the phrase table that is a set of phrase pairs x _i and is a partial character string of a sentence e _{j in} the original language A phrase q _i of any one phrase pair x _{i in} a subset of the phrase table in which each of the r-th words f _jr of the sentence f _{j in} the target language is a set of phrase pairs x _{i included in} the phrase p _i The subset T is selected from the phrase table S so as to be included in the phrase q _i which is a partial character string of the sentence f _{j in} the target language. The selection processing unit 22 obtains a subset T (phrase table T) such that T⊆S by solving the set partitioning problem.

具体的には、選択処理部２２は、以下の式にしたがって、集合分割問題を解くことにより、フレーズテーブルＳから部分集合Ｔを選択する。ここで、解くべき集合分割問題は、 Specifically, the selection processing unit 22 selects the subset T from the phrase table S by solving the set partitioning problem according to the following formula. Here, the set partitioning problem to be solved is

という変数ｙ_ｉに対する整数計画問題として定式化される。 It is formulated as an integer programming problem for the variable y _i .

変数ｙ_ｉ（ｉ＝１，．．．，Ｍ）は二値変数であり、フレーズ対ｘ_ｉが部分集合Ｔに含まれるときにｙ_ｉ＝１、そうでないときにｙ_ｉ＝０となる変数である。また、ａ_ｉｊｒはフレーズ対ｘ_ｉによって元言語の文ｅ_ｊのｒ番目の単語ｅ_ｊｒが被覆されているかどうかを表すパラメータであり、被覆されているときにはａ_ｉｊｒ＝１、そうでないときには０となるパラメータである。また、同様にｂ_ｉｊｒはフレーズ対ｘ_ｉによって目的言語の文ｆ_ｊのｒ番目の単語ｆ_ｊｒが被覆されているかどうかを表すパラメータであり、被覆されているときにはときに１、そうでないときに０となるパラメータである。 The variable y _i (i = 1,..., M) is a binary variable, and y _i = 1 when the phrase pair x _i is included in the subset T, and y _i = 0 otherwise. It is. Further, a _IJR is a parameter indicating whether r th word e _jr sentence e _j of the original language by the phrase pair x _i is covered, and 0 when a _{IJR = 1,} otherwise when it is covered It is a parameter. Similarly, b _ijr is a parameter indicating whether or not the r-th word f _jr of the sentence f _j of the target language is covered by the phrase pair x _i , and is 1 when covered, otherwise This parameter is zero.

集合分割問題とは、各ｅ_ｊｒ，ｆ_ｊｒがかならず１回のみ被覆されるようなフレーズの集合を選択する問題である。 The set partitioning problem is a problem of selecting a set of phrases in which each e _jr and f _jr is covered only once.

出力部３０は、選択処理部２２によって選択されたフレーズ対ｘ_ｉの部分集合Ｔを、フレーズ対が削減されたフレーズテーブルとして出力する。 The output unit 30 outputs the subset T of the phrase pair x _i selected by the selection processing unit 22 as a phrase table in which the phrase pairs are reduced.

＜情報処理装置の作用＞
次に、本発明の実施の形態に係る情報処理装置１００の作用について説明する。まず、訓練用対訳コーパス及びフレーズテーブルＳが、情報処理装置１００に入力されると、情報処理装置１００によって、図２に示す選択処理ルーチンが実行される。 <Operation of information processing device>
Next, the operation of the information processing apparatus 100 according to the embodiment of the present invention will be described. First, when the training parallel translation corpus and the phrase table S are input to the information processing apparatus 100, the information processing apparatus 100 executes a selection processing routine shown in FIG.

まず、ステップＳ１００において、入力部１０により訓練用対訳コーパス及びフレーズテーブルＳを受け付ける。 First, in step S <b> 100, a training parallel translation corpus and a phrase table S are received by the input unit 10.

そして、ステップＳ１０２において、選択処理部２２は、上記ステップＳ１００で受け付けた訓練用対訳コーパス及びフレーズテーブルＳに基づいて、訓練用対訳コーパスのペア（ｅ_ｊ，ｆ_ｊ）の各々について、当該ペア（ｅ_ｊ，ｆ_ｊ）の元言語の文ｅ_ｊのｒ番目の単語ｅ_ｊｒの各々が、フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つのフレーズ対ｘ_ｉのフレーズｐ_ｉであって、かつ、元言語の文ｅ_ｊの部分文字列であるフレーズｐ_ｉに含まれ、目的言語の文ｆ_ｊのｒ番目の単語ｆ_ｊｒの各々が、フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つのフレーズ対ｘ_ｉのフレーズｑ_ｉであって、かつ、目的言語の文ｆ_ｊの部分文字列であるフレーズｑ_ｉに含まれるように、フレーズテーブルＳから部分集合Ｔを選択する。 In step S102, the selection processing unit 22 determines, for each pair of training bilingual corpora (e _j , f _j ), based on the training bilingual corpus and phrase table S received in step S100. e _j, _{f j)} of each of the r-th word _{e jr} sentence _{e j} of the original language, phrase pair _{x i} phrase _{p i} of any one phrase pair _{x i} of a subset of the phrase table is a set of a is and, included in the phrase p _i is a substring of the sentence e _j of the original language, each of the r-th word f _jr sentence f _j of the target language is, is a set of phrase pair x _i a phrase q _i of any one phrase pair x _i of a subset of the phrase table, and, to be included in the phrase q _i is a substring of the sentence f _j in the target language, Furezute A subset T is selected from the table S.

ステップＳ１０４において、上記ステップＳ１０２で選択された部分集合Ｔを結果として出力し、選択処理ルーチンを終了する。 In step S104, the subset T selected in step S102 is output as a result, and the selection processing routine is terminated.

以上説明したように、本実施の形態に係る情報処理装置によれば、訓練用対訳コーパスのペア（ｅ_ｊ，ｆ_ｊ）の各々について、当該ペア（ｅ_ｊ，ｆ_ｊ）の元言語の文ｅ_ｊのｒ番目の単語ｅ_ｊｒの各々が、フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つのフレーズ対ｘ_ｉのフレーズｐ_ｉであって、かつ、元言語の文ｅ_ｊの部分文字列であるフレーズｐ_ｉに含まれ、目的言語の文ｆ_ｊのｒ番目の単語ｆ_ｊｒの各々が、フレーズ対ｘ_ｉの集合であるフレーズテーブルの部分集合の何れか１つのフレーズ対ｘ_ｉのフレーズｑ_ｉであって、かつ、目的言語の文ｆ_ｊの部分文字列であるフレーズｑ_ｉに含まれるように、フレーズテーブルから部分集合を選択することにより、訓練用対訳コーパスに含まれるフレーズ対を含み、かつフレーズ対が削減されたフレーズテーブルを得ることができ、フレーズ対の数が少ないフレーズテーブルを得ることができる。 As described above, according to the information processing apparatus according to the present embodiment, training corpus pair (e _{j, f} _j) for each of the original language sentence of the pair (e _{j, f} _j) each of the _jth r-th words e _jr is a phrase p _i of any one phrase pair x _i of a subset of the phrase table that is a set of phrase pairs x _i , and a sentence e in the original language _j included in the phrase p _i is a substring of any one phrase of a subset of each of the r-th word f _jr sentence f _j of the target language is a phrase table is a set of phrase pair x _i a phrase q _i pair x _i, and, to be included in the phrase q _i is a substring of the sentence f _j in the target language, by selecting a subset from the phrase table, the training corpus Included phrases A phrase table including pairs and having reduced phrase pairs can be obtained, and a phrase table having a small number of phrase pairs can be obtained.

また、フレーズ対の数が少ないフレーズテーブルが得られた結果として、翻訳文書生成処理の高速化、不要なフレーズ対を減らすことによる翻訳精度の向上が可能である。また、分割の概念を用いて定式化したことにより、フレーズテーブルの大きさに関するパラメータを手動で設定することなく、訓練用対訳コーパスに含まれる文に対する翻訳可能性を保証したフレーズテーブルを得ることができる。 Further, as a result of obtaining a phrase table with a small number of phrase pairs, it is possible to speed up the translation document generation process and improve translation accuracy by reducing unnecessary phrase pairs. In addition, by formulating using the concept of division, it is possible to obtain a phrase table that guarantees the translatability of the sentences included in the training bilingual corpus without manually setting parameters relating to the size of the phrase table. it can.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

上述の情報処理装置１００は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The information processing apparatus 100 described above has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０演算部
２２選択処理部
３０出力部
１００情報処理装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 22 Selection process part 30 Output part 100 Information processing apparatus

Claims

Wherein the selection processing means, in advance from a set a is training corpus of pairs of sentences f _j in the target language is a translation of the sentence e _j of the source language with sentence e _j of source language (e _{j, f} _j) It produced a bilingual phrase p _i and the phrase p _i is a substring of the sentence e _j of the original language and a phrase pair of phrases q _i is a substring of the sentence f _j in the target language An information processing method in an information processing apparatus that selects a subset from a phrase table that is a set of _xi ,
It said selection processing means, the pair _(e j, _{f j)} of the training corpus for each of said pairs _(e j, _{f j)} the sentence _{e j} of the original language of the r-th word _{e jr} of each, a phrase p _i of any one of the phrase pair x _i of a subset of the phrase table which is a set of the phrase pair x _i, and is the partial character string of a sentence e _j of the source language Each of the r-th word f _jr of the sentence f _{j in} the target language sentence f _j is included in the phrase p _i , and the phrase pair x _i of any one of the subsets of the phrase table that is a set of the phrase pairs x _i a the phrase q _i, and said to be included in the target language the phrase q _i is a substring of the sentence f _j of, met information processing method comprising the steps of selecting a subset from the phrase table The
The step of selecting a subset from the phrase table by the selection processing means selects an subset from the phrase table according to the following expression.

However, the variable y _i (i = 1,..., M) is a binary variable indicating whether or not the phrase pair x _i is included in the subset, and a _ijr is the original language according to the phrase pair x _i . a parameter r th word e _jr the sentence e _j represents whether it is covered, when it is covered is 0 and becomes parameter when a _{IJR = 1,} not covered. b _ijr is a parameter indicating whether or not the r-th word f _jr of the sentence f _j of the target language is covered by the phrase pair x _i , and b _ijr = 1 when covered, 0 when not covered It is a parameter.

Sentence f _j and a pair (e _{j, f} _j) of the target language is a translation of the sentence e _j of the source language with sentence e _j of the original language are previously generated from training corpus is a collection of source language A phrase p _i that is a partial character string of the sentence e _j and a phrase x _i that is a parallel translation of the phrase p _i and a phrase q _i that is a partial character string of the sentence f _{j in} the target language. An information processing apparatus that selects a subset from a phrase table,
Wherein said pair of training corpus _(e j, _{f j)} for each of said pairs _(e j, _{f j)} each r-th word _{e jr} sentence _{e j} of the original language of the phrase pair a x _i phrase p _i of any one of the phrase pair x _i of a subset of the phrase table is a set of, and included in the phrase p _i said a substring of the sentence e _j of the original language , each r-th word f _jr sentence f _j of the target language, encounters the phrase q _i of any one of the phrase pair x _i of a subset of the phrase table which is a set of the phrase pair x _i And a selection processing means for selecting a subset from the phrase table so as to be included in the phrase q _i that is a partial character string of the sentence f _j of the target language ,
The information processing method, wherein the selection processing means selects a subset from the phrase table according to the following expression.

However, the variable y _i (i = 1,..., M) is a binary variable indicating whether or not the phrase pair x _i is included in the subset, and a _ijr is the original language according to the phrase pair x _i . a parameter r th word e _jr the sentence e _j represents whether it is covered, when it is covered is 0 and becomes parameter when a _{IJR = 1,} not covered. b _ijr is a parameter indicating whether or not the r-th word f _jr of the sentence f _j of the target language is covered by the phrase pair x _i , and b _ijr = 1 when covered, 0 when not covered It is a parameter.

A program for causing a computer to execute each step of the information processing method according to claim 1 .