JP5555542B2

JP5555542B2 - Automatic word association apparatus, method and program thereof

Info

Publication number: JP5555542B2
Application number: JP2010116040A
Authority: JP
Inventors: 裕之進藤; 昭典藤野; 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-05-20
Filing date: 2010-05-20
Publication date: 2014-07-23
Anticipated expiration: 2030-05-20
Also published as: JP2011243087A

Description

この発明は、原言語と翻訳後の目的言語の対訳文が、文単位で対応付けられている対訳文コーパスから単語間の対応関係を、自動的に抽出する自動単語対応付け装置とその方法とプログラムに関する。 The present invention relates to an automatic word association apparatus and method for automatically extracting a correspondence relationship between words from a parallel sentence corpus in which a parallel translation of a source language and a target language after translation is associated in units of sentences. Regarding the program.

原言語と目的言語の単語間の対応関係を単語アライメントと称する。従来から、単語アライメントには、共起情報に基づくものと、雑音チャネルモデルに基づくものがある。共起情報に基づく方法では、例えば、日本語と英語間の単語アライメントの場合、ある日本語の単語が対訳コーパス上に出現した回数、ある英語の単語がコーパス上に出現した回数、それらが対訳文中に同時に出現した回数からDice係数や相互情報量を計算し、最も尤もらしい単語の対応関係を抽出する。 The correspondence between words in the source language and the target language is called word alignment. Conventionally, there are word alignments based on co-occurrence information and those based on a noise channel model. In the method based on co-occurrence information, for example, in the case of word alignment between Japanese and English, the number of times a certain Japanese word appears on the bilingual corpus, the number of times a certain English word appears on the corpus, The Dice coefficient and mutual information are calculated from the number of simultaneous occurrences in the sentence, and the most likely word correspondence is extracted.

雑音チャネルモデルに基づく方法では、原言語が目的言語へ確率的に変換されたと仮定する。例えば、原言語を日本語、目的言語を英語とするとき、日本語の各単語の生成確率、日本語の単語から英語の単語への翻訳確率、語順の入れ替えを表すアライメント確率を対訳コーパスから学習し、それらの確率から単語アライメントを抽出する。 The method based on the noise channel model assumes that the source language is stochastically converted to the target language. For example, when the source language is Japanese and the target language is English, the generation probability of each Japanese word, the translation probability from a Japanese word to an English word, and the alignment probability that indicates the replacement of the word order are learned from the bilingual corpus. Then, word alignment is extracted from those probabilities.

この共起情報に基づく単語アライメント手法と、雑音チャネルモデルに基づく単語アライメント手法は、例えば非特許文献１に開示されている。 The word alignment method based on the co-occurrence information and the word alignment method based on the noise channel model are disclosed in Non-Patent Document 1, for example.

F.J Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19-51.F.J Och and H. Ney. 2003.A systematic comparison of various statistical alignment models.Computational Linguistics, 29 (1): 19-51.

従来の自動単語対応付け装置には課題が二つある。その一つは、低頻度語の対応付けの精度が高くないという課題である。例えば、統計的な情報に基づく単語アライメント手法では、任意の原言語の単語と、任意の目的言語の単語が同時に対訳文中に出現する回数が多いほど、それらは対応関係にある可能性が高いと判定する。しかし、対訳コーパス中に一回しか出現しない単語では、その対訳文に含まれるあらゆる単語と対訳関係になる可能性が等しいと判断してしまう。それ故、対訳コーパス中に含まれる数が少ない単語ほど対応関係にある対訳語を判定することは困難になる。 There are two problems with conventional automatic word association devices. One of the problems is that the accuracy of low-frequency word association is not high. For example, in the word alignment method based on statistical information, the more frequently a word in any source language and a word in any target language appear in the parallel translation, the more likely they are in a correspondence relationship. judge. However, a word that appears only once in the bilingual corpus is determined to have the same possibility of being in a bilingual relationship with any word included in the bilingual sentence. For this reason, it is difficult to determine a parallel translation word having a corresponding relationship as the number of words included in the parallel corpus is small.

二つ目の課題は、単語の多様性の問題である。例えば、英単語「head」は「会長」や「頭部」など複数の意味をもち、文脈によって「head」が表す意味が異なる。それ故、「head」が用いられている文脈を考慮しなければ、誤った対応付けがされてしまう危険性がある。 The second issue is the problem of word diversity. For example, the English word “head” has a plurality of meanings such as “chairman” and “head”, and the meaning represented by “head” varies depending on the context. Therefore, if the context in which “head” is used is not taken into account, there is a risk of incorrect association.

この発明は、このような課題に鑑みてなされたものであり、例えば、「会社」や「経済」といった話題であるトピックを導入した同義語辞書モデルを構築させ、その同義語辞書モデルと従来の単語対応付けモデルとを同時に用いて単語アライメントを学習させる自動単語対応付け装置とその方法とプログラムを提供することを目的とする。 The present invention has been made in view of such problems. For example, a synonym dictionary model in which topics such as “company” and “economy” are introduced is constructed, and the synonym dictionary model and the conventional It is an object of the present invention to provide an automatic word association apparatus, a method and a program for learning word alignment using a word association model at the same time.

この発明の自動単語対応付け装置は、訓練データ記憶部と、アライメント確率学習部と、自動対応付け部と、を具備する。訓練データ記憶部は、単語で区切られた原言語と目的言語の対訳文の組みで構成される対訳文コーパスと、上記目的言語の同義語の組の集合である同義語辞書とから成る。アライメント確率学習部は、トピック毎に、対訳文コーパスの対訳文データの確率モデルの対数尤度と同義語辞書の同義語辞書データの同義語辞書確率モデルの対数尤度との重み付き和を最大にするパラメータを学習する。自動対応付け部は、対象翻訳文とそのパラメータを入力として対象翻訳文の原言語と目的言語の単語間のアライメントを生成する。 The automatic word association apparatus according to the present invention includes a training data storage unit, an alignment probability learning unit, and an automatic association unit. The training data storage unit is composed of a bilingual corpus composed of pairs of source language and target language parallel translations separated by words, and a synonym dictionary that is a set of synonyms of the target language. The alignment probability learning unit maximizes, for each topic, the weighted sum of the log likelihood of the bilingual corpus data of the bilingual sentence corpus and the log likelihood of the synonym dictionary probability model of the synonym dictionary data of the synonym dictionary. Learn the parameters to be The automatic association unit receives the target translation sentence and its parameters as input, and generates an alignment between the source language and target language words of the target translation sentence.

この発明の自動単語対応付け装置は、そのアライメント確率学習部が、対訳文コーパスの対数尤度と同義語辞書の対数尤度との重み付き和を最大にするパラメータを学習するので、自動単語対応付けの精度を向上させることができる。 In the automatic word association device according to the present invention, the alignment probability learning unit learns a parameter that maximizes the weighted sum of the log likelihood of the bilingual corpus and the log likelihood of the synonym dictionary. The accuracy of attachment can be improved.

この発明の自動単語対応付け装置１００の機能構成例を示す図。The figure which shows the function structural example of the automatic word matching apparatus 100 of this invention. 自動単語対応付け装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the automatic word matching apparatus 100. アライメント確率学習部２０の機能構成例を示す図。The figure which shows the function structural example of the alignment probability learning part 20. FIG. アライメント確率学習部２０の動作フローを示す図。The figure which shows the operation | movement flow of the alignment probability learning part 20. FIG. トピック別の原言語の単語生成確率テーブルの例を示す図。The figure which shows the example of the word production | generation probability table of the source language according to a topic. トピック別の単語翻訳確率テーブルの例を示す図。The figure which shows the example of the word translation probability table according to topic. 同義語辞書の確率テーブルの例を示す図。The figure which shows the example of the probability table of a synonym dictionary.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。実施例の説明の前に、この発明の基本的な考え方について説明する。
〔この発明の基本的な考え方〕
この発明は、同義語辞書に確率の考えを適用し、更に同義語辞書にトピック情報を導入した点で新しい。この発明では、同義語の「意味」uをトピックｋと原言語の単語ｅの組み合わせu=(k,e)で表現する。そして、参考文献１（Zhao, B. and Xing, E.P. 2007. HM-BiTAM: Bilingual topic exploration, word alignment, and translation. Twenty-second annual conference on neural information processing systems, Vancouver BC, Canada）に開示されているパラメータセットΘ≡({α_k},{β_k,e},{B_f,e,k},{T_i,i’})を用いて、同義語辞書確率モデルp_m(D_m;Θ）を式（１）に示すように計算する。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated. Prior to the description of the embodiments, the basic concept of the present invention will be described.
[Basic concept of this invention]
This invention is new in that the idea of probability is applied to a synonym dictionary and topic information is further introduced into the synonym dictionary. In the present invention, “meaning” u of synonyms is expressed by a combination u = (k, e) of a topic k and a source language word e. And it is disclosed in Reference Document 1 (Zhao, B. and Xing, EP 2007. HM-BiTAM: Bilingual topic exploration, word alignment, and translation. Twenty-second annual conference on neural information processing systems, Vancouver BC, Canada). Using the parameter set Θ≡ ({α _k }, {β _{k, e} }, {B _{f, e, k} }, {T _{i, i '} }), the synonym dictionary probability model p _m (D _m ; Θ) is calculated as shown in equation (1).

同義語辞書確率モデルp_m(D_m;Θ）を、全ての話題ｋと原言語ｅについての、原言語ｅと目的語ｆ_ｓと同義語ｆ′_ｓの確率値の和積に比例する値として定義する。ここで、α_ｋは話題ｋの混合比を生成する確率モデルのパラメータ、β_k,eは原言語の単語生成確率、Ｂ_fs,e,kは単語翻訳確率、Ｂ_f’s,e,kは同義語確率である。 The value of the synonym dictionary probability model p _m (D _m ; Θ) proportional to the sum of the probabilities of the source language e, the object f _s and the synonym f ′ _s for all topics k and source language e. Define as Here, α _k is a parameter of a probability model that generates a mixture ratio of topic k, β _{k, e} is a word generation probability in the source language, B _{fs, e, k} is a word translation probability, and B _{f's, e, k} is synonymous. Word probability.

一方、対訳データの確率モデルp(D_b;Θ）は、従来技術で使われている式（２）を用いて計算される。 On the other hand, the probability model p (D _b ; Θ) of the bilingual data is calculated using Expression (2) used in the prior art.

ここで、Eは原言語文、Fは目的言語文、zはトピック、θはトピックの混合比、ａは単語アライメントであり、z,θ,aはパラメータΘの値によって確率的に決まる値である。Ｔ_i,i′は、アライメントaを生成する確率モデルのパラメータである。 Here, E is a source language sentence, F is a target language sentence, z is a topic, θ is a topic mixing ratio, a is a word alignment, and z, θ, and a are values determined stochastically by the value of the parameter Θ. is there. T _{i, i ′} is a parameter of the probability model for generating the alignment a.

対訳データの確率モデルp(D_b;Θ）の対数尤度は式（４）で計算される。 The log likelihood of the bilingual data probability model p (D _b ; Θ) is calculated by equation (4).

ここで、単語アライメントａと、トピックの事後確率p(z,θ,a)を解析的に解くことが出来ないので、次のように近似したp(z,θ,a)≒q(θ｜γ)q(z｜φ)q(a｜λ)。また、q(θ｜γ)はディレクレ分布、q(z｜φ)は多項分布、q(a｜λ)は一次隠れマルコフモデルをそれぞれ仮定する。式（２）〜（４）は全て従来技術である。 Here, since word alignment a and topic posterior probability p (z, θ, a) cannot be solved analytically, p (z, θ, a) ≈q (θ | approximated as follows: γ) q (z | φ) q (a | λ). Further, q (θ | γ) is assumed to be a direct distribution, q (z | φ) is assumed to be a multinomial distribution, and q (a | λ) is assumed to be a first order hidden Markov model. Expressions (2) to (4) are all conventional techniques.

この発明は、上記したように同義語辞書に確率の考えを適用し、更に同義語辞書にトピック情報を導入した点で新しい。その同義語辞書確率モデルp_m(D_m;Θ）（式（１））の対数尤度は式（５）で計算される。 As described above, the present invention is new in that the idea of probability is applied to the synonym dictionary and topic information is further introduced into the synonym dictionary. The log likelihood of the synonym dictionary probability model p _m (D _m ; Θ) (equation (1)) is calculated by equation (5).

この発明のアライメント確率学習部は、対訳データの確率モデルp(D_b;Θ）の対数尤度と、同義語辞書確率モデルp_m(D_m;Θ）の対数尤度との重み付き和log L（Θ）=log p(D_b;Θ)+ζlog p(D_m; Θ)を、最大にするパラメータΘを学習する。log L（Θ）の下限値は式（６）で計算される。 The alignment probability learning unit according to the present invention includes a weighted sum log of the log likelihood of the probability model p (D _b ; Θ) of the bilingual data and the log likelihood of the synonym dictionary probability model p _m (D _m ; Θ). A parameter Θ that maximizes L (Θ) = log p (D _b ; Θ) + ζ log p (D _m ; Θ) is learned. The lower limit value of log L (Θ) is calculated by equation (6).

ここで、ζは同義語辞書確率モデルｐ_ｍ（Ｄ_ｍ；Θ）に与えられる重みである。この発明の自動単語対応付け装置は、log L（Θ）を最大化するパラメータを学習する。 Here, ζ is a weight given to the synonym dictionary probability model p _m (D _m ; Θ). The automatic word association apparatus of the present invention learns a parameter that maximizes log L (Θ).

図１にこの発明の自動単語対応付け装置１００の機能構成例を示す。その動作フローを図２に示す。自動単語対応付け装置１００は、訓練データ記憶部１０と、アラメント確率学習部２０と、自動対応付け部３０と、を具備する。 FIG. 1 shows a functional configuration example of the automatic word association apparatus 100 of the present invention. The operation flow is shown in FIG. The automatic word association device 100 includes a training data storage unit 10, an arrangement probability learning unit 20, and an automatic association unit 30.

訓練データ記憶部１０は、単語で区切られた原言語と目的言語の対訳文の組で構成される対訳文コーパス１１と、目的語の同義語の組の集合である同義語辞書１２とから成る。アライメント確率学習部２０は、トピック毎に対訳データと同義語辞書の重み付き対数尤度を最大にするパラメータを学習する（ステップＳ２０）。自動対応付け部４０は、対象翻訳文ＸとパラメータΘを入力として対象翻訳文Ｘの単語アライメントを生成する（ステップＳ４０）。 The training data storage unit 10 is composed of a bilingual corpus 11 composed of pairs of source language and target language bilingual sentences separated by words, and a synonym dictionary 12 that is a set of synonyms of target words. . The alignment probability learning unit 20 learns a parameter that maximizes the weighted log likelihood of the bilingual data and the synonym dictionary for each topic (step S20). The automatic association unit 40 receives the target translation sentence X and the parameter Θ as input and generates word alignment of the target translation sentence X (step S40).

自動単語対応付け装置１００は、対訳文から単語アライメントを自動的に推定する装置であり、上記した各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 The automatic word association apparatus 100 is an apparatus that automatically estimates word alignment from a bilingual sentence. The function of each unit described above is obtained by reading a predetermined program into a computer including, for example, a ROM, a RAM, and a CPU. This is realized by the CPU executing the program.

対訳文コーパス１１は、原言語と目的言語の対訳文ペアで構成されるデータである。同義語辞書１２は、目的言語の同義語ペアの集合であり、例えば、目的言語が日本語の場合、「二酸化炭素」と「炭酸ガス」、「学生」と「生徒」などの同義語ペアを集めた辞書データである。 The parallel translation corpus 11 is data composed of parallel translation sentence pairs of the source language and the target language. The synonym dictionary 12 is a set of synonym pairs of the target language. For example, when the target language is Japanese, synonym pairs such as “carbon dioxide” and “carbon dioxide”, “student” and “student” are displayed. Collected dictionary data.

アライメント確率学習部２０は、対訳文コーパス１１の対数尤度と同義語辞書１２の対数尤度との重み付き和log L（Θ）=log p_b(D_b;Θ)+ζlog p_m(D_m; Θ)を最大にするパラメータΘを学習する（ステップＳ２０）。ただし、D_b,D_mはそれぞれ、訓練データとして用意される対訳文コーパス１１のデータ、同義語辞書１２のデータを表す。また、p_b(D_b; Θ)はパラメータΘで表される対訳文コーパス１１の確率モデルで、０から１までの値を取る。同様に、p_m(D_m: Θ)は同義語辞書１２の確率モデルである。ζは対訳文データD_bと同義語辞書データD_mの重要度の割合を表す重みである。学習されたパラメータはアライメント確率学習部２０内の図示しないメモリに記憶される。 The alignment probability learning unit 20 is a weighted sum of the log likelihood of the bilingual corpus 11 and the log likelihood of the synonym dictionary 12 log L (Θ) = log p _b (D _b ; Θ) + ζ log p _m (D _The parameter Θ that maximizes _m ; Θ) is learned (step S20). However, D _b, respectively D _m, represent the data of the data, a synonym dictionary 12 bilingual corpus 11 is prepared as training data. P _b (D _b ; Θ) is a probabilistic model of the bilingual corpus 11 represented by the parameter Θ, and takes a value from 0 to 1. Similarly, p _m (D _m : Θ) is a probability model of the synonym dictionary 12. The ζ is a weight representing the degree of importance percentage of the translation data D _b and synonym dictionary data D _m. The learned parameters are stored in a memory (not shown) in the alignment probability learning unit 20.

確率モデルp_b(D_b; Θ)とp_m(D_m: Θ)は、訓練データから未知の情報を潜在変数Zとして定式化される。例えば、この発明では単語の対応関係aやトピックｚの情報が潜在変数として導入された確率モデルを用いる。したがって、対訳文データの確率モデルp_b(D_b; Θ)は、p_b(D_b; Θ)=Σ_Ｚp_b(D_b,Z_b; Θ),Z_b=(a,z)となる。同様に、同義語辞書データの確率モデルp_m(D_m: Θ)は、p_m(D_m: Θ)= Σ_Ｚp_ｍ(D_ｍ,Z_ｍ; Θ)となる。 The probabilistic models p _b (D _b ; Θ) and p _m (D _m : Θ) are formulated with unknown information from the training data as a latent variable Z. For example, the present invention uses a probability model in which word correspondence a and topic z information are introduced as latent variables. Therefore, the probability model p _b (D _b ; Θ) of the bilingual sentence data is p _b (D _b ; Θ) = Σ _Z p _b (D _b , Z _b ; Θ), Z _b = (a, z) Become. Similarly, the probability model synonym dictionary data p _{_m} (D _m: Θ) _{_{is, p m (D m: Θ}} ) = Σ Z p m (D m, Z m; Θ) becomes.

自動対応付け部３０は、アライメント確率学習部２０で学習されたパラメータΘを用いて、対象翻訳文Ｘの原言語と目的言語の対応関係を推定する（ステップＳ４０）。 The automatic association unit 30 estimates the correspondence relationship between the source language and the target language of the target translated sentence X using the parameter Θ learned by the alignment probability learning unit 20 (step S40).

以上のように動作する自動単語対応付け装置１００は、トピックを導入した同義語辞書モデルを構築する。そして、その同義語辞書モデルと従来からの単語対応付けモデルとを同時に用いて単語アライメントを学習させることで、高精度に自動単語対応付けを行うことができる。 The automatic word association apparatus 100 that operates as described above constructs a synonym dictionary model in which topics are introduced. Then, automatic word association can be performed with high accuracy by simultaneously learning the word alignment using the synonym dictionary model and the conventional word association model.

例えば、「炭酸ガス」と「二酸化炭素」が同義語であるという情報を利用することで、「炭酸ガス」を「二酸化炭素」と同じ対訳語に対応付ける。このとき、対訳文データに「炭酸ガス」がほとんど含まれていなくても「二酸化炭素」が対訳文データに多く含まれており、正しい対訳語「carbon dioxide」に対応付けられれば、「炭酸ガス」も正しい対訳語「carbon dioxide」に対応付けることができる。 For example, by using information that “carbon dioxide gas” and “carbon dioxide” are synonyms, “carbon dioxide” is associated with the same translation as “carbon dioxide”. At this time, even if the bilingual sentence data contains almost no “carbon dioxide”, if the bilingual sentence data contains many “carbon dioxide” and is associated with the correct bilingual word “carbon dioxide”, then “carbon dioxide” "Can also be associated with the correct bilingual word" carbon dioxide ".

また、トピック情報を導入することで、多義性の課題に対処することができる。例えば、「head」という単語は、「forefront」または「chief」の両単語と同義語であるが、「forefront」と「chief」の意味は異なる。すなわち、文脈に応じて「head」の表す意味が異なる。トピックの概念を導入した同義語辞書モデルを用いることで、文全体のトピックに応じて「head」の同義語がどちらであるかを自動的に学習する。その結果、同義語を単語アライメントの学習に利用することが可能になり、アライメント精度の向上が期待できる。 In addition, by introducing topic information, it is possible to deal with the issue of ambiguity. For example, the word “head” is synonymous with both the words “forefront” or “chief”, but the meanings of “forefront” and “chief” are different. That is, the meaning of “head” varies depending on the context. By using a synonym dictionary model that introduces the concept of a topic, it automatically learns which is the synonym of “head” according to the topic of the whole sentence. As a result, synonyms can be used for learning word alignment, and an improvement in alignment accuracy can be expected.

図３にアライメント確率学習部２０の具体的な機能構成例を示し、更に詳しく自動単語対応付け装置１００の動作を説明する。アライメント確率学習部２０は、基準値計算部２１と、単語アライメント確率計算部２２と、同義語辞書確率計算部２３と、パラメータ更新部２４と、収束判定部２５と、を具備する。 FIG. 3 shows a specific functional configuration example of the alignment probability learning unit 20, and the operation of the automatic word association apparatus 100 will be described in more detail. The alignment probability learning unit 20 includes a reference value calculation unit 21, a word alignment probability calculation unit 22, a synonym dictionary probability calculation unit 23, a parameter update unit 24, and a convergence determination unit 25.

アライメント確率学習部２０は、確率モデルp_b(D_b; Θ)とp_m(D_m: Θ)のパラメータΘが収束するまで逐次更新することで最適なパラメータΘ＾を求めるものである。その動作フローを図４に示す。 The alignment probability learning unit 20 obtains the optimum parameter Θ ^ by sequentially updating the parameters Θ of the probability models p _b (D _b ; Θ) and p _m (D _m : Θ) until they converge. The operation flow is shown in FIG.

基準値計算部２１は、訓練データ記憶部１０に保存されている対訳文データD_bと同義語辞書データD_mを読み込んで、初期値Θ⁽⁰⁾を設定する（ステップＳ２１）。 The reference value calculation unit 21 reads the bilingual sentence data D _b and the synonym dictionary data D _m stored in the training data storage unit 10, and sets an initial value Θ ⁽⁰⁾ (step S21).

単語アライメント確率計算部２２は、パラメータの初期値Θ⁽⁰⁾を入力として潜在変数Ｚの事後確率p_b(Z_b｜D_b:Θ⁽⁰⁾)を計算した後に、収束判定部２５から入力されるパラメータΘ^(t)から潜在変数Ｚの事後確率p_b(Z_b｜D_b:Θ^(t))を計算する（ステップＳ２２）。Θ^(t)は、t回目の更新ステップで得られるパラメータを示す。 The word alignment probability calculation unit 22 calculates the posterior probability p _b (Z _b | D _b : Θ ⁽⁰⁾ ) of the latent variable Z with the initial value Θ ⁽⁰⁾ of the parameter as input, and then inputs from the convergence determination unit 25. The posterior probability p _b (Z _b | D _b : Θ ^(t) ) of the latent variable Z is calculated from the parameter Θ ^(t) (step S22). Θ ^(t) represents a parameter obtained in the t-th update step.

同義語辞書確率計算部２３は、パラメータの初期値Θ⁽⁰⁾と更新途中のパラメータΘ^(t)から潜在変数Ｚの事後確率p_m(Z_m｜D_ｍ:Θ^(t))を計算する（ステップＳ２３）。 The synonym dictionary probability calculation unit 23 calculates the posterior probability p _m (Z _m | D _m : Θ ^(t) ) of the latent variable Z from the initial value Θ ⁽⁰⁾ of the parameter and the parameter Θ ^{(t) being} updated. (Step S23).

パラメータ更新部２４は、単語アライメント確率計算部２２で計算された潜在変数Ｚの事後確率p_b(Z_b｜D_b:Θ^(t))と、同義語辞書確率計算部２３で計算された潜在変数Ｚの事後確率p_b(Z_b｜D_b:Θ^(t))を入力として、トピック別の単語翻訳確率テーブル２４１に記録された単語翻訳確率と、トピック別の原言語の単語生成確率テーブル２４２に記録された単語生成確率と、同義語辞書確率テーブル２４３に記録された同義語辞書確率と、原言語から目的言語への単語の語順入れ替え確率と、トピックの混合比の生成確率とを参照する。そして、原言語の単語生成確率p(E_n｜z_n;β)、原言語から目的言語への単語翻訳確率p(F_m｜E_n,z_n,a_n;B)、訓練データの対訳文のトピックz_n、訓練データ全体のトピックの混合比θ、原言語と目的言語の単語アライメントａを推定し、推定した上記値を基に新たなパラメータΘ^(t+1)を計算する（ステップＳ２４）。 The parameter updating unit 24 uses the posterior probability p _b (Z _b | D _b : Θ ^(t) ) of the latent variable Z calculated by the word alignment probability calculating unit 22 and the latent calculated by the synonym dictionary probability calculating unit 23. Using the posterior probability p _b (Z _b | D _b : Θ ^(t) ) of the variable Z as an input, the word translation probability recorded in the topic-specific word translation probability table 241 and the topic-specific source language word generation probability table 242 and the word generation probability, which is recorded in, and synonym dictionary probability that has been recorded in the synonym dictionary probability table 243, and word order swapping probability of a word to the target language from the original language, and the generation probability of the mixture ratio of topics refer. Then, the word generation probability p (E _n | z _n ; β) of the source language, the word translation probability p (F _m | E _n , z _n , a _n ; B) from the source language to the target language, and the translation of the training data The topic z _{n of} the sentence, the mixing ratio θ of the topics of the entire training data, the word alignment a of the source language and the target language are estimated, and a new parameter Θ ^{(t + 1)} is calculated based on the estimated value (step ⁾ S24).

図５にトピック別の原言語の単語生成確率テーブル２４２の例を示す。図６にトピック別の単語翻訳確率テーブル２４１の例を示す。図７に同義語辞書確率テーブル２４３の例を示す。 FIG. 5 shows an example of the word generation probability table 242 of the source language for each topic. FIG. 6 shows an example of the word translation probability table 241 for each topic. FIG. 7 shows an example of the synonym dictionary probability table 243.

収束判定部２５は、パラメータ更新部２４から入力されるパラメータΘ^(t+1)から計算される対数尤度log L(Θ^(t+1))を計算し、収束条件log L(Θ^(t+1))-log L(Θ^(t))＜εを満たせばΘ＾←Θ^(t+1)としてパラメータの推定値を更新する。収束条件を満たさない場合は、パラメータΘ^(t+1)の更新ステップをt←t+1として、更新したパラメータΘ^(t+1)を再度、単語アライメント確率計算部２２と同義語辞書確率計算部２３へ出力する（ステップＳ２５、未収束）。 The convergence determination unit 25 calculates the log likelihood log L (Θ ^{(t + 1)} ) calculated from the parameter Θ ^{(t + 1)} input from the parameter update unit 24 and the convergence condition log L (Θ ^{(t (t +1)} ) -log If L (Θ ^(t) ) <ε is satisfied, the parameter estimate is updated as Θ ^ ← Θ ^{(t + 1)} . When the convergence condition is not satisfied, the update step of the parameter Θ ^{(t + 1)} is set to t ← t + 1, and the updated parameter Θ ^{(t + 1)} is calculated again with the word alignment probability calculation unit 22 and the synonym dictionary probability calculation. To the unit 23 (step S25, unconvergence).

ステップＳ２２〜Ｓ２５の処理は、収束条件を満たすまで繰り返し実行される。自動対応付け部３０は、学習された最適なパラメータΘ＾を用いて対象翻訳文Ｘの原言語と目的言語の最適な対応付けa_n^を抽出する（式（７））。 The processes in steps S22 to S25 are repeatedly executed until the convergence condition is satisfied. The automatic association unit 30 extracts the optimum association an _n ^ between the source language and the target language of the target translation sentence X using the learned optimum parameter Θ ^ (formula (7)).

上記した各処理の具体例を示して更に詳細に説明する。単語アライメント問題では、原言語の文が確率的に変換されて目的言語の文が生成されると考える。このとき、対訳データとして原言語と目的言語の対訳文集合（式（８））、同義語辞書として目的言語の同義語ペア集合（式（９））が与えられる。 A specific example of each process described above will be shown and described in more detail. In the word alignment problem, the source language sentence is probabilistically converted to generate the target language sentence. At this time, a bilingual sentence set (formula (8)) of the source language and the target language is given as the bilingual data, and a synonym pair set (formula (9)) of the target language is given as the synonym dictionary.

ここで、E_nは対訳データ中のn番目の原言語文、F_nはn番目の目的言語文である。(f_s,f’_s)は目的言語の同義語ペアを表す。 Here, E _n is the nth source language sentence in the parallel translation data, and F _n is the nth target language sentence. (f _s , f ' _s ) represents a synonym pair of the target language.

既存の単語アライメント抽出技術として、上記した参考文献１に開示されたHM-BiTAMを用いる場合で説明する。HM-BiTAMは、原言語と目的言語の対訳文集合の確率モデルを与える。潜在変数Z_b=(z,a)は単語アライメントa、トピックzを表す。また、HM-BiTAMでは、トピックzのパラメータであるトピックの混合比θに対して事前分布を設定する。したがって、トピックの混合比θはモデルの学習において陽には現れず、その代わりに事前分布のパラメータ（ハイパーパラメーター）を推定するため、便宜上Z_b=(z,a,θ)としてよい。単語アライメントを表す変数a_jn=iは、対訳文nにおいて、目的言語のj_n番目の単語と、原言語のi番目の単語が対応関係にあることを表す。トピックzは、各対訳文（E_n,F_n）に対して一つずつ割り当てられる。トピックの混合比θは、各トピックの生成確率を表す頻度分布である。 The case where HM-BiTAM disclosed in Reference Document 1 is used as an existing word alignment extraction technique will be described. HM-BiTAM gives a probabilistic model of the bilingual set of source and target languages. The latent variable Z _b = (z, a) represents the word alignment a and the topic z. In HM-BiTAM, a prior distribution is set with respect to the topic mixture ratio θ which is a parameter of topic z. Therefore, the mixture ratio θ of the topic does not appear explicitly in the learning of the model, and instead, a prior distribution parameter (hyper parameter) is estimated, and therefore, Z _b = (z, a, θ) may be set for convenience. The variable a _jn = i representing the word alignment indicates that in the parallel translation n, the j _nth word in the target language and the i th word in the source language are in a correspondence relationship. One topic z is assigned to each bilingual sentence (E _n , F _n ). The topic mixing ratio θ is a frequency distribution representing the generation probability of each topic.

次に、この発明の同義語辞書確率モデルについて説明する。同義語辞書確率モデルは、目的言語の同義語ペア(f_s,f_s)の集合である同義語辞書に対して確率を与える。同義語は、ある言葉の「意味」を異なる表現で表したものであり、式（１０）に示すように定義される。 Next, the synonym dictionary probability model of the present invention will be described. The synonym dictionary probability model gives a probability to a synonym dictionary that is a set of synonym pairs (f _s , f _s ) of the target language. A synonym is a representation of the “meaning” of a word in a different expression and is defined as shown in equation (10).

ここで、uは同義語の意味を表す。この実施例では、同義語の意味uをトピックｋと原言語の単語eの組み合わせで表現する。すなわち、u=(k,e)となる。図７に、同義語辞書の確率モデルで表現される同義語辞書確率テーブルの例を示す（図３の同義語辞書確率テーブル２４３）。 Here, u represents the meaning of a synonym. In this embodiment, the meaning u of a synonym is expressed by a combination of a topic k and a source language word e. That is, u = (k, e). FIG. 7 shows an example of a synonym dictionary probability table expressed by a synonym dictionary probability model (synonym dictionary probability table 243 in FIG. 3).

以上の前提の下、対訳文データの確率モデルp(D_b;Θ）は上記した式（２）で、同義語辞書確率モデルp_m(D_m;Θ）は式（１）で表せる。アライメント確率学習部２０は、対訳データの確率モデルp(D_b;Θ）の対数尤度と、同義語辞書確率モデルp_m(D_m;Θ）の対数尤度との重み付き和log L（Θ）=log p(D_b;Θ)+ζlog p(D_m; Θ)を、最大にするパラメータΘを学習する。log L（Θ）の下限値は、上記した式（６）で計算される。 Under the above assumptions, the bilingual sentence data probability model p (D _b ; Θ) can be expressed by the above equation (2), and the synonym dictionary probability model p _m (D _m ; Θ) can be expressed by the equation (1). The alignment probability learning unit 20 is a weighted sum log L () of the log likelihood of the bilingual data probability model p (D _b ; Θ) and the log likelihood of the synonym dictionary probability model p _m (D _m ; Θ). A parameter Θ that maximizes Θ) = log p (D _b ; Θ) + ζ log p (D _m ; Θ) is learned. The lower limit value of log L (Θ) is calculated by the above equation (6).

単語アライメント確率計算部２２は、式（６）を最大化するトピックの混合比θの確率モデルのパラメータγ_kを式（１１）、トピックzの確率モデルのパラメータφ_n,kを式（１２）、単語アライメントaの確率モデルのパラメータλ_n,j,iを式（１３）で求める。 The word alignment probability calculation unit 22 sets the parameter γ _k of the topic mixture ratio θ that maximizes the equation (6) to the equation (11), and the parameter φ _{n, k} of the topic z probability model to the equation (12). Then, the parameter λ _{n, j, i} of the probability model of word alignment a is obtained by equation (13).

同義語辞書確率計算部２３は、潜在変数Z_m=(k,e)であるため、事後確率p_m(k,e｜D_m;Θ^(t))を計算する（式（１４））。 The synonym dictionary probability calculation unit 23 calculates the posterior probability p _m (k, e | D _m ; Θ ^(t) ) because the latent variable Z _m = (k, e) (formula (14)).

ここで、β_k,eはトピック別の原言語の単語の生成確率であり式（１５）で計算される。B_f,e,kはトピック別の原言語から目的言語への単語翻訳確率であり式（１６）で計算される。α_kはトピックの混合比θの確率モデルのパラメータであり式（１７）で計算される。T_i′,iは原言語と目的言語の単語の語順入れ替え確率であり式（１８）で計算される。 Here, β _{k, e} is the generation probability of the source language word for each topic, and is calculated by equation (15). B _{f, e, k} is a word translation probability from the source language to the target language for each topic, and is calculated by Expression (16). α _k is a parameter of the probability model of the topic mixing ratio θ, and is calculated by Expression (17). T _{i ′, i} is the word order switching probability of the words in the source language and the target language, and is calculated by equation (18).

式（１１）〜式（１８）に示すパラメータは、訓練データの対数尤度log L（Θ）を示す式（６）を各パラメータで偏微分することで得られる。 The parameters shown in Expression (11) to Expression (18) are obtained by partial differentiation of Expression (6) indicating the log likelihood log L (Θ) of training data with each parameter.

収束判定部２５は、更新されたパラメータΘ^(t+1)が収束したかどうかを判定する。例えば、訓練データの対数尤度を更新前と更新後で比較し、その差がε未満であれば収束したと判定する（式（１９））。 The convergence determination unit 25 determines whether or not the updated parameter Θ ^{(t + 1)} has converged. For example, the log likelihood of the training data is compared before and after the update, and if the difference is less than ε, it is determined that the training has converged (formula (19)).

〔実験結果〕
この発明の自動単語対応付け方法の効果を確認する目的で評価実験を行った。対訳データとして、単語アライメント問題でよく用いられる参考文献２（R. Mihalcea and T. Pedersen. 2003. An evaluation exercise for word alignment. In Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond-Voiume 3, page 10. Association for Computational Linguistics.）に示されるHansardsデータセットを用いた。これは、英仏の対訳文コーパスである。また、同義語辞書として参考文献３（G.A. Miller. 1995. WordNet: a lexical database for English. Communications of the ACM, 38(11):41.）に示されたWordNet2.1を用いた。〔Experimental result〕
An evaluation experiment was conducted for the purpose of confirming the effect of the automatic word association method of the present invention. Reference translation 2 (R. Mihalcea and T. Pedersen. 2003. An evaluation exercise for word alignment. In Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond-Voiume 3, page 10. Association for Computational Linguistics. This is an English-French bilingual corpus. As a synonym dictionary, WordNet 2.1 shown in Reference 3 (GA Miller. 1995. WordNet: alexical database for English. Communications of the ACM, 38 (11): 41.) Was used.

評価用データとして、Hansardsデータセットの中から１万対訳文を無作為に抽出した。また、WordNetに掲載されている同義語の中で評価用対訳データセット中に少なくとも１回以上出現する同義語ペアを同義語辞書とした。これらの異種データと同義語辞書を訓練データとして英仏間の単語アライメントを推定し、評価を行った。その結果を表１に示す。評価の指標として、単語アライメントで一般的に用いられる精度、リコール、Ｆ値、AER(Alignment Error Rate)を用いた。 As evaluation data, 10,000 translations were randomly extracted from the Hansards data set. In addition, among the synonyms posted on WordNet, synonym pairs that appear at least once in the evaluation parallel translation data set were used as synonym dictionaries. Using these heterogeneous data and synonym dictionaries as training data, word alignment between English and French was estimated and evaluated. The results are shown in Table 1. As an evaluation index, accuracy, recall, F value, and AER (Alignment Error Rate) generally used in word alignment were used.

この発明の自動単語対応付け方法の精度が0.941と最も高く、この発明の効果が確認できた。

The accuracy of the automatic word association method of the present invention was the highest at 0.941, and the effect of the present invention was confirmed.

なお、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A training data storage unit comprising a bilingual corpus composed of a pair of bilingual sentences of a source language and a target language separated by words, and a synonym dictionary that is a set of synonyms of the target language;
Alignment probability for learning a parameter that maximizes the weighted sum of the log likelihood of the probability model of the parallel sentence data of the parallel sentence corpus and the log likelihood of the synonym dictionary probability model of the synonym dictionary data of the synonym dictionary The learning department,
An automatic association unit that generates an alignment between a source language and a target language word of the target translation sentence, using the target translation sentence and the parameter as input,
An automatic word association apparatus comprising:

In the automatic word matching device according to claim 1,
The synonym dictionary Sho確 rate model,
An automatic word association device comprising a generation probability for each meaning of a pair of synonyms.

In the automatic word matching device according to claim 1,
The synonym dictionary Sho確 rate model of the target language is,
Suppose the meaning of synonyms can be represented by a combination of words belt pick the source language used in word alignment, the word translation probabilities by the topic, a word generation probability of the source language, the generation of the mixing ratio of the topic An automatic word association device, characterized in that the automatic word association device is defined as a value proportional to a sum product of a probability and a synonym probability .

In the automatic word matching apparatus according to claim 1 or 3,
The alignment probability learning unit
The word translation probability recorded in the topic-specific word translation probability table, the word generation probability recorded in the topic-specific source language word generation probability table, the synonym dictionary probability recorded in the synonym dictionary table, and the original Refer to the probabilities of changing the order of words from the language to the target language and the probability of generating the mixing ratio of topics. The topic of the parallel translation of the training data, the mixing ratio of the topics of the entire training data, the source language and the target language Estimate alignment between words,
The parameter Θ is updated based on the estimated value,
Read the bilingual sentence data and the synonym dictionary data, and calculate the reference value of the parameter Θ that can be expressed as a weighted sum of the log likelihood of the probability model of the bilingual sentence data and the log likelihood of the synonym dictionary probability model A reference value calculation unit,
A word alignment probability calculation unit for calculating a posteriori probability of a latent variable included in the bilingual sentence data probability model , given the reference value of the parameter Θ and the current parameter Θ ^(t) ;
A synonym dictionary probability calculation unit that calculates a posteriori probability of a latent variable included in the synonym dictionary probability model , given the reference value of the parameter Θ and the current parameter Θ ^(t) ,
A parameter updater that calculates a new parameter Θ ^{(t + 1)} from the current parameter Θ ^(t) ;
The said parameter theta to calculate the log-likelihood using a ^{(t + 1),} the parameter Θ ^{(t + 1)} is the convergence determining unit determines whether the convergence conditions are satisfied optimum parameter estimates theta ^,
The automatic word matching apparatus characterized by comprising.

A set of log likelihoods of the bilingual corpus bilingual text data probabilistic model and synonyms of the target language described above, in which the alignment probability learning unit is composed of pairs of source language and target language bilingual sentences separated by words. An alignment probability learning process for learning a parameter that maximizes a weighted sum of logarithmic likelihoods of the synonym dictionary probability model of the synonym dictionary data of the synonym dictionary ,
An automatic association process in which an automatic association unit generates an alignment between a source language and a target language word of the target translation sentence by inputting the target translation sentence and the parameter;
Automatic word matching method including

In the automatic word matching method according to claim 5,
The synonym dictionary Sho確 rate model,
An automatic word association method comprising a generation probability for each meaning of a synonym pair.

In the automatic word matching method according to claim 5,
The synonym dictionary Sho確 rate model of the target language the meaning of synonyms,
Suppose it can be expressed by a combination of words belt pick the source language used in word alignment, the word translation probabilities by the topic, a word generation probability of the source language, and generation probability of the mixing ratio of the topic, synonyms An automatic word association method, characterized in that the automatic word association method is defined as a value proportional to a sum product with a probability .

In the automatic word matching method according to claim 5 or 7,
The above alignment probability learning process
The word translation probability recorded in the topic-specific word translation probability table, the word generation probability recorded in the topic-specific source language word generation probability table, the synonym dictionary probability recorded in the synonym dictionary table, and the original Refer to the probabilities of changing the order of words from the language to the target language and the probability of generating the mixing ratio of topics. The topic of the parallel translation of the training data, the mixing ratio of the topics of the entire training data, the source language and the target language Estimate alignment between words,
The parameter Θ is updated based on the estimated value,
A parameter that the reference value calculation unit reads the bilingual sentence data and the synonym dictionary data and can represent as a weighted sum of the log likelihood of the probability model of the bilingual sentence data and the log likelihood of the synonym dictionary probability model A reference value calculating step for calculating a reference value of Θ;
A word alignment probability calculating step for calculating a posteriori probability of a latent variable included in the bilingual sentence data probability model , given the reference value of the parameter Θ and the current parameter Θ ^(t) ;
A synonym dictionary probability calculation step for calculating a posteriori probability of a latent variable included in the synonym dictionary probability model , given the reference value of the parameter Θ and the current parameter Θ ^(t) ,
A parameter update step in which the parameter update unit calculates a new parameter Θ ^{(t + 1)} from the current parameter Θ ^(t) ;
Convergence determination unit, the parameter Θ ^{(t + 1)} to calculate the log-likelihood using a convergence determination determines whether the parameter Θ ^{(t + 1)} is the optimum parameter estimates theta ^ or convergence condition is satisfied Steps,
The automatic word matching method characterized by including this.

Programs for each part of the function of the automatic word associating device according to either one of claims 1 to 4, causes the computer to execute.