JP5791097B2

JP5791097B2 - Parallel translation phrase learning apparatus, phrase-based statistical machine translation apparatus, parallel phrase learning method, and parallel phrase production method

Info

Publication number: JP5791097B2
Application number: JP2011047588A
Authority: JP
Inventors: グラムニュービック; 渡辺　太郎; 太郎渡辺; 隅田　英一郎; 英一郎隅田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2011-03-04
Filing date: 2011-03-04
Publication date: 2015-10-07
Anticipated expiration: 2031-03-04
Also published as: JP2012185622A

Description

本発明は、対訳フレーズを学習する対訳フレーズ学習装置等に関するものである。 The present invention relates to a parallel phrase learning device that learns a parallel phrase.

従来の第一の手法において、単語アライメントの後、ヒューリスティックな句単位のアライメントを網羅的に行い、フレーズベース統計的機械翻訳のためのフレーズテーブルを作成する（例えば、非特許文献１参照）。この第一の手法では、長いフレーズで語彙的曖昧性を解消し、同時に短いフレーズによりスパースなデータに対応している。 In the first conventional method, after word alignment, heuristic phrase unit alignment is comprehensively performed to create a phrase table for phrase-based statistical machine translation (see, for example, Non-Patent Document 1). In this first method, lexical ambiguity is resolved with a long phrase, and sparse data is handled with a short phrase at the same time.

このヒューリスティックな手法（第一の手法）に対し、対訳データから直接句単位のアライメントを計算する手法が提案されている（例えば、非特許文献２から５参照）。非特許文献２および３に記載の手法では、フレーズを網羅的に計算するのに対し、非特許文献４および５に記載の手法では、Inversion Transduction Grammar(ITG)の制約を利用している。 In contrast to this heuristic method (first method), a method for directly calculating the alignment of phrase units from parallel translation data has been proposed (see, for example, Non-Patent Documents 2 to 5). In the methods described in Non-Patent Documents 2 and 3, phrases are comprehensively calculated, whereas in the methods described in Non-Patent Documents 4 and 5, the restrictions of Inversion Transduction Grammar (ITG) are used.

また、非特許文献３から５に記載の手法では、ノンパラメトリックベイズ法に基づく確率過程を用いて簡潔なフレーズに高い確率を割り当てることにより、短いフレーズの抽出を可能としている。 In the methods described in Non-Patent Documents 3 to 5, a short phrase can be extracted by assigning a high probability to a simple phrase using a stochastic process based on a nonparametric Bayes method.

さらに詳細には、ITGは同期文脈自由文法の一種で、非終端記号を生成する時に単語の並べ換えを行うことが特徴である。ITG制限を利用することにより計算量を減らし、多項式時間でアライメントの最尤解や周辺確率が計算できる。ITGにおいて、あるフレーズペアの生成確率をP_flat(<e,f>;θ_x,θ_t)とし、フレーズペア確率θ_tと記号確率θ_xでパラメータ化する。従来のITGモデルは、以下の生成過程を利用する。なお、<e，f>は、第一言語（例えば、英語）のフレーズeと第二言語（例えば、日本語）のフレーズfのペアを示す。 More specifically, ITG is a kind of synchronous context free grammar, and is characterized by rearranging words when generating nonterminal symbols. By using ITG restriction, the amount of calculation can be reduced, and the maximum likelihood solution and marginal probability of alignment can be calculated in polynomial time. In ITG, the generation probability of a certain phrase pair is P _flat (<e, f>; θ _x , θ _t ), and parameterized by the phrase pair probability θ _t and the symbol probability θ _x . The conventional ITG model uses the following generation process. <E, f> indicates a pair of a phrase e in the first language (for example, English) and a phrase f in the second language (for example, Japanese).

つまり、従来のITGモデルにおいて、第一に、シンボル（記号）xを多項式分布Px(x;θ_x)に従って生成する。xが取り得る値はterm、reg、またはinvである。ここで、Termは終端記号、regは普通非終端記号、invは倒置非終端記号である。 That is, in the conventional ITG model, first, a symbol (symbol) x is generated according to a polynomial distribution Px (x; θ _x ). Possible values for x are term, reg, or inv. Here, Term is a terminal symbol, reg is a normal non-terminal symbol, and inv is an inverted non-terminal symbol.

第二に、xの値に従って、「x=term（終端記号）」の場合、フレーズペア確率Pt(<e,f>;θ_t)に従ってフレーズペアを生成する。また、「x=reg（普通非終端記号）」の場合、P_flatに従って、フレーズペア<e₁，f₁>と<e₂，f₂>を生成し、<e₁e₂,f₁f₂>で１つのフレーズペアに融合する。さらに、「x=inv（倒置非終端記号）」の場合、P_flatに従って、２つのフレーズペア<e₁，f₁>と<e₂，f₂>を生成するが、f₁とf₂を逆順に並べて<e₁e₂,f₂f₁>を得る。 Second, according to the value of x, when “x = term (terminal symbol)”, a phrase pair is generated according to the phrase pair probability Pt (<e, f>; θ _t ). In addition, when “x = reg (ordinary non-terminal symbol)”, the phrase pair <e ₁ , f ₁ > and <e ₂ , f ₂ > are generated according to P _flat , and <e ₁ e ₂ , f ₁ f ₂ Use> to merge into one phrase pair. Furthermore, when “x = inv (inverted non-terminal symbol)”, two phrase pairs <e ₁ , f ₁ > and <e ₂ , f ₂ > are generated according to P _flat , but f ₁ and f ₂ are in reverse order To get <e ₁ e ₂ , f ₂ f ₁ >.

そして、各文に対するP_flatの積を取り、数式１に示すように、コーパス尤度が計算できる。
Then, the product of P _flat for each sentence is taken, and the corpus likelihood can be calculated as shown in Equation 1.

従来のITGモデル（FLATと言う。）は、そのまま最尤推定で学習できるが、最尤解では非常に長いフレーズペア（１文１フレーズ）が得られてしまう。そこで、簡潔なフレーズ辞書に高い確率を与える事前確率P(θ)=P(θ_x,θ_t)を利用することで、長いフレーズの問題を解決する（非特許文献５参照）。 A conventional ITG model (referred to as FLAT) can be learned as it is by maximum likelihood estimation, but a very long phrase pair (one phrase per phrase) is obtained with the maximum likelihood solution. Therefore, the problem of a long phrase is solved by using a prior probability P (θ) = P (θ _x , θ _t ) that gives a high probability to a simple phrase dictionary (see Non-Patent Document 5).

ここでは、θ_xの事前確率にDirichlet分布を利用し、θ_tにはノンパラメトリックベイズ法に基づくPitman-Yor過程（非特許文献６参照）を利用する。なお、Pitman-Yor過程を数式２に示す。
Here, using Dirichlet distribution prior probability theta _x, the theta _t utilizing Pitman-Yor process based on non-parametric Bayesian method (see Non-Patent Document 6). The Pitman-Yor process is shown in Equation 2.

数式２において、dはPitman-Yor過程の割引パラメータ、ｓは強さパラメータである。また、数式２において、非特許文献６に記載の技術を用いて、割引パラメータd、強さパラメータsを推定する。また、数式２において、P_baseは後述する基底測度（base measure)である。さらに、数式２において、Dirichlet分布は公知技術であるので説明を省略する。 In Equation 2, d is a discount parameter of the Pitman-Yor process, and s is a strength parameter. Further, in Equation 2, the discount parameter d and the strength parameter s are estimated using the technique described in Non-Patent Document 6. In Equation 2, P _base is a base measure described later. Furthermore, since the Dirichlet distribution is a known technique in Equation 2, description thereof is omitted.

Pitman-Yor過程による事前分布を用いる利点は、生成されたフレーズペアを記憶するという確率過程の性質にある。分布から頻繁に生成されるフレーズペアの確率が高くなり、さらに生成されやすくなる（かかる効果を「rich-gets-richer効果」という。）。Pitman-Yor過程を用いた学習によって、より少ない、より役に立つフレーズから構成されるフレーズテーブルが構築できる。また、P_t（フレーズペアの確率分布）から生成されるフレーズのみが記憶される。また、flat（ITGモデル）では、終端記号の最小フレーズペアのみがP_tから生成されるため、記憶されるのも最小フレーズペアのみである。 The advantage of using the prior distribution by the Pitman-Yor process lies in the nature of the stochastic process of storing the generated phrase pairs. The probability of phrase pairs that are frequently generated from the distribution is increased, and they are more likely to be generated (this effect is called “rich-gets-richer effect”). By using the Pitman-Yor process, a phrase table composed of fewer and more useful phrases can be constructed. Also, only phrases generated from P _t (phrase pair probability distribution) are stored. Also, the flat (ITG model), since only the minimum phrase pair terminal symbol is generated from P _t, only minimal phrase pairs also being stored.

また、数式２のP_baseはモデルにおけるフレーズペアの事前確率であり、適切に決めることでフレーズのアライメントしやすさに関する事前知識をモデルに組み込める。ここで、P_baseは対応なしのフレーズ（|e|=0または|f|=0）を生成するかどうかを一定の確率P_uで選び、対応なしのフレーズをP_buから生成し、対応ありのフレーズペアをPbaから生成する。
P_baは非特許文献７に記載されている通り、以下の数式３で算出できる。
Also, P _base in Equation 2 is the prior probability of the phrase pair in the model, and prior knowledge about the ease of phrase alignment can be incorporated into the model by appropriately determining it. Here, P _base is of no corresponding phrase (| e | = 0 or | f | = 0) to choose whether or not to generate a with a certain probability P _u, to produce a phrase without the support from P _bu, corresponding Yes Generate a phrase pair from Pba.
P _ba can be calculated by the following Equation 3 as described in Non-Patent Document 7.

数式３において、P_poisは平均長パラメータλを持つポアソン分布である。長いフレーズを避けるために、λに小さい値を利用する。P_m1は単語確率に基づくIBMモデル1確率である（非特許文献８参照）。これを利用することで、フレーズを構成する単語の翻訳確率が高ければフレーズの確率も高くなる。両方向の条件付き確率の相乗平均を利用することで、両モデルが一致するフレーズを優先的にアライメントする。また、数式３において、eというフレーズを構成する単語がe₁・・・e__nである場合、Puni(e)は、それぞれの単語のユニグラム確率の積である。 In Equation 3, P _pois is a Poisson distribution having an average length parameter λ. To avoid long phrases, use a small value for λ. P _m1 is an IBM model 1 probability based on the word probability (see Non-Patent Document 8). By using this, if the translation probability of the words constituting the phrase is high, the probability of the phrase is also high. By using the geometric mean of conditional probabilities in both directions, phrases that match both models are preferentially aligned. Further, in Equation 3, when words constituting the phrase e is _{_{e 1 ··· e_ n, Puni (}} e) is the product of the unigram probability of each word.

P_buでは、eとfの中から空でない単語列をgとし、確率を以下の数式４のように定義する。
In _Pbu , a non-empty word string from e and f is defined as g, and the probability is defined as in the following Equation 4.

なお、数式４において、eとfを両方考慮するため、P_buを2で割っている。また、数式４において、gというフレーズを構成する単語がg₁・・・g__nである場合、Puni(g)は、それぞれの単語のユニグラム確率の積である。 In Equation 4, _Pbu is divided by 2 in order to consider both e and f. Further, in Equation 4, when words constituting the phrase g is _{_{g 1 ··· g_ n, Puni (}} g) is the product of the unigram probability of each word.

P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proc. NAACL, pp. 48-54, 2003.P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proc. NAACL, pp. 48-54, 2003. Daniel Marcu and William Wong. 2002. A phrase-based, joint probability model for statistical machine transla- tion. pages 133-139.Daniel Marcu and William Wong. 2002. A phrase-based, joint probability model for statistical machine transla- tion. Pages 133-139. J. DeNero and D. Klein. The complexity of phrase alignment problems. In Proc. ACL, pp. 25-28, 2008.J. DeNero and D. Klein. The complexity of phrase alignment problems.In Proc.ACL, pp. 25-28, 2008. P. Blunsom and T. Cohn. Inducing synchronous gram- mars with slice sampling. In Proc. NAACL, 2010.P. Blunsom and T. Cohn. Inducing synchronous gram- mars with slice sampling. In Proc. NAACL, 2010. H. Zhang, C. Quirk, R. C. Moore, and D. Gildea. Bayesian learning of non-compositional phrases with synchronous parsing. Proc. ACL, pp. 97-105, 2008.H. Zhang, C. Quirk, R. C. Moore, and D. Gildea. Bayesian learning of non-compositional phrases with synchronous parsing. Proc. ACL, pp. 97-105, 2008. Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proc. ACL, 2006.Y. W. Teh.A hierarchical Bayesian language model based on Pitman-Yor processes.In Proc.ACL, 2006. J. DeNero, A. Bouchard-C^ot_e, and D. Klein. Sam-pling alignment structure under a Bayesian translation model. In Proc. EMNLP, pp. 314-323, 2008.J. DeNero, A. Bouchard-C ^ ot_e, and D. Klein. Sam-pling alignment structure under a Bayesian translation model. In Proc. EMNLP, pp. 314-323, 2008. J. DeNero and D. Klein. Discriminative modeling of extraction sets for machine translation. In Proc. ACL,pp. 1453-1463, 2010.J. DeNero and D. Klein. Discriminative modeling of extraction sets for machine translation. In Proc. ACL, pp. 1453-1463, 2010.

しかしながら、従来の第一の手法では、翻訳に最適なフレーズが得られているとは限らず、翻訳の精度の向上のため数多くのフレーズを抽出する必要があった。 However, in the first conventional method, the optimum phrase for translation is not always obtained, and it is necessary to extract a large number of phrases in order to improve translation accuracy.

また、非特許文献２に記載の手法における最尤推定では、非常に長いフレーズペアのみしか抽出できない。 In the maximum likelihood estimation in the method described in Non-Patent Document 2, only very long phrase pairs can be extracted.

また、非特許文献３から５に記載の手法では、上述したように、最小単位のフレーズの抽出のみ記憶し、複数の粒度のフレーズを直接モデル化できなかった。そのため、最小単位のフレーズアライメントの計算後、ヒューリスティックにより網羅的にフレーズの抽出を行っており、結局、二段階の手法を用いなければならなかった。また、網羅的にフレーズの抽出を行うため、不適切なフレーズペアを学習したりしていた。 Further, in the methods described in Non-Patent Documents 3 to 5, as described above, only the extraction of the phrase of the minimum unit is stored, and the phrases having a plurality of granularities cannot be directly modeled. For this reason, after calculating the phrase alignment of the minimum unit, the phrases are comprehensively extracted by heuristics, and eventually, a two-stage method has to be used. Moreover, in order to exhaustively extract phrases, inappropriate phrase pairs have been learned.

具体的には、従来のITGモデルを用いた場合、例えば、フレーズペア「Mrs.Smith's red cookbook／スミスさんの赤い料理本」に対して、図１１に示すように、最小単位のフレーズ「Mrs.／さん」「Smith／スミス」「's／の」「red／赤い」「cookbook／料理本」のみが取得できる。 Specifically, when the conventional ITG model is used, for example, for the phrase pair “Mrs. Smith's red cookbook”, as shown in FIG. 11, the phrase “Mrs. / San "," Smith "," 's / no "," red / red "," cookbook / cookbook "can only be acquired.

本第一の発明の対訳フレーズ学習装置は、第一言語の１以上の単語を有する第一言語フレーズと、第二言語の１以上の単語を有する第二言語フレーズとの対であるフレーズペアとフレーズペアの出現確率に関する情報であるスコアとを有する１以上のスコア付きフレーズペアを格納し得るフレーズテーブルと、フレーズペアと、フレーズペアの出現頻度に関する情報であるＦ出現頻度情報とを有する１以上のフレーズ出現頻度情報を格納し得るフレーズ出現頻度情報格納部と、新しいフレーズペアを生成する方法を識別する記号と、記号の出現頻度に関する情報であるＳ出現頻度情報とを有する１以上の記号出現頻度情報を格納し得る記号出現頻度情報格納部と、１以上のフレーズ出現頻度情報を用いて、第一言語フレーズと第二言語フレーズとを有するフレーズペアを取得する生成フレーズペア取得部と、フレーズペアを取得できた場合、フレーズペアに対応するＦ出現頻度情報を、予め決められた値だけ増加するフレーズ出現頻度情報更新部と、フレーズペアを取得できなかった場合、１以上の記号出現頻度情報を用いて、一の記号を取得する記号取得部と、記号取得部が取得した記号に対応するＳ出現頻度情報を、予め決められた値だけ増加する記号出現頻度情報更新部と、フレーズペアを取得できなかった場合、取得しようとしたフレーズペアより小さい２つのフレーズペアを生成する部分フレーズペア生成部と、記号取得部が取得した記号に従って、新しいフレーズペアを生成する第一の処理、または、２つのより小さいフレーズペアを生成し、１以上のフレーズ出現頻度情報を用いて、生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する第二の処理、または、２つのより小さいフレーズペアを生成し、１以上のフレーズ出現頻度情報を用いて、生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを逆順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する第三の処理のいずれかを行う新フレーズペア生成部と、新フレーズペア生成部が生成したフレーズペアに対して、フレーズ出現頻度情報更新部、記号取得部、記号出現頻度情報更新部、部分フレーズペア生成部、および新フレーズペア生成部の処理を再帰的に行う制御部と、フレーズ出現頻度情報格納部に格納されている１以上のフレーズ出現頻度情報を用いて、フレーズテーブルの各フレーズペアに対するスコアを算出するスコア算出部と、スコア算出部が算出したスコアを各フレーズペアに対応付けて蓄積するフレーズテーブル更新部とを具備する対訳フレーズ学習装置である。 The bilingual phrase learning device of the first invention includes a phrase pair that is a pair of a first language phrase having one or more words in the first language and a second language phrase having one or more words in the second language. One or more having a phrase table that can store one or more scored phrase pairs having a score that is information relating to the appearance probability of the phrase pair, a phrase pair, and F appearance frequency information that is information relating to the appearance frequency of the phrase pair One or more symbol appearances having a phrase appearance frequency information storage unit that can store the phrase appearance frequency information, a symbol that identifies a method for generating a new phrase pair, and S appearance frequency information that is information about the appearance frequency of the symbol A first language phrase and a second language frame using a symbol appearance frequency information storage unit capable of storing frequency information and one or more phrase appearance frequency information A generated phrase pair acquisition unit that acquires a phrase pair including: a phrase appearance frequency information update unit that increases F appearance frequency information corresponding to the phrase pair by a predetermined value if the phrase pair can be acquired; If the phrase pair cannot be acquired, the symbol acquisition unit for acquiring one symbol and the S appearance frequency information corresponding to the symbol acquired by the symbol acquisition unit using one or more symbol appearance frequency information are determined in advance. The symbol appearance frequency information update unit that increases by the value obtained, the partial phrase pair generation unit that generates two phrase pairs smaller than the phrase pair to be acquired, and the symbol acquisition unit when the phrase pair could not be acquired The first process to generate a new phrase pair according to the symbol, or two smaller phrase pairs to generate one or more phrase occurrences Using information, a new first language phrase that connects two first language phrases that make up the two generated phrase pairs in order, and a new second language phrase that connects the two second language phrases that make up the two phrase pairs in order. The second process of generating one phrase pair having a bilingual phrase, or two smaller phrase pairs are generated, and the generated two phrase pairs are configured using one or more phrase appearance frequency information Generates a single phrase pair having a new first language phrase in which two first language phrases are connected in order and a new second language phrase in which two second language phrases constituting two phrase pairs are connected in reverse order Phrase appearance frequency for the phrase pair generated by the new phrase pair generation unit that performs any of the third processes and the phrase pair generation unit The information update unit, the symbol acquisition unit, the symbol appearance frequency information update unit, the partial phrase pair generation unit, and the control unit that recursively processes the new phrase pair generation unit, and the phrase appearance frequency information storage unit 1 Using the above phrase appearance frequency information, a score calculation unit that calculates a score for each phrase pair in the phrase table, and a phrase table update unit that stores the score calculated by the score calculation unit in association with each phrase pair This is a bilingual phrase learning device.

かかる構成により、多数の適切なフレーズペアを学習できる。 With this configuration, a large number of appropriate phrase pairs can be learned.

また、本第二の発明の対訳フレーズ学習装置は、第一の発明に対して、生成フレーズペア取得部は、フレーズペアの確率分布を用いて、第一言語フレーズと第二言語フレーズとを有する生成フレーズペアを取得し、記号取得部は、フレーズペアを取得できなかった場合、記号の確率分布を用いて、一の記号を取得し、部分フレーズペア生成部は、フレーズペアを取得できなかった場合、基底測度を用いて、生成しようとしたフレーズペアより小さい２つのフレーズペアを生成し、第一の処理は、フレーズペアの基底測度を用いて、新しいフレーズペアを生成する処理であり、スコア算出部は、フレーズ出現頻度情報格納部に格納されている１以上のフレーズ出現頻度情報を用いて、ノンパラメトリックベイズ法に基づいて、フレーズテーブルの各フレーズペアに対するスコアを算出する対訳フレーズ学習装置である。 Moreover, the bilingual phrase learning device of the second invention has a first language phrase and a second language phrase using the probability distribution of the phrase pair, with respect to the first invention. When the generated phrase pair is acquired and the symbol acquisition unit cannot acquire the phrase pair, the symbol probability distribution is used to acquire one symbol, and the partial phrase pair generation unit cannot acquire the phrase pair. If the base measure is used, two phrase pairs smaller than the phrase pair to be generated are generated, and the first process is a process for generating a new phrase pair using the phrase pair base measure, and the score The calculation unit uses the one or more phrase appearance frequency information stored in the phrase appearance frequency information storage unit, and based on the nonparametric Bayes method, the phrase table A bilingual phrase learning device for calculating a score for each phrase pairs.

また、本第三の発明のフレーズベース統計的機械翻訳装置は、第一または第二の発明に対して、対訳フレーズ学習装置が学習したフレーズテーブルと、１以上の単語を有する第一言語の文を受け付ける受付部と、受付部が受け付けた文から１以上のフレーズを抽出し、フレーズテーブルのスコアを用いて、フレーズテーブルから第二言語の１以上のフレーズを取得するフレーズ取得部と、フレーズ取得部が取得した１以上のフレーズから第二言語の文を構成する文構成部と、文構成部が構成した文を出力する出力部とを具備するフレーズベース統計的機械翻訳装置である。 In addition, the phrase-based statistical machine translation device according to the third aspect of the invention provides a phrase table learned by the parallel phrase learning device and a sentence in a first language having one or more words with respect to the first or second invention. A phrase acquisition unit that extracts one or more phrases from a sentence received by the reception unit, and acquires one or more phrases in the second language from the phrase table using a score of the phrase table, and a phrase acquisition It is a phrase-based statistical machine translation device comprising a sentence constructing unit that constructs a sentence in the second language from one or more phrases acquired by the unit, and an output unit that outputs a sentence constructed by the sentence constructing unit.

かかる構成により、多数の適切なフレーズペアを用いて、精度の良い機械翻訳が可能となる。 With this configuration, accurate machine translation can be performed using a large number of appropriate phrase pairs.

本発明による対訳フレーズ学習装置によれば、多数の適切なフレーズペアを学習できる。 According to the bilingual phrase learning apparatus according to the present invention, a large number of appropriate phrase pairs can be learned.

実施の形態１における対訳フレーズ学習装置のブロック図Block diagram of bilingual phrase learning apparatus according to Embodiment 1 同対訳フレーズ学習装置の動作について説明するフローチャートThe flowchart explaining operation | movement of the parallel translation phrase learning apparatus. 同フレーズ生成処理の動作について説明するフローチャートA flowchart for explaining the operation of the phrase generation process 同学習できるフレーズペアを説明する図Diagram explaining phrase pairs that can be learned 同コーパスの諸元を示す図Diagram showing the specifications of the corpus 同実験結果を示す図Figure showing the results of the experiment 同モデル確率に基づくフレーズ抽出と従来法との比較示す図Diagram showing comparison between phrase extraction based on model probability and conventional method 実施の形態２におけるフレーズベース統計的機械翻訳装置のブロック図Block diagram of phrase-based statistical machine translation apparatus in Embodiment 2 上記実施の形態におけるコンピュータシステムの概観図Overview of the computer system in the above embodiment 上記実施の形態におけるコンピュータシステムのブロック図Block diagram of a computer system in the above embodiment 従来技術において学習できるフレーズペアを説明する図The figure explaining the phrase pair which can be learned in the prior art

以下、対訳フレーズ学習装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of a parallel phrase learning device and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.

（実施の形態１）
本実施の形態において、複数の階層の対訳フレーズを蓄積する対訳フレーズ学習装置について説明する。 (Embodiment 1)
In the present embodiment, a parallel phrase learning apparatus that accumulates a plurality of levels of parallel phrases will be described.

また、本実施の形態において、階層的なITGモデルを利用することにより、複数の粒度のフレーズを直接、確率モデルで表現する。このため、本実施の形態における対訳フレーズ学習装置では、ヒューリスティックスに基づくフレーズ抽出を行わずに高い翻訳精度を実現できる。 In the present embodiment, a phrase having a plurality of granularities is directly expressed by a probability model by using a hierarchical ITG model. For this reason, in the parallel phrase learning apparatus in the present embodiment, high translation accuracy can be realized without performing phrase extraction based on heuristics.

図１は、本実施の形態における対訳フレーズ学習装置１のブロック図である。対訳フレーズ学習装置１は、対訳コーパス１００、フレーズテーブル１０１、フレーズ出現頻度情報格納部１０２、記号出現頻度情報格納部１０３、フレーズテーブル初期化部１１３、生成フレーズペア取得部１０４、フレーズ出現頻度情報更新部１０５、記号取得部１０６、記号出現頻度情報更新部１０７、部分フレーズペア生成部１０８、新フレーズペア生成部１０９、制御部１１０、スコア算出部１１１、パージング部１１４、フレーズテーブル更新部１１２、および木更新部１１５を具備する。 FIG. 1 is a block diagram of a bilingual phrase learning apparatus 1 according to the present embodiment. The parallel phrase learning device 1 includes a parallel corpus 100, a phrase table 101, a phrase appearance frequency information storage unit 102, a symbol appearance frequency information storage unit 103, a phrase table initialization unit 113, a generated phrase pair acquisition unit 104, and a phrase appearance frequency information update. Unit 105, symbol acquisition unit 106, symbol appearance frequency information update unit 107, partial phrase pair generation unit 108, new phrase pair generation unit 109, control unit 110, score calculation unit 111, parsing unit 114, phrase table update unit 112, and A tree update unit 115 is provided.

対訳コーパス１００は、対訳文と対訳文の木構造とを有する１以上の対訳情報を格納し得る。対訳文とは、第一言語文と第二言語文との対である。第一言語文は、第一言語の文である。第二言語文は、第二言語の文である。ここで、文は、１以上の単語の意味であり、フレーズも含む。対訳文の木構造とは、２つの各言語の文を分割したフレーズ（単語も含む）の対応を木構造で表した情報である。対訳文の木構造は、例えば、図４に示すような情報である。 The bilingual corpus 100 can store one or more bilingual information having a bilingual sentence and a tree structure of the bilingual sentence. A bilingual sentence is a pair of a first language sentence and a second language sentence. The first language sentence is a sentence in the first language. The second language sentence is a sentence in the second language. Here, the sentence means one or more words and includes a phrase. The bilingual sentence tree structure is information representing a correspondence of phrases (including words) obtained by dividing sentences in two languages in a tree structure. The tree structure of the bilingual sentence is, for example, information as shown in FIG.

フレーズテーブル１０１は、１以上のスコア付きフレーズペアを格納し得る。スコア付きフレーズペアは、フレーズペアとスコアとを有する。フレーズペアは、第一言語フレーズと第二言語フレーズとの対である。第一言語フレーズは、第一言語の１以上の単語を有するフレーズである。第二言語フレーズは、第二言語の１以上の単語を有するフレーズである。フレーズは、文も含むとして、広く解する。また、スコアは、フレーズペアの出現確率に関する情報である。また、スコアとは、例えば、フレーズペア確率θ_ｔである。 The phrase table 101 can store one or more scored phrase pairs. The phrase pair with a score has a phrase pair and a score. A phrase pair is a pair of a first language phrase and a second language phrase. The first language phrase is a phrase having one or more words in the first language. The second language phrase is a phrase having one or more words in the second language. Phrases are widely understood as including sentences. The score is information regarding the appearance probability of the phrase pair. In addition, the score, for example, is a phrase pair probability θ _t.

フレーズ出現頻度情報格納部１０２は、１以上のフレーズ出現頻度情報を格納し得る。フレーズ出現頻度情報は、フレーズペアとＦ出現頻度情報とを有する。Ｆ出現頻度情報は、フレーズペアの出現頻度に関する情報である。Ｆ出現頻度情報は、フレーズペアの出現頻度であることが好適であるが、フレーズペアの出現確率等でも良い。なお、Ｆ出現頻度情報の初期値は、例えば、すべて０である。 The phrase appearance frequency information storage unit 102 can store one or more phrase appearance frequency information. The phrase appearance frequency information includes a phrase pair and F appearance frequency information. F appearance frequency information is information regarding the appearance frequency of phrase pairs. The F appearance frequency information is preferably the appearance frequency of the phrase pair, but may be the appearance probability of the phrase pair. Note that the initial values of the F appearance frequency information are all 0, for example.

記号出現頻度情報格納部１０３は、１以上の記号出現頻度情報を格納し得る。記号出現頻度情報は、記号とＳ出現頻度情報とを有する。記号とは、新しいフレーズペアを生成する方法を識別する情報である。記号は、例えば、BASE、REG、INVのいずれかである。ここで、BASEとは基底測度からフレーズペアを生成することを示す記号、REGとは普通非終端記号、INVとは倒置非終端記号である。また、Ｓ出現頻度情報は、記号の出現頻度に関する情報である。Ｓ出現頻度情報は、記号の出現頻度であることが好適であるが、記号の出現確率等でも良い。また、Ｓ出現頻度情報の初期値は、例えば、３つの記号すべてに対して０である。 The symbol appearance frequency information storage unit 103 can store one or more pieces of symbol appearance frequency information. The symbol appearance frequency information includes a symbol and S appearance frequency information. The symbol is information for identifying a method for generating a new phrase pair. The symbol is, for example, any one of BASE, REG, and INV. Here, BASE is a symbol indicating that a phrase pair is generated from a base measure, REG is a normal non-terminal symbol, and INV is an inverted non-terminal symbol. The S appearance frequency information is information related to the appearance frequency of symbols. The S appearance frequency information is preferably a symbol appearance frequency, but may be a symbol appearance probability or the like. The initial value of the S appearance frequency information is, for example, 0 for all three symbols.

フレーズテーブル初期化部１１３は、対訳コーパス１００の１以上の対訳情報から、１以上のスコア付きフレーズペアの初期の情報を生成し、フレーズテーブル１０１に蓄積する。なお、フレーズテーブル初期化部１１３は、例えば、１以上の対訳情報が有する対訳文の木構造に出現するフレーズペアとその出現回数をスコア付きフレーズペアとして取得し、フレーズテーブル１０１に蓄積する。なお、かかる場合、スコアは出現回数である。 The phrase table initialization unit 113 generates initial information of one or more phrase pairs with scores from one or more pieces of parallel translation information of the bilingual corpus 100 and stores the initial information in the phrase table 101. Note that the phrase table initialization unit 113 acquires, for example, a phrase pair that appears in the tree structure of the bilingual sentence included in one or more parallel translation information and the number of appearances as a phrase pair with a score, and accumulates the phrase pair in the phrase table 101. In such a case, the score is the number of appearances.

生成フレーズペア取得部１０４は、対訳コーパス１００に格納されている１以上の各対訳文を取得し、当該各対訳文の木構造を構成する１以上の各フレーズペアの出現分（通常、出現頻度の「１」）を、フレーズテーブル１０１に存在するフレーズペアのスコアから引く。次に、生成フレーズペア取得部１０４は、１以上のフレーズ出現頻度情報を用いて、第一言語フレーズと第二言語フレーズとを有するフレーズペアを取得する（正確には、取得しようとする）。ここで、１以上のフレーズ出現頻度情報を用いることは、例えば、フレーズペアの確率分布Ｐ_ｔを用いることであっても良い。つまり、生成フレーズペア取得部１０４は、フレーズペアの確率分布Ｐ_ｔを用いて、第一言語フレーズと第二言語フレーズとを有するフレーズペアを取得することは好適である。 The generated phrase pair acquisition unit 104 acquires one or more bilingual sentences stored in the bilingual corpus 100, and appearances of one or more phrase pairs that constitute the tree structure of the bilingual sentences (usually, appearance frequency). (1)) is subtracted from the phrase pair score existing in the phrase table 101. Next, the generated phrase pair acquisition unit 104 acquires a phrase pair that includes the first language phrase and the second language phrase by using one or more phrase appearance frequency information (accurately, tries to acquire). Here, the use of 1 or more phrases appearance frequency information may be, for example, by using the probability distribution P _t phrase pairs. That is, it is suitable for the generated phrase pair acquisition unit 104 to acquire a phrase pair having a first language phrase and a second language phrase using the phrase pair probability distribution P _t .

フレーズ出現頻度情報更新部１０５は、生成フレーズペア取得部１０４等がフレーズペアが取得できた場合、当該フレーズペアに対応するＦ出現頻度情報を、予め決められた値だけ増加する。ここでのＦ出現頻度情報とは、通所、フレーズペアの出現頻度である。また、予め決められた値とは、通常、１である。なお、生成フレーズペア取得部１０４等とは、生成フレーズペア取得部１０４と新フレーズペア生成部１０９である。 When the generated phrase pair acquisition unit 104 or the like can acquire a phrase pair, the phrase appearance frequency information update unit 105 increases the F appearance frequency information corresponding to the phrase pair by a predetermined value. The F appearance frequency information here is the appearance frequency of the passage and phrase pair. The predetermined value is usually 1. The generated phrase pair acquisition unit 104 and the like are the generated phrase pair acquisition unit 104 and the new phrase pair generation unit 109.

記号取得部１０６は、生成フレーズペア取得部１０４等がフレーズペアを取得できなかった場合、１以上の記号出現頻度情報を用いて、一の記号を取得する。ここで、１以上の記号出現頻度情報を用いることは、記号の確率分布P_x(x;θ_x)を用いることが好適である。つまり、記号取得部１０６は、生成フレーズペア取得部１０４が生成フレーズペアを取得できなかった場合、記号の確率分布を用いて、一の記号を取得することが好適である。なお、一の記号とは、例えば、BASE、REG、INVのいずれかである。 When the generated phrase pair acquisition unit 104 or the like cannot acquire a phrase pair, the symbol acquisition unit 106 acquires one symbol using one or more symbol appearance frequency information. Here, it is preferable to use the probability distribution P _x (x; θ _x ) of the symbols to use one or more symbol appearance frequency information. That is, it is preferable that the symbol acquisition unit 106 acquires one symbol using the probability distribution of symbols when the generation phrase pair acquisition unit 104 cannot acquire the generation phrase pair. One symbol is, for example, any one of BASE, REG, and INV.

記号出現頻度情報更新部１０７は、記号取得部１０６が取得した記号に対応するＳ出現頻度情報を、予め決められた値だけ増加する。また、予め決められた値とは、通常、１である。 The symbol appearance frequency information update unit 107 increases the S appearance frequency information corresponding to the symbol acquired by the symbol acquisition unit 106 by a predetermined value. The predetermined value is usually 1.

部分フレーズペア生成部１０８は、生成フレーズペア取得部１０４等がフレーズペアを取得できなかった場合、取得しようとしたフレーズペアより小さい２つのフレーズペアを生成する。また、部分フレーズペア生成部１０８は、フレーズペアを取得できなかった場合、通常、フレーズペアの事前確率を用いて、取得しようとしたフレーズペアより小さい２つのフレーズペアを生成する。さらに詳細には、例えば、部分フレーズペア生成部１０８は、フレーズペアの事前確率から取得される基底測度Ｐ_dacを用いて、取得しようとしたフレーズペアより小さい２つのフレーズペアを生成する。また、基底測度Ｐ_dacのdacとは、（"divide-and-conquer")の略であり、長いフレーズペアを短いフレーズペアに分割する仕組みによる確率である。例えば、取得しようとしたフレーズペアが<red cookbook,赤い料理本>の場合、「Ｐ_dac(<red cookbook,赤い料理本>)=P_x(REG)*P_t(<red,赤い>)*P_{t}(<cookbook,料理本>)+P_x(REG)*P_t(<red,赤い料理>)*P_t(<cookbook,本>)+P_x(INV)*P_t(<red,本>)*P_t(<cookbook,赤い料理>)+P_x(INV)*P_t(<red,料理本>)*P_t(<cookbook,赤い>)+P_x(BASE)*P_base(<red cookbook,赤い料理本>)」である。 When the generated phrase pair acquisition unit 104 or the like cannot acquire a phrase pair, the partial phrase pair generation unit 108 generates two phrase pairs smaller than the phrase pair to be acquired. In addition, when the phrase pair cannot be acquired, the partial phrase pair generation unit 108 normally generates two phrase pairs smaller than the phrase pair to be acquired using the prior probability of the phrase pair. More specifically, for example, the partial phrase pair generation unit 108 generates two phrase pairs smaller than the phrase pair to be acquired, using the base measure P _dac acquired from the prior probability of the phrase pair. The _dac of the base measure P _dac is an abbreviation of (“divide-and-conquer”), and is a probability based on a mechanism for dividing a long phrase pair into short phrase pairs. For example, if the phrase pair you are trying to acquire is <red cookbook, red cookbook>, then “P _dac (<red cookbook, red cookbook>) = P _x (REG) * P _t (<red, red>) * P_ {t} (<cookbook, book>) + P _x (REG) * P _t (<red, red dish>) * P _t (<cookbook, book>) + P _x (INV) * P _t (< red, book>) * P _t (<cookbook, red food>) + P _x (INV) * P _t (<red, food book>) * P _t (<cookbook, red>) + P _x (BASE) * P _base (<red cookbook>) ”.

新フレーズペア生成部１０９は、記号取得部１０６が取得した記号に従って、第一の処理、または第二の処理、または第三の処理のいずれかを行う。新フレーズペア生成部１０９は、記号取得部１０６が取得した記号がBASEである場合に第一の処理を行い、記号がREGである場合に第二の処理を行い記号がINVである場合に第三の処理を行う。 The new phrase pair generation unit 109 performs either the first process, the second process, or the third process according to the symbol acquired by the symbol acquisition unit 106. The new phrase pair generation unit 109 performs the first process when the symbol acquired by the symbol acquisition unit 106 is BASE, performs the second process when the symbol is REG, and performs the second process when the symbol is INV. Perform the third process.

ここで、第一の処理は、新しいフレーズペアを生成する処理である。また、第一の処理は、フレーズペアの事前確率を用いて、新しいフレーズペアを生成する処理である。また、第一の処理は、公知技術であるので、説明を省略する。 Here, the first process is a process of generating a new phrase pair. The first process is a process for generating a new phrase pair using the prior probability of the phrase pair. Moreover, since the first process is a known technique, the description is omitted.

また、第二の処理は、２つのより小さいフレーズペアを生成し、１以上のフレーズ出現頻度情報を用いて、生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する処理である。さらに、第三の処理は、２つのより小さいフレーズペアを生成し、１以上のフレーズ出現頻度情報を用いて、生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを逆順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する処理である。ここで、１以上のフレーズ出現頻度情報を用いることは、フレーズペアの生成確率（P_hier）を用いる意味でも良い。 In the second process, two smaller phrase pairs are generated, and two or more first language phrases constituting the two generated phrase pairs are sequentially connected using one or more phrase appearance frequency information. This is a process of generating one phrase pair having a single language phrase and a new second language phrase in which two second language phrases constituting two phrase pairs are sequentially connected. Further, the third process generates two smaller phrase pairs, and uses the one or more phrase appearance frequency information to newly connect the two first language phrases constituting the generated two phrase pairs in order. This is a process of generating one phrase pair having a single language phrase and a new second language phrase in which two second language phrases constituting two phrase pairs are connected in reverse order. Here, using one or more phrase appearance frequency information may mean using a phrase pair generation probability (P _hier ).

制御部１１０は、新フレーズペア生成部１０９が生成したフレーズペアに対して、フレーズ出現頻度情報更新部１０５、記号取得部１０６、記号出現頻度情報更新部１０７、部分フレーズペア生成部１０８、新フレーズペア生成部１０９の処理を再帰的に行う。なお、再帰的に行うとは、通常、処理対象が単語ペアになった時点で、再帰的な処理が終了する意味である。なお、再帰処理は、処理対象がP_tから直接（基底測度を用いずに）フレーズを生成した場合に終了する。また、再帰処理は、P_{_x}からBASEを生成して、P__baseからフレーズペアを生成した場合に終了する。 For the phrase pair generated by the new phrase pair generation unit 109, the control unit 110 includes a phrase appearance frequency information update unit 105, a symbol acquisition unit 106, a symbol appearance frequency information update unit 107, a partial phrase pair generation unit 108, and a new phrase. The processing of the pair generation unit 109 is performed recursively. Note that “recursively” usually means that the recursive process ends when the processing target becomes a word pair. Note that recursive processing is terminated when the processing target is generated the phrase (without using the base measure) directly from P _t. Also, recursive processing may generate a BASE from P _{_x,} and ends when that generated the phrase pairs from P_ _base.

スコア算出部１１１は、フレーズ出現頻度情報格納部１０２に格納されている１以上のフレーズ出現頻度情報を用いて、フレーズテーブル１０１の各フレーズペアに対するスコアを算出する。ここで、１以上のフレーズ出現頻度情報を用いてとは、例えば、数式５に示すように、ノンパラメトリックベイズ法に基づくPitman-Yor過程を利用することである。つまり、スコア算出部１１１は、フレーズ出現頻度情報格納部１０２に格納されている１以上のフレーズ出現頻度情報を用いて、ノンパラメトリックベイズ法に基づいて、フレーズテーブル１０１の各フレーズペアに対するスコアを算出することは好適である。
The score calculation unit 111 calculates a score for each phrase pair in the phrase table 101 using one or more phrase appearance frequency information stored in the phrase appearance frequency information storage unit 102. Here, using one or more phrase appearance frequency information means, for example, using a Pitman-Yor process based on the nonparametric Bayes method as shown in Equation 5. That is, the score calculation unit 111 calculates a score for each phrase pair in the phrase table 101 based on the nonparametric Bayes method using one or more phrase appearance frequency information stored in the phrase appearance frequency information storage unit 102. It is preferable to do.

なお、数式５は、フレーズペアの確率分布が、d、s、P_dacをパラメータとするPitman-Yor過程から生成される、ことを意味する。 Equation 5 means that the probability distribution of the phrase pair is generated from a Pitman-Yor process using d, s, and P _dac as parameters.

パージング部１１４は、スコア算出部１１１で算出したスコアが最大になるような対訳文（フレーズも含む）の木構造を取得する。さらに、詳細には、パージング部１１４は、ITGのチャートパーサにより、木構造を取得する。なお、ITGのチャートパーサについて、「M. Saers, J. Nivre, and D. Wu.Learning stochastic bracketing inversion transduction grammars with a cubic time biparsing algorithm.In Proc. IWPT, 2009.」に記載されている。 The parsing unit 114 acquires a tree structure of bilingual sentences (including phrases) that maximizes the score calculated by the score calculation unit 111. In more detail, the purging unit 114 acquires a tree structure using an ITG chart parser. The ITG chart parser is described in "M. Saers, J. Nivre, and D. Wu. Learning stochastic bracketing inversion transduction grammars with a cubic time biparsing algorithm. In Proc. IWPT, 2009."

木更新部１１５は、パージング部１１４が取得した木構造を、対訳コーパス１００に蓄積する。ここで、通常、木更新部１１５は、木構造を上書きする。つまり、対訳コーパス１００中の古い木構造は、新しい木構造に更新される。 The tree update unit 115 accumulates the tree structure acquired by the parsing unit 114 in the bilingual corpus 100. Here, the tree update unit 115 normally overwrites the tree structure. That is, the old tree structure in the bilingual corpus 100 is updated to a new tree structure.

フレーズテーブル更新部１１２は、スコア算出部１１１が算出したスコアを各フレーズペアに対応付けて蓄積する。また、フレーズテーブル更新部１１２は、スコア算出部１１１が算出したスコアに対応するフレーズペアがフレーズテーブル１０１に存在しない場合、スコア算出部１１１が算出したスコアとフレームペアとを有するスコア付きフレーズペアを、フレーズテーブル１０１に蓄積しても良い。 The phrase table update unit 112 stores the score calculated by the score calculation unit 111 in association with each phrase pair. Moreover, the phrase table update part 112, when the phrase pair corresponding to the score which the score calculation part 111 calculated does not exist in the phrase table 101, the phrase pair with a score which has the score and frame pair which the score calculation part 111 calculated The phrase table 101 may be accumulated.

フレーズテーブル１０１、フレーズ出現頻度情報格納部１０２、記号出現頻度情報格納部１０３、または対訳コーパス１００は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The phrase table 101, the phrase appearance frequency information storage unit 102, the symbol appearance frequency information storage unit 103, or the bilingual corpus 100 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

フレーズテーブル１０１等にスコア付きフレーズペア等が記憶される過程は問わない。例えば、記録媒体を介してスコア付きフレーズペア等がフレーズテーブル１０１等で記憶されるようになってもよく、通信回線等を介して送信されたスコア付きフレーズペア等がフレーズテーブル１０１等で記憶されるようになってもよく、あるいは、入力デバイスを介して入力されたスコア付きフレーズペア等がフレーズテーブル１０１等で記憶されるようになってもよい。 The process of storing a phrase pair with a score in the phrase table 101 or the like is not limited. For example, a phrase pair with a score may be stored in the phrase table 101 or the like via a recording medium, and a phrase pair with a score transmitted via a communication line or the like is stored in the phrase table 101 or the like. Alternatively, a phrase pair with a score or the like input via an input device may be stored in the phrase table 101 or the like.

生成フレーズペア取得部１０４、フレーズ出現頻度情報更新部１０５、記号取得部１０６、記号出現頻度情報更新部１０７、部分フレーズペア生成部１０８、新フレーズペア生成部１０９、制御部１１０、スコア算出部１１１、フレーズテーブル更新部１１２、フレーズテーブル初期化部１１３、パージング部１１４、および木更新部１１５は、通常、ＭＰＵやメモリ等から実現され得る。生成フレーズペア取得部１０４等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 Generated phrase pair acquisition unit 104, phrase appearance frequency information update unit 105, symbol acquisition unit 106, symbol appearance frequency information update unit 107, partial phrase pair generation unit 108, new phrase pair generation unit 109, control unit 110, score calculation unit 111 The phrase table update unit 112, the phrase table initialization unit 113, the parsing unit 114, and the tree update unit 115 can be usually realized by an MPU, a memory, or the like. The processing procedure of the generated phrase pair acquisition unit 104 or the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、対訳フレーズ学習装置１の動作について、図２のフローチャートを用いて説明する。なお、図２のフローチャートの動作の前に、フレーズテーブル初期化部１１３は、初期段階のフレーズテーブル１０１を生成している、とする。また、図２、図３のフローチャートでは、対訳コーパス１００内の一の対訳情報を用いて、スコア付きフレーズペアを取得する処理を説明している。つまり、通常、対訳コーパス１００内の多数の各対訳情報に対して、繰り返し、スコア付きフレーズペアを取得する。また、一の対訳情報に対しても、繰り返し、スコア付きフレーズペアを取得することは好適である。 Next, operation | movement of the parallel translation phrase learning apparatus 1 is demonstrated using the flowchart of FIG. It is assumed that the phrase table initialization unit 113 generates the initial stage phrase table 101 before the operation of the flowchart of FIG. In addition, the flowcharts of FIGS. 2 and 3 describe a process of acquiring a scored phrase pair using one translation information in the translation corpus 100. That is, normally, a phrase pair with a score is repeatedly obtained for a large number of pieces of parallel translation information in the parallel translation corpus 100. In addition, it is preferable to repeatedly obtain a scored phrase pair even for one parallel translation information.

（ステップＳ２０１）生成フレーズペア取得部１０４は、対訳コーパス１００に格納されている１以上の各対訳文を取得し、当該各対訳文の木構造を構成する１以上の各フレーズペアの出現分（通常、出現頻度の「１」）を、フレーズテーブル１０１に存在するフレーズペアのスコアから引く。次に、生成フレーズペア取得部１０４は、フレーズペアの確率分布Ｐ_ｔを用いて、一つのフレーズペアを生成しようとする。また、フレーズペアの確率分布Ｐ_ｔは、フレーズ出現頻度情報格納部１０２のフレーズペア頻度（Ｆ出現頻度情報）を用いて、例えば、Pitman-Yor過程によって算出され得る。なお、Pitman-Yor過程に基づいた確率の算出は公知技術であるので、説明を省略する。 (Step S201) The generated phrase pair acquisition unit 104 acquires one or more bilingual sentences stored in the bilingual corpus 100, and appearances of one or more phrase pairs constituting the tree structure of the bilingual sentences ( Usually, the appearance frequency “1”) is subtracted from the score of the phrase pair existing in the phrase table 101. Next, the generated phrase pair acquisition unit 104 attempts to generate one phrase pair using the phrase pair probability distribution P _t . Further, the phrase pair probability distribution P _t can be calculated by, for example, a Pitman-Yor process using the phrase pair frequency (F appearance frequency information) in the phrase appearance frequency information storage unit 102. Note that the calculation of the probability based on the Pitman-Yor process is a known technique, and thus the description thereof is omitted.

（ステップＳ２０２）部分フレーズペア生成部１０８等は、フレーズ生成処理を行い、処理を終了する。フレーズ生成処理とは、階層的ITGを用いた、２階層以上のフレーズの生成処理である。フレーズ生成処理については、図３のフローチャートを用いて説明する。 (Step S202) The partial phrase pair generation unit 108 or the like performs a phrase generation process and ends the process. The phrase generation process is a phrase generation process using two or more hierarchies using hierarchical ITG. The phrase generation process will be described with reference to the flowchart of FIG.

次に、ステップＳ１０２のフレーズ生成処理について、図３のフローチャートを用いて説明する。 Next, the phrase generation processing in step S102 will be described using the flowchart in FIG.

（ステップＳ３０１）部分フレーズペア生成部１０８は、先のフレーズペアの生成の処理において、フレーズペアが生成できたか否かを判断する。フレーズペアが生成できればステップＳ３０２に行き、生成できなければステップＳ３０５に行く。 (Step S301) The partial phrase pair generation unit 108 determines whether or not a phrase pair has been generated in the previous phrase pair generation process. If the phrase pair can be generated, the process goes to step S302; otherwise, the process goes to step S305.

（ステップＳ３０２）フレーズ出現頻度情報更新部１０５は、先のフレーズペアの生成の処理において生成されたフレーズペアに対応するＦ出現頻度情報を予め決められた値（通常、「１」）だけ増加する。なお、フレーズペアがフレーズ出現頻度情報格納部１０２に存在しない場合は、フレーズ出現頻度情報更新部１０５は、生成されたフレーズペアとＦ出現頻度情報とを対応付けて、フレーズ出現頻度情報格納部１０２に蓄積する。 (Step S302) The phrase appearance frequency information update unit 105 increases the F appearance frequency information corresponding to the phrase pair generated in the previous phrase pair generation process by a predetermined value (usually “1”). . When the phrase pair does not exist in the phrase appearance frequency information storage unit 102, the phrase appearance frequency information update unit 105 associates the generated phrase pair with the F appearance frequency information, and the phrase appearance frequency information storage unit 102 To accumulate.

（ステップＳ３０３）スコア算出部１１１は、更新されたフレーズ出現頻度情報に対応するフレーズペアのスコアを算出する。 (Step S303) The score calculation unit 111 calculates the score of the phrase pair corresponding to the updated phrase appearance frequency information.

（ステップＳ３０４）フレーズテーブル更新部１１２は、ステップＳ３０３で算出されたスコアを有するスコア付きフレーズペアを構成し、フレーズテーブル１０１に書き込む。なお、フレーズテーブル１０１に当該フレーズペアが存在しない場合は、フレーズテーブル更新部１１２は、スコア付きフレーズペアを構成し、新たにフレーズテーブル１０１に追記する。また、フレーズテーブル１０１に当該フレーズペアが存在する場合は、フレーズテーブル更新部１１２は、当該フレーズペアに対応するスコアを、ステップＳ３０３で算出されたスコアに更新する。そして、上位処理（ステップS２０２等）にリターンする。 (Step S304) The phrase table update unit 112 configures a scored phrase pair having the score calculated in step S303, and writes it into the phrase table 101. When the phrase pair does not exist in the phrase table 101, the phrase table update unit 112 configures a score-added phrase pair and newly adds the phrase pair to the phrase table 101. When the phrase pair exists in the phrase table 101, the phrase table update unit 112 updates the score corresponding to the phrase pair to the score calculated in step S303. Then, the process returns to the upper process (step S202 and the like).

（ステップＳ３０５）部分フレーズペア生成部１０８は、基底測度Ｐ_dacを用いて、生成しようとしたフレーズペアより小さい２つのフレーズペアを生成する。 (Step S305) The partial phrase pair generation unit 108 generates two phrase pairs smaller than the phrase pair to be generated using the base measure P _dac .

（ステップＳ３０６）記号取得部１０６は、１以上の記号出現頻度情報を用いて、一の記号ｘを取得する。 (Step S306) The symbol acquisition unit 106 acquires one symbol x using one or more symbol appearance frequency information.

（ステップＳ３０７）記号出現頻度情報更新部１０７は、記号取得部１０６が取得した記号ｘに対応するＳ出現頻度情報を、予め決められた値（通常、「１」）だけ増加する。 (Step S307) The symbol appearance frequency information update unit 107 increases the S appearance frequency information corresponding to the symbol x acquired by the symbol acquisition unit 106 by a predetermined value (usually “1”).

（ステップＳ３０８）新フレーズペア生成部１０９は、ステップＳ３０６で取得された記号ｘが「BASE」であるか否かを判断する。記号ｘが「BASE」であればステップＳ３０９に行き、「BASE」でなければステップＳ３１０に行く。 (Step S308) The new phrase pair generation unit 109 determines whether or not the symbol x acquired in step S306 is “BASE”. If the symbol x is “BASE”, go to step S309, and if not “BASE”, go to step S310.

（ステップＳ３０９）新フレーズペア生成部１０９は、フレーズペアの事前確率を用いて、新しいフレーズペアを生成し、ステップＳ３０２にジャンプする。 (Step S309) The new phrase pair generation unit 109 generates a new phrase pair using the prior probability of the phrase pair, and jumps to step S302.

（ステップＳ３１０）新フレーズペア生成部１０９は、ステップＳ３０６で取得された記号ｘが「REG」であるか否かを判断する。記号ｘが「REG」であればステップＳ３１１に行き、「REG」でなければステップＳ３１５に行く。なお、記号ｘが「REG」でなければ、記号ｘは「INV」である。 (Step S310) The new phrase pair generation unit 109 determines whether or not the symbol x acquired in Step S306 is “REG”. If the symbol x is “REG”, the process goes to step S311, and if it is not “REG”, the process goes to step S315. If the symbol x is not “REG”, the symbol x is “INV”.

（ステップＳ３１１）新フレーズペア生成部１０９は、より小さい２つのフレーズペアを生成する。なお、ここでの２つのフレーズペアを第一フレーズペア、と第二フレーズペアとする。 (Step S311) The new phrase pair generation unit 109 generates two smaller phrase pairs. In addition, let two phrase pairs here be the 1st phrase pair and the 2nd phrase pair.

（ステップＳ３１２）ステップＳ３１１で生成された第一フレーズペアに対して、図３のフレーズ生成処理を行う。 (Step S312) The phrase generation process of FIG. 3 is performed on the first phrase pair generated in Step S311.

（ステップＳ３１３）ステップＳ３１１で生成された第二フレーズペアに対して、図３のフレーズ生成処理を行う。 (Step S313) The phrase generation process of FIG. 3 is performed on the second phrase pair generated in Step S311.

（ステップＳ３１４）新フレーズペア生成部１０９は、ステップＳ３１２とステップＳ３１３で生成された２つのフレーズペアを順に連結し、一つのフレーズペアを生成し、ステップＳ３０２にジャンプする。 (Step S314) The new phrase pair generation unit 109 sequentially connects the two phrase pairs generated in step S312 and step S313, generates one phrase pair, and jumps to step S302.

（ステップＳ３１５）新フレーズペア生成部１０９は、より小さい２つのフレーズペアを生成する。なお、ここでの２つのフレーズペアを第一フレーズペア、と第二フレーズペアとする。 (Step S315) The new phrase pair generation unit 109 generates two smaller phrase pairs. In addition, let two phrase pairs here be the 1st phrase pair and the 2nd phrase pair.

（ステップＳ３１６）ステップＳ３１５で生成された第一フレーズペアに対して、図３のフレーズ生成処理を行う。 (Step S316) The phrase generation process of FIG. 3 is performed on the first phrase pair generated in Step S315.

（ステップＳ３１７）ステップＳ３１５で生成された第二フレーズペアに対して、図３のフレーズ生成処理を行う。 (Step S317) The phrase generation process of FIG. 3 is performed on the second phrase pair generated in Step S315.

（ステップＳ３１８）新フレーズペア生成部１０９は、ステップＳ３１６とステップＳ３１７で生成された２つのフレーズペアを逆順に連結し、一つのフレーズペアを生成し、ステップＳ３０２にジャンプする。 (Step S318) The new phrase pair generation unit 109 concatenates the two phrase pairs generated in steps S316 and S317 in reverse order, generates one phrase pair, and jumps to step S302.

なお、図２、図３のフローチャートにおいて、ステップＳ３０４の後、リターンの前に、パージング部１１４による木構造の生成、および木更新部１１５による木構造（対訳コーパス１００内の木構造）の更新処理が行われることは好適である。 2 and 3, after step S304 and before returning, generation of a tree structure by the parsing unit 114 and update processing of the tree structure (the tree structure in the bilingual corpus 100) by the tree update unit 115 Is preferably performed.

以下、本実施の形態における対訳フレーズ学習装置１の具体的な動作について説明する。 Hereinafter, a specific operation of the parallel phrase learning device 1 according to the present embodiment will be described.

今、フレーズテーブル１０１には、多数のスコア付きフレーズペアが格納されており、フレーズペアの確率分布がP_tである、とする。 Now, the phrase table 101, and a number of the scored phrase pair stored, the probability distribution of the phrase pair is P _t, and to.

また、フレーズ出現頻度情報格納部１０２には、フレーズペアと出現頻度の組である１以上のフレーズ出現頻度情報が格納されている。 The phrase appearance frequency information storage unit 102 stores one or more phrase appearance frequency information that is a pair of a phrase pair and an appearance frequency.

さらに、記号出現頻度情報格納部１０３には、記号「BASE」「REG」「INV」と、各記号の出現頻度の組である３つの記号出現頻度情報が格納されている。 Further, the symbol appearance frequency information storage unit 103 stores three symbol appearance frequency information, which are combinations of the symbols “BASE”, “REG”, “INV”, and the appearance frequency of each symbol.

かかる状況において、まず、対訳フレーズ学習装置１の生成フレーズペア取得部１０４は、対訳コーパス１００から一の対訳文を取得する。次に、生成フレーズペア取得部１０４は、取得した対訳文の木構造を構成する１以上の各フレーズペアの出現分（通常、出現頻度の「１」）を、フレーズテーブル１０１に存在するフレーズペアのスコアから引く。次に、生成フレーズペア取得部１０４は、当該対訳文であるフレーズペア<e,f>を、フレーズペアの確率分布P_tから生成しようとする。 In such a situation, first, the generated phrase pair acquisition unit 104 of the parallel phrase learning device 1 acquires one parallel sentence from the parallel corpus 100. Next, the generated phrase pair acquisition unit 104 uses the occurrence of one or more phrase pairs constituting the acquired bilingual tree structure (usually “1” of appearance frequency) as a phrase pair existing in the phrase table 101. Subtract from the score. Next, the generated phrase pair acquisition unit 104 tries to generate the phrase pair <e, f>, which is the parallel translation, from the phrase pair probability distribution P _t .

そして、部分フレーズペア生成部１０８は、先のフレーズペアの生成の処理において、フレーズペアが生成できなかった、と判断した場合、以下のように処理を行う。 If the partial phrase pair generation unit 108 determines that the phrase pair could not be generated in the previous phrase pair generation process, the partial phrase pair generation unit 108 performs the following process.

つまり、部分フレーズペア生成部１０８は、基底測度Ｐ_dacを用いて、再帰的に、生成しようとしたフレーズペアより小さい２つのフレーズペアを生成する。そして、生成したより小さい２つのフレーズペアを組み合わせることで新たなフレーズペアを生成する。なお、基底測度P_dacを定義し、θ_tの式は数式５に示したとおりである。 That is, the partial phrase pair generation unit 108 recursively generates two phrase pairs smaller than the phrase pair to be generated, using the base measure P _dac . And a new phrase pair is produced | generated by combining two smaller phrase pairs produced | generated. The base measure P _dac is defined, and the equation of θ _t is as shown in Equation 5.

また、Pdacの生成過程は、以下のようなITGに基づく生成過程となる。 Moreover, the generation process of Pdac is a generation process based on ITG as follows.

つまり、記号取得部１０６は、３つの記号出現頻度情報を用いて、記号の確率分布P_x(x;θ_x)に従って、記号を生成する。そして、記号出現頻度情報更新部１０７は、記号「ｘ＝reg」に対応するＳ出現頻度情報を１だけ増加する。 That is, the symbol acquisition unit 106 generates a symbol according to the probability distribution P _x (x; θ _x ) of the symbol using the three symbol appearance frequency information. Then, the symbol appearance frequency information update unit 107 increases the S appearance frequency information corresponding to the symbol “x = reg” by one.

次に、生成した記号ｘが「x=base」の場合、新フレーズペア生成部１０９は、新しいフレーズペアをP_baseから直接生成する。また、生成した記号ｘが「x=reg」の場合、新フレーズペア生成部１０９は、<e₁,f₁>と<e₂,f₂>をP_hierから生成し、１つのフレーズペア<e₁e₂,f₁f₂>を作成する。また、生成した記号ｘが「x=inv」の場合、新フレーズペア生成部１０９は、<e₁,f₁>と<e₂,f₂>をP_hierから生成し、f₁とf₂を逆順に並べて、１つのフレーズペア<e₁e₂,f₂f₁>を作成する。 Next, when the generated symbol x is “x = base”, the new phrase pair generation unit 109 generates a new phrase pair directly from P _base . When the generated symbol x is “x = reg”, the new phrase pair generation unit 109 generates <e ₁ , f ₁ > and <e ₂ , f ₂ > from P _hier and generates one phrase pair < Create e ₁ e ₂ , f ₁ f ₂ >. When the generated symbol x is “x = inv”, the new phrase pair generation unit 109 generates <e ₁ , f ₁ > and <e ₂ , f ₂ > from P _hier , and f ₁ and f ₂ Are arranged in reverse order to create one phrase pair <e ₁ e ₂ , f ₂ f ₁ >.

そして、フレーズ出現頻度情報更新部１０５は、新たに作成されたフレーズペアのフレーズ出現頻度情報を更新する。 Then, the phrase appearance frequency information update unit 105 updates the phrase appearance frequency information of the newly created phrase pair.

また、スコア算出部１１１は、更新されたフレーズ出現頻度情報に対応するフレーズペアのスコアを算出する。 Moreover, the score calculation part 111 calculates the score of the phrase pair corresponding to the updated phrase appearance frequency information.

そして、フレーズテーブル更新部１１２は、フレーズテーブルを更新する。 Then, the phrase table update unit 112 updates the phrase table.

また、パージング部１１４は、スコア算出部１１１が算出したスコアを用いて、木構造のスコアが最大になるような新しい木構造を取得する。そして、木更新部１１５は、取得された木構造を、対訳コーパス１００に蓄積し、古い木構造を新しい木構造に更新する。 In addition, the purging unit 114 acquires a new tree structure that maximizes the score of the tree structure, using the score calculated by the score calculation unit 111. Then, the tree updating unit 115 accumulates the acquired tree structure in the bilingual corpus 100 and updates the old tree structure to the new tree structure.

以上の処理により、フレーズペア「Mrs.Smith's red cookbook／スミスさんの赤い料理本」に対して、図４に示すように、多階層の粒度のフレーズペアが学習できることとなる。 With the above processing, as shown in FIG. 4, phrase pairs having a multi-level granularity can be learned for the phrase pair “Mrs. Smith's red cookbook”.

なお、本具体例におけるフレーズテーブル１０１の構築法は、例えば、以下である。 In addition, the construction method of the phrase table 101 in this specific example is, for example, as follows.

フレーズテーブルの素性として、条件付き確率Pt(f|e)とPt(e|f)や、lexical weight-ing確率、フレーズペナルティなどを利用する。ここでは、モデル確率P_tを使って条件付き確率を計算する。つまり、数式６、数式７を用いて、条件付き確率を算出する。そして、スコア算出部１１１は、例えば、フレーズテーブルの各素性に予め決められた重みを乗算し、それらの値の和をとることによりスコアを算出する。また、lexical weighting確率は、フレーズを構成する単語を利用して算出できる。かかる算出は公知技術（P. Koehn, F. J. Och, and D. Marcu.Statistical phrase-based translation.In Proc. NAACL, pp. 48-54, 2003.参照）
である。また、フレーズペナルティは、例えば、すべてのフレーズに対して「１」である。
As features of the phrase table, conditional probabilities Pt (f | e) and Pt (e | f), lexical weight-ing probabilities, phrase penalties, etc. are used. Here, the conditional probability is calculated using the model probability P _t . That is, the conditional probability is calculated using Equation 6 and Equation 7. Then, for example, the score calculation unit 111 calculates a score by multiplying each feature of the phrase table by a predetermined weight and taking the sum of these values. Further, the lexical weighting probability can be calculated using words constituting the phrase. Such calculation is known technology (see P. Koehn, FJ Och, and D. Marcu. Statistical phrase-based translation. In Proc. NAACL, pp. 48-54, 2003.)
It is. The phrase penalty is “1” for all phrases, for example.

なお、フレーズテーブル更新部１１２は、サンプルに１回以上現れるフレーズペアのみをフレーズテーブル１０１に入れる。さらに、２つの素性を加える。１つ目はモデルによるフレーズペアの同時確率Pt(<e,f>)である。２つ目はinside-outside アルゴリズムで計算されたスパンの事後確率に基づいて、あるフレーズペア<e,f>が入っているスパンの平均事後確率を素性とする。スパン確率は頻繁に起こるフレーズペア、または頻繁に起こるフレーズペアを元に構成されるフレーズペアで高くなるため、フレーズペアがどの程度信頼できるかを判定するのに有用である。このモデル確率に基づくフレーズ抽出をMODと呼ぶ。なお、スパン確率は、ITGのチャートパーサによって算出できる。
（実験） Note that the phrase table update unit 112 puts only phrase pairs that appear at least once in the sample into the phrase table 101. In addition, two features are added. The first is the phrase pair coincidence probability Pt (<e, f>) by the model. The second feature is the average posterior probability of a span containing a phrase pair <e, f> based on the posterior probability of the span calculated by the inside-outside algorithm. The span probability is high for a frequently occurring phrase pair or a phrase pair constructed based on a frequently occurring phrase pair, which is useful for determining how reliable the phrase pair is. Phrase extraction based on this model probability is called MOD. The span probability can be calculated by an ITG chart parser.
(Experiment)

以下、対訳フレーズ学習装置１の実験結果について説明する。本実験において、対訳フレーズ学習装置１の手法を、仏英翻訳と日英翻訳のタスクで評価した。 Hereinafter, the experimental result of the parallel phrase learning device 1 will be described. In this experiment, the method of the bilingual phrase learning device 1 was evaluated by the tasks of French-English translation and Japanese-English translation.

仏英翻訳においてWorkshop on Statistical Machine Translation(WMT)（C. Callison-Burch, et al. Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In Proc.WMT/MetricsMATR, pp. 17{53, 2010.参照）のデータを用い、翻訳モデル学習にnews commentaryのコーパス、言語モデル学習にnews commentaryとEuroparlのコーパスを利用した。日英翻訳はNTCIRの特許翻訳タスク（A.Fujii,M.Utiyama,M.Yamamoto, and T.Utsuro.Overview of the patent translation task at the NTCIR-7 workshop. In Proc. NTCIR-7, pp. 389-400, 2008.）のデータを用い、翻訳モデルにパラレルコーパスの最初の１０万文、言語モデルにパラレルコーパス全体を利用した。コーパスの諸元を図５に示す。データの前処理として単語分割（トークン化）と小文字化を行い、翻訳モデルの学習に４０単語以下の文のみを利用する。デコーダとしてMoses（P. Koehn,et al. Moses: Open source toolkit for statistical machine translation. In Proc. ACL, 2007.参照）を利用する。フレーズの最大長を７とし、言語モデルはKneser-Ney平滑化を用いた5-gramモデルである。評価基準は4-gramまでのBLEUスコアとする。最初の実験では、flatとhierのモデル確率を利用したフレーズ抽出（mod）と、GIZA++から得られたアライメント（giza）とヒューリスティックスに基づくフレーズ抽出の精度を比べる。gizaの場合はModel4までの標準的な学習設定を用いて、grow-diag-final-andで両方向のアライメント結果で組み合わせる。対訳フレーズ学習装置１の手法では１００イタレーションの学習を行い、最後のサンプルを利用する。実際には１００イタレーション目まで尤度が単調増加したが、翻訳精度は５〜１０イタレーション目以降ほぼ同等となった。１イタレーションは１コアで約１．３時間かかったため、良い翻訳精度は６．５〜１３時間で実現することができた。 Workshop on Statistical Machine Translation (WMT) (C. Callison-Burch, et al. Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In Proc. WMT / MetricsMATR, pp. 17 {53 , 2010.), we used a news commentary corpus for translation model learning, and a news commentary and Europarl corpus for language model learning. Japanese to English translation is NTCIR's patent translation task (A.Fujii, M.Utiyama, M.Yamamoto, and T.Utsuro.Overview of the patent translation task at the NTCIR-7 workshop. In Proc. NTCIR-7, pp. 389 -400, 2008.), the first 100,000 sentences of the parallel corpus were used as the translation model, and the entire parallel corpus was used as the language model. The corpus specifications are shown in FIG. As preprocessing of data, word division (tokenization) and lower case are performed, and only sentences of 40 words or less are used for learning a translation model. Moses (see P. Koehn, et al. Moses: Open source toolkit for statistical machine translation. In Proc. ACL, 2007.) is used as a decoder. The maximum phrase length is 7, and the language model is a 5-gram model using Kneser-Ney smoothing. The evaluation standard is BLEU score up to 4-gram. In the first experiment, the phrase extraction (mod) using the flat and hier model probabilities is compared with the accuracy of phrase extraction based on alignment (giza) and heuristics obtained from GIZA ++. In the case of giza, the standard learning setting up to Model4 is used and combined with the alignment results in both directions with grow-diag-final-and. In the method of the parallel phrase learning device 1, 100 iterations are learned and the last sample is used. Actually, the likelihood increased monotonically up to the 100th iteration, but the translation accuracy became almost the same after the 5th to 10th iteration. Since one iteration took about 1.3 hours for one core, good translation accuracy could be realized in 6.5 to 13 hours.

実験結果を図６に示す。図６において、BLEUスコアとフレーズテーブルサイズを示す。図６において、仏英・日英ともに、階層的モデルの確率を利用したフレーズテーブルはGIZA++とヒューリスティックスに基づくフレーズ抽出の精度をわずかに上回った。つまり、完全な確率モデルが、ヒューリスティックスに基づくフレーズ抽出を上回った。さらに、対訳フレーズ学習装置１の手法で得られたフレーズテーブルのサイズも従来法の２０%弱に収まった。また、モデル確率を用いた場合、hier（本手法）はflat（ITGモデルによる従来手法）を大きく上回った。これは、最小フレーズのみを利用すると高い精度が得られないからである。 The experimental results are shown in FIG. FIG. 6 shows the BLEU score and the phrase table size. In FIG. 6, the phrase table using the probability of the hierarchical model in both French and English and Japanese and English slightly exceeded the accuracy of phrase extraction based on GIZA ++ and heuristics. In other words, the complete probabilistic model exceeded the phrase extraction based on heuristics. Furthermore, the size of the phrase table obtained by the method of the bilingual phrase learning apparatus 1 is also less than 20% of the conventional method. Also, when using model probabilities, hier (this method) greatly exceeded flat (conventional method based on ITG model). This is because high accuracy cannot be obtained if only the minimum phrase is used.

さらに、モデル確率に基づくフレーズ抽出と従来法の比較を図７に示す。図７において、種々のフレーズ抽出法による翻訳精度とフレーズテーブルサイズを示す。図７では、hierやflatのアライメントを利用し、モデル確率を用いる提案手法modに加えて、フレーズheur-p、ブロックheur-b、単語heur-wを最小単位とするヒューリスティック抽出を比較した。hierとmodの組み合わせはヒューリスティック抽出とほぼ同等、またはより高い精度を示しながら、フレーズテーブルのサイズを大幅に削減していることが、図７から分かる。 Furthermore, FIG. 7 shows a comparison between phrase extraction based on model probabilities and the conventional method. FIG. 7 shows translation accuracy and phrase table size by various phrase extraction methods. In FIG. 7, in addition to the proposed method mod using model probabilities using hier and flat alignment, heuristic extraction using the phrase heur-p, block heur-b, and word heur-w as the minimum unit was compared. It can be seen from FIG. 7 that the combination of hier and mod significantly reduces the size of the phrase table while showing almost the same or higher accuracy than the heuristic extraction.

以上、本実施の形態によれば、対訳フレーズ学習装置１で作成したフレーズテーブルを用いた機械翻訳の精度を保ちながら、フレーズテーブルのサイズを大幅に削減できる。つまり、本実施の形態によれば、多数の適切なフレーズペアを学習できる。 As mentioned above, according to this Embodiment, the size of a phrase table can be reduced significantly, maintaining the precision of the machine translation using the phrase table created with the bilingual phrase learning apparatus 1. FIG. That is, according to the present embodiment, many appropriate phrase pairs can be learned.

なお、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。 Note that the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification.

（実施の形態２）
本実施の形態において、実施の形態１における対訳フレーズ学習装置１が学習したフレーズテーブル１０１を用いたフレーズベース統計的機械翻訳装置について説明する。 (Embodiment 2)
In the present embodiment, a phrase-based statistical machine translation device using the phrase table 101 learned by the parallel phrase learning device 1 in the first embodiment will be described.

図８は、本実施の形態におけるフレーズベース統計的機械翻訳装置２のブロック図である。 FIG. 8 is a block diagram of the phrase-based statistical machine translation apparatus 2 in the present embodiment.

フレーズベース統計的機械翻訳装置２は、フレーズテーブル１０１、受付部２０１、フレーズ取得部２０２、文構成部２０３、および出力部２０４を具備する。 The phrase-based statistical machine translation device 2 includes a phrase table 101, a reception unit 201, a phrase acquisition unit 202, a sentence composition unit 203, and an output unit 204.

フレーズテーブル１０１は、対訳フレーズ学習装置１が学習したフレーズテーブルである。 The phrase table 101 is a phrase table learned by the parallel phrase learning device 1.

受付部２０１は、１以上の単語を有する第一言語の文を受け付ける。ここで、受け付けとは、キーボードやマウス、タッチパネルなどの入力デバイスから入力された情報の受け付け、有線もしくは無線の通信回線を介して送信された情報の受信、光ディスクや磁気ディスク、半導体メモリなどの記録媒体から読み出された情報の受け付けなどを含む概念である。第一言語の文の入力手段は、キーボードやマウスやメニュー画面によるもの等、何でも良い。受付部２０１は、キーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The accepting unit 201 accepts a sentence in a first language having one or more words. Here, reception means reception of information input from an input device such as a keyboard, mouse, touch panel, reception of information transmitted via a wired or wireless communication line, recording on an optical disk, magnetic disk, semiconductor memory, or the like. It is a concept including reception of information read from a medium. The first language sentence input means may be anything such as a keyboard, mouse, or menu screen. The accepting unit 201 can be realized by a device driver of an input unit such as a keyboard, menu screen control software, or the like.

フレーズ取得部２０２は、受付部２０１が受け付けた文から１以上のフレーズを抽出し、フレーズテーブル１０１のスコアを用いて、フレーズテーブル１０１から第二言語の１以上のフレーズを取得する。なお、フレーズ取得部２０２の処理は公知技術である。 The phrase acquisition unit 202 extracts one or more phrases from the sentence received by the reception unit 201, and acquires one or more phrases in the second language from the phrase table 101 using the score of the phrase table 101. Note that the processing of the phrase acquisition unit 202 is a known technique.

文構成部２０３は、フレーズ取得部２０２が取得した１以上のフレーズから第二言語の文を構成する。なお、文構成部２０３の処理は公知技術である。 The sentence composing unit 203 composes a sentence in the second language from one or more phrases acquired by the phrase acquiring unit 202. The processing of the sentence composition unit 203 is a known technique.

出力部２０４は、文構成部２０３が構成した文を出力する。ここで、出力とは、ディスプレイへの表示、プロジェクターを用いた投影、プリンタへの印字、音出力、外部の装置への送信、記録媒体への蓄積、他の処理装置や他のプログラムなどへの処理結果の引渡しなどを含む概念である。 The output unit 204 outputs the sentence constructed by the sentence constructing unit 203. Here, output refers to display on a display, projection using a projector, printing on a printer, sound output, transmission to an external device, storage in a recording medium, output to another processing device or other program, etc. It is a concept that includes delivery of processing results.

フレーズ取得部２０２、および文構成部２０３は、通常、ＭＰＵやメモリ等から実現され得る。フレーズ取得部２０２等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The phrase acquisition unit 202 and the sentence composition unit 203 can be usually realized by an MPU, a memory, or the like. The processing procedure of the phrase acquisition unit 202 or the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力部２０４は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力部２０４は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 204 may be considered as including or not including an output device such as a display or a speaker. The output unit 204 can be implemented by output device driver software, or output device driver software and an output device.

また、フレーズベース統計的機械翻訳装置２の動作については、公知のフレーズベースの統計的機械翻訳処理を行うことで足りるので、詳細な説明を省略する。 Further, the operation of the phrase-based statistical machine translation apparatus 2 suffices to perform a well-known phrase-based statistical machine translation process, and thus detailed description thereof is omitted.

以上、本実施の形態によれば、少ない記憶領域で実現可能なフレーズテーブルを用いて、精度の高い機械翻訳が可能となる。 As described above, according to this embodiment, it is possible to perform highly accurate machine translation using a phrase table that can be realized with a small storage area.

また、図９は、上述した実施の形態の対訳フレーズ学習装置等を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図９は、このコンピュータシステム３４０の概観図であり、図１０は、コンピュータシステム３４０のブロック図である。 FIG. 9 shows the external appearance of a computer that implements the bilingual phrase learning apparatus and the like of the above-described embodiment. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 9 is an overview diagram of the computer system 340, and FIG. 10 is a block diagram of the computer system 340.

図９において、コンピュータシステム３４０は、ＦＤドライブ、ＣＤ−ＲＯＭドライブを含むコンピュータ３４１と、キーボード３４２と、マウス３４３と、モニタ３４４とを含む。 In FIG. 9, the computer system 340 includes a computer 341 including an FD drive and a CD-ROM drive, a keyboard 342, a mouse 343, and a monitor 344.

図１０において、コンピュータ３４１は、ＦＤドライブ３４１１、ＣＤ−ＲＯＭドライブ３４１２に加えて、ＭＰＵ３４１３と、ＣＤ−ＲＯＭドライブ３４１２及びＦＤドライブ３４１１に接続されたバス３４１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ３４１５とに接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ３４１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク３４１７とを含む。ここでは、図示しないが、コンピュータ３４１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 In FIG. 10, in addition to the FD drive 3411 and the CD-ROM drive 3412, the computer 341 stores an MPU 3413, a bus 3414 connected to the CD-ROM drive 3412 and the FD drive 3411, and a program such as a bootup program. A RAM 3416 for temporarily storing application program instructions and providing a temporary storage space; and a hard disk 3417 for storing application programs, system programs, and data. Although not shown here, the computer 341 may further include a network card that provides connection to the LAN.

コンピュータシステム３４０に、上述した実施の形態の対訳フレーズ学習装置等の機能を実行させるプログラムは、ＣＤ−ＲＯＭ３５０１、またはＦＤ３５０２に記憶されて、ＣＤ−ＲＯＭドライブ３４１２またはＦＤドライブ３４１１に挿入され、さらにハードディスク３４１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ３４１に送信され、ハードディスク３４１７に記憶されても良い。プログラムは実行の際にＲＡＭ３４１６にロードされる。プログラムは、ＣＤ−ＲＯＭ３５０１、ＦＤ３５０２またはネットワークから直接、ロードされても良い。 A program for causing the computer system 340 to execute the functions of the bilingual phrase learning device or the like of the above-described embodiment is stored in the CD-ROM 3501 or FD 3502, inserted into the CD-ROM drive 3412 or FD drive 3411, and further a hard disk. 3417 may be transferred. Alternatively, the program may be transmitted to the computer 341 via a network (not shown) and stored in the hard disk 3417. The program is loaded into the RAM 3416 at the time of execution. The program may be loaded directly from the CD-ROM 3501, the FD 3502, or the network.

プログラムは、コンピュータ３４１に、上述した実施の形態の対訳フレーズ学習装置等の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３４０がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS) or a third-party program that causes the computer 341 to execute the functions of the bilingual phrase learning apparatus according to the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 340 operates is well known and will not be described in detail.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.

本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる対訳フレーズ学習装置は、多数の適切なフレーズペアを学習できる、という効果を有し、対訳フレーズ学習装置、機械翻訳装置等として有用である。 As described above, the parallel phrase learning device according to the present invention has an effect that a large number of appropriate phrase pairs can be learned, and is useful as a parallel phrase learning device, a machine translation device, and the like.

１対訳フレーズ学習装置
２フレーズベース統計的機械翻訳装置
１０１フレーズテーブル
１０２フレーズ出現頻度情報格納部
１０３記号出現頻度情報格納部
１０４生成フレーズペア取得部
１０５フレーズ出現頻度情報更新部
１０６記号取得部
１０７記号出現頻度情報更新部
１０８部分フレーズペア生成部
１０９新フレーズペア生成部
１１０制御部
１１１スコア算出部
１１２フレーズテーブル更新部
２０１受付部
２０２フレーズ取得部
２０３文構成部
２０４出力部 DESCRIPTION OF SYMBOLS 1 Parallel phrase learning apparatus 2 Phrase-based statistical machine translation apparatus 101 Phrase table 102 Phrase appearance frequency information storage part 103 Symbol appearance frequency information storage part 104 Generation | occurrence | production phrase pair acquisition part 105 Phrase appearance frequency information update part 106 Symbol acquisition part 107 Symbol appearance Frequency information update unit 108 Partial phrase pair generation unit 109 New phrase pair generation unit 110 Control unit 111 Score calculation unit 112 Phrase table update unit 201 Reception unit 202 Phrase acquisition unit 203 Sentence configuration unit 204 Output unit

Claims

In a first language phrase having one or more words in the first language, pairs der the second language phrase having one or more words in the second language is, pairs of the first language sentence and the second language sentence One or more scores having a phrase pair acquired from a bilingual corpus storing one or more parallel translation information having a certain bilingual sentence and a tree structure of the bilingual sentence, and a score that is information relating to the appearance probability of the phrase pair A phrase table that can store phrase pairs with
A phrase appearance frequency information storage unit capable of storing one or more phrase appearance frequency information having a phrase pair and F appearance frequency information that is information related to the appearance frequency of the phrase pair;
A symbol appearance frequency information storage unit that can store one or more symbol appearance frequency information including a symbol for identifying a method for generating a new phrase pair and S appearance frequency information that is information about the appearance frequency of the symbol;
F appearance frequency information included in the one or more phrase appearance frequency information from one or more phrase pairs constituting a tree structure of one or more bilingual sentences stored in the bilingual corpus, A generated phrase pair acquisition unit that attempts to acquire a phrase pair using the probability distribution of each of the one or more phrase pairs acquired using F appearance frequency information corresponding to each phrase pair ;
When the phrase pair can be acquired, the phrase appearance frequency information update unit that increases the F appearance frequency information corresponding to the phrase pair by a predetermined value;
A symbol acquisition unit that acquires one symbol using the probability distribution of the one or more symbols acquired using the S appearance frequency information included in the one or more symbol appearance frequency information when the phrase pair cannot be acquired; ,
A symbol appearance frequency information update unit that is S appearance frequency information corresponding to the symbol acquired by the symbol acquisition unit and increases the S appearance frequency information stored in the symbol appearance frequency information storage unit by a predetermined value. When,
If the phrase pair could not be acquired, a partial phrase pair generation unit that generates two phrase pairs smaller than the phrase pair to be acquired;
A process corresponding to a symbol acquired by the symbol acquisition unit, wherein a new phrase pair is generated using a prior probability of a phrase pair acquired using F appearance frequency information corresponding to one or more phrase pairs . A new process in which two first language phrases constituting the generated two phrase pairs are sequentially connected using one process or two smaller phrase pairs and using the one or more phrase appearance frequency information A second process of generating one phrase pair having a first language phrase and a new second language phrase sequentially concatenating the two second language phrases that make up the two phrase pairs, or two smaller It generates a phrase pair, using the one or more phrases appearance frequency information, sequentially connecting the two first language phrase that constitutes the two phrases pairs generated A new first language phrase, and either the third process of generating a single phrase pair having a new second language phrase linked in reverse order with two second language phrase that constitutes the two phrases pairs A new phrase pair generator that performs
For each of the two phrase pairs acquired by the process of the second process by the new phrase pair generation unit, and for each of the two phrase pairs acquired by the process of the third process by the new phrase pair generation unit , A control unit that recursively processes the phrase appearance frequency information update unit, the symbol acquisition unit, the symbol appearance frequency information update unit, the partial phrase pair generation unit, and the new phrase pair generation unit;
Using one or more phrase appearance frequency information stored in the phrase appearance frequency information storage unit, the phrase pair acquired by the generated phrase pair acquisition unit , the new phrase pair generation unit acquired by the first process The score for the new phrase pair, the phrase pair acquired by the new phrase pair generation unit by the second process, and the phrase pair acquired by the new phrase pair generation unit by the third process are calculated. A score calculator to
A bilingual phrase learning device comprising: a phrase table updating unit that associates each score calculated by the score calculation unit with each phrase pair and accumulates the score in the phrase table.

The generated phrase pair acquisition unit
F appearance frequency information included in the one or more phrase appearance frequency information from one or more phrase pairs constituting a tree structure of one or more bilingual sentences stored in the bilingual corpus, Using the probability distribution of phrase pairs acquired using F appearance frequency information corresponding to each phrase pair, a generated phrase pair having a first language phrase and a second language phrase is acquired,
The symbol acquisition unit
If the phrase pair could not be obtained, the symbol probability distribution is used to obtain one symbol,
The partial phrase pair generation unit
If the phrase pair could not be obtained, the base measure is used to generate two phrase pairs that are smaller than the phrase pair that was to be generated,
The first process is a process for generating a new phrase pair using a phrase pair basis measure,
The score calculation unit
The parallel phrase learning according to claim 1, wherein a score for each phrase pair in the phrase table is calculated based on a nonparametric Bayes method using one or more phrase appearance frequency information stored in the phrase appearance frequency information storage unit. apparatus.

A phrase table learned by the parallel phrase learning device according to claim 1;
An accepting unit that accepts a sentence in a first language having one or more words;
A phrase acquisition unit that extracts one or more phrases from the sentence received by the reception unit, and acquires one or more phrases of a second language from the phrase table using a score of the phrase table;
A sentence constructing unit that constructs a sentence in a second language from one or more phrases acquired by the phrase acquiring unit;
A phrase-based statistical machine translation device comprising: an output unit that outputs a sentence formed by the sentence composing unit.

On the storage medium,
In a first language phrase having one or more words in the first language, pairs der the second language phrase having one or more words in the second language is, pairs of the first language sentence and the second language sentence One or more scores having a phrase pair acquired from a bilingual corpus storing one or more parallel translation information having a certain bilingual sentence and a tree structure of the bilingual sentence, and a score that is information relating to the appearance probability of the phrase pair A phrase table with a phrase pair,
One or more phrase appearance frequency information having a phrase pair and F appearance frequency information that is information related to the appearance frequency of the phrase pair;
Storing one or more symbol appearance frequency information including a symbol for identifying a method for generating a new phrase pair and S appearance frequency information that is information relating to the appearance frequency of the symbol,
Realized by generated phrase pair acquisition unit, phrase appearance frequency information update unit, symbol acquisition unit, symbol appearance frequency information update unit, partial phrase pair generation unit, new phrase pair generation unit, control unit, score calculation unit, and phrase table update unit A parallel phrase learning method to be performed,
The generated phrase pair acquisition unit includes an F appearance frequency included in the one or more phrase appearance frequency information from one or more phrase pairs constituting the tree structure of one or more parallel sentences stored in the bilingual corpus. A generated phrase pair acquisition step for acquiring a phrase pair using the F appearance frequency information corresponding to each of the one or more phrase pairs ;
When the phrase appearance frequency information update unit can acquire a phrase pair, the phrase appearance frequency information update step for increasing the F appearance frequency information corresponding to the phrase pair by a predetermined value;
If the symbol acquisition unit fails to acquire a phrase pair, a symbol acquisition step of acquiring one symbol using S appearance frequency information included in one or more symbol appearance frequency information;
The symbol appearance frequency information update unit is S appearance frequency information corresponding to the symbol acquired in the symbol acquisition step, and the S appearance frequency information stored in the symbol appearance frequency information storage unit is predetermined. A symbol appearance frequency information update step that increases by a value;
When the partial phrase pair generation unit cannot acquire a phrase pair, a partial phrase pair generation step of generating two phrase pairs smaller than the phrase pair to be acquired;
The new phrase pair generation unit is a process corresponding to the symbol acquired in the symbol acquisition step, generates a new phrase pair, or generates two smaller phrase pairs, the one or more Using the phrase appearance frequency information, the first first language phrase formed by sequentially connecting the two first language phrases constituting the generated two phrase pairs and the two second language phrases constituting the two phrase pairs A second process for generating one phrase pair having a new second language phrase connected in order, or two smaller phrase pairs, and generated using the one or more phrase appearance frequency information A new first language phrase that connects two first language phrases that compose one phrase pair in sequence, and two that compose two phrase pairs And new phrase pair generation step of performing one of a third process of generating a single phrase pair having a new second language phrases linked in reverse order bilingual phrase,
The control unit, for the phrase pair generated in the new phrase pair generation step, the phrase appearance frequency information update step, the symbol acquisition step, the symbol appearance frequency information update step, the partial phrase pair generation step, and A control step for recursively performing the processing of the new phrase pair generation step;
The score calculation unit calculates a score for the phrase pair acquired by the generated phrase pair acquisition unit using one or more phrase appearance frequency information stored in the storage medium; and
The phrase table update unit, comprising: a phrase table update step in which the score calculated in the score calculation step is associated with each phrase pair and accumulated in the phrase table.

On the storage medium,
In a first language phrase having one or more words in the first language, pairs der the second language phrase having one or more words in the second language is, pairs of the first language sentence and the second language sentence One or more scores having a phrase pair acquired from a bilingual corpus storing one or more parallel translation information having a certain bilingual sentence and a tree structure of the bilingual sentence, and a score that is information relating to the appearance probability of the phrase pair A phrase table with a phrase pair,
One or more phrase appearance frequency information having a phrase pair and F appearance frequency information that is information related to the appearance frequency of the phrase pair;
Storing one or more symbol appearance frequency information including a symbol for identifying a method for generating a new phrase pair and S appearance frequency information that is information relating to the appearance frequency of the symbol,
Realized by generated phrase pair acquisition unit, phrase appearance frequency information update unit, symbol acquisition unit, symbol appearance frequency information update unit, partial phrase pair generation unit, new phrase pair generation unit, control unit, score calculation unit, and phrase table update unit A method for producing a translated phrase,
The generated phrase pair acquisition unit includes an F appearance frequency included in the one or more phrase appearance frequency information from one or more phrase pairs constituting the tree structure of one or more parallel sentences stored in the bilingual corpus. A generated phrase pair acquisition step for acquiring a phrase pair using the probability distribution of the one or more phrase pairs acquired using the F appearance frequency information corresponding to each of the one or more phrase pairs. When,
When the phrase appearance frequency information update unit can acquire a phrase pair, the phrase appearance frequency information update step for increasing the F appearance frequency information corresponding to the phrase pair by a predetermined value;
When the symbol acquisition unit cannot acquire a phrase pair, the probability distribution of the one or more symbols acquired using the S appearance frequency information included in the one or more symbol appearance frequency information is used to determine one symbol. A symbol acquisition step to acquire;
The symbol appearance frequency information update unit is S appearance frequency information corresponding to the symbol acquired in the symbol acquisition step, and the S appearance frequency information stored in the symbol appearance frequency information storage unit is predetermined. A symbol appearance frequency information update step that increases by a value;
When the partial phrase pair generation unit cannot acquire a phrase pair, a partial phrase pair generation step of generating two phrase pairs smaller than the phrase pair to be acquired;
The said new phrase pair production | generation part is a process corresponding to the symbol acquired at the said symbol acquisition step, and uses the prior probability of the phrase pair acquired using F appearance frequency information corresponding to each 1 or more phrase pair Te, first process of generating a new phrase pair, or to generate two smaller phrase pair, using the one or more phrases appearance frequency information, the two constituting two phrases pairs of said generated first A second that generates a single phrase pair having a new first language phrase in which one language phrase is sequentially connected and a new second language phrase in which two second language phrases constituting two phrase pairs are sequentially connected Or two smaller phrase pairs are generated, and the two generated phrase pairs are constructed using the one or more phrase appearance frequency information. One phrase having a new first language phrase in which two first language phrases are sequentially connected and a new second language phrase in which two second language phrases constituting two phrase pairs are connected in reverse order A new phrase pair generation step for performing one of the third processes for generating a pair;
Each of the two phrase pairs acquired by the process of the second process in the new phrase pair generation step and the two of the phrase pairs generated by the process of the third process are generated by the control unit. for each phrase pair, the phrase appearance frequency information updating step, the symbol acquisition step, said symbol frequency information updating step, the partial phrase pair generation step, and the recursively performs control processing of the new phrase pair generation step Steps,
The score calculation unit calculates a score for the phrase pair acquired by the generated phrase pair acquisition unit using one or more phrase appearance frequency information stored in the storage medium; and
The phrase table update unit includes a phrase table update step in which the score calculated in the score calculation step is associated with each phrase pair and accumulated in the phrase table.

On the storage medium,
In a first language phrase having one or more words in the first language, pairs der the second language phrase having one or more words in the second language is, pairs of the first language sentence and the second language sentence One or more scores having a phrase pair acquired from a bilingual corpus storing one or more parallel translation information having a certain bilingual sentence and a tree structure of the bilingual sentence, and a score that is information relating to the appearance probability of the phrase pair A phrase table with a phrase pair,
One or more phrase appearance frequency information having a phrase pair and F appearance frequency information that is information related to the appearance frequency of the phrase pair;
Storing one or more symbol appearance frequency information including a symbol for identifying a method for generating a new phrase pair and S appearance frequency information that is information relating to the appearance frequency of the symbol,
Computer
F appearance frequency information included in the one or more phrase appearance frequency information from one or more phrase pairs constituting a tree structure of one or more bilingual sentences stored in the bilingual corpus, A generated phrase pair acquisition unit that attempts to acquire a phrase pair using the probability distribution of each of the one or more phrase pairs acquired using F appearance frequency information corresponding to each phrase pair ;
When the phrase pair can be acquired, the phrase appearance frequency information update unit that increases the F appearance frequency information corresponding to the phrase pair by a predetermined value;
A symbol acquisition unit that acquires one symbol using the probability distribution of the one or more symbols acquired using the S appearance frequency information included in the one or more symbol appearance frequency information when the phrase pair cannot be acquired; ,
A symbol appearance frequency information update unit that is S appearance frequency information corresponding to the symbol acquired by the symbol acquisition unit and increases the S appearance frequency information stored in the symbol appearance frequency information storage unit by a predetermined value. When,
If the phrase pair could not be acquired, a partial phrase pair generation unit that generates two phrase pairs smaller than the phrase pair to be acquired;
A process corresponding to a symbol acquired by the symbol acquisition unit, wherein a new phrase pair is generated using a prior probability of a phrase pair acquired using F appearance frequency information corresponding to one or more phrase pairs . One process or two smaller phrase pairs are generated, and using the one or more phrase appearance frequency information, two first language phrases constituting the two generated phrase pairs are sequentially connected. A second process of generating a single phrase pair having a single language phrase and a new second language phrase sequentially concatenating the two second language phrases that comprise the two phrase pairs, or two smaller phrase pairs Using the one or more phrase appearance frequency information, the two first language phrases constituting the two generated phrase pairs are sequentially linked. One of the third processes for generating one phrase pair having a new first language phrase that is connected and a new second language phrase that is a concatenation of the two second language phrases constituting the two phrase pairs. A new phrase pair generation unit to perform,
For each of the two phrase pairs acquired by the process of the second process by the new phrase pair generation unit, and for each of the two phrase pairs acquired by the process of the third process by the new phrase pair generation unit , A control unit that recursively processes the phrase appearance frequency information update unit, the symbol acquisition unit, the symbol appearance frequency information update unit, the partial phrase pair generation unit, and the new phrase pair generation unit;
Using one or more phrase appearance frequency information stored in the storage medium, a phrase pair acquired by the generated phrase pair acquisition unit , a new phrase pair acquired by the new phrase pair generation unit by the first process, A score calculation unit that calculates a score for the phrase pair acquired by the new phrase pair generation unit connected through the second process and the phrase pair acquired by the new phrase pair generation unit through the third process. When,
A program for causing each score calculated by the score calculation unit to correspond to each phrase pair and to function as a phrase table update unit that accumulates in the phrase table.