JP2015001862A

JP2015001862A - Bilingual phrase learning device, statistical machine translation device, bilingual phrase learning method, and program

Info

Publication number: JP2015001862A
Application number: JP2013126490A
Authority: JP
Inventors: 渡辺　太郎; Taro Watanabe; 太郎渡辺; チューツォンフィ; Conghui Zhu; 隅田　英一郎; Eiichiro Sumida; 英一郎隅田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2013-06-17
Filing date: 2013-06-17
Publication date: 2015-01-05
Anticipated expiration: 2033-06-17
Also published as: JP6192098B2; WO2014203681A1

Abstract

PROBLEM TO BE SOLVED: To solve the problem in prior art that a translation model has to be updated every time a bilingual corpus is added.SOLUTION: When calculating a score for each phrase pair acquired from a j(2<=j<=N)th bilingual corpus (a supplemented bilingual corpus), one or more pieces of appearance frequency information corresponding to a (j-1)th bilingual corpus is used to calculate the score for each phrase pair corresponding to the j-th bilingual corpus, and the calculated score is used to create a translation model, and by utilizing the newly created translation model in linkage with the original translation model, a translation model can be easily upgraded stepwise.

Description

本発明は、対訳フレーズを学習する対訳フレーズ学習装置等に関するものである。 The present invention relates to a parallel phrase learning device that learns a parallel phrase.

従来の統計的機械翻訳（非特許文献１参照）において、対訳データからフレーズテーブルなどの対訳知識を抽出し、翻訳モデルを学習する。そして、その翻訳モデルに基づき、翻訳システムを実現する。また、正確な翻訳モデルを推定するためには、大量の対訳データをバッチ学習と呼ばれる学習方法で学習する必要があった。なお、バッチ学習とは、全ての学習データについて最適化を行う学習の方法である。 In conventional statistical machine translation (see Non-Patent Document 1), bilingual knowledge such as a phrase table is extracted from bilingual data, and a translation model is learned. Based on the translation model, a translation system is realized. Further, in order to estimate an accurate translation model, it is necessary to learn a large amount of parallel translation data by a learning method called batch learning. Note that batch learning is a learning method in which all learning data is optimized.

また、特に、対訳データは年々増加しており、従来技術では、データを追加するたびに再学習を必要とする。しかし、この再学習の結果、より良い翻訳モデルが推定されるという保証がなかった。 In particular, bilingual data is increasing year by year, and the prior art requires re-learning each time data is added. However, there was no guarantee that a better translation model would be estimated as a result of this relearning.

かかる課題を解決するために、従来、対訳データを分野毎に分割し、各分野で局所的な翻訳モデルを学習し、その翻訳モデルを組み合わせる方法があった（非特許文献２参照）。 In order to solve this problem, conventionally, there has been a method of dividing parallel translation data for each field, learning a local translation model in each field, and combining the translation models (see Non-Patent Document 2).

また、従来技術において、原言語の入力文に対し、その分野を適切に割り当てる、といった判別機による手法があった（非特許文献３参照）。 Further, in the prior art, there is a method using a discriminator that appropriately assigns a field to an input sentence in a source language (see Non-Patent Document 3).

また、従来技術において、分野依存の素性を追加し、そのパラメータを分野のラベルが付与された対訳データに対して最適化する手法が用いられている（非特許文献３〜６参照）。 In addition, in the prior art, a technique of adding a field-dependent feature and optimizing the parameter with respect to parallel translation data to which a field label is assigned is used (see Non-Patent Documents 3 to 6).

さらに、従来技術において、インクリメンタル学習という手法があり、かかる手法において、対訳データを追加するたびに、その追加分に応じてモデルを更新している（非特許文献７参照）。 Furthermore, in the conventional technique, there is a method called incremental learning. In this method, each time bilingual data is added, the model is updated according to the added amount (see Non-Patent Document 7).

Phillip Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. HLT.Phillip Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. HLT. George Foster and Roland Kuhn. 2007. Mixture-model adaptation for smt. In Proc. of the second workshop of SMT.George Foster and Roland Kuhn. 2007. Mixture-model adaptation for smt. In Proc. Of the second workshop of SMT. Jia Xu, Yonggang Deng, Yuqing Gao, and Hermann Ney. 2007. Domain dependent statistical machine translation. MT Summit XI.Jia Xu, Yonggang Deng, Yuqing Gao, and Hermann Ney. 2007. Domain dependent statistical machine translation. MT Summit XI. Wei Wang, Klaus Macherey, Wolfgang Macherey, Franz Och, and Peng Xu. 2012. Improved domain adaptation for statistical machine translation. In Proc. of AMTA.Wei Wang, Klaus Macherey, Wolfgang Macherey, Franz Och, and Peng Xu. 2012. Improved domain adaptation for statistical machine translation.In Proc. Of AMTA. Yajuan Lu, Jin Huang, and Qun Liu. 2007. Improving statistical machine translation performance by training data selection and optimization. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 343-350.Yajuan Lu, Jin Huang, and Qun Liu. 2007. Improving statistical machine translation performance by training data selection and optimization.In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 343-350. Jinsong Su, Hua Wu, Haifeng Wang, Yidong Chen, Xiaodong Shi, Huailin Dong, and Qun Liu. 2012. Translation model adaptation for statistical machine translation with monolingual topic information. In Proc. of ACL.Jinsong Su, Hua Wu, Haifeng Wang, Yidong Chen, Xiaodong Shi, Huailin Dong, and Qun Liu. 2012.Translation model adaptation for statistical machine translation with monolingual topic information.In Proc. Of ACL. Abby Levenberg, Chris Callison-Burch, and Miles Osborne. 2010. Stream-based translation models for statistical machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT '10, pages 394- 402, Stroudsburg, PA, USA. Association for Computational Linguistics.Abby Levenberg, Chris Callison-Burch, and Miles Osborne. 2010.Stream-based translation models for statistical machine translation.In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT '10, pages 394- 402, Stroudsburg, PA, USA. Association for Computational Linguistics.

しかしながら、上記の分野適用による手法では、各分野で局所的に学習した場合に、対訳データが少なくなるため、正確に翻訳モデルを推定できない、という課題があった。また、分野適用による手法では、分野ごとの重みを決定するため、または入力文の分野を判別するために、判別器を必要とする。特に、非特許文献２や３に係る技術では、入力文に対して、適切な分野を割り当てる判別器が必要とあり、翻訳の精度はその判別器の性能に依存する、という課題があった。 However, the above-described technique based on the field application has a problem that the translation model cannot be estimated accurately because parallel translation data is reduced when learning locally in each field. In addition, the technique based on field application requires a discriminator in order to determine the weight for each field or to discriminate the field of the input sentence. In particular, the technologies according to Non-Patent Documents 2 and 3 have a problem that a discriminator that assigns an appropriate field to an input sentence is required, and the accuracy of translation depends on the performance of the discriminator.

また、非特許文献４〜６における素性に基づく手法では、分野のラベル付けされた正確な対訳データを必要とする、という課題があった。 In addition, in the methods based on the features in Non-Patent Documents 4 to 6, there is a problem that accurate parallel translation data labeled in the field is required.

さらに、非特許文献７におけるインクリメンタル学習は、オンラインＥＭアルゴリズムなど、複雑なパラメータの調整を必要とする技術に基づいており、また、追加された翻訳モデルをさらに最適化するなど複雑なシステムとなる、という課題があった。 Furthermore, the incremental learning in Non-Patent Document 7 is based on a technique that requires adjustment of complicated parameters such as an online EM algorithm, and becomes a complicated system such as further optimization of an added translation model. There was a problem.

以上をまとめると、従来技術においては、対訳コーパスを追加するたびに、翻訳モデルを段階的に充実させる場合の処理が面倒であった。 In summary, in the prior art, each time a bilingual corpus is added, the process for enhancing the translation model step by step is troublesome.

本発明は、かかる点に鑑み、追加した対訳コーパスから生成された翻訳モデルを元の翻訳モデルに繋げて利用することにより、容易に翻訳モデルを段階的に充実させることを目的とする。 In view of this point, an object of the present invention is to easily enrich a translation model step by step by using a translation model generated from an added bilingual corpus in connection with the original translation model.

本第一の発明の対訳フレーズ学習装置は、対訳文と対訳文の木構造とを有する１以上の対訳情報を有するＮ（Ｎは２以上の自然数）の対訳コーパスを格納し得る対訳情報格納部と、第一言語の１以上の単語を有する第一言語フレーズと、第二言語の１以上の単語を有する第二言語フレーズとの対であるフレーズペアとフレーズペアの出現確率に関する情報であるスコアとを有する１以上のスコア付きフレーズペアを、対訳コーパスごとに格納し得るフレーズテーブルと、フレーズペアと、フレーズペアの出現頻度に関する情報であるＦ出現頻度情報とを有する１以上のフレーズ出現頻度情報を、対訳コーパスごとに格納し得るフレーズ出現頻度情報格納部と、新しいフレーズペアを生成する方法を識別する記号と、記号の出現頻度に関する情報であるＳ出現頻度情報とを有する１以上の記号出現頻度情報を格納し得る記号出現頻度情報格納部と、対訳コーパスごとに、１以上のフレーズ出現頻度情報を用いて、第一言語フレーズと第二言語フレーズとを有するフレーズペアを取得する生成フレーズペア取得部と、フレーズペアを取得できた場合、フレーズペアに対応するＦ出現頻度情報を、予め決められた値だけ増加するフレーズ出現頻度情報更新部と、フレーズペアを取得できなかった場合、１以上の記号出現頻度情報を用いて、一の記号を取得する記号取得部と、記号取得部が取得した記号に対応するＳ出現頻度情報を、予め決められた値だけ増加する記号出現頻度情報更新部と、フレーズペアを取得できなかった場合、取得しようとしたフレーズペアより小さい２つのフレーズペアを生成する部分フレーズペア生成部と、記号取得部が取得した記号に従って、新しいフレーズペアを生成する第一の処理、または、２つのより小さいフレーズペアを生成し、１以上のフレーズ出現頻度情報を用いて、生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する第二の処理、または、２つのより小さいフレーズペアを生成し、１以上のフレーズ出現頻度情報を用いて、生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを逆順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する第三の処理のいずれかを行う新フレーズペア生成部と、新フレーズペア生成部が生成したフレーズペアに対して、フレーズ出現頻度情報更新部、記号取得部、記号出現頻度情報更新部、部分フレーズペア生成部、および新フレーズペア生成部の処理を再帰的に行うことを指示する制御部と、フレーズ出現頻度情報格納部に格納されている１以上のフレーズ出現頻度情報を用いて、フレーズテーブルの各フレーズペアに対するスコアを算出するスコア算出部と、スコア算出部が算出したスコアを各フレーズペアに対応付けて蓄積するフレーズテーブル更新部とを具備し、スコア算出部は、ｊ（２＜＝ｊ＜＝Ｎ）番目の対訳コーパスから取得された各フレーズペアに対するスコア算出する場合に、（ｊ−１）番目の対訳コーパスに対応する１以上のフレーズ出現頻度情報を用いて、ｊ番目の対訳コーパスに対応する各フレーズペアに対するスコアを算出する対訳フレーズ学習装置である。 The bilingual phrase learning device according to the first aspect of the present invention is a bilingual information storage unit capable of storing N (N is a natural number of 2 or more) bilingual corpus having one or more bilingual information having a bilingual sentence and a tree structure of the bilingual sentence. And a score that is information about a phrase pair that is a pair of a first language phrase having one or more words in the first language and a second language phrase having one or more words in the second language and the appearance probability of the phrase pair One or more phrase appearance frequency information including a phrase table that can store one or more scored phrase pairs having each of the following, for each bilingual corpus, a phrase pair, and F appearance frequency information that is information relating to the appearance frequency of the phrase pair Are stored for each parallel corpus, a phrase appearance frequency information storage unit, a symbol for identifying a method for generating a new phrase pair, and information on the appearance frequency of the symbol. A symbol appearance frequency information storage unit capable of storing one or more symbol appearance frequency information having S appearance frequency information, and for each parallel corpus, using one or more phrase appearance frequency information, Generation phrase pair acquisition unit that acquires a phrase pair having a bilingual phrase, and update of the phrase appearance frequency information that increases the F appearance frequency information corresponding to the phrase pair by a predetermined value when the phrase pair can be acquired Part and phrase pair could not be acquired, using one or more symbol appearance frequency information, a symbol acquisition unit for acquiring one symbol, and S appearance frequency information corresponding to the symbol acquired by the symbol acquisition unit, A symbol appearance frequency information update unit that increases by a predetermined value, and two frames smaller than the phrase pair to be acquired if the phrase pair cannot be acquired. A partial phrase pair generation unit that generates a pair and a first process for generating a new phrase pair according to the symbol acquired by the symbol acquisition unit, or two smaller phrase pairs, and one or more phrase appearance frequency information A new second language phrase in which two first language phrases that form two phrase pairs are connected in order, and a second second language phrase that forms two phrase pairs in order. 2nd process which produces | generates one phrase pair which has a language phrase, or two smaller phrase pairs are produced | generated, and two produced | generated two phrase pairs are comprised using one or more phrase appearance frequency information A new first language phrase that connects two first language phrases in sequence and two second language phrases that form two phrase pairs in reverse order A new phrase pair generation unit that performs one of the third processes for generating a single phrase pair having a new second language phrase that is connected, and a phrase appearance frequency for the phrase pair generated by the new phrase pair generation unit Stored in the phrase update frequency information storage unit and the control unit that instructs to recursively perform the processing of the information update unit, symbol acquisition unit, symbol appearance frequency information update unit, partial phrase pair generation unit, and new phrase pair generation unit The score calculation unit that calculates a score for each phrase pair in the phrase table using one or more phrase appearance frequency information that has been performed, and the phrase table update that stores the score calculated by the score calculation unit in association with each phrase pair A score calculation unit for each phrase pair acquired from the j (2 <= j <= N) -th parallel corpus A bilingual phrase learning device that calculates a score for each phrase pair corresponding to the jth bilingual corpus using the one or more phrase appearance frequency information corresponding to the (j-1) th bilingual corpus when calculating the core. is there.

かかる構成により、容易に翻訳モデルの段階的に充実させることができる。 With this configuration, the translation model can be easily enhanced in stages.

また、本第二の発明の対訳フレーズ学習装置は、第一の発明に対して、対訳情報格納部は、１以上の対訳コーパスを格納しており、対訳コーパスを受け付ける対訳コーパス受付部と、対訳コーパス受付部が受け付けた対訳コーパスを対訳情報格納部に蓄積する対訳コーパス蓄積部とをさらに具備し、制御部は、対訳コーパス蓄積部が受け付けられた対訳コーパスを対訳情報格納部に蓄積した後、対訳コーパスに対する生成フレーズペア取得部、フレーズ出現頻度情報更新部、記号取得部、記号出現頻度情報更新部、部分フレーズペア生成部、および新フレーズペア生成部の処理を行うことを指示し、スコア算出部は、対訳コーパス受付部が受け付けた対訳コーパスから取得された各フレーズペアに対するスコア算出する場合に、対訳コーパス蓄積部が対訳コーパスを蓄積する前に対訳情報格納部に格納されていた１以上の対訳コーパスのいずれかの対訳コーパスに対応する１以上のフレーズ出現頻度情報を用いて、対訳コーパス受付部が受け付けた対訳コーパスに対応する各フレーズペアに対するスコアを算出する対訳フレーズ学習装置である。 The bilingual phrase learning device according to the second aspect of the present invention is different from the first aspect in that the bilingual information storage unit stores one or more bilingual corpora, the bilingual corpus receiving unit that receives the bilingual corpus, The bilingual corpus accumulating unit that accumulates the bilingual corpus accepted by the corpus accepting unit in the bilingual information storage unit, and the control unit stores the bilingual corpus accepted by the bilingual corpus accumulating unit in the bilingual information storage unit, Instructed to perform processing of the generated phrase pair acquisition unit, phrase appearance frequency information update unit, symbol acquisition unit, symbol appearance frequency information update unit, partial phrase pair generation unit, and new phrase pair generation unit for the bilingual corpus, and score calculation When calculating the score for each phrase pair acquired from the bilingual corpus received by the bilingual corpus receiving unit, The bilingual corpus accepting unit accepts one or more phrase appearance frequency information corresponding to one of the one or more bilingual corpora stored in the bilingual corpus before the product stores the bilingual corpus. This is a bilingual phrase learning device that calculates a score for each phrase pair corresponding to a bilingual corpus.

また、本第三の発明の対訳フレーズ学習装置は、第一の発明に対して、２以上の対訳文をＮにグループに分割し、かつ各グループの対訳文から対訳文の木構造を取得して作成したＮの対訳コーパスを、対訳情報格納部に蓄積する対訳コーパス生成部をさらに具備し、スコア算出部は、一の対訳コーパスから取得された各フレーズペアに対するスコア算出する場合に、一の対訳コーパスとは異なる他の対訳コーパスに対応する１以上のフレーズ出現頻度情報を用いて、一の対訳コーパスに対応する各フレーズペアに対するスコアを算出する対訳フレーズ学習装置である。 The parallel phrase learning device according to the third aspect of the invention divides two or more parallel sentences into N groups and acquires the tree structure of the parallel sentences from the parallel sentences of each group. The bilingual corpus generating unit that accumulates the N bilingual corpora created in the bilingual information storage unit, and the score calculating unit calculates the score for each phrase pair acquired from one bilingual corpus. The bilingual phrase learning apparatus calculates a score for each phrase pair corresponding to one bilingual corpus using one or more phrase appearance frequency information corresponding to another bilingual corpus different from the bilingual corpus.

また、本第四の発明の対訳フレーズ学習装置は、第一から第三いずれかの発明に対して、スコア算出部は、階層的な中華レストラン過程を用いて、各対訳コーパスに対応する各フレーズペアに対するスコアを算出する対訳フレーズ学習装置である。 In addition, the parallel phrase learning device according to the fourth aspect of the present invention provides the phrase calculation unit according to any one of the first to third aspects, wherein the score calculation unit uses a hierarchical Chinese restaurant process and each phrase corresponding to each parallel corpus. This is a bilingual phrase learning device that calculates a score for a pair.

かかる構成により、階層的な中華レストラン過程を用いて、容易に翻訳モデルを段階的に充実させることができる。 With this configuration, it is possible to easily enrich the translation model step by step using a hierarchical Chinese restaurant process.

また、本第五の発明の統計的機械翻訳装置は、第一から第四いずれかの対訳フレーズ学習装置が学習したフレーズテーブルと、１以上の単語を有する第一言語の文を受け付ける受付部と、受付部が受け付けた文から１以上のフレーズを抽出し、フレーズテーブルのスコアを用いて、フレーズテーブルから第二言語の１以上のフレーズを取得するフレーズ取得部と、フレーズ取得部が取得した１以上のフレーズから第二言語の文を構成する文構成部と、文構成部が構成した文を出力する出力部とを具備する統計的機械翻訳装置である。 The statistical machine translation device according to the fifth aspect of the invention includes a phrase table learned by any one of the first to fourth parallel phrase learning devices, and a reception unit that receives a sentence in a first language having one or more words. The phrase acquisition unit that extracts one or more phrases from the sentence received by the reception unit and acquires one or more phrases in the second language from the phrase table using the score of the phrase table, and the phrase acquisition unit 1 acquired It is a statistical machine translation apparatus which comprises the sentence structure part which comprises the sentence of a 2nd language from the above phrase, and the output part which outputs the sentence which the sentence structure part comprised.

かかる構成により、段階的に充実させた翻訳モデルを用いて、精度の高い機械翻訳を実現できる。 With this configuration, it is possible to realize highly accurate machine translation using a translation model enriched in stages.

本発明による対訳フレーズ学習装置によれば、容易に翻訳モデルを段階的に充実させることができる。 According to the bilingual phrase learning device of the present invention, the translation model can be easily enhanced step by step.

本発明の実施の形態１における対訳フレーズ学習装置１のブロック図Block diagram of bilingual phrase learning apparatus 1 in Embodiment 1 of the present invention 本発明の実施の形態１における対訳フレーズ学習装置１の動作を説明するフローチャートThe flowchart explaining operation | movement of the bilingual phrase learning apparatus 1 in Embodiment 1 of this invention. 本発明の実施の形態１におけるフレーズ生成処理を説明するフローチャートThe flowchart explaining the phrase production | generation process in Embodiment 1 of this invention 本発明の実施の形態１における対訳情報を構成する木構造の例を示す図The figure which shows the example of the tree structure which comprises the parallel translation information in Embodiment 1 of this invention 本発明の実施の形態２における対訳フレーズ学習装置２のブロック図Block diagram of bilingual phrase learning device 2 in Embodiment 2 of the present invention 本発明の実施の形態２における対訳フレーズ学習装置２の動作を説明するフローチャートThe flowchart explaining operation | movement of the bilingual phrase learning apparatus 2 in Embodiment 2 of this invention. 本発明の実施の形態３における統計的機械翻訳装置３のブロック図Block diagram of statistical machine translation apparatus 3 according to Embodiment 3 of the present invention 本発明の実施の形態における実験で使用したデータセットを説明する図The figure explaining the data set used in the experiment in the embodiment of the present invention 本発明の実施の形態における実験結果を示す図The figure which shows the experimental result in embodiment of this invention 本発明の実施の形態における実験結果を示す図The figure which shows the experimental result in embodiment of this invention 本発明の実施の形態における実験結果を示す図The figure which shows the experimental result in embodiment of this invention 本発明の実施の形態におけるコンピュータシステムの概観図Overview of a computer system according to an embodiment of the present invention 本発明の実施の形態におけるコンピュータシステムのブロック図The block diagram of the computer system in embodiment of this invention

以下、対訳フレーズ学習装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of a parallel phrase learning device and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.

（実施の形態１）
本実施の形態において、追加した対訳コーパスから生成された翻訳モデルを元の翻訳モデルに繋げることにより、容易に翻訳モデルを段階的に充実させることができる対訳フレーズ学習装置について説明する。 (Embodiment 1)
In this embodiment, a bilingual phrase learning apparatus that can easily enhance a translation model step by step by connecting a translation model generated from the added bilingual corpus to the original translation model will be described.

図１は、本実施の形態における対訳フレーズ学習装置１のブロック図である。 FIG. 1 is a block diagram of a bilingual phrase learning apparatus 1 according to the present embodiment.

対訳フレーズ学習装置１は、対訳情報格納部１００、フレーズテーブル１０１、フレーズ出現頻度情報格納部１０２、記号出現頻度情報格納部１０３、対訳コーパス受付部１０４、対訳コーパス蓄積部１０５、フレーズテーブル初期化部１０６、生成フレーズペア取得部１０７、フレーズ出現頻度情報更新部１０８、記号取得部１０９、記号出現頻度情報更新部１１０、部分フレーズペア生成部１１１、新フレーズペア生成部１１２、制御部１１３、スコア算出部１１４、パージング部１１５、フレーズテーブル更新部１１６、および木更新部１１７を備える。 The bilingual phrase learning device 1 includes a bilingual information storage unit 100, a phrase table 101, a phrase appearance frequency information storage unit 102, a symbol appearance frequency information storage unit 103, a bilingual corpus reception unit 104, a bilingual corpus storage unit 105, and a phrase table initialization unit. 106, generated phrase pair acquisition unit 107, phrase appearance frequency information update unit 108, symbol acquisition unit 109, symbol appearance frequency information update unit 110, partial phrase pair generation unit 111, new phrase pair generation unit 112, control unit 113, score calculation Unit 114, parsing unit 115, phrase table update unit 116, and tree update unit 117.

対訳情報格納部１００は、Ｎ（Ｎは２または３以上の自然数）の対訳コーパスを格納し得る。ここでの対訳コーパスは、１以上の対訳情報を有する。対訳情報は、対訳文と当該対訳文の木構造とを有する。また、対訳文とは、第一言語文と第二言語文との対である。第一言語文は、第一言語の文である。第二言語文は、第二言語の文である。ここで、文は、１以上の単語の意味であり、フレーズも含む。対訳文の木構造とは、２つの各言語の文を分割したフレーズ（単語も含む）の対応を木構造で表した情報である。 The bilingual information storage unit 100 can store N (N is a natural number of 2 or 3) bilingual corpus. The bilingual corpus here has one or more bilingual information. The parallel translation information has a parallel translation sentence and a tree structure of the parallel translation sentence. The bilingual sentence is a pair of a first language sentence and a second language sentence. The first language sentence is a sentence in the first language. The second language sentence is a sentence in the second language. Here, the sentence means one or more words and includes a phrase. The bilingual sentence tree structure is information representing a correspondence of phrases (including words) obtained by dividing sentences in two languages in a tree structure.

なお、対訳情報格納部１００は、処理前に、１つの対訳コーパスを格納しており、後に２つ目以降、１以上の対訳コーパスが蓄積されるようにしても良い。 The bilingual information storage unit 100 may store one bilingual corpus before processing, and may accumulate one or more bilingual corpora after the second.

フレーズテーブル１０１は、Ｎの各対訳コーパスごとに、１以上のスコア付きフレーズペアを格納し得る。スコア付きフレーズペアは、フレーズペアとスコアとを有する。フレーズペアは、第一言語フレーズと第二言語フレーズとの対である。第一言語フレーズは、第一言語の１以上の単語を有するフレーズである。第二言語フレーズは、第二言語の１以上の単語を有するフレーズである。フレーズは、文も含むとして、広く解する。また、スコアは、フレーズペアの出現確率に関する情報である。また、スコアとは、例えば、フレーズペア確率θ_ｔである。なお、ここでフレーズペアは、ルールペアも含む概念であり、広く解する、とする。また、１以上のスコア付きフレーズペアは、上述した翻訳モデルと同意義である、と解しても良い。 The phrase table 101 may store one or more scored phrase pairs for each of N parallel corpora. The phrase pair with a score has a phrase pair and a score. A phrase pair is a pair of a first language phrase and a second language phrase. The first language phrase is a phrase having one or more words in the first language. The second language phrase is a phrase having one or more words in the second language. Phrases are widely understood as including sentences. The score is information regarding the appearance probability of the phrase pair. In addition, the score, for example, is a phrase pair probability θ _t. Here, the phrase pair is a concept including a rule pair, and is widely understood. Also, one or more scored phrase pairs may be understood to have the same meaning as the translation model described above.

フレーズ出現頻度情報格納部１０２は、対訳コーパスごとに、１以上のフレーズ出現頻度情報を格納し得る。フレーズ出現頻度情報は、フレーズペアとＦ出現頻度情報とを有する。Ｆ出現頻度情報は、フレーズペアの出現頻度に関する情報である。Ｆ出現頻度情報は、フレーズペアの出現頻度であることが好適であるが、フレーズペアの出現確率等でも良い。なお、Ｆ出現頻度情報の初期値は、例えば、すべて０である。 The phrase appearance frequency information storage unit 102 can store one or more phrase appearance frequency information for each parallel corpus. The phrase appearance frequency information includes a phrase pair and F appearance frequency information. F appearance frequency information is information regarding the appearance frequency of phrase pairs. The F appearance frequency information is preferably the appearance frequency of the phrase pair, but may be the appearance probability of the phrase pair. Note that the initial values of the F appearance frequency information are all 0, for example.

記号出現頻度情報格納部１０３は、１以上の記号出現頻度情報を格納し得る。記号出現頻度情報は、記号とＳ出現頻度情報とを有する。記号とは、新しいフレーズペアを生成する方法を識別する情報である。記号は、例えば、BASE、REG、INVのいずれかである。ここで、BASEとは基底測度からフレーズペアを生成することを示す記号、REGとは普通非終端記号、INVとは倒置非終端記号である。また、Ｓ出現頻度情報は、記号の出現頻度に関する情報である。Ｓ出現頻度情報は、記号の出現頻度であることが好適であるが、記号の出現確率等でも良い。また、Ｓ出現頻度情報の初期値は、例えば、３つの記号すべてに対して０である。なお、基底測度とは、例えば、IBM Model 1 などの単語翻訳モデルにより計算される事前確率であり、公知の技術であるので、詳細な説明を省略する。 The symbol appearance frequency information storage unit 103 can store one or more pieces of symbol appearance frequency information. The symbol appearance frequency information includes a symbol and S appearance frequency information. The symbol is information for identifying a method for generating a new phrase pair. The symbol is, for example, any one of BASE, REG, and INV. Here, BASE is a symbol indicating that a phrase pair is generated from a base measure, REG is a normal non-terminal symbol, and INV is an inverted non-terminal symbol. The S appearance frequency information is information related to the appearance frequency of symbols. The S appearance frequency information is preferably a symbol appearance frequency, but may be a symbol appearance probability or the like. The initial value of the S appearance frequency information is, for example, 0 for all three symbols. Note that the base measure is a prior probability calculated by a word translation model such as IBM Model 1 and is a known technique, and thus detailed description thereof is omitted.

対訳コーパス受付部１０４は、対訳コーパスを受け付ける。ここで、受け付けとは、キーボードやマウス、タッチパネルなどの入力デバイスから入力された情報の受け付け、有線もしくは無線の通信回線を介して送信された情報の受信、光ディスクや磁気ディスク、半導体メモリなどの記録媒体から読み出された情報の受け付けなどを含む概念である。 The parallel corpus reception unit 104 receives a parallel corpus. Here, reception means reception of information input from an input device such as a keyboard, mouse, touch panel, reception of information transmitted via a wired or wireless communication line, recording on an optical disk, magnetic disk, semiconductor memory, or the like. It is a concept including reception of information read from a medium.

対訳コーパスの入力手段は、キーボードやマウスやメニュー画面によるもの等、何でも良い。対訳コーパス受付部１０４は、キーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The bilingual corpus input means may be anything such as a keyboard, mouse, or menu screen. The bilingual corpus accepting unit 104 can be realized by a device driver of input means such as a keyboard, control software for a menu screen, or the like.

対訳コーパス蓄積部１０５は、対訳コーパス受付部１０４が受け付けた対訳コーパスを対訳情報格納部１００に蓄積する。 The bilingual corpus accumulating unit 105 accumulates the bilingual corpus received by the bilingual corpus receiving unit 104 in the bilingual information storage unit 100.

フレーズテーブル初期化部１０６は、対訳コーパスの１以上の対訳情報から、１以上のスコア付きフレーズペアの初期の情報を生成し、フレーズテーブル１０１に蓄積する。なお、フレーズテーブル初期化部１０６は、例えば、１以上の対訳情報が有する対訳文の木構造に出現するフレーズペアとその出現回数をスコア付きフレーズペアとして取得し、フレーズテーブル１０１に蓄積する。なお、かかる場合、スコアは出現回数である。フレーズテーブル初期化部１０６は、通常、対訳コーパスごとに、１以上のスコア付きフレーズペアの初期の情報を生成し、フレーズテーブル１０１に蓄積する。フレーズテーブル初期化部１０６は、対訳コーパス受付部１０４が受け付けた対訳コーパスが有する１以上の対訳情報から、１以上のスコア付きフレーズペアの初期の情報を生成し、フレーズテーブル１０１に蓄積しても良い。 The phrase table initialization unit 106 generates initial information of one or more phrase pairs with scores from one or more parallel translation information of the bilingual corpus and accumulates it in the phrase table 101. Note that the phrase table initialization unit 106 acquires, for example, a phrase pair that appears in the tree structure of the bilingual sentence included in one or more parallel translation information and the number of appearances as a phrase pair with a score, and accumulates them in the phrase table 101. In such a case, the score is the number of appearances. The phrase table initialization unit 106 normally generates initial information of one or more scored phrase pairs for each bilingual corpus and accumulates it in the phrase table 101. The phrase table initialization unit 106 generates initial information of one or more scored phrase pairs from one or more parallel information included in the bilingual corpus received by the bilingual corpus receiving unit 104 and stores the initial information in the phrase table 101. good.

生成フレーズペア取得部１０７は、対訳コーパスごとに、１以上のフレーズ出現頻度情報を用いて、第一言語フレーズと第二言語フレーズとを有するフレーズペアを取得する。 The generated phrase pair acquisition unit 107 acquires a phrase pair having a first language phrase and a second language phrase using one or more phrase appearance frequency information for each parallel corpus.

生成フレーズペア取得部１０７は、対訳コーパスごとに、対訳コーパスに格納されている１以上の各対訳文を取得し、当該各対訳文の木構造を構成する１以上の各フレーズペアの出現分（通常、出現頻度の「１」）を、フレーズテーブル１０１に存在するフレーズペアのスコアから引く。次に、生成フレーズペア取得部１０７は、１以上のフレーズ出現頻度情報を用いて、第一言語フレーズと第二言語フレーズとを有するフレーズペアを取得する（正確には、取得しようとする）。ここで、１以上のフレーズ出現頻度情報を用いることは、例えば、フレーズペアの確率分布Ｐ_ｔを用いることであっても良い。つまり、生成フレーズペア取得部１０７は、フレーズペアの確率分布Ｐ_ｔを用いて、第一言語フレーズと第二言語フレーズとを有するフレーズペアを取得することは好適である。 The generated phrase pair acquisition unit 107 acquires, for each bilingual corpus, one or more bilingual sentences stored in the bilingual corpus and appearances of one or more phrase pairs that constitute the tree structure of the bilingual sentences ( Usually, the appearance frequency “1”) is subtracted from the score of the phrase pair existing in the phrase table 101. Next, the generated phrase pair acquisition unit 107 acquires a phrase pair having a first language phrase and a second language phrase using one or more phrase appearance frequency information (accurately, it tries to acquire). Here, the use of 1 or more phrases appearance frequency information may be, for example, by using the probability distribution P _t phrase pairs. That is, it is preferable that the generated phrase pair acquisition unit 107 acquires a phrase pair having the first language phrase and the second language phrase using the phrase pair probability distribution P _t .

フレーズ出現頻度情報更新部１０８は、生成フレーズペア取得部１０７または新フレーズペア生成部１１２がフレーズペアを取得できた場合、当該フレーズペアに対応するＦ出現頻度情報を、予め決められた値だけ増加する。ここでのＦ出現頻度情報とは、通常、フレーズペアの出現頻度である。また、予め決められた値とは、通常、１である。 When the generated phrase pair acquisition unit 107 or the new phrase pair generation unit 112 can acquire the phrase pair, the phrase appearance frequency information update unit 108 increases the F appearance frequency information corresponding to the phrase pair by a predetermined value. To do. The F appearance frequency information here is usually the appearance frequency of a phrase pair. The predetermined value is usually 1.

記号取得部１０９は、生成フレーズペア取得部１０７等がフレーズペアを取得できなかった場合、１以上の記号出現頻度情報を用いて、一の記号を取得する。ここで、１以上の記号出現頻度情報を用いることは、記号の確率分布Ｐ_x(x;θ_x)を用いることが好適である。つまり、記号取得部１０９は、生成フレーズペア取得部１０７が生成フレーズペアを取得できなかった場合、記号の確率分布を用いて、一の記号を取得することが好適である。なお、一の記号とは、例えば、BASE、REG、INVのいずれかである。なお、Ｐ_x(x;θ_x)の「x」は記号、（θ_x)は記号が採用される確率である。 When the generated phrase pair acquisition unit 107 or the like cannot acquire a phrase pair, the symbol acquisition unit 109 acquires one symbol using one or more symbol appearance frequency information. Here, it is preferable to use the symbol probability distribution P _x (x; θ _x ) to use one or more symbol appearance frequency information. That is, it is preferable that the symbol acquisition unit 109 acquires one symbol using the probability distribution of symbols when the generation phrase pair acquisition unit 107 cannot acquire the generation phrase pair. One symbol is, for example, any one of BASE, REG, and INV. Note that “x” of P _x (x; θ _x ) is a symbol, and (θ _x ) is a probability that the symbol is adopted.

記号出現頻度情報更新部１１０は、記号取得部１０９が取得した記号に対応するＳ出現頻度情報を、予め決められた値だけ増加する。また、予め決められた値とは、通常、１である。 The symbol appearance frequency information update unit 110 increases the S appearance frequency information corresponding to the symbol acquired by the symbol acquisition unit 109 by a predetermined value. The predetermined value is usually 1.

部分フレーズペア生成部１１１は、生成フレーズペア取得部１０７等がフレーズペアを取得できなかった場合、取得しようとしたフレーズペアより小さい２つのフレーズペアを生成する。また、部分フレーズペア生成部１１１は、フレーズペアを取得できなかった場合、通常、フレーズペアの事前確率を用いて、取得しようとしたフレーズペアより小さい２つのフレーズペアを生成する。さらに詳細には、例えば、ｊ番目の対訳コーパスにおけるフレーズペアを生成しようとする場合、部分フレーズペア生成部１１１は、（ｊ−１）番目の対訳コーパスにおけるフレーズペアの事前確率Ｐ^ｊ−１を用いて、取得しようとしたフレーズペアより小さい２つのフレーズペアを生成する。なお、ｊ＝１の場合、部分フレーズペア生成部１１１は、Ｐ_base（例えば、IBM Model 1）を用いて、取得しようとしたフレーズペアより小さい２つのフレーズペアを生成する。例えば、取得しようとしたフレーズペアが<red cookbook,赤い料理本>の場合、「Ｐ_base(<red cookbook,赤い料理本>)=Ｐ_x(REG)* Ｐ_t(<red,赤い>)*Ｐ_{t}(<cookbook,料理本>)+Ｐ_x(REG)* Ｐ_t(<red,赤い料理>)*Ｐ_t(<cookbook,本>)+Ｐ_x(INV)* Ｐ_t(<red,本>)*Ｐ_t(<cookbook,赤い料理>)+Ｐ_x(INV)* Ｐ_t(<red,料理本>)*Ｐ_t(<cookbook,赤い>)+Ｐ_x(BASE)* Ｐ_base(<red cookbook,赤い料理本>)」である。なお、Ｐ_baseとは、例えばIBM Model 1などの単語翻訳モデルにより計算される事前確率である。 When the generated phrase pair acquisition unit 107 or the like cannot acquire a phrase pair, the partial phrase pair generation unit 111 generates two phrase pairs smaller than the phrase pair to be acquired. Moreover, the partial phrase pair production | generation part 111 normally produces | generates two phrase pairs smaller than the phrase pair which it was going to acquire using the prior probability of a phrase pair, when a phrase pair cannot be acquired. More specifically, for example, when a phrase pair in the j-th parallel corpus is to be generated, the partial phrase pair generation unit 111 sets the prior probability P ^j−1 of the phrase pair in the (j−1) -th parallel corpus. To generate two phrase pairs smaller than the phrase pair to be acquired. When j = 1, the partial phrase pair generation unit 111 generates two phrase pairs smaller than the phrase pair to be acquired using P _base (for example, IBM Model 1). For example, if the phrase pair you are trying to acquire is <red cookbook, red cookbook>, then “P _base (<red cookbook, red cookbook>) = P _x (REG) * P _t (<red, red>) * P_ {t} (<cookbook, cooking book>) + P _x (REG) * P _t (<red, red cooking>) * P _t (<cookbook, book>) + P _x (INV) * P _t (< red, book>) * P _t (<cookbook, red food>) + P _x (INV) * P _t (<red, cook book>) * P _t (<cookbook, red>) + P _x (BASE) * P _base (<red cookbook>) ”. Note that P _base is a prior probability calculated by a word translation model such as IBM Model 1, for example.

新フレーズペア生成部１１２は、記号取得部１０９が取得した記号に従って、第一の処理、または第二の処理、または第三の処理のいずれかを行う。新フレーズペア生成部１１２は、記号取得部１０９が取得した記号がBASEである場合に第一の処理を行い、記号がREGである場合に第二の処理を行い、記号がINVである場合に第三の処理を行う。 The new phrase pair generation unit 112 performs either the first process, the second process, or the third process according to the symbol acquired by the symbol acquisition unit 109. The new phrase pair generation unit 112 performs the first process when the symbol acquired by the symbol acquisition unit 109 is BASE, performs the second process when the symbol is REG, and when the symbol is INV A third process is performed.

ここで、第一の処理は、新しいフレーズペアを生成する処理である。また、第一の処理は、フレーズペアの事前確率を用いて、新しいフレーズペアを生成する処理である。ここで、ｊ（２＜＝ｊ＜＝Ｎ）番目の対訳コーパスに対して処理を行っている場合に、第一の処理で利用するフレーズペアの事前確率は、（ｊ−１）番目の対訳コーパスに対するフレーズペアの事前確率である。 Here, the first process is a process of generating a new phrase pair. The first process is a process for generating a new phrase pair using the prior probability of the phrase pair. Here, when processing is performed on the j (2 <= j <= N) th parallel corpus, the prior probability of the phrase pair used in the first processing is (j-1) th parallel translation. This is the prior probability of the phrase pair for the corpus.

また、第二の処理は、２つのより小さいフレーズペアを生成し、１以上のフレーズ出現頻度情報を用いて、生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する処理である。 In the second process, two smaller phrase pairs are generated, and two or more first language phrases constituting the two generated phrase pairs are sequentially connected using one or more phrase appearance frequency information. This is a process of generating one phrase pair having a single language phrase and a new second language phrase in which two second language phrases constituting two phrase pairs are sequentially connected.

さらに、第三の処理は、２つのより小さいフレーズペアを生成し、１以上のフレーズ出現頻度情報を用いて、生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを逆順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する処理である。ここで、１以上のフレーズ出現頻度情報を用いることは、フレーズペアの生成確率（Ｐ_hier）を用いる意味でも良い。 Further, the third process generates two smaller phrase pairs, and uses the one or more phrase appearance frequency information to newly connect the two first language phrases constituting the generated two phrase pairs in order. This is a process of generating one phrase pair having a single language phrase and a new second language phrase in which two second language phrases constituting two phrase pairs are connected in reverse order. Here, using one or more phrase appearance frequency information may mean using a phrase pair generation probability (P _hier ).

制御部１１３は、新フレーズペア生成部１１２が生成したフレーズペアに対して、フレーズ出現頻度情報更新部１０８、記号取得部１０９、記号出現頻度情報更新部１１０、部分フレーズペア生成部１１１、および新フレーズペア生成部１１２の処理を再帰的に行うことを指示する。なお、再帰的に行うとは、通常、処理対象が単語ペアになった時点で、再帰的な処理が終了する意味である。なお、再帰処理は、処理対象がＰ_tから直接（基底測度を用いずに）フレーズを生成した場合に終了する。また、再帰処理は、Ｐ_xからBASEを生成して、Ｐ_baseからフレーズペアを生成した場合に終了する。 The control unit 113 performs the phrase appearance frequency information update unit 108, the symbol acquisition unit 109, the symbol appearance frequency information update unit 110, the partial phrase pair generation unit 111, and the new phrase pair generated by the new phrase pair generation unit 112. The phrase pair generation unit 112 is instructed to perform the process recursively. Note that “recursively” usually means that the recursive process ends when the processing target becomes a word pair. Note that recursive processing is terminated when the processing target is generated the phrase (without using the base measure) directly from P _t. Also, recursive processing may generate a BASE from P _x, and ends when that generated phrase pair from P _base.

また、制御部１１３は、対訳コーパス蓄積部１０５が受け付けられた対訳コーパスを対訳情報格納部１００に蓄積した後、対訳コーパスに対する生成フレーズペア取得部１０７、フレーズ出現頻度情報更新部１０８、記号取得部１０９、記号出現頻度情報更新部１１０、部分フレーズペア生成部１１１、および新フレーズペア生成部１１２の処理を行うことを指示しても良い。 Further, the control unit 113 accumulates the bilingual corpus accepted by the bilingual corpus accumulating unit 105 in the bilingual information storage unit 100, and then generates a generated phrase pair acquisition unit 107, a phrase appearance frequency information update unit 108, and a symbol acquisition unit for the bilingual corpus 109, the symbol appearance frequency information update unit 110, the partial phrase pair generation unit 111, and the new phrase pair generation unit 112 may be instructed to perform processing.

スコア算出部１１４は、フレーズ出現頻度情報格納部１０２に格納されている１以上のフレーズ出現頻度情報を用いて、フレーズテーブル１０１の各フレーズペアに対するスコアを算出する。 The score calculation unit 114 calculates a score for each phrase pair in the phrase table 101 using one or more phrase appearance frequency information stored in the phrase appearance frequency information storage unit 102.

スコア算出部１１４は、ｊ（２＜＝ｊ＜＝Ｎ）番目の対訳コーパスから取得された各フレーズペアに対するスコア算出する場合に、（ｊ−１）番目の対訳コーパスに対応する１以上のフレーズ出現頻度情報を用いて、ｊ番目の対訳コーパスに対応する各フレーズペアに対するスコアを算出する。 When the score calculation unit 114 calculates a score for each phrase pair acquired from the j (2 <= j <= N) -th parallel corpus, one or more phrases corresponding to the (j-1) -th parallel corpus Using the appearance frequency information, a score for each phrase pair corresponding to the j-th parallel corpus is calculated.

また、スコア算出部１１４は、対訳コーパス受付部１０４が受け付けた対訳コーパスから取得された各フレーズペアに対するスコア算出する場合に、対訳コーパス蓄積部１０５が対訳コーパスを蓄積する前に対訳情報格納部１００に格納されていた１以上の対訳コーパスのいずれかの対訳コーパスに対応する１以上のフレーズ出現頻度情報を用いて、対訳コーパス受付部１０４が受け付けた対訳コーパスに対応する各フレーズペアに対するスコアを算出しても良い。 When the score calculation unit 114 calculates a score for each phrase pair acquired from the bilingual corpus received by the bilingual corpus receiving unit 104, the bilingual corpus accumulating unit 105 stores the bilingual corpus before storing the bilingual corpus. The score for each phrase pair corresponding to the bilingual corpus received by the bilingual corpus accepting unit 104 is calculated using one or more phrase appearance frequency information corresponding to any one of the one or more bilingual corpora stored in You may do it.

また、スコア算出部１１４は、数式１により階層的な中華レストラン過程を用いて、各対訳コーパスに対応する各フレーズペアに対するスコアを算出しても良い。
Further, the score calculation unit 114 may calculate a score for each phrase pair corresponding to each bilingual corpus using the hierarchical Chinese restaurant process according to Equation 1.

以上より、対訳フレーズ学習装置１は、対訳データ<F,E>、全てについてモデルのパラメータを推定するのではなく、ある分野に特定した、対訳データの一部について学習する、と言える。さらに、対訳フレーズ学習装置１は、事前確率として、IBM Model 1などのモデルを使用するのではなく、他の分野で学習されたモデルを使用する、と言える。具体的には、対訳データ<F,E>は、J個の分野<F¹,E¹>…<F^J,E^J>へと分割されるものとし、j番目の分野の翻訳モデルのパラメータθ_t ^ｊは、それ以前のj-1番目の分野で得られたモデルP^j-1を事前確率として用い、j番目の分野の対訳データ<F^j,E^j>から学習される（数式２参照）。なお、数式２の翻訳モデルを階層的Pitman-Yorモデルと呼び、例えばngram言語モデルあるいは分野適用で用いられている。また、階層的Pitman-Yorモデルを、階層的な中華レストラン過程として表現した場合、数式１のように表現される。なお、対訳データ<F,E>の「F」は原言語文、「E」は目的言語文（第二言語文）である。
From the above, it can be said that the bilingual phrase learning device 1 does not estimate model parameters for all the bilingual data <F, E>, but learns a part of the bilingual data specified in a certain field. Furthermore, it can be said that the bilingual phrase learning apparatus 1 uses a model learned in another field as a prior probability, instead of using a model such as IBM Model 1. Specifically, the bilingual data <F, E> is divided into ^J fields <F ¹ , E ¹ >… <F ^J , E ^J >, and the parameters of the translation model of the jth field θ _t ^j is learned from the parallel translation data <F ^j , E ^j > of the j-th field using the model P ^j-1 obtained in the previous j-1 field as the prior probability (Formula 2 reference). The translation model of Equation 2 is called a hierarchical Pitman-Yor model, and is used in, for example, an ngram language model or field application. Further, when the hierarchical Pitman-Yor model is expressed as a hierarchical Chinese restaurant process, it is expressed as Equation 1. In the parallel translation data <F, E>, “F” is a source language sentence, and “E” is a target language sentence (second language sentence).

パージング部１１５は、スコア算出部１１４で算出したスコアが最大になるような対訳文（フレーズも含む）の木構造を取得する。さらに、詳細には、パージング部１１５は、ITGのチャートパーサにより、木構造を取得する。なお、ITGのチャートパーサについて、「M.Saers, J.Nivre, and D. Wu.Learning stochastic bracketing inversion transduction grammars with a cubic time biparsing algorithm.In Proc. IWPT, 2009.」に記載されている。 The parsing unit 115 obtains a tree structure of bilingual sentences (including phrases) that maximizes the score calculated by the score calculation unit 114. In more detail, the purging unit 115 acquires a tree structure by an ITG chart parser. The ITG chart parser is described in "M. Saers, J. Nivre, and D. Wu. Learning stochastic bracketing inversion transduction grammars with a cubic time biparsing algorithm. In Proc. IWPT, 2009."

フレーズテーブル更新部１１６は、スコア算出部１１４が算出したスコアを各フレーズペアに対応付けて蓄積する。また、フレーズテーブル更新部１１６は、スコア算出部１１４が算出したスコアに対応するフレーズペアがフレーズテーブル１０１に存在しない場合、スコア算出部１１４が算出したスコアとフレームペアとを有するスコア付きフレーズペアを、フレーズテーブル１０１に蓄積しても良い。 The phrase table update unit 116 stores the score calculated by the score calculation unit 114 in association with each phrase pair. Moreover, the phrase table update part 116, when the phrase pair corresponding to the score which the score calculation part 114 calculated does not exist in the phrase table 101, the phrase pair with a score which has the score and frame pair which the score calculation part 114 calculated The phrase table 101 may be accumulated.

木更新部１１７は、パージング部１１５が取得した木構造を、対訳コーパスに蓄積する。ここで、通常、木更新部１１７は、木構造を上書きする。つまり、対訳コーパス中の古い木構造は、新しい木構造に更新される。 The tree update unit 117 accumulates the tree structure acquired by the parsing unit 115 in the bilingual corpus. Here, the tree update unit 117 normally overwrites the tree structure. That is, the old tree structure in the bilingual corpus is updated to a new tree structure.

対訳情報格納部１００、フレーズテーブル１０１、フレーズ出現頻度情報格納部１０２、および記号出現頻度情報格納部１０３は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The bilingual information storage unit 100, the phrase table 101, the phrase appearance frequency information storage unit 102, and the symbol appearance frequency information storage unit 103 are preferably non-volatile recording media, but can also be realized by volatile recording media.

対訳情報格納部１００等に対訳コーパス等が記憶される過程は問わない。例えば、記録媒体を介して対訳コーパス等が対訳情報格納部１００等で記憶されるようになってもよく、通信回線等を介して送信された対訳コーパス等が対訳情報格納部１００等で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された対訳コーパス等が対訳情報格納部１００等で記憶されるようになってもよい。 The process in which the bilingual corpus is stored in the bilingual information storage unit 100 or the like is not limited. For example, a bilingual corpus or the like may be stored in the bilingual information storage unit 100 or the like via a recording medium, and a bilingual corpus or the like transmitted via a communication line or the like is stored in the bilingual information storage unit 100 or the like. Alternatively, the bilingual corpus or the like input via the input device may be stored in the bilingual information storage unit 100 or the like.

対訳コーパス蓄積部１０５、フレーズテーブル初期化部１０６、生成フレーズペア取得部１０７、フレーズ出現頻度情報更新部１０８、記号取得部１０９、記号出現頻度情報更新部１１０、部分フレーズペア生成部１１１、新フレーズペア生成部１１２、制御部１１３、スコア算出部１１４、パージング部１１５、フレーズテーブル更新部１１６、および木更新部１１７は、通常、ＭＰＵやメモリ等から実現され得る。対訳コーパス蓄積部１０５等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 Bilingual corpus accumulation unit 105, phrase table initialization unit 106, generated phrase pair acquisition unit 107, phrase appearance frequency information update unit 108, symbol acquisition unit 109, symbol appearance frequency information update unit 110, partial phrase pair generation unit 111, new phrase The pair generation unit 112, the control unit 113, the score calculation unit 114, the parsing unit 115, the phrase table update unit 116, and the tree update unit 117 can be normally realized by an MPU, a memory, or the like. The processing procedure of the bilingual corpus storage unit 105 or the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、対訳フレーズ学習装置１の動作について、図２のフローチャートを用いて説明する。本フローチャートでは、対訳フレーズ学習装置１が、Ｎ（Ｎは２または３以上の自然数）の対訳コーパスを順次受け付け、ｊ（ｊ＜Ｎ）番目の対訳コーパスからフレーズテーブルを構成する場合に、（ｊ−１）番目のフレーズテーブルを使用する場合について、説明する。 Next, operation | movement of the parallel translation phrase learning apparatus 1 is demonstrated using the flowchart of FIG. In this flowchart, when the bilingual phrase learning device 1 sequentially receives N (N is a natural number of 2 or 3) bilingual corpora and constructs a phrase table from the j (j <N) -th bilingual corpus, (j -1) The case where the 1st phrase table is used is demonstrated.

（ステップＳ２０１）対訳コーパス受付部１０４は、カウンタｉに１を代入する。 (Step S201) The parallel corpus accepting unit 104 substitutes 1 for a counter i.

（ステップＳ２０２）対訳コーパス受付部１０４は、ｉ番目の対訳コーパスを受け付けたか否かを判断する。ｉ番目の対訳コーパスを受け付ければステップＳ２０３に行き、ｉ番目の対訳コーパスを受け付けなければステップＳ２０２に戻る。 (Step S202) The parallel corpus reception unit 104 determines whether or not an i-th parallel corpus has been received. If the i-th parallel corpus is accepted, the process proceeds to step S203. If the i-th parallel corpus is not accepted, the process returns to step S202.

（ステップＳ２０３）フレーズテーブル初期化部１０６は、ｉ番目の対訳コーパスが有する１以上の対訳情報から、１以上のスコア付きフレーズペアの初期の情報を生成し、ｉと対応付けて、フレーズテーブル１０１に蓄積する。 (Step S203) The phrase table initialization unit 106 generates initial information of one or more phrase pairs with scores from one or more pieces of parallel translation information included in the i-th parallel corpus, and associates it with i to the phrase table 101. To accumulate.

（ステップＳ２０４）生成フレーズペア取得部１０７は、ステップＳ２０１で受け付けられた対訳コーパスが有する１以上の各対訳文を取得し、当該各対訳文の木構造を構成する１以上の各フレーズペアの出現分（通常、出現頻度の「１」）を、フレーズテーブル１０１に存在するフレーズペアであり、ｉに対応するフレーズペアのスコアから引く。次に、生成フレーズペア取得部１０７は、（ｉ−１）に対応するフレーズペアの確率分布Ｐ^ｉ−１を用いて、一つのフレーズペアを生成しようとする。なお、「ｉ＝１」の場合は、生成フレーズペア取得部１０７は、確率分布Ｐ_ｂａｓｅを用いて、一つのフレーズペアを生成しようとする。確率分布Ｐ_ｂａｓｅとは、例えば、IBM Model 1である。また、フレーズペアの確率分布は、（ｉ−１）に対応するフレーズペア頻度（Ｆ出現頻度情報）を用いて、例えば、Pitman-Yor過程によって算出され得る。なお、フレーズペア頻度（Ｆ出現頻度情報）はフレーズ出現頻度情報格納部１０２に格納されている。また、Pitman-Yor過程に基づいた確率の算出は公知技術であるので、説明を省略する。 (Step S204) The generated phrase pair acquisition unit 107 acquires one or more parallel translations included in the parallel translation corpus received in Step S201, and the appearance of one or more phrase pairs constituting the tree structure of each parallel translation Minute (usually “1” of appearance frequency) is a phrase pair existing in the phrase table 101, and is subtracted from the score of the phrase pair corresponding to i. Next, the generated phrase pair acquisition unit 107 attempts to generate one phrase pair using the phrase pair probability distribution P ^i-1 corresponding to (i-1). If “i = 1”, the generated phrase pair acquisition unit 107 attempts to generate one phrase pair using the probability distribution P _base . The probability distribution P _base is, for example, IBM Model 1. Further, the phrase pair probability distribution can be calculated by, for example, a Pitman-Yor process using the phrase pair frequency (F appearance frequency information) corresponding to (i-1). The phrase pair frequency (F appearance frequency information) is stored in the phrase appearance frequency information storage unit 102. Further, since the calculation of the probability based on the Pitman-Yor process is a known technique, the description thereof is omitted.

（ステップＳ２０５）部分フレーズペア生成部１１１等は、フレーズ生成処理を行う。フレーズ生成処理とは、例えば、階層的ITGを用いた、２階層以上のフレーズの生成処理である。このフレーズ生成処理の詳細については、図３のフローチャートを用いて説明する。 (Step S205) The partial phrase pair generation unit 111 and the like perform phrase generation processing. The phrase generation process is, for example, a phrase generation process using two or more hierarchies using hierarchical ITG. Details of the phrase generation processing will be described with reference to the flowchart of FIG.

（ステップＳ２０６）対訳コーパス受付部１０４は、カウンタｉを１、インクリメントする。 (Step S206) The parallel corpus reception unit 104 increments the counter i by one.

（ステップＳ２０７）対訳コーパス受付部１０４は、「ｉ＜＝Ｎ」を満たすか否かを判断する。「ｉ＜＝Ｎ」を満たす場合はステップＳ２０２に戻り、「ｉ＜＝Ｎ」を満たさない場合は処理を終了する。 (Step S207) The parallel corpus reception unit 104 determines whether or not “i <= N” is satisfied. If “i <= N” is satisfied, the process returns to step S202, and if “i <= N” is not satisfied, the process ends.

次に、ステップＳ２０５のフレーズ生成処理の詳細について、図３のフローチャートを用いて説明する。 Next, details of the phrase generation processing in step S205 will be described using the flowchart of FIG.

（ステップＳ３０１）部分フレーズペア生成部１１１は、先のフレーズペアの生成の処理において、フレーズペアが生成できたか否かを判断する。フレーズペアが生成できればステップＳ３０２に行き、生成できなければステップＳ３０５に行く。 (Step S301) The partial phrase pair generation unit 111 determines whether a phrase pair has been generated in the process of generating the previous phrase pair. If the phrase pair can be generated, the process goes to step S302; otherwise, the process goes to step S305.

（ステップＳ３０２）フレーズ出現頻度情報更新部１０８は、先のフレーズペアの生成の処理において生成されたフレーズペアに対応するＦ出現頻度情報を予め決められた値（通常、「１」）だけ増加する。なお、フレーズペアがフレーズ出現頻度情報格納部１０２に存在しない場合は、フレーズ出現頻度情報更新部１０８は、生成されたフレーズペアとＦ出現頻度情報とを対応付けて、フレーズ出現頻度情報格納部１０２に蓄積する。 (Step S302) The phrase appearance frequency information update unit 108 increases the F appearance frequency information corresponding to the phrase pair generated in the previous phrase pair generation process by a predetermined value (usually “1”). . When the phrase pair does not exist in the phrase appearance frequency information storage unit 102, the phrase appearance frequency information update unit 108 associates the generated phrase pair with the F appearance frequency information, and the phrase appearance frequency information storage unit 102 To accumulate.

（ステップＳ３０３）スコア算出部１１４は、更新されたフレーズ出現頻度情報に対応するフレーズペアのスコアを算出する。ここで、スコア算出部１１４は、当該フレーズペアのスコアを算出する場合に、（ｉ−１）に対応するフレーズ出現頻度情報を使用する（数式１、数式２参照）。 (Step S303) The score calculation unit 114 calculates the score of the phrase pair corresponding to the updated phrase appearance frequency information. Here, the score calculation unit 114 uses the phrase appearance frequency information corresponding to (i-1) when calculating the score of the phrase pair (see Expressions 1 and 2).

（ステップＳ３０４）フレーズテーブル更新部１１６は、ステップＳ３０３で算出されたスコアを有するスコア付きフレーズペアを構成し、フレーズテーブル１０１に書き込む。なお、フレーズテーブル１０１に当該フレーズペアが存在しない場合は、フレーズテーブル更新部１１６は、スコア付きフレーズペアを構成し、新たにフレーズテーブル１０１に追記する。また、フレーズテーブル１０１に当該フレーズペアが存在する場合は、フレーズテーブル更新部１１６は、当該フレーズペアに対応するスコアを、ステップＳ３０３で算出されたスコアに更新し、上位処理（Ｓ２０６）にリターンする。 (Step S304) The phrase table update unit 116 configures a scored phrase pair having the score calculated in step S303, and writes it into the phrase table 101. In addition, when the said phrase pair does not exist in the phrase table 101, the phrase table update part 116 comprises a phrase pair with a score, and adds it to the phrase table 101 newly. If the phrase pair exists in the phrase table 101, the phrase table update unit 116 updates the score corresponding to the phrase pair to the score calculated in step S303, and returns to the upper process (S206). .

（ステップＳ３０５）部分フレーズペア生成部１１１は、例えば、基底測度Ｐ_dacまたは（ｉ−１）番目の対訳コーパスに対応する確率分布Ｐ^ｉ−１を用いて、生成しようとしたフレーズペアより小さい２つのフレーズペアを生成する。 (Step S305) The partial phrase pair generation unit 111 uses, for example, a base measure P _dac or a probability distribution P ^i-1 corresponding to the (i−1) -th parallel corpus to be smaller than the phrase pair to be generated 2 Generate one phrase pair.

（ステップＳ３０６）記号取得部１０９は、１以上の記号出現頻度情報を用いて、一の記号ｘを取得する。 (Step S306) The symbol acquisition unit 109 acquires one symbol x using one or more symbol appearance frequency information.

（ステップＳ３０７）記号出現頻度情報更新部１１０は、記号取得部１０９が取得した記号ｘに対応するＳ出現頻度情報を、予め決められた値（通常、「１」）だけ増加させる。 (Step S307) The symbol appearance frequency information update unit 110 increases the S appearance frequency information corresponding to the symbol x acquired by the symbol acquisition unit 109 by a predetermined value (usually “1”).

（ステップＳ３０８）新フレーズペア生成部１１２は、ステップＳ３０６で取得された記号ｘが「BASE」であるか否かを判断する。記号ｘが「BASE」であればステップＳ３０９に行き、「BASE」でなければステップＳ３１０に行く。 (Step S308) The new phrase pair generation unit 112 determines whether or not the symbol x acquired in step S306 is “BASE”. If the symbol x is “BASE”, go to step S309, and if not “BASE”, go to step S310.

（ステップＳ３０９）新フレーズペア生成部１１２は、フレーズペアの事前確率を用いて、新しいフレーズペアを生成し、ステップＳ３０２にジャンプする。 (Step S309) The new phrase pair generation unit 112 generates a new phrase pair using the prior probability of the phrase pair, and jumps to step S302.

（ステップＳ３１０）新フレーズペア生成部１１２は、ステップＳ３０６で取得された記号ｘが「REG」であるか否かを判断する。記号ｘが「REG」であればステップＳ３１１に行き、「REG」でなければステップＳ３１５に行く。なお、記号ｘが「REG」でなければ、記号ｘは「INV」である。 (Step S310) The new phrase pair generation unit 112 determines whether or not the symbol x acquired in Step S306 is “REG”. If the symbol x is “REG”, the process goes to step S311, and if it is not “REG”, the process goes to step S315. If the symbol x is not “REG”, the symbol x is “INV”.

（ステップＳ３１１）新フレーズペア生成部１１２は、より小さい２つのフレーズペアを生成する。なお、ここでの２つのフレーズペアを第一フレーズペア、と第二フレーズペアとする。 (Step S311) The new phrase pair generation unit 112 generates two smaller phrase pairs. In addition, let two phrase pairs here be the 1st phrase pair and the 2nd phrase pair.

（ステップＳ３１２）ステップＳ３１１で生成された第一フレーズペアに対して、図３のフレーズ生成処理を行う。 (Step S312) The phrase generation process of FIG. 3 is performed on the first phrase pair generated in Step S311.

（ステップＳ３１３）ステップＳ３１１で生成された第二フレーズペアに対して、図３のフレーズ生成処理を行う。 (Step S313) The phrase generation process of FIG. 3 is performed on the second phrase pair generated in Step S311.

（ステップＳ３１４）新フレーズペア生成部１１２は、ステップＳ３１２とステップＳ３１３で生成された２つのフレーズペアを順に連結し、一つのフレーズペアを生成し、ステップＳ３０２にジャンプする。 (Step S314) The new phrase pair generation unit 112 sequentially connects the two phrase pairs generated in Step S312 and Step S313, generates one phrase pair, and jumps to Step S302.

（ステップＳ３１５）新フレーズペア生成部１１２は、より小さい２つのフレーズペアを生成する。なお、ここでの２つのフレーズペアを第三フレーズペア、と第四フレーズペアとする。 (Step S315) The new phrase pair generation unit 112 generates two smaller phrase pairs. The two phrase pairs here are referred to as a third phrase pair and a fourth phrase pair.

（ステップＳ３１６）ステップＳ３１５で生成された第三フレーズペアに対して、図３のフレーズ生成処理を行う。 (Step S316) The phrase generation process of FIG. 3 is performed on the third phrase pair generated in Step S315.

（ステップＳ３１７）ステップＳ３１５で生成された第四フレーズペアに対して、図３のフレーズ生成処理を行う。 (Step S317) The phrase generation process of FIG. 3 is performed on the fourth phrase pair generated in Step S315.

（ステップＳ３１８）新フレーズペア生成部１１２は、ステップＳ３１６とステップＳ３１７で生成された２つのフレーズペアを逆順に連結し、一つのフレーズペアを生成し、ステップＳ３０２にジャンプする。 (Step S318) The new phrase pair generation unit 112 concatenates the two phrase pairs generated in steps S316 and S317 in reverse order, generates one phrase pair, and jumps to step S302.

なお、図２、図３のフローチャートにおいて、ステップＳ３０４の後、リターンの前に、パージング部１１５による木構造の生成、および木更新部１１７による木構造の更新処理が行われることは好適である。更新される木構造は、対訳コーパス内のｉ番目の対訳コーパスの木構造である。 2 and 3, it is preferable that the generation of the tree structure by the purging unit 115 and the update process of the tree structure by the tree update unit 117 are performed after step S304 and before the return. The tree structure to be updated is the tree structure of the i-th parallel corpus in the parallel corpus.

以下、本実施の形態における対訳フレーズ学習装置１の具体的な動作について説明する。 Hereinafter, a specific operation of the parallel phrase learning device 1 according to the present embodiment will be described.

今、記号出現頻度情報格納部１０３には、記号「BASE」「REG」「INV」と、各記号の出現頻度の組である３つの記号出現頻度情報が格納されている、とする。 Now, it is assumed that the symbol appearance frequency information storage unit 103 stores the symbols “BASE”, “REG”, “INV”, and three symbol appearance frequency information that are combinations of the appearance frequencies of the respective symbols.

かかる状況において、まず、対訳コーパス受付部１０４は、１番目の対訳コーパスを受け付け、対訳情報格納部１００に１と対応付けて、１番目の対訳コーパスを蓄積した、とする。 In this situation, it is assumed that the bilingual corpus accepting unit 104 first accepts the first bilingual corpus and stores the first bilingual corpus in association with 1 in the bilingual information storage unit 100.

次に、フレーズテーブル初期化部１０６は、１番目の対訳コーパスが有する１以上の対訳情報から、１以上のスコア付きフレーズペアの初期の情報を生成し、１と対応付けて、フレーズテーブル１０１に蓄積する。 Next, the phrase table initialization unit 106 generates initial information of one or more scored phrase pairs from one or more parallel translation information of the first bilingual corpus, associates it with 1, and stores it in the phrase table 101. accumulate.

次に、生成フレーズペア取得部１０７は、１番目の対訳コーパスから一の対訳文を取得する。次に、生成フレーズペア取得部１０７は、取得した対訳文の木構造を構成する１以上の各フレーズペアの出現分（通常、出現頻度の「１」）を、フレーズテーブル１０１に存在するフレーズペアのスコアから引く。次に、生成フレーズペア取得部１０７は、当該対訳文であるフレーズペア<f,e>を、確率分布Ｐ_base ^１を用いて生成しようとする。確率分布Ｐ_base ^１は、例えば、IBM MODEL 1 などにより、予め推定しておき、対訳フレーズ学習装置１が予め保持している、とする。 Next, the generated phrase pair acquisition unit 107 acquires one parallel translation from the first parallel corpus. Next, the generated phrase pair acquisition unit 107 uses the occurrence of one or more phrase pairs (usually “1” of appearance frequency) constituting the acquired bilingual tree structure as a phrase pair existing in the phrase table 101. Subtract from the score. Next, the generated phrase pair acquisition unit 107 tries to generate the phrase pair <f, e> that is the parallel translation using the probability distribution P _base ¹ . It is assumed that the probability distribution P _base ¹ is estimated in advance by, for example, IBM MODEL 1 and the bilingual phrase learning device 1 holds the probability distribution P _base ¹ in advance.

そして、部分フレーズペア生成部１１１は、先のフレーズペアの生成の処理において、フレーズペアが生成できなかった、と判断した場合、以下のように処理を行う。 And the partial phrase pair production | generation part 111 performs a process as follows, when it is judged that the phrase pair was not able to be produced | generated in the process of the production | generation of a previous phrase pair.

つまり、部分フレーズペア生成部１１１は、Ｐ_base ^１を用いて、再帰的に、生成しようとしたフレーズペアより小さい２つのフレーズペアを生成する。そして、生成したより小さい２つのフレーズペアを組み合わせることで新たなフレーズペアを生成する。なお、フレーズペアの対応関係<f,e>の確率は、数式３により示される、とする。
That is, the partial phrase pair generation unit 111 recursively generates two phrase pairs smaller than the phrase pair to be generated using P _base ¹ . And a new phrase pair is produced | generated by combining two smaller phrase pairs produced | generated. It is assumed that the probability of the phrase pair correspondence <f, e> is given by Equation 3.

また、θ_t ^１は、翻訳モデルのパラメータであり、全ての<f,e>の確率値を表したテーブルとする。この場合、θ_t ^１は、数式４のpitman-You過程で推定される。
Also, θ _t ¹ is a parameter of the translation model, and is a table representing all <f, e> probability values. In this case, θ _t ¹ is estimated in the pitman-You process of Equation 4.

次に、記号取得部１０９は、３つの記号出現頻度情報を用いて、記号の確率分布Ｐ_x(x;θ_x)に従って、記号を生成する。そして、記号出現頻度情報更新部１１０は、記号「ｘ＝reg」に対応するＳ出現頻度情報を１だけ増加する。 Next, the symbol acquisition unit 109 uses the three symbol appearance frequency information to generate a symbol according to the symbol probability distribution P _x (x; θ _x ). Then, the symbol appearance frequency information update unit 110 increases the S appearance frequency information corresponding to the symbol “x = reg” by one.

次に、生成した記号ｘが「x=base」の場合、新フレーズペア生成部１１２は、新しいフレーズペアをＰ_base ^１から直接生成する。また、生成した記号ｘが「x=reg」の場合、新フレーズペア生成部１１２は、<f₁,e₁>と<f₂,e₂>をフレーズペアの生成確率（Ｐ_hier）から生成し、１つのフレーズペア<f₁f₂,e₁e₂>を作成する。また、生成した記号ｘが「x=inv」の場合、新フレーズペア生成部１１２は、<f₁,e₁>と<f₂,e₂>をＰ_hierから生成し、f₁とf₂を逆順に並べて、１つのフレーズペア<f₂f₁,e₁e₂>を作成する。 Next, when the generated symbol x is “x = base”, the new phrase pair generation unit 112 generates a new phrase pair directly from P _base ¹ . When the generated symbol x is “x = reg”, the new phrase pair generation unit 112 generates <f ₁ , e ₁ > and <f ₂ , e ₂ > from the phrase pair generation probability (P _hier ). One phrase pair <f ₁ f ₂ , e ₁ e ₂ > is created. When the generated symbol x is “x = inv”, the new phrase pair generation unit 112 generates <f ₁ , e ₁ > and <f ₂ , e ₂ > from P _hier , and f ₁ and f ₂ Are arranged in reverse order to create one phrase pair <f ₂ f ₁ , e ₁ e ₂ >.

そして、フレーズ出現頻度情報更新部１０８は、新たに作成されたフレーズペアのフレーズ出現頻度情報を更新する。 Then, the phrase appearance frequency information update unit 108 updates the phrase appearance frequency information of the newly created phrase pair.

また、スコア算出部１１４は、更新されたフレーズ出現頻度情報に対応するフレーズペアのスコアを、Ｐ_base ^１を用いて算出する。 Further, the score calculation unit 114 calculates the score of the phrase pair corresponding to the updated phrase appearance frequency information using P _base ¹ .

次に、フレーズテーブル更新部１１６は、フレーズテーブルを更新する。 Next, the phrase table update unit 116 updates the phrase table.

次に、パージング部１１５は、スコア算出部１１４が算出したスコアを用いて、木構造のスコアが最大になるような新しい木構造を取得する。そして、木更新部１１７は、取得された木構造を、対訳コーパスに蓄積し、古い木構造を新しい木構造に更新する。 Next, the purging unit 115 uses the score calculated by the score calculation unit 114 to acquire a new tree structure that maximizes the score of the tree structure. Then, the tree updating unit 117 accumulates the acquired tree structure in the bilingual corpus and updates the old tree structure to the new tree structure.

以上の処理により、例えば、フレーズペア「Mrs.Smith's red cookbook／スミスさんの赤い料理本」に対して、図４に示すように、多階層の粒度のフレーズペアが学習できることとなる。なお、図４は、対訳情報を構成する木構造の例である。 With the above processing, for example, a phrase pair with a multi-level granularity can be learned for the phrase pair “Mrs. Smith's red cookbook / Smith ’s red cookbook” as shown in FIG. FIG. 4 is an example of a tree structure constituting the parallel translation information.

また、本具体例におけるフレーズテーブル１０１の構築法は、例えば、以下である。 Moreover, the construction method of the phrase table 101 in this specific example is, for example, as follows.

フレーズテーブルの素性として、条件付き確率Ｐ_t(f|e)とＰ_t(e|f)や、lexical weighting確率、フレーズペナルティなどを利用する。ここでは、モデル確率Ｐ_tを使って条件付き確率を計算する。つまり、数式５、数式６を用いて、条件付き確率を算出する。そして、スコア算出部１１４は、例えば、フレーズテーブルの各素性に予め決められた重みを乗算し、それらの値の和をとることによりスコアを算出する。また、lexical weighting確率は、フレーズを構成する単語を利用して算出できる。かかる算出は公知技術（P.Koehn, F.J.Och, and D.Marcu.Statistical phrase-based translation.In Proc. NAACL,pp.48-54,2003.参照）である。また、フレーズペナルティは、例えば、すべてのフレーズに対して「１」である。
As features of the phrase table, conditional probabilities P _t (f | e) and P _t (e | f), a lexical weighting probability, a phrase penalty, and the like are used. Here, the conditional probability is calculated using the model probability P _t . That is, the conditional probability is calculated using Equation 5 and Equation 6. Then, for example, the score calculation unit 114 calculates a score by multiplying each feature of the phrase table by a predetermined weight and taking the sum of these values. Further, the lexical weighting probability can be calculated using words constituting the phrase. This calculation is a known technique (see P. Koehn, FJOch, and D. Marchu. Statistical phrase-based translation. In Proc. NAACL, pp. 48-54, 2003.). The phrase penalty is “1” for all phrases, for example.

なお、数式５のΣの項は、Ｐ_t(ｅ)を表しており、全ての<ｅ，ｆ>のうち、頻度が１以上であり、かつｅが一致しているフレーズペアを列挙して、その確率値を合計している。ここで、ｆ^〜（^〜はｆの真上に存在する）は全ての<ｅ，ｆ>のうち、ｅが一致しているフレーズペアを構成するｆである。また、数式６のΣの項は、Ｐ_t(ｆ)を表しており、全ての<ｅ，ｆ>のうち、頻度が１以上であり、かつｆが一致しているフレーズペアを列挙して、その確率値を合計している。ここで、ｅ^〜（^〜はｆの真上に存在する）は全ての<ｅ，ｆ>のうち、ｆが一致しているフレーズペアを構成するｅである。また、ｃ（＜ｅ，ｆ^〜＞）は＜ｅ，ｆ^〜＞の頻度を示す。また、ｃ（＜ｅ^〜，ｆ＞）は＜ｅ^〜，ｆ＞の頻度を示す。 In addition, the term of Σ in Equation 5 represents P _t (e), and among all <e, f>, enumerate phrase pairs having a frequency of 1 or more and matching e. The probability values are totaled. Here, f ^~ ( ^~ is located immediately above f) is f constituting a phrase pair in which e matches among all <e, f>. Further, the term of Σ in Equation 6 represents P _t (f), and among all <e, f>, enumerate phrase pairs having a frequency of 1 or more and matching f. The probability values are totaled. Here, e ^~ ( ^~ exists immediately above f) is e that constitutes a phrase pair in which f matches among all <e, f>. ^{Further, c (<e, f ~} >) indicates the frequency of <e, ^{f ~>.} ^{Further, c (<e ~, f} >) indicates the frequency of ^<e ~, f>.

なお、フレーズテーブル更新部１１６は、サンプルに１回以上現れるフレーズペアpのみをフレーズテーブル１０１に入れる。さらに、２つの素性を加える。１つ目はモデルによるフレーズペアの同時確率Ｐt(<f,e>)である。２つ目はinside-outside アルゴリズムで計算されたスパンの事後確率に基づいて、あるフレーズペア<f,e>が入っているスパンの平均事後確率を素性とする。スパン確率は頻繁に起こるフレーズペア、または頻繁に起こるフレーズペアを元に構成されるフレーズペアで高くなるため、フレーズペアがどの程度信頼できるかを判定するのに有用である。このモデル確率に基づくフレーズ抽出をMODと呼ぶ。なお、スパン確率は、ITGのチャートパーサによって算出できる。 Note that the phrase table update unit 116 enters only the phrase pair p that appears once or more in the sample into the phrase table 101. In addition, two features are added. The first is the phrase pair coincidence probability Pt (<f, e>) by the model. The second feature is the average posterior probability of a span containing a phrase pair <f, e> based on the posterior probability of the span calculated by the inside-outside algorithm. The span probability is high for a frequently occurring phrase pair or a phrase pair constructed based on a frequently occurring phrase pair, which is useful for determining how reliable the phrase pair is. Phrase extraction based on this model probability is called MOD. The span probability can be calculated by an ITG chart parser.

そして、上記の処理を、１番目の対訳コーパスに含まれるすべての対訳文に対して行う。 Then, the above processing is performed for all the parallel translation sentences included in the first parallel corpus.

次に、対訳コーパス受付部１０４は、２番目の対訳コーパスを受け付け、対訳情報格納部１００に２と対応付けて、２番目の対訳コーパスを蓄積した、とする。 Next, it is assumed that the bilingual corpus accepting unit 104 accepts the second bilingual corpus and stores the second bilingual corpus in association with 2 in the bilingual information storage unit 100.

次に、フレーズテーブル初期化部１０６は、２番目の対訳コーパスが有する１以上の対訳情報から、１以上のスコア付きフレーズペアの初期の情報を生成し、２と対応付けて、フレーズテーブル１０１に蓄積する。 Next, the phrase table initialization unit 106 generates initial information of one or more phrase pairs with scores from one or more pieces of parallel translation information included in the second parallel corpus, associates it with 2, and stores it in the phrase table 101. accumulate.

次に、生成フレーズペア取得部１０７は、２番目の対訳コーパスから一の対訳文を取得する。次に、生成フレーズペア取得部１０７は、取得した対訳文の木構造を構成する１以上の各フレーズペアの出現分（通常、出現頻度の「１」）を、フレーズテーブル１０１に存在するフレーズペアのスコアから引く。次に、生成フレーズペア取得部１０７は、当該対訳文であるフレーズペア<f,e>を、確率分布Ｐ^１を用いて生成しようとする。確率分布Ｐ^１は、１番目の対訳コーパスに対する上記処理で取得された確率分布である。 Next, the generated phrase pair acquisition unit 107 acquires one parallel translation from the second parallel corpus. Next, the generated phrase pair acquisition unit 107 uses the occurrence of one or more phrase pairs (usually “1” of appearance frequency) constituting the acquired bilingual tree structure as a phrase pair existing in the phrase table 101. Subtract from the score. Then, generated phrase pair acquisition unit 107, a phrase pair is the translation sentence <f, e> and attempts to generate using the probability distribution P ^1. Probability distribution P ¹ is a probability distribution obtained in the above process for the first bilingual corpus.

つまり、部分フレーズペア生成部１１１は、Ｐ^１を用いて、再帰的に、生成しようとしたフレーズペアより小さい２つのフレーズペアを生成する。そして、生成したより小さい２つのフレーズペアを組み合わせることで新たなフレーズペアを生成する。なお、２番目の翻訳モデル（θ_t ^２）は、数式７のpitman-You過程で推定される、とする。
In other words, partial phrase pair generation unit 111 uses the P ^1, recursively generates a tries to create a phrase pair is less than two phrases pairs. And a new phrase pair is produced | generated by combining two smaller phrase pairs produced | generated. It is assumed that the second translation model (θ _t ² ) is estimated in the pitman-You process of Equation 7.

次に、生成した記号ｘが「x=base」の場合、新フレーズペア生成部１１２は、新しいフレーズペアをＰ^１から直接生成する。また、生成した記号ｘが「x=reg」の場合、新フレーズペア生成部１１２は、<f₁,e₁>と<f₂,e₂>をフレーズペアの生成確率（Ｐ_hier）から生成し、１つのフレーズペア<f₁f₂,e₁e₂>を作成する。また、生成した記号ｘが「x=inv」の場合、新フレーズペア生成部１１２は、<f₁,e₁>と<f₂,e₂>をＰ_hierから生成し、f₁とf₂を逆順に並べて、１つのフレーズペア<f₂f₁,e₁e₂>を作成する。 Then, if the generated symbol x is "x = base", the new phrase pair generation unit 112, to generate direct a new phrase pair from P ^1. When the generated symbol x is “x = reg”, the new phrase pair generation unit 112 generates <f ₁ , e ₁ > and <f ₂ , e ₂ > from the phrase pair generation probability (P _hier ). One phrase pair <f ₁ f ₂ , e ₁ e ₂ > is created. When the generated symbol x is “x = inv”, the new phrase pair generation unit 112 generates <f ₁ , e ₁ > and <f ₂ , e ₂ > from P _hier , and f ₁ and f ₂ Are arranged in reverse order to create one phrase pair <f ₂ f ₁ , e ₁ e ₂ >.

そして、フレーズ出現頻度情報更新部１０８は、新たに作成されたフレーズペアのフレーズ出現頻度情報を更新する。なお、このフレーズ出現頻度情報は、２番目の対訳コーパスに対応するフレーズ出現頻度情報である。 Then, the phrase appearance frequency information update unit 108 updates the phrase appearance frequency information of the newly created phrase pair. The phrase appearance frequency information is phrase appearance frequency information corresponding to the second parallel corpus.

また、スコア算出部１１４は、更新されたフレーズ出現頻度情報に対応するフレーズペアのスコアを、Ｐ^１を用いて算出する。 Furthermore, the score calculation unit 114, a score of phrase pair corresponding to the updated phrase appearance frequency information is calculated using the P ^1.

次に、フレーズテーブル更新部１１６は、２番目の対訳コーパスに対応するフレーズテーブルを更新する。 Next, the phrase table update unit 116 updates the phrase table corresponding to the second parallel corpus.

次に、パージング部１１５は、スコア算出部１１４が算出したスコアを用いて、木構造のスコアが最大になるような新しい木構造を取得する。そして、木更新部１１７は、取得された木構造を、対訳コーパスに蓄積し、古い木構造を新しい木構造に更新する。なお、この木構造は、２番目の対訳コーパスに対応する木構造である。 Next, the purging unit 115 uses the score calculated by the score calculation unit 114 to acquire a new tree structure that maximizes the score of the tree structure. Then, the tree updating unit 117 accumulates the acquired tree structure in the bilingual corpus and updates the old tree structure to the new tree structure. This tree structure is a tree structure corresponding to the second parallel corpus.

そして、上記の処理を、２番目の対訳コーパスに含まれるすべての対訳文に対して行う。そして、フレーズテーブル１０１には、２に対応付いた１以上のスコア付きフレーズペアが蓄積される。 Then, the above processing is performed for all the translations included in the second parallel corpus. In the phrase table 101, one or more phrase pairs with scores corresponding to 2 are accumulated.

以上の処理を、３番目以降の対訳コーパスに対しても行った、とする。そして、フレーズテーブル１０１には、１番目から（ｊ−１）番目の各対訳コーパスに対応する多数のスコア付きフレーズペアが格納された、とする。そして、例えば、（ｊ−１）番目の多数のフレーズペアの確率分布がＰ^j−１である、とする。なお、ここでｊは３以上の自然数である。 It is assumed that the above processing is also performed for the third and subsequent parallel corpora. Then, it is assumed that the phrase table 101 stores a large number of scored phrase pairs corresponding to the first to (j−1) th parallel corpora. For example, it is assumed that the probability distribution of the (j−1) th many phrase pairs is P ^j−1 . Here, j is a natural number of 3 or more.

かかる状況において、対訳コーパス受付部１０４は、ｊ番目の対訳コーパスを受け付け、対訳情報格納部１００にｊと対応付けて、ｊ番目の対訳コーパスを蓄積した、とする。 In this situation, it is assumed that the bilingual corpus accepting unit 104 accepts the j-th bilingual corpus and stores the j-th bilingual corpus in association with j in the bilingual information storage unit 100.

次に、フレーズテーブル初期化部１０６は、ｊ番目の対訳コーパスが有する１以上の対訳情報から、１以上のスコア付きフレーズペアの初期の情報を生成し、ｊと対応付けて、フレーズテーブル１０１に蓄積する。 Next, the phrase table initialization unit 106 generates initial information of one or more scored phrase pairs from one or more pieces of parallel translation information included in the j-th parallel corpus, and associates it with j in the phrase table 101. accumulate.

次に、生成フレーズペア取得部１０７は、対訳コーパスから一の対訳文を取得する。次に、生成フレーズペア取得部１０７は、取得した対訳文の木構造を構成する１以上の各フレーズペアの出現分（通常、出現頻度の「１」）を、フレーズテーブル１０１に存在するフレーズペアのスコアから引く。次に、生成フレーズペア取得部１０７は、当該対訳文であるフレーズペア<f,e>を、（ｊ−１）番目のフレーズペアの集合の確率分布Ｐ^Ｊ−１から生成しようとする。確率分布Ｐ^Ｊ−１は、（ｊ−１）番目の対訳コーパスに対する処理で取得された確率分布である。 Next, the generated phrase pair acquisition unit 107 acquires one parallel translation from the parallel corpus. Next, the generated phrase pair acquisition unit 107 uses the occurrence of one or more phrase pairs (usually “1” of appearance frequency) constituting the acquired bilingual tree structure as a phrase pair existing in the phrase table 101. Subtract from the score. Next, the generated phrase pair acquisition unit 107 tries to generate the phrase pair <f, e>, which is the parallel translation, from the probability distribution P ^J-1 of the (j−1) th phrase pair set. The probability distribution P ^J-1 is a probability distribution acquired by the process for the (j−1) -th parallel corpus.

つまり、部分フレーズペア生成部１１１は、確率分布Ｐ^j−１を用いて、再帰的に、生成しようとしたフレーズペアより小さい２つのフレーズペアを生成する。そして、生成したより小さい２つのフレーズペアを組み合わせることで新たなフレーズペアを生成する。なお、θ_ｔ ^jは、数式８のpitman-You過程で推定される。
That is, the partial phrase pair production | generation part 111 produces | generates two phrase pairs smaller than the phrase pair which it was going to produce | generate recursively using probability distribution ^Pj-1 . And a new phrase pair is produced | generated by combining two smaller phrase pairs produced | generated. Note that θ _t ^j is estimated in the pitman-You process of Equation 8.

次に、生成した記号ｘが「x=base」の場合、新フレーズペア生成部１１２は、新しいフレーズペアをＰ^１から直接生成する。また、生成した記号ｘが「x=reg」の場合、新フレーズペア生成部１１２は、<f₁,e₁>と<f₂,e₂>をフレーズペアの生成確率（Ｐ_hier）から生成し、１つのフレーズペア<f₁f₂,e₁e₂>を作成する。また、生成した記号ｘが「x=inv」の場合、新フレーズペア生成部１１２は、<f₁,e₁>と<f₂,e₂>をＰ_hierから生成し、f₁とf₂を逆順に並べて、１つのフレーズペア<f₂f₁,e₁e₂>を作成する。 Then, if the generated symbol x is "x = base", the new phrase pair generation unit 112, directly generate a new phrase pair from P ^1. When the generated symbol x is “x = reg”, the new phrase pair generation unit 112 generates <f ₁ , e ₁ > and <f ₂ , e ₂ > from the phrase pair generation probability (P _hier ). One phrase pair <f ₁ f ₂ , e ₁ e ₂ > is created. When the generated symbol x is “x = inv”, the new phrase pair generation unit 112 generates <f ₁ , e ₁ > and <f ₂ , e ₂ > from P _hier , and f ₁ and f ₂ Are arranged in reverse order to create one phrase pair <f ₂ f ₁ , e ₁ e ₂ >.

そして、フレーズ出現頻度情報更新部１０８は、新たに作成されたフレーズペアのフレーズ出現頻度情報を更新する。なお、このフレーズ出現頻度情報は、ｊ番目の対訳コーパスに対応するフレーズ出現頻度情報である。 Then, the phrase appearance frequency information update unit 108 updates the phrase appearance frequency information of the newly created phrase pair. The phrase appearance frequency information is phrase appearance frequency information corresponding to the j-th parallel corpus.

また、スコア算出部１１４は、更新されたフレーズ出現頻度情報に対応するフレーズペアのスコアを、Ｐ^ｊ−１を用いて算出する。 Moreover, the score calculation part 114 calculates the score of the phrase pair corresponding to the updated phrase appearance frequency information using ^Pj-1 .

次に、フレーズテーブル更新部１１６は、ｊ番目の対訳コーパスに対応するフレーズテーブルを更新する。 Next, the phrase table update unit 116 updates the phrase table corresponding to the j-th parallel corpus.

そして、上記の処理を、ｊ番目の対訳コーパスに含まれるすべての対訳文に対して行う。そして、フレーズテーブル１０１には、ｊに対応付いた１以上のスコア付きフレーズペアが蓄積される。 Then, the above processing is performed for all the translation sentences included in the j-th parallel corpus. In the phrase table 101, one or more phrase pairs with scores associated with j are accumulated.

以上、本実施の形態によれば、追加した対訳コーパスから生成された翻訳モデルを元の翻訳モデルに繋げることができ、その結果、容易に翻訳モデルを段階的に充実させることができる。 As described above, according to the present embodiment, the translation model generated from the added bilingual corpus can be connected to the original translation model, and as a result, the translation model can be easily enriched step by step.

また、本実施の形態によれば、対訳フレーズ学習装置１で作成したフレーズテーブルを用いた機械翻訳の精度を保ちながら、フレーズテーブルのサイズを大幅に削減できる。つまり、本実施の形態によれば、多数の適切なフレーズペアを学習できる。 Moreover, according to this Embodiment, the size of a phrase table can be reduced significantly, maintaining the precision of the machine translation using the phrase table created with the parallel translation phrase learning apparatus 1. FIG. That is, according to the present embodiment, many appropriate phrase pairs can be learned.

なお、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータがアクセス可能な記録媒体は、対訳文と対訳文の木構造とを有する１以上の対訳情報を有するＮ（Ｎは２以上の自然数）の対訳コーパスを格納し得る対訳情報格納部と、第一言語の１以上の単語を有する第一言語フレーズと、第二言語の１以上の単語を有する第二言語フレーズとの対であるフレーズペアと当該フレーズペアの出現確率に関する情報であるスコアとを有する１以上のスコア付きフレーズペアを格納し得るフレーズテーブルと、フレーズペアと、当該フレーズペアの出現頻度に関する情報であるＦ出現頻度情報とを有する１以上のフレーズ出現頻度情報を、対訳コーパスごとに格納し得るフレーズ出現頻度情報格納部と、新しいフレーズペアを生成する方法を識別する記号と、当該記号の出現頻度に関する情報であるＳ出現頻度情報とを有する１以上の記号出現頻度情報を格納し得る記号出現頻度情報格納部とを具備し、コンピュータを、前記対訳コーパスごとに、前記１以上のフレーズ出現頻度情報を用いて、第一言語フレーズと第二言語フレーズとを有するフレーズペアを取得する生成フレーズペア取得部と、
フレーズペアを取得できた場合、当該フレーズペアに対応するＦ出現頻度情報を、予め決められた値だけ増加するフレーズ出現頻度情報更新部と、フレーズペアを取得できなかった場合、１以上の記号出現頻度情報を用いて、一の記号を取得する記号取得部と、前記記号取得部が取得した記号に対応するＳ出現頻度情報を、予め決められた値だけ増加する記号出現頻度情報更新部と、フレーズペアを取得できなかった場合、前記取得しようとしたフレーズペアより小さい２つのフレーズペアを生成する部分フレーズペア生成部と、前記記号取得部が取得した記号に従って、新しいフレーズペアを生成する第一の処理、または、２つのより小さいフレーズペアを生成し、前記１以上のフレーズ出現頻度情報を用いて、前記生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する第二の処理、または、２つのより小さいフレーズペアを生成し、前記１以上のフレーズ出現頻度情報を用いて、生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを逆順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する第三の処理のいずれかを行う新フレーズペア生成部と、前記新フレーズペア生成部が生成したフレーズペアに対して、前記フレーズ出現頻度情報更新部、前記記号取得部、前記記号出現頻度情報更新部、前記部分フレーズペア生成部、および前記新フレーズペア生成部の処理を再帰的に行うことを指示する制御部と、前記フレーズ出現頻度情報格納部に格納されている１以上のフレーズ出現頻度情報を用いて、フレーズテーブルの各フレーズペアに対するスコアを算出するスコア算出部と、前記スコア算出部が算出したスコアを前記各フレーズペアに対応付けて蓄積するフレーズテーブル更新部として機能させ、前記スコア算出部は、ｊ（２＜＝ｊ＜＝Ｎ）番目の対訳コーパスから取得された各フレーズペアに対するスコア算出する場合に、（ｊ−１）番目の対訳コーパスに対応する１以上のフレーズ出現頻度情報を用いて、ｊ番目の対訳コーパスに対応する各フレーズペアに対するスコアを算出するものとして、コンピュータを機能させるプログラム、である。 Note that the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. Note that the software that implements the information processing apparatus according to the present embodiment is the following program. That is, in this program, the computer-accessible recording medium can store bilingual corpus of N (N is a natural number of 2 or more) having one or more bilingual information having a bilingual sentence and a bilingual tree structure. A phrase pair that is a pair of an information storage unit, a first language phrase having one or more words in the first language, and a second language phrase having one or more words in the second language, and the appearance probability of the phrase pair One or more phrase appearance frequency information having a phrase table that can store one or more scored phrase pairs having a score that is information, a phrase pair, and F appearance frequency information that is information related to the appearance frequency of the phrase pair A phrase appearance frequency information storage unit that can store each bilingual corpus, a symbol that identifies a method for generating a new phrase pair, and the symbol A symbol appearance frequency information storage unit that can store one or more symbol appearance frequency information having S appearance frequency information that is information relating to the appearance frequency, and the computer displays the one or more phrase appearances for each bilingual corpus Using the frequency information, a generated phrase pair acquisition unit that acquires a phrase pair having a first language phrase and a second language phrase;
When a phrase pair can be acquired, the F appearance frequency information corresponding to the phrase pair is increased by a predetermined value, and when the phrase pair cannot be acquired, one or more symbols appear. A symbol acquisition unit that acquires one symbol using the frequency information; a symbol appearance frequency information update unit that increases S appearance frequency information corresponding to the symbol acquired by the symbol acquisition unit by a predetermined value; If a phrase pair cannot be acquired, a partial phrase pair generation unit that generates two phrase pairs smaller than the phrase pair that was to be acquired, and a first that generates a new phrase pair according to the symbols acquired by the symbol acquisition unit Or generating two smaller phrase pairs and using the one or more phrase appearance frequency information, One phrase pair having a new first language phrase in which the two first language phrases constituting the pair are sequentially connected and a new second language phrase in which the two second language phrases constituting the two phrase pairs are sequentially connected The second process of generating the two or two smaller phrase pairs, and using the one or more phrase appearance frequency information, the two first language phrases constituting the generated two phrase pairs are connected in order A third process of generating a single phrase pair having a new first language phrase and a new second language phrase in which the two second language phrases constituting the two phrase pairs are connected in reverse order. For the phrase pair generated by the new phrase pair generation unit and the new phrase pair generation unit, the phrase appearance frequency information update unit, the symbol capture Stored in the phrase appearance frequency information storage unit, a control unit that instructs to recursively perform the processing of the symbol appearance frequency information update unit, the partial phrase pair generation unit, and the new phrase pair generation unit. A score calculator that calculates a score for each phrase pair in the phrase table using one or more phrase appearance frequency information, and a phrase table update that stores the score calculated by the score calculator in association with each phrase pair The score calculation unit corresponds to the (j-1) th parallel corpus when calculating the score for each phrase pair acquired from the j (2 <= j <= N) th parallel corpus. Calculating the score for each phrase pair corresponding to the j-th parallel corpus using one or more phrase appearance frequency information And a program that causes the computer to function.

また、前記対訳情報格納部は、１以上の対訳コーパスを格納しており、上位プログラムにおいて、対訳コーパスを受け付ける対訳コーパス受付部と、前記対訳コーパス受付部が受け付けた対訳コーパスを前記対訳情報格納部に蓄積する対訳コーパス蓄積部とをさらに具備し、前記制御部は、前記対訳コーパス蓄積部が前記受け付けられた対訳コーパスを前記対訳情報格納部に蓄積した後、当該対訳コーパスに対する前記生成フレーズペア取得部、前記フレーズ出現頻度情報更新部、前記記号取得部、前記記号出現頻度情報更新部、前記部分フレーズペア生成部、および前記新フレーズペア生成部の処理を行うことを指示し、前記スコア算出部は、前記対訳コーパス受付部が受け付けた対訳コーパスから取得された各フレーズペアに対するスコア算出する場合に、前記対訳コーパス蓄積部が前記対訳コーパスを蓄積する前に前記対訳情報格納部に格納されていた１以上の対訳コーパスのいずれかの対訳コーパスに対応する１以上のフレーズ出現頻度情報を用いて、前記対訳コーパス受付部が受け付けた対訳コーパスに対応する各フレーズペアに対するスコアを算出するものとしてコンピュータを機能させることは好適である。 The bilingual information storage unit stores one or more bilingual corpora, and in the upper program, the bilingual corpus receiving unit that receives the bilingual corpus and the bilingual corpus received by the bilingual corpus receiving unit are the bilingual information storing unit. A bilingual corpus accumulating unit for accumulating the generated bilingual corpus in the bilingual corpus after the bilingual corpus accumulating unit accumulates the accepted bilingual corpus in the bilingual information storage unit. Instructing the processing of the unit, the phrase appearance frequency information update unit, the symbol acquisition unit, the symbol appearance frequency information update unit, the partial phrase pair generation unit, and the new phrase pair generation unit, the score calculation unit Is a score for each phrase pair acquired from the parallel corpus received by the parallel corpus reception unit. When calculating, one or more phrase appearance frequency information corresponding to any one of the one or more parallel corpora stored in the parallel translation information storage before the parallel corpus is stored in the parallel corpus It is preferable to make the computer function as calculating a score for each phrase pair corresponding to the bilingual corpus received by the bilingual corpus receiving unit.

（実施の形態２）
本実施の形態において、複数の分野を独立に学習し、各分野の事前確率を他の分野で得られたモデルに入れ替え、複数個のモデルを階層的に繋げる対訳フレーズ学習装置について説明する。 (Embodiment 2)
In the present embodiment, a bilingual phrase learning apparatus that learns a plurality of fields independently, replaces prior probabilities of each field with models obtained in other fields, and connects a plurality of models hierarchically will be described.

図５は、本実施の形態における対訳フレーズ学習装置２のブロック図である。図５に示すように、対訳フレーズ学習装置２は、対訳フレーズ学習装置１と比較して、対訳コーパス生成部２０１が異なり、対訳コーパス受付部１０４、対訳コーパス蓄積部１０５を有さない。 FIG. 5 is a block diagram of the bilingual phrase learning device 2 in the present embodiment. As shown in FIG. 5, the bilingual phrase learning device 2 is different from the bilingual phrase learning device 1 in the bilingual corpus generation unit 201, and does not include the bilingual corpus reception unit 104 and the bilingual corpus accumulation unit 105.

対訳コーパス生成部２０１は、２以上の対訳文をＮにグループに分割し、かつ各グループの対訳文から対訳文の木構造を取得して作成したＮの対訳コーパスを、対訳情報格納部１００に蓄積する。なお、Ｎは、２または３以上の自然数である。また、上記の分割の方法は問わない。対訳文にクラスを識別するクラス識別子が付与されている場合は、対訳コーパス生成部２０１は、当該クラス識別子を用いて、２以上の対訳文をＮのグループに分割しても良い。また、対訳コーパス生成部２０１は、同様の数の対訳文を含むように、２以上の対訳文をＮにグループに分割しても良い。 The bilingual corpus generation unit 201 divides two or more bilingual sentences into N groups, and acquires N bilingual corpora created by acquiring a bilingual tree structure from the bilingual sentences of each group in the bilingual information storage unit 100. accumulate. N is a natural number of 2 or 3 or more. The dividing method is not limited. In the case where a class identifier for identifying a class is assigned to the parallel translation, the parallel corpus generation unit 201 may divide two or more parallel translations into N groups using the class identifier. Further, the bilingual corpus generation unit 201 may divide two or more bilingual sentences into N groups so as to include the same number of bilingual sentences.

対訳コーパス生成部２０１は、通常、ＭＰＵやメモリ等から実現され得る。対訳コーパス生成部２０１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The bilingual corpus generation unit 201 can usually be realized by an MPU, a memory, or the like. The processing procedure of the bilingual corpus generation unit 201 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、対訳フレーズ学習装置２の動作について、図６のフローチャートを用いて説明する。図６のフローチャートにおいて、図２のフローチャートと同一のステップについて、説明を省略する。 Next, operation | movement of the parallel translation phrase learning apparatus 2 is demonstrated using the flowchart of FIG. In the flowchart of FIG. 6, the description of the same steps as those in the flowchart of FIG. 2 is omitted.

（ステップＳ６０１）対訳コーパス生成部２０１は、対訳情報格納部１００に格納されている２以上の対訳文を、Ｎのグループに分割する。そして、各グループは、１以上の対訳文を有する対訳コーパスを有する。 (Step S601) The bilingual corpus generation unit 201 divides two or more bilingual sentences stored in the bilingual information storage unit 100 into N groups. Each group has a bilingual corpus having one or more bilingual sentences.

（ステップＳ６０２）対訳コーパス生成部２０１は、各グループの１以上の各対訳文の木構造を構成し、対訳情報格納部１００に蓄積する。かかる処理により、対訳情報格納部１００には、Ｎのグループの最終的な対訳コーパスが格納された。なお、ここでの対訳コーパスは、対訳文と対訳文の木構造とを有する１以上の対訳情報を有する。 (Step S602) The bilingual corpus generation unit 201 configures a tree structure of one or more bilingual sentences in each group and accumulates them in the bilingual information storage unit 100. With this processing, the final translation corpus of N groups is stored in the translation information storage unit 100. The bilingual corpus here has one or more bilingual information having a bilingual sentence and a tree structure of the bilingual sentence.

（ステップＳ６０３）フレーズテーブル初期化部１０６は、対訳情報格納部１００からｉ番目の対訳コーパスを取得する。ステップＳ２０３に行く。 (Step S603) The phrase table initialization unit 106 acquires the i-th parallel corpus from the parallel translation information storage unit 100. Go to step S203.

なお、図６のフローチャートにおいて、ｊ番目のグループの対訳コーパスに対応するフレーズテーブルを構成する場合に、通常、（ｊ−１）番目のグループの対訳コーパスに対応するフレーズテーブルを用いる。しかし、ｊ番目のグループの対訳コーパスに対応するフレーズテーブルを構成する場合に、既に取得されたフレーズテーブルであり、他のグループのフレーズテーブル（例えば、３番目のグループの対訳コーパスに対応するフレーズテーブル）を用いても良い。 In the flowchart of FIG. 6, when a phrase table corresponding to the bilingual corpus of the jth group is configured, the phrase table corresponding to the bilingual corpus of the (j−1) th group is usually used. However, when the phrase table corresponding to the bilingual corpus of the jth group is configured, the phrase table has already been acquired, and the phrase table of another group (for example, the phrase table corresponding to the bilingual corpus of the third group) ) May be used.

以上、本実施の形態によれば、対訳データを、例えば、分野毎に分割し、各分野で局所的なモデルを学習することにより、並列化が容易となる。また、本実施の形態によれば、学習の結果得られた各統計モデルを結合するとき、その重み付けを再計算する必要がなく、容易に翻訳モデルを充実させることができる。 As described above, according to the present embodiment, parallelization is facilitated by, for example, dividing bilingual data for each field and learning a local model in each field. Further, according to the present embodiment, when combining the statistical models obtained as a result of learning, it is not necessary to recalculate the weights, and the translation model can be easily enriched.

なお、上記実施の形態で説明した対訳フレーズ学習装置１を用いれば、以下のようなことが可能となる。つまり、対訳フレーズ学習装置１または対訳フレーズ学習装置２によれば、対訳データは日々新しく更新されており、すでにある大規模対訳データに新たに対訳データを追加する場合に、再学習のコストを大幅に低くすることが可能となる。特に、特許データなど新しい分野などへと対応するとき、対訳フレーズ学習装置１または対訳フレーズ学習装置２では、すでにある統計モデルを事前確率として学習し、新たにモデルのパラメータを推定することが容易になる。また、対訳フレーズ学習装置１または対訳フレーズ学習装置２では、日常会話などでも新しい表現が現れるたびに新しく対訳データを追加し、一般的なモデルを事前確率として持ち、その追加分に対してモデルを作成することが可能となる。 In addition, if the bilingual phrase learning apparatus 1 demonstrated in the said embodiment is used, it will become possible as follows. That is, according to the bilingual phrase learning device 1 or the bilingual phrase learning device 2, the bilingual data is newly updated every day, and when bilingual data is newly added to existing large-scale bilingual data, the cost of re-learning is greatly increased. It becomes possible to make it low. In particular, when dealing with new fields such as patent data, the bilingual phrase learning device 1 or the bilingual phrase learning device 2 can easily learn an existing statistical model as a prior probability and newly estimate the parameters of the model. Become. In addition, the parallel phrase learning device 1 or the parallel phrase learning device 2 adds new bilingual data each time a new expression appears in daily conversation, etc., and has a general model as a prior probability. It becomes possible to create.

なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、コンピュータがアクセス可能な記録媒体は、対訳文と対訳文の木構造とを有する１以上の対訳情報を有するＮ（Ｎは２以上の自然数）の対訳コーパスを格納し得る対訳情報格納部と、第一言語の１以上の単語を有する第一言語フレーズと、第二言語の１以上の単語を有する第二言語フレーズとの対であるフレーズペアと当該フレーズペアの出現確率に関する情報であるスコアとを有する１以上のスコア付きフレーズペアを格納し得るフレーズテーブルと、フレーズペアと、当該フレーズペアの出現頻度に関する情報であるＦ出現頻度情報とを有する１以上のフレーズ出現頻度情報を、対訳コーパスごとに格納し得るフレーズ出現頻度情報格納部と、新しいフレーズペアを生成する方法を識別する記号と、当該記号の出現頻度に関する情報であるＳ出現頻度情報とを有する１以上の記号出現頻度情報を格納し得る記号出現頻度情報格納部とを具備し、このプログラムは、コンピュータを、前記対訳コーパスごとに、前記１以上のフレーズ出現頻度情報を用いて、第一言語フレーズと第二言語フレーズとを有するフレーズペアを取得する生成フレーズペア取得部と、フレーズペアを取得できた場合、当該フレーズペアに対応するＦ出現頻度情報を、予め決められた値だけ増加するフレーズ出現頻度情報更新部と、フレーズペアを取得できなかった場合、１以上の記号出現頻度情報を用いて、一の記号を取得する記号取得部と、前記記号取得部が取得した記号に対応するＳ出現頻度情報を、予め決められた値だけ増加する記号出現頻度情報更新部と、フレーズペアを取得できなかった場合、前記取得しようとしたフレーズペアより小さい２つのフレーズペアを生成する部分フレーズペア生成部と、前記記号取得部が取得した記号に従って、新しいフレーズペアを生成する第一の処理、または、２つのより小さいフレーズペアを生成し、前記１以上のフレーズ出現頻度情報を用いて、前記生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する第二の処理、または、２つのより小さいフレーズペアを生成し、前記１以上のフレーズ出現頻度情報を用いて、生成した２つのフレーズペアを構成する２つの第一言語フレーズを順に繋げた新しい第一言語フレーズと、２つのフレーズペアを構成する２つの第二言語フレーズを逆順に繋げた新しい第二言語フレーズとを有する一つのフレーズペアを生成する第三の処理のいずれかを行う新フレーズペア生成部と、前記新フレーズペア生成部が生成したフレーズペアに対して、前記フレーズ出現頻度情報更新部、前記記号取得部、前記記号出現頻度情報更新部、前記部分フレーズペア生成部、および前記新フレーズペア生成部の処理を再帰的に行うことを指示する制御部と、前記フレーズ出現頻度情報格納部に格納されている１以上のフレーズ出現頻度情報を用いて、フレーズテーブルの各フレーズペアに対するスコアを算出するスコア算出部と、前記スコア算出部が算出したスコアを前記各フレーズペアに対応付けて蓄積するフレーズテーブル更新部として機能させ、前記スコア算出部は、ｊ（２＜＝ｊ＜＝Ｎ）番目の対訳コーパスから取得された各フレーズペアに対するスコア算出する場合に、（ｊ−１）番目の対訳コーパスに対応する１以上のフレーズ出現頻度情報を用いて、ｊ番目の対訳コーパスに対応する各フレーズペアに対するスコアを算出するものとして、コンピュータを機能させるプログラム、である。 Note that the software that implements the information processing apparatus according to the present embodiment is the following program. That is, the computer-accessible recording medium includes a parallel translation information storage unit that can store N (N is a natural number of 2 or more) parallel corpus having one or more parallel translation information having a parallel translation sentence and a parallel translation tree structure. A score that is information about a phrase pair that is a pair of a first language phrase having one or more words in the first language and a second language phrase having one or more words in the second language and the appearance probability of the phrase pair One or more phrase appearance frequency information having a phrase table that can store one or more scored phrase pairs having the following, a phrase pair, and F appearance frequency information that is information regarding the appearance frequency of the phrase pair, A phrase appearance frequency information storage unit that can be stored every time, a symbol that identifies a method of generating a new phrase pair, and an appearance frequency of the symbol A symbol appearance frequency information storage unit capable of storing one or more symbol appearance frequency information having S appearance frequency information as information, and the program causes the computer to execute the one or more phrases for each parallel corpus. Using the appearance frequency information, if a phrase pair acquisition unit that acquires a phrase pair having a first language phrase and a second language phrase and a phrase pair can be acquired, F appearance frequency information corresponding to the phrase pair is obtained. A phrase appearance frequency information update unit that increases by a predetermined value, a symbol acquisition unit that acquires one symbol using one or more symbol appearance frequency information when the phrase pair cannot be acquired, and the symbol The symbol appearance frequency information update unit that increases the S appearance frequency information corresponding to the symbol acquired by the acquisition unit by a predetermined value and the phrase pair cannot be acquired. A partial phrase pair generation unit that generates two phrase pairs smaller than the phrase pair to be acquired, and a first process for generating a new phrase pair according to the symbol acquired by the symbol acquisition unit, or 2 Two smaller phrase pairs are generated, and using the one or more phrase appearance frequency information, a new first language phrase in which two first language phrases constituting the generated two phrase pairs are sequentially connected, and two A second process for generating one phrase pair having a new second language phrase in which two second language phrases constituting the phrase pair are sequentially connected, or two smaller phrase pairs are generated, and the one or more Using the phrase appearance frequency information of, the two first language phrases that make up the two generated phrase pairs are connected in order A new one that performs one of the third processes of generating one phrase pair having a first language phrase and a new second language phrase in which two second language phrases constituting two phrase pairs are connected in reverse order For the phrase pair generated by the phrase pair generation unit and the new phrase pair generation unit, the phrase appearance frequency information update unit, the symbol acquisition unit, the symbol appearance frequency information update unit, the partial phrase pair generation unit, and Each phrase pair in the phrase table is determined using a control unit that instructs to recursively perform the processing of the new phrase pair generation unit, and one or more phrase appearance frequency information stored in the phrase appearance frequency information storage unit. A score calculation unit that calculates a score for the phrase, and a phrase that stores the score calculated by the score calculation unit in association with each phrase pair When the score calculation unit calculates the score for each phrase pair acquired from the j (2 <= j <= N) th parallel corpus, the (j-1) th parallel corpus Is a program that causes a computer to function as a score for each phrase pair corresponding to the j-th parallel corpus using one or more phrase appearance frequency information corresponding to.

また、上位プログラムにおいて、２以上の対訳文をＮにグループに分割し、かつ各グループの対訳文から対訳文の木構造を取得して作成したＮの対訳コーパスを、前記対訳情報格納部に蓄積する対訳コーパス生成部として、さらに機能させ、前記スコア算出部は、一の対訳コーパスから取得された各フレーズペアに対するスコア算出する場合に、当該一の対訳コーパスとは異なる他の対訳コーパスに対応する１以上のフレーズ出現頻度情報を用いて、前記一の対訳コーパスに対応する各フレーズペアに対するスコアを算出するものとしてコンピュータを機能させることは好適である。 Further, in the upper program, two or more parallel translations are divided into N groups, and N parallel corpora created by acquiring the parallel translation tree from the parallel translations of each group are stored in the parallel translation information storage unit. The bilingual corpus generation unit further functions, and the score calculation unit corresponds to another bilingual corpus different from the one bilingual corpus when calculating the score for each phrase pair acquired from the one bilingual corpus. It is preferable to cause the computer to function as a score for each phrase pair corresponding to the one bilingual corpus using one or more phrase appearance frequency information.

（実施の形態３）
本実施の形態において、対訳フレーズ学習装置１または対訳フレーズ学習装置２を用いて学習したフレーズテーブル１０１を利用した統計的機械翻訳装置３について説明する。 (Embodiment 3)
In the present embodiment, a statistical machine translation apparatus 3 using a phrase table 101 learned using the parallel phrase learning apparatus 1 or the parallel phrase learning apparatus 2 will be described.

図７は、本実施の形態における統計的機械翻訳装置３のブロック図である。統計的機械翻訳装置３は、フレーズテーブル１０１、受付部３０１、フレーズ取得部３０２、文構成部３０３、出力部３０４を備える。 FIG. 7 is a block diagram of the statistical machine translation apparatus 3 in the present embodiment. The statistical machine translation apparatus 3 includes a phrase table 101, a reception unit 301, a phrase acquisition unit 302, a sentence composition unit 303, and an output unit 304.

フレーズテーブル１０１は、対訳フレーズ学習装置１または対訳フレーズ学習装置２が学習したフレーズテーブルである。 The phrase table 101 is a phrase table learned by the bilingual phrase learning device 1 or the bilingual phrase learning device 2.

受付部３０１は、１以上の単語を有する第一言語の文を受け付ける。ここで、受け付けとは、キーボードやマウス、タッチパネルなどの入力デバイスから入力された情報の受け付け、有線もしくは無線の通信回線を介して送信された情報の受信、光ディスクや磁気ディスク、半導体メモリなどの記録媒体から読み出された情報の受け付けなどを含む概念である。第一言語の文の入力手段は、キーボードやマウスやメニュー画面によるもの等、何でも良い。受付部３０１は、キーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The accepting unit 301 accepts a sentence in a first language having one or more words. Here, reception means reception of information input from an input device such as a keyboard, mouse, touch panel, reception of information transmitted via a wired or wireless communication line, recording on an optical disk, magnetic disk, semiconductor memory, or the like. It is a concept including reception of information read from a medium. The first language sentence input means may be anything such as a keyboard, mouse, or menu screen. The accepting unit 301 can be realized by a device driver for input means such as a keyboard, control software for a menu screen, or the like.

フレーズ取得部３０２は、受付部３０１が受け付けた文から１以上のフレーズを抽出し、フレーズテーブル１０１のスコアを用いて、フレーズテーブル１０１から第二言語の１以上のフレーズを取得する。なお、フレーズ取得部３０２の処理は公知技術である。 The phrase acquisition unit 302 extracts one or more phrases from the sentence received by the reception unit 301, and acquires one or more phrases in the second language from the phrase table 101 using the score of the phrase table 101. Note that the processing of the phrase acquisition unit 302 is a known technique.

文構成部３０３は、フレーズ取得部３０２が取得した１以上のフレーズから第二言語の文を構成する。なお、文構成部３０３の処理は公知技術である。 The sentence composition unit 303 composes a sentence in the second language from one or more phrases acquired by the phrase acquisition unit 302. The processing of the sentence composition unit 303 is a known technique.

出力部３０４は、文構成部３０３が構成した文を出力する。ここで、出力とは、ディスプレイへの表示、プロジェクターを用いた投影、プリンタへの印字、音出力、外部の装置への送信、記録媒体への蓄積、他の処理装置や他のプログラムなどへの処理結果の引渡しなどを含む概念である。 The output unit 304 outputs the sentence constructed by the sentence composition unit 303. Here, output refers to display on a display, projection using a projector, printing on a printer, sound output, transmission to an external device, storage in a recording medium, output to another processing device or other program, etc. It is a concept that includes delivery of processing results.

フレーズ取得部３０２、および文構成部３０３は、通常、ＭＰＵやメモリ等から実現され得る。フレーズ取得部３０２等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The phrase acquisition unit 302 and the sentence composition unit 303 can be usually realized by an MPU, a memory, or the like. The processing procedure of the phrase acquisition unit 302 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力部３０４は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力部３０４は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 304 may or may not include an output device such as a display or a speaker. The output unit 304 can be implemented by output device driver software, or output device driver software and an output device.

また、統計的機械翻訳装置３の動作については、公知のフレーズベースの統計的機械翻訳処理を行うことで足りるので、詳細な説明を省略する。 Further, the operation of the statistical machine translation apparatus 3 suffices to perform a well-known phrase-based statistical machine translation process, and a detailed description thereof will be omitted.

以上、本実施の形態によれば、階層的に連結されたフレーズテーブルを用いて、精度の高い機械翻訳が可能となる。 As described above, according to the present embodiment, it is possible to perform machine translation with high accuracy by using hierarchically connected phrase tables.

以下、対訳フレーズ学習装置１または対訳フレーズ学習装置２の実験結果について説明する。
（実験） Hereinafter, the experimental result of the parallel phrase learning device 1 or the parallel phrase learning device 2 will be described.
(Experiment)

本実験で使用したデータセットの情報を図８に示す。本実験において、中英翻訳（中国語から英語への翻訳）のタスクを用いた。また、本実験において、図８に示すそれぞれ異なるサイズの３つのデータセットを用いた。図８において、「Data set」はデータセットの名称、「Corpus」はデータセット中のコーパスの名称、「#sent.pairs」は対訳文の数である。 The information of the data set used in this experiment is shown in FIG. In this experiment, the task of Chinese-English translation (translation from Chinese to English) was used. In this experiment, three data sets having different sizes shown in FIG. 8 were used. In FIG. 8, “Data set” is the name of the data set, “Corpus” is the name of the corpus in the data set, and “# sent.pairs” is the number of parallel translations.

図８のデータセット「IWSLT」は、IWSLT2012 OLYMPICSで用いられたデータセットであり、２つのトレーニングセット（HITコーパス、BTECコーパス）からなる。HITコーパスは、２００８年の北京オリンピックに密接に関連している。また、BTECコーパスは、観光関連の文を含んだ多言語音声コーパスである。 The data set “IWSLT” in FIG. 8 is a data set used in IWSLT2012 OLYMPICS, and consists of two training sets (HIT corpus and BTEC corpus). The HIT Corpus is closely related to the 2008 Beijing Olympics. The BTEC corpus is a multilingual speech corpus that includes tourism-related sentences.

また、図８のデータセット「FBIS」は、ニュース記事の集合であり、それ自体、ドメイン（分野）の情報を有していない。そこで、PLDA（http://code.google.com/p/plda/参照のこと。）と言う潜在的ディレクレ分配法（LDA）ツールを用いて、コーパス全体を５つのドメインに分割した。そして、５つの各ドメインにおいて、ソース側（第一言語側）とターゲット側（第二言語側）を単一の文のように連結した（Zhiyuan Liu, Yuzhou Zhang, Edward Y Chang, and Maosong Sun. 2011. Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing.ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):1-18.参照）。 The data set “FBIS” in FIG. 8 is a set of news articles and does not have domain (field) information. Therefore, the entire corpus was divided into five domains using a latent directory distribution (LDA) tool called PLDA (see http://code.google.com/p/plda/). In each of the five domains, the source side (first language side) and the target side (second language side) are connected like a single sentence (Zhiyuan Liu, Yuzhou Zhang, Edward Y Chang, and Maosong Sun. 2011. Plda +: Parallel latent dirichlet allocation with data placement and pipeline processing. See ACM Transactions on Intelligent Systems and Technology (TIST), 2 (3): 1-18.

また、図８のデータセット「LDC」は、ニュース、雑誌、金融などの種々のドメインを含み、LDCから取得された５つのコーパスからなる。 The data set “LDC” in FIG. 8 includes various domains such as news, magazines, and finance, and is composed of five corpora acquired from the LDC.

また、本実験において、以下の５つのフレーズペアの抽出手法を使用し、対訳フレーズ学習装置１または対訳フレーズ学習装置２の評価を行った。なお、対訳フレーズ学習装置１または対訳フレーズ学習装置２の手法を、「Hier-combin」と言うこととする。
（１）GIZA-linear In this experiment, the following five phrase pair extraction methods were used to evaluate the parallel phrase learning device 1 or the parallel phrase learning device 2. The method of the parallel phrase learning device 1 or the parallel phrase learning device 2 is referred to as “Hier-combin”.
(1) GIZA-linear

本手法において、フレーズペアは、GIZA++（Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models.Computational linguistics, 29(1):19-51.参照）および最大長７の"grow-diag-final-and"手法により各ドメイン内で抽出される。また、本手法において、種々のドメインから構成されたフレーズテーブルは、特徴量を平均化することによってリニアに結合される。
（２）Pialign-linear In this method, phrase pairs consist of GIZA ++ (see Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational linguistics, 29 (1): 19-51.) And a maximum length of 7 "grow- Extracted within each domain using the "diag-final-and" method. In this method, phrase tables composed of various domains are linearly combined by averaging feature values.
(2) Pialign-linear

本手法は、GIZA-linearと似ているが、pialignツールキット（Graham Neubig, Taro Watanabe, Eiichiro Sumita,Shinsuke Mori, and Tatsuya Kawahara. 2011. An unsupervised model for joint phrase alignment and extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies, pages 632-641,Portland, Oregon, USA, June. Association for Computational Linguistics.参照）を使用することにより、フレーズITG手法を用いた点でGIZA-linearと異なる。本手法でも、抽出されたフレーズペアは、特徴量を平均化することによってリニアに結合される。
（３）GIZA-batch This method is similar to GIZA-linear, but pialign toolkit (Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shinsuke Mori, and Tatsuya Kawahara. 2011. An unsupervised model for joint phrase alignment and extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 632-641, Portland, Oregon, USA, June. Association for Computational Linguistics.) And different. Also in this method, the extracted phrase pairs are linearly combined by averaging feature quantities.
(3) GIZA-batch

本手法では、データセットを各ドメインに分割するのではなく、一つのコーパスとして扱われる。そして、本手法では、GIZA-linearと似た、ヒューリスティックGIZAベースのフレーズ抽出法が実行される。
（４）Pialign-batch In this method, the data set is not divided into domains, but treated as a single corpus. In this method, a heuristic GIZA-based phrase extraction method similar to GIZA-linear is executed.
(4) Pialign-batch

本手法では、GIZA-batchと同様、単一モデルは単一のマージされたコーパスとして推定される。Pialignは大規模データを扱えないので、最大のＬＤＣデータセットでは実験できなかった。
（５）Pialign-adaptive In this method, similar to GIZA-batch, a single model is estimated as a single merged corpus. Since Pialign cannot handle large-scale data, it was not possible to experiment with the largest LDC data set.
(5) Pialign-adaptive

本手法では、アライメントとフレーズペアは、Pialign-batchと同様の方法で抽出される。本手法では、翻訳確率は、モノリンガルトピック情報を用いたアダプティブ手法により推定される。 In this method, alignment and phrase pairs are extracted by the same method as Pialign-batch. In this method, the translation probability is estimated by an adaptive method using monolingual topic information.

また、対訳フレーズ学習装置１または対訳フレーズ学習装置２の手法「Hier-combin」において、「Pialign-linear」と同様の方法により、フレーズペアを抽出した。また、フレーズテーブルの連結プロセスにおいて、各フレーズペアの翻訳確率は「Hier-combin」により推定している。また、他の特徴は特徴量を平均化することによりリニアに結合されている。また、「Pialign」は、デフォルトのパラメータを使用している。パラメータ「samps」は５に設定されている。「５」とは、５つのサンプルが一つの対訳文のために生成されることを示す。 Further, in the method “Hier-combin” of the parallel phrase learning device 1 or the parallel phrase learning device 2, phrase pairs were extracted by the same method as “Pialign-linear”. In the phrase table connection process, the translation probability of each phrase pair is estimated by “Hier-combin”. The other features are linearly combined by averaging feature amounts. “Pialign” uses default parameters. The parameter “samps” is set to 5. “5” indicates that five samples are generated for one parallel sentence.

さらに、本実験において、各特徴量の重みのチューニングのためにbatch-MIRA（Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 427-436, Montr´eal, Canada, June. Association for Computational Linguistics.参照）を用いた。そして、翻訳品質の評価のために、大文字と小文字を区別しないBLEU−４メトリクス（Kishore Papineni, Salim Roukos, ToddWard, and Wei Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311-318, Philadelphia, Pennsylvania, USA, July. Association for Computational Linguistics.参照）を用いた。
以上の環境において行った実験結果を図９に示す。図９において、「BLEU」は翻訳品質の評価値、「Size」はフレーズペアの数を示す。図９において、「Hier-combin」は、「Pialign-linear」より優れていることが分かる。なお、「Hier-combin」と「Pialign-linear」とは、翻訳確率が異なるだけであり、フレーズペア、およびフレーズペアの数は同じである。 Furthermore, in this experiment, batch-MIRA (Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 427-436, Montr´eal, Canada, June. Association for Computational Linguistics.). In order to evaluate translation quality, case-insensitive BLEU-4 metrics (Kishore Papineni, Salim Roukos, ToddWard, and Wei Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311-318, Philadelphia, Pennsylvania, USA, July. Association for Computational Linguistics.).
FIG. 9 shows the results of experiments conducted in the above environment. In FIG. 9, “BLEU” indicates an evaluation value of translation quality, and “Size” indicates the number of phrase pairs. In FIG. 9, it can be seen that “Hier-combin” is superior to “Pialign-linear”. Note that “Hier-combin” and “Pialign-linear” only differ in translation probability, and the number of phrase pairs and phrase pairs is the same.

また、「Pialign-adaptive」のパフォーマンスは、「Pialign-linear」のパフォーマンスより高く、「Hier-combin」のパフォーマンスより低い。これは、単一言語トピック情報を用いたadaptive手法はタスクで有用であることを証明している。しかし、階層的なPitman-yor過程を用いた「Hier-combin」は、様々なドメインからのすべてのデータに基づいて、より正確な翻訳確率を推定することができる。つまり、図９において、「Hier-combin」は、他の手法と比べて、比較的少ないフレーズペアの数で、種々のデータセットに対して、高い翻訳品質の評価となっていることが分かる。 The performance of “Pialign-adaptive” is higher than that of “Pialign-linear” and lower than that of “Hier-combin”. This proves that the adaptive method using monolingual topic information is useful for tasks. However, “Hier-combin” using a hierarchical Pitman-yor process can estimate a more accurate translation probability based on all data from various domains. That is, in FIG. 9, it can be seen that “Hier-combin” has a high translation quality evaluation for various data sets with a relatively small number of phrase pairs as compared to other methods.

また、さらに詳細には、図９において、「GIZA-batch」と比較して、「Hier-combin」は、はるかに小さいフレーズテーブルを用いて、競争力のあるパフォーマンスを実現していることが分かる。「Hier-combin」手法によって生成されたフレーズペアの数は、「GIZA-batch」のそれと比較して、各データセット毎に、７３．９％、５２．７％、４５．４％と、非常に小さくなっている。 In more detail, in FIG. 9, it can be seen that “Hier-combin” achieves competitive performance using a much smaller phrase table compared to “GIZA-batch”. . The number of phrase pairs generated by the “Hier-combin” method is 73.9%, 52.7%, and 45.4% for each data set compared to that of “GIZA-batch”. It is getting smaller.

IWLST2012データセットにおいて、HITコーパスとBTECコーパス間に大きな差があり、「Hier-combin」手法は、「Pialign-batch」手法と比較して、０．８１４BLEU値の改善が見られた。一方、FBISデータセットにおいて、人為的にサブドメインに分割し、明確な割り当ての違いが明確でなかったために、「Hier-combin」手法は、「GIZA-batch」手法と比較して、０．０９ BLEU値で低かった。 In the IWLST2012 data set, there is a large difference between the HIT corpus and the BTEC corpus, and the “Hier-combin” method improved 0.814 BLEU value compared to the “Pialign-batch” method. On the other hand, in the FBIS data set, the “Hier-combin” method is 0.09 compared to the “GIZA-batch” method because it is artificially divided into subdomains and the clear allocation difference is not clear. The BLEU value was low.

また、「Hier-combin」において、複数の各ドメインから、独立にフレーズペアを取得できる。そのため、「Hier-combin」において、ドメインごとに異なるマシンで処理でき、また並列処理も可能となる。 In “Hier-combin”, phrase pairs can be acquired independently from each of a plurality of domains. Therefore, in "Hier-combin", processing can be performed on different machines for each domain, and parallel processing is also possible.

また、図１０は、「FBIS」のデータセットを用いた場合の、アライメントとフレーズペアの抽出のためにかかった時間を示す。図１０において、「Batch」は、batch-based ITGs sampling method（「pialign-batch」）である。図１０は、２．７ＧＨｚのＥ５−２６８０ＣＰＵ、１２８ＧＢｙｔｅのメモリを用いた実験結果である。また、図１０において、「Parallel Extraction」は並列処理した場合の時間、「Integrating」は連結処理の時間、「Total」は並列処理と連結処理の合計の時間である。 FIG. 10 shows the time taken for alignment and phrase pair extraction when the “FBIS” data set is used. In FIG. 10, “Batch” is a batch-based ITGs sampling method (“pialign-batch”). FIG. 10 shows experimental results using a 2.7 GHz E5-2680 CPU and 128 GB memory. In FIG. 10, “Parallel Extraction” is the time when parallel processing is performed, “Integrating” is the time of concatenation processing, and “Total” is the total time of parallel processing and concatenation processing.

「Hier-combin」と「pialign-batch」とを比較すると、図１０によれば、「Hier-combin」における訓練のために費やした時間は、「pialign-batch」のほぼ四分の一よりはるかに少ない。一方、図９によれば、「Hier-combin」は「pialign-batch」と比較して、BLEU値は少し高いことが分かる。 Comparing “Hier-combin” and “pialign-batch”, according to FIG. 10, the time spent for training in “Hier-combin” is much more than almost a quarter of “pialign-batch”. Very few. On the other hand, according to FIG. 9, “Hier-combin” has a slightly higher BLEU value than “pialign-batch”.

階層的に結合する「Hier-combin」手法は、階層的なPitman-Yor過程の性質を利用している。そして、「Hier-combin」手法は、スムージング効果の優位性を得ている。また、「Hier-combin」手法を用いれば、段階的に、より正確な確率で様々なドメインからのすべてのデータに基づいた簡潔なフレーズテーブルを生成することができる。伝統的なＳＭＴにおけるフレーズペアの抽出はバッチベースであるが、「Hier-combin」手法は非常に効率良くフレーズペアの抽出ができ、一方、翻訳精度において、伝統的なＳＭＴの手法に劣らない、と言える。 The “Hier-combin” method of hierarchical connection uses the property of the hierarchical Pitman-Yor process. And the “Hier-combin” method has the advantage of smoothing effect. Also, by using the “Hier-combin” method, it is possible to generate a simple phrase table based on all data from various domains with a more accurate probability step by step. The phrase pair extraction in traditional SMT is batch-based, but the “Hier-combin” method can extract phrase pairs very efficiently, while the translation accuracy is not inferior to the traditional SMT method. It can be said.

図１１は、「Hier-combin」手法において、３つの各データセットにおいて、異なる結合方法をとった場合のBLEU値を示す。図１１において、類似度をキーとして並べた場合であり、「Descending」は降順、「Ascending」は昇順である。なお、データ間の類似度は、5-gram言語モデルを用いたパープレキシティ指標を用いて計算された。 FIG. 11 shows BLEU values when different combining methods are used for each of the three data sets in the “Hier-combin” method. In FIG. 11, the similarity is arranged as a key, “Descending” is descending and “Ascending” is ascending. The similarity between data was calculated using a perplexity index using a 5-gram language model.

また、図１２は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の対訳フレーズ学習装置等を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図１２は、このコンピュータシステム３００の概観図であり、図１３は、システム３００のブロック図である。 FIG. 12 shows the external appearance of a computer that executes the program described in this specification to realize the bilingual phrase learning apparatus and the like of the various embodiments described above. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 12 is an overview diagram of the computer system 300, and FIG. 13 is a block diagram of the system 300.

図１２において、コンピュータシステム３００は、ＣＤ−ＲＯＭドライブを含むコンピュータ３０１と、キーボード３０２と、マウス３０３と、モニタ３０４とを含む。 In FIG. 12, a computer system 300 includes a computer 301 including a CD-ROM drive, a keyboard 302, a mouse 303, and a monitor 304.

図１３において、コンピュータ３０１は、ＣＤ−ＲＯＭドライブ３０１２に加えて、ＭＰＵ３０１３と、ＭＰＵ３０１３、ＣＤ−ＲＯＭドライブ３０１２に接続されたバス３０１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ３０１５と、ＭＰＵ３０１３に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ３０１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク３０１７とを含む。ここでは、図示しないが、コンピュータ３０１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 In FIG. 13, in addition to the CD-ROM drive 3012, the computer 301 includes an MPU 3013, a bus 3014 connected to the MPU 3013 and the CD-ROM drive 3012, a ROM 3015 for storing a program such as a bootup program, and an MPU 3013. And a RAM 3016 for temporarily storing instructions of the application program and providing a temporary storage space, and a hard disk 3017 for storing the application program, the system program, and data. Although not shown here, the computer 301 may further include a network card that provides connection to a LAN.

コンピュータシステム３００に、上述した実施の形態の対訳フレーズ学習装置等の機能を実行させるプログラムは、ＣＤ−ＲＯＭ３１０１に記憶されて、ＣＤ−ＲＯＭドライブ３０１２に挿入され、さらにハードディスク３０１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ３０１に送信され、ハードディスク３０１７に記憶されても良い。プログラムは実行の際にＲＡＭ３０１６にロードされる。プログラムは、ＣＤ−ＲＯＭ３１０１またはネットワークから直接、ロードされても良い。 A program that causes the computer system 300 to execute the functions of the bilingual phrase learning apparatus and the like of the above-described embodiment may be stored in the CD-ROM 3101, inserted into the CD-ROM drive 3012, and further transferred to the hard disk 3017. . Alternatively, the program may be transmitted to the computer 301 via a network (not shown) and stored in the hard disk 3017. The program is loaded into the RAM 3016 at the time of execution. The program may be loaded directly from the CD-ROM 3101 or the network.

プログラムは、コンピュータ３０１に、上述した実施の形態の対訳フレーズ学習装置等の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３００がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS), a third-party program, or the like that causes the computer 301 to execute the functions of the bilingual phrase learning device of the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 300 operates is well known and will not be described in detail.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

また、上記各実施の形態において、一の装置に存在する２以上の通信手段（端末情報送信部、端末情報受信部など）は、物理的に一の媒体で実現されても良いことは言うまでもない。 In each of the above embodiments, it is needless to say that two or more communication means (terminal information transmission unit, terminal information reception unit, etc.) existing in one device may be physically realized by one medium. .

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.

本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる対訳フレーズ学習装置は、追加した対訳コーパスから生成された翻訳モデルを元の翻訳モデルに繋げて利用することにより、容易に翻訳モデルの段階的に充実させることができる、という効果を有し、機械翻訳のための装置等として有用である。 As described above, the translation phrase learning device according to the present invention can easily enhance the translation model step by step by connecting the translation model generated from the added translation corpus to the original translation model. It is useful as an apparatus for machine translation.

１、２対訳フレーズ学習装置
３統計的機械翻訳装置
１００対訳情報格納部
１０１フレーズテーブル
１０２フレーズ出現頻度情報格納部
１０３記号出現頻度情報格納部
１０４対訳コーパス受付部
１０５対訳コーパス蓄積部
１０６フレーズテーブル初期化部
１０７生成フレーズペア取得部
１０８フレーズ出現頻度情報更新部
１０９記号取得部
１１０記号出現頻度情報更新部
１１１部分フレーズペア生成部
１１２新フレーズペア生成部
１１３制御部
１１４スコア算出部
１１５パージング部
１１６フレーズテーブル更新部
１１７木更新部
２０１対訳コーパス生成部 1, 2 Bilingual phrase learning device 3 Statistical machine translation device 100 Bilingual information storage unit 101 Phrase table 102 Phrase appearance frequency information storage unit 103 Symbol appearance frequency information storage unit 104 Bilingual corpus reception unit 105 Bilingual corpus storage unit 106 Phrase table initialization Unit 107 generated phrase pair acquisition unit 108 phrase appearance frequency information update unit 109 symbol acquisition unit 110 symbol appearance frequency information update unit 111 partial phrase pair generation unit 112 new phrase pair generation unit 113 control unit 114 score calculation unit 115 parsing unit 116 phrase table Update unit 117 Tree update unit 201 Bilingual corpus generation unit

Claims

A bilingual information storage unit capable of storing N (N is a natural number of 2 or more) bilingual corpus having one or more bilingual information having a bilingual sentence and a bilingual tree structure;
A phrase pair that is a pair of a first language phrase having one or more words in the first language and a second language phrase having one or more words in the second language, and a score that is information relating to the appearance probability of the phrase pair; A phrase table that can store one or more scored phrase pairs with each bilingual corpus;
A phrase appearance frequency information storage unit capable of storing, for each bilingual corpus, one or more phrase appearance frequency information including a phrase pair and F appearance frequency information that is information relating to the appearance frequency of the phrase pair;
A symbol appearance frequency information storage unit that can store one or more symbol appearance frequency information including a symbol for identifying a method for generating a new phrase pair and S appearance frequency information that is information about the appearance frequency of the symbol;
For each bilingual corpus, using the one or more phrase appearance frequency information, a generated phrase pair acquisition unit that acquires a phrase pair having a first language phrase and a second language phrase;
When the phrase pair can be acquired, the phrase appearance frequency information update unit that increases the F appearance frequency information corresponding to the phrase pair by a predetermined value;
If the phrase pair could not be acquired, a symbol acquisition unit that acquires one symbol using one or more symbol appearance frequency information,
A symbol appearance frequency information update unit that increases S appearance frequency information corresponding to the symbol acquired by the symbol acquisition unit by a predetermined value;
If the phrase pair could not be acquired, a partial phrase pair generation unit that generates two phrase pairs smaller than the phrase pair to be acquired;
In accordance with the symbol acquired by the symbol acquisition unit, a first process for generating a new phrase pair, or two smaller phrase pairs are generated, and the generated two phrases are generated using the one or more phrase appearance frequency information. One phrase having a new first language phrase in which two first language phrases constituting a phrase pair are sequentially connected and a new second language phrase in which two second language phrases constituting two phrase pairs are sequentially connected 2nd process which produces | generates a pair, or two smaller phrase pairs are produced | generated, The two 1st language phrases which comprise the produced | generated two phrase pairs are sequentially used using said one or more phrase appearance frequency information A new second language phrase that connects the new first language phrase connected together and the two second language phrases that make up two phrase pairs in reverse order. And new phrase pair generation unit that performs one of a third process of generating a single phrase pair having a's,
For the phrase pair generated by the new phrase pair generation unit, the phrase appearance frequency information update unit, the symbol acquisition unit, the symbol appearance frequency information update unit, the partial phrase pair generation unit, and the new phrase pair generation unit A control unit instructing to perform the process of recursively,
A score calculation unit that calculates a score for each phrase pair in the phrase table, using one or more phrase appearance frequency information stored in the phrase appearance frequency information storage unit;
A phrase table update unit that stores the score calculated by the score calculation unit in association with each phrase pair;
The score calculation unit
When calculating a score for each phrase pair acquired from the j (2 <= j <= N) th parallel corpus, using one or more phrase appearance frequency information corresponding to the (j-1) th parallel corpus , A parallel phrase learning device for calculating a score for each phrase pair corresponding to the j-th parallel corpus.

The bilingual information storage unit
Contains one or more parallel corpora,
A bilingual corpus accepting unit that accepts a bilingual corpus;
A bilingual corpus storage unit that stores the bilingual corpus received by the bilingual corpus receiving unit in the bilingual information storage unit;
The controller is
After the bilingual corpus accumulation unit accumulates the accepted bilingual corpus in the bilingual information storage unit, the generated phrase pair acquisition unit for the bilingual corpus, the phrase appearance frequency information update unit, the symbol acquisition unit, the symbol appearance Instructing the processing of the frequency information update unit, the partial phrase pair generation unit, and the new phrase pair generation unit,
The score calculation unit
When calculating the score for each phrase pair acquired from the bilingual corpus received by the bilingual corpus receiving unit, one or more stored in the bilingual information storage unit before the bilingual corpus accumulating unit accumulates the bilingual corpus The bilingual translation of Claim 1 which calculates the score with respect to each phrase pair corresponding to the bilingual corpus which the said bilingual corpus received using the one or more phrase appearance frequency information corresponding to one of the bilingual corpus of Phrase learning device.

A bilingual corpus generation unit that accumulates N bilingual corpora created by dividing two or more bilingual sentences into N groups and acquiring a bilingual sentence tree structure from the bilingual sentences of each group in the bilingual information storage unit; In addition,
The score calculation unit
When calculating the score for each phrase pair acquired from one bilingual corpus, using one or more phrase appearance frequency information corresponding to another bilingual corpus different from the one bilingual corpus, The bilingual phrase learning device according to claim 1, wherein a score for each corresponding phrase pair is calculated.

The score calculation unit
The bilingual phrase learning apparatus according to any one of claims 1 to 3, wherein a score for each phrase pair corresponding to each bilingual corpus is calculated using a hierarchical Chinese restaurant process according to Formula 9.
In Equation 9, f is a source language phrase, e is a target language phrase, F is a source language sentence, E is a target language sentence, C ^j is all customers in the j-th parallel translation data <F, E>, s ^j is a strength corresponding to the j-th parallel translation data, d ^j is a discount corresponding to the j-th parallel translation data, T ^j is all tables in the parallel translation data <F, E>, and c ^j _{<f, e>} is j The number of customers corresponding to each <f, e> in the j-th translation data, t ^j is the number of tables corresponding to each <f, e> in the j-th translation data, P _base (<f _i , e _i >) is the prior probability of the model estimated in advance.

A phrase table learned by the bilingual phrase learning device according to any one of claims 1 to 4,
An accepting unit that accepts a sentence in a first language having one or more words;
A phrase acquisition unit that extracts one or more phrases from the sentence received by the reception unit, and acquires one or more phrases of a second language from the phrase table using a score of the phrase table;
A sentence constructing unit that constructs a sentence in a second language from one or more phrases acquired by the phrase acquiring unit;
A statistical machine translation apparatus comprising: an output unit that outputs a sentence formed by the sentence composing unit.

The recording medium is
A bilingual information storage unit capable of storing N (N is a natural number of 2 or more) bilingual corpus having one or more bilingual information having a bilingual sentence and a bilingual tree structure;
A phrase pair that is a pair of a first language phrase having one or more words in the first language and a second language phrase having one or more words in the second language, and a score that is information relating to the appearance probability of the phrase pair; A phrase table that can store one or more scored phrase pairs having
A phrase appearance frequency information storage unit capable of storing, for each bilingual corpus, one or more phrase appearance frequency information including a phrase pair and F appearance frequency information that is information relating to the appearance frequency of the phrase pair;
A symbol appearance frequency information storage unit capable of storing at least one symbol appearance frequency information including a symbol for identifying a method for generating a new phrase pair and S appearance frequency information that is information relating to the appearance frequency of the symbol. ,
Phrase appearance frequency information update unit, symbol acquisition unit, symbol appearance frequency information update unit, partial phrase pair generation unit, new phrase pair generation unit, control unit, score calculation unit, parallel translation phrase learning method that can be realized by the phrase table update unit There,
The generated phrase pair acquisition unit acquires a phrase pair having a first language phrase and a second language phrase using the one or more phrase appearance frequency information for each parallel corpus;
When the phrase appearance frequency information update unit can acquire a phrase pair, the phrase appearance frequency information update step for increasing the F appearance frequency information corresponding to the phrase pair by a predetermined value;
When the symbol acquisition unit fails to acquire a phrase pair, a symbol acquisition step of acquiring one symbol using one or more symbol appearance frequency information;
The symbol appearance frequency information update unit increases the S appearance frequency information corresponding to the symbol acquired in the symbol acquisition step by a predetermined value;
When the partial phrase pair generation unit cannot acquire a phrase pair, a partial phrase pair generation step of generating two phrase pairs smaller than the phrase pair to be acquired;
The new phrase pair generation unit generates a new phrase pair according to the symbol acquired in the symbol acquisition step, or generates two smaller phrase pairs, and the one or more phrase appearance frequency information A new first language phrase in which two first language phrases composing the generated two phrase pairs are connected in order and a new first language phrase in which two second language phrases composing two phrase pairs are connected in order. Second process for generating one phrase pair having bilingual phrases, or two smaller phrase pairs are generated, and the generated two phrase pairs are configured using the one or more phrase appearance frequency information A new first language phrase that connects two first language phrases in sequence and two second language phrases that form two phrase pairs And new phrase pair generation step of performing one of a third process of generating a single phrase pair having a new second language phrases linked in reverse order's,
The control unit, for the phrase pair generated in the new phrase pair generation step, the phrase appearance frequency information update step, the symbol acquisition step, the symbol appearance frequency information update step, the partial phrase pair generation step, and the A control step that instructs to recursively perform the processing of the new phrase pair generation step;
A score calculating step of calculating a score for each phrase pair in the phrase table, using one or more phrase appearance frequency information stored in the phrase appearance frequency information storage unit;
A phrase table update step for storing the score calculated in the score calculation step in association with each phrase pair, and
The score calculating step includes:
When calculating a score for each phrase pair acquired from the j (2 <= j <= N) th parallel corpus, using one or more phrase appearance frequency information corresponding to the (j-1) th parallel corpus A parallel phrase learning method for calculating a score for each phrase pair corresponding to the j-th parallel corpus.

Computer-accessible recording media
A bilingual information storage unit capable of storing N (N is a natural number of 2 or more) bilingual corpus having one or more bilingual information having a bilingual sentence and a bilingual tree structure;
A phrase pair that is a pair of a first language phrase having one or more words in the first language and a second language phrase having one or more words in the second language, and a score that is information relating to the appearance probability of the phrase pair; A phrase table that can store one or more scored phrase pairs having
A phrase appearance frequency information storage unit capable of storing, for each bilingual corpus, one or more phrase appearance frequency information including a phrase pair and F appearance frequency information that is information relating to the appearance frequency of the phrase pair;
A symbol appearance frequency information storage unit capable of storing at least one symbol appearance frequency information including a symbol for identifying a method for generating a new phrase pair and S appearance frequency information that is information relating to the appearance frequency of the symbol. ,
Computer
For each bilingual corpus, using the one or more phrase appearance frequency information, a generated phrase pair acquisition unit that acquires a phrase pair having a first language phrase and a second language phrase;
When the phrase pair can be acquired, the phrase appearance frequency information update unit that increases the F appearance frequency information corresponding to the phrase pair by a predetermined value;
If the phrase pair could not be acquired, a symbol acquisition unit that acquires one symbol using one or more symbol appearance frequency information,
A symbol appearance frequency information update unit that increases S appearance frequency information corresponding to the symbol acquired by the symbol acquisition unit by a predetermined value;
If the phrase pair could not be acquired, a partial phrase pair generation unit that generates two phrase pairs smaller than the phrase pair to be acquired;
In accordance with the symbol acquired by the symbol acquisition unit, a first process for generating a new phrase pair, or two smaller phrase pairs are generated, and the generated two phrases are generated using the one or more phrase appearance frequency information. One phrase having a new first language phrase in which two first language phrases constituting a phrase pair are sequentially connected and a new second language phrase in which two second language phrases constituting two phrase pairs are sequentially connected 2nd process which produces | generates a pair, or two smaller phrase pairs are produced | generated, The two 1st language phrases which comprise the produced | generated two phrase pairs are sequentially used using said one or more phrase appearance frequency information A new second language phrase that connects the new first language phrase connected together and the two second language phrases that make up two phrase pairs in reverse order. And new phrase pair generation unit that performs one of a third process of generating a single phrase pair having a's,
For the phrase pair generated by the new phrase pair generation unit, the phrase appearance frequency information update unit, the symbol acquisition unit, the symbol appearance frequency information update unit, the partial phrase pair generation unit, and the new phrase pair generation unit A control unit instructing to perform the process of recursively,
A score calculation unit that calculates a score for each phrase pair in the phrase table, using one or more phrase appearance frequency information stored in the phrase appearance frequency information storage unit;
Function as a phrase table update unit that stores the score calculated by the score calculation unit in association with each phrase pair;
The score calculation unit
When calculating a score for each phrase pair acquired from the j (2 <= j <= N) th parallel corpus, using one or more phrase appearance frequency information corresponding to the (j-1) th parallel corpus A program that causes a computer to function as a score for each phrase pair corresponding to the j-th parallel corpus.