JP6044996B2

JP6044996B2 - Character string association apparatus, method, and program

Info

Publication number: JP6044996B2
Application number: JP2013149869A
Authority: JP
Inventors: 克仁須藤; 永田　昌明; 昌明永田; 信介森
Original assignee: Kyoto University; Nippon Telegraph and Telephone Corp
Current assignee: Kyoto University; Nippon Telegraph and Telephone Corp
Priority date: 2013-07-18
Filing date: 2013-07-18
Publication date: 2016-12-14
Anticipated expiration: 2033-07-18
Also published as: JP2015022508A

Description

本発明は、文字列対応付け装置、方法、及びプログラムに係り、特に、異なる言語の文字列の組における文字の対応付けを行う文字列対応付け装置、方法、及びプログラムに関する。 The present invention relates to a character string associating device, method, and program, and more particularly, to a character string associating device, method, and program for associating characters in sets of character strings in different languages.

ある言語の音韻体系で表記された語句を別の言語の音韻体系での表記に変換する機械翻字を、統計モデルとして表現するために、互いが対応する単語の組を統計モデルの学習のためのデータとして利用して、単語を構成する文字同士の対応関係を推定することが広く行われている。例えば、非特許文献１では、英語の音韻表現と日本語におけるカタカナ語のローマ字化された表記との間での音韻記号-ローマ字間の１対多の対応付け方法について記している。さらに、非特許文献２では、英語の文字と音韻表記との多対多の対応付けについて記している。非特許文献３では、記号の多対多の対応関係を自動的に行うコンピュータプログラムについて記している。 To represent a machine transliteration that converts a phrase written in a phonological system of one language into a phonological system of another language as a statistical model, to learn a set of words that correspond to each other It is widely used to estimate the correspondence between the characters that make up a word by using as the data. For example, Non-Patent Document 1 describes a one-to-many correspondence method between phonetic symbols and Roman letters between English phonetic expressions and Katakana Romanized notation in Japanese. Further, Non-Patent Document 2 describes the many-to-many association between English characters and phonological notation. Non-Patent Document 3 describes a computer program that automatically performs a many-to-many correspondence of symbols.

一方で、統計モデルの学習に利用する単語の組を大量に収集しようとすると、ある程度の誤りの混入は避けられない。いわゆる「カタカナ語」と英語の対応で言えば、日英対訳辞書の項目において日本語側がカタカナで表記されているものでも、「コンピュータ」と”computer”のように翻字関係となっているものもあれば、「カブトムシ」と”beetle”のように、カタカナで表記されるが翻字関係とはなっていないものもある。こうした誤った単語対応を統計モデルの学習に利用することでノイズが混入し、統計モデルの質を低下することは避けるべきである。この問題に対して、翻字関係となっている文字間対応の統計モデルと、翻字関係となっておらず２言語間で独立な文字列の統計モデルを利用した翻字対応付け方法が提案されており、有効に働くことが示されている（例えば、非特許文献４）。対訳辞書の存在を仮定しない非特許文献５のような「統計的機械翻訳」と呼ばれる技術分野においては、対訳文中の共起関係等を用いて自動的に単語対応を得ている。この自動的に得られた単語対応から翻字対応関係を得ようとすると単語対応に誤りが含まれる可能性も高くなるが、非特許文献４の方法により、１対１の単語対応組を、翻字となっている単語組と翻字となっていない単語組を自動的に分類し、翻字となっている単語組からのみ翻字対応の統計モデルを学習することが可能となる。また、非特許文献６には、上記非特許文献３の技術を、翻字でない文字列に対応させた場合について記載されている。 On the other hand, when trying to collect a large number of sets of words used for statistical model learning, it is inevitable that some errors will be mixed. Speaking of correspondence between so-called “Katakana” and English, even if the Japanese side is written in Katakana in the Japanese-English bilingual dictionary, it is transliterated like “computer” and “computer” Others, such as “beetle” and “beetle”, are written in katakana but are not transliterated. Use of such incorrect word correspondences for statistical model learning should avoid noise mixing and degradation of the quality of the statistical model. To solve this problem, we propose a transliteration-related statistical model that uses transliteration, and a transliteration association method that uses a statistical model of character strings that are not transliteration and independent between two languages. Have been shown to work effectively (eg, Non-Patent Document 4). In a technical field called “statistical machine translation” such as Non-Patent Document 5 that does not assume the existence of a bilingual dictionary, word correspondence is automatically obtained by using a co-occurrence relationship in a bilingual sentence. When trying to obtain a transliteration correspondence from this automatically obtained word correspondence, there is a high possibility that an error is included in the word correspondence. However, according to the method of Non-Patent Document 4, a one-to-one word correspondence pair is It is possible to automatically classify a word group that is transliterated and a word group that is not transliterated, and learn a statistical model corresponding to transliteration only from the word group that is transliterated. Non-Patent Document 6 describes a case where the technique of Non-Patent Document 3 is associated with a character string that is not transliteration.

Kevin Knight and Jonathan Graehl、「Machine Transliteration」、Computational Linguistics、1998、Volume 24、Number 4、p.599-612Kevin Knight and Jonathan Graehl, `` Machine Transliteration '', Computational Linguistics, 1998, Volume 24, Number 4, p.599-612 Sittichai Jiampojamarn他２名、「Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion」、Proceedings of NAACLHLT、2007、p.372-379Sittichai Jiampojamarn and two others, “Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion”, Proceedings of NAACLHLT, 2007, p.372-379 Andrew Finch and Eiichiro Sumita、「A Bayesian Model of Bilingual Segmentation for Transliteration」、Proceedings of International Workshop on Spoken Language Translation、2010Andrew Finch and Eiichiro Sumita, `` A Bayesian Model of Bilingual Segmentation for Transliteration '', Proceedings of International Workshop on Spoken Language Translation, 2010 Hassan Sajjad他２名、「A Statistical Model for Unsupervised and Semisupervised Transliteration Mining」、Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics、2012、p.469-477Hassan Sajjad and two others, “A Statistical Model for Unsupervised and Semisupervised Transliteration Mining”, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012, p.469-477 Philipp Koehn他２名、「Statistical Phrase-Based Translation」、Proceedings of HLT-NAACL、2003、p.48-54Philipp Koehn and two others, "Statistical Phrase-Based Translation", Proceedings of HLT-NAACL, 2003, p.48-54 Ohnmar Htun他３名、「Improving Transliteration Mining by Integrating Expert Knowledge with Statistical Approaches」、International Journal of Computer Applications、2012、58(17)、p.12-22Ohnmar Htun and three others, “Improving Transliteration Mining by Integrating Expert Knowledge with Statistical Approaches”, International Journal of Computer Applications, 2012, 58 (17), p.12-22

非特許文献４の方法は単語対応が「翻字である」か「翻字でない」かを識別するため、複合語等、複数の単語の間の対応関係において「部分的には翻字であるが、その他の部分は翻字になっていない」ような場合を識別することができない。非特許文献５のような句に基づく統計的機械翻訳においては、２言語間の句と句の対訳関係を自動的に推定するため、部分的に訳語でないものが含まれることがあり、非特許文献４の方法では十分な識別を行うことが期待できず、そこから得られる翻字対応および翻字モデルの正確性に問題が生ずる。例えば、「コンピュータ」に対して”the computer”という句が対応しているという状況においては、”the”に対応する文字列がカタカナ語側にはないため、「翻字である」か「翻字でない」かの２値分類は適さない。 Since the method of Non-Patent Document 4 identifies whether the word correspondence is “transliteration” or “not transliteration”, the correspondence between a plurality of words such as compound words is “partially transliteration”. However, it is not possible to identify a case where the other parts are not transliterated. In statistical machine translation based on phrases as in Non-Patent Document 5, in order to automatically estimate the translation relationship between phrases and phrases between two languages, some non-translated words may be included. The method of Literature 4 cannot be expected to perform sufficient identification, and there is a problem in transliteration correspondence and accuracy of the transliteration model obtained therefrom. For example, in a situation where the phrase “the computer” corresponds to “computer”, there is no character string corresponding to “the” on the Katakana side, so it is “transliteration” or “translation”. Binary classification of “not letter” is not suitable.

本発明は、上記の事情を鑑みてなされたもので、異なる言語の文字列の組における文字の対応付けを精度よく行うことができる文字列対応付け装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a character string associating device, method, and program capable of accurately associating characters in sets of character strings in different languages. And

上記の目的を達成するために本発明に係る文字列対応付け装置は、異なる第１の言語及び第２の言語にそれぞれ属する同じ意味の文字列の組み合わせである文字列組について、前記第１の言語の文字列と、前記第２の言語の文字列との間で文字の対応付けを行う文字列対応付け装置であって、前記文字列組を複数組記憶した文字列組データベースに記憶された前記文字列組の各々に対して、前記文字列組の各文字列を、前記文字列の先頭から順番に、他方の言語の部分文字列と翻字関係にない０文字以上の部分文字列を示す前置非翻字セグメントと、前記他方の言語の部分文字列と翻字関係にある０文字以上の部分文字列を示す翻字セグメントと、前記他方の言語の部分文字列と翻字関係にない０文字以上の部分文字列を示す後置非翻字セグメントとで構成したときに、前記第１の言語の部分文字列が、前記第２の言語の部分文字列と翻字関係にない非翻字部分である確率を表す非翻字モデル選択確率と、前記第２の言語の部分文字列が、前記第１の言語の部分文字列と翻字関係にない非翻字部分である確率を表す非翻字モデル選択確率と、前記第１の言語の部分文字列が、前記第２の言語の部分文字列と翻字関係にある翻字部分であり、かつ前記第２の言語の部分文字列が、前記第１の言語の部分文字列と翻字関係にある翻字部分である確率を表す翻字モデル選択確率と、前記第１の言語の文字列のうちの前記前置非翻字セグメントの部分文字列、及び前記後置非翻字セグメントの部分文字列の各々に対する前記第１の言語における生成確率を表す非翻字モデル生成確率と、前記第２の言語の文字列のうちの前記前置非翻字セグメントの部分文字列、及び前記後置非翻字セグメントの部分文字列の各々に対する前記第２の言語における生成確率を表す非翻字モデル生成確率と、前記第１の言語の文字列のうちの前記翻字セグメントの部分文字列と、前記第２の言語の文字列のうちの前記翻字セグメントの部分文字列との間の部分文字列の各ペアに対する同時生成確率を表す翻字モデル生成確率と、に基づいて尤もらしくなるように、前記文字列組の各文字列を前記前置非翻字セグメント、前記翻字セグメント、及び後置非翻字セグメントで構成し、かつ、前記第１の言語の文字列のうちの前記翻字セグメントの部分文字列と、前記第２の言語の文字列のうちの前記翻字セグメントの部分文字列との間で文字の対応付けを行う対応付け計算部を含んで構成されている。 In order to achieve the above object, the character string associating device according to the present invention relates to the first character string set that is a combination of character strings having the same meaning belonging to different first languages and second languages. A character string associating device for associating characters between a language character string and a character string of the second language, and stored in a character string set database storing a plurality of the character string sets For each of the character string sets, each character string of the character string set is, in order from the beginning of the character string, a partial character string of zero or more characters that is not transliterated with the partial character string of the other language. A transliteration segment indicating a prefix non-transliteration segment, a transliteration segment indicating a partial character string of zero or more characters in a transliteration relationship with a partial character string of the other language, and a transliteration relationship with a partial character string of the other language. Postfix non-transliteration segment indicating zero or more substrings When configured in a cement, the partial character strings of the first language, and non transliteration model selection probability representing a probability said a partial string and transliteration no relationship non transliteration portion of the second language the substring of a second language, and non transliteration model selection probability representing the probability the a first language substring and not in transliteration relationship non transliteration portion, of the first language partial string, said a partial character string a transliteration portion in transliteration relationship of the second language, and the partial character strings of a second language, the first language substrings and transliterations of a transliteration model selection probability representing the probability of transliteration part in a relationship, substring of said front置非transliteration segments of the string of said first language, and the rear置非transliteration segment and non transliteration model generation probability representing a generation probability in the first language for each of the substrings, the Substring of the front置非transliteration segment of the second language character string, and non transliteration model representing the generation probability in the second language for each of the substrings of the rear置非transliteration segment a generation probability, partial character between the and the partial character string of the transliteration segments of the string in the first language, substrings of the transliteration segments of the string of the second language Each character string of the string set is represented by the pre-non-transliteration segment, the transliteration segment, and the back so as to be plausible based on the transliteration model generation probability representing the simultaneous generation probability for each pair of columns. constituted by置非transliteration segment, and said a partial character string of the transliteration segments of the string in the first language, the partial character of the transliteration segments of the string of the second language Line mapping of characters between columns The correspondence calculation unit is included.

本発明に係る文字列対応付け方法は、対応付け計算部を含み、異なる第１の言語及び第２の言語にそれぞれ属する同じ意味の文字列の組み合わせである文字列組について、前記第１の言語の文字列と、前記第２の言語の文字列との間で文字の対応付けを行う文字列対応付け装置における文字列対応付け方法であって、対応付け計算部によって、前記文字列組を複数組記憶した文字列組データベースに記憶された前記文字列組の各々に対して、前記文字列組の各文字列を、前記文字列の先頭から順番に、他方の言語の部分文字列と翻字関係にない０文字以上の部分文字列を示す前置非翻字セグメントと、前記他方の言語の部分文字列と翻字関係にある０文字以上の部分文字列を示す翻字セグメントと、前記他方の言語の部分文字列と翻字関係にない０文字以上の部分文字列を示す後置非翻字セグメントとで構成したときに、前記第１の言語の部分文字列が、前記第２の言語の部分文字列と翻字関係にない非翻字部分である確率を表す非翻字モデル選択確率と、前記第２の言語の部分文字列が、前記第１の言語の部分文字列と翻字関係にない非翻字部分である確率を表す非翻字モデル選択確率と、前記第１の言語の部分文字列が、前記第２の言語の部分文字列と翻字関係にある翻字部分であり、かつ前記第２の言語の部分文字列が、前記第１の言語の部分文字列と翻字関係にある翻字部分である確率を表す翻字モデル選択確率と、前記第１の言語の文字列のうちの前記前置非翻字セグメントの部分文字列、及び前記後置非翻字セグメントの部分文字列の各々に対する前記第１の言語における生成確率を表す非翻字モデル生成確率と、前記第２の言語の文字列のうちの前記前置非翻字セグメントの部分文字列、及び前記後置非翻字セグメントの部分文字列の各々に対する前記第２の言語における生成確率を表す非翻字モデル生成確率と、前記第１の言語の文字列のうちの前記翻字セグメントの部分文字列と、前記第２の言語の文字列のうちの前記翻字セグメントの部分文字列との間の部分文字列の各ペアに対する同時生成確率を表す翻字モデル生成確率と、に基づいて尤もらしくなるように、前記文字列組の各文字列を前記前置非翻字セグメント、前記翻字セグメント、及び後置非翻字セグメントで構成し、かつ、前記第１の言語の文字列のうちの前記翻字セグメントの部分文字列と、前記第２の言語の文字列のうちの前記翻字セグメントの部分文字列との間で文字の対応付けを行うステップを含む。 The character string associating method according to the present invention includes an associating calculation unit, and for the character string set that is a combination of character strings having the same meaning belonging to different first languages and second languages, the first language Is a character string associating method in a character string associating device for associating characters between the character string of the second language and the character string of the second language, and a plurality of character string pairs are generated by an association calculating unit. For each of the character string sets stored in the stored character string set database, each character string of the character string set is replaced with a partial character string and transliteration of the other language in order from the top of the character string. A prefix non-transliteration segment indicating a sub-character string of zero or more characters that is not related; a transliteration segment indicating a sub-character string of zero or more characters that is transliteration related to the sub-character string of the other language; and the other In transliteration and substrings in other languages When configured in a置非transliteration segment after showing a have 0 or more characters of the substring, the substring of the first language is not in the partial string and transliteration relationship of said second language- and non transliteration model selection probability representing a probability is transliterated portion, said partial character strings of a second language, said a first language substrings and transliteration no relationship non transliteration portion of the probability and non transliteration model selection probability representing the substring of the first language, said a transliteration portion in partial string and transliteration relationship of the second language, and the second language substrings column, said pre置非transliteration of the a transliteration model selection probability representing the probability of transliteration portion first in the language portion string and transliteration relationship, a string of the first language substring of segments, and the first language for each of the substrings of the rear置非transliteration segment Contact And non transliteration model generation probability representing the generated probability that each of the said partial character string before置非transliteration segments of the string of the second language, and the partial character string of the rear置非transliteration segment wherein the non transliteration model generation probability representing a generation probability in the second language, and the partial character string of the transliteration segments of the string of said first language, among the character strings in the second language for Each character string of the character string set to be likely based on a transliteration model generation probability representing a simultaneous generation probability for each pair of partial character strings between the partial character strings of the transliteration segment of the front置非transliteration segment, the transliteration segments, and constituted by a rear置非transliteration segment and a partial character string of the transliteration segments of the string of said first language, the second The transliteration of the language strings A step of associating a character with a partial character string of the segment.

また、本発明に係る前記対応付け計算部は、前記第１の言語の前記非翻字モデル選択確率と、前記第２の言語の前記非翻字モデル選択確率と、前記第２の言語の各部分文字列に対する前記翻字モデル選択確率と、前記第１の言語の各部分文字列に対する前記非翻字モデル生成確率と、前記第２の言語の各部分文字列に対する前記非翻字モデル生成確率と、前記第１の言語の部分文字列と前記第２の言語の部分文字列との間の部分文字列の各ペアに対する前記翻字モデル生成確率と、に対して初期値を各々設定する初期値設定部と、前記初期値設定部によって設定され、又は前回更新された、前記非翻字モデル選択確率、前記翻字モデル選択確率、前記非翻字モデル生成確率、及び前記翻字モデル生成確率に基づいて、前記文字列組の各々に対して、前記第１の言語の文字列のうちの部分文字列と、前記第２の言語の文字列のうちの部分文字列との間の部分文字列の各ペアについて、前記ペアが翻訳関係にある期待値を計算し、前記第１の言語の文字列のうちの各部分文字列について、前記部分文字列が非翻字部分である期待値を計算し、前記第２の言語の文字列のうちの各部分文字列について、前記部分文字列が非翻字部分である期待値を計算する期待値計算部と、前記文字列組の各々に対して前記期待値計算部によって計算された各ペアに対する前記翻訳関係にある期待値、前記第１の言語の各部分文字列についての前記非翻字部分である期待値、及び前記第２の言語の各部分文字列についての前記非翻字部分である期待値に基づいて、前記非翻字モデル選択確率、前記翻字モデル選択確率、前記非翻字モデル生成確率、及び前記翻字モデル生成確率を更新するパラメータ更新部と、予め定められた停止条件が満たされたか否かを判定し、前記停止条件が満たされるまで、前記期待値計算部による計算、及び前記パラメータ更新部による更新を繰り返す停止条件判定部と、を含み前記文字列組の各々に対して、前記非翻字モデル選択確率、前記翻字モデル選択確率、前記非翻字モデル生成確率、及び前記翻字モデル生成確率の各々に基づいて、前記文字列組の各文字列を前記前置非翻字セグメント、前記翻字セグメント、及び後置非翻字セグメントで構成し、かつ、前記第１の言語の文字列のうちの前記翻字セグメントの部分文字列と、前記第２の言語の文字列のうちの前記翻字セグメントの部分文字列との間で文字の対応付けを行うようにすることができる。 Furthermore, the correlation calculation unit according to the present invention, the a first of said non transliteration model selection probability of language, and the non transliteration model selection probability of the second language, each of the second language said transliteration model selection probability for the partial string, said non transliteration model generation probability for each substring of the first language, the non transliteration model generation probability for each substring of the second language When, early setting each initial value for the said transliteration model generation probability for each pair of partial character strings between the partial character string of the substring of the first language and the second language The non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability set by the value setting unit and the initial value setting unit or updated last time Based on each of the string sets Te, a partial character string of the character string of said first language, for each pair of partial character strings between the partial character string of the character string of the second language, the pair is the translation relationship calculate a certain expected value, for each partial character string of the character string of said first language, the substring calculates the expected value is a non transliteration portion, the string of the second language For each partial character string, an expected value calculation unit that calculates an expected value in which the partial character string is a non-transliterated part, and each pair calculated by the expected value calculation unit for each of the character string sets the expected value in the translation relation, wherein the non transliteration portion a is the expected value for each partial string of the first language, and in the non transliteration part for each partial character string of the second language for Based on a certain expected value, the non-transliteration model selection probability, the transliteration model A parameter update unit that updates the selection probability, the non-transliteration model generation probability, and the transliteration model generation probability, determines whether a predetermined stop condition is satisfied, and until the stop condition is satisfied, A non-transliteration model selection probability, a transliteration model selection probability, for each of the character string sets, including a stop condition determination unit that repeats the calculation by the expected value calculation unit and the update by the parameter update unit, Based on each of the non-transliteration model generation probability and the transliteration model generation probability, each character string of the character string set is converted into the front non-transliteration segment, the transliteration segment, and the post-nontranslation segment. in constructed, and between said the partial character string of the transliteration segments of the string in the first language, substrings of the transliteration segments of the string of the second language Character pair An attachment can be made.

また、本発明に係る前記対応付け計算部は、前記文字列組の各々に対して、前記文字列組の各文字列を前記前置非翻字セグメント、前記翻字セグメント、及び後置非翻字セグメントで構成し、かつ、前記第１の言語の文字列のうちの前記翻字セグメントの部分文字列と、前記第２の言語の文字列のうちの前記翻字セグメントの部分文字列との間で文字の対応付けを行って、対応付けの初期設定を行う初期対応設定部と、前記初期対応設定部によって設定され、又は前回更新された、前記複数組の文字列組のうちの処理対象の文字列組以外の文字列組の各々についての前記対応付けに基づいて、前記第１の言語の前記非翻字モデル選択確率と、前記第２の言語の前記非翻字モデル選択確率と、前記翻字モデル選択確率と、前記処理対象の文字列組の前記第１の言語の文字列のうちの各部分文字列に対する前記非翻字モデル生成確率と、前記処理対象の文字列組の前記第２の言語の文字列のうちの各部分文字列に対する前記非翻字モデル生成確率と、前記処理対象の文字列組の前記第１の言語の文字列のうちの部分文字列と、前記第２の言語の文字列のうちの部分文字列との間の部分文字列の各ペアに対する前記翻字モデル生成確率と、を計算するフィルタリング部と、前記フィルタリング部によって計算された前記非翻字モデル選択確率、前記翻字モデル選択確率、前記非翻字モデル生成確率、及び前記翻字モデル生成確率に基づいて、前記処理対象の文字列組の各文字列を前記前置非翻字セグメント、前記翻字セグメント、及び後置非翻字セグメントで構成し、かつ、前記第１の言語の文字列のうちの前記翻字セグメントの部分文字列と、前記第２の言語の文字列のうちの前記翻字セグメントの部分文字列との間で文字の対応付けを行って、前記処理対象の文字列組に対する前記対応付けを更新するサンプリング部と、予め定められた停止条件が満たされたか否かを判定し、前記停止条件が満たされるまで、各文字列組を処理対象とした前記フィルタリング部による計算及び前記サンプリング部による更新を繰り返す停止条件判定部と、を含むようにすることができる。 Furthermore, the correlation calculation unit of the present invention, for each of the previous SL string set, the character string set the front置非transliteration segments each string of the transliteration segments, and the rear置非composed of transliteration segment and a partial character string of the transliteration segments of the string of said first language, and the partial character string of the transliteration segments of the string of the second language The initial correspondence setting unit for performing the initial setting of the association, and the processing among the plurality of character string sets set by the initial correspondence setting unit or updated last time based on the correspondence for each string of sets of non-string sets of object, and wherein the non transliteration model selection probability of a first language, and the non transliteration model selection probability of the second language , The transliteration model selection probability, and the character string to be processed For the said and non transliteration model generation probability for each substring of the character string in the first language, each partial character string of the character string set in the second language character string of the processed between the and non transliteration model generation probability, and the partial character string of the character string of a string of pairs of the first language of the processing target, a partial character string of the character string of the second language A filtering unit for calculating the transliteration model generation probability for each pair of partial character strings, the non-transliteration model selection probability calculated by the filtering unit, the transliteration model selection probability, the non-transliteration model Based on the generation probability and the transliteration model generation probability, each character string of the character string set to be processed is configured by the prefix non-transliteration segment, the transliteration segment, and the post-translation non-transliteration segment, and said first word Of a partial character string of the transliteration segments of the string, by performing a mapping characters between substring of the transliteration segments of the string of the second language, the processing target A sampling unit that updates the correspondence to the character string set, and whether or not a predetermined stop condition is satisfied, and the filtering for each character string set as a processing target until the stop condition is satisfied And a stop condition determination unit that repeats the calculation by the unit and the update by the sampling unit.

本発明に係るプログラムは、コンピュータを、本発明に係る文字列対応付け装置の各部として機能させるためのプログラムである。 The program which concerns on this invention is a program for functioning a computer as each part of the character string matching apparatus which concerns on this invention.

以上説明したように、本発明の文字列対応付け装置、方法、及びプログラムによれば、異なる第１の言語及び第２の言語にそれぞれ属する同じ意味の文字列の組み合わせである文字列組について、文字列組の各文字列を、文字列の先頭から順番に、前置非翻字セグメントと、翻字セグメントと、後置非翻字セグメントとで構成したときに、翻字モデル選択確率と、非翻字モデル選択確率と、非翻字モデル生成確率と、翻字モデル生成確率と、に基づいて尤もらしくなるように、文字列組の各文字列を前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントで構成し、かつ、第１言語の文字列のうちの翻字セグメントの部分文字列と、第２言語の文字列のうちの翻字セグメントの部分文字列との間の文字を対応付けることにより、異なる言語の文字列組における文字の対応付けを精度よく行うことができる、という効果が得られる。 As described above, according to the character string association apparatus, method, and program of the present invention, a character string set that is a combination of character strings having the same meaning belonging to different first language and second language, When each character string of the character string set is composed of a prefix non-transliteration segment, a transliteration segment, and a postfix non-transliteration segment in order from the beginning of the character string, the transliteration model selection probability, Each character string in the string set is prefixed with a non-transliteration segment and a transliteration segment so as to be likely based on the non-transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability. , And a postfix non-transliteration segment, and a partial character string of the transliteration segment of the character string of the first language and a partial character string of the transliteration segment of the character string of the second language By associating characters between Characters correspondence in the language of the character string group consisting can be performed with high accuracy, the effect is obtained that.

本発明の第１の実施の形態に係る文字列対応付け装置の構成を示す概略図である。It is the schematic which shows the structure of the character string matching apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る文字列対応付け装置における文字列対応付け処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the character string matching process routine in the character string matching apparatus which concerns on the 1st Embodiment of this invention. 本発明の第２の実施の形態に係る文字列対応付け装置の構成を示す概略図である。It is the schematic which shows the structure of the character string matching apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る文字列対応付け装置における文字列対応付け処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the character string matching process routine in the character string matching apparatus which concerns on the 2nd Embodiment of this invention. 対応付けの一例として、本実施の形態に係る文字列対応付け装置に入力された英語とカタカナとを示す図である。It is a figure which shows the English and katakana which were input into the character string matching apparatus which concerns on this Embodiment as an example of matching. 英語とカタカナの対応付け結果の例を示す図である。It is a figure which shows the example of the matching result of English and katakana.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜発明の概要＞
本発明の実施の形態は、対応付けの対象となる文字列組について、「翻字である」か「翻字でない」かの区別を部分文字列の単位で識別する。また、単純に部分文字列単位での識別を行うと文字列組の中で「翻字である」部分と「翻字でない」部分が頻繁に入れ替わってしまうことがあり文字列対応方法として適さないため、「翻字である」部分は文字列組において高々１箇所であるという制約を加える。具体的には、上記非特許文献１記載の多対多の文字列対応付け方法において、文字列の生成モデルとして「翻字モデル」「非翻字モデル」の２つを同時に学習することで、文字列組の翻字となっている部分を識別し、非翻字部分が混在したデータにおける適切な文字列間の対応付けを実現する。 <Outline of the invention>
In the embodiment of the present invention, a distinction between “transliteration” and “not transliteration” is identified in units of partial character strings for character string sets to be matched. In addition, if identification is performed in units of partial character strings, the “translated” part and the “non-transliterated” part in the character string set may be frequently switched, which is not suitable as a character string handling method. Therefore, there is a restriction that the “transliteration” portion is at most one place in the character string set. Specifically, in the many-to-many character string associating method described in Non-Patent Document 1, by simultaneously learning two “transliteration model” and “non-transliteration model” as character string generation models, The part of the character string set that is transliterated is identified, and appropriate correspondence between character strings in the data in which the non-transliterated part is mixed is realized.

本発明の実施の形態では、上記非特許文献３や上記非特許文献４と同様の「翻字である」文字列の確率モデル（以下、「翻字モデル」と称する）と、「翻字でない」文字列の確率モデル（以下、「非翻字モデル」と称する）の２種類のモデルの存在を仮定する。翻字モデルは原言語と目的言語の部分文字列の同時確率Ｐ（￣ｓ，￣ｔ）（￣ｓは原言語の部分文字列、￣ｔは目的言語の部分文字列）のモデルであり、非翻字モデルは「原言語の部分文字列の確率Ｐ（￣ｓ）のモデル」と「目的言語の部分文字列の確率Ｐ（￣ｔ）のモデル」との独立した２つの確率モデルを含んで構成される。 In the embodiment of the present invention, a probability model (hereinafter referred to as “transliteration model”) of a character string that is “transliteration” similar to the above-mentioned Non-Patent Document 3 and Non-Patent Document 4 and “non-transliteration” It is assumed that there are two types of models: a character string probability model (hereinafter referred to as a “non-transliteration model”). The transliteration model is a model of the joint probability P (￣s, ￣t) of the source language and the target language partial character strings (￣s is the source language partial character string and ￣t is the target language partial character string). The non-transliteration model includes two independent probability models: a “model of the source language partial character string probability P (原 s)” and a “target language partial character string probability P (￣t) model”. Consists of.

本発明の実施の形態では、上記非特許文献４や、上記非特許文献６のように、学習に用いられる原言語と目的言語の文字列組が、文字列全体として翻字であるか、翻字でないかを区別するのではなく、文字列のどの部分が翻字であり、どの部分が翻字でないかを区別するように確率モデルを学習する。また、翻字である部分は高々１箇所であると仮定する。つまり、原言語と目的言語の文字列の各々は、 In the embodiment of the present invention, as in the non-patent document 4 and the non-patent document 6, the character string set of the source language and the target language used for learning is a transliteration as a whole character string. A probability model is learned so as to distinguish which part of a character string is a transliteration and which part is not a transliteration, instead of distinguishing whether it is not a letter. Further, it is assumed that there is at most one portion that is a transliteration. That is, each of the source and target language strings is

・「０文字以上の翻字でない部分」（以下、前置非翻字セグメントと称する。）
・「０文字以上の翻字である部分」（以下、翻字セグメントと称する。）
・「０文字以上の翻字でない部分」（以下、後置非翻字セグメントと称する。）・ "A non-transliteration part of zero or more characters" (hereinafter referred to as a prefix non-transliteration segment)
"" A part that is a transliteration of zero or more characters "(hereinafter referred to as a transliteration segment)
・ "A non-transliteration part of zero or more characters" (hereinafter referred to as a post-transliteration non-transliteration segment)

の順で構成されると仮定する。すなわち、文字列組の各文字列を、文字列の先頭から順番に、他方の言語の部分文字列と翻字関係にない０文字以上の部分文字列を示す前置非翻字セグメントと、他方の言語の部分文字列と翻字関係にある０文字以上の部分文字列を示す翻字セグメントと、他方の言語の部分文字列と翻字関係にない０文字以上の部分文字列を示す後置非翻字セグメントとで構成した場合を想定する。
なお、文字列全体が翻字でない場合は、すべてが前置非翻字セグメントに属するものとする。本実施の形態における文字列の対応付けは、原言語と目的言語の文字列組を構成する文字列を上記セグメントに分割し、非翻字セグメントにおいては原言語と目的言語で独立な非翻字モデルに基づいて部分文字列が生成され、翻字セグメントにおいては原言語と目的言語との翻字モデルに基づいて部分文字列の組が生成される場合の尤度が最大となるような対応付けを求める過程である。翻字モデルや非翻字モデル自体の構成及びモデル最適化アルゴリズムは特に規定しないが、翻字モデルは、原言語の０文字以上の文字列と目的言語の０文字以上の文字列の組の同時生成確率（ただし双方とも０文字となる場合は除く）モデルであり、非翻字モデルは原言語と目的言語で独立な、１文字以上の文字列の生成確率モデルであるとする。 Suppose that That is, each character string of the character string set is composed of a prefix non-transliteration segment indicating a partial character string of zero or more characters not in transliteration relation with the partial character string of the other language in order from the top of the character string, A transliteration segment indicating zero or more partial character strings in transliteration with a partial character string in one language, and a postfix indicating zero or more partial character strings in no transliteration relationship with a partial character string in the other language Assume a case of non-transliteration segments.
If the entire character string is not transliterated, all belong to the prefix non-transliterated segment. In this embodiment, the character strings are associated by dividing the character string constituting the character string set of the source language and the target language into the above segments, and in the non-transliteration segment, the non-transliteration independent in the source language and the target language. Correspondence that maximizes the likelihood that a partial character string is generated based on the model, and in the transliterated segment, a pair of partial character strings is generated based on the transliteration model of the source language and the target language Is the process of seeking. The configuration of the transliteration model and the non-transliteration model itself and the model optimization algorithm are not stipulated, but the transliteration model is a simultaneous combination of a string of zero or more characters in the source language and a string of zero or more characters in the target language. It is assumed that the model is a generation probability model (except when both are 0 characters), and the non-transliteration model is a generation probability model of one or more character strings independent of the source language and the target language.

また、ある文字列が翻字モデルと非翻字モデルとのどちらから生成されるかを表す確率変数（以下、「翻字モデル選択確率」、及び「非翻字モデル選択確率」と称する）も同時に考慮する。従って、上記の前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントで文字列が構成されることを前提とすると、原言語の非翻字モデル選択確率は、原言語の部分文字列が、目的言語の部分文字列と翻字関係にない非翻字部分である確率を表しており、目的言語の非翻字モデル選択確率は、目的言語の部分文字列が、原言語の部分文字列と翻字関係にない非翻字部分である確率を表している。
また、翻字モデル選択確率は、原言語の部分文字列が、目的言語の部分文字列と翻字関係にある翻字部分であり、かつ目的言語の部分文字列が、原言語の部分文字列と翻字関係にある翻字部分である確率を表している。 There is also a random variable that represents whether a character string is generated from a transliteration model or a non-transliteration model (hereinafter referred to as “transliteration model selection probability” and “non-transliteration model selection probability”). Consider at the same time. Therefore, assuming that the above-mentioned prefix non-transliteration segment, transliteration segment, and postfix non-transliteration segment constitute a character string, the non-transliteration model selection probability of the source language is the part of the source language. This indicates the probability that the character string is a non-transliteration part that is not transliterated with the sub-string of the target language, and the non-transliteration model selection probability of the target language indicates that the sub-string of the target language is It represents the probability of a non-transliteration part that is not transliterated with the partial character string.
The transliteration model selection probability is such that the source language partial character string is a transliteration part having a transliteration relationship with the target language partial character string, and the target language partial character string is the source language partial character string. And the probability of being a transliteration part in transliteration relation.

モデルの最適化においては、上記非特許文献４で用いられている前向き後向き（forward-backward）アルゴリズムに基づく期待値最大化（ＥＭ）アルゴリズムや、その変種といえる、上記特許文献１で用いられている、ギブス（Ｇｉｂｂｓ）サンプリングに基づく前向きフィルタリング・後ろ向きサンプリングアルゴリズムにおいて、上記の３セグメントでの構成を考慮した上で、非翻字セグメントでは非翻字モデル生成確率、翻字セグメントでは翻字モデル生成確率を利用して期待値を計算し、その結果を利用してモデルの更新を行う過程を繰り返し行えばよい。
ここで、原言語の部分文字列の非翻字モデル生成確率は、原言語の前置非翻字セグメント又は後置非翻字セグメントにおける部分文字列の、原言語における生成確率を表し、目的言語の部分文字列の非翻字モデル生成確率は、目的言語の前置非翻字セグメント又は後置非翻字セグメントにおける部分文字列の、目的言語における生成確率を表す。
また、翻字モデル生成確率は、原言語の文字列のうちの翻字セグメントの部分文字列と、目的言語の文字列のうちの翻字セグメントの部分文字列との間の部分文字列の各ペアに対する同時生成確率を表している。 In the optimization of the model, the expectation maximization (EM) algorithm based on the forward-backward algorithm used in the above-mentioned Non-Patent Document 4 and a variant thereof can be said to be used in the above-mentioned Patent Document 1. In the forward filtering / backward sampling algorithm based on Gibbs sampling, the non-transliteration segment generation probability for the non-transliteration segment and the transliteration model generation for the transliteration segment, taking into account the configuration of the above three segments The process of calculating the expected value using the probability and updating the model using the result may be repeated.
Here, the non-transliteration model generation probability of the source language partial character string represents the generation probability in the source language of the partial character string in the prefix non-transliteration segment or postfix non-transliteration segment of the source language. The non-transliteration model generation probability of the partial character string represents the generation probability in the target language of the partial character string in the prefix non-transliteration segment or postfix non-transliteration segment of the target language.
Moreover, the transliteration model generation probability is calculated for each substring between the substring of the transliteration segment in the source language character string and the substring of the transliteration segment in the target language character string. It represents the probability of simultaneous generation for a pair.

〔第１の実施の形態〕
＜システム構成＞
本発明の第１の実施の形態に係る文字列対応付け装置１００は、原言語（第１の言語）の文字列（単語）と目的言語（第２の言語）の文字列(単語)との対訳である複数組の文字列組を入力とし、文字列組の各々について、文字列の対応付けを行う。この文字列対応付け装置１００は、ＣＰＵと、ＲＡＭと、後述する文字列対応付け処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。図１に示すように、文字列対応付け装置１００は、入力部１０と、演算部２０と、出力部３０とを備えている。 [First Embodiment]
<System configuration>
The character string associating apparatus 100 according to the first embodiment of the present invention includes a character string (word) in a source language (first language) and a character string (word) in a target language (second language). A plurality of character string sets that are parallel translations are input, and character strings are associated with each character string set. This character string associating apparatus 100 is composed of a computer including a CPU, a RAM, and a ROM that stores a program for executing a character string associating process routine to be described later. It is configured. As illustrated in FIG. 1, the character string association device 100 includes an input unit 10, a calculation unit 20, and an output unit 30.

入力部１０は、文字列の対応付けを行う対象である複数組の文字列組を受け付ける。具体的には、入力部１０は、翻字又は対訳になっていることが期待され、かつ空白文字等も含んだ文字列組を、入力装置、記憶媒体もしくはネットワークを通じて複数組読み込む。 The input unit 10 receives a plurality of sets of character strings that are objects to be associated with character strings. Specifically, the input unit 10 is expected to be transliterated or translated, and reads a plurality of character string sets including a blank character or the like through an input device, a storage medium, or a network.

演算部２０は、文字列組データベース２１、文字列組データ読み込み部２２、及び対応付け計算部２３を備えている。 The calculation unit 20 includes a character string set database 21, a character string set data reading unit 22, and an association calculation unit 23.

文字列組データベース２１には、入力部１０により受け付けた複数組の文字列組が格納される。 The character string set database 21 stores a plurality of character string sets received by the input unit 10.

文字列組データ読み込み部２２は、文字列組データベース２１から全ての文字列組を読み込む。 The character string set data reading unit 22 reads all character string sets from the character string set database 21.

対応付け計算部２３は、文字列組毎に、原言語の文字列と、目的言語の文字列との間で部分文字列同士の対応付けを行う。 The association calculation unit 23 associates the partial character strings between the source language character string and the target language character string for each character string set.

対応付け計算部２３は、初期値設定部２３１、期待値計算部２３２、パラメータ更新部２３３、停止判定部２３４、及び文字対応付け処理部２３５を備えている。 The association calculation unit 23 includes an initial value setting unit 231, an expected value calculation unit 232, a parameter update unit 233, a stop determination unit 234, and a character association processing unit 235.

初期値設定部２３１は、原言語の部分文字列に対する非翻字モデル選択確率と、目的言語の部分文字列に対する非翻字モデル選択確率と、原言語の部分文字列と目的言語の部分文字列とのペアに対する翻字モデル選択確率と、原言語の文字列のうちの各部分文字列に対する非翻字モデル生成確率と、目的言語の文字列のうちの各部分文字列に対する非翻字モデル生成確率と、原言語の文字列のうちの部分文字列と、目的言語の文字列のうちの部分文字列との間の部分文字列の各ペアに対する翻字モデル生成確率と、に対して初期値を設定する。ここで、初期値は、出現頻度や共起頻度の比を用いて設定することが広く行われているが、一様分布としてもよい。
なお、対応付けの単位として上記非特許文献４のように１文字単位のものしか考慮しない場合は、非翻字モデル生成確率は単純な文字ユニグラム確率となるため、平滑化を考慮しなければ単純な文字の出現頻度に基づく出現確率分布に従って求められ、以後の処理において確率を更新せず、固定するようにしてもよい。
また、翻字モデル選択確率・非翻字モデル選択確率については、上記非特許文献４ではそれぞれ０．５を初期値としているが、０より大きく、和が１となるような任意の初期値を設定してもよい。 The initial value setting unit 231 includes a non-transliteration model selection probability for a source language partial character string, a non-transliteration model selection probability for a target language partial character string, a source language partial character string, and a target language partial character string. Transliteration model selection probability for each pair, non-transliteration model generation probability for each partial character string in the source language, and non-transliteration model generation for each partial character string in the target language character string Initial values for probabilities and transliteration model generation probabilities for each pair of substrings between a substring of the source language string and a substring of the target language string Set. Here, the initial value is widely set using the ratio of the appearance frequency and the co-occurrence frequency, but may be a uniform distribution.
If only one character unit is considered as the unit of correspondence as in Non-Patent Document 4, the non-transliteration model generation probability is a simple character unigram probability. It may be obtained according to the appearance probability distribution based on the appearance frequency of simple characters, and may be fixed without updating the probability in the subsequent processing.
The transliteration model selection probability and the non-transliteration model selection probability are 0.5 as initial values in the above-mentioned Non-Patent Document 4, but arbitrary initial values that are larger than 0 and the sum is 1 are used. It may be set.

期待値計算部２３２は、初期値設定部２３１によって設定され、又は前回更新された、非翻字モデル選択確率、翻字モデル選択確率、非翻字モデル生成確率、及び翻字モデル生成確率に基づいて、文字列組の各々に対して、原言語の文字列のうちの部分文字列と、目的言語の文字列のうちの部分文字列との間の部分文字列の各ペアについて、当該ペアが翻訳関係にある期待値を計算し、原言語の文字列のうちの各部分文字列について、当該部分文字列が非翻字部分である期待値を計算し、目的言語の文字列のうちの各部分文字列について、当該部分文字列が非翻字部分である期待値を計算する。期待値の計算方法は上記非特許文献４に記載の方法に類似するが、本実施の形態では、文字列に、前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントの順で、各セグメントが割り当てられて構成されるため、翻字セグメントのみが存在することを仮定している上記非特許文献４の方法を拡張した方法を用いる。 The expected value calculation unit 232 is set based on the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability set by the initial value setting unit 231 or updated last time. For each pair of character strings, for each pair of substrings between a substring in the source language string and a substring in the target language string, the pair is Calculate the expected value in translation, calculate the expected value that the partial character string is a non-transliterated part for each partial character string in the source language character string, and calculate each expected character string in the target language For a partial character string, an expected value is calculated that the partial character string is a non-transliterated part. The expected value calculation method is similar to the method described in Non-Patent Document 4, but in this embodiment, the sequence of the prefix non-transliteration segment, the transliteration segment, and the post-fixation non-transliteration segment is added to the character string. Thus, since each segment is allocated and configured, a method that extends the method of Non-Patent Document 4 that assumes that only a transliteration segment exists is used.

上記非特許文献４に記載の方法で用いられている前向き後ろ向きアルゴリズムでは、動的計画法を用いて、文字列組の先頭から対応付けの確率を記録する前向き確率表と、逆に文字列組の末尾から対応付けの確率を記録する後向き確率表とに、計算した確率を順次記録するが、本実施の形態では、前置非翻字セグメント、翻字セグメント、後置非翻字セグメントのそれぞれに対応する前向き確率表と後向き確率表の、合計６個の確率表を利用する。前向き確率及び後向き確率の計算と記録とに際しては、前置非翻字セグメント、翻字セグメント、後置非翻字セグメントの順を考慮する。前置非翻字セグメントの前向き確率表に記録される確率の計算にあたっては、そこより前の位置における前置非翻字セグメントの前向き確率のみを参照し、翻字セグメントの前向き確率表に記録される確率の計算にあたっては、そこより前の位置における前置非翻字セグメントの前向き確率及び翻字セグメントの前向き確率を参照し、後置非翻字セグメントの前向き確率表に記録される確率の計算にあたっては、そこより前の位置における翻字セグメントの前向き確率及び後置非翻字セグメントの前向き確率を参照する。 In the forward-backward algorithm used in the method described in Non-Patent Document 4, a forward probability table that records the probability of association from the beginning of a character string set using dynamic programming, and conversely a character string set In the present embodiment, the calculated probabilities are sequentially recorded in the backward probability table that records the probability of association from the end of each, but in this embodiment, each of the prefix non-transliteration segment, the transliteration segment, and the post-translation non-transliteration segment A total of six probability tables, that is, a forward probability table and a backward probability table corresponding to. In calculating and recording the forward and backward probabilities, the order of the prefix non-transliteration segment, the transliteration segment, and the post-fixation non-transliteration segment is considered. When calculating the probability recorded in the forward probability table for the prefix non-transliterated segment, only the forward probability of the prefix non-transliterated segment at a position before that is referred to and recorded in the forward probability table for the transliterated segment. When calculating the probability of recording, refer to the forward probability of the front non-transliterated segment and the forward probability of the transliterated segment at a position before that, and calculate the probability recorded in the forward probability table of the post-nontranslated segment. At that time, reference is made to the forward probability of the transliteration segment and the forward probability of the postfix non-transliteration segment at a position before that.

後向き確率表についてはこれと逆の順序となるため、後置非翻字セグメントの後向き確率表に記録される確率の計算にあたっては、そこより後の位置における後置非翻字セグメントの後向き確率のみを参照し、翻字セグメントの後向き確率表に記録される確率の計算にあたっては、そこより後の位置における後置非翻字セグメントの後向き確率及び翻字セグメントの後向き確率を参照し、前置非翻字セグメントの後向き確率表に記録される確率の計算にあたっては、そこより後の位置における翻字セグメントの後向き確率及び前置非翻字セグメントの後向き確率を参照する。また、確率表の初期値については、前置非翻字セグメントの前向き確率表の先頭と、後置非翻字セグメントの後向き確率表の末尾を１とし、それ以外を０とする。 Since the backward probability table is in the reverse order, only the backward probability of the post-non-transliteration segment at a later position is calculated when calculating the probability recorded in the back probability table of the post-non-transliteration segment. In calculating the probability recorded in the backward probability table of the transliteration segment, refer to the backward probability of the postfix non-transliteration segment and the backward probability of the transliteration segment at a position after that. In calculating the probabilities recorded in the backward probability table of the transliteration segment, the backward probabilities of the transliteration segments and the backward probabilities of the front non-transliteration segments in the positions after that are referred. As for the initial value of the probability table, the head of the forward probability table of the prefix non-transliterated segment and the tail of the backward probability table of the post-non-transliterable segment are set to 1, and the others are set to 0.

期待値計算部２３２は、初期値設定部２３１によって設定され、又は前回更新された、非翻字モデル選択確率、翻字モデル選択確率、非翻字モデル生成確率、及び翻字モデル生成確率に基づいて、上記のように、前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントのそれぞれに対応する前向き確率表の各前向き確率を計算して記録すると共に、前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントのそれぞれに対応する後ろ向き確率表の各後ろ向き確率を計算して記録する。なお、前向き及び後ろ向き確率表は、文字列組ごとに計算される。
期待値計算部２３２は、これらの前向き確率表と後向き確率表を用いて、文字列組中の任意の文字対応付けについて翻訳関係にある期待値を計算する。原言語側の文字位置ｊからｊ’−１までの文字列と目的言語側の文字位置ｉからｉ’−１までの文字列が翻字となる期待値Ｅｔ（ｊ→ｊ’，ｉ→ｉ’）は、以下のように計算できる（なお、文字列位置は０から始まるものとする）。 The expected value calculation unit 232 is set based on the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability set by the initial value setting unit 231 or updated last time. As described above, each forward probability in the forward probability table corresponding to each of the prefix non-transliteration segment, the transliteration segment, and the post-nontransliteration segment is calculated and recorded. Each backward probability in the backward probability table corresponding to each of the segment, transliteration segment, and postfix nontransliteration segment is calculated and recorded. The forward and backward probability tables are calculated for each character string set.
The expectation value calculation unit 232 uses the forward probability table and the backward probability table to calculate an expectation value that has a translation relationship with respect to an arbitrary character association in the character string set. Expected values Et (j → j ′, i → i) in which the character string from character position j to j′−1 on the source language side and the character string from character position i to i′−1 on the target language side are transliterated ') Can be calculated as follows (note that the character string position starts from 0).

ここで、Ｐ_ｔ（ｓ_ｊ→ｊ’，ｔ_ｉ→ｉ’）は原言語側の文字位置ｊからｊ’−１までの文字列と目的言語側の文字位置ｉからｉ’−１までの文字列の同時生成確率（翻字モデルの確率）、ｐ_ｔは翻字モデル選択確率、Ｐ_ｆ１（ｊ，ｉ）、Ｐ_ｆ２（ｊ，ｉ）、Ｐ_ｆ３（ｊ，ｉ）はそれぞれ原言語側の文字位置ｊ、目的言語側の文字位置ｉまでの前置非翻字セグメント、翻字セグメント、後置非翻字セグメントに対する前向き確率、Ｐ_ｂ１（ｊ，ｉ）、Ｐ_ｂ２（ｊ，ｉ）、Ｐ_ｂ３（ｊ，ｉ）はそれぞれ原言語側の文字位置ｊ、目的言語側の文字位置ｉより後の前置非翻字セグメント、翻字セグメント、後置非翻字セグメントに対する後向き確率、Ｊは原言語側の文字列長、Iは目的言語側の文字列長である。
期待値計算部２３２は、全ての文字列組ｄ毎に、文字列組ｄの部分文字列の各ペアについて、上記（１）式の期待値を計算する。 Here, P _t (s _{j → j ′} , t _{i → i ′} ) is a character string from the character position j to j′−1 on the source language side and the character position i to i′−1 on the target language side. Probability of simultaneous generation of strings (probability of transliteration model), p _t is transliteration model selection probability, P _f1 (j, i), P _f2 (j, i), P _f3 (j, i) are source languages Character position j, forward non-transliteration segment, transliteration segment, post-translation non-transliteration segment to character position i on the target language side, P _b1 (j, i), P _b2 (j, i ), P _b3 (j, i) are the backward probabilities for the character position j on the source language side, the pre-transliteration segment, the transliteration segment, and the post-nontransliteration segment after the character position i on the target language side, J is a character string length on the source language side, and I is a character string length on the target language side.
The expected value calculation unit 232 calculates the expected value of the above formula (1) for each pair of partial character strings of the character string set d for every character string set d.

期待値計算部２３２は、前向き確率表と後向き確率表を用いて、非翻字部分の期待値も同様に計算する。具体的には、原言語側の文字位置ｊからｊ’−１までの部分文字列が、目的言語側の文字位置ｉの後で目的言語側から（翻字ではなく）独立に生成される期待値Ｅ_ｓｒｃ（ｓ_ｊ→ｊ’，ｉ）を、以下の式のように前置非翻字セグメントで現れる場合と後置非翻字セグメントで現れる場合の期待値の和として計算する。期待値計算部２３２は、前向き確率表と後向き確率表を用いて、目的言語側についても同様に、目的言語側の文字位置ｉからｉ’−１までの部分文字列が、原言語側の文字位置ｊの後で原言語側から（翻字ではなく）独立に生成される期待値Ｅ_ｔｒｇ（ｓ_i→i’，ｊ）を計算する。 The expected value calculation unit 232 similarly calculates the expected value of the non-transliterated part using the forward probability table and the backward probability table. Specifically, it is expected that the partial character string from the character position j to j′−1 on the source language side is generated independently from the target language side (not transliteration) after the character position i on the target language side. The value E _src (s _{j → j ′} , i) is calculated as the sum of expected values when it appears in the prefix non-transliterated segment and in the post-non-transliterable segment as in the following equation. The expected value calculation unit 232 uses the forward probability table and the backward probability table, and similarly on the target language side, the partial character strings from the character positions i to i′−1 on the target language side are converted into the characters on the source language side. After the position j, an expected value E _trg (s _{i → i ′} , j) generated independently from the source language side (not transliteration) is calculated.

ここで、Ｐ_ｓｒｃ（ｓ_ｊ→ｊ’）は原言語側の文字位置ｊからｊ’−１までの部分文字列の生成確率（原言語側の非翻字モデルの確率）、ｐ_ｓｒｃは原言語側の非翻字モデル選択確率である。 Here, P _src (s _{j → j ′} ) is the generation probability of the partial character string from the character position j to j′−1 on the source language side (probability of the non-transliteration model on the source language side), and p _src is the original This is the non-transliteration model selection probability on the language side.

期待値計算部２３２は、全ての文字列組ｄ毎に、文字列組ｄの原言語側の部分文字列の各々について、上記（２）式の期待値を計算する。また、期待値計算部２３２は、全ての文字列組ｄ毎に、文字列組ｄの目的言語側の部分文字列の各々について、上記（２）式と同様の期待値を計算する。
期待値計算部２３２は、全ての文字列組ｄに対する計算結果に基づいて、原言語及び目的言語の部分文字列のペア（ｓ、ｔ）の各々について、当該ペア（ｓ、ｔ）に関する期待値の計算結果を集計して当該ペア（ｓ、ｔ）に対する翻字モデルからの同時生成の期待値Ｅ_ｔ（ｓ，ｔ）を計算する。期待値計算部２３２は、原言語の各部分文字列ｓについて、当該部分文字列ｓに関する期待値の計算結果を集計して当該部分文字列ｓに対する非翻字モデルからの生成の期待値Ｅ_ｓｒｃ（ｓ）を計算する。また、期待値計算部２３２は、目的言語の各部分文字列ｔについて、当該部分文字列ｔに関する期待値の計算結果を集計して当該部分文字列ｔの非翻字モデルからの生成の期待値Ｅ_ｔｒｇ（ｔ）を計算する。
また、上記の確率表および期待値の計算において、前置非翻字セグメント、後置非翻字セグメント内では、必ず原言語側が先に生成されるものとし、同じ文字列の生成に際して複数回の数え上げが起こらないようにする。 The expected value calculation unit 232 calculates the expected value of the above equation (2) for each of the partial character strings on the source language side of the character string set d for every character string set d. Further, the expected value calculation unit 232 calculates the expected value similar to the above equation (2) for each partial character string on the target language side of the character string set d for every character string set d.
Based on the calculation results for all character string sets d, the expected value calculation unit 232 expects values for the pair (s, t) of the source language and target language partial character string pairs (s, t). Are calculated, and an expected value E _t (s, t) of simultaneous generation from the transliteration model for the pair (s, t) is calculated. The expected value calculation unit 232 aggregates the expected value calculation results for the partial character string s for each partial character string s in the source language, and generates an expected value E _src from the non-transliteration model for the partial character string s. (S) is calculated. In addition, the expected value calculation unit 232 counts the expected value calculation results for the partial character string t for each partial character string t in the target language, and generates the expected value from the non-transliteration model of the partial character string t. E _trg (t) is calculated.
In the above probability table and expected value calculation, the source language side must be generated first in the prefix non-transliteration segment and postfix non-transliteration segment. Avoid counting up.

パラメータ更新部２３３は、文字列組の各々に対して、期待値計算部２３２によって計算された各ペアに対する翻訳関係にある期待値、原言語の各部分文字列についての非翻字部分である期待値、及び目的言語の各部分文字列についての非翻字部分である期待値に基づいて、非翻字モデル選択確率、翻字モデル選択確率、非翻字モデル生成確率、及び翻字モデル生成確率を更新する。具体的には、パラメータ更新部２３３は、期待値計算部２３２において計算された期待値をもとに、翻字モデルの同時生成確率、非翻字モデルの生成確率、及び翻字モデル選択確率・非翻字モデル選択確率を更新する。翻字モデルの同時生成確率及び非翻字モデルの生成確率については、期待値が０より大きい文字列組もしくは文字列に対して、期待値の総和との比をとって更新後の同時生成確率又は生成確率とする。つまり、翻字モデルの同時生成確率については、原言語の部分文字列ｓと目的言語の部分文字列ｔの同時生成確率Ｐ_ｔ（￣ｓ，￣ｔ）を以下のように更新する。 The parameter updating unit 233 has, for each character string set, an expected value in a translation relationship for each pair calculated by the expected value calculation unit 232, and an expected non-transliteration part for each partial character string in the source language. Non-transliteration model selection probability, transliteration model selection probability, non-transliteration model generation probability, and transliteration model generation probability based on the value and an expected value that is a non-transliteration part for each substring of the target language Update. Specifically, the parameter updating unit 233 determines the simultaneous generation probability of the transliteration model, the generation probability of the non-transliteration model, and the transliteration model selection probability, based on the expected value calculated by the expected value calculation unit 232. Update the non-transliteration model selection probability. Regarding the simultaneous generation probability of the transliteration model and the generation probability of the non-transliteration model, for the character string set or character string whose expected value is greater than 0, the ratio of the expected value to the sum of the expected values is taken as the updated simultaneous generation probability. Alternatively, the generation probability is used. That is, regarding the simultaneous generation probability of the transliteration model, the simultaneous generation probability P _t (￣s, ￣t) of the partial character string s of the source language and the partial character string t of the target language is updated as follows.

非翻字モデルの生成確率については、原言語側の部分文字列ｓの生成確率Ｐ_ｓｒｃ（￣ｓ）を以下のように更新する。目的言語側の部分文字列ｔの生成確率Ｐ_ｔｒｇ（￣ｔ）も同様に更新する。 Regarding the generation probability of the non-transliteration model, the generation probability P _src (￣s) of the partial character string s on the source language side is updated as follows. The generation probability P _trg (￣t) of the partial character string t on the target language side is similarly updated.

翻字モデル選択確率・非翻字モデル選択確率については、原言語と目的言語でそれぞれ翻字セグメントに属する文字数・非翻字セグメントに属する文字数の割合に基づくと考え、原言語の非翻字モデル選択確率は非翻字セグメントで現れる原言語の文字数の割合、目的言語の非翻字モデル選択確率は非翻字セグメントで現れる目的言語の文字数の割合とする。つまり、部分文字列の期待値を利用して計算すると、原言語の非翻字モデル選択確率ｐ_ｓｒｃは以下のように更新される。｜ｓ｜は部分文字列ｓの長さ（文字数）を表す。目的言語の非翻字モデル選択確率ｐ_ｔｒｇについても同様に更新される。 The transliteration model selection probability and non-transliteration model selection probability are based on the ratio of the number of characters belonging to the transliteration segment and the number of characters belonging to the nontransliteration segment in the source language and the target language, respectively. The selection probability is the ratio of the number of characters of the source language appearing in the non-transliteration segment, and the non-transliteration model selection probability of the target language is the ratio of the number of characters of the target language appearing in the non-transliteration segment. That is, when calculation is performed using the expected value of the partial character string, the non-transliteration model selection probability p _src of the source language is updated as follows. | S | represents the length (number of characters) of the partial character string s. The non-transliteration model selection probability p _trg for the target language is also updated in the same manner.

そして、翻字モデル選択確率ｐ_ｔは、以下の式に示すように、（１−原言語の非翻字モデル選択確率）×（１−目的言語の非翻字モデル選択確率）で更新される。 Then, the transliteration model selection probability p _t is updated by (1−non-transliteration model selection probability of the source language) × (1−non-transliteration model selection probability of the target language) as shown in the following equation. .

停止判定部２３４は、予め定められた停止条件が満たされたか否かを判定し、当該停止条件が満たされるまで、期待値計算部２３２による計算、及びパラメータ更新部２３３による更新を繰り返す。停止条件としては、文字列組データの尤度が一定以上の数値になった、尤度の変動幅が一定以下の数値になった、パラメータ更新の繰り返し回数が一定の回数を超えた、などが考えられる。本実施の形態では、文字列組データの尤度が一定以上の数値になることを、停止条件とする。 The stop determination unit 234 determines whether or not a predetermined stop condition is satisfied, and repeats the calculation by the expected value calculation unit 232 and the update by the parameter update unit 233 until the stop condition is satisfied. The stop conditions include the likelihood that the string set data has reached a certain value, the likelihood fluctuation range has become a certain value, and the number of parameter update iterations has exceeded a certain number. Conceivable. In the present embodiment, the stop condition is that the likelihood of the character string set data is a certain value or more.

文字対応付け処理部２３５は、文字列組の各々に対して、パラメータ更新部２３３により最終的に更新された非翻字モデル選択確率、翻字モデル選択確率、非翻字モデル生成確率、及び翻字モデル生成確率の各々に基づいて、当該文字列組の各文字列を、前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントで構成し、かつ、原言語の文字列のうちの翻字セグメントの部分文字列と、目的言語の文字列のうちの翻字セグメントの部分文字列との間で文字の対応付けを行う。具体的には、文字対応付け処理部２３５は、得られた最終的な非翻字モデル選択確率、翻字モデル選択確率、非翻字モデル生成確率、及び翻字モデル生成確率を用い、当該非翻字モデル選択確率、翻字モデル選択確率、非翻字モデル生成確率、及び翻字モデル生成確率の下で最尤な文字対応付けをビタビアルゴリズムによって求める。ビタビアルゴリズムは当該分野で広く知られた、動的計画法によって最尤な系列を求めるアルゴリズムである。具体的には、期待値計算部２３２における前向き確率表の確率計算とは異なり、前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントの各々の前向き確率表に対して、最も確率の高い経路の確率およびその経路の情報を保持する形で末尾まで計算を行い、その結果得られた確率表を末尾から先頭に向かって辿っていくことで最尤経路を得ることができる。経路が翻字セグメントの前向き確率表内にあるものは該当する部分が翻字の関係にあり、非翻字セグメントの前向き確率表内にあるものは翻字の関係にないことを表す。 The character association processing unit 235, for each character string set, the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration finally updated by the parameter update unit 233. Based on each character model generation probability, each character string of the character string set is composed of a pre-transliteration segment, a transliteration segment, and a post-non-transliteration segment. Characters are associated between a partial character string of the transliteration segment and a partial character string of the transliteration segment of the target language character string. Specifically, the character association processing unit 235 uses the obtained final non-transliteration model selection probability, transliteration model selection probability, non-transliteration model generation probability, and transliteration model generation probability, The most likely character association is determined by the Viterbi algorithm under the transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability. The Viterbi algorithm is an algorithm widely known in the field for obtaining a maximum likelihood sequence by dynamic programming. Specifically, unlike the probability calculation of the forward probability table in the expected value calculation unit 232, the forward probability table of each of the prefix non-transliteration segment, the transliteration segment, and the post-translation non-transliteration segment is the most The maximum likelihood route can be obtained by calculating to the end while retaining the probability of the route with high probability and information on the route, and tracing the probability table obtained as a result from the end to the beginning. If the path is in the forward probability table of the transliteration segment, the corresponding part is in a transliteration relationship, and that in the forward probability table of the non-transliteration segment is not in the transliteration relationship.

出力部３０は、文字対応付け処理部２３５で対応付けられた文字列組の各々の対応付けの情報を出力する。具体的には，出力部３０は、対応付けの情報を付与した文字列組を端末に表示、もしくは記憶媒体やネットワークを通じて書き出す。 The output unit 30 outputs information on the association of each character string set associated with the character association processing unit 235. Specifically, the output unit 30 displays the character string set to which the association information is added on the terminal, or writes it through a storage medium or a network.

＜文字列対応付け装置の作用＞
次に、第１の実施の形態に係る文字列対応付け装置１００の作用について説明する。まず、対訳となっている第１の言語体系の文字列及び第２の言語体系の文字列の組である文字列組が、文字列対応付け装置１００に複数入力されると、文字列対応付け装置１００によって、入力された複数の文字列組が、文字列組データベース２１に格納される。そして、文字列対応付け装置１００によって、図２に示す文字列対応付け処理ルーチンが実行される。 <Operation of character string matching device>
Next, the operation of the character string association device 100 according to the first embodiment will be described. First, when a plurality of character string sets, which are pairs of character strings in the first language system and character strings in the second language system, which are translated, are input to the character string association device 100, character string association The apparatus 100 stores a plurality of input character string sets in the character string set database 21. Then, the character string correlation processing routine shown in FIG.

まず、ステップＳ１００において、文字列組データ読み込み部２２によって、文字列組データベース２１から、全ての文字列組を読み込む。 First, in step S100, the character string set data reading unit 22 reads all character string sets from the character string set database 21.

ステップＳ１０２において、初期値設定部２３１によって、原言語及び目的言語の各々の非翻字モデル選択確率と、翻字モデル選択確率と、原言語及び目的言語の各々の各部分文字列に対する非翻字モデル生成確率と、原言語及び目的言語の部分文字列の各ペアに対する翻字モデル生成確率と、に対して初期値を設定する。 In step S102, the initial value setting unit 231 causes the non-transliteration model selection probability of each of the source language and the target language, the transliteration model selection probability, and the non-transliteration for each partial character string of each of the source language and the target language. Initial values are set for the model generation probability and the transliteration model generation probability for each pair of source language and target language partial character strings.

次に、ステップＳ１０４において、期待値計算部２３２によって、全ての文字列組の各々について、周知の前向き後ろ向きアルゴリズムを用いて、上記ステップＳ１０２で設定され、又は後述するステップＳ１１０で前回更新された、非翻字モデル選択確率、翻字モデル選択確率、非翻字モデル生成確率、及び翻字モデル生成確率に基づいて、前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントのそれぞれに対応する前向き確率表の各前向き確率を計算して前向き確率表に記録すると共に、前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントのそれぞれに対応する後ろ向き確率表の各後ろ向き確率を計算して後ろ向き確率表に記録する。 Next, in step S104, the expected value calculation unit 232 uses the well-known forward-backward algorithm for each character string set, and is set in step S102 or updated last time in step S110 described later. Based on the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability, each of the prefix non-transliteration segment, the transliteration segment, and the postfix non-transliteration segment Calculate and record each forward probability of the forward probability table corresponding to the forward probability table, and each of the backward probability table corresponding to each of the prefix non-transliteration segment, the transliteration segment, and the post-nontranslation segment. Calculate the backward probability and record it in the backward probability table.

ステップＳ１０６において、期待値計算部２３２によって、全ての文字列組の各々について、上記ステップＳ１０４で計算された当該文字列組の前向き確率表と後向き確率表とを用いて、当該文字列組の部分文字列の各ペアに対し、翻字関係にある期待値を計算する。 In step S106, the expected value calculation unit 232 uses the forward probability table and the backward probability table of the character string set calculated in step S104 for each of the character string sets. For each pair of strings, calculate the expected value in transliteration.

ステップＳ１０８において、期待値計算部２３２によって、全ての文字列組の各々について、上記ステップＳ１０４で計算された当該文字列組の前向き確率表と後向き確率表を用いて、当該文字列組の原言語側の各部分文字列に対し、非翻字部分となる期待値を計算すると共に、当該文字列組の目的言語側の各部分文字列に対し、非翻字部分となる期待値を計算する。 In step S108, the expected value calculation unit 232 uses the forward probability table and the backward probability table of the character string set calculated in step S104 for each of the character string sets. For each partial character string on the side, an expected value to be a non-transliterated part is calculated, and an expected value to be a non-transliterated part is calculated for each partial character string on the target language side of the character string set.

ステップＳ１１０において、パラメータ更新部２３３は、上記ステップＳ１０６で計算された翻訳関係にある期待値、上記ステップＳ１０８で計算された原言語の各部分文字列についての非翻字部分である期待値、及び目的言語の各部分文字列についての非翻字部分である期待値に基づいて、原言語及び目的言語の各々の非翻字モデル選択確率、翻字モデル選択確率、原言語の各部分文字列に対する非翻字モデル生成確率、目的言語の各部分文字列に対する非翻字モデル生成確率、及び原言語及び目的言語の部分文字列の各ペアに対する翻字モデル生成確率を更新する。 In step S110, the parameter update unit 233 calculates the expected value in the translation relation calculated in step S106, the expected value that is the non-transliterated part for each partial character string in the source language calculated in step S108, and Based on the expected value that is a non-transliteration part for each sub-string of the target language, the non-transliteration model selection probability of each of the source language and the target language, the transliteration model selection probability, and each sub-string of the source language The non-transliteration model generation probability, the non-transliteration model generation probability for each partial character string in the target language, and the transliteration model generation probability for each pair of the source language and target language partial character strings are updated.

ステップＳ１１４において、停止判定部２３４によって、予め定められた停止条件が満たされたか否かを判定する。そして、停止条件が満たされていない場合には、上記ステップＳ１０４へ戻り、上記ステップＳ１０４〜ステップＳ１１０の処理を実行する。一方、停止条件が満たされている場合には、ステップＳ１１６へ進む。 In step S114, the stop determination unit 234 determines whether or not a predetermined stop condition is satisfied. If the stop condition is not satisfied, the process returns to step S104, and the processes of steps S104 to S110 are executed. On the other hand, if the stop condition is satisfied, the process proceeds to step S116.

ステップＳ１１６において、文字対応付け処理部２３５によって、文字列組の各々に対して、上記ステップＳ１１０で最終的に更新された非翻字モデル選択確率、翻字モデル選択確率、非翻字モデル生成確率、及び翻字モデル生成確率の各々に基づいて、当該文字列組の各文字列を、前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントで構成し、かつ、原言語の文字列のうちの翻字セグメントの部分文字列と、目的言語の文字列のうちの翻字セグメントの部分文字列との間で文字の対応付けを行う。 In step S116, the character association processing unit 235 performs the non-transliteration model selection probability, the transliteration model selection probability, and the non-transliteration model generation probability that are finally updated in step S110 for each character string set. , And each of the transliteration model generation probabilities, each character string of the character string set is composed of a prefix non-transliteration segment, a transliteration segment, and a postfix non-transliteration segment, and Characters are associated between a transliteration partial character string in the character string and a transliteration segment partial character string in the target language character string.

ステップＳ１１８において、出力部３０によって、上記ステップＳ１１６で対応付けた結果を出力して、文字列対応付け処理ルーチンを終了する。 In step S118, the output unit 30 outputs the result of association in step S116, and the character string association processing routine ends.

以上説明したように、本発明の第１の実施の形態に係る文字列対応付け装置によれば、原言語及び目的言語において対訳となる文字列の組み合わせである文字列組について、文字列組の各文字列を、文字列の先頭から順番に、前置非翻字セグメントと、翻字セグメントと、後置非翻字セグメントとで構成したときに、翻字モデル選択確率と、非翻字モデル選択確率と、非翻字モデル生成確率と、翻字モデル生成確率と、に基づいて尤もらしくなるように、文字列組の各文字列を前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントで構成し、かつ、原言語の文字列のうちの翻字セグメントの部分文字列と、原言語の文字列のうちの翻字セグメントの部分文字列との間で文字の対応付けを行うことにより、異なる言語の文字列組における文字の対応付けを精度よく行うことができる。 As described above, according to the character string associating device according to the first embodiment of the present invention, a character string set that is a combination of character strings that are parallel translations in the source language and the target language. When each character string is composed of a prefix non-transliteration segment, a transliteration segment, and a postfix non-transliteration segment in order from the beginning of the character string, the transliteration model selection probability and the non-transliteration model Each character string in the string set is pre-fixed non-transliterated segment, transliterated segment, and postfixed so as to be likely based on the selection probability, non-transliteration model generation probability, and transliteration model generation probability. A non-transliteration segment, and character mapping between a transliteration partial character string in the source language character string and a transliteration segment substring in the source language character string By doing a string set of different languages It is possible to accurately perform the association of the definitive character.

また、文字列の生成モデルとして「翻字モデル」「非翻字モデル」の２つを同時に学習することで、文字列の組の翻字となっている部分を識別し、非翻字部分が混在したデータにおける適切な文字列間の対応付けを実現することができる。 In addition, by simultaneously learning the “transliteration model” and the “non-transliteration model” as the character string generation model, the transliteration part of the character string pair is identified, and the non-transliteration part is Appropriate correspondence between character strings in mixed data can be realized.

また、統計的機械翻訳で用いられる自動的に抽出された対訳語句対のような、必ずしも翻字の組となっておらず、部分的に翻字となっているような文字列組データに対しても、翻字となっている文字列を選択的に対応づけることが可能となる。 Also, for character string data that is not necessarily a transliteration pair, such as an automatically extracted bilingual word pair used in statistical machine translation, but partially transliterated. However, it is possible to selectively associate the character strings that are transliterated.

〔第２の実施の形態〕
＜システム構成＞
次に、第２の実施の形態について説明する。なお、第１の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。 [Second Embodiment]
<System configuration>
Next, a second embodiment will be described. In addition, about the part which becomes the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

第２の実施の形態では、ギブスサンプリングを用いて、文字列組における文字の対応付けを求めている点が、第１の実施の形態と異なっている。 The second embodiment is different from the first embodiment in that character matching in a character string set is obtained using Gibbs sampling.

本実施の形態で用いるギブスサンプリングは、上記非特許文献３や上記非特許文献６で用いられており、当該分野では広く知られる方法である。
上記非特許文献３及び非特許文献６に記載の方法のように、参考文献７（Daichi Mochihashi他２名、「Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling」、Proc. of ACL-IJCNLP、2009）に記載の方法に類似したアルゴリズムによって対応付けを繰り返し更新する。 Gibbs sampling used in the present embodiment is used in Non-Patent Document 3 and Non-Patent Document 6 and is a widely known method in the field.
Reference method 7 (Daichi Mochihashi et al., “Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling”, Proc. Of ACL-IJCNLP, 2009 The correspondence is repeatedly updated by an algorithm similar to the method described in (1).

この方法では、文字列組の集合Ｄから１つの文字列組ｄを選び、ｄを除く文字列組の集合Ｄ−ｄにおける対応付けから推定される事後確率分布に基づいてｄ上の対応付け結果をサンプリングする、という過程を、ｄを入れ替えながら繰り返し行う。サンプリングの結果はｄに対する一意の対応付け結果であるため、ＥＭアルゴリズムを利用する第１の実施の形態の構成とは異なり、停止判定部によって学習が終了したと判定された時点ですべてのｄに対する対応付け結果が得られるため、文字対応付け処理部２３５は必要ない。ただし、停止判定後に１度フィルタリングステップ・サンプリングステップを繰り返すことによって対応付けを再度更新してもよい。 In this method, one character string set d is selected from the set D of character string sets, and the association result on d is based on the posterior probability distribution estimated from the association in the set D-d of character string sets excluding d. Is repeated while replacing d. Since the sampling result is a unique association result with respect to d, unlike the configuration of the first embodiment that uses the EM algorithm, the sampling is performed for all d when the stop determination unit determines that learning has ended. Since the association result is obtained, the character association processing unit 235 is not necessary. However, the association may be updated again by repeating the filtering step and the sampling step once after the stop determination.

図３に示すように、第２の実施の形態に係る文字列対応付け装置２００の演算部２２０は、文字列組データベース２１、文字列組データ読み込み部２２、及び対応付け計算部２４を備えている。 As illustrated in FIG. 3, the calculation unit 220 of the character string association device 200 according to the second embodiment includes a character string group database 21, a character string group data reading unit 22, and a correspondence calculation unit 24. Yes.

対応付け計算部２４は、ギブスサンプリングを用いて、複数の文字列組の各々に対して対応付けを行う。対応付け計算部２４は、初期値対応部２４１、フィルタリング部２４２、サンプリング部２４３、及び停止判定部２４４を備えている。 The association calculation unit 24 associates each of the plurality of character string sets using Gibbs sampling. The association calculation unit 24 includes an initial value association unit 241, a filtering unit 242, a sampling unit 243, and a stop determination unit 244.

初期値対応部２４１は、文字列組の各々に対して、文字列組の各文字列を前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントで構成し、かつ、原言語の文字列のうちの翻字セグメントの部分文字列と、目的言語の文字列のうちの翻字セグメントの部分文字列との間で文字の対応付けを行って、対応付けの初期設定を行う。ここで、ギブスサンプリングを用いる場合は、確率分布の初期値でなく、適当な初期対応付けを与える必要がある。従って、初期対応付の決定方法としては、何かの規則により対応付けを行う、ランダムに対応付けを行う、別途簡便なモデルでの対応付け結果を利用するなどの方法が考えられる。本実施の形態では、ランダムに対応付けを行う。 The initial value corresponding unit 241 includes, for each character string set, each character string of the character string set composed of a prefix non-transliteration segment, a transliteration segment, and a post-translation non-transliteration segment. Characters are matched between the transliteration segment partial character string in the character string and the target language character string partial character string, and the association is initialized. Here, when Gibbs sampling is used, it is necessary to give an appropriate initial association instead of the initial value of the probability distribution. Accordingly, as a method for determining the initial association, there are conceivable methods such as performing association according to some rule, performing association at random, and using an association result with a separate simple model. In this embodiment, the association is performed at random.

フィルタリング部２４２は、フィルタリング部２４２によって前回計算された、原言語の非翻字モデル選択確率と、目的言語の非翻字モデル選択確率と、翻字モデル選択確率と、処理対象の文字列組の原言語の文字列のうちの各部分文字列に対する非翻字モデル生成確率と、処理対象の文字列組の目的言語の文字列のうちの各部分文字列に対する非翻字モデル生成確率と、処理対象の文字列組の原言語の文字列のうちの部分文字列と、目的言語の文字列のうちの部分文字列との間の部分文字列の各ペアに対する翻字モデル生成確率とに基づいて、上記第１の実施の形態における前向き確率の計算の場合と同様の方法で、処理対象の文字列組ｄに対して、文字列組の先頭から前向き確率を計算し、前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントのそれぞれに対応する前向き確率表に、計算した各前向き確率を記録する。ここで、確率が非常に小さい経路を大量に保持すると計算量が大きくなる場合があるため、他の対立する対応付け経路と比較して非常に確率が小さいような場合はその経路を無視してもよい。
また、フィルタリング部２４２は、初期対応設定部２４１によって設定され、又は後述するサンプリング部２４３によって前回更新された、複数組の文字列組のうちの処理対象の文字列組以外の文字列組の各々についての対応付けに基づいて、原言語の非翻字モデル選択確率と、目的言語の非翻字モデル選択確率と、翻字モデル選択確率と、処理対象の文字列組の原言語の文字列のうちの各部分文字列に対する非翻字モデル生成確率と、処理対象の文字列組の目的言語の文字列のうちの各部分文字列に対する非翻字モデル生成確率と、処理対象の文字列組の原言語の文字列のうちの部分文字列と、目的言語の文字列のうちの部分文字列との間の部分文字列の各ペアに対する翻字モデル生成確率と、を計算する。翻字モデル生成確率Ｐ_ｇｔ（ＥＭアルゴリズムによる構成の場合のＰ_ｔに相当）は、上記非特許文献３や上記非特許文献６に記載の方法を拡張し、以下のように定義できる。 The filtering unit 242 calculates the non-transliteration model selection probability of the source language, the non-transliteration model selection probability of the target language, the transliteration model selection probability, and the character string set to be processed previously calculated by the filtering unit 242. Non-transliteration model generation probability for each partial character string in the source language character string, non-transliteration model generation probability for each partial character string in the target language character string of the processing target character string set, and processing Based on the transliteration model generation probability for each pair of substrings between the substrings in the source language string of the target string set and the substrings in the target language string The forward probability is calculated from the beginning of the character string set for the character string set d to be processed in the same manner as in the case of calculating the forward probability in the first embodiment, and the prefix non-transliterated segment is calculated. , Transliteration segments, and A forward probability table corresponding to the respective 置非 transliteration segments, recording each forward probability calculated. Here, if a large number of routes with very low probability are held, the amount of calculation may increase. Therefore, if the probability is very small compared to other conflicting association routes, ignore that route. Also good.
Further, the filtering unit 242 sets each character string set other than the character string set to be processed among a plurality of character string sets set by the initial correspondence setting unit 241 or updated last time by the sampling unit 243 described later. Based on the mapping of the source language, the non-transliteration model selection probability of the source language, the non-transliteration model selection probability of the target language, the transliteration model selection probability, and the source language character string of the target character string set The non-transliteration model generation probability for each partial character string, the non-transliteration model generation probability for each partial character string in the target language character string of the processing target character string set, and the processing target character string set A transliteration model generation probability is calculated for each pair of partial character strings between the partial character string of the source language character string and the target language character string. The transliteration model generation probability P _gt (corresponding to P _t in the case of the configuration by the EM algorithm) can be defined as follows by extending the methods described in Non-Patent Document 3 and Non-Patent Document 6.

ここで、ｃ（ｓ_ｊ→ｊ’，ｔ_ｉ→ｉ’）は，原言語側の文字列ｓ_ｊ→ｊ’と目的言語側の文字列ｔ_ｉ→ｉ’がｄを除く文字組列の集合で対応付けられている回数、ＢＭ（ｓ_ｊ→ｊ’，ｔ_ｉ→ｉ’）は基底測度、αはハイパーパラメータ、Ｃはｄを除く文字列組データでの翻字となっているすべての対応付けの数である。非翻字モデルについても同様であり、原言語・目的言語の非翻字モデル生成確率も、同様に以下のように定義できる（ＥＭアルゴリズムによる構成の場合のＰ_ｓｒｃ、Ｐ_ｔｒｇに相当）。 Here, c (s _{j → j ′} , t _{i → i ′} ) is a character set string excluding d in which the source language side character string s _{j → j ′} and the target language side character string t _{i → i ′} are d. The number of times associated with the set, BM (s _{j → j ′} , t _{i → i ′} ) is the base measure, α is the hyperparameter, and C is the transliteration in the character set data excluding d Is the number of associations. The same applies to the non-transliteration model, and the non-transliteration model generation probability of the source language / target language can be similarly defined as follows (corresponding to P _src and P _trg in the case of the configuration by the EM algorithm).

翻字モデル選択確率ｐ_ｔ、非翻字モデル選択確率ｐ_ｓｒｃ、ｐ_ｔｒｇは前記ＥＭアルゴリズムの場合と同様、翻字になっている文字数と非翻字になっている文字数の割合で与えることができる。 The transliteration model selection probability p _t , the non-transliteration model selection probability p _src , and p _trg can be given as a ratio of the number of characters that are transliterated and the number of characters that are not transliterated, as in the case of the EM algorithm. it can.

サンプリング部２４３は、フィルタリング部２４２によって計算された非翻字モデル選択確率、翻字モデル選択確率、非翻字モデル生成確率、及び翻字モデル生成確率に基づいて、処理対象の文字列組の各文字列を前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントで構成し、かつ、原言語の文字列のうちの翻字セグメントの部分文字列と、目的言語の文字列のうちの翻字セグメントの部分文字列との間で文字の対応付けを行って、処理対象の文字列組に対する対応付けを更新する。具体的には、サンプリング部２４３は、フィルタリング部２４２で計算された前向き確率表を末尾から順に辿っていき、前向き確率に基づいて対応付けを決定していき、ｄに対する対応付けを更新する。より具体的には、文字列組データ中のある位置（ｊ，ｉ）に対して、その位置に至る１つ前の位置（ｊ₋，ｉ₋）を選択する際に、（ｊ₋，ｉ₋）から（ｊ，ｉ）に至る確率に基づく重み付きサンプリングを行う（確率が高い経路ほど選ばれやすいようにする）。（ｊ₋，ｉ₋）から（ｊ，ｉ）に至る確率は、（ｊ₋，ｉ₋）における前向き確率と（ｊ₋，ｉ₋）から（ｊ，ｉ）への経路の確率（ｓ_{ｊ−→ｊ’}，ｔ_{ｉ−→ｉ’}の翻字あるいは非翻字確率）である。こうして決定された対応付けをｄの新しい対応付けとする。 Based on the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability calculated by the filtering unit 242, the sampling unit 243 determines each character string set to be processed. The character string is composed of a prefix non-transliteration segment, a transliteration segment, and a postfix non-transliteration segment, and the substring of the transliteration segment of the source language character string and the character string of the target language Characters are associated with partial character strings of the transliteration segments, and the association with the character string set to be processed is updated. Specifically, the sampling unit 243 traces the forward probability table calculated by the filtering unit 242 in order from the end, determines the association based on the forward probability, and updates the association for d. More specifically, with respect to a certain position in the string set data (j, i), the position of the immediately preceding leading to that location (j _-, i _{_-)} when selecting, (j _-, i _-) from (j, it performs weighted sampling based on the probability of leading to i) (probability to be easy selected higher path). _(J _-, i -) probability leading to (j, i) from, _(j _-, i -) forward probability and _(j _-, i -) in the (j, i) the probability of the path to the _{(s j − → j ′} , _{ti− → i ′} transliteration or non-transliteration probability). Assume that the association thus determined is a new association of d.

全ての文字列組の各々を、処理対象の文字列組として、フィルタリング部２４２及びサンプリング部２４３による処理を繰り返し行う。
停止判定部２４４は、予め定められた停止条件が満たされたか否かを判定し、停止条件が満たされるまで、各文字列組を処理対象とした、フィルタリング部２４２による計算及びサンプリング部２４３による更新を繰り返す。本実施の形態では、ギブスサンプリングを利用しており、文字列組ｄを一つ選択するごとにフィルタリング部２４２によるフィルタリング処理と、サンプリング部２４３によるサンプリング処理とを行うので、文字列組の集合Ｄの全てに対してフィルタリング処理とサンプリング処理とを適用した後に停止判定を行う。ただし、途中で停止してもよい。停止条件はＥＭアルゴリズムを利用する第１の実施の形態の場合と同様に設定が可能である。 The processing by the filtering unit 242 and the sampling unit 243 is repeated with each character string set as a processing target character string set.
The stop determination unit 244 determines whether or not a predetermined stop condition is satisfied, and performs calculation by the filtering unit 242 and update by the sampling unit 243 for processing each character string set until the stop condition is satisfied. repeat. In the present embodiment, Gibbs sampling is used, and each time one character string set d is selected, the filtering process by the filtering unit 242 and the sampling process by the sampling unit 243 are performed. After applying the filtering process and the sampling process to all of the above, stop determination is performed. However, you may stop on the way. The stop condition can be set similarly to the case of the first embodiment using the EM algorithm.

＜文字列対応付け装置の作用＞
次に、第２の実施の形態に係る文字列対応付け装置２００の作用について説明する。なお、第１の実施の形態と同様の処理については、同一符号を付して説明を省略する。 <Operation of character string matching device>
Next, the operation of the character string association device 200 according to the second embodiment will be described. In addition, about the process similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

まず、対訳となっている原言語の文字列及び目的言語の文字列の組である文字列組が、文字列対応付け装置２００に複数入力されると、文字列対応付け装置２００によって、入力された複数の文字列組が、文字列組データベース２１に格納される。そして、文字列対応付け装置２００によって、図４に示す文字列対応付け処理ルーチンが実行される。 First, when a plurality of character string pairs that are pairs of source language character strings and target language character strings that are translated are input to the character string association device 200, the character string association device 200 inputs them. A plurality of character string sets are stored in the character string set database 21. Then, the character string correlation processing routine shown in FIG.

ステップＳ１００において、文字列組データベース２１から、全ての文字列組を読み込む。 In step S100, all character string sets are read from the character string set database 21.

そして、ステップＳ２０２において、初期値対応部２４１によって、文字列組の各々に対して、文字列組の各文字列を前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントで構成し、かつ、原言語の文字列のうちの翻字セグメントの部分文字列と、目的言語の文字列のうちの翻字セグメントの部分文字列との間で文字の対応付けを行って、対応付けの初期設定を行う。 In step S202, the initial value corresponding unit 241 configures each character string of the character string set with a prefix non-transliteration segment, a transliteration segment, and a postfix non-transliteration segment for each character string set. In addition, character mapping is performed between the partial character string of the transliteration segment in the source language character string and the partial character string of the transliteration segment in the target language character string. Perform initial settings for.

次にステップＳ２０４において、フィルタリング部２４２によって、処理対象として、１つの文字組のデータｄを設定する。 In step S204, the filtering unit 242 sets one character set data d as a processing target.

ステップＳ２０６において、フィルタリング部２４２によって、上記ステップＳ２０４で設定された処理対象の文字列組のデータｄについて、本ステップで前回計算された、原言語の非翻字モデル選択確率と、目的言語の非翻字モデル選択確率と、翻字モデル選択確率と、処理対象の文字列組の原言語の文字列のうちの各部分文字列に対する非翻字モデル生成確率と、処理対象の文字列組の目的言語の文字列のうちの各部分文字列に対する非翻字モデル生成確率と、処理対象の文字列組の原言語の文字列のうちの部分文字列と、目的言語の文字列のうちの部分文字列との間の部分文字列の各ペアに対する翻字モデル生成確率とに基づいて、上記第１の実施の形態における前向き確率の計算の場合と同様の方法で、処理対象の文字列組ｄに対して、文字列組の先頭から前向き確率を計算し、前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントのそれぞれに対応する前向き確率表に、計算した各前向き確率を記録する。 In step S206, the filtering unit 242 calculates the non-transliteration model selection probability of the source language and the non-translation of the target language previously calculated in this step for the data d of the processing target character string set set in step S204. Transliteration model selection probabilities, transliteration model selection probabilities, non-transliteration model generation probabilities for each substring in the source language string of the processing target string set, and purpose of the processing target string set Probability of generating a non-transliteration model for each partial character string in the language string, a partial character string in the source language character string of the target character string set, and a partial character in the target language character string Based on the transliteration model generation probability for each pair of partial character strings between columns, the character string set d to be processed is processed in the same manner as in the case of calculating the forward probability in the first embodiment. Against The forward probability is calculated from the beginning of the string set, before 置非 transliteration segment, transliteration segments, and the forward probability table corresponding to each of the rear 置非 transliteration segments, recording each calculated forward probability.

ステップＳ２０８において、フィルタリング部２４２によって、上記ステップＳ２０２で設定され、又は後述するステップＳ２１０で前回更新された、複数組の文字列組のうちの処理対象の文字列組以外の文字列組の各々についての対応付けに基づいて、原言語及び目的言語の各々の非翻字モデル選択確率と、翻字モデル選択確率と、処理対象の文字列組の原言語及び目的言語の各々の、文字列のうちの各部分文字列に対する非翻字モデル生成確率と、処理対象の文字列組の目的言語の文字列のうちの各部分文字列に対する非翻字モデル生成確率と、翻字モデル生成確率と、を計算する。 In step S208, for each of the character string sets other than the character string set to be processed among the plurality of character string sets, set in step S202 by the filtering unit 242 or updated last time in step S210 described later. Of the non-transliteration model selection probability of each of the source language and the target language, the transliteration model selection probability, and the character string of each of the source language and the target language of the character set to be processed. Non-transliteration model generation probability for each partial character string, non-transliteration model generation probability for each partial character string in the target language character string of the target character string set, and transliteration model generation probability calculate.

ステップＳ２１０において、サンプリング部２４３によって、上記ステップＳ２０８で計算された非翻字モデル選択確率、翻字モデル選択確率、非翻字モデル生成確率、及び翻字モデル生成確率に基づいて、処理対象の文字列組の各文字列を前置非翻字セグメント、翻字セグメント、及び後置非翻字セグメントで構成し、かつ、原言語の文字列のうちの翻字セグメントの部分文字列と、目的言語の文字列のうちの翻字セグメントの部分文字列との間で文字の対応付けを行って、処理対象の文字列組に対する対応付けを更新する。 In step S210, based on the non-transliteration model selection probability, transliteration model selection probability, non-transliteration model generation probability, and transliteration model generation probability calculated by the sampling unit 243 in step S208, the character to be processed Each character string of the sequence is composed of a prefix non-transliteration segment, a transliteration segment, and a post-translation non-transliteration segment, and a substring of the transliteration segment of the source language character string and the target language The character is associated with the partial character string of the transliteration segment in the character string of, and the association with the character string set to be processed is updated.

ステップＳ２１２において、文字列組の集合Ｄのうちの全ての文字列組ｄについて、上記ステップＳ２０４〜ステップＳ２１０の処理を実行したか否かを判定する。そして、全ての文字列組ｄについて、上記ステップＳ２０４〜ステップＳ２１０の処理を実行していない場合には、上記ステップＳ２０４へ戻り、新たな文字列組ｄを処理対象として設定する。一方、全ての文字列組ｄについて、上記ステップＳ２０４〜ステップＳ２１０の処理を実行した場合には、ステップＳ２１４へ進む。 In step S212, it is determined whether or not the processing in steps S204 to S210 has been executed for all character string sets d in the character string set set D. Then, when the processes in steps S204 to S210 are not executed for all character string sets d, the process returns to step S204, and a new character string set d is set as a processing target. On the other hand, when the processing of step S204 to step S210 is executed for all character string sets d, the process proceeds to step S214.

ステップＳ２１４において、停止判定部２４４によって、予め定められた停止条件が満たされたか否かを判定する。そして、停止条件が満たされていない場合には、上記ステップＳ２０４へ戻り、再び、全ての文字列組ｄについて、上記ステップＳ２０４〜ステップＳ２１２の処理を実行する。一方、停止条件が満たされている場合には、ステップＳ２１６へ進む。 In step S214, the stop determination unit 244 determines whether a predetermined stop condition is satisfied. If the stop condition is not satisfied, the process returns to step S204, and the processes of steps S204 to S212 are executed again for all character string sets d. On the other hand, if the stop condition is satisfied, the process proceeds to step S216.

ステップＳ２１６において、出力部３０によって、上記ステップＳ２１０で最終的に対応付けた結果を出力して、文字列対応付け処理ルーチンを終了する。 In step S216, the output unit 30 outputs the result finally associated in step S210, and the character string association processing routine ends.

なお、第２の実施の形態に係る文字列対応付け装置２００の他の構成及び作用については、第１の実施の形態と同様であるため、説明を省略する。 In addition, about the other structure and effect | action of the character string matching apparatus 200 which concern on 2nd Embodiment, since it is the same as that of 1st Embodiment, description is abbreviate | omitted.

以上説明したように、第２の実施の形態に係る文字列対応付け装置によれば、ギブスサンプリングにより、非翻字モデル選択確率と、翻字モデル選択確率と、非翻字モデル生成確率と、翻字モデル生成確率とを計算し、処理対象の文字列組に対する対応付けを更新することにより、異なる言語の文字列組における文字の対応付けを精度よく行うことができる。 As described above, according to the character string association device according to the second embodiment, by Gibbs sampling, the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, By calculating the transliteration model generation probability and updating the association with the character string set to be processed, it is possible to accurately associate the characters in the character string sets of different languages.

＜実施例＞
次に本発明に係る実施の形態を実施した例について示す。実施例では原言語として日本語（カタカナ語）、目的言語として英語を利用した。なお、本実施例ではギブスサンプリングを利用した第２の実施の形態によって、対応付けの計算を行う。 <Example>
Next, an example in which the embodiment according to the present invention is implemented will be described. In the examples, Japanese (Katakana) was used as the source language, and English was used as the target language. In this embodiment, the association is calculated according to the second embodiment using Gibbs sampling.

図５はそれぞれ本実施例で用いた日本語と英語との文字列組データ（およそ５万語対）を抜粋したものである。同じ行に記された文字列組が対応していることを示す。各記号（カタカナ及びアルファベット）は１文字ずつ区切り文字（空白文字）によって分割されている。空白文字自体を対応付けの対象文字として扱いたい場合には、区切り文字として別の文字（例えば”：”）を利用するか、空白文字を別の記号（例えば”[ｓｐ]”）に置換すればよい。 FIG. 5 is an excerpt of character string set data (approximately 50,000 words pairs) of Japanese and English used in this example. Indicates that the character string set on the same line corresponds. Each symbol (katakana and alphabet) is divided by a delimiter (blank character) one by one. If you want to treat the blank character itself as the target character for mapping, use another character (eg “:”) as the delimiter or replace the blank character with another symbol (eg “[sp]”). That's fine.

第２の実施の形態によって、上記図５に示した文字列組データに対して対応付けを行った。まず、上記文字列組データを記録した電子ファイルを文字列組データ読み込み部２２により読み込んだ。また、カタカナとアルファベットの任意長さの文字列間の対応を考慮すると計算量が大きくなるため、本実施例においては、カタカナ（日本語）側は最大２文字、アルファベット（英語）側は最大３文字までの対応に限定することとした。 According to the second embodiment, the character string set data shown in FIG. 5 is associated. First, the electronic file in which the character string set data is recorded is read by the character string set data reading unit 22. In addition, since the amount of calculation increases when considering the correspondence between character strings of arbitrary lengths of katakana and alphabets, in this embodiment, the katakana (Japanese) side has a maximum of 2 characters and the alphabet (English) side has a maximum of 3 characters. It was decided to limit the correspondence to letters.

本実施例における初期値対応部２４１では、ランダムな対応付けを与えるために、フィルタリング部２４２によるフィルタリング処理における前向き確率表の前向き確率の値をすべて１と仮定した上で（ただし、カタカナ側３文字以上、アルファベット側４文字以上となる対応付けは許さないようにする）、サンプリング部２４３によるサンプリング処理で利用されるサンプリングアルゴリズムを適用した。これにより、ランダムな初期対応付けを得ることができる。 In the present example, the initial value correspondence unit 241 assumes that the values of the forward probabilities in the forward probability table in the filtering process by the filtering unit 242 are all 1 in order to give random associations (however, three characters on the katakana side) As described above, the association of 4 letters or more on the alphabet side is not allowed), and the sampling algorithm used in the sampling process by the sampling unit 243 is applied. Thereby, a random initial association can be obtained.

続いて、文字列組データの各文字列組に対して、フィルタリング処理とサンプリング処理の順に繰り返し計算を行う。各文字列組に対するフィルタリング処理・サンプリング処理を文字列組データのデータ数分行う処理を１ラウンドとして、本実施例では３０ラウンドの処理を行った。また、英語の記号”‐”は通常カタカナに対応するものではないため、常に日本語の長さ０の部分文字列に対応するものとして扱った。なお、ギブスサンプリングにおいてはデータを一個ずつ処理して対応付けとモデルの確率が更新されるため、データの処理順序が学習に与える影響が大きいことが広く知られている。したがって、文字列組データ中のデータの処理順序は毎ラウンド開始時にランダムに入れ替えることとした。 Subsequently, the calculation is repeatedly performed in the order of the filtering process and the sampling process for each character string set of the character string set data. The processing for performing filtering processing and sampling processing for each character string set for the number of data of the character string set data is one round, and in this embodiment, 30 rounds of processing are performed. In addition, since the English symbol “-” does not normally correspond to katakana, it was always treated as corresponding to a partial character string of Japanese length zero. In Gibbs sampling, data is processed one by one to update association and model probabilities, so it is widely known that the processing order of data has a great influence on learning. Therefore, the processing order of data in the character string set data is randomly changed at the start of each round.

３０ラウンドの繰り返しの後、ビタビアルゴリズムによって再度文字列対応付けを行った。その結果を図６に示す。図６は日本語と英語の文字列がタブ文字で区切られて表記されており、対応付けられる部分文字列が記号”：”で区切られている。つまり、”：”の数は日本語側と英語側で同一となっており、１番目の要素同士が対応する部分文字列組であることを示す。また、”＜ｎｏｉｓｅ＞”となっているのは、対応する部分文字列がない（ノイズである）ことを示している。 After repeating 30 rounds, the character string association was performed again by the Viterbi algorithm. The result is shown in FIG. FIG. 6 shows Japanese and English character strings separated by tab characters, and the associated partial character strings are separated by the symbol “:”. That is, the number of “:” is the same on the Japanese side and the English side, indicating that the first element is a corresponding partial character string set. Further, “<noise>” indicates that there is no corresponding partial character string (ie, noise).

図６の結果から、”コンピュータ”と”ｃｏｍｐｕｔｅｒｓ”の組では、英語側末尾の”ｓ”がノイズであること、”バーン”と”ｂｕｒｎ‐ｉｎ”の組では、英語側の”ｉｎ”がノイズであること、”シンメトリー”と”ａｓｙｍｍｅｔｒｙ”との組では、英語側先頭の”ａ”がノイズであること、”サイド”と”ｓｉｄｅ‐ｅｆｆｅｃｔ”の組では、英語側の”ｅｆｆｅｃｔ”がノイズであること、などが見て取れる。また、その他の文字列組についても、妥当な文字列対応付け結果が得られた。 From the result of FIG. 6, in the pair of “computer” and “computers”, “s” at the end of the English side is noise, and in the pair of “burn” and “burn-in”, the “in” on the English side is In the case of “Symmetry” and “asymmetry”, the first “a” in the English side is noise. In the case of “Side” and “side-effect”, the “effect” on the English side is You can see that it is noise. In addition, valid character string association results were obtained for other character string sets.

なお、同じデータを上記非特許文献４の方法を実現するコンピュータプログラムによって対応付けした場合、及び上記非特許文献６の方法を実現するコンピュータプログラムによって対応付けした場合は、本実施例において部分的にノイズであると判定された文字列組は、すべてノイズであると判定されたため、本実施例は部分的にノイズが含まれるような文字列組の対応付けに適した方式であることが認められた。 Note that when the same data is associated by a computer program that implements the method of Non-Patent Document 4 and when the same data is associated by a computer program that implements the method of Non-Patent Document 6, a part of this embodiment is used. Since all the character string sets determined to be noise were determined to be noise, it is recognized that this embodiment is a method suitable for matching character string sets that partially include noise. It was.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、文字列組データベース２１は、外部に設けられ、文字列対応付け装置とネットワークで接続されていてもよい。 For example, the character string set database 21 may be provided outside and connected to a character string association device via a network.

上述の文字列対応付け装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The character string association apparatus described above has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０、２２０演算部
２１文字列組データベース
２２文字列組データ読み込み部
２３、２４対応付け計算部
３０出力部
１００、２００文字列対応付け装置
２３１初期値設定部
２３２期待値計算部
２３３パラメータ更新部
２３４停止判定部
２３５文字対応付け処理部
２４１初期値対応部
２４２フィルタリング部
２４３サンプリング部
２４４停止判定部 DESCRIPTION OF SYMBOLS 10 Input part 20, 220 Operation part 21 Character string set database 22 Character string set data reading part 23, 24 Matching calculation part 30 Output part 100, 200 Character string matching apparatus 231 Initial value setting part 232 Expected value calculation part 233 Parameter Update unit 234 Stop determination unit 235 Character association processing unit 241 Initial value correspondence unit 242 Filtering unit 243 Sampling unit 244 Stop determination unit

Claims

For a character string set that is a combination of character strings having the same meaning belonging to different first languages and second languages, characters between the character strings of the first language and the character strings of the second language A character string matching device that performs the matching of
For each of the character string sets stored in the character string set database that stores a plurality of character string sets, the character strings of the character string set are assigned to the other language in order from the beginning of the character string. A prefix non-transliteration segment indicating zero or more partial character strings that are not transliterated with the partial character string, and a transliteration indicating zero or more partial character strings that are transliterated with the partial character string of the other language. When the first language partial character string is composed of a character segment and a postfix non-transliteration segment indicating a partial character string of zero or more characters not transliterated with the other language partial character string, the a non transliteration model selection probability representing the probability of a partial string and not a transliteration relationship untranslated shaped portion of the second language, the substring of a second language, the first language Non-transliteration model selection that represents the probability of a non-transliteration part that is not transliterated with a substring And rate, a partial character string of the first language, said a transliteration portion in partial string and transliteration relationship of the second language, and the partial character string of the second language, the first substring and a transliteration model selection probability representing the probability of transliteration portion of the transliteration relationship, substrings of the front置非transliteration segments of the string of the first language of the language, and a non transliteration model generation probability representing a generation probability in the first language for each of the substrings of the rear置非transliteration segment, said front置非transliteration of the string of the second language substring of segments, and a non transliteration model generation probability representing a generation probability in the second language for each of the substrings of the rear置非transliteration segments of the string of said first language a partial character string of the transliteration segment, character of the second language Each character of the character string set so as to be likely based on a transliteration model generation probability representing a simultaneous generation probability for each pair of partial character strings between the partial character strings of the transliteration segment of column the front置非transliteration segment, the transliteration segments, and constituted by a rear置非transliteration segment and a partial character string of the transliteration segments of the string of said first language, said the correlation calculation unit that performs mapping of characters between substring of the transliteration segments of the string of the second language seen including,
The association calculation unit
The non-transliteration model selection probability of the first language; the non-transliteration model selection probability of the second language; the transliteration model selection probability for each partial character string of the second language; The non-transliteration model generation probability for each partial character string in the first language, the non-transliteration model generation probability for each partial character string in the second language, the partial character string in the first language, and the An initial value setting unit for setting an initial value for each transliteration model generation probability for each pair of partial character strings between partial character strings of the second language;
Based on the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability set by the initial value setting unit or updated last time, For each pair of character strings, for each pair of partial character strings between a partial character string of the first language character string and a partial character string of the second language character string , Calculating an expected value in which the pair is in translation relation, calculating an expected value for each partial character string in the first language character string, wherein the partial character string is a non-transliterated part, An expected value calculation unit that calculates an expected value of each partial character string of the character strings of two languages, the partial character string being a non-transliterated part;
Expected value in the translation relationship for each pair calculated by the expected value calculation unit for each of the character string sets, and an expected value that is the non-transliterated part for each partial character string in the first language , And an expected value that is the non-transliteration part for each partial character string of the second language, the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, And a parameter updating unit for updating the transliteration model generation probability,
A determination is made as to whether or not a predetermined stop condition is satisfied, and a stop condition determination unit that repeats the calculation by the expected value calculation unit and the update by the parameter update unit until the stop condition is satisfied.
Based on each of the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability for each of the character string sets, Of the first non-transliteration segment, the transliteration segment, and the post-nontransliteration segment, and a partial character string of the transliteration segment of the first language character string. And a character string associating device for associating characters with the partial character string of the transliteration segment in the character string of the second language .

For a character string set that is a combination of character strings having the same meaning belonging to different first languages and second languages, characters between the character strings of the first language and the character strings of the second language A character string matching device that performs the matching of
For each of the character string sets stored in the character string set database that stores a plurality of character string sets, the character strings of the character string set are assigned to the other language in order from the beginning of the character string. A prefix non-transliteration segment indicating zero or more partial character strings that are not transliterated with the partial character string, and a transliteration indicating zero or more partial character strings that are transliterated with the partial character string of the other language. When the first language partial character string is composed of a character segment and a postfix non-transliteration segment indicating a partial character string of zero or more characters not transliterated with the other language partial character string, the a non transliteration model selection probability representing the probability of a partial string and not a transliteration relationship untranslated shaped portion of the second language, the substring of a second language, the first language Non-transliteration model selection that represents the probability of a non-transliteration part that is not transliterated with a substring And rate, a partial character string of the first language, wherein a transliteration portion in partial string and transliteration relationship of the second language, and the partial character string of the second language, the first substring and a transliteration model selection probability representing the probability of transliteration portion of the transliteration relationship, substrings of the front置非transliteration segments of the string of the first language of the language, and a non transliteration model generation probability representing a generation probability in the first language for each of the substrings of the rear置非transliteration segment, said front置非transliteration of the string of the second language substring of segments, and a non transliteration model generation probability representing a generation probability in the second language for each of the substrings of the rear置非transliteration segments of the string of said first language a partial character string of the transliteration segment, character of the second language Each character of the character string set so as to be likely based on a transliteration model generation probability representing a simultaneous generation probability for each pair of partial character strings between the partial character strings of the transliteration segment of column the front置非transliteration segment, the transliteration segments, and constituted by a rear置非transliteration segment and a partial character string of the transliteration segments of the string of said first language, said the correlation calculation unit that performs mapping of characters between substring of the transliteration segments of the string of the second language seen including,
The association calculation unit
For each of the character string sets, each character string of the character string set is composed of the front non-transliteration segment, the transliteration segment, and the post-non-transliteration segment, and the first language The character is associated between the partial character string of the transliteration segment of the character string and the partial character string of the transliteration segment of the character string of the second language. An initial correspondence setting section for initial setting;
Based on the association for each of the character string sets other than the character string set to be processed among the plurality of character string sets set by the initial correspondence setting unit or updated last time, the first The non-transliteration model selection probability of the second language, the non-transliteration model selection probability of the second language, the transliteration model selection probability, and the characters of the first language of the character string set to be processed The non-transliteration model generation probability for each partial character string in the sequence; and the non-transliteration model generation probability for each partial character string in the second language character string of the character string set to be processed; , For each pair of partial character strings between the partial character string of the first language character string of the processing target character string set and the partial character string of the second language character string. Filtering to calculate the transliteration model generation probability And,
Based on the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability calculated by the filtering unit, each of the character string sets to be processed A character string is composed of the prefix non-transliteration segment, the transliteration segment, and the post-translation non-transliteration segment, and a partial character string of the transliteration segment of the character string of the first language; A sampling unit that performs character association with a partial character string of the transliteration segment of the character string of the second language, and updates the association with the character string set to be processed;
A stop condition determination unit that determines whether or not a predetermined stop condition is satisfied and repeats the calculation by the filtering unit and the update by the sampling unit for processing each character string set until the stop condition is satisfied And a character string matching device.

A character string set that includes a correspondence calculation unit and is a combination of character strings having the same meaning belonging to different first language and second language, for the first language character string and the second language A character string associating method in a character string associating device for associating characters with a character string,
For each of the character string sets stored in the character string set database in which a plurality of character string sets are stored by the association calculation unit, each character string of the character string set is sequentially ordered from the beginning of the character string. A prefix non-transliteration segment indicating zero or more partial character strings that are not transliterated with the other language partial character string, and zero or more characters that are transliterated with the other language partial character string. When the transliteration segment indicating a partial character string and the postfix non-transliteration segment indicating a partial character string of zero or more characters that are not transliterated with the partial character string of the other language , substring languages, and non transliteration model selection probability representing the probability that the second language substring and not in transliteration relationship untranslated shaped portion of the substring of a second language, Make a first language substrings and transliteration no relationship untranslated shaped portion of the And non transliteration model selection probability representing the said partial string of the first language is a transliteration portion in partial string and transliteration relationship of said second language, and part of the second language string, wherein the transliteration model selection probability representing the probability of transliteration portion first in the language portion string and transliteration relationship, said front置非transliteration of the string of said first language substring of characters segment, and a non transliteration model generation probability representing a generation probability in the first language for each of the substrings of the rear置非transliteration segment in the character string of the second language said front portion string置非transliteration segment, and a non transliteration model generation probability representing a generation probability in the second language for each of the substrings of the rear置非transliteration segment, the first of Substring of the transliteration segment of the language string A transliteration model generation probability which represents the simultaneous generation probability for the partial character string of each pair between the partial character string of the transliteration segments of the string of the second language, so plausible on the basis of the in the string set the front置非transliteration segments each string of the transliteration segments, and constituted by a rear置非transliteration segment, and the transliteration of the string of said first language a substring of characters segments, have rows correspondence letters between the partial character string of the transliteration segments of the string of the second language,
The association calculation unit includes an initial value setting unit, an expected value calculation unit, a parameter update unit, and a stop condition determination unit.
By the initial value setting unit, the non-transliteration model selection probability of the first language, the non-transliteration model selection probability of the second language, and the transliteration for each partial character string of the second language. A character model selection probability, the non-transliteration model generation probability for each partial character string in the first language, the non-transliteration model generation probability for each partial character string in the second language, and the first Setting initial values for the transliteration model generation probabilities for each pair of partial character strings between a partial character string of a language and a partial character string of the second language;
The expected value calculator sets the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model set by the initial value setting unit or updated last time. Based on the generation probability, for each of the character string sets, between the partial character string of the first language character string and the partial character string of the second language character string. For each pair of partial character strings, an expected value in which the pair is translationally related is calculated, and for each partial character string in the first language character string, the partial character string is expected to be a non-transliterated part Calculating a value, and for each partial string in the second language string, calculating an expected value that the partial string is a non-transliterated part;
The parameter updating unit calculates the expected value in the translation relationship for each pair calculated by the expected value calculation unit for each of the character string sets, and the non-inversion of each partial character string in the first language. Based on the expected value that is a character part and the expected value that is the non-transliteration part for each partial character string of the second language, the non-transliteration model selection probability, the transliteration model selection probability, Updating the transliteration model generation probability and the transliteration model generation probability;
A step of determining whether a predetermined stop condition is satisfied by a stop condition determining unit, and repeating the calculation by the expected value calculating unit and the updating by the parameter updating unit until the stop condition is satisfied; Including
Based on each of the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability for each of the character string sets, Of the first non-transliteration segment, the transliteration segment, and the post-nontransliteration segment, and a partial character string of the transliteration segment of the first language character string. And a character string associating method for associating characters with the partial character string of the transliteration segment in the character string of the second language .

A character string set that includes a correspondence calculation unit and is a combination of character strings having the same meaning belonging to different first language and second language, for the first language character string and the second language A character string associating method in a character string associating device for associating characters with a character string,
For each of the character string sets stored in the character string set database in which a plurality of character string sets are stored by the association calculation unit, each character string of the character string set is sequentially ordered from the beginning of the character string. A prefix non-transliteration segment indicating zero or more partial character strings that are not transliterated with the other language partial character string, and zero or more characters that are transliterated with the other language partial character string. When the transliteration segment indicating a partial character string and the postfix non-transliteration segment indicating a partial character string of zero or more characters that are not transliterated with the partial character string of the other language , substring languages, and non transliteration model selection probability representing the probability that the second language substring and not in transliteration relationship untranslated shaped portion of the substring of a second language, Make a first language substrings and transliteration no relationship untranslated shaped portion of the And non transliteration model selection probability representing the said partial string of the first language is a transliteration portion in partial string and transliteration relationship of said second language, and part of the second language string, wherein the transliteration model selection probability representing the probability of transliteration portion first in the language portion string and transliteration relationship, said front置非transliteration of the string of said first language substring of characters segment, and a non transliteration model generation probability representing a generation probability in the first language for each of the substrings of the rear置非transliteration segment in the character string of the second language said front portion string置非transliteration segment, and a non transliteration model generation probability representing a generation probability in the second language for each of the substrings of the rear置非transliteration segment, the first of Substring of the transliteration segment of the language string A transliteration model generation probability which represents the simultaneous generation probability for the partial character string of each pair between the partial character string of the transliteration segments of the string of the second language, so plausible on the basis of the in the string set the front置非transliteration segments each string of the transliteration segments, and constituted by a rear置非transliteration segment, and the transliteration of the string of said first language a substring of characters segments, have rows correspondence letters between the partial character string of the transliteration segments of the string of the second language,
The association calculation unit includes an initial correspondence setting unit, a filtering unit, a sampling unit, and a stop condition determination unit,
By the initial correspondence setting unit, for each of the character string set, each character string of the character string set is composed of the prefix non-transliteration segment, the transliteration segment, and the post-nontransliteration segment, And, the character correspondence between the partial character string of the transliteration segment in the character string of the first language and the partial character string of the transliteration segment of the character string of the second language To perform initial setting of correspondence,
Based on the association for each of the character string sets other than the character string group to be processed among the plurality of character string sets, which is set by the initial correspondence setting unit or updated last time by the filtering unit. The non-transliteration model selection probability of the first language, the non-transliteration model selection probability of the second language, the transliteration model selection probability, and the first of the character string sets to be processed The non-transliteration model generation probability for each partial character string in a character string of one language, and the non-translation for each partial character string in the second language character string in the character string set to be processed A partial character between a character model generation probability, a partial character string in the character string of the first language in the character string set to be processed, and a partial character string in the character string of the second language The transliteration model generation certainty for each pair of columns Calculating a, and,
Based on the non-transliteration model selection probability, the transliteration model selection probability, the non-transliteration model generation probability, and the transliteration model generation probability calculated by the filtering unit by the sampling unit. Each character string of the character string set is composed of the prefix non-transliteration segment, the transliteration segment, and the post-nontranslation segment, and the transliteration segment of the character string of the first language Character association is performed between the partial character string and the partial character string of the transliteration segment of the character string of the second language, and the association for the character string set to be processed is updated. Steps,
The stop condition determining unit determines whether or not a predetermined stop condition is satisfied. Until the stop condition is satisfied, the calculation by the filtering unit for processing each character string set and the sampling unit Repeating the update, and a character string association method.

The program for functioning a computer as each part of the character string matching apparatus of Claim 1 or Claim 2 .