JP5565827B2

JP5565827B2 - A sentence separator training device for language independent word segmentation for statistical machine translation, a computer program therefor and a computer readable medium.

Info

Publication number: JP5565827B2
Application number: JP2009273137A
Authority: JP
Inventors: パウルミヒャエル; フィンチアンドリュー; 英一郎隅田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2009-12-01
Filing date: 2009-12-01
Publication date: 2014-08-06
Anticipated expiration: 2029-12-01
Also published as: JP2011118496A

Description

この発明は自然言語処理（ＮａｔｕｒａｌＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ：ＮＬＰ）の前処理に関し、特に、ＳＭＴ（ＳｔａｔｉｓｔｉｃａｌＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ）又は自然言語理解に用いられるモデルをトレーニングするためのコーパス内の文をセグメント化することに関する。 The present invention relates to preprocessing of natural language processing (NLP), and more particularly to segmenting sentences in a corpus for training a model used for SMT (Statistical Machine Translation) or natural language understanding.

単語セグメント化の作業、すなわち連続したテキストにおいて単語の境界を特定することは、自然言語理解、情報抽出及び機械翻訳等のデータ駆動のＮＬＰ応用では基礎的前処理ステップのひとつである。英語等のインド−ヨーロッパ系言語と異なり、中国語、日本語等のアジア系言語の多くは意味のある単語単位を区別するのに空白文字を用いない。 Word segmentation work, i.e. identifying word boundaries in continuous text, is one of the basic preprocessing steps in data-driven NLP applications such as natural language understanding, information extraction and machine translation. Unlike Indo-European languages such as English, many Asian languages such as Chinese and Japanese do not use white space to distinguish meaningful word units.

これら言語の単語セグメント化には以下の課題がある。 The word segmentation of these languages has the following problems.

（１）多義性。たとえば、中国語では、単一の文字がある文脈では構成要素のひとつであり、別の文脈ではそれだけでひとつの単語でありうる。 (1) Ambiguity. For example, in Chinese, a single character can be a component in one context and a single word alone in another.

（２）未知の単語。すなわち、既存の単語を組み合わせると例えば「ホワイトハウス」等の固有名詞など、新たな単語になりうる。 (2) An unknown word. In other words, combining existing words can result in a new word such as a proper noun such as “White House”.

先行技術の純粋に辞書ベースのアプローチでは、最大一致度ヒューリスィックスによってこれらの課題に対処しているが、その精度は利用される辞書の守備範囲に大いに依存する。教師なしの単語セグメント化に関する最近の研究は、確率論的方法に基づくアプローチに焦点をあてている。例えば、非特許文献１はユニグラム単語分布に基づく確率論的セグメント化モデルを提案しており、非特許文献２は標準的なｎ−グラム（１≦ｎ≦３）言語モデルを用いている。これに替わる、ディリクレプロセスに基づくノンパラメトリックベイズ推論のアプローチであって、ユニグラム及びバイグラムの単語依存性を組入れたものが非特許文献３で紹介されている。 Prior art purely dictionary-based approaches address these challenges with maximum matching heuristics, but their accuracy is highly dependent on the coverage of the dictionary used. Recent research on unsupervised word segmentation has focused on approaches based on probabilistic methods. For example, Non-Patent Document 1 proposes a probabilistic segmentation model based on a unigram word distribution, and Non-Patent Document 2 uses a standard n-gram (1 ≦ n ≦ 3) language model. Non-patent document 3 introduces a nonparametric Bayesian inference approach based on the Dirichlet process, which incorporates unigram and bigram word dependencies.

Ｍ．ブレント。セグメント化及び単語発見のための効率的かつ確率的に健全なアルゴリズム。機械学習、３４：７１−１０５、１９９９年。（M. Brent. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34:71-105, 1999.）M.M. Brent. An efficient and stochastic algorithm for segmentation and word discovery. Machine learning, 34: 71-105, 1999. (M. Brent. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34: 71-105, 1999.) Ａ．ベンカタラマン。書起こし発話における単語発見のための統計的モデル。コンピュータ言語、２７（３）：３５１−３７２、２００１年。（A. Venkataraman. A statistical model for word discovery in transcribed speech. Computational Linguistics, 27(3):351-372, 2001.）A. Ben Kataraman. A statistical model for word discovery in transcription utterances. Computer language, 27 (3): 351-372, 2001. (A. Venkataraman. A statistical model for word discovery in transcribed speech. Computational Linguistics, 27 (3): 351-372, 2001.) Ｓ．ゴールドウォータ。教師なしの単語セグメント化における文脈依存性。ＡＣＬ予稿集第６７３−６８０ページ、シドニー、オーストラリア、２００６年。（Contextual Dependencies in Unsupervised Word Segmentation. In Proc. of the ACL, pages 673-680, Sydney, Australia, 2006.）S. Gold water. Context dependency in unsupervised word segmentation. ACL Proceedings, 673-680, Sydney, Australia, 2006. (Contextual Dependencies in Unsupervised Word Segmentation. In Proc. Of the ACL, pages 673-680, Sydney, Australia, 2006.) Ｊ．シュー、Ｊ．ガオ、Ｋ．トウタノバ、及びＨ．ネイ。ＳＭＴのためのベイズ半教師付き中国語単語セグメント化。ＣＯＬＩＮＧ予稿集、第１０１７−１０２４ページ、マンチェスター、ＵＫ、２００８年。（J. Xu, J. Gao, K. Toutanova, and H. Ney. Bayesian Semi-Supervised Chinese Word Segmentation for SMT. In Proc. of the COLING, pages 1017-1024, Manchester, UK, 2008.）J. et al. Shu, J. Gao, K. Toutanova and H. Ney. Bayes semi-supervised Chinese word segmentation for SMT. COLING Proceedings, pp. 1017-1024, Manchester, UK, 2008. (J. Xu, J. Gao, K. Toutanova, and H. Ney. Bayesian Semi-Supervised Chinese Word Segmentation for SMT. In Proc. Of the COLING, pages 1017-1024, Manchester, UK, 2008.) Ａ．ラトナパルキ。品詞タグ付けのための最大エントロピモデル。ＥＭＮＬＰ予稿集、ペンシルバニア、ＵＳＡ、１９９６年。（A. Ratnaparkhi. A Maximum Entropy Model for Part-Of-Speech Tagging. In Proc. of the EMNLP, Pennsylvania, USA, 1996.）A. Ratnaparki. Maximum entropy model for part-of-speech tagging. EMNLP Proceedings, Pennsylvania, USA, 1996. (A. Ratnaparkhi. A Maximum Entropy Model for Part-Of-Speech Tagging. In Proc. Of the EMNLP, Pennsylvania, USA, 1996.)

しかし、区切り記号を用いない言語の機械翻訳作業に最適な単語セグメント化の学習についてはあまり注意が払われてこなかった。中国語または日本語の単一の文字といった、小さい翻訳単位の場合、そのようなトークンがトレーニングコーパスの中に見られることが多く、従ってこれらのトークンをＳＭＴで翻訳することができるが、これらのトークンによって与えられるコンテキスト情報は良い翻訳を得るには十分でないかもしれない。 However, little attention has been paid to learning word segmentation that is optimal for machine translation work in languages that do not use delimiters. For small translation units, such as single Chinese or Japanese characters, such tokens are often found in training corpora, so these tokens can be translated with SMT, but these The context information given by the token may not be enough to get a good translation.

例えば、日本語−英語ＳＭＴは２つの連続した文字「白」（“ｗｈｉｔｅ”）と「鳥」（“ｂｉｒｄ”）とを“Ｗｈｉｔｅｂｉｒｄ”と訳すであろうが、人間ならこれらの連続した文字「白鳥」を“ｓｗａｎ”と訳すであろう。従って、翻訳単位が長くなればなるほど、意味のある翻訳を見出すためにはより多くのコンテキストを利用することが可能になる。他方で、翻訳単位が長くなればなるほど、統計的翻訳モデルのトレーニングに利用される言語資源のスパースネスのために、トレーニングデータ中にそのようなトークンの生じる可能性は少なくなる。 For example, a Japanese-English SMT would translate two consecutive characters “white” (“white”) and “bird” (“bird”) to “white bird”, but for humans these consecutive characters “Swan” would be translated as “swan”. Therefore, the longer the translation unit, the more context can be used to find meaningful translations. On the other hand, the longer the translation unit, the less likely that such tokens will occur in the training data due to the sparseness of the language resources used to train the statistical translation model.

従って、機械翻訳に「最適」な単語セグメント化は、所与の入力文のコンテキストにおいて意味を持つに十分に小さく、かつ翻訳の品質を最適化するための統計的モデルの翻訳作業の複雑さと対称範囲とのトレードオフを達成する翻訳単位を特定するものである、といえる。モノリンガルの確率モデルを用いることは、必ずしも良好な機械翻訳性能を生じない。しかし、モノリンガルのみならずバイリンガルの情報も考慮したいくつかのアプローチから改良がなされ、ＳＭＴに好適な単語セグメント化を導出するものが提案されている。言語資源の入手しやすさの点から、最近の研究は中国語から英語へのＳＭＴのための中国語単語セグメント化（Ｃｈｉｎｅｓｅｗｏｒｄｓｅｇｍｅｎｔａｔｉｏｎ：ＣＷＳ）の最適化に焦点をあてている。例えば、非特許文献４はＣＷＳのためのベイズ半教師付きアプローチであって、非特許文献３に基づくものを提案している。この生成的モデルは、まず市販の分離器を用いて中国語のテキストをセグメント化し、ＳＭＴに好適な新たな単語の種類と分布とを学習する。さらに、セグメント化の一貫性と翻訳単位の粒度ともまた、ＣＷＳを改良するには重要である。 Therefore, the word segmentation that is “optimal” for machine translation is small enough to make sense in the context of a given input sentence, and is symmetric with the complexity of the translation work of a statistical model to optimize the quality of the translation. It can be said that the translation unit that achieves the trade-off with the scope is specified. Using a monolingual probabilistic model does not necessarily produce good machine translation performance. However, improvements have been made from several approaches that consider not only monolingual but also bilingual information, and what has been proposed to derive word segmentation suitable for SMT. In view of the availability of language resources, recent research has focused on the optimization of Chinese word segmentation (CWS) for SMT from Chinese to English. For example, Non-Patent Document 4 proposes a Bayes semi-supervised approach for CWS based on Non-Patent Document 3. This generative model first segments Chinese text using a commercially available separator to learn new word types and distributions suitable for SMT. Furthermore, segmentation consistency and translation unit granularity are also important to improve CWS.

しかし、非特許文献３に開示のシステムでは、最初のセグメント化を得るために言語学的に導かれた単語セグメント化ツールが必要とされる。 However, the system disclosed in Non-Patent Document 3 requires a linguistically derived word segmentation tool to obtain the initial segmentation.

従って、この発明の目的の一つは、言語学に導かれた単語セグメント化ツールを必要としない、文分離器トレーニング装置を提供することである。 Accordingly, one object of the present invention is to provide a sentence separator training device that does not require a linguistically guided word segmentation tool.

この発明の別の目的は、言語学的に導かれた単語セグメント化ツールを必要とせず、最適な文分離器を自動的に生成する、文分離器トレーニング装置を提供することである。 Another object of the present invention is to provide a sentence separator training device that does not require a linguistically derived word segmentation tool and automatically generates an optimal sentence separator.

この発明の第１の局面に従えば、予め選択された第１の言語の文分離器をトレーニングするための文分離器トレーニング装置は、翻訳対のバイリンガルコーパスを記憶するための手段を含み、翻訳対の各々は前記第１の言語のソース言語文と第２の言語のターゲット言語文とを含み、さらに前記バイリンガルコーパス中の前記ソース言語文を予め定められた区切りによって文字列にセグメント化するための文字ベースの第１の分離器と、前記第１の言語と前記第２の言語の文の翻訳対を含むバイリンガルトレーニングコーパスを利用してＳＭＴ手段をトレーニングするための第１のトレーニング手段と、を含み、前記ＳＭＴ手段はトレーニングの間に前記バイリンガルトレーニングコーパス内の前記翻訳対の各々を対応付け、前記トレーニング手段によってトレーニングされた前記ＳＭＴ手段の性能を評価するための評価手段と、前記ＳＭＴ手段による対応付けの結果を利用して、前記バイリンガルコーパスの前記ソース言語文の第２の分離器をトレーニングするための第２のトレーニング手段と、前記バイリンガルコーパス中の前記ソース言語文を、前記第２のトレーニング手段によってトレーニングされた前記第２の分離器を利用して、前記予め定められた区切りによって分離されたセグメント列に分離するためのセグメント化手段と、前記第１のトレーニング手段、前記評価手段、前記第２のトレーニング手段及び前記第２の分離器を、前記ＳＭＴの性能に関する予め定められた終了条件が満足されるまで繰返し動作するよう制御するための繰返し制御手段と、前記繰返し制御手段の第１回目の繰返しに、前記文字ベースの第１の分離器によってセグメント化されたソース言語文を含むバイリンガルコーパスの一つを選択し、その後の繰返しで前記第２の分離器によってセグメント化された前記ソース言語文を有する前記バイリンガルコーパスを選択し、前記選択されたバイリンガルコーパスを前記バイリンガルトレーニングコーパスとして利用して前記第１のトレーニング手段に前記ＳＭＴ手段をトレーニングさせるための選択手段と、をさらに含む。 According to a first aspect of the present invention, a sentence separator training device for training a preselected first language sentence separator includes means for storing a bilingual corpus of translation pairs, Each pair includes a source language sentence of the first language and a target language sentence of a second language, and further for segmenting the source language sentence in the bilingual corpus into character strings by a predetermined delimiter A first training means for training SMT means using a bilingual training corpus including a first language-based separator and a translation pair of sentences of the first language and the second language; The SMT means associates each of the translation pairs in the bilingual training corpus during training, Training the second separator of the source language sentence of the bilingual corpus using the evaluation means for evaluating the performance of the SMT means trained by the means and the result of the association by the SMT means Second training means and the source language sentence in the bilingual corpus are separated by the predetermined delimiter using the second separator trained by the second training means. The segmentation means for separating into segment rows, the first training means, the evaluation means, the second training means, and the second separator have predetermined termination conditions relating to the performance of the SMT. Repetitive control means for controlling repetitive operation until satisfied, and the repetitive control In the first iteration of the means, select one of the bilingual corpora containing source language sentences segmented by the character-based first separator and then segmented by the second separator in subsequent iterations Selecting the bilingual corpus having the source language sentence, and selecting means for causing the first training means to train the SMT means using the selected bilingual corpus as the bilingual training corpus. In addition.

第１の分離器は、バイリンガルコーパス中のソース言語の文を予め定められた区切りでセグメント化された文字列にセグメント化する。１回目の繰返しでは、選択手段が、文字ベースの第１の分離器によってセグメント化されたソース言語の文を含むバイリンガルコーパスを選択する。第１のトレーニング手段は選択されたコーパスをバイリンガルトレーニングコーパスとして利用してＳＭＴ手段をトレーニングする。トレーニングの間に、ＳＭＴ手段はバイリンガルトレーニングコーパスの翻訳対の各々を対応付ける。第２のトレーニング手段はＳＭＴ手段による対応付けの結果を利用して第２の分離器をトレーニングする。セグメント化手段はバイリンガルコーパス中のソース言語の文を、第２のトレーニング手段によってトレーニングされた第２の分離器を利用してセグメント列にセグメント化する。２回目の繰返しでは、選択手段はセグメント化手段によってセグメント化されたバイリンガルコーパスを選択する。第１のトレーニング手段が選択されたコーパスをバイリンガルトレーニングコーパスとして利用してＳＭＴ手段をトレーニングする。繰返しごとのＳＭＴトレーニングの終わりに、評価手段がトレーニング手段によってトレーニングされたＳＭＴ手段の性能を評価する。繰返し制御手段は予め定められた終了条件が満足されると繰返しを止める。例えば、評価が劣化すると、繰返しが止められる。 The first separator segments the source language sentences in the bilingual corpus into character strings that are segmented at a predetermined delimiter. In the first iteration, the selection means selects a bilingual corpus that includes source language sentences segmented by the first character-based separator. The first training means trains the SMT means using the selected corpus as a bilingual training corpus. During training, the SMT means associates each translation pair of the bilingual training corpus. The second training means trains the second separator using the result of association by the SMT means. The segmenting means segments the source language sentences in the bilingual corpus into segment strings using the second separator trained by the second training means. In the second iteration, the selection means selects the bilingual corpus segmented by the segmentation means. The first training means trains the SMT means using the selected corpus as a bilingual training corpus. At the end of each iteration of SMT training, the evaluation means evaluates the performance of the SMT means trained by the training means. The repetition control means stops the repetition when a predetermined end condition is satisfied. For example, if the evaluation deteriorates, the repetition is stopped.

１回目の繰返しでは文字列にセグメント化されたソース言語の文を有するバイリンガルコーパスを利用するので、セグメント化は何ら言語特定の分離器を必要としない。次の繰返し以降に用いられる分離器はＳＭＴ手段のトレーニングの間に得られた対応付けの結果によってトレーニング可能である。従って、文分離器トレーニング装置は言語に依存せず、言語学的に導かれた単語セグメント化ツールを必要としない文分離器トレーニング装置が提供される。 Since the first iteration uses a bilingual corpus with source language sentences segmented into strings, segmentation does not require any language specific separator. The separator used after the next iteration can be trained according to the result of the correspondence obtained during the training of the SMT means. Thus, a sentence separator training device is provided that is language independent and does not require a linguistically derived word segmentation tool.

第１の局面に基づく第２の局面では、前記第２のトレーニング手段は、前記ＳＭＴ手段による前記対応付けの結果を利用して前記バイリンガルコーパスの前記ソース言語文内の各文字に注釈を付け、各文字が単語の終端であるか否かを示す注釈を各文字に付与するための手段と、前記バイリンガルコーパスの前記ソース言語文における各文字の予め定められた特徴量セットを抽出するための手段とを含み、前記予め定められた特徴量セットは前記ソース言語文の対象の文字のコンテキストと、前記ソース言語文と対にされたターゲット言語文における対象の文字と対応付けた単語のコンテキストとを反映し、さらに前記第２の分離器で用いられる確率モデルをトレーニングするための手段を含み、前記確率モデルは前記抽出手段によって抽出された前記特徴量の組の統計的分析によって、ソース言語文中の文字が単語の終端であるか否かの確率を推定するのに用いられる。 In a second aspect based on the first aspect, the second training means annotates each character in the source language sentence of the bilingual corpus using the result of the association by the SMT means, Means for giving each character an annotation indicating whether or not each character is the end of a word; and means for extracting a predetermined feature amount set of each character in the source language sentence of the bilingual corpus And the predetermined feature amount set includes a context of a target character of the source language sentence and a context of a word associated with the target character in the target language sentence paired with the source language sentence. Means for reflecting and further training a probability model used in the second separator, wherein the probability model is extracted by the extraction means By the characteristic quantity of the set of statistical analysis was used to characters in the source language sentence is to estimate whether the probability or is the end of a word.

ＳＭＴ手段のトレーニングの終わりに、バイリンガルコーパスのソース言語の文に注釈が付けられる。このように注釈が付けられたバイリンガルコーパスから、予め定められた特徴量の組が抽出され、第２の分離器で用いられる確率モデルがこの特徴量の組と既存の統計的機械翻訳スキームとを用いて統計的にトレーニングされ得る。 At the end of the training of the SMT instrument, the bilingual corpus source language sentence is annotated. A predetermined feature set is extracted from the annotated bilingual corpus, and the probability model used in the second separator determines the feature set and the existing statistical machine translation scheme. Can be used for statistical training.

従って、言語学的に導かれた単語セグメント化ツールを必要とせず、統計的に最適化された文分離器を自動的に生成する文分離器トレーニング装置が提供される。 Accordingly, a sentence separator training device is provided that automatically generates a statistically optimized sentence separator without the need for linguistically derived word segmentation tools.

第２の局面に基づく第３の局面では、前記確率モデルは最大エントロピ（ＭａｘｉｍｕｍＥｎｔｒｏｐｙ：ＭＥ）モデルを含む。 In a third aspect based on the second aspect, the probability model includes a maximum entropy (ME) model.

ＳＭＴトレーニングの間の対応付けの結果は、いくつかの対応誤りを含むことが多い。しかし、対応付けを利用してＭＥモデルをトレーニングすることで、誤りの起こりやすさを軽減することができ、従って、言語学的に導かれた単語セグメント化ツールを必要とせず比較的正確な文分離器を自動的に生成する文分離器トレーニング装置が提供される。 The result of mapping during SMT training often includes several mapping errors. However, training the ME model using correspondence can reduce the likelihood of errors, and therefore does not require a linguistically derived word segmentation tool and is a relatively accurate sentence. A sentence separator training device is provided that automatically generates a separator.

第３の局面に基づく第４の局面によれば、前記繰返し制御手段は前記第１のトレーニング手段、前記評価手段、前記第２のトレーニング手段、及び前記第２の分離器を、前記評価手段による評価が先行する繰返しでの前記評価手段の評価より悪くなるまで繰返し動作するよう制御する。従って、言語学的に導かれた単語セグメント化ツールを必要とせず、最適な文分離器を自動的に生成する文分離器トレーニング装置が提供される。最適化はＳＭＴと分離器の自動トレーニングとを利用して行われるので、結果として得られる分離器は人の手によってセグメント化されたバイリンガルコーパスを利用してトレーニングされたものとは異なり、セグメント化はよりＳＭＴに好適なものとなるであろう。 According to a fourth aspect based on the third aspect, the iterative control means includes the first training means, the evaluation means, the second training means, and the second separator according to the evaluation means. Control is performed so that the evaluation is repeated until the evaluation becomes worse than the evaluation of the evaluation means in the preceding repetition. Thus, a sentence separator training device is provided that automatically generates an optimal sentence separator without the need for linguistically derived word segmentation tools. Since optimization is performed using SMT and automatic training of the separator, the resulting separator differs from that trained using a human-segmented bilingual corpus, and is segmented Will be more suitable for SMT.

この発明の第５の局面は、コンピュータ上で実行されるとコンピュータを第１から第４の局面のいずれかに基づく装置として機能させる、コンピュータプログラムに関する。 A fifth aspect of the present invention relates to a computer program that, when executed on a computer, causes the computer to function as an apparatus based on any of the first to fourth aspects.

この発明の第６の局面に従えば、コンピュータ可読媒体は第５の局面に従ったコンピュータプログラムを記憶する。 According to a sixth aspect of the present invention, a computer readable medium stores a computer program according to the fifth aspect.

この発明の１実施例に従った繰返しブートストラップ処理の方法を示す図である。FIG. 3 is a diagram illustrating a method of repeated bootstrap processing according to one embodiment of the present invention. ソース言語とターゲット言語との翻訳対を示す図である。It is a figure which shows the translation pair of a source language and a target language. コンピュータ上でこの発明の第１の実施例を実現するプログラムのフロー図である。It is a flowchart of the program which implement | achieves 1st Example of this invention on a computer. 元の翻訳対と、ソース言語文が文字ベースでセグメント化された翻訳対との例を示す図である。It is a figure which shows the example of the original translation pair and the translation pair by which the source language sentence was segmented on the character base. 繰返しブートストラップ法の種々の繰返しでのソース言語文のセグメント化例を示す図である。It is a figure which shows the segmentation example of the source language sentence in the various repetition of the repeated bootstrap method. 繰返しブートストラップ法の種々の繰返しでのソース言語文のセグメント化例を示す図である。It is a figure which shows the segmentation example of the source language sentence in the various repetition of the repeated bootstrap method. 実験システムのシステム性能の変化を示す図である。It is a figure which shows the change of the system performance of an experimental system. 実験システムのシステム性能の変化を示す図である。It is a figure which shows the change of the system performance of an experimental system. 実験システムの語彙サイズ及び長さの変化を示す図である。It is a figure which shows the change of the vocabulary size and length of an experimental system. 実験システムの語彙サイズ及び長さの変化を示す図である。It is a figure which shows the change of the vocabulary size and length of an experimental system. 実験システムの語彙外（ｏｕｔ−ｏｆ−ｖｏｃａｂｕｒａｌｙ：ＯＯＶ）サイズの変化を示す図である。It is a figure which shows the change of the out-of-vocabulary (OOV) size of an experimental system. コンピュータシステム３２０の正面図である。2 is a front view of a computer system 320. FIG. コンピュータシステム３２０のブロック図である。2 is a block diagram of a computer system 320. FIG.

［第１の実施例］
（概観）
先行のアプローチとは対照的に、この発明では最初のセグメント化を得るために言語学的に導かれた単語セグメント化ツールの存在を必要としない、言語に依存しないアプローチを提案する。提案される方法はパラレルコーパスを用いて、文字列となっているソース言語の文をターゲット言語の空白スペースで分離された単語単位に対応付ける。同じターゲット単語に対応付けされた連続する文字が統合されたより大きなソース言語単位になる。従って、翻訳単位の粒度は所与のバイリンガルコーパスコンテキストで規定される。対応誤りの副作用を最小にし、かつセグメント化の一貫性を保つために、最大エントロピ（Ｍａｘｉｍｕｍ−Ｅｎｔｒｏｐｙ：ＭＥ）アルゴリズムを適用して、再セグメント化されたバイリンガルコーパスでトレーニングされるＳＭＴシステムの翻訳品質を最適化するソース言語単語セグメント化の学習が行われる。 [First embodiment]
(Overview)
In contrast to previous approaches, the present invention proposes a language-independent approach that does not require the presence of a linguistically derived word segmentation tool to obtain the initial segmentation. The proposed method uses a parallel corpus to associate source language sentences that are character strings with word units separated by blank spaces in the target language. It becomes a larger source language unit in which consecutive characters associated with the same target word are integrated. Therefore, the granularity of translation units is defined in a given bilingual corpus context. Translation quality of an SMT system trained on a re-segmented bilingual corpus using the Maximum-Entropy (ME) algorithm to minimize the side effects of miscorrespondence and to keep the segmentation consistent Learning of source language word segmentation is performed.

現代のＳＭＴシステムには、ＧＩＺＡ＋＋等のトークン−単語対応付けサブシステムが埋込まれている。このようなサブシステムはソース言語文のトークンとターゲット言語文の単語との間の最も確率の高い対応付けを出力するものとして知られているが、その対応付け精度は時として疑問である。 Modern SMT systems have embedded token-word association subsystems such as GIZA ++. Such a subsystem is known to output the most probable association between the tokens of the source language sentence and the words of the target language sentence, but the accuracy of the association is sometimes questionable.

５つのアジア系言語（日本語、韓国語、タイ語、中国語（標準中国語、台湾語））から英語への翻訳に、提案のセグメント化方法を適用した実験を行った。実験の結果、提案の方法は、文字化されたソース言語文を翻訳するベースラインシステムより性能がよく、言語学的ツールでセグメント化されたバイリンガルコーパスでトレーニングされたＳＭＴモジュールと同様の翻訳結果を得ることが分かった。 An experiment was conducted in which the proposed segmentation method was applied to translation from five Asian languages (Japanese, Korean, Thai, Chinese (standard Chinese, Taiwanese)) into English. As a result of experiments, the proposed method performs better than a baseline system that translates textual source language sentences and produces similar translation results as a SMT module trained in a bilingual corpus segmented with linguistic tools. I knew I would get it.

（単語セグメント化）
提案の単語セグメント化方法は２つのステップからなる。第１のステップでは、ユニグラムセグメント化ソース言語文字列と空白で分けられたターゲット言語の単語とからなるパラレルテキストコーパス上で標準的ＳＭＴモデルがトレーニングされる。ＳＭＴトレーニング手順の文字−単語対応付けの結果を利用して、それぞれのバイリンガルコーパスで同じターゲット言語の単語に対応付けされた連続したソース言語の文字を特定し、これらの文字を統合してより大きな翻訳単位とする。 (Word segmentation)
The proposed word segmentation method consists of two steps. In the first step, a standard SMT model is trained on a parallel text corpus consisting of unigram segmented source language strings and target language words separated by white space. Using the results of the character-word association of the SMT training procedure, each source's bilingual corpus identifies consecutive source language characters associated with the same target language word, and these characters are integrated to create a larger A translation unit.

第２のステップでは、単語セグメント化の作業は文字タグ付けとして取扱われ、ここでは２つのタグのみが用いられる。すなわち、所与のソース言語文字がターゲット言語の単語に対応付けされた統合文字列の最後の文字である場合の”ＷＢ”（ｗｏｒｄｂｏｕｎｄａｒｙ：単語境界）と、その他の場合の”ＮＢ"（ｎｏｂｏｕｒｄａｒｙ：非境界）である。対応付けベースの単語境界注釈を用いて、ＭＥ法が適用され、最適なソース言語単語セグメント化が学習される。 In the second step, the word segmentation task is treated as character tagging, where only two tags are used. That is, “WB” (word boundary) when a given source language character is the last character of an integrated character string associated with a word in the target language, and “NB” (no) in other cases (borderary: non-boundary). Using the association-based word boundary annotation, the ME method is applied to learn the optimal source language word segmentation.

（１）最大エントロピタグ付けモデル
ＭＥモデルは分類と予測のための汎用機械学習技術を提供する。これらは多くの特徴量を扱うことのできる多用途のツールであり、文境界検出又は品詞タグ付けを含む広範なＮＬＰ作業において非常に有効であることを示している。最大エントロピ分類機は指数的モデルであって、多数の二値特徴量関数及びそれらの重みからなる。モデルは、トレーニングデータによって課される制約により、確率モデルのエントロピを最大にするよう重みを調節することでトレーニングされる。実験では条件付き最大エントロピモデルを用い、ここで所与の特徴量の組に対する結果の条件付き確率は非特許文献５でモデル化されている。モデルは以下の形である： (1) Maximum entropy tagging model The ME model provides a general-purpose machine learning technique for classification and prediction. These are versatile tools that can handle many features and have proved very effective in a wide range of NLP tasks including sentence boundary detection or part-of-speech tagging. The maximum entropy classifier is an exponential model and consists of a number of binary feature functions and their weights. The model is trained by adjusting the weights to maximize the entropy of the stochastic model due to constraints imposed by the training data. In the experiment, a conditional maximum entropy model is used, and the conditional probability of the result for a given feature set is modeled in Non-Patent Document 5. The model has the following form:

ここで、
ｔは予測されるタグであり、
ｃはｔのコンテキストであり、
ｘは正規化係数であり、
Ｋはモデル内の特徴量の数であり、
ｆ_ｋは二値特徴量関数であり、
α_ｋは特徴量関数ｆ_ｋの重みであり、
ｐ_０はデフォルトモデルである。

here,
t is the expected tag,
c is the context of t,
x is a normalization factor,
K is the number of features in the model,
f _k is a binary feature quantity function;
α _k is a weight of the feature quantity function f _k ,
p ₀ is the default model.

特徴量の組を表１に示す。辞書によるコンテキスト特徴量はタグｔで注釈を付けられた（タグ付けされた）ソース言語の文字列を含む。ｃ_０はタグ付けされたコンテキスト単位（例えば文字、または単語）を示し、ｃ_−２、…ｃ_＋２は周囲のコンテキスト単位を示す。ｔ_０は現在のタグを示し、ｔ_−１は先行するタグを示し、以下同様である。タグコンテキスト特徴量は先行するタグ列のコンテキストに関する情報を供給する。この条件付きモデルは分類器として用いることができる。モデルは繰返しトレーニングされ、実験には改良された繰返しスケーリングアルゴリズム（ＩｍｐｒｏｖｅｄＩｔｅｒａｔｉｖｅＳｃａｌｉｎｇ：ＩＩＳ）を用いた。 Table 1 shows a set of feature amounts. The context feature by the dictionary includes a source language character string annotated (tagged) with a tag t. c ₀ indicates a tagged context unit (eg character or word) and c ₋₂ ,... c ₊₂ indicate surrounding context units. t ₀ indicates the current tag, t ₋₁ indicates the preceding tag, and so on. The tag context feature quantity supplies information related to the context of the preceding tag string. This conditional model can be used as a classifier. The model was iteratively trained and an improved iterative scaling algorithm (IIS) was used for the experiments.

（２）繰返しブートストラップ法
ＳＭＴの最適単語セグメント化学習のための、本発明により提案される繰返しブートストラップ法を図１にまとめた。図１を参照して、ターゲット言語文３２とソース言語文３４とを含むバイリンガルコーパス３０を準備する。ターゲット言語文３２の各々がその翻訳のソース言語文と対になっている。

(2) Iterative Bootstrap Method The iterative bootstrap method proposed by the present invention for optimal word segmentation learning of SMT is summarized in FIG. Referring to FIG. 1, a bilingual corpus 30 including a target language sentence 32 and a source language sentence 34 is prepared. Each target language sentence 32 is paired with the source language sentence of the translation.

図２を参照して、翻訳対１１０はソース言語文１１２と、そのソース言語文１１２の翻訳であるターゲット言語文１１４とを含む。 Referring to FIG. 2, translation pair 110 includes a source language sentence 112 and a target language sentence 114 that is a translation of the source language sentence 112.

再び図１を参照して、最初の繰返し（０回目繰返し）では、ソース言語文３４の各々がユニグラム分離器３６により文字ごとにユニグラムセグメント化ソース言語文３８に分割される。ユニグラム分離器３６は単に、ソース言語文３４の隣接する文字の各々の間に空白を挿入するだけである。 Referring to FIG. 1 again, in the first iteration (0th iteration), each of the source language sentences 34 is divided into unigram segmented source language sentences 38 for each character by the unigram separator 36. Unigram separator 36 simply inserts a space between each adjacent character in source language sentence 34.

ターゲット言語文３２とユニグラムセグメント化ソース言語文３８とを含むバイリンガルコーパスを利用してＳＭＴ４０をトレーニングする。これは最初の繰返しなので、このＳＭＴ４０を「ＳＭＴ_０」と呼ぶ。ＳＭＴ_０４０のトレーニングの間に、ターゲット言語文３２とユニグラムセグメント化ソース言語文３８との文の対の各々が対応付けされる。 SMT 40 is trained using a bilingual corpus that includes target language sentence 32 and unigram segmented source language sentence 38. Since this is the first iteration, this SMT 40 is called “SMT ₀ ”. During the training of SMT ₀ 40, each of the sentence pairs of the target language sentence 32 and the unigram segmented source language sentence 38 is associated.

次の繰返しが始まる前に、ＳＭＴ_０４０をソース言語文開発セット（図示せず）をターゲット言語文にデコードすることによって評価し、さらにデコードされた結果を、ＢＬＥＵ（Ｋ．パピネニにより提案、「ＢＬＥＵ：機械翻訳の自動評価法」第４０回ＡＣＬ予稿集、第３１１−３１８ページ、フィラデルフィア、ＵＳ，２００２年、K. Papineni, “BLEU: a Method for Automatic Evaluation of Machine Translation”, in Proceedings of the 40th ACL, pages 311-318, Philadelphia, US, 2002））又はＭＥＴＥＯＲ（Ｓ．ベネルジらにより提案、「ＭＥＴＥＯＲ：ＭＴ評価のための自動尺度」ＡＣＬ予稿集第６５−７２ページ、アンアーバー、ＵＳ、２００５年（S. Banerjee et al., “METEOR: An Automatic Metric for MT Evaluation” in Proceedings of the ACL, pages 65-72, Ann Arbor, US, 2005.））等の自動評価器によって評価する。評価結果４２のスコアを保存する。ＳＭＴ_０４０のトレーニングの間に、トークン―単語対応付けの結果４４が抽出される。結果４４を用いてユニグラムセグメント化ソース言語文３８内の各トークンに注釈を付けることができる。注釈付きソース言語文３８内のトークンの各々の特徴量セットが抽出され、より長い翻訳単位の取扱いが可能なＭＥ分類器４６（ＭＥ_１）のトレーニングに用いられる。 Before the next iteration begins, SMT ₀ 40 was evaluated by decoding a source language sentence development set (not shown) into a target language sentence, and the decoded result was further proposed by BLEU (proposed by K. Papineni, BLEU: Automatic Evaluation Method for Machine Translation, 40th ACL Proceedings, pages 311-318, Philadelphia, US, 2002, K. Papineni, “BLEU: a Method for Automatic Evaluation of Machine Translation”, in Proceedings of the 40th ACL, pages 311-318, Philadelphia, US, 2002)) or METEOR (S. Benergi et al., “METEOR: Automated Scale for MT Evaluation” ACL Proceedings, pages 65-72, Ann Arbor, US , 2005 (S. Banerjee et al., “METEOR: An Automatic Metric for MT Evaluation” in Proceedings of the ACL, pages 65-72, Ann Arbor, US, 2005.) Therefore, evaluate. The score of the evaluation result 42 is stored. During the training of SMT ₀ 40, a token-word association result 44 is extracted. Results 44 can be used to annotate each token in unigram segmented source language sentence 38. A feature set for each token in the annotated source language sentence 38 is extracted and used to train the ME classifier 46 (ME ₁ ) capable of handling longer translation units.

それぞれのバイリンガルコーパス３０の最初の文字−単語対応付けからＭＥ分類器４６を学習した後、得られたＭＥ分類器４６を適用して、セグメント化されていないパラレルコーパスのソース言語文３４の再セグメント化を行い、結果として、ターゲット言語文３２及びソース言語文４８を含むセグメント化バイリンガルコーパスが替わりに得られ、これを用いて別のＳＭＴ（ＳＭＴ_１）５０を再トレーニングし再評価することができ、これによって最初のＳＭＴ（ＳＭＴ_０）よりも良好な翻訳性能を達成すると期待できる。 After learning the ME classifier 46 from the first letter-word association of each bilingual corpus 30, the resulting ME classifier 46 is applied to re-segment the source language sentence 34 of the unsegmented parallel corpus As a result, a segmented bilingual corpus including the target language sentence 32 and the source language sentence 48 is obtained instead, and this can be used to retrain and re-evaluate another SMT (SMT ₁ ) 50. This can be expected to achieve better translation performance than the first SMT (SMT ₀ ).

教師なしＭＥタグ付け方法をＳＭＴ_１のトレーニングの間に抽出されるトークン−単語対応付けに適用することもでき、これによってより長い翻訳単位の取扱いが可能なＭＥ分類器５６（ＭＥ_２）を得ることができる。 The unsupervised ME tagging method can also be applied to token-word associations extracted during SMT ₁ training, resulting in a ME classifier 56 (ME ₂ ) capable of handling longer translation units. be able to.

この実施例では、ユニグラムセグメント化ソース言語文３８に、ＳＭＴ_０４０による対応付けの結果４４の結果によって注釈が付けられる。例えば、ＳＭＴ_０４０のトレーニングにおいて、ある文字がある単語の終端であると判断されると、その単語に「ＷＥ」（ＷｏｒｄＥｎｄ：語終端）というラベルが付され、そうでなければ「ＮＥ」（ＮｏｔＥｎｄ：非終端）とされる。注釈付きソース言語文を用いてＭＥ分類器のトレーニングを行う。この実施例では、ユニグラムセグメント化ソース言語文３８の注釈付き文字の各々について、表１に示すようなコンテキスト特徴量の組が導出される。ＭＥ分類器４６は、トレーニングデータによって課される所与の制約のもとで、確率モデルのエントロピが最大になるようにトレーニングされる。ＭＥモデルは特徴量の組により統計的にトレーニングされる。この実施例では、上述のとおり、ＭＥ分類器４６に対し条件付き最大エントロピモデルを用いる。 In this example, the unigram segmented source language sentence 38 is annotated with the result of the association 44 by SMT ₀ 40. For example, in the training of SMT ₀ 40, if it is determined that a certain character is the end of a word, the word is labeled “WE” (Word End), otherwise “NE”. (Not End: non-terminal). Train ME classifiers using annotated source language sentences. In this example, for each annotated character of the unigram segmented source language sentence 38, a set of context features as shown in Table 1 is derived. The ME classifier 46 is trained to maximize the entropy of the probability model under the given constraints imposed by the training data. The ME model is statistically trained with a set of features. In this embodiment, as described above, a conditional maximum entropy model is used for the ME classifier 46.

次の繰返しを「１回目」の繰返しと呼ぶ。１回目の繰返しでは、ソース言語文３４がＭＥ分類器４６によってセグメント化されて、セグメント化ソース言語文４８が結果として得られる。ターゲット言語文３２とセグメント化ソース言語文４８とを含むバイリンガルコーパスを用いて、ＳＭＴ５０（ＳＭＴ_１）をトレーニングする。トレーニングの間に、セグメント化ソース言語文４８のセグメントの各々がターゲット言語文３２の対応する単語と対応付けされる。対応付けの結果５４がＳＭＴ_１５０から抽出され、これを利用してセグメント化ソース言語文４８に注釈が付けられる。注釈付きセグメント化ソース言語文４８を用いて次の繰返しのＭＥ分類器５６（ＭＥ_２分類器）をトレーニングする。 The next iteration is called the “first” iteration. In the first iteration, the source language sentence 34 is segmented by the ME classifier 46, resulting in a segmented source language sentence 48. The SMT 50 (SMT ₁ ) is trained using a bilingual corpus that includes the target language sentence 32 and the segmented source language sentence 48. During training, each segment of segmented source language sentence 48 is associated with a corresponding word in target language sentence 32. The matching result 54 is extracted from SMT ₁ 50 and is used to annotate the segmented source language sentence 48. The next iterative ME classifier 56 (ME ₂ classifier) is trained using the annotated segmented source language sentence 48.

一方で、ＳＭＴ_１５０の性能を、ソース言語の開発セット文をデコードすることによって評価する。評価結果５２を最初の繰返しの保存された評価結果４２と比較する。もし結果５２が結果４２より良好であれば、繰返しが継続される。そうでなければ、この段階で繰返しを中止し、ＭＥ分類器４６がソース言語文のセグメント化に最適な分類器として出力される。 On the other hand, the performance of SMT ₁ 50 is evaluated by decoding the source language development set statement. The evaluation result 52 is compared with the stored evaluation result 42 of the first iteration. If result 52 is better than result 42, the iteration continues. Otherwise, the iteration is stopped at this stage and the ME classifier 46 is output as the optimal classifier for segmenting the source language sentence.

もし結果５２が保存された結果４２より良好であれば、評価結果が保存され、ソース言語文３４がＭＥ分類器５６でセグメント化されてセグメント化ソース言語文５８が結果として得られる。ターゲット言語文３２とセグメント化ソース言語文５８とを含むバイリンガルコーパスを利用してＳＭＴ６０（ＳＭＴ_２）をトレーニングする。ＳＭＴ_２６０のトレーニングの間のソース言語文の対応付けの結果（図示せず）が抽出される。ＳＭＴ_２６０の性能は、自動評価器で評価される。ＳＭＴ_２６０の評価結果６２が保存された結果５２より悪ければ、繰返しは終了し、ＭＥ分類器４６が最適な分類器として出力される。もし評価結果６２が保存された結果５２より良好なら、次の繰返しが行われる。 If the result 52 is better than the saved result 42, the evaluation result is saved and the source language sentence 34 is segmented by the ME classifier 56 resulting in a segmented source language sentence 58. SMT 60 (SMT ₂ ) is trained using a bilingual corpus including target language sentence 32 and segmented source language sentence 58. Source language sentence matching results (not shown) during SMT ₂ 60 training are extracted. The performance of SMT ₂ 60 is evaluated with an automatic evaluator. If the SMT ₂ 60 evaluation result 62 is worse than the stored result 52, the iteration is terminated and the ME classifier 46 is output as the optimal classifier. If the evaluation result 62 is better than the stored result 52, the next iteration is performed.

ＭＥ分類器のトレーニング、ＭＥ分類器を用いたソース言語文３４のセグメント化、セグメント化ソース言語文を含むバイリンガルコーパスによるＳＭＴのトレーニング、及びＳＭＴ性能の評価はこのようにして、評価結果が先行する評価結果より悪くなるまで繰返される。 Training of ME classifiers, segmentation of source language sentences 34 using ME classifiers, training of SMT with a bilingual corpus including segmented source language sentences, and evaluation of SMT performance are thus preceded by evaluation results. Iterate until it becomes worse than the evaluation result.

すなわち、図１を参照して、ＭＥ分類器７６が（Ｊ−１）回目の繰返しで（Ｊ−２）回目のＳＭＴトレーニングでのバイリンガルコーパスの対応付けを利用してトレーニングされると仮定する。（Ｊ−１）回目の繰返しでは、ソース言語文３４はＭＥ分類器７６によってセグメント化される。結果として得られるセグメント化されたテキスト７８は、ターゲット言語文３２とともにＳＭＴ８０（ＳＭＴ_Ｊ−１）のトレーニングに利用される。ＳＭＴ_Ｊ−１８０の性能が評価される。もし評価結果８２が先行する結果より良好なら、結果８２が保存され、ＳＭＴ_Ｊ−１８０のトレーニングにおける対応付けの結果が抽出される。ＭＥ分類器８６は対応付けの結果８４を用いてトレーニングされる。ソース言語文３４はセグメント化されてセグメント化ソース言語文８８になる。ターゲット言語文３２とセグメント化ソース言語文８８とを含むバイリンガルコーパスを利用してＳＭＴ９０（ＳＭＴ_Ｊ）をトレーニングする。ＳＭＴ_Ｊ９０の性能が自動評価器によって評価され、評価結果９２が先行する評価結果８２と比較される。ここでは、結果９２が結果８２より悪いと仮定する。ここで繰返しが中止され、先行する繰返しで得られた分類器７６が最適分類器として特定され、記憶される。 That is, with reference to FIG. 1, it is assumed that the ME classifier 76 is trained using the bilingual corpus association in the (J-2) th SMT training in the (J-1) th iteration. In the (J-1) th iteration, the source language sentence 34 is segmented by the ME classifier 76. The resulting segmented text 78 is used to train SMT 80 (SMT _J-1 ) along with the target language sentence 32. The performance of SMT _J-1 80 is evaluated. If the evaluation result 82 is better than the previous result, the result 82 is saved and the result of the association in the training of SMT _J-1 80 is extracted. The ME classifier 86 is trained using the matching result 84. Source language sentence 34 is segmented into segmented source language sentence 88. SMT 90 (SMT _J ) is trained using a bilingual corpus that includes target language sentence 32 and segmented source language sentence 88. The performance of the SMT _J 90 is evaluated by an automatic evaluator and the evaluation result 92 is compared with the previous evaluation result 82. Here, it is assumed that the result 92 is worse than the result 82. Here, the iteration is stopped, and the classifier 76 obtained in the preceding iteration is identified and stored as the optimum classifier.

このようなブートストラップ法が一連のＳＭＴ_ｉを繰返し生成し、そのたびに翻訳の複雑さが少なくなる。なぜなら、語順又は語の明瞭化の誤りなしに、より大きな塊を１のステップで翻訳することができるからである。しかし、ある時点で、トレーニングコーパスから学習した翻訳単位の長さの増大によりオーバーフィッティングが生じ、学習時に遭遇していない文を翻訳する際の翻訳性能が低下する。従って、トレーニングコーパスのＪ回目の再セグメント化が、遭遇してないテストの組について前回の繰返しより低い自動評価スコアをもたらした場合には、ブートストラップ法は中止される。そして、最も高い自動翻訳スコアを達成したＭＥ分類器７６（ＭＥ_Ｊ−１）が提案の方法の最終的な単語分離器として選択され出力される。 Such a bootstrap method repeatedly generates a series of SMT _i , each time with less translation complexity. This is because larger chunks can be translated in one step without error in word order or word clarification. However, at some point, overfitting occurs due to an increase in the length of the translation unit learned from the training corpus, and the translation performance when translating a sentence that has not been encountered at the time of learning is reduced. Thus, if the Jth re-segmentation of the training corpus resulted in a lower auto-evaluation score for the test set that was not encountered than the previous iteration, the bootstrap method is aborted. The ME classifier 76 (ME _J-1 ) that achieves the highest automatic translation score is selected and output as the final word separator of the proposed method.

（コンピュータによる実現）
図３を参照して、このトークン分類器トレーニング装置を実現するコンピュータプログラムは、バイリンガルコーパス３０のソース言語文３４をユニグラムにセグメント化してユニグラムセグメント化ソース言語文４８を得るステップ１４０で開始し、その後、ターゲット言語文３２とセグメント化ソース言語文４８とを含むバイリンガルコーパスを利用してＳＭＴ_０４０をトレーニングするステップを含む。 (Realization by computer)
Referring to FIG. 3, the computer program that implements this token classifier training apparatus begins at step 140 where the source language sentence 34 of the bilingual corpus 30 is segmented into unigrams to obtain a unigram segmented source language sentence 48; Thereafter, training SMT ₀ 40 using a bilingual corpus including target language sentence 32 and segmented source language sentence 48 is included.

図４を参照して、バイリンガルコーパス３０は、ソース言語文及び対応のターゲット言語文を含む文対２４０等の多数の翻訳対（文対）を含む。図４（Ａ)は手動でセグメント化されたソース文を含む対２４０を示し、図４（Ｂ)はユニグラムセグメント化ソース言語文を含む対２４２を示す。ここで、「ユニグラムセグメント化」とは、「文字にセグメント化された」という意味である。 Referring to FIG. 4, the bilingual corpus 30 includes a number of translation pairs (sentence pairs), such as a sentence pair 240 that includes a source language sentence and a corresponding target language sentence. 4A shows a pair 240 containing manually segmented source sentences, and FIG. 4B shows a pair 242 containing unigram segmented source language sentences. Here, “unigram segmentation” means “segmented into characters”.

プログラムはさらに、ＢＬＥＵまたはＭＥＴＥＯＲ等の自動評価器を用いてＳＭＴの性能を評価するステップ（１４４）を含み、評価の結果を得て、これが最初の繰返しであるか否かを判断するステップ（１４６）を含む。もしステップ１４６の判断がＹＥＳなら、制御はステップ１５０に進む。そうでなければ、制御はステップ１４８に進む。ステップ１４８で、ステップ１４４で計算された評価結果が先行する結果よりも悪いか否かが判断される。もし判断がＹＥＳならば、制御はステップ１６４に進み、ここで先行する繰返しで得られたＭＥ分類器が最適な分類器として出力され、制御は一連のプログラムを終了する。もしステップ１４８の判断がＮＯなら、制御はステップ１５０に進む。 The program further includes the step (144) of evaluating the performance of the SMT using an automatic evaluator such as BLEU or METEOR, obtaining the result of the evaluation and determining whether this is the first iteration (146). )including. If the determination in step 146 is yes, control proceeds to step 150. Otherwise, control proceeds to step 148. In step 148, it is determined whether the evaluation result calculated in step 144 is worse than the preceding result. If the determination is yes, control proceeds to step 164 where the ME classifier obtained in the preceding iteration is output as the optimal classifier and control ends the series of programs. If the determination at step 148 is no, control proceeds to step 150.

ステップ１５０で、ステップ１４４で計算された結果がメモリロケーションに記憶される。 At step 150, the result calculated at step 144 is stored in a memory location.

プログラムはさらに：直前に得られたＭＥ分類器をメモリロケーションに記憶するステップ（１５２）と、先行するＳＭＴトレーニングステップからの対応付けの結果を抽出するステップ（１５４）と、対応付けの結果を用いてソース言語文に注釈を付けるステップ（１５６）と、セグメント化されたソース言語文のトークンの各々について特徴量セットを抽出するステップ（１５８）と、抽出された特徴量セットを利用して今回の繰返しのＭＥ分類器をトレーニングするステップ（１６０）と、ステップ１６０で得られたＭＥ分類器でソース言語文をセグメント化し（１６２）、制御をステップ１４２に戻すステップと、を含む。 The program further includes: storing the ME classifier obtained immediately before in a memory location (152), extracting a matching result from a preceding SMT training step (154), and using the matching result Annotating the source language sentence (156), extracting a feature quantity set for each of the segmented source language sentence tokens (158), and using the extracted feature quantity set Training an iterative ME classifier (160), segmenting a source language sentence with the ME classifier obtained in step 160 (162), and returning control to step 142.

最初の繰返しでは、ユニグラムにセグメント化されたソース言語文を含むバイリンガルコーパスが選択されてＳＭＴのトレーニングに用いられる。これに続く繰返しでは、ステップ１６０でトレーニングされたＭＥ分類器を利用してセグメント化されたバイリンガルコーパスが選択されてＳＭＴのトレーニングに用いられる。ユニグラムのセグメント化は文字ベースなので、ステップ１４０のセグメント化は言語に依存しない。従って、言語学的に導かれた単語セグメント化ツールは不要である。 In the first iteration, a bilingual corpus containing source language sentences segmented into unigrams is selected and used for SMT training. In subsequent iterations, a segmented bilingual corpus using the ME classifier trained in step 160 is selected and used for SMT training. Since the unigram segmentation is character based, the segmentation in step 140 is language independent. Thus, no linguistically derived word segmentation tool is required.

ＳＭＴのトレーニング中の対応付けには周知のツールがあるが、対応付けの結果はいくつかの対応誤りを含むことが少なくない。対応付けの結果を直接バイリンガルコーパスのセグメント化に適用すると、結果に誤りが多くなるであろう。しかし、ＳＭＴトレーニングの対応付けの結果を利用してＭＥ分類器を統計的にトレーニングすることにより、ＭＥ分類器のセグメント化結果は比較的誤りが少なくなる。上述の繰返しの終わりに結果として得られるＭＥ分類器は、結果としてＳＭＴ性能が繰返しの間に得られるＳＭＴの中で最良となる、という意味で、最適なものとなるであろう。 The association in the SMT of the training there is a well-known tool, but the association of the results, it is not a few, including some of the correlation error. Applying the mapping results directly to bilingual corpus segmentation will result in many errors in the results. However, by statistically training the ME classifier using the result of the association of SMT training, the segmentation result of the ME classifier is relatively less error- prone . The resulting ME classifier at the end of the above iteration will be optimal in the sense that the resulting SMT performance is the best among the SMT obtained during the iteration.

このように構成されたプログラムはコンピュータで実行されると図１に示す動作を実現するものであることが当業者には理解されるであろう。 Those skilled in the art will understand that the program configured as described above realizes the operation shown in FIG. 1 when executed by a computer.

（実験）
種々の単語セグメント化を用いる効果を、多言語基本旅行表現コーパス（ｍｕｌｔｉｌｉｎｇｕａｌＢａｓｉｃＴｒａｖｅｌＥｘｐｒｅｓｓｉｏｎｓＣｏｒｐｕｓ：ＢＴＥＣ）を用いて調査した。これはバイリンガルの旅行の専門家が外国へ行ったり来たりする人々に有用と考えられる文章を集めたものである。単語セグメント化の実験のために、自然には単語単位を区切りで分離しない５つのアジア系言語、すなわち日本語（ｊａ）、韓国語（ｋｏ）、タイ語（ｔｈ）、及び２種の中国語である標準中国語（ｚｈ）ならびに台湾語（ｔｗ）を選択した。 (Experiment)
The effect of using various word segmentations was investigated using a multilingual Basic Travel Expressions Corpus (BTEC). This is a collection of texts that bilingual travel specialists may find useful for people traveling to and from abroad. For the purpose of word segmentation experiments, five Asian languages that do not naturally separate word units into separate parts: Japanese (ja), Korean (ko), Thai (th), and two Chinese languages Standard Chinese (zh) and Taiwanese (tw) were selected.

表２に、ＳＭＴモデルのトレーニング（ｔｒａｉｎ）、モデル重みづけのチューニング（ｄｅｖ）及び翻訳品質の評価（ｅｖａｌ）に用いたＢＴＥＣコーパスデータセットの特性をまとめた。文の数（ｓｅｎ）及び語彙（ｖｏｃ）に加えて、文の長さ（ｌｅｎ）も、文ごとの平均単語数として与えられている。 Table 2 summarizes the characteristics of the BTEC corpus data set used for SMT model training, model weight tuning (dev), and translation quality evaluation (eval). In addition to the number of sentences (sen) and vocabulary (voc), the length of the sentence (len) is also given as the average number of words per sentence.

全てのソース言語について、それぞれの言語に利用可能な言語学的セグメント化ツールを用いて所与の統計が得られた。台湾語についてはセグメント化が利用できなかった。 For all source languages, given statistics were obtained using linguistic segmentation tools available for each language. Segmentation was not available for Taiwanese.

ＳＭＴモデルのトレーニングについては、Ｆ．オチ及びＨ．ネイ、「統計的対応付けモデルの系統的比較」、コンピュータ言語、２９（１）：１９−５１ページ，２００３年（F. Och and H. Ney, “A Systematic Comparison of Statistical Alignment Models” in Computational Linguistics, 29(1):19-51, 2003）で提案された標準的単語対応付けと、Ａ．ストックル、「ＳＲＩＬＭ、拡張可能ＬＭツールキット」ＩＣＳＬＰ予稿集、第９０１−９０４ページ、デンバー、ＵＳ．２００２年（A. Stolcke, “SRILM an extensible LM toolkit”, in Proceedings of ICSLP, pages 901-904, Denver, US, 2002）によって提案された言語モデル化ツールを使用した。デコーダのパラメータのチューニングには最小誤り率トレーニング（Ｍｉｎｉｍｕｍｅｒｒｏｒｒａｔｅｔｒａｉｎｉｎｇ：ＭＥＲＴ）を用い、オチ及びネイの提案する技術を用いてｄｅｖセットについて行った。翻訳には、Ａ．フィンチら、「ＮＩＣＴ／ＡＴＲ音声翻訳システム」、ＩＷＳＬＴ予稿集、第１０３−１１０ページ、トレント、イタリア、２００７年（A. Finch et al, “The NICT/ATR Speech Translation System” in Proceedings of the IWSLT, pages 103-110, Trento, Italy, 2007）の提案するマルチスタック句ベースデコーダであってオープンソースツールキットＭＯＳＥＳに相当するものを用いた。翻訳品質の評価には、標準的自動評価尺度、すなわちＢＬＥＵ及びＭＥＴＥＯＲを用いた。この実施例の実験結果では、所与のスコアはパーセントの数字でリストされている。

For training on the SMT model, see F.C. Ochi and H. Ney, “Systematic Comparison of Statistical Alignment Models,” Computer Language, 29 (1): 19-51, 2003 (F. Och and H. Ney, “A Systematic Comparison of Statistical Alignment Models” in Computational Linguistics. , 29 (1): 19-51, 2003), and A. Stockl, “SRILM, Extensible LM Toolkit” ICSLP Proceedings, 901-904, Denver, US. The language modeling tool proposed by 2002 (A. Stolcke, “SRILM an extensible LM toolkit”, in Proceedings of ICSLP, pages 901-904, Denver, US, 2002) was used. Decoder parameters were tuned using minimum error rate training (MERT) and dev sets using techniques proposed by Ochi and Nei. For translation, A. Finch et al., “NICT / ATR Speech Translation System”, IWSLT Proceedings, pages 103-110, Trento, Italy, 2007 (A. Finch et al, “The NICT / ATR Speech Translation System” in Proceedings of the IWSLT, pages 103-110, Trento, Italy, 2007), which is a multi-stack phrase base decoder proposed by MOSES equivalent to the open source toolkit. Standard automatic rating scales, namely BLEU and METEOR, were used for translation quality assessment. In the experimental results of this example, a given score is listed as a percentage number.

セクション（１）はこの実施例の提案の方法を、文字ごとに分けたソース文を翻訳するベースラインシステムと、言語に依存する言語学的に導かれた単語セグメント化によってトレーニングされたＳＭＴとに比較した。加えて、繰返しブートストラップ法の効果をセクション（２）にまとめた。 Section (1) describes the proposed method of this embodiment as a baseline system for translating source-separated source sentences and an SMT trained by language-dependent linguistically guided word segmentation. Compared. In addition, the effect of the repeated bootstrap method is summarized in section (2).

セクション（１）単語セグメント化の効果 Section (1) Effects of word segmentation

種々にセグメント化されたソース言語リソースでトレーニングされたＳＭＴの自動評価スコアを表３に示す。ここで
「文字」は翻訳にユニグラムセグメント化ソーステキストを用いるベースラインシステムを指す。
「本件」はここで提案される最適ＭＥタグ付けモデルによってセグメント化されたコーパスでトレーニングされたＳＭＴである。
「言語学的」は、ｔｗを除き、調査された言語の各々について利用可能な言語依存の言語学的に導かれた単語セグメント化ツールを用いたものである。

Table 3 shows the automatic evaluation scores for SMT trained with various segmented source language resources. Here "character" refers to a baseline system that uses unigram segmented source text for translation.
“This case” is an SMT trained on a corpus segmented by the optimal ME tagging model proposed here.
“Linguistic” is a language-dependent linguistically derived word segmentation tool that is available for each of the investigated languages, except tw.

この結果から、この実施例に従った本件の方法が、対象とした言語の各々についてベースラインシステムに勝り、評価尺度はともに一貫してＢＬＥＵでは０．８から２．０ポイントのゲイン、ＭＥＴＥＯＲでは０．５から３．０ポイントのゲインを達成したことが示される。しかし、改良はそれぞれのソース言語とその特性に依存する。例えば、ゲインが最も小さかったのは中国語であったが、その理由は、単一の文字がそれ自身で単語を構成することが多く、このため、連続したひらがな又はカタカナ文字がより大きな意味のある単位を構成する日本語に比べ、多義性が高くなったからである。 From this result, the method according to this embodiment is superior to the baseline system for each of the target languages, and the evaluation scale is consistently 0.8 to 2.0 points gain for BLEU and METEOR. It is shown that a gain of 0.5 to 3.0 points has been achieved. However, improvements depend on each source language and its characteristics. For example, the smallest gain was in Chinese, because a single letter often constitutes a word on its own, so consecutive hiragana or katakana characters are more meaningful This is because the ambiguity is higher than the Japanese language that composes a unit.

この実施例の方法を、言語学的に導かれた分離器と比較すると、結果はソース言語のほとんどについて自動的に学習した単語セグメント化の評価スコアがわずかに低くなったが、この実施例の方法による結果は例えば韓国語−英語等の作業では言語学的セグメント化ツールにきわめて近いかこれに勝るものであった。しかし、調査したほかの文字種に比べ文字種が顕著に異なるタイ語については、本件の方法と言語学的セグメント化との間に大きな開きが出た。 When comparing the method of this example with a linguistically derived separator, the result was a slightly lower word segmentation rating score that was automatically learned for most of the source languages. The results of the method were very close to or better than linguistic segmentation tools, for example for Korean-English tasks. However, for Thai, which has significantly different character types compared to the other character types investigated, there was a big gap between the method and linguistic segmentation.

本件のセグメント化方法に影響を与え、従って初期コーパスセグメント化を生成する際に考慮すべき特徴量は以下を含む：（１）タイ語の文字種は、子音に基づくセグメント的表記体系であるが、母音の表記は強制的であって、ベースラインシステムの文字分けは子音の依存性に影響を与える。（２）これは子音の上に置かれるトーンマーカを用いるが、本件のアプローチでは単一の文字として扱われる。（３）子音の後に発音される母音は非連続で、子音の前、後、上、又は下に生じ得るため、トレーニングコーパスの単語語形変化の数が増え、学習済ＭＥ分類器の正確さが減じられる。 Features that affect the segmentation method in this case and therefore should be considered when generating the initial corpus segmentation include: (1) Thai character types are a segmental notation system based on consonants, The vowel notation is compulsory, and the character separation of the baseline system affects the consonant dependency. (2) This uses a tone marker placed on the consonant, but is treated as a single character in this approach. (3) Since the vowels that are pronounced after the consonant are discontinuous and can occur before, after, above, or below the consonant, the number of word word form changes in the training corpus increases and the accuracy of the learned ME classifier Reduced.

日本語は複数の文字種（漢字、ひらがな、カタカナ）を用いて書かれるので、所与のトークンに文字種に関する付加的特徴量を加えればＭＥ分類器の性能改良の一助となるかもしれない。 Japanese is written using multiple character types (kanji, hiragana, katakana), so adding additional features related to character types to a given token may help improve the performance of the ME classifier.

最後に、伝統的中国語文字種を用いたテキストをセグメント化するのに言語学的ツールが利用できなかった台湾語については、言語学的に導かれたセグメント化ツールが利用できないいかなる言語の翻訳についても、この実施例の方法がうまく適用できることを示した。 Finally, for Taiwanese where linguistic tools were not available for segmenting text using traditional Chinese character types, translations in any language where linguistically derived segmentation tools are not available Also demonstrated that the method of this example can be applied successfully.

セクション（２）繰返しブートストラップの効果
この実施例の方法の頑健性を見るため、繰返しブートストラップ法の間に各ソース言語のシステム性能に現れた変化を図７及び図８に示す。図７のＢＬＥＵの結果、及び図８のＭＥＴＥＯＲの結果は、全ての言語で第１回目の繰返しで最良の性能に達し、その後は繰返しの数が増すたびにわずかながら、しかし一貫して悪化することを示している。この理由は、より長いターゲット言語句と対応付けされ、その結果より長い翻訳単位のセグメント化につながるソーストークンの連結によって引起されるオーバーフィッティングの効果である。 Section (2) Effect of Repeated Bootstrap To see the robustness of the method of this example, the changes that appear in the system performance of each source language during the repeated bootstrap method are shown in FIGS. The BLEU results in FIG. 7 and the METEOR results in FIG. 8 reach the best performance in the first iteration for all languages, but then slightly worse but consistently worse with each iteration. It is shown that. The reason for this is the effect of overfitting caused by the concatenation of source tokens that are associated with longer target language phrases, resulting in segmentation of longer translation units.

語彙サイズと単語長の変化を図９及び図１０にまとめた。図９及び図１０を参照して、本件の方法で抽出された単語の量はベースラインシステムのものよりかなり多く、標準中国語及び台湾語では１０倍、日本語及び韓国語では３０倍、タイ語では１００倍に語彙サイズが増大した。さらに、調査した全ての言語について、言語学的ルールで得られた語彙よりも１．５から２．５倍、サイズが大きかった。平均語彙長さも、繰返しごとに増加し、１０回の繰返し後に学習された翻訳単位の長さは、最初の繰返しの平均単語長さの約２倍であった。 Changes in vocabulary size and word length are summarized in FIGS. Referring to FIGS. 9 and 10, the amount of words extracted by this method is considerably larger than that of the baseline system, 10 times for Mandarin Chinese and Taiwanese, 30 times for Japanese and Korean, In words, the vocabulary size increased 100 times. In addition, all the languages studied were 1.5 to 2.5 times larger in size than the vocabulary obtained with linguistic rules. The average vocabulary length also increased with each iteration, and the length of translation units learned after 10 iterations was approximately twice the average word length of the first iteration.

この実施例に従った方法のオーバーフィッティングの問題は、語彙外の（ｏｕｔ−ｏｆ−ｖｏｃａｂｕｌａｒｙ：ＯＯＶ）単語の増加、すなわち遭遇していないデータセットに含まれるソース言語単語であってそれぞれのＳＭＴでは翻訳できないものの増加、という形で示される。図１１に与えられる結果は、最初の３回の繰返しについてＯＯＶ単語の大きな増加を示しており、この結果、図７及び図８にリストされるように、翻訳品質が下がっている。 The problem of overfitting of the method according to this example is that there is an increase in out-of-vocabulary (OOV) words, ie source language words that are included in a data set that has not been encountered, in each SMT. It is shown in the form of an increase in things that cannot be translated. The results given in FIG. 11 show a large increase in OOV words for the first three iterations, which results in reduced translation quality, as listed in FIGS.

図５及び図６は日本語−英語翻訳作業に種々のセグメント化スキームを用いた、いくつかの翻訳例を示す。図において、「ＲＥＦ」は人手によるセグメント化と対応の翻訳を示す。最良の翻訳を達成した繰返しには「＊」の印を付す。図５に示す最初の例では、この実施例の方法による「もう真夜中」（ａｌｒｅａｄｙｍｉｄｎｉｇｈｔ）という連結がＯＯＶ単語となり、このため、部分的な翻訳しか達成できない。図６の上部に示す第２の例では、文の構造を正確に扱う第１回目の繰返しでの単語セグメント化を用いて最良の翻訳がなされており、一方この情報はベースラインシステム出力では省略されている。しかし、最初の文をさらに連結すると、同じＯＯＶの問題が引起される。図６の下部分に示される第３の例では、より長いトークンが翻訳の複雑さを減じ、従ってより多くの多義性を引起す他のセグメント化に比べ、よりよい翻訳が可能であることを示す。 FIGS. 5 and 6 show several translation examples using various segmentation schemes for Japanese-English translation work. In the figure, “REF” indicates manual segmentation and corresponding translation. The repetition that achieves the best translation is marked with an “*”. In the first example shown in FIG. 5, the “already midnight” concatenation according to the method of this example becomes an OOV word, so that only partial translation can be achieved. In the second example shown at the top of FIG. 6, the best translation is done using word segmentation in the first iteration that correctly handles sentence structure, while this information is omitted in the baseline system output. Has been. However, further concatenation of the first sentence causes the same OOV problem. In the third example shown in the lower part of FIG. 6, it is shown that longer tokens reduce translation complexity and thus allow better translation compared to other segmentations that cause more ambiguity. Show.

図１２を参照して、コンピュータシステム３２０はコンピュータ３４０と、全てコンピュータ３４０に接続された、モニタ３４２と、キーボード３４６と、マウス３４８とを含む。さらに、コンピュータ３４０はＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ：ディジタル多用途ディスク）ドライブ３５０と、半導体メモリポート３５２と、を含む。 Referring to FIG. 12, computer system 320 includes a computer 340, a monitor 342, a keyboard 346 and a mouse 348, all connected to computer 340. Further, the computer 340 includes a DVD (Digital Versatile Disc) drive 350 and a semiconductor memory port 352.

図１３を参照して、コンピュータ３４０はさらに、ＤＶＤドライブ３５０及び半導体メモリポート３５２に接続されたバス３６６と、上述の装置を実現するコンピュータプログラムを実行するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３５６と、コンピュータ３４０のブートアッププログラムを記憶するＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）３５８と、ＣＰＵ３５６によって用いられる作業領域及びＣＰＵ３５６によって実行されるプログラムの記憶領域を提供するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）３６０と、バイリンガルコーパス３０（図１を参照）及び他のデータを記憶するハードディスクドライブ（ＨａｒｄＤｉｓｋＤｒｉｖｅ：ＨＤＤ）３６４と、を含む。 Referring to FIG. 13, the computer 340 further includes a bus 366 connected to the DVD drive 350 and the semiconductor memory port 352, a CPU (Central Processing Unit) 356 that executes a computer program for realizing the above-described device, and a computer 340. ROM (Read Only Memory) 358 for storing a boot-up program of the computer, a RAM (Random Access Memory) 360 for providing a work area used by the CPU 356 and a storage area for a program executed by the CPU 356, and a bilingual corpus 30 (FIG. And a hard disk drive (HDD) 364 for storing other data.

コンピュータ３４０が翻訳トレーニング装置として用いられる場合、ＨＤＤ３６４はＳＭＴモジュールのためのプログラムをさらに記憶し、バイリンガルコーパスとテストセットとを記憶する。 When the computer 340 is used as a translation training device, the HDD 364 further stores a program for the SMT module, and stores a bilingual corpus and a test set.

コンピュータ３４０はさらにバス３６６に接続されコンピュータ３４０をネットワーク３８２に接続するネットワークインターフェース（Ｉ／Ｆ）３８０を含む。 Computer 340 further includes a network interface (I / F) 380 that is connected to bus 366 and connects computer 340 to network 382.

上述の実施例のシステムを実現するソフトウェアはＤＶＤ３６８又は半導体メモリ３７０等の記録媒体に記録されたオブジェクトコードの形で配布されてもよく、ＤＶＤドライブ３５０又は半導体メモリポート３５２等の読出装置によってコンピュータ３４０に提供され、ＨＤＤ３６４に記憶されてもよい。ＣＰＵ３５６がプログラムを実行する際には、プログラムはＨＤＤ３６４から読出され、ＲＡＭ３６０に記憶される。ＣＰＵ３５６内の図示しないプログラムカウンタから指定されるアドレスからＣＰＵ３５６に命令がフェッチされ実行される。ＣＰＵ３５６はＣＰＵ３５６、ＲＡＭ３６０又はＨＤＤ３６４内のレジスタから処理すべきデータを読出し、処理の結果をまたＣＰＵ３５６、ＲＡＭ３６０又はＨＤＤ３６４内のレジスタに記憶する。 Software for realizing the system of the above-described embodiment may be distributed in the form of an object code recorded on a recording medium such as a DVD 368 or a semiconductor memory 370, and the computer 340 is read by a reading device such as a DVD drive 350 or a semiconductor memory port 352. And may be stored in the HDD 364. When the CPU 356 executes the program, the program is read from the HDD 364 and stored in the RAM 360. An instruction is fetched and executed by the CPU 356 from an address designated by a program counter (not shown) in the CPU 356. The CPU 356 reads data to be processed from the registers in the CPU 356, RAM 360, or HDD 364, and stores the processing result in the registers in the CPU 356, RAM 360, or HDD 364 again.

コンピュータシステム３２０の一般的な動作は公知であるので、その詳細はここでは説明しない。 Since the general operation of computer system 320 is well known, its details are not described here.

ソフトウェア配布の方法については、必ずしも記憶媒体に固定されていなくてもよい。例えば、ソフトウェアは別のコンピュータからコンピュータ３４０にネットワーク３８２を介して送信されてもよい。ソフトウェアの一部をＨＤＤ３６４に記憶し、ソフトウェアの残りの部分をネットワークからＨＤＤ３６４に取込んで、実行の際に統合してもよい。 The software distribution method is not necessarily fixed to the storage medium. For example, software may be transmitted from another computer to computer 340 over network 382. A part of the software may be stored in the HDD 364, and the remaining part of the software may be taken into the HDD 364 from the network and integrated at the time of execution.

典型的には、現代のコンピュータはコンピュータのオペレーティングシステム（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ：ＯＳ）によって提供される機能を利用し、これらの機能を所望の目的に従って制御されたやり方で実行する。従って、これら機能を含まず、ＯＳによって、又は第三者によって提供され、一般的機能の実行の順序の組合せのみを指定するのみのプログラムもまた、そのプログラムが全体として所望の目的を達成する制御構造を有するのであれば、この発明の範囲に含まれる。 Typically, modern computers utilize the functions provided by a computer operating system (OS) and perform these functions in a controlled manner according to the desired purpose. Therefore, a program that does not include these functions, is provided by the OS or by a third party, and only specifies a combination of execution order of general functions is also a control that achieves a desired purpose as a whole. Any structure is included in the scope of the present invention.

上述の実施例では、繰返しは評価結果が先行する繰返しの評価結果よりも悪いステップ１４８（図３を参照）で止まる。しかし、この発明はそのような実施例に限定されない。例えば、繰返しは、評価が先行する評価結果より高くないときに停止されてもよいし、一回の繰返しの評価結果に代えて、予め定められた繰返しの評価結果の移動平均を利用してもよい。 In the example described above, the iteration stops at step 148 (see FIG. 3) where the evaluation result is worse than the preceding evaluation result. However, the present invention is not limited to such an embodiment. For example, the repetition may be stopped when the evaluation is not higher than the preceding evaluation result, or a moving average of a predetermined repetition evaluation result may be used instead of the evaluation result of one repetition. Good.

さらに、バイリンガルコーパスのソース言語文の文字にタグ付けするためのＭＥ分類器は、ＭＥでなく、統計的モデルを用いてもよい。ＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）又は判断ツリーをＭＥに代えて用いてもよい。 Further, the ME classifier for tagging the characters of the source language sentence of the bilingual corpus may use a statistical model instead of the ME. An SVM (Support Vector Machine) or a decision tree may be used instead of the ME.

（結論）
この実施例は、現在のＳＭＴシステムの性能を改良するために、意味のある単語単位を分離するのに空白文字を用いない文を、教師なしでセグメント化する、新たな言語に依存しない方法を提案する。提案の方法はソース言語について何ら言語学的情報必要とせず、これはしばしば形態素的分析ツールが利用できない比較的マイナーな言語の翻訳のためのＳＭＴシステムを構築するのに重要である。加えて、開発費用は、バイリンガルコーパスの生成についてのみで、言語学的単語セグメント化ツールの開発、またはデータセットを人手でセグメント化するために人に支払う費用にくらべはるかに少ない。 (Conclusion)
This example provides a new language-independent method for unsupervised segmentation of sentences that do not use white space to separate meaningful word units to improve the performance of current SMT systems. suggest. The proposed method does not require any linguistic information about the source language, which is important in building an SMT system for translation of relatively minor languages, where morphological analysis tools are often not available. In addition, the development costs are only for the generation of a bilingual corpus, which is much less than the development of linguistic word segmentation tools, or the costs paid to people to manually segment a data set.

提案の方法の効果を、旅行会話の分野について５つのアジア系言語（日本語、韓国語、タイ語及び中国語（標準中国語、台湾語））を英語に翻訳して調査した。翻訳結果の自動評価は、文字分けされた入力文を翻訳するベースラインシステムと比較してＢＬＥＵで最大２．０ポイント、ＭＥＴＥＯＲで最大３．０ポイントの一貫した改良を示した。加えて、提案の方法は、何ら外部の情報なく、ＭＥベースのセグメント化モデルをトレーニングするのに所与のバイリンガルコーパスを用いたのみで、言語学的に導かれたツールでセグメント化されたバイリンガルコーパスでトレーニングされたＳＭＴモデルと同様の翻訳結果を達成した。 The effect of the proposed method was investigated by translating five Asian languages (Japanese, Korean, Thai and Chinese (Mandarin Chinese, Taiwanese)) into English in the field of travel conversation. Automatic evaluation of translation results showed a consistent improvement of up to 2.0 points for BLEU and up to 3.0 points for METEOR compared to a baseline system that translates character-separated input sentences. In addition, the proposed method uses only a given bilingual corpus to train the ME-based segmentation model without any external information, and is bilingual segmented with linguistically derived tools. Similar translation results were achieved with the SMT model trained on the corpus.

最初の単語セグメント化の生成の間にソース言語の文字種に特有の特徴を取扱うこと、さらにＭＥベースのタグ付けモデルのさらなる特徴量を特定することで、システムの性能をさらに改良できるであろう。さらに、繰返しブートストラップ法で学習される単語セグメント化スキームは互いに大きく異なる。従って、ＳＭＴデコード処理に異なるスキームを組み合わせることで、長い翻訳単位を用いる単語セグメント化スキームのオーバーフィッティングの問題を克服しシステムの性能を上げる大きな可能性がある。 By handling features specific to the source language character type during the initial word segmentation generation, and identifying additional features of the ME-based tagging model, the performance of the system could be further improved. Furthermore, the word segmentation schemes learned by the iterative bootstrap method are very different from each other. Therefore, combining different schemes with the SMT decoding process has the potential to overcome the overfitting problem of word segmentation schemes that use long translation units and improve system performance.

３０バイリンガルコーパス
３２ターゲット言語文
３４ソース言語文
３６ユニグラム分離器
４０、５０、６０、８０、９０ＳＭＴ
４２、５２、６３、８２、９２評価結果
４４、５４、８４トークン−単語対応付け結果
４６、５６、７６、８６ＭＥ分類器
４８、５８、７８、８０セグメント化されたソース言語文 30 bilingual corpus 32 target language sentence 34 source language sentence 36 unigram separator 40, 50, 60, 80, 90 SMT
42, 52, 63, 82, 92 Evaluation result 44, 54, 84 Token-word matching result 46, 56, 76, 86 ME classifier 48, 58, 78, 80 Segmented source language sentence

Claims

A sentence separator training device for training a preselected first language sentence separator,
Means for storing a bilingual corpus of translation pairs, each of the translation pairs including a source language sentence of the first language and a target language sentence of a second language, each of the source language sentences The target language sentence is composed of non-delimited character strings, each word of the target language sentence is separated from each other by a space character, and the sentence separator training device further defines the source language sentence in the bilingual corpus in advance. A character-based first separator for separating and segmenting each character by a delimiter;
First statistical training means for training statistical machine translation means using a bilingual training corpus including translation pairs of sentences of the first language and the second language, the statistical machine translation means for each of the translation pairs in the bilingual training corpus during the training, the source language sentence, each segment separated by the delimiters, Ru correspondence to one of the words of the target language sentence It has a function
An evaluation means for evaluating the performance of the statistical machine translation means trained by the training means;
Using the result of the association by the statistical machine translation means , for each translation pair of the bilingual corpus, a plurality of consecutive characters in the source language sentence that are associated with the same word in the target language sentence A second training means for training a second separator of the source language sentence to separate the source language sentence into segments while integrating the characters into one string ;
Separating the source language sentence in the bilingual corpus into segments using the second separator trained by the second training means, and inserting the predetermined delimiter at a segment boundary Segmentation means of
The first training means, said evaluating means, wherein the second training means and the second separator, repetitive control for controlling so as to repeatedly operate until the a no observed improvement in the evaluation by the evaluation unit Means,
In the first iteration of the iteration control means, select the bilingual corpus that includes the source language sentence segmented by the character-based first separator, and in subsequent iterations segment by the second separator Selecting the bilingual corpus having the source language sentence that has been converted, and using the selected bilingual corpus as the bilingual training corpus to cause the first training means to train the statistical machine translation means further look at containing and means,
A sentence separator that outputs, as a trained sentence separator, the second separator obtained by the evaluation by the evaluation means among the second separators obtained by the time when the repetition by the control of the repetition control means ends . Training device.

The second training means includes
Annotating each character in the source language sentence of the bilingual corpus using the result of the association by the statistical machine translation means, and an annotation indicating whether each character is the end of a word Means for granting to,
Means for extracting a predetermined feature amount set of each character in the source language sentence of the bilingual corpus, wherein the predetermined feature amount set includes the annotation of the target character of the source language sentence. reflecting the context containing, by further wherein the probability models used in the second separator comprises a means for training the probabilistic model statistical analysis of the set of the feature amount extracted by said extraction means, The sentence separator training device according to claim 1, wherein the sentence separator training apparatus is used to estimate a probability that a character in a source language sentence is an end of a word.

The sentence separator training device according to claim 2, wherein the probability model includes a maximum entropy model.

The iterative control means is configured to evaluate the first training means, the evaluation means, the second training means, and the second separator based on the evaluation of the evaluation means in the iteration preceded by the evaluation by the evaluation means. The sentence separator training apparatus according to any one of claims 1 to 3, wherein the sentence separator training apparatus is controlled so as to repeatedly operate until it becomes worse.

A computer program that, when executed by a computer, causes the computer to function as the device according to any one of claims 1 to 4.

A computer readable medium recording the computer program according to claim 5.