JP2006127405A

JP2006127405A - Method for carrying out alignment of bilingual parallel text and executable program in computer

Info

Publication number: JP2006127405A
Application number: JP2004318207A
Authority: JP
Inventors: Tsuya Cho; 艶張; Hidenori Kashioka; 秀紀柏岡
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2004-11-01
Filing date: 2004-11-01
Publication date: 2006-05-18

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method for carrying out alignment of different language family sentences with high accuracy. <P>SOLUTION: The method for carrying out aligning sentences of a bilingual parallel text 30 includes three steps. A first step provides a length-based probability model 34 which describes probability to be a translation of another language based on a language information 42 including sentence-length information and corresponding word pairs. A second step makes a sum of alignment probability calculated by dynamic programming algorithm using the length-based probability model 34 maximum after carrying out alignment of a bilingual parallel text 36 (84, 88). A third step corrects the alignment so that punctuation marks in the bilingual text are aligned each other (92). <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明はバイリンガルコーパスの文のアライメントに関し、特に、中国語及び英語コーパス等のバイリンガルコーパスの、センテンス長に基づく拡張された文アライメントに関する。 The present invention relates to sentence alignment of bilingual corpora, and more particularly to extended sentence alignment of bilingual corpora such as Chinese and English corpora based on sentence length.

最近、バイリンガルコーパスの文のアライメント（対応付け）に関する多くの研究がなされている。非特許文献１、２及び３を参照されたい。文のアライメントは、機械翻訳の基本的要素の一つであって、翻訳情報と、バイリンガルコーパスに関する統計的パラメータとを提供する。特に、テキストのアライメントは統計的機械翻訳に不可欠である。 Recently, many studies on sentence alignment of bilingual corpora have been made. See Non-Patent Documents 1, 2, and 3. Sentence alignment is one of the basic elements of machine translation and provides translation information and statistical parameters for a bilingual corpus. In particular, text alignment is essential for statistical machine translation.

一般に、テキストアライメントの方法の研究として、基本的で同時に主要でもある二つの方面からの研究がある。すなわち、レキシコンに基づくものと、統計的なものとである。レキシコンに基づく手法では、文をアライメントするために、バイリンガルのレキシコンを利用する（非特許文献３及び４）。中国語と英語とのアライメントではまた、中国語のセグメント化及び品詞（ＰａｒｔＯｆＳｐｅｅｃｈ：ＰＯＳ）情報等の他の情報も必要とされる。 In general, there are two basic researches on text alignment methods, both basic and simultaneous. A lexicon-based one and a statistical one. In the lexicon-based method, a bilingual lexicon is used to align sentences (Non-Patent Documents 3 and 4). Alignment between Chinese and English also requires other information such as Chinese segmentation and part-of-speech (POS) information.

統計的な手法では言語知識はほとんど必要とされない。この手法は、バイリンガルのセンテンス長と、長さの分布のみに関連する。 Statistical methods require little linguistic knowledge. This approach relates only to the bilingual sentence length and length distribution.

現在、アルファベット系の言語をアライメントするためにセンテンス長に基づく手法が広く用いられており、同じ語族の言語をアライメントするには良好な結果が得られることが分っている。しかし、言語が異なる語族に属する場合、アライメントの精度は低く、問題が生じる。特に、一方の言語が中国語である場合、センテンス長に基づくアライメントには問題があることが知られている。なぜなら中国語はアルファベット系の言語とは全く異なるからである。 Currently, sentence length-based techniques are widely used to align alphabetic languages, and it has been found that good results can be obtained when aligning languages of the same family. However, when the languages belong to different families, the alignment accuracy is low and a problem arises. In particular, when one language is Chinese, it is known that there is a problem in alignment based on sentence length. This is because Chinese is completely different from alphabetic languages.

非特許文献２において、ウーは統計的手法とレキシコンによる手法とを同時に組合せた混合手法を提案している。ウーは上述の単一の手法の利点を利用し、レキシコンを手がかりとしたセンテンス長に基づく手法を提案している。
ゲールチャーチ、「バイリンガルコーパスで文をアライメントするためのプログラム」、コンピュータ言語学、１９（１）：７５−１０２、１９９１（Gale, Church, "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19(1): 75-102, 1991.）デカイウー、「パラレル英語−中国語コーパスのレキシコン基準を用いた統計的対応付け処理」、ＡＣＬ‐３２、年次学会予稿集、第８０−８７頁、１９９３（DeKai Wu. "Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria", in the proceeding of Annual meeting of ACL-32, pp.80-87, 1993.）Ｓ．Ｆ．チェン、「レキシコン情報を用いたバイリンガルコーパスの文の対応付け処理」、ＡＣＬ‐３１、年次学会予稿集、１９９３（S. F. Chen, "Aligning Sentences in Bilingual Corpora Using Lexical Information", in the proceeding of Annual meeting of ACL-31, 1993.）Ｍ．ケイ及びＫ．ローシャイセン、「テキスト翻訳アライメント」コンピュータ言語学、１９（１）、第１２１−１４２頁、１９９３（M. Kay & K. Roescheisen, "Text-Translation Alignment", Computational Linguistics 19(1), pp. 121-142, 1993.）Ｐ．Ｆ．ブラウン、Ｊ．Ｃ．レイ及びＲ．Ｌ．メルサ、「パラレルコーパスの文の対応付け処理」、ＡＣＬ‐２９、年次学会予稿集、第１６９−１７６頁、１９９１（P. F. Brown, J. C. Lai & R. L. Mercer, "Aligning Sentences in Parallel Corpora", in the proceeding of Annual meeting of ACL-29, pp.169-176, 1991.） In Non-Patent Document 2, Wu has proposed a mixing method in which a statistical method and a lexicon method are combined at the same time. Wu proposes a method based on sentence length using lexicon as a clue, taking advantage of the single method described above.
Gale Church, “Program for Aligning Sentences in a Bilingual Corpus”, Computer Linguistics, 19 (1): 75-102, 1991 (Gale, Church, “A Program for Aligning Sentences in Bilingual Corpora”, Computational Linguistics 19 ( 1): 75-102, 1991.) Decai Wu, "Parallel English-Statistical matching using lexicon criteria of Chinese corpus", ACL-32, Annual Conference Proceedings, pp. 80-87, 1993 (DeKai Wu. "Aligning a Parallel English- Chinese Corpus Statistically with Lexical Criteria ", in the proceeding of Annual meeting of ACL-32, pp.80-87, 1993.) S. F. Chen, “Bilingual corpus sentence matching using lexicon information”, ACL-31, Annual Conference Proceedings, 1993 (SF Chen, “Aligning Sentences in Bilingual Corpora Using Lexical Information”, in the proceeding of Annual meeting of ACL-31, 1993.) M.M. Kay and K.K. Rocheisen, “Text Translation Alignment”, Computer Linguistics, 19 (1), pp. 121-142, 1993 (M. Kay & K. Roescheisen, “Text-Translation Alignment”, Computational Linguistics 19 (1), pp. 121- 142, 1993.) P. F. Brown, J.A. C. Ray and R.D. L. Melsa, “Parallel corpus sentence matching process”, ACL-29, Annual Conference Proceedings, pp. 169-176, 1991 (PF Brown, JC Lai & RL Mercer, “Aligning Sentences in Parallel Corpora”, in the proceeding of Annual meeting of ACL-29, pp.169-176, 1991.)

しかし、先行技術による、異なる語族の２言語でのパラレルテキスト中の文のアライメントは、精度が低いことが分っている。特に中国語−英語の文のアライメントは非常に困難でその精度が低いことが知られている。 However, it has been found that the alignment of sentences in parallel text in two languages of different families according to the prior art is inaccurate. In particular, it is known that alignment of Chinese-English sentences is very difficult and its accuracy is low.

したがって、この発明の目的の一つは、異なる語族の２言語の文を高い精度でアライメントする方法を提供することである。 Accordingly, one of the objects of the present invention is to provide a method for aligning sentences in two languages of different language families with high accuracy.

この発明の別の目的は、中国語−英語の文を高い精度でアライメントする方法を提供することである。 Another object of the present invention is to provide a method for aligning Chinese-English sentences with high accuracy.

この発明の一局面は、第１の言語と第２の言語とのバイリンガルパラレルテキストの文をアライメントする、コンピュータで実現される以下のステップを含む方法に関する。すなわち、この方法は、第１の言語の文と第２の言語の文とが互いの訳である確率を算出するための、センテンス長に基づく確率モデルを、センテンス長情報と単語ペアの集合とに基づいて準備するステップを含み、ペアの各々は第１の言語の単語と第２の言語の単語とのペアを含む。この方法はさらに、センテンス長に基づく確率を用いてダイナミックプログラミングアルゴリズムで計算されたアライメントの合計確率が最大になるようにバイリンガルパラレルテキストの文をアライメントするステップと、バイリンガルパラレルテキストの対応する句読点が互いにアライメントされるように、アライメントを訂正するステップとを含む。 One aspect of the present invention relates to a method of aligning bilingual parallel text sentences in a first language and a second language, including the following steps implemented on a computer. That is, in this method, a probability model based on a sentence length for calculating a probability that a sentence in a first language and a sentence in a second language are mutually translated is a sentence length information and a set of word pairs. And each of the pairs includes a pair of a first language word and a second language word. The method further includes aligning the sentences of the bilingual parallel text using a probability based on the sentence length to maximize the total probability of alignment calculated by the dynamic programming algorithm, and the corresponding punctuation marks of the bilingual parallel text are mutually aligned. Correcting the alignment so that it is aligned.

センテンス長に基づく確率モデルを第１及び第２の言語の単語ペアと組合せることにより、アライメント誤差率を少なくできる。 By combining the probability model based on the sentence length with the word pairs in the first and second languages, the alignment error rate can be reduced.

好ましくは、ペアの各々は、所定の品詞（主に名詞）の単語ペアを含む。単語ペアはバイリンガルのレキシコンとして準備される。 Preferably, each of the pairs includes a word pair of a predetermined part of speech (mainly a noun). Word pairs are prepared as bilingual lexicons.

さらに好ましくは、ペアの幾つかは、それぞれ所定の実在物（客観的・概念的なものを含む。）を第１の言語でそれぞれ表す固有名詞と、それら所定の実在物を第２の言語でそれぞれ表す対応の固有名詞とをそれぞれ含む。 More preferably, some of the pairs each have a proper noun representing each predetermined entity (including objective and conceptual) in the first language, and each predetermined entity in the second language. Each corresponding proper noun is represented.

固有名詞は、両方の言語で容易に見つけることができる。その対応が明らかだからである。これによってアライメントが正確か否かを判断することが容易になり、したがってアライメント誤差も大幅に減少する。 Proper nouns can be easily found in both languages. This is because the correspondence is clear. This makes it easy to determine whether the alignment is accurate, and therefore greatly reduces alignment errors.

さらに好ましくは、センテンス長に基づく確率モデルは、互いの訳である第１の言語の文と第２の言語の文との確率が二つのセンテンス長の比に基づいて算出されるように準備される。 More preferably, the probability model based on the sentence length is prepared such that the probabilities of the sentence in the first language and the sentence in the second language that are mutually translated are calculated based on a ratio of the two sentence lengths. The

各センテンス長は、文中の文字の数で計数してもよい。 Each sentence length may be counted by the number of characters in the sentence.

一文中の単語数は言語の種類によって異なるであろう。この構成により、単語がデリミタによって分離されていない場合でも、文のセグメント化が不要となる。したがって、中国語、日本語等を含むいずれの言語にも、この方法を適用することができる。 The number of words in a sentence will vary depending on the language type. This configuration eliminates the need for sentence segmentation even when words are not separated by delimiters. Therefore, this method can be applied to any language including Chinese and Japanese.

この発明の別の局面は、コンピュータ上で実行されると、上述の方法のいずれかの全てのステップをコンピュータに実行させる、コンピュータで実行可能なプログラムに関する。 Another aspect of the invention relates to a computer executable program that, when executed on a computer, causes the computer to execute all steps of any of the methods described above.

[構造]
この発明の一実施の形態は、パラレルテキストにおける中国語−英語の文をアライメントするための新たな手法に関する。この混合手法は、主にセンテンス長に基づく手法に基づき、同時に、バイリンガルレキシコンからの語彙的な情報も考慮する。 [Construction]
One embodiment of the invention relates to a new technique for aligning Chinese-English sentences in parallel text. This mixing method is based mainly on a method based on sentence length, and also considers lexical information from a bilingual lexicon.

この混合手法はセグメント化及び品詞タグ付け等、中国語を更に処理することを不要にするだけでなく、統計的手法において幾つかの中国語のキーワードを使用することによって、文のアライメント精度を向上させる。この実施の形態で用いられるバイリンガルコーパスは、ＬＤＣ（ＬｉｎｇｕｉｓｔｉｃＤａｔａＣｏｎｓｏｒｔｉｕｍ）コーパスである。 This mixed approach not only eliminates the need for further processing of Chinese, such as segmentation and part-of-speech tagging, but also improves sentence alignment accuracy by using some Chinese keywords in statistical methods. Let The bilingual corpus used in this embodiment is an LDC (Linguistic Data Consortium) corpus.

図１はこの発明の一実施の形態に従った文アライメントシステム２０の構成を示すブロック図である。図１を参照して、文アライメントシステム２０は、文ごとにアライメントされた英語コーパス６０及び中国語コーパス６２を含むトレーニングコーパス３０と、トレーニングコーパス３０を用いてセンテンス長に基づく手法のための確率モデル３４をトレーニングするトレーニングモジュール３２とを含む。確率モデル３４は英語と中国語のコーパスの最尤の割当方法を求めるために用いられる。バイリンガル文の各ペアに対し、確率モデル３４は二つのセンテンス長の比に基づいた確率スコアを割当てる。 FIG. 1 is a block diagram showing a configuration of a sentence alignment system 20 according to an embodiment of the present invention. Referring to FIG. 1, a sentence alignment system 20 includes a training corpus 30 including an English corpus 60 and a Chinese corpus 62 aligned for each sentence, and a probability model for a technique based on sentence length using the training corpus 30. And a training module 32 for training 34. The probabilistic model 34 is used to obtain a maximum likelihood allocation method for English and Chinese corpora. For each pair of bilingual sentences, probability model 34 assigns a probability score based on the ratio of the two sentence lengths.

文アライメントシステム２０はさらに、英語コーパス７０及び中国語コーパス７２を含む入力コーパス３６をアライメントして、アライメントされた出力コーパス４４を出力する、文アライメント装置３８と、中国語コーパス７２中の中国語文のセグメント化及び中国語単語への適切な品詞（ＰＯＳ）タグ付けの際に文アライメント装置３８によって使用される辞書（レキシコン）４０と、アライメントを向上させるために用いられる言語学的情報を記憶する言語情報記憶部４２とを含む。アライメントされた出力コーパス４４は英語コーパス１００及び中国語コーパス１０２を含み、これらは基本的に、それぞれ英語コーパス６０及び中国語コーパス６２と同一である。 The sentence alignment system 20 further aligns an input corpus 36 including an English corpus 70 and a Chinese corpus 72 and outputs an aligned output corpus 44, and a Chinese sentence in the Chinese corpus 72. A dictionary (lexicon) 40 used by the sentence alignment device 38 in segmentation and proper part-of-speech (POS) tagging of Chinese words, and a language that stores linguistic information used to improve alignment Information storage unit 42. The aligned output corpus 44 includes an English corpus 100 and a Chinese corpus 102, which are basically identical to the English corpus 60 and the Chinese corpus 62, respectively.

文アライメント装置３８は、中国語コーパス７２の文を辞書４０を用いて形態素にセグメント化するセグメント化モジュール８０と、辞書４０を用いて各形態素にＰＯＳタグをタグ付けするためのＰＯＳタグ付けモジュール８２とを含む。セグメント化とＰＯＳタグ付けの目的は、バイリンガルコーパス中で固有名詞を見つけることである。文アライメント装置３８はさらに、英語コーパス７０中の文と、セグメント化されＰＯＳタグ付けされた中国語文とを、アライメントされた文ペアの、確率モデル３４で計算された確率が最大となるようにアライメントして、最大確率のアライメントを第１の結果８６として出力するためのアライメントモジュール８４を含む。文は確率モデル３４のみでアライメントされているので、第１の結果８６は多くの誤差を含むものと考えられる。 The sentence alignment device 38 includes a segmentation module 80 for segmenting sentences in the Chinese corpus 72 into morphemes using the dictionary 40, and a POS tagging module 82 for tagging each morpheme using the dictionary 40. Including. The purpose of segmentation and POS tagging is to find proper nouns in the bilingual corpus. The sentence alignment unit 38 further aligns sentences in the English corpus 70 and segmented POS-tagged Chinese sentences so that the probability calculated by the probability model 34 of the aligned sentence pair is maximized. And an alignment module 84 for outputting the maximum probability alignment as the first result 86. Since the sentences are aligned only with the probabilistic model 34, the first result 86 is considered to contain many errors.

文アライメント装置３８はさらに、言語情報記憶部４２に記憶された言語情報を用いて第１の結果８６を訂正し、訂正されたアライメントを第２の結果９０として出力する第１の訂正モジュール８８と、英語及び中国語の句読点情報を用いて第２の結果９０を訂正する第２の訂正モジュール９２とを含む。 The sentence alignment device 38 further corrects the first result 86 using the linguistic information stored in the linguistic information storage unit 42, and outputs a corrected alignment as a second result 90. And a second correction module 92 that corrects the second result 90 using English and Chinese punctuation information.

この実施の形態で用いられるトレーニングコーパス３０中のバイリンガルテキストは、ＬＤＣコーパスである。これらのファイルは、３つの形式、すなわち、Ｂｉｇ５と、ＧＢコーディングと、対応の英訳文という形式の香港の新聞記事である。この実施の形態では、Ｂｉｇ５コーディングのテキストはＧＢコーディングに変換される。変換後、アライメントされたバイリンガルのコーパスの属性が分析される。完成したパラレルテキストは高品質の逐語訳を含むことになる。このバイリンガルテキストから、バイリンガルペアの翻訳の種類を特定する情報が得られる。ほとんどの中国語文と英語の訳文は１対１タイプである。１対２または２対１の割合は少なく、ごく稀に、ｍが２より大きいｍ対ｍなどと、他の構成とがある。最後の、ｍを持つタイプは全て２対２に統一される。 The bilingual text in the training corpus 30 used in this embodiment is an LDC corpus. These files are Hong Kong newspaper articles in three formats: Big5, GB coding, and corresponding English translations. In this embodiment, Big5 coding text is converted to GB coding. After conversion, the attributes of the aligned bilingual corpus are analyzed. The completed parallel text will contain a high-quality verbatim translation. From this bilingual text, information specifying the type of translation of the bilingual pair can be obtained. Most Chinese sentences and English translations are one-to-one. The ratio of 1 to 2 or 2 to 1 is small, and very rarely, there are other configurations such as m to m, where m is greater than 2. The last type with m is unified to 2 to 2.

文アライメントのセンテンス長に基づく手法（非特許文献５）は、英語とフランス語、及び同じ語族の他の言語については良好な結果を生じることが知られている。中国語はアルファベット系の言語とは明確に異なる。したがって、中国語−英語については、アライメント処理は異なっている。 The technique based on sentence alignment sentence length (Non-Patent Document 5) is known to produce good results for English, French, and other languages of the same language family. Chinese is clearly different from alphabetic languages. Therefore, the alignment process is different for Chinese-English.

センテンス長に基づく手法は二つの言語の文字列長による統計的モデル（確率モデル３４）に基づくものである。基本的原理は、二つのセンテンス長の比に基づいて、バイリンガル文の各ペアに確率スコアを割当てる、というものである。トレーニングコーパス３０からトレーニングされた確率モデル３４を用いて、入力コーパス３６中の文の全てのアライメントから最大尤度のアライメントを見出す。 The method based on sentence length is based on a statistical model (probability model 34) based on character string lengths of two languages. The basic principle is to assign a probability score to each pair of bilingual sentences based on the ratio of the two sentence lengths. Using the probabilistic model 34 trained from the training corpus 30, the maximum likelihood alignment is found from all the alignments of the sentences in the input corpus 36.

以下にその統計的モデルを示す。ここでＡはアライメントされたペアであり、Ｌ_Ｓ及びＬ_Ｔはそれぞれソース言語とターゲット言語を表す。 The statistical model is shown below. Where A is an aligned pair and L _S and L _T represent the source language and the target language, respectively.

式（１）は、単一のアライメントされたペアの確率は独立であるとして、最大化問題により近似される。

Equation (1) is approximated by a maximization problem, assuming that the probabilities of a single aligned pair are independent.

式（２）はさらにベイズの定理により近似され、その結果、モデルは｜ｌ_ｓ｜及び｜ｌ_ｔ｜の関数であるパラメータδのみに依存するようになる。

Equation (2) is further approximated by Bayes' theorem, so that the model depends only on the parameter δ, which is a function of | l _s | and | l _t |.

上の式で、｜ｌ_ｓ｜及び｜ｌ_ｔ｜はそれぞれ中国語及び英語の、文字数で計数した長さである。Ｍ（Ｌ_Ｓ，Ｌ_Ｔ）はアライメントされた「タイプ」である。

In the above equation, | l _s | and | l _t | are the lengths of Chinese and English, respectively, counted in characters. M (L _S , L _T ) is the aligned “type”.

現在、センテンス長を計算する幾つかの方法がある（非特許文献５）。セグメント化の誤差を避けるため、文中の語数ではなく文字数を計数した。中国語の文字の各々と中国語の句読点とは長さ２を有するが、英語の文字と句読点との長さは１である。 Currently, there are several methods for calculating the sentence length (Non-Patent Document 5). To avoid segmentation errors, we counted the number of characters, not the number of words in the sentence. Each Chinese character and Chinese punctuation have a length of 2, while English characters and punctuation have a length of 1.

ここで、モデルを値ｃと分散ｓ^２とで示す。ｓ^２は長さに比例する。この実施の形態では、ＬＤＣコーパス（トレーニングコーパス３０）でｃ＝１．６３でありｓ^２＝３．２である。この場合δは正規分布を有し、以下で定義される。 Here, the models the values c and variance s ^2. s ² is proportional to the length. In this embodiment, c = 1.63 and s ² = 3.2 in the LDC corpus (training corpus 30). In this case, δ has a normal distribution and is defined below.

アライメントには４種類がある。１対１のペア、１対２のペア、２対１のペア、及び２対２のペアである。アライメントの種類による事前確率を表１に示す。４種類が学習され、アライメントされたトレーニングコーパス３０にそれらが出現する頻度から対応の確率が推定される。

There are four types of alignment. One-to-one pairs, one-to-two pairs, two-to-one pairs, and two-to-two pairs. Table 1 shows the prior probabilities depending on the type of alignment. Four types are learned, and the corresponding probability is estimated from the frequency of occurrence of them in the aligned training corpus 30.

センテンス長パラメータのみでは中国語−英語アライメントに対しては十分でない。アライメント精度をさらに改良するためには、言語学的情報が必要とされる。アライメントでは、語が同じ起源であることに着目するが、中国語と英語のパラレルテキストで音韻的、又は意味論的な起源が同じであることを認識するのは困難である。

The sentence length parameter alone is not sufficient for Chinese-English alignment. To further improve the alignment accuracy, linguistic information is required. Alignment focuses on the fact that words have the same origin, but it is difficult to recognize that the phonetic or semantic origin is the same in Chinese and English parallel text.

このため、第１の訂正モジュールが、バイリンガルコーパス中の中国語のキーワード情報等の言語学的情報をセンテンス長に基づく手法と組合せる。この情報は記憶部４２に記憶され、主に文の主要語を含む。これらの単語は特定の品詞であり、特定の、客観的・概念的なものを含む実在物を示すものと考えられる。典型的な例は組織又は人の名前等の固有名詞である。第１と第２の言語間でのこれらの対応は明確である。キーワード情報の例を図２に示す。 For this reason, the first correction module combines linguistic information such as Chinese keyword information in a bilingual corpus with a technique based on sentence length. This information is stored in the storage unit 42 and mainly includes the main word of the sentence. These words are specific parts of speech, and are considered to indicate actual objects including specific objective and conceptual words. A typical example is a proper noun such as the name of an organization or person. These correspondences between the first and second languages are clear. An example of keyword information is shown in FIG.

図２を参照して、中国語の単語 Referring to FIG. 2, Chinese words

と英語の“ＰｒｉｎｃｅｏｆＷａｌｅｓ”とは対応する固有名詞であり、同じ実在物を示す。アライメントされた文、又は誤ってアライメントされた文からこれらの対応する単語を見出すことにより、アライメントの精度又は失敗を確認し、または訂正することができる。

And “Prince of Wales” in English are corresponding proper nouns and indicate the same entity. By finding these corresponding words from the aligned or misaligned sentences, the accuracy or failure of the alignment can be confirmed or corrected.

ここで、式（２）はレキシコン情報により、以下の近似式に置換可能である。 Here, Expression (2) can be replaced with the following approximate expression by lexicon information.

ここでｃ_ｉ＝ｆｒｅｑ（Ｃｗｏｒｄ_ｉ，Ｌｓ）であり、ｅ_ｉ＝（Ｅｗｏｒｄ_ｉ，Ｌ_Ｔ）であり、Ｃｗｏｒｄ_ｉ及びＥｗｏｒｄ_ｉはそれぞれバイリンガルコーパス中の中国語及び英語の単語情報を示す。

Here, c _i = freq (Cword _i , Ls), e _i = (Eword _i , L _T ), and Cword _i and Eword _i indicate Chinese and English word information in the bilingual corpus, respectively.

これらは、パラメータδ_ｉで以下のように統一される。 These are unified by the parameter δ _i as follows.

ベイズの定理により、これはさらに以下のように変形される。

By Bayes' theorem, this is further transformed as follows:

第１の訂正モジュール８８は式（５）を用いて第１の結果８６を訂正する。こうして、トレーニングモジュール３２と言語情報４２とは、第１の言語の文と第２の言語の文とが互いの訳である確率を算出するための、センテンス長に基づく確率モデルを、センテンス長情報と単語ペアの集合とに基づいて提供する。なお、ペアの各々は第１の言語の単語と第２の言語の単語とを含む。アライメントモジュール８４と第１の訂正モジュール８８とは、センテンス長に基づく確率を用いてダイナミックプログラミングアルゴリズムで計算されたアライメントの合計確率が最大になるように、入力コーパス３６の文をアライメントする。

The first correction module 88 corrects the first result 86 using equation (5). In this way, the training module 32 and the language information 42 represent a probability model based on the sentence length for calculating the probability that the sentence in the first language and the sentence in the second language are mutually translated. And a set of word pairs. Each of the pairs includes a first language word and a second language word. The alignment module 84 and the first correction module 88 align the sentences of the input corpus 36 so that the total probability of alignment calculated by the dynamic programming algorithm is maximized using the probability based on the sentence length.

実際には、センテンス長に基づく手法の結果を訂正するには句読点情報も重要である。句読点は順序付きのバイリンガルテキスト間の内部での並びを決定するのに役立つ。 In practice, punctuation information is also important to correct the results of a technique based on sentence length. Punctuation marks are useful for determining the internal alignment between ordered bilingual text.

第２の訂正モジュール９２は第１の訂正モジュールから出力された第２の結果９０を、句読点マッチングによって訂正する。４つの主な文の句読点ペアがある。すなわち、中国語と英語でそれぞれ文の終わりを示す「；／;」と、「。／.」と、「？／?」と「！／!」とである。第１の結果８６では、幾つかの中国語と英語の文が正しくアライメントされていないことがあり得る。句読点のミスマッチがあるからである。これらの誤りは以下の状況で生じる。 The second correction module 92 corrects the second result 90 output from the first correction module by punctuation matching. There are four main sentence punctuation pairs. That is, “; /;”, “./.”, “? /?”, And “! /!” Indicating the end of the sentence in Chinese and English, respectively. In the first result 86, some Chinese and English sentences may not be correctly aligned. This is because there is a punctuation mismatch. These errors occur in the following situations:

（１）「（）」及び「“ ”」等の句読点は、中国語では通常一文の中に現れるが、英語では別の文で２個の独立したペアに分けられることがある。このような状況は、中国語の文が別の新聞または人から引用されたものであり、２以上の完全な文を含むという条件で生じる。 (1) Punctuation marks, such as “()” and ““ ”, usually appear in one sentence in Chinese, but may be divided into two independent pairs in another sentence in English. Such a situation arises on the condition that a Chinese sentence is quoted from another newspaper or person and contains two or more complete sentences.

（２）図３に示すように、中国語の文が、中国語のカンマで示される複数の節を含み、その節の各々が、文末を示す句読点を伴った英語の訳を有する場合、最初のアライメントタイプは節をアライメントするだけでは１対２となる。したがって、複雑な中国語及び英語文をアライメントすることでアライメントタイプは１対１に変更される。 (2) As shown in FIG. 3, when a Chinese sentence includes a plurality of sections indicated by Chinese commas, and each of the sections has an English translation with a punctuation mark indicating the end of the sentence, The alignment type of 1 becomes 2 by just aligning the nodes. Therefore, the alignment type is changed to 1: 1 by aligning complex Chinese and English sentences.

状況（１）は、すべての句読点ペアのマッチングを見出し、分割された中国語文を合体させることによって解決する。状況（２）の場合、中国語での節の数を計数してアライメントタイプを１対１に変更する。状況（２）によって、アライメント誤りは容易に広がっていく。 The situation (1) is solved by finding matching of all punctuation pairs and combining the divided Chinese sentences. In situation (2), the number of clauses in Chinese is counted and the alignment type is changed to 1: 1. Depending on the situation (2), alignment errors easily spread.

文をアライメントする処理は、アライメントモジュール８４により、ダイナミックプログラミングアルゴリズムで行なわれる。ソース言語文をＳ_ｉ，ｉ＝１…ｓとし、ターゲット言語文をＴ_ｊ，ｊ＝１…ｔとする。アライメントされた文のシーケンスはＡ（ｋ）＝｛＜Ｓｋ，Ｔｋ＞，ｋ∈[０,Ｋ]｝である。ただし、Ｋはこれらアライメント文の数である。 The process of aligning sentences is performed by the alignment module 84 using a dynamic programming algorithm. Let the source language sentence be S _i , i = 1... S, and the target language sentence be T _j , j = 1. The sequence of aligned sentences is A (k) = {<Sk, Tk>, kε [0, K]}. However, K is the number of these alignment sentences.

ダイナミックプログラミングアルゴリズムの手順において、目的とするところは、式（６）を最小にするアライメントされた文のペアを見出すことであり、ｇ（ｉ，ｊ）は再帰的に計算される。 In the dynamic programming algorithm procedure, the goal is to find an aligned sentence pair that minimizes Equation (6), and g (i, j) is recursively computed.

ダイナミックプログラミング処理を図４に示す。図４を参照して、手順はステップ１１０で始まり、ここで変数ｉ及びｊが１に初期化され、ｇ（ｉ，ｊ）が０に初期化される。ステップ１１２で、４種類のｇ（ｉ，ｊ）が一致のタイプと確率ｇ（ｉ−１，ｊ−１）、ｇ（ｉ−１，ｊ−２）、ｇ（ｉ−２，ｊ−１）及びｇ（ｉ−２，ｊ−２）とにしたがって計算される。

The dynamic programming process is shown in FIG. Referring to FIG. 4, the procedure begins at step 110 where variables i and j are initialized to 1 and g (i, j) is initialized to 0. In step 112, the four types of g (i, j) match and the probabilities g (i-1, j-1), g (i-1, j-2), g (i-2, j-1). ) And g (i−2, j−2).

ステップ１１４で、終点に至るまでの経路が２以上あるかが判断される。もし２以上の経路がある場合は、ステップ１１６で最小のｇ（ｉ，ｊ）となる経路を選択し、制御はステップ１１８に進む。さもなければ、制御は直接ステップ１１８に進む。ステップ１１８で、段落の終点に達したか否かが判断される。もし終点に達していれば、経路が成功裏に見出されたと判断して処理を終了する。さもなければ、ステップ１２０で変数ｉを１だけ増分し、制御はステップ１１２に戻る。 In step 114, it is determined whether there are two or more routes to the end point. If there are two or more routes, the route with the smallest g (i, j) is selected at step 116 and control proceeds to step 118. Otherwise, control proceeds directly to step 118. In step 118, it is determined whether the end of the paragraph has been reached. If the end point has been reached, it is determined that the route has been found successfully, and the process ends. Otherwise, in step 120, the variable i is incremented by 1, and control returns to step 112.

このようなダイナミックプログラミングにより、図１に示すアライメントモジュール８４はセンテンス長に基づくアライメントを決定する。 With such dynamic programming, the alignment module 84 shown in FIG. 1 determines the alignment based on the sentence length.

[動作]
文アライメントシステム２０の動作には二つの局面が存在する。第１の局面では、トレーニングモジュール３２がトレーニングコーパス３０を用いて確率モデル３４をトレーニングする。確率モデル３４が準備され、辞書４０と言語情報記憶部４２とが準備されると、文アライメント装置３８は第２の局面の動作を行なうことができる。 [Operation]
There are two aspects to the operation of the sentence alignment system 20. In the first aspect, the training module 32 trains the probability model 34 using the training corpus 30. When the probability model 34 is prepared and the dictionary 40 and the language information storage unit 42 are prepared, the sentence alignment device 38 can perform the operation of the second aspect.

入力コーパス３６が与えられると、セグメント化モジュール８０は中国語コーパス７２の文を多数の形態素にセグメント化し、ＰＯＳタグ付けモジュール８２は形態素の各々に対応のＰＯＳタグを付す。これらの動作は辞書４０を用いて行なわれる。 Given the input corpus 36, the segmentation module 80 segments the sentence of the Chinese corpus 72 into a number of morphemes, and the POS tagging module 82 attaches a corresponding POS tag to each of the morphemes. These operations are performed using the dictionary 40.

その後、アライメントモジュール８４は英語コーパス７０の文と中国語コーパス７２の文とを、確率モデル３４を用いて、図４に示すダイナミックプログラミングによりアライメントする。アライメントモジュール８４は、最小のｇ（ｉ，ｊ）に対応するアライメントの結果を、第１の結果８６として出力する。 Thereafter, the alignment module 84 aligns the sentence of the English corpus 70 and the sentence of the Chinese corpus 72 by the dynamic programming shown in FIG. 4 using the probability model 34. The alignment module 84 outputs the alignment result corresponding to the minimum g (i, j) as the first result 86.

第１の結果８６は誤差を含むものと考えられるので、第１の訂正モジュールが第１の結果８６を、図２に示すような言語情報を用いて訂正する。訂正されたアライメントは第１の訂正モジュール８８から第２の結果９０として出力される。 Since the first result 86 is considered to contain an error, the first correction module corrects the first result 86 using language information as shown in FIG. The corrected alignment is output as the second result 90 from the first correction module 88.

第２の訂正モジュール９２はさらに、句読点の対応付けに基づいて第２の訂正モジュール９２を訂正する。訂正された結果はアライメントされたコーパス４４として出力される。 The second correction module 92 further corrects the second correction module 92 based on the punctuation mark association. The corrected result is output as an aligned corpus 44.

センテンス長に基づく手法とレキシコンによる手法とは、中国語−英語文のアライメントにおいてはそれぞれ長所と短所とを持つ。センテンス長に基づく手法は主に２言語の統計的パラメータに依存し、言語知識からは独立している。この実施の形態は、中国語の言語に固有の特徴に基づき、キーワード情報と句読点とを同時に考慮するという点で優れている。 The sentence length-based method and the lexicon-based method each have advantages and disadvantages in Chinese-English sentence alignment. The method based on sentence length mainly depends on statistical parameters of two languages and is independent of language knowledge. This embodiment is superior in that it considers keyword information and punctuation at the same time based on features unique to the Chinese language.

純粋にレキシコンによる手法は、大規模なコーパスでは、特に中国語の処理が複雑であるため、中国語−英語のアライメントというタスクにはふさわしくない。センテンス長に基づく手法は大規模なバイリンガルテキストに適しており、多言語に移植可能である。中国語はセンテンス長のパラメータに対する感度が低いので、この実施の形態で示したようなセンテンス長に基づく拡張手法は上述の二つの手法の利点を組合わせて利用している。 The purely lexicon approach is not suitable for the task of Chinese-English alignment because of the complexity of Chinese processing, especially on large corpora. The sentence length based method is suitable for large-scale bilingual texts and is portable to multiple languages. Since Chinese has a low sensitivity to the sentence length parameter, the extended method based on the sentence length as shown in this embodiment uses a combination of the advantages of the above two methods.

[実験結果]
１９９７年から１９９９年の間に香港の新聞に載った香港のニュースについてのＬＤＣコーパスを用い、実験を行なった。テスト用のテキストはランダムに選択され、２８８０のバイリンガルのパラレルな文を含む。 [Experimental result]
Experiments were carried out using an LDC corpus about Hong Kong news in the Hong Kong newspaper between 1997 and 1999. The test text is randomly selected and includes 2880 bilingual parallel sentences.

アライメント精度は、以下の式（７）で計算した。
精度＝Ｎｕｍ_ｃ／Ｎｕｍ_ｔ（７）
ただし、Ｎｕｍ_ｃは人の判断による、正しくアライメントされた文の数であり、Ｎｕｍ_ｔはテストセット中のアライメント文の総数である。テストの結果を表２に示す。 The alignment accuracy was calculated by the following formula (7).
Accuracy = Num _c / Num _t (7)
Here, Num _c is the number of correctly aligned sentences according to human judgment, and Num _t is the total number of alignment sentences in the test set. Table 2 shows the test results.

この結果から、拡張された手法に従ったこの実施の形態は、レキシコン情報及び句読点を結果の訂正に導入した場合、純粋なセンテンス長に基づく手法より良好なアライメント精度を達成することが分る。

From this result, it can be seen that this embodiment according to the extended approach achieves better alignment accuracy than the pure sentence length based approach when lexicon information and punctuation marks are introduced into the result correction.

当業者であれば、この発明の上述の実施の形態をコンピュータハードウェアとそのハードウェア上で実行されるコンピュータソフトウェアで実現できることを理解するであろう。コンピュータプログラムはコンピュータ読出可能媒体に記憶され頒布されてもよい。このようなコンピュータソフトウェア、そのソフトウェアを記憶する媒体、及びソフトウェアでプログラムされるコンピュータは、そのプログラムが請求項のいずれかに記載された全てのステップを実行するかまたは全ての機能を満足する限り、この発明の範囲に含まれる。 Those skilled in the art will appreciate that the above-described embodiments of the present invention can be implemented with computer hardware and computer software running on the hardware. The computer program may be stored and distributed on a computer readable medium. Such computer software, a medium storing the software, and a computer programmed with the software, as long as the program executes all the steps recited in any of the claims or satisfies all the functions, It is included in the scope of the present invention.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the embodiment described above. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

この発明の一実施の形態に従った文アライメントシステム２０の構造を示すブロック図である。It is a block diagram which shows the structure of the sentence alignment system 20 according to one embodiment of this invention. 言語情報記憶部４２の見出し語例を示す図である。It is a figure which shows the example of a headword of the language information storage part. 句読点ミスマッチの例を示す図である。It is a figure which shows the example of a punctuation mark mismatch. ２個のコーパスの文をアライメントするためのダイナミックプログラミング手順を示す図である。It is a figure which shows the dynamic programming procedure for aligning the sentence of two corpus.

Explanation of symbols

２０文アライメントシステム
３０トレーニングコーパス
３２トレーニングモジュール
３４確率モデル
３６入力コーパス
３８文アライメント装置
４０辞書
４２言語情報
４４出力コーパス
８０セグメント化モジュール
８２ＰＯＳタグ付けモジュール
８４アライメントモジュール
８８第１の訂正モジュール
９２第２の訂正モジュール 20 sentence alignment system 30 training corpus 32 training module 34 probability model 36 input corpus 38 sentence alignment device 40 dictionary 42 language information 44 output corpus 80 segmentation module 82 POS tagging module 84 alignment module 88 first correction module 92 second Correction module

Claims

A computer-implemented method for aligning bilingual parallel text sentences in a first language and a second language, comprising the following steps implemented on a computer:
A probability model based on sentence length is prepared based on sentence length information and a set of word pairs for calculating a probability that the sentence in the first language and the sentence in the second language are mutually translated. Each of the pairs includes a pair of a word of the first language and a word of the second language, and further includes an alignment computed by a dynamic programming algorithm using a probability based on the sentence length. Aligning the sentences of the bilingual parallel text so that the total probability is maximized;
Correcting alignment such that corresponding punctuation marks of the bilingual parallel text are aligned with each other.

The method of claim 1, wherein each of the pairs includes a predetermined part-of-speech word pair.

The number of the pairs includes a proper noun that respectively represents a predetermined entity in the first language and a corresponding proper noun that represents the predetermined entity in the second language, respectively. The method described in 1.

The probability model based on the sentence length is prepared such that the probability of the sentence in the first language and the sentence in the second language that are mutually translated is calculated based on a ratio of the lengths of the two sentences. The method according to any one of claims 1 to 3, wherein:

The method according to claim 1, wherein the length of each sentence is counted by the number of characters in the sentence.

A computer-executable program that, when executed on a computer, causes the computer to execute all the steps according to any one of claims 1 to 5.