JP3778705B2

JP3778705B2 - Bilingual document matching system

Info

Publication number: JP3778705B2
Application number: JP26968098A
Authority: JP
Inventors: 達哉介弘; 秀樹山本
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1998-09-24
Filing date: 1998-09-24
Publication date: 2006-05-24
Anticipated expiration: 2018-09-24
Also published as: JP2000099511A

Description

【０００１】
【発明の属する技術分野】
本発明は、英語−日本語などの原文と訳文で構成される対訳文書における文の対応付けを行う対訳文書対応付けシステムに関する。
【０００２】
【従来の技術】
英語−日本語といった原文と訳文とで構成される対訳文書中の各文の対応付けを、対訳辞書を用いて行うシステムがある。例えば、このようなシステムに関する文献として、
対訳辞書および統計情報を用いた二言語対訳テキスト照合宇津呂武仁松本裕治Ｖｏｌ．１２Ｎｏ．５ｓｅｐ．１９９５があった。
【０００３】
この文献では、対訳文書の文の対応付けを対訳辞書を利用してダイナミックプログラミング（ＤＰ）で行う方法を述べている。このような対応付けをダイナミックプログラミングを用いて行うには、先ず、原文および訳文を１文毎に区切り、その文の形態解析を行って単語毎に分割する。そして、それらの単語の中から自立語を取り出し、対訳辞書を用いて、それぞれの文の中の自立語がどの程度対応しているかで評価する。例えば、評価する方法として以下のような式を用いる。
【０００４】
ｈ（ｘ，ｙ）＝２×ｆｍ（ｘ，ｙ）／（ｆｊ（ｘ）＋ｆｊ（ｙ））
ここで、
ｘは原文中の文（複数文の場合もある）
ｙは訳文中の文（複数文の場合もある）
ｈ（ｘ，ｙ）は評価関数
ｆｍ（ｘ，ｙ）は、ｘ，ｙの中で対応のついた自立語の数
ｆｊ（ｘ）は、ｘ中の自立語の数
ｆｊ（ｙ）は、ｙ中の自立語の数
である。
【０００５】
このような評価関数を用いることによって、対応する文同士の評価関数の値は大きくなり、対応していない文同士の評価関数は小さくなる。これを文の先頭から調べていき、評価関数の和が最も大きくなるような組み合わせを対応付け問題の解とする。
【０００６】
【発明が解決しようとする課題】
しかしながら、上記のように、評価関数を用いて対応付けを行うには、対訳辞書を引いたり、文字列のマッチング処理を行うため、評価値を算出するには時間がかかる。更に、ＤＰを用いた方法は、原文の文数×訳文の文数に比例した時間がかかり、実行速度に問題があった。それは文の対応のほとんどが１文対１文の対応であり、１文対２文、２文対１文まで含めるとほとんど１００パーセントに近い確率で当てはまるにもかかわらず、１文対３文、３文対１文や１文対４文、４文対１文になるまでの可能性を考慮して、考えられる全ての組の評価値を計算しているからである。
【０００７】
このような問題に対処するためには、１文対３文、３文対１文などを計算しないといった方法が容易に考えられる。しかしながら、上記のような確率値は対訳文書の分野等によっては変化する恐れがあり、場合によっては、１文対３文といた対応が最適解であることも考えられる。ところが、このような方法は、１文対３文の対応があった場合にこれを見つけられないため、場合によっては対応付けの精度が低くなってしまうという問題があった。
【０００８】
このような点から、対応付けの精度を低下させることなく処理を高速化することのできる対訳文書対応付けシステムの実現が望まれていた。
【０００９】
【課題を解決するための手段】
本発明は、前述の課題を解決するため次の構成を採用する。
〈構成〉
本発明は、原語文書に含まれる複数の原語文と、原語文書の訳語文書に含まれる複数の訳語文との対応関係を付けるための対訳文書文対応付けシステムであって、原語文の数と訳語文の数とを比較し、原語文の数が訳語文の数と等しい場合、各原語文及び各訳語文を文並びに従って一対一に対応させたそれぞれの対応組を順に設定してなる各初期パスを求め、原語文の数が訳語文の数より大きい場合、各原語文及び各訳語文を文並びに従って一対一及び二対一のいずれかに対応させたそれぞれの対応組を順に設定してなる各初期パスを求め、原語文の数が訳語文の数より小さい場合、各原語文及び各訳語文を文並びに従って一対一及び一対二のいずれかに対応させたそれぞれの対応組を順に設定してなる各初期パスを求め、引き続き、原語文と訳語文との対応数を順次増加させる毎に、他の対応組からなる他のパスを順次求める組合せ管理手段と、初期パス及び他のパス毎に、各対応組に対応して、原語文及び訳語文の対応度を示すそれぞれの評価値を文の評価要素に基づき算出し、該算出した各評価値を順次、加算してパス毎の評価値の総和を求める評価値計算手段と、各いずれかの初期パスを求めた場合のパス毎の評価値の総和の最大値を暫定解として該当する初期のパスに対応させて保持し、且つ、評価値の設定評価値を保持し、他のパスに対して計算済みの評価値を取り込んでその途中和を計算し、未計算の対応組の評価値が計算される前に、該未計算の対応組の数を調べて該数と設定評価値とを積算し、該積算値を途中和に加算してその加算結果が暫定解より小さいと判定すると他のパスの計算を中止させ、他のパスの評価値の総和が暫定解より大きいと判定すると、暫定解を更新し、更に、他のパスにおいて対応組の原語文と訳語文との対応数が最大対応付け文数になると、組合せ管理手段に該最大対応付け文数以上の対応組に関する他のパスの生成を終止させると共に、現在の暫定解に対応するパスを文の対応付けの解とする計算結果管理手段と
を備えることを特徴とする。
【００１０】
この設定評価値は、組合せが取り得る最大評価値及び組合せの平均及び分散から計算される評価基準値のいずれかである。
【００１２】
【発明の実施の形態】
本発明は、１文対１文や１文対２文等、可能性の高いものから優先的に計算し、暫定解を利用して正解の可能性を絞ることにより、ＤＰで使用する評価値の計算量を減らすようにしたものである。
【００１３】
以下、本発明の実施の形態を具体例を用いて詳細に説明する。
【００１４】
《具体例１》
〈構成〉
図１は本発明の対訳文書対応付けシステムの具体例１を示す構成図である。
図のシステムは、原文ファイル１０１、訳文ファイル１０２、文分割手段１０３、形態素解析手段１０４、組み合わせ管理手段１０５、評価値計算手段１０６、計算結果管理手段１０７、対応タグ付き原文ファイル１０８、対応タグ付き訳文ファイル１０９、対訳辞書１１０からなる。
【００１５】
原文ファイル１０１は、複数の原文からなる文書ファイルであり、例えば英語の文書ファイルである。訳文ファイル１０２は、複数の訳文からなる文書ファイルであり、例えば日本語の文書ファイルである。
【００１６】
文分割手段１０３は、原文ファイル１０１および訳文ファイル１０２の文書ファイルを１文毎に分割する機能を有している。例えば、英文であればピリオド、日本文ならば句点等で分割を行うよう構成されている。
【００１７】
形態素解析手段１０４は、文分割手段１０３で分割された英文や日本文に対して形態素解析を行い、単語毎に分割する機能を有している。尚、これら文分割手段１０３および形態素解析手段１０４は既知の構成を用いることができる。
【００１８】
組み合わせ管理手段１０５は、原文と訳文との対応組の組み合わせを求めると共に、文分割手段１０３や形態素解析手段１０４の結果に基づき、原文と訳文の原文ファイル１０１および訳文ファイル１０２中の文数から、原文と訳文との対応付けの最適解の可能性が高い組み合わせを求める機能を有している。
【００１９】
評価値計算手段１０６は、原文と訳文とが対応している程、高い値となる評価値を求めると共に、暫定解として、組み合わせ管理手段１０５で求めた可能性の高い組み合わせの評価値の和を求める機能を有している。具体的には、従来技術の項で説明した評価関数の式に基づいて評価値を計算する構成、あるいは他の手段により評価値を計算する構成であってもよい。
【００２０】
計算結果管理手段１０７は、評価値計算手段１０６で計算したＤＰの評価値を格納するテーブル（図示せず）を備え、組み合わせ管理手段１０５で求めた原文と訳文との組み合わせ（パス）の評価値の計算を、組み合わせの先頭の文から順次計算するよう評価値計算手段１０６に指示し、得られた評価値を前記のテーブルに格納する機能を有している。そして、これらのテーブルの評価値に基づき、任意の組み合わせで、最後まで計算しても暫定解を上回る値にならないと判定できた場合は、その時点で評価値の計算を中止させる機能を有している。
【００２１】
対応タグ付き原文ファイル１０８および対応タグ付き訳文ファイル１０９は、それぞれ原文ファイル１０１と訳文ファイル１０２に文の対応を示すためのタグを付与したものである。
【００２２】
対訳辞書１１０は、対応付けするための原文の単語を引くと訳文の語が複数あるような辞書である。例えば、原文が英語、訳文が日本語の場合英和辞典に相当する。
【００２３】
上記の対訳文書対応付けシステムは、マイクロコンピュータ等で構成され、原文ファイル１０１、訳文ファイル１０２、対応タグ付き原文ファイル１０８、対応タグ付き訳文ファイル１０９、対訳辞書１１０は、ハードディスク装置等の外部記憶装置あるいは半導体メモリに設けられている。また、文分割手段１０３〜計算結果管理手段１０７は、それぞれの手段に対応したプログラムとこれを実行するためのプロセッサや主記憶装置等で構成されている。
【００２４】
〈動作〉
図２は、本具体例の動作を示すフローチャートである。
先ず、文分割手段１０３によって、原文ファイル１０１と訳文ファイル１０２の文区切りを行う（ステップＳ１００）。ここで、原文の文数をｍ、訳文の文数をｎとする。また、組み合わせ管理手段１０５において、文の対応組を徐々に増やしていくための変数ｉに２をセットする。
【００２５】
次に、原文の文数と訳文の文数とが等しいか、即ち、ｍ＝ｎかをチェックする（ステップＳ１０１）。ここで、等しければ評価値を計算する文の組み合わせを（１，１）のみとし（ステップＳ１０３）、ステップＳ１０６に移行する。一方、ステップＳ１０１において、等しくなければ、ｍ＞ｎであるかを判定する（ステップＳ１０２）。
【００２６】
ステップＳ１０２において、ｍ＜ｎであれば、計算する対応組を（１，１）、（２，１）として（ステップＳ１０４）、ステップＳ１０６に移行し、ｍ＞ｎであれば計算する対応組を（１，１）、（１，２）として（ステップＳ１０５）、ステップＳ１０６に移行する。
【００２７】
ステップＳ１０６では、評価値計算手段１０６によって、対応組の評価関数を計算し、計算結果を計算結果管理手段１０７に送る。これにより、計算結果管理手段１０７は、最も点数の高いパスを暫定解とする（ステップＳ１０７）。尚、パスとは、先頭文から最後の文までどのようなルート（原文と訳文の対応組）を通って対応付けられているいるかを一意に示すものである。
【００２８】
次いで、計算結果管理手段１０７は、計算する対応組を（１，１）、（１，２）、（２，１）とし（ステップＳ１０８）、未計算の枝があるかを調べる（ステップＳ１０９）。尚、枝とは、パスの１要素であり、パス中のある地点（英文５文目、和文４文目等）から、次の原文と訳文との対応組がどのようなものであるかを示すものである。
【００２９】
ステップＳ１０９において、未計算の枝があった場合は、その枝の評価値を計算し、この計算値を計算結果管理手段１０７内のＤＰのテーブルに保持する（ステップＳ１１０）。次に、この計算値によって、上述した暫定解よりも評価値の合計値の高い解が得られるかを判定し（ステップＳ１１１）、高い解が得られた場合は、その解を暫定解として更新する（ステップＳ１１２）。
【００３０】
一方、ステップＳ１１１において、暫定解より高い解が得られるかが不明であった場合は、そのパスが最適解になり得ないかを判定する（ステップＳ１１３）。このステップＳ１１３において、明らかに暫定解より評価値が大きくならないことが判明した場合は、そのノードにマークを付け、それより先の枝の計算は保留し（ステップＳ１１４）、ステップＳ１０９に戻る。即ち、そのパスとしての計算を中止する。また、ステップＳ１１３において、最適解となり得ないかが不明であった場合は、そのままステップＳ１０９に戻る。
【００３１】
このようなステップＳ１０９〜ステップＳ１１４の処理を繰り返すことにより、対応組（１，１）、（１，２）、（２，１）におけるパスの暫定解が求められ、また、その処理中で暫定解より大きい値にならないと分かったパスは、それ以上の枝の計算が保留される。尚、ステップＳ１１４において、計算が中止ではなく保留となっているのは、後述するステップＳ１１５以降でｉの値をインクリメントした場合に、そのノードより先も計算する場合があるからである。
【００３２】
ステップＳ１０９において、未計算の枝がなくなった場合は、ステップＳ１１５に移行し、文の対応組を徐々に増やしていくための変数ｉの値をインクリメントする。即ち、ｉの値を３とする。
【００３３】
次に、ｉの値が予め定めた最大対応付け文数より大きいかを調べる（ステップＳ１１６）。この最大対応付け文数とは、１文と何文の対応まで調べるかを示すもので、最大対応付け文数が４であれば、１文対４文、４文対１文の対応まで調べることを意味する。このステップＳ１１６において、ｉ≦最大対応付け文数であれば、計算結果管理手段１０７において、計算する対応組に、（１，ｉ）、（ｉ，１）を追加し（ステップＳ１１７）、ステップＳ１０９に戻る。
【００３４】
一方、ステップＳ１１６において、ｉ＞最大対応付け文数であれば、現在の暫定解を最適解、最適解のパスを文の対応付けの解とし（ステップＳ１１８）、対応付け処理を終了する。
【００３５】
以上の処理を、更に具体的な一例を用いて説明する。
【００３６】
図３は、英文９文、和文７文からなるファイルの対応を取るためのパスの説明図である。
【００３７】
図において、Ｅ１〜Ｅ９はそれぞれ英文の１文目から９文目を表しており、Ｊ１〜Ｊ７はそれぞれ和文の１文目から７文目までを表している。また、丸付きの番号は枝の評価値を計算する順番を表している。例えば、１番の枝は英文の１文目と和文の１文目の対応を評価する。２番の枝は英文の１文目から２文目までと、和文の１文目の対応を評価する、…、２７番目の枝は英文の９文目と和文の７文目の対応を評価する、という意味である。
【００３８】
先ず、図２のフローチャートにおけるステップＳ１００〜ステップＳ１０７の処理を説明する。
【００３９】
この例では英文の数の方が多いので、最初に（１，１）、（２，１）の組み合わせを計算する。対応の組み合わせをこの２通りに限定するとゴール（図の右上）まで到達し得るパスは図３に示すように狭い範囲となる。
【００４０】
図示の枝にふられた番号順に計算し（必ずしもこの通りに計算しなくともよい）、２７番まで計算し終わったら、ＤＰによって、最も評価値の和の高いパスを調べ、その評価値の和を暫定解とする。ここでは、２−５−９−１４−１９−２３−２６が暫定解となったとする。
【００４１】
次に、ステップＳ１０８〜ステップＳ１１４の処理を説明する。
ステップＳ１０８において、（１，２）の組み合わせを追加すると、図３のパスは次のようになる。
【００４２】
図４は、（１，２）の組み合わせを追加した場合のパスの説明図である。
例えば、図中の３１番を計算した時点で、２−３１−１２−１９−２３−２６のパスの方が暫定解よりも高い評価値を得たとすると、その時に暫定解を更新し、そのパスを記憶しておく（ステップＳ１１１〜ステップＳ１１２）。
【００４３】
次に、３６番を計算した時点で、どのようなパスを通っても暫定解よりよい解が得られないと分かったら（この方法については後述する）、３６番の枝の終点（ゴール側）にマーク付けし、そこから先の枝の評価を保留する（ステップＳ１１３〜ステップＳ１１４）。図４の例では、４０、４１、４４、４５の枝の計算をせずに済む。
【００４４】
上記のステップＳ１１３の判断処理は次のように行う。各枝の評価値の最大値は１、暫定解が４．８、２−６−３６の枝の和が１．２であるとすると、そのパスを通る解は４．２以上にはならない。即ち、暫定解より大きくならないため、そのパスの評価値を計算する意味がないことになる。従って、そのパスとしての計算は中止する。
【００４５】
ステップＳ１１５〜ステップＳ１１８の処理では、最大対応付け文数までの対応について計算するためのものである。その際、上記の対応組を（１，１）、（１，２）、（２，１）とした場合と同様に、１文対３文といった計算で、最適解になり得ない枝の計算はせずに済む。
【００４６】
〈効果〉
以上のように、具体例１によれば、原文と訳文との対応の可能性の高い解を暫定解とし、評価値の計算処理において、暫定解より高い値にならないと分かった時点で、そのパスとしての計算を中止するようにしたので、ＤＰにおける計算量を削減でき、処理時間の短縮化を図ることができる。また、１文対３文といった計算も考慮しているため、文書の内容等で精度が低下することもない。
【００４７】
更に、任意の組み合わせで、それまでの暫定解よりも評価値の和の高い値が見つかった場合は、見つかった値を新たな暫定解とするようにしたので、処理を行うに従ってより正しい暫定解が得られ、従って、処理の高速化と精度向上の効果を同時に得ることができる。
【００４８】
特に、本具体例では、最適解のパスが最初に求めた暫定解のパスと一致するか、あるいは近傍にある場合、従来技術と比較して省略できる評価値の計算個所が多くなり、効果が顕著となる。
【００４９】
《具体例２》
具体例２は、複数の枝（原文と訳文との対応組）の評価値が一定の割合で含まれるような値を各枝の基準値として設定し、そのパスが最適解になるか否かの判定を、この基準値を用いて行うようにしたものである。
【００５０】
〈構成〉
図５は、具体例２の対訳文書対応付けシステムの構成図である。
図示のシステムは、原文ファイル１０１、訳文ファイル１０２、文分割手段１０３、形態素解析手段１０４、組み合わせ管理手段１０５、評価値計算手段１０６、計算結果管理手段１０７、対応タグ付き原文ファイル１０８、対応タグ付き訳文ファイル１０９、対訳辞書１１０、基準値計算手段１１１からなる。ここで、原文ファイル１０１〜対訳辞書１１０は、具体例１の構成と同様であるため、その説明を省略する。
【００５１】
基準値計算手段１１１は、各枝の平均、分散などを計算し、一定の割合の枝の評価値がその値を超えないような一定の値、例えば、９５％の枝がこの基準値内にあるような値を導き出す機能を有している。図５中に破線で示す式は、基準値の求め方の一例を示すものである。例えば、枝の値が正規分布をなし、平均をμ、分散をＤとしたときに図示の式において、ｐ＝０．９５となるようなｘを基準値とするよう構成されている。
【００５２】
計算結果管理手段１０７は、基準値計算手段１１１で求めた基準値を用いて暫定解よりも高くなり得ないパスを判断するよう構成されている。つまり、評価値を計算していない枝は高々基準値であると見なしてそのパスの評価値を計算する。
【００５３】
〈動作〉
図６は、具体例２の動作を示すフローチャートである。
図２に示した具体例１と異なる部分は、ステップＳ２０８において、基準値計算手段１１１で基準値を求めている部分と、ステップＳ２１４において、計算結果管理手段１０７が最適解になる見込みがあるかどうかを判定する時に、ステップＳ２０８で求めた基準値を使うことである。
【００５４】
具体例２におけるステップＳ２００〜Ｓ２０７は、具体例１のステップＳ１００〜Ｓ１０７の処理と同様である。次に、ステップＳ２０８では、例えば図５中の式に基づき、枝の平均、分散から評価基準値を計算する。これ以降のステップＳ２０９〜Ｓ２１３の処理は、具体例１のステップＳ１０８〜Ｓ１１２の処理と同様である。
【００５５】
ステップＳ２１４では、上述したように最適解になる見込みがあるか否かを判断する場合に、まだ、計算していない枝の評価値を高々基準値であるとして計算し、これに基づき判断を行うことである。即ち、具体例１では、まだ、計算していない枝の評価値を１として、つまり評価値を最高値としてそのパスの評価値を計算した。これに対し、具体例２では、ほとんどの枝がこの基準値内にあるような値、例えば０．７といったような値であるとしてそのパスの評価値を計算する。
【００５６】
このような基準値を用いることにより、最適解が見つかるという保証は必ずしもできなくなるが、具体例１よりも更に評価値を計算する枝を少なくすることができる。
【００５７】
尚、最適解が見つかるという保証は必ずしもできなくなるという理由は次の点からである。即ち、基準値は、０．７といった最高値ではない値であるため、まだ計算していない枝の評価値を基準値で計算した場合よりも、実際にその枝の評価値を計算した場合の方がそのパスの評価値が大きくなってしまう可能性がある。しかしながら、適切な基準値を設定することで、実際の処理ではこのような可能性はほとんどあり得ないと考えられる。
【００５８】
また、文の対応付け問題は、最適解が１００％正しいとは限らないので、後で人手により対応が正しいかどうかをチェックすることが必要であることを考えると、時間をかけて最適解を見つけるよりも、短時間でそれなりの解を見つける方が利用価値の高い場合もある。
【００５９】
具体例２におけるそれ以降のステップＳ２１５〜Ｓ２１９の処理は、具体例１におけるステップＳ１１４〜Ｓ１１８と同様であるため、ここでの説明は省略する。
【００６０】
以上の処理を更に具体的な一例を用いて説明する。ここで、具体例２においても、対象となるファイルは英文９文、和文７文からなるとする。また、これらの文の対応を取るためのパスの説明図として、具体例１における図３および図４を用いて説明する。
【００６１】
先ず、図６のフローチャートにおけるステップＳ２００〜ステップＳ２０７の処理を説明する。
【００６２】
この例では英文の数の方が多いので、最初に（１，１）、（２，１）の組み合わせを計算する。対応の組み合わせをこの２通りに限定するとゴール（図の右上）まで到達し得るパスは図３に示すように狭い範囲となる。
【００６３】
図示の枝にふられた番号順に計算し（必ずしもこの通りに計算しなくともよい）、２７番まで計算し終わったら、ＤＰによって、最も評価値の和の高いパスを調べ、その評価値の和を暫定解とする。ここでは、２−５−９−１４−１９−２３−２６が暫定解となったとする。以上は、具体例１と同様である。
【００６４】
次に、ステップＳ２０８では、枝の基準値を計算する。ここでは、基準値が０．７になったとする。
【００６５】
次に、ステップＳ２０９〜Ｓ２１５の処理を説明する。
ステップＳ２０９において、（１，２）の組み合わせを追加すると、図３のパスは図４のようになる。
【００６６】
例えば、図中の３１番を計算した時点で、２−３１−１２−１９−２３−２６のパスの方が暫定解よりも高い評価値を得たとすると、その時に暫定解を更新し、そのパスを記憶しておく（ステップＳ２１２〜ステップＳ２１３）。
【００６７】
次に、３６番を計算した時点で、どのようなパスを通っても暫定解よりよい解が得られないと分かったら（この方法については後述する）、３６番の枝の終点（ゴール側）にマーク付けし、そこから先の枝の評価を保留する。図４の例では、４０、４１、４４、４５の枝の計算をせずに済む。
【００６８】
上記の判断方法は次のように行う。各枝の評価値の基準値は０．７、暫定解が４．８、２−６−３６の枝の和が２．２であるとすると、そのパスを通る解は４．３以上にはならない。即ち、暫定解より大きくならないため、そのパスの評価値を計算する意味がないことになる。従って、そのパスとしての計算は中止する。
【００６９】
ここで、具体例１との比較を行うと次のようになる。即ち、具体例１では、各枝の評価値の最大値である１を用いているため、暫定解が４．８、２−６−３６の枝の和が２．２であるとすると、そのパスを通る解は最大５．２となり、この時点では最適解とはなり得ないと判定することができない。これに対し、具体例２では、この時点で計算を中止することができるため、具体例１よりも評価値を計算する枝を少なくすることができる。
【００７０】
ステップＳ２１６〜ステップＳ２１９の処理では、最大対応付け文数までの対応について計算するためのものである。その際、上記の対応組を（１，１）、（１，２）、（２，１）とした場合と同様に、１文対３文といった計算で、最適解になり得ない枝の計算はせずに済む。
【００７１】
〈効果〉
以上のように、具体例２によれば、原文と訳文との対応の可能性の高い解を暫定解とし、評価値の計算処理において、暫定解より高い値にならないと分かった時点で、そのパスとしての計算を中止するようにしたので、具体例１と同様に、ＤＰにおける計算量を削減でき、処理時間の短縮化を図ることができ、また、精度の低下もない。
【００７２】
更に、具体例２では、予め、基準値を用意し、この基準値を用いて最適解とはなり得ないパスかを判定するようにしたので、具体例１よりも更に評価値を求める枝を少なくすることができ、処理時間の短縮化を図ることができる。
【００７３】
尚、上記具体例１、２では、英文と和文との対応付けの場合を示したが、対訳辞書を替えることによって、あらゆる言語同士の対応付けにも利用することができる。また、枝の評価値を計算していく順番も、図示の順番に限定されるものではなく、ある程度順番を変えてもよい。
【００７４】
また、具体例１、２において、ｍ＞２ｎのような場合にも、最初に計算する対応組を変更することで容易に対処することができる。このような場合は、１文対１文、２文対１文、３文対１文を最初に計算する。
【００７５】
上記具体例２において、基準値は、枝の評価値の平均や分散等により予めその値を決めておいたが、最初に暫定解を見つけるまでの枝の値（図３に示す枝の値＝図６におけるステップＳ２０６までの処理の値）で計算してもよい。こうすれば、対応付け処理と同時に基準値のための計算も行うことができる。
【図面の簡単な説明】
【図１】本発明の対訳文書対応付けシステムの具体例１を示す構成図である。
【図２】本発明の対訳文書対応付けシステムの具体例１の動作を示すフローチャートである。
【図３】本発明の対訳文書対応付けシステムにおける英文９文、和文７文からなるファイルの対応を取るためのパスの説明図である。
【図４】本発明の対訳文書対応付けシステムにおける（１，２）の組み合わせを追加した場合のパスの説明図である。
【図５】本発明の対訳文書対応付けシステムの具体例２を示す構成図である。
【図６】本発明の対訳文書対応付けシステムの具体例２の動作を示すフローチャートである。
【符号の説明】
１０１原文ファイル
１０２訳文ファイル
１０５組み合わせ管理手段
１０６評価値計算手段
１０７計算結果管理手段
１１１基準値計算手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a bilingual document association system for associating sentences in a bilingual document composed of an original sentence such as English-Japanese and a translated sentence.
[0002]
[Prior art]
There is a system that uses a bilingual dictionary to associate each sentence in a bilingual document composed of original and translated sentences such as English-Japanese. For example, as literature on such systems:
Bilingual text matching using bilingual dictionaries and statistical information Takehito Utsuro Yuji Matsumoto Vol. 12 No. 5 sep. There was 1995.
[0003]
In this document, a method is described in which correspondence between sentences in a bilingual document is performed by dynamic programming (DP) using a bilingual dictionary. In order to perform such association using dynamic programming, first, the original sentence and the translated sentence are divided into sentences, the form of the sentence is analyzed, and divided into words. Then, independent words are taken out from these words and evaluated by how much the independent words in each sentence correspond using a bilingual dictionary. For example, the following formula is used as an evaluation method.
[0004]
h (x, y) = 2 × fm (x, y) / (fj (x) + fj (y))
here,
x is a sentence in the original sentence (may be multiple sentences)
y is a sentence in the translation (may be multiple sentences)
h (x, y) is the evaluation function
fm (x, y) is the number of independent words that correspond in x and y
fj (x) is the number of independent words in x
fj (y) is the number of independent words in y
It is.
[0005]
By using such an evaluation function, the value of the evaluation function between corresponding sentences becomes large, and the evaluation function between sentences that do not correspond becomes small. This is examined from the beginning of the sentence, and the combination that maximizes the sum of the evaluation functions is taken as the solution of the association problem.
[0006]
[Problems to be solved by the invention]
However, as described above, in order to perform the association using the evaluation function, it takes time to calculate the evaluation value because the bilingual dictionary is drawn or the character string matching process is performed. Furthermore, the method using DP takes time proportional to the number of sentences in the original sentence × the number of sentences in the translated sentence, and there is a problem in the execution speed. Even though most of the sentence correspondence is one-to-one sentence correspondence, including one sentence to two sentences and two sentences to one sentence, it is almost 100% of the probability. This is because the evaluation values of all possible combinations are calculated in consideration of the possibility of becoming three sentences versus one sentence, one sentence versus four sentences, and four sentences versus one sentence.
[0007]
In order to cope with such a problem, a method of not calculating one sentence to three sentences, three sentences to one sentence, etc. can be considered easily. However, the probability values as described above may change depending on the field of the bilingual document, and in some cases, the correspondence between one sentence and three sentences may be the optimal solution. However, such a method has a problem that, when there is a correspondence between one sentence and three sentences, it cannot be found, and in some cases, the precision of association is lowered.
[0008]
From this point of view, it has been desired to realize a bilingual document association system that can speed up the processing without reducing the accuracy of association.
[0009]
[Means for Solving the Problems]
The present invention employs the following configuration in order to solve the above-described problems.
<Constitution>
The present invention is a bilingual document sentence correspondence system for associating a plurality of source sentence sentences included in a source language document with a plurality of target sentence sentences included in a translation document of the source language document. Compare the number of translations, When the number of source language sentences is equal to the number of target language sentences, each initial path is obtained by sequentially setting each corresponding source sentence and each corresponding target sentence in a one-to-one correspondence according to the sentence sequence, When the number of source texts is larger than the number of target sentences, each initial path is set by sequentially setting each corresponding pair corresponding to either one source sentence or each target word sentence according to the sentence sequence. If the number of source language sentences is smaller than the number of translated word sentences, each corresponding set corresponding to either one-to-one or one-to-two according to the sentence sequence is set in order. Each time the initial path is obtained and the number of correspondences between the source sentence and the translated sentence is sequentially increased, combination management means for sequentially obtaining other paths composed of other corresponding pairs, and for each initial path and each other path, Each evaluation value indicating the degree of correspondence between the original sentence and the translated sentence is assigned to the corresponding group. Calculate based on the evaluation elements of the sentence, and sequentially add the calculated evaluation values to obtain the total evaluation value for each path. When calculating the evaluation value calculation means and any of the initial paths Per pass Evaluation value Total The maximum value of the sum is stored as a provisional solution in association with the corresponding initial path, the evaluation value setting evaluation value is held, the evaluation value already calculated for the other paths is taken, and the intermediate sum is calculated. Before the evaluation value of the uncalculated corresponding pair is calculated, the number of the uncalculated corresponding pair is examined, the number and the set evaluation value are integrated, and the integrated value is added to the intermediate sum. If it is determined that the addition result is smaller than the provisional solution, the calculation of the other path is stopped and the evaluation value of the other path is Total When it is determined that the sum is larger than the provisional solution, the provisional solution is updated. Further, when the number of correspondences between the original sentence and the corresponding translation sentence in the corresponding pair reaches the maximum number of correspondence sentences in the other paths, the combination management means corresponds to the maximum correspondence. A calculation result management means that terminates the generation of other paths related to the correspondence pairs that are equal to or greater than the number of attached sentences and uses the path corresponding to the current provisional solution as the solution for the correspondence of the sentences;
It is characterized by providing.
[0010]
This set evaluation value is the maximum evaluation value that the combination can take and Evaluation criteria calculated from the mean and variance of the combination One of them.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
The present invention preferentially calculates from the most likely ones, such as one sentence versus one sentence, one sentence versus two sentences, etc., and uses the provisional solution to narrow down the possibility of correct answers. The amount of calculation is reduced.
[0013]
Hereinafter, embodiments of the present invention will be described in detail using specific examples.
[0014]
<< Specific Example 1 >>
<Constitution>
FIG. 1 is a block diagram showing a specific example 1 of the bilingual document association system of the present invention.
The system shown in the figure includes an original text file 101, a translation text file 102, a text segmentation means 103, a morpheme analysis means 104, a combination management means 105, an evaluation value calculation means 106, a calculation result management means 107, an original text file 108 with a corresponding tag, and a corresponding tag. It consists of a translation file 109 and a bilingual dictionary 110.
[0015]
The original text file 101 is a document file composed of a plurality of original texts, for example, an English text file. The translation file 102 is a document file composed of a plurality of translations, for example, a Japanese document file.
[0016]
The sentence dividing unit 103 has a function of dividing the document file of the original sentence file 101 and the translated sentence file 102 into sentences. For example, it is configured to divide by a period if it is English, or by a punctuation if it is Japanese.
[0017]
The morpheme analyzing unit 104 has a function of performing morphological analysis on the English sentence or the Japanese sentence divided by the sentence dividing unit 103 and dividing the word into words. The sentence dividing unit 103 and the morpheme analyzing unit 104 can use known configurations.
[0018]
The combination management unit 105 obtains a combination of the corresponding pair of the original sentence and the translated sentence, and based on the results of the sentence dividing unit 103 and the morpheme analyzing unit 104, from the number of sentences in the original sentence file 101 and the translated sentence file 102 of the original sentence and the translated sentence, It has a function for obtaining a combination that has a high possibility of an optimal solution for matching an original sentence and a translated sentence.
[0019]
The evaluation value calculation means 106 obtains an evaluation value that becomes higher as the original sentence corresponds to the translated sentence, and calculates the sum of the evaluation values of the combinations that are likely to be obtained by the combination management means 105 as a provisional solution. It has the required function. Specifically, the evaluation value may be calculated based on the evaluation function formula described in the section of the related art, or the evaluation value may be calculated by other means.
[0020]
The calculation result management means 107 Evaluation value calculation means 106 The table (not shown) for storing the DP evaluation values calculated in step 1 is calculated, and the evaluation value of the combination (path) of the original sentence and the translation sentence obtained by the combination management means 105 is calculated sequentially from the first sentence of the combination The evaluation value calculation means 106 is instructed to do so, and the obtained evaluation value is stored in the table. Then, based on the evaluation values in these tables, if it can be determined that the value will not exceed the provisional solution even if it is calculated to the end in any combination, it has a function to stop the evaluation value calculation at that time ing.
[0021]
The corresponding tagged original text file 108 and the corresponding tagged translated text file 109 are obtained by adding tags for indicating correspondence between sentences to the original text file 101 and the translated text file 102, respectively.
[0022]
The bilingual dictionary 110 is a dictionary in which there are a plurality of translated words when an original word for association is drawn. For example, when the original text is English and the translated text is Japanese, it corresponds to an English-Japanese dictionary.
[0023]
The above-described bilingual document mapping system is composed of a microcomputer or the like, and the original text file 101, the translated text file 102, the corresponding tagged text file 108, the corresponding tagged text file 109, and the bilingual dictionary 110 are an external storage device such as a hard disk device. Alternatively, it is provided in a semiconductor memory. The sentence dividing unit 103 to the calculation result managing unit 107 are configured by a program corresponding to each unit, a processor for executing the program, a main storage device, and the like.
[0024]
<Operation>
FIG. 2 is a flowchart showing the operation of this example.
First, the sentence dividing means 103 separates the original text file 101 and the translated text file 102 (step S100). Here, m is the number of sentences in the original sentence, and n is the number of sentences in the translated sentence. Further, in the combination management means 105, 2 is set to a variable i for gradually increasing the corresponding group of sentences.
[0025]
Next, it is checked whether the number of sentences in the original sentence is equal to the number of sentences in the translated sentence, that is, m = n (step S101). Here, if they are equal, the combination of sentences for calculating the evaluation value is only (1, 1) (step S103), and the process proceeds to step S106. On the other hand, if they are not equal in step S101, it is determined whether m> n is satisfied (step S102).
[0026]
In step S102 m <n If so, the corresponding pairs to be calculated are (1, 1) and (2, 1) (step S104), the process proceeds to step S106, and if m> n, the corresponding pairs to be calculated are (1, 1), ( 1 and 2) (step S105), the process proceeds to step S106.
[0027]
In step S <b> 106, the evaluation value calculation unit 106 calculates the evaluation function of the corresponding group and sends the calculation result to the calculation result management unit 107. Thereby, the calculation result management means 107 sets the path with the highest score as a provisional solution (step S107). The path uniquely indicates what route (corresponding pair of the original sentence and the translated sentence) is associated through the first sentence to the last sentence.
[0028]
Next, the calculation result management means 107 sets the corresponding pairs to be calculated as (1, 1), (1, 2), (2, 1) (step S108), and checks whether there is an uncalculated branch (step S109). . A branch is an element of a path. From a certain point in the path (5th sentence in English, 4th sentence in Japanese, etc.), what kind of correspondence the next original and translated text is It is shown.
[0029]
If there is an uncalculated branch in step S109, the evaluation value of that branch is calculated, and this calculated value is stored in the DP table in the calculation result management means 107 (step S110). Next, it is determined by this calculated value whether a solution having a higher evaluation value than the provisional solution described above is obtained (step S111). If a high solution is obtained, the solution is updated as a provisional solution. (Step S112).
[0030]
On the other hand, if it is unclear in step S111 whether a solution higher than the provisional solution can be obtained, it is determined whether the path cannot be an optimal solution (step S113). In step S113, if it is clearly determined that the evaluation value does not become larger than the provisional solution, the node is marked, calculation of the branch ahead is suspended (step S114), and the process returns to step S109. That is, the calculation as the path is stopped. If it is unclear in step S113 that the optimal solution cannot be obtained, the process directly returns to step S109.
[0031]
By repeating the processing of step S109 to step S114, a provisional solution of the paths in the corresponding pairs (1, 1), (1, 2), (2, 1) is obtained, and the provisional solution is obtained during the processing. Paths found not to be greater than the solution are deferred from further branch computations. In step S114, the calculation is not suspended but suspended because it may be calculated before the node when the value of i is incremented in step S115 and later.
[0032]
If there are no more uncalculated branches in step S109, the process proceeds to step S115, and the value of the variable i for gradually increasing the corresponding pairs of sentences is incremented. That is, the value of i is set to 3.
[0033]
Next, it is checked whether the value of i is larger than a predetermined maximum number of correspondence sentences (step S116). The maximum number of correspondence sentences indicates how many sentences correspond to one sentence. If the maximum number of correspondence sentences is 4, the correspondence of one sentence to four sentences and four sentences to one sentence is examined. Means that. In this step S116, if i ≦ maximum number of correspondence sentences, the calculation result management means 107 adds (1, i), (i, 1) to the correspondence pair to be calculated (step S117), and step S109. Return to.
[0034]
On the other hand, if i> the maximum number of associated sentences in step S116, the current provisional solution is set as the optimum solution, and the path of the optimum solution is set as the sentence association solution (step S118), and the association process is terminated.
[0035]
The above processing will be described using a more specific example.
[0036]
FIG. 3 is an explanatory diagram of paths for taking correspondence between files composed of nine English sentences and seven Japanese sentences.
[0037]
In the figure, E1 to E9 represent the first to ninth sentences in English, respectively, and J1 to J7 represent the first to seventh sentences in Japanese, respectively. The numbers with circles represent the order in which the branch evaluation values are calculated. For example, the first branch evaluates the correspondence between the first sentence in English and the first sentence in Japanese. The second branch evaluates the correspondence between the first sentence of the English sentence to the second sentence and the first sentence of the Japanese sentence, ... 27th branch evaluates the correspondence of the ninth sentence of the English sentence and the seventh sentence of the Japanese sentence It means to do.
[0038]
First, the processing of step S100 to step S107 in the flowchart of FIG. 2 will be described.
[0039]
In this example, since there are more English sentences, the combination of (1, 1) and (2, 1) is calculated first. If the corresponding combinations are limited to these two ways, the path that can reach the goal (upper right in the figure) is a narrow range as shown in FIG.
[0040]
The calculation is performed in the order of the numbers given to the branches shown in the figure (it is not always necessary to calculate in this way). When calculation is completed up to 27th, the path with the highest evaluation value sum is checked by DP, and the sum of the evaluation values is calculated. Is a provisional solution. Here, it is assumed that 2-5-9-14-19-23-26 is a provisional solution.
[0041]
Next, the process of step S108-step S114 is demonstrated.
If the combination of (1, 2) is added in step S108, the path in FIG. 3 becomes as follows.
[0042]
FIG. 4 is an explanatory diagram of paths when the combination of (1, 2) is added.
For example, when the path No. 31-12-19-23-26 obtained a higher evaluation value than the provisional solution at the time of calculating No. 31 in the figure, the provisional solution is updated at that time, The path is stored (steps S111 to S112).
[0043]
Next, when calculating the number 36, if it is found that a better solution than the provisional solution cannot be obtained through any path (this method will be described later), the end point of the branch number 36 (goal side) And the evaluation of the branch ahead is suspended (steps S113 to S114). In the example of FIG. 4, it is not necessary to calculate the branches 40, 41, 44, and 45.
[0044]
The determination process in step S113 is performed as follows. If the maximum evaluation value of each branch is 1, the provisional solution is 4.8, and the sum of the branches of 2-6-36 is 1.2, the solution passing through the path cannot be 4.2 or more. That is, since it does not become larger than the provisional solution, there is no point in calculating the evaluation value of the path. Therefore, the calculation as the path is stopped.
[0045]
The processing from step S115 to step S118 is for calculating correspondences up to the maximum number of correspondence sentences. At that time, as in the case where the above corresponding pairs are (1, 1), (1, 2), (2, 1), the calculation of branches that cannot be the optimal solution by the calculation of one sentence versus three sentences. You do n’t have to.
[0046]
<effect>
As described above, according to the first specific example, a solution having a high possibility of correspondence between the original sentence and the translated sentence is set as a provisional solution, and when the evaluation value calculation process is found not to be higher than the provisional solution, Since the calculation as the path is stopped, the calculation amount in the DP can be reduced, and the processing time can be shortened. In addition, since calculation such as one sentence versus three sentences is also taken into account, the accuracy of the document contents or the like does not deteriorate.
[0047]
Furthermore, when a value with a higher sum of evaluation values than the previous provisional solution is found in any combination, the found value is used as a new provisional solution. Therefore, it is possible to obtain the effect of speeding up the processing and improving the accuracy at the same time.
[0048]
In particular, in this specific example, when the path of the optimal solution matches or is close to the path of the provisional solution that was first obtained, the number of evaluation value calculations that can be omitted compared to the prior art increases and the effect is increased. Become prominent.
[0049]
<< Specific Example 2 >>
Specific example 2 sets whether a value that includes evaluation values of a plurality of branches (corresponding pairs of original and translated sentences) at a certain ratio is set as a reference value for each branch, and whether or not the path is an optimal solution. Is determined using this reference value.
[0050]
<Constitution>
FIG. 5 is a configuration diagram of the bilingual document association system of the second specific example.
The illustrated system includes an original text file 101, a translation text file 102, a sentence division means 103, a morpheme analysis means 104, a combination management means 105, an evaluation value calculation means 106, a calculation result management means 107, an original text file 108 with a corresponding tag, and a corresponding tag. The translation file 109, the bilingual dictionary 110, and the reference value calculation means 111 are included. Here, since the original text file 101 to the bilingual dictionary 110 are the same as the configuration of the specific example 1, the description thereof is omitted.
[0051]
The reference value calculation means 111 calculates the average, variance, etc. of each branch, and a certain value such that the evaluation value of a certain percentage of branches does not exceed that value, for example, 95% of the branches are within this reference value. It has a function to derive a certain value. The formula shown with a broken line in FIG. 5 shows an example of how to obtain the reference value. For example, when the branch value has a normal distribution, the average is μ, and the variance is D, the reference value is x such that p = 0.95 in the equation shown.
[0052]
The calculation result management unit 107 is configured to determine a path that cannot be higher than the provisional solution using the reference value obtained by the reference value calculation unit 111. In other words, the evaluation value of the path is calculated assuming that the branch for which the evaluation value is not calculated is at most the reference value.
[0053]
<Operation>
FIG. 6 is a flowchart illustrating the operation of the second specific example.
2 differs from the specific example 1 shown in FIG. 2 in that the reference value calculation means 111 obtains the reference value in step S208 and the calculation result management means 107 is likely to be the optimal solution in step S214. When determining whether or not, the reference value obtained in step S208 is used.
[0054]
Steps S200 to S207 in Specific Example 2 are the same as the processes in Steps S100 to S107 of Specific Example 1. Next, in step S208, an evaluation reference value is calculated from the average and variance of branches based on, for example, the formula in FIG. The subsequent steps S209 to S213 are the same as the steps S108 to S112 of the first specific example.
[0055]
In step S214, when it is determined whether or not there is a possibility of an optimal solution as described above, an evaluation value of a branch that has not yet been calculated is calculated as a reference value at most, and a determination is made based on this. That is. That is, in the specific example 1, the evaluation value of the path is calculated by setting the evaluation value of the branch not yet calculated as 1, that is, the evaluation value as the highest value. On the other hand, in the specific example 2, the evaluation value of the path is calculated on the assumption that most of the branches are within the reference value, for example, 0.7.
[0056]
By using such a reference value, it is not always possible to guarantee that an optimal solution is found, but the number of branches for calculating evaluation values can be further reduced as compared with the first specific example.
[0057]
The reason that it is not always possible to guarantee that the optimum solution is found is as follows. That is, since the reference value is a value that is not the highest value such as 0.7, when the evaluation value of the branch is actually calculated rather than when the evaluation value of the branch that has not been calculated yet is calculated with the reference value. There is a possibility that the evaluation value of the path becomes larger. However, it is considered that such a possibility is hardly possible in actual processing by setting an appropriate reference value.
[0058]
Moreover, since the optimal solution is not necessarily 100% correct for the sentence matching problem, considering that it is necessary to manually check whether the response is correct later, it takes time to determine the optimal solution. It may be more useful to find a reasonable solution in a short time than to find it.
[0059]
Since the processing of subsequent steps S215 to S219 in the specific example 2 is the same as the steps S114 to S118 in the specific example 1, the description thereof is omitted here.
[0060]
The above process will be described using a more specific example. Here, also in the specific example 2, it is assumed that the target file is composed of 9 English sentences and 7 Japanese sentences. In addition, as an explanatory diagram of a path for taking correspondence between these sentences, a description will be given with reference to FIGS.
[0061]
First, the processing of step S200 to step S207 in the flowchart of FIG. 6 will be described.
[0062]
In this example, since there are more English sentences, the combination of (1, 1) and (2, 1) is calculated first. If the corresponding combinations are limited to these two ways, the path that can reach the goal (upper right in the figure) is a narrow range as shown in FIG.
[0063]
The calculation is performed in the order of the numbers given to the branches shown in the figure (it is not always necessary to calculate in this way). When calculation is completed up to 27th, the path with the highest evaluation value sum is checked by DP, and the sum of the evaluation values is calculated. Is a provisional solution. Here, it is assumed that 2-5-9-14-19-23-26 is a provisional solution. The above is the same as the specific example 1.
[0064]
In step S208, a branch reference value is calculated. Here, it is assumed that the reference value becomes 0.7.
[0065]
Next, the process of steps S209 to S215 will be described.
If the combination of (1, 2) is added in step S209, the path of FIG. 3 becomes as shown in FIG.
[0066]
For example, when the path No. 31-12-19-23-26 obtained a higher evaluation value than the provisional solution at the time of calculating No. 31 in the figure, the provisional solution is updated at that time, The path is stored (steps S212 to S213).
[0067]
Next, when calculating the number 36, if it is found that a better solution than the provisional solution cannot be obtained through any path (this method will be described later), the end point of the branch number 36 (goal side) And hold the evaluation of the branch from there. In the example of FIG. 4, it is not necessary to calculate the branches 40, 41, 44, and 45.
[0068]
The above determination method is performed as follows. If the reference value of the evaluation value of each branch is 0.7, the provisional solution is 4.8, and the sum of the branches of 2-6-36 is 2.2, the solution passing through the path must be 4.3 or higher Don't be. That is, since it does not become larger than the provisional solution, there is no point in calculating the evaluation value of the path. Therefore, the calculation as the path is stopped.
[0069]
Here, a comparison with Example 1 is as follows. That is, in the specific example 1, since 1 which is the maximum evaluation value of each branch is used, if the provisional solution is 4.8 and the sum of the branches of 2-6-36 is 2.2, The solution passing through the path is 5.2 at the maximum, and it cannot be determined that it cannot be the optimal solution at this point. On the other hand, in the specific example 2, since the calculation can be stopped at this point, the number of branches for calculating the evaluation value can be reduced as compared with the specific example 1.
[0070]
The processing from step S216 to step S219 is for calculating correspondences up to the maximum number of correspondence sentences. At that time, as in the case where the above corresponding pairs are (1, 1), (1, 2), (2, 1), the calculation of branches that cannot be the optimal solution by the calculation of one sentence versus three sentences. You do n’t have to.
[0071]
<effect>
As described above, according to the specific example 2, when a solution having a high possibility of correspondence between the original sentence and the translated sentence is set as a provisional solution, and it is found that the evaluation value calculation process does not become a value higher than the provisional solution, Since the calculation as the path is stopped, the calculation amount in the DP can be reduced, the processing time can be shortened, and the accuracy is not lowered as in the first specific example.
[0072]
Furthermore, in the specific example 2, a reference value is prepared in advance, and it is determined whether the path cannot be an optimal solution using this reference value. Therefore, the processing time can be shortened.
[0073]
In the specific examples 1 and 2, the case of associating English sentences with Japanese sentences has been shown. However, by changing the bilingual dictionary, it can also be used for associating all languages. Further, the order in which the branch evaluation values are calculated is not limited to the order shown in the figure, and the order may be changed to some extent.
[0074]
Further, in specific examples 1 and 2, even when m> 2n, it can be easily dealt with by changing the correspondence set calculated first. In such a case, one sentence to one sentence, two sentences to one sentence, and three sentences to one sentence are calculated first.
[0075]
In the specific example 2, the reference value is determined in advance by the average or variance of the evaluation values of the branches, but the value of the branch until the provisional solution is first found (the branch value shown in FIG. 3) It may be calculated by the value of the processing up to step S206 in FIG. In this way, the calculation for the reference value can be performed simultaneously with the association processing.
[Brief description of the drawings]
FIG. 1 is a configuration diagram showing a specific example 1 of a bilingual document association system of the present invention;
FIG. 2 is a flowchart showing the operation of specific example 1 of the bilingual document association system of the present invention.
FIG. 3 is an explanatory diagram of paths for taking correspondence between files composed of nine English sentences and seven Japanese sentences in the bilingual document association system of the present invention;
FIG. 4 is an explanatory diagram of paths when a combination of (1, 2) is added in the bilingual document association system of the present invention.
FIG. 5 is a block diagram showing a specific example 2 of the bilingual document association system of the present invention;
FIG. 6 is a flowchart showing the operation of specific example 2 of the bilingual document association system of the present invention.
[Explanation of symbols]
101 Original file
102 Translated file
105 Combination management means
106 Evaluation value calculation means
107 Calculation result management means
111 Reference value calculation means

Claims

A bilingual document sentence association system for associating a plurality of original sentence sentences included in a source language document with a plurality of translated sentence sentences included in a translated word document of the original document,
When the number of the source language sentences and the number of the translated texts are compared, and the number of the source language sentences is equal to the number of the translated texts, the original text sentences and the translated texts are made to correspond one-to-one according to the sentence sequence. Each initial path formed by sequentially setting each corresponding pair is obtained, and when the number of the source language sentences is larger than the number of the translation words, each of the source language sentences and each of the translation words are one-to-one and two-to-one according to the sentence sequence. Each corresponding path corresponding to any one of the above is sequentially set to obtain each initial path, and when the number of the source sentence is smaller than the number of the translated sentences, the original sentence and each translated sentence are arranged in a sentence Each of the correspondence paths corresponding to either one-to-one or one-to-two according to the above is obtained in turn, and each time the number of correspondences between the source sentence and the translated sentence is successively increased, Unions that sequentially seek other paths consisting of pairs And management means,
For each of the initial path and the other path, corresponding to each corresponding set, each evaluation value indicating the correspondence between the original sentence and the translated sentence is calculated based on the evaluation element of the sentence, and each of the calculated Evaluation value calculating means for sequentially adding evaluation values to obtain a total sum of evaluation values for each path ;
Each said hold in correspondence with the initial path to the appropriate maximum value of the total sum of the evaluation value for each path when determined either initial path as an interim solution, and, setting the evaluation value of the evaluation value , The calculated evaluation value for the other path is taken and the intermediate sum is calculated, and before the evaluation value of the uncalculated corresponding group is calculated, the uncalculated corresponding group The number and the set evaluation value are added up, the added value is added to the intermediate sum, and if the addition result is determined to be smaller than the provisional solution, the calculation of the other path is stopped, When the total sum of the evaluation values of the other paths it is determined that the tentative solution is greater than, updates the interim solution, further, the corresponding number of the maximum with the corresponding sets of source language sentences and translated word sentences in the other path When the number of correspondence sentences is reached, the combination management means will not exceed the maximum number of correspondence sentences. The corresponding set together to stop the production of other paths related, parallel texts statement associating system, characterized in that it comprises a calculation result management unit to correspondence solutions sentence corresponding path to the current interim solutions.

2. The bilingual document association system according to claim 1, wherein the set evaluation value is one of a maximum evaluation value that can be taken by the combination and an evaluation reference value calculated from an average and a variance of the combination .