JP3831357B2

JP3831357B2 - Parallel translation information creation device and parallel translation information search device

Info

Publication number: JP3831357B2
Application number: JP2003111807A
Authority: JP
Inventors: 晶佐々木; 裕美子吉村
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-04-16
Filing date: 2003-04-16
Publication date: 2006-10-11
Anticipated expiration: 2023-04-16
Also published as: JP2004318510A

Description

【０００１】
【発明の属する技術分野】
本発明は、翻訳支援ツール等に利用される対訳情報作成装置及び対訳情報検索装置に関する。
【０００２】
【従来の技術】
国際化の進行に伴い、外国語を用いた情報交換へのニーズが高まっており、機械翻訳は、かかる情報交換のツールとして大いに期待されている。しかし、現在の機械翻訳技術による翻訳結果は、人手による手直しが全く不要なレベルにあるとはいえず、翻訳精度の更なる向上が求められている。従って、現状の機械翻訳システムを用いて、人手による手直しのない状態まで翻訳精度を上げるためには、多大の労力と時間を必要とする。
【０００３】
そこで、従来、新たに翻訳を行う場合、過去に翻訳済みとされた文書を有効に活用するために、次のような幾つかの技術が提案されている。
【０００４】
その１つは、対訳データベース作成装置であって、ユーザが原文と当該原文の訳文とを文単位で対応付けし（以下、対訳ペアと呼ぶ）、データベース（以下、対訳メモリと呼ぶ）に保存する。原文の翻訳に関し、以後、原文と訳文との対訳ペアを作成し、順次、対訳メモリに保存し、対訳情報を作成する。
【０００５】
従って、以上のような状態において、新たに入力される翻訳対象文の翻訳を行う場合、過去に翻訳済みとされた対訳メモリを検索し、翻訳対象文と類似した文が存在すれば、機械翻訳により訳文を生成する代わりに当該対訳メモリ中の訳文を翻訳文に採用する（特開平１０−６３６６９号公報参照）。
【０００６】
他の１つは、過去の翻訳済みの文書を有効に活用する技術として、会話文翻訳装置がある（特開平５−３２４７０２号公報、特開平９−６２６８１号公報）。これらの会話文翻訳装置は、予め用意された会話用例文の文類情報を対訳ペアに付与し、対訳メモリ検索者の意図する方向により近い対訳ぺアを検索可能にした構成である。なお、前記会話文の分類情報とは、例えば「部屋の交渉」、「支払う、デポジット」など、会話の目的を表すフレーズ及び想定されたシーンがキーワードとなる。
【０００７】
さらに、前記特開平９−６２６８１号公報の会話文翻訳装置は、対訳メモリに格納される対訳ペアに対し、対訳ペアの原文文字列の中から「意味情報」を抽出して付与し、対訳ペアの文意をより忠実に検索に反映させる方法も提案されている。この「意味情報」は主に自立語の基本形が用いられ、予め事前に「意味素性」毎にその同義語、活用変化形、表現のバリエーション等を対応付けした「意味素性辞書」を作成し、当該「意味素性辞書」を参照し、当該意味情報を抽出するものである。例えば意味素性「お願い」には、「依頼、お願いした、お願いしたいのです」などが対応付けられている。
【０００８】
【特許文献１】
特開平１０−６３６６９号公報（４頁右欄３０行〜５頁左欄３８行、図１参照）
【０００９】
【特許文献２】
特開平５−３２４７０２号公報（８頁左欄３４行〜同頁右欄１２行参照）
【００１０】
【特許文献３】
特開平９−６２６８１号公報（図９，１７頁右欄１７行〜１９頁左欄１７行）
【００１１】
【発明が解決しようとする課題】
ところで、以上のような装置においては、次のような種々の問題が指摘されている。
【００１２】
先ず、前者の対訳データベース作成装置では、対訳メモリに格納されている対訳ペアは、翻訳対象文書の一文だけが考慮されているので、検索時の検索対象文の文脈や意図が何ら考慮されていない。その結果、対訳メモリの検索に際し、原文文字列は類似しているが、対訳文の意味やニューアンスを異にする複数の対訳ペアが対訳メモリ中に存在する場合、検索対象文の文脈の合致度に拘らず、単に原文文字列の最も高い一致度の対訳ペアが優先的に検索されるといった問題がでてくる。
【００１３】
一方、後者の会話文翻訳装置では、過去の翻訳済みの文書を有効に活用する点で意義を有するが、会話翻訳という観点から新たな問題が生じ、また十分な問題解決に至っていない。その理由について説明する。
【００１４】
その１つとしては、会話文翻訳装置は旅行会話文を対象としており、例えば税関手続、ホテルの出入り等に使われる語や挨拶等のシーン等にある程度のパターンがあるので、分類情報の網羅はある程度可能な状況にある。しかし、翻訳対象文書は、多種多様な一般的な文書であることから、あらゆる分類項目を網羅して対訳ペアを作成することは到底不可能なことであり、さらに分類の追加・変更などの更新も大変な労力と手間がかかる問題がある。
【００１５】
また、他の１つは、特開平９−６２６８１号公報に記載される「意味情報」は各対訳ペア自体から抽出したものであり、対訳ペア自体の意図が検索結果に反映できても、対訳ペアが作成された出典文書全体の文脈は反映することができない。このことは、未だ十分な問題解決に至っていないことを意味する。通常、会話文は、文単位で文意が明確となる場合がほとんどであり、前後の文脈を考慮する必要性はそれほどない。これに対し、一般的な文書は、一文単位だけでは意図が不明瞭であり、文意を汲み取るためには少なくとも前後の文脈を考慮する必要が多々ある。例えば「よくそのようなことがおできになりましたね。」という文は、肯定的な文脈から「賞賛」、否定的な文脈から「皮肉」の意味をもつため、前後の文脈に応じて訳文が大きく異なり、全く意味をもたない翻訳結果ないし検索結果となる問題がある。
【００１６】
本発明は上記事情にかんがみてなされたもので、対訳ペアに対訳対象文書の全体の特徴を考慮した情報を付加し、文脈や意図を反映した対訳情報を作成する対訳情報作成装置を提供することを目的とする。
【００１７】
また、本発明の他の目的は、検索対象原文に対し、文脈や意図を反映した対訳情報を利用し、文脈や意図を汲み取った検索結果（翻訳結果）を容易に検索可能とし、また検索結果の前後の文も出力し、検索結果の文がどのような文脈であるかを容易に把握可能とする対訳情報検索装置を提供することにある。
【００１８】
【課題を解決するための手段】
（１）上記課題を解決するために、本発明に係る対訳情報作成装置は、複数の文から構成される原文文書と当該原文文書の訳文である複数の文から構成される対訳文書を入力する文書データ入力手段と、この文書データ入力手段により入力された原文文書と訳文文書を文単位に対応付けて記憶手段に記憶する文対応付け手段と、前記原文文書から文書の特徴を表す文書識別情報を抽出する文書識別情報抽出手段と、前記文対応付け手段で文単位に対応付けられた前記原文文書と前記訳文文書との対訳ペアに前記抽出された文書識別情報を付加した対訳情報を作成し対訳メモリに記憶する対訳情報作成手段とを設けた構成である。
【００１９】
この発明は、以上のような構成とすることにより、対訳文書が入力されると、文対応付け手段は、対訳文書を構成する原文文書と訳文文書を文単位に対応付けし、一方、文書識別情報抽出手段は、原文文書全体から文書の特徴を表す文書識別情報を抽出する。しかる後、対訳情報作成手段は、文対応付けされた原文と訳文との対訳ペアに文書識別情報を付加した対訳情報（文書識別情報付き対訳ペア）を作成し対訳メモリに記憶する。従って、後に検索対象原文文書をもとに対訳ペアの訳文を検索する際、原文文書の文脈や意図を含んだ文書識別情報から適切な訳文を検索可能となる。
【００２１】
（２）本発明に係る対訳情報検索装置は、複数の文から構成される原文文書と当該原文文書の訳文である複数の文から構成される対訳文書について、前記原文文書と前記対訳文書を文単位に対応付けるとともに、前記原文文書から抽出した文書の特徴を表す文書識別情報を付加した、複数の対訳情報を記憶する対訳メモリ記憶手段と、入力した複数の文から構成される検索対象原文文書を文単位に分割する文書分割手段と、この検索対象原文文書から文書の特徴を表す文書識別情報を抽出する文書識別情報抽出手段と、前記文単位に分割された前記検索対象原文と前記文書識別情報に基づいて、対訳メモリ記憶手段に記憶された前記複数の対訳情報との一致度を計算し、この一致度に基づいて前記複数の対訳情報の中から前記検索対象原文の訳文を検索する検索手段とを設けた構成である。
【００２２】
この発明は、以上のような構成とすることにより、検索対象原文文書から抽出される文書識別情報と既に対訳メモリに記憶される対訳情報の文書識別情報とに基づいて、検索対象原文文書の各文に対する対訳ペアの適切な訳文を検索することが可能である。
【００２４】
【発明の実施の形態】
以下、本発明の実施の形態について図面を参照して説明する。
【００２５】
図１は対訳情報の作成及び対訳情報の検索を含んだシステムの一実施の形態を示す全体構成図である。
【００２６】
このシステムは、翻訳処理された翻訳結果に基づいて対訳情報を作成する対訳情報作成装置１０と検索対象文に対して当該対訳情報作成装置１０により作成された対訳情報から最適な検索結果（翻訳結果）を検索する対訳情報検索装置２０とによって構成されている。
【００２７】
この対訳情報作成装置１０は、ＣＰＵで構成され、対訳情報作成対象となる第一言語による原文文書と第二言語による訳文文書よりなる対訳文書を入力する文書データ入力部１１と、この文書データ入力部１１からバッフアメモリ１２に格納される対訳文書を文単位で対応付ける文対応付け部１３と、原文文書全体からから文書の特徴を表す文書識別情報を抽出する文書識別情報抽出部１４と、文対応付け部１３により文単位に対応付けられた原文及び訳文よりなる対訳ペア（対訳文）に文書識別情報抽出部１４で抽出された文書識別情報を付け加えた対訳情報を作成し、対訳メモリ１５に記憶する対訳情報作成部１６とが設けられている。
【００２８】
なお、対訳情報作成装置１０には、対訳情報作成用プログラムを記録するプログラム記録媒体１７が設けられている。
【００２９】
前記文書データ入力部１１としては、翻訳処理後の原文及び訳文よりなる対訳文書を入力するもので、例えば入力機器であるマウス等を含むキーボード１１１、予め翻訳処理後の原文及び訳文よりなる対訳文書を記憶するファイル１１２、当該対訳文書が伝送されてくるインターネット、専用線、ＬＡＮ等のネットワーク１１３などが挙げられる。その他、トラックボール、タブレットなどのポインティングデバイス、光学式文字読取装置などがある。
【００３０】
一方、対訳情報検索装置２０は、検索対象文書である原文文書を入力する文書データ入力部１１と、この文書データ入力部１１から入力される原文文書を文単位に分割しバッフアメモリ１２に格納する文書分割部２１と、この文書分割部２１で分割された原文文書から文書の特徴を表す文書識別情報を抽出する文書識別情報抽出部１４と、この文書識別情報抽出部１４で抽出された文書識別情報及び前記検索対象文書である原文文書の各文の構成文字列をキーとして前記対訳メモリ１５の対訳情報の中から検索結果（翻訳結果）となる訳文を検索する対訳情報検索処理部２２と、この検索結果を検索対象原文文書とともに、或いは検索結果だけを出力する検索結果出力制御部２３とによって構成されている。
【００３１】
２４は検索結果出力部であって、原文文書を含み、或いは含まない検索結果を格納するファイル２４１、原文文書を含み、或いは含まない検索結果を表示する表示部２４２、原文文書を含み、或いは含まない検索結果を所要とする端末などに伝送するインターネット、専用線、ＬＡＮ等を含むネットワーク２４３などの何れか１つ以上が用いられている。
【００３２】
また、対訳情報検索装置２０には、対訳情報を検索処理する対訳情報検索用プログラムを記録するプログラム記録媒体２５が設けられている。
【００３３】
なお、対訳情報作成装置１０と対訳情報検索装置２０は個別にプログラム記録媒体１７，２５を設けたが、対訳情報作成か対訳情報検索かを判断させる機能を設ければ、対訳情報作成処理と対訳情報検索処理とを１つのプログラム記録媒体を用いて実現できることは言うまでもない。
【００３４】
次に、対訳情報作成装置１０と対訳情報検索装置２０とに分けて、それぞれの動作について順次説明する。
【００３５】
（１）対訳情報作成装置１０の動作について（図２及び図３参照）。
【００３６】
なお、図２は対訳情報作成装置１０の全体動作を説明する図、図３は図１に示す文書識別情報抽出部１４の詳細動作を説明する図である。
【００３７】
先ず、ユーザは、文書データ入力部１１から図４に示す対訳文書（例文１）を入力しバッフアメモリ１２に格納する（ＳＴ１１）。この対訳文書の上段は日本語文書である原文文書、下段は英語文書である訳文文書である。
【００３８】
ここで、以上のような対訳文書が入力されると、文対応付け部１３は、自動的に日本語文書の各文が英語文書のどの文に対応しているかを判断し対応付けを行う（ＳＴ１２：文対応付けステップ）。
【００３９】
この文対応付け部１３による文対応付け方法は、例えば対訳文書を構成する各文書を一文単位に分割し、日本語原文文書を英語に翻訳する翻訳辞書（図示せず）を用いて翻訳処理を行い、日本語原文文書の文単位の原文から生成される訳文と対訳文書の訳文との類似度を計算し、文書全体の中で最も類似度の高い訳文文書を選択し、日本語原文と訳文文書との文対応付けを行い、バッフアメモリ１２に格納する。
【００４０】
引き続き、文書識別情報抽出部１４は、文対応付けされた日本語文書と英語文書（訳文文書）に関し、後記するように文書全体の特徴を表す文書識別情報を抽出する（Ｓ１３：文書識別情報抽出ステップ）。この文書識別情報の詳細な抽出処理は、後記する（図３参照）。
【００４１】
ここで、文書識別情報抽出部１４が文書識別情報を抽出すると、文対応付けされた日本語文書の各文及び英語文書の各文と、文書識別情報とを対訳情報作成部１６に送出する。この対訳情報作成部１６では、日本語文書の各文及び英語文書の各文と文書識別情報とを受け取ると、文対応付けされた日本語文と英語訳文とを対（対訳ペア）とし、各対訳ペアに文書識別情報を付加した情報付き対訳ペア（対訳情報）を所要とする形式に従って対訳メモリ１５に記憶する（Ｓ１４）。
【００４２】
図５は対訳メモリ１５を示す図であって、文書識別情報付き対訳メモリ１５ａと文書識別情報定義テーブル１５ｂとからなり、文書識別情報付き対訳メモリ１５ａには文対応付けされた日本語文（Ｊ：）と英語文（Ｅ：）との対訳ペアとし、この各対訳ペアに文書識別情報（ＰＲＯＰ：）を付け加えたものを一つの単位とする情報付き対訳ペアの形式で記憶されている。この文書識別情報（ＰＲＯＰ：）には文書構成文字列見出し（ＪＷＤ＝ＪＷＤ１）と日本語文書及び英語文書に対する各文の構成情報ＳＮ、ＰＮが格納される。また、文書識別情報定義テーブル１５ｂには文書構成文字列見出し（ＪＷＤ１）に対応する文書構成文字列データが格納される。
【００４３】
（２）図２に示す文書識別情報抽出部１４の詳細動作について（図３参照）。
【００４４】
文対応付け部１３により文対応付けられた日本語文書と英語文書が入力されると、文書識別情報抽出部１４は、日本語文書全体にわたって文書構成文字列（ＪＷＤ）を抽出する（Ｓ１３１）。この文書構成文字列（ＪＷＤ）は、文書全体から意味のある語をほぼ全て抽出しているので、文書の文脈や意図を反映したものものと言うことができる。ＪＷＤは各情報付き対訳ペアに共通の文書識別情報であるので、その抽出結果であるＪＷＤ１（文書構成文字列見出し）は、文書識別情報定義テーブル１５ｂに別途抽出頻度とともに定義付けしておく。
【００４５】
この文書構成文字列（ＪＷＤ）の切り出し法は、例えば日本語文書中の文字列に対して形態素解析を行い、自立語を中心とし、名詞、動詞、形容詞、副詞などを切り出し、例えば「美しければ」とある場合には「美しい」という活用形に変換する。この実施の形態においては、図４の日本文全体から切り出された構成文字列ＪＷＤ１には、「１万、台、売上、達成、心より、お祝い、申し上げる」などの語に加え、「成果、評判、高い」など、原文の肯定的な文脈もよく反映されていると言える。なお、各語に付記されるカッコ内の数字は文書内の出現頻度を表す。
【００４６】
次に、日本語文書及び英語文書に対する各文の構成情報を抽出する。
【００４７】
この構成情報の１つとしては、文書中の文番号（ＳＮ）を抽出する（Ｓ１３２）。この文番号ＳＮは、日本語文書及び英語文書に関し、総文数を分母とし、文番号を分子とする分数で表される。例えば図５の１番目の情報付き対訳ペアは、日本語文書の６文中の第１文なのでＳＮ＝Ｊ１／６、英語文書でも同様に６文中の第１文なのでＳＮ＝Ｅ１／６となる。日本語文書及び英語文書の第２文以降について同様に文番号（ＳＮ）を抽出する。
【００４８】
構成情報の他の１つとしては、文番号と同様な要領で文書中の段落番号（ＰＮ）を抽出する。この文書中の段落番号ＰＮは、日本語文書及び英語文書とも文書全体の総段落数を分母、該当文の段落数を分子とする分数で表される。例えば図５の１番目の情報付き対訳ペアは、日本語文書及び英語文書とも４段落で構成されており、かつ、１番最初の段落の文であるので、日本語文はＰＮ＝Ｊ１／４、英語文も同じくＰＮ＝Ｅ１／４となる。日本語文書及び英語文書の第２段落以降の文について同様に段落番号（ＰＮ）を抽出する。
【００４９】
なお、抽出する文書識別情報は、以上のような情報に限らず、例えば英語文書を構成する文字列、ファイル名、ファイル作成日時、作成者名、関連する顧客情報など、本装置のユーザが必要に応じて種々の情報を付与することが可能である。図６は文書データ入力部１１から入力される例文２を示す図であり、上段の日本語文書である原文文書、下段の英語文書である訳文よりなる対訳文書が示されている。
【００５０】
図７は、文書データ入力部１１から入力された例文２の対訳文書に関する情報付き対訳ペアを図５の対訳メモリ１５上に更に加えた例である。この例に示すように、対訳メモリ１５上には第４番目以降の情報付き対訳ペアが付加されている。この例の４番目の対訳ペアに見られるように、片方の言語の一文に対し、もう片方の言語の複数の文が対応する場合、分子の文書番号がＳＮ＝Ｊ４＋５／１０のごとく、プラス記号（＋）で結ばれて列挙される。また、図６の日本語文書から切り出される文書構成文字列見出しＪＷＤ２に対応する文書構成文字列は、「先ごろ、貴殿、届く、同封」等のほかに、「抗議」という否定的な文脈の語が含まれており、これらの語から原文は否定的な文脈であることが把握できる。
【００５１】
図７に示す情報付き対訳メモリ１５ａの中には、「よくこのようなことがおできに……」という、日本語ではほぼ等しいが、英語ではかなり異なる２つの対訳ペアが格納されている（図７網掛け部分参照）。この２つの対訳ペアにはそれぞれ異なる文書構成文字列見出し（ＪＷＤ１及びＪＷＤ２）が付いており、それぞれ肯定的な文脈及び否定的な文脈の原文から抽出された文であることが理解できる。
【００５２】
（３）対訳情報検索装置２０の動作について（図８及び図９参照）。なお、図８は対訳情報検索装置２０の全体動作を説明する図、図９は図１に示す対訳情報検索処理部２２の詳細動作を説明する図である。
【００５３】
この対訳情報検索装置は、ユーザが文書データ入力部１１から図１０を示す例文３の日本語文書（翻訳対象文書ないし検索対象文書）を入力し（ＳＴ２１）、文書分割部２１に送出する。この文書分割部２１では、文書データ入力部１１から入力される日本語文書を文単位に分割処理し、これら分割された日本語の各文は順次バッフアメモリ１２に格納する（Ｓ２２）。
【００５４】
しかる後、文書識別情報抽出部１４は、前記対訳情報作成装置１０で説明したとほぼ同様な手段によって文書識別情報を抽出する（Ｓ２３）。ここでは、文書識別情報抽出部１４の詳しい処理動作は図３の説明に譲る。
【００５５】
この文書識別情報抽出部１４は、文書識別情報を抽出した後、分割された日本語文書と文書識別情報を対訳情報検索処理部２２に渡す。この対訳情報検索処理部２２は、分割された日本語文書と文書識別情報とに基づいて検索処理を実行する（Ｓ２４）。この対訳情報検索処理部２２による検索処理の詳細は後記する（図９参照）。
【００５６】
この対訳情報検索処理部２２は、検索処理を終了すると、検索結果が成功したか否かを判断する（Ｓ２５）。検索結果が失敗の場合、検索結果出力制御部２３は表示部２４２に検索結果無しの状態を表示する（Ｓ２６）。検索結果が成功した場合、検索対象日本語文に基づいて対訳ペアとなっている英語文を抽出し、検索結果出力制御部２３に渡す（Ｓ２７）。この検索結果出力制御部２３は、受け取った検索結果を表示部２４２又はプリンタ（図示せず）に出力する（Ｓ２８）次に、対訳情報検索処理部２２の検索処理の詳細について図９を参照して説明する。
【００５７】
この対訳情報検索処理部２２の検索処理は、文書分割部２１により一文単位に分割された日本語文書及び文書識別情報抽出部１４で抽出された文書識別情報から、識別情報付き日本語文書を作成し、対訳メモリ１５に格納する（Ｓ２４１）。図１１は対訳メモリ１５のデータ配列構成を示す図であって、文書識別情報付き対訳メモリ１５ａには図１０に示す日本語文書の例文３から作成された文書識別情報付き日本語文書が格納され、文書識別情報定義テーブル１５ｂには日本語文書の文書構成文字列データが格納されている。
【００５８】
この文書識別情報付き日本語文書は、一文単位に分割された日本語文（Ｊ：）に文書識別情報（ＰＲＯＰ）を付与した一つの単位（以下、情報付き日本語文と呼ぶ）として構成されている。この文書識別情報は、前述する対訳情報作成装置１０とほぼ同様のデータ配列構成を有しており、例えば文書構成文字列見出し（ＪＷＤ＝ＪＷＤＰ）、文番号（ＳＮ）、段落番号（ＰＮ）などからなっている。同様に、文書識別情報定義テーブル１５ｂには文書構成文字列見出しに対応する文書構成文字列データが定義されている。しかし、その定義内容は、前述する対訳情報作成装置１０と多少異なり、日本語の文書構成文字列が段落別に抽出され、抽出結果としてＪＷＤＰ１〜ＪＷＤＰ５別に分けられている。これは、検索対象文書が長く、多数の段落から構成されている場合、文書全体をひとまとめにした処理だけでなく、後記する文書識別情報に関する処理を段落単位で行えるようにするためである。
【００５９】
引き続き、検索対象となる情報付き日本語文と文書構成情報付き対訳メモリ１５ａ中の情報付き対訳ペアの一方である原文との一致度を計算する（Ｓ２４２〜Ｓ２４５）。この検索装置２０における一致度計算のポイントは、日本語文字列の一致度に加え、さらに文書識別情報の一致度も考慮する点にある。この文書識別情報は、検索対象の日本語文書全体の文脈や文意を反映しているので、これにより検索対象の日本語文書の文脈を考慮した検索が可能となる。特に、日本語文をもつ同様な複数の対訳ペアが対訳メモリ１５ａに存在しても、文書識別情報の文書構成文字列の一致度を考慮することにより、検索対象の日本語文と文脈的に一致度の高い対訳ペアを検索することが可能となる。
【００６０】
なお、文書識別情報の一致度の計算は、最も単純な一計算法を説明すれば、例えば文書識別情報中の文書構成文字列（ＪＷＤ）を直交ベクトル成分とする文書全体を代表する文書ベクトルを作成し、ベクトルの内積を一致度とするベクトル空間法が用いられる。
【００６１】
ここで、検索対象である図１１の文書識別情報付き日本語文に対して、図７の対訳メモリ１５ａを検索した場合を例とし、一致度の計算処理（Ｓ２４２〜Ｓ２４５）を具体的に説明する。
【００６２】
今、図１１に示す文書識別情報定義テーブル１５ｂにあるすべての文書構成文字列（ＪＷＤＰ１〜ＪＷＤＰ５）に基づき、各文書構成文字列をベクトル成分とし、その文書識別情報の頻度を重みとした検索対象文書ベクトル（Dtr）を作成する（Ｓ２４２）。ここで、段落別の文書構成文字列ＪＷＤＰ１、ＪＷＤＰ２などから、それぞれ個別に検索対象ベクトルを作成し、これら複数のベクトルを同時に考慮すれば、段落ごとの文脈をきめ細かく反映した検索が可能となる。
【００６３】
次に、図７に示す文書識別情報付き対訳ペアに付与された各文書構成文字列（ＪＷＤ１、ＪＷＤ２）に基づき、各文字列をベクトル成分とし、頻度を重みとした対訳メモリ１５の文書ベクトル（Ｄ_TM１、Ｄ_TM２）を作成する（Ｓ２４３）。さらに、検索対象文の文書ベクトルと対訳メモリ１５の文書ベクトルとの一致度を求めるために、ＤtrとＤ_TM１、ＤtrとＤ_TM２の内積をそれぞれ計算する（Ｓ２４４）。Ｄ_TM１では、「お祝い、健闘」が一致することから内積値はゼロより大きい正の整数となるが、Ｄ_TM２では、一致項目が無いので、内積値はゼロとなる。その結果、ＤtrとＤ_TM１の内積値はＤtrとＤ_TM２の正積値よりも大きく、Ｄ_TM１の方の一致度が高いことが分かる。
【００６４】
次に、検索対象文書から一文を取り上げ、文字列の一致度について計算する（Ｓ２４５）。一例として、図１１の一文である「よくこのようなことがおできになりましたね」（図１１の網掛け部分参照）を文字列検索した場合を考えてみる。図７の情報付き対訳メモリ１５ａの中には、日本語文が「よくこのようなことがおできになりますね。」と「よくこのようなことがおできになりましたね。」である二つの対訳ペアが存在し（図７の網掛け部分参照）、それぞれに文書識別情報ＪＷＤ１、ＪＷＤ２が付与されている。
【００６５】
そこで、検索対象文を意味のある４つの語「よく・このような・こと・おできになりましたね」に分解したとする。このような文において、活用が異なるだけで基本形が一致している場合には０．５の重みで一致と考える。このような条件のもとに一致度を計算すると、ＪＷＤ１が付与されている日本語文は、４語中の３語が完全に一致し、１語は活用の違いだけであって基本形は一致するので、一致度は、（３／４）＋｛０．５（１／４）｝＝０．８８となる。一方、ＪＷＤ２が付与されている日本語文は、４語中４語が一致するので、一致度は４／４＝１となる。従って、文字列の一致度だけを考慮すると、ＪＷＤ２が付与されている対訳ペアの方が一致度が高い。しかし、最終的な一致度は、文書構成文字列の一致度と文書識別情報の一致度との両方を考慮し、、例えば２つの一致度を掛けた値とすれば、ＪＷＤ２が付与された対訳ペアの文書識別情報の一致度がゼロになり、結局、ＪＷＤ１が付与された対訳ペアの一致度の方が高くなる。このことは、文字列の一致度が低くても、文脈の一致度が高い対訳ペアが選択されることになる。
【００６６】
従って、以上のような対訳情報作成装置１０は、対訳対象の一文だけでなく、対訳対象文書全体の特徴を反映するように対訳情報を作成するので、対訳情報検索装置２０では、図１０に示す検索対象文書に関し、文脈や意図を考慮した訳文を検索することができる。なお、前述の説明は、ごく単純な例を挙げて説明をしたが、文書ベクトルを構成するベクトル成分を作成する際、以下のようなステップ数を導入することにより、一致度計算の精度を上げることができる。すなわち、各日本語構成文字列そのものをベクトル成分とせずに同意語、関連語などの相関が大きな語を分類（クラスタリング）し、同一分類に入る語をサブ成分としてベクトル成分を再構築する。同じベクトル成分に分類された同義語及び関連語は一致する語と見なすことにより、一致度はより文の主旨を反映したものとなる。例えば前記例において、Ｄ_TM１では、「お祝い、健闘」の二語だけが一致していたが、対訳メモリ１５（図７参照）中の「達成、成果」等の語と、「検索対象文（図１０参照）中の「栄誉、獲得、勝利」などの関連語も「一致する」と期待することができる。
【００６７】
また、図１０に示す検索対象文中にＪＷＤ２の文字列（図６の日本語文書＝例文２）と一致する文のスタイルに関わる「貴殿、届く」などの語が含まれていたとしても、以上のような処理を実行することにより、文意を反映した一致度を十分高くできれば、文意に即した検索を行うことができる。
【００６９】
なお、クラスタリングを行うには、相関強度の定義が必要になるが、例えばニューラルネットワークを利用した自動学習、ＥＤＲ（Electlonic Dictionary Research）編集の電子化辞書又はWord Net等の同意語、関連語、概念等の既存の分類体系を利用することができる。このような辞書、分類体系を利用することにより、文書の構成文字列から必要に応じて分類を作成することができ、特に分類を用意する必要がなく、分類の追加、変更も柔軟に行うことができる。
【００７０】
次に、図１に示す検索結果出力制御部２３、図８のステップＳ２５ないしＳ２８の詳細について説明する。
【００７１】
ステップＳ２５において、検索が成功した場合、検索結果を表示部２４２に表示するが、本発明の対訳情報検索装置２０の検索結果出力制御部２３では、検索結果だけでなく、その前後の文も同時に表示する方法を採用する。つまり、対訳メモリ１５に格納される文書識別情報の中に、対訳ペアの出典文書全体の通し番号（文番号）が記述されている。そこで、この文番号を利用し、検索結果の英語文及び日本語文の前後の文を表示することができる。
【００７２】
図１２は翻訳結果の表示部２４２への表示例を示す図である。この検索結果出力制御部２３は、左上側に検索対象原文表示ウインドウ２４２ａ、右上側に訳文表示ウインドウ２４２ｂが配置されている。この検索対象原文表示ウインドウ２４２ａには検索対象原文である日本語文が表示され、一方、訳文表示ウインドウ２４２ｂには対訳メモリ１５を参照し検索結果である翻訳結果英語文が表示される。
【００７３】
このような状態において、検索対象原文である日本語文書の一文をマウスで選択すると、表示部下側に対訳メモリ検索結果表示ウインドウ２４２ｃが表れ、ここに検索対象原文のみの検索結果が表示される。さらに、検索結果表示ウインドウ２４２ｃに表示された検索結果をマウスで選択し、右クリックすると、ウインドウ２４２ｄが表れ、このウインドウ２４２ｄには検索結果文の出典文書における前後に位置する文もポップアップ表示される。これにより、検索結果の一文がどのような文脈で用いられているかを容易に把握することができる。
【００７４】
なお、本願発明は、上記実施の形態に限定されるものでなく、その要旨を逸脱しない範囲で種々変形して実施できる。
【００７５】
また、各実施の形態は可能な限り組み合わせて実施することが可能であり、その場合には組み合わせによる効果が得られる。さらに、上記各実施の形態には種々の上位，下位段階の発明が含まれており、開示された複数の構成要素の適宜な組み合わせにより種々の発明が抽出され得るものである。例えば問題点を解決するための手段に記載される全構成要件から幾つかの構成要件が省略されうることで発明が抽出された場合には、その抽出された発明を実施する場合には省略部分が周知慣用技術で適宜補われるものである。
【００７６】
【発明の効果】
以上説明したように本発明によれば、対訳ペアに原文文書の全体の特徴を考慮した文書識別情報を付加することにより、原文文書の文脈や意図を反映した対訳情報を作成することができ、また原文文書の構成文字列から容易に分類分けされた文書識別情報付き対訳情報を作成できる対訳情報作成装置を提供できる。
【００７７】
また、本発明は、検索対象原文に対し、文脈や意図を汲み取った第二言語の検索結果（翻訳結果）を容易に検索でき、また検索結果の前後の文も同時に出力すれば、検索結果の文がどのような文脈となっているか容易に把握できる対訳情報検索装置を提供できる。
【図面の簡単な説明】
【図１】本発明に係る対訳情報作成装置及び対訳情報検索装置の一実施の形態を含んだシステムの構成図。
【図２】対訳情報作成装置の動作を説明するフローチャート。
【図３】図１に示す対訳情報作成装置の文書識別情報抽出部の動作例を説明するフローチャート。
【図４】例文１としての入力原文とこの入力原文の対訳文書（訳文）との関係を示す図。
【図５】例文１に関する文書の文ごとの対訳ペアに文書識別情報を付加した対訳情報が格納された対訳メモリのデータ配列構成を示す図。
【図６】例文２としての入力原文とこの入力原文の対訳文書（訳文）との関係を示す図。
【図７】例文１に関する対訳情報に例文２に関する対訳情報を付け加えた対訳メモリのデータ配列構成を示す図。
【図８】対訳情報検索装置の動作を説明するフローチャート。
【図９】図１に示す対訳情報検索装置の対訳情報検索部の動作例を説明するフローチャート。
【図１０】例文３としての検索対象文書を説明する図。
【図１１】例文３に関する文書の文ごとの対訳ペアに文書識別情報を付加した対訳情報が格納された対訳メモリのデータ配列構成を示す図。
【図１２】図１に示す対訳情報検索装置の検索結果出力制御部における表示部への表示状態を示す図。
【符号の説明】
１０…対訳情報作成装置、１１…文書データ入力部、１３…文対応付け部、１４…文書識別情報抽出部、１５…対訳メモリ、１５ａ…文書識別情報付き対訳メモリ、１５ｂ…文書識別情報定義テーブル、１６…対訳情報作成部、１７…プログラム記録媒体、２０…対訳情報検索装置、２１…文書分割部、２２…対訳情報検索処理部、２３…検索結果出力制御部、２５…プログラム記録媒体。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a bilingual information creation device used for a translation support tool or the like. as well as Bilingual information search Equipment Related.
[0002]
[Prior art]
With the progress of internationalization, the need for information exchange using foreign languages is increasing, and machine translation is highly expected as a tool for such information exchange. However, it cannot be said that the result of translation by the current machine translation technology is at a level where manual rework is unnecessary, and further improvement in translation accuracy is required. Therefore, a great deal of labor and time are required to improve the translation accuracy to the state where there is no manual correction by using the current machine translation system.
[0003]
Therefore, conventionally, when a new translation is performed, the following several techniques have been proposed in order to effectively use a document that has been translated in the past.
[0004]
One of them is a bilingual database creation device, in which a user associates an original sentence with a translation of the original sentence in a sentence unit (hereinafter referred to as a “translation pair”) and stores it in a database (hereinafter referred to as a “translation memory”). . Regarding the translation of the original text, a parallel translation pair of the original text and the translated text is created and sequentially stored in the parallel translation memory to create parallel translation information.
[0005]
Therefore, when translating a newly inputted translation target sentence in the above-described state, if a parallel translation memory that has been translated in the past is searched and a sentence similar to the translation target sentence exists, machine translation Instead of generating a translated sentence, the translated sentence in the parallel translation memory is adopted as a translated sentence (see Japanese Patent Laid-Open No. 10-63669).
[0006]
The other one is a conversational sentence translation apparatus (Japanese Patent Laid-Open No. 5-324702, Japanese Patent Laid-Open No. 9-62681) as a technique for effectively utilizing past translated documents. These conversational sentence translation devices have a configuration in which sentence information of a prepared example sentence for conversation is assigned to a parallel translation pair so that a parallel translation pair closer to the direction intended by the parallel memory searcher can be searched. The conversation sentence classification information includes, for example, phrases representing the purpose of the conversation and assumed scenes such as “room negotiation” and “pay, deposit” as keywords.
[0007]
Furthermore, the conversational sentence translation apparatus disclosed in Japanese Patent Laid-Open No. 9-62681 extracts “semantic information” from the original text string of the translation pair and assigns it to the translation pair stored in the translation memory. A method for more faithfully reflecting the meaning of the sentence in the search has also been proposed. This `` semantic information '' mainly uses the basic form of independent words, and in advance creates a `` semantic feature dictionary '' that associates each synonym, usage variation, expression variation, etc. for each `` semantic feature '' in advance, The semantic information is extracted with reference to the “semantic feature dictionary”. For example, the semantic feature “request” is associated with “request, requested, I want to request”.
[0008]
[Patent Document 1]
Japanese Patent Laid-Open No. 10-63669 (see page 4, right column, line 30 to page 5, left column, line 38, see FIG. 1)
[0009]
[Patent Document 2]
JP-A-5-324702 (see page 8, left column, line 34 to page right column, line 12)
[0010]
[Patent Document 3]
JP-A-9-62681 (FIG. 9, page 17, right column, line 17 to page 19, left column, line 17)
[0011]
[Problems to be solved by the invention]
By the way, in the apparatus as described above, the following various problems have been pointed out.
[0012]
First, in the former bilingual database creation device, the bilingual pair stored in the bilingual memory takes into account only one sentence of the translation target document, and therefore the context and intention of the search target sentence at the time of the search are not considered at all. . As a result, when searching the parallel memory, the source text strings are similar, but if there are multiple parallel pairs in the parallel memory that have different meanings or nuances, the matching of the context of the search target sentence Regardless of the degree, there arises a problem that the parallel translation pair having the highest matching degree of the original text string is preferentially searched.
[0013]
On the other hand, the latter conversational sentence translation apparatus is significant in that it effectively uses past translated documents, but a new problem arises from the viewpoint of conversational translation, and sufficient problem solving has not been achieved. The reason will be described.
[0014]
For example, the conversation translation device is intended for travel conversation sentences. For example, there are some patterns in customs procedures, scenes used for entering and exiting hotels, greetings, etc. The situation is possible to some extent. However, because the translation target document is a wide variety of general documents, it is impossible to create a bilingual pair that covers all classification items, and updates such as addition / change of classification There is a problem that takes a lot of labor and labor.
[0015]
The other is that "semantic information" described in Japanese Patent Laid-Open No. 9-62681 is extracted from each translation pair itself, and even if the intent of the translation pair itself can be reflected in the search result, The context of the entire source document from which the pair was created cannot be reflected. This means that the problem has not yet been solved sufficiently. In general, conversational sentences are often clarified in meaning on a sentence-by-sentence basis, and there is little need to consider the context before and after. On the other hand, in general documents, the intention is unclear in a single sentence unit, and in order to draw out the meaning of the sentence, it is often necessary to consider at least the context before and after. For example, the sentence “You can do that well” has the meaning of “praise” from the positive context, and “sarcasm” from the negative context. There is a problem that translation results or search results have no meaning at all.
[0016]
The present invention has been made in view of the above circumstances, and creates bilingual information creation that adds bilingual information that takes into account the overall characteristics of the bilingual target document and creates bilingual information reflecting the context and intention. Equipment The purpose is to provide.
[0017]
Another object of the present invention is to make it possible to easily search for a search result (translation result) that captures the context and intention by using parallel translation information reflecting the context and intention for the original text to be searched. Bilingual information search that outputs the sentence before and after the text and makes it easy to understand the context of the sentence in the search result Equipment It is to provide.
[0018]
[Means for Solving the Problems]
(1) In order to solve the above-described problem, a translation information creation apparatus according to the present invention includes: Composed of multiple sentences Source document and Consists of multiple sentences that are translations of the source document Document data input means for inputting a bilingual document and the document data input means By Entered The Associating source and target documents in sentence units Store in storage Sentence association means, document identification information extraction means for extracting document identification information representing the characteristics of the document from the original document document, and sentence correspondence means associated with the sentence unit Said Original documents When Said Translation documents And a bilingual information creating means for creating bilingual information with the extracted document identification information added to the bilingual pair and storing it in a bilingual memory.
[0019]
With this configuration, when a parallel translation document is input, the sentence correlating means associates the original text document and the translation text constituting the parallel translation document in units of sentences, while document identification is performed. The information extracting means extracts document identification information representing the document characteristics from the entire original document. Thereafter, the bilingual information creating means creates bilingual information (a bilingual pair with document identification information) in which the document identification information is added to the bilingual pair of the original sentence and the translated sentence associated with the sentence, and stores them in the bilingual memory. Therefore, when a translation of a parallel translation pair is searched later based on the original text document to be searched, an appropriate translation can be searched from the document identification information including the context and intention of the original text document.
[0021]
(2) The bilingual information search device according to the present invention is: For a source document composed of a plurality of sentences and a parallel translation document composed of a plurality of sentences that are translations of the source text document, the source document and the parallel translation document are associated with each sentence, and the document extracted from the source text document Consists of bilingual memory storage means for storing a plurality of parallel translation information to which document identification information representing features is added, and a plurality of input sentences A document dividing unit that divides the search target original document into sentence units, a document identification information extracting unit that extracts document identification information representing the characteristics of the document from the search target original document, Based on the original text divided into the sentence units and the document identification information, the degree of coincidence between the plurality of pieces of parallel translation information stored in the parallel translation memory storage unit is calculated, and the plurality of pieces of parallelism are calculated based on the degree of coincidence. Search means for searching for a translation of the original text to be searched from parallel translation information is provided. It is a configuration.
[0022]
The present invention is configured as described above, and based on the document identification information extracted from the search target original document and the document identification information of the parallel translation information already stored in the parallel translation memory, It is possible to search for an appropriate translation of a translation pair for the sentence.
[0024]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0025]
FIG. 1 is an overall configuration diagram showing an embodiment of a system including creation of parallel translation information and search for parallel translation information.
[0026]
This system includes a parallel translation information creation device 10 that creates parallel translation information based on a translation result subjected to translation processing, and an optimal search result (translation result) from the parallel translation information created by the parallel translation information creation device 10 for the search target sentence. ) For searching for translation information.
[0027]
This bilingual information creation device 10 is constituted by a CPU, and a document data input unit 11 for inputting a bilingual document consisting of a source document in a first language and a translation document in a second language, which are targets of bilingual information creation, and this document data input A sentence association unit 13 for associating a bilingual document stored in the buffer memory 12 from the unit 11 with a sentence unit, a document identification information extracting unit 14 for extracting document identification information representing the characteristics of the document from the entire original document, and a sentence association The parallel translation information obtained by adding the document identification information extracted by the document identification information extraction unit 14 to the parallel translation pair (translation text) composed of the original text and the translation text associated with the sentence unit by the section 13 is created and stored in the parallel translation memory 15. A bilingual information creation unit 16 is provided.
[0028]
The parallel translation information creation apparatus 10 is provided with a program recording medium 17 for recording a parallel translation information creation program.
[0029]
The document data input unit 11 is used to input a translated document composed of a translated text and a translated text. For example, a keyboard 111 including a mouse as an input device, a previously translated text and a translated text. , A network 112 such as the Internet, a dedicated line, or a LAN through which the corresponding bilingual document is transmitted. In addition, there are a pointing device such as a trackball and a tablet, and an optical character reader.
[0030]
On the other hand, the bilingual information search device 20 is a document data input unit 11 for inputting an original document that is a search target document, and a document that divides the original document input from the document data input unit 11 into sentence units and stores it in the buffer memory 12. The dividing unit 21, the document identification information extracting unit 14 that extracts document identification information representing the characteristics of the document from the original document divided by the document dividing unit 21, and the document identification information extracted by the document identification information extracting unit 14 And a parallel translation information search processing unit 22 for searching for a translation as a search result (translation result) from the parallel translation information in the parallel translation memory 15 using a constituent character string of each sentence of the original document as the search target document as a key, The search result is output together with the original text document to be searched or a search result output control unit 23 that outputs only the search result.
[0031]
A search result output unit 24 includes a file 241 for storing a search result including or not including the original document, a display unit 242 for displaying the search result including or not including the original document, and including or including the original document. Any one or more of the Internet, a dedicated line, a network 243 including a LAN, and the like for transmitting a search result to a terminal that requires no search result is used.
[0032]
In addition, the parallel translation information search device 20 is provided with a program recording medium 25 for recording a parallel translation information search program for searching parallel translation information.
[0033]
The bilingual information creation apparatus 10 and the bilingual information retrieval apparatus 20 are individually provided with the program recording media 17 and 25. However, if a function for determining whether bilingual information creation or bilingual information retrieval is provided, bilingual information creation processing and bilingual translation are performed. It goes without saying that the information retrieval process can be realized by using one program recording medium.
[0034]
Next, the bilingual information creation device 10 and the bilingual information search device 20 are divided into In action This will be explained in sequence.
[0035]
(1) The translation information creation device 10 In action About (refer FIG.2 and FIG.3).
[0036]
FIG. 2 shows the entire bilingual information creation apparatus 10. Action FIG. 3 is a diagram for explaining the detailed operation of the document identification information extracting unit 14 shown in FIG.
[0037]
First, the user inputs the bilingual document (example sentence 1) shown in FIG. 4 from the document data input unit 11 and stores it in the buffer memory 12 (ST11). The upper part of the parallel translation document is an original document that is a Japanese document, and the lower part is a translation document that is an English document.
[0038]
When the bilingual document as described above is input, the sentence association unit 13 automatically determines which sentence of the English document corresponds to each sentence of the Japanese document and performs association ( ST12: sentence association step).
[0039]
The sentence association method by the sentence association unit 13, for example, divides each document constituting the parallel translation document into one sentence unit, and performs translation processing using a translation dictionary (not shown) that translates the Japanese original document into English. To calculate the similarity between the translated text generated from the original text of the Japanese text document and the translated text of the parallel text, and select the translated text with the highest similarity in the entire document. The sentence is associated with the document and stored in the buffer memory 12.
[0040]
Subsequently, the document identification information extraction unit 14 extracts document identification information representing the characteristics of the whole document as described later with respect to the Japanese document and the English document (translation document) associated with the sentence (S13: Document identification information extraction). Step). The detailed process of extracting the document identification information will be described later (see FIG. 3).
[0041]
Here, when the document identification information extraction unit 14 extracts the document identification information, each sentence of the Japanese document associated with the sentence, each sentence of the English document, and the document identification information are sent to the bilingual information creation unit 16. When the bilingual information creation unit 16 receives each sentence of the Japanese document and each sentence of the English document and the document identification information, the bilingual translation-related Japanese sentence and the English translation sentence are paired (parallel translation pair). A bilingual pair with information obtained by adding document identification information to the pair (translation information) is stored in the bilingual memory 15 according to a required format (S14).
[0042]
FIG. 5 is a diagram showing the bilingual memory 15, which is composed of a bilingual memory 15a with document identification information and a document identification information definition table 15b. In the bilingual memory with document identification information 15a, Japanese sentences (J: ) And an English sentence (E :) and stored in the form of a bilingual pair with information in which each bilingual pair is added with document identification information (PROP :) as one unit. This document identification information (PROP :) stores a document configuration character string heading (JWD = JWD1) and configuration information SN and PN of each sentence for a Japanese document and an English document. The document identification information definition table 15b stores document configuration character string data corresponding to the document configuration character string heading (JWD1).
[0043]
(2) Detailed operation of the document identification information extraction unit 14 shown in FIG. 2 (see FIG. 3).
[0044]
When a Japanese document and an English document associated with a sentence are input by the sentence association unit 13, the document identification information extraction unit 14 extracts a document configuration character string (JWD) over the entire Japanese document (S131). Since this document structure character string (JWD) extracts almost all meaningful words from the entire document, it can be said that it reflects the context and intention of the document. Since JWD is document identification information common to each translation pair with information, JWD1 (document configuration character string header) as an extraction result is separately defined in the document identification information definition table 15b together with the extraction frequency.
[0045]
This document structure character string (JWD) cut-out method, for example, performs morphological analysis on a character string in a Japanese document, cuts out nouns, verbs, adjectives, adverbs, etc., centering on independent words. ”Is converted into a usage form of“ beautiful ”. In this embodiment, in the constituent character string JWD1 cut out from the entire Japanese sentence in FIG. 4, in addition to words such as “10,000, units, sales, achievement, heartfelt, congratulations”, “result, It can be said that the positive context of the original text, such as “Reputation, High”, is also well reflected. The numbers in parentheses attached to each word represent the appearance frequency in the document.
[0046]
Next, the composition information of each sentence for the Japanese document and the English document is extracted.
[0047]
As one of the configuration information, a sentence number (SN) in the document is extracted (S132). This sentence number SN is expressed as a fraction of Japanese and English documents with the total number of sentences as the denominator and the sentence number as the numerator. For example, the first bilingual pair with information in FIG. 5 is SN = J1 / 6 because it is the first sentence in six sentences of a Japanese document, and SN = E1 / 6 because it is also the first sentence in six sentences in an English document. Similarly, sentence numbers (SN) are extracted for the second sentence and subsequent sentences of the Japanese document and the English document.
[0048]
As another configuration information, the paragraph number (PN) in the document is extracted in the same manner as the sentence number. The paragraph number PN in this document is expressed as a fraction with the total number of paragraphs in the entire document as the denominator and the number of paragraphs in the corresponding sentence as the numerator for both Japanese and English documents. For example, the first bilingual pair with information shown in FIG. 5 is composed of 4 paragraphs in both Japanese and English documents, and is the first paragraph sentence, so the Japanese sentence is PN = J1 / 4, The English sentence is also PN = E1 / 4. The paragraph numbers (PN) are extracted in the same manner for the sentences after the second paragraph of the Japanese document and the English document.
[0049]
Note that the document identification information to be extracted is not limited to the above information. For example, the user of this apparatus needs to know the character string, file name, file creation date and time, creator name, and related customer information that make up an English document. It is possible to give various information depending on the situation. FIG. 6 is a diagram showing an example sentence 2 input from the document data input unit 11, and shows a bilingual document composed of an original document that is a Japanese document in the upper row and a translation that is an English document in the lower row.
[0050]
FIG. 7 shows an example in which a bilingual pair with information related to the bilingual document of the example sentence 2 input from the document data input unit 11 is further added to the bilingual memory 15 of FIG. As shown in this example, the fourth and subsequent bilingual pairs with information are added on the bilingual memory 15. As can be seen from the fourth parallel translation pair in this example, when a sentence in one language corresponds to a plurality of sentences in the other language, a plus sign is used as the numerator document number is SN = J4 + 5/10. Listed with (+). Further, the document composition character string corresponding to the document composition character string heading JWD2 cut out from the Japanese document in FIG. 6 is a word in a negative context such as “protest” in addition to “recently, you, arrive, enclose”. From these words, it can be understood that the original text is a negative context.
[0051]
In the bilingual memory 15a with information shown in FIG. 7, two bilingual pairs that are “similar to such a thing ...”, which are almost equal in Japanese but quite different in English, are stored ( (See the shaded portion in FIG. 7). These two parallel translation pairs have different document structure character string headers (JWD1 and JWD2), and it can be understood that they are sentences extracted from the original texts having a positive context and a negative context, respectively.
[0052]
(3) of the bilingual information search device 20 In action About (refer FIG.8 and FIG.9). FIG. 8 shows the entire bilingual information search device 20. Action FIG. 9 and FIG. 9 are diagrams for explaining the detailed operation of the translation information search processing unit 22 shown in FIG.
[0053]
This bilingual information search apparatus The user inputs a Japanese document (translation target document or search target document) of the example sentence 3 shown in FIG. 10 from the document data input unit 11 (ST21) and sends it to the document dividing unit 21. In the document dividing unit 21, the Japanese document input from the document data input unit 11 is divided into sentence units, and each divided Japanese sentence is sequentially stored in the buffer memory 12 (S22).
[0054]
Thereafter, the document identification information extracting unit 14 extracts the document identification information by means almost the same as described in the bilingual information creating apparatus 10 (S23). Here, the detailed processing operation of the document identification information extraction unit 14 will be described with reference to FIG.
[0055]
The document identification information extraction unit 14 extracts the document identification information and then passes the divided Japanese document and document identification information to the parallel translation information search processing unit 22. The parallel translation information search processing unit 22 executes a search process based on the divided Japanese document and document identification information (S24). Details of the search processing by the parallel translation information search processing unit 22 will be described later (see FIG. 9).
[0056]
When the bilingual information search processing unit 22 ends the search process, the bilingual information search processing unit 22 determines whether the search result is successful (S25). If the search result is unsuccessful, the search result output control unit 23 displays a state of no search result on the display unit 242 (S26). If the search result is successful, an English sentence that is a parallel translation pair is extracted based on the search target Japanese sentence, and passed to the search result output control unit 23 (S27). The search result output control unit 23 outputs the received search result to the display unit 242 or a printer (not shown) (S28). Next, refer to FIG. 9 for details of the search processing of the parallel translation information search processing unit 22. I will explain.
[0057]
This bilingual information search processing unit 22 performs search processing by creating a Japanese document with identification information from the Japanese document divided into single sentences by the document dividing unit 21 and the document identification information extracted by the document identification information extracting unit 14. Then, it is stored in the parallel translation memory 15 (S241). FIG. 11 is a diagram showing a data arrangement structure of the bilingual memory 15. In the bilingual memory 15a with document identification information, a Japanese document with document identification information created from the example sentence 3 of the Japanese document shown in FIG. 10 is stored. The document identification information definition table 15b stores document structure character string data of Japanese documents.
[0058]
This Japanese document with document identification information is configured as one unit (hereinafter referred to as information-added Japanese sentence) in which document identification information (PROP) is added to a Japanese sentence (J :) divided into single sentence units. . This document identification information has a data array configuration substantially the same as that of the bilingual information creation apparatus 10 described above. For example, the document configuration character string heading (JWD = JWDP), sentence number (SN), paragraph number (PN), etc. It is made up of. Similarly, document configuration character string data corresponding to the document configuration character string heading is defined in the document identification information definition table 15b. However, the definition content is slightly different from the bilingual information creating apparatus 10 described above, and Japanese document constituent character strings are extracted for each paragraph, and the extraction results are divided according to JWDP1 to JWDP5. This is because when the search target document is long and is composed of a large number of paragraphs, not only the processing of the entire document but also the processing relating to the document identification information described later can be performed in units of paragraphs.
[0059]
Subsequently, the degree of coincidence between the Japanese sentence with information to be searched and the original sentence which is one of the parallel translation pairs with information in the parallel translation memory with document structure information 15a is calculated (S242 to S245). The point of coincidence calculation in the search device 20 is that in addition to the coincidence of Japanese character strings, the coincidence of document identification information is also taken into consideration. Since this document identification information reflects the context and meaning of the entire Japanese document to be searched, it is possible to perform a search in consideration of the context of the Japanese document to be searched. In particular, even if a plurality of similar parallel translation pairs having Japanese sentences exist in the parallel translation memory 15a, the degree of coincidence with the search target Japanese sentence is considered by considering the degree of coincidence of the document constituent character strings of the document identification information. It is possible to search for a pair with a high translation.
[0060]
Note that the degree of coincidence of document identification information can be calculated by describing a simplest calculation method. For example, a document vector representing an entire document having a document constituent character string (JWD) in the document identification information as an orthogonal vector component is represented. A vector space method is used in which the inner product of the vectors is used as the degree of coincidence.
[0061]
Here, the case where the parallel translation memory 15a shown in FIG. 7 is searched for the Japanese sentence with document identification information shown in FIG. 11 as a search target will be described in detail as an example of the matching degree calculation processing (S242 to S245). .
[0062]
Now, based on all the document composition character strings (JWDP1 to JWDP5) in the document identification information definition table 15b shown in FIG. 11, each document composition character string is a vector component, and the frequency of the document identification information is weighted. A document vector (Dtr) is created (S242). Here, if search target vectors are created individually from the document-structured character strings JWDP1, JWDP2, etc. for each paragraph, and these multiple vectors are considered at the same time, a search that accurately reflects the context of each paragraph becomes possible.
[0063]
Next, based on each document constituent character string (JWD1, JWD2) assigned to the bilingual translation pair with document identification information shown in FIG. 7, each character string is a vector component, and the document vector ( D _TM 1, D _TM 2) is created (S243). Further, in order to obtain the degree of coincidence between the document vector of the search target sentence and the document vector of the parallel translation memory 15, Dtr and D _TM 1, Dtr and D _TM The inner product of 2 is calculated (S244). D _TM At 1, the inner product value is a positive integer greater than zero because “celebration, good fight” matches, but D _TM In 2, since there is no matching item, the inner product value is zero. As a result, Dtr and D _TM The inner product value of 1 is Dtr and D _TM Greater than the product of 2 and D _TM It can be seen that the degree of coincidence of 1 is higher.
[0064]
Next, a sentence is taken out from the search target document, and the character string matching degree is calculated (S245). As an example, let us consider a case where a character string search is performed on the text “I was able to do this well” (see the shaded portion in FIG. 11), which is one sentence in FIG. In the bilingual memory 15a with information shown in FIG. 7, the Japanese sentences are “I can do this well” and “I have been able to do this well”. There are two translation pairs (see the shaded portion in FIG. 7), and document identification information JWD1 and JWD2 are assigned to each.
[0065]
Therefore, suppose that the sentence to be searched is broken down into four meaningful words "well, like this, what you can do". In such a sentence, if the basic form is identical only by different utilization, it is considered to be coincident with a weight of 0.5. When the degree of coincidence is calculated under these conditions, the Japanese sentence with JWD1 is completely matched in 3 of the 4 words, and the basic form is the same, with only 1 word being used differently. Therefore, the degree of coincidence is (3/4) + {0.5 (1/4)} = 0.88. On the other hand, the Japanese sentence to which JWD2 is assigned matches 4 words out of 4 words, so the degree of coincidence is 4/4 = 1. Therefore, considering only the matching degree of character strings, the translation pair to which JWD2 is assigned has a higher matching degree. However, the final coincidence takes into account both the coincidence of the document constituent character strings and the coincidence of the document identification information. The degree of coincidence of the document identification information of the pair becomes zero, and eventually the degree of coincidence of the bilingual pair to which JWD1 is assigned becomes higher. This means that even if the matching degree of the character string is low, a parallel translation pair having a high matching degree of context is selected.
[0066]
Accordingly, the bilingual information creating apparatus 10 as described above creates bilingual information so as to reflect not only one sentence of the bilingual target but also the characteristics of the entire bilingual target document. With respect to the search target document, it is possible to search for a translated sentence in consideration of the context and intention. Although the above description has been given by taking a very simple example, when creating a vector component that constitutes a document vector, the accuracy of the degree of coincidence calculation is improved by introducing the following number of steps. be able to. That is, each Japanese constituent character string itself is not used as a vector component, but words having a large correlation such as synonyms and related words are classified (clustering), and a vector component is reconstructed using words that belong to the same classification as sub-components. By considering synonyms and related words classified into the same vector component as matching words, the degree of matching more reflects the purpose of the sentence. For example, in the above example, D _TM 1, only the two words “celebration, good fight” matched, but words such as “achievement, achievement” in the parallel translation memory 15 (see FIG. 7) and “search target sentence (see FIG. 10)” Related words such as “honor, win, win” can also be expected to “match”.
[0067]
Further, even if the search target sentence shown in FIG. 10 includes a word such as “you, arrive” related to the style of the sentence that matches the character string of JWD2 (Japanese document = example sentence 2 in FIG. 6). If the degree of coincidence reflecting the meaning of the sentence can be made sufficiently high by executing the process as described above, a search according to the meaning of the sentence can be performed.
[0069]
In order to perform clustering, it is necessary to define the correlation strength. For example, automatic learning using a neural network, an electronic dictionary edited by EDR (Electlonic Dictionary Research) or synonyms such as Word Net, related terms, and concepts Existing classification systems such as can be used. By using such a dictionary and classification system, classifications can be created as needed from the constituent strings of documents, and there is no need to prepare classifications, and additions and changes of classifications can be made flexibly. Can do.
[0070]
Next, details of the search result output control unit 23 shown in FIG. 1 and steps S25 to S28 of FIG. 8 will be described.
[0071]
In step S25, if the search is successful, the search result is displayed on the display unit 242, but the search result output control unit 23 of the parallel translation information search device 20 of the present invention simultaneously displays not only the search result but also the preceding and succeeding sentences. Adopt the display method. That is, in the document identification information stored in the parallel translation memory 15, the serial number (sentence number) of the entire source document of the parallel translation pair is described. Therefore, by using this sentence number, it is possible to display the English sentence in the search result and the sentence before and after the Japanese sentence.
[0072]
FIG. 12 is a diagram illustrating a display example of the translation result on the display unit 242. The search result output control unit 23 has a search target original text display window 242a on the upper left side and a translated text display window 242b on the upper right side. The search target original text display window 242a displays a Japanese text as a search target text, while the translation text display window 242b displays a translation result English text as a search result with reference to the parallel translation memory 15.
[0073]
In this state, when a sentence of a Japanese document that is a search target original is selected with a mouse, a parallel memory search result display window 242c appears on the lower side of the display unit, and a search result of only the search target original is displayed here. Further, when the search result displayed in the search result display window 242c is selected with the mouse and right-clicked, a window 242d appears, and this window 242d also pops up sentences located before and after the search result sentence in the source document. . Thereby, it is possible to easily grasp in what context the sentence of the search result is used.
[0074]
Note that the present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the scope of the invention.
[0075]
In addition, the embodiments can be implemented in combination as much as possible, and in that case, the effect of the combination can be obtained. Further, each of the above embodiments includes various higher-level and lower-level inventions, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, when an invention is extracted because some constituent elements can be omitted from all the constituent elements described in the means for solving the problem, the omitted part is used when the extracted invention is implemented. Is appropriately supplemented by well-known conventional techniques.
[0076]
【The invention's effect】
As described above, according to the present invention, bilingual information reflecting the context and intention of the original document can be created by adding document identification information in consideration of the overall characteristics of the original document to the bilingual pair. Also, bilingual information creation that can create bilingual information with document identification information that is easily classified from the character strings of the original document Equipment Can be provided.
[0077]
In addition, the present invention can easily search a second language search result (translation result) that captures the context and intention of the original text to be searched, and if the sentences before and after the search result are output at the same time, Bilingual information retrieval that makes it easy to understand the context of a sentence Equipment Can be provided.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a system including an embodiment of a parallel translation information creation device and a parallel translation information search device according to the present invention.
[Figure 2] Bilingual information creation device Action The flowchart to explain.
FIG. 3 is a flowchart for explaining an operation example of a document identification information extraction unit of the bilingual information creation apparatus shown in FIG. 1;
FIG. 4 is a diagram showing a relationship between an input original sentence as an example sentence 1 and a parallel translation document (translation sentence) of the input original sentence.
FIG. 5 is a diagram showing a data array configuration of a bilingual memory in which bilingual information in which document identification information is added to a bilingual pair for each sentence of a document related to example sentence 1 is stored;
FIG. 6 is a diagram showing a relationship between an input original sentence as an example sentence 2 and a parallel translation document (translation sentence) of the input original sentence.
FIG. 7 is a diagram showing a data array configuration of a parallel translation memory in which parallel translation information related to example sentence 2 is added to parallel translation information related to example sentence 1;
[Fig. 8] Bilingual information search device Action The flowchart to explain.
FIG. 9 is a flowchart for explaining an operation example of a parallel translation information search unit of the parallel translation information search apparatus shown in FIG. 1;
FIG. 10 is a diagram for explaining a search target document as an example sentence 3;
FIG. 11 is a diagram showing a data array configuration of a bilingual memory in which bilingual information obtained by adding document identification information to bilingual pairs for each sentence of a document related to example sentence 3 is stored.
12 is a diagram showing a display state on a display unit in a search result output control unit of the parallel translation information search apparatus shown in FIG. 1;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Parallel translation information creation apparatus, 11 ... Document data input part, 13 ... Sentence matching part, 14 ... Document identification information extraction part, 15 ... Parallel translation memory, 15a ... Parallel translation memory with document identification information, 15b ... Document identification information definition table , 16 ... parallel translation information creation unit, 17 ... program recording medium, 20 ... parallel translation information search device, 21 ... document splitting unit, 22 ... parallel translation information search processing unit, 23 ... search result output control unit, 25 ... program recording medium.

Claims

Document data input means for inputting a source document composed of a plurality of sentences and a parallel translation document composed of a plurality of sentences that are translations of the source document;
A sentence association unit that associates the original document document and the translated document input by the document data input unit with each sentence unit and stores them in the storage unit;
Document identification information extracting means for extracting document identification information representing the characteristics of the document from the original document;
Bilingual information creating means for creating bilingual information in which the extracted document identification information is added to a bilingual pair of the original document document and the translated text document associated with each sentence by the sentence associating means, and storing the bilingual information in a bilingual memory; A bilingual information creation device characterized by comprising:

The sentence association unit associates sentences based on a similarity between a translation generated from a sentence constituting the original document and a translation constituting the parallel translation document. Bilingual information creation device.

3. The bilingual information creating apparatus according to claim 1, wherein the document identification information is a character string constituting the original document document extracted from the original document document.

4. The bilingual information creation apparatus according to claim 3, wherein the document identification information further includes a sentence number and a paragraph number extracted based on a sentence constituting the original document and a sentence constituting the translated document. .

For a source document composed of a plurality of sentences and a parallel translation document composed of a plurality of sentences that are translations of the source text document, the source document and the parallel translation document are associated with each sentence, and the document extracted from the source text document Bilingual memory storage means for storing a plurality of bilingual information to which document identification information representing features is added;
Document dividing means for dividing a search target original document composed of a plurality of input sentences into sentence units,
Document identification information extracting means for extracting document identification information representing the characteristics of the document from the original text document to be searched;
Based on the original text divided into the sentence units and the document identification information, the degree of coincidence between the plurality of pieces of parallel translation information stored in the parallel translation memory storage unit is calculated, and the plurality of pieces of parallelism are calculated based on the degree of coincidence. A bilingual information search apparatus comprising: search means for searching for a translation of the original text to be searched from bilingual information.

6. The bilingual information search apparatus according to claim 5, wherein the document identification information is a character string constituting the original document document extracted from the original document document.

7. The bilingual information search apparatus according to claim 6, wherein the document identification information further includes a sentence number and a paragraph number extracted based on a sentence constituting the original document and a sentence constituting the translated document. .