JP4259726B2

JP4259726B2 - Recording medium recording a parallel thesaurus generation program, recording medium recording a parallel thesaurus, and recording medium recording a parallel thesaurus navigation program

Info

Publication number: JP4259726B2
Application number: JP2000149413A
Authority: JP
Inventors: 博行梶; 康嗣森本
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2000-05-22
Filing date: 2000-05-22
Publication date: 2009-04-30
Anticipated expiration: 2020-05-22
Also published as: JP2001331484A

Description

【０００１】
【発明の属する技術分野】
本発明は、二つの言語のテキストコーパスから二つの言語のシソーラスを結合したパラレルシソーラスを生成する装置及び該パラレルシソーラスを利用したナビゲーションシステムのプログラム、並びにパラレルシソーラスを記録した記録媒体に関する。
【０００２】
【従来の技術】
電子化されたテキスト情報の増加と共に、情報アクセス技術の重要性が高まっている。
本発明者らは先に特願平１１−２８１０１号として、シソーラスのナビゲーション方法を出願している。この方法は、テキストデータを記憶する文書データベース（以下、テキストコーパスと呼ぶ）から、有益な情報を掘り出す作業（以下、テキストマイニングと呼ぶ）を効率的に行う技術であり、テキストコーパスからターム及びターム間の関連知識を抽出してシソーラスを生成し、該シソーラスの内容をクライアント端末のブラウザに表示することによりナビゲーションする。
【０００３】
また、情報処理学会データベースシステム研究会／情報学基礎研究会研究報告ＤＢＳ−１１８−１３／ＦＩ−５４−１３「コーパス対応の関連シソーラスナビゲーション」（１９９９年５月１７日）には、上記ナビゲーション方法に基づくテキストマイニングシステムが報告されている。
【０００４】
【発明が解決しようとする課題】
情報検索の分野では、母国語で表現した検索要求を入力して、外国語の文書を検索したいというニーズが高まっている。この情報検索方法は、クロスランゲージ情報検索と呼ばれ、盛んに研究されている。
【０００５】
クロスランゲージ情報検索の代表的な手法は、たとえば情報処理学会論文誌４０巻１１号ページ４０７５−４０８６「機械翻訳を用いた英日・日英言語横断検索に関する一考察」（１９９９年１１月）に報告されている。この検索方法は、対訳辞書や機械翻訳システムを利用して、検索要求を文書と同じ言語に翻訳した上で文書検索を実行する。この場合、検索要求は文書に比べて短く文脈情報が少ないため、高精度で翻訳するのが難しく、検索精度が低いという問題がある。
【０００６】
この問題に対して、上記特願平１１−２８１０１号で提案しているナビゲーション方法を利用することが考えられる。この場合、単言語から二言語に拡張して情報検索システムのフロントエンドとして使用することにより、クロスランゲージ情報検索における上記問題点を解決することができると考えられるが、二つの言語のシソーラスを結合することが必要になる。
【０００７】
一方、テキストコーパスからのシソーラス生成技術は、上述したようなテキストマイニングへの応用だけでなく、様々な自然言語処理応用システムに有効であるが、次のような技術課題が残されている。
【０００８】
従来のシソーラス自動生成の技術は、多義語の取扱いに関して問題がある。従来の技術ではターム間の関連を抽出しているが、ターム間の関連は、本来意味的なものであるので、多義語のタームを語義すなわち概念に分割し、概念間の関連を抽出するのが理想的である。従来の技術によれば、たとえば「bank」の関連タームとして「loan」、「money」、「river」及び「water」等が、「bank」の概念を問わず全て抽出されてしまう。
【０００９】
ここで「loan」及び「money」は、お金を預けたり引き出したりする機関としての「bank（銀行）」の関連タームであり、「river」及び「water」は、水辺の場所としての「bank（岸）」の関連タームである。したがって、「bank」をそれが表す概念に分割し、「loan」及び「money」等の関連タームと、「river」及び「water」等の関連タームとが別々に抽出されることが望ましい。
【００１０】
また、従来のシソーラス自動生成技術は、同義語の取扱いに関しても問題がある。従来の技術では共起確率、すなわちテキスト中の近傍に揃って出現する確率に基づいて関連タームを抽出している。このため、同義語は関連タームとしてさえ抽出されず、別々のエンティティとして扱われる。同義語は同じ概念を表すのであるから、同義語が一つのエンティティに纏めて取扱われることが望ましい。
【００１１】
以上より、本発明の目的は上述した従来の技術における問題点を解決することである。
第１の目的は、第１言語の概念及び概念間の関連、第２言語の概念及び概念間の関連、第１言語の概念と第２言語の概念間の結合から構成されるパラレルシソーラスを第１言語及び第２言語のテキストコーパスから自動生成する、パラレルシソーラスの生成プログラムを記録した記録媒体を提供することにある。
【００１２】
第２の目的は、クロスランゲージ情報検索のフロントエンドとして、特に、多義語及び同義語の取扱いに注目して、検索要求の高精度な翻訳を可能にする、パラレルシソーラスナビゲーションプログラムを記録した記録媒体を提供することにある。
第３の目的は、上記パラレルシソーラスを利用した有効なテキストマイニングを実現するために、パラレルシソーラスを記録した記録媒体を提供することにある。
【００１３】
【課題を解決するための手段】
本発明は、第１言語及び第２言語のテキストコーパスの入力を受ける入力装置と、第１言語及び第２言語の対訳辞書を記憶する記憶装置と、データを処理する処理装置とを有するコンピュータをパラレルシソーラス生成装置として機能させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体であって、上記処理装置を、上記入力装置から入力された第１言語のテキストコーパス中の第１言語のターム間の関連を表す第１言語のターム関連シソーラスを生成する第１言語シソーラス生成手段、上記入力装置から入力された第２言語のテキストコーパス中の第２言語のターム間の関連を表す第２言語のターム関連シソーラスを生成する第２言語シソーラス生成手段、上記記憶装置に記憶されている第１言語及び第２言語の対訳辞書を用いて、上記第１言語のターム関連シソーラスと上記第２言語のターム関連シソーラスを結合するシソーラス結合を実行するシソーラス結合手段、として機能させるためのプログラムを記録したことを特徴とするコンピュータ読み取り可能な記録媒体である。
【００１４】
また、上記シソーラス結合手段として、上記処理装置を、上記第１及び第２言語の対訳辞書を用いて上記第１言語のターム関連シソーラスと上記第２言語のターム関連シソーラスとの間で対応するタームを結合するターム結合手段、上記結合された第２言語のタームを組み合わせることにより上記第１言語の各タームから概念ラベルを生成する第１言語概念ラベル生成手段、上記結合された第１言語のタームを組み合わせることにより上記第２言語の各タームから概念ラベルを生成する第２言語概念ラベル生成手段、上記第１及び第２言語間の上記タームの結合を該言語間の概念結合に変換する概念結合手段、上記第１言語のターム関連シソーラスに含まれるターム間の関連を概念間の関連に変換する第１言語概念関連シソーラス生成手段、上記第２言語のターム関連シソーラスに含まれるターム間の関連を概念間の関連に変換する第２言語概念関連シソーラス生成手段、同一の第２言語の概念に結合された複数の第１言語の概念に関連する各第１言語の関連概念の集合が類似している場合、上記複数の第１言語の概念をマージして一つの概念にする第１言語概念マージ手段、同一の第１言語の概念に結合された複数の第２言語の概念に関連する各第２言語の関連概念の集合が類似している場合、上記複数の第２言語の概念をマージして一つの概念にする第２言語概念マージ手段、として機能させるためのプログラムを記録したことを特徴とするコンピュータ読み取り可能な記録媒体である。これにより、多義語及び同義語が有する概念を考慮したパラレルシソーラスを生成することができる。
【００１５】
パラレルシソーラスの生成装置は、以下のように作用する。
第１言語シソーラス生成手段は、第１言語のテキストコーパスからターム及び共起するタームの組を抽出し、ターム間の相関を解析することにより、各ターム毎に関連タームの集合を出力する。第１言語が日本語であるとき、たとえば、ターム「銀行」の関連タームの集合として{ローン,金利,口座,利率,証券,経済,金融,投資}が出力される。
【００１６】
第２言語シソーラス生成手段は、第１言語シソーラス生成手段と同様な処理を第２言語のテキストコーパスに対して実行し、各ターム毎に関連タームの集合を出力する。第２言語が英語であるとき、たとえば、ターム「bank」の関連タームの集合として、{account,river,interest,loan,boat,investment,fishing,park,economy,lake}が出力される。
【００１７】
シソーラス結合手段において、各手段は以下のように作用する。
ターム結合手段は、第１及び第２言語の対訳辞書とを用いて、第１言語のターム関連シソーラスに含まれるタームと第２言語のターム関連シソーラスに含まれるタームとの間で対応するタームを結合する。たとえば、「銀行」と「bank」が結合され、「岸」と「bank」が結合される。
【００１８】
第１言語概念ラベル生成手段は、結合された第２言語のタームを組み合せることにより、第１言語の各タームから少なくとも１つの概念ラベルを生成する。同様に、第２言語概念ラベル生成手段は、結合された第１言語のタームを組み合せることにより、第２言語の各タームから少なくとも１つの概念ラベルを生成する。たとえば「bank」は「銀行」と「岸」とに結合されている。このとき、「銀行」の関連ターム集合と「岸」の関連ターム集合が似ていなければ、「銀行」と「岸」とが概念的に異なると判断され、「bank」から二つの概念ラベル「bank・銀行」、「bank・岸」が生成される。
【００１９】
概念結合手段は、言語間のターム結合を言語間の概念結合に変換する。たとえば、ターム結合「銀行−bank」は概念結合「銀行−bank・銀行」に変換され、ターム結合「岸−bank」は概念結合「岸−bank・岸」に変換される。
【００２０】
第１言語概念関連シソーラス生成手段は、第１言語のターム関連シソーラスに含まれるターム間の関連を概念間の関連に変換する。同様に、第２言語概念関連シソーラス生成手段は、第２言語のターム関連シソーラスに含まれるターム間の関連を概念間の関連に変換する。たとえば、ターム間の関連「bank−interest」は概念間の関連「bank・銀行−interest・金利／利率」に変換され、ターム間の関連「bank−river」は概念間の関連「bank・岸−river」に変換される。
【００２１】
第１言語概念マージ手段は、同一の第２言語の概念に結合され、関連する第１言語の概念の集合が類似している第１言語の概念を一つの概念にマージする。同様に、第２言語概念マージ手段は、同一の第１言語の概念に結合され、関連する第２言語の概念の集合が類似している第２言語の概念を一つの概念にマージする。たとえば、日本語の二つの概念「金利」と「利率」が共に英語の概念「interest・金利／利率」に結合されている。このとき、「金利」の関連概念の集合と「利率」の関連概念の集合が類似していれば、「金利」と「利率」とが一つの概念「金利−利率」にマージされる。
【００２２】
以上のように各機能が作用することにより、第１言語の概念及び概念間の関連から構成される第１言語の概念関連シソーラスと、第２言語の概念及び概念間の関連から構成される第２言語の概念関連シソーラスとが結合されたパラレルシソーラスが生成される。
【００２３】
また、本発明は、第１及び第２言語のタームを組み合わせた第１言語概念ラベルから作成される上記第１言語の概念間の関連を表す第１言語概念関連シソーラスを含むパラレルシソーラスと、表示手段に接続されたコンピュータに、上記第１言語概念関連シソーラスに表される概念間の関連を用いて、上記表示手段に表示されている第１言語の概念集合を読み込むステップ、上記読み込まれた概念集合に含まれる第１言語概念に関連の強い概念を取得するステップ、を実行させるためのパラレルシソーラスナビゲーションプログラムと、を記録したことを特徴とするコンピュータ読み取り可能な記録媒体である。
【００２４】
また、本発明は、第１言語のタームもしくは第１言語の概念ラベルを入力する端末と、データを処理する処理装置と、第１及び第２言語の概念ラベルから作成される概念結合データと第１及び第２言語の概念関連シソーラスとから構成されるパラレルシソーラスが記録されている記憶装置とを有するコンピュータを、パラレルシソーラスナビゲーションシステムとして機能させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体であって、上記処理装置を、上記第１言語の概念関連シソーラスを用いて上記端末に入力された第１言語のタームを含む概念集合もしくは上記端末に入力された第１言語の概念ラベルを含む概念集合を読み込む概念集合読み込み手段、上記第１言語の概念関連シソーラスを用いて上記読み込まれた概念集合と関連の強い概念を加えて上記読み込まれた概念集合を拡大し、該拡大された概念集合をクラスタリングする拡大概念集合クラスタリング手段、上記概念結合データと上記第２言語の概念関連シソーラスとを用いて該クラスタリングされた拡大概念集合中の各概念を第２言語の拡大概念集合に翻訳する概念集合翻訳手段、上記第２言語の拡大概念集合を表示手段に表示させる情報を出力する手段、として機能させるためのプログラムを記録したことを特徴とするコンピュータ読み取り可能な記録媒体である。
【００２５】
また、上記概念集合翻訳手段として、上記処理装置を、上記概念結合データを用いて上記クラスタリングされた拡大概念集合に含まれる第１言語概念に結合されている第２言語概念を集めて核となる第２言語概念集合を生成する手段、上記第２言語の概念関連シソーラスを用いて上記第２言語概念集合と関連の強い第２言語概念を集める手段、上記概念結合データを用いて上記関連の強い第２言語概念のうち上記第１言語概念に結合されていない概念を追加する手段、として機能させるためのプログラムを記録したことを特徴とする。パラレルシソーラスのナビゲーションシステムにおいて、第１言語の概念の集合から第２言語の概念の集合への翻訳（遷移）は次のように行われる。
【００２６】
翻訳（遷移）前の概念集合中の概念に結合された第２言語の概念と、それらの第２言語の概念と関連が強く、かつ第１言語の概念と結合されていない第２言語の概念を併せて翻訳（遷移）の概念集合が構成される。これにより、パラレルシソーラスにおいて陽に結合されていない概念を含めて、第１言語の概念集合から関連する第２言語の概念集合へ翻訳（遷移）することができる。
【００２７】
【発明の実施の形態】
以下、本発明の実施の形態を添付図面と対応して詳細に説明する。
図１は、本発明の実施の形態によるパラレルシソーラスの生成装置と、該装置を含むパラレルシソーラスナビゲーションシステムの構成を説明するブロック図である。なお、本実施の形態では、二つの言語のシソーラスが結合されたパラレルシソーラスとして、日本語と英語とによるパラレルシソーラスの生成について説明する。
【００２８】
本実施の形態によるパラレルシソーラスナビゲーションシステム（以下、システムと呼ぶ）は、通信ネットワーク３を介して互いに接続されるサーバ計算機１とクライアント計算機２により構成される。
【００２９】
サーバ計算機１は、日本語のシソーラスと英語のシソーラスとを対応付けるパラレルシソーラス生成処理と、本システムの処理のうち、シソーラスの検索処理等を行う。このサーバ計算機１は、処理装置１１、入力装置１２及び記憶装置１３により主に構成される。
【００３０】
処理装置１１は、サーバ計算機１の全体の処理を実行する。特に、後述する図２から図４に示すパラレルシソーラスの生成に関わる各データ処理を実行する。入力装置１２は、テキストコーパスの入手媒体に応じてＣＤ−ＲＯＭドライブ、フロッピーディスクドライブ等であり、テキストコーパスの入力に用いられる。
【００３１】
記憶装置１３は、ＲＡＭ、ＲＯＭ、光磁気ディスクライブラリ装置（図示せず）等の記憶手段を総称しており、たとえば、サーバ計算機１の処理プログラム等はＲＯＭに固定的に格納され、サーバ計算機１の処理の過程で作成されたデータ、作業ファイル等はＲＡＭに一時的に格納され、さらに各言語のコーパス、シソーラス及び対訳辞書（図２参照）等は光磁気ライブラリ装置等の大容量記憶装置に格納される。
【００３２】
クライアント計算機２は、本システムの処理のうち、サーバ計算機１の検索結果として送信されるシソーラスの表示、ユーザとの対話処理等を行う。
つぎに、図２から図４を用いてパラレルシソーラスの生成処理の詳細を説明する。
【００３３】
図２は、パラレルシソーラス生成装置における入出力データとモジュール構成を機能的に説明する図である。パラレルシソーラス生成装置の入力は、日本語コーパス５１と英語コーパス５２とが対になった日英二言語テキストコーパスである。日本語コーパス５１と英語コーパス５２とには同じ分野のテキストであるという条件が課せられるが、対訳である必要はない。
【００３４】
パラレルシソーラス生成装置の出力は、日本語概念関連シソーラス６１、英語概念関連シソーラス６２、日英概念結合データ６３、英日概念結合データ６４からなる日英パラレルシソーラスである。日英概念結合データ６３と英日概念結合データ６４は、情報の内容は同じでレコード形式が違うだけである。冗長ではあるが、日英シソーラス結合処理の効率を考慮して、両方を出力する。
【００３５】
パラレルシソーラス生成装置を構成するモジュールは、日本語シソーラス生成１０、英語シソーラス生成２０、及び日英シソーラス結合３０である。日本語シソーラス生成１０は、日本語コーパス５１から日本語ターム関連シソーラス７１を生成する。英語シソーラス生成２０は、英語コーパス５２から英語ターム関連シソーラス７２を生成する。日英シソーラス結合３０は、日英対訳辞書７３と英日対訳辞書７４を参照して、日本語ターム関連シソーラス７１と英語ターム関連シソーラス７２とから日本語概念関連シソーラス６１、英語概念関連シソーラス６２、日英概念結合データ６３、及び英日概念結合データ６４を生成する。日英対訳辞書７３と英日対訳辞書７４は、情報の内容は同じでレコード形式が違うだけである。冗長ではあるが、日英シソーラス結合処理の効率を考慮して、両方を使用する。
【００３６】
図３は、日本語シソーラス生成１０の処理の詳細を説明する図である。図３に示すように、日本語シソーラスを生成する処理は、ターム抽出１０１、共起データ抽出１０２、及び相関解析１０３の３つのステップからなる。
【００３７】
（１）ターム抽出１０１
日本語コーパス５１からタームを抽出して、出現頻度をカウントする。タームとしては、出現頻度が予め定めた閾値以上の名詞と複合名詞を抽出する。複合名詞は、品詞列パターンを用いたパターンマッチングによって抽出する。高頻度語の中には、特に分野に関係のない一般的な語も多い。それらは、ストップワードリストを用いて取り除く。
【００３８】
このストップワードリストに関して、たとえば、“上記”を先頭要素のストップワードとすることにより、“上記システム”というような名詞句を除外できる。同様に、“全体”を末尾要素のストップワードとすることにより、“システム全体”というような名詞句を除外できる。
【００３９】
（２）共起データ抽出１０２
共起するタームの対を抽出して、共起頻度をカウントする。共起の定義としてはウィンドウ共起を採用する。すなわち、一定の幅をもったウィンドウをテキストに沿って移動させながら、各位置でのウィンドウに含まれるタームの対を抽出する。ウィンドウの幅は、たとえば機能語を除いて２５タームとする。
（３）相関解析１０３
全てのターム対に対して統計的な相関値を計算し、予め定めた閾値以上の相関値をもつターム対を抽出する。タームの相関値としては相互情報量を用いる。
【００４０】
以上述べた（１）〜（３）のステップの結果として、日本語ターム関連シソーラス７１が得られる。日本語ターム関連シソーラス７１は、日本語の各タームに対する関連ターム集合を表すレコードの集まりである。すなわち、
ＲＴ_J(ｘ_i)＝{ｘ(i,1),ｘ(i,2),・・・,ｘ(i,ｍ_i)} (ｉ＝１,２,・・・,Ｍ)．
ここで、ｘ_iは日本語ターム、ｘ(i,ｍ_i)は関連ターム、ｉは各タームに付される番号、ｍ_iは第i日本語タームの関連ターム数、Ｍは日本語タームの総数である。レコードの例を以下に示す。
【００４１】
ＲＴ_J(銀行)＝{ローン,金利,口座,利率,証券,経済,金融,投資}．
ＲＴ_J(金利)＝{ローン,貸出し,預貯金,引き上げ,銀行}．
以上、日本語シソーラス生成１０の処理を説明したが、英語シソーラス生成２０の処理も、日本語シソーラス生成１０と全く同様である。英語シソーラスの生成処理の結果として、英語ターム関連シソーラス７２が得られる。英語ターム関連シソーラス７２は、英語の各タームに対する関連ターム集合を表すレコードの集まりである。すなわち、
ＲＴ_E(ｙ_i)＝{ｙ(i,1),ｙ(i,2),・・・,ｙ(i,n_i)} (ｉ＝１,２,・・・,Ｎ)．
ここで、ｙ_iは英語ターム、ｙ(i,n_i)は関連ターム、ｉは各タームに付される番号、ｎ_iは第i英語タームの関連ターム数、Ｎは英語タームの総数である。レコードの例を以下に示す。
ＲＴ_E(bank)＝{account,river,interest,loan,boat,investment,fishing,park
,economy,lake}．
ＲＴ_E(interest)＝{loan,deposit,bank,sciene,economy,exchange,politics}.
【００４２】
図４は、日英シソーラス結合モジュール３０の処理の詳細を説明する図である。図４に示すように、日英シソーラス結合処理は、日英ターム結合３０１、日本語概念ラベル生成３０２、英語概念ラベル生成３０３、日英概念結合３０４、日本語概念関連シソーラス生成３０５、英語概念関連シソーラス生成３０６、日本語概念マージ３０７、及び英語概念マージ３０８の８つのステップからなる。以下、これらの処理の詳細を説明する。
【００４３】
（１）日英ターム結合３０１
日英対訳辞書７３と英日対訳辞書７４を参照して、日本語ターム関連シソーラス７１と英語ターム関連シソーラス７２の間で対応するタームを結合し、日英ターム結合データ９１と英日ターム結合データ９２とを出力する。
日英ターム結合の入力のうち、日本語ターム関連シソーラス７１と英語ターム関連シソーラス７２とは既に説明したので、日英対訳辞書７３と英日対訳辞書７４とについて説明する。
【００４４】
日英対訳辞書７３は、日本語の各タームに対する対訳英語ターム集合を表すレコードの集まりである。すなわち、
Ｄ_JE(ａ_i)＝{b(i,1),b(i,2),・・・,b(i,l_i)} (ｉ＝１,２,・・・,Ｋ)．
ここで、ａは日本語ターム、ｂは英語タームである。レコードの例を以下に示す。
Ｄ_JE(銀行)＝{bank}．
Ｄ_JE(岸)＝{bank}．
【００４５】
英日対訳辞書７４は、英語の各タームに対する対訳日本語ターム集合を表すレコードの集まりである。すなわち、
Ｄ_EJ(ｂ_i)＝{ａ(i,1),ａ(i,2),・・・,ａ(i,k_i)} (ｉ＝１,２,・・・,Ｌ)．
ここで、ｂは英語ターム、ａは日本語タームである。レコードの例を以下に示す。
Ｄ_EJ(bank)＝{銀行,バンク,岸}．
Ｄ_EJ(interest)＝{興味,金利,利率}．
【００４６】
次に、日英ターム結合の出力について説明する。日英ターム結合データ９１と英日ターム結合データ９２とは、同じ情報を異なる形式で表現したものである。冗長ではあるが、後続の処理の効率を考慮して両方を出力する。
日英ターム結合データ９１は、日本語タームの各々について、それに結合された英語タームの集合を表すレコードの集まりである。すなわち、
ＴＬ_JE(ｘ_i)＝{ｙ'(i,1),ｙ'(i,2),・・・,ｙ'(i,n'_i)} (ｉ＝１,２,・・・,Ｍ)．
ここで、ｘは日本語ターム，ｙ'は英語タームである。
【００４７】
英日ターム結合データ９２は、英語タームの各々について、それに結合された日本語タームの集合を表すレコードの集まりである。すなわち、
ＴＬ_EJ(ｙ_i)＝{ｘ'(i,1),ｘ'(i,2),・・・,ｘ'(i,m'_i)} (ｉ＝１,２,・・・,Ｎ)．
ここで、ｙは英語ターム、ｘ'は日本語タームである。
【００４８】
日英ターム結合３０１のアルゴリズムは次のとおりである。
１）日英ターム結合データ９１を初期化する。すなわち、
ＴＬ_JE(ｘ_i)←φ (ｉ＝１,２,・・・,Ｍ).
２）英日ターム結合データ９２を初期化する。すなわち、
ＴＬ_EJ(ｙ_i)←φ (ｉ＝１,２,・・・,Ｎ)．
【００４９】
３）次の２つの条件を満足する、日本語タームｘと英語タームｙとを結合する。
（ａ）対訳関係＜ｘ,ｙ＞が対訳辞書によってサポートされている。
（ｂ）対訳関係＜ｘ,ｙ＞のドメイン関連度ＤＲ(ｘ,ｙ)が予め定めた閾値以上である。
すなわち、（ａ）及び（ｂ）を満足する全ての日本語タームｘと英語タームｙの対に関して、
ＴＬ_JE(ｘ)←ＴＬ_JE(ｘ)∪{ｙ}及びＴＬ_EJ(ｙ)←ＴＬ_EJ(ｙ)∪{ｘ}．
を実行する。
（ａ）ｘがk個のタームｘ₁,ｘ₂,・・・,ｘ_kの並び，ｙがk個のタームｙ₁,ｙ₂,・・・,ｙ_kの並びであって、{ｙ'₁,ｙ'₂,・・・,ｙ'_k}＝{ｙ₁,ｙ₂,・・・,ｙ_k}であるようなｙ'₁(∈Ｄ_JE(ｘ₁)),ｙ'₂(∈Ｄ_JE(ｘ₂)),・・・,ｙ'_k(∈Ｄ_JE(ｘ_k))が存在する。
（ｂ）ＤＲ(ｘ,ｙ)≧θ．
【００５０】
条件（ａ）により、日本語タームｘと英語タームｙとが対訳関係にあるかを知るために、構成要素の間での対訳関係が成立しているかが、集合におけるタームの順番を問わずにチェックされる。特に、ｋ＝１のときｙ∈Ｄ_JE(ｘ)となる。すなわち、対訳辞書に登録されているターム対であることを意味する。ｋ≧２のとき、構成要素間の対訳関係が、対訳辞書に登録されているような複合語タームの対であることを意味する。
条件（ｂ）により、対訳辞書が示唆する対訳関係がドメインで成立する関係であるかがチェックされる。対訳関係＜ｘ,ｙ＞のドメイン関連度ＤＲ(ｘ,ｙ)は次式で定義される。
【００５１】
【数１】

【００５２】
ＤＲ_JE(ｘ,ｙ)は、日本語タームの関連タームのうち、英語訳が英語タームの関連タームであるものの比率である。すなわち、ある日本語タームがどのような文脈で出現するかを示す出現文脈が、英語タームの出現文脈と重なる度合である。また、ＤＲ_EJ(ｘ,ｙ)は、英語タームの出現文脈が日本語タームの出現文脈と重なる度合である。多少なりとも出現文脈に共通性があれば、対訳関係がドメインで成立すると考えてよいので、ドメイン関連度の閾値θは小さめに設定するのがよい。以上のアルゴリズムにより、日英ターム結合３０１が実行される。
【００５３】
（２）日本語概念ラベル生成３０２
（３）英語概念ラベル生成３０３
日本語概念ラベル生成３０２と英語概念ラベル生成３０３の処理は、日本語と英語の役割が反転する以外は全く同様である。したがって、ここでは英語概念ラベル生成３０３について説明する。
【００５４】
英語概念ラベル生成ステップ３０３は、英日ターム結合データ９２と日本語ターム関連シソーラス７１とに基づいて、英語タームの各々から一つ以上の英語概念ラベルを生成する。さらに、生成した英語概念ラベルの各々に対して関連タームの集合を生成する。このために、英語ターム関連シソーラス７２、日英ターム結合データ９１、日本語ターム関連シソーラス７１、英日ターム結合データ９２を参照する。
【００５５】
英語概念ラベル生成３０３の入力データは既に説明済みであるので、出力データを説明する。英語概念ラベルは、タームを組み合せたものであり、以下のように定義される。

【００５６】
＜英語ターム＞・＜日本語ターム＞/.../＜日本語ターム＞は、英語タームが表す概念のうち、日本語タームが表す概念と共通の概念を指示する。たとえば、「bank・銀行」は、お金に関わる業務を行う組織としてのbankを指示し、「bank・岸」は、川や湖に沿った場所としてのbankを指示する。＜英語概念ラベル＞+＜英語概念ラベル＞は、二つの概念ラベルが指示する概念の共通部分を核とし、それぞれの概念ラベルが指示する概念を合わせた範囲の概念を指示する。「duty・税+tax」、「plane・飛行機+airplane」、及び「reasoning+inference」などが例である。
【００５７】
英語概念ラベルデータ９４は、各英語タームｙに対応する英語概念ラベル集合を表すレコードＣ_E(ｙ)の集まりである。英語概念ラベル集合の例を示す。
Ｃ_E(interest)＝{interest・興味,interest・金利/利率}．
このレコードＣ_E(interest)は、英日ターム結合データ９２中に
ＴＬ_EJ(interest)＝{興味,金利,利率}．
なるレコードが含まれるとき、それに対応して生成される。
【００５８】
英語概念の関連タームデータ９６は、各英語概念ラベルに対する関連ターム集合を表すレコードの集まりで、以下のように記す。
ＲＴ_E(Ｙ_i)＝{ｙ(N+i,1),ｙ(N+i,2),・・・,ｙ(N+i,n_N+i)} (ｉ＝１,２,・・・,Ｑ)．
ここで、Ｙは英語概念ラベル、ｙは英語タームである。
【００５９】
英語概念の関連タームデータ９６は、最終目的である英語概念関連シソーラス６２を生成するための中間データである。英語概念関連シソーラス６２として必要なのは、関連ターム集合ではなく、関連概念集合である。しかし、全てのタームに対する概念ラベル集合を生成してからでないと、関連概念集合を作成することはできない。そこで、暫定的に関連ターム集合を作成しておき、後続の英語概念関連シソーラス生成３０６において関連概念集合に変換する。
【００６０】
英語タームｙに対する英語概念ラベル集合Ｃ_E(ｙ)と、Ｃ_E(ｙ)中の概念ラベルに対する関連ターム集合を生成するアルゴリズムは次のとおりである。
１）英語タームｙが少なくとも一つの日本語タームと結合されているとき
ｉ）英語概念ラベル集合の初期データを作成する。英語タームｙに結合された日本語タームの各々を日本語修飾子とする英語概念ラベルを生成し、その要素とする。すなわち、
Ｃ_E(ｙ)←{ｙ・ｘ｜ｘ∈ＴＬ_EJ(ｙ)}．
ii）二つの英語概念の類似度が予め定めた閾値α以上であるなら、それらを一つの英語概念に統合する処理を可能な限り繰り返す。すなわち、

ここで、共通の英語タームｙに関わる二つの英語概念Ｙ₁＝ｙ・ｘ₁/ｘ₂/・・・/ｘ_kとＹ₂＝ｙ・ｘ'₁/ｘ'₂/・・・/ｘ'_k'の類似度Ｓ(Ｙ₁,Ｙ₂)は次式で定義される。
【００６１】
【数２】

【００６２】
すなわち、概念ラベルを構成する日本語修飾子の関連ターム集合間の重なり度で定義される。
iii）処理ii）の結果として得られた英語概念の各々に対して関連ターム集合データを作成する。英語概念ｙ・ｘ₁/ｘ₂/・・・/ｘ_kの関連ターム集合ＲＴ_E(ｙ・ｘ₁/ｘ₂/・・・/ｘ_k)は次式のとおりである。
【００６３】
【数３】

【００６４】
ここで、ＪＭ１は日本語修飾子の要素の集合である。すなわち、ＪＭ１＝{ｘ₁,ｘ₂,・・・,ｘ_k}．ＪＭ２は英語タームｙに結合された日本語タームの集合から日本語修飾子の要素を除いたものである。すなわち、ＪＭ２＝ＴＬ_EJ(y)−ＪＭ１．
２）英語タームｙが日本語タームｘと結合されていないとき
ｉ）英語タームｙそのものを概念ラベルとする。英語概念ラベル集合はこれを唯一の要素とする。すなわち、
Ｃ_E(ｙ)←{ｙ}．
ii）英語タームｙの関連ターム集合ＲＴ_E(ｙ)をそのまま英語概念ラベルｙの関連ターム集合とする。
上記アルゴリズムの１）のii）における閾値αの設定について補足しておく。ここでの目的は、一つの英語タームが表す複数の概念を区別するための日本語修飾子を得ることである。したがって、閾値αは小さめに設定し、類義の日本語訳語を一つの日本語修飾子に統合するのがよい。
【００６５】
日本語概念ラベル生成３０２は、英語概念ラベル生成３０３と同様である。その出力である日本語概念ラベルデータ９３は、英語概念ラベルデータ９４と同様で、日本語の各タームｘに対応する日本語概念ラベル集合を表すレコードＣ_J(ｘ)の集まりである。日本語概念ラベルは英語概念ラベルと同様で、以下のようにタームを組み合わせたものである。
【００６６】

日本語概念ラベル生成３０２のもう一つの出力である日本語概念の関連タームデータ９５は、英語概念の関連タームデータ９６と同様で、日本語の各概念ラベルに対する関連ターム集合を表すレコードの集まりである。すなわち、
ＲＴ_J(Ｘ_i)＝{ｘ(M+i,1),ｘ(M+i,2),・・・,ｘ(M+i,n_M+i)} (ｉ＝１,２,・・・,Ｐ)．
ここで、Ｘは日本語概念ラベル、ｘは日本語タームである。
【００６７】
（４）日英概念結合３０４
日英概念結合３０４は、日本語概念ラベルデータ９３と英語概念ラベルデータ９４とを入力として、日英概念結合データ６３と英日概念結合データ６４とを生成する。日本語概念ラベルデータ９３と英語概念ラベルデータ９４とは説明済みであるので、まず日英概念結合データ６３と英日概念結合データ６４とについて説明する。
【００６８】
日英概念結合データ６３は、日本語の各概念について、それに結合された英語の概念の集合を表すレコードの集まりである。すなわち、
ＣＬ_JE(Ｘ_i)＝{Ｙ'(i,1),Ｙ'(i,2),・・・,Ｙ'(i,q'_i)} (ｉ＝１,２,・・・,Ｐ)．
ここで、Ｘは日本語概念ラベル、Ｙ'は英語概念ラベルである。
【００６９】
同様に、英日概念結合データ６４は、英語の各概念について、それに結合された日本語の概念の集合を表すレコードの集まりである。すなわち、
ＣＬ_EJ(Ｙ_i)＝{Ｘ'(i,1),Ｘ'(i,2),・・・,Ｘ'(i,p'_i)} (ｉ＝１,２,・・・,Ｑ)．
ここで、Ｙは英語概念ラベル、Ｘ'は日本語概念ラベルである。
【００７０】
日英概念結合データ６３と英日概念結合データ６４とを生成するアルゴリズムは次のとおりである。
１）全ての日本語概念ラベルＸ=ｘ・ｙ₁/ｙ₂/・・・/ｙ_k'に対して、
ＣＬ_JE(Ｘ)＝{Ｙ|Ｙ＝ｙ・ｘ₁/ｘ₂/・・・/ｘ_k(∈Ｃ_E(ｙ)),ｙ∈ {ｙ₁,ｙ₂,・・・,ｙ_k'} ,{ｘ₁,ｘ₂,・・・,ｘ_k}∋ｘ}．
すなわち、Ｘの英語修飾子に含まれる英語タームｙの英語概念集合Ｃ_E(ｙ)の要素Ｙであって、日本語修飾子にｘを含むものの集合を生成する。
【００７１】
２）全ての英語概念ラベルＹ＝ｙ・ｘ₁/ｘ₂/・・・/ｘ_k'に対して、
ＣＬ_EJ(Ｙ)＝{Ｘ|Ｘ＝ｘ・ｙ₁/ｙ₂/・・・/ｙ_k(∈Ｃ_J(ｘ)),ｘ∈{ｘ₁,ｘ₂,・・・,ｘ_k'} ,{ｙ₁,ｙ₂,・・・,ｙ_k}∋ｙ}．
すなわち、Ｙの日本語修飾子に含まれる日本語タームｘの日本語概念集合Ｃ_J(ｘ)の要素Ｘであって、英語修飾子にｙを含むものの集合を生成する。
【００７２】
日英概念結合３０４の出力例を示す。日本語概念ラベルデータ９３が
Ｃ_J(興味)＝{興味・interest}、
Ｃ_J(金利)＝{金利・interest}、
Ｃ_J(利率)＝{利率・interest}
であり、英語概念ラベルデータ９４が
Ｃ_E(interest)＝{interest・興味,interest・金利/利率}
であるとする。このとき、日英概念結合データ６３として
ＣＬ_JE(興味・interest)＝{interest・興味}、
ＣＬ_JE(金利・interest)＝{interest・金利/利率}、
ＣＬ_JE(利率・interest)＝{interest・金利/利率}
が生成され、英日概念結合データ６４として
ＣＬ_EJ(interest・興味)＝{興味・interest}、
ＣＬ_EJ(interest・金利/利率)＝{金利・interest,利率・interest}
が生成される。
【００７３】
ここで、概念ラベルの表記に関する一つの規則を定める。ある英語タームに対応する英語概念がただ一つであるならば、日本語修飾子をつけることは無意味であり、タームそのものを概念ラベルとして差し支えない。日本語タームに関しても同様である。上に述べた日英概念結合３０４のアルゴリズムと違って、これ以降の処理では、日本語修飾子や英語修飾子に基づく判断を含まない。したがって、タームの唯一の概念である概念については、概念結合データ６３，６４の出力時に、概念ラベルをタームそのものに変更することにする。この規則に従えば、上の例における日英概念結合データ６３は
ＣＬ_JE(興味)＝{interest・興味}、
ＣＬ_JE(金利)＝{interest・金利/利率}、
ＣＬ_JE(利率)＝{interest・金利/利率}
となり、英日概念結合データ６４は
ＣＬ_EJ(interest・興味)＝{興味}、
ＣＬ_EJ(interest・金利/利率)＝{金利,利率}
となる。ここでは、「興味」が単一の概念を表す語であるので、概念ラベル「興味・interest」が「興味」に略記されている．「金利・interest」「利率・interest」も同様で、それぞれ「金利」「利率」に略記されている．一方、「interest」は複数の概念を表す語であるので、概念ラベル「interest・興味」「interest・金利/利率」を略記することはできない。
【００７４】
（５）日本語概念関連シソーラス生成３０５
（６）英語概念関連シソーラス生成３０６
日本語概念関連シソーラス生成３０５は、日本語概念ラベルデータ９３と日本語概念の関連タームデータ９５とを入力して、日本語概念関連シソーラス６１を出力する。英語概念関連シソーラス生成３０６は、英語概念ラベルデータ９４と英語概念の関連タームデータ９６とを入力として、英語概念関連シソーラス６２を出力する。これらの入力については説明済みである。
【００７５】
出力である日本語概念関連シソーラス６１と英語概念関連シソーラス６２は次のとおりである。日本語概念関連シソーラス６１は、日本語の各概念の関連概念集合を表すレコードの集まりである。すなわち、
ＲＣ_J(Ｘ_i)＝{Ｘ(i,1),Ｘ(i,2),・・・,Ｘ(i,p_i)} (ｉ＝１,２,・・・,Ｐ)．
ここで、Ｘは日本語概念ラベルである。関連概念集合の例を以下に示す。
【００７６】
ＲＣ_J(銀行)＝{ローン,金利,口座,利率,証券,経済,金融,投資}．
ＲＣ_J(岸)＝{川,水,ボート,湖,釣り}．
英語概念関連シソーラス６２は、英語の各概念の関連概念集合を表すレコードの集まりである。すなわち、
ＲＣ_E(Ｙ_i)＝{Ｙ(i,1),Ｙ(i,2),・・・,Ｙ(i,q_i)} (ｉ＝１,２,・・・,Ｑ)．
ここで、Ｙは英語概念ラベルである。関連概念集合の例を以下に示す。
ＲＣ_E(bank・銀行)＝{account・口座,interest・金利/利率,loan,investment,eco
nomy}．
ＲＣ_E(bank・岸)＝{river,boat,water,fishing,park・公園,lake}．
【００７７】
英語概念関連シソーラス生成３０６のアルゴリズムは以下のとおりである。なお、日本語概念関連シソーラス生成３０５のアルゴリズムも全く同様である。
英語概念Ｙの関連ターム集合ＲＴ_E(Ｙ)の各要素ｙに対応して、ｙの概念ラベル集合Ｃ_E(ｙ)の要素のうちＹとの相関度が最大のものを関連概念集合ＲＣ_E(Ｙ)の要素として選択する。すなわち、
【００７８】
【数４】

【００７９】
ここで、Ｓ₂は関連ターム集合に基づく英語概念の相関度で、次式で定義される。
Ｓ₂(Ｙ₁,Ｙ₂)＝｜ＲＴ_E(Ｙ₁)∩ＲＴ_E(Ｙ₂)｜／｜ＲＴ_E(Ｙ₁)∪ＲＴ_E(Ｙ₂)｜.
たとえば、英語概念「bank・銀行」の関連ターム集合が「interest」を含み、英語ターム「interest」の概念ラベル集合が{interest・興味,interest・金利/利率}であるとする。このとき、「bank・銀行」と「interest・興味」との相関度、「bank・銀行」と「interest・金利/利率」との相関度が計算される。後者の相関度が大きければ、「bank・銀行」の関連概念集合の要素として「interest・金利/利率」が選択される。
【００８０】
（７）日本語概念マージ３０７
（８）英語概念マージ３０８
日本語概念マージ３０７と英語概念マージ３０８の処理は、日本語と英語の役割が反転する以外、全く同様である。したがって、ここでは英語概念マージ３０８について説明する。
【００８１】
英語概念マージ３０８は、日本語の同一概念に結合された英語概念で類似度の高いものをマージして一つの概念にする。入力は、英語概念関連シソーラス６２、日英概念結合データ６３、英日概念結合データ６４であり、出力はそれらの更新データである。
【００８２】
英語概念マージ３０８のアルゴリズムは以下のとおりである。
全ての日本語概念Ｘに関して、以下の処理を可能な限り繰り返す。
Ｙ₁,Ｙ₂∈ＣＬ_JE(Ｘ)でＳ₃(Ｙ₁,Ｙ₂)≧βなる英語概念の組Ｙ₁,Ｙ₂が存在するならば、ａ）からｃ）を実行する。ここで、Ｓ₃(Ｙ₁,Ｙ₂)は英語概念Ｙ₁とＹ₂の類似度で、次式で定義される。
Ｓ₃(Ｙ₁,Ｙ₂)＝｜ＲＣ_E(Ｙ₁)∩ＲＣ_E(Ｙ₂)｜／｜ＲＣ_E(Ｙ₁)∪ＲＣ_E(Ｙ₂)｜．
【００８３】
ａ）英語概念関連シソーラス６２の更新
全てのＹ∈ＲＣ_E(Ｙ₁)に関して、ＲＣ_E(Ｙ)←ＲＣ_E(Ｙ)−{Ｙ₁}＋{Ｙ₁＋Ｙ₂}．
全てのＹ∈ＲＣ_E(Ｙ₂)に関して、ＲＣ_E(Ｙ)←ＲＣ_E(Ｙ)−{Ｙ₂}＋{Ｙ₁＋Ｙ₂}．
ＲＣ_E(Ｙ₁＋Ｙ₂)←ＲＣ_E(Ｙ₁)∪ＲＣ_E(Ｙ₂)．
ＲＣ_E(Ｙ₁)とＲＣ_E(ｙ₂)を消去する。
【００８４】
ｂ）日英概念結合データ６３の更新
全てのｘ∈ＣＬ_EJ(Ｙ₁)に関して、ＣＬ_JE(Ｘ)←ＣＬ_JE(Ｘ)−{Ｙ₁}＋{Ｙ₁＋Ｙ₂}．
全てのＸ∈ＣＬ_EJ(Ｙ₂)に関して、ＣＬ_JE(Ｘ)←ＣＬ_JE(Ｘ)−{Ｙ₂}＋{Ｙ₁＋Ｙ₂}．
ｃ）英日概念結合データ６４の更新
ＣＬ_EJ(Ｙ₁＋Ｙ₂)←ＣＬ_EJ(Ｙ₁)∪ＣＬ_EJ(Ｙ₂)．
ＣＬ_EJ(Ｙ₁)とＣＬ_EJ(Ｙ₂)を消去する。
【００８５】
上記アルゴリズム中の閾値βは大きめに設定し、類似度が非常に高い概念のみをマージするのがよい。概念の範囲やニュアンスが異なるタームを別々のエンティティとするほうが、利用価値の高いシソーラスになるからである。この点は、相手言語のタームが表す複数の概念を区別することが目的の場合（英語概念ラベル生成３０３のアルゴリズムにおける閾値α）と異なっている。
【００８６】
以上説明した処理によって、日本語コーパス５１と英語コーパス５２が対になった日英二言語テキストコーパスから、日本語概念関連シソーラス６１、英語概念関連シソーラス６２、日英概念結合データ６３、英日概念結合データ６４からなる日英パラレルシソーラスを生成することができる。
このように生成された日英パラレルシソーラスは、図１に示す通信ネットワーク３を介して、クライアント計算機２によるシソーラスナビゲーションに利用される。つぎに、図５〜図７を用いて本システムの処理を説明する。
【００８７】
図５は、本システムにおいて、クライアント計算機２の表示画面の内容を説明する図である。
図５に示す表示画面は、概念集合エリア１０１０、ズームインエリア１０２０及び機能選択ボタンから構成される。機能選択ボタンには、ズームインボタン１０３０、翻訳ボタン１０４０、クリアボタン１０５０、終了ボタン１０６０がある。
【００８８】
ズームインエリア１０２０には、一つ以上の概念クラスタ１０２１がそれに対応付けられた選択ボタン１０２２とともに表示される。この概念クラスタ１０２１は、関連性の高い概念の集合である。たとえば、“地球環境問題”に該当する概念クラスタであれば、「地球温暖化」、「オゾン層」、「温室効果」、「フロン」及び「大気」等の概念が表示される。クライアント計算機２のユーザは、これら概念クラスタ１０２１を複数指定することができる。
【００８９】
図６は、本実施の形態によるパラレルシソーラスナビゲーションシステムの処理を説明するフローチャートである。以下、図５に示した表示内容と対応して本システムの処理を説明する。
最初に初期画面を表示する（ステップ４１０）。初期画面では、概念集合エリア１０１０とズームインエリア１０２０は空白である。本システムには、「日本語」／「英語」を切り替える言語インジケータが内部に設けられており、初期画面を表示したときには、言語インジケータを「日本語」にする。
【００９０】
初期画面表示（ステップ４１０）の後、入力待ちの状態になる（ステップ４２０）。この状態では、概念集合エリア１０１０は書き込み可能であり、通常、ユーザが一つ以上の日本語タームあるいは日本語概念ラベルを書き込む。
入力待ちの状態で押されたボタンにより、以下のように分岐する。
【００９１】
（１）ズームインボタン１０３０が押されたとき
概念集合エリア１０１０に表示されている概念集合を読み込む（ステップ４３０）。ユーザが概念集合エリア１０１０に書き込むのは、通常、概念ラベルでなくタームである。タームが書き込まれている場合には、該タームから生成された全ての概念ラベルが書き込まれているとみなして処理する。この処理は、言語インジケータが「日本語」のときには日本語概念関連シソーラス６１を参照し、言語インジケータが「英語」のときには英語概念関連シソーラス６２を参照することにより行われる。
【００９２】
つぎに、概念集合に含まれる概念と関連の強い概念を加えて概念集合を拡大し、拡大された概念集合をクラスタリングする（ステップ４４０）。この処理は、言語インジケータが「日本語」のときには日本語概念関連シソーラス６１を参照し、言語インジケータが「英語」のときには英語概念関連シソーラス６２を参照することにより行われる。
最後に、得られた概念クラスタを選択ボタン１０２２とともにズームインエリア１０２０に表示し（ステップ４５０）、入力待ちの状態に戻る。
【００９３】
（２）概念クラスタの選択ボタン１０２２が押されたとき
選択された概念クラスタ１０２１を概念集合エリア１０１０にコピー（上書き）し（ステップ４６０）、入力待ちの状態に戻る。入力待ちの状態では、概念集合エリア１０１０は書き込み可能であり、ユーザがタームあるいは概念ラベルを追加したり、削除したりすることが可能である。
【００９４】
（３）翻訳ボタン１０４０が押されたとき
概念集合エリア１０１０に表示されている概念集合を読み込む（ステップ４７０）。この処理はステップ４３０と全く同じである。
つぎに、概念集合を翻訳する（ステップ４８０）。言語インジケータが「日本語」のときには日本語概念集合から英語概念集合への翻訳が実行され、言語インジケータが「英語」のときには英語概念集合から日本語概念集合への翻訳が実行される。
【００９５】
最後に、翻訳結果を概念集合エリア１０１０に表示（上書き）し、言語インジケータを反転させ（ステップ４９０）、入力待ちの状態に戻る。入力待ちの状態では、概念集合エリア１０１０は書き込み可能であり、ユーザがタームあるいは概念ラベルを追加したり、削除したりすることが可能である。
【００９６】
（４）クリアボタン１０５０が押されたとき
初期画面表示状態（ステップ４１０）に戻る。
（５）終了ボタン１０６０が押されたとき
処理を終了する。
以上述べた処理により、言語間の遷移を含むパラレルシソーラスのナビゲーションが可能になる。
【００９７】
図７は、本システムを特徴付ける概念集合翻訳（ステップ４８０）の処理を詳細に説明する図である。図７は、日本語概念集合を英語概念集合に翻訳する処理を示したものであるが、英語概念集合を日本語概念集合に翻訳する処理も全く同様である。
【００９８】
入力日本語概念集合が与えられると、日英概念結合データ６３を参照して、日本語概念集合中の日本語概念に結合されている英語概念を集めて、核となる英語概念集合を生成する（ステップ４８１）。
【００９９】
つぎに、英語概念関連シソーラス６２を参照して、この英語概念集合に含まれる英語概念と関連の強い英語概念を集め、さらに、英日概念結合データ６４を参照して、関連英語概念のうちで日本語概念と結合されていないものを選択する。核となる英語概念集合に選択した英語概念を追加して翻訳結果とする（ステップ４８２）。
【０１００】
日本語概念集合から英語疑念集合への翻訳（遷移）の例を示す。入力の日本語概念集合は｛地球温暖化，オゾン層，温室効果，フロン，大気，二酸化炭素，環境｝であるとする。日英概念結合データ６３は以下のレコードを含むとする。
【０１０１】
ＣＬ_JE(地球温暖化)＝φ．
ＣＬ_JE(オゾン層)＝{ozone layer}．
ＣＬ_JE(温室効果)＝φ．
ＣＬ_JE(フロン)＝φ．
ＣＬ_JE(大気)＝{atmosphere・大気}．
ＣＬ_JE(二酸化炭素)＝{carbon dioxide}．
ＣＬ_JE(環境)＝{environment}．
さらに、英語概念関連シソーラス６２が以下のレコードを含むとする。
【０１０２】
ＲＣ_J(ozone layer)＝{chrolofluorocarbon,depletion,atmosphere・大気,war
ming}．
ＲＣ_J(atmosphere・大気)＝{pollution,environment,gas・気体/ガス,carbon di
oxide}．
ＲＣ_J(carbon dioxide)＝{atmosphere・大気,energy,warming,environment,re
gulation}．
ＲＣ_J(environment)＝{protection,carbon dioxide,energy,atmosphere・大気
,pollution}．
また、英日概念結合データ６４は以下のレコードを含むとする。
【０１０３】
ＣＬ_EJ(ozone layer)＝{オゾン層}．
ＣＬ_EJ(chrolofluorocarbon)＝φ．
ＣＬ_EJ(depletion)＝{破壊}．
ＣＬ_EJ(atmosphere・大気)＝{大気}．
ＣＬ_EJ(warming)＝φ．
ＣＬ_EJ(pollution)＝{汚染}．
ＣＬ_EJ(environment)＝{環境}．
ＣＬ_EJ(gas・気体/ガス)＝{気体,ガス}．
ＣＬ_EJ(carbon dioxide)＝{二酸化炭素}．
ＣＬ_EJ(energy)＝{エネルギー}．
ＣＬ_EJ(regulation)＝{規制}．
ＣＬ_EJ(protection)＝{保護}．
【０１０４】
このとき、日本語概念集合{地球温暖化,オゾン層,温室効果,フロン,大気,二酸化炭素,環境}から英語概念集合への翻訳結果は{ozone layer,atmosphere・大気,carbon dioxide,environment,chrolofluorocarbon,warming}になる。
【０１０５】
英語概念集合を構成する６つの英語概念のうち、「ozone layer」、「atmosphere・大気」、「carbon dioxide」及び「environment」の４つは、日本語概念集合中の日本語概念とシソーラス中で陽に結合されていたものである。また、「chrolofluorocarbon」及び「warming」の２つは、日本語概念集合中の日本語概念とシソーラス中で陽に結合されていなかったが、上記４つの英語概念の関連概念として追加されたものである。実は、「chrolofluorocarbon」は「フロン」の英語訳であり、「warming」は「地球温暖化」の英語訳の一部である。このようにして、概念結合として陽に表現されていない対訳を含む翻訳結果を得ることができる。
【０１０６】
以上、この発明の実施の形態を図面を参照して詳述してきたが、具体的な構成はこれらの実施の形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計の変更等があってもよい。
はじめに、上記実施の形態では、日本語と英語とのパラレルシソーラスを生成しているが、たとえば、日本語とフランス語、更には、日本語のものも含めて一般的な２ヶ国語によるパラレルシソーラスを生成するものであってもよい。
【０１０７】
また、上記実施の形態では、図４に示したように、多義語と同義語の各々が有する概念を考慮した機能を実現しているが、多義語の概念のみを考慮した機能、又は同義語のみを考慮した機能を実現してもよい。この場合、日本語概念マージ３０７及び英語概念マージ３０８の機能を選択的に設けることで実現できる。
【０１０８】
また、本発明におけるクライアント計算機２としては、有線回線により通信ネットワーク３に接続されるパーソナルコンピュータ又はワークステーション等、また、無線回線により通信ネットワーク３に接続される移動体通信端末（携帯電話、ＰＨＳ（Personal Handy-Phone System）、ＰＤＡ（Personal Digital Assistance）等）であってもよい。
【０１０９】
なお、本発明のパラレルシソーラスの生成装置、及びパラレルシソーラスナビゲーションシステムは、このサーバ計算機１又はクライアント計算機２を機能させるためのプログラムによっても実現される。このプログラムは、たとえばＣＤ−ＲＯＭ等のコンピュータで読み取り可能な記録媒体に格納されている。
【０１１０】
パラレルシソーラスの生成装置、又はパラレルシソーラスナビゲーションシステムを機能させるためのプログラムを記録した記録媒体は、図１に示す記憶装置１３そのものであってもよいし、また、外部記憶装置としてＣＤ−ＲＯＭドライブ等のプログラム読み取り装置（図示せず）が設けられ、そこに挿入することで読み取り可能なＣＤ−ＲＯＭ等であってもよい。また、上記記録媒体は、磁気テープ、カセットテープ、フロッピーディスク、ハードディスク、ＭＯ／ＭＤ／ＤＶＤ等、又は半導体メモリであってもよい。
【０１１１】
また、本発明のパラレルシソーラスの生成装置により生成されたパラレルシソーラスは、ＣＤ−ＲＯＭ等のコンピュータで読み取り可能な記録媒体に格納されてもよい。このパラレルシソーラスは、二つの言語のシソーラスを結合したものであり、日本語及び英語概念ラベル生成３０２，３０３により二つの言語のタームを組み合わせた概念ラベルが生成され、日英概念結合３０４（図４参照）により、二つの言語の概念ラベルが概念に基づいて結合されている。
【０１１２】
【発明の効果】
本発明のパラレルシソーラスの生成プログラムを記録した記録媒体によれば、二つの言語の関連シソーラスが結合されたパラレルシソーラスを二つの言語のテキストコーパスから自動的に生成することができる。生成されるパラレルシソーラスは、概念と概念の関連を示したものである。
【０１１３】
また、本発明のパラレルシソーラスを記録した記録媒体によれば、タームとタームの関連を示す従来のシソーラスと異なり、多義語や同義語の問題が解決されているので、本発明により生成されたシソーラスを用いることで、各種の自然言語処理システムの精度を向上することができる。
【０１１４】
また、本発明のパラレルシソーラスナビゲーションプログラムを記録した記録媒体によれば、複数言語にまたがる効率的なテキストマイニングが可能になる。特に、母国語のシソーラスをナビゲーションして外国語の情報にアクセスすることが容易になる。従来のクロスランゲージ情報検索において問題とされる検索要求の翻訳精度も、概念集合の遷移（翻訳）機能により大きく改善される。
【図面の簡単な説明】
【図１】本発明の実施の形態によるパラレルシソーラスの生成装置と、該装置を収容するパラレルシソーラスナビゲーションシステムの構成を説明するブロック図である。
【図２】本発明の実施の形態によるパラレルシソーラスの生成装置における入出力とモジュール構成を機能的に説明する図である。
【図３】日本語シソーラス生成の処理の詳細を説明する図である。
【図４】日英シソーラス結合モジュールの処理の詳細を説明する図である。
【図５】本発明の実施の形態によるパラレルシソーラスナビゲーションシステムにおいて、クライアント計算機の表示画面の内容を説明する図である。
【図６】本発明の実施の形態によるパラレルシソーラスナビゲーションシステムの処理を説明するフローチャートである。
【図７】本発明の実施の形態によるパラレルシソーラスナビゲーションシステムを特徴付ける概念集合翻訳（ステップ４８０）の処理を詳細に説明する図である。
【符号の説明】
１サーバ計算機
２クライアント計算機
３通信ネットワーク
１０日本語シソーラス生成
１１処理装置
１２入力装置
１３記憶装置
２０英語シソーラス生成
３０日英シソーラス結合
５１日本語コーパス
５２英語コーパス
６１日本語概念関連シソーラス
６２英語概念関連シソーラス
６３日英概念結合データ
６４英日概念結合データ
７１日本語ターム関連シソーラス
７２英語ターム関連シソーラス
７３日英対訳辞書
７４英日対訳辞書
８１タームと出現頻度
８２共起タームの対と共起頻度
９１日英ターム結合データ
９２英日ターム結合データ
９３日本語概念ラベルデータ
９４英語概念ラベルデータ
９５日本語概念の関連タームデータ
９６英語概念の関連タームデータ
１０１ターム抽出
１０２共起データ抽出
１０３相関解析
３０１日英ターム結合
３０２日本語概念ラベル生成
３０３英語概念ラベル生成
３０４日英概念結合
３０５日本語概念関連シソーラス生成
３０６英語概念関連シソーラス生成
３０７日本語概念マージ
３０８英語概念マージ
１０１０概念集合エリア
１０２０ズームインエリア
１０２１概念クラスタ
１０２２選択ボタン
１０３０ズームインボタン
１０４０翻訳ボタン
１０５０クリアボタン
１０６０終了ボタン[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a device for generating a parallel thesaurus by combining two language thesauruses from a text corpus of two languages, a navigation system program using the parallel thesaurus, and a recording medium on which the parallel thesaurus is recorded.
[0002]
[Prior art]
With the increase of computerized text information, the importance of information access technology is increasing.
The inventors previously filed a thesaurus navigation method as Japanese Patent Application No. 11-28101. This method is a technique for efficiently extracting useful information (hereinafter referred to as text mining) from a document database (hereinafter referred to as a text corpus) that stores text data. A thesaurus is generated by extracting the related knowledge between them, and navigation is performed by displaying the contents of the thesaurus on the browser of the client terminal.
[0003]
In addition, the information processing society database system study group / informatics basic study group study report DBS-118-13 / FI-54-13 “Corpus compatible related thesaurus navigation” (May 17, 1999) includes the above navigation method. A text mining system based on is reported.
[0004]
[Problems to be solved by the invention]
In the field of information search, there is an increasing need to search for documents in a foreign language by inputting a search request expressed in a native language. This information retrieval method is called cross language information retrieval, and has been actively studied.
[0005]
A typical method for cross-language information retrieval is described in, for example, the Journal of Information Processing Society of Japan, Vol. It has been reported. This search method uses a bilingual dictionary or machine translation system to execute a document search after translating the search request into the same language as the document. In this case, since the search request is shorter than the document and has less context information, there is a problem that it is difficult to translate with high accuracy and the search accuracy is low.
[0006]
To deal with this problem, it is conceivable to use the navigation method proposed in Japanese Patent Application No. 11-28101. In this case, it is considered that the above problem in cross-language information retrieval can be solved by extending from a single language to two languages and using it as the front end of the information retrieval system. It becomes necessary to do.
[0007]
On the other hand, a thesaurus generation technology from a text corpus is effective not only for text mining as described above, but also for various natural language processing application systems, but the following technical problems remain.
[0008]
Conventional thesaurus automatic generation techniques have problems regarding the handling of multiple terms. In the conventional technology, the relationship between terms is extracted, but since the relationship between terms is semantic in nature, the term of polysemy is divided into meanings or concepts, and the relationship between concepts is extracted. Is ideal. According to the conventional technique, for example, “loan”, “money”, “river”, “water” and the like are extracted as related terms of “bank” regardless of the concept of “bank”.
[0009]
Here, “loan” and “money” are related terms of “bank” as an institution for depositing and withdrawing money, and “river” and “water” are “bank ( Kishi) "related terms. Therefore, it is desirable to divide “bank” into the concepts it represents and extract related terms such as “loan” and “money” and related terms such as “river” and “water” separately.
[0010]
In addition, the conventional automatic thesaurus generation technique has a problem regarding the handling of synonyms. In the conventional technique, related terms are extracted based on the co-occurrence probability, that is, the probability of appearing in the vicinity in the text. For this reason, synonyms are not extracted as related terms, but are treated as separate entities. Since synonyms represent the same concept, it is desirable that the synonyms are handled together in one entity.
[0011]
In view of the above, an object of the present invention is to solve the above-mentioned problems in the prior art.
The first purpose is to establish a parallel thesaurus comprising the concepts of the first language and the relationships between the concepts, the concepts of the second language and the relationships between the concepts, and the connections between the concepts of the first language and the concepts of the second language. An object of the present invention is to provide a recording medium on which a parallel thesaurus generation program is automatically generated from a text corpus of one language and a second language.
[0012]
A second object is a recording medium on which a parallel thesaurus navigation program is recorded as a front end for cross-language information retrieval, and in particular, focusing on the handling of multiple meanings and synonyms and enabling high-precision translation of search requests. Is to provide.
A third object is to provide a recording medium on which a parallel thesaurus is recorded in order to realize effective text mining using the parallel thesaurus.
[0013]
[Means for Solving the Problems]
  The present invention is an input for receiving input of a text corpus in a first language and a second language.apparatusAnd a memory for storing a bilingual dictionary of the first language and the second languageapparatusAnd processing to process the dataapparatusFor causing a computer having the above function as a parallel thesaurus generatorComputer-readable recording medium on which is recordedAnd the above processingEquipmentthe aboveFrom input deviceA first language thesaurus generator for generating a first language term-related thesaurus representing a relationship between terms of the first language in the input text corpus of the first languageStep,the aboveFrom input deviceA second language thesaurus generator for generating a second language term-related thesaurus that represents the relationship between the second language terms in the input second language text corpusStep,the aboveStored in storageA thesaurus combination hand that performs a thesaurus combination that combines the term-related thesaurus of the first language and the term-related thesaurus of the second language using a bilingual dictionary of the first language and the second language.To function as a stageA computer-readable recording medium having a program recorded thereon.
[0014]
  As the thesaurus coupling means,The processing deviceA term coupling hand for coupling corresponding terms between the term-related thesaurus of the first language and the term-related thesaurus of the second language using the bilingual dictionaries of the first and second languages.Step,A first language concept label generator for generating a concept label from each term of the first language by combining the combined terms of the second languageStep,A second language concept label generator for generating a concept label from each term of the second language by combining the combined terms of the first languageStep,A concept coupling hand for converting the term coupling between the first and second languages into a concept coupling between the languages.Step,A first language concept-related thesaurus generator for converting a relationship between terms included in the term-related thesaurus of the first language into a relationship between concepts.Step,A second language concept related thesaurus generator for converting the relation between terms contained in the term related thesaurus of the second language into a relation between concepts.Step,When a set of related concepts of each first language related to a plurality of first language concepts coupled to the same second language concept is similar, the plurality of first language concepts are merged to form one. First language concept merging into one conceptStep,When a set of related concepts of each second language related to a plurality of second language concepts coupled to the same first language concept is similar, the plurality of second language concepts are merged to form one. Second language concept merging into one conceptTo function as a stageCharacterized by recording the programA computer-readable recording medium.Thereby, the parallel thesaurus which considered the concept which a multiple word and a synonym have can be produced | generated.
[0015]
  The parallel thesaurus generator operates as follows.
  The first language thesaurus generating means extracts a set of terms and co-occurring terms from the text corpus of the first language, analyzes the correlation between the terms, and outputs a set of related terms for each term.When the first language is Japanese, for example, {loan, interest rate, account, interest rate, securities, economy, finance, investment} as a set of related terms of the term "bank"Is output.
[0016]
  The second language thesaurus generation means executes the same processing as the first language thesaurus generation means for the second language text corpus, and outputs a set of related terms for each term.When the second language is English, for example, {account, river, interest, loan, boat, investment, fishing, park, economy, lake}Is output.
[0017]
  Thesaurus bindingmeansIn the above, each means operates as follows.
  Term bindingmeansIs the firstas well asUsing the bilingual dictionary of the second language, between terms included in the term related thesaurus of the first language and terms included in the term related thesaurus of the second languageCombine the corresponding terms with.For example, “bank” and “bank” are combined, and “shore” and “bank” are combined.
[0018]
The first language concept label generating means generates at least one concept label from each term of the first language by combining the combined terms of the second language. Similarly, the second language concept label generating means generates at least one concept label from each term of the second language by combining the combined terms of the first language. For example, “bank” is combined with “bank” and “shore”. At this time, if the related term set of “Bank” and the related term set of “Kishi” are not similar, it is determined that “Bank” and “Kishi” are conceptually different. “bank” and “bank” are generated.
[0019]
The concept coupling means converts term coupling between languages into conceptual coupling between languages. For example, the term combination “bank-bank” is converted to the concept combination “bank-bank” and the term combination “bank-bank” is converted to the concept combination “bank-bank”.
[0020]
The first language concept related thesaurus generation means converts the relationship between terms included in the term related thesaurus of the first language into a relationship between concepts. Similarly, the second language concept related thesaurus generating means converts the relationship between terms included in the term related thesaurus of the second language into a relationship between concepts. For example, the relationship “bank-interest” between terms is converted into the relationship between concepts “bank bank-interest interest rate / interest rate”, and the relationship between terms “bank-river” is the relationship between concepts “bank converted to river.
[0021]
  First1The language concept merging means merges the concepts of the first language, which are coupled to the same concept of the second language, and whose related sets of concepts of the first language are similar, into one concept. Similarly, the second language concept merging means merges the concepts of the second language, which are coupled to the same concept of the first language and similar in the set of related second language concepts, into one concept. For example, the two Japanese concepts “interest rate” and “interest rate” are combined with the English concept “interest / interest rate / interest rate”. At this time, if the set of related concepts of "interest rate" and the set of related concepts of "interest rate" are similar, "interest rate" and "interest rate" are merged into one concept "interest rate-interest rate".
[0022]
As described above, when each function operates as described above, the first language concept-related thesaurus composed of the first language concept and the relationship between the concepts, and the second language concept-related thesaurus and the relationship between the concepts. A parallel thesaurus is generated by combining the concept-related thesauruses in two languages.
[0023]
  The present invention also provides a parallel thesaurus including a first language concept-related thesaurus that represents a relationship between the first language concepts created from a first language concept label that combines terms of the first and second languages.When,To the computer connected to the display means,Using the relationship between concepts represented in the first language concept related thesaurus,Read the concept set of the first language displayed on the display meansStep,the aboveReadGet a strongly related concept to the first language concept in the concept setStep,ExecuteA parallel thesaurus navigation program for recordingA computer-readable recording medium.
[0024]
  The present invention also provides:A terminal for inputting a first language term or a first language concept label;Processing to process dataEquipment and, First and second languagesofA parallel thesaurus composed of concept combination data created from concept labels and concept related thesauruses in the first and second languages.With storage deviceComputer, parallel thesaurus navigationsystemProgram to function asComputer-readable recording medium on which is recordedAnd the above processingEquipmentFirst language aboveofA concept set including terms of the first language input to the terminal using a concept-related thesaurus or input to the terminalContains concept labels in the first languageA concept set reader that reads a concept setStep,First language aboveofAn expanded concept set clustering method that expands the read concept set by adding a concept strongly related to the read concept set using a concept related thesaurus, and clusters the expanded concept set.Step,  The conceptual combination data and the second languageofA concept set translator that translates each concept in the clustered extended concept set into a second language extended concept set using a concept-related thesaurusStep,A method for outputting information for displaying the expanded concept set in the second language on the display means.To function as a stageA computer-readable recording medium characterized by recording a program.
[0025]
  In addition, the concept set translation meansAs the above processing device,A method of collecting a second language concept combined with the first language concept included in the clustered expanded concept set using the concept combination data to generate a second language concept set serving as a nucleus.Step,Second language aboveofUsing a concept-related thesaurus to collect second language concepts that are strongly related to the second language concept set.Step,A method of adding a concept that is not combined with the first language concept among the strongly related second language concepts using the concept combination data.Recorded a program to function as a stageIt is characterized by that. In a parallel thesaurus navigation system, translation (transition) from a set of concepts in the first language to a set of concepts in the second language is performed as follows.
[0026]
  Translation (transition)Combine the concepts of the second language combined with the concepts in the previous concept set and the concepts of the second language that are strongly related to the concepts of the second language and are not combined with the concepts of the first language.Translation (transition)Concept setIs configured.Thus, from the concept set of the first language to the concept set of the related second language, including concepts that are not explicitly coupled in the parallel thesaurus.Translation (transition)can do.
[0027]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
FIG. 1 is a block diagram for explaining the configuration of a parallel thesaurus generator according to an embodiment of the present invention and a parallel thesaurus navigation system including the device. In this embodiment, generation of a parallel thesaurus in Japanese and English will be described as a parallel thesaurus in which the thesauruses of two languages are combined.
[0028]
A parallel thesaurus navigation system (hereinafter referred to as a system) according to the present embodiment includes a server computer 1 and a client computer 2 connected to each other via a communication network 3.
[0029]
The server computer 1 performs a parallel thesaurus generation process for associating a Japanese thesaurus with an English thesaurus, and a thesaurus search process among the processes of this system. The server computer 1 is mainly composed of a processing device 11, an input device 12, and a storage device 13.
[0030]
The processing device 11 executes the entire processing of the server computer 1. In particular, each data processing related to parallel thesaurus generation shown in FIGS. 2 to 4 described later is executed. The input device 12 is a CD-ROM drive, a floppy disk drive, or the like according to a text corpus acquisition medium, and is used for inputting a text corpus.
[0031]
The storage device 13 is a general term for storage means such as a RAM, a ROM, and a magneto-optical disk library device (not shown). For example, the processing program of the server computer 1 is fixedly stored in the ROM, and the server computer 1 Data, work files, etc. created in the course of this process are temporarily stored in the RAM, and each language corpus, thesaurus, bilingual dictionary (see FIG. 2), etc. are stored in a mass storage device such as a magneto-optical library device. Stored.
[0032]
The client computer 2 displays a thesaurus transmitted as a search result of the server computer 1 and performs interactive processing with the user among the processes of this system.
Next, details of parallel thesaurus generation processing will be described with reference to FIGS.
[0033]
FIG. 2 is a diagram for functionally explaining input / output data and a module configuration in the parallel thesaurus generator. The input of the parallel thesaurus generator is a Japanese-English bilingual text corpus in which a Japanese corpus 51 and an English corpus 52 are paired. The Japanese corpus 51 and the English corpus 52 are required to be in the same field, but need not be translated.
[0034]
The output of the parallel thesaurus generator is a Japanese-English parallel thesaurus comprising a Japanese concept-related thesaurus 61, an English concept-related thesaurus 62, a Japanese-English concept combined data 63, and an English-Japanese concept combined data 64. The English-Japanese conceptual combined data 63 and the English-Japanese conceptual combined data 64 have the same information content but different record formats. Although redundant, both are output in consideration of the efficiency of the Japanese-English thesaurus combination processing.
[0035]
The modules constituting the parallel thesaurus generator are a Japanese thesaurus generator 10, an English thesaurus generator 20, and a Japanese-English thesaurus combination 30. The Japanese thesaurus generation 10 generates a Japanese term related thesaurus 71 from the Japanese corpus 51. The English thesaurus generation 20 generates an English term related thesaurus 72 from the English corpus 52. The Japanese / English thesaurus combination 30 refers to the Japanese / English bilingual dictionary 73 and the English / Japanese bilingual dictionary 74, and from the Japanese term related thesaurus 71 and the English term related thesaurus 72, the Japanese concept related thesaurus 61, the English concept related thesaurus 62, The English-Japanese concept combination data 63 and the English-Japanese concept combination data 64 are generated. The English-Japanese bilingual dictionary 73 and the English-Japanese bilingual dictionary 74 have the same information content but different record formats. Although redundant, both are used in consideration of the efficiency of the Japanese-English thesaurus combination processing.
[0036]
FIG. 3 is a diagram for explaining the details of the processing of the Japanese thesaurus generation 10. As shown in FIG. 3, the process for generating a Japanese thesaurus includes three steps: term extraction 101, co-occurrence data extraction 102, and correlation analysis 103.
[0037]
(1) Term extraction 101
Terms are extracted from the Japanese corpus 51 and the appearance frequency is counted. As terms, nouns and compound nouns whose appearance frequency is equal to or higher than a predetermined threshold are extracted. Compound nouns are extracted by pattern matching using a part-of-speech string pattern. Among the high-frequency words, there are many common words that are not particularly related to the field. They are removed using a stop word list.
[0038]
With respect to this stop word list, for example, by using “above” as the stop word of the first element, a noun phrase such as “above system” can be excluded. Similarly, by using “whole” as the stop word of the end element, a noun phrase such as “whole system” can be excluded.
[0039]
(2) Co-occurrence data extraction 102
Co-occurring term pairs are extracted and the co-occurrence frequency is counted. As the definition of co-occurrence, window co-occurrence is adopted. That is, a pair of terms included in the window at each position is extracted while moving a window having a certain width along the text. The window width is, for example, 25 terms excluding function words.
(3) Correlation analysis 103
Statistical correlation values are calculated for all term pairs, and term pairs having correlation values equal to or greater than a predetermined threshold are extracted. Mutual information is used as the term correlation value.
[0040]
As a result of the steps (1) to (3) described above, the Japanese term related thesaurus 71 is obtained. The Japanese term related thesaurus 71 is a collection of records representing a related term set for each Japanese term. That is,
RT_J(x_i) = {X (i, 1), x (i, 2), ..., x (i, m_i)} (i = 1, 2,..., M).
Where x_iIs the Japanese term, x (i, m_i) Is the related term, i is the number assigned to each term, m_iIs the number of related terms in the i-th Japanese term, and M is the total number of Japanese terms. An example of the record is shown below.
[0041]
RT_J(Bank) = {Loan, Interest rate, Account, Interest rate, Securities, Economy, Finance, Investment}.
RT_J(Interest rate) = {Loan, Lending, Savings, Raising, Banking}.
The processing of the Japanese thesaurus generation 10 has been described above, but the processing of the English thesaurus generation 20 is exactly the same as the Japanese thesaurus generation 10. As a result of the English thesaurus generation process, an English term-related thesaurus 72 is obtained. The English term related thesaurus 72 is a collection of records representing a related term set for each English term. That is,
RT_E(y_i) = {Y (i, 1), y (i, 2), ..., y (i, n_i)} (i = 1, 2,..., N).
Where y_iIs the English term, y (i, n_i) Is the related term, i is the number assigned to each term, n_iIs the number of terms associated with the i-th English term, and N is the total number of English terms. An example of the record is shown below.
RT_E(bank) = {account, river, interest, loan, boat, investment, fishing, park
, economy, lake}.
RT_E(interest) = {loan, deposit, bank, sciene, economy, exchange, politics}.
[0042]
FIG. 4 is a diagram for explaining the details of the processing of the Japanese / English thesaurus combination module 30. As shown in FIG. 4, the Japanese-English thesaurus combination processing includes the Japanese-English term combination 301, the Japanese concept label generation 302, the English concept label generation 303, the Japanese-English concept combination 304, the Japanese concept related thesaurus generation 305, and the English concept related. It consists of eight steps: a thesaurus generation 306, a Japanese concept merge 307, and an English concept merge 308. Details of these processes will be described below.
[0043]
(1) Japanese-English term combination 301
By referring to the Japanese-English bilingual dictionary 73 and the English-Japanese bilingual dictionary 74, corresponding terms are combined between the Japanese term-related thesaurus 71 and the English term-related thesaurus 72, and the Japanese-English term combined data 91 and the English-Japanese term combined data are combined. 92 is output.
Since the Japanese term-related thesaurus 71 and the English term-related thesaurus 72 have already been described in the input of the Japanese-English term combination, the Japanese-English bilingual dictionary 73 and the English-Japanese bilingual dictionary 74 will be described.
[0044]
The Japanese-English bilingual dictionary 73 is a collection of records representing a bilingual English term set for each Japanese term. That is,
D_JE(a_i) = {B (i, 1), b (i, 2), ..., b (i, l_i)} (i = 1, 2,..., K).
Here, a is a Japanese term and b is an English term. An example of the record is shown below.
D_JE(Bank) = {bank}.
D_JE(Shore) = {bank}.
[0045]
The English-Japanese bilingual dictionary 74 is a collection of records representing a bilingual Japanese term set for each English term. That is,
D_EJ(b_i) = {A (i, 1), a (i, 2), ..., a (i, k_i)} (i = 1, 2,..., L).
Here, b is an English term and a is a Japanese term. An example of the record is shown below.
D_EJ(bank) = {bank, bank, bank}.
D_EJ(interest) = {interest, interest rate, interest rate}.
[0046]
Next, the output of the Japanese-English term combination will be described. The Japanese-English term combined data 91 and the English-Japanese term combined data 92 represent the same information in different formats. Although redundant, both are output in consideration of the efficiency of subsequent processing.
The Japanese-English term combined data 91 is a collection of records representing a set of English terms combined with each Japanese term. That is,
TL_JE(x_i) = {Y '(i, 1), y' (i, 2), ..., y '(i, n'_i)} (i = 1, 2,..., M).
Here, x is a Japanese term and y ′ is an English term.
[0047]
The English-Japanese term combined data 92 is a collection of records representing a set of Japanese terms combined with each English term. That is,
TL_EJ(y_i) = {X '(i, 1), x' (i, 2), ..., x '(i, m'_i)} (i = 1, 2,..., N).
Here, y is an English term and x ′ is a Japanese term.
[0048]
The algorithm for the Japanese-English term combination 301 is as follows.
1) The Japanese-English term combination data 91 is initialized. That is,
TL_JE(x_i) ← φ (i = 1, 2, ..., M).
2) The English-Japanese term combination data 92 is initialized. That is,
TL_EJ(y_i) ← φ (i = 1, 2,..., N).
[0049]
3) Combining Japanese term x and English term y satisfying the following two conditions.
(A) The bilingual relationship <x, y> is supported by the bilingual dictionary.
(B) The domain relevance DR (x, y) of the translation relationship <x, y> is equal to or greater than a predetermined threshold.
That is, for all pairs of Japanese terms x and English terms y that satisfy (a) and (b),
TL_JE(x) ← TL_JE(x) ∪ {y} and TL_EJ(y) ← TL_EJ(y) ∪ {x}.
Execute.
(A) x is k terms x₁, x₂, ..., x_k, Y is k terms y₁, y₂, ..., y_kWhere {y '₁, y '₂, ..., y '_k} = {Y₁, y₂, ..., y_k} Y 'like₁(∈D_JE(x₁)), y '₂(∈D_JE(x₂)), ..., y '_k(∈D_JE(x_k)) Exists.
(B) DR (x, y) ≧ θ.
[0050]
Regardless of the order of the terms in the set, whether or not the translation relationship between the components is established in order to know whether the Japanese term x and the English term y are in a translation relationship according to the condition (a). Checked. In particular, y∈D when k = 1._JE(x). That is, it means a term pair registered in the bilingual dictionary. When k ≧ 2, it means that the bilingual relationship between the components is a pair of compound word terms registered in the bilingual dictionary.
The condition (b) checks whether the bilingual relationship suggested by the bilingual dictionary is a relationship established in the domain. The domain relevance DR (x, y) of the bilingual relationship <x, y> is defined by the following equation.
[0051]
[Expression 1]

[0052]
DR_JE(x, y) is the ratio of related terms in Japanese terms whose English translation is related to English terms. That is, the degree of appearance context that indicates the context in which a Japanese term appears overlaps with the appearance context of an English term. Also, DR_EJ(x, y) is the degree to which the appearance context of English terms overlaps with the appearance context of Japanese terms. If there is some commonality in the appearance context, it may be considered that the bilingual relationship is established in the domain. Therefore, it is preferable to set the threshold value θ of the domain relevance to a small value. With the above algorithm, the Japanese-English term combination 301 is executed.
[0053]
(2) Japanese concept label generation 302
(3) English concept label generation 303
The processing of Japanese concept label generation 302 and English concept label generation 303 is exactly the same except that the roles of Japanese and English are reversed. Therefore, the English concept label generation 303 will be described here.
[0054]
The English concept label generation step 303 generates one or more English concept labels from each of the English terms based on the English-Japanese term combination data 92 and the Japanese term related thesaurus 71. Furthermore, a set of related terms is generated for each of the generated English concept labels. For this purpose, the English term related thesaurus 72, the Japanese-English term combined data 91, the Japanese term related thesaurus 71, and the English-Japanese term combined data 92 are referred to.
[0055]
Since the input data for the English concept label generation 303 has already been described, the output data will be described. The English concept label is a combination of terms and is defined as follows.

[0056]
<English term> / <Japanese term> /.../ <Japanese term> indicates a concept common to the concept represented by the Japanese term among the concepts represented by the English term. For example, “bank” indicates a bank as an organization that performs money-related business, and “bank” indicates a bank as a place along a river or lake. <English concept label> + <English concept label> designates a concept in a range in which the concepts indicated by the two concept labels are used as a core and the concepts indicated by the respective concept labels are combined. Examples are “duty / tax + tax”, “plane / airplane + airplane”, and “reasoning + inference”.
[0057]
The English concept label data 94 is a record C representing an English concept label set corresponding to each English term y._EIt is a gathering of (y). An example of an English concept label set is shown.
C_E(interest) = {interest / interest, interest / interest / interest rate}.
This record C_E(interest) is in the English-Japanese term combined data 92
TL_EJ(interest) = {interest, interest rate, interest rate}.
When a record is included, it is generated correspondingly.
[0058]
The related term data 96 of the English concept is a collection of records representing the related term set for each English concept label, and is described as follows.
RT_E(Y_i) = {Y (N + i, 1), y (N + i, 2), ..., y (N + i, n_{N + i})} (i = 1, 2,..., Q).
Here, Y is an English concept label, and y is an English term.
[0059]
The related term data 96 of the English concept is intermediate data for generating the English concept related thesaurus 62 which is the final purpose. What is required as the English concept related thesaurus 62 is not a related term set but a related concept set. However, the related concept set cannot be created unless the concept label set for all terms is generated. Therefore, a related term set is provisionally created and converted into a related concept set in the subsequent English concept related thesaurus generation 306.
[0060]
English concept label set C for English term y_E(y) and C_EThe algorithm for generating the related term set for the concept label in (y) is as follows.
1) When English term y is combined with at least one Japanese term
i) Create initial data for a set of English concept labels. An English concept label using each Japanese term connected to the English term y as a Japanese qualifier is generated and used as its element. That is,
C_E(y) ← {y · x | x∈TL_EJ(y)}.
ii) If the similarity between two English concepts is greater than or equal to a predetermined threshold α, the process of integrating them into one English concept is repeated as much as possible. That is,

Here, two English concepts Y related to the common English term y₁= Y · x₁/ x₂/.../X_kAnd Y₂= Y · x '₁/ x '₂/.../X '_{k '}Similarity S (Y₁, Y₂) Is defined by the following equation.
[0061]
[Expression 2]

[0062]
That is, it is defined by the degree of overlap between related term sets of Japanese qualifiers constituting the concept label.
iii) Create related term set data for each English concept obtained as a result of process ii). English concept y ・ x₁/ x₂/.../X_kRelated Term Set RT_E(y x₁/ x₂/.../X_k) Is as follows:
[0063]
[Equation 3]

[0064]
Here, JM1 is a set of Japanese modifier elements. That is, JM1 = {x₁, x₂, ..., x_k}. JM2 is a set of Japanese terms combined with an English term y, excluding Japanese modifier elements. That is, JM2 = TL_EJ(y) -JM1.
2) When English term y is not combined with Japanese term x
i) The English term y itself is a concept label. This is the only element in the English concept label set. That is,
C_E(y) ← {y}.
ii) Related term set RT of English term y_ELet (y) be the relevant term set of the English concept label y as it is.
The setting of the threshold value α in 1) ii) of the above algorithm will be supplemented. The purpose here is to obtain a Japanese qualifier for distinguishing multiple concepts represented by one English term. Therefore, it is preferable to set the threshold value α to a small value and integrate similar Japanese translations into one Japanese modifier.
[0065]
Japanese concept label generation 302 is similar to English concept label generation 303. The output Japanese concept label data 93 is the same as the English concept label data 94, and a record C representing a Japanese concept label set corresponding to each Japanese term x._JIt is a collection of (x). The Japanese concept label is similar to the English concept label, and is a combination of terms as follows.
[0066]

The related term data 95 of the Japanese concept which is another output of the Japanese concept label generation 302 is similar to the related term data 96 of the English concept. is there. That is,
RT_J(X_i) = {X (M + i, 1), x (M + i, 2), ..., x (M + i, n_{M + i})} (i = 1, 2,..., P).
Here, X is a Japanese concept label, and x is a Japanese term.
[0067]
(4) English-Japanese concept 304
The Japanese-English concept combination 304 receives Japanese concept label data 93 and English concept label data 94 as input, and generates Japanese-English concept combination data 63 and English-Japanese concept combination data 64. Since the Japanese concept label data 93 and the English concept label data 94 have already been described, first, the Japanese-English concept combined data 63 and the English-Japanese concept combined data 64 will be described.
[0068]
The Japanese-English concept combined data 63 is a set of records representing a set of English concepts combined with each Japanese concept. That is,
CL_JE(X_i) = {Y '(i, 1), Y' (i, 2), ..., Y '(i, q'_i)} (i = 1, 2,..., P).
Here, X is a Japanese concept label, and Y ′ is an English concept label.
[0069]
Similarly, the English-Japanese concept combination data 64 is a collection of records representing a set of Japanese concepts combined with each English concept. That is,
CL_EJ(Y_i) = {X '(i, 1), X' (i, 2), ..., X '(i, p'_i)} (i = 1, 2,..., Q).
Here, Y is an English concept label, and X ′ is a Japanese concept label.
[0070]
The algorithm for generating the Japanese-English concept combination data 63 and the English-Japanese concept combination data 64 is as follows.
1) All Japanese concept labels X = xy₁/ y₂/.../Y_{k '}Against
CL_JE(X) = {Y | Y = y · x₁/ x₂/.../X_k(∈C_E(y)), y∈ {y₁, y₂, ..., y_{k '}}, {x₁, x₂, ..., x_k} ∋x}.
That is, the English concept set C of the English term y included in the English modifier of X_EA set of elements Y of (y) including x in the Japanese qualifier is generated.
[0071]
2) All English concept labels Y = y · x₁/ x₂/.../X_{k '}Against
CL_EJ(Y) = {X | X = x · y₁/ y₂/.../Y_k(∈C_J(x)), x∈ {x₁, x₂, ..., x_{k '}}, {y₁, y₂, ..., y_k} ∋y}.
That is, the Japanese concept set C of the Japanese term x included in the Japanese modifier of Y_JA set of elements X of (x) that includes y in the English modifier is generated.
[0072]
An output example of the Japanese-English concept combination 304 is shown. Japanese concept label data 93
C_J(Interest) = {interest / interest},
C_J(Interest rate) = {interest rate / interest},
C_J(Interest rate) = {Interest rate / interest}
The English concept label data 94 is
C_E(interest) ＝ {interest ・ interest, interest ・ interest / interest}
Suppose that At this time, as the Japanese-English concept combined data 63
CL_JE(Interest / interest) = {interest / interest},
CL_JE(Interest / interest) = {interest / interest / interest},
CL_JE(Interest / interest) = {interest / interest / interest}
Is generated as English-Japanese concept combined data 64
CL_EJ(interest / interest) = {interest / interest},
CL_EJ(interest / interest / interest) = {interest / interest, interest / interest}
Is generated.
[0073]
Here, one rule concerning notation of concept labels is defined. If there is only one English concept corresponding to an English term, it is meaningless to attach a Japanese modifier, and the term itself can be used as a concept label. The same applies to Japanese terms. Unlike the algorithm of the Japanese-English concept combination 304 described above, the subsequent processing does not include judgment based on the Japanese modifier or the English modifier. Therefore, for the concept that is the only concept of the term, the concept label is changed to the term itself when the

concept combination data

63 and 64 are output. If this rule is followed, the Japanese-English concept combination data 63 in the above example is
CL_JE(Interest) = {interest / interest},
CL_JE(Interest rate) = {interest / interest rate / interest rate},
CL_JE(Interest rate) = {interest / interest / interest}
The English-Japanese concept combined data 64 is
CL_EJ(interest / interest) = {interest},
CL_EJ(interest / interest / interest) = {interest, interest}
It becomes. Here, since “interest” is a word representing a single concept, the concept label “interest / interest” is abbreviated to “interest”. The same applies to "interest rate / interest" and "interest rate / interest", which are abbreviated as "interest rate" and "interest rate", respectively. On the other hand, since “interest” represents a plurality of concepts, the concept labels “interest / interest” and “interest / interest / interest rate” cannot be abbreviated.
[0074]
(5) Japanese concept related thesaurus generation 305
(6) English concept related thesaurus generation 306
The Japanese concept related thesaurus generation 305 receives Japanese concept label data 93 and Japanese concept related term data 95 and outputs a Japanese concept related thesaurus 61. The English concept related thesaurus generation 306 receives the English concept label data 94 and the English concept related term data 96 and outputs an English concept related thesaurus 62. These inputs have already been described.
[0075]
The output is a Japanese concept related thesaurus 61 and an English concept related thesaurus 62 as follows. The Japanese concept related thesaurus 61 is a collection of records representing a related concept set of each Japanese concept. That is,
RC_J(X_i) = {X (i, 1), X (i, 2), ..., X (i, p_i)} (i = 1, 2,..., P).
Here, X is a Japanese concept label. An example of a related concept set is shown below.
[0076]
RC_J(Bank) = {Loan, Interest rate, Account, Interest rate, Securities, Economy, Finance, Investment}.
RC_J(Shore) = {river, water, boat, lake, fishing}.
The English concept related thesaurus 62 is a collection of records representing a related concept set of English concepts. That is,
RC_E(Y_i) = {Y (i, 1), Y (i, 2), ..., Y (i, q_i)} (i = 1, 2,..., Q).
Here, Y is an English concept label. An example of a related concept set is shown below.
RC_E(bank / bank) = {account / account, interest / interest / interest, loan, investment, eco
nomy}.
RC_E(bank / shore) = {river, boat, water, fishing, park / park, lake}.
[0077]
The algorithm of the English concept related thesaurus generation 306 is as follows. The algorithm of the Japanese concept related thesaurus generation 305 is exactly the same.
Related term set RT of English concept Y_ECorresponding to each element y of (Y), a conceptual label set C of y_EAmong the elements of (y), the one with the highest degree of correlation with Y is the related concept set RC_ESelect as the element of (Y). That is,
[0078]
[Expression 4]

[0079]
Where S₂Is the degree of correlation of English concepts based on related term sets, defined by
S₂(Y₁, Y₂) = | RT_E(Y₁) RT_E(Y₂) ｜ / ｜ RT_E(Y₁) RT_E(Y₂) ｜.
For example, the related term set of the English concept “bank” includes “interest”, and the concept label set of the English term “interest” is {interest / interest, interest / interest / interest rate}. At this time, the correlation between “bank / bank” and “interest / interest” and the correlation between “bank / bank” and “interest / interest / interest rate” are calculated. If the latter degree of correlation is large, “interest / interest / interest rate” is selected as an element of the related concept set of “bank”.
[0080]
(7) Japanese concept merge 307
(8) English concept merge 308
The processing of the Japanese concept merge 307 and the English concept merge 308 is exactly the same except that the roles of Japanese and English are reversed. Accordingly, the English concept merge 308 will be described here.
[0081]
The English concept merge 308 merges English concepts combined with the same Japanese concept with high similarity into one concept. The input is an English concept-related thesaurus 62, Japanese-English concept combined data 63, and English-Japanese concept combined data 64, and the output is updated data thereof.
[0082]
The algorithm of English concept merge 308 is as follows.
The following processing is repeated as much as possible for all Japanese concepts X.
Y₁, Y₂∈CL_JES in (X)_Three(Y₁, Y₂) ≧ β English concept set Y₁, Y₂A) to c) are executed. Where S_Three(Y₁, Y₂) Is English concept Y₁And Y₂Is defined by the following equation.
S_Three(Y₁, Y₂) = | RC_E(Y₁) RC_E(Y₂) ｜｜｜ RC_E(Y₁) RC_E(Y₂) |.
[0083]
a) Update of English concept related thesaurus 62
All Y∈RC_E(Y₁) For RC_E(Y) ← RC_E(Y)-{Y₁} + {Y₁+ Y₂}.
All Y∈RC_E(Y₂) For RC_E(Y) ← RC_E(Y)-{Y₂} + {Y₁+ Y₂}.
RC_E(Y₁+ Y₂) ← RC_E(Y₁) RC_E(Y₂).
RC_E(Y₁) And RC_E(y₂) Is deleted.
[0084]
b) Updating the English-Japanese conceptual combination data 63
All x∈CL_EJ(Y₁) For CL_JE(X) ← CL_JE(X)-{Y₁} + {Y₁+ Y₂}.
All X∈CL_EJ(Y₂) For CL_JE(X) ← CL_JE(X)-{Y₂} + {Y₁+ Y₂}.
c) Update of English-Japanese concept combined data 64
CL_EJ(Y₁+ Y₂) ← CL_EJ(Y₁) ∪CL_EJ(Y₂).
CL_EJ(Y₁) And CL_EJ(Y₂).
[0085]
It is preferable that the threshold value β in the above algorithm is set to be large and only concepts having a very high degree of similarity are merged. This is because it is a thesaurus with higher utility value if terms with different concepts and nuances are made into separate entities. This is different from the case where the purpose is to distinguish a plurality of concepts represented by the terms of the partner language (threshold α in the algorithm of the English concept label generation 303).
[0086]
From the Japanese-English bilingual text corpus in which the Japanese corpus 51 and the English corpus 52 are paired by the processing described above, the Japanese concept-related thesaurus 61, the English concept-related thesaurus 62, the Japanese-English concept combination data 63, and the English-Japanese concept combination. A Japanese-English parallel thesaurus consisting of data 64 can be generated.
The Japanese-English parallel thesaurus generated in this way is used for thesaurus navigation by the client computer 2 via the communication network 3 shown in FIG. Next, the processing of this system will be described with reference to FIGS.
[0087]
FIG. 5 is a diagram for explaining the contents of the display screen of the client computer 2 in this system.
The display screen shown in FIG. 5 includes a concept collection area 1010, a zoom-in area 1020, and a function selection button. The function selection buttons include a zoom-in button 1030, a translation button 1040, a clear button 1050, and an end button 1060.
[0088]
In the zoom-in area 1020, one or more concept clusters 1021 are displayed together with a selection button 1022 associated therewith. The concept cluster 1021 is a set of highly related concepts. For example, in the case of a concept cluster corresponding to the “global environmental problem”, concepts such as “global warming”, “ozone layer”, “greenhouse effect”, “Freon” and “atmosphere” are displayed. The user of the client computer 2 can designate a plurality of these concept clusters 1021.
[0089]
FIG. 6 is a flowchart for explaining processing of the parallel thesaurus navigation system according to the present embodiment. Hereinafter, the processing of this system will be described in correspondence with the display contents shown in FIG.
First, an initial screen is displayed (step 410). In the initial screen, the concept collection area 1010 and the zoom-in area 1020 are blank. In this system, a language indicator for switching between “Japanese” / “English” is provided inside, and when the initial screen is displayed, the language indicator is set to “Japanese”.
[0090]
After the initial screen display (step 410), the system waits for input (step 420). In this state, the concept set area 1010 is writable, and the user normally writes one or more Japanese terms or Japanese concept labels.
Depending on the button pressed while waiting for input, the program branches as follows.
[0091]
(1) When the zoom-in button 1030 is pressed
The concept set displayed in the concept set area 1010 is read (step 430). It is usually the term that the user writes in the concept collection area 1010, not the concept label. If a term has been written, it is assumed that all concept labels generated from the term have been written. This processing is performed by referring to the Japanese concept related thesaurus 61 when the language indicator is “Japanese” and referring to the English concept related thesaurus 62 when the language indicator is “English”.
[0092]
Next, the concept set is expanded by adding a concept strongly related to the concept included in the concept set, and the expanded concept set is clustered (step 440). This processing is performed by referring to the Japanese concept related thesaurus 61 when the language indicator is “Japanese” and referring to the English concept related thesaurus 62 when the language indicator is “English”.
Finally, the obtained concept cluster is displayed in the zoom-in area 1020 together with the selection button 1022 (step 450), and the process returns to an input waiting state.
[0093]
(2) When the concept cluster selection button 1022 is pressed
The selected concept cluster 1021 is copied (overwritten) to the concept set area 1010 (step 460), and the process returns to an input waiting state. In the state of waiting for input, the concept collection area 1010 can be written, and the user can add or delete terms or concept labels.
[0094]
(3) When the translation button 1040 is pressed
The concept set displayed in the concept set area 1010 is read (step 470). This process is exactly the same as in step 430.
Next, the concept set is translated (step 480). When the language indicator is “Japanese”, the translation from the Japanese concept set to the English concept set is executed, and when the language indicator is “English”, the translation from the English concept set to the Japanese concept set is executed.
[0095]
Finally, the translation result is displayed (overwritten) in the concept set area 1010, the language indicator is inverted (step 490), and the state of waiting for input is returned. In the state of waiting for input, the concept collection area 1010 can be written, and the user can add or delete terms or concept labels.
[0096]
(4) When the clear button 1050 is pressed
Return to the initial screen display state (step 410).
(5) When the end button 1060 is pressed
The process ends.
The processing described above enables navigation of a parallel thesaurus including transitions between languages.
[0097]
FIG. 7 is a diagram for explaining in detail the process of concept set translation (step 480) that characterizes the present system. FIG. 7 shows the process of translating a Japanese concept set into an English concept set, but the process of translating an English concept set into a Japanese concept set is exactly the same.
[0098]
When an input Japanese concept set is given, referring to the Japanese-English concept combination data 63, the English concepts combined with the Japanese concepts in the Japanese concept set are collected to generate a core English concept set. (Step 481).
[0099]
Next, referring to the English concept related thesaurus 62, English concepts closely related to the English concepts included in the English concept set are collected, and further, referring to the English-Japanese concept combination data 64, Select one that is not combined with Japanese concepts. The selected English concept is added to the core English concept set to obtain a translation result (step 482).
[0100]
An example of translation (transition) from a Japanese concept set to an English doubt set is shown. Assume that the input Japanese concept set is {global warming, ozone layer, greenhouse effect, Freon, atmosphere, carbon dioxide, environment}. It is assumed that the Japanese-English concept combined data 63 includes the following records.
[0101]
CL_JE(Global warming) = φ.
CL_JE(Ozone layer) = {ozone layer}.
CL_JE(Greenhouse effect) = φ.
CL_JE(Freon) = φ.
CL_JE(Atmosphere) = {atmosphere / atmosphere}.
CL_JE(Carbon dioxide) = {carbon dioxide}.
CL_JE(Environment) = {environment}.
Further, it is assumed that the English concept related thesaurus 62 includes the following records.
[0102]
RC_J(ozone layer) = {chrolofluorocarbon, depletion, atmosphere, atmosphere, war
ming}.
RC_J(atmosphere) = {pollution, environment, gas ・ gas / gas, carbon di
oxide}.
RC_J(carbon dioxide) ＝ {atmosphere ・ atmosphere, energy, warming, environment, re
gulation}.
RC_J(environment) ＝ {protection, carbon dioxide, energy, atmosphere ・ atmosphere
, pollution}.
The English-Japanese concept combined data 64 is assumed to include the following records.
[0103]
CL_EJ(ozone layer) = {ozone layer}.
CL_EJ(chrolofluorocarbon) = φ.
CL_EJ(depletion) = {destruction}.
CL_EJ(atmosphere / atmosphere) = {atmosphere}.
CL_EJ(warming) = φ.
CL_EJ(pollution) = {contamination}.
CL_EJ(environment) = {Environment}.
CL_EJ(gas / gas) = {gas, gas}.
CL_EJ(carbon dioxide) = {carbon dioxide}.
CL_EJ(energy) = {Energy}.
CL_EJ(regulation) = {regulation}.
CL_EJ(protection) = {protection}.
[0104]
At this time, the translation result from Japanese concept set {global warming, ozone layer, greenhouse effect, chlorofluorocarbon, atmosphere, carbon dioxide, environment} to English concept set is {ozone layer, atmosphere / atmosphere, carbon dioxide, environment, chrolofluorocarbon , warming}.
[0105]
Of the six English concepts that make up the English concept set, four of the “ozone layer”, “atmosphere / atmosphere”, “carbon dioxide” and “environment” are in the Japanese concept and thesaurus in the Japanese concept set. It was positively coupled. “Chrolofluorocarbon” and “warming” were not explicitly combined with the Japanese concept in the Japanese concept set and the thesaurus, but were added as related concepts of the above four English concepts. is there. In fact, “chrolofluorocarbon” is an English translation of “Freon” and “warming” is part of the English translation of “global warming”. In this way, it is possible to obtain a translation result including a parallel translation that is not explicitly expressed as a concept combination.
[0106]
The embodiments of the present invention have been described in detail with reference to the drawings. However, the specific configuration is not limited to these embodiments, and the design can be changed without departing from the gist of the present invention. There may be.
First, in the above embodiment, a parallel thesaurus of Japanese and English is generated. For example, a parallel thesaurus in two languages including Japanese and French, and even a general Japanese thesaurus is included. It may be generated.
[0107]
Moreover, in the said embodiment, as shown in FIG. 4, although the function which considered the concept which each of a synonym and a synonym has was implement | achieved, the function which considered only the concept of a polygram, or a synonym It is also possible to realize a function considering only the above. In this case, it can be realized by selectively providing the functions of Japanese concept merge 307 and English concept merge 308.
[0108]
As the client computer 2 in the present invention, a personal computer or a workstation connected to the communication network 3 by a wired line, or a mobile communication terminal (mobile phone, PHS (connected to the communication network 3 by a wireless line) Personal Handy-Phone System) or PDA (Personal Digital Assistance).
[0109]
The parallel thesaurus generator and parallel thesaurus navigation system of the present invention are also realized by a program for causing the server computer 1 or the client computer 2 to function. This program is stored in a computer-readable recording medium such as a CD-ROM.
[0110]
The storage device 13 itself shown in FIG. 1 may be sufficient as the recording medium which recorded the program for functioning a parallel thesaurus production | generation apparatus or a parallel thesaurus navigation system, and a CD-ROM drive etc. as an external storage device A CD-ROM or the like that can be read by being inserted therein is provided. The recording medium may be a magnetic tape, a cassette tape, a floppy disk, a hard disk, an MO / MD / DVD, or a semiconductor memory.
[0111]
The parallel thesaurus generated by the parallel thesaurus generator of the present invention may be stored in a computer-readable recording medium such as a CD-ROM. This parallel thesaurus is a combination of two language thesauruses, and a concept label combining the terms of the two languages is generated by Japanese and English concept label generation 302, 303, and a Japanese-English concept combination 304 (FIG. 4). The concept labels of the two languages are combined based on the concept.
[0112]
【The invention's effect】
According to the recording medium on which the parallel thesaurus generation program of the present invention is recorded, the parallel thesaurus in which the related thesauruses of the two languages are combined can be automatically generated from the text corpus of the two languages. The generated parallel thesaurus shows the relationship between concepts.
[0113]
In addition, according to the recording medium on which the parallel thesaurus of the present invention is recorded, unlike the conventional thesaurus showing the relation between terms, the problems of polysemy and synonyms are solved, so the thesaurus generated by the present invention By using, the accuracy of various natural language processing systems can be improved.
[0114]
Further, according to the recording medium on which the parallel thesaurus navigation program of the present invention is recorded, efficient text mining across multiple languages becomes possible. In particular, navigation in a native language thesaurus makes it easier to access information in a foreign language. The translation accuracy of a search request, which is a problem in conventional cross-language information retrieval, is also greatly improved by the concept set transition (translation) function.
[Brief description of the drawings]
FIG. 1 is a block diagram for explaining a configuration of a parallel thesaurus generation device according to an embodiment of the present invention and a parallel thesaurus navigation system that accommodates the device.
FIG. 2 is a diagram functionally explaining input / output and module configuration in the parallel thesaurus generator according to the embodiment of the present invention.
FIG. 3 is a diagram illustrating details of a Japanese thesaurus generation process.
FIG. 4 is a diagram illustrating details of processing of a Japanese / English thesaurus combining module.
FIG. 5 is a diagram for explaining the contents of a display screen of a client computer in the parallel thesaurus navigation system according to the embodiment of the present invention.
FIG. 6 is a flowchart illustrating processing of the parallel thesaurus navigation system according to the embodiment of the present invention.
FIG. 7 is a diagram for explaining in detail processing of concept set translation (step 480) characterizing the parallel thesaurus navigation system according to the embodiment of the present invention;
[Explanation of symbols]
1 Server computer
2 Client computer
3 Communication network
10 Japanese thesaurus generation
11 Processing equipment
12 input devices
13 Storage device
20 English thesaurus generation
30 Japanese-English thesaurus combination
51 Japanese Corpus
52 English Corpus
61 Japanese Concept Thesaurus
62 English Concept Thesaurus
63 Japanese-English concept combined data
64 English-Japanese concept combined data
71 Japanese Term Thesaurus
72 English Term Thesaurus
73 Japanese-English Bilingual Dictionary
74 English-Japanese Bilingual Dictionary
81 terms and frequency of occurrence
82 Co-occurrence term pairs and co-occurrence frequency
91 Japanese-English term combined data
92 UK-Japan term combined data
93 Japanese concept label data
94 English concept label data
95 Related term data for Japanese concepts
96 Related term data for English concepts
101 Term extraction
102 Co-occurrence data extraction
103 Correlation analysis
301 Japanese-English term combination
302 Japanese concept label generation
303 English concept label generation
304 Japanese-English concept combination
305 Thesaurus generation related to Japanese concepts
306 Thesaurus generation related to English concepts
307 Japanese concept merge
308 English concept merge
1010 Concept collection area
1020 Zoom in area
1021 Concept cluster
1022 Select button
1030 Zoom in button
1040 Translate button
1050 Clear button
1060 Exit button

Claims

An input device for receiving input of a text corpus in a first language and a second language;
A first language bilingual dictionary in which a set of bilingual terms in the second language for each term in the first language is recorded, and a second in which a set of bilingual terms in the first language for each term in the second language are recorded . A storage device for storing a bilingual dictionary of the language;
A computer-readable recording medium recording a program for causing a computer having a processing device for processing data to function as a parallel thesaurus generator,
The processing device
With counting the occurrence frequency are extracted from the text corpus of first Comments word the terms of the first language inputted from the input device, extracts a first language term pair by a window co-occurrence from the text corpus in the first language The co-occurrence frequency is counted, statistical correlation values are calculated for all term pairs, and based on the first language term pairs having a correlation value equal to or higher than a predetermined threshold extracted based on the correlation values. First language thesaurus generation means for generating, in the storage device, a first language term-related thesaurus that is a set of first language related terms for each term of the first language ;
With counting the occurrence frequency by extracting terms from the second Comments language text corpus of a second language input from the input device, extracts a second language term pair by a window co-occurrence from the text corpus in the second language The co-occurrence frequency is counted, statistical correlation values are calculated for all term pairs, and based on the second language term pairs having a correlation value equal to or higher than a predetermined threshold extracted based on the correlation values. Second language thesaurus generation means for generating in the storage device a second language term related thesaurus which is a set of related terms of the second language for each term of the second language ;
The term related to the first language of the term related thesaurus of the first language related to the term of the first language between the term of the first language and the term of the second language that are in the translation relationship in the bilingual dictionary of the first language. Domain relevance indicating the ratio of the second language bilingual term using the first language bilingual dictionary for the second language related term in the second language term related thesaurus with respect to the second language term And the second language parallel translation using the first language bilingual dictionary for the first language term with respect to the first language term. Generating first and second language combined data by combining a term and a second language bilingual term using the first language bilingual dictionary for the related term of the first language, and the second word The second language related term in the second language term related thesaurus related to the second language term between the second language term and the first language term in the bilingual dictionary. Calculating a domain relevance level indicating a ratio of a first language bilingual term using a bilingual bilingual dictionary that is a first language related term of the first language term related thesaurus with respect to a first language term; When the domain relevance is equal to or higher than a predetermined threshold, the first language bilingual term using the second language bilingual dictionary for the second language term and the second language term are used. Term combining means for generating second and first language combined data combining the first language parallel terms using the second language parallel translation dictionary for the two language related terms;
By combining the first language term of the first and second language combination data and the second language term of the first and second language combination data, the concept expressed by the combined second language term is common. A concept label of the first language that indicates the concept of the first language is generated, and each concept label of the first language that is created is combined with the concept label of the first language based on the term-related thesaurus of the second language. When a related term set in the second language is acquired, a similarity defined by the overlapping degree of the related term set is calculated between the concept labels in the first language, and if the similarity is equal to or higher than a predetermined threshold, By combining the second language terms combined with the first language concept label with the other first language concept label, the second language concept labels are integrated with each other. First generates a concept labels set corresponding to the language terms of a collection of concept labels set for each term in the first language is recorded consisting concept label small first language than a threshold similarity is predetermined While storing in the concept label data storage unit of the first language formed in the storage device, the relation of the second language of each concept label of the first language whose similarity obtained in the generation process of the concept label set is equal to or greater than a threshold value A related term set data obtained by removing each term of the second language of the first and second language combined data from the term set is generated, and a collection of related term set data for each term of the first language is recorded in the storage device. First language concept label generating means for storing in a related term data storage unit of the formed first language concept,
By combining the second language term of the second / first language combination data and the first language term of the second / first language combination data, the concept expressed by the combined first language term is common. A second language concept label that indicates the concept of the second language is generated, and each of the created second language concept labels is combined with the second language concept label based on the term-related thesaurus of the first language. When a related term set of the first language is acquired, a similarity defined by the overlapping degree of the related term set is calculated between the concept labels of the second language, and the similarity is equal to or greater than a predetermined threshold, By combining the second language concept labels by combining the first language term combined with the second language concept label with the other second language concept label. A concept label set corresponding to a second language term comprising second language concept labels whose similarity is smaller than a predetermined threshold is generated, and a set of concept label sets for each second language term is recorded. A related term in the first language of each concept label in the second language that is stored in the concept label data storage unit in the second language formed in the storage device, and whose similarity obtained in the process of generating the concept label set is greater than or equal to a threshold value A related term set data is generated by removing each term of the first language of the second and first language combined data from the set, and formed in the storage device in which a set of related term set data for each term of the second language is recorded. Second language concept label generating means for storing in the related term data storage unit of the second language concept,
For each concept label set stored in the concept label data storage unit of the first language, a concept label set of the second language corresponding to the second language term of the concept label of the first language is converted to the concept of the second language. First and second obtained by combining the concept label of the first language and the concept label of the second language of the acquired concept label set including the first language of the concept label of the first language acquired from the label data storage unit The concept combination data is generated and stored in the first and second concept combination data storage units formed in the storage device, and for each concept label set stored in the concept label data storage unit of the second language, The first language concept label set corresponding to the first language term of the two language concept labels is obtained from the first language concept label data storage unit, and the second language concept label and the second language concept label are obtained. Second of Generating the second and first concept combination data combining the concept labels of the first language of the acquired concept label set including the word, and storing them in the second and first concept combination data storage unit formed in the storage device Concept combination means,
First language concept label data stored in the first language concept label data storage unit, and first language related term set data stored in the first language concept related term data storage unit; Based on the above, the degree of correlation with the concept label in the element of the concept label set of each first language term corresponding to each first language term of the related term set of the first language concept label is maximum. By selecting the first language term as an element of the first language concept-related thesaurus, a first language concept-related thesaurus is generated and stored in the first language concept-related thesaurus storage unit formed in the storage device. A first language concept related thesaurus generation means;
Second language concept label data stored in the second language concept label data storage unit, second term related term set data stored in the second language concept related term data storage unit, and Based on the above, corresponding to each second language term of the related term set of the concept label of the second language, the correlation degree with the concept label in the element of the concept label set of the term of each second language is the maximum. By selecting the second language term as an element of the second language concept-related thesaurus, a second language concept-related thesaurus is generated and stored in the second language concept-related thesaurus storage unit formed in the storage device. Second language concept related thesaurus generation means,
A computer-readable recording medium having recorded thereon a program for functioning as a computer.

The above processing apparatus ,
The degree of similarity between the first language concept labels of the second and first concept combination data stored in the second and first concept combination data storage unit is stored in the concept related thesaurus storage unit of the first language. When the similarity is equal to or greater than a predetermined threshold value, the correspondence stored in the concept-related thesaurus storage unit of the first language The concept related thesauruses of the first language to be integrated are updated to one concept related thesaurus of the first language, and the correspondence stored in the first / second concept combined data storage unit in accordance with the update. First language concept merging means for updating the first and second concept combined data and the corresponding second and first concept combined data stored in the second and first concept combined data storage unit,
The similarity between the second language concept labels of the first and second concept combination data stored in the first and second concept combination data storage unit is stored in the concept related thesaurus storage unit of the second language. When the similarity is greater than or equal to a predetermined threshold, the correspondence stored in the concept-related thesaurus storage unit of the second language is calculated based on the overlapping degree of the concept labels of the corresponding second-language concept-related thesauruses. The concept related thesauruses of the second language to be integrated are updated to one concept related thesaurus of the second language, and the correspondence stored in the first / second concept combined data storage unit in accordance with the update. Second language concept merging means for updating the first and second concept combined data and the corresponding second and first concept combined data stored in the second and first concept combined data storage unit,
The computer-readable recording medium according to claim 1, wherein a program for functioning as a computer is recorded.