JP4066600B2

JP4066600B2 - Multilingual document search system

Info

Publication number: JP4066600B2
Application number: JP2000387960A
Authority: JP
Inventors: 博増市
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2000-12-20
Filing date: 2000-12-20
Publication date: 2008-03-26
Anticipated expiration: 2020-12-20
Also published as: US7047182B2; US20020123982A1; JP2002189745A

Description

【０００１】
【発明の属する技術分野】
本発明はインターネット上に構築されたディレクトリ構造（階層構造）中に格納されている文書の検索システムに関し、特に異なる言語に対して構築された複数のディレクトリ構造をまたがる検索を行うシステムに関する。
【０００２】
【従来の技術】
インターネット利用者の急増に伴い、インターネットの商業上の利用も拡大しつつある。ＷＷＷサーバー上に蓄積された多量の文書へのアクセスを容易にする方法の一つとして、ディレクトリ構造を定義し、適切なディレクトリに文書を格納するディレクトリサービスを挙げることができる。これは、ユーザが最上位のディレクトリから興味の対象に近いサブディレクトリを順に辿っていくことによって目的の文書に到達するという効果を狙ったものである。しかしながら、ユーザが常に最適なサブディレクトリを辿っていくことは不可能であり、全文検索等の検索技術を併用して目的の文書に到達する可能性を高めることがほとんどである。
【０００３】
【発明が解決しようとする課題】
ディレクトリサービスは、特定の国／言語でサービスが開始された後、そこで使用されたディレクトリ構造がほとんどそのままの形で他の複数の国／言語へと移され、各国で同様のディレクトリサービスが行われることが多い。しかしながら、各国で行われているディレクトリサービスはそれぞれ独立したものであり、検索を行った場合単一のディレクトリ構造内に存在する文書が検索できるにすぎず、他国／他言語のディレクトリ構造内の文書を検索結果として得ることはできない。特にインターネット販売サイトやオークションサイト等の商用目的のディレクトリサービスでは、他の国／言語の文書を適切に検索できることは重要であり、現状においては多くの潜在的ビジネスチャンスを失っていることになる。
【０００４】
本発明はこのような点に鑑みてなされたものであり、複数のディレクトリ構造をまたがっった検索を高い精度で実現することができる多言語文書検索システムを提供することを目的とする。
【０００５】
【課題を解決するための手段】
これまで、言語の違いを超えて検索を行うために数多くの多言語情報検索手法が提案されてきた。例えば、「Ｄｅｅｒｗｅｓｔｅｒ，Ｓ．，Ｄｕｍａｉｓ，Ｓ．Ｔ．，Ｌａｎｄａｕｅｒ，Ｔ．Ｋ．，Ｆｕｒｎａｓ，Ｇ．Ｗ．ａｎｄＨａｒｓｈｍａｎ，Ｒ．Ａ．， ”Ｉｎｄｅｘｉｎｇｂｙｌａｔｅｎｔｓｅｍａｎｔｉｃａｎａｌｙｓｉｓ” ＪｏｕｒｎａｌｏｆｔｈｅＳｏｃｉｅｔｙｆｏｒＩｎｆｏｒｍａｔｉｏｎＳｃｉｅｎｃｅ，４１（６），３９１−４０７．」に詳細が記述されているＬａｔｅｎｔＳｅｍａｎｔｉｃＩｎｄｅｘｉｎｇと呼ばれる手法を翻訳テキストペアの集合（パラレルコーパス）へ適用することによって多言語情報検索を実現する方法が「Ｄｕｍａｉｓ，Ｓ．Ｔ．，Ｌａｎｄａｕｅｒ，Ｔ．Ｋ．ａｎｄＬｉｔｔｍａｎ，Ｍ．Ｌ．， ”Ａｕｔｏｍａｔｉｃｃｒｏｓｓ−ｌｉｎｇｕｉｓｔｉｃｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌｕｓｉｎｇＬａｔｅｎｔＳｅｍａｎｔｉｃＩｎｄｅｘｉｎｇ” ＩｎｐｒｏｃｅｅｄｉｎｇｓｏｆＳＩＧＩＲ’９６ − ＷｏｒｋｓｈｏｐｏｎＣｒｏｓｓ−ＬｉｎｇｕｉｓｔｉｃＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ，ｐｐ．１６−２３，Ａｕｇｕｓｔ１９９６．」で提案されている。また、「ＭａｒｋＷ．ＤａｖｉｓａｎｄＴｅｄＥ．Ｄｕｎｎｉｎｇ， ”Ｑｕｅｒｙｔｒａｎｓｌａｔｉｏｎｕｓｉｎｇｅｖｏｌｕｔｉｏｎａｒｙｐｒｏｇｒａｍｍｉｎｇｆｏｒｍｕｌｔｉ−ｌｉｎｇｕａｌｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ”，ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＦｏｕｒｔｈＡｎｎｕａｌＣｏｎｆｅｒｅｎｃｅｏｎＥｖｏｌｕｔｉｏｎａｒｙＰｒｏｇｒａｍｍｉｎｇ，Ｍａｒｃｈ１９９５．」で提案されている手法も多言語情報検索技術の代表例である。さらに「ＰｅｔｅｒＦ．Ｂｒｏｗｎ，ＳｔｅｐｈｅｎＡ．ＤｅｌｌａＰｉｅｔｒａ，ＶｉｎｃｅｎｔＪ．ＤｅｌｌａＰｉｅｔｒａ，ａｎｄＲｏｂｅｒｔＬ．Ｍｅｒｃｅｒ， ”ＴｈｅｍａｔｈｅｍａｔｉｃｓｏｆｓｔａｔｉｓｔｉｃａｌＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ：Ｐａｒａｍｅｔｅｒｅｓｔｉｍａｔｉｏｎ”，ＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，３２：２６３−３１１，１９９３．」で述べられているように、パラレルコーパスを利用することによって機械翻訳を実現し、機械翻訳によって第１の言語で書かれた検索要求文を第２の言語へと翻訳することによって第２の言語で書かれた文書を検索するという手法の研究も多く行われてきた。
【０００６】
しかしながら、現状においては、これらの多言語情報検索手法によって商用の実システムで使用するに充分な検索精度が実現できているとは言い難い。多言語情報検索の検索精度を低下させる最大の要因は、単語あるいはフレーズの意味曖昧性の問題である。一般に第１の言語のある単語（フレーズ）を第２の言語の単語（フレーズ）へと翻訳する際には、多くの翻訳候補が存在する。例えば、英語の「ｂａｓｅ」という単語は、軍事用語としては「基地」、野球用語としては「塁」、政治用語としては「支持母体」、数学用語としては「基数」、化学用語としては「塩基」、文法用語としては「期体」、建築用語としては「（塗料の）主成分」等、分野に依存して様々な翻訳候補が存在する。これらの翻訳候補は多くの場合分野依存であるため、多言語情報検索では、検索対象を特定の分野の文書集合に限れば高い精度が得られると言われている。
【０００７】
本発明では、異なる言語を対象として構築された２つのディレクトリ構造間の各ディレクトリの対応関係を利用する。ユーザからの検索要求と関連性の高いディレクトリを選択し、得られたディレクトリと対応関係にある他言語のディレクトリに属する文書集合だけを検索対象として多言語情報検索を行うことによって、検索対象となる文書集合の分野を限定することができ、精度の高い多言語文書検索を行うことが可能となる。
【０００８】
すなわち、図１に示すように、本発明の一側面によれば、上述の目的を達成するために、多言語検索システムに：第１の言語を対象として構築された第１のディレクトリ構造を保持する第１のディレクトリ保持手段１と；第２の言語を対象として構築された第２のディレクトリ構造を保持する第２のディレクトリ保持手段２と；第１のディレクトリ構造中の各ディレクトリと第２のディレクトリ構造中の各ディレクトリの対応関係を保持するディレクトリ関係保持手段３と；ユーザからの第１の言語による検索要求が第１のディレクトリ構造中のいずれのディレクトリと関連性が高いかを決定するディレクトリ検索手段４と；ディレクトリ検索手段によって決定されたディレクトリに対応する第２のディレクトリ構造中のディレクトリに属する文書群のうち、上記ユーザからの第１の言語による検索要求と関連性が高い文書を決定する多言語検索手段５とを設けるようにしている。
【０００９】
この構成においては、先に述べたように、ユーザからの検索要求と関連性の高いディレクトリを選択し、得られたディレクトリと対応関係にある他言語のディレクトリに属する文書集合だけを検索対象として多言語情報検索を行うので、検索対象となる文書集合の分野を限定することができ、精度の高い多言語文書検索を行うことが可能となる。
【００１０】
この構成において、第１のディレクトリが記憶されいるサーバと第２のディレクトリが記憶されているサーバとが異なる場合には、第１のディレクトリが記憶されているサーバに、第２のディレクトリが記憶されているサーバと通信可能な通信手段が設けられ、この通信手段を介して多言語の検索が行われる。
【００１１】
なお、本発明の上述の一側面および他の側面は特許請求の範囲に記載されるとおりであり、以下詳細に説明される。
【００１２】
また、本発明はシステムや装置として実現できるだけでなく、方法の態様でも実現できることはもちろんであり、その一部をコンピュータプログラムとして実現できることももちろんである。
【００１３】
また、本発明は、検索サーバとして実現することも可能であり、また、本発明の一部をクライアント装置に実装するようにしても良いことはもちろんである。
【００１４】
【発明の実施の形態】
以下、実施例を用いて本発明を詳細に説明する。
［実施例１］
本発明を実施例１に基づいて具体的に説明する。本実施例は請求項４に対応するものである。図２を参照して本実施例に係る多言語文書検索システムの構成を説明する。なお、本実施例および後述する実施例２では日本語と英語を対象として説明を行うが、形態素解析処理（文を単語へと分割する処理）が適用可能な言語であればいかなる言語であっても同様の効果を得ることができる。
【００１５】
第１のディレクトリ保持手段１１および第２のディレクトリ保持手段１２は、それぞれ複数の日本語文書および複数の英語文書を格納するディレクトリ構造（第１のディレクトリ構造と第２のディレクトリ構造）を計算機内部に保持する手段である。両手段によって保持されるディレクトリ構造の例（オークションサイトの例）を図３に示す。各ディレクトリには、ディレクトリに格納されている文書の内容（分野）に従って、それぞれ固有の名前（識別子）が付与されている。また、最下層のディレクトリにのみ文書が格納されている。
【００１６】
ディレクトリ関係保持手段１３には、第１のディレクトリ保持手段１１に保持されている第１のディレクトリ構造中の各ディレクトリと第２のディレクトリ保持手段１２に保持されている第２のディレクトリ構造中の各ディレクトリの対応関係が保持されている。ここでの対応関係とは、２つのディレクトリ中の文書集合の分野が等しいことを意味するものである。ディレクトリ関係保持手段１３に保持される対応関係の例を図４に示す。本実施例では、第１のディレクトリ保持手段１１に保持されている第１のディレクトリ構造中の各ディレクトリと第２のディレクトリ保持手段１２に保持されている第２のディレクトリ構造中の各ディレクトリの対応関係が一対一に定義されており、また両ディレクトリの構造が完全に等しいものとする。しかし、対応関係が定義されていないディレクトリが一部存在する場合でも、対応関係が定義されているディレクトリについては全く同様の効果を得ることができる。
【００１７】
全ディレクトリ単語ベクトル生成手段１４は、第１のディレクトリ構造中に含まれる全ての日本語文書を学習データとして、そこに含まれる全ての日本語単語の各々に対して、対応する多次元ベクトル（単語ベクトル）を計算する手段である。以下、単語ベクトルを計算するアルゴリズムを説明する。
【００１８】
＜ステップＳ１＞：第１のディレクトリ構造中に含まれる全ての日本語文書に対して形態素解析処理を施す。
＜ステップＳ２＞：ステップＳ１で得られた全日本語単語のうち、第１のディレクトリ構造中に含まれる全ての日本語文書中で出現頻度の多いものから順にｎ個の単語を選択する。ここで得られたｎ個の単語のことを特徴表現語と呼ぶことにする。ｎの値は数千のオーダーとする。
＜ステップＳ３＞：行と列がそれぞれ、ステップＳ１で得られた全日本語単語、および特徴表現語に対応する行列を作成する。ステップＳ１で得られた全日本語単語の総異なり語数が１０万であり、ｎの値を３，０００とした場合、１０万行×３，０００列の行列ができることになる。この行列の各要素には、その要素の行に対応する単語と列に対応する特徴表現語が、第１のディレクトリ構造中に含まれる全ての日本語文書中で何度共起したかを記録する。例えば、単語ａと特徴表現語ｂが３０の文書の中で共起している（同時に出現している）場合、対応する行列要素に３０と記録することになる。こうして得られた行列のことを共起行列と呼ぶことにする。このようにして、日本語文書中に含まれる全ての日本語単語に対してｎ次元のベクトルを定義することができる。このベクトルは、各日本語単語がどのようなコンテキストで出現しやすい傾向にあるかを示すベクトルであるといえる。
＜ステップＳ４＞：ステップＳ３で得られたｎ次元のベクトルは次元数が大きいため、後に必要となる処理で計算時間が膨大なものになってしまう。そこで、計算処理を実時間の範囲に抑えるために、元のｎ次元のベクトルを行列の次元圧縮手法によって、ｎ’次元（数百次元）のベクトルへと圧縮する。次元圧縮手法には様々なものが存在するが、「Ｂｅｒｒｙ，Ｍ．，Ｄｏ，Ｔ．，Ｏ’Ｂｒｉｅｎ，Ｇ．，Ｋｒｉｓｈｎａ，Ｖ．ａｎｄＶａｒａｄｈａｎ，Ｓ．（１９９３）． ”ＳＶＤＰＡＣＫＣＵＳＥＲ’ＳＧＵＩＤＥ”．Ｔｅｃｈ．Ｒｅｐ．ＣＳ−９３−１９４．ＵｎｉｖｅｒｓｉｔｙｏｆＴｅｎｎｅｓｓｅｅ，Ｋｎｏｘｖｉｌｌｅ，ＴＮ．」で詳細な説明がなされているＳｉｎｇｕｌａｒＶａｌｕｅＤｅｃｏｍｐｏｓｉｔｉｏｎがその代表例である。このようにして全ての日本語単語に対して得られたｎ’次元のベクトルを単語ベクトルと呼ぶことにする。
【００１９】
全ディレクトリ単語ベクトル保持手段１５は、全ディレクトリ単語ベクトル生成手段１４で計算された全日本語単語に対応する単語ベクトルを計算機内部に保持する手段である。
【００２０】
ディレクトリベクトル生成手段１６は、第１のディレクトリ構造中の各ディレクトリに対応するディレクトリベクトルを計算する手段である。以下、ディレクトリベクトルを計算するアルゴリズムを説明する。
【００２１】
＜ステップＳ１１＞：第１のディレクトリ構造中に含まれる全日本語文書の各々に対応する文書ベクトルを計算する。ここで、文書ベクトルとは、その文書中に含まれる全単語に対応する単語ベクトルの総和を正規化した（ベクトルの長さを１とした）ベクトルであるとする。
＜ステップＳ１２＞：最下層に位置する各ディレクトリのディレクトリベクトルを計算する。ここで、最下層に位置するディレクトリのディレクトリベクトルとは、そのディレクトリ中に含まれる全文書に対応する文書ベクトルの総和を正規化したベクトルであるとする。
＜ステップＳ１３＞：最下層に位置しないディレクトリであって、ディレクトリ中に含まれる全ディレクトリに対応するディレクトリベクトルが既に計算されているディレクトリを一つ見つけ、ディレクトリベクトルを計算する。ただし、最下層に位置しないディレクトリのディレクトリベクトルは、そのディレクトリ中に含まれる全ディレクトリに対応するディレクトリベクトルの総和を正規化したベクトルであるとする。
＜ステップＳ１４＞：全てのディレクトリについてディレクトリベクトルが計算されるまで、ステップＳ１３を繰り返す。
【００２２】
ディレクトリベクトル保持手段１７は、ディレクトリベクトル生成手段１６で計算された全ディレクトリに対応するディレクトリベクトルを計算機内部に保持する手段である。
【００２３】
学習データ保持手段１８は、第１のディレクトリ保持手段１１に保持されている第１のディレクトリ構造（あるいは第２のディレクトリ保持手段１２に保持されている第２のディレクトリ構造）中のディレクトリのうち、最下層に位置するディレクトリのそれぞれに対して、ディレクトリに含まれる文書の内容に関係する（文書の分野に属する）日英の翻訳対の集合（日英のパラレルコーパス）を学習データとして保持する手段である。学習データ保持手段１８が保持する学習データの例を図５に示す。
【００２４】
ディレクトリ毎単語ベクトル生成手段１９は、学習データ保持手段１８に保持されている日英のパラレルコーパスを学習データとして、第１のディレクトリ構造中の各ディレクトリの意味内容（第２のディレクトリ構造中の各ディレクトリの意味内容）に特化した単語ベクトル集合をそれぞれ計算する手段である。以下、任意の一つのディレクトリ（ディレクトリＡ）に対応する単語ベクトル集合を計算するアルゴリズムを説明する。
【００２５】
＜ステップＳ２１＞：ディレクトリＡに含まれる全ての最下層ディレクトリ（ディレクトリＡが最下層ディレクトリであればディレクトリＡそのもの）に対応して、学習データ保持手段１８に保持されている全ての日英パラレルコーパスをまとめて学習データとみなし、学習データ中に含まれる全ての日本語文書および英語文書に対して形態素解析処理を施す。図５の例において、ディレクトリＡが「骨董品（Ａｎｔｉｑｕｅｓ）」のディレクトリであれば、パラレルコーパス１−４をまとめて学習データとすることになる。
＜ステップＳ２２＞：ステップＳ１で得られた全日本語単語および全英語単語のうち、学習データ中で出現頻度の多いものから順にｎ個の単語を選択する。ここで得られたｎ個の単語のことをステップＳ２と同様に特徴表現語と呼ぶことにする。ただし、この場合特徴表現語には日本語単語と英語単語が混在することになる。ｎの値は、ステップＳ２と同様、数千のオーダーとする。
＜ステップＳ２３＞：行と列がそれぞれ、ステップＳ１で得られた全ての日本語／英語単語、および特徴表現語に対応する共起行列を作成する。この行列の各要素には、その要素の行に対応する単語と列に対応する特徴表現語が、学習データ中に含まれる全ての日英翻訳対中で何度共起したかを記録する。すなわち、日英の翻訳対を一つの文書であるとみなして共起の回数をカウントする。このようにして、全日本語単語と全英語単語をｎ次元のベクトルで表現する共起行列を生成することができる。このベクトルは、ディレクトリＡの意味内容（分野）に即した、各単語の出現傾向を示すベクトルであるといえる。
＜ステップＳ２４＞：ステップＳ２３で得られたｎ次元のベクトルを、ステップＳ４と同様に、行列の次元圧縮手法によって、ｎ’次元（数百次元）のベクトルへと圧縮する。このようにして全ての日本語／英語単語に対して同じベクトル空間上で比較可能なｎ’次元の単語ベクトルが得られることになる。
【００２６】
上記のアルゴリズムによる計算を、第１のディレクトリ構造中の全ディレクトリ（すなわち第２のディレクトリ構造中の全ディレクトリ）に対して適用することによって、ディレクトリ構造中の各ディレクトリの意味内容に特化した単語ベクトル集合をそれぞれ計算することができる。
【００２７】
ディレクトリ毎単語ベクトル保持手段１１０は、ディレクトリ毎単語ベクトル生成手段１９で計算された単語ベクトル集合をディレクトリ毎に保持する手段である。
【００２８】
文書ベクトル生成手段１１１は、第２のディレクトリ保持手段１２が保持する第２のディレクトリ構造中の全ディレクトリの各々に対して、ディレクトリに属する各英語文書の文書ベクトルを計算する手段である。任意のディレクトリＡに対して、ディレクトリＡに属する各英語文書の文書ベクトルを計算する際に、ディレクトリ毎単語ベクトル保持手段１１０中にディレクトリＡに対応して保持されている単語ベクトル集合を用いる。ここで、各英語文書の文書ベクトルは、文書中に含まれる全英単語に対応する単語ベクトルの総和を正規化したベクトルであるとして計算を行う。このようにして、第２のディレクトリ構造中の各ディレクトリに対して、それぞれその意味内容（分野）に特化した文書ベクトル集合を計算することができる。
【００２９】
文書ベクトル保持手段１１２は、文書ベクトル生成手段１１１で計算された文書ベクトル集合を第２のディレクトリ構造中の各ディレクトリ毎に保持する手段である。
【００３０】
検索要求取得手段１１３は、ユーザから日本語の文章による検索要求を受け取ることができるユーザインタフェースを持つ手段である。受け取った検索要求には形態素解析処理が施され日本語単語へと分割される。
【００３１】
全ディレクトリ検索要求ベクトル生成手段１１４は、検索要求取得手段１１３によって受け取られたユーザからの検索要求に対応する検索要求ベクトルを計算する手段である。全ディレクトリ単語ベクトル保持手段１５に保持されている単語ベクトル集合を用い、検索要求文章中に含まれる全日本語単語に対応する単語ベクトルの総和を正規化したベクトルを検索要求ベクトルとする。
【００３２】
ディレクトリ検索手段１１５は、検索要求取得手段１１３によって受け取られたユーザからの検索要求が、第１のディレクトリ構造中のいずれのディレクトリと最も関連性が高いかを決定する手段である。この決定を行うために、ディレクトリ検索手段１１５は、全ディレクトリ検索要求ベクトル生成手段１１４によって計算された検索要求ベクトルと、ディレクトリベクトル保持手段１７中に保持されている各ディレクトリベクトルの関連度を計算し、最も関連度の高いディレクトリベクトルを持つディレクトリを選択する。関連度の定義としては、ベクトル間の内積（コサイン値）を使用する。したがって、関連度は０と１の間の実数であり、２つのベクトル間の角度が小さいほど１に近づくことになる。
【００３３】
ディレクトリ毎検索要求ベクトル生成手段１１６は、ディレクトリ検索手段１１５によって計算された検索要求と最も関連度の高いディレクトリの分野に特化した検索要求ベクトルを計算する手段である。まず、ディレクトリ検索手段１１５から得られた第１のディレクトリ構造中のディレクトリに対応する第２のディレクトリ構造中のディレクトリを、ディレクトリ関係保持手段１３を参照することによって決定する。次に、そのディレクトリに対応する単語ベクトル集合をディレクトリ毎単語ベクトル保持手段１１０から得る。得られた単語ベクトル集合を用いて、検索要求文章中に含まれる全ての日本語単語に対応する単語ベクトルの総和を正規化したベクトルを計算し、新たな検索要求ベクトルとする。
【００３４】
多言語検索手段１１７は、ディレクトリ毎検索要求ベクトル生成手段１１６によって計算された検索要求ベクトルと、ディレクトリ検索手段１１５で決定されたディレクトリに対応して文書ベクトル保持手段１１２に保持されている各文書ベクトルとの間の関連度を計算する。関連度の定義はディレクトリ検索手段１１５での定義と同様である。検索要求ベクトルは日本語文章に対するベクトルであり、文書ベクトル保持手段１１２に保持されている各文書ベクトルは英語文書に対するベクトルであるが、どちらのベクトルも、日本語単語と英語単語を同一のベクトル空間上に表現したディレクトリ毎単語ベクトル保持手段１１０中のベクトルの和として計算されたベクトルであるため、比較可能である。
【００３５】
検索結果表示手段１１８は、多言語検索手段１１７によって計算された検索要求ベクトルと各文書ベクトルとの間の関連度を参照し、検索要求ベクトルと関連度の高い（ベクトルの内積が大きい）文書ベクトルに対応する文書を、検索結果としてユーザに提示する。
【００３６】
なお、本実施例では、ディレクトリ検索手段１１５によってユーザからの検索要求と関連性の高いディレクトリを自動的に決定するものとしたが、関連性の高いディレクトリをユーザがディレクトリ構造を辿ることによって人手で決定するものとしても構わない。
【００３７】
以上の構成によって得られる多言語文書検索装置では、日本語文章による検索要求に対して関連する英語文書を検索結果として得ることができ、上述した問題点（発明の課題）を解決することができる。
【００３８】
また、日本語文書を対象とする第１のディレクトリ構造と英語文書を対象とする第２のディレクトリ構造の間の対応関係を利用することによって、（１）検索要求と関連性の高い分野の英語文書のみを検索対象とすることができ、さらに、（２）検索要求と関連性の高い分野の学習データを用いて検索を行うことができる。この２点の分野限定効果によって、従来の多言語情報検索の精度を低下させる原因であった、単語の意味曖昧性（意味の分野依存性）の問題を回避することが可能となり、多言語文書検索の検索精度を飛躍的に向上させることができる。
【００３９】
本実施例では学習データとして、各最下層ディレクトリに対してパラレルコーパスを用意するものとしたが、請求項１の構成のように分野に特化した学習データを用いずに多言語文書検索を行った場合でも、上記（１）の効果が得られるため、従来の多言語文書検索に比べて高い精度の検索を行うことが可能である。
【００４０】
さらに、請求項１の構成のように分野に特化した学習データを用いずに多言語文書検索を行う場合であっても、ディレクトリ関係保持手段によって対応関係が保持されている第１のディレクトリ保持手段中のディレクトリと第２のディレクトリ保持手段中のディレクトリのペア中に含まれる文書集合（以下文書集合Ｄと呼ぶ）を用いて、分野に特化した多言語情報検索を行うことが可能である。以下にその方法について述べる。
【００４１】
構成は図２に示した上記の構成と同じであるとする。ただし、分野ごと（最下層ディレクトリごと）の学習データは持たないため、学習データ保持手段１８は、ディレクトリ毎単語ベクトル生成手段１９が各ディレクトリに対応する単語ベクトル集合を作成する際に共通に用いる日英のパラレルコーパスを１セットだけ保持するものとする。
【００４２】
したがって、ディレクトリ毎単語ベクトル生成手段１９が各ディレクトリに対応する単語ベクトル集合を作成する際、ステップＳ２１では学習データとして常に上記共通のパラレルコーパスを用いる。また、ステップＳ２３中で作成する共起行列の各要素を単語と特徴表現語の共起回数とする替わりに、式１で定義されるχ^２ _ｕを用いた重み付き共起回数とする。式１で定義されるχ^２ _ｕは、単語ｗ_ｕに対する重み（ディレクトリＡの分野での単語ｗ_ｕの重要度）であり、上記重み付き共起回数とは、単語ｗ_ｕ１と特徴表現語ｗ_ｕ２の共起回数に対してχ^２ _ｕ１とχ^２ _ｕ２とを乗じた値であるとする。上記のχ^２ _ｕは一般にχ^２検定と呼ばれる手法で用いられる値であり、集合全体とその部分集合中で異なる出現傾向を示す要素に対して高い値となる性質を持つものである。
【数１】

このようにして得られたディレクトリごとの単語ベクトル集合は、そのディレクトリの分野に特化した単語ベクトル集合となる。したがって、請求項１の構成のように分野に特化した学習データを用いずに多言語文書検索を行った場合でも、上記（１）の効果に加えて（２）の効果を得ることも可能であり、従来の多言語文書検索に比べて高い精度の検索を行うことが可能である。
【００４３】
なお、本実施例で利用した多言語文書検索手法の詳細は、文献「ＨｉｒｏｓｈｉＭａｓｕｉｃｈｉ，ＲａｙｍｏｎｄＦｌｏｕｒｎｏｙ，ＳｔｅｆａｎＫａｕｆｍａｎｎａｎｄＳｔａｎｌｅｙＰｅｔｅｒｓ， ”ＱｕｅｒｙＴｒａｎｓｌａｔｉｏｎＭｅｔｈｏｄｆｏｒＣｒｏｓｓＬａｎｇｕａｇｅＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ”，ＴｈｅＰｒｏｃｅｅｄｉｎｇｓｏｆＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎＳｕｍｍｉｔＶＩＩ ’９９ＷｏｒｋｓｈｏｐｏｎＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎｆｏｒＣｒｏｓｓＬａｎｇｕａｇｅＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ，（１９９９）」に記述されている。
【００４４】
［実施例２］
以下は、本発明の実施例２の説明である。本実施例は請求項７に対応するものである。本実施例は、実施例１と比較して学習データ保持手段１８の構成のみが異なる。したがって以下の説明では、学習データ保持手段１８に係わる部分についての説明だけを行うものとする。図６は、図２中の学習データ保持手段１８に対応する範囲の本実施例の構成を示す図である。その他の構成要素は図２と同じである。
【００４５】
第１のディレクトリ保持手段１１、第２のディレクトリ保持手段１２およびディレクトリ関係保持手段１３は図２中の各手段と同等の機能を持つ手段である。ただし、本実施例では、第１のディレクトリ構造および第２のディレクトリ構造中に格納されている文書はＷｅｂ文書であり、第１のディレクトリ構造中には日本語で書かれた文書が主に格納されているが英語で書かれた文書も格納されているものとし、同様に、第２のディレクトリ構造中には英語で書かれた文書が主に格納されているが日本語で書かれた文書も格納されているものとする。しかしながら、第１のディレクトリ構造中の全文書を形態素解析して得られる英語単語については日本語単語と同等の扱いをし、第２のディレクトリ構造中の全文書を形態素解析して得られる日本語単語については英語単語と同等の扱いをすることにより、実施例１で説明を行った各手段のアルゴリズムを全く変更せずに処理を行うことが可能である。
【００４６】
以下の２１−２６の各手段の説明は、第１のディレクトリ構造中の任意の最下層ディレクトリＡおよびディレクトリＡに対応する第２のディレクトリ構造中の最下層ディレクトリＡ’を対象とした場合のものである。したがって、ディレクトリ構造中の全最下層ディレクトリを対象として、それぞれ同じ処理を繰り返す必要がある。
【００４７】
ペアテキスト抽出手段２１は、第１のディレクトリ構造中の最下層ディレクトリＡと、それに対応する第２のディレクトリ構造中の最下層ディレクトリＡ’に属する全てのＷｅｂ文書の中から、日英で対訳となっているＷｅｂ文書の対訳テキストペアを、既存の文書収集ロボット等の技術を用いて抽出する手段である。
【００４８】
ペアテキスト保持手段２２は、ペアテキスト抽出手段２１によって得られた日英の対訳テキストペアの集合と、文書ペア抽出手段２５によって得られた日英の文書ペアを計算機内部に保持する手段である。また、予め設定された一定数以上の日英ペア（対訳テキストペア＋日英文書ペア）が手段内に保持されると、その日英ペア集合を学習データ保持手段へ渡す。
【００４９】
単語ベクトル生成手段２３は、ペアテキスト保持手段２２中に保持されている日英ペアを学習データとして、実施例１中のディレクトリ毎単語ベクトル生成手段１９と同等のアルゴリズムを用いることにより、単語ベクトルを計算する手段である。
【００５０】
文書ベクトル生成手段２４は、単語ベクトル生成手段２３から得られた単語ベクトル集合を用いることにより、ディレクトリＡおよびディレクトリＡ’に属する全ての文書に対応する文書ベクトルを計算する手段である。文書ベクトルは、文書中に含まれる全ての日本語／英語単語に対応する単語ベクトルの総和を正規化することによって計算する。
文書ペア抽出手段２５はまず、文書ベクトル生成手段２４から得られる文書ベクトルを参照することにより、以下の条件を満たす日本語文書と英語文書のペアを、ディレクトリＡおよびディレクトリＡ’に属する全ての文書集合から抽出する。
【００５１】
「ペア中の日本語文書に対応する文書ベクトルと最も関連度の高い（内積の値が大きい）英語文書ベクトルがペア中の英語文書ベクトルであり、逆にペア中の英語文書ベクトルと最も関連度の高い日本語文書ベクトルがペア中の日本語文書ベクトルである。」
次に、上記の条件を満たす日英文書ペアうち、ペア中の日英文書に対応する日英文書ベクトルの間の内積の値が予め設定された閾値よりも大きいペアを抽出する。このようにして得られた日英の文書ペアは、意味内容が極めて近いものであり、学習データとして使用することができるものとなる。得られたペアは、ペアテキスト抽出手段２１によって得られた日英の対訳テキストペアの集合と共に、ペアテキスト保持手段２２に保持される。
【００５２】
学習データ保持手段２６は、ペアテキスト保持手段から渡された日英ペア集合を計算機内部に保持する手段である。
【００５３】
このような構成をとり、
（１）ペアテキスト保持手段２２に保持された日英ペア集合を学習データとして、単語ベクトル生成手段２３によって単語ベクトル集合を生成し、
（２）文書ベクトル生成手段２４によって文書ベクトル集合を生成し、
（３）文書ペア抽出手段２５によって意味内容が極めて近い日英の文書ペアを抽出し、
（４）得られた文書ペアを、ペアテキスト保持手段２２に追加する（既に追加されている場合は以前のものと置き換える）。
という処理を繰り返し行うことにより、ペアテキスト保持手段２２に保持される日英ペア集合の数を徐々に増やすことが可能となる。このような繰り返し手法を用いることによって、ペアテキスト抽出手段２１から得られるペアテキストの数が少ない場合でも、実用上十分なサイズの学習データを得ることができることになる。このような繰り返し手法については、「ＨｉｒｏｓｈｉＭａｓｕｉｃｈｉ，ＲａｙｍｏｎｄＦｌｏｕｒｎｏｙ，ＳｔｅｆａｎＫａｕｆｍａｎｎａｎｄＳｔａｎｌｅｙＰｅｔｅｒｓ， ”ＡＢｏｏｔｓｔｒａｐｐｉｎｇｍｅｔｈｏｄｆｏｒＥｘｔｒａｃｔｉｎｇＢｉｌｉｎｇｕａｌＴｅｘｔＰａｉｒｓ”，ＴｈｅＰｒｏｃｅｅｄｉｎｇｓｏｆＴｈｅ１８ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，ｐｐ．１０６６−１０７０（２０００）」に詳細が記述されている。この繰り返し手法は、ペアを抽出する元となる文書集合の分野が限定されている時のみ有効な手法である。本実施例では、第１のディレクトリ構造と第２のディレクトリ構造の対応関係を利用して文書集合の分野を限定することにより、繰り返し手法を適用することが可能となっている。
【００５４】
このようにして学習データが得られた後の処理は、実施例１の処理と全く同じである。実施例１の例では、各最下層ディレクトリに対して学習データを予め用意しておく必要があったが、本実施例の構成によって得られる多言語文書検索装置では、Ｗｅｂ文書の中から日英で対訳となっているＷｅｂ文書の対訳テキストペアを初期学習データとして用い、さらにそれを上記の繰り返し手法によって成長させることによって多言語文書検索に必要な学習データを自動生成することが可能となる。
【００５５】
こうして得られた学習データ（２ヶ国語文書ペア）は、パラレルコーパスとして利用することができるものである。上記文献「ＨｉｒｏｓｈｉＭａｓｕｉｃｈｉ，ＲａｙｍｏｎｄＦｌｏｕｒｎｏｙ，ＳｔｅｆａｎＫａｕｆｍａｎｎａｎｄＳｔａｎｌｅｙＰｅｔｅｒｓ， ”ＡＢｏｏｔｓｔｒａｐｐｉｎｇｍｅｔｈｏｄｆｏｒＥｘｔｒａｃｔｉｎｇＢｉｌｉｎｇｕａｌＴｅｘｔＰａｉｒｓ”，ＴｈｅＰｒｏｃｅｅｄｉｎｇｓｏｆＴｈｅ１８ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，ｐｐ．１０６６−１０７０（２０００）」でも述べられている通り、パラレルコーパスは多言語情報検索システムあるいは機械翻訳システムを実現する上で貴重な言語資源であるにもかかわらず、不足しているのが現状である。本実施例で説明した、２つのディレクトリ構造間の対応関係を利用することによって可能となる分野毎の学習データ生成手法は、パラレルコーパスの不足の問題を解決するための極めて有効な手法であると言える。
【００５６】
なお、実施例１および実施例２共に、最下層ディレクトリにのみ文書が格納されている例を用いて説明を行ったが、最下層以外のディレクトリに文書が格納されている場合であっても、文書に対応する文書ベクトルをディレクトリベクトルと同等に扱うことによって、全く同じ処理を行うことが可能である。さらに、実施例１および実施例２共にディレクトリ構造を木構造として説明を行ったが、各ディレクトリが複数の親ディレクトリを持つようなネットワーク型のディレクトリ構造であっても、同様の処理を行うことができることは明らかである。
【００５７】
また、請求項２、３、５、６、８、９では多言語文書検索手法を行う代わりに、検索要求あるいは文書を翻訳しておくことによって異なる言語の間の検索を可能とするものである。パラレルコーパスを学習データとして機械翻訳システムを実現する例として、前述の文献「ＰｅｔｅｒＦ．Ｂｒｏｗｎ，ＳｔｅｐｈｅｎＡ．ＤｅｌｌａＰｉｅｔｒａ，ＶｉｎｃｅｎｔＪ．ＤｅｌｌａＰｉｅｔｒａ，ａｎｄＲｏｂｅｒｔＬ．Ｍｅｒｃｅｒ， ”ＴｈｅｍａｔｈｅｍａｔｉｃｓｏｆｓｔａｔｉｓｔｉｃａｌＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ：Ｐａｒａｍｅｔｅｒｅｓｔｉｍａｔｉｏｎ”，ＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，３２：２６３−３１１，１９９３．」を挙げることができる。
【００５８】
また、多言語検索手段が、直接多言語文書検索を行うのではなく、２ヶ国語文書ペアを予め抽出しておいてもよい。この抽出手法としては、実施例２で説明した学習データ生成手法をそのまま使用することが可能である。
【００５９】
以下に具体的な例を用いて上述実施例の効果を確認する。インターネット上の検索サイト、販売サイトあるいはオークションサイトにおいて、ユーザが「自然公園巡りを目的とするバスのフリーパス」という日本語の検索質問文を用いることにより英語で書かれた情報にアクセスし、バスのフリーパスについての情報を得る／フリーパスの購入を行う状況を考える。この場合、典型的な多言語検索システムでは、まず上記質問文から「自然」「公園」「巡り」「目的」「バス」「フリー」「パス」のキーワードを抽出し、各日本語キーワードを、日英翻訳辞書を用いて、対応する英語キーワードに置き換える。対応する英語キーワードの例を図７に示す。各日本語キーワードに対応する英語キーワードは複数存在し、さらに、それぞれの英語キーワードの英語文中における意味も複数の可能性が考えられる。日本語キーワード「パス」に対応する英語キーワードには、「ｐａｓｓ」「ｐａｓｓｉｎｇ」「ｃａｌｉｐｅｒｓ（パス（ノギス）［機械用語］）」「ＰＡＳ（パス（パラアミノサリチル酸、ｐａｒａ−ａｍｉｎｏｓａｌｉｃｙｌｉｃａｃｉｄ）［化学用語］）」「ｐａｔｈ」等があり、さらに、例えば「ｐａｓｓ」は英語文中で「（球技の）パス」「定期」「関門」「通過する」「合格する」等様々な意味で用いられる。したがって、これらの英語キーワードを用いて関連する英語文書を検索した場合、
▲１▼「フリーキック（ｆｒｅｅｋｉｃｋ）、ゴール（ｇｏａｌ）、パス（ｐａｓｓ）」等を重要キーワードとして含むサッカーに関する文書
▲２▼「ホームランを打つ（ｓｑｕａｒｅ）、野球場（ｂａｌｌｐａｒｋ）、チケット（ｔｉｃｋｅｔ）」等を重要キーワードとして含む野球に関する文書
▲３▼「パス（パラアミノサリチル酸）（ＰＡＳ）、遊離酸（ｆｒｅｅａｃｉｄ）」等を重要キーワードとして含む化学に関する文書
▲４▼「コンピュータバス（ｃｏｍｐｕｔｅｒｂｕｓ）、フリーアクセス（ｆｒｅｅａｃｃｅｓｓ）、パス解析（ｐａｔｈａｎａｌｙｓｉｓ）、回路（ｃｉｒｃｕｉｔ）」等を重要キーワードとして含むコンピュータに関する文書
▲５▼「バスフィッシングツアー（ｂａｓｓｆｉｓｈｉｎｇｔｏｕｒ）」に関する文書
▲６▼「試供品のパス（ノギス）（ｆｒｅｅｃａｌｉｐｅｒｓ）」に関する文書等、検索意図に反する検索結果が数多く得られてしまう結果となる。この状況を、ベクトル空間法に基づくシステム構成の例に基づいて表した模式図が図８である。単語の意味曖昧性に由来して、日本語検索質問を対応する英語単語で置き換えて得られる英語ベクトルと距離の近い英語文書ベクトルは様々な分野にわたって存在し、得られる検索結果の精度は極端に低いものとなってしまう。このように、多言語の情報検索を高精度で実現することは、単言語の情報検索と比べて極めて困難である。
【００６０】
上述実施例による多言語情報検索システムでは、まず日本語のみを対象として検索質問文と最も関連性の高いディレクトリを検索する（図９参照）。この場合、日英の両言語にまたがることによって生じる単語の意味曖昧性を考慮する必要がないため、高い精度で関連ディレクトリを得ることができる。（上記質問文「自然公園巡りを目的とするバスのフリーパス」が旅行分野の検索要求であることを、日本語のみを対象として決定することは容易である。）その後、得られた日本語ディレクトリに対応する英語ディレクトリのみを対象として英語文書を検索することによって、検索意図に反する検索結果を除外することが可能となる。さらに、実施の形態で説明を行った通り、検索要求と最も関連するディレクトリに対応する学習データを用いて多言語情報検索を行うことによって、より精度の高い多言語検索を行うことが可能である。
【００６１】
以上のように本発明によれば、第１の言語の文章による検索要求に対して適切な第２の言語の文書を検索結果として得ることができ、上述の問題点を解決することができる。
【００６２】
すなわち、第１の言語の文書を対象とする第１のディレクトリ構造と第２の言語の文書を対象とする第２のディレクトリ構造の間の対応関係を利用することによって、（１）検索要求と関連性の高い分野に属する第２の言語の文書のみを検索対象とすることができ、さらに、（２）検索要求と関連性の高い分野の学習データを用いて検索を行うことができる。この２点の分野限定効果によって、従来の多言語情報検索の精度を低下させる原因であった、単語の意味曖昧性（意味の分野依存性）の問題を回避することが可能となり、多言語文書検索の検索精度を飛躍的に向上させることができる。
【００６３】
さらに、第１のディレクトリ構造と第２のディレクトリ構造の間の対応関係を利用することによって、多言語文書検索の学習データを自動生成することも可能となる。
【００６４】
【発明の効果】
以上のように本発明によれば、第１の言語の文章による検索要求に対して適切な第２の言語の文書を検索結果として得ることができる等の効果を実現できる。
【図面の簡単な説明】
【図１】本発明に係る典型的な多言語文書検索システムの構成を示す図である。
【図２】本発明の実施例１に係る多言語文書検索システムの構成を示す図である。
【図３】ディレクトリ構造の一例を示す図である。
【図４】ディレクトリ間の対応関係の一例を示す図である。
【図５】学習データ（パラレルコーパス）の格納例を示す図である。
【図６】本発明の実施例１に係る学習データ生成部の構成を示す図である。
【図７】日本語質問文中の日本語単語に対応する英語単語の例を示す図である。
【図８】典型的な多言語情報検索システムの動作例を示す模式図である。
【図９】上述実施例による関連ディレクトリの検索動作例を示す模式図である。
【符号の説明】
１１第１のディレクトリ保持手段
１２第２のディレクトリ保持手段
１３ディレクトリ関係保持手段
１４全ディレクトリ単語ベクトル生成手段
１５全ディレクトリ単語ベクトル保持手段
１６ディレクトリベクトル生成手段
１７ディレクトリベクトル保持手段
１８学習データ保持手段
１９ディレクトリ毎単語ベクトル生成手段
１１０ディレクトリ毎単語ベクトル保持手段
１１１文書ベクトル生成手段
１１２文書ベクトル保持手段
１１３検索要求取得手段
１１４全ディレクトリ検索要求ベクトル生成手段
１１５ディレクトリ検索手段
１１６ディレクトリ毎検索要求ベクトル生成手段
１１７多言語検索手段
１１８検索結果表示手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a retrieval system for documents stored in a directory structure (hierarchical structure) constructed on the Internet, and more particularly to a system for performing retrieval across a plurality of directory structures constructed for different languages.
[0002]
[Prior art]
With the rapid increase in Internet users, the commercial use of the Internet is also expanding. One method for facilitating access to a large amount of documents stored on a WWW server is a directory service that defines a directory structure and stores the documents in an appropriate directory. This aims at the effect that the user reaches the target document by sequentially tracing the subdirectories close to the object of interest from the top directory. However, it is impossible for the user to always follow the optimal subdirectory, and in most cases, the possibility of reaching the target document is increased by using a search technique such as full-text search.
[0003]
[Problems to be solved by the invention]
After a service is started in a specific country / language, the directory structure used there is transferred to other countries / languages in almost the same way, and a similar directory service is performed in each country. There are many cases. However, each directory service in each country is independent. When a search is performed, documents existing in a single directory structure can only be searched, and documents in a directory structure in other countries / other languages. Cannot be obtained as a search result. Especially for commercial directory services such as Internet sales sites and auction sites, it is important to be able to search documents in other countries / languages appropriately, and at present, many potential business opportunities have been lost.
[0004]
The present invention has been made in view of such a point, and an object thereof is to provide a multilingual document search system capable of realizing a search across a plurality of directory structures with high accuracy.
[0005]
[Means for Solving the Problems]
In the past, many multilingual information retrieval techniques have been proposed in order to perform searches across language differences. For example, “Derwester, S., Dumais, S. T., Landauer, T. K., Furnas, GW and Harshman, R. A.,“ Indexing by Latin ce s ential s et al. , 41 (6), 391-407. ”A method called“ Lent Semantic Indexing ”is applied to a set of translation text pairs (parallel corpus) to realize multilingual information retrieval“ Dumais ”. S. T., Landauer, T. K. and Littman, M. L., "Automatic cro" s-linguistic information retrieval using Latent Semantic Indexing "In proceedings of SIGIR'96 -. Workshop on Cross-Linguistic Information Retrieval, pp 16-23, it has been proposed in August 1996.". In addition, "Mark W. Davis and Ted E. Dunning," Query translation using evolutionary programming for multi-lingual information retrieval ", In Proceedings of the Fourth Annual Conference on Evolutionary Programming, March 1995." approach that has been proposed in the multilingual This is a typical example of language information retrieval technology. Furthermore, “Peter F. Brown, Stephen A. Dela Pietra, Vincent J. Dela Pietra, and Robert L. Mercer, 31 The Co-ordinated. As described, machine translation is realized by using a parallel corpus, and a search request sentence written in the first language is translated into the second language by machine translation. A lot of research has been done on the method of searching for written documents.
[0006]
However, at present, it is difficult to say that these multilingual information retrieval techniques have achieved sufficient retrieval accuracy for use in commercial commercial systems. The biggest factor that reduces the search accuracy of multilingual information search is the problem of ambiguity of words or phrases. In general, when a word (phrase) in a first language is translated into a word (phrase) in a second language, there are many translation candidates. For example, the word “base” in English is “base” as a military term, “塁” as a baseball term, “supporting body” as a political term, “base” as a mathematical term, and “base” as a chemical term. There are various translation candidates depending on the field, such as “term” as a grammatical term and “main component of (paint)” as an architectural term. Since these translation candidates are often field-dependent, it is said that high accuracy can be obtained in multilingual information search if the search target is limited to a set of documents in a specific field.
[0007]
In the present invention, the correspondence of each directory between two directory structures constructed for different languages is used. A directory that is highly relevant to the search request from the user is selected, and a multilingual information search is performed by searching only a set of documents belonging to a directory in another language that has a corresponding relationship with the obtained directory. The field of the document set can be limited, and it becomes possible to perform multilingual document search with high accuracy.
[0008]
That is, as shown in FIG. 1, according to one aspect of the present invention, a multilingual search system maintains a first directory structure built for a first language to achieve the above objectives. First directory holding means 1 that performs; second directory holding means 2 that holds a second directory structure constructed for the second language; and each directory in the first directory structure and second directory A directory relation holding means 3 for holding a correspondence relation of each directory in the directory structure; a directory for determining which directory in the first directory structure is highly relevant to a search request from the user in the first language. Search means 4; belonging to a directory in the second directory structure corresponding to the directory determined by the directory search means Among documents, so that provided a multilingual search means 5 for determining a retrieval request and are highly relevant documents in the first language from the user.
[0009]
In this configuration, as described above, a directory highly relevant to a search request from a user is selected, and only a set of documents belonging to a directory in another language that has a corresponding relationship with the obtained directory is selected as a search target. Since the linguistic information search is performed, the field of the document set to be searched can be limited, and a multilingual document search with high accuracy can be performed.
[0010]
In this configuration, when the server storing the first directory is different from the server storing the second directory, the second directory is stored in the server storing the first directory. Communication means capable of communicating with the server is provided, and multilingual search is performed via this communication means.
[0011]
The one aspect and the other aspect of the present invention are as described in the claims, and will be described in detail below.
[0012]
Further, the present invention can be realized not only as a system or an apparatus but also as a method, and a part thereof can be realized as a computer program.
[0013]
In addition, the present invention can be realized as a search server, and a part of the present invention may be implemented in a client device.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail using examples.
[Example 1]
The present invention will be specifically described based on Example 1. This embodiment claims4It corresponds to. The configuration of the multilingual document search system according to this embodiment will be described with reference to FIG. In this embodiment and Example 2 described later, Japanese and English will be described. However, any language can be used as long as morpheme analysis processing (processing for dividing a sentence into words) is applicable. The same effect can be obtained.
[0015]
The first directory holding means 11 and the second directory holding means 12 have a directory structure (first directory structure and second directory structure) for storing a plurality of Japanese documents and a plurality of English documents, respectively, inside the computer. It is a means to hold. An example of a directory structure held by both means (an example of an auction site) is shown in FIG. Each directory is given a unique name (identifier) according to the content (field) of the document stored in the directory. Documents are stored only in the lowest directory.
[0016]
The directory relation holding unit 13 includes each directory in the first directory structure held in the first directory holding unit 11 and each directory in the second directory structure held in the second directory holding unit 12. Directory correspondence is maintained. The correspondence relationship here means that the fields of the document sets in the two directories are equal. An example of the correspondence relationship held in the directory relationship holding means 13 is shown in FIG. In the present embodiment, the correspondence between each directory in the first directory structure held in the first directory holding means 11 and each directory in the second directory structure held in the second directory holding means 12. Assume that the relationship is defined one-to-one and that the structures of both directories are exactly the same. However, even if there are some directories for which the correspondence relationship is not defined, the same effect can be obtained for the directory for which the correspondence relationship is defined.
[0017]
The all directory word vector generating means 14 uses all Japanese documents included in the first directory structure as learning data, and for each of all Japanese words included therein, a corresponding multidimensional vector (word Vector). Hereinafter, an algorithm for calculating a word vector will be described.
[0018]
<Step S1>: A morphological analysis process is performed on all Japanese documents included in the first directory structure.
<Step S2>: Of all the Japanese words obtained in Step S1, n words are selected in descending order of appearance frequency among all the Japanese documents included in the first directory structure. The n words obtained here are called feature expression words. The value of n is on the order of thousands.
<Step S3>: A matrix corresponding to all Japanese words and feature expression words obtained in step S1 is created for each row and column. If the total number of different Japanese words obtained in step S1 is 100,000 and the value of n is 3,000, a matrix of 100,000 rows × 3,000 columns can be formed. Each element of this matrix records how many times the word corresponding to the row of the element and the feature expression word corresponding to the column co-occurred in all Japanese documents included in the first directory structure. To do. For example, when the word a and the feature expression word b co-occur in 30 documents (appear simultaneously), 30 is recorded in the corresponding matrix element. The matrix thus obtained is called a co-occurrence matrix. In this way, n-dimensional vectors can be defined for all Japanese words included in the Japanese document. This vector can be said to be a vector indicating in what context each Japanese word tends to appear.
<Step S4>: Since the n-dimensional vector obtained in Step S3 has a large number of dimensions, the calculation time becomes enormous in the processing required later. Therefore, in order to limit the calculation process to the real time range, the original n-dimensional vector is compressed into an n′-dimensional (several hundred dimensions) vector by a matrix dimension compression method. There are various dimensional compression methods, but "Berry, M., Do, T., O'Brien, G., Krishna, V. and Varadhan, S. (1993)." SVDPACKC USER'S GUIDE. “Tech. Rep. CS-93-194. University of Tennessee, Knoxville, TN.” Is a representative example of Single Value Decomposition. The n'-dimensional vectors obtained for all Japanese words in this way are called word vectors.
[0019]
The all directory word vector holding means 15 is means for holding the word vectors corresponding to all Japanese words calculated by the all directory word vector generating means 14 inside the computer.
[0020]
The directory vector generation means 16 is a means for calculating a directory vector corresponding to each directory in the first directory structure. Hereinafter, an algorithm for calculating a directory vector will be described.
[0021]
<Step S11>: A document vector corresponding to each of all Japanese documents included in the first directory structure is calculated. Here, it is assumed that the document vector is a vector obtained by normalizing the sum of word vectors corresponding to all the words included in the document (the length of the vector is 1).
<Step S12>: The directory vector of each directory located in the lowest layer is calculated. Here, it is assumed that the directory vector of the directory located at the lowest layer is a vector obtained by normalizing the sum of the document vectors corresponding to all the documents included in the directory.
<Step S13>: A directory that is not located in the lowest layer and for which directory vectors corresponding to all the directories included in the directory have already been calculated is found, and the directory vector is calculated. However, it is assumed that the directory vector of a directory not located at the lowest layer is a vector obtained by normalizing the sum of directory vectors corresponding to all directories included in the directory.
<Step S14>: Step S13 is repeated until directory vectors are calculated for all directories.
[0022]
The directory vector holding means 17 is means for holding directory vectors corresponding to all directories calculated by the directory vector generating means 16 inside the computer.
[0023]
The learning data holding unit 18 includes a directory in the first directory structure held in the first directory holding unit 11 (or the second directory structure held in the second directory holding unit 12). Means for holding, as learning data, a set of Japanese-English translation pairs (Japanese-English parallel corpus) related to the contents of documents contained in the directory (Japanese-English parallel corpus) for each of the directories located at the bottom layer It is. An example of learning data held by the learning data holding means 18 is shown in FIG.
[0024]
The word vector generation unit 19 for each directory uses the Japanese-English parallel corpus held in the learning data holding unit 18 as learning data, and the semantic contents of each directory in the first directory structure (each of the second directory structure). This is a means for calculating a word vector set specialized for the semantic content of a directory). Hereinafter, an algorithm for calculating a word vector set corresponding to an arbitrary directory (directory A) will be described.
[0025]
<Step S21>: All Japanese-English parallel corpuses held in the learning data holding means 18 corresponding to all the lowest-level directories included in the directory A (directory A itself if the directory A is the lowest-level directory) Are collectively regarded as learning data, and morphological analysis processing is performed on all Japanese documents and English documents included in the learning data. In the example of FIG. 5, if the directory A is a directory of “Antiques”, the parallel corpus 1-4 are collectively used as learning data.
<Step S22>: Of all the Japanese words and all the English words obtained in Step S1, n words are selected in descending order of appearance frequency in the learning data. The n words obtained here are referred to as feature expression words as in step S2. However, in this case, Japanese words and English words are mixed in the feature expression words. The value of n is in the order of several thousand as in step S2.
<Step S23>: A co-occurrence matrix is created for each row and column corresponding to all Japanese / English words and feature expression words obtained in step S1. Each element of the matrix records how many times the word corresponding to the row of the element and the feature expression word corresponding to the column co-occurred in all the Japanese-English translation pairs included in the learning data. That is, the number of co-occurrence is counted by regarding a Japanese-English translation pair as one document. In this way, it is possible to generate a co-occurrence matrix that expresses all Japanese words and all English words by an n-dimensional vector. This vector can be said to be a vector indicating the appearance tendency of each word in accordance with the semantic content (field) of the directory A.
<Step S24>: The n-dimensional vector obtained in step S23 is compressed into an n′-dimensional (hundreds of dimensions) vector by the matrix dimension compression method as in step S4. In this way, n'-dimensional word vectors that can be compared in the same vector space for all Japanese / English words are obtained.
[0026]
By applying the above algorithm to all directories in the first directory structure (ie, all directories in the second directory structure), words specific to the semantic content of each directory in the directory structure Each vector set can be calculated.
[0027]
The directory word vector holding means 110 is means for holding the word vector set calculated by the directory word vector generation means 19 for each directory.
[0028]
The document vector generating unit 111 is a unit that calculates the document vector of each English document belonging to the directory for each of all directories in the second directory structure held by the second directory holding unit 12. When a document vector of each English document belonging to the directory A is calculated for an arbitrary directory A, a word vector set held in correspondence with the directory A in the word vector holding unit 110 for each directory is used. Here, the calculation is performed on the assumption that the document vector of each English document is a vector obtained by normalizing the sum of word vectors corresponding to all English words included in the document. In this way, for each directory in the second directory structure, it is possible to calculate a document vector set specialized for its semantic content (field).
[0029]
The document vector holding means 112 is means for holding the document vector set calculated by the document vector generating means 111 for each directory in the second directory structure.
[0030]
The search request acquisition unit 113 is a unit having a user interface that can receive a search request in Japanese text from a user. The received search request is subjected to morphological analysis processing and divided into Japanese words.
[0031]
The all directory search request vector generation unit 114 is a unit that calculates a search request vector corresponding to the search request from the user received by the search request acquisition unit 113. A vector obtained by normalizing the sum of word vectors corresponding to all Japanese words included in the search request sentence using the word vector set held in all directory word vector holding means 15 is set as a search request vector.
[0032]
The directory search unit 115 is a unit that determines which directory in the first directory structure is most relevant to the search request from the user received by the search request acquisition unit 113. In order to make this determination, the directory search means 115 calculates the degree of association between the search request vector calculated by the all directory search request vector generation means 114 and each directory vector held in the directory vector holding means 17. Select the directory with the most relevant directory vector. As the definition of the degree of association, an inner product (cosine value) between vectors is used. Therefore, the degree of association is a real number between 0 and 1, and the closer the angle between two vectors is, the closer it is to 1.
[0033]
The search request vector generation unit 116 for each directory is a unit that calculates a search request vector specialized for the field of the directory having the highest degree of association with the search request calculated by the directory search unit 115. First, a directory in the second directory structure corresponding to the directory in the first directory structure obtained from the directory search unit 115 is determined by referring to the directory relationship holding unit 13. Next, a word vector set corresponding to the directory is obtained from the word vector holding unit 110 for each directory. Using the obtained word vector set, a vector obtained by normalizing the sum of word vectors corresponding to all Japanese words included in the search request sentence is calculated and set as a new search request vector.
[0034]
The multilingual search unit 117 includes a search request vector calculated by the directory search request vector generation unit 116 and each document vector held in the document vector holding unit 112 corresponding to the directory determined by the directory search unit 115. The relevance between is calculated. The definition of the degree of association is the same as the definition in the directory search unit 115. The search request vector is a vector for a Japanese sentence, and each document vector held in the document vector holding means 112 is a vector for an English document. Both vectors have the same vector space for Japanese words and English words. Since these vectors are calculated as the sum of the vectors in the directory-specific word vector holding means 110 expressed above, they can be compared.
[0035]
The search result display unit 118 refers to the degree of association between the search request vector calculated by the multilingual search unit 117 and each document vector, and the document vector having a high degree of association with the search request vector (the inner product of the vectors is large). A document corresponding to is presented to the user as a search result.
[0036]
In this embodiment, the directory search unit 115 automatically determines a directory highly relevant to the search request from the user. However, the user manually searches the highly relevant directory by tracing the directory structure. It does not matter if it is determined.
[0037]
In the multilingual document search apparatus obtained by the above configuration, an English document related to a search request using Japanese sentences can be obtained as a search result, and the above-described problems (problems of the invention) can be solved. .
[0038]
In addition, by utilizing the correspondence between the first directory structure for Japanese documents and the second directory structure for English documents, (1) English in a field highly relevant to the search request Only a document can be set as a search target, and (2) a search can be performed using learning data in a field highly relevant to the search request. With these two field-limiting effects, it is possible to avoid the problem of word meaning ambiguity (separation of meaning in the field), which was the cause of lowering the accuracy of conventional multilingual information retrieval, and multilingual documents The search accuracy of the search can be dramatically improved.
[0039]
In this embodiment, a parallel corpus is prepared for each lowermost directory as learning data. However, multilingual document search is performed without using learning data specialized for a field as in the configuration of claim 1. Even in this case, since the effect (1) is obtained, it is possible to perform a search with higher accuracy than the conventional multilingual document search.
[0040]
Further, even when a multilingual document search is performed without using learning data specialized for a field as in the configuration of claim 1, the first directory holding in which the correspondence relation is held by the directory relation holding means It is possible to perform field-specific multilingual information retrieval using a document set (hereinafter referred to as document set D) included in a pair of a directory in the means and a directory in the second directory holding means. . The method is described below.
[0041]
It is assumed that the configuration is the same as the above configuration shown in FIG. However, since there is no learning data for each field (for each lowermost directory), the learning data holding means 18 is used in common when the word vector generating means 19 for each directory creates a word vector set corresponding to each directory. Only one set of English parallel corpora shall be retained.
[0042]
Therefore, when the word vector generation unit 19 for each directory creates a word vector set corresponding to each directory, the common parallel corpus is always used as learning data in step S21. Further, instead of setting each element of the co-occurrence matrix created in step S23 as the number of times of co-occurrence of a word and a feature expression word, χ defined by Expression 1² _uThe number of weighted co-occurrence using. Χ defined by Equation 1² _uIs the word w_uThe weight for (the word w in the field of directory A_uThe weighted co-occurrence count is the word w_u1And feature expression word w_u2For the number of co-occurrence of² _u1And χ² _u2And the value multiplied by. Χ above² _uIs generally χ²It is a value used in a technique called a test, and has a property of becoming a high value for elements showing different appearance tendencies in the entire set and its subset.
[Expression 1]

The word vector set for each directory obtained in this way is a word vector set specialized for the directory field. Therefore, even when the multilingual document search is performed without using the learning data specialized for the field as in the configuration of claim 1, the effect (2) can be obtained in addition to the effect (1). Therefore, it is possible to perform a search with higher accuracy than the conventional multilingual document search.
[0043]
The details of the multi-language document retrieval method that was utilized in the present embodiment, the literature "Hiroshi Masuichi, Raymond Flournoy, StefanKaufmann and Stanley Peters," QueryTranslation Method for Cross Language Information Retrieval ", The Proceedings of Machine Translation SummitVII '99 Workshop on Machine Translation for Cross Language Information Retrieval, (1999) ”.
[0044]
[Example 2]
  The following is a description of Example 2 of the present invention. This embodiment claims7It corresponds to. The present embodiment is different from the first embodiment only in the configuration of the learning data holding unit 18. Therefore, in the following description, only the part related to the learning data holding means 18 will be described. FIG. 6 is a diagram showing the configuration of the present embodiment in a range corresponding to the learning data holding means 18 in FIG. Other components are the same as those in FIG.
[0045]
The first directory holding means 11, the second directory holding means 12, and the directory relation holding means 13 are means having functions equivalent to the respective means in FIG. However, in the present embodiment, the documents stored in the first directory structure and the second directory structure are Web documents, and documents written in Japanese are mainly stored in the first directory structure. It is assumed that documents written in English are also stored, and similarly, documents written in English are mainly stored in the second directory structure, but documents written in Japanese are also stored. Is also stored. However, English words obtained by morphological analysis of all documents in the first directory structure are handled in the same way as Japanese words, and Japanese words obtained by morphological analysis of all documents in the second directory structure. By treating the word in the same way as the English word, it is possible to perform the processing without changing the algorithm of each means described in the first embodiment.
[0046]
The following description of each means 21-26 is for the case where the lowest directory A in the first directory structure and the lowest directory A ′ in the second directory structure corresponding to the directory A are targeted. It is. Therefore, it is necessary to repeat the same process for all lowermost directories in the directory structure.
[0047]
The pair text extraction means 21 translates bilingual texts in Japanese and English from the lowermost directory A in the first directory structure and all Web documents belonging to the lowermost directory A ′ in the corresponding second directory structure. This is means for extracting a parallel text pair of a Web document using a technique such as an existing document collection robot.
[0048]
The pair text holding means 22 is a means for holding a set of Japanese-English parallel text pairs obtained by the pair text extracting means 21 and a Japanese-English document pair obtained by the document pair extracting means 25 inside the computer. Further, when a predetermined number or more of Japanese-English pairs (translation text pairs + Japanese-English document pairs) set in advance are held in the means, the set of Japanese-English pairs is transferred to the learning data holding means.
[0049]
The word vector generation unit 23 uses the same algorithm as the word vector generation unit 19 for each directory in the first embodiment, using the Japanese-English pair stored in the pair text storage unit 22 as learning data, thereby generating a word vector. It is a means to calculate.
[0050]
The document vector generation unit 24 is a unit that calculates document vectors corresponding to all documents belonging to the directory A and the directory A ′ by using the word vector set obtained from the word vector generation unit 23. The document vector is calculated by normalizing the sum of word vectors corresponding to all Japanese / English words contained in the document.
First, the document pair extraction unit 25 refers to the document vector obtained from the document vector generation unit 24, and sets a pair of a Japanese document and an English document satisfying the following conditions to all the documents belonging to the directory A and the directory A ′. Extract from set.
[0051]
“The document vector corresponding to the Japanese document in the pair has the highest relevance (the inner product value is large) is the English document vector in the pair, and conversely the most relevant to the English document vector in the pair. A Japanese document vector with a high value is a Japanese document vector in a pair. "
Next, out of Japanese-English document pairs satisfying the above conditions, a pair is extracted in which the value of the inner product between the Japanese-English document vectors corresponding to the Japanese-English documents in the pair is larger than a preset threshold value. The Japanese-English document pairs obtained in this way have very close meanings and can be used as learning data. The obtained pair is held in the pair text holding means 22 together with a set of Japanese-English parallel text pairs obtained by the pair text extracting means 21.
[0052]
The learning data holding means 26 is means for holding the Japanese-English pair set passed from the pair text holding means inside the computer.
[0053]
Take such a configuration,
(1) Using the Japanese-English pair set held in the pair text holding means 22 as learning data, the word vector generating means 23 generates a word vector set,
(2) generating a document vector set by the document vector generating means 24;
(3) The document pair extraction means 25 extracts Japanese and English document pairs whose meanings are very close,
(4) The obtained document pair is added to the pair text holding unit 22 (if already added, it is replaced with the previous one).
By repeating the process, it is possible to gradually increase the number of Japanese-English pair sets held in the pair text holding means 22. By using such an iterative method, practically sufficient size of learning data can be obtained even when the number of pair texts obtained from the pair text extraction means 21 is small. Such repetition approach, "Hiroshi Masuichi, Raymond Flournoy, Stefan Kaufmannand Stanley Peters," A Bootstrapping method for Extracting Bilingual Text Pairs ", The Proceedings of The 18th International Conference on Computational Linguistics, pp. 1066-1070 (2000) For details. This iterative method is effective only when the field of the document set from which pairs are extracted is limited. In this embodiment, the iterative method can be applied by limiting the field of the document set by using the correspondence between the first directory structure and the second directory structure.
[0054]
The processing after the learning data is obtained in this way is exactly the same as the processing in the first embodiment. In the example of the first embodiment, it is necessary to prepare learning data in advance for each lowermost directory. However, in the multilingual document search apparatus obtained by the configuration of the present embodiment, Japanese-English from a Web document can be obtained. By using the bilingual text pair of the Web document that is the bilingual translation as the initial learning data, and further growing it using the above-described iterative method, it becomes possible to automatically generate learning data necessary for multilingual document retrieval.
[0055]
  The learning data (bilingual document pair) obtained in this way can be used as a parallel corpus. The above documents are mentioned in "Hiroshi Masuichi, Raymond Flournoy, Stefan Kaufmann and Stanley Peters," ABootstrapping method for Extracting Bilingual Text Pairs ", The Proceedings of The 18th International Conference on Computational Linguistics, pp. 1066-1070 (2000)." As described above, the parallel corpus is currently a deficient language resource, although it is a valuable language resource for realizing a multilingual information retrieval system or a machine translation system. The learning data generator for each field that is made possible by using the correspondence between the two directory structures described in this embodiment.Law isIt can be said that this is a very effective method for solving the problem of shortage of parallel corpus.
[0056]
Although both the first and second embodiments have been described using the example in which the document is stored only in the lowermost directory, even if the document is stored in a directory other than the lowermost directory, By treating a document vector corresponding to a document in the same way as a directory vector, it is possible to perform exactly the same processing. Furthermore, although the directory structure is described as a tree structure in both the first and second embodiments, the same processing can be performed even in a network type directory structure in which each directory has a plurality of parent directories. Obviously we can do it.
[0057]
  Claims2, 3, 5, 6, 8, 9Then, instead of using a multilingual document search method, a search between different languages can be performed by translating a search request or a document. As an example of realizing a machine translation system using a parallel corpus as learning data, the above-mentioned document “Peter F. Brown, Stephen A. Dela Pietra, Vincent J. Dela ti s er menta mart e, and Robert L. Mercer,“ The mathematics ”. , Computational Linguistics, 32: 263-311, 1993. ”.
[0058]
  Also, the multilingual search means does not directly perform multilingual document search, but extracts bilingual document pairs in advance.May be left. As this extraction method, the learning data generation method described in the second embodiment can be used as it is.
[0059]
The effects of the above-described embodiments will be confirmed using specific examples below. On search sites, sales sites or auction sites on the Internet, users access information written in English by using a Japanese search query “bus free pass for nature park tours”. Suppose you want to get information about your free pass / buy a free pass. In this case, in a typical multilingual search system, first, keywords of “nature”, “park”, “tour”, “purpose”, “bus”, “free”, and “pass” are extracted from the above question text, Use a Japanese-English translation dictionary to replace the corresponding English keyword. Examples of corresponding English keywords are shown in FIG. There are a plurality of English keywords corresponding to each Japanese keyword, and moreover, there are a plurality of possibilities for the meaning of each English keyword in the English sentence. The English keywords corresponding to the Japanese keyword “pass” include “pass”, “passing”, “calipers (path (calipers) [machine term])”, “PAS (path (para-aminosalicylic acid, para-aminosiclic acid) [chemical term]. ) ”,“ Path ”, etc. Further, for example,“ pass ”is used in English sentences in various meanings such as“ pass of (ball game) ”,“ regular ”,“ barrier ”,“ pass ”,“ pass ”. So if you search for related English documents using these English keywords,
▲ 1 ▼ "Free kick (free  kick), goal (goal),path(pass) ”Etc. as important keywords
▲ 2 ▼ “Hit home run (square), Baseball field (ballpark),ticket(ticket) "Etc. as important keywords
(3) “Path (paraaminosalicylic acid) (PAS), Free acid (free  acid) "etc. as important keywords
▲ 4 ▼ "Computer bus (computerbus), Free access (free  access), path analysis (path  analysis), circuit (circuit) ”Etc. as important keywords
▲ 5 ▼ “Bus fishing tour (bass  fishingtour) "
▲ 6 ▼ "Free sample pass (Caliper) (free  Calipers) "And other search results that are contrary to the search intention are obtained. FIG. 8 is a schematic diagram showing this situation based on an example of a system configuration based on the vector space method. Due to word ambiguity, English vectors obtained by replacing Japanese search questions with corresponding English words and English document vectors that are close to each other exist in various fields, and the accuracy of the search results obtained is extremely high. It will be low. As described above, it is extremely difficult to realize multilingual information retrieval with high accuracy compared to monolingual information retrieval.
[0060]
In the multilingual information search system according to the above-described embodiment, first, only the Japanese language is searched for the directory most relevant to the search question sentence (see FIG. 9). In this case, it is not necessary to consider the word ambiguity of the word that arises across both Japanese and English languages, so the related directory can be obtained with high accuracy. (It is easy to determine for the Japanese question only that the above question "bus free pass for nature park tour" is a search request in the travel field.) By searching the English document only for the English directory corresponding to the directory, it is possible to exclude the search results contrary to the search intention. Furthermore, as described in the embodiment, it is possible to perform a multilingual search with higher accuracy by performing a multilingual information search using learning data corresponding to the directory most relevant to the search request. .
[0061]
As described above, according to the present invention, it is possible to obtain a document in the second language appropriate for a search request using a sentence in the first language as a search result, and solve the above-described problems.
[0062]
That is, by using the correspondence between the first directory structure for documents in the first language and the second directory structure for documents in the second language, (1) a search request and Only documents in the second language belonging to a highly relevant field can be searched, and (2) a search can be performed using learning data in a field highly relevant to the search request. With these two field-limiting effects, it is possible to avoid the problem of word meaning ambiguity (separation of meaning in the field), which was the cause of lowering the accuracy of conventional multilingual information retrieval, and multilingual documents The search accuracy of the search can be dramatically improved.
[0063]
Furthermore, by using the correspondence between the first directory structure and the second directory structure, learning data for multilingual document search can be automatically generated.
[0064]
【The invention's effect】
As described above, according to the present invention, it is possible to realize an effect that a document in the second language appropriate for a search request using a sentence in the first language can be obtained as a search result.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a typical multilingual document search system according to the present invention.
FIG. 2 is a diagram showing a configuration of a multilingual document search system according to Embodiment 1 of the present invention.
FIG. 3 is a diagram illustrating an example of a directory structure.
FIG. 4 is a diagram illustrating an example of a correspondence relationship between directories.
FIG. 5 is a diagram illustrating a storage example of learning data (parallel corpus).
FIG. 6 is a diagram illustrating a configuration of a learning data generation unit according to the first embodiment of the present invention.
FIG. 7 is a diagram showing an example of English words corresponding to Japanese words in a Japanese question sentence.
FIG. 8 is a schematic diagram showing an operation example of a typical multilingual information search system.
FIG. 9 is a schematic diagram illustrating an example of a related directory search operation according to the embodiment described above.
[Explanation of symbols]
11 First directory holding means
12 Second directory holding means
13 Directory relationship holding means
14 All directory word vector generation means
15 All directory word vector holding means
16 Directory vector generation means
17 Directory vector holding means
18 Learning data holding means
19 Directory word vector generation means
110 Word vector holding means for each directory
111 Document vector generation means
112 Document vector holding means
113 Search request acquisition means
114 All Directory Search Request Vector Generation Means
115 Directory search means
116 Directory search request vector generation means
117 Multilingual search means
118 Search result display means

Claims

First directory holding means for holding a first directory structure constructed for the first language;
Second directory holding means for holding a second directory structure constructed for the second language;
Directory relationship holding means for holding the correspondence between each directory in the first directory structure and each directory in the second directory structure;
Directory search means for determining which directory in the first directory structure is highly relevant to a search request in a first language from a user;
A directory in the second directory structure corresponding to the directory determined by the directory search means is determined based on the correspondence relationship of the directory held in the directory relation holding means, and the directory in the determined second directory structure is determined. A multilingual document search system comprising: a multilingual search means for determining a document highly relevant to a search request in a first language from the user among documents in the directory .

First directory holding means for holding a first directory structure constructed for the first language;
Second directory holding means for holding a second directory structure constructed for the second language;
Directory relationship holding means for holding the correspondence between each directory in the first directory structure and each directory in the second directory structure;
Directory search means for determining which directory in the first directory structure is highly relevant to a search request in a first language from a user;
Translation means for translating a search request in a first language from the user into a search request in a second language;
A directory in the second directory structure corresponding to the directory determined by the directory search means is determined based on the correspondence relationship of the directory held in the directory relation holding means, and the directory in the determined second directory structure is determined. A multilingual document search system comprising: search means for determining a document highly relevant to a search request in a second language obtained from a translation means among documents belonging to the directory .

First directory holding means for holding a first directory structure constructed for the first language;
Second directory holding means for holding a second directory structure constructed for the second language;
Directory relationship holding means for holding the correspondence between each directory in the first directory structure and each directory in the second directory structure;
Directory search means for determining which directory in the first directory structure is highly relevant to a search request in a first language from a user;
Translation means for translating each document in the second language in the second directory structure into the first language;
A directory in the second directory structure corresponding to the directory determined by the directory search means is determined based on the correspondence relationship of the directory held in the directory relation holding means, and the directory in the determined second directory structure is determined. Searching means for determining a document that belongs to the directory and is highly relevant to the search request from the user in the first language among the group of documents translated into the first language by the translation means. A multilingual document search system.

Learning data holding means for holding learning data for multilingual search such as dictionary data in the field of the directory pair or bilingual pairs for each corresponding directory pair held in the directory relation holding means, The multilingual search means is a document having a high relevance to the search request in the first language from the user among the documents belonging to the directory in the second directory structure corresponding to the directory determined by the directory search means. The multilingual document search system according to claim 1, wherein the learning data is determined using corresponding learning data held in the learning data holding means.

Learning means holding means for holding learning data for translation such as dictionary data in the field of the directory pair or parallel translation pairs for each corresponding directory pair held in the directory relation holding means; The search request in the first language from the user is converted into the search request in the second language by using the learning data corresponding to the directory obtained from the directory search means and held in the learning data holding means. The multilingual document search system according to claim 2 for translation.

Learning means holding means for holding learning data for translation such as dictionary data in the field of the directory pair or parallel translation pairs for each corresponding directory pair held in the directory relation holding means; but held in the learning data holding means, by using the learning data corresponding to each directory, according to claim 3, wherein translating each document in the second language of the second directory structure in the first language Multilingual document search system.

Among the documents belonging to each corresponding directory pair held in the directory relation holding means, the text pair is extracted from a document having a text pair of the first language and the second language, and is used as learning data for multilingual search. Learning data holding means for holding, wherein the multilingual search means includes, among the document groups belonging to the directory in the second directory structure corresponding to the directory determined by the directory search means, The multilingual document search system according to claim 1, wherein a document highly relevant to a search request in one language is determined using corresponding learning data held in the learning data holding means.

Of the documents belonging to each corresponding directory pair held in the directory relation holding means, the text pair is extracted from a document having a text pair of the first language and the second language, and is held as translation learning data. It further has learning data holding means,
The translation means uses the learning data corresponding to the directory obtained from the directory search means, which is held in the learning data holding means, to make a search request in the first language from the user in the second language. The multilingual document search system according to claim 2, which is translated into a search request.

Of the documents belonging to each corresponding directory pair held in the directory relation holding means, the text pair is extracted from a document having a text pair of the first language and the second language, and is held as translation learning data. It further has learning data holding means,
The translation means translates each document in the second language in the second directory structure into the first language using the learning data corresponding to each directory held in the learning data holding means. Item 4. The multilingual document search system according to item 3 .

Directory relation holding means for holding a correspondence relationship between each directory in the first directory structure constructed for the first language and each directory in the second directory structure constructed for the second language; Have
The directory having the second directory structure corresponding to the directory in the first directory structure having a high relevance to the search request in the first language from the user is set as the correspondence relation of the directory held in the directory relation holding means. A search request for instructing to search for a document highly relevant to the search request in the first language from the user, out of the document group belonging to the directory in the determined second directory structure. A multilingual document retrieval system characterized by publishing.

Directory relationship holding means for holding the correspondence between each directory in the first directory structure constructed for the first language and each directory in the second directory structure constructed for the second language; ,
Translation means for translating a search request in a first language from a user into a search request in a second language, and having a high relevance to the search request in the first language from the user. A directory having a second directory structure corresponding to the directory is determined based on the correspondence relationship of the directories held in the directory relation holding unit, and the document group belonging to the directory in the determined second directory structure is determined. A multilingual document search system that issues a search request for instructing to search for a document highly relevant to the search request from the user translated from the first language into the second language.

First directory holding means for holding a first directory structure constructed for the first language;
A communication means capable of communicating with a second directory holding means for holding a second directory structure constructed for the second language;
Directory relationship holding means for holding the correspondence between each directory in the first directory structure and each directory in the second directory structure;
Directory search means for determining a directory in the first directory structure in response to a search request in a first language from a user;
The directory in the second directory structure corresponding to the directory determined by the directory search means via the communication means is determined based on the correspondence relationship of the directories held in the directory relation holding means, and the determined first A multilingual document characterized by comprising a multilingual search means for determining a document highly relevant to a search request in the first language from the user among a group of documents belonging to the directory in the directory structure of 2. Search system.