JP2004054882A

JP2004054882A - Synonym retrieval device, method, program and storage medium

Info

Publication number: JP2004054882A
Application number: JP2002314914A
Authority: JP
Inventors: Hideo Ito; 伊東　秀夫
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2002-05-27
Filing date: 2002-10-29
Publication date: 2004-02-19
Anticipated expiration: 2022-10-29
Also published as: JP4227797B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a synonym retrieval device, method, program and storage medium capable of inexpensively acquiring a synonym with universality when determining a synonym of a search term of a document group. <P>SOLUTION: A primary retrieval part 5 searches a document storing part 5 by using an object term of an object term buffer 2, and it attaches a document score to a retrieved document. The document score is larger in proportion to more number of search terms, shorter document length, and smaller number of documents. High ranking R pieces of documents with high document scores are considered as seed documents, the seed documents are disassembled into words, a degree of association with the object word is determined by predetermined calculation in regard to each word as a related term candidate, and high ranking T pieces with high degree of association are regarded as related terms. A secondary retrieval part 8 determines a seed document by searching the document storing part 3 again in the same way as the primary retrieval by using a related term group stored in a related term buffer 7 as a search term group, and it extracts related terms from the seed document. A synonym selecting part 9 selects high ranking S pieces of extracted related terms as synonym candidates. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、文書検索における検索語の類義語を求める類義語検索装置、方法、及びこの装置で用いられる類義語検索プログラム、この類義語検索プログラムを記憶した記憶媒体に関する。
【０００２】
【従来の技術】
データベース等に蓄積された多数の電子文書等の文書群から検索語（キーワード）に関連する文書を抽出する文書検索においては、検索者が検索語Ａを与えた場合、検索対象となる文書中では検索語Ａの類義語Ｂが用いられている場合がある。その場合、検索処理が語形に基づいて行われる限り、その文書は検索要求に合致する可能性が高いにもかかわらず、検索抽出することができない。この問題は一般的には語彙的ギャップ問題として従来より指摘されている。
尚、ここで類義語とは、例えば
・表記レベルの類義語：「コンピュータ」と「コンピューター」
・語彙レベルの類義語：「本」と「書籍」
のような語を言うものとする。
【０００３】
従って、検索要求中の検索語についてその類義語を求めることができれば、それらの類義語群を検索要求に含めることで、検索性能が向上することが期待できる。このため従来より、類義語群の自動獲得方法に関して例えば次のような提案がなされている。
・対象文字列の構成単語毎に予め用意した類義語ファイルから類義語を取得し、それらを組み合わせて対象文字列に対する類義表現を構成する（例えば特許文献１）。
・「〜とは〜である」といったパターンを用いて文書群から類義語を抽出する（例えば特許文献２）。
【０００４】
・複合語とその構成語間の関係を類義関係とみなし、入力単語に対し、それを構成語とする複合語を類義語として出力する（例えば特許文献３）。
・既存シソーラスに入力単語を登録する際に、入力単語の構成単語を求め、それらのシソーラス中の登録位置に基づき入力単語の登録位置を決定する（例えば特許文献４）。
・過去にユーザから入力された検索語群を記憶しておき、それらの中から現在の入力語に対する類義語（よく共起する語）を取り出す（例えば特許文献５）。
・括弧内とそれに前出する語との対を同義関係として抽出する（例えば特許文献６）。
【０００５】
【特許文献１】
特開平６−１６２０９８号公報（特許第３０２５７２４号）
【特許文献２】
特開平６−２６６７６９号公報
【特許文献３】
特開平７−３１９８８４号公報
【特許文献４】
特開平８−２２１４２７号公報
【特許文献５】
特開平９−３１９７６７号公報
【特許文献６】
特開平１１−３２８２０５号公報
【０００６】
【発明が解決しようとする課題】
しかしながら上記の従来技術では、類義語辞書、シソーラス、あるいは類義語抽出用の規則群などを必要とするが、これらのデータ群は予め人手により開発、作成しなければならず、そのための多大なコストがかかるという問題があった。また、括弧表現や同義表現などを利用することで規則群を容易に整備できたとしても、それらの表現が実際に出現する頻度は一般的には少なく、従って、得られる類義語も少量となる。即ち、一般性に欠けるという問題があった。
【０００７】
本発明は上記の問題を解決するためになされたもので、従来に比べて低コストで、かつ一般性のある類義語を獲得できるようにした類義語検索装置、方法、プログラム及び記憶媒体を提供することを目的とする。
【０００８】
【課題を解決するための手段】
上記の目的を達成するために、本発明による類義語検索装置においては、対象語を入力する入力手段と、文書群を記憶する文書記憶手段と、前記入力された対象語を検索語として前記文書記憶手段を検索し、検索語が出現する文書に第１の演算により文書スコアを付与し、文書スコアの高い順にランキングし、このランキングの上位所定数の文書をシード文書として取り出し、このシード文書を構成する単語を関連語候補として抽出し、抽出された関連語候補を検索語として前記シード文書を検索し、検索語が出現する文書について第２の演算により前記対象語と検索語との関連度を求め、前記関連語候補から前記関連度の高い順に上位所定数の検索語を関連語として抽出する１次検索手段と、前記抽出された関連語を検索語として前記文書記憶手段を検索し、検索語が出現する文書について前記第１及び第２の演算により関連語を抽出する２次検索手段と、前記２次検索手段で抽出された関連語から関連度の高い順に上位所定数の関連語を類義語候補として選択する類義語選択手段とを設けている。
【０００９】
また、本発明による類義語検索方法は、入力された対象語を検索語として文書群を検索し、検索語が出現する文書に第１の演算により文書スコアを付与し、文書スコアの高い順にランキングし、このランキングの上位所定数の文書をシード文書として取り出し、このシード文書を構成する単語を関連語候補として抽出し、抽出された関連語候補を検索語として前記シード文書を検索し、検索語が出現する文書について第２の演算により前記対象語と検索語との関連度を求め、前記関連語候補から前記関連度の高い順に上位所定数の検索語を関連語として抽出し、前記抽出された関連語を検索語として前記文書記憶手段を検索し、検索語が出現する文書について前記第１及び第２の演算により関連語を抽出し、前記抽出された関連語から関連度の高い順に上位所定数の関連語を類義語候補として選択するようにしている。
【００１０】
また、本発明によるプログラムは、対象語を入力する入力処理と、前記入力された対象語を検索語として文書群を検索し、検索語が出現する文書に第１の演算により文書スコアを付与し、文書スコアの高い順にランキングするランキング処理と、前記ランキングの上位所定数の文書をシード文書として取り出し、このシード文書を構成する単語を関連語候補として抽出する抽出処理と、抽出された関連語候補を検索語として前記シード文書を検索し、検索語が出現する文書について第２の演算により前記対象語と検索語との関連度を求め、前記関連語候補から前記関連度の高い順に上位所定数の検索語を関連語として抽出する抽出処理と、前記抽出された関連語を検索語として前記文書群を検索し、検索語が出現する文書について前記第１及び第２の演算により関連語を抽出する抽出処理と、前記抽出された関連語から関連度の高い順に上位所定数の関連語を類義語候補として選択する選択処理とをコンピュータに実行させるためのプログラムである。
【００１１】
また、本発明による記憶媒体は、上記プログラムを記憶したものである。
【００１２】
【発明の実施の形態】
以下、本発明の実施の形態を図面と共に説明する。
本実施の形態は、対象語に対し、文書群（コレクション）をランキング検索し、その上位文書群に現れる語群のうちから、対象語の関連語群を抽出し、その関連語群のみを用いて再度コレクションをランキング検索し、その上位文書群の内、対象語を含まない文書郡に現れる関連語群から関連度を利用して類義語の候補を求めることで前記課題を達成するものである。
【００１３】
図７は本発明の実施の形態による類義語検索装置を示すブロック図である。
本装置は、図示のようにユーザによる対象語入力操作等を行う入力装置２０、検索結果得られた類義語候補を出力する出力装置３０、プログラム格納用ＲＯＭ、作業用ＲＡＭ等の記憶装置４０及び全体を制御するＣＰＵ５０から構成されている。
【００１４】
図１は本発明の第１の実施の形態によるＣＰＵにおける制御部の構成を示すブロック図である。
図１において、１はユーザにより入力装置２０から入力された対象語（ここでは類義語を検索するための検索語）を受付ける入力部、２は入力された対象語を格納する対象語バッファ、３は文書群が格納された文書記憶部、４は文書記憶部３を検索する文書検索部、５は対象語バッファ２の対象語を検索語として文書検索部４を介して文書記憶部３をランキング検索すると共に、関連語候補群を求め、その中から関連語を求める１次検索部、６は関連語候補群から関連語を抽出する関連語抽出部、７は抽出された関連語を格納する関連語バッファ、８は関連語を用いて文書群をランキング検索し、関連語を抽出する２次検索部、９は２次検索された関連語から類義語候補を選択する類義語選択部、１０は類義語を出力する出力部である。
【００１５】
次に、上記構成による動作について説明する。
図２は制御部の動作を概略的に示すフローチャートである。
まず、入力部１を介して対象語を入力し、入力された対象語は対象語バッファ２に格納される（ステップＳ１、以下、ステップ略）。次に、１次検索部５により文書記憶部３の１次検索を行い（Ｓ２）、その検索結果に基づいて２次検索部８より２次検索を行う（Ｓ３）。次に、２次検索の結果に基づいて類義語選択部９により類義語候補を選択し（Ｓ４）、出力部１０により類義語候補の語形を出力する（Ｓ５）。
【００１６】
図３は１次検索部５の動作を示すフローチャートである。
１次検索部５は、まず、対象語バッファ２の対象語を検索語として文書検索部４を介して以下のような文書ランキング検索を行う（Ｓ１１）。即ち、文書記憶部３に記憶された各文書について次に定義される文書スコア（ｓｃｏｒｅ）を計算する。
ｓｃｏｒｅ＝｛ｔｆ／（ｔｆ＋ｄｌｅｎ）｝×ｗｅｉｇｈｔ・・・（１）
ｗｅｉｇｈｔ＝Ｌｏｇ（Ｎ／ｎ＋１）・・・（２）
【００１７】
上記（１）式において、ｔｆはその文書に検索語（対象語）が出現する頻度であり、ｄｌｅｎは文書長である。このスコア定義によれば、検索語が多く現れるほど、かつ文書長が短いほど、かつｗｅｉｇｈｔ（重み）が大きいほど、その文書に大きなスコアが付与される。
また、上記（２）式において、Ｎは文書記憶部３に記憶された文書の総数であり、ｎは総数Ｎの文書中で検索語が出現する文書数（文書頻度）である。この重み定義式によれば、少数の文書に出現するほど、その検索語の重みは大きくなる。
【００１８】
尚、検索語が複数の場合は、各検索語について上記スコアを求め、それらを加算することにより最終的な文書スコアを得る。検索語が１つも現れない文書のスコアは０とする。
上記文書ランキング検索により、文書スコア順に各文書がソートして出力される。
【００１９】
１次検索部５は次に、上記文書ランキングの結果から、上位にランクされた文書を所定の数Ｒだけ、上位から順に文書記憶部３から取り出す（Ｓ１２）。これらのＲ個の上位文書をここではシード文書と呼ぶ。次に、シード文書から１つずつ文書を取り出し（Ｓ１２）、その文書Ｄについて関連語抽出部６により、関連語候補群を以下のようにして求める（Ｓ１３）。
【００２０】
まず、文書Ｄを形態素解析あるいは単語の区切り文字等を用いて単語に分解する。そして各単語を関連語候補としてそれぞれについて次の属性を求める。
・文書頻度ｎ：その単語を検索語として文書検索を行うことにより、その単語の文書頻度（文書総数Ｎの中の出現文書数）を求める。
・出現シード文書数ｒ：総数Ｒのシード文書中のその単語が出現したシード文書数
・重み（ｗｅｉｇｈｔ）：前記（２）式により求める。
・選択値（ｔｓｖ）＝ｗｅｉｇｈｔ×（ｒ／Ｒ−ｎ／Ｎ）・・・（３）
【００２１】
１次検索部５は、上記のようにして全てのシード文書について関連語候補群（単語群）とその属性を求めた後（Ｓ１４）、各候補を選択値の降順にソートし、その中から上位Ｔ個を選択して関連語とする。ただしその場合、対象語と語形が同一の関連語候補は関連語とはしない。次に、１次検索部５は、各関連語の語形と選択値とのペアを関連語バッファ７に格納する。選択値は、重みが大きいほど、かつ対象語と文書内共起する確率が高いほど大きくなる。従って、選択値は、対象語と関連語候補との関連の度合い、即ち、関連度を表すものとなる。
【００２２】
図４は２次検索部８の動作を示すフローチャートである。
２次検索部８は、関連語バッファ７に格納された関連語群を検索語群として文書検索部４を介して文書記憶部３の文書ランキング検索を行う（Ｓ２１）。その際、対象語は検索語群に含めないものとする。１次検索部５の場合と同様にしてシード文書を求め（Ｓ２２）、このシード文書から関連語を抽出するが、対象語を含む上位文書はシード文書とはしない（Ｓ２３、Ｓ２４）。これは一般に、対象語を含む文書は、その文書中に一貫してその語を使用し、その類義語は使用されない場合が多いからである。全てのシード文書について２次検索の結果得られた関連語は関連語バッファ７に格納される（Ｓ２５）。
【００２３】
類義語選択部９は、２次検索の結果、関連語バッファ７に得られた関連語群から関連度（選択値）が大きい順に上位Ｓ個を類義語候補として選択する。この類義語は出力部１０から出力される。
【００２４】
図５は本発明の第２の実施の形態による制御部の構成を示すブロック図であり、図１と対応する部分には同一番号を付して重複する説明は省略する。
本実施の形態は、対象語バッファ２と関連語バッファ７との間に判定部１１を追加したものである。
【００２５】
図６は制御部の動作を示すフローチャートである。
まず、入力部１を介して対象語を入力し、入力された対象語は対象語バッファ２に格納される（Ｓ３１）。次に１次検索部５により文書記憶部３を１次検索を行う（Ｓ３２）。次に、判定部１１により処理を続行するか否かを判断し（Ｓ３３）、続行しない場合は、類義語候補が得られなかった旨のメッセージを出力部１０から出力する（Ｓ３４）。次に、１次検索の結果に基づいて２次検索部に８より文書記憶部３の２次検索を行う（Ｓ３５）。そして、２次検索の結果に基づいて類義語選択部９により類義語候補を選択し（Ｓ３６）、出力部１０により類義語候補の語形を出力する（Ｓ３７）。上記Ｓ３１、Ｓ３２及びＳ３５からＳ３７は、図１のＳ１からＳ５と同様に行われる。
【００２６】
上記Ｓ３３において、判定部１１は以下のようにして処理続行の可否を判定する。即ち、１次検索の結果、得られた関連語群を関連度の順にソートし、この関連語のランキングにおいて、対象語が上位ｋ位以内にならなかった場合は処理続行しないと判断する。それ以外は処理続行すると判断する。つまり、対象語と最も関連度が高くなるべき語は対象語自身であり、その状況から外れるほど有効性が低い関連語が得られたと判断し、その場合は処理を続行しない。
【００２７】
本実施の形態によれば、文書群の不足等で類義語が得られない対象語に対しては、その旨メッセージを出力するので、類義語の獲得結果の品質を高めることができる。
尚、図２のＳ５及び図６のＳ３７においては、出力部１０により類義語の語形のみを出力しているが、関連語バッファ７に記憶された関連度も類義語の語形と共に出力するようにしてもよい。このようにすることにより、類義語候補を更に絞り込むための情報として、各類義語候補毎にその関連度を出力するので、類義語の獲得結果の品質を高めることができる。
【００２８】
次に、本発明の第３から６の実施の形態について説明する。図８、図１２、図１３、図１６は各実施の形態による制御部の構成を示すもので、図１、図５と対応する部分には同一番号を付して重複する説明は省略する。尚、類義語検索装置の構成は図７の構成と同一である。
【００２９】
図８は本発明の第３の実施の形態による制御部の構成を示すブロック図である。
尚、図１、図５における１次検索部５及び２次検索部８は図示を省略されている。
また、類語語選択部９は類義語抽出部１２として図示されている。
【００３０】
図９は制御部の動作を示すフローチャートである。
入力部１を用いて対象語を受け付け、対象語は対象語バッファ２に格納される（Ｓ４１）。次に１次検索を行い（Ｓ４２）、さらに２次検索を行う（Ｓ４３）。次に出力部１０を用いて類似語群を出力する（Ｓ４４）。
【００３１】
図１０は１次検索の動作を示すフローチャートである。
１次検索では、まず対象語を検索語として文書検索部４を用いて文書ランキング検索を行う（Ｓ５１）。文書ランキング検索は前記（１）（２）式を用いて文書スコアを計算することにより行われる。次に、文書ランキングの結果から、上位文書（上位にランクされた文書）を予め定めた数（Ｒとする）だけ、上位から順に文書記憶部３から取り出す（Ｓ５２）。これら上位文書をここではシード文書と呼ぶ。次に、各シード文書Ｄに対して関連語抽出部６を用いて関連語候補群を得る（Ｓ５３）。関連語候補群は、文書Ｄを形態素解析あるいは単語の区切り文字等を用いて単語に分解して前記（３）式による選択値により求める。
【００３２】
次に１次検索は、全てのシード文書に対して関連語候補群とその属性が求めた後（Ｓ５４）、各候補を選択値の降順にソートし、予め定めた関連語数Ｔに基づき、上位Ｔ個を関連語とする。但し、対象語と語形が同一の関連語候補は、関連語とはしない。次に、各関連語の語形および選択値のペアを関連語バッファに格納する。選択値は、重みが大きいほど、かつ対象語と文書内共起する確率が大きいほど大きくなる。よって選択値は、対象語と関連語候補との関連の度合い（関連度と呼ぶ）を表す。
【００３３】
図１１は２次検索の動作を示すフローチャートである。
２次検索では、まず関連語バッファ７に格納された関連語群を検索語群として文書検索部４を用いて、文書ランキング検索を行う（Ｓ６１）。次に、文書ランキングの結果から、上位文書（上位にランクされた文書）を予め定めた数（Ｒ２とする）だけ、上位から順に文書記憶部３から取り出す（Ｓ６２）。これら上位文書をここでは第２シード文書Ｄと呼ぶ。次に、各第２シード文書Ｄに対して類義語抽出部１２を用いて、類義語候補群を以下のようにして得る（Ｓ６３）。
【００３４】
即ち、文書Ｄを形態素解析あるいは単語の区切り文字等を用いて単語に分解する。そして各単語ｓに対して次の属性を得る。
共起度ｃ：第２シード文書において単語ｓの各出現位置において、その前後Ｗ語以内の位置に出現した関連語の総数。ここでＷは予定めた定数である。
即ち、単語ｓに対する共起度ｃとは、第２シード文書群においてｓの付近に出現した関連語の総数である。
【００３５】
尚、ｃはｓの第２シード文書群における出現頻度ｔｆを１とする相対頻度としてもよい。この場合
ｃ’　＝　ｃ／ｔｆ
である。
【００３６】
次に、単語ｓとｃ、あるいはｓとｃ’を類義語候補の表現としてプールする。そして、全ての第２シード文書に対して類義語候補が求まった段階で（Ｓ６４）、各候補を共起度ｃの降順にソートし、予め定めた類義語数Ｓに基づき、上位Ｓ個を類義語とする。出力部１０は上記のＳ個の類義語を出力する。
【００３７】
尚、上記類義語を出力する場合、対象語と同一の語形の類義語が出力される場合がある。このため本発明の第４の実施の形態では、このような自明の類義語を除いて出力する。図１２は第４の実施の形態による制御部を示すもので、出力部１０において対象語バッファ２に記憶された対象語の語形と類義語の語形を比較し、同一であれば出力しない。
【００３８】
以上のように第３、第４の実施の形態は、対象語に対し、文書群をランキング検索し、その上位文書群から対象語の関連語群を求め、その関連語群を用いて再び文書群をランキング検索し、その上位文書群と関連語群から類義語群を求めるようにしたものである。
【００３９】
図１３は本発明の第５の実施の形態による制御部の構成を示すブロック図であり、図８に補足語バッファ１３を追加したものである。
【００４０】
図１４は制御部の動作を示すフローチャートである。
まず、入力部１を用いて対象語と補足語群を受け付け、対象語は対象語バッファ２に格納され、補足語群は補足語バッファ１３に格納される（Ｓ７１）。次に１次検索を行い、さらに２次検索を行う（Ｓ７２、Ｓ７３）。そして、出力部１０より類似語群を出力する（Ｓ７４）。
【００４１】
図１５は１次検索の動作を示すフローチャートである。
Ｓ８１〜Ｓ８４における前記図１０との違いは、Ｓ８１で対象語だけでなく、補足語群も用いて文書検索部４により文書ランキング検索を行う点である。
【００４２】
ここで補足語について説明する。
例えば、対象語“ＴＣＰ”は次の２つの意味を持つ多義語である。
意味１：ネットワークのプロトコル
意味２：リン酸三カルシウム
そこで、意味１の類義語を求める場合は、補足語群として“ネットワーク、通信”などを与えればよい。
【００４３】
２次検索は図１１と同様に行われ、出力部１０は前記Ｓ個の類義語を出力する。この場合、対象語と同一の語形の類義語が出力される場合があるため、本発明の第６の実施の形態においては、このような自明の類義語を除いて出力する。図１６は第６の実施の形態による制御部を示すもので、出力部１０は、対象語バッファ２に記憶された対象語の語形と類義語の語形を比較し、同一であれば出力しない。
【００４４】
以上のように第５、第６の実施の形態は、対象語と補足語群に対し、文書群をランキング検索し、その上位文書群から対象語の関連語群を求め、その関連語群を用いて再び文書群をランキング検索し、その上位文書群と関連語群から類義語群を求めるようにしたものである。
【００４５】
尚、各フローチャートについて説明した処理を図１、５、８、１２、１３、１６の制御部、図７のＣＰＵが実行するためのプログラムは本発明によるプログラムを構成する。また、このプログラムを記憶する図７の記憶装置４０等の記憶媒体は、本発明による記憶媒体を構成する。この記憶媒体としては、光ディスク、光磁気ディスク、磁気記録媒体、半導体記憶装置等であってよい。
【００４６】
【発明の効果】
以上説明したように本発明によれば、類義語検索のために文書群のみを利用すればよいので、従来に比べて低コストで、かつ一般性のある類義語を獲得することができる。
【００４７】
また、文書群の不足等で類義語が得られない対象語についてその旨メッセージを出力することにより、獲得される類義語の品質を高めることができる。
また、類義語候補を更に絞り込むための情報として、各類義語候補毎にその関連度（共起度を含む）を出力することにより、獲得される類義語の品質を高めることができる。
さらに、多義語についても低コストで、かつ一般性のある類義語を獲得することができると共に、獲得される類義語の品質を高めることができる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態による類義語検索装置における制御部の構成を示すブロック図である。
【図２】第１の実施の形態による制御部全体の処理を示すフローチャートである。
【図３】１次検索部の処理を示すフローチャートである。
【図４】２次検索部の処理を示すフローチャートである。
【図５】本発明の第２の実施の形態による類義語検索装置における制御部の構成を示すブロック図である。
【図６】第２の実施の形態による制御部全体の処理を示すフローチャートである。
【図７】本発明の実施の形態による類義語検索装置を示すブロック図である。
【図８】本発明の第３の実施の形態による類義語検索装置における制御部の構成を示すブロック図である。
【図９】第３の実施の形態による制御部全体の処理を示すフローチャートである。
【図１０】１次検索部の処理を示すフローチャートである。
【図１１】２次検索部の処理を示すフローチャートである。
【図１２】本発明の第４の実施の形態による類義語検索装置における制御部の構成を示すブロック図である。
【図１３】本発明の第５の実施の形態による類義語検索装置における制御部の構成を示すブロック図である。
【図１４】第５の実施の形態による制御部全体の処理を示すフローチャートである。
【図１５】１次検索部の処理を示すフローチャートである。
【図１６】本発明の第６の実施の形態による類義語検索装置における制御部の構成を示すブロック図である。
【符号の説明】
１　入力部
２　対象語バッファ
３　文書記憶部
４　文書検索部
５　１次検索部
６　関連語抽出部
７　関連語バッファ
８　２次検索部
９　類義語選択部
１０　出力部
１１　判定部
１２　類義語抽出部
１３　補足語バッファ
２０　入力装置
３０　出力装置
４０　記憶装置
５０　ＣＰＵ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a synonym search device and method for obtaining a synonym of a search word in a document search, a synonym search program used in the device, and a storage medium storing the synonym search program.
[0002]
[Prior art]
In a document search for extracting a document related to a search term (keyword) from a group of documents such as a large number of electronic documents stored in a database or the like, when a searcher gives a search term A, the search target document includes A synonym B of the search word A may be used. In that case, as long as the search process is performed based on the word form, the document cannot be searched and extracted even though it is highly likely that the document matches the search request. This problem has been generally pointed out as a lexical gap problem.
Here, the synonyms are, for example, synonyms at the notation level: “computer” and “computer”
・ Vocabulary level synonyms: “Book” and “Book”
Would say something like
[0003]
Therefore, if synonyms for the search term in the search request can be obtained, search performance can be expected to be improved by including those synonym groups in the search request. For this reason, for example, the following proposal has been made regarding a method of automatically acquiring a synonym group, for example.
A synonym is acquired from a synonym file prepared in advance for each constituent word of the target character string, and a synonym expression for the target character string is configured by combining them (for example, Patent Document 1).
A synonym is extracted from a group of documents by using a pattern such as "is a ..." (for example, Patent Document 2).
[0004]
A relation between a compound word and its constituent words is regarded as a synonymous relation, and a compound word having the input word as a constituent word is output as a synonym (for example, Patent Document 3).
-When registering an input word in an existing thesaurus, the constituent words of the input word are obtained, and the registration position of the input word is determined based on the registration positions in the thesaurus (for example, Patent Document 4).
-Search word groups input by the user in the past are stored, and synonyms (co-occurring words) for the current input word are extracted from them (for example, Patent Document 5).
A pair between a parenthesis and a preceding word is extracted as a synonymous relationship (for example, Patent Document 6).
[0005]
[Patent Document 1]
JP-A-6-162098 (Patent No. 3025724)
[Patent Document 2]
Japanese Patent Application Laid-Open No. 6-266770 [Patent Document 3]
Japanese Patent Application Laid-Open No. 7-319884 [Patent Document 4]
Japanese Patent Application Laid-Open No. 8-222427 [Patent Document 5]
JP-A-9-319767 [Patent Document 6]
JP-A-11-328205
[Problems to be solved by the invention]
However, the above-described conventional technology requires a synonym dictionary, a thesaurus, or a rule group for synonym extraction, and the like. However, these data groups must be manually developed and created in advance, which requires a large cost. There was a problem. Further, even if rules can be easily prepared by using parenthesis expressions or synonym expressions, the frequency at which those expressions actually appear is generally low, and therefore, the number of synonyms obtained is small. That is, there is a problem that the generality is lacking.
[0007]
SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and provides a synonym search apparatus, a method, a program, and a storage medium that can acquire general synonyms at a lower cost than conventional ones. With the goal.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, in a synonym search device according to the present invention, an input unit for inputting a target word, a document storage unit for storing a group of documents, and the document storage unit using the input target word as a search word A means is searched, a document score is given to the document in which the search term appears by the first operation, the document score is ranked in descending order of the document score, a predetermined number of documents having the highest ranking are taken out as a seed document, and the seed document is constructed. Is extracted as a related word candidate, the seed document is searched using the extracted related word candidate as a search word, and the degree of relevance between the target word and the search word is determined by a second operation for the document in which the search word appears. Primary search means for determining and extracting, from the related word candidates, a predetermined number of search words in the descending order of the degree of relevance as related words, and using the extracted related words as search words in the document Secondary search means for searching the storage means and extracting the related words by the first and second operations with respect to the document in which the search word appears, and from the related words extracted by the secondary search means in descending order of relevance. Synonym selecting means for selecting a predetermined number of related words as synonym candidates.
[0009]
Further, the synonym search method according to the present invention searches a document group using the input target word as a search word, assigns a document score to the document in which the search word appears by a first operation, and ranks the documents in descending order of the document score. A predetermined number of documents in the top ranking are taken out as seed documents, words constituting the seed documents are extracted as related word candidates, and the extracted related word candidates are used as search words to search the seed document. The degree of relevance between the target word and the search word is calculated for the appearing document by a second operation, and a predetermined number of search words having a higher rank are extracted from the relevant word candidates in the descending order of the degree of relevance as related words. Searching the document storage means using a related word as a search word, extracting a related word from the document in which the search word appears by the first and second operations, and determining a degree of relevance from the extracted related word. Has a top predetermined number of related words is selected as synonyms candidates in descending order.
[0010]
Further, the program according to the present invention provides an input process for inputting a target word, a search for a document group using the input target word as a search word, and a document score is given to the document in which the search word appears by a first operation. A ranking process for ranking documents in descending order of document score, extracting a predetermined number of documents having the highest ranking as a seed document, and extracting words constituting the seed document as related word candidates; Is used as a search word to search the seed document, and for a document in which the search word appears, a second operation is performed to determine the relevance between the target word and the search word. An extraction process of extracting a search word as a related word, and searching the document group using the extracted related word as a search word. A program for causing a computer to execute an extraction process of extracting related words by a second operation, and a selection process of selecting a predetermined number of related words as synonym candidates in descending order of relevance from the extracted related words. is there.
[0011]
Further, a storage medium according to the present invention stores the above-mentioned program.
[0012]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
In the present embodiment, for a target word, a document group (collection) is searched by ranking, and a related word group of the target word is extracted from a word group appearing in a higher-order document group thereof, and only the related word group is used. The above object is achieved by performing a ranking search of the collection again and obtaining a synonym candidate using the degree of relevance from a group of related words appearing in a document group that does not include the target word, from among the higher-order document group.
[0013]
FIG. 7 is a block diagram showing a synonym search device according to an embodiment of the present invention.
As shown in the figure, the apparatus includes an input device 20 for performing a target word input operation by a user, an output device 30 for outputting a synonym candidate obtained as a search result, a storage device 40 such as a program storage ROM, a work RAM, and the like. Is controlled by the CPU 50.
[0014]
FIG. 1 is a block diagram showing a configuration of a control unit in a CPU according to the first embodiment of the present invention.
In FIG. 1, 1 is an input unit for receiving a target word (here, a search word for searching for a synonym) input from the input device 20 by the user, 2 is a target word buffer for storing the input target word, and 3 is a target word buffer. A document storage unit in which a group of documents is stored, 4 is a document search unit for searching the document storage unit 3, and 5 is a ranking search of the document storage unit 3 via the document search unit 4 using the target word of the target word buffer 2 as a search word. In addition, a primary search unit for obtaining a related word candidate group and obtaining a related word therefrom, a related word extracting unit for extracting a related word from the related word candidate group, and a related search unit for storing the extracted related word A word buffer, 8 is a secondary search unit that rank-searches a document group using related words and extracts related words, 9 is a synonym selection unit that selects synonym candidates from the secondary searched related words, and 10 is a synonym. An output unit for outputting.
[0015]
Next, the operation of the above configuration will be described.
FIG. 2 is a flowchart schematically showing the operation of the control unit.
First, a target word is input via the input unit 1, and the input target word is stored in the target word buffer 2 (step S1, hereinafter abbreviated to step). Next, a primary search of the document storage unit 3 is performed by the primary search unit 5 (S2), and a secondary search is performed by the secondary search unit 8 based on the search result (S3). Next, a synonym candidate is selected by the synonym selection unit 9 based on the result of the secondary search (S4), and the word form of the synonym candidate is output by the output unit 10 (S5).
[0016]
FIG. 3 is a flowchart showing the operation of the primary search unit 5.
The primary search unit 5 first performs the following document ranking search via the document search unit 4 using the target word in the target word buffer 2 as a search word (S11). That is, a document score defined next for each document stored in the document storage unit 3 is calculated.
score = {tf / (tf + dlen)} × weight (1)
weight = Log (N / n + 1) (2)
[0017]
In the above equation (1), tf is the frequency of occurrence of the search word (target word) in the document, and dlen is the document length. According to this score definition, the greater the number of search words, the shorter the document length, and the greater the weight (weight), the greater the score given to the document.
In the above equation (2), N is the total number of documents stored in the document storage unit 3, and n is the number of documents (document frequency) in which the search term appears in the total N documents. According to the weight definition formula, the weight of the search term increases as the number of occurrences in a small number of documents increases.
[0018]
When there are a plurality of search terms, the above-mentioned scores are obtained for each search term, and a final document score is obtained by adding them. A document in which no search word appears has a score of 0.
By the document ranking search, each document is sorted and output in the order of the document score.
[0019]
Next, the primary search unit 5 retrieves the documents ranked higher in the order of a predetermined number R from the document storage unit 3 in order from the higher order based on the result of the document ranking (S12). These R upper documents are referred to as seed documents here. Next, documents are extracted one by one from the seed document (S12), and a related word candidate group is obtained for the document D by the related word extracting unit 6 as follows (S13).
[0020]
First, the document D is decomposed into words using morphological analysis or word delimiters. Then, the following attribute is obtained for each word as a related word candidate.
Document frequency n: By performing a document search using the word as a search word, the document frequency of the word (the number of appearing documents in the total number N of documents) is obtained.
-Number of appearing seed documents r: Number of seed documents in which the word appears in the total number R of seed documents-Weight: obtained by the above equation (2).
Selection value (tsv) = weight × (r / R−n / N) (3)
[0021]
After obtaining the related word candidate group (word group) and its attributes for all the seed documents as described above (S14), the primary search unit 5 sorts the candidates in descending order of the selection value, and The top T items are selected as related words. However, in that case, related word candidates having the same word form as the target word are not regarded as related words. Next, the primary search unit 5 stores the pair of the word form of each related word and the selected value in the related word buffer 7. The selection value increases as the weight increases and as the probability of co-occurrence with the target word in the document increases. Therefore, the selection value indicates the degree of association between the target word and the related word candidate, that is, the degree of association.
[0022]
FIG. 4 is a flowchart showing the operation of the secondary search unit 8.
The secondary search unit 8 performs a document ranking search of the document storage unit 3 via the document search unit 4 using the related word group stored in the related word buffer 7 as a search word group (S21). At this time, the target word is not included in the search word group. A seed document is obtained in the same manner as in the case of the primary search unit 5 (S22), and related words are extracted from the seed document. However, a higher-level document including the target word is not regarded as a seed document (S23, S24). This is because, in general, a document including a target word uses the word consistently in the document, and a synonym is often not used. The related words obtained as a result of the secondary search for all the seed documents are stored in the related word buffer 7 (S25).
[0023]
As a result of the secondary search, the synonym selection unit 9 selects the top S words from the related word group obtained in the related word buffer 7 in descending order of relevance (selection value) as synonym candidates. This synonym is output from the output unit 10.
[0024]
FIG. 5 is a block diagram showing a configuration of a control unit according to the second embodiment of the present invention. Parts corresponding to those in FIG. 1 are denoted by the same reference numerals, and redundant description is omitted.
In the present embodiment, a determination unit 11 is added between the target word buffer 2 and the related word buffer 7.
[0025]
FIG. 6 is a flowchart showing the operation of the control unit.
First, a target word is input via the input unit 1, and the input target word is stored in the target word buffer 2 (S31). Next, the primary search section 5 performs a primary search on the document storage section 3 (S32). Next, the determination unit 11 determines whether or not to continue the processing (S33). If not, a message indicating that no synonym candidate was obtained is output from the output unit 10 (S34). Next, the secondary search unit 8 performs a secondary search of the document storage unit 3 based on the result of the primary search (S35). Then, based on the result of the secondary search, the synonym selection unit 9 selects a synonym candidate (S36), and the output unit 10 outputs the synonym candidate word form (S37). Steps S31, S32, and S35 to S37 are performed in the same manner as S1 to S5 in FIG.
[0026]
In step S33, the determination unit 11 determines whether or not to continue the processing as described below. That is, the related word group obtained as a result of the primary search is sorted in the order of the degree of relevance, and if the target word is not in the top k places in the ranking of the related words, it is determined that the processing is not continued. Otherwise, it is determined that the processing is to be continued. In other words, the word that should have the highest degree of relevance to the target word is the target word itself, and it is determined that a related word with a lower validity is obtained as the word goes out of the situation, and in that case, the processing is not continued.
[0027]
According to the present embodiment, for a target word for which a synonym cannot be obtained due to a lack of a document group or the like, a message to that effect is output, so that the quality of the synonym acquisition result can be improved.
In S5 of FIG. 2 and S37 of FIG. 6, only the synonymous word form is output by the output unit 10, but the relevance stored in the related word buffer 7 may be output together with the synonymous word form. Good. By doing so, the relevance of each synonym candidate is output as information for further narrowing down synonym candidates, so that the quality of the synonym acquisition result can be improved.
[0028]
Next, third to sixth embodiments of the present invention will be described. FIGS. 8, 12, 13, and 16 show the configuration of the control unit according to each embodiment. Parts corresponding to those in FIGS. 1 and 5 are denoted by the same reference numerals, and redundant description is omitted. The configuration of the synonym search device is the same as the configuration in FIG.
[0029]
FIG. 8 is a block diagram showing a configuration of a control unit according to the third embodiment of the present invention.
The primary search unit 5 and the secondary search unit 8 in FIGS. 1 and 5 are not shown.
The synonym selecting section 9 is illustrated as a synonym extracting section 12.
[0030]
FIG. 9 is a flowchart showing the operation of the control unit.
The target word is received using the input unit 1, and the target word is stored in the target word buffer 2 (S41). Next, a primary search is performed (S42), and a secondary search is performed (S43). Next, a similar word group is output using the output unit 10 (S44).
[0031]
FIG. 10 is a flowchart showing the operation of the primary search.
In the primary search, first, a document ranking search is performed using the document search unit 4 with the target word as a search word (S51). The document ranking search is performed by calculating a document score using the above equations (1) and (2). Next, from the result of the document ranking, a predetermined number (referred to as R) of upper documents (documents ranked higher) are extracted from the document storage unit 3 in order from the upper one (S52). These higher-level documents are referred to herein as seed documents. Next, a related word candidate group is obtained for each seed document D using the related word extracting unit 6 (S53). The related word candidate group is obtained by decomposing the document D into words using morphological analysis or word delimiters or the like, and using the selection value obtained by the above equation (3).
[0032]
Next, in the primary search, after a related word candidate group and its attributes have been obtained for all seed documents (S54), each candidate is sorted in descending order of selection value, and based on a predetermined number of related words T, Let T be related words. However, a related word candidate having the same word form as the target word is not regarded as a related word. Next, the pair of the word form and the selected value of each related word is stored in the related word buffer. The selection value increases as the weight increases and as the probability of co-occurrence with the target word in the document increases. Therefore, the selection value indicates the degree of association between the target word and the related word candidate (referred to as the degree of association).
[0033]
FIG. 11 is a flowchart showing the operation of the secondary search.
In the secondary search, first, a document ranking search is performed using the related word group stored in the related word buffer 7 as a search word group using the document search unit 4 (S61). Next, based on the result of the document ranking, a predetermined number (referred to as R2) of upper documents (documents ranked higher) are extracted from the document storage unit 3 in order from the upper one (S62). These upper documents are referred to as second seed documents D here. Next, a synonym candidate group is obtained for each second seed document D using the synonym extraction unit 12 as follows (S63).
[0034]
That is, the document D is decomposed into words using morphological analysis or word delimiters. Then, the following attributes are obtained for each word s.
Co-occurrence degree c: The total number of related words that appear at positions within W words before and after each occurrence position of word s in the second seed document. Here, W is a predetermined constant.
That is, the co-occurrence degree c for the word s is the total number of related words that appeared near s in the second seed document group.
[0035]
Note that c may be a relative frequency where the appearance frequency tf of s in the second seed document group is 1. In this case, c '= c / tf
It is.
[0036]
Next, words s and c or s and c ′ are pooled as synonym candidate expressions. Then, when synonym candidates have been obtained for all the second seed documents (S64), each candidate is sorted in descending order of the co-occurrence degree c, and the top S words are defined as synonyms based on a predetermined number S of synonyms. I do. The output unit 10 outputs the above S synonyms.
[0037]
When outputting the synonym, a synonym having the same form as the target word may be output. Therefore, in the fourth embodiment of the present invention, such trivial synonyms are output. FIG. 12 shows a control unit according to the fourth embodiment, in which the output unit 10 compares the word form of a target word stored in the target word buffer 2 with a synonymous word form, and outputs no output if they are the same.
[0038]
As described above, according to the third and fourth embodiments, a document group is ranked and searched for a target word, a related word group of the target word is obtained from a higher order document group, and the document is re-used using the related word group. The group is searched by ranking, and a synonym group is obtained from the upper document group and the related word group.
[0039]
FIG. 13 is a block diagram showing a configuration of a control unit according to the fifth embodiment of the present invention, which is obtained by adding a supplementary word buffer 13 to FIG.
[0040]
FIG. 14 is a flowchart showing the operation of the control unit.
First, a target word and a supplementary word group are received using the input unit 1, the target word is stored in the target word buffer 2, and the supplementary word group is stored in the supplemental word buffer 13 (S71). Next, a primary search is performed, and further a secondary search is performed (S72, S73). Then, a similar word group is output from the output unit 10 (S74).
[0041]
FIG. 15 is a flowchart showing the operation of the primary search.
The difference from FIG. 10 in S81 to S84 is that the document search unit 4 performs a document ranking search using not only the target word but also a supplementary word group in S81.
[0042]
Here, supplementary words will be described.
For example, the target word “TCP” is a polysemy having the following two meanings.
Meaning 1: Network protocol Meaning 2: Tricalcium phosphate When a synonym of meaning 1 is obtained, "network, communication" or the like may be given as a supplementary word group.
[0043]
The secondary search is performed in the same manner as in FIG. 11, and the output unit 10 outputs the S synonyms. In this case, since a synonym having the same form as the target word may be output, in the sixth embodiment of the present invention, such a synonym is output excluding such obvious synonyms. FIG. 16 shows a control unit according to the sixth embodiment. The output unit 10 compares the word form of the target word stored in the target word buffer 2 with the synonymous word form, and does not output if they are the same.
[0044]
As described above, in the fifth and sixth embodiments, the document group is ranked and searched for the target word and the supplementary word group, and the related word group of the target word is obtained from the higher order document group. Then, the document group is searched again for ranking, and a synonym group is obtained from the upper document group and the related word group.
[0045]
The program for executing the processing described in each flowchart by the control units of FIGS. 1, 5, 8, 12, 13, and 16 and the CPU of FIG. 7 constitutes a program according to the present invention. In addition, a storage medium such as the storage device 40 in FIG. 7 that stores the program constitutes a storage medium according to the present invention. The storage medium may be an optical disk, a magneto-optical disk, a magnetic recording medium, a semiconductor storage device, or the like.
[0046]
【The invention's effect】
As described above, according to the present invention, it is sufficient to use only a document group for synonym search, so that a general synonym can be obtained at lower cost than in the past.
[0047]
Also, by outputting a message to that effect for a target word for which a synonym cannot be obtained due to a shortage of documents, the quality of the synonym to be obtained can be improved.
In addition, by outputting the relevance (including co-occurrence) for each synonym candidate as information for further narrowing down synonym candidates, the quality of the synonym acquired can be improved.
Furthermore, it is possible to acquire low-cost, general synonyms for polysynonyms, and to improve the quality of the synonyms to be acquired.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a control unit in a synonym search device according to a first embodiment of the present invention.
FIG. 2 is a flowchart illustrating processing of the entire control unit according to the first embodiment.
FIG. 3 is a flowchart illustrating processing of a primary search unit.
FIG. 4 is a flowchart illustrating processing of a secondary search unit.
FIG. 5 is a block diagram illustrating a configuration of a control unit in a synonym search device according to a second embodiment of the present invention.
FIG. 6 is a flowchart illustrating processing of the entire control unit according to the second embodiment.
FIG. 7 is a block diagram showing a synonym search device according to an embodiment of the present invention.
FIG. 8 is a block diagram illustrating a configuration of a control unit in a synonym search device according to a third embodiment of the present invention.
FIG. 9 is a flowchart illustrating processing of the entire control unit according to the third embodiment.
FIG. 10 is a flowchart illustrating processing of a primary search unit.
FIG. 11 is a flowchart illustrating processing of a secondary search unit.
FIG. 12 is a block diagram illustrating a configuration of a control unit in a synonym search device according to a fourth embodiment of the present invention.
FIG. 13 is a block diagram illustrating a configuration of a control unit in a synonym search device according to a fifth embodiment of the present invention.
FIG. 14 is a flowchart illustrating processing of the entire control unit according to a fifth embodiment.
FIG. 15 is a flowchart illustrating processing of a primary search unit.
FIG. 16 is a block diagram illustrating a configuration of a control unit in a synonym search device according to a sixth embodiment of the present invention.
[Explanation of symbols]
Reference Signs List 1 Input unit 2 Target word buffer 3 Document storage unit 4 Document search unit 5 Primary search unit 6 Related word extraction unit 7 Related word buffer 8 Secondary search unit 9 Synonymous word selection unit 10 Output unit 11 Judgment unit 12 Synonymous word extraction unit 13 Supplement Word buffer 20 Input device 30 Output device 40 Storage device 50 CPU

Claims

Input means for inputting a target word;
Document storage means for storing a group of documents;
The document storage unit is searched by using the input target word as a search word, a document score is given to the document in which the search word appears by a first operation, and the documents are ranked in descending order of the document score. Is extracted as a seed document, words constituting the seed document are extracted as related word candidates, and the seed document is searched using the extracted related word candidates as search words. Primary search means for calculating the degree of relevance between the target word and the search word by calculation, and extracting a predetermined number of search words from the relevant word candidates in the descending order of the degree of relevance as related words
Secondary search means for searching the document storage means using the extracted related word as a search word, and extracting a related word by the first and second operations for a document in which the search word appears;
A synonym selecting means for selecting a predetermined number of related words having higher relevance from the related words extracted by the secondary searching means as synonym candidates.

2. The synonym search apparatus according to claim 1, wherein the processing is stopped when the target word does not fall within a predetermined upper rank of the degree of relevancy obtained by the primary search means.

3. The synonym search device according to claim 2, further comprising an output unit that outputs a message indicating that the processing is stopped.

The synonym search device according to any one of claims 1 to 3, wherein the related word candidate group does not include the target word.

The first operation is:
Document score = {tf / (tf + dlen)} × weight
weight = Log (N / n + 1)
Here, tf: frequency of occurrence of a search word (target word) in the document, dlen: document length, N: total number of documents stored in the document storage means, and n: search word appears in the total number N of documents. Number of documents (document frequency)
The synonym search device according to any one of claims 1 to 4, wherein

The second operation is
Relevance = weight × (r / R−n / N)
The synonym search device according to any one of claims 1 to 5, wherein R: the number of seed documents, and r: the number of documents in which the search words in the seed document appear.

7. The method according to claim 1, wherein when the input target word, that is, the search word is plural, the document score is obtained for each search word, and a final document score is obtained by adding them. The synonym search device according to claim 1.

The synonym search device according to any one of claims 1 to 7, wherein the synonym selection unit outputs the degree of relevance together with the synonym candidate.

A document group is searched using the input target word as a search word, a document score is given to the document in which the search word appears by the first operation, and the documents are ranked in descending order of the document score. It is extracted as a seed document, the words constituting the seed document are extracted as related word candidates, and the seed document is searched using the extracted related word candidates as search words. Determine the degree of relevance between the target word and the search term, and extract a predetermined number of search terms as related terms from the related word candidate in the descending order of the degree of relevance,
Searching the document storage means using the extracted related word as a search word, extracting a related word by the first and second operations for a document in which the search word appears,
A synonym search method, comprising selecting a predetermined number of related words having higher relevance from the extracted related words as synonym candidates.

10. The synonym search method according to claim 9, wherein the processing is stopped when the target word does not fall within the upper predetermined rank of the relevance.

The synonym search method according to claim 10, wherein a message indicating that the processing is stopped is output.

12. The synonym search method according to claim 9, wherein the related word candidate group does not include the target word.

The first operation is:
Document score = {tf / (tf + dlen)} × weight
weight = Log (N / n + 1)
Here, tf: frequency of occurrence of a search word (target word) in the document, dlen: document length, N: total number of documents stored in the document storage means, and n: search word appears in the total number N of documents. Number of documents (document frequency)
The synonym search method according to any one of claims 9 to 12, wherein:

The second operation is
Relevance = weight × (r / R−n / N)
14. The synonym search method according to any one of claims 9 to 13, wherein R: the number of seed documents, and r: the number of documents in which the search words in the seed document appear.

15. The method according to claim 9, wherein when the input target word, that is, a plurality of search words, is obtained, the document score is obtained for each search word, and a final document score is obtained by adding them. A synonym search method according to any one of the preceding claims.

The synonym search method according to any one of claims 9 to 15, wherein a relevance is output together with the selected synonym candidate.

An input process for inputting a target word,
A ranking process of searching a document group using the input target word as a search word, assigning a document score to the document in which the search word appears by a first operation, and ranking the document in descending order of the document score;
An extraction process of extracting a predetermined number of documents in the ranking as a seed document, and extracting words forming the seed document as related word candidates;
The seed document is searched by using the extracted related word candidate as a search word, and a relevance between the target word and the search word is obtained by a second operation for the document in which the search word appears, and the relevance is calculated from the relevant word candidate. Extraction processing for extracting a predetermined number of search terms in the descending order of the number as related words,
An extraction process of searching the document group using the extracted related word as a search word, and extracting a related word by the first and second operations for a document in which the search word appears;
A synonym search program for causing a computer to execute a selection process of selecting, as a synonym candidate, a predetermined number of related words in descending order of relevance of the extracted related words.

18. The synonym search program according to claim 17, further comprising a suspending process for suspending the process when the target word does not fall within the upper predetermined place of the relevance.

19. The synonym search program according to claim 18, further comprising an output process for outputting a message indicating that the process is stopped.

The synonym search program according to any one of claims 17 to 19, wherein the related word candidate group does not include the target word.

The first operation is:
Document score = {tf / (tf + dlen)} × weight
weight = Log (N / n + 1)
Here, tf: frequency of occurrence of a search word (target word) in the document, dlen: document length, N: total number of documents stored in the document storage means, and n: search word appears in the total number N of documents. Number of documents (document frequency)
The synonym search program according to any one of claims 17 to 20, wherein:

The second operation is
Relevance = weight × (r / R−n / N)
22. The synonym search program according to claim 17, wherein R: the number of seed documents, and r: the number of documents in which search words in the seed document appear.

23. The method according to claim 17, wherein when the input target word, that is, the search word is plural, the document score is obtained for each search word, and a final document score is obtained by adding them. A synonym search program according to any one of the preceding claims.

The synonym search program according to any one of claims 17 to 23, wherein the synonym selection unit outputs a relevance together with the synonym candidate.

A storage medium storing the synonym search program according to any one of claims 17 to 24.

The input means inputs, when the target word is a polysemous word, a supplementary word for giving a meaning together with the target word, and the primary search means sets the input target word and the supplementary word as the search word. The synonym search device according to any one of claims 1 to 8, wherein

The synonym selecting means, at each occurrence position of the word s in the seed document obtained by the second operation, a co-occurrence degree c or c indicating a total number of related words appearing within a predetermined number W words before and after the word s. 28. The method according to claim 1, wherein a maximum number of related words are selected as synonym candidates in ascending order of c or c ′ = c / tf. Synonym search device of description.

17. The method according to claim 9, wherein when the target word is an ambiguous word, a supplementary word for meaning is input together with the target word, and the input target word and the supplementary word are used as the search word. A synonym search method according to any one of the preceding claims.

At each occurrence position of the word s in the seed document obtained by the first and second operations, the co-occurrence degree c or c ′ = c indicating the total number of related words appearing within a predetermined number of W words before and after the word s. The synonym according to any one of claims 9 to 16 or 28, wherein / tf is determined, and a predetermined number of related words having a higher order are selected as synonym candidates in descending order of c or c '= c / tf. retrieval method.

In the input process, when the target word is a polysemy, a supplementary word for its meaning is input together with the target word, and the ranking process sets the input target word and the supplementary word as the search word. The synonym search program according to any one of claims 17 to 24, characterized in that:

The selection process includes a co-occurrence degree c indicating a total number of related words appearing at positions within a predetermined number of W words before and after each occurrence position of the word s in the seed document obtained by the extraction process for extracting the related word. Or c ′ = c / tf is determined, and a predetermined number of related words having a higher rank in the descending order of c or c ′ = ｃc / tf are selected as synonym candidates. Synonym search program described in section.

A storage medium storing the synonym search program according to any one of claims 17 to 24, 30, and 31.