JP2004220513A

JP2004220513A - Information retrieval device

Info

Publication number: JP2004220513A
Application number: JP2003010032A
Authority: JP
Inventors: Yuji Kobayashi; 雄二小林
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2003-01-17
Filing date: 2003-01-17
Publication date: 2004-08-05
Anticipated expiration: 2023-01-17
Also published as: JP4289891B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information retrieval device appropriately retrieving information according to a specified retrieval requirement. <P>SOLUTION: This information retrieval device selects unnecessary words corresponding to the language of a retrieval requirement sentence (step S11), refers a retrieval method in a retrieval method specificating part, when a concept retrieval is specified, selects unnecessary words for the concept retrieval and, when a logical retrieval is specified, selects the unnecessary words for the logic retrieval (step S12) Then, this device refers a retrieval object field stored in a buffer memory in a retrieval object field detecting part, and selects the unnecessary words agreed with the specified retrieval object field (step S13). When the retrieval object field is unspecified or there is no unnecessary word corresponding to the agreed field, this device refers unnecessary word selection specification information stored in an unnecessary word selection specification holding part, when the unnecessary words to be used is specified, selects the unnecessary words shown in the unnecessary word selection specification holding part irrelevant to the result of the unnecessary word selection in the above steps (step S14). <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、複数種類の情報を管理し、その管理されている情報から所望の情報を検索する情報検索装置に関するものである。
【０００２】
【従来の技術】
従来、情報、例えば、文書あるいは文字によるコンテンツの内容記述を付帯させた画像情報などを検索する情報検索装置として、所望の情報を表す語あるいは文を入力して、入力された語あるいは文と一致する語あるいは文を保持する、蓄積された情報を得る、全文検索と呼ばれる手法を適用した情報検索装置は知られている。
【０００３】
また、単に入力した語あるいは文と一致する語あるいは文を保持する情報のみならず、入力した語あるいは文と相似な概念と判断される語あるいは文を保持する情報を得る情報検索装置も知られている。
【０００４】
このような情報検索装置では、どの情報にも頻出するために、情報の識別に寄与しない普遍的な語を検索対象から除外することで、検索効率を向上させるようにしており、このため、検索不要語と呼ばれる、検索対象から除外すべき語を装置内に保持している。
【０００５】
【発明が解決しようとしている課題】
しかし、上記従来の情報検索装置では、不要語は常に検索対象から除外されることになり、操作者の操作目的によっては検索されるべき語が不要語として定義されていたために検索されないという問題があった。
【０００６】
このような問題に対して、検索時に不要語を使わないようにする、すなわち、すべての検索語を検索対象とするようにした情報検索装置もあるが、操作者の操作目的に適わない、意図しない頻出語が大量に検索されてしまうという問題があった。
【０００７】
例えば、英語の代名詞“ｉｔ”は、それ自体には固有の意味を持つことが少なく、蓄積情報の如何を問わず頻出するため、検索要求文中に出現する“ｉｔ”は不要語として検索対象から除外するようにしたいとする。しかしながら、日本語文書には通常、英語の代名詞としての“ｉｔ”が出現することはまれであり、むしろ“ｉｔ”を不要語とすることで、「ＩＴ革命」の「ＩＴ」が検索されないという問題を生じてしまう。
【０００８】
また、検索要求文の内容に類似する概念をもつ文書を検索する概念検索においては、多義性があって、どの語義であるかが判定できない場合や、普遍的に出現するがゆえに文書概念の識別に寄与しない語を不要語とするが、検索語の出現する文書を真（出現する）か偽（出現しない）の２値で判定する論理検索では、概念検索で不要語と定義された語であっても検索対象とすべき場合があるというように、検索方式によって不要語とすべき語が異なるという問題があった。
【０００９】
本発明は、これらの点に着目してなされたものであり、指定した検索要求に従って、適切な情報検索を行うことができる情報検索装置を提供することを目的とする。
【００１０】
【課題を解決するための手段】
上記目的を達成するため、請求項１に記載の情報検索装置は、検索語または検索文を入力する入力手段と、該入力手段によって入力された検索語または検索文に基づいて、少なくとも文字データを含む情報複数の中から一部情報を検索する検索手段と、該検索手段による検索に用いない複数の単語または文字列からなる不要語群を複数種類記憶する記憶手段と、前記記憶された複数種類の不要語群から、所定の条件に応じて、少なくとも一部の不要語群を選択する選択手段と、前記検索手段による情報の検索時に、前記選択手段によって選択された文字データ群に含まれる不要語を当該検索語から除外する除外手段とを有することを特徴とする。
【００１１】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照して詳細に説明する。
【００１２】
図１は、本発明の一実施の形態に係る情報検索装置の概略構成を示すブロック図である。
【００１３】
同図において、マイクロプロセッサ（ＣＰＵ）１１は、情報検索のための演算や論理判断等を行ない、アドレスバスＡＢ、コントロールバスＣＢおよびデータバスＤＢを介して、それらのバスＡＢ，ＣＢ，ＤＢに接続された各構成要素１２〜１９を制御する。
【００１４】
アドレスバスＡＢは、ＣＰＵ１１の制御の対象とする各構成要素１２〜１９を指示するアドレス信号を転送する。コントロールバスＣＢは、ＣＰＵ１１の制御の対象とする各構成要素１２〜１９のコントロール信号を転送して印加する。データバスＤＢは、各構成要素１１〜１９相互間のデータ転送を行なう。
【００１５】
ＲＯＭ１２は、読出し専用の固定メモリ（リードオンリメモリ）であり、ＣＰＵ１１が実行する処理プログラム等の制御プログラムコードを記憶する。
【００１６】
ＲＡＭ１３は、１ワード１６ビットで構成される、書込み可能なランダムアクセスメモリであり、各構成要素からの各種データの一時記憶に用いられる。また、ＲＡＭ１３上には、図２において後述する、検索語保持部２０２、不要語選択指定保持部２０４、検索概念ベクトル保持部２１２および検索結果保持部２１４が形成される。
【００１７】
ＤＩＳＫ１４は、外部メモリであり、図２において後述する、不要語２０７、概念辞書２１１、蓄積文書概念ベクトル保持部２１６、蓄積文書２１８および単語インデックス２１９を格納する。また、図２において後述する、検索要求入力処理部２０１、不要語選択指定処理部２０３、検索方式指定処理部２０５、適用不要語判別処理部２０６、検索対象分野検知処理部２０８、検索語選別処理部２０９、検索概念ベクトル作成処理部２１０、概念検索処理部２１３、論理検索処理部２１５および概念ベクトル作成処理部２１７の各処理部を実行するプログラムコードも格納する。
【００１８】
なお、これらのデータおよびプログラムを格納する記憶媒体としては、ＲＯＭ、フロッピー（登録商標）ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、メモリカード、光磁気ディスクなどを用いることができる。
【００１９】
ＫＢ１５は、キーボードであり、アルファベットキー、ひらがなキー、カタカナキー、句点等の文字記号入力キー、検索を指示する検索キーおよび、カーソル移動を指示するカーソル移動キー等のような各種の機能キーを備えている。
【００２０】
ＶＲＡＭ１６は、表示用ビデオメモリであり、表示すべきデータのパターンを蓄える。
【００２１】
ＣＲＴＣ１７は、ＣＲＴコントローラであり、ＶＲＡＭ１６に蓄えられた内容をＣＲＴ１８に表示する役割を担う。
【００２２】
ＣＲＴ１８は、陰極線管であり、ＣＲＴ１８におけるドット構成の表示パターンおよびカーソルの表示は、ＣＲＴＣ１７で制御される。なお、ＣＲＴに代えて、液晶パネル等の表示装置を採用するようにしてもよい。
【００２３】
ＮＩＣ１９は、ネットワークコントローラであり、Ｅｔｈｅｒｎｅｔ（登録商標）などのネットワークに接続する役割を担う。
【００２４】
以上のように構成された情報検索装置は、ＫＢ１５からの各種の入力およびＮＩＣ１９から供給されるネットワーク経由の各種入力に応じて作動し、ＫＢ１５またはＮＩＣ１９からの入力があると、まず、インタラプト信号がＣＰＵ１１に送信される。これに応じて、ＣＰＵ１１は、ＤＩＳＫ１４内に記憶してある各種の制御信号を読出し、それらの制御信号に従って、各種の制御を開始する。
【００２５】
図２は、本実施の形態の情報検索装置の機能構成を示すブロック図である。
【００２６】
同図において、検索要求入力処理部２０１は、所望の検索対象に関する要求事項（検索文あるいは検索語）を入力する。
【００２７】
検索語保持部２０２は、検索要求入力処理部２０１によって入力された検索語を記憶する。
【００２８】
不要語選択指定処理部２０３は、検索不要語をユーザが指定するためのものであり、不要語選択指定保持部２０４は、不要語選択指定処理部２０３によって指定された不要語のユーザによる選択結果を、例えば図１１で後述する構成によって記憶する。
【００２９】
検索方式指定処理部２０５は、文書検索方式を選択する。
【００３０】
適用不要語判別処理部２０６は、複数の不要語から適用すべき不要語を判別選択する。
【００３１】
不要語保持部２０７は、目的に応じて定義づけられた複数の不要語を記憶する。
【００３２】
検索対象分野検知処理部２０８は、検索対象としている分野を検知判定する。
【００３３】
検索語選別処理部２０９は、適用不要語判別処理部２０６において選択された不要語に基づいて、検索語保持部２０２に記憶された検索語より検索対象となる語を選別する。
【００３４】
検索概念ベクトル作成処理部２１０は、検索語選別処理部２０９で選別された検索語に基づいて概念辞書２１１を参照することにより、検索概念ベクトルを作成する。
【００３５】
概念辞書２１１は、見出しとなる単語の意味特徴を記述して格納したものである。
【００３６】
検索概念ベクトル保持部２１２は、検索概念ベクトル作成処理部２１０により作成された検索概念ベクトルを記憶する。
【００３７】
概念検索処理部２１３は、概念検索を行う。
【００３８】
論理検索処理部２１５は、論理検索を行う。
【００３９】
検索結果保持部２１４は、概念検索処理部２１３および論理検索処理部２１５による処理結果を記憶する。
【００４０】
概念ベクトル作成処理部２１７は、登録文書２２０に対して概念ベクトルを作成する。
【００４１】
蓄積文書概念ベクトル保持部２１６は、概念ベクトル作成処理部２１７で作成された概念ベクトルを登録文書２２０と対応付けて記憶する。
【００４２】
単語インデックス２１９は、概念ベクトル作成処理部２１７で作成される、登録文書２２０に出現する単語の索引を記憶する。
【００４３】
蓄積文書２１８は、登録文書２２０を記憶する。
【００４４】
図３は、検索要求入力処理部２０１において、検索要求文あるいは検索要求語を指示する場合の操作パネルの表示例を示す図である。
【００４５】
同図において、表示ウィンドウ３０１は、検索要求入力操作を行うためのものである。
【００４６】
領域３０２は、検索要求となる文あるいは語を入力する検索文入力領域である。
【００４７】
文３０３は、入力中の検索要求文を示しており、図示例では、「モバイル機器の市場動向」と入力されている。
【００４８】
カーソル３０４は、検索文入力領域３０２における入力位置を示すものである。
【００４９】
ボタン３０５および３０６は、検索方式を指定するラジオボタンであり、いずれか一方を選択する。図示例では、「概念検索」が選択されている状態となっている。
【００５０】
領域３０７は、検索したい文書の対象分野を指定する分野指定入力領域であり、プルダウン表示される分野の一覧から１つを選択する。特に分野を指定しない場合は、デフォルトとして「指定なし」が選択されるものとする。
【００５１】
ボタン３０８は、検索処理の実行を指定する検索実行ボタンであり、検索実行ボタン３０８を押下することで、検索文入力領域３０２に入力した検索要求文について、ラジオボタン３０５あるいは３０６で指定した検索方式に基づく検索処理が実行される。
【００５２】
ボタン３０９は、検索処理の終了あるいは中止を指定するキャンセルボタンであり、キャンセルボタン３０９を押下すると、ただちに検索処理を終了し、表示ウインドウ３０１を閉じて終了する。
【００５３】
領域３１０は、検索ボタン３０８の押下によって検索処理を行った結果を表示する検索結果表示領域であり、図示例では、検索処理がなされていない状態であるので、何も表示されていない。
【００５４】
図４は、検索要求入力処理部２０１において、概念検索あるいは論理検索を行うための検索要求文あるいは検索要求語が操作者により指示され、検索処理が実行された場合の検索結果の表示例を示す図である。
【００５５】
同図において、領域４０１は、図３の領域３１０と同様、検索結果の表示領域である。
【００５６】
領域４０２は、検索結果の順位を示すランク表示領域である。検索結果は、検索要求に類似している順にランク付けされ、ランク順に表示される。図示例では、ランク２５位から３０位までの検索結果が表示されている。
【００５７】
領域４０３は、検索された文書の表題を表示する領域であり、領域４０４は、文書のファイル名を表示する領域である。
【００５８】
領域４０５は、検索された文書の大意が掴める程度の内容を表示する領域である。文書内容表示領域４０５には、あらかじめ文書の書誌的属性として与えられた要約文、あるいは文書から自動的に要約した要約文、あるいは文書の一部を大意として抽出した大意文などを表示することができる。
【００５９】
バー４０６は、検索結果表示領域４０１に表示しきれない場合に、表示領域４０１内において検索結果を部分表示しながら、表示されていない他の部分を表示するとともに、表示位置を指定するために、同種のウィンドウ表示装置において用いられているエレベータバーである。
【００６０】
図４に示されている表示状態は、検索文３０３として入力された「モバイル機器の市場動向」に対して、概念検索を行った検索結果を表示している。
【００６１】
図５は、概念辞書２１１の構成を示す図である。概念辞書２１１は、単語の概念を、普遍的な意味素の重みを要素とする多次元ベクトルで表したものである。
【００６２】
同図において、領域５０１は、概念辞書２１１の見出しとなる単語を格納する。
【００６３】
領域５０２は、見出し語５０１に対する２５６次元で表される意味素ベクトルの各要素を表す添え字を格納する。
【００６４】
領域５０３は、意味素ベクトルの各要素の重みを格納し、重みは０から１の間の実数をとり、意味素ベクトルの大きさが１となるよう正規化して格納する。
【００６５】
概念辞書２１１を構成する多次元ベクトルの要素となる普遍的な意味素は、１つのまとまった意味概念を表すラベルで、例えば、「これ、それ、あれ、どっち」などの語が内包している「指示の概念」、「クラス、グレード、級、ランク、順位、劣等、優劣、優等」などの語が内包している「等級の概念」、「変化、変身、革新、勃興」などの語が内包している「変化の概念」、「協力、挨拶、団結、握手、友好、国交、交友」などの語が内包している「交わりの概念」、「動物、哺乳類、ペンギン、犬、人間、金魚」などの語が内包している「生物の概念」といった特定の語に依らない各々独立した普遍的な意味素を用いる。図５においては、２５６種の意味素を用い、２５６次元の概念表現ベクトルを構成する。
【００６６】
図６は、単語インデックス２１９の構成を示す図である。単語インデックス２１９は、登録文書中に出現するすべての単語について、文書中の出現頻度を格納するテーブルである。
【００６７】
同図において、テーブルの第１列情報６０１は、登録文書を一意に同定する文書ＩＤである。テーブルの第２列情報から第ｎ列情報６０２は、図７で示される各々の単語を表す添え字である。テーブルの末尾行６０３は、各々の単語の出現数の総和を格納する。図６において、文書ＩＤが“００１４６”である文書は、添え字“１２５６”の示す単語「市場」が１２回文書中に出現していることを示している。
【００６８】
図７は、単語インデックス２１９において、単語と、単語インデックステーブルの添え字との対応関係を格納した対応テーブルの構成を示す図である。
【００６９】
同図において、対応テーブルは、単語７０１と対応付けられた一意の単語インデックス７０２とを対応をとって格納する。例えば、単語「市場」の単語インデックスは“１２５６”であることが示される。
【００７０】
次に、不要語２０７の構成について、図８〜図１０を用いて説明する。なお、不要語２０７は、複数の不要語を記憶する。
【００７１】
図８は、日本語を検索対象とした論理検索実行時に参照されるべき不要語を例示したものである。同図において、行頭にシャープ記号（＃）がある行はコメント行であり、不要語定義データは２行目以降のデータであり、文字コード順に配列されている。
【００７２】
図９は、英語を検索対象とした論理検索実行時に参照されるべき不要語を例示したものである。
【００７３】
同図において、図８と同様に、行頭のシャープ記号はコメント行を意味しており、英語の不要語が文字コード順に配列されている。
【００７４】
図１０は、英語を検索対象とした概念検索実行時に参照されるべき不要語を例示したものである。この不要語も、図８および図９の不要語とその構成は同一で、文字コード順に概念検索参照用の不要語を例示している。
【００７５】
次に、不要語選択指定保持部２０４の構成について、図１１を用いて説明する。
【００７６】
不要語選択指定保持部２０４は、不要語２０７に記憶される複数の不要語から検索実行時に使用すべき不要語を操作者が直接指定するよう構成され、図１１において、先頭にシャープ記号を持つ行はコメント行であり、データは読み飛ばされる。［ＳｔｏｐＷｏｒｄＳｅｌｅｃｔｉｏｎ］セクションに定義されるＯＮとなっている不要語ラベルの［ＳｔｏｐＷｏｒｄＬｉｓｔ］セクションで対応付けられたファイルパスを持つ不要語ファイルが使用される。図１１の例では、［ＳｔｏｐＷｏｒｄＳｅｌｅｃｔｉｏｎ］セクションにおいてＯＮが指定されているＳＴＯＰ１およびＳＴＯＰ２のラベルを持つ不要語が指定されており、［ＳｔｏｐＷｏｒｄＬｉｓｔ］セクションにおいて、それぞれＣ：￥ＪＢＯＯＬ．ｄａｔ，Ｃ：￥ＥＢＯＯＬ．ｄａｔのファイルパスで指定される不要語が検索処理において使用される。
【００７７】
次に、本実施形態で実行される文書検索処理について、図１２を用いて説明する。
【００７８】
図１２は、本実施の形態の情報検索装置が実行する文書検索処理の手順を示すフローチャートである。
【００７９】
同図において、ステップＳ１では、前記検索要求入力処理部２０１の動作を行う処理モジュールによって、検索要求入力処理を行う。検索要求入力処理は、前記検索要求文入力領域３０２に入力された検索要求文３０３を前記検索語保持部２０２に記憶し、前記検索対象分野指定３０７にて選択された検索対象分野を前記検索対象分野検知処理部２０８内の不図示のバッファメモリに記憶する。また、前記ラジオボタン３０５または３０６で指定された検索方式を前記検索方式指定処理部２０５内の不図示のバッファメモリに記憶する。
【００８０】
ステップＳ２では、検索方式、不要語選択指定、検索対象分野および検索対象言語などの条件に基づいて、検索実行時において選択すべき不要語を決定する。なお、この不要語選択処理の詳細については、図１３を用いて後述する。
【００８１】
ステップＳ３では、検索方式を判別する。ステップＳ１の検索要求入力処理において、検索方式指定処理部２０５内のバッファメモリに記憶された検索方式を参照し、概念検索が指定されていたならばステップＳ４へ、論理検索が指定されていたならばステップＳ５へ、それぞれ分岐する。
【００８２】
ステップＳ４では、検索要求入力処理ステップＳ１で入力された検索要求に従って、文書内容の表す概念が類似の文書を検索する概念検索処理を行う。なお、この概念検索処理の詳細については、図１４を用いて後述する。
【００８３】
ステップＳ５では、検索要求入力処理ステップＳ１で入力された検索要求に従って、検索要求文に出現する単語が出現する文書を前記単語インデックス２１９を参照して検索する論理検索処理を行う。なお、この論理検索処理の詳細については、図１５を用いて後述する。
【００８４】
ステップＳ６では、ステップＳ４またはステップＳ５において検索された検索結果を、前記検索結果保持部２１４より取り出して表示する。なお、この処理は同種の情報検索装置において広く行われている公知の処理である。
【００８５】
図１３は、ステップＳ２の不要語選択処理の詳細な手順を示すフローチャートである。
【００８６】
同図において、ステップＳ１１では、検索要求言語の言語種別による不要語選択を行う。検索語保持部２０２に記憶された検索要求文の記述言語を識別し、その言語に対応した不要語を選択する。例えば、検索要求文が英語で記述されていたならば英語用の不要語を選択し、日本語で記述されていたならば日本語用の不要語を選択する。検索要求文の記述言語識別は、図示しないが、あらかじめ言語を指定しておく方法、検索要求文から求めた単語の辞書照合率が最も高い辞書の言語種別を選択する方法、検索要求文を構成する文字コードの分布から言語を推定する方法などを選択することができる。
【００８７】
次に、ステップＳ１２では、検索方式による不要語選択を行う。前記図１２のステップＳ１において検索方式指定処理部２０５内のバッファメモリに記憶された検索方式を参照し、概念検索が指定された場合は概念検索用の不要語を選択し、論理検索が指定された場合は論理検索用の不要語を選択する。
【００８８】
次に、ステップＳ１３では、検索対象分野による不要語選択を行う。前記ステップＳ１において検索対象分野検知処理部２０８内のバッファメモリに記憶された検索対象分野を参照し、指定された検索対象分野と合致する不要語を選択する。検索対象分野が指定されていない場合または合致する分野に対応する不要語が存在しない場合は、そのまま次のステップへ進む。
【００８９】
次に、ステップＳ１４では、不要語選択指定による不要語の選択を行う。前記不要語選択指定保持部２０４に記憶された不要語選択指定情報を参照し、使用すべき不要語が指定されている場合は、ステップＳ１１からステップＳ１３までの不要語選択の結果を問わず、不要語選択指定保持部２０４に示される不要語を選択する。
【００９０】
図１４は、前記図１２のステップＳ４の概念検索処理の詳細な手順を示すフローチャートである。
【００９１】
同図において、ステップＳ２１では、検索語保持部２０２に格納されている検索文を取り出し、単語に分割する。検索文の単語への分割は形態素解析処理として公知の手法を適用する。
【００９２】
次に、ステップＳ２２では、前記図１２のステップＳ２で選択された不要語を参照し、ステップＳ２１で抽出された単語が不要語に含まれているか否かを検索する。不要語でなければ、ただちにステップＳ２４へ分岐し、不要語であった場合はステップＳ２３へ進む。
【００９３】
ステップＳ２３では、ステップＳ２２で合致した不要語の言語種別とステップＳ２１で抽出された検索語の言語種別が一致しているかどうかを判定する。この判定は、異種の言語が混在していた場合に適切な不要語の適用を行うためのものであり、例えば、検索要求文「ＩｔｉｓＩＴ革命」から抽出された先頭の単語「Ｉｔ」は英語の単語であり、英語用の不要語として「Ｉｔ」が定義されている場合は不要語として、この検索語を排除すべきであるが、同じ文字列を持つ「ＩＴ革命」の「ＩＴ」は日本語の単語の一部であり、言語情報が不一致であることから、検索語として採用してよいことになる。なお、大文字と小文字は正規化されて同一視されるものとする。
【００９４】
ステップＳ２４では、概念辞書２１１の見出し語５０１と一致するものがあるか検索する。検索語に一致する概念辞書１２２の見出し語５０１が存在する場合、対応する概念ベクトルデータを概念辞書２１１より取り出す。
【００９５】
ステップＳ２５では、取得した概念ベクトルデータの構成要素の成分値を加算して、検索概念ベクトルを作成する。なお、検索概念ベクトルはあらかじめ、ベクトルのすべての次元要素を“０”に初期化しておく。
【００９６】
ステップＳ２６では、検索語保持部２０２のすべての検索語を処理したかどうかを判定し、すべての検索語の処理を終えたならば、検索概念ベクトルデータを各要素の二乗和が“１”になるよう正規化を行った後、検索概念ベクトル保持部２１２に格納し、ステップＳ２７へ分岐する。未処理の検索語があればステップＳ２１へ戻る。
【００９７】
ステップＳ２７では、蓄積文書概念ベクトルを蓄積文書概念ベクトル保持部２１６より取得し、蓄積文書概念ベクトルと、検索概念ベクトル保持部２１２に格納された検索要求概念ベクトルとの概念類似度を算出する。概念類似度の算出は、両ベクトルデータの余弦測度によって求めることができる。算出した概念類似度は、蓄積文書の文書ＩＤと対応付けて不図示のバッファメモリに一時記憶する。
【００９８】
すべての蓄積文書について、ステップＳ２７の処理を終えた後、ステップＳ２８へ進み、ステップＳ２７で一時記憶された概念類似度の降順に検索結果をソートして、検索結果保持部２１４に格納して、本概念検索処理を終了する。
【００９９】
図１５は、前記図１２のステップＳ５の論理検索処理の詳細な手順を示すフローチャートである。
【０１００】
同図において、ステップＳ３１では、検索語保持部２０２に格納されている検索文を取り出し、単語に分割する。検索文の単語への分割は、形態素解析処理として公知の手法を適用する。
【０１０１】
次に、ステップＳ３２では、前記図１２のステップＳ２で選択された不要語を参照し、ステップＳ３１で抽出された単語が不要語に含まれているか否かを検索する。不要語でなければ、ただちにステップＳ３４へ分岐し、不要語であった場合はステップＳ３３へ進む。
【０１０２】
ステップＳ３３では、ステップＳ３２で合致した不要語の言語種別とステップＳ３１で抽出された検索語の言語種別が一致しているかどうかを判定する。この判定は、異種の言語が混在していた場合に適切な不要語の適用を行うためのものであり、例えば、検索要求文「ＩｔｉｓＩＴ革命」から抽出された先頭の単語「Ｉｔ」は英語の単語であり、英語用の不要語として「Ｉｔ」が定義されている場合は不要語として、この検索語を排除すべきであるが、同じ文字列を持つ「ＩＴ革命」の「ＩＴ」は日本語の単語の一部であり、言語情報が不一致であることから、検索語として採用してよいことになる。なお、大文字と小文字は正規化されて同一視されるものとする。
【０１０３】
ステップＳ３４では、単語インデックス２１９との照合を行い、ステップＳ３１で抽出された検索語を含む文書があるか否かを検索する。単語インデックス２１９との照合の結果、検索された文書の文書ＩＤおよび単語出現頻度の情報を不図示のバッファに一時記憶する。
【０１０４】
ステップＳ３５では、検索語保持部２０２のすべての検索語を処理したかどうか判定し、すべての検索語の処理を終えたならば、ステップＳ３６へ分岐する。未処理の検索語があればステップＳ３１へ戻る。
【０１０５】
すべての蓄積文書について処理を終えた後、ステップＳ３６で、ステップＳ３４で一時記憶された単語頻度の降順に検索結果をソートして、検索結果保持部２１４に格納して、本論理検索処理を終了する。
【０１０６】
次に、本実施形態で実行される文書登録処理について、図１６を用いて説明する。
【０１０７】
図１６は、本実施の形態の情報検索装置が実行する文書登録処理の手順を示すフローチャートである。
【０１０８】
同図において、ステップＳ４１では、登録文書２２０より単語を抽出する。単語の抽出は、形態素解析処理として一般に用いられる手法を適用する。
【０１０９】
次に、ステップＳ４２では、単語インデックス２１９への登録を行う。単語インデックステーブルの単語７０１に一致する単語であれば、該当する単語ＩＤを取得し、単語ＩＤをインデックスとする列情報に対象登録文書における出現頻度を格納する。該当する単語が単語インデックステーブルの単語７０１と一致しなければ、単語インデックステーブルに該単語を追加し、新規に一意な単語ＩＤを割り振り、割り振られた単語ＩＤを新規単語インデックスとして、列情報を加え、登録済みの蓄積文書に対しては頻度として“０”を格納し、対象登録文書に対しては出現頻度を格納する。
【０１１０】
次に、ステップＳ４３では、該単語と一致する見出し語５０１があるか概念辞書２１１を検索し、一致する見出し語が存在すれば、対応する概念ベクトルデータを取り出す。
【０１１１】
ステップＳ４４では、ステップＳ４３にて取り出した概念ベクトルデータに頻度に応じた重みを乗じて、文書概念ベクトルデータに加算する。
【０１１２】
ステップＳ４５では、登録文書のすべての単語について処理を終えたかどうか判定し、未処理の単語があればステップＳ４２へ戻り、すべての単語について処理を終えていればステップＳ４６へ分岐する。
【０１１３】
ステップＳ４６では、文書概念ベクトルデータをベクトル要素の二乗和が“１”となるように正規化して、蓄積文書概念ベクトル保持部２１６へ登録して、本文書登録処理を終了する。
【０１１４】
なお、本実施の形態では、検索対象として文書情報を用いて説明したが、文書情報以外の内容記述メタデータが付随した画像情報、動画情報および番組内容記述情報などのマルチメディア情報についても、内容記述された文章情報に対して情報特徴量抽出を行い、情報特徴量の類似測度を求めることによって、本発明を適用することができる。
【０１１５】
また、本実施の形態では、概念検索のための概念ベクトルとして、概念辞書２１１に単語の属性概念ベクトルを記憶し、その合成ベクトルにより構成することとしたが、通常広く行われている単語の頻度情報あるいは単語頻度と逆文書頻度などから演算される重みを要素とする単語ベクトルによって概念検索のための概念ベクトルを構成することも可能である。
【０１１６】
さらに、本実施の形態では、検索要求文に出現する語が不要語であるか否かを決定し、不要語であった場合は検索語から除外するように構成したが、図１６において詳述した単語インデックスおよび文書概念ベクトル作成の処理において、ステップＳ４１で抽出された単語が選択された不要語であるか否かを図１４におけるステップＳ２２およびステップＳ２３と同様の処理を行い、不要語でないと判定された場合のみステップＳ４２へ進み、不要語である場合は、単語インデックスおよび文書概念ベクトル作成のステップを経ることなく、ステップＳ４１へ戻るように構成することで、単語インデックス２１９および蓄積文書概念ベクトル保持部２１６のサイズを小さくすることができるという利点を持たせることもできる。
【０１１７】
また、本実施の形態では、検索対象となる蓄積文書２２０、蓄積文書概念ベクトル保持部２１６、単語インデックス２１９および概念辞書２１１は、単一の装置を構成するＤＩＳＫ１４に配置するものとして説明したが、これらの構成要件を異なる装置に分散配置し、ＮＩＣ１９を介してネットワーク上で処理を行うようにすることも可能である。
【０１１８】
なお、本発明は複数の機器（例えば、ホストコンピュータ、インタフェース機器、リーダ、プリンタなど）から構成されるシステムに適用しても、ひとつの機器からなる装置（例えば、複写機、ファクシミリ装置など）に適用してもよい。
【０１１９】
なお、上述した実施の形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システムまたは装置に供給し、そのシステムまたは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読出し実行することによっても、本発明の目的が達成されることは言うまでもない。
【０１２０】
この場合、記憶媒体から読出されたプログラムコード自体が本発明の新規な機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。
【０１２１】
プログラムコードを供給するための記憶媒体としては、たとえば、フレキシブルディスク、ハードディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、ＤＶＤ−ＲＷ、ＤＶＤ＋ＲＷ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。また、通信ネットワークを介してサーバコンピュータからプログラムコードが供給されるようにしてもよい。
【０１２２】
また、コンピュータが読出したプログラムコードを実行することにより、上述した実施の形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳなどが実際の処理の一部または全部を行い、その処理によって上述した実施の形態の機能が実現される場合も含まれることは言うまでもない。
【０１２３】
さらに、記憶媒体から読出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって上述した実施の形態の機能が実現される場合も含まれることは言うまでもない。
【０１２４】
以下、本発明の実施態様の例を列挙する。
【０１２５】
（実施態様１）検索語または検索文を入力する入力手段と、
該入力手段によって入力された検索語または検索文に基づいて、少なくとも文字データを含む情報複数の中から一部情報を検索する検索手段と、
該検索手段による検索に用いない複数の単語または文字列からなる不要語群を複数種類記憶する記憶手段と、
前記記憶された複数種類の不要語群から、所定の条件に応じて、少なくとも一部の不要語群を選択する選択手段と、
前記検索手段による情報の検索時に、前記選択手段によって選択された文字データ群に含まれる不要語を当該検索語から除外する除外手段と
を有することを特徴とする情報検索装置。
【０１２６】
（実施態様２）前記記憶手段は、前記複数の言語情報のそれぞれに対応付けて、前記各不要語群を記憶し、
前記所定の条件は、前記入力手段によって入力された検索語または検索文の言語情報に対応する不要語群を選択することである
ことを特徴とする実施態様１に記載の情報検索装置。
【０１２７】
（実施態様３）前記検索手段は、複数の検索方法の中から、ユーザによって選択されたいずれかの検索方法を用いて、情報を検索し、
前記記憶手段は、前記複数の検索方法のそれぞれに対応付けて、前記各不要語群を記憶し、
前記所定の条件は、前記検索手段として選択された検索方法に対応する不要語群を選択することである
ことを特徴とする実施態様１に記載の情報検索装置。
【０１２８】
（実施態様４）前記複数の検索方法は、前記入力手段によって入力された検索語または検索文に対応する文字データが含まれる情報を検索する論理検索手段を含むことを特徴とする実施態様３に記載の情報検索装置。
【０１２９】
（実施態様５）前記複数の検索方法は、前記入力手段によって入力された検索語または検索文の有する概念特徴を表す類似性評価尺度に基づいて、当該類似概念に対応する文字データが含まれる情報を検索する概念検索手段を含むことを特徴とする実施態様３に記載の情報検索装置。
【０１３０】
（実施態様６）前記入力手段は、検索の対象となる検索分野を入力でき、
前記記憶手段は、前記複数の検索分野のそれぞれに対応付けて、前記各不要語群を記憶し、
前記所定の条件は、前記入力手段によって入力された検索分野に対応する不要語群を選択することである
ことを特徴とする実施態様１に記載の情報検索装置。
【０１３１】
（実施態様７）前記所定の条件は、ユーザによる指示であることを特徴とする実施態様１に記載の情報検索装置。
【０１３２】
（実施態様８）検索語または検索文を入力する入力手段によって入力された検索語または検索文に基づいて、少なくとも文字データを含む情報複数の中から一部情報を検索する検索ステップと、
該検索ステップによる検索に用いない複数の単語または文字列からなる不要語群を複数種類記憶する記憶手段に記憶された複数種類の不要語群から、所定の条件に応じて、少なくとも一部の不要語群を選択する選択ステップと、
前記検索ステップによる情報の検索時に、前記選択ステップによって選択された文字データ群に含まれる不要語を当該検索語から除外する除外ステップと
を有することを特徴とする情報検索方法。
【０１３３】
（実施態様９）検索語または検索文を入力する入力手段によって入力された検索語または検索文に基づいて、少なくとも文字データを含む情報複数の中から一部情報を検索する検索手順と、
該検索手順による検索に用いない複数の単語または文字列からなる不要語群を複数種類記憶する記憶手段に記憶された複数種類の不要語群から、所定の条件に応じて、少なくとも一部の不要語群を選択する選択手順と、
前記検索手順による情報の検索時に、前記選択手順によって選択された文字データ群に含まれる不要語を当該検索語から除外する除外手順と
をコンピュータに実行させるためのプログラム。
【０１３４】
【発明の効果】
以上説明したように、本発明によれば、複数種類の不要語を保持し、指定した検索要求に従って不要語を選択するようにしたので、操作性に優れ、検索精度の高い情報検索を行うことができる。
【図面の簡単な説明】
【図１】本発明の一実施の形態に係る情報検索装置の概略構成を示すブロック図である。
【図２】図１の情報検索装置の機能構成を示すブロック図である。
【図３】図２の検索要求入力処理部において、検索要求文と検索手法を入力する場合の操作パネルの表示例を示す図である。
【図４】図２の検索要求入力処理部において、検索要求文および検索手法に対応する検索結果の表示例を示す図である。
【図５】図２の概念辞書の構成を示す図である。
【図６】図２の単語インデックスの構成を示す図である。
【図７】図６の単語インデックスにおける単語ＩＤと単語の対応関係を示す図である。
【図８】日本語論理検索における検索不要語を示す図である。
【図９】英語論理検索における検索不要語を示す図である。
【図１０】英語概念検索における検索不要語を示す図である。
【図１１】検索不要語選択指定の定義を示す図である。
【図１２】図１の情報検索装置が実行する文書検索処理の手順を示すフローチャートである。
【図１３】図１２の不要語選択処理の詳細な手順を示すフローチャートである。
【図１４】図１２の概念検索処理の詳細な手順を示すフローチャートである。
【図１５】図１２の論理検索処理の詳細な手順を示すフローチャートである。
【図１６】図１の情報検索装置が実行する文書検索のためのインデックス情報作成処理の手順を示すフローチャートである。
【符号の説明】
１１ＣＰＵ
１２ＲＯＭ
１３ＲＡＭ
１４ＤＩＳＫ
１５ＫＢ
１６ＶＲＡＭ
１７ＣＲＴＣ
１８ＣＲＴ
１９ＮＩＣ
２０１検索要求入力処理部
２０２検索語保持部
２０３不要語選択指定処理部
２０４不要語選択指定保持部
２０５検索方式指定処理部
２０６適用不要語判別処理部
２０７不要語
２０８検索対象分野検知処理部
２０９検索語選別処理部
２１０検索概念ベクトル作成処理部
２１１概念辞書
２１２検索概念ベクトル保持部
２１３概念検索処理部
２１４検索結果保持部
２１５論理検索処理部
２１６蓄積文書概念ベクトル保持部
２１７概念ベクトル作成処理部
２１８蓄積文書
２１９単語インデックス
２２０登録文書[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an information search device that manages a plurality of types of information and searches for desired information from the managed information.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, as an information search device for searching for information, for example, image information accompanied by a content description of a document or a character content, a word or sentence representing desired information is input and matched with the input word or sentence. 2. Description of the Related Art There is known an information search apparatus that applies a technique called full-text search, which holds words or sentences to be searched and obtains accumulated information.
[0003]
There is also known an information search device that obtains not only information that holds a word or sentence that matches an input word or sentence, but also information that holds a word or sentence that is determined to have a similar concept to the input word or sentence. ing.
[0004]
In such an information search device, in order to frequently appear in any information, the search efficiency is improved by excluding, from the search target, universal words that do not contribute to the identification of the information. A word called an unnecessary word to be excluded from the search target is stored in the apparatus.
[0005]
[Problems to be solved by the invention]
However, in the above-described conventional information search device, unnecessary words are always excluded from search targets, and the problem that words to be searched are defined as unnecessary words depending on the operation purpose of the operator is not searched. there were.
[0006]
To solve such problems, there is an information search device that does not use unnecessary words at the time of search, that is, all search words are searched. However, it is not suitable for the operation purpose of the operator. There is a problem that a large number of uncommon words are searched for in large quantities.
[0007]
For example, the English pronoun “it” has little inherent meaning in itself and frequently appears regardless of stored information. Therefore, “it” appearing in a search request sentence is an unnecessary word from a search target. You want to exclude them. However, "it" as an English pronoun rarely appears in Japanese documents, and it is said that "IT" of "IT Revolution" is not searched for by making "it" an unnecessary word. It causes problems.
[0008]
Also, in a concept search for searching for a document having a concept similar to the content of a search request sentence, there is polysemy and it is not possible to determine the meaning of the word. A word that does not contribute to the term is regarded as an unnecessary word. However, in a logical search in which a document in which a search term appears is determined as true (appears) or false (does not appear) in binary, a word defined as an unnecessary word in a conceptual search is used. There is a problem that the words to be unnecessary words differ depending on the search method, for example, there are cases where the words should be searched for.
[0009]
The present invention has been made in view of these points, and an object of the present invention is to provide an information search device capable of performing an appropriate information search according to a specified search request.
[0010]
[Means for Solving the Problems]
In order to achieve the above object, an information retrieval apparatus according to claim 1 includes an input unit that inputs a search word or a search sentence, and at least character data based on the search word or the search sentence input by the input unit. A search unit that searches for partial information from a plurality of pieces of information, a storage unit that stores a plurality of types of unnecessary word groups including a plurality of words or character strings that are not used in the search by the search unit, and the stored plurality of types. Selecting means for selecting at least a part of unnecessary word groups from the unnecessary word groups according to a predetermined condition; and unnecessary information included in the character data group selected by the selecting means at the time of searching for information by the searching means. Exclusion means for excluding a word from the search word.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0012]
FIG. 1 is a block diagram showing a schematic configuration of an information search device according to one embodiment of the present invention.
[0013]
In FIG. 1, a microprocessor (CPU) 11 performs an operation and a logical decision for information retrieval, and connects to those buses AB, CB, and DB via an address bus AB, a control bus CB, and a data bus DB. Control the respective components 12 to 19 performed.
[0014]
The address bus AB transfers an address signal indicating each of the components 12 to 19 to be controlled by the CPU 11. The control bus CB transfers and applies control signals of the components 12 to 19 to be controlled by the CPU 11. The data bus DB transfers data among the components 11 to 19.
[0015]
The ROM 12 is a fixed read-only memory (read only memory), and stores control program codes such as processing programs executed by the CPU 11.
[0016]
The RAM 13 is a writable random access memory composed of 16 bits per word, and is used for temporarily storing various data from each component. Further, on the RAM 13, a search word holding unit 202, an unnecessary word selection designation holding unit 204, a search concept vector holding unit 212, and a search result holding unit 214, which will be described later with reference to FIG.
[0017]
The DISK 14 is an external memory, and stores an unnecessary word 207, a concept dictionary 211, a stored document concept vector holding unit 216, a stored document 218, and a word index 219, which will be described later with reference to FIG. Also, a search request input processing unit 201, an unnecessary word selection designation processing unit 203, a search method designation processing unit 205, an unnecessary word discrimination processing unit 206, a search target field detection processing unit 208, a search word selection process, which will be described later with reference to FIG. A program code for executing the processing units of the unit 209, the search concept vector creation processing unit 210, the concept search processing unit 213, the logic search processing unit 215, and the concept vector creation processing unit 217 is also stored.
[0018]
As a storage medium for storing these data and programs, a ROM, a floppy (registered trademark) disk, a CD-ROM, a DVD-ROM, a memory card, a magneto-optical disk, or the like can be used.
[0019]
The KB 15 is a keyboard, and includes various function keys such as an alphabet key, a hiragana key, a katakana key, a character symbol input key such as a period, a search key for instructing a search, and a cursor movement key for instructing a cursor movement. ing.
[0020]
The VRAM 16 is a video memory for display, and stores a pattern of data to be displayed.
[0021]
The CRTC 17 is a CRT controller and plays a role of displaying the contents stored in the VRAM 16 on the CRT 18.
[0022]
The CRT 18 is a cathode ray tube, and the display pattern of the dot configuration and the display of the cursor on the CRT 18 are controlled by the CRTC 17. Note that a display device such as a liquid crystal panel may be employed instead of the CRT.
[0023]
The NIC 19 is a network controller and has a role of connecting to a network such as Ethernet (registered trademark).
[0024]
The information retrieval device configured as described above operates according to various inputs from the KB 15 and various inputs via the network supplied from the NIC 19, and when there is an input from the KB 15 or the NIC 19, first, an interrupt signal is generated. Sent to CPU 11. In response, the CPU 11 reads various control signals stored in the disk 14 and starts various controls according to the control signals.
[0025]
FIG. 2 is a block diagram illustrating a functional configuration of the information search device according to the present embodiment.
[0026]
In the figure, a search request input processing unit 201 inputs a requirement (search sentence or search word) relating to a desired search target.
[0027]
The search term holding unit 202 stores the search term input by the search request input processing unit 201.
[0028]
The unnecessary word selection / designation processing unit 203 is for the user to specify a search unnecessary word, and the unnecessary word selection / designation holding unit 204 is a user's selection result of the unnecessary word specified by the unnecessary word selection / designation processing unit 203. Is stored by, for example, a configuration described later with reference to FIG.
[0029]
The search method designation processing unit 205 selects a document search method.
[0030]
The unnecessary word discrimination processing unit 206 discriminates and selects an unnecessary word to be applied from a plurality of unnecessary words.
[0031]
The unnecessary word storage unit 207 stores a plurality of unnecessary words defined according to the purpose.
[0032]
The search target field detection processing unit 208 detects and determines a field to be searched.
[0033]
The search word selection processing unit 209 selects a search target word from the search words stored in the search word holding unit 202 based on the unnecessary word selected by the unnecessary word determination processing unit 206.
[0034]
The search concept vector creation processing unit 210 creates a search concept vector by referring to the concept dictionary 211 based on the search word selected by the search word selection processing unit 209.
[0035]
The concept dictionary 211 describes and stores the semantic characteristics of the word serving as a headline.
[0036]
The search concept vector holding unit 212 stores the search concept vector created by the search concept vector creation processing unit 210.
[0037]
The concept search processing unit 213 performs a concept search.
[0038]
The logical search processing unit 215 performs a logical search.
[0039]
The search result holding unit 214 stores the processing results of the concept search processing unit 213 and the logical search processing unit 215.
[0040]
The concept vector creation processing unit 217 creates a concept vector for the registered document 220.
[0041]
The stored document concept vector holding unit 216 stores the concept vector created by the concept vector creation processing unit 217 in association with the registered document 220.
[0042]
The word index 219 stores an index of a word appearing in the registered document 220 created by the concept vector creation processing unit 217.
[0043]
The stored document 218 stores the registered document 220.
[0044]
FIG. 3 is a diagram illustrating a display example of the operation panel when the search request input processing unit 201 indicates a search request sentence or a search request word.
[0045]
In the figure, a display window 301 is for performing a search request input operation.
[0046]
An area 302 is a search sentence input area for inputting a sentence or a word as a search request.
[0047]
The sentence 303 indicates the search request sentence being input, and in the illustrated example, “mobile device market trend” is input.
[0048]
The cursor 304 indicates an input position in the search sentence input area 302.
[0049]
Buttons 305 and 306 are radio buttons for specifying a search method, and select one of them. In the illustrated example, “concept search” is selected.
[0050]
An area 307 is an area specification input area for specifying a target field of a document to be searched, and selects one from a list of fields displayed in a pull-down display. If no particular field is specified, “unspecified” is selected as the default.
[0051]
A button 308 is a search execution button for specifying execution of a search process. When the search execution button 308 is pressed, the search method specified by the radio button 305 or 306 is used for the search request sentence input in the search sentence input area 302. Is executed based on the search processing.
[0052]
A button 309 is a cancel button for designating termination or cancellation of the search processing. When the cancel button 309 is pressed, the search processing is immediately terminated, and the display window 301 is closed and terminated.
[0053]
An area 310 is a search result display area for displaying the result of performing the search processing by pressing the search button 308. In the illustrated example, nothing is displayed since the search processing has not been performed.
[0054]
FIG. 4 shows a display example of a search result when a search request sentence or a search request word for performing a concept search or a logical search is specified by an operator in the search request input processing unit 201 and a search process is executed. FIG.
[0055]
3, an area 401 is a display area of a search result, like the area 310 of FIG.
[0056]
The area 402 is a rank display area indicating the order of the search result. Search results are ranked in order of similarity to the search request and are displayed in rank order. In the illustrated example, search results of ranks 25 to 30 are displayed.
[0057]
An area 403 is an area for displaying the title of the searched document, and an area 404 is an area for displaying the file name of the document.
[0058]
An area 405 is an area for displaying the contents of the searched document that can be understood. In the document content display area 405, it is possible to display an abstract sentence given in advance as a bibliographic attribute of the document, an abstract automatically summarized from the document, or a general sentence extracted from a part of the document as a meaning. it can.
[0059]
When the bar 406 cannot be displayed in the search result display area 401, the search result is partially displayed in the display area 401 while displaying other parts that are not displayed, and the display position is designated. This is an elevator bar used in the same type of window display device.
[0060]
The display state shown in FIG. 4 displays a search result obtained by performing a conceptual search on “market trend of mobile device” input as the search sentence 303.
[0061]
FIG. 5 is a diagram showing a configuration of the concept dictionary 211. The concept dictionary 211 expresses the concept of a word as a multidimensional vector having universal semantic weights as elements.
[0062]
In the figure, an area 501 stores a word serving as a heading of the concept dictionary 211.
[0063]
An area 502 stores a subscript representing each element of a semantic vector expressed in 256 dimensions for the headword 501.
[0064]
An area 503 stores the weight of each element of the semantic vector, the weight takes a real number between 0 and 1, and is normalized and stored so that the size of the semantic vector becomes 1.
[0065]
A universal semantic element that is an element of a multidimensional vector that forms the concept dictionary 211 is a label that represents a single semantic concept, and includes, for example, words such as “this, that, that, which”. Words such as "concept of instruction", "class, grade, grade, rank, rank, inferiority, superiority, honor" and so on are included in words such as "concept of grade", "change, transformation, innovation, rise". Words such as "concept of change", "cooperation, greeting, unity, handshake, friendship, diplomacy, friendship" are included, "concept of fellowship", "animal, mammal, penguin, dog, human, We use independent and universal semantics that do not depend on a specific word, such as "concept of living things", which is included in words such as "goldfish". In FIG. 5, 256 types of semantics are used to form a 256-dimensional concept expression vector.
[0066]
FIG. 6 is a diagram showing a configuration of the word index 219. The word index 219 is a table that stores the appearance frequency in the document for all words that appear in the registered document.
[0067]
In the figure, the first column information 601 of the table is a document ID for uniquely identifying a registered document. The second to n-th column information 602 of the table are suffixes indicating each word shown in FIG. The last row 603 of the table stores the sum total of the number of occurrences of each word. In FIG. 6, the document with the document ID “00146” indicates that the word “market” indicated by the subscript “1256” appears in the document 12 times.
[0068]
FIG. 7 is a diagram showing a configuration of a correspondence table in which the correspondence between words and subscripts of the word index table is stored in the word index 219.
[0069]
In the figure, the correspondence table stores a word 701 and a unique word index 702 associated with each other. For example, it is indicated that the word index of the word “market” is “1256”.
[0070]
Next, the configuration of the unnecessary word 207 will be described with reference to FIGS. The unnecessary word 207 stores a plurality of unnecessary words.
[0071]
FIG. 8 exemplifies unnecessary words to be referred to at the time of executing a logical search for Japanese as a search target. In the figure, a line having a pound sign (#) at the beginning of the line is a comment line, and unnecessary word definition data is data of the second and subsequent lines, which are arranged in the order of character codes.
[0072]
FIG. 9 exemplifies unnecessary words to be referred to at the time of executing a logical search for English as a search target.
[0073]
8, as in FIG. 8, a sharp symbol at the beginning of a line indicates a comment line, and unnecessary words in English are arranged in the order of character codes.
[0074]
FIG. 10 exemplifies unnecessary words to be referred to at the time of executing a concept search for English as a search target. This unnecessary word has the same configuration as that of the unnecessary word in FIGS. 8 and 9 and exemplifies the unnecessary word for concept search reference in the order of the character code.
[0075]
Next, the configuration of the unnecessary word selection designation holding unit 204 will be described with reference to FIG.
[0076]
The unnecessary word selection designation holding unit 204 is configured such that the operator directly specifies an unnecessary word to be used at the time of performing a search from a plurality of unnecessary words stored in the unnecessary word 207, and has a sharp symbol at the beginning in FIG. The line is a comment line and the data is skipped. An unnecessary word file having a file path associated with the [Stop Word List] section of the unnecessary word label that is ON defined in the [Stop Word Selection] section is used. In the example of FIG. 11, unnecessary words having the labels of STOP1 and STOP2 for which ON is specified in the [Stop Word Selection] section are specified. In the [Stop Word List] section, C: \ JBOOL. dat, C: {EBOOL. Unnecessary words specified by the dat file path are used in the search processing.
[0077]
Next, a document search process executed in the present embodiment will be described with reference to FIG.
[0078]
FIG. 12 is a flowchart illustrating a procedure of a document search process performed by the information search device according to the present embodiment.
[0079]
In FIG. 5, in step S1, a search request input process is performed by a processing module that operates the search request input processing unit 201. In the search request input process, the search request sentence 303 input to the search request sentence input area 302 is stored in the search word holding unit 202, and the search target field selected by the search target field designation 307 is searched for by the search target field. It is stored in a buffer memory (not shown) in the field detection processing unit 208. The search method specified by the radio button 305 or 306 is stored in a buffer memory (not shown) in the search method specification processing unit 205.
[0080]
In step S2, an unnecessary word to be selected at the time of executing a search is determined based on conditions such as a search method, an unnecessary word selection designation, a search target field, and a search target language. The details of the unnecessary word selection processing will be described later with reference to FIG.
[0081]
In step S3, a search method is determined. In the search request input processing in step S1, the search method stored in the buffer memory in the search method specification processing unit 205 is referred to. If the concept search has been specified, the process proceeds to step S4. If the logical search has been specified. For example, the process branches to step S5.
[0082]
In step S4, a concept search process for searching for a document having a similar concept represented by the document content is performed according to the search request input in the search request input process step S1. The details of the concept search process will be described later with reference to FIG.
[0083]
In step S5, according to the search request input in the search request input processing step S1, a logical search process is performed to search for a document in which a word appearing in the search request sentence appears by referring to the word index 219. The details of the logical search process will be described later with reference to FIG.
[0084]
In step S6, the search result searched in step S4 or step S5 is retrieved from the search result holding unit 214 and displayed. This process is a known process widely performed in the same type of information search device.
[0085]
FIG. 13 is a flowchart showing a detailed procedure of the unnecessary word selection processing in step S2.
[0086]
In the figure, in step S11, unnecessary words are selected according to the language type of the search request language. The description language of the search request sentence stored in the search word holding unit 202 is identified, and an unnecessary word corresponding to the language is selected. For example, if the search request sentence is described in English, an unnecessary word for English is selected, and if the search request sentence is described in Japanese, an unnecessary word for Japanese is selected. The description language identification of the search request sentence is not shown, but a method of specifying a language in advance, a method of selecting a language type of a dictionary having the highest dictionary matching rate of words obtained from the search request sentence, and forming a search request sentence For example, a method of estimating a language from the distribution of character codes to be used can be selected.
[0087]
Next, in step S12, unnecessary words are selected by a search method. Referring to the search method stored in the buffer memory in the search method specification processing unit 205 in step S1 of FIG. 12, if a concept search is specified, an unnecessary word for the concept search is selected, and a logical search is specified. In this case, an unnecessary word for a logical search is selected.
[0088]
Next, in step S13, unnecessary words are selected according to the search target field. In step S1, the search target field stored in the buffer memory in the search target field detection processing unit 208 is referred to, and an unnecessary word that matches the specified search target field is selected. When the search target field is not specified or when there is no unnecessary word corresponding to the matching field, the process proceeds to the next step.
[0089]
Next, in step S14, unnecessary words are selected by specifying unnecessary words. By referring to the unnecessary word selection designation information stored in the unnecessary word selection designation holding unit 204, if an unnecessary word to be used is designated, regardless of the result of the unnecessary word selection from step S11 to step S13, An unnecessary word shown in the unnecessary word selection designation holding unit 204 is selected.
[0090]
FIG. 14 is a flowchart showing a detailed procedure of the concept search process in step S4 of FIG.
[0091]
In the figure, in step S21, the search sentence stored in the search word holding unit 202 is extracted and divided into words. For dividing the search sentence into words, a method known as morphological analysis processing is applied.
[0092]
Next, in step S22, referring to the unnecessary word selected in step S2 of FIG. 12, it is searched whether the word extracted in step S21 is included in the unnecessary word. If it is not an unnecessary word, the process immediately branches to step S24, and if it is an unnecessary word, the process proceeds to step S23.
[0093]
In step S23, it is determined whether or not the language type of the unnecessary word matched in step S22 matches the language type of the search word extracted in step S21. This determination is for applying an appropriate unnecessary word when different languages are mixed. For example, the first word “It” extracted from the search request sentence “It is IT revolution” is If it is an English word and “It” is defined as an unnecessary word for English, this search word should be excluded as an unnecessary word, but “IT” of “IT revolution” having the same character string Is a part of a Japanese word, and the linguistic information does not match, so that it can be adopted as a search word. Note that uppercase and lowercase letters are normalized and identified.
[0094]
In step S24, a search is made to see if there is any entry that matches the entry word 501 in the concept dictionary 211. If there is a headword 501 of the concept dictionary 122 that matches the search word, the corresponding concept vector data is extracted from the concept dictionary 211.
[0095]
In step S25, the retrieved concept vector is created by adding the component values of the components of the acquired concept vector data. In addition, all the dimension elements of the search concept vector are initialized to “0” in advance.
[0096]
In step S26, it is determined whether or not all the search terms of the search term holding unit 202 have been processed. When the processing of all of the search terms has been completed, the search concept vector data is changed to a sum of squares of each element of “1”. After normalization is performed, the result is stored in the search concept vector holding unit 212, and the process branches to step S27. If there is an unprocessed search word, the process returns to step S21.
[0097]
In step S27, the stored document concept vector is acquired from the stored document concept vector storage unit 216, and the concept similarity between the stored document concept vector and the search request concept vector stored in the search concept vector storage unit 212 is calculated. The concept similarity can be calculated by the cosine measure of both vector data. The calculated concept similarity is temporarily stored in a buffer memory (not shown) in association with the document ID of the stored document.
[0098]
After finishing the processing of step S27 for all the stored documents, the process proceeds to step S28, where the search results are sorted in descending order of the concept similarity temporarily stored in step S27, and stored in the search result holding unit 214. The concept search process ends.
[0099]
FIG. 15 is a flowchart showing a detailed procedure of the logic search process in step S5 of FIG.
[0100]
In the figure, in step S31, the search sentence stored in the search word holding unit 202 is extracted and divided into words. For dividing the search sentence into words, a method known as morphological analysis processing is applied.
[0101]
Next, in step S32, the unnecessary word selected in step S2 in FIG. 12 is referred to, and it is searched whether the word extracted in step S31 is included in the unnecessary word. If it is not an unnecessary word, the process immediately branches to step S34, and if it is an unnecessary word, the process proceeds to step S33.
[0102]
In step S33, it is determined whether or not the language type of the unnecessary word matched in step S32 matches the language type of the search word extracted in step S31. This determination is for applying an appropriate unnecessary word when different languages are mixed. For example, the first word “It” extracted from the search request sentence “It is IT revolution” is If it is an English word and “It” is defined as an unnecessary word for English, this search word should be excluded as an unnecessary word, but “IT” of “IT revolution” having the same character string Is a part of a Japanese word, and the linguistic information does not match, so that it can be adopted as a search word. Note that uppercase and lowercase letters are normalized and identified.
[0103]
In step S34, collation with the word index 219 is performed, and a search is performed to determine whether there is a document including the search term extracted in step S31. As a result of collation with the word index 219, the information of the document ID and the word appearance frequency of the searched document is temporarily stored in a buffer (not shown).
[0104]
In step S35, it is determined whether or not all the search terms in the search term holding unit 202 have been processed, and when all the search terms have been processed, the flow branches to step S36. If there is an unprocessed search word, the process returns to step S31.
[0105]
After the processing for all the stored documents is completed, in step S36, the search results are sorted in descending order of the word frequency temporarily stored in step S34, stored in the search result holding unit 214, and the logical search processing ends. I do.
[0106]
Next, a document registration process executed in the present embodiment will be described with reference to FIG.
[0107]
FIG. 16 is a flowchart illustrating a procedure of a document registration process performed by the information search device according to the present embodiment.
[0108]
In the figure, in step S41, words are extracted from the registered document 220. For the extraction of words, a method generally used for morphological analysis is applied.
[0109]
Next, in step S42, registration in the word index 219 is performed. If the word matches the word 701 in the word index table, the corresponding word ID is obtained, and the appearance frequency in the target registered document is stored in the column information using the word ID as an index. If the corresponding word does not match the word 701 in the word index table, the word is added to the word index table, a new unique word ID is allocated, and the allocated word ID is used as a new word index, and column information is added. For a stored document that has already been registered, “0” is stored as the frequency, and for a target registered document, the appearance frequency is stored.
[0110]
Next, in step S43, the concept dictionary 211 is searched for a headword 501 that matches the word, and if a matching headword exists, the corresponding concept vector data is extracted.
[0111]
In step S44, the concept vector data extracted in step S43 is multiplied by a weight corresponding to the frequency, and is added to the document concept vector data.
[0112]
In step S45, it is determined whether or not processing has been completed for all words in the registered document. If there are unprocessed words, the process returns to step S42. If processing has been completed for all words, the process branches to step S46.
[0113]
In step S46, the document concept vector data is normalized so that the sum of the squares of the vector elements becomes "1", registered in the stored document concept vector holding unit 216, and the document registration process ends.
[0114]
Although the present embodiment has been described using document information as a search target, multimedia information such as image information, video information, and program content description information accompanied by content description metadata other than the document information can also be used. The present invention can be applied by extracting the information feature amount from the described sentence information and obtaining a similarity measure of the information feature amount.
[0115]
Further, in the present embodiment, as a concept vector for concept search, a word attribute concept vector is stored in the concept dictionary 211, and is configured by a composite vector. It is also possible to configure a concept vector for a concept search by a word vector having a weight calculated from information or a word frequency and an inverse document frequency as an element.
[0116]
Further, in the present embodiment, it is determined whether or not a word appearing in the search request sentence is an unnecessary word, and if the word is unnecessary, it is excluded from the search word. In the word index and document concept vector creation process, whether the word extracted in step S41 is the selected unnecessary word is performed in the same manner as in steps S22 and S23 in FIG. Only when it is determined, the process proceeds to step S42. When the word is an unnecessary word, the process returns to step S41 without going through the steps of creating the word index and the document concept vector. An advantage that the size of the holding portion 216 can be reduced can also be provided.
[0117]
In the present embodiment, the stored document 220 to be searched, the stored document concept vector holding unit 216, the word index 219, and the concept dictionary 211 are described as being arranged in the DISK 14 constituting a single device. It is also possible to disperse these components in different devices and perform processing on a network via the NIC 19.
[0118]
The present invention can be applied to a system including a plurality of devices (for example, a host computer, an interface device, a reader, a printer, etc.), but may be applied to a device including one device (for example, a copying machine, a facsimile machine, etc.) May be applied.
[0119]
A storage medium storing a program code of software for realizing the functions of the above-described embodiments is supplied to a system or an apparatus, and a computer (or CPU or MPU) of the system or the apparatus stores the program stored in the storage medium. It goes without saying that the object of the present invention is also achieved by reading and executing the code.
[0120]
In this case, the program code itself read from the storage medium implements the novel function of the present invention, and the storage medium storing the program code constitutes the present invention.
[0121]
Examples of a storage medium for supplying the program code include a flexible disk, a hard disk, a magneto-optical disk, a CD-ROM, a CD-R, a CD-RW, a DVD-ROM, a DVD-RAM, a DVD-RW, a DVD + RW, and a magnetic disk. A tape, a nonvolatile memory card, a ROM, or the like can be used. Further, the program code may be supplied from a server computer via a communication network.
[0122]
In addition, the functions of the above-described embodiments are implemented when the computer executes the readout program codes, and the OS or the like running on the computer performs the actual processing based on the instructions of the program codes. It goes without saying that a part or all of the above is performed, and the function of the above-described embodiment is realized by the processing.
[0123]
Further, after the program code read from the storage medium is written into a memory provided on a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. It goes without saying that the CPU or the like provided in the board or the function expansion unit performs part or all of the actual processing, and the processing realizes the functions of the above-described embodiments.
[0124]
Hereinafter, examples of embodiments of the present invention will be listed.
[0125]
(Embodiment 1) Input means for inputting a search word or a search sentence,
Searching means for searching for partial information from a plurality of information including at least character data based on a search word or a search sentence input by the input means;
Storage means for storing a plurality of types of unnecessary word groups consisting of a plurality of words or character strings not used in the search by the search means;
Selecting means for selecting at least some of the unnecessary word groups from the stored plurality of types of unnecessary word groups according to a predetermined condition;
An exclusion unit for excluding unnecessary words included in the character data group selected by the selection unit from the search word when searching for information by the search unit;
An information retrieval device, comprising:
[0126]
(Embodiment 2) The storage unit stores the unnecessary word groups in association with each of the plurality of pieces of language information,
The predetermined condition is to select an unnecessary word group corresponding to the search word input by the input unit or the linguistic information of the search sentence.
The information retrieval device according to the first embodiment, characterized in that:
[0127]
(Embodiment 3) The search means searches for information using any one of a plurality of search methods selected by a user,
The storage unit stores the unnecessary word groups in association with each of the plurality of search methods,
The predetermined condition is to select an unnecessary word group corresponding to the search method selected as the search means.
The information retrieval device according to the first embodiment, characterized in that:
[0128]
(Embodiment 4) The plurality of search methods include a logical search unit that searches for information including character data corresponding to a search word or search sentence input by the input unit. Described information retrieval device.
[0129]
(Fifth Embodiment) The plurality of search methods may include information including character data corresponding to a similar concept, based on a similarity evaluation scale representing a concept feature of a search word or search sentence input by the input unit. The information search apparatus according to the third embodiment, further comprising a concept search unit for searching for a search.
[0130]
(Embodiment 6) The input means can input a search field to be searched,
The storage unit stores the unnecessary word groups in association with each of the plurality of search fields,
The predetermined condition is that an unnecessary word group corresponding to the search field input by the input unit is selected.
The information retrieval device according to the first embodiment, characterized in that:
[0131]
(Embodiment 7) The information search device according to embodiment 1, wherein the predetermined condition is an instruction from a user.
[0132]
(Eighth Embodiment) A retrieval step of retrieving partial information from a plurality of pieces of information including at least character data based on a search word or a search sentence input by an input means for inputting a search word or a search sentence,
According to a predetermined condition, at least a part of unnecessary word groups is stored from a plurality of types of unnecessary word groups stored in a storage unit that stores a plurality of types of unnecessary word groups including a plurality of words or character strings not used in the search in the search step. A selection step of selecting a word group;
An exclusion step of excluding unnecessary words included in the character data group selected by the selection step from the search word when searching for information by the search step;
An information search method characterized by having:
[0133]
(Embodiment 9) A search procedure for searching for partial information from a plurality of pieces of information including at least character data based on a search word or a search sentence input by an input means for inputting a search word or a search sentence,
According to a predetermined condition, at least a part of unnecessary word groups is stored from a plurality of types of unnecessary word groups stored in a storage unit that stores a plurality of types of unnecessary word groups including a plurality of words or character strings that are not used in the search by the search procedure. A selection procedure for selecting a word group,
An exclusion step of excluding unnecessary words included in the character data group selected by the selection step from the search word when searching for information by the search step;
A program for causing a computer to execute.
[0134]
【The invention's effect】
As described above, according to the present invention, a plurality of types of unnecessary words are held and unnecessary words are selected in accordance with a specified search request, so that information search with excellent operability and high search accuracy can be performed. Can be.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a schematic configuration of an information search device according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a functional configuration of the information search device of FIG. 1;
FIG. 3 is a diagram showing a display example of an operation panel when a search request sentence and a search method are input in the search request input processing unit in FIG. 2;
4 is a diagram showing a display example of a search request sentence and a search result corresponding to a search method in the search request input processing unit of FIG. 2;
FIG. 5 is a diagram showing a configuration of a concept dictionary of FIG. 2;
FIG. 6 is a diagram illustrating a configuration of a word index in FIG. 2;
FIG. 7 is a diagram showing the correspondence between word IDs and words in the word index of FIG. 6;
FIG. 8 is a diagram showing search-unnecessary words in Japanese logic search.
FIG. 9 is a diagram showing search-unnecessary words in an English logical search.
FIG. 10 is a diagram showing search unnecessary words in an English concept search.
FIG. 11 is a diagram showing a definition of search-unnecessary word selection designation.
FIG. 12 is a flowchart illustrating a procedure of a document search process executed by the information search device of FIG. 1;
FIG. 13 is a flowchart showing a detailed procedure of an unnecessary word selecting process in FIG. 12;
FIG. 14 is a flowchart showing a detailed procedure of a concept search process of FIG. 12;
FIG. 15 is a flowchart illustrating a detailed procedure of a logical search process in FIG. 12;
FIG. 16 is a flowchart showing a procedure of index information creation processing for document search executed by the information search device of FIG. 1;
[Explanation of symbols]
11 CPU
12 ROM
13 RAM
14 DISK
15 KB
16 VRAM
17 CRTC
18 CRT
19 NIC
201 search request input processing unit
202 Search Term Holder
203 Unwanted word selection designation processing unit
204 Unnecessary word selection designation holding unit
205 Search method specification processing unit
206 Unnecessary word discrimination processing unit
207 unnecessary words
208 Search target field detection processing unit
209 Search word selection processing unit
210 Search Concept Vector Creation Processing Unit
211 Concept Dictionary
212 Search concept vector storage
213 Concept Search Processing Unit
214 search result storage
215 Logical search processing unit
216 Stored Document Concept Vector Holder
217 Concept vector creation processing unit
218 Stored document
219 Word Index
220 Registration Document

Claims

An input means for inputting a search word or a search sentence,
Searching means for searching for partial information from a plurality of information including at least character data based on a search word or a search sentence input by the input means;
Storage means for storing a plurality of types of unnecessary word groups consisting of a plurality of words or character strings not used in the search by the search means;
Selecting means for selecting at least some of the unnecessary word groups from the stored plurality of types of unnecessary word groups according to a predetermined condition;
An information retrieval apparatus, comprising: an exclusion unit configured to exclude an unnecessary word included in a character data group selected by the selection unit from the search word when searching for information by the search unit.