JP4037001B2

JP4037001B2 - Database creation device and database search device

Info

Publication number: JP4037001B2
Application number: JP04531299A
Authority: JP
Inventors: 則宏嶺岸; 郁子高梨; 聡田中
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1999-02-23
Filing date: 1999-02-23
Publication date: 2008-01-23
Anticipated expiration: 2019-02-23
Also published as: JP2000242662A

Description

【０００１】
【発明の属する技術分野】
本発明は、データベースのインデクスを自動的に作成するデータベース作成装置、および作成されたデータベースに対してカテゴリーを絞りながら検索を行うデータベース検索装置に関する。
【０００２】
【従来の技術】
図１３は、従来の類似検索装置を示すブロック構成図である。この検索装置は、属性と属性値のペアで表現されたデータを蓄えているデータベース１７と、類似データの検索の前にデータベース１７中のデータからデータ量を第１のインデクスを用いて絞り込む第１検索部１５と、属性値間の類似度の範囲と属性の重要度から類似度の範囲を計算し、第１検索部１５で検索されたデータを第２のインデクスを基に類似検索する第２検索部１６と、類似度範囲にしたがって第１のインデクスを変更する第１のインデクスの変更手段を備えた推論処理部１４と、第１インデクスの類似度値を設定し、類似度値に基づいて、第１インデクスのレベルを決定する第１インデクス生成部１８と、入力装置１１と、出力装置１２と、入出力制御部１３とから構成される。
【０００３】
この検索装置は、第１インデクスの類似度値を設定し、類似度値に基づいて第１インデクスのレベルを決定し、属性値間の類似度の範囲と第２インデクスを基に類似度を計算し、計算された類似度範囲にしたがって第１インデクスを変更して、類似検索を行う。このような検索装置は、たとえば特開平６−１７６０７２号公開公報に開示されている。
【０００４】
また、図１４は、従来の情報検索装置を示すブロック構成図である。この検索装置は、キーワードメモリ２４からの索引キーワード行列と相関度に応じて修正された検索ベクトルとを内積する演算器１９と、その結果を部分的線形に変換する部分線形器２０と、部分線形器２０の出力ベクトルとキーワード行列Ｘとを積する第２演算器２１と、積結果の各要素に対して０，１に正規化する正規化器２２と、演算器１９に１回フィードバックする前の正規化器２２の出力ベクトルと比較するコントローラ２３と、修正された部分線形器２０の出力ベクトルの位置の対応するアドレスに基づき、データベース２５から所望データを読み出す読み出し器２６と、その所望データを表示する表示器２７とから構成される。
【０００５】
この検索装置は、具体的には図１５に示す構成からなり、キーワード入力部２８から入力されたキーワードを蓄積部３３に蓄積するとともに、それを演算器２４で数値ベクトル化し、キーワード相関テーブル３５を参照して、相関度に応じた、より関連のあるキーワードを見つけ、変換器２９において、最初に入力されたキーワードを、その見つけたキーワードに変換する。そして、検索装置は、検索器３０により新たにそのキーワードを検索条件としてデータベース３６を検索し、読み出し器３１で読み出し、選択器３７を介して選択された結果を表示器３２に表示する。このような検索装置は、たとえば特開平８−８７５０８号公開公報に開示されている。
【０００６】
【発明が解決しようとする課題】
しかしながら、上述した類似検索装置では、単語単位の類似度を求めるためには属性と属性値のペアの状態でデータが格納されていなければならないという問題点や、属性の重要度のように、使用者の勘や経験に基づく目標や見本、あるいは使用者の意図が反映された目標や見本を設定しなければならないため、使用者によって得られる結果が異なるという問題点がある。
【０００７】
また、上述した情報検索装置では、何らかの方法により算出した一般的な相関度合いを示すキーワードの相関テーブルを用いて、入力されたキーワードを別のキーワードに変換しているため、異なる分野のデータであっても文字列が同じ単語であれば同じ相関になってしまい、分野に応じた適切な結果が得られないという問題点がある。これを回避するために仮に相関テーブルを修正すると、キーワードの空間全体に影響がおよび、全ての検索に対して性能が向上するとは限らない。
【０００８】
本発明は、上記問題点を解決するためになされたもので、データベースに与えたインデクス基準に基づいて、データに含まれる単語間の関連度を自動的に生成し、さらにその関連度に対して適切な重み付けを行うことによって、インデクス基準に影響を及ぼすことなく自動的にインデクスを作成するデータベース作成装置、およびそのデータベースに対してカテゴリーを絞りながら検索を行うデータベース検索装置を得ることを目的とする。
【０００９】
【課題を解決するための手段】
上記目的を達成するため、本発明は、データベースにデータを入力するためのデータ入力装置と、インデクスの基準となる構成を規定したインデクス基準を入力するためのインデクス基準読込装置と、入力されたデータに対して、前記インデクス基準で使用されている単語の出現頻度を調べ、同時に出現する単語について関連度を算出し、その関連度と前記インデクス基準とに基づいて単語関連度マップを作成する単語関連度マップ作成装置と、入力されたデータの第１の文書中に出現する単語数を、その第１の文書の要約または見出しとなる第２の文書中に出現する単語数で除し、得られた値の分だけ前記第２の文書中の単語に対して重み付けを行う単語重要度付与装置と、前記単語重要度付与装置により得られた重み付けを用いて、前記単語関連度マップ作成装置により作成された単語関連度マップの関連度を一時的に修正し、その修正された単語関連度マップを用いて、前記入力データに対してインデクスを作成するインデクス作成装置と、を具備することを特徴とする。
【００１０】
この発明によれば、データ入力装置によりデータが入力されるとともに、インデクス基準読込装置によりインデクス基準が入力されると、単語関連度マップ作成装置は、入力データに対して、インデクス基準で使用されている単語の出現頻度を調べ、同時に出現する単語について関連度を算出し、その関連度とインデクス基準とに基づいて単語関連度マップを作成する。また、単語重要度付与装置は、入力データの第１の文書中に出現する単語数を、その第１の文書の要約または見出しとなる第２の文書中に出現する単語数で除し、得られた値の分だけ第２の文書中の単語に対して重み付けを行う。また、インデクス作成装置は、その重み付けを用いて単語関連度マップの関連度を一時的に修正し、その修正された単語関連度マップを用いて入力データに対してインデクスを作成する。
【００１１】
また本発明は、データベースにデータを入力するためのデータ入力装置と、インデクスの基準となる構成を規定したインデクス基準を入力するためのインデクス基準読込装置と、入力されたデータに対して、前記インデクス基準で使用されている単語の出現頻度を調べ、同時に出現する単語について関連度を算出し、その関連度と前記インデクス基準とに基づいて単語関連度マップを作成する単語関連度マップ作成装置と、入力されたデータの第１の文書中に出現する単語数を、その第１の文書の要約または見出しとなる第２の文書中に出現する単語数で除し、得られた値の分だけ前記第２の文書中の単語に対して重み付けを行う単語重要度付与装置と、前記単語重要度付与装置により得られた重み付けを用いて、前記単語関連度マップ作成装置により作成された単語関連度マップの関連度を一時的に修正し、その修正された単語関連度マップを用いて、前記入力データに対してインデクスを作成するインデクス作成装置と、前記インデクス作成装置により作成されたインデクスに基づいて検索を行うデータ検索装置と、その検索結果を表示する結果表示装置と、を具備することを特徴とする。
【００１２】
この発明によれば、データ入力装置によりデータが入力されるとともに、インデクス基準読込装置によりインデクス基準が入力されると、単語関連度マップ作成装置は、入力データに対して、インデクス基準で使用されている単語の出現頻度を調べ、同時に出現する単語について関連度を算出し、その関連度とインデクス基準とに基づいて単語関連度マップを作成する。また、単語重要度付与装置は、入力データの第１の文書中に出現する単語数を、その第１の文書の要約または見出しとなる第２の文書中に出現する単語数で除し、得られた値の分だけ第２の文書中の単語に対して重み付けを行う。また、インデクス作成装置は、その重み付けを用いて単語関連度マップの関連度を一時的に修正し、その修正された単語関連度マップを用いて入力データに対してインデクスを作成する。そして、データ検索装置は、作成されたインデクスに基づいて検索を行い、結果表示装置は、その検索結果を表示する。
【００１３】
【発明の実施の形態】
以下、この発明にかかるデータベース作成装置およびデータベース検索装置の実施の形態について、添付図面を参照して詳細に説明する。
【００１４】
実施の形態１．
図１は、本発明にかかるデータベース作成装置の一例を示すブロック構成図である。このデータベース作成装置は、データベース３、システム外部からデータベース３にデータ１を入力するためのデータ入力装置２、システム外部からインデクス基準４を入力するためのインデクス基準読込装置５、入力されたデータ１とインデクス基準４とに基づいて単語関連度マップ８を作成する単語関連度マップ作成装置７、入力されたデータ１と単語関連度マップ８とインデクス基準４とに基づいてインデクス９を生成するインデクス作成装置６、およびインデクス作成時に単語の重要度を計算する単語重要度付与装置１０を備えている。
【００１５】
データベース３、単語関連度マップ８およびインデクス９は、たとえばハードディスク等の記憶装置に格納される。また、データ入力装置２、インデクス基準読込装置５、インデクス作成装置６、単語関連度マップ作成装置７および単語重要度付与装置１０は、それぞれコンピュータ・システムにおいて、たとえばデータ入力プログラム、インデクス基準読込プログラム、インデクス作成プログラム、単語関連度マップ作成プログラムおよび単語重要度付与プログラムが実行されることにより実現される。
【００１６】
インデクス基準４は、インデクスの構成を示すものであり、一例として医学書のデータベースに対するインデクス基準１００を図２に示す。この医学書インデクス基準１００は、たとえば最上層に「医学」というタイトルがあり、その一つ下層に「基礎医学」、「内科学」および「外科学」という３つのタイトルがあり、さらに「基礎医学」の一つ下層に「解剖学」および「生理学」があり、また「内科学」の一つ下層には「循環器」および「消化器」があり、また「外科学」の一つ下層には「局所外科」および「整形外科」があるというようにインデクスがツリー構造をなすように構成されている。この医学書インデクス基準１００のように、インデクス基準４もツリー構造をなすように構成されている。
【００１７】
なお、以下の説明では、この医学書インデクス基準１００を例にして具体的に説明するが、本発明は、医学書に関するデータベースおよび医学書インデクス基準１００に限らないのはいうまでもない。
【００１８】
単語関連度マップ８は、インデクス基準４の階層関係に各単語間の関連度を付与したマップである。単語関連度マップ作成装置７は、入力されたデータ１に対して、たとえば医学書インデクス基準１００で使用されている単語の出現頻度を調べ、それに基づいて所定の計算を行い、図３に示すような単語関連度マップ１０４を得る。単語関連度マップ作成装置７が単語関連度マップ８を作成する方法を、図４に示すデータ１０１を例にして具体的に説明する。
【００１９】
たとえば、データ１０１は３つの文書からなり、文書１のタイトルは「循環器の話」であり、抄録は「・・・循環器系の病気で最も恐いのは、解剖学的に狭心症と心不全である。・・・」である。文書２のタイトルは「循環器系の病気とヘルニアの併発」であり、抄録は「・・・解剖学的には、循環器が、・・・ヘルニアについては外科の医師の診察を受けること。・・・」である。文書３のタイトルは「消化器と循環器」であり、抄録は「・・・良くそしゃくしないと、消化器に炎症を起こし、嘔吐する場合があります。嘔吐すると心臓に負担をかけ、狭心症など循環器系の病気をもっていると、・・・」である。
【００２０】
これらの文書１〜３からそれぞれ単語のみを抽出すると、図４に示す単語列１０２のようになる。すなわち、単語列１０２は、文書１では、タイトルに対して「循環器」、抄録に対して「循環器、解剖学、狭心症、心不全」となり、文書２では、タイトルに対して「循環器、ヘルニア」、抄録に対して「解剖学、循環器、ヘルニア、外科」となり、文書３では、タイトルに対して「消化器、循環器」、抄録に対して「そしゃく、消化器、嘔吐、心臓、狭心症、循環器」となる。
【００２１】
そして、１つの文書に同時に出現する各単語間は相互に関係があるものとして、それらを共出現の単語の組１０３とし、すべてのデータに対して処理をする。そして、共出現の単語の組１０３について、たとえば、つぎの（１）式のように総出現回数に対する共出現の比率、などを用いて関連度を定義する。ただし、ある単語（「ＫＷ１」とする）の総出現頻度をＮ１とし、別のある単語（「ＫＷ２」とする）の総出現頻度をＮ２とし、「ＫＷ１」と「ＫＷ２」とが同時に出現する共出現頻度をＮ１２とし、「ＫＷ１」と「ＫＷ２」との関連度をμ１２とする。
【００２２】
μ１２＝Ｎ１２／（Ｎ１＋Ｎ２−Ｎ１２）・・・（１）
【００２３】
たとえば、上述した文書１に対して説明すると、共出現の単語の組１０３は、図４に示すように「循環器、解剖学」、「循環器、狭心症」、「循環器、心不全」、「解剖学、狭心症」、「解剖学、心不全」、「狭心症、心不全」、・・・となる。たとえば「循環器、狭心症」の共出現の組に対しては、上記（１）式にしたがって、（「循環器」と「狭心症」の共出現頻度）／｛（「循環器」の総出現頻度）＋（「狭心症」の総出現頻度）−（「循環器」と「狭心症」の共出現頻度）｝の値を求める。
【００２４】
そして、その値、すなわち関連度１０５をインデクス基準１００の階層関係に付与することにより、図３に示す単語関連度マップ１０４が得られる。なお、単語間の関連度の算出式は、上記（１）式以外にも、単語の１つの文書中の出現回数によって重み付けを行い共出現比率を計算するなど、種々の算式が適用できる。
【００２５】
単語重要度付与装置１０は、インデクス作成対象のデータに、たとえばタイトル、抄録および本文がある場合、タイトルに出現した単語と抄録に出現した単語と本文に出現した単語との間でそれぞれの価値に応じて適宜重み付けを行う。すなわち、一般に本文を簡潔に集約したものが抄録であり、その抄録をさらに集約したものがタイトルであるが、タイトル、抄録および本文のいずれも表現したい内容のボリュームは同等であるとし、タイトル、抄録および本文に出現した単語に対して価値を数値化して重み付けを行う。単語重要度付与装置１０による重み付けの決定方法を、たとえば図５に示すデータ１０８を例にして、図６を参照しながら具体的に説明する。
【００２６】
図５に示すデータは、タイトルと抄録を有している。まずタイトルおよび抄録のそれぞれについて、出現する単語数をカウントする（図６のステップＳ１，Ｓ２）。たとえば、文書１については、抄録に含まれた単語は「循環器」、「狭心症」、「心不全」および「解剖学」の４個である。それに対して、タイトルに含まれた単語は「循環器」の１個である。従って、タイトルに含まれた単語は、抄録に含まれた単語の４倍の価値を有していると考えられる。そこで、タイトルの単語については、抄録の単語に対して４倍という重み付けを行う（図６のステップＳ３）。これを各データ毎に行う。
【００２７】
たとえば、文書２は、タイトルに「ヘルニア」および「循環器」の２個の単語を含み、抄録に「解剖学」、「循環器」、「ヘルニア」および「外科」の４個の単語を含むので、タイトルの単語は２倍の重み付けとなる。また、文書３は、タイトルに「そしゃく」および「循環器」の２個の単語を含み、抄録に「そしゃく」、「消化器」、「嘔吐」、「解剖学」、「生理学」および「循環器」の６個の単語を含むので、タイトルの単語は３倍の重み付けとなる。このような重み付けによって、たとえば図７に示す例では、「循環器」は、本来０．２である関連度が、文書１では０．８、文書２では０．４、文書３では０．６になり、データ毎、すなわち文書１と文書２と文書３とで「循環器」の価値に違いが出ることになる（図６のステップＳ４）。
【００２８】
また、たとえば抄録と本文との間で重み付けを行う場合や、他の文書データの項目間で重み付けを行う場合も同様である。図８に、抄録と本文との間の重み付けの例を示す。図８に示すデータ１１０では、たとえば本文に関しては、同じ単語が繰り返し出現した場合には、その出現回数を加味している。また、単純に出現回数を加算するだけでは、対象としている文書や図書の量に差があるため、正規化するのが望ましい。すなわち、文書や図書によって本文の文章の量が異なり、一般的には文章量が多いほうが単語はより多く出現する。そこで、たとえば１ページあたり、または１０００文字あたり、というように一定の決まった文書量や、単位文書量を対象にして、重み付けを行うように正規化するとよい。
【００２９】
つぎに、インデクスの作成処理の流れについて説明する。データ入力装置２によってデータベース３にデータ１が入力され、またインデクス基準読込装置５により、たとえば図２に示すインデクス基準１００が入力されると、単語関連度マップ作成装置７は、入力されたデータ１およびインデクス基準４に現れる単語に基づいて、たとえば図３に示すような単語関連度マップ８を作成する。
【００３０】
しかる後、インデクス作成装置６は、たとえば図９に示すフローチャートに従い、インデクス作成対象データに対して、インデクス基準４と単語関連度マップ８を基にしてインデクス９を作成する。すなわち、まず各文書に含まれている分類項目のノード（単語）をピックアップし、単語関連度マップにマッピングする（ステップＳ１１）。一例として、図７に、データ１０６の文書３について出現単語を単語関連度マップ８中にマーキングした様子を示す。図示例では、マーキングは、該当する単語、すなわち「消化器」、「循環器」、「そしゃく」、「嘔吐」、「心臓」および「狭心症」という単語を下線付きの太字で表すことにより示した。
【００３１】
続いて、単語重要度付与装置１０によって重み付けを行い、単語関連度マップ８を一時的に修正する（ステップＳ１２）。図７に示す例では、文書３の場合、タイトルに「消化器」および「循環器」という２個の単語が出現し、それに対して抄録の出現単語数は６個であるため、文書３の処理時のみ、「消化器」および「循環器」については、単語関連度マップ８の関連度を一時的に３倍して、それぞれ０．６（０．２×３）とする。
【００３２】
続いて、インデクス作成対象データに出現した単語を末端語としてチェックし、各末端語からルートノード（「医学」）まで遡るように分類判定評価値を計算する（ステップＳ１３）。これは、単語関連度マップ８にマッピングされた各単語を、ある計算手順に従って計算し、評価することによって、マッピングされた位置で単体で評価せずに、分類体系全体の中でどのような位置付けにあるかということを考慮するためである。
【００３３】
すなわち、たとえば図７に示す例では、「心臓」という分類項目は、単に「心臓」という単語を意味しているわけではなく、「医学」に関する「内科学」に関する「循環器」に関する「心臓」という概念を意味している。それを反映するために、たとえば「心臓」という末端語ノードからルートノードの「医学」まで、マッピングされている分類項目を順に遡ってたどり、その途中の関連度を加算し、得られた関連度の累計を、たどった階層数で除して平均値を得、これを分類判定評価値とする。
【００３４】
図７に示す例で、文書３の場合、「心臓」とその一つ上層の「循環器」との関連度は０．９であり、「循環器」とその一つ上層の「内科学」との関連度は、本来０．２であるが、重み付けによって一時的に０．６になっており、さらに「内科学」とその一つ上層の「医学」との関連度は０．３である。従って、「心臓」という単語の分類判定評価値は、０．９と０．６と０．３を足し、それを３で除することにより、０．６となる。すなわち、文書３が「心臓」に分類される度合いは０．６である。
【００３５】
ただし、図７に示す文書３では、「心臓」および「循環器」という単語は出現しているが、ルートノードまで遡る途中の「内科学」および「医学」という単語は出現していない。このようにルートノードの「医学」に至るまでにマッピングされていない単語が出現し、途切れた場合には、単語関連度マップ８の関連度をそのまま加算せずに、つぎのステップＳ１４のような処理を行う。
【００３６】
すなわち、たとえば図７に示す例で説明すれば、文書１について「狭心症」という末端語ノードから上層にたどると、文書１には「心臓」という単語が出現していない。そこで、「心臓」の下位ノードの関連度の平均値を求める。具体的には、「心臓」の下位ノードである「狭心症」の関連度０．５と「心不全」の関連度０．５との平均値０．５（（０．５＋０．５）／２）を求める。そして、その平均値と、「循環器」に対する「心臓」の関連度の値０．９との積を求め、その値０．４５（０．９×０．５）を仮関連度として加算する（ステップＳ１４）。
【００３７】
先に「文書３が「心臓」に分類される度合いは０．６である」としたが、このステップＳ１４の処理を行うことによって、文書３の「心臓」という単語の分類判定表価値は、「医学」に対する「内科学」の仮関連度が０．１８（（０．６＋０．６）／２×０．３）であるので、実際には０．５６（（０．９＋０．６＋０．１８）／３）となる。
【００３８】
上述したステップＳ１３、およびノードが途切れた場合にはステップ１４を、全ての末端語ノードについて繰り返し行う（ステップＳ１５）。たとえば図７に示すデータの場合、文書３については「そしゃく」、「消化器」、「嘔吐」、「心臓」、「狭心症」および「心不全」について、それぞれルートノードまでたどる途中の全てのノードについて評価を行う。「そしゃく」、「消化器」、「嘔吐」、「心臓」、「狭心症」および「心不全」のそれぞれについて、分類判定表価値の計算式および計算結果を示す。その計算式において「“」と「”」で囲まれた値は、仮関連度であり、下位ノードから上位ノードに向かって順に加算している。なお「生理学」および「基礎医学」については、省略する。
【００３９】
「そしゃく」：（０．７＋“０．２３”＋“０．１４”＋“０．０９”）／４＝０．２９
「嘔吐」：（０．８＋“０．２３”＋“０．１４”＋“０．０９”）／４＝０．３２
「消化」：（“０．２３”＋“０．１４”＋“０．０９”）／３＝０．１５
「狭心症」：（０．５＋０．９＋０．６＋“０．１８”）／４＝０．５５
「心不全」：（０．５＋０．９＋０．６＋“０．１８”）／４＝０．５５
「心臓」：（０．９＋０．６＋“０．１８”）／３＝０．５６
「循環器」：（０．６＋“０．１８”）／２＝０．３９
「消化器」：（０．６＋“０．１８”）／２＝０．３９
【００４０】
以上のようにしてインデクス作成対象データの分類先として可能性のある分類項目のすべての評価が終わったら、その中で最も評価が高い項目を分類先として決定し、分類する（ステップＳ１６）。図７に示す例では、「心臓」の分類項目が最も高い評価値（０．５６）であるため、分類先を「心臓」に決定する。そして、重み付けにより一時的に修正した単語関連度マップ８を初期値に戻した後（ステップＳ１７）、同様の処理をインデクス作成の対象となるすべての文書について繰り返し行う（ステップＳ１８）。
【００４１】
なお、文書の分類先決定の評価方法については、階層数や、マップの大きさにより正規化して加算する方法も適用できる。たとえば、分書中に出現した単語の、単語関連度マップ８における階層を考慮して、深い層（すなわち下位の層）については一律に関連度に重み付けをするようにしてもよい。そうすれば、分類体系の階層数が非常に多い場合でも、関連度の相加平均値が過度に低くなり、偶然出現した、上位階層の単語の項目に分類されてしまうのを回避することができる。
【００４２】
実施の形態１によれば、インデクス基準４に基づいて、データに含まれる単語間の関連度を自動的に生成し、さらにその関連度に対して適切な重み付けを行うようになっているため、インデクス作成者が見本や典型的な例などを特に指定しなくても、インデクス基準４に影響を及ぼすことなく、単語単位のインデクスを自動的に作成することができる。
【００４３】
なお、インデクス作成対象となるデータは、文書に限らず、データベースに格納されたデータで、かつ単語を認識できるものであれば特に問わない。たとえば、インデクス作成対象データは、制御コードに相当するタグを含むインターネット上のＷＥＢページのデータであってもよい。
【００４４】
実施の形態２．
図１０は、本発明にかかるデータベース検索装置の一例を示すブロック構成図である。このデータベース検索装置は、図１に示す実施の形態１のデータベース作成装置に、インデクス作成装置６で作成されたインデクスに基づいて検索を行うデータ検索装置３８と、その検索結果を表示する結果表示装置３９を追加したものである。従って、データベース作成装置を構成するデータ入力装置２、データベース３、インデクス基準読込装置５、インデクス作成装置６、単語関連度マップ作成装置７および単語重要度付与装置１０、並びにインデクス基準４および単語関連度マップ８については、実施の形態１と同様であるため、説明を省略する。
【００４５】
データ検索装置３８は、コンピュータ・システムにおいて、たとえばデータ検索プログラムが実行されることにより実現される。データ検索装置３８は、たとえば図２に示すようなインデクス基準１００に基づいて作成されたインデクスメニュー１１１を結果表示装置３９に表示させて一望し得るようなインタフェースと、そのメニューの中から適当な項目を選択するための、たとえばマウスカーソル１１２を提供する。従ってデータ検索装置３８には、図示省略したが入力装置としてマウス等のポインティングデバイスやキーボードが接続されている。結果表示装置は、たとえばコンピュータ・システムの表示装置であるブラウン管や液晶表示装置である。
【００４６】
インデクス作成装置６により作成されたインデクスに対して検索を行う場合には、検索者は結果表示装置３９に表示されたメニューに対して、マウスカーソル１１２を移動させて適当な項目を指示し、選択することにより、インデクスを探すことができ、目的の図書を検索することができる。
【００４７】
なお、図１２に示すように、インデクス基準を各ノード毎に分割し、「医学」のノード１１４からマウスカーソル１１２で「内科学」を指示して「内科学」のノード１１５を開き、さらにマウスカーソル１１２で「循環器」を指示して「循環器」のノード１１６を開き、最終的にマウスカーソル１１２で「リンパ腺」を選択することにより、目的の図書を検索するようにしてもよい。このようなツリー構造をなすインデクス基準に対して次々と分類項目を絞り込んでいくメニュー状のインタフェースにより、効果的に検索を行うことができる。
【００４８】
【発明の効果】
以上、説明したとおり、本発明によれば、データ入力装置によりデータが入力されるとともに、インデクス基準読込装置によりインデクス基準が入力されると、単語関連度マップ作成装置は、入力データに対して、インデクス基準で使用されている単語の出現頻度を調べ、同時に出現する単語について関連度を算出し、その関連度とインデクス基準とに基づいて単語関連度マップを作成する。また、単語重要度付与装置は、入力データの第１の文書中に出現する単語数を、その第１の文書の要約または見出しとなる第２の文書中に出現する単語数で除し、得られた値の分だけ第２の文書中の単語に対して重み付けを行う。また、インデクス作成装置は、その重み付けを用いて単語関連度マップの関連度を一時的に修正し、その修正された単語関連度マップを用いて入力データに対してインデクスを作成する。従って、インデクス作成者が見本や典型的な例などを特に指定しなくても、インデクス基準に影響を及ぼすことなく、単語単位のインデクスを自動的に作成することができる。
【００４９】
つぎの発明によれば、データ入力装置によりデータが入力されるとともに、インデクス基準読込装置によりインデクス基準が入力されると、単語関連度マップ作成装置は、入力データに対して、インデクス基準で使用されている単語の出現頻度を調べ、同時に出現する単語について関連度を算出し、その関連度とインデクス基準とに基づいて単語関連度マップを作成する。また、単語重要度付与装置は、入力データの第１の文書中に出現する単語数を、その第１の文書の要約または見出しとなる第２の文書中に出現する単語数で除し、得られた値の分だけ第２の文書中の単語に対して重み付けを行う。また、インデクス作成装置は、その重み付けを用いて単語関連度マップの関連度を一時的に修正し、その修正された単語関連度マップを用いて入力データに対してインデクスを作成する。そして、データ検索装置は、作成されたインデクスに基づいて検索を行い、結果表示装置は、その検索結果を表示する。従って、効率よくインデクスを探すことができ、目的のデータを検索することができる。
【図面の簡単な説明】
【図１】本発明にかかるデータベース作成装置の一例を示すブロック構成図である。
【図２】そのデータベース作成装置において使用されるインデクス基準の一構成例を示す系統図である。
【図３】そのデータベース作成装置において作成された単語関連度マップの一例を示す模式図である。
【図４】単語関連度マップの作成方法を説明するための説明図である。
【図５】単語関連度マップに対して重み付けを行う方法を説明するための説明図である。
【図６】重み付けの決定方法の一例を示すフローチャートである。
【図７】重み付けを行った単語関連度マップの一例を示す模式図である。
【図８】抄録と本文との間の重み付けの一例を示す模式図である。
【図９】インデクス作成方法の一例を示すフローチャートである。
【図１０】本発明にかかるデータベース検索装置の一例を示すブロック構成図である。
【図１１】そのデータベース検索装置で使用される検索用メニューの一例を示す模式図である。
【図１２】そのデータベース検索装置で使用される検索用メニューの他の例を示す模式図である。
【図１３】従来におけるデータベース検索装置を示すブロック構成図である。
【図１４】従来におけるデータベース検索装置を示すブロック構成図である。
【図１５】従来におけるデータベース検索装置を示すブロック構成図である。
【符号の説明】
１データ、２データ入力装置、３データベース、４インデクス基準、５インデクス基準読込装置、６インデクス作成装置、７単語関連度マップ作成装置、８単語関連度マップ、９インデクス、１０単語重要度付与装置。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a database creation device that automatically creates a database index, and a database search device that searches a created database while narrowing down categories.
[0002]
[Prior art]
FIG. 13 is a block diagram showing a conventional similarity search apparatus. This search device includes a database 17 that stores data represented by attribute-attribute value pairs, and a first index that narrows down the amount of data from the data in the database 17 using a first index before searching for similar data. A second search unit 15 calculates a similarity range from the similarity range between attribute values and the importance of the attribute, and performs a similarity search on the data searched by the first search unit 15 based on the second index. The search unit 16, the inference processing unit 14 including a first index changing unit that changes the first index according to the similarity range, and the similarity value of the first index are set, and based on the similarity value The first index generation unit 18 that determines the level of the first index, the input device 11, the output device 12, and the input / output control unit 13 are configured.
[0003]
This retrieval apparatus sets the similarity value of the first index, determines the level of the first index based on the similarity value, and calculates the similarity based on the similarity range between the attribute values and the second index Then, the similarity search is performed by changing the first index according to the calculated similarity range. Such a search device is disclosed in, for example, Japanese Patent Application Laid-Open No. 6-177602.
[0004]
FIG. 14 is a block diagram showing a conventional information retrieval apparatus. This search device includes an arithmetic unit 19 for inner producting an index keyword matrix from the keyword memory 24 and a search vector modified in accordance with the degree of correlation, a partial linear unit 20 for converting the result into partial linearity, and a partial linearity. A second calculator 21 that multiplies the output vector of the calculator 20 and the keyword matrix X, a normalizer 22 that normalizes each element of the product result to 0 and 1, and before the feedback to the calculator 19 once. A controller 23 for comparing with the output vector of the normalizer 22 of the device, a reader 26 for reading out the desired data from the database 25 based on the corresponding address of the position of the output vector of the modified partial linearizer 20, and the desired data And a display 27 for displaying.
[0005]
Specifically, this search device has the configuration shown in FIG. 15, and stores the keyword input from the keyword input unit 28 in the storage unit 33, converts it into a numerical vector by the computing unit 24, and stores the keyword correlation table 35. Referring to the keyword, a more relevant keyword corresponding to the degree of correlation is found, and the converter 29 converts the first input keyword into the found keyword. Then, the search device newly searches the database 36 using the keyword as a search condition by the searcher 30, reads it by the reader 31, and displays the result selected via the selector 37 on the display 32. Such a search device is disclosed in, for example, Japanese Patent Application Laid-Open No. 8-87508.
[0006]
[Problems to be solved by the invention]
However, in the above-described similarity search device, in order to obtain the similarity in units of words, there is a problem that data must be stored in the state of attribute / attribute value pairs, and the importance is used like the importance of the attribute. Therefore, there is a problem in that the result obtained by each user differs because a target or sample based on the user's intuition or experience or a target or sample reflecting the user's intention must be set.
[0007]
In addition, since the information retrieval apparatus described above converts an input keyword into another keyword using a keyword correlation table indicating a general degree of correlation calculated by some method, data in different fields is used. However, if the character string is the same word, the correlation is the same, and there is a problem that an appropriate result according to the field cannot be obtained. If the correlation table is corrected to avoid this, the entire keyword space is affected and the performance is not necessarily improved for all searches.
[0008]
The present invention has been made to solve the above problems, and automatically generates a degree of association between words included in data based on an index criterion given to a database. The purpose is to obtain a database creation device that automatically creates an index without affecting the index criteria by performing appropriate weighting, and a database search device that performs a search while narrowing down the category for the database. .
[0009]
[Means for Solving the Problems]
In order to achieve the above object, the present invention provides a data input device for inputting data into a database, an index standard reading device for inputting an index standard that defines a configuration that serves as a standard for the index, and input data For the word association, the frequency of words used in the index criterion is examined, the relevance level is calculated for the simultaneously appearing words, and the word relevance map is created based on the relevance level and the index criterion Obtained by dividing the number of words appearing in the first document of the input data by the number of words appearing in the second document which is the summary or heading of the first document. A word importance level assigning device that weights words in the second document by an amount corresponding to the value, and using the weight obtained by the word importance level assigning device, An index creation device that temporarily modifies the relevance of the word relevance map created by the word relevance map creation device, and creates an index for the input data using the modified word relevance map; It is characterized by comprising.
[0010]
According to the present invention, when the data is input by the data input device and the index criterion is input by the index criterion reading device, the word association degree map creating device is used on the basis of the index for the input data. The frequency of appearance of a certain word is checked, the relevance level is calculated for the words that appear simultaneously, and a word relevance map is created based on the relevance level and index criteria. Further, the word importance assigning device divides the number of words appearing in the first document of the input data by the number of words appearing in the second document serving as the summary or heading of the first document. The word in the second document is weighted by the value obtained. Also, the index creation device temporarily modifies the degree of association of the word association degree map using the weighting, and creates an index for the input data using the modified word association degree map.
[0011]
Further, the present invention provides a data input device for inputting data to a database, an index standard reading device for inputting an index standard that defines a configuration that serves as a standard for the index, and the index for the input data. A word relevance map creating device that examines the frequency of occurrence of words used in the reference, calculates the relevance for simultaneously appearing words, and creates a word relevance map based on the relevance and the index criterion; The number of words appearing in the first document of the input data is divided by the number of words appearing in the second document that is the summary or heading of the first document, and the above-mentioned amount is obtained. Using the word importance level assigning device for weighting words in the second document and the weight obtained by the word importance level assigning device, the word relevance map creation An index creation device that temporarily modifies the relevance level of the word relevance map created by the position and creates an index for the input data using the modified word relevance map, and the index creation device And a result display device for displaying the search result. The data search device performs a search based on the index created by the above.
[0012]
According to the present invention, when the data is input by the data input device and the index criterion is input by the index criterion reading device, the word association degree map creating device is used on the basis of the index for the input data. The frequency of appearance of a certain word is checked, the relevance level is calculated for the words that appear simultaneously, and a word relevance map is created based on the relevance level and index criteria. Further, the word importance assigning device divides the number of words appearing in the first document of the input data by the number of words appearing in the second document serving as the summary or heading of the first document. The word in the second document is weighted by the value obtained. Also, the index creation device temporarily modifies the degree of association of the word association degree map using the weighting, and creates an index for the input data using the modified word association degree map. Then, the data search device performs a search based on the created index, and the result display device displays the search result.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of a database creation device and a database search device according to the present invention will be described below in detail with reference to the accompanying drawings.
[0014]
Embodiment 1 FIG.
FIG. 1 is a block diagram showing an example of a database creation apparatus according to the present invention. This database creation device includes a database 3, a data input device 2 for inputting data 1 to the database 3 from the outside of the system, an index reference reading device 5 for inputting an index reference 4 from the outside of the system, and input data 1 Word relevance map creating device 7 that creates word relevance map 8 based on index criterion 4, Index creating device that generates index 9 based on input data 1, word relevance map 8, and index criterion 4 6 and a word importance level assigning device 10 for calculating the importance level of the word when creating the index.
[0015]
The database 3, the word association degree map 8, and the index 9 are stored in a storage device such as a hard disk. Further, the data input device 2, the index reference reading device 5, the index creation device 6, the word relevance map creation device 7 and the word importance assigning device 10 are respectively a computer system, for example, a data input program, an index reference reading program, This is realized by executing an index creation program, a word relevance map creation program, and a word importance assigning program.
[0016]
The index standard 4 indicates the structure of the index. As an example, the index standard 100 for a medical book database is shown in FIG. The medical book index standard 100 has, for example, the title “medicine” in the top layer, and three titles “basic medicine”, “internal science”, and “external science” in the lower layer. "Anatomy" and "Physiology" are one layer below "", and "Cardiology" and "Gastroenterology" are one layer below "Internal science", and one layer below "Surgery" The index is configured to form a tree structure such that there are “local surgery” and “orthopedic surgery”. Like the medical book index standard 100, the index standard 4 is also configured to have a tree structure.
[0017]
In the following description, the medical book index standard 100 will be specifically described as an example. However, it is needless to say that the present invention is not limited to the medical book database and the medical book index standard 100.
[0018]
The word association degree map 8 is a map in which the degree of association between words is given to the hierarchical relation of the index criterion 4. The word relevance map creation device 7 examines the frequency of appearance of words used in the medical book index standard 100, for example, with respect to the input data 1 and performs a predetermined calculation based on the frequency, as shown in FIG. A word relevance map 104 is obtained. A method of creating the word association degree map 8 by the word association degree map creating apparatus 7 will be specifically described with reference to the data 101 shown in FIG.
[0019]
For example, the data 101 is composed of three documents, the title of the document 1 is “Talk of the circulatory organ”, and the abstract is “... the most terrible circulatory disease is angina anatomically. I have heart failure ... " The title of Document 2 is "Complication of cardiovascular disease and hernia", and the abstract is "... anatomically, cardiovascular, ... hernia should be consulted by a surgeon. ... ". The title of Document 3 is “Gastrointestinal and Cardiovascular”, and the abstract is “… If you do not chew well, it may cause irritation to the digestive tract and vomiting. If you have a cardiovascular disease, etc ... "
[0020]
When only words are extracted from these documents 1 to 3, a word string 102 shown in FIG. 4 is obtained. That is, the word string 102 is “circulatory organ” for the title in document 1, “circulatory organ, anatomy, angina, heart failure” for the abstract, and “circulatory organ” for the title in document 2. , Hernia ", abstract," anatomy, circulatory organ, hernia, surgery ", document 3, title" digestive organ, cardiovascular ", abstract," chewy, digestive organs, vomiting, heart " , Angina, cardiovascular.
[0021]
Then, assuming that words appearing simultaneously in one document are related to each other, they are set as a co-occurrence word set 103, and processing is performed on all data. For the co-occurrence word set 103, the degree of association is defined using, for example, the ratio of co-occurrence to the total number of appearances as in the following equation (1). However, the total appearance frequency of a certain word (referred to as “KW1”) is N1, the total appearance frequency of another certain word (referred to as “KW2”) is N2, and “KW1” and “KW2” appear simultaneously. The co-occurrence frequency is N12, and the degree of association between “KW1” and “KW2” is μ12.
[0022]
μ12 = N12 / (N1 + N2-N12) (1)
[0023]
For example, referring to the document 1 described above, the co-occurrence word set 103 includes “circulatory organ, anatomy”, “circulatory organ, angina”, “circulatory organ, heart failure” as shown in FIG. , "Anatomy, angina", "anatomy, heart failure", "angina, heart failure", and so on. For example, for a set of co-occurrence of “cardiovascular and angina”, according to the above equation (1), (co-occurrence frequency of “cardiovascular” and “angina”) / {(“cardiovascular” ) + (Total appearance frequency of “angina”) − (co-occurrence frequency of “circulatory organ” and “angina”)}.
[0024]
Then, by assigning the value, that is, the degree of association 105 to the hierarchical relationship of the index criterion 100, the word association degree map 104 shown in FIG. 3 is obtained. In addition to the above formula (1), various calculation formulas such as calculating the co-occurrence ratio by weighting according to the number of times the word appears in one document can be applied.
[0025]
When the index creation target data includes, for example, a title, an abstract, and a text, the word importance assigning device 10 determines the value between a word that appears in the title, a word that appears in the abstract, and a word that appears in the text. Accordingly, weighting is appropriately performed. In other words, the abstract is generally a concise summary of the text, and the title is a further aggregation of the abstracts. In addition, a value is digitized and weighted for words appearing in the text. A method for determining the weighting by the word importance assigning apparatus 10 will be specifically described with reference to FIG. 6, taking the data 108 shown in FIG. 5 as an example.
[0026]
The data shown in FIG. 5 has a title and an abstract. First, the number of words that appear for each title and abstract is counted (steps S1 and S2 in FIG. 6). For example, for document 1, there are four words included in the abstract: “Cardiovascular”, “Angina”, “Heart failure” and “Anatomy”. On the other hand, the word included in the title is one “circulatory organ”. Therefore, it is considered that the word included in the title has a value four times that of the word included in the abstract. Therefore, the title word is weighted four times with respect to the abstract word (step S3 in FIG. 6). This is performed for each data.
[0027]
For example, document 2 includes two words “Hernia” and “Cardiology” in the title, and four words “Anatomy”, “Cardiology”, “Hernia” and “Surgery” in the abstract. Thus, the title word is weighted twice. Document 3 also includes two words, “Mastication” and “Cardiovascular” in the title, and “Mastication”, “Gastroenterology”, “Vomiting”, “Anatomy”, “Physiology” and “Circulation” in the abstract. 6 words of “container” are included, so the word of the title is three times weighted. With such weighting, for example, in the example shown in FIG. 7, the degree of relevance of “circulator” is originally 0.2, 0.8 for document 1, 0.4 for document 2, and 0.6 for document 3. Therefore, the value of “circulatory organ” differs for each data, that is, document 1, document 2, and document 3 (step S4 in FIG. 6).
[0028]
The same applies to, for example, a case where weighting is performed between an abstract and a text, and a case where weighting is performed between items of other document data. FIG. 8 shows an example of weighting between the abstract and the text. In the data 110 shown in FIG. 8, for example, regarding the text, if the same word appears repeatedly, the number of appearances is taken into account. In addition, simply adding the number of appearances causes a difference in the amount of target documents and books, so normalization is desirable. That is, the amount of text in the body varies depending on the document or book, and more words generally appear as the amount of text increases. Therefore, normalization may be performed so that weighting is performed with respect to a fixed document amount or unit document amount, for example, per page or per 1000 characters.
[0029]
Next, the flow of index creation processing will be described. When the data 1 is input to the database 3 by the data input device 2 and the index reference 100 shown in FIG. 2 is input by the index reference reading device 5, for example, the word association degree map creating device 7 receives the input data 1 Based on the words appearing in the index criterion 4, for example, a word relevance map 8 as shown in FIG. 3 is created.
[0030]
Thereafter, the index creation device 6 creates the index 9 based on the index reference 4 and the word association degree map 8 for the index creation target data, for example, according to the flowchart shown in FIG. That is, first, nodes (words) of classification items included in each document are picked up and mapped to a word association degree map (step S11). As an example, FIG. 7 shows the appearance word marked in the word association degree map 8 for the document 3 of the data 106. In the illustrated example, the marking is made by expressing the corresponding words, ie “digestive organ”, “circulatory organ”, “chewing”, “vomiting”, “heart” and “anginal” in underlined bold letters. Indicated.
[0031]
Subsequently, weighting is performed by the word importance assigning device 10, and the word association degree map 8 is temporarily corrected (step S12). In the example shown in FIG. 7, in the case of document 3, two words “digestive organ” and “circulatory organ” appear in the title, and the number of appearance words in the abstract is six. Only during processing, for “digestive organs” and “circulatory organs”, the relevance of the word relevance map 8 is temporarily tripled to 0.6 (0.2 × 3).
[0032]
Subsequently, the word appearing in the index creation target data is checked as a terminal word, and a classification determination evaluation value is calculated so as to go back to the root node (“medicine”) from each terminal word (step S13). This is because each word mapped to the word relevance map 8 is calculated and evaluated according to a certain calculation procedure, so that the positioning in the entire classification system is not evaluated at the mapped position alone. This is to consider whether it is in
[0033]
That is, for example, in the example shown in FIG. 7, the classification item “heart” does not simply mean the word “heart”, but “heart” related to “cardiology” related to “internal medicine” related to “medicine”. Means the concept. In order to reflect this, for example, the mapped classification items are traced back in order from the terminal word node “heart” to the “medicine” of the root node, and the degree of relevance in the middle is added, and the obtained degree of relevance Is divided by the number of hierarchies obtained to obtain an average value, which is used as a classification determination evaluation value.
[0034]
In the example shown in FIG. 7, in the case of document 3, the degree of association between “heart” and its upper “circulatory organ” is 0.9, and “circulatory organ” and its uppermost “internal medicine” Is originally 0.2, but temporarily becomes 0.6 due to weighting. Furthermore, the degree of association between “internal science” and “medicine”, which is one higher level, is 0.3. is there. Therefore, the classification determination evaluation value of the word “heart” is 0.6 by adding 0.9, 0.6, and 0.3 and dividing it by 3. That is, the degree to which the document 3 is classified as “heart” is 0.6.
[0035]
However, in the document 3 shown in FIG. 7, the words “heart” and “circulatory organ” appear, but the words “internal science” and “medicine” in the middle of going back to the root node do not appear. In this way, when a word that has not been mapped appears until reaching the “medicine” of the root node, and is interrupted, the relevance of the word relevance map 8 is not added as it is, as in the next step S14. Process.
[0036]
That is, for example, referring to the example shown in FIG. 7, when the document 1 is traced from the terminal word node “angina” to the upper layer, the word “heart” does not appear in the document 1. Therefore, the average value of the relevance levels of the lower nodes of “heart” is obtained. Specifically, the average value of 0.5 ((0.5 + 0.5) / (relationship level) of 0.5 for “angina”, which is a lower node of “heart”, and 0.5 for “heart failure” 2). Then, the product of the average value and the value 0.9 of the relevance level of “heart” with respect to “circulatory organ” is obtained, and the value 0.45 (0.9 × 0.5) is added as the temporary relevance level. (Step S14).
[0037]
First, “the degree to which the document 3 is classified as“ heart ”is 0.6” is described. However, by performing the process of step S14, the classification determination table value of the word “heart” in the document 3 is Since the degree of relevance of “Internal Medicine” to “Medicine” is 0.18 ((0.6 + 0.6) /2×0.3), it is actually 0.56 ((0.9 + 0.6 + 0.18). ) / 3).
[0038]
If step S13 described above and the node is interrupted, step 14 is repeated for all terminal word nodes (step S15). For example, in the case of the data shown in FIG. 7, for Document 3, all of the “mastication”, “digestive organs”, “vomiting”, “heart”, “angina pectoris”, and “heart failure” are all on the way to the root node. Evaluate the node. The calculation formulas and calculation results of the classification judgment table values are shown for each of “Mastication”, “Gastrointestinal”, “Vomiting”, “Heart”, “Angina pectoris” and “Heart failure”. In the calculation formula, a value surrounded by ““ ”and“ ”” is a temporary relevance, and is added in order from the lower node to the upper node. Note that “physiology” and “basic medicine” are omitted.
[0039]
“Mastic”: (0.7+ “0.23” + “0.14” + “0.09”) / 4 = 0.29
“Vomiting”: (0.8+ “0.23” + “0.14” + “0.09”) / 4 = 0.32
“Digestion”: (“0.23” + “0.14” + “0.09”) / 3 = 0.15
“Angina pectoris”: (0.5 + 0.9 + 0.6 + “0.18”) / 4 = 0.55
“Heart failure”: (0.5 + 0.9 + 0.6 + “0.18”) / 4 = 0.55
“Heart”: (0.9 + 0.6 + “0.18”) / 3 = 0.56
“Circulator”: (0.6+ “0.18”) / 2 = 0.39
“Digestive organ”: (0.6+ “0.18”) / 2 = 0.39
[0040]
As described above, when all the evaluations of the classification items that are likely to be the classification destination of the index creation target data are completed, the item having the highest evaluation is determined as the classification destination and classified (step S16). In the example shown in FIG. 7, since the classification item of “heart” has the highest evaluation value (0.56), the classification destination is determined to be “heart”. Then, after the word association degree map 8 temporarily corrected by weighting is returned to the initial value (step S17), the same processing is repeated for all the documents for which the index is to be created (step S18).
[0041]
As an evaluation method for determining a document classification destination, a method of normalizing and adding according to the number of layers or the size of a map can be applied. For example, in consideration of the hierarchy in the word relevance map 8 of words that appear in the booklet, the relevance may be weighted uniformly for deep layers (that is, lower layers). Then, even if the number of levels of the classification system is very large, the arithmetic mean value of the relevance level is excessively low, and it is possible to avoid accidental appearance and classification into upper level word items. it can.
[0042]
According to the first embodiment, the degree of association between words included in the data is automatically generated based on the index criterion 4, and further appropriate weighting is performed on the degree of association. Even if the index creator does not particularly specify a sample or a typical example, an index for each word can be automatically created without affecting the index standard 4.
[0043]
The data for which the index is created is not limited to a document, and any data can be used as long as it is data stored in a database and can recognize words. For example, the index creation target data may be WEB page data on the Internet including a tag corresponding to the control code.
[0044]
Embodiment 2. FIG.
FIG. 10 is a block diagram showing an example of the database search apparatus according to the present invention. This database search device includes a data search device 38 that performs a search based on the index created by the index creation device 6 in the database creation device of the first embodiment shown in FIG. 1, and a result display device that displays the search result. 39 is added. Accordingly, the data input device 2, the database 3, the index reference reading device 5, the index creation device 6, the word association degree map creation device 7 and the word importance assigning device 10, and the index criterion 4 and the word association degree constituting the database creation device. Since the map 8 is the same as that of the first embodiment, the description thereof is omitted.
[0045]
The data search device 38 is realized in a computer system by executing, for example, a data search program. For example, the data search device 38 displays an index menu 111 created based on the index standard 100 as shown in FIG. 2 on the result display device 39, and an appropriate item from the menu. For example, a mouse cursor 112 is provided. Therefore, although not shown, the data search device 38 is connected with a pointing device such as a mouse or a keyboard as an input device. The result display device is, for example, a cathode ray tube or a liquid crystal display device which is a display device of a computer system.
[0046]
When searching for an index created by the index creation device 6, the searcher moves the mouse cursor 112 to the menu displayed on the result display device 39 to indicate an appropriate item and select it. By doing so, it is possible to search for an index and to search for a target book.
[0047]
As shown in FIG. 12, the index reference is divided for each node, the “internal medicine” is instructed from the “medicine” node 114 with the mouse cursor 112, and the “internal medicine” node 115 is opened. The target book may be searched by pointing the “circulatory organ” with the cursor 112 to open the “circulatory organ” node 116 and finally selecting the “lymph gland” with the mouse cursor 112. Searches can be effectively performed by a menu-like interface that narrows down the classification items one after another with respect to the index criteria having such a tree structure.
[0048]
【The invention's effect】
As described above, according to the present invention, when data is input by the data input device and the index reference is input by the index reference reading device, the word association degree map creating device The frequency of occurrence of words used in the index criterion is checked, the relevance level is calculated for the words that appear at the same time, and a word relevance map is created based on the relevance level and the index criterion. Further, the word importance assigning device divides the number of words appearing in the first document of the input data by the number of words appearing in the second document serving as the summary or heading of the first document. The word in the second document is weighted by the value obtained. Also, the index creation device temporarily modifies the degree of association of the word association degree map using the weighting, and creates an index for the input data using the modified word association degree map. Therefore, even if the index creator does not particularly specify a sample or a typical example, an index for each word can be automatically created without affecting the index criteria.
[0049]
According to the next invention, when the data is input by the data input device and the index reference is input by the index reference reading device, the word association degree map creating device is used for the input data with the index reference. The frequency of occurrence of a word is checked, a relevance level is calculated for words that appear simultaneously, and a word relevance map is created based on the relevance level and an index criterion. Further, the word importance assigning device divides the number of words appearing in the first document of the input data by the number of words appearing in the second document serving as the summary or heading of the first document. The word in the second document is weighted by the value obtained. Also, the index creation device temporarily modifies the degree of association of the word association degree map using the weighting, and creates an index for the input data using the modified word association degree map. Then, the data search device performs a search based on the created index, and the result display device displays the search result. Therefore, the index can be searched efficiently and the target data can be searched.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an example of a database creation device according to the present invention.
FIG. 2 is a system diagram showing an example of the configuration of index criteria used in the database creation device.
FIG. 3 is a schematic diagram showing an example of a word association degree map created by the database creation device.
FIG. 4 is an explanatory diagram for explaining a method of creating a word association degree map.
FIG. 5 is an explanatory diagram for explaining a method of weighting a word association degree map.
FIG. 6 is a flowchart illustrating an example of a weighting determination method.
FIG. 7 is a schematic diagram showing an example of a word association degree map that is weighted.
FIG. 8 is a schematic diagram showing an example of weighting between an abstract and a text.
FIG. 9 is a flowchart illustrating an example of an index creation method.
FIG. 10 is a block diagram showing an example of a database search device according to the present invention.
FIG. 11 is a schematic diagram showing an example of a search menu used in the database search device.
FIG. 12 is a schematic diagram showing another example of a search menu used in the database search device.
FIG. 13 is a block diagram showing a conventional database search apparatus.
FIG. 14 is a block diagram showing a conventional database search apparatus.
FIG. 15 is a block diagram showing a conventional database search device.
[Explanation of symbols]
1 data, 2 data input device, 3 database, 4 index standard, 5 index standard reading device, 6 index creation device, 7 word relevance map creation device, 8 word relevance map, 9 index, 10 word importance assigning device.

Claims

A data input device for entering data into the database;
An index standard reading device for defining an index standard configuration and inputting an index standard configured so that the index has a tree structure;
For the input data, the frequency of occurrence of words used in the index criterion is examined, the degree of association is calculated for words that appear at the same time, and based on the degree of association and the index criterion, the index criterion A word relevance map creating device for creating a word relevance map, which is a map in which the relevance between the words is given to a hierarchical relationship;
The number of words appearing in the first document of the input data is divided by the number of words appearing in the second document that is the summary or heading of the first document, and the above-mentioned amount is obtained. A word importance assigning device for weighting words in the second document;
Each word that appears in the input data by temporarily correcting the relevance of the word relevance map created by the word relevance map creating device using the weight obtained by the word importance assigning device Ri dates back said modified word relevance map as a word end to the root node, adding the relevance of the course traced back to the root node, the average is divided by the number of hierarchical predated the total of the obtained relevance A classification determination evaluation value that is an evaluation value for classifying the input data by obtaining a value is calculated, and the terminal word having the highest classification determination evaluation value is used as a classification destination to index the input data An index creation device for creating
A database creation device comprising:

A data input device for entering data into the database;
An index standard reading device for defining an index standard configuration and inputting an index standard configured so that the index has a tree structure;
For the input data, the frequency of occurrence of words used in the index criterion is examined, the degree of association is calculated for words that appear at the same time, and based on the degree of association and the index criterion, the index criterion A word relevance map creating device for creating a word relevance map that is a map in which the relevance between the words is given to a hierarchical relationship;
The number of words appearing in the first document of the input data is divided by the number of words appearing in the second document that is the summary or heading of the first document, and the above-mentioned amount is obtained. A word importance assigning device for weighting words in the second document;
Each word that appears in the input data by temporarily correcting the relevance of the word relevance map created by the word relevance map creating device using the weight obtained by the word importance assigning device Ri dates back said modified word relevance map as a word end to the root node, adding the relevance of the course traced back to the root node, the average is divided by the number of hierarchical predated the total of the obtained relevance A classification determination evaluation value that is an evaluation value for classifying the input data by obtaining a value is calculated, and the terminal word having the highest classification determination evaluation value is used as a classification destination to index the input data An index creation device for creating
A data search device for performing a search based on the index created by the index creation device;
A result display device for displaying the search results;
A database search apparatus comprising: