JP2004126770A

JP2004126770A - Structured document retrieving method and system and structured document database managing device

Info

Publication number: JP2004126770A
Application number: JP2002287324A
Authority: JP
Inventors: Katsuhiko Nonomura; 野々村　克彦; Yosuke Kuroda; 黒田　洋介; Masakazu Hattori; 服部　雅一
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-09-30
Filing date: 2002-09-30
Publication date: 2004-04-22
Anticipated expiration: 2022-09-30
Also published as: JP3999093B2

Abstract

<P>PROBLEM TO BE SOLVED: To easily grasp a component name being the clue of retrieval while minimizing the data volume of a partial document expressing the outline of a structured document in a retrieval result list for a structured database in which the huge number of structured documents in different document structures are stored. <P>SOLUTION: A retrieval request with the name and value of the component of a structured document to be retrieved as a retrieval condition is inputted by a retrieval request inputting part 12. An XML document pertinent to the retrieval request is retrieved from a document storing part 5. A matched part matched with the retrieval condition is extracted from the retrieved XML document, and the component including the matched part is extracted from the components included in the XML document. The retrieved XML document is displayed at a result list display part 13 with the matched part and the component name extracted by the retrieval request processing part 6. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、異なる文書構造の複数の構造化文書を、階層化された論理構造を持つ構造化文書データベースを検索する構造化文書検索方法、システム及び構造化文書データベース管理装置に関する。ここで構造化文書とは、ＸＭＬ（Ｅｘｔｅｎｓｉｂｌｅ　Ｍａｒｋｕｐ　Ｌａｎｇｕａｇｅ）など、文書の構成要素（章、節、段落、要約、著者、題名など）を示す情報を，テキストの形式で文書の中に明示的に記載した電子文書のことをいう。
【０００２】
【従来の技術】
現在、インターネットなどの情報技術の進化により、莫大な量の電子データを容易に入手することができるようになった。一方で、情報量が莫大なため、必要な情報がその莫大なデータの中に埋没してしまい、思うように検索が出来ない結果、十分に活用できないという弊害も発生している。情報が大量に存在していても、それをうまく活用できなければ意味がない。
【０００３】
こうした弊害を解消するため、電子データを構造化文書とし、これにより情報の共有化を容易にしたり、情報の検索をより効率のよいものにしたりする研究がなされ、その有効性が確認されている。例えば、ＨＴＭＬでは、文書の構成要素、例えば文書のタイトル、見出し、段落、著者名などタグ（ｔａｇ）により記載している。また、近年注目されているＸＭＬ（Ｅｘｔｅｎｓｉｂｌｅ　Ｍａｒｋｕｐ　Ｌａｎｇｕａｇｅ）では、このタグを独自に作成することができるため、ＨＴＭＬよりも柔軟な拡張性に優れており、また、ＸＳＬ（ｅＸｔｅｎｓｉｂｌｅ　Ｓｔｙｌｅｓｈｅｅｔ　Ｌａｎｇｕａｇｅ）などの書式情報を利用することにより、様々なメディアに対応することができるなどの利点がある。
【０００４】
このように構造化された文書にしても、複数の文書間では文書の構造はそれぞれ異なっている。こうした異なる文書構造の膨大な数の構造化文書を格納した構造化文書データベースにおける検索において、文書構造が検索結果として表示されてば、検索結果の各構造化文書の概要を即座に把握することが出来て便利である。
例えば、特許文献１では、検索対象文字列を含む構成要素を表示し、ユーザ要求により、この構成要素を含む上位の構成要素（親要素）を順次表示するようにしている。
【０００５】
また、非特許文献１では、特定のＤＴＤ（Ｄｏｃｕｍｅｎｔ　Ｔｙｐｅ　Ｄｅｆｉｎｉｔｉｏｎ　文書型定義）、スキーマに対応したＸＭＬ文書については部分文書のルートとなる構成要素を予め指定し、特定のＤＴＤ、スキーマとの対応のないＸＭＬ文書については兄弟に同じ名前の構成要素名が存在する構成要素を部分文書のルートとなる構成要素とみなし、検索キーワードを含む部分文書だけを表示している。
【０００６】
さらに、非特許文献２では、ＨＴＭＬデータにおける各固定タグの重み付けと単語の重要度の評価法の１つであるｔｆ−ｉｄｆ（ｔｅｒｍ　ｆｒｅｑｕｅｎｃｙ　×　ｉｎｖｅｒｓｅ　ｄｏｃｕｍｅｎｔ　ｆｒｅｑｕｅｎｃｙ）を用いた検索キーワードの重要度に基づいて、各構成要素に対し点数づけを行い、その点数に基づき文書内情報の表示／非表示を決めている。
【０００７】
【特許文献１】
特許第３１４３３４５号公報（第４頁、第５図、第６図）
【非特許文献１】
論文「ＸＭＬ文書の文書構造と内容を用いた部分文書の抽出手法」情報処理学会論文誌：データベース　Ｖｏｌ．４３　Ｎｏ．ＳＩＧ２（ＴＯＤ１３）、２００２年３月発行
【非特許文献２】
論文「Ｄｙｎａｍｉｃ　Ｇｅｎｅｒａｔｉｏｎ　ａｎｄ　Ｂｒｏｗｓｉｎｇ　ｏｆ　Ｖｉｒｔｕａｌ　ＷＷＷ　Ｓｐａｃｅ　Ｂａｓｅｄ　ｏｎ　Ｕｓｅｒ　Ｐｒｏｆｉｌｅｓ」、第５回国際コンピュータサイエンス会議（ＩＣＳＣ）「Ｉｎｔｅｒｎｅｔ　Ａｐｐｌｉｃａｔｉｏｎｓ」の議事録９３−１０８頁（１９９９年１２月１３−１５日香港で開催、Ｓｐｒｉｎｇｅｒ社発行）
【０００８】
【発明が解決しようとする課題】
異なる文書構造の膨大な数の構造化文書が格納されている構造化データベースにおいては、前記３つの文献の手法では以下の課題がある。
まず特許文献１の手法では、検索結果一覧の初期段階では検索条件に一致した部分だけが表示されるに過ぎず、構造化文書の全体の構造が判るような表示はなされない。このため、それぞれの構造化文書の概要を把握したい場合には、その構造化文書のツリー構造を辿る必要がある。
【０００９】
次に非特許文献１の手法では、指定又は決定された構成要素（根ノード）以下の部分木を単純に部分文書として表示するものであり、部分木が大きい場合にはデータ量の制約からそのままのデータを表示するわけにはいかず、一方、前記部分木が小さい場合には、ユーザが期待する情報が欠落する虞が大きい。
【００１０】
さらに非特許文献２の手法は、文書構造がほぼ固定であることを前提としており、文書構造が膨大な構造化文書間で大きく異なる場合には適用できない。
【００１１】
そこで本発明は、上記問題点に鑑み、異なる文書構造の膨大な数の構造化文書が格納されている構造化データベースに対する検索において、構造化文書の概要を表わす部分文書のデータ量を最小限に押さえながら、検索の手がかりとなる構成要素名を容易につかむことが可能な構造化文書検索方法、構造化文書検索システム、構造化文書データベース管理装置を提供することを目的とする。
【００１２】
【課題を解決するための手段】
上記目的達成のため、本発明に係る構造化文書検索方法は、異なる文書構造の複数の構造化文書を格納した構造化文書データベースに対しユーザ端末側から検索要求を送信して検索を行う構造化文書検索方法において、前記構造化文書の構成要素の名前および前記構成要素の値を検索条件に含む検索要求を入力する検索要求入力ステップと、この作成された検索要求に該当する前記構造化文書を前記構造化文書データベースの中から検索する検索ステップと、前記検索ステップで検索された前記構造化文書から、前記検索条件に一致する一致部分を抽出すると共に前記構造化文書に含まれる構成要素のうち前記一致部分を含む構成要素を抽出する抽出ステップと、前記検索ステップで検索された前記構造化文書を、前記抽出ステップで抽出された前記一致部分及び前記構成要素の名前により表示する表示ステップと、を備えたことを特徴とする。
【００１３】
この発明によれば、構造化文書の構成要素の名前および前記構成要素の値を検索条件に含む検索要求が入力され、この検索要求に該当する前記構造化文書が構造化文書データベースの中から検索される。そして、検索された前記構造化文書から、前記検索条件に一致する一致部分が抽出されると共に前記構造化文書に含まれる構成要素のうち前記一致部分を含む構成要素及びその周辺の構成要素が抽出される。そして、検索された前記構造化文書を、抽出された前記一致部分及び前記構成要素名により表示する。
従って、異なる文書構造の膨大な数の構造化文書が格納されている構造化データベースに対する検索において、構造化文書の概要を表わす部分文書のデータ量を最小限に押さえながら、検索の手がかりとなる構成要素名を容易につかむことが可能となる。
【００１４】
本発明において、前記表示ステップは、前記構成要素が前記複数の構造化文書中に共通して存在する度合を示す要素名生起情報を用いて、前記度合が低い構成要素名を優先的に検索結果として表示することができる。これにより、ユーザが入力した値および要素名を含む部分だけでなく、前記度合の低い要素名も表示されるため、埋もれがちな情報が含まれている構成要素の獲得が容易となる。
【００１５】
また、本発明の前記検索要求入力ステップにおいて、前記要素名生起情報を用いて前記度合が高い構成要素の一覧をユーザに提示し、これにより検索条件としてユーザが与える構成要素の入力を支援することもできる。
また、前記表示ステップにおいて表示された前記構成要素をユーザに選択させる選択ステップと、この選択ステップで選択された前記構成要素と前記検索要求入力ステップで入力された検索条件に該当する前記構造化文書を前記構造化文書データベースの中から再検索する再検索ステップとを更に備えるようにすることもできる。これにより、検索結果の絞込みを効率的に実行することができる。
【００１６】
さらに、前記表示ステップにおいて表示された前記構造化文書の中から所望のものをユーザに選択させる構造化文書選択ステップと、前記前記構造化文書選択ステップで選択された前記構造化文書の詳細を前記構造化文書データベースから取得して表示する詳細表示ステップとを更に備えるようにすることもできる。
また、この詳細表示ステップにおいて表示された前記詳細における構成要素をユーザに選択させるステップと、このステップで選択された前記構成要素を表示するステップとを更に備えるようにすることもできる。
【００１７】
また、検索要求入力ステップにおいて、前記構成要素の名前の類似関係を定義した類似構成要素辞書を用いて、入力された前記構成要素と類似の前記構成要素とをまとめて検索条件とすることもできる。
【００１８】
上記目的達成のため、本発明に係る構造化文書検索システムは、異なる文書構造の複数の構造化文書を格納した構造化文書データベースに対し検索要求を送信して検索を行う構造化文書検索システムにおいて、前記構造化文書の構成要素の名前および前記構成要素の値を検索条件に含む検索要求を入力する検索要求入力部と、この入力された検索要求に該当する前記構造化文書を前記構造化文書データベースの中から検索する検索部と、前記検索部で検索された前記構造化文書から、前記検索条件に一致する一致部分を抽出すると共に前記構造化文書に含まれる構成要素のうち前記一致部分を含む前記構成要素及びその周辺の構成要素を抽出する抽出部と、前記検索部で検索された前記構造化文書を、前記抽出部で抽出された前記一致部分及び前記構成要素とにより表示する表示部と、を備えたことを特徴とする。
【００１９】
上記目的達成のため、本発明に係る構造化文書データベース管理装置は、異なる文書構造の複数の構造化文書を格納した構造化文書データベースと接続され、検索要求をユーザ端末から受領し前記データベースを検索すると共に検索結果を前記ユーザ端末に送信する構造化文書データベース管理装置において、検索しようとする前記構造化文書の構成要素の名前および前記構成要素の値を検索条件に含む検索要求を受け付ける検索要求受付部と、前記検索要求に該当する前記構造化文書を前記構造化文書データベースの中から検索する検索部と、前記検索部で検索された前記構造化文書から、前記検索条件に一致する一致部分を抽出すると共に前記構造化文書に含まれる構成要素のうち前記一致部分を含む構成要素及びその周辺の構成要素を抽出する抽出部と、前記抽出部で抽出された前記一致部分及び前記構成要素を、前記構造化文書の構造及びその構造における前記一致部分の位置が理解できるような形式で表示するデータ形式に変更し、前記ユーザ端末に送信する結果処理部とを備えたことを特徴とする。
【００２０】
【発明の実施の形態】
以下、本発明の実施の形態を図面に基づいて説明する。図１は本発明の構造化文書検索システムの構成を示す図である。この実施の形態では、構造化文書はＸＭＬ文書であるものとして説明するが、本発明をこれに限る趣旨ではない。
構造化文書検索システムは、ＧＵＩ部１、要求制御部２、アクセス要求処理部３、データアクセス部４、文書記憶部５、検索要求処理部６、要素名生起情報記憶部７とから大略構成されている。文書記憶部５はＸＭＬ文書を記憶するための構造化文書データベースであり、具体的にはハードディスクドライブなどの外部記憶装置を用いて構成される。図１のシステム構成は、ＬＡＮ（Ｌｏｃａｌ　Ａｒｅａ　Ｎｅｔｗｏｒｋ）、ＷＡＮ（Ｗｉｄｅ　Ａｒｅａ　Ｎｅｔｗｏｒｋ）等のネットワークに接続されたコンピュータ（サーバとクライアント（ユーザ端末））とソフトウェアを用いて実現可能である。
【００２１】
ＧＵＩ部１は、ユーザ（データベース利用者）のＸＭＬ文書の新規格納要求、変更要求、削除要求及び検索要求の入力並びに検索結果の出力をするための入出力インタフェースをグラフィカルに提供する部分であり、登録部１１、検索条件入力部１２、結果一覧表示部１３、詳細表示部１４から構成される。
登録部１１はユーザからのＸＭＬ文書格納や変更、削除の要求を受け付けて、要求制御部２を呼び出す機能を有する。検索条件入力部１２はユーザからの検索要求を受け付けて、要求制御部２を呼び出す機能を有する。結果一覧表示部１３は検索結果一覧を要求制御部２から受け付けて表示する機能を有する。
詳細表示部１４は結果一覧表示部１３に表示された検索結果一覧のうち、ユーザが詳細情報を要求したＸＭＬ文書に関し、その詳細情報を表示する部分である。
【００２２】
要求制御部２は、要求受付部２１と結果処理部２２から構成されている。要求受付部２１は、ＧＵＩ部１からのＸＭＬ文書格納／変更／削除の要求、検索要求等を受け付けて、アクセス要求処理部２又は検索要求処理部６を呼び出す部分である。結果処理部２２はアクセス要求処理部３および検索要求処理部６が処理した結果をＧＵＩ部１に返す処理を行う部分である。
【００２３】
アクセス要求処理部３は、ユーザからのＸＭＬ文書格納や文書変更、文書削除等の要求に対応した処理を行う。
データアクセス部４は、文書記憶部５にアクセスするための基本インタフェースの集合である。データアクセス部４は、文書オブジェクトツリー格納部４１、文書オブジェクトツリー削除部４２、文書オブジェクトツリー取得部４３、文書文字列取得部４４から構成される。文書オブジェクトツリー格納部４１は、登録部１１からのＸＭＬ文書格納指令に基づき、文書記憶部５中の物理的な指定エリアに文書オブジェクトツリーを格納する処理を行う。文書オブジェクトツリー削除部４２は、登録部１１からのＸＭＬ文書削除指令に基づき、文書記憶部５中の物理的な指定エリアに存在する文書オブジェクトツリーを削除する処理を行う。文書オブジェクトツリー取得部４３は、登録部１１からのＸＭＬ文書取得指令に基づき、文書記憶部５中の物理的な指定エリアに存在する文書オブジェクトツリーを取得する処理を行う。文書文字列取得部４４は、文書オブジェクトツリーをＸＭＬ文書に変換する処理を行う。
【００２４】
文書記憶部５は、例えば、図２に示すように、ＸＭＬ文書をＵＮＩＸのディレクトリ構造のように階層的にツリー構造状に格納する。図２では、パス／製品情報群（ルートノードの下の「製品情報群」）というフォルダに図３や図４に示すようなＸＭＬ文書が多数格納され、パス／最新情報／カタログ集（ルートノードの下の「最新情報」の下の「カタログ集」）というフォルダに図５に示すようなＸＭＬ文書が多数格納されていることを示している。これら多数のＸＭＬ文書は、図３〜５に示されるように、その文書構造が異なっている。
【００２５】
図１に戻って、検索要求処理部６は、検索結果抽出部６１、要素名一覧抽出部６２から構成され、ＧＵＩ部１からの検索要求に対応した処理を行う。
検索結果抽出部６１はデータアクセス部４を呼び出すことで、検索条件入力部１２より入力された検索要求を満たす構成要素の集合を求める。
【００２６】
要素名一覧抽出部６２はこの検索結果抽出部６１により求められた構成要素の集合の周辺に位置する構成要素（子要素、親要素、兄弟要素など）の一覧を抽出する。
また、要素名一覧抽出部６２は、図６に示すような要素名生起情報を用いて、これら抽出された各構成要素が登場する頻度を示す度（カウント／文書数。以下これを共通度という）をチェックし、その共通度がある閾値より小さい構成要素名とその構成要素のＩＤの一覧を求める。
図６に示す要素名生起情報は要素名生起情報記憶部７に格納される。要素名生起情報は、文書記憶部５に登録されているＸＭＬ化文書の数２０１（図６中では３２０９）、各構成要素を含むＸＭＬ文書の数２０２から構成される。フラグ２００の役割については後述する。
【００２７】
この要素名生起情報記憶部７の記憶内容は、新しいＸＭＬ文書が文書記憶部５に登録されるごとに更新される。この更新動作の手順を図７に示すフローチャートを用いて説明する。
まずステップ１００において、登録しようとするＸＭＬ文書の構成要素名や構成要素の値、構成要素の親子関係（上下関係）、兄弟関係の情報が文書記憶部５に登録される。
続いてステップ１０１において要素名生起情報のフラグ２００をすべて０に設定する。
【００２８】
続くステップ１０２において、文書数の値２０１をインクリメントする。
次にステップ１０３において、登録されたＸＭＬ文書内の構成要素の名前を順次取得し、続くステップ１０４においてその構成要素名が要素名生起情報記憶部７にデータとして存在するか否かチェックを行う。ＹＥＳの場合には、ステップ１０５においてその構成要素名に該当するフラグ２００の値をチェックする。フラグ２００の値が０であるならば、ステップ１０６においてその構成要素名に該当するカウント２０２の値をインクリメントとともに、フラグの値を１に変更する。フラグ２００の値が０でない場合には、ステップ１０６はスキップする。
【００２９】
ステップ１０４の判定がＮＯの場合には、ステップ１０７へ移行し、要素名生起情報にその構成要素名に関する情報を保持するエリアを確保し、カウント２０２の値を１に設定する。
以上のステップ１０４から１０７までの手順を、すべての構成要素について繰り返す。
【００３０】
図１に戻って、類似構成要素辞書記憶部８は、構成要素名の類似関係を定義した辞書を記憶する部分である。これにより、検索条件入力部１２で入力された構成要素名と類似の構成要素名に関連する構成要素も検索結果抽出部６１により抽出される。
【００３１】
次に、この構造化文書検索システムによる検索の処理手順を、図８及び図９を用いて説明する。
まず図８に示すように、検索条件入力部１２に表示される検索画面において、ユーザは検索要求３０１を入力する。検索条件として入力する項目としては、図９に例示されるように、「キーワード」と「タグ名」とがある。「キーワード」には、検索対象としてのＸＭＬ文書内のいずれかの構成要素の値に含んでほしい文字列等を入力し、「タグ名」には、ＸＭＬ文書内に含んでほしい構成要素名を入力する。「キーワード」又は「タグ名」の欄のいずれか一方だけに検索条件を入力してもよい。また、図９に示すように、１つの欄に複数の文字列を入力してもよい。
なお、図９に示すプルダウンメニューＴ１「タグ名一覧」をクリックすると、共通度の高い構成要素の一覧が表示されるので、検索条件入力の参考にしてもよい。
【００３２】
例えば、図９に示すように、ユーザが「価格が安いパソコンに関して何か有益な情報を得たい。」と考え、検索条件入力部１２において、「キーワード」欄に「パソコン」と「低価格」の文字を、「タグ名」欄に「価格」の文字を検索要求３０１として入力したとする。
すると検索条件入力部１２は、クエリ並びに検索要求入力部１２の「キーワード」欄及び「タグ名」欄に入力された文字の列から構成される指示データ３０２を要求制御部２に送る。
要求制御部２の要求受付部２１は、この指示データ３０２を検索要求処理部６に送る。
検索結果抽出部６１は、この指示データ３０２に合致するＸＭＬ文書を、類似構成要素辞書記憶部８を参照しつつ文書記憶部５から抽出し、そのＸＭＬ文書中から指示データの上記条件に合致する一致部分、及びこの一致部分を含む構成要素の名前の集合を抽出する（図８の３０３）。そして、要素名一覧抽出部６２は、この抽出されたＸＭＬ文書内の構成要素の名称一覧をＸＭＬ文書単位で抽出する（図８の３０４）。
【００３３】
このようにして抽出された検索結果及びその構成要素名一覧データは、要求制御部２２の結果処理部において、検索結果一覧データ３０５としてＸＭＬ文書ごとに１つにまとめられ、検索要求処理部６から要求制御部２の結果処理部２２に送信される。
検索結果一覧データ３０５は、例えば図１０に図示されるような構成のデータを、検索結果としてのＸＭＬ文書ごとに作成したものとなっている。このデータは、以下の３つのものを含んでいる。
（１）検索されたＸＭＬ文書のルート構成要素（文書中の一番外側の構成要素、図１０では＜製品情報＞）
（２）検索条件入力部１２において、検索条件として「キーワード」欄及び「タグ名」欄に与えられた文字列に合致する一致部分（文字列）と、その一致部分を含む構成要素名。図１０では、構成要素＜タイプ＞＜特徴＞＜価格＞とその中の値がこれに該当する。図１０に示すように、それぞれの構成要素の値には、キーワードとして入力された「パソコン」「低価格」、及びこれと類似する「〇〇円」が含まれている。
（３）共通度が低いとされた構成要素名。図１０では、構成要素＜お得情報＞　がこれに該当する。
【００３４】
結果一覧表示部１３は、この検索結果一覧データ３０５に基づき、例えば図１１に示すような検索結果画面を表示する（図８の検索結果一覧表示３０６）。図１１では、検索結果としての複数のＸＭＬ文書のうち、３件のみを表示した形式となっている。別の検索結果を表示したい場合には、「前」「次」のアイコンをクリックすることにより前又は次の３件の検索結果を表示させることができる。
【００３５】
ユーザは、図１１に示すような検索結果画面を見て、詳細を見たいと思うＸＭＬ文書を発見した場合には、そのＸＭＬ文書のルートの構成要素名をマウスでクリックする（図８の文書獲得要求３０７）。これにより、そのＸＭＬ文書の指定された構成要素のＩＤを含む指示データ３０７が要求制御部２に送信される。要求制御部２は文書獲得処理３０９を実行し、獲得された文書データ３１０をＧＵＩ部１に返送する。この文書データ３１０により、ＧＵＩ部１は詳細結果表示３１１を実行し、図１２に示すように、文書の詳細内容を表示させることができる。表示された詳細内容のうち、さらに詳細に見たい部分がある場合にはその部分をマウスでクリックすることにより、表示位置移動要求３１２を要求制御部２に送信することができる。また、別の部分文書を見たい場合に詳細結果表示更新要求３１３を送信することができる。これらの操作により、選択した部分を含む位置を中心とする表示に切り替え、個々の構成要素の情報を即座に見ることが可能となる。
【００３６】
また、ユーザが、図１１に示す検索結果画面を見て、再検索を望む場合には、再検索要求３１５を入力し、再検索を実行することもできる。このとき、図１１に示す構成要素（例えば検索結果１の「タグ名一覧」に表示されている「お得情報」）をマウスで右クリックし、表示されるメニュー選択画面（図示せず）において「再検索」を指定することにより、これを先に入力した検索条件の一部に含めることができる。この実施例の場合、図９に示す「パソコン」「低価格」「価格」に加え、例えば「お得情報」を検索条件に加えて再度の検索を実行することができる。これにより、埋もれがちな情報（共通度が小さい構成要素に係る情報）が含まれている構成要素を基準として条件検索が可能となる。再検索要求３１５は、指示データ３１６に変換される。指示データ３１６を受けた要求制御部２は、この指示データ３１６に基づき、検索結果抽出処理３１７を実行すると共に、検索結果から構成要素名の一覧を抽出する要素名一覧抽出処理３１８を実行する。検索結果一覧３１９は３０５と同様にＧＵＩ部１に送信され、これに基づきＧＵＩ部１において検索結果一覧表示処理３２０が実行される。
【００３７】
次に、図１３に示すフローチャートを用いて、要素名一覧抽出部６２での要素名一覧抽出処理３０４（図８）の詳細について説明する。
まず、検索結果抽出３０３（図８）にて抽出されたＸＭＬ文書内の構成要素の子や親、兄弟を辿ることで周辺構成要素の名称データを取得する（ステップ４０１）。取得された名称データは、要素名一覧抽出部６２において、検索結果として得られたＸＭＬ文書の数だけ用意され予め初期化されたメモリ領域に記憶される。
続くステップ４０２では、図８の検索結果抽出３０３で抽出された構成要素、及びステップ４０１で取得された周辺構成要素について、各構成要素の共通度が閾値Ｘより小さいかどうかをチェックする。具体的には各構成要素の名前に該当する要素生起情報記憶部７のデータ「カウント」を用いて、（カウント／文書数）の値（共通度）がある閾値Ｘ以下であるかチェックし、ＹＥＳの場合には、その構成要素名を要素名一覧抽出部６２のメモリ領域に記憶させる。ＮＯの場合には、その構成要素名はメモリ領域に記憶させず、ステップ４０２へ戻り、別の構成要素の共通度のチェックの手順へ移行する。このステップ４０２，４０３の手順を、全構成要素について繰り返す。
【００３８】
以上の結果一覧により、ユーザが入力した値および要素名を含む部分だけでなく、その部分周辺で共通度の低い要素名も表示されるため、埋もれがちな情報が含まれている構成要素の獲得が容易となる。
【００３９】
以上、発明の実施の形態について説明したが、本発明はこれに限定されるものではなく、本発明の趣旨を逸脱しない範囲で種々の変更、置換、追加等が可能である。
【００４０】
【発明の効果】
以上説明したように、本発明によれば、異なる文書構造の膨大な数の構造化文書が格納されている構造化データベースに対する検索結果一覧において、構造化文書の概要を表わす部分文書のデータ量を最小限に押さえながら、検索の手がかりとなる構成要素名を容易につかむことが可能となる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る構造化文書検索システムの全体構成を示したブロック図である。
【図２】図１に示す文書記憶部５に記憶される、階層的にツリー構造状に格納されるＸＭＬ文書の内容を示す。
【図３】文書記憶部５に記憶されるＸＭＬ文書の例を示す。
【図４】文書記憶部５に記憶されるＸＭＬ文書の例を示す。
【図５】文書記憶部５に記憶されるＸＭＬ文書の例を示す。
【図６】図１に示す要素名生起情報記憶部７に記憶される要素名生起情報の一例を示す。
【図７】要素名生起情報の更新の手順を示すフローチャートである。
【図８】図１に示す構造化文書検索システムの検索処理動作を説明する概念図である。
【図９】検索条件入力部１２の検索条件入力画面の一例を示す。
【図１０】検索結果一覧データ３０５のデータ構造の一例を示す。
【図１１】検索結果画面の一例を示す。
【図１２】詳細表示画面の一例を示す。
【図１３】要素名一覧抽出部６２での要素名一覧抽出処理を説明するフローチャートである。
【符号の説明】
１…ＧＵＩ部、　２…要求制御部　、３…アクセス要求処理部、　４…データアクセス部、　５…文書記憶部、　６…検索要求処理部、　７…要素名生起情報記憶部、　８…類似構成要素辞書記憶部、　１１…登録部、１２…検索条件入力部、　１３…結果一覧表示部、　１４…詳細表示部、２１…要求受付部、　２２…結果処理部、　４１…文書オブジェクトツリー格納部、　４２…文書オブジェクトツリー削除部、４３…文書オブジェクトツリー取得部、４４…文書文字列取得部、　６１…検索結果抽出部、　６２…要素名一覧抽出部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a structured document search method and system for searching a structured document database having a hierarchical logical structure from a plurality of structured documents having different document structures, and a structured document database management device. Here, the structured document is information such as XML (Extensible Markup Language) that indicates the components of the document (such as chapters, sections, paragraphs, abstracts, authors, and titles). Refers to the electronic document described.
[0002]
[Prior art]
At present, an enormous amount of electronic data can be easily obtained due to the evolution of information technology such as the Internet. On the other hand, since the amount of information is enormous, necessary information is buried in the enormous data, and as a result, it is not possible to search as desired. Even if there is a large amount of information, it is meaningless if it cannot be used effectively.
[0003]
In order to solve these problems, research has been conducted to make electronic data into structured documents, thereby facilitating information sharing and making information retrieval more efficient, and its effectiveness has been confirmed. . For example, in HTML, components of a document are described by tags (tags) such as the title, headline, paragraph, and author name of the document. In addition, in XML (Extensible Markup Language), which has been attracting attention in recent years, since this tag can be created independently, it is more flexible and extensible than HTML, and has a format such as XSL (Extensible Stylesheet Language). There is an advantage that various media can be used by using information.
[0004]
Even with a document structured in this way, the document structure differs among a plurality of documents. When searching a structured document database that stores a huge number of structured documents of these different document structures, if the document structure is displayed as a search result, it is possible to immediately grasp the outline of each structured document of the search result. It is made and convenient.
For example, in Patent Literature 1, components including a search target character string are displayed, and higher-level components (parent elements) including the components are sequentially displayed according to a user request.
[0005]
Also, in Non-Patent Document 1, for XML documents corresponding to a specific DTD (Document Type Definition document type definition) and schema, a component that is the root of a partial document is specified in advance, and the correspondence between the specific DTD and the schema is specified. For an XML document that does not exist, a component whose sibling has the same component name is regarded as a component serving as a root of the partial document, and only the partial document including the search keyword is displayed.
[0006]
Further, in Non-Patent Document 2, the weight of each fixed tag in HTML data and the importance of a search keyword using tf-idf (term frequency × inverse document frequency), which is one of the methods of evaluating the importance of a word, are described. Each component is scored, and the display / non-display of the in-document information is determined based on the score.
[0007]
[Patent Document 1]
Japanese Patent No. 3143345 (page 4, FIG. 5, FIG. 6)
[Non-patent document 1]
Dissertation Method for Extracting Partial Documents Using Document Structure and Contents of XML Document, IPSJ Transactions: Database Vol. 43 No. SIG2 (TOD13), published March 2002
[Non-patent document 2]
Paper "Dynamic Generation and Browsing of Virtual WWW Space Based on User Profiles", Proceedings of the 5th International Computer Science Conference (ICSC), "Internet Applications", Proc. (Issued by Springer)
[0008]
[Problems to be solved by the invention]
In a structured database in which an enormous number of structured documents having different document structures are stored, the methods described in the above three documents have the following problems.
First, according to the method disclosed in Patent Document 1, only a portion that matches the search condition is displayed in the initial stage of the search result list, and a display that allows the entire structure of the structured document to be understood is not performed. Therefore, when it is desired to grasp the outline of each structured document, it is necessary to follow the tree structure of the structured document.
[0009]
Next, according to the method of Non-Patent Document 1, a partial tree below a specified or determined component (root node) is simply displayed as a partial document. Cannot be displayed. On the other hand, when the subtree is small, there is a high possibility that the information expected by the user is lost.
[0010]
Furthermore, the method of Non-Patent Document 2 is based on the premise that the document structure is substantially fixed, and cannot be applied to a case where the document structure is significantly different between enormous structured documents.
[0011]
In view of the above problems, the present invention minimizes the data amount of a partial document representing the outline of a structured document in a search for a structured database storing a huge number of structured documents having different document structures. It is an object of the present invention to provide a structured document search method, a structured document search system, and a structured document database management device that can easily grasp a component name serving as a search key while holding down.
[0012]
[Means for Solving the Problems]
In order to achieve the above object, a structured document search method according to the present invention provides a structured document search method in which a search request is transmitted from a user terminal to a structured document database storing a plurality of structured documents having different document structures. In the document search method, a search request input step of inputting a search request including a name of a component of the structured document and a value of the component in a search condition, and the structured document corresponding to the created search request A search step of searching the structured document database; and extracting, from the structured document searched in the search step, a matching part that matches the search condition, among components included in the structured document. An extraction step of extracting a component including the matching portion; and extracting the structured document retrieved in the retrieval step in the extraction step. A display step of displaying the names of the matching portion and the components were characterized by comprising a.
[0013]
According to the present invention, a search request including a name of a component of a structured document and a value of the component in a search condition is input, and the structured document corresponding to the search request is searched from a structured document database. Is done. Then, from the retrieved structured document, a matching part that matches the search condition is extracted, and among the constituent elements included in the structured document, the component including the matching part and the surrounding components are extracted. Is done. Then, the searched structured document is displayed by the extracted matching part and the component name.
Therefore, in a search for a structured database in which an enormous number of structured documents having different document structures are stored, a search key can be obtained while minimizing the data amount of a partial document representing the outline of the structured document. Element names can be easily grasped.
[0014]
In the present invention, the display step preferentially searches for a component name having a lower degree using element name occurrence information indicating a degree at which the component is commonly present in the plurality of structured documents. Can be displayed as Thereby, not only the part including the value and the element name input by the user but also the element name with a low degree are displayed, so that it is easy to obtain the constituent element including the information that is likely to be buried.
[0015]
Further, in the search request input step of the present invention, a list of the components having a high degree is presented to the user using the element name occurrence information, thereby supporting input of the components given by the user as search conditions. You can also.
A selection step of allowing a user to select the component displayed in the display step; and the structured document corresponding to the component selected in the selection step and the search condition input in the search request input step. A re-searching step of searching again from the structured document database. As a result, the search results can be narrowed down efficiently.
[0016]
Further, a structured document selecting step for allowing a user to select a desired one from the structured documents displayed in the displaying step, and details of the structured document selected in the structured document selecting step are described. A detail display step of acquiring from the structured document database and displaying the acquired information.
Further, the method may further include a step of allowing a user to select a component in the details displayed in the detail display step, and a step of displaying the component selected in this step.
[0017]
Further, in the search request inputting step, the input component and the similar component can be collectively used as a search condition by using a similar component dictionary defining a similar relationship between the component names. .
[0018]
In order to achieve the above object, a structured document search system according to the present invention provides a structured document search system for performing a search by transmitting a search request to a structured document database storing a plurality of structured documents having different document structures. A search request input unit for inputting a search request including a name of a component of the structured document and a value of the component in a search condition, and the structured document corresponding to the input search request is written in the structured document. A search unit for searching a database, and extracting a matching part that matches the search condition from the structured document searched by the searching unit, and extracting the matching part among the constituent elements included in the structured document. An extracting unit for extracting the constituent element including the constituent elements and its surrounding components, and the structured document searched by the searching unit, and the matching part and the matching part extracted by the extracting unit. Characterized by comprising a display unit for displaying the said components.
[0019]
To achieve the above object, the structured document database management device according to the present invention is connected to a structured document database storing a plurality of structured documents having different document structures, receives a search request from a user terminal, and searches the database. A structured document database management device for transmitting a search result to the user terminal and receiving a search request including a name of a component of the structured document to be searched and a value of the component in a search condition. Unit, a search unit that searches the structured document database for the structured document corresponding to the search request, and a matching part that matches the search condition from the structured document searched by the search unit. The components including the matching part and the components surrounding the extracted components among the components included in the structured document are extracted. Change the extraction unit to be output and the matching part and the component extracted by the extracting unit to a data format in which the structure of the structured document and the position of the matching part in the structure can be understood. And a result processing unit for transmitting the result to the user terminal.
[0020]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing the configuration of the structured document search system of the present invention. In this embodiment, the structured document is described as an XML document, but the present invention is not limited to this.
The structured document search system generally includes a GUI unit 1, a request control unit 2, an access request processing unit 3, a data access unit 4, a document storage unit 5, a search request processing unit 6, and an element name occurrence information storage unit 7. ing. The document storage unit 5 is a structured document database for storing XML documents, and is specifically configured using an external storage device such as a hard disk drive. The system configuration in FIG. 1 can be realized using a computer (server and client (user terminal)) and software connected to a network such as a LAN (Local Area Network) and a WAN (Wide Area Network).
[0021]
The GUI unit 1 graphically provides an input / output interface for inputting a new storage request, a change request, a deletion request, and a search request of an XML document of a user (database user) and outputting a search result. It comprises a registration unit 11, a search condition input unit 12, a result list display unit 13, and a detail display unit 14.
The registration unit 11 has a function of receiving a request for storing, changing, or deleting an XML document from a user and calling the request control unit 2. The search condition input unit 12 has a function of receiving a search request from a user and calling the request control unit 2. The result list display unit 13 has a function of receiving and displaying a search result list from the request control unit 2.
The detail display unit 14 is a part of the search result list displayed on the result list display unit 13 for displaying detailed information on the XML document for which the user has requested detailed information.
[0022]
The request control unit 2 includes a request receiving unit 21 and a result processing unit 22. The request receiving unit 21 is a unit that receives an XML document storage / change / deletion request, a search request, and the like from the GUI unit 1 and calls the access request processing unit 2 or the search request processing unit 6. The result processing unit 22 is a unit that performs a process of returning the result processed by the access request processing unit 3 and the search request processing unit 6 to the GUI unit 1.
[0023]
The access request processing unit 3 performs processing corresponding to a request from the user for storing an XML document, changing a document, deleting a document, and the like.
The data access unit 4 is a set of basic interfaces for accessing the document storage unit 5. The data access unit 4 includes a document object tree storage unit 41, a document object tree deletion unit 42, a document object tree acquisition unit 43, and a document character string acquisition unit 44. The document object tree storage unit 41 performs a process of storing a document object tree in a physically designated area in the document storage unit 5 based on an XML document storage command from the registration unit 11. The document object tree deletion unit 42 performs a process of deleting a document object tree existing in a physically designated area in the document storage unit 5 based on an XML document deletion command from the registration unit 11. The document object tree acquisition unit 43 performs a process of acquiring a document object tree existing in a physically designated area in the document storage unit 5 based on an XML document acquisition command from the registration unit 11. The document character string acquisition unit 44 performs a process of converting a document object tree into an XML document.
[0024]
The document storage unit 5, for example, as shown in FIG. 2, stores an XML document in a hierarchical tree structure like a UNIX directory structure. In FIG. 2, a large number of XML documents as shown in FIGS. 3 and 4 are stored in a folder called a path / product information group (“product information group” under the root node), and a path / latest information / catalog collection (root node) This indicates that a large number of XML documents as shown in FIG. 5 are stored in a folder called “catalog collection” under “latest information” under “. These many XML documents have different document structures as shown in FIGS.
[0025]
Returning to FIG. 1, the search request processing unit 6 includes a search result extraction unit 61 and an element name list extraction unit 62, and performs processing corresponding to the search request from the GUI unit 1.
The search result extraction unit 61 calls the data access unit 4 to obtain a set of components that satisfy the search request input from the search condition input unit 12.
[0026]
The element name list extracting unit 62 extracts a list of constituent elements (child elements, parent elements, sibling elements, etc.) located around the set of constituent elements obtained by the search result extracting unit 61.
The element name list extraction unit 62 uses the element name occurrence information as shown in FIG. 6 to indicate the frequency (count / number of documents; hereinafter referred to as commonality) indicating the frequency at which these extracted constituent elements appear. ) To obtain a list of component names and the IDs of the components whose commonality is smaller than a certain threshold.
The element name occurrence information shown in FIG. 6 is stored in the element name occurrence information storage unit 7. The element name occurrence information includes the number 201 of XML documents registered in the document storage unit 5 (3209 in FIG. 6) and the number 202 of XML documents including each component. The role of the flag 200 will be described later.
[0027]
The content stored in the element name occurrence information storage unit 7 is updated each time a new XML document is registered in the document storage unit 5. The procedure of this update operation will be described with reference to the flowchart shown in FIG.
First, in step 100, the component name and component value of the XML document to be registered, parent-child relationship (vertical relationship), and sibling relationship information of the component are registered in the document storage unit 5.
Subsequently, in step 101, all the flags 200 of the element name occurrence information are set to 0.
[0028]
In the following step 102, the value 201 of the number of documents is incremented.
Next, in step 103, the names of the components in the registered XML document are sequentially acquired, and in step 104, it is checked whether or not the component names exist as data in the element name occurrence information storage unit 7. In the case of YES, in step 105, the value of the flag 200 corresponding to the component name is checked. If the value of the flag 200 is 0, the value of the count 202 corresponding to the component name is incremented and the value of the flag is changed to 1 in step 106. If the value of the flag 200 is not 0, step 106 is skipped.
[0029]
If the determination in step 104 is NO, the process proceeds to step 107 to secure an area for holding information on the component name in the element name occurrence information, and set the value of the count 202 to 1.
The above steps 104 to 107 are repeated for all components.
[0030]
Returning to FIG. 1, the similar component dictionary storage unit 8 is a unit that stores a dictionary that defines a similar relationship between component names. Thus, the search result extraction unit 61 also extracts components related to component names similar to the component name input in the search condition input unit 12.
[0031]
Next, a search processing procedure by the structured document search system will be described with reference to FIGS.
First, as shown in FIG. 8, the user inputs a search request 301 on a search screen displayed on the search condition input unit 12. Items to be input as search conditions include a "keyword" and a "tag name" as illustrated in FIG. In the “keyword”, enter a character string or the like that is desired to be included in the value of any component in the XML document to be searched, and in the “tag name”, enter the name of the component that you want to include in the XML document. input. A search condition may be entered in only one of the "keyword" and "tag name" fields. Further, as shown in FIG. 9, a plurality of character strings may be input in one column.
When a pull-down menu T1 “tag name list” shown in FIG. 9 is clicked, a list of components having a high degree of commonality is displayed, which may be used as a reference for inputting search conditions.
[0032]
For example, as shown in FIG. 9, the user thinks “I want to get some useful information about a low-priced personal computer.” In the search condition input section 12, “PC” and “low price” are entered in the “keyword” column. It is assumed that the character of “price” is input as the search request 301 in the “tag name” column.
Then, the search condition input unit 12 sends to the request control unit 2 instruction data 302 including a query and character strings input in the “keyword” column and the “tag name” column of the search request input unit 12.
The request receiving unit 21 of the request control unit 2 sends the instruction data 302 to the search request processing unit 6.
The search result extraction unit 61 extracts an XML document that matches the instruction data 302 from the document storage unit 5 with reference to the similar component dictionary storage unit 8, and matches the above condition of the instruction data from the XML document. A matching part and a set of names of constituent elements including the matching part are extracted (303 in FIG. 8). Then, the element name list extraction unit 62 extracts the name list of the constituent elements in the extracted XML document in XML document units (304 in FIG. 8).
[0033]
The search results and the component name list data thus extracted are combined into one for each XML document as search result list data 305 in the result processing unit of the request control unit 22. It is transmitted to the result processing unit 22 of the request control unit 2.
The search result list data 305 is, for example, data having a configuration as shown in FIG. 10 created for each XML document as a search result. This data includes the following three items.
(1) Root component of retrieved XML document (outermost component in document, <product information> in FIG. 10)
(2) In the search condition input unit 12, a matching part (character string) that matches the character string given in the “keyword” field and the “tag name” field as the search condition, and a component name including the matching part. In FIG. 10, the components <type>, <feature>, <price> and the values therein correspond to this. As shown in FIG. 10, the values of the respective components include “PC” and “low price” input as keywords, and “〇〇 yen” similar thereto.
(3) Component names determined to have low commonality. In FIG. 10, the component <profitable information> corresponds to this.
[0034]
Based on the search result list data 305, the result list display unit 13 displays, for example, a search result screen as shown in FIG. 11 (search result list display 306 in FIG. 8). FIG. 11 shows a format in which only three of a plurality of XML documents as search results are displayed. If another search result is desired to be displayed, the previous or next three search results can be displayed by clicking the “previous” and “next” icons.
[0035]
When the user looks at the search result screen as shown in FIG. 11 and finds an XML document whose detail is to be viewed, he or she clicks the component name of the root of the XML document with the mouse (the document in FIG. 8). Acquisition request 307). As a result, the instruction data 307 including the ID of the specified component of the XML document is transmitted to the request control unit 2. The request control unit 2 executes the document acquisition processing 309, and returns the acquired document data 310 to the GUI unit 1. Based on the document data 310, the GUI unit 1 executes the detailed result display 311 and can display the detailed contents of the document as shown in FIG. If there is a part of the displayed detailed contents that the user wants to see in more detail, the display position movement request 312 can be transmitted to the request control unit 2 by clicking the part with the mouse. Further, when another partial document is to be viewed, a detailed result display update request 313 can be transmitted. By these operations, the display is switched to a display centered on the position including the selected portion, and the information of each component can be immediately viewed.
[0036]
In addition, when the user looks at the search result screen shown in FIG. 11 and desires a re-search, the user can input a re-search request 315 and execute the re-search. At this time, the component shown in FIG. 11 (for example, “profit information” displayed in “tag name list” of search result 1) is right-clicked with a mouse, and displayed on a menu selection screen (not shown). By designating "re-search", this can be included in a part of the previously input search condition. In the case of this embodiment, in addition to “PC”, “low price” and “price” shown in FIG. 9, for example, “profit information” can be added to the search condition and a search can be performed again. This makes it possible to perform a conditional search based on components including information that tends to be buried (information on components having low commonality). The re-search request 315 is converted into instruction data 316. Upon receiving the instruction data 316, the request control unit 2 executes a search result extraction process 317 and an element name list extraction process 318 for extracting a list of component names from the search result based on the instruction data 316. The search result list 319 is transmitted to the GUI unit 1 as in the case of 305, and the search result list display processing 320 is executed in the GUI unit 1 based on this.
[0037]
Next, details of the element name list extraction processing 304 (FIG. 8) in the element name list extraction unit 62 will be described using the flowchart shown in FIG.
First, the name data of the peripheral component is obtained by tracing the child, parent, and sibling of the component in the XML document extracted by the search result extraction 303 (FIG. 8) (step 401). The acquired name data is stored in a memory area that is prepared and initialized in advance by the element name list extraction unit 62 by the number of XML documents obtained as search results.
In the following step 402, it is checked whether or not the commonality of each component is smaller than the threshold X for the components extracted by the search result extraction 303 in FIG. 8 and the peripheral components obtained in the step 401. Specifically, using the data “count” of the element occurrence information storage unit 7 corresponding to the name of each component, it is checked whether the value (count / number of documents) (commonness) is equal to or less than a threshold X, In the case of YES, the component name is stored in the memory area of the element name list extraction unit 62. In the case of NO, the component name is not stored in the memory area, and the process returns to step 402 and shifts to a procedure for checking the commonality of another component. The procedure of steps 402 and 403 is repeated for all components.
[0038]
According to the above result list, not only the part including the value and the element name input by the user, but also the element name having a low degree of commonality around the part are displayed, so that the constituent element including the information which is likely to be buried is obtained. Becomes easier.
[0039]
Although the embodiments of the present invention have been described above, the present invention is not limited to these embodiments, and various changes, substitutions, additions, and the like can be made without departing from the spirit of the present invention.
[0040]
【The invention's effect】
As described above, according to the present invention, in a search result list for a structured database storing an enormous number of structured documents having different document structures, the data amount of a partial document representing the outline of the structured document is reduced. It is possible to easily grasp the component name that is a clue for the search while keeping it to a minimum.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an overall configuration of a structured document search system according to an embodiment of the present invention.
FIG. 2 shows contents of an XML document stored in a document storage unit 5 shown in FIG. 1 and stored in a hierarchical tree structure.
FIG. 3 shows an example of an XML document stored in a document storage unit 5;
FIG. 4 shows an example of an XML document stored in a document storage unit 5;
FIG. 5 shows an example of an XML document stored in a document storage unit 5;
6 shows an example of element name occurrence information stored in an element name occurrence information storage unit 7 shown in FIG.
FIG. 7 is a flowchart illustrating a procedure for updating element name occurrence information.
FIG. 8 is a conceptual diagram illustrating a search processing operation of the structured document search system shown in FIG.
9 shows an example of a search condition input screen of the search condition input section 12. FIG.
FIG. 10 shows an example of the data structure of search result list data 305.
FIG. 11 shows an example of a search result screen.
FIG. 12 shows an example of a detail display screen.
FIG. 13 is a flowchart illustrating an element name list extraction process in an element name list extraction unit 62.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... GUI part, 2 ... request control part, 3 ... access request processing part, 4 ... data access part, 5 ... document storage part, 6 ... search request processing part, 7 ... element name occurrence information storage part, 8 ... similar structure Element dictionary storage unit, 11 registration unit, 12 search condition input unit, 13 result list display unit, 14 detailed display unit, 21 request reception unit, 22 result processing unit, 41 document object tree storage unit, 42: Document object tree deletion unit, 43: Document object tree acquisition unit, 44: Document character string acquisition unit, 61: Search result extraction unit, 62: Element name list extraction unit

Claims

A structured document search method for performing a search by sending a search request from a user terminal to a structured document database storing a plurality of structured documents having different document structures,
A search request inputting step of inputting a search request including a name of a component of the structured document and a value of the component in a search condition;
A search step of searching the structured document database corresponding to the created search request from the structured document database;
From the structured document searched in the search step, a matching part that matches the search condition is extracted, and among the constituent elements included in the structured document, a component including the matching part and a surrounding component are extracted. An extraction step to extract;
A display step of displaying the structured document searched in the search step by the name of the matching part and the component extracted in the extraction step;
A structured document search method comprising:

The display step displays element names having a lower degree as a search result preferentially using element name occurrence information indicating the degree that the element is commonly present in the plurality of structured documents. Item 2. The structured document search method according to Item 1.

In the search request inputting step, a list of the components having a high degree of degree is presented to the user using the element name occurrence information, thereby supporting input of components given by the user as search conditions. Item 3. The structured document search method according to item 1 or 2.

A selection step of allowing a user to select the component name displayed in the display step,
A re-search step of re-searching the structured document database corresponding to the component name selected in the selection step and the search condition input in the search request input step from the structured document database. 3. The structured document search method according to claim 1, wherein:

A structured document selecting step for allowing a user to select a desired one from the structured documents displayed in the displaying step,
3. The structured document search method according to claim 1, further comprising: a detail display step of acquiring and displaying details of the structured document selected in the structured document selection step from the structured document database. .

6. The structure according to claim 5, further comprising: allowing a user to select a component in the details displayed in the detail displaying step; and displaying the component selected in the step. Document search method.

In the search request inputting step, the input component name and the similar component name are collectively used as a search condition using a similar component dictionary defining a similar relationship between the component names. 2. The structured document search method according to claim 1, wherein:

In a structured document search system for performing a search by sending a search request to a structured document database storing a plurality of structured documents having different document structures,
A search request input unit for inputting a search request including a name of a component of the structured document and a value of the component in a search condition;
A search unit for searching the structured document database for the structured document corresponding to the input search request,
From the structured document searched by the search unit, a matching part that matches the search condition is extracted, and among the constituent elements included in the structured document, a component including the matching part and a surrounding component are extracted. An extraction unit to extract,
A display unit that displays the structured document searched by the search unit, by the matching part and the name of the component extracted by the extraction unit;
A structured document search system comprising:

The element includes an element name occurrence information storage unit that stores element name occurrence information indicating the degree of common occurrence in the plurality of structured documents, and the display unit uses the element name occurrence information, 9. The structured document search system according to claim 8, wherein the name of the component having a low degree is preferentially displayed as a search result.

The structured document search system according to claim 8, further comprising a component display unit that displays a list of the components with a high degree using the element name occurrence information.

A selection unit that allows a user to select the component name displayed on the display unit,
The search unit is configured to search for a structured document corresponding to the component name selected by the selection unit and the search condition input by the search request input unit. Item 9. A structured document search system according to Item 8.

A structured document selection unit that allows a user to select a desired one from the structured documents displayed by the display unit,
The structured document search system according to claim 8, further comprising: a detail display unit configured to acquire and display details of the structured document selected by the structured document selection unit from the structured document database. .

The apparatus further includes a component selection unit that allows a user to select a component displayed on the detail display unit, and a component display unit that displays the component selected by the component selection unit. The structured document search system according to claim 8.

A dictionary storage unit that stores a similar component dictionary that defines the similarity relationship of the component names, wherein the search unit stores the component names input in the search request input unit and the similar component names. 9. The structured document search system according to claim 8, wherein a search is executed collectively as a search condition.

A structured document database management device connected to a structured document database storing a plurality of structured documents having different document structures, receiving a search request from a user terminal, searching the database, and transmitting a search result to the user terminal. ,
A search request receiving unit that receives a search request including a name of a component of the structured document to be searched and a value of the component in a search condition;
A search unit that searches the structured document database corresponding to the search request from the structured document database;
A component that matches the search condition is extracted from the structured document searched by the search unit, and the component including the matching portion and the components around the component are included in the component included in the structured document. An extraction unit for extracting
The matching part and the component extracted by the extraction unit are changed to a data format that is displayed in a format that allows the structure of the structured document and the position of the matching part in the structure to be understood, and A structured document database management device, comprising: a transmission result processing unit.

An element name occurrence information acquisition unit that acquires element name occurrence information indicating a degree in which the constituent element is commonly present in the plurality of structured documents, and the result processing unit uses the element name occurrence information 9. The structured document database management device according to claim 8, wherein the name of the component having a low degree is displayed on the user terminal so as to be distinguishable from others.

The structuring according to claim 15, further comprising: an element name occurrence information presentation unit that presents information on the component having a high degree to a user terminal using the element name occurrence information information acquired by the element name occurrence information acquisition unit. Document database management device.