JP2004258912A

JP2004258912A - Document retrieval device, method and program

Info

Publication number: JP2004258912A
Application number: JP2003048009A
Authority: JP
Inventors: Kenji Ono; 顕司小野; Mitsuo Nunome; 光生布目; Masaru Suzuki; 優鈴木; Takuya Kanewa; 拓也金輪; Shozo Isobe; 庄三磯部
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-02-25
Filing date: 2003-02-25
Publication date: 2004-09-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document retrieval device and method capable of quickly and efficiently retrieving document data where structured data and unstructured data are intermingled. <P>SOLUTION: Desired document data are retrieved from a storage means which stores a plurality of document data, including structured data to be retrieved by a first search method which is one of two different retrieval methods, and unstructured data to be retrieved by a second search method which is the other of the two methods. First retrieval requirements for the structured data and second retrieval requirements for the unstructured data are input and the structured data among the document data stored in the storage means are retrieved using the first retrieval method, whereby document data including the structured data satisfying the first retrieval requirements are obtained. The unstructured data among the document data obtained are retrieved using the second retrieval method, whereby the document data including the unstructured data satisfying the second retrieval requirements are obtained. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は文書検索装置および方法に関する。
【０００２】
【従来の技術】
インターネットやイントラネットにおいて顧客情報や商品情報などのデータベースと連動したサービスやアプリケーションを構築する技術として、従来のリレーショナルデータベース（以下ＲＤＢ）技術に加えて近年はＸＭＬ（ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ：株式会社メディアフュージョンＸＭＬラボ発行所ソフトバンクパブリッシング株式会社「ＭＬデータベースによるＷｅｂアプリケーション開発」２００１年３月）といったマークアップ言語によるデータベース技術が普及してきている。
【０００３】
ＸＭＬデータの検索技術としては、ＸＱＬ（ＸＭＬＱｕｅｒｙＬａｎｕｇａｇｅ：［ＵＲＬ］ｈｔｔｐ：／／ｗｗｗ．ｗ３．ｏｒｇ／ＴａｎｄＳ／ＱＬ／ＱＬ９８／ｐｐ／ｘｑｌ．ｈｔｍｌ）などが知られている。一方、文書情報の検索方式としては従来からテキスト検索、特にフルテキスト検索技術がよく知られている。
【０００４】
一般にＸＭＬ技術は従来のＲＤＢ技術が扱ってきたような構造化されたデータと、従来のテキスト検索技術が扱ってきたような非構造化データの中間に位置するようなデータ（半構造化データ）を処理するのに適しているといわれている。
【０００５】
ところで、ＸＭＬの普及に伴い従来テキスト検索で扱ってきたデータをＸＭＬデータ化してＸＭＬ検索しようという要求が高まってきている。既存のテキスト文書の完全なＸＭＬ化のコストは膨大であり、当面はその文書中の限られた情報データについてだけＸＭＬ化を行うことが多いと思われる。また管理上、ＸＭＬ化の対象である文書自体にＸＭＬタグを付加していく形でＸＭＬ化を行うケースが多い。ＸＭＬ化の対象でない文書部分は、適当なタグで囲むことによりコメント部分として処理される。このようにすれば、元のデータとＸＭＬ化したデータがファイル単位で一元管理でき、元のデータのどの部分がＸＭＬ化されていて、どの部分がまだそうでないかが自明であるからである。元のデータの情報はすべてファイル中で保持されているので、将来のＸＭＬ化作業の際にもタグを追加していくだけでよく、便宜がいい。
【０００６】
このような情報の一部がＸＭＬ化されたデータに対してＸＭＬ検索を行う場合、未だＸＭＬ化されていないデータ部分はＸＭＬデータではないので、当然のことながらうまく検索できないという問題がある。
【０００７】
ＸＭＬ文書などの構造化文書の検索においてフルテキスト検索を利用しているものがある（例えば、特許文献１参照）。この手法は、ＸＭＬ文書全体を上記したようなコメント部分も含めてすべて事前に処理してフルテキスト検索用インデクスを作成しておく。そのため、通常のＸＭＬ検索に必要でないコメント部分もインデクスに含まれるため、インデクスが巨大になり検索速度を低下させる。
【０００８】
また、文書のレイアウト情報なども含めてＸＭＬ化して統一管理するというものもある（例えば、特許文献２参照）。
【０００９】
しかし、上記いずれの技術も、ＸＭＬ化されたデータと未だＸＭＬ化されていないデータ部分とが混在する文書データを高速に、しかも効率よく検索することはできない。
【００１０】
【特許文献１】
特開２００１−１６７０８７公報
【００１１】
【特許文献２】
特開２０００−９９５４３公報
【００１２】
【発明が解決しようとする課題】
以上説明したように、従来は、ＸＭＬデータ化された構造化データと、ＸＭＬデータ化されていない非構造化データとが混在する文書データを高速に、しかも効率よく検索することができないという問題点があった。
【００１３】
そこ、本発明は、上記問題点に鑑み、構造化データと非構造化データの混在する文書データの検索が高速にしかも効率よく行える文書検索装置および方法を提供することを目的とする。
【００１４】
【課題を解決するための手段】
本発明は、異なる２つの検索方式のうちの１つである第１の検索方式の検索対象となる構造化データと、前記２つの検索方式のうちの他の１つである第２の検索方式の検索対象となる非構造化データとを含む複数の文書データを記憶した記憶手段から所望の文書データを検索するためのものであって、前記構造化データに対する第１の検索条件と、前記非構造化データに対する第２の検索条件を入力し、前記記憶手段に記憶された各文書データ中の前記構造化データを前記第１の検索方式を用いて検索することにより、前記第１の検索条件を満たす構造化データを含む文書データを求め、その結果得られた各文書データ中の前記非構造化データを前記第２の検索方式を用いて検索することにより、前記第２の検索条件を満たす非構造化データを含む文書データを求め、その結果得られた文書データを出力することを特徴とする。本発明によれば、構造化データと非構造化データとが混在する文書データに対し、構造化データに対しては、構造化データの検索に適した検索方式を用いて検索を行い、非構造化データに対しては、非構造化データの検索に適した検索方式を用いて検索を行い、しかも、先に、構造化データに対する検索を行って得られた結果を検索対象として非構造化データに対する検索を行うことにより、構造化データと非構造化データの混在する文書データの検索が高速にしかも効率よく行える。
【００１５】
【発明の実施の形態】
以下、本発明の実施形態について図面を参照して説明する。
【００１６】
図１は、本実施形態に係る文書検索システムの構成例を示したもので、入力部１、検索処理部２、第１の検索部３、第２の検索部４、文書記憶部５、出力部６から構成されている。
【００１７】
文書記憶部５には、複数の文書データが記憶されている。ここに記憶されている複数の文書データは、もともと、構造化されていない（所定のタグでマークアップされていない）非構造化文書データであったものを、構造化するためのＸＭＬタグで囲まれたデータを挿入して、あるいは、ＸＭＬタグで文書中の所定のデータを囲むことによりマークアップして作成された、構造化（ＸＭＬ化）されたデータと構造化されていないデータとが混在する文書データである。
【００１８】
構造化されたデータ（構造化データ）は、上記のように、構造化するための所定のＸＭＬタグと当該タグで囲まれたデータとからなる少なくとも１つの要素からなるデータである。各要素は、ＸＭＬタグ（要素名とも呼ぶ）に当該ＸＭＬタグに対応するデータを対応付けたもの、とも云える。
【００１９】
また、各文書中の構造化されていないデータ領域は、ここでは、コメント領域と呼び、コメント領域内のデータを非構造化データとも呼ぶ。各文書中でコメント領域（非構造化データ部分）と、構造化データ部分とを区別するために、ここでは、コメント領域は、「＜！−−」と「−−＞」という記号で囲むことにする。この記号で囲まれた領域は、通常のＸＭＬ処理系において（例えば、図１の第１の検索部３において）、コメントとして無視して処理される。
【００２０】
なお、コメント領域を定めるための特定のタグを予め定義し、そのような特定のタグで囲まれた領域は、上記コメント領域として取り扱うようにしてもよい。この場合は、後述する第１の検索部３の処理に、文書中の上記特定のタグで囲まれた領域をコメント領域として無視するような機能を新たに追加する必要がある。
【００２１】
図２はＸＭＬ化する前の構造化データを含まない非構造化文書データであるオリジナル文書の具体例を示したものである。内容は会議の開催通知である。図３は図２に示した文書データの一部のデータをＸＭＬ化したものである。既に述べたように、非構造化文書データ中の予め定義したＸＭＬタグに対応するデータを当該ＸＭＬタグに対応付ける（例えば、ここでは、ＸＭＬタグでマークアップする）形で、ＸＭＬ文書化されている。この例では、会議の名称、「日時」、「場所」を表すデータが、ＸＭＬタグでマークアップされている。すなわち、会議の名称は、＜ｎａｍｅ＞タグでマークアップされ、「日時」は＜ｄａｔｅ＞タグでマークアップされ、「場所」は＜ｐｌａｃｅ＞タグでマークアップされている。なお、ここでは、＜ｄａｔｅ＞タグで始まる要素では、＜ｙｅａｒ＞、＜ｍｏｎｔｈ＞、＜ｄａｙ＞、＜ｄａｙｏｆｔｈｅｙｅａｒ＞、＜ｈｏｕｒｆｒｏｍ＞、＜ｈｏｕｒｔｏ＞といったタグで、年、月、日、曜日、開始時間と終了時間を表すデータがそれぞれマークアップされて構成された要素を包含する構造を有している。また、＜ｐｌａｃｅ＞タグで始まる要素は、＜ａｄｄｒｅｓｓ＞タグで住所データがマークアップされて構成された要素を包含する構造を有している。
【００２２】
図３に示した文書データでは、元の文書データ（図２参照）の改行やインデントといった書式構造はそのまま保存されている。上記したようなタグで囲まれた部分以外は、コメント領域として、「＜！−−」と「−−＞」という記号で囲まれている。
【００２３】
文書記憶部５に記憶された各文書中の、＜ｎａｍｅ＞タグ、＜ｄａｔｅ＞タグ＜ｐｌａｃｅ＞タグといったタグでマークアップされた構造化データが、後述する第１の検索部３におけるＸＭＬ検索の検索対象となり、非構造化データの領域（コメント領域）は、後述する第２の検索部４におけるフルテキスト検索（あるいは全文検索）の検索対象となる。
【００２４】
入力部１は、グラフィカル・ユーザ・インターフェイス（以下ＧＵＩ）を備えており、所望の文書を検索するための検索条件を入力するためのものである。ここで、検索条件としては、文書データ中の構造化データに対する検索条件と、コメント領域内の非構造化データに対する検索条件とがある。ここでは、前者を第１の検索条件、後者を第２の検索条件と呼ぶ。
【００２５】
図４は、入力部１により所定のディスプレイに表示される検索条件の入力画面の一例を示したものである。図４に示した検索条件入力画面では、「会議名」「日時」「場所」の各項目に対応する空欄部分にユーザが所望の文字列（キーワード）を入力することにより、構造化データに対する検索条件（第１の検索条件）を指定することができるようになっている。「会議名」「日時」「場所」といった項目は、文書記憶部５に記憶されている文書データに含まれるタグ名に対応する。すなわち、「会議名」は＜ｎａｍｅ＞タグに対応し、「日時」は＜ｄａｔｅ＞タグに対応し、「場所」は＜ｐｌａｃｅ＞タグに対応する。
【００２６】
また、図４に示した検索条件入力画面では、「その他」の項目に対応する空欄部分に、ユーザが所望の文字列（キーワード）を入力することにより、非構造化データに対する検索条件（第２の検索条件）を指定することができるようになっている。
【００２７】
検索処理部２は、入力部１から入力された検索条件を、第１の検索部３でのＸＭＬ検索で用いる検索条件（第１の検索条件）と、第２の検索部４でのテキスト検索で用いる検索条件（第２の検索条件）とに分離する。第１の検索条件を第１の検索部３が解釈可能な形式の第１の検索条件文に変換して第１の検索部３に出力し、第２の検索条件を第２の検索部４が解釈可能な形式の第２の検索条件文に変換して第２の検索部４に出力する。そして、第１の検索部３、第２の検索部４のそれぞれから出力された検索結果を基に、ユーザに提示するための表示データを生成し、それを出力部６へ出力する。
【００２８】
出力部６は、検索処理部２で作成された表示データを、例えば所定のディスプレイに表示する。ＧＵＩを備えており、ユーザが検索結果をブラウズすることができるようになっている。
【００２９】
第１の検索部３は、検索処理部２から渡された上記第１の検索条件文を基に、公知のＸＭＬ検索方式を用いて、文書記憶部５に記憶された各文書データの構造化データを検索対象として、当該第１の検索条件文に含まれる上記第１の検索条件を満たす文書データを検索する。例えば、文書記憶部５に記憶された文書の中から、第１の検索条件として指定されたタグ名（要素名）の要素（構成要素）を含み、しかも、当該要素に当該第１の検索条件として指定された文字列を含むデータが対応付けられているような文書を検索する。検索結果として得られた文書は、検索処理部２に返される。
【００３０】
第２の検索部４は、検索処理部２から渡された上記第２の検索条件文を基に、検索処理部２から渡された、第１の検索部５で検索された各文書のコメント領域内を検索対象として、公知のフルテキスト検索方式を用い、コメント領域内の非構造化データが上記第２の検索条件文に含まれる上記第２の検索条件を満たす文書を検索する。例えば、第１の検索部５で検索された文書の中から、第２の検索条件として指定された文字列を含むコメント領域をもつような文書を検索する。検索結果として得られた文書は、検索処理部２に返される。
【００３１】
次に、図５、図６に示すフローチャートを参照して、検索処理部２の処理動作について説明する。
【００３２】
例えば、ユーザが、図４に示した検索条件入力画面上で、図７に示すように、「日時」という項目に対応する空欄に、「１２月１２日」という文字列を検索条件として指定したとする。
【００３３】
この検索条件は入力部１から検索処理部２に送信される（ステップＳ１）。検索処理部２は、ユーザの検索条件を調べて、第１の検索条件と、第２の検索条件とに分離する（ステップＳ２）。まず、第１の検索条件から第１の検索条件文を生成する。上記例の場合、「日時」はタグ名＜ｄａｔｅ＞に対応し、第１の検索条件のみが入力されている。従って、この第１の検索条件から、＜ｄａｔａ＞タグの要素をもち、当該要素が文字列「１２月１２日」を含むような文書を検索するための第１の検索条件文が生成される。そして、この第１の検索条件文を第１の検索部３に渡し、検索を要求し、検索結果を得る（ステップＳ３）。上記例の場合、第１の検索部３では、＜ｄａｔｅ＞タグの要素をもち、当該要素が文字列「１２月１２日」を含む文書（すなわち、＜ｄａｔｅ＞タグの要素をもち、＜ｄａｔｅ＞タグに対応付けられたデータに文字列「１２月１２日」が含まれている文書）が検索され、それが検索処理部２に渡される。
【００３４】
入力部１から入力した検索条件に第２の検索条件が含まれている場合には（ステップＳ４）、当該第２の検索条件から第２の検索条件文を生成し、第１の検索部３での検索により得られた文書と、当該第２の検索条件文を第２の検索部４に渡し、検索を要求する（図６のステップＳ６）。
【００３５】
なお、上記例の場合、ユーザにより入力された検索条件は、第１の検索条件だけであったので（ステップＳ４）、第１の検索部３で検索した結果得られた文書から、表示データを生成し、それを出力部６へ渡す（図５のステップＳ５）。出力部６では、当該表示データを所定のディスプレイに表示する。
【００３６】
図８は、出力部８により表示される、第１の検索部３で検索した結果得られた文書から生成された表示データの表示画面の一例を示したものである。ここでは、２件の文書が検索結果として得られ、この２件の文書中の、会議名、日時、場所のそれぞれに対応するとタグ名に対応付けられたデータの全部または１部を表示するための表示データが表示されている。この表示画面上、例えば、各文書の会議名の欄をマウスでクリックすることにより、検索された文書全体を表示する表示画面に移行することができる。なお、この表示処理は、従来のＨＴＭＬ文書やＸＭＬ文書の処理技術で実現可能である。
【００３７】
次に、ユーザが図４の検索条件入力画面上に、上記例（図７参照）とは異なる検索条件を入力した場合を例にとり図５〜図６に示すフローチャートを参照して検索処理部２の処理動作を説明する。
【００３８】
例えば、ユーザが、図４に示した検索条件入力画面上で、図９に示すように、「日時」という項目に対応する空欄に、「１２月１２日」という文字列を検索条件として指定するとともに、「その他」という項目に対応する空欄に、「田中」という文字列を検索条件として指定したとする。
【００３９】
この検索条件は入力部１から検索処理部２に送信される（ステップＳ１）。検索処理部２は、前述同様、ユーザの検索条件を調べて、第１の検索条件と、第２の検索条件とに分離する（ステップＳ２）。まず、第１の検索条件から第１の検索条件文を生成する。上記例の場合、前述同様、第１の検索条件から、＜ｄａｔａ＞タグの要素をもち、当該要素が文字列「１２月１２日」を含むような文書を検索するための第１の検索条件文が生成され、この第１の検索条件文を第１の検索部３に渡し、検索を要求し、検索結果を得る（ステップＳ３）。
【００４０】
また、上記例の場合、入力部１から入力した検索条件には、「その他」という項目に対応する空欄で、「田中」という文字列が指定されており、この「その他」という項目に対応する空欄に入力された文字列は、非構造化データに対する検索条件（第２の検索条件）として用いられる。すなわち、ユーザにより入力された検索条件から第２の検索条件として、文字列「田中」が得られる（ステップＳ４）。
【００４１】
この第２の検索条件から、第１の検索部５で検索された文書の中からコメント領域内の非構造化データが上記第２の検索条件として与えられた文字列「田中」を含む文書を検索するための第２の検索条件文が生成される。そして、そして、この第２の検索条件文を第２の検索部４に渡し、検索を要求し、検索結果を得る（図６のステップＳ６）。なお、上記「田中」という文字列のように、ＸＭＬ文書のマークアップに使われていない検索キーのことを、「フリーワード」と呼ぶこともある。
【００４２】
上記例の場合、第２の検索部４では、第１の検索部３で検索された各文書のコメント領域内に対してフルテキスト検索を実施し、「田中」という文字列を含むコメント領域をもつ文書を検索する。そして、検索された文書が検索処理部２へ渡される。
【００４３】
例えば、図３に示した文書のコメント領域には、「田中」という文字列が含まれているので、この文書が第１の検索部３で検索されると、当該文書は、第２の検索部４でも、検索結果として検出される。
【００４４】
第２の検索部４で、例えば図３に示した文書等が検索されたときには（ステップＳ７）、当該検索された各文書に対して所定の言語処理を行う（ステップＳ９）。例えば、検索された各文書中の上記第２の検索条件として指定された文字列（上記例の場合、「田中」）が出現する箇所について、言語処理を行う。ここでは、その文字列の出現個所の前後一定文字数部分がＫＷＩＣ（ＫｅｙＷｏｒｄＩｎＣｏｎｔｅｘｔ）として抽出する。
【００４５】
そして、この言語処理結果を利用して、検索結果の表示データを生成し、それを出力部６に出力する（ステップＳ１０）。出力部６では、当該表示データを所定のディスプレイに表示する。
【００４６】
図１０は、出力部８により表示される、第２の検索部４で検索した結果得られた文書から生成された表示データの表示画面の一例を示したものである。ここでは、１件の文書（例えば、図３に示した文書）が検索結果として得られ、この文書中の、「会議名」に対応するタグ名に対応付けられたデータの全部または一部と、検索条件として指定された「日時」に対応するタグ名に対応付けられたデータの全文または一部と、検索条件として指定された文字列「田中」の前後一定文字数部分を表示するための表示データが表示されている。この表示画面上、検索条件として指定された文字列「田中」の前後一定文字数部分を表示する箇所をマウスでクリックすることにより、検索された文書中の、文字列「田中」が出現する個所の表示に移行することができる。その表示画面の一例を図１１に示す。
【００４７】
図１１に示した表示画面では、文字列「田中」は、反転、強調表示など特殊表示がなされており、ユーザに視認しやすくなっている。この処理の実現には、公知公用技術を用いればよい。
【００４８】
このように、検索された文書中でユーザが与えたフリーワードの出現個所の前後の文脈を表示する機能と、検索された文書中の該当個所の表示にすばやく移行できる機能は重要である。なぜなら、フリーワードに基づく検索は、タグ名などを用いた構造化データに対する検索（ＸＭＬ検索）と違って意味的曖昧性を含んでおり、検索結果がユーザが所望するものとは必ずしも限らないからである。ユーザは検索された文書が自分の所望する文書であるかどうかを確認しなければならない。その判断を行う際に有益な情報は、指定したフリーワードの文書中での文脈情報である。上述したＫＷＩＣの提示機能と検索された文書中の該当個所の表示にすばやく移行できる機能はこの文脈情報を効果的に与えるものである。
【００４９】
なお、上述の検索例ではフリーワードとして文字列を１つ（例えば、上記例の場合「田中」）を与えた場合を示したが、複数の単語がある場合も上記同様に処理を行うことができる。その場合、検索条件入力画面上で「その他」という項目に対応する空欄に、例えば、「田中」、「山田」、「佐藤」のような複数の文字列を列挙して記述すればよい。例えば、１つの空欄内に、スペースや「、」や「／」などで１つ１つの文字列を分けて入力してもよいし、複数の空欄があれば、各空欄にそれぞれ１つの文字列を入力してもよい。
【００５０】
次に、ユーザが図４の検索条件入力画面上に、上記例（例えば図９参照）とは異なる検索条件を入力した場合を例にとり図５〜図６に示すフローチャートを参照して検索処理部２の処理動作を説明する。
【００５１】
例えば、ユーザが、図４に示した検索条件入力画面上で、図１２に示すように、「日時」という項目に対応する空欄に、「１２月１２日」という文字列を検索条件として指定するとともに、「その他」という項目に対応する空欄に、「主催」「ＥＩＡ」という文字列を検索条件として指定したとする。
【００５２】
この検索条件は入力部１から検索処理部２に送信される（ステップＳ１）。検索処理部２は、前述同様、ユーザの検索条件を調べて、第１の検索条件と、第２の検索条件とに分離する（ステップＳ２）。まず、第１の検索条件から第１の検索条件文を生成する。上記例の場合、前述同様、第１の検索条件から、＜ｄａｔａ＞タグの要素をもち、当該要素が文字列「１２月１２日」を含むような文書を検索するための第１の検索条件文が生成され、この第１の検索条件文を第１の検索部３に渡し、検索を要求し、検索結果を得る（ステップＳ３）。
【００５３】
また、上記例の場合、入力部１から入力した検索条件には、「その他」という項目に対応する空欄で、「主催」、「ＥＩＡ」という文字列が指定されており、この「その他」という項目に対応する空欄に入力された文字列は、非構造化データに対する検索条件（第２の検索条件）として用いられる。すなわち、ユーザにより入力された検索条件から第２の検索条件として「主催」、「ＥＩＡ」が得られる（ステップＳ４）。
【００５４】
ところで、図４の検索画面の「その他」行の２つのフィールドに単語を指定する場合、最初のフィールドに項目名，次のフィールドに項目値が来ることが想定されている。項目名は一般に「日時」「場所」といった一般名詞や普通名詞である。項目値は「１２月１２日」、「金曜日」、「東京」といった数値や固有名詞である場合が多いが、一般名詞である場合もある。この事例における検索単語「主催」は一般名詞であり、「ＥＩＡ」は固有名詞である。
【００５５】
フリーワード検索の常として検索キーとなる検索単語中に一般名詞や普通名詞がある場合、同義語展開が実行される。この場合、「主催」という文字列には「共催」、「開催者」のように同義の検索単語が追加されてから、第２の検索が実行される。
【００５６】
一方、「ＥＩＡ」のような固有名詞については、異表記展開と呼ばれる検索単語の追加処理が行われる。具体的には、長音記号「ー」の有無，検索単語中の小文字のカタカナ「ッ」を「ツ」に置換した単語，略記表現の元の単語等が追加される。この場合、「ＥＩＡ」には「ＥｌｅｃｔｒｉｃＩｎｆｏｒｍａｔｉｏｎＡｓｓｏｃｉａｔｉｏｎ」や「電子情報協会」のように異表記語が追加されたあと、第２の検索が実行される。
【００５７】
通常この同義語展開や異表記展開には汎用および特定の単語辞書が利用される。このような検索単語の同義語展開や異表記展開は情報検索の公知の技術で実現できる。
【００５８】
以下の説明ではこのように展開された同義語や異表記語についての説明は省くが、これらの技術の利用を除外するものではないことを補足しておく。
【００５９】
第２の検索条件から、第１の検索部５で検索された文書の中からコメント領域内の非構造化データが上記第２の検索条件として与えられた文字列「主催」、「ＥＩＡ」を含む文書を検索するための第２の検索条件文が生成される。そして、この第２の検索条件文を第２の検索部４に渡し、検索を要求し、検索結果を得る（図６のステップＳ６）。
【００６０】
上記例の場合、第２の検索部４では、第１の検索部３で検索された各文書のコメント領域内に対してフルテキスト検索を実施し、「主催」、「ＥＩＡ」という文字列を含むコメント領域をもつ文書を検索する。そして、検索された文書が検索処理部２へ渡される。
【００６１】
上記例の場合、検索キーとして２つの文字列（「主催」、「ＥＩＡ」）があるが、１つのコメント領域に両方の文字列が出現している文書が優先的に検索される。なぜなら、検索キーとして指定された複数の文字列（ここでは、２つの文字列）は、意味的に関連するものであり、これら複数の文字列は文書中でも互いに近い位置に出現していることが多いからである。
【００６２】
例えば、図３に示した文書は、そのコメント領域に「主催」、「ＥＩＡ」の両文字列を含むので、検索結果として検出される（ステップＳ７）。
【００６３】
次に、検索された各文書中のコメント領域について言語処理が行われる（ステップＳ９）。
【００６４】
２つのフリーワード「主催」、「ＥＩＡ」は、情報論的な観点では「主催」がキーで、「ＥＩＡ」が値、つまり「主催」が項目名で「ＥＩＡ」がその内容という意味構造を持っている。このように与えられた複数のフリーワードに意味構造がある場合は、ステップＳ９の言語処理において、検索された各文書中のコメント領域に対して、公知・公用の情報抽出技術を適用する。今の場合、例えば、ｐｌａｉｎ２（ｐｌａｉｎ２ｆａｎ［ＵＲＬ］ｈｔｔｐ：／／ｓｈｉｋａ．ａｉｓｔ−ｎａｒａ．ａｃ．ｊｐ／ｐｒｏｄｕｃｔｓ／ｐｌａｉｎ２／ｐｌａｉｎ２−ｊ．ｈｔｍｌ）等を適用することにより、項目名：「主催」、項目内容：「社団法人電子情報協会（ＥＩＡ）」という対応関係を表した情報を抽出することができる。ｐｌａｉｎ２は、文書の改行情報やインデント情報を利用して文書の構造解析を行うものである。すなわち、テキストから箇条書きや表の構造を抽出してテキストを構造化するものである。
【００６５】
図３に示した文書は、元の文書（図２参照）の改行情報やインデント情報を含んでいるので、ｐｌａｉｎ２を適用することができる。また、このような文書の書式情報がなくても、ＭＵＣ（ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＳｉｘｔｈＭｅｓｓａｇｅＵｎｄｅｒｓｔａｎｄｉｎｇＣｏｎｆｅｒｅｎｃｅ（ＭＵＣ−６），ＭｏｒｇａｎＫａｕｆｍａｎ）等で開示されている技術を使えば、同様の情報抽出を行うことが可能である。
【００６６】
文書記憶部５に記憶されている文書が、特定のワープロ文書であったり、ＲＴＦ（ｒｉｃｈｔｅｘｔｆｏｒｍａｔ）やＨＴＭＬのような箇条書きやインデントといった文書のレイアウト情報を含んだ構造化文書である場合も同様である。このような場合、文書のレイアウト情報を直接得るため、上記ＭＵＣのような技術を用いて、上述した「項目名」−「項目内容」といた対応関係を表した情報を取得できる。
【００６７】
第２の検索部４では、フリーワードとして指定された複数の文字列を含むコメント領域を持つ各文書のうち、コメント領域内の、フリーワードとして指定された複数の文字列間の対応関係が、検索キーとしての複数のフリーワードの間の意味構造に合致する文書を、優先して検索する。
【００６８】
さて、ステップＳ９では、検索された各文書中の上記第２の検索条件として指定された文字列、すなわち、フリーワード（上記例の場合、「主催」、「ＥＩＡ」）が出現する箇所について、言語処理を行う。ここでは、その文字列の出現個所の前後一定文字数部分がＫＷＩＣ（ＫｅｙＷｏｒｄＩｎＣｏｎｔｅｘｔ）として抽出する。
【００６９】
そして、この言語処理結果を利用して、検索結果の表示データを生成し、それを出力部６に出力する（ステップＳ１０）。出力部６では、当該表示データを所定のディスプレイに表示する。
【００７０】
図１３は、出力部８により表示される、第２の検索部４で検索した結果得られた文書から生成された表示データの表示画面の一例を示したものである。ここでは、１件の文書（例えば、図３に示した文書）が検索結果として得られ、この文書中の、「会議名」に対応するタグ名に対応付けられたデータの全部または一部と、検索条件として指定された「日時」に対応するタグ名に対応付けられたデータの全文または一部と、検索条件として指定された文字列「主催」、「ＥＩＡ」の前後一定文字数部分を表示するための表示データが表示されている。なお、第２の検索部４で、「主催」が項目名で、「ＥＩＡ」が項目内容といった対応関係が得られているとき、この対応関係にある「主催」と「ＥＩＡ」を含むコメント領域をもつ文書が検索される。従って、この対応関係に基づき、図１３に示した表示例では、検索された文書が、項目名「主催」とその項目内容に「ＥＩＡ」が含まれるという関係があることを表現すべく表示されている。すなわち、図１３では、文字列「主催」と、文字列「ＥＩＡ」を含むデータが、項目名と項目内容という対応関係にあることが明らかとなるように、項目名に対応する「主催」という文字列を、「名称」「日時」といった（タグ名に対応する項目名と同列に表示し、項目内容として、文字列「ＥＩＡ」の前後一定文字数部分を表示する箇所を表示している。この項目内容を表示する箇所は、図１０と同様、ＫＷＩＣとして表示されている。図１３の表示画面上、ＫＷＩＣ表示されている部分（検索条件として指定された文字列「ＥＩＡ」の前後一定文字数部分を表示する箇所）をマウスでクリックすることにより、検索された文書中の、文字列「ＥＩＡ」が出現する個所の表示に移行することができる。その表示画面の一例を図１４に示す。
【００７１】
図１４に示した表示画面では、文字列「主催」と、これを項目名とする項目内容に対応する文字列部分（文字列「ＥＩＡ」を含む文字列）は、反転、強調表示など特殊表示がなされており、ユーザに視認しやすくなっている。この処理の実現には、公知公用技術を用いればよい。
【００７２】
なお、上記実施形態では、文書記憶部５に記憶されている各文書中のコメント領域のマークアップのための記号として「＜！−−」と「−−＞」を用いて説明を行ったが、この場合に限らず、例えば、コメント領域である旨を表すために定義した（コメントの意味を持たせた）任意のＸＭＬのタグ（マークアップ記号）を用いてもかまわないことはいうまでもない。
【００７３】
また、第１の検索部３におけるＸＭＬ検索は、上記実施形態の場合、タグ名と当該タグ名の要素に含まれる文字列を検索条件として指定して、指定されたタグ名の要素をもち、その要素が指定された文字列を包含するような構造化データ（をもつ文書データ）を検索するものであるが、この場合に限らない。例えば、検索条件として１つまたは複数のタグ名のみが指定されたときには、当該指定されたタグ名の要素をもつ構造化データ（をもつ文書データ）を検索するようにしてもよい。また、例えば、あるタグ名の要素を包含する要素とか、ある要素に包含される要素といったような文書構造や、文書構造といずれかの要素に含まれる文字列とを検索条件として指定された場合の検索も、やはり公知のＸＭＬ検索技術を用いれば容易に実現可能である。上記実施形態では、第１の検索部３では、構造化データに対し、公知のＸＭＬ検索技術を用いて検索を行う場合の一例を示したにすぎず、公知のＸＭＬ検索技術を用いて実現可能な検索であれば、どのような検索をおこなってもよい。その場合に、入力すべき検索条件に応じて、検索条件を入力するための入力画面等を変更すればよい。
【００７４】
以上説明したように、上記実施形態によれば、異なる２つの検索方式のうちの１つである第１の検索方式（例えば、ＸＭＬ検索方式）の検索対象となる所定の要素名と当該要素名に対応付けられたデータとからなる少なくとも１つの要素から構成される（構造化データ）と、前記２つの検索方式のうちの他の１つである第２の検索方式（たとえば、フルテキスト検索方式）の検索対象となるデータ（コメント領域内の非構造化データ）とを含む複数の文書データを文書記憶部５に記憶する。そして、構造化データに対する第１の検索条件と、非構造化データに対する第２の検索条件が入力されると、第１の検索部３では、文書記憶部５に記憶された各文書データ中の構造化データをＸＭＬ検索方式を用いて検索することにより、上記第１の検索条件を満たす要素を含む文書データを求め、第２の検索部４では、第１の検索部３での検索の結果得られた各文書データ中の非構造化データをフルテキスト検索方式を用いて検索することにより、上記第２の検索条件を満たす非構造化データをもつ文書データを求める。
【００７５】
すなわち、上記実施形態では、構造化データと非構造化データとが混在する文書データに対し、構造化データに対しては、構造化データの検索に適した検索方式を用いて検索を行い、非構造化データに対しては、非構造化データの検索に適した検索方式を用いて検索を行う。一般的に、構造化データに対する検索時間は、非構造化データに対する検索時間よりも短時間である。従って、先に、構造化データに対する検索を行って得られた結果を検索対象として非構造化データに対する検索を行う。
【００７６】
このように、上記実施形態によれば、処理時間の比較的長い非構造化データに対する検索を行う前に、構造化データに対する検索を行って、検索対象の文書を絞り込み、さらに、各文書中の検索位置を（コメント領域に）限定した上で、非構造化データに対する検索を行うことにより、構造化データと非構造化データの混在する文書データの検索が高速にしかも効率よく行える。
【００７７】
なお、上記実施形態では、構造化データとして、ＸＭＬタグでマークアップして構造化したデータを例にとり説明したが、この場合に限らず、例えばＲＤＢ内のデータのように、所定の項目（上記実施形態における要素名に対応する）に当該項目に対応するデータを対応付けて構造化したデータであっても、上記同様の効果が得られる。
【００７８】
本発明の実施の形態に記載した本発明の手法は（特に、図５、６のフローチャートに示した検索処理部２の処理動作、さらに、検索処理動作以外にも図１の各部の処理動作も含めて）、コンピュータに実行させることのできるプログラムとして、磁気ディスク（フレキシブルディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、半導体メモリなどの記録媒体に格納して頒布することもできる。
【００７９】
なお、本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。さらに、上記実施形態には種々の段階の発明は含まれており、開示される複数の構成要件における適宜な組み合わせにより、種々の発明が抽出され得る。例えば、実施形態に示される全構成要件から幾つかの構成要件が削除されても、発明が解決しようとする課題の欄で述べた課題（の少なくとも１つ）が解決でき、発明の効果の欄で述べられている効果（のなくとも１つ）が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。
【００８０】
【発明の効果】
以上説明したように、本発明によれば、構造化データと非構造化データの混在する文書データの検索が高速にしかも効率よく行える。
【図面の簡単な説明】
【図１】本発明の実施形態に係る文書検索システムの構成例を示した図。
【図２】構造化データを含まないオリジナルの文書の具体例を示した図。
【図３】文書記憶部に記憶される、構造化データと非構造化データを含む文書の具体例であって、図２のオリジナルの文書から生成された文書を示している。
【図４】検索条件入力画面の一例を示した図。
【図５】図１の検索処理部の処理動作を説明するためのフローチャート。
【図６】図１の検索処理部の処理動作を説明するためのフローチャート。
【図７】検索条件入力画面に入力された検索条件の一例を示した図。
【図８】図７に示した検索条件に基づく検索結果の表示例を示した図。
【図９】検索条件入力画面に入力された検索条件の他の例を示した図。
【図１０】図９に示した検索条件に基づく検索結果の表示例を示した図。
【図１１】検索結果の他の表示例を示した図。
【図１２】検索条件入力画面に入力された検索条件のさらに他の例を示した図。
【図１３】図１２に示した検索条件に基づく検索結果の表示例を示した図。
【図１４】検索結果の他の表示例を示した図。
【符号の説明】
１…入力部、２…検索処理部、３…第１の検索部、４…第２の検索部、５…文書記憶部、６…出力部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document search device and method.
[0002]
[Prior art]
In recent years, in addition to the conventional relational database (hereinafter referred to as RDB) technology, in recent years, XML (Extensible Markup Language: Media Fusion XML Lab.) A database technology using a markup language, such as “Improvement of Web application using ML database” (published by Softbank Publishing Co., Ltd., March 2001), has become widespread.
[0003]
As a search technique for XML data, XQL (XML Query Language: [URL] http://www.w3.org/TandS/QL/QL98/pp/xql.html) and the like are known. On the other hand, as a search method of document information, a text search, particularly a full text search technique, has been well known.
[0004]
Generally, XML technology is data (semi-structured data) that is positioned between structured data as handled by the conventional RDB technology and unstructured data as handled by the conventional text search technology. It is said to be suitable for processing.
[0005]
By the way, with the spread of XML, there has been a growing demand for converting data conventionally handled by text search into XML data and searching for XML. The cost of complete XML conversion of an existing text document is enormous, and for the time being it is likely that only limited information data in the document will be converted to XML. Further, for management purposes, in many cases, XML conversion is performed by adding XML tags to the document itself to be converted to XML. A document part that is not a target of XML conversion is processed as a comment part by surrounding it with an appropriate tag. In this way, the original data and the XML data can be unitarily managed in file units, and it is obvious which part of the original data has been converted to XML and which part has not yet been converted. Since all the information of the original data is stored in the file, it is convenient to simply add tags even in the future XML conversion work, which is convenient.
[0006]
When an XML search is performed on data in which a part of such information is converted into XML, there is a problem that a data portion that has not yet been converted into XML is not XML data, so that it cannot be searched well as a matter of course.
[0007]
In some cases, a full-text search is used for searching a structured document such as an XML document (for example, see Patent Document 1). In this method, the entire XML document is processed in advance including the comment part as described above to create a full-text search index. For this reason, since a comment part that is not necessary for a normal XML search is also included in the index, the index becomes huge and the search speed is reduced.
[0008]
In addition, there is a method in which the data is converted into XML including the layout information of the document and managed in a unified manner (for example, see Patent Document 2).
[0009]
However, none of the above-described techniques can search for document data in which XML-formatted data and data portions not yet converted to XML are mixed at high speed and efficiently.
[0010]
[Patent Document 1]
JP 2001-167087 A
[0011]
[Patent Document 2]
JP 2000-99543 A
[0012]
[Problems to be solved by the invention]
As described above, conventionally, it is difficult to efficiently and efficiently search document data in which structured data converted to XML data and unstructured data not converted to XML data are mixed. was there.
[0013]
SUMMARY OF THE INVENTION In view of the above problems, an object of the present invention is to provide a document search apparatus and method capable of searching for document data in which structured data and unstructured data are mixed at high speed and efficiently.
[0014]
[Means for Solving the Problems]
The present invention relates to structured data to be searched by a first search method, which is one of two different search methods, and a second search method, which is another one of the two search methods. For searching desired document data from storage means storing a plurality of document data including unstructured data to be searched for, wherein a first search condition for the structured data; By inputting a second search condition for the structured data and searching the structured data in each document data stored in the storage means using the first search method, the first search condition is obtained. Document data including structured data that satisfies the above condition, and searches the unstructured data in each of the obtained document data using the second search method, thereby satisfying the second search condition. Unstructured data Obtains document data including, and outputs the document data obtained as a result. According to the present invention, for document data in which structured data and unstructured data are mixed, a search is performed on the structured data using a search method suitable for searching the structured data. For structured data, a search is performed using a search method suitable for unstructured data search, and the results obtained by first performing a search on the structured data are used as the search target. By performing a search for, document data in which structured data and unstructured data are mixed can be searched at high speed and efficiently.
[0015]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0016]
FIG. 1 shows an example of the configuration of a document search system according to the present embodiment. An input unit 1, a search processing unit 2, a first search unit 3, a second search unit 4, a document storage unit 5, an output unit It is composed of a unit 6.
[0017]
The document storage unit 5 stores a plurality of document data. The plurality of document data stored here are originally unstructured (not marked up with a predetermined tag) unstructured document data, and are surrounded by XML tags for structuring. A mixture of structured (XML) data and unstructured data created by inserting markup data or marking up by surrounding predetermined data in a document with XML tags Document data.
[0018]
As described above, the structured data (structured data) is data including at least one element including a predetermined XML tag for structuring and data surrounded by the tag. Each element can be said to be an XML tag (also referred to as an element name) associated with data corresponding to the XML tag.
[0019]
The unstructured data area in each document is called a comment area here, and the data in the comment area is also called unstructured data. In order to distinguish between the comment area (unstructured data part) and the structured data part in each document, the comment area is enclosed here by the symbols “<! −−” and “−−>”. To A region surrounded by this symbol is processed in a normal XML processing system (for example, in the first search unit 3 in FIG. 1) while ignoring it as a comment.
[0020]
Note that a specific tag for defining a comment area may be defined in advance, and an area surrounded by such a specific tag may be handled as the comment area. In this case, it is necessary to newly add a function of ignoring an area surrounded by the specific tag in the document as a comment area to the processing of the first search unit 3 described later.
[0021]
FIG. 2 shows a specific example of an original document which is unstructured document data that does not include structured data before XML conversion. The content is a notice of the holding of the meeting. FIG. 3 is an XML version of a part of the document data shown in FIG. As described above, the XML document is converted into an XML document by associating data corresponding to a predefined XML tag in the unstructured document data with the XML tag (for example, markup with the XML tag here). . In this example, data representing the name of the meeting, “date and time”, and “location” are marked up with XML tags. That is, the name of the conference is marked up with a <name> tag, “date” is marked up with a <date> tag, and “location” is marked up with a <place> tag. Here, in the element starting with the <date> tag, tags such as <year>, <month>, <day>, <day of the year>, <hour from>, <hour to>, and year, month, It has a structure that includes elements that are formed by marking up data representing the day, the day of the week, the start time, and the end time. An element starting with a <place> tag has a structure including an element formed by marking up address data with an <address> tag.
[0022]
In the document data shown in FIG. 3, the format structure such as line breaks and indents of the original document data (see FIG. 2) is stored as it is. Except for the portion enclosed by the tags as described above, the comment region is enclosed by symbols “<! −−” and “−−>”.
[0023]
Structured data marked up with tags such as <name> tag, <date> tag and <place> tag in each document stored in the document storage unit 5 is used for XML search in the first search unit 3 described later. The area (comment area) of the unstructured data becomes a search target, and becomes a search target of a full-text search (or a full-text search) in the second search unit 4 described later.
[0024]
The input unit 1 has a graphical user interface (hereinafter, GUI) for inputting search conditions for searching for a desired document. Here, the search conditions include a search condition for structured data in the document data and a search condition for unstructured data in the comment area. Here, the former is called a first search condition, and the latter is called a second search condition.
[0025]
FIG. 4 shows an example of a search condition input screen displayed on a predetermined display by the input unit 1. On the search condition input screen shown in FIG. 4, the user inputs a desired character string (keyword) in blank spaces corresponding to the items of “meeting name”, “date and time”, and “location”, thereby performing a search for structured data. A condition (first search condition) can be specified. Items such as “meeting name”, “date and time”, and “location” correspond to tag names included in the document data stored in the document storage unit 5. That is, "meeting name" corresponds to a <name> tag, "date and time" corresponds to a <date> tag, and "location" corresponds to a <place> tag.
[0026]
In the search condition input screen shown in FIG. 4, the user inputs a desired character string (keyword) in a blank portion corresponding to the item of "others", so that the search condition for the unstructured data (second Search conditions) can be specified.
[0027]
The search processing unit 2 uses the search condition input from the input unit 1 as a search condition (first search condition) used in the XML search in the first search unit 3 and a text search in the second search unit 4 And the search condition (second search condition) used in (1). The first search condition is converted into a first search condition sentence in a format that can be interpreted by the first search unit 3 and output to the first search unit 3, and the second search condition is converted to the second search unit 4 Is converted to a second search condition sentence in a format that can be interpreted and output to the second search unit 4. Then, based on the search results output from each of the first search unit 3 and the second search unit 4, display data to be presented to the user is generated and output to the output unit 6.
[0028]
The output unit 6 displays the display data created by the search processing unit 2 on, for example, a predetermined display. A GUI is provided so that the user can browse the search results.
[0029]
The first search unit 3 uses a well-known XML search method to structure each document data stored in the document storage unit 5 based on the first search condition sentence passed from the search processing unit 2. Document data that satisfies the first search condition contained in the first search condition sentence is searched for data as a search target. For example, among the documents stored in the document storage unit 5, an element (component) of the tag name (element name) specified as the first search condition is included, and the first search condition is included in the element. Is searched for a document to which data including a character string specified as "" is associated. The document obtained as a search result is returned to the search processing unit 2.
[0030]
The second search unit 4 is based on the second search condition sentence passed from the search processing unit 2 and based on the comment of each document searched by the first search unit 5 passed from the search processing unit 2 Using a well-known full-text search method as a search target in the area, a document in which the unstructured data in the comment area satisfies the second search condition included in the second search condition sentence is searched. For example, a document having a comment area including a character string specified as a second search condition is searched from the documents searched by the first search unit 5. The document obtained as a search result is returned to the search processing unit 2.
[0031]
Next, the processing operation of the search processing unit 2 will be described with reference to the flowcharts shown in FIGS.
[0032]
For example, on the search condition input screen shown in FIG. 4, the user has specified a character string "December 12" as a search condition in a blank space corresponding to the item "Date" as shown in FIG. And
[0033]
The search condition is transmitted from the input unit 1 to the search processing unit 2 (Step S1). The search processing unit 2 checks the search condition of the user and separates the search condition into the first search condition and the second search condition (step S2). First, a first search condition sentence is generated from the first search condition. In the case of the above example, “date” corresponds to the tag name <date>, and only the first search condition is input. Therefore, from the first search condition, a first search condition sentence for generating a document having an element of <data> tag and including the character string “December 12” is generated. . Then, the first search condition sentence is passed to the first search unit 3, a search is requested, and a search result is obtained (step S3). In the case of the above example, the first search unit 3 has an element of the <date> tag, and the element has a document including the character string “December 12” (that is, an element of the <date>tag;> A document in which the data associated with the tag includes the character string “December 12” is retrieved, and is passed to the retrieval processing unit 2.
[0034]
If the search condition input from the input unit 1 includes the second search condition (step S4), a second search condition sentence is generated from the second search condition, and the first search unit 3 Then, the document obtained by the search in step (1) and the second search condition sentence are transferred to the second search unit 4, and a search is requested (step S6 in FIG. 6).
[0035]
In the case of the above example, since the search condition input by the user is only the first search condition (step S4), display data is obtained from the document obtained as a result of the search by the first search unit 3. It is generated and passed to the output unit 6 (step S5 in FIG. 5). The output unit 6 displays the display data on a predetermined display.
[0036]
FIG. 8 shows an example of a display screen of display data, which is displayed by the output unit 8 and is generated from the document obtained as a result of the search by the first search unit 3. Here, two documents are obtained as a search result, and all or a part of the data associated with the tag name is displayed when corresponding to the meeting name, date and time, and location in the two documents. Is displayed. On this display screen, for example, by clicking the column of the conference name of each document with a mouse, it is possible to shift to a display screen displaying the entire searched document. Note that this display processing can be realized by a conventional HTML document or XML document processing technique.
[0037]
Next, the search processing unit 2 will be described with reference to the flowcharts shown in FIGS. 5 and 6 by taking, as an example, a case where the user inputs a search condition different from the above example (see FIG. 7) on the search condition input screen of FIG. Will be described.
[0038]
For example, on the search condition input screen shown in FIG. 4, the user specifies a character string "December 12" as a search condition in a blank space corresponding to the item "Date" as shown in FIG. At the same time, it is assumed that a character string "Tanaka" is specified as a search condition in a blank space corresponding to the item "others".
[0039]
The search condition is transmitted from the input unit 1 to the search processing unit 2 (Step S1). As described above, the search processing unit 2 checks the search condition of the user and separates the search condition into the first search condition and the second search condition (step S2). First, a first search condition sentence is generated from the first search condition. In the case of the above example, as described above, the first search condition for searching for a document having an element of <data> tag and including the character string “December 12” from the first search condition A sentence is generated, and the first search condition sentence is passed to the first search unit 3, requesting a search, and obtaining a search result (step S3).
[0040]
In the case of the above example, in the search condition input from the input unit 1, a character string of "Tanaka" is specified in a blank corresponding to the item of "other", and the character string corresponding to the item of "other" is specified. The character string entered in the blank is used as a search condition (second search condition) for unstructured data. That is, the character string "Tanaka" is obtained as the second search condition from the search condition input by the user (step S4).
[0041]
From the second search condition, the unstructured data in the comment area is extracted from the documents searched by the first search unit 5 for the document including the character string “Tanaka” given as the second search condition. A second search condition sentence for searching is generated. Then, the second search condition sentence is passed to the second search unit 4, a search is requested, and a search result is obtained (step S6 in FIG. 6). Note that a search key that is not used in markup of an XML document, such as the character string “Tanaka”, may be referred to as a “free word”.
[0042]
In the case of the above example, the second search unit 4 performs a full-text search on the comment area of each document searched by the first search unit 3 and searches the comment area including the character string “Tanaka”. Search for documents that have Then, the searched document is passed to the search processing unit 2.
[0043]
For example, since the comment area of the document shown in FIG. 3 includes the character string “Tanaka”, when this document is searched by the first search unit 3, the document is The part 4 also detects it as a search result.
[0044]
When, for example, the document shown in FIG. 3 is searched by the second search unit 4 (step S7), predetermined language processing is performed on each of the searched documents (step S9). For example, linguistic processing is performed on a portion where a character string (“Tanaka” in the above example) specified as the second search condition in each searched document appears. Here, a fixed number of characters before and after the appearance of the character string are extracted as KWIC (Key Word In Context).
[0045]
Then, display data of the search result is generated using the result of the language processing, and is output to the output unit 6 (step S10). The output unit 6 displays the display data on a predetermined display.
[0046]
FIG. 10 shows an example of a display screen of display data, which is displayed by the output unit 8 and is generated from the document obtained as a result of the search by the second search unit 4. Here, one document (for example, the document shown in FIG. 3) is obtained as a search result, and all or a part of the data associated with the tag name corresponding to the “meeting name” in this document is obtained. , Display for displaying the full text or part of the data associated with the tag name corresponding to the "date and time" specified as the search condition, and a fixed number of characters before and after the character string "Tanaka" specified as the search condition Data is displayed. On this display screen, by clicking with a mouse the place where a fixed number of characters before and after the character string "Tanaka" specified as the search condition is displayed, the location of the character string "Tanaka" in the searched document is displayed. The display can be shifted to. FIG. 11 shows an example of the display screen.
[0047]
In the display screen shown in FIG. 11, the character string "Tanaka" is displayed in a special manner such as inversion and highlighting, so that it is easy for the user to visually recognize it. A known public technique may be used to realize this processing.
[0048]
As described above, it is important to have a function of displaying the context before and after the location of the free word given by the user in the retrieved document and a function of quickly shifting to the display of the corresponding location in the retrieved document. This is because a search based on a free word includes a semantic ambiguity unlike a search for structured data using a tag name or the like (XML search), and a search result is not always desired by a user. It is. The user must confirm whether the retrieved document is the desired document. Information useful in making the determination is context information in the document of the specified free word. The above-described KWIC presentation function and the function of quickly shifting to the display of the corresponding location in the retrieved document provide this context information effectively.
[0049]
In the above-described search example, a case where one character string (for example, “Tanaka” in the above example) is given as a free word is shown. However, when there are a plurality of words, the same processing as described above can be performed. it can. In this case, a plurality of character strings such as "Tanaka", "Yamada", and "Sato" may be listed and described in a blank space corresponding to the item "others" on the search condition input screen. For example, one character string may be divided and entered in one blank with spaces, ",", "/", etc. If there are a plurality of blanks, one character string is entered in each blank. May be entered.
[0050]
Next, referring to the flowcharts shown in FIGS. 5 and 6, taking as an example a case where the user has input a search condition different from the above example (for example, see FIG. 9) on the search condition input screen of FIG. The processing operation 2 will be described.
[0051]
For example, on the search condition input screen shown in FIG. 4, the user specifies a character string "December 12" as a search condition in a blank space corresponding to the item "Date" as shown in FIG. At the same time, it is assumed that a character string of “host” or “EIA” is specified as a search condition in a blank corresponding to the item of “others”.
[0052]
The search condition is transmitted from the input unit 1 to the search processing unit 2 (Step S1). As described above, the search processing unit 2 checks the search condition of the user and separates the search condition into the first search condition and the second search condition (step S2). First, a first search condition sentence is generated from the first search condition. In the case of the above example, as described above, the first search condition for searching for a document having an element of the <data> tag and including the character string “December 12” from the first search condition A sentence is generated, and the first search condition sentence is passed to the first search unit 3, requesting a search, and obtaining a search result (step S3).
[0053]
Further, in the case of the above example, the search condition input from the input unit 1 specifies character strings of “host” and “EIA” in blank spaces corresponding to the item of “others”, and the search condition is referred to as “others”. The character string entered in the blank corresponding to the item is used as a search condition (second search condition) for the unstructured data. That is, “host” and “EIA” are obtained as second search conditions from the search conditions input by the user (step S4).
[0054]
By the way, when a word is specified in two fields of the "others" line on the search screen in FIG. 4, it is assumed that an item name comes in the first field and an item value comes in the next field. The item names are generally nouns and common nouns such as “date and time” and “place”. The item value is often a numerical value or a proper noun such as “December 12,” “Friday”, or “Tokyo”, but may be a general noun. The search word “host” in this case is a general noun, and “EIA” is a proper noun.
[0055]
When a common noun or a common noun is included in a search word serving as a search key as always in a free word search, synonym expansion is executed. In this case, the second search is executed after synonymous search words such as "co-host" and "organizer" are added to the character string "host".
[0056]
On the other hand, with respect to proper nouns such as “EIA”, a process of adding a search word called “expansion notation” is performed. Specifically, presence / absence of a long sound symbol "-", a word in which lowercase katakana "tsu" in the search word is replaced with "tsu", an original word of the abbreviated expression, and the like are added. In this case, a second search is performed after a different notation is added to “EIA” such as “Electric Information Association” or “Electronic Information Association”.
[0057]
Normally, general-purpose and specific word dictionaries are used for synonym expansion and variant notation expansion. Such a synonym expansion and a different expression expansion of the search word can be realized by a known technique of information search.
[0058]
In the following description, explanations of synonyms and notation words developed in this manner are omitted, but it is added that the use of these technologies is not excluded.
[0059]
From the second search condition, the unstructured data in the comment area from the documents searched by the first search unit 5 are converted into the character strings “host” and “EIA” given as the second search condition. A second search condition sentence for searching for a containing document is generated. Then, the second search condition sentence is passed to the second search unit 4, a search is requested, and a search result is obtained (step S6 in FIG. 6).
[0060]
In the case of the above example, the second search unit 4 performs a full-text search on the comment area of each document searched by the first search unit 3, and outputs character strings “host” and “EIA”. Search for documents that have a comment area that includes them. Then, the searched document is passed to the search processing unit 2.
[0061]
In the case of the above example, there are two character strings (“host” and “EIA”) as search keys, but a document in which both character strings appear in one comment area is preferentially searched. This is because a plurality of character strings (here, two character strings) specified as search keys are semantically related, and these plurality of character strings may appear close to each other in the document. Because there are many.
[0062]
For example, the document shown in FIG. 3 is detected as a search result because the comment area includes both character strings of “host” and “EIA” (step S7).
[0063]
Next, linguistic processing is performed on the comment area in each searched document (step S9).
[0064]
The two free words "host" and "EIA" have a semantic structure in which "host" is a key and "EIA" is a value from an informational viewpoint, that is, "host" is an item name and "EIA" is its content. have. When a plurality of free words given in this way have a semantic structure, in the language processing of step S9, a publicly-known / public information extraction technique is applied to the comment area in each document searched. In this case, for example, by applying plain2 (plain2 fan [URL] http://shika.aist-nara.ac.jp/products/plane2/plane2-plane2-j.html) or the like, the item name: “host” Item content: Information indicating a correspondence relationship of "Electronic Information Association (EIA)" can be extracted. Plain2 analyzes the structure of a document by using line feed information and indent information of the document. That is, the structure of a bullet is extracted from the text to structure the text.
[0065]
Since the document shown in FIG. 3 includes line feed information and indent information of the original document (see FIG. 2), plain2 can be applied. Even if there is no format information for such a document, similar information extraction can be performed by using a technology disclosed in MUC (Procedings of the Sixth Message Understanding Conference (MUC-6), Morgan Kaufman) or the like. Is possible.
[0066]
The document stored in the document storage unit 5 may be a specific word processing document or a structured document including document layout information such as itemized list or indentation such as RTF (rich text format) or HTML. The same is true. In such a case, in order to directly obtain the layout information of the document, it is possible to acquire the information indicating the correspondence between “item name” and “item content” using a technique such as the MUC.
[0067]
In the second search unit 4, in each document having a comment area including a plurality of character strings specified as free words, the correspondence between the plurality of character strings specified as free words in the comment area is A document that matches the semantic structure between a plurality of free words as search keys is searched with priority.
[0068]
In step S9, a character string specified as the second search condition in each searched document, that is, a portion where a free word (in the above example, “host” or “EIA”) appears, Perform language processing. Here, a fixed number of characters before and after the appearance of the character string are extracted as KWIC (Key Word In Context).
[0069]
Then, display data of the search result is generated using the result of the language processing, and is output to the output unit 6 (step S10). The output unit 6 displays the display data on a predetermined display.
[0070]
FIG. 13 shows an example of a display screen of display data displayed from the output unit 8 and generated from a document obtained as a result of the search performed by the second search unit 4. Here, one document (for example, the document shown in FIG. 3) is obtained as a search result, and all or a part of the data associated with the tag name corresponding to the “meeting name” in this document is obtained. , The full text or a part of the data associated with the tag name corresponding to the “date and time” specified as the search condition, and a fixed number of characters before and after the character strings “sponsored” and “EIA” specified as the search condition are displayed Display data is displayed. In the second search unit 4, when a corresponding relationship such as “host” is an item name and “EIA” is obtained in the item content, a comment area including “host” and “EIA” in the corresponding relationship is obtained. Is searched. Therefore, based on this correspondence, in the display example shown in FIG. 13, the retrieved document is displayed to express that there is a relation that the item name “host” and the item content include “EIA”. ing. That is, in FIG. 13, the character string “host” and the data including the character string “EIA” are called “host” corresponding to the item name so that it is clear that there is a correspondence between the item name and the item content. The character string is displayed in the same column as the item name corresponding to the tag name such as “name” and “date and time”, and a portion where a fixed number of characters before and after the character string “EIA” is displayed as the item content. The location where the item content is displayed is displayed as KWIC, as in Fig. 10. On the display screen in Fig. 13, the portion displayed as KWIC (the portion with a fixed number of characters before and after the character string "EIA" specified as the search condition) Is clicked with a mouse, it is possible to shift to a display of a place where the character string "EIA" appears in the searched document.An example of the display screen is shown in FIG.
[0071]
In the display screen shown in FIG. 14, the character string "host" and a character string portion corresponding to the item content having the item name (a character string including the character string "EIA") are displayed in a special manner such as inversion and highlighting. , Which makes it easy for the user to see. A known public technique may be used to realize this processing.
[0072]
In the above embodiment, description has been made using “<! −−” and “−−>” as marks for marking up the comment area in each document stored in the document storage unit 5. However, the present invention is not limited to this case. For example, an arbitrary XML tag (markup symbol) defined (having the meaning of a comment) defined to represent a comment area may be used. Absent.
[0073]
In the above embodiment, the XML search in the first search unit 3 specifies a tag name and a character string included in an element of the tag name as search conditions, and has an element of the specified tag name. This is a search for structured data (document data having) in which the element includes a specified character string, but is not limited to this. For example, when only one or a plurality of tag names are specified as a search condition, structured data (document data having) having an element of the specified tag name may be searched. Also, for example, when a document structure such as an element including an element with a certain tag name or an element included in a certain element, or a document structure and a character string included in any element are specified as search conditions Can also be easily realized by using a well-known XML search technique. In the above embodiment, the first search unit 3 merely shows an example of performing a search on the structured data using the known XML search technology, and the first search unit 3 can be realized using the known XML search technology. Any search may be performed as long as the search is appropriate. In this case, an input screen or the like for inputting a search condition may be changed according to the search condition to be input.
[0074]
As described above, according to the embodiment, a predetermined element name to be searched in a first search method (for example, an XML search method), which is one of two different search methods, and the element name (Structured data) composed of at least one element consisting of data associated with a second search method (for example, a full text search method) which is another one of the two search methods. ) Is stored in the document storage unit 5 including the data to be searched (unstructured data in the comment area). Then, when the first search condition for the structured data and the second search condition for the unstructured data are input, the first search unit 3 searches the document data in the document data stored in the document storage unit 5. By searching the structured data using the XML search method, document data including an element satisfying the first search condition is obtained. The second search unit 4 searches the first search unit 3 for a result of the search. Document data having unstructured data that satisfies the second search condition is obtained by searching the obtained unstructured data in each document data using a full-text search method.
[0075]
That is, in the above embodiment, the structured data and the unstructured data are mixed, and the structured data is searched using a search method suitable for the structured data search. For structured data, a search is performed using a search method suitable for searching for unstructured data. Generally, the search time for structured data is shorter than the search time for unstructured data. Therefore, first, a search for unstructured data is performed with a result obtained by performing a search for structured data as a search target.
[0076]
As described above, according to the above-described embodiment, before performing a search on unstructured data having a relatively long processing time, a search is performed on the structured data to narrow down documents to be searched, and furthermore, By performing a search on unstructured data after limiting the search position (to the comment area), it is possible to search for document data in which structured data and unstructured data are mixed at high speed and efficiently.
[0077]
Note that, in the above-described embodiment, as an example of structured data, structured data that has been marked up with an XML tag has been described. However, the present invention is not limited to this case. Even if the data is structured by associating data corresponding to the item with (corresponding to the element name in the embodiment), the same effect as described above can be obtained.
[0078]
The method of the present invention described in the embodiment of the present invention (particularly, the processing operation of the search processing unit 2 shown in the flowcharts of FIGS. 5 and 6 and the processing operation of each unit in FIG. 1 in addition to the search processing operation) In addition, as a program that can be executed by a computer, the program can be stored in a recording medium such as a magnetic disk (such as a flexible disk or a hard disk), an optical disk (such as a CD-ROM or a DVD), or a semiconductor memory and distributed.
[0079]
Note that the present invention is not limited to the above-described embodiment, and can be variously modified in an implementation stage without departing from the gist of the invention. Furthermore, the above embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, even if some components are deleted from all the components shown in the embodiments, at least one of the problems described in the column of the problem to be solved by the invention can be solved, and the effect of the invention can be solved. In the case where the effect described in (1) is obtained (at least one), a configuration from which this component is deleted can be extracted as an invention.
[0080]
【The invention's effect】
As described above, according to the present invention, retrieval of document data in which structured data and unstructured data are mixed can be performed at high speed and efficiently.
[Brief description of the drawings]
FIG. 1 is an exemplary diagram showing a configuration example of a document search system according to an embodiment of the present invention.
FIG. 2 is a diagram showing a specific example of an original document that does not include structured data.
FIG. 3 is a specific example of a document including structured data and unstructured data stored in a document storage unit, and shows a document generated from the original document of FIG. 2;
FIG. 4 is a diagram showing an example of a search condition input screen.
FIG. 5 is a flowchart for explaining a processing operation of a search processing unit in FIG. 1;
FIG. 6 is a flowchart for explaining a processing operation of a search processing unit in FIG. 1;
FIG. 7 is a diagram showing an example of a search condition input on a search condition input screen.
8 is a diagram showing a display example of a search result based on the search condition shown in FIG.
FIG. 9 is a diagram showing another example of a search condition input on a search condition input screen.
FIG. 10 is an exemplary view showing a display example of a search result based on the search condition shown in FIG. 9;
FIG. 11 is a diagram showing another display example of a search result.
FIG. 12 is a diagram showing still another example of the search condition input on the search condition input screen.
FIG. 13 is a view showing a display example of a search result based on the search condition shown in FIG.
FIG. 14 is a diagram showing another display example of a search result.
[Explanation of symbols]
REFERENCE SIGNS LIST 1 input unit 2 search processing unit 3 first search unit 4 second search unit 5 document storage unit 6 output unit

Claims

Structured data to be searched in a first search method that is one of two different search methods, and search data in a second search method that is another one of the two search methods. Storage means for storing a plurality of document data including the following unstructured data;
Input means for inputting a first search condition for the structured data and a second search condition for the unstructured data;
First searching for structured data in each document data stored in the storage unit using the first search method to obtain document data including structured data satisfying the first search condition Search means,
By searching the unstructured data in each document data obtained as a result of the search by the first search means using the second search method, an unstructured data satisfying the second search condition is obtained. Second search means for obtaining document data including data,
Output means for outputting document data obtained as a result of the search by the second search means;
A document search device comprising:

The first search condition specifies an element name of an element constituting the structured data and a character string included in data associated with the element name,
The second search condition specifies a character string included in the unstructured data,
The first search means obtains document data including an element having an element name and a character string specified as the first search condition;
2. The document search apparatus according to claim 1, wherein the second search unit obtains document data including unstructured data including a character string specified as the second search condition.

Structured data to be searched in a first search method that is one of two different search methods, and search data in a second search method that is another one of the two search methods. A search method for searching for desired document data from storage means storing a plurality of document data including the following unstructured data,
Inputting a first search condition for the structured data and a second search condition for the unstructured data;
First searching for structured data in each document data stored in the storage unit using the first search method to obtain document data including structured data satisfying the first search condition Search step,
By searching the unstructured data in each document data obtained as a result of the search in the first search step using the second search method, an unstructured data satisfying the second search condition is obtained. A second search step for document data containing the data;
An output step of outputting document data obtained as a result of the search in the second search step;
A document search method comprising:

The first search condition specifies an element name of an element constituting the structured data and a character string included in data associated with the element name,
The second search condition specifies a character string included in the unstructured data,
The first search step obtains document data including an element having an element name and a character string designated as the first search condition;
4. The document search method according to claim 3, wherein the second search step obtains document data including structured data including a character string specified as the second search condition.

Structured data to be searched in a first search method that is one of two different search methods, and search data in a second search method that is another one of the two search methods. A program for retrieving desired document data from storage means storing a plurality of document data including unstructured data,
On the computer,
Inputting a first search condition for the structured data and a second search condition for the unstructured data;
First searching for structured data in each document data stored in the storage unit using the first search method to obtain document data including structured data satisfying the first search condition Search step,
By searching the unstructured data in each document data obtained as a result of the search in the first search step using the second search method, an unstructured data satisfying the second search condition is obtained. A second search step for finding document data with data;
An output step of outputting document data obtained as a result of the search in the second search step;
The program to be executed.