JP3568062B2

JP3568062B2 - Document database management device and document database management method

Info

Publication number: JP3568062B2
Application number: JP15594495A
Authority: JP
Inventors: 恒中津山
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1995-06-22
Filing date: 1995-06-22
Publication date: 2004-09-22
Anticipated expiration: 2019-09-22
Also published as: JPH096803A

Description

【０００１】
【産業上の利用分野】
本発明は、電子文書を管理対象とする文書データベース管理装置と文書データベース管理方法に関わる。
【０００２】
【従来の技術】
ワードプロセッサ等により作成された電子文書は、デジタルデータとして表現されるので、追加、削除、変更等の編集を容易に行なうことができ、文書作成効率を高めることができる。また、複数の電子文書を大容量の記憶装置に蓄積して文書データベース装置を構築することにより、キーワード検索等により目的とする文書を電子的に検索することができる。
【０００３】
従来の電子文書を管理対象とする文書データベース管理装置では、文書の検索を行なう場合には、ワードプロセッサ等で作られた文書データそのものを蓄積し、そのデータを使って検索を行なっていた。
【０００４】
一方、電子文書の作成や編集作業を容易に行なえるようにするために、電子文書を構造化することが行なわれている。文書の構造は、たとえば、文書を構成する章、見出し、段落などの要素と、その要素間の関係についての情報、たとえば、章は、下位構造として見出しと段落を持つなどについての情報により表される。
【０００５】
【発明が解決しようとする課題】
本発明が解決しようとする課題を、文書構造の国際規格であるＯＤＡ（ＯｆｆｉｃｅＤｏｃｕｍｅｎｔＡｒｃｈｉｔｅｃｔｕｒｅ）（ＩＳＯ８６１３）とＳＧＭＬ（ＳｔａｎｄａｒｄＧｅｎｅｒａｌｉｚｅｄＭａｒｋｕｐＬａｎｇｕａｇｅ）（ＩＳＯ８８７９；ＪＩＳＸ４１５１）を例にとって説明する。
【０００６】
先ず、本明細書で使用する用語について説明する。
【０００７】
「文書構造」という用語は、文書を表現する情報構造とする。たとえば、ＯＤＡが定める情報構造は文書構造である。ＳＧＭＬのサブセッティング（機能の制限）を行ない、使用する文字コードや図表などに用いる情報構造を定めたものも文書構造である。なお、ＳＧＭＬについては、たとえば、ＭａｒｔｉｎＢｒｙａｎ著，「ＳＧＭＬ入門」，株式会社アスキー，１９９１年３月３１日発行を参照されたい。
【０００８】
「文書型」という用語は、文書のテンプレートを示すものとする。文書型は、そこから作られる文書がどのような論理構造をもち得るか、すなわち、論理構造中に現われるノードの種類、各ノードがもち得る属性、各ノードがもち得る下位構造を定める。ＯＤＡの共通論理構造（ｇｅｎｅｒｉｃｌｏｇｉｃａｌｓｔｒｕｃｔｕｒｅ）や、ＳＧＭＬをサブセッティングした文書アーキテクチャにおけるＤＴＤ（ＤｏｃｕｍｅｎｔＴｙｐｅＤｅｆｉｎｉｔｉｏｎ）は、文書型である。
【０００９】
次に、上述したような、構造化された文書を検索する場合の問題点について説明する。
【００１０】
構造化文書では、文書の内容は論理構造と呼ばれ、章、節、図などの複数の文書構成要素からなる木構造で表現される。論理構造の例を図１０に示す。
【００１１】
論理構造はまったく自由に作成してよいのではなく、上述した文書型と呼ばれる構文規則に沿って作成される。文書型の例を図１１に示す。矩形のノードは要素の型（要素型）を定義している。ノードのラベルは、要素型の名前を示している。同一の名前をもつノードの実体は同一の要素型である。したがって、図１１の「節」という名前の要素型は、再帰的に定義されていることになる。楕円で示したノードは要素のつながりを定義する。このノードを構築子と呼ぶ。ＳＥＱノードは、それにつながるノードのインスタンスがその順に生成されることを示している。ＲＥＰノードは、それにつながるノードのインスタンスが１回以上生成されることを示す。ＯＰＴノードは、それにつながるノードのインスタンスが、出現してもしなくてもよいことを示す。ＣＨＯノードは、それにつながるいずれか１つのノードのインスタンスが生成されることを示す。図１０の論理構造は、図１１の文書型の制約を満たしている。
【００１２】
構造化文書を管理対象とする文書データベース管理装置では、検索を記述するための問合せ言語を提供している。問合せ言語は、テキストで記述されるものもあるが、グラフィカルユーザインタフェースで記述されるものもある。グラフィカルユーザインターフェースで記述された検索式の例を図１２に示す。ノードの文字列は要素型を示している。ノードの傍に示した文字列は、そのノードがもつテキストがその文字列を含むことを示す。実線で示されたアークは、両端のノードが親子関係にあることを示す。破線で示されたアークは、両端のノードが祖孫関係（先祖と子孫の関係）にあることを示す。ひとつのノードから複数のアークが出ている場合、すべての条件を満すものが検索結果となる。つまり、連言として指定されたことになる。図１２の検索式は、「見出しに”文書”という文字列を含み、”データベース”という文字列を含む段落をもつ章」の検索を指定している。
【００１３】
検索式は、文書の要素に関する条件と、要素間の接続関係に関する条件を用いて指定される。図１２の例では、前者は要素型に関する条件、後者は祖孫関係を用いた条件である。
【００１４】
図１３は、上述したような検索を行なう従来のデータベース管理装置の模式図である。問合せエディタ１で作成された検索式は、検索式生成部２により、検索式評価部３で実行可能な形式の検索式に変換される。この検索式は、文書型管理部５に渡され指定された条件を満たす要素をもつ文書を検索する。なお、６はデータ辞書、７はデータベースである。上記検索式評価部３においては検索式が文法的に正しいかどうかの検査も行なわれ、文法的に正しくない場合には、その旨が操作者に通知されると共に処理が中止される。
【００１５】
図１３に示す従来の電子文書を管理対象とする文書データベース管理装置では、検索式が文法的に正しいかどうかの検査だけを行なっていた。このため、解が存在し得ない検索式を与えても、正しい検索式として扱われる。たとえば、図１４に示すような「段落に”データベース”という文字列を含み、その段落の下位に繋がった見出しに”文書”という文字列を含む」という検索式は、文法的には正しいが、これを満す要素をもつ文書は存在しない。即ち、図１１に示す文書例の構造の場合、段落の下位に見出しが存在することは有り得ない。
【００１６】
このような、妥当でない検索式を検索した結果は、条件を満すものが存在しないので、何も得られない。ユーザの視点からは、検索式が文法的にも意味論的にも正しいが、条件を満たすものがデータベースに存在しなかったのか、そもそも検索式が妥当でなかったかは容易には判断できず、検索式を構成する際のユーザの負担になっていた。また、条件を満す要素をもつ文書が存在し得ないにもかかわらず検索処理を行なうので、無意味にシステムの計算時間を浪費する結果となっていた。図１４に示す検索式の評価は、典型的には、見出しのインスタンスと段落のインスタンスをすべて走査し、その後、親子関係を満すものがあるかどうかを調べるが、そもそも図１４の条件を満すものは存在し得ないので、評価する必要はない。
【００１７】
そこで本発明は、検索式の妥当性を容易に判断できるようにすることを目的とする。また、本発明の他の目的は、妥当でない検索式に対しては検索を行なわないようにして計算時間の浪費を防止することである。
【００１８】
【課題を解決するための手段】
前記問題点を解決するため、本発明の文書データベース管理装置は、要素間のつながりを定義する構文規則によって取り得る構造が規定されている構造化文書を管理対象とし、この構造化文書を構成する要素に関する条件と、前記要素間の親子関係及び祖孫関係に関する条件とを用いた検索式により検索対象を指定する文書データベース管理装置において、文書データベース管理装置内に格納されている構文規則と与えられた検索式を照合する検索式検証手段を備え、前記検索式検証手段が、前記構文規則に基づいて、前記構造化文書を構成する要素のうち親となる出発要素と、前記出発要素の子として出現し得る隣接要素と、前記出発要素の子孫として出現し得る到達可能要素を一つの組とし、すべての出発要素についての前記の組を備えた対応表を生成する対応表生成手段と、前記与えられた検索式における出発要素と隣接要素と到達可能要素に基づき、前記与えられた検索式における前記出発要素について前記隣接要素と前記到達可能要素が出現し得るか否かを前記対応表を走査して検証し、前記検索式が妥当であるか否かを検証する手段とを備えていることを特徴とする。
【００１９】
また本発明の文書データベース管理方法は、検索式検証手段及び対応表生成手段を備え、要素間のつながりを定義する構文規則によって取り得る構造が規定されている構造化文書を管理対象とし、この構造化文書を構成する要素に関する条件と、前記要素間の親子関係及び祖孫関係に関する条件とを用いた検索式により検索対象を指定する文書データベース管理装置において、前記検索式検証手段が、文書データベース管理装置内の前記構文規則に基づいて、前記構造化文書を構成する要素のうち親となる出発要素と、前記出発要素の子として出現し得る隣接要素と、前記出発要素の子孫として出現し得る到達可能要素を一つの組とし、すべての出発要素についての前記の組を備えた対応表を対応表生成手段により生成し、与えられた検索式における出発要素と隣接要素と到達可能要素に基づき、前記与えられた検索式における前記出発要素について前記隣接要素と前記到達可能要素が出現し得るか否かを前記対応表を走査して検証し、前記検索式が妥当であるか否かを検証することを特徴とする。
【００２０】
【作用】
本発明によれば、構造化文書を検索するに際し、文書型と、検索式で指定された、親子関係および祖孫関係を用いた条件が照合され、検索式が妥当か否か判断される。
【００２１】
本発明においては、検索を実行する前に文書型が調べられ、構造化文書を構成する要素のうち親となる出発要素と、前記出発要素の子として出現し得る隣接要素と、前記出発要素の子孫として出現し得る到達可能要素を一つの組とし、すべての出発要素についての前記の組を備えた対応表が生成される。検索式が入力されるとこの検索式に基づいて前記対応表が走査され、検索式が対応表の条件を満足しているか否かが判別され、条件が満足されないときには、検索式が妥当でないと判断される。
【００２２】
【実施例】
図１は、本発明の文書データベース管理装置のブロック図である。問合せエディタ１は、検索条件を入力するためのもので図２に示すような問合せエディタ画面を使用して入力される。問合せエディタ１で作成された検索式は、検索式生成部２により、検索式評価部３で実行可能な形式の検索式に変換される。この検索式は、検索式検証部４に渡され、妥当か否か判定される。検索式検証部４における処理の詳細については後述する。妥当な検索式であれば検索式評価部３に渡され、指定された条件を満たす要素をもつ文書を検索する。なお、５は文書型管理部、６はデータ辞書、７はデータベースである。
【００２３】
検索式検証部４は、検証制御部８、対応表保持部９、対応表生成部１０、到達可能性判定部１１をもつ。
【００２４】
検証制御部８は、全体を統轄する要素で、対応表保持部９、対応表生成部１０、到達可能性判定部１１を適宜呼び出す。
【００２５】
対応表保持部９は、要素（出発要素）、出発要素の子として出現し得る要素（隣接要素集合）、および出発要素から到達可能な要素の集合（到達可能要素集合）の３つ組をエントリとする対応表を保持する。ここで、ある要素Ａからある要素Ｂに到達可能であるとは、要素Ａのインスタンスの下位（子孫）として要素Ｂが出現し得ることを言う。図１１の文書型から生成した対応表を図３に示す。たとえば、要素「記事」に対しては、下位要素として「節」が隣接しており、要素「記事」からは、要素「節」，「見出し」，「段落」の何れにも到達可能であることを示している。
【００２６】
対応表生成部１０は、指定された文書型について文書型の構造を参照して上記した対応表を生成する。
【００２７】
到達可能性判定部１１は、ソースとデスティネーションの２つの要素が与えられたとき、対応表保持部９に保持されている対応表（図３参照）を走査し、ソースを出発要素とするエントリの隣接要素集合または到達可能要素集合にデスティネーションが含まれるか検査する。デスティネーションが隣接要素集合に含まれるときには、デスティネーションはソースの子として出現し得る。デスティネーションが到達可能要素集合に含まれるときには、デスティネーションはソースの子孫として出現し得る。
【００２８】
図４〜図９に、検索式検証部４において実行される与えられた検索式を検証する処理のフローを示す。このフローに沿って、本実施例について説明する。
【００２９】
図４は、検索式の検証の全体のフローである。この処理の入力は、検索式生成部２で生成された検索式である。検証制御部８は、文書型管理部５を呼び出し、入力された検索式の検索対象となるスキーマ（文書型）の情報を取得する（ステップ６−１）。続いて、対応表生成部１０を呼び出し、そのスキーマの対応表を作成する（ステップ６−２）。
【００３０】
図５は、対応表の作成処理（図４のステップ６−２参照）のフローである。この処理は対応表生成部１０で行なわれる。入力は、スキーマの情報である。スキーマの情報は、図６に示すような有向グラフで表現される。まず、入力されたスキーマのルートを選択する（ステップ７−２）。次に、ルートから到達可能な要素型の集合を求める（ステップ７−３）。変数Ｓに、ルートから到達可能な要素型の集合にルートを加えて集合を保持させる（ステップ７−４）。なお、ステップ７−４における戻り値とは直前の処理により得られた結果を示す次いで、変数Ｓに未処理のノードがあるか検査する（ステップ７−５）。未処理のノードがなければ終了である（ステップ７−１０）。未処理のノードがあれば、ノードをひとつ選択する（ステップ７−６）。選択したノードと隣接する要素型の集合を求める（ステップ７−７）。さらに、選択したノードから到達可能な要素型の集合を求める（ステップ７−８）。選択中のノード（要素型）、ステップ７−７で得られた隣接要素型集合、およびステップ７−８で得られた到達可能要素型集合を３つ組として対応表保持部９に渡し、対応表にエントリを登録する（ステップ７−９）。この後、ステップ７−５に戻る。
【００３１】
図７は、到達可能要素型集合を求める処理（図５のステップ７−３参照）のフローである。この処理も対応表生成部１０で行なわれる。この処理の入力は要素型で、出力は入力された要素型から到達可能な要素型の集合である。このフローでは、要素型の集合を保持する変数Ｓと、要素型のキューを保持する変数Ｑを用いる。変数Ｓの初期値は空集合である（ステップ８−２）。変数Ｑの初期値は、入力ノードに隣接するノードすべてからなるキューである（ステップ８−３）。まず、変数Ｑの長さが０かどうか判定する（ステップ８−４）。変数Ｑの長さが０であれば、入力された要素型から到達可能な要素の集合が変数Ｓに格納されているので、これを戻り値として制御を戻す（ステップ８−１０）。変数Ｑの長さが１以上であれば、変数Ｑの先頭要素を取り出す（ステップ８−１１）。取り出した要素が変数Ｓに含まれていれば、ステップ８−４に戻る。取り出した要素がＳに含まれていなければ、それが要素型かどうか検査する（ステップ８−７）。要素型であれば、Ｓにその要素型を加える（ステップ８−８）。取り出した要素に隣接するノードすべてを変数Ｑの末尾に追加し（ステップ８−９）、ステップ８−４に戻る。
【００３２】
図８は、隣接要素型集合を求める処理（図５のステップ７−７参照）のフローである。この処理も対応表生成部１０で行なわれる。この処理の入力は要素型で、出力は入力された要素型と隣接する要素型の集合である。到達可能要素型集合を求める処理と同様、このフローでも、要素型の集合を保持する変数Ｓと、要素型のキューを保持する変数Ｑを用いる。変数Ｓの初期値は空集合である（ステップ９−２）。変数Ｑの初期値は、入力ノードに隣接するノードすべてからなるキューである（ステップ９−３）。まず、変数Ｑの長さが０かどうか判定する（ステップ９−４）。変数Ｑの長さが０であれば、入力された要素型から到達可能な要素の集合が変数Ｓに格納されているので、これを戻り値として制御を戻す（ステップ９−８）。変数Ｑの長さが１以上であれば、変数Ｑの先頭要素を取り出す（ステップ９−５）。取り出した要素が要素型かどうか検査する（ステップ９−６）。要素型であれば、変数Ｓにその要素型を加え（ステップ９−７）、ステップ９−４に戻る。要素型でなければ、取り出した要素に隣接するノードすべてを変数Ｑの末尾に追加し（ステップ９−９）、ステップ９−４に戻る。
【００３３】
図９は、検索式のノードの検証処理（図４のステップ６−４参照）のフローである。この処理は、到達可能性判定部１１で行なわれる。この処理の入力は検索式のノード、出力はそのノードが妥当か否かを示す真理値である。まず、対応表保持部９に保持されている対応表を走査し、入力されたノードを出発要素型とするエントリを求めておく（ステップ１０−２）。次に、未処理の隣接ノードがあるかどうか検査する（ステップ１０−３）。すべて処理が済んでいれば戻り値を真とし、制御を戻す（ステップ１０−１２）。未処理の隣接ノードがあれば、ノードをひとつ選ぶ（ステップ１０−４）。選択したノードが、入力されたノードの子として指定されているかどうか判定する（ステップ１０−５）。子として指定されていれば、選択したノードが、エントリの隣接要素型集合に含まれるかどうか検査する（ステップ１０−７）。含まれていなければ、戻り値を偽として制御を戻す（ステップ１０−７）。含まれていれば、選択したノードを検証する（ステップ１０−８）。ステップ１０−５で、選択したノードが子として指定されていなければ、つまり子孫として指定されていれば、選択したノードが、エントリの到達可能要素型集合に含まれるかどうか検査する（ステップ１０−１１）。含まれていなければ、戻り値を偽として制御を戻す（ステップ１０−１３）。含まれていれば、ステップ１０−８に行く。ステップ１０−８での検証結果が偽であれば、戻り値を偽として制御を戻す（ステップ１０−１０）。真であれば、ステップ１０−３に戻る。
【００３４】
本実施例では、検索式の検証を行なう度に対応表を構成しているが、文書型をデータベースに登録する時点で対応表を構成し、検証時はその表を走査するようにしてもよい。
【００３５】
【発明の効果】
以上のように、本発明によれば、文書型と、検索式で指定された、親子関係および祖孫関係を用いた条件が照合され、妥当か否か判断される。
【００３６】
これにより、検索式の意味的な誤りにより検索結果が得られなかったのか、条件に該当するインスタンスがなかったのかを判別するのが容易になる。また、システムが、検索結果があり得ない検索式を評価しなくて済むようになり、計算時間の浪費を防ぐことができる。
【図面の簡単な説明】
【図１】本発明の文書データベース管理装置の実施例の構成である。
【図２】問合せエディタのグラフィカルユーザインターフェースの例である。
【図３】対応表の例である。これは図１２に示した文書型の対応表である。
【図４】検索式の検証のフローである。
【図５】対応表の作成処理のフローである。
【図６】図１１の文書型を有向グラフで表現したものである。
【図７】ある要素型から到達可能な要素型の集合を求める処理のフローである。
【図８】ある要素型に隣接する要素型の集合を求める処理のフローである。
【図９】検索式のノードの検証処理のフローである。
【図１０】文書インスタンスの例である。
【図１１】文書型の例である。これは図１０の文書インスタンスの文書型である。
【図１２】検索対象の指定の例である。
【図１３】従来の文書データベース管理装置の構成である。
【図１４】妥当でない検索式の例である。この検索式で用いている文書型は図１１のものである。
【符号の説明】
１…問い合わせエディタ、２…検索式生成部、３…検索式評価部、４…検索式検証部、５…文書型管理部、６…データ辞書、７…データベース、８…検証制御部、９…対応表保持部、１０…対応表生成部、１１…到達可能性判定部[0001]
[Industrial applications]
The present invention relates to a document database management apparatus and a document database management method for managing electronic documents.
[0002]
[Prior art]
Since an electronic document created by a word processor or the like is expressed as digital data, editing such as addition, deletion, and modification can be easily performed, and document creation efficiency can be improved. In addition, by constructing a document database device by storing a plurality of electronic documents in a large-capacity storage device, a target document can be electronically searched by a keyword search or the like.
[0003]
In a conventional document database management apparatus that manages electronic documents, when searching for a document, the document data itself created by a word processor or the like is stored, and the search is performed using the data.
[0004]
On the other hand, in order to facilitate creation and editing of electronic documents, electronic documents have been structured. The structure of a document is represented, for example, by information on elements such as chapters, headings, and paragraphs that make up the document and the relationship between the elements.For example, a chapter has information on headings and paragraphs as substructures. You.
[0005]
[Problems to be solved by the invention]
The problem to be solved by the present invention will be described by way of an example of ODA (Office Document Architecture) (ISO 8613) and SGML (Standard Generalized Markup Language) (ISO 8879; JIS X4151) which are international standards of document structure.
[0006]
First, terms used in the present specification will be described.
[0007]
The term “document structure” is an information structure representing a document. For example, the information structure defined by the ODA is a document structure. The SGML subsetting (restriction of functions) and the information structure used for a character code to be used or a chart are also a document structure. For details on SGML, see, for example, Martin Bryan, "Introduction to SGML", ASCII Corporation, issued on March 31, 1991.
[0008]
The term "document type" shall indicate a document template. The document type defines what logical structure a document created from it has, namely, the types of nodes appearing in the logical structure, the attributes that each node can have, and the substructure that each node can have. A common logical structure of the ODA and a document type definition (DTD) in a document architecture in which SGML is set are document types.
[0009]
Next, a description will be given of a problem in a case where a structured document is searched as described above.
[0010]
In a structured document, the content of the document is called a logical structure, and is represented by a tree structure including a plurality of document components such as chapters, sections, and figures. FIG. 10 shows an example of the logical structure.
[0011]
The logical structure is not completely free to be created, but is created according to the syntax rules called the document type described above. FIG. 11 shows an example of the document type. The rectangular node defines the element type (element type). The label of the node indicates the name of the element type. Nodes with the same name have the same element type. Therefore, the element type named “section” in FIG. 11 is defined recursively. Nodes indicated by ellipses define the connection of elements. This node is called a constructor. The SEQ node indicates that instances of the node connected to the SEQ node are generated in that order. The REP node indicates that the instance of the node connected to the REP node is generated one or more times. The OPT node indicates that the instance of the node connected to it may or may not appear. The CHO node indicates that an instance of any one node connected to the CHO node is generated. The logical structure of FIG. 10 satisfies the document type restriction of FIG.
[0012]
A document database management apparatus that manages structured documents provides a query language for describing a search. Some query languages are described in text, while others are described in a graphical user interface. FIG. 12 shows an example of a search expression described in the graphical user interface. The character string of the node indicates the element type. A character string shown beside a node indicates that the text of the node includes the character string. The arcs shown by solid lines indicate that the nodes at both ends are in a parent-child relationship. The arcs indicated by broken lines indicate that the nodes at both ends are in a grandchild relationship (a relationship between ancestors and descendants). When a plurality of arcs are emitted from one node, a search result that satisfies all conditions is obtained. That is, it is designated as a conjunction. The search formula in FIG. 12 specifies a search for “a chapter having a character string“ document ”in the headline and a paragraph including the character string“ database ””.
[0013]
The search formula is specified using a condition relating to the elements of the document and a condition relating to the connection relationship between the elements. In the example of FIG. 12, the former is a condition relating to the element type, and the latter is a condition using a grandchild relationship.
[0014]
FIG. 13 is a schematic diagram of a conventional database management device that performs the above-described search. The search expression created by the query editor 1 is converted by the search expression generation unit 2 into a search expression in a format executable by the search expression evaluation unit 3. This search expression searches for documents having elements that are passed to the document type management unit 5 and satisfy specified conditions. 6 is a data dictionary and 7 is a database. The search expression evaluation unit 3 also checks whether the search expression is grammatically correct. If the search expression is not grammatically correct, the operator is notified and the processing is stopped.
[0015]
In the conventional document database management apparatus shown in FIG. 13 which manages electronic documents, only a check is performed to determine whether a search expression is grammatically correct. For this reason, even if a search formula for which a solution cannot exist is given, it is treated as a correct search formula. For example, a search expression such as “a paragraph includes a character string“ database ”in a paragraph and a heading connected to the lower part of the paragraph includes a character string“ document ”” as shown in FIG. 14 is grammatically correct. No document has an element that satisfies this. That is, in the case of the structure of the document example shown in FIG. 11, a heading cannot exist below a paragraph.
[0016]
As a result of searching for such an invalid search expression, there is no search result that satisfies the condition, and nothing is obtained. From the user's point of view, the query is both grammatically and semantically correct, but it is not easy to determine whether anything in the database satisfies the condition or whether the query was invalid in the first place. This is a burden on the user when constructing the search formula. In addition, since a search process is performed even though a document having an element satisfying the condition cannot exist, the calculation time of the system is insignificantly wasted. The evaluation of the search expression shown in FIG. 14 typically scans all instances of headings and instances of paragraphs, and then checks whether there is something that satisfies the parent-child relationship. There is no need to evaluate because there can be no such thing.
[0017]
Therefore, an object of the present invention is to make it possible to easily determine the validity of a search expression. Another object of the present invention is to prevent a search time from being wasted on an invalid search expression, thereby preventing waste of calculation time.
[0018]
[Means for Solving the Problems]
In order to solve the above problem, the document database management apparatus of the present invention manages a structured document in which a structure that can be taken by a syntax rule defining a connection between elements is specified, and configures the structured document. In a document database management device that specifies a search target by a search formula using conditions relating to elements and conditions relating to parent-child relationships and grandchild relationships between the elements, syntax rules stored in the document database management device are given. Search expression verification means for collating the search expression, the search expression verification means, based on the syntax rules, as a parent starting element among the elements constituting the structured document, and as a child of the starting element and adjacent elements can appear, the the reachable elements that may appear as a descendant of the starting element and one group comprising the set of all the starting elements A correspondence table generation means for generating a correspondence table, based on the reachable elements starting element and the adjacent element in the given search expression, the abutment element and the reachable element for said starting element in the given search expression Means for scanning the correspondence table to verify whether or not it can appear, and verifying whether the search formula is valid.
[0019]
Further, the document database management method of the present invention includes a search expression verification unit and a correspondence table generation unit, and manages a structured document in which a structure that can be taken by a syntax rule that defines a connection between elements is specified as a management target. In a document database management apparatus for specifying a search target by a search formula using conditions relating to elements constituting a structured document and conditions relating to parent-child relationships and grandchild relationships between the elements, the search formula verification means includes: Based on the syntax rules in the device, a parent starting element, an adjacent element that can appear as a child of the starting element, and an arrival that can appear as a descendant of the starting element among the elements constituting the structured document possible elements as one set, a correspondence table with the set of all starting components generated by correspondence table generation unit, a given search expression Definitive based on the starting element and reachable and the adjacent Element to verify whether the above for the starting element and the adjacent elements the reachable elements in the given search expression may appear by scanning the correspondence table, It is characterized in that it is verified whether the search formula is valid.
[0020]
[Action]
According to the present invention, when a structured document is searched, a document type and a condition using a parent-child relationship and a grandchild relationship specified by the search formula are collated to determine whether the search formula is appropriate.
[0021]
In the present invention, a document type is checked before executing a search, and a parent starting element, an adjacent element that can appear as a child of the starting element among the elements constituting the structured document, A set of the reachable elements that can appear as descendants is generated, and a correspondence table including the above set for all starting elements is generated. When a search formula is input, the correspondence table is scanned based on the search formula, and it is determined whether the search formula satisfies the conditions of the correspondence table.If the condition is not satisfied, the search formula is invalid. Is determined.
[0022]
【Example】
FIG. 1 is a block diagram of the document database management device of the present invention. The query editor 1 is used to input search conditions and is input using a query editor screen as shown in FIG. The search expression created by the query editor 1 is converted by the search expression generation unit 2 into a search expression in a format executable by the search expression evaluation unit 3. This search expression is passed to the search expression verification unit 4, and it is determined whether the search expression is appropriate. Details of the processing in the search expression verification unit 4 will be described later. If it is a valid search expression, it is passed to the search expression evaluation unit 3 and searches for a document having an element satisfying the specified condition. 5 is a document type management unit, 6 is a data dictionary, and 7 is a database.
[0023]
The search expression verification unit 4 includes a verification control unit 8, a correspondence table holding unit 9, a correspondence table generation unit 10, and a reachability determination unit 11.
[0024]
The verification control unit 8 is an element that controls the whole, and appropriately calls the correspondence table holding unit 9, the correspondence table generation unit 10, and the reachability determination unit 11.
[0025]
The correspondence table holding unit 9 stores a triplet of an element (starting element), an element that can appear as a child of the starting element (adjacent element set), and a set of elements reachable from the starting element (reachable element set). Is held. Here, being reachable from an element A to an element B means that the element B can appear as a lower order (descendant) of an instance of the element A. FIG. 3 shows a correspondence table generated from the document type of FIG. For example, the element “article” is adjacent to “section” as a lower element, and the element “article” can reach any of the elements “section”, “headline”, and “paragraph”. It is shown that.
[0026]
The correspondence table generation unit 10 generates the above-described correspondence table with reference to the document type structure for the specified document type.
[0027]
When the two elements of the source and the destination are given, the reachability determination unit 11 scans the correspondence table (see FIG. 3) held in the correspondence table holding unit 9 and sets the entry having the source as the starting element. It is checked whether the destination is included in the neighboring element set or reachable element set. When a destination is included in the neighboring element set, the destination may appear as a child of the source. When the destination is included in the reachable element set, the destination may appear as a descendant of the source.
[0028]
4 to 9 show a flow of processing performed by the search expression verification unit 4 to verify a given search expression. The present embodiment will be described along this flow.
[0029]
FIG. 4 is an overall flow of verification of a search expression. The input of this processing is the search formula generated by the search formula generation unit 2. The verification control unit 8 calls the document type management unit 5 and acquires information on a schema (document type) to be searched for the input search expression (step 6-1). Subsequently, the correspondence table generation unit 10 is called to create a correspondence table of the schema (step 6-2).
[0030]
FIG. 5 is a flow of the correspondence table creation processing (see step 6-2 in FIG. 4). This processing is performed by the correspondence table generation unit 10. The input is schema information. Schema information is represented by a directed graph as shown in FIG. First, the root of the input schema is selected (step 7-2). Next, a set of element types that can be reached from the root is determined (step 7-3). The root is added to the set of element types reachable from the root to the variable S, and the set is held (step 7-4). The return value in step 7-4 indicates the result obtained by the immediately preceding process. Next, it is checked whether there is an unprocessed node in the variable S (step 7-5). If there is no unprocessed node, the process ends (step 7-10). If there is an unprocessed node, one is selected (step 7-6). A set of element types adjacent to the selected node is obtained (step 7-7). Further, a set of element types that can be reached from the selected node is obtained (step 7-8). The selected node (element type), the set of adjacent element types obtained in step 7-7, and the set of reachable element types obtained in step 7-8 are passed to the correspondence table holding unit 9 as a set of three. An entry is registered in the table (step 7-9). Thereafter, the process returns to step 7-5.
[0031]
FIG. 7 is a flow of a process for obtaining a reachable element type set (see step 7-3 in FIG. 5). This processing is also performed by the correspondence table generation unit 10. The input of this process is an element type, and the output is a set of element types that can be reached from the input element type. In this flow, a variable S that holds a set of element types and a variable Q that holds a queue of element types are used. The initial value of the variable S is an empty set (step 8-2). The initial value of the variable Q is a queue composed of all nodes adjacent to the input node (step 8-3). First, it is determined whether the length of the variable Q is 0 (step 8-4). If the length of the variable Q is 0, since a set of elements reachable from the input element type is stored in the variable S, the control is returned using this as a return value (step 8-10). If the length of the variable Q is 1 or more, the head element of the variable Q is extracted (step 8-11). If the extracted element is included in the variable S, the process returns to step 8-4. If the extracted element is not included in S, it is checked whether it is an element type (step 8-7). If it is an element type, the element type is added to S (step 8-8). All the nodes adjacent to the extracted element are added to the end of the variable Q (Step 8-9), and the process returns to Step 8-4.
[0032]
FIG. 8 is a flowchart of the process of obtaining the adjacent element type set (see step 7-7 in FIG. 5). This processing is also performed by the correspondence table generation unit 10. The input of this process is an element type, and the output is a set of element types adjacent to the input element type. Similar to the process of obtaining the reachable element type set, in this flow, a variable S holding an element type set and a variable Q holding an element type queue are used. The initial value of the variable S is an empty set (step 9-2). The initial value of the variable Q is a queue composed of all nodes adjacent to the input node (step 9-3). First, it is determined whether the length of the variable Q is 0 (step 9-4). If the length of the variable Q is 0, a set of elements that can be reached from the input element type is stored in the variable S, and control is returned using this as a return value (step 9-8). If the length of the variable Q is 1 or more, the head element of the variable Q is extracted (step 9-5). It is checked whether the fetched element is an element type (step 9-6). If it is an element type, the element type is added to the variable S (step 9-7), and the process returns to step 9-4. If it is not the element type, all nodes adjacent to the extracted element are added to the end of the variable Q (step 9-9), and the process returns to step 9-4.
[0033]
FIG. 9 is a flowchart of the verification processing of the nodes of the search expression (see step 6-4 in FIG. 4). This process is performed by the reachability determination unit 11. The input of this processing is a node of the search formula, and the output is a truth value indicating whether the node is valid or not. First, the correspondence table held in the correspondence table holding unit 9 is scanned, and an entry having the input node as a starting element type is obtained (step 10-2). Next, it is checked whether there is an unprocessed adjacent node (step 10-3). If all processes have been completed, the return value is set to true and control is returned (step 10-12). If there is an unprocessed adjacent node, one node is selected (step 10-4). It is determined whether or not the selected node is specified as a child of the input node (step 10-5). If it is specified as a child, it is checked whether the selected node is included in the adjacent element type set of the entry (step 10-7). If not, control is returned with the return value set to false (step 10-7). If it is included, the selected node is verified (step 10-8). In step 10-5, if the selected node is not specified as a child, that is, if it is specified as a descendant, it is checked whether the selected node is included in the reachable element type set of the entry (step 10-). 11). If not, control is returned with the return value set to false (step 10-13). If it is included, go to step 10-8. If the verification result in step 10-8 is false, the control is returned with the return value being false (step 10-10). If true, return to step 10-3.
[0034]
In the present embodiment, the correspondence table is formed every time the search expression is verified. However, the correspondence table may be formed when the document type is registered in the database, and the table may be scanned at the time of verification. .
[0035]
【The invention's effect】
As described above, according to the present invention, the document type and the condition using the parent-child relationship and the grandchild relationship specified by the search formula are collated, and it is determined whether the condition is appropriate.
[0036]
This makes it easy to determine whether a search result could not be obtained due to a semantic error in the search expression or whether there was no instance meeting the condition. In addition, the system does not need to evaluate a search expression in which a search result cannot be obtained, thereby preventing waste of calculation time.
[Brief description of the drawings]
FIG. 1 is a configuration of an embodiment of a document database management device of the present invention.
FIG. 2 is an example of a graphical user interface of a query editor.
FIG. 3 is an example of a correspondence table. This is the document type correspondence table shown in FIG.
FIG. 4 is a flowchart of verification of a search expression.
FIG. 5 is a flowchart of a correspondence table creation process.
FIG. 6 is a representation of the document type of FIG. 11 as a directed graph.
FIG. 7 is a flowchart of a process for obtaining a set of reachable element types from a certain element type.
FIG. 8 is a flowchart of a process for obtaining a set of element types adjacent to a certain element type.
FIG. 9 is a flowchart of a process of verifying a node of a search expression.
FIG. 10 is an example of a document instance.
FIG. 11 is an example of a document type. This is the document type of the document instance in FIG.
FIG. 12 is an example of specifying a search target.
FIG. 13 shows a configuration of a conventional document database management device.
FIG. 14 is an example of an invalid search expression. The document type used in this search formula is that of FIG.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Inquiry editor, 2 ... Search expression generation part, 3 ... Search expression evaluation part, 4 ... Search expression verification part, 5 ... Document type management part, 6 ... Data dictionary, 7 ... Database, 8 ... Verification control part, 9 ... Correspondence table holding unit, 10: Correspondence table generation unit, 11: Reachability judgment unit

Claims

A structured document in which a structure that can be taken by a syntax rule that defines a connection between elements is defined as a management target, and conditions regarding elements constituting the structured document and conditions regarding a parent-child relationship and a grandchild relationship between the elements In a document database management device that specifies a search target by a search formula using
A search expression verification means for checking a given search expression with a syntax rule stored in the document database management device;
The search expression verification means, based on the syntax rules, includes a parent starting element, an adjacent element that can appear as a child of the starting element among elements constituting the structured document, and a descendant of the starting element. A correspondence table generating means for generating a correspondence table including the above sets for all the starting elements, with a set of the reachable elements that can appear, and a starting element and an adjacent element in the given search formula and reachable Based on the element , the correspondence table is scanned to verify whether the adjacent element and the reachable element can appear for the starting element in the given search expression, and whether the search expression is valid A document database management device comprising:

A structured document that includes a search expression verification unit and a correspondence table generation unit, and a structure that can be taken by a syntax rule that defines a connection between elements is defined as a management target, and a condition regarding an element configuring the structured document; In a document database management device that specifies a search target by a search formula using a condition regarding a parent-child relationship and a grandchild relationship between the elements,
The search formula verification means , based on the syntax rules in the document database management device, a parent starting element among elements constituting the structured document, and an adjacent element that can appear as a child of the starting element, the reachable elements that may appear as a descendant of the starting elements as one set, a correspondence table with the set of all starting components generated by the corresponding table generation means, starting elements in a given search expression and Based on the adjacent element and the reachable element , the correspondence table is scanned to verify whether the adjacent element and the reachable element can appear for the starting element in the given search expression, and the search expression is A document database management method characterized by verifying validity.