JP2013246522A

JP2013246522A - Structured document retrieval device and program

Info

Publication number: JP2013246522A
Application number: JP2012117985A
Authority: JP
Inventors: Kaname Kojima; 要小島
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2012-05-23
Filing date: 2012-05-23
Publication date: 2013-12-09
Also published as: CN103425719A

Abstract

PROBLEM TO BE SOLVED: To enable a structure retrieval by combining both of structure information by an XML (Extensible Markup Language) and structure information by annotation tags.SOLUTION: The structured document retrieval device according to the present invention includes: a processor for executing a program; a first storage area for storing the program; a second storage area for storing structured documents satisfying tree structure conditions and annotation data attached to the documents; a document structure list construction unit that allocates a text of a structured document to a structure in which root elements of DOM (Document Object Model) tree are made common, the structure being individually obtained from an inclusion relation of tags of the structured document and an inclusion relation of tags of the annotation data, and that generates a text common DOM tree; and a retrieval processing unit for retrieving an element matching a retrieval query from the text common DOM tree.

Description

本発明は、構造化言語で記述された文書（以下、「構造化文書」という。）および任意の形式でアノテーションデータが付与された構造化文書を、タグの構造及び／又は文字列データに基づいて検索する構造化文書検索装置及びその機能をコンピュータを通じて実現するプログラムに関する。 According to the present invention, a document described in a structured language (hereinafter referred to as “structured document”) and a structured document to which annotation data is assigned in an arbitrary format are based on the tag structure and / or character string data. In particular, the present invention relates to a structured document search apparatus for searching and a program for realizing the function thereof through a computer.

ＸＭＬ（Extensible Markup Language）は、テキストに対して構造情報の記述を可能とするデータフォーマットであり、タグと呼ばれる「＜」、「＞」で囲まれた文字列により、テキスト内への構造情報の記述を可能にする。ＸＭＬは、タグを入れ子状に記述することにより階層状の木構造を表現でき、タグの追加・削除により階層状の木構造を変更することができる。このため、ＸＭＬは、財務情報の記録、特許明細書の記録、電子商取引におけるデータ交換、ソフトウェアのファイル形式のフォーマットとして広く使用されている。以下、ＸＭＬを用いて記述した文書をＸＭＬ文書という。ＸＭＬ文書では、構造とテキストの双方を検索条件に用いる検索が可能である。ＸＭＬ文書の検索クエリ方式には、Ｗ３Ｃ勧告のＸＰａｔｈ等がある。 XML (Extensible Markup Language) is a data format that makes it possible to describe structural information for text. By using character strings enclosed in "<" and ">" called tags, Enable description. XML can express a hierarchical tree structure by nesting tags, and can change the hierarchical tree structure by adding / deleting tags. For this reason, XML is widely used as a format for recording financial information, recording patent specifications, exchanging data in electronic commerce, and a software file format. Hereinafter, a document described using XML is referred to as an XML document. An XML document can be searched using both the structure and text as search conditions. The search query method of the XML document includes the XPath of W3C recommendation.

一方、一般的なテキストデータに対し、アノテーションを付与するための技術の一つに、ＵＩＭＡ（Unstructured Information Management Architecture）がある。ＵＩＭＡは、構造化されていない文書等のデータを管理するために用いられる技術であり、文書へのアノテーションタグの付与を可能とするプラットフォームを提供する。ＵＩＭＡは、ＸＭＬとは異なり、木構造条件を満たす形式でタグを付ける必要がない。このため、ＵＩＭＡは、計算機により得られた文法構造の解析結果、文書中の技術的に重要な部分などへのマーキングのように、必ずしも構造間に木構造関係を満たす必要がない文書構造情報の保存に利用されている。 On the other hand, UIMA (Unstructured Information Management Architecture) is one of techniques for giving annotations to general text data. UIMA is a technique used to manage data such as unstructured documents, and provides a platform that allows annotation tags to be attached to documents. Unlike XML, UIMA does not need to be tagged in a format that satisfies the tree structure condition. For this reason, UIMA does not need to satisfy the tree structure relationship between the structures, such as the analysis of the grammatical structure obtained by the computer and the marking on technically important parts in the document. Used for storage.

米国特許出願公開第２００４／０２４３５６０号明細書US Patent Application Publication No. 2004/0243560

清水敏之、鬼塚真、江田毅晴、吉川正俊、ＸＭＬデータの管理とストリーム処理に関する技術、電子情報通信学会論文誌ＤＪ９０−Ｄ（２）：１５９−１８４、２００７Toshiyuki Shimizu, Makoto Onizuka, Masaharu Eda, Masatoshi Yoshikawa, Technology on XML Data Management and Stream Processing, IEICE Transactions J J90-D (2): 159-184, 2007 G.Navarro and V.Makinen、Compressed full-text indexes、 ACM Computing Surveys 39（1）、2007.G. Navarro and V. Makinen, Compressed full-text indexes, ACM Computing Surveys 39 (1), 2007.

ところで、構造間に木構造関係を満たす保証のない文書構造情報は、今後ますます増加すると予想される。このため、木構造関係に縛られることなく、構造条件とテキスト条件による検索を可能とする技術が求められている。 By the way, it is expected that document structure information that does not guarantee a tree structure relationship between structures will increase in the future. For this reason, there is a need for a technique that enables a search based on a structural condition and a text condition without being bound by a tree structure relationship.

しかし、計算機による自動抽出により得られる結果（例えば文法構造の解析結果、テキストの意味情報（例えば重要技術・効果など）に基づく文書構造の解析結果）や人手によるマーキングの結果は、その構造情報が木構造条件を満たすとは限らない。このため、木構造条件を満たさない構造情報を含む文書に対しては、既存のＸＭＬ検索手法を利用することができなかった。 However, results obtained by automatic extraction by a computer (for example, grammatical structure analysis results, document structure analysis results based on text semantic information (for example, important technologies / effects)) and manual marking results include the structure information. It does not necessarily satisfy the tree structure condition. For this reason, the existing XML search method cannot be used for a document including structural information that does not satisfy the tree structure condition.

以上の理由により、構造上の制約がないタグ情報の検索には、ＵＩＭＡにより用意される検索機能が用いられている（特許文献１）。しかし、この検索方式では、タグの包含関係による階層的構造が考えられていない。従って、ＵＩＭＡの用意する検索機能では、検索クエリとして指定されたテキストが、各タグに含まれるか否かを検証するブーリアン検索しか実行されていない。 For the above reasons, a search function prepared by UIMA is used for searching tag information without any structural restrictions (Patent Document 1). However, this search method does not consider a hierarchical structure based on the inclusion relationship of tags. Therefore, in the search function prepared by UIMA, only a Boolean search for verifying whether or not the text specified as the search query is included in each tag is executed.

結局、ＸＭＬ文書について用意された既存の検索機能やＵＩＭＡにより用意された既存の検索機能では、アノテーション付きのＸＭＬ文書について、ＸＭＬによる構造条件とアノテーションによる構造条件の双方を考慮した検索を実行することができなかった。 Eventually, with the existing search function prepared for XML documents and the existing search function prepared by UIMA, for an annotated XML document, a search is performed in consideration of both the structural condition by XML and the structural condition by annotation. I could not.

本発明は、前述した問題点を考慮し、タグの構造が木構造条件を満たす構造化文書と、当該文書に対応付けられた任意のアノテーションの構造情報とで形成される構造が木構造条件を満たさない場合でも、構造条件とテキストの双方を考慮した検索を可能にする。 In the present invention, in consideration of the above-described problems, a structure formed by a structured document whose tag structure satisfies the tree structure condition and structure information of an arbitrary annotation associated with the document satisfies the tree structure condition. Even if it is not satisfied, it is possible to search in consideration of both structural conditions and text.

本明細書は、上記課題を解決する発明を複数含んでいる。その一例である発明は、プログラムを実行するプロセッサと、プログラムを格納する第１の記憶領域と、木構造条件を満たす構造化文書及び当該文書に付されたアノテーションデータを記憶する第２の記憶領域と、構造化文書のタグの包含関係とアノテーションデータのタグの包含関係から個別に得られるＤＯＭ（Document Object Model）木のルート要素を共通化した構造に対し、構造化文書のテキストを割り当ててテキスト共有ＤＯＭ木を生成する文書構造リスト構築部と、検索クエリとして与えられたロケーションパスに合致する要素を、テキスト共有ＤＯＭ木から検索する検索処理部とを有する。 This specification includes a plurality of inventions that solve the above-described problems. The invention as an example includes a processor that executes a program, a first storage area that stores the program, a structured document that satisfies the tree structure condition, and a second storage area that stores annotation data attached to the document In addition, the text of the structured document is assigned to the structure in which the root element of the DOM (Document Object Model) tree obtained individually from the inclusion relation of the tags of the structured document and the annotation data is included. A document structure list construction unit that generates a shared DOM tree, and a search processing unit that searches the text shared DOM tree for an element that matches a location path given as a search query.

本発明によれば、任意の形式で付与されたアノテーションを含む構造情報とテキストの双方による検索を実現することができる。前述した以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 According to the present invention, it is possible to realize a search based on both structure information including an annotation given in an arbitrary format and text. Problems, configurations, and effects other than those described above will become apparent from the following description of embodiments.

構造化文書検索装置の構成例を示す図（第１の実施例）。1 is a diagram illustrating a configuration example of a structured document search device (first embodiment). FIG. 主記憶装置に記憶するプログラム及びデータの一例を示す図（第１の実施例）。The figure which shows an example of the program and data which are memorize | stored in a main memory (1st Example). アノテーショングループの一例を説明する図（各実施例共通）。The figure explaining an example of an annotation group (common to each Example). テキスト共有ＤＯＭ木の一例を示す図（各実施例共通）。The figure which shows an example of a text sharing DOM tree (common to each Example). 構造化文書検索装置の前処理の流れを説明する図（第１の実施例）。The figure explaining the flow of the pre-processing of a structured document search device (1st Example). ＤＯＭ木構築部の処理例を示すフローチャート（第１の実施例）。The flowchart which shows the process example of a DOM tree construction part (1st Example). 文書構造リスト構築部の処理例を示すフローチャート（第１の実施例）。The flowchart which shows the process example of a document structure list construction part (1st Example). 親子関係解析・登録部の処理例を示すフローチャート（第１の実施例）。The flowchart which shows the process example of a parent-child relationship analysis / registration part (1st Example). 構造化文書検索装置による検索処理の流れを説明する図（第１の実施例）。The figure explaining the flow of the search process by a structured document search device (1st Example). ＤＯＭＤＡＧの一例を示す図（第２の実施例）。The figure which shows an example of DOM DAG (2nd Example). ２つのＤＯＭＤＡＧを統合したパスＤＡＧと転置インデクスの一例を説明する図（第２の実施例）。The figure explaining an example of the path | pass DAG which integrated two DOM DAG, and a transposition index (2nd Example). 転置インデクス構築部の処理例を示すフローチャート（第３の実施例）。The flowchart which shows the process example of the transposition index construction part (3rd Example). 深さ割当部の処理例を示すフローチャート（第３の実施例）。The flowchart which shows the process example of a depth allocation part (3rd Example). パスＤＡＧＩＤ取得部の処理例を示すフローチャート（第３の実施例）。The flowchart which shows the process example of the path | pass DAG ID acquisition part (3rd Example). パスＤＡＧ要素生成・登録部を示すフローチャート（第３の実施例）。12 is a flowchart illustrating a path DAG element generation / registration unit (third embodiment). 転置インデクス登録部の処理例を示すフローチャート（第３の実施例）。The flowchart which shows the process example of a transposition index registration part (3rd Example). ＤＯＭＤＡＧの記録例を説明する図（第４の実施例）。The figure explaining the example of a recording of DOM DAG (4th Example). 検索インデクスの概要を示す図（第４の実施例）。The figure which shows the outline | summary of a search index (4th Example). 検索インデクス構築部の処理例を示すフローチャート（第４の実施例）。The flowchart which shows the process example of a search index construction part (4th Example). 検索インデクス登録部の処理例を示すフローチャート（第４の実施例）。The flowchart which shows the process example of a search index registration part (4th Example). パスＤＡＧＩＤ登録部の処理例を示すフローチャート（第４の実施例）。The flowchart which shows the process example of the path | pass DAG ID registration part (4th Example). ロケーションパス検索部の処理例を示すフローチャート（第４の実施例）。The flowchart which shows the process example of a location path search part (4th Example). ＸＭＬ要素検索部の処理例を示すフローチャート（第４の実施例）。The flowchart which shows the process example of an XML element search part (4th Example). アノテーション要素検索部の処理例を示すフローチャート（第４の実施例）。The flowchart which shows the process example of an annotation element search part (4th Example). 拡張ウェーブレット木の概要を示す図（第６の実施例）。The figure which shows the outline | summary of an extended wavelet tree (6th Example). 拡張ウェーブレット木構築部の処理例を示すフローチャート（第６の実施例）。The flowchart which shows the process example of an extended wavelet tree construction part (6th Example). ｒａｎｋ計算部の処理例を示すフローチャート（第６の実施例）。The flowchart which shows the process example of a rank calculation part (6th Example). ｓｅｌｅｃｔ計算部の処理例を示すフローチャート（第６の実施例）。The flowchart which shows the process example of a select calculation part (6th Example).

以下、添付図面に基づいて、本発明の実施例を説明する。なお、本発明は、後述する実施例に限定されるものではなく、その技術思想の範囲において、種々の変形が可能である。 Embodiments of the present invention will be described below with reference to the accompanying drawings. In addition, this invention is not limited to the Example mentioned later, A various deformation | transformation is possible in the range of the technical thought.

［第１の実施例］
［概要］
本実施例では、ＸＭＬ文書の集合とアノテーションデータの集合を前処理して検索用データを予め作成し、この検索用データと検索クエリとの照合により、検索クエリに合致する要素を探索結果として出力する構造化文書検索装置について説明する。本実施例では、検索用データとして、ＸＭＬタグとアノテーションタグの構造情報を統合したテキスト共有ＤＯＭ木を使用する。 [First embodiment]
[Overview]
In this embodiment, a set of XML documents and a set of annotation data are pre-processed to create search data in advance, and an element that matches the search query is output as a search result by matching the search data with the search query. A structured document search apparatus will be described. In this embodiment, a text sharing DOM tree in which structural information of XML tags and annotation tags is integrated is used as search data.

［装置構成］
図１−１は、構造化文書検索装置４００の構成例を示す。構造化文書検索装置４００は、ＣＰＵ（Central Processing Unit）４０１、主記憶装置（メモリ）４０２、補助記憶装置４０３Ａ、ユーザインタフェース部４０６を有する計算機として構成される。この構造化文書検索装置４００は、ＬＡＮ（Local Area Network）等のネットワーク４０５を介して外部のネットワーク装置に接続されている。 [Device configuration]
FIG. 1-1 shows a configuration example of the structured document search apparatus 400. The structured document search device 400 is configured as a computer having a CPU (Central Processing Unit) 401, a main storage device (memory) 402, an auxiliary storage device 403A, and a user interface unit 406. The structured document search apparatus 400 is connected to an external network apparatus via a network 405 such as a LAN (Local Area Network).

ＣＰＵ４０１は、主記憶装置４０２に格納されたプログラムを実行する中央演算装置である。図１−２に、主記憶装置４０２に格納されるプログラムを通じて実現される機能部とデータの一例を示す。なお、図１−２には、実施例１で使用するプログラムを通じて実現される機能部だけでなく、他の実施例で使用するプログラムにより実現される機能部や、各実施例で使用するデータについても表している。例えば、ＣＰＵ４０１は、プログラムの実行により、ＸＭＬ文書からＤＯＭ木を構築するＤＯＭ木構築部４０９、アノテーションデータからＤＯＭ木を構築するＤＯＭ木構築部４１０、文書構造リスト構築部４１１、テキスト共有ＤＯＭ木構築部４１５、テキストデータ・テキスト要素リスト構築部４１７、テキスト割当部４１８、親子関係解析・登録部４１９、ロケーションパス検索部４２０、ＤＯＭＤＡＧ構築部４２２、転置インデクス構築部４２４、深さ割当部４２７、パスＤＡＧＩＤ取得部４２８、パスＤＡＧ要素生成・登録部４２９、パスＤＡＧＩＤ登録部４３９、検索インデクス登録部４４０、ＸＭＬ要素検索部４４３、アノテーション要素検索部４４４、拡張ウェーブレット木構築部４５１、検索インデクス構築部４５４、簡易ビットベクトル・ウェーブレット木構築部４５５として機能する。 The CPU 401 is a central processing unit that executes a program stored in the main storage device 402. FIG. 1-2 shows an example of functional units and data realized through a program stored in the main storage device 402. In FIG. 1-2, not only the functional unit realized through the program used in the first embodiment, but also the functional unit realized by the program used in the other examples and the data used in each example. Also represents. For example, the CPU 401 executes a program to construct a DOM tree construction unit 409 that constructs a DOM tree from an XML document, a DOM tree construction unit 410 that constructs a DOM tree from annotation data, a document structure list construction unit 411, and a text sharing DOM tree construction. Unit 415, text data / text element list construction unit 417, text assignment unit 418, parent-child relationship analysis / registration unit 419, location path search unit 420, DOM DAG construction unit 422, transposed index construction unit 424, depth assignment unit 427, Path DAG ID acquisition unit 428, path DAG element generation / registration unit 429, path DAG ID registration unit 439, search index registration unit 440, XML element search unit 443, annotation element search unit 444, extended wavelet tree construction unit 451, search index Construction unit 454 It acts as a simple bit vector wavelet tree construction unit 455.

主記憶装置４０２は、ＲＡＭ（Random Access Memory）等の記憶装置である。主記憶装置４０２は、前述したプログラム及びプログラムの実行に用いるパスＤＡＧ４２３等を記憶する。また、主記憶装置４０２は、必要があれば、ＸＭＬ文書集合４０７、アノテーションデータ集合４０８、文書構造リスト４１２も一時的に記憶する。 The main storage device 402 is a storage device such as a RAM (Random Access Memory). The main storage device 402 stores the above-described program and the path DAG 423 used for executing the program. Further, if necessary, the main storage device 402 also temporarily stores an XML document set 407, an annotation data set 408, and a document structure list 412.

補助記憶装置４０３Ａは、ＸＭＬ文書、アノテーションデータ、前述したプログラム等を格納するＨＤＤ等の記憶装置又は記憶媒体である。 The auxiliary storage device 403A is a storage device or storage medium such as an HDD that stores an XML document, annotation data, the above-described program, and the like.

リムーバブルメディア４０４は、ＸＭＬ文書やアノテーションデータ等を記録したＣＤ−ＲＯＭ、ＤＶＤ等の記録媒体である。補助記憶装置４０３Ａ及びリムーバブルメディア４０４に記録された各データは、構造化文書検索装置４００の起動時に、必要に応じ、主記憶装置４０２に読み出される。 The removable medium 404 is a recording medium such as a CD-ROM or DVD that records an XML document, annotation data, or the like. Each data recorded in the auxiliary storage device 403A and the removable medium 404 is read to the main storage device 402 as necessary when the structured document search device 400 is activated.

ユーザインタフェース部４０６は、ユーザインタフェースを提供する入出力装置（例えば、キーボード、マウス、ディスプレイ）である。 The user interface unit 406 is an input / output device (for example, a keyboard, a mouse, a display) that provides a user interface.

ＣＰＵ４０１は、主記憶装置４０２、補助記憶装置４０３Ａ、リムーバブルメディア４０４、又は、ネットワーク４０５を介して接続された外部記憶装置４０３Ｂから、必要に応じ、ＸＭＬ文書と当該文書に付されたアノテーションデータを取得する。ここで、外部記憶装置４０３Ｂは、ＨＤＤ等の記憶装置又は記憶媒体である。また、ネットワーク４０５は、ローカルエリアネットワークでも、インターネットでもよい。また、ネットワーク４０５は有線ネットワークでも無線ネットワークでもよい。ＣＰＵ４０１は、これらの記憶装置から取得したＸＭＬ文書とアノテーションデータに基づいて、検索インデクス４３０を作成する。作成された検索インデクス４３０は、主記憶装置４０２に記憶される。 The CPU 401 acquires an XML document and annotation data attached to the document as necessary from the main storage device 402, the auxiliary storage device 403A, the removable medium 404, or the external storage device 403B connected via the network 405. To do. Here, the external storage device 403B is a storage device such as an HDD or a storage medium. The network 405 may be a local area network or the Internet. The network 405 may be a wired network or a wireless network. The CPU 401 creates a search index 430 based on the XML document and annotation data acquired from these storage devices. The created search index 430 is stored in the main storage device 402.

前述の説明においては、ＸＭＬ文書とアノテーションデータは、主記憶装置４０２、補助記憶装置４０３Ａ、リムーバブルメディア４０４、ネットワーク４０５上の外部記憶装置４０３Ｂにいずれかに格納される例を示したが、ＣＰＵ４０１が読み書き可能な記憶装置上に格納されていればよい。例えばＸＭＬ文書は補助記憶装置４０３Ａに格納し、アノテーションデータは主記憶装置４０２に格納しても良い。 In the above description, an example in which the XML document and the annotation data are stored in any of the main storage device 402, the auxiliary storage device 403A, the removable medium 404, and the external storage device 403B on the network 405 has been described. It only needs to be stored on a readable / writable storage device. For example, the XML document may be stored in the auxiliary storage device 403A, and the annotation data may be stored in the main storage device 402.

ＣＰＵ４０１は、前述した各機能部に対応するプログラムを実行し、所定の機能を実現する機能部として動作する。例えばＣＰＵ４０１は、文書構造構築プログラムに従って動作することにより、テキスト共有ＤＯＭ木構築部４１５として機能する。他のプログラムについても同様である。例えばＣＰＵ４０１は、ロケーションパス検索プログラムに従って動作することにより、ロケーションパス検索部４２０として機能する。また、ＣＰＵ４０１は、検索インデクス構築プログラムに従って動作することにより、検索インデクス構築部４５４として機能する。 The CPU 401 executes a program corresponding to each functional unit described above, and operates as a functional unit that realizes a predetermined function. For example, the CPU 401 functions as the text sharing DOM tree construction unit 415 by operating according to the document structure construction program. The same applies to other programs. For example, the CPU 401 functions as the location path search unit 420 by operating according to the location path search program. The CPU 401 functions as the search index construction unit 454 by operating according to the search index construction program.

ロケーションパス検索部４２０、ＤＯＭＤＡＧ構築部４２２及び検索インデクス構築部４５４の各機能を実現するプログラム、テーブル等の情報は、補助記憶装置４０３Ａ、リムーバブルメディア４０４、不揮発性半導体メモリ、ハードディスクドライブ、ＳＳＤ（Solid State Drive）等の記憶デバイス、又は、ＩＣカード、ＳＤカード、ＤＶＤ等の計算機読み取り可能な非一時的データ記憶媒体に格納することができる。 Information such as programs and tables for realizing the functions of the location path search unit 420, the DOM DAG construction unit 422, and the search index construction unit 454 includes an auxiliary storage device 403A, a removable medium 404, a nonvolatile semiconductor memory, a hard disk drive, an SSD ( Solid State Drive) or a computer-readable non-transitory data storage medium such as an IC card, SD card, or DVD.

［アノテーションデータ］
続いて、アノテーションデータについて説明する。アノテーションは、ＸＭＬ文書内のテキスト領域に対して、タグを付与することで与えられる。アノテーションにより付与されるタグをアノテーションタグと呼ぶ。アノテーションタグは、互いに包含関係による構造が、木構造となるグループに事前に分けられていることを想定する。アノテーションタグのグループをアノテーショングループと呼び、各アノテーショングループには整数によるＩＤが割り振られる。異なるアノテーショングループに属するアノテーションタグ同士は、木構造条件を満たす必要はない。 [Annotation data]
Next, annotation data will be described. An annotation is given by adding a tag to a text area in an XML document. A tag given by annotation is called an annotation tag. It is assumed that the annotation tag has a structure based on an inclusive relationship divided in advance into a group having a tree structure. An annotation tag group is called an annotation group, and an integer ID is assigned to each annotation group. Annotation tags belonging to different annotation groups do not need to satisfy the tree structure condition.

図２は、ＸＭＬ文書に付与されたアノテーションとアノテーショングループの例を示す。「アノテーショングループ１」においては、文書中のテキスト「耐久性を向上」に「効果」というアノテーションタグが付与されており、テキスト「ナイフに関する」に「内容」というアノテーションタグが付与されている。これらのタグは、お互い重なることなく木構造条件を満たしている。また、「アノテーショングループ２」においては、文書中のテキスト「医療用」に「対象」というアノテーションタグが付与されており、テキスト「医療用ナイフ」に「道具」というアノテーションタグが付与されている。これらのアノテーションタグ同士は、入れ子の関係になっているので木構造条件を満たす。 FIG. 2 shows an example of annotations and annotation groups given to the XML document. In “annotation group 1”, an annotation tag “effect” is assigned to the text “improve durability” in the document, and an annotation tag “content” is assigned to the text “knife”. These tags satisfy the tree structure condition without overlapping each other. In “Annotation Group 2”, an annotation tag “target” is assigned to the text “medical” in the document, and an annotation tag “tool” is assigned to the text “medical knife”. Since these annotation tags are nested, they satisfy the tree structure condition.

一方、「アノテーショングループ１」に属するアノテーションタグである「内容」と、「アノテーショングループ２」に属するアノテーションタグである「道具」とは、タグ同士が入れ子の関係になっておらず、木構造条件を満たしていない。しかし、「内容」と「道具」は異なるアノテーショングループに属するタグ同士であるため、アノテーションデータとして問題ないものである。 On the other hand, “content” that is an annotation tag belonging to “annotation group 1” and “tool” that is an annotation tag belonging to “annotation group 2” are not nested in each other, and the tree structure condition Does not meet. However, since “content” and “tool” are tags belonging to different annotation groups, there is no problem as annotation data.

アノテーションタグは、「タグ名、アノテーショングループＩＤ、テキスト領域の開始位置、テキスト領域の終了位置」で与えられる４つの情報の組により構成される。アノテーションデータは、各アノテーションタグについて４つの情報の組が記述されたデータの集合である。アノテーションデータの一例には、各行に、各アノテーションタグのデータであるタグ名、アノテーショングループＩＤ、テキスト領域の開始位置、テキスト領域の終了位置をタブ区切りで記述したテキストデータを考えることができる。 An annotation tag is composed of a set of four pieces of information given by “tag name, annotation group ID, start position of text area, end position of text area”. The annotation data is a set of data in which a set of four information is described for each annotation tag. As an example of annotation data, text data in which a tag name, an annotation group ID, a text region start position, and a text region end position, which are data of each annotation tag, are tab-separated can be considered in each line.

［ＸＭＬ要素、アノテーション要素、テキスト要素］
ＸＭＬ要素は、各ＸＭＬタグの開始タグ、終了タグにより表された要素であり、タグ名、テキスト領域における開始タグ、終了タグの位置を持つ。アノテーション要素は、タグ名、アノテーショングループＩＤ、各アノテーションタグの開始タグ、終了タグの位置を持つ要素である。テキスト要素は、テキスト領域を表す要素であり、テキスト領域の開始位置、終了位置、テキスト領域に含まれるテキストから構成される。 [XML elements, annotation elements, text elements]
The XML element is an element represented by a start tag and an end tag of each XML tag, and has a tag name and a position of the start tag and end tag in the text area. An annotation element is an element having a tag name, an annotation group ID, a start tag of each annotation tag, and a position of an end tag. The text element is an element that represents a text area, and includes a start position and an end position of the text area, and text included in the text area.

［ＤＯＭ木の要素］
ＤＯＭ木を構成する要素は、タグ名、アノテーショングループＩＤ、テキスト領域開始位置、テキスト領域終了位置、深さを持つ。通常のＸＭＬタグによる要素の場合、アノテーショングループＩＤを“０”とする。テキスト要素の場合、アノテーショングループＩＤを“−１”とし、タグ名の部分にテキストを保持する。 [DOM tree elements]
Elements constituting the DOM tree have a tag name, an annotation group ID, a text area start position, a text area end position, and a depth. In the case of an element using a normal XML tag, the annotation group ID is set to “0”. In the case of a text element, the annotation group ID is set to “−1”, and the text is held in the tag name portion.

[前処理の概要］
ここでは、構造化文書検索装置４００が検索前に実行する前処理の概要について説明する。前処理は、文書構造リスト構築部４１１として機能するＣＰＵ４０１が実行する。ＣＰＵ４０１は、前処理として、テキスト共有ＤＯＭ木４１６を作成する。 [Overview of preprocessing]
Here, an overview of preprocessing executed by the structured document search apparatus 400 before search will be described. The pre-processing is executed by the CPU 401 functioning as the document structure list construction unit 411. The CPU 401 creates a text sharing DOM tree 416 as preprocessing.

まず、ＣＰＵ４０１は、ＸＭＬ文書集合４０７とアノテーションデータ集合４０８の中から、未処理のＸＭＬ文書と対応するアノテーションデータを取得する。次に、ＣＰＵ４０１は、ＤＯＭ木構築プログラムを実行し、ＸＭＬ文書からＤＯＭ木を構築するＤＯＭ木構築部４０９と、アノテーションデータからＤＯＭ木を構築するＤＯＭ木構築部４１０として動作する。ＤＯＭ木構築部４０９及び４１０として機能するＣＰＵ４０１は、ＸＭＬ文書のＤＯＭ木とアノテーションデータの各アノテーショングループに属する要素のＤＯＭ木を作成する。 First, the CPU 401 acquires annotation data corresponding to an unprocessed XML document from the XML document set 407 and the annotation data set 408. Next, the CPU 401 executes a DOM tree construction program, and operates as a DOM tree construction unit 409 that constructs a DOM tree from an XML document, and a DOM tree construction unit 410 that constructs a DOM tree from annotation data. The CPU 401 functioning as the DOM tree construction units 409 and 410 creates a DOM tree of elements belonging to each annotation group of the DOM tree of the XML document and annotation data.

次に、ＣＰＵ４０１は、テキスト共有ＤＯＭ木構築プログラムを実行し、テキスト共有ＤＯＭ木構築部４１５として機能する。ここで、ＣＰＵ４０１は、同じテキストに対して形成された２つのＤＯＭ木のルート要素を共通化した構造にテキストを割り当て、テキスト共有ＤＯＭ木４１６を作成する。各ＸＭＬ文書とアノテーションデータに対して構築されたテキスト共有ＤＯＭ木４１６は、検索に用いるデータ構造として、文書構造リスト構築部４１１に保持される。 Next, the CPU 401 executes a text sharing DOM tree construction program and functions as the text sharing DOM tree construction unit 415. Here, the CPU 401 assigns text to a structure in which the root elements of two DOM trees formed for the same text are shared, and creates a text sharing DOM tree 416. The text sharing DOM tree 416 constructed for each XML document and annotation data is held in the document structure list construction unit 411 as a data structure used for search.

［テキスト共有ＤＯＭ木］
続いて、テキスト共有ＤＯＭ木４１６を説明する。テキスト共有ＤＯＭ木４１６は、ＸＭＬタグに基づいて構築されたＤＯＭ木と、各アノテーショングループに属するタグに基づいて構成された１つ又は複数のＤＯＭ木から構築されるデータ構造であり、各ＤＯＭ木のルート要素を共通化したデータ構造を有している。ここで、ＸＭＬタグに基づくＤＯＭ木は、ＤＯＭ木構築部４０９として機能するＣＰＵ４０１が構築する。また、アノテーションタグに基づくＤＯＭ木は、ＤＯＭ木構築部４１０として機能するＣＰＵ４０１が構築する。以下の説明において、単にＤＯＭ木というときは、ＸＭＬタグに基づいて構築されたＤＯＭ木又はアノテーションタグに基づいて構築されたＤＯＭ木をいう。 [Text sharing DOM tree]
Next, the text sharing DOM tree 416 will be described. The text sharing DOM tree 416 is a data structure constructed from a DOM tree constructed based on XML tags and one or a plurality of DOM trees constructed based on tags belonging to each annotation group. It has a data structure that shares the root elements of. Here, the DOM tree based on the XML tag is constructed by the CPU 401 functioning as the DOM tree construction unit 409. The DOM tree based on the annotation tag is constructed by the CPU 401 functioning as the DOM tree construction unit 410. In the following description, the term “DOM tree” means a DOM tree constructed based on an XML tag or a DOM tree constructed based on an annotation tag.

ＸＭＬ文書を構成するテキストは、ＸＭＬタグ及びアノテーションタグで区切られており、区切られたテキスト領域ごとにテキスト要素が構成される。各ＤＯＭ木において、テキスト要素は、包含される要素のうち、ルート要素から最も深い位置の要素に割り当てられる。各テキスト領域は、単一のＤＯＭ木でなく、複数のＤＯＭ木に割り当てられる可能性がある。 The text composing the XML document is delimited by an XML tag and an annotation tag, and a text element is formed for each delimited text area. In each DOM tree, the text element is assigned to the element deepest from the root element among the included elements. Each text region may be assigned to multiple DOM trees instead of a single DOM tree.

図３は、ＸＭＬタグ＜ｐ＞、＜ｍｔｄ＞、＜ｆｘ＞、＜ｖａｌ＞、＜ａｔｔｒ＞と、同一のアノテーションＩＤに属するタグ［Ａ］と［Ｔ］でタグ付けした下記の文章から作成したテキスト共有ＤＯＭ木の例を示している。
（タグ付け前の文章）
「本発明では、粉末コークスを用いることで、粗鋼を生産し、２倍の効率化を達成した。」
（タグ付け後の文章）
「＜ｐ＞本発明では、［Ａ］＜ｍｔｄ＞粉末コークスを用いる＜／ｍｔｄ＞ことで、＜ｆｘ＞［Ｔ］＜ｏｂｊ＞粗鋼＜／ｏｂｊ＞を生産［／Ｔ］［／Ａ］し、＜ｖａｌ＞２倍＜／ｖａｌ＞の＜ａｔｒ＞効率化＜／ａｔｔｒ＞を達成＜／ｆｘ＞した。＜／ｐ＞」
図３に示すように、テキスト共有ＤＯＭ木のルート要素には、ＸＭＬタグの＜ｐ＞とアノテーションタグの［Ａ］にエッジが接続されている。すなわち、テキスト共有ＤＯＭ木は、ＸＭＬタグによるＤＯＭ木と、同一のアノテーショングループに属するアノテーションタグによるＤＯＭ木のルート要素を共通化した構造を有している。また、テキストは、ＸＭＬタグ、アノテーションタグに関わらず分割され、ＸＭＬタグ、アノテーションタグの場合に分け、テキスト領域の包含関係に基づいて、包含関係がある最も深い位置の要素に割り当てられる。例えば図３の例の場合、テキスト「粗鋼」は、ＸＭＬタグ＜ｏｂｊ＞とアノテーションタグ［Ｔ］の両方に割り当てられている。 Figure 3 is created from the following sentences tagged with XML tags , <mtd>, <fx>, <val>, <attr> and tags [A] and [T] belonging to the same annotation ID An example of a text sharing DOM tree is shown.
(Sentence before tagging)
“In the present invention, by using powdered coke, crude steel was produced and the efficiency was doubled.”
(Text after tagging)
“ In the present invention, [A] <mtd></mtd> using powder coke is used to produce <fx> [T] <obj> crude steel </ obj> [/ T] [/ A]. <Val> 2 times </ val><atr> Efficient </ attr></fx> was achieved. 
As shown in FIG. 3, an edge is connected to of the XML tag and [A] of the annotation tag in the root element of the text sharing DOM tree. In other words, the text sharing DOM tree has a structure in which the root element of the DOM tree by the XML tag and the DOM tree by the annotation tag belonging to the same annotation group are shared. The text is divided regardless of the XML tag and the annotation tag, and is divided into the case of the XML tag and the annotation tag, and is assigned to the element at the deepest position having the inclusion relation based on the inclusion relation of the text region. For example, in the example of FIG. 3, the text “crude steel” is assigned to both the XML tag <obj> and the annotation tag [T].

［前処理の詳細処理］
図４に、前処理に関連して実行されるデータの流れを示す。前処理の開始は、ユーザインタフェース４０６を通じ、ＣＰＵ４０１に与えられる（ステップＳ３０１）。また、ＣＰＵ４０１は、ＸＭＬ文書集合４０７とアノテーションデータ集合４０８を補助記憶装置４０３Ａから入力する（ステップＳ３０２）。なお、入力されたＸＭＬ文書集合４０７とアノテーションデータ集合４０８は、作業領域としての主記憶装置４０２に格納される。 [Detailed preprocessing]
FIG. 4 shows the flow of data executed in connection with the preprocessing. The start of preprocessing is given to the CPU 401 through the user interface 406 (step S301). Further, the CPU 401 inputs the XML document set 407 and the annotation data set 408 from the auxiliary storage device 403A (step S302). The input XML document set 407 and annotation data set 408 are stored in the main storage device 402 as a work area.

この段階で、ＣＰＵ４０１は、テキスト共有ＤＯＭ木構築部４１５としての分析処理を実行し、文書構造リスト４１２を構築する（ステップＳ３０３）。 At this stage, the CPU 401 executes analysis processing as the text sharing DOM tree construction unit 415 and constructs the document structure list 412 (step S303).

ＣＰＵ４０１（テキスト共有ＤＯＭ木構築部４１５）は、ＸＭＬ文書集合４０７の要素としてのＸＭＬ文書と、ＸＭＬ文書に対応したアノテーションデータ集合４０８の要素としてのアノテーションデータ４０８とに基づいてテキスト共有ＤＯＭ木４１６を生成する。ＣＰＵ４０１（テキスト共有ＤＯＭ木構築部４１５）が実行する前処理の詳細については後述する。前処理が終了すると、ＣＰＵ４０１は、生成されたテキスト共有ＤＯＭ木４１６を文書構造リスト４１２として、補助記憶装置４０３Ａに出力する（ステップＳ３０４）。 The CPU 401 (text sharing DOM tree construction unit 415) generates the text sharing DOM tree 416 based on the XML document as an element of the XML document set 407 and the annotation data 408 as an element of the annotation data set 408 corresponding to the XML document. Generate. Details of the preprocessing executed by the CPU 401 (text sharing DOM tree construction unit 415) will be described later. When the preprocessing ends, the CPU 401 outputs the generated text sharing DOM tree 416 as the document structure list 412 to the auxiliary storage device 403A (step S304).

［アノテーションデータ用のＤＯＭ木構築部４１０］
図５は、アノテーションデータからＤＯＭ木を構築するＤＯＭ木構築部４１０として機能するＣＰＵ４０１の処理動作例を示すフローチャートである。 [DOM tree construction unit 410 for annotation data]
FIG. 5 is a flowchart illustrating a processing operation example of the CPU 401 functioning as the DOM tree construction unit 410 that constructs the DOM tree from the annotation data.

ＣＰＵ４０１（ＤＯＭ木構築部４１０）は、入力されたアノテーションデータのファイルを読み込み、「タグ名、アノテーショングループＩＤ、テキスト領域開始位置、テキスト領域終了位置」の４組データの集合ＡＳを用意する（ステップＳ４０１）。ＣＰＵ４０１は、集合ＡＳのうち最大のアノテーショングループＩＤを“ｇ”にセットする（ステップＳ４０２）。次に、ＣＰＵ４０１は、変数ｉに“１”をセットする（ステップＳ４０３）。次に、ＣＰＵ４０１は、変数ｉが“ｇ”以下か判定し、変数ｉが“ｇ”より大きくなるまで、後述するステップＳ４０５〜Ｓ４１３を繰り返す（ステップＳ４０４）。 The CPU 401 (DOM tree construction unit 410) reads the input annotation data file, and prepares a set AS of four sets of data “tag name, annotation group ID, text region start position, text region end position” (step). S401). The CPU 401 sets the largest annotation group ID in the set AS to “g” (step S402). Next, the CPU 401 sets “1” to the variable i (step S403). Next, the CPU 401 determines whether or not the variable i is “g” or less, and repeats steps S405 to S413 described later until the variable i becomes larger than “g” (step S404).

ステップＳ４０４で肯定結果が得られた場合、ＣＰＵ４０１は、集合ＡＳよりアノテーショングループＩＤが変数ｉとなる要素のリストをＡＬに設定する（ステップＳ４０５）。次に、ＣＰＵ４０１は、リストＡＬの要素をテキスト領域開始位置の昇順にソートし、テキスト領域開始位置が同じものについては、テキスト領域の終了位置の降順にソートする（ステップＳ４０６）。 If an affirmative result is obtained in step S404, the CPU 401 sets the list of elements whose annotation group ID is the variable i from the set AS to AL (step S405). Next, the CPU 401 sorts the elements of the list AL in ascending order of the text area start position, and sorts elements having the same text area start position in descending order of the end position of the text area (step S406).

続いて、ＣＰＵ４０１は、ルート要素のみからなるＤＯＭ木ＡＴを用意し、ルート要素をｖとする（ステップＳ４０７）。ここで、ＣＰＵ４０１は、変数ｊに１をセットする（ステップＳ４０８）。次に、ＣＰＵ４０１は、変数ｊがリストＡＬの長さ（要素数）を越えるまで、後述するステップＳ４１０〜Ｓ４１２を繰り返し（ステップＳ４０９）、繰り返し終了後、変数ｉに“１”を加算してステップＳ４０４に戻る（ステップＳ４１３）。 Subsequently, the CPU 401 prepares a DOM tree AT including only root elements, and sets the root element to v (step S407). Here, the CPU 401 sets 1 to the variable j (step S408). Next, the CPU 401 repeats steps S410 to S412 described later until the variable j exceeds the length (number of elements) of the list AL (step S409). After the repetition is completed, “1” is added to the variable i and the step is repeated. The process returns to S404 (step S413).

先のステップＳ４０９で肯定結果が得られた場合、ＣＰＵ４０１は、リストＡＬのｊ番目の要素のテキスト領域（テキスト領域開始位置とテキスト領域終了位置で与えられる領域）が、ルート要素ｖにあるテキスト領域（テキスト領域開始位置とテキスト領域終了位置で与えられる領域）に含まれるまで、ルート要素ｖの親要素をｖにセットする処理を繰り返す（ステップＳ４１０）。 When a positive result is obtained in the previous step S409, the CPU 401 determines that the text area of the jth element of the list AL (the area given by the text area start position and the text area end position) is in the root element v. The process of setting the parent element of the root element v to v is repeated until it is included in (text area start position and text area end position) (step S410).

この後、ＣＰＵ４０１は、リストＡＬのｊ番目の要素のタグ名、アノテーショングループＩＤ、テキスト領域開始位置、テキスト領域終了位置を有するＤＯＭ木の要素を作成し、ルート要素ｖの子要素としてＤＯＭ木ＡＴに加える（ステップＳ４１１）。この後、ＣＰＵ４０１は、変数ｊに“１”を加算し、ステップＳ４０９に戻る（ステップＳ４１２）。 Thereafter, the CPU 401 creates a DOM tree element having a tag name, an annotation group ID, a text area start position, and a text area end position of the jth element of the list AL, and uses the DOM tree AT as a child element of the root element v. (Step S411). Thereafter, the CPU 401 adds “1” to the variable j and returns to step S409 (step S412).

ＣＰＵ４０１は、各アノテーショングループについて、テキスト領域の開始位置と終了位置に基づいて要素を事前にソートし、その後、前述した方法を適用することにより、アノテーションデータのＤＯＭ木を構築する。また、ＤＯＭ木は順序木となっているが、前述した方法で要素を加えることにより、兄弟要素同士は、テキスト領域の開始位置で昇順にソートされた状態となる。 For each annotation group, the CPU 401 sorts the elements in advance based on the start position and end position of the text area, and then constructs a DOM tree of annotation data by applying the method described above. Although the DOM tree is an ordered tree, sibling elements are sorted in ascending order at the start position of the text area by adding elements by the method described above.

［ＸＭＬ文書用のＤＯＭ木構築部４０９］
前述したように、ＣＰＵ４０１は、ＸＭＬ文書からＤＯＭ木を構築するＤＯＭ木構築部４０９としても機能する。ＤＯＭ木構築部４０９としてのＣＰＵ４０１は、入力されたＸＭＬ文書に付与されたタグの包含関係を分析し、ＤＯＭ木を構築する。ＸＭＬ文書のＤＯＭ木は、非特許文献１に記載の方法により構築することができる。構築されたＤＯＭ木は、各タグに対応するテキスト領域の開始位置、終了位置の情報は含まない。 [DOM tree construction unit 409 for XML document]
As described above, the CPU 401 also functions as the DOM tree construction unit 409 that constructs a DOM tree from an XML document. The CPU 401 serving as the DOM tree construction unit 409 analyzes the inclusion relationship of the tags attached to the input XML document and constructs a DOM tree. The DOM tree of the XML document can be constructed by the method described in Non-Patent Document 1. The constructed DOM tree does not include information on the start position and end position of the text area corresponding to each tag.

本実施例では、ＣＰＵ４０１が、ＸＭＬ文書について構築されたＤＯＭ木に割り当てられたテキスト要素について、テキスト領域における開始位置及び終了位置を計算する。次に、ＣＰＵ４０１は、計算された位置情報を各要素に付与しながらテキスト要素を削除し、テキスト要素が保持する文字列をつなげたテキストを作成する。 In the present embodiment, the CPU 401 calculates the start position and the end position in the text area for the text element assigned to the DOM tree constructed for the XML document. Next, the CPU 401 deletes the text element while giving the calculated position information to each element, and creates a text in which character strings held by the text element are connected.

ここでは、“ｔ”を空のテキストとする。ＣＰＵ４０１は、構築されたＤＯＭ木をルート要素から走査しながら、テキスト要素に対応するテキストの長さを記録し、前順に開始位置、後順に終了位置を記録する。各テキスト要素を走査した後、ＣＰＵ４０１は、テキスト要素を削除し、テキスト“ｔ”に対し、削除されたテキスト要素が保持する文字列を加える。最後に、ＣＰＵ４０１は、構築されたＤＯＭ木とテキストｔを出力し、一連の処理を終了する。 Here, “t” is an empty text. The CPU 401 scans the constructed DOM tree from the root element, records the text length corresponding to the text element, and records the start position in the front order and the end position in the back order. After scanning each text element, the CPU 401 deletes the text element and adds a character string held by the deleted text element to the text “t”. Finally, the CPU 401 outputs the constructed DOM tree and text t, and ends a series of processing.

［文書構造リスト構築部４１１］
図６は、文書構造リスト構築部４１１として機能するＣＰＵ４０１により実行される処理の一例を示すフローチャートである。 [Document Structure List Building Unit 411]
FIG. 6 is a flowchart illustrating an example of processing executed by the CPU 401 functioning as the document structure list construction unit 411.

まず、ＣＰＵ４０１は、ＸＭＬ文書集合４０７とアノテーションデータ集合４０８を入力とする（ステップＳ５０１）。次に、ＣＰＵ４０１は、文書構造リスト４１２を空のリストに初期化する（ステップＳ５０２）。その後、ＣＰＵ４０１は、変数ｉに“１”をセットする（ステップＳ５０３）。次に、ＣＰＵ４０１は、テキストデータリストを空に初期化する（ステップＳ５０４）。 First, the CPU 401 inputs an XML document set 407 and an annotation data set 408 (step S501). Next, the CPU 401 initializes the document structure list 412 to an empty list (step S502). Thereafter, the CPU 401 sets “1” in the variable i (step S503). Next, the CPU 401 initializes the text data list to be empty (step S504).

この後、ＣＰＵ４０１は、ＸＭＬ文書集合４０７とアノテーションデータ集合４０８に含まれる、ＸＭＬ文書とアノテーションデータの全てのペアを処理したか否かを判定する(ステップＳ５０５)。この判定処理で肯定結果が得られるまで、ＣＰＵ４０１は、後述するステップＳ５０６〜Ｓ５１０の処理を繰り返し実行する。 Thereafter, the CPU 401 determines whether or not all the pairs of XML documents and annotation data included in the XML document set 407 and the annotation data set 408 have been processed (step S505). Until a positive result is obtained in this determination process, the CPU 401 repeatedly executes processes in steps S506 to S510 described later.

ステップＳ５０５で否定結果が得られた場合、ＣＰＵ４０１は、未処理のＸＭＬ文書とアノテーションデータのペアを読み込む（ステップＳ５０６）。次に、ＣＰＵ４０１は、ＸＭＬ文書用のＤＯＭ木構築部４０９とアノテーションデータ用のＤＯＭ木構築部４１０として機能し、それぞれ対応するＤＯＭ木を作成する（ステップＳ５０７）。 If a negative result is obtained in step S505, the CPU 401 reads an unprocessed XML document and annotation data pair (step S506). Next, the CPU 401 functions as a DOM tree construction unit 409 for XML documents and a DOM tree construction unit 410 for annotation data, and creates corresponding DOM trees (step S507).

この後、ＣＰＵ４０１は、ＸＭＬ文書から得られたＤＯＭ木と各アノテーショングループに属するタグから得られたＤＯＭ木を入力すると、後述するテキスト共有ＤＯＭ木構築部４１５として機能してテキスト共有ＤＯＭ木４１６を作成し、作成されたテキスト共有ＤＯＭ木４１６を構成する要素のリストＮを得る（ステップＳ５０８）。次に、ＣＰＵ４０１は、リストＮを、文書構造リスト（テキスト共有ＤＯＭ木リスト）４１２に追加する（ステップＳ５０９）。 Thereafter, when the CPU 401 inputs the DOM tree obtained from the XML document and the DOM tree obtained from the tag belonging to each annotation group, the CPU 401 functions as a text sharing DOM tree construction unit 415 to be described later, and the text sharing DOM tree 416 is displayed. A list N of elements constituting the created text sharing DOM tree 416 is obtained (step S508). Next, the CPU 401 adds the list N to the document structure list (text sharing DOM tree list) 412 (step S509).

次に、ＣＰＵ４０１は、読み込んだＸＭＬ文書及びアノテーションデータを処理済としてステップＳ５０５に戻る（ステップＳ５１０）。 Next, the CPU 401 determines that the read XML document and annotation data have been processed, and returns to step S505 (step S510).

［テキスト共有ＤＯＭ木構築部４１５］
テキスト共有ＤＯＭ木構築部４１５として機能するＣＰＵ４０１は、ＸＭＬ要素によるＤＯＭ木と、ＸＭＬ文書内のテキスト及び各アノテーショングループの要素からなるＤＯＭ木を入力する。ここで、ＣＰＵ４０１は、各ＤＯＭ木のルート要素以外の要素のリストＮを用意する。 [Text sharing DOM tree construction unit 415]
The CPU 401 functioning as the text sharing DOM tree construction unit 415 inputs a DOM tree made up of XML elements, and a DOM tree made up of text in the XML document and elements of each annotation group. Here, the CPU 401 prepares a list N of elements other than the root element of each DOM tree.

ＣＰＵ４０１は、リストＮとＸＭＬ文書内のテキストを、テキストデータ・テキスト要素リスト構築部４１７に入力し、テキストデータリスト４１４とテキスト要素リスト４１３を更新する。 The CPU 401 inputs the text in the list N and the XML document to the text data / text element list construction unit 417 and updates the text data list 414 and the text element list 413.

この後、ＣＰＵ４０１は、テキスト要素リスト４１３と、ＸＭＬ要素と、各アノテーショングループの要素からなるＤＯＭ木を入力する。テキスト割当部４１８として機能するＣＰＵ４０１は、テキスト要素を子要素として各要素に割り当てる。ＣＰＵ４０１は、テキスト要素リスト４１３の各要素をリストＮに加える。また、ＣＰＵ４０１は、各ＤＯＭ木のルート要素を１つに共通化し、共通化したルート要素をリストＮの先頭に加える。最後に、ＣＰＵ４０１は、リストＮを出力し、テキスト共有ＤＯＭ木の作成処理を終了する。 Thereafter, the CPU 401 inputs a DOM tree including a text element list 413, XML elements, and elements of each annotation group. The CPU 401 functioning as the text assignment unit 418 assigns the text element to each element as a child element. The CPU 401 adds each element of the text element list 413 to the list N. Further, the CPU 401 shares the root element of each DOM tree into one, and adds the common root element to the top of the list N. Finally, the CPU 401 outputs the list N and ends the text sharing DOM tree creation process.

［テキストデータ・テキスト要素リスト構築部４１７］
テキストデータ・テキスト要素リスト構築部４１７として機能するＣＰＵ４０１は、リストＮとテキストｔとを入力する。ＣＰＵ４０１は、リストＮ内の各要素のテキスト領域の開始位置、終了位置を取り出し、昇順にソートした数値からなるリストをＳとする。ＣＰＵ４０１は、リストＳ内で数値が重複している部分を除去する。リストＳの先頭の値が０でない場合、ＣＰＵ４０１は、先頭に０を追加する。また、リストＳの最後尾の値がテキストｔと同じ場合、ＣＰＵ４０１は、最後尾の値を削除する。ＣＰＵ４０１は、テキストｔとリストＳをテキストデータとして、テキストデータリスト４１４の最後尾に追加する。 [Text Data / Text Element List Building Unit 417]
The CPU 401 functioning as the text data / text element list construction unit 417 inputs the list N and the text t. The CPU 401 takes out the start position and end position of the text area of each element in the list N, and designates a list composed of numerical values sorted in ascending order as S. The CPU 401 removes the portion where the numerical value is duplicated in the list S. When the top value of the list S is not 0, the CPU 401 adds 0 to the top. If the last value in the list S is the same as the text t, the CPU 401 deletes the last value. The CPU 401 adds the text t and the list S as text data to the end of the text data list 414.

ＣＰＵ４０１は、テキスト要素リスト４１３を空に初期化する。リストＳの各値は、テキスト領域の開始位置、終了位置を与えている。従って、ＣＰＵ４０１は、リストＳの先頭の要素から順番に、テキスト領域の開始位置を現在見ている要素の値、終了位置を次の要素の値、タグ名を＃として、ＤＯＭＤＡＧ４２１の要素を作成し、テキスト要素リスト４１３の最後尾に順次追加する。なお、リストＳの最後の要素の場合、次の要素が存在しない。このため、ＣＰＵ４０１は、テキストｔの長さをテキスト領域の終了位置にセットする。 The CPU 401 initializes the text element list 413 to be empty. Each value in the list S gives the start position and end position of the text area. Therefore, the CPU 401 creates elements of the DOM DAG 421 in order from the top element of the list S, with the start position of the text area being the value of the element currently being viewed, the end position being the value of the next element, and the tag name being #. The text element list 413 is sequentially added to the end. In the case of the last element in the list S, the next element does not exist. Therefore, the CPU 401 sets the length of the text t at the end position of the text area.

［テキスト割当部４１８］
テキスト要素リスト４１３の各要素は、互いのテキスト領域が重ならないため包含関係を有していない。このため、テキスト割当部４１８として機能するＣＰＵ４０１は、テキスト要素リスト４１３の各要素とルート要素とに基づいて、テキスト要素同士が兄弟となるＤＯＭ木を作成することができる。作成されるＤＯＭ木をＴｔとする。ＣＰＵ４０１は、ＤＯＭ木Ｔｔに対して、各アノテーショングループのＤＯＭ木から、後述する親子関係解析・登録部４１９としての機能を通じ、アノテーショングループ外の親を登録する。 [Text allocation unit 418]
Each element of the text element list 413 does not have an inclusion relationship because the text areas do not overlap each other. For this reason, the CPU 401 functioning as the text assignment unit 418 can create a DOM tree in which the text elements are siblings based on each element of the text element list 413 and the root element. Let the created DOM tree be Tt. The CPU 401 registers a parent outside the annotation group with respect to the DOM tree Tt through a function as a parent-child relationship analysis / registration unit 419 described later from the DOM tree of each annotation group.

［要素間の包含関係］
本実施例では、ＸＭＬ要素又はアノテーション要素とテキスト要素間の包含関係は、以下のルールに従うものとする。ＸＭＬ要素又はアノテーション要素のテキスト領域がテキスト要素のテキスト領域を包含する、又は同じである場合、ＸＭＬ要素又はアノテーション要素は、テキスト要素を包含するとする。これ以外の場合は、包含関係はないものとする。 [Inclusive relation between elements]
In the present embodiment, the inclusion relationship between the XML element or annotation element and the text element is assumed to follow the following rules. If the text area of the XML element or annotation element includes or is the same as the text area of the text element, the XML element or annotation element shall include the text element. In other cases, there is no inclusion relationship.

［親子関係解析・登録部４１９］
図７は、親子関係解析・登録部４１９として機能するＣＰＵ４０１により実行される処理の一例を示すフローチャートである。 [Parent-child relationship analysis / registration unit 419]
FIG. 7 is a flowchart illustrating an example of processing executed by the CPU 401 functioning as the parent-child relationship analysis / registration unit 419.

まず、ＣＰＵ４０１は、ＤＯＭ木ＴとＤＯＭ木Ｕを入力とする（ステップＳ６０１）
次に、ＣＰＵ４０１は、ＤＯＭ木Ｕの要素の集合をＶとし（ステップＳ６０２）、Ｖが空であるかを判定する（ステップＳ６０３）。Ｖが空になるまで、後述のステップＳ６０４〜Ｓ６０８を繰り返す。 First, the CPU 401 inputs the DOM tree T and the DOM tree U (step S601).
Next, the CPU 401 sets a set of elements of the DOM tree U as V (step S602), and determines whether V is empty (step S603). Steps S604 to S608 described later are repeated until V becomes empty.

要素ｖをＶから取り出し、Ｖから削除する（ステップＳ６０４）。ＣＰＵ４０１は、ＤＯＭ木Ｔの要素の中で、ｖを包含する要素のうち、ルートからの深さが最も深いものを要素ｕとする（ステップＳ６０５）。ｕがＤＯＭ木Ｔのルート要素であるかを判定し（ステップＳ６０６）、ルート要素でないならば、ＣＰＵ４０１は、ｕをｖのアノテーショングループ外の親要素として登録し、ステップＳ６０３に戻る（ステップＳ６０７）。 The element v is extracted from V and deleted from V (step S604). Among the elements of the DOM tree T, the CPU 401 sets the element u that has the deepest depth from the root among the elements including v (step S605). It is determined whether u is a root element of the DOM tree T (step S606). If it is not a root element, the CPU 401 registers u as a parent element outside the annotation group of v, and returns to step S603 (step S607). .

［検索動作の概要］
続いて、構造化文書検索装置４００において実行される検索処理の概要を説明する。検索処理は、後述するロケーションパス検索部４２０として機能するＣＰＵ４０１により実行される。ロケーションパス検索部４２０として機能するＣＰＵ４０１は、各ＸＭＬ文書とアノテーションデータとから構築されたテキスト共有ＤＯＭ木をロケーションパスに沿って辿り、検索クエリに合致した要素を検索結果として取得する。 [Overview of search operation]
Next, an outline of search processing executed in the structured document search apparatus 400 will be described. The search process is executed by the CPU 401 functioning as a location path search unit 420 described later. The CPU 401 functioning as the location path search unit 420 follows a text sharing DOM tree constructed from each XML document and annotation data along the location path, and acquires an element that matches the search query as a search result.

図３では、テキスト共有ＤＯＭ木について、ロケーションパス「/p/fx/obj」とロケーションパス「/A/T」とが共通する要素を検索する場合について表している。図３の例の場合、ロケーションパス「/p/fx/obj」は、ルート要素の下のタグ＜ｐ＞の下のタグ＜ｆｘ＞の下のタグ＜ｏｂｊ＞の下に割り当てられた要素を表しており、ロケーションパス「/A/T」は、ルート要素の下のタグ［Ａ］の下のタグ［Ｔ］の下に割り当てられた要素を表している。図３では、これら２つのロケーションパスを点線の矢印により表している。この結果、共通する要素は、テキスト要素である「粗鋼」となる。 FIG. 3 shows a case where an element having a common location path “/ p / fx / obj” and location path “/ A / T” is searched for the text sharing DOM tree. In the case of the example in FIG. 3, the location path “/ p / fx / obj” indicates that the element assigned under the tag <obj> under the tag <fx> under the tag under the root element. The location path “/ A / T” represents an element allocated under the tag [T] under the tag [A] under the root element. In FIG. 3, these two location paths are represented by dotted arrows. As a result, the common element is “crude steel” which is a text element.

［検索動作の詳細］
図８に、構造化文書検索装置４００において検索処理が実行される際のデータの流れを示す。なお、ＣＰＵ４０１は、検索に使用する文書構造リスト４１２（本実施例の場合、テキスト共有ＤＯＭ木４１６）を、補助記憶装置４０３Ａから主記憶装置４０２に事前に読み出しているものとする（ステップＳ４０１）。 [Details of search operation]
FIG. 8 shows the flow of data when search processing is executed in structured document search apparatus 400. It is assumed that the CPU 401 has previously read the document structure list 412 (text sharing DOM tree 416 in this embodiment) used for the search from the auxiliary storage device 403A to the main storage device 402 (step S401). .

この状態で、ユーザがユーザインタフェース４０６を通じ、検索クエリとしてのロケーションパスを構造化文書検索装置４００（具体的には、ＣＰＵ４０１）に投入する（ステップＳ４０２）。 In this state, the user inputs a location path as a search query to the structured document search device 400 (specifically, the CPU 401) through the user interface 406 (step S402).

すると、ロケーションパス検索部４２０として機能するＣＰＵ４０２は、文書構造リスト４１２にあるテキスト共有ＤＯＭ木４１６にアクセスし、検索クエリとして指定されたロケーションパスに合致した要素の箇所を計算する（ステップＳ４０３）。この後、ＣＰＵ４０１は、検索クエリに合致した要素の集合をユーザインタフェース４０６に出力する（ステップＳ４０５）。 Then, the CPU 402 functioning as the location path search unit 420 accesses the text sharing DOM tree 416 in the document structure list 412 and calculates the location of the element that matches the location path specified as the search query (step S403). Thereafter, the CPU 401 outputs a set of elements that match the search query to the user interface 406 (step S405).

［ロケーションパス検索部４２０］
最後に、ロケーションパス検索部４２０として機能するＣＰＵ４０１の動作を説明する。文書構造リスト４１２を構成する各要素のリストの先頭は、テキスト共有ＤＯＭ木のルート要素となっている。従って、ＣＰＵ４０１は、各テキスト共有ＤＯＭ木のルート要素からテキスト共有ＤＯＭ木をロケーションパスに沿って辿ることにより、検索クエリに合致するテキスト共有ＤＯＭ木の要素を得る。 [Location path search unit 420]
Finally, the operation of the CPU 401 functioning as the location path search unit 420 will be described. The head of the list of each element constituting the document structure list 412 is a root element of the text sharing DOM tree. Accordingly, the CPU 401 obtains an element of the text sharing DOM tree that matches the search query by tracing the text sharing DOM tree along the location path from the root element of each text sharing DOM tree.

［実施例の効果］
本実施例に係る構造化文書検索装置４００は、前処理として、ＸＭＬタグに基づいて構築されたＤＯＭ木と、アノテーショングループに属するタグに基づいて構築されたＤＯＭ木の間でルート要素を共通化したテキスト共有ＤＯＭ木４１６を生成する。テキスト共有ＤＯＭ木４１６には、共通するルート要素を起点として各ＤＯＭ木の構造が含まれている。従って、任意の形式でアノテーションがＸＭＬ文書に付与されている場合にも、ＸＭＬによる階層構造とアノテーションによる階層構造の双方を考慮した検索を可能とすることができる。 [Effect of Example]
The structured document search apparatus 400 according to the present embodiment uses, as preprocessing, a text in which a root element is shared between a DOM tree constructed based on an XML tag and a DOM tree constructed based on a tag belonging to an annotation group. A shared DOM tree 416 is generated. The text sharing DOM tree 416 includes the structure of each DOM tree starting from a common root element. Therefore, even when an annotation is added to an XML document in an arbitrary format, it is possible to perform a search in consideration of both the XML hierarchical structure and the annotation hierarchical structure.

［第２の実施例］
本実施例においては、ＸＭＬ要素とアノテーション要素、又は、異なるアノテーショングループに属するアノテーション要素同士といった異なる種類の要素間における包含関係を定義する。このために、本実施例では、第１の実施例におけるテキスト共有ＤＯＭ木の構造を拡張したＤＯＭＤＡＧ（Directed Acyclic Graph：非循環有向グラフ）を使用する。なお、本実施例に係る構造化文書検索装置４００の基本構成は、実施例１と同様である。すなわち、図１−１及び図１−２に示す構成を基本構成とする。ただし、本実施例では、テキスト共有ＤＯＭ木構築部４１５に代えて、ＤＯＭＤＡＧ構築部４２２を使用する。 [Second Embodiment]
In this embodiment, an inclusion relationship between different types of elements such as XML elements and annotation elements or annotation elements belonging to different annotation groups is defined. For this purpose, in this embodiment, a DOM DAG (Directed Acyclic Graph) obtained by extending the structure of the text sharing DOM tree in the first embodiment is used. The basic configuration of the structured document search apparatus 400 according to the present embodiment is the same as that of the first embodiment. That is, the basic configuration is the configuration shown in FIGS. 1-1 and 1-2. However, in this embodiment, the DOM DAG construction unit 422 is used instead of the text sharing DOM tree construction unit 415.

［前処理の概要］
前述したように、本実施例においては、ＸＭＬ要素とアノテーション要素間、及び、異なるアノテーショングループ間に属する要素間におけるテキスト領域の包含関係についても親子関係として考慮したＤＯＭＤＡＧを使用した構造検索について説明する。ＤＯＭＤＡＧの導入により、異なる種類の要素間の包含関係を考慮した検索を実行可能な構造化文書検索装置が実現される。 [Overview of preprocessing]
As described above, in this embodiment, the structure search using the DOM DAG in which the inclusion relation of the text area between the XML element and the annotation element and between the elements belonging to different annotation groups is also considered as the parent-child relation is described. To do. With the introduction of DOM DAG, a structured document search apparatus capable of executing a search in consideration of the inclusion relationship between different types of elements is realized.

［ＤＯＭＤＡＧ］
ここでは、本実施例で使用するＤＯＭＤＡＧについて説明する。ＤＯＭＤＡＧは、テキスト共有ＤＯＭ木において、ＸＭＬタグや異なるアノテーショングループに属するタグなど、異なる種類の要素間の包含関係も記述したデータ構造である。すなわち、ルート要素で共通化された各ＤＯＭ木の間において、包含関係による親子関係が記述されたデータ構造である。 [DOM DAG]
Here, the DOM DAG used in the present embodiment will be described. The DOM DAG is a data structure in which inclusion relationships between different types of elements such as XML tags and tags belonging to different annotation groups are described in a text sharing DOM tree. That is, it is a data structure in which a parent-child relationship based on an inclusion relationship is described between DOM trees shared by root elements.

ＤＯＭＤＡＧでは、例えばＤＯＭ木Ｔ２の要素を包含するＤＯＭ木Ｔ１の要素のうち、ＤＯＭ木Ｔ１のルート要素から最も深い位置にある要素との間に親子関係があるものとして扱い、ＤＯＭ木Ｔ１とＤＯＭ木Ｔ２の間にリンクを張る。これにより、ＤＯＭ木Ｔ１の要素からＤＯＭ木Ｔ２の要素への包含関係を表現する。 In the DOM DAG, for example, among the elements of the DOM tree T1 including the elements of the DOM tree T2, it is treated as having a parent-child relationship with the element at the deepest position from the root element of the DOM tree T1, and the DOM tree T1 and A link is established between the DOM trees T2. Thereby, the inclusion relationship from the element of the DOM tree T1 to the element of the DOM tree T2 is expressed.

図９に、図３に示したテキスト共有ＤＯＭ木４１６に基づいて構築されたＤＯＭＤＡＧの例を示す。図９の場合、ＸＭＬタグであるタグ＜ｆｘ＞からアノテーションタグであるタグ［Ｔ］へのリンク、アノテーションタグであるタグ［Ｔ］からＸＭＬタグであるタグ＜ｏｂｊ＞へのリンクのように、異なる種類のタグ間でも親子関係が考えられている。 FIG. 9 shows an example of a DOM DAG constructed based on the text sharing DOM tree 416 shown in FIG. In the case of FIG. 9, the link from the tag <fx>, which is an XML tag, to the tag [T], which is an annotation tag, and the link from the tag [T], which is an annotation tag, to a tag <obj>, which is an XML tag, Parent-child relationships are also considered between different types of tags.

［ＤＯＭＤＡＧ内の要素の包含関係］
ＤＯＭ木又はＤＯＭＤＡＧ内の要素の包含関係は、次のルールに従うものとする。
（１）ルート要素は、ルート要素以外のどの要素も包含する。
（２）ＸＭＬ要素とアノテーション要素のテキスト領域開始位置と終了位置が同じ場合、ＸＭＬ要素がアノテーション要素を包含する。
（３）アノテーション要素同士のテキスト領域開始位置と終了位置が同じ場合、アノテーショングループＩＤが小さい要素がアノテーショングループＩＤが大きい要素を包含する。
（４）テキスト要素とテキスト要素以外の他の要素について、テキスト領域開始位置と終了位置が同じ場合、後者が前者を包含する。
（５）（１）〜（４）によって包含関係が決定しない場合、要素にあるテキスト領域の包含関係から要素の包含関係を決定する。ただし、テキスト要素のテキスト領域が他の要素のテキスト領域を包含する場合、要素の包含関係はないものとする。
（６）（１）〜（５）によって包含関係が決定しない場合、互いに包含関係がないものとする。 [Inclusive relation of elements in DOM DAG]
The inclusion relationship of the elements in the DOM tree or DOM DAG shall follow the following rules.
(1) The root element includes any element other than the root element.
(2) When the text region start position and end position of the XML element and annotation element are the same, the XML element includes the annotation element.
(3) When the text region start position and end position of annotation elements are the same, an element with a small annotation group ID includes an element with a large annotation group ID.
(4) When the text region start position and end position are the same for the text element and other elements other than the text element, the latter includes the former.
(5) When the inclusion relation is not determined by (1) to (4), the inclusion relation of the element is determined from the inclusion relation of the text region in the element. However, when the text area of the text element includes the text area of another element, it is assumed that there is no element inclusion relationship.
(6) When the inclusion relationship is not determined by (1) to (5), it is assumed that there is no mutual inclusion relationship.

［ＤＯＭＤＡＧ構築部４２２］
ＤＯＭＤＡＧ構築部４２２として機能するＣＰＵ４０１は、ＸＭＬ要素から構築したＤＯＭ木、ＸＭＬ文書内のテキスト、各アノテーショングループの要素から構築したＤＯＭ木を入力とする。 [DOM DAG construction unit 422]
The CPU 401 functioning as the DOM DAG construction unit 422 receives as input the DOM tree constructed from the XML elements, the text in the XML document, and the DOM tree constructed from the elements of each annotation group.

親子関係解析・登録部４１９として機能するＣＰＵ４０１は、ＸＭＬ要素を含むアノテーショングループの全てのＤＯＭ木の組に対し、アノテーショングループ外の親を割り当てる。ここで、ＣＰＵ４０１は、各ＤＯＭ木について、ルート要素以外の要素のリストＮを用意する。ＣＰＵ４０１は、リストＮとＸＭＬ文書内のテキストを、テキストデータ・テキスト要素リスト構築部４１７に入力し、テキストデータリスト４１４とテキスト要素リスト４１３を更新する。 The CPU 401 functioning as the parent-child relationship analysis / registration unit 419 assigns a parent outside the annotation group to all DOM tree pairs of the annotation group including the XML element. Here, the CPU 401 prepares a list N of elements other than the root element for each DOM tree. The CPU 401 inputs the text in the list N and the XML document to the text data / text element list construction unit 417 and updates the text data list 414 and the text element list 413.

テキスト割当部４１８として機能するＣＰＵ４０１は、テキスト要素リスト４１３と、ＸＭＬ要素と、各アノテーショングループの要素からなるＤＯＭ木を入力すると、テキスト要素を子要素としてＤＯＭ木の各要素に割り当てる。ＣＰＵ４０１は、テキスト要素リスト４１３の各要素をリストＮに加えると共に、深さの情報を各要素に割り当てる。ＣＰＵ４０１は、ＤＯＭ木の各要素が持つアノテーショングループ外の親要素リストの要素に対し、親子関係のリンクを張る。また、ＣＰＵ４０１は、各ＤＯＭ木のルート要素を１つに共通化し、共通化したルート要素をリストＮの先頭に加える。この後、ＣＰＵ４０１は、リストＮを出力し、ＤＯＭＤＡＧの作成処理を終了する。 When the CPU 401 functioning as the text assigning unit 418 inputs a text element list 413, an XML element, and a DOM tree composed of elements of each annotation group, the CPU 401 assigns the text element to each element of the DOM tree as a child element. The CPU 401 adds each element of the text element list 413 to the list N and assigns depth information to each element. The CPU 401 establishes a parent-child relationship link to the elements in the parent element list outside the annotation group possessed by each element of the DOM tree. Further, the CPU 401 shares the root element of each DOM tree into one, and adds the common root element to the top of the list N. Thereafter, the CPU 401 outputs the list N and ends the creation process of the DOM DAG.

［実施例の効果］
本実施例に係る構造化文書検索装置４００は、前処理として、実施例１で説明したテキスト共有ＤＯＭ木４１６を構成する異なるＤＯＭ木の要素間についても包含関係を規定したＤＯＭＤＡＧを作成する。従って、本実施例に係る構造化文書検索装置４００を用いれば、実施例１の効果に加え、異なる種類の要素間における包含関係（親子関係）も考慮した検索を可能とすることができる。 [Effect of Example]
The structured document search apparatus 400 according to the present embodiment creates a DOM DAG that defines an inclusion relationship between elements of different DOM trees constituting the text sharing DOM tree 416 described in the first embodiment as preprocessing. Therefore, by using the structured document search apparatus 400 according to the present embodiment, in addition to the effects of the first embodiment, it is possible to perform a search in consideration of an inclusion relationship (parent-child relationship) between different types of elements.

［第３の実施例］
前述したように、ＤＯＭＤＡＧを用いれば、異種タグ間の構造関係を利用した検索が可能となる。しかし、ロケーションパスの検索時、構築した全てのＤＯＭＤＡＧをルート要素から辿るのでは非効率的である。 [Third embodiment]
As described above, if DOM DAG is used, a search using the structural relationship between different types of tags becomes possible. However, it is inefficient to trace all constructed DOM DAGs from the root element when searching for a location path.

そこで、本実施例では、複数のＤＯＭＤＡＧの構造を集約したデータ構造であるパスＤＡＧを定義する。さらに、パスＤＡＧ内の要素をエントリとして、ＤＯＭＤＡＧ内の要素を値とした転置インデクスによる検索を可能とすることにより、ロケーションパスを検索クエリとした効率的な探索を可能にする。 Therefore, in this embodiment, a path DAG that is a data structure in which the structures of a plurality of DOM DAGs are aggregated is defined. Furthermore, by making an element in the path DAG an entry and performing a search using a transposed index with the element in the DOM DAG as a value, an efficient search using the location path as a search query is enabled.

本実施例の場合も、構造化文書検索装置４００の基本構成は、実施例１と同様である。すなわち、図１−１及び図１−２に示す構成を基本構成とする。ただし、本実施例の場合、ＤＯＭＤＡＧ構築部４２２とロケーションパス検索部４２０の機能が拡張されている。 Also in this embodiment, the basic configuration of the structured document search apparatus 400 is the same as that of the first embodiment. That is, the basic configuration is the configuration shown in FIGS. 1-1 and 1-2. However, in this embodiment, the functions of the DOM DAG construction unit 422 and the location path search unit 420 are expanded.

［前処理の概要］
本実施例の場合、転置インデクス構築部４２４が前処理を実行する。転置インデクス構築部４２４として機能するＣＰＵ４０１は、パスＤＡＧ４２３をルート要素のみからなるデータ構造に初期化すると共に、各ＸＭＬ文書と対応したアノテーションデータに基づいてＤＯＭＤＡＧを構築する。 [Overview of preprocessing]
In this embodiment, the transposed index construction unit 424 executes preprocessing. The CPU 401 functioning as the transposed index construction unit 424 initializes the path DAG 423 to a data structure including only root elements, and constructs a DOM DAG based on annotation data corresponding to each XML document.

パスＤＡＧＩＤ取得部４２８として機能するＣＰＵ４０１は、ＤＯＭＤＡＧを構成する各要素のタグ名と親要素に基づいて、パスＤＡＧ内に既に登録されている構造か否かを判定する。判定対象である構造がパスＤＡＧ内に既に登録されている場合、ＣＰＵ４０１は、当該構造に対し、対応するパスＤＡＧ内の要素のＩＤであるパスＤＡＧＩＤを与える。これに対し、判定対象とする構造が登録されていない場合、ＣＰＵ４０１は、パスＤＡＧ要素生成・登録部４２９として機能し、パスＤＡＧ４２３に対して新しく要素を生成すると共に、生成された要素に対してパスＤＡＧＩＤを得える。パスＤＡＧＩＤの取得後、ＣＰＵ４０１は、転置インデクス４２５の取得したパスＤＡＧＩＤに対応したエントリに各要素を追加する。 The CPU 401 functioning as the path DAG ID acquisition unit 428 determines whether or not the structure is already registered in the path DAG based on the tag name and parent element of each element constituting the DOM DAG. When the structure to be determined is already registered in the path DAG, the CPU 401 gives the structure a path DAG ID that is an ID of an element in the corresponding path DAG. On the other hand, when the structure to be determined is not registered, the CPU 401 functions as the path DAG element generation / registration unit 429, generates a new element for the path DAG 423, and generates a new element for the generated element. You can get a pass DAG ID. After acquiring the path DAG ID, the CPU 401 adds each element to the entry corresponding to the acquired path DAG ID of the transposition index 425.

図１０は、２つのＤＯＭＤＡＧの登録後に生成される、パスＤＡＧと転置インデクス４２５の例を示す。図１０に示すように、パスＤＡＧ４２３は、ＸＭＬ文書１から構築されたＤＯＭＤＡＧとＸＭＬ文書２から構築されたＤＯＭＤＡＧについて、親ノードの集合が同じである要素を共通化させるように構築される。図中、白抜きで示す要素は、ＸＭＬ要素、黒の塗りつぶしで示す要素は、アノテーション要素を示す。 FIG. 10 shows an example of a path DAG and a transposed index 425 generated after registration of two DOM DAGs. As shown in FIG. 10, the path DAG 423 is constructed so that elements having the same set of parent nodes are shared between the DOM DAG constructed from the XML document 1 and the DOM DAG constructed from the XML document 2. . In the figure, the elements shown in white are XML elements, and the elements shown in black are annotation elements.

ただし、要素ｃのように、２つの文書間で親の集合の関係が異なる要素の場合には、パスＤＡＧ４２３上では、異なる要素として登録される。図１０では、「ｃ：１」と「ｃ：２」として区別して登録されている。 However, in the case of an element such as element c in which the parent set relationship is different between the two documents, it is registered as a different element on the path DAG 423. In FIG. 10, “c: 1” and “c: 2” are distinguished and registered.

また、要素ｄは、親要素がどちらも要素ｃだけであり、親の集合が共通しているように見える。しかし、パスＤＡＧ上における要素ｃは、前述したように異なる要素として登録されている。このため、要素ｄも、パスＤＡＧ上異なる要素として登録される。ＤＯＭＤＡＧの各要素には、文書の順番、文書内でのタグの出現順を組にした番号が振られている。各要素の番号は、転置インデクス４２５内で対応するパスＤＡＧ要素のパスＤＡＧＩＤのエントリに記録される。 In addition, the element d has only the element c as the parent element, and it seems that the set of parents is common. However, the element c on the path DAG is registered as a different element as described above. For this reason, the element d is also registered as a different element on the path DAG. Each element of the DOM DAG is assigned a number that is a combination of the document order and the tag appearance order in the document. The number of each element is recorded in the path DAG ID entry of the corresponding path DAG element in the transposed index 425.

なお、図１０には、パスＤＡＧ４２３に対応する転置インデクス４２５の構造もＩＤとタグ番号の関係として記載されている。この転置インデクスの生成方法については、次項において説明する。 In FIG. 10, the structure of the transposed index 425 corresponding to the path DAG 423 is also described as the relationship between the ID and the tag number. A method for generating this transposed index will be described in the next section.

［転置インデクス構築部４２４］
図１１は、転置インデクス構築部４２４として機能するＣＰＵ４０１の処理動作を示すフローチャートである。 [Transposition index construction unit 424]
FIG. 11 is a flowchart showing the processing operation of the CPU 401 functioning as the transposed index construction unit 424.

まず、ＣＰＵ４０１は、ＸＭＬ文書集合４０７とアノテーションデータ集合４０８を入力する（ステップＳ１００１）。次に、ＣＰＵ４０１は、ＤＯＭＤＡＧリストを空のリストに初期化する（ステップＳ１００２）。続いて、ＣＰＵ４０１は、転置インデクス４２５を空のテーブルに初期化する（ステップＳ１００３）。また、ＣＰＵ４０１は、変数ｉを“１”にセットする（ステップＳ１００４）。次に、ＣＰＵ４０１は、テキストデータリストを空に初期化する（ステップＳ１００５）。 First, the CPU 401 inputs an XML document set 407 and an annotation data set 408 (step S1001). Next, the CPU 401 initializes the DOM DAG list to an empty list (step S1002). Subsequently, the CPU 401 initializes the transposed index 425 to an empty table (step S1003). Further, the CPU 401 sets the variable i to “1” (step S1004). Next, the CPU 401 initializes the text data list to be empty (step S1005).

この後、ＣＰＵ４０１は、ＸＭＬ文書集合４０７とアノテーションデータ集合４０８に含まれる、ＸＭＬ文書とアノテーションデータの全てのペアを処理したか否かを判定する（ステップＳ１００６）。この判定処理において肯定結果が得られるまで、ＣＰＵ４０１は、後述するステップＳ１００７〜Ｓ１０１３の処理を繰り返す。 Thereafter, the CPU 401 determines whether or not all the pairs of XML documents and annotation data included in the XML document set 407 and the annotation data set 408 have been processed (step S1006). Until a positive result is obtained in this determination process, the CPU 401 repeats the processes of steps S1007 to S1013 described later.

ＤＯＭＤＡＧ構築部４２２として機能するＣＰＵ４０１は、未処理のＸＭＬ文書とアノテーションデータのペアを読み込む（ステップＳ１００７）。ここで、ＣＰＵ４０１は、ＸＭＬ文書用のＤＯＭ木構築部４０９及びアノテーションデータ用のＤＯＭ木構築部４１０として機能し、ＸＭＬ文書とアノテーションデータのそれぞれからＤＯＭ木を作成する（ステップＳ１００８）。 The CPU 401 functioning as the DOM DAG construction unit 422 reads an unprocessed XML document and annotation data pair (step S1007). Here, the CPU 401 functions as the DOM tree construction unit 409 for the XML document and the DOM tree construction unit 410 for the annotation data, and creates a DOM tree from each of the XML document and the annotation data (step S1008).

次に、ＣＰＵ４０１は、後述するＤＯＭＤＡＧ構築部４２２として機能し、ＸＭＬ文書から得られたＤＯＭ木と各アノテーショングループに属するタグから得られたＤＯＭ木からＤＯＭＤＡＧ４２１を作成し、ＤＯＭＤＡＧ４２１を構成する要素のリストＮを得る（ステップＳ１００９）。 Next, the CPU 401 functions as a DOM DAG construction unit 422 to be described later, creates a DOM DAG 421 from the DOM tree obtained from the XML document and the DOM tree obtained from the tag belonging to each annotation group, and configures the DOM DAG 421. An element list N is obtained (step S1009).

続いて、ＣＰＵ４０１は、リストＮを後述するＤＯＭＤＡＧ要素リストソート部４２６に入力し、ソートする（ステップＳ１０１０）。さらに、ＣＰＵ４０１は、リストＮと変数ｉを転置インデクス登録部４４０に入力し、転置インデクス４２５に各要素を登録する（ステップＳ１０１１）。この後、ＣＰＵ４０１は、リストＮをＤＯＭＤＡＧリストに追加する（ステップＳ１０１２）。また、ＣＰＵ４０１は、読み込んだＸＭＬ文書、アノテーションデータを処理済とし、ステップＳ１００６に戻る（ステップＳ１０１３）。 Subsequently, the CPU 401 inputs the list N to a DOM DAG element list sorting unit 426 described later and sorts the list N (step S1010). Further, the CPU 401 inputs the list N and the variable i to the transposed index registration unit 440 and registers each element in the transposed index 425 (step S1011). Thereafter, the CPU 401 adds the list N to the DOM DAG list (step S1012). In addition, the CPU 401 determines that the read XML document and annotation data have been processed, and returns to step S1006 (step S1013).

［ＤＯＭＤＡＧ要素リストソート部４２６］
ここでは、ステップＳ１０１０の処理を実行するＤＯＭＤＡＧ要素リストソート部４２６の処理動作を説明する。勿論、ＤＯＭＤＡＧ要素リストソート部４２６としての機能は、ＣＰＵ４０１によるプログラムの実行を通じて実現する。ＤＯＭＤＡＧ要素リストソート部４２６として機能するＣＰＵ４０１は、ＤＯＭＤＡＧ要素のリストＮを入力とする。次に、ＣＰＵ４０１は、リストＮ内の要素を、テキスト領域開始位置と包含関係に基づいて文書内の出現順に並び替える。事前に、ＸＭＬ要素を、各アノテーショングループ毎に対応するＤＯＭ木について前順で並び換え、並び替えた要素同士をマージソートの要領でマージすると、リストＮ内の要素を効率的にソートすることができる。要素の並びは、次のルールに従うものとする。
（１）ルート要素は他のどの要素より前に来る。
（２）テキスト領域開始位置が前の要素の前に来る又はテキスト領域開始位置が同じ場合、包含する要素が前に来る。 [DOM DAG element list sort unit 426]
Here, the processing operation of the DOM DAG element list sorting unit 426 that executes the processing of step S1010 will be described. Of course, the function as the DOM DAG element list sorting unit 426 is realized through execution of a program by the CPU 401. The CPU 401 functioning as the DOM DAG element list sorting unit 426 receives a list N of DOM DAG elements as an input. Next, the CPU 401 rearranges the elements in the list N in the order of appearance in the document based on the text region start position and the inclusion relationship. If the XML elements are rearranged in the order of the DOM tree corresponding to each annotation group in advance and the rearranged elements are merged in the manner of merge sort, the elements in the list N can be efficiently sorted. it can. The arrangement of elements shall follow the following rules:
(1) The root element comes before any other element.
(2) When the text region start position comes before the previous element or when the text region start position is the same, the containing element comes before.

［深さ割当部４２７］
ここでは、深さ割当部４２７として機能するＣＰＵ４０１の機能を説明する。図１２は、深さ割当部４２７の処理例を示すフローチャートである。 [Depth allocation unit 427]
Here, the function of the CPU 401 functioning as the depth assignment unit 427 will be described. FIG. 12 is a flowchart illustrating a processing example of the depth assignment unit 427.

まず、ＣＰＵ４０１は、ＤＯＭＤＡＧ４２１の要素からなるリストＮを入力とする（ステップＳ１１０１）。次に、ＣＰＵ４０１は、ＸＭＬ文書の中で最も深い位置にある要素の深さをＤとし、長さＤの配列をＥとする（ステップＳ１１０２）。次に、ＣＰＵ４０１は、変数ｉを１とし、変数ｄを０とする（ステップＳ１１０３）。この後、ＣＰＵ４０１は、変数ｉがＮの長さ以下か否かを判定する（ステップＳ１１０４）。変数ｉがＮの長さより大きいと判定されるまで、ＣＰＵ４０１は、後述するステップＳ１１０５〜Ｓ１１１３を繰り返す。 First, the CPU 401 receives a list N composed of elements of the DOM DAG 421 (step S1101). Next, the CPU 401 sets the depth of the element at the deepest position in the XML document as D and sets the array of lengths D as E (step S1102). Next, the CPU 401 sets variable i to 1 and variable d to 0 (step S1103). Thereafter, the CPU 401 determines whether or not the variable i is equal to or shorter than the length of N (step S1104). The CPU 401 repeats steps S1105 to S1113 described later until it is determined that the variable i is greater than the length of N.

ステップＳ１１０４で肯定結果が得られた場合、ＣＰＵ４０１は、リストＮのｉ番目の要素をｖとする（ステップＳ１１０５）。次に、ＣＰＵ４０１は、ｖがＸＭＬ要素またはテキスト要素かを判定する（ステップＳ１１０６）。ここで、ｖがＸＭＬ要素またはテキスト要素であった場合、ＣＰＵ４０１は、ＤＯＭＤＡＧ４２１のルート要素からの深さをｄにセットする（ステップＳ１１０７）。さらに、ＣＰＵ４０１は、Ｅのｄ番目の要素にｖのテキスト領域終了位置をセットし、ステップＳ１１０４へ戻る（ステップＳ１１０８）。 If a positive result is obtained in step S1104, the CPU 401 sets the i-th element of the list N to v (step S1105). Next, the CPU 401 determines whether v is an XML element or a text element (step S1106). Here, when v is an XML element or a text element, the CPU 401 sets the depth of the DOM DAG 421 from the root element to d (step S1107). Further, the CPU 401 sets the text area end position of v in the d-th element of E, and returns to step S1104 (step S1108).

一方、ステップＳ１１０６の判定処理において、ｖがＸＭＬ要素でもテキスト要素でもないと判定された場合、ＣＰＵ４０１は、ｖがアノテーションタグの開始要素か否か判定する（ステップＳ１１０９）。ここで、ｖがアノテーションタグの開始要素であると判定された場合、ＣＰＵ４０１は、ｖのテキスト領域開始位置を変数ｅにセットする（ステップＳ１１１０）。これに対し、ステップＳ１１０９の判定処理において、ｖがアノテーションタグの開始要素でないと判定された場合、ＣＰＵ４０１は、ｖのテキスト領域の終了位置を変数ｅにセットする（ステップＳ１１１１）。 On the other hand, if it is determined in step S1106 that v is neither an XML element nor a text element, the CPU 401 determines whether v is an annotation tag start element (step S1109). If it is determined that v is the start element of the annotation tag, the CPU 401 sets the text area start position of v in the variable e (step S1110). In contrast, if it is determined in step S1109 that v is not the start element of the annotation tag, the CPU 401 sets the end position of the text area of v in the variable e (step S1111).

ｖがテキスト領域開始位置とテキスト領域終了位置のいずれであった場合でも、ＣＰＵ４０１は、変数ｄが１になるか、Ｅのｄ−１番目の要素が変数ｅより大きくなるまで、変数ｄから１を引く（ステップＳ１１１２）。この後、ＣＰＵ４０１は、ｖの深さをｄとする一方、変数ｉに１を加算し、ステップＳ１１０４に戻る（ステップＳ１１１３）。 Regardless of whether the v is the text region start position or the text region end position, the CPU 401 sets the variable d to 1 until the variable d becomes 1 or the d-1th element of E becomes larger than the variable e. Is subtracted (step S1112). Thereafter, the CPU 401 sets the depth of v to d, adds 1 to the variable i, and returns to step S1104 (step S1113).

［パスＤＡＧＩＤ取得部４２８］
ここでは、ＤＯＭＤＡＧからパスＤＡＧを取得するパスＤＡＧＩＤ取得部４２８の処理機能を説明する。図１３は、パスＤＡＧＩＤ取得部４２８として機能するＣＰＵ４０１の処理例を示すフローチャートである。 [Pass DAG ID acquisition unit 428]
Here, the processing function of the path DAG ID acquisition unit 428 that acquires the path DAG from the DOM DAG will be described. FIG. 13 is a flowchart illustrating a processing example of the CPU 401 functioning as the path DAG ID acquisition unit 428.

まず、ＣＰＵ４０１は、ＤＯＭＤＡＧ４２１の要素のリストＮを入力とする（ステップＳ１２０１）。次に、ＣＰＵ４０１は、変数ｉを１にセットする（ステップＳ１２０２）。この後、ＣＰＵ４０１は、変数ｉがＮの長さ以下か否かを判定し、変数ｉがＮの長さより大きいと判定されるまで、後述するステップＳ１２０４〜Ｓ１２１３を繰り返す（ステップＳ１２０３）。 First, the CPU 401 inputs a list N of elements of the DOM DAG 421 (step S1201). Next, the CPU 401 sets a variable i to 1 (step S1202). Thereafter, the CPU 401 determines whether or not the variable i is equal to or less than the length of N, and repeats steps S1204 to S1213 described later until it is determined that the variable i is greater than the length of N (step S1203).

変数ｉがリストＮの長さ以下であった場合、ＣＰＵ４０１は、リストＮのｉ番目の要素をｖとする（ステップＳ１２０４）。次に、ＣＰＵ４０１は、要素ｖがアノテーションの終了タグ要素か否かを判定する（ステップＳ１２０５）。この判定結果として肯定結果が得られた場合、ＣＰＵ４０１は変数ｉに１を加算し、ステップＳ１２０３に戻る（ステップＳ１２０６）。これに対し、ステップＳ１２０５の判定処理で否定結果が得られた場合、ＣＰＵ４０１は、ＤＯＭＤＡＧ４２１における要素ｖの親要素の集合をＶとする（ステップＳ１２０７）。さらに、ＣＰＵ４０１は、集合Ｖの各要素に相当するパスＤＡＧ４２３の要素の集合をＰとする（ステップＳ１２０８）。さらに、ＣＰＵ４０１は、集合Ｐに属するパスＤＡＧ４２３の要素のうち共通する子要素の中で、要素ｖと同じタグ名を持つパスＤＡＧ４２３の要素の集合をＩとする（ステップＳ１２０９）。 When the variable i is equal to or shorter than the length of the list N, the CPU 401 sets the i-th element of the list N as v (step S1204). Next, the CPU 401 determines whether or not the element v is an annotation end tag element (step S1205). When a positive result is obtained as the determination result, the CPU 401 adds 1 to the variable i and returns to step S1203 (step S1206). On the other hand, when a negative result is obtained in the determination process in step S1205, the CPU 401 sets a set of parent elements of the element v in the DOM DAG 421 as V (step S1207). Further, the CPU 401 sets a set of elements of the path DAG 423 corresponding to each element of the set V as P (step S1208). Further, the CPU 401 sets the set of elements of the path DAG 423 having the same tag name as the element v among the common child elements among the elements of the path DAG 423 belonging to the set P as I (step S1209).

ここで、ＣＰＵ４０１は、集合Ｉの中で親要素の集合が集合Ｐと同一となる要素が存在するか否かを判定する（ステップＳ１２１０）。この判定処理において肯定結果が得られた場合、ＣＰＵ４０１は、その要素が持つパスＤＡＧＩＤを要素ｖに登録し、ステップＳ１２０３に戻る（ステップＳ１２１１）。これに対し、ステップＳ１２１０の判定処理において否定結果が得られた場合、ＣＰＵ４０１は、後述するパスＤＡＧ要素生成・登録部４２９に要素ｖと集合Ｐを入力する（ステップＳ１２１２）。その後、ＣＰＵ４０１は、変数ｉに１を加算し、ステップＳ１２０３に戻る（ステップＳ１２１３）。 Here, the CPU 401 determines whether or not there is an element in the set I whose parent element set is the same as the set P (step S1210). If a positive result is obtained in this determination process, the CPU 401 registers the path DAG ID of the element in the element v, and the process returns to step S1203 (step S1211). On the other hand, if a negative result is obtained in the determination process in step S1210, the CPU 401 inputs the element v and the set P to a path DAG element generation / registration unit 429 described later (step S1212). Thereafter, the CPU 401 adds 1 to the variable i and returns to step S1203 (step S1213).

［パスＤＡＧ要素生成・登録部４２９］
ここでは、ステップＳ１２１２で使用されるパスＤＡＧ要素生成・登録部４２９としてのＣＰＵ４０１の処理機能を説明する。図１４は、パスＤＡＧ要素生成・登録部４２９として機能するＣＰＵ４０１の処理例を示すフローチャートである。 [Path DAG element generation / registration unit 429]
Here, the processing function of the CPU 401 as the path DAG element generation / registration unit 429 used in step S1212 will be described. FIG. 14 is a flowchart illustrating a processing example of the CPU 401 functioning as the path DAG element generation / registration unit 429.

ＣＰＵ４０１は、ＤＯＭＤＡＧ４２１の要素ｖとパスＤＡＧ４２３の要素Ｐを入力とする（ステップＳ１３０１）。ここで、ＣＰＵ４０１は、要素ｖが保持する深さをｄとする（ステップＳ１３０２）。 The CPU 401 inputs the element v of the DOM DAG 421 and the element P of the path DAG 423 (step S1301). Here, the CPU 401 sets the depth held by the element v to d (step S1302).

次に、ＣＰＵ４０１は、要素ｖがアノテーション要素か否か判定する（ステップＳ１３０３）。要素ｖがアノテーション要素であった場合、ＣＰＵ４０１は、（２，ｄ，ＡＩ）をパスＤＡＧＩＤとして要素ｖに与え、ＡＩに１を加算する（ステップＳ１３０４）。その後、ＣＰＵ４０１は、要素ｖが保持するパスＤＡＧＩＤを持つパスＤＡＧ要素を作成し、集合Ｐの各要素の子要素とする（ステップＳ１３０５）。 Next, the CPU 401 determines whether or not the element v is an annotation element (step S1303). When the element v is an annotation element, the CPU 401 gives (2, d, AI) to the element v as a path DAG ID, and adds 1 to AI (step S1304). Thereafter, the CPU 401 creates a path DAG element having a path DAG ID held by the element v and sets it as a child element of each element of the set P (step S1305).

これに対し、ステップＳ１３０３で否定結果が得られた場合（すなわち、要素ｖがＸＭＬ要素であった場合）、ＣＰＵ４０１は、要素ｖのタグ名が“＃”か否か判定する（ステップＳ１３０６）。 On the other hand, when a negative result is obtained in step S1303 (that is, when the element v is an XML element), the CPU 401 determines whether the tag name of the element v is “#” (step S1306).

要素ｖのタグ名が＃の場合、ＣＰＵ４０１は、（１，ｄ，ＸＩ［ｄ］）をパスＤＡＧＩＤとして要素ｖに与える（ステップＳ１３０７）。これに対し、要素ｖのタグ名が＃でない場合、ＣＰＵ４０１は、（０，ｄ，ＸＩ［ｄ］）をパスＤＡＧＩＤとして要素ｖに与える（ステップＳ１３０８）。いずれかのパスＤＡＧＩＤが要素ｖに与えられた後、ＣＰＵ４０１は、ＸＩ［ｄ］に１を加算する（ステップＳ１３０９）。 When the tag name of the element v is #, the CPU 401 gives (1, d, XI [d]) to the element v as a path DAG ID (step S1307). On the other hand, if the tag name of the element v is not #, the CPU 401 gives (0, d, XI [d]) to the element v as a path DAG ID (step S1308). After any path DAG ID is given to the element v, the CPU 401 adds 1 to XI [d] (step S1309).

［転置インデクス登録部４４２］
次に、転置インデクス登録部４４２の処理動作を説明する。図１５は、転置インデクス登録部４４２として機能するＣＰＵ４０１の処理を示すフローチャートである。 [Transposition index registration unit 442]
Next, the processing operation of the transposed index registration unit 442 will be described. FIG. 15 is a flowchart showing processing of the CPU 401 functioning as the transposed index registration unit 442.

まず、ＣＰＵ４０１は、ＤＯＭＤＡＧ要素のリストＮと文書番号ｎを入力とする（ステップＳ１４０１）。次に、ＣＰＵ４０１は、変数ｉに１をセットする（ステップＳ１４０２）。続いて、ＣＰＵ４０１は、変数ｉがリストＮの長さ以下か否かを判定する（ステップＳ１４０３）。ＣＰＵ４０１は、変数ｉがリストＮの長さを越えると判定されるまで、後述するステップＳ１４０４〜Ｓ１４０９を繰り返す。 First, the CPU 401 inputs a DOM DAG element list N and a document number n (step S1401). Next, the CPU 401 sets 1 to the variable i (step S1402). Subsequently, the CPU 401 determines whether or not the variable i is equal to or shorter than the length of the list N (step S1403). The CPU 401 repeats steps S1404 to S1409 described later until it is determined that the variable i exceeds the length of the list N.

変数ｉがリストＮの長さを越えない場合、ＣＰＵ４０１は、要素ｖがルート要素か否か判定する（ステップＳ１４０４）。要素ｖがルート要素である場合、ＣＰＵ４０１は、変数ｉに１を加算してステップＳ１４０３に戻る（ステップＳ１４０９）。これに対し、要素ｖがルート要素でなかった場合、ＣＰＵ４０１は、後述するパスＤＡＧＩＤ取得部４２８に対して変数ｖを入力し、変数ｖが保持するパスＤＡＧＩＤｊを取得する（ステップＳ１４０５）。 When the variable i does not exceed the length of the list N, the CPU 401 determines whether the element v is a root element (step S1404). When the element v is the root element, the CPU 401 adds 1 to the variable i and returns to step S1403 (step S1409). On the other hand, if the element v is not a root element, the CPU 401 inputs a variable v to a path DAG ID acquisition unit 428 described later, and acquires a path DAG IDj held by the variable v (step S1405).

この後、ＣＰＵ４０１は、転置インデクス４２５において、変数ｊに対応したエントリが存在しないか否か判定する（ステップＳ１４０６）。対応するエントリが存在しない場合、ＣＰＵ４０１は、変数ｊに対応したエントリを作成する（ステップＳ１４０８）。この後、ＣＰＵ４０１は、変数ｊ対応したエントリに対し、（ｎ、ｉ）のタプル（tuple）を追加する（ステップＳ１４０７）。その後、ＣＰＵ４０１は、変数ｉに１を加算し、ステップＳ１４０３へ戻る（ステップＳ１４０９）。 Thereafter, the CPU 401 determines whether or not there is an entry corresponding to the variable j in the transposed index 425 (step S1406). If there is no corresponding entry, the CPU 401 creates an entry corresponding to the variable j (step S1408). Thereafter, the CPU 401 adds a (n, i) tuple to the entry corresponding to the variable j (step S1407). Thereafter, the CPU 401 adds 1 to the variable i and returns to step S1403 (step S1409).

［検索動作の概要］
本実施例による検索動作は以下のように進行する。本実施例の場合、ロケーションパスによる検索は、ロケーションパス検索部４２０としてのＣＰＵ４０１が、転置インデクス４２５に基づいて実行する。ロケーションパス検索部４２０としてのＣＰＵ４０１は、ロケーションパスに沿ってパスＤＡＧを辿り、ロケーションパスの構造に合致したパスＤＡＧ要素のパスＤＡＧＩＤを取得する。 [Overview of search operation]
The search operation according to the present embodiment proceeds as follows. In the case of the present embodiment, the search based on the location path is executed by the CPU 401 as the location path search unit 420 based on the transposed index 425. The CPU 401 as the location path search unit 420 traces the path DAG along the location path, and acquires the path DAG ID of the path DAG element that matches the location path structure.

ＣＰＵ４０１は、取得したパスＤＡＧＩＤに基づいて転置インデクス４２５を検索し、ロケーションパスに合致するＤＯＭＤＡＧ要素のタグ番号をＤＯＭＤＡＧリストから得る。次に、ＣＰＵ４０１は、このタグ番号で特定されるＤＯＭＤＡＧ要素を検索結果として取得する。 The CPU 401 searches the transposition index 425 based on the acquired path DAG ID, and obtains the tag number of the DOM DAG element that matches the location path from the DOM DAG list. Next, the CPU 401 acquires a DOM DAG element specified by this tag number as a search result.

以上の処理動作を具体例により説明する。ここでは、図１０を参照する。図１０は、パスＤＡＧ４２３と転置インデクス４２５を使用し、ロケーションパス「/a/b/＃」に合致する要素の検索が実行される要素を表している。この場合、ＣＰＵ４０１は、パスＤＡＧ（左下に示す構造図）上でロケーションパスを辿り、（１，３，１）をパスＤＡＧＩＤとして得る。次に、ＣＰＵ４０１は、転置インデクス４２５（右下に示すテーブル）から（１，３，１）に対応するエントリを検索する。図１０の例では、破線矢印で示すように、ＤＯＭＤＡＧ要素のタグ番号である（１，５）と（２，４）が得られる。次に、ＣＰＵ４０１は、このタグ番号に対応する要素を、文書１のＤＯＭＤＡＧと文書２のＤＯＭＤＡＧから取得する。 The above processing operation will be described using a specific example. Here, FIG. 10 is referred. FIG. 10 shows an element for which a search for an element that matches the location path “/ a / b / #” is executed using the path DAG 423 and the transposed index 425. In this case, the CPU 401 follows the location path on the path DAG (structure diagram shown in the lower left), and obtains (1, 3, 1) as the path DAG ID. Next, the CPU 401 searches for an entry corresponding to (1, 3, 1) from the transposed index 425 (table shown in the lower right). In the example of FIG. 10, (1, 5) and (2, 4), which are tag numbers of the DOM DAG element, are obtained as indicated by broken-line arrows. Next, the CPU 401 acquires an element corresponding to the tag number from the DOM DAG of the document 1 and the DOM DAG of the document 2.

［ロケーションパス検索部４２０］
ここでは、ロケーションパス検索部４２０として機能するＣＰＵ４０１の処理を説明する。ＣＰＵ４０１は、ロケーションパスに沿ってパスＤＡＧ４２３を辿り、到達した要素に対応するパスＤＡＧＩＤを変数ｉとする。次に、ＣＰＵ４０１は、転置インデクス４２５の中からパスＤＡＧＩＤｉに対応するエントリＥを取得する。エントリＥ内の各要素は、ＤＯＭＤＡＧリストにあるＤＯＭＤＡＧの要素のリストの場所と、ＤＯＭＤＡＧリスト内の要素の場所の組になっている。ロケーションパス検索部４２０は、これら対応するＤＯＭＤＡＧの要素を転置インデクス４２５から全て取得し、出力する。 [Location path search unit 420]
Here, processing of the CPU 401 functioning as the location path search unit 420 will be described. The CPU 401 follows the path DAG 423 along the location path, and sets the path DAG ID corresponding to the reached element as the variable i. Next, the CPU 401 acquires an entry E corresponding to the path DAG IDi from the transposed index 425. Each element in entry E is a set of the location of the list of elements in the DOM DAG in the DOM DAG list and the location of the elements in the DOM DAG list. The location path search unit 420 acquires all the corresponding DOM DAG elements from the transposed index 425 and outputs them.

［実施例の効果］
本実施例に係る構造化文書検索装置４００では、複数のＤＯＭＤＡＧの構造を集約したパスＤＡＧを定義することにより、前述した実施例に比して検索効率を高めることができる。 [Effect of Example]
In the structured document search apparatus 400 according to the present embodiment, by defining a path DAG in which the structures of a plurality of DOM DAGs are defined, search efficiency can be improved as compared with the above-described embodiment.

［第４の実施例］
ここでは、ＤＯＭＤＡＧ４２１の計算機上での保持形態について考える。Ｃ言語等では、構造情報をポインタとして表現する。このため、ＤＯＭＤＡＧ４２１の構造が複雑である場合、多数のポインタが要素間に張られることになり、元のＸＭＬ文書とアノテーションデータよりも多くのメモリが必要となる。そこで、本実施例では、パスＤＡＧと数列データにより、ＤＯＭＤＡＧの構造を表現する手法を説明する。 [Fourth embodiment]
Here, a holding form on the computer of the DOM DAG421 will be considered. In the C language or the like, the structure information is expressed as a pointer. For this reason, when the structure of the DOM DAG 421 is complicated, a large number of pointers are extended between elements, and more memory is required than the original XML document and annotation data. In this embodiment, a method for expressing the structure of the DOM DAG using the path DAG and the numerical sequence data will be described.

なお、本実施例に係る構造化文書検索装置４００は、ＸＭＬ文書の開始タグとアノテーションデータに出現する開始要素と終了要素について出現順に割り当てられた要素番号に基づいて、検索クエリに合致するタグの開始場所と終了場所に該当する要素を検索する。 Note that the structured document search apparatus 400 according to the present embodiment uses the start tag of the XML document and the element number assigned to the search query based on the element numbers assigned in the order of appearance for the start element and the end element that appear in the annotation data. Search for elements corresponding to the start and end locations.

［前処理の概要］
本実施例に係る前処理は、検索インデクス構築部４２４として機能するＣＰＵ４０１が実行する。前述の実施例で説明したように、検索インデクス４３０は、パスＤＡＧと数列データにより、ＤＯＭＤＡＧ４２１の構造を記述する。このため、各ＸＭＬ文書と対応するアノテーションデータについてのＤＯＭＤＡＧ４２１を、主記憶装置４０２に保持する必要はない。 [Overview of preprocessing]
The pre-processing according to the present embodiment is executed by the CPU 401 functioning as the search index construction unit 424. As described in the above embodiment, the search index 430 describes the structure of the DOM DAG 421 using the path DAG and the sequence data. Therefore, it is not necessary to store the DOM DAG 421 for the annotation data corresponding to each XML document in the main storage device 402.

まず、ＣＰＵ４０１は、各ＸＭＬ文書と対応するアノテーションデータについてＤＯＭＤＡＧ４２１を構築する。次に、ＣＰＵ４０１は、パスＤＡＧＩＤ取得部４２８として機能し、構築したＤＯＭＤＡＧ４２１の各要素のパスＤＡＧＩＤを取得する。この後、ＣＰＵ４０１は、取得されたパスＤＡＧＩＤを順番に数列データに登録する。 First, the CPU 401 constructs a DOM DAG 421 for annotation data corresponding to each XML document. Next, the CPU 401 functions as the path DAG ID acquisition unit 428 and acquires the path DAG ID of each element of the constructed DOM DAG 421. Thereafter, the CPU 401 registers the acquired path DAG ID in the sequence data in order.

図１６は、図１０に示すＤＯＭＤＡＧを数列データとして記録する様子を示している。左上段に示すように、ＤＯＭＤＡＧ４２１の各要素には、出現順に通し番号が振られている。図１６の場合、文書１のＤＯＭＤＡＧの各要素には、その出現順に“１”から“７”の通し番号が付されており、文書２のＤＯＭＤＡＧの各要素には、その出現順に“８”から“１５”の通し番号が付されている。また、右上段に示すように、２つのＤＯＭＤＡＧ４２１を集約したパスＤＡＧ４２３の各要素には、パスＤＡＧＩＤが各階層の出現順に数列データが記録されている。 FIG. 16 shows how the DOM DAG shown in FIG. 10 is recorded as sequence data. As shown in the upper left column, each element of the DOM DAG 421 is assigned a serial number in the order of appearance. In the case of FIG. 16, each element of the DOM DAG of the document 1 is assigned a serial number from “1” to “7” in the order of appearance, and each element of the DOM DAG of the document 2 is “8” in the order of appearance. "To" 15 "are assigned serial numbers. Further, as shown in the upper right column, in each element of the path DAG 423 in which the two DOM DAGs 421 are aggregated, the sequence DAG ID is recorded in the order of appearance of each layer.

［検索インデクス４３０］
図１７は、検索インデクス４３０の概要を示す。検索インデクス４３０は、複数のＤＯＭＤＡＧの構造を集約したパスＤＡＧ４２３と、パスＤＡＧ４２３の各要素に振られたＩＤをＤＯＭＤＡＧの各要素に割り当て、要素の出現順に登録した数列データ群と、ビット列のデータ群と、テキストデータリスト４１４とから構成される。 [Search index 430]
FIG. 17 shows an overview of the search index 430. The search index 430 includes a path DAG 423 in which the structures of a plurality of DOM DAGs are aggregated, an ID assigned to each element of the path DAG 423 assigned to each element of the DOM DAG, a sequence data group registered in the appearance order of the elements, and a bit string It consists of a data group and a text data list 414.

数列データ群は、アノテーション要素判別ビット列４３１、テキスト要素判別ビット列４３２、ＸＭＬ要素の深さ数列４３３、深さ別ＩＤ列リスト４３４、アノテーション終了タグ判別ビット列４３５、アノテーション開始タグＩＤ列４３６、アノテーション終了タグＩＤ列４３７、アノテーション要素の深さ数列４３８とから構成される。テキストデータリスト４１４は、各ＸＭＬ文書内の全文字列を出現順につなげたテキストと、ＸＭＬタグとアノテーションタグにより切り分けられたテキスト分割位置のリストとから構成される。 The sequence data group includes an annotation element determination bit string 431, a text element determination bit string 432, an XML element depth number string 433, a depth-specific ID string list 434, an annotation end tag determination bit string 435, an annotation start tag ID string 436, and an annotation end tag. An ID string 437 and an annotation element depth number string 438 are included. The text data list 414 includes a text in which all character strings in each XML document are connected in the order of appearance, and a list of text division positions separated by the XML tag and the annotation tag.

［検索インデクス構築部４５４］
図１８は、検索インデクス構築部４５４として機能するＣＰＵ４０１の処理例を示すフローチャートである。 [Search index construction unit 454]
FIG. 18 is a flowchart illustrating a processing example of the CPU 401 functioning as the search index construction unit 454.

まず、ＣＰＵ４０１は、ＸＭＬ文書集合４０７とアノテーションデータ集合４０８を入力とする（ステップＳ１７０１）。次に、ＣＰＵ４０１は、アノテーション要素判別ビット列４３１、テキスト要素判別ビット列４３２、ＸＭＬ要素の深さ数列４３３、アノテーション終了タグ判別ビット列４３５、アノテーション開始タグＩＤ列４３６、アノテーション終了タグＩＤ列４３７、アノテーション要素の深さ数列４３８を空の数列に初期化する（ステップＳ１７０２）。 First, the CPU 401 receives an XML document set 407 and an annotation data set 408 as input (step S1701). Next, the CPU 401 determines the annotation element determination bit string 431, the text element determination bit string 432, the XML element depth number string 433, the annotation end tag determination bit string 435, the annotation start tag ID string 436, the annotation end tag ID string 437, and the annotation element The depth sequence 438 is initialized to an empty sequence (step S1702).

次に、ＣＰＵ４０１は、ＸＭＬ文書集合４０７に含まれるＸＭＬ文書の中で最も深い位置にある要素の深さをＤとする時、深さ別ＩＤ列リスト４３４に空の数列をＤ本セットする（ステップＳ１７０３）。 Next, when the depth of an element at the deepest position in the XML document included in the XML document set 407 is D, the CPU 401 sets D empty number sequences in the depth-specific ID column list 434 ( Step S1703).

この後、ＣＰＵ４０１は、パスＤＡＧ４２３をルート要素のみを持つグラフに初期化し、ルート要素にパスＤＡＧＩＤとして（０，０，１）を与える（ステップＳ１７０４）。また、ＣＰＵ４０１は、数字ＸＤ、長さＤの数列ＸＩの各要素、数字ＡＩを０に初期化する（ステップＳ１７０４）。 Thereafter, the CPU 401 initializes the path DAG 423 to a graph having only the root element, and gives (0, 0, 1) as the path DAG ID to the root element (step S1704). Further, the CPU 401 initializes each element of the numerical sequence XI having the number XD and the length D and the number AI to 0 (step S1704).

さらに、ＣＰＵ４０１は、テキストデータリストを空に初期化する（ステップＳ１７０５）。 Further, the CPU 401 initializes the text data list to be empty (step S1705).

この後、ＣＰＵ４０１は、ＸＭＬ文書集合４０７とアノテーションデータ集合４０８に含まれる、ＸＭＬ文書とアノテーションデータの全ペアを処理したか否かを判定する（ステップＳ１７０６）。この判定処理で肯定結果が得られるまで、ＣＰＵ４０１は、後述するステップＳ１７０７〜Ｓ１７１３の処理を繰り返し実行する。 Thereafter, the CPU 401 determines whether or not all pairs of the XML document and annotation data included in the XML document set 407 and the annotation data set 408 have been processed (step S1706). Until a positive result is obtained in this determination process, the CPU 401 repeatedly executes processes in steps S1707 to S1713 described later.

ステップＳ１７０６で否定の結果が得られた場合、ＣＰＵ４０１は、未処理のＸＭＬ文書とアノテーションデータのペアを読み込む（ステップＳ１７０７）。ＸＭＬ文書から、ＸＭＬ文書からのＤＯＭ木構築部４０９により、アノテーションデータから、アノテーションデータからのＤＯＭ木構築部４１０により、ＤＯＭ木を作成する（ステップＳ１７０８）。 If a negative result is obtained in step S1706, the CPU 401 reads an unprocessed XML document and annotation data pair (step S1707). From the XML document, the DOM tree construction unit 409 from the XML document creates a DOM tree from the annotation data by the DOM tree construction unit 410 from the annotation data (step S1708).

次に、ＣＰＵ４０１は、ＸＭＬ文書から得られたＤＯＭ木と各アノテーショングループに属するタグから得られたＤＯＭ木を入力すると、後述するＤＯＭＤＡＧ構築部４２２としての機能を通じてＤＯＭＤＡＧ４２１を作成し、要素のリストＮを得る（ステップＳ１７０９）。 Next, when a DOM tree obtained from an XML document and a DOM tree obtained from a tag belonging to each annotation group are input, the CPU 401 creates a DOM DAG 421 through a function as a DOM DAG construction unit 422 to be described later. A list N is obtained (step S1709).

この後、ＣＰＵ４０１は、リストＮをＤＯＭＤＡＧ要素リストソート部４２６に入力し、ソートする（ステップＳ１７１０）。さらに、ＣＰＵ４０１は、リストＮを後述するアノテーション終了タグ挿入部４４１に入力する（ステップＳ１７１１）。次に、ＣＰＵ４０１は、リストＮを後述する検索インデクス登録部４４０に入力する（ステップＳ１７１２）。そして、ＣＰＵ４０１は、読み込んだＸＭＬ文書とアノテーションデータを処理済とし、ステップＳ１７０６に戻る（ステップＳ１７１３）。 Thereafter, the CPU 401 inputs the list N to the DOM DAG element list sorting unit 426 and sorts it (step S1710). Further, the CPU 401 inputs the list N to the annotation end tag insertion unit 441 described later (step S1711). Next, the CPU 401 inputs the list N to the search index registration unit 440 described later (step S1712). The CPU 401 determines that the read XML document and annotation data have been processed, and returns to step S1706 (step S1713).

［アノテーション終了タグ要素］
アノテーション終了タグ要素とは、アノテーションの終了タグを示す要素であり、対応するアノテーション要素、深さ、パスＤＡＧＩＤを保持する。アノテーション終了タグ要素は、ＤＯＭＤＡＧ４２１の要素をソートしたリスト内におけるアノテーションタグの終了位置を示すために用いる。 [Annotation end tag element]
An annotation end tag element is an element indicating an annotation end tag, and holds a corresponding annotation element, depth, and path DAG ID. The annotation end tag element is used to indicate the end position of the annotation tag in the list in which the elements of the DOM DAG 421 are sorted.

［アノテーション終了タグ挿入部４４１］
アノテーション終了タグ挿入部４４１として機能するＣＰＵ４０１は、ＤＯＭＤＡＧ４２１の要素のリストＮを入力とする。ここで、ＣＰＵ４０１は、空のアノテーション要素のリストＬを用意する。 [Annotation end tag insertion unit 441]
The CPU 401 functioning as the annotation end tag insertion unit 441 receives a list N of elements of the DOM DAG 421 as an input. Here, the CPU 401 prepares a list L of empty annotation elements.

ＣＰＵ４０１は、リストＮの要素を先頭から順番に走査する。リストＮの要素がアノテーション要素の場合、アノテーションタグの終了要素を作成し、該当するアノテーション要素を記録した後、リストＬの最後尾に追加する。ＣＰＵ４０１は、リストＮを走査した後、リストＬ内の要素をマージソートの要領でリストＮに挿入し、リストＬをリストＮにマージする。ソート時の順序は次のルールに従うものとする。 The CPU 401 scans the elements of the list N in order from the top. When the element of the list N is an annotation element, an end element of the annotation tag is created, and the corresponding annotation element is recorded, and then added to the end of the list L. After scanning the list N, the CPU 401 inserts elements in the list L into the list N in the manner of merge sort, and merges the list L into the list N. The order of sorting shall follow the following rules.

ＤＯＭ木の要素のソート時は次のルールに従い、降順にソートする。
（１）ルート要素は他のどの要素より前に来る。
（２）リストＮ内の要素とアノテーション終了タグ要素の比較において、前者のテキスト領域開始位置が後者のテキスト領域終了位置と同じ場合、後者に保持するアノテーション要素が前者に含まれるならば、前者が前に来る。そうでない場合、後者が前にくる。
（３）リストＮ内の要素とアノテーション終了タグ要素の比較において、前者のテキスト領域終了位置が後者のテキスト領域終了位置と同じ場合、後者に対応したアノテーション要素が前者を含むならば、前者が前に来る。そうでない場合、後者が前にくる。 When sorting the elements of the DOM tree, the elements are sorted in descending order according to the following rules.
(1) The root element comes before any other element.
(2) In the comparison between the element in the list N and the annotation end tag element, when the former text area start position is the same as the latter text area end position, if the former includes an annotation element held in the latter, Come forward. If not, the latter comes first.
(3) In the comparison between the element in the list N and the annotation end tag element, when the former text area end position is the same as the latter text area end position, if the annotation element corresponding to the latter includes the former, the former is I come to. If not, the latter comes first.

［検索インデクス登録部４４０］
図１９は、検索インデクス登録部４４０として機能するＣＰＵ４０１の処理例を示すフローチャートである。 [Search index registration unit 440]
FIG. 19 is a flowchart illustrating a processing example of the CPU 401 functioning as the search index registration unit 440.

ＣＰＵ４０１は、ＤＯＭＤＡＧ４２１の要素からなるリストＮを入力とする（ステップＳ１８０１）。次に、ＣＰＵ４０１は、変数ｉに１をセットする（ステップＳ１８０２）。その後、ＣＰＵ４０１は、変数ｉがリストＮの長さ以下か判定する（ステップＳ１８０３）。ＣＰＵ４０１は、変数ｉがリストＮの長さを越えるまで、後述するステップＳ１８０４〜Ｓ１８１１を繰り返す。 The CPU 401 receives a list N composed of elements of the DOM DAG 421 (step S1801). Next, the CPU 401 sets 1 to the variable i (step S1802). Thereafter, the CPU 401 determines whether the variable i is equal to or shorter than the length of the list N (step S1803). The CPU 401 repeats steps S1804 to S1811 described later until the variable i exceeds the length of the list N.

変数ｉがリストＮの長さ以下の場合、ＣＰＵ４０１は、リストＮのｉ番目の要素をｖとする（ステップＳ１８０４）。ここで、ＣＰＵ４０１は、要素ｖがアノテーションタグの終了要素か否か判定する（ステップＳ１８０５）。肯定結果が得られた場合、ＣＰＵ４０１は、要素ｖに対応した開始要素のパスＤＡＧＩＤｊを取得し、ＩＤｊの１番目の要素に３をセットする（ステップＳ１８０６）。一方、ステップＳ１８０５で否定結果が得られた場合、ＣＰＵ４０１は、後述するパスＤＡＧＩＤ取得部４２８に要素ｖを入力し、要素ｖが保持するパスＤＡＧＩＤｊを取得する（ステップＳ１８０７、Ｓ１８０８）。 When the variable i is equal to or shorter than the length of the list N, the CPU 401 sets the i-th element of the list N as v (step S1804). Here, the CPU 401 determines whether or not the element v is an end element of the annotation tag (step S1805). If an affirmative result is obtained, the CPU 401 obtains the path DAG IDj of the start element corresponding to the element v, and sets 3 to the first element of IDj (step S1806). On the other hand, if a negative result is obtained in step S1805, the CPU 401 inputs an element v to a path DAG ID acquisition unit 428 described later, and acquires a path DAG IDj held by the element v (steps S1807 and S1808).

以上の処理の後、ＣＰＵ４０１は、ＩＤｊをパスＤＡＧＩＤとしてＤＮに登録する（ステップＳ１８０９）。さらに、ＣＰＵ４０１は、ＩＤｊをパスＤＡＧＩＤ登録部４３９に入力する（ステップＳ１８１０）。この後、ＣＰＵ４０１は、変数ｉに１を加算し、ステップＳ１８０３に戻る（ステップＳ１８１１）。 After the above processing, the CPU 401 registers IDj as a path DAG ID in the DN (step S1809). Further, the CPU 401 inputs IDj to the path DAG ID registration unit 439 (step S1810). Thereafter, the CPU 401 adds 1 to the variable i and returns to step S1803 (step S1811).

［パスＤＡＧＩＤ登録部４３９］
前述のステップＳ１８１０で使用されるパスＤＡＧＩＤ登録部４３９の処理内容を説明する。図２０は、パスＤＡＧＩＤ登録部４３９として機能するＣＰＵ４０１の処理内容を示すフローチャートである。 [Pass DAG ID registration unit 439]
The processing contents of the path DAG ID registration unit 439 used in step S1810 described above will be described. FIG. 20 is a flowchart showing the processing contents of the CPU 401 functioning as the path DAG ID registration unit 439.

まず、ＣＰＵ４０１は、パスＤＡＧＩＤｉを入力とする（ステップＳ１９０１）。次に、ＣＰＵ４０１は、ＩＤｉのｊ番目の要素をｉ［ｊ］とする（ステップＳ１９０２）。
ここで、ＣＰＵ４０１は、ｉ［１］が０または１であるか判定する（ステップＳ１９０３）。ｉ［１］が０または１の時、ＣＰＵ４０１は、アノテーション要素判別ビット列４３１の最後尾に０を追加し（ステップＳ１９０４）、ＸＭＬ要素の深さ数列４３３の最後尾にｉ［２］を加え（ステップＳ１９０５）、テキスト要素判別ビット列４３２の最後尾にｉ［１］を加える（ステップＳ１９０６）。 First, the CPU 401 receives a path DAG IDi (step S1901). Next, the CPU 401 sets the j-th element of IDi to i [j] (step S1902).
Here, the CPU 401 determines whether i [1] is 0 or 1 (step S1903). When i [1] is 0 or 1, the CPU 401 adds 0 to the end of the annotation element determination bit string 431 (step S1904), and adds i [2] to the end of the XML element depth number sequence 433 ( In step S1905), i [1] is added to the end of the text element determination bit string 432 (step S1906).

さらに、ＣＰＵ４０１は、ｉ［２］が０か否かを判定する（ステップＳ１９０７）。ｉ［２］が０であれば、ＣＰＵ４０１は、この時点で登録処理を終了する。これに対し、ｉ［２］が０以外の場合、ＣＰＵ４０１は、深さ別ＩＤ列リスト４３４にあるｉ［２］番目の数列の最後尾にｉ［３］を加え、登録処理を終了する（ステップＳ１９０８）。 Furthermore, the CPU 401 determines whether i [2] is 0 (step S1907). If i [2] is 0, the CPU 401 ends the registration process at this point. On the other hand, if i [2] is other than 0, the CPU 401 adds i [3] to the end of the i [2] -th number sequence in the depth-specific ID sequence list 434 and ends the registration process ( Step S1908).

一方、ステップＳ１９０３で否定結果の場合、すなわちｉ［１］が２また３の時、ＣＰＵ４０１は、アノテーション要素判別ビット列４３１の最後尾に１を追加する（ステップＳ１９１０）。さらに、ＣＰＵ４０１は、アノテーション要素の深さ数列４３８の最後尾にｉ［２］を加える（ステップＳ１９１１）。 On the other hand, in the case of a negative result in step S1903, that is, when i [1] is 2 or 3, the CPU 401 adds 1 to the end of the annotation element determination bit string 431 (step S1910). Further, the CPU 401 adds i [2] to the end of the annotation element depth sequence 438 (step S1911).

この後、ＣＰＵ４０１は、ｉ［１］が２であるか否かを判定する（ステップＳ１９１２）。ｉ［１］が２の場合、ＣＰＵ４０１は、アノテーション終了タグ判別ビット列４３５の最後尾に０を加えると共に、アノテーション開始タグＩＤ列４３６の最後尾にｉ［３］を加え、登録処理を終了する（ステップＳ１９１３、Ｓ１９１４）。 Thereafter, the CPU 401 determines whether i [1] is 2 (step S1912). When i [1] is 2, the CPU 401 adds 0 to the end of the annotation end tag determination bit string 435 and i [3] to the end of the annotation start tag ID string 436, and ends the registration process ( Steps S1913 and S1914).

これに対し、ステップＳ１９１２で否定結果が得られた場合、すなわちｉ［１］が３の場合、ＣＰＵ４０１は、アノテーション終了タグ判別ビット列４３５の最後尾に１を加えると共に、アノテーション終了タグＩＤ列４３７の最後尾にｉ［３］を加え、登録処理を終了する（ステップＳ１９１５、Ｓ１９１６）。 On the other hand, if a negative result is obtained in step S 1912, that is, if i [1] is 3, the CPU 401 adds 1 to the end of the annotation end tag determination bit string 435 and the annotation end tag ID string 437. I [3] is added to the end, and the registration process ends (steps S1915 and S1916).

［検索動作の概要］
ここで、本実施例に係る検索動作を説明する。本実施例における検索動作は、ロケーションパス検索部４２０として機能するＣＰＵ４０１により実行される。ＣＰＵ４０１は、検索クエリで与えられたロケーションパスに沿ってパスＤＡＧを辿り、ロケーションパスの構造に合致したパスＤＡＧ要素のパスＤＡＧＩＤを取得する。 [Overview of search operation]
Here, the search operation according to the present embodiment will be described. The search operation in this embodiment is executed by the CPU 401 functioning as the location path search unit 420. The CPU 401 follows the path DAG along the location path given by the search query, and acquires the path DAG ID of the path DAG element that matches the structure of the location path.

ＣＰＵ４０１は、取得したパスＤＡＧＩＤから、検索結果となる要素がＸＭＬ要素かアノテーション要素か判定する。検索結果となる要素がＸＭＬ要素の場合、ＣＰＵ４０１は、ＸＭＬ要素検索部４４３として機能し、検索クエリに合致した数列データにおける要素の箇所を計算する。一方、検索結果となる要素がアノテーション要素の場合、ＣＰＵ４０１は、アノテーション要素検索部４４４として機能し、検索クエリに合致した数列データにおける要素の箇所を計算する。 The CPU 401 determines from the acquired path DAG ID whether the element that is the search result is an XML element or an annotation element. If the element that is the search result is an XML element, the CPU 401 functions as the XML element search unit 443 and calculates the position of the element in the numerical sequence data that matches the search query. On the other hand, when the element that is the search result is an annotation element, the CPU 401 functions as the annotation element search unit 444 and calculates the position of the element in the sequence data that matches the search query.

ここで、ＸＭＬ要素検索部４４３とアノテーション要素検索部４４４は、数列に対して次のｒａｎｋと、ｓｅｌｅｃｔと、ｎｅａｒｅｓｔと呼ばれる演算を実行する。
（１）ｒａｎｋ（ｃ，ｐ）：数列において、ｐ番目の位置の要素にあるｃの数
（２）ｓｅｌｅｃｔ（ｃ，ｎ）：数列において、ｎ回目に出現するｃの位置
（３）ｎｅａｒｅｓｔ（ｐ，ｄ）：数列において、ｐ番目より後ろの要素でｄ以下の値を持つ、ｐに最も近い要素の位置 Here, the XML element search unit 443 and the annotation element search unit 444 perform operations called “rank”, “select”, and “nearest” on the sequence.
(1) rank (c, p): number of c in the element at the p-th position in the number sequence (2) select (c, n): position of c appearing n times in the number sequence (3) nearest ( p, d): position of the element closest to p having a value less than or equal to d in the element after the pth in the sequence

ここでは、図１６を参照して、一連の処理動作を説明する。図１６は、パスＤＡＧ４２３と数列データを使用して、ロケーションパス「/e/c/d」に合致する要素を検索する例を示す。 Here, a series of processing operations will be described with reference to FIG. FIG. 16 shows an example of searching for an element matching the location path “/ e / c / d” using the path DAG 423 and the sequence data.

まず、ＣＰＵ４０１は、ロケーションパスに沿ってパスＤＡＧを辿ることで、パスＤＡＧＩＤである（０，３，３）を得る。図１６では、該当するパスＤＡＧＩＤを破線で囲んで示している。ここで、パスＤＡＧＩＤの１番目の要素が“０”の時はＸＭＬ要素を示し、２番目の要素である“３”は深さを示している。 First, the CPU 401 obtains a path DAG ID (0, 3, 3) by following the path DAG along the location path. In FIG. 16, the corresponding path DAG ID is surrounded by a broken line. Here, when the first element of the path DAG ID is “0”, it indicates an XML element, and the second element “3” indicates a depth.

従って、ＣＰＵ４０１は、深さ別ＩＤ列リスト４３４の３番目の数列（深さが“３”の位置の数列）において、パスＤＡＧＩＤの３番目の要素である“３”を含む位置を探す。図１６の場合、“３”は、深さ“３”の数列内の４番目にある。この要素も、図１６では、破線で囲んで示している。 Therefore, the CPU 401 searches for a position including “3” that is the third element of the path DAG ID in the third number sequence (the number sequence at the position where the depth is “3”) in the depth-specific ID column list 434. In the case of FIG. 16, “3” is the fourth in the sequence of depth “3”. This element is also surrounded by a broken line in FIG.

次に、ＣＰＵ４０１は、ＸＭＬ要素の深さ数列４３３について、ｓｅｌｅｃｔ（３，４）を計算し、“３”が４回目に出現する位置である“１４”を得る。次に、ＣＰＵ４０１は、アノテーション要素判別ビット列４３１に対して、ｓｅｌｅｃｔ（０，１４）を計算する。アノテーション要素判別ビット列４３１では、ＸＭＬ要素の位置には０が登録されているためである。従って、この演算は、アノテーション要素判別ビット列４３１で、１４回目に出現するＸＭＬ要素の位置を求めることを意味する。ここでは、演算結果として“１５”を得る。 Next, the CPU 401 calculates select (3, 4) for the depth sequence 433 of the XML element, and obtains “14”, which is the position where “3” appears for the fourth time. Next, the CPU 401 calculates select (0, 14) for the annotation element determination bit string 431. This is because 0 is registered at the position of the XML element in the annotation element determination bit string 431. Therefore, this calculation means obtaining the position of the XML element that appears for the 14th time in the annotation element determination bit string 431. Here, “15” is obtained as the calculation result.

図１６の左上段に示すように、“１５”に対応するＤＯＭＤＡＧの要素は、ＸＭＬ文書２のＤＯＭＤＡＧの“ｄ”である。なお、“１５”は、要素における開始タグの位置を示している。要素の終了タグの位置についても同様の方法で取得できる。 As shown in the upper left part of FIG. 16, the element of the DOM DAG corresponding to “15” is “d” of the DOM DAG of the XML document 2. “15” indicates the position of the start tag in the element. The position of the end tag of the element can be obtained in the same way.

［ロケーションパス検索部４２０］
図２１は、ロケーションパス検索部４２０として機能するＣＰＵ４０１の処理を示すフローチャートである。 [Location path search unit 420]
FIG. 21 is a flowchart showing processing of the CPU 401 that functions as the location path search unit 420.

ＣＰＵ４０１は、ロケーションパスに沿ってパスＤＡＧ４２３を辿り、到達した要素に対応するパスＤＡＧＩＤをｉとする（ステップ２００１）。次に、ＣＰＵ４０１は、ＩＤｉの１番目の要素をｓ、２番目の要素をｄとする（ステップＳ２００２）。また、ＣＰＵ４０１は、変数ｊに１をセットする（ステップＳ２００３）。さらに、ＣＰＵ４０１は、Ｌを、検索対象のタグの開始位置と終了位置のペアを保持するための空のリストとする（ステップＳ２００４）。 The CPU 401 follows the path DAG 423 along the location path, and sets the path DAG ID corresponding to the reached element to i (step 2001). Next, the CPU 401 sets s as the first element of IDi and s as the second element (step S2002). Further, the CPU 401 sets 1 to the variable j (step S2003). Further, the CPU 401 sets L as an empty list for holding a pair of a start position and an end position of a search target tag (step S2004).

ここで、ＣＰＵ４０１は、要素ｓが０または１か判定する（ステップＳ２００５）。要素ｓが０または１であれば、ＣＰＵ４０１は、ＸＭＬ要素検索部４４３に対してｉとｊを入力し、得られたタグの開始位置、終了位置のタプルｔを得る（ステップＳ２００６）。 Here, the CPU 401 determines whether the element s is 0 or 1 (step S2005). If the element s is 0 or 1, the CPU 401 inputs i and j to the XML element search unit 443, and obtains a tuple t of the obtained tag start position and end position (step S2006).

次に、ＣＰＵ４０１は、タプルｔが（−１，−１）であるか否か判定する（ステップＳ２００７）。タプルｔが（−１，−１）でない場合、ＣＰＵ４０１は、タプルｔをリストＬの最後尾に加え、ｊを１増やす（ステップＳ２００８、Ｓ２００９）。タプルｔが（−１、−１）となるまでステップＳ２００６〜Ｓ２００９を繰り返す（ステップＳ２００７）。なお、ＣＰＵ４０１は、タプルｔが（−１、−１）となるまで、ＣＰＵ４０１は、ステップＳ２００６〜Ｓ２００９を繰り返す。そして、タプルｔが（−１，−１）となると、ＣＰＵ４０１は、リストＬを出力し、検索処理を終了する（ステップＳ２０１４）。 Next, the CPU 401 determines whether or not the tuple t is (−1, −1) (step S2007). When the tuple t is not (−1, −1), the CPU 401 adds the tuple t to the tail of the list L and increments j by 1 (steps S2008 and S2009). Steps S2006 to S2009 are repeated until the tuple t becomes (−1, −1) (step S2007). The CPU 401 repeats steps S2006 to S2009 until the tuple t becomes (−1, −1). When the tuple t becomes (-1, -1), the CPU 401 outputs the list L and ends the search process (step S2014).

これに対し、ステップＳ２００５で否定結果が得られた場合（この場合は、要素ｓが０と１のどちらでもない場合）、ＣＰＵ４０１は、アノテーション要素検索部４４４に対してｉとｊを入力し、得られたタグの開始位置と終了位置のタプルをｔとする（ステップＳ２０１０）。 On the other hand, if a negative result is obtained in step S2005 (in this case, the element s is neither 0 nor 1), the CPU 401 inputs i and j to the annotation element search unit 444, A tuple of the start position and end position of the obtained tag is set to t (step S2010).

ここで、ＣＰＵ４０１は、タプルｔが（−１，−１）であるか否か判定する（ステップＳ２０１１）。この場合も、タプルｔが（−１，−１）でない場合、ＣＰＵ４０１は、タプルｔをリストＬの最後尾に加え、変数ｊに１を加算する（ステップＳ２０１２、Ｓ２０１３）。そして、タプルｔが（−１、−１）となるまで、ＣＰＵ４０１は、ステップＳ２０１０〜Ｓ２０１３を繰り返す。そして、タプルｔが（−１，−１）となると、ＣＰＵ４０１は、リストＬを出力し、検索処理を終了する（ステップＳ２０１４）。 Here, the CPU 401 determines whether or not the tuple t is (−1, −1) (step S2011). Also in this case, when the tuple t is not (−1, −1), the CPU 401 adds the tuple t to the tail of the list L and adds 1 to the variable j (steps S2012 and S2013). The CPU 401 repeats steps S2010 to S2013 until the tuple t becomes (-1, -1). When the tuple t becomes (-1, -1), the CPU 401 outputs the list L and ends the search process (step S2014).

［ＸＭＬ要素検索部４４３］
図２２は、ＸＭＬ要素検索部４４３として機能するＣＰＵ４０１の処理を示すフローチャートである。 [XML element search unit 443]
FIG. 22 is a flowchart illustrating processing of the CPU 401 functioning as the XML element search unit 443.

まず、ＣＰＵ４０１は、入力されたパスＤＡＧＩＤをｉ、検索番号をｎとする（ステップＳ２１０１）。次に、ＣＰＵ２０１は、パスＤＡＧＩＤｉの２番目の要素をｄ、３番目の要素をｊとする（ステップＳ２１０２）。さらに、ＣＰＵ４０１は、深さ別ＩＤ列リスト４３４のｄ番目の数列にｊが含まれる数をｍとする（ステップＳ２１０３）。 First, the CPU 401 sets the input path DAG ID as i and the search number as n (step S2101). Next, the CPU 201 sets d as the second element of the path DAG IDi and j as the third element (step S2102). Further, the CPU 401 sets m as the number of j included in the d-th number sequence in the depth-specific ID sequence list 434 (step S2103).

ここで、ＣＰＵ４０１は、ｍがｎより小さいか判定する（ステップＳ２１０４）。ｍがｎより少ない場合、ＣＰＵ４０１は、２つの要素からなるタプル（−１，−１）を出力して探索処理を終了する（ステップＳ２１０５）。 Here, the CPU 401 determines whether m is smaller than n (step S2104). If m is less than n, the CPU 401 outputs a tuple (-1, -1) consisting of two elements and ends the search process (step S2105).

これに対し、ｍがｎ以上であった場合、ＣＰＵ４０１は、深さ別ＩＤ列リスト４３４のｄ番目の数列に対して、ｓｅｌｅｃｔ（ｊ，ｎ）を計算し、結果をｎにセットする（ステップＳ２１０６）。 On the other hand, if m is greater than or equal to n, the CPU 401 calculates select (j, n) for the d-th number sequence in the depth-specific ID sequence list 434 and sets the result to n (step) S2106).

次に、ＣＰＵ４０１は、ＸＭＬ要素の深さ数列４３３に対して、ｓｅｌｅｃｔ（ｄ，ｎ）を計算し、その結果をｎにセットする（ステップＳ２１０７）。 Next, the CPU 401 calculates select (d, n) for the depth sequence 433 of the XML element, and sets the result to n (step S2107).

さらに、ＣＰＵ４０１は、アノテーション要素判別ビット列４３１に対して、ｓｅｌｅｃｔ（０，ｎ）を計算し、変数ｓにセットする（ステップＳ２１０８）。 Further, the CPU 401 calculates select (0, n) for the annotation element determination bit string 431 and sets it to the variable s (step S2108).

次に、ＣＰＵ４０１は、ＸＭＬ要素の深さ数列４３３に対して、ｎｅａｒｅｓｔ（ｎ，ｄ）を計算し、結果を変数ｐにセットする（ステップＳ２１０９）。 Next, the CPU 401 calculates nearest (n, d) for the depth sequence 433 of the XML element, and sets the result to a variable p (step S2109).

その後、ＣＰＵ４０１は、アノテーション要素判別ビット列４３１に対して、ｓｅｌｅｃｔ（０，ｐ）を計算し、結果をｐにセットする（ステップＳ２１１０）。 Thereafter, the CPU 401 calculates select (0, p) for the annotation element determination bit string 431 and sets the result to p (step S2110).

次に、ＣＰＵ４０１は、アノテーション要素判別ビット列４３１に対して、ｒａｎｋ（１，ｎ）を計算し、結果を変数ｒにセットする（ステップＳ２１１１）。 Next, the CPU 401 calculates rank (1, n) for the annotation element determination bit string 431 and sets the result in a variable r (step S2111).

さらに、ＣＰＵ４０１は、アノテーション要素の深さ数列４３８に対して、ｎｅａｒｅｓｔ（ｒ，ｄ）を計算し、結果をｒにセットする（ステップＳ２１１２）。 Further, the CPU 401 calculates nearest (r, d) for the depth sequence 438 of annotation elements, and sets the result to r (step S2112).

また、ＣＰＵ４０１は、アノテーション要素判別ビット列４３１に対して、ｓｅｌｅｃｔ（１，ｑ）を計算し、変数ｑにセットする（ステップＳ２１１３）。 Further, the CPU 401 calculates select (1, q) for the annotation element determination bit string 431 and sets it to the variable q (step S2113).

また、ＣＰＵ４０１は、変数ｅにｐとｑで小さい方をセットする（ステップＳ２１１４）。 Further, the CPU 401 sets the smaller one of p and q to the variable e (step S2114).

そして、ＣＰＵ４０１は、ｓとｅのペアのタプル（ｓ，ｅ）を出力する（ステップＳ２１１５）。 Then, the CPU 401 outputs a tuple (s, e) of a pair of s and e (step S2115).

［アノテーション要素検索部４４４］
図２３は、アノテーション要素検索部４４４として機能するＣＰＵ４０１の処理を示すフローチャートである。 [Annotation element search unit 444]
FIG. 23 is a flowchart showing the processing of the CPU 401 functioning as the annotation element search unit 444.

まず、ＣＰＵ４０１は、入力されたパスＤＡＧＩＤをｉ、検索番号をｎとする（ステップＳ２２０１）。次に、ＣＰＵ４０１は、ｉの３番目の要素をｊとする（ステップＳ２２０２）。また、ＣＰＵ４０１は、アノテーション開始タグＩＤ列４３６にｊが含まれる数をｍとする（ステップＳ２２０３）。 First, the CPU 401 sets the input path DAG ID as i and the search number as n (step S2201). Next, the CPU 401 sets j as the third element of i (step S2202). Also, the CPU 401 sets m as the number of j included in the annotation start tag ID column 436 (step S2203).

ここで、ＣＰＵ４０１は、ｍがｎより小さいか否か判定する（ステップＳ２２０４）。ｍがｎより小さい場合、ＣＰＵ４０１は、２つの要素からなるタプル（−１，−１）を出力して検索処理を終了する（ステップＳ２２０５）。 Here, the CPU 401 determines whether m is smaller than n (step S2204). If m is smaller than n, the CPU 401 outputs a tuple (-1, -1) consisting of two elements and ends the search process (step S2205).

一方、ｍがｎ以上であった場合、ＣＰＵ４０１は、アノテーション開始タグＩＤ列４３６に対して、ｓｅｌｅｃｔ（ｊ，ｎ）を計算し、結果をｍにセットする（ステップＳ２２０６）。 On the other hand, if m is greater than or equal to n, the CPU 401 calculates select (j, n) for the annotation start tag ID string 436 and sets the result to m (step S2206).

次に、ＣＰＵ４０１は、アノテーション終了タグ判別ビット列４３５に対して、ｓｅｌｅｃｔ（０，ｍ）を計算し、結果をｍにセットする（ステップＳ２２０７）。 Next, the CPU 401 calculates select (0, m) for the annotation end tag determination bit string 435 and sets the result to m (step S2207).

さらに、ＣＰＵ４０１は、アノテーション要素判別ビット列４３１に対して、ｓｅｌｅｃｔ（１，ｍ）を計算し、変数ｓにセットする（ステップＳ２２０８）。 Further, the CPU 401 calculates select (1, m) for the annotation element determination bit string 431 and sets it to the variable s (step S2208).

また、ＣＰＵ４０１は、アノテーション終了タグＩＤ列４３７に対して、ｓｅｌｅｃｔ（ｊ，ｎ）を計算し、結果をｍにセットする（ステップＳ２２０９）。 Further, the CPU 401 calculates select (j, n) for the annotation end tag ID column 437 and sets the result to m (step S2209).

その後、ＣＰＵ４０１は、アノテーション終了タグ判別ビット列４３５に対して、ｓｅｌｅｃｔ（１，ｍ）を計算し、結果をｍにセットする（ステップＳ２２１０）。 Thereafter, the CPU 401 calculates select (1, m) for the annotation end tag determination bit string 435, and sets the result to m (step S2210).

また、ＣＰＵ４０１は、アノテーション要素判別ビット列４３１に対して、ｓｅｌｅｃｔ（１，ｍ）を計算し、変数ｅにセットする（ステップＳ２２１１）。 Further, the CPU 401 calculates select (1, m) for the annotation element determination bit string 431 and sets it to the variable e (step S2211).

さらに、ＣＰＵ４０１は、ｓとｅのペアのタプル（ｓ，ｅ）を出力する（ステップＳ２２１２）。 Further, the CPU 401 outputs a tuple (s, e) of a pair of s and e (step S2212).

［実施例の効果］
本実施例に係る構造化文書検索装置４００では、パスＤＡＧと数列データによりＤＯＭＤＡＧの構造を記述する手法を採用する。この場合、構造化文書検索装置４００は、各ＸＭＬ文書と対応するアノテーションデータから作成されたＤＯＭＤＡＧを保持する必要がなく、前述した実施例に比してメモリ容量の消費量を大幅に低減することができる。 [Effect of Example]
In the structured document search apparatus 400 according to the present embodiment, a method of describing the structure of the DOM DAG using the path DAG and the numerical sequence data is adopted. In this case, the structured document search apparatus 400 does not need to hold the DOM DAG created from the annotation data corresponding to each XML document, and greatly reduces the memory capacity consumption compared to the above-described embodiment. be able to.

［第５の実施例］
定数オーダの計算量により、ビット列に対してｒａｎｋやｓｅｌｅｃｔを計算することができるデータ構造として、簡潔ビットベクトルが知られている。また、数列データをデータ圧縮した状態で保持し、圧縮した状態のままｒａｎｋやｓｅｌｅｃｔを効率良く計算することができるデータ構造として、ウェーブレット木が知られている。 [Fifth embodiment]
A concise bit vector is known as a data structure that can calculate rank and select for a bit string depending on the calculation amount of a constant order. In addition, a wavelet tree is known as a data structure that can hold numeric sequence data in a compressed state and efficiently calculate rank and select in the compressed state.

ウェーブレット木は、簡潔ビットベクトルを発展させたデータ構造であり、どちらも事前に計算のための辞書データが必要となる。辞書データの構築法、データ構造を用いたｒａｎｋやｓｅｌｅｃｔの計算方法は、非特許文献２に記載された手法等を用いれば良い。 A wavelet tree is a data structure developed from a concise bit vector, and both require dictionary data for calculation in advance. As a method for constructing dictionary data and a method for calculating rank and select using the data structure, the method described in Non-Patent Document 2 may be used.

本実施例では、検索インデクス構築部４５４としての処理機能の実行後、簡易ビットベクトル・ウェーブレット木構築部４５５が、前述した公知の方法により、アノテーション終了タグ判別ビット列４３５、テキスト要素判別ビット列４３２、アノテーション要素判別ビット列４３１に対して簡潔ビットベクトルを生成すると共に、ＸＭＬ要素の深さ数列４３３、深さ別ＩＤ列リスト４３４、アノテーション開始タグＩＤ列４３６、アノテーション終了タグＩＤ列４３７、アノテーション要素の深さ数列４３８に対してウェーブレット木を生成する。 In this embodiment, after the processing function as the search index construction unit 454 is executed, the simple bit vector / wavelet tree construction unit 455 performs the annotation end tag determination bit string 435, the text element determination bit string 432, the annotation by the known method described above. A concise bit vector is generated for the element determination bit string 431, and an XML element depth number string 433, a depth-specific ID string list 434, an annotation start tag ID string 436, an annotation end tag ID string 437, and an annotation element depth A wavelet tree is generated for the sequence 438.

［実施例の効果］
本実施例に係る構造化文書検索装置４００は、用意した簡潔ビットベクトル、ウェーブレット木を用いることにより、ＸＭＬ要素検索部４４３とアノテーション要素検索部４４４におけるｒａｎｋ演算及びｓｅｌｅｃｔ演算を効率化することができる。 [Effect of Example]
The structured document search apparatus 400 according to the present embodiment can improve the rank operation and the select operation in the XML element search unit 443 and the annotation element search unit 444 by using the prepared simple bit vector and wavelet tree. .

［第６の実施例］
実運用面では、同じＸＭＬ文書に対して、異なるアノテーションデータを追加し又は削除できることが望まれる。本実施例では、与えられたＸＭＬ文書に対して既にインデクスが構築された状態において、インデクスの構成を大きく変更することなく、アノテーション要素群を追加する手法について説明する。 [Sixth embodiment]
In actual operation, it is desired that different annotation data can be added to or deleted from the same XML document. In the present embodiment, a method of adding an annotation element group without greatly changing the index configuration in a state where an index has already been constructed for a given XML document will be described.

アノテーション要素が追加されると、各要素の親の集合が変更される。このため、これまでパスＤＡＧ上で同一と考えられていた要素を別の要素として扱わなければならない。また、アノテーションを追加すると、テキスト領域が分割されるため、パスＤＡＧＩＤの変更だけでなく、テキスト領域を表す要素の数も増える問題がある。また、アノテーションを追加すると、パスＤＡＧの構造を変更するだけでなく、数列データにある数字ｃの一部を、これまで数列になかった数字とｃからなる数列に置換する必要が発生する。 When an annotation element is added, the parent set of each element is changed. For this reason, an element that has been considered to be identical on the path DAG so far must be treated as another element. In addition, when an annotation is added, the text area is divided, so that not only the path DAG ID is changed, but also the number of elements representing the text area increases. Further, when an annotation is added, not only the structure of the path DAG is changed, but also a part of the number c in the number sequence data needs to be replaced with a number sequence consisting of a number and c that were not in the number sequence so far.

以下では、ウェーブレット木４４６に登録されている数列にある数字ｃの一部が、これまで数列になかった数字からなる数列Ｎに変更となった場合に、既存のウェーブレット木４４６にデータ構造を追加することで、変更時におけるｒａｎｋ、ｓｅｌｅｃｔ、ｌｏｏｋｕｐを可能とする拡張ウェーブレット木４４５について説明する。 In the following, when a part of the number c in the number sequence registered in the wavelet tree 446 is changed to a number sequence N consisting of numbers that were not in the number sequence so far, a data structure is added to the existing wavelet tree 446 An extended wavelet tree 445 that enables rank, select, and lookup at the time of change will be described.

［拡張ウェーブレット木４４５］
図２４は、拡張ウェーブレット木４４５の概要を示す図である。拡張ウェーブレット木４４５は、元の数列から構成されたウェーブレット木４４６、変更となった数字ｘが元の数列に登録されたどの数字になるかを示す数字変更表４４７、数列全体において、どの部分の数字が追加されたかを示す追加フラグ４４８、元の数列の各数字に対して、どの位置に追加されたかを示す数字別追加フラグ４４９、元の数列を構成する数字において、変更に関連した数字に対応した変更ウェーブレット４５０から構成される。 [Extended wavelet tree 445]
FIG. 24 is a diagram showing an outline of the extended wavelet tree 445. The extended wavelet tree 445 includes a wavelet tree 446 composed of the original number sequence, a number change table 447 indicating which number the changed number x is registered in the original number sequence, and which part of the entire number sequence An addition flag 448 indicating whether a number has been added, a number-specific addition flag 449 indicating which position is added to each number in the original number sequence, and a number related to the change in the numbers constituting the original number sequence It consists of a corresponding modified wavelet 450.

［拡張ウェーブレット木構築部４５１］
図２５は、拡張ウェーブレット木構築部４４５として機能するＣＰＵ４０１の処理を示すフローチャートである。ＣＰＵ４０１は、ウェーブレット木４４６に登録されている数列のｐ番目にある数字ｃが数列Ｎに変更された場合、以下の手順により、ウェーブレット木４４６を拡張する。 [Extended wavelet tree construction unit 451]
FIG. 25 is a flowchart showing processing of the CPU 401 functioning as the extended wavelet tree construction unit 445. When the p-th number c registered in the wavelet tree 446 is changed to the number N, the CPU 401 expands the wavelet tree 446 according to the following procedure.

まず、ＣＰＵ４０１は、ウェーブレット木４４６に登録されている数列と同じ長さで全ての値が０で初期化された数列Ｂを用意する（ステップＳ２４０１）。次に、ＣＰＵ４０１は、リストＮの長さを｜Ｎ｜とし、数列Ｂのｐ番目以降に｜Ｎ｜−１個の“１”を追加する（ステップＳ２４０２）。 First, the CPU 401 prepares a number sequence B that has the same length as the number sequence registered in the wavelet tree 446 and is initialized with all values being 0 (step S2401). Next, the CPU 401 sets the length of the list N to | N | and adds | N | −1 “1” s after the p-th in the sequence B (step S2402).

この後、ＣＰＵ４０１は、数列Ｂの簡潔ビットベクトルを作成し、これを追加フラグ４４８とする（ステップＳ２４０３）。次に、ＣＰＵ４０１は、追加フラグ４４８のｒａｎｋ（ｐ，ｃ）を計算し、計算結果をｑとする（ステップＳ２４０４）。 Thereafter, the CPU 401 creates a simple bit vector of the sequence B and sets it as an addition flag 448 (step S2403). Next, the CPU 401 calculates rank (p, c) of the additional flag 448, and sets the calculation result to q (step S2404).

ここで、ＣＰＵ４０１は、ウェーブレット木４４６に登録されている数字ｃの個数をｍとする（ステップＳ２４０５）。さらに、ＣＰＵ４０１は、全ての値が０で初期化された長さｍの数列Ｖを用意し、ｑ番目以降に｜Ｎ｜−１個の“１”を追加する（ステップＳ２４０６）。 Here, the CPU 401 sets m as the number of numbers c registered in the wavelet tree 446 (step S2405). Further, the CPU 401 prepares a sequence m of length m that is initialized with all values being 0, and adds | N | −1 “1” s after the q-th (step S2406).

この後、ＣＰＵ４０１は、数列Ｖの簡潔ビットベクトルを作成し、これを数字ｃに対応した数字別追加フラグ４４９とする（ステップＳ２４０７）。さらに、ＣＰＵ４０１は、数列Ｎを保持するウェーブレット木を作成し、これを数字ｃに対応した変更ウェーブレット４５０とする（ステップＳ２４０８）。その後、ＣＰＵ４０１は、変更された数列を構成する各数字が、元の数列を構成していたどの数字に対応するかを記述した数字変更表４４７を作成する（ステップＳ２４０９）。 Thereafter, the CPU 401 creates a concise bit vector of the sequence V and sets it as a number-by-number addition flag 449 corresponding to the number c (step S2407). Further, the CPU 401 creates a wavelet tree that holds the sequence N, and sets this as a modified wavelet 450 corresponding to the number c (step S2408). After that, the CPU 401 creates a number change table 447 that describes which numbers constituting the changed number sequence correspond to the numbers constituting the original number sequence (step S2409).

［拡張ウェーブレット木におけるｒａｎｋ計算部４５２］
図２６は、拡張ウェーブレット木４４５についてｒａｎｋ計算を実行するｒａｎｋ計算部４５２として機能するＣＰＵ４０１の処理を示すフローチャートである。 [Rank calculation unit 452 in the extended wavelet tree]
FIG. 26 is a flowchart showing processing of the CPU 401 functioning as a rank calculation unit 452 that executes rank calculation on the extended wavelet tree 445.

まず、ＣＰＵ４０１は、追加フラグ４４８に対してｒａｎｋ（０，ｐ）を計算し、ｑとする（ステップＳ２５０１）。次に、ＣＰＵ４０１は、数字変更表４４７から、ｃの元の数列における数字を取得し、これをｄとする（ステップＳ２５０２）。次に、ＣＰＵ４０１は、元のウェーブレット木４４６に対して、ｒａｎｋ（ｃ，ｑ）を計算し、計算結果をｒとする（ステップＳ２５０３）。 First, the CPU 401 calculates rank (0, p) for the additional flag 448 and sets it to q (step S2501). Next, the CPU 401 acquires the number in the original number sequence of c from the number change table 447 and sets it as d (step S2502). Next, the CPU 401 calculates rank (c, q) for the original wavelet tree 446, and sets the calculation result to r (step S2503).

ここで、ＣＰＵ４０１は、ｃとｄが同じか否か判定する（ステップＳ２５０４）。ｃとｄが同じでない場合、ＣＰＵ４０１はｒを出力し、演算処理を終了する（ステップＳ２５０８）。 Here, the CPU 401 determines whether c and d are the same (step S2504). If c and d are not the same, the CPU 401 outputs r and ends the arithmetic processing (step S2508).

これに対し、ｃとｄが同じ場合、ＣＰＵ４０１は、追加フラグ４４８に対して、ｓｅｌｅｃｔ（０，ｑ）を計算し、計算結果をｓとする（ステップＳ２５０５）。次に、ＣＰＵ４０１は、ｄに対応した数字別追加フラグ４４９に対して、ｓｅｌｅｃｔ（０，ｒ）を計算し、計算結果をｔとする（ステップＳ２５０６）。また、ＣＰＵ４０１は、ｄに対応した数字別追加フラグ４４９に対して、ｒａｎｋ（ｃ，ｔ＋ｐ−ｓ）を計算し、計算結果を出力し、演算処理を終了する（ステップＳ２５０７）。 On the other hand, when c and d are the same, the CPU 401 calculates select (0, q) for the additional flag 448 and sets the calculation result to s (step S2505). Next, the CPU 401 calculates select (0, r) for the numerical addition flag 449 corresponding to d, and sets the calculation result to t (step S2506). Further, the CPU 401 calculates rank (c, t + ps) with respect to the numerical addition flag 449 corresponding to d, outputs the calculation result, and ends the arithmetic processing (step S2507).

［拡張ウェーブレット木におけるｓｅｌｅｃｔ計算部４５３］
図２７は、拡張ウェーブレット木４４５についてｓｅｌｅｃｔ計算を実行するｓｅｌｅｃｔ計算部４５３として機能するＣＰＵ４０１の処理を示すフローチャートである。 [Select Calculation Unit 453 in Extended Wavelet Tree]
FIG. 27 is a flowchart illustrating the processing of the CPU 401 functioning as the select calculation unit 453 that executes the select calculation for the extended wavelet tree 445.

まず、ＣＰＵ４０１は、数字変更表４４７から、ｃの元の数列における数字を取得し、これをｄとする（ステップＳ２６０１）。 First, the CPU 401 acquires the number in the original number sequence of c from the number change table 447 and sets it as d (step S2601).

ここで、ＣＰＵ４０１は、ｃとｄが異なるか否か判定する（ステップＳ２６０２）。ｃがｄと異なる場合（肯定結果）、ＣＰＵ４０１は、ｄに該当する追加ウェーブレット木に対して、ｓｅｌｅｃｔ（ｃ,ｎ）を計算し、計算結果をｓとする（ステップＳ２６０３）。さらに、ＣＰＵ４０１は、ｄに対応した数字別追加フラグ４４９に対して、ｒａｎｋ（０，ｓ）を計算し、計算結果をｎとする（ステップＳ２６０４）。 Here, the CPU 401 determines whether c and d are different (step S2602). When c is different from d (positive result), the CPU 401 calculates select (c, n) for the additional wavelet tree corresponding to d, and sets the calculation result to s (step S2603). Further, the CPU 401 calculates rank (0, s) for the numerical addition flag 449 corresponding to d, and sets the calculation result to n (step S2604).

このステップＳ２６０４の後、又はステップＳ２６０２で否定結果を得た後、ＣＰＵ４０１は、ｄに対応した数字別追加フラグ４４９に対して、ｓｅｌｅｃｔ（０，ｎ）を計算し、計算結果をｔとする（ステップＳ２６０６）。 After this step S2604 or after obtaining a negative result in step S2602, the CPU 401 calculates select (0, n) for the numerical addition flag 449 corresponding to d and sets the calculation result to t ( Step S2606).

次に、ＣＰＵ４０１は、元のウェーブレット木４４６に対して、ｓｅｌｅｃｔ（ｄ，ｎ）を計算し、計算結果をｍとする（ステップＳ２６０７）。また、ＣＰＵ４０１は、追加フラグ４４８に対して、ｓｅｌｅｃｔ（０，ｍ）を計算し、計算結果をｕとする（ステップＳ２６０８）。 Next, the CPU 401 calculates select (d, n) for the original wavelet tree 446 and sets the calculation result to m (step S2607). Further, the CPU 401 calculates select (0, m) for the addition flag 448 and sets the calculation result to u (step S2608).

ここで、ＣＰＵ４０１は、ｃとｄが同じか否か判定する（ステップＳ２６０９）。ｃとｄが同じ場合、ＣＰＵ４０１は、ｕを出力して計算処理を終了する（ステップＳ２６１０）。一方、ｃとｄが異なる場合、ＣＰＵ４０１は、ｓ−ｔ＋ｕを計算結果として出力し、計算処理を終了する（ステップＳ２６１１）。 Here, the CPU 401 determines whether c and d are the same (step S2609). When c and d are the same, the CPU 401 outputs u and ends the calculation process (step S2610). On the other hand, if c and d are different, the CPU 401 outputs s−t + u as the calculation result, and ends the calculation process (step S2611).

［実施例の効果］
本実施例に係る構造化文書検索装置４００によれば、同じＸＭＬ文書に対して、異なるアノテーションデータを追加し又は削除することが可能になる。 [Effect of Example]
According to the structured document search apparatus 400 according to the present embodiment, different annotation data can be added to or deleted from the same XML document.

[他の実施例]
本発明は、上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば上述した実施例は、本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備える場合に限定されるものではない。 [Other examples]
The present invention is not limited to the above-described embodiments, and includes various modifications. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to the case where all the configurations described are provided.

また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能である。また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成を追加し、削除し、又は置換することもできる。 In addition, a part of the configuration of a certain embodiment can be replaced with the configuration of another embodiment. It is also possible to add the configuration of another embodiment to the configuration of one embodiment. Further, with respect to a part of the configuration of each embodiment, another configuration can be added, deleted, or replaced.

また、上述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上述した各構成、機能等は、ＣＰＵがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。 Each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. Further, each of the above-described configurations, functions, and the like may be realized by software by the CPU interpreting and executing a program that realizes each function.

各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、ＳＳＤ（Solid State Drive）等の記録装置、または、ＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に置くことができる。 Information such as programs, tables, and files for realizing each function can be stored in a recording device such as a memory, a hard disk, an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えてもよい。 Further, the control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

４００構造化文書検索装置
４０１ＣＰＵ（中央演算装置）
４０２主記憶装置
４０３Ａ補助記憶装置
４０３Ｂ外部記憶装置
４０４リムーバブルメディア
４０５ネットワーク
４０６インターフェース部
４０７ＸＭＬ文書集合
４０８アノテーションデータ集合
４０９ＤＯＭ木構築部（ＸＭＬ文書用）
４１０ＤＯＭ木構築部（アノテーションデータ用）
４１１文書構造リスト構築部
４１２文書構造リスト
４１３テキスト要素リスト
４１４テキストデータリスト
４１５テキスト共有ＤＯＭ木構築部
４１６テキスト共有ＤＯＭ木
４１７テキストデータ・テキスト要素リスト構築部
４１８テキスト割当部
４１９親子関係解析・登録部
４２０ロケーションパス検索部
４２１ＤＯＭＤＡＧ
４２２ＤＯＭＤＡＧ構築部
４２３パスＤＡＧ
４２４転置インデクス構築部
４２５転置インデクス
４２６ＤＯＭＤＡＧ要素リストソート部
４２７深さ割当部
４２８パスＤＡＧＩＤ取得部
４２９パスＤＡＧ要素生成・登録部
４３０検索インデクス
４３１アノテーション要素判別ビット列
４３２テキスト要素判別ビット列
４３３ＸＭＬ要素の深さ数列
４３４深さ別ＩＤ列リスト
４３５アノテーション終了タグ判別ビット列
４３６アノテーション開始タグＩＤ列
４３７アノテーション終了タグＩＤ列
４３８アノテーション要素の深さ数列
４３９パスＤＡＧＩＤ登録部
４４０検索インデクス登録部
４４１アノテーション終了タグ挿入部
４４２転置インデクス登録部
４４３ＸＭＬ要素検索部
４４４アノテーション要素検索部
４４５拡張ウェーブレット木
４４６元の数列から構成されたウェーブレット木
４４７数字変更表
４４８追加フラグ
４４９数字別追加フラグ
４５０変更ウェーブレット木
４５１拡張ウェーブレット木構築部
４５２拡張ウェーブレット木におけるｒａｎｋ計算部
４５３拡張ウェーブレット木におけるｓｅｌｅｃｔ計算部
４５４検索インデクス構築部
４５５簡易ビットベクトル・ウェーブレット木構築部 400 Structured Document Retrieval Device 401 CPU (Central Processing Unit)
402 Main storage device 403A Auxiliary storage device 403B External storage device 404 Removable media 405 Network 406 Interface unit 407 XML document set 408 Annotation data set 409 DOM tree construction unit (for XML document)
410 DOM tree construction department (for annotation data)
411 Document structure list construction unit 412 Document structure list 413 Text element list 414 Text data list 415 Text sharing DOM tree construction unit 416 Text sharing DOM tree 417 Text data / text element list construction unit 418 Text allocation unit 419 Parent-child relationship analysis / registration unit 420 Location Path Search Unit 421 DOM DAG
422 DOM DAG construction part 423 path DAG
424 Transposition index construction unit 425 Transposition index 426 DOM DAG element list sort unit 427 Depth allocation unit 428 Path DAG ID acquisition unit 429 Path DAG element generation / registration unit 430 Search index 431 Annotation element discrimination bit string 432 Text element discrimination bit string 433 XML element Depth number sequence 434 Depth ID sequence list 435 Annotation end tag discrimination bit sequence 436 Annotation start tag ID sequence 437 Annotation end tag ID sequence 438 Annotation element depth number sequence 439 Path DAG ID registration unit 440 Search index registration unit 441 End of annotation Tag insertion unit 442 Transposed index registration unit 443 XML element search unit 444 Annotation element search unit 445 Extended wavelet tree 446 Consists of original sequence Wavelet tree 447 Number change table 448 Additional flag 449 Number-specific addition flag 450 Changed wavelet tree 451 Extended wavelet tree construction unit 452 Rank calculation unit 453 in extended wavelet tree Select calculation unit 454 in extended wavelet tree Search index construction unit 455 Simple bit vector・ Wavelet tree construction department

Claims

A processor for executing the program;
A first storage area for storing a program;
A second storage area for storing a structured document satisfying the tree structure condition and annotation data attached to the document;
By assigning the text of the structured document to the structure in which the root element of the DOM (Document Object Model) tree obtained individually from the inclusion relation of the tags of the structured document and the inclusion relation of the tags of the annotation data is assigned, the text A document structure list construction unit for generating a shared DOM tree;
An input device for entering a search query;
A structured document search apparatus comprising: a location path search unit that searches the text sharing DOM tree for an element that matches a location path given as the search query.

A processor for executing the program;
A first storage area for storing a program;
A second storage area for storing a structured document satisfying the tree structure condition and annotation data attached to the document;
Inclusion relationship between tags of different DOM trees with respect to a structure in which a root element of a DOM (Document Object Model) tree obtained individually from the inclusion relationship of tags of the structured document and the inclusion relationship of tags of the annotation data is shared A document structure list construction unit that generates a DOM DAG (Directed Acyclic Graph) by assigning the text of the structured document to the structure after incorporation,
An input device for entering a search query;
A structured document search apparatus comprising: a location path search unit that searches the DOM DAG for an element that matches a location path given as the search query.

The structured document search device according to claim 2,
A path DAG element generation / registration unit that generates a path DAG in which the structures of the plurality of DOM DAGs are aggregated;
A transposition index construction unit that constructs a transposition index composed of a path DAG ID that is an ID of the element of the path DAG and position information of one or more elements of the DOM DAG that are in a correspondence relationship;
The said location path search part calculates the location where the element which corresponds to the location path which is a search query appears based on the said path | pass DAG and the said transposition index. The structured document search apparatus characterized by the above-mentioned.

The structured document search device according to claim 2,
A path DAG element generation / registration unit that generates a path DAG in which the structures of the plurality of DOM DAGs are aggregated;
A search index construction unit that constructs a search index that holds the path DAG, a bit string, and numerical sequence data that stores a type of location path corresponding to each element of the DOM DAG;
The location path search unit calculates a path DAG ID corresponding to a location path that is a search query from the path DAG, and a location where an element specified by the path DAG ID obtained by the calculation appears is the bit string and the A structured document search apparatus characterized by performing calculation by scanning numerical sequence data.

The structured document search device according to claim 2,
A path DAG element generation / registration unit that generates a path DAG in which the structures of the plurality of DOM DAGs are aggregated;
A search index construction unit that constructs a search index that holds the path DAG, a bit string, and numeric data that stores a type of location path corresponding to each element of the DOM DAG;
A concise bit vector / wavelet tree generation unit for generating a concise bit vector and a wavelet tree based on the bit string and the sequence data held by the search index;
The location path search unit calculates a path DAG ID corresponding to a location path which is a search query from the path DAG, and a location where an element specified by the path DAG ID obtained by the calculation appears is the concise bit vector. Alternatively, the structured document search apparatus, wherein the calculation is performed by a rank operation and a select operation on the wavelet tree.

The structured document search apparatus according to claim 5,
When a part of a number sequence registered in the wavelet tree is replaced with another number sequence by adding annotation data, the wavelet tree is converted into an extended wavelet tree including change information for the wavelet tree. Has an extended wavelet tree construction department,
The location path search unit uses the extended wavelet tree for the rank operation and the select operation.

For a structure in which the root element of a DOM (Document Object Model) tree obtained individually from the inclusion relation of tags of structured documents that satisfy the tree structure condition and the inclusion relation of tags of annotation data attached to the document is shared, A first process of assigning text of a structured document and generating a text sharing DOM tree;
A program that causes a computer to execute a second process of searching an element that matches a location path given as a search query from the text sharing DOM tree.

For a structure in which the root element of a DOM (Document Object Model) tree obtained individually from the inclusion relation of tags of structured documents that satisfy the tree structure condition and the inclusion relation of tags of annotation data attached to the document is shared, A first process of incorporating a containment relationship between tags of different DOM trees, and assigning a text of a structured document to the structure after the incorporation, and generating a DOM DAG (Directed Acyclic Graph);
A program that causes a computer to execute a second process of searching the DOM DAG for an element that matches a location path given as a search query.

The program according to claim 8 is a process executed by a computer.
A third process for generating a path DAG in which the structures of the plurality of DOM DAGs are aggregated;
A fourth process for constructing a transposed index composed of a path DAG ID that is an ID of the element of the path DAG and position information of one or more elements of the DOM DAG that are in a corresponding relationship; ,
The program according to the second aspect is characterized in that, based on the path DAG and the transposed index, a portion where an element that matches a location path that is a search query appears is calculated.

The program according to claim 8 is a process executed by a computer.
A third process for generating a path DAG in which the structures of the plurality of DOM DAGs are aggregated;
A fifth process for constructing a search index that holds the path DAG, a bit string, and numerical sequence data that stores a type of location path corresponding to each element of the DOM DAG;
In the second process, a path DAG ID corresponding to a location path that is a search query is calculated from the path DAG, and a place where an element specified by the path DAG ID obtained by the calculation appears is the bit string and the A program characterized by being calculated by scanning numerical data.

The program according to claim 8 is a process executed by a computer.
A third process for generating a path DAG in which the structures of the plurality of DOM DAGs are aggregated;
A fifth process for constructing a search index that holds the path DAG, a bit string, and numerical sequence data that stores a type of location path corresponding to each element of the DOM DAG;
A sixth process for generating a concise bit vector and a wavelet tree based on the bit string and the number sequence data held by the search index;
In the second process, a path DAG ID corresponding to a location path that is a search query is calculated from the path DAG, and a place where an element specified by the path DAG ID obtained by the calculation appears is the concise bit vector. Alternatively, the program is calculated by a rank operation and a select operation on the wavelet tree.

The program according to claim 11 is a process to be executed by a computer.
When a part of a number sequence registered in the wavelet tree is replaced with another number sequence by adding annotation data, the wavelet tree is converted into an extended wavelet tree including change information for the wavelet tree. Having a seventh process,
In the program, the second process uses the extended wavelet tree for the rank operation and the select operation.