JP2013175053A

JP2013175053A - Xml document retrieval device and program

Info

Publication number: JP2013175053A
Application number: JP2012039242A
Authority: JP
Inventors: Tomohiro Yasuda; 知弘安田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2012-02-24
Filing date: 2012-02-24
Publication date: 2013-09-05
Anticipated expiration: 2032-02-24
Also published as: JP5695586B2

Abstract

PROBLEM TO BE SOLVED: To efficiently retrieve a portion conforming to a complicated retrieval condition in XPath.SOLUTION: An XML document retrieval device analyzes an XML document set to be retrieved, and generates both a sequence S for storing a shape of DOM tree based on structure information for the XML document and a sequence group T consisting of one or more sequences that store a type of structure path corresponding to each node of the DOM tree. Then, the device calculates a portion conforming to the structure path that is a retrieval query, by scanning the sequence S and the sequence group T.

Description

本発明は、複数のＸＭＬ（Extensible Markup Language）文書から、ユーザが指定した検索条件に合致する箇所を探索する装置及びプログラムに関する。 The present invention relates to an apparatus and a program for searching a location that matches a search condition specified by a user from a plurality of XML (Extensible Markup Language) documents.

データ交換及び蓄積に用いるデータの記述方法として、ＸＭＬが広く普及している。ＸＭＬを用いることにより、多様なデータを高い自由度で、機械的に処理しやすいテキストフォーマットで記述できる。図１に、ＸＭＬ文書の例を示す。この文書の内容は「<」と「>」で囲まれたタグにより複数の部分に区切られている。タグには、「<タグ名>」の形式で書かれている開始タグ１０１と、「</タグ名>」の形式で書かれた終了タグ１０２がある。同じタグ名の開始タグと終了タグで区切られた領域を、要素又はＸＭＬ要素と呼ぶ。 XML is widely used as a method for describing data used for data exchange and storage. By using XML, various data can be described with a high degree of freedom in a text format that is easy to process mechanically. FIG. 1 shows an example of an XML document. The content of this document is divided into a plurality of parts by tags surrounded by “<” and “>”. The tag includes a start tag 101 written in the format “<tag name>” and an end tag 102 written in the format “</ tag name>”. An area delimited by a start tag and an end tag having the same tag name is called an element or an XML element.

なお、開始タグおよび終了タグに挟まれるタグやテキスト領域がない場合、ＸＭＬでは、開始タグおよび終了タグの代わりに「<タグ名/>」という形式のタグを使用できる。本明細書では、このようなタグは、「<タグ名></タグ名>」と記述した場合と同様に扱う。また、開始タグでは、「<タグ名属性1=属性の値1 属性2=属性の値2 ...>」の形式で、タグに属性値を与えることができる。本明細書では、このようなタグが与えられると、「<タグ名> <＠属性1>属性の値1<＠属性1> <＠属性2>属性の値2</＠属性2>」と書かれた場合と同様に扱う。また、本明細書では、テキスト領域は、タグ名が「＃」である要素であるとみなす。また、親をもたない要素をルート要素と呼ぶ。各ＸＭＬ文書は、ただ１つのルート要素を持つ。 If there is no tag or text region sandwiched between the start tag and the end tag, XML can use a tag of the form “<tag name />” instead of the start tag and the end tag. In this specification, such a tag is handled in the same manner as when “<tag name> </ tag name>” is described. In the start tag, an attribute value can be given to the tag in the format of “<tag name attribute 1 = attribute value 1 attribute 2 = attribute value 2...>”. In this specification, when such a tag is given, “<tag name> <@attribute 1> attribute value 1 <@attribute 1> <@ attribute 2> attribute value 2 </ @ attribute 2>” Treat as if written. Further, in this specification, the text area is regarded as an element whose tag name is “#”. An element having no parent is called a root element. Each XML document has only one root element.

ＸＭＬでは、複数の要素を入れ子にすることで、複雑なデータ構造を記述することができる。終了タグのタグ名は、ＸＭＬ文章を先頭から末尾へ読み進めたとき、まだ対応する終了タグが出現していない開始タグのうち最も直前に現れた開始タグのタグ名と、同一でなければならない。従って、任意の２つの要素は、一方が他方を完全に包含するか、全く重なりがないかのいずれかに限られる。ある要素Ａが「要素Ｂを包含し」、かつ、「要素Ａに包含され、かつ、要素Ｂを包含する別の要素Ｃが存在しない」とき、要素Ｂは要素Ａの子であるといい、要素Ａは要素Ｂの親であるという。親を順次辿って到達可能な要素は先祖と呼ばれ、子を順次辿って到達可能な要素は子孫と呼ばれる。また、同じ親の子である要素を、兄弟と呼ぶ。 In XML, a complicated data structure can be described by nesting a plurality of elements. The tag name of the end tag must be the same as the tag name of the start tag that appears immediately before the start tag for which the corresponding end tag does not yet appear when the XML text is read from the beginning to the end. . Thus, any two elements are limited to either one completely containing the other or no overlap at all. An element A is said to be a child of element A when "element B contains" and "no other element C is contained in element A and contains element B" Element A is said to be the parent of element B. Elements that are reachable by traversing the parent sequentially are called ancestors, and elements that are reachable by traversing the children sequentially are called descendants. An element that is a child of the same parent is called a sibling.

この要素間の関係は、一般に、木構造により表される。図２に、木構造の一例を示す。図２に示す木構造は、各要素に対応するノード２０１を用意し、親要素に対応するノードから子要素に対応するノードへ有向エッジ２０２を張ることで得られる。この木構造は、ＤＯＭ木(document object model tree)と呼ばれる。ＤＯＭ木の根から、各ノードへ至る経路上にある要素のタグ名を「／」を挟んで連結し、さらに先頭に「／」を付与して得られる文字列を、本明細書では構造パスと呼ぶ。例えば図２の場合、もっとも右側の「ｃ」の構造パスは、「／ａ／ｃ」である。構造パスに含まれるタグの数を、その要素の深さと定義する。 This relationship between elements is generally represented by a tree structure. FIG. 2 shows an example of a tree structure. The tree structure shown in FIG. 2 is obtained by preparing a node 201 corresponding to each element and extending a directed edge 202 from a node corresponding to the parent element to a node corresponding to the child element. This tree structure is called a DOM tree (document object model tree). A character string obtained by concatenating tag names of elements on the route from the root of the DOM tree to each node with “/” interposed therebetween and adding “/” at the head is referred to as a structure path in this specification. . For example, in the case of FIG. 2, the rightmost “c” structure path is “/ a / c”. The number of tags included in the structure path is defined as the depth of the element.

特開２００６−２２８１５５号公報JP 2006-228155 A

清水敏之、鬼塚真、江田毅晴、吉川正俊、XMLデータの管理とストリーム処理に関する技術、電子情報通信学会論文誌D J90-D(2):159-184, 2007.Toshiyuki Shimizu, Makoto Onizuka, Masaharu Eda, Masatoshi Yoshikawa, XML data management and stream processing technology, IEICE Transactions J90-D (2): 159-184, 2007. R. Kaushik, R. Krishnamurthy, J.F. Naughton, R. Ramakrishnan, On the integration of structure indexes and inverted lists, Proc. ACM SIGMOD, pp 779-790, 2004.R. Kaushik, R. Krishnamurthy, J.F.Naughton, R. Ramakrishnan, On the integration of structure indexes and inverted lists, Proc.ACM SIGMOD, pp 779-790, 2004. 江田毅晴、鬼塚真、山室雅司、XML データの要約情報を用いた高速な XPath 処理方法、電子情報通信学会論文誌D、J89-D(2): 139-150, 2006.Eda Yasuharu, Onizuka Makoto, Yamamuro Masashi, High-speed XPath processing method using summary information of XML data, IEICE Transactions D, J89-D (2): 139-150, 2006. 萩尾一仁、御手洗秀一、石野明、竹田正幸、漸増的なパストライ構築に基づく高速・軽量XML文書フィルタリング、DBSJ Letters 6(2):1-4, 2007.Kazuhito Washio, Shuichi Mitarai, Akira Ishino, Masayuki Takeda, Fast and Lightweight XML Document Filtering Based on Incremental Path Trial Construction, DBSJ Letters 6 (2): 1-4, 2007. Navarro, G. and Makinen, V., Compressed full-text indexes, ACM Computing Surveys 39(1): Article 2, 2007.Navarro, G. and Makinen, V., Compressed full-text indexes, ACM Computing Surveys 39 (1): Article 2, 2007. Managing Gigabytes,I.Witten,A.Moffat,and T.Bell,Morgan KaufmannManaging Gigabytes, I. Witten, A. Moffat, and T. Bell, Morgan Kaufmann

ＸＭＬ文書を検索対象とする検索クエリの記述方法として、ＸＰａｔｈと呼ばれる規格が普及している（非特許文献１）。例えばＸＰａｔｈは、「要素ａの子である要素ｂ」を指定する検索クエリを「ａ／ｂ」と書く。ＸＰａｔｈでは、このような検索クエリの記述方法が規格化されている。また、ＸＰａｔｈで記述した検索クエリでは、親、子、先祖、子孫、兄弟等の関係にある複数の要素の組み合わせも指示することができる。 As a description method of a search query for searching an XML document, a standard called XPath is widely used (Non-patent Document 1). For example, XPath writes “a / b” as a search query that specifies “element b which is a child of element a”. In XPath, such a search query description method is standardized. In addition, in a search query described in XPath, a combination of a plurality of elements having a relationship such as a parent, a child, an ancestor, a descendant, and a sibling can be specified.

ＸＰａｔｈにより記述された検索クエリによる検索処理においては、その検索処理を効率化する方法として、(1) 検索対象とするＸＭＬデータに現れる構造と、(2) 各構造の出現位置とを記録したインデックスとを事前に構築し、それらを参照して検索クエリに合致する箇所を探す方法が広く用いられる（特許文献１、非特許文献２、３を参照）。この他、ＸＰａｔｈにより記述された検索クエリによる検索処理においては、前述した事前処理は実行せず、検索実行時に検索対象とするＸＭＬ文書に現れる構造をリアルタイムで分析する方法も用いられる（非特許文献４を参照）。ただし、こちらの方法は、検索インデックスを事前に計算する方法に比べ、検索速度の点で不利となる。 In a search process using a search query described in XPath, as a method for improving the search process, (1) an index that records the structure that appears in the XML data to be searched and (2) the appearance position of each structure Are used in advance, and a method for searching for a location matching the search query by referring to them is widely used (see Patent Document 1, Non-Patent Documents 2 and 3). In addition, in the search process using the search query described in XPath, a method of analyzing the structure appearing in the XML document to be searched at the time of search execution in real time is not executed (Non-patent Document). 4). However, this method is disadvantageous in terms of search speed compared to the method of calculating the search index in advance.

ＸＰａｔｈにより記述された検索クエリによる検索を効率よく実行するためには、ＸＭＬ文書に現れる構造情報を分析して各構造パスに該当する箇所がどこにあるかを記録すると共に、親、子、先祖、子孫、兄弟といった要素間の関係を効率よく計算できるデータ構造が必要となる。さらに、大規模なＸＭＬ文書を扱う場合には、このようなデータを極力小さいデータサイズで表現できることと、高速に読み取れることが必要となる。 In order to efficiently execute a search by a search query described in XPath, the structure information appearing in the XML document is analyzed and the location corresponding to each structure path is recorded, and the parent, child, ancestor, A data structure that can efficiently calculate the relationship between elements such as descendants and siblings is required. Furthermore, when handling a large-scale XML document, it is necessary that such data can be expressed with a data size as small as possible and can be read at high speed.

そこで、本発明は、ＸＭＬ文書の検索処理で使用する検索用データのデータサイズを極力小さくし、検索クエリで指定された条件を満たす箇所の検索を高速に計算可能にする。このために、本発明は、ＸＭＬ文書分析部において、(1) 要素の出現順に、当該要素の深さを表す数値の列を部分列として含む第一の数列Ｓと、(2) ＤＯＭ木の各ノードに対応する構造パスの種類を記録する１つ以上の数列からなる数列群Ｔとで与えられるＸＭＬ文書のＤＯＭ木の形状を記録する検索用データを作成する。そして、数列Ｓと数列群Ｔを走査し、検索クエリとして与えられた構造パスに合致する箇所を計算する。 Therefore, the present invention reduces the data size of the search data used in the XML document search process as much as possible, and makes it possible to calculate at high speed a search for a location that satisfies the conditions specified by the search query. To this end, the present invention provides an XML document analysis unit in which (1) a first numerical sequence S including a numerical sequence representing the depth of an element as a partial sequence in the order of appearance of the element, and (2) a DOM tree Search data for recording the shape of a DOM tree of an XML document given by a sequence group T composed of one or more sequences that record the type of structure path corresponding to each node is created. Then, the number sequence S and the number sequence group T are scanned, and a portion that matches the structure path given as a search query is calculated.

本発明によれば、ＸＰａｔｈにより記述された検索クエリによる検索処理を高速化することができる。上記した以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 According to the present invention, it is possible to speed up a search process using a search query described in XPath. Problems, configurations, and effects other than those described above will be clarified by the following description of embodiments.

ＸＭＬデータの一例を示す図。The figure which shows an example of XML data. ＤＯＭ木の一例を示す図。The figure which shows an example of a DOM tree. ＸＭＬ要素に割り当てる要素番号を説明する図。The figure explaining the element number allocated to an XML element. 第１の形態例に係るＸＭＬ文書検索装置のブロック構成例を示す図。The figure which shows the block structural example of the XML document search apparatus which concerns on a 1st form example. 第１の形態例に係る前処理実行時の各構成間の連携を説明する図。The figure explaining the cooperation between each structure at the time of the pre-processing execution which concerns on a 1st form example. 第１の形態例に係る前処理動作の流れを説明するフローチャート。The flowchart explaining the flow of the pre-processing operation | movement which concerns on a 1st form example. 第１の形態例に係る前処理動作の概念を示す図。The figure which shows the concept of the pre-processing operation | movement which concerns on a 1st example. 第１の形態例に係るＸＭＬ文書分析部で実行される処理動作（分析動作）の流れを説明するフローチャート。The flowchart explaining the flow of the processing operation (analysis operation) performed by the XML document analysis part which concerns on a 1st form example. 第１の形態例に係る検索実行時の各構成間の連携を説明する図。The figure explaining the cooperation between each structure at the time of the search execution which concerns on a 1st form example. 第１の形態例に係るＸＭＬ文書検索装置による検索処理の流れを説明するフローチャート。The flowchart explaining the flow of the search process by the XML document search apparatus which concerns on a 1st form example. 第１の形態例に係る検索処理の処理例を説明する図。The figure explaining the process example of the search process which concerns on a 1st form example. ｒａｎｋ演算及びｓｅｌｅｃｔ演算を説明する図。The figure explaining a rank operation and a select operation. 第１の形態例に係る構造パス分析部の動作の流れを説明するフローチャート。The flowchart explaining the flow of operation | movement of the structure path | pass analysis part which concerns on a 1st form example. 第１の形態例に係る要素探索部の動作の流れを説明するフローチャート。The flowchart explaining the flow of operation | movement of the element search part which concerns on a 1st form example. ビットベクトルを説明する図。The figure explaining a bit vector. Ｗａｖｅｌｅｔ木を説明する図。The figure explaining a Wavelet tree. 第２の形態例に係るＸＭＬ文書検索装置のブロック構成例を示す図。The figure which shows the block structural example of the XML document search apparatus which concerns on a 2nd form example. 第２の形態例に係る前処理実行時の各構成間の連携を説明する図。The figure explaining the cooperation between each structure at the time of the pre-processing execution which concerns on a 2nd form example. 第２の形態例に係る前処理の流れを説明するフローチャート。The flowchart explaining the flow of the pre-process which concerns on a 2nd form example. 第３の形態例において親要素を計算する動作の流れを説明するフローチャート。The flowchart explaining the flow of the operation | movement which calculates a parent element in the 3rd form example. 第３の形態例において１つ前の兄弟要素を計算する動作の流れを説明するフローチャート。The flowchart explaining the flow of the operation | movement which calculates the previous sibling element in a 3rd form example. 第３の形態例において１つ後の兄弟要素を計算する動作の流れを説明するフローチャートで。It is a flowchart explaining the flow of the operation | movement which calculates the next sibling element in a 3rd form example.

以下、図面に基づいて、本発明の実施の形態を説明する。なお、本発明は、後述する形態例に限定されるものではなく、その技術思想の範囲において、種々の変形が可能である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In addition, this invention is not limited to the form example mentioned later, A various deformation | transformation is possible in the range of the technical thought.

［第１の形態例］
本形態例に係るＸＭＬ文書検索装置は、ＸＭＬ文書集合を前処理して作成した検索用データと検索クエリとを照合し、検索クエリが指定する構造パスに合致する要素を探索結果として出力する。探索結果の出力は、ＸＭＬ文書に出現する全ての要素（ＸＭＬ要素）について割り当てられている要素番号により行う。 [First embodiment]
The XML document search apparatus according to this embodiment collates search data created by preprocessing an XML document set with a search query, and outputs an element that matches a structure path specified by the search query as a search result. The search result is output based on element numbers assigned to all elements (XML elements) appearing in the XML document.

［要素番号］
図３に、ＸＭＬ文書を構成する各要素（ＸＭＬ要素）に対する要素番号の割り当て例を示す。図３では、要素番号３０１に対応する数字を四角形の枠内に表している。要素番号３０１は、検索処理の前処理において、全てのＸＭＬ要素に割り当てられる、各要素を一意に識別する番号である。要素番号は、ＸＭＬ文書を先頭からスキャンしたとき、それまでに出会った要素数と文書数の合計値とする。ｉ番目のＸＭＬ文書のｊ番目の要素の要素番号は、ｉ−１番目のＸＭＬ文書の要素番号の最大値をＥ（ｉ−１）とするとき、Ｅ（ｉ−１）＋１＋ｊとなる。なお、Ｅ（０）＝０とする。 [Element number]
FIG. 3 shows an example of element number assignment for each element (XML element) constituting the XML document. In FIG. 3, the numbers corresponding to the element numbers 301 are shown in a rectangular frame. The element number 301 is a number uniquely identifying each element that is assigned to all XML elements in the pre-processing of the search process. The element number is a total value of the number of elements and documents that have been met so far when the XML document is scanned from the top. The element number of the j-th element of the i-th XML document is E (i-1) + 1 + j, where E (i-1) is the maximum element number of the i-1th XML document. Note that E (0) = 0.

［装置構成］
図４に、本形態例に係るＸＭＬ文書検索装置４００のブロック構成を示す。ＸＭＬ文書検索装置４００は、ＣＰＵ（Central Processing Unit）４０１、主記憶装置４０２、補助記憶装置４０３、リムーバブルドライブ４０４、ユーザインタフェース４０６及びネットワークインタフェース４０７を備える。各構成部は、内部バス等によって互いに接続される。 [Device configuration]
FIG. 4 shows a block configuration of an XML document search apparatus 400 according to this embodiment. The XML document search device 400 includes a CPU (Central Processing Unit) 401, a main storage device 402, an auxiliary storage device 403, a removable drive 404, a user interface 406, and a network interface 407. Each component is connected to each other by an internal bus or the like.

また、ＸＭＬ文書検索装置４００は、ＬＡＮ（Local Area Network）等のネットワーク４４０を介して外部記憶装置４３０と接続される。本形態例は、ネットワーク４４０の種別に限定されない。ネットワーク４４０は、有線接続でも、無線接続でも構わない。 The XML document search apparatus 400 is connected to an external storage device 430 via a network 440 such as a LAN (Local Area Network). The present embodiment is not limited to the type of the network 440. The network 440 may be a wired connection or a wireless connection.

ＣＰＵ４０１は、主記憶装置４０２に格納されたプログラムを実行する演算装置である。ＣＰＵ４０１による主記憶装置４０２に格納されるプログラムの実行により、ＸＭＬ文書検索装置４００が有する機能が実現される。以下の説明においてプログラムを主語として説明する処理動作は、ＣＰＵ４０１上での該当プログラムの実行を通じて実現される。 The CPU 401 is an arithmetic device that executes a program stored in the main storage device 402. By executing the program stored in the main storage device 402 by the CPU 401, the functions of the XML document search device 400 are realized. In the following description, the processing operation described with the program as the subject is realized through execution of the corresponding program on the CPU 401.

主記憶装置４０２は、ＣＰＵ４０１によって実行されるプログラム及び当該プログラムの実行に必要な情報を格納する。主記憶装置４０２は、例えばＲＡＭ（Random Access Memory）等のメモリを想定する。 The main storage device 402 stores a program executed by the CPU 401 and information necessary for executing the program. The main storage device 402 is assumed to be a memory such as a RAM (Random Access Memory).

主記憶装置４０２には、プログラムとして、ＸＭＬ文書分析部４１０、構造パス分析部４１２及び要素探索部４１３を格納し、データとして、ＸＭＬ文書集合４２０、パストライ４２１、数列化されたＤＯＭ木４２２及びテキストデータ４２４を格納する。 The main storage device 402 stores an XML document analysis unit 410, a structure path analysis unit 412 and an element search unit 413 as programs, and as data, an XML document set 420, a path trie 421, a DOM tree 422 converted into a sequence, and a text Data 424 is stored.

ＸＭＬ文書分析部４１０は、ＸＭＬ文書集合４２０に含まれる各ＸＭＬ文書をパース（parse）し、タグを認識するとともにテキストデータ４２４を抽出する。そして、分析結果に基づき、ＸＭＬ文書分析部４１０は、パストライ４２１及び数列化されたＤＯＭ木４２２を生成する。 The XML document analysis unit 410 parses each XML document included in the XML document set 420, recognizes a tag, and extracts text data 424. Then, based on the analysis result, the XML document analysis unit 410 generates a path trie 421 and a sequenced DOM tree 422.

構造パス分析部４１２は、検索クエリである構造パスの深さ及びパス種別を、パストライ４２１を用いて計算する。なお、パス種別とは、各構造パスを識別するために割り当てる識別番号である。その詳細については後述する。 The structural path analysis unit 412 calculates the depth and path type of the structural path that is the search query using the path trie 421. The path type is an identification number assigned to identify each structural path. Details thereof will be described later.

要素探索部４１３は、構造パス分析部４１２が計算した深さ及びパス種別に基づき、検索クエリに合致する箇所をＸＭＬ文書集合４２０から全て列挙し、検索結果とする。 Based on the depth and path type calculated by the structural path analysis unit 412, the element search unit 413 enumerates all locations that match the search query from the XML document set 420 and sets them as search results.

ＸＭＬ文書集合４２０は、検索対象となる１つ又は複数のＸＭＬ文書のデータである。パストライ４２１は、ＸＭＬ文書集合４２０に含まれる構造情報の要約である。その詳細については後述する。数列化されたＤＯＭ木４２２は、ＸＭＬ文書集合４２０に含まれる構造情報を検索しやすい形式で抽出したものであり、これの詳細も後述する。テキストデータ４２４は、ＸＭＬ文書集合４２０においてタグに挟まれたテキストの情報を抽出したものである。 The XML document set 420 is data of one or more XML documents to be searched. The path trie 421 is a summary of the structure information included in the XML document set 420. Details thereof will be described later. The digitized DOM tree 422 is obtained by extracting the structure information included in the XML document set 420 in an easily searchable format, and details thereof will be described later. The text data 424 is obtained by extracting text information sandwiched between tags in the XML document set 420.

なお、ＸＭＬ文書集合４２０は、主記憶装置４０２に格納される必要はなく、例えば補助記憶装置４０３、リムーバブルメディア又は外部記憶装置４３０に格納されていてもよい。この場合、ＣＰＵ４０１が、補助記憶装置４０３、リムーバブルメディア又は外部記憶装置４３０からＸＭＬ文書集合４２０を読み出し、主記憶装置４０２に格納する。 Note that the XML document set 420 need not be stored in the main storage device 402, and may be stored in the auxiliary storage device 403, a removable medium, or the external storage device 430, for example. In this case, the CPU 401 reads the XML document set 420 from the auxiliary storage device 403, the removable medium or the external storage device 430 and stores it in the main storage device 402.

同様に、パストライ４２１、数列化されたＤＯＭ木４２２、テキストデータ４２４も、主記憶装置４０２に格納される必要はなく、例えば補助記憶装置４０３及びリムーバブルメディアに格納されてもよい。この場合、ＣＰＵ４０１は、必要に応じ、これらのデータを、補助記憶装置４０３及びリムーバブルメディア４０４から読み出す。 Similarly, the path trie 421, the digitized DOM tree 422, and the text data 424 need not be stored in the main storage device 402, and may be stored in the auxiliary storage device 403 and a removable medium, for example. In this case, the CPU 401 reads these data from the auxiliary storage device 403 and the removable medium 404 as necessary.

本形態例においては、ＸＭＬ文書分析部４１０、構造パス分析部４１２、要素探索部４１３をいずれもプログラムにより実現しているが、本発明はこれに限定されない。例えばこれらの機能を専用のハードウェアとして実現してもよい。すなわち、ＸＭＬ文書検索装置４００が、ＸＭＬ文書分析装置、構造パス分析装置、要素探索装置を備える構成でもよい。 In this embodiment, the XML document analysis unit 410, the structure path analysis unit 412, and the element search unit 413 are all realized by a program, but the present invention is not limited to this. For example, these functions may be realized as dedicated hardware. That is, the XML document search device 400 may include an XML document analysis device, a structure path analysis device, and an element search device.

補助記憶装置４０３は、情報を永続的に保持することが可能な装置であり、例えばＨＤＤ（Hard Disk Drive）等が考えられる。リムーバブルドライブ４０４は、リムーバブルメディアへのデータの書込処理及び読出処理を実行する装置である。リムーバブルメディアには、ＣＤ−ＲＯＭ、ＤＶＤなどの光学ディスク、フロッピー（登録商標）ディスクなどの磁気ディスクが含まれる。 The auxiliary storage device 403 is a device capable of permanently storing information, and may be an HDD (Hard Disk Drive), for example. The removable drive 404 is a device that executes data write processing and data read processing on a removable medium. The removable media includes optical disks such as CD-ROM and DVD, and magnetic disks such as floppy (registered trademark) disks.

ユーザインタフェース４０６は、ＸＭＬ文書検索装置４００の利用者が、データの入力と処理結果の出力に使用するインタフェースである。ユーザインタフェース４０６は、ディスプレイ装置、キーボード及びマウスなどが含まれる。ネットワークインタフェース４０７は、ネットワーク４４０を介して外部装置と接続するためのインタフェースである。 The user interface 406 is an interface used by the user of the XML document search apparatus 400 for inputting data and outputting processing results. The user interface 406 includes a display device, a keyboard, a mouse, and the like. The network interface 407 is an interface for connecting to an external device via the network 440.

次に、ＸＭＬ文書検索装置４００の具体的な処理内容を説明する。ただし、以下の説明では、ＸＭＬ文書集合４２０は、補助記憶装置４０３に格納されているものとする。 Next, specific processing contents of the XML document search apparatus 400 will be described. However, in the following description, it is assumed that the XML document set 420 is stored in the auxiliary storage device 403.

［前処理時の構成間連携］
図５に、本形態例に係るＸＭＬ文書検索装置１００がＸＭＬ文書を前処理する際の各構成間の連携動作を示す。 [Inter-configuration linkage during pre-processing]
FIG. 5 shows a cooperative operation between components when the XML document search apparatus 100 according to the present embodiment preprocesses an XML document.

まず、ＸＭＬ文書検索装置４００の利用者が、ユーザインタフェース４０６を用いて、処理の開始を指示する（ステップＳ１０１）。 First, the user of the XML document search apparatus 400 instructs the start of processing using the user interface 406 (step S101).

処理の開始指示を受け付けたＣＰＵ４０１は、補助記憶装置４０３からＸＭＬ文書集合４２０を読み出す（ステップＳ１０２）。読み出されたＸＭＬ文書集合４２０は、主記憶装置４０２に格納される。 Receiving the process start instruction, the CPU 401 reads the XML document set 420 from the auxiliary storage device 403 (step S102). The read XML document set 420 is stored in the main storage device 402.

次に、ＸＭＬ文書分析部４１０（ＣＰＵ４０１）は、ＸＭＬ文書集合４２０に含まれる各ＸＭＬ文書を分析し、パストライ４２１、数列化されたＤＯＭ木４２２、テキストデータ４２４を生成する（ステップＳ１０３）。この後、ＣＰＵ４０１は、パストライ４２１、数列化されたＤＯＭ木４２２、テキストデータ４２４を補助記憶装置４０３に出力する（ステップＳ１０４）。 Next, the XML document analysis unit 410 (CPU 401) analyzes each XML document included in the XML document set 420, and generates a path trie 421, a digitized DOM tree 422, and text data 424 (step S103). Thereafter, the CPU 401 outputs the path trie 421, the numbered DOM tree 422, and the text data 424 to the auxiliary storage device 403 (step S104).

なお、ステップＳ１０１では、利用者が、ＸＭＬ文書集合４２０を直接入力してもよい。この場合には、ステップＳ１０２の省略が可能である。ＸＭＬ文書集合４２０は、外部記憶装置４３０から読み出してもよい。 In step S101, the user may directly input the XML document set 420. In this case, step S102 can be omitted. The XML document set 420 may be read from the external storage device 430.

［前処理の概要］
図６に、本形態例に係るＸＭＬ文書検索装置４００が検索前に実行する前処理の流れを説明するフローチャートを示す。 [Overview of preprocessing]
FIG. 6 is a flowchart for explaining the flow of preprocessing executed by the XML document search apparatus 400 according to this embodiment before search.

ＣＰＵ４０１は、ＸＭＬ文書集合４２０が入力されると、ＸＭＬ文書分析部４１０による分析処理を実行する（ステップＳ２０１）。ＸＭＬ文書分析部４１０は、ＸＭＬ文書集合４２０に含まれる各ＸＭＬ文書を分析し、パストライ４２１、数列化されたＤＯＭ木４２２、テキストデータ４２４を生成する。ＸＭＬ文書分析部４１０が実行する前処理の詳細については後述する。処理が終了すると、ＣＰＵ４０１は、パストライ４２１、数列化されたＤＯＭ木４２２、テキストデータ４２４を、補助記憶装置４０３に出力する（ステップＳ２０２）。 When the XML document set 420 is input, the CPU 401 executes analysis processing by the XML document analysis unit 410 (step S201). The XML document analysis unit 410 analyzes each XML document included in the XML document set 420, and generates a path trie 421, a DOM tree 422 converted into a sequence, and text data 424. Details of the preprocessing executed by the XML document analysis unit 410 will be described later. When the processing is completed, the CPU 401 outputs the path trie 421, the digitized DOM tree 422, and the text data 424 to the auxiliary storage device 403 (step S202).

［数列化されたＤＯＭ木］
図７を参照し、ＸＭＬ文書分析部４１０が生成する、数列化されたＤＯＭ木４２２の例を説明する。なお、図７には、パストライ４２１の例も表している。ＤＯＭ木に出現する各ノードの接続関係（すなわち、ＤＯＭ木の形状）は、数列Ｓに記録される。数列Ｓには、次のルールに従って数値が格納される。
・各文書に対応する情報（部分数列）は、数値０で開始される。
・各文書に対応する情報（部分数列）を構成する数値は、ＸＭＬ文書を先頭から順に読み出す場合に発見される開始タグに対応する要素の深さ位置を表す。なお、テキストについては、前述した通り、タグ名が「＃」である開始タグ・終了タグに囲まれている場合と同様の処理を行なう。 [Sequenced DOM tree]
With reference to FIG. 7, an example of the DOM tree 422 converted into a number sequence generated by the XML document analysis unit 410 will be described. FIG. 7 also shows an example of a path trie 421. The connection relationship (that is, the shape of the DOM tree) of each node appearing in the DOM tree is recorded in the sequence S. Numeric values are stored in the sequence S according to the following rules.
The information (partial number sequence) corresponding to each document starts with a numerical value 0.
The numerical value constituting the information (partial number sequence) corresponding to each document represents the depth position of the element corresponding to the start tag that is found when the XML document is read sequentially from the top. As described above, the text is processed in the same way as when it is surrounded by the start tag / end tag whose tag name is “#”.

図７に示す例の場合、第１のＸＭＬ文書（上段左側）を先頭から読むと、<ａ>、「テキスト１」、<ｂ>、「テキスト２」、<／ｂ>、<ｃ>、「テキスト３」、<／ｃ>、<／ａ>の順にタグやテキストが出現する。終了タグ以外の要素の深さは、1,2,2,3,2,3である。従って、数列Ｓには、ＸＭＬ文書の先頭を表す０を考慮すると、０,１,２,２,３,２,３が追加される。同様に、第２のＸＭＬ文書（上段右側）については、０,１,２,３,２,２,３が追加される。 In the example shown in FIG. 7, when the first XML document (upper left side) is read from the top, <a>, “text 1”, <b>, “text 2”, </ b>, <c>, Tags and text appear in the order of “text 3”, </ c>, </a>. The depth of elements other than the end tag is 1,2,2,3,2,3. Therefore, in consideration of 0 representing the head of the XML document, 0, 1, 2, 2, 3, 2, 3 are added to the sequence S. Similarly, 0, 1, 2, 3, 2, 2, 3 are added to the second XML document (upper right side).

このように、数列Ｓは、ＤＯＭ木の形状を記録することができる。ただし、検索用データとして使用するには、タグの種類で特定される情報（パス種別）も必要である。そこで、任意の構造パスに割り当てた番号で識別されるパス種別を、深さ別のＤＯＭ木構造に対応する数列Ｔ[d]に記録する。パス種別を与える数値は、同じ深さを有する構造パスに対して一意に割り当てられた番号である。 In this way, the sequence S can record the shape of the DOM tree. However, in order to use it as search data, information (path type) specified by the tag type is also required. Therefore, the path type identified by the number assigned to an arbitrary structure path is recorded in the numerical sequence T [d] corresponding to the DOM tree structure by depth. The numerical value giving the path type is a number uniquely assigned to the structure path having the same depth.

パス種別の番号は、パストライ４２１に基づいて記録される。パストライ４２１とは、ＸＭＬ文書集合４２０を構成する全てのＸＭＬ文書について出現する構造パスの全てを含むように構築された木構造のデータである。パストライ４２１は、公知の方法（例えば非特許文献４）により構築することができる。本形態例の場合、パストライ４２１の構築時に新規ノードを追加する必要が生じた場合、その新規ノードに新たなパス種別の番号を割り当てる処理機能を、公知の構築機能に追加する。番号の割り当て方法については後述する。図７の場合、括弧で囲まれた数値７０３が、パス種別の番号に相当する。 The path type number is recorded based on the path trie 421. The path trie 421 is tree-structured data constructed so as to include all of the structural paths that appear for all the XML documents constituting the XML document set 420. The past trie 421 can be constructed by a known method (for example, Non-Patent Document 4). In the case of this embodiment, when it becomes necessary to add a new node during the construction of the path trie 421, a processing function for assigning a new path type number to the new node is added to a known construction function. A method for assigning numbers will be described later. In the case of FIG. 7, a numerical value 703 enclosed in parentheses corresponds to a path type number.

［数列ＳとＴの生成］
図８に、本形態例に係るＸＭＬ文書分析部４１０において実行される分析動作の詳細を示す。この分析動作において、数列Ｓと数列Ｔ［d］が作成される。 [Generation of sequences S and T]
FIG. 8 shows details of the analysis operation executed in the XML document analysis unit 410 according to this embodiment. In this analysis operation, a number sequence S and a number sequence T [d] are created.

まず、ＸＭＬ文書分析部４１０は、数列Ｓ，Ｔ[1],…,Ｔ[D]を空の数列に初期化する（ステップＳ３００）。ここでの［D］は、ＸＭＬ文書集合４２０で最も深い位置の要素の深さを表している。また、ＸＭＬ文書分析部４１０は、配列Ｒの要素Ｒ[1],…,Ｒ[D]を全て「１」に初期化する（ステップＳ３００）。また、ＸＭＬ文書分析部４１０は、パストライ４２１を、ルートノード７０１（図７）のみを持つように初期化する。 First, the XML document analysis unit 410 initializes the sequence S, T [1],..., T [D] to an empty sequence (step S300). Here, [D] represents the depth of the element at the deepest position in the XML document set 420. Also, the XML document analysis unit 410 initializes all the elements R [1],..., R [D] of the array R to “1” (step S300). Further, the XML document analysis unit 410 initializes the path trie 421 so as to have only the root node 701 (FIG. 7).

次に、ＸＭＬ文書分析部４１０は、ＸＭＬ文書集合４２０に含まれる全ての文書の処理が完了したか否かを判定する（ステップＳ３０１）。肯定結果が得られるまで、後述するステップＳ３０２〜Ｓ３０９の処理が繰り返し実行される。 Next, the XML document analysis unit 410 determines whether or not processing of all the documents included in the XML document set 420 has been completed (step S301). Until a positive result is obtained, steps S302 to S309 described later are repeatedly executed.

ステップＳ３０１で否定結果が得られた場合、ＸＭＬ文書分析部４１０は、未処理の文書を読み込む（ステップＳ３０２）。以下、この文書をＸとする。ステップＳ３０２において、ＸＭＬ文書分析部４１０は、数列Ｓに文書の先頭を表す「０」を追加する。また、ＸＭＬ文書分析部４１０は、変数ｄを用意し、初期値として「０」をセットする。さらに、ＸＭＬ文書分析部４１０は、変数ｖを用意し、パストライのルートノード７０１を指すように初期化する。 If a negative result is obtained in step S301, the XML document analysis unit 410 reads an unprocessed document (step S302). Hereinafter, this document is referred to as X. In step S302, the XML document analysis unit 410 adds “0” representing the beginning of the document to the sequence S. Also, the XML document analysis unit 410 prepares a variable d and sets “0” as an initial value. Further, the XML document analysis unit 410 prepares a variable v and initializes it to point to the root node 701 of the path trie.

次に、ＸＭＬ文書分析部４１０は、文書Ｘを最後まで読んだか否か判定する（ステップ３０３）。肯定結果が得られるまで、後述するステップＳ３０４〜Ｓ３０９が繰り返し実行される。 Next, the XML document analysis unit 410 determines whether or not the document X has been read to the end (step 303). Until a positive result is obtained, steps S304 to S309 described later are repeatedly executed.

ステップＳ３０３で否定結果が得られると、ＸＭＬ文書分析部４１０は、現在の読み位置にあるタグが「終了タグ」か否か判定する（ステップＳ３０４）。肯定結果が得られた場合、ＸＭＬ文書分析部４１０は、変数ｖをパストライ４２１上で親ノードを指すように変更し、変数ｄから１を減じる（ステップＳ３０４−１）。そして、ＸＭＬ文書分析部４１０は、読み位置を終了タグの直後まで進め、ステップＳ３０４に戻る。 If a negative result is obtained in step S303, the XML document analysis unit 410 determines whether or not the tag at the current reading position is an “end tag” (step S304). If a positive result is obtained, the XML document analysis unit 410 changes the variable v to point to the parent node on the path trie 421, and subtracts 1 from the variable d (step S304-1). Then, the XML document analysis unit 410 advances the reading position to immediately after the end tag, and returns to step S304.

ステップＳ３０４で否定結果が得られた場合、ＸＭＬ文書分析部４１０は、文書Ｘにおいて、現在の読み位置が「タグ」でなく「テキスト」であるか否かを判定する（ステップＳ３０５）。肯定結果が得られた場合、ＸＭＬ文書分析部４１０は、そのテキストを読み込み、その内容をテキストデータ４２４に追加する（ステップＳ３０５−１）。さらに、ＸＭＬ文書分析部４１０は、変数ｔに「＃」をセットし、ステップＳ３０７に進む（ステップＳ３０５−１）。 If a negative result is obtained in step S304, the XML document analysis unit 410 determines whether or not the current reading position in the document X is “text” instead of “tag” (step S305). If a positive result is obtained, the XML document analysis unit 410 reads the text and adds the content to the text data 424 (step S305-1). Further, the XML document analysis unit 410 sets “#” in the variable t, and proceeds to step S307 (step S305-1).

ステップＳ３０５で否定結果が得られた場合、ＸＭＬ文書分析部４１０は、開始タグを読み、その直後まで読み位置を進める。また、ＸＭＬ文書分析部４１０は、この開始タグのタグ名を、変数ｔにセットする（ステップＳ３０６）。 If a negative result is obtained in step S305, the XML document analysis unit 410 reads the start tag and advances the reading position to just after that. Also, the XML document analysis unit 410 sets the tag name of this start tag in the variable t (step S306).

ステップＳ３０５−１の後又はステップＳ３０６の後、ＸＭＬ文書分析部４１０は、パストライ４２１上のノードｖに、タグ名が変数ｔの値に一致する子ノードｖ’が存在するか否か判定する（ステップＳ３０７）。否定結果が得られた場合、ＸＭＬ文書分析部４１０は、新規に子ノード（以下「ｖ’」という）を作成し、ｖ’のパス種別をＲ[d]の値とした後、Ｒ[d]に１を加える（ステップＳ３０７−１）。 After step S305-1 or after step S306, the XML document analysis unit 410 determines whether or not a child node v ′ whose tag name matches the value of the variable t exists in the node v on the path trie 421 ( Step S307). If a negative result is obtained, the XML document analysis unit 410 newly creates a child node (hereinafter referred to as “v ′”), sets the path type of v ′ as the value of R [d], and then selects R [d ] Is added to [] (step S307-1).

ステップＳ３０７で肯定結果が得られた場合、ＸＭＬ文書分析部４１０は、ｖの子ノードｖ’を指すように変数ｖを更新し、変数ｄに１を加える（ステップＳ３０８）。 If a positive result is obtained in step S307, the XML document analysis unit 410 updates the variable v to point to the child node v 'of v, and adds 1 to the variable d (step S308).

ステップＳ３０７−１の後又はステップＳ３０８の後、ＸＭＬ文書分析部４１０は、数列Ｓに変数ｄの値を追加し、さらに数列Ｔ[d]に更新された変数ｖのパス種別を追加し、ステップＳ３０３に戻る（ステップＳ３０９）。 After step S307-1, or after step S308, the XML document analysis unit 410 adds the value of the variable d to the sequence S, and further adds the updated path type of the variable v to the sequence T [d]. The process returns to S303 (step S309).

［検索動作時の構成間連携］
図９に、本形態例に係るＸＭＬ文書検索装置４００がＸＭＬ文書を検索する際の各構成間の連携動作を示す。 [Inter-configuration linkage during search operations]
FIG. 9 shows a cooperative operation between components when the XML document search apparatus 400 according to this embodiment searches for an XML document.

まず、ＸＭＬ文書検索装置４００は、検索に使用するパストライ４２１と数列化されたＤＯＭ木４２２を、補助記憶装置４０３から主記憶装置４０２に予め読み出す（ステップＳ４０１）。これらのデータは、前述した前処理により事前に作成されたデータである。 First, the XML document search apparatus 400 reads in advance the path trie 421 and the numbered DOM tree 422 used for the search from the auxiliary storage device 403 to the main storage device 402 (step S401). These data are data created in advance by the preprocessing described above.

ＸＭＬ文書検索装置４００の利用者が、ユーザインタフェース４０６を通じ、検索クエリとしての構造パスを投入する（ステップＳ４０２）。検索クエリを受け取ったＣＰＵ４０１は、パストライ４２１を使用し、ＸＭＬ文書集合４２０において、検索クエリに含まれる構造パスに該当する要素の深さｄとパス種別ｔを計算する（ステップＳ４０３）。 A user of the XML document search apparatus 400 inputs a structure path as a search query through the user interface 406 (step S402). Receiving the search query, the CPU 401 uses the path trie 421 to calculate the depth d and the path type t of the element corresponding to the structural path included in the search query in the XML document set 420 (step S403).

次に、要素探索部４１３（ＣＰＵ４０１）は、数列化されたＤＯＭ木４２２を使用し、検索クエリに含まれる構造パスに合致する要素番号をすべて列挙する（ステップＳ４０４）。その後、ＣＰＵ４０１は、得られた要素番号をユーザへ送信する（ステップＳ４０５）。 Next, the element search unit 413 (CPU 401) uses the numbered DOM tree 422 to enumerate all element numbers that match the structure path included in the search query (step S404). Thereafter, the CPU 401 transmits the obtained element number to the user (step S405).

［検索動作の詳細］
図１０に、本形態例に係るＸＭＬ文書検索装置４００がＸＭＬ文書を検索する際の処理の流れを示す。なお、ＣＰＵ４０１は、検索に用いるパストライ４２１、数列化されたＤＯＭ木４２２、テキストデータ４２４を、補助記憶装置４０３から主記憶装置４０２に事前に読み出しているものとする。 [Details of search operation]
FIG. 10 shows the flow of processing when the XML document search apparatus 400 according to this embodiment searches for an XML document. It is assumed that the CPU 401 has previously read the path trie 421, the numbered DOM tree 422, and the text data 424 used for the search from the auxiliary storage device 403 to the main storage device 402.

まず、ユーザがユーザインタフェース４０６を通じ、検索クエリとしての構造パスをＸＭＬ文書検索装置４００に投入する。この後、構造パス分析部４１２は、ＸＭＬ文書集合４２０について作成されたパストライ４２１にアクセスし、検索クエリとして指定された構造パスに該当する要素の深さｄとパス種別ｔを計算する（ステップＳ５０１）。 First, the user inputs a structure path as a search query to the XML document search apparatus 400 through the user interface 406. Thereafter, the structural path analysis unit 412 accesses the path trie 421 created for the XML document set 420, and calculates the depth d and the path type t of the element corresponding to the structural path specified as the search query (step S501). ).

次に、要素探索部４１３は、数列化されたＤＯＭ木４２２にアクセスし、当該構造パスに該当する要素の番号を列挙する（ステップＳ５０２）。 Next, the element search unit 413 accesses the numbered DOM tree 422 and lists the numbers of elements corresponding to the structure path (step S502).

図１１に、前述した検索処理の概要を示す。図１１は、検索クエリとして、「／ａ／ｂ」で表される構造パスが与えられた場合について、この構造パスが出現する要素の番号を計算する概念を表している。 FIG. 11 shows an outline of the above-described search process. FIG. 11 shows the concept of calculating the number of an element in which a structural path appears when a structural path represented by “/ a / b” is given as a search query.

この構造パスは、「／ａ／ｂ」でｂが２番目のタグ名なので、深さが「２」である。さらに、パストライ４２１において、ルートノード７０１から「ａ」、「ｂ」と辿っていくと、「ｂ（２）」と書かれたノードに到達する。「（２）」は、このノードのパス種別が「２」であることを表している。従って、「／ａ／ｂ」の出現位置を全て知るためには、深さ「２」にあるパス種別が「２」の要素の全てについて、要素番号を計算すればよい。そのために、後述するｒａｎｋ演算及びｓｅｌｅｃｔ演算を実行する。
（１）ｒａｎｋ（Ａ，ｃ，ｉ）＝数列Ａのｉ番目までの要素にあるｃの数
（２）ｓｅｌｅｃｔ（Ａ，ｃ，ｊ）＝数列Ａにｊ番目に出現するｃの位置 Since this structure path is “/ a / b” and b is the second tag name, the depth is “2”. Further, in the path trie 421, when “a” and “b” are traced from the root node 701, the node written as “b (2)” is reached. “(2)” indicates that the path type of this node is “2”. Therefore, in order to know all the appearance positions of “/ a / b”, it is only necessary to calculate the element numbers for all the elements having the path type “2” at the depth “2”. For this purpose, a rank operation and a select operation described later are executed.
(1) rank (A, c, i) = number of c in the i-th element of the sequence A (2) select (A, c, j) = position of c appearing j in the sequence A

図１２に、ｒａｎｋ演算およびｓｅｌｅｃｔ演算の例を示す。図１２の例の場合、数列Ｘの１０番目までの要素にある「３」の数を与えるｒａｎｋ（Ｘ，３，１０）は「２」である。また、数列Ｘについて「３」が２番目に出現する位置を与えるｓｅｌｅｃｔ（Ｘ，３，２）は「７」である。 FIG. 12 shows an example of rank operation and select operation. In the example of FIG. 12, rank (X, 3, 10) that gives the number of “3” in the elements up to the tenth in the sequence X is “2”. Further, select (X, 3, 2) which gives the position where “3” appears second in the sequence X is “7”.

図１１の説明に戻る。前述の「／ａ／ｂ」を検索する処理動作は、深さが「２」でパス種別が「２」の要素を抽出する処理である。 Returning to the description of FIG. The processing operation for searching for “/ a / b” described above is processing for extracting an element having a depth of “2” and a path type of “2”.

まず、深さが「２」でパス種別が「２」の要素の総数ｎは、ｎ＝ｒａｎｋ（Ｔ[2]，２，｜Ｔ[2]｜）により計算することができる。ただし、｜Ｔ[d]｜は、数列Ｔ[d]の要素数である。 First, the total number n of elements having a depth of “2” and a path type of “2” can be calculated by n = rank (T [2], 2, | T [2] |). However, | T [d] | is the number of elements of the sequence T [d].

次に、１≦ｋ≦ｎである全ての整数ｋについて、ｋ’＝ｓｅｌｅｃｔ（Ｔ[2]，２，ｋ）を計算する。この計算で得られる値ｋ’の集合は、深さが「２」の要素に限定した場合、パス種別が「２」の要素が何番目に出現するかを表している。図１１の例では、２番目と６番目である。 Next, k ′ = select (T [2], 2, k) is calculated for all integers k where 1 ≦ k ≦ n. The set of values k ′ obtained by this calculation represents the order in which the element of the path type “2” appears when the depth is limited to the element of “2”. In the example of FIG. 11, they are the second and sixth.

さらに、ｋ”＝ｓｅｌｅｃｔ（Ｓ，２，ｋ’）を計算すれば、検索対象であるＸＭＬ文書集合４２０のＤＯＭ木の形状を現す数列Ｓにおいて、深さが「２」の要素の中でｋ’番目に出現する要素が全体の何番目に位置するかを計算することができる。図１１の例では、４番目と１３番目である。この「４」と「１３」が、要素探索部４１３の検索結果となる。 Further, if k ″ = select (S, 2, k ′) is calculated, in the sequence S representing the shape of the DOM tree of the XML document set 420 to be searched, k among the elements having a depth of “2”. It is possible to calculate the position of the element where the 'th occurrence appears. In the example of FIG. 11, they are the fourth and thirteenth. These “4” and “13” are the search results of the element search unit 413.

［構造パスの分析動作］
図１３に、構造パス分析部４１２が実行する構造パスの分析動作を示す。ここでは、構造パスに含まれる左からｄ番目のタグ名をＰ[d]，タグの総数を｜Ｐ｜とする。 [Structural path analysis]
FIG. 13 shows a structure path analysis operation performed by the structure path analysis unit 412. Here, the d-th tag name from the left included in the structure path is P [d], and the total number of tags is | P |.

まず、構造パス分析部４１２は、変数ｖをパストライ４２１のルート７０１にセットし、変数ｄを「０」にセットする（ステップＳ６０１）。 First, the structural path analysis unit 412 sets the variable v to the route 701 of the path trie 421, and sets the variable d to “0” (step S601).

次に、構造パス分析部４１２は、ｄ≧｜Ｐ｜か否かを判定する（ステップＳ６０２）。ステップＳ６０２で肯定結果が得られるまで、構造パス分析部４１２は、ステップＳ６０３〜Ｓ６０５の処理を繰り返す。因みに、肯定結果が得られた場合（ｄ≧｜Ｐ｜の場合）、構造パス分析部４１２は、深さを与える情報として「ｄ」を出力し、パス種別７０３を与える情報として変数ｖが指すノードのパス種別を出力し、分析処理を終了する（ステップＳ６０６）。 Next, the structural path analysis unit 412 determines whether or not d ≧ | P | (step S602). The structural path analysis unit 412 repeats the processes of steps S603 to S605 until a positive result is obtained in step S602. Incidentally, when an affirmative result is obtained (when d ≧ | P |), the structure path analysis unit 412 outputs “d” as the information that gives the depth, and the variable v points to the information that gives the path type 703. The node path type is output, and the analysis process ends (step S606).

これに対し、ステップＳ６０２で否定結果が得られた場合、構造パス分析部４１２は、変数ｄに「１」を加える（ステップＳ６０３）。 On the other hand, when a negative result is obtained in step S602, the structural path analysis unit 412 adds “1” to the variable d (step S603).

続いて、構造パス分析部４１２は、変数ｖの指すノードの子にタグ名がＰ[d]のものがあるか否か判定する（ステップＳ６０４）。否定結果が得られた場合（子が存在しない場合）、構造パス分析部４１２は、「当該構造無し」と出力し、検索処理自体を終了する（ステップＳ６０７）。 Subsequently, the structural path analysis unit 412 determines whether or not there is a tag name P [d] as a child of the node indicated by the variable v (step S604). When a negative result is obtained (when there are no children), the structure path analysis unit 412 outputs “no such structure” and ends the search process itself (step S607).

一方、ステップＳ６０４において肯定結果が得られた場合（子が存在する場合）、構造パス分析部４１２は、変数ｖをタグ名がＰ[i]である子に変更する(ステップＳ６０５)。 On the other hand, when a positive result is obtained in step S604 (when a child exists), the structure path analysis unit 412 changes the variable v to a child whose tag name is P [i] (step S605).

［要素探索動作］
図１４に、要素探索部４１３において実行される検索動作の詳細を示す。 [Element search operation]
FIG. 14 shows details of the search operation executed in the element search unit 413.

要素探索部４１３はまず変数ｎに、深さがｄであり、かつ、パス種別がｔである要素の「総数」をセットする。この総数は、ｒａｎｋ（Ｔ[d]，ｔ，｜Ｔ[d]｜）の計算値として与えられる。さらに、要素探索部４１３は、変数ｋを初期値「０」にセットする。 The element search unit 413 first sets the “total number” of elements having a depth of d and a path type of t as a variable n. This total number is given as a calculated value of rank (T [d], t, | T [d] |). Furthermore, the element search unit 413 sets the variable k to the initial value “0”.

次に、要素探索部４１３は、ｋ＞ｎか否かを判定する（ステップＳ７０２）。肯定結果が得られるまで、要素探索部４１３は、後述するステップＳ７０３〜Ｓ７０６の処理を繰り返し実行する。 Next, the element search unit 413 determines whether k> n is satisfied (step S702). Until a positive result is obtained, the element search unit 413 repeatedly executes the processes of steps S703 to S706 described later.

要素探索部４１３は、検索クエリに合致する次の要素が、深さｄの要素の中で何番目に位置するかを、ｓｅｌｅｃｔ（Ｔ[d]，ｔ，ｋ）により計算し、計算結果を変数ｋ’にセットする（ステップＳ７０３）。 The element search unit 413 calculates, by select (T [d], t, k), the position where the next element that matches the search query is located in the element of depth d, and the calculation result is calculated. A variable k ′ is set (step S703).

次に、要素探索部４１３は、検索クエリに合致する次の要素（ステップＳ７０３と同じ要素）が、ＸＭＬ文書全体に対応する数列Ｓの何番目の要素に位置するかを、ｓｅｌｅｃｔ（Ｓ，ｄ，ｋ’）により計算し、計算結果を変数ｋ”にセットする（ステップＳ７０４）。 Next, the element search unit 413 selects select (S, d) as to which element in the sequence S corresponding to the entire XML document the next element that matches the search query (the same element as in step S703) is located. , K ′), and the calculation result is set to a variable k ″ (step S704).

要素探索部４１３は、変数ｋ”の値を出力する（ステップＳ７０５）。
この後、要素探索部４１３は、変数ｋに「１」を加え、ステップＳ７０２に戻る（ステップＳ７０６）。 The element search unit 413 outputs the value of the variable k ″ (step S705).
Thereafter, the element search unit 413 adds “1” to the variable k and returns to step S702 (step S706).

［計算処理の高速化］
前述の処理により構造パスに合致するＸＭＬ要素の計算を高速化するには、ｒａｎｋ演算とｓｅｌｅｃｔ演算を高速に処理する必要がある。 [Acceleration of calculation processing]
In order to speed up the calculation of the XML element that matches the structure path by the above-described processing, it is necessary to process the rank operation and the select operation at high speed.

まず、本形態例の場合、ｒａｎｋ演算は、ステップＳ７０１において、Ｔ[d]に値ｔが幾つあるかを数えるためにしか使用しない。処理の高速化のため、本実施例では、数列化されたＤＯＭ木４２２を補助記憶装置４０３から読み出す際に、予め深さ別に各要素が何回出現したかを数え、パス種別（値ｔ）の順に当該構造パスの出現回数を並べた２次元配列Ｎ７０２（図７）を作成する。Ｎのうち、特定の深さｄに対応する配列Ｎ[d]の内容は、（Ｔ［ｄ］における値１の数、値２の数、…、値ｔの数、…）で与えられる。例えば図７の場合、深さが「２」のパス種別の構造を表す数列Ｔ[2]には、値１が２回、値２が２回、値３が１回、値４が１回出現する。このため、２次元配列Ｎのd=2の部分はＮ[2]＝（２,２,１,１）のように作成される。 First, in the present embodiment, the rank operation is used only for counting how many values t exist in T [d] in step S701. In order to speed up the processing, in this embodiment, when reading the DOM tree 422 that has been converted into a sequence from the auxiliary storage device 403, the number of times each element has appeared in advance is counted to determine the path type (value t). A two-dimensional array N 702 (FIG. 7) in which the number of appearances of the structure path is arranged in this order is created. Among N, the contents of the array N [d] corresponding to a specific depth d are given by (number of values 1 in T [d], number of values 2,..., Number of values t,...). For example, in the case of FIG. 7, in the sequence T [2] representing the path type structure with the depth “2”, the value 1 is 2 times, the value 2 is 2 times, the value 3 is 1 time, and the value 4 is 1 time. Appear. Therefore, the d = 2 portion of the two-dimensional array N is created as N [2] = (2, 2, 1, 1).

続く、ステップＳ７０３のｓｅｌｅｃｔ演算では、ｋの値が１、２、３、…と順に変化し、Ｔ[d]において値がｔとなる箇所を順に計算する。この処理は、単純に数列Ｔ[d]を先頭から順に読み、値ｔが出現する箇所を計算することで実現できる。数列Ｔ[d]に値ｔが頻出すれば、この処理の計算効率は十分に良い。ただし、数列Ｔ[d]に値ｔが頻出しない場合は、数列を走査する処理時間が性能劣化を招く可能性がある。従って、数列Ｔ[d]に値ｔが頻出しないことが予測される場合には、後述する第２の形態例で説明する手法を用いることが好ましい。なお、繰り返し回数ｎは上述の方法で計算してステップＳ７０２で用いても良いが、ｋ＞ｎかを判定する代わりに、数列Ｔ[d]の要素をすべて読み終わった時点で要素探索部４１３を終了する手法を採用してもよい。 In the subsequent select operation in step S703, the k values change in order 1, 2, 3,..., And the points where the value is t in T [d] are calculated in order. This processing can be realized by simply reading the sequence T [d] in order from the top and calculating the location where the value t appears. If the value t appears frequently in the sequence T [d], the calculation efficiency of this process is sufficiently good. However, if the value t does not appear frequently in the sequence T [d], the processing time for scanning the sequence may cause performance degradation. Therefore, when it is predicted that the value t does not appear frequently in the sequence T [d], it is preferable to use the method described in the second embodiment described later. Note that the number of repetitions n may be calculated by the above method and used in step S702. Instead of determining whether k> n, the element search unit 413 is read when all the elements of the sequence T [d] have been read. You may employ | adopt the method of complete | finishing.

ステップＳ７０４のｓｅｌｅｃｔ演算についても、やはり数列を先頭から順番に読み、値ｄがｋ’（＝ｓｅｌｅｃｔ（Ｔ［d］，ｔ，ｋ））回目に出現する位置を求めることで計算できる。 The select operation in step S704 can also be calculated by reading the sequence in order from the top and obtaining the position where the value d appears k '(= select (T [d], t, k)).

ここで、１＜ｋ≦ｎならば、ｋ’＝ｓｅｌｅｃｔ(Ｔ［d］，ｔ，ｋ)の値が、ｋについて単調に増加する。このため、次の不等式が成立する。 Here, if 1 <k ≦ n, the value of k ′ = select (T [d], t, k) increases monotonously for k. For this reason, the following inequality holds.

ｓｅｌｅｃｔ（Ｔ［d］，ｔ，ｋ−１）＜ｓｅｌｅｃｔ（Ｔ［d］，ｔ，ｋ）
同様に、ｓｅｌｅｃｔ（Ｓ，ｄ，ｋ’）の値も、ｋ’について単調に増加する。このため、次の不等式が成立する。 select (T [d], t, k-1) <select (T [d], t, k)
Similarly, the value of select (S, d, k ′) also increases monotonously for k ′. For this reason, the following inequality holds.

ｓｅｌｅｃｔ（Ｓ，ｄ，ｓｅｌｅｃｔ（Ｔ［d］，ｔ，ｋ−１））
＜ｓｅｌｅｃｔ（Ｓ，ｄ，ｓｅｌｅｃｔ（Ｔ［d］，ｔ，ｋ）） select (S, d, select (T [d], t, k-1))
<Select (S, d, select (T [d], t, k))

このため、ステップＳ７０４では、前回出力した値であるｓｅｌｅｃｔ（Ｓ，ｄ，ｓｅｌｅｃｔ（Ｔ［d］，ｔ，ｋ−１））を保存しておき、その位置から数列Ｓを走査してｓｅｌｅｃｔ（Ｓ，ｄ，ｓｅｌｅｃｔ（Ｔ［d］，ｔ，ｋ））を計算すれば、計算時間を短縮することができる。 For this reason, in step S704, select (S, d, select (T [d], t, k-1)), which is the last output value, is stored, and the sequence S is scanned from that position to select ( If S, d, select (T [d], t, k)) is calculated, the calculation time can be shortened.

［変形例］
本実施形態で例示した種々のソフトウェアは、電磁的、電子的及び光学式等の種々の記録媒体に格納可能であり、インターネット等の通信網を通じて、コンピュータにダウンロード可能である。 [Modification]
The various software exemplified in this embodiment can be stored in various recording media such as electromagnetic, electronic, and optical, and can be downloaded to a computer through a communication network such as the Internet.

以上の説明では、数列化されたＤＯＭ木４２２の全体を主記憶装置４０２に読み込んで処理する方法を説明した。しかし、実際の運用では、主記憶装置４０２に入りきらない膨大なデータを処理可能とするため、データを補助記憶装置４０３に配置し、処理に必要な箇所をその都度、主記憶装置４０２へ読み込むことが好ましい。 In the above description, the method of reading the entire DOM tree 422 converted into a sequence into the main storage device 402 and processing it has been described. However, in actual operation, in order to be able to process a huge amount of data that cannot fit in the main storage device 402, the data is arranged in the auxiliary storage device 403, and a part necessary for processing is read into the main storage device 402 each time. It is preferable.

［第１の形態例の効果］
本形態例に係るＸＭＬ文書検索装置４００を用いれば、ＸＰａｔｈにより記述された検索クエリによるＸＭＬ文書の検索処理を高速化することができる。 [Effect of the first embodiment]
If the XML document search apparatus 400 according to this embodiment is used, it is possible to speed up the XML document search process using a search query described in XPath.

［第２の形態例］
［Ｗａｖｅｌｅｔ木］
数列をコンパクトに圧縮し、さらにデータを圧縮したままでｒａｎｋ演算及びｓｅｌｅｃｔ演算を効率よく計算できるデータ構造として、Ｗａｖｅｌｅｔ木が知られている（例えば、Navarro, G. and Makinen, V., Compressed full-text indexes, ACM Computing Surveys 39(1): Article 2, 2007.）。 [Second Embodiment]
[Wavelet tree]
A Wavelet tree is known as a data structure that can efficiently calculate a rank operation and a select operation while compressing a number sequence in a compact manner (for example, Navarro, G. and Makinen, V., Compressed full). -text indexes, ACM Computing Surveys 39 (1): Article 2, 2007.).

Ｗａｖｅｌｅｔ木は、０と１の任意の並びであるビットベクトルと呼ばれるデータ構造を用いて構築される。ビットベクトルの例を、図１５に示す。ビットベクトルを数列と見たときに効率よくｒａｎｋ演算及びｓｅｌｅｃｔ演算を行うために、一定間隔でｒａｎｋ演算及びｓｅｌｅｃｔ演算の結果をサンプリングし格納するとともに、サンプリングされていない箇所の値は、短いビットベクトルのｒａｎｋ演算及びｓｅｌｅｃｔ演算の結果を事前計算して格納した１５０１のような表を併用し、高速に計算する手法が知られている（例えば非特許文献５を参照)。 The Wavelet tree is constructed using a data structure called a bit vector that is an arbitrary sequence of 0s and 1s. An example of the bit vector is shown in FIG. In order to efficiently perform the rank operation and the select operation when the bit vector is viewed as a sequence, the results of the rank operation and the select operation are sampled and stored at regular intervals, and the value of the unsampled portion is a short bit vector. There is known a technique for calculating at high speed by using a table such as 1501 in which the results of the rank operation and the select operation are pre-calculated and stored (see Non-Patent Document 5, for example).

図１６に、数列Ｓ「０１２２２３２３０１２３２２３」に対するＷａｖｅｌｅｔ木の構造例を示す。Ｗａｖｅｌｅｔ木は、木のノードにビットベクトルを格納し、全体として数列と同等の情報を記録できるデータ構造である。 FIG. 16 shows a structural example of a Wavelet tree for the sequence S “012223230123223”. A Wavelet tree is a data structure in which a bit vector is stored in a node of the tree and information equivalent to a numerical sequence can be recorded as a whole.

ルートノード１６０１には、数列に格納された値を２つのグループに分割するとき、各値がどちらのグループに属すかを記録したビットベクトルＢ１が格納される。図１６の例では、値が「２」であるか、「２」でないかでグループ分けを行っている。ルートノード１６０１の下にあるノード１６０２では、ルートノードでグループ分けされた「２」以外の値を、さらにグループ分けし、「３」であるか、「３」でないかでグループ分けしたビットベクトルＢ２が格納されている。同様に、その下のノード１６０３は、残った「０」と「１」を区別している。 The root node 1601 stores a bit vector B1 that records which group each value belongs to when the values stored in the numerical sequence are divided into two groups. In the example of FIG. 16, grouping is performed depending on whether the value is “2” or not “2”. In a node 1602 below the root node 1601, the values other than “2” grouped by the root node are further grouped, and the bit vector B2 is grouped by “3” or not “3”. Is stored. Similarly, the lower node 1603 distinguishes the remaining “0” and “1”.

Ｗｅｖｅｌｅｔ木に対するｒａｎｋ演算は、ルートノード１６０１から木を辿ることによって行う。例えばｒａｎｋ（Ｓ，３，７）を計算する場合を説明すると以下のようになる。 The rank operation on the WEB tree is performed by tracing the tree from the root node 1601. For example, the case of calculating rank (S, 3, 7) will be described as follows.

まず、「３」はルートノード１６０１では、「１」にグループ分けされている。このため、ｒａｎｋ（Ｂ１，１，７）を計算し、「４」を得る。この結果は、数列Ｓの７番目までに０，１，３が計４個あることを意味する。 First, “3” is grouped into “1” in the root node 1601. Therefore, rank (B1, 1, 7) is calculated to obtain “4”. This result means that there are a total of four, 0, 1, 3 by the seventh in the sequence S.

次のビットベクトルＢ２では、「３」が「１」にグループ分けされている。このため、ｒａｎｋ（Ｂ２，１，４）を計算し、「２」を得る。この結果は、数列Ｓの７番目までに出現する４個の０，１，３のうち、「３」が計２個であることを表す。このように、ｒａｎｋ（Ｓ，３，７）の結果として、正しく「２」が計算できたことが分かる。 In the next bit vector B2, “3” is grouped into “1”. Therefore, rank (B2, 1, 4) is calculated to obtain “2”. This result indicates that among the four 0, 1, 3 appearing up to the seventh in the sequence S, “3” is a total of two. Thus, it can be seen that “2” was correctly calculated as the result of rank (S, 3, 7).

これに対し、ｓｅｌｅｃｔ演算は、リーフからルートノードに木を辿ることによって計算する。例えばｓｅｌｅｃｔ（Ｓ，３，２）を計算する場合を説明すると以下のようになる。 On the other hand, the select operation is calculated by tracing the tree from the leaf to the root node. For example, the case of calculating select (S, 3, 2) will be described as follows.

まず、「３」を表すリーフの親ノードであるノード１６０２において、２番目の「３」に該当する位置を計算する。「３」は、ノード１６０２において、「１」にグループ分けされている。このため、ｓｅｌｅｃｔ（Ｂ２，１，２）を計算すると、「４」が得られる。この結果は、２番目の「３」が、０，１，３だけを取り出した部分数列では４番目の値であることを意味する。 First, in the node 1602 that is the parent node of the leaf representing “3”, the position corresponding to the second “3” is calculated. “3” is grouped into “1” in the node 1602. Therefore, “4” is obtained by calculating select (B2, 1, 2). This result means that the second “3” is the fourth value in the partial sequence from which only 0, 1, 3 are extracted.

さらに、この値が数列Ｓの中で何番目に位置するかを求めるには、ｓｅｌｅｃｔ（Ｂ１，１，４）を計算すれば良い。この場合、「７」が得られる。この結果は、０、１又は３が４番目に現れる位置が全体では７番目の値であることを表す。すなわち、この「７」がｓｅｌｅｃｔ（Ｓ，３，２）の計算結果となる。 Further, in order to determine what position this value is in the sequence S, select (B1, 1, 4) may be calculated. In this case, “7” is obtained. This result represents that the position where 0, 1, or 3 appears fourth is the seventh value as a whole. That is, “7” is the calculation result of select (S, 3, 2).

このように、Ｗａｖｅｌｅｔ木を用いると、ビットベクトルに対するｒａｎｋ演算やｓｅｌｅｃｔ演算を、最大でも木の高さに等しい回数分の処理の繰り返しにより実現できる。すなわち、最大でも木の高さに等しい計算処理の回数の実行により、数列に対するｒａｎｋ演算及びｓｅｌｅｃｔ演算の解を得ることができる。 As described above, when a Wavelet tree is used, a rank operation or a select operation on a bit vector can be realized by repeating the processing for the number of times equal to the height of the tree at most. That is, by executing the number of calculation processes equal to the height of the tree at the maximum, it is possible to obtain a solution of rank operation and select operation for the sequence.

Ｗａｖｅｌｅｔ木の高さは、数列の長さよりも遥かに小さな値であり、特に数列の長さが非常に長く、ｒａｎｋ演算又はｓｅｌｅｃｔ演算の第二引数の値の出現頻度が小さいとき、数列を直接走査する場合に比して効率がよい。また、Ｗａｖｅｌｅｔ木は、格納される数列において各値の出現頻度に偏りがあると、圧縮が可能であることが知られている。 The height of the wavelet tree is much smaller than the length of the sequence, especially when the sequence is very long and the frequency of the second argument of the rank operation or select operation is small. It is more efficient than scanning. Further, it is known that the Wavelet tree can be compressed if the appearance frequency of each value is biased in the stored numerical sequence.

［装置構成］
本形態例では、第１の形態例において、ｒａｎｋ演算及びｓｅｌｅｃｔ演算を実行していた箇所に、Ｗａｖｅｌｅｔ木を適用する手段を提供する。 [Device configuration]
In this embodiment, a means for applying a Wavelet tree to a place where a rank operation and a select operation are executed in the first embodiment is provided.

図１７に、本形態例に係るＸＭＬ文書検索装置１７００のブロック構成を示す。図１７には、図４との対応部分に同一符号を付して示している。ＸＭＬ文書検索装置１７００は、Ｗａｖｅｌｅｔ木構築部４１１と、Ｗａｖｅｌｅｔ木群４２３を有する点で、第１の形態例と異なる。 FIG. 17 shows a block configuration of an XML document search apparatus 1700 according to this embodiment. In FIG. 17, parts corresponding to those in FIG. The XML document search apparatus 1700 is different from the first embodiment in that it has a Wavelet tree construction unit 411 and a Wavelet tree group 423.

図１８に、本形態例に係るＸＭＬ文書検索装置１７００がＸＭＬ文書を前処理する際の各構成間の連携動作を示す。 FIG. 18 shows a cooperative operation between components when the XML document search apparatus 1700 according to this embodiment pre-processes an XML document.

まず、ＸＭＬ文書検索装置４００の利用者が、ユーザインタフェース４０６を用いて、処理の開始を指示する（ステップＳ８０１）。 First, the user of the XML document search apparatus 400 instructs the start of processing using the user interface 406 (step S801).

処理の開始指示を受け付けたＣＰＵ４０１は、補助記憶装置４０３からＸＭＬ文書集合４２０を読み出す（ステップＳ８０２）。読み出されたＸＭＬ文書集合４２０は、主記憶装置４０２に格納される。 Receiving the processing start instruction, the CPU 401 reads the XML document set 420 from the auxiliary storage device 403 (step S802). The read XML document set 420 is stored in the main storage device 402.

次に、ＸＭＬ文書分析部４１０（ＣＰＵ４０１）は、ＸＭＬ文書集合４２０に含まれる各ＸＭＬ文書を分析し、パストライ４２１、数列化されたＤＯＭ木４２２、テキストデータ４２４を生成する（ステップＳ８０３）。ここまでは、第１の形態例と同じである。 Next, the XML document analysis unit 410 (CPU 401) analyzes each XML document included in the XML document set 420, and generates a path trie 421, a digitized DOM tree 422, and text data 424 (step S803). The steps so far are the same as those in the first embodiment.

Ｗａｖｅｌｅｔ木構築部４１１は、ＸＭＬ文書分析部４１０が生成した数列化されたＤＯＭ木４２２に含まれる各数列を、Ｗａｖｅｌｅｔ木群４２３に変換する（ステップＳ８０４）。Ｗａｖｅｌｅｔ木構築部４１１は、公知の方法（例えば非特許文献５を参照）を使用し、数列化されたＤＯＭ木４２２をＷａｖｅｌｅｔ木群４２３に変換する。そして、数列化されたＤＯＭ木４２２に代わり、Ｗａｖｅｌｅｔ木群４２３を補助記憶装置に出力し、同様にパストライ４２１、テキストデータ４２４も出力する（ステップＳ８０５）。数列化されたＤＯＭ木４２２は、Ｗａｖｅｌｅｔ木群４２３を構築した後に消去してもよい。 The Wavelet tree construction unit 411 converts each number sequence included in the numbered DOM tree 422 generated by the XML document analysis unit 410 into the Wavelet tree group 423 (step S804). The Wavelet tree construction unit 411 converts the DOM tree 422 converted into a sequence into a Wavelet tree group 423 using a known method (see, for example, Non-Patent Document 5). Then, the Wavelet tree group 423 is output to the auxiliary storage device instead of the DOM tree 422 converted into a sequence, and the path trie 421 and the text data 424 are also output (step S805). The sequenced DOM tree 422 may be deleted after the Wavelet tree group 423 is constructed.

［前処理の概要］
図１９に、本形態例に係るＸＭＬ文書検索装置４００が検索前に実行する前処理の流れを説明するフローチャートを示す。 [Overview of preprocessing]
FIG. 19 is a flowchart for explaining the flow of preprocessing executed by the XML document search apparatus 400 according to this embodiment before search.

ＣＰＵ４０１は、ＸＭＬ文書集合４２０が入力されると、ＸＭＬ文書分析部４１０による分析処理を実行する（ステップＳ９０１）。ＸＭＬ文書分析部４１０は、ＸＭＬ文書集合４２０に含まれる各ＸＭＬ文書を分析し、パストライ４２１、数列化されたＤＯＭ木４２２、テキストデータ４２４を生成する。 When the XML document set 420 is input, the CPU 401 executes analysis processing by the XML document analysis unit 410 (step S901). The XML document analysis unit 410 analyzes each XML document included in the XML document set 420, and generates a path trie 421, a DOM tree 422 converted into a sequence, and text data 424.

この処理が終了すると、ＣＰＵ４０１は、前述したようにＷａｖｅｌｅｔ木群４２３を構築する（ステップＳ９０２）。 When this process ends, the CPU 401 constructs the Wavelet tree group 423 as described above (step S902).

この後、ＣＰＵ４０１は、パストライ４２１、Ｗａｖｅｌｅｔ木群１２３、テキストデータ４２４を補助記憶装置４０３に出力する（ステップＳ９０３）。 Thereafter, the CPU 401 outputs the path trie 421, the wavelet tree group 123, and the text data 424 to the auxiliary storage device 403 (step S903).

［検索処理］
本形態例に係るＸＭＬ文書検索装置１７００は、検索の際、まず、パストライ４２１およびＷａｖｅｌｅｔ木群４２３を補助記憶装置４０３から読み出す。前述の通り、数列化されたＤＯＭ木４２２は不要である。 [Search processing]
The XML document search apparatus 1700 according to the present embodiment first reads out the path trie 421 and the Wavelet tree group 423 from the auxiliary storage device 403 during the search. As described above, the DOM tree 422 converted into a number sequence is not necessary.

検索動作は、第１の形態例と同様、図１４に従って実行される。第１の形態例との違いは、検索クエリである構造パスが与えられた後に実行されるステップＳ７０１（図１４）のｒａｎｋ演算と、ステップＳ７０３（図１４）及びＳ７０４（図１４）のｓｅｌｅｃｔ演算にＷａｖｅｌｅｔ木を用いる点と、２次元配列Ｎ７０２（図７）が不要な点である。ただし、ｒａｎｋ演算よりも配列参照の処理の方が速いため、本形態例の場合にも、２次元配列Ｎ７０２を使用してもよい。 The search operation is executed according to FIG. 14 as in the first embodiment. The difference from the first embodiment is that a rank operation in step S701 (FIG. 14) executed after a structural path as a search query is given, and a select operation in steps S703 (FIG. 14) and S704 (FIG. 14). A point where a Wavelet tree is used for the two-dimensional array N702 (FIG. 7) is unnecessary. However, since the array reference processing is faster than the rank operation, the two-dimensional array N702 may be used also in this embodiment.

［第２の形態例の効果］
本形態例に係るＸＭＬ文書検索装置１７００を用いれば、数列Ｔ[d]に値ｔが頻出しない場合にも、ＸＰａｔｈにより記述された検索クエリによるＸＭＬ文書の検索処理を高速に実行することができる。 [Effect of the second embodiment]
If the XML document search apparatus 1700 according to this embodiment is used, even when the value t does not appear frequently in the sequence T [d], the XML document search process using the search query described in XPath can be executed at high speed. .

［第３の形態例］
ＸＰａｔｈによる検索では、親要素、子要素、兄弟要素等に関する制約条件が検索クエリに盛り込まれる場合がある。そこで、本形態例では、親要素、子要素、兄弟要素等を探索する機能と、任意の２つの要素ｉ、ｊが、指定された関係にあるか否かを検査するための機能について説明する。以下の説明では、要素ｉ、ｊの深さをｄｉ、ｄｊとし、パス種別をそれぞれｔｉ、ｔｊとする。 [Third embodiment]
In a search using XPath, there are cases where a constraint condition related to a parent element, a child element, a sibling element, or the like is included in a search query. Therefore, in this embodiment, a function for searching for a parent element, a child element, a sibling element, and the like and a function for checking whether or not any two elements i and j are in a specified relationship will be described. . In the following description, the depths of the elements i and j are di and dj, and the path types are ti and tj, respectively.

（１）要素ｉの親要素の計算
深さｄｉ≦１の場合、要素ｉの親要素は存在しない。それ以外の場合、親要素は深さがｄｉ−１で与えられる、ｉ未満でｉに最も近い要素番号の要素である。従って、親要素の要素番号は、ｓｅｌｅｃｔ（Ｓ，ｄｉ−１，ｒａｎｋ（Ｓ，ｄｉ−１，ｉ））を計算することにより取得することができる。 (1) Calculation of parent element of element i When the depth di ≦ 1, there is no parent element of element i. Otherwise, the parent element is the element whose element number is less than i and closest to i, given depth di-1. Therefore, the element number of the parent element can be obtained by calculating select (S, di-1, rank (S, di-1, i)).

（２）要素ｉの最初の子要素の計算
子要素が存在すれば、要素番号はｉ＋１で与えられる。ただし、子要素が存在しない場合があり、その場合、要素番号ｉ＋１の要素の深さはｄｉ＋１以外である。第１の形態例で述べたように数列化されたＤＯＭ木４２２を使用している場合はＳ［ｉ＋１］＝ｄｉ＋１か否かを判定すればよいが、第２の形態例で述べたようにＷａｖｅｌｅｔ木群４２３しか使用できない場合は、ｒａｎｋ（Ｓ，ｄｉ＋１，ｉ＋１）＝ｒａｎｋ（Ｓ，ｄｉ＋１，ｉ）＋１か否かを判定すればよい。後者の場合ＣＰＵ４０１は、図２０に示す処理手順により、子要素の存在を判定し（ステップＳ１００１）、存在する場合にはその要素番号ｉ＋１を出力し（ステップＳ１００２）、存在しない場合には「子要素無し」と出力する（ステップＳ１００３）。 (2) Calculation of the first child element of element i If there is a child element, the element number is given by i + 1. However, a child element may not exist, and in this case, the depth of the element with element number i + 1 is other than di + 1. As described in the first embodiment, when the DOM tree 422 converted into a sequence is used, it may be determined whether S [i + 1] = di + 1, but as described in the second embodiment. If only the Wavelet tree group 423 can be used, it may be determined whether rank (S, di + 1, i + 1) = rank (S, di + 1, i) +1. In the latter case, the CPU 401 determines the presence of a child element according to the processing procedure shown in FIG. 20 (step S1001), and outputs the element number i + 1 if it exists (step S1002). “No element” is output (step S1003).

（３）要素ｉの兄弟要素の計算
要素ｉよりも前に兄弟要素が存在する場合、１つ前の兄弟要素は、要素ｉと同じ深さｄｉであり、ｉ未満でｉに最も近い要素番号の要素となる。従って、その要素番号ｊは、ｊ＝ｓｅｌｅｃｔ（Ｓ，ｄｉ，ｒａｎｋ（Ｓ，ｄｉ，ｉ−１））として与えることができる。 (3) Calculation of sibling element of element i When there is a sibling element before element i, the previous sibling element has the same depth di as element i, and is the element number closest to i less than i It becomes the element of. Therefore, the element number j can be given as j = select (S, di, rank (S, di, i-1)).

これに対し、要素ｉよりも後に兄弟要素が存在する場合、１つ後の兄弟要素は、要素ｉと同じ深さｄｉであり、ｉより大きくｉに最も近い要素番号の要素となる。従って、その要素番号ｊは、ｊ＝ｓｅｌｅｃｔ（Ｓ，ｄｉ，ｒａｎｋ（Ｓ，ｄｉ，ｉ）＋１）として与えることができる。 On the other hand, when there is a sibling element after the element i, the next sibling element has the same depth di as the element i, and is an element having an element number greater than i and closest to i. Therefore, the element number j can be given as j = select (S, di, rank (S, di, i) +1).

一方、そのような兄弟要素が存在しない場合、いずれの場合にも、要素ｉと要素ｊの間には、ｉ及びｊのいずれか一方だけの親要素が存在する。よって、ｒａｎｋ（Ｓ，ｄｉ−１，ｉ）＝ｒａｎｋ（Ｓ，ｄｉ−１，ｊ）ならばｊは兄弟要素であり、それ以外の場合は当該兄弟要素が存在しないことになる。 On the other hand, when such a sibling element does not exist, in any case, only the parent element of either i or j exists between the element i and the element j. Therefore, if rank (S, di-1, i) = rank (S, di-1, j), j is a sibling element, otherwise the sibling element does not exist.

この判定処理をフローチャートで表すと図２１及び図２２となる。図２１は、１つ前の兄弟要素を探すためのフローチャートであり、図２２は、１つ後の兄弟要素を探すためのフローチャートである。 This determination process is represented by flowcharts in FIGS. 21 and 22. FIG. 21 is a flowchart for searching for the previous sibling element, and FIG. 22 is a flowchart for searching for the next sibling element.

まず、図２１に示すフローチャートについて説明する。前述したように、１つ前の兄弟要素ｊは、要素ｉと同じ深さｄｉであり、ｉ未満でｉに最も近い要素番号の要素となる。従って、ステップＳ１１０１では、ｊ＝ｓｅｌｅｃｔ（Ｓ，ｄｉ，ｒａｎｋ（Ｓ，ｄｉ，ｉ−１））を計算する。次に、要素ｉの深さｄｉよりも１つ浅い深さについてｉ番目までの数（＝ｒａｎｋ（Ｓ，ｄｉ−１，ｉ）と、ｊ番目までの数（＝ｒａｎｋ（Ｓ，ｄｉ−１，ｉ）を比較する（ステップＳ１１０２）。２つの値が同じであれば、同じ親要素の子なのでステップＳ１１０１で計算されたｊを出力する（ステップＳ１１０３）。２つの値が異なれば、兄弟要素無しと出力する（ステップＳ１１０４）。 First, the flowchart shown in FIG. 21 will be described. As described above, the previous sibling element j has the same depth di as the element i, and is an element having an element number closest to i but less than i. Accordingly, in step S1101, j = select (S, di, rank (S, di, i-1)) is calculated. Next, the number up to the i-th (= rank (S, di-1, i) and the number up to the j-th (= rank (S, di-1), which is one depth less than the depth di of the element i. , I) are compared (step S1102) If the two values are the same, a child of the same parent element is output, and j calculated in step S1101 is output (step S1103). “None” is output (step S1104).

次に、図２２に示すフローチャートについて説明する。前述したように、１つ後の兄弟要素ｊは、要素ｉと同じ深さｄｉであり、ｉより大きくｉに最も近い要素番号の要素となる。従って、ステップＳ１２０１では、ｊ＝ｓｅｌｅｃｔ（Ｓ，ｄｉ，ｒａｎｋ（Ｓ，ｄｉ，ｉ）＋１）を計算する。次に、要素ｉより１つ浅い深さについてｉ番目までの数（＝ｒａｎｋ（Ｓ，ｄｉ−１，ｉ）と、ｊ番目までの数（＝ｒａｎｋ（Ｓ，ｄｉ−１，ｊ）を比較する（ステップＳ１２０２）。２つの値が同じであれば、同じ親要素の子なのでステップＳ１２０１で計算されたｊを出力する（ステップＳ１２０３）。２つの値が異なれば、兄弟要素無しと出力する（ステップＳ１２０４）。 Next, the flowchart shown in FIG. 22 will be described. As described above, the next sibling element j has the same depth di as the element i, and is an element having an element number greater than i and closest to i. Accordingly, in step S1201, j = select (S, di, rank (S, di, i) +1) is calculated. Next, the number up to the i-th (= rank (S, di-1, i)) and the number up to the j-th (= rank (S, di-1, j)) are compared with the depth one shallower than the element i. (Step S1202) If the two values are the same, since it is a child of the same parent element, output j calculated in Step S1201 (Step S1203) If the two values are different, output that there is no sibling element (Step S1202). Step S1204).

（４）要素ｉが要素ｊの親か否かの判定
上記（１）の方法で要素ｊの親要素を計算し、計算された親要素の値が要素ｉに一致するか否かで判定する。 (4) Determination of whether element i is the parent of element j Calculate the parent element of element j by the method of (1) above, and determine whether the calculated value of the parent element matches element i .

（５）要素ｉがｊの先祖であるか否かの判定
以下の条件を同時に満たせば先祖であり、そうでなければ先祖でないと判定する。
（５−１）ｉ＜ｊ
この条件を満たせば、要素ｉの開始タグは、要素ｊの開始タグよりも先に出現する。
（５−２）ｒａｎｋ（Ｓ，ｉ，ｄｉ）＝ｒａｎｋ（Ｓ，ｊ，ｄｉ）
この条件を満たせば、要素ｉと要素ｊの間には、深さがｄｉ以下である要素ｉ以外の要素はない。 (5) Determination of whether element i is an ancestor of j If the following conditions are satisfied simultaneously, it is determined that the element i is an ancestor.
(5-1) i <j
If this condition is satisfied, the start tag for element i appears before the start tag for element j.
(5-2) rank (S, i, di) = rank (S, j, di)
If this condition is satisfied, there is no element other than element i having a depth of di or less between element i and element j.

（６）要素ｉ、ｊが兄弟要素であるか否かの判定
要素ｉと要素ｊの親要素を計算し、一致するか否かを判定する。 (6) Determining whether or not the elements i and j are sibling elements The parent elements of the elements i and j are calculated, and it is determined whether or not they match.

ｒａｎｋ（Ｓ，ｉ，ｄｉ−１）＝ｒａｎｋ（Ｓ，ｊ，ｄｊ−１）であれば、兄弟要素であると判定してもよい。 If rank (S, i, di-1) = rank (S, j, dj-1), it may be determined that the element is a sibling element.

［第４の形態例］
ＸＰａｔｈに規定されている検索クエリは、複数の構造パスに合致する場合がある。例えば、「／ａ／／ｔｅｘｔ（）」という検索クエリは、タグ名が「ａ」であるルート要素の子孫であるテキストノードの全てに合致する。このような検索クエリが与えられた場合、パストライ上でクエリに合致する構造パスを全て計算し、それらの検索結果の和集合を取ればよい。 [Fourth embodiment]
A search query defined in XPath may match a plurality of structure paths. For example, the search query “/ a // text ()” matches all text nodes that are descendants of the root element whose tag name is “a”. When such a search query is given, all the structural paths that match the query are calculated on the path trie, and the union of those search results may be taken.

［第５の形態例］
ＸＰａｔｈに規定されている検索クエリは、テキストに関する条件を含む場合がある。例えば「”／ａ／／ｔｅｘｔ()[ｃｏｎｔａｉｎｓ(.,”ａｂｃ"）]」という検索クエリは、タグ名が「ａ」であるルート要素の子孫であるテキストで"ａｂｃ"を含むものに合致する。このような検索クエリが与えられた場合、前述したＸＭＬ要素に関する検索結果と、テキストデータ４２４に対するテキスト検索の結果を照合し、両方の条件に合致する箇所を検索結果とすればよい。 [Fifth Embodiment]
A search query defined in XPath may include a condition related to text. For example, the search query ""/a//text()[contains(.,"abc")]" matches text that is a descendant of the root element with the tag name "a" and contains "abc". To do. When such a search query is given, the search result relating to the XML element described above and the text search result for the text data 424 may be collated, and a location that satisfies both conditions may be used as the search result.

テキスト検索の処理には、公知の任意の手法が使用できる（例えば非特許文献６を参照）。 Any known technique can be used for the text search process (see, for example, Non-Patent Document 6).

［他の形態例］
前述の形態例は、本発明の適用例を例示したものであり、本発明の技術的範囲を前述した各形態例の具体的構成に限定する趣旨ではない。本発明の要旨を逸脱しない範囲において種々の変更可能である。例えば本発明は、前述した各形態例の全ての構成要素を備える必要はない。また、ホン発明は、ある形態例の一部を他の形態例の構成に置き換えることもでき、ある形態例の構成に他の形態例の構成を加えることもできる。 [Other examples]
The above-described embodiments are examples of application of the present invention, and are not intended to limit the technical scope of the present invention to the specific configurations of the embodiments described above. Various modifications can be made without departing from the scope of the present invention. For example, the present invention does not have to include all the constituent elements of the respective embodiments described above. In the phone invention, a part of one embodiment can be replaced with the structure of another embodiment, and the structure of another embodiment can be added to the structure of one embodiment.

また、上述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路その他のハードウェアとして実現しても良い。また、各処理機能を実現するプログラム、テーブル、ファイル等の情報は、メモリやハードディスク、SSD（Solid State Drive）等の記憶装置、ICカード、SDカード、DVD等の記憶媒体に格納することができる。 Moreover, you may implement | achieve some or all of each structure, a function, a process part, a process means, etc. which were mentioned above as an integrated circuit or other hardware, for example. In addition, information such as programs, tables, and files that realize each processing function can be stored in a storage device such as a memory, a hard disk, or a solid state drive (SSD), or a storage medium such as an IC card, an SD card, or a DVD. .

４００ＸＭＬ文書検索装置
４０１ＣＰＵ
４０２主記憶装置
４０３補助記憶装置
４０４リムーバブルドライブ
４０６ユーザインタフェース
４０７ネットワークインタフェース
４１０ＸＭＬ文書分析部
４１１Ｗａｖｅｌｅｔ木構築部
４１２構造パス分析部
４１３ノード探索部
４２０ＸＭＬ文書集合
４２１パストライ
４２２数列化されたＤＯＭ木
４２３Ｗａｖｅｌｅｔ木
４２４テキストデータ
４３０外部記憶装置
４４０ネットワーク
１７００ＸＭＬ文書検索装置 400 XML document search device 401 CPU
402 Main storage device 403 Auxiliary storage device 404 Removable drive 406 User interface 407 Network interface 410 XML document analysis unit 411 Wavelet tree construction unit 412 Structure path analysis unit 413 Node search unit 420 XML document set 421 Path trie 422 Numbered DOM tree 423 Wavelet tree 424 Text data 430 External storage device 440 Network 1700 XML document retrieval device

Claims

A structure path having a processor, a main storage device, and an input / output device for inputting / outputting an XML document, and as a search query, lists all elements of the XML document and ancestor elements of the elements in order from the root element. Given an XML document search device that searches for a location that matches the structure path,
The input / output device receives an input of an XML document set to be searched,
The XML document search device
An XML document analysis unit that analyzes the XML document, recognizes a tag type and an inclusion relationship, converts the tag document into a sequence of numbers, and constructs a path trie;
Using the path trie, a structure path analysis unit that calculates the depth and path type of the structure path that is a search query;
An element search unit that calculates a location where an element that matches the structure path that is a search query appears based on the depth and path type of the structure path;
The XML document analysis unit
In order to record the shape of the DOM tree of the XML document, a first number sequence S including a numerical sequence representing the depth of the element as a partial sequence in the order of appearance of the element in the XML document;
A sequence group T composed of one or more sequences that record the type of the structure path corresponding to each node of the DOM tree, and a sequence path T [d] included in the sequence group T has a depth d When recording the type of
The element search unit includes:
An XML document search apparatus characterized in that by scanning the number sequence S and the number sequence group T, a location that matches a structure path as a search query is calculated.

The XML document search device according to claim 1,
The element search unit includes:
The position at which the value c appears in the sequence A is denoted as select (A, c, j), and the process for calculating this value is referred to as a select operation. The depth of the structure path of the search query is d, the path When the type is t and the total number of occurrences of the structure path is n, the value of k ″ obtained by applying the formula (1) is calculated for the integer k where 1 ≦ k ≦ n. An XML document search apparatus, wherein a part that matches a search query is searched from the XML document set.
[Equation 1]
k ″ = select (S, d, select (T [d], t, k)) (1)

The XML document search device according to claim 2,
Subsequent to the processing of the XML document analysis unit, a Wavelet tree construction unit that converts each number sequence included in the number sequence S and the number sequence group T into a Wavelet tree,
The XML document search apparatus, wherein the Wavelet tree is used for the select operation.

The XML document search device according to claim 1,
In order to record the shape of the DOM tree of the XML document, means for constructing a numerical sequence including a numerical sequence representing the depth of the element as a partial sequence in the order of appearance of the element in the XML document,
The number of values c up to the i-th in the sequence S is expressed as rank (S, c, i), and the process of calculating this value is called a rank operation.
The position where the value c appears in the sequence S in the sequence S is expressed as select (S, c, j), and the process of calculating this value is called a select operation.
For the element of the XML document corresponding to the i-th value in the sequence, when the depth of the structure path of the element is d,
The number of the element that is the parent of the element is calculated by equation (2),
Calculate the number of the element that is the first child of the element by equation (3),
Calculate the number of the preceding sibling element closest to the element by equation (4);
An XML document search apparatus characterized in that the number of a subsequent sibling element closest to the element is calculated according to formula (5).
[Equation 2]
select (S, d-1, rank (S, d-1, i)) (2)
[Equation 3]
i + 1 Formula (3)
[Equation 4]
select (S, d, rank (S, d, i-1)) (4)
[Equation 5]
select (S, d, rank (S, d, i) +1) (5)

The XML document search device according to claim 4,
Subsequent to the processing of the XML document analysis unit, a Wavelet tree construction unit that converts each number sequence included in the number sequence S and the number sequence group T into a Wavelet tree,
The XML document search apparatus, wherein the Wavelet tree is used for the rank operation and the select operation.

When a search path is provided with a structure path that lists all elements of an XML document and ancestor elements of the element in order from the root element, the computer is caused to execute a process of searching for a location that matches the structure path. In the program
The program is
A first process of analyzing the XML document, recognizing a tag type and an inclusion relation, converting the XML document into a sequence of numbers, and constructing a path trie;
A second process of calculating the depth and path type of the structural path that is the search query using the path trie;
Based on the depth of the structural path and the path type, the computer is caused to execute a third process for calculating a location where an element that matches the structural path that is the search query appears,
The first process includes
In order to record the shape of the DOM tree of the XML document, each node in the DOM tree includes a first numerical sequence S including a numerical sequence representing the depth of the element as a partial sequence in the order of appearance of the element in the XML document. And a sequence group T composed of one or more sequences that record the type of the structure path corresponding to, and records the types of structure paths in which the sequence T [d] included in the sequence group T has a depth d When
The third process includes
A program that scans the sequence S and the sequence group T to calculate a location that matches a structural path that is a search query.

The program according to claim 6,
The third process includes
The position at which the value c appears in the sequence A is denoted as select (A, c, j), and the process for calculating this value is referred to as a select operation. The depth of the structure path of the search query is d, the path By calculating the value of k ″ obtained by applying Equation (6) for an integer k where 1 ≦ k ≦ n, where t is the type and n is the total number of occurrences of the structure path, A program that searches the XML document set for a location that matches a search query.
[Equation 6]
k ″ = select (S, d, select (T [d], t, k)) (6)

The program according to claim 7,
Subsequent to the first process, there is a fourth process for converting each number sequence included in the number sequence S and the number sequence group T into a Wavelet tree,
The program using the Wavelet tree for the select operation.