JP5225021B2

JP5225021B2 - Full-text search method, apparatus and program

Info

Publication number: JP5225021B2
Application number: JP2008278882A
Authority: JP
Inventors: 俊文榎本; 伸幸小林; 源吾鈴木; 雅司山室
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-10-29
Filing date: 2008-10-29
Publication date: 2013-07-03
Anticipated expiration: 2028-10-29
Also published as: JP2010108191A

Description

本発明は、全文検索方法及び装置及びプログラムに係り、特に、構造化データである多数のＸＭＬ及びＨＴＭＬ文書を蓄積・検索する分野で全文検索と呼ばれる、記述内容の部分一致を高速に行うためのＸＭＬデータのタグを跨いだ文字列に対する全文検索方法及び装置及びプログラムに関する。 The present invention relates to a full-text search method, apparatus, and program, and more particularly to high-speed partial matching of description contents called full-text search in the field of storing and searching a large number of XML and HTML documents that are structured data. The present invention relates to a full-text search method, apparatus, and program for a character string straddling XML data tags.

蓄積された文書群から、検索キーワードが含まれている文書を高速に探し出す全文検索技術がある。これは平坦に記述されたテキスト（文字列）から、出現単語を切り出し、単語毎にどの文書に出現したかを索引（全文インデックス）しておくことで実現される。単語の切り出し方として、形態素解析を用いる方法や、N-gramと呼ばれる機械的に一定数の文字列を一文字ずつずらして重複しつつ取り出す方法が用いられる。 There is a full-text search technique for quickly searching for a document containing a search keyword from an accumulated document group. This is realized by cutting out appearance words from flatly described text (character string) and indexing in which document each word appears (full text index). As a method of extracting words, a method using morphological analysis or a method of extracting a fixed number of mechanical character strings called N-grams while shifting each character one by one is used.

一方、ＸＭＬとはマークアップ言語の一つで、ＸＭＬで記述されたデータは構造化され、構造に意味を持ったデータである。図１２は、ＸＭＬ文書とその構造を示した図である。
ＸＭＬデータは木構造で表すことができ、各節を「ノード」と呼ぶ。特に、根のノードを「ルート要素」（"book"）、値（記述内容）を「テキストノード」("宇宙の不思議")、タグ中に記述されたものを「属性」（"no=１"）と呼ぶ。 On the other hand, XML is one of markup languages, and data described in XML is structured and meaningful in structure. FIG. 12 shows an XML document and its structure.
XML data can be represented by a tree structure, and each section is called a “node”. In particular, the root node is “root element” (“book”), the value (description content) is “text node” (“mystery of the universe”), and what is described in the tag is “attribute” (“no = 1”) ").

ＸＭＬ文書（群）に対する全文検索では、文書全体を対象にすること以外にも、文書の特定の範囲を対象にすることもできる必要がある。すなわち、パスを指定することで大きな一つのＸＭＬ文書から、必要とする部分のみを検索対象にすることができる。この対象となる文書の一部を、ここでは「部分文書」と呼ぶ。図１３に部分文書の例を示す。同図（Ａ）が/book/chapterと指定された場合の部分文書で、同図（Ｂ）が/book/chapter/sectionと指定された場合の部分文書である。 In the full-text search for the XML document (s), it is necessary to be able to target a specific range of the document in addition to the entire document. In other words, by designating a path, only a necessary part can be searched from a large XML document. A part of the target document is referred to herein as a “partial document”. FIG. 13 shows an example of a partial document. FIG. 6A shows a partial document when / book / chapter is designated, and FIG. 5B shows a partial document when / book / chapter / section is designated.

このような部分文書に対する全文検索を高速に実現しようとした場合、部分文書の特定とその範囲の記述内容に検索キーワードが含まれているかどうか（さらには適合度の算出も）を効率的に行う必要がある（例えば、特許文献１参照）。
特開２００８−１４６４２４号公報 When trying to achieve full-text search for such partial documents at high speed, it is efficient to specify the partial documents and whether or not the search keywords are included in the description contents of the range (and also to calculate the fitness). There is a need (see, for example, Patent Document 1).
JP 2008-146424 A

ＸＭＬ文書では「混合内容（Mixed Content）」と呼ばれるテキストとタグの混在を許す構造がある。その例を図１４に示す。例では、テキスト中に強調表示を行うために<b>タグが混在している。このように、意味的には連続した文字列であっても、タグにより分断され、別々のテキストノードになる場合がある。前述の特許文献１も含めた従来の手法では、テキストノード単位での処理となるため、例えば「水星」で全文検索した場合、全文インデクスには「水」「星」とバラバラに索引されており、検索できない。 An XML document has a structure called “Mixed Content” that allows a mixture of text and tags. An example is shown in FIG. In the example, the <b> tag is mixed for highlighting in the text. Thus, even if it is a semantically continuous character string, it may be divided by a tag and become separate text nodes. In the conventional method including the above-mentioned Patent Document 1, since processing is performed in units of text nodes, for example, when a full-text search is performed using “Mercury”, the full-text index is indexed separately as “water” and “star”. , Can't search.

これを解決するには、テキストノードを跨いだ単語に対しても索引する必要があるが、単純にテキストノードを連結してしまうと、構造が保持できなくなってしまう。最も単純な手法としては、全ての部分文書毎にテキストノードを連結した文字列を取り出し、それに対して全文インデックス群を構築する方法がある。しかしながら、この方法はテキストノードの重複によりインデクスサイズが巨大になりすぎ、現実的ではない。 In order to solve this, it is necessary to index the words across the text nodes. However, if the text nodes are simply connected, the structure cannot be maintained. As the simplest method, there is a method of extracting a character string obtained by connecting text nodes for every partial document and constructing a full-text index group for it. However, this method is not practical because the index size becomes too large due to duplication of text nodes.

本発明は、上記の点に鑑みなされたもので、巨大な全文インデクス群を構築することなく、混合内容を含むＸＭＬ文書に対する、任意の構造を指定した全文検索を実現することが可能な全文検索方法及び装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points. A full-text search capable of realizing a full-text search designating an arbitrary structure for an XML document including mixed contents without constructing a huge full-text index group. An object is to provide a method, an apparatus, and a program.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、構造化データであるテキストとタグが混在するＸＭＬ文書群に対して全文検索を行う全文検索方法であって、
ＸＭＬの各ノード間の構造上の関係を管理するノード管理手段と、該ＸＭＬの文書内の単語にインデクスを構築する全文インデクス手段と、検索を実行する検索手段と、を有する装置において、
ノード管理手段が、入力されたＸＭＬ文書群の個々の文書に対して文書ＩＤを付与し、該文書の全てのノードにタグを跨ぐことを許す範囲ラベルを付与し、範囲ラベル記憶手段に格納する範囲ラベル付与ステップ（ステップ１）と、
全文インデクス手段が、ＸＭＬ文書のテキストから単語及びタグを跨いだ単語を切り出して、該ＸＭＬ文書における該単語の出現位置と文書ＩＤの組からなる出現位置情報を転置表記憶手段の転置表に格納する転置表格納ステップ（ステップ２）と、
からなる格納過程と、
検索手段が、入力された検索クエリから、検索パスと検索キーワードを抽出し、範囲ラベル記憶手段から検索パスに該当する文書ＩＤ及び範囲ラベルを取得する範囲ラベル取得ステップ（ステップ３）と、
転置表記憶手段から、検索キーワードに合致する出現位置情報を取得する出現位置情報取得ステップ（ステップ４）と、
範囲ラベル取得ステップで取得した文書ＩＤと範囲ラベルから、出現位置情報取得ステップで取得した検索キーワードの出現位置情報が存在するノード情報を選択するノード情報選択ステップ（ステップ５）と、からなる検索過程と、を行い、
格納過程の範囲ラベル付与ステップ（ステップ１）において、
ＸＭＬ文書の文書ＩＤ、各ノードのpreorder値とpostorder値、タグ名、全テキストノードを連結した際の開始文字オフセット、記述内容を範囲ラベルとして範囲ラベル記憶手段に格納し、
格納過程の転置表格納ステップ（ステップ２）において、
ＸＭＬ文書の全テキストノードを連結し、一定の長さの文字列に分割し、N-gram方式により単語を切り出すと共に、切り出した単語がテキストノードを跨っている場合は、該テキストノードの終端までのＭ文字（但し、Ｍは1以上Ｎ未満の整数）の単語をも切り出す単語切り出しステップと、
切り出した単語毎に、開始位置が含まれるノード識別子preorder値と、終了位置が含まれるノード識別子postorder値と、連結したテキスト中の開始位置を取得し、該preorder値、該postorder値、該開始位置、文書ＩＤを出現位置情報として転置表記憶手段の転置表に格納する出現位置情報格納ステップを、行う。 The present invention (Claim 1) is a full-text search method for performing a full-text search on an XML document group in which text and tags as structured data are mixed.
In an apparatus having node management means for managing a structural relationship between XML nodes, full-text index means for constructing an index on a word in the XML document, and search means for executing a search,
The node management means assigns a document ID to each document in the input XML document group, assigns a range label that allows the tags to be straddled to all nodes of the document, and stores them in the range label storage means. range labeling step (step 1),
The full-text index means cuts out a word that straddles a word and a tag from the text of the XML document, and stores the appearance position information including the appearance position of the word and the document ID in the XML document in the transposition table of the transposition table storage means. A transposition table storage step (step 2) ,
A storage process consisting of:
A range label acquisition step (step 3) in which the search means extracts a search path and a search keyword from the input search query, and acquires a document ID and a range label corresponding to the search path from the range label storage means;
An appearance position information acquisition step (step 4) of acquiring appearance position information matching the search keyword from the transposition table storage means;
A search process comprising: a node information selection step (step 5) for selecting node information in which the appearance position information of the search keyword acquired in the appearance position information acquisition step exists from the document ID and the range label acquired in the range label acquisition step. and, it was carried out,
In the range labeling step (step 1) of the storage process,
The document ID of the XML document, the preorder value and postorder value of each node, the tag name, the start character offset when all the text nodes are concatenated, and the description content are stored in the range label storage means as a range label,
In the transposition table storage step (step 2) of the storage process,
All text nodes of an XML document are concatenated, divided into character strings of a certain length, and words are extracted by the N-gram method. If the extracted words straddle the text nodes, the end of the text node is reached. A word cutting step of cutting out a word of M characters (where M is an integer of 1 or more and less than N);
For each extracted word, the node identifier preorder value including the start position, the node identifier postorder value including the end position, and the start position in the concatenated text are acquired, and the preorder value, the postorder value, and the start position Then, an appearance position information storage step of storing the document ID as appearance position information in the transposition table of the transposition table storage means is performed.

また、本発明（請求項２）は、検索過程の範囲ラベル取得ステップ（ステップ３）において、
検索キーワードをN-gram分解し、検索単語を取得し、
検索単語に基づいて転置表記憶手段の転置表から該検索単語の出現位置情報を取得し、
検索単語の出現位置情報の開始位置を付き合わせ、検索キーワードの出現位置情報を取得する。 Further, the present invention (Claim 2 ), in the range label acquisition step (Step 3) of the search process,
N-gram decomposes the search keyword, gets the search word,
Obtaining the appearance position information of the search word from the transposition table of the transposition table storage means based on the search word,
The start position of the search word appearance position information is added to obtain the search keyword appearance position information.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項３）は、構造化データであるテキストとタグが混在するＸＭＬ文書群に対して全文検索を行う全文検索装置であって、
ＸＭＬの各ノード間の構造上の関係を管理するノード管理手段１２０と、該ＸＭＬの文書内の単語にインデクスを構築する全文インデクス手段１３０と、検索を実行する検索手段１４０と、を有し、
ノード管理手段１２０は、
入力されたＸＭＬ文書群の個々の文書に対して文書ＩＤを付与し、該文書の全てのノードにタグを跨ぐことを許す範囲ラベルを付与し、範囲ラベル記憶手段１２１に格納する範囲ラベル付与手段１２２を有し、
全文インデクス手段１３０は、
ＸＭＬ文書のテキストから単語及びタグを跨いだ単語を切り出して、該ＸＭＬ文書における該単語の出現位置と文書ＩＤの組からなる出現位置情報を転置表記憶手段１３１の転置表に格納する転置表格納手段１３２を有し、
検索手段１４０は、
入力された検索クエリから、検索パスと検索キーワードを抽出し、範囲ラベル記憶手段１２１から検索パスに該当する文書ＩＤ及び範囲ラベルを取得する範囲ラベル取得手段１４１と、
転置表記憶手段１３１から、検索キーワードに合致する出現位置情報を取得し、出現位置情報取得手段１４２と、
範囲ラベル取得手段１４１で取得した文書ＩＤと範囲ラベルから、出現位置情報取得手段１４２で取得した検索キーワードの出現位置情報が存在するノード情報を選択するノード情報選択手段１４３と、を有し、
ノード管理手段１２０の範囲ラベル付与手段１２２は、
ＸＭＬ文書の文書ＩＤ、各ノードのpreorder値とpostorder値、タグ名、全テキストノードを連結した際の開始文字オフセット、記述内容を範囲ラベルとして範囲ラベル記憶手段１２１に格納する手段を有し、
全文インデクス手段１３０の転置表格納手段１３２は、
ＸＭＬ文書の全テキストノードを連結し、一定の長さの文字列に分割し、N-gram方式により単語を切り出すと共に、切り出した単語がテキストノードを跨っている場合は、該テキストノードの終端までのＭ文字（但し、Ｍは1以上Ｎ未満の整数）の単語をも切り出す単語切り出し手段と、
切り出した単語毎に、開始位置が含まれるノード識別子preorder値と、終了位置が含まれるノード識別子postorder値と、連結したテキスト中の開始位置を取得し、該preorder値、該postorder値、該開始位置、文書ＩＤを出現位置情報として転置表記憶手段１３１の転置表に格納する出現位置情報格納手段と、を含み、
検索手段１４０の範囲ラベル取得手段１４１は、
検索キーワードをN-gram分解し、検索単語を取得し、該検索単語に基づいて転置表記憶手段１３１の転置表から該検索単語の出現位置情報を取得し、該検索単語の出現位置情報の開始位置を付き合わせ、検索キーワードの出現位置情報を取得する手段を含む。 The present invention (Claim 3 ) is a full-text search device for performing a full-text search on an XML document group in which text and tags as structured data are mixed,
A node management unit 120 that manages a structural relationship between XML nodes; a full-text index unit 130 that builds an index on a word in the XML document; and a search unit 140 that executes a search.
The node management means 120
A range label assigning unit that assigns a document ID to each document of the input XML document group, assigns a range label that allows the tags to be straddled to all nodes of the document, and stores the range label in the range label storage unit 121 122,
The full text index means 130
A transposition table storage that cuts out a word across a word and a tag from the text of the XML document and stores appearance position information including a pair of the appearance position of the word and the document ID in the XML document in the transposition table of the transposition table storage unit 131 Means 132,
The search means 140 is
A range label acquisition unit 141 that extracts a search path and a search keyword from the input search query and acquires a document ID and a range label corresponding to the search path from the range label storage unit 121;
Appearance position information that matches the search keyword is acquired from the transposition table storage means 131, and the appearance position information acquisition means 142;
Possess the range label document ID and ranges label obtained by the obtaining unit 141, a node information selecting means 143 for selecting the node information acquired occurrence position information of the search keywords occurrence position information obtaining unit 142 is present, a
The range label assigning means 122 of the node management means 120
A document ID of the XML document, a preorder value and a postorder value of each node, a tag name, a start character offset when all the text nodes are concatenated, and a means for storing the description content as a range label in the range label storage unit 121;
The transposed table storage means 132 of the full-text index means 130 is
All text nodes of an XML document are concatenated, divided into character strings of a certain length, and words are extracted by the N-gram method. If the extracted words straddle the text nodes, the end of the text node is reached. A word cutout means for cutting out a word of M characters (where M is an integer of 1 or more and less than N);
For each extracted word, the node identifier preorder value including the start position, the node identifier postorder value including the end position, and the start position in the concatenated text are acquired, and the preorder value, the postorder value, and the start position And an appearance position information storage means for storing the document ID in the transposition table of the transposition table storage means 131 as the appearance position information,
The range label acquisition unit 141 of the search unit 140 includes:
N-gram decomposition of a search keyword, acquisition of a search word, acquisition of appearance position information of the search word from the transposition table of the transposition table storage means 131 based on the search word, and start of the appearance position information of the search word Means for associating positions and obtaining appearance position information of search keywords;

また、本発明（請求項７）は、ノード管理手段１２０の範囲ラベル付与手段１２２において、
ＸＭＬ文書の文書ＩＤ、各ノードのpreorder値とpostorder値、タグ名、記述内容を範囲ラベルとして範囲ラベル記憶手段に格納する手段を有し、
全文インデクス手段１３０の転置表格納手段１３２において、
ＸＭＬ文書を一定の長さの文字列に分割し、該文字列を形態素解析することにより単語及びタグを跨いだ単語を切り出す形態素解析手段と、
切り出した単語毎に、単語の最初の文字が出現したノードのpreorder値と、該単語の最後の文字が出現したノードのpostorder値を取得し、該preorder値、該postorder値、文書ＩＤを出現位置情報として転置表記憶手段１３１の転置表に格納する手段と、を含み、
検索手段１４０の範囲ラベル取得手段１４１において、
検索キーワードを形態素解析し、検索単語を取得し、該検索単語に基づいて転置表記憶手段の転置表から該検索単語の出現位置情報を取得する手段を含む。 Further, according to the present invention (claim 7), in the range label assigning means 122 of the node management means 120,
A means for storing the document ID of the XML document, the preorder value and postorder value of each node, the tag name, and the description content in the range label storage means as a range label;
In the transposed table storage means 132 of the full-text index means 130,
A morpheme analysis unit that divides an XML document into character strings of a certain length and extracts words straddling words and tags by performing morphological analysis on the character strings;
For each extracted word, the preorder value of the node in which the first character of the word appears and the postorder value of the node in which the last character of the word appears are obtained, and the preorder value, the postorder value, and the document ID are represented as the appearance position. Means for storing in the transposition table of the transposition table storage means 131 as information,
In the range label acquisition means 141 of the search means 140,
A morphological analysis is performed on the search keyword, a search word is acquired, and appearance position information of the search word is acquired from the transposition table of the transposition table storage unit based on the search word.

また、本発明（請求項８）は、ノード管理手段１２０の範囲ラベル付与手段１２２において、
ＸＭＬ文書の文書ＩＤ、各ノードのpreorder値とpostorder値、タグ名、全テキストノードを連結した際の開始文字オフセット、記述内容を範囲ラベルとして範囲ラベル記憶手段に格納する手段を有し、
全文インデクス格納手段１３０の転置表格納手段１３２において、
ＸＭＬ文書の全テキストノードを連結し、一定の長さの文字列に分割し、N-gram方式により単語を切り出し、切り出した単語がテキストノードを跨っている場合は、該テキストノードの終端までのＭ文字（但し、Ｍは1以上の整数）も重複して切り出す単語切り出し手段と、
切り出した単語毎に、開始位置が含まれるノード識別子preorder値と、終了位置が含まれるノード識別子postorder値と、連結したテキスト中の開始位置を取得し、該preorder値、該postorder値、該開始位置、文書ＩＤを出現位置情報として転置表記憶手段１３１の転置表に格納する第２の範囲ラベル付与手段と、を含み、
検索手段１４０の範囲ラベル取得手段１４１において、
検索キーワードをN-gram分解し、検索単語を取得し、該検索単語に基づいて転置表記憶手段の転置表から該検索単語の出現位置情報を取得し、該検索単語の出現位置情報の開始位置を付き合わせ、検索キーワードの出現位置情報を取得する手段を含む。 Further, according to the present invention (claim 8), in the range label assigning means 122 of the node management means 120,
A document ID of the XML document, a preorder value and a postorder value of each node, a tag name, a start character offset when all the text nodes are concatenated, and a means for storing the description content as a range label in the range label storage unit,
In the transposed table storage unit 132 of the full-text index storage unit 130,
All the text nodes of the XML document are concatenated, divided into character strings of a certain length, words are extracted by the N-gram method, and when the extracted words straddle the text nodes, the end of the text node is reached. Word extraction means for extracting M characters (where M is an integer of 1 or more),
For each extracted word, the node identifier preorder value including the start position, the node identifier postorder value including the end position, and the start position in the concatenated text are acquired, and the preorder value, the postorder value, and the start position A second range label assigning unit that stores the document ID as appearance position information in the transposition table of the transposition table storage unit 131,
In the range label acquisition means 141 of the search means 140,
The search keyword is decomposed into N-grams, the search word is obtained, the appearance position information of the search word is obtained from the transposition table of the transposition table storage means based on the search word, and the start position of the appearance position information of the search word And a means for acquiring appearance position information of the search keyword.

本発明（請求項４）は、請求項３に記載の全文検索装置を構成する各手段としてコンピュータを機能させるための全文検索プログラムである。 The present invention (Claim 4 ) is a full-text search program for causing a computer to function as each means constituting the full-text search device according to Claim 3 .

上記のように本発明によれば、テキストノードに付与される範囲値を拡張した転置表を作成することにより、タグを跨いだ（テキストノードを跨いだ）単語に対しても全文検索が可能であり、巨大な全文インデクス群を構築することなく、混合内容を含むＸＭＬ文書に対する任意の構造を指定した全文検索を実現することができる。 As described above, according to the present invention, it is possible to perform a full-text search even for a word that straddles a tag (strands a text node) by creating a transposition table in which a range value assigned to a text node is expanded. In addition, it is possible to realize a full text search specifying an arbitrary structure for an XML document including mixed contents without constructing a huge full text index group.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における全文検索装置の構成図である。 FIG. 3 is a configuration diagram of the full-text search apparatus according to the embodiment of the present invention.

同図に示す全文検索装置１００は、パスインデクス部１１０、ノード管理部１２０、全文インデクス部１３０、検索部１４０から構成される。 The full-text search apparatus 100 shown in the figure includes a path index unit 110, a node management unit 120, a full-text index unit 130, and a search unit 140.

パスインデクス部１１０は、機能を実現するためには必須でないが、高速検索を実現するために通常用いられる。 The path index unit 110 is not essential for realizing the function, but is usually used for realizing high-speed search.

ノード管理部１２０は、各ノードに出現した文書を識別し、各文書にＩＤを付与し、各ノード間の構造上の関係を管理する。本発明では、構造上の関係の管理に「範囲ラベル」を用いるものとし、各ノードに範囲ラベルを付与し、ノード表として範囲レベル記憶部１２１に格納する。範囲ラベルは、例えば、「S. Al-Khalifa, H. V. Jagadish, N. Koudas, J. M. Patel, D. Drivastava, and Y. Wu: "Structural joins: A primitive for efficient XML query pattern matching," Proc. ICDE, p. 141, 2002」で示されるように、全てのノードにpreorder値とpostorder値の２値を付与し、子ノードのpreorder値とpostorder値は、親ノードのpreorder値とpostorder値に含まれるようにラベル付けされる。範囲ラベルの例を図４に示す。例えば、範囲ラベル（９，６０）の「desc」ノードは、範囲ラベル（６，６１）の「chapter」ノードの子供の関係にあり、範囲ラベルも（９，６０）は（６，６１）に含まれている。また、範囲ラベル（１２，１５）の「ｂ」ノードも（６，６１）に含まれており、直接親子だけでなく先祖子孫関係に対し、高速に判別することができる（この判別の組み合わせを選択することを「構造ジョイン」と呼ぶ）。図５に範囲レベル記憶部のノード表の格納例を示す。 The node management unit 120 identifies documents that appear at each node, assigns an ID to each document, and manages the structural relationship between the nodes. In the present invention, a “range label” is used for managing the structural relationship, and a range label is assigned to each node and stored in the range level storage unit 121 as a node table. Range labels are, for example, “S. Al-Khalifa, HV Jagadish, N. Koudas, JM Patel, D. Drivastava, and Y. Wu:“ Structural joins: A primitive for efficient XML query pattern matching, ”Proc. ICDE, p. 141, 2002 ”, preorder and postorder values are assigned to all nodes, and child nodes' preorder and postorder values are included in the parent node's preorder and postorder values. Labeled. An example of a range label is shown in FIG. For example, the “desc” node of the range label (9, 60) has a child relationship with the “chapter” node of the range label (6, 61), and the range label (9, 60) also becomes (6, 61). include. In addition, the “b” node of the range label (12, 15) is also included in (6, 61), and it is possible to discriminate at high speed not only directly in the parent-child but also in the ancestor-descendant relationship (the combination of this discrimination is This is called “structural join”). FIG. 5 shows a storage example of the node table in the range level storage unit.

全文インデクス部１３０は、ＸＭＬ文書の全テキストノードを連結したテキストから単語を切り出し、当該単語の出現位置を転置表記憶部１３１の転置表に記録する。単語の出現位置とは、単語の出現する文書の識別だけでなく、文書中の単語の出現する部分文書の構造上の位置、すなわち、上記説明した出現したノードの範囲ラベルも記録する。 The full-text index unit 130 cuts out a word from the text obtained by connecting all the text nodes of the XML document, and records the appearance position of the word in the transposed table of the transposed table storage unit 131. The word appearance position records not only the identification of the document in which the word appears, but also the structural position of the partial document in which the word appears in the document, that is, the range label of the above-described appearing node.

検索部１４０は、検索クエリを受け取り、指定された検索パス及び検索キーワードにマッチするノードを、パスインデクス部１１０、ノード管理部１２０、全文インデクス部１３０から特定し、返却する。 The search unit 140 receives the search query, specifies a node that matches the specified search path and search keyword from the path index unit 110, the node management unit 120, and the full-text index unit 130, and returns the node.

以下に、ＸＭＬ文書の「格納フェイズ」、「検索フェイズ」からなるシステムの動作を説明する。 The operation of the system composed of the “storage phase” and “search phase” of the XML document will be described below.

格納フェイズでは、与えられたＸＭＬ文書群から各種インデクスを作成する。 In the storage phase, various indexes are created from the given XML document group.

図６は、本発明の一実施の形態における格納フェイズの全体フローを示す。 FIG. 6 shows the overall flow of the storage phase in one embodiment of the present invention.

以下では、ＸＭＬ文書群に対し、パスインデクス部１１０、ノード管理部１２０、全文インデクス部１３０のそれぞれが処理を行う。 In the following, each of the path index unit 110, the node management unit 120, and the full-text index unit 130 performs processing on the XML document group.

ステップ１１０）ノード管理部１２０は、ＸＭＬ文書をＸＭＬ文書記憶装置１０から取り出す。 Step 110) The node management unit 120 retrieves the XML document from the XML document storage device 10.

ステップ１２０）さらに、取り出した文書に文書ＩＤを付与してメモリ（図示せず）に格納する。 Step 120) Further, a document ID is assigned to the extracted document and stored in a memory (not shown).

ステップ１３０）詳細なノード管理部１２０の動作を図７に示す。 Step 130) The detailed operation of the node management unit 120 is shown in FIG.

ノード管理部１２０は、メモリ（図示せず）からノードを取り出し（ステップ１３１）、パスを特定し、パスインデクス部１１０に伝える（ステップ１３２）。各ノードに範囲ラベルを付与し（ステップ１３３）、範囲ラベル記憶部１２１のノード表に、文書ＩＤ，範囲ラベル、開始文字オフセット等からなるノード情報を格納する（ステップ１３４）。上記の処理を取り出した全ノード分繰り返す（ステップ１３５）。 The node management unit 120 extracts the node from the memory (not shown) (step 131), identifies the path, and transmits it to the path index unit 110 (step 132). A range label is assigned to each node (step 133), and node information including a document ID, a range label, a start character offset, and the like is stored in the node table of the range label storage unit 121 (step 134). The above process is repeated for all the extracted nodes (step 135).

上記のように、ノード管理部１２０では、図４で示したＸＭＬ文書を、図５に示すように範囲レベル記憶部１２１のノード表に格納する。文書ＩＤ（docID）、範囲ラベル（pre,post）、タグ名（tag）、全テキストノードを連結した際の開始文字オフセット(offset)、記述内容(value)を格納する。但し、"offset"は形態素解析方式の場合は必須ではない。 As described above, the node management unit 120 stores the XML document shown in FIG. 4 in the node table of the range level storage unit 121 as shown in FIG. Stores document ID (docID), range label (pre, post), tag name (tag), start character offset (offset) when all text nodes are connected, and description content (value). However, “offset” is not essential for the morphological analysis method.

ステップ１４０）パスインデクス１１０は、パス毎にそのパスに対応するノード管理部１２０のノード情報（群）への関係を格納する。本発明では、具体的な手法は限定しないが、例えば、
（ａ）パスにＩＤを付与し、パスインデクス部１１０とノード管理部１２０で共通で保持する手法；
（ｂ）ノード管理部１２０で格納されているアドレス（ポインタ／オフセット）をパスインデクス部１１０で保持する方法；
などが考えられる。 Step 140) The path index 110 stores the relationship to the node information (group) of the node management unit 120 corresponding to the path for each path. In the present invention, the specific method is not limited.
(A) A method of assigning an ID to a path and holding the path in common between the path index unit 110 and the node management unit 120;
(B) A method of holding the address (pointer / offset) stored in the node management unit 120 in the path index unit 110;
And so on.

ステップ１５０）全文インデックス部１３０は、全てのテキストのノードを連結し、そのテキストから形態素解析方式、または、N-gram方式により単語を切り出し、出現位置を転置表を用いて転置表記憶部１３１に記録する。図８に全文インデクス部１３０の動作フローを示す。 Step 150) The full-text index unit 130 concatenates all text nodes, cuts out words from the text by the morphological analysis method or the N-gram method, and uses the transposition table to store the appearance position in the transposition table storage unit 131. Record. FIG. 8 shows an operation flow of the full-text index unit 130.

ステップ１２０でメモリ（図示せず）に格納された全てのテキストノードを連結し（ステップ１５１）、形態素解析方式、またはN-gram方式を用いて単語を切り出す（ステップ１５２）。切り出された単語に基づいてノード管理部１２０に問い合わせ、当該単語の出現場所の位置情報を取得する（ステップ１５３）。当該単語が既出の単語でない場合は（ステップ１５４、Ｎｏ）、当該単語を転置表記憶部１３１の転置表に登録する（ステップ１５５）。転置表記憶部１３１の転置表に文書ＩＤ，範囲ラベル等の出現位置情報を追加格納する（ステップ１５６）。メモリ（図示せず）に格納された全ての単語分、ステップ１５３以降の処理を繰り返す（ステップ１５７）。 All the text nodes stored in the memory (not shown) in step 120 are connected (step 151), and words are cut out using the morphological analysis method or the N-gram method (step 152). The node management unit 120 is inquired based on the extracted word, and the position information of the appearance location of the word is acquired (step 153). When the word is not an existing word (No at Step 154), the word is registered in the transposition table of the transposition table storage unit 131 (Step 155). Appearance position information such as a document ID and a range label is additionally stored in the transposition table of the transposition table storage unit 131 (step 156). The processing from step 153 onward is repeated for all the words stored in the memory (not shown) (step 157).

次に、検索フェイズについて説明する。 Next, the search phase will be described.

図９は、本発明の一実施の形態における検索フェイズのフローを示す。 FIG. 9 shows the flow of the search phase in one embodiment of the present invention.

ステップ２１０）検索部１４０は、入力された検索クエリから、検索パスと検索キーワードを取り出す。 Step 210) The search unit 140 extracts a search path and a search keyword from the input search query.

ステップ２２０）パスインデクス部１１０から検索パスにマッチするノード情報（群）への関係を取得する。 Step 220) The relationship to the node information (group) matching the search path is acquired from the path index unit 110.

ステップ２３０）ノード管理部１２０から上記で該当するノード情報（群）の文書ＩＤ及び範囲ラベルを取得する。 Step 230) The document ID and range label of the node information (group) corresponding to the above are acquired from the node management unit 120.

ステップ２４０）全文インデクス部１３０から、検索キーワードにマッチする出現位置情報（群）を取得する。 Step 240) The appearance position information (group) matching the search keyword is acquired from the full-text index unit 130.

ステップ２５０）ノード情報（群）の文書ＩＤと範囲ラベルと検索キーワードの出現位置情報（群）の文書ＩＤと範囲ラベルに対し、構造ジョインを行う（出現位置情報の存在するノード情報（群）を選択する）。 Step 250) A structure join is performed for the document ID and range label of the node information (group) and the document ID and range label of the appearance position information (group) of the search keyword (node information (group) in which the appearance position information exists). select).

ステップ２６０）必要ならば、ノード管理部１２０から、選択されたノード情報（群）に対応する部分文書を組み立て、返却する。 Step 260) If necessary, the node management unit 120 assembles and returns a partial document corresponding to the selected node information (group).

［第１の実施例］
＜格納フェイズ＞
本実施例では、格納フェイズにおいて、全文インデクス部１３０が単語の切り出しを行う際に、形態素解析方式を用いた例について説明する。 [First embodiment]
<Storage phase>
In the present embodiment, an example will be described in which the morphological analysis method is used when the full-text index unit 130 extracts words in the storage phase.

図８のステップ１５２において、全文インデクス部１３０は以下の処理を行う。 In step 152 of FIG. 8, the full-text index unit 130 performs the following processing.

（１）テキストを一定の長さに分割（通常は文毎）し、順次以下の処理を行う。 (1) The text is divided into fixed lengths (usually sentence by sentence), and the following processing is performed sequentially.

ａ）形態素解析を行い、単語（群）を切り出す。単語毎に以下の処理を行う。 a) Morphological analysis is performed to cut out a word (group). The following processing is performed for each word.

i.開始位置が含まれるノードのpreorder値を取得する。 i. Get the preorder value of the node containing the start position.

ii.終了位置が含まれるノードのpostorder値を取得する。 ii. Get the postorder value of the node containing the end position.

ｂ）文書ＩＤ，preorder値、postorder値を転置表に記録する。 b) Record the document ID, preorder value, and postorder value in the transposition table.

転置表の作成例を図１０に示す。転置表記億部１３１の転置表は、単語毎に出現位置（群）がまとめられた表である。同図の例において「水星」「金星」はタグを跨いでいるため、範囲ラベルもそれぞれ（１３，１７）（１９，２３）とタグを跨いだ格好になっている。 An example of creating a transposition table is shown in FIG. The transposition table of the transposition notation part 131 is a table in which appearance positions (groups) are grouped for each word. In the example shown in the figure, “Mercury” and “Venus” straddle the tags, so the range labels are also (13, 17), (19, 23), and straddle the tags.

＜検索フェイズ＞
検索部１４０において、以下の手順で検索キーワードにマッチする出現位置情報（群）を取得する。 <Search phase>
The search unit 140 acquires appearance position information (group) that matches the search keyword in the following procedure.

（１）検索キーワードを形態素解析し、検索単語を得る。 (1) A morphological analysis is performed on the search keyword to obtain a search word.

（２）検索単語に基づいて、全文インデクス部１３０の転置表記憶部１３１から当該検索単語の出現位置情報（群）を得る。 (2) Based on the search word, appearance position information (group) of the search word is obtained from the transposed table storage unit 131 of the full-text index unit 130.

具体例として、図５のノード表及び図１１の転置表に対し、検索パス「/book/chapter」及び検索キーワード「水星・金星」が記された検索クエリの場合の処理を示す。 As a specific example, the processing in the case of a search query in which the search path “/ book / chapter” and the search keyword “Mercury / Venus” are written in the node table of FIG. 5 and the transposition table of FIG. 11 is shown.

手順１：検索クエリから、検索パス「/book/chapter」、検索キーワード「水星・金星」を取り出す。 Step 1: Retrieve the search path “/ book / chapter” and the search keyword “Mercury / Venus” from the search query.

手順２：パスインデクス部１１０から、検索パス「/book/chapter」に対応するノード情報への関係を取得する。 Procedure 2: The relationship to the node information corresponding to the search path “/ book / chapter” is acquired from the path index unit 110.

手順３：ノード管理部１２０から該当するノード情報群として図６の２行目を含む複数行から、その文書ＩＤと範囲ラベルを取得する。 Procedure 3: The document ID and range label are acquired from a plurality of lines including the second line in FIG. 6 as the corresponding node information group from the node management unit 120.

（001，6，61），…
手順４：全文インデクス部１３０から、検索キーワード「水星・金星」にマッチする出現位置情報（群）を以下の手順で取得する。 (001, 6, 61), ...
Procedure 4: Appearance position information (group) that matches the search keyword “Mercury / Venus” is acquired from the full-text index unit 130 by the following procedure.

（ａ）「水星・金星」を形態素解析し、検索単語「水星」及び「金星」を得る。 (A) Morphological analysis of “Mercury / Venus” to obtain the search words “Mercury” and “Venus”.

（ｂ）図１０の転置表から、
「水星」の出現位置情報群："（001，13，17），…"
「金星」の出現位置情報群："（001，19，23），…"を得る。 (B) From the transposition table of FIG.
Appearance position information group of “Mercury”: “(001, 13, 17),…”
Appearance position information group of “Venus”: “(001, 19, 23), ...” is obtained.

手順５：手順３で得られた検索パス「/book/chapter」に対応する文書ＩＤ及び範囲ラベル群と、手順４で得られた検索キーワード「水星・金星」にマッチする出現位置情報群に対し、構造ジョインを行う。 Step 5: For the document ID and range label group corresponding to the search path “/ book / chapter” obtained in step 3 and the appearance position information group matching the search keyword “Mercury / Venus” obtained in step 4 , Do structural joins.

この例では、少なくとも正しい組み合わせとして、
(001，6，61)と、（001，13，17）及び(001，19，23)
が選択される。 In this example, at least as the correct combination,
(001,6,61) and (001,13,17) and (001,19,23)
Is selected.

留意点として、この例のように検索単語が複数となった場合、全ての検索単語の出現位置情報と組み合わせとなるノード情報が正解となる。 It should be noted that when there are a plurality of search words as in this example, the node information that is combined with the appearance position information of all the search words is correct.

手順６：検索結果として部分文書が必要とされる場合は、構造ジョインで選択されたノード情報から部分文書を組み立てる。 Procedure 6: When a partial document is required as a search result, the partial document is assembled from the node information selected by the structure join.

(001，6，61)の場合、ノード表（図５から文書ＩＤが001で、範囲ラベルが（6，61）に含まれる行（図５の２〜9行目及びそれ以降）を取り出し、部分文書を組み立てて返却結果とする。 In the case of (001, 6, 61), the node table (from FIG. 5, the document ID is 001 and the range label is included in (6, 61) (lines 2 to 9 in FIG. 5 and subsequent lines) is extracted. Assemble the partial document as the return result.

同様に、検索パスを「/book/chapter/desc/b」とした場合、手順３で得られるノード情報は、
(001，12，15)，(001，18，21)，…
となり、手順４で得られた出現位置情報とどれも組み合わせとならない。このように検索キーワードに検索パスに一部かかっていたとしても、キーワード全てが含まれているような関係でないと結果に含まれないよう、正しく動作する。 Similarly, when the search path is “/ book / chapter / desc / b”, the node information obtained in step 3 is
(001, 12, 15), (001, 18, 21), ...
Thus, none of the appearance position information obtained in the procedure 4 is combined. As described above, even if a part of the search keyword is included in the search path, the operation is performed correctly so that the search keyword is not included in the result unless the relation includes all the keywords.

［第２の実施例］
本実施例では、N-gram方式による単語の切り出しを行う場合を示す。なお、具体例としてはＮ＝２つまりBi-gramの場合について示す。 [Second Embodiment]
In the present embodiment, a case where a word is cut out by the N-gram method is shown. As a specific example, a case where N = 2, that is, a Bi-gram is shown.

＜格納フェイズ＞
全文インデクス部１３０において、以下の処理を行う。 <Storage phase>
The full text index unit 130 performs the following processing.

手順１：全テキストノードを連結する（ステップ１５２）。 Procedure 1: All text nodes are connected (step 152).

手順２：一定の長さの文字列に分割（通常は文毎）し、順次以下の処理を行う。 Procedure 2: Divide into character strings of a certain length (usually sentence by sentence) and perform the following processing in sequence.

（ａ）N-gram方式により、単語（群）を切り出す。但し、通常のN-gramの切り出しに加え、テキストノードの単位も考慮した切り出しを行う。つまり、
i.連結したテキストから機械的にＮ文字ずつ切り出す。 (A) A word (group) is cut out by the N-gram method. However, in addition to normal N-gram clipping, clipping is performed in consideration of the unit of the text node. That means
i. Mechanically cut out N characters from the concatenated text.

ii.切り出した単語がテキストノードを跨っていた場合、テキストノードの終端までのＭ文字も重複して切り出す。これは、Ｎ文字未満の検索単語に対しても、漏れなく対応するためである。 ii. When the extracted word straddles the text node, the M characters up to the end of the text node are also extracted in duplicate. This is for dealing with search words with fewer than N characters without omission.

（ｂ）切り出した単語毎に、以下の処理を行う。 (B) The following processing is performed for each extracted word.

iii.文書ＩＤ，preoder値、postorder値、開始位置を転置表に記録する。 iii. Record the document ID, preoder value, postorder value, and start position in the transposition table.

転置表の作成例を図１１に示す。形態素解析方式の例と異なり、全テキストノードを連結した文字列での開始位置（position）も出現位置情報として格納されており「水星」「水」といったようにタグを跨いだ２文字の単語と、テキストノード終端までの一文字を重複して格納されている。 An example of creating a transposition table is shown in FIG. Unlike the example of the morphological analysis method, the start position (position) in the character string that concatenates all the text nodes is also stored as the appearance position information, and the two-character word across the tags such as “Mercury” and “Water” , One character up to the end of the text node is stored in duplicate.

手順１：検索キーワードをN-gram分解し、検索単語（群）を得る。 Procedure 1: The search keyword is decomposed into N-grams to obtain a search word (group).

手順２：転置表記億部131の転置表から検索単語毎に出現位置情報（群）を得る。 Procedure 2: Appearance position information (group) is obtained for each search word from the transposition table of the transposition notation part 131.

手順３：検索単語の出現開始位置を突き合わせ、検索キーワードの出現位置情報を得る。 Procedure 3: The search word appearance start position is matched, and the search keyword appearance position information is obtained.

具体例として、図５のノード管理部１２０の範囲ラベル記憶部１２１のノード表及び、図１１の転置表記憶部１３１の転置表に対して、検索パス「/book/chapter」及び検索キーワード「水星・金星」が記された検索クエリの場合の処理を示す。 As a specific example, the search path “/ book / chapter” and the search keyword “Mercury” are stored in the node table of the range label storage unit 121 of the node management unit 120 of FIG. 5 and the transposition table of the transposition table storage unit 131 of FIG. The processing for a search query with “Venus” is shown.

手順１：（第1の実施例と同様）検索クエリから、検索パス「/book/chapter」、検索キーワード「水星・金星」を取り出す。 Procedure 1: (similar to the first embodiment) The search path “/ book / chapter” and the search keyword “Mercury / Venus” are extracted from the search query.

手順２：（第1の実施例と同様）パスインデクス部１１０から、検索パス「/book/chapter」に対応するノード情報への関係を取得する。 Procedure 2: (similar to the first embodiment) The relationship to the node information corresponding to the search path “/ book / chapter” is acquired from the path index unit 110.

手順３：（第1の実施例と同様）ノード管理部１２０から該当するノード情報群として図５の２行目を含む複数行から、その文書ＩＤと範囲ラベルを取得する。 Procedure 3: (similar to the first embodiment) The document ID and the range label are acquired from the node management unit 120 from a plurality of rows including the second row of FIG.

（ａ）「水星・金星」をBi-gram分解し、検索単語「水星」「・金」「金星」を得る。同時に、「水星」の開始位置をｋとした場合、「・金星」はｋ＋２、「金星」はｋ＋３の関係も得る。 (A) Bi-gram decomposition of “Mercury / Venus” to obtain the search words “Mercury”, “• Gold”, “Venus”. At the same time, when the start position of “Mercury” is k, “Venus” has a relationship of k + 2, and “Venus” has a relationship of k + 3.

（ｂ）図１１の転置表から、
「水星」の出現位置情報群："（001，13，17，9），…"
「・金」の出現位置情報群："（001，16，20，11），…"
「金星」の出現位置情報群："（001，19，23，12），…"を得る。 (B) From the transposition table of FIG.
Appearance position information group of "Mercury": "(001, 13, 17, 9), ..."
Appearance position information group of "・ Gold": "(001, 16, 20, 11), ..."
Appearance position information group of “Venus”: “(001, 19, 23, 12),.

（ｃ）同じ文書ＩＤで開始位置（position）の関係を満たすものを、上記出現位置情報の組み合わせを見つける。この例では、各1列目が満たす。 (C) A combination of the appearance position information is found that satisfies the relationship of the start position (position) with the same document ID. In this example, each first column is filled.

（ｄ）上記の組み合わせから、検索キーワードの出現位置情報群を作成する。 (D) A search keyword appearance position information group is created from the above combinations.

（001，13，23），…
文書ＩＤは共通、preoder値は「水星」のもの、postorder値は「金星」のものを組み合わせ、検索キーワードの出現位置情報としている。 (001, 13, 23), ...
The document ID is common, the preoder value is “Mercury”, and the postorder value is “Venus”.

手順５：手順３で得られた検索パス「／book／chapter」に対応する文書ＩＤ及び範囲ラベル群と、手順４で得られた検索キーワード「水星・金星」にマッチする出現位置情報群に対し、構造ジョインを行う。 Step 5: For the document ID and range label group corresponding to the search path “/ book / chapter” obtained in step 3 and the appearance position information group matching the search keyword “Mercury / Venus” obtained in step 4 , Do structural joins.

この例では、少なくとも正しい組み合わせとして、
(0061，6，61)と(001，13，23)
が選択される。 In this example, at least as the correct combination,
(0061, 6, 61) and (001, 13, 23)
Is selected.

手順６：（第1の実施例と同様）検索結果として部分文書が必要とされる場合は、構造ジョインで選択されたノード情報から部分文書を組み立てる。 Procedure 6: (Similar to the first embodiment) When a partial document is required as a search result, the partial document is assembled from the node information selected by the structure join.

（001,6,61）の場合、ノード表（図５から文書ＩＤが001で、範囲ラベルが（6,61）に含まれる行（同図中の２〜９行目及びそれ以降）を取り出し、部分文書を組み立てて返却結果とする。 In the case of (001,6,61), the node table (from FIG. 5, the document ID is 001, and the range label is included in (6,61) (lines 2-9 in the figure and beyond) is extracted. The partial document is assembled and used as the return result.

同様に、検索パスを「/book/chapter/desc/b」とした場合、手順３で得られるノード情報は、
(001，12，15)，(001，18，21)，…
となり、手順４で得られた出現位置情報とどれも組み合わせとはならない。このように検索キーワードが検索パスに一部かかっていたとしても、キーワードがすべて含まれているような関係でないと結果に含まれないよう、正しく動作する。 Similarly, when the search path is “/ book / chapter / desc / b”, the node information obtained in step 3 is
(001, 12, 15), (001, 18, 21), ...
Thus, none of the appearance position information obtained in the procedure 4 is combined. Thus, even if a part of the search keyword is included in the search path, it operates correctly so that it is not included in the result unless the relationship includes all the keywords.

なお、上記の図３に記載された全文検索装置１００の構成要素の動作をプログラムとして構築し、全文検索装置として利用されるコンピュータにインストールする、または、ネットワークを介して流通させることが可能である。 Note that the operation of the components of the full-text search apparatus 100 described in FIG. 3 can be constructed as a program and installed in a computer used as the full-text search apparatus, or distributed via a network. .

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments and examples, and various modifications and applications can be made within the scope of the claims.

本発明は、構造化データ、特にＸＭＬの全文検索技術に適用可能である。 The present invention is applicable to structured data, particularly XML full-text search technology.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態における全文検索装置の構成図である。It is a block diagram of the full text search apparatus in one embodiment of this invention. 本発明の一実施の形態における範囲ラベルの例である。It is an example of the range label in one embodiment of this invention. 本発明の一実施の形態における範囲レベル記憶部のノード表の格納例である。It is a storage example of the node table of the range level memory | storage part in one embodiment of this invention. 本発明の一実施の形態における格納フェイズの全体フローである。It is the whole storage phase flow in one embodiment of the present invention. 本発明の一実施の形態におけるノード管理部のフロー（格納フェイズ）である。It is a flow (storage phase) of the node management part in one embodiment of this invention. 本発明の一実施の形態における全文インデクス部の格納フェイズのフローである。It is a flow of the storage phase of the full-text index part in one embodiment of this invention. 本発明の一実施の形態における検索フェイズのフローである。It is a flow of the search phase in one embodiment of the present invention. 本発明の第１の実施例の転置表の作成例（形態素解析法式）である。It is a creation example (morpheme analysis method formula) of the transposition table of 1st Example of this invention. 本発明の第１の実施例の作成例（Bi-gram方式）である。It is a creation example (Bi-gram system) of the first embodiment of the present invention. ＸＭＬ文書とその構造を示す図である。It is a figure which shows an XML document and its structure. パスが示す部分文書の例である。It is an example of the partial document which a path | pass shows. 混合内容のＸＭＬ例である。It is an XML example of mixed contents.

Explanation of symbols

１０ＸＭＬ文書記憶装置
１００全文検索装置
１１０パスインデクス部１１０
１２０ノード管理手段、ノード管理部
１２１範囲ラベル記憶手段、範囲ラベル記憶部
１２２範囲ラベル付与手段
１３０全文インデクス手段、全文インデクス部
１３１転置表記憶手段、転置表記億部
１３２転置表格納手段
１４０検索手段、検索部
１４１範囲ラベル取得手段
１４２出現位置情報取得手段
１４３ノード情報選択手段 10 XML document storage device 100 Full-text search device 110 Path index unit 110
120 node management means, node management section 121 range label storage means, range label storage section 122 range label assigning means 130 full-text index means, full-text index section 131 transposed table storage means, transposed notation billion section 132 transposed table storage means 140 search means, Search unit 141 Range label acquisition unit 142 Appearance position information acquisition unit 143 Node information selection unit

Claims

A full-text search method for performing a full-text search on an XML document group in which text and tags as structured data are mixed,
In an apparatus having node management means for managing a structural relationship between XML nodes, full-text index means for constructing an index on a word in the XML document, and search means for executing a search,
The node management means assigns a document ID to each document of the input XML document group, assigns a range label that allows the tags to be straddled to all nodes of the document, and stores it in the range label storage means A range labeling step to be performed ;
The full-text index means cuts out a word that straddles a word and a tag from the text of the XML document, and uses the transposition table of the transposition table storage means as the appearance position information composed of the appearance position of the word and the document ID in the XML document. A transposition table storage step to store in
A storage process consisting of:
A range label acquiring step in which the search means extracts a search path and a search keyword from the input search query, and acquires a document ID and a range label corresponding to the search path from the range label storage means;
An appearance position information acquisition step of acquiring appearance position information that matches the search keyword from the transposition table storage means;
A node information selection step of selecting node information in which the appearance position information of the search keyword acquired in the appearance position information acquisition step exists from the document ID and the range label acquired in the range label acquisition step;
A search process consisting of
And
In the range labeling step of the storing process,
The document ID of the XML document, the preorder value and postorder value of each node, the tag name, the start character offset when concatenating all the text nodes, and the description content are stored in the range label storage means as a range label,
In the transposition table storage step of the storage process,
All the text nodes of the XML document are concatenated, divided into character strings of a certain length, words are cut out by the N-gram method, and when the cut out words straddle the text nodes, the end of the text node A word extracting step of cutting out a word of up to M characters (where M is an integer of 1 or more and less than N);
For each extracted word, the node identifier preorder value including the start position, the node identifier postorder value including the end position, and the start position in the concatenated text are acquired, and the preorder value, the postorder value, and the start position , An appearance position information storage step of storing the document ID as the appearance position information in the transposition table of the transposition table storage means ,
Full-text search method comprising rows Ukoto.

In the range label acquisition step of the search process,
N-gram decomposition of the search keyword to obtain a search word,
Obtaining the appearance position information of the search word from the transposition table of the transposition table storage means based on the search word,
The full-text search method according to claim 1, wherein the start position of the search word appearance position information is associated with each other and the appearance position information of the search keyword is acquired.

A full-text search apparatus that performs a full-text search on an XML document group in which text and tags that are structured data are mixed,
Node management means for managing a structural relationship between XML nodes, full-text index means for constructing an index on a word in the XML document, and search means for executing a search,
The node management means includes
A range label assigning unit that assigns a document ID to each document of the input XML document group, assigns a range label that allows the tags to be straddled to all nodes of the document, and stores the range label in the range label storage unit. Have
The full-text index means is:
A transposition table storage that cuts out a word that spans a word and a tag from the text of the XML document, and stores appearance position information including a pair of the appearance position of the word and the document ID in the XML document in a transposition table of a transposition table storage unit Having means,
The search means includes
A range label acquisition unit that extracts a search path and a search keyword from the input search query, and acquires a document ID and a range label corresponding to the search path from the range label storage unit;
From the transposition table storage means, to obtain appearance position information that matches the search keyword, appearance position information acquisition means,
From the document ID and the range label acquired by the range label obtaining means, have a, and node information selecting means occurrence position information of the search keyword selects a node information existing acquired by the occurrence position information obtaining unit ,
The range label giving means of the node management means is
A document ID of the XML document, a preorder value and a postorder value of each node, a tag name, a start character offset when all the text nodes are concatenated, and a description content stored in the range label storage unit as a range label,
The transposition table storage means of the full-text index means is
All the text nodes of the XML document are concatenated, divided into character strings of a certain length, words are cut out by the N-gram method, and when the cut out words straddle the text nodes, the end of the text node A word segmentation means for segmenting a word of up to M characters (where M is an integer greater than or equal to 1 and less than N);
For each extracted word, the node identifier preorder value including the start position, the node identifier postorder value including the end position, and the start position in the concatenated text are acquired, and the preorder value, the postorder value, and the start position , Appearance position information storage means for storing the document ID as the appearance position information in the transposition table of the transposition table storage means,
The range label acquisition unit of the search unit includes:
The search keyword is decomposed into N-grams, a search word is obtained, the appearance position information of the search word is obtained from the transposition table of the transposition table storage means based on the search word, and the appearance position information of the search word A full-text search apparatus comprising means for associating start positions and acquiring appearance position information of the search keyword .

A full-text search program for causing a computer to function as each means constituting the full-text search device according to claim 3 .