JP4490930B2

JP4490930B2 - Structured document search apparatus and structured document search method

Info

Publication number: JP4490930B2
Application number: JP2006030110A
Authority: JP
Inventors: 拓也金輪
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2006-02-07
Filing date: 2006-02-07
Publication date: 2010-06-30
Anticipated expiration: 2026-02-07
Also published as: JP2007213158A

Description

この発明は、階層化された論理構造を持つ構造化文書データベースで管理された異なる文書構造の複数の構造化文書を検索する構造化文書検索装置および構造化文書検索方法に関するものである。 The present invention relates to a structured document search apparatus and a structured document search method for searching a plurality of structured documents having different document structures managed by a structured document database having a hierarchical logical structure.

近年、ＸＭＬ（eXtensible Markup Language）などで記述された構造化文書データを記憶・検索する構造化文書データベースが開発されている。構造化文書データベースに対する問合せ（検索）は、一般に、Ｗ３Ｃ（World Wide Web Consortium）が標準化を進めているＸＱｕｅｒｙ（XML Query）という問合せ言語によって行われる。 In recent years, structured document databases for storing and retrieving structured document data described in XML (eXtensible Markup Language) or the like have been developed. The query (search) for the structured document database is generally performed by a query language called XQuery (XML Query) which is being standardized by the World Wide Web Consortium (W3C).

ＸＱｕｅｒｙは、条件式の指定によるフィルタリングだけでなく、検索結果を返す部分であるＲｅｔｕｒｎ節に仮想的な構造化文書を埋め込むこと、条件式を複雑な入れ子構造とすること、条件式に関数式を定義することなどが可能であり、多様な機能を有する非常に問合せ能力が高度な言語である。 XQuery not only performs filtering by specifying a conditional expression, but also embeds a virtual structured document in the return section that returns the search result, makes the conditional expression a complex nested structure, and adds a functional expression to the conditional expression. It is a language that can be defined, etc., and has a very high query ability with various functions.

ＸＱｕｅｒｙでは、要素や属性などのＤＯＭ（Document Object Model）におけるノードレベルの情報を検索対象とする。例えば、特許文献１では、以下の方法により、構造化文書のノードレベルの情報の検索を行う技術が提案されている。 In XQuery, node-level information in DOM (Document Object Model) such as elements and attributes is a search target. For example, Patent Document 1 proposes a technique for searching for node level information of a structured document by the following method.

まず、構造化文書をデータベースに格納する際に、対象となる文書のデータ構造を解析し、その構造（ノード）に対する解析情報を語彙索引情報などに埋め込んで索引を作成する。なお、この場合の構造の解析情報は、ＸＰａｔｈ（XML Path Language）で表現できるパスレベルを同一の構造情報（構造テンプレート）と見なした情報である。 First, when a structured document is stored in a database, the data structure of the target document is analyzed, and an index is created by embedding analysis information for the structure (node) in lexical index information or the like. The structure analysis information in this case is information regarding path levels that can be expressed in XPath (XML Path Language) as the same structure information (structure template).

次に、検索時に検索クエリを解析して問合せグラフを作成し、コスト計算をした上でクエリ実行のプランを作成する。この際、索引処理時にクエリを解析し、それぞれの変数が満たさなければならない構造に対する制約を予め求めることにより、各索引に対する探索範囲を限定した上で索引の検索を行い、中間の候補件数の削減を実現している。 Next, at the time of search, the search query is analyzed to create a query graph, and after calculating the cost, a query execution plan is created. At this time, by analyzing the query at the time of index processing and obtaining constraints on the structure that each variable must satisfy in advance, the search of the index is limited and the number of intermediate candidates is reduced. Is realized.

一方、全文検索の分野では、キーワードの論理条件（ＡＮＤ／ＯＲなど）を指定して合致する文書を検索する機能や、文字の出現位置を条件に指定して検索する機能が実現されている。出現位置を条件とする検索条件の一例として、「<title>タグに、“web”と“title”が５文字以内に存在する情報を検索せよ」のような近傍検索が挙げられる。 On the other hand, in the field of full-text search, a function for searching for a matching document by specifying a logical condition (such as AND / OR) of a keyword or a function for searching by specifying a character appearance position as a condition is realized. An example of a search condition that uses an appearance position as a condition is a neighborhood search such as “Search for information in which“ web ”and“ title ”exist within 5 characters in the <title> tag”.

近傍検索を実現するためには、ＡＮＤ条件の処理を行う際にテキストの位置情報を残しておく必要がある。しかし、ＸＱｕｅｒｙは検索対象がノード単位のモデルであるため、ＡＮＤ条件の処理時に出現位置に関する情報が失われ、近傍検索を実現することができなかった。 In order to realize the neighborhood search, it is necessary to leave the text position information when processing the AND condition. However, since the XQuery is a node-based model, information on the appearance position is lost when the AND condition is processed, and the neighborhood search cannot be realized.

このように、ＸＱｕｅｒｙは、全文検索機能を実現するという観点からは十分な表現能力を有するとは言えなかった。これを解消するため、ＸＱｕｅｒｙを拡張し、全文検索機能を融合したＸＱＦＴ（XQuery-Full-Text）という言語規格がＷ３Ｃによって提案されている。 Thus, XQuery cannot be said to have sufficient expression capability from the viewpoint of realizing the full-text search function. In order to solve this problem, the W3C has proposed a language standard called XQFT (XQuery-Full-Text) that extends XQuery and combines full-text search functions.

ＸＱＦＴは検索対象のノードに対して、テキストの出現位置の条件を含む複数の条件を記述可能である。例えば、上記の近傍検索の例は、「./title ftcontains "web" && "site" distance at most 5」と記述することができる。ＸＱＦＴでは、検索対象の情報がＸＱｕｅｒｙの「ノード」レベルではなく、「テキストの出現位置」レベルであるため、これを考慮した処理方法を実現しなければならない。 XQFT can describe a plurality of conditions including a condition of the appearance position of a text for a node to be searched. For example, the example of the neighborhood search described above can be described as “./title ftcontains“ web ”&&“ site ”distance at most 5”. In XQFT, the information to be searched is not at the “Node” level of XQuery but at the “Text Appearance Position” level, so a processing method that takes this into consideration must be realized.

特開２００１−１４７９３３号公報JP 2001-147933 A

しかしながら、ＸＱＦＴでは、テキストの出現位置レベルまで考慮した検索結果の候補を取得するため、結合処理等を行う場合に中間候補数が膨大になり、検索処理速度の低下や、メモリ量の爆発等を招く可能性があるという問題があった。 However, in XQFT, search result candidates that take the text appearance position level into consideration are acquired, so the number of intermediate candidates becomes enormous when performing join processing and the like, resulting in a decrease in search processing speed and an explosion in memory capacity. There was a problem that it might invite.

例えば、ある要素内に“web”が１００回、“site”が２００回出現する場合、各要素に対する候補を単純に結合すると、中間候補数は２００００（１００×２００）個存在する。ＸＱＦＴによりテキストの出現位置レベルまで考慮すると、各要素に対する候補数が増大するため、それらの候補を結合して得られる中間候補数の個数はさらに増大する。 For example, when “web” appears 100 times and “site” appears 200 times in a certain element, if candidates for each element are simply combined, there are 20000 (100 × 200) intermediate candidates. When the text appearance position level is considered by XQFT, the number of candidates for each element increases, so the number of intermediate candidates obtained by combining these candidates further increases.

本発明は、上記に鑑みてなされたものであって、構造化文書に対する高速な全文検索機能を実現することができる構造化文書検索装置および構造化文書検索方法を提供することを目的とする。 The present invention has been made in view of the above, and an object thereof is to provide a structured document search apparatus and a structured document search method capable of realizing a high-speed full-text search function for a structured document.

上述した課題を解決し、目的を達成するために、本発明は、階層化された論理構造を有する文書であって、前記論理構造の単位である構造要素に対応する実情報である要素を含む構造化文書と、前記構造化文書を一意に識別する文書ＩＤと、を対応づけて記憶する構造化文書記憶手段と、前記構造要素を一意に識別する構造ＩＤと、前記要素の特徴を表す第１の特徴情報とを対応づけた構造情報を記憶する構造情報記憶手段と、前記構造化文書の検索条件の入力を受付け、受付けた検索条件を解析し、前記構造化文書の前記構造要素に対応した構造の単位であるノードを階層化して有する階層構造の検索条件であって、検索対象となる前記構造ＩＤの候補と前記構造ＩＤの候補に対する検索キーとを対応づけた前記ノードを含む前記階層構造の検索条件を求める解析手段と、前記解析手段が求めた前記階層構造の検索条件の各ノードについて、各ノードに含まれる前記構造ＩＤの候補に対応づけられた前記第１の特徴情報を前記構造情報記憶手段から取得する取得手段と、前記解析手段が求めた前記階層構造の検索条件の各ノードについて、各ノードに含まれる前記検索キーの特徴を表す第２の特徴情報を算出する算出手段と、前記解析手段が求めた前記階層構造の検索条件の各ノードについて、各ノードに含まれる前記構造ＩＤの候補のうち、対応する前記検索キーについて前記算出手段が算出した前記第２の特徴情報が、前記取得手段が取得した前記第１の特徴情報に適合しない前記構造ＩＤの候補を削除する削除手段と、前記削除手段が前記構造ＩＤの候補を削除した前記階層構造の検索条件に基づき、前記階層構造の検索条件を満たす前記構造ＩＤに対応する前記文書ＩＤを求め、前記求められた文書ＩＤに対応する前記構造化文書を前記構造化文書記憶手段から検索する検索手段と、を備えたことを特徴とする。 In order to solve the above-described problems and achieve the object, the present invention includes a document having a hierarchical logical structure, which is an element that is actual information corresponding to a structural element that is a unit of the logical structure. Structured document storage means for storing a structured document in association with a document ID for uniquely identifying the structured document, a structure ID for uniquely identifying the structural element, and a feature ID of the element A structure information storage means for storing structure information associated with one feature information; and input of search conditions for the structured document; analysis of the received search conditions; and correspondence to the structure elements of the structured document A hierarchical search condition having a hierarchical structure of nodes that are units of the structured structure, the hierarchy including the node associating the structure ID candidate to be searched with a search key for the structure ID candidate Construction Analyzing means for obtaining a search condition, and for each node of the search condition of the hierarchical structure obtained by the analyzing means, the first feature information associated with the candidate structure ID included in each node is the structure information. Obtaining means for obtaining from the storage means; for each node of the search condition of the hierarchical structure obtained by the analyzing means; calculating means for calculating second feature information representing a feature of the search key included in each node; For each node of the search condition of the hierarchical structure obtained by the analysis means, the second feature information calculated by the calculation means for the corresponding search key among the candidate structure IDs included in each node is: A deletion unit that deletes the structure ID candidate that does not match the first feature information acquired by the acquisition unit; and the floor from which the deletion unit has deleted the structure ID candidate. Based on a structure search condition, the document ID corresponding to the structure ID satisfying the hierarchical structure search condition is obtained, and the structured document corresponding to the obtained document ID is retrieved from the structured document storage means. And a search means.

また、本発明は、階層化された論理構造を有する文書であって、前記論理構造の単位である構造要素に対応する実情報である要素を含む構造化文書と、前記構造化文書を一意に識別する文書ＩＤと、を対応づけて記憶する構造化文書記憶手段と、前記構造要素を一意に識別する構造ＩＤと、前記要素の特徴を表す第１の特徴情報とを対応づけた構造情報を記憶する構造情報記憶手段と、前記文書ＩＤと、前記要素を一意に識別する要素ＩＤと、前記要素に対応する前記構造要素の前記構造ＩＤとを対応づけた索引を記憶する索引記憶手段と、前記構造化文書の検索条件の入力を受付け、受付けた検索条件を解析し、前記構造化文書の前記構造要素に対応した構造の単位であるノードを階層化して有する階層構造の検索条件であって、検索対象となる前記構造ＩＤの候補と前記構造ＩＤの候補に対する検索キーとを対応づけた前記ノードを含む前記階層構造の検索条件を求める解析手段と、前記解析手段が求めた前記階層構造の検索条件の各ノードについて、各ノードに含まれる前記構造ＩＤの候補に対応づけられた前記第１の特徴情報を前記構造情報記憶手段から取得する取得手段と、前記解析手段が求めた前記階層構造の検索条件の各ノードについて、各ノードに含まれる前記検索キーの特徴を表す第２の特徴情報を算出する算出手段と、前記解析手段が求めた前記階層構造の検索条件の各ノードについて、各ノードに含まれる前記構造ＩＤの候補のうち、対応する前記検索キーについて前記算出手段が算出した前記第２の特徴情報が、前記取得手段が取得した前記第１の特徴情報に適合しない前記構造ＩＤの候補を削除する削除手段と、前記削除手段が前記構造ＩＤの候補を削除した前記階層構造の検索条件に基づき、前記階層構造の検索条件を満たす前記構造ＩＤに対応する前記文書ＩＤと前記要素ＩＤとを前記索引記憶手段から検索する索引検索手段と、前記索引検索手段が検索した前記文書ＩＤに対応する前記構造化文書を前記構造化文書記憶手段から検索する検索手段と、を備えたことを特徴とする。 Further, the present invention is a document having a hierarchical logical structure, wherein the structured document including an element that is actual information corresponding to a structural element that is a unit of the logical structure, and the structured document are uniquely identified. Structured document storage means for associating and storing a document ID to be identified, structure information for associating structure ID for uniquely identifying the structure element, and first feature information representing the feature of the element Structure information storage means for storing; index storage means for storing an index that associates the document ID; an element ID that uniquely identifies the element; and the structure ID of the structure element corresponding to the element; A search condition of a hierarchical structure that accepts input of search conditions for the structured document, analyzes the accepted search conditions, and has hierarchically structured nodes corresponding to the structural elements of the structured document. , Search target Each of the hierarchical structure search condition obtained by the analysis means, the analysis means for obtaining the hierarchical structure search condition including the node that associates the structure ID candidate and the search key for the structure ID candidate. For the node, an acquisition means for acquiring the first feature information associated with the structure ID candidate included in each node from the structure information storage means, and a search condition of the hierarchical structure obtained by the analysis means For each node, the calculation means for calculating the second feature information representing the characteristics of the search key included in each node, and each node of the hierarchical structure search condition obtained by the analysis means are included in each node. Among the candidate structure IDs, the second feature information calculated by the calculation unit for the corresponding search key is the first feature information acquired by the acquisition unit. A deletion unit that deletes the structure ID candidate that does not conform to the structure ID, and the deletion unit corresponds to the structure ID that satisfies the search condition of the hierarchical structure based on the search condition of the hierarchical structure from which the candidate of the structure ID has been deleted. Index search means for searching the document ID and element ID from the index storage means, and search means for searching the structured document corresponding to the document ID searched by the index search means from the structured document storage means And.

また、本発明は、上記装置を実行することができる構造化文書検索方法である。 The present invention is also a structured document search method capable of executing the above apparatus.

本発明によれば、構造化文書の構造要素ごとに対応する要素の特徴を表す特徴情報を格納し、検索時に当該特徴情報を参照して中間候補数を絞り込むことができる。このため、構造化文書に対する高速な全文検索機能を実現することができるという効果を奏する。 According to the present invention, it is possible to store the feature information representing the feature of the corresponding element for each structural element of the structured document, and to narrow down the number of intermediate candidates by referring to the feature information during the search. Therefore, there is an effect that a high-speed full-text search function for the structured document can be realized.

以下に添付図面を参照して、この発明にかかる構造化文書検索装置および構造化文書検索方法の最良な実施の形態を詳細に説明する。 Exemplary embodiments of a structured document search apparatus and a structured document search method according to the present invention will be explained below in detail with reference to the accompanying drawings.

（第１の実施の形態）
第１の実施の形態にかかる構造化文書検索装置は、構造化文書の構造要素ごとに対応する要素の特徴を表す特徴情報を格納し、検索時に当該特徴情報を参照して中間候補数を絞り込むことにより、ＸＱＦＴのような構造化文書の全文検索機能を高速に実現可能とするものである。 (First embodiment)
The structured document search apparatus according to the first embodiment stores feature information indicating the feature of an element corresponding to each structure element of the structured document, and narrows down the number of intermediate candidates by referring to the feature information at the time of search. Thus, a full-text search function for structured documents such as XQFT can be realized at high speed.

図１は、第１の実施の形態にかかる構造化文書検索装置１００の構成を示すブロック図である。同図に示すように、構造化文書検索装置１００は、ネットワーク２００を介してクライアント３００と接続されており、通信部１０１と、格納処理部１１０と、検索処理部１２０と、構造化文書記憶部１３１と、構造テンプレート記憶部１３２と、索引記憶部１３３とを備えている。 FIG. 1 is a block diagram showing a configuration of a structured document search apparatus 100 according to the first embodiment. As shown in the figure, the structured document search apparatus 100 is connected to a client 300 via a network 200, and includes a communication unit 101, a storage processing unit 110, a search processing unit 120, and a structured document storage unit. 131, a structure template storage unit 132, and an index storage unit 133.

クライアント３００は、登録する構造化文書（ＸＭＬ文書）や、登録済みの構造化文書を対象とする検索クエリを構造化文書検索装置１００に送信し、検索結果を受信するものである。 The client 300 transmits a structured document to be registered (XML document) or a search query for a registered structured document to the structured document search apparatus 100 and receives a search result.

ネットワーク２００は、クライアント３００と構造化文書検索装置１００とを接続するもので、例えば、インターネット、有線ＬＡＮ（Local Area Network）、無線ＬＡＮなどのあらゆるネットワーク構成を適用することができる。 The network 200 connects the client 300 and the structured document search apparatus 100. For example, any network configuration such as the Internet, a wired LAN (Local Area Network), and a wireless LAN can be applied.

通信部１０１は、ネットワーク２００を介して、クライアント３００から構造化文書、検索クエリを受信し、検索結果をクライアント３００に送信するものである。 The communication unit 101 receives a structured document and a search query from the client 300 via the network 200, and transmits a search result to the client 300.

構造化文書記憶部１３１は、ＸＭＬで記述された構造化文書を記憶する記憶部である。ここで、構造化文書の記述形式について説明する。図２は、ＸＭＬで記述された構造化文書の一例を示す説明図である。 The structured document storage unit 131 is a storage unit that stores a structured document described in XML. Here, the description format of the structured document will be described. FIG. 2 is an explanatory diagram showing an example of a structured document described in XML.

同図では、特許に関する情報をＸＭＬ形式で記述した構造化文書の例が示されている。ＸＭＬでは、文書の構造の表現にタグが用いられる。タグには、開始タグと終了タグが存在し、構造化文書の構成要素を開始タグと終了タグで囲むことにより、文書中の文字列（テキスト）の区切りと、そのテキストが構造上いずれの構成要素に属するのかを明確に記述することができる。 In the figure, an example of a structured document in which information related to a patent is described in an XML format is shown. In XML, tags are used to represent the structure of a document. A tag has a start tag and an end tag. By enclosing the components of a structured document with a start tag and an end tag, character strings (text) in the document can be separated and the structure of the text can be any. Can clearly describe whether it belongs to an element.

なお、ＸＭＬでは、タグを使って定義したデータの単位を要素という。例えば、＜特許＞タグと＜／特許＞タグとを含み、両タグで囲まれたデータが１つの要素を構成する。 In XML, data units defined using tags are called elements. For example, data including a <patent> tag and a </ patent> tag, and data surrounded by both tags constitute one element.

また、要素には、省略可能か、繰り返しが可能かなどの付加的な情報を追加するための属性を指定することができる。属性は、開始タグに「＜要素名称属性＝"属性値"＞」のような書式で設定する。同図では、“特許”要素の属性として“ＩＤ”属性が指定された例が示されている。 Further, an attribute for adding additional information such as whether the element can be omitted or can be repeated can be designated for the element. The attribute is set in a format such as “<element name attribute =“ attribute value ”>” in the start tag. In the figure, an example in which the “ID” attribute is designated as the attribute of the “patent” element is shown.

また、開始タグとは要素名称を記号「＜」、「＞」で閉じた書式で記載され、終了タグとは要素名称を記号「＜／」と「＞」で閉じた書式で記載される。開始タグと終了タグとの間には、構造化文書の実情報を表すテキスト、または他の要素（子要素）が設定される。「＜特許ＤＢ＞＜／特許ＤＢ＞」のようにテキストを含まない構成要素は、簡易記法として「＜特許ＤＢ／＞」のように表すこともできる。 The start tag is described in a format in which element names are closed with symbols “<” and “>”, and the end tag is described in a format in which element names are closed with symbols “</” and “>”. Between the start tag and the end tag, text representing actual information of the structured document or other elements (child elements) is set. A component that does not include text such as “<patent DB> </ patent DB>” can also be represented as “<patent DB />” as a simple notation.

同図に示した文書は、「特許」タグから始まる要素を文書ルート（根）とし、その子要素として「タイトル」、「発明者一覧」、「効果」、「キーワードリスト」タグから始まる要素を有する。また、例えば、「タイトル」タグから始まる要素には「構造化文書検索装置」といった、１つのテキスト（文字列）が存在する。 The document shown in the figure has an element starting from a “patent” tag as a document root (root), and its child elements include elements starting from “title”, “inventor list”, “effect”, and “keyword list” tags. . In addition, for example, one element (character string) such as “structured document search device” exists in an element starting from a “title” tag.

なお、このようなＸＭＬ形式の構造化文書から、各タグの名称や階層関係、繰り返しの個数などを抽出した情報を構造情報という。また、構造化文書の構造情報を構成する論理的な構造の単位を構造要素という。第１の実施の形態では、上述の要素、属性、テキストが構造要素となる。 Note that information obtained by extracting the name, hierarchical relationship, number of repetitions, and the like of each tag from such an XML structured document is referred to as structure information. A logical structural unit constituting the structural information of the structured document is called a structural element. In the first embodiment, the above-described elements, attributes, and text are structural elements.

次に、構造化文書記憶部１３１に格納された構造化文書のデータ構造について説明する。図３は、構造化文書記憶部１３１に格納された構造化文書のデータ構造の一例を示す説明図である。 Next, the data structure of the structured document stored in the structured document storage unit 131 will be described. FIG. 3 is an explanatory diagram showing an example of the data structure of the structured document stored in the structured document storage unit 131.

同図は、図２に示す構造化文書を木構造のデータ構造で表した例を示している。図３では、楕円のノードはフォルダを表すノード、六角形のノードは文書を表すノード、一重線の四角形はタグを表すノード、二重線の四角形は属性を表すノード、角の丸い四角形はテキストを表すノードを意味する。 This figure shows an example in which the structured document shown in FIG. 2 is represented by a tree-structured data structure. In FIG. 3, an ellipse node is a folder node, a hexagon node is a document node, a single-line rectangle is a tag node, a double-line rectangle is an attribute node, and a rounded rectangle is text Means a node representing

例えば、“発明者一覧”タグを表すノード以下の部分木は、“発明者一覧”要素以下の２つの“発明者”要素を含んでいることを表している。 For example, the subtree below the node representing the “inventor list” tag indicates that it includes two “inventor” elements below the “inventor list” element.

構造化文書記憶部１３１は、このような木構造のデータ構造を表形式で格納する。図４は、表形式で表した構造化文書のデータ構造の一例を示す説明図である。 The structured document storage unit 131 stores such a tree-structured data structure in a table format. FIG. 4 is an explanatory diagram showing an example of the data structure of the structured document represented in a table format.

同図に示すように、構造化文書記憶部１３１の構造化文書は、文書ＩＤと、要素ＩＤと、木構造の親子兄弟関係を長男次弟方式で表したときの次弟と長男とを対応づけて格納している。文書ＩＤは、構造化文書を一意に識別するための識別子である。要素ＩＤは、文書ＩＤで識別される構造化文書の木構造の要素（ノード）を一意に識別するための識別子である。 As shown in the figure, the structured document in the structured document storage unit 131 corresponds to the document ID, the element ID, and the next brother and the eldest son when the parent-child sibling relationship of the tree structure is expressed in the eldest son-second-sister method. Are stored. The document ID is an identifier for uniquely identifying the structured document. The element ID is an identifier for uniquely identifying a tree-structured element (node) of the structured document identified by the document ID.

例えば、同図の要素ＩＤ＝３のノードは、図２の“タイトル”タグのノードに対応しており、次弟として図２の“発明者一覧”タグのノード（要素ＩＤ＝４）、および長男として図２の“テキスト”ノード（要素ＩＤ＝５）を有することを示している。 For example, the node with the element ID = 3 in the figure corresponds to the node of the “title” tag in FIG. 2, the node of the “inventor list” tag in FIG. 2 (element ID = 4) as the second brother, and The eldest son has the “text” node (element ID = 5) in FIG.

構造テンプレート記憶部１３２は、上述のようなＸＭＬ形式の構造化文書から抽出された構造情報を格納するものである。構造テンプレート記憶部１３２は、構造化文書記憶部１３１に格納する構造化文書の構造を、構造情報と照合して解析する際に参照される。 The structure template storage unit 132 stores structure information extracted from the structured document in the XML format as described above. The structure template storage unit 132 is referred to when the structure of the structured document stored in the structured document storage unit 131 is analyzed with reference to the structure information.

図５は、第１の実施の形態における構造テンプレート記憶部１３２に格納された構造情報のデータ構造の一例を示す説明図である。同図は、構造情報を木構造で表した例を示している。 FIG. 5 is an explanatory diagram illustrating an example of the data structure of the structure information stored in the structure template storage unit 132 according to the first embodiment. This figure shows an example in which the structure information is represented by a tree structure.

同図に示すように、構造情報の木構造は、構造化文書の木構造と同様に、フォルダを表す楕円のノード、文書を表す六角形のノード、タグを表す一重線の四角形のノード、属性を表す二重線の四角形のノード、テキストを表す角の丸い四角形のノードを含んでいる。 As shown in the figure, the tree structure of structure information is similar to the tree structure of structured documents. Ellipse nodes representing folders, hexagonal nodes representing documents, single-line square nodes representing tags, and attributes A double-lined square node representing the text and a rounded-cornered square node representing the text are included.

また、構造情報には、構造情報の各ノードである構造要素を一意に識別するための識別子であるテンプレートＩＤ（ＴＩＤ）が付与されている。構造情報は、複数の構造化文書から、構造を表す情報のみを抽出した情報である。したがって、例えば、“発明者”タグのノードのように、構造化文書内では複数設定されうる情報であっても、構造情報上では１つに集約される。 Further, a template ID (TID) that is an identifier for uniquely identifying a structural element that is each node of the structural information is given to the structural information. The structure information is information obtained by extracting only information representing a structure from a plurality of structured documents. Therefore, for example, even information that can be set in a structured document, such as a node of an “inventor” tag, is collected into one on the structure information.

構造テンプレート記憶部１３２は、このような木構造の構造情報を表形式で格納する。図６は、表形式で表した構造テンプレート記憶部１３２のデータ構造の一例を示す説明図である。 The structure template storage unit 132 stores such tree structure information in a table format. FIG. 6 is an explanatory diagram showing an example of the data structure of the structure template storage unit 132 expressed in a table format.

同図に示すように、構造テンプレート記憶部１３２の構造情報は、ＴＩＤと、次弟と、長男と、最小語彙ＩＤと、最大語彙ＩＤと、最大テキストサイズとを対応づけて格納している。 As shown in the figure, the structure information in the structure template storage unit 132 stores TID, second younger brother, eldest son, minimum vocabulary ID, maximum vocabulary ID, and maximum text size in association with each other.

最小語彙ＩＤおよび最大語彙ＩＤには、ＴＩＤで識別される構造要素に対応するテキスト要素内に出現する語彙のうち、それぞれ語彙ＩＤが最小および最大の語彙を設定する。語彙ＩＤとは、索引記憶部１３３内に格納されている語彙に対し、語彙の出現順に付与された一意に識別するための識別子をいう。索引記憶部１３３の詳細については後述する。 In the minimum vocabulary ID and the maximum vocabulary ID, the vocabulary having the minimum and maximum vocabulary ID among the vocabulary appearing in the text element corresponding to the structural element identified by the TID is set. The vocabulary ID is an identifier for uniquely identifying the vocabulary stored in the index storage unit 133 and assigned to the vocabulary in the order of appearance. Details of the index storage unit 133 will be described later.

最大テキストサイズとは、ＴＩＤで識別される構造要素に対応するテキスト要素のうち、文字列長が最大のテキスト要素の文字列長の値をいう。 The maximum text size refers to the value of the character string length of the text element having the maximum character string length among the text elements corresponding to the structural element identified by the TID.

このように、構造要素の親子兄弟関係を表す情報（次弟、長男）だけでなく、構造要素に対応するテキスト要素の特徴を表す情報（特徴情報）として最小語彙ＩＤと最大語彙ＩＤと最大テキストサイズとを対応づけて格納することにより、検索時に当該情報を参照して中間候補数を絞り込むことが可能となる。 As described above, not only the information indicating the parent-child relationship of the structural element (second brother and eldest son) but also the information indicating the characteristic of the text element corresponding to the structural element (characteristic information), the minimum vocabulary ID, the maximum vocabulary ID and the maximum text By storing the size in association with each other, it is possible to narrow down the number of intermediate candidates by referring to the information at the time of search.

索引記憶部１３３は、構造化文書の検索で用いる索引を記憶するものであり、統計情報記憶部１３３ａと、転置ファイル記憶部１３３ｂとを備えている。 The index storage unit 133 stores an index used for searching the structured document, and includes a statistical information storage unit 133a and a transposed file storage unit 133b.

統計情報記憶部１３３ａは、構造化文書記憶部１３１に格納された各構造化文書内に発生する語彙の統計情報を格納する記憶部である。語彙とは、構造化文書の実情報を表すテキストに含まれる語句を言う。第１の実施の形態では、テキストをＮグラム分割した語句を語彙とする。Ｎグラム分割とは、文字列に含まれるすべての連続するＮ個の文字を語彙として分割する方法をいう。 The statistical information storage unit 133 a is a storage unit that stores statistical information of vocabulary generated in each structured document stored in the structured document storage unit 131. A vocabulary is a phrase included in text representing actual information of a structured document. In the first embodiment, a vocabulary is a phrase obtained by dividing a text into N grams. N-gram division refers to a method of dividing all N consecutive characters included in a character string as a vocabulary.

例えば、文字列“ＸＭＬデータベース”をＮ＝３でＮグラム分割すると、“ＸＭＬ”、“ＭＬデ”、“Ｌデー”、“データ”、“ータベ”、“タベー”、“ベース”、“ース”、“ス”の９個の語彙に分割される。 For example, when the character string “XML database” is divided into N grams with N = 3, “XML”, “ML data”, “L data”, “data”, “table”, “tab data”, “base”, “−” It is divided into nine vocabularies, “su” and “su”.

なお、テキストからの語彙の分割方法はこれに限られるものではなく、テキストに含まれる語句を抽出するものであれば、形態素解析で得られる単語を語彙とする方法などのあらゆる方法を適用することができる。 Note that the method of dividing vocabulary from text is not limited to this, and any method can be applied, such as a method that uses words obtained by morphological analysis as vocabulary as long as it can extract words contained in the text. Can do.

図７は、統計情報記憶部１３３ａに格納された統計情報のデータ構造の一例を示す説明図である。同図に示すように、統計情報記憶部１３３ａは、分割された各語彙について、発生順に昇順に付与された識別子である語彙ＩＤと、語彙の全構造化文書内での発生頻度と、転置ファイル番号とを対応づけた統計情報を格納している。 FIG. 7 is an explanatory diagram illustrating an example of a data structure of statistical information stored in the statistical information storage unit 133a. As shown in the figure, the statistical information storage unit 133a includes, for each divided vocabulary, a vocabulary ID that is an identifier assigned in ascending order in the order of occurrence, the occurrence frequency of the vocabulary in all structured documents, and a transposed file. Stores statistical information that associates numbers.

なお、一般的に出現頻度が高い語彙はいずれの文書内にも存在する可能性が高いことから、語彙ＩＤの値は小さくなる傾向があり、逆に稀少語彙に関しては語彙ＩＤの値は大きくなる傾向がある。 In general, a vocabulary with a high appearance frequency is likely to exist in any document, so the value of the vocabulary ID tends to be small. Conversely, for a rare vocabulary, the value of the vocabulary ID is large. Tend.

また、転置ファイル番号とは、後述する転置ファイル記憶部１３３ｂに記憶された転置ファイルを一意に識別するための番号をいう。なお、統計情報記憶部１３３ａに格納された語彙の統計情報と、転置ファイルとにより構造化文書の検索を高速化するための索引が構成される。 The inverted file number is a number for uniquely identifying an inverted file stored in an inverted file storage unit 133b described later. An index for accelerating the retrieval of the structured document is constituted by the lexical statistical information stored in the statistical information storage unit 133a and the transposed file.

転置ファイル記憶部１３３ｂは、構造化文書の検索処理を高速化するための索引の一部を構成する転置ファイルを格納する記憶部である。図８は、転置ファイル記憶部１３３ｂに格納された転置ファイルのデータ構造の一例を示す説明図である。 The transposed file storage unit 133b is a storage unit that stores a transposed file that constitutes a part of an index for speeding up the search processing of structured documents. FIG. 8 is an explanatory diagram showing an example of the data structure of the transposed file stored in the transposed file storage unit 133b.

同図は、統計情報記憶部１３３ａに格納された、ある語彙に対応する１つの転置ファイルのデータ構造を表している。実際には、語彙ごとに同図のような構造の転置ファイルが作成され、転置ファイル記憶部１３３ｂに格納される。 This figure shows the data structure of one transposed file corresponding to a certain vocabulary stored in the statistical information storage unit 133a. Actually, a transposed file having a structure as shown in the figure is created for each vocabulary and stored in the transposed file storage unit 133b.

同図に示すように、転置ファイル記憶部１３３ｂの転置ファイルは、ＴＩＤと、文書ＩＤと、要素ＩＤと、発生位置とを対応づけた語彙索引情報を格納している。 As shown in the figure, the transposed file in the transposed file storage unit 133b stores lexical index information in which TID, document ID, element ID, and occurrence position are associated with each other.

発生位置とは、当該転置ファイルに対応する語彙が、文書ＩＤと要素ＩＤとで識別される構造化文書の要素内で出現する位置を表す情報である。 The occurrence position is information indicating a position where the vocabulary corresponding to the transposed file appears in the element of the structured document identified by the document ID and the element ID.

格納処理部１１０は、クライアント３００から受信した構造化文書を構造化文書記憶部１３１に格納する処理を実行するもので、構造情報抽出部１１１と、構造テンプレート決定部１１２と、索引登録部１１３と、統計情報更新部１１４と、文書登録部１１５とを備えている。 The storage processing unit 110 executes processing for storing the structured document received from the client 300 in the structured document storage unit 131. The structure information extraction unit 111, the structure template determination unit 112, the index registration unit 113, , A statistical information update unit 114 and a document registration unit 115 are provided.

構造情報抽出部１１１は、通信部１０１がクライアント３００から受信した構造化文書から構造情報を抽出するものである。具体的には、構造情報抽出部１１１は、受信した構造化文書を構文解析してＤＯＭのような木構造の形式に展開し、木構造の各ノードを構造情報として抽出する。展開した木構造の各ノードは、後述する文書登録部１１５により構造化文書記憶部１３１に記憶される。なお、文書ＩＤ、要素ＩＤは、構造情報抽出部１１１が構造情報を抽出した際に付与される。 The structure information extraction unit 111 extracts structure information from the structured document received by the communication unit 101 from the client 300. Specifically, the structure information extraction unit 111 parses the received structured document and develops it into a tree structure format such as DOM, and extracts each node of the tree structure as structure information. Each node of the expanded tree structure is stored in the structured document storage unit 131 by the document registration unit 115 described later. The document ID and element ID are given when the structure information extraction unit 111 extracts structure information.

構造テンプレート決定部１１２は、構造情報抽出部１１１が抽出した構造情報を参照し、構造テンプレート記憶部１３２に記憶された構造情報と照合することにより、展開した木構造の構造化文書の各ノードに対応するＴＩＤを決定するものである。また、構造テンプレート決定部１１２は、構造テンプレート記憶部１３２に既に記憶されている構造情報に含まれない新規の構造情報を構造情報抽出部１１１が抽出した場合は、当該新規構造情報を構造テンプレート記憶部１３２に格納する。 The structure template determination unit 112 refers to the structure information extracted by the structure information extraction unit 111 and collates with the structure information stored in the structure template storage unit 132, thereby providing each node of the expanded structured document with a tree structure. The corresponding TID is determined. When the structure information extraction unit 111 extracts new structure information that is not included in the structure information already stored in the structure template storage unit 132, the structure template determination unit 112 stores the new structure information in the structure template storage. Stored in the unit 132.

索引登録部１１３は、クライアント３００から受信した構造化文書に含まれるテキスト要素に対する語彙索引情報を作成し、転置ファイル記憶部１３３ｂに登録するものである。 The index registration unit 113 creates lexical index information for text elements included in the structured document received from the client 300 and registers it in the transposed file storage unit 133b.

具体的には、索引登録部１１３は、テキスト要素の文字列をＮグラム分割して得られた語彙に対応する転置ファイル内に、構造テンプレート決定部１１２で決定または算出されたＴＩＤ、文書ＩＤ、要素ＩＤ、および発生位置を対応づけた語彙索引情報を新たに転置ファイルに追加する。 Specifically, the index registration unit 113 includes a TID, a document ID, a TID determined or calculated by the structure template determination unit 112 in a transposed file corresponding to a vocabulary obtained by dividing a character string of a text element by N grams. Vocabulary index information in which the element ID and the occurrence position are associated is newly added to the transposed file.

統計情報更新部１１４は、クライアント３００から受信した構造化文書に含まれるテキスト要素から抽出した語彙の統計情報を統計情報記憶部１３３ａに登録するものである。 The statistical information update unit 114 registers the statistical information of the vocabulary extracted from the text elements included in the structured document received from the client 300 in the statistical information storage unit 133a.

具体的には、統計情報更新部１１４は、テキスト要素の文字列をＮグラム分割して得られた語彙に対応する統計情報記憶部１３３ａ内の統計情報を更新する。なお、新規の語彙が抽出された場合は、当該新規の語彙を統計情報記憶部１３３ａの統計情報に追加する。 Specifically, the statistical information update unit 114 updates the statistical information in the statistical information storage unit 133a corresponding to the vocabulary obtained by dividing the character string of the text element into N grams. When a new vocabulary is extracted, the new vocabulary is added to the statistical information in the statistical information storage unit 133a.

また、統計情報更新部１１４は、構造テンプレート記憶部１３２内の最小語彙ＩＤ、最大語彙ＩＤ、最大テキストサイズの更新が必要な場合に、それらの値を更新する処理を行う。例えば、図６のような構造情報が構造テンプレート記憶部１３２に登録されている場合に、ＴＩＤ＝４の構造要素に対応するテキスト要素として最大語彙ＩＤ＝５２２の語彙が出現した場合、最大語彙ＩＤを５２２に更新する。 Further, the statistical information update unit 114 performs processing to update these values when the minimum vocabulary ID, the maximum vocabulary ID, and the maximum text size in the structure template storage unit 132 need to be updated. For example, when the structure information as shown in FIG. 6 is registered in the structure template storage unit 132, if the vocabulary with the maximum vocabulary ID = 522 appears as the text element corresponding to the structure element with TID = 4, the maximum vocabulary ID Is updated to 522.

文書登録部１１５は、構造情報抽出部１１１が木構造に展開した各ノードに対して、親子兄弟関係を付加し、構造化文書記憶部１３１に格納するものである。 The document registration unit 115 adds a parent-child sibling relationship to each node developed by the structure information extraction unit 111 into a tree structure, and stores it in the structured document storage unit 131.

検索処理部１２０は、クライアント３００から受信した検索クエリに従い、構造化文書記憶部１３１から構造化文書を検索する処理を実行するもので、クエリ解析部１２１と、制約付加部１２２と、クエリプランニング部１２３と、クエリ実行部１２４とを備えている。 The search processing unit 120 executes a process of searching for a structured document from the structured document storage unit 131 in accordance with the search query received from the client 300, and includes a query analysis unit 121, a constraint adding unit 122, a query planning unit, and the like. 123 and a query execution unit 124.

検索処理部１２０の処理は、原則として特許文献１の方法により実行される。以下では、検索処理のうち、語彙索引の処理に関する部分について詳細に説明するが、実際の処理では、語彙索引の処理以外にもさまざまな処理が実行される。 The processing of the search processing unit 120 is executed by the method of Patent Document 1 in principle. In the following, a part related to the vocabulary index process in the search process will be described in detail. However, in the actual process, various processes other than the vocabulary index process are executed.

なお、特許文献１の方法では、制約従属型のプラン作成方法に準じ、問合せ言語を解析した内部形式から、階層構造の検索条件であるクエリグラフを作成する。そして、作成したクエリグラフに含まれる全ての変数の具体化を目標として、テーブルと呼ばれる変数集合の取り得る値（候補集合）の組み合わせを表すデータを順次生成する。ここで、１つのテーブルを生成する処理単位をオペレータと呼ぶ。各オペレータの結果は、候補集合として記憶部に保存される。 In the method of Patent Document 1, a query graph, which is a search condition of a hierarchical structure, is created from an internal format obtained by analyzing a query language in accordance with a constraint-dependent plan creation method. Then, with the goal of realizing all the variables included in the created query graph, data representing combinations of possible values (candidate sets) of a variable set called a table are sequentially generated. Here, a processing unit for generating one table is called an operator. The results of each operator are stored in the storage unit as a candidate set.

クエリ解析部１２１は、通信部１０１がクライアント３００から受信した検索クエリを構文解析し、解析結果としてクエリグラフを作成するものである。クエリグラフの作成は、特許文献１に記載の方法など従来から用いられているあらゆる方法を適用することができる。 The query analysis unit 121 parses the search query received by the communication unit 101 from the client 300 and creates a query graph as an analysis result. For the creation of the query graph, any conventionally used method such as the method described in Patent Document 1 can be applied.

図９は、検索クエリの一例を示す説明図である。同図に示すクエリは、「タイトル要素に“構造化文書”および“XML”を含み、ＩＤ＝３である“特許”文書を取得し、“＜検索結果＞”タグで囲った検索結果データを出力する検索条件を表している。 FIG. 9 is an explanatory diagram illustrating an example of a search query. The query shown in the figure is obtained by acquiring “patent” document including “structured document” and “XML” in the title element and ID = 3, and retrieving the search result data surrounded by “<search result>” tags. Represents search conditions to be output.

図１０は、クエリグラフの一例を示す説明図である。同図のクエリグラフは、図９の検索クエリをクエリ解析部１２１が解析して作成したクエリグラフである。 FIG. 10 is an explanatory diagram illustrating an example of a query graph. The query graph in the figure is a query graph created by the query analysis unit 121 analyzing the search query in FIG. 9.

図１０に示すように、クエリグラフは構造情報の各構造要素に対応したノードを含む木構造で表される。例えば、図１０のクエリグラフのノード２は、「特許」文書の構造要素が対応することを示している。また、例えば、ノード３はタイトルタグが、ノード４はタイトルタグ下のテキスト要素が対応することを示している。 As shown in FIG. 10, the query graph is represented by a tree structure including nodes corresponding to each structural element of the structural information. For example, node 2 in the query graph of FIG. 10 indicates that the structural element of the “patent” document corresponds. Further, for example, the node 3 corresponds to the title tag, and the node 4 corresponds to the text element under the title tag.

構造要素に対する検索条件（以下、検索キー）が存在する場合は、当該検索キーを、検索キーの検索対象となる構造要素に対応するノードに対応づける。例えば、図１０では、ノード４に対応するタイトルタグ下のテキスト要素に対して検索キーとして「ftcontains “構造化文書” && “XML”」が対応づけられている。 When there is a search condition (hereinafter referred to as a search key) for a structural element, the search key is associated with a node corresponding to the structural element to be searched for the search key. For example, in FIG. 10, “ftcontains“ structured document ”&&“ XML ”” is associated with the text element under the title tag corresponding to the node 4 as a search key.

制約付加部１２２は、クエリ解析部１２１が作成したクエリグラフにおけるノード間の制約関係を求め、当該制約を満たす構造要素の候補を取得してクエリグラフに付加するものである。なお、このような構造要素の候補が中間候補の１つに相当する。 The constraint adding unit 122 obtains a constraint relationship between nodes in the query graph created by the query analysis unit 121, acquires a candidate for a structural element that satisfies the constraint, and adds the candidate to the query graph. Note that such a structural element candidate corresponds to one of the intermediate candidates.

制約としては、候補となる構造要素が相互に満たさなければならない構造に関する制約（以下、構造制約という。）、語彙に関する制約（以下、語彙制約という。）が存在する。 As constraints, there are constraints on structures (hereinafter referred to as structure constraints) that must be satisfied by candidate structural elements, and constraints on vocabularies (hereinafter referred to as vocabulary constraints).

構造制約とは、テキスト全体が対応するノードレベルの情報により設定される制約をいう。例えば、クエリグラフのノード４はタイトルタグ下のテキスト要素でなければならないといった制約が構造制約に該当する。この場合、ノード４には、対応する構造要素の候補として、ＴＩＤ＝７の構造要素が取得される。なお、構造情報は、図５、図６に示すような内容が格納されていることを前提とする。同様に、ノード２に対してはＴＩＤ＝４の構造要素が、ノード６に対してはＴＩＤ＝５の構造要素が候補として取得される。 The structure constraint is a constraint set by node level information corresponding to the entire text. For example, a constraint that the node 4 of the query graph must be a text element under the title tag corresponds to the structure constraint. In this case, a structural element with TID = 7 is acquired as a corresponding structural element candidate in the node 4. The structure information is premised on the contents shown in FIGS. 5 and 6 being stored. Similarly, a structural element with TID = 4 is obtained as a candidate for node 2 and a structural element with TID = 5 is obtained as a candidate for node 6.

この他、制約付加部１２２は、ある２つの構造要素は同一文書内に存在するといった制約（以下、同一文書制約という。）を構造制約とすることもできる。 In addition, the constraint adding unit 122 can also make a constraint that two certain structural elements exist in the same document (hereinafter referred to as the same document constraint) as the structure constraint.

語彙制約とは、テキストの語彙レベルの情報により設定される制約をいう。例えば、ＴＩＤ＝７のノードに対応する語彙は、語彙ＩＤが最小語彙ＩＤ＝２１から最大語彙ＩＤ＝１０４５の間に存在しなければならないといった制約が語彙制約に該当する。なお、最小語彙ＩＤ、最大語彙ＩＤは、図６に示すようなノードごとの特徴情報を含む構造テンプレート記憶部１３２から取得できる。 Vocabulary constraints are constraints set by vocabulary level information of text. For example, a vocabulary corresponding to a node with TID = 7 corresponds to a vocabulary constraint in which a vocabulary ID must exist between a minimum vocabulary ID = 21 and a maximum vocabulary ID = 1045. The minimum vocabulary ID and the maximum vocabulary ID can be acquired from the structure template storage unit 132 including feature information for each node as shown in FIG.

また、制約付加部１２２は、近傍検索などの語彙の出現位置を条件とした制約を語彙制約とすることもできる。 In addition, the constraint adding unit 122 can also set a constraint such as a neighborhood search on the condition of the appearance position of a vocabulary as a vocabulary constraint.

制約付加部１２２は、図１に示すように、特徴情報取得部１２２ａと、特徴情報算出部１２２ｂと、ＴＩＤ候補削除部１２２ｃとを備えている。 As shown in FIG. 1, the constraint adding unit 122 includes a feature information acquisition unit 122a, a feature information calculation unit 122b, and a TID candidate deletion unit 122c.

特徴情報取得部１２２ａは、クエリグラフの各ノードに含まれるＴＩＤの候補に対応づけられた特徴情報である最小語彙ＩＤおよび最大語彙ＩＤを構造テンプレート記憶部１３２から取得するものである。 The feature information acquisition unit 122a acquires, from the structure template storage unit 132, the minimum vocabulary ID and the maximum vocabulary ID that are feature information associated with the TID candidates included in each node of the query graph.

特徴情報算出部１２２ｂは、クエリグラフの各ノードに含まれる検索キーの特徴を表す特徴情報を算出するものである。具体的には、特徴情報算出部１２２ｂは、検索キーをＮグラム分割した結果の語彙を検索キーの特徴情報として算出する。 The feature information calculation unit 122b calculates feature information representing the feature of the search key included in each node of the query graph. Specifically, the feature information calculation unit 122b calculates the vocabulary as a result of dividing the search key into N grams as the feature information of the search key.

ＴＩＤ候補削除部１２２ｃは、クエリグラフの各ノードに含まれるＴＩＤの候補から、特徴情報算出部１２２ｂが算出した特徴情報である検索キーを分割した語彙の語彙ＩＤが、特徴情報取得部１２２ａが取得した最小語彙ＩＤおよび最大語彙ＩＤの範囲外となる候補を削除するものである。 The TID candidate deletion unit 122c acquires, from the TID candidates included in each node of the query graph, the vocabulary ID of the vocabulary obtained by dividing the search key that is the feature information calculated by the feature information calculation unit 122b by the feature information acquisition unit 122a. Candidates outside the range of the minimum vocabulary ID and the maximum vocabulary ID are deleted.

クエリプランニング部１２３は、クエリ解析部１２１で求めたクエリグラフと、制約付加部１２２で付加した制約関係を満たす構造要素の候補の情報とを参照し、処理コストが最小になるようなプラン（処理順序）を作成するものである。プランの作成方法としては、特許文献１に記載の方法など従来から用いられているあらゆる方法を適用することができる。なお、Ｎグラム方式では、グラムレベルでその処理順序を決定するために、グラムごとのコストと、検索語彙全体のコストとが分けて計算される。 The query planning unit 123 refers to the query graph obtained by the query analysis unit 121 and information on candidate structure elements that satisfy the constraint relationship added by the constraint adding unit 122, and the plan (processing that minimizes the processing cost) Order). As a plan creation method, any conventionally used method such as the method described in Patent Document 1 can be applied. In the N-gram method, in order to determine the processing order at the gram level, the cost for each gram and the cost of the entire search vocabulary are calculated separately.

図１１、図１２は、処理コスト計算で用いられる語彙の頻度情報の一例を示した説明図である。図１１、図１２は、単純に語彙の頻度によって処理コストを計算する場合の例を示している。 11 and 12 are explanatory diagrams showing an example of vocabulary frequency information used in processing cost calculation. FIGS. 11 and 12 show an example in which the processing cost is simply calculated based on the vocabulary frequency.

検索結果の候補を早期に絞り込むため、クエリプランニング部１２３は、原則として頻度が小さい語彙から検索を実行するようなプランを作成する。例えば、テキスト“構造化文書”から抽出された語彙の頻度が、図１１に示すような値であった場合、最も頻度の小さい（３０）語彙である“造化”から検索を実行するプランが作成される。 In order to narrow down search result candidates at an early stage, the query planning unit 123 creates a plan for executing a search from a vocabulary with a low frequency in principle. For example, when the frequency of the vocabulary extracted from the text “structured document” is a value as shown in FIG. 11, a plan for executing a search from “structured” which is the least frequent (30) vocabulary is created. Is done.

なお、同一テキスト内の語彙を連続して検索する必要はない。例えば、図１２に示すようなテキスト“ＸＭＬ”から抽出された語彙の頻度と、図１１に示すような語彙の頻度とから、“造化”、“化文”、“Ｌ”の順に処理するようなプランを作成してもよい。 It is not necessary to search continuously for vocabulary in the same text. For example, the vocabulary frequency extracted from the text “XML” as shown in FIG. 12 and the vocabulary frequency as shown in FIG. 11 are processed in the order of “structured”, “chemical sentence”, and “L”. You may make a plan.

また、あるオペレータを実行することで新たに制約が付加される場合もあるが、Ｎグラム方式で分割して得た索引の場合は、１つでも条件を満たさない語彙が存在する場合は、その時点で以後の処理が中断される。 In addition, a new constraint may be added by executing a certain operator, but in the case of an index obtained by dividing it by the N-gram method, if any vocabulary that does not satisfy the condition exists, At that point, subsequent processing is interrupted.

クエリ実行部１２４は、クエリプランニング部１２３が作成したプランに従ってクエリを実行するものであり、索引検索部１２４ａと、検索部１２４ｂとを備えている。 The query execution unit 124 executes a query according to the plan created by the query planning unit 123, and includes an index search unit 124a and a search unit 124b.

索引検索部１２４ａは、語彙索引の処理を行う語彙索引処理オペレータを実行するものである。 The index search unit 124a executes a vocabulary index processing operator that performs lexical index processing.

検索部１２４ｂは、残りのオペレータを実行することにより、検索条件を満たす構造化文書を検索するものである。 The search unit 124b searches for structured documents that satisfy the search conditions by executing the remaining operators.

次に、このように構成された第１の実施の形態にかかる構造化文書検索装置１００による構造化文書格納処理について説明する。図１３は、第１の実施の形態における構造化文書格納処理の全体の流れを示すフローチャートである。 Next, a structured document storage process performed by the structured document search device 100 according to the first embodiment configured as described above will be described. FIG. 13 is a flowchart showing the overall flow of structured document storage processing in the first embodiment.

まず、通信部１０１が、クライアント３００から入力データであるＸＭＬ文書を受信する（ステップＳ１３０１） First, the communication unit 101 receives an XML document that is input data from the client 300 (step S1301).

次に、構造情報抽出部１１１が、入力データを構文解析して木構造の形式に展開し、展開した木構造から構造要素を抽出する（ステップＳ１３０２）。 Next, the structure information extraction unit 111 parses the input data and expands it into a tree structure format, and extracts structure elements from the expanded tree structure (step S1302).

次に、構造テンプレート決定部１１２が、木構造と構造テンプレート記憶部１３２に格納された構造情報とを参照して、木構造の各ノードに対応する構造要素のＩＤ（ＴＩＤ）を決定する構造テンプレート決定処理を実行する（ステップＳ１３０３）。構造テンプレート決定処理の詳細については後述する。 Next, the structure template determining unit 112 refers to the tree structure and the structure information stored in the structure template storage unit 132, and determines the structure element ID (TID) corresponding to each node of the tree structure. A determination process is executed (step S1303). Details of the structure template determination process will be described later.

次に、索引登録部１１３が、語彙索引情報を生成し、転置ファイル記憶部１３３ｂに登録する（ステップＳ１３０４）。転置ファイルに登録する語彙索引情報に必要なＴＩＤ、文書ＩＤ、要素ＩＤ、および発生位置の情報は、構造テンプレート決定処理で取得または算出される。 Next, the index registration unit 113 generates lexical index information and registers it in the transposed file storage unit 133b (step S1304). Information on the TID, document ID, element ID, and occurrence position necessary for the lexical index information registered in the transposed file is acquired or calculated in the structure template determination process.

次に、統計情報更新部１１４が、統計情報記憶部１３３ａの統計情報を更新する（ステップＳ１３０５）。具体的には、統計情報更新部１１４は、テキスト要素の文字列をＮグラム分割して得られた語彙について、統計情報記憶部１３３ａの統計情報の頻度の値を、テキスト要素内での出現回数分加算した値に更新する。 Next, the statistical information update unit 114 updates the statistical information in the statistical information storage unit 133a (step S1305). Specifically, the statistical information update unit 114 uses the frequency value of the statistical information stored in the statistical information storage unit 133a as the number of occurrences in the text element for a vocabulary obtained by dividing the character string of the text element into N grams. Update to the value added by minutes.

また、統計情報更新部１１４は、構造テンプレート決定処理内で算出されたＴＩＤごとの最小語彙ＩＤ、最大語彙ＩＤ、最大テキストサイズに変更がある場合は、変更後の値で構造テンプレート記憶部１３２の特徴情報（最小語彙ＩＤ、最大語彙ＩＤ）を更新する。 Further, when there is a change in the minimum vocabulary ID, the maximum vocabulary ID, and the maximum text size for each TID calculated in the structure template determination process, the statistical information update unit 114 uses the values after the change in the structure template storage unit 132. The feature information (minimum vocabulary ID, maximum vocabulary ID) is updated.

次に、文書登録部１１５が、木構造の各ノードに対して、親子兄弟関係を付加し、表形式で表した構造化文書を構造化文書記憶部１３１に登録する（ステップＳ１３０６）。例えば、図２のようなＸＭＬ文書を入力し、構造情報抽出部１１１により図３に示すような木構造が得られた場合、図４のような情報により木構造を表形式で表した構造化文書が登録される。 Next, the document registration unit 115 adds a parent-child sibling relationship to each node of the tree structure, and registers the structured document expressed in a table format in the structured document storage unit 131 (step S1306). For example, when an XML document as shown in FIG. 2 is input and a tree structure as shown in FIG. 3 is obtained by the structure information extraction unit 111, the tree structure is expressed in a tabular format with the information as shown in FIG. The document is registered.

以上が、構造化文書格納処理の全体の流れである。次に、ステップＳ１３０３の構造テンプレート決定処理について説明する。図１４は、第１の実施の形態における構造テンプレート決定処理の全体の流れを示すフローチャートである。 The above is the overall flow of the structured document storage process. Next, the structure template determination process in step S1303 will be described. FIG. 14 is a flowchart illustrating an overall flow of the structure template determination process according to the first embodiment.

まず、構造テンプレート決定部１１２は、構造情報抽出部１１１が抽出した構造要素のうち、テキスト要素に対応する構造要素について、当該テキスト要素に対する語彙分割を実行する（ステップＳ１４０１）。語彙分割方法としては、Ｎグラム方式の分割方法を採用する。 First, the structure template determination unit 112 executes vocabulary division for a text element among the structure elements extracted by the structure information extraction unit 111 (step S1401). As the vocabulary dividing method, an N-gram method is adopted.

次に、構造テンプレート決定部１１２は、分割した語彙から重複した語彙を削除し、新規の語彙に対しては語彙ＩＤを付加する（ステップＳ１４０２）。なお、既存の語彙の場合は、統計情報記憶部１３３ａから語彙ＩＤを取得する。また、語彙の発生回数は、後に統計情報記憶部１３３ａの頻度を更新する際に利用するのでＲＡＭ（Random Access Memory）などの記憶部（図示せず）に記憶しておく。 Next, the structure template determination unit 112 deletes duplicate vocabulary from the divided vocabulary, and adds a vocabulary ID to the new vocabulary (step S1402). In the case of an existing vocabulary, the vocabulary ID is acquired from the statistical information storage unit 133a. The number of occurrences of the vocabulary is stored in a storage unit (not shown) such as a RAM (Random Access Memory) because it is used later when the frequency of the statistical information storage unit 133a is updated.

次に、構造テンプレート決定部１１２は、構造テンプレート記憶部１３２に記憶するＴＩＤごとの特徴情報（最大語彙ＩＤ、最小語彙ＩＤ、最大テキストサイズ）を算出する（ステップＳ１４０３）。具体的には、構造テンプレート決定部１１２は、ステップＳ１４０２で付加または取得した語彙ＩＤのうち、最大値を最大語彙ＩＤとし、最小値を最小語彙ＩＤとして算出する。また、対応するテキスト要素の文字列長を最大テキストサイズとして算出する。 Next, the structure template determination unit 112 calculates feature information (maximum vocabulary ID, minimum vocabulary ID, maximum text size) for each TID stored in the structure template storage unit 132 (step S1403). Specifically, the structure template determination unit 112 calculates the maximum value as the maximum vocabulary ID and the minimum value as the minimum vocabulary ID among the vocabulary IDs added or obtained in step S1402. Further, the character string length of the corresponding text element is calculated as the maximum text size.

次に、構造テンプレート決定部１１２は、構造テンプレート記憶部１３２を参照し、展開した木構造の構造要素から１つの構造要素を取得し、木構造の構造制約に合致する構造要素を取得する（ステップＳ１４０４）。 Next, the structural template determination unit 112 refers to the structural template storage unit 132, acquires one structural element from the expanded structural elements of the tree structure, and acquires a structural element that matches the structural constraints of the tree structure (step) S1404).

次に、構造テンプレート決定部１１２は、合致する構造要素が取得されたか否かを判断し（ステップＳ１４０５）、取得されなかった場合は（ステップＳ１４０５：ＮＯ）、新規に構造要素を作成する（ステップＳ１４０６）。 Next, the structure template determination unit 112 determines whether or not a matching structural element has been acquired (step S1405). If it has not been acquired (step S1405: NO), a new structural element is created (step S1405). S1406).

次に、構造テンプレート決定部１１２は、作成した構造要素のＴＩＤを、現在処理している木構造の構造要素に該当するＴＩＤとして設定する（ステップＳ１４０７）。 Next, the structure template determination unit 112 sets the TID of the created structural element as the TID corresponding to the structural element of the tree structure currently being processed (step S1407).

ステップＳ１４０５で、合致する構造要素が取得された場合は（ステップＳ１４０５：ＹＥＳ）、構造テンプレート決定部１１２は、取得した構造要素のＴＩＤを、現在処理している木構造の構造要素に該当するＴＩＤとして設定する（ステップＳ１４０８）。 If a matching structural element is acquired in step S1405 (step S1405: YES), the structural template determination unit 112 uses the TID of the acquired structural element as the TID corresponding to the structural element of the currently processed tree structure. (Step S1408).

次に、構造テンプレート決定部１１２は、すべての木構造の構造要素を処理したか否かを判断し（ステップＳ１４０９）、すべての構造要素を処理していない場合は（ステップＳ１４０９：ＮＯ）、次の構造要素について処理を繰り返す（ステップＳ１４０４）。 Next, the structure template determination unit 112 determines whether or not all tree structure elements have been processed (step S1409). If all structure elements have not been processed (step S1409: NO), the next step is performed. The process is repeated for the structural element (step S1404).

すべての構造要素を処理した場合は（ステップＳ１４０９：ＹＥＳ）、構造テンプレート決定処理を終了する。 If all the structural elements have been processed (step S1409: YES), the structural template determination process ends.

次に、このように構成された第１の実施の形態にかかる構造化文書検索装置１００による構造化文書検索処理について説明する。図１５は、第１の実施の形態における構造化文書検索処理の全体の流れを示すフローチャートである。 Next, a structured document search process performed by the structured document search apparatus 100 according to the first embodiment configured as described above will be described. FIG. 15 is a flowchart showing an overall flow of the structured document search process according to the first embodiment.

まず、通信部１０１が、クライアント３００から検索クエリを受信する（ステップＳ１５０１）。次に、クエリ解析部１２１が、受信した検索クエリを解析し、クエリグラフを作成する（ステップＳ１５０２）。 First, the communication unit 101 receives a search query from the client 300 (step S1501). Next, the query analysis unit 121 analyzes the received search query and creates a query graph (step S1502).

次に、制約付加部１２２が、クエリ解析部１２１が作成したクエリグラフに対して制約を付加する制約付加処理を実行する（ステップＳ１５０３）。制約付加処理の詳細については後述する。 Next, the constraint adding unit 122 executes a constraint adding process for adding a constraint to the query graph created by the query analyzing unit 121 (step S1503). Details of the constraint addition processing will be described later.

次に、クエリプランニング部１２３が、制約付加部１２２により制約が付加されたクエリグラフを参照し、処理コストが最小になる検索実行のプランを作成する（ステップＳ１５０４）。 Next, the query planning unit 123 refers to the query graph to which the constraint is added by the constraint adding unit 122, and creates a search execution plan that minimizes the processing cost (step S1504).

次に、索引検索部１２４ａが、検索処理のうち、語彙索引に関する処理である転置ファイルスキャン処理を実行する（ステップＳ１５０５）。転置ファイルスキャン処理では、付加された制約に従い、検索対象となる索引の候補を絞り込む処理が実行される。転置ファイルスキャン処理の詳細については後述する。 Next, the index search unit 124a executes a transposed file scan process that is a process related to the vocabulary index in the search process (step S1505). In the transposed file scan processing, processing for narrowing down index candidates to be searched is executed according to the added constraint. Details of the transposed file scanning process will be described later.

次に、検索部１２４ｂが、検索処理のうち、転置ファイルスキャン処理以外の残りのプランに該当する処理を実行し、検索条件を満たす構造化文書を検索する（ステップＳ１５０６）。 Next, the search unit 124b executes a process corresponding to the remaining plan other than the transposed file scan process in the search process, and searches for a structured document that satisfies the search condition (step S1506).

次に、通信部１０１が、検索結果である構造化文書をクライアント３００に送信し（ステップＳ１５０７）、構造化文書検索処理を終了する。 Next, the communication unit 101 transmits a structured document as a search result to the client 300 (step S1507), and the structured document search process is terminated.

次に、ステップＳ１５０３の制約付加処理の詳細について説明する。図１６は、第１の実施の形態における制約付加処理の全体の流れを示すフローチャートである。 Next, details of the constraint addition processing in step S1503 will be described. FIG. 16 is a flowchart showing an overall flow of the constraint addition processing in the first embodiment.

まず、制約付加部１２２は、クエリグラフの各ノードの構造制約を満たすＴＩＤの候補の集合（以下、ＴＩＤ候補集合という。）を取得する（ステップＳ１６０１）。例えば、図１０に示すようなクエリグラフのノード４に対しては、タイトルタグの下のテキスト要素であることから、図５に示すような構造情報のＴＩＤ７が候補として取得できる。 First, the constraint adding unit 122 acquires a TID candidate set (hereinafter referred to as a TID candidate set) that satisfies the structure constraints of each node of the query graph (step S1601). For example, since the node 4 of the query graph as shown in FIG. 10 is a text element under the title tag, the structure information TID 7 as shown in FIG. 5 can be acquired as a candidate.

なお、ステップＳ１６０１の処理により、構造制約を満たすＴＩＤの候補が取得できる。以下のステップＳ１６０２からステップＳ１６１０では、ＴＩＤの候補から語彙制約を満たさない候補を削除する処理を行う。 Note that TID candidates satisfying the structural constraints can be acquired by the processing in step S1601. In steps S1602 to S1610 below, processing is performed to delete candidates that do not satisfy the lexical constraints from the TID candidates.

次に、制約付加部１２２は、未処理のノードに対するＴＩＤ候補集合Ｐｉを取得する（ステップＳ１６０２）。この際、特徴情報取得部１２２ａが、各ＴＩＤの候補に対応づけられた最小語彙ＩＤおよび最大語彙ＩＤを構造テンプレート記憶部１３２から取得しておく。 Next, the constraint adding unit 122 acquires a TID candidate set Pi for an unprocessed node (step S1602). At this time, the feature information acquisition unit 122a acquires the minimum vocabulary ID and the maximum vocabulary ID associated with each TID candidate from the structure template storage unit 132 in advance.

次に、特徴情報算出部１２２ｂは、Ｐｉに対応する検索キーの語彙を分割する（ステップＳ１６０３）。例えば、図１０のクエリグラフのノード４に対しては、「ftcontains “構造化文書” && “XML”」という検索キーが対応づけられている。このため、制約付加部１２２は、“構造化文書”および“XML”をそれぞれ語彙分割する。語彙分割は、上述のようなＮグラム分割による方法で行う。 Next, the feature information calculation unit 122b divides the search key vocabulary corresponding to Pi (step S1603). For example, the search key “ftcontains“ structured document ”&&“ XML ”” is associated with the node 4 of the query graph of FIG. Therefore, the constraint adding unit 122 divides the “structured document” and “XML” into words. Vocabulary division is performed by the method using N-gram division as described above.

次に、制約付加部１２２は、Ｐｉに対応する論理条件を取得する（ステップＳ１６０４）。例えば、図１０のクエリグラフのノード４の場合は、ＡＮＤ条件（“＆＆”）が取得される。 Next, the constraint adding unit 122 acquires a logical condition corresponding to Pi (step S1604). For example, in the case of node 4 in the query graph of FIG. 10, an AND condition (“&&”) is acquired.

次に、制約付加部１２２は、取得した論理条件がＡＮＤ条件のみか否かを判断する（ステップＳ１６０５）。ＡＮＤ条件のみの場合は（ステップＳ１６０５：ＹＥＳ）、Ｐｉの各ＴＩＤ候補のうち、検索キーから分割した語彙の語彙ＩＤのいずれかが、ＴＩＤに対応する最小語彙ＩＤと最大語彙ＩＤとの間の範囲外となるＴＩＤ候補を取得する（ステップＳ１６０６）。 Next, the constraint adding unit 122 determines whether the acquired logical condition is only an AND condition (step S1605). In the case of only the AND condition (step S1605: YES), among the TID candidates of Pi, any of the vocabulary IDs of the vocabulary divided from the search key is between the minimum vocabulary ID and the maximum vocabulary ID corresponding to the TID. TID candidates that are out of range are acquired (step S1606).

例えば、検索キーが“構造化文書”の場合、図１１に示すように検索キーから分割した語彙の語彙ＩＤは、２２から２３５の間の値を取る。このため、例えば、図６に示すような構造情報が設定されていた場合を前提とすると、語彙ＩＤの値は、ＴＩＤ７の最小語彙ＩＤ（＝２１）と最大語彙ＩＤ（＝１０４５）との範囲内に収まる。したがって、ＴＩＤ７は本ステップでは取得されない。一方、ＴＩＤ６については、最小語彙ＩＤ（＝２５）と最大語彙ＩＤ（＝１０３）との範囲外となる語彙ＩＤが存在する（語彙ＩＤ＝２２，１７８，２３５）。このため、本ステップではＴＩＤ６が該当するＴＩＤ候補として取得される。 For example, when the search key is “structured document”, the vocabulary ID of the vocabulary divided from the search key takes a value between 22 and 235 as shown in FIG. Therefore, for example, assuming that the structure information shown in FIG. 6 is set, the value of the vocabulary ID is a range between the minimum vocabulary ID (= 21) and the maximum vocabulary ID (= 1045) of TID7. Fits within. Therefore, TID7 is not acquired in this step. On the other hand, for TID6, there is a vocabulary ID that is outside the range of the minimum vocabulary ID (= 25) and the maximum vocabulary ID (= 103) (vocabulary ID = 22, 178, 235). For this reason, TID6 is acquired as a corresponding TID candidate in this step.

ステップＳ１６０５で、ＡＮＤ条件のみでないと判断した場合は（ステップＳ１６０５：ＮＯ）、制約付加部１２２は、取得した論理条件がＯＲ条件のみか否かを判断する（ステップＳ１６０７）。ＯＲ条件のみの場合は（ステップＳ１６０７：ＹＥＳ）、Ｐｉの各ＴＩＤ候補のうち、検索キーから分割した語彙の語彙ＩＤのすべてが、ＴＩＤに対応する最小語彙ＩＤと最大語彙ＩＤとの間の範囲外となるＴＩＤ候補を取得する（ステップＳ１６０８）。 If it is determined in step S1605 that the AND condition is not the only condition (step S1605: NO), the constraint adding unit 122 determines whether the acquired logical condition is only the OR condition (step S1607). In the case of only the OR condition (step S1607: YES), among the TID candidates of Pi, all the vocabulary IDs of the vocabulary divided from the search key are ranges between the minimum vocabulary ID and the maximum vocabulary ID corresponding to the TID. An outside TID candidate is acquired (step S1608).

次に、制約付加部１２２は、ステップＳ１６０６またはステップＳ１６０８で、該当するＴＩＤ候補が取得されたか否かを判断する（ステップＳ１６０９）。該当するＴＩＤ候補が取得された場合は（ステップＳ１６０９：ＹＥＳ）、ＴＩＤ候補削除部１２２ｃは、Ｐｉから該当するＴＩＤ候補を削除する（ステップＳ１６１０）。 Next, the constraint adding unit 122 determines whether or not a corresponding TID candidate has been acquired in step S1606 or step S1608 (step S1609). When the corresponding TID candidate is acquired (step S1609: YES), the TID candidate deletion unit 122c deletes the corresponding TID candidate from Pi (step S1610).

このようにして、論理条件がＡＮＤ条件かＯＲ条件かに応じて、ＴＩＤの候補を適切に絞り込むことが可能となる。 In this way, TID candidates can be appropriately narrowed down according to whether the logical condition is an AND condition or an OR condition.

ステップＳ１６０７で、ＯＲ条件のみでないと判断した場合は（ステップＳ１６０７：ＮＯ）、ＴＩＤ候補の削除処理を行わずにステップＳ１６１１以降の位置条件の判定処理を実行する。 If it is determined in step S1607 that only the OR condition is not satisfied (step S1607: NO), the position condition determination process after step S1611 is executed without performing the TID candidate deletion process.

ステップＳ１６１０でＴＩＤ候補を削除した後、または、ステップＳ１６０７でＯＲ条件のみでないと判断した場合は、制約付加部１２２は、Ｐｉに対応する検索キーから、近傍検索などの位置に関する条件指定（位置条件）を取得する（ステップＳ１６１１）。 After deleting the TID candidate in step S1610, or when it is determined in step S1607 that only the OR condition is not satisfied, the constraint adding unit 122 specifies a condition related to a position such as neighborhood search (position condition) from the search key corresponding to Pi. ) Is acquired (step S1611).

次に、制約付加部１２２は、各ＴＩＤ候補が、取得した位置条件を満たすか否かを判断し（ステップＳ１６１２）、満たさない場合は（ステップＳ１６１２：ＮＯ）、該当するＴＩＤ候補をＰｉから削除する（ステップＳ１６１３）。 Next, the constraint adding unit 122 determines whether or not each TID candidate satisfies the acquired position condition (step S1612). If not satisfied (step S1612: NO), the corresponding TID candidate is deleted from Pi. (Step S1613).

例えば、「"XML"&&"データベース" distance at most 5」という位置条件が指定されていた場合、少なくとも１４（＝条件中の文字列の長さ＋間隔の値＝９＋５）文字が、テキスト要素内に存在しなければならない。これに対し、例えば、当該テキスト要素に対応する構造要素のＴＩＤに対応づけられた最大テキストサイズが１４未満であれば、位置条件を満たさないとしてＰｉから削除することができる。 For example, if the position condition "" XML "&&" database "distance at most 5" is specified, at least 14 characters (= length of the character string in the condition + value of the interval = 9 + 5) are included in the text element. Must be present. On the other hand, for example, if the maximum text size associated with the TID of the structural element corresponding to the text element is less than 14, it can be deleted from Pi because the position condition is not satisfied.

なお、最大テキストサイズは、構造テンプレート記憶部１３２から取得することができる。このようにして、語彙の位置条件に応じてＴＩＤの候補を絞り込むことが可能となる。 The maximum text size can be acquired from the structure template storage unit 132. In this way, TID candidates can be narrowed down according to the vocabulary position conditions.

次に、制約付加部１２２は、すべてのノードを処理したか否かを判断し（ステップＳ１６１４）、すべてのノードを処理していない場合は（ステップＳ１６１４：ＮＯ）、次にノードに対して処理を繰り返す（ステップＳ１６０２）。 Next, the constraint adding unit 122 determines whether or not all the nodes have been processed (step S1614). If all the nodes have not been processed (step S1614: NO), the next processing is performed on the nodes. Is repeated (step S1602).

すべてのノードを処理した場合は（ステップＳ１６１４：ＹＥＳ）、制約付加部１２２は、すべてのＴＩＤ候補集合内のＴＩＤに対して、同一文書内に存在しなければならないＴＩＤ間の関係を取得し、同一文書制約として付加する（ステップＳ１６１５）。 When all the nodes have been processed (step S1614: YES), the constraint adding unit 122 acquires the relationship between the TIDs that must exist in the same document for the TIDs in all the TID candidate sets, The same document restriction is added (step S1615).

例えば、図１０のようなクエリグラフの場合、ノード４およびノード６に対応するＴＩＤ＝７およびＴＩＤ＝５は、それぞれ「特許」文書（ＴＩＤ＝４）内に存在しなければならないため、同一文書制約が付加される。なお、以下では、同一文書制約を「ＴＩＤ７＜−＞ＴＩＤ５」のような形式で記述する。 For example, in the case of the query graph as shown in FIG. 10, TID = 7 and TID = 5 corresponding to the nodes 4 and 6 must exist in the “patent” document (TID = 4), respectively. A constraint is added. In the following, the same document restriction is described in a format such as “TID7 <-> TID5”.

このように、制約付加処理では、構造制約だけでなく、構造化文書の登録時に格納されたテキスト要素の語彙の特徴を表す特徴情報で表される語彙制約により、中間候補であるＴＩＤの候補数を絞り込むことができる。このため、構造化文書に対する高速な全文検索機能を実現することができる。 As described above, in the constraint addition process, the number of TID candidates that are intermediate candidates is determined not only by the structure constraint but also by the vocabulary constraint represented by the feature information representing the vocabulary feature of the text element stored when the structured document is registered. Can be narrowed down. Therefore, a high-speed full-text search function for structured documents can be realized.

次に、ステップＳ１５０５の転置ファイルスキャン処理の詳細について説明する。図１７は、第１の実施の形態における転置ファイルスキャン処理の全体の流れを示すフローチャートである。 Next, details of the transposed file scanning process in step S1505 will be described. FIG. 17 is a flowchart illustrating an overall flow of the transposed file scan process according to the first embodiment.

まず、索引検索部１２４ａは、制約付加部１２２により付加された構造制約Ｔと、検索キーから分割した語彙（以下、検索語彙という。）の語彙ＩＤの集合Ｇとを取得する（ステップＳ１７０１）。 First, the index search unit 124a acquires the structural constraint T added by the constraint adding unit 122 and the set G of vocabulary IDs of the vocabulary (hereinafter referred to as search vocabulary) divided from the search key (step S1701).

次に、索引検索部１２４ａは、クエリプランニング部１２３が作成したプランに従い、クエリを実行するノードを取得する（ステップＳ１７０２）。例えば、最初の処理では、頻度が最も小さい検索語彙に対応するノードを取得する。 Next, the index search unit 124a acquires a node that executes the query according to the plan created by the query planning unit 123 (step S1702). For example, in the first process, the node corresponding to the search vocabulary with the lowest frequency is acquired.

次に、索引検索部１２４ａは、検索語彙に対応する転置ファイルを、転置ファイル記憶部１３３ｂから取得する（ステップＳ１７０３）。具体的には、検索語彙の語彙ＩＤに対応する転置ファイル番号を統計情報記憶部１３３ａから取得し、取得した転置ファイル番号で識別される転置ファイルを、転置ファイル記憶部１３３ｂから取得する。 Next, the index search unit 124a acquires the transposed file corresponding to the search vocabulary from the transposed file storage unit 133b (step S1703). Specifically, the transposed file number corresponding to the vocabulary ID of the search vocabulary is acquired from the statistical information storage unit 133a, and the transposed file identified by the acquired transposed file number is acquired from the transposed file storage unit 133b.

次に、索引検索部１２４ａは、転置ファイルから語彙索引情報を取り出す（ステップＳ１７０４）。次に、索引検索部１２４ａは、語彙索引情報のＴＩＤが、構造制約Ｔに含まれるか否かを判断する（ステップＳ１７０５）。 Next, the index search unit 124a extracts lexical index information from the transposed file (step S1704). Next, the index search unit 124a determines whether or not the TID of the lexical index information is included in the structure constraint T (step S1705).

構造制約Ｔに含まれる場合は（ステップＳ１７０５：ＹＥＳ）、索引検索部１２４ａは、現在の語彙索引情報を索引の候補集合に追加する（ステップＳ１７０６）。 When included in the structure constraint T (step S1705: YES), the index search unit 124a adds the current lexical index information to the index candidate set (step S1706).

ステップＳ１７０６で候補集合に追加した後、または、ステップＳ１７０５で、語彙索引情報のＴＩＤが構造制約Ｔに含まれないと判断した場合は（ステップＳ１７０５：ＮＯ）、索引検索部１２４ａは、すべての語彙索引情報を処理したか否かを判断する（ステップＳ１７０７）。 After adding to the candidate set in step S1706 or in step S1705, when it is determined that the TID of the lexical index information is not included in the structure constraint T (step S1705: NO), the index search unit 124a selects all vocabularies. It is determined whether index information has been processed (step S1707).

すべての語彙索引情報を処理していない場合は（ステップＳ１７０７：ＮＯ）、転置ファイルから次の語彙索引情報を取得して処理を繰り返す（ステップＳ１７０４）。 If all lexical index information has not been processed (step S1707: NO), the next lexical index information is acquired from the transposed file and the process is repeated (step S1704).

すべての語彙索引情報を処理した場合は（ステップＳ１７０７：ＹＥＳ）、索引検索部１２４ａは、すべての検索語彙を処理したか否かを判断する（ステップＳ１７０８）。すべての検索語彙を処理していない場合は（ステップＳ１７０８：ＮＯ）、次の検索語彙に対して処理を繰り返す（ステップＳ１７０３）。 If all vocabulary index information has been processed (step S1707: YES), the index search unit 124a determines whether all search vocabularies have been processed (step S1708). If not all the search vocabularies have been processed (step S1708: NO), the process is repeated for the next search vocabulary (step S1703).

すべての検索語彙を処理した場合は（ステップＳ１７０８：ＹＥＳ）、索引検索部１２４ａは、すべてのノードを処理したか否かを判断する（ステップＳ１７０９）。すべてのノードを処理していない場合は（ステップＳ１７０９：ＮＯ）、次のノードを取得して処理を繰り返す（ステップＳ１７０２）。 If all search vocabularies have been processed (step S1708: YES), the index search unit 124a determines whether all nodes have been processed (step S1709). If all the nodes have not been processed (step S1709: NO), the next node is acquired and the process is repeated (step S1702).

すべてのノードを処理した場合は（ステップＳ１７０９：ＹＥＳ）、転置ファイルスキャン処理を終了する。 If all nodes have been processed (step S1709: YES), the transposed file scan process is terminated.

なお、上述のような処理で得られた索引の候補集合に含まれるＴＩＤから、重複を排除してＴＩＤ集合が作成され、作成されたＴＩＤ集合は次のオペレータ処理を実行する際の新たな制約条件として利用される。 A TID set is created by eliminating duplication from the TIDs included in the index candidate set obtained by the above-described processing, and the created TID set is a new restriction when executing the next operator processing. Used as a condition.

次に、第１の実施の形態にかかる構造化文書検索装置１００による構造化文書検索処理の具体例について説明する。図１８は、構造テンプレート記憶部１３２に記憶されている構造情報の一例を示す説明図である。また、図１９は、入力された検索クエリの一例を示す説明図である。 Next, a specific example of structured document search processing by the structured document search apparatus 100 according to the first embodiment will be described. FIG. 18 is an explanatory diagram illustrating an example of the structure information stored in the structure template storage unit 132. FIG. 19 is an explanatory diagram showing an example of the input search query.

以下では、図６および図１８に示すような構造情報が構造テンプレート記憶部１３２に記憶されており、図１９に示すような検索クエリが入力された場合の構造化文書検索処理について説明する。 Hereinafter, the structured document search process when the structure information as shown in FIGS. 6 and 18 is stored in the structure template storage unit 132 and the search query as shown in FIG. 19 is input will be described.

図１９に示す検索クエリは、ＸＱＦＴの言語規格に従い記述されたクエリである。また、本クエリは、“「タイトル」要素の候補として、「構造化文書」かつ「XML」を含み、かつ両者が２文字以内の距離に存在するという制約を満たし、さらに「発明者」要素に「田中」を含む特許文書を検索せよ”、という検索条件を意味する。 The search query shown in FIG. 19 is a query described in accordance with the XQFT language standard. In addition, this query satisfies the restriction that “structured document” and “XML” are included as candidates for the “title” element, and both exist within a distance of two characters or less, and the “inventor” element This means that the search condition is “search for patent documents including“ Tanaka ””.

図２０は、検索クエリを解析した結果であるクエリグラフの一例を示す説明図である。同図は、図１９の検索クエリを解析した結果のクエリグラフを表す。構造制約としては、以下に示すような４つの制約（Ａ１〜Ａ４）が付加される。
Ａ１：ノード６（ＴＩＤ１１、ＴＩＤ１９）→田中
Ａ２：ノード４（ＴＩＤ６、ＴＩＤ７、ＴＩＤ１５）→ＸＭＬ
Ａ３：ノード４（ＴＩＤ６、ＴＩＤ７、ＴＩＤ１５）→構造化文書
Ａ４：ノード４とノード６は同一文書制約を満たす FIG. 20 is an explanatory diagram illustrating an example of a query graph that is a result of analyzing a search query. The figure shows a query graph as a result of analyzing the search query of FIG. As the structural constraints, the following four constraints (A1 to A4) are added.
A1: Node 6 (TID11, TID19) → Tanaka A2: Node 4 (TID6, TID7, TID15) → XML
A3: Node 4 (TID6, TID7, TID15) → structured document A4: Node 4 and node 6 satisfy the same document constraint

なお、例えば、“ノード６（ＴＩＤ１１、ＴＩＤ１９）→田中”とは、ノード６は構造要素の候補がＴＩＤ１１またはＴＩＤ１９でなければならないこと、および、対応するテキストの検索キーとして「田中」が設定されていることを表している。 For example, “Node 6 (TID11, TID19) → Tanaka” means that node 6 must have TID11 or TID19 as a structural element candidate, and “Tanaka” is set as a search key for the corresponding text. It represents that.

また、Ａ４は同一文書制約の構造制約を表し、例えば、ノード４がＴＩＤ６のときは、ノード６にはＴＩＤ１１が存在しなければならない制約である。したがって、Ａ４の同一文書制約は、「｛ＴＩＤ６、ＴＩＤ７｝＜−＞｛ＴＩＤ１１｝」および「｛ＴＩＤ１５｝＜−＞｛ＴＩＤ１９｝」と表すこともできる。 A4 represents the structure constraint of the same document constraint. For example, when the node 4 is TID6, the node 6 must have TID11. Therefore, the same document restriction of A4 can also be expressed as “{TID6, TID7} <−> {TID11}” and “{TID15} <−> {TID19}”.

また、語彙制約としては以下に示すような２つの制約（Ｂ１、Ｂ２）が付加される。
Ｂ１：ノード４には、「構造」、「造化」、「化文」、「文書」、「書」、「ＸＭＬ」、「ＭＬ」、「Ｌ」に対応する語彙ＩＤをすべて含む
Ｂ１：ノード６には、「田中」、「中」に対応する語彙ＩＤを含む In addition, the following two constraints (B1, B2) are added as vocabulary constraints.
B1: Node 4 includes all vocabulary IDs corresponding to “structure”, “structured”, “chemical sentence”, “document”, “book”, “XML”, “ML”, “L” B1: node 6 includes vocabulary IDs corresponding to “Tanaka” and “Middle”

なお、ftcontainsなど部分一致検索の場合は、Ｎグラム以下の文字列長の語彙に関しては条件に加えないが、ここでは説明の都合上付加している。 In the case of partial match search such as ftcontains, a vocabulary having a character string length of N grams or less is not included in the conditions, but is added here for convenience of explanation.

例えば、検索キー「構造化文書」に対する検索語彙は図１１に示すように分割されるため、語彙ＩＤの分布は｛２２，３２，５５，１７８，２３５｝となる。 For example, since the search vocabulary for the search key “structured document” is divided as shown in FIG. 11, the distribution of the vocabulary ID is {22, 32, 55, 178, 235}.

ＴＩＤ６の最小語彙ＩＤは図６より２５であり、２５より小さい語彙ＩＤ（２２）が存在するためＴＩＤ６は候補から削除される（ステップＳ１６０９：ＹＥＳ、ステップＳ１６１０）。ＴＩＤ７およびＴＩＤ１５は制約を満たすため（ステップＳ１６０９：ＮＯ）、候補から削除されない。 The minimum vocabulary ID of TID6 is 25 from FIG. 6, and since there is a vocabulary ID (22) smaller than 25, TID6 is deleted from the candidates (step S1609: YES, step S1610). Since TID7 and TID15 satisfy the restriction (step S1609: NO), they are not deleted from the candidates.

同様に、検索キー「ＸＭＬ」に対する検索語彙の語彙ＩＤの分布は｛４３，５８，１２３｝であり、ＴＩＤ７、ＴＩＤ１５は制約を満たすため候補から削除されない。 Similarly, the vocabulary ID distribution of the search vocabulary for the search key “XML” is {43, 58, 123}, and TID7 and TID15 are not deleted from the candidates because they satisfy the constraints.

ノード６に対しても同様に検索語彙である「田中」および「中」の語彙ＩＤと、ＴＩＤ１１およびＴＩＤ１９に対応する最小語彙ＩＤおよび最大語彙ＩＤとが比較される。ここでは、両者ともに候補として残されたことを前提とする。 Similarly for node 6, the vocabulary IDs of “Tanaka” and “Middle”, which are the search vocabulary, are compared with the minimum vocabulary ID and the maximum vocabulary ID corresponding to TID11 and TID19. Here, it is assumed that both are left as candidates.

また、位置条件による制約として、以下のような制約（Ｃ１）が付加される。
Ｃ１：ノード４は、「構造化文書」と「ＸＭＬ」とが２文字以上離れた距離に存在する Moreover, the following restrictions (C1) are added as restrictions by position conditions.
C1: Node 4 has “structured document” and “XML” at a distance of two or more characters

この場合、ノード４に対応するテキストのテキストサイズとして少なくとも１０（＝「構造化文書」のサイズ＋「ＸＭＬ」のサイズ＋２＝５＋３＋２）文字が必要となるため、例えば、対応する最大テキストサイズが１０未満であるＴＩＤ候補が存在すれば、この段階で除外することができる（ステップＳ１６１３）。 In this case, since the text size of the text corresponding to the node 4 is at least 10 (= “structured document” size + “XML” size + 2 = 5 + 3 + 2) characters, for example, the corresponding maximum text size is 10 If there is a TID candidate that is less than this, it can be excluded at this stage (step S1613).

図２１は、このような処理により候補が絞り込まれた状態における、クエリグラフの一例を示す説明図である。図２０と比較して、図２１では、ノード４に対するＴＩＤ候補が、ＴＩＤ７、ＴＩＤ１５に絞り込まれていることが示されている。 FIG. 21 is an explanatory diagram illustrating an example of a query graph in a state where candidates are narrowed down by such processing. Compared with FIG. 20, FIG. 21 shows that TID candidates for the node 4 are narrowed down to TID7 and TID15.

このようにして制約付加部１２２で制約を求めた後、クエリプランニング部１２３によるクエリプランニング処理が実行される。クエリプランニング部１２３は、頻度が低い検索語彙を優先的に処理するようなプランを作成する。 In this way, after the constraint is added by the constraint adding unit 122, the query planning process by the query planning unit 123 is executed. The query planning unit 123 creates a plan that preferentially processes a search vocabulary with a low frequency.

図２２は、検索語彙ごと頻度を示した説明図である。同図に示すように、語彙「田中」に対する頻度が最も低い（２０）ので、当該語彙に対応するノードから転置ファイルスキャン処理を実行する（ステップＳ１７０２）。 FIG. 22 is an explanatory diagram showing the frequency for each search vocabulary. As shown in the figure, since the frequency for the vocabulary “Tanaka” is the lowest (20), the transposed file scan process is executed from the node corresponding to the vocabulary (step S1702).

このように、第１の実施の形態にかかる構造化文書検索装置では、構造化文書の構造要素ごとに対応する要素の特徴を表す特徴情報を格納し、当該特徴情報を参照して転置ファイルスキャン時に不要な語彙索引情報の読み出し数を早い段階で削減することが可能となる。その結果、語彙の出現位置レベルまで考慮した検索条件による検索時に必要となる候補の結合処理コストを削減することを可能とする。このため、ＸＱＦＴのような位置情報を指定と構造制約と組み合わせたクエリによる構造化文書の全文検索機能を高速に実行することができる。 As described above, in the structured document search apparatus according to the first embodiment, feature information representing the feature of an element corresponding to each structure element of the structured document is stored, and a transposed file scan is performed with reference to the feature information. It is possible to reduce the number of reading of unnecessary lexical index information at an early stage. As a result, it is possible to reduce the cost of the candidate combination processing required for the search based on the search condition considering the appearance position level of the vocabulary. For this reason, it is possible to execute a full-text search function of a structured document by a query in which position information such as XQFT is combined with designation and structure constraints at high speed.

（第２の実施の形態）
第２の実施の形態にかかる構造化文書検索装置は、語彙索引情報内に、対応するテキスト要素に含まれる語彙の特徴を表す特徴値を格納し、転置ファイルスキャン処理時に参照して候補数を絞り込むことにより、さらに高速な検索を可能とするものである。 (Second Embodiment)
The structured document search apparatus according to the second embodiment stores the feature value representing the vocabulary feature included in the corresponding text element in the vocabulary index information, and refers to the number of candidates by referring to the transposed file scan processing. By narrowing down, a higher-speed search is possible.

第２の実施の形態では、転置ファイル記憶部１３３ｂに、テキスト要素に含まれる語彙から算出された特徴値を対応づけた語彙索引情報を含む転置ファイルを格納する点が、第１の実施の形態と異なっている。特徴値は、構造テンプレート決定部１１２により、構造テンプレート決定処理内で算出される。 In the second embodiment, a transposed file including lexical index information in which feature values calculated from vocabulary included in a text element are associated is stored in the transposed file storage unit 133b. Is different. The feature value is calculated by the structure template determination unit 112 in the structure template determination process.

また、第２の実施の形態では、転置ファイルスキャン処理時に、索引検索部１２４ａが、特徴値を参照して索引の候補を絞り込む点が、第１の実施の形態と異なっている。 Further, the second embodiment is different from the first embodiment in that the index search unit 124a refers to the feature value to narrow down index candidates during the transposed file scan process.

図２３は、第２の実施の形態における転置ファイル記憶部１３３ｂに格納された転置ファイルのデータ構造の一例を示す説明図である。 FIG. 23 is an explanatory diagram illustrating an example of a data structure of an inverted file stored in the inverted file storage unit 133b according to the second embodiment.

同図に示すように、転置ファイル記憶部１３３ｂの転置ファイルは、ＴＩＤと、特徴値と、文書ＩＤと、要素ＩＤと、発生位置とを対応づけた語彙索引情報を格納している。 As shown in the figure, the transposed file of the transposed file storage unit 133b stores lexical index information in which TIDs, feature values, document IDs, element IDs, and occurrence positions are associated with each other.

特徴値とは、文書ＩＤと要素ＩＤとで識別される構造化文書の要素の特徴を表す情報であり、構造テンプレート決定部１１２により算出される。 The feature value is information representing the feature of the element of the structured document identified by the document ID and the element ID, and is calculated by the structure template determination unit 112.

第２の実施の形態における構造テンプレート決定部１１２は、構造化文書のテキスト要素に対する特徴値を算出する処理を行う。特徴値は、テキストに含まれる語彙の語彙ＩＤの最小値および最大値である最小語彙ＩＤおよび最大語彙ＩＤから算出する。具体的には、最小語彙ＩＤと最大語彙ＩＤとを変換係数α（例えば、１０）で除算した値を特徴値として算出する。 The structure template determination unit 112 in the second embodiment performs a process of calculating feature values for text elements of the structured document. The feature value is calculated from the minimum vocabulary ID and the maximum vocabulary ID which are the minimum and maximum vocabulary IDs of the vocabulary included in the text. Specifically, a value obtained by dividing the minimum vocabulary ID and the maximum vocabulary ID by a conversion coefficient α (for example, 10) is calculated as a feature value.

例えば、テキストに含まれる語彙の最小語彙ＩＤが２２、最大語彙ＩＤが３０１であった場合、それらをそれぞれα＝１０で除算した結果である（２、３０）が、当該テキストの特徴値として算出される。 For example, when the minimum vocabulary ID of the vocabulary included in the text is 22 and the maximum vocabulary ID is 301, the result of dividing each by α = 10 (2, 30) is calculated as the feature value of the text. Is done.

なお、特徴値の算出方法はこれに限られるものではなく、テキストの特徴を表す情報であればあらゆる方法により算出することができる。例えば、英数字や記号などの文字コード値を利用する方法、予め文字列パターンを複数用意しておき、いずれのパターンに含まれるかをパターン番号によって指定する方法を適用可能である。 Note that the feature value calculation method is not limited to this, and can be calculated by any method as long as the information represents the feature of the text. For example, a method of using character code values such as alphanumeric characters and symbols, or a method of preparing a plurality of character string patterns in advance and designating which pattern is included by a pattern number can be applied.

第２の実施の形態における索引検索部１２４ａは、語彙索引の処理時、特徴値を参照して索引の候補を絞り込む処理を行う。具体的には、索引検索部１２４ａは、検索キーに含まれる語彙の語彙ＩＤに、特徴値算出時と同様の変換処理を行い、候補である語彙索引情報の特徴値の範囲内に変換後の語彙ＩＤの値が含まれない場合に、当該語彙索引情報を索引の候補から除外する。なお、特徴値の範囲とは、例えば特徴値が（２、３０）の場合、２以上３０以下の範囲をいう。 The index search unit 124a according to the second embodiment performs processing for narrowing down index candidates with reference to feature values when processing a vocabulary index. Specifically, the index search unit 124a performs a conversion process similar to that used when calculating the feature value on the vocabulary ID of the vocabulary included in the search key, and converts the vocabulary ID within the range of the feature value of the candidate vocabulary index information. If the vocabulary ID value is not included, the lexical index information is excluded from index candidates. Note that the feature value range refers to a range from 2 to 30 when the feature value is (2, 30), for example.

次に、このように構成された第２の実施の形態にかかる構造化文書検索装置による構造テンプレート決定処理について説明する。図２４は、第２の実施の形態における構造テンプレート決定処理の全体の流れを示すフローチャートである。 Next, the structure template determination process by the structured document search apparatus according to the second embodiment configured as described above will be described. FIG. 24 is a flowchart illustrating an overall flow of the structure template determination process according to the second embodiment.

第２の実施の形態では、テキスト要素の特徴値を算出する処理がステップＳ２４０４に挿入された点が、第１の実施の形態と異なっている。その他の処理は、第１の実施の形態と同様の処理なので、その説明を省略する。 The second embodiment is different from the first embodiment in that the process of calculating the feature value of the text element is inserted in step S2404. The other processes are the same as those in the first embodiment, and a description thereof will be omitted.

ステップＳ２４０４では、構造テンプレート決定部１１２が、ステップＳ２４０３で算出した最大語彙ＩＤ、最小語彙ＩＤからテキストの特徴値を算出する（ステップＳ２４０４）。 In step S2404, the structure template determination unit 112 calculates a text feature value from the maximum vocabulary ID and the minimum vocabulary ID calculated in step S2403 (step S2404).

図２５は、算出された特徴値を説明するための説明図である。同図は、“ＸＭＬデータベース”というテキストについて算出された特徴値を表す図である。 FIG. 25 is an explanatory diagram for explaining the calculated feature values. The figure shows the feature values calculated for the text “XML database”.

なお、同図は説明のために、統計情報記憶部１３３ａの情報の一部（語彙、語彙ＩＤ）と、転置ファイル記憶部１３３ｂの情報（発生位置、特徴値、文書ＩＤ、要素ＩＤ、ＴＩＤ）とを対応づけた情報を表している。 For the sake of explanation, this figure shows a part of the information in the statistical information storage unit 133a (vocabulary and vocabulary ID) and the information in the transposed file storage unit 133b (occurrence position, feature value, document ID, element ID, TID). Represents information associated with.

同図に示す例では、語彙“データ”の語彙ＩＤ＝１２が最小語彙ＩＤであり、語彙“タベー”の語彙ＩＤ＝１４７が最大語彙ＩＤである。したがって、最小語彙ＩＤおよび最大語彙ＩＤを変換係数α＝１０で除算して並べた値（１，１４）が特徴値として算出されている。 In the example shown in the figure, the vocabulary ID = 12 of the vocabulary “data” is the minimum vocabulary ID, and the vocabulary ID = 147 of the vocabulary “table” is the maximum vocabulary ID. Therefore, a value (1, 14) obtained by dividing the minimum vocabulary ID and the maximum vocabulary ID by dividing by the conversion coefficient α = 10 is calculated as the feature value.

このようにして算出された特徴値は、最終的に転置ファイル記憶部１３３ｂの転置ファイル内に格納される。上記例では、文書ＩＤ、要素ＩＤ、ＴＩＤ、および特徴値が同一で、発生位置が異なる９個の語彙索引情報が、新たに転置ファイルに格納される。 The feature value calculated in this way is finally stored in the transposed file of the transposed file storage unit 133b. In the above example, nine lexical index information having the same document ID, element ID, TID, and characteristic value but different occurrence positions are newly stored in the transposed file.

なお、同一テキスト要素内に複数回数発生する語彙に関しては、語彙ＩＤの分布を複数の区間に分割して特徴値を算出するように構成してもよい。図２６は、このような構成で算出された特徴値を説明するための説明図である。同図は、“文書と文書の関係”というテキストについて算出された特徴値を表す図である。 For a vocabulary that occurs a plurality of times in the same text element, the feature value may be calculated by dividing the distribution of vocabulary IDs into a plurality of sections. FIG. 26 is an explanatory diagram for explaining the feature values calculated in such a configuration. This figure shows the feature values calculated for the text “Relationship between documents”.

この例では、“文書”という語彙が同一テキスト中に２回発生するので、語彙ＩＤ分布（２２，２２，２６，５５，１０３，１７８，２３５，３０１）を２区間（２２，２２，２６，５５）および（１０３，１７８，２３５，３０１）に分割する。そして、分割した区間それぞれで算出した特徴値（２，５）および（１０，３０）を、それぞれ語彙“文書”に対応づけて格納している。 In this example, since the vocabulary “document” occurs twice in the same text, the lexical ID distribution (22, 22, 26, 55, 103, 178, 235, 301) is divided into two sections (22, 22, 26, 55) and (103, 178, 235, 301). The characteristic values (2, 5) and (10, 30) calculated in each of the divided sections are stored in association with the vocabulary “document”.

この場合は、図２５と異なり、語彙“文書”に対しては特徴値が（２，５）または（１０，３０）、その他の語彙に対しては特徴値が（２，３０）である８つの語彙索引情報が転置ファイルに格納される。 In this case, unlike FIG. 25, the feature value is (2, 5) or (10, 30) for the vocabulary “document”, and the feature value is (2, 30) for other vocabularies. One lexical index information is stored in the transposed file.

このように、発生頻度が高く、処理コストが高くなるような語彙に関しては、より詳細な特徴値を付与できるため、特徴値による絞込みの精度が向上するという効果が得られる。 As described above, since a more detailed feature value can be assigned to a vocabulary having a high occurrence frequency and a high processing cost, an effect of improving the accuracy of narrowing down by the feature value can be obtained.

次に、このように構成された第２の実施の形態にかかる構造化文書検索装置による転置ファイルスキャン処理について説明する。図２７は、第２の実施の形態における転置ファイルスキャン処理の全体の流れを示すフローチャートである。 Next, a transposed file scan process by the structured document search apparatus according to the second embodiment configured as described above will be described. FIG. 27 is a flowchart illustrating an overall flow of the transposed file scan process according to the second embodiment.

ステップＳ２７０１からステップＳ２７０５までの、ノード取得処理、転置ファイル取得処理、語彙索引情報取得処理は、第２の実施の形態にかかる構造化文書検索装置におけるステップＳ１７０１からステップＳ１７０５までと同様の処理なので、その説明を省略する。 The node acquisition process, the transposed file acquisition process, and the vocabulary index information acquisition process from step S2701 to step S2705 are the same processes as steps S1701 to S1705 in the structured document search apparatus according to the second embodiment. The description is omitted.

ステップＳ２７０５で、語彙索引情報のＴＩＤが構造制約Ｔに含まれると判断した場合は（ステップＳ２７０５：ＹＥＳ）、索引検索部１２４ａは、語彙索引情報内の特徴値の範囲と比較するため、集合Ｇ内の語彙ＩＤの値を変換する（ステップＳ２７０６）。具体的には、特徴値を算出する際に用いる変換係数αで、検索語彙の語彙ＩＤの値を除算する。例えば、検索語彙の語彙ＩＤが１００であったとすると、変換係数α＝１０で除算することにより、語彙ＩＤの値が１０に変換される。 If it is determined in step S2705 that the TID of the lexical index information is included in the structure constraint T (step S2705: YES), the index search unit 124a compares the feature value range in the lexical index information with the set G The value of the vocabulary ID is converted (step S2706). Specifically, the vocabulary ID value of the search vocabulary is divided by the conversion coefficient α used when calculating the feature value. For example, if the vocabulary ID of the search vocabulary is 100, the value of the vocabulary ID is converted to 10 by dividing by the conversion coefficient α = 10.

次に、索引検索部１２４ａは、語彙索引情報内の特徴値の範囲内に、変換した語彙ＩＤの値が含まれるか否かを判断する（ステップＳ２７０７）。例えば、特徴値が（２，５）であり、変換した語彙ＩＤの値が１０であった場合は、範囲内（２から５の間）に含まれないと判断する。 Next, the index search unit 124a determines whether or not the converted vocabulary ID value is included in the range of feature values in the vocabulary index information (step S2707). For example, if the feature value is (2, 5) and the converted vocabulary ID value is 10, it is determined that it is not included in the range (between 2 and 5).

語彙索引情報内の特徴値の範囲内に、変換した語彙ＩＤの値が含まれると判断した場合は（ステップＳ２７０７：ＹＥＳ）、索引検索部１２４ａは、現在の語彙索引情報を索引の候補集合に追加する（ステップＳ２７０８）。 If it is determined that the converted vocabulary ID value is included in the range of feature values in the vocabulary index information (step S2707: YES), the index search unit 124a sets the current vocabulary index information as an index candidate set. It adds (step S2708).

なお、変換処理のコスト、候補件数の関係などを考慮した処理コストに応じて特徴値による判定の可否を決定するように構成してもよいし、特徴値のうちのある１つの値（最小値または最大値）だけを利用するように構成してもよい。 Note that it may be configured to determine whether or not the determination by the feature value is possible according to the processing cost in consideration of the cost of the conversion processing, the relationship between the number of candidates, and the like, or one value (minimum value) of the feature values Alternatively, only the maximum value may be used.

ステップＳ２７０９からステップＳ２７１１までの、終了判定処理は、第２の実施の形態にかかる構造化文書検索装置におけるステップＳ１７０７からステップＳ１７０９までと同様の処理なので、その説明を省略する。 Since the end determination process from step S2709 to step S2711 is the same as that from step S1707 to step S1709 in the structured document search apparatus according to the second embodiment, the description thereof is omitted.

このように、転置ファイルスキャン処理では、索引作成時に転置ファイルに登録されたテキストの特徴値を参照し、検索条件に含まれる語彙から算出した特徴値に相当する値が、索引の特徴値に適合しない場合に、当該索引を候補から除外することができる。このため、語彙索引処理時の候補数をさらに絞り込むことができ、構造化文書に対する高速な全文検索機能を実現することができる。 As described above, in the inverted file scan process, the feature value of the text registered in the inverted file is referred to when the index is created, and the value corresponding to the feature value calculated from the vocabulary included in the search condition matches the feature value of the index. If not, the index can be excluded from the candidates. For this reason, the number of candidates at the time of lexical index processing can be further narrowed down, and a high-speed full-text search function for structured documents can be realized.

ここで、転置ファイルスキャン処理で処理されるデータの具体例について説明する。図２８は、転置ファイルスキャン処理で処理される候補の一例を示す説明図である。同図は、構造制約Ｔとして｛ＴＩＤ７、ＴＩＤ１１｝（ＴＩＤ７とＴＩＤ１１が構造制約を満たす構造要素のＴＩＤであることを表す）が取得され、語彙ＩＤ集合Ｇとして｛１３５、２９２、３５６｝が取得された場合を前提としている（ステップＳ２７０１）。 Here, a specific example of data processed in the transposed file scanning process will be described. FIG. 28 is an explanatory diagram of an example of candidates processed in the transposed file scan process. In the figure, {TID7, TID11} (representing that TID7 and TID11 are TIDs of structural elements satisfying the structure constraint) is acquired as the structure constraint T, and {135, 292, 356} is acquired as the vocabulary ID set G. (Step S2701).

なお、検索キーとしては“ＸＭＬ”が指定されたことを前提とする。したがって、語彙ＩＤ集合Ｇは、それぞれ語彙“ＸＭＬ”、語彙“ＭＬ”、語彙“Ｌ”に対応する語彙ＩＤを含んでいる。 It is assumed that “XML” is designated as the search key. Therefore, the vocabulary ID set G includes vocabulary IDs corresponding to the vocabulary “XML”, the vocabulary “ML”, and the vocabulary “L”, respectively.

さらに、同図では、語彙“ＸＭＬ”に対応する転置ファイルとして５つのＴＩＤの語彙索引情報を含む転置ファイルが転置ファイル記憶部１３３ｂに記憶されていることを前提とする。なお、同図では、語彙索引情報のうち、要素ＩＤと発生位置とは省略している。 Further, in the figure, it is assumed that a transposed file including lexical index information of five TIDs is stored in the transposed file storage unit 133b as a transposed file corresponding to the vocabulary “XML”. In the figure, the element ID and the occurrence position are omitted from the lexical index information.

上記のような前提では、構造制約条件により、転置ファイルのＴＩＤは３件に絞り込まれる（ステップＳ２７０４）。また、集合Ｇ内の語彙ＩＤの変換値は、それぞれ１３，２９，３５となるため（ステップＳ２７０５）、これらの変換値が特徴値の範囲外となる文書ＩＤ＝２００の語彙索引情報が除外される（ステップＳ２７０７）。 Under the premise as described above, the TID of the transposed file is narrowed down to three due to the structure constraint condition (step S2704). Moreover, since the conversion values of the vocabulary IDs in the set G are 13, 29, and 35, respectively (step S2705), the vocabulary index information of document ID = 200 in which these conversion values are out of the feature value range is excluded. (Step S2707).

このように、構造制約のみによれば候補はＴＩＤ７とＴＩＤ１１であるが、特徴値を参照することにより、ＴＩＤ７に対応する索引のみに候補を絞り込むことができる。 As described above, the candidates are TID7 and TID11 according to the structure constraint alone, but the candidates can be narrowed down to only the index corresponding to TID7 by referring to the feature value.

次に、処理を繰り返すことにより制約がさらに絞り込まれる例について説明する。図２９、図３０は、転置ファイルスキャン処理で処理される候補の別の一例を示す説明図である。 Next, an example in which the restriction is further narrowed down by repeating the process will be described. 29 and 30 are explanatory diagrams illustrating another example of candidates processed in the transposed file scanning process.

この例では、図１８のような構造情報が構造テンプレート記憶部１３２に記憶されていることを前提とする。 In this example, it is assumed that structure information as illustrated in FIG. 18 is stored in the structure template storage unit 132.

この状態で、転置ファイル１のスキャンによりＴＩＤ７、ＴＩＤ１５の候補が、ＴＩＤ１５の候補のみに絞り込まれ（図２９）、次に構造制約｛ＴＩＤ１１、ＴＩＤ１９｝を処理すると仮定する（図３０）。 In this state, it is assumed that the candidates for TID7 and TID15 are narrowed down to only candidates for TID15 by scanning the transposed file 1 (FIG. 29), and then the structure constraints {TID11, TID19} are processed (FIG. 30).

この場合、ＴＩＤ１１とＴＩＤ１９とのうち、図１８に示すような構造情報でＴＩＤ１５と同一文書制約を満たすＴＩＤはＴＩＤ１９のみであるので、ＴＩＤ１１に関する索引情報は読み出す必要がない（ステップＳ２７０５：ＮＯ）。したがって、転置ファイル２ではＴＩＤ１９を持つ文書ＩＤ＝６２の語彙索引情報だけを候補とすればよい。 In this case, since TID 19 is the only TID satisfying the same document restriction as TID 15 in the structure information as shown in FIG. Therefore, in the transposed file 2, only the vocabulary index information of the document ID = 62 having TID19 needs to be a candidate.

次に、第２の実施の形態にかかる構造化文書検索装置による構造化文書検索処理の具体例について説明する。 Next, a specific example of structured document search processing by the structured document search device according to the second embodiment will be described.

ここでは、第１の実施の形態で図１８から図２２を用いて説明した構造化文書検索処理の具体例を前提として説明する。すなわち、図１８から図２２で説明した処理により、図２１のようなクエリグラフが生成され、当該クエリグラフ内の検索語彙の頻度が図２２のような値を取っていることを前提とする。 Here, a description will be given on the assumption of a specific example of the structured document search processing described with reference to FIGS. 18 to 22 in the first embodiment. That is, it is assumed that a query graph as shown in FIG. 21 is generated by the processing described with reference to FIGS. 18 to 22, and the frequency of the search vocabulary in the query graph has a value as shown in FIG.

図３１は、このときの転置ファイルスキャン処理で処理される候補の一例を示す説明図である。索引検索部１２４ａで、[ＴＩＤ１１,ＴＩＤ１９]を構造制約として転置ファイルスキャン処理を行なった結果、候補集合としてＴＩＤ１１のみが残されたとする。 FIG. 31 is an explanatory diagram showing an example of candidates processed in the transposed file scanning process at this time. Assume that the index search unit 124a performs the transposed file scan process using [TID11, TID19] as a structure constraint, and as a result, only TID11 remains as a candidate set.

ＴＩＤ１１の候補は３つ存在するが、文書ＩＤ＝２００の候補については、特徴値の比較により候補から削除される。「田中」の語彙ＩＤが１３６であるとすると、変換した語彙ＩＤの値は１３となり（ステップＳ１７０６）、特徴値である８０から９０の範囲内に含まれないためである。 Although there are three candidates for TID11, the candidate for document ID = 200 is deleted from the candidates by comparison of feature values. If the vocabulary ID of “Tanaka” is 136, the value of the converted vocabulary ID is 13 (step S1706), and is not included in the range of 80 to 90 that is the feature value.

また、構造制約からＴＩＤ１９が除外されていることから、同一文書制約「｛ＴＩＤ１５｝＜−＞｛ＴＩＤ１９｝」に従い、ＴＩＤ１５も候補から除外される。したがって、この後に処理されるノード４では、ＴＩＤ１５を除いたＴＩＤ７のみを構造制約として候補を抽出すればよい。 Since TID19 is excluded from the structure constraint, TID15 is also excluded from the candidates in accordance with the same document constraint “{TID15} <−> {TID19}”. Therefore, in the node 4 to be processed after this, only the TID 7 excluding the TID 15 needs to be extracted as a structural constraint.

図３２は、このときの転置ファイルスキャン処理で処理される候補の一例を示す説明図である。 FIG. 32 is an explanatory diagram showing an example of candidates processed in the transposed file scanning process at this time.

ノード４では、最も頻度が低い語彙「造化」に対する処理を実行する。この場合も、特徴値による絞込みにより、結果的に文書ＩＤ＝４３に対応する１件の候補だけが抽出される。「造化」は、「構造化文書」を構成する語彙の１つであるので、この場合は索引から位置情報まで含めた形で残しておき、次の処理に利用する。 The node 4 executes processing for the vocabulary “building” with the lowest frequency. Also in this case, only one candidate corresponding to the document ID = 43 is extracted as a result of narrowing down by the feature value. “Structured” is one of the vocabularies constituting “structured document”, and in this case, it is left in the form including the index to the position information and used for the next processing.

このように、第２の実施の形態にかかる構造化文書検索装置では、語彙索引情報内にテキスト要素に含まれる語彙の特徴を表す特徴値を格納し、転置ファイルスキャン処理時に参照して候補数を絞り込むことができる。このため、ＸＱＦＴのような位置情報を指定と構造制約と組み合わせたクエリによる構造化文書の全文検索機能をさらに高速に実行することができる。 As described above, in the structured document search apparatus according to the second embodiment, the feature value representing the feature of the vocabulary included in the text element is stored in the vocabulary index information, and the number of candidates is referred to during the transposed file scanning process. Can be narrowed down. For this reason, the full-text search function of a structured document by a query combining position information such as XQFT with designation and structure constraints can be executed at higher speed.

（第３の実施の形態）
第３の実施の形態にかかる構造化文書検索装置は、構造要素ごとに特徴情報に基づく分割ルールを定め、構造要素に対応する要素の値の特徴情報が分割ルールに適合する場合に、当該構造要素を分割して新たな構造要素を作成して構造テンプレート記憶部１３２を更新するものである。 (Third embodiment)
The structured document search apparatus according to the third embodiment defines a division rule based on feature information for each structure element, and when the feature information of the element value corresponding to the structure element matches the division rule, The structure template storage unit 132 is updated by creating a new structural element by dividing the element.

第３の実施の形態では、構造テンプレート記憶部１３２に構造要素ごとの分割ルールを設定する点が、第２の実施の形態と異なっている。また、構造化文書登録処理時に、構造テンプレート決定部１１２が、当該分割ルールを参照して構造テンプレート記憶部１３２に格納されている構造要素を分割して新たな構造要素を作成する点が、第２の実施の形態と異なっている。 The third embodiment is different from the second embodiment in that a division rule for each structural element is set in the structural template storage unit 132. In addition, in the structured document registration process, the structure template determination unit 112 refers to the division rule and divides the structure element stored in the structure template storage unit 132 to create a new structure element. This is different from the second embodiment.

図３３は、第３の実施の形態における構造テンプレート記憶部１３２に格納された構造情報のデータ構造の一例を示す説明図である。同図は、構造情報を表形式で表した例を示している。 FIG. 33 is an explanatory diagram illustrating an example of the data structure of the structure information stored in the structure template storage unit 132 according to the third embodiment. This figure shows an example in which the structure information is represented in a table format.

同図に示すように、第３の実施の形態では、構造情報に、要素内最大発生語彙数と、テキストサイズの閾値を超えた回数と、基準情報と、分割方法とをさらに対応づけて格納している。基準情報と、分割方法とにより分割ルールが指定される。 As shown in the figure, in the third embodiment, the structure information stores the maximum number of generated vocabularies in the element, the number of times that the text size threshold is exceeded, the reference information, and the division method in further correspondence. is doing. A division rule is designated by the reference information and the division method.

要素内最大発生語彙数とは、対応するテキスト要素内に発生する各語彙の出現数のうち最大値をいう。テキストサイズの閾値を超えた回数とは、予め定められたテキストサイズの閾値（例えば、１００文字）を超えた回数をいう。 The maximum number of vocabulary generated in an element refers to the maximum value of the number of occurrences of each vocabulary generated in the corresponding text element. The number of times that the text size threshold is exceeded means the number of times that a predetermined text size threshold (for example, 100 characters) is exceeded.

基準情報には、分割の判断に適用する情報を設定する。例えば、構造要素ごとの特徴情報である最小語彙ＩＤ、最大語彙ＩＤ、最大テキストサイズ、要素内最大発生語彙数、およびテキストサイズの閾値を超えた回数などを設定することができる。また、これ以外に、頻出語彙などを判断基準とすることも可能である。同図では、語彙ＩＤ＝２２または４００の頻出語彙を抽出し、基準情報として設定された例が示されている。 In the reference information, information to be applied for division determination is set. For example, the minimum vocabulary ID, the maximum vocabulary ID, the maximum text size, the maximum number of generated vocabularies in the element, the number of times exceeding the threshold of the text size, etc., which are characteristic information for each structural element, can be set. In addition to this, it is also possible to use frequently used vocabulary as a criterion. The figure shows an example in which a frequently used vocabulary with a vocabulary ID = 22 or 400 is extracted and set as reference information.

また、分割条件には、基準情報に設定された情報が満たす条件を設定する。基準情報に設定された情報が分割条件を満たした場合に、対応する構造要素を分割する。同図では、頻出語彙として定められた語彙ＩＤ＝２２または４００の語彙の出現回数が予め定められた閾値である３０を超えたことが分割条件として設定された例が示されている。 In addition, a condition that the information set in the reference information satisfies is set as the division condition. When the information set in the reference information satisfies the division condition, the corresponding structural element is divided. The figure shows an example in which the division condition is set such that the number of occurrences of the vocabulary with the vocabulary ID = 22 or 400 defined as the frequent vocabulary exceeds a predetermined threshold of 30.

なお、基準情報は、処理負荷を低減するため、計算量が小さい特徴情報を用いるのが望ましい。また、初期段階では基準情報や分割条件を設定しないように構成してもよいし、デフォルトの値を設定するように構成してもよい。 Note that it is desirable to use feature information with a small calculation amount as the reference information in order to reduce the processing load. Further, in the initial stage, it may be configured not to set the reference information or the division condition, or may be configured to set a default value.

この他、例えば、テキストサイズの閾値（１００）を超えた回数を基準情報とし、当該回数が１０を超えたことを分割条件として構造要素を分割するように構成してもよい。この場合、分割の方法としては、例えば、テキストサイズが１００以上のテキスト要素と、テキストサイズが９９以下のテキスト要素とを分離するように２つの構造要素に分割する方法を適用する。 In addition to this, for example, the number of times that the text size threshold (100) is exceeded may be used as the reference information, and the structural element may be divided on the basis of the number of times that the number exceeds 10 as a division condition. In this case, as a division method, for example, a method of dividing a text element having a text size of 100 or more and a text element having a text size of 99 or less into two structural elements is applied.

このように、例えば、ある語彙が同一テキスト要素内で発生する回数が多い場合は、クエリ処理で性能劣化を招く可能性があると判断して、登録時に事前に構造テンプレートの構造要素を分割することができる。このため、転置ファイルスキャン処理の段階で候補を早期に絞り込むことが可能となる。 In this way, for example, when a certain vocabulary occurs frequently in the same text element, it is determined that there is a possibility of causing performance degradation in the query processing, and the structural element of the structural template is divided in advance at the time of registration. be able to. For this reason, candidates can be narrowed down early at the stage of the transposed file scan process.

第３の実施の形態における構造テンプレート決定部１１２は、ＴＩＤの決定および特徴値の算出機能に加え、ＴＩＤが分割条件を満たすか否かを判断して分割条件を満たす場合にＴＩＤを分割し、分割したＴＩＤを該当するＴＩＤとして決定する処理を行うものである。 In addition to the TID determination and feature value calculation functions, the structure template determination unit 112 according to the third embodiment determines whether the TID satisfies the division condition, and divides the TID when the division condition is satisfied. A process of determining the divided TID as the corresponding TID is performed.

すなわち、構造テンプレート決定部１１２は、構造情報のみによってＴＩＤを決定するのではなく、対応するテキストの特徴情報に応じて構造要素を分割する場合がある。そして、構造テンプレート決定部１１２は、分割することにより新規に作成した構造要素を構造化文書の木構造のノードに対応する構造要素として決定する。 That is, the structure template determination unit 112 may divide the structure element according to the feature information of the corresponding text, instead of determining the TID based only on the structure information. Then, the structure template determination unit 112 determines the structure element newly created by the division as a structure element corresponding to the tree structure node of the structured document.

次に、このように構成された第３の実施の形態にかかる構造化文書検索装置によるテンプレート決定処理について説明する。図３４は、第３の実施の形態におけるテンプレート決定処理の全体の流れを示すフローチャートである。 Next, template determination processing by the structured document search apparatus according to the third embodiment configured as described above will be described. FIG. 34 is a flowchart showing the overall flow of the template determination process in the third embodiment.

ステップＳ３４０１からステップＳ３４０２までの、語彙分割処理、語彙ＩＤ付加処理は、第２の実施の形態にかかる構造化文書検索装置におけるステップＳ２４０１からステップＳ２４０２までと同様の処理なので、その説明を省略する。 The vocabulary division processing and vocabulary ID addition processing from step S3401 to step S3402 are the same as the processing from step S2401 to step S2402 in the structured document search apparatus according to the second embodiment, and thus description thereof is omitted.

語彙ＩＤを付加した後、構造テンプレート決定部１１２は、構造テンプレート記憶部１３２に記憶するＴＩＤごとの特徴情報（最大語彙ＩＤ、最小語彙ＩＤ、最大テキストサイズ、要素内最大発生語彙数、テキストサイズの閾値を超えた回数）を算出する（ステップＳ３４０３）。例えば、テキスト要素内の語彙のうち、出現回数が最も多い語彙の出現関数を要素内最大発生語彙数として算出する。 After adding the vocabulary ID, the structure template determining unit 112 stores the feature information for each TID stored in the structure template storage unit 132 (maximum vocabulary ID, minimum vocabulary ID, maximum text size, maximum number of vocabulary in elements, text size The number of times that the threshold is exceeded is calculated (step S3403). For example, the appearance function of the vocabulary with the highest number of appearances among the vocabularies in the text element is calculated as the maximum number of occurrences in the element.

ステップＳ３４０４からステップＳ３４０６までの特徴値算出処理、構造要素取得処理は、第２の実施の形態にかかる構造化文書検索装置におけるステップＳ２４０４からステップＳ２４０６までと同様の処理なので、その説明を省略する。 Since the feature value calculation process and the structural element acquisition process from step S3404 to step S3406 are the same as the process from step S2404 to step S2406 in the structured document search apparatus according to the second embodiment, the description thereof is omitted.

ステップＳ３４０６で、構造要素が取得されなかった場合は（ステップＳ３４０６：ＮＯ）、構造テンプレート決定部１１２は、新規に構造要素を作成し、デフォルトの基準情報と分割条件とを設定して構造テンプレート記憶部１３２に登録する（ステップＳ３４０７）。 If a structural element is not acquired in step S3406 (step S3406: NO), the structural template determination unit 112 newly creates a structural element, sets default reference information and division conditions, and stores the structural template. Registered in the unit 132 (step S3407).

次に、構造テンプレート決定部１１２は、作成した構造要素のＴＩＤを、現在処理している木構造の構造要素に該当するＴＩＤとして設定する（ステップＳ３４０８）。 Next, the structure template determination unit 112 sets the TID of the created structure element as a TID corresponding to the structure element of the tree structure currently being processed (step S3408).

ステップＳ３４０６で、構造要素が取得された場合は（ステップＳ３４０６：ＹＥＳ）、構造テンプレート決定部１１２は、構造要素に対応する特徴情報に従い最も適合する構造要素を選択する（ステップＳ３４０９）。なお、この処理は、構造要素が複数取得された場合に、いずれかの構造要素を選択するために実行する処理であるため、構造要素が１つのみ取得された場合には実行しない。 If a structural element is acquired in step S3406 (step S3406: YES), the structural template determination unit 112 selects the most suitable structural element according to the feature information corresponding to the structural element (step S3409). This process is a process executed to select one of the structural elements when a plurality of structural elements are acquired. Therefore, this process is not executed when only one structural element is acquired.

例えば、図１８および図３３に示すような構造情報が構造テンプレート記憶部１３２に記憶されており、タイトルタグ下のテキスト要素に対応する構造要素としてＴＩＤ６およびＴＩＤ７が取得されたとする。この場合、例えば、テキスト要素に“文書”が含まれるならば、“文書”の語彙ＩＤが２２であるため、当該語彙ＩＤを最小語彙ＩＤと最大語彙ＩＤとの間に含むＴＩＤ７が最も適合する構造要素として選択される。 For example, it is assumed that structure information as shown in FIGS. 18 and 33 is stored in the structure template storage unit 132, and TID6 and TID7 are acquired as the structure elements corresponding to the text elements under the title tag. In this case, for example, if “document” is included in the text element, since the vocabulary ID of “document” is 22, TID7 including the vocabulary ID between the minimum vocabulary ID and the maximum vocabulary ID is most suitable. Selected as a structural element.

次に、構造テンプレート決定部１１２は、構造テンプレート記憶部１３２を参照し、選択した構造要素に対応する基準情報と分割条件を取得する（ステップＳ３４１０）。 Next, the structure template determination unit 112 refers to the structure template storage unit 132 and acquires reference information and division conditions corresponding to the selected structure element (step S3410).

次に、構造テンプレート決定部１１２は、取得した基準情報と分割条件、および、ステップＳ３４０３で算出した特徴情報を参照し、構造要素が分割条件を満たすか否かを判断する（ステップＳ３４１１）。 Next, the structure template determination unit 112 refers to the acquired reference information, the division condition, and the feature information calculated in step S3403, and determines whether the structural element satisfies the division condition (step S3411).

分割条件を満たさない場合は（ステップＳ３４１１：ＮＯ）、ステップＳ３４０９で選択した構造要素のＴＩＤを、現在処理している木構造の構造要素に該当するＴＩＤとして設定する（ステップＳ３４１２）。 If the division condition is not satisfied (step S3411: NO), the TID of the structural element selected in step S3409 is set as the TID corresponding to the structural element of the tree structure currently processed (step S3412).

分割条件を満たす場合は（ステップＳ３４１１：ＹＥＳ）、構造テンプレート決定部１１２は、新規に構造要素を作成し、デフォルトの基準情報と分割条件とを設定して構造テンプレート記憶部１３２に登録する（ステップＳ３４０７）。 When the division condition is satisfied (step S3411: YES), the structure template determination unit 112 newly creates a structure element, sets default reference information and division conditions, and registers them in the structure template storage unit 132 (step S3411). S3407).

ステップＳ３４１３の終了確認処理は、第２の実施の形態にかかる構造化文書検索装置におけるステップＳ２４１０と同様の処理なので、その説明を省略する。 Since the end confirmation process in step S3413 is the same process as step S2410 in the structured document search apparatus according to the second embodiment, the description thereof is omitted.

次に、第３の実施の形態にかかる構造化文書検索装置による構造化文書登録処理の具体例について説明する。 Next, a specific example of structured document registration processing by the structured document search device according to the third embodiment will be described.

図３５は、構造テンプレート記憶部１３２に格納された構造情報の一例を示す説明図である。また、図３６および図３７は、登録する構造化文書の一例を示す説明図である。 FIG. 35 is an explanatory diagram illustrating an example of the structure information stored in the structure template storage unit 132. FIGS. 36 and 37 are explanatory diagrams showing examples of structured documents to be registered.

ここでは、図３３および図３５に示すような構造情報が構造テンプレート記憶部１３２に格納されている状態を前提とし、図３６および図３７で示すような構造化文書を登録する例について説明する。なお、以下ではタイトル要素に対する処理に着目して説明する。 Here, an example of registering a structured document as shown in FIGS. 36 and 37 on the assumption that the structure information as shown in FIGS. 33 and 35 is stored in the structure template storage unit 132 will be described. In the following description, the processing for the title element will be described.

タイトル要素については、図３５のような構造情報から、２つの構造要素の候補が取得される（ステップＳ３４０５、ステップＳ３４０６：ＹＥＳ）。そこで、構造テンプレート決定部１１２は、構造要素に対応する特徴情報に従い最も適合する構造要素を選択する（ステップＳ３４０９）。 For the title element, two structural element candidates are acquired from the structural information as shown in FIG. 35 (step S3405, step S3406: YES). Therefore, the structure template determination unit 112 selects the most suitable structural element according to the feature information corresponding to the structural element (step S3409).

この例では、テキスト要素内に語彙“文書”（語彙ＩＤ＝２２）を含むため、当該語彙ＩＤを最小語彙ＩＤと最大語彙ＩＤとの間に含むＴＩＤ７が最も適合する構造要素として選択される。 In this example, since the vocabulary “document” (vocabulary ID = 22) is included in the text element, TID7 including the vocabulary ID between the minimum vocabulary ID and the maximum vocabulary ID is selected as the most suitable structural element.

次に、構造テンプレート決定部１１２は、構造要素の分割条件を満たすか否かを判断する（ステップＳ３４１１）。ここでは、テキスト要素に頻出語彙である語彙ＩＤ＝２２の語彙“文書”を２つ含むため、出現回数３０以上であるという分割条件を満たしたと仮定する（ステップＳ３４１１：ＹＥＳ）。 Next, the structure template determination unit 112 determines whether or not a structural element division condition is satisfied (step S3411). Here, since the text element includes two vocabulary “documents” with the vocabulary ID = 22, which is a frequent vocabulary, it is assumed that the division condition that the number of appearances is 30 or more is satisfied (step S3411: YES).

この場合は、構造テンプレート決定部１１２は、ＴＩＤ７を分割し、新規に構造要素を作成する（ステップＳ３４０７）。分割は、タイトルタグ下のテキスト要素に対応する構造要素のみを分割してもよいし、親要素も含めて新たな構造要素を作成して分割してもよい。 In this case, the structure template determination unit 112 divides TID7 and newly creates a structure element (step S3407). In the division, only the structural element corresponding to the text element under the title tag may be divided, or a new structural element including a parent element may be created and divided.

図３８は、分割後の構造テンプレート記憶部１３２に格納された構造情報の一例を示す説明図である。同図は、親要素である特許文書を含めて構造要素を分割した例を示している。 FIG. 38 is an explanatory diagram showing an example of the structure information stored in the structure template storage unit 132 after the division. This figure shows an example in which structural elements are divided including a patent document which is a parent element.

このように、第３の実施の形態にかかる構造化文書検索装置では、構造要素ごとに特徴情報に基づく分割ルールを定め、構造要素に対応する要素の値の特徴情報が分割ルールに適合する場合に、当該構造要素を分割して新たな構造要素を作成して構造テンプレート記憶部を更新することができる。これにより、処理候補数が増大し、検索処理のボトルネックとなりうる部分を予め検出して構造要素を分割することができる。このため、転置ファイルスキャン処理の段階で候補を早期に絞り込むことが可能となる。 As described above, in the structured document search apparatus according to the third embodiment, the division rule based on the feature information is determined for each structure element, and the feature information of the element value corresponding to the structure element matches the division rule. In addition, the structural template storage unit can be updated by dividing the structural element to create a new structural element. As a result, the number of processing candidates increases, and a portion that can become a bottleneck of search processing can be detected in advance to divide the structural elements. For this reason, candidates can be narrowed down early at the stage of the transposed file scan process.

以上のように、本発明にかかる構造化文書検索装置および構造化文書検索方法は、異なる文書構造の複数の構造化文書を、階層化された論理構造を持つ構造化文書データベースで管理し、かつ全文検索処理のような位置情報を条件に指定した検索を行う構造化文書検索装置および構造化文書検索方法に適している。 As described above, the structured document search apparatus and the structured document search method according to the present invention manage a plurality of structured documents having different document structures in a structured document database having a hierarchical logical structure, and It is suitable for a structured document search apparatus and a structured document search method that perform a search specified by using position information as a condition, such as full-text search processing.

第１の実施の形態にかかる構造化文書検索装置の構成を示すブロック図である。It is a block diagram which shows the structure of the structured document search apparatus concerning 1st Embodiment. ＸＭＬで記述された構造化文書の一例を示す説明図である。It is explanatory drawing which shows an example of the structured document described by XML. 構造化文書記憶部に格納された構造化文書のデータ構造の一例を示す説明図である。It is explanatory drawing which shows an example of the data structure of the structured document stored in the structured document storage part. 表形式で表した構造化文書のデータ構造の一例を示す説明図である。It is explanatory drawing which shows an example of the data structure of the structured document represented with the table format. 第１の実施の形態における構造テンプレート記憶部に格納された構造情報のデータ構造の一例を示す説明図である。It is explanatory drawing which shows an example of the data structure of the structure information stored in the structure template memory | storage part in 1st Embodiment. 表形式で表した構造テンプレート記憶部のデータ構造の一例を示す説明図である。It is explanatory drawing which shows an example of the data structure of the structure template memory | storage part represented with the table format. 統計情報記憶部に格納された統計情報のデータ構造の一例を示す説明図である。It is explanatory drawing which shows an example of the data structure of the statistical information stored in the statistical information storage part. 転置ファイル記憶部に格納された転置ファイルのデータ構造の一例を示す説明図である。It is explanatory drawing which shows an example of the data structure of the transposed file stored in the transposed file storage part. 検索クエリの一例を示す説明図である。It is explanatory drawing which shows an example of a search query. クエリグラフの一例を示す説明図である。It is explanatory drawing which shows an example of a query graph. 処理コスト計算で用いられる語彙の頻度情報の一例を示した説明図である。It is explanatory drawing which showed an example of the frequency information of the vocabulary used by processing cost calculation. 処理コスト計算で用いられる語彙の頻度情報の一例を示した説明図である。It is explanatory drawing which showed an example of the frequency information of the vocabulary used by processing cost calculation. 第１の実施の形態における構造化文書格納処理の全体の流れを示すフローチャートである。It is a flowchart which shows the whole flow of the structured document storage process in 1st Embodiment. 第１の実施の形態における構造テンプレート決定処理の全体の流れを示すフローチャートである。It is a flowchart which shows the whole flow of the structure template determination process in 1st Embodiment. 第１の実施の形態における構造化文書検索処理の全体の流れを示すフローチャートである。It is a flowchart which shows the whole flow of the structured document search process in 1st Embodiment. 第１の実施の形態における制約付加処理の全体の流れを示すフローチャートである。It is a flowchart which shows the whole flow of the restriction | limiting addition process in 1st Embodiment. 第１の実施の形態における転置ファイルスキャン処理の全体の流れを示すフローチャートである。It is a flowchart which shows the whole flow of the transposition file scan process in 1st Embodiment. 構造テンプレート記憶部に記憶されている構造情報の一例を示す説明図である。It is explanatory drawing which shows an example of the structure information memorize | stored in the structure template memory | storage part. 入力された検索クエリの一例を示す説明図である。It is explanatory drawing which shows an example of the input search query. 検索クエリを解析した結果であるクエリグラフの一例を示す説明図である。It is explanatory drawing which shows an example of the query graph which is the result of having analyzed the search query. クエリグラフの一例を示す説明図である。It is explanatory drawing which shows an example of a query graph. 検索語彙ごと頻度を示した説明図である。It is explanatory drawing which showed the frequency for every search vocabulary. 第２の実施の形態における転置ファイル記憶部に格納された転置ファイルのデータ構造の一例を示す説明図である。It is explanatory drawing which shows an example of the data structure of the transposition file stored in the transposition file memory | storage part in 2nd Embodiment. 第２の実施の形態における構造テンプレート決定処理の全体の流れを示すフローチャートである。It is a flowchart which shows the whole flow of the structure template determination process in 2nd Embodiment. 算出された特徴値を説明するための説明図である。It is explanatory drawing for demonstrating the calculated feature value. 特徴値を説明するための説明図である。It is explanatory drawing for demonstrating a feature value. 第２の実施の形態における転置ファイルスキャン処理の全体の流れを示すフローチャートである。It is a flowchart which shows the flow of the whole transposition file scan process in 2nd Embodiment. 転置ファイルスキャン処理で処理される候補の一例を示す説明図である。It is explanatory drawing which shows an example of the candidate processed by a transposition file scanning process. 転置ファイルスキャン処理で処理される候補の別の一例を示す説明図である。It is explanatory drawing which shows another example of the candidate processed by a transposition file scanning process. 転置ファイルスキャン処理で処理される候補の別の一例を示す説明図である。It is explanatory drawing which shows another example of the candidate processed by a transposition file scanning process. 転置ファイルスキャン処理で処理される候補の一例を示す説明図である。It is explanatory drawing which shows an example of the candidate processed by a transposition file scanning process. 転置ファイルスキャン処理で処理される候補の一例を示す説明図である。It is explanatory drawing which shows an example of the candidate processed by a transposition file scanning process. 第３の実施の形態における構造テンプレート記憶部に格納された構造情報のデータ構造の一例を示す説明図である。It is explanatory drawing which shows an example of the data structure of the structure information stored in the structure template memory | storage part in 3rd Embodiment. 第３の実施の形態におけるテンプレート決定処理の全体の流れを示すフローチャートである。It is a flowchart which shows the whole flow of the template determination process in 3rd Embodiment. 構造テンプレート記憶部に格納された構造情報の一例を示す説明図である。It is explanatory drawing which shows an example of the structure information stored in the structure template memory | storage part. 登録する構造化文書の一例を示す説明図である。It is explanatory drawing which shows an example of the structured document to register. 登録する構造化文書の一例を示す説明図である。It is explanatory drawing which shows an example of the structured document to register. 分割後の構造テンプレート記憶部に格納された構造情報の一例を示す説明図である。It is explanatory drawing which shows an example of the structure information stored in the structure template memory | storage part after a division | segmentation.

Explanation of symbols

１００構造化文書検索装置
１０１通信部
１１０格納処理部
１１１構造情報抽出部
１１２構造テンプレート決定部
１１３索引登録部
１１４統計情報更新部
１１５文書登録部
１２０検索処理部
１２１クエリ解析部
１２２制約付加部
１２２ａ特徴情報取得部
１２２ｂ特徴情報算出部
１２２ｃＴＩＤ候補削除部
１２３クエリプランニング部
１２４クエリ実行部
１２４ａ索引検索部
１２４ｂ検索部
１３１構造化文書記憶部
１３２構造テンプレート記憶部
１３３索引記憶部
１３３ａ統計情報記憶部
１３３ｂ転置ファイル記憶部
２００ネットワーク
３００クライアント DESCRIPTION OF SYMBOLS 100 Structured document search apparatus 101 Communication part 110 Storage processing part 111 Structure information extraction part 112 Structure template determination part 113 Index registration part 114 Statistical information update part 115 Document registration part 120 Search processing part 121 Query analysis part 122 Constraint addition part 122a Feature Information acquisition unit 122b Feature information calculation unit 122c TID candidate deletion unit 123 Query planning unit 124 Query execution unit 124a Index search unit 124b Search unit 131 Structured document storage unit 132 Structure template storage unit 133 Index storage unit 133a Statistical information storage unit 133b Transpose File storage unit 200 Network 300 Client

Claims

A document having a hierarchical logical structure, which includes an element that is actual information corresponding to a structural element that is a unit of the logical structure; a document ID that uniquely identifies the structured document; Structured document storage means for storing
A structure information storage unit that stores structure information in which a structure ID that uniquely identifies the structure element is associated with first feature information that represents a feature of the element;
A search condition of a hierarchical structure that accepts input of search conditions for the structured document, analyzes the accepted search conditions, and has hierarchically structured nodes corresponding to the structural elements of the structured document. Analyzing means for obtaining a search condition for the hierarchical structure including the node in which the structure ID candidate to be searched and the search key for the structure ID candidate are associated with each other;
Acquisition means for acquiring, from the structure information storage means, the first feature information associated with the candidate structure ID included in each node for each node of the search condition of the hierarchical structure obtained by the analysis means; ,
Calculating means for calculating second feature information representing characteristics of the search key included in each node for each node of the search condition of the hierarchical structure obtained by the analyzing means;
For each node of the search condition of the hierarchical structure obtained by the analysis means, the second feature information calculated by the calculation means for the corresponding search key among the candidate structure IDs included in each node is: A deletion unit that deletes the candidate of the structure ID that does not match the first feature information acquired by the acquisition unit;
Based on the search condition of the hierarchical structure in which the deletion means deletes the candidate for the structure ID, the document ID corresponding to the structure ID satisfying the search condition of the hierarchical structure is obtained, and the document ID corresponding to the obtained document ID is obtained. Search means for searching the structured document from the structured document storage means;
A structured document retrieval apparatus characterized by comprising:

A document having a hierarchical logical structure, which includes an element that is actual information corresponding to a structural element that is a unit of the logical structure; a document ID that uniquely identifies the structured document; Structured document storage means for storing
A structure information storage unit that stores structure information in which a structure ID that uniquely identifies the structure element is associated with first feature information that represents a feature of the element;
Index storage means for storing an index that associates the document ID, an element ID that uniquely identifies the element, and the structure ID of the structural element corresponding to the element;
A search condition of a hierarchical structure that accepts input of search conditions for the structured document, analyzes the accepted search conditions, and has hierarchically structured nodes corresponding to the structural elements of the structured document. Analyzing means for obtaining a search condition for the hierarchical structure including the node in which the structure ID candidate to be searched and the search key for the structure ID candidate are associated with each other;
Acquisition means for acquiring, from the structure information storage means, the first feature information associated with the candidate structure ID included in each node for each node of the search condition of the hierarchical structure obtained by the analysis means; ,
Calculating means for calculating second feature information representing characteristics of the search key included in each node for each node of the search condition of the hierarchical structure obtained by the analyzing means;
For each node of the search condition of the hierarchical structure obtained by the analysis means, the second feature information calculated by the calculation means for the corresponding search key among the candidate structure IDs included in each node is: A deletion unit that deletes the candidate of the structure ID that does not match the first feature information acquired by the acquisition unit;
Based on the search condition of the hierarchical structure in which the deletion means deletes the candidate structure ID, the document ID and the element ID corresponding to the structure ID satisfying the search condition of the hierarchical structure are searched from the index storage means. Index search means to perform,
Search means for searching the structured document storage means for the structured document corresponding to the document ID searched by the index search means;
A structured document retrieval apparatus characterized by comprising:

The index storage means includes, for each vocabulary that is a character string included in the element, a vocabulary ID that uniquely identifies the vocabulary, and an occurrence frequency of the vocabulary in all structured documents of the structured document storage means And storing the index further corresponding to
The structure information storage means associates the minimum vocabulary ID, which is the minimum value of the vocabulary ID associated with the vocabulary included in the element, and the maximum vocabulary ID, which is the maximum value, as the first feature information. Memorize structural information,
The calculation means calculates the vocabulary ID of the vocabulary included in the search key of each node as the second feature information for each node of the hierarchical structure search condition obtained by the analysis means,
For each node of the hierarchical structure search condition obtained by the analyzing means, the deleting means is the vocabulary ID calculated by the calculating means for the corresponding search key among the candidate structure IDs included in each node. 3. The structure ID candidate not included between the minimum vocabulary ID and the maximum vocabulary ID, which is the first feature information acquired by the acquisition unit, is deleted. Structured document retrieval device.

The index storage means includes a vocabulary ID that uniquely identifies the vocabulary for each vocabulary that is all consecutive plural characters included in the element, and all structured documents in the structured document storage means of the vocabulary. 4. The structured document search apparatus according to claim 3, wherein the index further correlates with the occurrence frequency within the index is stored.

The index storage means includes, for each vocabulary that is a word obtained by morphological analysis of the element, a vocabulary ID that uniquely identifies the vocabulary, and in all structured documents of the structured document storage means of the vocabulary The structured document search apparatus according to claim 3, wherein the index further storing the occurrence frequency is stored.

The index storage means stores an index in which the document ID, the element ID, the structure ID of the structural element corresponding to the element, and third characteristic information representing the characteristic of the element are associated with each other. ,
Second acquisition means for acquiring the third feature information associated with the document ID and the element ID searched by the index search means from the index storage means;
The second feature information calculated by the calculation unit for the search key included in the node of the search condition of the hierarchical structure used in the search from the document ID and the element ID searched by the index search unit A second deletion unit that deletes the document ID and the element ID that do not match the third feature information acquired by the second acquisition unit,
3. The structured document according to claim 2, wherein the retrieval unit retrieves the structured document corresponding to the document ID deleted by the second deletion unit from the structured document storage unit. Search device.

The index storage means includes, for each vocabulary that is a character string included in the element, a vocabulary ID that uniquely identifies the vocabulary, and an occurrence frequency of the vocabulary in all structured documents of the structured document storage means The document ID, the element ID, the structure ID of the structural element corresponding to the element, and a minimum vocabulary ID that is a minimum value of the vocabulary ID associated with the vocabulary included in the element; Storing the index associating with the third feature information calculated based on the maximum vocabulary ID which is the maximum value;
The calculation means calculates the second feature information for each node of the search condition of the hierarchical structure obtained by the analysis means based on the vocabulary ID of the vocabulary included in the search key of each node,
The second deleting means includes the calculating means for the search key included in the search condition node of the hierarchical structure used for searching from the document ID and the element ID searched by the index searching means. 7. The structured document according to claim 6, wherein the calculated vocabulary ID deletes the document ID and the element ID that do not match the third feature information acquired by the second acquisition unit. Search device.

The index storage means includes the document ID, the element ID, the structure ID of the structural element corresponding to the element, the appearance position of the vocabulary included in the element, and the element for each appearance position. 8. The structured document search device according to claim 7, wherein an index that associates the third feature information that represents the feature is stored.

The structure information storage means stores the structure information further associated with a condition for dividing the structure element based on the first feature information,
Receiving means for receiving input of the structured document;
Extracting means for extracting a hierarchical structure including at least one structural element from the structured document received by the receiving means;
Structure information determining means for determining the structure ID corresponding to each structure element included in the hierarchical structure extracted by the extracting means;
Whether the first feature information of the element corresponding to the structural element included in the hierarchical structure extracted by the extracting unit satisfies the condition corresponding to the structure ID determined by the structural information determining unit Determining means for determining
When the determination means determines that the condition is satisfied, the structure information corresponding to the structure ID determined by the structure information determination means is divided to create new structure information, and the structure information storage means 3. The structured document search apparatus according to claim 2, further comprising a structure information dividing unit for registration.

The structure information storage means uses the maximum character string length that is the maximum value of the character string length of the element corresponding to the structure element as the first feature information, and the maximum character string length exceeds a predetermined threshold value. 10. The structured document search apparatus according to claim 9, wherein the structure information further correlates with a condition for dividing the structure element when the structure element is divided.

The structure information storage means uses, as the first feature information, a maximum vocabulary number that is a maximum value of the vocabulary number in the element corresponding to the structural element, and the maximum vocabulary number exceeds a predetermined threshold value. 10. The structured document search apparatus according to claim 9, wherein the structure information further correlates with a condition for dividing the structure element when the structure element is divided.

The structure information storage means uses the number of times that the character string length of the element corresponding to the structure element exceeds a predetermined first threshold as the first feature information, and sets the number of times to the second 10. The structured document search apparatus according to claim 9, wherein the structure information is further associated with a condition for dividing the structure element when the threshold value of the structure element is exceeded.

The index storage means includes, for each vocabulary that is a character string included in the element, a vocabulary ID that uniquely identifies the vocabulary, and an occurrence frequency of the vocabulary in all structured documents of the structured document storage means And storing the index further corresponding to
The structural information storage means includes the vocabulary stored in the index storage means, the vocabulary having an appearance frequency greater than a predetermined first threshold value appearing in the element corresponding to the structural element. 10. The structured document search device according to claim 9, wherein the structure information further stores a condition for dividing the structure element when the number of times exceeds a predetermined second threshold.

The structured document search device
A document having a hierarchical logical structure, and receiving an input of a search condition of a structured document including an element that is actual information corresponding to a structural element that is a unit of the logical structure, and analyzes the received search condition. , A hierarchical search condition having nodes that are units of a structure corresponding to the structural element of the structured document, the structure uniquely identifying the structural element of the structured document to be searched An analysis step for obtaining a search condition for the hierarchical structure including the node that associates an ID candidate and a search key for the structure ID candidate;
For each node of the search condition of the hierarchical structure obtained by the analysis step, from the structure information storage means for storing the structure information in which the structure ID is associated with the first feature information representing the feature of the element. An acquisition step of acquiring the first feature information associated with the candidate structure ID included in the node;
A calculation step of calculating second feature information representing a feature of the search key included in each node for each node of the search condition of the hierarchical structure obtained in the analysis step;
For each node of the search condition of the hierarchical structure obtained in the analysis step, the second feature information calculated by the calculation step for the corresponding search key among the structure ID candidates included in each node is: A deletion step of deleting the candidate of the structure ID that does not match the first feature information acquired by the acquisition step;
An index that stores an index that associates a structured document, a document ID that uniquely identifies the structured document, an element ID that uniquely identifies the element, and the structure ID of the structural element corresponding to the element The document ID and the element ID corresponding to the structure ID satisfying the hierarchical structure search condition are searched from the storage means based on the hierarchical structure search condition in which the deletion step deletes the structure ID candidate. An index search step;
A retrieval step for retrieving the structured document corresponding to the document ID retrieved by the index retrieval step from a structured document storage means for storing the structured document in association with the document ID;
A structured document retrieval method characterized by executing :