JP2009187224A

JP2009187224A - Information processor and information processing program

Info

Publication number: JP2009187224A
Application number: JP2008025635A
Authority: JP
Inventors: Kazutaka Hayashi; 千登林
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2008-02-05
Filing date: 2008-02-05
Publication date: 2009-08-20

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information processor for performing the retrieval of a structure pattern appearing several times in a plurality of tree structures by recursive processing for every place to which a node is added. <P>SOLUTION: The first retrieval means of an information processor performs the retrieval of a structure pattern appearing several times in a plurality of tree structures to nodes of lower rank than a node as the object of current processing in the tree structures, and a second retrieval means performs the retrieval of the structure pattern to nodes in higher rank than the node as the object of current processing in the tree structures for every non-retrieved node in lower rank than the node in high rank, and the first retrieval means and the second retrieval means return to the original node starting the retrieval when there exist no node as the object of retrieval. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、情報処理装置及び情報処理プログラムに関する。 The present invention relates to an information processing apparatus and an information processing program.

コンピュータの処理能力、記憶装置の容量の飛躍的な増大に加え、ＩＴ化やネットワーク化が進んだことで大量な情報が容易に集められるようになってきた。集めた情報から市場機会やリスクに関する情報を早期に発見したり、隠れた知識を発見したりすることへの期待が高まっている。
しかし、集めた情報の量はしばしば人間の処理能力をはるかに超えるものとなる。このため、せっかく大量に集めた情報からリスクを発見したり、知識を抽出したりして活用することは実際には労力を伴う難しいものであった。 In addition to dramatic increases in computer processing capacity and storage device capacity, IT and networking have made it possible to easily collect large amounts of information. There is an increasing expectation for early discovery of information on market opportunities and risks from the collected information and discovery of hidden knowledge.
However, the amount of information collected often exceeds human processing capabilities. For this reason, it has been difficult to find a risk from information gathered in a large amount of information or extract knowledge and use it.

一方、パターン・マイニング等の技術の進展により、そのような大量の情報の中から例えば同時に購入される商品のパターンなどの情報が抽出可能となってきた。同時に購入される品物のパターンや購入される順序のパターンを抽出する技術が顧客の購買行動の分析などの需要から注目を集めて研究開発されてきたが、最近ではさまざまな情報の構造化、半構造化が進んできたこともあり、木構造のような構造を持つパターンを抽出するパターン・マイニングの技術が注目されてきている。構造情報を抽出するパターン・マイニングの技術の中でも、特に木構造はＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）をはじめとしてドキュメントの構造化や知識表現などさまざまな情報の構造化に用いられるためパターン抽出への期待も大きい。 On the other hand, with advances in technology such as pattern mining, it has become possible to extract information such as patterns of products purchased simultaneously from such a large amount of information. Technology for extracting patterns of products purchased at the same time and patterns of order of purchase has been researched and developed with attention from demands such as analysis of customer purchasing behavior. With the progress of structuring, pattern mining techniques for extracting patterns having a structure like a tree structure have attracted attention. Among the pattern mining techniques for extracting structure information, the tree structure is especially used for structuring various information such as document structuring and knowledge representation including XML (extensible Markup Language). large.

木構造のデータ群から部分木のパターンを抽出する技術には大きく分けて、親子関係が厳密に一致する構造だけを抽出するｉｎｄｕｃｅｄｓｕｂｔｒｅｅｍｉｎｉｎｇの技術と、親子関係が多少乱れても先祖―子孫の関係があれば構造を抽出するｅｍｂｅｄｄｅｄｓｕｂｔｒｅｅｍｉｎｉｎｇの技術がある。
現実社会で発生するデータ、例えばドキュメントの操作履歴などのように人の操作を記録したものでは、人がたとえ同じように作業を行ったつもりでも、操作履歴のデータ上では必ずしもデータの親子関係が一致しないことがしばしば起きる。そのような場合にはｅｍｂｅｄｄｅｄｓｕｂｔｒｅｅｍｉｎｉｎｇの技術を適用することが望ましい。現実世界のデータではしばしば同様にゆれが生じるため、ｅｍｂｅｄｄｅｄｓｕｂｔｒｅｅｍｉｎｉｎｇの技術への期待は高い。ｅｍｂｅｄｄｅｄｓｕｂｔｒｅｅｍｉｎｉｎｇを実現する技術として、開示されているものには、ＴｒｅｅＭｉｎｅｒ、Ｄｒｙａｄｅ、ＭＢ３−Ｍｉｎｅｒなどの技術が挙げられる。 The technology for extracting subtree patterns from a tree-structured data group is broadly divided into an extracted subtree mining technology that extracts only the structure in which the parent-child relationship is exactly the same, and an ancestor-descendant even if the parent-child relationship is somewhat disturbed. There is an embedded subtree mining technique for extracting the structure if there is a relationship.
For data that occurs in the real world, such as those that record human operations such as document operation history, even if a person intends to work in the same way, the operation history data does not necessarily have the parent-child relationship of the data. Inconsistencies often occur. In such a case, it is desirable to apply an embedded subtree mining technique. Since real-world data often fluctuates in the same way, there is high expectation for the technique of embedded subtree mining. Examples of disclosed technologies for realizing embedded subtree mining include technologies such as TreeMiner, Dryade, and MB3-Miner.

これらに関連する技術として、例えば、特許文献１には、データの集合からその中に含まれる重要なパターンを検出する方法及びシステムを提供することを課題とし、木構造データで表わされたデータ集合を含むデータベースから、集計対象となる候補パターンを用いて、頻出パターンを検出するシステムであって、（１）データベースから候補パターンにマッチするパターンを集計する手段と、（２）前記集計により出現頻度の高いパターンを検出する手段と、（３）前記検出したパターンから、次の集計対象となる候補パターンを生成する手段と、を有するように構成することが開示されている。 As a technique related to these, for example, Patent Document 1 has an object to provide a method and system for detecting an important pattern included in a set of data, and data represented by tree structure data. A system for detecting a frequent pattern from a database including a set using candidate patterns to be aggregated, (1) means for aggregating patterns matching the candidate pattern from the database, and (2) appearing by the aggregation It is disclosed that the apparatus includes a means for detecting a pattern having a high frequency and (3) means for generating a candidate pattern to be a next aggregation target from the detected pattern.

また、例えば、特許文献２には、順序木において頻出するパターンを抽出するのに好適な抽出装置等を提供することを課題とし、抽出装置の入力受付部は、１つ以上の順序木の入力を受け付け、変換部は、入力を受け付けられた順序木のそれぞれを系列表現へ変換し、抽出部は、変換された系列表現のそれぞれが含むパターンのうち、所定の頻度以上で出現するパターンを抽出し、系列表現は、順序木を深さ優先探索して、枝を進む際に通過する節はその名前を表わすマークを、枝を戻る際はバックトラックマークを、それぞれ並べることによりでき、パターンは、系列表現であるマークの列中の名前を表わすマークのいずれかを最初のマークとして、これから射影を０回以上繰り返したときに、最初のマークから最後のマークに至るまでに出会うマークの列をいい、射影が成立するか否かは、マークの列の列文脈と、射影文脈の値により判定することが開示されている。
特開２００１−１３４５７５号公報特開２００４−３５５４５７号公報 In addition, for example, Patent Document 2 has an object to provide an extraction device or the like suitable for extracting a pattern that frequently appears in an ordered tree, and an input reception unit of the extracting device inputs one or more ordered trees. The conversion unit converts each of the ordered trees that have received the input into a sequence expression, and the extraction unit extracts a pattern that appears at a predetermined frequency or more from the patterns included in each converted sequence expression The sequence representation can be obtained by performing a depth-first search of the ordered tree, placing a mark indicating the name of the clause that passes through the branch, and a backtrack mark when returning the branch. From the first mark to the last mark when the projection is repeated zero or more times, with one of the marks representing the names in the sequence of marks as a series representation as the first mark Refers to sequence of marks meet, whether projection is satisfied, the sequence context of the columns of the mark, be determined by the value of the projection context is disclosed.
JP 2001-134575 A JP 2004-355457 A

本発明は、複数の木構造内で複数回現れる構造パターンの探索において、ノードを加えた場所毎に再帰的処理による探索を行うようにした情報処理装置及び情報処理プログラムを提供することを目的としている。 It is an object of the present invention to provide an information processing apparatus and an information processing program that perform a recursive search for each place where a node is added in a search for a structure pattern that appears multiple times in a plurality of tree structures. Yes.

かかる目的を達成するための本発明の要旨とするところは、次の各項の発明に存する。
請求項１の情報処理装置は、複数の木構造内で複数回現れる構造パターンの探索を、前記木構造内の現在の処理対象となっているノードより下位のノードに対して行う第１の探索手段と、前記構造パターンの探索を、前記木構造内の現在の処理対象となっているノードより上位のノードであって該上位のノードの下位にあり、かつ未探索のノード毎に探索する第２の探索手段を具備し、前記第１の探索手段と前記第２の探索手段は、探索の対象とすべきノードがなくなった場合に、該探索を始めた元のノードに戻ることを特徴とする。 The gist of the present invention for achieving the object lies in the inventions of the following items.
The information processing apparatus according to claim 1, wherein a search for a structure pattern that appears multiple times in a plurality of tree structures is performed on a node lower than a node currently being processed in the tree structure. And a search for each of the unsearched nodes that are higher than the current processing target node in the tree structure and lower than the higher level node. Comprising two search means, wherein the first search means and the second search means return to the original node that started the search when there are no more nodes to be searched. To do.

請求項２の情報処理装置は、請求項１に記載する情報処理装置であって、前記構造パターン内の現在の処理対象である現処理対象となっているノードと一致する前記木構造内でのノードのうち上下関係のあるものについては最上位のノードに基づいて、子孫を探索する範囲を決定する第１の探索範囲決定手段をさらに具備し、前記第１の探索手段は、前記第１の探索範囲決定手段によって決定された範囲に基づいて探索を行うことを特徴とする。 An information processing device according to claim 2 is the information processing device according to claim 1, wherein the information processing device in the tree structure coincides with a current processing target node in the structure pattern. The first search range determining means for determining a range to search for descendants based on the highest-order node among the nodes having a hierarchical relationship, wherein the first search means includes the first search means The search is performed based on the range determined by the search range determining means.

請求項３の情報処理装置は、請求項１に記載する情報処理装置であって、前記構造パターン内での親ノードと一致する前記木構造内でのノードのうち上下関係のあるものについては最上位のノード及び前記構造パターン内での子ノードと一致する前記木構造内でのノードに上下関係のあるものについては最下位のノードに基づいて、探索範囲を決定する第２の探索範囲決定手段をさらに具備し、前記第２の探索手段は、前記第２の探索範囲決定手段によって決定された範囲に基づいて探索を行うことを特徴とする。 An information processing apparatus according to claim 3 is the information processing apparatus according to claim 1, wherein the nodes in the tree structure that coincide with the parent node in the structure pattern have the highest relationship. Second search range determination means for determining a search range based on the lowest-order node with respect to an upper node and a node in the tree structure that coincides with a child node in the structure pattern. The second search means performs a search based on the range determined by the second search range determination means.

請求項４の情報処理装置は、請求項３に記載する情報処理装置であって、前記構造パターン内の現在の処理対象となっているノードの先祖のノードの出現箇所の中から、該現在の処理対象となっているノードの出現箇所のノードを子孫に含むノードに対応する出現箇所を保持する保持手段をさらに具備することを特徴とする。 An information processing apparatus according to claim 4 is the information processing apparatus according to claim 3, wherein the current information is selected from among the appearances of ancestor nodes of the current processing target node in the structure pattern. The image processing apparatus further includes holding means for holding an appearance location corresponding to a node that includes a node at the appearance location of the node to be processed as a descendant.

請求項５の情報処理装置は、請求項４に記載する情報処理装置であって、前記木構造内で分岐のない範囲の最上位及び最下位以外の出現箇所を前記保持手段から削除する削除手段をさらに具備することを特徴とする。 The information processing apparatus according to claim 5 is the information processing apparatus according to claim 4, wherein deletion means for deleting from the holding means appearances other than the highest and lowest positions in a range where there is no branch in the tree structure Is further provided.

請求項６の情報処理装置は、請求項４に記載する情報処理装置であって、前記保持手段に保持させる出現箇所を対象とし、前記木構造内で分岐のない範囲の最上位及び最下位以外の出現箇所以外を削除したものを選別する選別手段をさらに具備することを特徴とする。 An information processing apparatus according to claim 6 is the information processing apparatus according to claim 4, wherein the appearance portion to be held by the holding unit is targeted, and other than the highest and lowest positions in the tree structure where there is no branch The image processing apparatus is further characterized by further comprising selection means for selecting items other than the appearance locations of.

請求項７の情報処理プログラムは、コンピュータを、複数の木構造内で複数回現れる構造パターンの探索を、前記木構造内の現在の処理対象となっているノードより下位のノードに対して行う第１の探索手段と、前記構造パターンの探索を、前記木構造内の現在の処理対象となっているノードより上位のノードであって該上位のノードの下位にあり、かつ未探索のノード毎に探索する第２の探索手段として機能させ、前記第１の探索手段と前記第２の探索手段は、探索の対象とすべきノードがなくなった場合に、該探索を始めた元のノードに戻ることを特徴とする。 The information processing program according to claim 7, wherein the computer searches for a structure pattern appearing a plurality of times in a plurality of tree structures with respect to a node lower than the current processing target node in the tree structure. 1 search means and a search for the structure pattern are performed for each unsearched node that is higher than the current processing target node in the tree structure and lower than the higher node. The first search means and the second search means return to the original node that started the search when there are no more nodes to be searched. It is characterized by.

請求項１記載の情報処理装置によれば、複数の木構造内で複数回現れる構造パターンの探索においてノードを加えた場所毎に再帰的処理による探索を行うことができる。 According to the information processing apparatus of the first aspect, it is possible to perform a recursive search for each place where nodes are added in a search for a structure pattern that appears multiple times in a plurality of tree structures.

請求項２記載の情報処理装置によれば、探索処理における記憶容量及び処理時間の増大を抑制することができる。 According to the information processing apparatus of the second aspect, it is possible to suppress an increase in storage capacity and processing time in the search process.

請求項３記載の情報処理装置によれば、探索処理における記憶容量及び処理時間の増大を抑制することができる。 According to the information processing apparatus of the third aspect, it is possible to suppress an increase in storage capacity and processing time in the search process.

請求項４記載の情報処理装置によれば、探索状態を保持して、その探索における再帰的処理が実行できる。 According to the information processing apparatus of the fourth aspect, it is possible to hold the search state and execute recursive processing in the search.

請求項５記載の情報処理装置によれば、探索処理における記憶容量をより削減することができる。 According to the information processing apparatus of the fifth aspect, the storage capacity in the search process can be further reduced.

請求項６記載の情報処理装置によれば、探索処理における記憶容量をより削減することができる。 According to the information processing apparatus of the sixth aspect, the storage capacity in the search process can be further reduced.

請求項７記載の情報処理プログラムによれば、複数の木構造内で複数回現れる構造パターンの探索においてノードを加えた場所毎に再帰的処理による探索を行うことができる。 According to the information processing program of the seventh aspect, it is possible to perform a recursive search for each place where a node is added in a search for a structural pattern that appears multiple times in a plurality of tree structures.

まず、前述のＴｒｅｅＭｉｎｅｒ、Ｄｒｙａｄｅ、ＭＢ３−Ｍｉｎｅｒの技術について、説明する。
Ｄｒｙａｄｅは、兄弟ノードに同じものを含めないという機能制限があり、そのような場面が頻出するドキュメントの操作履歴などのマイニングには適さない。
ＭＢ３−Ｍｉｎｅｒは、幅優先探索であり、処理の階層毎に用意するパターン候補の数が膨大なものになるため、大規模なデータに適用するには限界がある。
また、ＴｒｅｅＭｉｎｅｒは、深さ優先探索であると主張されてはいるが、実際にはツリー（木）構造のルートから葉ノードをつないだパス方向に発生する枝の候補を全て再帰処理の深さ方向に送り込むことを行う。このため幅優先探索と同様の問題が生じてしまい、大規模なデータに適用すると候補生成でパターン候補の数が膨大なものになり処理できなくなるという問題があった。 First, the techniques of the above-mentioned TreeMiner, Dryade, and MB3-Miner will be described.
Dryade has a function restriction that sibling nodes do not include the same thing, and is not suitable for mining such as an operation history of a document in which such a scene frequently appears.
MB3-Miner is a breadth-first search, and the number of pattern candidates prepared for each processing hierarchy becomes enormous, so there is a limit to application to large-scale data.
Although TreeMiner is claimed to be a depth-first search, in reality, all the branch candidates that occur in the path direction connecting leaf nodes from the root of the tree (tree) structure are the depth of recursion processing. To feed in the direction. For this reason, the same problem as the breadth-first search occurs, and when applied to large-scale data, there is a problem that the number of pattern candidates becomes enormous and the processing cannot be performed.

また、ＭＢ３−ＭｉｎｅｒやＴｒｅｅＭｉｎｅｒにおいては、ツリーの出現位置（パターン木の中の各ノードと、データの木の中のノードとの対応関係）を管理するが、ｅｍｂｅｄｄｅｄｓｕｂｔｒｅｅｍｉｎｉｎｇの場合には、その数が深さ方向に広がる子孫ノード候補の組み合わせにより指数関数的に膨れ上がるという問題があった。ＴｒｅｅＭｉｎｅｒでは、ＴｒｅｅＭｉｎｅｒＤにおいて、このことに対する対策がとられているが、実際には十分に機能していない。
同様の状況はドキュメントの操作履歴だけで生じるものではなく、例えば、たんぱく質の構造データなどにおいても同じラベルを持つノードが一連の系列の中に何度も現れることは少なくない。出現位置の組み合わせの管理については、このような場合に大量の情報を管理しなければならなくなり、処理に必要な記憶容量の増大と処理対象のデータの増大による処理コストの増大という問題が発生し、大規模なデータの処理を現実的な時間とリソースで実現することを難しくしてしまう。
つまり、従来は、木構造をストリング形式に変換して又はそれと等価なものに変換して探索を行っていた。本実施の形態は、木構造自体の探索を行うようにしているものである。 In MB3-Miner and TreeMiner, the appearance position of a tree (correspondence between each node in the pattern tree and a node in the data tree) is managed. In the case of embedded subtree mining, There was a problem that the number of descendant nodes expanded in the depth direction expanded exponentially. In TreeMiner, a countermeasure against this is taken in TreeMinerD, but in reality it does not function sufficiently.
A similar situation does not occur only in the operation history of a document. For example, even in protein structure data, nodes having the same label often appear in a series of times. Regarding the management of the combination of appearance positions, a large amount of information must be managed in such a case, and there arises a problem that the processing capacity increases due to an increase in storage capacity necessary for processing and data to be processed. This makes it difficult to process large-scale data with realistic time and resources.
That is, conventionally, a search is performed by converting a tree structure into a string format or converting it into an equivalent one. In the present embodiment, the tree structure itself is searched.

本実施の形態は、要素間に設定した関係を木構造として扱えるデータ群の中から、複数回にわたって現れる関係構造（部分木のパターン、以下、構造パターン、構造パターン木、パターンともいう）を抽出する技術に関するものである。
本実施の形態は、ルートノード（根）からリーフノード（葉）にいたるパス（以降、Ｅｐａｔｈともいう）、各ノードの属するＥｐａｔｈの範囲（以降、ＥｐＲａｎｇｅともいう）、のデータを用い、探索ステップに合わせて管理するノード出現情報管理機構を用いた、深さ優先探索で探索処理を行う情報抽出装置である。
さらに、パターン抽出の効率を上げるために、前記ＥｐＲａｎｇｅを用いたツリーデータ管理機構を備えている。 In this embodiment, a relational structure that appears multiple times (a subtree pattern, hereinafter also referred to as a structure pattern, a structure pattern tree, or a pattern) is extracted from a data group that can handle the relationship set between elements as a tree structure. It is related to the technology.
This embodiment uses data of a path from a root node (root) to a leaf node (leaf) (hereinafter also referred to as “Epath”) and an Epath range to which each node belongs (hereinafter also referred to as “EpRange”), and a search step This is an information extraction device that performs a search process by a depth-first search using a node appearance information management mechanism that is managed in accordance with.
Furthermore, in order to increase the efficiency of pattern extraction, a tree data management mechanism using the EpRange is provided.

以下、図面に基づき本発明を実現するにあたっての好適な一実施の形態の例を説明する。
図１は、本実施の形態を適用するに好適なシステムの概念構成図である。このシステムは、構造情報ＤＢ１１０、情報収集装置１２０、情報抽出装置１３０、抽出情報管理装置１４０を有している。これらは、通信回線を介して接続されている。なお、これらの全ての装置群が一つの装置内に構築されていてもよいし、これらのうちの一部の装置が一つの装置内に構築されていてもよい。 Hereinafter, an example of a preferred embodiment for realizing the present invention will be described with reference to the drawings.
FIG. 1 is a conceptual configuration diagram of a system suitable for applying the present embodiment. This system has a structure information DB 110, an information collection device 120, an information extraction device 130, and an extraction information management device 140. These are connected via a communication line. Note that all these device groups may be built in one device, or some of these devices may be built in one device.

構造情報ＤＢ１１０は、情報収集装置１２０、情報抽出装置１３０と接続されており、情報収集装置１２０から受け取ったデータであり、情報を抽出すべき対象である木構造データを蓄積して管理し、情報抽出装置１３０からアクセスされる。
情報収集装置１２０は、構造情報ＤＢ１１０と接続されており、図示しない他の装置から情報を集めて、あるいは図示しない他の装置から送信された情報を受け取って、必要なら情報の整形（情報抽出装置１３０が扱えるような木構造データへの変換）を行って、構造情報ＤＢ１１０に格納する。
情報抽出装置１３０は、構造情報ＤＢ１１０、抽出情報管理装置１４０と接続されており、構造情報ＤＢ１１０にアクセスして、木構造データから頻出情報を抽出して抽出情報管理装置１４０に送信する。
抽出情報管理装置１４０は、情報抽出装置１３０と接続されており、情報抽出装置１３０から送信された頻出情報を受け取って蓄積したり表示装置や印刷装置などの図示しない他の装置に送信したりする。 The structure information DB 110 is connected to the information collection device 120 and the information extraction device 130, and is data received from the information collection device 120. The structure information DB 110 stores and manages tree structure data from which information is to be extracted. Accessed from the extractor 130.
The information collection device 120 is connected to the structure information DB 110, collects information from other devices (not shown), or receives information transmitted from other devices (not shown), and shapes the information (information extraction device) if necessary. (Converted into tree-structured data that can be handled by 130) and stored in the structure information DB 110.
The information extraction device 130 is connected to the structure information DB 110 and the extraction information management device 140, accesses the structure information DB 110, extracts frequent information from the tree structure data, and transmits it to the extraction information management device 140.
The extracted information management device 140 is connected to the information extracting device 130, and receives and accumulates frequent information transmitted from the information extracting device 130 or transmits it to other devices (not shown) such as a display device and a printing device. .

図２は、本実施の形態の情報抽出装置１３０内の構成例についての概念的なモジュール構成図である。
なお、モジュールとは、一般的に論理的に分離可能なソフトウェア（コンピュータ・プログラム）、ハードウェア等の部品を指す。したがって、本実施の形態におけるモジュールはコンピュータ・プログラムにおけるモジュールのことだけでなく、ハードウェア構成におけるモジュールも指す。それゆえ、本実施の形態は、コンピュータ・プログラム、システム及び方法の説明をも兼ねている。ただし、説明の都合上、「記憶する」、「記憶させる」、これらと同等の文言を用いるが、これらの文言は、実施の形態がコンピュータ・プログラムの場合は、記憶装置に記憶させる、又は記憶装置に記憶させるように制御するの意である。また、モジュールは機能にほぼ一対一に対応しているが、実装においては、１モジュールを１プログラムで構成してもよいし、複数モジュールを１プログラムで構成してもよく、逆に１モジュールを複数プログラムで構成してもよい。また、複数モジュールは１コンピュータによって実行されてもよいし、分散又は並列環境におけるコンピュータによって１モジュールが複数コンピュータで実行されてもよい。なお、１つのモジュールに他のモジュールが含まれていてもよい。また、以下、「接続」とは物理的な接続の他、論理的な接続（データの授受、指示、データ間の参照関係等）を含む。
また、システム又は装置とは、複数のコンピュータ、ハードウェア、装置等がネットワーク（一対一対応の通信接続を含む）等の通信手段で接続されて構成されるほか、１つのコンピュータ、ハードウェア、装置等によって実現される場合も含まれる。「装置」と「システム」とは、互いに同義の用語として用いる。 FIG. 2 is a conceptual module configuration diagram of a configuration example in the information extraction apparatus 130 according to the present embodiment.
The module generally refers to components such as software (computer program) and hardware that can be logically separated. Therefore, the module in the present embodiment indicates not only a module in a computer program but also a module in a hardware configuration. Therefore, the present embodiment also serves as an explanation of a computer program, a system, and a method. However, for the sake of explanation, the words “store”, “store”, and equivalents thereof are used. However, when the embodiment is a computer program, these words are stored in a storage device or stored in memory. It is the control to be stored in the device. In addition, the modules correspond almost one-to-one with the functions. However, in mounting, one module may be composed of one program, or a plurality of modules may be composed of one program. A plurality of programs may be used. The plurality of modules may be executed by one computer, or one module may be executed by a plurality of computers in a distributed or parallel environment. Note that one module may include other modules. In the following, “connection” includes not only physical connection but also logical connection (data exchange, instruction, reference relationship between data, etc.).
In addition, the system or device is configured by connecting a plurality of computers, hardware, devices, and the like by communication means such as a network (including one-to-one correspondence communication connection), etc., and one computer, hardware, device. The case where it implement | achieves by etc. is also included. “Apparatus” and “system” are used as synonymous terms.

図１に示した情報抽出装置１３０は、図２に示すように構造情報管理モジュール２１０、出現情報選択モジュール２２０、出現情報管理モジュール２３０、調査範囲処理モジュール２４０、探索処理モジュール２５０、探索状態管理モジュール２６０、抽出情報処理モジュール２７０を有している。 As shown in FIG. 2, the information extraction device 130 shown in FIG. 1 includes a structure information management module 210, an appearance information selection module 220, an appearance information management module 230, a survey range processing module 240, a search processing module 250, and a search state management module. 260 and an extraction information processing module 270.

構造情報管理モジュール２１０は、出現情報選択モジュール２２０、出現情報管理モジュール２３０、調査範囲処理モジュール２４０と接続されており、構造情報ＤＢ１１０中の構造情報を調査範囲処理モジュール２４０の指定にしたがって調査し、木構造データ中のノードの出現をラベル毎に集計する。また、構成によってはラベル毎に出現位置情報を収集するようにしてもよい。この集計結果である出現情報は、出現情報管理モジュール２３０に送信される。なお、出現情報には、出現箇所の情報と幾つの木に出現したかを示す集計値の両方を含んでいる。また、構造情報管理モジュール２１０は、必要であれば図示しない記憶手段を有し、構造情報ＤＢ１１０中の構造情報を処理に適したデータ構造に変換して蓄積することを行ってもよい。 The structure information management module 210 is connected to the appearance information selection module 220, the appearance information management module 230, and the investigation range processing module 240, and investigates the structure information in the structure information DB 110 according to the designation of the investigation range processing module 240. The appearance of nodes in the tree structure data is totaled for each label. Depending on the configuration, appearance position information may be collected for each label. Appearance information that is a result of the aggregation is transmitted to the appearance information management module 230. It should be noted that the appearance information includes both information on the appearance location and a total value indicating how many trees have appeared. Further, the structure information management module 210 may include a storage unit (not shown) if necessary, and may convert the structure information in the structure information DB 110 into a data structure suitable for processing and store it.

出現情報選択モジュール２２０は、構造情報管理モジュール２１０、出現情報管理モジュール２３０と接続されており、構造情報管理モジュール２１０によって生成された出現情報に基づいて、対象とするラベルを選択する。つまり、構造情報管理モジュール２１０によって出現が確認されたノードのラベル毎の出現情報を受け取り、予め定めた基準を満たす要素（予め設定された回数以上出現するものなど）と満たさない要素を選別する。又は、図示しない入力装置によりユーザから指定された条件に見合わない出現情報を破棄したりして出現情報を整理する。 The appearance information selection module 220 is connected to the structure information management module 210 and the appearance information management module 230, and selects a target label based on the appearance information generated by the structure information management module 210. In other words, the appearance information for each label of the node whose appearance is confirmed by the structure information management module 210 is received, and an element that satisfies a predetermined criterion (such as one that appears more than a preset number of times) and an element that does not satisfy it are selected. Alternatively, the appearance information is arranged by discarding appearance information that does not meet the conditions specified by the user using an input device (not shown).

出現情報管理モジュール２３０は、出現情報選択モジュール２２０、探索処理モジュール２５０、構造情報管理モジュール２１０と接続されており、出現情報選択モジュール２２０により選択されたノードのラベルと出現情報を受け取り管理する。この情報を探索処理モジュール２５０の要求により順に探索処理モジュール２５０に送信する。 The appearance information management module 230 is connected to the appearance information selection module 220, the search processing module 250, and the structure information management module 210, and receives and manages the label and appearance information of the node selected by the appearance information selection module 220. This information is sequentially transmitted to the search processing module 250 at the request of the search processing module 250.

調査範囲処理モジュール２４０は、構造情報管理モジュール２１０、探索処理モジュール２５０、探索状態管理モジュール２６０と接続されており、探索状態管理モジュール２６０に記憶されている探索状態及び構造情報管理モジュール２１０によって生成され、探索処理モジュール２５０の指示にしたがって出現情報管理モジュール２３０から複製、あるいは移動して探索状態管理モジュール２６０内に格納されている出現情報に基づいて、木構造データ内の構造パターンの探索範囲を決定する。 The search range processing module 240 is connected to the structure information management module 210, the search processing module 250, and the search state management module 260, and is generated by the search state and structure information management module 210 stored in the search state management module 260. Based on the appearance information stored in the search state management module 260 after copying or moving from the appearance information management module 230 according to the instruction of the search processing module 250, the search range of the structure pattern in the tree structure data is determined. To do.

探索処理モジュール２５０は、出現情報管理モジュール２３０、調査範囲処理モジュール２４０、探索状態管理モジュール２６０、抽出情報処理モジュール２７０と接続されており、出現情報管理モジュール２３０、調査範囲処理モジュール２４０、探索状態管理モジュール２６０、抽出情報処理モジュール２７０による処理を制御して、調査範囲処理モジュール２４０によって決定された探索範囲に基づいて、構造情報管理モジュール２１０が取り出し、出現情報選択モジュール２２０により選択されたラベルと出現情報を出現情報管理モジュール２３０から得て木構造データ内に複数回出現する構造パターンの探索を行い、その探索結果を抽出情報処理モジュール２７０へ渡す。この探索処理は適時探索状態管理モジュール２６０の探索状態を更新し再帰的に実行する。 The search processing module 250 is connected to the appearance information management module 230, the investigation range processing module 240, the search state management module 260, and the extracted information processing module 270. The appearance information management module 230, the investigation range processing module 240, and the search state management. The control by the module 260 and the extraction information processing module 270 is controlled, the structure information management module 210 takes out based on the search range determined by the survey range processing module 240, and the label and appearance selected by the appearance information selection module 220 Information is obtained from the appearance information management module 230, a structure pattern that appears multiple times in the tree structure data is searched, and the search result is passed to the extraction information processing module 270. This search process is performed recursively by updating the search state of the search state management module 260 in a timely manner.

探索状態管理モジュール２６０は、調査範囲処理モジュール２４０、探索処理モジュール２５０と接続されており、探索処理の途中状態を格納、管理する。記憶されるこの途中の状態は、探索処理モジュール２５０による再帰的な探索処理に利用される。探索処理モジュール２５０による探索処理中のパターン候補情報、構造パターン中のノードの各構造情報中での出現情報の保持と処理の過程における出現情報の更新・回復などを受け持つ。
抽出情報処理モジュール２７０は、探索処理モジュール２５０と接続されており、探索処理モジュール２５０より抽出したパターンを受け取り、図示しない記憶装置に格納したり、図示しない出力装置への出力処理を受け持つ。 The search state management module 260 is connected to the search range processing module 240 and the search processing module 250, and stores and manages the intermediate state of the search process. The stored intermediate state is used for recursive search processing by the search processing module 250. Responsible for pattern candidate information during search processing by the search processing module 250, retention of appearance information in each structure information of nodes in the structure pattern, and updating / recovering appearance information in the process.
The extracted information processing module 270 is connected to the search processing module 250, receives the pattern extracted from the search processing module 250, stores it in a storage device (not shown), and takes charge of output processing to an output device (not shown).

図３は、本実施の形態による処理例を示したフローチャートである。
ステップＳ３０２（走査用情報の準備）では、構造情報管理モジュール２１０は、後の処理を効率的に実行するための準備を行うために、構造情報ＤＢ１１０に格納された構造情報を変換し、図示しない記憶手段に格納する。 FIG. 3 is a flowchart showing an example of processing according to this embodiment.
In step S302 (preparation of scanning information), the structure information management module 210 converts the structure information stored in the structure information DB 110 to prepare for efficiently executing the subsequent processing, and is not shown. Store in storage means.

図８は、説明のために用いる木構造データの例である。
例示した３つの木構造データＴｒ１（図８（ａ）），Ｔｒ２（図８（ｂ）），Ｔｒ３（図８（ｃ））には、それぞれ１６個、１６個、１１個のノードがある。説明の簡易化のために、兄弟間には順序関係があるとし、深さ優先探索でノードを辿った場合の順番をもとに、ノードにｖ０、ｖ１、・・・と識別子を設定した。以降、木構造を示す必要がない場合には、単にｖ０やｖ２等の表記でノードを指し、どの木データであるかを示す必要がある場合にはｖ０_Ｔｒ１、ｖ２_Ｔｒ２のようにそれぞれのノードが属する木データの識別子を添えて表記する。
図８に示した例では、ノードのラベルをＡ，Ｂ，Ｃ，・・・とした。例に現れるノードのラベルは、Ａ，Ｂ，Ｃ，Ｄ，Ｅ，Ｆ，Ｇ，Ｈ，Ｉ，Ｊ，Ｋである。図８の丸内のアルファベットはそのノードのラベルである。 FIG. 8 is an example of tree structure data used for explanation.
The three tree structure data Tr1 (FIG. 8A), Tr2 (FIG. 8B), and Tr3 (FIG. 8C) illustrated have 16 nodes, 16 nodes, and 11 nodes, respectively. For simplification of explanation, it is assumed that there is an order relationship between siblings, and identifiers such as v0, v1,... Are set to the nodes based on the order when the nodes are traced by the depth-first search. Thereafter, when it is not necessary to indicate the tree structure, the node is simply indicated by a notation such as v0 or v2, and when it is necessary to indicate which tree data, each node is indicated as v0 _Tr1 or v2 _Tr2. It is described with the identifier of the tree data to which the belongs.
In the example shown in FIG. 8, the labels of the nodes are A, B, C,. The labels of the nodes appearing in the example are A, B, C, D, E, F, G, H, I, J, and K. The alphabet in the circle in FIG. 8 is the label of the node.

また、パターン抽出の条件を２つ以上の木構造データで出現するパターンとして説明する。
図９に示す例は、それぞれの木構造データについて、それぞれのルートノードからリーフノードまでのパスを示した図である。図中でそれぞれのパスにＥｐ１〜Ｅｐ６までの識別子をつけている。これも同様に木を示す必要がある場合には、Ｅｐ１_Ｔｒ１のように木データの識別子を添えて表記する。例えばＥｐ４_Ｔｒ１（Ｔｒ１のＥｐ４）は、ｖ０，ｖ１０，ｖ１１，ｖ１２，ｖ１３を通るパスであり、Ｅｐ２_Ｔｒ３（Ｔｒ３のＥｐ２）は、ｖ０，ｖ１，ｖ２，ｖ３，ｖ４，ｖ６を通るパスである。
Ｅｐａｔｈをこのように設定することで、各ノードは少なくとも１つのＥｐａｔｈ上にあり、ノードによっては複数のＥｐａｔｈ上にある。例えば、ｖ１２_Ｔｒ１は、Ｅｐ４_Ｔｒ１とＥｐ５_Ｔｒ１の上にあり、ｖ７_Ｔｒ２は、Ｅｐ３_Ｔｒ２とＥｐ４_Ｔｒ２とＥｐ５_Ｔｒ２の上にある。 Also, the pattern extraction condition will be described as a pattern appearing in two or more tree structure data.
The example shown in FIG. 9 is a diagram showing a path from each root node to a leaf node for each tree structure data. In the figure, identifiers Ep1 to Ep6 are attached to the respective paths. Similarly, when it is necessary to indicate a tree, it is described with an identifier of the tree data, such as Ep1 _Tr1 . For example, Ep4 _Tr1 (Ep4 of _Tr1 ) is a path that passes through v0, v10, v11, v12, and v13, and Ep2 _Tr3 (Ep2 of Tr3) is a path that passes through v0, v1, v2, v3, v4, and v6. .
By setting Epath in this way, each node is on at least one Epath, and some nodes are on multiple Epaths. For example, v12 _Tr1 is on Ep4 _Tr1 and Ep5 _Tr1 , and v7 _Tr2 is on Ep3 _Tr2 , Ep4 _Tr2, and Ep5 _Tr2 .

各ノードが、どのＥＰａｔｈ上にあるかを示す情報ＥｐＲａｎｇｅを各ノードの関数として定義する。例えば、ＥｐＲａｎｇｅ（ｖ１２_Ｔｒ１）＝｛Ｅｐ４_Ｔｒ１，Ｅｐ５_Ｔｒ１｝であり、ＥｐＲａｎｇｅ（ｖ７_Ｔｒ２）＝｛Ｅｐ３_Ｔｒ２，Ｅｐ４_Ｔｒ２，Ｅｐ５_Ｔｒ２｝である。
ここで、対象としている木構造は順序木を仮定しているため、ＥｐＲａｎｇｅに現れるＥＰａｔｈの番号は、連続する。そこで、説明の簡易化のためＥｐＲａｎｇｅを単純にＥＰａｔｈの番号の一番小さいものと一番大きいもので示すこととする。すなわち、次のように表記する。例えば、ＥｐＲａｎｇｅ（ｖ１２_Ｔｒ１）＝［４，５］、ＥｐＲａｎｇｅ（ｖ７_Ｔｒ２）＝［３，５］である。
このとき、ＥｐＲａｎｇｅの小さい側の番号をＥｐＲａｎｇｅＬ、大きい側の番号をＥｐＲａｎｇｅＲとして同様に関数で表わす。例えば、ＥｐＲａｎｇｅＬ（ｖ１２_Ｔｒ１）＝４、ＥｐＲａｎｇｅＬ（ｖ７_Ｔｒ２）＝３であり、ＥｐＲａｎｇｅＲ（ｖ１２_Ｔｒ１）＝５、ＥｐＲａｎｇｅＲ（ｖ７_Ｔｒ２）＝５である。 Information EpRange indicating on which EP Path each node is located is defined as a function of each node. For _example, a _{EpRange (v12 Tr1) = {Ep4} Tr1, Ep5 Tr1}, EpRange (v7 Tr2) = a _{_{{Ep3 Tr2, Ep4 Tr2, Ep5}} Tr2}.
Here, since the target tree structure assumes an ordered tree, the EPath numbers appearing in EpRange are consecutive. Therefore, for simplification of description, EpRange is simply indicated by the smallest and largest EPPath numbers. That is, it is expressed as follows. For example, EpRange (v12 _Tr1 ) = [4, 5] and EpRange (v7 _Tr2 ) = [3, 5].
At this time, the number on the smaller side of EpRange is represented by EpRangeL, and the number on the larger side is represented by EpRangeR. For _{example, EpRangeL (v12 Tr1) = 4} , EpRangeL (v7 Tr2) = a _{3, EpRangeR (v12 Tr1) =} 5, EpRangeR (v7 Tr2) = 5.

また、ＥｐＲａｎｇｅをノードの関数としてではなく、単にＥＰａｔｈの範囲として参照することも行う。例えば、｛Ｅｐ２_Ｔｒ３，Ｅｐ３_Ｔｒ３，Ｅｐ４_Ｔｒ３｝を参照したい際に、単にＴｒ３のＥｐＲａｎｇｅ_Ｔｒ３［２，４］と示すことも行う。また同様に、ＥｐＲａｎｇｅＬをＥｐＲａｎｇｅの小さい番号、ＥｐＲａｎｇｅＲをＥｐＲａｎｇｅの大きい番号としても用いる。 In addition, EpRange is not referred to as a function of a node but simply as a range of EPath. For _example, when you want a _{_{{Ep2 Tr3, Ep3 Tr3, Ep4}} Tr3}, simply performed also indicate a _EpRange Tr3 [2,4] of the Tr3. Similarly, EpRangeL is used as a number with a small EpRange, and EpRangeR is used as a number with a large EpRange.

図１０に、図８に示した例のＴｒ１，Ｔｒ２，Ｔｒ３中の各ノードについて、（ノード識別子、ラベル、ＥｐＲａｎｇｅＬ，ＥｐＲａｎｇｅＲ）の組を示した。図１０（ａ）では、それぞれノードの識別子１００１ａ、ノードのラベル１００２ａ、ＥｐＲａｎｇｅＬ１００３ａ、ＥｐＲａｎｇｅＲ１００４ａが該当する。例えば、図１０（ａ）に示すＴｒ１のノード識別子：Ｖ０は、Ａというノードのラベルであり、ＥｐＲａｎｇｅＬは１、ＥｐＲａｎｇｅＲは６である。 FIG. 10 shows a set of (node identifier, label, EpRangeL, EpRangeR) for each node in Tr1, Tr2, and Tr3 in the example shown in FIG. In FIG. 10A, a node identifier 1001a, a node label 1002a, EpRangeL1003a, and EpRangeR1004a correspond to each. For example, the node identifier V0 of Tr1 shown in FIG. 10A is a label of the node A, EpRangeL is 1, and EpRangeR is 6.

構造情報管理モジュール２１０は、各木構造データに対してノードの存在範囲やＥｐＲａｎｇｅ範囲を指定してノードを走査する機能を有するために、図示しない内部の記憶手段に各ツリー情報中の必要な情報を別のデータ構造で保持することもできる。 Since the structure information management module 210 has a function of scanning a node by designating a node existence range or EpRange range for each tree structure data, necessary information in each tree information is stored in an internal storage means (not shown). Can be stored in a separate data structure.

図１１、図１２、図１３には、ツリー情報を内部に格納する一例を示した。図１１、図１２、図１３は、それぞれＴｒ１，Ｔｒ２，Ｔｒ３について、各ノードをＥｐＲａｎｇｅＬの値から辿りやすく格納した例であり、ツリーを深さ優先に辿りながらそれぞれのノードのＥｐＲａｎｇｅＬの場所にデータを追加することで構成できる。つまり、例えば図１１に示すＴｒ１のＥｐ情報１１００は、Ｅｐ１１１０１、Ｅｐ２１１０２、Ｅｐ３１１０３、Ｅｐ４１１０４、Ｅｐ５１１０５、Ｅｐ６１１０６へのリンク情報を含むものである。また、図示する都合上１１０１，１１０２、１１０３，１１０４，１１０５，１１０６が個別に示してあるが、これらが図１０に示したような順で連続領域に配置されてもよい。 FIG. 11, FIG. 12, and FIG. 13 show an example in which tree information is stored inside. 11, 12, and 13 are examples in which each node is stored so that it can be easily traced from the value of EpRangeL for each of Tr1, Tr2, and Tr3. It can be configured by adding. That is, for example, the Ep information 1100 of Tr1 shown in FIG. 11 includes link information to Ep1 1101, Ep2 1102, Ep3 1103, Ep4 1104, Ep5 1105, and Ep6 1106. In addition, for convenience of illustration, 1101, 1102, 1103, 1104, 1105, 1106 are individually shown, but these may be arranged in a continuous area in the order shown in FIG. 10.

図１１、図１２、図１３に示すツリー情報例の各ノードに対応するデータは、図１０に示した（ノード識別子、ラベル、ＥｐＲａｎｇｅＬ，ＥｐＲａｎｇｅＲ）の組に加えて、一番右に番号を加えている。この番号は、各データの同じＥｐＲａｎｇｅＬの中での位置を示す番号である。この番号は、同じＥｐＲａｎｇｅＬで、次に下位のデータを探すといった場合に用いるためのものである。
例えばＴｒ２について、ＥＰａｔｈ３，ＥＰａｔｈ４の上にあるノードを列挙したいとすると、ＥｐＲａｎｇｅＬが３あるいは４であるノードを辿って、ＥｐＲａｎｇｅＲが５より大きくならないものを探して列挙するという方法で簡単に実現できる。ただし、Ｅｐ３，Ｅｐ４の上にあるノードは、これだけが全てではなく、ＥｐＲａｎｇｅＬがＥｐ１であるｖ０，ｖ１が残される。しかし、後述する追加処理によって、このｖ０，ｖ１のようなノードも簡単に列挙することができる。
本実施の形態では、ＥｐＲａｎｇｅを指定して効率的にノードを列挙できる方法の一例として図１１、図１２、図１３に示すデータ構造を使って説明を行うが、他のデータ管理方法、例えばバケット分割を用いる方法や探索のための探索木による方法を適用してもよい。 The data corresponding to each node in the tree information examples shown in FIGS. 11, 12, and 13 is added to the rightmost number in addition to the set of (node identifier, label, EpRangeL, EpRangeR) shown in FIG. ing. This number is a number indicating the position of each data in the same EpRangeL. This number is for use when searching for the next lower data with the same EpRangeL.
For example, for Tr2, if it is desired to enumerate the nodes above EPath3 and EPath4, it can be realized simply by tracing the nodes whose EpRangeL is 3 or 4 and searching for and listing those whose EpRangeR is not greater than 5. However, this is not the only node on Ep3 and Ep4, and v0 and v1 whose EpRangeL is Ep1 are left. However, nodes such as v0 and v1 can also be easily enumerated by an additional process described later.
In the present embodiment, description will be made using the data structure shown in FIGS. 11, 12, and 13 as an example of a method for efficiently enumerating nodes by specifying EpRange. However, other data management methods such as buckets are used. A method using division or a method using a search tree for searching may be applied.

ステップＳ３０４（要素の出現を集計）では、構造情報管理モジュール２１０によって、各ラベルがいくつのツリーで出現したかを集計する。この処理は前述のステップＳ３０２の際に同時に実行することもできる。処理の結果、各ラベルがいくつの木構造データで出現したかを判定できる情報を得て、パターン抽出の条件に見合うものだけを残した頻出要素（本実施の形態ではノードのラベル）のリストを作成する。構成により、各頻出要素の出現位置情報（以降簡単に出現情報とも呼ぶ）も合わせて処理結果となる。
ここでの例では、Ａ，Ｂ，Ｃ，Ｄ，Ｆが頻出要素のリストに残る。この頻出要素のリストにしたがい、リスト中の頻出要素それぞれをルートノードとした構造パターンを抽出する処理を以降の繰り返し処理で行う。 In step S304 (aggregation of element occurrences), the structure information management module 210 aggregates how many trees each label has appeared. This process can also be executed at the same time as the above-described step S302. As a result of processing, information that can determine how many tree-structured data each label appears is obtained, and a list of frequent elements (node labels in this embodiment) that leaves only those that meet the pattern extraction conditions create. Depending on the configuration, the appearance position information (hereinafter also simply referred to as appearance information) of each frequent element is also processed.
In this example, A, B, C, D, and F remain in the frequent element list. In accordance with the list of frequent elements, a process of extracting a structure pattern having each frequent element in the list as a root node is performed in the subsequent iterative process.

ステップＳ３０６（頻出要素処理終了？）では、頻出要素のリスト内のノードのラベル（例に示したＡ，Ｂ，Ｃ，Ｄ，Ｆ）それぞれについて処理が繰り返されたことを、検査し、処理の流れを制御する。頻出要素のリスト中の全てのラベルについて処理が終わったところで、処理を終了する（ステップＳ３１４）。 In step S306 (Frequent element processing end?), It is checked that the processing has been repeated for each of the labels (A, B, C, D, and F shown in the example) of the nodes in the frequent element list. Control the flow. When the processing is completed for all the labels in the frequent element list, the processing ends (step S314).

ステップＳ３０８（処理対象要素選択）では、出現情報選択モジュール２２０が、頻出要素のリスト内の未処理のものから一つ選ぶ。例ではノードのラベル、Ａ，Ｂ，Ｃ，Ｄ，Ｆの中から未処理のものを一つ選ぶことになる。どのラベルを選択した場合も処理の流れは共通であるため、以降ではラベルＡを選択した場合について説明を行う。 In step S308 (processing target element selection), the appearance information selection module 220 selects one from the unprocessed elements in the frequent element list. In the example, one unprocessed item is selected from the node labels A, B, C, D, and F. Since the process flow is the same regardless of which label is selected, the case where label A is selected will be described below.

ステップＳ３１０（探索状態作成）では、探索状態管理モジュール２６０が、選択したラベルが各構造データにおいてどこで出現したかを示す情報を用意する。
このステップＳ３１０は、ステップＳ３０４において各ラベルの出現位置情報も合わせて出現情報管理モジュール２３０に格納される構成の場合には、その情報を取り出してくるだけの処理となる。構成によっては、記憶容量などの関係から出現位置の情報が出現情報管理モジュール２３０に記憶されていない場合もあり、そのような場合には、この段階で構造情報を走査して出現位置の情報を抽出するようにしてもよい。
例えば、ラベルＡの出現情報は、図１４に示すように、Ｔｒ１の出現情報１４１０内のｖ０_Ｔｒ１、ｖ２_Ｔｒ１、ｖ４_Ｔｒ１、ｖ１２_Ｔｒ１、Ｔｒ２の出現情報１４２０内のｖ１_Ｔｒ２、ｖ３_Ｔｒ２、ｖ９_Ｔｒ２、ｖ１１_Ｔｒ２、Ｔｒ３の出現情報１４３０内のｖ０_Ｔｒ３、ｖ８_Ｔｒ３、となる。ここで、図１４に示した出現情報は、図１１、図１２、図１３に格納されている情報と同じものを用いた。つまり、（１）「ノードの識別子」、（２）「ラベル」、（３）「ＥｐＲａｎｇｅＬ」、（４）「ＥｐＲａｎｇｅＲ」、（５）「ＥｐＲａｎｇｅＬ」を同じとして格納されている情報内での番号の組として示してある。 In step S310 (search state creation), the search state management module 260 prepares information indicating where the selected label appears in each structure data.
In the case where the appearance position information of each label is also stored in the appearance information management module 230 in step S304, this step S310 is a process of only extracting the information. Depending on the configuration, the appearance position information may not be stored in the appearance information management module 230 due to the storage capacity and the like. In such a case, the structure information is scanned at this stage to obtain the appearance position information. You may make it extract.
For example, as shown in FIG. 14, the appearance information of the label A is v1 _Tr2 , v3 _Tr2 , v9 _Tr2 in the appearance information 1420 of v0 _Tr1 , v2 _Tr1 , v4 _Tr1 , v12 _Tr1 , Tr2 in the appearance information 1410 of _Tr1 . , V11 _Tr2 , and Tr3 appearance information 1430 are v0 _Tr3 and v8 _Tr3 . Here, the appearance information shown in FIG. 14 is the same as the information stored in FIG. 11, FIG. 12, and FIG. That is, (1) “node identifier”, (2) “label”, (3) “EpRangeL”, (4) “EpRangeR”, (5) “EpRangeL” in the information stored as the same number It is shown as a pair.

この出現情報に加えて、選択したラベルをルートとした構造パターン木を作成し、その構造パターン木中の現在処理位置を設定して、図１５の例に示す探索状態ノードを作成する。つまり、探索状態ノードは、構造パターン木１５１０、出現情報１５２０、変更情報スタック１５３０を有している。構造パターン木１５１０は、ステップＳ３０８で選択されたラベルのノードをルートとした構造パターン木とその構造パターン木中の現在処理位置を記憶している。出現情報１５２０内のＴｒ１１５２１、Ｔｒ２１５２２、Ｔｒ３１５２３は、図１４で示した出現情報１４１０、出現情報１４２０、出現情報１４３０と同じものである。変更情報スタック１５３０は、後述の処理で使用するものである。 In addition to the appearance information, a structure pattern tree having the selected label as a root is created, a current processing position in the structure pattern tree is set, and a search state node shown in the example of FIG. 15 is created. That is, the search state node has a structure pattern tree 1510, appearance information 1520, and change information stack 1530. The structure pattern tree 1510 stores the structure pattern tree having the label node selected in step S308 as a root and the current processing position in the structure pattern tree. Tr1 1521, Tr2 1522, and Tr3 1523 in the appearance information 1520 are the same as the appearance information 1410, the appearance information 1420, and the appearance information 1430 shown in FIG. The change information stack 1530 is used in processing described later.

この探索状態ノードをルートノードとして、探索状態を図１６の例に示すように作成して、探索状態管理モジュール２６０に格納する。図１６は、探索状態ノードに格納されている構造パターン木に対応するパターンの構造を文字列で示したものを探索状態ノード（Ａ）１６１０に示したものである。
なお、構造パターン木の文字列の表記は、ここでは、｛親｝［｛子供１｝，｛子供２｝，・・・］という書き方を再帰的に適用することにより示す。ただし、子供がないノードについては［］を省略する。 Using this search state node as a root node, a search state is created as shown in the example of FIG. 16 and stored in the search state management module 260. FIG. 16 shows a search state node (A) 1610 that shows the structure of the pattern corresponding to the structure pattern tree stored in the search state node as a character string.
Here, the notation of the character string of the structure pattern tree is shown by recursively applying the notation {parent} [{child 1}, {child 2},...]. However, [] is omitted for nodes having no children.

ステップＳ３１２では、探索処理モジュール２５０が、頻出構造パターン探索を行う。ステップＳ３１２については、図４に示すフローチャートを用いて説明する。
なお、ステップＳ３１０（探索状態作成）は、ここではステップＳ３１２（頻出構造パターン探索）の前で行ったが、構成によってステップＳ３１０はステップＳ３１２の処理の中に組み込むこともできる。 In step S312, the search processing module 250 performs a frequent structure pattern search. Step S312 will be described with reference to the flowchart shown in FIG.
Note that step S310 (search state creation) is performed here before step S312 (frequent structure pattern search), but step S310 can be incorporated into the processing of step S312 depending on the configuration.

図４は、本実施の形態による頻出構造パターン探索の処理例を示したフローチャートである。ここで２種類の探索を行っている。つまり、第１の探索である下位探索（ステップＳ４０４）と第２の探索である横枝探索（ステップＳ４１０）である。
また、下位探索処理（ステップＳ４０４）は、複数の木構造データ内で複数回現れる構造パターンの探索を、前記木構造データ内の現在の処理対象となっているノードより下位のノードに対して行う。
そして、横枝探索処理（ステップＳ４１０）では、前記構造パターンの探索を、前記木構造データ内の現在の処理対象となっているノードより上位のノードであって、その上位のノードの下位にあり、かつ未探索のノード毎に行う。
下位探索処理と横枝探索処理では、探索の対象とすべきノードがなくなった場合に、その探索を始めた元のノードに戻るようにしている。 FIG. 4 is a flowchart showing a processing example of a frequent structure pattern search according to this embodiment. Two types of searches are performed here. That is, a lower search (step S404) that is the first search and a lateral search (step S410) that is the second search.
In the lower search process (step S404), a structure pattern that appears multiple times in a plurality of tree structure data is searched for a node lower than the current processing target node in the tree structure data. .
In the lateral branch search process (step S410), the search for the structure pattern is a node that is higher than the current processing target node in the tree structure data and is lower than the higher node. And for each unsearched node.
In the lower search process and the lateral branch search process, when there are no more nodes to be searched, the process returns to the original node that started the search.

ステップＳ４０２（構造パターン情報出力）では、その時点での探索状態ノードの構造パターン木のデータを出力する。この出力の手前で予め定められた基準にしたがって、出力するか否かの判定を行ってもよい。 In step S402 (structure pattern information output), the structure pattern tree data of the search state node at that time is output. It may be determined whether or not to output in accordance with a predetermined standard before this output.

ステップＳ４０４（下位探索）では、子孫方向に頻出する部分構造を探索する。この下位探索を行うときに、構造パターン内の現在の処理対象である現処理対象となっているノードと一致する前記木構造データ内のノード（ノードの間で上下関係があるものについては最上位のノード）に基づいて、子孫を探索する範囲を決定するようにしており、下位探索処理は、その決定された範囲に基づいて探索を行う。なお、ステップＳ４０４については、図５に示すフローチャートを用いて説明する。
そして、下位探索とは別に親ノードや先祖ノードから横方向に部分構造を探索する、横枝探索（ステップＳ４１０）の繰り返し処理が行われる。この横枝探索を行うときに、構造パターン内での親ノードと一致する前記木構造データ内のノード（ノード間で上下関係のあるものについては最上位のノード）及び前記構造パターン内での子ノードと一致する前記木構造データ内のノード（ノード間に上下関係があるときには最下位のノード）に基づいて、探索範囲を決定するようにしており、横枝探索処理は、その決定された範囲に基づいて探索を行う。
なお、この繰り返し処理中のステップＳ４０６（横枝探索箇所残りあり？）や、ステップＳ４０８（横枝探索箇所選択）は、パターン木中にノードが複数ある方が説明しやすいため、これらのステップの説明は後で行う。ステップＳ４１０（横枝探索）も別途後述する。 In step S404 (subordinate search), a partial structure that frequently appears in the descendant direction is searched. When this sub-search is performed, the node in the tree structure data that matches the current processing target node in the structure pattern (the highest level for nodes having a vertical relationship between the nodes) The sub-search range is determined based on the determined range, and the sub-search process is performed based on the determined range. Step S404 will be described with reference to the flowchart shown in FIG.
In addition to the subordinate search, a side branch search (step S410) repetitive process for searching for a partial structure in the horizontal direction from the parent node and the ancestor node is performed. When this traverse branch search is performed, the node in the tree structure data that matches the parent node in the structure pattern (the highest-order node if there is a vertical relationship between the nodes) and the child in the structure pattern The search range is determined based on the node in the tree structure data that matches the node (the lowest node when there is a vertical relationship between the nodes), and the lateral branch search processing is performed in the determined range. Search based on
It should be noted that step S406 (there is a horizontal branch search location remaining?) And step S408 (horizontal branch search location selection) during this iterative process are easier to explain if there are multiple nodes in the pattern tree. The explanation will be given later. Step S410 (horizontal branch search) will also be described later.

図５は、本実施の形態による下位探索の処理例を示したフローチャートである。
ステップＳ５０２（下位探索範囲情報準備）では、子孫ノードの探索範囲の準備を行う。このステップでは、具体的には各構造データに対して子孫ノードを探索する範囲を指定する。ここではその指定をＥｐＲａｎｇｅにより行う例を示す。
パターン中のノードをラベルＡで選び、そのときの探索状態ノードは図１５に示すようになる。このとき、例えば、Ｔｒ１においてラベルＡの子孫の候補は、Ｔｒ１１５２１に基づいて、ｖ０_Ｔｒ１の子孫、あるいはｖ２_Ｔｒ１の子孫、あるいはｖ４_Ｔｒ１の子孫、あるいはｖ１２_Ｔｒ１の子孫である。 FIG. 5 is a flowchart showing an example of a low-order search process according to this embodiment.
In step S502 (preparation of lower search range information), a search range of descendant nodes is prepared. In this step, specifically, a range for searching for descendant nodes is designated for each structure data. Here, an example in which the designation is performed by EpRange is shown.
A node in the pattern is selected by label A, and the search state node at that time is as shown in FIG. At this time, for example, the candidate of the offspring of label A in Tr1 is a descendant of v0 _Tr1, a descendant of v2 _Tr1, a descendant of v4 _Tr1 , or a descendant of v12 _Tr1 based on Tr1 1521.

ｖ０_Ｔｒ１の子孫はＥｐＲａｎｇｅ（ｖ０_Ｔｒ１）＝［１，６］の範囲にあり、ｖ２_Ｔｒ１の子孫はＥｐＲａｎｇｅ（ｖ２_Ｔｒ１）＝［１，３］の範囲にあり、ｖ４_Ｔｒ１の子孫はＥｐＲａｎｇｅ（ｖ４_Ｔｒ１）＝［１，２］の範囲にあり、ｖ１２_Ｔｒ１の子孫はＥｐＲａｎｇｅ（ｖ１２_Ｔｒ１）＝［４，５］の範囲に存在する。 The descendants of v0 _Tr1 are in the range EpRange (v0 _Tr1 ) = [1,6], the descendants of v2 _Tr1 are in the range EpRange (v2 _Tr1 ) = [1,3], and the descendants of v4 _Tr1 are EpRange (v4 in the range of _{_{Tr1) = [1,2], v12}} Tr1 progeny is present in the range of _{EpRange (v12 Tr1) = [4,5} ].

下の条件全て（（１）と（２））を満たすノードは先祖にラベルＡを持つノードが存在する。
（１）ＥｐＲａｎｇｅ（ｖ０_Ｔｒ１）＝［１，６］，ＥｐＲａｎｇｅ（ｖ２_Ｔｒ１）＝［１，３］，ＥｐＲａｎｇｅ（ｖ４_Ｔｒ１）＝［１，２］，ＥｐＲａｎｇｅ（ｖ１２_Ｔｒ１）＝［４，５］のいずれかのＥｐＲａｎｇｅの範囲にある。
（２）ｖ０_Ｔｒ１，ｖ２_Ｔｒ１，ｖ４_Ｔｒ１，ｖ１２_Ｔｒ１のいずれのＥｐＲａｎｇｅも、そのノードのＥｐＲａｎｇｅよりも広いものが存在しないなら、同じＥｐＲａｎｇｅを持つものでより上位のものが存在する。
ただし、前述の２つの条件は、本実施の形態で用いた構造情報の管理方法において、重複なく子孫ノードを列挙する際の一方法のための条件である。パターンに対応する出現情報から、全ての子孫ノード候補を列挙できるのであれば他の方法を用いても構わない。 Nodes that satisfy all of the following conditions ((1) and (2)) have a node with label A as an ancestor.
_{_{(1) EpRange (v0 Tr1)}} = [1,6], EpRange (v2 Tr1) = [1,3], EpRange (v4 Tr1) = [1,2], EpRange (v12 Tr1) = [4,5] It is in the range of any EpRange.
(2) If none of the EpRanges of v0 _Tr1 , v2 _Tr1 , v4 _Tr1 , and v12 _Tr1 is wider than the EpRange of the node, a higher one exists having the same EpRange.
However, the above-mentioned two conditions are conditions for one method when enumerating descendant nodes without duplication in the structure information management method used in the present embodiment. Other methods may be used as long as all descendant node candidates can be enumerated from the appearance information corresponding to the pattern.

前述の条件にしたがって、子孫ノードを列挙する一方法を示す。
まず各構造データ（例では、Ｔｒ１，Ｔｒ２，Ｔｒ３）毎に出現データのＥｐＲａｎｇｅの包含関係を調べ、他のＥｐＲａｎｇｅに含まれないものだけを残す。なお、ＥｐＲａｎｇｅが同じであれば、同じＥｐＲａｎｇｅＬを持つものの間でつけた番号がより小さいものを残す。この処理により、Ｔｒ１ではｖ０_Ｔｒ１、Ｔｒ２ではｖ１_Ｔｒ２、Ｔｒ３ではｖ０_Ｔｒ３がそれぞれ残る。 One method for enumerating descendant nodes according to the above conditions is shown.
First, for each structure data (Tr1, Tr2, Tr3 in the example), the EpRange inclusion relation of the appearance data is checked, and only the data not included in other EpRanges is left. In addition, if EpRange is the same, the thing with the smaller number assigned between things with the same EpRangeL is left. By this process, v0 _Tr1 remains in _Tr1 , v1 _Tr2 remains in _Tr2 , and v0 _Tr3 remains in _Tr3 .

これらの出現情報をもとに子孫ノードの探索を行う。
ノードＡからの子孫ノードの探索例は、Ｔｒ１ではｖ１_Ｔｒ１以下全てのノード、Ｔｒ２ではｖ２_Ｔｒ２以下全てのノード、Ｔｒ３ではｖ３_Ｔｒ３以下全てのノードとなる。
この走査の過程で、ステップＳ５０４（要素の出現を集計）において、各ラベル毎に出現情報が集計され集められる。
この段階でも頻出と判定されるラベルはＡ，Ｂ，Ｃ，Ｄ，Ｆとなる。 The descendant node is searched based on the appearance information.
The search example of descendant nodes from node A is all nodes below v1 _Tr1 in Tr1, all nodes below v2 _Tr2 in Tr2, and all nodes below v3 _{Tr3 in Tr3} .
In this scanning process, appearance information is totaled and collected for each label in step S504 (aggregation of element occurrences).
At this stage, labels that are determined to appear frequently are A, B, C, D, and F.

次にこれらの頻出ラベルのそれぞれを処理対象要素と選んだ処理が、それぞれステップＳ５０６（頻出要素処理終了？）によって繰り返される。
例えば、ステップＳ５０８（処理対象要素選択）で、Ｂを選んだとして説明を行う。このときのＢの出現情報は、図１７に示すように、Ｔｒ１の出現情報１７１０内のｖ１_Ｔｒ１、ｖ３_Ｔｒ１、ｖ６_Ｔｒ１、ｖ１１_Ｔｒ１、ｖ１３_Ｔｒ１、Ｔｒ２の出現情報１７２０内のｖ２_Ｔｒ２、ｖ４_Ｔｒ２、ｖ７_Ｔｒ２、ｖ１２_Ｔｒ２、Ｔｒ３の出現情報１７３０内のｖ２_Ｔｒ３、ｖ１０_Ｔｒ３となる。 Next, the process of selecting each of these frequent labels as a process target element is repeated in step S506 (end frequent element process?).
For example, description will be made assuming that B is selected in step S508 (selection of processing target element). As shown in FIG. 17, the appearance information of B at this time is v2 _Tr2 , v4 _Tr2 in the appearance information 1720 of v1 _Tr1 , v3 _Tr1 , v6 _Tr1 , v11 _Tr1 , v13 _Tr1 , Tr2 in the appearance information 1710 of _Tr1 . , V7 _Tr2 , v12 _Tr2 , Tr3 appear in the appearance information 1730 as v2 _Tr3 , v10 _Tr3 .

そして、次にステップＳ５１０（探索状態作成・更新）の処理を行う。つまり、ここでは、構造パターン内の現在の処理対象となっているノードの先祖のノードの出現箇所から、その現在の処理対象となっているノードの出現箇所のノードを子孫に含むノードに対応する出現箇所を保持するようにしている。
構造パターン木のルートノードＡの下にＢが子供のノードとしてついたパターンを登録し、構造パターン木の中の現在位置情報はＢの位置を指す。また、出現情報として図１７に示す出現情報１７１０、出現情報１７２０、出現情報１７３０を登録して、図１８の例に示すように探索状態ノードを作成して探索状態管理モジュール２６０に登録する。つまり、図１８の例に示す探索状態ノードは、構造パターン木１８１０にラベルＡのノードの下にラベルＢのノードを有している構造パターン木とその構造パターン木の中の現在の位置情報（ラベルＢのノード）を記憶しており、出現情報１８２０に図１７に示したものと同様のＴｒ１１８２１、Ｔｒ２１８２２、Ｔｒ３１８２３を記憶しており、さらに、変更情報スタック１８３０に変更情報１８３１を記憶している。 Then, the process of step S510 (search state creation / update) is performed. In other words, here, from the appearance location of the ancestor node of the current processing target node in the structure pattern, the node corresponding to the node including the node of the current processing target node appearance as the descendant The appearance location is kept.
A pattern with B as a child node is registered under the root node A of the structure pattern tree, and the current position information in the structure pattern tree indicates the position of B. In addition, appearance information 1710, appearance information 1720, and appearance information 1730 shown in FIG. 17 are registered as appearance information, and a search state node is created and registered in the search state management module 260 as shown in the example of FIG. That is, the search state node shown in the example of FIG. 18 includes a structure pattern tree having a label B node below a label A node in the structure pattern tree 1810 and current position information in the structure pattern tree ( (Node of label B) is stored, Tr1 1821, Tr2 1822, Tr3 1823 similar to those shown in FIG. 17 are stored in appearance information 1820, and change information 1831 is stored in change information stack 1830 is doing.

ここでの探索状態は、図１９の例に示すようになる。つまり、図１９の例に示す探索状態は、探索状態ノード（Ａ）１９１０下に探索状態ノード（Ａ[Ｂ]）１９２０が接続されている。 The search state here is as shown in the example of FIG. That is, in the search state shown in the example of FIG. 19, the search state node (A [B]) 1920 is connected under the search state node (A) 1910.

そして、探索状態の更新を行う。具体的には、登録した探索状態ノードから順に上位の探索状態ノードを辿り、下位の探索状態ノードに登録されている出現位置を一つも含まない出現情報をいったん削除して変更情報スタック１８３０に格納する。
この例の状態では、下位の探索状態ノードがＡ［Ｂ］、上位の探索状態ノードがＡであり、探索状態ノードＡ［Ｂ］の出現情報を一つも範囲に含まないものを探索状態ノードＡの出現情報から移動して、変更情報スタック１８３０に格納する。ここでは、この条件に該当する出現情報が、探索状態ノードＡにないため、なにもしない。もし、構造パターン木中の先祖方向の木の状態を更新した場合には、その更新した出現状態ノードの情報を上位方向の更新を開始する際の探索状態ノード（この場合探索状態ノードＡ［Ｂ］）の変更情報スタック１８３０に格納する。 Then, the search state is updated. Specifically, the upper search state node is traced in order from the registered search state node, and the appearance information that does not include any appearance position registered in the lower search state node is temporarily deleted and stored in the change information stack 1830. To do.
In the state of this example, the lower search state node is A [B], the upper search state node is A, and the search state node A is one that does not include any appearance information of the search state node A [B]. Is stored in the change information stack 1830. Here, since there is no appearance information corresponding to this condition in the search state node A, nothing is done. If the state of the ancestor direction tree in the structure pattern tree is updated, the updated appearance state node information is used as the search state node (in this case, the search state node A [B ]) In the change information stack 1830.

そして、再帰的にステップＳ５１２（頻出構造パターン探索）の処理に入る。
頻出構造パターン探索処理では、構造パターン情報の出力処理において、予め定められた条件にしたがって、構造パターン木の情報Ａ［Ｂ］が出力される。次に、同様に下位探索に入る。
次の下位探索の処理では、Ｂの出現情報からの探索範囲は下のノードの子孫を探索することになる。すなわち、Ｔｒ１ではｖ１_Ｔｒ１とｖ１１_Ｔｒ１の子孫、Ｔｒ２ではｖ２_Ｔｒ２とｖ７_Ｔｒ２の子孫、Ｔｒ３ではｖ２_Ｔｒ３の子孫の探索である。
この範囲で探索すると、例えばｖ１０_Ｔｒ１，ｖ１５_Ｔｒ１，ｖ６_Ｔｒ２，ｖ１１_Ｔｒ３などは探索の範囲にはいってこないなど、探索範囲が狭められてくるが、この段階ではまだ頻出ノードはＡ，Ｂ，Ｃ，Ｄ，Ｆのままである。 Then, the process recursively enters step S512 (frequent structure pattern search).
In the frequent structure pattern search process, the structure pattern tree information A [B] is output in accordance with a predetermined condition in the structure pattern information output process. Next, the lower search is similarly performed.
In the next lower search process, the search range from the appearance information of B searches for descendants of the lower node. In other words, Tr1 is a search for descendants of v1 _Tr1 and v11 _Tr1 , Tr2 is a descendant of v2 _Tr2 and v7 _Tr2 , and Tr3 is a search for descendants of v2 _Tr3 .
When searching in this range, for example, v10 _Tr1 , v15 _Tr1 , v6 _Tr2 , v11 _Tr3, etc. will not enter the search range, the search range will be narrowed, but at this stage the frequent nodes are still A, B, C , D, F.

次に、ここでラベルＣのノードが選ばれたとする。すると、Ｃの出現情報はｖ８_Ｔｒ１、ｖ１４_Ｔｒ１、ｖ５_Ｔｒ２、ｖ８_Ｔｒ２、ｖ１３_Ｔｒ２、ｖ３_Ｔｒ３、ｖ９_Ｔｒ３となる。この結果探索状態ノードは図２０に示すようになり、探索状態は図２１に示すようになる。つまり、図２０の例に示す探索状態ノードは、構造パターン木２０１０にラベルＡのノードの下にラベルＢのノード、そのラベルＢのノードの下にラベルＣのノードを有している構造パターン木とその構造パターン木の中の現在の位置情報（ラベルＣのノード）を記憶しており、出現情報２０２０にＴｒ１２０２１、Ｔｒ２２０２２、Ｔｒ３２０２３を記憶しており、さらに、変更情報スタック２０３０に変更情報２０３１を記憶している。図２１に示す探索状態は、探索状態ノード（Ａ）１９１０の下に探索状態ノード（Ａ[Ｂ]）１９２０が接続され、探索状態ノード（Ａ[Ｂ]）１９２０の下に探索状態ノード（Ａ[Ｂ[Ｃ]]）２１３０が接続されている。 Next, assume that the node of label C is selected here. Then, the appearance information of C becomes v8 _Tr1 , v14 _Tr1 , v5 _Tr2 , v8 _Tr2 , v13 _Tr2 , v3 _Tr3 , v9 _Tr3 . As a result, the search state node becomes as shown in FIG. 20, and the search state becomes as shown in FIG. That is, the search state node shown in the example of FIG. 20 has a structure pattern tree in which a structure pattern tree 2010 has a label B node under the label A node and a label C node under the label B node. And the current position information (node of label C) in the structure pattern tree, Tr1 2021, Tr2 2022, and Tr3 2023 are stored in the appearance information 2020, and the change information stack 2030 is changed. Information 2031 is stored. In the search state shown in FIG. 21, the search state node (A [B]) 1920 is connected under the search state node (A) 1910, and the search state node (A [B]) 1920 is connected under the search state node (A [B]) 1920. [B [C]]) 2130 is connected.

そして、探索状態ノードの上位のものの更新が行われる。図２０に示すＣの出現情報２０２０を下位に持たない、すなわちＥｐＲａｎｇｅが図２０のＣのＴｒ１２０２１、Ｔｒ２２０２２、Ｔｒ３２０２３の出現情報のいずれよりも大きくない、あるいは同じであったとしても番号が大きいものは、変更情報２０３１に加えられて変更情報スタック２０３０に格納される。
この場合、ｖ６_Ｔｒ１、ｖ１３_Ｔｒ１、ｖ４_Ｔｒ２、ｖ１２_Ｔｒ２、ｖ１０_Ｔｒ３を変更情報２０３１に登録して、変更情報スタック２０３０に格納する。その結果探索状態ノードＡ［Ｂ］は、図２２に示すように更新される。つまり、図２２の例に示す探索状態ノードは、構造パターン木２２１０にラベルＡのノードの下にラベルＢのノードを有している構造パターン木とその構造パターン木の中の現在の位置情報（ラベルＢのノード）を記憶しており、出現情報２２２０にＴｒ１２２２１、Ｔｒ２２２２２、Ｔｒ３２２２３を記憶しており、さらに、変更情報スタック２２３０に変更情報２２３１、変更情報２２３２を記憶している。そして、この更新した情報を、更新の原因となった探索状態ノードＡ［Ｂ［Ｃ］］の変更情報２２３１に格納する。 Then, the upper one of the search state nodes is updated. The appearance information 2020 of C shown in FIG. 20 is not included in the lower order, that is, EpRange is not greater than or equal to any of the appearance information of Tr1 2021, Tr2 2022, and Tr3 2023 of C of FIG. The larger one is added to the change information 2031 and stored in the change information stack 2030.
In this case, v6 _Tr1 , v13 _Tr1 , v4 _Tr2 , v12 _Tr2 , and v10 _Tr3 are registered in the change information 2031 and stored in the change information stack 2030. As a result, the search state node A [B] is updated as shown in FIG. That is, the search state node shown in the example of FIG. 22 includes a structure pattern tree having a label B node under a label A node in the structure pattern tree 2210 and current position information in the structure pattern tree ( Node B), Tr1 2221, Tr2 2222, and Tr3 2223 are stored in the appearance information 2220, and the change information 2231 and the change information 2232 are stored in the change information stack 2230. Then, the updated information is stored in the change information 2231 of the search state node A [B [C]] that has caused the update.

そして、下位の探索状態ノードＡ［Ｂ］が更新されたので、その上位の探索状態ノードＡの更新検査を行う。この場合、出現情報のうちｖ４_Ｔｒ１、ｖ１２_Ｔｒ１、ｖ３_Ｔｒ２、ｖ９_Ｔｒ２、ｖ１１_Ｔｒ２、ｖ８_Ｔｒ３が下位に該当するＢを持たなくなるため、出現情報からはずされて変更情報スタックに移動される。そして、この更新した情報を、更新の原因となった探索状態ノードＡ［Ｂ［Ｃ］］の変更情報に格納する。この探索状態ノードＡの更新結果を図２３に示す。つまり、図２３の例に示す探索状態ノードは、構造パターン木２３１０にラベルＡのノードを有している構造パターン木とその構造パターン木の中の現在の位置情報（ラベルＡのノード）を記憶しており、出現情報２３２０にＴｒ１２３２１、Ｔｒ２２３２２、Ｔｒ３２３２３を記憶しており、さらに、変更情報スタック２３３０に変更情報２３３１、変更情報２３３２を記憶している。そして、この更新した情報を、更新の原因となった探索状態ノードＡ［Ｂ［Ｃ］］の変更情報２２３１に格納する。 Then, since the lower search state node A [B] has been updated, an update check of the upper search state node A is performed. In this case, v4 _Tr1 , v12 _Tr1 , v3 _Tr2 , v9 _Tr2 , v11 _Tr2 , and v8 _Tr3 out of the appearance information do not have B corresponding to the lower order, and thus are removed from the appearance information and moved to the change information stack. Then, the updated information is stored in the change information of the search state node A [B [C]] that has caused the update. The update result of the search state node A is shown in FIG. That is, the search state node shown in the example of FIG. 23 stores the structure pattern tree having the node of label A in the structure pattern tree 2310 and the current position information (node of label A) in the structure pattern tree. Tr1 2321, Tr2 2322, and Tr3 2323 are stored in the appearance information 2320, and the change information 2331 and the change information 2332 are stored in the change information stack 2330. Then, the updated information is stored in the change information 2231 of the search state node A [B [C]] that has caused the update.

次に、続けて頻出構造パターン探索（ステップＳ５１２）に入る。この段階でもラベルＣを持つノードの下に頻出ラベルは見つけることができるが、同じ処理の流れの繰り返しとなるので、下位探索の処理は再帰的な実行が終了して次の処理に進んだところを説明する。 Next, the frequent structure pattern search is continued (step S512). Even at this stage, frequent labels can be found under the node having label C, but the same processing flow is repeated, so that the recursive execution of the sub-search process has proceeded to the next process. Will be explained.

図４に示すフローチャートでのステップＳ４０６（横枝探索箇所残りあり？）以降の処理（ステップＳ４０６〜ステップＳ４１０）は、探索状態を上位方向に探索状態ノードを辿る処理となる。登録した一番新しい探索状態ノードから親探索状態ノードの方向に辿り、親子関係のペア毎に横枝探索の処理（ステップＳ４１０）が実行されることになる。
この例では、探索状態は図２１に示すように、３つの探索状態ノードからなっており、探索状態ノード（Ａ）１９１０の下に探索状態ノード（Ａ[Ｂ]）１９２０が接続されており、探索状態ノード（Ａ[Ｂ]）１９２０の下に探索状態ノード（Ａ[Ｂ[Ｃ]]）２１３０が接続されており、２段階の親子関係が構成されている。このため、このループは探索状態ノードＡ［Ｂ］と探索状態ノードＡ［Ｂ［Ｃ］］の間の親子関係と、探索状態ノードＡと探索状態ノードＡ［Ｂ］の間の親子関係について実行される。
この処理は、望ましくは探索状態ノードの下位から上位への順で行う。ここでは、まず探索状態ノードＡ［Ｂ］と探索状態ノードＡ［Ｂ［Ｃ］］の間の親子関係に対する処理を説明する。すなわち横枝探索箇所としてこの探索状態ノードの親子関係が選ばれたものとする（ステップＳ４０８）。 The processing (step S406 to step S410) after step S406 (there is a horizontal branch search portion remaining?) In the flowchart shown in FIG. 4 is processing for tracing the search state node in the upper direction of the search state. Tracing in the direction from the newest registered search state node to the parent search state node, the side branch search process (step S410) is executed for each parent-child relationship pair.
In this example, the search state is composed of three search state nodes as shown in FIG. 21, and a search state node (A [B]) 1920 is connected under the search state node (A) 1910. A search state node (A [B [C]]) 2130 is connected under the search state node (A [B]) 1920, and a two-stage parent-child relationship is configured. Therefore, this loop is executed for the parent-child relationship between the search state node A [B] and the search state node A [B [C]] and the parent-child relationship between the search state node A and the search state node A [B]. Is done.
This processing is preferably performed in order from the lower order to the higher order of the search state node. Here, the process for the parent-child relationship between the search state node A [B] and the search state node A [B [C]] will be described first. In other words, it is assumed that the parent-child relationship of the search state node is selected as the horizontal branch search location (step S408).

そして、ステップＳ４１０（横枝探索）の処理に入る。
図７に、横枝探索の処理例のフローチャートを示す。
横枝探索処理では、まずステップＳ７０２（横枝探索範囲情報準備）の処理を実行する。この処理は具体的には、選択された探索状態ノードの親子関係の間、ここでの例では、探索状態ノードＡ［Ｂ］と探索状態ノードＡ［Ｂ［Ｃ］］の間で横枝探索範囲情報を準備する。 Then, the process enters step S410 (horizontal branch search).
FIG. 7 shows a flowchart of a process example of the lateral branch search.
In the horizontal branch search process, first, the process of step S702 (preparation of horizontal branch search range information) is executed. Specifically, this processing is performed between the parent-child relationship of the selected search state node, and in this example, the traverse search between the search state node A [B] and the search state node A [B [C]]. Prepare range information.

この処理は、上位の探索状態ノードの出現情報から、ＥｐＲａｎｇｅが他の出現情報のＥｐＲａｎｇｅに包含されないものを選び、それぞれのＥｐＲａｎｇｅについて、下位の探索状態ノードの出現情報の中から上位のＥｐＲａｎｇｅにＥｐＲａｎｇｅが含まれるものの中でＥｐＲａｎｇｅＲが最小のものを選び（以降、下位最左ＥｐＲａｎｇｅとよぶ）、この選ばれた下位のＥｐＲａｎｇｅのＥｐＲａｎｇｅＲ（下位最左ＥｐＲａｎｇｅのＥｐＲａｎｇｅ）と上位ＥｐＲａｎｇｅのＥｐＲａｎｇｅで横枝探索範囲を算出する。 This process selects, from the appearance information of the upper search state node, one that EpRange is not included in the EpRange of the other appearance information, and for each EpRange, from the appearance information of the lower search state node to the upper EpRange, EpRange Is selected (hereinafter referred to as the lower leftmost EpRange), and the selected subordinate EpRange EpRangeR (the lowermost EpRange EpRange) and the upper EpRange EpRange Calculate the range.

例えば、Ｔｒ１では上位探索状態ノードの出現情報のうち、ｖ３_Ｔｒ１のＥｐＲａｎｇｅはｖ１_Ｔｒ１のＥｐＲａｎｇｅに包含される（同じＥｐＲａｎｇｅの場合には一番右の番号が小さいものが番号の大きいものを包含するとして説明する。ただしＥｐＲａｎｇｅは同じなので逆としても結果に影響はない）。このため、上位探索状態ノードのＥｐＲａｎｇｅはｖ１_Ｔｒ１とｖ１１_Ｔｒ１のＥｐＲａｎｇｅとなる。ｖ１_Ｔｒ１に対応する下位探索状態ノードの出現情報は、ｖ８_Ｔｒ１のＥｐＲａｎｇｅとなる。したがって、この場合の上位ＥｐＲａｎｇｅは、ＥｐＲａｎｇｅ（ｖ１_Ｔｒ１）＝［１，３］で下位最左ＥｐＲａｎｇｅはＥｐＲａｎｇｅ（ｖ８_Ｔｒ１）＝［２，２］となる。その結果、ｖ１_Ｔｒ１についての探索範囲は［２＋１，３］、すなわちＥｐＲａｎｇｅ［３，３］となる。 For example, in Tr1, EpRange of v3 _Tr1 is included in EpRange of v1 _Tr1 among the appearance information of the upper search state node (in the case of the same EpRange, the one with the smallest right number includes the one with the larger number) However, since EpRange is the same, the reverse will not affect the result). Therefore, the EpRange of the upper search state node is the EpRange of v1 _Tr1 and v11 _Tr1 . The appearance information of the lower search state node corresponding to v1 _Tr1 is EpRange of v8 _Tr1 . Accordingly, the upper EpRange in this case is EpRange (v1 _Tr1 ) = [1, 3], and the lower leftmost EpRange is EpRange (v8 _Tr1 ) = [2, 2]. As a result, the search range for v1 _Tr1 is [2 + 1, 3], that is, EpRange [3, 3].

また、もう一方のｖ１１_Ｔｒ１のＥｐＲａｎｇｅについては下位探索状態ノードの出現情報はｖ１４_Ｔｒ１のＥｐＲａｎｇｅとなる。上位ＥｐＲａｎｇｅはＥｐＲａｎｇｅ（ｖ１１_Ｔｒ１）＝［４，５］で下位最左ＥｐＲａｎｇｅはＥｐＲａｎｇｅ（ｖ１４_Ｔｒ１）＝［５，５］となる。その結果、ｖ１１_Ｔｒ１についての探索範囲は［５＋１，５］となり、矛盾した範囲となるため対応する探索範囲はないということになる。この結果、Ｔｒ１についての探索範囲はＥｐＲａｎｇｅ［３，３］だけとなる。同様にＴｒ２については上位、ｖ７_Ｔｒ２のＥｐＲａｎｇｅと、これに対応する下位最左ＥｐＲａｎｇｅ、ｖ１３_Ｔｒ２のＥｐＲａｎｇｅとの間でＥｐＲａｎｇｅ［５：５］が横枝の探索範囲となる。Ｔｒ３については上位、ｖ２_Ｔｒ３のＥｐＲａｎｇｅと、これに対応する下位最左ＥｐＲａｎｇｅ、ｖ９_Ｔｒ３のＥｐＲａｎｇｅとの間でＥｐＲａｎｇｅ［４：４］が横枝の探索範囲となる。 For the EpRange of the other v11 _Tr1 , the appearance information of the lower search state node is the EpRange of v14 _Tr1 . The upper EpRange is EpRange (v11 _Tr1 ) = [4, 5], and the lower leftmost EpRange is EpRange (v14 _Tr1 ) = [5, 5]. As a result, the search range for v11 _Tr1 is [5 + 1, 5], which is an inconsistent range, so there is no corresponding search range. As a result, the search range for Tr1 is only EpRange [3, 3]. Similarly, for Tr2, EpRange [5: 5] is the search range of the horizontal branch between the upper, v7 _Tr2 EpRange, and the corresponding lower leftmost EpRange, v13 _Tr2 EpRange. For Tr3, EpRange [4: 4] is the search range of the horizontal branch between the upper, v2 _Tr3 EpRange, and the corresponding lower leftmost EpRange, v9 _Tr3 EpRange.

これらの探索範囲にしたがって、それぞれの木構造データを探索する（ステップＳ７０４〜ステップＳ７１４）。本実施の形態では、この探索処理を効率よく行う一方法として図１１、図１２、図１３のデータ構造を用いる方法を示している。他のデータ構造と処理方法を用いて、この探索処理を行ってもよいが、少なくともこの方法を用いることで、各木構造データからＥＰａｔｈの範囲を指定したノードの探索を容易に実現することができる。 Each tree structure data is searched according to these search ranges (steps S704 to S714). In the present embodiment, a method using the data structure of FIGS. 11, 12, and 13 is shown as one method for efficiently performing this search process. This search process may be performed using other data structures and processing methods. However, by using at least this method, it is possible to easily realize a search for a node specifying the range of EPath from each tree structure data. it can.

Ｔｒ１からはＥｐＲａｎｇｅ［３，３］で探索すると、ｖ９_Ｔｒ１だけが探索範囲となる。出現するラベルはＦである。Ｔｒ２からはＥｐＲａｎｇｅ［５，５］で探索すると、ｖ１４_Ｔｒ２とｖ１５_Ｔｒ２だけが探索範囲となる。出現するラベルはＧとＦである。Ｔｒ３からはＥｐＲａｎｇｅ［４，４］で探索すると、ｖ１０_Ｔｒ２だけが探索範囲となる。出現するラベルはＢである。このときの頻出となるラベルはＦだけとなる（ステップＳ７０４）。 When searching from Ep1 with EpRange [3, 3], only v9 _Tr1 is the search range. The label that appears is F. When searching from Ep2 with EpRange [5, 5], only v14 _Tr2 and v15 _Tr2 are search ranges. Appearing labels are G and F. When searching from Ep3 with EpRange [4, 4], only v10 _Tr2 is the search range. The label that appears is B. The only frequent label at this time is F (step S704).

このラベルＦを選択して（ステップＳ７０８）、探索状態作成・更新の処理（ステップＳ７１０）を行う。また、ステップＳ７１０では、構造パターン内の現在の処理対象となっているノードの先祖のノードの出現箇所から、その現在の処理対象となっているノードの出現箇所のノードを子孫に含むノードに対応する出現箇所を保持するようにしている。
図２４に示す前述のＦの出現情報を用いて探索状態ノードを作成して、探索状態に登録する。このときの探索状態を図２５に示す。つまり、図２４の例に示す探索状態ノードは、構造パターン木２４１０にラベルＡのノードの下にラベルＢのノード、そのラベルＢのノードの下にラベルＣのノードとラベルＦのノードを有している構造パターン木とその構造パターン木の中の現在の位置情報（ラベルＦのノード）を記憶しており、出現情報２４２０にＴｒ１２４２１、Ｔｒ２２４２２を記憶しており、さらに、変更情報スタック２４３０に変更情報２４３１を記憶している。また、図２５に示す探索状態は、４つの探索状態ノードからなっており、探索状態ノード（Ａ）１９１０の下に探索状態ノード（Ａ[Ｂ]）１９２０が接続されており、探索状態ノード（Ａ[Ｂ]）１９２０の下に探索状態ノード（Ａ[Ｂ[Ｃ]]）２１３０と探索状態ノード（Ａ[Ｂ[ＣＦ]]）２５４０が接続されている。このとき探索状態ノードＡ［Ｂ［ＣＦ］］は、横枝の探索時に親側の探索状態ノードとして選んだ探索状態ノードＡ［Ｂ］の子供ノードとして関係付けられる。 The label F is selected (step S708), and search state creation / update processing (step S710) is performed. Also, in step S710, from the appearance location of the ancestor node of the current processing target node in the structure pattern, the node including the node of the current processing target node appearance as a descendant It keeps the appearance part to do.
A search state node is created using the appearance information of F shown in FIG. 24 and registered in the search state. The search state at this time is shown in FIG. That is, the search state node shown in the example of FIG. 24 has a node of label B under the node of label A and a node of label C and a node of label F under the node of label B in the structure pattern tree 2410. And the current position information (the node of the label F) in the structure pattern tree are stored, Tr1 2421 and Tr2 2422 are stored in the appearance information 2420, and the change information stack 2430 is stored. The change information 2431 is stored. 25 includes four search state nodes. A search state node (A [B]) 1920 is connected under the search state node (A) 1910, and the search state node ( A search state node (A [B [C]]) 2130 and a search state node (A [B [CF]]) 2540 are connected under A [B]) 1920. At this time, the search state node A [B [CF]] is related as a child node of the search state node A [B] selected as the parent-side search state node during the search for the horizontal branch.

次に、この探索状態ノードＡ［Ｂ［ＣＦ］］を登録したことによる影響を調べて探索状態ノードの更新を行う。
このとき、更新が必要であるかどうかを調べる探索状態ノードは、探索状態ノードＡ［Ｂ［ＣＦ］］の先祖である探索状態ノードである。具体的には、探索状態ノードＡ［Ｂ］と探索状態ノードＡである。この場合、探索状態ノードＡ［Ｂ］を検査した段階で更新が必要ないことがわかるので、探索状態ノードの更新は行われない。 Next, the search state node is updated by examining the effect of registering the search state node A [B [CF]].
At this time, the search state node that checks whether or not the update is necessary is a search state node that is an ancestor of the search state node A [B [CF]]. Specifically, the search state node A [B] and the search state node A. In this case, the search state node is not updated because it is found that the search state node A [B] is not updated at the stage of inspection.

そして、さらに頻出構造パターン探索（ステップＳ７１２）に移行する。
頻出構造パターン探索では、パターン候補Ａ［Ｂ［ＣＦ］］について、必要であれば出力処理を行い、下位探索の処理を行う。
ここでの下位探索においては、頻出となるラベルが存在しないためすぐに下位探索処理から戻ってくる。
次に、横枝探索の繰り返し処理が行われる。このときの横枝探索の候補となる探索状態ノードの親子関係は、探索状態ノードＡ［Ｂ［ＣＦ］］と探索状態ノードＡ［Ｂ］の間の関係、探索状態ノードＡと探索状態ノードＡ［Ｂ］の間の関係の二通りとなる。
この処理の流れは既に説明した。
しかし、ここではいずれの探索状態ノードの親子関係においても頻出となるラベルが現れない。 Then, the process proceeds to a frequent structure pattern search (step S712).
In the frequent structure pattern search, an output process is performed on the pattern candidate A [B [CF]] if necessary, and a low-order search process is performed.
In the lower search here, since there is no frequent label, the process immediately returns from the lower search process.
Next, the iterative search process is repeated. The parent-child relationship of the search state nodes that are candidates for the lateral branch search at this time is the relationship between the search state node A [B [CF]] and the search state node A [B], the search state node A and the search state node A There are two relationships between [B].
The flow of this process has already been described.
However, labels that appear frequently in any parent-child relationship of any search state node do not appear here.

そして、この探索状態ノードＡ［Ｂ［ＣＦ］］における処理が終わる。これにより探索状態ノードＡ［Ｂ［ＣＦ］］に対応する頻出構造パターン探索の処理が終わり、上位の処理の流れに戻る。この場合には、横枝探索の処理に戻る。そして次のステップの探索状態回復の処理（ステップＳ７１４）が行われる。 Then, the processing in the search state node A [B [CF]] is completed. As a result, the frequent structure pattern search process corresponding to the search state node A [B [CF]] is completed, and the process returns to the upper process flow. In this case, the process returns to the horizontal branch search process. Then, a search state recovery process (step S714) of the next step is performed.

この探索状態回復の処理（ステップＳ７１４、ステップＳ５１４）について説明する。
探索状態ノードＡ［Ｂ［ＣＦ］］の変更情報スタックを参照する。ここに更新した探索状態ノードＡ［Ｂ］，Ａが記録されている。
この両方の探索状態ノードについて、それぞれ探索状態スタックのトップの変更情報をもとに探索状態ノードの回復処理を行う。具体的には、探索状態ノードＡ［Ｂ］の場合には、出現情報のｖ６_Ｔｒ１，ｖ１３_Ｔｒ１，ｖ４_Ｔｒ２，ｖ１２_Ｔｒ２，ｖ１０_Ｔｒ３を出現情報として戻す。戻した結果は、図２２に示したものと同じになる。同様に処理して、探索状態ノードＡも図２３に示したものと同じになる。
以上に示したように、処理を再帰的に実行していくことで予め指定された出現数以上の全ての先祖−子孫関係の木構造を抽出することができる。 The search state recovery process (steps S714 and S514) will be described.
The change information stack of the search state node A [B [CF]] is referred to. The updated search state nodes A [B] and A are recorded here.
For both search state nodes, search state node recovery processing is performed based on the change information at the top of the search state stack. Specifically, in the case of the search state node A [B], the appearance information v6 _Tr1 , v13 _Tr1 , v4 _Tr2 , v12 _Tr2 , v10 _Tr3 is returned as the appearance information. The returned result is the same as that shown in FIG. By performing the same processing, the search state node A becomes the same as that shown in FIG.
As described above, by executing the processing recursively, it is possible to extract all ancestor-descendant tree structures that are equal to or greater than the number of appearances specified in advance.

図６に示すフローチャートを用いて、子孫ノードの走査処理について説明する。
ステップＳ６０２では、上位の出現情報の処理が終了しているか否かについて判断する。かかる判断において、終了していると判断した場合は子孫ノードの走査処理を終了し（ステップＳ６１２）、終了していないと判断した場合はステップＳ６０４へ進む。
ステップＳ６０４では、出現情報を一つ選択する。 The descendant node scanning process will be described with reference to the flowchart shown in FIG.
In step S602, it is determined whether or not the upper appearance information has been processed. In this determination, if it is determined that the process is complete, the descendant node scanning process is terminated (step S612). If it is determined that the process is not complete, the process proceeds to step S604.
In step S604, one piece of appearance information is selected.

ステップＳ６０６では、ステップＳ６０４で選択した出現情報のＥｐＲａｎｇｅＬ，ＥｐＲａｎｇｅＲを、それぞれＥｐＲａｎｇｅＬＰ，ＥｐＲａｎｇｅＲＰとする。
ステップＳ６０８では、出現情報のＥｐＲａｎｇｅＬがＥｐＲａｎｇｅＬＰと同じものを出現情報の次の番号から走査する。
ステップＳ６１０では、ＥｐＲａｎｇｅＬＰ＋１からＥｐＲａｎｇｅＲＰまでの範囲で、ＥｐＲａｎｇｅＲの値がＥｐＲａｎｇｅＲＰ以下のものを走査する。 In step S606, EpRangeL and EpRangeR of the appearance information selected in step S604 are set to EpRangeLP and EpRangeRP, respectively.
In step S608, scanning is performed from the next number of the appearance information for EpRangeL of the appearance information that is the same as EpRangeLP.
In step S610, scanning is performed in a range from EpRangeLP + 1 to EpRangeRP where the value of EpRangeR is equal to or less than EpRangeRP.

なお、本実施の形態で示した例では、ＥｐＲａｎｇｅが同じ出現情報も皆同等に扱ったが、ＥｐＲａｎｇｅ（つまり、分岐がない範囲）が同じ出現情報は一つ（つまり、最上位と最下位のノード）を代表として扱って処理しても全く同等の効果が得られ、記憶しておく出現情報の量が減る分必要な記憶容量が削減され、さらに、検査などの処理のコストも削減される。つまり、木構造内で分岐のない範囲の最上位及び最下位以外の出現箇所を探索状態管理モジュール２６０から削除するようにしてもよい。あるいは、探索状態管理モジュール２６０に格納する前に選別してもよい、さらには、構造情報ＤＢ１１０に格納する時点でこの選別を行ってもよい。 In the example shown in the present embodiment, all occurrence information with the same EpRange is handled equally, but there is one occurrence information with the same EpRange (that is, a range without a branch) (that is, the highest and lowest). Nodes) can be treated as representatives, and the same effect can be obtained. The amount of appearance information to be stored is reduced, the required storage capacity is reduced, and the cost of processing such as inspection is further reduced. . In other words, the appearance locations other than the highest and lowest positions in the range without branching in the tree structure may be deleted from the search state management module 260. Alternatively, it may be selected before being stored in the search state management module 260, and further, this selection may be performed at the time of storage in the structure information DB 110.

また、木構造のノードを深さ優先探索してノードに番号をつけ、そのノードの番号を持ってノードの子孫のノードの範囲を示す方法が知られている。このｓｃｏｐｅと呼ばれる情報は、例えば図７に示すＴｒ２のｖ７では、自分自身の番号７と子孫のノードのうち最も大きな番号を持ったノードの番号を用いて［７，１５］と表わされる。
木構造から範囲を指定してノードを探索するために本実施の形態ではＥｐＴｒｅｅを用いたが、代わりにこのｓｃｏｐｅ情報を用いて探索する構成にすることも当業者に容易な変更の範囲である。ｓｃｏｐｅ情報は各ノードの子孫ノードの範囲を示す情報であるため、あるノードからの子孫の範囲を決定できる。また、横枝の探索時にも上位の出現情報のｓｃｏｐｅ情報と下位の出現情報のｓｃｏｐｅ情報を演算することが可能であり、本実施の形態の該当箇所に当てはめれば、横枝の探索範囲もｓｃｏｐｅ情報を用いて決定することができる。 Further, a method is known in which a tree-structured node is depth-firstly searched, a node is numbered, and the range of descendant nodes of the node is indicated by the number of the node. For example, in the v2 of Tr2 shown in FIG. 7, this information called scope is expressed as [7, 15] using the number 7 of its own and the number of the node having the largest number among the descendant nodes.
In this embodiment, EpTree is used to search for a node by specifying a range from a tree structure. However, a configuration in which a search is performed using this scope information instead can be easily changed by those skilled in the art. . Since the scope information is information indicating the range of descendant nodes of each node, the range of descendants from a certain node can be determined. In addition, the scope information of the higher-order appearance information and the scope information of the lower-order appearance information can be calculated even when searching for the horizontal branch. If this is applied to the corresponding part of this embodiment, the search range of the horizontal branch is also It can be determined using the scope information.

なお、本実施の形態としてのプログラムが実行されるコンピュータのハードウェア構成は、図２６に例示するように、一般的なコンピュータであり、具体的にはパーソナルコンピュータ、サーバーとなり得るコンピュータ等である。構造情報管理モジュール２１０、出現情報選択モジュール２２０、出現情報管理モジュール２３０、調査範囲処理モジュール２４０、探索処理モジュール２５０、探索状態管理モジュール２６０、抽出情報処理モジュール２７０等のプログラムを実行するＣＰＵ２６０１と、そのプログラムやデータを記憶するＲＡＭ２６０２と、本コンピュータを起動するためのプログラム等が格納されているＲＯＭ２６０３と、補助記憶装置であるＨＤ２６０４（例えばハードディスクを用いることができる）と、キーボード、マウス等のデータを入力する入力装置２６０６と、ＣＲＴや液晶ディスプレイ等の出力装置２６０５と、通信ネットワークと接続するための通信回線インタフェース２６０７（例えばネットワークインタフェースカードを用いることができる）、そして、それらをつないでデータのやりとりをするためのバス２６０８により構成されている。これらのコンピュータが複数台互いにネットワークによって接続されていてもよい。 Note that the hardware configuration of a computer on which the program according to the present embodiment is executed is a general computer as illustrated in FIG. 26, specifically, a personal computer, a computer that can be a server, or the like. CPU 2601 for executing programs such as structure information management module 210, appearance information selection module 220, appearance information management module 230, survey range processing module 240, search processing module 250, search state management module 260, extraction information processing module 270, etc. A RAM 2602 for storing programs and data, a ROM 2603 for storing programs for starting up the computer, an HD 2604 as an auxiliary storage device (for example, a hard disk can be used), and data such as a keyboard and a mouse are stored. An input device 2606 for input, an output device 2605 such as a CRT or a liquid crystal display, and a communication line interface 2607 for connecting to a communication network (for example, using a network interface card) DOO can), and, and a bus 2608 for exchanging data by connecting them. A plurality of these computers may be connected to each other via a network.

前述の実施の形態のうち、コンピュータ・プログラムによるものについては、本ハードウェア構成のシステムにソフトウェアであるコンピュータ・プログラムを読み込ませ、ソフトウェアとハードウェア資源とが協働して、前述の実施の形態が実現される。
なお、図２６に示すハードウェア構成は、１つの構成例を示すものであり、本実施の形態は、図２６に示す構成に限らず、本実施の形態において説明したモジュールを実行可能な構成であればよい。例えば、一部のモジュールを専用のハードウェア（例えばＡＳＩＣ等）で構成してもよく、一部のモジュールは外部のシステム内にあり通信回線で接続しているような形態でもよく、さらに図２６に示すシステムが複数互いに通信回線によって接続されていて互いに協調動作するようにしてもよい。また、特に、パーソナルコンピュータの他、情報家電、複写機、ファックス、スキャナ、プリンタ、複合機（スキャナ、プリンタ、複写機、ファックス等のいずれか２つ以上の機能を有している画像処理装置）などに組み込まれていてもよい。 Among the above-described embodiments, the computer program is a computer program that reads the computer program, which is software, in the hardware configuration system, and the software and hardware resources cooperate with each other. Is realized.
Note that the hardware configuration illustrated in FIG. 26 illustrates one configuration example, and the present embodiment is not limited to the configuration illustrated in FIG. 26, and is a configuration capable of executing the modules described in the present embodiment. I just need it. For example, some modules may be configured by dedicated hardware (for example, ASIC), and some modules may be in an external system and connected via a communication line. A plurality of systems shown in FIG. 5 may be connected to each other via communication lines so as to cooperate with each other. In particular, in addition to personal computers, information appliances, copiers, fax machines, scanners, printers, and multifunction machines (image processing apparatuses having two or more functions of scanners, printers, copiers, fax machines, etc.) Etc. may be incorporated.

なお、説明したプログラムについては、記録媒体に格納して提供してもよく、また、そのプログラムを通信手段によって提供してもよい。その場合、例えば、前記説明したプログラムについて、「プログラムを記録したコンピュータ読み取り可能な記録媒体」の発明として捉えてもよい。
「プログラムを記録したコンピュータ読み取り可能な記録媒体」とは、プログラムのインストール、実行、プログラムの流通などのために用いられる、プログラムが記録されたコンピュータで読み取り可能な記録媒体をいう。
なお、記録媒体としては、例えば、デジタル・バーサタイル・ディスク（ＤＶＤ）であって、ＤＶＤフォーラムで策定された規格である「ＤＶＤ−Ｒ、ＤＶＤ−ＲＷ、ＤＶＤ−ＲＡＭ等」、ＤＶＤ＋ＲＷで策定された規格である「ＤＶＤ＋Ｒ、ＤＶＤ＋ＲＷ等」、コンパクトディスク（ＣＤ）であって、読出し専用メモリ（ＣＤ−ＲＯＭ）、ＣＤレコーダブル（ＣＤ−Ｒ）、ＣＤリライタブル（ＣＤ−ＲＷ）等、光磁気ディスク（ＭＯ）、フレキシブルディスク（ＦＤ）、磁気テープ、ハードディスク、読出し専用メモリ（ＲＯＭ）、電気的消去及び書換可能な読出し専用メモリ（ＥＥＰＲＯＭ）、フラッシュ・メモリ、ランダム・アクセス・メモリ（ＲＡＭ）等が含まれる。
そして、前記のプログラム又はその一部は、前記記録媒体に記録して保存や流通等させてもよい。また、通信によって、例えば、ローカル・エリア・ネットワーク（ＬＡＮ）、メトロポリタン・エリア・ネットワーク（ＭＡＮ）、ワイド・エリア・ネットワーク（ＷＡＮ）、インターネット、イントラネット、エクストラネット等に用いられる有線ネットワーク、あるいは無線通信ネットワーク、さらにこれらの組み合わせ等の伝送媒体を用いて伝送させてもよく、また、搬送波に乗せて搬送させてもよい。
さらに、前記のプログラムは、他のプログラムの一部分であってもよく、あるいは別個のプログラムと共に記録媒体に記録されていてもよい。また、複数の記録媒体に分割して
記録されていてもよい。また、圧縮や暗号化など、復元可能であればどのような態様で記録されていてもよい。 The program described above may be provided by being stored in a recording medium, or the program may be provided by communication means. In that case, for example, the above-described program may be regarded as an invention of a “computer-readable recording medium recording the program”.
The “computer-readable recording medium on which a program is recorded” refers to a computer-readable recording medium on which a program is recorded, which is used for program installation, execution, program distribution, and the like.
The recording medium is, for example, a digital versatile disc (DVD), which is a standard established by the DVD Forum, such as “DVD-R, DVD-RW, DVD-RAM,” and DVD + RW. Standards such as “DVD + R, DVD + RW, etc.”, compact discs (CDs), read-only memory (CD-ROM), CD recordable (CD-R), CD rewritable (CD-RW), etc. MO), flexible disk (FD), magnetic tape, hard disk, read only memory (ROM), electrically erasable and rewritable read only memory (EEPROM), flash memory, random access memory (RAM), etc. It is.
The program or a part of the program may be recorded on the recording medium for storage or distribution. Also, by communication, for example, a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), a wired network used for the Internet, an intranet, an extranet, etc., or wireless communication It may be transmitted using a transmission medium such as a network or a combination of these, or may be carried on a carrier wave.
Furthermore, the program may be a part of another program, or may be recorded on a recording medium together with a separate program. Moreover, it may be divided and recorded on a plurality of recording media. Further, it may be recorded in any manner as long as it can be restored, such as compression or encryption.

本実施の形態を好適に適用するシステムの概念構成例を示す説明図である。It is explanatory drawing which shows the conceptual structural example of the system which applies this Embodiment suitably. 本実施の形態の構成例についての概念的なモジュール構成図である。It is a conceptual module block diagram about the structural example of this Embodiment. 本実施の形態による処理例を示したフローチャートである。It is the flowchart which showed the example of a process by this Embodiment. 本実施の形態による頻出構造パターン探索の処理例を示したフローチャートである。It is the flowchart which showed the processing example of the frequent structure pattern search by this Embodiment. 本実施の形態による下位探索の処理例を示したフローチャートである。It is the flowchart which showed the example of the process of the low-order search by this Embodiment. 本実施の形態による子孫ノードの走査処理例を示したフローチャートである。It is the flowchart which showed the example of a scanning process of the descendant node by this Embodiment. 本実施の形態による横枝探索の処理例を示したフローチャートである。It is the flowchart which showed the process example of the lateral branch search by this Embodiment. 対象とする木構造データの例を示す説明図である。It is explanatory drawing which shows the example of the tree structure data made into object. Ｅｐａｔｈの例を示す説明図である。It is explanatory drawing which shows the example of Epath. 各ノードにおけるＥｐＲａｎｇｅの例を示す説明図である。It is explanatory drawing which shows the example of EpRange in each node. 木構造データＴｒ１のＥｐＴｒｅｅの例を示す説明図である。It is explanatory drawing which shows the example of EpTree of tree structure data Tr1. 木構造データＴｒ２のＥｐＴｒｅｅの例を示す説明図である。It is explanatory drawing which shows the example of EpTree of tree structure data Tr2. 木構造データＴｒ３のＥｐＴｒｅｅの例を示す説明図である。It is explanatory drawing which shows the example of EpTree of tree structure data Tr3. 各木構造データにおけるラベルＡの出現の例を示す説明図である。It is explanatory drawing which shows the example of appearance of the label A in each tree structure data. 探索状態ノードの例を示す説明図である。It is explanatory drawing which shows the example of a search state node. パターンのルートノードにラベルＡを選択した場合の探索状態の例を示す説明図である。It is explanatory drawing which shows the example of the search state at the time of selecting the label A for the root node of a pattern. 各木構造データにおけるラベルＢの出現の例を示す説明図である。It is explanatory drawing which shows the example of appearance of the label B in each tree structure data. ラベルＡの子孫としてラベルＢを選択した場合の探索状態ノードＡ［Ｂ］の例を示す説明図である。It is explanatory drawing which shows the example of search state node A [B] at the time of selecting the label B as a descendant of the label A. ラベルＡの子孫としてラベルＢを選択した場合の探索状態の例を示す説明図である。It is explanatory drawing which shows the example of the search state at the time of selecting the label B as a descendant of the label A. ラベルＢの子孫としてラベルＣを選択した場合の探索状態ノードＡ［Ｂ［Ｃ］］の例を示す説明図である。It is explanatory drawing which shows the example of search state node A [B [C]] at the time of selecting the label C as the descendant of the label B. ラベルＢの子孫としてラベルＣを選択した場合の探索状態の例を示す説明図である。It is explanatory drawing which shows the example of the search state at the time of selecting the label C as a descendant of the label B. 探索状態ノードＡ［Ｂ］の例を示す説明図である。It is explanatory drawing which shows the example of search state node A [B]. 探索状態ノードＡの例を示す説明図である。It is explanatory drawing which shows the example of the search state node A. FIG. 探索状態ノードＡ［Ｂ［ＣＦ］］の例を示す説明図である。It is explanatory drawing which shows the example of search state node A [B [CF]]. 探索状態Ａ［Ｂ［ＣＦ］］の例を示す説明図である。It is explanatory drawing which shows the example of search state A [B [CF]]. 本実施の形態を実現するコンピュータのハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of the computer which implement | achieves this Embodiment.

Explanation of symbols

１１０…構造情報ＤＢ
１２０…情報収集装置
１３０…情報抽出装置
１４０…抽出情報管理装置
２１０…構造情報管理モジュール
２２０…出現情報選択モジュール
２３０…出現情報管理モジュール
２４０…調査範囲処理モジュール
２５０…探索処理モジュール
２６０…探索状態管理モジュール
２７０…抽出情報処理モジュール 110 ... Structure information DB
DESCRIPTION OF SYMBOLS 120 ... Information collection device 130 ... Information extraction device 140 ... Extraction information management device 210 ... Structural information management module 220 ... Appearance information selection module 230 ... Appearance information management module 240 ... Investigation range processing module 250 ... Search processing module 260 ... Search state management Module 270 ... Extraction information processing module

Claims

First search means for performing a search for a structure pattern that appears multiple times in a plurality of tree structures with respect to a node lower than the node currently being processed in the tree structure;
A second search in which the search for the structure pattern is performed for each unsearched node that is higher than the current processing target node in the tree structure and lower than the higher node. Means
The information processing apparatus according to claim 1, wherein the first search unit and the second search unit return to the original node that started the search when there is no node to be searched.

For the nodes in the tree structure that match the current processing target node that is the current processing target in the structure pattern, the descendants are searched based on the highest-order node for those having a vertical relationship A first search range determining means for determining a range;
The information processing apparatus according to claim 1, wherein the first search unit performs a search based on a range determined by the first search range determination unit.

Among the nodes in the tree structure that coincide with the parent node in the structure pattern, the nodes in the tree structure that coincide with the top node and the child nodes in the structure pattern with respect to those having a vertical relationship And a second search range determining means for determining a search range based on the lowest-order node for those having a vertical relationship to
The information processing apparatus according to claim 1, wherein the second search unit performs a search based on a range determined by the second search range determination unit.

An appearance location corresponding to a node whose descendants include the node of the current processing target node among the appearance locations of the ancestor node of the current processing target node in the structure pattern The information processing apparatus according to claim 3, further comprising holding means that holds the information.

5. The information processing apparatus according to claim 4, further comprising: a deletion unit that deletes, from the holding unit, occurrence locations other than the highest and lowest levels in a range where there is no branch in the tree structure.

And further comprising a selecting means for selecting an appearance location to be held by the holding means, and selecting items other than the appearance location other than the highest and lowest positions in the tree structure without branching. The information processing apparatus according to claim 4.

Computer
First search means for performing a search for a structure pattern that appears multiple times in a plurality of tree structures with respect to a node lower than the node currently being processed in the tree structure;
A second search in which the search for the structure pattern is performed for each unsearched node that is higher than the current processing target node in the tree structure and lower than the higher node. Function as a means,
The information processing program characterized in that the first search means and the second search means return to the original node that started the search when there is no node to be searched.