JP5701830B2

JP5701830B2 - Document structure analysis apparatus and program

Info

Publication number: JP5701830B2
Application number: JP2012194305A
Authority: JP
Inventors: 精一紺谷; 明通田中; 内山　匡; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-09-04
Filing date: 2012-09-04
Publication date: 2015-04-15
Anticipated expiration: 2032-09-04
Also published as: JP2014049088A

Description

本発明は、文書構造解析装置及びプログラムに係り、特に、構造化文書の要素の階層構造を解析する文書構造解析装置及びプログラムに関する。 The present invention relates to a document structure analysis apparatus and program, and more particularly to a document structure analysis apparatus and program for analyzing a hierarchical structure of elements of a structured document.

非特許文献１では、ウェブラッパーと呼ばれる、ＨＴＭＬ文書と抽出したい部分の組を学習例として与えると機械学習の手法を用いて文書構造抽出プログラムを自動生成する手法が提案されている。 Non-Patent Document 1 proposes a method called a web wrapper that automatically generates a document structure extraction program using a machine learning method when a combination of an HTML document and a portion to be extracted is given as a learning example.

I. Muslea, S. Minton, and C. Knoblock, ”A Hierarchical Approach to Wrapper Induc-tion,” pp.190-197, AGENTS '99, 1999.I. Muslea, S. Minton, and C. Knoblock, “A Hierarchical Approach to Wrapper Induc-tion,” pp.190-197, AGENTS '99, 1999.

しかしながら、上述した非特許文献１の手法では、同様の情報を含むページであっても、ｗｅｂページはサイト毎にＨＴＭＬ構造が異なるため、様々なサイトから包括的に情報を取得するためには、ＨＴＭＬ文書と抽出したい部分の組からなる大量の学習例が必要となり、人手による作業が大きい。また、ＥＣサイトのリニューアルなどでテンプレートが更新されると、商品ページのＨＴＭＬ構造が変わり、学習したルールが適用できなくなるため、ＨＴＭＬ文書と抽出したい部分の組からなる学習例を作り直して再学習する必要がある。 However, in the method of Non-Patent Document 1 described above, even if the page includes similar information, the web page has a different HTML structure for each site, so in order to comprehensively acquire information from various sites, A large amount of learning examples consisting of a set of HTML documents and parts to be extracted are required, and the work by manpower is large. In addition, if the template is updated, such as when the EC site is renewed, the HTML structure of the product page changes, and the learned rules can no longer be applied. There is a need.

本発明は、上記の事情を鑑みてなされたもので、構造化文書における複数の要素をクラスタリングすることにより、構造化文書の要素を精度良くクラスタリングでき、抽出したい部分を得ることができる文書構造解析装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and by documenting a plurality of elements in a structured document, it is possible to cluster the elements of the structured document with high accuracy and obtain a portion to be extracted. An object is to provide an apparatus and a program.

上記の目的を達成するために本発明に係る文書構造解析装置は、解析対象となる少なくとも１つの構造化文書の各々について、前記構造化文書の要素の階層構造を解析する階層構造解析手段と、前記階層構造解析手段によって解析された構造化文書を表示したときの前記構造化文書の各要素の表示位置を解析する位置情報解析手段と、前記階層構造解析手段によって解析された解析結果に基づいて、前記少なくとも１つの構造化文書における複数の要素のうちの第１要素と第２要素との間の各々について、前記第１要素及び前記第１要素の子孫の要素と、前記第２要素及び前記第２要素の子孫の要素とを比較して、要素の構造に関する構造類似度を算出する構造類似度計算手段と、前記位置情報解析手段によって解析された解析結果に基づいて、前記少なくとも１つの構造化文書における複数の要素のうちの２つの要素間の各々について、要素の表示位置に関する位置類似度を算出する位置類似度計算手段と、前記構造類似度計算手段によって算出された構造類似度と、前記位置類似度計算手段によって算出された位置類似度とに基づいて、前記少なくとも１つの構造化文書における複数の要素をクラスタリングするクラスタリング手段とを含んで構成されている。 In order to achieve the above object, a document structure analyzing apparatus according to the present invention comprises a hierarchical structure analyzing means for analyzing a hierarchical structure of elements of the structured document for each of at least one structured document to be analyzed, Based on the analysis result analyzed by the hierarchical structure analyzing means, the positional information analyzing means for analyzing the display position of each element of the structured document when displaying the structured document analyzed by the hierarchical structure analyzing means , For each of a plurality of elements in the at least one structured document between a first element and a second element, the first element and a descendant element of the first element; the second element; Based on the structural similarity calculation means for calculating the structural similarity regarding the structure of the element by comparing with the descendant elements of the second element, and the analysis result analyzed by the positional information analysis means The position similarity calculation means for calculating the position similarity regarding the display position of each element between two elements of the at least one structured document, and the structure similarity calculation means. And clustering means for clustering a plurality of elements in the at least one structured document based on the structural similarity and the positional similarity calculated by the positional similarity calculation means.

本発明によれば、階層構造解析手段によって、解析対象となる少なくとも１つの構造化文書の各々について、構造化文書の要素の階層構造を解析する。位置情報解析手段によって、階層構造解析手段によって解析された構造化文書を表示したときの構造化文書の各要素の表示位置を解析する。 According to the present invention, the hierarchical structure of the elements of the structured document is analyzed for each of at least one structured document to be analyzed by the hierarchical structure analyzing means. The position information analyzing unit analyzes the display position of each element of the structured document when the structured document analyzed by the hierarchical structure analyzing unit is displayed.

そして、構造類似度計算手段によって、階層構造解析手段によって解析された解析結果に基づいて、少なくとも１つの構造化文書における複数の要素のうちの第１要素と第２要素との間の各々について、第１要素及び第１要素の子孫の要素と、第２要素及び第２要素の子孫の要素とを比較して、要素の構造に関する構造類似度を算出する。 And based on the analysis result analyzed by the hierarchical structure analysis means by the structure similarity calculation means, for each between the first element and the second element of the plurality of elements in at least one structured document, The first element and the descendant element of the first element are compared with the second element and the descendant element of the second element, and the structural similarity related to the structure of the element is calculated.

そして、位置類似度計算手段によって、位置情報解析手段によって解析された解析結果に基づいて、少なくとも１つの構造化文書における複数の要素のうちの２つの要素間の各々について、要素の表示位置に関する位置類似度を算出する。 Then, based on the analysis result analyzed by the position information analysis means by the position similarity calculation means, the position regarding the display position of the element for each of the two elements of the plurality of elements in at least one structured document Calculate similarity.

そして、クラスタリング手段によって、構造類似度計算手段によって算出された構造類似度と、位置類似度計算手段によって算出された位置類似度とに基づいて、少なくとも１つの構造化文書における複数の要素をクラスタリングする。 Then, the clustering unit clusters a plurality of elements in at least one structured document based on the structural similarity calculated by the structural similarity calculation unit and the position similarity calculated by the position similarity calculation unit. .

これにより、構造化文書の要素の階層構造を解析し、解析された構造化文書を表示したときの構造化文書の各要素の表示位置を解析し、構造化文書における複数の要素のうちの２つの要素間の各々について、要素の構造に関する構造類似度及び要素の表示位置に関する位置類似度を算出し、構造類似度と、位置類似度とに基づいて、構造化文書における複数の要素をクラスタリングすることにより、構造化文書の要素を精度良くクラスタリングでき、抽出したい部分を得ることができる。 As a result, the hierarchical structure of the elements of the structured document is analyzed, the display position of each element of the structured document when the analyzed structured document is displayed, and two of the plurality of elements in the structured document are analyzed. For each of the two elements, the structural similarity regarding the structure of the element and the positional similarity regarding the display position of the element are calculated, and a plurality of elements in the structured document are clustered based on the structural similarity and the positional similarity. As a result, the elements of the structured document can be clustered with high accuracy, and the portion to be extracted can be obtained.

また、前記構造類似度計算手段は、前記階層構造解析手段によって解析された解析結果に基づいて、前記少なくとも１つの構造化文書における各要素について、該要素及び該要素の子孫の要素の属性名のヒストグラムを算出し、前記第１要素と第２要素との間の各々について、前記第１要素のヒストグラムと前記第２要素のヒストグラムとの類似度を前記構造類似度として算出することができる。 Further, the structural similarity calculation means, for each element in the at least one structured document, based on the analysis result analyzed by the hierarchical structure analysis means, the attribute name of the element and the descendant element of the element A histogram can be calculated, and the similarity between the histogram of the first element and the histogram of the second element can be calculated as the structural similarity for each between the first element and the second element.

また、前記位置類似度計算手段は、前記位置情報解析手段によって解析された解析結果に基づいて、前記２つの要素間の各々について、前記構造化文書を表示したときの前記２つの要素の表示位置及び大きさに基づく類似度を、前記位置類似度として算出することができる。 In addition, the position similarity calculation unit is configured to display a display position of the two elements when the structured document is displayed for each of the two elements based on the analysis result analyzed by the position information analysis unit. The similarity based on the size can be calculated as the position similarity.

また、前記クラスタリング手段によってクラスタリングされたクラスタの各々について、前記クラスタに属する各要素に含まれる単語と予め定められたキーワードとに基づいて、前記クラスタの優先度を算出する優先度計算手段を更に含むことができる。 Further, for each of the clusters clustered by the clustering means, further includes a priority calculation means for calculating the priority of the cluster based on a word included in each element belonging to the cluster and a predetermined keyword. be able to.

また、前記構造類似度計算手段によって算出された構造類似度に基づいて、前記少なくとも１つの構造化文書における複数の要素のうちの同一のクラスタにクラスタリングされる２つの要素を含む複数のクラスタ候補を探索する候補探索手段を更に含み、前記クラスタリング手段は、前記候補探索手段によって探索された各クラスタ候補に含まれる要素間の前記構造類似度及び前記位置類似度に基づいて、前記少なくとも１つの構造化文書における複数の要素をクラスタリングすることができる。 Further, a plurality of cluster candidates including two elements clustered in the same cluster among a plurality of elements in the at least one structured document based on the structure similarity calculated by the structure similarity calculation means. Candidate searching means for searching, further comprising: the clustering means based on the structural similarity and the position similarity between elements included in each cluster candidate searched by the candidate searching means. Multiple elements in a document can be clustered.

また、前記候補探索手段は、前記少なくとも１つの構造化文書における複数の要素の各々について、該要素とクラスタ候補に含まれる要素を、親子関係にある要素のうち前記構造類似度が最大となる要素とすることができる。 In addition, the candidate search means, with respect to each of the plurality of elements in the at least one structured document, the element included in the cluster candidate and the element having the maximum structural similarity among the elements having a parent-child relationship It can be.

また、前記候補探索手段は、前記構造化文書を表示したときの表示位置の距離が所定値以下であって、かつ同一のクラスタにクラスタリングされる２つの要素を含む前記複数のクラスタ候補を探索することができる。 Further, the candidate search means searches for the plurality of cluster candidates including two elements whose display position distance when the structured document is displayed is equal to or smaller than a predetermined value and clustered into the same cluster. be able to.

本発明に係るプログラムは、コンピュータを、上記の文書構造解析装置の各手段として機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each unit of the document structure analysis apparatus.

以上説明したように、本発明の文書構造解析装置及びプログラムによれば、構造化文書の要素の階層構造を解析し、解析された構造化文書を表示したときの構造化文書の各要素の表示位置を解析し、構造化文書における複数の要素のうちの２つの要素間の各々について、要素の構造に関する構造類似度及び要素の表示位置に関する位置類似度を算出し、構造類似度と、位置類似度とに基づいて、構造化文書における複数の要素をクラスタリングすることにより、構造化文書の要素を精度良くクラスタリングでき、抽出したい部分を得ることができる、という効果が得られる。 As described above, according to the document structure analysis apparatus and program of the present invention, the hierarchical structure of the elements of the structured document is analyzed, and each element of the structured document is displayed when the analyzed structured document is displayed. Analyzing the position and calculating the structural similarity related to the structure of the element and the positional similarity related to the display position of the element for each of the two elements of the plurality of elements in the structured document. By clustering a plurality of elements in the structured document based on the degree, it is possible to cluster the elements of the structured document with high accuracy and to obtain a portion to be extracted.

本発明の実施の形態に係る文書構造解析装置の構成を示す概略図である。It is the schematic which shows the structure of the document structure analysis apparatus which concerns on embodiment of this invention. 本発明の実施の形態で用いるキーワードリストＫＷ＾の概略図である。It is the schematic of the keyword list KW ^ used by embodiment of this invention. 本発明の実施の形態で用いる階層構造情報テーブルＫ＾の概略図である。It is the schematic of the hierarchical structure information table K ^ used by embodiment of this invention. 本発明の実施の形態で用いる位置情報テーブルＰ＾の概略図である。It is the schematic of the positional information table P ^ used by embodiment of this invention. 本発明の実施の形態で用いるクラスタテーブルＣ＾の概略図である。It is the schematic of the cluster table C ^ used by embodiment of this invention. 本発明の実施の形態で用いる構造類似度テーブルＬ^ｓｔｒ＾の概略図である。It is the schematic of the structure similarity table ^Lstr ^ used by embodiment of this invention. 本発明の実施の形態で用いる位置類似度テーブルＬ^ｐｏｓ＾の概略図である。It is the schematic of the position similarity table ^Lpos ^ used by embodiment of this invention. 本発明の実施の形態の説明で用いた動作例１のｗｅｂページを示す図である。It is a figure which shows the web page of the operation example 1 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例１のｗｅｂページのＨＴＭＬタグ構造を示す図である。It is a figure which shows the HTML tag structure of the web page of the operation example 1 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例１のＤＯＭツリー構造を示す図である。It is a figure which shows the DOM tree structure of the operation example 1 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例１の階層構造情報テーブルを示す図である。It is a figure which shows the hierarchical structure information table of the operation example 1 used by description of embodiment of this invention. 本発明の実施の形態に係る文書構造解析装置における構造化文書情報抽出処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the structured document information extraction process routine in the document structure analysis apparatus which concerns on embodiment of this invention. 構造化文書情報抽出処理ルーチン内で呼び出される候補探索処理ルーチンの前半部分の内容を示すフローチャートである。It is a flowchart which shows the content of the first half part of the candidate search process routine called within a structured document information extraction process routine. 構造化文書情報抽出処理ルーチン内で呼び出される候補探索処理ルーチンの後半部分の内容を示すフローチャートである。It is a flowchart which shows the content of the second half part of the candidate search process routine called within a structured document information extraction process routine. 構造化文書情報抽出処理ルーチン内で呼び出される候補探索処理ルーチンの前半部分を示すアルゴリズムである。It is an algorithm which shows the first half part of the candidate search process routine called within the structured document information extraction process routine. 構造化文書情報抽出処理ルーチン内で呼び出される候補探索処理ルーチンの後半部分を示すアルゴリズムである。It is an algorithm which shows the latter half part of the candidate search process routine called within a structured document information extraction process routine. 候補探索処理ルーチン内で呼び出されるfirstChildルーチンの内容を示すアルゴリズムである。It is an algorithm showing the contents of the firstChild routine called in the candidate search processing routine. 候補探索処理ルーチン内で呼び出されるnextSiblingルーチンの内容を示すアルゴリズムである。This is an algorithm showing the contents of the nextSibling routine called in the candidate search processing routine. 候補探索処理ルーチン内で呼び出されるnextNodeルーチンの内容を示すアルゴリズムである。This is an algorithm showing the contents of the nextNode routine called in the candidate search processing routine. 構造化文書情報抽出処理ルーチン内で呼び出されるクラスタリング処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the clustering process routine called within a structured document information extraction process routine. 構造化文書情報抽出処理ルーチン内で呼び出されるクラスタリング処理ルーチンのアルゴリズムである。This is an algorithm of a clustering processing routine called in the structured document information extraction processing routine. 構造化文書情報抽出処理ルーチン内で呼び出される優先度計算処理ルーチンの内容を示すアルゴリズムである。It is an algorithm which shows the content of the priority calculation process routine called within a structured document information extraction process routine. 本発明の実施の形態の説明で用いた動作例１のキーワードリストＫＷ＾を示す図である。It is a figure which shows the keyword list KW ^ of the operation example 1 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例１の位置情報テーブルＰ＾を示す図である。It is a figure which shows the positional information table P ^ of the operation example 1 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例１の構造類似度テーブルＬ^ｓｔｒ＾を示す図である。It is a figure which shows the structure similarity table ^Lstr ^ of the operation example 1 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例１の位置類似度テーブルＬ^ｐｏｓ＾を示す図である。It is a figure which shows the position similarity table ^Lpos ^ of the operation example 1 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例１のクラスタテーブルＣ＾を示す図である。It is a figure which shows the cluster table C ^ of the operation example 1 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例１のクラスタリング結果を示す図である。It is a figure which shows the clustering result of the operation example 1 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例１の出力結果を示す図である。It is a figure which shows the output result of the operation example 1 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例２のｗｅｂページを示す図である。It is a figure which shows the web page of the operation example 2 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例２のｗｅｂページのＨＴＭＬタグ構造を示す図である。It is a figure which shows the HTML tag structure of the web page of the operation example 2 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例２のキーワードリストＫＷ＾を示す図である。It is a figure which shows the keyword list KW ^ of the operation example 2 used by description of embodiment of this invention. 本発明の実施の形態の説明で用いた動作例２の出力結果を示す図である。It is a figure which shows the output result of the operation example 2 used by description of embodiment of this invention. 構造化文書情報抽出処理ルーチン内で呼び出される優先度計算処理ルーチンの他の構成例１を示すアルゴリズムである。It is an algorithm which shows the other structural example 1 of the priority calculation process routine called within a structured document information extraction process routine. 構造化文書情報抽出処理ルーチン内で呼び出される優先度計算処理ルーチンの他の構成例２を示すアルゴリズムである。It is an algorithm which shows the other structural example 2 of the priority calculation processing routine called within a structured document information extraction processing routine. 構造化文書情報抽出処理ルーチン内で呼び出される優先度計算処理ルーチンの他の構成例３を示すアルゴリズムである。It is an algorithm which shows the other structural example 3 of the priority calculation process routine called within a structured document information extraction process routine.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜システム構成＞ <System configuration>

本実施の形態に係る文書構造解析装置は、ＣＰＵと、ＲＡＭと、後述する構造化文書情報抽出装置処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 The document structure analysis apparatus according to the present embodiment is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a structured document information extraction apparatus processing routine described later. Is configured as follows.

本実施の形態に係る文書構造解析装置は、図１に示すように、入力部１と、記憶部２と、演算部３と、出力部４とを備えている。また、演算部３は、階層構造解析部１０と、位置情報解析部１２と、構造類似度計算部１４と、位置類似度計算部１６と、候補探索部１８と、クラスタリング部２０と、優先度計算部２２とを備えている。 As shown in FIG. 1, the document structure analysis apparatus according to the present embodiment includes an input unit 1, a storage unit 2, a calculation unit 3, and an output unit 4. In addition, the calculation unit 3 includes a hierarchical structure analysis unit 10, a position information analysis unit 12, a structure similarity calculation unit 14, a position similarity calculation unit 16, a candidate search unit 18, a clustering unit 20, and a priority level. And a calculation unit 22.

なお、記憶部２は、外部に設けられ、文書構造解析装置とネットワークで接続されていてもよい。 Note that the storage unit 2 may be provided outside and connected to the document structure analysis apparatus via a network.

入力部１は、外部からの入力を受け付ける。入力部１は、例えば、ネットワーク、又はファイルなどを経由して、ｗｅｂページ群、後述するパラメータ（構造類似度閾値θ_ｓｔｒ、位置類似度閾値θ_ｐｏｓ、探索範囲閾値γ_０、γ_１）、及び後述するキーワードリストＫＷ＾の入力を受け付ける。 The input unit 1 receives an input from the outside. For example, the input unit 1 receives a web page group, parameters described later (structure similarity threshold θ _str , position similarity threshold θ _pos , search range thresholds γ ₀ , γ ₁ ), An input of a keyword list KW ^ to be described later is accepted.

記憶部２は、入力部１により入力された各種データを記憶する。具体的には、入力部１により入力されたｗｅｂページ群、パラメータ（構造類似度閾値θ_ｓｔｒ、位置類似度閾値θ_ｐｏｓ、探索範囲閾値γ_０、γ_１）、及び図２に示すキーワードリストＫＷ＾を記憶する。また記憶部２は、後述する各処理での結果等を記憶し、具体的には、図３に示す階層構造情報テーブルＫ＾、図４に示す位置情報テーブルＰ＾、図５に示すクラスタテーブルＣ＾、図６に示す構造類似度テーブルＬ^ｓｔｒ＾、及び図７に示す位置類似度テーブルＬ^ｐｏｓ＾などを記憶する。なお、記号に付された「＾」は、当該記号が行列または多次元配列またはベクトルであることを表わしている。また記憶部２は、後述するその他の各処理での結果等を記憶する。 The storage unit 2 stores various data input from the input unit 1. Specifically, a web page group input by the input unit 1, parameters (structure similarity threshold θ _str , position similarity threshold θ _pos , search range thresholds γ ₀ , γ ₁ ), and a keyword list KW shown in FIG. Remember ^. Further, the storage unit 2 stores the results of each processing described later, specifically, the hierarchical structure information table K ^ shown in FIG. 3, the position information table P ^ shown in FIG. 4, and the cluster table shown in FIG. C, the structural similarity table L ^str ^ shown in FIG. 6, the position similarity table L ^pos ^ shown in FIG. 7, and the like are stored. Note that “＾” attached to a symbol indicates that the symbol is a matrix, a multidimensional array, or a vector. The storage unit 2 also stores the results of other processes described later.

階層構造解析部１０は、入力部１により入力され、記憶部２に記憶されているｗｅｂページの各々について、ＨＴＭＬタグが示す要素の階層構造を解析する。具体的には、ｗｅｂページからＨＴＭＬタグが示す各要素に対応する各ノードを含むＤＯＭ（Document Object Model）ツリーを生成すると共に、ＤＯＭツリー中の各ノードについて、自ノードと子孫の各ノードのノード名（属性名）でヒストグラムを作成し、上記図３に示す階層構造情報テーブルＫ＾を作成する。
例えば、図８に示す、後述する動作例１のｗｅｂページが入力された場合には、図９に示すようなＨＴＭＬタグが示す要素の階層構造を解析し、図１０に示すツリーを生成する。そして、ＤＯＭツリー中の各ノードについて、図１１に示すようなヒストグラムを作成する。 The hierarchical structure analysis unit 10 analyzes the hierarchical structure of the element indicated by the HTML tag for each of the web pages input from the input unit 1 and stored in the storage unit 2. Specifically, a DOM (Document Object Model) tree including each node corresponding to each element indicated by the HTML tag is generated from the web page, and for each node in the DOM tree, the node of the own node and each of the descendant nodes. A histogram is created by name (attribute name), and the hierarchical structure information table K ^ shown in FIG. 3 is created.
For example, when the web page of operation example 1 shown in FIG. 8 is input, the hierarchical structure of the element indicated by the HTML tag as shown in FIG. 9 is analyzed, and the tree shown in FIG. 10 is generated. Then, a histogram as shown in FIG. 11 is created for each node in the DOM tree.

位置情報解析部１２は、入力部１により入力され、記憶部２に記憶されているｗｅｂページの各々について、ＨＴＭＬタグが示す要素の、ｗｅｂブラウザ上での表示位置及び表示領域の幅と高さを計算し、上記図４に示す位置情報テーブルＰ＾を作成する。なお、要素の表示領域の幅と高さは、要素を表示したときの大きさの一例である。 The position information analysis unit 12 receives the display position on the web browser and the width and height of the display area of the element indicated by the HTML tag for each of the web pages input from the input unit 1 and stored in the storage unit 2. And the position information table P ^ shown in FIG. 4 is created. The width and height of the element display area are examples of the size when the element is displayed.

構造類似度計算部１４は、階層構造解析部１０で算出される各ノードのヒストグラムに基づいて、２つのノードのペア各々について２つの要素間の階層構造に関する類似度を計算する。例えば、ノードＡとノードＢが与えられた場合に、それらの構造類似度Ｓ^ｓｔｒ（Ａ、Ｂ）を以下の（１）式に従って計算する。 Based on the histogram of each node calculated by the hierarchical structure analysis unit 10, the structural similarity calculation unit 14 calculates the similarity related to the hierarchical structure between two elements for each pair of two nodes. For example, when node A and node B are given, their structural similarity S ^str (A, B) is calculated according to the following equation (1).

ここで、ｈｉｓｔ^Ａ、ｈｉｓｔ^Ｂは階層構造解析部１０で算出されるノードＡ、Ｂのヒストグラムである。 Here, hist ^A and hist ^B are histograms of the nodes A and B calculated by the hierarchical structure analysis unit 10.

位置類似度計算部１６は、位置情報解析部１２で算出される位置情報に基づいて、２つのノードのペア各々について２つの要素間の位置関係の類似度を計算する。例えば、ノードＡとノードＢが与えられた場合に、ノードＡ、Ｂの位置類似度Ｓ^ｐｏｓ＾（Ａ、Ｂ）を以下の（２）式に従って計算する。 The position similarity calculation unit 16 calculates the similarity of the positional relationship between two elements for each pair of two nodes based on the position information calculated by the position information analysis unit 12. For example, when node A and node B are given, the position similarity S ^pos ^ (A, B) of the nodes A and B is calculated according to the following equation (2).

ここで、（ｘ_Ａ，ｙ_Ａ，ｗ_Ａ，ｈ_Ａ）、（ｘ_Ｂ，ｙ_Ｂ，ｗ_Ｂ，ｈ_Ｂ）は、位置情報解析部１２で算出されるノードＡ、Ｂの位置情報である。また、（Ｗ_Ａ，Ｈ_Ａ）、（Ｗ_Ｂ，Ｈ_Ｂ）は、ｗｅｂブラウザ上での、Ａ、Ｂを含むｗｅｂページの表示領域（スクロール分を含む）の幅および高さである。 Here, (x _A , y _A , w _A , h _A ) and (x _B , y _B , w _B , h _B ) are the position information of the nodes A and B calculated by the position information analysis unit 12. . In addition, (W _A , H _A ) and (W _B , H _B ) are the width and height of the display area (including scrolling) of the web page including A and B on the web browser.

候補探索部１８は、入力部１により入力されたパラメータ（構造類似度閾値θ_ｓｔｒ、探索範囲閾値γ_０、γ_１）、構造類似度計算部１４で算出された構造類似度、及び位置情報解析部１２で算出されたノードの位置情報に基づいて、階層構造解析部１０で生成されたＤＯＭツリーの各ノードからクラスタ候補となるノードの組を探索する。そして、候補探索の結果として、上記（１）式で算出される構造類似度Ｓ^ｓｔｒ（Ａ、Ｂ）をクラスタ候補のノードの組（ｉ,ｊ）の各々について格納した、上記図６に示す構造類似度テーブルＬ^ｓｔｒ＾を作成する。また、上記（２）式で算出される位置類似度Ｓ^ｐｏｓ（Ａ、Ｂ）をクラスタ候補のノードの組（ｉ,ｊ）の各々について格納した、上記図７に示す位置類似度テーブルＬ^ｐｏｓ＾を作成する。そして、上記図５に示す、少なくとも１つのクラスタ候補に属する各ノードを各クラスタとするクラスタテーブルＣ＾を作成する。
具体的には、ＤＯＭツリーのノードを辿って、構造類似度計算部１４で算出された構造類似度が閾値以上で、位置情報解析部１２で算出されたノードの表示位置が、探索範囲内の２つのノードのペアを検出する。そして、検出されたノードのペアの一方のノードに対して、他方のノードとなるノード群内で親子関係がないように、親子関係があるノードのうちの、当該一方のノードとの間の構造類似度が最大となるノードを当該他方のノードとして選択し、クラスタ候補とする。そして、各クラスタ候補の構造類似度、及び位置類似度を格納して、構造類似度テーブルＬ^ｓｔｒ＾、位置類似度テーブルＬ^ｐｏｓ＾を作成し、記憶部２へ格納する。 The candidate search unit 18 receives the parameters (structure similarity threshold θ _str , search range thresholds γ ₀ , γ ₁ ) input by the input unit 1, the structure similarity calculated by the structure similarity calculation unit 14, and position information analysis Based on the node position information calculated by the unit 12, a set of nodes serving as cluster candidates is searched from each node of the DOM tree generated by the hierarchical structure analysis unit 10. Then, as a result of the candidate search, the structural similarity S ^str (A, B) calculated by the above equation (1) is stored for each cluster candidate node set (i, j), as shown in FIG. A structural similarity table L ^str ^ is created. Further, the position similarity table L ^pos shown in FIG. 7 in which the position similarity S ^pos (A, B) calculated by the above equation (2) is stored for each cluster candidate node set (i, j). Create ^. Then, the cluster table C ^ shown in FIG. 5 is created in which each node belonging to at least one cluster candidate is each cluster.
Specifically, the nodes of the DOM tree are traced, the structure similarity calculated by the structure similarity calculation unit 14 is equal to or greater than a threshold value, and the display position of the node calculated by the position information analysis unit 12 is within the search range. Detect a pair of two nodes. And a structure between the detected node pair and the one of the nodes having a parent-child relationship so that there is no parent-child relationship in the node group that is the other node. The node having the maximum similarity is selected as the other node and is set as a cluster candidate. Then, the structure similarity and position similarity of each cluster candidate are stored, and a structure similarity table L ^str ^ and a position similarity table L ^pos ^ are created and stored in the storage unit 2.

クラスタリング部２０は、候補探索部１８で作成された構造類似度テーブルＬ^ｓｔｒ＾、位置類似度テーブルＬ^ｐｏｓ＾、及びクラスタテーブルＣ＾に基づいて、クラスタテーブルＣ＾のクラスタを統合して、クラスタテーブルＣ＾を更新することを繰り返して、ノードをクラスタリングする。ここで、位置類似度計算部１６で算出された位置類似度および構造類似度計算部１４で算出された構造類似度が指定した閾値以上のクラスタ候補のうち、構造類似度が最大のクラスタ候補のノードペアに対応する２つのクラスタを統合してクラスタテーブルＣ＾を更新することを繰り返す。 The clustering unit 20 integrates the clusters of the cluster table C ^ based on the structure similarity table L ^str ^, the position similarity table L ^pos ^, and the cluster table C ^ created by the candidate search module 18 to obtain a cluster. The nodes are clustered by repeatedly updating the table C ^. Here, among the cluster candidates whose positional similarity calculated by the positional similarity calculation unit 16 and the structural similarity calculated by the structural similarity calculation unit 14 are equal to or greater than the specified threshold, the cluster candidate having the maximum structural similarity is selected. It is repeated that the two clusters corresponding to the node pair are integrated to update the cluster table C ^.

優先度計算部２２は、クラスタリング部２０で生成されたクラスタの優先度を計算する。具体的には、入力部１により入力されたキーワードリストＫＷ＾、階層構造解析部１０により作成された階層構造情報テーブルＫ＾、及びクラスタリング部２０で得られたクラスタテーブルＣ＾に基づいて、各クラスタについて、当該クラスタに属する各ノード毎に、当該ノードが含むキーワードの重みの和を算出し、その最大値を当該クラスタの優先度として計算する。 The priority calculation unit 22 calculates the priority of the cluster generated by the clustering unit 20. Specifically, based on the keyword list KW ^ input by the input unit 1, the hierarchical structure information table K ^ created by the hierarchical structure analysis unit 10, and the cluster table C ^ obtained by the clustering unit 20, For each node belonging to the cluster, a sum of keyword weights included in the node is calculated, and the maximum value is calculated as the priority of the cluster.

出力部４は、優先度計算部２２で算出した各クラスタの優先度に基づいて、クラスタリング部２０で生成されたクラスタリング結果を優先度順に出力する。 The output unit 4 outputs the clustering results generated by the clustering unit 20 in order of priority based on the priority of each cluster calculated by the priority calculation unit 22.

＜作用＞
次に、本実施の形態に係る文書構造解析装置の作用について説明する。まず、ｗｅｂページ群、パラメータ（構造類似度閾値θ_ｓｔｒ、位置類似度閾値θ_ｐｏｓ、探索範囲閾値γ_０、γ_１）、及びキーワードリストＫＷ＾が文書構造解析装置に入力され、記憶部２に格納される。そして、文書構造解析装置において、図１２に示す構造化文書情報抽出装置処理ルーチンが実行される。 <Action>
Next, the operation of the document structure analysis apparatus according to this embodiment will be described. First, a web page group, parameters (structure similarity threshold θ _str , position similarity threshold θ _pos , search range thresholds γ ₀ , γ ₁ ) and a keyword list KW ^ are input to the document structure analysis apparatus and stored in the storage unit 2. Stored. Then, the structured document information extracting device processing routine shown in FIG. 12 is executed in the document structure analyzing device.

まず、ステップＳ１００において、記憶部２から、入力部１により入力されたｗｅｂページ群、パラメータ（構造類似度閾値θ_ｓｔｒ、位置類似度閾値θ_ｐｏｓ、探索範囲閾値γ_０、γ_１）、及びキーワードリストＫＷ＾を読み込む。 First, in step S100, a web page group, parameters (structure similarity threshold θ _str , position similarity threshold θ _pos , search range thresholds γ ₀ , γ ₁ ) and keywords input from the storage unit 2 by the input unit 1 and keywords. Read the list KW ^.

次に、ステップＳ１０２において、ｗｅｂページに対して階層構造の解析を行う。具体的には、記憶部２から、入力部１により入力されたｗｅｂページを１つ読み出し、ＨＴＭＬタグが示す各要素に対応する各ノードを含むＤＯＭ（Document Object Model）ツリーを生成すると共に、ＤＯＭツリー中の各ノードについて、自ノードと子孫の各ノードのノード名（属性名）でヒストグラムを作成し、上記図３に示す階層構造情報テーブルＫ＾を作成し、記憶部２に格納する。 Next, in step S102, the hierarchical structure is analyzed for the web page. Specifically, one web page input from the input unit 1 is read from the storage unit 2, and a DOM (Document Object Model) tree including each node corresponding to each element indicated by the HTML tag is generated. For each node in the tree, a histogram is created with the node name (attribute name) of each of its own node and descendant nodes, and the hierarchical structure information table K ^ shown in FIG. 3 is created and stored in the storage unit 2.

次に、ステップＳ１０４において、ｗｅｂページに対して、位置情報の解析を行う。具体的には、記憶部２から、入力部１により入力されたｗｅｂページを１つ読み出し、ＨＴＭＬタグが示す各要素の、ｗｅｂブラウザ上での表示位置を各々計算し、上記図４に示す位置情報テーブルＰ＾を作成し、記憶部２に格納する。 Next, in step S104, position information is analyzed for the web page. Specifically, one web page input from the input unit 1 is read from the storage unit 2, the display positions of each element indicated by the HTML tag on the web browser are calculated, and the positions shown in FIG. An information table P ^ is created and stored in the storage unit 2.

ステップＳ１０６では、記憶部２に記憶されているｗｅｂページ群の全てについて、上記ステップＳ１０２の構造情報解析、及び上記ステップＳ１０４の位置情報解析が終了したか否かを判定する。ｗｅｂページ群の全てについて解析が終了していない場合には、ステップＳ１０２へ移行する。ｗｅｂページ群の全てについて解析が終了している場合には、ステップＳ１０８へ移行する。 In step S106, it is determined whether or not the structural information analysis in step S102 and the positional information analysis in step S104 have been completed for all the web page groups stored in the storage unit 2. If the analysis has not been completed for all of the web page groups, the process proceeds to step S102. If the analysis has been completed for all of the web page groups, the process proceeds to step S108.

ステップＳ１０８では、各ノードから、クラスタ候補となるノードの組を探索するため、図１３及び図１４に示す候補探索処理ルーチンが実行される。また、上記図１３及び上記図１４に示す候補探索処理ルーチンに対応するアルゴリズムを図１５及び図１６に示す。 In step S108, candidate search processing routines shown in FIGS. 13 and 14 are executed in order to search for a set of nodes that are cluster candidates from each node. FIGS. 15 and 16 show algorithms corresponding to the candidate search processing routines shown in FIG. 13 and FIG.

＜候補探索処理ルーチン＞
候補探索処理ルーチンでは、各ノードからクラスタ候補となるノードの組を探索する処理が実行される。また、候補探索処理ルーチンは、ｗｅｂページ群から得られるｗｅｂページのペアの各々について実行される。また、同じｗｅｂページからなる各ペアについて、候補探索処理ルーチンが実行される。 <Candidate search processing routine>
In the candidate search processing routine, a process of searching for a set of nodes that are cluster candidates from each node is executed. The candidate search processing routine is executed for each pair of web pages obtained from the web page group. In addition, a candidate search processing routine is executed for each pair of the same web page.

はじめに、ステップＳ２００において、処理対象となるｗｅｂページのペア（ｐａｇｅＡ，ｐａｇｅＢ）、パラメータ（構造類似度の閾値θ_ｓｔｒ、探索範囲を指定するパラメータγ_０、γ_１）及び上記ステップＳ１０４において作成された位置情報テーブルＰ＾が再度読み込まれる。 First, in step S200, a web page pair (pageA, pageB) to be processed, parameters (threshold value θ _str of structure similarity, parameters γ ₀ , γ ₁ specifying a search range), and the above-described step S104 are created. The position information table P ^ is read again.

次に、ステップＳ２０２において、ｐａｇｅＡとｐａｇｅＢが同一であるか否かが判定される。そして、ｐａｇｅＡとｐａｇｅＢが同一である場合には、ステップＳ２０４へ移行する。ｐａｇｅＡとｐａｇｅＢが同一でない場合には、ステップＳ２０６へ移行する。 Next, in step S202, it is determined whether or not pageA and pageB are the same. If pageA and pageB are the same, the process proceeds to step S204. If pageA and pageB are not the same, the process proceeds to step S206.

ステップＳ２０４では、探索範囲を指定する閾値γにγ_０の値が入力される。ステップＳ２０６では、γにγ_１の値が入力される。ここで、γ_０、γ_１は各々、同一ページ・異なるページでの探索範囲を指定するパラメータである。 In step S204, a value of γ ₀ is input to the threshold value γ that specifies the search range. In step S206, the value of γ ₁ is input to γ. Here, γ ₀ and γ ₁ are parameters for designating search ranges on the same page and different pages, respectively.

次に、ステップＳ２０８では、図１７に示すfirstChildルーチンが呼び出され、firstChildルーチンにｐａｇｅＡが入力される。そして、firstChildルーチンの返す値が、クラスタ候補となりうるノードの組の一方のノードを示すノード番号ｎ１に代入される。なお、firstChildルーチンは、ｐａｇｅＡが子ノードを有している場合には、ｐａｇｅＡの最初の子ノードのノード番号を返す。ｐａｇｅＡが子ノードを有していない場合には、φ（空）を返す。 Next, in step S208, the firstChild routine shown in FIG. 17 is called and pageA is input to the firstChild routine. Then, the value returned by the firstChild routine is assigned to the node number n1 indicating one node of the set of nodes that can be cluster candidates. The firstChild routine returns the node number of the first child node of pageA when pageA has a child node. If pageA has no child nodes, it returns φ (empty).

次に、ステップＳ２１０では、上記ステップＳ２０８で算出されたｎ１の値がφ（空）であるか否かが判定される。ｎ１の値がφ（空）でない場合には、ステップＳ２１２へ移行する。ｎ１の値がφ（空）である場合には、ステップＳ３０６へ移行する。 Next, in step S210, it is determined whether or not the value of n1 calculated in step S208 is φ (empty). If the value of n1 is not φ (empty), the process proceeds to step S212. When the value of n1 is φ (empty), the process proceeds to step S306.

次に、ステップＳ２１２では、クラスタ候補の他方のノードの候補となるノードを示すノード番号bnodeにφ（空）が代入され、scoreに−１が代入される。 Next, in step S212, φ (empty) is substituted for the node number bnode indicating the candidate node of the other cluster candidate, and −1 is substituted for score.

次に、ステップＳ２１４では、ｐａｇｅＡについての位置情報テーブルＰ^Ａ＾のｎ１番目のノード番号に対応する位置情報が、（ｘ_１，ｙ_１，ｗ_１，ｈ_１）に代入される。 Next, in step S214, the position information corresponding to the n1st node number of the position information table P ^A ^ for pageA is substituted into (x ₁ , y ₁ , w ₁ , h ₁ ).

ステップＳ２１６では、ｐａｇｅＡとｐａｇｅＢが同一であるか否かが判定される。そして、ｐａｇｅＡとｐａｇｅＢが同一である場合には、ステップＳ２１８へ移行する。ｐａｇｅＡとｐａｇｅＢが同一でない場合には、ステップＳ２２０へ移行する。 In step S216, it is determined whether pageA and pageB are the same. If pageA and pageB are the same, the process proceeds to step S218. If pageA and pageB are not the same, the process proceeds to step S220.

ステップＳ２１８では、図１８に示すnextSiblingルーチンが呼び出され、nextSiblingルーチンにｎ１が入力される。そして、nextSiblingルーチンの返す値がクラスタ候補となりうるノードの組の他方のノードを示すノード番号ｎ２に代入される。なお、nextSiblingルーチンは、入力されたノード番号ｎ１に対応するノードが、兄弟ノードを有している場合には、ノード番号ｎ１に対応するノードの最初の兄弟ノードのノード番号を返す。ノード番号ｎ１に対応するノードが、兄弟ノードを有していない場合には、φ（空）を返す。 In step S218, the nextSibling routine shown in FIG. 18 is called and n1 is input to the nextSibling routine. Then, the value returned by the nextSibling routine is substituted into the node number n2 indicating the other node of the set of nodes that can be cluster candidates. The nextSibling routine returns the node number of the first sibling node of the node corresponding to the node number n1, when the node corresponding to the input node number n1 has a sibling node. If the node corresponding to the node number n1 does not have a sibling node, φ (empty) is returned.

ステップＳ２２０では、上記図１７に示すfirstChildルーチンが呼び出され、firstChildルーチンにｐａｇｅＢが入力される。そして、firstChildルーチンの返す値がｎ２に代入される。 In step S220, the firstChild routine shown in FIG. 17 is called, and pageB is input to the firstChild routine. The value returned by the firstChild routine is assigned to n2.

次に、ステップＳ２２２では、上記ステップＳ２１８若しくは上記ステップＳ２２０で算出されたｎ２の値、又は後述するステップＳ２４０で前回算出されたｎ２の値がφ（空）であるか否かが判定される。ｎ２の値がφ（空）でない場合には、ステップＳ２２４へ移行する。ｎ１の値がφ（空）である場合には、ステップＳ３００へ移行する。 Next, in step S222, it is determined whether or not the value of n2 calculated in step S218 or step S220 or the value of n2 previously calculated in step S240 described later is φ (empty). If the value of n2 is not φ (empty), the process proceeds to step S224. When the value of n1 is φ (empty), the process proceeds to step S300.

次に、ステップＳ２２４では、上記ステップＳ２０８又は後述するステップＳ３０４で前回算出されたｎ１と、上記ステップＳ２１８若しくは上記ステップＳ２２０又は後述するステップＳ２４０で前回算出されたｎ２に基づいて、構造類似度Ｓ^ｓｔｒ（ｎ１，ｎ２）を算出し、構造類似度Ｓ^ｓｔｒ（ｎ１，ｎ２）を比較用の構造類似度値ｓｉｍへ代入する。 Next, in step S224, the structural similarity S ^str is based on n1 previously calculated in step S208 or step S304 described later and n2 previously calculated in step S218 or step S220 or step S240 described later. (N1, n2) is calculated, and the structural similarity S ^str (n1, n2) is substituted into the structural similarity value sim for comparison.

次に、ステップＳ２２６では、上記ステップＳ２２４で算出されたｓｉｍが、上記ステップＳ２００で入力されたパラメータ（構造類似度の閾値θ_ｓｔｒ）以上であるか否かを判定する。そして、ｓｉｍが構造類似度の閾値θ_ｓｔｒ以上であればステップＳ２２８へ移行する。ｓｉｍが構造類似度の閾値θ_ｓｔｒ未満であればステップＳ２４０へ移行する。 Next, in step S226, it is determined whether or not sim calculated in step S224 is equal to or greater than the parameter (structural similarity threshold θ _str ) input in step S200. If sim is equal to or greater than the structural similarity threshold θ _str , the process proceeds to step S228. If sim is less than the structural similarity threshold θ _str , the process proceeds to step S240.

次に、ステップＳ２２８では、ｐａｇｅＢについての位置情報テーブルＰ^Ｂ＾のｎ２番目のノード番号に対応する位置情報が、（ｘ_２，ｙ_２，ｗ_２，ｈ_２）に代入される。 Next, in step S228, the position information corresponding to the n2nd node number of the position information table P ^B ^ for pageB is substituted into (x ₂ , y ₂ , w ₂ , h ₂ ).

次に、ステップＳ２３０では、上記ステップＳ２０４又はステップＳ２０６で入力されたγの値、上記ステップＳ２１４で代入された（ｘ_１，ｙ_１，ｗ_１，ｈ_１）、及び上記ステップＳ２２８で代入された（ｘ_２，ｙ_２，ｗ_２，ｈ_２）に基づいて、判定が行われる。具体的には、ｘ_２−ｘ_１≦γｗ_１であって、かつｘ_１−ｘ_２≦γｗ_２であって、かつｙ_２−ｙ_１≦γｈ_１であって、かつｙ_１−ｙ_２≦γｈ_２と判定される場合には、ステップＳ２３２へ移行する。それ以外の場合には、ステップＳ２４０へ移行する。 Next, in step S230, the value of γ input in step S204 or step S206, (x ₁ , y ₁ , w ₁ , h ₁ ) substituted in step S214, and substituted in step S228 above. A determination is made based on (x ₂ , y ₂ , w ₂ , h ₂ ). Specifically, x ₂ −x ₁ ≦ γw ₁ , x ₁ −x ₂ ≦ γw ₂ , y ₂ −y ₁ ≦ γh ₁ , and y ₁ −y ₂ ≦ If it is determined that γh ₂ , the process proceeds to step S232. Otherwise, the process proceeds to step S240.

次に、ステップＳ２３２では、上記ステップＳ２１２で値が代入されたbnode又は後述するステップＳ２３４若しくはステップＳ２３８で前回値が代入されたbnode、及び上記ステップＳ２１８若しくはステップＳ２２０で値が代入されたｎ２又は後述するステップＳ２４０で前回値が代入されたｎ２に基づいて、判定が行われる。具体的には、bnode≠φ（空）であって、かつｎ２がbnodeと親子関係にないと判定される場合には、ステップＳ２３４へ移行する。それ以外の場合には、ステップＳ２３６へ移行する。 Next, in step S232, the bnode in which the value is substituted in step S212, the bnode in which the previous value is substituted in step S234 or step S238 described later, and n2 in which the value is substituted in step S218 or step S220 described later In step S240, determination is performed based on n2 to which the previous value is substituted. Specifically, when it is determined that bnode ≠ φ (empty) and n2 is not in a parent-child relationship with bnode, the process proceeds to step S234. Otherwise, the process proceeds to step S236.

ステップＳ２３４では、構造度類似度テーブルＬ^ｓｔｒ＾のＬ^ｓｔｒ（｛ｎ１，bnode｝）に、前回本ステップＳ２３４で算出された親子関係にあるbnodeにおける構造類似度の最大値scoreの値、又は後述するステップＳ２３８で前回算出されたscoreの値を代入する。そして、位置類似度テーブルＬ^ｐｏｓ＾のＬ^ｐｏｓ（｛ｎ１，bnode｝）に、位置類似度Ｓ^ｐｏｓ（ｎ１，bnode）の値を代入する。また、クラスタテーブルＣ＾のＣ（ｎ１）に、上記ステップＳ２０８で値が代入されたｎ１、又は後述するステップＳ３０４で前回算出されたｎ１の値を代入する。そして、クラスタテーブルＣ＾のＣ（bnode）に、前回本ステップＳ２３４で算出されたbnodeの値、又は後述するステップＳ２３８で前回算出されたbnodeの値を代入する。また、bnodeに、上記ステップＳ２１８若しくはステップＳ２２０で算出されたｎ２の値、又は後述するステップＳ２４０で前回算出されたｎ２の値を代入する。そして、scoreに上記ステップＳ２２４で算出されたsimの値を代入する。 In step S234, the maximum structural score value of the structural similarity in the bnode having the parent-child relationship calculated in the previous step S234 is stored in L ^str ({n1, bnode}) of the structural degree similarity table L ^str ^ or described later. In step S238, the previously calculated score value is substituted. Then, the value of the position similarity S ^pos (n1, bnode) is substituted into L ^pos ({n1, bnode}) of the position similarity table L ^pos ^. Further, n1 into which the value is substituted in step S208 or the value of n1 previously calculated in step S304 described later is substituted into C (n1) of the cluster table C ^. Then, the value of bnode previously calculated in step S234 or the value of bnode previously calculated in step S238 described later is substituted into C (bnode) of the cluster table C ^. Further, the value of n2 calculated in step S218 or step S220 or the value of n2 calculated last time in step S240 described later is substituted into bnode. Then, the value of sim calculated in step S224 is substituted for score.

ステップＳ２３６では、上記ステップ２１２で算出されたscoreの値、又はステップＳ２３４で前回算出されたscoreの値若しくは後述するステップＳ２３８で前回算出されたscoreの値と、上記ステップＳ２２４で算出されたsimの値に基づいて、判定が行われる。具体的には、scoreの値がsimの値よりも小さいと判定された場合には、ステップＳ２３８へ移行する。scoreの値がsimの値以上であると判定された場合には、ステップＳ２４０へ移行する。 In step S236, the score value calculated in step 212 above, the score value previously calculated in step S234, or the score value previously calculated in step S238 described later, and the sim value calculated in step S224 above. A determination is made based on the value. Specifically, if it is determined that the score value is smaller than the sim value, the process proceeds to step S238. When it is determined that the score value is equal to or greater than the sim value, the process proceeds to step S240.

次にステップＳ２３８では、bnodeに、上記ステップＳ２１８若しくはステップＳ２２０で算出されたｎ２の値、又は後述するステップＳ２４０で前回算出されたｎ２の値が代入される。そして、親子関係にあるbnodeにおける構造類似度の最大値scoreに上記ステップＳ２２４で算出されたsimの値を代入する。 Next, in step S238, the value of n2 calculated in step S218 or step S220 or the value of n2 calculated in the previous step in step S240 described later is substituted for bnode. Then, the value of sim calculated in step S224 is substituted for the maximum value score of the structural similarity in the bnode having the parent-child relationship.

次にステップＳ２４０では、図１９に示すnextNodeルーチンが呼び出され、nextNodeルーチンにｐａｇｅＢと、ｎ２が入力される。そして、nextNodeルーチンの返す値がｎ２に代入される。なお、nextNodeルーチンは、入力されたｐａｇｅＢ内のノードｎ２が、子ノードを有している場合には、ノードｎ２の最初の子ノードのノードを示すノード番号を返す。ノードｎ２が、子ノードを有していない場合には、ノードｎ２の最初の兄弟ノードを示すノード番号を返す。ノードｎ２が、兄弟ノードを有していない場合には、ノードｎ２の親ノードの次の兄弟ノードを示すノード番号を返す。それ以外の場合には、φ（空）を返す。 Next, in step S240, the nextNode routine shown in FIG. 19 is called, and pageB and n2 are input to the nextNode routine. The value returned by the nextNode routine is assigned to n2. The nextNode routine returns a node number indicating the node of the first child node of the node n2 when the node n2 in the input pageB has a child node. When the node n2 has no child node, a node number indicating the first sibling node of the node n2 is returned. If the node n2 has no sibling node, a node number indicating the next sibling node of the parent node of the node n2 is returned. Otherwise, returns φ (empty).

ステップＳ３００では、上記ステップＳ２１２で算出されたbnodeの値、又は上記ステップＳ２３４若しくはステップＳ２３８で算出されたbnodeの値がφ（空）であるか否かを判定する。bnodeの値がφ（空）でない場合には、ステップＳ３０２へ移行する。bnodeの値がφ（空）である場合には、ステップＳ３０４へ移行する。 In step S300, it is determined whether or not the bnode value calculated in step S212 or the bnode value calculated in step S234 or S238 is φ (empty). If the value of bnode is not φ (empty), the process proceeds to step S302. When the value of bnode is φ (empty), the process proceeds to step S304.

ステップＳ３０２では、構造度類似度テーブルＬ^ｓｔｒ＾のＬ^ｓｔｒ（｛ｎ１，bnode｝）に、上記ステップＳ２３４で算出されたscoreの値又は上記ステップＳ２３８で算出されたscoreの値を代入する。そして、位置類似度テーブルＬ^ｐｏｓ＾のＬ^ｐｏｓ（｛ｎ１，bnode｝）に、位置類似度Ｓ^ｐｏｓ（ｎ１，bnode）の値を代入する。また、クラスタテーブルＣ＾のＣ（ｎ１）に、上記ステップＳ２０８で算出されたｎ１の値又は後述するステップＳ３０４で前回算出されたｎ１の値を代入する。そして、クラスタテーブルＣ＾のＣ（bnode）に、上記ステップＳ２３４又は上記ステップＳ２３８で算出されたbnodeの値を代入する。 In step S302, the score value calculated in step S234 or the score value calculated in step S238 is substituted into L ^str ({n1, bnode}) of the structural degree similarity table L ^str ^. Then, the value of the position similarity S ^pos (n1, bnode) is substituted into L ^pos ({n1, bnode}) of the position similarity table L ^pos ^. Further, the value of n1 calculated in step S208 or the value of n1 calculated last time in step S304 described later is substituted into C (n1) of the cluster table C ^. Then, the value of bnode calculated in step S234 or step S238 is substituted into C (bnode) of the cluster table C ^.

次に、ステップＳ３０４では、上記図１９に示すnextNodeルーチンが呼び出され、nextNodeルーチンにｐａｇｅＡと、ｎ１が入力される。そして、nextNodeルーチンの返す値がｎ１に代入される。 Next, in step S304, the nextNode routine shown in FIG. 19 is called, and pageA and n1 are input to the nextNode routine. Then, the value returned by the nextNode routine is substituted for n1.

ステップＳ３０６では、候補探索処理ルーチンの処理結果として、クラスタテーブルＣ＾、構造類似度テーブルＬ^ｓｔｒ＾、位置類似度テーブルＬ^ｐｏｓ＾が結果として出力され、記憶部２へ格納される。 In step S306, the cluster table C ^, the structure similarity table ^Lstr ^, and the position similarity table ^Lpos ^ are output as results and stored in the storage unit 2 as processing results of the candidate search processing routine.

＜クラスタリング処理ルーチン＞
候補探索処理ルーチンの処理が終了した後には、上記図１２に示す構造化文書情報抽出処理ルーチンのステップＳ１１０に戻る。ステップＳ１１０では、上記ステップＳ１０８の候補探索処理ルーチンで作成されたクラスタテーブルＣ＾、構造類似度テーブルＬ^ｓｔｒ＾、位置類似度テーブルＬ^ｐｏｓ＾に基づいて、クラスタテーブルＣ＾のクラスタを統合することを繰り返し行う。ステップＳ１１０では、図２０に示すクラスタリング処理ルーチンが実行される。また、上記図２０に示すクラスタリング処理ルーチンに対応するアルゴリズムを図２１に示す。 <Clustering processing routine>
After the candidate search process routine is completed, the process returns to step S110 of the structured document information extraction process routine shown in FIG. In step S110, the clusters of the cluster table C ^ are integrated based on the cluster table C ^, the structural similarity table ^Lstr ^, and the position similarity table ^Lpos ^ created by the candidate search processing routine in step S108. Repeat. In step S110, the clustering processing routine shown in FIG. 20 is executed. FIG. 21 shows an algorithm corresponding to the clustering processing routine shown in FIG.

ステップＳ４００では、上記ステップＳ１０８で算出されたクラスタテーブルＣ＾、構造度類似度テーブルＬ^ｓｔｒ＾、位置類似度テーブルＬ^ｐｏｓ＾、及びパラメータ（構造類似度閾値θ_ｓｔｒ、位置類似度閾値θ_ｐｏｓ）が記憶部２から読み出される。 In step S400, the cluster table C ^, structure similarity table ^Lstr ^, position similarity table ^Lpos ^ calculated in step S108, and parameters (structure similarity threshold _θstr , position similarity threshold _θpos ) are calculated. Is read from the storage unit 2.

次にステップＳ４０２では、統合される候補となるクラスタを示すクラスタ番号ｐ、ｑ、統合される候補となるクラスタのペアにおける構造類似度の最大値ｍａｘＳ^ｓｔｒに初期値として−１が代入される。 Next, in step S402, −1 is assigned as an initial value to the cluster numbers p and q indicating candidate clusters to be integrated, and the maximum value maxS ^str of the structural similarity in a pair of candidate clusters to be integrated.

ステップＳ４０３では、構造類似度テーブルＬ^ｓｔｒ＾に基づいて、処理対象となるクラスタのペア（｛ｉ,ｊ｝, Ｓ^ｓｔｒ）を設定する。 In step S403, a cluster pair ({i, j}, S ^str ) to be processed is set based on the structure similarity table L ^str ^.

次にステップＳ４０４では、ｍａｘＳ^ｓｔｒと構造類似度Ｓ^ｓｔｒを比較する。そして、ｍａｘＳ^ｓｔｒがＳ^ｓｔｒ未満であれば、ステップＳ４０６へ移行する。ｍａｘＳ^ｓｔｒがＳ^ｓｔｒより大きいならば、ステップＳ４１２へ移行する。 Next, in step S404, maxS ^str is compared with the structural similarity S ^str . If maxS ^str is less than S ^str , the process proceeds to step S406. If maxS ^str is greater than S ^{str, the} process proceeds to step S412.

ステップＳ４０６では、位置類似度Ｓ^ｐｏｓ＾に、上記ステップＳ４００で読み出された位置類似度テーブルＬ^ｐｏｓ＾の値Ｌ^ｐｏｓ（｛ｉ,ｊ｝）が代入される。ここで、位置類似度Ｓ^ｐｏｓ＾＝（Ｓx^ｐｏｓ，Ｓy^ｐｏｓ，Ｓw^ｐｏｓ，Ｓh^ｐｏｓ）である。 In step S406, the value L ^pos ({i, j}) of the position similarity table L ^pos ^ read in step S400 is substituted for the position similarity S ^pos ^. Here, the position similarity S ^pos ^ = (Sx ^pos , Sy ^pos , Sw ^pos , Sh ^pos ).

次にステップＳ４０８では、上記ステップＳ４０６で代入された位置類似度Ｓ^ｐｏｓ＾＝（Ｓx^ｐｏｓ，Ｓy^ｐｏｓ，Ｓw^ｐｏｓ，Ｓh^ｐｏｓ）と、位置類似度閾値θ_ｐｏｓを比較する。そして、Ｓx^ｐｏｓ≧θ_ｐｏｓであって、かつＳw^ｐｏｓ≧θ_ｐｏｓであるか、又はＳy^ｐｏｓ≧θ_ｐｏであって、かつＳh^ｐｏｓ≧θ_ｐｏｓである場合には、ステップＳ４１０へ移行する。それ以外の場合には、ステップＳ４１２へ移行する。 In step S408, the position similarity S ^pos ^ = (Sx ^pos , Sy ^pos , Sw ^pos , Sh ^pos ) substituted in step S406 is compared with the position similarity threshold θ _pos . Then, a Sx ^_pos ≧ ^{θ pos,} and whether it is Sw ^_pos ≧ ^{θ pos,} or a Sy ^pos ≧ ^θ _po, and in the case of Sh ^_pos ≧ θ ^pos, the process proceeds to step S410. Otherwise, the process proceeds to step S412.

ステップＳ４１０では、上記ステップＳ４０２で初期値が代入されたｐ、ｑ、ｍａｘＳ^ｓｔｒに（ｉ，ｊ，Ｓ^ｓｔｒ）の値が代入される。 In step S410, the values of (i, j, S ^str ) are substituted for p, q, and maxS ^str for which the initial values are substituted in step S402.

次に、ステップＳ４１２では、構造類似度テーブルＬ^ｓｔｒ＾に含まれる全てのデータ（｛ｉ,ｊ｝、Ｓ^ｓｔｒ）について、ステップＳ４０４〜ステップＳ４１０の処理が終了しているか否かを判定する。終了している場合には、ステップＳ４１４へ移行する。終了していない場合には、ステップＳ４０３へ移行する。 Then determines, in step S412, all the data contained in the structural similarity table ^{L str ^ ({i, j} }, S str) for, whether the process of steps S404~ step S410 has ended. If completed, the process proceeds to step S414. If not completed, the process proceeds to step S403.

次に、ステップＳ４１４では、上記ステップＳ４１０で値が更新されたｍａｘＳ^ｓｔｒと、構造類似度閾値θ_ｓｔｒとを比較する。ｍａｘＳ^ｓｔｒがθ_ｓｔｒ未満である場合には、ステップＳ４３４へ移行する。ｍａｘＳ^ｓｔｒがθ_ｓｔｒ以上である場合には、ステップＳ４１６へ移行する。 Next, in step S414, maxS ^str whose value is updated in step S410 is compared with the structural similarity threshold θ _str . If maxS ^str is less than θ _str , the process proceeds to step S434. If maxS ^str is greater than or equal to θ _{str, the} process proceeds to step S416.

ステップＳ４１６では、クラスタテーブルＣ＾に格納されたクラスタＣ＾（ｐ）に、クラスタＣ＾（ｑ）を統合して、記憶部２に記憶されたクラスタＣ＾（ｐ）を更新する。 In step S416, the cluster C ^ (q) stored in the cluster table C ^ is integrated with the cluster C ^ (q), and the cluster C ^ (p) stored in the storage unit 2 is updated.

次にステップＳ４１８では、クラスタテーブルＣ＾のＣ（ｑ）にφ（空）を格納する。 In step S418, φ (empty) is stored in C (q) of the cluster table C ^.

次にステップＳ４２０では、統合されたクラスタのペアｐ、ｑについての、記憶部２に記憶されている構造度類似度テーブルＬ^ｓｔｒ＾のＬ^ｓｔｒ（｛ｐ, ｑ｝）にφ（空）を格納する。 Next, in step S420, φ (empty) is set in L ^str ({p, q}) of the structural degree similarity table L ^str ^ stored in the storage unit 2 for the integrated cluster pair p, q. Store.

次にステップＳ４２２では、統合されたクラスタのペアｐ、ｑについての、記憶部２に記憶されている位置類似度テーブルＬ^ｐｏｓ＾のＬ^ｐｏｓ（｛ｐ, ｑ｝）にφ（空）を格納する。 In step S422, φ (empty) is stored in L ^pos ({p, q}) of the position similarity table L ^pos ^ stored in the storage unit 2 for the integrated cluster pair p, q. To do.

ステップＳ４２３では、クラスタテーブルＣ＾に基づいて、処理対象となるクラスタＣ_ｉを設定する。 In step S423, the cluster C _i to be processed is set based on the cluster table C ^.

次にステップＳ４２４では、構造類似度テーブルＬ^ｓｔｒ＾の値Ｌ^ｓｔｒ（｛ｐ, ｉ｝）及びＬ^ｓｔｒ（｛ｑ, ｉ｝）のうち大きい値の方を、記憶部２に記憶された構造類似度テーブルＬ^ｓｔｒ＾のＬ^ｓｔｒ（｛ｐ, ｉ｝）に格納する。 Next, in step S424, the larger value of the values L ^str ({p, i}) and L ^str ({q, i}) of the structure similarity table L ^str ^ is stored in the storage unit 2. Stored in L ^str ({p, i}) of the similarity table L ^str ^.

次にステップＳ４２６では、記憶部２に記憶されている構造類似度テーブルＬ^ｓｔｒ＾のＬ^ｓｔｒ｛ｑ, ｉ｝）にφ（空）を格納する。 In step S426, φ (empty) is stored in L ^str {q, i}) of the structural similarity table L ^str ^ stored in the storage unit 2.

次にステップＳ４２８では、位置類似度テーブルＬ^ｐｏｓ＾の値Ｌ^ｐｏｓ（｛ｐ, ｉ｝）及びＬ^ｐｏｓ（｛ｑ, ｉ｝）のうち大きい値の方を、記憶部２に記憶された構造類似度テーブルＬ^ｐｏｓ＾のＬ^ｐｏｓ（｛ｐ, ｉ｝）に格納する。 Next, in step S428, the larger value of the values L ^pos ({p, i}) and L ^pos ({q, i}) of the position similarity table L ^pos ^ is stored in the storage unit 2. Stored in L ^pos ({p, i}) of the similarity table L ^pos ^.

次にステップＳ４３０では、記憶部２に記憶されている位置類似度テーブルＬ^ｓｔｒ＾のＬ^ｓｔｒ（｛ｑ, ｉ｝）にφ（空）を格納する。 In step S430, φ (empty) is stored in L ^str ({q, i}) of the position similarity table L ^str ^ stored in the storage unit 2.

次にステップＳ４３２では、空でない全てのクラスタＣ_ｉについてステップＳ４２４〜ステップＳ４３０の処理が終了しているか判定する。終了している場合には、ステップＳ４０２へ移行する。終了していない場合には、ステップＳ４２３へ移行する。 In step S432, it determines whether the processing of steps S424~ step S430 for all the clusters _{C i} is not empty is completed. If completed, the process proceeds to step S402. If not completed, the process proceeds to step S423.

ステップＳ４３４では、クラスタリング処理ルーチンの処理結果として、記憶部２に記憶されているクラスタテーブルＣ＾が結果として出力される。 In step S434, the cluster table C ^ stored in the storage unit 2 is output as a result as the processing result of the clustering processing routine.

クラスタリング処理ルーチンの処理が終了した後には、上記図１２に示す構造化文書情報抽出処理ルーチンのステップＳ１１２に戻る。ステップＳ１１２では、上記ステップＳ１００で入力されたキーワードリストＫＷ＾、上記ステップＳ１０２で得られた階層構造情報テーブルＫ＾、及び上記ステップＳ１１０のクラスタリング処理ルーチンで得られたクラスタテーブルＣ＾に基づいて、クラスタの優先度を算出する。図２２にステップＳ１１２で実行される優先度計算のアルゴリズムを示す。 After the clustering processing routine is completed, the process returns to step S112 of the structured document information extraction processing routine shown in FIG. In step S112, based on the keyword list KW ^ input in step S100, the hierarchical structure information table K ^ obtained in step S102, and the cluster table C ^ obtained in the clustering processing routine in step S110, Calculate cluster priority. FIG. 22 shows a priority calculation algorithm executed in step S112.

ステップＳ１１２では、上記ステップＳ１００で入力されたキーワードリストＫＷ＾、上記ステップＳ１０２で得られた階層構造情報テーブルＫ＾、及び上記ステップＳ１１０で作成されたクラスタテーブルＣ＾に基づいて、各クラスタについて、当該クラスタに属する各ノード毎に、当該ノードが含むキーワードの重みの和を算出し、その最大値を当該クラスタの優先度として計算する。 In step S112, for each cluster, based on the keyword list KW ^ input in step S100, the hierarchical structure information table K ^ obtained in step S102, and the cluster table C ^ created in step S110, For each node belonging to the cluster, the sum of keyword weights included in the node is calculated, and the maximum value is calculated as the priority of the cluster.

次にステップＳ１１４では、上記ステップＳ１１２で算出された優先度順に、上記ステップＳ１１０で算出したクラスタリング結果を出力部４により出力する。 Next, in step S114, the output unit 4 outputs the clustering results calculated in step S110 in the priority order calculated in step S112.

次に、本実施の形態に係る文書構造解析装置の実際の動作例を以下で説明する。 Next, an actual operation example of the document structure analysis apparatus according to the present embodiment will be described below.

＜動作例１＞
動作例１では、同一ページ内の繰り返し部分の抽出例について説明する。動作例１では、上記図８に示すｗｅｂページが文書構造解析装置に入力される。また、上記図８に示すｗｅｂページのｂｏｄｙ以下のＨＴＭＬタグ構造を上記図９に示す。また、入力部１には、図２３に示すキーワードリストが入力される。また、入力部１に入力されるパラメータとして、構造類似度閾値θ_ｓｔｒ＝０．７、位置類似度閾値θ_ｐｏｓ＝０．８、探索範囲閾値γ_０＝１．５のデータが入力部１に入力される。 <Operation example 1>
In the operation example 1, an example of extracting repeated portions in the same page will be described. In the first operation example, the web page shown in FIG. 8 is input to the document structure analysis apparatus. Also, FIG. 9 shows the HTML tag structure below the body of the web page shown in FIG. The keyword list shown in FIG. 23 is input to the input unit 1. As parameters input to the input unit 1, data of the structural similarity threshold θ _str = 0.7, the position similarity threshold θ _pos = 0.8, and the search range threshold γ ₀ = 1.5 are input to the input unit 1. Entered.

ステップＳ１０２では、上記図１０に示すＤＯＭ（Document Object Model）ツリーが生成され、上記図１１に示す階層構造情報テーブルＫ＾が作成される。 In step S102, the DOM (Document Object Model) tree shown in FIG. 10 is generated, and the hierarchical structure information table K ^ shown in FIG. 11 is created.

ステップＳ１０４では、図２４に示す位置情報テーブルＰ＾が作成される。 In step S104, the position information table P ^ shown in FIG. 24 is created.

次にステップＳ１０８において、上記図１３及び上記図１４に示す候補探索処理ルーチンが実行される。 Next, in step S108, the candidate search processing routine shown in FIG. 13 and FIG. 14 is executed.

最初のステップＳ２０８では、ｎ１←１が代入される。 In the first step S208, n1 ← 1 is substituted.

次に、ステップＳ２１６で、入力されたｗｅｂページのペアが同一であるか否かを判定する。動作例１では、ペアとなるｗｅｂページが同一であり、ステップＳ２１８へ移行する。 Next, in step S216, it is determined whether or not the input web page pair is the same. In the operation example 1, the paired web pages are the same, and the process proceeds to step S218.

次のステップＳ２１８で、ｎ１のsibling（兄弟ノード）はいないので、ｎ２←φ（空）が代入される。 In the next step S218, since there is no sibling (sibling node) of n1, n2 ← φ (empty) is substituted.

そして、ステップＳ３０４では、ｎ１←２が代入される。 In step S304, n1 ← 2 is substituted.

そして、２回目のステップＳ２１８では、ｎ２←１４が代入される。 In the second step S218, n2 ← 14 is substituted.

そして、ステップＳ２２４において、構造類似度計算部１６が呼び出される。構造類似度計算部１６では、ステップＳ１０２において階層構造解析部１０により算出された上記図１１に示す階層構造情報テーブルＫ＾から、Ｋ＾（２）＝（０，１０，２），Ｋ＾（１４）＝（０，９，２）が読み込まれる。そして、階層構造情報テーブルＫ＾から読み込まれた値を上記（１）式に代入して計算すると、以下のように構造類似度が算出される。 In step S224, the structural similarity calculation unit 16 is called. In the structural similarity calculation unit 16, from the hierarchical structure information table K ^ shown in FIG. 11 calculated by the hierarchical structure analysis unit 10 in step S102, K ^ (2) = (0, 10, 2), K ^ ( 14) = (0, 9, 2) is read. When the value read from the hierarchical structure information table K ^ is substituted into the above equation (1) and calculated, the structural similarity is calculated as follows.

上記（３）式で算出された構造類似度Ｓ^ｓｔｒは、ｓｉｍへ代入される。 The structural similarity S ^str calculated by the above equation (3) is substituted into sim.

そして次のステップＳ２２６において、上記（３）式で算出された構造類似度Ｓ^ｓｔｒが構造類似度閾値θ_ｓｔｒ以上であるか否かが判定される。上記（３）式で算出された構造類似度Ｓ^ｓｔｒ（２，１４）≒０．９２で、構造類似度閾値θ_ｓｔｒ＝０．７であるため、構造類似度Ｓ^ｓｔｒ（２，１４）の値が代入されたｓｉｍとθ_ｓｔｒの関係は、ｓｉｍ≧とθ_ｓｔｒとなる。従って、ステップＳ２２８へ移行する判定がされる。 Then, in the next step S226, it is determined whether or not the structural similarity ^Sstr calculated by the above equation (3) is equal to or _larger than the structural similarity threshold _θstr . Since the structural similarity S ^str (2,14) ≈0.92 calculated by the above equation (3) and the structural similarity threshold θ _str = 0.7, the structural similarity S ^str (2,14) The relationship between sim and θ _str into which the value is substituted is sim ≧ and θ _str . Accordingly, it is determined to proceed to step S228.

次に、ステップＳ２２８では、上記ステップＳ２１８でｎ２←１４が代入されているため、位置情報テーブルＰ^Ｂ＾の１４番のノード番号に対応する位置情報が（ｘ_２，ｙ_２，ｗ_２，ｈ_２）に代入される。 Next, in step S228, since n2 ← 14 is substituted in step S218, the position information corresponding to the 14th node number of the position information table P ^B ^ is (x ₂ , y ₂ , w ₂ , h ₂ ).

次に、ステップＳ２３０では、上記図２４に示す動作例１の位置情報テーブルＰ＾から、（ｘ_１，ｙ_１，ｗ_１，ｈ_１）＝（０．５６，０．３７，１２，８．５）、（ｘ_２，ｙ_２，ｗ_２，ｈ_２）＝（０．５６，９．２５，４，０．７）であり、又上記ステップＳ２０４で、γ＝１．５が代入されているので、ステップＳ２３０の判定式に照らすと、範囲内であると判定される。 Next, in step S230, (x ₁ , y ₁ , w ₁ , h ₁ ) = (0.56, 0.37, 12, 8,. 5), (x ₂ , y ₂ , w ₂ , h ₂ ) = (0.56, 9.25, 4, 0.7), and γ = 1.5 is substituted in step S204. Therefore, according to the determination formula of step S230, it is determined that the value is within the range.

そして、候補bnode←φ（空）が代入されているので、ステップＳ２３８で、候補bnode←１４が代入され、score←０．９２の値が代入され、候補が更新される。 Since the candidate bnode ← φ (empty) is substituted, the candidate bnode ← 14 is substituted, the value of score ← 0.92 is substituted in step S238, and the candidate is updated.

次にステップＳ２４０では、ｎ２←１５が代入される。 Next, in step S240, n2 ← 15 is substituted.

次に２回目のステップＳ２２４では、ｎ１（＝２）とｎ２（＝１５）について、上記の処理が行われる。ここで、ｎ１（＝２）とｎ２（＝１５）の構造類似度Ｓ^ｓｔｒ（２，１５）≒０．４１であり、構造類似度閾値θ_ｓｔｒ＝０．７未満であるので、何もせずに次のノードを調べる。ノード２４までは、構造類似度Ｓ^ｓｔｒが構造類似度閾値θ_ｓｔｒ＝０．７未満であるので、何もしない。 Next, in the second step S224, the above processing is performed for n1 (= 2) and n2 (= 15). Here, since the structural similarity S ^str (2,15) ≈0.41 between n1 (= 2) and n2 (= 15) and less than the structural similarity threshold θ _str = 0.7, nothing is done. Check the next node. Up to the node 24, the structure similarity S ^str is less than the structure similarity threshold θ _str = 0.7, so nothing is done.

そして、ｎ２がノード２５になったときに、ｎ１＝２とｎ２＝２５の構造類似度Ｓ^ｓｔｒ（２，２５）≒０．９２で、かつ上記ステップＳ２３０で判定される表示位置も範囲内なので、ステップＳ２３２へ移行する。 When n2 becomes node 25, the structural similarity S ^str (2,25) ≈0.92 between n1 = 2 and n2 = 25, and the display position determined in step S230 is within the range. The process proceeds to step S232.

そして、ステップＳ２３２では、現在の候補bnode（＝１４）と、ノード２５を比較する。ノード２５はノード１４と親子関係にないので、ステップＳ２３４へ移行する。 In step S232, the current candidate bnode (= 14) is compared with the node 25. Since the node 25 is not in a parent-child relationship with the node 14, the process proceeds to step S234.

そして、ステップＳ２３４では、ノード２とノード１４の組の構造類似度の値が代入されているscoreの値を構造類似度テーブルＬ^ｓｔｒ＾に格納する。さらに、ノード２とノード１４の組の位置類似度を計算し、位置類似度テーブルＬ^ｓｔｒ＾に格納する。そして現在の候補bnode←２５として、更新する。 In step S234, the value of the score into which the value of the structural similarity of the pair of the node 2 and the node 14 is substituted is stored in the structural similarity table L ^str ^. Further, the position similarity of the set of the node 2 and the node 14 is calculated and stored in the position similarity table L ^str ^. The current candidate bnode ← 25 is updated.

同様に、上記の処理を進めると、ｎ１＝５とｎ２＝８に進んだときに、ステップＳ２２６の構造類似度の条件と、ステップＳ２３０の表示位置の条件が満たされるので、ステップＳ２３８において候補bnodeを更新する。 Similarly, when the above process is advanced, when the process proceeds to n1 = 5 and n2 = 8, the condition of the structural similarity in step S226 and the condition of the display position in step S230 are satisfied. Update.

次に、ｎ１＝５とｎ２＝１０のときに、再度、ステップＳ２２６の構造類似度の条件とステップＳ２３０の表示位置の条件が満たされるが、ノード８と１０は親子関係にありノード１０の方が構造類似度が高いので、ステップ２３６へ移行し、ノード５とノード８の組は保存されないことになる。 Next, when n1 = 5 and n2 = 10, the structural similarity condition in step S226 and the display position condition in step S230 are satisfied again, but nodes 8 and 10 are in a parent-child relationship, and the node 10 Since the structural similarity is high, the process proceeds to step 236, and the pair of the node 5 and the node 8 is not saved.

以上のように処理がされた結果、上記図１２に示すステップＳ１０８の候補探索処理ルーチンの実行結果として、図２５に示すような構造類似度テーブルＬ^ｓｔｒ＾、図２６に示すような位置類似度テーブルＬ^ｐｏｓ＾、及び図２７に示すようなクラスタテーブルＣ＾が作成される。 As a result of the processing as described above, as the execution result of the candidate search processing routine in step S108 shown in FIG. 12, the structure similarity table L ^str ^ as shown in FIG. 25 and the position similarity as shown in FIG. A table L ^pos ^ and a cluster table C ^ as shown in Fig. 27 are created.

次に、上記図１２に示すステップＳ１１０において、上記図２０に示すクラスタリング処理ルーチンが実行される。 Next, in step S110 shown in FIG. 12, the clustering processing routine shown in FIG. 20 is executed.

例えば、ノード２と２５については、上記図２５に示すように構造類似度Ｓ^ｓｔｒ（２，２５）≒０．９２であって、閾値（θ_ｓｔｒ＝０．７）以上であるが、上記図２６に示すように、位置類似度Ｓ^ｐｏｓ（２，２５）≒（０．３６，１，０．６７，０．６７）であって、上記図２０に示すクラスタリング処理ルーチンのステップＳ４０８の条件を満たさないので統合されない。 For example, for the nodes 2 and 25, as shown in FIG. 25, the structural similarity S ^str (2,25) ≈0.92, which is equal to or greater than the threshold (θ _str = 0.7), As shown in FIG. 26, the position similarity S ^pos (2,25) ≈ (0.36, 1,0.67,0.67), and the condition of step S408 of the clustering processing routine shown in FIG. Since it does not satisfy, it is not integrated.

以上のような処理の結果、クラスタリング結果は図２８に示すような結果となる。 As a result of the above processing, the clustering result is as shown in FIG.

クラスタリング処理ルーチンの処理が終了した後には、上記図１２に示す構造化文書情報抽出処理ルーチンのステップＳ１１２に戻る。ステップＳ１１２では、上記ステップＳ１１０のクラスタリング処理ルーチンで作成されたクラスタテーブルＣ＾に基づいて、クラスタの優先度を算出する。 After the clustering processing routine is completed, the process returns to step S112 of the structured document information extraction processing routine shown in FIG. In step S112, the cluster priority is calculated based on the cluster table C ^ created by the clustering processing routine in step S110.

従って、ステップＳ１１２では、上記図２８に示す各クラスタについて、上記図２２に示すアルゴリズムの処理を繰り返す。例えば、上記図２８に示すクラスタ２は{２，１４}のノードを含んでおり、上記図１１に示すように、ノード２は上記図２３に示すキーワード”タイトル”，および，”著者”を共に含んでいるので、優先度は２となる。同様に、ノード１４も優先度は２となる。クラスタの優先度は各ノードの優先度の最大値なので、クラスタ２の優先度は２となる。また、クラスタ６は、ノード６が”タイトル”を、ノード７が”著者”を含んでいるので、それらの最大値がクラスタの優先度となるので、優先度は１となる。 Accordingly, in step S112, the algorithm processing shown in FIG. 22 is repeated for each cluster shown in FIG. For example, the cluster 2 shown in FIG. 28 includes {2, 14} nodes. As shown in FIG. 11, the node 2 has the keywords “title” and “author” shown in FIG. The priority is 2. Similarly, the priority of the node 14 is 2. Since the priority of the cluster is the maximum value of the priority of each node, the priority of cluster 2 is 2. Further, since the node 6 includes the “title” and the node 7 includes the “author”, the maximum value thereof is the cluster priority, so the priority is 1.

次のステップＳ１１４では、上記ステップＳ１１２で算出された優先度順に、上記ステップＳ１１０で算出したクラスタリング結果を出力する。
従って、ステップＳ１１２では、図２９に示すように、優先度順にクラスタが出力される。 In the next step S114, the clustering results calculated in step S110 are output in the order of priority calculated in step S112.
Therefore, in step S112, clusters are output in order of priority as shown in FIG.

図８及び上記図２９を参照すると、階層構造のある繰り返しが正しく抽出されていることがわかる。 Referring to FIG. 8 and FIG. 29 above, it can be seen that a certain repetition having a hierarchical structure is correctly extracted.

＜動作例２＞
動作例２では、異なるページの繰り返し部分の抽出例について説明する。動作例２では、図３０に示すｐａｇｅＡ，ｐａｇｅＢが、ｗｅｂページとして文書構造解析装置に入力される。また、上記図３０に示すｗｅｂページのｂｏｄｙ以下のＨＴＭＬタグ構造を図３１に示す。また、入力部１には、図３２に示すキーワードリストが入力される。また、入力部１に入力されるパラメータとして、構造類似度閾値θ_ｓｔｒ＝０．７、位置類似度閾値θ_ｐｏｓ＝０．８、探索範囲閾値γ_１＝０．５のデータが入力部１に入力される。 <Operation example 2>
In the operation example 2, an example of extracting repeated portions of different pages will be described. In the operation example 2, pageA and pageB shown in FIG. 30 are input to the document structure analysis apparatus as web pages. FIG. 31 shows an HTML tag structure below the body of the web page shown in FIG. In addition, the keyword list shown in FIG. As parameters input to the input unit 1, data of the structural similarity threshold θ _str = 0.7, the position similarity threshold θ _pos = 0.8, and the search range threshold γ ₁ = 0.5 are input to the input unit 1. Entered.

動作例２における各データが入力されると、上記図１２に示す構造化文書情報抽出処理ルーチンが実行され、図３３に示す結果が出力される。 When each data in the operation example 2 is input, the structured document information extraction processing routine shown in FIG. 12 is executed, and the result shown in FIG. 33 is output.

上記図３０及び上記図３３を参照すると、ページ間の要素の対応が正しく抽出されていることがわかる。 Referring to FIG. 30 and FIG. 33, it can be seen that the correspondence of elements between pages is correctly extracted.

以上説明したように、本実施の形態に係る文書構造解析装置によれば、構造化文書の要素の階層構造を解析し、解析された構造化文書を表示したときの構造化文書の各要素の表示位置を解析し、構造化文書における複数の要素のうちの２つの要素間の各々について、要素の構造に関する構造類似度及び要素の表示位置に関する位置類似度を算出し、構造類似度と、位置類似度とに基づいて、構造化文書における複数の要素をクラスタリングすることにより、構造化文書の要素を精度良くクラスタリングできる。 As described above, according to the document structure analysis apparatus according to the present embodiment, the hierarchical structure of the elements of the structured document is analyzed, and each element of the structured document when the analyzed structured document is displayed is analyzed. Analyzing the display position and calculating the structural similarity regarding the structure of the element and the positional similarity regarding the display position of the element for each of the two elements of the plurality of elements in the structured document. By clustering a plurality of elements in the structured document based on the similarity, the elements of the structured document can be clustered with high accuracy.

また、本実施の形態に係る文書構造解析装置によれば、人間にとって意味がありそうな（すなわち，表やリスト等の様に隣接して配置される）繰り返し構造を抽出することができる。 Further, according to the document structure analysis apparatus according to the present embodiment, it is possible to extract a repetitive structure that is likely to be meaningful to humans (that is, arranged adjacently like a table or a list).

また、本実施の形態によれば、候補探索部がクラスタの候補を親子関係にないものに制限できるため、階層的な繰り返し構造がある場合でも効率的に抽出できる。 Also, according to the present embodiment, the candidate search unit can limit the cluster candidates to those that do not have a parent-child relationship, so that even if there is a hierarchical repetitive structure, it can be extracted efficiently.

また、繰り返し構造に優先度が付けられるので、多数のクラスタが生成された場合でも効率的に提示できる。 In addition, since the priority is given to the repetitive structure, it can be efficiently presented even when a large number of clusters are generated.

また、構造類似度計算にノード名のヒストグラムを使うことで、子ノードの数に依らず、常に一定の計算量(タグ名の数) で高速に構造の比較を行うことができる。 In addition, by using a histogram of node names in the structure similarity calculation, it is possible to compare structures at a high speed with a constant calculation amount (number of tag names) regardless of the number of child nodes.

また、ＨＴＭＬ文書と抽出したい部分の組からなる学習データをゼロから作成する際には、想定されるデータとしてキーワードを変更して、クラスタリング結果の提示順序を変えることもできる。 In addition, when learning data consisting of a combination of an HTML document and a portion to be extracted is created from scratch, a keyword can be changed as assumed data to change the order of presentation of clustering results.

また、サイトの更新による再学習においては、既に抽出されたデータをキーワードに設定することで、クラスタリング結果から効率よく学習データとなる要素を取得することができる。 Further, in the relearning by updating the site, elements already serving as learning data can be efficiently acquired from the clustering result by setting the already extracted data as keywords.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

本実施の形態では、構造類似度計算部１４は、階層構造解析部１０で算出されるノードのヒストグラムに基づいて、２つのノードのペア各々について２つの要素間の階層構造に関する類似度を計算したが、他の構成としても良い。 In the present embodiment, the structural similarity calculation unit 14 calculates the similarity related to the hierarchical structure between two elements for each pair of two nodes based on the node histogram calculated by the hierarchical structure analysis unit 10. However, other configurations may be used.

例えば、構造類似度計算部１４について、ノードＡとノードＢが与えられた場合に、それらのノードをルートとする部分木のＨＴＭＬ開始タグ、終了タグを一つの文字とし、２つの文字列間のnormalized edit distance（ＮＤ）を計算し、構造類似度として１−ＮＤを返す構成としても良い。normalized edit distanceについては例えば、B. Liu, "Web Data Mining," pp.384-386, Springer, 2007に記述がある． For example, when the node A and the node B are given to the structural similarity calculation unit 14, the HTML start tag and end tag of the subtree having these nodes as roots are set as one character, and two character strings are It is good also as a structure which calculates normalized edit distance (ND) and returns 1-ND as structural similarity. For example, normalized edit distance is described in B. Liu, "Web Data Mining," pp.384-386, Springer, 2007.

また、本実施の形態では、優先度計算部２２は、キーワードリストＫＷ＾に基づいて、１つのクラスタに含まれるノードについて、各ノードの優先度を各ノードが含むキーワードの重みの和とし、その最大値を優先度として算出したが、他の構成としても良い。 Further, in the present embodiment, the priority calculation unit 22 sets the priority of each node as the sum of the weights of the keywords included in each node for the nodes included in one cluster based on the keyword list KW ^. Although the maximum value is calculated as the priority, other configurations may be used.

例えば、優先度計算部２２は、キーワードのリスト、およびクラスタが指定され、クラスタの構成要素（ノード）について、いずれか一つのノードにすべてのキーワードが含まれていれば１、そうでなければ０を返すようにすることができる。このアルゴリズム例を図３４に示す。 For example, the priority calculation unit 22 specifies 1 if a keyword list and a cluster are specified, and all keywords are included in any one of the constituent elements (nodes) of the cluster, and 0 otherwise. Can be returned. An example of this algorithm is shown in FIG.

また、優先度計算部２２は、クラスタが指定され、クラスタの構成要素（ノード）の数を優先度として返すようにすることができる。このアルゴリズム例を図３５に示す。 Further, the priority calculation unit 22 can specify a cluster and return the number of components (nodes) of the cluster as a priority. An example of this algorithm is shown in FIG.

また、優先度計算部２２は、クラスタが指定され、クラスタに含まれる各ノードについて、ｗｅｂページのリーフノードの数をＮ、クラスタに含まれるノードをルートとする部分木が含むリーフノードの数をｎ、各ノードの優先度を Further, the priority calculation unit 22 designates a cluster, and for each node included in the cluster, the number of leaf nodes on the web page is N, and the number of leaf nodes included in the subtree having the node included in the cluster as a root is included. n, the priority of each node

として、その平均値をクラスタの優先度として返すようにすることができる。このアルゴリズム例を図３６に示す。 The average value can be returned as the cluster priority. An example of this algorithm is shown in FIG.

また、上述の文書構造解析装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 In addition, the document structure analysis apparatus described above has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０階層構造解析部
１２位置情報解析部
１４構造類似度計算部
１６位置類似度計算部
１６構造類似度計算部
１８候補探索部
２０クラスタリング部
２２優先度計算部 DESCRIPTION OF SYMBOLS 10 Hierarchical structure analysis part 12 Position information analysis part 14 Structure similarity calculation part 16 Position similarity calculation part 16 Structure similarity calculation part 18 Candidate search part 20 Clustering part 22 Priority calculation part

Claims

Hierarchical structure analyzing means for analyzing a hierarchical structure of elements of the structured document for each of at least one structured document to be analyzed;
Position information analyzing means for analyzing the display position of each element of the structured document when the structured document analyzed by the hierarchical structure analyzing means is displayed;
Based on the analysis result analyzed by the hierarchical structure analyzing means, the first element and the second element for each of the first element and the second element among the plurality of elements in the at least one structured document. A structural similarity calculation means for comparing the second element and the descendant element of the second element with each other and calculating a structural similarity related to the structure of the element;
Based on the analysis result analyzed by the position information analysis means, the position similarity for calculating the position similarity regarding the display position of the element for each of two elements among the plurality of elements in the at least one structured document Degree calculation means,
Clustering means for clustering a plurality of elements in the at least one structured document based on the structure similarity calculated by the structure similarity calculation means and the position similarity calculated by the position similarity calculation means; Document structure analysis device including

The structural similarity calculation means, based on the analysis result analyzed by the hierarchical structure analysis means, for each element in the at least one structured document, a histogram of attribute names of the element and descendant elements of the element 2. The document structure according to claim 1, wherein the document structure is calculated and a similarity between the histogram of the first element and the histogram of the second element is calculated as the structure similarity for each of the first element and the second element. Analysis device.

The position similarity calculation unit is configured to display a display position and a size of the two elements when the structured document is displayed for each of the two elements based on the analysis result analyzed by the position information analysis unit. The document structure analysis apparatus according to claim 1, wherein a similarity based on the height is calculated as the position similarity.

The system further includes priority calculation means for calculating a priority of the cluster based on a word included in each element belonging to the cluster and a predetermined keyword for each cluster clustered by the clustering means. The document structure analysis apparatus according to any one of claims 1 to 3.

Based on the structural similarity calculated by the structural similarity calculation means, a plurality of cluster candidates including two elements clustered in the same cluster among the plurality of elements in the at least one structured document are searched. Further including candidate searching means,
The clustering unit clusters a plurality of elements in the at least one structured document based on the structural similarity and the position similarity between elements included in each cluster candidate searched by the candidate searching unit. The document structure analysis apparatus according to any one of claims 1 to 4.

The candidate search means sets, for each of a plurality of elements in the at least one structured document, an element included in the cluster candidate as an element having the maximum structural similarity among elements having a parent-child relationship. The document structure analyzing apparatus according to claim 5.

The candidate search means searches for the plurality of cluster candidates including two elements whose display position distance when the structured document is displayed is not more than a predetermined value and clustered into the same cluster. 5. The document structure analysis apparatus according to 5 or 6.

The program for functioning a computer as each means of the document structure analysis apparatus of any one of Claims 1-7.