JP5225331B2

JP5225331B2 - Data extraction apparatus and method

Info

Publication number: JP5225331B2
Application number: JP2010150011A
Authority: JP
Inventors: 圭吾町永
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2010-06-30
Filing date: 2010-06-30
Publication date: 2013-07-03
Anticipated expiration: 2030-06-30
Also published as: JP2012014412A

Description

本発明は、データ抽出装置及び方法に関する。 The present invention relates to a data extraction apparatus and method.

近年、インターネット等の普及により、ユーザは、端末からインターネット等にアクセスし、様々なウェブページを容易に閲覧することができる。ウェブページは、ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）等で記述された文書と画像データとで構成され、ウェブブラウザによって閲覧される。このようなウェブページを閲覧し、検索することによって、ユーザは、求めている情報を容易に取得することができる。 In recent years, with the spread of the Internet and the like, a user can access the Internet or the like from a terminal and easily browse various web pages. A web page is composed of a document described in HTML (HyperText Markup Language) or the like and image data, and is viewed by a web browser. By browsing and searching such a web page, the user can easily obtain the information he / she wants.

ウェブページから情報を検索する技術には、ＨＴＭＬ等の記述フォーマットを利用して、特定の情報を抽出する技術等が存在する。これらの技術は、ＨＴＭＬ等におけるタグ構造を利用し、タグ構造の共通性等を利用して情報を抽出する。例えば、特許文献１から３が知られている。 A technique for retrieving information from a web page includes a technique for extracting specific information using a description format such as HTML. These techniques use a tag structure in HTML or the like, and extract information using the commonality of the tag structure. For example, Patent Documents 1 to 3 are known.

特許文献１には、ＷＷＷ上で提供される情報の中から本文部分のみを特定することが可能な抽出規則を作成するシステムが開示されている。特許文献１に開示されたシステムは、予め収集されたウェブページデータから本文部分を抽出する本文抽出手段と、ウェブページデータを解析して、本文抽出手段によって抽出された本文部分が現れる箇所をデータ構造で表現した抽出規則を作成する抽出規則作成手段と、抽出規則作成手段によって作成した同一の抽出規則が適用される複数のＵＲＬをグループ化し、このグループ化されたＵＲＬと抽出規則とを関連付ける適用抽出規則作成手段とを備える。 Patent Document 1 discloses a system that creates an extraction rule that can specify only a body part from information provided on the WWW. The system disclosed in Patent Document 1 includes a text extracting means for extracting a text part from web page data collected in advance, and analyzing data on the web page data to indicate a location where the text part extracted by the text extracting means appears. An extraction rule creation means for creating an extraction rule expressed by a structure, and an application for grouping a plurality of URLs to which the same extraction rule created by the extraction rule creation means is applied, and associating the grouped URLs with the extraction rules Extraction rule creating means.

特許文献２には、タグの解析や抽出ルールの作成をしないでも、一般のユーザが、有益な情報を持つテキストコンテンツを容易に取り出して活用することができるシステムが開示されている。特許文献２に開示されたシステムは、正規表現を持つパターンフォーマットを記憶する記憶部と、ＨＴＭＬページからパターンフォーマットと一致するテキストコンテンツを取り出す抽出ルールを生成する抽出ルール生成部と、抽出ルールから所定のフォーマットに変換するフォーマット変換部を有する。 Patent Document 2 discloses a system that allows a general user to easily extract and use text contents having useful information without analyzing tags or creating extraction rules. The system disclosed in Patent Literature 2 includes a storage unit that stores a pattern format having a regular expression, an extraction rule generation unit that generates an extraction rule that extracts text content that matches the pattern format from an HTML page, and a predetermined number of extraction rules. A format conversion unit for converting to the above format.

特許文献３には、不特定のウェブサイトのページから、ある検索語に関する特定の情報をブロック単位で抽出する情報抽出装置が開示されている。特許文献１に開示された情報抽出装置は、ＨＴＭＬ等で記述された半構造化情報における検索語と検索語に関する特定情報との間の構造的位置関係を表すパターンからなるパターン集合に基づいて、ウェブページの集合から検索語に関する特定情報の候補をブロック単位で抽出し、抽出された特定情報の候補の中から情報分類技術を用いて特定情報を選択する。例えば、情報抽出装置は、取得したウェブページにおける、検索語とその検索語に関する特定情報との構造的位置関係（タグ構造に基づく木構造）から、学習処理によってパターン（パターン木、ノード間の距離等）を生成する。そして、情報抽出装置は、パターンをウェブページに当てはめ、パターン木と、ノード間の距離が制限以内であるターゲットノードとにマッチしたノードを根とする部分木に含まれる情報を抽出し、特定情報の候補とし、情報分類技術を用いて特定情報を選択する。 Patent Document 3 discloses an information extraction device that extracts specific information related to a certain search term from a page of an unspecified website in units of blocks. The information extraction device disclosed in Patent Document 1 is based on a pattern set including patterns representing a structural positional relationship between a search word in semi-structured information described in HTML or the like and specific information related to the search word. Specific information candidates related to a search term are extracted in block units from a set of web pages, and specific information is selected from the extracted specific information candidates using an information classification technique. For example, the information extraction device uses a learning process to determine a pattern (pattern tree, distance between nodes) from a structural positional relationship (a tree structure based on a tag structure) between a search word and specific information related to the search word in the acquired web page Etc.). Then, the information extraction device applies the pattern to the web page, extracts information contained in the subtree rooted at the node matching the pattern tree and the target node whose distance between the nodes is within the limit, and specifies the specific information. Specific information is selected using an information classification technique.

特開２００４−２２０２５１号公報JP 2004-220251 A 特開２００６−２３６２６２号公報JP 2006-236262 A 特開２００７−４７９７４号公報JP 2007-47974 A

しかしながら、特許文献１に開示された発明は、ウェブページデータから本文部分のみを特定し、抽出するので、ユーザが所望して指定したデータと同様のデータを抽出することができない。特許文献２に開示された発明は、ＨＴＭＬページからパターンフォーマットと一致するテキストコンテンツを取り出すので、ユーザは所望するデータを取得するために、パターンフォーマットを理解して指定する必要がある。特許文献３に開示された発明は、ウェブページから、ノード間の距離が制限以内であるノードをもマッチするノードとして抽出するので、ユーザが所望して指定したデータと異なる意味合いのデータをも抽出してしまう。さらに、特許文献３に開示された発明は、抽出した情報を候補として情報分類技術を用いて特定情報を選択するので、抽出までの時間がかかることが予想される。 However, since the invention disclosed in Patent Document 1 specifies and extracts only the body part from the web page data, it cannot extract the same data as the data specified and desired by the user. The invention disclosed in Patent Document 2 takes out text content that matches the pattern format from the HTML page, so the user needs to understand and specify the pattern format in order to obtain the desired data. Since the invention disclosed in Patent Document 3 extracts a node whose distance between nodes is within the limit from a web page as a matching node, it also extracts data having a meaning different from that specified by the user. Resulting in. Furthermore, since the invention disclosed in Patent Document 3 selects specific information using information classification technology with the extracted information as a candidate, it is expected that it takes time until extraction.

そこで、ユーザが所望するデータを、容易、かつ、高速にＨＴＭＬ文書から抽出することができるデータ抽出装置及び方法が求められている。 Therefore, there is a need for a data extraction apparatus and method that can easily and quickly extract data desired by a user from an HTML document.

本発明は、ユーザが所望するデータを、容易、かつ、高速にＨＴＭＬ文書から抽出することができるデータ抽出装置及び方法を提供することを目的とする。 It is an object of the present invention to provide a data extraction apparatus and method that can easily and quickly extract data desired by a user from an HTML document.

本発明では、以下のような解決手段を提供する。 The present invention provides the following solutions.

（１）ＨＴＭＬ文書を記憶するＨＴＭＬ文書記憶手段と、前記ＨＴＭＬ文書記憶手段に記憶された前記ＨＴＭＬ文書を構成するノードのうち、二つ以上の目的ノードの指定を受け付ける目的ノード受付手段と、前記目的ノード受付手段によって受け付けられた全ての前記目的ノードに共通の上位ノードである共通祖先ノードを特定する祖先ノード特定手段と、前記祖先ノード特定手段によって特定された前記共通祖先ノードから指定された前記目的ノードまでの全てのパスを抽出するパス抽出手段と、前記パス抽出手段によって抽出された前記パス及び前記ＨＴＭＬ文書を構成するノードの繰り返し構造に基づき、抽出対象となる抽出目的ノードの抽出ルールを示す、前記共通祖先ノードから当該抽出目的ノードまでの検索パスを生成する検索パス生成手段と、前記共通祖先ノードを含む前記ＨＴＭＬ文書から、前記検索パス生成手段によって生成された前記検索パスに従って抽出される前記抽出目的ノードを抽出するデータ抽出手段と、を含むデータ抽出装置。 (1) an HTML document storage unit that stores an HTML document, a target node reception unit that receives designation of two or more target nodes among the nodes constituting the HTML document stored in the HTML document storage unit, An ancestor node specifying unit that specifies a common ancestor node that is an upper node common to all the target nodes received by the target node receiving unit, and the specified from the common ancestor node specified by the ancestor node specifying unit Based on the path extraction means for extracting all paths to the target node, and the repetition structure of the nodes constituting the HTML document and the path extracted by the path extraction means, the extraction target node extraction rules to be extracted are A search path from the common ancestor node to the extraction target node is generated. A data extraction apparatus comprising: a search path generation means; and a data extraction means for extracting the extraction target node extracted according to the search path generated by the search path generation means from the HTML document including the common ancestor node .

（１）の構成によれば、本発明に係るデータ抽出装置は、ＨＴＭＬ文書を記憶するＨＴＭＬ文書記憶手段を有する。そして、データ抽出装置は、ＨＴＭＬ文書記憶手段に記憶されたＨＴＭＬ文書を構成するノードのうち、二つ以上の目的ノードの指定を受け付け、受け付けた全ての目的ノードに共通の上位ノードである共通祖先ノードを特定する。次に、データ抽出装置は、特定した共通祖先ノードから指定された目的ノードまでの全てのパスを抽出し、抽出したパス及びＨＴＭＬ文書を構成するノードの繰り返し構造に基づき、抽出対象となる抽出目的ノードの抽出ルールを示す、共通祖先ノードから当該抽出目的ノードまでの検索パスを生成する。そして、データ抽出装置は、共通祖先ノードを含むＨＴＭＬ文書から、生成した検索パスに従って抽出される抽出目的ノードを抽出する。 According to the configuration of (1), the data extraction apparatus according to the present invention has the HTML document storage means for storing the HTML document. The data extraction apparatus accepts designation of two or more target nodes among the nodes constituting the HTML document stored in the HTML document storage means, and is a common ancestor that is an upper node common to all the accepted target nodes. Identify the node. Next, the data extraction device extracts all paths from the specified common ancestor node to the designated target node, and based on the extracted path and the repetition structure of the nodes constituting the HTML document, the extraction purpose to be extracted A search path from the common ancestor node to the extraction target node indicating the node extraction rule is generated. Then, the data extraction device extracts the extraction target node extracted according to the generated search path from the HTML document including the common ancestor node.

すなわち、本発明に係るデータ抽出装置は、指定を受け付けた全ての目的ノードの共通祖先ノードを特定し、特定した共通祖先ノードから抽出目的ノードまでの検索パスを生成し、生成した検索パスに従って抽出される抽出目的ノードを抽出する。したがって、本発明に係るデータ抽出装置は、ユーザからヒントとなる目的ノードの指定を受け付けて、当該ヒントに基づいてユーザが所望するデータを、容易、かつ、高速にＨＴＭＬ文書から抽出することができる。 That is, the data extraction apparatus according to the present invention specifies a common ancestor node of all target nodes that have received a specification, generates a search path from the specified common ancestor node to the extraction target node, and extracts according to the generated search path Extract the target node to be extracted. Therefore, the data extraction apparatus according to the present invention can accept designation of a target node as a hint from the user, and can easily and quickly extract data desired by the user from the HTML document based on the hint. .

（２）前記検索パス生成手段は、前記パス抽出手段が抽出した前記全てのパスに基づいて、複数のパターンにマッチするワイルドカードを一部に含む表現を有する前記検索パスを生成する（１）に記載のデータ抽出装置。 (2) The search path generation unit generates the search path having an expression including a part of a wild card that matches a plurality of patterns based on all the paths extracted by the path extraction unit (1). The data extraction device described in 1.

したがって、（２）に係るデータ抽出装置は、抽出した全てのパスに基づいて、複数のパターンにマッチするワイルドカードを一部に含む表現を有する検索パスを生成する。これによりＨＴＭＬ文書の繰り返し構造に揺らぎを含んでいる場合においても、ユーザが所望するデータを、容易にＨＴＭＬ文書から抽出することができる。 Therefore, the data extraction apparatus according to (2) generates a search path having an expression that partially includes a wild card that matches a plurality of patterns, based on all the extracted paths. Thereby, even when fluctuations are included in the repetitive structure of the HTML document, the data desired by the user can be easily extracted from the HTML document.

（３）前記検索パス生成手段は、前記パス抽出手段が抽出した前記全てのパスについてＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチングを行って、前記ワイルドカードを一部に含む表現を有する検索パスを生成する（２）に記載のデータ抽出装置。 (3) The search path generation unit performs DP (Dynamic Programming) matching on all the paths extracted by the path extraction unit, and generates a search path having an expression partially including the wildcard (2 ) Data extraction device.

したがって、（３）に係るデータ抽出装置は、複数の抽出されたパスが完全に一致していない場合であっても、これらのパスのＤＰマッチングを行ってワイルドカードを一部に含む表現を有する検索パスを推定し生成する。これによりＨＴＭＬ文書の繰り返し構造に揺らぎを含んでいる場合においても、ユーザが所望するデータを、さらに容易にＨＴＭＬ文書から抽出することができる。 Therefore, the data extraction apparatus according to (3) has an expression including a wild card as a part of the DP matching of these paths even when the plurality of extracted paths do not completely match. Estimate and generate a search path. Thereby, even when fluctuations are included in the repetitive structure of the HTML document, the data desired by the user can be extracted from the HTML document more easily.

（４）前記祖先ノード特定手段は、前記指定を受け付けた二つ以上の前記目的ノードのうちの前記ＨＴＭＬ文書の下位の前記目的ノードから順に組み合わせて、二つの前記目的ノードの共通上位ノードである第１の共通ノードを求める手段と、二つの前記第１の共通ノードの共通上位ノードである第２の共通ノードを求める手段と、前記共通上位ノードが一つになるまでこれを繰り返し、当該一つの共通上位ノードを前記共通祖先ノードとして特定する手段と、を含み、前記パス抽出手段は、前記共通祖先ノードから一段下位の共通上位ノードまでのパスを抽出する手段と、前記一段下位の前記共通上位ノードが前記抽出目的ノードでない場合には、前記一段下位の共通上位ノードからさらに一段下位の共通上位ノードまでのパスを抽出する手段と、さらに、これを繰り返し、前記抽出目的ノードまでのパスを抽出する繰り返し手段と、を含み、前記検索パス生成手段は、前記共通祖先ノードから一段下位の共通上位ノードまでの検索パスを、前記抽出されたパスのＤＰマッチングを行って生成する手段と、前記一段下位の共通上位ノードが前記抽出目的ノードでない場合には、前記一段下位の共通上位ノードからさらに一段下位の共通上位ノードまでの検索パスを、前記抽出された前記パスのＤＰマッチングを行って、生成する手段と、さらに、これを繰り返し、前記抽出目的ノードまでの検索パスを生成する手段と、を含む、（１）に記載のデータ抽出装置。 (4) The ancestor node specifying means is a common upper node of the two target nodes by combining in order from the lower target node of the HTML document among the two or more target nodes that have received the designation. This is repeated until the number of the common upper node becomes one, the means for obtaining the first common node, the means for obtaining the second common node that is the common upper node of the two first common nodes, and the one common upper node. Means for identifying two common upper nodes as the common ancestor node, wherein the path extraction means extracts a path from the common ancestor node to a common upper node that is one level lower than the common ancestor node; If the upper node is not the extraction target node, a path from the common upper node one level lower to the common upper node one level lower is extracted. And a repeating unit that repeats this to extract a path to the extraction target node, and the search path generation unit includes a search path from the common ancestor node to a common upper node that is one level lower, Means for performing DP matching on the extracted path, and if the one-level lower common upper node is not the extraction target node, the one-level lower common upper node to the one-level lower common upper node (1) including means for generating a search path by performing DP matching of the extracted path, and means for generating a search path to the extraction target node by repeating this search path. Data extraction device.

（４）の構成によれば、（４）に係るデータ抽出装置は、（１）において、指定を受け付けた二つ以上の目的ノードのうちのＨＴＭＬ文書の下位の目的ノードから順に組み合わせて、二つの目的ノードの共通上位ノードである第１の共通ノードを求め、二つの第１の共通ノードの共通上位ノードである第２の共通ノードを求め、共通上位ノードが一つになるまでこれを繰り返し、当該一つの共通上位ノードを共通祖先ノードとして特定する。次に、（４）に係るデータ抽出装置は、共通祖先ノードから一段下位の共通上位ノードまでのパスを抽出し、一段下位の共通上位ノードが抽出目的ノードでない場合には、一段下位の共通上位ノードからさらに一段下位の共通上位ノードまでのパスを抽出し、さらに、これを繰り返し、抽出目的ノードまでのパスを抽出する。そして、（４）に係るデータ抽出装置は、共通祖先ノードから一段下位の共通上位ノードまでの検索パスを、抽出されたパスのＤＰマッチングを行って生成し、一段下位の共通上位ノードが抽出目的ノードでない場合には、一段下位の共通上位ノードからさらに一段下位の共通上位ノードまでの検索パスを、抽出されたパスのＤＰマッチングを行って、生成し、さらに、これを繰り返し、抽出目的ノードまでの検索パスを生成する。 According to the configuration of (4), the data extraction apparatus according to (4) is combined in order from the target node in the lower order of the HTML document among the two or more target nodes that have received the designation in (1). Find the first common node that is the common upper node of the two target nodes, find the second common node that is the common upper node of the two first common nodes, and repeat this until there is one common upper node The one common upper node is specified as a common ancestor node. Next, the data extraction apparatus according to (4) extracts a path from the common ancestor node to the common upper node that is one step lower, and when the common upper node that is one step lower is not the extraction target node, the common upper node that is one step lower A path from the node to the common upper node one level lower is extracted, and this is repeated to extract a path to the extraction target node. Then, the data extraction apparatus according to (4) generates a search path from the common ancestor node to the common upper node that is one step lower by performing DP matching of the extracted path, and the common upper node that is one step lower is used for the extraction purpose. If it is not a node, a search path from the common upper node one level lower to the common upper node one level lower is generated by DP matching of the extracted path, and this is repeated until the extraction target node Generate search path for.

すなわち、（４）に係るデータ抽出装置は、指定を受け付けた全ての目的ノードの共通祖先ノードを特定し、特定した共通祖先ノードから抽出目的ノードまでの検索パスを、一段ごとに生成し、生成した一段ごとの検索パスによって、生成した検索パスに従って抽出される抽出目的ノードを抽出する。したがって、（４）に係るデータ抽出装置は、ユーザからヒントとなる目的ノードの指定を受け付けて、当該ヒントに基づいてユーザが所望するデータを、さらに、容易、かつ、高速にＨＴＭＬ文書から抽出することができる。 In other words, the data extraction device according to (4) specifies the common ancestor node of all target nodes that have received the designation, generates a search path from the specified common ancestor node to the extraction target node for each stage, and generates The extraction target node extracted according to the generated search path is extracted by the search path for each stage. Therefore, the data extraction apparatus according to (4) receives designation of a target node serving as a hint from the user, and extracts data desired by the user from the HTML document more easily and at high speed based on the hint. be able to.

（５）ＨＴＭＬ文書を記憶するＨＴＭＬ文書記憶手段を有するデータ抽出装置が実行する方法であって、前記ＨＴＭＬ文書記憶手段に記憶された前記ＨＴＭＬ文書を構成するノードのうち、二つ以上の目的ノードの指定を受け付ける目的ノード受付ステップと、前記目的ノード受付ステップにおいて受け付けられた全ての前記目的ノードに共通の上位ノードである共通祖先ノードを特定する祖先ノード特定ステップと、前記祖先ノード特定ステップにおいて特定された前記共通祖先ノードから指定された前記目的ノードまでの全てのパスを抽出するパス抽出ステップと、前記パス抽出ステップにおいて抽出された前記パス及び前記ＨＴＭＬ文書を構成するノードの繰り返し構造に基づき、抽出対象となる抽出目的ノードの抽出ルールを示す、前記共通祖先ノードから当該抽出目的ノードまでの検索パスを生成する検索パス生成ステップと、前記共通祖先ノードを含む前記ＨＴＭＬ文書から、前記検索パス生成ステップにおいて生成された前記検索パスに従って抽出される前記抽出目的ノードを抽出するデータ抽出ステップと、を含む方法。 (5) A method executed by a data extraction apparatus having an HTML document storage means for storing an HTML document, wherein two or more target nodes among the nodes constituting the HTML document stored in the HTML document storage means A target node receiving step for receiving a designation of an ancestor, an ancestor node specifying step for specifying a common ancestor node that is an upper node common to all the target nodes received in the target node receiving step, and specifying in the ancestor node specifying step A path extraction step for extracting all paths from the common ancestor node to the designated target node, and a repetition structure of the nodes constituting the HTML document and the path extracted in the path extraction step, Indicates the extraction rule for the extraction target node to be extracted A search path generating step for generating a search path from the common ancestor node to the extraction target node; and the HTML document including the common ancestor node is extracted according to the search path generated in the search path generating step A data extraction step of extracting an extraction target node.

したがって、（１）と同様に、本発明に係る方法は、ユーザからデータを受け付けて、ユーザが所望するデータを、容易、かつ、高速にＨＴＭＬ文書から抽出することができる。 Therefore, as in (1), the method according to the present invention can accept data from a user and extract data desired by the user from an HTML document easily and at high speed.

本発明によれば、ユーザが所望するデータを、容易、かつ、高速にＨＴＭＬ文書から抽出することができる。 According to the present invention, data desired by a user can be easily and quickly extracted from an HTML document.

本発明の特徴を説明するためのＨＴＭＬ文書データの例を示す図である。It is a figure which shows the example of the HTML document data for demonstrating the characteristic of this invention. 本発明の一実施形態に係るデータ抽出装置の機能構成を示す機能ブロック図である。It is a functional block diagram which shows the function structure of the data extraction apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態に係るデータ抽出装置のＨＴＭＬ文書データの別の例を示す図である。It is a figure which shows another example of the HTML document data of the data extracting device which concerns on one Embodiment of this invention. 本発明の一実施形態に係るデータ抽出装置の検索パスの生成においてＤＰマッチングを行う例を示す図である。It is a figure which shows the example which performs DP matching in the production | generation of the search path | pass of the data extraction device which concerns on one Embodiment of this invention. 一実施形態に係るデータ抽出装置の処理内容を示すフローチャートである。It is a flowchart which shows the processing content of the data extraction device which concerns on one Embodiment. 一実施形態に係るデータ抽出装置の共通祖先ノード特定処理を示すフローチャートである。It is a flowchart which shows the common ancestor node specification process of the data extraction device which concerns on one Embodiment. 一実施形態に係るデータ抽出装置の検索パス生成処理を示すフローチャートである。It is a flowchart which shows the search path | pass production | generation process of the data extracting device which concerns on one Embodiment. 本発明の一実施形態に係るデータ抽出装置のデータ抽出例を示す図である。It is a figure showing an example of data extraction of a data extraction device concerning one embodiment of the present invention.

以下、本発明の実施形態について図を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

本実施形態のデータ抽出装置１０は、コンピュータ及びその周辺装置に適用される。本実施形態における各部は、コンピュータ及びその周辺装置が備えるハードウェア並びに該ハードウェアを制御するソフトウェアによって構成される。 The data extraction device 10 of this embodiment is applied to a computer and its peripheral devices. Each unit in the present embodiment is configured by hardware included in a computer and its peripheral devices, and software that controls the hardware.

上記ハードウェアには、制御部としてのＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）の他、記憶部、通信装置、表示装置及び入力装置が含まれる。記憶部としては、例えば、メモリ（ＲＡＭ：ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ、ＲＯＭ：ＲｅａｄＯｎｌｙＭｅｍｏｒｙ等）、ハードディスクドライブ（ＨＤＤ：ＨａｒｄＤｉｓｋＤｒｉｖｅ）及び光ディスク（ＣＤ：ＣｏｍｐａｃｔＤｉｓｋ、ＤＶＤ：ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ等）ドライブが挙げられる。通信装置としては、例えば、各種有線及び無線インターフェース装置が挙げられる。表示装置としては、例えば、液晶ディスプレイやプラズマディスプレイ等の各種ディスプレイが挙げられる。入力装置としては、例えば、キーボード及びポインティング・デバイス（マウス、トラッキングボール等）が挙げられる。 The hardware includes a storage unit, a communication device, a display device, and an input device in addition to a CPU (Central Processing Unit) as a control unit. Examples of the storage unit include a memory (RAM: Random Access Memory, ROM: Read Only Memory, etc.), a hard disk drive (HDD: Hard Disk Drive), and an optical disk (CD: Compact Disc, DVD: Digital Versatile Drive, etc.). It is done. Examples of the communication device include various wired and wireless interface devices. Examples of the display device include various displays such as a liquid crystal display and a plasma display. Examples of the input device include a keyboard and a pointing device (mouse, tracking ball, etc.).

上記ソフトウェアには、上記ハードウェアを制御するコンピュータ・プログラムやデータが含まれる。コンピュータ・プログラムやデータは、記憶部により記憶され、制御部により適宜実行、参照される。また、コンピュータ・プログラムやデータは、通信回線を介して配布されることも可能であり、ＣＤ−ＲＯＭ等のコンピュータ可読媒体に記録して配布されることも可能である。 The software includes a computer program and data for controlling the hardware. The computer program and data are stored in the storage unit, and are appropriately executed and referenced by the control unit. The computer program and data can be distributed via a communication line, or can be recorded on a computer-readable medium such as a CD-ROM and distributed.

図１は、本発明の特徴を説明するためのＨＴＭＬ文書データの例を示す図である。図１（ａ）は、「カレーライスの作り方」における素材等を具体例としたＨＴＭＬ文書データの例を示す図である。図１（ｂ）は、図１（ａ）のＨＴＭＬ文書データの例を木構造で示した図である。この例において、データ抽出装置１０は、データ抽出の目的となる抽出目的ノードのヒントである目的ノードの指定を受け付ける。具体的には、図１（ａ）の例では、詳しくは図８において後述するように、データ抽出装置１０はマウスポインタ等によって表示画面のクリック操作を受け付けることにより、「カレーライスの作り方」、「牛肉」、「２００ｇ」、「カレールー」及び「１／２パック」を、抽出目的ノードのヒントである目的ノードとして受け付ける。そして、データ抽出装置１０は、当該ＨＴＭＬ文書を構成するノードの繰り返し構造に基づき、「カレーライスの作り方」、「牛肉」、「２００ｇ」等のヒントとして直接指定を受け付けた目的ノードに加えて、例えば「ニンジン」、「１本」等を抽出目的ノードとして自動的に抽出する。 FIG. 1 is a diagram showing an example of HTML document data for explaining the features of the present invention. FIG. 1A is a diagram illustrating an example of HTML document data using a material or the like in “How to Make Curry and Rice” as a specific example. FIG. 1B is a diagram showing an example of the HTML document data of FIG. In this example, the data extraction apparatus 10 accepts designation of a target node that is a hint of an extraction target node that is a purpose of data extraction. Specifically, in the example of FIG. 1A, as will be described in detail later with reference to FIG. 8, the data extraction device 10 receives a click operation on the display screen with a mouse pointer or the like, thereby “how to make curry rice”, “Beef”, “200 g”, “Carrero”, and “1/2 pack” are accepted as target nodes that are hints of the extraction target node. Then, the data extraction apparatus 10 is based on the repeated structure of the nodes constituting the HTML document, in addition to the target node that has received direct designation as a hint such as “how to make curry rice”, “beef”, “200 g”, For example, “carrot”, “one”, etc. are automatically extracted as extraction target nodes.

すなわち、データ抽出装置１０は、ＨＴＭＬ文書の中で、ユーザの所望するデータ（抽出目的ノード）を抽出するためのヒントである目的ノードとして、例えば、「カレーライスの作り方」、「牛肉」、「２００ｇ」、「カレールー」及び「１／２パック」の指定を受け付ける。次に、データ抽出装置１０は、当該ＨＴＭＬ文書の下位の目的ノードから順に組み合わせて、指定を受け付けた「牛肉」に対応する目的ノードと、「２００ｇ」に対応する目的ノードとを第１グループとする。さらに、データ抽出装置１０は、これらと並列の目的ノードである「カレールー」に対応する目的ノードと、「１／２パック」に対応する目的ノードとを第２グループとする。さらに、データ抽出装置１０は、これらより上位の目的ノードである「カレーライスの作り方」に対応するノードを第３グループとする。ここで、データ抽出装置１０は、第１グループと第２グループとの共通上位ノードである＜ｄｌ＞を抽出し、第１グループと第２グループとを合わせて第４グループとする。次に、データ抽出装置１０は、合わせた第４グループと、第３グループとの共通上位ノードである＜ｂｏｄｙ＞を抽出し、第４グループと第３グループとを合わせて第５グループとする。そして、データ抽出装置１０は、他に上位ノードを抽出するためのグループがないので＜ｂｏｄｙ＞を共通祖先ノードとして特定する。 In other words, the data extraction apparatus 10 includes, for example, “how to make curry and rice”, “beef”, and “as a target node that is a hint for extracting data (extraction target node) desired by the user in an HTML document. "200g", "Kararu" and "1/2 pack" are accepted. Next, the data extraction apparatus 10 combines the target node corresponding to “beef” and the target node corresponding to “200g” in combination with the first group from the lower target nodes of the HTML document as the first group. To do. Furthermore, the data extraction apparatus 10 sets the target node corresponding to “Carrero”, which is the target node in parallel with these, and the target node corresponding to “1/2 pack” as the second group. Furthermore, the data extraction apparatus 10 sets a node corresponding to “how to make curry and rice”, which is a higher-order target node, as a third group. Here, the data extraction apparatus 10 extracts <dl> which is a common upper node of the first group and the second group, and the first group and the second group are combined into a fourth group. Next, the data extraction apparatus 10 extracts <body> which is a common upper node of the combined fourth group and the third group, and the fourth group and the third group are combined into a fifth group. Then, the data extraction apparatus 10 specifies <body> as a common ancestor node because there is no other group for extracting an upper node.

次に、データ抽出装置１０は、特定した共通祖先ノード＜ｂｏｄｙ＞から、上記の処理の逆順に逐次、抽出対象となる抽出目的ノードの抽出ルールを示す、当該抽出目的ノードまでの検索パスを推定していく。すなわち、データ抽出装置１０は、第５グループとして特定した共通祖先ノード＜ｂｏｄｙ＞から、一段下位の第３グループの「カレーライスの作り方」に対応する目的ノードまでのパス＜ｂｏｄｙ＞−＜ｈ１＞と、第４グループに対応する共通上位ノードまでのパス＜ｂｏｄｙ＞−＜ｄｌ＞とから、検索パス＜ｂｏｄｙ＞−＜ｈ１＞−＜ｄｌ＞を推定して生成する。次に、データ抽出装置１０は、第４グループに対応する共通上位ノードである＜ｄｌ＞からさらに一段下位の第１グループの「牛肉」に対応する目的ノードまでのパス＜ｄｌ＞−＜ｄｔ＞及び同じく第１グループの「２００ｇ」に対応する目的ノードまでのパス＜ｄｌ＞−＜ｄｄ＞、並びに、第２グループの「カレールー」に対応する目的ノードまでのパス＜ｄｌ＞−＜ｄｔ＞及び同じく第２グループの「１／２パック」に対応する目的ノードまでのパス＜ｄｌ＞−＜ｄｄ＞から、抽出目的ノードまでの検索パス＜ｂｏｄｙ＞−＜ｈ１＞−＜ｄｌ＞−＜ｄｔ＞−＜ｄｄ＞を推定して生成する。ここで、「牛肉」、「２００ｇ」、「カレールー」及び「１／２パック」は、指定を受け付けたヒント（目的ノード）であるので、データ抽出装置１０は、ここで検索パスの推定・生成処理を完了する。 Next, the data extraction apparatus 10 estimates a search path from the identified common ancestor node <body> to the extraction target node indicating the extraction rule of the extraction target node to be extracted sequentially in the reverse order of the above processing. I will do it. That is, the data extraction apparatus 10 passes the path <body>-<h1> from the common ancestor node <body> specified as the fifth group to the target node corresponding to the “how to make curry rice” in the third group one level lower. Then, the search path <body>-<h1>-<dl> is estimated and generated from the path <body>-<dl> to the common upper node corresponding to the fourth group. Next, the data extraction apparatus 10 passes the path <dl>-<dt> from <dl>, which is the common upper node corresponding to the fourth group, to the destination node corresponding to “beef” of the first group, which is one level lower. And a path <dl>-<dd> to the target node corresponding to “200g” of the first group, and a path <dl>-<dt> to the target node corresponding to “carreaux” of the second group, and Similarly, the search path <body>-<h1>-<dl>-<dt> from the path <dl>-<dd> to the target node corresponding to “1/2 pack” of the second group to the extraction target node. -Estimate and generate <dd>. Here, since “beef”, “200 g”, “carrero”, and “1/2 pack” are hints (target nodes) for which designation is accepted, the data extraction apparatus 10 estimates and generates a search path here. Complete the process.

そして、データ抽出装置１０は、共通祖先ノードを含むＨＴＭＬ文書から、推定・生成した検索パスに従って特定される抽出目的ノードに対応する「ニンジン」、「１本」、「タマネギ」、「２個」、「ジャガイモ」及び「２個」を抽出する。したがって、データ抽出装置１０は、ユーザから、例えば、「カレーライスの作り方」、「牛肉」、「２００ｇ」、「カレールー」及び「１／２パック」を受け付けて、ユーザが所望するカレーライスを作るための素材及び素材の量を、容易、かつ、高速にＨＴＭＬ文書から抽出することができる。 Then, the data extraction apparatus 10 selects “carrot”, “1”, “onion”, “2” corresponding to the extraction target node specified according to the estimated / generated search path from the HTML document including the common ancestor node. , “Potato” and “two” are extracted. Therefore, the data extraction apparatus 10 receives, for example, “how to make curry rice”, “beef”, “200 g”, “curry roux”, and “1/2 pack” from the user, and makes the curry rice desired by the user. Therefore, the material and the amount of the material can be easily and quickly extracted from the HTML document.

図２は、本発明の一実施形態に係るデータ抽出装置１０の機能構成を示す機能ブロック図である。データ抽出装置１０は、インターネット７０を介して接続されているウェブサーバ５０から取得したウェブデータを格納するウェブデータ記憶手段としてのウェブデータＤＢ３１と、目的ノード受付手段としての目的ノード受付部１１と、祖先ノード特定手段としての祖先ノード特定部１２と、パス抽出手段としてのパス抽出部１３と、検索パス生成手段としての検索パス生成部１４と、データ抽出手段としてのデータ抽出部１５とを備える。このようなデータ抽出装置１０の機能について、各部ごとに詳述する。 FIG. 2 is a functional block diagram showing a functional configuration of the data extraction device 10 according to an embodiment of the present invention. The data extraction apparatus 10 includes a web data DB 31 as a web data storage unit that stores web data acquired from a web server 50 connected via the Internet 70, a target node reception unit 11 as a target node reception unit, An ancestor node specifying unit 12 as an ancestor node specifying unit, a path extracting unit 13 as a path extracting unit, a search path generating unit 14 as a search path generating unit, and a data extracting unit 15 as a data extracting unit. The function of the data extracting apparatus 10 will be described in detail for each part.

ウェブデータＤＢ３１は、ＨＴＭＬ文書を含むウェブデータを記憶する。例えば、ＨＴＭＬ文書の記述フォーマットは、タグから構成されるタグ構造を有する（図１（ａ）を参照）。本発明において、タグと、タグに含まれるテキストとを、ノードに対応させ、ＨＴＭＬ文書に木構造を適用する。 The web data DB 31 stores web data including HTML documents. For example, the description format of an HTML document has a tag structure composed of tags (see FIG. 1A). In the present invention, a tag and a text included in the tag are associated with a node, and a tree structure is applied to the HTML document.

目的ノード受付部１１は、ウェブデータＤＢ３１に記憶されたウェブデータに含まれるＨＴＭＬ文書を構成するノードのうち、二つ以上の目的ノードの指定を受け付ける。例えば、目的ノード受付部１１は、ユーザの所望するデータ（抽出目的ノード）を抽出するための目的ノードとして、例えば、「カレーライスの作り方」、「牛肉」、「２００ｇ」、「カレールー」及び「１／２パック」の指定を受け付ける。 The target node receiving unit 11 receives the designation of two or more target nodes among the nodes constituting the HTML document included in the web data stored in the web data DB 31. For example, the target node accepting unit 11 may, for example, “how to make curry and rice”, “beef”, “200 g”, “carrero” and “ “1/2 pack” designation is accepted.

祖先ノード特定部１２は、目的ノード受付部１１によって受け付けられた全ての目的ノードに共通の上位ノードである共通祖先ノードを特定する。すなわち、祖先ノード特定部１２は、目的ノード受付部１１によって受け付けられた二つ以上の目的ノードのうち、ＨＴＭＬ文書の下位の目的ノードから順に組み合わせて、二つの目的ノードの共通上位ノードである第１の共通ノードを求める。次に、祖先ノード特定部１２は、二つの第１の共通ノードの共通上位ノードである第２の共通ノードを求める。そして、祖先ノード特定部１２は、共通上位ノードが一つになるまで上位ノードの抽出を繰り返し、当該一つの共通上位ノードを共通祖先ノードとして特定する。 The ancestor node specifying unit 12 specifies a common ancestor node that is an upper node common to all the target nodes received by the target node receiving unit 11. That is, the ancestor node specifying unit 12 is a common upper node of the two target nodes by combining the two or more target nodes received by the target node receiving unit 11 in order from the lower target node of the HTML document. Find one common node. Next, the ancestor node specifying unit 12 obtains a second common node that is a common upper node of the two first common nodes. Then, the ancestor node specifying unit 12 repeats the extraction of the upper node until there is one common upper node, and specifies the one common upper node as the common ancestor node.

例えば、図１のようにタグの種類によって木構造を構成するＨＴＭＬ文書において、祖先ノード特定部１２は、目的ノードである「カレーライスの作り方」、「牛肉」、「２００ｇ」、「カレールー」及び「１／２パック」のうち、ＨＴＭＬ文書の下位の目的ノードから順に組み合わせて、「牛肉」と、「２００ｇ」とを第１グループとし、「カレールー」と、「１／２パック」とを第２グループとし、これらより上位の目的ノードである「カレーライスの作り方」を第３グループとする。次に、祖先ノード特定部１２は、第１グループと第２グループとの共通上位ノードである＜ｄｌ＞を抽出し、第１グループと第２グループとを合わせて第４グループとする。次に、データ抽出装置１０は、合わせた第４グループと、第３グループとの共通上位ノードである＜ｂｏｄｙ＞を抽出し、第４グループと第３グループとを合わせて第５グループとする。そして、祖先ノード特定部１２は、一つになった共通上位ノード＜ｂｏｄｙ＞を共通祖先ノードとして特定する。 For example, in an HTML document that forms a tree structure according to tag types as shown in FIG. 1, the ancestor node specifying unit 12 has the target nodes “how to make curry and rice”, “beef”, “200 g”, “carrero” and Among the “1/2 pack”, the beef and “200 g” are grouped in the first group by combining the target nodes in the lower order of the HTML document, and “Carreru” and “1/2 pack” are the first group. There are two groups, and the higher-level target node “how to make curry and rice” is the third group. Next, the ancestor node specifying unit 12 extracts <dl>, which is a common upper node of the first group and the second group, and combines the first group and the second group into a fourth group. Next, the data extraction apparatus 10 extracts <body> which is a common upper node of the combined fourth group and the third group, and the fourth group and the third group are combined into a fifth group. Then, the ancestor node specifying unit 12 specifies the common upper node <body> that has become one as a common ancestor node.

パス抽出部１３は、祖先ノード特定部１２によって特定された共通祖先ノードから指定された目的ノードまでの全てのパスを抽出する。すなわち、パス抽出部１３は、共通祖先ノードから一段下位の共通上位ノードまでのパスを抽出する。次に、パス抽出部１３は、一段下位の共通上位ノードが抽出対象となる抽出目的ノードでない場合には、一段下位の共通上位ノードからさらに一段下位の共通上位ノードまでのパスを抽出する。そして、パス抽出部１３は、一段下位の共通上位ノードが抽出目的ノードになるまで、一段下位の共通上位ノードまでのパスの抽出を繰り返し、抽出目的ノードまでのパスを推定して抽出する。 The path extraction unit 13 extracts all paths from the common ancestor node specified by the ancestor node specification unit 12 to the designated target node. That is, the path extraction unit 13 extracts a path from the common ancestor node to the common upper node one level lower. Next, the path extraction unit 13 extracts a path from the common upper node one level lower to the common upper node one level lower when the common upper node one level lower is not the extraction target node to be extracted. Then, the path extraction unit 13 repeats the extraction of the path up to the common upper node that is one step lower until the common upper node that is one step lower becomes the extraction target node, and estimates and extracts the path to the extraction target node.

例えば、パス抽出部１３は、共通祖先ノード＜ｂｏｄｙ＞から、祖先ノード特定部１２の処理の逆順に、抽出目的ノードまでのパスを抽出する。すなわち、パス抽出部１３は、共通祖先ノード＜ｂｏｄｙ＞から、一段下位の第３グループの「カレーライスの作り方」までのパス＜ｂｏｄｙ＞−＜ｈ１＞と、第４グループに対応する共通上位ノードである＜ｄｌ＞までのパス＜ｂｏｄｙ＞−＜ｄｌ＞を抽出する。次に、パス抽出部１３は、第４グループからさらに一段下位の第１グループの「牛肉」までのパス＜ｄｌ＞−＜ｄｔ＞及び「２００ｇ」までのパス＜ｄｌ＞−＜ｄｄ＞、並びに、第２グループの「カレールー」までのパス＜ｄｌ＞−＜ｄｔ＞及び「１／２パック」までのパス＜ｄｌ＞−＜ｄｄ＞を抽出する。 For example, the path extraction unit 13 extracts a path from the common ancestor node <body> to the extraction target node in the reverse order of the process of the ancestor node specification unit 12. That is, the path extraction unit 13 performs the path <body>-<h1> from the common ancestor node <body> to the third group “how to make curry rice” one level lower, and the common upper node corresponding to the fourth group. The path <body>-<dl> up to <dl> is extracted. Next, the path extracting unit 13 further passes the path <dl>-<dt> and the path <dl>-<dd> up to “200 g” from the fourth group to the first group “beef” that is one step lower, and Then, the path <dl>-<dt> to “Carreru” in the second group and the path <dl>-<dd> to “1/2 pack” are extracted.

検索パス生成部１４は、パス抽出部１３によって抽出されたパス及びＨＴＭＬ文書を構成するノードの繰り返し構造に基づき、抽出対象となる抽出目的ノードの抽出ルールを示す、共通祖先ノードから当該抽出目的ノードまでの検索パスを生成する。例えば、検索パス生成部１４は、パス抽出部１３が抽出した全てのパスについてＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチング（後述する図４参照）を行って、複数のパターンにマッチするワイルドカードを一部に含む表現を有する検索パスを生成する。すなわち、検索パス生成部１４は、共通祖先ノードから一段下位の共通上位ノードまでの検索パスを、抽出されたパスのＤＰマッチングを行って生成する。次に、検索パス生成部１４は、一段下位の共通上位ノードが抽出目的ノードでない場合には、一段下位の共通上位ノードからさらに一段下位の共通上位ノードまでの検索パスを、抽出されたパスのＤＰマッチングを行って、生成する。このように、検索パス生成部１４は、一段ごとの検索パスの生成を繰り返し、抽出目的ノードまでの検索パスを推定して生成する。 The search path generation unit 14 indicates the extraction target node from the common ancestor node that indicates the extraction rule of the extraction target node to be extracted based on the path extracted by the path extraction unit 13 and the repeated structure of the nodes constituting the HTML document. Generate a search path up to. For example, the search path generation unit 14 performs DP (Dynamic Programming) matching (see FIG. 4 described later) on all the paths extracted by the path extraction unit 13 and includes a part of wild cards that match a plurality of patterns. Generate a search path with a representation. In other words, the search path generation unit 14 generates a search path from the common ancestor node to the common upper node one level lower by performing DP matching on the extracted path. Next, when the common upper node that is one step lower is not the extraction target node, the search path generation unit 14 searches the search path from the common upper node that is one step lower to the common upper node that is one step lower to the extracted path. Generate by performing DP matching. In this way, the search path generation unit 14 repeatedly generates a search path for each stage, and estimates and generates a search path to the extraction target node.

例えば、検索パス生成部１４は、共通祖先ノード＜ｂｏｄｙ＞から一段下位の共通上位ノード＜ｈ１＞及び＜ｄｌ＞までの検索パスとして、抽出されたパスである＜ｂｏｄｙ＞−＜ｈ１＞と、＜ｂｏｄｙ＞−＜ｄｌ＞とのＤＰマッチングを行って、検索パス＜ｂｏｄｙ＞−＜ｈ１＞−＜ｄｌ＞を生成する。次に、検索パス生成部１４は、一段下位の共通上位ノード＜ｄｌ＞が抽出目的ノードでないので、一段下位の共通上位ノード＜ｄｌ＞からさらに一段下位の共通上位ノード＜ｄｔ＞及び＜ｄｄ＞までの検索パスとして、抽出されたパスである＜ｄｌ＞−＜ｄｔ＞と、＜ｄｌ＞−＜ｄｄ＞とのＤＰマッチングを行って、＜ｂｏｄｙ＞−＜ｈ１＞−＜ｄｌ＞−＜ｄｔ＞−＜ｄｄ＞を生成する。検索パス生成部１４は、パスのＤＰマッチングを行うことにより、複数の抽出されたパスが完全には一致していない場合（例えば、後述する図３の場合）であっても、複数のパターンにマッチするワイルドカードとして表現し、最適な検索パスを推定して生成する。パスのＤＰマッチングは、図３及び図４において詳述する。 For example, the search path generation unit 14 extracts <body>-<h1>, which are paths extracted as search paths from the common ancestor node <body> to the common upper nodes <h1> and <dl> that are one step lower, DP matching with <body>-<dl> is performed to generate a search path <body>-<h1>-<dl>. Next, since the one-stage lower common upper node <dl> is not the extraction target node, the search path generation unit 14 further lowers the one-step lower common upper node <dt> and <dd> from the one-step lower common upper node <dl>. DP matching between <dl>-<dt>, which is the extracted path, and <dl>-<dd> is performed as a search path up to and including <body>-<h1>-<dl>-<dt >-<Dd> is generated. The search path generation unit 14 performs DP matching of paths, so that even when a plurality of extracted paths do not completely match (for example, in the case of FIG. 3 described later), a plurality of patterns are obtained. Expressed as a wild card that matches and estimates and generates an optimal search path. The path DP matching will be described in detail with reference to FIGS.

データ抽出部１５は、共通祖先ノードを含むＨＴＭＬ文書から、検索パス生成部１４によって生成された検索パスに従って抽出される抽出目的ノードを抽出する。例えば、データ抽出部１５は、生成した検索パス＜ｂｏｄｙ＞−＜ｄｌ＞−＜ｄｔ＞に従って検索されたノードに対応する「ニンジン」や、＜ｂｏｄｙ＞−＜ｄｌ＞−＜ｄｔ＞−＜ｄｄ＞に従って検索されたノードに対応する「１本」（図１参照）を抽出する。 The data extraction unit 15 extracts an extraction target node extracted according to the search path generated by the search path generation unit 14 from the HTML document including the common ancestor node. For example, the data extraction unit 15 performs “carrot” corresponding to the node searched according to the generated search path <body>-<dl>-<dt>, <body>-<dl>-<dt>-<dd. > “1” (see FIG. 1) corresponding to the searched node is extracted.

図３は、本発明の一実施形態に係るデータ抽出装置１０のＨＴＭＬ文書データの別の例を示す図である。図３（ａ）は、ＨＴＭＬ文書データにおいて＜ｉｍｇ＞を有するデータの例を示す図である。 FIG. 3 is a diagram showing another example of HTML document data of the data extraction apparatus 10 according to an embodiment of the present invention. FIG. 3A is a diagram illustrating an example of data having <img> in HTML document data.

図３の例において、データ抽出装置１０は、特定した共通祖先ノード＜ｂｏｄｙ＞から、抽出目的ノードである「カレーライスの作り方」、「牛肉」、「２００ｇ」、「カレールー」及び「１／２カップ」までの検索パスである＜ｂｏｄｙ＞−＜ｄｌ＞−＜ｄｔ＞−＜ｉｍｇ＞？−＜ｄｄ＞を生成する。＜ｉｍｇ＞？は、「？」直前の＜ｉｍｇ＞が出現する場合もあるし、出現しない場合もあることを示すワイルドカードとしての表現である。 In the example of FIG. 3, the data extraction apparatus 10 starts from the specified common ancestor node <body>, which are extraction method nodes “how to make curry and rice”, “beef”, “200 g”, “carrero”, and “1/2”. <Body>-<dl>-<dt>-<img>? -Generate <dd>. <Img>? Is an expression as a wild card indicating that <img> immediately before “?” May or may not appear.

すなわち、データ抽出装置１０は、「２００ｇ」までの検索パス＜ｂｏｄｙ＞−＜ｄｌ＞−＜ｄｔ＞−＜ｄｄ＞と、「１／２カップ」までの検索パス＜ｂｏｄｙ＞−＜ｄｌ＞−＜ｄｔ＞−＜ｉｍｇ＞−＜ｄｄ＞とから検索パスの併合を行い、＜ｂｏｄｙ＞−＜ｄｌ＞−＜ｄｔ＞−＜ｉｍｇ＞？−＜ｄｄ＞を生成する。データ抽出装置１０は、例えば、ＤＰマッチングを用いて、この検索パスの併合を行う。 That is, the data extraction device 10 searches the search path <body>-<dl>-<dt>-<dd> up to “200 g” and the search path <body>-<dl> − up to “1/2 cup”. The search paths are merged from <dt>-<img>-<dd>, and <body>-<dl>-<dt>-<img>? -Generate <dd>. The data extraction device 10 merges this search path using, for example, DP matching.

図４は、本発明の一実施形態に係るデータ抽出装置１０の検索パスの生成においてＤＰマッチングを行う例を示す図である。図４の例は、「１／２カップ」のように目的ノードに到達するまでに、＜ｉｍｇ＞が存在する入力１＜ｂｏｄｙ＞−＜ｄｌ＞−＜ｄｔ＞−＜ｉｍｇ＞−＜ｄｄ＞と、「２００ｇ」のように＜ｉｍｇ＞が存在しない入力２＜ｂｏｄｙ＞−＜ｄｌ＞−＜ｄｔ＞−＜ｄｄ＞とにおいて、目的ノードに達するまでのパスを示す例である。 FIG. 4 is a diagram illustrating an example in which DP matching is performed in search path generation of the data extraction apparatus 10 according to an embodiment of the present invention. In the example of FIG. 4, input 1 <body>-<dl>-<dt>-<img>-<dd> in which <img> exists until the target node is reached as in “1/2 cup”. And an input 2 <body>-<dl>-<dt>-<dd> in which <img> does not exist, such as “200 g”, shows an example of a path to reach the target node.

図４の例において、データ抽出装置１０は、入力１と入力２とが合致（マッチ）するパスをスコア＋１とし、入力１を省略するパスをスコア−１、入力２を省略するパスをスコア−１とする。ここで、データ抽出装置１０は、入力１と入力２とが合致（マッチ）しないパス５３３を、経路とすることができない。データ抽出装置１０は、目的ノードに到達するパスのスコアを算出し、算出したスコアが最も高いパスを最適経路とする。 In the example of FIG. 4, the data extraction apparatus 10 sets a path where input 1 and input 2 match (match) as score +1, a path omitting input 1 as score −1, and a path omitting input 2 as score−. Set to 1. Here, the data extraction apparatus 10 cannot use a path 533 in which the input 1 and the input 2 do not match (match) as a route. The data extraction apparatus 10 calculates the score of the path that reaches the target node, and sets the path having the highest calculated score as the optimal path.

例えば、開始ノード９０１から目的ノード９０２までの検索パスにおいて、検索パスＡと、検索パスＢとについてスコアを算出する。検索パスＡは、パス５１１（スコア＋１）、パス５２２（スコア＋１）、パス５０３３（入力１の省略：スコア−１）、パス５３０４（入力２の省略：スコア−１）、パス５４０４（入力２の省略：スコア−１）、パス５５４（スコア＋１）、パス５６５（スコア＋１）によって目的ノード９０２に達するので、算出したスコアは＋１である。検索パスＢは、パス５１１（スコア＋１）、パス５２２（スコア＋１）、パス５３０３（入力２の省略：スコア−１）、パス５４３（スコア＋１）、パス５５４（スコア＋１）、パス５６５（スコア＋１）によって目的ノード９０２に達するので、算出したスコアは＋３である。よって、データ抽出装置１０は、検索パスＢを検索パスＡより優れたパスと判定する。同様にして、データ抽出装置１０は、他のパスのスコアを算出し、検索パスＢとの比較によって検索パスＢを最適経路と判定し、複数のパターンにマッチするワイルドカードとして表現した、＜ｄｌ＞−＜ｄｔ＞−＜ｉｍｇ＞？−＜ｄｄ＞を生成する。 For example, in the search path from the start node 901 to the target node 902, scores are calculated for the search path A and the search path B. The search path A includes a path 511 (score + 1), a path 522 (score + 1), a path 5033 (input 1 omitted: score-1), a path 5304 (input 2 omitted: score-1), and a path 5404 (input 2). Is omitted: score -1), pass 554 (score +1), and pass 565 (score +1) reach the target node 902, so the calculated score is +1. The search path B includes a path 511 (score + 1), a path 522 (score + 1), a path 5303 (input 2 omitted: score-1), a path 543 (score + 1), a path 554 (score + 1), and a path 565 (score). Since the destination node 902 is reached by +1), the calculated score is +3. Therefore, the data extraction apparatus 10 determines that the search path B is superior to the search path A. Similarly, the data extraction apparatus 10 calculates a score of another path, determines that the search path B is the optimum path by comparison with the search path B, and expresses it as a wild card that matches a plurality of patterns. >-<Dt>-<img>? -Generate <dd>.

図５は、一実施形態に係るデータ抽出装置１０の処理内容を示すフローチャートである。 FIG. 5 is a flowchart showing the processing contents of the data extraction apparatus 10 according to an embodiment.

ステップＳ１０１において、ＣＰＵ（目的ノード受付部１１）は、ユーザの所望するデータを抽出するための目的ノードを受け付ける。その後、ＣＰＵは、処理をステップＳ１０２に移す。 In step S101, the CPU (target node receiving unit 11) receives a target node for extracting data desired by the user. Thereafter, the CPU moves the process to step S102.

ステップＳ１０２において、ＣＰＵ（祖先ノード特定部１２）は、共通祖先ノード特定処理を行い、ステップＳ１０１において受け付けた目的ノードの共通祖先ノードを特定する。その後、ＣＰＵは、処理をステップＳ１０３に移す。 In step S102, the CPU (ancestor node specifying unit 12) performs a common ancestor node specifying process, and specifies the common ancestor node of the target node received in step S101. Thereafter, the CPU moves the process to step S103.

ステップＳ１０３において、ＣＰＵ（パス抽出部１３、検索パス生成部１４）は、検索パス生成処理を行い、抽出対象となる抽出目的ノードの抽出ルールを示す、共通祖先ノードから抽出目的ノードまでの検索パスを生成する。その後、ＣＰＵは、処理をステップＳ１０４に移す。 In step S103, the CPU (path extraction unit 13, search path generation unit 14) performs a search path generation process, and indicates a search path from the common ancestor node to the extraction target node, which indicates an extraction rule for the extraction target node to be extracted. Is generated. Thereafter, the CPU moves the process to step S104.

ステップＳ１０４において、ＣＰＵ（データ抽出部１５）は、ステップＳ１０３において生成した検索パスによりデータを抽出する。その後、ＣＰＵは、処理を終了する。 In step S104, the CPU (data extraction unit 15) extracts data using the search path generated in step S103. Thereafter, the CPU ends the process.

図６は、一実施形態に係るデータ抽出装置１０の共通祖先ノード特定処理を示すフローチャートである。 FIG. 6 is a flowchart showing the common ancestor node identification process of the data extraction device 10 according to an embodiment.

ステップＳ２０１において、ＣＰＵ（祖先ノード特定部１２）は、受け付けた目的ノードをグループ化する。より具体的には、ＣＰＵは、ウェブデータＤＢ３１から取得したＨＴＭＬ文書の木構造に基づいて下位のノードから順に組み合わせて、目的ノードをグループ化する。その後、ＣＰＵは、処理をステップＳ２０２に移す。 In step S201, the CPU (ancestor node specifying unit 12) groups the received target nodes. More specifically, the CPU groups the target nodes by combining in order from the lower nodes based on the tree structure of the HTML document acquired from the web data DB 31. Thereafter, the CPU moves the process to step S202.

ステップＳ２０２において、ＣＰＵ（祖先ノード特定部１２）は、グループ同士の共通上位ノードを抽出する。より具体的には、ＣＰＵは、グループ化したグループを、タグの種類によって位置付け、位置付けられたグループのうち最も下位に位置付けられたグループと、そのグループと同じ又は次の上位に位置付けられたグループとの共通上位ノードを抽出する。その後、ＣＰＵは、処理をステップＳ２０３に移す。 In step S202, the CPU (ancestor node specifying unit 12) extracts a common upper node between groups. More specifically, the CPU positions the grouped group according to the type of tag, the group positioned at the lowest position among the positioned groups, and the group positioned at the same or the next higher level as the group. The common upper node of is extracted. Thereafter, the CPU moves the process to step S203.

ステップＳ２０３において、ＣＰＵ（祖先ノード特定部１２）は、抽出した共通上位ノードに対応付けて、一段下位のグループを記憶する。より具体的には、ＣＰＵは、抽出した共通上位ノードに対応付けて、その共通上位ノードを上位ノードとするグループを記憶する。その後、ＣＰＵは、処理をステップＳ２０４に移す。 In step S203, the CPU (ancestor node specifying unit 12) stores a group one level lower in association with the extracted common upper node. More specifically, the CPU stores a group having the common upper node as an upper node in association with the extracted common upper node. Thereafter, the CPU moves the process to step S204.

ステップＳ２０４において、ＣＰＵ（祖先ノード特定部１２）は、共通上位ノードを抽出したグループ同士を一つとする。その後、ＣＰＵは、処理をステップＳ２０５に移す。 In step S <b> 204, the CPU (ancestor node specifying unit 12) takes one group from which common upper nodes are extracted. Thereafter, the CPU moves the process to step S205.

ステップＳ２０５において、ＣＰＵ（祖先ノード特定部１２）は、２以上のグループが存在するか否かを判断する。すなわち、ＣＰＵは、グループ同士を一つにした結果、グループが２以上存在するか否かを判断する。この判断がＹＥＳの場合、ＣＰＵは、処理をステップＳ２０２に移し、ＮＯの場合、ＣＰＵは、処理をステップＳ２０６に移す。 In step S205, the CPU (ancestor node specifying unit 12) determines whether there are two or more groups. That is, the CPU determines whether there are two or more groups as a result of combining the groups into one. If this determination is YES, the CPU moves the process to step S202, and if NO, the CPU moves the process to step S206.

ステップＳ２０６において、ＣＰＵ（祖先ノード特定部１２）は、最後に抽出した共通上位ノードを共通祖先ノードとして特定する。その後、ＣＰＵは、処理を終了し、本処理に移るステップの次のステップに処理を移す。 In step S206, the CPU (ancestor node specifying unit 12) specifies the common upper node extracted last as the common ancestor node. Thereafter, the CPU ends the process, and shifts the process to the next step after the step shifts to the present process.

図７は、一実施形態に係るデータ抽出装置１０の検索パス生成処理を示すフローチャートである。 FIG. 7 is a flowchart showing search path generation processing of the data extraction apparatus 10 according to an embodiment.

ステップＳ３０１において、ＣＰＵ（パス抽出部１３）は、特定した共通祖先ノードを開始ノードとする。より具体的には、ＣＰＵは、ステップＳ２０６において特定した共通祖先ノードを開始ノードとする。その後、ＣＰＵは、処理をステップＳ３０２に移す。 In step S301, the CPU (path extraction unit 13) sets the identified common ancestor node as a start node. More specifically, the CPU sets the common ancestor node identified in step S206 as a start node. Thereafter, the CPU moves the process to step S302.

ステップＳ３０２において、ＣＰＵ（パス抽出部１３）は、抽出目的ノードへのパスにおいて、開始ノードを共通上位ノードとする一段下位のグループを取得する。より具体的には、ＣＰＵは、ステップＳ２０３において、共通上位ノードに対応付けて記憶したグループのうち、開始ノードを共通上位ノードとするグループを取得する。その後、ＣＰＵは、処理をステップＳ３０３に移す。 In step S <b> 302, the CPU (path extraction unit 13) acquires a one-step lower group having the start node as a common upper node in the path to the extraction target node. More specifically, in step S203, the CPU acquires a group having the start node as the common upper node among the groups stored in association with the common upper node. Thereafter, the CPU moves the process to step S303.

ステップＳ３０３において、ＣＰＵ（検索パス生成部１４）は、取得したグループ同士に基づいて、ＤＰマッチングにより、グループへの最適経路を求める（図４参照）。その後、ＣＰＵは、処理をステップＳ３０４に移す。 In step S303, the CPU (search path generation unit 14) obtains an optimum route to the group by DP matching based on the acquired groups (see FIG. 4). Thereafter, the CPU moves the process to step S304.

ステップＳ３０４において、ＣＰＵ（検索パス生成部１４）は、求めた最適経路に基づいて共通祖先ノードからグループまでの、複数のパターンにマッチするワイルドカードとしての表現を含む最適経路を生成する。その後、ＣＰＵは、処理をステップＳ３０５に移す。 In step S <b> 304, the CPU (search path generation unit 14) generates an optimum route including a wild card expression that matches a plurality of patterns from the common ancestor node to the group based on the obtained optimum route. Thereafter, the CPU moves the process to step S305.

ステップＳ３０５において、ＣＰＵ（検索パス生成部１４）は、グループを構成するノードが抽出目的ノードか否かを判断する。この判断がＹＥＳの場合、ＣＰＵは、処理をステップＳ３０７に移し、ＮＯの場合、ＣＰＵは、処理をステップＳ３０６に移す。 In step S305, the CPU (search path generation unit 14) determines whether or not the nodes constituting the group are extraction target nodes. If this determination is YES, the CPU moves the process to step S307, and if NO, the CPU moves the process to step S306.

ステップＳ３０６において、ＣＰＵ（検索パス生成部１４）は、グループを構成するノードを開始ノードとする。その後、ＣＰＵは、処理をステップＳ３０２に移す。 In step S306, the CPU (search path generation unit 14) sets a node constituting the group as a start node. Thereafter, the CPU moves the process to step S302.

ステップＳ３０７において、ＣＰＵ（検索パス生成部１４）は、生成した最適経路を検索パスとする。その後、ＣＰＵは、処理を終了し、本処理に移るステップの次のステップに処理を移す。 In step S307, the CPU (search path generation unit 14) sets the generated optimum route as a search path. Thereafter, the CPU ends the process, and shifts the process to the next step after the step shifts to the present process.

図８は、本発明の一実施形態に係るデータ抽出装置１０のデータ抽出例を示す図である。図８の例において、ＨＴＭＬ文書は、「カレーライスの作り方」と、「牛肉のおろし大根ソースの作り方」とから次のように構成されている。
＜ｈｔｍｌ＞
＜ｈｅａｄ＞
＜ｔｉｔｌｅ＞牛肉を使った料理＜／ｔｉｔｌｅ＞
＜／ｈｅａｄ＞
＜ｂｏｄｙ＞
＜ｈ１＞カレーライスの作り方＜／ｈ１＞
＜ｄｌ＞
＜ｄｔ＞牛肉＜／ｄｔ＞＜ｄｄ＞２００ｇ＜／ｄｄ＞
＜ｄｔ＞＜ｉｍｇｓｒｃ＝“ｃｕｒｒｙ．ｊｐｇ”＞カレールー＜／ｄｔ＞＜ｄｄ＞１／２パック＜／ｄｄ＞
＜ｄｔ＞ニンジン＜／ｄｔ＞＜ｄｄ＞１本＜／ｄｄ＞
＜ｄｔ＞タマネギ＜／ｄｔ＞＜ｄｄ＞２個＜／ｄｄ＞
＜ｄｔ＞ジャガイモ＜／ｄｔ＞＜ｄｄ＞２個＜／ｄｄ＞
＜／ｄｌ＞
＜／ｂｏｄｙ＞
＜ｂｏｄｙ＞
＜ｈ１＞牛肉のおろし大根ソースの作り方＜／ｈ１＞
＜ｄｌ＞
＜ｄｔ＞＜ｉｍｇｓｒｃ＝“ｔｏｋｕｓａｎ＿ｇｙｕｕ．ｊｐｇ”＞牛肩ロース肉＜／ｄｔ＞＜ｄｄ＞４００ｇ＜／ｄｄ＞
＜ｄｔ＞大根＜／ｄｔ＞＜ｄｄ＞１５０ｇ＜／ｄｄ＞
＜ｄｔ＞もやし＜／ｄｔ＞＜ｄｄ＞２００ｇ＜／ｄｄ＞
＜ｄｔ＞ピーマン＜／ｄｔ＞＜ｄｄ＞２個＜／ｄｄ＞
＜／ｄｌ＞
＜／ｂｏｄｙ＞
＜／ｈｔｍｌ＞ FIG. 8 is a diagram illustrating a data extraction example of the data extraction device 10 according to an embodiment of the present invention. In the example of FIG. 8, the HTML document is composed of “how to make curry and rice” and “how to make beef grated radish sauce” as follows.
<Html>
<Head>
<Title> Cooking with beef </ title>
</ Head>
<Body>
<H1> How to make curry and rice </ h1>
<Dl>
<Dt> beef </ dt><dd> 200 g </ dd>
<Dt><img src = “curry.jpg”> Carreau </ dt><dd> 1/2 pack </ dd>
<Dt> carrot </ dt><dd> 1 </ dd>
<Dt> Onion </ dt><dd> 2 </ dd>
<Dt> potato </ dt><dd> 2 </ dd>
</ Dl>
</ Body>
<Body>
<H1> How to make beef grated radish sauce </ h1>
<Dl>
<Dt><img src = “tokusan_gyuu.jpg”> loin beef shoulder </ dt><dd> 400 g </ dd>
<Dt> radish </ dt><dd> 150 g </ dd>
<Dt> sprouts </ dt><dd> 200 g </ dd>
<Dt> peppers </ dt><dd> 2 </ dd>
</ Dl>
</ Body>
</ Html>

図８（ａ）の例は、データ抽出装置１０の表示装置に表示された画像において、抽出目的ノードを抽出するためのヒントである目的ノードとして、例えば、「カレーライスの作り方」、「牛肉」、「２００ｇ」、「カレールー」及び「１／２パック」が、マウスポインタ６０１によって指定され、受け付けられたことを示す例である。そして、図８（ａ）の例は、抽出ボタン６１０によって、データ抽出処理を開始することを示している例である。 In the example of FIG. 8A, for example, “how to make curry and rice”, “beef” as the target node that is a hint for extracting the extraction target node in the image displayed on the display device of the data extraction device 10. , “200 g”, “Carrero”, and “1/2 pack” are designated by the mouse pointer 601 and are accepted. The example of FIG. 8A is an example in which data extraction processing is started by the extraction button 610.

図８（ｂ）の例は、データ抽出装置１０が受け付けたヒントである目的ノードによって、ＨＴＭＬ文書から抽出すべき項目を推定して、さらに「ニンジン」及び「１本」〜「ジャガイモ」及び「２個」と、「牛肉のおろし大根ソースの作り方」と、「牛肩ロース肉」及び「４００ｇ」〜「ピーマン」及び「２個」とを、画像データ６１１（“ｃｕｒｒｙ．ｊｐｇ”）及び画像データ６１２（“ｔｏｋｕｓａｎ＿ｇｙｕｕ．ｊｐｇ”）の有無に関わらず抽出したことを示す例である。 In the example of FIG. 8B, the items to be extracted from the HTML document are estimated by the target node which is the hint received by the data extraction device 10, and “carrot” and “one” to “potato” and “ “2 pieces”, “How to make beef grated radish sauce”, “Beef shoulder loin” and “400 g” to “Peppers” and “2 pieces”, image data 611 (“curry.jpg”) and image It is an example showing that the data 612 (“tokusan_gyuu.jpg”) is extracted regardless of the presence or absence.

本実施形態によれば、データ抽出装置１０は、ＨＴＭＬ文書を記憶するウェブデータＤＢ３１を有する。そして、データ抽出装置１０は、ユーザから、ウェブデータＤＢ３１に記憶されたＨＴＭＬ文書を構成するノードのうち、二つ以上の目的ノードの指定を受け付け、受け付けた全ての目的ノードに共通の上位ノードである共通祖先ノードを特定する。次に、データ抽出装置１０は、特定した共通祖先ノードから指定された目的ノードまでの全てのパスを抽出し、抽出したパス及びＨＴＭＬ文書を構成するノードの繰り返し構造に基づき、ＤＰマッチングを行って、抽出対象となる抽出目的ノードの抽出ルールを示す、共通祖先ノードから当該抽出目的ノードまでの、複数のパターンにマッチするワイルドカードを一部に含む表現を有する検索パスを推定して生成する。そして、データ抽出装置１０は、共通祖先ノードを含むＨＴＭＬ文書から、生成した検索パスに従って検索される抽出目的ノードを抽出する。したがって、データ抽出装置１０は、ユーザからヒントとなる目的ノードの指定を受け付けて、ユーザが所望するデータを、容易、かつ、高速にＨＴＭＬ文書から抽出することができる。 According to the present embodiment, the data extraction device 10 has a web data DB 31 that stores an HTML document. Then, the data extraction device 10 accepts designation of two or more target nodes among the nodes constituting the HTML document stored in the web data DB 31 from the user, and is an upper node common to all the accepted target nodes. Identify a common ancestor node. Next, the data extraction device 10 extracts all paths from the specified common ancestor node to the designated target node, and performs DP matching based on the extracted path and the repetition structure of the nodes constituting the HTML document. A search path having an expression partially including wildcards that match a plurality of patterns from the common ancestor node to the extraction target node, which indicates an extraction rule of the extraction target node to be extracted, is generated. Then, the data extraction device 10 extracts an extraction target node to be searched according to the generated search path from the HTML document including the common ancestor node. Therefore, the data extraction apparatus 10 can accept designation of a target node as a hint from the user, and can easily and quickly extract data desired by the user from the HTML document.

以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

１０データ抽出装置
１１目的ノード受付部
１２祖先ノード特定部
１３パス抽出部
１４検索パス生成部
１５データ抽出部
３１ウェブデータＤＢ
５０ウェブサーバ
７０インターネット DESCRIPTION OF SYMBOLS 10 Data extraction device 11 Target node reception part 12 An ancestor node specification part 13 Path extraction part 14 Search path generation part 15 Data extraction part 31 Web data DB
50 Web server 70 Internet

Claims

HTML document storage means for storing HTML documents;
A destination node accepting means for accepting designation of two or more destination nodes among nodes constituting the HTML document stored in the HTML document storage means;
An ancestor node specifying means for specifying a common ancestor node that is a common upper node to all the target nodes received by the target node receiving means;
Path extracting means for extracting all paths from the common ancestor node specified by the ancestor node specifying means to the designated target node;
A search path from the common ancestor node to the extraction target node, which indicates an extraction rule of the extraction target node to be extracted based on the path extracted by the path extraction means and the repetition structure of the nodes constituting the HTML document Search path generation means for generating
Data extraction means for extracting the extraction target node extracted according to the search path generated by the search path generation means from the HTML document including the common ancestor node;
A data extraction device.

2. The search path generation unit according to claim 1, wherein the search path generation unit generates the search path having an expression partially including a wild card that matches a plurality of patterns, based on all the paths extracted by the path extraction unit. Data extraction device.

3. The search path generation unit performs DP (Dynamic Programming) matching on all the paths extracted by the path extraction unit to generate a search path having an expression including a part of the wild card. Data extraction device.

The ancestor node specifying means includes:
Means for obtaining a first common node that is a common upper node of the two target nodes by combining in order from the lower target node of the HTML document among the two or more target nodes that have received the designation;
Means for obtaining a second common node which is a common upper node of the two first common nodes;
Repeating this until there is one common upper node, and specifying the one common upper node as the common ancestor node,
The path extraction means includes
Means for extracting a path from the common ancestor node to a common upper node one level lower;
Means for extracting a path from the one-stage lower common upper node to the one-stage lower common upper node if the one-stage lower common upper node is not the extraction target node;
And repeating means for extracting a path to the extraction target node.
The search path generation means includes
Means for generating a search path from the common ancestor node to a common upper node one level lower by performing DP matching of the extracted path;
If the lower common upper node is not the extraction target node, the search path from the lower upper common upper node to the lower common upper node is subjected to DP matching of the extracted path. Means for generating, and
And repeating this to generate a search path to the extraction target node.
The data extraction device according to claim 1.

A method executed by a data extraction apparatus having an HTML document storage means for storing an HTML document,
A destination node accepting step for accepting designation of two or more destination nodes among nodes constituting the HTML document stored in the HTML document storage means;
An ancestor node specifying step for specifying a common ancestor node that is a common upper node for all the target nodes received in the target node receiving step;
A path extraction step of extracting all paths from the common ancestor node identified in the ancestor node identification step to the designated target node;
A search path from the common ancestor node to the extraction target node indicating an extraction rule of the extraction target node to be extracted based on the path extracted in the path extraction step and the repetition structure of the nodes constituting the HTML document A search path generation step for generating
A data extraction step of extracting the extraction target node extracted according to the search path generated in the search path generation step from the HTML document including the common ancestor node.