JP5559104B2

JP5559104B2 - Information extraction method, information extraction apparatus, and information extraction program

Info

Publication number: JP5559104B2
Application number: JP2011166460A
Authority: JP
Inventors: 正之杉崎; 裕一郎関口; 健司江崎; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-07-29
Filing date: 2011-07-29
Publication date: 2014-07-23
Anticipated expiration: 2031-07-29
Also published as: JP2013030041A

Description

本発明は、Ｗｅｂページなどの構造化文書から本文を抽出する技術に関する。 The present invention relates to a technique for extracting a text from a structured document such as a Web page.

近年、インターネットなどのコンピュータネットワークを通じて、大量の電子化された文書の利用や不特定多数人を対象とした情報発信などが可能になっている。コンピュータネットワーク上で表現された電子文書は、その特徴を生かした表現が利用され、ＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）上のＷｅｂページではマークアップ言語（ｍａｒｋｕｐｌａｎｇｕａｇｅ）で記述された構造化文書が多く利用されている。 In recent years, it has become possible to use a large amount of digitized documents and send information to unspecified large numbers of people through a computer network such as the Internet. An electronic document expressed on a computer network uses an expression that makes use of its characteristics, and a Web page on the WWW (World Wide Web) often uses a structured document described in a markup language. ing.

例えばＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）と呼ばれる文書は、何らかの情報を記述するだけではなく、他のコンピュータ上に存在する他の人が記述した文書を参照するための「ハイパーリンク（Ｈｙｐｅｒｌｉｎｋ）」の機能を有している。これは他の文書を信頼して自身の記している情報を補完したり、同様の内容の文書を参照するときなどに利用されている。 For example, a document called HTML (HyperText Markup Language) has a function of “hyperlink” for referring to a document described by another person existing on another computer as well as describing some information. Have. This is used when, for example, trusting another document to supplement the information described by itself or referring to a document having the same content.

その一方でＨＴＭＬファイルの情報は、「広告」部分や「メニュー」部分といった本来表現したい情報とは無関係の情報が記述されている場合もある。そこで、検索サービスやＨＴＭＬファイルへのアクセスログの解析などでは、ＨＴＭＬファイルが本来表現したかった情報を適切に抽出する必要が生じる。ここで抽出される部分を「本文」と呼び、「本文」を抽出することを「本文抽出」と呼ぶこととする。 On the other hand, the information of the HTML file may describe information unrelated to the information to be originally expressed, such as an “advertisement” portion or a “menu” portion. Therefore, in search services and analysis of access logs to HTML files, it is necessary to appropriately extract information that the HTML file originally intended to express. The part extracted here is called “text”, and the extraction of “text” is called “text extraction”.

ＨＴＭＬファイル内に記述されたＨＴＭＬタグ情報を利用して本文を抽出する既存技術としては、「Ｗｅｂスクレイピング（単にスクレイピングと表現される場合もある。）」が存在する。このスクレイピングを行うには、ＨＴＭＬ（ＸＭＬ）ファイルのどの部分を抽出すべきかを、ＸＳＬＴ（ＸＭＬＳｔｙｌｅｓｈｅｅｔＬａｎｇｕａｇｅＴｒａｎｓｆｏｒｍａｔｉｏｎｓ）、即ちＸＭＬにより記述された文書を他のＸＭＬ文書に変換するための簡易言語で指定する方法が知られている。この方法が非特許文献１に示されている。 As an existing technique for extracting a text by using HTML tag information described in an HTML file, there is “Web scraping (sometimes expressed simply as scraping)”. To do this scraping, specify which part of the HTML (XML) file should be extracted by XSLT (XML Style Language Transformations), that is, a simple language for converting a document written in XML into another XML document How to do is known. This method is shown in Non-Patent Document 1.

著者：ＥｒｉｋＴ．Ｒａｙ、訳者：山本和彦中原晃司梶浦正規豊田公児、”入門ＸＭＬ”、株式会社オーム社、２００１年９月２８日、ｐ．２０３−２４１Author: Erik T. Ray, Translated by: Kazuhiko Yamamoto, Junji Nakahara, Masaru Kajiura, “Introductory XML”, Ohm Co., Ltd., September 28, 2001, p. 203-241 ”要素の親子関係／ＨＴＭＬ基礎講座”、「ｏｎｌｉｎｅ」、初版公開日２００２年５月１８日最新更新日２００４年３月７日、［２０１１年年７月１３日検索］、インターネット＜URL:http://www,scollabo.com/banban/lectur/ht6.html＞"Element-parent relationship / HTML basic course", "online", first edition release date May 18, 2002 Latest update date March 7, 2004 [Search July 13, 2011], Internet <URL: http : //www,scollabo.com/banban/lectur/ht6.html> ”ブロックレベル要素とインライン要素”、「ｏｎｌｉｎｅ」、２００１年４月５日更新、［２０１１年７月１３日検索］、インターネット＜URL:http://www.kanzaki.com/docs/html/element-level.html＞"Block level elements and inline elements", "online", updated April 5, 2001, [searched July 13, 2011], Internet <URL: http://www.kanzaki.com/docs/html/element -level.html>

しかしながら、非特許文献１の方法は、本文抽出するＨＴＭＬファイルが小数であれば、それに応じたＸＳＬＴにて表記された「ルール」を用意することで可能であるものの、インターネット上のあらゆるＨＴＭＬを収集するなど大量のＨＴＭＬファイルが存在する場合には、各々のＨＴＭＬファイルに対して本文抽出を行うためのルールを予め用意しなければならず、非常に困難であった。 However, the method of Non-Patent Document 1 can collect all HTML on the Internet, although it is possible to prepare “rules” written in XSLT according to the HTML file to be extracted if it is a decimal number. When there are a large number of HTML files such as, a rule for performing text extraction for each HTML file must be prepared in advance, which is very difficult.

また、一度用意した本文抽出のルールは未来永劫に使い続けられる保証はなく、ＨＴＭＬファイルの記述内容が変更された場合には本文が抽出できなくなるおそれもある。 Also, once prepared text extraction rules are not guaranteed to continue to be used forever, and there is a possibility that the text cannot be extracted if the description content of the HTML file is changed.

本発明は、上述のような従来技術の問題点を解決するためになされたものであり、本文抽出のルールに依存することなく、構造化文書から本文抽出を行うことを解決課題としている。 The present invention has been made in order to solve the above-described problems of the prior art, and it is an object of the present invention to perform text extraction from a structured document without depending on text extraction rules.

そこで、本発明は、構造化文書の参照関係を利用し、参照元の構造化文書に存在する情報部分を参照先の構造化文書内から抽出する。すなわち、参照元（リンク元）の構造化文書に存在する情報を有する部分を参照先（リンク先）の構造化文書から抽出することで、広告部分やメニュー部分などの不要な部分の情報を排除し、本文のみを抽出する。 Therefore, the present invention extracts the information portion existing in the reference source structured document from the reference destination structured document using the reference relationship of the structured document. In other words, by extracting the part having information existing in the structured document of the reference source (link source) from the structured document of the reference destination (link destination), information on unnecessary parts such as the advertisement part and the menu part is eliminated. And only the text is extracted.

本発明の情報抽出方法は、収集された構造化文書群に対して、それぞれの構造化文書内に存在するリンクと、該リンク周辺テキスト情報とを抽出するリンク元情報抽出ステップと、リンク元情報抽出ステップで抽出されたリンクに基づきリンク先の構造化文書を特定し、該ステップで抽出されたリンク周辺テキスト情報を含むリンク先の構造化文書における代表的な部分を本文として抽出する本文抽出ステップと、を有する。 The information extraction method according to the present invention includes a link source information extraction step for extracting a link existing in each structured document and the link peripheral text information for the collected structured document group, and link source information A text extraction step for identifying a linked structured document based on the link extracted in the extracting step, and extracting a representative part of the linked structured document including link peripheral text information extracted in the step as a text And having.

また、本発明の情報抽出装置は、収集された構造化文書群に対して、それぞれの構造化文書内に存在するリンクと、該リンク周辺テキスト情報とを抽出するリンク元情報抽出部と、リンク元情報抽出部で抽出されたリンクに基づきリンク先の構造化文書を特定し、該情報抽出部で抽出されたリンク周辺テキスト情報を含むリンク先の構造化文書における代表的な部分を本文として抽出する本文抽出部と、を備える。 In addition, the information extraction apparatus of the present invention includes a link source information extraction unit that extracts a link existing in each structured document and the link peripheral text information with respect to the collected structured document group, and a link Based on the link extracted by the original information extraction unit, the structured document of the link destination is specified, and a representative part in the structured document of the link destination including the link peripheral text information extracted by the information extraction unit is extracted as the text. A text extraction unit.

なお、本発明は、前記装置としてコンピュータを機能させるプログラムとすることもできる。このプログラムは、ネットワークや記録媒体などを通じて提供することができる。 The present invention may be a program that causes a computer to function as the device. This program can be provided through a network or a recording medium.

本発明によれば、本文抽出のルールに依存することなく、構造化文書から本文抽出を行うことができる。 According to the present invention, text extraction can be performed from a structured document without depending on text extraction rules.

本発明の実施形態に係る情報抽出装置のブロック図。The block diagram of the information extraction device which concerns on embodiment of this invention. 同リンク元情報抽出部の処理を示すフローチャート。The flowchart which shows the process of the link source information extraction part. ハイパーリンクによる参照情報を含む文書例。Example document containing reference information by hyperlink. （ａ）は図３の文書ＡのＨＴＭＬファイル（ソース）例、（ｂ）は図３の文書ＢのＨＴＭＬファイル（ソース）例。(A) is an example of an HTML file (source) of document A in FIG. 3, and (b) is an example of an HTML file (source) of document B in FIG. 本文抽出部の処理を示すフローチャート。The flowchart which shows the process of a text extraction part.

以下、本発明の実施形態に係る情報抽出装置を説明する。この情報抽出装置は、マークアップ言語で記述された構造化文書、主にＷＷＷ上のＷｅｂページなどを処理対象とし、参照表現を利用して本文抽出を行う。 Hereinafter, an information extraction apparatus according to an embodiment of the present invention will be described. This information extraction apparatus processes structured documents described in a markup language, mainly Web pages on the WWW, and performs text extraction using reference expressions.

ここではＨＴＭＬドキュメント（ＨＴＭＬ文書）のハイパーリンクによる参照表現に基づく処理例を説明する。すなわち、リンク元のＨＴＭＬファイルとリンク先のＨＴＭＬファイルとを用いて、リンク先のＨＴＭＬファイルから本文を抽出する。このときリンク元のＨＴＭＬファイル中に記述された情報を抽出し、抽出された情報がリンク先のＨＴＭＬファイル内に存在する場合、それに応じたタグ（例えばリンク元ファイルからの抽出情報をすべて含むようなタグ）をピックアップする。 Here, a processing example based on a reference expression by a hyperlink of an HTML document (HTML document) will be described. That is, the text is extracted from the linked HTML file by using the linked HTML file and the linked HTML file. At this time, when the information described in the link source HTML file is extracted and the extracted information exists in the link destination HTML file, the corresponding tag (for example, all the extracted information from the link source file is included). Pick up the tag).

≪構成例≫
図１に基づき前記情報抽出装置の構成例を説明する。この情報抽出装置１は、コンピュータにより構成され、通常のコンピュータのハードウェアリソース、例えばＣＰＵ．メモリ（ＲＡＭ）やハードディスクドライブ装置などの記憶装置など備える。 ≪Configuration example≫
A configuration example of the information extraction apparatus will be described with reference to FIG. This information extraction apparatus 1 is configured by a computer, and is a normal computer hardware resource such as a CPU. A storage device such as a memory (RAM) or a hard disk drive device is provided.

このハードウェアリソースとソフトウェアリソース（ＯＳ．アプリケーションなど）との協働の結果、情報抽出装置１は、文書集合記録部２．文書集合ＤＢ３．リンク元情報抽出部４．本文抽出部５．出力部６を実装する。この記録部２は、図示省略の入力部を通じて入力された処理対象の文書の情報、例えばＨＴＭＬファイルを前記ＤＢ３に記録する。このＨＴＭＬファイルは、検索エンジンのクローリングなどで収集されたものでもよい。なお、前記ＤＢ３は前記記憶装置に構築されているものとする。 As a result of the cooperation between the hardware resource and the software resource (OS. Application, etc.), the information extraction apparatus 1 has the document set recording unit 2. Document set DB3. 3. Link source information extraction unit 4. Text extraction unit The output unit 6 is mounted. The recording unit 2 records information on a document to be processed input through an input unit (not shown), such as an HTML file, in the DB 3. This HTML file may be collected by search engine crawling or the like. Note that the DB 3 is constructed in the storage device.

前記情報抽出部４は、前記ＤＢ３に記録された文書のＨＴＭＬファイル内に存在する参照関係を実現するハイパーリンクと、該リンクの周辺テキスト（以下リンク周辺テキストという。）とを抽出する。ここではハイパーリンクに囲まれた文字列を抽出してもよく、あるいはＨＴＭＬにおける要素（タグ）の階層構造（親子関係）に基づきリンク先文書のＵＲＬ含む上位ｎ番目（ｎ＝正の整数）のブロック要素内に存在する文字列を抽出してもよい。 The information extraction unit 4 extracts a hyperlink that realizes a reference relationship existing in the HTML file of the document recorded in the DB 3 and a peripheral text of the link (hereinafter referred to as a link peripheral text). Here, a character string surrounded by hyperlinks may be extracted, or based on the hierarchical structure (parent-child relationship) of elements (tags) in HTML, the upper nth (n = positive integer) including the URL of the linked document You may extract the character string which exists in a block element.

本文抽出部５は、前記ＤＢ３に記録された文書についてＨＴＭＬファイルを参照するハイパーリンクが前記情報抽出部４で抽出されていれば、該文書をリンク先文書として特定する。ここで特定されたリンク先文書のＨＴＭＬファイル内に存在するテキスト情報の文字列と前記情報抽出部４の抽出したリンク周辺テキストの文字列とを比較し、リンク先文書内の代表的な部分を本文として抽出する。ここでは前記情報抽出部４で抽出された文字列をすべて含む上位Ｎ番目（Ｎ＝正の整数）の要素（親タグ）の配下に存在する文字列を抽出する。 If the hyperlink that refers to the HTML file for the document recorded in the DB 3 is extracted by the information extracting unit 4, the body extracting unit 5 identifies the document as a linked document. The character string of the text information existing in the HTML file of the link destination document specified here is compared with the character string of the link peripheral text extracted by the information extraction unit 4, and a representative portion in the link destination document is determined. Extract as text. Here, the character string existing under the top N-th (N = positive integer) element (parent tag) including all the character strings extracted by the information extraction unit 4 is extracted.

したがって、広告部分やメニュー部分などの不要部分の情報を排除し、本文に該当する部分のみが抽出される。抽出された本文は出力部６を通じて、例えば検索エンジンにおける検索用の索引（インデックス）を作成するためのＨＴＭＬを代表する部分として出力される。以下、情報抽出装置１の処理の詳細を、前記情報抽出部４と本文抽出部５とに大別して説明する。 Therefore, information on unnecessary parts such as an advertisement part and a menu part is excluded, and only a part corresponding to the text is extracted. The extracted text is output through the output unit 6 as, for example, a part representing HTML for creating a search index (index) in a search engine. Hereinafter, the details of the processing of the information extraction device 1 will be described by roughly dividing the information extraction unit 4 and the text extraction unit 5.

≪前記情報抽出部４の処理例≫
まず、図２に基づき前記情報抽出部４の処理例を説明する。ここでは図３の文書Ａ〜Ｄを処理対象とする。この文書Ａ〜Ｄは、商品購入サービスを行っているインターネット上のサイトにて提供されるＨＴＭＬファイルであって、文書Ａは商品を羅列する文書を関し、文書Ｂ〜Ｄは商品１〜３の詳細を説明する文書に関する。 << Processing Example of Information Extraction Unit 4 >>
First, a processing example of the information extraction unit 4 will be described with reference to FIG. Here, the documents A to D in FIG. The documents A to D are HTML files provided on a site on the Internet where a product purchase service is performed. The document A relates to a document listing the products, and the documents B to D are products 1 to 3. It relates to a document explaining the details.

図３中、文書Ａの商品名１〜３の下線（アンダーライン）は文書Ｂ．Ｃの参照を実現するハイパーリンクを示し、商品名１のハイパーリンクは文書Ｂを参照し、商品名２のハイパーリンクは文書Ｃを参照し、商品名３のハイパーリンクは文書Ｄを参照している。なお、文書Ａ〜ＤのＨＴＭＬファイルは、前記ＤＢ２に記録されているものとする。 In FIG. 3, the underline (underline) of the product names 1 to 3 of the document A is the document B. A hyperlink that realizes reference to C is shown, the hyperlink of product name 1 refers to document B, the hyperlink of product name 2 refers to document C, and the hyperlink of product name 3 refers to document D Yes. It is assumed that the HTML files of the documents A to D are recorded in the DB2.

まず、前記情報抽出部４は、処理が開始さると処理対象のＨＴＭＬファイルを前記ＤＢ３から取得する（Ｓ０１）。ここでは一例として文書ＡのＨＴＭＬファイルを取得するものとする。その後にＳ０１で取得したＨＴＭＬファイル内にハイパーリンクが存在するか否かを確認する（Ｓ０２）。確認の結果、ハイパーリンクが存在しなければ前記情報抽出部４の処理を終了する。一方、ハイパーリンクが存在すれば該ハイパーリンクとリンク周辺テキストとを抽出し（Ｓ０３）、処理を終了する。この抽出情報は次の（１）（２）のいずれかで定義されるものとする。
（１）ハイパーリンク（ＨＴＭＬタグのＡＮＣＨＯＲタグ）に囲まれた文字列（ＵＲＬを含む文字列）
（２）ハイパーリンク先文書のＵＲＬを含み、かつ上位ｎ番目（ｎ＝正の整数）に出現するブロック要素内に存在する文字列
この定義（１）（２）の選択あるいは「ｎの値」に依存して抽出情報の範囲を変更することができる。例えば定義（１）を選択すれば最小範囲、即ち「ハイパーリンク（ＨＴＭＬタグのＡＮＣＨＯＲタグに囲まれた文字列」が抽出される。また、定義（２）を選択して「ｎ＝最大値（ＨＴＭＬ要素の階層数）」すれば最大範囲、即ち「本文全文（ＨＴＭＬタグのＢＯＤＹタグの中身すべて）」が抽出される。このとき抽出される文字列にＨＴＭＬタグ「<>」を含むか否かの条件を与えることもできる。 First, when the process starts, the information extraction unit 4 acquires an HTML file to be processed from the DB 3 (S01). Here, as an example, an HTML file of document A is acquired. Thereafter, it is confirmed whether or not a hyperlink exists in the HTML file acquired in S01 (S02). If there is no hyperlink as a result of the confirmation, the processing of the information extraction unit 4 is terminated. On the other hand, if a hyperlink exists, the hyperlink and link peripheral text are extracted (S03), and the process ends. This extraction information is defined by either (1) or (2) below.
(1) Character string (character string including URL) enclosed by hyperlink (an HTML tag ANCHOR tag)
(2) Character string existing in the block element that includes the URL of the hyperlink destination document and appears in the top nth (n = positive integer) This definition (1) Selection of (2) or “value of n” The range of extraction information can be changed depending on For example, if the definition (1) is selected, the minimum range, that is, “hyperlink (a character string surrounded by the HTML tag of the HTML tag)” is extracted. Also, the definition (2) is selected and “n = maximum value ( The maximum range, that is, “the full text (the entire contents of the BODY tag of the HTML tag)” is extracted. Whether the HTML tag “<>” is included in the extracted character string It is possible to give these conditions.

ここでＨＴＭＬの要素（タグ）は、非特許文献２に示すように、ある要素がある要素を含み、さらにその要素が別の要素を含む、というように階層構造で表される。この階層構造は、一般的に親要素、子要素、孫要素などと親子関係にたとえて表現され、要素ごとに親子関係を有しており、定義（２）の上位ｎ番目は要素の親子関係を示している。 Here, as shown in Non-Patent Document 2, HTML elements (tags) are expressed in a hierarchical structure such that a certain element includes an element and the element includes another element. This hierarchical structure is generally expressed as a parent-child relationship with parent elements, child elements, grandchild elements, etc., and each element has a parent-child relationship. The top nth of definition (2) is the parent-child relationship of the elements Is shown.

また、要素の種類としては、非特許文献３に示すように、表示上のブロック（見出し・段落など）を構成するブロック要素と、表示上はブロック要素と連続しているように見えるインライン要素とが存在する。これを図４（ａ）に基づき説明すれば、文書ＡのＨＴＭＬファイル中、ＤＩＶはブロック要素に該当し、ＳＰＡＮとＡ（ＡＮＣＨＯＲ）はインライン要素に該当する。 As the types of elements, as shown in Non-Patent Document 3, a block element that constitutes a block (heading, paragraph, etc.) on the display, and an inline element that appears to be continuous with the block element on the display, Exists. This will be described with reference to FIG. 4A. In the HTML file of document A, DIV corresponds to a block element, and SPAN and A (ANCHOR) correspond to inline elements.

このときＡ（ＡＮＣＨＯＲ）タグからみれば、「ｉｄ＝ＴＲ」のＤＩＶタグは上位１番目に出現するブロック要素に該当し、「ｉｄ＝ＴＡＢＬＥ」のＤＩＶタグは上位第２番目に出現するブロック要素に該当する。なお、Ａ（ＡＮＣＨＯＲ）タグのｈｒｅｆ属性はリンク先のＵＲＬを指定しているものとする。 At this time, from the viewpoint of the A (ANCHOR) tag, the DIV tag with “id = TR” corresponds to the block element that appears first, and the DIV tag with “id = TABLE” appears as the block element that appears second. It corresponds to. It is assumed that the href attribute of the A (ANCHOR) tag specifies a link destination URL.

ここで文書ＡのＨＴＭＬファイル中からの抽出例を説明する。図４（ａ）に示すように、文書ＡのＨＴＭＬファイルには３個のＡ（ＡＮＣＨＯＲ）タグが存在するため、３個のハイパーリンクが埋まっていることがＳ０２で確認される。この各Ａ（ＡＮＣＨＯＲ）タグはｈｒｅｆ属性に示すように、文書Ｂ〜Ｄを参照している。ここでは一例としてＳ０３のリンク元文書からの抽出範囲が定義（２）に設定され、「ｎ＝１」に設定されているものとする。また、「タグを含まず人が可読なテキストのみ」という条件も与えられているものとする。 Here, an example of extracting the document A from the HTML file will be described. As shown in FIG. 4A, since there are three A (ANCHOR) tags in the HTML file of document A, it is confirmed in S02 that three hyperlinks are buried. Each A (ANCHOR) tag refers to the documents B to D as indicated by the href attribute. Here, as an example, it is assumed that the extraction range from the link source document in S03 is set in the definition (2) and “n = 1” is set. It is also assumed that a condition “only human-readable text not including tags” is given.

このとき各Ａ（ＡＮＣＨＯＲタグ）からみれば、「ｉｄ＝ＴＲ」の各ＤＩＶタグは上位１番目に出現するブロック要素に該当するから、その配下の各ＳＰＡＮタグに囲まれた文字列がそれぞれ抽出される。したがって、文書Ｂについては上段の「ＤＩＶｉｄ＝ＴＲ」配下、即ちｈｒｅｆ属性に示す文書ＢのＵＲＬと、「商品名１」「値段：１００円」「色：赤」のリンク周辺テキストとが抽出される。 At this time, from the viewpoint of each A (ANCHOR tag), each DIV tag of “id = TR” corresponds to the block element that appears first in the top, so that the character strings surrounded by the subordinate SPAN tags are respectively extracted. Is done. Therefore, for the document B, the URL of the document B shown in the upper “DIV id = TR”, that is, the href attribute, and the link peripheral text of “product name 1” “price: 100 yen” “color: red” are extracted. Is done.

また、文書Ｃについては中段の「ＤＩＶｉｄ＝ＴＲ」配下、即ちｈｒｅｆ属性に示す文書ＣのＵＲＬと、「商品名２」「値段：３００円」「色：青」のリンク周辺テキストとが抽出される。 For document C, the URL of document C shown in the middle “DIV id = TR”, that is, the href attribute, and the link peripheral text of “product name 2”, “price: 300 yen”, and “color: blue” are extracted. Is done.

さらに、文書Ｄについては下段の「ＤＩＶｉｄ＝ＴＲ」配下、即ちｈｒｅｆ属性に示す文書ＤのＵＲＬと、「商品名３」「値段：３００円」「色：黄」のリンク周辺テキストとが抽出される。抽出されたリンク先文書のＵＲＬとリンク周辺テキストとは前記記憶装置に記憶されるものとする。 Further, for document D, the lower part of “DIV id = TR”, that is, the URL of document D shown in the href attribute and the link peripheral text of “product name 3”, “price: 300 yen”, and “color: yellow” are extracted. Is done. It is assumed that the URL of the extracted link destination document and link peripheral text are stored in the storage device.

≪本文抽出部５の処理内容≫
つぎに図５に基づき本文抽出部５の処理例を説明する。本文抽出部５は、処理が開始されると処理対象のＨＴＭＬファイルを前記ＤＢ２から取得する（Ｓ１１）。ここでは一例として文書Ｂ〜ＤのＨＴＭＬファイルを取得するものとする。 ≪Processing content of text extraction unit 5≫
Next, a processing example of the text extracting unit 5 will be described with reference to FIG. When the process is started, the text extracting unit 5 acquires an HTML file to be processed from the DB 2 (S11). Here, as an example, it is assumed that HTML files of documents B to D are acquired.

この取得したＨＴＭＬファイルを参照するハイパーリンクがＳ０３で抽出されているか否かを確認する（Ｓ１２）。この確認は前記記憶装置に記憶されたリンク先文書のＵＲＬを用いればよい。この結果、ハイパーリンクの存在が確認されていなければ処理を終了し、Ｓ１１に戻って次のＨＴＭＬファイルを取得する。 It is confirmed whether or not a hyperlink referring to the acquired HTML file is extracted in S03 (S12). For this confirmation, the URL of the link destination document stored in the storage device may be used. As a result, if the presence of the hyperlink is not confirmed, the process is terminated, and the process returns to S11 to acquire the next HTML file.

一方、ハイパーリンクの存在が確認されていれば、取得したＨＴＭＬファイルをリンク先文書と特定し、Ｓ０３で抽出されたリンク周辺テキストを前記記憶装置から取得し、リンク周辺テキストを含むリンク先文書の代表部分を本文として抽出する（Ｓ１３）。このときリンク先文書のどこまでを代表的部分、即ち本文とするのかの条件が必要である。基本的にはＳ０３で抽出されたリンク周辺テキストの文字列を含む部分であればよいが、リンク先のＨＴＭＬファイルの「すべての文字列」とするのでは無用な情報によるデータ容量の増加を招くおそれがある。 On the other hand, if the existence of the hyperlink is confirmed, the acquired HTML file is identified as the link destination document, the link peripheral text extracted in S03 is acquired from the storage device, and the link destination document including the link peripheral text is acquired. The representative part is extracted as the text (S13). At this time, it is necessary to have a condition as to how much of the linked document is a representative part, that is, a text. Basically, it may be a part including the character string of the link peripheral text extracted in S03, but if “all character strings” of the link destination HTML file is used, the data capacity increases due to unnecessary information. There is a fear.

そこで、本文抽出される代表的部分の範囲指定としては、Ｓ０３で抽出されたリンク周辺テキストの文字列をすべて含み、かつ上位Ｎ番目（Ｎ＝正の整数）に出現する親要素（タグ）を探索し、その配下に存在する文字列を本文とする。この本文抽出の条件、即ちＮの値を変更することでＨＴＭＬファイル内から抽出される本文抽出の文字列を変更することもできる。このとき本文にＨＴＭＬタグを含むか否かの条件も与えることができるものとする。 Therefore, as the range designation of the representative part extracted from the body, the parent element (tag) that includes all the character strings of the link peripheral text extracted in S03 and appears in the top Nth (N = positive integer) is used. Search and use the character string under the search as the text. The text extraction character string extracted from the HTML file can be changed by changing the text extraction condition, that is, the value of N. At this time, it is also possible to give a condition as to whether or not an HTML tag is included in the text.

以下、文書ＢのＨＴＭＬファイルに対する処理例を説明する。ここでは本文抽出条件は「Ｎ＝１」に設定され、上位１番目に出現する親要素を探索するものとする。また、文書Ｂは、Ｓ１２でＳ０３のハイパーリンク抽出が確認され、前記記憶装置にはＳ０３で抽出した「商品名１」「値段：１００円」「色：赤」のリンク周辺テキストが記憶されているものとする。 A processing example for the HTML file of document B will be described below. Here, the text extraction condition is set to “N = 1”, and the parent element that appears first is searched. In addition, the hyperlink extraction of S03 is confirmed in document B in S12, and the link peripheral text “product name 1” “price: 100 yen” “color: red” extracted in S03 is stored in the storage device. It shall be.

このとき文書ＢのＨＴＭＬ中では、図４（ｂ）に示すように、「商品名１」は「ｉｄ＝ｎａｍｅ」のＳＰＡＮタグに挟まれ、「値段：１００円」は「ｉｄ＝ｐｒｉｃｅ」のＳＰＡＮタグに挟まれ、「色：赤」は「ｉｄ＝ｃｏｌｏｒ」のＳＰＡＮタグに挟まれており、これらを「すべて含む上位１番目の親タグ」の探索結果として「ＤＩＶ」タグが取得される。 At this time, in the HTML of document B, as shown in FIG. 4B, “product name 1” is sandwiched between SPAN tags of “id = name”, and “price: 100 yen” is “id = price”. “Color: Red” is sandwiched between SPAN tags of “id = color”, and a “DIV” tag is acquired as a search result of “the first top parent tag including all”. .

したがって、「その配下に存在する文字列」は「ＤＩＶタグの配下に存在する文字列」が該当する。ここでは「タグを含まず人が可読なテキストのみ」という条件が与えられているものとする。その結果、「商品名１，写真，型番：１２３４５６，値段：１００円，色：赤，概略：売れてます」の文字列が文書Ｂの代表的部分、即ち本文として抽出される。このときＨＴＭＬタグを含んで抽出する条件が与えられていれば、タグ付の文字列が抽出される。抽出された本文は出力部６に出力され（Ｓ１４）、処理を終了する。ここで出力部６に出力された本文は、検索エンジンなどに提供され、検索用の索引（インデックス）作成にあたってＨＴＭＬファイルの本来表現したかった情報の把握に役立てられる。 Therefore, the “character string existing under the subordinate” corresponds to the “character string existing under the DIV tag”. Here, it is assumed that the condition “only human-readable text including no tags” is given. As a result, the character string “product name 1, photo, model number: 123456, price: 100 yen, color: red, outline: sold” is extracted as a representative part of the document B, that is, the text. At this time, if a condition for extraction including an HTML tag is given, a character string with a tag is extracted. The extracted text is output to the output unit 6 (S14), and the process ends. The text output to the output unit 6 is provided to a search engine or the like, and is used for grasping information originally intended to be expressed in the HTML file when creating a search index.

このように情報抽出装置１によれば、従来のような本文抽出のためのルールに依存することなく、広告部分やメニュー部分などの不要部分の情報を排除して本文を抽出でき、前記ルールの保守コストが軽減される。特に、インターネット上で他者によって生成されたＨＴＭＬファイルに対しても、その更新に合わせて本文抽出を臨機応変に実現できる点でも有用である。 As described above, according to the information extracting apparatus 1, the text can be extracted by eliminating the information of the unnecessary part such as the advertisement part and the menu part without depending on the conventional rule for extracting the text. Maintenance costs are reduced. In particular, it is also useful in that text extraction can be realized flexibly according to the update of HTML files generated by others on the Internet.

なお、本発明は上記実施形態に限定されるものではなく、各請求項に記載された範囲内で変形して実施することができる。例えばＳ０３における定義（１）（２）や「ｎの値」、Ｓ１３の本文抽出条件「Ｎの値」はリンク元のＨＴＭＬファイル毎に異なってもよく、リンク先のＨＴＭＬファイル毎に異なっていてもよい。すなわち、処理対象の文書にすべてに同一としてもよく、あるいはインターネットサイト毎に個別的に選択・設定してもよい。 In addition, this invention is not limited to the said embodiment, It can deform | transform and implement within the range described in each claim. For example, the definitions (1) and (2) in S03, the “n value”, and the text extraction condition “N value” in S13 may be different for each link source HTML file, and may be different for each link destination HTML file. Also good. That is, it may be the same for all documents to be processed, or may be individually selected and set for each Internet site.

もっとも、本文として抽出する情報の統一化のためには、すべてのリンク元ＨＴＭＬファイルで定義（１）（２）の選択や「ｎの値」の設定は同一であることが好ましい。同様にすべてのリンク先ＨＴＭＬファイルで本文抽出条件「Ｎの値」の設定は同一であることが好ましい。 However, in order to unify the information extracted as the text, it is preferable that the definitions (1) and (2) are selected and the “n value” setting is the same in all link source HTML files. Similarly, it is preferable that the setting of the text extraction condition “value of N” is the same in all link destination HTML files.

また、本発明は、ＨＴＭＬの文書に限定されることなく、ＸＭＬなどの他のマークアップ言語で記述された構造化文書も処理対象とすることができる。この場合には、ＸＭＬドキュメント（文書）同士のリンクを定義する「ＸＬｉｎｋ（ＸＭＬＬｉｎｋｉｎｇＬａｎｇｕａｇｅ）」を利用すればよい。 Further, the present invention is not limited to HTML documents, and structured documents described in other markup languages such as XML can be processed. In this case, “XML Linking Language (XML Linking Language)” that defines a link between XML documents (documents) may be used.

≪プログラムなど≫
本発明は、情報抽出装置１の各部２〜６の一部もしくは全部として、コンピュータを機能させる文書検索プログラムとして構成することもできる。このプログラムによれば、Ｓ０１〜Ｓ０３．Ｓ１１〜Ｓ１４の一部あるいは全部をコンピュータに実行させることが可能となる。 ≪Programs≫
The present invention can also be configured as a document search program that causes a computer to function as some or all of the units 2 to 6 of the information extraction apparatus 1. According to this program, S01 to S03. It becomes possible to cause the computer to execute part or all of S11 to S14.

前記プログラムは、Ｗｅｂサイトや電子メールなどネットワークを通じて提供することができる。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，ＢＤ−ＲＯＭ，ＢＤ−Ｒ，ＢＤ−ＲＥなどの記録媒体に記録して、保存・配布することも可能である。この記録媒体は、記録媒体駆動装置を利用して読み出され、そのプログラムコード自体が前記実施形態の処理を実現するので、該記録媒体も本発明を構成する。 The program can be provided through a network such as a website or e-mail. The program is stored in a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, BD-ROM, BD-R, or BD-RE. It is also possible to record, save and distribute. This recording medium is read using a recording medium driving device, and the program code itself realizes the processing of the above embodiment, so that the recording medium also constitutes the present invention.

１…情報抽出装置
２…文書集合記録部
３…文書集合ＤＢ
４…リンク元情報抽出部
５…本文抽出部
６…出力部 DESCRIPTION OF SYMBOLS 1 ... Information extraction apparatus 2 ... Document set recording part 3 ... Document set DB
4 ... Link source information extraction unit 5 ... Text extraction unit 6 ... Output unit

Claims

An information extraction method of an apparatus for extracting a text from a structured document of a link destination based on a structured document in which a link realizing a reference relationship is expressed,
A link source information extracting step for extracting a link existing in each structured document and the link peripheral text information with respect to the collected structured document group;
Identifying a structured document destination based on the link extracted by the link source information extracting step, extracting the representative portion in the structured document landing including the link around the text information extracted by said step as the body And a text extraction step to
In the link source information extraction step, all of the tags sandwiched from the start point to the end point of the tag present in the upper nth (n = positive integer) block element from the link element including the URL of the linked structured document Extract the string,
In the body extracting step, a character string existing under the top Nth (N = positive integer) element including all the character strings extracted in the link source information extracting step is extracted as a representative part. Information extraction method.

An information extraction device that extracts a text from a structured document of a link destination based on a structured document in which a link realizing a reference relationship is expressed,
A link source information extracting unit step for extracting a link existing in each structured document and the link peripheral text information with respect to the collected structured document group;
Identifying a structured document destination based on the link extracted by the link source information extracting step, extracting the representative portion in the structured document landing including the link around the text information extracted by said step as the body A text extraction unit to
The link source information extraction unit includes all the tags sandwiched from the start point to the end point of the tag existing in the upper nth (n = positive integer) block element from the link element including the URL of the linked structured document . Extract the string,
The text extraction unit extracts, as a representative part, a character string existing under the top Nth (N = positive integer) element including all the character strings extracted in the link source information extraction step. Information extraction device.

An information extraction program for causing a computer to function as each unit of the information extraction device according to claim 2.