JP5564442B2

JP5564442B2 - Text search device

Info

Publication number: JP5564442B2
Application number: JP2011003037A
Authority: JP
Inventors: 泰彦宮崎; 豪東野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-01-11
Filing date: 2011-01-11
Publication date: 2014-07-30
Anticipated expiration: 2031-01-11
Also published as: JP2012146065A

Description

本発明は、ネットワーク上の文書を検索し出力する装置であって、特に、出力時に、文書中から主たる文章だけを抽出することを可能とする文章検索装置に関する。 The present invention relates to an apparatus for searching and outputting a document on a network, and more particularly to a sentence search apparatus that can extract only main sentences from a document at the time of output.

現在、ネットワーク上には、URLという形式により指定され、主にHTMLというフォーマットにより記述された文書（以下、Web文書と呼ぶ）が多数存在する。しかしながら、多くのWeb文書は、そのWeb文書の主内容となる本文（主たる文章と呼ぶ）の他にも、広告や、関連サイトへのリンク、同一サイト内の別Web文書へナビゲートするためのリンクなどが含まれている。 Currently, there are a large number of documents (hereinafter referred to as Web documents) that are specified in the URL format and described mainly in the HTML format on the network. However, many Web documents are used to navigate to advertisements, links to related sites, and other Web documents in the same site, in addition to the main text (called the main text) of the Web document. Includes links.

このようなWeb文書を、特に、携帯端末のように大きさが限られた端末や、公共空間に設置されたサイネージディスプレイなどに表示する場合には、本文のみを抽出して出力されることが望まれる。 When displaying such a Web document on a terminal with a limited size such as a portable terminal or a signage display installed in a public space, only the text may be extracted and output. desired.

また、検索エンジンにより、キーワードを入力してネットワーク上のWeb文書検索を行う場合にも、広告や、関連サイトへのリンク、同一サイト内の別Web文書へナビゲートするためのリンクなどにキーワードがヒットすることは、必ずしも検索目的に沿うものではないので、やはり、本文のみを抽出されることが望まれる。 Also, when searching for a web document on a network by entering a keyword using a search engine, the keyword is included in an advertisement, a link to a related site, or a link for navigating to another web document in the same site. Since hitting does not necessarily match the purpose of the search, it is desirable that only the text is extracted.

しかしながら、Web文書を記述するHTMLでは、本文、広告、ナビゲーションリンクなどによるタグの構成に特別な差異はないため、本文のみを抽出するのは必ずしも容易ではない。 However, in HTML that describes a Web document, there is no particular difference in the structure of tags based on the text, advertisements, navigation links, etc., so it is not always easy to extract only the text.

このような要望を可能とする従来技術として、特許文献１によれば、Web文書のデータをプレインテキスト化し、統計的な手法を適用することにより、本文と推定される箇所を抽出する技術が開示されている。 As a conventional technique that enables such a request, according to Patent Document 1, a technique for extracting a portion estimated to be a text by converting a Web document data into plain text and applying a statistical method is disclosed. Has been.

また、関連する背景技術として、RSSと呼ばれる技術が一般に知られている。RSSは、「RDF Site Summary」あるいは「Rich Site Summary」あるいは「Really Simple Syndication」の略とされ、参考サイト１、２のように複数のバージョンにより規格の差異等もあるが、共通することは、配信されるWeb文書の概要を、XML形式により配信する技術である。概要としては、ほとんどの場合において、Web文書の本文のタイトルや、本文の一部抜粋が含まれている。現在では、特に、ニュースサイトやブログサイトなど更新されることが多いサイトを中心として、非常に多くのサイトでRSSも配信されている。また、RSSと類似する技術として、参考サイト３のATOMがあり、RSSと同様にWeb文書の概要をXML形式により記述し配信することができる。以下、RSSの各バージョンおよびATOMを総称して、単にRSSと記述する。RSSでは、タイトルや本文抜粋などの項目をどのタグで記述するのかが規定されているため、HTMLとは異なり、タグの構成を解析することで、各項目を抽出することは容易である。なお、RSSにはバージョンによる差異があるが、各バージョンのRSS（ATOMを含む）には、そのバージョン自体が記述されているので、その情報を取得し、記述されているバージョンに応じた解析処理を行うことで、バージョンによる差異の課題は回避することができる。 As a related background technology, a technology called RSS is generally known. RSS is an abbreviation of “RDF Site Summary” or “Rich Site Summary” or “Really Simple Syndication”, and there are differences in standards depending on multiple versions like Reference Sites 1 and 2, but what is common is that It is a technology that delivers an outline of the delivered Web document in XML format. As an outline, in most cases, the title of the text of the Web document and a partial excerpt of the text are included. At present, RSS is also distributed on a large number of sites, especially those that are frequently updated, such as news sites and blog sites. In addition, as a technology similar to RSS, there is ATOM of reference site 3, and it is possible to describe and distribute an outline of a Web document in XML format as with RSS. Hereinafter, RSS versions and ATOM are collectively referred to as RSS. Since RSS stipulates which tags describe items such as titles and text excerpts, unlike HTML, it is easy to extract each item by analyzing the tag structure. Note that there are differences between RSS versions, but each version's RSS (including ATOM) describes the version itself, so that information is obtained and analyzed according to the described version. By performing the above, the problem of differences due to versions can be avoided.

参考サイト１：http://web.resource.org/rss/1.0/spec
参考サイト２：http://www.rssboard.org/rss-specification
参考サイト３：http://www.ietf.org/wg/concluded/atompub Reference site 1: http://web.resource.org/rss/1.0/spec
Reference site 2: http://www.rssboard.org/rss-specification
Reference site 3: http://www.ietf.org/wg/concluded/atompub

特開２００６−３３８３６４号公報JP 2006-338364 A

しかしながら、特許文献１の方法では、統計的手法を利用するため、統計的に有意な推定を行うために十分なデータを収集する必要がある。あるいは、これらのデータが十分でない場合には、有効な推定結果を得ることができない。また、十分なデータを収集したとしても不可避な推定誤差が存在するという課題がある。 However, since the method of Patent Document 1 uses a statistical method, it is necessary to collect sufficient data to perform statistically significant estimation. Or when these data are not enough, an effective estimation result cannot be obtained. In addition, there is a problem that even if sufficient data is collected, there is an inevitable estimation error.

RSSを使用する場合は、XMLにより配信されるデータを、その規格に従って解析すればよいため、このような統計的手法に起因する課題はない。しかしながら、多くの場合、RSSにより配信されるのは、本文の抜粋のみであって、本文内容のごく一部しか含まれていない。すなわち、RSSのみからでは、本文全ての情報を取得することはできないという課題がある。 When RSS is used, there is no problem due to such a statistical method because data distributed by XML may be analyzed according to the standard. However, in many cases, only a text excerpt is delivered by RSS, and only a small part of the text content is included. That is, there is a problem that it is not possible to acquire all the information of the text only from RSS.

また、現在のWeb文書は、CSSと呼ばれるデザイン上の指定を含むことが殆どであり、利用者がWeb文書を閲覧する際には、その指定に基づいて表示された状態で見ている。一方で、いずれの従来技術においても、主にWeb文書のテキストデータに関するデータだけを抽出する。そのため、これら従来技術により抽出された結果には元のWeb文書のデザインを適用することができず、そのために利用者が抽出された結果の表示を見た際に、元のWeb文書から抽出した結果であることを認識しにくくなるという課題がある。 Moreover, most current Web documents include a design designation called CSS, and when a user browses a Web document, it is viewed in a state displayed based on the designation. On the other hand, in any conventional technique, only data related to text data of Web documents is mainly extracted. For this reason, the design of the original Web document cannot be applied to the results extracted by these conventional techniques. Therefore, when the user sees the display of the extracted results, the results are extracted from the original Web document. There is a problem that it is difficult to recognize the result.

このような課題を解決するために、Web文書と共に配信されるRSS情報を利用し、そのRSSフィードに記載された「タイトル」に極めて類似した箇所を、HTMLを変換した木構造データ（DOM）上で検索し、必要に応じてそのスタイルを考慮しながら特定することで、Web文書の本文を抽出する。 In order to solve these problems, the RSS information distributed with the Web document is used, and the part very similar to the “title” described in the RSS feed is converted to the tree structure data (DOM) converted from HTML. The text of the Web document is extracted by searching with and specifying it while considering its style as necessary.

第１の本発明は、ネットワーク上の文書に関するアクセス先とタイトルを含み且つ予め規定されたフォーマットにより配信されるデータを読み込み、当該文書に関するアクセス先とタイトルを取り出す機能を有する概要文書取得部と、前記概要文書取得部により取り出された文書のアクセス先にアクセスし、文書のデータと当該文書のスタイルデータを取得し、取得した文書のデータを木構造のデータに変換して記憶する木構造データ記憶部と、前記概要文書取得部により取り出されたタイトルと、前記木構造データ記憶部に記憶された木構造のデータのノードにおけるテキストデータとの類似性を測定し、測定された類似性に基づいて前記木構造のデータ上のタイトルを特定する機能を有するタイトルノード検索部と、前記タイトルノード検索部により特定された木構造のデータ上のタイトルから木構造の根までのノードをたどりながら、前記スタイルデータに基づいてノードのスタイルが所定の条件を満たすか否かを判定することにより、前記木構造のデータ上の主たる文章を含むノードを特定する機能を有する文章ノード検索部とを備えることを特徴とする文章検索装置をもって解決手段とする。 The first aspect of the present invention includes an outline document acquisition unit having a function of reading data distributed in a predetermined format including an access destination and a title related to a document on the network and retrieving the access destination and the title related to the document; A tree structure data storage for accessing an access destination of a document extracted by the summary document acquisition unit, acquiring document data and style data of the document, and converting the acquired document data into tree structure data and storing it And the text retrieved at the node of the tree-structured data stored in the tree-structured data storage unit based on the measured similarity Title node search unit having a function of specifying a title on the data of the tree structure, and the title node While tracing the nodes from the title of the data of the identified tree structure to the root of the tree structure by the cord section, by the style node style based on the data to determine whether or not a predetermined condition is satisfied, the A sentence search device comprising a sentence node search unit having a function of specifying a node including a main sentence on tree-structured data is used as a solution means.

第２の本発明は、ネットワーク上の文書に関するアクセス先とタイトルを含み且つ予め規定されたフォーマットにより配信されるデータを読み込み、当該文書に関するアクセス先とタイトルと概要を取り出す機能を有する概要文書取得部と、前記概要文書取得部により取り出された文書のアクセス先にアクセスし、文書のデータと当該文書のスタイルデータを取得し、取得した文書のデータを木構造のデータに変換して記憶する木構造データ記憶部と、前記概要文書取得部により取り出されたタイトルと、前記木構造データ記憶部に記憶された木構造のデータのノードにおけるテキストデータとの類似性を測定し、測定された類似性に基づいて前記木構造のデータ上のタイトルを特定する機能を有するタイトルノード検索部と、前記タイトルノード検索部により特定された木構造のデータ上のタイトルから木構造の根までのノードをたどりながら、前記概要文書取得部により取得された概要のテキストと比較対象であるノードのテキストの先頭から予め定められた規則により切り出したテキストとの間のレーベンシュタイン距離に基づいて、前記木構造のデータ上の主たる文章を含むノードを特定する機能を有する文章ノード検索部とを備えることを特徴とする文章検索装置をもって解決手段とする。 The second aspect of the present invention is a summary document acquisition unit having a function of reading data distributed in a predetermined format including an access destination and a title relating to a document on the network and retrieving the access destination, title and summary relating to the document. A tree structure that accesses an access destination of the document extracted by the summary document acquisition unit, acquires document data and style data of the document, and converts the acquired document data into tree-structured data and stores it The similarity between the data storage unit, the title extracted by the summary document acquisition unit, and the text data in the node of the tree structure data stored in the tree structure data storage unit is measured, and the measured similarity is obtained. A title node search unit having a function of specifying a title on the data of the tree structure based on the title; While tracing the node from over de search on the data of the identified tree structure of the unit title to the root of the tree structure, from the beginning of the text of the to be compared with the acquired outline of text by summary document acquisition unit node A sentence node search unit having a function of identifying a node including a main sentence on the data of the tree structure based on a Levenshtein distance from a text cut out in accordance with a predetermined rule. A text search device is used as a solution.

第３の本発明は、第１または第２の本発明において、前記文章ノード検索部により特定された前記木構造のデータ上の前記ノードと親子関係にないノードを削除あるいは非表示にすることによって当該文書の主たる文章のみを表示する機能を有する出力部を備えることを特徴とする文章検索装置をもって解決手段とする。 According to a third aspect of the present invention, in the first or second aspect of the present invention, by deleting or hiding a node that is not in a parent-child relationship with the node on the tree structure data specified by the sentence node search unit A sentence retrieval apparatus comprising an output unit having a function of displaying only the main sentence of the document is used as the solving means.

本発明によれば、統計的手法に依存せず、高い精度でWeb文書中の主たる文章を抽出することが可能となる。また、DOM上で特定できるので、デザイン性を保持した状態での抽出結果とすることが可能となる。 According to the present invention, it is possible to extract a main sentence in a Web document with high accuracy without depending on a statistical method. In addition, since it can be specified on the DOM, it is possible to obtain an extraction result while maintaining the design.

第１の実施の形態に係る文章検索装置の構成図である。It is a lineblock diagram of a text search device concerning a 1st embodiment. 概要文書取得部１１の処理フロー（１）を示す図である。It is a figure which shows the processing flow (1) of the summary document acquisition part 11. FIG. 概要文書取得部１１の処理フロー（２）を示す図である。It is a figure which shows the processing flow (2) of the summary document acquisition part. タイトルノード検索部１３の処理フロー（１）を示す図である。It is a figure which shows the processing flow (1) of the title node search part. タイトルノード検索部１３の処理フロー（２）を示す図である。It is a figure which shows the processing flow (2) of the title node search part 13. FIG. ＨＴＭＬサンプルを示す図である。It is a figure which shows an HTML sample. ＨＴＭＬサンプルの表示例を示す図である。It is a figure which shows the example of a display of an HTML sample. ＨＴＭＬサンプルのＤＯＭ変換例を示す図である。It is a figure which shows the DOM conversion example of an HTML sample. 出力部１５を含む文章検索装置１の構成図である。1 is a configuration diagram of a text search device 1 including an output unit 15. FIG. 本実施の形態を適用する前の表示例を示す図である。It is a figure which shows the example of a display before applying this Embodiment. 本実施の形態を適用した後の表示例を示す図である。It is a figure which shows the example of a display after applying this Embodiment.

以下、本発明の実施の形態について図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（第１の実施の形態）
図１は、本発明の実施の形態に係る文章検索装置の全体構成図である。 (First embodiment)
FIG. 1 is an overall configuration diagram of a text search apparatus according to an embodiment of the present invention.

文章検索装置１は、RSSにより配信される当該Web文書の概要データを取得する概要文書取得部１１と、当該Web文書のデータを取得し、取得した文書のデータを木構造のデータに変換して記憶する木構造データ記憶部１２と、木構造のデータからWeb文書の主たる文章のタイトルとなるノードを検索して特定するタイトルノード検索部１３と、特定されたタイトルとなるノードを起点として木構造のデータから主たる文章のノードを検索して特定する文章ノード検索部１４とを備える。 The text search apparatus 1 acquires a summary document acquisition unit 11 that acquires summary data of the Web document distributed by RSS, acquires the Web document data, and converts the acquired document data into tree-structured data. A tree structure data storage unit 12 to store, a title node search unit 13 for searching and specifying a node that is a title of a main sentence of a Web document from tree structure data, and a tree structure starting from the node that is the specified title And a text node search unit 14 for searching and specifying a main text node from the above data.

本実施の形態は、ネットワークに接続されCPU、メモリ、HDDなどを備えた汎用的な計算機上で、以下で述べるような各部機能の動作をさせるためのプログラムを稼働させることより、実施することが可能である。なお、各部は、同一の計算機上で実施することも可能であるし、各部を個別の計算機上で機能させ、ネットワーク通信により必要なパラメータを送信して、複数の計算機上で実施することも可能である。 This embodiment can be implemented by running a program for operating the functions of each part as described below on a general-purpose computer connected to a network and having a CPU, memory, HDD, etc. Is possible. Each unit can be executed on the same computer, or each unit can function on an individual computer, and necessary parameters can be transmitted by network communication to be executed on a plurality of computers. It is.

（第１の実施の形態の動作）
概要文書取得部１１の処理フローを、図２、図３に示す。 (Operation of the first embodiment)
The processing flow of the summary document acquisition unit 11 is shown in FIGS.

まず、RSSを取得するために指定されたURLに対して、HTTPにより要求を行い、その返答（ＲＳＳデータ）を受けとる（Ｓ１）。URLの指定については、多くのWebサイトには、そこで配信されるWeb文書に関するRSSを取得するためのリンクが明記されているので、そのリンク先URLを、予め利用者が設定するという方法がある。また、現在の多くのWeb文書には、metaタグの中に、RSS取得先が埋め込まれていることがあるので、そのmetaタグの内容を解析してアクセスするように構成することも可能である。 First, a request is made by HTTP for a URL designated for acquiring RSS, and a response (RSS data) is received (S1). Regarding the specification of the URL, many websites specify a link for acquiring RSS related to the Web document distributed there, so there is a method in which the user sets the link destination URL in advance. . Also, in many current Web documents, the RSS acquisition destination may be embedded in the meta tag, so it can be configured to analyze and access the contents of the meta tag. .

アクセスすると、RSSデータがXML形式で返却されるので、そのXMLデータを解析する（Ｓ３）。例えば、http://www.feedparser.org/で開示されているUniversal Feed Parserというライブラリソフトウェアを使用すると、RSSのバージョン（ATOMを含む）の差異を吸収して、取得されたデータの解析を行うことが可能となる。なお、本発明の実施において、Universal Feed Parserの使用は必須ではない。ただし、RSSのバージョンによって、解析して取得すべき項目のタグ名称が若干異なるので、以下の説明においては、Universal Feed Parserのドキュメント［http://www.feedparser.org/docs/reference.html］で使用している名称に準じて記載する。 When accessed, the RSS data is returned in XML format, and the XML data is analyzed (S3). For example, if you use the library software called Universal Feed Parser disclosed at http://www.feedparser.org/, the difference between RSS versions (including ATOM) is absorbed, and the acquired data is analyzed. It becomes possible. In the implementation of the present invention, the use of Universal Feed Parser is not essential. However, the tag names of items that should be obtained by analysis differ slightly depending on the RSS version. In the following explanation, the Universal Feed Parser document [http://www.feedparser.org/docs/reference.html] Write according to the name used in.

RSSでは、複数のWeb文書に関する情報が含まれている。各Web文書の情報は、entriesという配列として取得することができ、i番目のWeb文書の情報のうち、タイトル（entries[i].title）とアクセス先URL（entries[i].link）を取得する。RSSに規定されている項目は多岐に渡るが、実際に配信時に設定されている項目は必ずしも多くない。しかしながら、このタイトルとアクセス先URLについては、ほぼ全てのRSS配信データに含まれている。 RSS contains information about multiple Web documents. Information of each Web document can be acquired as an array called entries, and the title (entries [i] .title) and access URL (entries [i] .link) are acquired from the information of the i-th Web document. To do. There are various items specified in RSS, but there are not necessarily many items that are actually set at the time of distribution. However, this title and access URL are included in almost all RSS distribution data.

解析後の動作の詳細な実現方式については、本実施の形態の使用目的に応じて、いくつかの実施方式があり、下記に３例述べる。 There are several implementation methods for the detailed implementation of the operation after the analysis, depending on the purpose of use of this embodiment, and three examples will be described below.

予め、抽出対象となるURLが決まっている（そのURLをTarget_URLとする）場合は、entries[i].linkが一致するものを配列から検索し、entries[i].titleと共に、タイトルノード検索部１３にパラメータとして渡す。（図２：Ｓ５、Ｓ７）
特に決まっていない場合は、例えば、配列の先頭のentries[0].titleとentries[0].linkをタイトルノード検索部１３にパラメータとして渡してもよい（特に図示せず）。 If the URL to be extracted has been determined in advance (the URL is set as Target_URL), search for an entry with matching entries [i] .link from the array, along with entries [i] .title, and the title node search unit 13 is passed as a parameter. (Figure 2: S5, S7)
If not particularly determined, for example, entries [0] .title and entries [0] .link at the beginning of the array may be passed as parameters to the title node search unit 13 (not shown).

あるいは、一定時間を置いて、iを+1しながら、順次entries[i].titleとentries[i].linkをタイトルノード検索部１３にパラメータとして渡してもよい（図３：Ｓ１１〜Ｓ１５）。 Alternatively, entries [i] .title and entries [i] .link may be sequentially passed as parameters to the title node search unit 13 while i is incremented by 1 after a certain time (FIG. 3: S11 to S15). .

木構造データ記憶部１２は、指定されたURLに対して、HTTPにより要求を行い、そのレスポンスとしてHTMLを取得する。取得したHTMLは、Document Object Model (DOM) ［http://www.w3.org/DOM/DOMTR］の規定に従って、ツリー状のデータ構造に変換し、メモリ等に記憶する。木構造であるDOMデータは、HTMLの各タグデータが木構造のノードを構成しており、表示されるデータとしては、BODYタグによるノードが、最上位となる。更に、DOMの規定の一部であるCascading Style Sheet（CSS）についても解析し、各データ構造におけるサイズ等のスタイルについても変換しておく。CSSはそのWeb文書のHTML内に直接記述されている場合と、HTML内にはCSS取得先URLが書かれていてそのURLにアクセスすることで取得する必要がある場合がある。後者のような外部CSSの読み込みも含めたこのような変換は、現在のほとんどのブラウザソフトウェアで実現されている公知の技術である。 The tree structure data storage unit 12 makes a request to the designated URL by HTTP, and acquires HTML as a response. The acquired HTML is converted into a tree-like data structure according to the specification of Document Object Model (DOM) [http://www.w3.org/DOM/DOMTR], and stored in a memory or the like. In the DOM data having a tree structure, each tag data of HTML constitutes a node of the tree structure, and the node by the BODY tag is the highest as displayed data. Furthermore, Cascading Style Sheet (CSS), which is part of the DOM rules, is also analyzed, and styles such as size in each data structure are converted. There are cases where CSS is described directly in the HTML of the Web document, and where the CSS acquisition URL is written in the HTML and needs to be acquired by accessing the URL. Such conversion including reading of external CSS such as the latter is a well-known technique that is realized in most current browser software.

また、DOMという木構造データに展開されたWeb文書に対して、一般にJavaScriptとして公知の技術を利用することにより、例えば、次のようなデータの検索動作は、容易に実施することができる。 For example, the following data search operation can be easily performed on a Web document expanded into tree-structured data called DOM by using a technique generally known as JavaScript.

・木構造に従って、特定のノードの上位ノードや下位ノードを検索すること
・特定のタグ名をもつDOM上のノードデータを検索すること
・特定の属性値をもつDOM上のノードデータを検索すること
・特定のDOMノードの属性値を検索すること
・特定のDOMノードに属するテキストデータを検索すること
・特定のDOMノードの配下の全ノードに属するテキストデータを検索すること
・通常属するテキストデータが存在しないタグ（例：imgタグ）において、テキストデータの代替として、そのタグのalt属性値のデータを検索すること
・特定のDOMノードの表示・非表示状態を検索すること
・特定のDOMノードが表示状態のとき、その表示上のサイズ（高さおよび幅）を検索すること
本実施の形態においても、これら公知の技術により、木構造データ記憶部１２は本発明の他の構成部に対して、上述した各動作機能を提供することができる。・ Search the upper node and lower node of a specific node according to the tree structure ・ Search the node data on the DOM with a specific tag name ・ Search the node data on the DOM with a specific attribute value -Search for attribute values of a specific DOM node-Search for text data belonging to a specific DOM node-Search for text data belonging to all nodes under a specific DOM node-Text data that normally belongs Search for the tag's alt attribute value data as an alternative to text data in a tag that does not work (eg, img tag)-Search for the display / non-display status of a specific DOM node-Display a specific DOM node In this state, the size (height and width) on the display is searched. Also in this embodiment, the tree structure data storage unit is obtained by these known techniques. 12 can provide the above-described operation functions to other components of the present invention.

タイトルノード検索部１３の処理フローを、図４、図５に示す。 The processing flow of the title node search unit 13 is shown in FIGS.

本実施の形態におけるタイトルノード検索部１３の動作としては、どちらの処理フローによっても実施できるが、まず、より簡易な図４について説明する。 The operation of the title node search unit 13 in the present embodiment can be performed by either processing flow, but first, a simpler FIG. 4 will be described.

タイトルノード検索部１３は、前記概要文書取得部１１からタイトル（title）とアクセス先URL（link）をパラメータとして渡されて起動する（Ｓ１０１）。 The title node search unit 13 is activated by passing the title (title) and the access destination URL (link) as parameters from the summary document acquisition unit 11 (S101).

まず、linkとして渡されたURLのDOMデータを、木構造データ記憶部１２にアクセスして取得する（Ｓ１０３）。もし、木構造データ記憶部１２に当該URLのDOMデータがまだ記憶されていない場合には、木構造データ記憶部１２に対してネットワーク経由で取得し、記憶するように指示すればよい。 First, the DOM data of the URL passed as link is acquired by accessing the tree structure data storage unit 12 (S103). If the DOM data of the URL is not yet stored in the tree structure data storage unit 12, the tree structure data storage unit 12 may be instructed to obtain and store it via the network.

タイトルとしてRSSで配信されるテキストは、多くの場合、HTML内では、H1, H2, H3, H4, H5のいずれかのタグ（以下、headingタグと総称する）で記述されている。そのため、まず、DOM内から、headingタグのノードを取り出し、取り出された各ノード配下に含まれるテキスト情報を取得する。タグがimgの場合は、テキストの代替としてalt属性値を利用しても良い。これらの動作は、木構造データ記憶部１２により提供される機能を利用して実施できる。 Text delivered by RSS as a title is often described by any one of tags H1, H2, H3, H4, and H5 (hereinafter collectively referred to as a heading tag) in HTML. Therefore, first, the heading tag node is extracted from the DOM, and the text information contained under each extracted node is acquired. When the tag is img, the alt attribute value may be used as a text alternative. These operations can be performed using the functions provided by the tree structure data storage unit 12.

次に、このようにして取得されたテキストデータと、パラメータとして渡されたtitleとを比較して、類似度が高いデータを検索する。 Next, the text data acquired in this way is compared with the title passed as a parameter to search for data with high similarity.

類似度の算出に当たっては、まず、それぞれのテキストデータを、正規化する。ここで、正規化とは、次のような処理のことを指し、これによって、テキストデータとしての表記上の差異が小さくなる。 In calculating the similarity, first, each text data is normalized. Here, normalization refers to the following processing, which reduces the difference in notation as text data.

・テキストのエンコードを統一する（例えば全てunicode変換する）
・英数字記号カタカナ等の全角、半角表示を統一する（例えば全て半角変換する）
・英字の大文字小文字を統一する（例えば全て小文字変換する）
・空白の表記を統一する（例えば、全角スペースやタブや改行などを全て半角スペースに変換し、２文字以上連続する半角スペースを１文字半角スペースに置換する。また文頭、文末のスペースは削除する）
このようにして正規化を行ってから、titleとheadingタグのテキストデータを比較し、一致するheadingタグデータを取得する。・ Unify text encoding (for example, all unicode conversion)
・ Unify full-width and half-width display of alphanumeric symbols such as katakana (for example, all half-width conversion)
・ Unify uppercase and lowercase letters (for example, convert all lowercase letters)
・ Unify white space (for example, convert all double-byte spaces, tabs, and line feeds to single-byte spaces, replace two or more consecutive single-byte spaces with single-byte single-byte spaces. )
After normalization in this way, the text data of the title and heading tag are compared, and matching heading tag data is obtained.

なお、サイトによっては、RSS配信時に、何らかの文字を付加したり、逆に各サイトで決められた文字数以上の場合には一部を削除したりしている可能性もあり、あるいは一部の文字を別の文字に置き換えている可能性もある。その場合は正規化によっても同一のテキストとはならないため、一致するという条件を緩和し、例えば、
・一方のテキストが、他方のテキストに包含される
・２つのテキストデータ間のレーベンシュタイン距離を算出し、その距離が予め決められた条件より小さい
といった条件の１つあるいは複数に合致するheadingタグデータを取得してもよい。なお、レーベンシュタイン距離とは、一方のテキストデータから何回の文字単位の編集操作で他方のテキストデータに変換できるかを計る数値であり、すなわち値が小さいほど２つのテキストデータが類似していると判断できる。［http://www.merriampark.com/ld.htm］などで開示されている方法により算出することが可能である。 Depending on the site, some characters may be added during RSS distribution, or some characters may be deleted if the number of characters exceeds the number determined by each site. May be replaced with another character. In that case, even if normalization does not result in the same text, relax the condition of matching, for example,
・ One text is included in the other text ・ Heading tag data that meets one or more of the following conditions: The Levenshtein distance between two text data is calculated and the distance is smaller than a predetermined condition May be obtained. Note that the Levenshtein distance is a numerical value that measures how many character units can be converted from one text data to the other text data. That is, the smaller the value, the more similar the two text data are. It can be judged. It can be calculated by a method disclosed in [http://www.merriampark.com/ld.htm].

このようにして、「類似度が高い」と判断されたheadingタグノードは、特定されるべきタイトルノードの候補として、一時的な配列領域に記憶される（Ｓ１０５、Ｓ１０５１，Ｓ１０５２）。 In this way, the heading tag node determined to be “high in similarity” is stored in the temporary arrangement area as a candidate for the title node to be specified (S105, S1051, S1052).

取得されたheadingタグノードは、１つとは限らない。すなわち、候補配列領域に複数のノードが記憶されている可能性がある。そのため、複数のheadingタグノードが取得された場合には、各データのCSSに基づき、例えば次のような方法により最適となるノードを選択する（Ｓ１０７）。 The acquired heading tag node is not necessarily one. That is, a plurality of nodes may be stored in the candidate sequence area. Therefore, when a plurality of heading tag nodes are acquired, an optimum node is selected by the following method, for example, based on the CSS of each data (S107).

・フォントサイズが最大のもの
・フォントサイズ最大なものが複数ある場合は、画面に配置した際により上のほうにあるもの
このようにして選択されたノードは、本文のタイトルを表すデータであるとして、次の文章ノード検索部１４へパラメータとして送られる（Ｓ１０９）。・ The one with the largest font size ・ If there are multiple ones with the largest font size, the one on the upper side when placed on the screen The node selected in this way is the data representing the title of the text Then, it is sent as a parameter to the next sentence node search unit 14 (S109).

次に、図４より好適な実施例となる図５について説明する。 Next, FIG. 5 which is a preferred embodiment from FIG. 4 will be described.

図４のフローでは、類似度の判定において条件に合致するheadingタグデータが１つもない可能性がある。HTMLにおいて、あえて、headingタグを使わずに、DIVタグやTABLEタグ内のTDタグなどにより記述され、適切なCSSを設定することによって、利用者にとってタイトルと認識できるようにデザインされている場合が主に考えられる。そのような場合にも対処するためは、headingタグに限定せず、全ての表示状態のタグに対して、前述のような正規化テキストの比較と最適なタグの選択を行えばよい。表示状態であるか否かは、木構造データ記憶部１２の機能を利用して判断可能である。このような処理を行うと、１つも選択されない可能性が小さくなる代わりに、対象となるタグデータが増大するため、処理時間の問題が発生する可能性がある。そのため、まずはheadingタグを対象として処理を行い、headingタグからは見つけられなかった場合にのみ、全表示タグデータを対象として処理を行う、といった実施方法はより好適である。全表示タグデータを対象とした処理においても、タイトルを検索できなかった場合は、なんらかの原因によりRSSに対応するHTMLでなかったと判断し、エラーとして処理を打ち切る。 In the flow of FIG. 4, there is a possibility that there is no heading tag data that matches the condition in determining the similarity. In HTML, there is a case where it is described so that it can be recognized as a title for the user by setting an appropriate CSS by using the TD tag in the DIV tag or TABLE tag without using the heading tag. Think mainly. In order to deal with such a case, it is only necessary to compare the normalized text and select the optimum tag as described above for all tags in the display state, not limited to the heading tag. Whether or not it is in the display state can be determined using the function of the tree structure data storage unit 12. If such a process is performed, the possibility that none of them will be selected is reduced, but the target tag data increases, which may cause a processing time problem. Therefore, an implementation method in which processing is performed on the heading tag first, and processing is performed on all display tag data only when it is not found from the heading tag is more preferable. Even in the process for all display tag data, if the title cannot be searched, it is determined that the HTML does not correspond to RSS for some reason, and the process is terminated as an error.

図５のフロー図は、これを具現化したものである。すなわち、ステップＳ１０１からＳ１０５２までは図４と同じ処理を行い、ステップＳ１１１において候補として一時的に記憶されている配列の長さをチェックする。長さが1以上の場合は、図４と同じ処理を行うが、長さが０であった場合には、ステップＳ１１３以降の処理へすすむ。ステップＳ１１３、Ｓ１１３１、Ｓ１１３２では、図４の処理と同じことを、headingタグノードに限定せず、全ての表示状態のタグノードに対して処理を行う。 The flow diagram of FIG. 5 is an implementation of this. That is, from step S101 to S1052, the same processing as in FIG. 4 is performed, and the length of the array temporarily stored as a candidate is checked in step S111. When the length is 1 or more, the same processing as in FIG. 4 is performed, but when the length is 0, the processing proceeds to step S113 and subsequent steps. In steps S113, S1131, and S1132, the same processing as in FIG. 4 is not limited to the heading tag node, and processing is performed on tag nodes in all display states.

つまり、ＤＯＭ上の全ての表示状態のタグに対して後述のステップＳ１１３１、１１３２を行う（Ｓ１１３）。ステップＳ１１３１では、対象タグ配下のテキスト情報とパラメータｔｉｔｌｅとの類似度を算出する。続くステップＳ１１３２では、類似度が予め決められた値より大きければ、対象タグを「候補」の配列に追加する。 That is, Steps S1131 and 1132 described later are performed on all display state tags on the DOM (S113). In step S1131, the similarity between the text information under the target tag and the parameter title is calculated. In subsequent step S1132, if the similarity is greater than a predetermined value, the target tag is added to the “candidate” array.

次に、ステップＳ１１５において候補として一時的に記憶されている配列の長さをチェックする。長さが1以上の場合は、図４と同じ処理を行う（Ｓ１０７、Ｓ１０９）。長さが０であった場合には、エラーで処理を打ち切る（Ｓ１１７）。 Next, in step S115, the length of the array temporarily stored as a candidate is checked. When the length is 1 or more, the same processing as in FIG. 4 is performed (S107, S109). If the length is 0, the process is terminated with an error (S117).

文章ノード検索部１４は、タイトルノード検索部１３で検索されたDOM上のタイトル位置を受け取り、それに続く本文を抽出する。なお、本発明に置いて、タイトルと本文をあわせた箇所を主たる文章と呼ぶ。そのためには、ツリー状のデータ構造をしているDOMを、タイトルのノードから、Web文書で表示される木構造データの根をなしているBODYタグのノードに向かって、順次たどる処理を行い、たどられたノードの大きさを木構造データ記憶部１２から取得する。タイトルと本文は、DOMデータ上としては、通常、ある程度近い関係をなし、すなわち、主たる文章として統合されたDOM上のノードとしてタイトルノードからBODYノードをたどる間に存在しており、しかもタイトルに本文が付くことによって、表示上の大きさ（特に高さ）が急に大きくなるという性質がある。このため、例えば次のような条件を満たすときに、順次たどる処理を中止してそのときのDOM上のノードを、抽出したい主たる文章のノードとして特定することができる。 The text node search unit 14 receives the title position on the DOM searched by the title node search unit 13 and extracts the subsequent text. In the present invention, a portion where the title and the text are combined is called a main sentence. For that purpose, the DOM that has a tree-like data structure is sequentially processed from the title node to the BODY tag node that forms the root of the tree structure data displayed in the Web document. The size of the traced node is acquired from the tree structure data storage unit 12. The title and the text are usually close to each other on the DOM data, that is, they exist as a node on the DOM integrated as the main text while following the BODY node from the title node, and the text is included in the title. By adding, the size (particularly the height) on the display suddenly increases. For this reason, for example, when the following conditions are satisfied, it is possible to stop the sequential processing and specify a node on the DOM at that time as a node of a main sentence to be extracted.

条件例１：予め決められた高さを超える（300ピクセル以上など）
条件例２：タイトルの高さと当該ノードの高さの比を求め、予め決められた比を超える（タイトルの高さの５倍以上など）
条件例３：ノードをBODYに向かって順次処理する際に、直前のノードとの面積の比を求め、予め決められた比を超える（直前のノードの面積の５倍以上など）
以上の処理について、図６、７、８に示すHTMLのサンプルにより説明する。 Condition example 1: Exceeding a predetermined height (300 pixels or more, etc.)
Condition example 2: The ratio between the height of the title and the height of the node is obtained, and exceeds a predetermined ratio (eg, more than 5 times the height of the title).
Condition example 3: When processing nodes sequentially toward BODY, the ratio of the area with the immediately preceding node is obtained, and exceeds a predetermined ratio (such as 5 times the area of the immediately preceding node).
The above processing will be described with reference to HTML samples shown in FIGS.

図７は、図６のHTMLを通常のブラウザで表示したときの表示例である。なお、タグにclass属性によりそのクラスが指定されているものについては、「.」（ドット）記法により示している。例えば、<DIV class="bttl">により指定されるDOMデータのノードは「DIV.bttl」と表記している。 FIG. 7 is a display example when the HTML of FIG. 6 is displayed by a normal browser. A tag whose class is specified by the class attribute is indicated by “.” (Dot) notation. For example, the node of the DOM data specified by <DIV class = "bttl"> is expressed as “DIV.bttl”.

図８は、DOMに展開した際の木構造データを示している。このうち、網がけされた「H2」ノードが、タイトルノード検索部１３で特定されたタイトルである。 FIG. 8 shows tree structure data when expanded to DOM. Among these, the shaded “H2” node is the title specified by the title node search unit 13.

文章ノード検索部１４では、この「H2」から「DIV.ettl」「DIV.ent」「DIV.left」「BODY」の順に木構造をたどる。まず、「H2」から「DIV.ettl」に移り、高さを比較する。 The sentence node search unit 14 follows the tree structure from “H2” to “DIV.ettl”, “DIV.ent”, “DIV.left”, and “BODY”. First, move from "H2" to "DIV.ettl" and compare the height.

この場合には、ほとんど高さが変わらないために、予め決められた条件を満たさないため、主たる文章ではないと判断し、処理を次の「DIV.ent」に移す。「DIV.ent」の高さは大きく増加するので、予め決められた条件を満たすことになり、この「DIV.ent」部が、主たる文章であると判断される。 In this case, since the height hardly changes, the predetermined condition is not satisfied, so it is determined that the sentence is not the main sentence, and the process proceeds to the next “DIV.ent”. Since the height of “DIV.ent” greatly increases, a predetermined condition is satisfied, and this “DIV.ent” portion is determined to be the main sentence.

以上述べた実施の形態により、Web文書のタイトルと本文を合わせた主たる文章を抽出することが可能である。また上述したように、処理の途中において、タイトル部を特定しているので、抽出された主たる文章の内、タイトル部を除いた箇所を、本文として特定し抽出することも可能である。 According to the embodiment described above, it is possible to extract a main sentence that combines the title and text of a Web document. Further, as described above, since the title part is specified in the middle of the process, it is possible to specify and extract a part of the extracted main sentence excluding the title part as the text.

本実施の形態の効果として、以上のようにして抽出された主たる文章の利用例について述べる。 As an effect of the present embodiment, a usage example of the main sentence extracted as described above will be described.

例えば、Webブラウザのように、Web文書を表示するためのシステムで、HTMLデータをCSSデータと共にDOMに変換した後、本実施の形態により抽出された主たる文章のDOM上のノードと親子関係（上位下位関係）にない全てのノードを削除あるいは非表示とすることによって、広告やナビゲーションリンクなどの表示を無くした主たる文章のみのWeb文書表示装置とすることができる。しかも、この場合は元のWeb文書のCSS指定が正しく残されているため、フォント、サイズ、色などのデザインが正しく適用されるので、このようなデザインを含めた本文抽出が可能となる。このように、主たる文章と親子関係にないノードを削除あるいは非表示化して出力するという機能を持った出力部１５を含む文章検索装置１の構成図を図９に示す。また、図１０、１１に、実際に本実施の形態を適用した例を示す。この例では、RSSにより、URL「http://www.ntt.co.jp/RD/OFIS/keyword/」とタイトル「キーワードでわかる先端技術」を取得して、本発明の技術を適用し、記事の主たる文章以外のDOMを非表示にすることにより、図１１のような表示を得ることができる。 For example, in a system for displaying a Web document, such as a Web browser, after converting HTML data to DOM along with CSS data, the parent-child relationship between the nodes on the DOM of the main sentence extracted by this embodiment (upper level) By deleting or hiding all nodes that are not in the subordinate relationship), it is possible to provide a Web document display device for only the main text without displaying advertisements or navigation links. In addition, in this case, since the CSS specification of the original Web document is correctly left, the design of font, size, color, etc. is applied correctly, so that it is possible to extract the text including such a design. FIG. 9 shows a configuration diagram of the text search apparatus 1 including the output unit 15 having a function of deleting or hiding a node that is not in a parent-child relationship with the main text as described above. 10 and 11 show an example in which the present embodiment is actually applied. In this example, the URL “http://www.ntt.co.jp/RD/OFIS/keyword/” and the title “advanced technology understood by keywords” are acquired by RSS, and the technology of the present invention is applied. The display as shown in FIG. 11 can be obtained by hiding the DOM other than the main text of the article.

また、別の例として検索エンジンに使用する場合には、本実施の形態により特定された主たる文章に属するテキストデータのみを対処として、検索エンジン用のデータベース等に蓄積することにより、利用者の意図しない検索結果を避けることが可能となる。 As another example, when it is used for a search engine, only text data belonging to the main sentence specified by the present embodiment is dealt with and stored in a search engine database, etc. It is possible to avoid search results that do not.

（第２の実施の形態）
第２の実施の形態では、まず、概要文書取得部１１にて、第１の実施の形態で述べた項目に加えて、entries[i].summaryも取得する。このsummaryには、通常、本文か、本文の一部（特に、最初の数行分程度）がプレインテキストあるいはHTMLソースとして入っている。 (Second Embodiment)
In the second embodiment, first, the summary document acquisition unit 11 acquires entries [i] .summary in addition to the items described in the first embodiment. This summary usually contains the text or part of the text (especially the first few lines) as plain text or HTML source.

木構造データ記憶部１２とタイトルノード検索部１３は、第１の実施の形態と同じように構成する。 The tree structure data storage unit 12 and the title node search unit 13 are configured in the same manner as in the first embodiment.

文章ノード検索部１４では、タイトルノード検索部１３で抽出されたDOM上のタイトル位置と、概要文書取得部１１から得られるsummaryをパラメータとして受け取って起動される。第２の実施の形態においても、第１の実施の形態と同様にタイトルからBODYノードに向かって順次木構造（ツリー）をたどる処理を行うが、処理を打ち切る条件としては、第１の実施の形態のようにデザインにより確認するのではなく、パラメータとして受け取ったsummaryとの比較により行い、summaryによく合致するノードであると判断される場合に処理を打ち切り、打ち切ったときの処理対象を主たる文章のノードであるとして抽出する。 The text node search unit 14 is activated by receiving the title position on the DOM extracted by the title node search unit 13 and the summary obtained from the summary document acquisition unit 11 as parameters. In the second embodiment as well, the process of sequentially following the tree structure (tree) from the title to the BODY node is performed as in the first embodiment. The conditions for terminating the process are as follows. It is not confirmed by design as in the form, but by comparing with the summary received as a parameter, if it is determined that the node matches well with the summary, the processing is aborted, and the main sentence is the processing target when it is terminated To be extracted as a node.

（第２の実施の形態の動作）
より具体的な実施方法の例としては、まず、summaryがHTMLソースの場合はタグ部を削除することでプレインテキストに変換し、第１の実施の形態で述べた正規化をする。もし、summaryがプレインテキストの場合は、単に正規化する。正規化の結果をテキスト１とする。 (Operation of Second Embodiment)
As an example of a more specific implementation method, first, when the summary is an HTML source, it is converted to plain text by deleting the tag part, and the normalization described in the first embodiment is performed. If the summary is plain text, just normalize it. The result of normalization is set to text 1.

一方で、比較対象となるノードに対しても、そのノードの配下に存在するテキストデータを取り出す。この際、既にタイトルとして抽出されているノードの配下となるテキストデータは、タイトルテキストであるとして除外する。タイトルを除外したテキストデータも、やはり正規化し、それをテキスト２とする。テキスト１と２を比較し、例えば、テキスト１がテキスト２に包含される場合に、それがノードであると特定して順次処理を打ち切る。あるいは、厳密に包含されるとは限らない可能性を考慮して、テキスト２の先頭から、テキスト１の文字数のR倍（Rは1.2程度で予め決めておく)程度までの文字数分のテキストを切り出し、それとテキスト１とのレーベンシュタイン距離が決められた距離以下といった類似性がみられるときに、「ほぼ包含されている」と判断しても良い。先頭から決められた条件の長さを切り出しているのは、もともとRSS配信時にsummaryにする際に、主に先頭からある程度の文字数分切り出して、必要に応じて多少のテキストの加工を行っていることが多いからであり、テキスト２全体とのレーベンシュタイン距離を求めると切り出された以降のテキストデータ分の距離が加算され、不必要にレーベンシュタイン距離が大きくなるのを防ぐ目的がある。 On the other hand, text data existing under the node is extracted for the node to be compared. At this time, the text data under the node already extracted as the title is excluded as the title text. The text data excluding the title is also normalized, and is set as text 2. The texts 1 and 2 are compared. For example, when the text 1 is included in the text 2, it is determined that the text 1 is a node, and the processing is terminated. Or, considering the possibility of not being included strictly, text for the number of characters from the beginning of text 2 to R times the number of characters in text 1 (R is determined in advance at about 1.2) When a similarity is found such that the Levenshtein distance between the cutout and the text 1 is equal to or less than a predetermined distance, it may be determined that “substantially included”. The length of the condition determined from the beginning is cut out from the beginning when the RSS is originally summarized, and a certain number of characters are cut out mainly from the beginning, and some text processing is performed as necessary. This is because there are many cases where the Levenstein distance from the entire text 2 is obtained, and the distance for the text data after the cut-out is added to prevent the Levenstein distance from becoming unnecessarily large.

第２の実施の形態についても、第１の実施の形態と同じような効果がある。また、第１の実施の形態と同様に、出力部１５をも備えるよう構成してもよい。 The second embodiment also has the same effect as the first embodiment. Moreover, you may comprise so that the output part 15 may also be provided similarly to 1st Embodiment.

１…文章検索装置
１１…概要文書取得部
１２…木構造データ記憶部
１３…タイトルノード検索部
１４…文章ノード検索部
１５…出力部 DESCRIPTION OF SYMBOLS 1 ... Text search device 11 ... Summary document acquisition part 12 ... Tree structure data storage part 13 ... Title node search part 14 ... Text node search part 15 ... Output part

Claims

An overview document acquisition unit having a function of reading data distributed in a predetermined format including an access destination and a title related to a document on the network and retrieving the access destination and the title related to the document;
A tree structure data storage for accessing an access destination of a document extracted by the summary document acquisition unit, acquiring document data and style data of the document, and converting the acquired document data into tree structure data and storing it And
Measure the similarity between the title extracted by the summary document acquisition unit and the text data in the node of the tree structure data stored in the tree structure data storage unit, and based on the measured similarity, the tree structure A title node search unit having a function of specifying a title on the data of
Determining whether the style of the node satisfies a predetermined condition based on the style data while tracing the nodes from the title on the tree structure data specified by the title node search unit to the root of the tree structure And a sentence node search unit having a function of specifying a node including a main sentence on the tree-structured data.

An outline document acquisition unit having a function of reading data distributed in a predetermined format including an access destination and a title related to a document on the network, and retrieving an access destination, the title, and an outline related to the document;
A tree structure data storage for accessing an access destination of a document extracted by the summary document acquisition unit, acquiring document data and style data of the document, and converting the acquired document data into tree structure data and storing it And
Measure the similarity between the title extracted by the summary document acquisition unit and the text data in the node of the tree structure data stored in the tree structure data storage unit, and based on the measured similarity, the tree structure A title node search unit having a function of specifying a title on the data of
The beginning of the text of the node to be compared with the summary text acquired by the overview document acquisition unit while tracing the nodes from the title on the tree structure data specified by the title node search unit to the root of the tree structure A sentence node search unit having a function of identifying a node including a main sentence on the tree-structured data based on a Levenshtein distance from a text cut out according to a predetermined rule from Sentence search device.

An output unit having a function of displaying only a main sentence of the document by deleting or hiding a node not having a parent-child relationship with the node on the tree structure data specified by the sentence node search unit; The sentence search device according to claim 1, wherein:

A computer program for causing a computer to function as the text search device according to claim 1.