JP2012064132A

JP2012064132A - Blog body specification device and blog body specification method

Info

Publication number: JP2012064132A
Application number: JP2010209674A
Authority: JP
Inventors: Kenji Yoshida; 健児吉田
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2010-09-17
Filing date: 2010-09-17
Publication date: 2012-03-29
Anticipated expiration: 2030-09-17
Also published as: JP5068356B2

Abstract

PROBLEM TO BE SOLVED: To provide a blog body specification device and a blog body specification method that specify a blog body while suppressing a control load.SOLUTION: A first origin specification part (2) of a blog body specification device (1) generates external path information representing a path of a tag, included in a source code, in the source code with respect to each of a plurality of blog pages made to correspond to the same under by analyzing the respective blog pages, and specifies, as an origin path including a blog body, a path which can uniquely specify the tag from the external path information (common path) and is deepest even when being a path including the body (body-including path). Then blog body extraction means of the blog body specification device (1) extracts the body of a blog page of the user based upon the origin path.

Description

本発明は、同一ユーザの複数のブログに共通する本文位置を特定するブログ本文特定装置及びブログ本文特定方法に関する。 The present invention relates to a blog text specifying device and a blog text specifying method for specifying a text position common to a plurality of blogs of the same user.

インターネットの普及に伴いユーザは、ネットワーク上に存在する大量のデータを閲覧可能になり、このような大量のデータからユーザのニーズに応じたデータを抽出すべく、検索システムの開発が日々行われている。検索システムとして、ロボット型の検索システムが知られており、ロボット型の検索システムでは、インターネット上に公開された大量のデータ（Ｗｅｂページ）から重要なキーワードを自動的に抽出し、これをインデックス化することにより、所望のデータをユーザに提供している。 With the spread of the Internet, users can browse a large amount of data on the network, and search systems are being developed every day to extract data according to user needs from such a large amount of data. Yes. As a search system, a robot-type search system is known. In a robot-type search system, important keywords are automatically extracted from a large amount of data (Web pages) published on the Internet and indexed. Thus, desired data is provided to the user.

ところで、ブログなどの文書データ中には、本文以外にもカレンダー情報やヘッダやフッタに記述された情報や過去のログを示す情報や広告情報などのキーワードを抽出する対象として相応しくない情報が数多く混入しており、このような情報からキーワードを抽出しインデックスを生成したのでは、検索精度の低下を招来するおそれがあった。 By the way, in document data such as blogs, in addition to the main text, a lot of information that is not suitable for extracting keywords such as calendar information, information described in headers and footers, past log information, and advertising information is mixed. Therefore, if keywords are extracted from such information and an index is generated, there is a risk that search accuracy may be reduced.

ブログ（文章データ）の中から本文を抽出する技術としては、特許文献１に開示されたノイズ除去システムが知られている。このノイズ除去処理システムでは、各文章データに含まれる文字列の出現頻度が一定の度合いを超えることから、当該文字列をヘッダやフッタなどの定型文字列と判定し、各文章データから除去するなどして、文章に含まれる本文以外のノイズを除去する。 As a technique for extracting a text from a blog (text data), a noise removal system disclosed in Patent Document 1 is known. In this noise removal processing system, since the appearance frequency of the character string included in each sentence data exceeds a certain degree, it is determined that the character string is a fixed character string such as a header or footer, and is removed from each sentence data. Then, noise other than the text included in the sentence is removed.

特開２００９−２７１７９６号公報JP 2009-271796 A

また、ブログサイトは、デフォルトでＲＳＳ（ＲＤＦＳｉｔｅＳｕｍｍａｒｙ、ＲｉｃｈＳｉｔｅＳｕｍｍａｒｙ、ＲｅａｌｌｙＳｉｍｐｌｅＳｙｎｄｉｃａｔｉｏｎ）に対応しているため、検索システムでは、ＲＳＳのｄｅｓｃｒｉｐｔｉｏｎ要素から本文の要約（本文冒頭の規定数文の文字列）を自動的に収集することが可能になっている。検索システムでは、ブログの中からｄｅｓｃｒｉｐｔｉｏｎ要素に含まれる本文の要約と一致する文字列を抽出することで、ブログに含まれる本文の少なくとも一部を特定することもできる。 In addition, the blog site supports RSS (RDF Site Summary, Rich Site Summary, Really Simple Syndication) by default, so in the search system, the text summary from the RSS description element (characters in the specified number of sentences at the beginning of the text). Column) can be collected automatically. In the search system, by extracting a character string that matches the summary of the text included in the description element from the blog, at least a part of the text included in the blog can be specified.

しかしながら、ネットワーク上には膨大な量のブログが存在しており、上記方法でブログの中から本文を抽出していたのでは、検索システムに係る負荷が膨大なものとなってしまう。すなわち、特許文献１のノイズ除去システムでは、全てのブログ（文書）に対して個別にノイズ除去を行わなければならず、例えば、新たなブログが作成された場合には、このブログに対しても個別にノイズ除去を行わなければならない。ＲＳＳを用いた本文特定も、同様であり、全てのブログに対して個別に文字列の検索を行わなければならない。更に、特定した本文の一部に基づいて本文の範囲を特定する技術は提案されていない。 However, there are an enormous amount of blogs on the network, and if the text is extracted from the blogs by the above method, the load on the search system becomes enormous. That is, in the noise removal system of Patent Document 1, it is necessary to individually remove noise for all blogs (documents). For example, when a new blog is created, Noise removal must be performed individually. The same applies to the text specification using RSS, and the character strings must be individually searched for all the blogs. Furthermore, a technique for specifying the scope of the text based on a part of the specified text has not been proposed.

本発明は、このような問題に鑑みてなされたものであり、制御負荷を抑えつつブログ本文の特定を可能なブログ本文特定装置及びブログ本文特定方法を提供することを目的とする。 The present invention has been made in view of such problems, and an object of the present invention is to provide a blog text specifying device and a blog text specifying method that can specify a blog text while suppressing a control load.

（１）ユーザが作成したブログ本文及び当該ブログ本文の要約情報を含むブログページを当該ユーザを識別する情報と対応付けて記憶するブログページＤＢと、同一ユーザに対応付けられた複数のブログページを解析して、当該複数のブログページの夫々について、ソースコードに含まれるタグの当該ソースコード内でのパスを示す外部パス情報を生成する外部パス情報生成手段と、前記外部パス情報生成手段が生成した前記外部パス情報のうち、前記同一ユーザの複数のブログページの夫々に共通する外部共通パスを取得する外部共通パス取得手段と、前記外部パス情報生成手段が生成した前記外部パス情報のうちブログ本文を含む本文包含外部パスを、前記同一ユーザの複数のブログページの夫々について取得する本文包含外部パス取得手段と、前記外部共通パスであって、かつ、複数のＷｅｂページの全てにおける前記本文包含外部パスであるパスのうち、最も深いパスを前記ブログ本文を含む起点パスとして特定する列起点特定手段と、を備えるブログ本文特定装置。 (1) A blog page DB that stores a blog page that includes a blog text created by a user and summary information of the blog text in association with information that identifies the user, and a plurality of blog pages that are associated with the same user. Analyzing and generating, for each of the plurality of blog pages, external path information generation means for generating external path information indicating paths in the source code of tags included in the source code, and the external path information generation means Out of the external path information, an external common path acquisition unit that acquires an external common path common to each of the plurality of blog pages of the same user, and a blog among the external path information generated by the external path information generation unit A text inclusion external path for acquiring a text inclusion external path including a text for each of the plurality of blog pages of the same user. And a column origin specifying means for specifying the deepest path as the origin path including the blog text among the paths that are the external common path and the text inclusion external path in all of the plurality of Web pages; A blog text identification device comprising:

（１）のブログ本文特定装置によれば、外部共通パス取得手段は、同一ユーザが作成した複数のブログページのソースコードに含まれるタグの位置を示すパスのうち複数のブログページ夫々に共通する外部共通パスを取得し、本文包含外部パス取得手段は、タグのパス情報のうち、ブログ本文を含む本文包含外部パスを、同一ユーザの複数のブログページの夫々について取得する。そして、列起点特定手段は、外部共通パスであって、かつ、本文包含外部パスであるパスのうち、本文包含外部パスに最も近いパスを当該ユーザのブログページのブログ本文に共通するブログ本文を含む起点パスとして特定する。
これにより、同一ユーザの複数のブログページについて、複数のブログページで共通するパスであって、かつ、ブログ本文を含むタグの位置を示す起点パスを特定することができる。また、一度起点パスを特定してしまえば、当該ユーザのブログページのタグ構造が大きく変化しない限り起点パスから容易にブログ本文の位置を特定できるため、制御負荷を抑えつつブログ本文の位置を特定することができる。 According to the blog text specifying device of (1), the external common path acquisition means is common to each of a plurality of blog pages among the paths indicating the positions of the tags included in the source code of the plurality of blog pages created by the same user. An external common path is acquired, and a text inclusion external path acquisition unit acquires a text inclusion external path including a blog text among tag path information for each of a plurality of blog pages of the same user. Then, the column origin specifying means determines the blog body text common to the blog body of the user's blog page from among the paths that are external common paths and that are the body-inclusive external paths, Specify as the origin path to include.
Thereby, it is possible to specify a starting path indicating a position of a tag including a blog text, which is a path common to a plurality of blog pages for a plurality of blog pages of the same user. In addition, once the starting path is specified, the position of the blog text can be easily determined from the starting path unless the tag structure of the user's blog page changes significantly. can do.

（２）同一ユーザの複数のブログページの夫々について、前記列起点特定手段が特定した起点パスにより特定されるタグよりも下位のタグのパスを示す内部パス情報を生成する内部パス情報生成手段と、前記内部パス情報生成手段が生成した前記内部パス情報のうち、前記同一ユーザの複数のブログページの夫々に共通する内部共通パスを取得する内部共通パス取得手段と、前記内部パス情報生成手段が生成した前記内部パス情報のうち前記ブログ本文を含む本文包含内部パスを、前記同一ユーザの複数のブログページの夫々について取得する本文包含内部パス取得手段と、前記内部共通パスであって、かつ、複数のＷｅｂページの全てにおける前記本文包含内部パスであるパスのうち、最も深いパスを前記ブログ本文を含む本文パスとして特定する本文位置特定手段と、を備える（１）に記載のブログ本文特定装置。 (2) Internal path information generating means for generating internal path information indicating a path of a tag lower than a tag specified by the starting path specified by the column starting point specifying means for each of a plurality of blog pages of the same user; The internal path information generating means generates an internal common path acquisition means for acquiring an internal common path common to each of the plurality of blog pages of the same user, and the internal path information generation means includes: A text inclusion internal path acquisition means for acquiring a text inclusion internal path including the blog text in the generated internal path information for each of a plurality of blog pages of the same user, the internal common path, and Of the paths that are the text inclusion internal paths in all of a plurality of Web pages, the deepest path is the text path including the blog text. And body position specifying means for specifying, blog body identifying device according to comprise (1) a.

（２）のブログ本文特定装置によれば、列起点特定手段が特定した起点パスにより特定されるタグよりも下位のタグについて、更に、内部共通パス及び本文包含内部パスを取得し、ブログ本文を含む本文パスを特定する。これにより、より詳細な本文の位置を特定することができるとともに、一度特定すれば容易にブログ本文の位置を特定できるため、制御負荷を抑えつつブログ本文の位置を特定することができる。 According to the blog text specifying device of (2), an internal common path and a text inclusion internal path are further acquired for tags lower than the tag specified by the starting path specified by the column starting point specifying means, Specify the text path to include. As a result, the position of the text can be specified in more detail, and since the position of the blog text can be easily specified once specified, the position of the blog text can be specified while suppressing the control load.

（３）ユーザが作成したブログ本文及び当該ブログ本文の要約情報を含むブログページを当該ユーザを識別する情報と対応付けて記憶するブログページＤＢと、同一ユーザに対応付けられた複数の前記ブログページを解析して、当該複数のブログページの夫々について、ソースコードに含まれるタグの当該ソースコード内でのパスを示すパス情報を生成するパス情報生成手段と、前記パス情報生成手段が生成した前記パス情報のうち、前記同一ユーザの複数のブログページの夫々に共通する共通パスを取得する共通パス取得手段と、前記パス情報生成手段が生成した前記パス情報のうちブログ本文を含む本文包含パスを、同一ユーザの複数のブログページの夫々について取得する本文包含パス取得手段と、前記共通パスであって、かつ、複数のＷｅｂページの全てにおける前記本文包含パスであるパスのうち、最も深いパスを起点パスとして特定する列起点特定手段と、前記列起点特定手段が前記起点パスを特定できたか否かを判定する特定可否判定手段と、を備え、前記特定可否判定手段が前記列起点特定手段が前記起点パスを特定できたと判定することを条件に、前記パス情報生成手段は、当該起点パスにより特定されるタグよりも下位のタグのパスを示すパス情報を更に生成し、前記共通パス取得手段及び前記本文包含パス取得手段は、前記下位のタグのパス情報から前記共通パス及び前記本文包含パスを取得し、前記列起点特定手段は、前記共通パス及び前記本文包含パスに基づいて、新たな起点パスを特定し、前記特定可否判定手段が前記列起点特定手段が前記起点パスを特定できないと判定することを条件に、起点パス情報を前記ブログ本文を含む本文包含パスとして特定する本文位置特定手段と、を更に備えるブログ本文特定装置。 (3) a blog page DB that stores a blog page including a blog text created by a user and summary information of the blog text in association with information for identifying the user, and a plurality of the blog pages associated with the same user For each of the plurality of blog pages, path information generating means for generating path information indicating a path in the source code of the tag included in the source code, and the path information generating means Among the path information, a common path acquisition unit that acquires a common path that is common to each of the plurality of blog pages of the same user, and a text inclusion path that includes a blog text among the path information generated by the path information generation unit. A body inclusion path acquisition unit that acquires each of a plurality of blog pages of the same user, the common path, Among the paths that are the text inclusion paths in all of the Web pages, a column origin specifying unit that identifies the deepest path as a starting path, and a specification that determines whether or not the column origin specifying unit has identified the origin path Passability determining means, and the path information generating means is based on a tag specified by the starting path, provided that the specifying availability determining means determines that the column starting point specifying means has specified the starting path. Further generate path information indicating the path of the lower tag, the common path acquisition means and the text inclusion path acquisition means acquire the common path and the text inclusion path from the path information of the lower tag, and The column starting point specifying unit specifies a new starting point path based on the common path and the text inclusion path, and the specifying possibility determining unit determines that the column starting point specifying unit sets the starting point path. A blog text specifying device further comprising: text position specifying means for specifying the starting path information as a text inclusion path including the blog text on the condition that it is determined that the URL cannot be specified.

（３）のブログ本文特定装置によれば、列起点特定手段が新たな起点パスを特定できなくなるまで、下位のタグのパス情報から新たな起点パスが特定される。これにより、より詳細な本文の位置を特定することができるとともに、一度特定すれば容易にブログ本文の位置を特定できるため、制御負荷を抑えつつブログ本文の位置を特定することができる。 According to the blog text specifying device of (3), the new starting path is specified from the path information of the lower tag until the column starting point specifying unit cannot specify the new starting path. As a result, the position of the text can be specified in more detail, and since the position of the blog text can be easily specified once specified, the position of the blog text can be specified while suppressing the control load.

（４）ユーザが作成したブログ本文及び当該ブログ本文の要約情報を含むブログページを当該ユーザを識別する情報と対応付けて記憶するブログページＤＢを備えるコンピュータが同一ユーザのブログページの本文位置を特定するブログ本文特定方法であって、同一ユーザに対応付けられた複数のブログページを解析して、当該複数のブログページの夫々について、ソースコードに含まれるタグの当該ソースコード内でのパスを示す外部パス情報を生成するステップと、生成した前記外部パス情報のうち、前記同一ユーザの複数のブログページの夫々に共通する外部共通パスを取得するステップと、生成した前記外部パス情報のうちブログ本文を含む本文包含外部パスを、前記同一ユーザの複数のブログページの夫々について取得するステップと、前記外部共通パスであって、かつ、複数のＷｅｂページの全てにおける前記本文包含外部パスであるパスのうち、最も深いパスを前記ブログ本文を含む起点パスとして特定するステップと、を含むブログ本文特定方法。 (4) A computer including a blog page DB that stores a blog page including a blog text created by a user and summary information of the blog text in association with information for identifying the user identifies the text position of the blog page of the same user This is a method for specifying a blog body, and analyzes a plurality of blog pages associated with the same user, and indicates a path in the source code of a tag included in the source code for each of the plurality of blog pages. A step of generating external path information; a step of acquiring an external common path common to each of a plurality of blog pages of the same user among the generated external path information; and a blog text of the generated external path information A step of acquiring a body-inclusive external path including the URL for each of the plurality of blog pages of the same user. And a step of identifying the deepest path as a starting path including the blog text among the paths that are the external common path and the text inclusion external path in all of a plurality of Web pages. Blog body identification method.

（４）のブログ本文特定方法によれば、（１）のブログ本文特定装置と同様の効果を奏する。 According to the blog text specifying method of (4), the same effect as that of the blog text specifying device of (1) can be obtained.

本発明によれば、制御負荷を抑えつつブログ本文の特定することができる。 According to the present invention, it is possible to specify a blog text while suppressing a control load.

本実施形態のブログ本文特定装置の機能構成を示す図である。It is a figure which shows the function structure of the blog text specific | specification apparatus of this embodiment. 記憶部に記憶される各データベースを示す図である。It is a figure which shows each database memorize | stored in a memory | storage part. 外部パス情報生成手段による外部パス情報の生成例を示す図である。It is a figure which shows the example of the production | generation of the external path information by an external path information generation means. 生成された外部パス情報の例を示す図である。It is a figure which shows the example of the produced | generated external path information. 列起点特定手段による起点パスの生成例を示す図である。It is a figure which shows the example of the production | generation of the origin path | pass by a column origin identification means. 列起点特定手段による起点パスの生成例を示す図である。It is a figure which shows the example of the production | generation of the origin path | pass by a column origin identification means. 列起点特定手段による起点パスの生成例を示す図である。It is a figure which shows the example of the production | generation of the origin path | pass by a column origin identification means. 内部パス情報生成手段による内部パス情報の生成例を示す図である。It is a figure which shows the example of a production | generation of the internal path information by an internal path information production | generation means. 本文位置特定手段による本文パスの生成例を示す図である。It is a figure which shows the example of the production | generation of the text path | pass by a text position specific | specification means. ブログ本文特定装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of a blog text specific | specification apparatus. 別実施形態のブログ本文特定装置の機能構成を示す図である。It is a figure which shows the function structure of the blog text identification apparatus of another embodiment. ブログページ及びそのソースコードを示す図である。It is a figure which shows a blog page and its source code. ブログページ及びそのソースコードを示す図である。It is a figure which shows a blog page and its source code.

以下、本発明の実施形態について図面を参照して説明する。
初めに、図１２及び図１３を参照して、本発明の概要について説明する。図１２は、あるユーザのブログページであるＷｅｂページ１００及びそのソースコード１１０を示し、図１３は、同一ユーザの別の日のブログページであるＷｅｂページ２００及びそのソースコード２１０を示す。 Embodiments of the present invention will be described below with reference to the drawings.
First, the outline of the present invention will be described with reference to FIGS. FIG. 12 shows a web page 100 that is a blog page of a certain user and its source code 110, and FIG. 13 shows a web page 200 that is a blog page of another day of the same user and its source code 210.

図１２を参照して、Ｗｅｂページ１００は、ブログタイトル欄１０１と、広告欄１０２と、カレンダー欄１０３と、本文欄１０４と、を含む。また、図１３を参照して、Ｗｅｂページ２００は、ブログタイトル欄２０１と、広告欄２０２と、カレンダー欄２０３と、本文欄２０４と、を含む。このように同一ユーザのブログページの構造は、複数のＷｅｂページにおいて基本的に共通している。本発明は、同一ユーザの複数のブログページが共通の構造を有することに着目し、ブログページの本文の位置を特定するものであり、特に、本文の位置の特定を、ソースコードに含まれるタグの位置情報（パス）で特定するものである。 Referring to FIG. 12, Web page 100 includes a blog title column 101, an advertisement column 102, a calendar column 103, and a body column 104. Referring to FIG. 13, Web page 200 includes a blog title field 201, an advertisement field 202, a calendar field 203, and a body text field 204. Thus, the structure of the blog page of the same user is basically common to a plurality of Web pages. The present invention focuses on the fact that a plurality of blog pages of the same user have a common structure, and specifies the position of the text of the blog page. In particular, the tag included in the source code specifies the position of the text. It is specified by the position information (path).

［第１実施形態］
続いて、図１〜図１０を参照して、第１実施形態のブログ本文特定装置１について説明する。 [First Embodiment]
Then, with reference to FIGS. 1-10, the blog text identification device 1 of 1st Embodiment is demonstrated.

［ブログ本文特定装置１の構成］
図１を参照して、本発明のブログ本文特定装置１について説明する。ブログ本文特定装置１は、第１起点特定部２と、第２起点特定部３と、記憶部４と、ブログ本文抽出手段５と、を含んで構成される。 [Configuration of Blog Body Identification Device 1]
With reference to FIG. 1, the blog text identifying device 1 of the present invention will be described. The blog text specifying device 1 includes a first starting point specifying unit 2, a second starting point specifying unit 3, a storage unit 4, and a blog text extracting unit 5.

＜第１起点特定部２の構成＞
第１起点特定部２は、外部パス情報生成手段２１と、外部共通パス取得手段２２と、本文包含外部パス取得手段２３と、列起点特定手段２４と、を含んで構成される。 <Configuration of the first starting point specifying unit 2>
The first starting point specifying unit 2 includes an external path information generating unit 21, an external common path acquiring unit 22, a text inclusion external path acquiring unit 23, and a column starting point specifying unit 24.

外部パス情報生成手段２１は、同一ユーザに対応付けられた複数のブログページを解析して、複数のブログページの夫々について、ソースコードに含まれるタグのソースコード内でのパスを示す外部パス情報を生成する。なお、外部パス情報生成手段２１は、記憶部４のブログページＤＢ４１に記憶されたＨＴＭＬデータを読み出して、外部パス情報を生成する。
ここで、外部パス情報生成手段２１は、図３（２）、（３）に示すように、ソースコードに含まれるタグの外部パス情報を、ハードパス及びソフトパスで生成する。なお、「ハードパス」とは、タグの種類、タグの属性及びタグの出現回数で規定されたタグのパス（タグ［属性］［出現回数］）をいい、「ソフトパス」とは、タグの種類及びタグの属性で規定されたタグのパス（タグ［属性］）をいう。 The external path information generation means 21 analyzes a plurality of blog pages associated with the same user, and external path information indicating the path in the source code of the tag included in the source code for each of the plurality of blog pages. Is generated. The external path information generation unit 21 reads out HTML data stored in the blog page DB 41 of the storage unit 4 and generates external path information.
Here, as shown in FIGS. 3B and 3C, the external path information generation unit 21 generates the external path information of the tag included in the source code with a hard path and a soft path. “Hard path” refers to the tag path (tag [attribute] [number of appearances]) defined by the tag type, tag attribute, and the number of appearances of the tag, and “soft path” refers to the tag A tag path (tag [attribute]) defined by the type and tag attribute.

図３を参照して、図３（１）は、Ｗｅｂページ１００のソースコードであり、図３（２）は、ソースコードに含まれるタグのパスをハードパスで規定した外部パス情報であり、図３（２）は、ソースコードに含まれるタグのパスをソフトパスで規定した外部パス情報である。 Referring to FIG. 3, FIG. 3 (1) is the source code of the Web page 100, and FIG. 3 (2) is external path information that defines the path of the tag included in the source code with a hard path, FIG. 3B shows external path information that defines a path of a tag included in the source code with a soft path.

外部パス情報生成手段２１は、ソースコードに含まれるタグ毎に、ソースコード内でのパス（ハードパス及びソフトパス）を生成する。例えば、ソースコードのライン４のタグ１１のパスの生成について、このタグ１１は、出現回数が１回目の「ｔｒ」タグであり属性が規定されていないことから、外部パス情報生成手段２１は、「ｔｒ［］［０］」という外部パス１１Ａ（ハードパス）を生成するとともに、「ｔｒ［］」という外部パス１１Ｂ（ソフトパス）を生成する。
なお、パスの指定は任意に行うことができ、本実施形態では、「ｈｔｍｌ」タグなどのような多くのＷｅｂページで使用される位置に変化のないタグは、省略することとしている。 The external path information generation unit 21 generates a path (hard path and soft path) in the source code for each tag included in the source code. For example, for the generation of the path of the tag 11 of the line 4 of the source code, since the tag 11 is the “tr” tag having the first appearance count and the attribute is not specified, the external path information generation unit 21 An external path 11A (hard path) “tr [] [0]” is generated, and an external path 11B (soft path) “tr []” is generated.
The path can be specified arbitrarily. In the present embodiment, tags that do not change in position used in many Web pages, such as “html” tags, are omitted.

また、ライン１６のタグ１２のパスの生成について、このタグ１２は、ライン９の「ｔｒ」タグの下位のライン１３の「ｔｄ」タグの下位のライン１５の「ｄｉｖ」タグの下位のタグである。そして、ライン９の「ｔｒ」タグは、出現回数が２回目の「ｔｒ」タグであり属性が規定されておらず、ライン１３の「ｔｄ」タグは、ライン９の「ｔｒ」タグの中で２回目の「ｔｄ」タグであり属性が規定されておらず、ライン１５の「ｄｉｖ」タグは、ライン１３の「ｔｄ」タグの中で２回目の「ｄｉｖ」タグであり属性が規定されていない。また、ライン１６のタグ１２自体は、ライン１５の「ｄｉｖ」タグの中で１回目の「ｃｅｎｔｅｒ」タグであり属性が規定されていない。
そのため、外部パス情報生成手段２１は、ライン１６のタグ１２のパスとして、「ｔｒ［］［１］／ｔｄ［］［１］／ｄｉｖ［］［１］／ｃｅｎｔｅｒ［］［０］」という外部パス１２Ａ（ハードパス）を生成するとともに、「ｔｒ［］／ｔｄ［］／ｄｉｖ［］／ｃｅｎｔｅｒ［］」という外部パス１２Ｂ（ソフトパス）を生成する。 Further, regarding the generation of the path of the tag 12 of the line 16, the tag 12 is a tag lower than the “div” tag of the line 15 lower than the “td” tag of the line 13 lower than the “tr” tag of the line 9. is there. The “tr” tag of the line 9 is the “tr” tag whose appearance count is the second, and the attribute is not defined. The “td” tag of the line 13 is the “tr” tag of the line 9. The second “td” tag and no attributes are defined, and the “div” tag of the line 15 is the second “div” tag among the “td” tags of the line 13 and the attributes are defined. Absent. Further, the tag 12 itself of the line 16 is the first “center” tag in the “div” tag of the line 15, and the attribute is not defined.
Therefore, the external path information generation unit 21 uses the external “tr [] [1] / td [] [1] / div [] [1] / center [] [0]” as the path of the tag 12 of the line 16. A path 12A (hard path) is generated, and an external path 12B (soft path) of “tr [] / td [] / div [] / center []” is generated.

図４にＷｅｂページ１００（図１２）及びＷｅｂページ２００（図１３）のソースコードに含まれるタグの外部パス情報を示す。図４（１）は、Ｗｅｂページ１００の外部パス情報（ハードパス）であり、図４（２）は、Ｗｅｂページ１００の外部パス情報（ソフトパス）であり、図４（３）は、Ｗｅｂページ２００の外部パス情報（ハードパス）であり、図４（３）は、Ｗｅｂページ２００の外部パス情報（ソフトパス）である。外部パス情報生成手段２１が生成した外部パス情報は、外部共通パス取得手段２２及び本文包含外部パス取得手段２３に供給される。 FIG. 4 shows external path information of tags included in the source code of the Web page 100 (FIG. 12) and the Web page 200 (FIG. 13). 4 (1) is external path information (hard path) of the Web page 100, FIG. 4 (2) is external path information (soft path) of the Web page 100, and FIG. The external path information (hard path) of the page 200, and FIG. 4 (3) is the external path information (soft path) of the Web page 200. The external path information generated by the external path information generation unit 21 is supplied to the external common path acquisition unit 22 and the text inclusion external path acquisition unit 23.

図１に戻り、外部共通パス取得手段２２は、外部パス情報生成手段２１が生成した外部パス情報のうち、同一ユーザの複数のブログページの夫々に共通するとともに当該パスに対応するタグを一意に特定できる外部共通パスを取得する。
また、本文包含外部パス取得手段２３は、外部パス情報生成手段２１が生成した外部パス情報のうちブログ本文を含む本文包含外部パスを、同一ユーザの複数のブログページの夫々について取得する。 Returning to FIG. 1, the external common path acquisition unit 22 uniquely identifies a tag corresponding to the path, which is common to each of a plurality of blog pages of the same user among the external path information generated by the external path information generation unit 21. Get external common path that can be identified.
The text inclusion external path acquisition unit 23 acquires the text inclusion external path including the blog text from the external path information generated by the external path information generation unit 21 for each of a plurality of blog pages of the same user.

図５〜図７を参照して、外部共通パス及び本文包含外部パスの取得について説明する。
本発明は、ブログ本文の位置をタグのパスで特定することを特徴としているため、起点パスの特定に際しては、同一ユーザの複数のブログページの夫々に共通する外部パスであって、対応するタグを一意に特定できる外部パスを特定する必要がある。 The acquisition of the external common path and the text inclusion external path will be described with reference to FIGS.
Since the present invention is characterized in that the position of the blog text is specified by the tag path, when specifying the starting path, it is an external path common to each of a plurality of blog pages of the same user, and the corresponding tag It is necessary to identify an external path that can uniquely identify

そこで、外部共通パス取得手段２２は、外部パス情報に含まれる外部パス（ハードパス及びソフトパス）の夫々の出現回数を複数のブログページの夫々で計数する。このとき、図５のカウント欄に示すように、Ｗｅｂページ１００とＷｅｂページ２００とでは桁数が重複しないように外部パスの出現回数を計数することが好ましい。本実施形態では、Ｗｅｂページ２００の外部パスの出現回数を１００倍して計数している。すなわち、カウント欄の１００とは、出現回数が１回であることを意味する。そして、外部共通パス取得手段２２は、Ｗｅｂページ１００の外部パスの出現回数とＷｅｂページ２００の外部パスの出現回数とを加算する。外部共通パス取得手段２２は、この加算結果から、外部共通パス、すなわち、同一ユーザの複数のブログページの夫々に共通し、かつ対応するタグを一意に特定できる外部パスを取得する。 Therefore, the external common path acquisition unit 22 counts the number of appearances of each external path (hard path and soft path) included in the external path information for each of the plurality of blog pages. At this time, as shown in the count column of FIG. 5, it is preferable to count the number of appearances of the external path so that the number of digits does not overlap between the Web page 100 and the Web page 200. In this embodiment, the number of appearances of the external path of the web page 200 is multiplied by 100 and counted. That is, 100 in the count column means that the number of appearances is one. Then, the external common path acquisition unit 22 adds the number of appearances of the external path of the Web page 100 and the number of appearances of the external path of the Web page 200. From this addition result, the external common path acquisition unit 22 acquires an external common path, that is, an external path that is common to each of a plurality of blog pages of the same user and can uniquely identify the corresponding tag.

図５のカウント欄にＷｅｂページ１００の外部パス（ハードパス）の出現回数とＷｅｂページ２００の外部パス（ハードパス）の出現回数とを示し、図６のカウント欄に外部パス（ハードパス）の出現回数の加算結果を示す。また、図７（１）のカウント欄にＷｅｂページ１００の外部パス（ソフトパス）の出現回数とＷｅｂページ２００の外部パス（ソフトパス）の出現回数とを示し、図７（２）のカウント欄に外部パス（ソフトパス）の出現回数の加算結果を示す。 The count field of FIG. 5 shows the number of appearances of the external path (hard path) of the Web page 100 and the number of appearances of the external path (hard path) of the Web page 200, and the count field of FIG. 6 shows the external path (hard path). The result of adding the number of appearances is shown. Further, the count field in FIG. 7A shows the number of appearances of the external path (soft path) of the Web page 100 and the number of appearances of the external path (soft path) of the Web page 200, and the count field in FIG. Shows the result of adding the number of appearances of external paths (soft paths).

なお、加算結果「００１」とは、Ｗｅｂページ１００で１回出現するが、Ｗｅｂページ２００では出現しない外部パスを示し、加算結果「１００」とは、Ｗｅｂページ１００で出現しないが、Ｗｅｂページ２００では１回出現する外部パスを示す。また、加算結果「１０１」とは、Ｗｅｂページ１００及びＷｅｂページ２００で１回出現する外部パスを示し、加算結果「３０３」とは、Ｗｅｂページ１００及びＷｅｂページ２００で３回出現する外部パスを示す。 The addition result “001” indicates an external path that appears once on the Web page 100 but does not appear on the Web page 200, and the addition result “100” does not appear on the Web page 100, but the Web page 200. Shows an external path that appears once. The addition result “101” indicates an external path that appears once in the Web page 100 and the Web page 200, and the addition result “303” indicates an external path that appears three times in the Web page 100 and the Web page 200. Show.

ここで、加算結果「３０３」は、Ｗｅｂページ１００及びＷｅｂページ２００で共通する外部パスであるが、この外部パスは３回出現するため、この外部パスに対応するタグは一意に特定することができない。そこで、本実施形態では、加算結果「１０１」の外部パスを、同一ユーザの複数のブログページの夫々に共通する外部パスであって、対応するタグを一意に特定できる外部パス（外部共通パス）として決定する。なお、図６及び図７（２）では、外部共通パスを点線で示している。 Here, the addition result “303” is an external path common to the Web page 100 and the Web page 200. Since this external path appears three times, the tag corresponding to this external path can be uniquely specified. Can not. Therefore, in the present embodiment, the external path of the addition result “101” is an external path that is common to each of a plurality of blog pages of the same user, and an external path (external common path) that can uniquely identify the corresponding tag. Determine as. In FIGS. 6 and 7 (2), the external common path is indicated by a dotted line.

また、本発明は、ブログ本文の位置をタグの位置情報（パス）で特定することを特徴としているため、起点パスで特定されるタグには、本文が含まれている必要がある。 Further, the present invention is characterized in that the position of the blog text is specified by the tag position information (path), and therefore, the tag specified by the starting path needs to include the text.

そこで、本文包含外部パス取得手段２３は、外部パス情報に含まれる外部パス（ハードパス及びソフトパス）の夫々により特定されるタグがブログ本文を含むものであるか否かを判定する。なお、ブログ本文を含むタグであるか否かは、記憶部４のブログページＤＢ４１（図２）に記憶されたＲＳＳデータ及びＨＴＭＬデータに基づいて行われる。すなわち、本文包含外部パス取得手段２３は、ＲＳＳデータ（ｄｅｓｃｒｉｐｔｉｏｎ要素に含まれる本文の要約）を含むタグをＨＴＭＬデータから特定し、当該タグを包含する外部パスをブログ本文を含むタグに対応する外部パスであると判定する。 Therefore, the text inclusion external path acquisition unit 23 determines whether or not the tag specified by each of the external paths (hard path and soft path) included in the external path information includes a blog text. Whether or not the tag includes a blog text is determined based on RSS data and HTML data stored in the blog page DB 41 (FIG. 2) of the storage unit 4. That is, the text inclusion external path acquisition unit 23 specifies a tag including RSS data (summary of the text included in the description element) from the HTML data, and sets the external path including the tag to the external tag corresponding to the tag including the blog text. It is determined that the path.

図３（１）を参照して、Ｗｅｂページ１００では、ライン１６の「ｃｅｎｔｅｒ」タグの中にブログ本文が含まれる。そのため、ライン１６の「ｃｅｎｔｅｒ」タグを包含する外部パスが、ブログ本文を含むタグに対応する外部パスと判定される。具体的には、外部パス（ハードパス）「ｔｒ［］［１］／ｔｄ［］［１］／ｄｉｖ［］［１］」は、ライン１６の「ｃｅｎｔｅｒ」タグを包含する外部パスであり、ブログ本文を含むタグに対応する外部パスである。一方、外部パス（ハードパス）「ｔｒ［］［１］／ｔｄ［］［０］」は、ライン１６の「ｃｅｎｔｅｒ」タグを包含しないため、ブログ本文を含まないタグに対応する外部パスである。本実施形態では、ブログ本文を含む外部パスを本文包含外部パスとしている。なお、図６及び図７（２）では、本文包含外部パスを２点鎖線で示している。また、図７では、本文包含欄に「△」と表記されている外部パスが存在するが、これは、複数回出現する外部パスについて、一部が本文を包含し、他の一部が本文を包含しないことを意味している。 With reference to FIG. 3A, in the Web page 100, the blog text is included in the “center” tag of the line 16. Therefore, the external path including the “center” tag on the line 16 is determined as the external path corresponding to the tag including the blog text. Specifically, the external path (hard path) “tr [] [1] / td [] [1] / div [] [1]” is an external path including the “center” tag of the line 16; An external path corresponding to a tag containing the blog text. On the other hand, the external path (hard path) “tr [] [1] / td [] [0]” does not include the “center” tag of the line 16, and thus is an external path corresponding to a tag that does not include the blog text. . In the present embodiment, an external path including a blog text is set as a text inclusion external path. In FIG. 6 and FIG. 7 (2), the text inclusion external path is indicated by a two-dot chain line. In FIG. 7, there is an external path indicated by “△” in the text inclusion column. This is because some of the external paths that appear multiple times include the text, and the other part contains the text. Is not included.

図１に戻り、列起点特定手段２４は、外部共通パス取得手段２２及び本文包含外部パス取得手段２３により外部共通パス及び本文包含外部パスが取得されると、外部共通パスであって、かつ、複数のＷｅｂページの全てにおける本文包含外部パスである外部パスのうち、最も深い（すなわち、最も下位の）外部パスを起点パスとして特定する。
例えば、図６を参照すると、ハードパスについては、外部共通パス（点線）であって、かつ、Ｗｅｂページ１００及びＷｅｂページ２００の本文包含外部パス（２点鎖線のうち、Ｗｅｂページ１００及びＷｅｂページ２００で○とされているもの）である外部パスは、外部パス（ハードパス）「ｔｒ［］［１］」及び「ｔｒ［］［１］／ｔｄ［］［１］」となる。これらの外部パス（ハードパス）では、「ｔｒ［］［１］／ｔｄ［］［１］」の方が「ｔｄ［］［１］」分だけ深い。他方、ソフトパスについては、図７に示すように外部共通パスであって、かつ、複数のＷｅｂページの全てにおける本文包含外部パスである外部パスは、存在しない。 Returning to FIG. 1, when the external common path and the text inclusion external path are acquired by the external common path acquisition means 22 and the text inclusion external path acquisition means 23, the column origin specifying means 24 is an external common path, and The deepest (ie, the lowest) external path is identified as the starting path among external paths that are text-included external paths in all of the plurality of Web pages.
For example, referring to FIG. 6, the hard path is an external common path (dotted line), and the text inclusion external path of the Web page 100 and the Web page 200 (the Web page 100 and the Web page out of the two-dot chain line) The external paths that are “circle” in 200) are the external paths (hard paths) “tr [] [1]” and “tr [] [1] / td [] [1]”. In these external paths (hard paths), “tr [] [1] / td [] [1]” is deeper by “td [] [1]”. On the other hand, as for the soft path, as shown in FIG. 7, there is no external path that is an external common path and is a text inclusion external path in all of a plurality of Web pages.

よって、列起点特定手段２４は、外部パス（ハードパス）「ｔｒ［］［１］／ｔｄ［］［１］」を起点パスとして特定する。なお、ソフトパスにおいても外部共通パスであって、かつ、複数のＷｅｂページの全てにおける本文包含外部パスである外部パス（以下、「起点パス候補の外部パス」）が存在する場合には、ハードパスの起点パス候補の外部パス及びソフトパスの起点パス候補の外部パスのうち、最も深い外部パスを起点パスとして特定する。また、ハードパスの起点パス候補の外部パス及びソフトパスの起点パス候補の外部パスでパスの深さが同じである場合には、本実施形態ではソフトパスの起点パス候補の外部パスを起点パスとして特定する。これは、ソフトパスの方がハードパスよりも表記が簡潔であり、多数のブログページの本文位置を管理するのに好適だからである。勿論、ハードパスの起点パス候補の外部パスを起点パスとして特定することとしてもよい。
列起点特定手段２４が、起点パスを特定すると、この起点パスは、記憶部４の本文位置ＤＢ４２（図２（２））に記憶される。なお、図２における「Ｈ’」は、ハードパスであることを示し、「Ｓ’」は、ソフトパスであることを示している。 Therefore, the column starting point specifying unit 24 specifies the external path (hard path) “tr [] [1] / td [] [1]” as the starting point path. In the case of a soft path, if there is an external path that is an external common path and is a body-inclusive external path in all of a plurality of Web pages (hereinafter referred to as an “external path of a starting path candidate”), The deepest external path is specified as the origin path among the external paths of the path origin path candidates and the soft path origin path candidates. If the path depth is the same for the external path of the hard path starting path candidate and the external path of the soft path starting path candidate, in this embodiment, the external path of the soft path starting path candidate is used as the starting path. As specified. This is because the soft path has a simpler notation than the hard path and is suitable for managing the text positions of many blog pages. Of course, the external path of the hard path starting path candidate may be specified as the starting path.
When the column starting point specifying unit 24 specifies the starting point path, this starting point path is stored in the text position DB 42 (FIG. 2 (2)) of the storage unit 4. Note that “H ′” in FIG. 2 indicates a hard path, and “S ′” indicates a soft path.

このように、第１起点特定部２では、Ｗｅｂページ１００及びＷｅｂページ２００の本文位置を特定するための起点パスを「Ｈ’ｔｒ［］［１］／ｔｄ［］［１］」と特定する。ここで、図１２に示すＷｅｂページ１００のソースコード１１０を参照して、起点パスが特定するタグは、２回目の「ｔｒ」タグ（ライン９）の下位の「ｔｄ」タグのうち２回目の「ｔｄ」タグ、すなわち、ライン１３の「ｔｄ」タグである。よって、起点パスにより、Ｗｅｂページ１００の本文は、ライン１３の「ｔｄ」タグからライン１９の「／ｔｄ」タグまでの間に存在することが特定できる。 As described above, the first starting point specifying unit 2 specifies the starting point path for specifying the text positions of the Web page 100 and the Web page 200 as “H′tr [] [1] / td [] [1]”. . Here, referring to the source code 110 of the Web page 100 shown in FIG. 12, the tag specified by the starting path is the second of the “td” tags in the lower order of the “tr” tag of the second time (line 9). The “td” tag, ie, the “td” tag on line 13. Therefore, the origin path can specify that the text of the Web page 100 exists between the “td” tag on the line 13 and the “/ td” tag on the line 19.

ところで、ライン１３の「ｔｄ」タグからライン１９の「／ｔｄ」まででは、本文（ライン１６）に加え、画像（ライン１４）やＦｏｏｔｅｒ（ライン１８）が含まれる。ここで、画像やＦｏｏｔｅｒ（例えば、図１２のトピックス、コメント（１）、トラックバックなど）は、ブログのキーワードを抽出する対象として相応しくない情報であり、このような情報を含めてキーワードを抽出しインデックスを生成したのでは、検索精度の低下を招来するおそれがある。
そこで、本実施形態では、第１起点特定部２が特定した起点パスにより特定されるタグよりも下位のタグについて、複数のブログページの夫々に共通するパスであって、当該パスによりタグを一意に特定できるパス、かつ、本文を含んでいるパスを再度特定することとしている。 By the way, from the “td” tag of the line 13 to “/ td” of the line 19, in addition to the body (line 16), an image (line 14) and Footer (line 18) are included. Here, images and Footer (for example, topics, comments (1), trackbacks, etc. in FIG. 12) are information that is not suitable for extracting blog keywords, and keywords including such information are extracted and indexed. If this is generated, the search accuracy may be reduced.
Therefore, in the present embodiment, the tag lower than the tag specified by the starting path specified by the first starting point specifying unit 2 is a path common to each of the plurality of blog pages, and the tag is uniquely identified by the path. The path that can be specified by the user and the path that includes the text is specified again.

＜第２起点特定部３＞
図１に戻り、第２起点特定部３は、内部パス情報生成手段３１と、内部共通パス取得手段３２と、本文包含内部パス取得手段３３と、本文位置特定手段３４と、を含んで構成される。 <Second origin specifying unit 3>
Returning to FIG. 1, the second origin specifying unit 3 includes an internal path information generating unit 31, an internal common path acquiring unit 32, a text inclusion internal path acquiring unit 33, and a text position specifying unit 34. The

内部パス情報生成手段３１は、同一ユーザの複数のブログページの夫々について、列起点特定手段２４が特定した起点パスにより特定されるタグよりも下位のタグのパスを示す内部パス情報を生成する。なお、内部パス情報生成手段３１は、記憶部４の本文位置ＤＢ４２に記憶された起点パス、及び記憶部４のブログページＤＢ４１に記憶されたＨＴＭＬデータを読み出して、内部パス情報を生成する。 The internal path information generating unit 31 generates internal path information indicating a path of a tag lower than the tag specified by the starting point specified by the column starting point specifying unit 24 for each of a plurality of blog pages of the same user. The internal path information generation unit 31 reads out the starting path stored in the text position DB 42 of the storage unit 4 and the HTML data stored in the blog page DB 41 of the storage unit 4 to generate internal path information.

内部パス情報の生成は、外部パス情報の生成と基本的に同じであるため詳細な説明を省略するが、内部パス情報生成手段３１は、図８（１）〜（４）に示す内部パス情報（ハードパス及びソフトパス）を生成する。 The generation of the internal path information is basically the same as the generation of the external path information, and thus detailed description thereof will be omitted. However, the internal path information generation unit 31 uses the internal path information shown in FIGS. 8 (1) to (4). (Hard path and Soft path) are generated.

内部共通パス取得手段３２は、内部パス情報生成手段３１が生成した内部パス情報のうち、同一ユーザの複数のブログページの夫々に共通するとともに当該パスに対応するタグを一意に特定できる内部共通パスを取得する。
すなわち、内部共通パス取得手段３２は、Ｗｅｂページ１００の内部パスの出現回数とＷｅｂページ２００の内部パスの出現回数との加算結果が「１０１」となる内部パスを内部共通パスとして取得する。
本文包含内部パス取得手段３３は、内部パス情報生成手段３１が生成した内部パス情報のうちブログ本文を含む本文包含外部パスを、同一ユーザの複数のブログページの夫々について取得する。
なお、図９に内部パスの出現回数の加算結果及び内部パスのブログ本文包含の有無を示す。 The internal common path acquisition unit 32 is an internal common path that is common to each of a plurality of blog pages of the same user among the internal path information generated by the internal path information generation unit 31 and can uniquely identify a tag corresponding to the path. To get.
That is, the internal common path acquisition unit 32 acquires, as an internal common path, an internal path in which the addition result of the number of appearances of the internal path of the Web page 100 and the number of appearances of the internal path of the Web page 200 is “101”.
The text inclusion internal path acquisition means 33 acquires the text inclusion external path including the blog text among the internal path information generated by the internal path information generation means 31 for each of a plurality of blog pages of the same user.
FIG. 9 shows the addition result of the number of appearances of the internal path and the presence / absence of inclusion of the blog text in the internal path.

本文位置特定手段は、内部共通パス取得手段３２及び本文包含内部パス取得手段３３により内部共通パス及び本文包含内部パスが取得されると、内部共通パスであって、かつ、複数のＷｅｂページの全てにおける本文包含内部パスである内部パスのうち、最も深い（すなわち、最も下位の）内部パスを本文パスとして特定する。
例えば、図９（２）を参照すると、ソフトパスについては、内部共通パス（点線）であって、かつ、Ｗｅｂページ１００及びＷｅｂページ２００の本文包含内部パスである内部パスは、内部パス（ソフトパス）「ｄｉｖ［］／ｃｅｎｔｅｒ［］」となる。他方、ハードパスについては、図９（１）に示すように内部共通パスであって、かつ、複数のＷｅｂページの全てにおける本文包含内部パスである内部パスは、存在しない。 When the internal common path and the text inclusion internal path are acquired by the internal common path acquisition means 32 and the text inclusion internal path acquisition means 33, the text position specifying means is an internal common path and includes all of the plurality of Web pages. Among the internal paths that are the text inclusion internal paths, the deepest (that is, the lowest) internal path is specified as the text path.
For example, referring to FIG. 9 (2), the soft path is an internal common path (dotted line), and the internal path that is the text-containing internal path of the Web page 100 and the Web page 200 is the internal path (soft Path) “div [] / center []”. On the other hand, with respect to the hard path, there is no internal path that is an internal common path as shown in FIG. 9A and that is a text inclusion internal path in all of the plurality of Web pages.

よって、本文位置特定手段３４は、内部パス（ソフトパス）「ｄｉｖ［］／ｃｅｎｔｅｒ［］」を本文パスとして特定する。本文位置特定手段３４が、本文パスを特定すると、この本文パスは、記憶部４の本文位置ＤＢ４２（図２（２））に記憶される。 Therefore, the text position specifying unit 34 specifies the internal path (soft path) “div [] / center []” as the text path. When the text position specifying unit 34 specifies the text path, the text path is stored in the text position DB 42 (FIG. 2 (2)) of the storage unit 4.

＜ブログ本文抽出手段５＞
図１に戻り、ブログ本文抽出手段５は、記憶部４の本文位置ＤＢ４２に記憶された起点パス及び本文パスから特定されるタグに基づいて、ユーザのブログページからブログ本文を抽出し、記憶部４のブログ本文ＤＢ４３に記憶する。
例えば、図１２を参照して、ブログ本文抽出手段５は、ブログ本文の位置を、起点パス「Ｈ’ｔｒ［］［１］／ｔｄ［］［１］」からライン１３の「ｔｄ」タグからライン１９の「／ｔｄ」タグまでの間と絞り込み、本文パス「Ｓ’ｄｉｖ［］／ｃｅｎｔｅｒ［］」からライン１６であると更に絞り込む。その結果、起点パスで特定していた場合に含まれたライン１４の画像やライン１８のＦｏｏｔｅｒを除くことができ、同一ユーザのブログページのブログ本文をより正確に抽出することができる。 <Blog text extraction means 5>
Returning to FIG. 1, the blog text extracting means 5 extracts the blog text from the user's blog page based on the origin path and the tag specified from the text path stored in the text position DB 42 of the storage unit 4. 4 is stored in the blog text DB 43.
For example, referring to FIG. 12, the blog text extracting means 5 determines the position of the blog text from the starting path “H′tr [] [1] / td [] [1]” from the “td” tag on the line 13. The line 19 is narrowed down to the “/ td” tag, and further narrowed down to the line 16 from the text path “S'div [] / center []”. As a result, the image of the line 14 and the footer of the line 18 included when the origin path is specified can be removed, and the blog text of the blog page of the same user can be extracted more accurately.

＜記憶部４＞
図１に戻り、記憶部４は、ブログページＤＢ４１と、本文位置ＤＢ４２と、ブログ本文ＤＢ４３と、を含んで構成される。
ブログページＤＢ４１は、図２（２）に示すように、ユーザＩＤとＵＲＬとＷｅｂページ情報とを対応付けて記憶する。なお、Ｗｅｂページ情報には、ＨＴＭＬデータとＲＳＳデータとが含まれる。外部パス情報生成手段２１又は内部パス情報生成手段３１は、このＨＴＭＬデータから、外部パス情報又は内部パス情報を生成する。また、本文包含外部パス取得手段２３又は本文包含内部パス取得手段３３は、ＲＳＳデータから、外部パス又は内部パスにより特定されるタグにブログ本文が含まれる否かを判定する。 <Storage unit 4>
Returning to FIG. 1, the storage unit 4 includes a blog page DB 41, a text position DB 42, and a blog text DB 43.
As illustrated in FIG. 2B, the blog page DB 41 stores a user ID, a URL, and Web page information in association with each other. The Web page information includes HTML data and RSS data. The external path information generation unit 21 or the internal path information generation unit 31 generates external path information or internal path information from the HTML data. The text inclusion external path acquisition means 23 or the text inclusion internal path acquisition means 33 determines whether or not the blog text is included in the tag specified by the external path or the internal path from the RSS data.

本文位置ＤＢ４２は、図２（３）に示すように、ユーザＩＤとＵＲＬとブログ本文位置とを対応付けて記憶する。なお、ブログ本文位置には、列起点特定手段２４が特定した起点パスと、本文位置特定手段３４が特定した本文パスと、が含まれる。
ブログ本文ＤＢ４３は、図示は省略するが、ブログ本文抽出手段５が抽出したブログ本文をユーザＩＤやＵＲＬに対応付けて記憶する。ブログ本文ＤＢ４３に記憶されたブログ本文に基づいて、キーワードが抽出され、ブログページについてのインデックスが生成される。 As shown in FIG. 2 (3), the text position DB 42 stores the user ID, URL, and blog text position in association with each other. The blog text position includes the starting path specified by the column starting point specifying means 24 and the text path specified by the text position specifying means 34.
Although not shown, the blog text DB 43 stores the blog text extracted by the blog text extraction unit 5 in association with the user ID and URL. Keywords are extracted based on the blog text stored in the blog text DB 43, and an index for the blog page is generated.

［ブログ本文特定装置１のハードウェア構成］
以上説明したブログ本文特定装置１のハードウェアは、一般的なコンピュータによって構成することができる。一般的なコンピュータは、例えば、制御部として、中央処理装置（ＣＰＵ）を備える他、記憶部として、メモリ（ＲＡＭ、ＲＯＭ）、ハードディスク（ＨＤＤ）及び光ディスク（ＣＤ、ＤＶＤなど）を、ネットワーク通信装置として、各種有線及び無線ＬＡＮ装置を、表示装置として、例えば、液晶ディスプレイ、プラズマディスプレイなどの各種ディスプレイを、入力装置として、例えば、キーボード及びポインティング・デバイス（マウス、トラッキングボールなど）を適宜備え、これらは、バスラインにより接続されている。このような一般的なコンピュータにおいて、ＣＰＵは、ブログ本文特定装置１を統括的に制御し、各種プログラムを適宜読み出して実行することにより、上述したハードウェアと協働し、本発明に係る各種機能を実現している。 [Hardware configuration of blog text identification device 1]
The hardware of the blog text specifying device 1 described above can be configured by a general computer. For example, a general computer includes a central processing unit (CPU) as a control unit, and a memory (RAM, ROM), a hard disk (HDD), and an optical disk (CD, DVD, etc.) as a storage unit, and a network communication device. Various wired and wireless LAN devices as display devices, for example, various displays such as liquid crystal displays and plasma displays, and as input devices, for example, keyboards and pointing devices (mouse, tracking ball, etc.) Are connected by a bus line. In such a general computer, the CPU controls the blog text specifying device 1 in an integrated manner, reads and executes various programs as appropriate, and cooperates with the hardware described above, thereby enabling various functions according to the present invention. Is realized.

［ブログ本文特定装置１の処理］
続いて、ブログ本文特定装置１の処理について、図１０を参照して説明する。 [Process of Blog Text Specifying Device 1]
Next, processing of the blog text identification device 1 will be described with reference to FIG.

Ｓ１：初めに、外部パス情報生成手段２１は、ブログページＤＢ４１に記憶されたＨＴＭＬデータを解析し、ソースコードに含まれるタグのパスを示す外部パス情報をハードパス及びソフトパスで生成する。
Ｓ２：続いて、外部共通パス取得手段２２は、生成した外部パス情報のうち、同一ユーザの複数のブログページの夫々に共通する外部パスであって、対応するタグを一意に特定可能な外部パスである外部共通パスを取得する。
Ｓ３：また、本文包含外部パス取得手段２３は、生成した外部パス情報のうちブログ本文を含む本文包含外部パスを、同一ユーザの複数のブログページの夫々で取得する。
Ｓ４：そして、列起点特定手段２４は、Ｓ２で取得した外部共通パスであって、かつ、Ｓ３で取得した本文包含外部パスである外部パスのうち、最も深い外部パスを起点パスとして特定する。なお、特定された起点パスは、本文位置ＤＢ４２に記憶される。 S1: First, the external path information generation unit 21 analyzes the HTML data stored in the blog page DB 41, and generates external path information indicating a tag path included in the source code by a hard path and a soft path.
S2: Subsequently, the external common path acquisition unit 22 is an external path that is common to each of a plurality of blog pages of the same user in the generated external path information and can uniquely identify the corresponding tag. Get the external common path.
S3: Also, the text inclusion external path acquisition unit 23 acquires the text inclusion external path including the blog text from the generated external path information for each of a plurality of blog pages of the same user.
S4: The column origin specifying means 24 identifies the deepest external path as the origin path among the external paths that are the external common paths acquired in S2 and are the text inclusion external paths acquired in S3. The specified starting path is stored in the text position DB 42.

Ｓ５：続いて、内部パス情報生成手段３１は、ブログページＤＢ４１に記憶されたＨＴＭＬデータを解析し、Ｓ４で特定した起点パスにより特定されるタグよりも下位のタグのパスを示す内部パス情報をハードパス及びソフトパスで生成する。
Ｓ６：次に、内部共通パス取得手段３２は、生成した内部パス情報のうち、同一ユーザの複数のブログページの夫々に共通する内部パスであって、対応するタグを一意に特定可能な内部パスである内部共通パスを取得する。
Ｓ７：また、本文包含内部パス取得手段３３は、生成した内部パス情報のうちブログ本文を含む本文包含内部パスを、同一ユーザの複数のブログページの夫々で取得する。
Ｓ８：そして、本文位置特定手段３４は、Ｓ６で取得した内部共通パスであって、かつ、Ｓ７で取得した本文包含内部パスである内部パスのうち、最も深い内部パスを本文パスとして特定する。なお、特定された本文パスは、本文位置ＤＢ４２に記憶される。 S5: Subsequently, the internal path information generation unit 31 analyzes the HTML data stored in the blog page DB 41, and obtains internal path information indicating the path of the tag lower than the tag specified by the starting path specified in S4. Generate hard and soft paths.
S6: Next, the internal common path acquisition unit 32 is an internal path that is common to each of a plurality of blog pages of the same user in the generated internal path information, and can uniquely identify the corresponding tag. Get the internal common path.
S7: Also, the text inclusion internal path acquisition unit 33 acquires the text inclusion internal path including the blog text from the generated internal path information for each of the plurality of blog pages of the same user.
S8: Then, the text position specifying unit 34 specifies the deepest internal path as the text path among the internal common paths acquired in S6 and the text inclusion internal path acquired in S7. The identified text path is stored in the text position DB 42.

Ｓ９：続いて、ブログ本文抽出手段５は、本文位置ＤＢ４２に記憶された起点パス及び本文パスから特定されるタグに基づいて、ユーザのブログページからブログ本文を抽出し、記憶部４のブログ本文ＤＢ４３に記憶する。 S9: Subsequently, the blog text extracting means 5 extracts the blog text from the user's blog page based on the starting path and the tag specified from the text path stored in the text position DB 42, and the blog text in the storage unit 4 is extracted. Store in DB43.

以上、第１実施形態のブログ本文特定装置１について説明した。第１実施形態のブログ本文特定装置１によれば、同一ユーザの複数のブログページにおいてブログ本文を含むタグを指定するパスを生成することができる。これにより、当該ユーザのブログページでは、タグ構造が大きく変化しない限り容易にブログ本文の位置を特定できるため、新しいページが生成される都度本文のテキストを特定する必要がなく、制御負荷を抑えつつブログ本文の位置を特定することができる。
このとき、第１実施形態のブログ本文特定装置１では、第１起点特定部２が絞り込んだ本文位置を第２起点特定部３が更に絞り込むこととしているため、同一ユーザのブログページのブログ本文をより正確に抽出することができる。 Heretofore, the blog text specifying device 1 of the first embodiment has been described. According to the blog text specifying device 1 of the first embodiment, it is possible to generate a path for designating a tag including a blog text in a plurality of blog pages of the same user. As a result, in the user's blog page, the position of the blog body can be easily specified as long as the tag structure does not change greatly. Therefore, it is not necessary to specify the text of the body every time a new page is generated, and the control load is reduced. The position of the blog text can be specified.
At this time, in the blog text specifying device 1 of the first embodiment, since the second starting point specifying unit 3 further narrows down the text position narrowed down by the first starting point specifying unit 2, the blog text of the blog page of the same user is selected. It can be extracted more accurately.

［第２実施形態］
続いて、第２実施形態のブログ本文特定装置１Ａについて、図１１を参照して説明する。第１実施形態のブログ本文特定装置１が、ブログ本文の位置を２回に分けて絞り込むこととしているのに対し、第２実施形態のブログ本文特定装置１Ａでは複数回に分けて絞り込むことを特徴としている。
図１１（１）は、ブログ本文特定装置１Ａの構成を示し、図１１（２）は、本文位置ＤＢ４２Ａの構成を示す。なお、第１実施形態のブログ本文特定装置１と同一の構成については、同一の符号を付し、詳細な説明を省略する。 [Second Embodiment]
Next, the blog text specifying device 1A of the second embodiment will be described with reference to FIG. The blog text specifying device 1 of the first embodiment narrows down the position of the blog text in two steps, whereas the blog text specifying device 1A of the second embodiment narrows down in a plurality of times. It is said.
FIG. 11 (1) shows the configuration of the blog text specifying device 1A, and FIG. 11 (2) shows the configuration of the text position DB 42A. In addition, about the structure same as the blog text identification apparatus 1 of 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

［ブログ本文特定装置１Ａの構成］
ブログ本文特定装置１Ａは、起点特定部２Ａと、記憶部４Ａと、ブログ本文抽出手段５と、を含んで構成される。
起点特定部２Ａは、パス情報生成手段２１Ａと、共通パス取得手段２２Ａと、本文包含パス取得手段２３Ａと、列起点特定手段２４Ａと、特定可否判定手段２５Ａと、本文位置特定手段２６Ａと、を含み、記憶部４Ａは、ブログページＤＢ４１と、本文位置ＤＢ４２Ａと、ブログ本文ＤＢ４３と、を含む。 [Configuration of Blog Text Identification Device 1A]
The blog text specifying device 1A includes a starting point specifying unit 2A, a storage unit 4A, and a blog text extracting unit 5.
The starting point specifying unit 2A includes a path information generating unit 21A, a common path acquiring unit 22A, a text inclusion path acquiring unit 23A, a column starting point specifying unit 24A, a specifiability determining unit 25A, and a text position specifying unit 26A. The storage unit 4A includes a blog page DB 41, a text position DB 42A, and a blog text DB 43.

パス情報生成手段２１Ａ、共通パス取得手段２２Ａ、本文包含パス取得手段２３Ａ及び列起点特定手段２４Ａは、外部パス情報生成手段２１（内部パス情報生成手段３１）、外部共通パス取得手段２２（内部共通パス取得手段３２）、本文包含外部パス取得手段２３（本文包含内部パス取得手段３３）及び列起点特定手段２４（本文位置特定手段３４）と同様の機能を有する。
すなわち、パス情報生成手段２１Ａ、共通パス取得手段２２Ａ、本文包含パス取得手段２３Ａ及び列起点特定手段２４Ａは、ブログページを解析して生成されたパス情報に含まれるパスのうち、同一ユーザの複数のブログページの夫々に共通するとともにタグを一意に特定でき（共通パス）、かつ、本文を含んでいるパス（本文包含パス）であっても最も深いパスを起点パスとして特定する。 The path information generation means 21A, the common path acquisition means 22A, the text inclusion path acquisition means 23A, and the column origin specifying means 24A are the external path information generation means 21 (internal path information generation means 31), the external common path acquisition means 22 (internal common The path acquisition means 32), the text inclusion external path acquisition means 23 (text inclusion internal path acquisition means 33), and the column origin specifying means 24 (text position specifying means 34) have the same functions.
That is, the path information generation unit 21A, the common path acquisition unit 22A, the text inclusion path acquisition unit 23A, and the column start point specification unit 24A include a plurality of the same users among the paths included in the path information generated by analyzing the blog page. The tag is uniquely specified (common path) and the deepest path is specified as the starting path even if the path includes the text (text inclusion path).

特定可否判定手段２５Ａは、列起点特定手段２４Ａが起点パスを特定できたか否かを判定する。このとき、列起点特定手段２４Ａが起点パスを特定できた場合には、パス情報生成手段２１Ａは、当該起点パスにより特定されるタグよりも下位のタグのパスを示すパス情報を更に生成し、共通パス取得手段２２Ａ及び本文包含パス取得手段２３Ａは、下位のタグのパス情報から共通パス及び本文包含パスを取得し、列起点特定手段２４Ａは、共通パス及び本文包含パスに基づいて、新たな起点パスを特定する。
他方、列起点特定手段２４Ａが起点パスを特定できない場合には、本文位置特定手段２６Ａは、これまで特定された起点パスからブログ本文の位置を特定する。 The identification availability determination unit 25A determines whether the column origin specifying unit 24A has identified the origin path. At this time, if the column starting point specifying unit 24A can specify the starting point path, the path information generating unit 21A further generates path information indicating a path of a tag lower than the tag specified by the starting point path, The common path acquisition means 22A and the text inclusion path acquisition means 23A acquire the common path and the text inclusion path from the path information of the lower tag, and the column origin specifying means 24A generates a new path based on the common path and the text inclusion path. Identify the origin path.
On the other hand, when the column starting point specifying unit 24A cannot specify the starting point path, the body position specifying unit 26A specifies the position of the blog body from the starting point path specified so far.

すなわち、第２実施形態のブログ本文特定装置１Ａでは、より下位の起点パスが特定できなくなるまで新たな起点パスを繰り返し特定する。言い換えると、同一ユーザの複数のブログページの夫々に共通するとともにタグを一意に特定でき（共通パス）、かつ、本文を含んでいるパス（本文包含パス）が存在しなくなるまで、パス情報生成手段２１Ａ乃至列起点特定手段２４Ａの処理を繰り返す。
そのため、図１１（２）に示すように、本文位置ＤＢ４２Ａには、ブログ本文位置として繰り返し特定される起点パスが記憶される。 That is, the blog text specifying device 1A of the second embodiment repeatedly specifies a new starting path until a lower starting path cannot be specified. In other words, the path information generation means is common until a tag can be uniquely specified (common path) common to each of a plurality of blog pages of the same user, and a path including the text (text inclusion path) does not exist. The processing of 21A through the column origin specifying means 24A is repeated.
Therefore, as shown in FIG. 11 (2), the starting position path repeatedly specified as the blog text position is stored in the text position DB 42A.

このようなブログ本文特定装置１Ａによれば、新たな起点パスを特定できなくなるまで、より下位のタグのパス情報から新たな起点パスが特定される。これにより、より詳細な本文の位置を特定することができるとともに、一度特定すれば容易にブログ本文の位置を特定できるため、制御負荷を抑えつつブログ本文の位置を特定することができる。 According to such a blog text specifying device 1A, a new starting path is specified from path information of lower tags until a new starting path cannot be specified. As a result, the position of the text can be specified in more detail, and since the position of the blog text can be easily specified once specified, the position of the blog text can be specified while suppressing the control load.

以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

上記実施形態では、ブログページのソースコードのパス情報（外部パス情報、内部パス情報を含む）を生成した後に、複数のページで共通し対応するタグを一意に特定可能な共通パス（外部共通パス、内部共通パスを含む）やブログ本文を含む本文包含パス（本文包含外部パス、本文包含内部パスを含む）を取得することとしている。ここで、パス情報から共通パス及び本文包含パスを取得する順序は、任意に設計することができる。
また、パス情報に含まれる全てのパスについて、共通パスであるか否かの判定や本文包含パスであるか否かの判定を行う必要もなく、パス情報に含まれている一部のパスについてのみ、当該判定を行うこととしてもよい。例えば、出現回数の加算結果が「１０１」のパスに限って本文を包含するか否かの判定を行うこととしてもよく、本文を包含するパスに限ってのみ出現回数の加算結果を算出することとしてもよい。 In the above embodiment, after generating path information (including external path information and internal path information) of the source code of the blog page, a common path (external common path) that is common to multiple pages and can uniquely identify the corresponding tag , Including an internal common path) and a text inclusion path including a blog text (including a text inclusion external path and a text inclusion internal path). Here, the order of acquiring the common path and the text inclusion path from the path information can be arbitrarily designed.
For all paths included in the path information, it is not necessary to determine whether the path is a common path or whether the path is a text inclusion path, and some paths included in the path information. Only this determination may be made. For example, it may be possible to determine whether or not to include the text only for the path with the appearance count addition result “101”, and to calculate the addition result of the appearance count only for the path including the text. It is good.

また、ハードパスは、その性質上必ず１回しか出現しないため、ソフトパスの加算結果のみを算出し、ハードパスについては複数のブログページで共通するか否かのみを判定することとしてもよい。 Further, since the hard path always appears only once due to its nature, it is possible to calculate only the addition result of the soft path and determine whether the hard path is common to a plurality of blog pages.

また、本実施形態では、ソースコードに含まれるタグのうち、「ｈｔｍｌ」タグなどのような多くのＷｅｂページで使用される位置に変化のないタグは、パスの生成において省略することとしているが、省略するタグの種別は任意に設計可能である。例えば、ブログページでは、基本的な構成は同一であっても、使用するフォントを日々変更することが考えられる。そのため、例えば、「ｆｏｎｔ」タグなどのようなブログページの構成に影響のないタグは、パスの生成において省略することとしてもよく、任意に設計可能である。
また、パスを構成する属性についても、パスの生成において省略することとしてもよく、何れの属性を省略するかは任意に設計可能である。例えば、ブログ本文欄の背景を日々変更するブログページも考えられる。そのため、例えば、「ｔｄ」タグの属性である「ｂｇｃｏｌｏｒ」などの属性は、パスの生成において省略することとしてもよく、任意に設計可能である。 In the present embodiment, among the tags included in the source code, tags that do not change the position used in many Web pages, such as the “html” tag, are omitted in the path generation. The tag type to be omitted can be arbitrarily designed. For example, in a blog page, even if the basic configuration is the same, it is possible to change the font to be used every day. Therefore, for example, a tag that does not affect the configuration of the blog page such as a “font” tag may be omitted in the path generation, and can be arbitrarily designed.
Further, the attributes constituting the path may be omitted in the generation of the path, and which attribute is omitted can be arbitrarily designed. For example, a blog page that changes the background of the blog text column every day can be considered. Therefore, for example, an attribute such as “bgcolor” that is an attribute of the “td” tag may be omitted in the path generation, and can be arbitrarily designed.

１ブログ本文特定装置
２第１起点特定部
２１外部パス生成手段
２２外部共通パス取得手段
２３本文包含外部パス取得手段
２４列起点特定手段
３第２起点特定部
３１内部パス情報生成手段
３２内部共通パス取得手段
３３本文包含内部パス取得手段
３４本文位置特定手段
４記憶部
４１ブログページＤＢ
４２本文位置ＤＢ
４３ブログ本文ＤＢ
５ブログ本文抽出手段 DESCRIPTION OF SYMBOLS 1 Blog text identification apparatus 2 1st origin specifying part 21 External path generation means 22 External common path acquisition means 23 Body inclusion external path acquisition means 24 Column origin specification means 3 2nd origin specification part 31 Internal path information generation means 32 Internal common path Acquisition means 33 Text inclusion internal path acquisition means 34 Text position specifying means 4 Storage unit 41 Blog page DB
42 Text position DB
43 Blog text DB
5 Blog text extraction means

Claims

A blog page DB for storing a blog page including a blog text created by a user and summary information of the blog text in association with information for identifying the user;
External path information for analyzing a plurality of blog pages associated with the same user and generating external path information indicating a path in the source code of a tag included in the source code for each of the plurality of blog pages Generating means;
Of the external path information generated by the external path information generation means, an external common path acquisition means for acquiring an external common path common to each of the plurality of blog pages of the same user,
A text inclusion external path acquisition means for acquiring a text inclusion external path including a blog text among the external path information generated by the external path information generation means for each of a plurality of blog pages of the same user;
Column origin specifying means for specifying the deepest path as the origin path including the blog text among the paths that are the external common path and are the text inclusion external paths in all of the plurality of Web pages;
A blog text identification device comprising:

Internal path information generating means for generating internal path information indicating a path of a tag lower than a tag specified by the starting path specified by the column starting point specifying means for each of a plurality of blog pages of the same user,
Of the internal path information generated by the internal path information generation means, an internal common path acquisition means for acquiring an internal common path common to each of the plurality of blog pages of the same user,
A text inclusion internal path acquisition means for acquiring a text inclusion internal path including the blog text among the internal path information generated by the internal path information generation means for each of a plurality of blog pages of the same user;
Text position specifying means for specifying the deepest path as the text path including the blog text among the internal common paths and the text inclusion internal paths in all of the plurality of Web pages;
The blog text identifying device according to claim 1, comprising:

A blog page DB for storing a blog page including a blog text created by a user and summary information of the blog text in association with information for identifying the user;
Path information generation for analyzing a plurality of the blog pages associated with the same user and generating path information indicating paths in the source code of tags included in the source code for each of the plurality of blog pages Means,
Of the path information generated by the path information generation means, a common path acquisition means for acquiring a common path common to each of the plurality of blog pages of the same user;
A text inclusion path acquisition means for acquiring a text inclusion path including a blog text in the path information generated by the path information generation means for each of a plurality of blog pages of the same user;
Column origin specifying means for specifying a deepest path as a starting path among paths that are the common path and the text inclusion path in all of a plurality of Web pages;
Identifiability determining means for determining whether or not the sequence starting point specifying means has specified the starting point path;
With
The path information generation unit indicates a path of a tag lower than the tag specified by the starting path on condition that the specifying availability determination unit determines that the column starting point specifying unit has specified the starting path. Path information is further generated, the common path acquisition means and the text inclusion path acquisition means acquire the common path and the text inclusion path from the path information of the lower tag, and the column origin specifying means is the common origin identification means Based on the path and the text inclusion path, a new origin path is identified,
Text position specifying means for specifying starting path information as a text inclusion path including the blog text, on the condition that the identification availability determining means determines that the column starting point specifying means cannot specify the starting path;
A blog text specifying device further comprising:

A blog text in which a computer including a blog page DB that stores a blog page including a blog text created by a user and summary information of the blog text in association with information for identifying the user identifies the text position of the blog page of the same user A specific method,
Analyzing a plurality of blog pages associated with the same user, and generating external path information indicating a path in the source code of a tag included in the source code for each of the plurality of blog pages;
Of the generated external path information, obtaining an external common path common to each of the plurality of blog pages of the same user;
Acquiring a text inclusion external path including a blog text in the generated external path information for each of the plurality of blog pages of the same user;
Specifying the deepest path as a starting path including the blog text among the paths that are the external common paths and are the text inclusion external paths in all of a plurality of Web pages;
Blog body identification method including