JP2006236221A

JP2006236221A - Management server for web page retrieval

Info

Publication number: JP2006236221A
Application number: JP2005053134A
Authority: JP
Inventors: Kazuhiko Mori; 和彦森
Original assignee: Individual
Current assignee: Individual
Priority date: 2005-02-28
Filing date: 2005-02-28
Publication date: 2006-09-07

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device retrieving Web pages related to a Web page even, in a case where no referencing is provided by a link or the like. <P>SOLUTION: An element information generation means 10 analyzes constituting elements of character information of every acquired Web page and generates Web page element information. The generated Web page element information is recorded for every Web page in a recording part 20. A receiving means 12 receives a retrieval request with a specified object Web page from a terminal device 4. An element information reading means 14 reads element information of the object Web page shown by the retrieval request from the recording part. A retrieval means 18 retrieves other Web pages having element information similar to the read element information of the object Web page from the recording part 20. A retrieval result transmitting means 16 transmits attribute information of retrieved Web pages such as URL to the terminal device 4 which is the requester. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、自己のウエブページなどの対象ウエブページに関連のあるウエブページを見いだすための技術に関するものである。 The present invention relates to a technique for finding a web page related to a target web page such as its own web page.

インターネット上のウエブページ（掲示板、ブログ、ホームページなど）間の関連は、リンクがどのように貼られているかを参照することにより、知ることができる。たとえば、自己のウエブページに対して、他のどのウエブページがリンクを貼っているかを、サーチエンジン（google（商標）など）を用いて知ることができる。 The relationship between web pages on the Internet (bulletin board, blog, homepage, etc.) can be known by referring to how the links are pasted. For example, a search engine (such as google (trademark)) can be used to determine which other web page has a link to its own web page.

また、特許文献１には、掲示板内での記事の参照関係をツリー構造にて表示して、一見して相互関係をつかむことのできる装置が開示されている。 Japanese Patent Application Laid-Open No. 2004-151561 discloses an apparatus that displays the reference relationship of articles in a bulletin board in a tree structure and can grasp the mutual relationship at a glance.

特開平９−１０６３３１JP-A-9-106331

しかしながら、サーチエンジンによってリンク元を知る方法であっても、特許文献１のように記事の関連性を表示する方法であっても、いずれも、ウエブページや記事が技術的手段によって関連付けられていなければならなかった。したがって、内容として類似するウエブページがあったとしても、リンクなどによって関連付けられていなければ、それを探し出すことは困難であった。 However, the web page and the article must be associated with each other by technical means regardless of whether the search engine knows the link source or the method of displaying the relevance of the article as in Patent Document 1. I had to. Therefore, even if there is a similar web page as the content, it is difficult to find it if it is not related by a link or the like.

この発明は上記のような問題点を解決して、リンクなどによる関連づけがなされていなくとも、ウエブページに関連するウエブページを検索することのできる装置や方法を提供することを目的とする。 An object of the present invention is to solve the above-described problems and to provide an apparatus and a method capable of searching for a web page related to a web page without being linked by a link or the like.

(1)(2)この発明に係る管理サーバ装置は、ネットワーク上において、対象となるウエブページとの関連性の高い他のウエブページを検索する管理サーバ装置であって、各ウエブページにアクセスし、各ウエブページを構成する文字情報を取得する文字情報取得手段と、取得した各ウエブページの文字情報に含まれる要素を抽出し、ウエブページ要素情報を生成し、各ウエブページの場所特定情報を含むウエブページ属性情報に対応づけて記録部に記録する要素情報生成手段と、端末装置から送られてきた、対象ウエブページを特定した検索要求を受信する受信手段と、検索要求によって特定された対象ウエブページのウエブページ要素情報を記録部から読み出す要素情報読出手段と、記録部を参照し、読み出した対象ウエブページ要素情報に類似するウエブページ要素情報を有するウエブページを検索してウエブページ属性情報を取得する検索手段と、検索したウエブページのウエブページ属性情報を前記端末装置に送信する検索結果送信手段とを備えている。 (1) (2) The management server device according to the present invention is a management server device for searching other web pages highly relevant to the target web page on the network, and accessing each web page. Character information acquisition means for acquiring character information constituting each web page, elements included in the acquired character information of each web page are extracted, web page element information is generated, and location specifying information of each web page is obtained. Element information generating means for recording in the recording unit in association with the included web page attribute information, receiving means for receiving a search request specifying the target web page sent from the terminal device, and target specified by the search request Element information reading means for reading out the web page element information of the web page from the recording unit, and referring to the recording unit, it is classified into the read target web page element information. It comprises a searching means for acquiring a Web page attribute information by searching the web page having a web page element information, and a search result transmitting means for transmitting a web page attribute information of the retrieved web page to the terminal device.

したがって、リンクなどによって関連付けられていなくとも、対象ウエブページに関連する内容を有する他のウエブページを検索することができる。 Therefore, it is possible to search for other web pages having contents related to the target web page even if they are not linked by a link or the like.

(3)この発明に係る管理サーバ装置においては、要素情報生成手段が、各ウエブページの文字情報を形態素解析して形態素を要素として抽出する形態素抽出手段と、抽出した各形態素の出現頻度に基づいて各形態素ごとにスコアを算出するスコア算出手段と、各ウエブページごとの各形態素ごとのスコアを形態素テーブルとして記録部に記録する形態素テーブル記録手段とを備えてもよく、検索手段が、対象ウエブページ要素情報である複数の形態素のうちからスコアに基づいて所定数の形態素を抽出して検索形態素とする検索形態素抽出手段と、形態素テーブルを参照して、各ウエブページごとに、検索形態素に与えられたスコアを合計し、当該合計に基づいて類似するウエブページを選択するウエブページ選択手段とを備えてもよい。 (3) In the management server device according to the present invention, the element information generating means is based on the morpheme extracting means for extracting the morpheme as an element by analyzing the character information of each web page, and the appearance frequency of each extracted morpheme And a score calculation means for calculating a score for each morpheme and a morpheme table recording means for recording a score for each morpheme for each web page as a morpheme table in a recording unit. A search morpheme extraction means that extracts a predetermined number of morphemes from a plurality of morphemes that are page element information based on the score to obtain a search morpheme, and a morpheme table is given to each search page for each morpheme. There may be provided web page selection means for summing up the scores obtained and selecting similar web pages based on the sum.

したがって、ウエブページの文字情報を形態素に分析しスコア化して、正確に関連するウエブページを検索することができる。 Therefore, the web page character information can be analyzed into morphemes and scored to accurately search related web pages.

(4)この発明に係る管理サーバ装置においては、スコア算出手段が、ウエブページから抽出した各形態素の当該ウエブページにおける出願回数と、当該ウエブページにおける形態素の延べ数の比に基づいて、当該形態素のスコアを算出してもよい。 (4) In the management server device according to the present invention, the score calculation means, based on the ratio of the number of applications in the web page of each morpheme extracted from the web page and the total number of morphemes in the web page, A score may be calculated.

したがって、単なる出願回数だけでなく、全体に占めるウエイトを考慮して、各形態素に対するスコアを算出することができる。 Therefore, the score for each morpheme can be calculated in consideration of not only the number of applications but also the weight of the entire application.

(5)この発明に係る管理サーバ装置においては、文字情報取得手段および要素情報生成手段が、所定期間ごとにその動作を行うものであってもよい。 (5) In the management server device according to the present invention, the character information acquisition means and the element information generation means may perform their operations every predetermined period.

したがって、管理サーバ装置の処理負担を軽減することができる。 Therefore, the processing burden on the management server device can be reduced.

(6)この発明に係る管理サーバ装置においては、要素情報読出手段と検索手段が、検索要求を受ける前に、予め、各ウエブページを対象ウエブページとして、ウエブページの検索を実行して記録部に記録し、検索結果送信手段が、端末装置から検索要求を受けると、記録部に記録された当該対象ウエブページに類似するウエブページのウエブページ属性情報を、当該端末装置に送信するようにしてもよい。 (6) In the management server device according to the present invention, before the element information reading means and the search means receive the search request, the web page search is executed by using each web page as the target web page in advance. When the search result transmission means receives the search request from the terminal device, the web page attribute information of the web page similar to the target web page recorded in the recording unit is transmitted to the terminal device. Also good.

したがって、検索要求に対して、関連するウエブページ属性情報を迅速に提示することができる。 Therefore, it is possible to quickly present related web page attribute information in response to a search request.

(7)この発明に係る管理サーバ装置においては、検索結果送信手段が、検索した各ウエブページを代表する形態素に基づいて、各ウエブページをグループ化して端末装置に送信するようにしてもよい。 (7) In the management server device according to the present invention, the search result transmitting means may group each web page and transmit it to the terminal device based on the morpheme representing each searched web page.

したがって、どのような形態素によって、どのようなウエブページのグループが関連しているかを提供することができる。 Therefore, what morpheme can provide what group of web pages is related.

(8)この発明に係る管理サーバ装置においては、検索手段によって検索されたウエブページを第２の対象ウエブページとし、当該第２の対象ウエブページのウエブページ要素情報を記録部から読み出す第２の要素情報読出手段と、記録部を参照し、読み出した第２の対象ウエブページ要素情報に類似するウエブページ要素情報を有するウエブページを検索する第２の検索手段とを備え、検索結果送信手段が、検索手段および第２の検索手段の双方によって見いだされたウエブページのウエブページ属性情報を前記端末装置に送信するようにしてもよい。 (8) In the management server device according to the present invention, the web page searched by the search means is set as the second target web page, and the second page of the second target web page is read out from the recording unit. Element information reading means; and second search means for searching for a web page having web page element information similar to the read second target web page element information with reference to the recording unit, and the search result transmitting means The web page attribute information of the web page found by both the search means and the second search means may be transmitted to the terminal device.

したがって、関連するウエブページについて関連しているウエブページをさらに知ることができる。 Therefore, the related web page can be further known about the related web page.

(9)この発明に係る管理サーバ装置においては、要素情報生成手段が、所定期間ごとに区切って各ウエブページのウエブページ要素情報を生成して記録し、検索手段が、所定期間ごとの対象ウエブページ要素情報に基づいて、期間を特定して関連するウエブページを検索するようにしてもよい。 (9) In the management server device according to the present invention, the element information generation means generates and records the web page element information of each web page divided every predetermined period, and the search means uses the target web for each predetermined period. Based on the page element information, a related web page may be searched by specifying a period.

したがって、ウエブページの時系列に変換する各期間を特定して、関連するウエブページを検索することができる。 Therefore, it is possible to search for a related web page by specifying each period to be converted to a time series of web pages.

(10)この発明に係る管理サーバ装置においては、ウエブページは、ブログであることを特徴とするもの。 (10) In the management server device according to the present invention, the web page is a blog.

一般的に更新頻度の高いブログについて、関連するものを見いだすことのできる本発明を適用することにより、より有用性が高まる。 In general, by applying the present invention that can find related items to blogs that are frequently updated, usefulness is further increased.

(11)この発明に係るウエブページ検索方法は、ネットワーク上において、対象となるウエブページとの関連性の高い他のウエブページを、コンピュータによって検索するウエブページ検索方法であって、前記コンピュータは、各ウエブページにアクセスし、各ウエブページを構成する文字情報を取得し、取得した各ウエブページの文字情報に含まれる要素を抽出し、ウエブページ要素情報を生成し、各ウエブページの場所特定情報を含むウエブページ属性情報に対応づけて記録部に記録し、端末装置から送られてきた、対象ウエブページを特定した検索要求を受信し、検索要求によって特定された対象ウエブページのウエブページ要素情報を記録部から読み出し、記録部を参照し、読み出した対象ウエブページ要素情報に類似するウエブページ要素情報を有するウエブページを検索してウエブページ属性情報を取得し、検索したウエブページのウエブページ属性情報を前記端末装置に送信することを特徴としている。 (11) A web page search method according to the present invention is a web page search method for searching other web pages highly relevant to a target web page on a network by a computer, wherein the computer includes: Access each web page, get the text information that makes up each web page, extract the elements contained in the text information of each web page that was obtained, generate web page element information, and place specific information on each web page The web page element information of the target web page specified by the search request received from the terminal device is received in response to the search request specifying the target web page, recorded in the recording unit in association with the web page attribute information including Is read from the recording unit, is referred to the recording unit, and is similar to the read target web page element information. A web page having element information is searched to acquire web page attribute information, and the web page attribute information of the searched web page is transmitted to the terminal device.

この発明において「ウエブページ」とは、インターネット上において閲覧可能なコンテンツをいう。ブログ、ホームページなどを含む概念である。また、１つのページだけで構成されているものだけでなく、複数のページによって構成されているものを含む概念である。 In this invention, “web page” refers to content that can be browsed on the Internet. The concept includes blogs, homepages, and so on. Further, it is a concept including not only one page but also one constituted by a plurality of pages.

「文字情報取得手段」は、実施形態においては、図３のステップＳ１がこれに対応する。 In the embodiment, “character information acquisition means” corresponds to step S1 in FIG.

「要素情報生成手段」は、実施形態においては、図３のステップＳ２〜Ｓ５がこれに対応する。 In the embodiment, the “element information generation unit” corresponds to steps S2 to S5 in FIG.

「受信手段」は、実施形態においては、図７のステップＳ１１がこれに対応する。 In the embodiment, “reception means” corresponds to step S11 in FIG.

「要素情報読出手段」は、実施形態においては、ステップＳ１２がこれに対応する。 In the embodiment, “element information reading means” corresponds to step S12.

「検索手段」は、実施形態においては、ステップＳ１４、Ｓ１５がこれに対応する。 In the embodiment, “search means” corresponds to steps S14 and S15.

「検索結果送信手段」は、実施形態においては、ステップＳ１６がこれに対応する。 In the embodiment, “search result transmitting means” corresponds to step S16.

「プログラム」とは、ＣＰＵによって直接実行可能なプログラムだけでなく、ソース形式のプログラム、圧縮されたプログラム、暗号化されたプログラムやハードディスク等によってインストールして動作可能となるプログラムなどを含む概念である。 The “program” is a concept including not only a program that can be directly executed by the CPU but also a program in a source format, a compressed program, an encrypted program, a program that can be installed and operated by a hard disk, and the like. .

BEST MODE FOR CARRYING OUT THE INVENTION

図１に、この発明の一実施形態による関連ウエブページ検索システムの構成を示す。ウエブサーバW1,W2...Wnが、インターネット2に接続されている。また、端末装置４がインターネット2に接続されている。図においては、端末装置4は、1つだけしか示されていないが、実際には多くの端末装置が接続されている。端末装置４は、一般に、ユーザが使用しているＰＣである。 FIG. 1 shows the configuration of a related web page search system according to an embodiment of the present invention. Web servers W1, W2 ... Wn are connected to the Internet 2. A terminal device 4 is connected to the Internet 2. In the figure, only one terminal device 4 is shown, but many terminal devices are actually connected. The terminal device 4 is generally a PC used by a user.

インターネット2には、さらに、管理サーバ装置６が接続されている。管理サーバ装置６は、文字情報取得手段８、要素情報生成手段１０、受信手段１２、要素情報読出手段１４、検索結果送信手段１６、検索手段１８、記録部２０を備えている。文字情報取得手段８は、インターネット２を介して、ウエブサーバW1〜Wnの各ウエブページを巡回して、文字情報を取得する。１つのウエブサーバに１つのウエブページが記録されている場合だけでなく、1つのウエブサーバに2以上のウエブページが記録されている場合もある。なお、この実施形態では、複数のウエブページがあっても、同一人が作成しているウエブページは、複数ページから構成されていても１つのウエブページであるとしている。 A management server device 6 is further connected to the Internet 2. The management server device 6 includes character information acquisition means 8, element information generation means 10, reception means 12, element information reading means 14, search result transmission means 16, search means 18, and recording unit 20. The character information acquisition means 8 circulates each web page of the web servers W1 to Wn via the Internet 2 and acquires character information. In addition to the case where one web page is recorded on one web server, there are cases where two or more web pages are recorded on one web server. In this embodiment, even if there are a plurality of web pages, the web page created by the same person is assumed to be one web page even if it is composed of a plurality of pages.

要素情報生成手段１０は、取得したウエブページごとに、その文字情報の構成要素を解析し、ウエブページ要素情報を生成する。生成されたウエブページ要素情報は、ウエブページ毎に、記録部２０に記録される。この際、併せて、ウエブページのＵＲＬや管理者(つまり作成者)などのウエブページの属性も記録される。 The element information generation means 10 analyzes the constituent elements of the character information for each acquired web page and generates web page element information. The generated web page element information is recorded in the recording unit 20 for each web page. At this time, the web page attributes such as the URL of the web page and the administrator (that is, the creator) are also recorded.

受信手段１２は、対象ウエブページを特定した検索要求を、端末装置4から受ける。要素情報読出手段１４は、検索要求にて示された対象ウエブページの要素情報を、記録部２０から読み出す。検索手段１８は、読み出された対象ウエブページの要素情報に類似する要素情報を有する他のウエブページを、記録部２０から検索する。検索結果送信手段１６は、検索したウエブページのＵＲＬ等の属性情報を、要求を行った端末装置４に送信する。 The receiving means 12 receives a search request specifying the target web page from the terminal device 4. The element information reading unit 14 reads element information of the target web page indicated by the search request from the recording unit 20. The search unit 18 searches the recording unit 20 for another web page having element information similar to the element information of the read target web page. The search result transmission means 16 transmits attribute information such as the URL of the searched web page to the terminal device 4 that has made the request.

上記のようにして、端末装置４を使用するユーザは、特定のウエブページに関連するウエブページを知ることができる。たとえば、特定のウエブぺージとして自己のウエブページを指定すれば、自己のウエブページに類似する他人のウエブページを知ることができる。 As described above, a user who uses the terminal device 4 can know a web page related to a specific web page. For example, if a user's own web page is designated as a specific web page, a web page of another person similar to the user's own web page can be known.

図２に、図１に示す管理サーバ装置のハードウエア構成を示す。ＣＰＵ３２には、ディスプレイ３０、通信回路３４、キーボード／マウス３６、メモリ３８、ハードディスク４０、ＣＤ−ＲＯＭドライブ４２が接続されている。通信回路３４は、インターネットに接続するための回路である。ハードディスク４０には、WINDOWS(商標)などのオペレーティングシステム（ＯＳ）４６、管理サーバプログラム４８、形態素テーブル５０、巡回テーブル５２などが記録されている。管理サーバプログラム４８は、ＯＳ４６と協働してその機能を発揮するものである。これらプログラムは、ＣＤ−ＲＯＭ４４などの記録媒体に記録されていたものを、ＣＤ−ＲＯＭドライブ４２を介してインストールしたものである。なお、インターネットを介してダウンロードしたものであってもよい。 FIG. 2 shows a hardware configuration of the management server apparatus shown in FIG. A display 30, a communication circuit 34, a keyboard / mouse 36, a memory 38, a hard disk 40, and a CD-ROM drive 42 are connected to the CPU 32. The communication circuit 34 is a circuit for connecting to the Internet. In the hard disk 40, an operating system (OS) 46 such as WINDOWS (trademark), a management server program 48, a morpheme table 50, a circulation table 52, and the like are recorded. The management server program 48 functions in cooperation with the OS 46. These programs are those recorded on a recording medium such as the CD-ROM 44 and installed via the CD-ROM drive 42. It may be downloaded via the Internet.

図３、図７に、ハードディスク４０に記録された管理サーバプログラムのフローチャートを示す。図３が形態素テーブルを作成する処理であり、図６がウエブページを検索する処理である。 3 and 7 show flowcharts of the management server program recorded on the hard disk 40. FIG. FIG. 3 shows a process for creating a morpheme table, and FIG. 6 shows a process for searching a web page.

なお、以下では、多くのユーザにブログを作成させるためにウエブサーバを運営しているものが、管理サーバ装置６を運営しているものとして説明を行う。したがって、管理サーバ装置6の運営者は、会員であるどのようなユーザがどのようなブログを作成しているかを、容易に知ることができる。 In the following description, it is assumed that a person who operates a web server in order to cause many users to create a blog operates a management server apparatus 6. Therefore, the operator of the management server device 6 can easily know what kind of blog a user who is a member creates.

図３の形態素テーブルの作成は、所定期間毎（たとえば、毎日２４時に実行）に実行される。ステップＳ１において、ＣＰＵ３２は通信回路３４によってインターネットに接続し、ウエブページの記載内容(文字情報)を収集する。巡回先ウエブページのトップページＵＲＬを記述した巡回テーブル５２を参照して、ウエブページに接続する。 The creation of the morpheme table in FIG. 3 is executed every predetermined period (for example, every day at 24:00). In step S1, the CPU 32 connects to the Internet via the communication circuit 34 and collects the description content (character information) of the web page. With reference to the circulation table 52 describing the top page URL of the circulation destination web page, connection is made to the web page.

ＣＰＵ３２は、まず、巡回テーブル５２の最初に記載されたウエブページのＵＲＬを取得し、接続する。ＣＰＵ３２は、接続したトップページの文字情報を取得し、メモリ３８に記録する。さらに、当該トップページの作成者と同じ作成者が作成しているページについても同様に取得して記録する。なお、この実施形態では、トップページからリンクが貼られており、かつ、トップページのＵＲＬよりも下位のＵＲＬを有するページが、同一人によって作成されているものとしている。たとえば、巡回テーブルに記載されたトップページのＵＲＬが"http://www.furutani.co.jp/matsusihta"である場合、リンク先のページのＵＲＬが"http://www.furutani.co.jp/matsusihta/today.html"であれば、このページは同一人によって作成されたページであると判断する。 First, the CPU 32 acquires the URL of the web page described at the beginning of the circulation table 52 and connects to it. The CPU 32 acquires character information of the connected top page and records it in the memory 38. Further, a page created by the same creator as the creator of the top page is similarly acquired and recorded. In this embodiment, a link is pasted from the top page, and a page having a URL lower than the URL of the top page is created by the same person. For example, when the URL of the top page described in the circulation table is “http://www.furutani.co.jp/matsusihta”, the URL of the linked page is “http://www.furutani.co. If "jp / matsusihta / today.html", this page is determined to be a page created by the same person.

ＣＰＵ３２は、このようにしてウエブページの文字情報を取得すると、これら文字情報について、形態素辞書(図示せず)を参照して形態素解析を行う(ステップＳ２)。なお、ＣＰＵ３２は、形態素解析を行うとともに、当該ウエブページにおいて各形態素が何回出現したかを計数する。この実施形態では、形態素を記録した辞書(図示せず)を参照して形態素解析を行っている。また、この実施形態では、形態素解析をし、名詞のみを抽出するようにしている。 When the CPU 32 acquires the character information of the web page in this way, the CPU 32 performs morpheme analysis on the character information with reference to a morpheme dictionary (not shown) (step S2). The CPU 32 performs morphological analysis and counts how many times each morpheme appears on the web page. In this embodiment, morphological analysis is performed with reference to a dictionary (not shown) in which morphemes are recorded. In this embodiment, morphological analysis is performed and only nouns are extracted.

たとえば、図５に示すようなウエブページがあった場合、特許、知財、弁理士・・・などの形態素について、それぞれ、出現回数を計数する。ＣＰＵ３２は、計数した各形態素の出現回数に基づいて、各形態素毎のスコアを算出する。この実施形態では、次のようにして、スコアを算出している。 For example, when there is a web page as shown in FIG. 5, the number of appearances is counted for each morpheme such as patent, intellectual property, patent attorney. The CPU 32 calculates a score for each morpheme based on the counted number of appearances of each morpheme. In this embodiment, the score is calculated as follows.

スコア＝（当該形態素の出現回数／全形態素の延数）＊１００
つまり、当該ウエブページにおける全形態素の延べ数に対して、どの程度の割合にて当該形態素が出現しているかによってスコアを算出してる。スコアが大きいほど、当該形態素が、そのウエブページにおいて重要であることを示している。ＣＰＵ３２は、このようにして算出した各形態素のスコアを、図６に示すように形態素テーブルに記録する(ステップＳ３)。 Score = (number of appearances of the morpheme / total number of morphemes) * 100
In other words, the score is calculated depending on how much the morpheme appears with respect to the total number of morphemes on the web page. The higher the score, the more important the morpheme is on the web page. The CPU 32 records the score of each morpheme thus calculated in the morpheme table as shown in FIG. 6 (step S3).

この実施形態では、形態素テーブルの作成は１日１回(２４時)実行されるので、その度に、形態素テーブルの内容が上書きされることになる。 In this embodiment, since the creation of the morpheme table is executed once a day (24:00), the contents of the morpheme table are overwritten each time.

次に、ＣＰＵ３２は、上記にて目的としたウエブぺージが巡回テーブルの最後にあるかどうか（つまり、テーブルのすべてのウエブページについて巡回したか）を判断する(ステップS４)。最後のウエブページでなければ、ＣＰＵ３２は、リストの次のウエブページを目的のウエブページとして(ステップＳ５)、ステップＳ１以下の処理を繰り返し実行する。 Next, the CPU 32 determines whether or not the target web page is at the end of the circulation table (that is, whether or not all the web pages in the table have been visited) (step S4). If it is not the last web page, the CPU 32 sets the next web page in the list as the target web page (step S5), and repeatedly executes the processing from step S1.

このようにして、最後のウエブページについての処理を行うと、形態素テーブル作成の処理を終了する。作成された形態素テーブルの例を、図６に示す。この実施形態では、各形態素毎のスコアだけでなく、タイトル、ＵＲＬ、管理者（作成者）などのウエブページの属性も記録されている。たとえば、ブログを管理するウエブサイトの管理者が、その会員であるブログ作成者のウエブページについて形態素テーブルを作成する場合には、会員情報に基づいて属性を取得することができる。 When the process for the last web page is performed in this way, the process for creating the morpheme table is terminated. An example of the created morpheme table is shown in FIG. In this embodiment, not only the score for each morpheme but also web page attributes such as title, URL, administrator (creator), etc. are recorded. For example, when the administrator of the website managing the blog creates a morpheme table for the web page of the blog creator who is a member, the attribute can be acquired based on the member information.

次に、図７を参照して、ユーザの端末装置４から検索要求があった場合の処理について説明する。 Next, with reference to FIG. 7, a process when there is a search request from the user terminal device 4 will be described.

会員であるユーザは、まず、端末装置４のブラウザプログラムから管理サーバ装置６に接続する。さらに、図8のような検索要求画面を受けて、検索要求を、管理サーバ装置６に送信する。ここでは、入力ボックスBOX1に、自分のブログのＵＲＬを入力して送信したものとする。 A user who is a member first connects to the management server device 6 from the browser program of the terminal device 4. Further, upon receiving the search request screen as shown in FIG. 8, the search request is transmitted to the management server device 6. Here, it is assumed that the URL of one's blog is input and transmitted in the input box BOX1.

管理サーバ装置６のＣＰＵ３２は、通信回路３４により、この検索要求を受信する(ステップＳ１１)。ＣＰＵ３２は、この検索要求に含まれるＵＲＬに基づいて対象ウエブページを特定し、形態素テーブルから当該対象ウエブページの形態素を取得する(ステップＳ１２)。この際、ＣＰＵ３２は、スコアが「０」を超える形態素（つまり、少なくとも1回は出現している形態素）を取得する。たとえば、対象ウエブぺージが図６の「あかねんぼ」である場合、「発明」という形態素は抽出されるが、「特許」「発明」「弁理士」などの形態素は抽出されない。 The CPU 32 of the management server device 6 receives this search request through the communication circuit 34 (step S11). The CPU 32 specifies the target web page based on the URL included in the search request, and acquires the morpheme of the target web page from the morpheme table (step S12). At this time, the CPU 32 acquires a morpheme whose score exceeds “0” (that is, a morpheme that appears at least once). For example, if the target web page is “Akanebo” in FIG. 6, the morpheme “invention” is extracted, but the morpheme such as “patent” “invention” “patent attorney” is not extracted.

次に、ＣＰＵ３２は、抽出した形態素のうち、スコアの上位から所定個(この実施形態では５個)の形態素を抽出する(ステップＳ１３)。続いて、ＣＰＵ３２は、抽出した５つの形態素のいずれかを含む（その形態素のスコアが「０」を超える）ウエブページを、形態素テーブルから抽出する(ステップＳ１４)。たとえば、５つの抽出した形態素が「特許」「知財」「弁理士」「発明」「開発」であれば、「知的財産日記」「あかねんぼ」「オレンジピンク」などのウエブページが選択されることになる。 Next, the CPU 32 extracts a predetermined number (five in this embodiment) of morphemes from the top of the score among the extracted morphemes (step S13). Subsequently, the CPU 32 extracts a web page including any of the five extracted morphemes (the score of the morpheme exceeds “0”) from the morpheme table (step S14). For example, if the five extracted morphemes are “patent”, “intellectual property”, “patent attorney”, “invention”, “development”, web pages such as “Intellectual Property Diary”, “Akanebo” and “Orange Pink” are selected. Will be.

ＣＰＵ３２は、５つの形態素のそれぞれについて、選択したウエブページのうち、スコアの高いものを所定個（この実施形態では3つ）抽出する。つまり、形態素「特許」についてのスコアの高いウエブページを３つ、形態素「知財」についてのスコアの高いウエブページを３つ・・・というように、合計１５個のウエブページを抽出し、メモリ３８に記憶する(ステップＳ１５)。 For each of the five morphemes, the CPU 32 extracts a predetermined number (three in this embodiment) of the selected web pages having a high score. That is, a total of 15 web pages are extracted, such as three web pages with high scores for the morpheme “patent”, three web pages with high scores for the morpheme “IP”, and the like. 38 (step S15).

ＣＰＵ３２は、このように形態素毎に抽出したウエブページを、図９に示すように画像化して、要求を行った端末装置４に、通信回路３４により送信する（ステップＳ１６）。図９では、中央に、対象ウエブページの名称が表示され、その周囲に５つの形態素が示されている。さらに、各形態素毎に、抽出したウエブページの名称が示されている。なお、各ウエブページの名称の表示は、各ウエブページへのリンクとなっている。ＣＰＵ３２は、形態素テーブルよりＵＲＬを取得して、このリンクを作成することができる。 The CPU 32 converts the web page extracted for each morpheme into an image as shown in FIG. 9 and transmits it to the terminal device 4 that has made the request through the communication circuit 34 (step S16). In FIG. 9, the name of the target web page is displayed in the center, and five morphemes are shown around it. Furthermore, the name of the extracted web page is shown for each morpheme. The display of the name of each web page is a link to each web page. The CPU 32 can acquire this URL from the morpheme table and create this link.

端末装置４では、これにより、自分のブログにおいてキーとなっている形態素に関連する他人のブログを知ることができる。なお、形態素テーブルは、日々更新されるので、その時点において類似する他のブログを得ることができる。 In this way, the terminal device 4 can know other people's blogs related to the morphemes that are key in their blogs. Since the morpheme table is updated every day, other similar blogs can be obtained at that time.

上記実施形態では、図９において、ウエブページのタイトルを表示しているが、ブログの作成者を表示するようにしてもよい。 In the above embodiment, the title of the web page is displayed in FIG. 9, but the creator of the blog may be displayed.

上記実施形態では、多くのユーザにブログを作成させるためにウエブサーバを運営しているものが、管理サーバ装置６を運営している場合を説明した。しかし、他人が運営するウエブサーバも対象として、本システムを運用することができる。 In the above-described embodiment, a case has been described in which a web server that operates many users to create a blog operates a management server device 6. However, this system can be operated for web servers operated by others.

上記実施形態では、ユーザからの要求に応じて、検索を行って関連する上ウエブページを送信するようにしている。しかし、予め、検索を行っておき、検索結果を記録しておき、要求時にこの記録した結果を送信するようにしてもよい。 In the above embodiment, in response to a request from the user, a search is performed and a related upper web page is transmitted. However, a search may be performed in advance, a search result may be recorded, and the recorded result may be transmitted when requested.

上記実施形態では、図8の画面において、ＵＲＬにより対象ウエブページを特定するようにしている。しかし、当該ウエブページの名称や作成者のメールアドレスなどによって、特定するようにしてもよい。 In the above embodiment, the target web page is specified by the URL on the screen of FIG. However, it may be specified by the name of the web page, the e-mail address of the creator, or the like.

上記実施形態では、管理サーバ装置６は１台のコンピュータによって実現しているが、複数台のコンピュータによって実現してもよい。 In the above embodiment, the management server device 6 is realized by a single computer, but may be realized by a plurality of computers.

上記実施形態では、形態素テーブルを毎日更新し、古い形態素テーブルは破棄(上書き)するようにしている。しかし、各月の最初の日の形態素テーブルについては、破棄せずにハードディスク４０に記録しておくようにしてもよい。これにより、現在のウエブページだけでなく、過去のウエブーページについても、検索対象とすることができる。この場合の検索結果の表示は、たとえば、図１０のように、ウエブページの名称に併せて、時期を特定する記述を行う。たとえば、「特許事務所日記」において”２００２．５”と記述されているのは、このウエブページの開始から２００２年５月までの内容を総合的に判断すると、対象ウエブページに類似しているという意味である。 In the above embodiment, the morpheme table is updated every day, and the old morpheme table is discarded (overwritten). However, the morpheme table on the first day of each month may be recorded in the hard disk 40 without being discarded. As a result, not only the current web page but also the past web page can be searched. For example, as shown in FIG. 10, the search result is displayed in such a manner that the time is specified in addition to the name of the web page. For example, “2002.5” in the “Patent Office Diary” is similar to the target web page when comprehensively judging the contents from the start of this web page to May 2002 It means that.

さらに、日記などの場合であって、ＵＲＬによって作成期間が明確に把握できる場合には、図１１に示すように各期間毎に形態素テーブルを生成するようにしてもよい。図において、http://www.furutani.co.jp/2003以下のディレクトリは、2003年度に記述した日記であることを示している。このように形態素を構築しておけば、各ウエブページごと各期間ごとを関連するものの検索対象とすることができる。 Further, in the case of a diary or the like, when the creation period can be clearly grasped by the URL, a morpheme table may be generated for each period as shown in FIG. In the figure, the directory below http://www.furutani.co.jp/2003 indicates a diary described in 2003. If a morpheme is constructed in this way, each web page can be used as a search target for each period.

また、検索対象を１つのウエブページに限定し、各期間を検索対象とすることもできる。 Further, the search target can be limited to one web page, and each period can be set as the search target.

上記実施形態では、ウエブページの例としてブログを示したが、ホームページなどについても同様に適用することができる。 In the above embodiment, a blog is shown as an example of a web page, but the same can be applied to a homepage or the like.

上記実施形態では、ステップＳ１３において抽出した形態素のそれぞれについて関連するウエブページを抽出するようにしている。５つすべての形態素を含むウエブページのうち、各形態素のスコア合計の高いウエブページを抽出するようにしてもよい。さらに、抽出したウエブページを対象ウエブページとして、関連するウエブページをさらに検索するようにしてもよい。これにより、図１２に示すように、対象ウエブページ「知的財産日記」を中心として、広がりを持って関連ウエブページを示すことができる。 In the above embodiment, the related web page is extracted for each of the morphemes extracted in step S13. Of the web pages including all five morphemes, a web page with a high score sum of each morpheme may be extracted. Further, the extracted web page may be used as a target web page, and a related web page may be further searched. As a result, as shown in FIG. 12, the related web page can be shown with a spread centering on the target web page “Intellectual Property Diary”.

上記実施形態では、自己のウエブページに類似するウエブページを検索しているが、他人のウエブページに類似するウエブページを検索することもできる。また、図8の入力画面において、２つのウエブページを入力させるようにし、両ウエブページの類似度を算出するようにしてもよい。類似度は、両ウエブページのスコア上位の形態素のうち、両方のウエブページにあるもののスコアを合計し、２で割った数値を用いることができる。 In the above embodiment, a web page similar to the user's own web page is searched, but a web page similar to another person's web page can also be searched. Further, on the input screen of FIG. 8, two web pages may be input, and the similarity between both web pages may be calculated. As the similarity, a numerical value obtained by adding up the scores of the morphemes having higher scores on both web pages and those on both web pages and dividing the sum by two can be used.

この発明の一実施形態による管理サーバ装置を含む関連ウエブページ検索システムの機能ブロック図である。It is a functional block diagram of a related web page search system including a management server device according to an embodiment of the present invention. 管理サーバ装置のハードウエア構成を示す図である。It is a figure which shows the hardware constitutions of a management server apparatus. 管理サーバプログラムのフローチャートである。It is a flowchart of a management server program. 巡回テーブルを示す図である。It is a figure which shows a circulation table. ウエブページの例を示す図である。It is a figure which shows the example of a web page. 形態素テーブルの例を示す図である。It is a figure which shows the example of a morpheme table. 管理サーバプログラムのフローチャートである。It is a flowchart of a management server program. 検索画面の例を示す図である。It is a figure which shows the example of a search screen. 検索結果の表示例である。It is a display example of a search result. 検索結果の表示例である。It is a display example of a search result. 他の例による形態素テーブルの例を示す図である。It is a figure which shows the example of the morpheme table by another example. 検索結果の表示例である。It is a display example of a search result.

Explanation of symbols

８・・・文字情報取得手段
１０・・・要素情報生成手段
１２・・・受信手段
１４・・・要素情報読出手段
１６・・・検索結果送信手段
１８・・・検索手段
２０・・・記録部 8 ... Character information acquisition means 10 ... Element information generation means 12 ... Reception means 14 ... Element information reading means 16 ... Search result transmission means 18 ... Search means 20 ... Recording unit

Claims

A management server device that searches other web pages highly relevant to a target web page on a network,
Character information acquisition means for accessing each web page and acquiring character information constituting each web page;
Element information generating means for extracting elements included in the acquired character information of each web page, generating web page element information, and recording it in the recording unit in association with the web page attribute information including the location specifying information of each web page When,
Receiving means for receiving a search request specifying the target web page sent from the terminal device;
Element information reading means for reading out the web page element information of the target web page specified by the search request from the recording unit;
Search means for referring to the recording unit and searching for a web page having web page element information similar to the read target web page element information to obtain web page attribute information;
Search result transmitting means for transmitting web page attribute information of the searched web page to the terminal device;
A management server device.

A management server program for realizing, by a computer, the following means of a management server device for searching for another web page highly relevant to a target web page on a network:
Character information acquisition means for accessing each web page and acquiring character information constituting each web page;
Element information generating means for extracting elements included in the character information of each acquired web page, generating web page element information, and recording it in the recording unit in association with the web page attribute information including the location specifying information of each web page When,
Receiving means for receiving a search request specifying the target web page sent from the terminal device;
Element information reading means for reading out the web page element information of the target web page specified by the search request from the recording unit;
Search means for referring to the recording unit and searching for a web page having web page element information similar to the read target web page element information to obtain web page attribute information;
Search result transmission means for transmitting web page attribute information of the searched web page to the terminal device;
Server program for realizing the above with a computer.

In the management server apparatus and program of Claim 1 or 2,
The element information generating means
Morpheme extraction means for extracting morpheme as an element by analyzing morpheme of character information on each web page
Score calculating means for calculating a score for each morpheme based on the appearance frequency of each extracted morpheme;
A morpheme table recording means for recording a score for each morpheme for each web page as a morpheme table in a recording unit;
With
The search means includes
Search morpheme extraction means for extracting a predetermined number of morphemes from the plurality of morphemes that are target web page element information based on the score,
With reference to the morpheme table, for each web page, a score given to the search morpheme is summed, and a web page selection unit that selects a similar web page based on the sum,
It is characterized by having.

In the management server apparatus or program of Claim 3,
The score calculating means calculates the score of the morpheme based on the ratio of the number of applications of each morpheme extracted from the web page to the web page and the total number of morpheme on the web page.

In the management server apparatus or program in any one of Claims 1-4,
The character information acquisition unit and the element information generation unit perform the operation every predetermined period.

In the management server apparatus or program in any one of Claims 1-5,
Before receiving the search request, the element information reading unit and the search unit execute each web page search as a target web page and record it in the recording unit.
The search result transmitting means, when receiving a search request from a terminal device, transmits web page attribute information of a web page similar to the target web page recorded in the recording unit to the terminal device.

In the management server apparatus or program in any one of Claims 1-6,
The search result transmitting means groups each web page based on a morpheme representing each searched web page and transmits it to the terminal device.

In the management server apparatus or program in any one of Claims 1-7,
A second element information reading unit that reads the web page element information of the second target web page from the recording unit as a second target web page that is searched by the search unit;
A second search means for referring to the recording unit and searching for a web page having web page element information similar to the read second target web page element information;
With
The search result transmitting means transmits the web page attribute information of the web page found by both the search means and the second search means to the terminal device.

In the management server apparatus or program in any one of Claims 1-8,
The element information generating means generates and records web page element information of each web page divided every predetermined period,
The search means specifies a period and searches for a related web page based on target web page element information for each predetermined period.

In the management server apparatus or program in any one of Claims 1-9,
The web page is a blog.

A web page search method for searching other web pages highly relevant to a target web page on a network by a computer,
The computer accesses each web page, obtains character information constituting each web page,
Extract the elements included in the character information of each acquired web page, generate web page element information, record it in the recording unit in association with the web page attribute information including the location specifying information of each web page,
Receives a search request sent from the terminal device that identifies the target web page,
Read the web page element information of the target web page specified by the search request from the recording unit,
Refer to the recording unit, search for a web page having web page element information similar to the read target web page element information, and obtain web page attribute information.
Transmitting the web page attribute information of the retrieved web page to the terminal device;
Web page search method characterized by