JP2010086210A

JP2010086210A - Retrieval method, program, and server for preferentially displaying page corresponding to amount of information

Info

Publication number: JP2010086210A
Application number: JP2008253465A
Authority: JP
Inventors: Yukiko Mori; 有紀子森; Masaru Ichikawa; 勝市川
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2008-09-30
Filing date: 2008-09-30
Publication date: 2010-04-15
Anticipated expiration: 2028-09-30
Also published as: JP5072792B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a retrieval method, a program, and a server for preferentially outputting not a page of a small amount of information but a page of a large amount of information in retrieving a Web page. <P>SOLUTION: A retrieval server 1 extracts a feature word included in a Web page to be retrieved through morphological analysis, and calculates a weight of the feature word by using TF-IDF. The retrieval server 1 is also provided with a relevant word DB 24 for specifying whether or not a plurality of feature words are relevant to each other, and decides whether or not the plurality of feature words included in the Web page are relevant to each other by using the relevant word DB 24. Then, the retrieval server 1 calculates the total weight of the plurality of relevant feature words, and adjusts the retrieval result by using the total weight. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、Ｗｅｂページの検索において検索キーワードに関する情報量の少ないページではなく、情報量の多いページを優先的に表示する検索方法、プログラム及びそのサーバに関する。 The present invention relates to a search method, a program, and a server thereof that preferentially display a page with a large amount of information, not a page with a small amount of information related to a search keyword in a search for a Web page.

近年、Ｗｅｂ上のホームページやブログその他のインターネットを介してアクセス可能なコンテンツは、様々な検索サービスを利用することによって検索が可能となっており、学術上のみならず、一般市民の生活においても、或いは企業活動においても、なくてはならないものとなっている。 In recent years, websites, blogs and other contents accessible via the Internet have become searchable by using various search services, and not only academically but also in the lives of ordinary citizens, Or in corporate activities, it is indispensable.

このような検索サービスの多くは、検索キーワードを入力することによってインターネット上のコンテンツを検索する方式を採っている。そこで、検索キーワードから類推される、ユーザの検索目的に合致するコンテンツを効率よく探し出すために、各検索サービスは様々な工夫を行っている。例えば、いわゆるページランクの高いコンテンツを、検索結果としてより上位に表示することが多い。 Many of such search services employ a method of searching content on the Internet by inputting a search keyword. Therefore, various search services have been devised in order to efficiently search for content that is inferred from the search keyword and that matches the search purpose of the user. For example, content with a high so-called page rank is often displayed higher in the search results.

このページランクは、検索キーワードをより多く含むことのみならず、他のページからより多くのリンクが張られたもの、即ち被リンク数の多いものや、ユーザにより閲覧された回数（クリック回数）が多いものがより上位になるように決定されている。 This page rank includes not only more search keywords but also more links from other pages, that is, more links, and the number of times viewed by the user (clicks). Many have been decided to be higher.

例えば、特許文献１によれば、インターネット上に多々存在するＷｅｂページを検索して順位をつけて表示する際に、被リンクドメイン数を参照して当該ページのスコア（ウェイト）を算出することでページランクを決定することとしている。
特開２００７−１１４９０３号公報 For example, according to Patent Document 1, when searching and displaying Web pages that exist in large numbers on the Internet, the score (weight) of the page is calculated by referring to the number of linked domains. The page rank is decided.
JP 2007-114903 A

このように、Ｗｅｂページの特徴を示す様々なデータを用いて検索結果を出力することで、従来の検索サービスは、ユーザのニーズに合わせた順序で検索結果を表示することを試みていた。 As described above, by outputting the search results using various data indicating the characteristics of the Web page, the conventional search service has attempted to display the search results in the order according to the user's needs.

ところで、近年ではブログサイトが増大しているところ、このブログサイトは、Ｗｅｂページの更新を頻繁に行うが、更新する毎にテーマが異なるものに変わるため情報が散漫となり一つのテーマについての情報量が少ないものが多い。また、近年では、このようなブログサイトを著名人が開設することで、非常に人気の高いブログサイトが存在する。 By the way, in recent years, the number of blog sites has increased, and this blog site frequently updates web pages, but each time it is updated, the theme changes so that the information becomes diffuse and the amount of information on one theme There are many things with few. In recent years, celebrities have opened such blog sites, and there are very popular blog sites.

このような状況において、被リンク数やクリック回数等を反映して検索結果の順位を決定することとした場合、検索キーワードに関する情報量が少ない人気のブログサイトであっても検索結果として上位に表示されてしまうという問題があった。 In such a situation, if the ranking of search results is determined by reflecting the number of linked links, the number of clicks, etc., even a popular blog site with a small amount of information related to the search keyword will be displayed as a high-order search result. There was a problem of being.

図９及び図１０を参照して、このような問題点について具体的に説明する。図９（１）では、「オリンピック」に関する文章２００のみが表示され、また、その内容も充実したものとなっている。他方、図９（２）は、「オリンピック」に関する文章２０１の他に、他のテーマに関する文章２０２及び文章２０３が表示され、また、文章２０１の内容も情報の薄いものとなっている。なお、Ｗｅｂページ２は、人気のブログサイトの一例である。 Such a problem will be specifically described with reference to FIGS. In FIG. 9 (1), only the sentence 200 relating to “Olympic” is displayed, and the content is also enriched. On the other hand, in FIG. 9B, in addition to the sentence 201 related to “Olympic”, a sentence 202 and a sentence 203 related to other themes are displayed, and the content of the sentence 201 is also thin. Web page 2 is an example of a popular blog site.

図１０（１）は、これらＷｅｂページ１及び２内における特徴語の重要度（出現頻度）である特徴語ウェイトと、クリック履歴と、リンクデータ数とを示すＷｅｂページウェイトの例を示す図である。なお、特徴語とはＷｅｂページ内のテキストデータを形態素解析し抽出される語句のうち特徴的なもの（例えば、形態素解析の結果抽出されるものから助詞や助動詞等を除いたもの）をいい、特徴語ウェイトとは、語句の出現頻度に基づく指標であるＴＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ）・ＩＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）を用いて算出されるＷｅｂページ内における特徴語の重み付けを示す指標をいう。そして、図１０（２）は、このような場合に、ユーザが検索キーワードとして「オリンピック」を入力したときのＷｅｂページのウェイト（スコア）を示す図である。 FIG. 10A is a diagram illustrating an example of a Web page weight indicating a feature word weight that is an importance (appearance frequency) of a feature word in the Web pages 1 and 2, a click history, and the number of link data. is there. A feature word is a characteristic phrase extracted from a morphological analysis of text data in a Web page (for example, a word or auxiliary verb removed from a result extracted from a morphological analysis) The feature word weight is an index indicating the weight of the feature word in the Web page calculated using TF (Term Frequency) / IDF (Inverse Document Frequency) which is an index based on the appearance frequency of the phrase. FIG. 10B is a diagram illustrating the weight (score) of the Web page when the user inputs “Olympic” as a search keyword in such a case.

図１０（１）によると、「オリンピック」について充実した内容のＷｅｂページ１では、特徴語「オリンピック」についてＴＦ・ＩＤＦによるウェイトが「３０」となっているが、「オリンピック」についての内容の薄いＷｅｂページ２では「２０」となっている。しかしながら、Ｗｅｂページ２はブログサイトとして人気があるため、そのクリック履歴やリンクデータ数等がＷｅｂページ１に比べて多くなっている。そのため、図１０（２）に示すように、「オリンピック」についての内容の薄いＷｅｂページ２の方が、内容の充実しているＷｅｂページ１よりも高いウェイトとなってしまう場合があった。このような場合、ユーザが検索キーワードとして「オリンピック」を入力したときには、検索結果として、内容の充実したＷｅｂページ１ではなく、テーマが散漫し内容の薄いＷｅｂページ２の方が優先して表示されることとなる。 According to FIG. 10 (1), in the Web page 1 with a rich content about “Olympic”, the weight of TF / IDF is “30” for the characteristic word “Olympic”, but the content about “Olympic” is thin. In Web page 2, it is “20”. However, since the Web page 2 is popular as a blog site, the click history, the number of link data, and the like are larger than those of the Web page 1. For this reason, as shown in FIG. 10B, the Web page 2 with less content about “Olympic” may have a higher weight than the Web page 1 with more content. In such a case, when the user inputs “Olympic” as a search keyword, the search result is displayed with priority on the Web page 2 with a lighter theme and less content rather than the Web page 1 with a rich content. The Rukoto.

ユーザにとってみれば、ブログを担当する著名人の名称ではなくあえて検索キーワードを「オリンピック」としたのであり、上述のような結果は、ユーザのニーズに合わないものとなってしまう。 From the user's perspective, the search keyword is “Olympic” rather than the name of a celebrity who is in charge of the blog, and the above results do not meet the user's needs.

そこで、本発明は、Ｗｅｂページの検索においてこのような情報量の少ないページではなく、情報量の多いページを優先的に出力する検索方法、プログラム及びそのサーバを提供することを目的とする。 Accordingly, an object of the present invention is to provide a search method, a program, and a server thereof that preferentially output a page with a large amount of information, not a page with a small amount of information in the search for a Web page.

本発明者は、検索キーワードに関連する特徴語の特徴語ウェイトを用いることで、検索結果に情報量の多いページを優先的に表示する仕組みを見出し、本発明を完成するに至った。本発明は、具体的には次のようなものを提供する。 The inventor has found a mechanism for preferentially displaying a page with a large amount of information in the search result by using the feature word weight of the feature word related to the search keyword, and has completed the present invention. Specifically, the present invention provides the following.

（１）端末に対して通信ネットワークを介して接続されたコンピュータが、
検索の対象となるＷｅｂページを解析して前記Ｗｅｂページ内における特徴を示す特徴語を複数抽出するＷｅｂページ解析ステップと、
抽出した前記特徴語の前記Ｗｅｂページ内における出現頻度を示す特徴語ウェイトを複数の前記特徴語の夫々について算出する特徴語ウェイト算出ステップと、
を含む検索方法であって、
複数の特徴語の関連性を記憶する関連語ＤＢを備え、
前記Ｗｅｂページ解析ステップにより抽出した複数の前記特徴語の夫々が関連するか否かを、前記関連語ＤＢを用いて判定する関連語判定ステップと、
関連すると判定した特徴語の特徴語ウェイトの総和を関連語ウェイトとして算出するＷｅｂページウェイト算出ステップと、
算出した前記関連語ウェイトを、当該Ｗｅｂページのリンクデータに対応付けてインデキシングＤＢに記憶するインデキシング記憶ステップと、
を含む検索方法。 (1) A computer connected to a terminal via a communication network
A web page analyzing step of analyzing a web page to be searched and extracting a plurality of feature words indicating features in the web page;
A feature word weight calculating step of calculating a feature word weight indicating an appearance frequency of the extracted feature word in the Web page for each of the plurality of feature words;
A search method including
A related word DB for storing relevance of a plurality of feature words is provided,
A related word determining step of determining whether or not each of the plurality of feature words extracted by the web page analyzing step is related using the related word DB;
A web page weight calculating step of calculating a sum of feature word weights of feature words determined to be related as a related word weight;
An indexing storage step of storing the calculated related word weight in the indexing DB in association with the link data of the Web page;
Search method including

本発明のこのような構成によれば、インデキシングＤＢには、関連すると判定された特徴語の総和である関連語ウェイトがＷｅｂページのリンクデータに対応付けられて記憶される。これにより、Ｗｅｂページ内における特定のテーマに関する情報量を適切に判別することができる。そのため、検索結果を出力する処理等のコンピュータ処理において、情報量の少ないページではなく、情報量の多いページを優先的に出力するように検索結果を調整することができる。 According to this configuration of the present invention, the related word weight, which is the sum of the feature words determined to be related, is stored in the indexing DB in association with the link data of the Web page. Thereby, the information amount regarding the specific theme in a web page can be discriminate | determined appropriately. Therefore, in a computer process such as a process for outputting a search result, the search result can be adjusted so that a page with a large amount of information is output preferentially instead of a page with a small amount of information.

（２）前記コンピュータが、
検索キーワードを含む要求データを前記端末から受信する受信ステップと、
前記受信ステップにより受信した前記検索キーワードに基づき、前記検索キーワードを含むＷｅｂページを検索する検索ステップと、
前記検索ステップにより検索したＷｅｂページのうち前記検索キーワードに関連する関連語ウェイトを用いて検索結果を調整するＷｅｂページ調整ステップと、
前記Ｗｅｂページ調整ステップによる調整結果に基づき、検索された前記Ｗｅｂページのリンクデータを含むコンテンツを、前記端末に送信する送信ステップと、
を含むことを特徴とする（１）に記載の検索方法。 (2) The computer is
Receiving the request data including the search keyword from the terminal;
A search step of searching for a web page including the search keyword based on the search keyword received in the receiving step;
A web page adjustment step of adjusting a search result using a related word weight related to the search keyword among the web pages searched in the search step;
A transmission step of transmitting content including link data of the searched web page to the terminal based on the adjustment result of the web page adjustment step;
(1) The search method according to (1).

本発明のこのような構成によれば、検索キーワードによるＷｅｂページの検索において、ユーザの端末に検索キーワードについて情報量の多いＷｅｂページを優先的に出力させることができるため、ユーザのニーズに適した検索方法を提供できる。 According to such a configuration of the present invention, when searching for a Web page using a search keyword, the user's terminal can preferentially output a Web page with a large amount of information about the search keyword, which is suitable for the user's needs. A search method can be provided.

（３）前記コンピュータが、
Ｗｅｂページ内における全ての特徴語の特徴語ウェイトの平均値を算出する平均ウェイト算出ステップと、
前記Ｗｅｂページウェイト算出ステップにより算出した前記関連語ウェイトが、前記平均値から乖離する割合を算出する対比ウェイト算出ステップと、を更に含み、
前記Ｗｅｂページ調整ステップは、前記対比ウェイト算出ステップにより算出した割合を用いて検索結果を調整することを特徴とする（２）に記載の検索方法。 (3) The computer is
An average weight calculating step for calculating an average value of feature word weights of all feature words in the Web page;
A comparison weight calculation step of calculating a ratio at which the related word weight calculated by the Web page weight calculation step deviates from the average value;
(2) The search method according to (2), wherein the Web page adjustment step adjusts the search result using the ratio calculated in the comparison weight calculation step.

本発明のこのような構成によれば、情報量の多いＷｅｂページを判定するに当たり、関連語ウェイトがＷｅｂページ内における平均値から乖離する割合を用いるため、検索キーワードに対応する特徴語のＷｅｂページ内における重要度を２以上のＷｅｂページにおいて適切に判定することができる。 According to such a configuration of the present invention, when determining a Web page with a large amount of information, since the ratio of the related word weight deviating from the average value in the Web page is used, the Web page of the characteristic word corresponding to the search keyword It is possible to appropriately determine the importance level in the web page of two or more.

（４）前記コンピュータが、通信ネットワークを介して接続されたＷｅｂサーバを定期的に巡回して、検索の対象になる前記Ｗｅｂページを取得する取得ステップを含むこと、を特徴とする（１）から（３）のいずれか１項に記載の検索方法。 (4) The computer includes an acquisition step of periodically visiting a Web server connected via a communication network to acquire the Web page to be searched. (1) The search method according to any one of (3).

本発明のこのような構成によれば、対象とするＷｅｂページを定期的に取得するので、常に新しいＷｅｂページのデータを用いて検索をすることができる。 According to such a configuration of the present invention, since the target Web page is periodically acquired, it is possible to always search using data of a new Web page.

（５）（１）から（４）のいずれか１項に記載の方法のステップをコンピュータに実行させるための検索プログラム。 (5) A search program for causing a computer to execute the steps of the method according to any one of (1) to (4).

（６）端末に対して通信ネットワークを介して接続された検索サーバであって、
検索の対象となるＷｅｂページを解析して前記Ｗｅｂページ内における特徴を示す特徴語を複数抽出するＷｅｂページ解析手段と、
抽出した前記特徴語の前記Ｗｅｂページ内における出現頻度を示す特徴語ウェイトを複数の前記特徴語の夫々について算出する特徴語ウェイト算出手段と、
複数の特徴語の関連性を記憶する関連語ＤＢと、
前記Ｗｅｂページ解析手段により抽出した複数の前記特徴語の夫々が関連するか否かを、前記関連語ＤＢを用いて判定する関連語判定手段と、
関連すると判別した特徴語の特徴語ウェイトの総和を関連語ウェイトとして算出するＷｅｂページウェイト算出手段と、
算出した前記関連語ウェイトを、当該Ｗｅｂページのリンクデータに対応付けてインデキシングＤＢに記憶するインデキシング記憶手段と、
を備える検索サーバ。 (6) A search server connected to a terminal via a communication network,
Web page analysis means for analyzing a Web page to be searched and extracting a plurality of feature words indicating features in the Web page;
Feature word weight calculating means for calculating a feature word weight indicating the appearance frequency of the extracted feature word in the Web page for each of the plurality of feature words;
A related word DB for storing relevance of a plurality of feature words;
Related word determination means for determining whether or not each of the plurality of feature words extracted by the web page analysis means is related using the related word DB;
Web page weight calculating means for calculating the sum of feature word weights of feature words determined to be related as related word weights;
Indexing storage means for storing the calculated related word weight in the indexing DB in association with the link data of the Web page;
A search server comprising:

本発明によれば、Ｗｅｂページの検索において情報量の少ないページではなく、情報量の多いページを優先的に出力することができる。 According to the present invention, it is possible to preferentially output a page with a large amount of information rather than a page with a small amount of information when searching for a Web page.

以下、本発明を実施するための最良の形態について図を参照しながら説明する。なお、これはあくまでも一例であって、本発明の技術的範囲はこれに限られるものではない。 Hereinafter, the best mode for carrying out the present invention will be described with reference to the drawings. This is merely an example, and the technical scope of the present invention is not limited to this.

（実施形態）
［検索システムの全体構成及び検索サーバの機能構成］
図１は、本実施形態に係る検索システム１００の全体構成及び検索サーバ１の機能構成を示す図である。 (Embodiment)
[Overall configuration of search system and functional configuration of search server]
FIG. 1 is a diagram showing the overall configuration of the search system 100 and the functional configuration of the search server 1 according to the present embodiment.

検索システム１００は、検索サーバ１と、コンテンツサーバ２と、通信ネットワーク３と、端末４とにより構成される。図１に示すように、インターネット等の通信回線に代表される通信ネットワーク３を介して、検索サーバ１と、コンテンツサーバ２と、端末４とが互いに通信可能に接続されている。 The search system 100 includes a search server 1, a content server 2, a communication network 3, and a terminal 4. As shown in FIG. 1, a search server 1, a content server 2, and a terminal 4 are connected to be communicable with each other via a communication network 3 represented by a communication line such as the Internet.

検索サーバ１は、制御部１０と記憶部２０とを備える。制御部１０は、Ｗｅｂページ取得手段１１と、Ｗｅｂページ解析手段１２と、特徴語ウェイト算出手段１３と、関連語判定手段１４と、Ｗｅｂページウェイト算出手段１５と、受信手段１６と、検索手段１７と、コンテンツ送信手段１９とを備える。また、記憶部２０は、回収Ｗｅｂページデータベース（以下、データベースをＤＢともいう）２２と、関連語ＤＢ２４と、インデキシングＤＢ２６とを備える。なお、各ＤＢの内容については後述する。 The search server 1 includes a control unit 10 and a storage unit 20. The control unit 10 includes a web page acquisition unit 11, a web page analysis unit 12, a feature word weight calculation unit 13, a related word determination unit 14, a web page weight calculation unit 15, a reception unit 16, and a search unit 17. And content transmission means 19. The storage unit 20 includes a collection Web page database (hereinafter, the database is also referred to as a DB) 22, a related term DB 24, and an indexing DB 26. The contents of each DB will be described later.

検索サーバ１は、ハードウェアの数に制限はなく、必要に応じて１又は複数のハードウェアで構成してよい。また、複数のハードウェアで構成する場合には、通信ネットワーク３を介して各ハードウェアを接続してもよい。例えば、後述する各機能毎に別サーバとし、各サーバ間での信号の送受信により、各サーバを連携させることで、本実施形態の機能を実現してもよい。 The search server 1 is not limited in the number of hardware, and may be configured by one or a plurality of hardware as necessary. In the case of a plurality of hardware, each hardware may be connected via the communication network 3. For example, the functions of the present embodiment may be realized by using a separate server for each function described later and linking the servers by transmitting and receiving signals between the servers.

Ｗｅｂページ取得手段１１は、Ｗｅｂページ（コンテンツ）を記憶したコンテンツサーバ２を定期的に巡回して、新しく作成されたＷｅｂページや更新されたＷｅｂページを回収（取得）するクローラの役割を果たす。回収したＷｅｂページは、随時回収ＷｅｂページＤＢ２２に記憶される。 The Web page acquisition unit 11 plays a role of a crawler that periodically circulates the content server 2 storing the Web page (content) and collects (acquires) a newly created Web page or an updated Web page. The collected web pages are stored in the collected web page DB 22 as needed.

Ｗｅｂページ解析手段１２は、回収ＷｅｂページＤＢ２２に記憶されたＷｅｂページのテキストデータを形態素解析して抽出された語句から、特徴語を抽出する。 The web page analysis unit 12 extracts feature words from words extracted by morphological analysis of text data of web pages stored in the collection web page DB 22.

特徴語ウェイト算出手段１３は、抽出した特徴語のＷｅｂページ内における重要度（出現頻度）を示す特徴語ウェイトを算出する。ここで、特徴語ウェイトの算出には、語句の出現頻度に基づく指標であるＴＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ）・ＩＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）が用いられる。 The feature word weight calculation unit 13 calculates feature word weights indicating the importance (appearance frequency) of the extracted feature words in the Web page. Here, the feature word weight is calculated using TF (Term Frequency) / IDF (Inverse Document Frequency), which is an index based on the appearance frequency of the phrase.

関連語判定手段１４は、関連語ＤＢ２４を用いて、算出した特徴語に関する関連語の有無をＷｅｂページ毎に判定する。なお、関連語とは、ある特徴語に関連する他の特徴語をいう。ここで、各特徴語が関連するか否かについてはサーバーの管理者が任意に設定可能であり、また、適宜変更可能である。 The related word determination unit 14 determines the presence / absence of a related word related to the calculated feature word for each Web page using the related word DB 24. In addition, a related word means the other characteristic word relevant to a certain characteristic word. Here, the administrator of the server can arbitrarily set whether or not each feature word is related, and can be changed as appropriate.

Ｗｅｂページウェイト算出手段１５は、関連語判定手段１４により判定された関連語の関連語ウェイト（例えば、関連する複数の特徴語のウェイトの総和）を算出し、この関連語ウェイトと、特徴語ウェイトと、クリック履歴と、リンクデータ数とを含むＷｅｂページウェイトを算出する。そして、Ｗｅｂページウェイト算出手段１５は、算出したＷｅｂページウェイトをインデキシングＤＢ２６（後述の図４（２）参照）に記憶する。 The web page weight calculation unit 15 calculates the related word weight of the related word determined by the related word determination unit 14 (for example, the sum of the weights of a plurality of related characteristic words), and the related word weight and the characteristic word weight. Web page weight including the click history and the number of link data is calculated. Then, the web page weight calculation unit 15 stores the calculated web page weight in the indexing DB 26 (see FIG. 4 (2) described later).

受信手段１６は、端末４から検索キーワードを含む要求データを受信する。検索手段１７は、検索キーワードを含むＷｅｂページを検索する。Ｗｅｂページ調整手段１８は、Ｗｅｂページウェイト算出手段により算出したＷｅｂページウェイトと検索キーワードとを用いて検索結果を調整する。コンテンツ送信手段１９は、Ｗｅｂページ調整手段１８の調整結果に基づき、Ｗｅｂページのリンクデータを含むコンテンツを端末４に送信する。 The receiving unit 16 receives request data including a search keyword from the terminal 4. The search means 17 searches for a Web page containing the search keyword. The web page adjusting unit 18 adjusts the search result using the web page weight calculated by the web page weight calculating unit and the search keyword. The content transmission unit 19 transmits the content including the link data of the web page to the terminal 4 based on the adjustment result of the web page adjustment unit 18.

コンテンツサーバ２は、複数のＷｅｂページを記憶したコンテンツのＤＢサーバである。コンテンツサーバ２は、通信ネットワーク３に接続されていれば、世界中のあらゆるＷｅｂページを記憶したコンテンツのＤＢサーバが該当する。 The content server 2 is a content DB server that stores a plurality of Web pages. As long as the content server 2 is connected to the communication network 3, it corresponds to a content DB server storing any web page in the world.

端末４は、ユーザが、コンテンツを再生するための操作入力をするためのキーボード、マウス等の入力部や、コンテンツを表示する表示画面を備えた装置である。端末４を用いて、ユーザは、検索キーワードを入力したり、コンテンツを視聴したりすることができる。 The terminal 4 is an apparatus including an input unit such as a keyboard and a mouse for a user to input an operation for reproducing the content, and a display screen for displaying the content. Using the terminal 4, the user can input a search keyword or view content.

［検索サーバ１のハードウェア構成図］
図２は、本実施形態に係る検索サーバ１のハードウェア構成を示す図である。本発明が実施されるサーバは標準的なものでよく、以下に構成の一例を示す。 [Hardware configuration diagram of search server 1]
FIG. 2 is a diagram illustrating a hardware configuration of the search server 1 according to the present embodiment. The server in which the present invention is implemented may be a standard server, and an example of the configuration is shown below.

検索サーバ１は、制御部１０を構成するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１０（マルチプロセッサ構成ではＣＰＵ１０１２等複数のＣＰＵが追加されてもよい）、バスライン１００５、通信Ｉ／Ｆ（Ｉ／Ｆ：インターフェイス）１０４０、メインメモリ１０５０、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）１０６０、表示装置１０２２、Ｉ／Ｏコントローラ１０７０、キーボード及びマウス等の入力装置１１００、ハードディスク１０７４、光ディスクドライブ１０７６、並びに半導体メモリ１０７８を備える。なお、ハードディスク１０７４、光ディスクドライブ１０７６、及び半導体メモリ１０７８はまとめて記憶部２０と呼ぶ。 The search server 1 includes a central processing unit (CPU) 1010 (a plurality of CPUs such as a CPU 1012 may be added in a multiprocessor configuration), a bus line 1005, a communication I / F (I / F: interface) constituting the control unit 10. ) 1040, a main memory 1050, a basic input output system (BIOS) 1060, a display device 1022, an I / O controller 1070, an input device 1100 such as a keyboard and a mouse, a hard disk 1074, an optical disk drive 1076, and a semiconductor memory 1078. The hard disk 1074, the optical disk drive 1076, and the semiconductor memory 1078 are collectively referred to as the storage unit 20.

制御部１０は、検索サーバ１を統括的に制御する部分であり、ハードディスク１０７４に記憶された各種プログラムを適宜読み出して実行することにより、上述したハードウェアと協働し、本発明に係る各種機能を実現している。 The control unit 10 is a part that controls the search server 1 in an integrated manner. By appropriately reading and executing various programs stored in the hard disk 1074, the control unit 10 cooperates with the hardware described above, and performs various functions according to the present invention. Is realized.

通信Ｉ／Ｆ１０４０は、検索サーバ１が、通信ネットワーク３（図１）を介して端末４（図１）と情報を送受信する場合のネットワーク・アダプタである。通信Ｉ／Ｆ１０４０は、モデム、ケーブル・モデム及びイーサネット（登録商標）・アダプタを含んでよい。 The communication I / F 1040 is a network adapter when the search server 1 transmits / receives information to / from the terminal 4 (FIG. 1) via the communication network 3 (FIG. 1). The communication I / F 1040 may include a modem, a cable modem, and an Ethernet (registered trademark) adapter.

ＢＩＯＳ１０６０は、検索サーバ１の起動時にＣＰＵ１０１０が実行するブートプログラムや、検索サーバ１のハードウェアに依存するプログラム等を記録する。 The BIOS 1060 records a boot program executed by the CPU 1010 when the search server 1 is started up, a program depending on the hardware of the search server 1, and the like.

表示装置１０２２は、ブラウン管表示装置（ＣＲＴ）、液晶表示装置（ＬＣＤ）等のディスプレイ装置を含む。 The display device 1022 includes a display device such as a cathode ray tube display device (CRT) or a liquid crystal display device (LCD).

Ｉ／Ｏコントローラ１０７０には、ハードディスク１０７４、光ディスクドライブ１０７６、及び半導体メモリ１０７８等の記憶装置である記憶部２０を接続することができる。 The I / O controller 1070 can be connected to a storage unit 20 that is a storage device such as a hard disk 1074, an optical disk drive 1076, and a semiconductor memory 1078.

入力装置１１００は、検索サーバ１の管理者による入力の受け付けを行うものである。 The input device 1100 accepts input by the administrator of the search server 1.

ハードディスク１０７４は、本ハードウェアを検索サーバ１として機能させるための各種プログラム、本発明の機能を実行するプログラム及び後述するＤＢのテーブル及びレコードを記憶する。なお、検索サーバ１は、外部に別途設けたハードディスク（図示せず）を外部記憶装置として利用することもできる。 The hard disk 1074 stores various programs for causing the hardware to function as the search server 1, programs for executing the functions of the present invention, and DB tables and records to be described later. The search server 1 can also use an external hard disk (not shown) as an external storage device.

光ディスクドライブ１０７６としては、例えば、ＤＶＤ−ＲＯＭドライブ、ＣＤ−ＲＯＭドライブ、ＤＶＤ−ＲＡＭドライブ、ＣＤ−ＲＡＭドライブを使用することができる。この場合は各ドライブに対応した光ディスク１０７７を使用する。光ディスク１０７７から光ディスクドライブ１０７６によりプログラム又はデータを読み取り、Ｉ／Ｏコントローラ１０７０を介してメインメモリ１０５０又はハードディスク１０７４に提供することもできる。 As the optical disc drive 1076, for example, a DVD-ROM drive, a CD-ROM drive, a DVD-RAM drive, or a CD-RAM drive can be used. In this case, the optical disk 1077 corresponding to each drive is used. A program or data may be read from the optical disk 1077 by the optical disk drive 1076 and provided to the main memory 1050 or the hard disk 1074 via the I / O controller 1070.

なお、本発明でいうコンピュータとは、記憶装置、制御部等を備えた情報処理装置をいい、検索サーバ１は、記憶部２０、制御部１０等を備えた情報処理装置により構成され、この情報処理装置は、本発明のコンピュータの概念に含まれる。 The computer referred to in the present invention refers to an information processing device including a storage device, a control unit, and the like. The search server 1 includes an information processing device including a storage unit 20, a control unit 10, and the like. The processing device is included in the concept of the computer of the present invention.

［コンテンツサーバ２のハードウェア構成］
コンテンツサーバ２も、上述の検索サーバ１と同様な構成を持つ。なお、コンテンツサーバ２のみならず、検索サーバ１にも、コンテンツサーバ２と同様にＷｅｂページを記憶するコンテンツのＤＢを有してもよい。 [Hardware Configuration of Content Server 2]
The content server 2 has the same configuration as the search server 1 described above. In addition to the content server 2, the search server 1 may have a content DB that stores Web pages in the same manner as the content server 2.

［端末４のハードウェア構成］
端末４も、上述の検索サーバ１と同様な構成を持つ。例えば、記憶部２０は、光ディスクドライブ１０７６に代えて外部メモリを挿入可能なドライブであってもよい。また、表示装置１０２２が、タッチパネルとして入力装置１１００の機能を備えていてもよい。更に、端末４が、加速度センサ等のセンサを備えて、そのセンサが、入力装置１１００の機能を有してもよい。 [Hardware configuration of terminal 4]
The terminal 4 has the same configuration as that of the search server 1 described above. For example, the storage unit 20 may be a drive in which an external memory can be inserted instead of the optical disc drive 1076. The display device 1022 may have the function of the input device 1100 as a touch panel. Furthermore, the terminal 4 may include a sensor such as an acceleration sensor, and the sensor may have the function of the input device 1100.

［メイン処理のフローチャート］
図３は、本実施形態に係る検索サーバ１のインデキシング処理のフローチャートである。インデキシング処理は、例えば、検索サーバ１の管理者により決められた所定のタイミングで定期的に行う。 [Flowchart of main processing]
FIG. 3 is a flowchart of the indexing process of the search server 1 according to this embodiment. For example, the indexing process is periodically performed at a predetermined timing determined by the administrator of the search server 1.

Ｓ１：制御部１０（Ｗｅｂページ取得手段１１）は、コンテンツサーバ２をクロールすることにより、Ｗｅｂページを取得する。そして、制御部１０は、取得したＷｅｂページを、回収ＷｅｂページＤＢ２２に記憶する。 S1: The control unit 10 (Web page acquisition unit 11) acquires a Web page by crawling the content server 2. And the control part 10 memorize | stores the acquired web page in collection | recovery web page DB22.

Ｓ２：制御部１０（Ｗｅｂページ解析手段１２）は、Ｗｅｂページ解析処理を行う。具体的には、Ｗｅｂページ解析手段１２は、回収ＷｅｂページＤＢ２２に記憶されたＷｅｂページを形態素解析して、Ｗｅｂページのテキストデータから語句を抽出する。そして、この語句のうちＷｅｂページ内における特徴的なもの、即ち特徴語を抽出する。 S2: The control unit 10 (Web page analysis means 12) performs Web page analysis processing. Specifically, the web page analysis unit 12 performs morphological analysis on the web page stored in the collected web page DB 22 and extracts a phrase from the text data of the web page. Then, a characteristic word in the Web page, that is, a characteristic word is extracted from the word / phrase.

Ｓ３：制御部１０（特徴語ウェイト算出手段１３）は、特徴語のＷｅｂページ内における重み付けである特徴語ウェイトとを算出する。 S3: The control unit 10 (feature word weight calculating means 13) calculates a feature word weight that is a weighting of the feature word in the Web page.

Ｓ４：制御部１０（関連語判定手段１４）は、算出された複数の特徴語に関する関連語の有無を判定する。なお、関連語の有無の判定には、関連語ＤＢ２４（図４（１）参照）が用いられる。 S4: The control part 10 (related word determination means 14) determines the presence or absence of the related word regarding the calculated several characteristic word. In addition, related word DB24 (refer FIG. 4 (1)) is used for determination of the presence or absence of a related word.

Ｓ５：制御部１０（Ｗｅｂページウェイト算出手段１５）は、Ｗｅｂページの特徴を示すＷｅｂページウェイトを算出し、算出したＷｅｂページウェイトを記憶部２０（インデキシングＤＢ２６）に記憶する。ここで、Ｗｅｂページウェイトは、関連語ウェイトと、特徴語ウェイトと、クリック履歴と、リンクデータ数とを含む。なお、特徴語ウェイトは、算出した特徴語毎に、Ｗｅｂページ内のウェイトをＴＦ・ＩＤＦを用いて求めることで算出される。また、関連語ウェイトは、関連すると判定された複数の特徴語のウェイトの総和である。また、クリック履歴は、所定の期間に当該Ｗｅｂページが端末４で表示された回数であり、リンクデータ数は、当該ＷｅｂページがリンクされているＷｅｂページの数である。 S5: The control unit 10 (Web page weight calculation means 15) calculates a Web page weight indicating the characteristics of the Web page, and stores the calculated Web page weight in the storage unit 20 (indexing DB 26). Here, the Web page weight includes a related word weight, a feature word weight, a click history, and the number of link data. The feature word weight is calculated by obtaining the weight in the Web page for each calculated feature word using TF / IDF. The related word weight is the sum of the weights of a plurality of feature words determined to be related. The click history is the number of times that the web page is displayed on the terminal 4 during a predetermined period, and the number of link data is the number of web pages to which the web page is linked.

［関連語ＤＢ２４及びインデキシングＤＢ２６］
図４は、本実施形態に係る記憶部２０に格納される関連語ＤＢ２４及びインデキシングＤＢ２６を示す図である。関連語ＤＢ２４は、上述の図３のＳ４において、制御部１０がＷｅｂページ内の特徴語に関連性があるか否かを判定する際に用いられるＤＢである。なお、関連性が有無は検索サーバ１の管理者が予め任意に設定しておくことができる。また、インデキシングＤＢ２６は、上述の図３のＳ５において制御部１０が関連語ウェイトを含むＷｅｂページウェイトを記憶するデータベースである。 [Related Word DB 24 and Indexing DB 26]
FIG. 4 is a diagram showing the related term DB 24 and the indexing DB 26 stored in the storage unit 20 according to the present embodiment. The related word DB 24 is a DB used when the control unit 10 determines whether or not the feature word in the Web page is related in S4 of FIG. 3 described above. Note that the presence or absence of relevance can be arbitrarily set in advance by the administrator of the search server 1. The indexing DB 26 is a database in which the control unit 10 stores Web page weights including related word weights in S5 of FIG. 3 described above.

図４（ａ）に一例を示す関連語ＤＢ２４は、関連語ＩＤ毎に管理者が任意に設定した関連語を対応付けて記憶する。例えば、「関連語ＩＤ０１０」を参照して、「オリンピック」には、「五輪」、「北京」、「ロンドン」、「東京」等が関連する特徴語であると記憶される。 The related word DB 24 shown as an example in FIG. 4A stores the related words arbitrarily set by the administrator for each related word ID. For example, referring to “Related Word ID 010”, “Olympic”, “Beijing”, “London”, “Tokyo” and the like are stored as related feature words in “Olympic”.

図４（ｂ）に一例を示すインデキシングＤＢ２６は、ＷｅｂページＩＤをキー情報として、Ｗｅｂページの位置情報を示すＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）と、それに対応して特徴語ウェイト、関連語ウェイト、クリック履歴、及びリンク数データを含むＷｅｂページウェイトとを記憶する。なお、図４（ｂ）では、Ｗｅｂページ１（図８（１））とＷｅｂページ２（図８（２））とについての情報を一例として示している。 The indexing DB 26 shown as an example in FIG. 4B has a URL (Uniform Resource Locator) indicating the position information of the Web page with the Web page ID as key information, and corresponding feature word weight, related word weight, and click history. , And a web page weight including link number data. In FIG. 4B, information about Web page 1 (FIG. 8 (1)) and Web page 2 (FIG. 8 (2)) is shown as an example.

関連語ウェイトには、関連語ＩＤ毎に関連すると判定された複数の特徴語のウェイトの総和が格納される。例えば、「ＷｅｂページＩＤ００１」では、特徴語として「オリンピック（ウェイト：３０）」、「北京（ウェイト：２５）」、「五輪（ウェイト：１５）」等が算出されるところ、これらの特徴語は「関連語ＩＤ０１０」において全て関連すると設定されている。そのため、「関連語ＩＤ０１０」の関連語ウェイトは、これらの特徴語のウェイトの総和となり、「７０」（＝３０＋２５＋１５）となる。同様に、「ＷｅｂページＩＤ００２」では、特徴語として「オリンピック（ウェイト：２０）」、「ミッドタウン（ウェイト：１５）」、「海（ウェイト：１０）」等が算出されるところ、「関連語ＩＤ０１０」においては「オリンピック」のみが関連すると判定される。そのため、「関連語ＩＤ０１０」の関連語ウェイトは「２０」となる。 The related word weight stores the sum of the weights of a plurality of feature words determined to be related for each related word ID. For example, in “Web page ID001”, “Olympic (weight: 30)”, “Beijing (weight: 25)”, “Olympic (weight: 15)”, etc. are calculated as characteristic words. “Related word ID 010” is set to be all related. Therefore, the related word weight of the “related word ID 010” is the sum of the weights of these feature words and becomes “70” (= 30 + 25 + 15). Similarly, in “Web page ID 002”, “Olympic (weight: 20)”, “Midtown (weight: 15)”, “Sea (weight: 10)”, etc. are calculated as characteristic words. "" Is determined that only "Olympic" is relevant. Therefore, the related word weight of “related word ID 010” is “20”.

なお、本実施形態では、Ｗｅｂページウェイトとしてクリック履歴や、リンクデータ数を格納しているが、Ｗｅｂページに関する特徴を示す他の指標で代替してもよい。 In the present embodiment, the click history and the number of link data are stored as the Web page weight, but may be replaced with another index indicating the characteristics related to the Web page.

［検索処理のフローチャート］
次に、上述した関連語ウェイトを用いた検索処理について説明する。図５は、本実施形態に係る検索サーバ１での検索処理のフローチャートである。 [Search process flowchart]
Next, a search process using the related word weights described above will be described. FIG. 5 is a flowchart of search processing in the search server 1 according to the present embodiment.

Ｓ２１：制御部１０（受信手段１６）は、端末４から検索キーワードを含む要求データを受信する。 S <b> 21: The control unit 10 (reception unit 16) receives request data including a search keyword from the terminal 4.

Ｓ２２：制御部１０（検索手段１７）は、受信した検索キーワードに基づき、キーワードを含むＷｅｂページを検索する。このＷｅｂページの検索には、回収ＷｅｂページＤＢ２２に記憶されたＷｅｂページを用いることができる。 S22: The control unit 10 (search means 17) searches for a Web page including the keyword based on the received search keyword. For the search of this Web page, a Web page stored in the collection Web page DB 22 can be used.

Ｓ２３：制御部１０（Ｗｅｂページ調整手段１８）は、検索結果からリンクデータを並べる順番をインデキシングＤＢ２６を用いて調整する。この調整は、まず、検索結果として対象になったＷｅｂページのうち、検索キーワードに対応する関連語ＩＤの関連語ウェイトを抽出する。次に、抽出した関連語ウェイトを、クリック履歴やリンクデータ数等のＷｅｂページに関する特徴を示す指標により補正し、トータルウェイトを算出する。そして、検索結果として対象になったＷｅｂページのうち、算出したトータルウェイトが大きいものから順番に出力する。 S23: The control unit 10 (Web page adjustment means 18) adjusts the order in which the link data is arranged from the search result using the indexing DB 26. In this adjustment, first, the related word weight of the related word ID corresponding to the search keyword is extracted from the Web page targeted as the search result. Next, the extracted related word weights are corrected by an index indicating characteristics relating to the Web page such as a click history and the number of link data, and a total weight is calculated. Then, the Web pages targeted as search results are output in order from the one with the largest calculated total weight.

例えば、検索キーワードが「オリンピック」である場合の「ＷｅｂページＩＤ００１」及び「ＷｅｂページＩＤ００２」を例にとって説明する。検索キーワード「オリンピック」に対応する関連語ＩＤは「０１０」であるところ、「ＷｅｂページＩＤ００１」の関連語ウェイトは図４（２）に示すように「７０」となる。この場合において、クリック履歴やリンク数データ等により補正すると、「ＷｅｂページＩＤ００１」のトータルウェイトは、図６に示すように「９０」となる。同様に「ＷｅｂページＩＤ００２」の関連語ウェイトは図４（２）に示すように「２０」となり、これを補正すると「ＷｅｂページＩＤ００２」のトータルウェイトは、図６に示すように「５５」となる。 For example, “Web page ID 001” and “Web page ID 002” when the search keyword is “Olympic” will be described as an example. The related word ID corresponding to the search keyword “Olympic” is “010”, and the related word weight of “Web page ID001” is “70” as shown in FIG. In this case, when corrected based on the click history, link number data, etc., the total weight of “Web page ID001” is “90” as shown in FIG. Similarly, the related word weight of “Web page ID 002” is “20” as shown in FIG. 4B, and when this is corrected, the total weight of “Web page ID 002” is “55” as shown in FIG. Become.

Ｓ２４：制御部１０は、調整結果に基づき、Ｗｅｂページのリンクデータを配置したコンテンツを作成する。この場合、図４（２）のＵＲＬの値を配置できる。 S24: The control unit 10 creates content in which the link data of the Web page is arranged based on the adjustment result. In this case, the URL value shown in FIG.

Ｓ２５：制御部１０（コンテンツ送信手段１９）は、作成したコンテンツを、端末４に送信する。 S25: The control unit 10 (content transmission unit 19) transmits the created content to the terminal 4.

このように、本実施形態では、一つのテーマについて情報量が多いページのトータルウェイトを情報量が少ないページよりも高くすることができ、結果として検索結果の上位に情報量の多いページを表示させることができる。即ち、従来では一つのテーマについての情報量が少ないページであってもクリック履歴やリンクデータ数等が多いときには検索結果の上位に表示されていたものを（図９、図１０参照）、Ｗｅｂページ内において関連する特徴語をまとめ関連語ウェイトとして算出することで、情報量が多いページを優先的に表示させることができる。 As described above, in this embodiment, the total weight of a page with a large amount of information for one theme can be set higher than a page with a small amount of information, and as a result, a page with a large amount of information is displayed at the top of the search result. be able to. That is, even if a page with a small amount of information about one theme has been displayed in the top of the search results when the click history or the number of link data is large (see FIGS. 9 and 10), the Web page It is possible to preferentially display a page with a large amount of information by calculating related feature words as a summary related word weight.

他方で、検索結果に、クリック履歴やリンクデータ数等のＷｅｂページに関する特徴を示す指標をも反映させることができるため、同じ情報量であればより人気の高いＷｅｂページを優先的に表示させることとなり、ユーザのニーズに合わせた順序で検索結果を表示できる。 On the other hand, since it is possible to reflect the index indicating the characteristics of the Web page such as the click history and the number of link data in the search result, the more popular Web page is preferentially displayed with the same amount of information. Thus, the search results can be displayed in the order according to the user's needs.

［各Ｗｅｂページの対比についての好適例］
ここで、ＴＦ・ＩＤＦを用いて算出される特徴語ウェイトは、各Ｗｅｂページ内における相対値となるため、Ｗｅｂページの対比に適しない場合がある。例えば、図７（１）のＷｅｂページ３（ＩＤ：００３）は、ブログ形式のＷｅｂページであるところ、８月１日付の記事には、Ｗｅｂページ１と全く同じ内容である文章２００が記載され、８月２日以降には更に別の文章２０５が記載されている。 [Preferred example of comparison of each Web page]
Here, since the feature word weight calculated using TF / IDF is a relative value in each Web page, it may not be suitable for comparison of Web pages. For example, Web page 3 (ID: 003) in FIG. 7A is a blog-type Web page, and an article dated August 1 includes a sentence 200 having exactly the same content as Web page 1. After August 2nd, another sentence 205 is described.

この場合において、Ｗｅｂページ３の特徴語ウェイトをＴＦ・ＩＤＦを用いて算出すると、図７（２）に示すとおり、「オリンピック（ウェイト：２０）」、「北京（ウェイト：１５）」、「五輪（ウェイト：５）」となり、「関連語ＩＤ０１０」の関連語ウェイトは「４０」となる。他方、図４（２）に示したとおり、全く同じ内容の文章２００が記載されたＷｅｂページ１では、「オリンピック（ウェイト：３０）」、「北京（ウェイト：２５）」、「五輪（ウェイト：１５）」となり、「関連語ＩＤ０１０」の関連語ウェイトは「７０」となる。 In this case, when the feature word weight of the Web page 3 is calculated using TF / IDF, as shown in FIG. 7B, “Olympic (weight: 20)”, “Beijing (weight: 15)”, “Olympic” (Weight: 5) ”, and the related word weight of“ related word ID 010 ”is“ 40 ”. On the other hand, as shown in FIG. 4 (2), in the Web page 1 on which the sentence 200 having exactly the same content is described, “Olympic (weight: 30)”, “Beijing (weight: 25)”, “Olympic (weight: 15) ”, and the related word weight of“ related word ID 010 ”is“ 70 ”.

即ち、Ｗｅｂページ内に全く同じ文章が記載されている場合であっても、Ｗｅｂページ内のその他の文章によっては、ＴＦ・ＩＤＦを用いて算出される特徴語ウェイトが異なることとなり、結果、関連語ウェイトが大きく異なることとなる。この場合において、検索キーワードとして「オリンピック」が入力された場合に、Ｗｅｂページ１及びＷｅｂページ３のいずれを検索結果の上位に表示するか問題となる。 That is, even if the exact same sentence is described in the web page, the feature word weight calculated using TF / IDF differs depending on other sentences in the web page. The word weight will vary greatly. In this case, when “Olympic” is input as a search keyword, it becomes a problem which of the Web page 1 and the Web page 3 is displayed at the top of the search result.

この点について、本実施形態では、Ｗｅｂページ内における全ての特徴語の特徴語ウェイトの平均値（平均ウェイト）から関連語ウェイトが乖離する割合（対比用ウェイト）を算出し、この割合を対比することで、各Ｗｅｂページを対比することとしてもよい。なお、この割合の算出は、「（関連語ウェイト−平均ウェイト）／関連語ウェイト」により算出することができる。 In this embodiment, in this embodiment, the ratio (weight for comparison) that the related word weights deviate from the average value (average weight) of the feature word weights of all the feature words in the Web page is compared. Thus, each Web page may be compared. This ratio can be calculated by “(related word weight−average weight) / related word weight”.

具体的には、Ｗｅｂページ１においては、Ｗｅｂページ内における全ての特徴語の特徴語ウェイトの平均値が「１８」であるところ、「関連語ＩＤ０１０」の関連語ウェイトについての対比用ウェイトは、「０．７４」（＝（７０−１８）／７０）となる。同様に、Ｗｅｂページ３においては、Ｗｅｂページ内における平均値が「１０」であるところ、退避用ウェイトは、「０．７５」（＝（４０−１０）／４０）となる。 Specifically, in the Web page 1, the average value of the feature word weights of all the feature words in the Web page is “18”, and the comparison weight for the related word weight of the “related word ID 010” is “0.74” (= (70−18) / 70). Similarly, in the Web page 3, the average value in the Web page is “10”, and the saving weight is “0.75” (= (40−10) / 40).

このように対比用ウェイトを算出して対比することで、Ｗｅｂページ１及びＷｅｂページ３をほぼ同じ検索結果の順位とすることができる。即ち、一つのテーマについての情報量が同じＷｅｂページであれば、当該Ｗｅｂページ内の残りの部分の情報に関わらず同じ検索結果の順位とすることができる。その結果、本実施形態では、全てのＷｅｂページにおける絶対的な指標（平均値から乖離する割合、即ち、当該Ｗｅｂページ内における重要度）で検索結果の調整を行うことができ、より適切な検索結果をユーザに対して提供することができる。 By calculating and comparing the comparison weights in this way, the Web page 1 and the Web page 3 can be set to substantially the same order of search results. That is, if the Web page has the same amount of information for one theme, the same search result rank can be obtained regardless of the remaining information in the Web page. As a result, in the present embodiment, the search results can be adjusted by the absolute index (the ratio deviating from the average value, that is, the importance in the Web page) in all Web pages, and more appropriate search Results can be provided to the user.

なお、具体的には、図３のＳ３において、制御部１０（特徴語ウェイト算出手段１３）が、Ｗｅｂページ内における特徴語ウェイトを算出すると共に、当該Ｗｅｂページ内における特徴語ウェイトの平均値を算出し、図３のＳ５において、制御部１０（Ｗｅｂページウェイト算出手段１５）が、関連語ウェイトを算出すると共に、対比用ウェイトを算出することで実現できる。 Specifically, in S3 of FIG. 3, the control unit 10 (feature word weight calculating means 13) calculates the feature word weight in the web page and calculates the average value of the feature word weight in the web page. This can be realized by calculating the related word weight and the comparison weight in S5 of FIG. 3 by the control unit 10 (Web page weight calculating means 15).

（変形形態）
［関連語ＤＢの変形例］
上記実施形態では、関連すると判定された特徴語を単に加算することとしているが、図８に示すように関連する度合いに応じて加算する割合を異ならせることとしてもよい。例えば、「関連語ＩＤ０１０」の「オリンピック」に関連するものとして「北京」と「東京」とが考えられるところ、開催された年月が近い「北京」を開催された年月が遠い「東京」よりも関連性が高いとすることとしてもよい。なお、これらの関連する度合いは、検索サーバ１の管理者が任意に設定可能であり、また適宜変更可能である。 (Deformation)
[Modification of Related Word DB]
In the above embodiment, the feature words determined to be related are simply added. However, as shown in FIG. 8, the adding ratio may be varied depending on the degree of relatedness. For example, “Beijing” and “Tokyo” are considered to be related to “Olympic Games” with “Related Word ID 010”, but “Beijing”, where the date was held, is close to “Tokyo”. The relevance may be higher than that. Note that the degree of these relations can be arbitrarily set by the administrator of the search server 1 and can be changed as appropriate.

［特徴語ウェイトの算出］
また、上記実施形態では、関連語ウェイトを所定のタイミングで定期的に行われるインデキシング処理において算出する（Ｓ５）こととしているが、これに限られるものではない。例えば、検索処理においてユーザが入力した検索キーワードを受信後に、関連語ウェイトを算出することとしてもよい。このような構成にすることにより、当該検索キーワードに対応する関連語ＩＤについてのみ関連語ウェイトを算出すれば足りることになる。 [Calculation of feature word weights]
In the above embodiment, the related word weight is calculated in the indexing process periodically performed at a predetermined timing (S5). However, the present invention is not limited to this. For example, the related word weight may be calculated after receiving the search keyword input by the user in the search process. With such a configuration, it is sufficient to calculate the related word weight only for the related word ID corresponding to the search keyword.

検索システムの全体構成及び検索サーバの機能構成を示す図である。It is a figure which shows the whole structure of a search system, and the function structure of a search server. 検索サーバのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a search server. 検索サーバのインデキシング処理のフローチャートである。It is a flowchart of the indexing process of a search server. 記憶部に格納される関連語ＤＢ及びインデキシングＤＢを示す図である。It is a figure which shows the related term DB and indexing DB which are stored in a memory | storage part. 検索サーバでの検索処理のフローチャートである。It is a flowchart of the search process in a search server. 検索サーバでの検索結果の一例を示す図である。It is a figure which shows an example of the search result in a search server. 各Ｗｅｂページの対比についての好適例を示す図である。It is a figure which shows the suitable example about contrast of each web page. 関連語ＤＢの別実施形態を示す図である。It is a figure which shows another embodiment of related word DB. Ｗｅｂページ１及びＷｅｂページ２の一例を示す図である。2 is a diagram illustrating an example of a web page 1 and a web page 2. FIG. Ｗｅｂページのトータルウェイトの算出例を示す図である。It is a figure which shows the example of calculation of the total weight of a web page.

Explanation of symbols

１検索サーバ
２コンテンツサーバ
４端末
１０制御部
１１Ｗｅｂページ取得手段
１２Ｗｅｂページ解析手段
１３特徴語ウェイト算出手段
１４関連語判定手段
１５Ｗｅｂページウェイト算出手段
１６受信手段
１７検索手段
１８Ｗｅｂページ調整手段
１９コンテンツ送信手段
２０記憶部
２２回収ＷｅｂページＤＢ
２４関連語ＤＢ
２６インデキシングＤＢ
１００検索システム DESCRIPTION OF SYMBOLS 1 Search server 2 Content server 4 Terminal 10 Control part 11 Web page acquisition means 12 Web page analysis means 13 Feature word weight calculation means 14 Related word determination means 15 Web page weight calculation means 16 Reception means 17 Search means 18 Web page adjustment means 19 Content transmission means 20 Storage unit 22 Collection Web page DB
24 related terms DB
26 Indexing DB
100 search system

Claims

A computer connected to the terminal via a communication network
A web page analyzing step of analyzing a web page to be searched and extracting a plurality of feature words indicating features in the web page;
A feature word weight calculating step of calculating a feature word weight indicating an appearance frequency of the extracted feature word in the Web page for each of the plurality of feature words;
A search method including
A related word DB for storing relevance of a plurality of feature words is provided,
A related word determining step of determining whether or not each of the plurality of feature words extracted by the web page analyzing step is related using the related word DB;
A web page weight calculating step of calculating a sum of feature word weights of feature words determined to be related as a related word weight;
An indexing storage step of storing the calculated related word weight in the indexing DB in association with the link data of the Web page;
Search method including

The computer is
Receiving the request data including the search keyword from the terminal;
A search step of searching for a web page including the search keyword based on the search keyword received in the receiving step;
A web page adjustment step of adjusting a search result using a related word weight related to the search keyword among the web pages searched in the search step;
A transmission step of transmitting content including link data of the searched web page to the terminal based on the adjustment result of the web page adjustment step;
The search method according to claim 1, further comprising:

The computer is
An average weight calculating step for calculating an average value of feature word weights of all feature words in the Web page;
A comparison weight calculation step of calculating a ratio at which the related word weight calculated by the Web page weight calculation step deviates from the average value;
The search method according to claim 2, wherein the Web page adjustment step adjusts a search result using the ratio calculated in the comparison weight calculation step.

The computer includes an acquisition step of periodically visiting a Web server connected via a communication network to acquire the Web page to be searched;
The search method according to any one of claims 1 to 3, wherein:

The search program for making a computer perform the step of the method of any one of Claims 1-4.

A search server connected to a terminal via a communication network,
Web page analysis means for analyzing a Web page to be searched and extracting a plurality of feature words indicating features in the Web page;
Feature word weight calculating means for calculating a feature word weight indicating the appearance frequency of the extracted feature word in the Web page for each of the plurality of feature words;
A related word DB for storing relevance of a plurality of feature words;
Related word determination means for determining whether or not each of the plurality of feature words extracted by the web page analysis means is related using the related word DB;
Web page weight calculating means for calculating the sum of feature word weights of feature words determined to be related as related word weights;
Indexing storage means for storing the calculated related word weight in the indexing DB in association with the link data of the Web page;
A search server comprising: