JP2002245089A

JP2002245089A - Web page retrieval system, secondary information collecting device and interface unit

Info

Publication number: JP2002245089A
Application number: JP2001041492A
Authority: JP
Inventors: Masaru Kamata; 賢鎌田
Original assignee: Hitachi Engineering Co Ltd
Current assignee: Hitachi Engineering Co Ltd
Priority date: 2001-02-19
Filing date: 2001-02-19
Publication date: 2002-08-30

Abstract

PROBLEM TO BE SOLVED: To provide a secondary information collecting device having a correctness corresponding to keyword extraction mannually and to provide a web page retrieval interface device having a function for automatically adjusting a retrieval condition in the case of re-retrieval. SOLUTION: An anchor tag for analyzing a character string which is described the inside or in the neighborhood of the anchor tag on the web page is analyzed and a keyword is stored in a database as the feature of a link destination. In a web page retrieval device, information of the web page is obtained from the database based on the keyword and converted into a feature vector. The propriety of the retrieval result is decided, the average of the respective feature vectors of the propriety is obtained, an adaptation degree is calculated in the retrieval result and, then, outputting is performed in order of the adaptation degree from high to low. Thus the web page is retrieved more accurately.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ウェブページ検索
システムに関し、特にアンカータグやその周辺の文字列
をもってリンク先のウェブページを特徴付けるものと
し、ウェブページ検索時に利用者の適・不適の評価に基
いて、検索を行うウェブページ検索システム及びそれに
使用する二次情報収集装置、インターフェース装置に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a web page search system, and more particularly to a web page search system that characterizes a link destination web page using an anchor tag or a character string around the anchor tag, and evaluates whether a user is appropriate or not at the time of web page search. The present invention relates to a web page search system for performing a search, and a secondary information collecting device and an interface device used for the system.

【０００２】[0002]

【従来の技術】従来のウェブページ検索装置は、ウェブ
ページの本文と対応する要約語彙の登録・検索を担うデ
ータベースサーバ計算機と、インターネット中のリンク
を辿ることによってウェブページ本文を収集するウェブ
ロボットと、ウェブページ本文からキーワードを抽出し
てウェブページのＵＲＬとキーワード群との組を生成す
る二次情報収集装置と、複数の利用者からのＵＲＬを得
て、利用者に呈示するインターフェース装置とを備えて
いる。二次情報収集装置は、ウェブロボットによって収
集されたウェブページの本文からキーワードを抽出する
ことによって、ウェブページのＵＲＬとキーワード群と
の組を生成し、データベースサーバに登録を依頼する。
一方、複数の利用者からの検索要素に応じて、インター
フェース装置は、データベースサーバに問い合わせて、
検索キーワードに合致するウェブページのＵＲＬを得
て、利用者に呈示する。2. Description of the Related Art A conventional web page retrieval apparatus includes a database server computer for registering and retrieving a summary vocabulary corresponding to the text of a web page, and a web robot for collecting the text of a web page by following a link in the Internet. A secondary information collection device that extracts a keyword from a web page body to generate a set of a URL of a web page and a keyword group, and an interface device that obtains URLs from a plurality of users and presents them to users. Have. The secondary information collection device generates a set of the URL of the web page and the keyword group by extracting the keyword from the body of the web page collected by the web robot, and requests the database server to register.
On the other hand, according to the search elements from a plurality of users, the interface device queries the database server,
The URL of the web page that matches the search keyword is obtained and presented to the user.

【０００３】また、特開平１０−２０７７５６号公報で
は、アンカータグよりウェブページ（ホームページ）の
構成を分析し、何らかの検索条件を満たしたホームペー
ジのアンカーを優先的に分析処理を行っている。更に、
特開平１１−３９３３２号公報では、画像検索方法にお
いて、複数の画像に適・不適の重み付けを行い、利用者
の希望する画像を検索している。In Japanese Patent Application Laid-Open No. Hei 10-207756, the structure of a web page (home page) is analyzed from anchor tags, and the anchor processing of a home page that satisfies some search condition is preferentially analyzed. Furthermore,
In Japanese Patent Application Laid-Open No. 11-39332, in an image search method, a plurality of images are weighted appropriately or inappropriately, and an image desired by a user is searched.

【０００４】[0004]

【発明が解決しようとする課題】従来の二次情報収集装
置においては、次のような課題があった。従来の二次情
報収集装置は、ウェブページの本文に含まれている単語
の中から、タイトルに用いられているとか、頻出してい
る等の理由で、代表的であると推定される単語をそのペ
ージのキーワードとして抽出している。ウェブページの
主題とその本文に高頻度で表される単語とは必ずしも一
致していないので、ウェブページによっては、不適切に
特徴付けられるという問題点がある。また、ページの主
題と異なるキーワードを意図的に羅列すれば、不適切に
特徴づけをさせることさえ可能である。The conventional secondary information collecting apparatus has the following problems. The conventional secondary information collection device uses a word included in the body of a web page to extract a word that is assumed to be representative because it is used in a title or frequently appears. It is extracted as a keyword for that page. There is a problem that some web pages are inappropriately characterized because the subject of the web page does not always match a word frequently expressed in the text. In addition, if keywords different from the subject of the page are listed intentionally, it is even possible to cause improper characterization.

【０００５】また、第三者のウェブページから張られて
いるリンクの本数によって、リンク先のページの評判の
よさが表されているという考えに基き、他のドメインか
らの被リンク数を持ってページの注目度ひいては有用性
の評価値とするという実施例が存在する。これにより、
意図的に不適切な特徴付けを行ったページは排除される
が、上記の前者の問題点のように不適切な特徴付けをさ
れてしまう問題は、残されたままである。更に、特開平
１０−２０７７５６号公報では、アンカータグの分析に
より、優先的に分析処理を行っているが、そのウェブペ
ージとそのページを特徴付ける特徴量との関連度は不明
であり、依然、ウェブページの絞込みが不十分となる。
これらの問題点を解決する方法として、人手でウェブペ
ージに分類キーワードを付す方法があるが、日々変化し
ていく膨大なウェブページの全てを人手で処理すること
は不可能である。[0005] Further, based on the idea that the number of links provided from a third party's web page indicates the reputation of the linked page, the number of links from other domains is determined. There is an example in which the degree of attention of a page, and thus the evaluation value of usefulness, is used. This allows
Pages intentionally improperly characterized are rejected, but the problem of improper characterization, such as the former problem described above, remains. Further, in Japanese Patent Application Laid-Open No. Hei 10-207756, analysis is performed preferentially by analyzing an anchor tag. However, the degree of relevance between the web page and a feature amount characterizing the page is unknown, and the web page is still unknown. Insufficient page narrowing.
As a method of solving these problems, there is a method of manually attaching a classification keyword to a web page, but it is impossible to manually process all of a huge number of web pages that change every day.

【０００６】一方、上述した従来のインターフェース装
置においては、次のような課題があった。従来のインタ
ーフェース装置は、利用者によって指定された検索キー
ワードをより多く含んでいるウェブページのＵＲＬを得
て、利用者に呈示する。同じキーワードを含む、趣旨の
異なるページが存在するために、目的に合わないページ
のＵＲＬが呈示されてしまうという問題がある。この問
題点を軽減する方法のために、絞込検索として知られて
いる機能では、必須のキーワード・含まれていてはいけ
ないキーワード・望ましいキーワード・望ましくないキ
ーワードを指定することによって検索条件を詳細に記述
できるようになっている。この方法では、検索結果とし
て呈示されたＵＲＬを閲覧して目的外のページであると
判断した場合に、そのＵＲＬを排除するような検索条件
を利用者が考えなければならない。これは、前記特開平
１１−３９３３２号公報で、利用者に重み付けの判断が
必要である点で同様の問題である。On the other hand, the above-mentioned conventional interface device has the following problems. The conventional interface device obtains a URL of a web page including more search keywords specified by the user and presents the URL to the user. Since there are pages having the same keyword and different purposes, there is a problem that a URL of a page that does not meet the purpose is presented. To help alleviate this problem, a feature known as refined search refines search criteria by specifying required keywords, keywords that should not be included, desirable keywords, and undesirable keywords. It can be described. In this method, when a user browses a URL presented as a search result and determines that the page is not a target, the user must consider a search condition that excludes the URL. This is a similar problem in Japanese Patent Application Laid-Open No. 11-39332 in that the user needs to determine the weight.

【０００７】本発明は、上記課題に鑑みてなされたもの
で，人手によるキーワード抽出に準ずるような正確さを
有する二次情報収集装置、ならびに、再検索時の検索条
件を自動的に調節する機能を有するウェブページ検索用
インターフェース装置及びそれらを用いたウェブページ
検索システムの提供を目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and has a secondary information collecting apparatus having accuracy similar to that of manual keyword extraction, and a function of automatically adjusting search conditions at the time of re-search. And a web page search system using the same.

【０００８】[0008]

【課題を解決するための手段】本発明は、リンク元とな
る新規又は変更されたウェブページを検索・入手し、ア
ンカータグ及びその周辺の分析を行い、その結果をもと
に、アンカータグが示すリンク先のウェブページに分析
結果を付与する二次情報収集装置と、この二次情報収集
装置から得られたデータやリンク元となる新規又は変更
されたウェブページの情報を保守・管理し、永続的にデ
ータを更新・蓄積するデータベースサーバと、このデー
タベースサーバ内の情報を元に利用者が要求するウェブ
ページの検索結果を提供するインターフェース装置と、
を備えるウェブページ検索システムを開示する。According to the present invention, a new or changed web page as a link source is searched and obtained, an anchor tag and its surroundings are analyzed, and the anchor tag is identified based on the result. A secondary information collection device that gives the analysis result to the linked web page shown, and maintains and manages data obtained from the secondary information collection device and information on a new or changed web page as a link source, A database server that updates and accumulates data permanently, an interface device that provides a search result of a web page requested by a user based on information in the database server,
A web page search system comprising:

【０００９】更に本発明は、リンク元となる新規又は変
更されたウェブページを検索・入手し、アンカータグ及
びその周辺の分析を行い、その結果をもとに、アンカー
タグが示すリンク先のウェブページに分析結果を付与す
る二次情報収集装置と、この二次情報収集装置から得ら
れたデータやリンク元となる新規又は変更されたウェブ
ページの情報を保守・管理し、永続的にデータを更新・
蓄積するデータベースサーバと、このデータベースサー
バ内の情報を元に利用者が要求するウェブページの検索
結果を提供するインターフェース装置と、を備えるウェ
ブページ検索システムにおいて、上記二次情報収集装置
は、新規登録や変更されたウェブページを検索する新規
登録・変更ウェブページ検索手段と、前記新規登録・変
更ウェブページ検索手段から得られたウェブページを取
得するウェブページ取得手段と、前記ウェブページ取得
手段により得られたウェブページ内でアンカータグの有
無を検索するアンカータグ検索手段と、前記アンカータ
グ検索手段により、前記ウェブページ取得手段により得
られたウェブページ内にアンカータグが存在する場合、
そのアンカータグの内部及び近辺に記載されている文字
列を分析するアンカータグ分析手段と、前記新規登録・
変更ウェブページ検索手段から得られたウェブページの
分析が終了したことを記憶しておく新規登録・変更ウェ
ブページ情報変更・記憶手段と、前記アンカータグ分析
手段により得られた、リンク先のＵＲＬと特徴となるキ
ーワードとその関連度を記憶しておく、リンク先特徴情
報変更・記憶手段と、前記新規登録・変更ウェブページ
検索手段と、前記ウェブページ取得手段と、前記アンカ
ータグ検索手段と、前記アンカータグ分析手段と、前記
新規登録・変更ウェブページ情報変更・記憶手段と、前
記リンク先特徴情報変更・記憶手段との情報のやり取り
や順序を制御・管理するタスク管理手段と、を備えたこ
とを特徴とするウェブページ検索システムを開示する。Further, the present invention searches and obtains a new or changed web page as a link source, analyzes the anchor tag and its surroundings, and, based on the result, based on the result, the link destination web page indicated by the anchor tag. Maintains and manages a secondary information collection device that gives analysis results to pages, and data obtained from this secondary information collection device and information on new or changed web pages as link sources, and permanently stores data. update·
In a web page search system comprising a database server for storing and an interface device for providing a search result of a web page requested by a user based on information in the database server, the secondary information collection device includes a newly registered New / changed web page search means for searching for or changed web pages, a web page obtaining means for obtaining a web page obtained from the new registered / changed web page search means, and a web page obtaining means obtained by the web page obtaining means. An anchor tag search means for searching for the presence or absence of an anchor tag in the obtained web page, and the anchor tag search means, when an anchor tag is present in the web page obtained by the web page acquisition means,
An anchor tag analyzing means for analyzing a character string described in and around the anchor tag;
A new registration / change web page information change / storage means for storing that the analysis of the web page obtained from the change web page search means has been completed; a link destination URL obtained by the anchor tag analysis means; Link destination feature information change / storage means, which stores characteristic keywords and their relevance, the new registration / change web page search means, the web page acquisition means, the anchor tag search means, An anchor tag analysis unit, the new registration / change web page information change / storage unit, and a task management unit controlling and managing information exchange and order with the link destination feature information change / storage unit A web page search system is disclosed.

【００１０】更に本発明は、リンク元となる新規又は変
更されたウェブページを検索・入手し、アンカータグ及
びその周辺の分析を行い、その結果をもとに、アンカー
タグが示すリンク先のウェブページに分析結果を付与す
る二次情報収集装置と、この二次情報収集装置から得ら
れたデータやリンク元となる新規又は変更されたウェブ
ページの情報を保守・管理し、永続的にデータを更新・
蓄積するデータベースサーバと、このデータベースサー
バ内の情報を元に利用者が要求するウェブページの検索
結果を提供するインターフェース装置と、を備えるウェ
ブページ検索システムにおいて、上記インターフェース
装置は、検索条件を入力するキーワード入力手段と、ウ
ェブページのＵＲＬとそのページの特徴となるキーワー
ドとそのキーワードの関連度が予め設定・記憶されてい
るウェブページ特徴記憶手段と、前記キーワード入力手
段により入力されたキーワードを元に、前記ウェブペー
ジ特徴記憶手段にウェブページ情報を問い合わせ、その
情報を取得する問い合わせ・情報取得手段と、前記問い
合わせ・情報取得手段で得られたウェブページの情報を
特徴ベクトルに変換する特徴ベクトル変換手段と、前記
特徴ベクトル変換手段により得られた結果を記憶する検
索結果記憶手段と、前記検索結果記憶手段に記憶されて
いる情報を出力する結果表示手段と、前記結果表示手段
により表示されたウェブページに適・不適を入力する適
・不適入力手段と、前記適・不適入力手段及び前記検索
結果記憶手段により適・不適それぞれの特徴ベクトルを
平均し、規格化する適・不適特徴ベクトル平均手段と、
前記適・不適特徴ベクトル平均手段と前記検索結果記憶
手段により記憶されているウェブページの情報により適
合度を算出する、適合度算出手段と、前記検索結果記憶
手段により記憶されているウェブページを前記適合度算
出手段により算出された適合度の高い順に並べ替える結
果並べ替え手段と、を備えたことを特徴とするウェブペ
ージ検索システムを開示する。Further, the present invention searches and obtains a new or changed web page as a link source, analyzes the anchor tag and its surroundings, and, based on the result, based on the result, the link destination web indicated by the anchor tag. Maintains and manages a secondary information collection device that gives analysis results to pages, and data obtained from this secondary information collection device and information on new or changed web pages as link sources, and permanently stores data. update·
In a web page search system including a database server to store and an interface device for providing a search result of a web page requested by a user based on information in the database server, the interface device inputs a search condition. A keyword input unit; a web page feature storage unit in which a URL of a web page, a keyword serving as a feature of the page, and a degree of relevance of the keyword are preset and stored; and a keyword input by the keyword input unit. An inquiry / information acquisition unit for inquiring the web page information to the web page feature storage unit and acquiring the information; and a feature vector conversion unit for converting the information of the web page obtained by the inquiry / information acquisition unit into a feature vector. And the feature vector conversion Search result storage means for storing the results obtained by the step, result display means for outputting information stored in the search result storage means, and inputting suitability / inappropriateness to the web page displayed by the result display means Suitable / inappropriate input means, suitable / inappropriate feature vector averaging means for averaging appropriate / inappropriate respective feature vectors by the appropriate / inappropriate input means and the search result storage means, and normalizing,
Calculating a relevance based on the information of the web page stored by the suitable / inappropriate feature vector averaging means and the search result storage means, and calculating the relevance calculating means; and the web page stored by the search result storage means. A web page search system, comprising: a result rearranging unit that rearranges the images in descending order of the degree of matching calculated by the degree of matching calculating unit.

【００１１】更に本発明は、リンク元となる新規又は変
更されたウェブページを検索入手し、アンカータグ及び
その周辺の分析を行い、その結果をもとに、アンカータ
グが示すリンク先のウェブページに分析結果を付与す
る、ウェブページ検索システムの二次情報収集装置であ
って、新規登録や変更されたウェブページを検索する新
規登録・変更ウェブページ検索手段と、前記新規登録・
変更ウェブページ検索手段から得られたウェブページを
取得するウェブページ取得手段と、前記ウェブページ取
得手段により得られたウェブページ内でアンカータグの
有無を検索するアンカータグ検索手段と、前記アンカー
タグ検索手段により、前記ウェブページ取得手段により
得られたウェブページ内にアンカータグが存在する場
合、そのアンカータグの内部及び近辺に記載されている
文字列を分析するアンカータグ分析手段と、前記新規登
録・変更ウェブページ検索手段から得られたウェブペー
ジの分析が終了したことを記憶しておく新規登録・変更
ウェブページ情報変更・記憶手段と、前記アンカータグ
分析手段により得られた、リンク先のＵＲＬと特徴とな
るキーワードとその関連度を記憶しておく、リンク先特
徴情報変更・記憶手段と、前記新規登録・変更ウェブペ
ージ検索手段と、前記ウェブページ取得手段と、前記ア
ンカータグ検索手段と、前記アンカータグ分析手段と、
前記新規登録・変更ウェブページ情報変更・記憶手段
と、前記リンク先特徴情報変更・記憶手段との情報のや
り取りや順序を制御・管理するタスク管理手段と、を備
えたウェブページ検索システムにおける二次情報収集装
置を開示する。Further, the present invention searches and obtains a new or changed web page as a link source, analyzes the anchor tag and its surroundings, and, based on the result, based on the result, the web page of the link destination indicated by the anchor tag. A new registration / change web page search means for searching for a newly registered or changed web page; and a new registration / change web page search means for searching for a newly registered or changed web page.
Web page obtaining means for obtaining a web page obtained from the modified web page searching means, anchor tag searching means for searching for the presence or absence of an anchor tag in a web page obtained by the web page obtaining means, and anchor tag searching Means, when an anchor tag exists in the web page obtained by the web page obtaining means, an anchor tag analyzing means for analyzing a character string written inside and around the anchor tag; and A new registration / change web page information change / storage means for storing that the analysis of the web page obtained from the change web page search means has been completed; a link destination URL obtained by the anchor tag analysis means; A link destination feature information change / memory operator that stores feature keywords and their relevance. And, with the new registration and change web page search means, and the web page acquisition means, and the anchor tag retrieval means, and wherein the anchor tag analysis means,
A secondary in a web page search system comprising: the new registration / change web page information change / storage means; and a task management means for controlling / managing information exchange and order with the link destination feature information change / storage means. An information collection device is disclosed.

【００１２】更に本発明は、データベースサーバ内の情
報を元に利用者が要求するウェブページの検索結果を提
供する、ウェブページ検索システムのインターフェース
装置であって、検索条件を入力するキーワード入力手段
と、ウェブページのＵＲＬとそのページの特徴となるキ
ーワードとそのキーワードの関連度が予め設定・記憶さ
れているウェブページ特徴記憶手段と、前記キーワード
入力手段により入力されたキーワードを元に、前記ウェ
ブページ特徴記憶手段にウェブページ情報を問い合わ
せ、その情報を取得する問い合わせ・情報取得手段と、
前記問い合わせ・情報取得手段で得られたウェブページ
の情報を特徴ベクトルに変換する特徴ベクトル変換手段
と、前記特徴ベクトル変換手段により得られた結果を記
憶する検索結果記憶手段と、前記検索結果記憶手段に記
憶されている情報を出力する結果表示手段と、前記結果
表示手段により表示されたウェブページに適・不適を入
力する適・不適入力手段と、前記適・不適入力手段及び
前記検索結果記憶手段により適・不適それぞれの特徴ベ
クトルを平均し、規格化する適・不適特徴ベクトル平均
手段と、前記適・不適特徴ベクトル平均手段と前記検索
結果記憶手段により記憶されているウェブページの情報
により適合度を算出する、適合度算出手段と、前記検索
結果記憶手段により記憶されているウェブページを前記
適合度算出手段により算出された適合度の高い順に並べ
替える結果並べ替え手段と、を備えたウェブページ検索
システムにおけるインターフェース装置を開示する。Further, the present invention is an interface device of a web page search system for providing a search result of a web page requested by a user based on information in a database server, and a keyword input means for inputting a search condition. A web page feature storage unit in which a URL of a web page, a keyword serving as a feature of the page, and a degree of association of the keyword are set and stored in advance, and the web page based on the keyword input by the keyword input unit. An inquiry / information acquisition unit for inquiring the web page information to the feature storage unit and acquiring the information;
A feature vector conversion unit for converting the information of the web page obtained by the inquiry / information acquisition unit into a feature vector; a search result storage unit for storing a result obtained by the feature vector conversion unit; and the search result storage unit Result display means for outputting information stored in the Web browser, suitable / unsuitable input means for inputting suitable / unsuitable to the web page displayed by the result display means, suitable / unsuitable input means and the search result storage means Means for averaging suitable and unsuitable feature vectors and standardizing them, and the degree of relevance based on the information of the web page stored by the suitable and unsuitable feature vector averaging means and the search result storage means. Calculating the relevance calculating means, and the web page stored by the search result storing means to the relevance calculating means. And sort results rearrangement means with high calculated fitness order Ri, discloses an interface apparatus in a web page search system equipped with.

【００１３】かかる構成により、人手によるキーワード
抽出に準ずるような正確さを有する二次情報収集装置、
ならびに、再検索時の検索条件を自動的に調節する機能
を有するウェブページ検索用インターフェース装置の実
現が得られる。[0013] With this configuration, a secondary information collection device having accuracy similar to that of manual keyword extraction,
In addition, it is possible to realize a web page search interface device having a function of automatically adjusting search conditions at the time of re-search.

【００１４】[0014]

【発明の実施の形態】以下、図１を用いて、本発明の一
実施形態によるウェブページ検索装置の構成について説
明する。構成は大きく、二次情報収集手段１１００、デ
ータベースサーバ手段１２００、インターフェース手段
１３００の３つに分かれる。二次情報収集装置１１００
はリンク元となる新規又は変更されたウェブページを検
索・入手し、アンカータグ及びその周辺の分析を行い、
その結果を元に、アンカータグが示すリンク先のウェブ
ページに分析結果を特徴として付与する装置である。新
規登録・変更ウェブページ検索手段１１１０とウェブペ
ージ取得手段１１２０とアンカータグ検索手段１１３０
とアンカータグ分析手段１１４０とタスク管理手段１１
５０から構成される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The configuration of a web page search device according to an embodiment of the present invention will be described below with reference to FIG. The configuration is large and is divided into three parts: a secondary information collection unit 1100, a database server unit 1200, and an interface unit 1300. Secondary information collection device 1100
Searches and obtains new or changed web pages as link sources, analyzes anchor tags and their surroundings,
Based on the result, it is a device that gives the analysis result as a feature to the linked web page indicated by the anchor tag. New registration / change web page search means 1110, web page acquisition means 1120, anchor tag search means 1130
, Anchor tag analysis means 1140 and task management means 11
50.

【００１５】データベースサーバ１２００は二次情報処
理装置より得られたデータやリンク元となる新規又は変
更されたウェブページの情報を保守・管理し、永続的に
データを更新・蓄積する。新規登録・変更ウェブページ
情報変更・記憶手段１２１０とリンク先特徴情報変更・
記憶手段１２２０又はその他同様の情報が記憶されてい
る手段を用いてウェブページの特徴を記憶しておくウェ
ブページ特徴記憶手段１２３０から構成される。The database server 1200 maintains and manages data obtained from the secondary information processing apparatus and information of a new or changed web page as a link source, and permanently updates and stores the data. New registration / change Web page information change / storage means 1210 and link destination feature information change /
It comprises a web page feature storage means 1230 for storing web page features using the storage means 1220 or other means in which similar information is stored.

【００１６】インターフェース装置１３００はデータベ
ースサーバ１２００内の情報を元に利用者が要求するウ
ェブページの検索結果を呈示するものであり、キーワー
ド入力装置１３１０と問い合わせ・情報取得手段１３２
０と特徴ベクトル変換手段１３３０と結果表示手段１３
５０と、適・不適入力手段１３６０と適・不適特徴ベク
トル平均手段１３７０と適合度算出手段１３８０と結果
並べ替え手段１３９０から構成され、付属的に検索結果
記憶手段１３４０がある。これは利用者からの要求を満
たす検索結果を一時的に記憶しておくための手段であ
る。The interface device 1300 presents a search result of a web page requested by a user based on information in the database server 1200, and includes a keyword input device 1310 and an inquiry / information acquisition means 132
0, feature vector conversion means 1330 and result display means 13
50, suitable / unsuitable input means 1360, suitable / unsuitable feature vector averaging means 1370, suitability calculating means 1380, and result rearranging means 1390. This is a means for temporarily storing a search result satisfying a request from a user.

【００１７】次に図１、図２、図３を用いて本発明の二
次情報収集装置１１００とデータベースサーバ１２００
の動作について説明する。図２は二次情報収集装置１１
００の処理を示すフローチャートである。図３は、アン
カータグ分析手段の説明図である。新規登録・変更ウェ
ブページ検索手段１１１０により、新規登録又は変更さ
れたウェブページを検出する（図２ステップ２１０
０）。前記新規登録・変更ウェブページ検索手段１１１
０により、新規登録又は変更されたウェブページを検出
できた場合には、ウェブページ取得手段１１２０により
該当するウェブページを取得する（ステップ２２０
０）。アンカータグ検索手段１１３０により、該当する
ウェブページ内でアンカータグの有無を検出する（ステ
ップ２３００）。該当するウェブページ内でアンカータ
グが検出できれば、アンカータグ分析手段１１４０によ
り、アンカータグ及びその周辺の文字列を分析し、キー
ワードとなる単語とその関連度を抽出する（ステップ２
４００）。リンク先特徴情報変更・記憶手段１２２０に
より、リンク先のＵＲＬとステップ２４００により抽出
されたキーワードとその関連度を記憶しておく（ステッ
プ２５００）。新規登録・変更ウェブページ情報変更・
記憶手段１２１０により、ステップ２２００により取得
されたウェブページのＵＲＬと処理をした日時をデータ
ベースサーバ１２００に記憶する（ステップ２６０
０）。Next, referring to FIG. 1, FIG. 2, and FIG. 3, the secondary information collecting apparatus 1100 and the database server 1200 of the present invention will be described.
Will be described. FIG. 2 shows the secondary information collection device 11
It is a flowchart which shows the process of 00. FIG. 3 is an explanatory diagram of the anchor tag analyzing means. A newly registered or changed web page is detected by the newly registered / changed web page search unit 1110 (step 210 in FIG. 2).
0). The new registration / change web page search means 111
0, if a newly registered or changed web page is detected, the corresponding web page is obtained by the web page obtaining means 1120 (step 220).
0). An anchor tag search unit 1130 detects the presence or absence of an anchor tag in the corresponding web page (step 2300). If the anchor tag can be detected in the corresponding web page, the anchor tag analyzing unit 1140 analyzes the anchor tag and the character strings around the anchor tag, and extracts a keyword word and its relevance (step 2).
400). The link destination URL, the keyword extracted in step 2400, and the degree of association are stored by the link destination characteristic information change / storage unit 1220 (step 2500). New registration / change Web page information change /
The storage unit 1210 stores the URL of the web page acquired in step 2200 and the date and time of the processing in the database server 1200 (step 260).
0).

【００１８】以上のステップ２１００からステップ２６
００までの処理及び情報の授受や順序は、タスク管理手
段１１５０によって制御・管理する。アンカータグ分析
手段１１４０により、アンカータグ及びその周辺の文字
列を分析し、キーワードとなる単語とその関連度を抽出
するステップ２４００について図３を用いて詳細に説明
する。The above steps 2100 to 26
The processes up to 00 and the transfer and order of information are controlled and managed by the task management unit 1150. The step 2400 of analyzing the anchor tag and the character strings around it by the anchor tag analyzing means 1140 and extracting the word serving as the keyword and the relevance thereof will be described in detail with reference to FIG.

【００１９】ウェブページＡ（３１００）、ウェブペー
ジＢ（３２００）、ウェブページＣ（３３００）があ
り、ウェブページＢ（３２００）、ウェブページＣ（３
３００）はそれぞれリンクＢ１（３２１０）、リンクＣ
２（３３１０）により、ウェブページＡ（３１００）に
リンクが張られているとする。また、リンクＢ１には３
２２０、リンクＣ２には３３２０で示されるようなアン
カータグが記載されているとする。アンカータグ及びそ
の周辺とは、例えばＢ１（３２２０）で示すように“＜
ａ”から始まり“＜／ａ＞”で括られる範囲内を指す。
アンカータグには画像をアンカーとして使用する場合に
は、イメージングタグが含まれることもある。また、前
記アンカータグ及びその周辺の文字列とは例えばリンク
Ｂ１（３２２０）の場合には、“””と“””で囲まれ
る“ｌｉｎｕｘ／ｆｉｂａ．ｈｔｍｌ”及び“＞”と
“＜”で囲まれる文字列“ＦｉｂａとＬｉｎｕｘについ
ての注意”である。文字列の分析は、文字列を単語に分
解し、一定の条件で分類し、キーワードとなる単語を抽
出することである。前記分類方法の一例として、動詞、
名詞、形容詞、副詞等の自立語を抽出する方法も有効で
ある。このようにウェブページＢ（３２００）がウェブ
ページＡ（３１００）を特徴付けるキーワードとして、
３２３０で示されるような単語を抽出し、その関連度
（使用頻度）を記録する。同様に、ウェブページＣ（３
３００）がウェブページＡ（３１００）を特徴付けるキ
ーワードとして、３３３０で示されるような単語を抽出
し、その関連度を記録する。There are a web page A (3100), a web page B (3200), and a web page C (3300), and a web page B (3200) and a web page C (3300).
300) are link B1 (3210) and link C, respectively.
It is assumed that a link is provided to web page A (3100) by 2 (3310). Also, link B1 has 3
220, it is assumed that an anchor tag indicated by 3320 is described in the link C2. The anchor tag and its surroundings are, for example, "<" as indicated by B1 (3220).
a ”and the range enclosed by“ </a> ”.
When an image is used as an anchor, the anchor tag may include an imaging tag. In the case of the link B1 (3220), for example, in the case of the link B1 (3220), the anchor tag and the surrounding character string are “linux / fiba.html” and “>” and “<” surrounded by “” and “”. The enclosed character string is “Notes on Fiba and Linux”. The analysis of a character string is to decompose the character string into words, classify them under certain conditions, and extract words to be keywords. As an example of the classification method, a verb,
A method of extracting independent words such as nouns, adjectives and adverbs is also effective. Thus, as a keyword that characterizes web page A (3100), web page B (3200)
A word as indicated by 3230 is extracted, and its relevance (frequency of use) is recorded. Similarly, web page C (3
300) extracts a word indicated by 3330 as a keyword characterizing the web page A (3100), and records the relevance thereof.

【００２０】以上のアンカータグ分析を統合して、ウェ
ブページＡ（３１００）を特徴付けるキーワードとし
て、３４００で示されるような単語及び、その関連度と
する。アンカータグの内部および近辺の文字列は、一般
に、リンク先として参照しているウェブページの説明と
して、リンク元ウェブページの作成者が記述したものと
なっているので、リンク先ウェブページの人手による抄
録に近いものと期待されるので、上記の特徴付けによれ
ば、人手によるキーワード抽出に準ずる正確さを有する
二次情報収集装置が構成できる。By integrating the anchor tag analysis described above, a word indicated by 3400 and its relevance are defined as keywords characterizing the web page A (3100). In general, the character string inside and around the anchor tag is described by the creator of the link source web page as a description of the web page referred to as the link destination. Since it is expected to be close to an abstract, according to the above characterization, a secondary information collection device having an accuracy equivalent to manual keyword extraction can be configured.

【００２１】次に図１、図４を用いて本発明のデータベ
ースサーバ１２００とインターフェース装置１３００の
動作について説明する。図４はインターフェース装置１
３００の処理を示すフローチャートである。キーワード
入力手段１３１０は、検索条件を入力するための手段で
あり、例えばマウスやキーボード等により実現可能であ
る。前記キーワード入力手段１３１０により、検索条件
の入力があった場合（図４ステップ４１００）、データ
ベースサーバ１２００内のウェブページ特徴記憶手段１
２３０により記憶された情報（ウェブページＵＲＬ、特
徴となるキーワード及びその関連度）を前記キーワード
で以って問い合わせた後に情報を取得する（問い合わせ
・情報取得手段１３２０、ステップ４２００）。ステッ
プ４２００により得られたページの特徴となるキーワー
ドとそのキーワードの関連度を特徴ベクトル変換手段１
３３０により特徴ベクトルに変換する（ステップ４３０
０）。ステップ４３００で得た検索結果およびステップ
４２００により得られたウェブページのＵＲＬを検索結
果記憶手段１３４０によりある利用者の要求に対する検
索結果として一時的に記憶する（ステップ４４００）。
適合度算出手段１３８０によりステップ４４００で得た
結果と前記キーワードを比較し、適合度を算出する（ス
テップ４５００）。結果並べ替え手段１３９０によりス
テップ４５００で得た適合度の高い順にステップ４４０
０で得た結果を並べ替え、前記検索結果記憶手段１３４
０に再び記憶する（ステップ４６００）。結果表示手段
１３５０により、ステップ４６００で並べ替えた結果を
モニタ等に表示する（ステップ４７００）。ステップ４
７００により、表示された検索結果に基き、前記利用者
はマウスやキーボード等を用いた適・不適入力手段１３
６０により、幾つかの検索結果について、適・不適の評
価を入力する（ステップ４８００）。入力がなければ終
了する。ステップ４８００により、適・不適の入力があ
れば、それぞれのウェブページの特徴ベクトルを平均す
る（ステップ４９００）。ステップ４５００で、ステッ
プ４４００により得られた検索結果及びステップ４１０
０により得られた前記キーワードの代わりに、それぞれ
ステップ４６００により並べ替えられている検索結果
と、ステップ４９００により適・不適のそれぞれの平均
・規格化された特徴ベクトルを用いて再び適合度算出手
段１３８０により適合度を算出する。以後、同様にステ
ップ４７００へ進みステップ４８００で適・不適入力が
ある限り、ステップ４５００からステップ４９００を繰
り返す。Next, the operation of the database server 1200 and the interface device 1300 of the present invention will be described with reference to FIGS. FIG. 4 shows the interface device 1
3 is a flowchart showing a process of 300. The keyword input unit 1310 is a unit for inputting a search condition, and can be realized by, for example, a mouse or a keyboard. When a search condition is input by the keyword input unit 1310 (step 4100 in FIG. 4), the web page feature storage unit 1 in the database server 1200 is used.
After inquiring the information (web page URL, characteristic keyword and its relevance) stored by 230 using the keyword, the information is acquired (inquiry / information acquisition means 1320, step 4200). The feature vector conversion means 1 calculates the keyword serving as a feature of the page obtained in step 4200 and the relevance of the keyword.
It is converted into a feature vector by 330 (step 430).
0). The search result obtained in step 4300 and the URL of the web page obtained in step 4200 are temporarily stored in search result storage unit 1340 as a search result for a user request (step 4400).
The result of the step 4400 is compared with the keyword by the fitness calculating means 1380 to calculate the fitness (step 4500). Step 440 is performed in the descending order of the degree of matching obtained in step 4500 by the result sorting unit 1390.
0 are rearranged and the search result storage means 134
0 is stored again (step 4600). The result sorted in step 4600 is displayed on a monitor or the like by the result display means 1350 (step 4700). Step 4
Based on the displayed search result, the user can enter the appropriate / unsuitable input means 13 using a mouse, a keyboard, or the like.
According to 60, suitable / unsuitable evaluation is inputted for some search results (step 4800). If there is no input, the process ends. In step 4800, if there is an appropriate or inappropriate input, the feature vectors of each web page are averaged (step 4900). In step 4500, the search result obtained in step 4400 and step 410
The relevance calculating means 1380 is again used by using the search results rearranged in step 4600 and the averaged and standardized characteristic vectors suitable and unsuitable in step 4900 instead of the keyword obtained in step 4900. To calculate the degree of conformity. Thereafter, the process similarly proceeds to step 4700, and steps 4500 to 4900 are repeated as long as there is an appropriate / unsuitable input in step 4800.

【００２２】ステップ４１００からステップ４９００ま
での処理の流れに従って、特徴ベクトル変換手段１３３
０により変換されたウェブページの特徴ベクトルについ
ての説明と、適合度算出手段１３８０により算出される
適合度について図５、図６を用いて詳細に説明する。図
５は特徴ベクトルと適合度算出の説明図であり、図６は
具体例である。図３のウェブページＡ（３１００）を例
にとり、図５を用いて説明する。ここで、図３のウェブ
ページＡ（３１００）を５０００、３４００で示された
ようなウェブページを特徴付けるキーワードとその関連
度を５１００で示すと特徴ベクトルは５２００と表せ
る。ウェブページＡ（５０００）の特徴は、５２００の
左辺の要素であるキーワードｗｉ（例えば、Ｆｉｂａ）
をパラメータとするパラメータ空間のベクトルｖＡ（５
３００）として表現される。パラメータの値は、ページ
とパラメータの関連の強さ（例えばｖＡ（ｗ＝“Ｆｉｂ
ａ”）＝２（キーワード“Ｆｉｂａ”の関連度））
を表す。According to the flow of processing from step 4100 to step 4900, the feature vector conversion means 133
The feature vector of the web page converted by 0 and the fitness calculated by the fitness calculation unit 1380 will be described in detail with reference to FIGS. FIG. 5 is an explanatory diagram of the calculation of the feature vector and the degree of matching, and FIG. 6 is a specific example. This will be described with reference to FIG. 5 taking the web page A (3100) in FIG. 3 as an example. Here, if the web page A (3100) in FIG. 3 is denoted by 5000 and a keyword characterizing the web page as indicated by 3400 and its relevance is indicated by 5100, the feature vector can be expressed as 5200. The feature of the web page A (5000) is a keyword wi (eg, Fiba) which is an element on the left side of 5200.
Is a parameter space vector vA (5)
300). The value of the parameter depends on the strength of the association between the page and the parameter (for example, vA (w = “Fib
a)) = 2 (degree of relevance of keyword "Fiba"))
Represents

【００２３】検索条件は、検索目的に対するキーワード
（以下検索目的キーワードとする）の重みを値とし、長
さを規格化したベクトルｄ（５４００）で表される。例
えば、検索目的キーワードが「漢方薬」と「胃薬」であ
れば、「漢方」、「胃」、「薬」のようなキーワード
（以下小検索目的キーワードとする）に分解でき、小検
索目的キーワード「薬」は小検索目的キーワード「漢
方」や小検索目的キーワード「胃」の倍の重みとなる。
一時的なベクトルＤに重みとして、Ｄ（ｗ＝“漢方”）
＝１，Ｄ（ｗ＝“胃”）＝１，Ｄ（ｗ＝“薬”）＝
２を設定し、その長さ｜｜Ｄ｜｜＝（Ｄ（ｗ＝“漢
方”）２，Ｄ（ｗ＝“胃”）２，Ｄ（ｗ＝
“薬”）２）１／２を用いてｄ＝Ｄ／｜｜Ｄ｜｜と
してｄの長さを１に規格化する。ページの適合度は、ベ
クトルｄを基底とする（１次元）部分空間におけるｖＡ
の最小二乗近似係数として評価され、ここでは、長さＬ
Ａ（５５００）となる。長さＬＡは、ｄとｖＡの内積
（ｄ，ｖＡ）＝「すべてのキーワードｗｉについての
ｄ（ｗ＝ｗｉ）×ｖＡ（ｗ＝ｗｉ）の総和」として計算
される。この評価値の大きい順に利用者に呈示される。The search condition is represented by a vector d (5400) in which the weight of a keyword for a search purpose (hereinafter referred to as a search purpose keyword) is used as a value and the length is standardized. For example, if the search target keywords are “Kampo medicine” and “Stomach medicine”, they can be decomposed into keywords such as “Kampo”, “stomach”, and “drug” (hereinafter referred to as “small search purpose keywords”). “Drug” has twice the weight of the small search purpose keyword “Chinese medicine” and the small search purpose keyword “stomach”.
As a weight for the temporary vector D, D (w = “Kampo”)
= 1, D (w = “stomach”) = 1, D (w = “drug”) =
2 and the length || D || = (D (w = “Kampo”) 2, D (w = “stomach”) 2, D (w =
“Drug”) 2) 1/2 and the length of d is normalized to 1 as d = D / || D ||. The page relevance is calculated based on vA in (one-dimensional) subspace based on vector d.
, Where the length L
A (5500). The length LA is calculated as an inner product of d and vA (d, vA) = “sum of d (w = wi) × vA (w = wi) for all keywords wi”. The evaluation values are presented to the user in descending order.

【００２４】利用者は、呈示されたＵＲＬを閲覧して、
検索目的に適合しているページｇ１、ｇ２、…（５６
００）に“適”、そうでないページｂ１、ｂ２、…
（５７００）に“不適”と指定する。一般には複数個あ
るｇ１、ｇ２、…（５６００）とｂ１、ｂ２、…
（５７００）の特徴ベクトルｖｇ１、ｖｇ２、…とｖ
ｂ１、ｖｂ２、…をそれぞれベクトルとして平均して長
さを１に規格化したベクトルｄｇ（５８００）、ｄｂ
（５９００）を用いて再検索の条件を構成する。この操
作は、すべてのキーワードｗｉについてＤｇ（ｗ＝ｗ
ｉ）＝ｖｇ１（ｗ＝ｗｉ）＋ｖｇ２（ｗ＝ｗｉ）＋…
（５８１０）、Ｄｂ（ｗ＝ｗｉ）＝ｖｂ１（ｗ＝ｗｉ）
＋ｖｂ２（ｗ＝ｗｉ）＋…とし、｜｜Ｄｇ｜｜＝（すべ
てのキーワードｗｉについてのＤｇ（ｗ＝ｗｉ）２
の総和）１／２、｜｜Ｄｂ｜｜＝（すべてのキーワード
ｗｉについてのＤｂ（ｗ＝ｗｉ）２の総和）１／２
（５８２０）を用いて、ｄｇ＝Ｄｇ／｜｜Ｄｇ｜｜、ｄ
ｂ＝Ｄｂ／｜｜Ｄｂ｜｜として計算される。ここでは、
長さを１に規格化しているが、他にもｄｇ＝Ｄｇ／｜｜
Ｄｇ｜｜２、ｄｂ＝Ｄｂ／｜｜Ｄｂ｜｜２として強調す
る方法もある。The user browses the presented URL,
Pages g1, g2,... (56
00) "suitable", otherwise pages b1, b2, ...
(5700) is designated as “unsuitable”. Generally, a plurality of g1, g2,... (5600) and b1, b2,.
(5700) feature vectors vg1, vg2,.
Vectors dg (5800), db in which lengths are normalized to 1 by averaging b1, vb2,.
(5900) is used to compose the re-search condition. This operation is performed with Dg (w = w) for all keywords wi.
i) = vg1 (w = wi) + vg2 (w = wi) + ...
(5810), Db (w = wi) = vb1 (w = wi)
+ Vb2 (w = wi) +... || Dg || = (Dg (w = wi) 2 for all keywords wi
| Db || = (sum of Db (w = wi) 2 for all keywords wi) 1/2
Using (5820), dg = Dg / || Dg ||, d
b = Db / || Db || here,
Length is normalized to 1, but dg = Dg / ||
There is also a method of emphasizing Dg || 2 and db = Db / || Db || 2.

【００２５】再検索の際の適合度を、ｄｇ、ｄｂを基底
とする（２次元）部分空間におけるｖＡ（５３１０）の
最小二乗近似Ｌｇｄｇ＋Ｌｂｄｂの係数の差（ｄｇの係
数Ｌｇからｄｂの係数Ｌｂを引く）とすることによっ
て、ｄｇに類似していてｄｂに類似していないページに
高い評価が与えられる。最小二乗近似係数ＬｇとＬｂ
は、ｄとｃの内積を（ｄ，ｃ）＝「すべてのキーワー
ドｗｉについてのｄ（ｗ＝ｗｉ）×ｃ（ｗ＝ｗｉ）の総
和」とするとき、ｄｇとｄｂに対応する双直交基底ｄｇ
＊＝［（ｄｂ，ｄｂ）ｄｇ−（ｄｇ，ｄｂ）ｄｂ］／
［（ｄｇ，ｄｇ）（ｄｂ，ｄｂ）−（ｄｇ，ｄｂ）
（ｄｂ，ｄｇ）］、ｄｇ＊＝［（ｄｇ，ｄｇ）ｄｂ−
（ｄｇ，ｄｂ）ｄｇ］／［（ｄｇ，ｄｇ）（ｄｂ，
ｄｂ）−（ｄｇ，ｄｂ）（ｄｂ，ｄｇ）］を用いて、
Ｌｇ＝（ｖＡ，ｄｇ＊）、Ｌｂ＝（ｖＡ，ｄ
Ｂ＊）で計算される。The relevance at the time of the re-search is determined by the difference between the coefficients of the least square approximation Lgdg + Lbdb of vA (5310) in the (two-dimensional) subspace based on dg and db (coefficient Lg of db to coefficient Lb of db By subtracting, a page that is similar to dg but not similar to db is given a high rating. Least squares approximation coefficients Lg and Lb
Is the bi-orthogonal basis corresponding to dg and db, where the inner product of d and c is (d, c) = “sum of d (w = wi) × c (w = wi) for all keywords wi”. dg
* = [(Db, db) dg- (dg, db) db] /
[(Dg, dg) (db, db)-(dg, db)
(Db, dg)], dg * = [(dg, dg) db−
(Dg, db) dg] / [(dg, dg) (db,
db)-(dg, db) (db, dg)]
Lg = (vA, dg *), Lb = (vA, d
B *).

【００２６】具体的な例を図６に示す。検索目的キーワ
ード「魚料理」を検索する場合、小検索目的キーワード
「魚」と「料理」でデータベース内を検索した結果、ウ
ェブページＡ（６１００）、ウェブページＢ（６２０
０）、ウェブページＣ（６３００）が入手できたとす
る。ウェブページＡ（６１００）、ウェブページＢ（６
２００）、ウェブページＣ（６３００）のそれぞれを特
徴付けるキーワードとその関連度が６１１０、６２１
０、６３１０であったとする。まず、ウェブページの特
徴ベクトルがどのように表されるかを示す。ウェブペー
ジＡ（６１００）、ウェブページＢ（６２００）、ウェ
ブページＣ（６３００）の重複するキーワードが「魚」
と「料理」と「ペット」であるので、ウェブページＡ
（６１００）、ウェブページＢ（６２００）、ウェブペ
ージＣ（６３００）それぞれの特徴ベクトルは６１２
０、６２２０、６３２０となる。FIG. 6 shows a specific example. When searching for the search target keyword “fish dishes”, the results of a search in the database with the small search target keywords “fish” and “dish” result in web page A (6100) and web page B (620).
0), web page C (6300) is obtained. Web page A (6100), Web page B (6
200), the keywords characterizing each of the web pages C (6300) and their relevance are 6110, 621
0, 6310. First, how the feature vector of a web page is represented is shown. The overlapping keyword of web page A (6100), web page B (6200), and web page C (6300) is “fish”
Web page A because it is "food" and "pet"
(6100), web page B (6200), and web page C (6300) each have a feature vector of 612.
0, 6220, and 6320.

【００２７】次に、検索目的キーワードとの適合度の比
較を行う。ここでの小検索目的キーワードは「魚」と
「料理」であり、この二つの小検索目的キーワードの特
徴ベクトルを平均し、規格化したものが６４００とな
る。その方法は、「魚」と「料理」の二つの特徴ベクト
ルの和をとり、長さを１にする。Next, the degree of matching with the search target keyword is compared. Here, the small search target keywords are “fish” and “dish”, and the average of the feature vectors of the two small search target keywords is normalized to be 6400. In this method, two feature vectors of “fish” and “dish” are summed, and the length is set to 1.

【００２８】また、特徴ベクトル６４００と６１２０、
６２２０、６３２０のそれぞれの位置関係を比較するた
めに、これらを同一平面上に示すと、特徴ベクトル６４
００、６１２０、６２２０、６３２０はそれぞれの６４
１０、６１３０、６２３０、６３３０となる。特徴ベク
トル６１３０、６２３０、６３３０は特徴ベクトル６４
１０の要素を６１３０、６３３０、６２３０の順に多く
含んでおり、ベクトル同士が接近している。つまり、ウ
ェブページＡ、ウェブページＣ、ウェブページＢの順に
検索目的キーワード「魚料理」に近く、多くの要素を含
むことが分かる、これを数値で表す方法が最小二乗近似
係数であり、これを適合度とする。ウェブページＡ（６
１００）、ウェブページＢ（６２００）、ウェブページ
Ｃ（６３００）それぞれの最小二乗近似係数は、特徴ベ
クトル６４１０が規格化されているために長さＬＡ、Ｌ
Ｂ、ＬＣとなり、ＬＡは６１４０、ＬＢは６２４０（値
が零のため表記せず）、ＬＣは６３４０と表せる。キー
ワードの特徴ベクトルを平均し、規格化した理由はキー
ワードの個数に関係なく、評価を一定にするためであ
る。最小二乗近似係数の値によりステップ４６００でウ
ェブページＡ、ウェブページＣ、ウェブページＢの順に
並べ替える。Also, feature vectors 6400 and 6120,
In order to compare the respective positional relationships of 6220 and 6320, when they are shown on the same plane, the feature vector 64
00, 6120, 6220, and 6320 are 64
10, 6130, 6230, and 6330. Feature vectors 6130, 6230, and 6330 are feature vectors 64
It contains ten elements in the order of 6130, 6330, and 6230, and the vectors are close to each other. In other words, it can be seen that the search target keyword “fish dish” is close to the search target keyword “fish dish” in the order of the web page A, the web page C, and the web page B, and that many elements are included. The method of expressing this as a numerical value is the least square approximation coefficient. The degree of conformity. Web page A (6
100), web page B (6200), and web page C (6300), the least squares approximation coefficients have lengths LA and L because feature vector 6410 is standardized.
B and LC, LA can be expressed as 6140, LB can be expressed as 6240 (not shown because the value is zero), and LC can be expressed as 6340. The reason why the feature vectors of the keywords are averaged and standardized is to make the evaluation constant regardless of the number of keywords. In step 4600, the web page A, the web page C, and the web page B are rearranged according to the value of the least square approximation coefficient.

【００２９】ウェブページＡ（６１００）に「適」、ウ
ェブページＢ（６２００）、ウェブページＣ（６３０
０）に「不適」を与えた場合、「適」の平均・規格化し
たベクトルｄｇは６５００で表され、「不適」の平均・
規格化したベクトルｄｂは６６００で表される。ｄｇは
特徴ベクトル６１２０を長さ１にしたもので表され、ｄ
ｂは特徴ベクトル６２２０、６３２０の和をとり、長さ
１にすることにより得られる。"Suitable" for web page A (6100), web page B (6200), web page C (630)
If “unsuitable” is given to 0), the average / normalized vector dg of “suitable” is represented by 6500, and the average of “unsuitable”
The normalized vector db is represented by 6600. dg is represented by a feature vector 6120 with a length of 1, and d
b is obtained by taking the sum of the feature vectors 6220 and 6320 and setting the length to 1.

【００３０】あるウェブページｐの特徴ベクトルがｖｐ
（６７００）であった場合、ｖｐを評価するために、
ｄｇ（６５１０）とｄｂ（６６１０）で表される平面上
に、ｖｐを射影したものがｖｐ’ （６７１０）であ
る。最小二乗近似係数の差（適合度）はｄｇ及びｄｂが
規格化されているため、Ｌｇ〓Ｌｂとなる（６７２
０）。この図ではｖｐはｄｇに近い（適している）場合
となる。The feature vector of a certain web page p is vp
If (6700), then to evaluate vp,
vp ′ (6710) is obtained by projecting vp on a plane represented by dg (6510) and db (6610). The difference (fitness) between the least square approximation coefficients is Lg なる Lb because dg and db are standardized (672).
0). In this figure, vp is close (suitable) to dg.

【００３１】[0031]

【発明の効果】本発明によれば、人手によるキーワード
抽出に準ずるような正確さを有する二次情報収集装置、
ならびに、再検索時の検索条件を自動的に調節する機能
を有するウェブページ検索用インターフェース装置の提
供ができるのでより的確なウェブページ検索が可能であ
る。According to the present invention, a secondary information collecting apparatus having an accuracy similar to that of manual keyword extraction,
In addition, a web page search interface device having a function of automatically adjusting search conditions at the time of re-search can be provided, so that more accurate web page search can be performed.

[Brief description of the drawings]

【図１】本発明の構成図である。FIG. 1 is a configuration diagram of the present invention.

【図２】本発明の二次情報収集装置とデータベースサー
バの動作説明図である。FIG. 2 is an operation explanatory diagram of a secondary information collecting device and a database server of the present invention.

【図３】本発明のアンカータグ分析手段の説明図であ
る。FIG. 3 is an explanatory diagram of an anchor tag analyzing unit of the present invention.

【図４】本発明のウェブページ検索用インターフェース
装置とデータベースサーバの動作説明図である。FIG. 4 is an operation explanatory diagram of a web page search interface device and a database server of the present invention.

【図５】本発明の特徴ベクトル及び適合度算出手段の説
明図である。FIG. 5 is an explanatory diagram of a feature vector and fitness calculating unit according to the present invention.

【図６】本発明の適合度算出手段の具体例である。FIG. 6 is a specific example of a fitness calculating unit according to the present invention.

[Explanation of symbols]

１１００二次情報収集装置１１１０新規登録・変更ウェブページ検索手段１１２０ウェブページ取得手段１１３０アンカータグ検索手段１１４０アンカータグ分析手段１１５０タスク管理手段１２００データベースサーバ１２１０新規登録・変更ウェブページ情報変更・記憶
手段１２２０リンク先特徴情報変更・記憶手段１２３０ウェブページ特徴記憶手段１３００インターフェース装置１３１０キーワード入力装置１３２０問い合わせ・情報取得手段１３３０特徴ベクトル変換手段１３４０検索結果記憶手段１３５０結果表示手段１３６０適・不適入力手段１３７０適・不適特徴ベクトル平均手段１３８０適合度算出手段１３９０結果並べ替え手段３１００ウェブページＡ３２００ウェブページＢ３２１０リンクＢ１３２２０リンクＢ１のアンカータグ３２３０リンクＢ１のアンカータグから抽出した単語３３００ウェブページＣ３３１０リンクＣ２３３２０リンクＣ２のアンカータグ３３３０リンクＣ２のアンカータグから抽出した単語３４００ウェブページＡの関連度５１００ウェブページを特徴付けるキーワードとその
関連度５２００特徴ベクトル５３００パラメータ空間のベクトルｖｐ５４００検索目的に対するキーワードの重みを値とす
るベクトルｄ（長さを１に規格化）５５００ウェブページＡの特徴ベクトルｖＡのｄ方向
の長さＬＡ５６００適合ページ５７００不適合ページ５８００ “適” ベクトルとして平均して長さを１に
規格化したベクトル５８１０ベクトルＤとキーワードｗｉと関連度ｇｎの
関係５８２０ベクトルＤの長さ５９００ “不適” ベクトルとして平均して長さを１
に規格化したベクトル６１００ウェブページＡ６１１０ウェブページＡを特徴付けるキーワードとそ
の関連度６１２０ウェブページＡの特徴ベクトル６１３０ウェブページＡの特徴ベクトル６１４０適合度ＬＡ６２００ウェブページＢ６２１０ウェブページＢを特徴付けるキーワードとそ
の関連度６２２０ウェブページＢの特徴ベクトル６２３０ウェブページＢの特徴ベクトル６２４０適合度ＬＢ６３００ウェブページＣ６３１０ウェブページＣを特徴付けるキーワードとそ
の関連度６３２０ウェブページＣの特徴ベクトル６３３０ウェブページＣの特徴ベクトル６３４０適合度ＬＣ６５００「適」の平均・規格化したベクトルｄｇ６５１０「適」の平均・規格化したベクトルｄｇ６６００「不適」の平均・規格化したベクトルｄｂ６６１０「不適」の平均・規格化したベクトルｄｂ６７００ウェブページｐの特徴ベクトルがｖｐ６７１０特徴ベクトルｖｐをｄｇ、ｄｂにより生成さ
れる平面状に射影したベクトルｖｐ’ ６７２０最小二乗近似係数の差（適合度）1100 Secondary information collection device 1110 New registration / change web page search means 1120 Web page acquisition means 1130 Anchor tag search means 1140 Anchor tag analysis means 1150 Task management means 1200 Database server 1210 New registration / change web page information change / storage means 1220 Link destination feature information change / storage means 1230 Web page feature storage means 1300 Interface device 1310 Keyword input device 1320 Inquiry / information acquisition means 1330 Feature vector conversion means 1340 Search result storage means 1350 Result display means 1360 Suitable / unsuitable input means 1370 Suitable / Unsuitable feature vector averaging means 1380 Fitness calculating means 1390 Result sorting means 3100 Web page A 3200 Web page B 3210 Link B1 3220 anchor tag of link B1 3230 word extracted from anchor tag of link B1 3300 web page C 3310 link C2 3320 anchor tag of link C2 3330 word extracted from anchor tag of link C2 3400 relevance of web page A 5100 web Keywords characterizing a page and their relevance 5200 Feature vector 5300 Vector vp in parameter space 5400 Vector d (weight length is normalized to 1) with weight of keyword for search purpose 5500 d direction of feature vector vA of web page A Length LA 5600 conforming page 5700 non-conforming page 5800 a vector 5810 whose length is averaged and normalized to 1 as a “suitable” vector 5810 Relationship between vector D, keyword wi and relevance gn 5 20 the length of 5900 "unsuitable" average of length as a vector of a vector D 1
6100 Web page A 6110 Keyword characterizing web page A and its relevance 6120 Feature vector of web page A 6130 Feature vector of web page A 6140 Fitness level LA 6200 Web page B 6210 Keyword characterizing web page B Relevance 6220 Feature vector of web page B 6230 Feature vector of web page B 6240 Fitness LB 6300 Web page C 6310 Keyword characterizing web page C and its relevance 6320 Feature vector of web page C 6330 Feature vector of web page C 6340 Goodness of fit LC 6500 Average / normalized vector dg 6510 “Good” Average / normalized vector dg 6600 “Good” Average / normalized Vector db 6610 Average / normalized vector db 6700 of “inappropriate” The feature vector of the web page p is vp 6710 The vector vp ′ 6720 obtained by projecting the feature vector vp in a plane generated by dg and db vp ′ 6720 Least square approximation coefficient Difference (fitness)

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ０６Ｆ 13/00 ５４０Ｇ０６Ｆ 13/00 ５４０Ｅ ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G06F 13/00 540 G06F 13/00 540E

Claims

[Claims]

1. Search and obtain a new or changed web page as a link source, analyze the anchor tag and its surroundings, and analyze the link destination web page indicated by the anchor tag based on the result. Maintains and manages a secondary information collection device that gives the result, and data obtained from this secondary information collection device and information on new or changed web pages as link sources, and permanently updates and accumulates data A web page search system comprising: a database server that performs search; and an interface device that provides a search result of a web page requested by a user based on information in the database server.

2. Search and acquire a new or changed web page as a link source, analyze the anchor tag and its surroundings, and analyze the link destination web page indicated by the anchor tag based on the result. Maintains and manages a secondary information collection device that gives the result, data obtained from this secondary information collection device, and information on new or changed web pages as link sources, and permanently updates and accumulates data A web page search system comprising a database server that performs search and a web page search result requested by a user based on information in the database server. A new registration / change web page search means for searching for a changed web page; Web page obtaining means for obtaining the obtained Web page, Anchor tag searching means for searching for the presence or absence of an anchor tag in the Web page obtained by the Web page obtaining means, The Web page obtaining by the anchor tag searching means When an anchor tag is present in the web page obtained by the means, an anchor tag analyzing means for analyzing a character string written in and around the anchor tag, and an anchor tag analyzing means obtained from the newly registered / changed web page searching means. A new registration / change web page information change / storage means for storing that the analysis of the obtained web page is completed; a link destination URL obtained by the anchor tag analysis means; A link destination characteristic information change / storage means for storing the degree of change; Page search means, the web page acquisition means, the anchor tag search means, the anchor tag analysis means, the new registration / change web page information change / storage means, and the link destination feature information change / storage means. And a task management means for controlling and managing the exchange and order of information of the web pages.

3. A new or changed web page serving as a link source is retrieved and obtained, an anchor tag and its surroundings are analyzed, and based on the result, an analysis result is displayed on a linked web page indicated by the anchor tag. Maintains and manages the secondary information collection device that provides the information, and the data obtained from the secondary information collection device and the information of the new or changed web page that is the link source, and permanently updates and accumulates the data In a web page search system including a database server and an interface device for providing a search result of a web page requested by a user based on information in the database server, the interface device includes a keyword input for inputting search conditions. Means, the URL of the web page, the keyword that is a feature of the page, and the relevance of the keyword are predicted. Web page characteristic storage means set and stored, and inquiry / information acquisition means for inquiring the Web page characteristic storage means for web page information based on the keyword input by the keyword input means and acquiring the information Feature vector conversion means for converting the information of the web page obtained by the inquiry / information acquisition means into a feature vector; search result storage means for storing the results obtained by the feature vector conversion means; A result display unit that outputs information stored in the storage unit, and a web page displayed by the result display unit.
Suitable / unsuitable input means for inputting unsuitable; suitable / unsuitable feature vector averaging means for averaging and normalizing respective suitable / unsuitable feature vectors by the suitable / unsuitable input means and the search result storage means; Calculating a relevance based on the information of the web page stored by the unsuitable feature vector averaging means and the search result storage; and calculating the relevance by calculating the web page stored by the search result storage. And a result rearranging means for rearranging the results in the descending order of the degree of matching calculated by the means.

4. Searching and obtaining a new or changed web page as a link source, analyzing an anchor tag and its surroundings, and analyzing a result of the analysis on a web page of a link indicated by the anchor tag based on the analysis result. A new registration / change web page search means for searching for a newly registered or changed web page; and a secondary information collection apparatus for obtaining a new registration / change web page search means. Web page obtaining means for obtaining the obtained Web page, Anchor tag searching means for searching for the presence or absence of an anchor tag in the Web page obtained by the Web page obtaining means, The Web page obtaining by the anchor tag searching means If an anchor tag exists in the web page obtained by the means, the inside of the anchor tag and Anchor tag analysis means for analyzing a character string described on the side, and new registration / change web page information for storing that the analysis of the web page obtained from the new registration / change web page search means is completed. Change / storage means; link destination characteristic information change / storage means for storing a link destination URL, a keyword serving as a characteristic, and a degree of association obtained by the anchor tag analysis means; and the new registration / change Web page search means, the web page acquisition means, the anchor tag search means, the anchor tag analysis means, the new registration / change web page information change / storage means, and the link destination feature information change / storage means Task management means to control and manage the exchange and order of information with the secondary information collection in a web page search system with Apparatus.

5. An interface device of a web page search system for providing a search result of a web page requested by a user based on information in a database server, wherein a keyword input means for inputting search conditions; A web page feature storage unit in which a URL of a web page, a keyword serving as a feature of the page, and a degree of relevance of the keyword are set and stored in advance; and the web page feature based on the keyword input by the keyword input unit. Inquiry / information acquisition means for inquiring the web page information to the storage means and acquiring the information; feature vector conversion means for converting the information of the web page obtained by the inquiry / information acquisition means into a feature vector; Search result storage means for storing results obtained by the conversion means; A result display means for outputting information stored in said search result storage means, - suitable for web pages displayed by the result display means
Suitable / unsuitable input means for inputting unsuitable; suitable / unsuitable feature vector averaging means for averaging and normalizing respective suitable / unsuitable feature vectors by the suitable / unsuitable input means and the search result storage means; Calculating a relevance based on the information of the web page stored by the unsuitable feature vector averaging means and the search result storage; and calculating the relevance by calculating the web page stored by the search result storage. An interface device in a web page search system, comprising: a result rearranging unit that rearranges in a descending order of the degree of matching calculated by the unit.