JP2003108595A

JP2003108595A - Information retrieving device, information retrieving method and information retrieving program

Info

Publication number: JP2003108595A
Application number: JP2001302623A
Authority: JP
Inventors: Masaru Kiregawa; 優喜連川; Takayuki Tamura; 孝之田村
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2001-09-28
Filing date: 2001-09-28
Publication date: 2003-04-11
Anticipated expiration: 2021-09-28
Also published as: JP3898016B2

Abstract

PROBLEM TO BE SOLVED: To solve a problem of requiring a long time for work when the number of retrieval results increases for requiring a user to successively check validity of the respective retrieval results since order of these retrieval results does not always coincide with validity in a retrieval purpose when obtaining a plurality of retrieval results. SOLUTION: A character string having the same attribute as a retrieving keyword and a character string having an attribute different from the retrieving keyword are extracted from a Web page gathered by a WWW information gathering part 24, and a score of the Web page is set according to a description degree of the character string to the Web page.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、検索条件に適合
する文書（ウェブページ）を検索する情報検索装置、情
報検索方法及び情報検索プログラムに関するものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information retrieval device, an information retrieval method and an information retrieval program for retrieving a document (web page) matching a retrieval condition.

【０００２】[0002]

【従来の技術】今日、インターネット上では急速に普及
したＷＷＷ（ワールド・ワイド・ウェブ）サービスに基
づき種々の情報が公開されているが、大量の情報が無秩
序に氾濫し、情報洪水とも呼ばれる現象が起こってい
る。ＷＷＷにおける情報のアドレスはＵＲＬ（Ｕｎｉｆ
ｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）で表される
が、ＵＲＬは命名方法に規則性がなく、しばしば変更さ
れることから、ＷＷＷ利用者がＵＲＬを直接指定するこ
とは稀である。2. Description of the Related Art Today, various kinds of information have been released on the Internet based on the rapidly popular WWW (World Wide Web) service. is happening. The address of information on the WWW is URL (Unif
Orm Resource Locator), but since the naming method of the URL has no regularity and is often changed, it is rare that the WWW user directly specifies the URL.

【０００３】したがって、ＨＴＭＬ（ＨｙｐｅｒＴｅ
ｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）で記述された
別のウェブページ中のハイパーリンクを辿ったり、指定
したキーワードを含むウェブページのＵＲＬを回答する
検索エンジンを利用することにより、目的とするＵＲＬ
にアクセスするケースが９０％近くを占めるとの報告が
ある（米ジョージア工科大学ＧＶＵセンター第１０
回ＷＷＷユーザサーベイ１９９８年ｈｔｔｐ：／／
ｗｗｗ．ｃｃ．ｇａｔｅｃｈ．ｅｄｕ／ｇｖｕ／ｕｓｅ
ｒ＿ｓｕｒｖｅｙｓ／ｓｕｒｖｅｙ−１９９８−１０
／）。Therefore, HTML (Hyper Te
Target URL by following a hyperlink in another web page described in xt Markup Language) or using a search engine that returns the URL of the web page containing the specified keyword.
It has been reported that nearly 90% of the cases access the Internet (GVU Center 10th, Georgia Institute of Technology
Annual WWW User Survey 1998 http: //
www. cc. gatech. edu / gvu / use
r_surveys / survey-1998-10
/).

【０００４】図２５は従来の情報検索装置を示す構成図
であり、図において、１はＷＷＷサーバ、２はインター
ネット、３は利用者端末、４は検索条件のキーワードに
適合するＵＲＬのリストを利用者端末３に提供するイン
ターネット検索エンジンである。５はＷＷＷサーバ１か
らウェブページを大量に収集するＷＷＷ情報収集部、６
はＷＷＷ情報収集部５により収集されたウェブページの
コピーからキーワード検索用のインデックスを生成する
インデックス生成部、７はインデックス生成部６により
生成されたインデックスを格納するインデックス記憶装
置、８は利用者端末３から検索条件を受けると、インデ
ックス記憶装置７に格納されているインデックスを参照
して、その検索条件に適合するＵＲＬのリストを生成す
る検索問合せ処理部である。FIG. 25 is a block diagram showing a conventional information retrieving apparatus. In the figure, 1 is a WWW server, 2 is the Internet, 3 is a user terminal, and 4 is a list of URLs that match a keyword of a retrieval condition. It is an internet search engine provided to the worker terminal 3. Reference numeral 5 is a WWW information collection unit that collects a large number of web pages from the WWW server 1, 6
Is an index generation unit that generates an index for keyword search from a copy of the web pages collected by the WWW information collection unit 5, 7 is an index storage device that stores the index generated by the index generation unit 6, and 8 is a user terminal 3 is a search query processing unit that receives a search condition from No. 3 and refers to an index stored in the index storage device 7 to generate a list of URLs matching the search condition.

【０００５】次に動作について説明する。まず、ＷＷＷ
情報収集部５は、ＷＷＷサーバ１からウェブページを大
量に収集する。即ち、ウェブページのＵＲＬが初期設定
されると、そのウェブページをダウンロードする。次
に、当該ウェブページに記述されているＨＴＭＬのハイ
パーリンクを取り出すことにより、当該ウェブページに
リンクされているウェブページをダウンロードする。こ
の処理を繰り返し実行することにより、ウェブページを
辿りながら次々とウェブページのダウンロードを行う。Next, the operation will be described. First, WWW
The information collecting unit 5 collects a large number of web pages from the WWW server 1. That is, when the URL of the web page is initialized, the web page is downloaded. Next, by extracting the HTML hyperlink described in the web page, the web page linked to the web page is downloaded. By repeatedly executing this process, web pages are downloaded one after another while tracing the web pages.

【０００６】インデックス生成部６は、ＷＷＷ情報収集
部５が大量のウェブページを収集すると、それらのウェ
ブページのコピーからキーワード検索用のインデックス
を生成する。即ち、インデックス生成部６は、ＷＷＷ情
報収集部５により収集されたウェブページに対してＨＴ
ＭＬのタグと通常のテキストを分離する処理を実行す
る。そして、通常のテキストを構成する文章を単語単位
に分割し、その字句要素（単語）のインデックス情報を
生成する。ここで、図２６はインデックス情報の一例を
示す説明図であり、図２６の例では、インデックス情報
として各単語が出現するウェブページのＵＲＬを列挙し
ている。[0006] When the WWW information collecting unit 5 collects a large number of web pages, the index generating unit 6 generates an index for keyword search from a copy of those web pages. That is, the index generation unit 6 performs the HT on the web pages collected by the WWW information collection unit 5.
The process of separating the ML tag and normal text is executed. Then, a sentence which constitutes a normal text is divided into word units, and index information of the lexical element (word) is generated. Here, FIG. 26 is an explanatory diagram showing an example of index information, and in the example of FIG. 26, URLs of web pages in which each word appears are listed as index information.

【０００７】検索問合せ処理部８は、利用者端末３から
検索条件を受けると、インデックス記憶装置７に格納さ
れているインデックスのエントリを取得して、その検索
条件に適合するＵＲＬのリストを生成する。検索条件が
複数のキーワードを全て含むことを指定している場合、
対応する複数のエントリ間でＵＲＬ集合の共通部分を求
めてＵＲＬのリストを生成する。また、検索条件が複数
のキーワードのいずれか１つ以上を含むことを指定して
いる場合、対応する複数のエントリ間でＵＲＬ集合の合
併を求めてＵＲＬのリストを生成する。Upon receiving the search condition from the user terminal 3, the search query processing unit 8 acquires the index entry stored in the index storage device 7 and generates a list of URLs that match the search condition. . If the search condition specifies that all keywords are included,
A list of URLs is generated by finding the common part of the URL set among a plurality of corresponding entries. Further, when the search condition specifies that one or more of the plurality of keywords are included, a URL list is generated by requesting the unification of the URL sets among the corresponding plurality of entries.

【０００８】例えば、図２６のインデックス情報が生成
されたとき、「三菱」と「東京」の両方を含むという条
件で検索を実施する場合、その検索結果は｛ＵＲＬ１，
ＵＲＬ３｝と｛ＵＲＬ１，ＵＲＬ２，ＵＲＬ４，ＵＲＬ
５｝の共通部分である｛ＵＲＬ１｝となる。同様に「三
菱」と「東京」のいずれかを含むという条件に対する検
索結果は｛ＵＲＬ１，ＵＲＬ２，ＵＲＬ３，ＵＲＬ４，
ＵＲＬ５｝となる。For example, when the index information of FIG. 26 is generated and a search is carried out under the condition that both "Mitsubishi" and "Tokyo" are included, the search result is {URL1,
URL3} and {URL1, URL2, URL4, URL
It becomes {URL1} which is a common part of 5}. Similarly, the search result for the condition that contains either "Mitsubishi" or "Tokyo" is {URL1, URL2, URL3, URL4.
URL5}.

【０００９】一方、実際に企業のホームページを特定す
る目的で検索を行う際には、検索漏れを防ぎつつ最も妥
当な結果に絞り込むために、何通りかの条件を用いて繰
り返し検索を行う必要がある。具体的には、図２７に示
すように、第１の条件が企業の電話番号（ハイフンで区
切られている電話番号）、第２の条件が企業の電話番号
（ハイフンを除いた部分のＡＮＤ条件）、第３の条件が
企業の名称の全体と所在地の一部のＡＮＤ条件、第４の
条件が企業の名称の一部と所在地の一部のＡＮＤ条件で
ある。On the other hand, when actually conducting a search for the purpose of identifying a company's home page, it is necessary to repeatedly perform the search using some conditions in order to prevent omission of the search and narrow down to the most appropriate result. is there. Specifically, as shown in FIG. 27, the first condition is the company's telephone number (phone numbers separated by hyphens), and the second condition is the company's telephone number (the AND condition of the part excluding the hyphen). ), The third condition is an AND condition of the whole company name and a part of the location, and the fourth condition is an AND condition of a part of the company name and a part of the location.

【００１０】ユーザは、第１の条件で検索を行って１件
以上の結果が得られれば、その中から最も妥当な結果を
選択するか、あるいは、妥当な結果がないと判断し、検
索を終了する。第１の条件で１件も結果が得られなかっ
た場合、次に第２の条件で検索を行って同様の判断を行
う。以下、第３の条件、第４の条件に関しても同様であ
るが、第４の条件による検索の結果が０件であった場
合、妥当な結果がないと判断し、検索を終了する。If the user conducts a search under the first condition and one or more results are obtained, the user selects the most appropriate result from the results or judges that there is no appropriate result, and executes the search. finish. When no result is obtained under the first condition, the search is performed under the second condition and the same judgment is performed. Hereinafter, the same applies to the third condition and the fourth condition, but if the result of the search under the fourth condition is 0, it is determined that there is no valid result, and the search is terminated.

【００１１】[0011]

【発明が解決しようとする課題】従来の情報検索装置は
以上のように構成されているので、複数の検索結果が得
られた場合、それらの検索結果の順序と検索目的におけ
る妥当性とは必ずしも一致しない。そのため、ユーザが
各検索結果の妥当性を順次チェックする必要があり、検
索結果の数が多くなると作業に長時間を要するなどの課
題があった。また、なるべく少ない検索結果を得ようと
すると、複数の検索条件を用いて検索を繰り返す必要が
あり、作業に長時間を要する課題があった。Since the conventional information retrieval apparatus is configured as described above, when a plurality of retrieval results are obtained, the order of the retrieval results and the validity of the retrieval purpose are not always required. It does not match. Therefore, it is necessary for the user to sequentially check the validity of each search result, and there is a problem that the work takes a long time when the number of search results increases. Further, in order to obtain as few search results as possible, it is necessary to repeat the search using a plurality of search conditions, and there is a problem that the work takes a long time.

【００１２】この発明は上記のような課題を解決するた
めになされたもので、ユーザによる検索結果の妥当性の
チェックを省略することができる情報検索装置、情報検
索方法及び情報検索プログラムを得ることを目的とす
る。The present invention has been made to solve the above problems, and obtains an information search device, an information search method, and an information search program which can omit checking of validity of search results by a user. With the goal.

【００１３】[0013]

【課題を解決するための手段】この発明に係る情報検索
装置は、検索手段により検索された文字列が記述されて
いるウェブページの順位を当該ウェブページのスコアを
参照して決定する順位決定手段を設けたものである。An information retrieval apparatus according to the present invention is a ranking determination means for determining the ranking of a web page in which a character string retrieved by the retrieval means is described by referring to the score of the web page. Is provided.

【００１４】この発明に係る情報検索装置は、収集手段
により収集されたウェブページの順位を当該ウェブペー
ジのスコアを参照して決定する順位決定手段を設けたも
のである。The information retrieving apparatus according to the present invention is provided with a ranking determining means for determining the ranking of the web pages collected by the collecting means with reference to the score of the web page.

【００１５】この発明に係る情報検索装置は、順位決定
手段が順位の高いウェブページのアドレスから順番に出
力するようにしたものである。In the information retrieval apparatus according to the present invention, the rank determining means outputs the web pages in the order from the highest rank.

【００１６】この発明に係る情報検索装置は、順位決定
手段が順位の高いウェブページの部分内容から順番に出
力するようにしたものである。In the information retrieval apparatus according to the present invention, the rank determining means outputs the partial contents of the web page having a high rank in order.

【００１７】この発明に係る情報検索装置は、順位決定
手段がウェブページのスコアに基づいて当該ウェブペー
ジが属するＷＷＷサーバの順位を決定するようにしたも
のである。In the information retrieving apparatus according to the present invention, the order determining means determines the order of the WWW server to which the web page belongs based on the score of the web page.

【００１８】この発明に係る情報検索装置は、検索手段
が抽出手段により抽出された文字列及び検索キーワード
の正規化を実施するようにしたものである。In the information retrieving apparatus according to the present invention, the retrieving means normalizes the character string and the retrieval keyword extracted by the extracting means.

【００１９】この発明に係る情報検索装置は、検索手段
が電話番号を検索キーワードとして用いるようにしたも
のである。In the information retrieval apparatus according to the present invention, the retrieval means uses the telephone number as the retrieval keyword.

【００２０】この発明に係る情報検索装置は、収集手段
により収集されたウェブページの順位を当該ウェブペー
ジのスコアを参照して決定するとともに、そのウェブペ
ージのアドレスを上位のウェブページのアドレスに変更
する順位決定手段を設けたものである。The information search device according to the present invention determines the ranking of the web pages collected by the collecting means with reference to the score of the web page, and changes the address of the web page to the address of the higher web page. A means for determining the order is provided.

【００２１】この発明に係る情報検索装置は、抽出手段
により抽出された文字列のうち、検索キーワードと一致
しない文字列を検索する検索手段を設けたものである。The information retrieving apparatus according to the present invention is provided with retrieving means for retrieving a character string that does not match the retrieval keyword among the character strings extracted by the extracting means.

【００２２】この発明に係る情報検索方法は、検索キー
ワードと一致する文字列を検索し、その文字列が記述さ
れているウェブページの順位を当該ウェブページのスコ
アを参照して決定するようにしたものである。In the information search method according to the present invention, a character string that matches the search keyword is searched, and the ranking of the web page in which the character string is described is determined by referring to the score of the web page. It is a thing.

【００２３】この発明に係る情報検索方法は、収集した
ウェブページの順位を当該ウェブページのスコアを参照
して決定するようにしたものである。In the information retrieval method according to the present invention, the ranking of the collected web pages is determined by referring to the scores of the web pages.

【００２４】この発明に係る情報検索方法は、順位の高
いウェブページのアドレスから順番に出力するようにし
たものである。The information retrieval method according to the present invention is such that the web page addresses are output in order from the highest.

【００２５】この発明に係る情報検索方法は、順位の高
いウェブページの部分内容から順番に出力するようにし
たものである。In the information retrieval method according to the present invention, the partial contents of a web page having a high rank are sequentially output.

【００２６】この発明に係る情報検索方法は、ウェブペ
ージのスコアに基づいて当該ウェブページが属するＷＷ
Ｗサーバの順位を決定するようにしたものである。In the information search method according to the present invention, the WW to which the web page belongs based on the score of the web page.
The order of W servers is determined.

【００２７】この発明に係る情報検索方法は、抽出した
文字列及び検索キーワードの正規化を実施するようにし
たものである。The information retrieval method according to the present invention is adapted to normalize the extracted character string and retrieval keyword.

【００２８】この発明に係る情報検索方法は、電話番号
を検索キーワードとして用いるようにしたものである。The information retrieval method according to the present invention uses a telephone number as a retrieval keyword.

【００２９】この発明に係る情報検索方法は、収集した
ウェブページの順位を当該ウェブページのスコアを参照
して決定するとともに、そのウェブページのアドレスを
上位のウェブページのアドレスに変更するようにしたも
のである。In the information retrieval method according to the present invention, the ranking of the collected web pages is determined by referring to the score of the web page, and the address of the web page is changed to the address of the higher web page. It is a thing.

【００３０】この発明に係る情報検索方法は、抽出した
文字列のうち、検索キーワードと一致しない文字列を検
索するようにしたものである。The information search method according to the present invention searches for a character string that does not match the search keyword among the extracted character strings.

【００３１】この発明に係る情報検索プログラムは、検
索処理手順により検索された文字列が記述されているウ
ェブページの順位を当該ウェブページのスコアを参照し
て決定する順位決定処理手順を設けたものである。The information search program according to the present invention is provided with a rank determination processing procedure for determining the rank of the web page in which the character string searched by the search processing procedure is described by referring to the score of the web page. Is.

【００３２】この発明に係る情報検索プログラムは、収
集処理手順により収集されたウェブページの順位を当該
ウェブページのスコアを参照して決定する順位決定処理
手順を設けたものである。The information search program according to the present invention is provided with an order determination processing procedure for determining the order of the web pages collected by the collection processing procedure with reference to the score of the web page.

【００３３】この発明に係る情報検索プログラムは、収
集処理手順により収集されたウェブページの順位を当該
ウェブページのスコアを参照して決定するとともに、そ
のウェブページのアドレスを上位のウェブページのアド
レスに変更する順位決定処理手順を設けたものである。The information search program according to the present invention determines the ranking of web pages collected by the collection processing procedure with reference to the score of the web page, and sets the address of the web page to the address of the higher web page. The order determination processing procedure for changing is provided.

【００３４】この発明に係る情報検索プログラムは、抽
出処理手順により抽出された文字列のうち、検索キーワ
ードと一致しない文字列を検索する検索処理手順を設け
たものである。The information retrieval program according to the present invention has a retrieval processing procedure for retrieving a character string that does not match the retrieval keyword among the character strings extracted by the extraction processing procedure.

【００３５】[0035]

【発明の実施の形態】以下、この発明の実施の一形態を
説明する。実施の形態１．図１はこの発明の実施の形態１による情
報検索装置を示す構成図であり、図において、２１はＷ
ＷＷサーバ、２２はインターネット、２３はインターネ
ット情報検索システム、２４はＷＷＷサーバ２１にアク
セスして、ＷＷＷサーバ２１からウェブページを収集す
るＷＷＷ情報収集部（収集手段）、２５はＷＷＷ情報収
集部２４により収集されたウェブページから検索キーワ
ードと同一属性の文字列（例えば、電話番号、名称、所
在地などを示す文字列）及び検索キーワードと異なる属
性の文字列を抽出するとともに、そのウェブページに対
する当該文字列の記述の程度に応じて当該ウェブページ
のスコアを設定する情報抽出部（抽出手段）、２６は情
報抽出部２５により抽出された文字列とスコアがウェブ
ページのＵＲＬに対応付けられている抽出情報リストを
格納する抽出情報記憶装置である。BEST MODE FOR CARRYING OUT THE INVENTION An embodiment of the present invention will be described below. Embodiment 1. 1 is a block diagram showing an information retrieval apparatus according to Embodiment 1 of the present invention, in which 21 is a W
A WWW server, 22 is the Internet, 23 is an Internet information search system, 24 is a WWW information collection unit (collection means) that accesses the WWW server 21 and collects web pages from the WWW server 21, and 25 is a WWW information collection unit 24. A character string having the same attribute as the search keyword (for example, a character string indicating a telephone number, name, location, etc.) and a character string having an attribute different from the search keyword are extracted from the collected web pages, and the character string for the web page is extracted. The information extracting unit (extracting means) that sets the score of the web page according to the degree of the description, 26 is the extraction information in which the character string extracted by the information extracting unit 25 and the score are associated with the URL of the web page. It is an extracted information storage device for storing a list.

【００３６】２７は検索対象を指定する検索キーワード
を含む検索対象情報を格納する検索対象情報記憶装置、
２８は情報抽出部２５により抽出された文字列のうち、
検索対象情報記憶装置２７に格納されている検索キーワ
ードと一致する文字列を検索する結合演算処理部（検索
手段）、２９は結合演算処理部２８により検索された文
字列が記述されているウェブページの順位を当該ウェブ
ページのスコアを参照して決定し、順位の高いウェブペ
ージのＵＲＬから順番に出力するランキング処理部（順
位決定手段）、３０はランキング処理部２９から出力さ
れたウェブページのＵＲＬを格納する結果情報記憶装置
である。Reference numeral 27 denotes a search target information storage device for storing search target information including a search keyword designating a search target,
28 is a character string extracted by the information extraction unit 25,
A join operation processing unit (search means) that searches for a character string that matches the search keyword stored in the search target information storage device 27, and a web page 29 in which the character string retrieved by the join operation processing unit 28 is described. Is determined by referring to the score of the web page, and a ranking processing unit (rank determination means) that sequentially outputs the URLs of the web pages having higher rankings, and 30 is the URL of the web page output from the ranking processing unit 29. Is a result information storage device for storing.

【００３７】図２はＷＷＷ情報収集部２４の内部を示す
構成図であり、図において、２４ａは収集すべきウェブ
ページのＵＲＬを保持する取得要求ＵＲＬキュー、２４
ｂはＷＷＷサーバ２１から取得要求ＵＲＬキュー２４ａ
に保持されているＵＲＬに係るウェブページをダウンロ
ードするダウンロード部、２４ｃはダウンロード部２４
ｂによる同一ウェブページのダウンロードを防止するた
めに、ダウンロード部２４ｂによりダウンロードされた
ウェブページのＵＲＬ一覧を格納する既取得ＵＲＬ記憶
装置、２４ｄはダウンロード部２４ｂによりダウンロー
ドされたウェブページを格納するＵＲＬコンテンツ記憶
装置、２４ｅはＵＲＬコンテンツ記憶装置２４ｄに格納
されているウェブページからＨＴＭＬのハイパーリンク
を取り出し、そのリンク先を示すＵＲＬを取得要求ＵＲ
Ｌキュー２４ａに挿入するリンク抽出部である。FIG. 2 is a block diagram showing the inside of the WWW information collecting unit 24. In the figure, 24a is an acquisition request URL queue holding URLs of web pages to be collected, and 24.
b is an acquisition request URL queue 24a from the WWW server 21.
Download section for downloading a web page related to the URL held in
In order to prevent the same web page from being downloaded by b, the already acquired URL storage device that stores the URL list of the web pages downloaded by the download unit 24b, and 24d is the URL content that stores the web page downloaded by the download unit 24b. The storage device 24e takes out an HTML hyperlink from the web page stored in the URL content storage device 24d, and acquires a URL indicating the link destination UR.
It is a link extraction unit to be inserted into the L queue 24a.

【００３８】図３は情報抽出部２５の内部を示す構成図
であり、図において、２５ａはＵＲＬコンテンツ記憶装
置２４ｄに格納されているウェブページに対してＨＴＭ
Ｌのタグと通常のテキストを分離する処理を実行するＨ
ＴＭＬ解析部、２５ｂはＨＴＭＬ解析部２５ａにより分
離されたテキストを構成する文章を単語単位に分割する
とともに、形態素解析を実行して各単語に品詞情報を付
加する字句要素解析部、２５ｃは構文ルールを参照して
検索キーワードと同一属性の文字列及び検索キーワード
と異なる属性の文字列を検索する構文解析部、２５ｄは
ウェブページに対する当該文字列の記述の程度に応じて
当該ウェブページのスコアを設定するとともに、その文
字列とスコアをウェブページのＵＲＬに対応付けた抽出
情報リストを生成する抽出情報管理部である。FIG. 3 is a block diagram showing the inside of the information extraction unit 25. In the figure, 25a is the HTM for the web page stored in the URL content storage device 24d.
H that executes the processing to separate the L tag and normal text
The TML analysis unit 25b divides a sentence constituting the text separated by the HTML analysis unit 25a into word units, and executes a morphological analysis to add part-of-speech information to each word. 25c is a syntax rule. 25d, the parser for searching a character string having the same attribute as the search keyword and a character string having an attribute different from the search keyword, 25d sets the score of the web page according to the degree of description of the character string on the web page. In addition, the extraction information management unit generates an extraction information list in which the character string and the score are associated with the URL of the web page.

【００３９】次に動作について説明する。ここでは、検
索対象情報記憶装置２７は、図７に示すように、名称、
所在地及び電話番号を含む企業情報を検索対象情報とし
て格納しているものとする。即ち、複数の検索条件の指
定を可能にするため、１以上の検索対象情報から構成さ
れる検索条件を少なくとも１以上リスト形式で格納して
いる。なお、図７では１つの検索条件のみを記述してい
る。Next, the operation will be described. Here, as shown in FIG. 7, the search target information storage device 27 stores the name,
It is assumed that company information including a location and a telephone number is stored as search target information. That is, in order to enable designation of a plurality of search conditions, at least one or more search conditions composed of one or more search target information are stored in a list format. Note that FIG. 7 describes only one search condition.

【００４０】まず、ＷＷＷ情報収集部２４のダウンロー
ド部２４ｂは、図４に示すように、既知のＷＷＷサーバ
２１のＵＲＬが取得要求ＵＲＬキュー２４ａに初期設定
されると（ステップＳＴ１１）、取得要求ＵＲＬキュー
２４ａから既知のＷＷＷサーバ２１のＵＲＬを取得する
（ステップＳＴ１２）。そして、既取得ＵＲＬ記憶装置
２４ｃに格納されているＵＲＬ一覧を参照して、そのＵ
ＲＬに係るウェブページを既にダウンロードしているか
否かを確認する（ステップＳＴ１３）。この段階では、
ウェブページのダウンロードはまだ行われていないの
で、そのＵＲＬに係るウェブページはダウンロードして
いないと判断する。なお、多くのウェブページをダウン
ロードするには、多くのＷＷＷサーバ２１へのリンクを
含むＷＷＷサーバを初期ＵＲＬに指定することが望まし
い。First, as shown in FIG. 4, when the URL of the known WWW server 21 is initialized in the acquisition request URL queue 24a (step ST11), the download section 24b of the WWW information collecting section 24 acquires the acquisition request URL. The URL of the known WWW server 21 is acquired from the queue 24a (step ST12). Then, referring to the URL list stored in the already acquired URL storage device 24c, the U
It is confirmed whether or not the RL web page has already been downloaded (step ST13). At this stage,
Since the web page has not been downloaded yet, it is determined that the web page associated with the URL has not been downloaded. In order to download many web pages, it is desirable to specify a WWW server including links to many WWW servers 21 as the initial URL.

【００４１】ダウンロード部２４ｂは、そのＵＲＬに係
るウェブページをダウンロードしていない場合、そのＵ
ＲＬに対応するＷＷＷサーバ２１のＩＰアドレスを取得
し（ステップＳＴ１４）、そのＷＷＷサーバ２１に対し
てＨＴＴＰのＧＥＴ要求を発行することにより（ステッ
プＳＴ１５）、そのＷＷＷサーバ２１からウェブページ
をダウンロードしてＵＲＬコンテンツ記憶装置２４ｄに
格納する（ステップＳＴ１６）。When the download section 24b has not downloaded the web page associated with the URL, the download section 24b outputs the U
By acquiring the IP address of the WWW server 21 corresponding to the RL (step ST14) and issuing an HTTP GET request to the WWW server 21 (step ST15), the web page is downloaded from the WWW server 21. It is stored in the URL content storage device 24d (step ST16).

【００４２】リンク抽出部２４ｅは、ダウンロード部２
４ｂがウェブページをＵＲＬコンテンツ記憶装置２４ｄ
に格納すると、既取得ＵＲＬ記憶装置２４ｃに格納され
ているＵＲＬ一覧に当該ウェブページのＵＲＬを追加す
るとともに（ステップＳＴ１７）、そのウェブページの
内容からリンク情報を抽出し、リンク先のＵＲＬを取得
要求ＵＲＬキュー２４ａに挿入する（ステップＳＴ１
８）。The link extraction unit 24e is used by the download unit 2
4b stores the web page in the URL content storage device 24d
When it is stored in, the URL of the web page is added to the URL list stored in the acquired URL storage device 24c (step ST17), and the link information is extracted from the content of the web page to acquire the URL of the link destination. Insert into the request URL queue 24a (step ST1
8).

【００４３】ダウンロード部２４ｂは、取得要求ＵＲＬ
キュー２４ａに新たなＵＲＬが挿入されているか否かを
判断し、取得要求ＵＲＬキュー２４ａが空であれば処理
を終了する。一方、新たなＵＲＬが挿入されていれば、
ステップＳＴ１２に戻り、ステップＳＴ１２〜ＳＴ１８
の処理を繰り返し実行する（ステップＳＴ１９）。The download section 24b uses the acquisition request URL.
It is determined whether or not a new URL is inserted in the queue 24a, and if the acquisition request URL queue 24a is empty, the process ends. On the other hand, if a new URL is inserted,
Returning to step ST12, steps ST12 to ST18
The above process is repeatedly executed (step ST19).

【００４４】情報抽出部２５のＨＴＭＬ解析部２５ａ
は、ダウンロード部２４ｂがウェブページをＵＲＬコン
テンツ記憶装置２４ｄに格納すると、各ウェブページに
対してＨＴＭＬのタグと通常のテキストを分離する処理
を実行する。字句要素解析部２５ｂは、ＨＴＭＬ解析部
２５ａがテキストを分離すると、そのテキストを構成す
る文章を単語単位に分割するとともに、形態素解析を実
行して各単語に品詞情報を付加する。なお、品詞情報は
「名詞」、「助詞」、「動詞」などの大分類だけでな
く、「名詞−固有名詞−地名」といった詳細まで含むも
のであり、構文解析部２５ｃが参照する構文ルール中で
使用される。HTML analysis unit 25a of the information extraction unit 25
When the download unit 24b stores the web page in the URL content storage device 24d, the download unit 24b executes a process of separating the HTML tag and the normal text for each web page. When the HTML analysis unit 25a separates the texts, the lexical element analysis unit 25b divides the sentences forming the texts into word units, and also executes morphological analysis to add the part-of-speech information to each word. The part-of-speech information includes not only major classifications such as "nouns", "particles", and "verbs" but also details such as "noun-proper noun-place name". Used in.

【００４５】構文解析部２５ｃは、字句要素解析部２５
ｂにより分解された単語と品詞の列から、構文ルール
（図９を参照）に指定された認識パターンに合致する部
分列を取り出し、合致した認識パターンの名称とともに
文字列を出力する。図９では、認識パターンとして、＜
電話番号＞、＜名称＞、＜所在地＞が記述されているの
で、電話番号、名称、または所在地と考えられる文字列
に対して、それぞれ「＜電話番号＞文字列値」、「＜名
称＞文字列値」、「＜所在地＞文字列値」という形式で
出力し、それ以外の文字列に対しては出力を生成しな
い。認識パターンは文字列の意味を表す「属性」に対応
しており、構文解析部２５ｃの出力は一般的には文字列
属性と文字列値の組の列となる。The syntax analysis unit 25c is a lexical element analysis unit 25.
From the sequence of words and parts of speech decomposed by b, a partial sequence that matches the recognition pattern specified in the syntax rule (see FIG. 9) is extracted, and a character string is output together with the name of the matching recognition pattern. In FIG. 9, as the recognition pattern, <
Since "phone number", "name", and "location" are described, "<phone number> character string value" and "<name>character" are entered for the character strings considered to be the phone number, name, or location, respectively. Output in the format of "column value" and "<location> character string value", and output is not generated for other character strings. The recognition pattern corresponds to an "attribute" that represents the meaning of a character string, and the output of the syntax analysis unit 25c is generally a string of character string attributes and character string values.

【００４６】ここで、構文ルールは、例えば、ＲＦＣ２
２３４（ｈｔｔｐ：／／ｒｆｃ．ｎｅｔ／ｒｆｃ２２３
４．ｈｔｍｌ）に規定されたＡＢＮＦ（Ａｕｇｍｅｎｔ
ｅｄＢａｃｋｕｓ−ＮａｕｒＦｏｒｍ、拡張Ｂａｃｋ
ｕｓ−Ｎａｕｒ記法）を用いて記述することができる。
構文ルールに従った認識は公知のコンパイラ技術である
ＬＲ構文解析を用いて実現することができる。また、構
文ルールに記述したパターン以外はエラーとなるが、こ
こではエラーは無視し、何も出力しない。Here, the syntax rule is, for example, RFC2.
234 (http://rfc.net/rfc223
4. ABNF (Augment)
edBackus-Naur Form, Extended Back
us-Naur notation).
The recognition according to the syntax rule can be realized using LR parsing which is a well-known compiler technique. Also, except for the pattern described in the syntax rule, an error will occur, but here the error is ignored and nothing is output.

【００４７】抽出情報管理部２５ｄは、構文解析部２５
ｃの出力に基づき、１つのウェブページに含まれていた
文字列属性と文字列値の種類や出現頻度に応じて当該ウ
ェブページのスコアを設定する。具体的には、図１１に
示すような対応表（文字列属性と文字列値に応じたスコ
アを示している）を参照してウェブページのスコアを設
定するが、例えば、あるウェブページから検索された属
性＜電話番号＞に対応する文字列値の種類が“１”、属
性＜名称＞に対応する文字列値の種類が“４”、属性＜
所在地＞に対応する文字列値の種類が“２”であれば、
ウェブページのスコアは“９０”に設定される。The extraction information management unit 25d is a syntax analysis unit 25.
Based on the output of c, the score of the web page is set according to the type and appearance frequency of the character string attribute and the character string value contained in one web page. Specifically, the score of the web page is set by referring to the correspondence table (which shows the score according to the character string attribute and the character string value) as shown in FIG. The type of the character string value corresponding to the specified attribute <telephone number> is “1”, the type of the character string value corresponding to the attribute <name> is “4”, the attribute <
If the type of the string value corresponding to location> is “2”,
The score of the web page is set to "90".

【００４８】なお、図１１の例では、各文字列属性に対
応する文字列値が存在しないウェブページのスコアを下
げているが、これは記述が充実しておらず、正規の連絡
先を伝える意図がないと思われるウェブページを除外す
るためである。一方、ある文字列属性に対応する文字列
値の種類が多過ぎるウェブページもスコアを下げるよう
にしている。これは名簿などのように、第三者によって
公開されているウェブページを除くためであり、例え
ば、４種類以上の電話番号が記述されたウェブページの
スコアを下げている。In the example of FIG. 11, the score of the web page in which the character string value corresponding to each character string attribute does not exist is lowered, but the description is not complete, and the regular contact information is transmitted. This is to exclude web pages that are not intended. On the other hand, a web page with too many kinds of character string values corresponding to a certain character string attribute is also reduced in score. This is for excluding web pages published by a third party such as a list, and lowers the score of web pages in which four or more types of telephone numbers are described, for example.

【００４９】抽出情報管理部２５ｄは、上記のようにし
て、ウェブページのスコアを設定すると、ウェブページ
のＵＲＬ及びスコアと、文字列属性及び文字列値の組の
列（図１０（ａ）の１行に相当）とから、ある文字列属
性の文字列値をキーとする抽出情報インデックスを生成
して抽出情報記憶装置２６に格納する（図１０（ｂ）を
参照）。When the score of the web page is set as described above, the extraction information management unit 25d sets the URL of the web page and the score, and the string of character string attributes and character string values (see FIG. 10A). (Corresponding to one line) and an extracted information index using a character string value of a certain character string attribute as a key and stored in the extracted information storage device 26 (see FIG. 10B).

【００５０】ここでは、文字列属性として電話番号を用
い、その値はハイフン、カッコ、空白などを除き数字の
みで表される正規化電話番号としている。正規化電話番
号を用いる理由は、名称や所在地などと比較して表記の
ゆれ（曖昧さ）が少ないと期待されるからである。一般
的には、１つのウェブページに対応して０個から複数個
の抽出情報インデックスが生成されるが、それぞれのキ
ー（正規化電話番号）に対し、抽出情報記憶装置２６に
対応するエントリが存在しない場合は、キーとＵＲＬお
よびスコアの組が新たなエントリとして挿入され、抽出
情報記憶装置２６に対応するエントリが既に存在してい
る場合は、当該エントリに対してＵＲＬおよびスコアの
組を追加した上で抽出情報記憶装置２６に書き戻され
る。Here, a telephone number is used as the character string attribute, and its value is a normalized telephone number represented only by numbers, excluding hyphens, parentheses, and blanks. The reason for using a normalized phone number is that it is expected that there will be less fluctuation (ambiguity) in the notation compared to names and locations. Generally, a plurality of extracted information indexes are generated from 0 corresponding to one web page, but for each key (normalized telephone number), there is an entry corresponding to the extracted information storage device 26. If it does not exist, the combination of the key, the URL and the score is inserted as a new entry, and if the corresponding entry in the extraction information storage device 26 already exists, the URL and the score pair is added to the entry. Then, it is written back to the extracted information storage device 26.

【００５１】結合演算処理部２８は、情報抽出部２５が
全てのウェブページを処理した後に動作を開始する。結
合演算処理部２８は、情報抽出部２５により抽出された
文字列のうち、検索対象情報記憶装置２７に格納されて
いる検索キーワードと一致する文字列を検索する。具体
的には図５に示すように、結合演算処理部２８は、ま
ず、検索対象情報記憶装置２７から検索対象情報（検索
キーワード）を１つ取得する（ステップＳＴ２１）。こ
こでは、説明の便宜上、検索対象情報記憶装置２７から
検索キーワードとして電話番号を取得するものとする。
結合演算処理部２８は、検索キーワードとして電話番号
を取得すると、その電話番号の正規化を行う（ステップ
ＳＴ２２）。ここで、正規化とは、表記の揺れを吸収す
るために行う処理であり、ハイフン、括弧や空白など、
数字以外の区切り文字を取り除き、数字のみの文字列と
して電話番号を表す処理である。The combining operation processing section 28 starts its operation after the information extracting section 25 processes all the web pages. The join operation processing unit 28 searches the character strings extracted by the information extraction unit 25 for a character string that matches the search keyword stored in the search target information storage device 27. Specifically, as shown in FIG. 5, the join operation processing unit 28 first acquires one piece of search target information (search keyword) from the search target information storage device 27 (step ST21). Here, for convenience of explanation, it is assumed that the telephone number is acquired as the search keyword from the search target information storage device 27.
When obtaining the telephone number as the search keyword, the combination calculation processing unit 28 normalizes the telephone number (step ST22). Here, normalization is a process performed to absorb fluctuations in notation, such as hyphens, parentheses and spaces.
This is a process of removing the delimiters other than numbers and expressing the telephone number as a character string containing only numbers.

【００５２】結合演算処理部２８は、検索キーワードで
ある電話番号の正規化を行うと、情報抽出部２５により
抽出された文字列のうち、正規化後の電話番号と一致す
る文字列を検索する（ステップＳＴ２３）。When the combination arithmetic processing unit 28 normalizes the telephone number which is the search keyword, it searches the character strings extracted by the information extracting unit 25 for a character string that matches the normalized telephone number. (Step ST23).

【００５３】結合演算処理部２８は、正規化後の電話番
号と一致する文字列が存在する場合には、その文字列
（＝正規化後の電話番号）と当該文字列が記述されてい
るウェブページのＵＲＬ及び当該ウェブページのスコア
との対応表である出現ＵＲＬリスト（図１０（ｂ）を参
照）をランキング処理部２９に出力する（ステップＳＴ
２５）。なお、結合演算処理部２８は、全ての検索対象
情報について処理を実施したか否かを判断し、全ての検
索対象情報について処理を実施していない場合には、ス
テップＳＴ２１に戻り、ステップＳＴ２１〜ＳＴ２５の
処理を繰り返し実行する（ステップＳＴ２６）。When there is a character string that matches the normalized telephone number, the combination calculation processing unit 28 determines the character string (= normalized telephone number) and the web in which the character string is described. The appearance URL list (see FIG. 10B), which is a correspondence table between the URL of the page and the score of the web page, is output to the ranking processing unit 29 (step ST).
25). It should be noted that the join operation processing unit 28 determines whether or not all the search target information has been processed. If all the search target information have not been processed, the process returns to step ST21 and steps ST21 to ST21. The processing of ST25 is repeatedly executed (step ST26).

【００５４】ランキング処理部２９は、結合演算処理部
２８が文字列の検索処理を完了すると、検索された文字
列が記述されているウェブページの順位を当該ウェブペ
ージのスコアを参照して決定し、順位の高いウェブペー
ジのＵＲＬから順番に出力する。図８の例では、上位２
つのウェブページのＵＲＬが出力されていることを示し
ている。When the combining operation processing unit 28 completes the search processing of the character string, the ranking processing unit 29 determines the rank of the Web page in which the searched character string is described by referring to the score of the Web page. , Output in order from the URL of the web page with the highest rank. In the example of FIG. 8, the top 2
It shows that the URLs of the two web pages are output.

【００５５】ここでは、ランキング処理部２９がウェブ
ページの順位を当該ウェブページのスコアを参照して決
定するものについて示したが、ウェブページのスコアに
基づいて当該ウェブページが属するＷＷＷサーバ２１の
順位を決定するようにしてもよい。具体的には図６に示
す通りである。Although the ranking processing unit 29 determines the ranking of the web page by referring to the score of the web page here, the ranking of the WWW server 21 to which the web page belongs based on the score of the web page. May be determined. Specifically, it is as shown in FIG.

【００５６】ランキング処理部２９は、まず、結合演算
処理部２８の検索結果を１つ取り出し（ステップＳＴ３
１）、出現ＵＲＬリストにおける複数のＵＲＬをサーバ
毎にまとめる処理を実行する（ステップＳＴ３２）。即
ち、ＨＴＴＰにおいては、ｈｔｔｐ：／／に続く部分が
サーバ名であるので、そのサーバ名を参照して、複数の
ＵＲＬをサーバ毎にまとめる処理を実行する。The ranking processing section 29 first retrieves one search result of the join operation processing section 28 (step ST3).
1) A process of collecting a plurality of URLs in the appearing URL list for each server is executed (step ST32). That is, in HTTP, the part following http: /// is the server name, and therefore a process of collecting a plurality of URLs for each server is executed by referring to the server name.

【００５７】次に、各ウェブページのスコアに基づき、
各ＷＷＷサーバ２１に属するウェブページのスコアの最
大値をＷＷＷサーバ２１のスコアとして設定する（ステ
ップＳＴ３３）。そして、複数のＷＷＷサーバ２１の順
位を、当該ＷＷＷサーバ２１のスコアや当該ＷＷＷサー
バ２１に属するウェブページの個数によってランキング
する（ステップＳＴ３４）。Next, based on the score of each web page,
The maximum score of the web pages belonging to each WWW server 21 is set as the score of the WWW server 21 (step ST33). Then, the ranking of the plurality of WWW servers 21 is ranked according to the score of the WWW server 21 and the number of web pages belonging to the WWW server 21 (step ST34).

【００５８】ランキング処理部２９は、上位にランキン
グされたＷＷＷサーバ２１、例えば、上位３つ以内のＷ
ＷＷサーバ２１のＵＲＬのみを順番に出力する（ステッ
プＳＴ３５）。ランキング処理部２９は、全ての検索結
果について処理を実行したか否かを判定し、未処理の検
索結果があれば、ステップＳＴ３１に戻り、ステップＳ
Ｔ３１〜ＳＴ３５の処理を繰り返し実行する（ステップ
ＳＴ３６）。The ranking processing unit 29 determines the WWW servers 21 ranked at the top, for example, Ws within the top three.
Only the URL of the WW server 21 is sequentially output (step ST35). The ranking processing unit 29 determines whether or not all the search results have been processed, and if there is an unprocessed search result, the process returns to step ST31 and step S31.
The processes of T31 to ST35 are repeatedly executed (step ST36).

【００５９】以上で明らかなように、この実施の形態１
によれば、ＷＷＷ情報収集部２４により収集されたウェ
ブページから検索キーワードと同一属性の文字列及び検
索キーワードと異なる属性の文字列を抽出するととも
に、そのウェブページに対する当該文字列の記述の程度
に応じて当該ウェブページのスコアを設定するように構
成したので、ユーザによる検索結果の妥当性のチェック
を省略することができる効果を奏する。As is clear from the above, the first embodiment
According to this, a character string having the same attribute as the search keyword and a character string having an attribute different from the search keyword are extracted from the web page collected by the WWW information collecting unit 24, and the degree of description of the character string on the web page is calculated. According to the configuration, the score of the web page is set accordingly. Therefore, it is possible to eliminate the need for the user to check the validity of the search result.

【００６０】なお、この実施の形態１では、インターネ
ット情報検索システム２３の構成要素をＩＣなどの専用
のハードウエアを用いて構成してもよいし、ソフトウエ
ア（情報検索プログラム）を実行するコンピュータを用
いて構成してもよい。図１２はコンピュータを用いて構
成する場合のハードウエア構成図であり、図において、
４１は情報検索プログラムを実行するＣＰＵであり、Ｗ
ＷＷ情報収集部２４、情報抽出部２５、結合演算処理部
２８及びランキング処理部２９の機能を有している。４
２は情報検索プログラムやプログラムの実行に必要なデ
ータを格納するメモリ、４３はコンソール入出力装置４
４との入出力を行うコンソールインタフェース、４４は
コンソール入出力装置、４５はハードディスク装置４６
をアクセスするディスクインタフェース、４６はハード
ディスク装置、４７はインターネット２２に接続するネ
ットワークインタフェースである。In the first embodiment, the components of the Internet information search system 23 may be configured by using dedicated hardware such as IC, or a computer that executes software (information search program) may be used. It may be configured by using. FIG. 12 is a hardware configuration diagram when it is configured by using a computer. In the figure,
Reference numeral 41 denotes a CPU that executes an information retrieval program, and W
It has the functions of the WW information collection unit 24, the information extraction unit 25, the combination calculation processing unit 28, and the ranking processing unit 29. Four
2 is a memory for storing an information retrieval program and data necessary for executing the program, 43 is a console input / output device 4
4, a console input / output device, 45 a hard disk device 46
Is a disk interface, 46 is a hard disk device, and 47 is a network interface connected to the Internet 22.

【００６１】実施の形態２．上記実施の形態１では、順
位の高いウェブページのＵＲＬから順番に出力するもの
について示したが、順位の高いウェブページの部分内容
から順番に出力するようにしてもよい。例えば、検索対
象情報として製品名を用いることにより、ウェブページ
上で製品と共に記述されている価格を出力するようにす
る。これにより、ユーザは実際にウェブページをアクセ
スすることなく、ウェブページの部分内容、即ち、目的
の情報を知ることができる効果を奏する。Embodiment 2. In the above-described first embodiment, the output is performed in order from the URL of the web page having the higher rank, but the partial contents of the web page having the higher rank may be sequentially output. For example, by using the product name as the search target information, the price described with the product on the web page is output. As a result, the user can know the partial content of the web page, that is, the target information without actually accessing the web page.

【００６２】実施の形態３．図１３はこの発明の実施の
形態３による情報検索装置を示す構成図であり、図にお
いて、図１と同一符号は同一または相当部分を示すので
説明を省略する。５１は検索キーワードと一致する文字
列を含むウェブページのＵＲＬを検索する検索エンジ
ン、５２は検索キーワードを検索エンジン５１に与え
て、検索エンジン５１から検索結果であるウェブページ
のＵＲＬを取得する検索エンジン問合せ部、５３は検索
エンジン問合せ部５２により取得されたＵＲＬを有する
ウェブページをＷＷＷサーバ２１からダウンロードする
ダウンロード部である。なお、検索エンジン問合せ部５
２及びダウンロード部５３から収集手段が構成されてい
る。Embodiment 3. FIG. 13 is a block diagram showing an information retrieval apparatus according to Embodiment 3 of the present invention. In the figure, the same reference numerals as those in FIG. Reference numeral 51 is a search engine that searches for a URL of a web page that includes a character string that matches the search keyword, and 52 is a search engine that gives the search keyword to the search engine 51 and acquires the URL of the web page that is the search result from the search engine 51. An inquiry unit 53 is a download unit for downloading a web page having the URL acquired by the search engine inquiry unit 52 from the WWW server 21. The search engine inquiry unit 5
2 and the download unit 53 constitute a collecting means.

【００６３】５４はダウンロード部５３によりダウンロ
ードされたウェブページから検索キーワードと同一属性
の文字列及び検索キーワードと異なる属性の文字列を抽
出する属性抽出部、５５はダウンロード部５３によりダ
ウンロードされたウェブページに対する当該文字列の記
述の程度に応じて当該ウェブページのスコアを設定する
スコア設定部である。なお、属性抽出部５４及びスコア
設定部５５から抽出手段が構成されている。５６はダウ
ンロード部５３によりダウンロードされたウェブページ
の順位を当該ウェブページのスコアを参照して決定する
ランキング処理部（順位決定手段）である。Reference numeral 54 is an attribute extraction unit for extracting a character string having the same attribute as the search keyword and a character string having an attribute different from the search keyword from the web page downloaded by the download unit 53, and 55 is the web page downloaded by the download unit 53. The score setting unit sets the score of the web page according to the degree of description of the character string. Note that the attribute extraction unit 54 and the score setting unit 55 constitute extraction means. Reference numeral 56 denotes a ranking processing unit (rank determining unit) that determines the rank of the web page downloaded by the download unit 53 with reference to the score of the web page.

【００６４】次に動作について説明する。検索エンジン
問合せ部５２は、検索エンジン５１に対して検索キーワ
ード（例えば、企業名）を渡して検索を要求することに
より、検索エンジン５１の検索結果として１以上のウェ
ブページのＵＲＬを取得する。検索エンジン５１のイン
タフェースは、ＨＴＴＰプロトコルやＨＴＭＬフォーマ
ットといった標準仕様に基づいており、検索キーワード
の指定や結果からのＵＲＬ抽出は容易に実現できる。Next, the operation will be described. The search engine inquiring unit 52 obtains URLs of one or more web pages as a search result of the search engine 51 by passing a search keyword (for example, a company name) to the search engine 51 to request a search. The interface of the search engine 51 is based on standard specifications such as the HTTP protocol and the HTML format, and it is easy to specify the search keyword and extract the URL from the result.

【００６５】ダウンロード部５３は、検索エンジン問合
せ部５２が１以上のウェブページのＵＲＬを取得する
と、ＷＷＷサーバ２１からインターネット２２を介し
て、それらのＵＲＬに対応するウェブページ（ＨＴＭＬ
ファイル）をダウンロードし、そのウェブページを属性
抽出部５４に出力する。When the search engine inquiry section 52 acquires the URLs of one or more web pages, the download section 53 receives web pages (HTML) corresponding to those URLs from the WWW server 21 via the Internet 22.
File) and outputs the web page to the attribute extraction unit 54.

【００６６】属性抽出部５４は、ダウンロード部５３か
ら１以上のウェブページを受け取ると、まず、情報抽出
部２５のＨＴＭＬ解析部２５ａと同様に、各ウェブペー
ジに対してＨＴＭＬのタグと通常のテキストを分離する
処理を実行する。次に、属性抽出部５４は、情報抽出部
２５の字句要素解析部２５ｂと同様に、その分離したテ
キストを構成する文章を単語単位に分解し、形態素解析
を実行して各単語に品詞情報を付加する。なお、品詞情
報は「名詞」、「助詞」、「動詞」などの大分類だけで
なく、「名詞−固有名詞−地名」といった詳細まで含む
ものであり、以下、属性抽出部５４が参照する構文ルー
ル中で使用される。When the attribute extraction unit 54 receives one or more web pages from the download unit 53, first, similarly to the HTML analysis unit 25a of the information extraction unit 25, the HTML tags and normal texts for each web page. Execute the process of separating. Next, like the lexical element analysis unit 25b of the information extraction unit 25, the attribute extraction unit 54 decomposes the sentences constituting the separated text into word units, executes morphological analysis, and adds the part-of-speech information to each word. Add. Note that the part-of-speech information includes not only major classifications such as “noun”, “particle”, and “verb” but also details such as “noun-proper noun-place name”, and the syntax referred to by the attribute extraction unit 54 below. Used in rules.

【００６７】次に、属性抽出部５４は、情報抽出部２５
の構文解析部２５ｃと同様に、その分解した単語と品詞
の列から、構文ルールに指定された認識パターンに合致
する部分列を取り出し、合致した認識パターンの名称と
ともに文字列を出力する。図９の構文ルールの例では、
この実施の形態３の検索目的に合わせて、＜電話番号
＞、＜名称＞、＜所在地＞の認識パターンが記述されて
いるので、これらのいずれかと考えられる文字列に対し
て、それぞれ「＜電話番号＞文字列値」、「＜名称＞文
字列値」、「＜所在地＞文字列値」という形式で出力
し、それ以外の文字列に対しては出力を生成しない。認
識パターンは文字列の意味を表す「属性」に対応してお
り、属性抽出部５４の出力は一般的には文字列属性と文
字列値の組の列となる（図１４を参照）。Next, the attribute extraction unit 54 is connected to the information extraction unit 25.
Similarly to the syntactic analysis unit 25c, the sub-string that matches the recognition pattern specified in the syntax rule is extracted from the decomposed word and part-of-speech strings, and the character string is output together with the name of the matching recognition pattern. In the syntax rule example in Figure 9,
The recognition patterns of <telephone number>, <name>, and <location> are described in accordance with the search purpose of the third embodiment. Therefore, for each character string considered to be one of these, “<telephone number>No.> character string value ”,“ <name> character string value ”,“ <location> character string value ”are output, and no output is generated for other character strings. The recognition pattern corresponds to the "attribute" that represents the meaning of the character string, and the output of the attribute extracting unit 54 is generally a string of character string attributes and character string values (see FIG. 14).

【００６８】ここで、構文ルールは、例えば、ＲＦＣ２
２３４（ｈｔｔｐ：／／ｒｆｃ．ｎｅｔ／ｒｆｃ２２３
４．ｈｔｍｌ）に規定されたＡＢＮＦ（Ａｕｇｍｅｎｔ
ｅｄＢａｃｋｕｓ−ＮａｕｒＦｏｒｍ、拡張Ｂａｃｋ
ｕｓ−Ｎａｕｒ記法）を用いて記述することができる。
構文ルールに従った認識は公知のコンパイラ技術である
ＬＲ構文解析を用いて実現することができる。また、構
文ルールに記述したパターン以外はエラーとなるが、こ
こではエラーは無視して何も出力しない。Here, the syntax rule is, for example, RFC2.
234 (http://rfc.net/rfc223
4. ABNF (Augment)
edBackus-Naur Form, Extended Back
us-Naur notation).
The recognition according to the syntax rule can be realized using LR parsing which is a well-known compiler technique. Also, except for the pattern described in the syntax rule, an error occurs, but here the error is ignored and nothing is output.

【００６９】スコア設定部５５は、情報抽出部２５の抽
出情報管理部２５ｄと同様に、属性抽出部５４の出力に
基づき、１つのウェブページに含まれていた文字列属性
と文字列値の種類や出現頻度に応じて当該ウェブページ
のスコアを設定する。具体的には、図１１に示すような
対応表（文字列属性と文字列値に応じたスコアを示して
いる）を参照してウェブページのスコアを設定するが、
例えば、あるウェブページから検索された属性＜電話番
号＞に対応する文字列値の種類が“１”、属性＜名称＞
に対応する文字列値の種類が“４”、属性＜所在地＞に
対応する文字列値の種類が“２”であれば、ウェブペー
ジのスコアは“９０”に設定される。The score setting unit 55, like the extracted information management unit 25d of the information extraction unit 25, based on the output of the attribute extraction unit 54, the type of the character string attribute and the character string value included in one web page. And the score of the web page is set according to the appearance frequency. Specifically, the score of the web page is set by referring to the correspondence table (which shows the score according to the character string attribute and the character string value) as shown in FIG.
For example, the type of the character string value corresponding to the attribute <phone number> retrieved from a certain web page is “1”, and the attribute <name>
If the type of the character string value corresponding to "4" and the type of the character string value corresponding to the attribute <location> are "2", the score of the web page is set to "90".

【００７０】なお、図１１の例では、各文字列属性に対
応する文字列値が存在しないウェブページのスコアを下
げているが、これは記述が充実しておらず、正規の連絡
先を伝える意図がないと思われるウェブページを除外す
るためである。一方、ある文字列属性に対応する文字列
値の種類が多過ぎるウェブページもスコアを下げるよう
にしているが、これは名簿などのように第三者によって
公開されているウェブページを除くためであり、例えば
４種類以上の電話番号が書かれたウェブページのスコア
を下げている。スコア設定部５５は、上記のようにし
て、ウェブページのスコアを設定すると、当該ウェブペ
ージのＵＲＬとスコアとを組にしてランキング処理部５
６に出力する（図１５を参照）。In the example of FIG. 11, the score of the web page in which the character string value corresponding to each character string attribute does not exist is lowered, but this is not fully described, and the regular contact information is transmitted. This is to exclude web pages that are not intended. On the other hand, we also try to lower the score for web pages that have too many types of string values corresponding to a certain string attribute, but this is to exclude web pages published by third parties such as rosters. Yes, for example, the score of a web page in which four or more types of telephone numbers are written is lowered. When the score of the web page is set as described above, the score setting unit 55 sets the URL of the web page and the score as a pair, and the ranking processing unit 5
6 (see FIG. 15).

【００７１】ランキング処理部５６は、検索エンジン５
１の全ての検索結果に対応するＵＲＬとスコアの組とを
スコア設定部５５から受け取ると、それらの複数のＵＲ
ＬをＷＷＷサーバ２１毎にまとめる処理を実行する。即
ち、ＨＴＴＰにおいてはｈｔｔｐ：／／に続く部分がＷ
ＷＷサーバ名であるので、そのＷＷＷサーバ名を参照し
て複数のＵＲＬをＷＷＷサーバ２１毎にまとめる処理を
実行する。次に、各ＵＲＬのスコアに基づき、各ＷＷＷ
サーバ２１に属するＵＲＬスコアの最大値をＷＷＷサー
バ２１のスコアとして設定する。The ranking processing unit 56 uses the search engine 5
When a set of URLs and scores corresponding to all the search results of No. 1 is received from the score setting unit 55, those plural URs are received.
A process of collecting L for each WWW server 21 is executed. That is, in HTTP, the part following http: /// is W
Since it is the WW server name, a process of collecting a plurality of URLs for each WWW server 21 is executed by referring to the WWW server name. Next, based on the score of each URL, each WWW
The maximum value of the URL scores belonging to the server 21 is set as the score of the WWW server 21.

【００７２】そして、複数のＷＷＷサーバ２１の順位
を、当該ＷＷＷサーバ２１のスコアや当該ＷＷＷサーバ
２１に属するＵＲＬの個数によってランキングする。ラ
ンキング処理部５６は、上位にランキングされたＷＷＷ
サーバ２１、例えば、上位３つ以内のＷＷＷサーバ２１
について、当該ＷＷＷサーバ２１内で最大スコアを持つ
ＵＲＬのみを順番に出力する（図１６を参照）。Then, the ranking of the plurality of WWW servers 21 is ranked according to the score of the WWW server 21 and the number of URLs belonging to the WWW server 21. The ranking processing unit 56 uses the WWW ranked at the top.
Server 21, for example WWW server 21 within the top three
For, the URLs having the maximum score in the WWW server 21 are sequentially output (see FIG. 16).

【００７３】以上で明らかなように、この実施の形態３
によれば、検索エンジン５１による検索結果の複数ウェ
ブページに対し、検索キーワード以外の文字列も対象に
して文字列属性とその値を取り出し、その記述の程度に
応じて設定した当該ウェブページのスコアに基づいて当
該ウェブページの順位を決定するように構成したので、
検索キーワードの指定だけでは除外できない不適当な結
果を除外することができ、ユーザが妥当な結果を選択す
る作業を省略することができる効果を奏する（図１７を
参照）。As is clear from the above, the third embodiment
According to the above, for a plurality of web pages as a result of the search by the search engine 51, a character string attribute and its value are extracted for character strings other than the search keyword, and the score of the web page is set according to the degree of the description. Since it is configured to determine the ranking of the web page based on
Inappropriate results that cannot be excluded only by specifying the search keyword can be excluded, and the user can omit the work of selecting an appropriate result (see FIG. 17).

【００７４】なお、この実施の形態３においても、上記
実施の形態１と同様に、インターネット情報検索システ
ム２３の構成要素をＩＣなどの専用のハードウエアを用
いて構成してもよいし、ソフトウエア（情報検索プログ
ラム）を実行するコンピュータを用いて構成してもよ
い。また、この実施の形態３では、ＵＲＬを返すものに
ついて示したが、抽出した情報（例えば、電話番号から
企業名、製品名から価格）を返すようにしてもよい。In the third embodiment as well, similar to the first embodiment, the components of the Internet information search system 23 may be configured by using dedicated hardware such as IC, or software. A computer that executes the (information search program) may be used. In addition, in the third embodiment, the URL is returned, but the extracted information (for example, a telephone number to a company name and a product name to a price) may be returned.

【００７５】実施の形態４．図１８はこの発明の実施の
形態４による情報検索装置を示す構成図であり、図にお
いて、図１及び図１３等と同一符号は同一または相当部
分を示すので説明を省略する。６１はスコア設定部５５
から受け取ったウェブページのＵＲＬ及びスコアと、文
字列属性及び文字列値の組の列とから、ある文字列属性
の文字列値をキーとする抽出情報インデックスを生成す
る抽出情報管理部、６２はウェブページのＵＲＬを上位
のウェブページのＵＲＬに変更するＵＲＬ修正部（順位
決定手段）である。図２０はＵＲＬ修正部６２の処理内
容を示すフローチャートである。Fourth Embodiment 18 is a block diagram showing an information retrieval apparatus according to Embodiment 4 of the present invention. In the figure, the same reference numerals as those in FIGS. 61 is a score setting unit 55
An extraction information management unit 62 that generates an extraction information index with a character string value of a certain character string attribute as a key from the URL and score of the web page received from It is a URL correction unit (order determining means) that changes the URL of a web page to the URL of a higher-level web page. FIG. 20 is a flowchart showing the processing contents of the URL correction unit 62.

【００７６】次に動作について説明する。抽出情報管理
部６１は、上記実施の形態３と同様にしてスコア設定部
５５がウェブページのスコアを設定すると、スコア設定
部５５から受け取ったウェブページのＵＲＬ及びスコア
と、文字列属性及び文字列値の組の列（図１０（ａ）の
1行に相当）とから、ある文字列属性の文字列値をキー
とする抽出情報インデックスを生成して抽出情報記憶装
置２６に格納する（図１９を参照）。Next, the operation will be described. When the score setting unit 55 sets the score of the web page as in the third embodiment, the extraction information management unit 61 receives the URL and score of the web page received from the score setting unit 55, the character string attribute and the character string. A column of value pairs (see FIG. 10 (a))
(Corresponding to one line) and the extracted information index using the character string value of a certain character string attribute as a key and stored in the extracted information storage device 26 (see FIG. 19).

【００７７】ここでは、文字列属性として電話番号を用
い、その値はハイフン、カッコ、空白などを除き数字の
みで表される正規化電話番号としている。正規化電話番
号を用いる理由は、名称や所在地などと比較して表記の
ゆれ（曖昧さ）が少ないと期待されるからである。一般
的には、１つのウェブページに対応して０個から複数個
の抽出情報インデックスが生成されるが、それぞれのキ
ー（正規化電話番号）に対し、抽出情報記憶装置２６に
対応するエントリが存在しない場合は、キーとＵＲＬお
よびスコアの組が新たなエントリとして挿入され、抽出
情報記憶装置２６に対応するエントリが既に存在してい
る場合は、当該エントリに対してＵＲＬおよびスコアの
組を追加した上で抽出情報記憶装置２６に書き戻され
る。Here, a telephone number is used as the character string attribute, and its value is a normalized telephone number represented only by numbers, excluding hyphens, parentheses, and blanks. The reason for using a normalized phone number is that it is expected that there will be less fluctuation (ambiguity) in the notation compared to names and locations. Generally, a plurality of extracted information indexes are generated from 0 corresponding to one web page, but for each key (normalized telephone number), there is an entry corresponding to the extracted information storage device 26. If it does not exist, the combination of the key, the URL and the score is inserted as a new entry, and if the corresponding entry in the extraction information storage device 26 already exists, the URL and the score pair is added to the entry. Then, it is written back to the extracted information storage device 26.

【００７８】ランキング処理部５６は、ＷＷＷ情報収集
部２４による情報収集が完了し、収集した全てのウェブ
ページに対応する抽出情報が抽出情報記憶装置２６に格
納された時点で動作を開始する。ランキング処理部５６
は、抽出情報記憶装置２６からエントリを１つずつ取り
出し、上記実施の形態３と同様の処理を実行して各エン
トリのキー（正規化電話番号）に対応するＵＲＬの内で
スコアの高いウェブページおよびＷＷＷサーバ２１に関
するものを出力する。The ranking processing section 56 starts its operation when the WWW information collecting section 24 completes the information collection and the extracted information corresponding to all the collected web pages is stored in the extracted information storage device 26. Ranking processing unit 56
Retrieves the entries one by one from the extracted information storage device 26, executes the same processing as in the third embodiment, and executes the same process as that of the key (normalized telephone number) of each entry, and the web page with the highest score in the URLs. And those related to the WWW server 21 are output.

【００７９】ＵＲＬ修正部６２は、ランキング処理部５
６からキー（正規化電話番号）とＵＲＬ（複数）を受け
取ると、図２０に示すように、既取得ＵＲＬ記憶装置２
４ｃとＵＲＬコンテンツ記憶装置２４ｄに格納された情
報を参照してＵＲＬ文字列に修正を加え、修正後のＵＲ
Ｌ文字列を結果情報記憶装置３０に格納する。即ち、Ｕ
ＲＬに対応するＨＴＭＬファイルをＵＲＬコンテンツ記
憶装置２４ｄから検索し、内容として上位のＵＲＬを指
すハイパーリンクが含まれていたら、そのＵＲＬで元の
ＵＲＬを置き換える。The URL correction unit 62 is a ranking processing unit 5.
When the key (normalized telephone number) and URL (plurality) are received from 6, the already-acquired URL storage device 2 is received as shown in FIG.
4c and the information stored in the URL content storage device 24d, the URL character string is corrected and the corrected UR
The L character string is stored in the result information storage device 30. That is, U
The HTML file corresponding to the RL is searched from the URL content storage device 24d, and if the content includes a hyperlink pointing to a higher URL, the original URL is replaced with the URL.

【００８０】ここで、上位のＵＲＬとは、ＵＲＬ文字列
を“／”で区切って最後の要素（ファイル名に対応）を
無視した場合、要素数が元のＵＲＬより少なく、全ての
要素が元のＵＲＬに含まれているものを指す。例えば、
元のＵＲＬがｈｔｔｐ：／／ｗｗｗ．ａ．ｃｏ．ｊｐ／
ｐｒｏｄｕｃｔｓ／ｐｃ．ｈｔｍｌであった場合、ｈｔ
ｔｐ：／／ｗｗｗ．ａ．ｃｏ．ｊｐ／ｉｎｄｅｘ．ｈｔ
ｍｌは上位ＵＲＬであるが、ｈｔｔｐ：／／ｗｗｗ．
ａ．ｃｏ．ｊｐ／ｐｒｏｄｕｃｔｓ／やｈｔｔｐ：／／
ｗｗｗ．ａ．ｃｏ．ｊｐ／ｐｒｏｄｕｃｔｓ／ｉｎｄｅ
ｘ．ｈｔｍｌは上位ＵＲＬとはみなさない。Here, with respect to the upper URL, if the last element (corresponding to the file name) is delimited by delimiting the URL character string with “/”, the number of elements is smaller than the original URL, and all the elements are original. Is included in the URL. For example,
The original URL is http: // www. a. co. jp /
products / pc. If it is html, ht
tp: // www. a. co. jp / index. ht
ml is the upper URL, but http: // www.
a. co. jp / products / or http: ///
www. a. co. jp / products / inde
x. html is not considered as a high-level URL.

【００８１】さらに、ＵＲＬ文字列がファイル名を表し
ている場合（“／”で終わっていない場合）、ＵＲＬ文
字列の末尾のファイル名部分を取り除くことでディレク
トリのＵＲＬを生成し、既取得ＵＲＬ記憶装置２４ｃを
検索して当該ディレクトリＵＲＬへのアクセス記録が存
在したら当該ディレクトリＵＲＬが有効であると判断し
て元のＵＲＬをディレクトリＵＲＬに置き換える。例え
ば、元のＵＲＬがｈｔｔｐ：／／ｗｗｗ．ａ．ｃｏ．ｊ
ｐ／ｇａｉｙｏｕ．ｈｔｍｌである場合、対応するディ
レクトリＵＲＬとしてｈｔｔｐ：／／ｗｗｗ．ａ．ｃ
ｏ．ｊｐ／を生成し、既取得ＵＲＬ記憶装置２４ｃを検
索の上、当該ＵＲＬの存在が確認できたらＵＲＬをｈｔ
ｔｐ：／／ｗｗｗ．ａ．ｃｏ．ｊｐ／に書き換える。Furthermore, when the URL character string represents a file name (when it does not end with "/"), the file name portion at the end of the URL character string is removed to generate the directory URL, and the already acquired URL is obtained. When the storage device 24c is searched and an access record to the directory URL exists, it is determined that the directory URL is valid and the original URL is replaced with the directory URL. For example, the original URL is http: // www. a. co. j
p / gaiyou. If the URL is http: // www. a. c
o. jp / is generated, the acquired URL storage device 24c is searched, and if the existence of the URL can be confirmed, the URL is ht
tp: // www. a. co. Rewrite to jp /.

【００８２】また、既取得ＵＲＬ記憶装置２４ｃに当該
ディレクトリＵＲＬへのアクセス記録が存在しなかった
場合、当該ディレクトリＵＲＬが無効であると判断す
る。この場合、当該ディレクトリ内のファイルＵＲＬの
うち、ホームページのＵＲＬとして最も妥当性が高いも
のを選択する。例えば、元のＵＲＬｈｔｔｐ：／／ｗ
ｗｗ．ａ．ｃｏ．ｊｐ／ｇａｉｙｏｕ．ｈｔｍｌに対
し、既取得ＵＲＬ記憶装置２４ｃを検索して得られた同
一ディレクトリ内のファイルＵＲＬがｈｔｔｐ：／／ｗ
ｗｗ．ａ．ｃｏ．ｊｐ／ｇａｉｙｏｕ．ｈｔｍｌ（便宜
上元のＵＲＬ自体も含める）、ｈｔｔｐ：／／ｗｗｗ．
ａ．ｃｏ．ｊｐ／ｐｒｏｄｕｃｔｓ．ｈｔｍｌ、ｈｔｔ
ｐ：／／ｗｗｗ．ａ．ｃｏ．ｊｐ／ｉｎｄｅｘ．ｈｔｍ
ｌであったとすると、ホームページＵＲＬとしては一般
にｉｎｄｅｘ．ｈｔｍｌの妥当性が高いので、ＵＲＬを
ｈｔｔｐ：／／ｗｗｗ．ａ．ｃｏ．ｊｐ／ｉｎｄｅｘ．
ｈｔｍｌに書き換える。妥当性の優劣が付けられない場
合にはＵＲＬの書き換えは行なわない。なお、図２１は
ＵＲＬ修正部６２におけるファイル名の妥当性の設定例
を示す説明図である。If the access record to the directory URL does not exist in the acquired URL storage device 24c, it is determined that the directory URL is invalid. In this case, of the file URLs in the directory, the one having the highest validity as the URL of the home page is selected. For example, the original URL http: // w
ww. a. co. jp / gaiyou. The file URL in the same directory obtained by searching the acquired URL storage device 24c for http is http: // w
ww. a. co. jp / gaiyou. html (including the original URL itself for convenience), http: // www.
a. co. jp / products. html, htt
p: // www. a. co. jp / index. htm
If it is 1, the home page URL is generally index. Since the relevance of http is high, the URL is http: // www. a. co. jp / index.
Rewrite to html. The URL is not rewritten unless the validity is evaluated. Note that FIG. 21 is an explanatory diagram showing an example of setting the validity of the file name in the URL correction unit 62.

【００８３】結果情報記憶装置３０から企業ホームペー
ジを検索する際には、対象企業の電話番号を正規化（ハ
イフン、カッコ、空白など非数字文字の除去）した文字
列で検索することでＵＲＬが得られる。When retrieving the company home page from the result information storage device 30, the URL is obtained by retrieving the telephone number of the target company with a normalized character string (removing non-numeric characters such as hyphens, parentheses and spaces). To be

【００８４】以上で明らかなように、この実施の形態４
によれば、ＷＷＷサーバ２１から収集した情報を元に検
索結果としてのウェブページの妥当性を判定し、より妥
当なウェブページのＵＲＬを結果とするようにしたの
で、検索キーワードが含まれているとは限らない妥当な
結果を返すことができ、ユーザが結果を修正する作業を
省略することができる効果を奏する（図２２を参照）。As is clear from the above, the fourth embodiment
According to the above, the validity of the web page as the search result is determined based on the information collected from the WWW server 21, and the URL of the more valid web page is used as the result, so that the search keyword is included. It is possible to return a reasonable result which is not always the case, and the user can omit the work of correcting the result (see FIG. 22).

【００８５】なお、この実施の形態４においても、上記
実施の形態１と同様に、インターネット情報検索システ
ム２３の構成要素をＩＣなどの専用のハードウエアを用
いて構成してもよいし、ソフトウエア（情報検索プログ
ラム）を実行するコンピュータを用いて構成してもよ
い。Also in the fourth embodiment, as in the first embodiment, the components of the Internet information search system 23 may be configured by using dedicated hardware such as an IC, or software. A computer that executes the (information search program) may be used.

【００８６】実施の形態５．図２３はこの発明の実施の
形態５による情報検索装置を示す構成図であり、図にお
いて、図１８と同一符号は同一または相当部分を示すの
で説明を省略する。７１は既知の正規化電話番号（検索
キーワード）等を記憶する既知情報記憶装置、７２は抽
出情報記憶装置２６に記憶されている文字列のうち、既
知情報記憶装置７１に記憶されている検索キーワードと
一致しない文字列を検索する反結合演算部（検索手
段）、７３は反結合演算部７２により検索された文字列
を記憶する未知情報記憶装置である。Embodiment 5. 23 is a block diagram showing an information retrieval apparatus according to Embodiment 5 of the present invention. In the figure, the same reference numerals as those in FIG. Reference numeral 71 is a known information storage device that stores a known normalized telephone number (search keyword), and 72 is a search keyword stored in the known information storage device 71 among the character strings stored in the extracted information storage device 26. An anti-join operation unit (search means) that searches for a character string that does not match with, and 73 is an unknown information storage device that stores the character string searched by the anti-join operation unit 72.

【００８７】次に動作について説明する。反結合演算部
７２は、ＷＷＷ情報収集部２４による情報収集が完了
し、収集した全てのウェブページに対応する抽出情報が
抽出情報記憶装置２６に格納された時点で動作を開始す
る。反結合演算部７２は、抽出情報記憶装置２６に格納
されたエントリのうち、キー（正規化電話番号）が既知
情報記憶装置７１に存在しないものを未知情報記憶装置
７３に格納する。Next, the operation will be described. The anti-join calculation unit 72 starts its operation when the WWW information collection unit 24 completes the information collection and the extracted information corresponding to all the collected web pages is stored in the extracted information storage device 26. The anti-join operation unit 72 stores, in the unknown information storage device 73, an entry whose key (normalized telephone number) does not exist in the known information storage device 71 among the entries stored in the extracted information storage device 26.

【００８８】ここで、既知情報記憶装置７１には、抽出
情報記憶装置２６と同様に、予め正規化された形式でキ
ー（電話番号）が格納されているものとする（図２４を
参照）。具体的には、反結合演算部７２は、抽出情報記
憶装置２６からエントリのキーを１つ取得し、既知情報
記憶装置７１を検索して当該キーが存在しなかった場合
のみ当該エントリを出力する、という処理を抽出情報記
憶装置２６の全てのエントリに対して実行する。Here, it is assumed that the known information storage device 71 stores a key (telephone number) in a pre-normalized format, similarly to the extraction information storage device 26 (see FIG. 24). Specifically, the anti-join operation unit 72 acquires one key of the entry from the extracted information storage device 26, searches the known information storage device 71, and outputs the entry only when the key does not exist. , Is executed for all the entries in the extracted information storage device 26.

【００８９】以上のように、この実施の形態５によれ
ば、ＷＷＷサーバ２１から収集した情報から検索対象の
属性（電話番号）を有する文字列を抽出するようにした
ので、具体的に検索対象の値を指定することが不可能な
検索を実行することができる効果を奏する。As described above, according to the fifth embodiment, since the character string having the attribute (telephone number) of the search target is extracted from the information collected from the WWW server 21, the search target is specifically described. This has the effect of performing a search in which it is not possible to specify the value of.

【００９０】なお、この実施の形態５においても、上記
実施の形態１と同様に、インターネット情報検索システ
ム２３の構成要素をＩＣなどの専用のハードウエアを用
いて構成してもよいし、ソフトウエア（情報検索プログ
ラム）を実行するコンピュータを用いて構成してもよ
い。Also in the fifth embodiment, as in the first embodiment, the components of the Internet information search system 23 may be configured by using dedicated hardware such as an IC, or software. A computer that executes the (information search program) may be used.

【００９１】[0091]

【発明の効果】以上のように、この発明によれば、検索
手段により検索された文字列が記述されているウェブペ
ージの順位を当該ウェブページのスコアを参照して決定
する順位決定手段を設けるように構成したので、ユーザ
による検索結果の妥当性のチェックを省略することがで
きる効果がある。As described above, according to the present invention, the ranking determining means for determining the ranking of the web page in which the character string retrieved by the retrieval means is described by referring to the score of the web page is provided. Since it is configured as described above, there is an effect that the check of the validity of the search result by the user can be omitted.

【００９２】この発明によれば、収集手段により収集さ
れたウェブページの順位を当該ウェブページのスコアを
参照して決定する順位決定手段を設けるように構成した
ので、検索キーワードの指定だけでは除外できない不適
当な結果を除外することができるようになり、その結
果、ユーザが妥当な結果を選択する作業を省略すること
ができる効果がある。According to the present invention, the ranking determining means for determining the ranking of the web pages collected by the collecting means with reference to the score of the web page is provided, so that the ranking cannot be excluded only by designating the search keyword. It becomes possible to exclude inappropriate results, and as a result, there is an effect that the user can omit the work of selecting an appropriate result.

【００９３】この発明によれば、順位決定手段が順位の
高いウェブページのアドレスから順番に出力するように
構成したので、妥当性の高いウェブページからアクセス
することが可能になる効果がある。According to the present invention, since the rank determining means is configured to output the web pages in the order of their ranks in order, it is possible to access from the web pages of high validity.

【００９４】この発明によれば、順位決定手段が順位の
高いウェブページの部分内容から順番に出力するように
構成したので、ユーザが実際にウェブページをアクセス
することなく、目的の情報を知ることができる効果があ
る。According to the present invention, the rank determining means is configured to output the partial contents of the web page having a high rank in order, so that the user can know the target information without actually accessing the web page. There is an effect that can be.

【００９５】この発明によれば、順位決定手段がウェブ
ページのスコアに基づいて当該ウェブページが属するＷ
ＷＷサーバの順位を決定するように構成したので、妥当
性の高いウェブページを保有するＷＷＷサーバを認識す
ることができる効果がある。According to this invention, the ranking determining means determines that the W to which the web page belongs based on the score of the web page.
Since it is configured to determine the rank of the WW server, there is an effect that the WWW server having a highly valid web page can be recognized.

【００９６】この発明によれば、検索手段が抽出手段に
より抽出された文字列及び検索キーワードの正規化を実
施するように構成したので、表記の揺れを吸収して検索
精度を高めることができる効果がある。According to the present invention, since the search means is configured to normalize the character string and the search keyword extracted by the extraction means, it is possible to absorb the fluctuation of the notation and improve the search accuracy. There is.

【００９７】この発明によれば、検索手段が電話番号を
検索キーワードとして用いるように構成したので、企業
や団体についての情報を的確に検索することができる効
果がある。According to the present invention, since the search means is configured to use the telephone number as the search keyword, there is an effect that information about a company or a group can be accurately searched.

【００９８】この発明によれば、収集手段により収集さ
れたウェブページの順位を当該ウェブページのスコアを
参照して決定するとともに、そのウェブページのアドレ
スを上位のウェブページのアドレスに変更する順位決定
手段を設けるように構成したので、検索キーワードが含
まれているとは限らない妥当な結果を返すことができる
ように、その結果、ユーザが結果を修正する作業を省略
することができる効果がある。According to the present invention, the ranking of the web pages collected by the collecting means is determined with reference to the score of the web page, and the ranking of changing the address of the web page to the address of the upper web page. Since the means is provided, it is possible to return a valid result that does not necessarily include the search keyword, and as a result, the user can omit the work of correcting the result. .

【００９９】この発明によれば、抽出手段により抽出さ
れた文字列のうち、検索キーワードと一致しない文字列
を検索する検索手段を設けるように構成したので、具体
的に検索対象の値を指定することが不可能な検索を実行
することができる効果がある。According to the present invention, since the search means for searching the character string which does not match the search keyword among the character strings extracted by the extracting means is provided, the value to be searched is specifically specified. The effect is that it is possible to perform an impossible search.

【０１００】この発明によれば、検索キーワードと一致
する文字列を検索し、その文字列が記述されているウェ
ブページの順位を当該ウェブページのスコアを参照して
決定するように構成したので、ユーザによる検索結果の
妥当性のチェックを省略することができる効果がある。According to the present invention, the character string that matches the search keyword is searched, and the ranking of the web page in which the character string is described is determined by referring to the score of the web page. This has the effect of omitting the validity check of the search results by the user.

【０１０１】この発明によれば、収集したウェブページ
の順位を当該ウェブページのスコアを参照して決定する
ように構成したので、検索キーワードの指定だけでは除
外できない不適当な結果を除外することができるように
なり、その結果、ユーザが妥当な結果を選択する作業を
省略することができる効果がある。According to the present invention, since the ranking of the collected web pages is determined by referring to the score of the web page, it is possible to exclude inappropriate results that cannot be excluded only by specifying the search keyword. As a result, the user can omit the work of selecting a proper result.

【０１０２】この発明によれば、順位の高いウェブペー
ジのアドレスから順番に出力するように構成したので、
妥当性の高いウェブページからアクセスすることが可能
になる効果がある。According to the present invention, since the web pages are output in order from the highest-ranked web page address,
There is an effect that it is possible to access from a highly valid web page.

【０１０３】この発明によれば、順位の高いウェブペー
ジの部分内容から順番に出力するように構成したので、
ユーザが実際にウェブページをアクセスすることなく、
目的の情報を知ることができる効果がある。According to the present invention, since the partial contents of the web page with the highest rank are output in order,
Without the user actually accessing the web page,
It has the effect of being able to know the desired information.

【０１０４】この発明によれば、ウェブページのスコア
に基づいて当該ウェブページが属するＷＷＷサーバの順
位を決定するように構成したので、妥当性の高いウェブ
ページを保有するＷＷＷサーバを認識することができる
効果がある。According to the present invention, since the ranking of the WWW server to which the web page belongs is determined based on the score of the web page, it is possible to recognize the WWW server having the highly valid web page. There is an effect that can be done.

【０１０５】この発明によれば、抽出した文字列及び検
索キーワードの正規化を実施するように構成したので、
表記の揺れを吸収して検索精度を高めることができる効
果がある。According to the present invention, since the extracted character string and the search keyword are configured to be normalized,
This has the effect of absorbing fluctuations in the notation and improving the search accuracy.

【０１０６】この発明によれば、電話番号を検索キーワ
ードとして用いるように構成したので、企業や団体につ
いての情報を的確に検索することができる効果がある。According to the present invention, since the telephone number is used as the search keyword, there is an effect that information about a company or a group can be accurately searched.

【０１０７】この発明によれば、収集したウェブページ
の順位を当該ウェブページのスコアを参照して決定する
とともに、そのウェブページのアドレスを上位のウェブ
ページのアドレスに変更するように構成したので、検索
キーワードが含まれているとは限らない妥当な結果を返
すことができるように、その結果、ユーザが結果を修正
する作業を省略することができる効果がある。According to the present invention, the ranking of the collected web pages is determined with reference to the score of the web page, and the address of the web page is changed to the address of the higher web page. As a result, a valid result not necessarily including the search keyword can be returned, and as a result, the user can omit the work of correcting the result.

【０１０８】この発明によれば、抽出した文字列のう
ち、検索キーワードと一致しない文字列を検索するよう
に構成したので、具体的に検索対象の値を指定すること
が不可能な検索を実行することができる効果がある。According to the present invention, of the extracted character strings, a character string that does not match the search keyword is searched, so that a search in which it is impossible to specifically specify a value to be searched is executed. There is an effect that can be.

【０１０９】この発明によれば、検索処理手順により検
索された文字列が記述されているウェブページの順位を
当該ウェブページのスコアを参照して決定する順位決定
処理手順を設けるように構成したので、ユーザによる検
索結果の妥当性のチェックを省略することができる効果
がある。According to the present invention, the ranking determination processing procedure for determining the ranking of the web page in which the character string retrieved by the retrieval processing procedure is described by referring to the score of the web page is provided. There is an effect that the check of the validity of the search result by the user can be omitted.

【０１１０】この発明によれば、収集処理手順により収
集されたウェブページの順位を当該ウェブページのスコ
アを参照して決定する順位決定処理手順を設けるように
構成したので、検索キーワードの指定だけでは除外でき
ない不適当な結果を除外することができるようになり、
その結果、ユーザが妥当な結果を選択する作業を省略す
ることができる効果がある。According to the present invention, the ranking determination processing procedure for determining the ranking of the web pages collected by the collection processing procedure with reference to the score of the web page is provided. It is now possible to exclude inappropriate results that cannot be excluded,
As a result, there is an effect that the user can omit the work of selecting an appropriate result.

【０１１１】この発明によれば、収集処理手順により収
集されたウェブページの順位を当該ウェブページのスコ
アを参照して決定するとともに、そのウェブページのア
ドレスを上位のウェブページのアドレスに変更する順位
決定処理手順を設けるように構成したので、検索キーワ
ードが含まれているとは限らない妥当な結果を返すこと
ができるように、その結果、ユーザが結果を修正する作
業を省略することができる効果がある。According to the present invention, the rank of the web pages collected by the collection processing procedure is determined with reference to the score of the web page, and the rank of the web page address is changed to the higher rank web page address. Since the configuration is such that the determination processing procedure is provided, as a result, it is possible to return a valid result that does not necessarily include the search keyword, and as a result, the user can omit the work of correcting the result. There is.

【０１１２】この発明によれば、抽出処理手順により抽
出された文字列のうち、検索キーワードと一致しない文
字列を検索する検索処理手順を設けるように構成したの
で、具体的に検索対象の値を指定することが不可能な検
索を実行することができる効果がある。According to the present invention, since the retrieval processing procedure for retrieving the character string that does not match the retrieval keyword among the character strings extracted by the extraction processing procedure is provided, the value to be searched is specifically specified. The effect is that a search that cannot be specified is performed.

[Brief description of drawings]

【図１】この発明の実施の形態１による情報検索装置
を示す構成図である。FIG. 1 is a configuration diagram showing an information search device according to a first embodiment of the present invention.

【図２】ＷＷＷ情報収集部の内部を示す構成図であ
る。FIG. 2 is a configuration diagram showing the inside of a WWW information collecting unit.

【図３】情報抽出部の内部を示す構成図である。FIG. 3 is a configuration diagram showing the inside of an information extraction unit.

【図４】ＷＷＷ情報収集部の処理内容を示すフローチ
ャートである。FIG. 4 is a flowchart showing the processing content of a WWW information collection unit.

【図５】結合演算処理部の処理内容を示すフローチャ
ートである。FIG. 5 is a flowchart showing the processing contents of a join operation processing unit.

【図６】ランキング処理部の処理内容を示すフローチ
ャートである。FIG. 6 is a flowchart showing the processing contents of a ranking processing unit.

【図７】検索対象情報を示す説明図である。FIG. 7 is an explanatory diagram showing search target information.

【図８】結果情報を示す説明図である。FIG. 8 is an explanatory diagram showing result information.

【図９】構文ルールを示す説明図である。FIG. 9 is an explanatory diagram showing syntax rules.

【図１０】抽出情報リストや出現ＵＲＬリストを示す
説明図である。FIG. 10 is an explanatory diagram showing an extraction information list and an appearance URL list.

【図１１】スコアの設定方法を説明するための説明図
である。FIG. 11 is an explanatory diagram for explaining a score setting method.

【図１２】コンピュータ用いて構成する場合のハード
ウエア構成図である。FIG. 12 is a hardware configuration diagram when configured using a computer.

【図１３】この発明の実施の形態３による情報検索装
置を示す構成図である。FIG. 13 is a configuration diagram showing an information search device according to a third embodiment of the present invention.

【図１４】属性抽出部の出力情報を示す説明図であ
る。FIG. 14 is an explanatory diagram showing output information of an attribute extraction unit.

【図１５】スコア設定部の出力情報を示す説明図であ
る。FIG. 15 is an explanatory diagram showing output information of a score setting unit.

【図１６】ランキング処理部の出力情報を示す説明図
である。FIG. 16 is an explanatory diagram showing output information of a ranking processing unit.

【図１７】この発明の実施の形態３による情報検索装
置の効果を示す説明図である。FIG. 17 is an explanatory diagram showing effects of the information search device according to the third embodiment of the present invention.

【図１８】この発明の実施の形態４による情報検索装
置を示す構成図である。FIG. 18 is a configuration diagram showing an information search device according to a fourth embodiment of the present invention.

【図１９】抽出情報インデックスを示す説明図であ
る。FIG. 19 is an explanatory diagram showing an extraction information index.

【図２０】ＵＲＬ修正部の処理内容を示すフローチャ
ートである。FIG. 20 is a flowchart showing the processing contents of a URL correction unit.

【図２１】ＵＲＬ修正部におけるファイル名の妥当性
の設定例を示す説明図である。FIG. 21 is an explanatory diagram showing a setting example of validity of a file name in the URL correction unit.

【図２２】この発明の実施の形態４による情報検索装
置の効果を示す説明図である。FIG. 22 is an explanatory diagram showing effects of the information search device according to the fourth embodiment of the present invention.

【図２３】この発明の実施の形態５による情報検索装
置を示す構成図である。FIG. 23 is a configuration diagram showing an information search device according to a fifth embodiment of the present invention.

【図２４】抽出情報記憶装置及び未知情報記憶装置の
記憶内容を示す説明図である。FIG. 24 is an explanatory diagram showing stored contents of an extracted information storage device and an unknown information storage device.

【図２５】従来の情報検索装置を示す構成図である。FIG. 25 is a block diagram showing a conventional information retrieval device.

【図２６】インデックス情報の一例を示す説明図であ
る。FIG. 26 is an explanatory diagram showing an example of index information.

【図２７】検索対象情報を示す説明図である。FIG. 27 is an explanatory diagram showing search target information.

[Explanation of symbols]

２１ＷＷＷサーバ、２２インターネット、２３イ
ンターネット情報検索システム、２４ＷＷＷ情報収集
部（収集手段）、２４ａ取得要求ＵＲＬキュー、２４
ｂダウンロード部、２４ｃ既取得ＵＲＬ記憶装置、
２４ｄＵＲＬコンテンツ記憶装置、２４ｅリンク抽
出部、２５情報抽出部（抽出手段）、２５ａＨＴＭ
Ｌ解析部、２５ｂ字句要素解析部、２５ｃ構文解析
部、２５ｄ抽出情報管理部、２６抽出情報記憶装
置、２７検索対象情報記憶装置、２８結合演算処理
部（検索手段）、２９ランキング処理部（順位決定手
段）、３０結果情報記憶装置、４１ＣＰＵ、４２
メモリ、４３コンソールインタフェース、４４コン
ソール入出力装置、４５ディスクインタフェース、４
６ハードディスク装置、４７ネットワークインタフ
ェース、５１検索エンジン、５２検索エンジン問合
せ部（収集手段）、５３ダウンロード部（収集手
段）、５４属性抽出部（抽出手段）、５５スコア設
定部（抽出手段）、５６ランキング処理部（順位決定
手段）、６１抽出情報管理部、６２ＵＲＬ修正部
（順位決定手段）、７１既知情報記憶装置、７２反
結合演算部（検索手段）、７３未知情報記憶装置。21 WWW server, 22 Internet, 23 Internet information search system, 24 WWW information collecting unit (collecting means), 24a Acquisition request URL queue, 24
b download unit, 24c already acquired URL storage device,
24d URL content storage device, 24e link extraction unit, 25 information extraction unit (extraction means), 25a HTM
L analysis unit, 25b lexical element analysis unit, 25c syntax analysis unit, 25d extraction information management unit, 26 extraction information storage device, 27 search target information storage device, 28 join operation processing unit (search means), 29 ranking processing unit (rank Determination means), 30 result information storage device, 41 CPU, 42
Memory, 43 console interface, 44 console input / output device, 45 disk interface, 4
6 hard disk device, 47 network interface, 51 search engine, 52 search engine inquiry section (collecting means), 53 download section (collecting means), 54 attribute extracting section (extracting means), 55 score setting section (extracting means), 56 ranking Processing unit (rank determination unit), 61 Extracted information management unit, 62 URL correction unit (rank determination unit), 71 Known information storage device, 72 Anti-join calculation unit (search unit), 73 Unknown information storage device.

───────────────────────────────────────────────────── フロントページの続き (72)発明者田村孝之東京都千代田区丸の内二丁目２番３号三菱電機株式会社内Ｆターム(参考） 5B075 KK07 ND16 PR08 UU40 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Takayuki Tamura 2-3 2-3 Marunouchi, Chiyoda-ku, Tokyo Inside Ryo Electric Co., Ltd. F term (reference) 5B075 KK07 ND16 PR08 UU40

Claims

[Claims]

1. A collecting means for collecting a web page from a WWW server, a character string having the same attribute as a search keyword is extracted from the web page collected by the collecting means, and a description of the character string is written for the web page. Extraction means for setting the score of the web page according to the degree, search means for searching for a character string that matches the search keyword among the character strings extracted by the extraction means, and the search means for searching An information retrieval apparatus, comprising: a ranking determining unit that determines the ranking of a web page in which a character string is described with reference to the score of the web page.

2. Collection means for collecting a web page containing a character string that matches a search keyword from a WWW server,
Extracting means for extracting a character string having the same attribute as the search keyword from the web page collected by the collecting means, and setting the score of the web page according to the degree of description of the character string on the web page; An information retrieval device comprising: a ranking determining means for determining the ranking of web pages collected by the collecting means with reference to the score of the web page.

3. The information retrieving apparatus according to claim 1, wherein the rank determining means outputs the web pages in the order of high rank in order.

4. The information retrieving apparatus according to claim 1, wherein the rank determining means outputs the partial contents of the web page having a higher rank in order.

5. The ranking determination means determines the ranking of the WWW server to which the web page belongs based on the score of the web page.
Information retrieval device described.

6. The information search device according to claim 1, wherein the search means normalizes the character string and the search keyword extracted by the extraction means.

7. The information search device according to claim 1, wherein the search means uses a telephone number as a search keyword.

8. A collecting means for collecting a web page from a WWW server, a character string having the same attribute as a search keyword is extracted from the web page collected by the collecting means, and a description of the character string is written for the web page. Extracting means for setting the score of the web page according to the degree, and determining the ranking of the web pages collected by the collecting means by referring to the score of the web page, and the address of the web page is set to the upper web. An information retrieval device comprising a ranking determining means for changing to a page address.

9. Collection means for collecting web pages from a WWW server, extraction means for extracting a character string having the same attribute as a search keyword from the web pages collected by the collection means, and characters extracted by the extraction means. An information retrieval apparatus comprising: a retrieval unit that retrieves a character string that does not match the search keyword from the columns.

10. When a web page is collected from a WWW server, a character string having the same attribute as the search keyword is extracted from the web page, and the score of the web page is scored according to the degree of description of the character string on the web page. On the other hand, while setting
An information search method for searching a character string that matches the search keyword and determining the rank of a web page in which the character string is described by referring to the score of the web page.

11. When a web page containing a character string that matches a search keyword is collected from a WWW server, a character string having the same attribute as the search keyword is extracted from the web page, and a description of the character string is written to the web page. An information retrieval method in which the score of the web page is set according to the degree, and the ranking of the collected web pages is determined by referring to the score of the web page.

12. The information search method according to claim 10, wherein the information is output in order from the address of the web page having the highest rank.

13. The information retrieving method according to claim 10, wherein the partial contents of a web page having a high rank are output in order.

14. The information retrieval method according to claim 10, wherein the ranking of the WWW server to which the web page belongs is determined based on the score of the web page.

15. The information search method according to claim 10, wherein the extracted character string and the search keyword are normalized.

16. The information search method according to claim 10, wherein a telephone number is used as a search keyword.

17. When a web page is collected from a WWW server, a character string having the same attribute as the search keyword is extracted from the web page, and the score of the web page is scored according to the degree of description of the character string on the web page. An information retrieval method in which the ranking of the collected web pages is determined with reference to the score of the web page and the address of the web page is changed to the address of the higher web page while setting the.

18. When a web page is collected from a WWW server, a character string having the same attribute as the search keyword is extracted from the web page, and a character string that does not match the search keyword is searched from the extracted character strings. Information retrieval method.

19. A collection processing procedure for collecting a web page from a WWW server, a character string having the same attribute as a search keyword is extracted from the web page collected by the collection processing procedure, and a character string of the character string for the web page is extracted. An extraction processing procedure for setting the score of the web page according to the degree of description, a search processing procedure for searching for a character string that matches the search keyword among the character strings extracted by the extraction processing procedure, and the search described above. An information retrieval program for causing a computer to execute a ranking determination processing procedure for determining the ranking of a web page in which a character string retrieved by the processing procedure is described by referring to the score of the web page.

20. A collection processing procedure for collecting a web page containing a character string that matches a search keyword from a WWW server, and extracting a character string having the same attribute as the search keyword from the web page collected by the collection processing procedure. , Refer to the score of the web page for the extraction processing procedure for setting the score of the web page according to the degree of description of the character string on the web page and the ranking of the web pages collected by the collection processing procedure. An information retrieval program for causing a computer to execute the order determination processing procedure for determining.

21. A collection processing procedure for collecting a web page from a WWW server, a character string having the same attribute as a search keyword is extracted from the web page collected by the collection processing procedure, and The extraction processing procedure for setting the score of the web page according to the degree of description and the ranking of the web pages collected by the collection processing procedure are determined by referring to the score of the web page and the address of the web page. An information retrieval program for causing a computer to execute the order determination processing procedure for changing the address of a higher web page.

22. A collection processing procedure for collecting web pages from a WWW server, an extraction processing procedure for extracting a character string having the same attribute as a search keyword from the web pages collected by the collection processing procedure, and the extraction processing procedure. An information search program for causing a computer to execute a search processing procedure for searching a character string that does not match the search keyword from the extracted character strings.