JP2001265774A

JP2001265774A - Method and device for retrieving information, recording medium with recorded information retrieval program and hypertext information retrieving system

Info

Publication number: JP2001265774A
Application number: JP2000074079A
Authority: JP
Inventors: Masanori Harada; 昌紀原田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2000-03-16
Filing date: 2000-03-16
Publication date: 2001-09-28

Abstract

PROBLEM TO BE SOLVED: To decrease a probability for retrieval to be missed by improving accuracy in the calculation of suitability to retrieval conditions inputted by a user in an information retrieval system with a hypertext information system as a target. SOLUTION: An anchor extracting part 1 extracts an anchor expressing a link from a structured text comprising a certain page to the other page. An index preparing part 2 stores even an anchor text on the other page expressing the link to a certain page in a data base 3 as position attribute information in addition to the context of the structured text comprising that page by performing indexing added with a position attribute. While using the position attribute information stored in the data base 3, a retrieval processing part 4 retrieves a page suited to the applied retrieval conditions. Concerning the retrieves result found by the retrieval processing part 4, a suitability calculating part 5 calculates the suitability to the retrieval conditions while considering the position attribute information.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、情報検索システム
において、検索対象となる文書群を索引づけし、利用者
の入力した検索条件に適合する文書を検索する情報検索
装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information retrieval system for indexing a group of documents to be retrieved in an information retrieval system, and for retrieving documents that meet a retrieval condition input by a user.

【０００２】[0002]

【従来の技術】情報検索装置は、利用者が入力した条件
を満たす文書を検索し、その一覧を利用者に提示する。
しかし、検索結果の中には、検索意図に強く適合する文
書もあれば、ほとんど適合しない文書もある。そこで、
検索結果となるそれぞれの文書について、検索条件に対
する適合度を計算し、それらを適合度順に提示すること
で、利用者による検索結果の閲覧と利用を支援する技術
が開発されてきた。2. Description of the Related Art An information search apparatus searches for a document that satisfies conditions entered by a user, and presents a list of the documents to the user.
However, some of the search results strongly match the search intention, while others hardly match the search intention. Therefore,
Techniques have been developed to assist users in browsing and using search results by calculating relevance to search conditions and presenting them in order of relevance for each document that is a search result.

【０００３】情報検索装置は、ある一定の検索モデルに
基づき、文書中における語句の出現回数、文書の大き
さ、検索対象文書集合全体における語の総出現回数、出
現位置、出現位置の属性などの情報をもとに、検索条件
と文書との適合度を計算する。代表的な検索モデルとし
ては、ベクトル空間型、確率型などがある（参考文献：
R.Baeza-Yates and B.Ribeiro-Neto,Modern Informatio
n Retrieval,Addison Wesley,1999.）。[0003] An information retrieval apparatus is based on a certain retrieval model, and includes information such as the number of appearances of a word in a document, the size of the document, the total number of appearances of a word in the entire search target document set, the appearance position, and the attribute of the appearance position. Based on the information, the relevance between the search condition and the document is calculated. Typical search models include a vector space type and a probability type (references:
R. Baeza-Yates and B. Ribeiro-Neto, Modern Informatio
n Retrieval, Addison Wesley, 1999.).

【０００４】一方、今日の文書情報システムでは、文書
の記述と閲覧を容易にするため、文書をハイパーテキス
トとよばれるデータ構造で管理することが多い。ハイパ
ーテキストにおいては、文書は一まとまりのテキスト情
報からなるページと、関連したページの間を結ぶリンク
によって構成される。実現方式はハイパーテキスト情報
システムによって異なるが、構造化テキストによってペ
ージを表現し、テキスト中の語句に、他のページへのリ
ンクとしての役割を付与する方式が一般的である。以下
では構造化テキスト中でリンクとしての機能を持つ語句
の部分をアンカーと呼ぶ。On the other hand, in today's document information systems, documents are often managed in a data structure called hypertext in order to facilitate description and browsing of documents. In hypertext, a document is composed of a page composed of a group of text information and links connecting related pages. Although the realization method differs depending on the hypertext information system, a method is generally used in which a page is represented by structured text and a word in the text is given a role as a link to another page. Hereinafter, a portion of a phrase having a link function in the structured text is referred to as an anchor.

【０００５】ハイパーテキストを対象とした情報検索シ
ステムは、検索条件を満たすページの一覧を利用者に提
示することを目的とする。そこで、ハイパーテキストに
おける一つのページを一つの文書とみなして、一般的な
情報検索システムと同様の処理を行なうのが通例であ
る。[0005] The purpose of an information retrieval system for hypertext is to present a list of pages satisfying a retrieval condition to a user. Therefore, it is customary to treat one page in the hypertext as one document and perform the same processing as that of a general information search system.

【０００６】[0006]

【発明が解決しようとする課題】上述したように、従来
のハイパーテキスト情報検索システムでは、適合度の計
算に用いられる情報がページ単位で完結している。しか
し、ハイパーテキスト情報システムでは、検索の対象と
なるページの数が大きくなる一方で、それぞれのページ
に含まれるテキストの情報量が小さくなることが多い。
そのため、従来の適合度の計算方法では、適合度を高い
信頼度で計算することができない。As described above, in the conventional hypertext information retrieval system, the information used for calculating the fitness is completed in page units. However, in the hypertext information system, while the number of pages to be searched increases, the information amount of text included in each page often decreases.
Therefore, the conventional method of calculating the degree of fitness cannot calculate the degree of fitness with high reliability.

【０００７】また、従来の情報検索システムでは、検索
条件において指定された語句が含まれていないページの
適合度は０となり、検索もれを生ずることになる。テキ
ストの量が少ないページが多い場合には、検索もれが多
く発生する。加えて、ＷＷＷのようなネットワーク対応
型のハイパーテキスト情報システムでは、ページが多数
の作者によって独立に作成されており、用語が統制され
ていないため、検索もれが生じやすいという問題があ
る。Further, in the conventional information retrieval system, the page which does not include the phrase specified in the retrieval condition has a relevance of 0, which results in retrieval omission. If there are many pages with a small amount of text, search omissions often occur. In addition, a network-compatible hypertext information system such as WWW has a problem in that pages are created independently by many authors, and terms are not controlled, so that search omissions are likely to occur.

【０００８】本発明の目的は、ハイパーテキスト情報シ
ステムを対象とした情報検索システムにおいて、利用者
の入力した検索条件に対する適合度の計算の精度を高
め、検索もれの生ずる確率を減少させる情報検索方法、
装置、情報検索プログラムを記録した記録媒体を提供す
るところにある。SUMMARY OF THE INVENTION An object of the present invention is to provide an information retrieval system for a hypertext information system in which the accuracy of calculating the degree of relevance to a retrieval condition input by a user is improved and the probability of occurrence of retrieval omission is reduced. Method,
It is an object to provide a device and a recording medium on which an information search program is recorded.

【０００９】[0009]

【課題を解決するための手段】本発明の情報検索装置
は、検索対象となるページを構成する構造化テキストを
入力し、該構造化テキストを解釈し、アンカーを抽出
し、抽出されたアンカーが表現するリンクの参照先ペー
ジのアドレスと、リンクの可視的表現に用いられるアン
カーテキストを出力するアンカー抽出手段と、データベ
ースと、あるページを構成する構造化テキストと、その
ページのアドレスを受け取り、さらに前記参照先アドレ
スとアンカーテキストを受け取り、該アンカーテキスト
を該構造化テキストの一部として解釈し、検索処理に必
要な位置属性情報を出力、前記データベースに格納する
索引作成手段と、検索条件を受け取り、該検索条件に含
まれる語句に対応する位置属性情報を前記データベース
から読み出し、検索条件と照合することで、該検索条件
を満たすページの位置属性情報を求め、検索条件ととも
に出力し、前記検索条件を満たすページのアドレスに適
合度を付加し、検索結果として出力する検索処理手段
と、前記検索条件と位置属性情報を受け取り、前記適合
度を計算する適合度計算手段を有する。An information retrieval apparatus according to the present invention inputs a structured text constituting a page to be searched, interprets the structured text, extracts an anchor, and outputs the anchor. Receiving an address of a reference page of the link to be represented, an anchor extracting means for outputting an anchor text used for the visual representation of the link, a database, a structured text constituting a certain page, and an address of the page, Receiving the reference address and the anchor text, interpreting the anchor text as a part of the structured text, outputting position attribute information necessary for search processing, receiving index creation means for storing in the database, and receiving search conditions Reading the position attribute information corresponding to the phrase included in the search condition from the database, Search processing means for obtaining position attribute information of a page that satisfies the search condition, outputting the position attribute information together with the search condition, adding a degree of conformity to an address of a page that satisfies the search condition, and outputting the result as a search result, A fitness calculating unit that receives the search condition and the position attribute information and calculates the fitness.

【００１０】本発明では、参照元ページのリンクを表現
するアンカーテキストに含まれる語句が検索条件として
指定された場合に、そのアンカーテキストの存在する参
照元ページのみならず、参照先ページも検索条件を満た
すページと判定し、参照元ページのアンカーテキスト内
の語句を参照先ページの適合度の計算に利用する。According to the present invention, when a phrase included in an anchor text expressing a link of a reference source page is specified as a search condition, not only the reference source page where the anchor text exists but also the reference destination page is included in the search condition. Is determined as a page that satisfies the condition, and the word in the anchor text of the referring page is used for calculating the fitness of the referenced page.

【００１１】[0011]

【発明の実施の形態】次に、本発明の実施の形態につい
て図面を参照して説明する。Next, embodiments of the present invention will be described with reference to the drawings.

【００１２】図１を参照すると、本発明の一実施形態の
情報検索装置はアンカー抽出部１と索引作成部２とデー
タベース３と検索処理部４と適合度計算部５で構成され
ている。Referring to FIG. 1, an information retrieval apparatus according to one embodiment of the present invention includes an anchor extraction unit 1, an index creation unit 2, a database 3, a search processing unit 4, and a fitness calculation unit 5.

【００１３】アンカー抽出部１は、図２に示すように、
ハイパーテキスト情報システムから、検索対象となるペ
ージを構成する構造化テキストを受け取り（ステップ１
１）、構造化テキストを解釈し（ステップ１２）、アン
カーを抽出する（ステップ１３）。抽出されたアンカー
を構成する情報のうち、そのアンカーが表現するリンク
の参照先ページのアドレスと、リンクの可視的表現に用
いられるアンカーテキストを、索引作成部２に送る（ス
テップ１４）。As shown in FIG. 2, the anchor extracting unit 1
The structured text constituting the page to be searched is received from the hypertext information system (step 1).
1) Interpret the structured text (step 12) and extract anchors (step 13). Among the information constituting the extracted anchor, the address of the reference page of the link represented by the anchor and the anchor text used for the visual representation of the link are sent to the index creating unit 2 (step 14).

【００１４】索引作成部２は、図３に示すように、ハイ
パーテキスト情報システムから、あるページを構成する
構造化テキストと、そのページのアドレスとを受け取る
（ステップ２１）。また、アンカー抽出部１から、アン
カーの表現するリンクの参照先アドレスと、アンカーテ
キストを受け取り（ステップ２２）、そのアンカーテキ
ストを、参照先アドレスに対応する構造化テキストの一
部として解釈し（ステップ２３）、位置属性情報を生成
し、データベース３に渡す（ステップ２４）。As shown in FIG. 3, the index creating unit 2 receives the structured text constituting a certain page and the address of the page from the hypertext information system (step 21). Also, from the anchor extraction unit 1, a reference address of the link represented by the anchor and the anchor text are received (step 22), and the anchor text is interpreted as a part of the structured text corresponding to the reference address (step 22). 23), generates position attribute information and passes it to the database 3 (step 24).

【００１５】検索処理部４は、図４に示すように、検索
受付部から検索条件を受け取り（ステップ３１）、検索
条件を解釈し（ステップ３２）、データベース３から検
索条件に含まれる語句に対応する位置属性情報を読み出
し（ステップ３３）、検索条件と照合することで、検索
条件を満たすページの位置属性情報を求め（ステップ３
４）、検索条件とともに適合度計算部５に渡す（ステッ
プ３５）。そして、適合度計算部５の計算した適合度を
入力し（ステップ３６）、検索条件を満たすページのア
ドレスに適合度を付加し、検索結果として検索受付部に
渡す（ステップ３７）。As shown in FIG. 4, the search processing unit 4 receives the search condition from the search reception unit (step 31), interprets the search condition (step 32), and responds to the words included in the search condition from the database 3 (step 32). The position attribute information to be searched is read out (step 33), and the position attribute information of the page that satisfies the search condition is obtained by collating with the search condition (step 3).
4), and pass it along with the search condition to the fitness calculation unit 5 (step 35). Then, the relevance calculated by the relevance calculator 5 is input (step 36), the relevance is added to the address of the page satisfying the search condition, and the result is passed to the search receiving unit as a search result (step 37).

【００１６】適合度計算部５は、図５に示すように、検
索処理部４から検索条件と位置属性情報を受け取り（ス
テップ４１）、語句の出現頻度を用いて適合度を計算し
（ステップ４２）、さらに位置属性情報を用いてより詳
細な適合度の計算を行なう（ステップ４３）。ここで、
参照元アンカーという属性による適合度も評価される。As shown in FIG. 5, the relevance calculating unit 5 receives the search condition and the position attribute information from the retrieving processing unit 4 (step 41), and calculates the relevance using the appearance frequency of the word (step 42). ), And a more detailed calculation of the degree of conformity is performed using the position attribute information (step 43). here,
The conformity based on the attribute of the reference source anchor is also evaluated.

【００１７】図６は、上述した本発明の情報検索装置を
ハイパーテキスト情報システムの１つであるＷＷＷを対
象とした情報検索システムに適用した場合の構成図であ
る。FIG. 6 is a block diagram showing a case where the above-described information retrieval apparatus of the present invention is applied to an information retrieval system for WWW which is one of the hypertext information systems.

【００１８】ＷＷＷ情報検索システム１００はアンカー
抽出部１と索引作成部２とデータベース３と検索処理部
４と適合度計算部５（以上は図１の情報検索装置を構成
する）と検索受付部６とページ収集部７とページ保存部
８とアンカー保存部９で構成されている。The WWW information search system 100 includes an anchor extraction unit 1, an index creation unit 2, a database 3, a search processing unit 4, a fitness calculation unit 5 (the above constitutes the information search device of FIG. 1), and a search reception unit 6. , A page collection unit 7, a page storage unit 8, and an anchor storage unit 9.

【００１９】ＷＷＷ情報検索システム１００は検索処理
に先立って、検索処理に必要となる索引の作成処理を行
なう。ページ収集部７は検索対象となるページを有する
ＷＷＷサーバにＨＴＴＰ規約に準じた手段に従ってアク
セスを行ない、ＷＷＷ情報システム２００からＨＴＭＬ
形式の構造化テキストを取得し、それらをページ保存部
８に格納する。ページ保存部８に蓄積されたＨＴＭＬ形
式のテキストはまずアンカー抽出部１に渡され、アンカ
ー抽出部１はＨＴＭＬの規約に従い、ＨＴＭＬ形式のテ
キストからアンカー部分のみ抽出を行ない、それらをア
ンカー保存部９に格納する。ここでアンカー部分とはＨ
ＴＭＬ規約に定義されているＡ要素に対応しており、Ａ
要素のＨＲＥＦ属性が当該アンカーが表現するリンクの
参照先アドレスであり、Ａ要素の開始タグと終了タグの
間に記述された要素のうち、可視部分がアンカーテキス
トとなる（参考文献：T.Berners-Lee and D.Connolly,
Hypertext Markup Language-2.0, Request for Comment
s 1866, Internet Engineering Task Force,1995.）。
索引作成部２はページ保存部８から一つのページを構成
するＨＴＭＬ形式の構造化テキストを受け取り、ＨＴＭ
Ｌ規約に従って解釈し、テキスト中の語句の出現位置を
そのページのアドレスに基づいて設定し、表題や見出
し、キーワードといった属性を付与した上で位置属性情
報として出力し、データベース３に渡す。また、索引作
成部２はアンカー保存部９に格納されているアンカーを
入力として受け取り、アンカーテキスト中の語句の出現
位置を、そのアンカーが表現するリンクの参照先ページ
のアドレスに基づいて設定し、参照元アンカーという属
性と共に出力し、データベース３に渡す。データベース
３は索引作成部２の出力する位置属性情報を検索処理に
適した形式で格納する。Prior to the search process, the WWW information search system 100 performs an index creation process required for the search process. The page collection unit 7 accesses a WWW server having a page to be searched in accordance with a means conforming to the HTTP protocol, and sends the HTML information from the WWW information system 200 to the HTML.
The structured text in the format is obtained and stored in the page storage unit 8. The HTML-format text stored in the page storage unit 8 is first passed to the anchor extraction unit 1, which extracts only the anchor part from the HTML-format text according to the rules of HTML, and stores them in the anchor storage unit 9. To be stored. Here, the anchor portion is H
It corresponds to the A element defined in the TML rules,
The HREF attribute of the element is the reference address of the link represented by the anchor, and the visible portion of the element described between the start tag and the end tag of the A element becomes the anchor text (Reference: T. Berners) -Lee and D. Connolly,
Hypertext Markup Language-2.0, Request for Comment
s 1866, Internet Engineering Task Force, 1995.).
The index creating unit 2 receives the structured text in the HTML format constituting one page from the page storage unit 8 and
It interprets according to the L rule, sets the appearance position of the phrase in the text based on the address of the page, adds attributes such as a title, a headline, and a keyword, outputs it as position attribute information, and passes it to the database 3. Also, the index creation unit 2 receives the anchor stored in the anchor storage unit 9 as an input, and sets the appearance position of the phrase in the anchor text based on the address of the reference page of the link represented by the anchor, It is output together with an attribute called a reference source anchor, and passed to the database 3. The database 3 stores the position attribute information output from the index creation unit 2 in a format suitable for search processing.

【００２０】ＷＷＷ情報検索システム１００における検
索処理は利用者３００による検索条件の入力によって開
始される。検索受付部６は利用者３００から検索条件を
受け取り、検索処理部４に渡す。検索処理部４はデータ
ベース３に格納された位置属性情報から、検索条件に適
合する位置属性情報を求め、検索条件とともに適合度計
算部５に渡す。適合度計算部５は位置属性情報から、検
索条件語の総出現回数、検索条件を満たす個々のページ
における検索条件語の出現回数、検索条件語の出現位置
における属性等を求め、それらを元に適合度の計算を行
なって検索処理部４に渡す。この際、参照元アンカーと
いう属性のほうが、標準的な属性よりも、検索条件に対
する適合度に対する寄与が大きいように重み付けを行な
う。検索処理部４は検索条件を満たすページのアドレス
を求め、それらを適合度計算部５によって計算された適
合度の大きさ順に並べ替えたものを、検索結果として検
索受付部６に渡す。検索受付部６は適合度順に並べられ
た検索結果の一部あるいは全部を利用者３００に渡し、
検索処理は終了する。The search processing in the WWW information search system 100 is started when the user 300 inputs search conditions. The search receiving unit 6 receives the search condition from the user 300 and passes it to the search processing unit 4. The search processing unit 4 obtains position attribute information that satisfies the search condition from the position attribute information stored in the database 3 and passes the position attribute information together with the search condition to the matching degree calculation unit 5. From the position attribute information, the matching degree calculation unit 5 obtains the total number of appearances of the search condition word, the number of appearances of the search condition word on each page satisfying the search condition, the attribute at the appearance position of the search condition word, and the like. The relevance is calculated and passed to the search processing unit 4. At this time, weighting is performed so that the attribute of the reference source anchor contributes more to the degree of matching with the search condition than the standard attribute. The search processing unit 4 obtains addresses of pages satisfying the search conditions, and transfers them to the search reception unit 6 as a search result, in which the addresses are rearranged in the order of the magnitude of the fitness calculated by the fitness calculation unit 5. The search receiving unit 6 passes a part or all of the search results arranged in the order of relevance to the user 300,
The search processing ends.

【００２１】図７を参照すると、本発明の他の実施形態
の情報検索装置は入力装置５１と記憶装置５２，５３と
出力装置５４と記録媒体５５とデータ処理装置５６で構
成されている。Referring to FIG. 7, an information retrieval apparatus according to another embodiment of the present invention comprises an input device 51, storage devices 52 and 53, an output device 54, a recording medium 55, and a data processing device 56.

【００２２】入力装置５１は構造化テキスト、検索条件
等を入力するための、キーボード等の入力装置である。
記憶装置５２は図１中のデータベースに相当する。記憶
装置５３はハードディスクである。出力装置５４は検索
結果が出力される、ディスプレイ、プリンタ等の出力装
置である。記録媒体５５は、図１中のアンカー抽出部
１、索引作成部２、検索処理部４、適合度計算部５から
なる情報検索プログラムが記録されている、フロッピィ
・ディスク、ＣＤ−ＲＯＭ、光磁気ディスク等の記録媒
体である。データ処理装置５６はＣＰＵ、各種インタフ
ェース等を含み、記録媒体５５から情報検索プログラム
を記憶装置５３に読込んで、これを実行する。The input device 51 is an input device such as a keyboard for inputting structured text, search conditions and the like.
The storage device 52 corresponds to the database in FIG. The storage device 53 is a hard disk. The output device 54 is an output device, such as a display or a printer, to which a search result is output. The recording medium 55 is a floppy disk, CD-ROM, magneto-optical disk in which an information retrieval program including the anchor extraction unit 1, index creation unit 2, search processing unit 4, and fitness calculation unit 5 in FIG. It is a recording medium such as a disk. The data processing device 56 includes a CPU, various interfaces, etc., reads an information retrieval program from the recording medium 55 into the storage device 53, and executes the program.

【００２３】[0023]

【発明の効果】以上説明したように、本発明によれば、
アンカーテキストに含まれる語句を含む検索条件で検索
を行なった場合に、そのアンカーが表わすリンクの参照
先ページも検索結果となることから、参照先ページのテ
キストには含まれていない語句による検索が可能とな
り、検索もれが生ずる確率が減少する。また。アンカー
テキストは参照先ページの概要を簡潔かつ的確に表現し
ていることが多いため、ページ内のテキストのみを用い
た場合に比べて、より的確な適合度の計算が可能とな
る。また、ＷＷＷのようなネットワーク型のハイパーテ
キスト情報システムでは、有用な情報を含むページは、
多くのページからリンクされやすい傾向を持つが、本発
明を適用したハイパーテキスト情報検索システムでは、
多くのページから共通の語句を含んだアンカーにより参
照されるページが高い適合度を得ることになり、結果と
して有用な情報を含むページが適合度順に並び替えられ
た検索結果の中で上位に表示され、検索システム利用者
の利便性が向上する。As described above, according to the present invention,
When a search is performed using search conditions that include the term contained in the anchor text, the page referred to by the link represented by the anchor is also a search result. It is possible, and the probability of occurrence of search omission is reduced. Also. Since the anchor text often expresses the outline of the referenced page simply and accurately, it is possible to calculate a more appropriate degree of matching than when only the text in the page is used. In a network type hypertext information system such as WWW, a page containing useful information is:
Although it tends to be linked from many pages, in the hypertext information search system to which the present invention is applied,
Pages referenced by anchors containing common words from many pages will have high relevance, and consequently pages containing useful information will be displayed at the top of the search results sorted by relevance Therefore, the convenience of the search system user is improved.

【００２４】たとえば、日本電信電話株式会社の公式ホ
ームページは「ＮＴＴ」や「日本電信電話」というアン
カーテキストによって多数のＷＷＷページからリンクさ
れている。本発明のＷＷＷ情報検索システムでは、「Ｎ
ＴＴ」という語を検索条件として指定した場合に、その
ようなページの適合度が高くなり、検索結果を適合度順
に並べた際に優先して表示されることから、利用者の利
便性が向上する。また、その公式ホームページに「エヌ
ティティ」という語が一切用いられていなかったとして
も、「エヌティティ」という語を含むアンカーテキスト
でリンクされていれば、「エヌティティ」を検索条件と
した際にも検索結果に含まれるようになるなど、検索も
れの生ずる確率が減少する。For example, the official homepage of Nippon Telegraph and Telephone Corporation is linked from many WWW pages by anchor texts such as "NTT" and "Nippon Telegraph and Telephone". In the WWW information search system of the present invention, "N
When the word “TT” is specified as a search condition, the relevance of such a page is increased, and the search results are displayed in priority when the search results are arranged in the order of relevance, thereby improving user convenience. I do. Also, even if the word “NTITY” is not used at all on its official website, if it is linked with an anchor text containing the word “NTITY”, the search condition will be “NTITY”. Is included in the search result, and the probability of occurrence of the search omission is reduced.

【００２５】表題や見出し、キーワードといった属性情
報を適合度の計算に用いている情報検索装置であれば、
参照元アンカーという属性を新たに加えることによっ
て、容易に本発明を適用することができる。また、その
際に参照元アンカーという属性の適合度への寄与を可変
にすることにより、従来通りのページ内のテキストのみ
を用いた適合度の計算と、参照元アンカーを考慮した適
合度の計算とを、使い分けることも可能である。If the information retrieval apparatus uses attribute information such as a title, a headline, and a keyword for calculating the degree of matching,
The present invention can be easily applied by adding a new attribute called a reference source anchor. At this time, the contribution of the attribute of the reference source anchor to the relevance is made variable, so that the relevance calculation using only the text in the page as in the past and the relevance calculation considering the reference source anchor Can be used properly.

[Brief description of the drawings]

【図１】本発明の一実施形態の情報検索装置の構成を示
すブロック図である。FIG. 1 is a block diagram illustrating a configuration of an information search device according to an embodiment of the present invention.

【図２】図１の情報検索装置に使用されているアンカー
抽出部１の処理を示す流れ図である。FIG. 2 is a flowchart showing a process of an anchor extracting unit 1 used in the information search device of FIG.

【図３】図１の情報検索装置に使用されている索引作成
部２の処理を示す流れ図である。FIG. 3 is a flowchart showing a process of an index creation unit 2 used in the information search device of FIG. 1;

【図４】図１の情報検索装置に使用されている検索処理
部４の処理を示す流れ図である。FIG. 4 is a flowchart showing a process of a search processing unit 4 used in the information search device of FIG. 1;

【図５】図１の情報検索装置に使用されている適合度計
算部５の処理を示す流れ図である。FIG. 5 is a flowchart showing a process of a fitness calculating unit 5 used in the information search device of FIG. 1;

【図６】本発明の情報検索装置をＷＷＷ情報検索システ
ムに適用した場合の構成を示す図である。FIG. 6 is a diagram showing a configuration when the information search device of the present invention is applied to a WWW information search system.

【図７】本発明の他の実施形態の情報検索装置を示すブ
ロック図である。FIG. 7 is a block diagram showing an information search device according to another embodiment of the present invention.

[Explanation of symbols]

１アンカー抽出部２索引作成部３データベース４検索処理部５適合度計算部６検索受付部７ページ収集部８ページ保存部９アンカー保存部１１〜１４，２１〜２４，３１〜３７，４１〜４４
ステップ１００ＷＷＷ情報検索システム２００ＷＷＷ情報システム３００利用者５１入力装置５２，５３記憶装置５４出力装置５５記録媒体５６データ処理装置DESCRIPTION OF SYMBOLS 1 Anchor extraction part 2 Index creation part 3 Database 4 Search processing part 5 Fitness calculation part 6 Search reception part 7 Page collection part 8 Page storage part 9 Anchor storage part 11-14, 21-24, 31-37, 41-44
Step 100 WWW information search system 200 WWW information system 300 User 51 Input device 52, 53 Storage device 54 Output device 55 Recording medium 56 Data processing device

Claims

[Claims]

1. An information retrieval method for a hypertext, comprising inputting structured text constituting a page to be searched, interpreting the structured text, extracting an anchor, and extracting the extracted anchor. Receiving an address of a reference page of a link represented by an anchor, an anchor extraction step of outputting an anchor text used for visual representation of the link, receiving a structured text constituting a certain page and an address of the page, and Receiving an address and an anchor text, interpreting the anchor text as a part of the structured text, outputting position attribute information necessary for search processing, and storing an index in a database; receiving search conditions; The position attribute information corresponding to the phrase included in the condition is read from the database, A search processing step of obtaining position attribute information of a page that satisfies the search condition by matching the search condition, and outputting the position attribute information together with the search condition; a fitness calculation step of receiving the search condition and the location attribute information and calculating the fitness. An information search method including a search result output step of adding the relevance to an address of a page satisfying the search condition and outputting the result as a search result.

2. An information search apparatus for a hypertext, comprising inputting structured text constituting a page to be searched, interpreting the structured text, extracting an anchor, and extracting the extracted anchor. Receiving the address of the page referred to by the link represented, the anchor extracting means for outputting the anchor text used for the visual representation of the link, a database, the structured text constituting a certain page, and the address of the page, Further, receiving the reference destination address and the anchor text, interpreting the anchor text as a part of the structured text, outputting position attribute information necessary for search processing, indexing means for storing in the database, Receiving and reading the position attribute information corresponding to the phrase included in the search condition from the database. Then, by matching with the search condition, the position attribute information of the page satisfying the search condition is obtained, output together with the search condition, the relevance is added to the address of the page satisfying the search condition, and the search is output as the search result. An information retrieval apparatus comprising: a processing unit; and a fitness calculating unit that receives the search condition and the position attribute information and calculates the fitness.

3. A recording medium on which an information search program for hypertext is recorded, wherein a structured text constituting a page to be searched is input, the structured text is interpreted, and an anchor is extracted. , The address of the page referred to by the link represented by the extracted anchor, the anchor extraction process of outputting the anchor text used for the visual representation of the link, the structured text constituting a certain page, and the address of the page Receiving the reference address and the anchor text, interpreting the anchor text as a part of the structured text, outputting position attribute information necessary for search processing, and storing the index information in a database; And from the database, position attribute information corresponding to a phrase included in the search condition. A search result that obtains position attribute information of a page that satisfies the search condition by reading and matching with the search condition, outputs the position attribute information together with the search condition, adds a degree of conformity to an address of the page that satisfies the search condition, and outputs the result as a search result. A recording medium recording an information retrieval program for causing a computer to execute an output process and a fitness calculation process for receiving the search condition and the position attribute information and calculating the fitness.

4. A page storage unit, a page collection unit that acquires a structured text from a hypertext information system and stores the structured text in the page storage unit, an anchor storage unit, and a structured text that constitutes a page to be searched. Is input from the page storage means, the structured text is interpreted, an anchor is extracted, and the address of the reference page of the link represented by the extracted anchor and the anchor text used for the visual representation of the link are described above. An anchor extracting means to be stored in an anchor storing means, a database, a structured text constituting a certain page, and an address of the page are read out from the page storing means, and the reference destination address and the anchor text are read from the anchor storing means. Read the anchor text as part of the structured text Index generating means for outputting position attribute information necessary for search processing and storing the information in the database; search receiving means for receiving search conditions from a user and presenting search results to the user; and receiving the search conditions. Receiving a search condition from the means, reading out position attribute information corresponding to a phrase included in the search condition from the database, and comparing the search condition with the search condition to obtain position attribute information of a page satisfying the search condition;
A search processing unit that outputs together with the search condition, adds a relevance to an address of a page that satisfies the search condition, and outputs the result to the search reception unit as a search result; A hypertext information retrieval system having a fitness calculating means for calculating.