JP4826622B2

JP4826622B2 - Document search apparatus, search method and program

Info

Publication number: JP4826622B2
Application number: JP2008287280A
Authority: JP
Inventors: 英紀河合; 健二立石; 俊一福島
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-11-10
Filing date: 2008-11-10
Publication date: 2011-11-30
Anticipated expiration: 2022-08-14
Also published as: JP2009032292A

Description

本発明は文書検索装置、検索方法及びプログラムに係り、特にハイパーテキストを対象とした文書検索装置、検索方法及びプログラムに関する。 The present invention is document retrieval system relates to the search how and programs, in particular the document retrieval device intended for the hypertext, for the search how and programs.

ハイパーテキストとは、ハイパーリンク（リンク）で構造化された文書集合のことであり、文書をノードとし、文書間にリンクをはった構造を持つ。文書Ａから文書Ｂへのリンクに関して、文書Ｂのリンク元になる文書Ａ中の文字列をアンカー文字列と呼ぶ。ハイパーテキストの代表例が、ＷＷＷ（World Wide Web）である。ＷＷＷはＨＴＭＬ（Hyper Text Markup Language）形式で記述されたハイパーテキストであり、リンク及びアンカー文字列は＜Ａ＞タグによってマークされる。なお、ＷＷＷはハイパーテキストの代表例であるが、本発明は対象をＷＷＷに限定したものではない。また、ハイパーテキストはＨＴＭＬだけではなく、ＸＭＬ（Extensible Markup Language）、ＳＧＭＬ（Standard Generalized Markup Language）等を用いて記述することも可能である。 Hypertext is a set of documents structured by hyperlinks (links), and has a structure in which documents are nodes and links are made between documents. Regarding the link from document A to document B, the character string in document A that is the link source of document B is called an anchor character string. A representative example of hypertext is WWW (World Wide Web). WWW is hypertext described in HTML (Hyper Text Markup Language) format, and links and anchor character strings are marked by <A> tags. Although WWW is a representative example of hypertext, the present invention is not limited to the WWW. Further, the hypertext can be described using not only HTML but also XML (Extensible Markup Language), SGML (Standard Generalized Markup Language), and the like.

近年のインターネットの普及により、リンクで構造化された大量の文書に容易にアクセスすることが可能になっている。リンクで構造化された文書は通常、サイトと呼ばれる文書クラスタを形成している。サイトを構成する文書クラスタの単位は必ずしも明確ではないが、同一ドメイン名を持つ文書クラスタを一つのサイトとみなしたり、あるいは同一管理者による同一テーマの文書クラスタを一つのサイトとみなしたりすることが多い。各サイトには、トップページと呼ばれる入り口となる文書が存在し、閲覧者はそこからサイト内の各コンテンツの文書へリンクをたどってアクセスすることができる。 With the recent spread of the Internet, it is possible to easily access a large number of documents structured by links. Documents structured with links usually form document clusters called sites. The unit of the document cluster that constitutes a site is not necessarily clear, but a document cluster with the same domain name can be regarded as one site, or a document cluster of the same theme by the same administrator can be regarded as one site. Many. Each site has an entrance document called a top page, from which a viewer can access a document of each content in the site by following a link.

例えば、後述する図２の例では、文書１１、文書２１、文書３１、文書４１がそれぞれサイト１、サイト２、サイト３、サイト４のトップページである。リンクは、引用したい文書の格納場所を表すＵＲＬと、引用先の文書の内容を表すアンカー文字列からなる。アクセス可能などの文書へもリンクを自由にはることもできるが、インターネットの習慣上、異なるサイトからのリンクはＵＲＬにトップページが指定され、アンカー文字列にサイトのタイトルまたはサイトの内容を表す文字列が使われる傾向がある。 For example, in the example of FIG. 2 described later, the document 11, the document 21, the document 31, and the document 41 are the top pages of the site 1, the site 2, the site 3, and the site 4, respectively. The link includes a URL that indicates the storage location of the document to be cited and an anchor character string that indicates the content of the document to be cited. You can link freely to any accessible document, but due to Internet customs, links from different sites have the top page specified in the URL, and the anchor string represents the site title or site content. There is a tendency to use strings.

一方、同一サイト内の各コンテンツの文書へのリンクは、閲覧者がトップページから順番にリンクをたどってくることを想定して、表現を省略することが多い。例えば、サイト１が全国のグルメ情報を扱うサイトであった場合、サイト２やサイト３などの異なるサイトからトップページである文書１１へはられたリンクのアンカー文字列「Ｌ２０３」や「Ｌ３０２」には、「グルメ」のページ、「レストラン検索」など、サイトのタイトルそのものやサイトの内容を表す表現が多く使われる。 On the other hand, links to documents of each content in the same site are often omitted in the assumption that the viewer follows the links in order from the top page. For example, if the site 1 is a site that deals with gourmet information across the country, the anchor character strings “L203” and “L302” of links from different sites such as the site 2 and the site 3 to the document 11 that is the top page are displayed. Is often used to express the site title itself and the site content, such as “Gourmet” page and “Restaurant search”.

一方、サイト内のコンテンツへのリンクのアンカー文字列「Ｌ１０１」、「Ｌ１０３」、「Ｌ１０６」には、それぞれ「関西」、「奈良県」、「中華」など閲覧者がトップページから順番にリンクをたどってくることを想定して必要最低限の表現が使われる傾向があり、アンカー文字列単独ではリンク先の文書の内容が部分的にしか把握できないことが多い。また、文書の内容自体も、閲覧者がトップページから順番にリンクをたどってくることを想定して記述されているため、例えば「奈良県」というアンカー文字列がある文書には、県名のリストが記述されているだけなど、文書単独では内容を十分に把握することはできないことが多い。 On the other hand, viewers such as “Kansai”, “Nara Prefecture”, and “Chinese” link to the anchor character strings “L101”, “L103”, and “L106” of links to the contents in the site in order from the top page. There is a tendency that the minimum necessary expression is used on the assumption that the document is followed. In many cases, the content of the linked document can be grasped only partially by the anchor character string alone. In addition, since the content of the document itself is described on the assumption that the viewer follows the links in order from the top page, for example, a document with an anchor character string “Nara Prefecture” contains the name of the prefecture. In many cases, a document alone cannot fully grasp the contents, such as only a list being described.

このようなリンクで構造化された文書を検索・分類する従来技術として、例えば非特許文献１に掲載の論文、ゲンヴル・アンド・フォーダブリュ：ツールズ・フォー・テイミング・ザ・ウェブ（GENVL and WWWW: Tools for Taming the Web）に示される検索装置や、特許文献１に記載されたハイパーテキスト検索装置、特許文献２に記載された文書分類システム、特許文献３に記載のリンク情報を用いたキーワード付与方法、特許文献４に記載された関連文書表示装置などのように、リンク元のアンカー文字列を検索対象や分類対象とする方法が挙げられる。 As a conventional technique for searching and classifying a document structured by such a link, for example, a paper published in Non-Patent Document 1 , Genvl and Forwable: Tools for Timing the Web (GENVL and WWWW: Tools for Taming the Web), hypertext search device described in Patent Document 1 , document classification system described in Patent Document 2, and keyword assignment method using link information described in Patent Document 3 As a related document display device described in Patent Document 4, there is a method of setting a link source anchor character string as a search target or a classification target.

これらの検索装置や文書分類システムやキーワード付与方法によれば、文書本体に含まれるキーワードに加え、リンク元のアンカー文字列を検索インデックスに登録したり、文書特徴ベクトルに追加したりすることにより、リンク元のアンカー文字列がリンク先文書の説明を記述している性質を利用して、検索・分類の精度を高めようとしている。 According to these search devices, document classification systems, and keyword assignment methods, in addition to the keywords included in the document body, by registering the link source anchor character string in the search index or adding it to the document feature vector, Using the property that the anchor character string of the link source describes the description of the link destination document, the search / classification accuracy is improved.

Proceedings of The １st International Conference on the World Wide Web, １994Proceedings of The 1st International Conference on the World Wide Web, 1994 特許第３１０８０１５号公報Japanese Patent No. 3108015 特開平１０−２５４８９９号公報Japanese Patent Laid-Open No. 10-254899 再公表特許ＷＯ９９／１４６９０Republished patent WO99 / 14690 特開２０００−３３９３２０号公報JP 2000-339320 A

しかるに、上記の文書検索装置及び検索方法では、以下のような問題点がある。すなわち、検索対象をより絞り込むために、「奈良レストラン情報」のように複数の単語をスペースで分割して入力されたり、「奈良のレストラン情報」のように「ＡのＢ」といった表現を使ったり、「奈良レストラン情報」のように複合語として検索条件が入力された場合に、絞り込み検索が精度良く行えないことである。サイト内のリンクやページの内容は、閲覧者がトップページから順番にリンクを辿ってくることを想定して記述される傾向があるため、単独のページの本文やリンク元アンカー文字列ではうまく内容を絞り込めないことがある。 However, the above document retrieval apparatus and retrieval method have the following problems. In other words, in order to narrow down the search target, a plurality of words are divided into spaces such as “Nara restaurant information”, or an expression such as “A's B” is used such as “Nara restaurant information”. , when the search conditions as a compound word, such as "Nara restaurant information" is input, Ru der that Refine search can not be performed with high accuracy. Contents of the link and the page in the site, since there is a tendency that the viewer is described on the assumption that come by following the link to the order from the top page, well in this text and link the original anchor character string of a single page The content may not be narrowed down.

また仮に、上記の問題点を回避するために、リンク元のアンカー文字列を単純に一定数遡って検索・分類対象としても、検索精度は必ずしも向上しない。その原因は、リンク元のアンカー文字列を単純に一定数遡るだけでは、かえって文書の特徴と関係ないキーワードが検索・分類対象に含まれることになるからである。 Further if, in order to avoid the problem points described above, even if the link source anchor character string as a simple search and classification target back a certain number, the search accuracy is not necessarily improved. The reason is that simply by going back a certain number of anchor character strings at the link source, keywords that are not related to document features are included in the search / classification target.

例えば、後述の図２において、リンク元を３つ遡ったアンカー文字列の系列を検索対象とする場合、文書１７のアンカー文字列の系列は「Ｌ１０６←Ｌ１０３←Ｌ１０１」となるため，文書１７をうまく説明するキーワードが抽出できると期待できる。ところが、文書１２のアンカー文字列の系列は「Ｌ１０１←Ｌ２０３←Ｌ２０１」もしくは「Ｌ１０１←Ｌ３０２←Ｌ３０１」となる。 For example, in FIG. 2 to be described later, when an anchor character string series that is three links back is used as a search target, the anchor character string series of the document 17 is “L106 ← L103 ← L101”. It can be expected that keywords that explain well can be extracted. However, the anchor character string series of the document 12 is “L101 ← L203 ← L201” or “L101 ← L302 ← L301”.

この場合、サイト２、サイト３内のリンクのアンカー文字列「Ｌ２０１」及び「Ｌ３０１」は、文書１２とは関係ない可能性が高い。さらに、文書１１のアンカー文字列の系列は「Ｌ２０３←Ｌ２０１」もしくは「Ｌ３０２←Ｌ３０１←Ｌ４０３」となる。この場合、文書１１に無関係なリンクのアンカー文字列「Ｌ２０１」、「Ｌ３０１」に加えて、さらに無関係なリンクのアンカー文字列「Ｌ４０３」も検索対象に含めることになるため、検索・分類精度の向上は期待できない。 In this case, the anchor character strings “ L 201” and “ L 301” of the links in the site 2 and the site 3 are highly likely not to be related to the document 12. Further, the anchor character string series of the document 11 is “L203 ← L201” or “L302 ← L301 ← L403”. In this case, in addition to the anchor character strings “L201” and “L301” of links unrelated to the document 11, the anchor character string “L403” of unrelated links is also included in the search target. We cannot expect improvement.

本発明は以上の点に鑑みなされたもので、キーワードがスペースで分割されたり、「の」で接続されたり、複合語として検索条件に入力された場合に、そのキーワードを分割し、サイト構造を表すキーワードを検索対象とすることによって、効果的な絞り込み検索を行い得る文書検索装置、検索方法及びプログラムを提供することを目的とする。 Since the present invention also has been made in consideration of these circumstances, or key words are separated by a space, or are connected by a "no", when it is entered in the search condition as a compound word, dividing the keyword, site structure by the search target keyword representing the document search apparatus capable of performing an effective search refinement shall be the object of providing a search how and programs.

上記の目的を達成するため、本発明の文書検索装置は、ハイパーリンクで構造化された複数の文書により形成されている文書クラスタのそれぞれが備える階層化された複数の文書から、入力されたキーワード条件に合致する文書群を検索する文書検索装置において、同一の文書クラスタの複数の文書のそれぞれについて、文書へのハイパーリンクを複数段遡って得られるアンカー文字列の系列から抽出したキーワードである文書クラスタと文書との関係を特徴付ける単語群と共に、その単語群を含む文書及びその文書中の出現頻度とが対応付けられた第１のインデックスを記憶している第１のインデックス記憶手段と、複数の文書のそれぞれの本文中に出現したキーワードである文書自体の内容を特徴付ける単語群と共に、その単語群を含む文書及びその文書中の出現頻度が対応付けられた第２のインデックスを記憶している第２のインデックス記憶手段と、キーワード条件にｎ個（ｎ≧２）のキーワードが含まれる場合に、第１のインデックス記憶手段に記憶されている第１のインデックスを参照して、ｎ個のキーワードのうちのｍ個（１≦ｍ≦ｎ−１）のキーワードが第１のインデックスに含まれていることを検索したときは、第１のインデックスからｍ個のキーワードの出現頻度とｍ個のキーワードが現れた文書とを示す第１の検索結果を得た後、残りのｎ−ｍ個のキーワードで第２のインデックス記憶手段に記憶されている第２のインデックスを参照して、ｎ−ｍ個のキーワードのうち第２のインデックスに含まれるキーワードの出現頻度とそのキーワードが現れた文書とを示す第２の検索結果を得て、第１及び第２の検索結果を出力するインデックス検索手段とを備えることを特徴とする。 To achieve the above object, the document retrieval equipment of the present invention, from a plurality of hierarchical documents each document cluster that is formed by the structured plurality of document hyperlink provided, is input A keyword extracted from a series of anchor character strings obtained by tracing back a plurality of hyperlinks to a document for each of a plurality of documents in the same document cluster in a document search apparatus that searches for a document group that matches the keyword condition. A first index storage means for storing a first index in which a word group that characterizes a relationship between a document cluster and a document, a document that includes the word group, and an appearance frequency in the document are associated with each other; Including the word group that characterizes the content of the document itself, which is a keyword that appears in the body of each document A write and a second index storage means frequency of occurrence in the document stores a second index associated with, in the case that contains the keyword n (n ≧ 2) the keyword condition, first Referring to the first index stored in the index storage means, m keywords out of n keywords (1 ≦ m ≦ n−1) are included in the first index. When a search is performed, a first search result indicating the frequency of appearance of m keywords and a document in which m keywords appear is obtained from the first index, and then the second search is performed using the remaining mn keywords. Referring to the second index stored in the index storage means, the appearance frequency of the keyword included in the second index among the mn keywords and the document in which the keyword appears To obtain a second search result shown, characterized in that it comprises an index search means for outputting the first and second search results.

また、上記の目的を達成するため、本発明の文書検索装置は、ハイパーリンクで構造化された複数の文書により形成されている文書クラスタのそれぞれが備える階層化された複数の文書から、入力された検索条件文に合致する文書群を検索する文書検索装置において、同一の文書クラスタの複数の文書のそれぞれについて、文書へのハイパーリンクを複数段遡って得られるアンカー文字列の系列から抽出したキーワードである文書クラスタと文書との関係を特徴付ける単語群と共に、その単語群を含む文書及びその文書中の出現頻度とが対応付けられた第１のインデックスを記憶している第１のインデックス記憶手段と、複数の文書のそれぞれの本文中に出現したキーワードである文書自体の内容を特徴付ける単語群と共に、その単語群を含む文書及びその文書中の出現頻度が対応付けられた第２のインデックスを記憶している第２のインデックス記憶手段と、検索条件文が「の」で連結された、又はスペースで分割された、又は複合語を構成する２個のキーワードからなる場合に、２個のキーワードで第１のインデックス記憶手段に記憶されている第１のインデックスを参照して、２個のキーワードの一方が第１のインデックスに含まれていることを検索したときは、第１のインデックスから一方のキーワードの出現頻度とそのキーワードが現れた文書とを示す第１の検索結果を得た後、もう一方のキーワードで第２のインデックス記憶手段に記憶されている第２のインデックスを参照して、第２のインデックスに含まれるもう一方のキーワードの出現頻度とそのキーワードが現れた文書とを示す第２の検索結果を得て、第１及び第２の検索結果を出力するインデックス検索手段とを備えることを特徴とする。 In order to achieve the above object, the document search apparatus of the present invention is inputted from a plurality of hierarchized documents included in each of document clusters formed by a plurality of documents structured by hyperlinks. Keyword extracted from a series of anchor character strings obtained by tracing back multiple hyperlinks to a document for each of a plurality of documents in the same document cluster in a document search device that searches a group of documents that match the search condition sentence A first index storage means for storing a first index in which a document group including the word cluster and a document characterizing the relationship between the document and a document including the word group and an appearance frequency in the document are associated with each other In addition to the word group that characterizes the content of the document itself, which is a keyword that appears in the text of each of multiple documents, the word group is included. The second index storage means storing the document and the second index associated with the appearance frequency in the document and the search condition sentence are connected by “no”, or divided by a space, or In the case of two keywords constituting a compound word, the first index stored in the first index storage means is referred to by the two keywords, and one of the two keywords is the first index. When the first search result indicating the frequency of appearance of one keyword and the document in which the keyword appears is obtained from the first index, the second keyword is used for the second keyword. Referring to the second index stored in the index storage means, the appearance frequency of the other keyword included in the second index and the keyword are displayed. It was to obtain a second search result indicating the document, characterized by comprising an index search means for outputting the first and second search results.

また、上記の目的を達成するため、本発明の文書検索方法は、データ処理装置が、ハイパーリンクで構造化された複数の文書により形成されている文書クラスタのそれぞれが備える階層化された複数の文書から、入力されたキーワード条件に合致する文書群を検索する文書検索方法において、データ処理装置が、同一の文書クラスタの複数の文書のそれぞれについて、文書へのハイパーリンクを複数段遡って得られるアンカー文字列の系列から抽出したキーワードである文書クラスタと文書との関係を特徴付ける単語群と共に、その単語群を含む文書及びその文書中の出現頻度とが対応付けられた第１のインデックスを第１のインデックス記憶手段に登録する第１の登録ステップと、データ処理装置が、複数の文書のそれぞれの本文中に出現したキーワードである文書自体の内容を特徴付ける単語群と共に、その単語群を含む文書及びその文書中の出現頻度が対応付けられた第２のインデックスを第２のインデックス記憶手段に登録する第２の登録ステップと、データ処理装置が、キーワード条件にｎ個（ｎ≧２）のキーワードが含まれる場合に、第１のインデックス記憶手段に登録されている第１のインデックスを参照して、ｎ個のキーワードのうちのｍ個（１≦ｍ≦ｎ−１）のキーワードが第１のインデックスに含まれていることを検索したときは、第１のインデックスからｍ個のキーワードの出現頻度とｍ個のキーワードが現れた文書とを示す第１の検索結果を得た後、残りのｎ−ｍ個のキーワードで第２のインデックス記憶手段に登録されている第２のインデックスを参照して、ｎ−ｍ個のキーワードのうち第２のインデックスに含まれるキーワードの出現頻度とそのキーワードが現れた文書とを示す第２の検索結果を得て、第１及び第２の検索結果を出力するインデックス検索ステップとを含むことを特徴とする。 In order to achieve the above object, the document search method according to the present invention includes a plurality of hierarchized data processing devices each of which includes a plurality of document clusters formed by a plurality of documents structured by hyperlinks. In a document search method for searching a document group that matches an input keyword condition from a document, the data processing apparatus can obtain a plurality of hyperlinks to the document retrospectively for each of a plurality of documents in the same document cluster. A first index that associates a document group that is a keyword extracted from a series of anchor character strings with a word group that characterizes the relationship between the document cluster and the document, and a document that includes the word group and an appearance frequency in the document, is first. A first registration step for registering in the index storage means and a data processing device appear in the text of each of the plurality of documents A second registration that registers, in the second index storage means, a second index associated with a word group that characterizes the content of the document itself, which is a keyword, and a document that includes the word group and an appearance frequency in the document. When the step and the data processing apparatus include n (n ≧ 2) keywords in the keyword condition, the n keywords are referred to by referring to the first index registered in the first index storage means When m keywords (1 ≦ m ≦ n−1) of the keywords are included in the first index, the appearance frequency of the m keywords and the m keywords from the first index are retrieved. After obtaining the first search result indicating the document in which the word appears, refer to the second index registered in the second index storage means with the remaining nm keywords. The second search result indicating the appearance frequency of the keyword included in the second index among the mn keywords and the document in which the keyword appears is obtained, and the first and second search results are output. And an index search step.

また、上記の目的を達成するため、本発明の文書検索方法は、データ処理装置が、ハイパーリンクで構造化された複数の文書により形成されている文書クラスタのそれぞれが備える階層化された複数の文書から、入力された検索条件文に合致する文書群を検索する文書検索方法において、データ処理装置が、同一の文書クラスタの複数の文書のそれぞれについて、文書へのハイパーリンクを複数段遡って得られるアンカー文字列の系列から抽出したキーワードである文書クラスタと文書との関係を特徴付ける単語群と共に、その単語群を含む文書及びその文書中の出現頻度とが対応付けられた第１のインデックスを第１のインデックス記憶手段に登録する第１の登録ステップと、データ処理装置が、複数の文書のそれぞれの本文中に出現したキーワードである文書自体の内容を特徴付ける単語群と共に、その単語群を含む文書及びその文書中の出現頻度が対応付けられた第２のインデックスを第２のインデックス記憶手段に登録する第２の登録ステップと、データ処理装置が、検索条件文が「の」で連結された、又はスペースで分割された、又は複合語を構成する２個のキーワードからなる場合に、２個のキーワードで第１のインデックス記憶手段に登録されている第１のインデックスを参照して、２個のキーワードの一方が第１のインデックスに含まれていることを検索したときは、第１のインデックスから一方のキーワードの出現頻度とそのキーワードが現れた文書とを示す第１の検索結果を得た後、もう一方のキーワードで第２のインデックス記憶手段に登録されている第２のインデックスを参照して、第２のインデックスに含まれるもう一方のキーワードの出現頻度とそのキーワードが現れた文書とを示す第２の検索結果を得て、第１及び第２の検索結果を出力するインデックス検索ステップとを含むことを特徴とする。 In order to achieve the above purpose, the document retrieval method according to the present invention, a plurality of data processing devices have been layered each comprising a document clusters formed by structured multiple documents hyperlinks In a document search method for searching a document group that matches an input search condition sentence, a data processing device traces back a plurality of hyperlinks to the document for each of a plurality of documents in the same document cluster. A first index in which a document group including a word cluster that characterizes a relationship between a document cluster that is a keyword extracted from the obtained anchor character string series and the document, and an appearance frequency in the document are associated with each other A first registration step for registering in the first index storage means and a data processing device appear in the text of each of a plurality of documents A second registration step of registering, in the second index storage means, a second index associated with a word group that characterizes the content of the document itself that is a word and a document that includes the word group and an appearance frequency in the document. If the data processing device is composed of two keywords in which the search condition sentence is connected by “no”, divided by a space, or composed of a compound word, the first index is composed of the two keywords. When it is found that one of the two keywords is included in the first index with reference to the first index registered in the storage means, the appearance frequency of one keyword from the first index And the second search result registered in the second index storage means with the other keyword. Referring to the index, a second search result indicating the appearance frequency of the other keyword included in the second index and the document in which the keyword appears is obtained, and the first and second search results are output. An index search step .

また、上記の目的を達成するため、本発明の文書検索用プログラムは、コンピュータにハイパーリンクで構造化された複数の文書により形成されている文書クラスタのそれぞれが備える階層化された複数の文書から、入力されたキーワード条件又は検索条件文に合致する文書群を検索する文書検索用プログラムにおいて、コンピュータに本発明の文書検索方法の各ステップを実行させることを特徴とする。 To achieve the above purpose, a program for document retrieval of the present invention, a plurality of documents layering each comprising a document clusters that are formed by a plurality of documents that are structured in a hyperlink to a computer In the document search program for searching for a document group that matches the input keyword condition or search condition sentence, the computer is caused to execute each step of the document search method of the present invention.

本発明によれば、以下の種々の効果を奏する。The present invention has the following various effects.

（１）ハイパーテキスト群に対してサイト全体の内容とサイト内での文書の位置付けを反映した検索・分類を行うことができる。その理由は、サイト全体の内容を表すサイト外からのリンクと、サイト内での文書の位置付けを表すサイト内のリンクをそれぞれ遡って得られるアンカー文字列の系列を、文書クラスタと文書との関係を特徴付ける単語群として抽出し、検索・分類対象とするからである。 (1) It is possible to perform search / classification on the hypertext group reflecting the contents of the entire site and the position of the document in the site. The reason for this is that the link between the document cluster and the document is a series of anchor strings obtained by tracing back the link from outside the site that represents the contents of the entire site and the link within the site that represents the position of the document within the site. This is because it is extracted as a word group that characterizes and is used as a search / classification target.

（２）複数キーワードに対する効果的な絞り込み検索を行うことができる。その理由は、キーワードがスペースで分割されたり、「の」で接続されたり、複合語として、検索条件に入力された場合に、そのキーワードを分割し、文書の内容を表すメタ情報としてのキーワードと文書本体のキーワードを別々に検索するからである。 (2) An effective refinement search for a plurality of keywords can be performed. The reason for this is that when a keyword is divided by a space, connected by “no”, or entered as a compound word in a search condition, the keyword is divided into a keyword as meta information representing the content of the document. This is because the keyword of the document body is searched separately.

（３）一般的な文書に対しても、文書の内容と意味付けを反映した検索を行うことができる。その理由は、文書の内容を表すメタ情報中のキーワードと、文書中のキーワードを区別し、それぞれ別のインデックスとして検索を行うからである。 (3) A search that reflects the contents and meaning of a document can be performed for a general document. The reason is that the keyword in the meta information representing the content of the document is distinguished from the keyword in the document, and the search is performed as separate indexes.

次に、本発明の実施の形態について、図面を参照して詳細に説明する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.

［第１の実施の形態］
図１は本発明の第１の実施の形態のブロック図を示す。同図に示すように、本発明の文書検索装置の第１の実施の形態は、プログラム制御により動作するデータ処理装置１と、情報を記憶する記憶装置２とを含む構成である。 [First Embodiment]
FIG. 1 shows a block diagram of a first embodiment of the present invention. As shown in the figure, the first embodiment of the document retrieval equipment of the present invention, the data processing apparatus 1 that operates under program control, a configuration that includes a storage unit 2 for storing information.

記憶装置２は、ハイパーテキストデータベース２１と、文書キーワード記憶部２２とを備えている。ハイパーテキストデータベース２１は、図２に示すようなハイパーリンクで構造化された文書群について、各文書のＵＲＬ、ローカルアドレス、本文テキスト、リンク先文書とそのアンカー文字列などを記憶している。ハイパーテキストデータベース２１の例としては、例えばインターネットまたはイントラネット上のウェブ（Ｗｅｂ）がこれに該当する。 The storage device 2 includes a hypertext database 21 and a document keyword storage unit 22. The hypertext database 21 stores URLs, local addresses, body texts, linked documents and their anchor character strings, etc. of each document for a document group structured with hyperlinks as shown in FIG. An example of the hypertext database 21 corresponds to, for example, the Internet (Web) on the Internet or an intranet.

文書キーワード記憶部２２は、各文書について、後述の文書キーワード決定手段１４が決定したキーワードを記憶する。文書キーワード決定手段１４が決定するキーワードには、同一サイト内のリンクを遡って得られるアンカー文字列の系列（サイト内キーワード）と、異なるサイトからそのサイトのトップページへのリンクのアンカー文字列（サイト外キーワード）の２種類がある。 The document keyword storage unit 22 stores a keyword determined by the document keyword determination unit 14 described later for each document. The keyword determined by the document keyword determining means 14 includes a series of anchor character strings (site keywords) obtained by tracing back links in the same site, and an anchor character string of links from different sites to the top page of the site ( There are two types of keywords.

ハイパーテキストデータベース２１が図２のようなハイパーリンクで構造化された文書群の場合、文書キーワード記憶部２２が記憶するキーワードの例は図３のようになる。図３において、文書キーワード記憶部２２には、各文書が文書名とサイト外キーワードとサイト内キーワードとが対応付けて記憶されており、例えば文書１５のサイト外キーワードとして「Ｌ２０３，Ｌ３０２」が、サイト内キーワードとして「Ｌ１０４←Ｌ１０１」が記憶されているのがわかる。 When the hypertext database 21 is a document group structured with hyperlinks as shown in FIG. 2, examples of keywords stored in the document keyword storage unit 22 are as shown in FIG. In FIG. 3, each document is stored in the document keyword storage unit 22 in association with the document name, the keyword outside the site, and the keyword inside the site. For example, “L203, L302” is set as the keyword outside the site of the document 15. It can be seen that “L104 ← L101” is stored as the keyword in the site.

一方、図１のデータ記憶装置１は、ハイパーテキストアクセス手段１１と、文書クラスタ情報取得手段１２と、対象指定手段１３と、文書キーワード決定手段１４とを備えている。 On the other hand, the data storage device 1 of FIG. 1 includes a hypertext access unit 11, a document cluster information acquisition unit 12, a target designation unit 13, and a document keyword determination unit 14.

ハイパーテキストアクセス手段１１は、ハイパーテキストデータベース２１に格納されている文書を読み出し、文書クラスタ情報取得手段１２に渡す。ハイパーテキストデータベース２１がＷＷＷの場合、ＨＴＴＰ（Hyper Text Transfer Protocol）を介して文書にアクセスすることができる。このような機能は、従来、ＩＥ（Internet Explorer）などのＷｅｂブラウザ、あるいはＷｅｂクローラー（スパイダー／ロボット）において実現されている。 The hypertext access unit 11 reads a document stored in the hypertext database 21 and passes it to the document cluster information acquisition unit 12. When the hypertext database 21 is WWW, a document can be accessed via HTTP (Hyper Text Transfer Protocol). Such a function is conventionally realized in a web browser such as IE (Internet Explorer) or a web crawler (spider / robot).

文書クラスタ情報取得手段１２は、ハイパーテキストアクセス手段１１が読み出した文書に含まれるリンク情報を抽出し、対象指定手段１３によって指定された条件に基づきサイトを構成する文書クラスタを特定し、文書参照関係表と文書クラスタ表を生成する。文書参照関係表の例を図４に、文書クラスタ表の例を図５に示す。 The document cluster information acquisition unit 12 extracts link information included in the document read by the hypertext access unit 11, specifies the document cluster that constitutes the site based on the conditions specified by the target specification unit 13, and the document reference relationship Generate tables and document cluster tables. An example of the document reference relation table is shown in FIG. 4, and an example of the document cluster table is shown in FIG.

図４に示すように、文書参照関係表は、アンカー文字列、リンク元文書及びリンク先文書が対応付けられた一覧表であり、例えば文書１１から文書１２に対してアンカー文字列「Ｌ１０１」のリンクがはられていることを示している。また、図５に示すように、文書クラスタ表は、文書クラスタ、トップページ及びクラスタ内文書が対応付けられた一覧表であり、例えば文書クラスタ「サイト１」のトップページは文書１１で、クラスタ内には、文書１２〜１９が含まれていることを示している。 As shown in FIG. 4, the document reference relation table is a list in which an anchor character string, a link source document, and a link destination document are associated with each other. Indicates that a link has been established. As shown in FIG. 5, the document cluster table is a list in which a document cluster, a top page, and documents in the cluster are associated with each other. For example, the top page of the document cluster “site 1” is the document 11 and Indicates that documents 12 to 19 are included.

図１のデータ処理装置１内の対象指定手段１３は、同一サイトとみなすべき文書クラスタの条件を、文書クラスタ情報取得手段１２に与える。対象指定手段１３が与える条件には、「サイトのトップページの条件」と「同一サイトに含まれる文書の条件」を含む。例えば、同一ドメイン名のサーバーに格納された文書クラスタを一つのサイトとみなしたい場合、「サイトのトップページの条件」として、「文書のＵＲＬが『ｈｔｔｐ：／／ドメイン名／』、または『ｈｔｔｐ：／／ドメイン名／ｉｎｄｅｘ．ｈｔｍｌ』であるもの」と指定し、「同一サイトに含まれる文書の条件」として、「ドメイン名が同じ」と指定すればよい。 The target designating unit 13 in the data processing apparatus 1 in FIG. 1 gives the document cluster information obtaining unit 12 the conditions for the document cluster to be regarded as the same site. The conditions given by the target designating means 13 include “site top page conditions” and “document conditions included in the same site”. For example, when a document cluster stored in a server having the same domain name is regarded as one site, “the URL of the document is“ http: // domain name / ”or“ http ” : // domain name / index.html ”and“ same domain name ”may be specified as“ a condition for documents included in the same site ”.

データ処理装置１内の文書キーワード決定手段１４は、文書クラスタ情報取得手段１２によって生成された文書参照関係表と文書クラスタ表を参照しながら、同一文書クラスタ内を遡って得られるアンカー文字列の系列と、異なる文書クラスタからのリンクのアンカー文字列をその文書のキーワードとして決定し、文書キーワード記憶部２２に格納する。 The document keyword determination unit 14 in the data processing apparatus 1 refers to a series of anchor character strings obtained retrospectively in the same document cluster while referring to the document reference relation table and the document cluster table generated by the document cluster information acquisition unit 12. Then, an anchor character string of a link from a different document cluster is determined as a keyword of the document and stored in the document keyword storage unit 22.

次に、図１のブロック図乃至図６のフローチャートを併せ参照して第１の実施の形態の動作について、詳細に説明する。まず、ハイパーテキストアクセス手段１１は、ハイパーテキストデータベース２１に格納されている各文書を読み出し、文書クラスタ情報取得手段１２に渡す。文書クラスタ情報取得手段１２は、与えられた文書からリンク情報を抽出し、図４に示すような文書参照関係表を生成する（ステップＳ１）。 Next, the operation of the first embodiment will be described in detail with reference to the block diagram of FIG. 1 to the flowchart of FIG. First, the hypertext access unit 11 reads each document stored in the hypertext database 21 and passes it to the document cluster information acquisition unit 12. The document cluster information acquisition unit 12 extracts link information from the given document, and generates a document reference relation table as shown in FIG. 4 (step S1).

次に、文書クラスタ情報取得手段１２は、対象指定手段１３により指定された「サイトのトップページの条件」に基づき、与えられた文書についてトップページか否かの判定を行う。ここで、トップページとは、ディレクトリ階層における位置関係から定まる文書クラスタ内の最上位文書である（図２の場合、サイト１では文書１１、サイト３では文書３１である。）。 Next, the document cluster information acquisition unit 12 determines whether the given document is the top page based on the “site top page condition” specified by the target specification unit 13. Here, the top page is the highest level document in the document cluster determined from the positional relationship in the directory hierarchy (in the case of FIG. 2, it is the document 11 at the site 1 and the document 31 at the site 3).

もし、トップページであれば図５に示す文書クラスタ表に１行追加して登録する（ステップＳ２）。例えば、「サイトのトップページの条件」として「文書のＵＲＬが『ｈｔｔｐ：／／ドメイン名／』、または『ｈｔｔｐ：／／ドメイン名／ｉｎｄｅｘ．ｈｔｍｌ』であるもの」と指定されていた場合、ドメイン名単位でトップページが文書クラスタ表に登録される。 If it is a top page, one line is added and registered in the document cluster table shown in FIG. 5 (step S2). For example, when “the URL of the document is“ http: // domain name / ”or“ http: // domain name / index.html ”” is specified as “the condition of the top page of the site”, The top page is registered in the document cluster table for each domain name.

また、文書クラスタ情報取得手段１２は、与えられた文書がトップページでないと判定した場合は、対象指定手段１３により指定された「同一サイトに含まれる文書の条件」に基づき、トップページでないと判定された文書がどのサイトに属するかを決定し、図５に示す文書クラスタ表のクラスタ内文書に登録する（ステップＳ３）。例えば、「同一サイトに含まれる文書の条件」として、「ドメイン名が同じ」と指定されていた場合、トップページと同じドメイン名を持つ文書がクラスタ内文書に登録される。 If the document cluster information acquisition unit 12 determines that the given document is not the top page, the document cluster information acquisition unit 12 determines that it is not the top page based on the “condition of the document included in the same site” specified by the target specification unit 13. It is determined to which site the recorded document belongs, and is registered in the intra-cluster document of the document cluster table shown in FIG. 5 (step S3). For example, if “the domain name is the same” is designated as the “condition of documents included in the same site”, a document having the same domain name as the top page is registered in the intra-cluster document.

次に、文書キーワード決定手段１４は、文書クラスタ情報取得手段１２が生成した文書参照関係表と文書クラスタ表を参照して、各サイトのトップページに対してサイト外からはられているリンクのアンカー文字列をサイト外キーワードとして文書キーワード記憶部２２に記憶させる（ステップＳ４）。 Next, the document keyword determination unit 14 refers to the document reference relation table and the document cluster table generated by the document cluster information acquisition unit 12, and anchors the links placed from outside the site to the top page of each site. The character string is stored in the document keyword storage unit 22 as an off-site keyword (step S4).

さらに、文書キーワード決定手段１４は、文書クラスタ情報取得手段１２が生成した文書参照関係表と文書クラスタ表を参照して、各クラスタ内文書について、同一クラスタ内文書のリンクを遡って得られるアンカー文字列の系列をサイト内キーワードとして文書キーワード記憶部２２に記憶させる（ステップＳ５）。
この時、同一サイトに含まれている文書のサイト外キーワードは、そのサイトのトップページのサイト外キーワードと同じにする。したがって、図２の文書１２〜文書１９のサイト外キーワードは、文書１１のサイト外キーワードと同一の「Ｌ２０３，Ｌ３０２」となる。 Further, the document keyword determination unit 14 refers to the document reference relation table and the document cluster table generated by the document cluster information acquisition unit 12, and for each intra-cluster document, the anchor character obtained by tracing back the link of the same intra-cluster document. The series of columns is stored in the document keyword storage unit 22 as an in-site keyword (step S5).
At this time, the off-site keywords of the documents included in the same site are the same as the off-site keywords of the top page of the site. Therefore, the off-site keywords of the document 12 to the document 19 in FIG. 2 are the same as “L203, L302” as the off-site keyword of the document 11.

また、リンクを遡る際に、一度遡った文書を覚えておき、ループして遡らないようにする。例えば、図２の文書１６に対するリンクを単純に遡ると「Ｌ１０５←Ｌ１０２」というアンカー文字列の系列のほかに、「Ｌ１０５←Ｌ１０９」、「Ｌ１０５←Ｌ１０９←Ｌ１０５←Ｌ１０２」、「Ｌ１０５←Ｌ１０９←Ｌ１０５←Ｌ１０９←・・・」のようにループによって無数のアンカー文字列が生成されてしまう。そこで、一度遡った文書を同じアンカー文字列の系列内で二度遡らないようにしておくと、文書１６のサイト内キーワードは「Ｌ１０５←Ｌ１０２」だけになる。 Also, when going back to the link, remember the document that went back once, so that it doesn't go back in a loop. For example, when the link to the document 16 in FIG. An infinite number of anchor character strings are generated by a loop like L105 ← L109 ←. Therefore, if a document that has been traced once is not traced twice in the same anchor character string series, the in-site keyword of the document 16 is only “L105 ← L102”.

一方、別のアンカー文字列の系列で同じ文書を遡る場合は、それぞれ別のキーワードとして登録する。例えば図２の文書１９の場合、「Ｌ１０８←Ｌ１０４←Ｌ１０１」と「Ｌ１１０←Ｌ１０５←Ｌ１０２」はどちらも文書１１に遡るアンカー文字列の系列であるが、別の系列であるため両方をサイト内キーワードとして記憶する。ここでも、「Ｌ１１０←Ｌ１０５←Ｌ１０９←Ｌ１０５←Ｌ１０２」というアンカー文字列の系列などが考えられるが、これは同一系列内で文書１３と文書１６をそれぞれ２回遡っているためサイト内キーワードとしては記憶しない。 On the other hand, when the same document is traced back by another anchor character string series, it is registered as a different keyword. For example, in the case of the document 19 of FIG. 2, “L108 ← L104 ← L101” and “L110 ← L105 ← L102” are both anchor character string series that goes back to the document 11, but both are in the site because they are different series. Remember as a keyword. In this case, an anchor character string series such as “L110 ← L105 ← L109 ← L105 ← L102” can be considered. This is because the document 13 and the document 16 are traced back twice in the same series. I don't remember.

なお、本実施の形態では、ハイパーテキストアクセス手段１１が記憶装置２に記憶されたハイパーテキストデータベース２１にアクセスする方法について述べたが、他にもインターネットに直接アクセスし、記憶装置２にハイパーテキストデータベース２１を記憶する方法もあり、本発明は本実施の形態で述べた方法に限定されない。 In this embodiment, the method in which the hypertext access unit 11 accesses the hypertext database 21 stored in the storage device 2 has been described. However, the hypertext database 21 is directly accessed from the Internet and the storage device 2 is accessed. And the present invention is not limited to the method described in this embodiment mode.

また、本実施の形態では、対象指定手段１３により指定される「サイトのトップページの条件」として「文書のＵＲＬが『ｈｔｔｐ：／／ドメイン名／』、または『ｈｔｔｐ：／／ドメイン名／ｉｎｄｅｘ．ｈｔｍｌ』であるもの」とし、「同一サイトに含まれる文書の条件」として「ドメイン名が同じ」である場合を例として説明を行った。しかし、「サイトのトップページの条件」として「異なるドメイン名のページからのリンクが一定数以上の文書」、「同一サイトに含まれる文書の条件」として「同一ドメインでトップページとＵＲＬのディレクトリ階層が同じか、深い文書」を指定する方法もある。また、習慣的にチルダ「~」で始まるディレクトリ名は、そのサーバーを利用している各ユーザーのサイトであるとみなすこともできる。 In the present embodiment, “the URL of the document is“ http: // domain name / ”or“ http: // domain name / index ”as the“ site top page condition ”designated by the target designating means 13. .Html ”, and“ the domain name is the same ”as the“ condition of documents included in the same site ”. However, the “site top page condition” is “a document with a certain number of links from pages with different domain names”, and the “domain condition of the document included in the same site” is “directory hierarchy of top page and URL in the same domain” There is also a method of specifying “same or deep document”. Ordinarily, a directory name beginning with a tilde “~” can be regarded as the site of each user using the server.

また、「サイトのトップページの条件」として「『ＨｏｍｅＰａｇｅ』『Ｔｏｐへ』『最初に戻る』など、トップページを指すと考えられる表現のアンカー文字列を持つリンクのリンク先文書」とし、「同一サイトに含まれる条件」として「『ＨｏｍｅＰａｇｅ』『Ｔｏｐへ』『最初に戻る』など、トップページを指すと考えられる表現のアンカー文字列を持つリンクのリンク元文書」とする方法もある。さらに、「サイトのトップページの条件」として、予め人手によって指定されたＵＲＬのリストを使う方法もあり、本実施の形態で述べた方法に限定されるものではない。 In addition, the “site top page condition” is “link destination document with an anchor character string that is considered to indicate the top page, such as“ Home Page ”,“ Top ”,“ return to the beginning ”, etc. There is also a method in which “a condition included in the same site” is “a link source document of a link having an anchor character string of an expression that is considered to indicate a top page, such as“ Home Page ”,“ Top ”, and“ Return to the beginning ”. Further, there is a method of using a list of URLs specified by hand in advance as the “site top page condition”, and the method is not limited to the method described in this embodiment.

また、本実施の形態では、文書キーワード決定手段１４は、同一クラスタ内文書のリンクを遡って得られるアンカー文字列の系列をサイト内キーワードとしたが、トップページでない文書にサイト外からリンクがはられている場合、そのリンクを一つだけ遡ったアンカー文字列の系列もサイト内キーワードとして記憶してもよい。また、必ずしもトップページまでのリンクをすべて遡らずに、遡る数を指定したリンク数に限定する方法もあり、本実施の形態で述べた方法に限定されるものではない。 In the present embodiment, the document keyword determination means 14 uses an anchor character string series obtained by tracing back links of documents in the same cluster as keywords in the site. If it is, the anchor character string series that goes back one link may be stored as an in-site keyword. In addition, there is a method of limiting the number of links to the designated number of links without necessarily going back all the links to the top page, and the method is not limited to the method described in the present embodiment.

また、本実施の形態では、文書キーワード決定手段１４は、ループしたリンクのアンカー文字列の系列をサイト内キーワードから除いていた。しかし、他にも、「戻る」「Ｂａｃｋ」「Ｔｏｐへ」「ＨｏｍｅＰａｇｅ」「前へ」「次へ」など、検索・分類に適切でないキーワードをあらかじめ辞書として持っておき、その文字列を含むアンカー文字列の系列はサイト内キーワードとして登録しない方法などもある。また、遡る文書数が一定以上に長くなったアンカー文字列の系列をサイト内キーワードとして登録しない方法や、遡る文書数が少ない上位ｓ通りのアンカー文字列の系列のみをサイト内キーワードとして登録する方法などがあり、本実施の形態で述べた方法に限定されない。 In the present embodiment, the document keyword determination unit 14 excludes the anchor character string series of the looped link from the site keyword. However, other keywords such as “Back”, “Back”, “Top”, “Home Page”, “Previous”, “Next”, etc. that are not suitable for search / classification are stored in advance as a dictionary, and their character strings are included. There is also a method of not registering the anchor character string series as an in-site keyword. Also, a method of not registering an anchor character string series in which the number of documents going back as long as a certain value is registered as an in-site keyword, or a method of registering only a series of anchor character strings in the top s ways with a small number of documents going back as site keywords. And the present invention is not limited to the method described in this embodiment mode.

また、本実施の形態では、文書キーワード決定手段１４はアンカー文字列を基にキーワードを決定しているが、アンカー文字列に加えて文書のタイトル、アンカー文字列周辺の一定長の文字列、アンカー文字列周辺のテーブルタグに囲まれた文字列、アンカー文字列周辺のリストタグに囲まれた文字列、アンカー文字列周辺の＜ＢＲ＞または＜Ｐ＞タグで囲まれた文字列、文書中の＜Ｈ＞タグやフォントサイズや色が強調された文字列も含めてキーワードとする方法もあり、本実施の形態で述べた方法に限定されない。 In the present embodiment, the document keyword determination means 14 determines a keyword based on the anchor character string, but in addition to the anchor character string, the document title, a fixed-length character string around the anchor character string, the anchor A character string enclosed in table tags around the character string, a character string enclosed in a list tag around the anchor character string, a character string enclosed in <BR> or <P> tags around the anchor character string, <H> There is a method of using a keyword including a character string in which a tag, font size, and color are emphasized, and the method is not limited to the method described in this embodiment.

また、本実施の形態では、文書キーワード記憶部２２にサイト外キーワードとサイト内キーワードのみ記憶しているが、さらに文書のタイトル、本文テキストなどをキーワードとして記憶してもよく、本実施の形態で述べた方法に限定されない。また、本実施の形態では、トップページを特定するステップＳ２の後に文書クラスタを特定するステップＳ３を実行するとして動作を説明したが、先に文書クラスタを特定するステップＳ３を実行した後に、トップページを特定するステップＳ２を実行する方法もあり、本実施の形態で述べた方法に限定されない。 In the present embodiment, only the off-site keyword and the in-site keyword are stored in the document keyword storage unit 22, but the document title, body text, and the like may be stored as keywords. It is not limited to the method described. In this embodiment, the operation is described as executing step S3 for specifying a document cluster after step S2 for specifying the top page. However, after executing step S3 for specifying the document cluster, the top page is first executed. There is also a method of executing step S2 for specifying the method, and is not limited to the method described in the present embodiment.

また、本実施の形態では、サイト外キーワードを決定するステップＳ４の後にサイト内キーワードを決定するステップＳ５を実行するとして動作を説明したが、先にサイト内キーワードを決定するステップＳ５を実行した後に、サイト外キーワードを決定するステップＳ４を実行する方法もあり、本実施の形態で述べた方法に限定されない。 In the present embodiment, the operation has been described as executing step S5 for determining an in-site keyword after step S4 for determining an off-site keyword. However, after executing step S5 for determining an in-site keyword first. There is also a method of executing step S4 for determining an off-site keyword, and is not limited to the method described in the present embodiment.

次に、本発明の第１の実施の形態の効果について説明する。本実施の形態では、サイト全体の内容を表すサイト外からのリンクと、サイト内での文書の位置付けを表すサイト内のリンクをそれぞれ遡って得られるアンカー文字列の系列を、文書クラスタと文書との関係を特徴付ける単語群として抽出する。そのため、各文書について、サイト全体の内容とサイト内での文書の位置付けを反映したキーワードを得ることができる。 Next, effects of the first exemplary embodiment of the present invention will be described. In the present embodiment, a series of anchor character strings obtained by tracing back a link from outside the site representing the contents of the entire site and a link within the site representing the position of the document within the site are divided into a document cluster and a document. This is extracted as a word group characterizing the relationship. Therefore, for each document, a keyword reflecting the contents of the entire site and the position of the document in the site can be obtained.

［第２の実施の形態］
次に、本発明の第２の実施の形態について図面を参照して説明する。図７は本発明の第２の実施の形態のブロック図を示す。同図に示すように、本発明の文書検索装置の第２の実施の形態は、プログラム制御により動作するデータ処理装置５と、情報を記憶する記憶装置６と、入力手段３と出力手段４を含む構成である。同図中、図１と同一構成部分には同一符号を付し、その説明を省略する。 [Second Embodiment]
Next, a second embodiment of the present invention will be described with reference to the drawings. FIG. 7 shows a block diagram of the second embodiment of the present invention. As shown in the drawing, a second embodiment of the document retrieval equipment of the present invention includes a data processing unit 5 which operates under program control, a storage device 6 for storing information, input means 3 and output means 4 is included. In the figure, the same components as those in FIG.

本発明の第２の実施の形態は、データ処理装置５が、図１に示された第１の実施の形態におけるデータ処理装置１の構成に加え、インデックス作成手段１５とインデックス検索手段１６を有する点で異なる。また、記憶装置６が、図１に示された第１の実施の形態における記憶装置２の構成に加え、第１のインデックス記憶部２３を有する点で異なる。さらに、図１に示された第１の実施の形態に加え、キーボード等の入力手段３とディスプレイ装置や印刷装置等の出力手段４を有する点で異なる。 In the second embodiment of the present invention, the data processing device 5 includes an index creating unit 15 and an index searching unit 16 in addition to the configuration of the data processing device 1 in the first embodiment shown in FIG. It is different in point. Moreover, the storage device 6 is different in that it includes a first index storage unit 23 in addition to the configuration of the storage device 2 in the first embodiment shown in FIG. Further, in addition to the first embodiment shown in FIG. 1, the difference is that an input means 3 such as a keyboard and an output means 4 such as a display device and a printing device are provided.

図７において、記憶装置６内の第１のインデックス記憶部２３は、文書キーワード記憶部２２のデータをもとにインデックス作成手段１５が生成するインデックスを格納する。データ処理装置５内のインデックス作成手段１５は、文書キーワード記憶部２２に記憶されている各文書のサイト外キーワードとサイト内キーワードを読み出し、どのキーワードがどの文書のサイト外キーワードまたはサイト内キーワードに出現するかをインデックスとして作成し、第１のインデックス記憶部２３に格納する。データ処理装置５内のインデックス検索手段１６は、入力手段３から入力された検索条件に応じて、第１のインデックス記憶部２３を検索しその結果を出力手段４に出力する。 In FIG. 7, the first index storage unit 23 in the storage device 6 stores an index generated by the index creating unit 15 based on the data in the document keyword storage unit 22. The index creation means 15 in the data processing device 5 reads the off-site keyword and the on-site keyword of each document stored in the document keyword storage unit 22, and which keyword appears as the off-site keyword or the on-site keyword of which document. Is created as an index and stored in the first index storage unit 23. The index search unit 16 in the data processing device 5 searches the first index storage unit 23 according to the search condition input from the input unit 3 and outputs the result to the output unit 4.

次に、第２の実施の形態の動作を、図面を参照して詳細に説明する。本実施の形態では、図８（Ａ）に示すフローチャートによる登録処理と、図８（Ｂ）に示すフローチャートによる検索処理という動作のタイミングが異なる２種類の処理がある。検索処理は利用者からの入力がある度に行われるのに対し、登録処理は予め１回だけ行っておけばよい。 Next, the operation of the second embodiment will be described in detail with reference to the drawings. In the present embodiment, there are two types of processing with different operation timings, that is, registration processing according to the flowchart shown in FIG. 8A and search processing according to the flowchart shown in FIG. The search process is performed every time there is an input from the user, whereas the registration process only needs to be performed once in advance.

まず、第２の実施の形態の登録処理について図８（Ａ）のフローチャートと共に説明する。図８（Ａ）中、図６と同一処理ステップには同一符号を付してある。すなわち、図８（Ａ）に示す登録処理のフローチャート中、ステップＳ１〜Ｓ５で示される本実施の形態におけるハイパーテキストアクセス手段１１、文書クラスタ情報取得手段１２、対象指定手段１３、文書キーワード決定手段１４の動作は、第１の実施の形態の各手段１１、１２、１３および１４の動作と同一のため、説明は省略する。 First, registration processing according to the second embodiment will be described with reference to the flowchart of FIG. In FIG. 8A, the same processing steps as those in FIG. That is, in the flowchart of the registration process shown in FIG. 8A, the hypertext access unit 11, the document cluster information acquisition unit 12, the target designation unit 13, and the document keyword determination unit 14 in the present embodiment shown in steps S1 to S5. Since the operation of is the same as the operation of each of the means 11, 12, 13 and 14 of the first embodiment, the description thereof is omitted.

第１の実施の形態では、ステップＳ５でサイト内キーワードを決定した段階で処理を終了していた。本実施の形態では、ステップＳ５の結果生成された文書キーワードを基に、インデックス作成手段１５がサイト外キーワードについて、どの語がどの文書に登録されているかという索引を作成する（ステップＳ６）。続いて、インデックス作成手段１５は、サイト内キーワードについて、どの語がどの文書に登録されているかという索引を作成する（ステップＳ７）。これにより、登録処理を終了する。 In the first embodiment, the process is terminated when the in-site keyword is determined in step S5. In this embodiment, based on the document keyword generated as a result of step S5, the index creating means 15 creates an index indicating which word is registered in which document for the off-site keyword (step S6). Subsequently, the index creating means 15 creates an index indicating which word is registered in which document for the keyword in the site (step S7). This completes the registration process.

次に、検索処理について図８（Ｂ）のフローチャートと共に説明する。まず、入力手段３から検索条件が入力される（ステップＴ１）。検索条件として入力されるものとしては、キーワードの他にも、自然言語による質問文や、検索目的とする文書に類似した別の文書などがある。 Next, the search process will be described with reference to the flowchart of FIG. First, a search condition is input from the input means 3 (step T1). What is input as a search condition includes, in addition to keywords, a question sentence in a natural language, another document similar to a document to be searched, and the like.

次に、インデックス検索手段１６は、入力された検索条件から検索に使うキーワードｎ語を決定する（ステップＴ２）。キーワードの決定の方法には、文の分割とキーワード選定の二つの処理が含まれる。例えば、文の分割には形態素解析を用い、キーワードの選定では「の」などの付属語を除外した残りの語をキーワードとして使うなどの方法がある。 Next, the index search means 16 determines the keyword n word used for a search from the input search conditions (step T2). The keyword determination method includes two processes, sentence division and keyword selection. For example, there is a method of using morphological analysis for sentence division and using the remaining words excluding attached words such as “no” as keywords for keyword selection.

次に、インデックス検索手段１６は、ｎ語に分割したキーワードのうち、サイト外キーワードに現れる語がないか調べる。現れていれば、そのキーワードｍ語（１≦ｍ≦ｎ−１）とその出現頻度、及びキーワードが現れた文書を検索結果候補として記憶しておく（ステップＴ３）。 Next, the index search means 16 checks whether there are any words that appear in the off-site keyword among the keywords divided into n words. If it appears, the keyword m word (1 ≦ m ≦ n−1), its appearance frequency, and the document in which the keyword appears are stored as search result candidates (step T3).

次に、インデックス検索手段１６は、検索結果候補となった文書のうち、サイト内キーワードに、残りのｎ−ｍ語が現れている文書と、キーワードの出現頻度を検索結果リストに追加登録し（ステップＴ４）、その検索結果リストをキーワードの出現頻度でソートし、出力手段４を使って利用者に検索結果を表示する（ステップＴ５）。 Next, the index search means 16 additionally registers, in the search result list, a document in which the remaining nm words appear in the site keyword among the documents that are search result candidates and the appearance frequency of the keyword ( In step T4), the search result list is sorted by keyword appearance frequency, and the search result is displayed to the user using the output means 4 (step T5).

なお、本発明は第２の実施の形態に限定されるものではなく、以下の種々の変形例も含むものである。すなわち、第２の実施の形態では、文書キーワード記憶部２２にはサイト外キーワードとサイト内キーワードだけを記憶しているが、その他にサイトタイトルやサイト本文をキーワードとして記憶し、検索キーワード分割後に検索対象とする方法でもよい。また、本実施の形態では、検索結果リストをキーワードの出現頻度でソートしているが、サイト外キーワードでの出現頻度とサイト内キーワードでの出現頻度にそれぞれ異なる重みを掛けて、その結果でソートする方法を採用してもよい。 In addition, this invention is not limited to 2nd Embodiment, The following various modifications are also included. That is, in the second embodiment, only the off-site keyword and the in-site keyword are stored in the document keyword storage unit 22, but the site title and the site text are also stored as keywords, and the search is performed after the search keyword is divided. The target method may be used. In this embodiment, the search result list is sorted by the appearance frequency of the keyword. However, the appearance frequency of the keyword outside the site and the appearance frequency of the keyword inside the site are multiplied by different weights, and sorted by the result. You may adopt the method of doing.

また、本実施の形態では、検索方式／検索モデルをキーワードマッチによるものを想定しているが、検索方式／検索モデルとしては、ベクトル空間モデル、確率モデル、ＡＮＤやＯＲ演算を行うブーリアンモデルなどの方法でもよい。 In this embodiment, the search method / search model is assumed to be based on keyword matching. However, as the search method / search model, a vector space model, a probability model, a Boolean model for performing AND and OR operations, etc. It may be a method.

また、本実施の形態では、サイト外キーワードの索引を作成するステップＳ６の後にサイト内キーワードの索引を作成するステップＳ７を実行しているが、サイト内キーワードの索引を作成するステップＳ７の後にサイト外キーワードの索引を作成するステップＳ６を実行してもよい。 Further, in this embodiment, step S7 for creating an index for in-site keywords is performed after step S6 for creating an index for keywords outside the site. Step S6 for creating an index of outside keywords may be executed.

また、本実施の形態では、サイト外キーワードを決定するステップＳ４とサイト内キーワードを決定するステップＳ５の後にそれぞれサイト外キーワードの索引を作成するステップＳ６とサイト内キーワードの索引を作成するステップＳ７を実行しているが、サイト外キーワードを決定するステップＳ４の後にサイト外キーワードの索引を作成するステップＳ６を実行し、サイト内キーワードを決定するステップＳ５の後にサイト内キーワードの索引を作成するステップＳ７を実行してもよい。 Further, in the present embodiment, after step S4 for determining the keyword outside the site and step S5 for determining the keyword within the site, step S6 for creating an index for the keyword outside the site and step S7 for creating the index for the keyword within the site, respectively. Although it is executed, step S6 for creating an index for the off-site keyword is executed after step S4 for determining the keyword outside the site, and step S7 for creating an index for the keyword inside the site after step S5 for determining the keyword within the site. May be executed.

また、本実施の形態では、検索キーワードを決定するステップＴ２で、形態素解析を用いて文を分割する方法について述べたが、他にも漢字・英数字・カタカナ・ひらがななどの字種で分割する、一定文字数で分割する、スペースや句読点で分割する、「の」などの付属語で分割するなどの方法もあり、本実施の形態で述べた方法に限定されない。 In the present embodiment, the method of dividing a sentence using morphological analysis in the step T2 for determining a search keyword has been described. However, it is divided into other character types such as kanji, alphanumeric, katakana, and hiragana. There are also methods such as dividing by a certain number of characters, dividing by spaces or punctuation marks, and dividing by an attached word such as “no”, and are not limited to the method described in this embodiment.

また、本実施の形態では、検索キーワードを決定するステップＴ２における、キーワード選定で「の」などの付属語を除外する方法について述べたが、他にも「情報」、「方法」など一般的な文書での出現頻度が高い語を不要語として除外するか、検索にヒットしても低いスコアの加算にとどめておき、逆に、一般的な文書における出現頻度に比較して質問文内での出現頻度が高い語を重要語として検索にヒットした場合にスコアを高いスコアを加算するなどの方法があり、本実施の形態で述べた方法に限定されない。 In the present embodiment, the method of excluding ancillary words such as “no” in keyword selection in step T2 for determining a search keyword has been described, but other general information such as “information” and “method” are also used. Exclude words with high frequency in the document as unnecessary words, or add only a low score even if the search hits, and conversely, in the question text in comparison with the frequency in general documents There is a method of adding a high score when a search hits a word having a high appearance frequency as an important word, and is not limited to the method described in the present embodiment.

また、本実施の形態では、サイト外キーワードを検索するステップＴ３で、キーワードが１語以上現れた場合に、その文書を検索結果候補として記憶する方法について述べたが、すべての文書ですべてのキーワードがヒットしなかった場合に、すべての文書を検索結果候補としてサイト内キーワードを検索するステップＴ４を実行する方法もある。また、サイト外キーワード、サイト内キーワードのどちらか一方でもヒットすれば検索結果リストに含めておき、検索結果を出力するステップＴ５で、サイト外、サイト内のいずれでヒットしたかによって文書のスコアの重みを変えてソートする方法もある。 In the present embodiment, the method of storing a document as a search result candidate when one or more keywords appear in step T3 of searching for an off-site keyword has been described. There is also a method of executing step T4 of searching for in-site keywords using all documents as search result candidates when no hit is found. If either the keyword outside the site or the keyword inside the site is hit, it is included in the search result list, and in step T5 for outputting the search result, the score of the document is determined depending on whether it is hit outside the site or inside the site. There is also a method of sorting by changing the weight.

次に、第２の実施の形態の効果について説明する。本実施の形態では、サイト全体の内容を表すサイト外からのリンクと、サイト内での文書の位置付けを表すサイト内のリンクをそれぞれ遡って得られるアンカー文字列の系列を、文書クラスタと文書との関係を特徴付ける単語群として抽出し、インデックスを作成している。これにより、サイト全体の内容とサイト内での文書の位置付けを反映した検索を行うことができる。 Next, the effect of the second embodiment will be described. In the present embodiment, a series of anchor character strings obtained by tracing back a link from outside the site representing the contents of the entire site and a link within the site representing the position of the document within the site are divided into a document cluster and a document. It is extracted as a group of words that characterize the relationship between them, and an index is created. Thereby, it is possible to perform a search reflecting the contents of the entire site and the position of the document in the site.

また、本実施の形態では、キーワードがスペースで分割されたり、「の」で接続されたり、複合語として検索条件に入力された場合に、そのキーワードを分割し、サイト全体の内容を表すサイト外からのリンクと、サイト内での文書の位置付けを表すサイト内のリンクをそれぞれ遡って得られるアンカー文字列の系列をそれぞれ検索している。これにより、サイト構造を反映した効果的な絞込み検索を行うことができる。 Also, in this embodiment, when a keyword is divided by a space, connected by “no”, or entered as a compound word in a search condition, the keyword is divided to indicate the contents of the entire site. And a series of anchor character strings obtained by tracing back a link in the site and a link in the site representing the position of the document in the site. This makes it possible to perform an effective refined search that reflects the site structure.

［第３の実施の形態］
次に、本発明の第３の実施の形態について図面を参照して詳細に説明する。図９は本発明の第３の実施の形態のブロック図を示す。同図に示すように、本発明の文書検索装置の第３の実施の形態は、プログラム制御により動作するデータ処理装置７と、情報を記憶する記憶装置８とを含む構成である。同図中、図１と同一構成部分には同一符号を付し、その説明を省略する。 [Third Embodiment]
Next, a third embodiment of the present invention will be described in detail with reference to the drawings. FIG. 9 shows a block diagram of a third embodiment of the present invention. As shown in the figure, the third embodiment of the document retrieval equipment of the present invention includes a data processing unit 7 which operates under program control, a configuration that includes a storage device 8 for storing information. In the figure, the same components as those in FIG.

本発明の第３の実施の形態は、図９に示すように、データ処理装置７が、図１に示された第１の実施の形態におけるデータ処理装置１の構成に加え、文書ベクトル作成手段１７と、類似度計算手段１８を有する点で異なる。また、記憶装置８が、図１に示された第１に示された第１の実施の形態における記憶装置２の構成に加え、文書ベクトル記憶部２４、カテゴリ条件記憶部２５、および分類結果記憶部２６を有する点で異なる。 In the third embodiment of the present invention, as shown in FIG. 9, the data processing device 7 has a document vector creating means in addition to the configuration of the data processing device 1 in the first embodiment shown in FIG. 17 and the point of having similarity calculation means 18. In addition to the configuration of the storage device 2 in the first embodiment shown in FIG. 1, the storage device 8 includes a document vector storage unit 24, a category condition storage unit 25, and a classification result storage. It differs in that it has a portion 26.

文書ベクトル記憶部２４には、文書キーワード記憶部２２に格納されているキーワードを基に文書ベクトル作成手段１７によって作成された、各文書の特徴ベクトルが記憶されている。文書の特徴ベクトルとは、例えば文書中に出現する各キーワードとその出現頻度を多次元ベクトルとして表現したものである。 In the document vector storage unit 24, feature vectors of each document created by the document vector creation unit 17 based on the keywords stored in the document keyword storage unit 22 are stored. The document feature vector is, for example, each keyword appearing in the document and its appearance frequency expressed as a multidimensional vector.

複数の文書について、それぞれ特徴ベクトルを決定しておけば、特徴ベクトル間のユークリッド距離や、特徴ベクトルがなす角度などから、文書間の類似度を計算することができる。また、あるカテゴリに属する複数の文書の特徴ベクトルの総和や重心を、そのカテゴリの特徴ベクトルと考え、カテゴリの特徴ベクトルと未分類の文書の特徴ベクトルの類似度を計算することによって、その文書がどのカテゴリに属するかを決定することもできる。 If a feature vector is determined for each of a plurality of documents, the similarity between documents can be calculated from the Euclidean distance between the feature vectors, the angle formed by the feature vectors, and the like. In addition, the sum or centroid of the feature vectors of a plurality of documents belonging to a certain category is regarded as the feature vector of the category, and by calculating the similarity between the feature vector of the category and the feature vector of the unclassified document, It is also possible to determine which category it belongs to.

カテゴリ条件記憶部２５には、分類したいカテゴリについて、それぞれ特徴的なキーワードとその出現頻度が特徴ベクトルとして記憶されている。分類結果記憶部２６には、類似度計算手段１８によって文書ベクトルと各カテゴリの特徴ベクトルの余弦を計算した結果が記憶されている。この結果は、値が大きいほど文書がそのカテゴリに属すると判断できる。 The category condition storage unit 25 stores characteristic keywords and their appearance frequencies as feature vectors for the categories to be classified. The classification result storage unit 26 stores the result of calculating the cosine of the document vector and the feature vector of each category by the similarity calculation means 18. As a result, it can be determined that the larger the value, the more the document belongs to the category.

文書ベクトル作成手段１７は、文書キーワード記憶部２２に記憶されている文書キーワードを基に、各文書について、どのキーワードがどの部分（サイト外キーワードか、サイト内キーワードか、タイトルか、本文か等）に何回出現したかを文書ベクトルとして文書ベクトル記憶部２４に記憶させる。 Based on the document keywords stored in the document keyword storage unit 22, the document vector creation means 17 determines which keyword is which part of each document (whether it is an off-site keyword, an in-site keyword, a title, or a text). Is stored in the document vector storage unit 24 as a document vector.

類似度計算手段１８は、文書ベクトル記憶部２４に格納されている各文書の文書ベクトルについて、カテゴリ条件記憶部２５に格納されている各カテゴリの特徴ベクトルとの余弦を計算し、その結果を分類結果記憶部２６に格納する。 The similarity calculation means 18 calculates the cosine of the feature vector of each category stored in the category condition storage unit 25 for the document vector of each document stored in the document vector storage unit 24, and classifies the result. The result is stored in the result storage unit 26.

次に、本実施の形態の動作を、図１０のフローチャート共に詳細に説明する。
図１０中、図６と同一処理ステップには同一符号を付し、その説明を省略する。
すなわち、図１０のステップＳ１〜Ｓ５で示される本実施の形態におけるハイパーテキストアクセス手段１１、文書クラスタ情報取得手段１２、対象指定手段１３、文書キーワード決定手段１４の動作は、第１の実施の形態の各手段１１、１２、１３及び１４の動作と同一のため、説明は省略する。 Next, the operation of the present embodiment will be described in detail with reference to the flowchart of FIG.
In FIG. 10, the same processing steps as those in FIG. 6 are denoted by the same reference numerals, and the description thereof is omitted.
That is, the operations of the hypertext access unit 11, the document cluster information acquisition unit 12, the target designation unit 13, and the document keyword determination unit 14 in the present embodiment shown in steps S1 to S5 in FIG. 10 are the same as those in the first embodiment. Since the operations are the same as those of the means 11, 12, 13 and 14, description thereof will be omitted.

第１の実施の形態では、ステップＳ５でサイト内キーワードを決定した段階で処理を終了していた。本実施の形態では、ステップＳ５の結果生成された文書キーワードを基に、文書ベクトル作成手段１７が、各文書について、どのキーワードがどの部分（サイト外キーワードか、サイト内キーワードか、タイトルか、本文か等）に何回出現したかを文書ベクトルとして文書ベクトル記憶部２４に記憶させる（ステップＳ８）。 In the first embodiment, the process is terminated when the in-site keyword is determined in step S5. In the present embodiment, based on the document keyword generated as a result of step S5, the document vector creation means 17 determines which part of each document is which keyword (external keyword, in-site keyword, title, text) Or the like) is stored in the document vector storage unit 24 as a document vector (step S8).

次に、類似度計算手段１８が文書ベクトル記憶部２４に格納されている各文書の文書ベクトルについて、カテゴリ条件記憶部２５に格納されている各カテゴリの特徴ベクトルとの余弦を計算し、その結果を分類結果記憶部２６に格納する（ステップＳ９）。 Next, the similarity calculation means 18 calculates the cosine of the feature vector of each category stored in the category condition storage unit 25 for the document vector of each document stored in the document vector storage unit 24, and the result Is stored in the classification result storage unit 26 (step S9).

なお、本実施の形態では、文書ベクトルとしてキーワードと、その出現部分（サイト外キーワードか、サイト内キーワードか、タイトルか、本文か等）、および出現頻度を使ったが、特に出現部分の区別をしない方法や、出現部分によって出現頻度に重み付けを行う方法、あるいは出現頻度ではなく出現したか否かのみの情報を使う方法などを採用してもよい。 In this embodiment, keywords, their appearance parts (non-site keywords, in-site keywords, titles, body texts, etc.) and appearance frequencies are used as document vectors. A method of not performing, a method of weighting the appearance frequency according to the appearance part, a method of using only information on whether or not the appearance frequency is used instead of the appearance frequency may be employed.

また、本実施の形態では、文書の類似度計算としてベクトルの余弦をとっているが、類似度計算としてベクトル間のユークリッド距離を用いる方法でもよい。
また、本実施の形態では、カテゴリ条件として各カテゴリの特徴ベクトルを指定している。しかし、カテゴリの特徴ベクトルを指定する代わりに、実際にカテゴリに含まれる文書を指定して教師データとし、ＳＶＭなどの機械学習を用いて学習した結果生成される学習モデルをカテゴリ条件として使い、類似度計算手段１８でこの学習モデルを使って未学習の文書を分類する方法でもよい。ＳＶＭを用いた文書分類についての詳細は１９８８年、プロシーディングズ・オブ・テンス・ヨーロピアン・カンファレンス・オン・マシン・ラーニング、１３７〜１４２頁（Proceedings of １0th European Conference on Machine Learning, pp.１37-I42, １998）などに記載されている。 In the present embodiment, the cosine of a vector is taken as the similarity calculation of the document, but a method using the Euclidean distance between the vectors may be used as the similarity calculation.
In this embodiment, the feature vector of each category is specified as the category condition. However, instead of specifying a category feature vector, a document actually included in a category is specified as teacher data, and a learning model generated as a result of learning using machine learning such as SVM is used as a category condition. A method of classifying unlearned documents using the learning model by the degree calculation means 18 may be used. For details on document classification using SVM, see Proceedings of 10th European Conference on Machine Learning, pp.137-I42, 1988, Proceedings of 10th European Conference on Machine Learning. , 1998).

次に、第３の実施の形態の効果について説明する。本実施の形態では、サイト全体の内容を表すサイト外からのリンクと、サイト内での文書の位置付けを表すサイト内のリンクをそれぞれ遡って得られるアンカー文字列の系列を、文書クラスタと文書との関係を特徴付ける単語群として抽出し、文書ベクトルを作成している。これにより、サイト全体の内容とサイト内での文書の位置付けを反映した分類を行うことができる。 Next, the effect of the third embodiment will be described. In this embodiment, sites and links from offsite representing the entire contents, the sequence of the anchor character string obtained by tracing back each link in the site representing the positioning of documents within a site, document clusters and documents A word vector is extracted as a group of words that characterizes the relationship between and a document vector. Thereby, it is possible to perform classification that reflects the contents of the entire site and the position of the document in the site.

［第４の実施の形態］
次に、本発明の第４の実施の形態について図面を参照して詳細に説明する。図１１は本発明の第４の実施の形態のブロック図を示す。同図に示すように、本発明のキーワード抽出装置、文書検索装置及び文書分類装置の第４の実施の形態は、プログラム制御により動作するデータ処理装置９と、情報を記憶する記憶装置１１と、入力手段３と出力手段４を含む構成である。同図中、図１と同一構成部分には同一符号を付し、その説明を省略する。 [Fourth Embodiment]
Next, a fourth embodiment of the present invention will be described in detail with reference to the drawings. FIG. 11 shows a block diagram of a fourth embodiment of the present invention. As shown in the figure, the fourth embodiment of the keyword extraction device, document search device, and document classification device of the present invention includes a data processing device 9 that operates under program control, a storage device 11 that stores information, The input unit 3 and the output unit 4 are included. In the figure, the same components as those in FIG.

図１１に示すように、本発明の第４の実施の形態は、データ処理装置９が、図７に示された第２の実施の形態におけるデータ処理装置５の構成から、ハイパーテキストアクセス手段１１、文書クラスタ情報取得手段１２、対象指定手段１３、文書キーワード決定手段１４、インデックス作成手段１５を除いている点で異なる。また、記憶装置１０が、図７に示された第２の実施の形態における記憶装置６の構成から、ハイパーテキストデータベース２１、文書キーワード記憶部２２を除き、新たに第２のインデックス記憶部２７を有する点で異なる。 As shown in FIG. 11, in the fourth embodiment of the present invention, the data processing device 9 has a hypertext access means 11 based on the configuration of the data processing device 5 in the second embodiment shown in FIG. , Except that the document cluster information acquisition unit 12, the target designation unit 13, the document keyword determination unit 14, and the index creation unit 15 are excluded. In addition, the storage device 10 has the configuration of the storage device 6 in the second embodiment shown in FIG. 7 except for the hypertext database 21 and the document keyword storage unit 22, and newly adds a second index storage unit 27. It is different in having.

第１のインデックス記憶部２３には、文書の内容を表すメタ情報としてサイト外キーワードとサイト内キーワードの索引が記憶されている。また、第２のインデックス記憶部２７には、文書の本文中に出現したキーワードの索引が記憶されている。 The first index storage unit 23 stores an off-site keyword and an on-site keyword index as meta information representing the content of the document. The second index storage unit 27 stores an index of keywords that appear in the text of the document.

次に、本実施の形態の動作を図１２のフローチャートを参照して詳細に説明する。なお、図１２中、図８（Ｂ）と同一処理ステップには同一符号を付し、その説明を省略する。図１２のステップＴｌ、Ｔ２及びステップＴ５で示される本実施の形態におけるインデックス検索手段１６の動作は、第２の実施の形態におけるインデックス検索手段１６の動作と同一のため、説明は省略する。 Next, the operation of the present embodiment will be described in detail with reference to the flowchart of FIG. In FIG. 12, the same processing steps as those in FIG. 8B are denoted by the same reference numerals, and the description thereof is omitted. The operation of the index search means 16 in the present embodiment shown in steps Tl, T2 and T5 in FIG. 12 is the same as the operation of the index search means 16 in the second embodiment, and a description thereof will be omitted.

第２の実施の形態では、キーワードを決定するステップＴ２の後、インデックス検索手段１６はサイト外キーワードとサイト内キーワードをそれぞれ検索していた。本実施の形態では、インデックス検索手段１６はサイト外キーワードとサイト内キーワードの索引である第１のインデックスを検索し、文書の本文中に出現したキーワードの索引である第２のインデックスをそれぞれ検索する。 In the second embodiment, after step T2 for determining a keyword, the index search means 16 searches for an off-site keyword and an on-site keyword, respectively. In the present embodiment, the index search means 16 searches the first index that is the index of the keyword outside the site and the keyword within the site, and searches the second index that is the index of the keyword that appears in the text of the document. .

まず、インデックス検索手段１６は、ステップＴ２でｎ語に決定したキーワードのうち、第１のインデックス記憶部２３に登録された語（第１のインデックス）がないか検索する。登録されていれば、そのキーワードｍ語（１≦ｍ≦ｎ−１）と、その出現頻度およびキーワードが現れた文書を検索結果候補として記憶しておく（ステップＵ３）。 First, the index search means 16 searches for a word (first index) registered in the first index storage unit 23 among the keywords determined to be n words in step T2. If registered, the keyword m words (1 ≦ m ≦ n−1), the appearance frequency and the document in which the keyword appears are stored as search result candidates (step U3).

次に、インデックス検索手段１６は、検索結果候補となった各文書の残りのｎ−ｍ語のうち、第２のインデックス記憶部２７に登録された語（第２のインデックス）がないか検索し、登録されていれば、その登録されている文書と、キーワードの出現頻度を検索結果リストに追加登録する（ステップＵ４）。その後、インデックス検索手段１６は、上記の検索結果リストをキーワードの出現頻度でソートし、出力手段４を使って利用者に検索結果を表示する（ステップＴ５）。 Next, the index search means 16 searches for the word (second index) registered in the second index storage unit 27 among the remaining nm words of each document that is a search result candidate. If registered, the registered document and the appearance frequency of the keyword are additionally registered in the search result list (step U4). Thereafter, the index search means 16 sorts the search result list by the appearance frequency of the keywords, and displays the search results to the user using the output means 4 (step T5).

なお、本発明はこの実施の形態に限定されるものではなく、以下の種々の変形例が可能である。すなわち、第４の実施の形態では、検索結果リストをキーワードの出現頻度でソートしているが、第１のインデックスでの出現頻度と第２のインデックスでの出現頻度にそれぞれ異なる重みを掛けて総和をとり、その結果でソートしてもよい。また、本実施の形態では、第１のインデックス記憶部２３には、ハイパーテキストから抽出されたサイト外キーワードとサイト内キーワードが登録されているとしたが、文書の内容を表すメタ情報中に出現するキーワードであってもよい。例えば、検索対象が学術論文である場合、引用元論文内での紹介文がこのメタ情報にあたる。また、検索対象が書籍である場合、書誌事項や書籍の紹介記事などがこのメタ情報にあたる。 The present invention is not limited to this embodiment, and the following various modifications are possible. That is, in the fourth embodiment, the search result list is sorted by the appearance frequency of keywords, but the appearance frequency in the first index and the appearance frequency in the second index are multiplied by different weights, and the sum is obtained. And sort by the result. Further, in the present embodiment, the first index storage unit 23 registers the off-site keyword and the on-site keyword extracted from the hypertext, but appears in the meta information representing the content of the document. It may be a keyword. For example, when the search target is an academic paper, an introductory sentence in the cited paper corresponds to this meta information. In addition, when the search target is a book, bibliographic items and introductory articles of the book correspond to this meta information.

また、本実施の形態では、検索結果リストをキーワードの出現頻度でソートしているが、第１のインデックスでの出現頻度と第２のインデックスでの出現頻度にそれぞれ異なる重みを掛けて、その結果でソートするようにしてもよい。また、本実施の形態では、第１のインデックスを検索するステップＵ３で、キーワードが１語以上現れた場合に、その文書を検索結果候補として記憶する方法について述べたが、すべての文書ですべてのキーワードがヒットしなかった場合に、すべての文書を検索結果候補として第２のインデックスを検索するステップＵ４を実行する方法もある。 In this embodiment, the search result list is sorted by the appearance frequency of the keyword. However, the appearance frequency in the first index and the appearance frequency in the second index are multiplied by different weights, and the result is obtained. You may make it sort by. In the present embodiment, the method of storing the document as a search result candidate when one or more keywords appear in step U3 for searching the first index has been described. There is also a method of executing the step U4 of searching the second index using all documents as search result candidates when the keyword is not hit.

また、第１のインデックス、第２のインデックスのどちらか一方でもヒットすれば検索結果リストに含めておき、検索結果を出力するステップＴ５で、第１のインデックスと第２のインデックスのいずれでヒットしたかによって、文書スコアの重みを変えてソートする方法もあり、本実施の形態で述べた方法に限定されない。 If either the first index or the second index is hit, it is included in the search result list, and in step T5 for outputting the search result, either the first index or the second index is hit. Depending on the method, there is a method of sorting by changing the weight of the document score, and the method is not limited to the method described in the present embodiment.

次に、本実施の形態の効果について説明する。本実地の形態では、文書の内容を表すメタ情報に含まれるキーワードから第１のインデックスを作成し、これを優先して検索している。これにより、文書の内容を反映した検索を行うことができる。 Next, the effect of this embodiment will be described. In the present embodiment, a first index is created from a keyword included in meta information representing the content of a document, and this is preferentially searched. As a result, a search reflecting the contents of the document can be performed.

また、本実施の形態では、キーワードがスペースで分割されたり、「の」で接続されたり、複合語として検索条件に入力された場合に、そのキーワードを分割し、第１のインデックスと第２のインデックスをそれぞれ検索している。これにより、文書の内容を反映した効果的な絞り込み検索を行うことができる。 In this embodiment, when a keyword is divided by a space, connected by “no”, or inputted as a compound word in a search condition, the keyword is divided, and the first index and the second index are divided. Each index is searched. As a result, an effective narrowing search that reflects the contents of the document can be performed.

［第５の実施の形態］
次に本発明の第５の実施の形態について図画を参照して詳細に説明する。図１３は、本発明の第５の実施の形態のブロック図を示す。同図に示すように、本発明の第５の実施の形態は、入力装置３１、データ処理装置３２、出力装置３３、記憶装置３４を備え、さらに、前述の第１の実施の形態のキーワード抽出装置を実現するためのプログラムを記録した記録媒体３０を備える。この記録媒体３０は、磁気ディスク、半導体メモリ、ＣＤ−ＲＯＭその他の記録媒体のいずれでもよい。 [Fifth Embodiment]
Next, a fifth embodiment of the present invention will be described in detail with reference to the drawings. FIG. 13 shows a block diagram of a fifth embodiment of the present invention. As shown in the figure, the fifth embodiment of the present onset Ming, an input device 31, data processing unit 32, an output device 33 includes a storage device 34, further, the first embodiment described above keywords A recording medium 30 on which a program for realizing the extraction device is recorded is provided. The recording medium 30 may be a magnetic disk, a semiconductor memory, a CD-ROM, or other recording medium.

入力装置３１は、マウス、キーボード等、操作者からの指示を入力するための装置である。また、出力装置３３は、データ処理装置３２による処理結果を出力する装置で、例えば表示装置、プリンタ等である。キーワード抽出装置を実現するためのプログラムは、記録媒体３０からデータ処理装置３２に読み込まれ、データ処理装置３２の動作を制御し、記憶装置３４に入力メモリ３５とワークメモリ３６を生成する。データ処理装置３２は、キーワード抽出装置を実現するためのプログラムの制御により第１の実施の形態と同一の処理を実行する。 The input device 31 is a device for inputting instructions from an operator, such as a mouse and a keyboard. The output device 33 is a device that outputs the processing result of the data processing device 32, and is a display device, a printer, or the like, for example. A program for realizing the keyword extracting device is read from the recording medium 30 into the data processing device 32, controls the operation of the data processing device 32, and generates an input memory 35 and a work memory 36 in the storage device 34. The data processing device 32 executes the same processing as in the first embodiment by controlling a program for realizing the keyword extracting device.

図１におけるデータ処理装置１と図１３におけるデータ処理装置３２が対応し、図１における記憶装置２と図１３における記憶装置３４が対応する。ただし、処理対象となるハイパーテキストデータベース２１は、記録媒体３０から読み込む形態の他に、データ処理装置３２によって外部にあるデータベースにネットワーク（例えばインターネット）を介してアクセスして取得する形態であってもよい。 The data processing device 1 in FIG. 1 corresponds to the data processing device 32 in FIG. 13, and the storage device 2 in FIG. 1 corresponds to the storage device 34 in FIG. However, the hypertext database 21 to be processed may be obtained by accessing the external database via the network (for example, the Internet) by the data processing device 32 in addition to the form read from the recording medium 30. Good.

［第６の実施の形態］
次に、本発明の第６の実施の形態について図面を参照して詳細に説明する。第６の実施の形態は、第５の実施の形態と同様に、図１３の構成を用いる。文書検索装置を実現するためのプログラムが、記録媒体３０からデータ処理装置３２に読み込まれ、データ処理装置３２の動作を制御する。データ処理装置３２は、文書検索装置を実現するためのプログラムの制御により第２の実施の形態と同一の処理を実行する。 [Sixth Embodiment]
Next, a sixth embodiment of the present invention will be described in detail with reference to the drawings. The sixth embodiment uses the configuration shown in FIG. 13 as in the fifth embodiment. A program for realizing the document search device is read from the recording medium 30 into the data processing device 32 and controls the operation of the data processing device 32. The data processing device 32 executes the same processing as in the second embodiment by controlling a program for realizing the document search device.

図７におけるデータ処理装置５と図１３におけるデータ処理装置３２が対応し、図７における記憶装置６と図１３における記憶装置３４が対応する。ただし、処理対象となるハイパーテキストデータベース２１は、記録媒体３０から読み込む形態の他に、データ処理装置３２によって外部にあるデータベースにネットワーク（例えばインターネット）を介してアクセスして取得する形態であってもよい。 The data processing device 5 in FIG. 7 corresponds to the data processing device 32 in FIG. 13, and the storage device 6 in FIG. 7 corresponds to the storage device 34 in FIG. However, the hypertext database 21 to be processed may be obtained by accessing the external database via the network (for example, the Internet) by the data processing device 32 in addition to the form read from the recording medium 30. Good.

なお、ここでは、図７におけるハイパーテキストアクセス手段１１、文書クラスタ情報取得手段１２、対象指定手段１３、文書キーワード決定手段１４、インデックス作成手段１５、インデックス検索手段１６のすべてが１つの記録媒体３０からデータ処理装置３２によって読み込まれる形態で説明したが、複数の記録媒体に分割して記録されていてもよい。例えば、ハイパーテキストアクセス手段１１、文書クラスタ情報取得手段１２、対象指定手段１３、文書キーワード決定手段１４のプログラムは、第５の実施の形態の記録媒体から読み込むようにし、それ以外のインデックス作成手段１５とインデックス検索手段１６のプログラムは別の記録媒体としてもよい。さらには、インデックス作成手段１５のプログラムとインデックス検索手段１６のプログラムが別の記録媒体に分けて構成されていてもよい。 Here, all of the hypertext access unit 11, the document cluster information acquisition unit 12, the target designation unit 13, the document keyword determination unit 14, the index creation unit 15, and the index search unit 16 in FIG. Although described in the form of being read by the data processing device 32, it may be divided and recorded on a plurality of recording media. For example, the programs of the hypertext access unit 11, the document cluster information acquisition unit 12, the target specification unit 13, and the document keyword determination unit 14 are read from the recording medium of the fifth embodiment, and the other index creation unit 15 And the program of the index search means 16 may be a separate recording medium. Furthermore, the program of the index creation means 15 and the program of the index search means 16 may be divided into separate recording media.

［第７の実施の形態］
次に本発明の第７の実施の形態について図面を参照して詳細に説明する。第７の実施の形態は、第５、第６の実施の形態と同様に、図１３の構成を用いる。文書分類装置を実現するためのプログラムが記録媒体３０からデータ処理装置３２に読み込まれ、データ処理装置３２の動作を制御する。データ処理装置３２は、文書分類装置を実現するためのプログラムの制御により第３の実施の形態と同一の処理を実行する。 [Seventh Embodiment]
Next, a seventh embodiment of the present invention will be described in detail with reference to the drawings. The seventh embodiment uses the configuration of FIG. 13 as in the fifth and sixth embodiments. A program for realizing the document classification device is read from the recording medium 30 into the data processing device 32 and controls the operation of the data processing device 32. The data processing device 32 executes the same processing as that of the third embodiment under the control of a program for realizing the document classification device.

図９におけるデータ処理装置７と図１３におけるデータ処理装置３２が対応し、図９における記憶装置８と図１３における記憶装置３４が対応する。ただし、処理対象となるハイパーテキストデータベース２１は、記録媒体３０から読み込む形態の他に、データ処理装置３２によって外部にあるデータベースにネットワーク（例えばインターネット）を介してアクセスして取得する形態であってもよい。 9 corresponds to the data processing device 32 in FIG. 13, and the storage device 8 in FIG. 9 corresponds to the storage device 34 in FIG. However, the hypertext database 21 to be processed may be obtained by accessing the external database via the network (for example, the Internet) by the data processing device 32 in addition to the form read from the recording medium 30. Good.

なお、ここでは、図９におけるハイパーテキストアクセス手段１１、文書クラスタ情報取得手段１２、対象指定手段１３、文書キーワード決定手段１４、文書ベクトル作成手段１７、類似度計算手段１８のすべてが１つの記録媒体３０からデータ処理装置３２に読み込まれる形態で説明したが、複数の記録媒体に分割して記録されていてもよい。 Here, the hypertext access unit 11, the document cluster information acquisition unit 12, the target designation unit 13, the document keyword determination unit 14, the document vector creation unit 17, and the similarity calculation unit 18 in FIG. 9 are all in one recording medium. Although described in the form of being read from 30 to the data processor 32, it may be divided and recorded on a plurality of recording media.

例えば、ハイパーテキストアクセス手段１１、文書クラスタ情報取得手段１２、対象指定手段１３、文書キーワード決定手段１４のプログラムは、第５の実施の形態の記録媒体から読み込むようにし、それ以外の文書ベクトル作成手段１７と類似度計算手段１８のプログラムは別の記録媒体として構成されていてもよい。さらには、文書ベクトル作成手段１７のプログラムと類似度計算手段１８のプログラムが別の記録媒体に分けて構成されていてもよい。 For example, the hypertext access means 11, document cluster information acquisition means 12, object designation means 13, and document keyword determination means 14 are read from the recording medium of the fifth embodiment, and other document vector creation means 17 and the program of the similarity calculation means 18 may be configured as separate recording media. Furthermore, the program of the document vector creation means 17 and the program of the similarity calculation means 18 may be configured separately on different recording media.

［第８の実施の形態］
次に、本発明の第８の実施の形態について図面を参照して詳細に説明する。第８の実施の形態は、第５、第６、第７の実施の形態と同様に、図１３の構成を用いる。文書検索装置を実現するためのプログラムが、記録媒体３０からデータ処理装置３２に読み込まれ、データ処理装置３２の動作を制御する。データ処理装置３２は、文書検索装置を実現するためのプログラムの制御により第４の実施の形態と同一の処理を実行する。図１１におけるデータ処理装置９と図１３におけるデータ処理装置３２が対応し、図１１における記録装置１０と図１３における記録装置３４が対応する。 [Eighth Embodiment]
Next, an eighth embodiment of the present invention will be described in detail with reference to the drawings. The eighth embodiment uses the configuration of FIG. 13 as in the fifth, sixth, and seventh embodiments. A program for realizing the document search device is read from the recording medium 30 into the data processing device 32 and controls the operation of the data processing device 32. The data processing device 32 executes the same processing as that of the fourth embodiment under the control of a program for realizing the document search device. 11 corresponds to the data processing device 32 in FIG. 13, and the recording device 10 in FIG. 11 corresponds to the recording device 34 in FIG.

［第１の実施例］
次に、本発明の第１の実施例を、図面を参照して説明する。この第１の実施例は本発明の第１の実施の形態に対応するものである。本実施例は、図１に示したデータ処理装置１としてパーソナルコンピュータを、記憶装置２として磁気ディスク記憶装置を備えている。 [First embodiment]
Next, a first embodiment of the present invention will be described with reference to the drawings. This first example corresponds to the first embodiment of the present invention. In this embodiment, a personal computer is provided as the data processing device 1 shown in FIG. 1 and a magnetic disk storage device is provided as the storage device 2.

パーソナルコンピュータは、図１に示したハイパーテキストアクセス手段１１、文書クラスタ情報取得手段１２、対象指定手段１３、文書キーワード決定手段１４として機能する中央演算装置を有している。また、磁気ディスク記憶装置には、図１に示したハイパーテキストデータベース２１、文書キーワード記憶部２２が記憶されている。ハイパーテキストデータベース２１に格納されているハイパーテキスト群の一例を図１４に示す。 The personal computer has a central processing unit that functions as the hypertext access unit 11, the document cluster information acquisition unit 12, the target designation unit 13, and the document keyword determination unit 14 shown in FIG. The magnetic disk storage device stores the hypertext database 21 and the document keyword storage unit 22 shown in FIG. An example of a hypertext group stored in the hypertext database 21 is shown in FIG.

まず、ハイパーテキストアクセス手段１１はハイパーテキストデータベース２１に格納されている各文書を読み出し、文書クラスタ情報取得手段１２に渡す。
文書クラスタ情報取得手段１２は、与えられた文書からリンク情報を抽出し、図１５に示すような、アンカー文字列とリンク元文書とリンク先文書とが対応付けられた文書参照関係表を生成する。 First, the hypertext access unit 11 reads each document stored in the hypertext database 21 and passes it to the document cluster information acquisition unit 12.
The document cluster information acquisition unit 12 extracts link information from a given document, and generates a document reference relation table in which an anchor character string, a link source document, and a link destination document are associated with each other as shown in FIG. .

次に、文書クラスタ情報取得手段１２は、対象指定手段１３に指定された「サイトのトップページの条件」と「同一サイトに含まれる文書の条件」に基づき、図１６に示すような、文書クラスタとトップページとクラスタ内文書とが対応付けられた文書クラスタ表を生成する。なお、本実施例では、「サイトのトップページの条件」として「文書のＵＲＬが『ｈｔｔｐ：／／ドメイン名／』、または『ｈｔｔｐ：／／ドメイン名／ｉｎｄｅｘ．ｈｔｍｌ』であるもの」とし、「同一サイトに含まれる文書の条件」として、「ドメイン名が同じ」としている。 Next, the document cluster information acquisition unit 12 uses the document cluster as shown in FIG. 16 based on the “site top page condition” and the “document condition included in the same site” specified by the target specifying unit 13. A document cluster table in which the top page and the in-cluster document are associated with each other is generated. In this embodiment, “the condition of the top page of the site” is “the URL of the document is“ http: // domain name / ”or“ http: // domain name / index.html ””, “The domain name is the same” as the “condition of documents included in the same site”.

次に、文書キーワード決定手段１４は、文書クラスタ情報取得手段１２が生成した文書参照関係表と文書クラスタ表を参照して、各サイトのトップページに対してサイト外からはられているリンクのアンカー文字列をサイト外キーワードとし、各クラスク内文書について、同一クラスタ内文書のリンクを遡って得られるアンカー文字列の系列をサイト内キーワードとして文書キーワード記憶部２２に記憶させる。得られる文書キーワードの例を図１７に示す。 Next, the document keyword determination unit 14 refers to the document reference relation table and the document cluster table generated by the document cluster information acquisition unit 12, and anchors the links placed from outside the site to the top page of each site. A character string is used as an off-site keyword, and a series of anchor character strings obtained by tracing back the links of documents in the same cluster for each in-class document is stored in the document keyword storage unit 22 as an on-site keyword. An example of the obtained document keyword is shown in FIG.

なお、本実施例では、同一サイトに含まれている文書のサイト外キーワードは、そのサイトのトップページのサイト外キーワードと同じにする。したがって、図１４の文書１１２〜文書１１９のサイト外キーワードは、文書１１１のサイト外キーワードと同一の「グルメ情報，レストラン検索」となる。 In this embodiment, the off-site keyword of documents included in the same site is the same as the off-site keyword of the top page of the site. Accordingly, the off-site keywords of the document 112 to the document 119 in FIG. 14 are the same “gourmet information, restaurant search” as the off-site keyword of the document 111.

また、リンクを遡る際に、一度遡った文書を覚えておき、ループして遡らないようにする。例えば、図１４の文書１１６に対するリンクを単純に遡ると「東京都←関東」というアンカー文字列の系列のほかに、「東京都←関東←戻る←東京都←関東」、「東京都←関東←戻る←東京都←関東←戻る←東京都・・・」のようにループによって無数のアンカー文字列が生成されてしまう。そこで、一度遡った文書を同じアンカー文字列の系列内で二度遡らないようにする。したがって、文書１１６のサイト内キーワードは「戻る」を含まないことになる。 Also, when going back to the link, remember the document that went back once, so that it doesn't go back in a loop. For example, when the link to the document 116 in FIG. 14 is simply traced back, in addition to the anchor character string series “Tokyo → Kanto”, “Tokyo → Kanto ← Back ← Tokyo → Kanto”, “Tokyo → Kanto ← An infinite number of anchor strings are generated by a loop like "Back ← Tokyo ← Kanto ← Back ← Tokyo ...". Therefore, a document that has been traced once is not traced twice within the same anchor character string series. Therefore, the keyword in the site of the document 116 does not include “return”.

また、本実施例では、トップページでないページへのサイト外からのリンクを一つだけ遡ってサイト内キーワードに含めるようにしている。したがって、文書１１６のサイト内キーワードには「東京都←関東」と「東京のお勧め店」の２種類になる。図１４の文書１１９についても同様の方法で登録するが、別のアンカー文字列の系列で同じ文書を遡る場合は、それぞれ別のキーワードとしで登録する。 Further, in this embodiment, only one link from outside the site to a page that is not the top page is traced back and included in the in-site keyword. Accordingly, there are two types of keywords in the site of the document 116: “Tokyo ← Kanto” and “Recommended shops in Tokyo”. The document 119 shown in FIG. 14 is registered in the same manner. However, when the same document is traced back in another anchor character string series, it is registered as different keywords.

すなわち、「中華←東京都←関東」と「中華←大阪府←関西」はどちらも文書１１１に遡るアンカー文字列の系列であるが、別の系列であるため両方をサイト内キーワードとしで記憶する。ここでも、「中華←東京都←関東←戻る←東京都←関東」というアンカー文字列の系列などが考えられるが、これは同一系列内で文書１１３と文書１１６をそれぞれ２回遡っているためサイト内キーワードとしては記憶しない。また、トップページでないページへのサイト外からのリンクを一つだけ遡ってサイト内キーワードに含めるようにしているため、「中華←東京のお勧め店」も文書１１９のサイト内キーワードとして記憶される。 That is, “Chinese ← Tokyo ← Kanto” and “Chinese ← Osaka Prefecture ← Kansai” are both anchor character strings that go back to the document 111, but both are stored as site keywords because they are separate series. . In this case as well, an anchor character string such as “Chinese ← Tokyo ← Kanto ← Back ← Tokyo ← Kanto” can be considered, but this is because the document 113 and the document 116 are traced back twice in the same series. It is not stored as an internal keyword. In addition, since only one link from outside the site to a page that is not the top page is included in the site keyword, “Chinese ← Tokyo recommended store” is also stored as the site keyword of document 119. .

［第２の実施例］
次に、本発明の第２の実施例を、図面を参照して説明する。この第２の実施例は本発明の第２の実施の形態に対応するものである。本実施例は図７に示した第２の実施の形態におけるデータ処理装置５としてパーソナルコンピュータを、記憶装置６として磁気ディスク記憶装置を備えている。 [Second Embodiment]
Next, a second embodiment of the present invention will be described with reference to the drawings. This second example corresponds to the second embodiment of the present invention. In this embodiment, a personal computer is provided as the data processing device 5 in the second embodiment shown in FIG. 7, and a magnetic disk storage device is provided as the storage device 6.

上記のパーソナルコンピュータの中央演算装置は、第１の実施例と同様の機能を有するが、これに加えて図７に示したインデックス作成手段１５、インデックス検索手段１６としても機能する点で第１の実施例と異なる。また、入力装置としてキーボードを、出力装置としてディスプレイを備える点で第１の実施例と異なる。また、磁気ディスク記憶装置には、図７に示した第１のインデックス記憶部２３も記憶される点で第１の実施例と異なる。本実施例のハイパーテキストデータベース２１に格納されているハイパーテキスト群の一例を図１４に示す。 The above central processing unit of the personal computer has the same function as that of the first embodiment, but in addition to this, it also functions as the index creating means 15 and index searching means 16 shown in FIG. Different from the embodiment. The second embodiment is different from the first embodiment in that a keyboard is provided as an input device and a display is provided as an output device. The magnetic disk storage device is different from the first embodiment in that the first index storage unit 23 shown in FIG. 7 is also stored. An example of a hypertext group stored in the hypertext database 21 of this embodiment is shown in FIG.

本実施例では、登録と検索という動作のタイミングが異なる２種類の処理がある。検索は利用者からの入力がある度に行われるのに対し、登録は予め１回だけ行っておけばよい。登録処理では、まず、ハイパーテキストアクセス手段１１はハイパーテキストデータベース２１に格納されている各文書を読み出し、文書クラスタ情報取得手段１２に渡す。文書クラスタ情報取得手段１２は、与えられた文書からリンク情報を抽出し、図１５に示すような文書参照関係表を生成する。 In this embodiment, there are two types of processing with different timings of registration and search. The search is performed every time there is an input from the user, whereas the registration need only be performed once in advance. In the registration process, first, the hypertext access means 11 reads each document stored in the hypertext database 21 and passes it to the document cluster information acquisition means 12. The document cluster information acquisition unit 12 extracts link information from a given document, and generates a document reference relation table as shown in FIG.

次に、文書クラスタ情報取得手段１２は、対象指定手段１３に指定された「サイトのトップページの条件」と「同一サイトに含まれる文書の条件」に基づき、図１６に示すような文書クラスタ表を生成する。なお、本実施例では、「サイトのトップページの条件」として「文書のＵＲＬが『ｈｔｔｐ：／／ドメイン名／』、または『ｈｔｔｐ：／／ドメイン名／ｉｎｄｅｘ．ｈｔｍｌ』であるもの」とし、「同一サイトに含まれる文書の条件」としている。 Next, the document cluster information acquisition unit 12 uses the document cluster table as shown in FIG. Is generated. In this embodiment, “the condition of the top page of the site” is “the URL of the document is“ http: // domain name / ”or“ http: // domain name / index.html ””, “Conditions for documents included in the same site”.

次に、文書キーワード決定手段１４は、文書クラスタ情報取得手段１２が生成した文書参照関係表と文書クラスタ表を参照して、各サイトのトップページに対してサイト外からはられているリンクのアンカー文字列をサイト外キーワードとし、各クラスタ内文書について、同一クラスタ内文書のリンクを遡って得られるアンカー文字列の系列をサイト内キーワードとして文書キーワード記憶部２２に記憶させる。得られる文書キーワードの例を図１７に示す。 Next, the document keyword determination unit 14 refers to the document reference relation table and the document cluster table generated by the document cluster information acquisition unit 12, and anchors the links placed from outside the site to the top page of each site. A character string is used as an off-site keyword, and for each intra-cluster document, a series of anchor character strings obtained by tracing back the link of the same intra-cluster document is stored in the document keyword storage unit 22 as an intra-site keyword. An example of the obtained document keyword is shown in FIG.

なお、本実施例では、同一サイトに含まれている文書のサイト外キーワードは、そのサイトのトップページのサイト外キーワードと同じにする。したがって、図１４の文書１１２〜文書１１９のサイト外キーワードは文書１１１のサイト外キーワードと同一の「グルメ情報，レストラン検索」となる。 In this embodiment, the off-site keyword of documents included in the same site is the same as the off-site keyword of the top page of the site. Accordingly, the off-site keywords of the document 112 to the document 119 in FIG. 14 are the same “gourmet information, restaurant search” as the off-site keyword of the document 111.

また、リンクを遡る際に、一度遡った文書を覚えておき、ループして遡らないようにする。例えば、図１４の文書１１６に対するリンクを単純に遡ると「東京都←関東」というアンカー文字列の系列のほかに、「東京都←関東←戻る←東京都←関東」、「東京都←関東←戻る←東京都・・・」のようにループによって無数のアンカー文字列が生成されてしまう。そこで、一度遡った文書を同じアンカー文字列の系列内で二度遡らないようにする。したがって、文書１１６のサイト内キーワードは「戻る」を含まないことになる。 Also, when going back to the link, remember the document that went back once, so that it doesn't go back in a loop. For example, when the link to the document 116 in FIG. 14 is simply traced back, in addition to the anchor character string series “Tokyo → Kanto”, “Tokyo → Kanto ← Back ← Tokyo → Kanto”, “Tokyo → Kanto ← An infinite number of anchor strings are generated by a loop like "Back ← Tokyo ...". Therefore, a document that has been traced once is not traced twice within the same anchor character string series. Therefore, the keyword in the site of the document 116 does not include “return”.

また、本実施例では、トップページでないページへのサイト外からのリンクを一つだけ遡ってサイト内キーワードに含めるようにしている。したがって、文書１１６のサイト内キーワードには「東京都←関東」と「東京のお勧め店」の２種類になる。図１４の文書１１９についても同様の方法で登録するが別のアンカー文字列の系列で同じ文書を遡る場合は、それぞれ別のキーワードとして登録する。 Further, in this embodiment, only one link from outside the site to a page that is not the top page is traced back and included in the in-site keyword. Accordingly, there are two types of keywords in the site of the document 116: “Tokyo ← Kanto” and “Recommended shops in Tokyo”. The document 119 in FIG. 14 is also registered in the same manner, but when the same document is traced back with another anchor character string series, it is registered as a different keyword.

すなわち、「中華←東京都←関東」と「中華←大阪府←関西」はどちらも文書１１１に遡るアンカー文字列の系列であるが、別の系列であるため両方をサイト内キーワードとして記憶する。ここでも、「中華←東京都←関東←戻る←東京都←関東」というアンカー文字列の系列などが考えられるが、これは同一系列内で文書１１３と文書１１６をそれぞれ２回遡っているためサイト内キーワードとしては記憶しない。また、トップページでないページへのサイト外からのリンクを一つだけ遡ってサイト内キーワードに含めるようにしているため、「中華←東京のお勧め店」も文書１１９のサイト内キーワードとして記憶される。 In other words, “Chinese ← Tokyo → Kanto” and “Chinese ← Osaka Prefecture ← Kansai” are both anchor character string sequences that go back to the document 111, but both are stored as in-site keywords. In this case as well, an anchor character string such as “Chinese ← Tokyo ← Kanto ← Back ← Tokyo ← Kanto” can be considered, but this is because the document 113 and the document 116 are traced back twice in the same series. It is not stored as an internal keyword. In addition, since only one link from outside the site to a page that is not the top page is included in the site keyword, “Chinese ← Tokyo recommended store” is also stored as the site keyword of document 119. .

次に、インデックス作成手段１５は、サイト外キーワードについて、どの語がどの文書に登録されているかという索引を作成し、続いて、サイト内キーワードについて、どの語がどの文書に登録されているかという索引を作成する。 Next, the index creating means 15 creates an index indicating which words are registered in which document for the off-site keyword, and subsequently, an index indicating which word is registered in which document for the in-site keyword. Create

次に、検索処理の詳細な説明を行う。今、キーボードから「奈良グルメ」という検索条件が入力されたとする。すると、インデックス検索手段１６は、検索条件を、スペースや「の」で区切り、あるいは形態素解析を行うことによって、「奈良」と「グルメ」の２つのキーワードに分割する。 Next, the search process will be described in detail. Assume that a search condition “Nara gourmet” is entered from the keyboard. Then, the index search means 16 divides the search condition into two keywords “Nara” and “Gourmet” by separating the search condition with a space or “no” or by performing morphological analysis.

次に、インデックス検索手段１６は、「奈良」、「グルメ」のうち、サイト外キーワードに現れる語がないか調べる。現れていれば、そのキーワードとその出現頻度およびキーワードが現れた文書を検索結果候補として記憶しておく。文書キーワードが図１７の場合、「グルメ」が文書１１１〜１１９のサイト外キーワードにそれぞれ１回ずつ現れているので、検索結果候補とする。 Next, the index search means 16 checks whether there is a word that appears in the keyword outside the site among “Nara” and “Gourmet”. If it appears, the keyword, its appearance frequency, and the document in which the keyword appears are stored as search result candidates. When the document keyword is shown in FIG. 17, “gourmet” appears once for each of the off-site keywords of the documents 111 to 119, so that it is set as a search result candidate.

次に、インデックス検索手段１６は、検索結果候補となった文書のうち、サイト内キーワードに、残りの「奈良」が現れている文書と、キーワードの出現頻度を検索結果リストに追加登録する。文書キーワードが図１７の場合、検索結果候補となった文書１１１〜１１９のうち、サイト内キーワードに「奈良」が出現しているのは文書１１４、文書１１７、文書１１８である。キーワード「奈良」の出現頻度はいずれも１回である。最後に、インデックス検索手段１６は、検索結果リストをキーワードの出現頻度でソートし、ディスプレイを使って利用者に検索結果を表示する。 Next, the index search means 16 additionally registers, in the search result list, a document in which the remaining “Nara” appears as a keyword in the site among the documents that are search result candidates, and the appearance frequency of the keyword. When the document keyword is shown in FIG. 17, among the documents 111 to 119 that are search result candidates, the documents 114, 117, and 118 have “Nara” as the keyword in the site. The appearance frequency of the keyword “Nara” is once. Finally, the index search means 16 sorts the search result list by the appearance frequency of the keywords, and displays the search results to the user using the display.

なお、本実施例では、文書１１４、文書１１７、文書１１８でのキーワードの出現頻度はいずれも１回であるが、出現位置（サイト外キーワードか、サイト内キーワードか、サイト内キーワード中、でも最初の方か、本文か）に応じて出現頻度に重みをつけたスコアを用いて検索結果をソートしてもよい。 In the present embodiment, the appearance frequency of the keyword in the document 114, the document 117, and the document 118 is all once, but the appearance position (the keyword outside the site, the keyword within the site, or the keyword within the site is the first The search results may be sorted using a score that weights the appearance frequency according to whether it is the body or the body).

また、キーボードから「奈良グルメ検索」という検索条件が入力されたとする。すると、インデックス検索手段１６は、検索条件をスペースや「の」で区切り、あるいは形態素解析を行うことによって、「奈良」、「グルメ」、「検索」に分割する。 It is also assumed that a search condition “Nara gourmet search” is input from the keyboard. Then, the index search means 16 divides the search condition into “Nara”, “Gourmet”, and “Search” by dividing the search condition by a space or “NO” or by performing morphological analysis.

次に、インデックス検索手段１６は、「奈良」、「グルメ」、「検索」のうち、サイト外キーワードに現れる語がないか調べる。現れていれば、そのキーワードとその出現頻度およびキーワードが現れた文書を検索結果候補として記憶しておく。文書キーワードが図１７の場合、「グルメ」、「検索」が文書１１１〜文書１１９のサイト外キーワードにそれぞれ１回ずつ現れているので、検索結果候補とする。 Next, the index search means 16 checks whether there is a word that appears in the keyword outside the site among “Nara”, “Gourmet”, and “Search”. If it appears, the keyword, its appearance frequency, and the document in which the keyword appears are stored as search result candidates. When the document keyword is shown in FIG. 17, “gourmet” and “search” appear once in the off-site keywords of the documents 111 to 119, respectively, and therefore are set as search result candidates.

次に、インデックス検索手段１６は、検索結果候補となった文書のうち、サイト内キーワードに、残りの「奈良」が現れている文書と、キーワードの出現頻度を検索結果リストに追加登録する。文書キーワードが図１７の場合、検索結果候補となった文書１１１〜文書１１９のうち、サイト内キーワードに「奈良」が出現しているのは文書１１４、文書１１７、文書１１８である。キーワード「奈良」の出現頻度はいずれも１回である。最後に、インデックス検索手段１６は、検索結果リストをキーワードの出現頻度でソートし、ディスプレイを使って利用者に検索結果を表示する。 Next, the index search means 16 additionally registers, in the search result list, a document in which the remaining “Nara” appears as a keyword in the site among the documents that are search result candidates, and the appearance frequency of the keyword. When the document keyword is shown in FIG. 17, among the documents 111 to 119 that are search result candidates, “Nara” appears as the keyword in the site in the document 114, the document 117, and the document 118. The appearance frequency of the keyword “Nara” is once. Finally, the index search means 16 sorts the search result list by the appearance frequency of the keywords, and displays the search results to the user using the display.

なお、本実施例では、文書１１４、文書１１７、文書１１８でのキーワードの出現頻度はいずれも１回であるが、出現位置（サイト外キーワードか、サイト内キーワードか、サイト内キーワード中でも最初の方か、本文か）に応じて出現頻度に重みをつけたスコアを用いて検索結果をソートしてもよい。 In this embodiment, the appearance frequency of the keywords in the document 114, the document 117, and the document 118 is all once, but the appearance position (the keyword outside the site, the keyword within the site, or the keyword within the site is the first one) Search results may be sorted using a score that weights the appearance frequency according to the text).

また、キーボードから「奈良中華」という検索条件が入力されたとする。次に、インデックス検索手段１６は、検索条件をスペースや「の」で区切り、あるいは形態素解析を行うことによって、「奈良」、「中華」に分割する。 Also assume that a search condition of “Nara Chinese” is entered from the keyboard. Next, the index search means 16 divides the search condition into “Nara” and “Chinese” by dividing the search condition with a space or “no” or performing morphological analysis.

次に、インデックス検索手段１６は、「奈良」、「中華」のうち、サイト外キーワードに現れる語がないか調べる。現れていれば、そのキーワードとその出現頻度およびキーワードが現れた文書を検索結果候補として記憶しておく。文書キーワードが図１７の場合、「奈良」も「中華」もサイト外キーワードとして現れていない。 Next, the index search means 16 checks whether there is a word that appears in the off-site keyword among “Nara” and “Chinese”. If it appears, the keyword, its appearance frequency, and the document in which the keyword appears are stored as search result candidates. When the document keyword is shown in FIG. 17, neither “Nara” nor “Chinese” appears as off-site keywords.

次に、インデックス検索手段１６は、すべての文書のうち、サイト内キーワードに、「奈良」と「中華」が現れている文書と、キーワードの出現頻度を検索結果リストに追加登録する。文書キーワードが図１７の場合、文書１１７に「奈良」と「中華」がそれぞれ１回ずつ出現しているため、文書１１７が検索結果リストに登録される。最後に、インデックス検索手段１６は、検索結果リストをキーワードの出現頻度でソートし、ディスプレイを使って利用者に検索結果を表示する。 Next, the index search means 16 additionally registers in the search result list the documents in which “Nara” and “Chinese” appear in the site keywords among all the documents, and the appearance frequency of the keywords. When the document keyword is shown in FIG. 17, since “Nara” and “Chinese” appear once in the document 117, the document 117 is registered in the search result list. Finally, the index search means 16 sorts the search result list by the appearance frequency of the keywords, and displays the search results to the user using the display.

また、キーボードから「中華レストラン」という検索条件が入力されたとする。すると、インデックス検索手段は、検索条件をスペースや「の」で区切り、あるいは形態素解析を行うことによって、「中華」、「レストラン」に分割する。 Further, it is assumed that a search condition “Chinese restaurant” is input from the keyboard. Then, the index search means divides the search condition into “Chinese” and “Restaurant” by dividing the search condition with a space or “no” or performing morphological analysis.

次に、インデックス検索手段１６は、「中華」、「レストラン」のうち、サイト外キーワードに現れる語がないか調べる。現れていれば、そのキーワードとその出現頻度およびキーワードが現れた文書を検索結果候補として記憶しておく。
文書キーワードが図１７の場合、「レストラン」が文書１１１〜文書１１９のサイト外キーワードにそれぞれ１回ずつ現れているので、検索結果候補とする。 Next, the index search means 16 checks whether there is a word that appears in the keyword outside the site among “Chinese” and “Restaurant”. If it appears, the keyword, its appearance frequency, and the document in which the keyword appears are stored as search result candidates.
In the case where the document keyword is shown in FIG. 17, “restaurant” appears once for each of the off-site keywords of the document 111 to the document 119, and is thus regarded as a search result candidate.

次に、インデックス検索手段１６は、検索結果候補となった文書のうち、サイト内キーワードに、残りの「中華」が現れている文書と、キーワードの出現頻度を検索結果リストに追加登録する。文書キーワードが図１７の場合、検索結果候補となった文書１１１〜文書１１９のうち、サイト内キーワードに「中華」が出現しているのは文書１１７と文書１１９であり、キーワード「中華」の出現頻度はそれぞれ１回、３回である。最後に、インデックス検索手段は、検索結果リストをキーワードの出現頻度でソートし、ディスプレイを使って利用者に検索結果を表示する。 Next, the index search means 16 additionally registers, in the search result list, the remaining “Chinese” in the site keyword among the documents that are search result candidates and the appearance frequency of the keyword. When the document keyword is FIG. 17, among the documents 111 to 119 that are search result candidates, “Chinese” appears in the site keyword in the documents 117 and 119, and the appearance of the keyword “Chinese” appears. The frequency is once and three times, respectively. Finally, the index search means sorts the search result list according to the appearance frequency of the keywords, and displays the search results to the user using the display.

なお、本実施例では、文書１１９のサイト内キーワードを「中華←大阪府←関西」、「中華←東京都←関東」、「中華←東京のお勧め店」の３通りであるとして「中華」の出現頻度を３回と数えたが、いずれの「中華」も同一のリンクが由来となっているため、出現頻度を１回と数えてもよい。あるいは、文書１１９のサイト内キーワードを「中華←大阪府，東京都，東京のお勧め店←関西，関東」として記憶しておき、「中華」の出現頻度を１回と数えてもよい。 In this embodiment, the keyword in the site of document 119 is “Chinese ← Osaka Prefecture ← Kansai”, “Chinese ← Tokyo ← Kanto”, and “Chinese ← Tokyo recommended shops”. However, since all “Chinese Chinese” are derived from the same link, the appearance frequency may be counted as one. Alternatively, the keyword in the site of the document 119 may be stored as “Chinese ← Osaka Prefecture, Tokyo, Tokyo recommended store ← Kansai, Kanto”, and the appearance frequency of “Chinese” may be counted as one time.

［第３の実施例］
次に、本発明の第３の実施例を、図面を参照して説明する。この第３の実施例は、本発明の第３の実施の形態に対応するものである。本実施例は第１の実施例と同様に、図９に示した第３の実施の形態のデータ処理装置９をパーソナルコンピュータとし、記憶装置８を磁気ディスク記憶装置とした構成であるが、パーソナルコンピュータの中央演算装置が、図９に示した文書ベクトル作成手段１７、類似度計算手段１８としても機能する点で第１の実施例と異なる。また、磁気ディスク記憶装置には、図９に示した文書ベクトル記憶部２４、カテゴリ条件記憶部２５、および分類結果記憶部２６も記憶される点で第１の実施例と異なる。 [Third embodiment]
Next, a third embodiment of the present invention will be described with reference to the drawings. This third example corresponds to the third embodiment of the present invention. As in the first example, this example has a configuration in which the data processing device 9 of the third embodiment shown in FIG. 9 is a personal computer and the storage device 8 is a magnetic disk storage device. The computer is different from the first embodiment in that the central processing unit of the computer also functions as the document vector creation means 17 and the similarity calculation means 18 shown in FIG. Further, the magnetic disk storage device is different from the first embodiment in that the document vector storage unit 24, the category condition storage unit 25, and the classification result storage unit 26 shown in FIG. 9 are also stored.

次に、本実施の形態の動作について説明する。まず、ハイパーテキストアクセス手段（図９の１１）はハイパーテキストデータベース（図９の２１）に格納されている各文書を読み出し、文書クラスタ情報取得手段（図９の１２）に渡す。
ここで、ハイパーテキストデータベース２１に格納されているハイパーテキスト群の一例を図１４に示す。文書クラスタ情報取得手段１２は、与えられた文書からリンク情報を抽出し、図１５に示すような文書参照関係表を生成する。 Next, the operation of the present embodiment will be described. First, the hypertext access means (11 in FIG. 9) reads each document stored in the hypertext database (21 in FIG. 9) and passes it to the document cluster information acquisition means (12 in FIG. 9).
An example of a hypertext group stored in the hypertext database 21 is shown in FIG. The document cluster information acquisition unit 12 extracts link information from a given document, and generates a document reference relation table as shown in FIG.

次に、文書クラスタ情報取得手段１２は、対象指定手段（図９の１３）に指定された「サイトのトップページの条件」と「同一サイトに含まれる文書の条件」に基づき、図１６に示すような文書クラスタ表を生成する。なお、本実施例では、「サイトのトップページの条件」として「文書のＵＲＬが『ｈｔｔｐ：／／ドメイン名／』、または『ｈｔｔｐ：／／ドメイン名／ｉｎｄｅｘ．ｈｔｍｌ』であるもの」とし、「同一サイトに含まれる文書の条件」として、「ドメイン名が同じ」としている。 Next, the document cluster information acquisition means 12 is shown in FIG. 16 based on the “site top page condition” and the “document condition included in the same site” designated by the target designation means (13 in FIG. 9). A document cluster table like this is generated. In this embodiment, “the condition of the top page of the site” is “the URL of the document is“ http: // domain name / ”or“ http: // domain name / index.html ””, “The domain name is the same” as the “condition of documents included in the same site”.

次に、文書キーワード決定手段（図９の１４）は、文書クラスタ情報取得手段１２が生成した文書参照関係表と文書クラスタ表を参照して、各サイトのトップページに対してサイト外からはられているリンクのアンカー文字列の系列をサイト外キーワードとし、各クラスタ内文書について、同一クラスタ内文書のリンクを遡って得られるアンカー文字列の系列をサイト内キーワードとして文書キーワード記憶部（図９の２２）に記憶させる。得られる文書キーワードの例を図１７に示す。 Next, the document keyword determination means (14 in FIG. 9) refers to the document reference relation table and the document cluster table generated by the document cluster information acquisition means 12 and is applied to the top page of each site from outside the site. 9 is used as a keyword outside the site, and for each intra-cluster document, a series of anchor character strings obtained by tracing back the links of the documents in the same cluster as the intra-site keyword. 22). An example of the obtained document keyword is shown in FIG.

また、本実施例では、トップページでないページへのサイト外からのリンクを一つだけ遡ってサイト内キーワードに含めるようにしている。したがって、文書１１６のサイト内キーワードには「東京都←関東」と「東京都のお勧め店」の２種類になる。図１４の文書１１９についても同様の方法で登録するが、別のアンカー文字列の系列で同じ文書を遡る場合は、それぞれ別のキーワードとして登録する。 Further, in this embodiment, only one link from outside the site to a page that is not the top page is traced back and included in the in-site keyword. Therefore, there are two types of keywords in the site of the document 116, “Tokyo ← Kanto” and “Tokyo recommended shops”. The document 119 in FIG. 14 is registered in the same manner, but when the same document is traced back in another anchor character string series, it is registered as a different keyword.

次に、文書ベクトル作成手段１７は、各文書について、どのキーワードがどの部分（サイト外キーワードか、サイト内キーワードか、タイトルか、本文か等）に何回出現したかを文書ベクトルとして文書ベクトル記憶部２４に記憶させる。 Next, the document vector creating means 17 stores the document vector as a document vector indicating how many times each keyword appears in which part (external keyword, in-site keyword, title, text, etc.). Store in the unit 24.

そして、類似度計算手段１８が文書ベクトル記憶部２４に格納されている各文書の文書ベクトルについて、カテゴリ条件記憶部２５に格納されている各カテゴリの特徴ベクトルとの余弦を計算し、その結果を分類記憶部２６に格納する。 Then, the similarity calculation means 18 calculates the cosine of the feature vector of each category stored in the category condition storage unit 25 for the document vector of each document stored in the document vector storage unit 24, and the result is Stored in the classification storage unit 26.

［第４の実施例］
次に、本発明の第４の実施例を、図面を参照して説明する。この第４の実施例は、本発明の第４の実施の形態に対応するものである。本実施例は図１１に示した第４の実施の形態のデータ処理装置９をパーソナルコンピュータで構成し、記憶装置１０を磁気ディスク記憶装置で構成した点は第２の実施例と同様であるが、パーソナルコンピュータの中央演算装置が、インデックス検索手段としてしか機能しない点で第２の実施例と異なる。また、磁気ディスク記憶装置には、ハイパーテキストデータベース、文書キーワード記憶部が記憶されない代わりに、図１１に示した第２のインデックス記憶部２７が記憶される点で第２の実施例と異なる。 [Fourth embodiment]
Next, a fourth embodiment of the present invention will be described with reference to the drawings. This fourth example corresponds to the fourth embodiment of the present invention. This embodiment is the same as the second embodiment in that the data processing device 9 of the fourth embodiment shown in FIG. 11 is constituted by a personal computer and the storage device 10 is constituted by a magnetic disk storage device. The central processing unit of the personal computer is different from the second embodiment in that it functions only as index search means. Further, the magnetic disk storage device is different from the second embodiment in that the second index storage unit 27 shown in FIG. 11 is stored instead of storing the hypertext database and the document keyword storage unit.

図１１に示した第１のインデックス記憶部２３に記憶されている文書のメタ情報から作成されたインデックスの一例を図１８に示す。図１８には、キーワードと出現する文書、および出現頻度が記録されており、例えば、キーワード「ホテル」で登録されている文書は文書２１１、文書２１２、文書２１４で、それぞれキーワード「ホテル」の出現頻度は３回、１回、５回であることが分かる。 FIG. 18 shows an example of an index created from document meta information stored in the first index storage unit 23 shown in FIG. In FIG. 18, keywords, appearing documents, and appearance frequencies are recorded. For example, documents 211, 212, and 214 are registered in the keyword “hotel”, and the keyword “hotel” appears. It can be seen that the frequency is 3 times, 1 time, and 5 times.

また、第２のインデックス記憶部２７に記憶されている文書の本文から作成されたインデックスの一例を図１９に示す。第２のインデックスの形式も第１のインデックスと同様で、キーワードと出現する文書、および出現頻度が記録されており、例えば、キーワード「東京」で登録されている文書は文書２１２、文書２１３、文書２１４、文書２１７、文書２１８、文書２１９で、それぞれキーワード「東京」の出現頻度は１回、４回、６回、８回、１回、２回であることが分かる。 An example of an index created from the text of the document stored in the second index storage unit 27 is shown in FIG. The format of the second index is the same as that of the first index, and keywords, appearing documents, and appearance frequencies are recorded. For example, documents registered with the keyword “Tokyo” are document 212, document 213, document In 214, document 217, document 218, and document 219, it can be seen that the appearance frequency of the keyword “Tokyo” is once, four times, six times, eight times, once, and twice.

今、キーボードから検索条件「奈良グルメ」が入力されたとする。すると、インデックス検索手段１６は、検索条件をスペースや「の」で区切り、あるいは形態素解析を行うことによって、「奈良」と「グルメ」のキーワードに分割する。 Assume that the search condition “Nara gourmet” is entered from the keyboard. Then, the index search means 16 divides the search condition into a keyword of “Nara” and “Gourmet” by dividing the search condition by a space or “no” or by performing morphological analysis.

次に、インデックス検索手段１６は、キーワード「奈良」と「グルメ」のうち、第１のインデックス記憶部２３に登録された語がないか調べる。登録されていれば、そのキーワードとその出現頻度およびキーワードが現れた文書を検索結果候補として記憶しておく。第１のインデックスが図１８の場合、「グルメ」が登録されているので、文書２１１、文書２１２、文書２１３、文書２１４が検索結果候補となる。 Next, the index search means 16 checks whether there is a word registered in the first index storage unit 23 among the keywords “Nara” and “Gourmet”. If registered, the keyword, its appearance frequency, and the document in which the keyword appears are stored as search result candidates. When the first index is FIG. 18, since “gourmet” is registered, the document 211, the document 212, the document 213, and the document 214 are search result candidates.

次に、インデックス検索手段１６は、検索結果候補となった文書のうち、第２のインデックス記憶部２７に、残りの「奈良」が登録されている文書と、キーワードの出現頻度を検索結果リストに追加登録する。第２のインデックスが図１９の場合、検索結果候補の文書２１１、文書２１２、文書２１３、文書２１４のうち、「奈良」は文書２１３にだけ現れているため、文書２１３が検索結果リストに登録される。 Next, the index search unit 16 stores the remaining “Nara” documents registered in the second index storage unit 27 and the keyword appearance frequency in the search result list among the search result candidate documents. Register additional. When the second index is FIG. 19, among the search result candidate documents 211, 212, 213, and 214, “Nara” appears only in the document 213, so the document 213 is registered in the search result list. The

次に、インデックス検索手段１６は、検索結果リストをキーワードの出現頻度でソートし、ディスプレイに検索結果を表示する。この場合、検索結果として出力されるのは検索結果リストに登録されている文書２１３である。 Next, the index search means 16 sorts the search result list by the appearance frequency of the keywords, and displays the search results on the display. In this case, the document 213 registered in the search result list is output as the search result.

また、キーボードから検索条件「大阪の図書館」が入力されたとする。すると、インデックス検索手段１６は、検索キーワードをスペースや「の」で区切り、あるいは形態素解析を行うことによって、「大阪」と「図書館」のキーワードに分割する。 Also assume that the search condition “Osaka Library” is entered from the keyboard. Then, the index search means 16 divides the search keyword into a keyword “Osaka” and “library” by dividing the search keyword by a space or “no” or by performing a morphological analysis.

次に、インデックス検索手段１６は、キーワード「大阪」と「図書館」のうち、第１のインデックス記憶部２３に登録された語がないか調べる。登録されていれば、そのキーワードとその出現頻度およびキーワードが現れた文書を検索結果候補として記憶しておく。第１のインデックスが図１８の場合、「図書館」が登録されている、文書２１５、文書２１６、文書２１７、文書２１８、文書２１９が検索結果候補となる。キーワード「図書館」の出現頻度は、それぞれ１回、５回、２回、７回、４回である。 Next, the index search means 16 checks whether there is a word registered in the first index storage unit 23 among the keywords “Osaka” and “Library”. If registered, the keyword, its appearance frequency, and the document in which the keyword appears are stored as search result candidates. When the first index is shown in FIG. 18, the document 215, the document 216, the document 217, the document 218, and the document 219 in which “library” is registered are search result candidates. The appearance frequency of the keyword “library” is once, five times, twice, seven times, and four times, respectively.

次に、インデックス検索手段１６は、検索結果候補となった文書のうち、第２のインデックス記憶部２７に、残りの「大阪」が登録されている文書と、キーワードの出現頻度を検索結果リストに追加登録する。第２のインデックスが図１９の場合、検索結果候補の文書２１５、文書２１６、文書２１７、文書２１８、文書２１９のうち、キーワード「大阪」は文書２１６、文書２１７、文書２１９に現れており、「大阪」の出現頻度はそれぞれ２回、４回、８回である。 Next, the index search unit 16 stores the remaining “Osaka” in the second index storage unit 27 among the documents as search result candidates and the appearance frequency of the keyword in the search result list. Register additional. When the second index is FIG. 19, the keyword “Osaka” appears in the document 216, the document 217, and the document 219 among the search result candidate documents 215, 216, 217, 218, and 219. The appearance frequency of “Osaka” is 2 times, 4 times and 8 times, respectively.

次に、インデックス検索手段１６は、検索結果リストをキーワードの出現頻度でソートし、ディスプレイに検索結果を表示する。キーワード「図書館」、「大阪」の出現頻度の合計は、文書２１６では７回、文書２１７では６回、文書２１９では１２回であるので、ディスプレイには文書２１９、文書２１６、文書２１７の順序で表示される。 Next, the index search means 16 sorts the search result list by the appearance frequency of the keywords, and displays the search results on the display. The total appearance frequency of the keywords “library” and “Osaka” is 7 times for the document 216, 6 times for the document 217, and 12 times for the document 219, so the display shows the document 219, document 216, and document 217 in this order. Is displayed.

なお、本実施例では単純にキーワードの出現頻度の合計でソートしたが、第１のキーワードインデックスでのキーワード出現頻度と、第２のキーワードインデックスでのキーワード出現頻度にそれぞれ別の重みを掛けて合計して得られるスコアを基準にソートしてもよい。 In this embodiment, the keywords are simply sorted according to the total appearance frequency of the keywords. However, the keyword appearance frequencies in the first keyword index and the keyword appearance frequencies in the second keyword index are multiplied by different weights, respectively. You may sort on the basis of the score obtained.

本発明の第１の実施の形態の構成を示すブロック図である。It is a block diagram which shows the structure of the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるハイパーテキストデータベースが記憶するハイパーテキスト群の一例を示す図である。It is a figure which shows an example of the hypertext group which the hypertext database in the 1st Embodiment of this invention memorize | stores. 本発明の第１の実施の形態における文書キーワード記憶部が記憶する文書キーワードの一例を示す図である。It is a figure which shows an example of the document keyword which the document keyword storage part in the 1st Embodiment of this invention memorize | stores. 本発明の第１の実施の形態における文書クラスタ情報取得部が生成する文書参照関係表の一例を示す図である。It is a figure which shows an example of the document reference relation table which the document cluster information acquisition part in the 1st Embodiment of this invention produces | generates. 本発明の第１の実施の形態における文書クラスタ情報取得部が生成する文書クラスタ表の一例を示す図である。It is a figure which shows an example of the document cluster table which the document cluster information acquisition part in the 1st Embodiment of this invention produces | generates. 本発明の第１の実施の形態の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the 1st Embodiment of this invention. 本発明の第２の実施の形態の構成を示すブロック図である。It is a block diagram which shows the structure of the 2nd Embodiment of this invention. 本発明の第２の実施の形態の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the 2nd Embodiment of this invention. 本発明の第３の実施の形態の構成を示すブロック図である。It is a block diagram which shows the structure of the 3rd Embodiment of this invention. 本発明の第３の実施の形態の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the 3rd Embodiment of this invention. 本発明の第４の実施の形態の構成を示すブロック図である。It is a block diagram which shows the structure of the 4th Embodiment of this invention. 本発明の第４の実施の形態の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the 4th Embodiment of this invention. 本発明の第５〜８の実施の形態の構成を示すブロック図である。It is a block diagram which shows the structure of the 5th-8th embodiment of this invention. 本発明の第１の実施例におけるハイパーテキストデータベースが記憶するハイパーテキスト群の一例を示す図である。It is a figure which shows an example of the hypertext group which the hypertext database in 1st Example of this invention memorize | stores. 本発明の第１の実施例における文書クラスタ情報取得部が生成する文書参照関係表の一例を示す図である。It is a figure which shows an example of the document reference relation table which the document cluster information acquisition part in 1st Example of this invention produces | generates. 本発明の第１の実施例における文書クラスタ情報取得部が生成する文書クラスタ表の一例を示す図である。It is a figure which shows an example of the document cluster table which the document cluster information acquisition part in 1st Example of this invention produces | generates. 本発明の第１の実施例における文書キーワード記憶部が記憶する文書キーワードの一例を示す図である。It is a figure which shows an example of the document keyword which the document keyword storage part in 1st Example of this invention memorize | stores. 本発明の第４の実施例における第１のインデックス記憶部が記憶するインデックスの一例を示す図である。It is a figure which shows an example of the index which the 1st index memory | storage part in the 4th Example of this invention memorize | stores. 本発明の第４の実施例における第２のインデックス記憶部が記憶するインデックスの一例を示す図である。It is a figure which shows an example of the index which the 2nd index memory | storage part in the 4th Example of this invention memorize | stores.

Explanation of symbols

１、５、７データ処理装置
２、６、８記憶装置
３入力手段
４出力手段
１１ハイパーテキストアクセス手段
１２文書クラスタ情報取得手段
１３対象指定手段
１４文書キーワード決定手段
１５インデックス作成手段
１６インデックス検索手段
１７文書ベクトル作成手段
１８類似度計算手段
２１ハイパーテキストデータベース
２２文書キーワード記憶部
２３第１のインデックス記憶部
２４文書ベクトル記憶部
２５カテゴリ条件記憶部
２６分類結果記憶部
２７第２のインデックス記憶部
３０記憶媒体
３１入力装置
３２データ処理装置
３３出力装置
３４記憶装置
３５入力メモリ
３６ワークメモリ DESCRIPTION OF SYMBOLS 1, 5, 7 Data processing apparatus 2, 6, 8 Storage device 3 Input means 4 Output means 11 Hypertext access means 12 Document cluster information acquisition means 13 Object designation means 14 Document keyword determination means 15 Index creation means 16 Index search means 17 Document vector creation means 18 Similarity calculation means 21 Hypertext database 22 Document keyword storage section 23 First index storage section 24 Document vector storage section 25 Category condition storage section 26 Classification result storage section 27 Second index storage section 30 Storage medium 31 Input Device 32 Data Processing Device 33 Output Device 34 Storage Device 35 Input Memory 36 Work Memory

Claims

In a document search apparatus that searches a document group that matches an input keyword condition from a plurality of hierarchical documents included in each of document clusters formed by a plurality of documents structured by hyperlinks ,
For each of a plurality of documents in the same document cluster, together with a word group characterizing the relationship between the document cluster, which is a keyword extracted from a series of anchor character strings obtained by tracing back a plurality of hyperlinks to the document, and the document First index storage means for storing a first index in which a document including the word group and an appearance frequency in the document are associated with each other;
A word group that characterizes the content of the document itself, which is a keyword that appears in the text of each of the plurality of documents, and a second index that associates the document that includes the word group and the appearance frequency in the document are stored. a second index storage means are,
When n (n ≧ 2) keywords are included in the keyword condition, m of the n keywords is referred to by referring to the first index stored in the first index storage means. When searching that the first index includes a number of keywords (1 ≦ m ≦ n−1) , the appearance frequency of the m keywords and the m keywords from the first index Is obtained with reference to the second index stored in the second index storage means with the remaining mn keywords, and the n index is obtained. -Obtaining a second search result indicating the appearance frequency of the keyword included in the second index among the m keywords and the document in which the keyword appears, and outputting the first and second search results Document search apparatus characterized by comprising an index search means that.

In a document search apparatus that searches a document group that matches an input search condition sentence from a plurality of hierarchical documents included in each of document clusters formed by a plurality of documents structured by hyperlinks .
For each of a plurality of documents in the same document cluster, together with a word group characterizing the relationship between the document cluster, which is a keyword extracted from a series of anchor character strings obtained by tracing back a plurality of hyperlinks to the document, and the document First index storage means for storing a first index in which a document including the word group and an appearance frequency in the document are associated with each other;
A word group that characterizes the content of the document itself, which is a keyword that appears in the text of each of the plurality of documents, and a second index that associates the document that includes the word group and the appearance frequency in the document are stored. a second index storage means are,
The search condition statements connected by 'the', or divided by a space, or if comprising two keywords constituting the compound word, the two keywords stored in the first index storage means When it is found that one of the two keywords is included in the first index with reference to the first index, the appearance of the one keyword from the first index After obtaining the first search result indicating the frequency and the document in which the keyword appears, the second index stored in the second index storage means with the other keyword is referred to and the second index is stored. A second search result indicating the appearance frequency of the other keyword included in the index 2 and the document in which the keyword appears is obtained, and the first and second search results are obtained. Document search apparatus characterized by comprising an index search means for outputting.

The index search means includes
Appearance frequency in the first index related to the first keyword retrieved from the first index stored in the first index storage means, and the second frequency stored in the second index storage means The appearance frequency in the second index relating to the second keyword retrieved from the index of the first index is weighted differently, and the weighted appearance frequency of each document including the first and second keywords is determined. 3. The document search apparatus according to claim 1 , wherein a sum is calculated and search results are output in the order of the sum .

The first index storage means is determined from the hyperlink relationship in the document cluster and / or the positional relationship in the directory hierarchy when each document is a linked document for each of a plurality of documents in the same document cluster. The keyword in the site of the document indicating a series of anchor character strings that are character strings in each link source document obtained by tracing back the link in the direction of the link source document that is the upper layer up to the top document in the document cluster, 4. The document search apparatus according to claim 1, wherein the document search apparatus stores a word group that characterizes a relationship between a document cluster and the document.

The first index storage means is determined from the hyperlink relationship in the document cluster and / or the positional relationship in the directory hierarchy when each document is a linked document for each of a plurality of documents in the same document cluster. A keyword in the site of the document indicating a series of anchor character strings that are character strings in each link source document obtained by tracing back a link in the direction of the link source document that is an upper layer up to the top document in the document cluster; A word group that characterizes the relationship between the document cluster and the document, and an off-site keyword indicating an anchor character string that is a character string in a document in another document cluster linked to the topmost document of the document cluster 4. The document search apparatus according to claim 1 , wherein the document search apparatus is stored as

Document search in which a data processing device searches a document group that matches an input keyword condition from a plurality of hierarchized documents included in each of document clusters formed by a plurality of documents structured by hyperlinks . In the method
For each of a plurality of documents in the same document cluster, the data processing device includes a document cluster that is a keyword extracted from a series of anchor character strings obtained by tracing back a plurality of hyperlinks to the document, and the document. A first registration step of registering, in the first index storage means, a first index in which a document including the word group and an appearance frequency in the document are associated with a word group characterizing the relationship;
The data processing device includes a word group that characterizes the content of the document itself, which is a keyword that appears in the text of each of the plurality of documents, and a document that includes the word group and an appearance frequency in the document associated with each other. A second registration step of registering the second index in the second index storage means;
When the data processing apparatus includes n (n ≧ 2) keywords in the keyword condition , the data processing device refers to the first index registered in the first index storage unit, and the n When searching that m (1 ≦ m ≦ n−1) keywords are included in the first index, the frequency of appearance of the m keywords from the first index And the second index registered in the second index storage means with the remaining mn keywords is obtained. Referring to the mn keywords, a second search result indicating a frequency of appearance of a keyword included in the second index and a document in which the keyword appears is obtained, and the first and second keywords are obtained. Document search method characterized by comprising the index search step of outputting the second search result.

A document in which a data processing device searches a document group that matches an input search condition sentence from a plurality of hierarchized documents included in each of document clusters formed by a plurality of documents structured by hyperlinks. In the search method,
For each of a plurality of documents in the same document cluster, the data processing device includes a document cluster that is a keyword extracted from a series of anchor character strings obtained by tracing back a plurality of hyperlinks to the document, and the document. A first registration step of registering, in the first index storage means, a first index in which a document including the word group and an appearance frequency in the document are associated with a word group characterizing the relationship;
The data processing device includes a word group that characterizes the content of the document itself, which is a keyword that appears in the text of each of the plurality of documents, and a document that includes the word group and an appearance frequency in the document associated with each other. A second registration step of registering the second index in the second index storage means;
Wherein the data processing apparatus, the search condition statements connected by 'the', or divided by a space, or if comprising two keywords constituting the compound word, said at two keywords first When searching that one of the two keywords is included in the first index with reference to the first index registered in the index storage means, the first index After obtaining the first search result indicating the appearance frequency of the one keyword and the document in which the keyword appears, the second index registered in the second index storage means with the other keyword is obtained. Referring to the second search result indicating the frequency of appearance of the other keyword included in the second index and the document in which the keyword appears, Document search method characterized by comprising the index search step of outputting first and second search results.

The index search step includes:
Appearance frequency in the first index related to the first keyword retrieved from the first index stored in the first index storage means, and the second frequency stored in the second index storage means The appearance frequency in the second index relating to the second keyword retrieved from the index of the first index is weighted differently, and the weighted appearance frequency of each document including the first and second keywords is determined. 8. The document search method according to claim 6 , wherein a sum is calculated and search results are output in the order of the sum .

In the first registration step, when the data processing apparatus sets each document as a link destination document for each of a plurality of documents in the same document cluster, the data processing apparatus has a hyperlink relationship and / or a directory hierarchy in the document cluster. The site of the document indicating a sequence of anchor character strings that are character strings in each link source document obtained by tracing the link back in the direction of the link source document which is the upper layer up to the highest level document in the document cluster determined from the positional relationship in 9. The document search method according to claim 6 , wherein an internal keyword is registered in the first index storage unit as a word group that characterizes a relationship between the document cluster and the document.

In the first registration step, when the data processing apparatus sets each document as a link destination document for each of a plurality of documents in the same document cluster, the data processing apparatus has a hyperlink relationship and / or a directory hierarchy in the document cluster. The site of the document indicating a sequence of anchor character strings that are character strings in each link source document obtained by tracing the link back in the direction of the link source document which is the upper layer up to the highest level document in the document cluster determined from the positional relationship in An intra-site keyword and an off-site keyword indicating an anchor character string that is a character string in a document in another document cluster linked to the top-level document of the document cluster, and the document cluster and the document 7. A word group that characterizes a relationship is registered in the first index storage means. Document search method as claimed in any one of.

Allows a computer to realize a function of searching a document group that matches an input keyword condition from a plurality of hierarchical documents included in each of document clusters formed by a plurality of documents structured by hyperlinks . In the document search program for
In the computer ,
For each of a plurality of documents in the same document cluster, together with a word group characterizing the relationship between the document cluster, which is a keyword extracted from a series of anchor character strings obtained by tracing back a plurality of hyperlinks to the document, and the document A first registration step of registering, in the first index storage means, a first index in which a document including the word group and an appearance frequency in the document are associated with each other;
In addition to a word group that characterizes the content of the document itself, which is a keyword that appears in the text of each of the plurality of documents, a second index that associates the document that includes the word group and the appearance frequency in the document is a second index. A second registration step of registering in the index storage means;
When n (n ≧ 2) keywords are included in the keyword condition, m of the n keywords is referred to by referring to the first index registered in the first index storage means. When searching that the first index includes a number of keywords (1 ≦ m ≦ n−1) , the appearance frequency of the m keywords and the m keywords from the first index Is obtained by referring to the second index registered in the second index storage means with the remaining n−m keywords. -Obtaining a second search result indicating the appearance frequency of the keyword included in the second index among the m keywords and the document in which the keyword appears, and outputting the first and second search results And the index search step that
A program characterized by having executed .

The computer has a function to search a document group that matches the input search condition sentence from a plurality of hierarchical documents included in each of document clusters formed by a plurality of documents structured by hyperlinks. In the document search program for
In the computer ,
For each of a plurality of documents in the same document cluster, together with a word group characterizing the relationship between the document cluster, which is a keyword extracted from a series of anchor character strings obtained by tracing back a plurality of hyperlinks to the document, and the document A first registration step of registering, in the first index storage means, a first index in which a document including the word group and an appearance frequency in the document are associated with each other;
In addition to a word group that characterizes the content of the document itself, which is a keyword that appears in the text of each of the plurality of documents, a second index that associates the document that includes the word group and the appearance frequency in the document is a second index. A second registration step of registering in the index storage means;
The search condition statements connected by 'the', or divided by a space, or if comprising two keywords constituting the compound word, registered in the first index storage means by the two keywords When it is found that one of the two keywords is included in the first index with reference to the first index, the appearance of the one keyword from the first index After obtaining the first search result indicating the frequency and the document in which the keyword appears, the second index registered in the second index storage means with the other keyword is referred to and the second index is stored. A second search result indicating the appearance frequency of the other keyword included in the index 2 and the document in which the keyword appears is obtained, and the first and second search results are obtained. And the index search step of outputting
A program characterized by having executed .