JPH11259509A

JPH11259509A - Information retrieval and classification method and system therefor

Info

Publication number: JPH11259509A
Application number: JP10060915A
Authority: JP
Inventors: Hiroyuki Kaji; 博行梶; Yasutsugu Morimoto; 康嗣森本; Toshiko Aizono; 敏子相薗
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1998-03-12
Filing date: 1998-03-12
Publication date: 1999-09-24

Abstract

PROBLEM TO BE SOLVED: To sufficiently reduce a calculation amount required for clustering and to retrieve and classify information at a high speed. SOLUTION: A retrieval technique using retrieval interrogation and a clustering technique are combined to retrieve and classify information. Especially, by a term correlation extraction part 1d, knowledge on the correlation of terms is extracted beforehand from a set of documents inside a document file 10a (a correlation degree is imparted to a term highly correlated to the term for respective terms) and registered in a term correlation file 10d. Then, in the case of retrieving the document corresponding to the retrieval interrogation of a user, terms highly correlated to an input term are extracted from the term correlation file 10d and clustered based on the correlation degree, and documents retrieved corresponding to the retrieval interrogation are further clustered based on the cluster of the terms.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、クラスタリングに
よる情報の検索、分類技術に係わり、特に、検索結果を
高速に分類するのに好適な情報検索分類方法および情報
検索分類システムに関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for searching and classifying information by clustering, and more particularly to an information search and classification method and an information search and classification system suitable for high-speed classification of search results.

【０００２】[0002]

【従来の技術】計算機（コンピュータ）技術の進歩と共
に、大量の文書を蓄積したデータベースから、ユーザの
情報要求に適合する文書を検索する情報検索システムが
さまざまな分野で利用されるようになってきた。このよ
うな情報検索システムでは、一般に、ユーザが情報要求
を表わすタームを用いて検索質問を入力することが必要
である。2. Description of the Related Art With the advance of computer (computer) technology, an information retrieval system for retrieving a document meeting a user's information request from a database storing a large amount of documents has been used in various fields. . In such an information search system, it is generally necessary for a user to input a search question using a term indicating an information request.

【０００３】しかし、情報要求を的確に表すタームを想
起することはユーザにとって必ずしも容易でない。ま
た、ユーザが思いつくタームは高々数個であることが多
い。そのため、大量の文書を蓄積したシステムでは、検
索される文書の数もかなり多くなり、その中には適合文
書と不適合文書が混在する。[0003] However, it is not always easy for a user to recall a term that accurately represents an information request. In addition, the user often comes up with a few terms at most. Therefore, in a system in which a large number of documents are stored, the number of documents to be searched is considerably large, and some of the documents are compatible documents and some are non-conforming documents.

【０００４】このような問題に対処するために、検索質
問と文書の関連度を計算して、関連度の大きい順に出力
するシステムもある。しかし、適合文書が全て上位にラ
ンクされることにはならない。このため、検索された文
書の中から適合文書を選び出すためのユーザ負荷が非常
に大きい。[0004] In order to deal with such a problem, there is also a system which calculates the relevance of a search query and a document and outputs the relevance in descending order of relevance. However, not all conforming documents will be ranked higher. Therefore, the user load for selecting a suitable document from the retrieved documents is extremely large.

【０００５】このような、検索質問に応答する形式のシ
ステムではなく、データベース全体をブラウジングしな
がら情報要求に適合する文書を探し出すという新しい考
え方のシステムも提案されている。この技術は、ユーザ
が情報要求を持っているが、その内容が必ずしも明確で
ないというような状況に適したものであり、D.R.Cuttin
g,et al., "Scatter／Gather: A Cluster−based Appro
ach to Browsing Large Document Collections," Proce
edings of the 15th International Conference on Inf
ormation Retrieval, pp. 318−329（1992 June, Copen
hagen）に述べられているシステムがその例である。[0005] Instead of such a system responding to a search question, a system based on a new concept of searching for a document that meets an information request while browsing the entire database has been proposed. This technique is suitable for situations where the user has an information request but the content is not always clear.
g, et al., "Scatter / Gather: A Cluster-based Appro
ach to Browsing Large Document Collections, "Proce
edings of the 15th International Conference on Inf
ormation Retrieval, pp. 318-329 (1992 June, Copen
hagen) is an example.

【０００６】この技術では、システム側で、データベー
スに蓄積された文書を文書間の類似度に基づいてクラス
タリングし、ユーザに提示する。ユーザは、情報要求に
関連のありそうなクラスタを選択する。この選択に基づ
き、システムは、選択されたクラスタの文書を対象とし
て、さらにクラスタリングを実行し、その結果をユーザ
に提示する。このようなインタラクションを通じて、情
報要求に適合する文書を含むクラスタに絞り込んでいく
ことができる。In this technique, the system clusters documents stored in a database based on the similarity between documents and presents the clustered documents to a user. The user selects a cluster that is likely to be relevant to the information request. Based on this selection, the system performs further clustering on the documents of the selected cluster and presents the result to the user. Through such an interaction, it is possible to narrow down to clusters containing documents that meet the information request.

【０００７】しかし、この技術には次のような問題点が
ある。すなわち、データベース全体のクラスタリングか
らスタートすると、最初のうちは、かなり多様な文書を
含み、特徴が不明確なクラスタしか得られないことが多
い。また、対話性がこの技術の本質であるが、それを損
なわない応答時間を達成することは容易でない。However, this technique has the following problems. That is, starting from the clustering of the entire database, initially, it is often the case that only clusters that include quite various documents and have unclear characteristics are obtained. Also, interactivity is the essence of this technology, but it is not easy to achieve a response time that does not compromise it.

【０００８】このクラスタリング処理の計算量について
は次の通りである。ｎ個の要素（ここでは文書）をクラ
スタリングする処理は、一般に、全ての要素対について
の類似度計算を含め、ｎ²のオーダーの計算量になる。
計算量を削減するため、上記Scatter／Gatherでは、ｎ
個の要素を一括してクラスタリングせずに、ｎ個の要素
からランダムに選んだｍ個の要素をクラスタリングして
クラスタの種を生成する第１ステップと、（ｎ−ｍ）個
の要素をそれぞれ最も近いクラスタに振り分ける第２ス
テップに分ける技術を提案している。[0008] The amount of calculation in the clustering process is as follows. n elements process clustering (documentation here) generally including similarity calculation for all element pairs, the computational complexity of the order of n ^2.
In order to reduce the amount of calculation, in Scatter / Gather, n
A first step of clustering m elements selected at random from n elements to generate a cluster seed without clustering the elements collectively; and (nm) elements by There is proposed a technique of dividing into a second step of distributing to the closest cluster.

【０００９】このようにすることにより、第１ステップ
はｍ²のオーダー、第２ステップは（ｎ−ｍ)×ｃのオー
ダーの計算量になり、ｎ個の要素を一括してクラスタリ
ングするよりは計算量が小さくなる。尚、ｃはクラスタ
の種の数である。しかし、この技術では、良いクラスタ
リング結果を得るには、ｍはｎに比例した値にしなけれ
ばならない。従って、実時間の文書クラスタリングとい
う意味では、十分な解決策とはいえない。By doing so, the first step has a calculation amount of the order of m ² , and the second step has a calculation amount of the order of (nm) × c. The amount of calculation is small. Here, c is the number of cluster types. However, in this technique, m must be a value proportional to n in order to obtain a good clustering result. Therefore, it is not a sufficient solution in terms of real-time document clustering.

【００１０】[0010]

【発明が解決しようとする課題】解決しようとする問題
点は、従来の技術では、クラスタリングに要する計算量
を十分に小さくすることができない点である。本発明の
目的は、これら従来技術の課題を解決し、高速な情報の
検索分類が可能な情報検索分類方法および情報検索分類
システムを提供することである。The problem to be solved is that the amount of calculation required for clustering cannot be sufficiently reduced with the conventional technology. An object of the present invention is to solve the problems of the conventional technology and to provide an information search / classification method and an information search / classification system capable of high-speed information search / classification.

【００１１】[0011]

【課題を解決するための手段】上記目的を達成するた
め、本発明の情報検索分類方法および情報検索分類シス
テムは、検索質問による検索技術とクラスタリング技術
とを組み合わせて、情報の検索分類を行う構成とし、特
に、文書ファイルに蓄積した文書の集合から、タームの
相関に関する知識を予め抽出して（ターム毎に、このタ
ームと相関の高いタームを、その相関度を付与して）登
録しておき、ユーザがタームを用いて入力した検索質問
に対応して文書を検索する場合に、ユーザが入力したタ
ームと相関の高いタームを、予め登録しておいたターム
から抽出し、ターム間の相関度に基づきクラスタリング
し、さらに、このタームのクラスタに基づき、検索質問
に対応して検索した文書をクラスタリングする。そし
て、この文書クラスタを表示してユーザの選択を待ち、
ユーザが選択したクラスタに関してのタームのクラスタ
リング、および、文書のクラスタリングを繰り返す。こ
のように、ユーザの検索質問からスタートして、この検
索質問に用いたタームと、その相関タームに基づき、検
索した関連文書のクラスタリングとクラスタ選択のサイ
クルを繰り返すので、クラスタリングに要する処理時間
を短くでき、ユーザは、効率良く適合文書に到達するこ
とができる。In order to achieve the above object, an information retrieval / classification method and an information retrieval / classification system according to the present invention are configured to perform retrieval and classification of information by combining a retrieval technique based on a retrieval query and a clustering technique. In particular, knowledge relating to term correlation is extracted in advance from a set of documents stored in a document file and registered in advance (for each term, a term having a high correlation with this term is given a degree of correlation). When searching a document in response to a search query entered by a user using terms, terms having a high correlation with the terms input by the user are extracted from terms registered in advance, and the degree of correlation between the terms is extracted. , And further, based on the cluster of the term, cluster the documents searched in response to the search query. Then display this document cluster and wait for the user's selection,
It repeats term clustering and document clustering for the cluster selected by the user. As described above, starting from the user's search question, the cycle of clustering and cluster selection of the related documents searched is repeated based on the term used in the search question and the correlation term, so that the processing time required for clustering is reduced. Thus, the user can efficiently reach the matching document.

【００１２】[0012]

【発明の実施の形態】以下、本発明の実施例を、図面に
より詳細に説明する。図１は、本発明の情報検索分類シ
ステムの本発明に係る構成の一実施例を示すブロック図
であり、図２は、図１における情報検索分類システムを
構成するコンピュータの構成例を示すブロック図であ
る。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of the configuration of the information search and classification system of the present invention according to the present invention, and FIG. 2 is a block diagram showing an example of the configuration of a computer constituting the information search and classification system in FIG. It is.

【００１３】図２に示すように、本例の情報検索分類シ
ステムは、ＣＰＵ（Central Processing Unit）を具備
して蓄積プログラム方式によるコンピュータ処理を行う
処理装置２１と、ＲＡＭからなる記憶装置２２、ハード
ディスク等に格納した検索分類対象の文書を処理装置２
１に入力する文書入力装置２３、キーボードやマウス等
からなる入力装置２４、および、ＣＲＴ（Cathode Ray
Tube）等からなる表示装置２５からなるコンピュータで
構成される。As shown in FIG. 2, the information retrieval and classification system of the present embodiment includes a processing unit 21 having a CPU (Central Processing Unit) for performing computer processing according to a storage program system, a storage unit 22 including a RAM, and a hard disk. The document to be searched and classified stored in
1, a document input device 23, an input device 24 such as a keyboard and a mouse, and a CRT (Cathode Ray).
Tube) and the like.

【００１４】処理装置２１では、図１における蓄積サブ
システム１および検索サブシステム２の処理を実行し、
記憶装置２２には、図１における文書ファイル１０ａ、
文書特徴ファイル１０ｂ、インデクスファイル１０ｃ、
およびターム相関ファイル１０ｄが記憶される。文書入
力装置２３は、文書データを入力するための装置であ
り、ハードディスク駆動装置以外にも、文書データの入
手媒体に応じて、ＣＤ−ＲＯＭ駆動装置やフロッピーデ
ィスク駆動装置などが用いられる。入力装置２４と表示
装置２５により、ユーザは、検索サブシステムとの対話
方式での操作を行う。The processing device 21 executes the processing of the storage subsystem 1 and the search subsystem 2 in FIG.
The storage device 22 stores the document files 10a,
Document characteristic file 10b, index file 10c,
And the term correlation file 10d is stored. The document input device 23 is a device for inputting document data, and in addition to a hard disk drive, a CD-ROM drive, a floppy disk drive, or the like is used depending on the medium from which the document data is obtained. With the input device 24 and the display device 25, the user performs an interactive operation with the search subsystem.

【００１５】このような構成のコンピュータの処理装置
２１において、ＣＤ−ＲＯＭ駆動装置等を介して、本発
明に係わる情報検索分類方法の手順を光ディスクに記録
したプログラムをロードすることにより、図１の蓄積サ
ブシステム１における各処理部（文書登録部１ａ、文書
特徴抽出部１ｂ、インデクスファイル生成部１ｃ、ター
ム相関抽出部１ｄ）と検索サブシステム２における関連
ターム選出部２ａ、タームクラスタ生成部２ｂ、文書検
索部２ｃ、文書クラスタ生成部２ｄ、文書クラスタ表示
制御部２ｅなどのモジュールが構成されるIn the processing unit 21 of the computer having the above-described structure, a program in which the procedure of the information search and classification method according to the present invention is recorded on an optical disk via a CD-ROM drive or the like is loaded. Each processing unit (document registration unit 1a, document feature extraction unit 1b, index file generation unit 1c, term correlation extraction unit 1d) in storage subsystem 1 and related term selection unit 2a, term cluster generation unit 2b in search subsystem 2 Modules such as a document search unit 2c, a document cluster generation unit 2d, and a document cluster display control unit 2e are configured.

【００１６】以下、図１に従って、本例の情報検索分類
システムの構成を説明する。本図１において、１は予め
情報の検索分類に用いるファイルを生成して蓄積するた
めの蓄積サブシステム、２は蓄積サブシステム１で蓄積
している各ファイルを用いて情報の検索分類処理を行う
検索サブシステムである。Hereinafter, the configuration of the information retrieval / classification system of this embodiment will be described with reference to FIG. In FIG. 1, reference numeral 1 denotes a storage subsystem for generating and storing files used for information search and classification in advance, and 2 performs information search and classification processing using each file stored in the storage subsystem 1. The search subsystem.

【００１７】蓄積サブシステム１は、文書登録部１ａ、
文書特徴抽出部１ｂ、インデクスファイル生成部１ｃお
よびターム相関抽出部１ｄを有し、各部の処理結果は、
それぞれ、文書ファイル１０ａ、文書特徴ファイル１０
ｂ、インデクスファイル１０ｃ、ターム相関ファイル１
０ｄに登録される。蓄積サブシステム１の各部（文書登
録部１ａ、文書特徴抽出部１ｂ、インデクスファイル生
成部１ｃ、ターム相関抽出部１ｄ）の処理について説明
する前に、それらの部が作成するファイルについて図３
〜図６を用いて説明する。The storage subsystem 1 includes a document registration unit 1a,
It has a document feature extraction unit 1b, an index file generation unit 1c, and a term correlation extraction unit 1d.
Document file 10a and document feature file 10 respectively
b, index file 10c, term correlation file 1
0d is registered. Before describing the processing of each unit (document registration unit 1a, document feature extraction unit 1b, index file generation unit 1c, term correlation extraction unit 1d) of storage subsystem 1, FIG.
This will be described with reference to FIG.

【００１８】図３は、図１における文書ファイルのレコ
ード構成例を示す説明図であり、図４は、図１における
文書特徴ファイルのレコード構成例を示す説明図、図５
は、図１におけるインデクスファイルのレコード構成例
を示す説明図、そして、図６は、図１におけるターム相
関ファイルのレコード構成例を示す説明図である。FIG. 3 is an explanatory diagram showing an example of the record configuration of the document file in FIG. 1. FIG. 4 is an explanatory diagram showing an example of the record configuration of the document characteristic file in FIG.
6 is an explanatory diagram showing an example of a record configuration of the index file in FIG. 1, and FIG. 6 is an explanatory diagram showing an example of a record configuration of the term correlation file in FIG.

【００１９】図３に示すように、文書ファイル１０ａ
は、文書ＩＤ３１と文書の内容であるテキストデータ３
２を含むレコードから構成される。文書ＩＤをキーとし
て、当該文書ＩＤを持つレコードに直接アクセスできる
構造のファイルである。そのような構造のファイルは、
Ｂトリーインデクスを作成する方法や、ハッシングによ
る方法など、さまざまな技術により実現することがで
き、これらの技術は公知であるので、説明は省略する。As shown in FIG. 3, the document file 10a
Is the document ID 31 and the text data 3 which is the content of the document.
2 is included. This file has a structure in which a record having the document ID can be directly accessed using the document ID as a key. A file with such a structure
It can be realized by various techniques such as a method of creating a B-tree index and a method by hashing, and these techniques are well-known, and thus description thereof is omitted.

【００２０】図４に示すように、文書特徴ファイル１０
ｂは、文書ＩＤ４１と文書のタームベクトル４２を含む
レコードから構成される。タームベクトル４２は、ター
ム４２ａと当該タームの重み４２ｂの組のリストであ
る。文書ＩＤ４１をキーとしてレコードに直接アクセス
できる構造のファイルである。As shown in FIG. 4, the document characteristic file 10
b includes a record including a document ID 41 and a term vector 42 of the document. The term vector 42 is a list of pairs of terms 42a and weights 42b of the terms. This file has a structure that allows direct access to records using the document ID 41 as a key.

【００２１】図５に示すように、インデクスファイル１
０ｃは、ターム５１とターム出現文書情報５２を含むレ
コードから構成される。ターム出現文書情報５２は、文
書ＩＤ５２ａと当該文書におけるターム５１の重み５２
ｂの組のリストである。ターム５１をキーとしてレコー
ドに直接アクセスできる構造のファイルである。As shown in FIG. 5, the index file 1
0c is composed of a record including the term 51 and the term appearance document information 52. The term appearance document information 52 includes a document ID 52a and a weight 52 of the term 51 in the document.
b is a list of pairs. This is a file having a structure in which records can be directly accessed using the term 51 as a key.

【００２２】図６に示すように、ターム相関ファイル１
０ｄは、ターム６１と相関ターム情報６２を含むレコー
ドから構成される。相関ターム情報６２は、ターム６２
ａと相関度６２ｂの組のリストである。ターム６１をキ
ーとしてレコードに直接アクセスできる構造のファイル
である。As shown in FIG. 6, the term correlation file 1
0d is composed of a record including the term 61 and the correlation term information 62. The correlation term information 62 is a term 62
It is a list of a set of a and a correlation 62b. This is a file having a structure in which records can be directly accessed using the term 61 as a key.

【００２３】次に、図１における蓄積サブシステム１の
各部（文書登録部１ａ、文書特徴抽出部１ｂ、インデク
スファイル生成部１ｃ、ターム相関抽出部１ｄ）の処理
を説明する。文書登録部１ａは、図２の文書入力装置２
３から文書を一つずつ読み込み、文書ＩＤを付与し、文
書のテキストデータと共に文書ファイル１０ａに出力す
る。Next, the processing of each unit (document registration unit 1a, document feature extraction unit 1b, index file generation unit 1c, term correlation extraction unit 1d) of the storage subsystem 1 in FIG. 1 will be described. The document registration unit 1a is the document input device 2 of FIG.
3 is read one by one, a document ID is assigned, and the document ID is output to the document file 10a together with the text data of the document.

【００２４】文書特徴抽出部１ｂは、文書ファイル１０
ａ中の各文書を読み出し、テキストデータから、ターム
とその出現頻度を求め、その結果に基づいて各文書にお
ける各タームの重みを計算する。そして、各文書につい
て、重みが予め定めた閾値以上であるタームを選択し、
重みと共に文書特徴ファイル１０ｂに出力する。ここ
で、タームの重みはtf−idf（term frequency − inver
se document frequency）の考え方に基づいて計算す
る。The document feature extraction unit 1b includes a document file 10
Each document in a is read, terms and their appearance frequencies are obtained from the text data, and the weight of each term in each document is calculated based on the result. Then, for each document, select a term whose weight is equal to or greater than a predetermined threshold,
The weight is output to the document feature file 10b together with the weight. Here, the term weight is tf−idf (term frequency−inver
The calculation is based on the concept of se document frequency.

【００２５】すなわち、文書diにおけるタームtjの重み
w（di,tj）を次式で計算する。 w（di,tj）＝fi,j×log（N／nj）ここで、fi,jは、文書diにおけるタームtjの出現頻度、
Nは、文書ファイル中の文書数、njは、タームtjが出現
した文書数である。That is, the weight of the term tj in the document di
Calculate w (di, tj) by the following equation. w (di, tj) = fi, j × log (N / nj) Here, fi, j is the appearance frequency of the term tj in the document di,
N is the number of documents in the document file, and nj is the number of documents in which the term tj has appeared.

【００２６】インデクスファイル生成部１ｃは、文書特
徴ファイル１０ｂを読み込み、転置ファイルを生成す
る。すなわち、文書特徴ファイル１０ｂでは、文書毎
に、タームとそのタームの当該文書における重みの組の
リストを記憶しているのに対し、インデクスファイル生
成部１ｃでは、ターム毎に、文書とその文書における当
該タームの重みの組のリストを記憶する。このようなレ
コードを作成してインデクスファイル１０ｃに出力す
る。The index file generator 1c reads the document characteristic file 10b and generates an inverted file. That is, the document feature file 10b stores a list of terms and sets of weights of the terms in the document for each document, whereas the index file generating unit 1c stores a document and a document in the document for each term. The list of the weight set of the term is stored. Such a record is created and output to the index file 10c.

【００２７】ターム相関抽出部１ｄは、文書ファイル１
０ａ中の文書を読み出し、テキストデータ３２から共起
するタームの組とその共起頻度を求め、その結果に基づ
いてターム間の相関度を計算する。各タームについて、
相関度が予め定めた閾値以上であるタームを選択し、相
関度と共にターム相関ファイル１０ｄに出力する。The term correlation extraction unit 1d outputs the document file 1
The document in 0a is read, a set of terms that co-occur and the frequency of co-occurrence are obtained from the text data 32, and the degree of correlation between the terms is calculated based on the result. For each term,
A term having a degree of correlation equal to or greater than a predetermined threshold is selected and output to the term correlation file 10d together with the degree of correlation.

【００２８】共起するタームは、ウインドウ共起の定義
に基づいて抽出する。すなわち、テキスト中でタームt
の出現位置からの距離がW語以内の位置にタームt'が出
現するとき、tとt'が共起していると考える。Wはウイン
ドウのサイズで、予め値を定めておく。テキストに沿っ
てウインドウを移動させながら、共起するタームの組を
抽出してその頻度をカウントする。Co-occurring terms are extracted based on the definition of window co-occurrence. That is, the term t in the text
When the term t ′ appears at a position within W words from the appearance position of, it is considered that t and t ′ co-occur. W is the size of the window, whose value is determined in advance. While moving the window along the text, a set of co-occurring terms is extracted and its frequency is counted.

【００２９】また、ターム間の相関度としては相互情報
量を計算する。すなわち、タームtiとタームtjの相関度
α（ti,tj）を次の式（数１）で計算する。The mutual information is calculated as the degree of correlation between the terms. That is, the degree of correlation α (ti, tj) between the term ti and the term tj is calculated by the following equation (Equation 1).

【数１】ここで、f（ti）はタームtiの出現頻度、g（ti,tj）は
タームtiとタームtjの共起頻度である。(Equation 1) Here, f (ti) is the appearance frequency of the term ti, and g (ti, tj) is the co-occurrence frequency of the term ti and the term tj.

【００３０】次に、図１における検索サブシステム２に
ついて、図７を用いて説明する。図７は、図１における
検索サブシステムの詳細な構成例を示すブロック図であ
る。本例の検索サブシステムは、検索質問読込み部２
ｆ、文書検索部２ｃ、関連ターム選出部２ａ、タームク
ラスタ抽出部２ｂ、文書クラスタ生成部２ｄ、文書クラ
スタ表示制御部２ｅ、クラスタ番号読込み部２ｇ、およ
びターム集合・文書集合縮小部２ｈから構成されてい
る。Next, the search subsystem 2 in FIG. 1 will be described with reference to FIG. FIG. 7 is a block diagram showing a detailed configuration example of the search subsystem in FIG. The search subsystem of this example is a search question reading unit 2
f, a document search unit 2c, a related term selection unit 2a, a term cluster extraction unit 2b, a document cluster generation unit 2d, a document cluster display control unit 2e, a cluster number reading unit 2g, and a term set / document set reduction unit 2h. ing.

【００３１】検索質問読込み部２ｆとクラスタ番号読込
み部２ｇは、図２の入力装置２４からのユーザの入力を
待って処理を開始する。また、関連ターム選出部２ａ、
タームクラスタ抽出部２ｂ、文書検索部２ｃ、文書クラ
スタ生成部２ｄ、文書クラスタ表示制御部２ｅ、および
ターム集合・文書集合縮小部２ｈは、必要なデータを他
の部から受け取ると処理を開始する。The search query reading unit 2f and the cluster number reading unit 2g start processing after waiting for a user input from the input device 24 in FIG. In addition, related term selection section 2a,
The term cluster extraction unit 2b, the document search unit 2c, the document cluster generation unit 2d, the document cluster display control unit 2e, and the term set / document set reduction unit 2h start processing when necessary data is received from another unit.

【００３２】以下、これらの検索サブシステムの各部
（関連ターム選出部２ａ、タームクラスタ抽出部２ｂ、
文書検索部２ｃ、文書クラスタ生成部２ｄ、文書クラス
タ表示制御部２ｅ、検索質問読込み部２ｆ、クラスタ番
号読込み部２ｇ、ターム集合・文書集合縮小部２ｈ）の
処理について説明する前に、各部の間で授受するデータ
（検索質問データ１０ｅ、検索文書集合データ１０ｆ、
関連ターム集合データ１０ｇ、タームクラスタデータ１
０ｈ、文書クラスタデータ１０ｉ、クラスタ番号データ
１０ｊ）の構造について、図８〜図１３を用いて説明す
る。Hereinafter, each part of these search subsystems (related term selection section 2a, term cluster extraction section 2b,
Before describing the processes of the document search unit 2c, the document cluster generation unit 2d, the document cluster display control unit 2e, the search query reading unit 2f, the cluster number reading unit 2g, and the term set / document set reduction unit 2h), (Search query data 10e, search document set data 10f,
Related term set data 10g, term cluster data 1
0h, the document cluster data 10i, and the cluster number data 10j) will be described with reference to FIGS.

【００３３】図８は、図７における検索質問データの構
成例を示す説明図であり、図９は、図７における検索文
書集合データの構成例を示す説明図、図１０は、図７に
おける関連ターム集合データの構成例を示す説明図、図
１１は、図７におけるタームクラスタデータの構成例を
示す説明図、図１２は、図７における文書クラスタデー
タの構成例を示す説明図、そして、図１３は、図７にお
けるクラスタ番号データの構成例を示す説明図である。FIG. 8 is an explanatory diagram showing an example of the structure of the search question data in FIG. 7, FIG. 9 is an explanatory diagram showing an example of the structure of the search document set data in FIG. 7, and FIG. FIG. 11 is an explanatory diagram showing an example of the configuration of term cluster data in FIG. 7, FIG. 12 is an explanatory diagram showing an example of the configuration of document cluster data in FIG. 7, and FIG. FIG. 13 is an explanatory diagram showing a configuration example of the cluster number data in FIG.

【００３４】図８に示すように、検索質問データ１０ｅ
は、１個の論理オペレータ８１と１個以上のターム８２
から構成される。論理オペレータ８１の値は、「ＡＮ
Ｄ」、「ＯＲ」、または、空白である。図９に示すよう
に、検索文書集合データ１０ｆは、０個以上の文書ＩＤ
９１から構成される。As shown in FIG. 8, the search question data 10e
Is one logical operator 81 and one or more terms 82
Consists of The value of the logical operator 81 is “AN
D "," OR ", or blank. As shown in FIG. 9, the search document set data 10f includes zero or more document IDs.
91.

【００３５】図１０に示すように、関連ターム集合デー
タ１０ｇは、０個以上のターム１０１から構成される。
図１１に示すように、タームクラスタデータ１０ｈは、
Ｌ個のクラスタそれぞれに対応するレコードからなり。
各レコードはクラスタ番号１１１と１個以上のターム１
１２から構成される。As shown in FIG. 10, the related term set data 10g is composed of zero or more terms 101.
As shown in FIG. 11, the term cluster data 10h is
It consists of records corresponding to each of L clusters.
Each record has a cluster number 111 and one or more terms 1
12 is comprised.

【００３６】図１２に示すように、文書クラスタデータ
１０ｉは、Ｌ個のクラスタそれぞれに対応するレコード
からなり、各レコードはクラスタ番号１２１、および、
１組以上の文書ＩＤ１２２と類似度１２３の組から構成
される。図１３に示すように、クラスタ番号データ１０
ｊは、１個以上のクラスタ番号１３１から構成される。As shown in FIG. 12, the document cluster data 10i is composed of records corresponding to each of L clusters, and each record has a cluster number 121,
It is composed of one or more sets of a document ID 122 and a similarity 123. As shown in FIG.
j is composed of one or more cluster numbers 131.

【００３７】次に、このような各データに対する図７の
検索サブシステムにおける各部の処理動作を説明する。
検索質問読込み部２ｆは、図２の入力装置２４から検索
質問を読み込み、検索質問データ１０ｅの論理オペレー
タ８１とターム８２に値を出力する。本例では、検索質
問の形式を１個のターム、複数タームの論理積、複数タ
ームの論理和の３通りに限定する。また、それぞれの場
合の論理オペレータ８１とターム８２の値は以下の通り
とする。Next, the processing operation of each unit in the search subsystem shown in FIG. 7 for such data will be described.
The search question reading unit 2f reads the search question from the input device 24 in FIG. 2, and outputs values to the logical operator 81 and the term 82 of the search question data 10e. In this example, the format of the search question is limited to three types: one term, logical product of plural terms, and logical sum of plural terms. The values of the logical operator 81 and the term 82 in each case are as follows.

【００３８】検索質問が１個のタームである場合、すな
わちｑ＝t₁の場合、論理オペレータ８１は空白とし、t₁
をターム８２にセットする。また、検索質問が複数ター
ムの論理積である場合、すなわちｑ＝t₁∧t₂∧‥∧tpの
場合、論理オペレータ８１はＡＮＤとし、t₁,t₂,‥,tp
をターム８２にセットする。また、検索質問が複数ター
ムの論理和である場合、すなわちｑ＝t₁∨t₂∨‥∨tpの
場合、論理オペレータ８１はＯＲとし、t₁,t₂,‥,tpを
ターム８２にセットする。When the search query is one term, that is, when q = t ₁ , the logical operator 81 is blank and t ₁
Is set in the term 82. When the search query is a logical product of a plurality of terms, that is, when q = t ₁ ∧t ₂ ∧ ‥ ∧tp, the logical operator 81 is AND, and t ₁ , t ₂ , ‥, tp
Is set in the term 82. When the search query is a logical sum of a plurality of terms, that is, when q = t ₁ ∨t ₂ ∨ ‥ ∨tp, the logical operator 81 is OR, and t ₁ , t ₂ , ‥, tp is set in the term 82. I do.

【００３９】文書検索部２ｃは、検索質問データ１０ｅ
を読み込み、ターム８２をキーとしてインデクスファイ
ル１０ｃのレコードを検索し、図５のターム出現文書情
報５２に文書ＩＤが記されている文書と検索質問との関
連度を計算し、関連度の大きい文書の文書ＩＤを検索文
書集合データ１０ｆに出力する。The document retrieval section 2c retrieves the retrieval question data 10e
Is read, and the record of the index file 10c is searched using the term 82 as a key, and the relevance between the document whose document ID is described in the term appearance document information 52 of FIG. 5 and the search query is calculated. Is output to the search document set data 10f.

【００４０】文書（ｄ）と検索質問（ｑ）との関連度Ｒ
(d,ｑ)は、論理オペレータ８１の値に応じて以下のよう
に計算する。論理オペレータが空白の場合、すなわちｑ
＝t₁の場合、Ｒ(d,ｑ)＝ｗ(d,t₁) 論理オペレータがＡＮＤの場合、すなわちｑ＝t₁∧t₂∧
‥∧tpの場合、Ｒ(d,ｑ)＝min{ｗ(d,t₁)，ｗ(d,t₂)，‥，ｗ(d,tp）｝論理オペレータがＯＲの場合、すなわちｑ＝ｔ_１∨t₂∨
‥∨tpの場合、Ｒ(d,ｑ)＝max{ｗ(d,t₁)，ｗ(d,t₂)，‥，ｗ(d,tp）} ここで、ｗ(d,t）の値は次の通りである。Relevance R between document (d) and search query (q)
(d, q) is calculated as follows according to the value of the logical operator 81. If the logical operator is blank, ie q
= For _{t 1, R (d, q} ) = w (d, t 1) logic if the operator of the AND, i.e., q = t ₁ ∧t ₂ ∧
{In the case of tp, R (d, q) = min {w (d, t ₁ ), w (d, t ₂ ), ‥, w (d, tp)} When the logical operator is OR, that is, q = t ₁ ∨t ₂
In the case of ‥ ∨tp, R (d, q) = max {w (d, t ₁ ), w (d, t ₂ ), ‥, w (d, tp)} where w (d, t) The values are as follows:

【００４１】タームｔをキーとして検索したインデクス
ファイル１０ｃのレコード中の文書ＩＤ５２ａが文書
(ｄ)を含んでいれば、対応する重み５２ｂをｗ(d,t)の
値とする。また、タームｔをキーとして検索したインデ
クスファイル１０ｃのレコード中の文書ＩＤ５２ａが文
書ｄを含んでいなければ、ｗ(d,t)の値は０にする。The document ID 52a in the record of the index file 10c searched using the term t as a key
If (d) is included, the corresponding weight 52b is set to the value of w (d, t). If the document ID 52a in the record of the index file 10c searched using the term t as a key does not include the document d, the value of w (d, t) is set to 0.

【００４２】このようにして求めた関連度Ｒ(d,ｑ)に基
づく検索文書集合の決定は、検索文書集合の大きさの上
限Ｍと関連度の下限θ₁を予め定めておくことにより行
う。すなわち、Ｒ(d,ｑ)＞θ₁を満たす範囲内で、Ｒ(d,
ｑ)の大きい順に最大M個の文書を選択する。選択された
文書d₁,d₂,‥,dmの文書ＩＤを検索文書集合データ１０
ｆとして出力する。The determination of the retrieval document set based on the relevance R (d, q) obtained in this manner is performed by previously defining the upper limit M of the size of the retrieval document set and the lower limit θ _{1 of the} relevance. . That is, within a range that satisfies R (d, q)> θ 1, R (d,
Select up to M documents in descending order of q). The document IDs of the selected documents d ₁ , d ₂ ,.
Output as f.

【００４３】関連ターム選出部２ａは、検索質問データ
１０ｅを読み込み、図８におけるターム８２をキーとし
てターム相関ファイル１０ｄのレコードを検索し、図６
における相関ターム情報６２に含まれるタームと検索質
問との相関度を計算し、相関度の大きいタームを関連タ
ーム集合データ１０ｇに出力する。The related term selecting section 2a reads the search question data 10e, searches for a record in the term correlation file 10d using the term 82 in FIG. 8 as a key, and
, The degree of correlation between the term included in the correlation term information 62 and the search query is calculated, and a term having a high degree of correlation is output to the related term set data 10g.

【００４４】ターム(ｔ)と検索質問(ｑ)との相関度Ａ
(t,ｑ)は、図８の論理オペレータ８１の値に応じて以下
のように計算する。論理オペレータが空白の場合、すな
わちｑ＝t₁の場合、Ａ(t,ｑ)＝α(t,t₁）論理オペレータがＡＮＤの場合、すなわちｑ＝t₁∧t₂‥
∧tpの場合、Ａ(t,ｑ)＝min{α(t,t₁)，α(t,t₂)，‥，α(t,tp）} 論理オペレータがＯＲの場合、すなわちｑ＝t₁∨t₂‥∨
tpの場合、Ａ(t,ｑ)＝max{α(t,t₁)，α(t,t₂)，‥，α(t,tp）｝Correlation degree A between term (t) and search query (q)
(t, q) is calculated as follows according to the value of the logical operator 81 in FIG. If the logic operator is blank, that is, when the _{q = t 1, A (t} , q) = α (t, t 1) logic if the operator of the AND, i.e., q = t ₁ ∧t ₂ ‥
In the case of ∧tp, A (t, q) = min {α (t, t ₁ ), α (t, t ₂ ), ‥, α (t, tp)} When the logical operator is OR, that is, q = t ₁ ∨t ₂ ‥ ∨
In the case of tp, A (t, q) = max {α (t, t ₁ ), α (t, t ₂ ), {, α (t, tp)}

【００４５】ここで、α（ｔ，ｔｉ）の値は次の通りで
ある。タームtiをキーとして検索したターム相関ファイ
ル１０ｄのレコードの相関ターム情報６２中のターム６
２ａにタームｔが含まれていれば、対応する相関度６２
ｂをα(t,ti)の値とする。タームtiをキーとして検索し
たターム相関ファイル１０ｄのレコードの相関ターム情
報６２中のターム６２ａにタームｔが含まれていなけれ
ば、α(t,ti)の値は０にする。Here, the value of α (t, ti) is as follows. Term 6 in the correlation term information 62 of the record of the term correlation file 10d searched using the term ti as a key
If the term t is included in 2a, the corresponding correlation 62
Let b be the value of α (t, ti). If the term t is not included in the term 62a in the correlation term information 62 of the record of the term correlation file 10d searched using the term ti as a key, the value of α (t, ti) is set to 0.

【００４６】相関度Ａ(t,ｑ)に基づく関連ターム集合の
決定は、関連ターム集合の大きさの上限Ｋと相関度の下
限θ₂を予め定めておくことにより行う。すなわち、Ａ
(t,ｑ)＞θ₂を満たす範囲内で、Ａ(t,ｑ)の大きい順に
最大Ｋ個のタームを選択する。選択されたタームt₁,t₂,
‥,tkを関連ターム集合データ１０ｇとして出力する。The determination of the related term set based on the correlation A (t, q) is performed by previously defining the upper limit K of the size of the related term set and the lower limit θ _{2 of the} correlation. That is, A
(t, q)> θ ₂ in a range satisfying, A (t, q) selects the maximum of K terms in descending order of. The selected terms t ₁ , t ₂ ,
‥, tk are output as related term set data 10g.

【００４７】タームクラスタ抽出部２ｂは、関連ターム
集合データ１０ｇを読み込み、さらに関連ターム集合に
属するターム相互間の相関度をターム相関ファイル１０
ｄから読み込む。これに基づいて関連ターム集合をクラ
スタリングし、得られたクラスタをタームクラスタデー
タ１０ｈに出力する。クラスタリングの方法として、本
例では凝集的なクラスタリング法の一つであるグループ
平均法を用いる。生成するクラスタの数は予め定めた値
Ｌとする。The term cluster extraction unit 2b reads the related term set data 10g, and further calculates the degree of correlation between the terms belonging to the related term set by using the term correlation file 10g.
Read from d. Based on this, the related term set is clustered, and the obtained cluster is output as term cluster data 10h. In this example, a group averaging method, which is one of the coherent clustering methods, is used as a clustering method. The number of clusters to be generated is a predetermined value L.

【００４８】クラスタリングすべき関連ターム集合が{t
₁，t₂，‥，tk}であるとする。このとき、次の手順でＬ
個のクラスタを生成する。まず、初期状態として、ｋ個
のクラスタＣ₁＝{t₁}，Ｃ₂＝{t₂}，‥，Ｃk＝{tk}を作
る。そして、クラスタの総数がＬと等しくなるまで、次
の処理を繰り返す。相関度が最大であるクラスタの対を
選んで一つのクラスタにマージする。The related term set to be clustered is {t
₁ , t ₂ , {, tk}. At this time, L
Generate clusters. First, k clusters C ₁ = {t ₁ }, C ₂ = {t ₂ }, ‥, and Ck = {tk} are created as an initial state. Then, the following processing is repeated until the total number of clusters becomes equal to L. A pair of clusters having the highest degree of correlation is selected and merged into one cluster.

【００４９】ここで、クラスタＣとＤの相関度αc(C,D)
を、次の式（数２）で定義する。ただし、α(t,u)は二
つのタームt,uの相関度である。Here, the degree of correlation αc (C, D) between clusters C and D
Is defined by the following equation (Equation 2). Here, α (t, u) is the degree of correlation between the two terms t, u.

【数２】 (Equation 2)

【００５０】文書クラスタ生成部２ｄは、検索文書集合
データ１０ｆに文書ＩＤが記されている文書の各々につ
いて、文書特徴ファイル１０ｂからタームベクトルを読
み出し、タームクラスタデータ１０ｈ中の各クラスタと
の類似度を計算する。各文書を類似度が最大のタームク
ラスタに割り当てることによって、タームクラスタを種
にした文書クラスタを生成する。その結果を文書クラス
タデータ１０ｉに出力する。The document cluster generation unit 2d reads out a term vector from the document feature file 10b for each document whose document ID is described in the search document set data 10f, and determines the similarity with each cluster in the term cluster data 10h. Is calculated. A document cluster using the term cluster as a seed is generated by assigning each document to a term cluster having the highest similarity. The result is output to the document cluster data 10i.

【００５１】文書のタームベクトルとタームクラスタの
類似度は次のようにして計算する。まず、文書特徴ファ
イル１０ｂから読み出した文書ｄのタームベクトル（図
４のタームベクトル４２）を、類似度計算のため、次の
ようなベクトル表現ｄに形式変換する。ここで、関連タ
ーム集合が{t₁，t₂，‥，tk}であるとしている。文書ｄ
のタームベクトル４２がタームｔiを含んでいるなら
ば、その重みをベクトルｄの第ｉ要素の値とする。文書
ｄのタームベクトル４２がタームｔiを含んでいなけれ
ば、ベクトルｄの第ｉ要素の値は０にする。The similarity between a term vector of a document and a term cluster is calculated as follows. First, the term vector of the document d (the term vector 42 in FIG. 4) read from the document feature file 10b is converted into the following vector expression d for calculating the similarity. Here, it is assumed that the related term set is {t ₁ , t ₂ , ‥, tk}. Document d
If term vector 42 includes term ti, its weight is used as the value of the i-th element of vector d . If the term vector 42 of the document d does not include the term ti, the value of the ith element of the vector d is set to 0.

【００５２】次に、タームクラスタＣのベクトル表現ｃ
を次のようにして作成する。ここでも関連ターム集合が
{t₁，t₂，‥，tk}であるとしている。タームクラスタＣ
がタームｔiを含んでいるならば、ベクトルｃの第ｉ要
素の値を１にする。また、タームクラスタＣがタームｔ
iを含んでいなければ、ベクトルｃの第ｉ要素の値を０
にする。このようにして作成した二つのベクトルｄ,ｃ
を用い、文書ｄとタームクラスタＣの類似度ρ(d,C)を
次式で計算する。 ρ(d,C)＝d×c／(|d|×｜ｃ｜）Next, the vector expression c of the term cluster C
Is created as follows. Again, the related term set
{t ₁ , t ₂ , ‥, tk}. Term Cluster C
Contains the term ti, the value of the ith element of the vector c is set to 1. The term cluster C is term t
If i is not included, the value of the i-th element of the vector c is set to 0
To The two vectors d and c created in this way
, The similarity ρ (d, C) between the document d and the term cluster C is calculated by the following equation. ρ (d, C) = d × c / (| d | × | c |)

【００５３】文書クラスタ表示制御部２ｅは、文書クラ
スタデータ１０ｉ中の文書クラスタの各々について、ダ
イジェストを図２の表示装置２５に出力する。ダイジェ
ストは、文書クラスタを特徴付けるタームのリストと文
書クラスタを代表する文書である。前者として、ターム
クラスタデータ１０ｈのレコードで、文書クラスタと同
一のクラスタ番号を持つものを出力する。後者として
は、文書クラスタデータ１０ｉの当該文書クラスタのレ
コード中、図１２における類似度１２３の最大値に対応
する文書ＩＤ１２２を持つ文書を採用する。The document cluster display control unit 2e outputs a digest to each of the document clusters in the document cluster data 10i to the display device 25 in FIG. The digest is a list of terms characterizing the document cluster and a document representing the document cluster. As the former, a record of the term cluster data 10h having the same cluster number as the document cluster is output. As the latter, a document having a document ID 122 corresponding to the maximum value of the similarity 123 in FIG. 12 in the record of the document cluster of the document cluster data 10i is adopted.

【００５４】その文書のテキストデータ（図３に示すテ
キストデータ３２）を文書ファイル１０ａから読み出
し、その最初の部分（予め定めた文字数分）を出力す
る。このようにして出力した文書クラスタダイジェスト
の例を図１４に示す。図１４は、図７における文書クラ
スタ表示制御部により表示された文書クラスタダイジェ
ストの表示例を示す説明図である。本例では、検索質問
データにおけるタームに「半導体」が指定された場合
の、検索、分類結果が示されており、例えば、クラスタ
１として、キーワードの欄には各タームクラスタ（「日
米」，「通商」，・・・」）が、また、文書クラスタと
して文書１と文書２がそれぞれの内容の一部と共に表示
されている。The text data of the document (text data 32 shown in FIG. 3) is read from the document file 10a, and the first part (for a predetermined number of characters) is output. FIG. 14 shows an example of the document cluster digest output in this manner. FIG. 14 is an explanatory diagram showing a display example of the document cluster digest displayed by the document cluster display control unit in FIG. In this example, the search and classification results when “semiconductor” is designated as the term in the search question data are shown. For example, as the cluster 1, each term cluster (“Japan-US”, “Trade”,...), And document 1 and document 2 are displayed as document clusters together with part of their contents.

【００５５】図７の説明に戻り、クラスタ番号読込み部
２ｇは、図２の入力装置２４からユーザが入力するクラ
スタ番号を読み込み、クラスタ番号データ１０ｊに値を
セットする。ターム集合・文書集合縮小部２ｈは、ター
ムクラスタデータ１０ｈから、クラスタ番号データ１０
ｊ中のクラスタ番号と同一のクラスタ番号を持つレコー
ドを読込む。そして、読込んだレコードに記されている
タームの和集合を求め、関連ターム集合データ１０ｇに
出力する。同様に、文書クラスタデータ１０ｉから、ク
ラスタ番号データ１０ｊ中のクラスタ番号と同一のクラ
スタ番号を持つレコードを読込み、読込んだレコードに
記されている文書ＩＤの和集合を求め、検索文書集合デ
ータ１０ｆに出力する。Returning to the description of FIG. 7, the cluster number reading unit 2g reads the cluster number input by the user from the input device 24 of FIG. 2, and sets a value in the cluster number data 10j. The term set / document set reduction unit 2h converts the cluster number data 10
The record having the same cluster number as the cluster number in j is read. Then, the union of the terms described in the read record is obtained and output to the related term set data 10g. Similarly, a record having the same cluster number as the cluster number in the cluster number data 10j is read from the document cluster data 10i, the union of the document IDs written in the read record is obtained, and the search document set data 10f Output to

【００５６】以上、検索サブシステムにおける各部の処
理を説明したが、検索サブシステム全体としては次のよ
うに処理を行う。ユーザが図２の入力装置２４から検索
質問を入力すると、検索質問読込み部２ｆ、文書検索部
２ｃ、関連ターム選出部２ａ、タームクラスタ抽出部２
ｂ、文書クラスタ生成部２ｄ、文書クラスタ表示制御部
２ｅが順次動作して、検索質問との関連度が大きい文書
を複数のクラスタに分類した結果を図２の表示装置２５
に表示する。The processing of each unit in the search subsystem has been described above. The entire search subsystem performs the following processing. When the user inputs a search query from the input device 24 of FIG. 2, the search query reading unit 2f, the document search unit 2c, the related term selection unit 2a, and the term cluster extraction unit 2
b, the document cluster generation unit 2d and the document cluster display control unit 2e operate sequentially to classify the documents having a high degree of relevance with the search query into a plurality of clusters, and display the result of the classification into a plurality of clusters.
To be displayed.

【００５７】ユーザが、表示されたクラスタから、情報
要求に関連の強そうなクラスタを選択し、その番号を入
力装置２４から入力すると、クラスタ番号読込み部２
ｇ、およびターム集合・文書集合縮小部２ｈ、タームク
ラスタ抽出部２ｂ、文書クラスタ生成部２ｄ、文書クラ
スタ表示制御部２ｅが順次動作して、選択されたクラス
タを、より小さなクラスタに分類し、その結果を表示装
置２５に表示する。このようなクラスタを選択して細分
類する処理を必要なだけ繰り返すことによって、ユーザ
は適合文書を探し出すことができる。When the user selects a cluster likely to be related to the information request from the displayed clusters and inputs the number from the input device 24, the cluster number reading unit 2
g, the term set / document set reduction unit 2h, the term cluster extraction unit 2b, the document cluster generation unit 2d, and the document cluster display control unit 2e sequentially operate to classify the selected cluster into smaller clusters. The result is displayed on the display device 25. The user can search for a suitable document by repeating such a process of selecting a cluster and subclassifying as necessary.

【００５８】クラスタリングの計算量については次の通
りである。検索された文書をクラスタリングするために
本例で導入した処理部は、関連ターム選出部２ａ、ター
ムクラスタ抽出部２ｂ、および文書クラスタ生成部２ｄ
である。そして、関連ターム選出部２ａの計算量は、検
索質問に含まれるターム数をｔとすると、ｔのオーダー
である。また、タームクラスタ抽出部２ｂの計算量は、
関連ターム集合のターム数をｋとするとｋ^２のオーダー
である。さらに、文書クラスタ生成部２ｄの計算量は、
検索文書集合の文書数をｍ、タームクラスタの数をＬと
するとｍ×Ｌのオーダーである。The calculation amount of the clustering is as follows. The processing unit introduced in this example for clustering the searched documents includes a related term selection unit 2a, a term cluster extraction unit 2b, and a document cluster generation unit 2d.
It is. The calculation amount of the related term selecting unit 2a is on the order of t, where t is the number of terms included in the search query. The amount of calculation of the term cluster extraction unit 2b is as follows.
The number of terms of the relevant term set is of the order of k ² and the k. Further, the calculation amount of the document cluster generation unit 2d is:
If the number of documents in the retrieval document set is m and the number of term clusters is L, the order is m × L.

【００５９】ここで、ｔやＬの値はｍに比べて非常に小
さい。また、ｋは一定の値に設定して良く、検索文書集
合の文書数ｍとして許容すべき値よりは十分小さくでき
る。しかも、タームクラスタ抽出部２ｂの計算量でｋ²
に掛ける係数は小さい。なぜなら、ターム間の相関度は
計算済みでターム相関ファイル１０ｄに記憶されている
からである。以上より、本例の文書クラスタリングの計
算量が、検索文書集合を直接クラスタリングする方法
（計算量はｍ²のオーダー）に比べて小さいことは明ら
かである。Here, the values of t and L are much smaller than m. Further, k may be set to a constant value, which can be sufficiently smaller than a value that should be allowed as the number m of documents in the search document set. Moreover, the amount of calculation by the term cluster extraction unit 2b is k ²
Is small. This is because the degree of correlation between terms has been calculated and stored in the term correlation file 10d. From the above, it is clear that the calculation amount of the document clustering of this example is smaller than the method of directly clustering the retrieval document set (the calculation amount is on the order of m ² ).

【００６０】本例の利点として、さらに次のことがあげ
られる。文書クラスタが生成された段階で、文書クラス
タを特徴付けるタームのリストが既に得られている。従
って、文書クラスタからダイジェストを作る処理を別途
行わなくて良い。The advantages of this embodiment are as follows. When the document cluster is generated, a list of terms characterizing the document cluster has already been obtained. Therefore, it is not necessary to separately perform a process of creating a digest from a document cluster.

【００６１】図１５および図１６は、本発明の情報検索
分類方法に係わる一実施例を示すフローチャートであ
る。まず、図１５において、蓄積サブシステムとして次
の処理を行う。図１の文書特徴抽出部１ｂにより、文書
ファイル１０ａ中の各文書に出現するタームと出現頻度
を抽出し（ステップ１５０１）、それに基づいて、各文
書におけるタームの重みを計算し（ステップ１５０
２）、重みが予め定めた値以上であるようなタームと重
みの組を要素とするタームベクトルを文書特徴ファイル
１０ｂに出力する（ステップ１５０３）。そして、ター
ム相関抽出部１ｄにより、文書ファイル中の文書から共
起するタームの組と共起頻度を抽出し（ステップ１５０
４）、共起頻度に基づきターム間の相関を計算して結果
をターム相関ファイル１０ｄに出力する（ステップ１５
０５）。FIGS. 15 and 16 are flowcharts showing one embodiment of the information retrieval / classification method of the present invention. First, in FIG. 15, the following processing is performed as a storage subsystem. The term appearing in each document in the document file 10a and the frequency of appearance are extracted by the document feature extraction unit 1b in FIG. 1 (step 1501), and the weight of the term in each document is calculated based on the term (step 1501).
2) A term vector having a set of terms and weights whose weight is equal to or greater than a predetermined value as an element is output to the document feature file 10b (step 1503). Then, the term correlation extraction unit 1d extracts a set of co-occurring terms and a co-occurrence frequency from the document in the document file (step 150).
4), calculate the correlation between terms based on the co-occurrence frequency, and output the result to the term correlation file 10d (step 15).
05).

【００６２】このようにしてデータを蓄積した後、図１
６において、検索サブシステムとして次の処理を行な
う。ユーザが検索質問を入力すると（ステップ１５０
６）、図７における検索質問読込み部２ｆにより、これ
を読込む（ステップ１５０７）。次に、文書検索部２ｃ
により、インデクスファイル１０ｃを参照して、検索質
問と文書の関連度を計算して関連度の大きい文書を選択
し、選択した文書ＩＤを検索文書集合データ１０ｆとし
て出力する（ステップ１５０８）。また、関連ターム選
出部２ａにより、ターム相関ファイル１０ｄを参照し
て、検索質問と相関の強いタームを選出し、関連ターム
集合データ１０ｇとして出力する（ステップ１５０
９）。After accumulating the data in this way, FIG.
At 6, the following processing is performed as a search subsystem. When the user inputs a search question (step 150)
6), this is read by the search question reading unit 2f in FIG. 7 (step 1507). Next, the document search unit 2c
With reference to the index file 10c, the relevance between the search query and the document is calculated, a document having a high relevance is selected, and the selected document ID is output as the search document set data 10f (step 1508). Further, the related term selecting unit 2a refers to the term correlation file 10d to select a term having a strong correlation with the search query, and outputs the term as the related term set data 10g (step 150).
9).

【００６３】さらに、タームクラスタ抽出部２ｂによ
り、ターム相関ファイル１０ｄを参照して、関連ターム
集合データ１０ｇをクラスタリングし、相互に相関の強
いタームからなる複数のタームクラスタを抽出してター
ムクラスタデータ１０ｈとして出力する（ステップ１５
１０）。そして、文書クラスタ生成部２ｈにより、検索
文書集合データ１０ｆの各文書について、文書特徴ファ
イル１０ｂからタームベクトルを読み出して（ステップ
１５１１）、タームベクトルと各タームクラスタの類似
度を計算し（ステップ１５１２）、最も類似度の高いタ
ームクラスタに文書を割り当てることによって、ターム
クラスタを種にした文書クラスタを生成する（ステップ
１５１３）。Further, the term cluster extraction unit 2b clusters the related term set data 10g with reference to the term correlation file 10d, extracts a plurality of term clusters composed of mutually strongly correlated terms, and extracts the term cluster data 10h. (Step 15
10). Then, the term cluster is read by the document cluster generation unit 2h from the document feature file 10b for each document in the search document set data 10f (step 1511), and the similarity between the term vector and each term cluster is calculated (step 1512). By assigning a document to the term cluster having the highest similarity, a document cluster using the term cluster as a seed is generated (step 1513).

【００６４】そして、文書クラスタ表示制御部２ｅによ
り、生成された文書クラスタの各々について、種になっ
たタームクラスタ、および、このタームクラスタとの類
似度が最大の文書を表示する（ステップ１５１４）。ユ
ーザは、この表示をみて、情報要求に関連の強そうな一
つあるいはいくつかのクラスタを選択する（ステップ１
５１５）。Then, for each of the generated document clusters, the document cluster display control section 2e displays the seed term cluster and the document having the highest similarity to the term cluster (step 1514). The user sees this display and selects one or several clusters likely to be relevant to the information request (step 1).
515).

【００６５】すると、クラスタ番号読込み部２ｇによ
り、ユーザが選択したクラスタ番号を読込み（ステップ
１５１６）、ターム集合・文書集合縮小部２ｈにより、
選択された文書クラスタの種になったタームクラスタの
和集合を求めて、関連ターム集合データ１０ｇをこれで
置き換え（ステップ１５１７）、また、選択された文書
クラスタの和集合を求めて、検索文書集合データ１０ｆ
をこれで置き換える（ステップ１５１８）。その後、ス
テップ１５１０に戻り、このように置き換えた関連ター
ム集合データおよび検索文書集合データを用いたクラス
タリングを繰り返す。Then, the cluster number reading unit 2g reads the cluster number selected by the user (step 1516), and the term set / document set reduction unit 2h reads
The union of the term clusters that became the seed of the selected document cluster is obtained, the related term set data 10g is replaced with this (step 1517), and the union of the selected document clusters is obtained, and the search document set is obtained. Data 10f
Is replaced with this (step 1518). Thereafter, the process returns to step 1510, and the clustering using the related term set data and the search document set data thus replaced is repeated.

【００６６】以上、図１〜図１６を用いて説明したよう
に、本実施例の情報検索分類システムおよび情報検索分
類方法では、全体を蓄積サブシステム１と検索サブシス
テム２とに分け、蓄積サブシステム１を、図１に示すよ
うに、文書ファイル１０ａ中の各文書の内容を特徴付け
るタームベクトルを抽出する文書特徴抽出部１ｂと、文
書ファイル１０ａ中の文書を解析してタームの相関に関
する知識を抽出し、ターム相関ファイル１０ｄに記憶す
るターム相関抽出部１ｄとを含むように構成する。As described above with reference to FIGS. 1 to 16, in the information search and classification system and the information search and classification method of the present embodiment, the whole is divided into the storage subsystem 1 and the search subsystem 2, As shown in FIG. 1, the system 1 includes a document feature extraction unit 1b that extracts a term vector characterizing the content of each document in the document file 10a, and analyzes documents in the document file 10a to obtain knowledge on term correlation. It is configured to include a term correlation extracting unit 1d that extracts and stores the extracted term correlation file 10d in the term correlation file 10d.

【００６７】また、検索サブシステム２を、図７に示す
ように、ユーザの検索質問を読込む検索質問読込み部２
ｆ、読込んだ検索質問に関連の強い文書を検索する文書
検索部２ｃ、ターム相関ファイル１０ｄを参照して、検
索質問と相関の強いタームを関連タームとして選出する
関連ターム選出部２ａ、ターム相関ファイルを参照し
て、関連タームの集合から複数のタームクラスタを抽出
するタームクラスタ抽出部２ｂ、検索された文書のター
ムベクトルと抽出されたタームクラスタとの類似度を計
算して、文書を最も類似度の大きいタームクラスタに割
り当てることにより、文書クラスタを生成する文書クラ
スタ生成部２ｄ、生成された文書クラスタのダイジェス
トとして、文書クラスタの種になったタームクラスタ、
および、このタームクラスタとの類似度が最大の文書
（の一部）を表示する文書クラスタ表示制御部２ｅ、文
書クラスタのうちの一つあるいはいくつかをユーザに選
択させるクラスタ番号読込み部２ｇ、選択された文書ク
ラスタの種になったタームクラスタの和集合で関連ター
ムの集合を置き換え、また選択された文書クラスタの和
集合で検索文書の集合を置き換えるターム集合・文書集
合縮小部２ｈにより構成する。As shown in FIG. 7, the search subsystem 2 is a search question reading unit 2 for reading a user's search question.
f, a document search unit 2c for searching for a document strongly related to the read search query, a related term selection unit 2a for selecting a term having a strong correlation with the search query as a related term with reference to the term correlation file 10d, and a term correlation A term cluster extraction unit 2b that extracts a plurality of term clusters from a set of related terms by referring to a file, calculates the similarity between the term vector of the retrieved document and the extracted term cluster, and makes the document most similar A document cluster generating unit 2d that generates a document cluster by assigning the term cluster to a term cluster having a large degree, a term cluster that has become a seed of the document cluster as a digest of the generated document cluster,
A document cluster display control unit 2e for displaying (part of) a document having the highest similarity with the term cluster; a cluster number reading unit 2g for allowing the user to select one or some of the document clusters; The term set / document set reduction unit 2h replaces the set of related terms with the union of the term clusters used as the seeds of the selected document clusters, and replaces the set of search documents with the union of the selected document clusters.

【００６８】このような構成により、本例の情報検索分
類システムでは、ユーザが検索質問を入力すると、検索
質問との関連度が大きい文書を検索して、相互に類似度
の高い文書からなるクラスタを抽出し、クラスタのダイ
ジェストを表示する。そして、表示したクラスタから、
ユーザが情報要求に関連の強そうなクラスタを選択する
と、処理を戻し、選択されたクラスタから、より小さな
クラスタを抽出し、そのダイジェストを表示する。With such a configuration, in the information search and classification system of the present example, when a user inputs a search query, a document having a high degree of relevance to the search query is searched, and a cluster consisting of documents having high similarity to each other is searched. To display the digest of the cluster. Then, from the displayed cluster,
When the user selects a cluster that is likely to be relevant to the information request, the process is returned, a smaller cluster is extracted from the selected cluster, and the digest is displayed.

【００６９】このように、タームによる検索とクラスタ
リングを組合わせることにより、ユーザが想起した検索
質問からスタートして、クラスタリングとクラスタ選択
のサイクルを繰り返し、適合文書を含むクラスタを段階
的に絞っていくことにより、ユーザは効率良く適合文書
に到達することができる。また、検索された文書の数に
あまり影響されない高速なクラスタリングが可能であ
り、従来技術のように対話性を損なうことがない。尚、
本発明は、図１〜図１６を用いて説明した実施例に限定
されるものではなく、その要旨を逸脱しない範囲におい
て種々変更可能である。As described above, by combining the term-based search and clustering, starting from the search question recalled by the user, the cycle of clustering and cluster selection is repeated, and the cluster including the conforming document is narrowed down step by step. As a result, the user can efficiently reach the matching document. Further, high-speed clustering that is not so affected by the number of retrieved documents is possible, and does not impair interactivity unlike the related art. still,
The present invention is not limited to the embodiment described with reference to FIGS. 1 to 16, and can be variously modified without departing from the gist thereof.

【００７０】[0070]

【発明の効果】本発明によれば、クラスタリングに要す
る計算量を十分に小さくすることができ、高速な情報の
検索分類が可能となり、ユーザは、対話的操作のストレ
スを感じることなく、所望の情報を効率良く探し出すこ
とができる。According to the present invention, the amount of calculation required for clustering can be sufficiently reduced, high-speed information search and classification can be performed, and the user can obtain desired information without feeling the stress of interactive operation. Information can be searched efficiently.

[Brief description of the drawings]

【図１】本発明の情報検索分類システムの本発明に係る
構成の一実施例を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of a configuration according to the present invention of an information search and classification system of the present invention.

【図２】図１における情報検索分類システムを構成する
コンピュータの構成例を示すブロック図である。FIG. 2 is a block diagram showing a configuration example of a computer constituting the information search and classification system in FIG.

【図３】図１における文書ファイルのレコード構成例を
示す説明図である。FIG. 3 is an explanatory diagram illustrating an example of a record configuration of a document file in FIG. 1;

【図４】図１における文書特徴ファイルのレコード構成
例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of a record configuration of a document feature file in FIG. 1;

【図５】図１におけるインデクスファイルのレコード構
成例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of a record configuration of an index file in FIG. 1;

【図６】図１におけるターム相関ファイルのレコード構
成例を示す説明図である。FIG. 6 is an explanatory diagram showing an example of a record configuration of a term correlation file in FIG. 1;

【図７】図１における検索サブシステムの詳細な構成例
を示すブロック図である。FIG. 7 is a block diagram showing a detailed configuration example of a search subsystem in FIG. 1;

【図８】図７における検索質問データの構成例を示す説
明図である。FIG. 8 is an explanatory diagram showing a configuration example of search question data in FIG. 7;

【図９】図７における検索文書集合データの構成例を示
す説明図である。FIG. 9 is an explanatory diagram showing a configuration example of search document set data in FIG. 7;

【図１０】図７における関連ターム集合データの構成例
を示す説明図である。FIG. 10 is an explanatory diagram showing a configuration example of related term set data in FIG. 7;

【図１１】図７におけるタームクラスタデータの構成例
を示す説明図である。FIG. 11 is an explanatory diagram showing a configuration example of term cluster data in FIG. 7;

【図１２】図７における文書クラスタデータの構成例を
示す説明図である。FIG. 12 is an explanatory diagram showing a configuration example of document cluster data in FIG. 7;

【図１３】図７におけるクラスタ番号データの構成例を
示す説明図である。FIG. 13 is an explanatory diagram showing a configuration example of cluster number data in FIG. 7;

【図１４】図７における文書クラスタ表示制御部により
表示された文書クラスタダイジェストの表示例を示す説
明図である。14 is an explanatory diagram illustrating a display example of a document cluster digest displayed by the document cluster display control unit in FIG. 7;

【図１５】本発明の情報検索分類方法に係わる一実施例
を示すフローチャートの蓄積サブシステム処理部分であ
る。FIG. 15 is a storage subsystem processing part of a flowchart showing one embodiment of the information search / classification method of the present invention.

【図１６】本発明の情報検索分類方法に係わる一実施例
を示すフローチャートの検索サブシステム処理部分であ
る。FIG. 16 is a search subsystem processing part of a flowchart showing an embodiment according to the information search and classification method of the present invention.

[Explanation of symbols]

１：蓄積サブシステム、１ａ：文書登録部、１ｂ：文書
特徴抽出部、１ｃ：インデクスファイル生成部、１ｄ：
ターム相関抽出部、２：検索サブシステム、２ａ：関連
ターム選出部、２ｂ：タームクラスタ生成部、２ｃ：文
書検索部、２ｄ：文書クラスタ生成部、２ｅ：文書クラ
スタ表示制御部、２ｆ：検索質問読込み部、２ｇ：クラ
スタ番号読込み部、２ｈ：およびターム集合・文書集合
縮小部、１０ａ：文書ファイル、１０ｂ：文書特徴ファ
イル、１０ｃ：インデクスファイル、１０ｄ：ターム相
関ファイル、１０ｅ：検索質問データ、１０ｆ：検索文
書集合データ、１０ｇ：関連ターム集合データ、１０
ｈ：タームクラスタデータ、１０ｉ：文書クラスタデー
タ、１０ｊ：クラスタ番号データ、２１：処理装置、２
２：記憶装置、２３：文書入力装置、２４：入力装置、
２５：表示装置、３１：文書ＩＤ、３２：テキストデー
タ、４１：文書ＩＤ、４２：タームベクトル、４２ａ：
ターム、４２ｂ：重み、５１：ターム、５２：ターム出
現文書情報、５２ａ：文書ＩＤ、５２ｂ：重み、６１：
ターム、６２：相関ターム情報、６２ａ：ターム、６２
ｂ：相関度、８１：論理オペレータ、８２：ターム、９
１：文書ＩＤ、１０１：ターム、１１１：クラスタ番
号、１１２：ターム、１２１：クラスタ番号、１２２：
文書ＩＤ、１２３：類似度、１３１：クラスタ番号。1: Storage subsystem, 1a: Document registration unit, 1b: Document feature extraction unit, 1c: Index file generation unit, 1d:
Term correlation extraction unit, 2: search subsystem, 2a: related term selection unit, 2b: term cluster generation unit, 2c: document search unit, 2d: document cluster generation unit, 2e: document cluster display control unit, 2f: search query Reading unit, 2g: cluster number reading unit, 2h: and term set / document set reduction unit, 10a: document file, 10b: document feature file, 10c: index file, 10d: term correlation file, 10e: search query data, 10f : Search document set data, 10g: Related term set data, 10
h: term cluster data, 10i: document cluster data, 10j: cluster number data, 21: processing device, 2
2: storage device, 23: document input device, 24: input device,
25: display device, 31: document ID, 32: text data, 41: document ID, 42: term vector, 42a:
Term, 42b: weight, 51: term, 52: term appearance document information, 52a: document ID, 52b: weight, 61:
Term, 62: correlation term information, 62a: term, 62
b: correlation degree, 81: logical operator, 82: term, 9
1: Document ID, 101: Term, 111: Cluster number, 112: Term, 121: Cluster number, 122:
Document ID, 123: similarity, 131: cluster number.

Claims

[Claims]

1. An information search and classification apparatus for searching and classifying a document strongly related to an input search query from a set of documents stored in a document file based on a relationship with a term included in the search query. A method for extracting and registering information relating to term correlation from a set of documents stored in the document file in advance, and a set of documents searched as being strongly related to the search query, And a second step of generating a document cluster using the information related to the term correlation, and outputting a search classification result for each of the document clusters.

2. The information retrieval and classification method according to claim 1, wherein the generation of the document cluster in the second step is repeated for documents included in a document cluster selected by a user from the output document cluster. An information search and classification method characterized by the following.

3. An information search and classification system for searching and classifying a document strongly related to an input search query from a set of documents stored in a document file based on a relation with a term included in the search query. A first step of extracting and registering information relating to term correlation from a set of documents stored in the document file in advance, and using the information relating to term correlation to be included in the search query. A second step of selecting a term related to the selected term and generating a term cluster from the selected set of terms; and a set of documents searched as being strongly related to the search query using the term cluster as a seed. And a third step of generating a document cluster from the above, and outputting a search classification result for each document cluster.

4. The information search and classification method according to claim 3, wherein the second step relates to a term related to a term included in the search query and relates to a correlation of the terms extracted in the first step. An information search and classification method, wherein the term cluster is generated based on information and the term cluster is generated from a set of the selected terms based on information relating to the correlation of the terms.

5. The information retrieval and classification method according to claim 3, wherein the third step comprises: setting a set of terms previously associated with the retrieved document with the term An information search and classification method, wherein a similarity to a cluster is determined, and the document cluster is generated by assigning the document to a term cluster having the highest similarity.

6. The information retrieval and classification method according to claim 3, wherein a document included in a document cluster selected by a user from the output document clusters,
And generating the term cluster using information relating to the correlation of the terms in the second step for the term class serving as the seed of the document cluster selected by the user, and the term in the third step An information retrieval and classification method characterized by repeatedly generating the document cluster using a cluster as a seed.

7. The information search and classification method according to claim 3, wherein a term cluster that has become a seed of the document cluster is output as one information of a search and classification result for each of the document clusters. An information search and classification method characterized by the following.

8. The information search and classification method according to claim 1, wherein the first step comprises:
Determine a set of terms that co-occur in a predetermined size in the document and a co-occurrence frequency, calculate the degree of correlation between the terms based on the co-occurrence frequency, and calculate the degree of correlation in which the degree of correlation exceeds a predetermined value. An information search and classification method, wherein sets are assigned with a degree of correlation and associated, and information on the correlation of the terms is extracted.

9. The information search and classification method according to claim 1, wherein a part of the documents included in the document cluster is output as one information of a search and classification result for each document cluster. An information search and classification method characterized by the following.

10. An information retrieval and classification system for retrieving and classifying, from a set of documents stored in a document file, a document strongly related to an input search query based on a relation with a term included in the search query. A registration means for extracting and registering information relating to term correlation from a set of documents stored in the document file in advance, and a term related to a term included in the search query selected from the registration means;
Term cluster generating means for generating a plurality of term clusters composed of mutually strongly correlated terms, and searching for a document strongly related to the search query from the document file, assigning to the term cluster having the highest similarity, Document cluster generating means for generating a cluster of documents using the term cluster as a seed; and presenting, to a user, for each cluster of the document, a cluster of the term serving as a seed of the cluster of the document and a part of the document. Means for searching and classifying information.

11. The information search and classification system according to claim 10, wherein said registration means obtains a set of terms that co-occur in a predetermined size of the document and a co-occurrence frequency, and based on the co-occurrence frequency. Means for calculating a degree of correlation between the terms, assigning a set of terms in which the degree of correlation exceeds a predetermined value by assigning a degree of correlation, and extracting information relating to the correlation of the terms. Information retrieval and classification system.

12. The information retrieval and classification system according to claim 10, wherein the term used for generating the term cluster by the term cluster generating means and the document by the document cluster generating means. Means for narrowing down the documents used for generating the cluster to the terms and documents of the cluster selected by the user from the clusters of the documents presented by the presenting means, and presenting by the presenting means and selecting the document cluster by the user The information retrieval and classification system is characterized in that the document cluster is narrowed down by repeating the above.