JPH11250102A

JPH11250102A - Information retrieving method and its device

Info

Publication number: JPH11250102A
Application number: JP10069271A
Authority: JP
Inventors: Keiko Aoki; 圭子青木; Kazunori Matsumoto; 一則松本; Kazuo Hashimoto; 和夫橋本
Original assignee: KDD Corp
Current assignee: KDDI Corp
Priority date: 1998-03-05
Filing date: 1998-03-05
Publication date: 1999-09-17

Abstract

PROBLEM TO BE SOLVED: To improve a speed without deteriorating sorting precision by stopping the calculation of an evaluation function in the middle and executing the calculation of an evaluation value in clustering so as to reduce the calculation amount of the evaluation value. SOLUTION: U (C1 ) is obtained for all inputted clusters S=(C1 , C2 ,..., CN) (S202, 203). Evaluation values E (Ci , Cj ) between the combinations (Ci ,j ) of all the clusters are obtained to the middle (S204). E (Ci , Cj ) is continually obtained for combinations to a high-ordered t-number (t is a positive integer) (S205). The combination of clusters maximizing E (Ci , Cj ) is merged as that of the highest similarity. Then, a cluster Ck with the clusters Ci , Cj a slave node is prepared (S206). Thus, clustering is advanced toward the upper layer of the clusters by setting the cluster S to be S=S-Ci -j +Ck (S207).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、情報検索方法及び
装置に関し、特にベイジアンクラスタリングを用いたク
ラスタリングにおける評価値の計算の高速化を図る情報
検索方法及び装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information retrieval method and apparatus, and more particularly to an information retrieval method and apparatus for speeding up calculation of evaluation values in clustering using Bayesian clustering.

【０００２】[0002]

【従来の技術】図６は従来の情報検索装置の構成図であ
る。通常、インターネット６１に接続された複数のコン
ピュータ６３が有する文書情報を検索する情報検索装置
は、情報検索サーバ６２と位置付けられる。インターネ
ット６１には、更にページを有する膨大な数のコンピュ
ータ６３と、検索を所望するクライアント６４とが接続
されている。情報検索サーバ６２は、コンピュータ６３
が有するページのＵＲＬであるページ情報を管理し、か
つクライアント６４が指定する情報に合うページのＵＲ
Ｌを検索結果として提供するためのものである。2. Description of the Related Art FIG. 6 is a block diagram of a conventional information retrieval apparatus. Usually, an information search device that searches for document information held by a plurality of computers 63 connected to the Internet 61 is positioned as an information search server 62. The Internet 61 is connected to a huge number of computers 63 having further pages and clients 64 desiring to search. The information search server 62 includes a computer 63
Manages the page information which is the URL of the page held by the client, and the URL of the page which matches the information specified by the client 64.
L is provided as a search result.

【０００３】また、情報検索サーバ６２は、コンテンツ
データベース６２１、クラスタデータベース６２２及び
制御部６２３を有している。コンテンツデータベース６
２１には複数のページ情報が記憶され、クラスタデータ
ベース６２２にはページ情報をクラスタリングするため
のノード情報が記憶されている。The information search server 62 has a content database 621, a cluster database 622, and a control unit 623. Content database 6
21 stores a plurality of page information, and the cluster database 622 stores node information for clustering the page information.

【０００４】更に、制御部６２３は、葉ノード情報選択
手段６２３ａ、部分クラスタ生成手段６２３ｂ、再帰ク
ラスタリング手段６２３ｃ及びページ更新／検索手段６
２３ｄを有している。葉ノード情報選択手段６２３ａ
は、複数のページ情報の中から所定の個数の最適なペー
ジ情報を選択するためのものである。部分クラスタ生成
手段６２３ｂは、選択されなかった残りのページ情報を
クラスタの類似する葉ノードに割り当ててクラスタを生
成するためのものである。再帰クラスタリング手段６２
３ｃは、生成されたクラスタの葉ノード方向に向かっ
て、葉ノード情報選択手段６２３ａ及び部分クラスタ生
成手段６２３ｂを再度繰り返されるように指示するため
のものである。ページ更新／検索手段６２３ｄは、生成
されたクラスタにページ情報を追加及び更新したり、該
クラスタからページ情報を検索するためのものでる。Further, the control unit 623 includes a leaf node information selecting unit 623a, a partial cluster generating unit 623b, a recursive clustering unit 623c, and a page updating / searching unit 6.
23d. Leaf node information selecting means 623a
Is for selecting a predetermined number of optimal page information from a plurality of page information. The partial cluster generation unit 623b is for generating a cluster by assigning the remaining unselected page information to a leaf node similar to the cluster. Recursive clustering means 62
3c is for instructing the leaf node information selecting means 623a and the partial cluster generating means 623b to be repeated again toward the leaf node direction of the generated cluster. The page updating / searching unit 623d is for adding and updating page information to the generated cluster and for searching for page information from the cluster.

【０００５】図７は、従来の情報検索装置におけるクラ
スタを生成するための再帰クラスタリング関数を示すフ
ローチャートである。これは、図示していないが２分木
の情報構造を探索するために一般に用いられる再帰関数
に類似したものであり、入力（ステップ７０１）はペー
ジ集合を示すノードのポインタである。クラスタを構築
する場合、全てのページを割り当てたルートノードを入
力するものとする。FIG. 7 is a flowchart showing a recursive clustering function for generating a cluster in a conventional information retrieval apparatus. This is similar to a recursive function (not shown) generally used to search the information structure of a binary tree, and the input (step 701) is a pointer to a node indicating a page set. When constructing a cluster, it is assumed that a root node to which all pages are assigned is input.

【０００６】まず、入力されたノードに割り当てられた
ページの数を判断する（ステップ７０２）。このページ
の数が所定の最大数であるｍａｘ個以上であれば、当該
ノードの下層に位置するクラスタを生成する。一方、ペ
ージの数が所定のｍａｘ個より少ないならば類似精度を
高めて総当たりにクラスタリングする（ステップ７０
８）。First, the number of pages assigned to the input node is determined (step 702). If the number of pages is equal to or greater than the predetermined maximum number of max, a cluster located below the node is generated. On the other hand, if the number of pages is less than the predetermined number of pages, the similarity is increased and clustering is performed for all rounds (step 70).
8).

【０００７】ノードに割り当てられたページの数がｍａ
ｘ個以上であれば、クラスタ生成関数が呼び出される
（ステップ７０２）。このクラスタ生成関数は、入力と
なるノードのポインタの下層に位置するページをクラス
タリングするものである。この関数の出力は、生成され
た部分クラスタのルートノードのポインタである。The number of pages assigned to a node is ma
If x or more, the cluster generation function is called (step 702). This cluster generation function clusters pages located below the pointer of the input node. The output of this function is a pointer to the root node of the generated partial cluster.

【０００８】次に、生成されたクラスタの各葉ノードに
対して（ステップ７０４）再帰的に呼び出してクラスタ
リングを進めていく。まず、ある葉ノードに対して、当
該葉ノードに割り当てられたページがあるかどうかを判
定する（ステップ７０５）。割り当てられたページがあ
れば、再帰的に自関数（ステップ７０６）を呼び出し
て、クラスタの下層に向かってクラスタリングを進めて
いく。その後、再帰クラスタリング関数で得られたクラ
スタのルートノードを葉ノードとしてマージする（ステ
ップ７０７）。Next, clustering is performed by recursively calling each leaf node of the generated cluster (step 704). First, it is determined whether there is a page assigned to a certain leaf node (step 705). If there is an assigned page, the self-function (step 706) is recursively called to advance clustering toward the lower layer of the cluster. Thereafter, the root node of the cluster obtained by the recursive clustering function is merged as a leaf node (step 707).

【０００９】図８は、クラスタ生成関数をフローチャー
トで表したものである。このクラスタ生成関数は、葉ノ
ード選択段階と部分クラスタ生成段階との２つの処理段
階に分けられる。葉ノード選択段階は、複数のページの
中から、クラスタリングした際に最小符号長となるよう
な所定に個数の最適なページを選択するものである。部
分クラスタ生成段階は、選択されたページを葉ノードと
して選択されなかった残りのページを類似する葉ノード
に割り当ててクラスタを完成させるものである。FIG. 8 is a flowchart showing the cluster generation function. This cluster generation function is divided into two processing stages: a leaf node selection stage and a partial cluster generation stage. The leaf node selecting step is to select a predetermined number of optimal pages that have a minimum code length when clustering is performed, from a plurality of pages. In the partial cluster generation step, the selected page is assigned as a leaf node and the remaining pages not selected are assigned to similar leaf nodes to complete the cluster.

【００１０】はじめに、入力されたノードのポインタに
割り当てられた複数のページから（ステップ８０１）、
ｍａｘ個のページ集合Ｐ［ｔ］を選択する（ステップ８
０３）。このｍａｘ個を大きくするほど、１回分の分類
単位を大きくできる。ｔは一連の処理を繰り返す度に１
づつ増える数である。このクラスタ生成関数が呼び出さ
れる際のノードには再帰クラスタリング段階の流れか
ら、少なくともｍａｘ個以上のページが割り当てられて
いるはずである。First, from a plurality of pages assigned to the input node pointer (step 801),
Select max page sets P [t] (step 8)
03). The larger the max number, the larger the classification unit for one time. t is 1 each time a series of processing is repeated.
It is a number that increases by one. From the flow of the recursive clustering stage, at least max pages or more should have been allocated to the node when this cluster generation function is called.

【００１１】次に、選択されたページ集合Ｐ［ｔ］を公
知のアルゴリズムでクラスタリングを行う（ステップ８
０４）。これは、ｍａｘ個の中で総当たりに類似度を判
定してクラスタリングを行うために、計算量が著しく増
加することはない。Next, the selected page set P [t] is clustered by a known algorithm (step 8).
04). Since the clustering is performed by determining the similarity in the max number of round robins, the calculation amount does not significantly increase.

【００１２】そして、生成されたページ集合Ｐ［ｔ］の
クラスタについて、選択されなかった残りのページを当
該クラスタの類似する葉ノードに対して割り当てる（ス
テップ８０５）。Then, for the cluster of the generated page set P [t], the remaining pages not selected are allocated to similar leaf nodes of the cluster (step 805).

【００１３】次に、生成されたクラスタに符号長Ｌ
［ｔ］を求める（ステップ８０６）。情報の集合の最適
化では、ＭＤＬ（Minimum Description Length creteri
on）基準に基づき、分類結果の符号長が最小になるよう
に選択される。ここでの符号長Ｌは、当該クラスタに必
要なノードの情報量Ｌ₁ と、各葉ノードに割り当てられ
たページ数から分類に必要な符号長Ｌ₂ との和として求
められる。Next, a code length L is added to the generated cluster.
[T] is obtained (step 806). In optimizing a set of information, MDL (Minimum Description Length creteri
on) Based on the criterion, the code length is selected so as to minimize the code length of the classification result. Here, the code length L is obtained as the sum of the information amount L _{1 of the} node required for the cluster and the code length L ₂ required for classification from the number of pages assigned to each leaf node.

【００１４】２分木自体の符号化は、木を先行順に訪れ
て内部ノードを訪れたときに１を出力し、葉ノードを訪
れたときに０を出力することによって行う。ノードの情
報量Ｌ₁ は、葉ノードの数（＝ｍａｘ）をｋ（ｋは正の
整数）とすると、木の記述自体に必要な内部ノード数は
Ｌ₁ ＝２ｋ−１となる。The encoding of the binary tree itself is performed by visiting the tree in order of precedence, outputting 1 when the internal node is visited, and outputting 0 when the leaf node is visited. Information amount L ₁ of node, the number of leaf nodes (= max) (the k positive integer) k and the internal node number required to describe itself of the tree becomes L ₁ = 2k-1.

【００１５】よって、葉ノードｉに割り当てられたペー
ジの数をｎ_i 及び全ページから葉ノードｉの情報が選択
された確率をｐ_i ＝ｎ_i ／Σ_j n_jとした場合、各葉ノー
ドに割り当てられたページの数から分類に必要な符号長
はＬ₂ ＝Σn_i log p_i となる。これにより、Ｌ₁ ＋Ｌ₂
がクラスタの符号長Ｌとして求められる。[0015] Therefore, if the probability that information is selected leaf node i the number of pages allocated to the leaf node i from n _i and all the pages and the _{_{_{p i = n i / Σ j}}} n j, each leaf node The code length necessary for classification is L ₂ = Σn _i log p _i from the number of pages assigned to. As a result, L ₁ + L ₂
Is obtained as the code length L of the cluster.

【００１６】ここで求められたクラスタの符号長Ｌを、
以前の繰り返しによって記憶されている最小符号長Ｌ
_min と比較する。求められた符号長Ｌ［ｔ］が記憶され
ている最小符号長Ｌ_min よりも小さければ、Ｌ［ｔ］が
Ｌ_min として記憶される（ステップ８０７）。The code length L of the cluster obtained here is
Minimum code length L stored by previous iteration
Compare with _min . Is smaller than the minimum code length L _min of the obtained code length L [t] is stored, L [t] is stored as L _min (step 807).

【００１７】これら一連の処理を所定の回数ｃ回繰り返
す（ステップ８０２）ことによって最小符号長となるペ
ージ集合Ｐ［ｔ］が選択される。ページ集合の選択はラ
ンダムに行われるために、この回数ｃが大きいほど最適
なページ集合を選択することができる。By repeating these series of processes a predetermined number of times c (step 802), a page set P [t] having the minimum code length is selected. Since the selection of the page set is performed at random, the larger the number c is, the more the optimum page set can be selected.

【００１８】また、クラスタ生成段階は、葉ノード選択
段階によって選択されたページ集合Ｐ［ｔ］を類似度に
応じてクラスタリングを行い（ステップ８０８）、次い
で選択されなかった残りのページを生成されたクラスタ
の類似する葉ノードに割り当てる（ステップ８０９）。
このようにして、最小符号長Ｌ_min となるクラスタが生
成される。In the cluster generation step, the page set P [t] selected in the leaf node selection step is clustered according to the similarity (step 808), and the remaining unselected pages are generated. Assigned to a similar leaf node of the cluster (step 809).
In this way, a cluster having the minimum code length L _min is generated.

【００１９】以上説明したクラスタリング方法による従
来の情報検索装置では、所定のｍａｘ個数以上では多少
類似精度を落として高速にクラスタリングし、所定のｍ
ａｘ個数より小さい場合では類似度を高めて総当たりに
クラスタリングする。そのために、生成時間及び類似精
度にバランスをとってクラスタを生成することができ
る。In the conventional information retrieval apparatus based on the clustering method described above, when the number is equal to or more than a predetermined max, clustering is performed at a high speed with a slight decrease in similarity accuracy, and a predetermined m is obtained.
If the number is smaller than ax, the similarity is increased and clustering is performed on a brute force basis. For this reason, clusters can be generated while balancing generation time and similarity accuracy.

【００２０】このような従来の情報検索装置において、
文書の類似検索の方法としては、入力文書と検索対象文
書との適合性に関する確率に従い、検索対象文書をラン
キングする統計的な方法がある。その１つとして、文書
中の語の出現確率を用いて文書集合をベイジアンクラス
タリングする方法があり、この方法は「Makoto IMAYAM
A, Takenobu TOKUNAGA, "Cluster-Based Text Categori
zation; A Comparisonof Categoly Search Strategie
s", Proc. of the Annual International ACM SIGIR Co
nference on Research and Develpment in Information
Rctricval, pp.273-280, 1995」に開示され、また本発
明者による「大量文書向けのクラスタリング手法の評
価」情報処理学会第５５回全国大会（平成９年後期），
３−２０８，１９９７年に提案されている。In such a conventional information retrieval apparatus,
As a similarity search method of a document, there is a statistical method of ranking search target documents according to a probability regarding compatibility between an input document and a search target document. As one of the methods, there is a method of Bayesian clustering a set of documents using the probability of occurrence of words in the document. This method is described in "Makoto IMAYAM
A, Takenobu TOKUNAGA, "Cluster-Based Text Categori
zation; A Comparisonof Categoly Search Strategie
s ", Proc. of the Annual International ACM SIGIR Co
nference on Research and Develpment in Information
Rctricval, pp.273-280, 1995 ", and the inventor's" Evaluation of Clustering Method for Large Number of Documents ", Information Processing Society of Japan 55th Annual Convention (late 1997),
3-208, 1997.

【００２１】これらの文献におけるベイジアンクラスタ
リングはドキュメントの集合をクラスタとみなし、順次
クラスタ同士をマージしていくものである。クラスタ同
士をマージする際には、マージ対象となる２つのクラス
タc_i,c_j を引数とする評価関数Ｅ(c_i,c_j) を用い、Ｅ(c
_i,c_j) が最大になるようなc_i,c_j の組を求め、c_i,c_jを
マージする。The Bayesian clustering in these documents regards a set of documents as a cluster and sequentially merges the clusters. When merging clusters, an evaluation function E (c _i , c _j ) having two clusters c _i , c _j to be merged as arguments is used, and E (c
_i, c _j) is determined a set of c _i, c _j that maximizes merges c _i, c _j.

【００２２】上記前者の文献に示されたプログラムリス
トによると、Ｅ(c_i,c_j) を以下のように計算している。According to the program list shown in the former document, E (c _i , c _j ) is calculated as follows.

【００２３】Ｅ(c_i,c_j) ＝Ｍ(c_i,c_j) −Ｕ(c_i)−Ｕ(c_j)E (c _i , c _j ) = M (c _i , c _j ) −U (c _i ) −U (c _j )

【００２４】この計算方法の場合、Ｕ（c_i∪c_j) ＝Ｍ(c
_i,c_j) が成り立つのでＵ(c_i)はＭ(c_i,c_j) として計算す
る。但し、Ｍ(c_i,c_j) は以下のようにして求められる。In the case of this calculation method, U (c _i ∪c _j ) = M (c
_{Since i} , c _j ) holds, U (c _i ) is calculated as M (c _i , c _j ). Here, M (c _i , c _j ) is obtained as follows.

【００２５】１：function M(c_i,c_j) ２： out=0.0; ３： forall ドキュメント d∈クラスタc_i∪c_j ４： tmp=0.0; ５： forall単語 w∈d ６： tmp=tmp+rate(w,d,c_i ∪c_j); ７： out=out+log(tmp); ８： return out; ９：endfunc1: function M (c _i , c _j ) 2: out = 0.0; 3: forall document d∈cluster c _i ∪c _j 4: tmp = 0.0; 5: forall word w∈d 6: tmp = tmp + rate (w, d, c _i ∪c _j ); 7: out = out + log (tmp); 8: return out; 9: endfunc

【００２６】ただし、rate(w,d,c)=(d中のw の相対頻
度)(c 中のw の相対頻度) ／（全文書中のw の相対頻
度) ；Where, rate (w, d, c) = (relative frequency of w in d) (relative frequency of w in c) / (relative frequency of w in all documents);

【００２７】図９はこの方法をフローチャートで表した
ものである。FIG. 9 is a flowchart showing this method.

【００２８】入力（ステップ９０１）はクラスタc_i,c_j
に含まれる文書の頻度表である。まず、入力されたクラ
スタc_i,c_j それぞれ(c) についてクラスタc_i,c_j （ステ
ップ９０２，９０３）に含まれる文書ｄ中の全ての単語
を全文書中の相対頻度でソートする（ステップ９０４〜
９０６）。これらの結果の値を累積した上でｌｏｇを取
って累積してＥ(c_i,c_j) を出力する（ステップ９０７，
９０８）。The input (step 901) includes clusters c _i and c _j
Is a frequency table of documents included in. First, for each of the input clusters c _i and c _j (c), all words in the document d included in the clusters c _i and c _j (steps 902 and 903) are sorted by the relative frequency in all the documents (step 904-
906). After accumulating the values of these results, taking the log and accumulating the result, E (c _i , c _j ) is output (step 907,
908).

【００２９】[0029]

【発明が解決しようとする課題】このような従来の方法
では、従来の技術のアルゴリズムの中で、Ｍ(c_i,c_j) の
内側のループ（第５行、第６行）にほとんどの時間が費
やされる。そして、生成中の全クラスタ対においてクラ
スタ中の全ての文書中の全ての単語について、rateの値
を計算する必要があり、大量文書を処理するためには多
くの計算時間を必要とする。In such a conventional method, most of the prior art algorithms include a loop (lines 5 and 6) inside M (c _i , c _j ). Time is spent. Then, it is necessary to calculate the value of rate for all the words in all the documents in the cluster in all the cluster pairs being generated, and it takes a lot of calculation time to process a large number of documents.

【００３０】本発明はこれらの問題点を解決するための
もので、評価値の計算量を大幅に削減できると共に、評
価関数の計算ではランダムな順で計算するのに比べて、
全文書中の相対頻度等でソートして計算することによ
り、分類精度を劣化させることなく高速化することもで
きる情報検索方法及び装置を提供することを目的とす
る。The present invention has been made to solve these problems, and can greatly reduce the amount of calculation of the evaluation value. In addition, the calculation of the evaluation function is more complicated than the calculation in a random order.
An object of the present invention is to provide an information retrieval method and apparatus that can perform high-speed processing without deteriorating classification accuracy by sorting and calculating based on relative frequencies in all documents.

【００３１】[0031]

【課題を解決するための手段】上記従来例の問題点を解
決するために、本発明によれば、文書情報を有する複数
のコンピュータがネットワークに接続され、複数の前記
文書情報のインデックス情報を記憶するコンテンツデー
タベースと、該コンテンツデータベースを用いて前記文
書情報を検索及び更新する制御部とを有する情報検索装
置であって、前記制御部は、複数の前記文書情報の中か
らクラスタの葉ノードとなる所定の個数の情報を選択す
る葉ノード情報選択手段と、選択されなかった残りの情
報を類似する前記葉ノードに割り当てる部分クラスタ生
成手段と、前記葉ノード情報選択手段及び前記部分クラ
スタ生成手段によって生成されたクラスタの葉ノードの
方向に向かって繰り返されるように指示する再帰クラス
タリング手段と、生成されたクラスタにページ情報を追
加及び更新し、当該クラスタからページ情報を検索する
ページ更新／検索手段とを有する情報検索装置におい
て、制御部は、評価関数の計算を途中で止め、クラスタ
リングにおける評価値計算を行う評価値計算手段を有す
ることに特徴がある。また、評価値計算手段を、部分ク
ラスタ生成手段及び／又は再帰クラスタリング手段に設
けてもよい。以上のような構成を有する本発明の装置に
よれば、分類精度を劣化させることなく高速化できる情
報検索装置を実現できる。According to the present invention, a plurality of computers having document information are connected to a network, and a plurality of index information of the document information is stored according to the present invention. And a control unit for searching and updating the document information using the content database, wherein the control unit is a leaf node of a cluster from among the plurality of pieces of the document information. Leaf node information selecting means for selecting a predetermined number of pieces of information, partial cluster generating means for allocating the remaining unselected information to similar leaf nodes, and leaf node information selecting means and partial cluster generating means Recursive clustering means for instructing the cluster to be repeated in the direction of the leaf nodes; In an information search apparatus having a page updating / searching unit for adding and updating page information to a formed cluster and searching for page information from the cluster, the control unit stops calculation of an evaluation function halfway, and evaluates in clustering. It is characterized by having an evaluation value calculation means for performing value calculation. Further, the evaluation value calculating means may be provided in the partial cluster generating means and / or the recursive clustering means. According to the apparatus of the present invention having the above-described configuration, it is possible to realize an information retrieval apparatus that can increase the speed without deteriorating the classification accuracy.

【００３２】また、文書の集合をクラスタとみなし、順
次前記クラスタ同士をマージしていく際２つの前記クラ
スタを引数とする評価関数を用いて当該評価関数が最大
になるようなクラスタの組合せを求め、前記クラスタ同
士をマージするベイジアンクラスタリングを用いたクラ
スタリングにおける情報検索方法において、前記評価関
数の計算を途中で止め、クラスタリングにおける評価値
計算を行うことにも特徴がある。また、評価関数の計算
を途中で止め、上位所定の個数までの組合せについて前
記評価関数の計算を引き続いて行う。更に、評価関数の
計算を行う際、全文書中の相対頻度、各文書中の相対頻
度及び各クラスタ中の相対頻度の順にソートして計算す
る。よって、特に評価値の計算量を大幅に削減できる。Further, when a set of documents is regarded as a cluster, and when the clusters are sequentially merged, a combination of clusters that maximizes the evaluation function is obtained by using an evaluation function having the two clusters as arguments. In the information search method in the clustering using Bayesian clustering for merging the clusters, it is also characterized in that the calculation of the evaluation function is stopped halfway and the evaluation value is calculated in the clustering. Further, the calculation of the evaluation function is stopped halfway, and the calculation of the evaluation function is continuously performed for combinations up to a predetermined number. Further, when calculating the evaluation function, the calculation is performed by sorting in order of the relative frequency in all documents, the relative frequency in each document, and the relative frequency in each cluster. Therefore, the calculation amount of the evaluation value can be significantly reduced.

【００３３】従って、本発明によれば、評価値の計算量
を大幅に削減できると共に、評価関数の計算ではランダ
ムな順で計算するのに比べて、全文書中の相対頻度等で
ソートして計算することにより、分類精度を劣化させる
ことなく高速化できる情報検索方法及び装置を提供でき
る。Therefore, according to the present invention, the amount of calculation of the evaluation value can be greatly reduced, and the evaluation function is sorted by the relative frequency in all the documents as compared with the calculation in a random order. By performing the calculation, it is possible to provide an information retrieval method and an information retrieval apparatus capable of increasing the speed without deteriorating the classification accuracy.

【００３４】[0034]

【発明の実施の形態】以下、本発明の実施の形態例を図
面に基づいて説明する。はじめに、従来例におけるベイ
ジアンクラスタリングの計算方法の第３〜第７行目を実
行中にｏｕｔの値があまり増えない場合、最後まで計算
してもＭ(c_i,c_j) の値が小さく、クラスタ対c_i,c_j が評
価関数を最大にする可能性は低いが、従来例では計算す
る必要がある。Embodiments of the present invention will be described below with reference to the drawings. First, when the value of out does not increase so much during execution of the third to seventh lines of the Bayesian clustering calculation method in the conventional example, the value of M (c _i , c _j ) is small even if calculation is performed to the end, Although it is unlikely that the cluster pair c _i , c _j maximizes the evaluation function, it needs to be calculated in the conventional example.

【００３５】そこで、本発明における情報検索方法で
は、クラスタ中の全ての単語について、評価関数Ｍ，Ｅ
を計算するのではなく、評価関数Ｍを途中まで求めた段
階で評価関数Ｅの値を予測し、評価関数Ｅの推定値が低
いクラスタ対については評価関数Ｅの計算を一時止める
方法である。そして、評価関数Ｅの推定値が所定値より
高いクラスタ対についてのみ、評価関数Ｅの値を計算す
ればよい。そこで、文書中の全単語を用いて評価関数Ｅ
を計算するのではなく、Therefore, in the information search method according to the present invention, the evaluation functions M and E are used for all the words in the cluster.
Instead of calculating the evaluation function M, the value of the evaluation function E is predicted at a stage where the evaluation function M is obtained halfway, and the calculation of the evaluation function E is temporarily stopped for a cluster pair having a low estimated value of the evaluation function E. Then, the value of the evaluation function E may be calculated only for a cluster pair whose estimated value of the evaluation function E is higher than a predetermined value. Therefore, the evaluation function E is calculated using all the words in the document.
Instead of calculating

【００３６】・文書中の相対頻度でソートし、頻度の高
い順から指定された割合（ｒ）の単語集合もしくは、Sorting by relative frequency in a document, and a word set of a specified ratio (r) in descending order of frequency, or

【００３７】・クラスタ（c_i）中の相対頻度でソート
し、頻度の高い順から指定された割合（ｒ）の単語集合
もしくは、Sorting by relative frequency in the cluster (c _i ), and a word set of a specified ratio (r) in descending order of frequency, or

【００３８】・クラスタ（c_i∪c_j）中の相対頻度でソー
トし、頻度の高い順から指定された割合（ｒ）の単語集
合もしくは、Sorting by the relative frequency in the cluster (c _i ∪c _j ), and a word set of a specified ratio (r) in descending order of frequency, or

【００３９】・全文書中の相対頻度でソートし、頻度の
高い順から指定された割合（ｒ）の単語集合A set of words of a specified ratio (r) in descending order of frequency, sorted by relative frequency in all documents

【００４０】で、評価値の高い組合せを予想する。Then, a combination having a high evaluation value is predicted.

【００４１】但し、評価関数Ｅの分類精度を保つため、
クラスタ中の各文書の単語種類数が指定された大きさ
（ｓ）以下のものについては全ての単語について評価値
を計算する。However, in order to maintain the classification accuracy of the evaluation function E,
If the number of word types of each document in the cluster is equal to or smaller than the specified size (s), the evaluation value is calculated for all the words.

【００４２】次に、本発明に係る実施の形態例の情報検
索装置について説明する。図１は本発明に係る実施の形
態例の情報検索装置の構成を示すブロック図である。同
図において、図６と同じ構成要件は同じ参照番号を付し
ている。異なる構成要件として、１１は評価値計算手段
であり、上位ｔ個のみを残してＭ(c_i,c_j) を途中まで計
算し、Ｍ(c_i,c_j) の途中結果ｍ(i,j,d) によりＭ(c_i,
c_j),Ｕ(c_i), Ｕ(c_j)からＥ(c_i,c_j) を求め、引き続きＭ
(c_i,c_j) を求めてＭ(c_i,c_j),Ｕ(c_i), Ｕ(c_j)からＥ(c_i,
c_j) を求めるものである。なお、評価値計算手段１１は
部分クラスタ生成手段６２３ｂ及び／又は再帰クラスタ
リング手段６２３ｃに含まれてもよい。Next, an information retrieval apparatus according to an embodiment of the present invention will be described. FIG. 1 is a block diagram showing a configuration of an information search device according to an embodiment of the present invention. In the figure, the same components as those in FIG. 6 are denoted by the same reference numerals. As a different constituent requirement, 11 is an evaluation value calculating means, which calculates M (c _i , c _j ) halfway while leaving only the top t, and calculates the intermediate result m (i, c _j ) of M (c _i , c _j ). j, d) gives M (c _i ,
E (c _i , c _j ) is obtained from c _j ), U (c _i ), U (c _j ), and M
(c _i, c _j) seeking _{_{M (c i, c j)}} , U (c i), from _{U (c j) E (c} i,
c _j ). The evaluation value calculation unit 11 may be included in the partial cluster generation unit 623b and / or the recursive clustering unit 623c.

【００４３】次に、上記の単語数の割合がｒになるまで
評価値Ｅを計算する関数を全文書中の相対頻度でソーテ
ィングする場合を、図２に示すフローチャートに従って
説明する。Next, a case where the function for calculating the evaluation value E until the ratio of the number of words becomes r is sorted at the relative frequency in all documents will be described with reference to the flowchart shown in FIG.

【００４４】入力（ステップ２０１）はクラスタ集合
Ｓ，ｒ，ｓ，ｔである。まず、入力されたクラスタＳ＝
（c₁,c₂,・・・,c_N）中の全てのクラスタについてＵ(c_i)を
求める（ステップ２０２，２０３）。全てのクラスタの
組合せ(c_i,c_j) 間の評価値Ｅ(c_i,c_j) を途中まで求める
（ステップ２０４）。上位ｔ（ｔは正の整数）個までの
組合せについて引き続きＥ(c_i,c_j) を求める（ステップ
２０５）。Ｅ(c_i,c_j) が最大となるようなクラスタの組
合せを類似度が最も高いものとしてマージする。そし
て、クラスタc_i,c_j を子ノードとするクラスタc_kを作成
する（ステップ２０６）。よって、クラスタＳをＳ＝Ｓ
-c_i-c_j+c_k としてクラスタの上層に向かってクラスタリ
ングを進めていく（ステップ２０７）。The input (step 201) is a cluster set S, r, s, t. First, the input cluster S =
U (c _i ) is obtained for all clusters in (c ₁ , c ₂ ,..., C _N ) (steps 202 and 203). Finding all cluster combinations (c _i, c _j) the evaluation value E (c _i, c _j) between the halfway (step 204). E (c _i , c _j ) is continuously obtained for up to t (t is a positive integer) combinations (step 205). A cluster combination that maximizes E (c _i , c _j ) is merged with the highest similarity. Then, a cluster _ck having the clusters c _i and c _j as child nodes is created (step 206). Therefore, let the cluster S be S = S
Clustering proceeds toward the upper layer of the cluster as -c _i -c _j + c _k (step 207).

【００４５】図３は全文書中の相対頻度でソートした場
合の評価値であり、横軸が単語種類数の割合、縦軸が評
価関数Ｅの値である。同図では、文書５例(d₁,d₂,d₃,
d₄,d₅)を用いて、Ｅ(d₁,d₂) ，Ｅ(d₁,d₃) ，Ｅ(d₁,d₄)
，Ｅ(d₁,d₅) の増え方を示した。同図からわかるよう
に、単語数の割合にほぼ比例してＥが増えている。この
ことから、ある程度の任意の割合ｒの値を用いれば、最
終的な評価関数Ｅの値が精度よく推定できることを示し
ている。FIG. 3 shows the evaluation values in the case of sorting by relative frequency in all the documents. The horizontal axis shows the ratio of the number of word types, and the vertical axis shows the value of the evaluation function E. In this figure, five documents (d ₁ , d ₂ , d ₃ ,
Using d ₄ , d ₅ ), E (d ₁ , d ₂ ), E (d ₁ , d ₃ ), E (d ₁ , d ₄ )
, E (d ₁ , d ₅ ). As can be seen from the figure, E increases almost in proportion to the ratio of the number of words. This indicates that the final value of the evaluation function E can be accurately estimated by using a certain value of the arbitrary ratio r.

【００４６】単語数の割合が割合ｒになるまで評価関数
Ｅを計算する関数を全文書の相対頻度でソーティングす
る場合について以下に示す。The case where the function for calculating the evaluation function E until the ratio of the number of words becomes the ratio r is sorted by the relative frequency of all the documents will be described below.

【００４７】１：function newM(c_i,c_j) ２： out=0.0; ３： forall 文書 d∈（c_i∪c_j）{ ４： if(dの単語の種類数N(d)＞ｓ）５： N_x=rN(d) ６： else N_x=N(d); ７： d 中の単語を全文書中の相対頻度でソートす
る；８： tmp=0.0; ９： for(w=1;w++;w<=N_x) 10： tmp=tmp+rate(F(w),d,c_i ∪c_j); 11： m(i,j,d)=tmp; 12： out=out+log(tmp); 13： } 14：return out; 15：endfunc1: function newM (c _i , c _j ) 2: out = 0.0; 3: forall document d∈ (c _i ∪c _j ) {4: if (d The number of word types N (d)> s 5: N _x = rN (d) 6: else N _x = N (d); 7: Sort words in d by relative frequency in all documents; 8: tmp = 0.0; 9: for (w = 1; w ++; w <= N _x ) 10: tmp = tmp + rate (F (w), d, c _i ∪c _j ); 11: m (i, j, d) = tmp; 12: out = out + log (tmp); 13:} 14: return out; 15: endfunc

【００４８】m(i,j,d)については後で使用する。F(w)は
w 番目に頻度の高い単語の頻度である。M (i, j, d) will be used later. F (w) is
The frequency of the w-th most frequent word.

【００４９】次に、上記の単語数の割合がｒになるまで
評価値Ｅ(c_i,c_j) を求めるためのＭ(c_i,c_j) を計算する
関数を全文書中の相対頻度でソーティングする場合を図
４に示すフローチャートに従って説明する。Next, a function for calculating M (c _i , c _j ) for obtaining the evaluation value E (c _i , c _j ) until the ratio of the number of words becomes r becomes a relative frequency in all documents. The case where sorting is performed will be described with reference to the flowchart shown in FIG.

【００５０】入力（ステップ４０１）はクラスタc_i,c_j
に含まれる文書の頻度表である。まず、クラスタc_i∪c_j
（ステップ４０２，４０３）に含まれる文書ｄ毎の単語
の種類数Ｎ(d) がｓより大きいならばＮ_x をｒＮ(d) と
して（ステップ４０４〜４０６）、単語の種類数Ｎ(d)
がｓより小さければＮ_x をＮ(d) として文書ｄ中の単語
を全文書中の相対頻度でソートし（ステップ４０７）、
更に文書ｄの上位Ｎ個の単語についてかつ上記rateの計
算を行い（ステップ４０８，４０９）結果の値を累積し
た上で途中結果ｍ(i,j,d) を保存し（ステップ４１０）
ｌｏｇを取って累積してＭ(c_i,c_j) を出力する（ステッ
プ４１１，４１２）。The input (step 401) is the clusters c _i , c _j
Is a frequency table of documents included in. First, the cluster c _i ∪c _j
The N _x if the document d each word type number N (d) is greater than s included in (step 402, 403) as rN (d) (step 404-406), the words in the number of types N (d)
There words in the document d is smaller than s N _x as N (d) sort by relative frequency in all documents (step 407),
Further, the above rate is calculated for the top N words of the document d (steps 408 and 409), and the intermediate values m (i, j, d) are stored after accumulating the result values (step 410).
The log is taken and accumulated to output M (c _i , c _j ) (steps 411 and 412).

【００５１】その後、評価値の高いものから指定された
上位ｔ個の組合せについて、引続き評価値を求め、最終
的に最大となる組合せを求める。After that, evaluation values are continuously obtained for the top t combinations specified from the evaluation values having the highest evaluation values, and the combination which is finally the maximum is obtained.

【００５２】以下に引続き評価値を求めるときに使用す
る関数を示す。The functions used when obtaining the evaluation value are shown below.

【００５３】１：function cntM(c_i,c_j) ２： out=0.0; ３： forall 文書 d∈（c_i∪c_j）{ ４： tmp=m(i,j,d); ５： if(dの単語の種類数N(d)＞ｓ）｛６： for(w=rN(d)+1;w++;w<=N(d)) ７： tmp=tmp+rate(F(w),d,c_i ∪c_j); ８： } ９： out=out+log(tmp); 10： } 11： return out; 12：endfunc1: function cntM (c _i , c _j ) 2: out = 0.0; 3: forall document d∈ (c _i ∪c _j ) {4: tmp = m (i, j, d); 5: if (the number of word types of d: N (d)> s) ｛6: for (w = rN (d) +1; w ++; w <= N (d)) 7: tmp = tmp + rate (F (w) , d, c _i ∪c _j ); 8:} 9: out = out + log (tmp); 10:} 11: return out; 12: endfunc

【００５４】次に、上記の単評価値の高いものから指定
された上位ｔ個の組合せについて、引続き評価値を求
め、最終的に最大となる組合せを求める場合を図５に示
すフローチャートに従って説明する。Next, a description will be given of a case in which evaluation values are successively obtained for the top t combinations specified from the one having the highest single evaluation value, and finally the maximum combination is obtained with reference to the flowchart shown in FIG. .

【００５５】入力（ステップ５０１）は上位ｔ個の評価
値Ｅ'(c_i,c_j)を持つクラスタ対c_i,c_jである。まず、入
力されたクラスタc_i,c_j それぞれ(c) について（ステッ
プ５０２，５０３，５０４）クラスタc_i,c_j に含まれる
文書ｄ毎の単語の種類数Ｎ(d) がｓより以下であるなら
ば途中結果ｍ(i,j,d) をそのまま用い、単語の種類数Ｎ
(d) がｓより大きくなれば（ステップ５０５）頻度表Ｆ
中の残りの単語F(N_x+1),・・・,F(d)についてrateの計算を
行って累積する（ステップ５０６，５０７）。その結果
の値をｌｏｇを取って更に累積してＭ(c_i,c_j) を出力す
る（ステップ５０８，５０９）。The input (step 501) is a cluster pair c _i , c _j having t higher evaluation values E ′ (c _i , c _j ). First, for each of the input clusters c _i , c _j (c) (steps 502, 503, 504), the number of types of words N (d) for each document d included in the clusters c _i , c _j is smaller than s. If there is, the intermediate result m (i, j, d) is used as it is, and the number of word types N
If (d) is greater than s (step 505), the frequency table F
For the remaining words F (N _x +1),..., F (d), rate is calculated and accumulated (steps 506 and 507). The values of the result are taken as a log and further accumulated to output M (c _i , c _j ) (steps 508 and 509).

【００５６】なお、上述した各実施の形態例の構成は単
なる一例であり、各実施の形態例の組み合わせも可能で
あり、その組み合わせも任意に構成できるものである。
また、以上述べた実施の形態例は本発明の一例を示すも
のであって限定するものではなく、本発明は他の変形な
る態様及び変更なる態様で実施することができるもので
ある。よって、本発明の範囲は特許請求の範囲及びその
均等範囲によってのみ規定されるものである。The configuration of each embodiment described above is merely an example, and combinations of the embodiments are also possible, and the combination can be arbitrarily configured.
The embodiments described above are merely examples of the present invention and are not intended to limit the present invention. The present invention can be implemented in other modified forms and modified forms. Therefore, the scope of the present invention should be defined only by the appended claims and their equivalents.

【００５７】[0057]

【発明の効果】以上詳細に説明したように、本発明によ
れば、評価値の計算量を大幅に削減できる。また、評価
関数を求めるためのＭ(c_i,c_j) の計算では、ランダムな
順で計算するのに比べて、全文書中の相対頻度等でソー
トして計算することにより、分類精度を劣化させること
なく高速化することもできる。As described above in detail, according to the present invention, the amount of calculation of the evaluation value can be greatly reduced. In addition, in the calculation of M (c _i , c _j ) for obtaining the evaluation function, sorting accuracy is calculated by sorting by relative frequency in all documents, as compared with calculation in random order. It is also possible to increase the speed without deterioration.

[Brief description of the drawings]

【図１】本発明に係る情報検索装置の構成を示すブロッ
ク図である。FIG. 1 is a block diagram showing a configuration of an information search device according to the present invention.

【図２】本発明に係る全文書中の相対頻度でソーティン
グを行うことを示すフローチャートである。FIG. 2 is a flowchart showing that sorting is performed at a relative frequency in all documents according to the present invention.

【図３】本発明における単語数割合と評価値との関係を
示す特性図である。FIG. 3 is a characteristic diagram showing a relationship between a word count ratio and an evaluation value according to the present invention.

【図４】本発明における評価値の高いものから指定され
た上位ｔ個の組合せについて行う場合の処理を示すフロ
ーチャートである。FIG. 4 is a flowchart showing a process in the case of performing the t top combinations specified from the one with the highest evaluation value in the present invention.

【図５】本発明における最終的に最大となる組合せを求
める場合の処理を示すフローチャートである。FIG. 5 is a flowchart showing a process in a case where a finally maximum combination is obtained in the present invention.

【図６】従来の情報検索装置の構成を示すブロック図で
ある。FIG. 6 is a block diagram showing a configuration of a conventional information search device.

【図７】従来の再帰クラスタリング関数の処理内容を示
すフローチャートである。FIG. 7 is a flowchart showing processing contents of a conventional recursive clustering function.

【図８】従来のクラスタ生成関数の処理内容を示すフロ
ーチャートである。FIG. 8 is a flowchart showing processing contents of a conventional cluster generation function.

【図９】従来におけるベイジアンクラスタリング方法を
示すフローチャートである。FIG. 9 is a flowchart showing a conventional Bayesian clustering method.

[Explanation of symbols]

１１評価値計算手段６１インターネット６２情報検索サーバ６３コンピュータ６４クライアント６２１コンテンツデータベース６２２クラスタデータベース６２３制御部６２３ａ葉ノード情報選択手段６２３ｂ部分クラスタ生成手段６２３ｃ再帰クラスタリング手段６２３ｄページ更新／検索手段 11 Evaluation value calculation means 61 Internet 62 Information search server 63 Computer 64 Client 621 Content database 622 Cluster database 623 Control unit 623a Leaf node information selection means 623b Partial cluster generation means 623c Recursive clustering means 623d Page update / search means

Claims

[Claims]

1. A set of documents is regarded as a cluster, and when merging the clusters sequentially, a combination of clusters that maximizes the evaluation function is obtained using an evaluation function having two clusters as arguments. An information search method in clustering using Bayesian clustering for merging clusters, wherein the calculation of the evaluation function is stopped halfway and the evaluation value is calculated in clustering.

2. The information search method according to claim 1, wherein the calculation of the evaluation function is stopped halfway, and the calculation of the evaluation function is continuously performed for up to a predetermined number of combinations.

3. The information retrieval method according to claim 1, wherein when calculating the evaluation function, the relative frequency in all documents, the relative frequency in each document, and the relative frequency in each cluster are sorted and calculated in this order. .

4. A content database storing a plurality of computers having document information connected to a network and storing index information of the plurality of document information, and a control unit for searching and updating the document information using the content database. An information retrieval apparatus having: a control unit configured to select a predetermined number of pieces of information to be leaf nodes of a cluster from a plurality of pieces of the document information; And a recursive clustering means for instructing to be repeated in the direction of the leaf nodes of the cluster generated by the leaf node information selecting means and the partial cluster generating means,
In an information search device having a page update / search unit for adding and updating page information to a generated cluster and searching for page information from the cluster, the control unit stops calculation of an evaluation function halfway, An information retrieval device comprising an evaluation value calculation means for performing evaluation value calculation.

5. The information retrieval apparatus according to claim 4, wherein the evaluation value calculation means stops the calculation of the evaluation function in the middle and continuously calculates the evaluation function for up to a predetermined number of combinations.

6. The evaluation value calculation means, when calculating the evaluation function, sorts and calculates in order of relative frequency in all documents, relative frequency in each document, and relative frequency in each cluster. 4. The information search device according to 4.

7. The information retrieval apparatus according to claim 4, wherein said evaluation value calculation means is provided in said partial cluster generation means and / or said recursive clustering means.