JP2006338157A

JP2006338157A - Document group processor, document group processing method, document group processing program and recording medium with the same program stored therein

Info

Publication number: JP2006338157A
Application number: JP2005159777A
Authority: JP
Inventors: Tomoharu Iwata; 具治岩田; Kazumi Saito; 和巳斉藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-05-31
Filing date: 2005-05-31
Publication date: 2006-12-14

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method for extracting a topic group adaptive for a document group by using an existing topic group attached to an existing document group and a method for classifying document groups for every topic suitable for the document group. <P>SOLUTION: This document group processing method comprises an existing topic group probability model construction step for constructing the probability model of an existing topic group, an adaptive topic group extraction step for extracting a topic group matched with a document group by using the probability model constructed by the existing topic group probability model construction step and a document group classification step for classifying the document groups for every adaptive topic group extracted by the suitable topic group extraction step. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、文書群が与えられたとき、それらをトピック毎にクラスタリングするための文書群処理装置、文書群処理方法、文書群処理プログラム及び文書群処理プログラムを格納した記録媒体に関する。 The present invention relates to a document group processing apparatus, a document group processing method, a document group processing program, a document group processing program, and a recording medium storing a document group processing program.

文書群をトピック毎にクラスタリングすることは、ユーザが容易に目的の文書を探し出すことを可能にする。ここで、クラスタリングとは、似ているもの同士をまとめることによって、分類対象をグループ化する手法のことである。クラスタリングの多くは人手で行われているが、近年、膨大な数の電子文書が蓄積されつつあり、機械的にクラスタリングする技術が極めて重要となってきており、これまでに多くのクラスタリング手法が提案されている（例えば、非特許文献１参照）。
R.Duda, P.Hart, and D.Stork著，押上守夫監訳，「パターン識別」，第２版，（米国），新技術コミュニケーションズ,２００１年，Ｐ．５１９−６０１ Clustering a document group for each topic enables a user to easily find a target document. Here, clustering is a technique for grouping similar objects by grouping similar objects together. Most of the clustering is done manually, but in recent years, a large number of electronic documents are being accumulated, and the technology of mechanical clustering has become extremely important, and many clustering methods have been proposed so far. (For example, refer nonpatent literature 1).
R.Duda, P.Hart, and D.Stork, translated by Morio Oshiage, “Pattern Identification”, 2nd Edition, (USA), New Technology Communications, 2001 519-601

しかしながら、例えば、非特許文献１で開示されている技術では、文書群を適切にクラスタリングすることができない。また、クラスタリングを行った結果生じた各クラスに適切なトピックを付与することはできない。 However, for example, the technique disclosed in Non-Patent Document 1 cannot appropriately cluster a document group. Moreover, an appropriate topic cannot be assigned to each class generated as a result of clustering.

例えば、Ｗｅｂページの場合、検索サイトｇｏｏ（登録商標）（http://www.goo.ne.jp）のカテゴリ検索などのディレクトリ型検索エンジンにおいて、大量のページがトピック毎にラベル付けされている。しかし、ｇｏｏ（登録商標）のカテゴリ検索の第２レベルでも２４０以上のトピックが存在し、文書群を全てのトピックを用いてクラスタリングしたとしても、有用なクラスタリング結果は得られない。ユーザが容易に目的の文書を探し出すことを可能とするためには、トピック数が５から２０程度であることが望ましいと考えられる。したがって、適度な数のトピック群を得る必要があるため、これらの多数のトピックの中から文書群に適合するトピックを選び出す必要がある。 For example, in the case of a Web page, in a directory type search engine such as a category search of a search site goo (registered trademark) (http://www.goo.ne.jp), a large number of pages are labeled for each topic. . However, 240 or more topics exist even in the second level of the category search of goo (registered trademark), and even if a document group is clustered using all topics, a useful clustering result cannot be obtained. In order to enable a user to easily find a target document, it is considered that the number of topics is preferably about 5 to 20. Therefore, since it is necessary to obtain an appropriate number of topic groups, it is necessary to select a topic that matches the document group from among these many topics.

そこで本発明は、以上のような問題点に鑑みてなされたものであり、既存文書群に付された既存トピック群を用いて、文書群に適合するトピック群を抽出することが可能な文書群処理装置、文書群処理方法、文書群処理プログラム及び文書群処理プログラムを格納した記録媒体を提供することを課題とする。また、抽出した適合トピック群毎に文書群をクラスタリングすることをさらに可能とする文書群処理方法、文書群処理プログラム及び文書群処理プログラムを格納した記録媒体を提供することを課題とする。 Therefore, the present invention has been made in view of the above problems, and a document group capable of extracting a topic group suitable for the document group using the existing topic group attached to the existing document group. It is an object of the present invention to provide a processing device, a document group processing method, a document group processing program, and a recording medium storing the document group processing program. Another object of the present invention is to provide a document group processing method, a document group processing program, and a recording medium storing the document group processing program, which further allow clustering of document groups for each extracted relevant topic group.

本発明は、前記課題を解決するために創案されたものであり、本発明の文書群処理装置は、既存トピック群の確率モデルを構築する既存トピック群確率モデル構築部と、既存トピック群確率モデル構築部において構築された確率モデルを用いて、文書群に適合するトピック群を抽出する適合トピック群抽出部とを備える構成とした。 The present invention was created to solve the above-described problems, and the document group processing apparatus according to the present invention includes an existing topic group probability model construction unit that constructs a probability model of an existing topic group, and an existing topic group probability model. The configuration includes a matching topic group extraction unit that extracts a topic group that matches a document group by using the probability model constructed in the construction unit.

このような構成によれば、文書群処理装置は、既存文書群に付された既存トピック群を用いて、文書群に適合するトピック群を抽出することが可能となる。ここで、文書群に適合するトピック群とは、文書群を説明するトピック群として好ましいトピックの集まりであり、適度な数の集まりを意味する。 According to such a configuration, the document group processing apparatus can extract a topic group suitable for the document group by using the existing topic group attached to the existing document group. Here, the topic group suitable for the document group is a group of topics preferable as a topic group for explaining the document group, and means an appropriate number of groups.

また、本発明の文書群処理方法は、既存トピック群の確率モデルを構築する既存トピック群確率モデル構築ステップと、既存トピック群確率モデル構築ステップによって構築された確率モデルを用いて、文書群に適合するトピック群を抽出する適合トピック群抽出ステップとを含む方法とした。 Further, the document group processing method of the present invention is adapted to a document group using an existing topic group probability model construction step for constructing a probability model of an existing topic group and a probability model constructed by the existing topic group probability model construction step. And a matching topic group extraction step for extracting a topic group to be performed.

このような方法によれば、文書群処理装置は、既存文書群に付された既存トピック群を用いて、文書群に適合するトピック群を抽出することが可能となる。 According to such a method, the document group processing apparatus can extract a topic group that matches the document group by using the existing topic group attached to the existing document group.

また、本発明の文書群処理方法は、既存トピック群の確率モデルを構築する既存トピック群確率モデル構築ステップと、既存トピック群確率モデル構築ステップによって構築された確率モデルを用いて、文書群に適合するトピック群を抽出する適合トピック群抽出ステップと、適合トピック群抽出ステップによって抽出された適合トピック群毎に、文書群を分類する文書群分類ステップとを含む方法とした。 The document group processing method of the present invention is adapted to a document group by using an existing topic group probability model construction step for constructing a probability model of an existing topic group and a probability model constructed by the existing topic group probability model construction step. The method includes a matching topic group extraction step for extracting a topic group to be performed, and a document group classification step for classifying the document group for each matching topic group extracted by the matching topic group extraction step.

このような構成によれば、文書群処理装置は、既存文書群に付された既存トピック群を用いて、文書群に適合するトピック群を抽出することが可能となる。さらに、抽出した適合トピック群毎に文書群をクラスタリングすることが可能となる。 According to such a configuration, the document group processing apparatus can extract a topic group suitable for the document group by using the existing topic group attached to the existing document group. Furthermore, it is possible to cluster document groups for each extracted matching topic group.

また、本発明の文書群処理方法における既存トピック群確率モデル構築ステップは、既存文書群の単語出現頻度を用いて、単語生起確率を推定する単語生起確率推定ステップを含む方法とした。単語出現頻度及び単語生起確率については、発明を実施するための最良の形態において詳細に説明する。 Further, the existing topic group probability model construction step in the document group processing method of the present invention is a method including a word occurrence probability estimation step for estimating a word occurrence probability using the word appearance frequency of the existing document group. The word appearance frequency and the word occurrence probability will be described in detail in the best mode for carrying out the invention.

このような方法によれば、文書群処理装置は、既存文書群の単語出現頻度を用いて、単語生起確率を推定することが可能となる。また、推定した得られた単語生起確率を用いて、文書群に適合するトピック群を抽出することが可能となる。 According to such a method, the document group processing apparatus can estimate the word occurrence probability using the word appearance frequency of the existing document group. Further, it is possible to extract a topic group that matches the document group using the estimated word occurrence probability.

また、本発明の文書群処理方法における適合トピック群抽出ステップは、単語生起確率を用いて、トピック群の尤度を算出する尤度算出ステップと、尤度算出ステップによって算出されたトピック群の尤度を用いて、トピック群を抽出するトピック群抽出ステップとを含む方法とした。 In the document group processing method of the present invention, the matching topic group extraction step uses a word occurrence probability to calculate the likelihood of the topic group, and the topic group likelihood calculated by the likelihood calculation step. And a topic group extraction step of extracting topic groups using the degree.

このような方法によれば、文書群処理装置は、単語生起確率を用いて、トピック群の尤度を算出することが可能となる。また、算出によって得られたトピック群の尤度を用いて、文書群に適合するトピック群を抽出することが可能となる。 According to such a method, the document group processing apparatus can calculate the likelihood of the topic group using the word occurrence probability. Further, it is possible to extract a topic group that matches the document group using the likelihood of the topic group obtained by the calculation.

また、本発明の文書群処理方法における文書群分類ステップは、尤度を用いた計算に基づいて、文書群を分類する分類ステップを含む方法とした。 The document group classification step in the document group processing method of the present invention is a method including a classification step of classifying a document group based on a calculation using likelihood.

このような方法によれば、文書群処理装置は、尤度を用いた計算に基づいて、文書群を分類することが可能となる。 According to such a method, the document group processing apparatus can classify the document group based on the calculation using the likelihood.

また、このような文書群処理方法をコンピュータに実行させる文書群処理プログラムによれば、コンピュータに文書群処理装置と同様の機能を実行させることが可能である。さらに、このような文書群処理方法をコンピュータに実行させるプログラムを格納した記録媒体によれば、コンピュータに文書群処理装置と同様の機能を実行させるプログラムを記録媒体内に記憶させることが可能である。 Further, according to the document group processing program for causing a computer to execute such a document group processing method, it is possible to cause the computer to execute the same function as that of the document group processing apparatus. Furthermore, according to a recording medium storing a program for causing a computer to execute such a document group processing method, it is possible to store in the recording medium a program for causing the computer to execute the same function as the document group processing apparatus. .

本発明によれば、既存文書群に付された既存トピック群を用いて、文書群に適合するトピック群を抽出することが可能となる。また、抽出した適合トピック群毎に文書群をクラスタリングすることが可能となる。 According to the present invention, it is possible to extract a topic group that matches a document group using the existing topic group attached to the existing document group. In addition, it is possible to cluster document groups for each extracted matching topic group.

次に、本発明を実施するための最良の形態（以下、「実施形態」という）について図面を参照して説明する。 Next, the best mode for carrying out the present invention (hereinafter referred to as “embodiment”) will be described with reference to the drawings.

図１は、本発明の実施形態に係る文書群処理装置のブロック図である。図１に示すように、文書群処理装置１は、演算手段２と、入力手段３と、記憶手段４と、出力手段５とを備えている。各手段はバスライン１１に接続されている。 FIG. 1 is a block diagram of a document group processing apparatus according to an embodiment of the present invention. As shown in FIG. 1, the document group processing apparatus 1 includes a calculation unit 2, an input unit 3, a storage unit 4, and an output unit 5. Each means is connected to the bus line 11.

演算手段２は、既存トピック群確率モデル構築部２１と、適合トピック群抽出部２２と、メモリ２３とを含んで構成される。演算手段２は、記憶手段４から既存トピック群確率モデル構築プログラム４１及び適合トピック群抽出プログラム４２を読み込み、メモリ２３に格納し、実行することで、既存トピック群確率モデル構築部２１及び適合トピック群抽出部２２を実現する。既存トピック群確率モデル構築部２１及び適合トピック群抽出部２２の構成についての詳細は、後記する。演算手段２は、例えば、演算処理を行うＣＰＵ（Central Processing Unit）と、情報を記憶するＲＡＭ（Random Access Memory）とを含んで構成される。 The computing means 2 includes an existing topic group probability model construction unit 21, a compatible topic group extraction unit 22, and a memory 23. The computing means 2 reads the existing topic group probability model construction program 41 and the compatible topic group extraction program 42 from the storage means 4, stores them in the memory 23, and executes them, whereby the existing topic group probability model construction unit 21 and the compatible topic groups are executed. The extraction unit 22 is realized. Details of the configurations of the existing topic group probability model construction unit 21 and the compatible topic group extraction unit 22 will be described later. The computing means 2 includes, for example, a CPU (Central Processing Unit) that performs arithmetic processing and a RAM (Random Access Memory) that stores information.

入力手段３は、キーボードやディスクドライブ装置などから構成される。前記した既存文書群及びクラスタリングする文書群は、入力手段３を介して入力され、記憶手段４に記憶される構成とすることが可能である。 The input means 3 includes a keyboard and a disk drive device. The existing document group and the document group to be clustered can be input via the input unit 3 and stored in the storage unit 4.

記憶手段４は、ハードディスク装置などから構成される。記憶手段４は、既存トピック群確率モデル構築部２１及び適合トピック群抽出部２２のもとになる既存トピック群確率モデル構築プログラム４１及び適合トピック群抽出プログラム４２を記憶させておくことが可能である。また、記憶手段４は、既存文書群テーブル４３と、単語生起確率テーブル４４と、文書群テーブル４５とを含んで構成される。 The storage means 4 is composed of a hard disk device or the like. The storage means 4 can store an existing topic group probability model construction program 41 and a compatible topic group extraction program 42 that are the basis of the existing topic group probability model construction unit 21 and the matching topic group extraction unit 22. . The storage unit 4 includes an existing document group table 43, a word occurrence probability table 44, and a document group table 45.

ここで、既存文書群テーブル４３は、トピックのラベルの付いたページ集合を格納するためのテーブルであり、ディレクトリ型検索エンジンなどに登録されているページ集合を格納するためのテーブルである。 Here, the existing document group table 43 is a table for storing a page set with a topic label, and is a table for storing a page set registered in a directory type search engine or the like.

また、単語生起確率テーブル４４は、既存トピック群確率モデル構築部２１によって算出された単語生起確率を格納するためのテーブルである。 The word occurrence probability table 44 is a table for storing the word occurrence probabilities calculated by the existing topic group probability model construction unit 21.

さらに、文書群テーブル４５は、クラスタリングを行うページ集合を格納するためのテーブルであり、ディレクトリ型検索エンジンなどで検索して得られたページ集合を格納するためのテーブルである。 Further, the document group table 45 is a table for storing a set of pages for clustering, and is a table for storing a set of pages obtained by searching with a directory search engine or the like.

出力手段５は、例えば、グラフィックボード及びそれに接続されたモニタであり、文書群のクラスタリングを行った結果などを表示するものである。 The output means 5 is, for example, a graphic board and a monitor connected thereto, and displays the result of clustering the document group.

以下、図２及び図３を参照しながら、既存トピック群確率モデル構築部２１と、適合トピック群抽出部２２との構成について説明する。ここで、既存トピック群確率モデル構築部２１は、演算手段２によって呼び出され、既存トピック群確率モデル構築部２１の処理が終了すると、適合トピック群抽出部２２が、演算手段２によって呼び出される。 Hereinafter, the configurations of the existing topic group probability model construction unit 21 and the matching topic group extraction unit 22 will be described with reference to FIGS. 2 and 3. Here, the existing topic group probability model construction unit 21 is called by the computing means 2, and when the processing of the existing topic group probability model construction unit 21 ends, the suitable topic group extraction unit 22 is called by the computing means 2.

（既存トピック群確率モデル構築部２１の説明）
図２は、本発明の実施形態に係る既存トピック群確率モデル構築部のブロック図である。図２に示すように、既存トピック群確率モデル構築部２１は、既存文書群読込部２１１と、単語生起確率推定部２１２と、単語生起確率書込部２１３とを備えている。 (Description of the existing topic group probability model construction unit 21)
FIG. 2 is a block diagram of the existing topic group probability model construction unit according to the embodiment of the present invention. As shown in FIG. 2, the existing topic group probability model construction unit 21 includes an existing document group reading unit 211, a word occurrence probability estimation unit 212, and a word occurrence probability writing unit 213.

（既存文書群読込部２１１の説明）
既存文書群読込部２１１は、既存文書群テーブル４３（図１参照）から、既存文書群Ｘ’を読み込み、メモリ２３に格納するものである。既存文書群テーブル４３（図１参照）は、前記した通り、ラベルの付いたページ集合である。Ｘ’は、Ｘ’_k（ｋ＝１，…，Ｋ’）で定義される。ここで、Ｋ’は既存文書群全体のトピック数である。また、Ｘ’_kはトピックｋに属している文書集合であり、以下の式（１）によって定義される。 (Description of Existing Document Group Reading Unit 211)
The existing document group reading unit 211 reads the existing document group X ′ from the existing document group table 43 (see FIG. 1) and stores it in the memory 23. The existing document group table 43 (see FIG. 1) is a set of pages with labels as described above. X ′ is defined by X ′ _k (k = 1,..., K ′). Here, K ′ is the number of topics of the entire existing document group. X ′ _k is a document set belonging to the topic k, and is defined by the following equation (1).

ここでＮ’_kは、トピックｋに属するページ数を表す。各要素ｘ’_knは、トピックｋに属するページｘ’_nにおけるＶ次元の単語出現頻度ベクトルであり、Ｖは総単語数を表す。総単語数とは、既存文書群Ｘ’全体の単語の総数を意味する。単語出現頻度ベクトルｘ’_knは、以下の式（２）によって定義される。 Here, N ′ _k represents the number of pages belonging to the topic k. Each element x ′ _kn is a V-dimensional word appearance frequency vector in the page x ′ _n belonging to the topic k, and V represents the total number of words. The total number of words means the total number of words in the entire existing document group X ′. The word appearance frequency vector x ′ _kn is defined by the following equation (2).

ここでｘ’_knjは、トピックｋに属するページｘ’_nにおける単語ｗ_jの出現頻度を表す。出現頻度とは、あるページ範囲内に特定の単語が出現する回数を表したものである。 Here, x ′ _knj represents the appearance frequency of the word w _j in the page x ′ _n belonging to the topic k. The appearance frequency represents the number of times a specific word appears within a certain page range.

（単語生起確率推定部２１２の説明）
単語生起確率推定部２１２は、既存文書群読込部２１１によってメモリ２３に格納された既存文書群Ｘ’_k（ｋ＝１，…，Ｋ’）に基づいて、単語生起確率θ_kjを算出し、メモリ２３に格納するものである。ここで単語生起確率θ_kjは、トピックｋに属するページにおける単語ｗ_jの出現確率を意味するものである。単語生起確率θ_kjは、各トピックの確率モデルとして、例えば、ＮＢ（Naive Bayes）モデルを採用し、以下の式（３）によって推定することが可能である。 (Description of word occurrence probability estimation unit 212)
The word occurrence probability estimation unit 212 calculates a word occurrence probability θ _kj based on the existing document group X ′ _k (k = 1,..., K ′) stored in the memory 23 by the existing document group reading unit 211, It is stored in the memory 23. Here, the word occurrence probability θ _kj means the appearance probability of the word w _{j on} the page belonging to the topic k. The word occurrence probability θ _kj can be estimated by the following equation (3), for example, by adopting an NB (Naive Bayes) model as a probability model of each topic.

なお、ＮＢモデルについては、例えば、「McCallum,A., Nigam, K.(1998) A comparison of event models for naive Bayes text classification. In:AAAI-98 Workshop on Learning for Text Categorization」に記載されている。λ_kはスムージングパラメータであり、クロスバリデーション法を用いて算出することが可能である。クロスバリデーション法については、例えば、前記した非特許文献１に記載されている。 The NB model is described in, for example, “McCallum, A., Nigam, K. (1998) A comparison of event models for naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization”. . λ _k is a smoothing parameter and can be calculated using a cross-validation method. About the cross-validation method, it describes in above-mentioned nonpatent literature 1, for example.

（単語生起確率書込部２１３の説明）
単語生起確率書込部２１３は、単語生起確率推定部２１２によって推定され、メモリ２３に格納された単語生起確率θ_kjを、単語生起確率テーブル４４（図１参照）に格納するものである。単語生起確率テーブル４４（図１参照）に格納された単語生起確率θ_kjは、適合トピック群抽出部２２で利用される。 (Description of word occurrence probability writing unit 213)
The word occurrence probability writing unit 213 stores the word occurrence probability θ _kj estimated by the word occurrence probability estimation unit 212 and stored in the memory 23 in the word occurrence probability table 44 (see FIG. 1). The word occurrence probability θ _kj stored in the word occurrence probability table 44 (see FIG. 1) is used by the matching topic group extraction unit 22.

（適合トピック群抽出部２２の説明）
図３は、本発明の実施形態に係る適合トピック群抽出部のブロック図である。図３に示すように、適合トピック群抽出部２２は、単語生起確率読込部２２１と、文書群読込部２２２と、既存トピック群取得部２２３と、対数尤度計算部２２４と、トピック群ソート部２２５と、トピック削除部２２６とを備えている。 (Description of relevant topic group extraction unit 22)
FIG. 3 is a block diagram of an adapted topic group extraction unit according to the embodiment of the present invention. As shown in FIG. 3, the matching topic group extraction unit 22 includes a word occurrence probability reading unit 221, a document group reading unit 222, an existing topic group acquisition unit 223, a log likelihood calculation unit 224, and a topic group sorting unit. 225 and a topic deletion unit 226.

（単語生起確率読込部２２１の説明）
単語生起確率読込部２２１は、単語生起確率テーブル４４（図１参照）から、単語生起確率θ_kjを読み込み、メモリ２３に格納するものである。 (Description of word occurrence probability reading unit 221)
The word occurrence probability reading unit 221 reads the word occurrence probability θ _kj from the word occurrence probability table 44 (see FIG. 1) and stores it in the memory 23.

（文書群読込部２２２の説明）
文書群読込部２２２は、文書群テーブル４５（図１参照）から、文書群Ｘを読み込み、メモリ２３に格納するものである。文書群テーブル４５（図１参照）は、前記した通り、クラスタリングを行うページ集合である。Ｘは、ｘ_n（ｎ＝１，…，Ｎ）で定義される。ここで、Ｎは文書群全体のページ数である。また、ｘ_nはページＸ_nにおけるＶ次元の単語出現頻度ベクトルであり、Ｖは総単語数を表す。総単語数とは、前記した通り、既存文書群Ｘ’全体の単語の総数を意味する。単語出現頻度ベクトルｘ_nは、単語出現頻度ｘ_njを用いて、以下の式（４）によって定義される。 (Description of document group reading unit 222)
The document group reading unit 222 reads the document group X from the document group table 45 (see FIG. 1) and stores it in the memory 23. As described above, the document group table 45 (see FIG. 1) is a set of pages for clustering. X is defined by x _n (n = 1,..., N). Here, N is the number of pages of the entire document group. In addition, x _n is the word appearance frequency vector of V dimension in the page X _n, V represents the total number of words. As described above, the total number of words means the total number of words in the entire existing document group X ′. The word appearance frequency vector x _n is defined by the following equation (4) using the word appearance frequency x _nj .

ここで、単語出現頻度ｘ_njは、ページＸ_nにおける単語ｗ_jの出現頻度を意味する。

Here, the word appearance frequency x _nj means the appearance frequency of the word w _j in the page X _n .

（既存トピック群取得部２２３の説明）
既存トピック群取得部２２３は、単語生起確率読込部２２１によってメモリ２３に格納された単語生起確率θ_kjから既存トピック群Ｇ’を取得し、メモリ２３に格納するものである。ここで、既存トピック群取得部２２３が取得した既存トピック群Ｇ’は、Ｇ’＝｛１，…，Ｋ’｝によって定義される。単語生起確率θ_kjは、トピックｋと単語ｗ_jとを指定すると得られる値であるので、既存トピック群Ｇ’｛１，…，Ｋ’｝の情報を有している。また、適合トピック群Ｇに、既存トピック群Ｇ’を代入することによって、適合トピック群Ｇに初期値を設定する。適合トピック群Ｇは、文書群Ｘに適合するトピックを抽出するために使用され、Ｇ⊂｛１，…，Ｋ’｝を満たすものである。 (Description of the existing topic group acquisition unit 223)
The existing topic group acquisition unit 223 acquires the existing topic group G ′ from the word occurrence probability θ _kj stored in the memory 23 by the word occurrence probability reading unit 221 and stores it in the memory 23. Here, the existing topic group G ′ acquired by the existing topic group acquisition unit 223 is defined by G ′ = {1,..., K ′}. Since the word occurrence probability θ _kj is a value obtained by designating the topic k and the word w _j , the word occurrence probability θ _kj has information on the existing topic group G ′ {1,..., K ′}. Further, by substituting the existing topic group G ′ for the matching topic group G, an initial value is set for the matching topic group G. The matching topic group G is used to extract topics that match the document group X, and satisfies G を満たす {1,..., K ′}.

（対数尤度計算部２２４の説明）
対数尤度計算部２２４は、単語生起確率読込部２２１によってメモリ２３に読み込まれた単語生起確率θ_kjと、文書群読込部２２２によってメモリ２３に読み込まれた文書群Ｘとを利用して、対数尤度ｌｏｇｐ(ｘ_n｜ｋ)を計算し、メモリ２３に格納するものである。対数尤度ｌｏｇｐ(ｘ_n｜ｋ)は、文書群ＸのページＸ_n（ｎ＝１，…，Ｎ）と、既存文書群Ｘ’のトピックｋ（ｋ＝１，…，Ｋ’）との全ての組み合わせについて算出される。対数尤度ｌｏｇｐ(ｘ_n｜ｋ)は、以下の式（５）によって計算される。 (Description of Log Likelihood Calculation Unit 224)
The log likelihood calculation unit 224 uses the word occurrence probability θ _kj read into the memory 23 by the word occurrence probability reading unit 221 and the document group X read into the memory 23 by the document group reading unit 222 to calculate the logarithm. The likelihood logp (x _n | k) is calculated and stored in the memory 23. The log likelihood logp (x _n | k) is calculated between the page X _n (n = 1,..., N) of the document group X and the topic k (k = 1,..., K ′) of the existing document group X ′. Calculated for all combinations. The log likelihood logp (x _n | k) is calculated by the following equation (5).

（トピック群ソート部２２５の説明）
トピック群ソート部２２５は、対数尤度計算部２２４によって算出されてメモリ２３に格納された対数尤度ｌｏｇｐ(ｘ_n｜ｋ)を利用して、推定トピックｃ_n（Ｇ）を計算し、メモリ２３に格納するものである。推定トピックｃ_n（Ｇ）は、以下の式（６）によって計算される。 (Description of Topic Group Sorting Unit 225)
The topic group sorting unit 225 calculates the estimated topic c _n (G) using the log likelihood logp (x _n | k) calculated by the log likelihood calculation unit 224 and stored in the memory 23, and the memory 23 is stored. The estimated topic c _n (G) is calculated by the following equation (6).

また、トピック群ソート部２２５は、メモリ２３に格納された推定トピックｃ_n（Ｇ）を利用して、推定トピックｋに対応するページ数Ｎ_k（ｋ＝１，…，Ｋ’）を計算し、メモリ２３に格納するものである。 The topic group sorting unit 225 calculates the number of pages N _k (k = 1,..., K ′) corresponding to the estimated topic k using the estimated topic c _n (G) stored in the memory 23. Are stored in the memory 23.

また、トピック群ソート部２２５は、推定トピックｋに対応するページが存在しない場合、すなわち推定トピックｋに対応するページ数Ｎ_kが正ではない場合に、推定トピックｋをメモリ２３内の適合トピック群Ｇから削除する処理を、推定トピックｋ（ｋ＝１，…，Ｋ’）について行うものである。ここで、Ｋ’は、既存トピック群Ｇ’の要素数である。この処理によって、トピック群ソート部２２５は、対応するページが存在する適合トピック群Ｇ＝｛１，…，Ｋ｝を抽出することが可能である。 Further, the topic group sorting unit 225 selects the estimated topic k as the applicable topic group in the memory 23 when there is no page corresponding to the estimated topic k, that is, when the number of pages N _k corresponding to the estimated topic k is not positive. The process of deleting from G is performed for the estimated topic k (k = 1,..., K ′). Here, K ′ is the number of elements of the existing topic group G ′. By this processing, the topic group sorting unit 225 can extract the matching topic group G = {1,..., K} where the corresponding page exists.

さらに、トピック群ソート部２２５は、メモリ２３内の推定トピックｋのページ数Ｎ_kの昇順に適合トピック群ＧをソートしたリストＬ＝｛Ｌ₁，…，Ｌ_K｝を作成し、メモリ２３に格納するものである。 Further, the topic group sorting unit 225 creates a list L = {L ₁ ,..., L _K } in which the matching topic group G is sorted in ascending order of the number of pages N _k of the estimated topic k in the memory 23. To store.

（トピック削除部２２６の説明）
トピック削除部２２６は、ＡＩＣ（Akaike's Information Criterion）の値が最小になる適合トピック群Ｇを選択することによって、最適な適合トピック群Ｇを抽出するために、適合トピックに該当しないトピックを削除するものである。なお、ＡＩＣについては、例えば、「Akaike,H.(1973).Information theory and extension of the maximum likelihood principle」に記載されている。 (Description of Topic Deletion Unit 226)
The topic deletion unit 226 deletes a topic that does not correspond to the conforming topic in order to extract the optimum conforming topic group G by selecting the conforming topic group G that minimizes the value of AIC (Akaike's Information Criterion). It is. The AIC is described in, for example, “Akaike, H. (1973). Information theory and extension of the maximum likelihood principle”.

ＡＩＣの値が最小になるモデルを選択する方法としては、例えば、ＡＩＣ（Ｇ）とＡＩＣ（Ｇ_−m）とを比較し、ＡＩＣ（Ｇ）＞ＡＩＣ（Ｇ_−m）を満たす場合に、適合トピック群ＧからＬ_mを削除する処理を、ｍ＝１，…，Ｋについて順番に、すなわちページ数Ｎ_kの少ないトピックから順番に行うというものが考えられる。ここで、Ｇ_−mは、適合トピック群ＧからＬ_mを取り除いたものである。ＡＩＣ（Ｇ）の値は、メモリ２３内の適合トピック群Ｇ、対数尤度ｌｏｇｐ(ｘ_n｜ｋ)及び推定トピックｃ_n（Ｇ）を利用して、以下の式（７）を用いて計算される。 As a method of selecting a model that minimizes the value of AIC, for example, when AIC (G) is compared with AIC (G- _m ) and AIC (G)> AIC (G- _m ) is satisfied, it is suitable. It can be considered that the process of deleting L _m from the topic group G is performed in order for m = 1,..., K, that is, in order from the topic having the smallest page number N _k . Here, G− _m is obtained by removing L _m from the matching topic group G. The value of AIC (G) is calculated using the following formula (7) using the matching topic group G, the log likelihood logp (x _n | k), and the estimated topic c _n (G) in the memory 23. Is done.

ここで、｜Ｇ｜は適合トピック群Ｇの要素数を表すものである。ＡＩＣ（Ｇ_−m）の値も同様の方法で計算することが可能である。

Here, | G | represents the number of elements of the matching topic group G. The value of AIC (G _−m ) can be calculated by the same method.

なお、適合トピック群Ｇが変更された場合には、推定トピックｃ_n（Ｇ）及びＡＩＣ（Ｇ）の値も変更されるので、適合トピック群ＧからＬ_mを削除した後に、推定トピックｃ_n（Ｇ）及びＡＩＣ（Ｇ）の値を再度利用する場合などには、メモリ２３内の推定トピックｃ_n（Ｇ）及びＡＩＣ（Ｇ）の値を正しい値に更新してから利用する必要がある。トピック群Ｇ_−mについても同様である。 Note that when the matching topic group G is changed, the values of the estimated topics c _n (G) and AIC (G) are also changed. Therefore, after deleting L _m from the matching topic group G, the estimated topics c _n When the values of (G) and AIC (G) are used again, it is necessary to update the values of the estimated topics c _n (G) and AIC (G) in the memory 23 to correct values. . The same applies to the topic group G- _m .

また、トピック削除部２２６は、対応するページが存在しないトピックｋを、適合トピック群Ｇから予め削除しておくことにより、対応するページが存在しないトピックｋを含んだ適合トピック群Ｇに対するソート処理及びＡＩＣ（Ｇ）の計算処理を省略することが可能となる。 In addition, the topic deletion unit 226 deletes the topic k having no corresponding page from the matching topic group G in advance, thereby sorting the matching topic group G including the topic k having no corresponding page, It is possible to omit the AIC (G) calculation process.

さらに、適合トピック群抽出部２２によって抽出された適合トピック群Ｇと、文書群Ｘとを、式（７）に適用すれば、適合トピック群Ｇ毎に文書群Ｘをクラスタリングすることが可能となる。 Furthermore, if the matching topic group G and the document group X extracted by the matching topic group extraction unit 22 are applied to Expression (7), the document group X can be clustered for each matching topic group G. .

次に、図４を参照（適宜図１参照）しながら、演算手段２（図１参照）が行う処理について説明する。図４は、本発明の実施形態に係る演算手段の処理を表すフローチャートである。 Next, processing performed by the computing means 2 (see FIG. 1) will be described with reference to FIG. 4 (see FIG. 1 as appropriate). FIG. 4 is a flowchart showing the processing of the calculation means according to the embodiment of the present invention.

図４に示すように、演算手段２は、まず、既存トピック群確率モデル構築部２１による既存トピック群確率モデル構築処理を行う（Ｓ１０）。続いて、適合トピック群抽出部２２による適合トピック群抽出処理を行い（Ｓ２０）、処理を終了する。以下、Ｓ１０の処理の詳細について、図５を用いて説明する。また、Ｓ２０の処理の詳細について、図６、７及び８を用いて説明する。 As shown in FIG. 4, the computing means 2 first performs an existing topic group probability model construction process by the existing topic group probability model construction unit 21 (S10). Subsequently, a matching topic group extraction process by the matching topic group extraction unit 22 is performed (S20), and the process ends. Hereinafter, details of the process of S10 will be described with reference to FIG. Details of the process of S20 will be described with reference to FIGS.

図５を参照（適宜図１及び２参照）しながら、既存トピック群確率モデル構築部２１（図１参照）が行う処理について説明する。図５は、本発明の実施形態に係る既存トピック群確率モデル構築部の処理を表すフローチャートである。 Processing performed by the existing topic group probability model construction unit 21 (see FIG. 1) will be described with reference to FIG. 5 (see FIGS. 1 and 2 as appropriate). FIG. 5 is a flowchart showing processing of the existing topic group probability model construction unit according to the embodiment of the present invention.

図５に示すように、既存文書群読込部２１１は、既存文書群Ｘ’を既存文書群テーブル４３から読み込み、メモリ２３に格納する（Ｓ１１）。続いて、単語生起確率推定部２１２は、メモリ２３に格納された既存文書群に基づいて、単語生起確率θ_kjを推定し、メモリ２３に格納する（Ｓ１２）。単語生起確率θ_kjは、前記した式（３）によって計算することが可能である。そして、単語生起確率書込部２１３は、メモリ２３に格納された単語生起確率θ_kjを単語生起確率テーブル４４に書き込み（Ｓ１３）、処理を終了する。 As shown in FIG. 5, the existing document group reading unit 211 reads the existing document group X ′ from the existing document group table 43 and stores it in the memory 23 (S11). Subsequently, the word occurrence probability estimation unit 212 estimates the word occurrence probability θ _kj based on the existing document group stored in the memory 23 and stores it in the memory 23 (S12). The word occurrence probability θ _kj can be calculated by the above equation (3). Then, the word occurrence probability writing unit 213 writes the word occurrence probability θ _kj stored in the memory 23 in the word occurrence probability table 44 (S13), and ends the process.

図６、７及び８を参照（適宜図１及び３参照）しながら、適合トピック群抽出部２２（図１参照）が行う処理について説明する。図６、７及び８は、本発明の実施形態に係る適合トピック群抽出部の処理を表すフローチャートである。 The processing performed by the adapted topic group extraction unit 22 (see FIG. 1) will be described with reference to FIGS. 6, 7 and 8 (see FIGS. 1 and 3 as appropriate). 6, 7 and 8 are flowcharts showing the processing of the adapted topic group extraction unit according to the embodiment of the present invention.

図６に示すように、単語生起確率読込部２２１は、単語生起確率θ_kjを単語生起確率テーブル４４から読み込み、メモリ２３に格納する（Ｓ２０１）。続いて、文書群読込部２２２は、文書群Ｘを文書群テーブル４５から読み込み、メモリ２３に格納する（Ｓ２０２）。 As shown in FIG. 6, the word occurrence probability reading unit 221 reads the word occurrence probability θ _kj from the word occurrence probability table 44 and stores it in the memory 23 (S201). Subsequently, the document group reading unit 222 reads the document group X from the document group table 45 and stores it in the memory 23 (S202).

既存トピック群取得部２２３は、メモリ２３内の単語生起確率θ_kjから既存トピック群Ｇ’を取得し、メモリ２３に格納する（Ｓ２０３）。また、メモリ２３内に適合トピック群Ｇを確保し、既存トピック群Ｇ’を代入して初期値を設定する（Ｓ２０４）。 The existing topic group acquisition unit 223 acquires the existing topic group G ′ from the word occurrence probability θ _kj in the memory 23 and stores it in the memory 23 (S203). Also, a suitable topic group G is secured in the memory 23, and an initial value is set by substituting the existing topic group G ′ (S204).

対数尤度計算部２２４は、メモリ２３内の単語生起確率θ_kj及び文書群Ｘを用いて、対数尤度ｌｏｇｐ(ｘ_n｜ｋ)を計算し、メモリ２３に格納する（Ｓ２０５）。対数尤度ｌｏｇｐ(ｘ_n｜ｋ)は、前記した式（５）によって計算することが可能である。 The log likelihood calculation unit 224 calculates the log likelihood logp (x _n | k) using the word occurrence probability θ _kj and the document group X in the memory 23 and stores them in the memory 23 (S205). The log likelihood logp (x _n | k) can be calculated by the above-described equation (5).

次に、図７に示すように、トピック群ソート部２２５は、メモリ２３内の対数尤度ｌｏｇｐ(ｘ_n｜ｋ)を用いて、推定トピックｃ_n（Ｇ）を計算し、メモリ２３に格納する（Ｓ２０６）。推定トピックｃ_n（Ｇ）は、前記した式（６）によって計算することが可能である。続いて、メモリ２３内の推定トピックｃ_n（Ｇ）を用いて、推定トピックｋのページ数Ｎ_kを計算し、メモリ２３に格納する（Ｓ２０７）。 Next, as shown in FIG. 7, the topic group sorting unit 225 calculates the estimated topic c _n (G) using the log likelihood logp (x _n | k) in the memory 23 and stores it in the memory 23. (S206). The estimated topic c _n (G) can be calculated by the equation (6) described above. Subsequently, the estimated number of pages N _k of the estimated topic _k is calculated using the estimated topic c _n (G) in the memory 23 and stored in the memory 23 (S207).

トピック群ソート部２２５は、メモリ２３内にカウンタｔを確保し、ｔに１を設定する（Ｓ２０８）。続いて、ページ数Ｎ_tが正であるかを判定する（Ｓ２０９）。ページ数Ｎ_tが正ではない場合（Ｓ２０９でＮｏの場合）、メモリ２３内の適合トピック群Ｇからトピックｔを削除して（Ｓ２１０）、Ｓ２１１に進む。ページ数Ｎ_tが正の場合（Ｓ２０９でＹｅｓの場合）、何もせず、Ｓ２１１に進む。Ｓ２１１に進んだ場合、カウンタｔに１を加算し（Ｓ２１１）、カウンタｔが既存トピック群Ｇ’の要素数Ｋ’を超えたか否かを判定する（Ｓ２１２）。ｔが既存トピック群Ｇ’の要素数Ｋ’を超えていない場合（Ｓ２１２でＮｏの場合）、Ｓ２０９に戻る。ｔが既存トピック群Ｇ’の要素数Ｋ’を超えた場合（Ｓ２１２でＹｅｓの場合）、Ｓ２１３に進む。 The topic group sorting unit 225 secures the counter t in the memory 23 and sets 1 to t (S208). Subsequently, it is determined whether the page number N _t is positive (S209). If the page number N _t is not positive (No in S209), the topic t is deleted from the matching topic group G in the memory 23 (S210), and the process proceeds to S211. If the page number _Nt is positive (Yes in S209), nothing is done and the process proceeds to S211. When the process proceeds to S211, 1 is added to the counter t (S211), and it is determined whether or not the counter t exceeds the number of elements K ′ of the existing topic group G ′ (S212). When t does not exceed the number of elements K ′ of the existing topic group G ′ (No in S212), the process returns to S209. When t exceeds the number of elements K ′ of the existing topic group G ′ (Yes in S212), the process proceeds to S213.

Ｓ２１３に進んだ場合、トピック群ソート部２２５は、メモリ２３内の推定トピックｋのページ数Ｎ_kの昇順に適合トピック群ＧをソートしたリストＬ＝[Ｌ₁，…，Ｌ_K]を作成し、メモリ２３に格納する（Ｓ２１３）。 When the process proceeds to S213, the topic group sorting unit 225 creates a list L = [L ₁ ,..., L _K ] in which the matching topic group G is sorted in ascending order of the number of pages N _k of the estimated topic k in the memory 23. And stored in the memory 23 (S213).

次に、図８に示すように、Ｓ２１４に進んだ場合、トピック削除部２２６は、メモリ２３内の適合トピック群Ｇ、対数尤度ｌｏｇｐ(ｘ_n｜ｋ)及び推定トピックｃ_n（Ｇ）を用いて、ＡＩＣ（Ｇ）を計算し、メモリ２３に格納する（Ｓ２１４）。ＡＩＣ（Ｇ）は、前記した式（７）によって計算することが可能である。続いて、メモリ２３内にカウンタｍを確保し、ｍに１を設定する（Ｓ２１５）。 Next, as illustrated in FIG. 8, when the process proceeds to S <b _> 214, the topic deletion unit 226 displays the matching topic group G, the log likelihood logp (x _n | k), and the estimated topic c _n (G) in the memory 23. Using this, AIC (G) is calculated and stored in the memory 23 (S214). AIC (G) can be calculated by the above-described equation (7). Subsequently, a counter m is secured in the memory 23, and 1 is set to m (S215).

トピック削除部２２６は、Ｌ_mを適合トピック群Ｇから削除したトピック群Ｇ_−mとし、メモリ２３内の対数尤度ｌｏｇｐ(ｘ_n｜ｋ)を用いて、推定トピックｃ_n（Ｇ_−m）を計算して、メモリ２３に格納する（Ｓ２１６）。続いて、Ｇ_−m、対数尤度ｌｏｇｐ(ｘ_n｜ｋ)及び推定トピックｃ_n（Ｇ_−m）を用いて、ＡＩＣ（Ｇ_−m）を計算し、メモリ２３に格納する（Ｓ２１７）。そして、ＡＩＣ（Ｇ）と、ＡＩＣ（Ｇ_−m）との値を比較し、ＡＩＣ（Ｇ_−m）がＡＩＣ（Ｇ）より小さくない場合（Ｓ２１８でＮｏの場合）、Ｓ２２２に進む。ＡＩＣ（Ｇ_−m）がＡＩＣ（Ｇ）より小さい場合（Ｓ２１８でＹｅｓの場合）、適合トピック群ＧからＬ_mを削除し（Ｓ２１９）、ｃ_n（Ｇ）の値をｃ_n（Ｇ_−m）で更新し（Ｓ２２０）、ＡＩＣ（Ｇ）の値をＡＩＣ（Ｇ_−m）で更新し（Ｓ２２１）、Ｓ２２２に進む。 The topic deletion unit 226 uses L _m as the topic group G _−m deleted from the matching topic group G, and uses the log likelihood logp (x _n | k) in the memory 23 to estimate the topic c _n (G _−m ). Is calculated and stored in the memory 23 (S216). Subsequently, AIC (G _−m ) is calculated using G _−m , log likelihood logp (x _n | k), and estimated topic c _n (G _−m ), and is stored in the memory 23 (S217). Then, the values of AIC (G) and AIC (G _−m ) are compared. If AIC (G _−m ) is not smaller than AIC (G) (No in S218), the process proceeds to S222. AIC (Yes in S218) _{(G -m)} is AIC (G) is smaller than the case, remove the L _m from the fitted topic group G (S219), the value of _{_{c n (G) c n (}} G -m ) (S220), the value of AIC (G) is updated with AIC (G- _m ) (S221), and the process proceeds to S222.

Ｓ２２２に進んだ場合、カウンタｍに１を加算し（Ｓ２２２）、カウンタｍがリストＬの要素数Ｋを超えたか否かを判定する（Ｓ２２３）。ｍがリストＬの要素数Ｋを超えていない場合（Ｓ２２３でＮｏの場合）、Ｓ２１６に戻る。ｍがリストＬの要素数Ｋを超えた場合（Ｓ２２３でＹｅｓの場合）、メモリ２３内の対数尤度ｌｏｇｐ(ｘ_n｜ｋ)及び抽出された適合トピック群Ｇを用いて、推定トピックｃ_n（Ｇ）を計算し、メモリ２３に格納して（Ｓ２２４）、処理を終了する。 When the process proceeds to S222, 1 is added to the counter m (S222), and it is determined whether or not the counter m exceeds the number K of elements in the list L (S223). If m does not exceed the number K of elements in the list L (No in S223), the process returns to S216. If m exceeds the number K of elements in the list L (Yes in S223), the estimated topic c _n is calculated using the log likelihood logp (x _n | k) in the memory 23 and the extracted matching topic group G. (G) is calculated and stored in the memory 23 (S224), and the process is terminated.

以上のステップにより、文書群処理装置１は、適合トピック群Ｇを抽出することが可能である。また、抽出された適合トピック群Ｇと、文書群Ｘとを、式（７）に適用することで、適合トピック群Ｇ毎に文書群Ｘをクラスタリングすることが可能となる。 Through the above steps, the document group processing apparatus 1 can extract the relevant topic group G. Further, by applying the extracted matching topic group G and the document group X to Expression (7), the document group X can be clustered for each matching topic group G.

なお、本実施形態における文書群処理装置１は、抽出された適合トピック群Ｇと、文書群Ｘとを、式（７）に適用することで、適合トピック群Ｇ毎に文書群Ｘをクラスタリングする機能も備えることとしたが、単に、文書群Ｘに適合する適合トピック群Ｇを抽出する機能を有する装置として、文書群処理装置を実現することも可能である。 Note that the document group processing apparatus 1 according to this embodiment applies the extracted matching topic group G and document group X to Expression (7) to cluster the document group X for each matching topic group G. Although a function is also provided, it is also possible to realize a document group processing apparatus simply as an apparatus having a function of extracting a suitable topic group G that matches the document group X.

また、本実施形態においては、適合トピック群Ｇを選択する方法として、ＡＩＣを用いることとしたが、適合トピック群Ｇの選択方法はこれに限定されるものではない。例えば、ＡＩＣの代わりにＭＤＬ（Minimum Description Length）などのモデル選択基準を用いることも可能である。なお、ＭＤＬについては、例えば、「Rissanen,J.,(1983).A universal prior for integers and estimation by minimum description length,The annals of Statistics,Vol.11,NO.2,pp.416-431」に記載されている。 In this embodiment, AIC is used as a method for selecting the matching topic group G. However, the method for selecting the matching topic group G is not limited to this. For example, model selection criteria such as MDL (Minimum Description Length) can be used instead of AIC. For MDL, see, for example, “Rissanen, J., (1983) .A universal prior for integers and estimation by minimum description length, The annals of Statistics, Vol.11, NO.2, pp.416-431”. Are listed.

また、本実施形態では、適合トピック数に制限を設けないこととしたが、適合トピック数に制限を設け、適合トピック数が制限した値以下になったら、適合トピックを選択する処理を終了することも可能である。適合トピック数に制限を設けることで、適度なトピック数の適合トピックを抽出することが可能となる。 In this embodiment, the number of conforming topics is not limited. However, the number of conforming topics is limited, and when the number of conforming topics falls below the limited value, the process of selecting conforming topics is terminated. Is also possible. By limiting the number of applicable topics, it is possible to extract appropriate topics with an appropriate number of topics.

（文書群処理装置の評価）
本発明の実施形態における文書群処理装置の有効性を評価するため、Ｗｅｂの検索結果で得られたページ群のクラスタリングを行った。用いた既存トピック群は、ｇｏｏ（登録商標）のカテゴリ検索の第２レベルの242トピックであり、この中に含まれる74233ページを用いて各トピックの確率モデルを構築した。ここでの総単語数Ｖは50129であった。 (Evaluation of document group processing device)
In order to evaluate the effectiveness of the document group processing apparatus in the embodiment of the present invention, clustering of page groups obtained from Web search results was performed. The existing topic group used is 242 topics in the second level of category search of goo (registered trademark), and a probability model of each topic is constructed using 74233 pages included therein. The total number of words V here was 50129.

「ハブ」を検索語にしたときの検索結果約1000ページを本発明の実施形態における文書群処理装置でクラスタリングした結果、「生物学」、「ペット」、「料理、グルメ」、「オークション」、「専門店」、「パソコンショップ」、「ハードウェア」、「ネットワーク関連」、「ビジネスニュース」、「人文科学」、「辞書、辞典」、「ブロードバンドの知識」の計１２のトピックが抽出された。 As a result of clustering about 1000 pages of search results when “hub” is used as a search term by the document group processing apparatus in the embodiment of the present invention, “biology”, “pet”, “cooking, gourmet”, “auction”, A total of 12 topics were extracted: “specialized store”, “computer shop”, “hardware”, “network related”, “business news”, “humanities”, “dictionary, dictionary”, “broadband knowledge”. .

従来のクラスタリング手法を用いた場合に、各トピックラベル（生物学やペットなど）を付けることは困難であるが、本発明の実施形態におけるクラスタリングには、人手で付けられた分かりやすいラベルが付けられる。 Although it is difficult to attach each topic label (biology, pet, etc.) when using the conventional clustering method, easy-to-understand labels manually attached are attached to the clustering in the embodiment of the present invention. .

各トピックラベルに対応するページとして、「生物学」、「ペット」にはヘビのハブに関するページ、「料理、グルメ」にはハブ茶に関するページ、「オークション」、「専門店」、「パソコンショップ」、「ハードウェア」、「ネットワーク関連」にはネットワークのハブに関するページ、「人文科学」には史資料ハブ地域文化研究拠点という研究プロジェクトに関するページが見付かった。 As pages corresponding to each topic label, “biology”, “pets” are pages about snake hubs, “cooking and gourmet” are pages about hub teas, “auctions”, “special stores”, “computer shops” In "Hardware" and "Network-related", a page on network hubs was found, and in "Humanities", a page on research projects called the historical material hub regional culture research center was found.

「ハブ」には様々な意味があるが、このようにトピック毎に分類されることにより、検索の効率化が期待できる。また、検索の効率化だけではなく、検索語がＷｅｂ上でどのような意味で使われているのかを知ることも可能である。この例の場合、例えば、史資料ハブ地域文化研究拠点という研究プロジェクトに関するページが見付かり、「人文科学」においても使われていることが発見できる。 “Hub” has various meanings, and it can be expected that search efficiency is improved by classifying each topic in this way. In addition to improving search efficiency, it is also possible to know what meaning a search term is used on the Web. In this case, for example, a page related to a research project called “Historical Resource Hub Regional Culture Research Center” can be found, and it can be found that it is also used in “humanities”.

本発明の実施形態に係る文書群処理装置のブロック図である。It is a block diagram of a document group processing apparatus according to an embodiment of the present invention. 本発明の実施形態に係る既存トピック群確率モデル構築部のブロック図である。It is a block diagram of the existing topic group probability model construction part which concerns on embodiment of this invention. 本発明の実施形態に係る適合トピック群抽出部のブロック図である。It is a block diagram of the suitable topic group extraction part which concerns on embodiment of this invention. 本発明の実施形態に係る演算手段の処理を表すフローチャートである。It is a flowchart showing the process of the calculating means which concerns on embodiment of this invention. 本発明の実施形態に係る既存トピック群確率モデル構築部の処理を表すフローチャートである。It is a flowchart showing the process of the existing topic group probability model construction part which concerns on embodiment of this invention. 本発明の実施形態に係る適合トピック群抽出部の処理を表すフローチャートである。It is a flowchart showing the process of the suitable topic group extraction part which concerns on embodiment of this invention. 本発明の実施形態に係る適合トピック群抽出部の処理を表すフローチャートである。It is a flowchart showing the process of the suitable topic group extraction part which concerns on embodiment of this invention. 本発明の実施形態に係る適合トピック群抽出部の処理を表すフローチャートである。It is a flowchart showing the process of the suitable topic group extraction part which concerns on embodiment of this invention.

Explanation of symbols

１文書群処理装置
２演算手段
３入力手段
４記憶手段
５出力手段
１１バスライン
２１既存トピック群確率モデル構築部
２２適合トピック群抽出部
２３メモリ
４１既存トピック群確率モデル構築プログラム
４２適合トピック群抽出プログラム
４３既存文書群テーブル
４４単語生起確率テーブル
４５文書群テーブル
２１１既存文書群読込部
２１２単語生起確率推定部
２１３単語生起確率書込部
２２１単語生起確率読込部
２２２文書群読込部
２２３既存トピック群取得部
２２４対数尤度計算部
２２５トピック群ソート部
２２６トピック削除部 DESCRIPTION OF SYMBOLS 1 Document group processing apparatus 2 Calculation means 3 Input means 4 Storage means 5 Output means 11 Bus line 21 Existing topic group probability model construction part 22 Conforming topic group extraction part 23 Memory 41 Existing topic group probability model construction program 42 Conforming topic group extraction program 43 existing document group table 44 word occurrence probability table 45 document group table 211 existing document group reading unit 212 word occurrence probability estimating unit 213 word occurrence probability writing unit 221 word occurrence probability reading unit 222 document group reading unit 223 existing topic group acquiring unit 224 Log likelihood calculation unit 225 Topic group sorting unit 226 Topic deletion unit

Claims

A document group processing apparatus that extracts a topic group that matches a document group using the existing topic group attached to the existing document group,
The document group processing device includes:
An existing topic group probability model building unit for building a probability model of the existing topic group;
A document group processing apparatus comprising: an adapted topic group extraction unit that extracts a topic group that matches the document group using the probability model constructed in the existing topic group probability model construction unit.

A document group processing method by a document group processing apparatus for extracting a topic group suitable for a document group using an existing topic group attached to an existing document group,
The document group processing method includes:
An existing topic group probability model construction step of constructing a probability model of the existing topic group;
A document group processing method comprising: a matching topic group extraction step of extracting a topic group that matches the document group using the probability model constructed by the existing topic group probability model construction step.

A document group processing method by a document group processing apparatus that extracts a topic group that matches a document group using an existing topic group attached to an existing document group and classifies the document group for each extracted conforming topic group. ,
The document group processing method includes:
An existing topic group probability model construction step of constructing a probability model of the existing topic group;
A matching topic group extraction step of extracting a topic group that matches the document group using the probability model constructed by the existing topic group probability model construction step;
A document group processing method comprising: a document group classification step for classifying the document group for each of the matching topic groups extracted by the matching topic group extraction step.

The existing topic group probability model construction step includes:
The document group processing method according to claim 2, further comprising a word occurrence probability estimating step of estimating a word occurrence probability using a word appearance frequency of the existing document group.

The matching topic group extraction step includes:
A likelihood calculating step of calculating the likelihood of the topic group using the word occurrence probability;
5. The document group processing method according to claim 2, further comprising: a topic group extracting step of extracting a topic group using the likelihood of the topic group calculated by the likelihood calculating step.

The document group classification step includes:
The document group processing method according to claim 3, further comprising: a classification step of classifying the document group based on calculation using likelihood.

A document group processing program for causing a computer to execute the document group processing method according to claim 2.

A recording medium storing the document group processing program according to claim 7.