JP4963341B2

JP4963341B2 - Document relationship visualization method, visualization device, visualization program, and recording medium recording the program

Info

Publication number: JP4963341B2
Application number: JP2004178752A
Authority: JP
Inventors: 具治岩田; 和巳斉藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-06-16
Filing date: 2004-06-16
Publication date: 2012-06-27
Anticipated expiration: 2024-06-16
Also published as: JP2006004105A

Description

本発明は、カテゴリー分類された文書間の関係を可視化する文書間関係可視化方法及び可視化装置に関するものである。 The present invention relates to an inter-document relationship visualization method and a visualization device for visualizing relationships between categorized documents.

近年、大量の文書が電子的に蓄積されているが、それらの文書群は、通常、ユーザが目的の文書を容易に見つけ出すことができるように、関連する文書ごとに分類されている。例えば、「open directory project」、「goo（Ｒ）」と呼ばれる検索サイトでは、ディレクトリ型検索エンジンによって、大量のウェブページがトピック毎に分類されている。また、ウェブページ以外の文書、例えば、電子図書、特許文書、論文、コンピュータ用の電子ファイルついても、ウェブページの場合と同様、カテゴリーの別に分類されている。
しかし、分類された文書群の中には、例えば、人的なミスにより分類された文書が含まれていたり、既に設定されているカテゴリーのいずれにも属さない文書（これを「特異文書」と呼ぶ。）が含まれていたりする。また、分類体系が特定の分野に偏っているケースがある。このように、文書の分類が適切に行われていないことが往往にある。
文書の分類ミスは、一つひとつの文書を人的にチェックして探し出すことは可能である。しかし、膨大な文書を人的にチェックすることは困難であるため、文書分類の適正が評価されていないことが多い。 In recent years, a large number of documents have been electronically accumulated, and these document groups are usually classified for each related document so that a user can easily find a target document. For example, in a search site called “open directory project” or “goo®”, a large number of web pages are classified by topic by a directory-type search engine. Also, documents other than web pages, such as electronic books, patent documents, papers, and electronic files for computers, are classified into categories as in the case of web pages.
However, the classified document group includes, for example, documents classified due to human error, or documents that do not belong to any of the categories already set (this is referred to as “singular document”). Is included). In addition, there are cases where the classification system is biased toward specific fields. As described above, it is often the case that documents are not properly classified.
Document misclassification can be found by manually checking each document. However, since it is difficult to manually check a large number of documents, the suitability of document classification is often not evaluated.

そこで、文書分類の適正を評価するための方法として、文書間の関係を分かり易く表現する可視化の方法が用いられている。この可視化の方法によって、内在する文書の構造的特徴が浮き彫りになり、文書群についての新たな知識発見（文書分類の評価など）のための重要な手がかりが得られる可能性がある。 Therefore, as a method for evaluating the appropriateness of document classification, a visualization method that expresses the relationship between documents in an easy-to-understand manner is used. This visualization method highlights the structural features of the underlying document and may provide important clues for discovering new knowledge about the document group (e.g., document classification evaluation).

従来、このような可視化の方法として、次の２つの手法があった。一つは、文書の単語頻度ベクトルから文書間の類似度を求め、この類似度をもとに文書間の関係を可視化する方法である。すなわち、ＭＤＳ（Multi-Dimension Scaling）と呼ばれる手法である（例えば、非特許文献１参照）。もう一つは、文書間のリンクをノード間のリンクとする文書のネットワークで表現し、ノード間の距離から文書のネットワークを可視化する方法である（例えば、非特許文献２参照）。
M.chalmers and P.chitson，“BEAD:Explorations in information visualizaition”，SIGIR'92，Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Resarch and Development in Information Retrieval,ACM Press,１９９２年，ｐ．３３０−３３７ J.B.Tenenbaum,V.de Silva and J.C.Langford,A global geometric framework for nonlinear dimensionality reduction,Science,290,２０００年，ｐ.２３１９−２３２３ Conventionally, there are the following two methods as such a visualization method. One is a method of obtaining the similarity between documents from the word frequency vector of the document and visualizing the relationship between the documents based on the similarity. That is, it is a technique called MDS (Multi-Dimension Scaling) (for example, refer nonpatent literature 1). The other is a method in which links between documents are expressed as a document network with links between nodes, and the document network is visualized from the distance between the nodes (for example, see Non-Patent Document 2).
M. chalmers and P. chitson, “BEAD: Explorations in information visualiza- tion”, SIGIR '92, Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Resarch and Development in Information Retrieval, ACM Press, 1992, p. 330-337 JBTenenbaum, V. de Silva and JCLangford, A global geometric framework for nonlinear dimensionality reduction, Science, 290, 2000, p. 2319-2323

しかしながら、非特許文献１および非特許文献２に開示された手法では、いずれも多くのノードが集中して配置され、文書間の関係が分かりにくいという問題があった。 However, the methods disclosed in Non-Patent Document 1 and Non-Patent Document 2 both have a problem that many nodes are concentrated and the relationship between documents is difficult to understand.

そこで、本発明は、前記した課題を解決するためになされたものであり、その目的は、カテゴリー分類された文書間の関係を分かり易く表現することができる文書間関係可視化方法、可視化装置、可視化プログラム及びそのプログラムを記録した記録媒体を提供することである。 Therefore, the present invention has been made to solve the above-described problems, and the purpose thereof is to make it possible to express the relationship between categorized documents in an easy-to-understand manner, an inter-document relationship visualization method, a visualization device, and a visualization It is to provide a program and a recording medium on which the program is recorded.

前記課題を解決するために本発明は、予め登録されたカテゴリーに分類された文書群について、カテゴリーごとの、文書中に出現する単語の単語頻度分布を表した文書生成モデル、および前記文書群における各文書と前記各カテゴリーとの関係を視覚的に表現するための２次元または３次元の可視化空間の情報、を格納する記憶装置と、前記文書群における各文書について、登録済みのカテゴリーに属する確率それぞれと、登録済みのカテゴリーのいずれにも属さない確率と、を成分とする事後確率ベクトルを作成する事後確率推定部と、前記文書群の可視化処理を行う可視化部と、を備える可視化装置による文書間関係可視化方法であって、前記事後確率推定部が、前記カテゴリーごとの文書生成モデルを前記記憶装置から読み出し、前記文書群における各文書について、前記カテゴリーごとの文書生成モデルをもとに、登録済みのカテゴリーに属する確率それぞれを算出し、前記カテゴリーごとの文書生成モデルのうち、事後確率が最も高いカテゴリーの文書生成モデルをもとに、登録済みのカテゴリーのいずれにも属さない確率を算出し、当該算出した各確率を成分とする事後確率ベクトルを作成するステップと、前記可視化部が、前記作成した事後確率ベクトルで表された各確率に従って、前記記憶装置に格納した情報である可視化空間上に、前記各文書が前記各カテゴリーに属する確率が高いほどに、当該文書の座標の当該カテゴリーの中心座標からの距離が短くなるように、前記文書それぞれの座標とカテゴリーそれぞれの中心座標とを配置するステップと、前記文書それぞれの座標と前記カテゴリーそれぞれの中心座標とを配置した可視化空間の情報を外部出力するステップと、を実行することを特徴とする。 In order to solve the above problems, the present invention relates to a document generation model representing a word frequency distribution of words appearing in a document for each category, and a document generation model classified into categories registered in advance. A storage device that stores information on a two-dimensional or three-dimensional visualization space for visually expressing the relationship between each document and each category, and the probability that each document in the document group belongs to a registered category A document by a visualization device comprising: a posteriori probability estimation unit that creates a posteriori probability vector whose components are each and a probability that does not belong to any of the registered categories; and a visualization unit that performs visualization processing of the document group. In the inter-relationship visualization method, the posterior probability estimation unit reads the document generation model for each category from the storage device, and For each document in the book group, the probability of belonging to a registered category is calculated based on the document generation model for each category, and the document generation of the category with the highest posterior probability among the document generation models for each category is calculated. Based on the model, calculating a probability that does not belong to any of the registered categories, creating a posterior probability vector having each calculated probability as a component, and the visualization unit creating the created posterior probability vector In accordance with each probability represented by the above, on the visualization space that is information stored in the storage device, the higher the probability that each document belongs to each category, the greater the distance from the center coordinate of the category of the document. Arranging the coordinates of each of the documents and the center coordinates of the categories so that the A step of the respectively of the coordinate information the external output of the categories visualization space disposed between respective center coordinates, characterized in that the run.

また、本発明は、予め登録されたカテゴリー分類された文書群について、カテゴリーごとの、文書中に出現する単語の単語頻度分布を表した文書生成モデル、および前記文書群における各文書と前記各カテゴリーとの関係を視覚的に表現するための２次元または３次元の可視化空間の情報、を格納する記憶装置と、前記文書群における各文書について、登録済みのカテゴリーに属する確率それぞれと、登録済みのカテゴリーのいずれにも属さない確率と、を成分とする事後確率ベクトルを作成する事後確率推定部と、前記文書群の可視化処理を行う可視化部と、を備える可視化装置であって、前記事後確率推定部は、前記カテゴリーごとの文書生成モデルを前記記憶装置から読み出し、前記文書群における各文書について、前記カテゴリーごとの文書生成モデルをもとに、登録済みのカテゴリーに属する確率それぞれを算出し、前記カテゴリーごとの文書生成モデルのうち、事後確率が最も高いカテゴリーの文書生成モデルをもとに、登録済みのカテゴリーのいずれにも属さない確率を算出し、当該算出した各確率を成分とする事後確率ベクトルを作成する推定機能を備え、前記可視化部は、前記作成した事後確率ベクトルで表された各確率に従って、前記記憶装置に格納した情報である可視化空間上に、前記各文書が前記各カテゴリーに属する確率が高いほどに、当該文書の座標の当該カテゴリーの中心座標からの距離が短くなるように、前記文書それぞれの座標とカテゴリーそれぞれの中心座標とを配置する可視化機能と、前記文書それぞれの座標と前記カテゴリーそれぞれの中心座標とを配置した可視化空間の情報を外部出力する出力機能と、を備えることを特徴とする。 The present invention also relates to a document generation model that represents a word frequency distribution of words appearing in a document for each category, and each document in the document group and each category for a group of documents that are classified in advance. A storage device for storing information of a two-dimensional or three-dimensional visualization space for visually expressing the relationship between the two, a probability of belonging to a registered category for each document in the document group, and a registered A visualization apparatus comprising: a posterior probability estimation unit that creates a posterior probability vector having a probability that does not belong to any of the categories; and a visualization unit that performs visualization processing of the document group, wherein the posterior probability The estimation unit reads the document generation model for each category from the storage device, and for each document in the document group, for each category. The probability of belonging to a registered category is calculated based on the document generation model of, and among the document generation models for each category, the registered category is based on the document generation model of the category with the highest posterior probability. A probability that does not belong to any of the above, including an estimation function to create a posterior probability vector having each calculated probability as a component, the visualization unit according to each probability represented by the created posterior probability vector, On the visualization space, which is information stored in the storage device, the higher the probability that each document belongs to each category, the shorter the distance from the center coordinate of the category of the document to the document. Visualization function to arrange each coordinate and center coordinate of each category, and each document coordinate and each category And an output function of the information of the visualization space for the external output arranged and heart coordinates, characterized in that it comprises a.

また、本発明は、前記した文書間関係可視化方法をコンピュータに実行させるための可視化プログラムである。さらに、前記した文書間関係可視化方法をコンピュータに実行させるための可視化プログラムを記録したコンピュータ読み取り可能な記録媒体である。 The present invention is also a visualization program for causing a computer to execute the above-described inter-document relationship visualization method. Furthermore, the present invention is a computer-readable recording medium on which a visualization program for causing a computer to execute the inter-document relationship visualization method is recorded.

本発明によると、カテゴリー分類された文書間の関係を分かり易く表現することができる。 According to the present invention, the relationship between categorized documents can be easily expressed.

以下、本発明の実施の形態を図１ないし図６に基づいて説明する。
図１は、本発明の実施の形態に係る可視化装置を含むシステムを示すブロック図である。
図１において、可視化装置１０は、インターネットなどの通信ネットワーク２０を介して利用者用端末３０に接続されている。
可視化装置１０は、通信装置１１、記憶装置１２および処理装置１３を備えている。例えば、サーバ装置などのコンピュータがこれに用いられる。通信装置１１は入出力インターフェースであり、記憶装置１２はメモリ、ハードディスクなどであり、処理装置１３はＣＰＵなどである。
なお、図１では、単一の可視化装置１０を示しているが、複数のコンピュータを用いて分散処理を行うように可視化装置１０を構成してもよい。 Hereinafter, embodiments of the present invention will be described with reference to FIGS.
FIG. 1 is a block diagram showing a system including a visualization device according to an embodiment of the present invention.
In FIG. 1, the visualization apparatus 10 is connected to a user terminal 30 via a communication network 20 such as the Internet.
The visualization device 10 includes a communication device 11, a storage device 12, and a processing device 13. For example, a computer such as a server device is used for this. The communication device 11 is an input / output interface, the storage device 12 is a memory, a hard disk, or the like, and the processing device 13 is a CPU or the like.
In addition, although the single visualization apparatus 10 is shown in FIG. 1, you may comprise the visualization apparatus 10 so that a distributed process may be performed using a some computer.

記憶装置１２には、予め登録されたカテゴリーに分類された文書群についての文書生成モデルがカテゴリーごとに格納されている。カテゴリーは、Ｗｅｂページ、電子図書、特許公報などの電子文書を管理上、分類するために用いられるものである。例えば、トピックなどがこれに該当する。
文書生成モデルは、後記で詳述するが、文書中に出現する単語の単語頻度分布を表したものである。なお、文書生成モデルは、例えば、ファイル形式で記憶装置１２に保存される。 The storage device 12 stores document generation models for document groups classified into categories registered in advance for each category. The category is used for administratively classifying electronic documents such as Web pages, electronic books, and patent gazettes. For example, a topic corresponds to this.
As will be described in detail later, the document generation model represents a word frequency distribution of words appearing in the document. The document generation model is stored in the storage device 12 in a file format, for example.

処理装置１３は、文書生成モデル構築部１３１、事後確率ベクトル推定部１３２および可視化部１３３を有している。これら各部１３１〜１３３の機能は、後記する。 The processing device 13 includes a document generation model construction unit 131, a posterior probability vector estimation unit 132, and a visualization unit 133. The functions of these units 131 to 133 will be described later.

利用者用端末３０は、パーソナルコンピュータなどのコンピュータであり、次のような一般的な構成となっている。すなわち、利用者用端末３０は、キーボードなどの入力装置と、コンピュータディスプレイなどの表示装置と、ハードディスク、メモリなどの記憶装置と、ＣＰＵなどの処理装置とを備えている。 The user terminal 30 is a computer such as a personal computer, and has the following general configuration. That is, the user terminal 30 includes an input device such as a keyboard, a display device such as a computer display, a storage device such as a hard disk and a memory, and a processing device such as a CPU.

次に、可視化装置１０の特徴的な処理について説明する。
［文書生成モデルの構築処理]
可視化装置１０では、文書生成モデル構築部１３１が、次式（１）から（３）までの関数を用いて、文書生成モデルの構築処理を行う。ここでは、あらかじめ分類された文書群からカテゴリーごとの文書生成モデルを構築する場合について説明する。 Next, characteristic processing of the visualization apparatus 10 will be described.
[Document generation model construction process]
In the visualization device 10, the document generation model construction unit 131 performs a document generation model construction process using functions from the following formulas (1) to (3). Here, a case where a document generation model for each category is constructed from a document group classified in advance will be described.

まず、予め分類された文書群などを次のように定義しておく。分類された文書群をＤ＝[ｄ₁，ｄ₂，・・・，ｄ_N]とする。ｄｎは第ｎ文書，Ｎは与えられた文書の総数を表す。また、カテゴリーの集合をＣ＝[１，２，…，ｋ，…，Ｋ]とする。ただし、Ｋはカテゴリーの総数である。各文書は１つのカテゴリーに分類されていることとし、第ｎ文書のカテゴリーをｃ_n∈Ｃとする。さらに、与えられた文書に含まれる全単語の集合をＷ[ｗ₁，ｗ₂，・・・，ｗ_v]とする。ｗ_jは第ｊ単語，Ｖは単語の総数を表す。 First, a group of documents classified in advance is defined as follows. Let the classified document group be D = [d ₁ , d ₂ ,..., D _N ]. dn represents the nth document, and N represents the total number of given documents. Also, let C = [1, 2,..., K,. Where K is the total number of categories. Each document is classified into one category, and the category of the nth document is c _n ∈C. Further, a set of all words included in a given document is set to W [w ₁ , w ₂ ,..., W _v ]. w _j represents the j-th word, and V represents the total number of words.

文書の内容は、単語頻度ベクトルで表現する。つまり、ここでは、例えば、文書中における単語の順序や係り受けなどを無視したＢＯＷ（Bag-of-Words）と呼ばれる文書表現を用いる。テキスト分類では、ＢＯＷによる文書表現が用いられることが多い。ここで、第ｎ文書の単語頻度ベクトルをｘ_n＝（ｘ_n1，ｘ_n1，…，ｘ_nv）とする。ｘ_niは第ｎ文書に含まれる第ｊ単語の数を表す。 The content of the document is expressed by a word frequency vector. That is, here, for example, a document expression called BOW (Bag-of-Words) in which the order of words in the document, dependency, and the like are ignored is used. In text classification, document representation by BOW is often used. Here, it is assumed that the word frequency vector of the nth document is x _n = (x _n1 , x _n1 ,..., X _nv ). x _ni represents the number of the j-th word included in the n-th document.

そして、カテゴリーごとの文書生成の確率モデル、すなわち文書生成モデルとして、例えば、ナイーブベイズモデル（以下「ＮＢモデル」と略す。ＮＢ：Naive Bayes）を用いる。ＮＢモデルは、文書の内容を単語頻度ベクトルで表現したものである。なお、文書生成モデルとして、ＮＢモデルを用いることとするが、例えば、ｎ-ｇｒａｍモデルなどを用いてもよい。
ＮＢモデルでは、カテゴリーｃ_nに属する文書ｄ_nの生成確率を次式（１）の多項分布と仮定する。 For example, a naive Bayes model (hereinafter abbreviated as “NB model”, NB: Naive Bayes) is used as a probability model for document generation for each category, that is, a document generation model. The NB model represents the contents of a document with a word frequency vector. Note that the NB model is used as the document generation model, but for example, an n-gram model or the like may be used.
In the NB model, it is assumed that the generation probability of the document d _n belonging to the category c _n is a multinomial distribution of the following equation (1).

第ｋカテゴリーの文書生成モデルのパラメータθ_k＝[θ_k1，θ_k2，…，θ_kv]ハ、次式（２）の尤度Ｌ_kを最大化することにより、推定することとする。式（２）には、事前確率ハイパーパラメータλ_kが導入されている。 The parameter θ _k = [θ _k1 , θ _k2 ,..., Θ _kv ] of the k-th category document generation model is estimated by maximizing the likelihood L _k of the following equation (2). In equation (2), a prior probability hyperparameter λ _k is introduced.

式（２）中のλ_kによって、与えられた文書中に一つも現れなかった単語の生成確率が０になることを防ぎ、かつ汎化性能を向上させることができる。なお、汎化性能というのは、上位ノードに集約するための性能を意味する。 With λ _k in equation (2), it is possible to prevent the generation probability of a word that has never appeared in a given document from being zero, and to improve generalization performance. Note that the generalization performance means performance for consolidating to upper nodes.

θ_kjの最尤推定値を次式（３）で示す。 The maximum likelihood estimated value of θ _kj is expressed by the following equation (3).

式（３）中のハイパーパラメータλ_kは、過学習を防ぐため、ここではクロスバリデーション法により推定する。 In order to prevent over-learning, the hyperparameter λ _k in equation (3) is estimated here by a cross-validation method.

［事後確率ベクトルの推定処理］
また、可視化装置１０では、事後確率ベクトル推定部１３２が、次式（４）から式（１２）までの関数を用いて事後確率ベクトルの推定処理を行う。
事後確率ベクトルというのは、文書について、各カテゴリーに属する確率（登録済みのカテゴリーに属する確率）と、どのカテゴリーにも属さない確率（分類済みのカテゴリーに属さない確率）とをベクトルで表したものである。ここでは、例えば、最大エントロピー法（「K.Nigam,J.lafferty and A.McCallum,Using maximum entropy for text classification,In IJCAI-99 Workshop on Machine Learning for Information Filtering,61-67,1999」参照）による推定方法について説明する。
最大エントロピー法は、第ｎ文書と第ｋカテゴリーとの関係を表す関数（これを素性（feature：特徴）関数と呼ぶ。）を素性ｆ（ｄ_n,ｋ）としたとき、次式（４）の制約を満たしつつ、確率Ｐ（ｋ│ｄ_n）のエントロピーが最大となるようにパラメータを推定するためのものである。確率Ｐ（ｋ│ｄ_n）は、第ｎ文書が与えられたときの第ｋカテゴリーに属す確率である。なお、最大エントロピーの条件である式（４）の関数は記憶装置１２に格納されている。 [A posteriori probability vector estimation process]
In the visualization device 10, the posterior probability vector estimation unit 132 performs posterior probability vector estimation processing using functions from the following formulas (4) to (12).
A posteriori probability vector is a vector that represents the probability of a document belonging to each category (probability of belonging to a registered category) and the probability of not belonging to any category (probability of not belonging to a classified category). It is. Here, for example, by the maximum entropy method (see “K. Nigam, J. lafferty and A. McCallum, Using maximum entropy for text classification, In IJCAI-99 Workshop on Machine Learning for Information Filtering, 61-67, 1999”). An estimation method will be described.
In the maximum entropy method, when a function representing a relationship between an nth document and a kth category (called a feature function) is a feature f (d _n , k), the following equation (4) The parameter is estimated so that the entropy of the probability P (k | d _n ) is maximized while satisfying the above constraint. The probability P (k | d _n ) is a probability belonging to the kth category when the nth document is given. Note that the function of Equation (4), which is the maximum entropy condition, is stored in the storage device 12.

そうすると、確率Ｐ（ｋ│ｄ_n）のエントロピーが最大となる唯一の分布が存在する。この分布を次式（５）で表す。 Then, there exists a unique distribution that maximizes the entropy of the probability P (k | d _n ). This distribution is expressed by the following equation (5).

式（５）中のβは推定すべきパラメータである。式（５）では、ｋは１からＫ＋１までの整数値をとる。１からＫまでは、既存のカテゴリーを表し、Ｋ＋１は既存のカテゴリー以外のカテゴリーを表す。Ｐ（Ｋ＋１│ｄ_n）は、第ｎ文書が既存のカテゴリーのいずれにも属さない確率を表す。 Β in the equation (5) is a parameter to be estimated. In Expression (5), k takes an integer value from 1 to K + 1. 1 to K represent existing categories, and K + 1 represents a category other than the existing categories. P (K + 1 | d _n ) represents the probability that the nth document does not belong to any of the existing categories.

次に、第ｎ文書と第ｋカテゴリー（分類済みのもの）との素性（ｄ_n,ｋ）を次式（６）で表す。ここでは、ＮＢモデルでの第ｋカテゴリーが与えられたときの第ｎ文書の対数尤度を用いる。 Next, the feature (d _n , k) between the n-th document and the k-th category (classified one) is expressed by the following equation (6). Here, the log likelihood of the nth document when the kth category in the NB model is given is used.

式（６）によると、第ｎ文書の単語頻度が第ｋカテゴリーの文書生成モデルに類似すれば、素性（ｄ_n,ｋ）が高くなり、第ｋカテゴリーに属す事後確率（ｋ│ｄ_n）も高くなる。したがって、事後確率を推定するために、素性に対数尤度を用いることは適切であると考えられる。
既存のＫ個のカテゴリーに限定して考えた場合、ＮＢモデルにおける事後確率は、ベイズの定理により、次式（７）で表される。 According to equation (6), if the word frequency of the nth document is similar to the document generation model of the kth category, the feature (d _n , k) increases, and the posterior probability (k | d _n ) belonging to the kth category. Also gets higher. Therefore, it is considered appropriate to use log-likelihood for the feature to estimate the posterior probability.
When limited to the existing K categories, the posterior probability in the NB model is expressed by the following equation (7) by Bayes' theorem.

式（７）によると、事前確率Ｐ（ｋ）を一様としたときのＰ_NB（ｋ│ｄ_n）と、β＝１としたときの最大エントロピー法によるＰ（ｋ│ｄ_n）とは等しくなる。したがって、ＮＢモデルの自然な拡張になっていると言える。最大エントロピー法により、エントロピーを最大にするという好適な条件の下、既存のどのカテゴリーにも属さない確率を含めた事後確率Ｐ_NB（ｋ│ｄ_n）を適切に推定することが可能となる。 According to Equation (7), and P _NB (k│d _n) when the uniform prior probability P (k), and P (k│d _n) by the maximum entropy method when a beta = 1 is Will be equal. Therefore, it can be said that this is a natural extension of the NB model. By the maximum entropy method, it becomes possible to appropriately estimate the posterior probability P _NB (k | d _n ) including the probability of not belonging to any existing category under a preferable condition of maximizing entropy.

また、第ｎ文書が既存のどのカテゴリーにも属さない事後確率を推定するための素性として、次のモデルにおける３シグマでの対数尤度、すなわち３シグマ値を用いる。ＮＢモデルでの第ｎ文書の事後確率が最も高いカテゴリー（尤度ｃ_n＝argmax_kP_NB（ｋ│ｄ_n）のモデル（文書生成モデル）である。ここで、平均μ，標準偏差σの正規分布でみてみると、この正規分布に従った確率変数が区間（μ＋３σおよびμ−３σ）に入る確率は、０．９９７である。このことから、３シグマ値が、既存のカテゴリーに対する素性よりも相対的に高くなる文書は、既存のカテゴリーに属する確率も低くなると考えられる。
そこで、３シグマ値を求めるため、次式（８）を満たすような確率変数Ｘを考える。 Further, as a feature for estimating the posterior probability that the nth document does not belong to any existing category, the log likelihood with 3 sigma in the next model, that is, the 3 sigma value is used. This is a model (document generation model) of a category (likelihood c _n = argmax _k P _NB ( _k | d _n ) with the highest posterior probability of the nth document in the NB model, where the average μ and the standard deviation σ looking at a normal distribution, the probability that the random variable in accordance with the normal distribution enters the section (mu + 3 [sigma] and mu-3 [sigma]) is a 0.9 97. Thus, 3-sigma value, feature for existing categories Documents that are relatively higher than are considered to have a lower probability of belonging to an existing category.
Therefore, in order to obtain a 3-sigma value, a random variable X that satisfies the following equation (8) is considered.

式（８）では、ＸをＭ回試行したときのＸの平均値は、中心限定定理により、Ｍが大きくなっていくと、正規分布に漸近する。すなわち、次式（９）〜（１１）のようになる。式（９）中、Ｘ_iは第ｉ試行のＸの値，Ｎ（μ，σ）は平均μで標準偏差σの正規分布を表す。 In Expression (8), the average value of X when X is tried M times asymptotically approaches a normal distribution as M increases by the central limit theorem. That is, the following expressions (9) to (11) are obtained. In equation (9), X _i represents the value of X of the i-th trial, and N (μ, σ) represents a normal distribution with mean μ and standard deviation σ.

次に、素性ｆ（ｄ_n,Ｋ＋１）を次式（１２）で表す。素性ｆ（ｄ_n,Ｋ＋１）は、既存のカテゴリーに属さない第ｎ文書の確率を求めるためのものである。素性ｆ（ｄ_n,Ｋ＋１）は、式（１２）で表される３シグマ値になっている。なお、式（１２）中、Ｍ_nは第ｎ文書の単語数を表す。 Next, the feature f (d _n , K + 1) is expressed by the following equation (12). The feature f (d _n , K + 1) is for obtaining the probability of the nth document not belonging to the existing category. The feature f (d _n , K + 1) has a 3 sigma value represented by Expression (12). In equation (12), M _n represents the number of words in the nth document.

このように、可視化装置１０では、式（１）ないし式（１２）により、素性を決定した後、未知のパラメータβを推定する。そして、事後確率Ｐ_NB（ｋ│ｄ_n）を求める。求めた第ｎ文書の事後確率Ｐ_NB（ｋ│ｄ_n）をベクトルにしたものを事後確率ベクトルという。事後確率ベクトルｑ_nは、ｑ_n＝（Ｐ（１│ｄ_n）,・・・，Ｐ（Ｋ＋１│ｄ_n））で表す。 As described above, the visualization device 10 estimates the unknown parameter β after determining the feature by the equations (1) to (12). Then, the posterior probability P _NB (k | d _n ) is obtained. A vector obtained from the obtained posterior probability P _NB (k | d _n ) of the nth document is referred to as a posterior probability vector. The posterior probability vector q _n is represented by q _n = (P (1 | d _n ),..., P (K + 1 | d _n )).

その他、事後確率ベクトルの推定処理は、非特許文献「K.Nigam,J.Lafferty and A.McCallum,Using maximum entropy for text classification,In IJCAI-99 Workshop on Machine Learning for Information Filtering,61-67,1999」に開示された既知の方法により行われるものとする。 In addition, the posterior probability vector estimation process is described in the non-patent literature `` K. Nigam, J. Lafferty and A. McCallum, Using maximum entropy for text classification, In IJCAI-99 Workshop on Machine Learning for Information Filtering, 61-67, 1999. It is assumed to be performed by a known method disclosed in the above.

［事後確率保存埋め込み法による可視化処理］
さらに、可視化装置１０では、可視化部１３３が、事後確率ベクトルｑ_nをもとに、次式（１３）から式（１６）までの関数を用いて、文書群の可視化処理を行う。
まず、可視化空間について説明しておく。可視化空間は、文書群の可視化を実現されるものであり、ここでは、ユークリッド空間を用いることとする。
第ｋカテゴリーに属する文書群の中心となる可視化空間内の座標をφ_k＝（φ_k1，・・・，φ_kD）とする。Ｄは可視化空間の次元を表す。また、第ｎ文書の座標をｒ_n＝（ｒ_n1，・・・，ｒ_nD）とする。 [Visualization processing by posterior probability preservation embedding method]
Further, in the visualization device 10, the visualization unit 133 performs document group visualization processing using functions from the following equations (13) to (16) based on the posterior probability vector q _n .
First, the visualization space will be described. The visualization space realizes visualization of a document group, and here, the Euclidean space is used.
The coordinates in the visualization space that is the center of the document group belonging to the k-th category is φ _k = (φ _k1 ,..., Φ _kD ). D represents the dimension of the visualization space. Further, the coordinates of the n documents _{_{r n = (r n1, ···}} , r nD) and.

文書間の関係について、ユーザが視覚を通じて直観的に捉えることができるように、低次元、すなわち２次元または３次元の可視化空間に事後確率の関係を埋め込んでいく場合について考えてみる。可視化空間に埋め込まれる事後確率の関係は、前記した式（１）ないし式（１２）により求められた事後確率Ｐ_NB（ｋ│ｄ_n）の関係が保たれたままの状態である。
そこで、次式（１３）のユークリッド距離ｕ_nkが、次の条件を満たすように、φとｒ_nとを配置するようにする。可視化空間において、第ｎ文書の第ｋカテゴリーに属する確率が高くなれば、第ｋカテゴリーと第ｎ文書との間のユークリッド距離ｕ_nkが小さく（近く）なるようにする。逆に、第ｎ文書の第ｋカテゴリーに属する確率が低くなれば、第ｋカテゴリーと第ｎ文書との間のユークリッド距離ｕ_nkが大きく（遠く）なるようにする。 Consider a case where the relationship between posterior probabilities is embedded in a low-dimensional, that is, two- or three-dimensional visualization space so that the user can intuitively grasp the relationship between documents. The relationship between the posterior probabilities embedded in the visualization space is a state in which the relationship of the posterior probabilities P _NB (k | d _n ) obtained by the above formulas (1) to (12) is maintained.
Therefore, the Euclidean distance u _nk of the formula (13), so as to satisfy the following conditions, so as to place the φ and r _n. In the visualization space, if the probability of belonging to the kth category of the nth document increases, the Euclidean distance _unk between the kth category and the nth document is made smaller (near). On the other hand, if the probability that the nth document belongs to the kth category decreases, the Euclidean distance _unk between the kth category and the nth document is increased (distant).

ここで、前記した可視化空間について、第ｎ文書が第ｋカテゴリーに属する確率ｓ_nk＝ρ（ｕ_nk）を考えてみる。ρ（ｕ_nk）は、ｕ＞０（ここでは０を含む）に関する単調現象関数である。ここでは、０＜ρ（ｕ_nk）＜１（ここでは、０，１を含む），ρ（０）＝１，ρ（無限大）＝１，Σρ（ｕ_nk）＝１（ｋ＝１，・・・，Ｋ＋１）とする。そうすると
、ユークリッド距離ｕ_nkが小さくなれば、確率ｓ_nkが１に近づき、逆に、ｕ_nkが大きくなれば、確率ｓ_nkが０に漸近するようになる。このようにして、登録済みのカテゴリーに属する確率ｓ_nkに応じて、各文書と各カテゴリーとの間のユークリッド距離ｕ_nkが可視化空間上で調整されることとなる。
このようなρ（ｕ_nk）の典型例を次式（１４）に示す。本実施の形態では、式（１４）の関数を用いることとする。 Here, consider the probability s _nk = ρ (u _nk ) that the nth document belongs to the kth category for the visualization space described above. ρ (u _nk ) is a monotonic phenomenon function regarding u> 0 (including 0 in this case). Here, 0 <ρ (u _nk ) <1 (including 0 and 1 here), ρ (0) = 1, ρ (infinity) = 1, Σρ (u _nk ) = 1 (k = 1, ..., K + 1). Then, if the Euclidean distance u _nk decreases, the probability s _nk approaches 1; conversely, if u _nk increases, the probability s _nk gradually approaches 0. In this way, the Euclidean distance _unk between each document and each category is adjusted in the visualization space according to the probability _snk belonging to the registered category.
A typical example of such ρ (u _nk ) is shown in the following formula (14). In the present embodiment, the function of Expression (14) is used.

（１４）
(14)

ここで、第ｎ文書がどのカテゴリーに属するかを表すベクトルをｓ_n＝（ｓ_n1，・・・，ｓ_n(k+1)）とすると、可視化空間では、ｓ_nが、事後確率ベクトルｑ_nの近似になればよいということになる。そこで、ｑ_nとｓ_nとを離散確率分布と考え、これら２つの確率分布間の距離がすべての文書に対して最小になるようにする。これにより、ｓ_nがｑ_nに近似する。前記した２つの確率分布間の距離を表すカルバック距離ＫＬ（ｑ_n，ｓ_n）を次式（１５）に示す。 Here, if a vector representing which category the nth document belongs to is s _n = (s _n1 ,..., Sn _{(k + 1)} ), in the visualization space, s _n is a posterior probability vector q. That is, it should be an approximation of _n . Therefore, q _n and s _{n are} considered as discrete probability distributions, and the distance between these two probability distributions is minimized for all documents. As a result, s _n approximates q _n . The Cullback distance KL (q _n , s _n ) representing the distance between the two probability distributions is shown in the following equation (15).

この場合、すべての文書の座標Ｒを得るための目的関数（ここでは最大化すべきもの）Ｌｖを次式（１６）に示す。 In this case, an objective function (which should be maximized here) Lv for obtaining the coordinates R of all documents is represented by the following equation (16).

式（１６）では、実際には、Ｒ＝[ｒ₁，…，ｒ_N]に関するＬ_vとΦ＝[φ₁，…，φ_K]に関するＬ_vとが収束するまで繰り返すことにより、Ｒを求めることになる。Φが与えられたとき、Ｌ_vは、Ｒに関し厳密に上に凸となり、収束値が大域的最適解を保証するという好適な特性をもつことになる。Φの初期値として、Θ（推定）＝[θ₁（推定），…，θ_k（推定）]をもとに、ＭＤＳで低次元（つまり２次元または３次元）に圧縮して求めた座標を用いることとする。これは、文書中の単語の出現確率が近似しているカテゴリー同士が可視化空間に近接して配置されたときの値を初期値として用いるためである。これにより、カテゴリー間の関係を反映して可視化することが可能となる。この可視化の手法を示す説明図を図２に示す。 In equation (16), in _{fact, R = [r 1, ...} , r N] L v and _{Φ = [φ 1, ...,} φ K] relating By repeated until L _v and converge about the R Will be asked. Given Φ, L _v will be strictly convex upward with respect to R and will have the favorable property that the convergence value guarantees a global optimal solution. Coordinates obtained by compressing to low dimensions (that is, two-dimensional or three-dimensional) with MDS based on Θ (estimated) = [θ ₁ (estimated),..., Θ _k (estimated)] as the initial value of Φ Will be used. This is because a value when categories having similar appearance probabilities of words in the document are arranged close to the visualization space is used as an initial value. This makes it possible to visualize the relationship between categories. An explanatory diagram showing this visualization technique is shown in FIG.

図２（ａ）（ｂ）では、「相撲」、「野球」、「ラグビー」、「サッカー」、「プロレス」の５種類のカテゴリーに分類されている文書群（クラスター）の場合について示されている。なお、図２（ａ）（ｂ）では、実際には、文書は、カテゴリー単位に色相で可視化されている。例えば、「相撲」、「野球」、「ラグビー」、「サッカー」、「プロレス」に分類されている文書は、それぞれ、青色、赤色、緑色、水色、黄色で色分けされている。
そして、図２（ａ）に示した文書群について、可視化処理が行われると、カテゴリー間の類似関係も反映されて、図２（ｂ）に示すように配置される。図２（ｂ）では、「相撲」と「プロレス」の２つのカテゴリーに属する文書群は、隣接して配置されている。これは、「相撲」と「プロレス」はともに、格闘技という同じスポーツのジャンルに属することとなるので、２つのカテゴリーの類似度が高いと判断されたからである。 FIGS. 2A and 2B show the case of a document group (cluster) classified into five categories of “sumo”, “baseball”, “rugby”, “soccer”, and “professional wrestling”. Yes. In FIGS. 2A and 2B, the document is actually visualized by hue in category units. For example, documents classified as “sumo”, “baseball”, “rugby”, “soccer”, and “professional wrestling” are color-coded in blue, red, green, light blue, and yellow, respectively.
Then, when the visualization processing is performed on the document group shown in FIG. 2A, the similarity between categories is reflected and the document group is arranged as shown in FIG. In FIG. 2B, document groups belonging to two categories of “sumo” and “professional wrestling” are arranged adjacent to each other. This is because both “sumo” and “professional wrestling” belong to the same sports genre as martial arts, and thus it is determined that the similarity between the two categories is high.

ここで、図２（ｂ）に示した「分類ミス」、「多重カテゴリー」および「特異文書」について順に説明しておく。
「分類ミス」というのは、誤って分類された文書のことである。このような文書は、登録済みのカテゴリーとは異なるカテゴリーの文書群の周辺に配置される。図２（ｂ）では、「プロレス」の文書が分類ミスとして示され、それが「相撲」の文書群の周辺に配置されている。したがって、その文書が、「プロレス」ではなく、「相撲」のカテゴリーに属する可能性が高いと考えられる。 Here, the “classification error”, “multiple category”, and “singular document” shown in FIG.
A “classification mistake” is a document that is misclassified. Such a document is arranged around a document group of a category different from the registered category. In FIG. 2B, the “pro-wrestling” document is shown as a misclassification, and it is arranged around the “sumo” document group. Therefore, it is highly likely that the document belongs to the “sumo” category, not “pro-wrestling”.

「多重カテゴリー」というのは、複数のカテゴリーに属する文書と考えられる文書のことである。このような文書は、可視化により、複数のカテゴリーの間に配置される。図２（ｂ）では、「サッカー」と「野球」との間に配置されている文書が多重カテゴリーとして示されている。したがって、この文書は、「野球」だけではなく、「サッカー」のカテゴリーにも属する可能性が高いと考えられる。 A “multiple category” is a document that is considered to be a document belonging to multiple categories. Such a document is arranged between a plurality of categories by visualization. In FIG. 2B, documents arranged between “soccer” and “baseball” are shown as multiple categories. Therefore, it is highly likely that this document belongs not only to “baseball” but also to “soccer”.

「特異文書」というのは、登録済みのカテゴリーに属さない確率が高い文書のことである。このような文書は、可視化空間では、分類済みの文書群が集まる場所（領域）とは異なる場所に配置される。図２（ｂ）では、「野球」の文書が、５種類のカテゴリーの文書群とは異なる場所に配置されている。このような特異文書は、その内容から、次の２つのケースに分類することができる。一つは、登録済みのカテゴリーに属する文書にはない有益な内容が記載されている場合である。もう一つは、分類体系とは無関係の内容や文書管理の価値のない内容（無意味の内容）が記載されている場合である。前者の場合、有益な内容が記載されているので、ユーザにとって、特異文書を見つけ出す意義は大きい。 A “singular document” is a document that has a high probability of not belonging to a registered category. Such a document is arranged in a place different from a place (area) where classified document groups gather in the visualization space. In FIG. 2B, the document “baseball” is arranged at a location different from the document group of the five types of categories. Such unique documents can be classified into the following two cases based on their contents. One is a case where useful contents not included in a document belonging to a registered category are described. The other is a case where contents irrelevant to the classification system or contents not worth document management (nonsense contents) are described. In the former case, since useful contents are described, it is significant for the user to find a specific document.

次に、前記した文書生成モデルの構築処理から可視化処理までの動作全体の処理について説明する。
図３は、可視化装置１０の動作手順を示す図である。なお、可視化装置１０の動作は、処理装置１３が記憶装置１２に格納された可視化プログラムを逐次実行することによって実現される。可視化プログラムは、コンピュータ読み込み可能な記録媒体から読み込まれてもよい。記録媒体としては、例えば、ＣＤ−ＲＯＭ、半導体メモリ、磁気ディスクなどがある。 Next, processing of the entire operation from the above-described document generation model construction processing to visualization processing will be described.
FIG. 3 is a diagram illustrating an operation procedure of the visualization device 10. Note that the operation of the visualization device 10 is realized by the processing device 13 sequentially executing the visualization program stored in the storage device 12. The visualization program may be read from a computer-readable recording medium. Examples of the recording medium include a CD-ROM, a semiconductor memory, and a magnetic disk.

まず、ユーザが利用者用端末３０を操作して、通信ネットワーク２０を介して可視化装置１０にアクセスする。そしてその後、可視化装置１０の処理装置１３では、文書生成モデル構築部１３１が、利用者用端末３０から、処理対象となる文書群（カテゴリー分類済みのもの）の要求を受ける。そして、文書生成モデル構築部１３１は、前記した式（１）から式（３）までの関数を用いて、要求された文書群についてカテゴリーごとの文書生成モデルの構築処理を行う（Ｓ１：これを「構築機能」という。）。これにより、文書ごとに、文書中に出現する単語の単語頻度が求められることとなる。 First, the user operates the user terminal 30 to access the visualization device 10 via the communication network 20. Thereafter, in the processing device 13 of the visualization device 10, the document generation model construction unit 131 receives a request for a document group (categorized data) to be processed from the user terminal 30. Then, the document generation model construction unit 131 performs the construction processing of the document generation model for each category for the requested document group using the functions from the above formulas (1) to (3) (S1: "Building function"). Thereby, the word frequency of the word which appears in a document will be calculated | required for every document.

続いて、事後確率ベクトル推定部１３２は、Ｓ１で構築されたカテゴリーごとの文書生成モデルを記憶装置１２から読み出す。そして、事後確率ベクトル推定部１３２は、前記した式（４）から式（１２）までの関数を用い、読み出した文書生成モデルに基づいて、事後確率ベクトルｑ_nの推定処理を行う（Ｓ２：これを「推定機能」という。）。これを詳述する。事後確率ベクトル推定部１３２は、記憶装置１２から読み出した最大エントロピーの条件（式（４））に従って、事後確率ベクトルｑ_nに基づく文書の事後確率Ｐ_NB（ｋ│ｄ_n）が最高となったカテゴリーの文書生成モデルを用いて、事後確率ベクトルｑ_nを推定する。これにより、各文書について、登録済みのカテゴリーに属する確率と、登録済みのカテゴリーに属さない確率とをベクトル表示することが可能となる。 Subsequently, the posterior probability vector estimation unit 132 reads out the document generation model for each category constructed in S <b > 1 from the storage device 12. Then, the posterior probability vector estimation unit 132 performs an estimation process of the posterior probability vector q _n based on the read document generation model using the functions from Equation (4) to Equation (12) (S2: This) Is called "estimation function"). This will be described in detail. The posterior probability vector estimation unit 132 has the highest posterior probability P _NB (k | d _n ) of the document based on the posterior probability vector q _n according to the maximum entropy condition (equation (4)) read from the storage device 12. A posteriori probability vector q _n is estimated using a category document generation model. As a result, for each document, the probability of belonging to the registered category and the probability of not belonging to the registered category can be displayed as a vector.

次に、可視化部１３３は、前記した式（１３）から式（１６）までの関数を用い、Ｓ２で推定された事後確率ベクトルｑ_nをもとに、Ｓ１で要求された文書群の各文書の可視化処理を行う（Ｓ３：これを「可視化機能」という。）。これを詳述する。可視化部１３３は、Ｓ２で推定した事後確率ベクトルｑ_nにより表された登録済みのカテゴリーに属する確率に応じて、各文書および各カテゴリーの間の距離を可視化空間上で調整して配置する。これにより、文書とカテゴリーとの類似度を可視化空間に反映させることが可能となる。 Next, the visualization unit 133 uses the functions from Equation (13) to Equation (16) described above, and based on the posterior probability vector q _n estimated in S2, each document in the document group requested in S1. (S3: This is called “visualization function”). This will be described in detail. The visualization unit 133 adjusts and arranges the distance between each document and each category on the visualization space according to the probability belonging to the registered category represented by the posterior probability vector q _n estimated in S2. As a result, the similarity between the document and the category can be reflected in the visualization space.

そして、可視化部１３３は、可視化結果の出力処理として、利用者用端末３０に通信ネットワークを介して、Ｓ３の可視化結果を送信する（Ｓ４：これを「出力機能」という。）。これにより、例えば、利用者用端末３０では、登録済みのカテゴリー間の関係が反映された可視化空間が表示装置に表示される（図２参照）。したがって、ユーザは、例えば、分類ミスや、多重カテゴリー、特異文書など、文書群についての新たな知識を発見することが可能となる。以上から、カテゴリー分類された文書間の関係をより正確に把握することができる。なお、分類ミスや、多重カテゴリー、特異文書により得られる具体的な知識の内容については、後記する。 And the visualization part 133 transmits the visualization result of S3 to the terminal 30 for users via a communication network as an output process of a visualization result (S4: This is called "output function."). Thereby, for example, in the user terminal 30, the visualization space reflecting the relationship between the registered categories is displayed on the display device (see FIG. 2). Therefore, the user can discover new knowledge about the document group such as misclassification, multiple categories, unique documents, and the like. From the above, it is possible to grasp the relationship between categorized documents more accurately. Details of specific knowledge obtained from misclassification, multiple categories, and unique documents will be described later.

［本発明の可視化の評価］
次に、本発明の有効性を評価するため、Open Directory Project（ＯＤＰ）の日本語ウェブページ（ここでは、Open directory project,http://dmoz.org/）のトップカテゴリーに分類されているウェブページを用いて可視化を行った。
ここでは、ウェブページのサンプリングは、次のように行った。まず、ＯＤＰに登録されている日本語ウェブページから、単語数が５０以下のウェブページを除いた。また、複数のカテゴリーに分類されているウェブページを除いた。そして、カテゴリー単位に、１００ページ分のウェブページをランダムにサンプリングした。サンプリングの結果、得られたウェブページは、次のカテゴリーに属する１３００ページであった。 [Evaluation of visualization of the present invention]
Next, in order to evaluate the effectiveness of the present invention, webs classified into the top category of the Open Directory Project (ODP) Japanese web page (here, Open directory project, http://dmoz.org/). Visualization was performed using pages.
Here, sampling of web pages was performed as follows. First, web pages with 50 or fewer words were excluded from Japanese web pages registered in ODP. In addition, web pages classified into multiple categories were excluded. And 100 web pages for each category were randomly sampled. As a result of sampling, the obtained web pages were 1300 pages belonging to the following category.

利用されたカテゴリーは、アート（arts）、オンラインショップ（online-shop）、コンピュータ（computer）、スポーツ（sports）、ニュース（news）、ビジネス（business）、レクリエーション（recreation）、健康（health）、各種資料（reference）、家庭（home）、社会（society）、科学（science）、地域（regional）の１３種類である。なお、ＯＤＰのトップカテゴリーに含まれているゲームのカテゴリーは、ウェブページが１００ページに達しなかったため対象外とした。 The categories used are arts, online-shop, computer, sports, news, business, recreation, health, various There are 13 types: reference, home, society, science, and regional. The game categories included in the ODP top category were excluded because the web page did not reach 100 pages.

［可視化結果］
図４は、前記のサンプリングされたウェブページ群の可視化結果を示す図である。図４では、カテゴリーの別にウェブページが色分けされている。例えば、「arts」のウェブページ群は赤、「online-shop」のウェブページ群は青、「computer」のウェブページ群はピンク、「sports」ウェブページ群は緑などに色分けされている。
そして図４では、同じカテゴリーに属するウェブページがクラスター（同じカテゴリーに属する文書のまとまり）を形成している。クラスター数は計１４になっている。これは、１３種類のカテゴリー分のクラスター数１３に、特異文書の分のクラスター数１を合算したものになっているからである。
ここで、クラスターの位置関係をみてみると、関連するカテゴリーのクラスターは、近接して配置されている。例えば、「online-shop」と「business」との各クラスター、「sports」と「health」との各クラスターなどは、近接して配置されている。 [Visualization result]
FIG. 4 is a diagram illustrating a visualization result of the sampled web page group. In FIG. 4, web pages are color-coded by category. For example, the “arts” web page group is red, the “online-shop” web page group is blue, the “computer” web page group is pink, and the “sports” web page group is green.
In FIG. 4, web pages belonging to the same category form a cluster (a group of documents belonging to the same category). The total number of clusters is 14. This is because the number of clusters for 13 types is added to the number of clusters for 13 unique categories and the number of clusters for unique documents is 1.
Here, looking at the positional relationship of the clusters, the clusters of the related categories are arranged close to each other. For example, each cluster of “online-shop” and “business” and each cluster of “sports” and “health” are arranged close to each other.

そして、これらのクラスターの中には、図４に示すように、分類済みのカテゴリーとは異なるウェブページａがある。これは、分類ミスであると考えられる。具体的には、図４に示すウェブページａは、「regional」に分類されているが、「health」のカテゴリーに属する可能性が高いと考えられる。そこで、ウェブページａの内容を確かめてみると、ウェブページａは、病院のページであることがわかった。その病院の地元の住民がウェブページａを利用するという観点からウェブページａを分類した場合、確かに、ウェブページａは「regional」のページであるということもできる。しかし、「health」のカテゴリーに属するウェブページの中には、病院のページが多く含まれている。したがって、このような分類の実体も踏まえて、ウェブページａの分類を考えてみると、「health」のカテゴリーにウェブページａを含めた方が適当であると思われた。
このように、図４の可視化結果を通じて、分類ミスの可能性があるウェブページを探し出すことができた。 And in these clusters, as shown in FIG. 4, there is a web page a different from the classified category. This is considered a misclassification. Specifically, the web page a shown in FIG. 4 is classified as “regional”, but is considered to have a high possibility of belonging to the “health” category. Then, when the content of the web page a was confirmed, it turned out that the web page a is a hospital page. If the local residents of the hospital classify the web page a from the viewpoint of using the web page a, it is true that the web page a is a “regional” page. However, many web pages belonging to the “health” category include hospital pages. Therefore, when considering the classification of the web page a in consideration of the substance of such classification, it seems that it is more appropriate to include the web page a in the “health” category.
Thus, the web page with the possibility of misclassification could be found through the visualization result of FIG.

また、図４では、ウェブページｂが、「recreation」に分類されているのにもかかわらず、「sports」と「recreation」の各クラスターのほぼ中間に示された。このウェブページｂは、サッカーくじ（toto）のページであった。確かに、サッカーくじは、レクリケーションの一種であるといえるが、スポーツとも関わりが深い。このことが、ウェブページｂが「sports」と「recreation」の各クラスターのほぼ中間に示された理由と考えられた。サッカーに関心がある人であれば、サーカーくじのウェブページｂにも興味を持つものが多いと考えられる。
このように、図４の可視化結果から、多重分類の可能性があるウェブページを探し出すことが可能となる。また、ウェブページの内容そのものから判断して、特定の一のカテゴリーに分類することが困難なものを探し出すことも可能となる。さらに、カテゴリー分類の見直しを行う際に、どのような内容に重点をおいてカテゴリー分類すべきかを確認することが可能となる。 In FIG. 4, the web page b is shown almost in the middle of each cluster of “sports” and “recreation”, although it is classified as “recreation”. The web page b was a soccer toto page. Certainly, the soccer lottery is a kind of recreation, but it is also closely related to sports. This was thought to be the reason why web page b was shown approximately halfway between the “sports” and “recreation” clusters. If you are interested in soccer, you may be interested in the web page b of the Circer Lottery.
In this way, it is possible to find a web page that has the possibility of multiple classification from the visualization result of FIG. Also, it is possible to find out what is difficult to classify into one specific category, judging from the content of the web page itself. Furthermore, when reviewing the category classification, it is possible to confirm what content should be focused on.

さらに、図４の可視化結果から、特異文書となったウェブページを探し出すことができる。図４では、例えば、メールマガジンの紹介に関するウェブページがこれに該当した。このウェブページは、コンピュータのカテゴリーに分類していた。しかし、そのウェブページに使用されている文章をみてみると、コンピュータに関する内容だけでなく、それ以外の内容が多く含まれていた。したがって、そのウェブページがコンピュータに関するものでないことがわかった。 Furthermore, it is possible to find a web page that has become a unique document from the visualization result of FIG. In FIG. 4, for example, a web page related to the introduction of a mail magazine corresponds to this. This web page was classified into the computer category. However, looking at the text used in the web page, it included a lot of other content, not just about the computer. Therefore, it turns out that the web page is not about the computer.

［従来法との比較］
次に、前記したサンプリングにより得た１３００ページのウェブページについて、前記した従来法であるＭＤＳ法（非特許文献１参照）で可視化処理を行った。ここでは、単語頻度ベクトルを次元圧縮して可視化する方法と、前記した事後確率ベクトルを次元圧縮して可視化する方法の２つの方法で行った。 [Comparison with conventional methods]
Next, 1300 pages of web pages obtained by the above sampling were visualized by the MDS method (see Non-Patent Document 1) which is the conventional method described above. Here, two methods were used: a method of visualizing the word frequency vector by dimensional compression and a method of visualizing the posterior probability vector described above by dimensional compression.

図５は、ＭＤＳ法により単語頻度ベクトルを次元圧縮して可視化した結果を示す図である。図５では、各カテゴリーのクラスターが集中し、ウェブサイト同士が重なるため、ウェブページ群の類似関係が分かりにくかった。このため、文書が持つ特徴がわからない。これは、カテゴリーの関係が反映されずに、可視化が行われたからである。 FIG. 5 is a diagram showing the result of visualizing the word frequency vector by dimensional compression using the MDS method. In FIG. 5, the clusters of each category are concentrated and the websites overlap with each other, so it is difficult to understand the similarity of web page groups. For this reason, the characteristics of the document are not known. This is because visualization was performed without reflecting the relationship of categories.

図６は、ＭＤＳ法により事後確率ベクトルを次元圧縮して可視化した結果を示す図である。図６では、カテゴリーごとにクラスターが形成され、分類ミスを探し出すことはできた。しかし、クラスターが集中し、カテゴリー間の類似関係や、多重カテゴリー、特異文書を探し出しにくかった。 FIG. 6 is a diagram showing the result of visualizing the posterior probability vector by dimensional compression by the MDS method. In FIG. 6, a cluster was formed for each category, and it was possible to find a classification error. However, the clusters were concentrated and it was difficult to find similar relationships among categories, multiple categories, and unique documents.

このように、従来法との比較を行った結果、図５および図６に示したように、いずれの場合も、クラスターが集中するため、本発明による図４の場合に比べて、ウェブページ群の類似関係などが分かりにくかった。
また、従来法では、カテゴリー数Ｋを考慮せずに文書同士を比較して可視化処理を行うため、文書数をＮとした場合、計算量ＯはＯ（Ｎ²）になるのに対して、本発明では、各文書と各カテゴリーとを比較して可視化処理を行うため、計算量ＯはＯ（ＮＫ）になる。このため、従来法では、文書数が多くなると、計算量が２のべき乗で増加することになるのに対し、本発明ではそれほど増加することとならない。したがって、可視化装置１０では、大量の文書群にも適用しやすくなる。 In this way, as a result of comparison with the conventional method, as shown in FIGS. 5 and 6, since the clusters are concentrated in any case, the web page group is compared with the case of FIG. 4 according to the present invention. It was difficult to understand the similar relationship of.
In the conventional method, since the visualization process is performed by comparing the documents without considering the category number K, when the number of documents is N, the calculation amount O is O (N ² ). In the present invention, since each document is compared with each category for visualization processing, the calculation amount O is O (NK). For this reason, in the conventional method, as the number of documents increases, the amount of calculation increases by a power of 2, whereas the present invention does not increase so much. Therefore, the visualization apparatus 10 can be easily applied to a large number of documents.

なお、本発明は、前記した実施の形態に限られるものではない。記憶装置１２のデータ構造及びプログラム処理の順序は、既知の技術により種々の変更が可能である。 The present invention is not limited to the embodiment described above. The data structure of the storage device 12 and the order of program processing can be variously changed by a known technique.

また、可視化装置１０は、サーバ装置で構成する場合について説明したが、例えば、単一のパーソナルコンピュータなどのコンピュータで構成するようにしてもよい。 Moreover, although the visualization apparatus 10 demonstrated the case where it comprised with a server apparatus, you may make it comprise with computers, such as a single personal computer, for example.

本発明の実施の形態に係る可視化装置を含むシステムを示すブロック図である。It is a block diagram which shows the system containing the visualization apparatus which concerns on embodiment of this invention. 図１の可視化装置による可視化の手法を示す説明図である。It is explanatory drawing which shows the method of visualization by the visualization apparatus of FIG. 図１の可視化装置おける可視化方法の動作手順を示す図である。It is a figure which shows the operation | movement procedure of the visualization method in the visualization apparatus of FIG. 図１の可視化装置によって可視化された結果の例を示す図である。It is a figure which shows the example of the result visualized by the visualization apparatus of FIG. ＭＤＳ法により単語頻度ベクトルを次元圧縮して可視化した結果を示す図である。It is a figure which shows the result of having dimensionally compressed and visualized the word frequency vector by MDS method. ＭＤＳ法により事後確率ベクトルを次元圧縮して可視化した結果を示す図である。It is a figure which shows the result of having dimension-compressed and visualized the posterior probability vector by MDS method.

Explanation of symbols

１０可視化装置
１２記憶装置
１３処理装置
２０通信ネットワーク
３０利用者用端末 DESCRIPTION OF SYMBOLS 10 Visualization device 12 Storage device 13 Processing device 20 Communication network 30 User terminal

Claims

For document groups classified into pre-registered categories, a document generation model representing the word frequency distribution of words appearing in the documents for each category, and the relationship between each document in the document group and each category visually A storage device for storing information of a two-dimensional or three-dimensional visualization space for expressive representation,
For each document in the document group, each probability of belonging to a registered category;
A probability that does not belong to any of the registered categories, and a posteriori probability estimator that creates a posteriori probability vector with components as components,
A visualization unit for performing visualization processing of the document group;
A method for visualizing the relationship between documents by a visualization device comprising:
The posterior probability estimation unit
Reads the document generation model for each of the categories from the storage device, for each document d _n in the documents,
The document d _n feature function f such correlation is larger the higher the model parameters word frequency and category k of (d _{n, k),} and represents the relationship between the document d _n and registered categories other than Category Using the feature function f (d _n , K + 1) of the following equation,
μ: average of the parameters of the _{c n}
sigma: standard deviation of parameters _{c n}
M _n : number of words in document d _n
c _n : The document generation model of the category having the highest posterior probability of the document d _n satisfies the constraint of the following expression,
N: Total number of documents
c _n: | as the entropy of the _{(d n} k) is the maximum, the probability belonging to each registered category k of the document _{d n} P document _{d n} category probability is given to P (k _{| d} n) , and it does not belong to any of the registered category probabilities P | a step of calculating the (K + 1 d _n), to create a posterior probability vector q _n of a document d _n to the probability component the calculated,
The visualization unit is
Function ρ the distance _{u nk} between center coordinates phi _k coordinates _{r n} and category k of the document _{d n} document _{d n} into a probability _{s nk} in category k, greater than zero probability _{s nk} and the distance _{u nk} monotonically decreases becomes the sum of k from 1 to K + 1 is using a function ρ becomes 1, the distance between the vector s _n obtained from the probability s _nk and the posterior probability vector q _n of each document d _n with respect to For an objective function L _v that minimizes the sum total in all documents of KL _n , L _v related to the coordinate set R of the document group having the coordinate r _n as an element and a category group having the central coordinate φ _k as an element and determining the set of coordinates R and center coordinates set Φ by repeating relates center coordinate set Φ to L _v and converge,
Externally outputting information of a visualization space in which the coordinate set R of the document group and the central coordinate set Φ of the category group are arranged;
A method for visualizing the relationship between documents, characterized in that

Feature function f (d _{n, k)} representing the relationship between each category k of the document d _n and registered is represented by the following formula,
x _nj: word in the document _{d n} frequency
θ _kj : Parameter of document model of category k The probability P (k | d _n ) and the probability P (K + 1 | d _n ) are calculated by the parameter β shown by the following equation:
The inter-document relationship visualization method according to claim 1, wherein the inter-document relationship visualization method is obtained by estimating.

The function ρ is expressed by the following equation:
Before Symbol objective function L _v is shown by the following formula
The method for visualizing the relationship between documents according to claim 1 or 2.

For a group of pre-registered categorized documents, a document generation model representing the word frequency distribution of words that appear in the document for each category, and the relationship between each document in the document group and each category visually A storage device for storing information of a two-dimensional or three-dimensional visualization space for expressing
For each document in the document group, each probability of belonging to a registered category;
A probability that does not belong to any of the registered categories, and a posteriori probability estimator that creates a posteriori probability vector with components as components,
A visualization unit that performs a visualization process of the document group,
The posterior probability estimator is
Reads the document generation model for each of the categories from the storage device, for each document d _n in the documents,
The document d _n feature function f such correlation is larger the higher the model parameters word frequency and category k of (d _{n, k),} and represents the relationship between the document d _n and registered categories other than Category Using the feature function f (d _n , K + 1) of the following equation,
μ: average of the parameters of the _{c n}
sigma: standard deviation of parameters _{c n}
M _n : number of words in document d _n
c _n : The document generation model of the category having the highest posterior probability of the document d _n satisfies the constraint of the following expression,
N: Total number of documents
c _n: | as the entropy of the _{(d n} k) is the maximum, the probability belonging to each registered category k of the document _{d n} P document _{d n} category probability is given to P (k _{| d} n) and the probability P that does not belong to any of the registered category (K + 1 | d _n) is calculated, with the estimation function for creating a posterior probability vector q _n of a document d _n to the probability component the calculated,
The visualization unit includes:
Function ρ the distance _{u nk} between center coordinates phi _k coordinates _{r n} and category k of the document _{d n} document _{d n} into a probability _{s nk} in category k, greater than zero probability _{s nk} and the distance _{u nk} monotonically decreases becomes the sum of k from 1 to K + 1 is using a function ρ becomes 1, the distance between the vector s _n obtained from the probability s _nk and the posterior probability vector q _n of each document d _n with respect to For an objective function L _v that minimizes the sum total in all documents of KL _n , L _v related to the coordinate set R of the document group having the coordinate r _n as an element and a category group having the central coordinate φ _k as an element and visualization function of obtaining the coordinate set R and center coordinates set Φ by repeating until the L _v with respect to the center coordinates set Φ converges,
A visualization apparatus comprising: an output function for externally outputting information on a visualization space in which the coordinate set R of the document group and the central coordinate set Φ of the category group are arranged.

Feature function f (d _{n, k)} representing the relationship between each category k of the document d _n and registered is represented by the following formula,
x _nj: word in the document _{d n} frequency
θ _kj : Parameter of document model of category k The probability P (k | d _n ) and the probability P (K + 1 | d _n ) are calculated by the parameter β shown by the following equation:
The visualization device according to claim 4, wherein the visualization device is obtained by estimating.

The function ρ is expressed by the following equation:
Before Symbol objective function L _v is shown by the following formula
The visualization apparatus according to claim 4 or 5, wherein

A visualization program for causing a computer to execute the inter-document relationship visualization method according to any one of claims 1 to 3.

A computer-readable recording medium recording a visualization program for causing a computer to execute the inter-document relationship visualization method according to any one of claims 1 to 3.