JP2010267017A

JP2010267017A - Device, method and program for classifying document

Info

Publication number: JP2010267017A
Application number: JP2009116899A
Authority: JP
Inventors: Noriaki Kawamae; 徳章川前; Takeshi Yamada; 武士山田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-05-13
Filing date: 2009-05-13
Publication date: 2010-11-25

Abstract

<P>PROBLEM TO BE SOLVED: To perform association between documents having similar contents while reflecting a change of interest of an author, about a lot of documents. <P>SOLUTION: A document classification device 1000 includes: a storage part 2 storing a prescribed calculation expression and a prescribed calculation end condition used in the plurality of documents and a probability distribution model; an initial setting part 3 randomly setting initial values of document classes to which the respective documents belong and topic classes to which words belong; a document class evaluation part 4 estimating the document class to which the document belongs in each document; a topic class evaluation part 5 estimating the topic class to which the word belongs in each word; a convergence decision part 6 making the document class evaluation part 4 and the topic class evaluation part 5 repeat the estimation until satisfying the prescribed calculation end condition; and an output part 7 outputting a calculation result including contents of the document class. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、大量の文書（Ｗｅｂページ、論文、特許明細書、ブログ記事等の電子データ）について、類似する内容の文書間の関連付けや、類似する内容の文書を作成した著者間の関連付けを行う技術に関する。 The present invention associates a large amount of documents (electronic data such as Web pages, papers, patent specifications, blog articles, etc.) between documents with similar contents and between authors who created documents with similar contents. Regarding technology.

一般に、検索対象となる文書には、著者やその所属、引用文献や作成時刻といったメタデータがある。これらのメタデータ等の情報を利用して、類似する内容の文書間の関連付けや、類似する内容の文書を作成した著者間の関連付けを行う手法が知られている。 In general, documents to be searched include metadata such as authors, their affiliations, cited references, and creation times. There is known a method of using information such as metadata to associate between documents having similar contents and to associate between authors who created documents having similar contents.

非特許文献１に開示されたAuthor-Topic modelは、著者毎に、その著者が各トピックを扱う（選択する）可能性を確率として与えることにより、著者とトピックとの関連付けを行う手法である。ここで、著者毎にトピックを選択する確率の分布をトピック選択確率分布と呼ぶ。つまり、ある著者に対して計算されたトピック選択確率分布において、最も確率が高くなるトピックが、その著者と最も関連の高いトピックであるとすることで関連付けを実現する。 The Author-Topic model disclosed in Non-Patent Document 1 is a technique for associating an author with a topic by giving each author a possibility of handling (selecting) each topic as a probability. Here, a probability distribution for selecting a topic for each author is called a topic selection probability distribution. That is, in the topic selection probability distribution calculated for a certain author, association is realized by assuming that the topic having the highest probability is the topic most relevant to the author.

また、非特許文献２に開示されたAuthor-Persona-Topic modelは、著者毎にペルソナという複数の人格を考慮し、人格毎にトピック選択確率分布を導入することにより、１人の著者が複数のトピック選択確率分布を持つことを可能にした手法である。 In addition, the Author-Persona-Topic model disclosed in Non-Patent Document 2 takes into account multiple personalities of personas for each author and introduces a topic selection probability distribution for each personality to allow multiple authors to This is a technique that makes it possible to have a topic selection probability distribution.

M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths: “Probabilistic Author-Topic Models for Information Discovery”, ACM KDD (International conference on Knowledge Discovery and Data mining), pp. 306‐315 (2004).M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths: “Probabilistic Author-Topic Models for Information Discovery”, ACM KDD (International conference on Knowledge Discovery and Data mining), pp. 306-315 (2004 ). D. Mimno and A. McCallum: “Expertise Modeling for Matching Papers with Reviewers”, ACM KDD (International conference on Knowledge Discovery and Data mining), pp. 500‐509 (2007).D. Mimno and A. McCallum: “Expertise Modeling for Matching Papers with Reviewers”, ACM KDD (International conference on Knowledge Discovery and Data mining), pp. 500-509 (2007).

しかしながら、非特許文献１の手法では、１人の著者に対してトピック選択確率分布が１つに固定される。したがって、著者の興味の変化を反映することができないという問題がある。 However, in the method of Non-Patent Document 1, the topic selection probability distribution is fixed to one for one author. Therefore, there is a problem that the change of the author's interest cannot be reflected.

また、非特許文献２の手法では、著者毎に複数のトピック選択確率分布を持つことができるが、ペルソナは著者毎に閉じて設定されるため、異なる著者間でのペルソナの関連付けを行うことができない。つまり、似た内容の文章を作成している全く繋がりのない著者（共著関係や友人関係などにない著者）がいる場合、それぞれの著者に対して別の尺度（別のキーワード）でペルソナが設定されるため、両著者のペルソナが類似するかどうかを判断することができない。したがって、類似する内容の文書を作成する全く繋がりのない著者間の関連付けを行うことができないという問題がある。 In the method of Non-Patent Document 2, a plurality of topic selection probability distributions can be provided for each author. However, since a persona is set closed for each author, personas can be associated between different authors. Can not. In other words, if there is an unconnected author (an author who is not in a co-authorship or friendship) who creates a sentence with similar content, the persona is set on a different scale (different keyword) for each author. As a result, it is impossible to determine whether the author's personas are similar. Therefore, there is a problem that it is not possible to make associations between authors who create documents having similar contents and are not connected at all.

本発明は、前記した問題に鑑みてなされたものであり、大量の文書について、著者の興味の変化を反映しながら、類似する内容の文書間の関連付けを行うことを課題とする。また、類似する内容の文書を作成した著者間の関連付けを行うことを他の課題とする。 The present invention has been made in view of the above-described problems, and an object of the present invention is to associate documents having similar contents with respect to a large number of documents while reflecting a change in interest of the author. Another object is to associate the authors who created documents with similar contents.

前記課題を解決するために、本発明は、電子データである複数の文書について、複数の文書それぞれを内容が類似する文書集合である複数の文書クラスのいずれかに確率的に属させ、複数の文書を構成する単語それぞれを内容が類似する単語集合である複数のトピッククラスの１つ以上に確率的に属させ、文書の著者それぞれを１つ以上の文書クラスに属させ、文書クラスそれぞれを１つ以上のトピッククラスに関連付ける所定の確率分布モデルに基づいて、計算を行う文書分類装置であって、複数の文書と、確率分布モデルで使用され文書ごとにその属する文書クラスを推定するための第１の計算式と、確率分布モデルで使用され単語ごとにその属するトピッククラスを推定するための第２の計算式と、所定の計算終了条件と、を記憶する記憶部と、文書それぞれが属する文書クラスと、単語それぞれが属するトピッククラスと、の初期値をランダムに設定する初期設定部と、第１の計算式に基づいて、文書ごとに、全ての文書クラスそれぞれに対して属する確率を計算し、その確率が最も高い文書クラスを、当該文書とその著者とが属する文書クラスとして推定する文書クラス評価部と、第２の計算式に基づいて、単語ごとに、全てのトピッククラスそれぞれに対して属する確率を計算し、その確率が最も高いトピッククラスを、当該単語が属するトピッククラスとして推定するトピッククラス評価部と、所定の計算終了条件を満たすまで、文書クラス評価部とトピッククラス評価部とに、推定を繰り返させる収束判定部と、収束判定部によって所定の計算終了条件が満たされたと判定されたとき、文書クラスごとに属する文書と著者とを含む計算結果を出力する出力部と、を備えることを特徴とする。 In order to solve the above problems, the present invention probabilistically assigns a plurality of documents, which are electronic data, to each of a plurality of document classes, which are a set of documents having similar contents. Each word constituting the document is stochastically assigned to one or more of a plurality of topic classes that are word sets having similar contents, each of the document authors is assigned to one or more document classes, and each document class is assigned 1 A document classification device that performs calculation based on a predetermined probability distribution model associated with one or more topic classes, and is used to estimate a plurality of documents and a document class to which each document belongs in the probability distribution model. The first calculation formula, the second calculation formula used to estimate the topic class to which each word belongs and used in the probability distribution model, and a predetermined calculation end condition are stored. An initial setting unit that randomly sets initial values of a memory unit, a document class to which each document belongs, and a topic class to which each word belongs, and all document classes for each document based on the first calculation formula For each word, calculate the probability of belonging to each, and estimate the document class with the highest probability as the document class to which the document and the author belong, and the second calculation formula , Calculate the probability of belonging to each topic class, and estimate the topic class with the highest probability as the topic class to which the word belongs, and the document class until a predetermined calculation end condition is satisfied. A convergence determination unit that causes the evaluation unit and the topic class evaluation unit to repeat estimation and a convergence determination unit satisfy a predetermined calculation termination condition. When it is determined that, characterized in that it comprises an output unit for outputting a calculation result including the document and the author belongs to every document class.

かかる発明によれば、文書の著者それぞれを１つ以上の文書クラスに属させつつ、文書クラスそれぞれを１つ以上のトピッククラスに関連付けることができる。つまり、１人の著者に関して１つ以上のトピッククラスとの関連付けを間接的に行うことができるので、１人の著者が複数のトピッククラスと関連付けられた場合には、その著者の興味の変化が反映されたことになり、その上で、類似する内容の文書間の関連付けを行うこと（各文書を複数の文書クラスに分類すること）ができる。 According to this invention, each document class can be associated with one or more topic classes, while each document author belongs to one or more document classes. In other words, since one author can be indirectly associated with one or more topic classes, if one author is associated with multiple topic classes, the change in the interests of the author As a result, the documents having similar contents can be associated with each other (each document is classified into a plurality of document classes).

また、本発明は、電子データである複数の文書について、複数の文書それぞれを内容が類似する文書集合である複数の文書クラスのいずれかに確率的に属させ、各文書を構成する単語それぞれを内容が類似する単語集合である複数のトピッククラスの１つ以上に確率的に属させ、文書の著者それぞれを内容が類似する文書を作成した著者集合である著者クラスのいずれかに確率的に属させ、著者クラスを１つ以上の文書クラスに確率的に属させ、文書クラスそれぞれを１つ以上のトピッククラスに関連付ける所定の確率分布モデルに基づいて、計算を行う文書分類装置であって、複数の文書と、確率分布モデルで使用され文書ごとにその属する文書クラスを推定するための第１の計算式と、確率分布モデルで使用され単語ごとにその属するトピッククラスを推定するための第２の計算式と、確率分布モデルで使用され著者ごとにその属する著者クラスを推定するための第３の計算式と、所定の計算終了条件と、を記憶する記憶部と、著者それぞれが属する著者クラスと、文書それぞれが属する文書クラスと、単語それぞれが属するトピッククラスと、の初期値をランダムに設定する初期設定部と、第３の計算式に基づいて、著者ごとに、全ての著者クラスそれぞれに対して属する確率を計算し、その確率が最も高い著者クラスを、当該著者が属する著者クラスとして推定する著者クラス評価部と、第１の計算式に基づいて、文書ごとに、全ての文書クラスそれぞれに対して属する確率を計算し、その確率が最も高い文書クラスを、当該文書とその著者クラスが属する文書クラスとして推定する文書クラス評価部と、第２の計算式に基づいて、単語ごとに、全てのトピッククラスそれぞれに対して属する確率を計算し、その確率が最も高いトピッククラスを、当該単語が属するトピッククラスとして推定するトピッククラス評価部と、所定の計算終了条件を満たすまで、著者クラス評価部と文書クラス評価部とトピッククラス評価部とに、推定を繰り返させる収束判定部と、収束判定部によって所定の計算終了条件が満たされたと判定されたとき、著者クラスの内容と文書クラスの内容とを含む計算結果を出力する出力部と、を備えることを特徴とする。 Further, the present invention probabilistically assigns each of a plurality of documents to a plurality of document classes that are a set of documents having similar contents, and each word constituting each document is assigned to a plurality of documents that are electronic data. Probably belong to one or more of multiple topic classes that are similar word sets, and each of the document authors stochastically belongs to one of the author classes that are the set of authors who created similar documents A document classification apparatus that performs calculation based on a predetermined probability distribution model that probabilistically belongs an author class to one or more document classes and associates each document class with one or more topic classes. A first calculation formula for estimating the document class used in the probability distribution model and the document class to which each document belongs, and the word used in the probability distribution model and belonging to each word. A storage unit for storing a second calculation formula for estimating a class, a third calculation formula for estimating the author class to be used for each author and belonging to the author, and a predetermined calculation end condition An initial setting unit that randomly sets initial values of an author class to which each author belongs, a document class to which each document belongs, and a topic class to which each word belongs, and for each author based on the third calculation formula The author class evaluator for calculating the probability of belonging to each of the author classes and estimating the author class with the highest probability as the author class to which the author belongs, and the document based on the first calculation formula For each document class, the probability of belonging to each document class is calculated, and the document class with the highest probability is set as the document class to which the document and its author class belong. Based on the document class evaluation unit to be estimated and the second calculation formula, the probability of belonging to each of all topic classes is calculated for each word, and the topic class having the highest probability is determined as the topic class to which the word belongs. A topic class evaluation unit that estimates, a convergence determination unit that causes the author class evaluation unit, document class evaluation unit, and topic class evaluation unit to repeat estimation until a predetermined calculation termination condition is satisfied, and a convergence determination unit And an output unit that outputs a calculation result including the content of the author class and the content of the document class when it is determined that the calculation end condition is satisfied.

かかる発明によれば、前記発明に加えて、文書の著者それぞれをいずれかの著者クラスに属させ、著者クラスそれぞれを１つ以上の前記文書クラスに関連付けることができる。つまり、類似する内容の文書を作成した著者を同じ著者クラスに属させることで、類似する内容の文書を作成した著者間の関連付けを行うことができる。 According to this invention, in addition to the above invention, each author of a document can belong to any author class, and each author class can be associated with one or more document classes. That is, an author who has created a document having similar contents can be associated with the author who has created a document having similar contents by belonging to the same author class.

また、本発明は、収束判定部は、所定の計算終了条件として、著者クラス評価部と文書クラス評価部とトピッククラス評価部との少なくともいずれかによって計算された前回と最新の確率分布の誤差が所定の閾値以下であること、または、繰り返された計算回数が所定回数に達したこと、を用いることを特徴とする。 Further, according to the present invention, the convergence determination unit has an error between the previous probability distribution and the latest probability distribution calculated by at least one of the author class evaluation unit, the document class evaluation unit, and the topic class evaluation unit as a predetermined calculation end condition. It is characterized in that it is less than a predetermined threshold value or that the number of repeated calculations reaches a predetermined number.

かかる発明によれば、所定の計算終了条件として、前記した確率分布の誤差に関する閾値か、繰り返した計算回数のいずれかという具体的で適切な条件を設定することで、計算精度や処理時間などの目的に応じた条件設定ができる。 According to this invention, as a predetermined calculation termination condition, by setting a specific and appropriate condition such as the threshold value related to the error of the probability distribution described above or the number of repeated calculations, the calculation accuracy, processing time, etc. Conditions can be set according to the purpose.

また、本発明は、前記した文書分類装置としてコンピュータを機能させるための文書分類プログラムである。
かかる発明によれば、このプログラムをインストールされたコンピュータが、このプログラムに基づいた各機能を実現することができる。 The present invention is also a document classification program for causing a computer to function as the document classification apparatus described above.
According to this invention, a computer in which this program is installed can realize each function based on this program.

本発明によれば、大量の文書について、著者の興味の変化を反映しながら、類似する内容の文書間の関連付けを行うことができる。また、類似する内容の文書を作成した著者間の関連付けを行うことができる。 According to the present invention, it is possible to associate documents having similar contents with respect to a large number of documents while reflecting a change in interest of the author. In addition, it is possible to perform association between authors who created documents having similar contents.

第１実施形態における文書・トピック分類モデルの説明図である。It is explanatory drawing of the document and topic classification | category model in 1st Embodiment. 第１実施形態における文書分類装置の構成図である。It is a block diagram of the document classification device in 1st Embodiment. 文書クラス評価部の構成図である。It is a block diagram of a document class evaluation part. トピッククラス評価部の構成図である。It is a block diagram of a topic class evaluation part. 第１実施形態における文書・トピック分類方法の処理を示すフローチャートである。It is a flowchart which shows the process of the document and topic classification | category method in 1st Embodiment. 文書クラス処理を示すフローチャートである。It is a flowchart which shows a document class process. 文書クラスサンプリング処理を示すフローチャートである。It is a flowchart which shows a document class sampling process. トピッククラス更新処理を示すフローチャートである。It is a flowchart which shows a topic class update process. トピッククラスサンプリング処理を示すフローチャートである。It is a flowchart which shows a topic class sampling process. 第２実施形態における著者・文書・トピック分類モデルの説明図である。It is explanatory drawing of the author / document / topic classification model in 2nd Embodiment. 第２実施形態における文書分類装置の構成図である。It is a block diagram of the document classification device in 2nd Embodiment. 第２実施形態における著者クラスを考慮した文書・トピック分類方法の処理を示すフローチャートである。It is a flowchart which shows the process of the document and topic classification | category method which considered the author class in 2nd Embodiment. 著者クラス更新処理を示すフローチャートである。It is a flowchart which shows an author class update process. 実施例における著者・文書クラスの分布を示す図である。It is a figure which shows distribution of an author and a document class in an Example. 実施例における著者クラス・文書クラスの分布を示す図である。It is a figure which shows distribution of the author class and document class in an Example. 実施例における文書・トピッククラスの分布を示す図である。It is a figure which shows distribution of the document and topic class in an Example. 実施例における文書クラス・トピッククラスの分布を示す図である。It is a figure which shows distribution of the document class and topic class in an Example.

以下、本発明を実施するための形態（以下、「実施形態」という。）について、第１実施形態と第２実施形態に分けて、図面を参照して説明する。なお、以下において、トークンは単語を抽象化した概念に相当する。すなわち、文書はトークンの並びで構成され、トークンに具体的な単語が入ることによって、具体的な文書になる。また、本実施形態の説明において、「評価」とは「推定」を意味する。 Hereinafter, modes for carrying out the present invention (hereinafter referred to as “embodiments”) will be described by dividing them into a first embodiment and a second embodiment with reference to the drawings. In the following, a token corresponds to a concept obtained by abstracting words. That is, the document is composed of a sequence of tokens, and a specific word is entered into the token to become a specific document. In the description of the present embodiment, “evaluation” means “estimation”.

＜第１実施形態＞
第１実施形態では、類似する内容の文書の集合を文書クラス、類似する単語（トークン）の集合をトピッククラスとし、各文書と各著者が属する文書クラス、トピッククラスを評価することにより、類似する内容の文書間の関連付けを行う。同じ文書クラスに属する文書同士は関連が高い（類似する内容を扱っている）と判断できるので、各文書と各著者を文書クラスに分類することによって、類似する内容の文書間の関連付けを実現することができる。 <First Embodiment>
In the first embodiment, a set of documents having similar contents is set as a document class, a set of similar words (tokens) is set as a topic class, and each document is evaluated by evaluating the document class and topic class to which each author belongs. Make associations between content documents. Since it can be determined that documents belonging to the same document class are highly related (similar contents are handled), association between documents having similar contents is realized by classifying each document and each author into a document class. be able to.

なお、文書クラスは、例えば、「スポーツ」、「政治」などである。また、文書クラス「スポーツ」に関連付けられるトピッククラスは、例えば、「野球」、「サッカー」などである。また、トピッククラス「野球」に属する単語は、例えば、「プロ野球」、「メジャーリーグ」などである。 The document class is, for example, “sports” or “politics”. The topic class associated with the document class “sports” is, for example, “baseball”, “soccer”, and the like. The words belonging to the topic class “baseball” are, for example, “professional baseball”, “major league”, and the like.

具体的には、まず、文書ｄの属する文書クラスｃ_ｄがｊである確率Ｐ（ｃ_ｄ＝ｊ｜ｃ_＼ｄ，ｚ，γ，δ）を計算する。ここで、ｃ_＼ｄは文書ｄを除いた文書クラスの集合であり、ｚはトピックの集合であり、γとδは隠れパラメータ（詳細は後記）である。文書ｄ毎に、全ての文書クラスについて確率を計算し、最も確率が高くなる文書クラスをその文書の属する文書クラスとして関連付けを行う。 Specifically, first, the probability _P document class _{c d} Field of document d is _{j (c d = j | c} \d, z, γ, δ) is calculated. Here, c _\D is a set of document classes excluding the document d, z is the set of topics, the γ and δ is a hidden parameter (described in detail later). For each document d, probabilities are calculated for all document classes, and the document class with the highest probability is associated as the document class to which the document belongs.

同様に、文書ｄのｉ番目のトークンｄｉに関連するトピッククラスがｔである確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）を計算する。ここで、ｚ_ｄｉはトークンｄｉが属するトピッククラスを表し（以下、トークン（単語）ｚ_ｄｉともいう。）、ｚ_＼ｄｉはｚからｚ_ｄｉを除いたトピックの集合であり、ｗは単語の集合であり、βは隠れパラメータ（詳細は後記）である。単語毎に、全てのトピッククラスについて確率を計算し、最も確率が高くなるトピッククラスをその単語が属するトピッククラスとして関連付けを行う。文書内の各単語についてトピッククラスを関連付けるので、１つの文書に対して複数のトピッククラスが関連付けされることもある。 Similarly, the probability P (z _di = t | j, z _\ _di , w, β, δ) that the topic class related to the i-th token di of the document d is t is calculated. Here, z _di represents the topic class to which the token di belongs (hereinafter also referred to as a token (word) z _di ), z _\ _di is a set of topics excluding z _di from z, and w is a set of words. And β is a hidden parameter (details will be described later). For each word, the probabilities are calculated for all topic classes, and the topic class with the highest probability is associated as the topic class to which the word belongs. Since a topic class is associated with each word in a document, a plurality of topic classes may be associated with one document.

第１実施形態では、文書・トピック間の関連付け（ある文書とその文書が扱うトピックとの関連付け）と、文書間の関連付け（類似する内容の文書間の関連付け）を同時に行う手法（文書・トピック分類手法：ＬＩＴ０）について説明する。 In the first embodiment, a method (document / topic classification) that simultaneously performs association between a document and a topic (association between a document and a topic handled by the document) and association between documents (association between documents having similar contents). Method: LIT0) will be described.

まず、図１を参照し、第１実施形態における文書・トピック分類モデルについて説明する。第１実施形態では、以下のａ．〜ｅ．を仮定したモデルを利用して前記関連付けを行う。 First, the document / topic classification model in the first embodiment will be described with reference to FIG. In the first embodiment, the following a. ~ E. The association is performed using a model that assumes

ａ．１つの文書は１つの文書クラスに確率的に関連付けされる。
ｂ．（文書中の）１つの単語（トークン）は、１つのトピッククラスに関連付けされる。なお、異なる単語が同じトピッククラスに関連付けされることはある。
ｃ．１つの文書クラスは１つ以上のトピッククラスを選択することができる。なお、１つの文書クラスに対して複数のトピッククラスを関連付けできる。
ｄ．１つのトピッククラスは１つ以上のトークンを選択することができる。なお、１つのトピッククラスに対して複数の単語（トークン）を関連付けしても良いので、異なる単語（トークン）が同じトピッククラスに選択されることはある（例えば、図１の単語ｗ_１とｗ_３）。
ｅ．各文書クラスは異なるトピッククラス選択の確率分布を持つ。例えば、図１の例では、文書クラスｊ_１と文書クラスｊ_２がどちらもトピッククラスｔ_４を選択していたとしても、文書クラスｊ_１がトピッククラスｔ_４を選択する確率と、文書クラスｊ_２がトピッククラスｔ_４を選択する確率が同じとは限らない。 a. One document is stochastically associated with one document class.
b. A word (token) (in the document) is associated with a topic class. Different words may be associated with the same topic class.
c. One document class can select one or more topic classes. A plurality of topic classes can be associated with one document class.
d. A topic class can select one or more tokens. Since a plurality of words (tokens) may be associated with one topic class, different words (tokens) may be selected for the same topic class (for example, words w ₁ and w in FIG. 1). ₃ ).
e. Each document class has a different probability distribution for topic class selection. For example, in the example of FIG. 1, even if the document class j ₁ and the document class j ₂ are both have selected topic class t _4, the probability that the document class j ₁ selects the topic class t _4, the document class j ₂ is not necessarily the same probability of selecting a topic class t _4.

（構成）
次に、第１実施形態の文書分類装置の構成について説明する。図２に示すように、文書分類装置１０００は、コンピュータ装置であり、入力部１と、記憶部２と、演算部１００と、出力部７とを備えて構成される。 (Constitution)
Next, the configuration of the document classification device according to the first embodiment will be described. As shown in FIG. 2, the document classification device 1000 is a computer device, and includes an input unit 1, a storage unit 2, a calculation unit 100, and an output unit 7.

入力部１は、情報を入力する手段であり、例えば、キーボード、マウス、ディスクドライブ装置などから構成される。 The input unit 1 is a means for inputting information, and includes, for example, a keyboard, a mouse, a disk drive device, and the like.

記憶部２は、情報を記憶する手段であり、例えば、一般的なハードディスク装置などから構成され、動作プログラム、関連付けを行う対象となる文書データ（著者の情報等のメタデータも含む。）、各パラメータ、各数式等の情報を記憶する。文書データとは、前記したように、例えば、Ｗｅｂページ、論文、特許明細書、ブログ記事等の電子データである。 The storage unit 2 is a means for storing information. For example, the storage unit 2 includes a general hard disk device and the like, and includes an operation program, document data to be associated (including metadata such as author information), and the like. Information such as parameters and equations is stored. As described above, the document data is electronic data such as Web pages, papers, patent specifications, blog articles, and the like.

演算部１００は、例えば、ＣＰＵ（Central Processing Unit）およびＲＡＭ（Random Access Memory）から構成される主制御装置であり、初期設定部３と、文書クラス評価部４と、トピッククラス評価部５と、収束判定部６とを備えている。
文書クラス評価部４は、サンプル著者選択部４１、サンプル文書選択部４２、初期設定部４３および文書クラス更新部４４を備えている（詳細は後記）。
トピッククラス評価部５は、サンプル著者選択部５１、サンプル文書選択部５２、初期設定部５３およびトピッククラス更新部５４を備えている（詳細は後記）。 The arithmetic unit 100 is a main control device composed of, for example, a CPU (Central Processing Unit) and a RAM (Random Access Memory), and includes an initial setting unit 3, a document class evaluation unit 4, a topic class evaluation unit 5, And a convergence determination unit 6.
The document class evaluation unit 4 includes a sample author selection unit 41, a sample document selection unit 42, an initial setting unit 43, and a document class update unit 44 (details will be described later).
The topic class evaluation unit 5 includes a sample author selection unit 51, a sample document selection unit 52, an initial setting unit 53, and a topic class update unit 54 (details will be described later).

出力部７は、情報を出力するための手段であり、例えば、グラフィックボード（出力インタフェース）およびそれに接続されたモニタである。 The output unit 7 is a means for outputting information, for example, a graphic board (output interface) and a monitor connected thereto.

（処理）
次に、図５を参照して（適宜図１〜４参照）、文書分類装置１０００の処理について説明する。なお、図５のステップ番号と図２の構成の符号とで整合をとっている（例えば、ステップＳ１と入力部１、ステップＳ３と初期設定部３など。図１２と図１１も同様）。 (processing)
Next, processing of the document classification apparatus 1000 will be described with reference to FIG. Note that the step numbers in FIG. 5 and the reference numerals in FIG. 2 are matched (for example, step S1 and input unit 1, step S3 and initial setting unit 3, etc., and so on in FIGS. 12 and 11).

[入力部１]
まず、入力部１は、ユーザによる、分類したい文書クラスの数Ｊと、分類したいトピッククラスの数Ｔとの入力を受け付ける（ステップＳ１）。 [Input unit 1]
First, the input unit 1 accepts input of the number J of document classes to be classified and the number T of topic classes to be classified by the user (step S1).

[初期設定部３]
次に、初期設定部３は、各文書が属する文書クラスとトピッククラスの初期値とをランダムに設定（選択）する（ステップＳ３）。これらの初期値は任意に選択することができるが、１つの文書に対して文書クラスは１つとする。また、各文書内の各単語について、その単語が属するトピッククラスを１つずつ設定する。 [Initial setting part 3]
Next, the initial setting unit 3 randomly sets (selects) the document class to which each document belongs and the initial value of the topic class (step S3). These initial values can be arbitrarily selected, but there is one document class for one document. For each word in each document, one topic class to which the word belongs is set.

[文書クラス評価部４]
次に、文書クラス評価部４は、文書クラス更新処理を行う（ステップＳ４）。具体的には、ギブスサンプリング（Gibbs Sampling）の手法を利用して、文書ｄの属する文書クラスがｊ（ｊ＝１，２，３，…，Ｊ）である確率Ｐ（ｃ_ｄ＝ｊ｜ｃ_＼ｄ，ｚ，γ，δ）を評価する。ギブスサンプリングでは、サンプリング対象となる文書を１つ選択し、条件付確率（以下の式（３ｃ））を計算した後、ランダムに生成した値（乱数）と条件付確率の値とを比較することにより文書クラスを更新する処理を収束（ステップＳ６で判定）するまで繰り返す。 [Document Class Evaluation Unit 4]
Next, the document class evaluation unit 4 performs document class update processing (step S4). Specifically, the probability P (c _d = j | c) that the document class to which the document d belongs is j (j = 1, 2, 3,..., J) using the Gibbs sampling method. _{\ D} , z, γ, δ) are evaluated. In Gibbs sampling, select one document to be sampled, calculate the conditional probability (the following formula (3c)), and then compare the randomly generated value (random number) with the conditional probability value. The process of updating the document class is repeated until convergence (determined in step S6).

ここで、図６を参照して、ステップＳ４の文書クラス更新処理について説明する。
まず、サンプル著者選択部４１において、著者の集合Ａの中から、まだ選択していないサンプル対象の著者ａを１人選択する（ステップＳ４１）。
次に、サンプル文書選択部４２において、著者ａが作成した文書の中から、まだ選択していないサンプル対象の文書ｄを１つ選択する（ステップＳ４２）。 Here, the document class update processing in step S4 will be described with reference to FIG.
First, the sample author selection unit 41 selects one author a that is not yet selected from the set A of authors (step S41).
Next, the sample document selection unit 42 selects one sample target document d that has not yet been selected from the documents created by the author a (step S42).

以降の処理では、選択した文書ｄについて、全ての文書クラスｊ（ｊ＝１，２，３，…，Ｊ）に対する確率Ｐ（ｃ_ｄ＝ｊ｜ｃ_＼ｄ，ｚ，γ，δ）をそれぞれ評価する。 In the subsequent processing, the probability P (c _d = j | c _{\ d} , z, γ, δ) for all document classes j (j = 1, 2, 3,..., J) is selected for the selected document d. evaluate.

まず、初期設定部４３において、以降の確率計算で使用するパラメータの初期値を設定する（ステップＳ４３）。例えば、記憶部２に記憶された文書の総数をパラメータｎの値として設定する。また、文書クラスの評価に使用する隠れパラメータβ、γ、δの初期値を決定する。ここで、γはＪ次元（Ｊは入力部１で入力した文書クラスの総数）のベクトル、δはＴ次元（Ｔは入力部１で入力したトピッククラスの総数）のベクトルである。γ、δの各要素の値は、以下の関係式（１）、（２）を満たす任意の値を設定する。 First, in the initial setting unit 43, initial values of parameters used in the subsequent probability calculation are set (step S43). For example, the total number of documents stored in the storage unit 2 is set as the value of the parameter n. Also, initial values of the hidden parameters β, γ, and δ used for document class evaluation are determined. Here, γ is a vector of J dimension (J is the total number of document classes input by the input unit 1), and δ is a vector of T dimension (T is the total number of topic classes input by the input unit 1). As the values of the elements γ and δ, arbitrary values satisfying the following relational expressions (1) and (2) are set.

０＜γ_ｊ≦ｎ_ａｊ＼ｄｊ＝１，２，３，…，Ｊ・・・式（１）
０＜δ_ｔ≦ｎ_ｊｔ＼ｄｔ＝１，２，３，…，Ｔ・・・式（２）
ここで、ｎ_ａｊ＼ｄは、文書クラスｊに属する全文書のうち、サンプリング対象の文書ｄを除いて、著者ａが作成した文書の総数である。また、ｎ_ｊｔ＼ｄは、サンプリング対象の文書ｄを除く文書クラスｊに属する全文書中でトピッククラスｔが選択された数である。 _{_{0 <γ j ≦ n aj\d j}} = 1,2,3, ..., J ··· formula (1)
0 <δ _t ≦ n _jt \ _d t = 1, 2, 3,..., T (2)
Here, n _{aj \} d is the total number of documents created by the author a, excluding the document d to be sampled, out of all documents belonging to the document class j. Also, n _{jt \} d is the number of selected topic classes t in all documents belonging to the document class j excluding the document d to be sampled.

次に、文書クラス更新部４４において、文書クラスサンプリング処理を行う（ステップＳ４４）。ここで、図７を参照して（適宜他図参照）、ステップＳ４４の文書クラスサンプリング処理について説明する。 Next, the document class update unit 44 performs document class sampling processing (step S44). Here, the document class sampling process in step S44 will be described with reference to FIG.

図７に示すように、文書クラス更新部４４は、まず、Ｊ個の文書クラスから１つの文書クラスｊを選択し（ステップＳ４４１）、以下の処理を行う。 As shown in FIG. 7, the document class updating unit 44 first selects one document class j from the J document classes (step S441), and performs the following processing.

以下の式（３ｃ）（第１の計算式）により、確率Ｐ（ｃ_ｄ＝ｊ｜ｃ_＼ｄ，ｚ，γ，δ）を計算する（ステップＳ４４２）。そのために、まず、次の式（３ａ）を計算する。

The probability P (c _d = j | c _{\ d} , z, γ, δ) is calculated by the following equation (3c) (first calculation equation) (step S442). For this purpose, first, the following equation (3a) is calculated.

次に、ｊ＝１，２，・・・，Ｊについて、次の式（３ｂ）を計算することにより、自身よりもｊの値が小さい文書クラスの確率を足し合わせた確率を算出する。

Next, for j = 1, 2,..., J, the following formula (3b) is calculated to calculate the probability of adding the probabilities of the document class having a smaller value of j than itself.

そして、全てのｊ＝１，２，・・・，Ｊについて、式（３ｃ）により正規化を行う。これにより、全ての確率Ｐ（ｃ_ｄ＝ｊ｜ｃ_＼ｄ，ｚ，γ，δ）は、１以下の値となる。

Then, for all j = 1, 2,..., J, normalization is performed according to equation (3c). As a result, all the probabilities P (c _d = j | c _{\ d} , z, γ, δ) are values of 1 or less.

ここで、記号∝は、左辺の値が右辺の値に比例することを意味する。ｎ_ｄｔは文書ｄの中でトピッククラスｔが選択された回数、ｎ_ｄは文書ｄに含まれるトークンの総数である。γ_ｊはｊ次元のγを意味し、δ_ｔはｔ次元のδを意味する。
なお、初期設定部３で設定した隠れパラメータγとδは、式（３ａ）においてｎ_ａｊ＼ｄやｎ_ｊｔ＼ｄが０となった場合に、式（３ａ）の分母が０になるために確率を計算できなくなってしまうことを避けるためのスムージングパラメータの役割を担っている。 Here, the symbol ∝ means that the value on the left side is proportional to the value on the right side. n _dt is the number of times the topic class t has been selected in the document d, and n _d is the total number of tokens included in the document d. gamma _j denotes the j dimension gamma, [delta] _t means the t dimension [delta].
The hidden parameters γ and δ set by the initial setting unit 3 are set so that the denominator of the equation (3a) becomes 0 when n _aj \ _d and n _{jt \ d} become 0 in the equation (3a). It plays the role of a smoothing parameter to avoid the probability being unable to be calculated.

ステップＳ４４２の後、０≦Ｒ＜１の一様乱数を生成し（ステップＳ４４３）、ステップＳ４４２で計算した確率Ｐ（ｃ_ｄ＝ｊ｜ｃ_＼ｄ，ｚ，γ，δ）の値がＲより大きければ（ステップＳ４４４でＹｅｓ）、文書ｄの属するクラスをｊに変更する（ステップＳ４４５）。ステップＳ４４２で計算した確率Ｐ（ｃ_ｄ＝ｊ｜ｃ_＼ｄ，ｚ，γ，δ）の値がＲ以下の場合には（ステップＳ４４４でＮｏ）、文書クラスの更新は行わず（Ｓ４４５をスキップして）、ステップＳ４４６に進む。 After step S442, a uniform random number of 0 ≦ R <1 is generated (step S443), and the value of the probability P (c _d = j | c _{\ d} , z, γ, δ) calculated in step S442 is greater than R. If it is larger (Yes in step S444), the class to which the document d belongs is changed to j (step S445). Step probability _P calculated in _{S442 (c d = j | c} \d, z, γ, δ) when the value of the following R (No at step S444), skip document update class without (S445 Then, the process proceeds to step S446.

文書ｄについて、全ての文書クラスを更新していない場合、つまり、ステップＳ４４２以下の処理を行っていない文書クラスが存在する場合には（ステップＳ４４６でＮｏ）、まだ調べていない文書クラスｊ´を選択し（ステップＳ４４１）、前記ステップＳ４４２〜ステップＳ４４５を繰り返す。全ての文書クラスｊ＝１，２，…，Ｊについて更新処理を終えたら（ステップＳ４４６でＹｅｓ）、図６に戻って、ステップＳ４４を終えてステップＳ４５に進む。 If all the document classes have not been updated for the document d, that is, if there is a document class that has not been processed in step S442 and subsequent steps (No in step S446), the document class j ′ that has not been checked is selected. Select (Step S441) and repeat Steps S442 to S445. When the update process is completed for all the document classes j = 1, 2,..., J (Yes in step S446), the process returns to FIG.

文書クラス評価部４は、著者ａが作成した文書のうちサンプル対象として選択していない文書ｄ´を新たなサンプル対象の文書として選択し（図６のステップＳ４５でＮｏ→ステップＳ４２）、ステップＳ４３，Ｓ４４の処理を繰り返す。これらの処理を著者ａが作成した全ての文書について繰り返し（ステップＳ４５でＹｅｓとなるまで）行う。 The document class evaluation unit 4 selects a document d ′ that has not been selected as a sample target from among the documents created by the author a as a new sample target document (No in step S45 in FIG. 6 → step S42), and step S43. , S44 is repeated. These processes are repeated for all documents created by the author a (until “Yes” in step S45).

次に、文書クラス評価部４は、全ての著者をサンプリングしていない場合、つまり、まだサンプリングを終えていない著者が存在する場合は（ステップＳ４６でＮｏ）、ステップＳ４１に戻り、まだ選択していないサンプル対象の著者ａ´を選択する。そして、ステップＳ４２〜ステップＳ４５の処理を繰り返す。全ての著者についてサンプリングを終えたら（Ｓ４６でＹｅｓ）、図５に戻って、ステップＳ４を終えて、ステップＳ５の処理に移行する。 Next, when all the authors are not sampled, that is, when there is an author who has not finished sampling (No in step S46), the document class evaluation unit 4 returns to step S41 and has not yet selected. Choose the author a 'for which there is no sample. And the process of step S42-step S45 is repeated. When sampling has been completed for all the authors (Yes in S46), the process returns to FIG. 5, and step S4 is completed, and the process proceeds to step S5.

[トピッククラス評価部５]
ステップＳ４の後、トピッククラス評価部５は、トピッククラス更新処理を行う（ステップＳ５）。具体的には、文書クラス評価部４と同様にギブスサンプリングを利用した手法により、ある文書ｄ中のｉ番目のトークンｄｉがトピッククラスｔ（ｔ＝１，２，３，…，Ｔ）に属する確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）を評価する。 [Topic class evaluation unit 5]
After step S4, the topic class evaluation unit 5 performs topic class update processing (step S5). Specifically, the i-th token di in a document d belongs to the topic class t (t = 1, 2, 3,..., T) by a technique using Gibbs sampling as in the document class evaluation unit 4. The probability P (z _di = t | j, z _\ _di , w, β, δ) is evaluated.

ここで、図８を参照して、ステップＳ５のトピッククラス更新処理について説明する。
まず、サンプル著者選択部５１において、著者の集合Ａの中から、まだ選択していないサンプル対象の著者ａを１人選択する（ステップＳ５１）。
次に、サンプル文書選択部５２において、著者ａが作成した文書の中から、まだ選択していないサンプル対象の文書ｄを１つ選択する（ステップＳ５２）。 Here, the topic class update processing in step S5 will be described with reference to FIG.
First, the sample author selection unit 51 selects one sample target author a that has not yet been selected from the set A of authors (step S51).
Next, the sample document selection unit 52 selects one sample target document d that has not yet been selected from the documents created by the author a (step S52).

次に、初期設定部５３において、以降の確率計算で使用するパラメータの初期値を設定する（ステップＳ５３）。例えば、記憶部２に記憶された文書の総数をパラメータｎの値として設定する。また、トピッククラスの評価に使用する隠れパラメータβ、γ、δの初期値を決定する。ここで、γとδは文書クラス評価部４におけるγとδと同じであり、式（１）と（２）を満たす任意の値を設定する。また、βはＶ次元（Ｖは全文書に含まれる単語の異なり数（種類数））のベクトルであり、βの各要素の値は、以下の関係式を満たす任意の値を設定する。 Next, the initial setting unit 53 sets initial values of parameters used in subsequent probability calculations (step S53). For example, the total number of documents stored in the storage unit 2 is set as the value of the parameter n. In addition, initial values of hidden parameters β, γ, and δ used for topic class evaluation are determined. Here, γ and δ are the same as γ and δ in the document class evaluation unit 4, and arbitrary values satisfying the expressions (1) and (2) are set. Β is a vector of V dimensions (V is the number of different words (number of types) included in all documents), and the value of each element of β is set to an arbitrary value satisfying the following relational expression.

０＜β_ｖ≦ｎ_{ｔｖ＼ｄｉ} ｊ＝１，２，３，…，Ｊ・・・式（４）
ここで、ｎ_{ｔｖ＼ｄｉ}は、トピッククラスｔに属するトークンのうち、サンプリング対象の文書ｄのｉ番目のトークンを除いて現時点で単語ｖを選択しているトークンの総数である。 0 <β _v ≦ n _tv \ _di j = 1, 2, 3,..., J (Equation 4)
Here, n _tv \ _di is the total number of tokens that currently select the word v, excluding the i-th token of the document d to be sampled, among the tokens belonging to the topic class t.

次に、トピッククラス更新部５４において、トピッククラスサンプリング処理を行う（ステップＳ５４）。具体的には、文書ｄ内の全てのトークン（文書ｄ内のトークンのうち、既に確率を計算したトークンを除く全てのトークン）について、以下の処理を繰り返す。 Next, the topic class update unit 54 performs topic class sampling processing (step S54). Specifically, the following processing is repeated for all tokens in the document d (all tokens in the document d excluding tokens whose probability has already been calculated).

ここで、図９を参照して、ステップＳ５４のトピッククラスサンプリング処理について説明する。
まず、トピッククラス更新部５４は、まだサンプリングを終えていない文書ｄ内のトークン（ｉ番目の単語とする）ｄｉを１つ選択する（ステップＳ５４１）。 Here, the topic class sampling process in step S54 will be described with reference to FIG.
First, the topic class update unit 54 selects one token (assumed to be the i-th word) di in the document d that has not been sampled yet (step S541).

次に、まだサンプリングを終えていないトピッククラスｔを１つ選択する（ステップＳ５４２）。
次に、以下の式（５ｃ）（第２の計算式）により、確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）を計算する（ステップＳ５４３）。そのために、まず、次の式（５ａ）を計算する。 Next, one topic class t that has not been sampled is selected (step S542).
Then, the following equation (5c) (second calculation formula), the probability _{_{P (z di = t | j}} , z \di, w, β, δ) calculates the (step S543). For this purpose, first, the following equation (5a) is calculated.

次に、ｔ＝１，２，・・・，Ｔについて、式（５ｂ）を計算することにより、自身よりもｔの値が小さいトピッククラスの確率を足し合わせた確率を算出する。

Next, with respect to t = 1, 2,..., T, a probability obtained by adding the probabilities of topic classes having a smaller value of t than itself is calculated by calculating equation (5b).

そして、全てのｔ＝１，２，・・・，Ｔについて、式（５ｃ）により正規化を行う。これにより、全ての確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）は、１以下の値となる。

Then, for all t = 1, 2,..., T, normalization is performed according to equation (5c). As a result, all the probabilities P (z _di = t | j, z _\ _di , w, β, δ) are values of 1 or less.

ここで、ｎ_{ｊｔ＼ｄｉ}は文書クラスｊに属する全文書のうち、サンプリング対象となる文書ｄのｉ番目のトークン以外で現時点でトピッククラスｔを選択しているトークンの総数である。β_ｖはｖ次元のβを意味する。 Here, n _jt \ _di is the total number of tokens that currently select the topic class t other than the i-th token of the document d to be sampled among _all the documents belonging to the document class j. β _v means v of dimension β.

初期設定部５３で設定した隠れパラメータβとδは、式（５ａ）においてｎ_{ｔｖ＼ｄｉ}やｎ_{ｊｔ＼ｄｉ}が０となった場合に、式（５ａ）の分母が０になるために確率を計算できなくなってしまうことを避けるためのスムージングパラメータの役割を担っている。 The hidden parameters β and δ set by the initial setting unit 53 have a probability that the denominator of the equation (5a) becomes 0 when n _tv \ _di or n _{jt \ di} becomes 0 in the equation (5a). It plays the role of a smoothing parameter to avoid being unable to calculate.

次に、０≦Ｒ＜１の一様乱数を生成し（ステップＳ５４４）、ステップＳ５４３で計算した確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）の値がＲより大きければ（ステップＳ５４５でＹｅｓ）、文書ｄのｉ番目のトークンｚ_ｄｉの属するトピッククラスをｔに変更する（ステップＳ５４６）。ステップＳ５４３で計算した確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）値がＲ以下の場合には（ステップＳ５４５でＮｏ）、トピッククラスの更新は行わなず（Ｓ５４６をスキップして）、ステップＳ５４７に進む。 Next, a uniform random number of 0 ≦ R <1 is generated (step S544), and the value of the probability P (z _di = t | j, z _\ _di , w, β, δ) calculated in step S543 is greater than R. is greater (in step S545 Yes), the topic class that belongs to the i-th token _{z di} of the document d to change to t (step S546). Probability calculated in step _{S543 P (z di = t |} j, z \di, w, β, δ) in the following cases R is value (No at step S545), updates the topic class is not a performed (S546 Is skipped), and the process proceeds to step S547.

ステップＳ５４７において、全てのトピッククラスを更新していない場合、つまり、単語ｄｉについてまだ確率を計算していないトピッククラスがあれば（ステップＳ５４７でＮｏ）、そのトピッククラスを選択し（ステップＳ５４２）、ステップＳ５４３〜ステップＳ５４６を繰り返すことにより、全てのトピッククラスｔについて確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）を計算する。 If all topic classes have not been updated in step S547, that is, if there is a topic class for which the probability has not yet been calculated for word di (No in step S547), that topic class is selected (step S542), by repeating the steps S543~ step S546, the probability for all the topic class _{t P (z di = t |} j, z \di, w, β, δ) is calculated.

全てのトピッククラスについて前記のサンプリング処理を終えたら（ステップＳ５４７でＹｅｓ）、現在選択している文書ｄ内で、まだ条件付確率を計算していないトークンを選択し（ステップＳ５４８でＮｏ→ステップＳ５４１）、ステップＳ５４２〜ステップＳ５４７の処理を繰り返す。これにより、文書ｄ内の全てのトークン（単語）について、各トピッククラスに属する確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）をそれぞれ計算することができる（ｔ＝１，２，…，Ｔ）。ステップＳ５４８でＹｅｓの後、図８に戻って、ステップＳ５４を終えてステップＳ５５に進む。 When the sampling process is completed for all topic classes (Yes in step S547), a token for which the conditional probability has not yet been calculated is selected in the currently selected document d (No in step S548, step S541). ), Steps S542 to S547 are repeated. Thereby, the probability P (z _di = t | j, z _\ _di , w, β, δ) belonging to each topic class can be calculated for all tokens (words) in the document d (t = 1, 2, ..., T). After Yes in step S548, the process returns to FIG. 8, ends step S54, and proceeds to step S55.

次に、トピッククラス評価部５は、著者ａが作成した文書のうち、サンプル対象として選択していない文書がある場合には（ステップＳ５５でＮｏ）、まだ選択していない文書ｄ´を新たなサンプル対象の文書として選択し（ステップＳ５２）、ステップＳ５３，Ｓ５４を繰り返す。これを著者ａが作成した全ての文書について（ステップＳ５５でＹｅｓになるまで）行う。 Next, when there is a document that is not selected as a sample target among the documents created by the author a (No in step S55), the topic class evaluation unit 5 creates a new document d ′ that has not yet been selected. A document to be sampled is selected (step S52), and steps S53 and S54 are repeated. This is performed for all the documents created by the author a (until Yes in step S55).

ステップＳ５５でＹｅｓの後、サンプリングを終えていない著者がいる場合には（Ｓ５６でＮｏ）、新たにまだサンプリングを終えていない著者ａ´を選択し（ステップＳ５１）、同様にステップＳ５２〜ステップＳ５５を繰り返す。全ての著者についてサンプリングを終えたら（ステップＳ５６でＹｅｓ）、図５に戻って、ステップＳ５を終えてステップＳ６に進む。 If there is an author who has not finished sampling after Yes in step S55 (No in S56), an author a 'which has not yet finished sampling is selected (step S51), and similarly, steps S52 to S55 are selected. repeat. When sampling has been completed for all the authors (Yes in step S56), the process returns to FIG. 5 to end step S5 and proceed to step S6.

[収束判定部６]
収束判定部６は、文書クラス評価部４とトピッククラス評価部５で算出した式（３ｃ）と式（５ｃ）の確率分布が収束したか否かを判定し（ステップＳ６）、これらの確率分布が収束するまで（ステップＳ６でＮｏの間）文書クラス評価部４とトピッククラス評価部５の処理（ステップＳ４，Ｓ５）を繰り返す。 [Convergence determination unit 6]
The convergence determination unit 6 determines whether or not the probability distributions of the equations (3c) and (5c) calculated by the document class evaluation unit 4 and the topic class evaluation unit 5 have converged (step S6), and these probability distributions. Are converged (during No in step S6), the processing of the document class evaluation unit 4 and the topic class evaluation unit 5 (steps S4 and S5) is repeated.

確率分布が収束したら（ステップＳ６でＹｅｓ）、得られた文書クラスとトピッククラスの確率分布を出力部７に出力する。ここで、確率分布が収束したか否かの判定は、１ステップ前の確率分布と今回得られた確率分布との誤差を比較し、誤差が予め定めた閾値以下である場合には収束したと判定することとしても良い。あるいは、文書クラス更新処理（ステップＳ４）とトピッククラス更新処理（ステップＳ５）の繰り返し回数をカウントし、予め設定した繰り返し回数に到達したら確率分布が収束したと判定して処理を終了することとしても良い。このように、計算の終了条件は、計算精度や処理時間などの目的に応じて設定すれば良い。 When the probability distribution converges (Yes in step S6), the obtained probability distribution of the document class and the topic class is output to the output unit 7. Here, the determination of whether or not the probability distribution has converged is made by comparing the error between the probability distribution one step before and the probability distribution obtained this time, and when the error is equal to or less than a predetermined threshold value It may be determined. Alternatively, the number of repetitions of the document class update process (step S4) and the topic class update process (step S5) is counted, and if the preset number of repetitions is reached, it is determined that the probability distribution has converged and the process is terminated. good. Thus, the calculation end condition may be set according to purposes such as calculation accuracy and processing time.

[出力部７]
ステップＳ６でＹｅｓの後、出力部７は、収束判定部６から得られた確率分布に基づいて、文書クラスとトピッククラスを出力する（ステップＳ７）。
このとき、文書クラスは、文書ｄに関して評価した確率Ｐ（ｃ_ｄ＝ｊ｜ｃ_＼ｄ，ｚ，γ，δ）の値が最も大きくなるｊを、文書ｄが所属する文書クラスとして出力する。 [Output unit 7]
After Yes in step S6, the output unit 7 outputs the document class and the topic class based on the probability distribution obtained from the convergence determination unit 6 (step S7).
At this time, the document class outputs j having the largest probability P (c _d = j | c _{\ d} , z, γ, δ) evaluated for the document d as the document class to which the document d belongs.

また、このとき、トピッククラスは、文書ｄ内のｉ番目のトークン（単語）ｄｉについて計算した確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）が最も高くなるｔを単語ｄｉが属するトピッククラスｚ_ｄｉとし、文書ｄ内の全ての単語（ｉ=１，２，…，Ｎ_ｄ）について確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）が最も高くなるｔを選択して得られた集合を文書ｄが選択するトピッククラスとして出力する。 In addition, at this time, topic class, document the i-th token in the d (word) probability was calculated for _{di P (z di = t |} j, z \di, w, β, δ) the highest becomes t The topic class z _di to which the word di belongs, and the probability P (z _di = t | j, z _\ _di , w, β, δ) for all the words (i = 1, 2,..., N _d ) in the document d. A set obtained by selecting t with the highest value is output as a topic class selected by the document d.

このように、第１実施形態の文書分類装置１０００によれば、文書の著者それぞれを１つ以上の文書クラスに関連付けつつ、文書クラスそれぞれを１つ以上のトピッククラスに関連付けることができる。つまり、１人の著者に関して１つ以上のトピッククラスとの関連付けを間接的に行うことができるので、１人の著者が複数のトピッククラスと関連付けられた場合には、その著者の興味の変化が反映されたことになり、その上で、類似する内容の文書間の関連付けを行うこと（各文書を複数の文書クラスに分類すること）ができる。 Thus, according to the document classification apparatus 1000 of the first embodiment, each document class can be associated with one or more topic classes while each document author is associated with one or more document classes. In other words, since one author can be indirectly associated with one or more topic classes, if one author is associated with multiple topic classes, the change in the interests of the author As a result, the documents having similar contents can be associated with each other (each document is classified into a plurality of document classes).

＜第２実施形態＞
次に、第２実施形態の文書分類装置１０００ａ（１０００）について説明する。なお、第１実施形態の文書分類装置１０００の場合と同様の要素（構成や処理）については、同様の符号を付し、重複説明を適宜省略する。 <Second Embodiment>
Next, the document classification apparatus 1000a (1000) of the second embodiment will be described. In addition, the same code | symbol is attached | subjected about the element (a structure and a process) similar to the case of the document classification device 1000 of 1st Embodiment, and duplication description is abbreviate | omitted suitably.

第２実施形態では、第１実施形態の内容に加え、類似する内容の文書を作成する著者のクラスを著者クラスとし、著者は著者クラスに属し、著者クラスは文書クラスを選択するものとする。これにより、同じ著者クラスに属する著者同士は関連が高い（興味が似ている）と判断できるようになる。 In the second embodiment, in addition to the contents of the first embodiment, an author class that creates a document having similar contents is an author class, the author belongs to the author class, and the author class selects a document class. As a result, it is possible to determine that authors belonging to the same author class are highly related (similar in interest).

具体的には、著者ａの属する著者クラス（コミュニティ）ｓ_ａがｈである確率Ｐ（ｓ_ａ＝ｈ｜ｓ_＼ａ，ｃ，α，γ）を計算する。ここで、ｓ_＼ａは著者ａを除いた著者クラスの集合であり、ｃは文書クラスの集合であり、αは隠れパラメータである。著者毎に、全ての著者クラスについて確率を計算し、最も確率が高くなるｈが著者ａの属する著者クラスとして関連付けを行う。 Specifically, the probability author class (community) _{s a} Field of author a is _{_{h P (s a = h |}} s \a, c, α, γ) is calculated. Here, s _{\ a} is a set of author classes excluding author a, c is a set of document classes, and α is a hidden parameter. For each author, probabilities are calculated for all author classes, and h with the highest probability is associated as the author class to which author a belongs.

この第２実施形態では、文書・トピック間の関連付け（ある文書とその文書が扱うトピックとの関連付け）、文書間の関連付け（類似する内容の文書間の関連付け）、および、著者間の関連付け（類似する内容の文書を作成した著者間の関連付け）を同時に行う手法（著者・文書・トピック分類方法：ＬＩＴ１）について説明する。 In the second embodiment, association between documents and topics (association between a document and a topic handled by the document), association between documents (association between documents having similar contents), and association between authors (similarity) A method (author / document / topic classification method: LIT1) for simultaneously performing association between authors who created a document with the content to be described will be described.

図１０に示すように、第２実施形態における著者と文書とトピッククラスとを関連付けるためのモデル（著者・文書・トピック分類モデル）では、第１実施形態におけるａ．〜ｅ．に加えて、以下のｆ．〜ｈ．を仮定する。
ｆ．１人の著者は１つの著者クラスに確率的に関連付けされる。
ｇ．類似した内容を扱う著者（同じ文書クラスを選択する著者の集合）は、同じ著者クラス（コミュニティ）に属する。
ｈ．１つの著者クラスは、１つ以上の文書クラスを選択できる。 As shown in FIG. 10, in the model (author / document / topic classification model) for associating an author, a document, and a topic class in the second embodiment, a. ~ E. In addition to f. ~ H. Assuming
f. One author is stochastically associated with one author class.
g. Authors handling similar contents (a set of authors selecting the same document class) belong to the same author class (community).
h. One author class can select one or more document classes.

（構成）
次に、第２実施形態の文書分類装置１０００ａの構成について説明する。図２に示すように、第２実施形態の文書分類装置１０００ａは、第１実施形態の文書分類装置１０００と比べて、演算部１００ａ（１００）において、著者クラスを評価するための著者クラス評価部８が追加された構成となっている。著者クラス評価部８は、サンプル著者選択部８１、初期設定部８２および文書クラス更新部８３を備えている（詳細は後記）。 (Constitution)
Next, the configuration of the document classification apparatus 1000a according to the second embodiment will be described. As shown in FIG. 2, the document classification apparatus 1000a according to the second embodiment has an author class evaluation unit for evaluating the author class in the arithmetic unit 100a (100), as compared with the document classification apparatus 1000 according to the first embodiment. 8 is added. The author class evaluation unit 8 includes a sample author selection unit 81, an initial setting unit 82, and a document class update unit 83 (details will be described later).

（処理）
次に、図１２を参照して、文書分類装置１０００ａの処理について説明する。この著者クラスを考慮した文書・トピック分類方法の処理の概要をまず説明すると、第１実施形態の文書クラス更新処理（ステップＳ４）とトピッククラス更新処理（ステップＳ５）に加えて、著者の属するクラスを評価するための著者クラス更新処理（ステップＳ８）が追加され、第１実施形態と同様に繰り返し確率計算と更新処理を行う。これにより、ある著者ａの属する著者クラス（コミュニティ）がｈ（ｈ＝１，２，３，…，Ｈ）である確率Ｐ（ｓ_ａ＝ｈ｜ｓ_＼ａ，ｃ，α，γ）と、文書ｄの属する文書クラスがｊ（ｊ＝１，２，３，…，Ｊ）である確率Ｐ（ｃ_ｄ＝ｊ｜ｃ_＼ｄ，ｚ，γ，δ）と、ある文書中のｉ番目のトークンがトピッククラスｔ（ｔ＝１，２，３，…，Ｔ）に属する（関連する）確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）とを評価する。 (processing)
Next, processing of the document classification apparatus 1000a will be described with reference to FIG. First, the outline of the document / topic classification method taking account of the author class will be described. In addition to the document class update process (step S4) and the topic class update process (step S5) of the first embodiment, the class to which the author belongs. The author class update process (step S8) for evaluating is added, and iterative probability calculation and update process are performed as in the first embodiment. Thus, the probability P (s _a = h | s _{\ a} , c, α, γ) that the author class (community) to which an author a belongs is h (h = 1, 2, 3,..., H), The probability P (c _d = j | c _{\ d} , z, γ, δ) that the document class to which the document d belongs is j (j = 1, 2, 3,..., J), and the i th _Evaluate the probability P (z _di = t | j, z _\ _di , w, β, δ) that the token belongs to (relates to) the topic class t (t = 1, 2, 3,..., T).

[入力部１]
まず、入力部１は、ユーザによる、分類したい文書クラスの数Ｊと、分類したいトピッククラスの数Ｔとに加えて、分類したい著者クラスの数Ｈの入力を受け付ける（ステップＳ１ａ）を入力する。 [Input unit 1]
First, the input unit 1 receives an input of the number H of author classes to be classified in addition to the number J of document classes to be classified and the number T of topic classes to be classified by the user (step S1a).

[初期設定部３]
次に、初期設定部３は、各著者が属する著者クラスと、各文書が属する文書クラスと、トピッククラスの初期値とを設定（選択）する（ステップＳ３ａ）。これらの初期値は第１実施形態と同じく任意に選択することができるが、１つの文書に対して文書クラスは１個、トピッククラスは複数個選択することができる。また、１人の著者が属する著者クラスは１つとする。 [Initial setting part 3]
Next, the initial setting unit 3 sets (selects) the author class to which each author belongs, the document class to which each document belongs, and the initial value of the topic class (step S3a). These initial values can be arbitrarily selected as in the first embodiment, but one document class and a plurality of topic classes can be selected for one document. Also, there is one author class to which one author belongs.

[著者クラス評価部８]
次に、著者クラス評価部８は、著者クラス更新処理を行う（ステップＳ８）。ここで、図１３を参照して、ステップＳ８の著者クラス更新処理について説明する。著者クラス評価部８は、第１実施形態の文書クラス評価部４、トピッククラス評価部５と同様に、ギブスサンプリングの手法を利用して、ある著者ａの属する著者クラス（コミュニティ）がｈ（ｈ＝１，２，３，…，Ｈ）である確率Ｐ（ｓ_ａ＝ｈ｜ｓ_＼ａ，ｃ，α，γ）を評価する。 [Author class evaluation section 8]
Next, the author class evaluation unit 8 performs an author class update process (step S8). Here, the author class update processing in step S8 will be described with reference to FIG. Similar to the document class evaluation unit 4 and the topic class evaluation unit 5 of the first embodiment, the author class evaluation unit 8 uses the Gibbs sampling technique to set an author class (community) to which an author a belongs to h (h = 1, 2, 3,..., H), the probability P (s _a = h | s _{\ a} , c, α, γ) is evaluated.

まず、サンプル著者選択部８１において、著者の集合Ａの中から、まだ選択していないサンプル対象の著者ａを１人選択する（ステップＳ８１）。
次に、初期設定部８２において、以降の確率計算で使用するパラメータの初期値を設定する（ステップＳ８２）。例えば、文書クラスの評価に使用する隠れパラメータα、γの初期値を決定する。ここで、第１実施形態と同じくγはＪ次元（Ｊは入力部１で入力した文書クラスの総数）のベクトルで前記した式（１）を満たす任意の値を設定する。αはＨ次元（Ｈは入力部１で入力した著者クラスの総数）のベクトルであり、次の関係式（６）を満たす任意の値を設定する。
０＜α_ｈ≦ｎ_ｈ＼ａｈ＝１，２，３，…，Ｈ・・・式（６）
ここで、ｎ_ｈ＼ａは著者ａが作成した文書を除く著者クラスｈに属する文書の数である。 First, the sample author selection unit 81 selects one author a that is not yet selected from the set A of authors (step S81).
Next, the initial setting unit 82 sets initial values of parameters used in subsequent probability calculations (step S82). For example, initial values of the hidden parameters α and γ used for evaluating the document class are determined. Here, as in the first embodiment, γ is an arbitrary value satisfying the above-described expression (1) with a vector of J dimensions (J is the total number of document classes input by the input unit 1). α is a vector of H dimension (H is the total number of author classes input by the input unit 1), and an arbitrary value satisfying the following relational expression (6) is set.
0 <α _h ≦ n h _{\ a} h = 1, 2, 3,..., H (formula 6)
_Here, n h\a is the number of documents belonging to the author class h except for the document author a created.

次に、文書クラス更新部８３において、まだ選択していない（更新処理を終えていない）著者クラスｈを１つ選択し（ステップＳ８３）、以下の更新処理を行う。
次の式（７ｃ）（第３の計算式）により、確率Ｐ（ｓ_ａ＝ｈ｜ｓ_＼ａ，ｃ，α，γ）を計算する（ステップＳ８４）。そのために、まず、次の式（７ａ）を計算する。 Next, the document class updating unit 83 selects one author class h that has not been selected yet (update processing has not been completed) (step S83), and performs the following update processing.
The probability P (s _a = h | s _{\ a} , c, α, γ) is calculated by the following formula (7c) (third calculation formula) (step S84). For this purpose, first, the following equation (7a) is calculated.

次に、ｈ＝１，２，・・・，Ｈについて、式（７ｂ）を計算することにより、自身よりもｈの値が小さいトピッククラスの確率を足し合わせた確率を算出する。

Next, with respect to h = 1, 2,..., H, the probability is calculated by adding the probabilities of topic classes having a smaller value of h than itself by calculating equation (7b).

そして、全てのｈ＝１，２，・・・，Ｈについて、式（７ｃ）により正規化を行う。これにより、全ての確率Ｐ（ｓ_ａ＝ｈ｜ｓ_＼ａ，ｃ，α，γ）は、１以下の値となる。

Then, for all h = 1, 2,..., H, normalization is performed according to equation (7c). As a result, all the probabilities P (s _a = h | s _{\ a} , c, α, γ) are values of 1 or less.

ここで、ｎ_ｈｊ＼ａは著者クラスｈに属する全著者が作成した文書のうち、著者ａが作成した文書を除いて現時点で文書クラスｊに属している全文書の数、ｎ_ａｊは著者ａが作成した文書で文書クラスｊに属する文書の数、ｎ_ａは著者ａが作成した文書数である。 Here, n _Hj\a among documents all authors belonging to the author class h created, the number of all documents belonging to the document class j at this time, except a document author a created, n _aj author a Is the number of documents belonging to the document class j, and _na is the number of documents created by the author a.

次に、文書クラス更新部８３において、０≦Ｒ＜１の一様乱数を生成し（ステップＳ８５）、ステップＳ８４で計算した確率Ｐ（ｓ_ａ＝ｈ｜ｓ_＼ａ，ｃ，α，γ）の値がＲより大きければ（ステップＳ８６でＹｅｓ）、著者ａの属するクラスをｈに更新する（ステップＳ８７）。ステップＳ８４で計算した確率Ｐ（ｓ_ａ＝ｈ｜ｓ_＼ａ，ｃ，α，γ）の値がＲ以下の場合には（ステップＳ８６でＮｏ）、著者クラスの更新は行わなず（Ｓ８７をスキップして）、ステップＳ８８に進む。 Next, the document class update unit 83 generates a uniform random number of 0 ≦ R <1 (step S85), and the probability P (s _a = h | s _{\ a} , c, α, γ) calculated in step S84. If the value of is greater than R (Yes in step S86), the class to which the author a belongs is updated to h (step S87). If the value of the probability P (s _a = h | s _{\ a} , c, α, γ) calculated in step S84 is R or less (No in step S86), the author class is not updated (S87 Skip) and go to step S88.

全てのｈ（ｈ＝１，２，…，Ｈ）について更新を終えるまで（ステップＳ８８でＹｅｓとなるまでＮｏの間は）、前記ステップＳ８３〜ステップＳ８７を繰り返す。ステップＳ８８でＹｅｓの後、全ての著者をサンプリングしていない場合、つまり、まだサンプリングを終えていない著者ａ´が存在する場合は（ステップＳ８９でＮｏ）、ステップＳ８１に戻る。その後、まだサンプリングを終えていない著者ａ´を選択し（ステップＳ８１）、ステップＳ８２〜ステップＳ８８を繰り返すことにより、全ての著者について確率Ｐ（ｓ_ａ＝ｈ｜ｓ_＼ａ，ｃ，α，γ）を評価し（ステップＳ８９でＹｅｓ）、図１２に戻って、ステップＳ８を終えてステップＳ４に進む。ステップＳ４，Ｓ５の処理は、第１実施形態の場合と同じである。 Steps S83 to S87 are repeated until all h (h = 1, 2,..., H) have been updated (until No in step S88). If all authors have not been sampled after Yes in step S88, that is, if there is an author a 'that has not yet been sampled (No in step S89), the process returns to step S81. Thereafter, the author a ′ that has not yet been sampled is selected (step S81), and the probabilities P (s _a = h | s _{\ a} , c, α, γ) are obtained for all authors by repeating steps S82 to S88. ) Is evaluated (Yes in step S89), the process returns to FIG. 12, and step S8 ends and the process proceeds to step S4. Steps S4 and S5 are the same as those in the first embodiment.

[収束判定部６]
図１２に示すように、ステップＳ５の後、収束判定部６は、ステップＳ８，Ｓ４，Ｓ５で算出した各確率分布が収束したか否かを判定し（ステップＳ６ａ）、確率分布が収束するまで（ステップＳ６ａでＹｅｓになるまで）、ステップＳ８，Ｓ４，Ｓ５の処理を繰り返す。各確率分布が収束したら処理を終了して（ステップＳ６ａでＹｅｓ）、得られた著者クラスと文書クラスとトピッククラスの確率分布を出力部７に出力する。各確率分布が収束したか否かの判定は、１ステップ前の確率分布と今回得られた確率分布との誤差を比較し、誤差が予め定めた閾値以下である場合には収束したと判定することとしても良い。あるいは、予め設定した繰り返し回数に到達したら、確率分布が収束したと判定して処理を終了することとしても良い。 [Convergence determination unit 6]
As shown in FIG. 12, after step S5, the convergence determination unit 6 determines whether or not each probability distribution calculated in steps S8, S4, and S5 has converged (step S6a), and until the probability distribution converges. The process of steps S8, S4, and S5 is repeated (until Yes in step S6a). When each probability distribution converges, the process is terminated (Yes in step S6a), and the obtained probability distribution of the author class, document class, and topic class is output to the output unit 7. To determine whether each probability distribution has converged, the error between the probability distribution one step before and the probability distribution obtained this time is compared, and if the error is equal to or less than a predetermined threshold, it is determined that the probability distribution has converged. It's also good. Alternatively, when the number of repetitions set in advance is reached, it is possible to determine that the probability distribution has converged and end the processing.

[出力部７]
ステップＳ６でＹｅｓの後、出力部７では、収束判定部６から得られた確率分布に基づいて、著者クラスと文書クラスとトピッククラスを出力する（ステップＳ７ａ）。
このとき、著者クラスは、著者ａに関して評価した確率Ｐ（ｓ_ａ＝ｈ｜ｓ_＼ａ，ｃ，α，γ）の値が最も大きくなるｈを著者が所属する著者クラスとして出力する。
また、このとき、文書クラスとトピッククラスについては、第１実施形態の出力部７で説明した通りである。 [Output unit 7]
After Yes in step S6, the output unit 7 outputs the author class, document class, and topic class based on the probability distribution obtained from the convergence determination unit 6 (step S7a).
At this time, the author class outputs h having the largest probability P (s _a = h | s _{\ a} , c, α, γ) evaluated for the author a as the author class to which the author belongs.
At this time, the document class and the topic class are as described in the output unit 7 of the first embodiment.

このように、第２実施形態の文書分類装置１０００ａによれば、第１実施形態の場合の効果に加えて、文書の著者それぞれをいずれかの著者クラスに属させ、著者クラスそれぞれを１つ以上の前記文書クラスに関連付けることができる。つまり、類似する内容の文書を作成した著者を同じ著者クラスに属させることで、類似する内容の文書を作成した著者間の関連付けを行うことができる。 As described above, according to the document classification apparatus 1000a of the second embodiment, in addition to the effects of the first embodiment, each author of a document belongs to one of the author classes, and one or more author classes are provided. Associated with the document class. That is, an author who has created a document having similar contents can be associated with the author who has created a document having similar contents by belonging to the same author class.

つまり、第２実施形態の文書分類装置１０００ａによれば、大量の文書とそれを作成した著者のデータから、類似する話題に関する文書をまとめた集合（文書クラス）や、類似する内容の文書を作成した著者をまとめた集合（著者クラス）を抽出することができる。その際、文書クラスの名前（例えば、対象文書がブログやＷｅｂページなどの場合は、「趣味」「ファッション」「グルメ」「スポーツ」など。特許明細書や学術論文等の場合は「化学」「物理学」「機械工学」「情報工学」など）を指定することなく、文書と著者の情報のみから前記したクラスタリングを行うことができる。これにより、以下のようなことが実現できる。 That is, according to the document classification apparatus 1000a of the second embodiment, a set (document class) of documents related to similar topics or a document with similar contents is created from a large number of documents and data of authors who created the documents. A set of authors (author class) can be extracted. At that time, the name of the document class (for example, “hobbies”, “fashion”, “gourmet”, “sports”, etc. when the target document is a blog, web page, etc .; Clustering can be performed only from document and author information without specifying physics, “mechanical engineering”, “information engineering”, etc.). As a result, the following can be realized.

（類似する内容の文書の検索）
検索対象の文書と同じ文書クラスに属する文書を抽出することにより、検索対象の文書と類似する内容の文書を検索することができる。 (Search for documents with similar contents)
By extracting a document belonging to the same document class as the search target document, it is possible to search for a document having a content similar to the search target document.

（コミュニティ検索）
ブログ記事やＷｅｂページにおいて、クラスタリングを行うことで、類似した内容を扱う著者（作者）の集合に分類することができる。自分と同じ著者クラスに属する著者を抽出することにより、自分の興味のある話題を扱う著者のコミュニティを抽出したり、興味の似ている著者を検索したりすることができる。 (Community search)
By clustering blog articles and Web pages, it is possible to classify them into a set of authors (authors) who handle similar contents. By extracting authors that belong to the same author class as you, you can extract a community of authors that handle topics you are interested in, or search for authors that are similar in interest.

（協調フィルタリング（リコメンデーションシステム））
文書の代わりにユーザの購入商品の情報とし、著者の代わりに商品を購入したユーザとして本手法を適用することにより、文書クラスとして類似する商品の集合を抽出し、著者クラスとして類似する商品を購入したユーザの集合を抽出することができる。これにより、あるユーザの所属する著者クラスに属する著者が購入した商品を推薦することができる。また、ある商品の属する文書クラスの商品を購入した著者のクラスを特定することにより、当該商品を好みそうなユーザを特定することができる。 (Collaborative filtering (Recommendation system))
By using this method as the user who purchased the product instead of the document instead of the document and applying this method as the user who purchased the product instead of the author, a set of similar products is extracted as the document class and the similar product is purchased as the author class. A set of users can be extracted. This makes it possible to recommend a product purchased by an author belonging to an author class to which a certain user belongs. Further, by specifying the class of the author who purchased the product of the document class to which a certain product belongs, it is possible to specify a user who is likely to like the product.

なお、従来技術において、文書中の単語のみから文書の分類を行っていた場合では、例えば、１人の著者が大量の文書を作成していたとき、その著者による影響が大きくなり過ぎてしまうことがあった。一方、第１、第２実施形態の文書分類装置１０００によれば、著者ごと、あるいは、著者クラスごとに文書クラスと関連付けられるので、１人の著者による影響が大きくなり過ぎる事態を回避することができる。 In the prior art, when a document is classified only from words in the document, for example, when one author creates a large number of documents, the influence of the author becomes too great. was there. On the other hand, according to the document classification apparatus 1000 of the first and second embodiments, each author or each author class is associated with a document class, so that it is possible to avoid a situation where the influence of one author becomes too great. it can.

また、文書分類装置１０００を構成するコンピュータに実行させるためのプログラムを作成し、コンピュータにインストールすることにより、コンピュータは、そのプログラムに基づいた各機能を実現することができる。 Further, by creating a program to be executed by a computer constituting the document classification apparatus 1000 and installing the program in the computer, the computer can realize each function based on the program.

以上で本実施形態の説明を終えるが、本発明の態様はこれらに限定されるものではない。具体的な構成や処理について、本発明の主旨を逸脱しない範囲で適宜変更が可能である。 Although description of this embodiment is finished above, the aspect of the present invention is not limited to these. The specific configuration and processing can be appropriately changed without departing from the gist of the present invention.

次に、文書からの著者の興味の抽出に着目し、提案モデル（第１、第２実施形態におけるモデル）の定性的評価と定量的評価の結果について説明する。定量的評価にはtest set perplexityとKL-Divergenceの２つの指標を用いた。 Next, focusing on the extraction of the author's interest from the document, the results of qualitative evaluation and quantitative evaluation of the proposed model (models in the first and second embodiments) will be described. Two indicators, test set perplexity and KL-Divergence, were used for quantitative evaluation.

対象の文書データは、学術論文であり、著者（研究者）の数は３６４名で、文書の数は１４９０７であり、単語の種類は２７７４８であった。また、トピックの数Ｔは１００に固定した。また、パラメータは、α＝１／Ｔ，β＝０．１，γ＝１（ＬＩＴ），δ＝１とした。また、ギブスサンプリングの繰り返し回数は１００００とし、収束判定部６において繰り返し回数が１００００回に到達したら収束したと判定して処理を終了した。 The target document data is an academic paper, the number of authors (researchers) is 364, the number of documents is 14907, and the type of word is 27748. The number of topics T is fixed at 100. The parameters were α = 1 / T, β = 0.1, γ = 1 (LIT), and δ = 1. Further, the number of Gibbs sampling iterations was set to 10,000, and when the number of iterations reached 10,000 in the convergence determination unit 6, it was determined that the convergence was completed and the processing was terminated.

[定性的評価]
第２実施形態における手法を用いた場合の、各著者と選択した文書クラスの分布を著者クラス毎に分類した結果の一部（図１４）と、著者クラス毎に選択する文書クラスの分布（図１５）を示す。これらの分布は、それぞれ以下の式（８）（図１４に対応）、式（９）（図１５に対応）を用いて計算できる。 [Qualitative evaluation]
A part of the result of classifying the distribution of each author and the selected document class for each author class when using the method in the second embodiment (FIG. 14) and the distribution of the document class selected for each author class (FIG. 14) 15). These distributions can be calculated using the following formula (8) (corresponding to FIG. 14) and formula (9) (corresponding to FIG. 15), respectively.

なお、ｃは文書クラス、ａは著者であり、ｎ_ａｃは著者ａが作成した中で文書クラスがｃとなった文書の数である。γ_ｃはｃ次元の隠れパラメータγを意味する。＾は、実験結果として得られた推定値の平均値であることを意味する。

Note that c is a document class, a is an author, and _nac is the number of documents created by the author a and having the document class c. γ _c means a c-dimensional hidden parameter γ. ^ Means an average value of estimated values obtained as experimental results.

なお、ｃは文書クラス、ｓは著者クラスであり、ｎ_ｓｃは著者クラスｓの全著者が作成した中で文書クラスがｃとなった文書の数である。

Note that c is a document class, s is an author class, and n _sc is the number of documents created by all authors of the author class s and having the document class c.

図１４および図１５では、各著者あるいは同一著者クラスに属する著者が作成した文書の文書クラスをカウントし、各行について正規化を行っている。図１４は、著者・文書クラス、図１５は著者クラス・文書クラスの分布を表している。図１４では、縦軸は各著者のＩＤに対応し、著者クラス毎に線ｌ１で区切っている。横軸は各著者が作成した文書の文書クラスＩＤに対応する。図１５では、縦軸は著者クラスのＩＤに対応する。図１５において枠ｅ１で囲った部分が、図１４の著者クラスに該当する。横軸は各著者が作成した文書の文書クラスＩＤに対応する。 In FIG. 14 and FIG. 15, the document class of the document created by each author or an author belonging to the same author class is counted, and normalization is performed for each line. FIG. 14 shows an author / document class distribution, and FIG. 15 shows an author class / document class distribution. In FIG. 14, the vertical axis corresponds to the ID of each author, and is separated by a line l1 for each author class. The horizontal axis corresponds to the document class ID of the document created by each author. In FIG. 15, the vertical axis corresponds to the ID of the author class. A portion surrounded by a frame e1 in FIG. 15 corresponds to the author class in FIG. The horizontal axis corresponds to the document class ID of the document created by each author.

図１４および図１５において、該当する文書の数が多い場合は塗りつぶした領域が大きくなり、該当する文書の数が少ない場合は塗りつぶした領域が小さくなる。つまり、これらの図から著者クラス毎に特徴的な文書クラスを知ることができる。 14 and 15, when the number of corresponding documents is large, the filled area becomes large, and when the number of corresponding documents is small, the filled area becomes small. That is, a characteristic document class can be known for each author class from these diagrams.

同様に、各文書と選択したトピッククラスの分布を文書クラス毎に分類した結果の一部（図１６）と、文書クラス毎に選択するトピッククラスの分布を（図１７）を示す。これらの分布は、それぞれ以下の式（１０）（図１６に対応），（１１）（図１７に対応）を用いて計算できる。 Similarly, a part of the result of classifying the distribution of each document and the selected topic class for each document class (FIG. 16) and the distribution of the topic class selected for each document class are shown (FIG. 17). These distributions can be calculated using the following equations (10) (corresponding to FIG. 16) and (11) (corresponding to FIG. 17), respectively.

ここで、ｚはトピッククラス、ｄは文書である。ｎ_ｄｚは、文書ｄの中でトピッククラスｔが選択された数である。δ_ｚはｚ次元の隠れパラメータδを意味する。

Here, z is a topic class and d is a document. n _dz is the number of topic classes t selected in the document d. [delta] _z denotes the hidden parameters [delta] of the z dimension.

ここで、ｚはトピッククラス、ｃは文書クラスである。ｎ_ｃｚは、文書クラスｃの全文書の中でトピッククラスｔが選択された数である。

Here, z is a topic class, and c is a document class. n _cz is the number of topic classes t selected in all documents of document class c.

図１６は、文書・トピッククラス、図１７は文書クラス・トピッククラスの分布を示す図（グラフ）である。図１６では、縦軸は各文書のＩＤに対応し、文書クラス毎に線ｌ２で区切っている。横軸はトピッククラスのＩＤに対応する。図１７では、縦軸は各文書クラスのＩＤに対応し、横軸はトピッククラスのＩＤに対応する。図１７において枠ｅ２で囲った部分が、図１６の文書クラスに該当する。 16 is a document / topic class, and FIG. 17 is a diagram (graph) showing the distribution of document classes / topic classes. In FIG. 16, the vertical axis corresponds to the ID of each document, and is separated by a line 12 for each document class. The horizontal axis corresponds to the topic class ID. In FIG. 17, the vertical axis corresponds to the ID of each document class, and the horizontal axis corresponds to the ID of the topic class. In FIG. 17, the part enclosed by a frame e2 corresponds to the document class of FIG.

図１６および図１７において、該当するトピックの数が多い場合は塗りつぶした領域が大きくなり、該当するトピックの数が少ない場合は塗りつぶした領域が小さくなる。つまり、これらの図から文書クラス毎に特徴的なトピッククラスを知ることができる。 16 and 17, when the number of corresponding topics is large, the filled area becomes large, and when the number of applicable topics is small, the filled area becomes small. That is, a characteristic topic class can be known for each document class from these diagrams.

[定量的評価]
次に、提案モデル（第１、第２実施形態におけるモデル）の生成モデルとしての定量的評価を行うために、比較例を含めた各モデルにより推定されたパラメータを用いてtest set perplexityを計算し、これらを比較する。 [Quantitative evaluation]
Next, in order to perform quantitative evaluation as a generation model of the proposed model (the model in the first and second embodiments), test set perplexity is calculated using parameters estimated by each model including the comparative example. Compare these.

提案モデルの特徴は、著者毎に新規の文書の生成確率を計算できる点（第１実施形態）、著者全体に対して新規の文書の生成確率を計算できる点（第２実施形態）である。 The feature of the proposed model is that the generation probability of a new document can be calculated for each author (first embodiment), and the generation probability of a new document can be calculated for the entire author (second embodiment).

例えば、第１実施形態では、著者毎に著者が作成した各文書について、確率Ｐ（ｃ_ｄ＝ｊ｜ｃ_＼ｄ，ｚ，γ，δ）を計算する。そうすると、その著者がどの文書クラスに属する文書をどのくらいの割合で生成するか、という生成確率が分かる。さらに、確率Ｐ（ｚ_ｄｉ＝ｔ｜ｊ，ｚ_＼ｄｉ，ｗ，β，δ）を計算することにより、その著者が新たに作成する文書がどのようなトピッククラスに属する単語の並びで構成されるのかが分かる。よって、第１実施形態で計算した各確率は、著者毎に新規の文書の生成確率を計算することができるモデルとなっている。 For example, in the first embodiment, the probability P (c _d = j | c _{\ d} , z, γ, δ) is calculated for each document created by the author for each author. Then, it is possible to know the generation probability of how much of the document class the author belongs to. Further, by calculating the probability P (z _di = t | j, z _\ _di , w, β, δ), the document newly created by the author is composed of a sequence of words belonging to what topic class. You can see. Therefore, each probability calculated in the first embodiment is a model that can calculate the generation probability of a new document for each author.

また、第２実施形態では、前記第１実施形態の場合の計算に加えて、確率Ｐ（ｓ_ａ＝ｈ｜ｓ_＼ａ，ｃ，α，γ）を計算する。これにより、ある著者クラスに属する著者がどのような文書クラスの文書を生成する確率が高いのかも評価することができる。これにより、ある著者クラスに属する著者が次に生成する新規の文書の属する文書クラスや、その文書に含まれるトピッククラスの確率が分かるので、著者全体に対する新規の文書の生成確率を計算するモデルを得ることができる。 In the second embodiment, in addition to the calculation in the first embodiment, the probability P (s _a = h | s _{\ a} , c, α, γ) is calculated. As a result, it is possible to evaluate what kind of document class is likely to be generated by an author belonging to a certain author class. This allows the authors belonging to a certain author class to know the probability of the document class to which the new document to be generated next and the topic class included in the document belong. Obtainable.

そこで、定量的評価は第２実施形態について次のように行う。まず、全文書のうち１０％をランダムに選択し、テスト文書とする。テスト文書を取り除いた残りの９０％の文書は学習文書とする。そして、学習文書を用いて、パラメータを推定する。最後に、これらのパラメータを用いてテスト文書内の単語のperplexityを計算する。perplexityは言語モデル分野で使われる指標であり、この値が低いほど言語モデルとして高性能であることを示す。Test set perplexityであるＰＰＸは次の式（１２）で与えられる。 Therefore, quantitative evaluation is performed as follows for the second embodiment. First, 10% of all documents are selected at random and set as test documents. The remaining 90% of the documents from which the test document is removed are taken as learning documents. Then, the parameters are estimated using the learning document. Finally, the perplexity of words in the test document is calculated using these parameters. Perplexity is an index used in the language model field, and the lower this value, the higher the performance of the language model. PPX which is Test set perplexity is given by the following equation (12).

ここでθ_ｔはモデルにより文書dに与えられるトピッククラスtの選択確率、φ_ｔｖは単語ｖの出現確率、Ｗは単語数、Ｄ_textは文書数である。なお、非特許文献１の手法（ＡＴ）、非特許文献２の手法（ＡＰＴ）、および、第２実施形態の手法では、著者毎にトピック選択確率を与えることができるので、そのトピッククラスを用いた場合のTest set perplexityを算出するために、式（１２）を修正して次の式（１３）とした。

Here, θ _t is the selection probability of the topic class t given to the document d by the model, φ _tv is the appearance probability of the word v, W is the number of words, and D _text is the number of documents. In the method of Non-Patent Document 1 (AT), the method of Non-Patent Document 2 (APT), and the method of the second embodiment, a topic selection probability can be given for each author. In order to calculate the test set perplexity in the case of the above, the equation (12) is corrected to the following equation (13).

ここで、θ_ｔ｜ｊはモデルにより文書dに与えられる文書クラスjのトピッククラスtの選択確率、θ_ｊ｜ａはモデルにより文書dに与えられる著者aの文書クラスjの選択確率である。評価はテスト文書集合を変えて５回行い、そのperplexityの平均を計算した。なお、θ_ｔ｜ｊとθ_ｊ｜ａは、実際には式（８）と式（９）に定義されている推定値の平均値を利用した。表１に結果を示す。第２実施形態は、perplexityが総じて低く、モデルとして高性能であることが分かる。

Here, θ _{t | j} is the selection probability of the topic class t of the document class j given to the document d by the model, and θ _{j | a} is the selection probability of the document class j of the author a given to the document d by the model. The evaluation was performed 5 times while changing the test document set, and the average of the perplexity was calculated. For θ _{t | j} and θ _{j | a} , the average value of the estimated values defined in Equation (8) and Equation (9) was actually used. Table 1 shows the results. In the second embodiment, it can be seen that the perplexity is generally low and the model has high performance.

さらに、第２実施形態では、著者クラスの分布と各著者クラスの文書選択分布を用いて、各文書に対し文書クラスを与えることができるため、その文書クラスを用いた場合のTest set perplexityであるＰＰＸを評価できるよう、式（１２）を修正して次の式（１４）とした。 Furthermore, in the second embodiment, since the document class can be given to each document using the distribution of the author class and the document selection distribution of each author class, the Test set perplexity when the document class is used. Formula (12) was modified to the following formula (14) so that PPX could be evaluated.

ここで、θ_ｊ｜ｈはモデルにより文書dに与える著者クラスhの文書クラスjの選択確率である。また、λ_ｈは次の式（１５）のように表すことができる。

Here, θ _{j | h} is the selection probability of document class j of author class h given to document d by the model. Further, λ _h can be expressed as the following equation (15).

なお、ｎ_ｈは、著者クラスｈに属する著者が作成した文書数である。また、α_ｈはｈ次元の隠れパラメータαを意味する。

Note that n _h is the number of documents created by an author belonging to the author class h. Α _h means an h-dimensional hidden parameter α.

評価はテスト文書集合を変えて５回行い、そのperplexityの平均を用いた。表２に結果を示す。

The evaluation was performed 5 times by changing the test document set, and the average of the perplexity was used. Table 2 shows the results.

なお、ここで、従来技術では著者クラスを計算できないので、比較を行っていないが、この第２実施形態のperplexityはその数値サイズから考えて低い数値であると思われる。 Here, since the author class cannot be calculated in the conventional technique, the comparison is not performed, but the perplexity of the second embodiment is considered to be a low numerical value in view of the numerical size.

また、表３は、全てのトピックの組み合わせについて、それらの単語分布間の距離をKL-Divergenceを用いて計算した結果である。この値が大きければ、単語分布の違いが大きいことを意味するため、トピック間の違いを明確に区別できる。この結果から、第２実施形態の結果は従来技術よりも独立したトピックを抽出する精度が高いことが分かる。 Table 3 shows the results of calculating the distance between the word distributions for all topic combinations using KL-Divergence. If this value is large, it means that the difference in word distribution is large, so that the difference between topics can be clearly distinguished. From this result, it can be seen that the result of the second embodiment is higher in the accuracy of extracting independent topics than in the prior art.

１入力部
２記憶部
３初期設定部
４文書クラス評価部
５トピッククラス評価部
６収束判定部
７出力部
８著者クラス評価部
１００，１００ａ演算部
１０００，１０００ａ文書分類装置
DESCRIPTION OF SYMBOLS 1 Input part 2 Memory | storage part 3 Initial setting part 4 Document class evaluation part 5 Topic class evaluation part 6 Convergence determination part 7 Output part 8 Author class evaluation part 100,100a Operation part 1000,1000a Document classification apparatus

Claims

For multiple documents that are electronic data,
Each of the plurality of documents belongs to one of a plurality of document classes that are document sets having similar contents, and each of the words constituting the plurality of documents is one of a plurality of topic classes that are word sets having similar contents. Based on a predetermined probability distribution model that probabilistically belongs, each author of the document is stochastically belonging to one or more of the document classes, and each of the document classes is associated with one or more of the topic classes. A document classification device for performing calculations,
The plurality of documents, a first calculation formula for estimating the document class to be used for each document used in the probability distribution model, and the topic class to be used for the word and used for the probability distribution model. A storage unit for storing a second calculation formula for estimating, and a predetermined calculation end condition;
An initial setting unit that randomly sets initial values of a document class to which each of the documents belongs and a topic class to which each of the words belongs;
Based on the first calculation formula, the probability of belonging to each of all the document classes is calculated for each document, and the document class having the highest probability is determined as the document class to which the document and the author belong. A document class evaluator that estimates as
Based on the second calculation formula, for each word, the probability of belonging to each of the topic classes is calculated, and the topic class having the highest probability is estimated as the topic class to which the word belongs. A class evaluation department;
A convergence determination unit that causes the document class evaluation unit and the topic class evaluation unit to repeat the estimation until the predetermined calculation end condition is satisfied;
An output unit for outputting a calculation result including a document and an author belonging to each document class when the convergence determination unit determines that the predetermined calculation end condition is satisfied;
A document classification apparatus comprising:

For multiple documents that are electronic data,
Each of the plurality of documents is probabilistically belonging to any of a plurality of document classes that are document sets having similar contents, and each of the words constituting the plurality of documents is a plurality of topic classes that are word sets having similar contents And each of the authors of the document is stochastically belonging to one of the author classes that is a set of authors who have created similar documents, and the author class is assigned to one or more of the author classes. A document classification device that performs a calculation based on a predetermined probability distribution model that probabilistically belongs to a document class and associates each of the document classes with one or more topic classes,
The plurality of documents, a first calculation formula for estimating the document class to be used for each document used in the probability distribution model, and the topic class to be used for the word and used for the probability distribution model. And a second calculation formula for estimating the author class, a third calculation formula for estimating the author class to which each author belongs and used in the probability distribution model, and a predetermined calculation end condition are stored. A storage unit;
An initial setting unit that randomly sets initial values of an author class to which each of the authors belongs, a document class to which each of the documents belongs, and a topic class to which each of the words belongs;
Based on the third formula, for each author, the probability of belonging to each of the author classes is calculated, and the author class having the highest probability is estimated as the author class to which the author belongs. A class evaluation department;
Based on the first calculation formula, the probability of belonging to each of all the document classes is calculated for each document, and the document class having the highest probability is determined as the document class to which the document and the author class belong. A document class evaluator that estimates as
Based on the second calculation formula, for each word, the probability of belonging to each of the topic classes is calculated, and the topic class having the highest probability is estimated as the topic class to which the word belongs. A class evaluation department;
Until the predetermined calculation termination condition is satisfied, the author class evaluation unit, the document class evaluation unit, and the topic class evaluation unit, a convergence determination unit that repeats the estimation,
An output unit that outputs a calculation result including the content of the author class and the content of the document class when the convergence determination unit determines that the predetermined calculation end condition is satisfied;
A document classification apparatus comprising:

The convergence determination unit has, as the predetermined calculation end condition, an error between a previous probability distribution and a latest probability distribution calculated by at least one of the author class evaluation unit, the document class evaluation unit, and the topic class evaluation unit is predetermined. The document classification apparatus according to claim 2, wherein the threshold value is equal to or less than a threshold value of the number of times or the number of repeated calculations reaches a predetermined number.

For multiple documents that are electronic data,
Each of the plurality of documents belongs to one of a plurality of document classes that are document sets having similar contents, and each of the words constituting the plurality of documents is one of a plurality of topic classes that are word sets having similar contents. Based on a predetermined probability distribution model that probabilistically belongs, each author of the document is stochastically belonging to one or more of the document classes, and each of the document classes is associated with one or more of the topic classes. Do the calculations,
The plurality of documents, a first calculation formula for estimating the document class to be used for each document used in the probability distribution model, and the topic class to be used for the word and used for the probability distribution model. A storage unit for storing a second calculation formula for estimating, and a predetermined calculation end condition;
A document classification method by a document classification device having an initial setting unit, a document class evaluation unit, a topic class evaluation unit, a convergence determination unit, and an output unit,
The initial setting unit randomly sets initial values of a document class to which each of the documents belongs and a topic class to which each of the words belongs,
The document class evaluation unit calculates a probability of belonging to each of all the document classes for each document based on the first calculation formula, and selects the document class having the highest probability as the document and the document class. Estimate as the document class to which the author belongs,
The topic class evaluation unit calculates a probability of belonging to each of the topic classes for each word based on the second calculation formula, and the word belongs to a topic class having the highest probability. Estimating as the topic class,
The convergence determination unit causes the document class evaluation unit and the topic class evaluation unit to repeat the estimation until the predetermined calculation end condition is satisfied,
The document classification method, wherein the output unit outputs a calculation result including a document and an author belonging to each document class when the convergence determination unit determines that the predetermined calculation end condition is satisfied. .

For multiple documents that are electronic data,
Each of the plurality of documents is probabilistically belonging to any of a plurality of document classes that are document sets having similar contents, and each of the words constituting the plurality of documents is a plurality of topic classes that are word sets having similar contents And each of the authors of the document is stochastically belonging to one of the author classes that is a set of authors who have created similar documents, and the author class is assigned to one or more of the author classes. Performing a calculation based on a predetermined probability distribution model that stochastically belongs to a document class and associates each of the document classes with one or more of the topic classes;
The plurality of documents, a first calculation formula for estimating the document class to be used for each document used in the probability distribution model, and the topic class to be used for the word and used for the probability distribution model. And a second calculation formula for estimating the author class, a third calculation formula for estimating the author class to which each author belongs and used in the probability distribution model, and a predetermined calculation end condition are stored. A storage unit;
A document classification method by a document classification device having an initial setting unit, an author class evaluation unit, a document class evaluation unit, a topic class evaluation unit, a convergence determination unit, and an output unit,
The initial setting unit randomly sets initial values of an author class to which each of the authors belongs, a document class to which each of the documents belongs, and a topic class to which each of the words belongs,
The author class evaluation unit calculates the probability of belonging to each of the author classes for each of the authors based on the third calculation formula, and the author belongs to the author class having the highest probability. Estimated as the author class,
The document class evaluation unit calculates a probability of belonging to each of all the document classes for each document based on the first calculation formula, and selects the document class having the highest probability as the document and the document class. Estimate as the document class to which the author class belongs,
The topic class evaluation unit calculates a probability of belonging to each of the topic classes for each word based on the second calculation formula, and the word belongs to a topic class having the highest probability. Estimating as the topic class,
The convergence determination unit causes the author class evaluation unit, the document class evaluation unit, and the topic class evaluation unit to repeat the estimation until the predetermined calculation end condition is satisfied,
The output unit outputs a calculation result including the content of the author class and the content of the document class when the convergence determination unit determines that the predetermined calculation end condition is satisfied. Classification method.

The convergence determination unit has, as the predetermined calculation end condition, an error between a previous probability distribution and a latest probability distribution calculated by at least one of the author class evaluation unit, the document class evaluation unit, and the topic class evaluation unit is predetermined. 6. The document classification method according to claim 5, wherein the threshold value is equal to or less than a threshold value of the number of times or the number of repeated calculations reaches a predetermined number.

A document classification program for causing a computer to function as the document classification device according to any one of claims 1 to 3.