JPH11250100A

JPH11250100A - Hierarchical document classifying device and machine-readable recording medium recording program

Info

Publication number: JPH11250100A
Application number: JP10064682A
Authority: JP
Inventors: Ko Ri; 航李; Kenji Yamanishi; 健司山西
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-02-27
Filing date: 1998-02-27
Publication date: 1999-09-17
Anticipated expiration: 2018-02-27
Also published as: JP3178406B2

Abstract

PROBLEM TO BE SOLVED: To classify a sentence to categorical hierarchies based on the distribution of words (keyword) appearing in the sentence (text and document). SOLUTION: This method grasps document classification as a statistical test problem. A categorical hierarchy storing part 1 stores category hierarchies. A probability model storing part 2 stores linear combination models. A learning part 3 refers to categorical hierarchies stored in the part 1, learns a linear combination model corresponding to each category from a document that is already classified to a category and stores the linear combination model in the part 2. A document classifying part 4 newly inputs a document, refers to each category in category hierarchies stored in the part 1, refers to a linear combination model corresponding to each category from the part 2 for the category, calculates the negative logarithmic likelihood of each linear combination model to the inputted document and classifies the inputted document to a category corresponding to a linear combination model that has the smallest negative logarithmic likelihood.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、インターネットの
ホームページの自動分類、電子図書館における文献検
索、特許出願情報の検索、電子化された新聞記事の自動
分類、マルチメディア情報の自動分類等の情報の分類や
検索に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to automatic classification of Internet home pages, retrieval of documents in electronic libraries, retrieval of patent application information, automatic classification of digitized newspaper articles, and automatic classification of multimedia information. It is about classification and search.

【０００２】[0002]

【従来の技術】情報の分類や検索の分野では、文章分類
（ドキュメント分類、テキスト分類ともいう）装置の開
発は大きな課題である。ここでいう文章分類とは、予め
人間がカテゴリを設け、さらに一部の文章がそれぞれど
のカテゴリに属するかを判断し、該当のカテゴリにそれ
らの文章を分類し、システムに記憶した後、システムは
記憶された情報から知識を自動的に獲得し、それ以後、
獲得できた知識を基に、新たに入力された文章を自動的
に分類することを指す。2. Description of the Related Art In the field of information classification and retrieval, development of a sentence classification (also referred to as document classification or text classification) apparatus is a major issue. Here, the sentence classification means that a human has provided a category in advance, further determines which category each sentence belongs to, classifies those sentences into the corresponding category, and stores them in the system. Knowledge is automatically acquired from the stored information.
Refers to automatically classifying newly input sentences based on the acquired knowledge.

【０００３】文章はカテゴリに分類されているので、文
章を検索する時、関係するカテゴリにおける文章だけを
検索すればよく、検索が効率良く且つ正確になる。[0003] Since sentences are classified into categories, when searching for sentences, only sentences in related categories need to be searched, and the search is efficient and accurate.

【０００４】従来、幾つかの文章分類装置が提案されて
いる。中でも、Ｓａｌｔｏｎらの提案する文章分類装置
が良く知られている（Ｇ．ＳａｌｔｏｎａｎｄＭ．
Ｊ．ＭｃＧｉｌｌ，Ｉｎｔｒｏｄｕｃｔｉｏｎｔｏ
ＭｏｄｅｒｎＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅ
ｖａｌ，ＮｅｗＹｏｒｋ：ＭｃＧｒａｗＨｉｌ
ｌ，１９８３）。その文章分類装置は、文章に現れる単
語の頻度ベクトルとカテゴリにおける単語の頻度ベクト
ルとの間のコサイン値を文章とカテゴリ間の距離と見な
し、距離の最も小さいカテゴリに文章を分類することを
特徴としている。Heretofore, several text classification devices have been proposed. Above all, a sentence classification device proposed by Salton et al. Is well known (G. Salton and M.S.
J. McGill, Introduction to
Modern Information Retrie
val, New York: McGraw Hill
1, 1983). The sentence classification apparatus regards a cosine value between a frequency vector of a word appearing in a sentence and a frequency vector of a word in a category as a distance between the sentence and the category, and classifies the sentence into a category having the smallest distance. I have.

【０００５】[0005]

【発明が解決しようとする課題】しかし、従来方式のほ
とんどは、文章を幾つかの並列のカテゴリに分類するも
ので、階層構造をなすカテゴリに文章を自動的に分類す
る装置がなかった。例えば、「政治」のカテゴリがさら
に「国会」や「政党」のサブカテゴリに分かれ、文章を
「政治」のカテゴリに分類した後、さらにそれを「国
会」と「政党」に分類した方が後の検索がさらに高速に
なる。However, most of the conventional methods classify sentences into several parallel categories, and there is no device for automatically classifying sentences into categories having a hierarchical structure. For example, the category of "politics" is further divided into subcategories of "parliament" and "political parties", and after categorizing sentences into the category of "politics", it is better to classify them into "diet" and "political parties" later. Searches are even faster.

【０００６】本発明の目的は、並列のカテゴリに文章を
分類するのではなく、階層構造をなすカテゴリに文章を
自動分類し得るようにすることにある。An object of the present invention is not to classify sentences into parallel categories, but to automatically classify sentences into categories having a hierarchical structure.

【０００７】また、本発明の別の目的は、信頼性の高い
文章の自動分類を実現することにある。It is another object of the present invention to realize highly reliable automatic classification of sentences.

【０００８】[0008]

【課題を解決するための手段】本発明では、カテゴリを
階層化し、各カテゴリに線形結合モデルと呼ばれる確率
モデル、或いは確率モデルの集合を対応させ、新しい文
章が入力されると、その文章に対する線形結合モデルの
負対数尤度、或いは確率モデル集合の確率的複雑度を計
算し、負対数尤度の最も小さい、或いは確率的複雑度の
最も小さいカテゴリに新しい文章を分類する。According to the present invention, a category is hierarchized, a probability model called a linear combination model or a set of probability models is made to correspond to each category, and when a new text is input, a linear model for the text is input. The negative log likelihood of the combined model or the stochastic complexity of the set of probabilistic models is calculated, and the new text is classified into the category having the smallest negative log likelihood or the smallest stochastic complexity.

【０００９】つまり、本発明では、文章における単語の
分布を基にその文章をカテゴリに分類している。特に、
確率的なモデルを用いた統計的検定によって文章を分類
することが特徴である。That is, in the present invention, the sentences are classified into categories based on the distribution of words in the sentences. Especially,
A feature is that sentences are classified by a statistical test using a probabilistic model.

【００１０】具体的には、本発明の第１の階層型文章分
類装置は、ノードが文章の分類されたカテゴリを表現
し、リンクがカテゴリの上位下位関係を表現するグラフ
として、カテゴリの階層を記憶するカテゴリ階層記憶
部、前記カテゴリ階層記憶部に記憶されるカテゴリの階
層の各カテゴリに対して、より下位のカテゴリの単語空
間上の確率モデルの重みつき平均を該カテゴリの線形結
合モデルとし、各カテゴリの線形結合モデルを記憶する
確率モデル記憶部、前記カテゴリ階層記憶部に記憶され
るカテゴリの階層の各カテゴリに分類された文章を基
に、各カテゴリの線形結合モデルを、より下位のカテゴ
リの線形結合モデルから学習し、学習できた各カテゴリ
の線形結合モデルを前記確率モデル記憶部に記憶する学
習部、新しく文章を入力し、該入力文章を単語のデータ
列と見なし、前記カテゴリ階層記憶部に記憶されるカテ
ゴリの階層の各カテゴリに対して、前記確率モデル記憶
部に記憶される該カテゴリの線形結合モデルの該入力文
章に対する負対数尤度を計算し、計算された負対数尤度
の最も小さいカテゴリに該入力文章を分類する文章分類
部、を備えることを特徴とする。More specifically, the first hierarchical sentence classifying apparatus of the present invention displays a category hierarchy as a graph in which a node represents a category in which a sentence is classified and a link represents an upper / lower relationship of the category. For storing the category hierarchy storage unit, for each category of the category hierarchy stored in the category hierarchy storage unit, the weighted average of the probability model in the word space of the lower category as a linear combination model of the category, Probability model storage unit that stores a linear combination model of each category, based on sentences classified into each category of the category hierarchy stored in the category hierarchy storage unit, the linear combination model of each category, the lower category A learning unit that learns from the linear combination model of the above and stores the linear combination model of each category that has been learned in the probability model storage unit, and newly inputs a sentence , Regarding the input sentence as a data string of words, and for each category of the category hierarchy stored in the category hierarchy storage unit, the input sentence of the linear combination model of the category stored in the probability model storage unit And a sentence classification unit that calculates the negative log likelihood of the input sentence and classifies the input sentence into a category having the smallest calculated negative log likelihood.

【００１１】このように構成された第１の階層型文章分
類装置にあっては、学習部が、カテゴリ階層記憶部に記
憶されるカテゴリの階層の各カテゴリに例えば事前に人
手によって分類された文章を基に、各カテゴリの線形結
合モデルを、より下位のカテゴリの線形結合モデルから
学習し、学習できた各カテゴリの線形結合モデルを確率
モデル記憶部に記憶し、その後、自動分類対象となる文
章が入力されると、文章分類部が、その文章を入力し、
この入力文章を単語のデータ列と見なし、カテゴリ階層
記憶部に記憶されるカテゴリの階層の各カテゴリに対し
て、確率モデル記憶部に記憶される該カテゴリの線形結
合モデルの該入力文章に対する負対数尤度を計算し、計
算された負対数尤度の最も小さいカテゴリに該入力文章
を分類する。[0011] In the first hierarchical sentence classifying apparatus configured as described above, the learning unit, for example, a sentence that is manually classified in advance into each category of the category hierarchy stored in the category hierarchy storage unit. Based on, the linear combination model of each category is learned from the linear combination model of the lower category, and the linear combination model of each category that has been trained is stored in the probability model storage unit. Is input, the sentence classification unit inputs the sentence,
This input sentence is regarded as a word data sequence, and for each category of the category hierarchy stored in the category hierarchy storage unit, the negative logarithm of the linear combination model of the category stored in the probability model storage unit for the input sentence is calculated. The likelihood is calculated, and the input sentence is classified into a category having the smallest calculated negative log likelihood.

【００１２】また、本発明の第２の階層型文章分類装置
は、ノードが文章の分類されたカテゴリを表現し、リン
クがカテゴリの上位下位関係を表現するグラフとして、
カテゴリの階層を記憶するカテゴリ階層記憶部、前記カ
テゴリ階層記憶部に記憶されるカテゴリの階層の各カテ
ゴリに対して、より下位のカテゴリの、単語空間上の確
率モデルの集合を該カテゴリの確率モデルの集合とし、
各カテゴリの確率モデルの集合の全ての要素を記憶する
確率モデル集合記憶部、前記カテゴリ階層記憶部に記憶
されるカテゴリの階層の各カテゴリに分類された文章を
基に、各カテゴリの確率モデルの集合を、より下位のカ
テゴリの単語空間上の確率モデルの集合から学習し、学
習できた各カテゴリの確率モデルの集合のすべての要素
を前記確率モデル集合記憶部に記憶する学習部、新しく
文章を入力し、該入力文章を単語のデータ列と見なし、
前記カテゴリ階層記憶部に記憶されるカテゴリの階層の
各カテゴリに対して、前記確率モデル集合記憶部に記憶
される該カテゴリの確率モデルの集合に対する該入力文
章の確率的複雑度を計算し、計算された確率的複雑度の
最も小さいカテゴリに該入力文章を分類する文章分類
部、を備える。Further, the second hierarchical sentence classification device of the present invention provides a graph in which a node represents a category in which a sentence is classified and a link represents an upper / lower relationship of the category.
A category hierarchy storage unit for storing the hierarchy of categories, and for each category of the category hierarchy stored in the category hierarchy storage unit, a set of probability models in the word space of lower categories, and a probability model of the category. And a set of
Probability model set storage unit that stores all elements of the set of probability models of each category, based on sentences classified into each category of the category hierarchy stored in the category hierarchy storage unit, the probability model of each category A learning unit that learns a set from a set of probability models on a word space of a lower category and stores all elements of a set of probability models of each category that can be learned in the probability model set storage unit. Input, regard the input sentence as a word data string,
For each category of the category hierarchy stored in the category hierarchy storage unit, calculate the stochastic complexity of the input sentence for the set of probability models of the category stored in the probability model set storage unit, and calculate A sentence classification unit for classifying the input sentence into a category having the smallest probabilistic complexity.

【００１３】このように構成された第２の階層型文章分
類装置にあっては、学習部が、カテゴリ階層記憶部に記
憶されるカテゴリの階層の各カテゴリに例えば事前に人
手によって分類された文章を基に、各カテゴリの確率モ
デルの集合を、より下位のカテゴリの単語空間上の確率
モデルの集合から学習し、学習できた各カテゴリの確率
モデルの集合のすべての要素を確率モデル集合記憶部に
記憶し、その後、自動分類対象となる文章が入力される
と、文章分類部が、その文章を入力し、この入力文章を
単語のデータ列と見なし、カテゴリ階層記憶部に記憶さ
れるカテゴリの階層の各カテゴリに対して、確率モデル
集合記憶部に記憶される該カテゴリの確率モデルの集合
に対する該入力文章の確率的複雑度を計算し、計算され
た確率的複雑度の最も小さいカテゴリに該入力文章を分
類する。[0013] In the second hierarchical sentence classifying apparatus configured as described above, the learning unit, for example, manually classifies sentences classified in advance into each category of the category stored in the category hierarchical storage unit. Based on, the set of probability models of each category is learned from the set of probability models in the word space of lower categories, and all elements of the set of probability models of each category that have been learned are stored in the probability model set storage unit. Then, when a sentence to be automatically classified is input, the sentence classification unit inputs the sentence, regards the input sentence as a data string of words, and stores the category of the category stored in the category hierarchy storage unit. For each category of the hierarchy, the probabilistic complexity of the input sentence for the set of probabilistic models of the category stored in the probabilistic model set storage unit is calculated. To classify the input sentence is also a small category.

【００１４】[0014]

【発明の実施の形態】次に本発明の実施の形態の例につ
いて図面を参照して詳細に説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, embodiments of the present invention will be described in detail with reference to the drawings.

【００１５】図１を参照すると、本発明の第１の実施例
は、カテゴリ階層記憶部１、確率モデル記憶部２、学習
部３、および文章分類部４から構成される。Referring to FIG. 1, the first embodiment of the present invention comprises a category hierarchy storage unit 1, a probability model storage unit 2, a learning unit 3, and a sentence classification unit 4.

【００１６】カテゴリ階層記憶部１ではカテゴリの階層
が記憶される。カテゴリの階層構造はグラフとして表さ
れる。グラフでは、ノードがカテゴリを表現し、リンク
がカテゴリの上位下位関係を表現する。また、カテゴリ
には既に分類された文章が入っている。図２にカテゴリ
階層の例を示す。ここでは、カテゴリの階層が木構造と
なっているが、一般的にはカテゴリの階層がもっと複雑
なグラフ構造になる。The category hierarchy storage unit 1 stores the category hierarchy. The hierarchical structure of the categories is represented as a graph. In the graph, the nodes represent categories, and the links represent the upper and lower relationships of the categories. In addition, the category contains sentences that have already been classified. FIG. 2 shows an example of a category hierarchy. Here, the category hierarchy has a tree structure, but the category hierarchy generally has a more complicated graph structure.

【００１７】確率モデル記憶部２では、カテゴリの階層
における一つのカテゴリに対して一つの確率モデルを対
応させて記憶する。本実施例では、この確率モデルが線
形結合モデルの形をとることを特徴とする。ある確率の
線形結合モデルは、それより下位のカテゴリの確率モデ
ルの重みつき平均として定義される。以下に線形結合モ
デルの例を示す。The probability model storage unit 2 stores one probability model corresponding to one category in the category hierarchy. The present embodiment is characterized in that this probability model takes the form of a linear combination model. A linear combination model of a certain probability is defined as a weighted average of the probability models of lower categories. The following is an example of a linear combination model.

【００１８】線形結合モデルの例；カテゴリの木構造で
は、ノードがカテゴリを表す。ノードｃのカテゴリの線
形結合モデルはその子ノードのカテゴリの線形結合モデ
ル、およびノードｃ自身に属する確率モデルの線形結合
として以下のように定義される。Example of linear combination model; In a tree structure of categories, nodes represent categories. The linear combination model of the category of the node c is defined as the linear combination model of the category of the child node and the linear combination of the probability model belonging to the node c itself as follows.

【数１】数１において、確率変数Ｗは単語の集合Ｗ＝｛ｗ１，ｗ
２，…，ｗｓ｝の値をとる。Ｐ（Ｗ｜ｃ１），Ｐ（Ｗ｜
ｃ２），…，Ｐ（Ｗ｜ｃｎ）はｃの子ノードｃ１，ｃ
２，…，ｃｎのカテゴリの線形結合モデルである。Ｐ
（Ｗ｜ｃ’）はノードｃ自身に属する確率モデルであ
る。つまり、Ｐ（Ｗ｜ｃ’）はｃの表すカテゴリに属
し、ｃ_1,…_,ｃ_nの表すカテゴリに属さない確率モデル
である。Ｐ（ｃ’｜ｃ），Ｐ（ｃ１｜ｃ），…，Ｐ（ｃ
ｎ｜ｃ）はｃ’，ｃ１，…，ｃｎの事前確率である。(Equation 1) In Equation 1, the random variable W is a set of words W = {w1, w
2,..., Ws}. P (W | c1), P (W |
c2),..., P (W | cn) are child nodes c1, c of c.
2, cn. P
(W | c ′) is a probability model belonging to the node c itself. That is, P (W | c ′) is a probability model that belongs to the category represented by c and does not belong to the category represented by c _1, ... _, C _n . P (c ′ | c), P (c1 | c),..., P (c
n | c) is the prior probability of c ′, c1,..., cn.

【００１９】学習部３は、カテゴリ階層記憶部１に記憶
されるカテゴリの階層を参照し、既にカテゴリに分類さ
れた文章から各カテゴリの線形結合モデルを学習し、学
習できた線形結合モデルを確率モデル記憶部２に記憶す
る。The learning unit 3 refers to the category hierarchy stored in the category hierarchy storage unit 1, learns the linear combination models of each category from the sentences already classified into the categories, and calculates It is stored in the model storage unit 2.

【００２０】文章分類部４は、新しく文章を入力し、該
文章を単語のデータ列と見なし、カテゴリ階層記憶部１
に記憶されるカテゴリの階層における各カテゴリを参照
し、各カテゴリに対して、確率モデル記憶部２から、そ
のカテゴリの対応する線形結合モデルを参照し、該文章
に対する各線形結合モデルの負対数尤度を計算し、負対
数尤度のもっとも小さい線形結合モデルに対応するカテ
ゴリに該文章を分類する。The sentence classification unit 4 inputs a new sentence, regards the sentence as a data string of words, and stores the sentence in the category hierarchy storage unit 1.
Is referred to, and for each category, the corresponding linear combination model of the category is referred from the probability model storage unit 2, and the negative log likelihood of each linear combination model for the sentence is referred to. The degree is calculated, and the sentence is classified into a category corresponding to the linear combination model having the smallest negative log likelihood.

【００２１】学習部３は、幾つかの方法で線形結合モデ
ルを学習することができる。例えば、その下位カテゴリ
の線形結合モデルをヒストグラムとして推定することが
できる。また、重み係数をＥＭアルゴリズムと呼ばれる
アルゴリズムによって学習することができる。The learning section 3 can learn the linear combination model by several methods. For example, a linear combination model of the lower category can be estimated as a histogram. Further, the weight coefficient can be learned by an algorithm called an EM algorithm.

【００２２】ここでは、学習部３の学習アルゴリズムの
一例を示す。階層を表すグラフは木構造をもつとする。
学習部３は、木構造となるカテゴリの階層を参照し、ボ
トムアップ的にカテゴリの線形結合モデルを学習する。
その学習アルゴリズムは以下の通りであり、そのフロー
チャートを図３に示す。Here, an example of the learning algorithm of the learning section 3 will be described. The graph representing the hierarchy has a tree structure.
The learning unit 3 refers to a category hierarchy having a tree structure and learns a linear combination model of the categories from the bottom up.
The learning algorithm is as follows, and its flowchart is shown in FIG.

【００２３】ノードｃを入力とする。最初は、木構造のルートノードが入力される。ｉｆノードｃは葉ノードである。ｔｈｅｎノードｃのカテゴリに分類された文章から、ｃの線形結合モデルを学習し、戻る。ｅｌｓｅノードｃの子ノードｃｉ（ｉ＝１，２，…，ｎ）の線形結合モデルを参照する。ｉｆノードｃｉの線形結合モデルはまだ学習できていない。ｔｈｅｎノードｃｉに対して、再帰的に本アルゴリズムを適用する。ｅｌｓｅノードｃｉの線形結合モデルとｃ自身の確率モデルからノードｃの線形結合モデルを学習し、戻る。The input is a node c. First, the root node of the tree structure is input. if node c is a leaf node. Then, a linear combination model of c is learned from the sentences classified into the category of the node c, and the process returns. Reference the linear combination model of the child node ci (i = 1, 2,..., n) of the else node c. The linear combination model of if node ci has not been learned yet. This algorithm is recursively applied to the then node ci. else Learn the linear combination model of node c from the linear combination model of node ci and the probability model of c itself, and return.

【００２４】文章分類部４は文章の統計的仮説検定によ
って文章を分類する。次に、文章分類部４のアルゴリズ
ムの一例を示し、そのフローチャートを図４に示す。The sentence classification unit 4 classifies the sentence by a statistical hypothesis test of the sentence. Next, an example of the algorithm of the sentence classification unit 4 is shown, and its flowchart is shown in FIG.

【００２５】ｄは入力された文章であるとする。ノードｃと文章ｄを入力とする。最初は、木構造のルートノードが入力される。ｉｆノードｃは葉ノードである。ｔｈｅｎ文章ｄはノードｃのカテゴリに属するとし、終了する。ｅｌｓｅ文章ｄに対するノードｃの線形結合モデルの負対数尤度Ｌ（ｄ｜ｃ）を計算する。ノードｃの子ノードｃｉ（ｉ＝１，２，…，ｎ）の負対数尤度Ｌ（ｄ｜ｃｉ）をも計算する。計算できたＬ（ｄ｜ｃ）とＬ（ｄ｜ｃｉ）の最小値を求める。ｉｆ子ノードの中のｃｉの負対数尤度が最小である。ｔｈｅｎノードｃｉに対して本アルゴリズムを再帰的に適用する。ｅｌｓｅ文章ｄはノードｃのカテゴリに属するとし、終了する。Let d be an input sentence. The node c and the sentence d are input. First, the root node of the tree structure is input. if node c is a leaf node. Then, assume that the sentence d belongs to the category of the node c, and the processing ends. Else Calculate the negative log likelihood L (d | c) of the linear combination model of node c for sentence d. The negative log likelihood L (d | ci) of the child node ci (i = 1, 2,..., N) of the node c is also calculated. The minimum value of the calculated L (d | c) and L (d | ci) is obtained. The negative log likelihood of ci in the if child node is minimal. This algorithm is applied recursively to the then node ci. Else It is assumed that the sentence d belongs to the category of the node c, and the processing ends.

【００２６】次に、学習部３による線形結合モデルを学
習する方法と、文章分類部４による負対数尤度の計算方
法を、さらに具体的な例を通じて説明する。カテゴリの
階層は図５に示すものとする。図５中、ｃ１，ｃ２，ｃ
３はカテゴリであり、ｄ１，ｄ２，ｄ３は既に分類され
た文章である。また、図６に各文章ｄ１，ｄ２，ｄ３に
おける単語ｗ１，ｗ２，ｗ３の出現頻度を示す。単語ｗ
１，ｗ２，ｗ３は予め定められたキーワードである。Next, the method of learning the linear combination model by the learning unit 3 and the method of calculating the negative log likelihood by the sentence classification unit 4 will be described through more specific examples. The category hierarchy is as shown in FIG. In FIG. 5, c1, c2, c
3 is a category, and d1, d2, and d3 are already classified sentences. FIG. 6 shows the appearance frequencies of the words w1, w2, and w3 in the sentences d1, d2, and d3. Word w
1, w2 and w3 are predetermined keywords.

【００２７】○線形結合モデルの学習の例ｃ２とｃ３は葉ノードであるので、それらのノードのカ
テゴリの線形結合モデルは文章における単語のヒストグ
ラムとして、図７（ａ）のように学習される。Example of Learning Linear Combination Model Since c2 and c3 are leaf nodes, the linear combination model of the category of these nodes is learned as a histogram of words in a sentence as shown in FIG. 7A.

【００２８】ｃ１に分類された文章ｄ２から、ｃ１自身
に属する確率モデルを単語のヒストグラムとして学習す
る。これをＰ（Ｗ｜ｃ１’）と表す。つまり、それはｃ
１のカテゴリに属し、ｃ２，ｃ３のカテゴリに属さない
確率モデルであり、図７（ｂ）のように学習される。From the sentence d2 classified as c1, a probability model belonging to c1 itself is learned as a word histogram. This is represented as P (W | c1 ′). That is, it is c
The probability model belongs to category 1 and does not belong to categories c2 and c3, and is learned as shown in FIG.

【００２９】一方、各モデルの事前分布を以下のように
学習する。On the other hand, the prior distribution of each model is learned as follows.

【数２】 (Equation 2)

【００３０】ここで、ｆ（ｃｉ）はノードｃｉとその支
配するノードの属する文章数で、Ｎは全文章数である。
よって、各モデルの事前分布は図７（ｃ）のように学習
される。Here, f (ci) is the number of sentences to which the node ci and its dominant node belong, and N is the total number of sentences.
Therefore, the prior distribution of each model is learned as shown in FIG.

【００３１】さらに、線形結合モデルの定義に従って、
ノードｃ１における線形結合モデルを以下のように学習
することができる。Further, according to the definition of the linear combination model,
The linear combination model at the node c1 can be learned as follows.

【数３】 (Equation 3)

【００３２】即ち、ノードｃ１における線形結合モデル
は図７（ｄ）に示すようになる。That is, the linear combination model at the node c1 is as shown in FIG.

【００３３】○負対数尤度の計算の例新しい文章ｄにおける単語の分布は図８に示すものとす
る。つまり、文章分類部４は入力文章中から単語ｗ１を
２個、単語ｗ２を１個、単語ｗ３を１個検出したとす
る。ｄに対するｃ１の負対数尤度を以下のように計算す
る。対数の底は２であるとする。Example of calculation of negative log likelihood The distribution of words in a new sentence d is shown in FIG. That is, it is assumed that the sentence classification unit 4 detects two words w1, one word w2, and one word w3 from the input sentence. The negative log likelihood of c1 for d is calculated as follows. Assume that the base of the logarithm is 2.

【数４】 (Equation 4)

【００３４】同様に、ｃ２，ｃ３の負対数尤度を計算す
る。Similarly, the negative log likelihood of c2 and c3 is calculated.

【数５】 (Equation 5)

【数６】 (Equation 6)

【００３５】尤度Ｌ（ｄ｜ｃ１）がもっとも小さいの
で、ｄはｃ１に分類される。Since the likelihood L (d | c1) is the smallest, d is classified as c1.

【００３６】図９を参照すると、本発明の第２の実施例
は、カテゴリ階層記憶部１、確率モデル集合記憶部５、
学習部６、および文章分類部７から構成される。Referring to FIG. 9, the second embodiment of the present invention comprises a category hierarchy storage 1, a probability model set storage 5,
It comprises a learning unit 6 and a sentence classification unit 7.

【００３７】カテゴリ階層記憶部１ではカテゴリの階層
が記憶される。カテゴリの階層では、ノードがカテゴリ
を表し、リンクが上位下位関係を表す。カテゴリ階層の
例として前述した図２がある。The category hierarchy storage unit 1 stores the category hierarchy. In the category hierarchy, nodes represent categories, and links represent higher / lower relationships. FIG. 2 described above is an example of the category hierarchy.

【００３８】確率モデル集合記憶部５では、確率モデル
の集合が記憶される。カテゴリの階層における各カテゴ
リに対して一つの確率モデルの集合が定義され、記憶さ
れる。以下に確率モデルの集合の例を示す。The probability model set storage unit 5 stores a set of probability models. One set of probability models is defined and stored for each category in the category hierarchy. The following is an example of a set of probability models.

【００３９】○確率モデル集合の例ノードｃの確率モデルの集合が確率モデルＰ（Ｗ｜
ｃ’），Ｐ（Ｗ｜ｃ１），…，Ｐ（Ｗ｜ｃｎ）を含むと
する。Ｐ（Ｗ｜ｃ１），…，Ｐ（Ｗ｜ｃｎ）はｃの子ノ
ードｃ_1,…ｃ_nの確率モデルの集合のもつ確率モデル
（確率分布）である。Ｐ（Ｗ｜ｃ’）はノードｃ自身に
属する確率モデルである。つまり、それは、ｃのカテゴ
リに属し、ｃ１，…，ｃｎのカテゴリに属さない確率モ
デルである。また、各確率モデルの事前確率Ｐ（ｃ’｜
ｃ），Ｐ（ｃ１｜ｃ），…，Ｐ（ｃｎ｜ｃ）が存在する
とする。確率モデルＰ（Ｗ｜ｃ’），Ｐ（Ｗ｜ｃ１），
…，Ｐ（Ｗ｜ｃｎ）は、例えば、ヒストグラムの形で表
現される。Example of Probability Model Set The set of the probability models of the node c is the probability model P (W |
c ′), P (W | c1),..., P (W | cn). P is a | | (cn W) is a child node c ₁ of _c, ... probability model (probability distribution) with a set of probability models of _{c n (W c1), ...} , P. P (W | c ′) is a probability model belonging to the node c itself. That is, it is a probability model that belongs to the category of c and does not belong to the categories of c1,. In addition, the prior probability P (c ′ |
c), P (c1 | c),..., P (cn | c) exist. Stochastic models P (W | c ′), P (W | c1),
.., P (W | cn) are expressed, for example, in the form of a histogram.

【００４０】各カテゴリの確率モデルの集合は、それ自
身に属する文章による単語空間上の確率モデルと、その
下位のカテゴリに属する文章による単語空間上の確率モ
デルからなる。The set of probability models of each category includes a probability model in the word space based on sentences belonging to itself and a probability model in the word space based on sentences belonging to lower categories.

【００４１】学習部６は、カテゴリ階層記憶部１に記憶
されるカテゴリの階層を参照し、既にカテゴリに分類さ
れた文章から各カテゴリの対応するモデル集合を学習
し、学習できた確率モデルの集合を確率モデル集合記憶
部５に記憶する。The learning unit 6 refers to the category hierarchy stored in the category hierarchy storage unit 1, learns the corresponding model set of each category from the sentences already classified into the category, and sets the set of probability models that can be learned. Is stored in the probability model set storage unit 5.

【００４２】文章分類部７は、新しく文章を入力し、該
文章を単語のデータ列と見なし、カテゴリ階層記憶部１
に記憶されるカテゴリにおける各カテゴリを参照し、各
カテゴリに対して、確率モデル集合記憶部５から、その
カテゴリの対応する確率モデル集合を参照し、該文章の
各参照された確率モデル集合に対する確率的複雑度を計
算し、確率的複雑度のもっとも小さい確率モデル集合に
対応するカテゴリに該文章を分類する。The sentence classification unit 7 inputs a new sentence, regards the sentence as a word data string, and stores the sentence in the category hierarchy storage unit 1.
Is referred to, and for each category, the corresponding probability model set of the category is referred from the probability model set storage unit 5, and the probability of the sentence for each referenced probability model set is referred to. The sentence is calculated into a category corresponding to a set of stochastic models having the smallest stochastic complexity.

【００４３】確率的複雑度とは、確率モデルの集合を用
いてデータを記述する際の最小記述長を表す量で、リッ
サネン（Ｒｉｓｓａｎｅｎ）によって提唱されたもので
ある（ＪｏｒｍａＲｉｓｓａｎｅｎ，Ｓｔｏｃｈａｓ
ｔｉｃＣｏｍｐｌｅｘｉｔｙｉｎＳｔａｔｉｓｔ
ｉｃａｌＩｎｑｕｉｒｙ，ＷｏｒｌｄＳｃｉｅｎｔ
ｉｆｉｃＰｕｂｌｉｓｈｉｎｇＣｏ．，Ｓｉｎｇａ
ｐｏｒｅ，１９８９）。本実施例では、確率的複雑度
を、確率モデル集合における確率モデルのデータに対す
る尤度の重み付き平均の負対数として計算する。The stochastic complexity is a quantity representing the minimum description length when data is described using a set of probabilistic models, and is proposed by Rissanen (Jorma Rissanen, Stockas).
tic Complexity in Statist
ical Inquiry, World Scientific
if publishing Co. , Singa
pore, 1989). In this embodiment, the probabilistic complexity is calculated as the negative logarithm of the weighted average of the likelihood for the data of the probabilistic model in the probabilistic model set.

【００４４】次に、学習部６の学習アルゴリズムの一例
を示す。階層を表すグラフが木構造をもつとする。学習
部６は、木構造となるカテゴリの階層を参照し、ボトム
アップ的にカテゴリの確率モデル集合を学習する。その
アルゴリズムは以下の通りであり、そのフローチャート
を図１０に示す。Next, an example of the learning algorithm of the learning section 6 will be described. It is assumed that a graph representing a hierarchy has a tree structure. The learning unit 6 references a category hierarchy having a tree structure and learns a probability model set of the category from the bottom up. The algorithm is as follows, and its flowchart is shown in FIG.

【００４５】ノードｃを入力とする。最初は、木構造のルートノードが入力される。ｉｆノードｃは葉ノードである。ｔｈｅｎノードｃのカテゴリに分類された文章から、ｃの確率モデル集合の全ての要素を学習し、戻る。ｅｌｓｅノードｃの子ノードｃｉ（ｉ＝１，２，…，ｎ）の確率モデル集合を参照する。ｉｆノードｃｉの確率モデル集合はまだ学習できていない。ｔｈｅｎノードｃｉに対して、再帰的に本アルゴリズムを適用する。ｅｌｓｅノードｃｉの確率モデル集合とｃに分類された文章の確率モデルからノードｃの確率モデル集合を学習し、戻る。The node c is input. First, the root node of the tree structure is input. if node c is a leaf node. Then, from the sentences classified into the category of the node c, learn all elements of the set of probability models of c and return. The probability model set of the child node ci (i = 1, 2,..., n) of the else node c is referred to. The set of probability models for if node ci has not been learned yet. This algorithm is recursively applied to the then node ci. Else The probability model set of node c is learned from the probability model set of node ci and the probability model of the text classified into c, and the process returns.

【００４６】文章分類部７は統計的仮説検定によって文
章を分類する。次に、文章分類部７のアルゴリズムの一
例を示す。図１１はそのフローチャートである。The sentence classifying unit 7 classifies sentences by a statistical hypothesis test. Next, an example of the algorithm of the sentence classification unit 7 will be described. FIG. 11 is a flowchart thereof.

【００４７】ｄは入力された文章であるとする。ノードｃと文章ｄを入力とする。最初は、木構造のルートノードが入力される。ｉｆノードｃは葉ノードである。ｔｈｅｎ文章ｄはノードｃのカテゴリに属するとし、終了する。ｅｌｓｅノードｃにおける文章ｄの確率的複雑度ＳＣ（ｄ｜ｃ）を計算する。ノードｃの子ノードｃｉ（ｉ＝１，２，…，ｎ）における確率的複雑度ＳＣ（ｄ｜ｃｉ）をも計算する。計算できたＳＣ（ｄ｜ｃ）とＳＣ（ｄ｜ｃｉ）の中の最小値を求める。ｉｆノードの中のｃｉの確率的複雑度が最小である。ｔｈｅｎノードｃｉに対して本アルゴリズムを再帰的に適用する。ｅｌｓｅ文章ｄはノードｃのカテゴリに属するとし、終了する。Let d be an input sentence. The node c and the sentence d are input. First, the root node of the tree structure is input. if node c is a leaf node. Then, assume that the sentence d belongs to the category of the node c, and the processing ends. The probabilistic complexity SC (d | c) of the sentence d at the else node c is calculated. The probabilistic complexity SC (d | ci) of the child node ci (i = 1, 2,..., N) of the node c is also calculated. Find the minimum value of the calculated SC (d | c) and SC (d | ci). The stochastic complexity of ci in the if node is minimal. This algorithm is applied recursively to the then node ci. Else It is assumed that the sentence d belongs to the category of the node c, and the processing ends.

【００４８】次に確率的複雑度の計算例を示す。Next, an example of calculating the stochastic complexity will be described.

【００４９】カテゴリの階層は図５に示すものとする。
また、文章における単語（キーワード）の出現頻度は図
６に示すものであるとする。The hierarchy of categories is as shown in FIG.
It is also assumed that the frequency of appearance of a word (keyword) in a sentence is as shown in FIG.

【００５０】ノードｃ２，ｃ３が葉ノードであるので、
それぞれのもつ確率モデルの集合は一つの確率モデルを
含む。さらに、それらの確率モデルがヒストグラムとし
て、図１２（ａ）のように学習される。Since nodes c2 and c3 are leaf nodes,
Each set of probability models contains one probability model. Further, those probability models are learned as a histogram as shown in FIG.

【００５１】ノードｃ１自身に属する確率モデルもヒス
トグラムとして、図１２（ｂ）のように学習される。The probability model belonging to the node c1 itself is also learned as a histogram as shown in FIG.

【００５２】従って、ｃ１の確率モデル集合は確率モデ
ルＰ（Ｗ｜ｃ１），Ｐ（Ｗ｜ｃ２），Ｐ（Ｗ｜ｃ３）を
含むことになる。それらの確率モデルの事前確率Ｐ（ｄ
ｉ｜ｃ）が一様分布であるとする。Accordingly, the set of probability models of c1 includes the probability models P (W | c1), P (W | c2), and P (W | c3). Prior probabilities P (d
i | c) has a uniform distribution.

【００５３】新しい文章ｄにおける単語の出現頻度は図
１２（ｃ）に示すものであるとする。Assume that the frequency of appearance of a word in a new sentence d is as shown in FIG.

【００５４】ｄのｃ１に対する確率的複雑度を以下のよ
うに計算する。対数の底は２であるとする。The stochastic complexity of d with respect to c1 is calculated as follows. Assume that the base of the logarithm is 2.

【数７】 (Equation 7)

【００５５】ｄのｃ２，ｃ３に対する確率的複雑度を以
下のように計算する。The stochastic complexity of d with respect to c2 and c3 is calculated as follows.

【数８】 (Equation 8)

【数９】 (Equation 9)

【００５６】ＳＣ（ｄ｜ｃ３）がもっとも小さいので、
ｄはｃ３に分類される。Since SC (d | c3) is the smallest,
d is classified into c3.

【００５７】図１３は本発明の階層型文章分類装置の第
３の実施例のブロック図である。この例の階層型文章分
類装置は、ＣＰＵ１０１、主記憶１０２および補助記憶
１０３を含むコンピュータ１０４と、このコンピュータ
１０４に接続された表示装置１０５、入力装置１０６お
よびファイル１０７を含む入出力装置１０８と、階層型
文章分類プログラムを記録する記録媒体１０９とから構
成される。記録媒体１０９はＣＤ−ＲＯＭ、半導体メモ
リ等の機械読み取り可能な記録媒体であり、ここに記録
された階層型文章分類プログラムは、コンピュータ１０
４に読み取られ、コンピュータ１０４の動作を制御する
ことにより、コンピュータ１０４上に、図１に示したカ
テゴリ階層記憶部１、確率モデル記憶部２、学習部３お
よび文章分類部４、または図９に示したカテゴリ階層記
憶部１、確率モデル集合記憶部５、学習部６および文章
分類部７を実現する。FIG. 13 is a block diagram of a third embodiment of the hierarchical sentence classification apparatus according to the present invention. The hierarchical sentence classification device of this example includes a computer 104 including a CPU 101, a main storage 102 and an auxiliary storage 103, a display device 105 connected to the computer 104, an input device 106, and an input / output device 108 including a file 107; And a recording medium 109 for recording a hierarchical sentence classification program. The recording medium 109 is a machine-readable recording medium such as a CD-ROM, a semiconductor memory, and the like.
4 by controlling the operation of the computer 104, the computer stores the category hierarchy storage unit 1, the probability model storage unit 2, the learning unit 3 and the sentence classification unit 4 shown in FIG. The illustrated category hierarchy storage unit 1, probability model set storage unit 5, learning unit 6, and sentence classification unit 7 are realized.

【００５８】[0058]

【発明の効果】以上説明したように、本発明によれば、
階層構造をなすカテゴリに文章を自動分類することがで
き、かつ尤度比検定の理論に基づいた統計的信頼性の高
い文章分類ができる。As described above, according to the present invention,
Sentences can be automatically classified into categories having a hierarchical structure, and sentence classification with high statistical reliability based on the theory of likelihood ratio test can be performed.

[Brief description of the drawings]

【図１】本発明の階層型文章分類装置の第１の実施例の
ブロック図である。FIG. 1 is a block diagram of a first embodiment of a hierarchical sentence classification device of the present invention.

【図２】カテゴリ階層の例を示す図である。FIG. 2 is a diagram illustrating an example of a category hierarchy.

【図３】本発明の階層型文章分類装置の第１の実施例に
おける学習アルゴリズムの一例を示すフローチャートで
ある。FIG. 3 is a flowchart illustrating an example of a learning algorithm in the first embodiment of the hierarchical sentence classification device of the present invention.

【図４】本発明の階層型文章分類装置の第１の実施例に
おける文章分類のアルゴリズムの一例を示すフローチャ
ートである。FIG. 4 is a flowchart illustrating an example of a sentence classification algorithm in the first embodiment of the hierarchical sentence classification device of the present invention.

【図５】カテゴリ階層の例を示す図である。FIG. 5 is a diagram illustrating an example of a category hierarchy.

【図６】文章における単語分布の例を示す図である。FIG. 6 is a diagram showing an example of word distribution in a sentence.

【図７】線形結合モデルの学習例の説明図である。FIG. 7 is an explanatory diagram of a learning example of a linear combination model.

【図８】負対数尤度の計算例の説明図である。FIG. 8 is an explanatory diagram of a calculation example of negative log likelihood.

【図９】本発明の階層型文章分類装置の第２の実施例の
ブロック図である。FIG. 9 is a block diagram of a second embodiment of the hierarchical sentence classification device of the present invention.

【図１０】本発明の階層型文章分類装置の第２の実施例
における学習アルゴリズムの一例を示すフローチャート
である。FIG. 10 is a flowchart illustrating an example of a learning algorithm in the second embodiment of the hierarchical sentence classification device of the present invention.

【図１１】本発明の階層型文章分類装置の第２の実施例
における文章分類のアルゴリズムの一例を示すフローチ
ャートである。FIG. 11 is a flowchart showing an example of a sentence classification algorithm in the second embodiment of the hierarchical sentence classification device of the present invention.

【図１２】確率的複雑度の計算例の説明図である。FIG. 12 is an explanatory diagram of a calculation example of stochastic complexity.

【図１３】本発明の階層型文章分類装置の第３の実施例
のブロック図である。FIG. 13 is a block diagram of a third embodiment of the hierarchical sentence classification device of the present invention.

[Explanation of symbols]

１カテゴリ階層記憶部２確率モデル記憶部３学習部４文章分類部５確率モデル集合記憶部６学習部７文章分類部 1 Category Hierarchy Storage Unit 2 Probability Model Storage Unit 3 Learning Unit 4 Sentence Classification Unit 5 Probability Model Set Storage Unit 6 Learning Unit 7 Sentence Classification Unit

Claims

[Claims]

1. A category hierarchy storage unit that stores a hierarchy of categories as a graph in which a node represents a category into which a sentence is classified and a link expresses an upper / lower relationship of the category, and is stored in the category hierarchy storage unit. For each category in the category hierarchy, a probability model storage unit that stores a weighted average of a probability model in a word space of a lower category as a linear combination model of the category, and stores a linear combination model of each category; Based on sentences classified into each category of the category hierarchy stored in the hierarchy storage unit, the linear combination model of each category is learned from the linear combination model of a lower category, and the linear combination of each category that can be learned is obtained. A learning unit that stores a model in the probability model storage unit, inputs a new sentence, regards the input sentence as a data string of words, For each category of the category hierarchy stored in the category hierarchy storage unit, the negative log likelihood for the input sentence of the linear combination model of the category stored in the probability model storage unit is calculated, and the calculated negative is calculated. A text classification unit that classifies the input text into a category having the smallest log likelihood.

2. A category hierarchy storage unit that stores a hierarchy of categories as a graph in which nodes represent categories into which sentences are classified and links express upper / lower relationships of the categories, and are stored in the category hierarchy storage unit. For each category in the category hierarchy, a set of probabilistic models in the word space of lower categories is set as a set of probabilistic models of the category, and a probability model that stores all elements of the set of probabilistic models of each category A set storage unit, based on sentences classified into each category of the category hierarchy stored in the category hierarchy storage unit, a set of probability models of each category, A learning unit that learns from and stores all elements of a set of probability models of each category that have been learned in the probability model set storage unit. A set of probability models of the category stored in the probability model set storage unit for each category of the category hierarchy stored in the category hierarchy storage unit. A sentence classification unit that calculates the stochastic complexity of the input sentence with respect to, and classifies the input sentence into a category having the smallest calculated stochastic complexity.

3. The computer of claim 1, wherein:
A machine-readable recording medium that records a program that functions as a category hierarchy storage unit, a probability model storage unit, a learning unit, and a sentence classification unit.

4. The computer of claim 2, wherein:
Category hierarchy storage unit, probability model set storage unit, learning unit,
And a machine-readable recording medium on which a program for functioning as a text classification unit is recorded.