JP2000035963A

JP2000035963A - Automatic sentence classification device and method

Info

Publication number: JP2000035963A
Application number: JP10202575A
Authority: JP
Inventors: Ko Ri; 航李; Naoki Abe; 直樹安倍
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-07-17
Filing date: 1998-07-17
Publication date: 2000-02-02
Anticipated expiration: 2018-07-17
Also published as: JP3266106B2

Abstract

PROBLEM TO BE SOLVED: To provide a device for automatically classifying sentences into the clusters of the sentences based on the appearing frequency of a word in the sentence. SOLUTION: A statistic processing part 1 inputs plural sentences and prepares a matrix composed of the appearing frequency of a word in the sentences and a sentence classification part 2 inputs the matrix from the statistic processing part 1, classifies the sentences into the clusters of the sentences based on data in the matrix and outputs the result. The automatic classification problem of the sentences is captured as the estimation problem of a probability model defined on the direct product of the division of a word set and a sentence set, it is assumed that the respective sentences are generated from the belonging cluster in certain probability and the respective words are generated from the belonging cluster in certain probability in the probability model and the probability mode is selected by using an information amount standard. Clustering is alternately performed to the sentence set and the word set in a bottom-up manner.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文章自動分類装置
に関し、特に、電子メール等の電子化された文章（テキ
スト、ドキュメント）を自動的に整理分類する文章自動
分類装置及び方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an automatic text classification apparatus, and more particularly, to an automatic text classification apparatus and method for automatically organizing and classifying electronic texts (texts and documents) such as electronic mails.

【０００２】[0002]

【従来の技術】マルチメディア情報処理においては、大
量の電子化された文章（テキスト、ドキュメント）の分
類や検索が、現在より更に頻繁にかつ身近に行われると
考えられる。こうしたニーズに応えるためには、正確か
つ高速に文章を整理分類する技術の確立が必要である。
その中で、文章における単語の出現頻度の分布パターン
を基に、文章を文章のクラスタ（グループ）に自動的に
分類する、いわゆる「文章の自動分類」、あるいは「文
章のクラスラリング」の技術に関する研究が盛んであ
る。2. Description of the Related Art In multimedia information processing, it is considered that classification and retrieval of a large number of electronic sentences (texts and documents) are performed more frequently and familiarly than at present. In order to meet such needs, it is necessary to establish a technique for organizing and classifying sentences accurately and at high speed.
Among them, the so-called "sentence automatic classification" or "sentence classification" technology that automatically classifies sentences into sentence clusters (groups) based on the distribution pattern of the frequency of occurrence of words in the sentences. The research on is active.

【０００３】従来法としては、例えば、文献１（G. Sal
tot, M. McGill, Introduction toModern Informatio
n Retrieval, New York, McGraw-Hill, 1983）に示さ
れているような、文章における単語の出現頻度の分布パ
ターンを基に、文章を自動的に分類する方法がある。こ
の方法では、各文章における単語の出現頻度ベクトルを
求め、二つの文章の単語頻度ベクトル間の角度（コサイ
ン値）の大きさを二つの文章間の距離として文章の分類
を行う。この方法を利用すれば、非常に簡単に文章を分
類することができる。As a conventional method, for example, reference 1 (G. Sal
tot, M. McGill, Introduction toModern Informatio
n Retrieval, New York, McGraw-Hill, 1983), there is a method of automatically classifying sentences based on the distribution pattern of the frequency of occurrence of words in sentences. In this method, a word appearance frequency vector in each sentence is obtained, and the sentence is classified using the magnitude of the angle (cosine value) between the word frequency vectors of the two sentences as the distance between the two sentences. This method makes it very easy to classify sentences.

【０００４】しかし、この方法は、数理統計学の理論に
基づいたものではなく、結果として得られた分類の質な
どについて理論的な保証がない、という問題があった。[0004] However, this method is not based on the theory of mathematical statistics, and has a problem that there is no theoretical guarantee on the quality of the classification obtained as a result.

【０００５】上記の問題の解決を図るものとして、例え
ば特開平８−２６３５１０号公報（特許願０７−００６
５７２２号）には、数理利統計学や情報理論等確率統計
的な手法を用いて文章を自動的に分類すること文章自動
分類システムが提案されている。この文章自動分類シス
テムでは、文章を単語の頻度ベクトルとみなし、ＭＤＬ
基準を用いた確率モデル推定問題として文章分類をとら
える。具体的には、分類対象となる文章の単語頻度ベク
トルが、ある確率モデルから生成されたと仮定し、ＭＤ
Ｌ基準に基づいて、可能なモデルのクラスから観測デー
タ（文章の特徴）をもっともよく説明できる確率モデル
を一つ選択する。選択されたモデルにおける文章のグル
ープ分けを文章分類の結果とする。To solve the above problem, Japanese Patent Application Laid-Open No. Hei 8-263510 (Patent Application No. 07-006) has been proposed.
No. 5722) proposes an automatic sentence classification system for automatically classifying sentences using stochastic methods such as mathematical statistics and information theory. In this sentence automatic classification system, a sentence is regarded as a word frequency vector, and MDL is used.
We take sentence classification as a stochastic model estimation problem using criteria. Specifically, it is assumed that a word frequency vector of a sentence to be classified is generated from a certain probability model, and MD
Based on the L criterion, one probabilistic model that can best explain the observed data (text features) is selected from the possible model classes. The grouping of the sentences in the selected model is set as the result of the sentence classification.

【０００６】ＭＤＬ基準（The Minimum Description
Length Principle；記述長最小の原理、ＭＤＬ原理と
もいう）は、数理統計学や情報理論において提案されて
いるもので、入力されたデータを基に、複数の確率モデ
ルの中から最適なモデルを選択するための基準である。The MDL standard (The Minimum Description
Length Principle (also called the principle of minimum description length, also called MDL principle) is proposed in mathematical statistics and information theory. Based on input data, an optimal model is selected from a plurality of probability models. It is a standard to do.

【０００７】具体的には、ＭＤＬ基準は、「モデル記述
長」と呼ばれる量と、「データ記述長」と呼ばれる量の
総和が最小となるようなモデルが最適なモデルであると
主張する。例えば文献（１）（J. Rissanen, Modeling
by Shortest Data Description, Automatica, Wo
l.14, 1978）、文献（２）（J. Rissanen, UniversalCo
ding, Information, Prediction, and estimation, IE
EE Trans. On IT, Vol. IT-30, 1984）の記載が参照
される。Specifically, the MDL criterion asserts that a model that minimizes the sum of a quantity called “model description length” and a quantity called “data description length” is the optimal model. For example, reference (1) (J. Rissanen, Modeling
by Shortest Data Description, Automatica, Wo
l.14, 1978), Reference (2) (J. Rissanen, UniversalCo
ding, Information, Prediction, and estimation, IE
EE Trans. On IT, Vol. IT-30, 1984).

【０００８】ＭＤＬ基準による確率モデルの選択は、多
くの望ましい性質を持つことが理論的に明らかにされて
きている。上記特開平８−２６３５１０号公報に記載さ
れる文章自動分類システムは、ＭＤＬ基準を文章自動分
類初めて適用したものとして注目される。[0008] It has been demonstrated that the choice of a probabilistic model according to the MDL criterion has many desirable properties. The automatic sentence classification system described in Japanese Patent Application Laid-Open No. 8-263510 is noted as the first application of the MDL standard for automatic sentence classification.

【０００９】[0009]

【発明が解決しようとする課題】しかしながら、上記特
開平８−２６３５１０号公報に提案される文章自動分類
システムには、次の二つの問題点を有している。However, the automatic text classification system proposed in Japanese Patent Application Laid-Open No. 8-263510 has the following two problems.

【００１０】まず第一の問題点として、単語の出現頻度
情報をそのまま文章分類に用いることに限界がある、と
いうことである。A first problem is that there is a limit to using the word frequency information as it is for sentence classification.

【００１１】単語には、同義語、類義語、さらに関連単
語があり、表面的に異なる単語でも実質的に同じ意味を
指すものが多く存在しており、できれば、同義語、類義
語さらに関連語のまとまり、いわゆる「単語のクラス
タ」の出現頻度に基づいて文章を分類したほうが望まし
い。[0011] Words include synonyms, synonyms, and related words. Many superficially different words have substantially the same meaning. Preferably, synonyms, synonyms, and related words are grouped together. It is desirable to classify sentences based on the appearance frequency of so-called “clusters of words”.

【００１２】上記特開平８−２６３５１０号公報に提案
される文章自動分類システムでは、予め用意された辞書
を参照することで、この問題の解決を図っている。しか
し、そのような単語辞書は、分野ごとに異なっており、
このような辞書を網羅的に全部用意することは、現実的
に非常に困難である。In the automatic sentence classification system proposed in Japanese Patent Laid-Open No. 8-263510, this problem is solved by referring to a dictionary prepared in advance. However, such word dictionaries are different in each field,
It is practically very difficult to prepare all such dictionaries comprehensively.

【００１３】また、第二の問題点として、上記特開平８
−２６３５１０号公報に提案される文章自動分類システ
ムは、アニーリング法を用いており、処理の効率性の面
でも問題があった。As a second problem, the above-mentioned Japanese Patent Application Laid-Open
The automatic sentence classification system proposed in Japanese Unexamined Patent Publication No. -263510 uses an annealing method, and has a problem in terms of processing efficiency.

【００１４】したがって、本発明は、上記問題点に鑑み
てなされたものであって、その目的は、文章における単
語の出現頻度を基に文章を文章のクラスタに自動的に分
類することで、精度を向上するとともに処理効率を向上
する文章自動分類装置の提供を提供することにある。Accordingly, the present invention has been made in view of the above-mentioned problems, and an object of the present invention is to automatically classify sentences into sentence clusters based on the frequency of occurrence of words in the sentence, thereby improving accuracy. Another object of the present invention is to provide a sentence automatic classification device which improves the processing efficiency and the processing efficiency.

【００１５】[0015]

【課題を解決するための手段】前記目的を達成する本発
明は、複数の文章を入力とし、前記各文章における単語
の出現頻度を統計し、前記各文章における単語出現頻度
からなるマトリックスを作成する統計処理手段と、前記
統計処理手段から前記マトリックスを入力し、前記マト
リックスを基に、文章、及び単語をそれぞれのクラスタ
（「文章クラスタ」、「単語クラスタ」という）に分類
する操作を交互に行なうことで、分類された文章のクラ
スタを出力する文章分類手段と、を備え、前記文章分類
手段では、前記各文章における単語の出現頻度からなる
前記マトリックスにおけるデータが、文章と単語の同時
確率モデルから発生されたものであるものとみなし、前
記同時確率モデルが、文章クラスタと単語クラスタの同
時確率に、文章のその属する文章クラスタからの条件つ
き生起確率と、単語のその属する単語クラスタからの条
件付き生起確率と、が乗じて算出され、前記文章と単語
の分類操作が、前記確率モデルの推定問題として情報量
基準を用いて、前記確率モデルの推定を行う。According to the present invention for achieving the above object, a plurality of sentences are input, the frequency of occurrence of words in each of the sentences is counted, and a matrix composed of the frequency of word appearance in each of the sentences is created. Statistical processing means and the matrix are input from the statistical processing means, and operations of classifying sentences and words into respective clusters (hereinafter referred to as “sentence clusters” and “word clusters”) based on the matrices are performed alternately. And a sentence classifying unit that outputs a cluster of the classified sentences.In the sentence classifying unit, the data in the matrix including the frequency of appearance of the word in each sentence is obtained from a simultaneous probability model of the sentence and the word. And the joint probability model calculates the joint probability of the sentence cluster and the word cluster as Is multiplied by the conditional occurrence probability from the sentence cluster to which the word belongs and the conditional occurrence probability from the word cluster to which the word belongs, and the classification operation of the sentence and the word is performed as an estimation problem of the probability model. The probability model is estimated using a criterion.

【００１６】[0016]

【発明の実施の形態】本発明の実施の形態について以下
に説明する。本発明は、文章における単語の出現頻度ベ
クトルがある確率モデルから生起されるという仮定を取
り除き、文章集合の分割と、単語集合の分割と、の直積
の上で定義される、文章と単語の同時確率モデルの推定
問題として、文章分類問題をとらえる。Embodiments of the present invention will be described below. The present invention removes the assumption that the appearance frequency vector of a word in a sentence arises from a certain probability model, and simultaneously defines a sentence and a word, which are defined on the direct product of a sentence set division and a word set division. A sentence classification problem is considered as an estimation problem of a stochastic model.

【００１７】より詳細には、文章と単語の同時確率は、
文章クラスタと、単語クラスタとの同時確率に、単語の
属する単語クラスタからの条件付き生起確率と、文章の
属する文章クラスタからの条件つき生起確率を乗じたも
のであるとする。More specifically, the joint probability of a sentence and a word is
Assume that the simultaneous probability of a sentence cluster and a word cluster is multiplied by a conditional occurrence probability from a word cluster to which a word belongs and a conditional occurrence probability from a sentence cluster to which the sentence belongs.

【００１８】これにより、単語がクラスタとしてまとま
り、単語クラスタの出現頻度に基づいて文章を分類する
ことが可能となり、文章分類の精度が向上することが期
待できる。As a result, words are organized into clusters, and sentences can be classified based on the frequency of occurrence of word clusters, and it is expected that the accuracy of sentence classification is improved.

【００１９】第二に、ボトムアップに、しかも、交互
に、単語と文章をそのクラスタに分類することにより、
文章分類の効率化を図る。Second, by categorizing words and sentences into their clusters bottom-up and alternately,
Improve the efficiency of sentence classification.

【００２０】本発明の文章自動分類装置は、その好まし
い実施の形態において、複数の文章（テキスト、ドキュ
メント）を入力とし、各文章における単語の出現頻度を
統計し、各文章における単語出現頻度からなるマトリッ
クスを作成する統計処理部と、統計処理部から前記マト
リックスを入力とし、入力されたマトリックスを基に、
文章、および単語をそれぞれのクラスタに分類する操作
を交互に行ない、分類された文章のクラスタを出力する
文章分類部と、を備える。In a preferred embodiment, the automatic sentence classification apparatus of the present invention receives a plurality of sentences (texts and documents), statistics the frequency of occurrence of words in each sentence, and comprises the frequency of occurrence of words in each sentence. A statistical processing unit for creating a matrix, and the matrix is input from the statistical processing unit, based on the input matrix,
A sentence classifying unit that alternately performs an operation of classifying a sentence and a word into respective clusters and outputs a cluster of the classified sentences.

【００２１】各文章における単語の出現頻度からなるマ
トリックスにおけるデータが、文章と単語の同時確率モ
デルから発生されたものであるとし、該同時確率モデル
が、文章クラスタと単語クラスタの同時確率に、文章の
その属する文章クラスタからの条件つき生起確率と、単
語の、その属する単語クラスタからの条件付き生起確率
を乗じた形をとるものとし、前記文章と単語の分類操作
を、前記確率モデルの推定問題として捉え、情報量基準
を用いて、前記確率モデルの推定を行う。なお、本発明
の実施の形態において、統計処理部、文章分類部は、コ
ンピュータ上で実行されるプログラムによってその機能
を実現することができ、本発明は、これらのプログラム
を記録した記録媒体からプログラムをロードし実行する
ことで実施される。It is assumed that the data in the matrix consisting of the frequency of appearance of the word in each sentence is generated from a joint probability model of the sentence and the word. And the conditional occurrence probability from the sentence cluster to which the word belongs, multiplied by the conditional occurrence probability from the word cluster to which the word belongs, and the classification operation of the sentence and the word is performed by the estimation problem of the probability model. And the probability model is estimated using the information amount criterion. Note that, in the embodiment of the present invention, the functions of the statistical processing unit and the sentence classifying unit can be realized by a program executed on a computer. Is loaded and executed.

【００２２】[0022]

【実施例】本発明の実施例について図面を参照して以下
に説明する。図１は、本発明の文章自動分類装置の一実
施例の構成を示す図である。図１を参照すると、文章自
動分類装置は、統計処理部１、文章分類部２を備える。
統計処理部１は、複数の文章を入力として受け、各文章
における単語の出現頻度を統計し、各文章における単語
の出現頻度からなるマトリックスを作成する。その後、
文章分類部２は、統計処理部１からマトリックスを入力
する。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a diagram showing a configuration of an embodiment of an automatic sentence classification device of the present invention. Referring to FIG. 1, the automatic sentence classification device includes a statistical processing unit 1 and a sentence classification unit 2.
The statistical processing unit 1 receives a plurality of sentences as input, statistically calculates the frequency of occurrence of a word in each sentence, and creates a matrix including the frequency of occurrence of a word in each sentence. afterwards,
The sentence classification unit 2 receives a matrix from the statistical processing unit 1.

【００２３】文章分類部２は、初期処理として、各々の
文章が一つの文章クラスタをなすとし、また、各々の単
語が一つの単語クラスタをなすとする。The sentence classification unit 2 assumes that each sentence forms one sentence cluster and each word forms one word cluster as an initial process.

【００２４】次に、二つの文章クラスタを選択し、情報
量基準を用いて一つの文章クラスタにまとめる（マージ
する）操作と、二つの単語クラスタを選択し、一つの単
語クラスタにマージする操作を交互に繰り返す。Next, an operation of selecting two sentence clusters and combining (merging) them into one sentence cluster using the information amount criterion and an operation of selecting two word clusters and merging them into one word cluster are described. Repeat alternately.

【００２５】最後に、繰り返して行った操作により作成
された文章のクラスタの履歴をたどって、文章のクラス
タからなる文章の分類木を出力する。Finally, the history of the sentence cluster created by the repeated operation is traced to output a sentence classification tree of the sentence cluster.

【００２６】図２は、統計処理部１が作成する、各文章
における単語の出現頻度からなるマトリックスの一例を
示す。図２に示す例では、文章が四つあり（テキスト１
〜４）、単語が五つある。マトリックスの行が単語に対
応し、列が文章に対応している。テキスト（ｔｅｘｔ）
１とテキスト２はテニスに関する文章で、テキスト３と
テキスト４はサッカーに関する文章である。FIG. 2 shows an example of a matrix formed by the statistical processing section 1 and composed of the frequencies of appearance of words in each sentence. In the example shown in FIG. 2, there are four sentences (text 1
4) There are five words. The rows of the matrix correspond to words, and the columns correspond to sentences. Text
1 and 2 are sentences related to tennis, and 3 and 4 are sentences related to soccer.

【００２７】前者の二つの文章（テキスト１、２）にお
いて、単語の出現頻度パターンが類似している。また、
後者の二つの文章（テキスト３、４）において、単語の
出現頻度パターンが類似している。文章分類部は、この
ような出現頻度パターンの類似性を基に文章の分類を行
う。In the former two sentences (texts 1 and 2), the appearance frequency patterns of words are similar. Also,
In the latter two sentences (texts 3 and 4), the appearance frequency patterns of words are similar. The sentence classification unit classifies the sentences based on the similarity of the appearance frequency patterns.

【００２８】また、この文章自動分類装置では、文章を
文章クラスタに再帰的にまとめていくので、最終的に
は、階層的な文章クラスタからなる文章分類木を作成す
る。図３は、図２に示したデータを基に作成された文章
の分類木を示す。ルートから二つのノードに分岐し、テ
キスト１、２が一つのノードから分岐し、テキスト３、
４が第２のノードから分岐、各テキスト１〜４がリーフ
を構成している。In this automatic sentence classification apparatus, sentences are recursively grouped into sentence clusters, so that a sentence classification tree composed of hierarchical sentence clusters is ultimately created. FIG. 3 shows a sentence classification tree created based on the data shown in FIG. Branch from the root to two nodes, text 1 and 2 branch from one node, text 3,
4 branches from the second node, and each text 1 to 4 constitutes a leaf.

【００２９】また、文章分類部２は、文章を文章クラス
タに分類すると同時に、単語を単語クラスタに分類する
ことをも行う。The sentence classifying unit 2 classifies sentences into sentence clusters and also classifies words into word clusters.

【００３０】例えば単語「ショット」と「ラケット」は
テニスに関係する単語で、テニス関連文章によく表れ
る。また、単語「キック」と「ゴール」はサッカーに関
する単語で、サッカー関連の文章によく表れる。従っ
て、単語を、文章における出現頻度の分布パターンを基
に分類することもできる。For example, the words "shot" and "racquet" are words related to tennis and often appear in tennis-related sentences. The words "kick" and "goal" are words related to soccer and often appear in soccer-related sentences. Therefore, the words can be classified based on the distribution pattern of the appearance frequency in the text.

【００３１】図４は、図２に示すデータを基に作成され
た単語の分類木を示す。同じ単語クラスタには、同義
語、類義語、さらに関連語が振り分けられる。これによ
り、同義語、類義語、さらに関連語からなる単語クラス
タの出現頻度パターンがより明確になり、それを基に文
章をさらに高精度に分類することが可能になる。FIG. 4 shows a word classification tree created based on the data shown in FIG. Synonyms, synonyms, and related words are assigned to the same word cluster. As a result, the appearance frequency pattern of a word cluster composed of synonyms, synonyms, and related words becomes clearer, and the sentence can be classified with higher accuracy based on the pattern.

【００３２】以下、文章分類部２の処理をより詳しく説
明する。Hereinafter, the processing of the sentence classification unit 2 will be described in more detail.

【００３３】文章集合における文章の分類と、単語集合
における単語の分類を以下のように再帰的に行う。The classification of sentences in a sentence set and the classification of words in a word set are performed recursively as follows.

【００３４】初期設定として、文章集合における各々の
文章が一つの文章クラスタを形成し、同様に単語集合に
おける各々の単語が一つの単語クラスタを形成するとす
る。As an initial setting, each sentence in a sentence set forms one sentence cluster, and similarly, each word in a word set forms one word cluster.

【００３５】まず、文章集合における文章クラスタに対
してクラスタリングを一回行なう。ここでは、もっとも
類似する二つのクラスタを選び、その二つのクラスタを
一つのクラスタにまとめることを「クラスタリング」と
よぶ。First, clustering is performed once for a sentence cluster in a sentence set. Here, selecting the two clusters that are the most similar and combining the two clusters into one cluster is called “clustering”.

【００３６】次に、得られた文章クラスタを固定したま
ま、単語集合における単語クラスタに対してクラスタリ
ングを一回行う。Next, while the obtained sentence cluster is fixed, clustering is performed once on the word clusters in the word set.

【００３７】このように、文章集合と単語集合に対し
て、交互にクラスタリングを行い、これ以上、クラスタ
リングできないところまで処理していく。As described above, the clustering is alternately performed on the sentence set and the word set, and the processing is performed to a point where the clustering cannot be further performed.

【００３８】最終的に、文章の分類木と単語の分類木が
得られる。クラスタリングは分類木の観点からみれば、
ボトムアップに行われている。Finally, a sentence classification tree and a word classification tree are obtained. Clustering is, from the perspective of a classification tree,
It is done from the bottom up.

【００３９】クラスタリンブを行う際、現段階にある文
章、あるいは単語クラスタの中から、二つのクラスタを
選び一つのクラスタにまとめるが、どのクラスタ同士が
もっとも類似しているか、そしてそれらを一つのクラス
タとしてまとめるべきかの判断が、クラスタリングの際
もっとも重要になる。At the time of performing cluster limber, two clusters are selected from a sentence or a word cluster at the present stage, and are combined into one cluster. Which cluster is most similar to each other, and these are regarded as one cluster Determining whether to combine is most important in clustering.

【００４０】以下、二つのクラスタをまとめる方法につ
いて説明する。Hereinafter, a method of combining two clusters will be described.

【００４１】本実施例では、文章における単語の出現頻
度データが、文章集合の特定の分割と、単語集合の特定
の分割による直積を用いて定義される確率モデルから生
成されたものであるとする。In this embodiment, it is assumed that the appearance frequency data of a word in a sentence is generated from a probability model defined using a specific division of a sentence set and a direct product of the specific division of the word set. .

【００４２】一定の文章集合および単語集合に対して、
一般に多数の分割が可能であるので、文章集合の一つの
分割と単語集合の一つの分割に対して、一つの確率モデ
ルが対応していることになる。For a given sentence set and word set,
In general, since many divisions are possible, one probability model corresponds to one division of a sentence set and one division of a word set.

【００４３】本実施例では、文章における単語の出現頻
度データをもっとも良く説明でき、かつ、ほどほど単純
な確率モデルを一つ選び、そのモデルにおける文章のグ
ループ分けと単語のグループ分けを文章分類、および単
語分類の結果とする。In the present embodiment, one frequency model that can best explain the frequency data of words in a sentence and that is moderately simple is selected, and the sentence grouping and word grouping in the model are classified into sentences and This is the result of word classification.

【００４４】ボトムアップのクラスタリングでは、これ
は、現在あるクラスタをそれ以上まとめるべきか、もし
そうであればどのクラスタをまとめればもっともよい
か、の判断にあたる。In bottom-up clustering, this is a determination of whether more current clusters should be combined and, if so, which cluster is best.

【００４５】上記確率モデルを以下に定義する。この確
率モデルは、文章集合の分割と、単語集合の分割と、の
直積を用いて定義されるもので、ある文章とある単語の
同時確率が、それらの属する文章クラスタと単語クラス
タの同時確率に、文章の属する文章クラスタからの条件
つき生起確率と、単語の属する単語クラスタからの条件
付き生起確率を乗じたものであるとする。The above probability model is defined below. This probability model is defined using the direct product of a sentence set division and a word set division, and the joint probability of a sentence and a word is calculated as the joint probability of the sentence cluster and the word cluster to which they belong. And the conditional occurrence probability from the sentence cluster to which the sentence belongs and the conditional occurrence probability from the word cluster to which the word belongs.

【００４６】このモデルでは、同じ文章クラスタに属す
る文章が一様に発生するのではなく、一般に異なる確率
で発生することになる。また、同じ単語クラスタに属す
る単語も一様に発生するのではなく、一般に異なる確率
で発生することになる。In this model, sentences belonging to the same sentence cluster do not occur uniformly but generally occur with different probabilities. Also, words belonging to the same word cluster do not occur uniformly but generally occur with different probabilities.

【００４７】上記確率モデルは以下の式（１）で与えら
れる。The above probability model is given by the following equation (1).

【００４８】Ｐ（ｔ，ｗ）＝Ｐ（ｔ｜Ｃｔ）×Ｐ（ｗ｜Ｃｗ）×Ｐ（Ｃｔ，Ｃｗ） …(1 )P (t, w) = P (t | Ct) × P (w | Cw) × P (Ct, Cw) (1)

【００４９】このモデルは、文章における単語の出現頻
度、即ち文章単語ペア（ｔ，ｗ）の生起確率Ｐ（ｔ，
ｗ）に関する確率モデルである。This model is based on the appearance frequency of words in a sentence, that is, the occurrence probability P (t, w) of a sentence word pair (t, w).
9 is a stochastic model for w).

【００５０】ＣｔとＣｗはそれぞれｔ、ｗの属する文章
クラスタと単語クラスタを表す。Ct and Cw represent a sentence cluster and a word cluster to which t and w belong, respectively.

【００５１】Ｐ（ｔ｜Ｃｔ）は文章ｔのその属する文章
クラスタＣｔからの条件付き生起確率を表し、Ｐ（ｗ｜
Ｃｗ）は単語ｗのその属する単語クラスタＣｗからの条
件付き生起確率を表し、Ｐ（Ｃｔ，Ｃｗ）は文章クラス
タＣｔと単語クラスタＣｗの同時確率を表す。P (t | Ct) represents a conditional occurrence probability of the sentence t from the sentence cluster Ct to which it belongs, and P (w |
Cw) represents the conditional occurrence probability of the word w from the word cluster Cw to which it belongs, and P (Ct, Cw) represents the joint probability of the sentence cluster Ct and the word cluster Cw.

【００５２】図５を参照して、確率モデルの例を説明す
る。この確率モデルでは、例えば、テキスト１とテキス
ト２が同じクラスタに分類されている。単語「ラケッ
ト」と単語「シュート」が同じクラスタに分類されてい
る。図５には、単語の条件付き生起確率、文章の条件付
き生起確率、および文章クラスタと単語クラスタの同時
確率も示されている。Referring to FIG. 5, an example of the probability model will be described. In this probability model, for example, text 1 and text 2 are classified into the same cluster. The word "racquet" and the word "shoot" are classified into the same cluster. FIG. 5 also shows conditional occurrence probabilities of words, conditional occurrence probabilities of sentences, and simultaneous probabilities of sentence clusters and word clusters.

【００５３】次に確率モデルを選択する基準が問題とな
るが、本実施例では、ＭＤＬ基準という情報量基準（Ｉ
ｎｆｏｒｍａｔｉｏｎＣｒｉｔｅｒｉａ）を用いて、
上記確率モデルを選ぶ。Next, the criterion for selecting the probability model becomes a problem. In the present embodiment, the information criterion (I
nformation Criteria)
Choose the above probability model.

【００５４】ＭＤＬ基準は、モデル記述長とデータ記述
長の和（全記述長）が最も小さいモデルが最もよいモデ
ルであると主張する。一般に、全記述長が最小のモデル
はデータをよく説明できかつ簡単なモデルである。ＭＤ
Ｌ基準によるモデル選択とは、全記述長を計算し、全記
述長が最小のモデルを選択することにより実現できる。The MDL standard claims that the model with the smallest sum of the model description length and the data description length (total description length) is the best model. In general, the model with the smallest total description length is a simple model that can explain the data well. MD
Model selection based on the L criterion can be realized by calculating the total description length and selecting a model having the minimum total description length.

【００５５】ボトムアップのクラスタリングでは、これ
は、例えば、どの二つの文章クラスタをまとめるべきか
を判断する時に、全記述長が最小であるモデルを作り出
すような二つの文章クラスタを選び、それらをまとめる
ことに相当する。In bottom-up clustering, for example, when deciding which two sentence clusters should be combined, two sentence clusters that create a model having the smallest total description length are selected and combined. It is equivalent to

【００５６】一つの確率モデルが与えられた時、そのモ
デルの入力データに対する全記述長を次式（２）のよう
に計算する。When one probability model is given, the total description length of the model with respect to input data is calculated as in the following equation (2).

【００５７】Ｌ＝−ΣlogＰ（ｔ，ｗ）＋ｋ／２×logＳ，ｔ∈Ｔ，ｗ∈Ｗ …(2)L = −ΣlogP (t, w) + k / 2 × logS, t∈T, w∈W (2)

【００５８】上式（２）の第１項はデータ記述長で、第
２項はモデル記述長である。また、ｔは文章を表し、Ｔ
は文章集合を表す。ｗは単語を表し、Ｗは単語集合を表
す。ｋは上記確率モデルにおける自由パラメータの数、
Ｓはデータ数を表す。The first term of the above equation (2) is the data description length, and the second term is the model description length. Also, t represents a sentence, and T
Represents a sentence set. w represents a word, and W represents a word set. k is the number of free parameters in the probabilistic model,
S represents the number of data.

【００５９】ボトムアップのクラスタリングでは、例え
ば、まず、現時点のモデルの全記述長を計算し、次に、
すべての可能な二つの文章クラスタのマージによって得
られる確率モデルの全記述長を計算し、これらの確率モ
デルの全記述長の減少分を計算し、全記述長減少のもっ
とも大きいモデルを選ぶ。また、どの二つの文章クラス
タに対しても、それらのマージにより得られる確率モデ
ルの全記述長が減少しなければ、クラスタリングをこれ
以上行ってもＭＤＬ基準の観点からみて好ましいことに
なるので、クラスタリングを終了させる。In the bottom-up clustering, for example, first, the total description length of the current model is calculated, and then,
Calculate the total description length of the probability model obtained by merging all possible two sentence clusters, calculate the decrease of the total description length of these probability models, and select the model with the largest total description length reduction. Further, if the total description length of the probabilistic model obtained by merging these two sentence clusters does not decrease, further clustering would be preferable from the viewpoint of the MDL standard. To end.

【００６０】最終的に選ばれたモデルにおける文章のグ
ループ分けを文章分類の結果とする。The grouping of sentences in the finally selected model is used as the result of the sentence classification.

【００６１】次に、実際の文章分類のアルゴリズムにつ
いて述べる。このアルゴリズムを「クラスタリング」と
いう。クラスタリングでは、文章集合と単語集合に対し
て交互に分類を行う。Next, an actual text classification algorithm will be described. This algorithm is called “clustering”. In clustering, a sentence set and a word set are alternately classified.

【００６２】クラスタリング（Ｔ，Ｗ）Ｔは文章集合、Ｗは単語集合を表す。Clustering (T, W) T represents a sentence set, and W represents a word set.

【００６３】ステップ１．文章クラスタの集合ＣＴと単
語クラスタの集合ＣＷをそれぞれ以下のように初期化す
る。Step 1. A set of sentence clusters CT and a set of word clusters CW are initialized as follows.

【００６４】ＣＴ＝｛｛ｔ｝｜ｔはＴに属する｝ＣＷ＝｛｛ｗ｝｜ｗはＷに属する｝ステップ２．以下の手順を繰り返す。CT = {t} | t belongs to T {CW = {w} | w belongs to W} Step2. Repeat the following steps.

【００６５】ステップ２．１：文章マージ（ＣＴ，Ｃ
Ｗ）によりＣＴを更新する。Step 2.1: Sentence Merge (CT, C
W) to update the CT.

【００６６】ステップ２．２：単語マージ（ＣＷ，Ｃ
Ｔ）によりＣＷを更新する。Step 2.2: Word merge (CW, C
The CW is updated by T).

【００６７】ステップ２．３：もしＣＴもＣＷも不変で
あったならば、手順３へ。Step 2.3: If both CT and CW are unchanged, go to procedure 3.

【００６８】ステップ３．過去のＣＴの経歴を木構造に
変換して、これを文章の分類木として出力する。Step 3. The history of the past CT is converted into a tree structure, and this is output as a sentence classification tree.

【００６９】一回のクラスタリングで、以下の「マー
ジ」と呼ばれるアルゴリズムによって行う。まず、単語
のクラスタを固定して、文章のクラスタをマージする場
合について説明する。One clustering is performed by the following algorithm called “merge”. First, a case where word clusters are fixed and sentence clusters are merged will be described.

【００７０】マージ（ＣＴ，ＣＷ）ステップ１．文章クラスタの集合ＣＴ中のすべての文章
クラスタの対がマージされる場合の確率モデルの全記述
長の減少を計算し、全記述長の減少の大きい順に文章ク
ラスタ対をソートする。Merge (CT, CW) Step 1. When all pairs of sentence clusters in the set CT of sentence clusters are merged, the decrease in the total description length of the probability model is calculated, and the sentence cluster pairs are sorted in descending order of the decrease in the total description length.

【００７１】ステップ２．全記述長減少の最も大きい対
を選び、その全記述長減少の最ものマージを実行する。Step 2. The largest pair of the total description length reduction is selected, and the most merge of the total description length reduction is performed.

【００７２】ステップ３．現時点の文章クラスタ集合Ｃ
Ｔを出力し終了する。Step 3. Current sentence cluster set C
Outputs T and ends.

【００７３】次に、文章のクラスタを固定して、単語の
クラスタをマージする場合について説明する。Next, a case in which sentence clusters are fixed and word clusters are merged will be described.

【００７４】マージ（ＣＷ，ＣＴ）ステップ１．単語クラスタの集合ＣＷ中のすべての二つ
の単語クラスタがマージされる場合の確率モデルの全記
述長の減少を計算し、全記述長の減少の大きい順に単語
クラスタ対をソートする。Merge (CW, CT) Step 1. The reduction of the total description length of the probability model when all two word clusters in the word cluster set CW are merged is calculated, and the word cluster pairs are sorted in descending order of the total description length reduction.

【００７５】ステップ２．全記述長減少の最も大きい対
を選び、その全記述長減少の最ものマージを実行する。Step 2. The largest pair of the total description length reduction is selected, and the most merge of the total description length reduction is performed.

【００７６】ステップ３．現時点の単語クラスタ集合Ｃ
Ｗを出力し終了する。Step 3. Current word cluster set C
Outputs W and ends.

【００７７】[0077]

【発明の効果】以上説明したように、本発明によれば、
高精度で、効率的な文章の分類ができる、という効果を
奏する。その理由は、As described above, according to the present invention,
This has the effect of enabling highly accurate and efficient classification of sentences. The reason is,

[Brief description of the drawings]

【図１】本発明の一実施例の文章自動分類装置の構成を
示す図である。FIG. 1 is a diagram showing a configuration of an automatic sentence classification apparatus according to an embodiment of the present invention.

【図２】本発明の一実施例を説明するための図であり、
文章における単語の出現頻度のデータの例を示す図であ
る。FIG. 2 is a diagram for explaining one embodiment of the present invention;
FIG. 9 is a diagram illustrating an example of data of the frequency of occurrence of words in a sentence.

【図３】本発明の一実施例を説明するための図であり、
文章の分類木の例を示す図である。FIG. 3 is a diagram for explaining one embodiment of the present invention;
It is a figure showing an example of a classification tree of a sentence.

【図４】本発明の一実施例を説明するための図であり、
単語のの分類木の例を示す図である。FIG. 4 is a diagram for explaining one embodiment of the present invention;
It is a figure showing an example of a classification tree of a word.

【図５】本発明の一実施例を説明するための図であり、
確率モデルの例を示す図である。FIG. 5 is a diagram for explaining one embodiment of the present invention;
It is a figure showing an example of a probability model.

[Explanation of symbols]

１統計処理部２文章分類部 1 Statistical processing unit 2 Text classification unit

Claims

[Claims]

And a statistical processing means for inputting a plurality of texts, statistically calculating the frequency of occurrence of words in each of the texts, and creating a matrix including the frequency of word appearance in each of the texts. By inputting and alternately classifying a sentence and a word into respective clusters (referred to as “sentence cluster” and “word cluster”, respectively) based on the matrix, a cluster of the classified sentences is output. The sentence classifying means, wherein the sentence classifying means considers that the data in the matrix composed of the frequency of appearance of words in each sentence is generated from a simultaneous probability model of sentences and words, The probabilistic model calculates the simultaneous probability of a sentence cluster and a word cluster by using the condition from the sentence cluster to which the sentence belongs. The occurrence probability and the conditional occurrence probability of a word from the word cluster to which the word belongs are calculated by multiplication, and the classification operation of the sentence and the word is performed by using the information amount criterion as an estimation problem of the probability model. An automatic sentence classification apparatus for estimating a sentence.

2. The automatic sentence classification apparatus according to claim 1, wherein an MDL criterion is used as said information amount criterion.

3. A statistical processing means for inputting a plurality of sentences, statistic of the frequency of occurrence of words in each of the sentences, and creating a matrix composed of the word occurrence frequencies in each of the sentences, By inputting and alternately classifying a sentence and a word into respective clusters (referred to as “sentence cluster” and “word cluster”, respectively) based on the matrix, a cluster of the classified sentences is output. An automatic sentence classification device, comprising: sentence classification means.

4. The sentence classifying means, as an initial processing, assumes that each sentence forms one sentence cluster and each word forms one word cluster. Then, two sentence clusters are selected and an information amount reference is made. The operation of merging into one sentence cluster using, and the operation of selecting two word clusters and merging into one word cluster are alternately repeated, and the history of the sentence cluster created by the repeated operation is traced to the sentence. 4. The automatic sentence classification device according to claim 3, wherein a classification tree of the sentence composed of the clusters is output.

5. The sentence classification means, wherein the word appearance frequency data in the sentence is generated from a probability model defined by using a specific division of a sentence set and a direct product of the specific division of the word set. Assuming that the probability model, the joint probability of a sentence and a word, the joint probability of the sentence cluster and the word cluster to which they belong, the conditional occurrence probability from the sentence cluster to which the sentence belongs, and the word cluster to which the word belongs Multiplied by the conditional occurrence probability of
When one probability model is given, two sentence clusters which create a model having a minimum total description length are selected for the total description length of input data of the model, and these are combined. 3. An automatic sentence classification device according to 3.

6. A statistical processing means for: (a) inputting a plurality of sentences, statistically calculating the frequency of occurrence of a word in each of the sentences, and creating a matrix composed of the frequency of occurrence of the words in each of the sentences; By inputting the matrix from the processing means and alternately classifying sentences and words into respective clusters (hereinafter referred to as “sentence clusters” and “word clusters”) based on the matrix,
A sentence classifying means for outputting a cluster of classified sentences, comprising: as initial processing, each sentence forms one sentence cluster, and each word forms one word cluster. It is created by repeating the operation of selecting a sentence cluster and merging it into one sentence cluster using the information amount criterion, and the operation of selecting two word clusters and merging it into one word cluster. A text classification means for tracing the history of a text cluster and outputting a text classification tree composed of text clusters; a statistical processing means; and a recording medium recording a program for causing the text classification means to function on a computer.

7. A method comprising the steps of: (a) inputting a plurality of sentences, statistically calculating the frequency of occurrence of words in each of the sentences, and creating a matrix including the frequency of occurrence of words in each of the sentences; and (b) generating a matrix based on the matrix. Outputting a cluster of classified sentences by alternately performing an operation of classifying sentences and words into respective clusters (referred to as “sentence clusters” and “word clusters”, respectively); In step (b), as an initial process, each sentence forms one sentence cluster and each word forms one word cluster. Then, two sentence clusters are selected and one sentence is selected using an information amount criterion. An operation of merging into a sentence cluster and an operation of selecting two word clusters and merging them into one word cluster are alternately repeated. And outputs the classification tree of a sentence consisting of a cluster of sentences following the cluster history text created by the operation, the sentence automatic classification method characterized by.

8. In the step (b), the appearance frequency data of a word in a sentence includes a specific division of a sentence set,
The probability model is generated from a probability model defined by using a direct product of a specific division of a word set. The probability is calculated by multiplying the conditional occurrence probability from the sentence cluster to which the sentence belongs and the conditional occurrence probability from the word cluster to which the word belongs.When determining whether two sentence clusters should be put together,
When one probability model is given, two sentence clusters which create a model having a minimum total description length are selected for the total description length of input data of the model, and these are combined. 7. Automatic sentence classification method described in 7.