JP4125951B2

JP4125951B2 - Text automatic classification method and apparatus, program, and recording medium

Info

Publication number: JP4125951B2
Application number: JP2002373868A
Authority: JP
Inventors: 正之杉崎; 俊朗牧野; 勝宮本; 久茨木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-12-25
Filing date: 2002-12-25
Publication date: 2008-07-30
Anticipated expiration: 2022-12-25
Also published as: JP2004206355A

Description

【０００１】
【発明の属する技術分野】
本発明は、例えばインターネット上の文書のような大量のテキスト情報を分類するために利用されるテキスト自動分類方法及び装置並びにプログラム及び記録媒体に関する。
【０００２】
【従来の技術】
例えばインターネットなどのコンピュータネットワークにおいては、カテゴリ，発信者，作者，発信地，作成日時などが不特定多数の電子化された大量の文書情報が公開されている。これらの文書情報の多くは、文字列情報で構成されるテキストが主体になっている。
【０００３】
このような不特定多数の大量の文書情報の中から必要な情報や重要な情報を獲得するために、様々な分析が行われている。
例えば、多数の情報をいくつかのグループに分類する際に良く用いられるのはクラスター分析と呼ばれる方法である。
クラスター分析は、多変量解析手法の１つであり、「異質なものの混ざりあっている対象（それは個体＝ものの場合もあるし、変数の場合もある）を、それらの間に何らかの意味で定義された類似度（similarity）を手がかりにして似たものをあつめ、いくつかの均質なものの集落（クラスター）に分類する方法を総称したもの」である（非特許文献１）。
【０００４】
すなわち、類似した情報同士を結び付け、互いに関連のある情報をクラスターと呼ばれるグループに分類する。
クラスター分析のアルゴリズムにおいては、一般に次のような処理が行われる（図５，図６参照）。
（Ｓ１）初期設定：Ｎ個のデータ（ｄ１，ｄ２，・・・，ｄＮ）が存在する場合に各データを要素とする要素数が１のクラスター（Ｃ１，Ｃ２，・・・，ＣＮ）の集合Ｄ＝｛Ｃ１，Ｃ２，・・・，ＣＮ｝を形成する。
【０００５】
（Ｓ２）クラスター集合Ｄを探索し、この中で最も類似したクラスター同士をクラスター組（Ｃｉ，Ｃｊ）として抽出する。
（Ｓ３）クラスターＣｉ，Ｃｊから新たなクラスターＣｋを生成し、Ｃｋをクラスター集合Ｄに加える。
【０００６】
（Ｓ４）クラスター集合ＤからクラスターＣｉ，Ｃｊを削除する。
（Ｓ５）所定の終了条件を満たすまで上記(Ｓ２)〜(Ｓ４)の処理を繰り返す。
終了条件としては、例えば「クラスターの数がｍ個まで」や、前記ステップＳ２で「類似度の値によって類似していると判断されなくなった場合」などが想定される。
【０００７】
新たに作られたクラスターとそれ以外のクラスターとの類似度を計算する場合には、計算コストの関係から、一般的に新たに作られたクラスターから代表的な値を使って計算される。これは「重心法」や「メジアン法」と呼ばれる。
クラスター分析の用途としては文書の自動分類などがある。例えば、新聞記事などの大量の文書を集めておき、この中から類似した内容の文書同士を分類するために利用できる。
【０００８】
この場合、各文書の特徴を、要素が文書内に存在する単語と１対１に対応したベクトル（特徴ベクトル）として表現する。また、各単語に対応する要素は、文書に対するその単語の重要度（重み）を実数値化して表現する。
単語の重みを計算する場合には、その単語の出現頻度や、文書集合全体における分布の割合や、文字数の長さなどに基づいて決定するのが一般的である（非特許文献２）。
【０００９】
このようにして得られるベクトルを用いて、各文書間の類似度（距離）を定義する。例えば、ユークリッド空間におけるベクトル同士のなす角度（ｃｏｓθ）などを利用し、値が大きいベクトル同士は類似しているものとみなす。
このような類似度に基づいて、類似した文書同士を同じクラスターに分類していき、クラスター同士の類似性がなくなるまで処理を繰り返す。この結果、大量の文書をいくつかの文書グループに分類できる。
【非特許文献１】
“多変量統計解析法”，田中，脇本，現代数学社，１９８３初版
【非特許文献２】
“Automatic Text Processing”，Gerard Salton, ADDISON-WESLEY pub. 1989
【００１０】
【発明が解決しようとする課題】
上記のような従来のクラスター分析においては、再帰的な処理を行うので、処理が進むにつれてクラスターの数が徐々に減少する。従って、いつまでも処理を続けると同じクラスターの中に互いに全く類似しないデータが分類されることになり、分類の誤りが発生する。
【００１１】
このような誤りの発生をなくすために、クラスター間の類似性の有無を閾値などを用いて判断し、類似性がなくなった段階で処理を終了する。しかし、扱うデータの種類に応じて適切に閾値を決定しないと、分類処理が十分に行われる前に処理が終了したり、類似性のないデータが同じクラスターに混入する場合が多いのが実状である。
【００１２】
本発明は、分類精度を改善することが可能なテキスト自動分類方法及び装置並びにプログラム及び記録媒体を提供することを目的とする。
【００１３】
【課題を解決するための手段】
請求項１は、文字列情報で構成されるテキストを処理対象とし、互いに独立した複数のＮ個のテキストを入力して、各テキストをテキスト間の類似性に基づいてＮよりも小さい数のテキストを要素とするクラスターに分類するテキスト自動分類装置において、互いに異なる２つのクラスター間の関連度を関連度テーブルに格納し、最も関連度が高い２つのクラスターをクラスター組として検出するクラスター組検出手段と、前記クラスター組の一方のクラスターに含まれる複数の各要素と、前記クラスター組のもう一方のクラスターに含まれる複数の各要素との、要素同士の類似度を求め類似度テーブルに格納する要素間類似度検出手段と、前記類似度を求めた要素同士の組み合わせ総数に対する、類似度が第１の閾値以上の組み合わせの個数の割合を算出する類似度割合算出手段と、前記類似度割合算出手段が算出した割合を第２の閾値と比較する類似度割合比較手段と、前記割合が第２の閾値未満であれば前記クラスター組を統合しないクラスター組除外手段と、前記割合が第２の閾値以上であれば前記クラスター組を統合して新たなクラスターを生成するクラスター組統合手段とを設けたことを特徴とする。
【００１４】
一般的なクラスター分析においては、クラスター組が検出されると無条件でそのクラスター組から新たなクラスターが生成され、その元になる各クラスターは削除され、クラスターの数が徐々に減少する。従って、互いに関連の小さい要素（テキスト情報）が多数含まれる新たなクラスターが生成される可能性も高い。請求項１においては、クラスター組が検出されると、前記クラスター組の各クラスターを構成する複数の要素について、クラスター間で要素同士の類似度を求め、それを閾値と比較するので、クラスター組として検出された各クラスターに含まれる要素同士の類似度が小さい場合には、そのクラスター組が新たなクラスターとして統合されるのを未然に防止できる。
【００１５】
なお、例えば前記クラスター組として検出されたクラスター間の類似度に０を割り当てることにより、それらのクラスターが再び同じクラスター組として検出されるのを防止し、クラスター間の類似性判断対象から除外することができる。これにより、クラスターがどんどん肥大化していき、クラスター内に属する要素に関連のない組合せが現れるのを避けることができ、関連の高い要素同士を集めたクラスターだけを高精度に抽出することができる。
【００１６】
なお、非特許文献１の技術（群平均法）は本発明の技術と関連があるが、計算にかかるコストは重心法を元にしている本発明のほうが小さい。
【００１７】
また、請求項１においては、クラスター組を検出した場合に、クラスター間でそれらに含まれる各要素の類似度が第１の閾値の条件を満たす割合が第２の閾値よりも大きいか小さいかに応じて前記クラスター組が適切か否かを区別するので、要素間の関連が大きい要素の数に比べて要素間の関連が小さい要素の数が多い場合には、それらのクラスター組を１つのクラスターに統合するのを中止できる。従って、精度の高い分類処理が可能である。
【００２０】
請求項２のテキスト分類プログラムは、請求項１に記載のテキスト自動分類装置を構成する各手段としてコンピュータを機能させることを特徴とする。
【００２１】
請求項２のプログラムを実行することにより、請求項１と同様の動作を実現できる。
【００２２】
請求項３は、請求項２のプログラムを記録したコンピュータで読みとり可能な記録媒体である。
【００２３】
【発明の実施の形態】
本発明のテキスト自動分類方法及び装置並びにプログラム及び記録媒体の１つの実施の形態について、図１〜図４を参照して説明する。この形態は全ての請求項に対応する。
【００２４】
図１はこの形態のテキスト分類処理を示すフローチャートである。図２は文書情報を扱う装置の構成例を示すブロック図である。図３はクラスター生成の具体例を示す模式図である。図４は関連度テーブルの構成例を示す模式図である。
この形態では、請求項１のクラスター組検出手段，要素間類似度検出手段，要素間類似度比較手段及びクラスター組除外手段は、それぞれステップＳ１３，Ｓ１５，Ｓ１６及びＳ２０に対応する。
【００２５】
また、請求項１のクラスター組検出手段，要素間類似度検出手段，類似割合算出手段，類似割合比較手段及びクラスター組除外手段は、それぞれステップＳ１３，Ｓ１５，Ｓ１６，Ｓ１７，Ｓ２０に対応する。更に、請求項１のクラスター組統合手段は、Ｓ１８とＳ１９とに対応する。
図２に示す装置においては、分類処理装置１０がテキストの文書情報の分類処理を行う場合を想定している。分類処理装置１０は、例えばパソコンとその上で動作する分類処理用のプログラムとで構成される。分類処理装置１０に備わった記憶装置１１には、収集された大量の文書情報が保持される。また、独立した複数の文書情報間の類似度を表す行列が、類似度テーブル１２として分類処理装置１０上に作成される。更に、クラスター間の関連度を表す情報が関連度テーブル１３上に作成される。
【００２６】
この例では、分類処理装置１０はインターネット２０及びＬＡＮ３０に接続されている。従って、分類処理装置１０はインターネット２０に接続された様々なサーバ２１(1)，２１(2)，２１(3)，・・・のデータベース２２やＬＡＮ３０に接続された様々なサーバ３１(1)，３１(2)，・・・のデータベース３２から文書情報、例えばＨＴＭＬ形式のテキストデータファイル（文字列情報）を収集することができる。
【００２７】
分類処理装置１０の動作について、図１を参照しながら説明する。
ステップＳ１０では、分類処理装置１０はインターネット２０やＬＡＮ３０を介して処理対象となる文書情報群を収集する。
また、文書情報間の類似度の計算に必要な関数が予め用意されていない場合や、選択可能な複数種類の関数が用意されている場合には、分類処理装置１０は関数の選択入力や直接入力をオペレータに対して促し、１つの関数を取得する。更に、後述する処理で用いられる各閾値を変更可能な場合には、各閾値の入力をオペレータに対して促す。
【００２８】
ステップＳ１１では、Ｓ１０で収集された全ての文書情報について、関数に従ってそれぞれの文書間の類似度を計算し、その結果の類似度行列を類似度テーブル１２に登録する。なお、類似度テーブル１２内の行列のうち半分の要素は同じ組み合わせであり同じ類似度なので不要である。
処理対象となる文書情報としては、例えばある期間に掲載された新聞の記事などが想定される。
【００２９】
このような文書情報間の類似度を求める際には、文書内に存在する単語と１対１に対応したベクトル（特徴ベクトル）を検出し、各単語の文書に対する重要度（重み）を実数値化して表現する。また、単語の重みを計算する場合には、その単語の出現頻度や、文書集合全体における分布の割合や、文字数の長さなどに基づいて決定すればよい。このようにして得られる特徴ベクトルを用いて、各文書間の類似度（距離）を例えばユークリッド空間におけるベクトル同士のなす角度（ｃｏｓθ）を利用し、値が大きいベクトル同士は類似しているものとみなす。
【００３０】
ステップＳ１２ではクラスター分析の初期化を行う。すなわち、要素数が１のクラスターを順次に作成し、各クラスターに処理対象の文書情報を１つずつ割り当てる。例えば、Ｎ個の文書情報を処理対象とする場合には、最初はＮ個のクラスター群Ｃ１〜ＣＮが作成される。そして、これらの集まりを集合Ｄとする。
ステップＳ１３では、集合Ｄに含まれるクラスター群Ｃ１〜ＣＮから２つのクラスターを全ての組み合わせについて順次に抽出し、全ての組み合わせの中でクラスター同士の関係が最も類似しているものをクラスター組として検出する。
【００３１】
実際には、クラスター毎に求められる特徴の中心ベクトル（重心ベクトル）を用いて、全クラスターの組み合わせの中で、中心ベクトル間の距離が最も近い（関連度が高い）ものをクラスター組として検出する。
集合Ｄにおける各クラスター間の関連度は、例えば図４に示すような形の関連度テーブル１３として求められる。図４の例では、クラスターＣ２，Ｃ３間の関連度（０．３５）が最も大きい（近い）ので、クラスターＣ２，Ｃ３の組み合わせが候補のクラスター組（Ｃ２，Ｃ３）としてステップＳ１３で検出される。
【００３２】
ステップＳ１４では、終了条件を満たすか否かを識別する。例えば、Ｓ１３で検出された候補のクラスター組におけるクラスター間の距離などを所定の閾値ｔｈ１と比較することにより、処理を終了すべきか否かを識別する。
終了条件を満たさなければステップＳ１５以降の処理を繰り返し実行する。
ステップＳ１５では、Ｓ１３で検出された候補のクラスター組（Ｃｉ，Ｃｊ）について、ｉ番目のクラスターＣｉに含まれている各要素（１つ又は複数の文書情報）ｘとｊ番目のクラスターＣｊに含まれている各要素ｙとの全ての要素の組み合わせについて、類似度Ｓ（ｘ，ｙ）をそれぞれ求める。これらの類似度Ｓ（ｘ，ｙ）は予めステップＳ１１で計算され類似度テーブル１２に保持されているので、類似度テーブル１２を参照して得ることができる。
【００３３】
例えば、図３に示すクラスターＣ２，Ｃ３を候補のクラスター組として検出した場合には、クラスターＣ２に３つの文書情報が要素として存在し、クラスターＣ３に４つの文書情報が要素として存在しているので要素（ｘ，ｙ）の組み合わせが１２組あり、これらの組み合わせについてそれぞれ類似度Ｓ（ｘ，ｙ）を求めることになる。
【００３４】
ステップＳ１６では、各々の類似度Ｓ（ｘ，ｙ）を閾値ｔｈ２と比較し、候補のクラスター組（Ｃｉ，Ｃｊ）における全組み合わせ数Ｎ１と、閾値ｔｈ２の条件を満たす数Ｎ２との割合Ｒを算出する（Ｒ＝Ｎ２／Ｎ１）。
例えば、図３に示す候補のクラスター組（Ｃ２，Ｃ３）の場合は（Ｎ１＝１２）であるので、仮に（Ｓ（ｘ，ｙ）≧ｔｈ２）の条件を満たす数Ｎ２が９であれば、（Ｒ＝０．７５）になる。
【００３５】
そして、割合Ｒを閾値ｔｈ３（例えば０．６）と比較し、（Ｒ≧ｔｈ３）であればステップＳ１７からＳ１８に進み、（Ｒ＜ｔｈ３）ならステップＳ２０に進む。
すなわち、（Ｒ≧ｔｈ３）であれば候補のクラスター組（Ｃｉ，Ｃｊ）をステップＳ１８で新たなクラスターとして生成し集合Ｄに追加する。また、新たなクラスターの元になるクラスターＣｉ，Ｃｊは不要なのでステップＳ１９で集合Ｄから削除する。
【００３６】
一方、（Ｒ＜ｔｈ３）であれば候補のクラスター組（Ｃｉ，Ｃｊ）をクラスター組として採用しない。また、これ以降クラスターＣｉとクラスターＣｊとの組み合わせをステップＳ１３での関連性判断の対象にしないようにステップＳ２０で処理する。例えば、関連度テーブル１３の行列において該当するクラスターＣｉ，Ｃｊ間の関連度を０に変更することにより、これらを関連性判断の対象から除外できる。
【００３７】
このため、各クラスターの中心ベクトル間の距離が近い場合であっても、クラスター間で文書情報同士の類似度が小さい文書の割合が大きい場合には、これらのクラスターの組み合わせを新たなクラスターとして統合しないので、同じクラスター内に類似性の小さい文書が分類されるのを抑制できる。
ステップＳ１４で所定の終了条件を満たす場合には、この分類処理を終了し、分類の結果を分類処理装置１０上に表示する。
【００３８】
なお、分類処理装置１０における処理については、専用のハードウェアで実現することもできるし、コンピュータ上で実行されるプログラムとして実現することもできる。また、プログラムはＣＤ−ＲＯＭなどの記録媒体からコンピュータに読み込むことができる。
【００３９】
【発明の効果】
以上のように、本発明によれば従来のクラスター分析よりも高い精度でテキスト情報の分類処理を行うことができる。
【図面の簡単な説明】
【図１】実施の形態のテキスト分類処理を示すフローチャートである。
【図２】文書情報を扱う装置の構成例を示すブロック図である。
【図３】クラスター生成の具体例を示す模式図である。
【図４】関連度テーブルの構成例を示す模式図である。
【図５】一般的なクラスター分析処理を示すフローチャートである。
【図６】一般的なクラスター分析の処理を示す模式図である。
【符号の説明】
１０分類処理装置
１１記憶装置
１２類似度テーブル
１３関連度テーブル
２０インターネット
２１サーバ
２２データベース
３０ＬＡＮ
３１サーバ
３２データベース[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an automatic text classification method and apparatus, a program, and a recording medium used for classifying a large amount of text information such as documents on the Internet.
[0002]
[Prior art]
For example, in a computer network such as the Internet, a large amount of digitized document information having an unspecified number of categories, senders, authors, senders, creation dates and the like is disclosed. Most of these document information is mainly text composed of character string information.
[0003]
In order to acquire necessary information and important information from such a large amount of unspecified document information, various analyzes are performed.
For example, a method called cluster analysis is often used to classify a large amount of information into several groups.
Cluster analysis is one of the multivariate analysis methods. “A mixture of heterogeneous things (individuals may be things or variables) is defined in some sense between them. "Similarity (similarity)" is a general term for collecting similar things and classifying them into several homogeneous communities (non-patent document 1).
[0004]
That is, similar information is linked and information related to each other is classified into a group called a cluster.
In the cluster analysis algorithm, the following processing is generally performed (see FIGS. 5 and 6).
(S1) Initial setting: When there are N pieces of data (d1, d2,..., DN), the cluster (C1, C2,. A set D = {C1, C2,..., CN} is formed.
[0005]
(S2) The cluster set D is searched, and the most similar clusters among them are extracted as the cluster set (Ci, Cj).
(S3) A new cluster Ck is generated from the clusters Ci and Cj, and Ck is added to the cluster set D.
[0006]
(S4) The clusters Ci and Cj are deleted from the cluster set D.
(S5) The above processes (S2) to (S4) are repeated until a predetermined end condition is satisfied.
As the termination condition, for example, “up to m clusters” or “when it is no longer determined to be similar by the value of similarity” in step S2 is assumed.
[0007]
When calculating the degree of similarity between a newly created cluster and other clusters, it is generally calculated using a representative value from the newly created cluster because of the calculation cost. This is called the “centroid method” or “median method”.
Cluster analysis is used for automatic document classification. For example, a large number of documents such as newspaper articles can be collected and used to classify documents having similar contents.
[0008]
In this case, the feature of each document is expressed as a vector (feature vector) in which the element has a one-to-one correspondence with a word existing in the document. The element corresponding to each word expresses the importance (weight) of the word with respect to the document by converting it into a real value.
When calculating the weight of a word, it is generally determined based on the appearance frequency of the word, the distribution ratio in the entire document set, the length of the number of characters, and the like (Non-Patent Document 2).
[0009]
Using the vectors obtained in this way, the similarity (distance) between the documents is defined. For example, an angle (cos θ) formed by vectors in the Euclidean space is used, and vectors having large values are regarded as similar to each other.
Based on such similarity, similar documents are classified into the same cluster, and the process is repeated until there is no similarity between the clusters. As a result, a large number of documents can be classified into several document groups.
[Non-Patent Document 1]
“Multivariate Statistical Analysis”, Tanaka, Wakimoto, Hyundai Mathematics, 1983 First Edition [Non-patent Document 2]
“Automatic Text Processing”, Gerard Salton, ADDISON-WESLEY pub. 1989
[0010]
[Problems to be solved by the invention]
In the conventional cluster analysis as described above, since recursive processing is performed, the number of clusters gradually decreases as processing proceeds. Therefore, if the processing is continued indefinitely, data that are not similar to each other are classified in the same cluster, and a classification error occurs.
[0011]
In order to eliminate the occurrence of such an error, the presence / absence of similarity between clusters is determined using a threshold or the like, and the process ends when the similarity is lost. However, if the threshold value is not appropriately determined according to the type of data to be handled, the situation is often the case that the process ends before the classification process is sufficiently performed, or data with no similarity is often mixed in the same cluster. is there.
[0012]
An object of the present invention is to provide an automatic text classification method and apparatus, a program, and a recording medium capable of improving classification accuracy.
[0013]
[Means for Solving the Problems]
Claim 1 sets a text composed of character string information as a processing target, inputs a plurality of N texts independent from each other, and sets each text to a number smaller than N based on the similarity between the texts. In the automatic text classification device for classifying into clusters having elements as elements, cluster group detection means for storing the degree of association between two different clusters in the association degree table and detecting the two clusters having the highest degree of association as a cluster group; , Between the elements of the plurality of elements included in one cluster of the cluster set and the elements of the plurality of elements included in the other cluster of the cluster set to obtain the similarity between the elements and store in the similarity table a similarity detection means, for the total number of combinations of elements with each other to determine the degree of similarity, number of combinations similarity is equal to or larger than the first threshold value A similarity rate calculation means for calculating the ratio of the similarity ratio comparing means for comparing the ratio of the similarity rate calculation means has calculated the second threshold value, the cluster if the ratio is less than the second threshold value A cluster group exclusion unit that does not integrate a group and a cluster group integration unit that integrates the cluster group and generates a new cluster if the ratio is equal to or greater than a second threshold value are provided.
[0014]
In a general cluster analysis, when a cluster set is detected, a new cluster is unconditionally generated, each original cluster is deleted, and the number of clusters gradually decreases. Therefore, there is a high possibility that a new cluster including a large number of elements (text information) that are small in relation to each other is generated. In claim 1, when a cluster set is detected, for a plurality of elements constituting each cluster of the cluster set, the similarity between the elements is obtained between the clusters and compared with a threshold value. When the degree of similarity between the elements included in each detected cluster is small, the cluster set can be prevented from being integrated as a new cluster.
[0015]
In addition, for example, by assigning 0 to the similarity between clusters detected as the cluster set, these clusters are prevented from being detected again as the same cluster set, and excluded from the similarity determination target between clusters. Can do. As a result, it is possible to prevent the clusters from growing steadily and to show combinations that are not related to elements belonging to the cluster, and it is possible to extract only clusters in which highly related elements are collected with high accuracy.
[0016]
Although the technique (group average method) of Non-Patent Document 1 is related to the technique of the present invention, the cost for calculation is smaller in the present invention based on the centroid method.
[0017]
Further, in claim 1 , when a cluster set is detected, whether the ratio of the similarity of each element included in the cluster satisfies the condition of the first threshold is larger or smaller than the second threshold Therefore, if the number of elements having a small relationship between elements is large compared to the number of elements having a large relationship between elements, the cluster set is classified into one cluster. You can cancel the integration. Therefore, highly accurate classification processing is possible.
[0020]
According to a second aspect of the present invention, there is provided a text classification program for causing a computer to function as each means constituting the automatic text classification apparatus according to the first aspect.
[0021]
By executing the program of claim 2 , the same operation as that of claim 1 can be realized.
[0022]
A third aspect of the present invention is a computer-readable recording medium on which the program of the second aspect is recorded.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of an automatic text classification method and apparatus, program, and recording medium of the present invention will be described with reference to FIGS. This form corresponds to all the claims.
[0024]
FIG. 1 is a flowchart showing this type of text classification processing. FIG. 2 is a block diagram illustrating a configuration example of an apparatus that handles document information. FIG. 3 is a schematic diagram showing a specific example of cluster generation. FIG. 4 is a schematic diagram illustrating a configuration example of the relevance table.
In this embodiment, the cluster set detection means, the element similarity detection means, the element similarity comparison means, and the cluster set exclusion means of claim 1 correspond to steps S13, S15, S16, and S20, respectively.
[0025]
Further, the cluster set detecting means, the element similarity detecting means, the similarity ratio calculating means, the similarity ratio comparing means, and the cluster set excluding means of claim 1 correspond to steps S13, S15, S16, S17, and S20, respectively. Further, the cluster set integrating means of claim 1 corresponds to S18 and S19.
In the apparatus shown in FIG. 2, it is assumed that the classification processing apparatus 10 performs classification processing of text document information. The classification processing device 10 is composed of, for example, a personal computer and a classification processing program that operates on the personal computer. The storage device 11 provided in the classification processing device 10 holds a large amount of collected document information. Further, a matrix representing the similarity between a plurality of independent document information is created on the classification processing device 10 as the similarity table 12. Further, information representing the degree of association between clusters is created on the degree of association table 13.
[0026]
In this example, the classification processing device 10 is connected to the Internet 20 and the LAN 30. Accordingly, the classification processing apparatus 10 has various servers 21 (1), 21 (2), 21 (3),... Connected to the Internet 20 and various servers 31 (1) connected to the LAN 30. , 31 (2),..., Document information, for example, an HTML format text data file (character string information) can be collected.
[0027]
The operation of the classification processing apparatus 10 will be described with reference to FIG.
In step S <b> 10, the classification processing apparatus 10 collects document information groups to be processed via the Internet 20 or the LAN 30.
Further, when a function necessary for calculating similarity between document information is not prepared in advance, or when a plurality of types of selectable functions are prepared, the classification processing apparatus 10 selects a function or inputs it directly. Prompt input to the operator and get one function. Furthermore, when each threshold value used in the process described later can be changed, the operator is prompted to input each threshold value.
[0028]
In step S11, the similarity between each document is calculated according to the function for all the document information collected in S10, and the resulting similarity matrix is registered in the similarity table 12. Note that half of the elements in the similarity table 12 are unnecessary because they are the same combination and the same similarity.
As the document information to be processed, for example, articles of newspapers published in a certain period are assumed.
[0029]
When obtaining such similarity between document information, a vector (feature vector) corresponding to a word existing in the document in a one-to-one relationship is detected, and the importance (weight) of each word for the document is a real value. To express. Further, when calculating the weight of a word, it may be determined based on the appearance frequency of the word, the distribution ratio in the entire document set, the length of the number of characters, and the like. Using the feature vectors obtained in this way, the degree of similarity (distance) between the documents is used, for example, using the angle (cos θ) between the vectors in the Euclidean space, and the vectors having large values are similar to each other. I reckon.
[0030]
In step S12, cluster analysis is initialized. That is, clusters with one element are sequentially created, and document information to be processed is assigned to each cluster one by one. For example, when N pieces of document information are to be processed, initially N cluster groups C1 to CN are created. These collections are set as a set D.
In step S13, two clusters are sequentially extracted for all combinations from the cluster groups C1 to CN included in the set D, and the most similar relationship between the clusters is detected as a cluster set among all the combinations. To do.
[0031]
Actually, using the center vector (centroid vector) of the feature required for each cluster, the combination of all clusters that has the shortest distance between the center vectors (highly related) is detected as a cluster set. .
The degree of association between the clusters in the set D is obtained, for example, as an association degree table 13 having a form as shown in FIG. In the example of FIG. 4, since the degree of association (0.35) between the clusters C2 and C3 is the largest (close), the combination of the clusters C2 and C3 is detected as a candidate cluster pair (C2, C3) in step S13. .
[0032]
In step S14, it is identified whether an end condition is satisfied. For example, by comparing the distance between clusters in the candidate cluster set detected in S13 with a predetermined threshold th1, it is identified whether or not the process should be terminated.
If the end condition is not satisfied, the processes after step S15 are repeatedly executed.
In step S15, the candidate cluster set (Ci, Cj) detected in S13 is included in each element (one or more document information) x included in the i-th cluster Ci and the j-th cluster Cj. Similarity S (x, y) is obtained for each combination of all the elements y. Since these similarities S (x, y) are calculated in advance in step S11 and stored in the similarity table 12, they can be obtained by referring to the similarity table 12.
[0033]
For example, when clusters C2 and C3 shown in FIG. 3 are detected as candidate cluster pairs, three document information exists as elements in cluster C2, and four document information exists as elements in cluster C3. There are 12 combinations of elements (x, y), and the similarity S (x, y) is obtained for each of these combinations.
[0034]
In step S16, each similarity S (x, y) is compared with the threshold th2, and the ratio R between the total number of combinations N1 in the candidate cluster set (Ci, Cj) and the number N2 that satisfies the condition of the threshold th2 is determined. Calculate (R = N2 / N1).
For example, since the candidate cluster set (C2, C3) shown in FIG. 3 is (N1 = 12), if the number N2 that satisfies the condition (S (x, y) ≧ th2) is 9, (R = 0.75).
[0035]
Then, the ratio R is compared with a threshold th3 (for example, 0.6). If (R ≧ th3), the process proceeds from step S17 to S18, and if (R <th3), the process proceeds to step S20.
That is, if (R ≧ th3), a candidate cluster set (Ci, Cj) is generated as a new cluster in step S18 and added to the set D. Further, since the clusters Ci and Cj that are the basis of the new cluster are unnecessary, they are deleted from the set D in step S19.
[0036]
On the other hand, if (R <th3), the candidate cluster set (Ci, Cj) is not adopted as the cluster set. Further, after that, the combination of the cluster Ci and the cluster Cj is processed in step S20 so as not to be the target of the relevance determination in step S13. For example, by changing the degree of association between the corresponding clusters Ci and Cj to 0 in the matrix of the degree-of-association table 13, these can be excluded from the relevance judgment target.
[0037]
For this reason, even if the distance between the center vectors of each cluster is close, if the ratio of documents with low similarity between document information is large among clusters, the combination of these clusters is integrated as a new cluster. Therefore, it is possible to suppress classification of documents with low similarity within the same cluster.
If the predetermined end condition is satisfied in step S14, the classification process is terminated, and the classification result is displayed on the classification processing apparatus 10.
[0038]
In addition, about the process in the classification | category processing apparatus 10, it can also implement | achieve with a dedicated hardware and can also be implement | achieved as a program run on a computer. The program can be read into a computer from a recording medium such as a CD-ROM.
[0039]
【The invention's effect】
As described above, according to the present invention, text information classification processing can be performed with higher accuracy than conventional cluster analysis.
[Brief description of the drawings]
FIG. 1 is a flowchart illustrating a text classification process according to an embodiment.
FIG. 2 is a block diagram illustrating a configuration example of an apparatus that handles document information.
FIG. 3 is a schematic diagram showing a specific example of cluster generation.
FIG. 4 is a schematic diagram illustrating a configuration example of an association degree table.
FIG. 5 is a flowchart showing a general cluster analysis process.
FIG. 6 is a schematic diagram showing a general cluster analysis process;
[Explanation of symbols]
10 Classification processing device 11 Storage device 12 Similarity table 13 Relevance table 20 Internet 21 Server 22 Database 30 LAN
31 server 32 database

Claims

A cluster in which text composed of character string information is processed, a plurality of N texts independent from each other are input, and each text is a number of texts smaller than N based on the similarity between the texts In the automatic text classification device that classifies
A cluster set detecting means for storing a degree of association between two different clusters in a relation degree table, and detecting two clusters having the highest degree of association as a cluster set;
The similarity between the elements of the plurality of elements included in one cluster of the cluster set and the elements of the plurality of elements included in the other cluster of the cluster set are calculated and stored in the similarity table. Degree detection means;
A similarity ratio calculating means for calculating a ratio of the number of combinations whose similarity is equal to or greater than a first threshold to the total number of combinations of elements for which the similarity is obtained ;
Similarity ratio comparison means for comparing the ratio calculated by the similarity ratio calculation means with a second threshold;
A cluster group exclusion unit that does not integrate the cluster group if the ratio is less than a second threshold;
An automatic text classification device, comprising: a cluster set integration unit that generates a new cluster by integrating the cluster sets when the ratio is equal to or greater than a second threshold value .

A text classification program for causing a computer to function as each means constituting the automatic text classification apparatus according to claim 1 .

A computer-readable recording medium on which the program according to claim 2 is recorded.