JP4143234B2

JP4143234B2 - Document classification apparatus, document classification method, and storage medium

Info

Publication number: JP4143234B2
Application number: JP28201499A
Authority: JP
Inventors: 栄治剣持
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1999-10-01
Filing date: 1999-10-01
Publication date: 2008-09-03
Anticipated expiration: 2019-10-01
Also published as: JP2001101227A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書群を文書の内容に従って複数の文書部分集合に自動分類する文書群分類装置などに係わり、特に、分類基準の異なる部分文書集合を多数抽出することができる文書分類装置などに関する。
【０００２】
【従来の技術】
近年、インターネットなどの普及により大量の文書情報へのアクセスが可能になったことなどに伴い、収集した大量の文書情報を意味のあるグループに（例えば話題毎に）分類することにより、所望の文書情報へのアクセスを効率的に行えるようにしたり、大量の文書集合の分析作業を効率的に行えるようにする必要性が高まっている。
しかし、大量の文書情報を利用者が手動で分類するのでは、人的／時間的コストが膨大なものになる。そのため、近年では、文書集合を文書の内容により自動分類できる装置が提供されるに至っている。
そのような自動分類においては、例えば、日本語形態素解析などの自然言語処理を用いて文書からそれらを構成する複数の単語を抽出することにより、文書を複数の単語の出現頻度のベクトル（文書特徴ベクトル）として空間表現する。この技術は文書のベクトル空間モデルと呼ばれ、広く用いられている。このようなベクトル空間モデルでは、空間内における任意の２つの文書特徴ベクトル間の距離、内積、余弦等を算出することでベクトル間の類似度を定義できるので、統計的手法を用いて文書の内容による自動分類をおこなうことが可能となり、種々の文書自動分類方法が提供されている（例えば、特開平7-114572号公報記載の発明など）。
これらの方法の多くは、生成する部分文書集合の質の向上を目指したものである（例えば、特開平11-45247号公報記載の発明）。生成された部分文書集合を単位としてさまざまな作業を効率的に行おうというわけであるから、確かに生成する部分文書集合の質は重要な課題である。しかし、それと同時に、分類対象の文書集合に内在している様々な話題を分類された部分文書集合がいかに多く抽出することができるかということも同様に重要な課題である。しかしながら、この課題を直接的に扱っている方法は見当たらない。
【０００３】
【発明が解決しようとする課題】
前記のように、従来技術においては、部分文書集合への分類に際して、文書集合に含まれる話題の一部分しか抽出できないため、文書集合に対する包括的な分析をすることができないという問題がある。
本発明の課題は、このような従来技術の問題を解決し、特定の基準に基づき文書特徴ベクトルの特徴次元を動的に操作し、文書自動分類を繰り返し行うことにより、分類時に用いられる特徴ベクトル間の類似度が動的に異なる、つまり分類基準が異なる部分文書集合を多数、自動抽出することができるようにして、文書集合に対する包括的な分析を行うことができる文書分類装置などを提供することにある。
【０００４】
前記の課題を解決するために、請求項１に記載の発明は、文書の内容に従って文書集合を自動的に分類する文書分類装置において、複数の文書から成る文書集合のそれぞれの文書データ中の単語を抽出し、該抽出された単語の出現回数を前記文書ごとに計数する文書解析手段と、前記文書解析手段で得られた単語と該単語の出現回数とに基づき、前記各文書と各文書に出現する単語とがそれぞれ行列成分に対応し、各行列要素が前記文書ごとに計数された前記単語の出現回数である行列を生成し、該行列に特異値分解を用いて文書特徴ベクトルを求める特徴ベクトル生成手段と、前記文書特徴ベクトルにおける対応する特異値が大きい順に特徴次元を削除して当該文書特徴ベクトルを修正する特徴ベクトル修正手段と、該特徴ベクトル修正手段により修正された前記文書特徴ベクトルを含む文書特徴ベクトル間の類似度に基づいて文書集合を複数の部分文書集合に分類し、分類結果を分類結果記憶手段に記憶させる文書分類手段と、を備え、前記文書分類手段が前記分類結果を記憶させた後、所定のくり返し条件を用いた判定に従い、くり返すと判定された場合、前記特徴ベクトル修正手段が文書特徴ベクトルを修正する動作および前記文書分類手段が文書集合を部分文書集合に分類して前記分類結果記憶手段に分類結果を記憶する動作をくり返すことを特徴とする。
また、請求項２に記載の発明は、請求項１記載の文書分類装置において、前記特徴ベクトル生成手段により求められた前記文書特徴ベクトルを記憶しておく特徴ベクトル記憶手段を備え、前記特徴ベクトル修正手段は、文書特徴ベクトルをくり返し修正する際、前記特徴ベクトル記憶手段に記憶されている特徴ベクトルを修正することを特徴とする。
また、請求項３に記載の発明は、請求項１又は２記載の文書分類装置において、前記分類結果記憶手段に記憶された分類結果から統計情報を算出し、算出された統計情報を用いて削除する特徴次元を決定することを特徴とする。
また、請求項４に記載の発明は、請求項３記載の文書分類装置において、前記統計情報は、それぞれの部分文書集合における特徴次元の分散値であることを特徴とする。
【０００５】
また、請求項５に記載の発明は、文書解析手段と特徴ベクトル生成手段と特徴ベクトル修正手段と文書分類手段と分類結果記憶手段とを有し、文書の内容に従って文書集合を自動的に分類する文書分類装置が実行する文書分類方法において、前記文書解析手段による、複数の文書から成る文書集合のそれぞれの文書データ中の単語を抽出し、該抽出された単語の出現回数を前記文書ごとに計数するステップと、前記特徴ベクトル生成手段による、前記文書解析手段で得られた単語と、該単語の出現回数と、に基づき、行成分が各前記文書と対応し、列成分が各前記単語と対応し、各行列要素が前記文書ごとに計数された前記単語の出現回数である行列を生成し、該行列に特異値分解を用いて文書特徴ベクトルを求めるステップと、前記特徴ベクトル修正手段による、前記文書特徴ベクトルにおける対応する特異値が大きい順に特徴次元を削除して当該文書特徴ベクトルを修正するステップと、前記文書分類手段による、前記特徴ベクトル修正手段により修正された前記文書特徴ベクトルを含む文書特徴ベクトル間の類似度に基づいて文書集合を複数の部分文書集合に分類し、分類結果を前記分類結果記憶手段に記憶させるステップと、前記文書分類手段が前記分類結果を記憶させた後、所定のくり返し条件を用いた判定に従い、くり返すと判定された場合、前記特徴ベクトル修正手段が文書特徴ベクトルを修正する動作および前記文書分類手段が文書集合を部分文書集合に分類して前記分類結果記憶手段に分類結果を記憶する動作をくり返すステップと、から構成されることを特徴とする。
また、請求項６に記載の発明は、請求項５記載の文書分類方法において、特徴ベクトル記憶手段が、前記特徴ベクトル生成手段により最初に求められた前記文書特徴ベクトルを記憶しておくステップを有し、前記特徴ベクトル修正手段は、文書特徴ベクトルをくり返し修正する際、前記特徴ベクトル記憶手段に記憶されている最初に求められた文書特徴ベクトルを修正することを特徴とする。
また、請求項７に記載の発明は、請求項５又は６記載の文書分類方法において、前記分類結果記憶手段に記憶された分類結果から統計情報を算出し、算出された統計情報を用いて削除する特徴次元を決定することを特徴とする。
また、請求項８に記載の発明は、請求項７記載の文書分類方法において、前記統計情報は、それぞれの部分文書集合における特徴次元の分散値であることを特徴とする。
また、請求項９に記載の発明は、請求項５乃至８の何れか一項記載の文書分類方法を実行するためのプログラムを記憶したコンピュータ読み取り可能な記憶媒体を特徴とする。
【０００６】
前記のような手段にしたので、請求項１および請求項６記載の発明では、複数の文書から成る文書集合のそれぞれの文書データ中の単語が解析され、その解析結果に基づいて文書特徴ベクトルが求められ、文書特徴ベクトル間の類似度に基づいて文書集合が複数の部分文書集合に分類され、その後、条件によってくり返しが選択されると、所定の基準に基づき前記文書特徴ベクトルの特徴次元が修正され、修正された文書特徴ベクトルを含む文書特徴ベクトル間の類似度に基づいて文書集合が複数の部分文書集合に分類され、さらに、前記条件によってくり返しが選択されると、文書特徴ベクトルを修正する動作、および部分文書集合に分類し結果を記憶する動作がくり返される。
請求項２および請求項７記載の発明では、請求項１または請求項６記載の発明において、生成される文書特徴ベクトルの特徴次元が所定の基準に従って順序付けされ、操作する特徴次元が順序付けされた順序に従って決定される。
請求項３および請求項８記載の発明では、請求項６または請求項７記載の発明において、最初に求められた文書特徴ベクトルが記憶しておかれ、文書特徴ベクトルをくり返し修正する際、記憶されている最初に求められた特徴ベクトルが修正される。
請求項４および請求項９記載の発明では、請求項１乃至請求項３または請求項６乃至請求項８記載の発明において、記憶された分類結果から統計情報が算出され、算出された統計情報を用いて操作する特徴次元が決定される。
請求項５および請求項10記載の発明では、請求項４または請求項９記載の発明において、記憶された分類結果からそれぞれの部分文書集合における特徴次元の分散値が算出され、算出された分散値を用いて操作する特徴次元が決定される。請求項11記載の発明では、請求項６乃至請求項10記載の文書分類方法に従ってプログラミングしたプログラムが例えば着脱可能な記憶媒体に記憶される。
【０００７】
【発明の実施の形態】
以下、図面により本発明の実施の形態を詳細に説明する。
図１は本発明の第１の実施形態を示す文書分類装置の構成ブロック図である。図示したように、この実施形態の文書分類装置は、複数の文書から成る文書集合のそれぞれの文書データを入力する文書入力部１、前記文書入力部１により入力されたそれぞれの文書データ中の単語を解析する文書解析手段である文書解析部２、前記文書解析部２による解析結果に基づいて文書特徴ベクトルを求める特徴ベクトル生成手段である特徴ベクトル生成部３、所定の基準に基づき前記文書特徴ベクトルの特徴次元を操作して前記文書特徴ベクトルを修正する特徴ベクトル修正手段である特徴ベクトル修正部４、修正された文書特徴ベクトルを含む文書特徴ベクトル間の類似度に基づいて文書集合を複数の部分文書集合に分類する文書分類手段である文書分類部５、前記文書分類部５により分類された分類結果を記憶しておく分類結果記憶手段である分類結果記憶部６、所定のくり返し条件に従って文書特徴ベクトル修正から後の動作をくり返させるくり返し判定部７などを備えている。なお、前記文書解析部２、特徴ベクトル生成部３、特徴ベクトル修正部４、文書分類部５、くり返し判定部７は、プログラムやデータを記憶しておく共有のメモリ（例えばＲＡＭ）およびそのプログラムに従って動作する共有または専有のＣＰＵを有する。以下、前記各部についてさらに説明する。
まず、文書入力部１であるが、キーボード、ＯＣＲ装置、着脱可能な記憶媒体、ネットワークインタフェース部などを備え、それらを用いて文書データ群を入力し、文書記憶部（図示していない）に格納する。
また、文書解析部２は、入力された文書データのそれぞれに対して自然言語解析を行い、単語やその品詞などを抽出する。さらに、文書データ内での単語の出現順序、および文書の作成者や作成日など文書のメタ情報（属性情報）などを含めた文書解析を行うこともできる。単語を抽出した後は、文書群中に出現した単語に対して一意な単語識別符号（ID）を付与し、文書毎に単語出現回数を計数する。
【０００８】
特徴ベクトル生成部３では、文書解析部２で生成した単語、単語ID、単語出現回数、品詞情報などの文書解析データを基に、行成分が文書ID、列成分が単語IDであり、行列要素が前記各文書IDの文書の含む前記各単語IDの単語の出現回数となるような文書-単語行列データを生成する。そして、この文書-単語行列の各行ベクトルを文書特徴ベクトルとする。文書-単語行列データと文書特徴ベクトルの例を図２に示す。なお、この文書特徴ベクトルに対して正規化処理を行うこともできる。また、単語が有する多義性・同義性の問題に対処するために、生成した文書-単語行列に対して因子分析、数量化III類、および特異値分解などの多次元尺度手法を適用することにより文書特徴ベクトルを生成することもできる。
例えば、特異値分解を用いて文書-単語行列から文書特徴ベクトルを生成する方法では、大きさd×t（dは文書数，tは単語数）の文書-単語行列（文書特徴ベクトル）Ｘを式（１）のように複数の行列に分解する。なお、式（１）において、svd ( )は行列へ特異値分解を適用する演算子である。また、特異値とは、特異値分解により生成される値であり、例えば、多数の文書に共通して出現する単語を多数含む文書が、特異値から成る行列Lの特異値の高い次元で高い値になる。
式（１） X = svd(X) = ALU^T ［Tは行列の転置を示す］
式（１）において、A,L,Uはいずれも行列であり、行列Ａは大きさd×k（kはtより小さい）の行列となる。つまり、大きさd×kの行列Aにおける各行ベクトルが文書特徴ベクトルとなる。ここで、kは１≦k≦rの整数で、rはdとtの小さい方より小さく、行列Xのランクを示す。また、行列Lは特異値からなる大きさk×kの対角行列であり、行列Uはt×kの行列で、任意の単語をk次元の潜在構造空間へ写像したものと考えることができる。
なお、文書特徴ベクトルを効率的に管理するために、特徴ベクトル生成部３は、文書-単語行列データに付随する付加的な情報、たとえば、文書-単語行列データの列成分である単語IDとその単語との対応関係を記述した単語-単語ID対応マップデータや、各単語について単語IDとその単語の有する品詞情報との対応関係を記述した単語ID-品詞対応マップデータなども同時に生成する。
【０００９】
また、特徴ベクトル修正部４では、前記文書特徴ベクトルの特徴次元（ベクトルの次元であり、それぞれの次元は近似的に文書集合において振る舞いの似た複数の単語から構成されるものと考えることができる）を所定の基準に基づき逐次的に操作することにより文書特徴ベクトルを修正する。なお、特徴次元の操作とてしては、次元の重み付け、削除、および線形変換などを行うことができる。
例えば、文書特徴ベクトルから特定の次元を削除する場合では、文書特徴ベクトルをd×kの大きさの行列Aとし、削除する特徴次元に対応する列を大きさk×kの単位行列から削除した結果生成されるk×k'の大きさの修正行列をPk'とすると、修正された文書特徴ベクトルA'は式（２）のように求めることができる（この式は、前記特異値分解の場合に限定していない一般的な表現をしている）。
式（２） A' = A Pk'
また、修正行列として大きさk×kの単位行列から削除する特徴次元に対応する対角要素を０にした結果生成される行列を用いても特徴次元の削除を行えるが、この場合は修正された文書特徴ベクトルの次元数は修正前と同じになる。なお、くり返し実行の際には、式（２）に示す修正が逐次的に実行される。特徴次元を削除する順序は、特徴次元の１番目から整列順であってもよいし、１から特徴次元数までの乱数を発生させることで決めてもよい。このようにして、逐次削除した特徴次元の表現していた特徴を排除した特徴空間での文書分類が可能となり、最も中心的な話題（特徴）の陰に隠れてしまっている他の話題が分類のための視点になってくるのである。
特に、前記の特異値分解を用いて文書特徴ベクトルを生成した場合には、文書特徴ベクトルの各次元は対応する特異値の大きさで順位付けされるので、特異値の大きな特徴次元から徐々に削除していくことにより、逐次主要な話題の影響を排除した特徴空間で文書分類を行うことが可能となる。つまり、各特徴次元のそれぞれは、近似的にいくつかの振る舞いの似た単語で構成されるものと考えることができるため、文書データ内に内在するそれぞれの話題と解釈することができ、各特徴次元に対応する特異値の大きさは、文書データ内での話題の主要性をあらわすものと考えられ、特異値が大きい程、対応する特徴次元は文書データ内での主要な話題を示すものと解釈することができるので、くり返し実行の際に、特異値の大きな特徴次元から徐々に削除していくことにより、逐次主要な話題の影響を排除した特徴空間で文書分類を行うことが可能となるのである。
なお、特徴ベクトル修正部４はくり返し実行の初回にはバイパスされる。
また、文書分類部５は、生成した文書特徴ベクトルに統計的手法を適用することで文書分類を行う。文書特徴ベクトル値が近い文書は似た文書であるので、文書特徴ベクトル値の近い文書同志を集めて複数の部分文書集合を生成するのである。適用する統計的手法としては判別分析の手法やクラスタ分析の手法などの分類手法を適用することができるが、ここではベクトルデータが適用できる分類手法であれば、その手法は問わない。
【００１０】
図３に、第１の実施形態の動作フローを示す。以下、図３などに従って、この実施形態の動作を説明する。
まず、文書入力部１により、キーボード、ＯＣＲ装置、着脱可能な記憶媒体、またはネットワークインタフェース部などを介して分類対象の文書データ群（文書集合）を入力し、それらを文書記憶部（図示していない）に格納する（ステップＳ１）。
次に、文書解析部２が、入力されたそれぞれの文書データに対して自然言語解析を行い、単語やその品詞などを抽出する（ステップＳ２）。そして、文書データ群中に出現した単語に対して一意な単語識別符号（ID）を付与し、文書毎に単語出現回数を計数する（ステップＳ２）。
続いて、特徴ベクトル生成部３が、文書解析部２で生成した単語、単語ID、単語出現回数、品詞情報などの文書解析データを基に、行成分が文書ID、列成分が単語IDであり、行列要素が前記各文書IDの文書の含む前記各単語IDの単語の出現回数となるような文書-単語行列データを生成する（ステップＳ３）。そして、この文書-単語行列の各行ベクトルを文書特徴ベクトルとする（図２参照）。
さらに、文書分類部５が、生成した文書特徴ベクトルに統計的手法を適用することで文書分類を行う（ステップＳ５）。文書特徴ベクトル値が近い文書は似た文書であるので、文書特徴ベクトル値の近い文書同志を集めて複数の部分文書集合を生成するのである。
この後は、文書分類部５が、生成した文書分類結果を分類結果記憶部６に記憶させ（ステップＳ６）、くり返し判定部７が、文書特徴ベクトルを修正させて文書分類をくり返すかどうかを所定のくり返し条件を用いて判定する（ステップＳ７）。なお、前記判定を行うための所定のくり返し条件としては、予め設定されたくり返し回数を用いることができるし、文書特徴ベクトルの次元数などを参考にして決定することもできる。また、分類結果を見て、利用者がくり返すか否かを指示することも可能である。そして、くり返すと判定されたならば（ステップＳ７でYes）、前記のようにして文書特徴ベクトルを修正する（ステップＳ４）。例えば、文書特徴ベクトルを構成する一つの特徴次元を所定の基準で選択し、その特徴次元を削除するのである。
続いて、文書分類部５が修正された特徴ベクトルを用いて再び文書分類を行い（ステップＳ５）、分類結果を分類結果記憶部６に記憶させる（ステップＳ６）。
こうして、前記のように、例えば特異値分解を用いて文書特徴ベクトルを生成した場合、文書特徴ベクトルの各次元は対応する特異値の大きさで順位付けされ、特異値の大きな特徴次元から逐次削除され、逐次主要な話題の影響を排除した特徴空間で文書分類を行うことが可能となる。
【００１１】
図４は本発明の第２の実施形態を示す文書分類装置の構成ブロック図である。第１の実施形態（図１参照）と同一のものに関しては同じ番号を付してある。図示したように、この実施形態では、第１の実施形態の構成に加えて、特徴ベクトル生成部３により求められた文書特徴ベクトルを記憶しておく特徴ベクトル記憶手段である特徴ベクトル記憶部８を備えている。なお、この特徴ベクトル記憶部８には、文書特徴ベクトルを効率的に管理するために特徴ベクトル生成部３が生成した、文書-単語行列データに付随する付加的な情報、たとえば、文書-単語行列データの列成分である単語IDとその単語との対応関係を記述した単語-単語ID対応マップデータや、各単語について単語IDとその単語が有する品詞情報との対応関係を記述した単語ID-品詞対応マップデータなども記憶される。
このような特徴ベクトル記憶部８を追加したことにより、この実施形態では、特徴ベクトル修正部４は、文書特徴ベクトル修正の都度、この特徴ベクトル記憶部８に記憶されている文書特徴ベクトルを操作（修正）される文書特徴ベクトルとすることが可能になる。そして、これより、文書特徴ベクトルに施す操作（例えば一つの次元の削除）の効果（結果）を継承しない文書特徴ベクトルを用いて文書分類を行うことが可能になる。
例えば、文書特徴ベクトルが特異値分解により生成されており、n回目の繰り返し時に第n次元の特徴次元を削除する場合、そのときの修正行列をPn、特徴ベクトル記憶部８に記憶されている文書特徴ベクトルをA0とし、修正された文書特徴ベクトルをAnとすると、
式（３） An = A0Pn
となる。なお、第１の実施形態の場合には、
式（４） An = A0Pn Pn-1・・・P0
となる。つまり、第２の実施形態では、削除する特徴次元の表現する話題のみを除いた特徴空間で文書分類を行うことが可能となるのである。
【００１２】
図５は本発明の第３の実施形態を示す文書分類装置の構成ブロック図である。図５において、第１の実施形態（図１参照）および第２の実施形態（図４参照）と同一のものに関しては同じ番号を付してある。図示したように、第３の実施形態では、第２の実施形態の構成に加えて、記憶されている分類結果から各部分文書集合に所属する文書特徴ベクトルを抽出する部分文書集合抽出部９、抽出された各部分文書集合における各文書特徴ベクトル間での各特徴次元の分散値を算出する部分文書集合分散算出部10、算出された各特徴次元の分散値など統計情報を用いて操作する特徴次元を決定する操作対象特徴次元決定部11を備える。
このような構成で、この実施形態では、分類結果記憶部６に記憶された分類結果から統計情報として例えばそれぞれの部分文書集合における特徴次元の分散値を算出し、算出された特徴次元の分散値を用いて操作する特徴次元を決定する。なお、このような決定方法の根拠は、部分文書集合における特徴次元の分散の大きさがその特徴次元の部分文書集合を群化させる寄与率を示すものと考えることができることにある。つまり、分散の小さな特徴次元は部分文書集合を密にしていると考えられるため、群化の寄与率は高いものと考えることができる。したがって、各部分文書集合について、分散の小さな特徴次元はその部分文書集合の表現する話題と強く関連しているものと考えられるため、例えば、この特徴次元を削除した特徴ベクトル空間で文書分類を行うことにより、前記の部分文書集合が表現する話題以外の話題を表現する部分文書集合を抽出できるものと考えられるのである。以下、この実施形態において追加した前記各部について、さらに説明する。
まず、部分文書集合抽出部９であるが、これは、分類結果記憶部６に記憶されている分類結果から、生成された部分文書集合すべてについてそれぞれに所属する文書特徴ベクトルを抽出する。なお、対象とする部分文書集合は直前に生成された部分文書集合だけでもよいし、生成されている全部分文書集合でもよい。
【００１３】
また、部分文書集合分散算出部10は、部分文書集合抽出部９が抽出した全部分文書集合について、それぞれに所属する各文書特徴ベクトル間での各特徴次元の分散値を算出する。この際、各部分文書集合について、各特徴次元の分散値の大きさの順位を算出すると共に、各特徴次元の分散値について、各部分文書集合の順位も合わせて算出する。
また、操作対象特徴次元決定部11は、部分文書集合分散算出部11が算出した各部分文書集合における各特徴次元の分散値、各部分文書集合おける各特徴次元の分散値の大きさの順位、および各特徴次元の分散値についての各部分文書集合の順位の情報を基にして特徴ベクトル修正部４の操作する特徴次元を決定する。例えば、全部分文書集合における特徴次元の分散値が一定値以下のものを操作対象の特徴次元として選択したり、全部分文書集合における特徴次元の分散値の大きさの順位が常に一定順位以下（分散が小さい）ものを操作対象の特徴次元として選択したりするのである。
なお、直前に生成された部分文書集合だけを抽出した場合には、その部分文書集合における各特徴次元の分散値、およびその部分文書集合おける各特徴次元の分散値の大きさの順位を基にして特徴ベクトル修正部４の操作する特徴次元を決定する。
こうして、この実施形態では、選択された特徴次元を削除した特徴ベクトル空間で文書分類を行い、前記の部分文書集合が表現する話題以外の話題を表現する部分文書集合を抽出することができる。
以上、図１、図４、および図５に示した構成の文書分類装置の場合について説明したが、各実施形態で説明したような本発明の文書分類方法に従ってプログラミングしたプログラムを、例えば、着脱可能な記憶媒体に記憶させ、その記憶媒体をこれまで本発明によった方法の文書分類を行えなかったパーソナルコンピュータなど情報処理装置に装填することにより、その情報処理装置において前記文書分類を行うこともできる。
【００１４】
【発明の効果】
以上説明したように、本発明によれば、請求項１および請求項５記載の発明では、分類対象の文書集合中に内在している異なる話題の部分文書集合を多数、自動抽出することができ、したがって、文書集合に対する包括的な分析を行うことができる。さらに、特徴次元の操作を効率的に行うことができる。
また、請求項２および請求項６記載の発明では、逐次行われる文書特徴ベクトルの特徴次元の操作の効果がその直後に行われる文書分類のみに有効になる。つまり、逐次行われる特徴次元の操作の効果が継承されない部分文書集合を生成することができ、したがって、請求項１又は５記載の発明とは異なる話題も抽出できる。
【００１５】
また、請求項３および請求項７記載の発明では、請求項１または請求項５記載の発明とは異なった方法で異なる話題の部分文書集合を多数、自動抽出することができ、したがって、請求項１または請求項５記載の発明の効果をさらに向上させることができる。
また、請求項４および請求項８記載の発明では、請求項３または請求項７記載の発明の効果を容易に実現することができる。
また、請求項９記載の発明では、情報処理装置において請求項５乃至８の何れか一項記載の発明の効果を得ることができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態を示す文書分類装置の構成ブロック図である。
【図２】本発明の第１の実施形態を示す文書分類方法の説明図である。
【図３】本発明の第１の実施形態を示す文書分類方法の動作フロー図である。
【図４】本発明の第２の実施形態を示す文書分類装置の構成ブロック図である。
【図５】本発明の第３の実施形態を示す文書分類装置の構成ブロック図である。
【符号の説明】
１文書入力部
２文書解析部
３特徴ベクトル生成部
４特徴ベクトル修正部
５文書分類部
６分類結果記憶部
７くり返し判定部
８特徴ベクトル記憶部
９部分文書集合抽出部
１０部分文書集合分散算出部
１１操作対象特徴次元決定部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document group classification apparatus that automatically classifies a document group into a plurality of document subsets according to the contents of the document, and more particularly to a document classification apparatus that can extract a large number of partial document sets with different classification criteria.
[0002]
[Prior art]
In recent years, with the spread of the Internet and the like, it has become possible to access a large amount of document information, and the desired document can be obtained by classifying the collected large amount of document information into meaningful groups (for example, for each topic). There is a growing need to be able to access information efficiently and to efficiently analyze a large collection of documents.
However, if a user manually classifies a large amount of document information, the human / time cost becomes enormous. Therefore, in recent years, an apparatus capable of automatically classifying a document set according to the contents of a document has been provided.
In such automatic classification, for example, by extracting a plurality of words constituting the document from the document using natural language processing such as Japanese morphological analysis, the document is expressed as a vector of appearance frequency of the plurality of words (document feature). Space) as a vector. This technique is called a vector space model of a document and is widely used. In such a vector space model, the similarity between vectors can be defined by calculating the distance, inner product, cosine, etc. between any two document feature vectors in the space. Thus, various document automatic classification methods are provided (for example, the invention described in JP-A-7-114572).
Many of these methods aim to improve the quality of the generated partial document set (for example, the invention described in Japanese Patent Laid-Open No. 11-45247). Since various operations are efficiently performed with the generated partial document set as a unit, the quality of the generated partial document set is an important issue. At the same time, however, how many partial document sets into which various topics existing in the document set to be classified can be extracted is also an important issue. However, there is no method that deals directly with this issue.
[0003]
[Problems to be solved by the invention]
As described above, the conventional technique has a problem in that a comprehensive analysis of a document set cannot be performed because only a part of topics included in the document set can be extracted when classifying the document set.
An object of the present invention is to solve such a problem of the prior art, dynamically operate a feature dimension of a document feature vector based on a specific criterion, and repeatedly perform automatic document classification, whereby a feature vector used at the time of classification To provide a document classification device and the like that can automatically extract a large number of partial document sets with different degrees of similarity between them, that is, different classification criteria, and can perform comprehensive analysis on the document set There is.
[0004]
In order to solve the above-mentioned problem, the invention according to claim 1 is a document classification device for automatically classifying a document set according to the contents of the document, and a word in each document data of a document set consisting of a plurality of documents. The Extract and count the number of appearances of the extracted words for each document Document analysis means; The word obtained by the document analysis means and the number of occurrences of the word; Based on Each document and a word appearing in each document correspond to a matrix component, and each matrix element generates a matrix that is the number of occurrences of the word counted for each document. Feature vector generation means for obtaining a document feature vector using singular value decomposition, and corresponding in the document feature vector In descending order of singular values A feature vector correcting unit that corrects the document feature vector by deleting the feature dimension, and a plurality of parts of the document set based on the similarity between the document feature vectors including the document feature vector corrected by the feature vector correcting unit Document classification means for classifying into a document set and storing the classification result in the classification result storage means, and after the document classification means stores the classification result, it repeats according to the determination using a predetermined repetition condition If it is determined, the feature vector correcting unit repeats the operation of correcting the document feature vector and the document classification unit repeats the operation of classifying the document set into partial document sets and storing the classification result in the classification result storage unit. It is characterized by that.
The invention according to claim 2 is the document classification apparatus according to claim 1, further comprising feature vector storage means for storing the document feature vector obtained by the feature vector generation means, and the feature vector correction The means corrects the feature vector stored in the feature vector storage means when repeatedly correcting the document feature vector.
The invention according to claim 3 is the document classification apparatus according to claim 1 or 2, wherein the statistical information is calculated from the classification result stored in the classification result storage means, and is deleted using the calculated statistical information. The feature dimension to be determined is determined.
According to a fourth aspect of the present invention, in the document classification device according to the third aspect, the statistical information is a variance value of a feature dimension in each partial document set.
[0005]
The invention according to claim 5 Document analysis means, feature vector generation means, feature vector correction means, document classification means, and classification result storage means, Automatically classify document sets according to document content Performed by the document classifier In the document classification method, Above Document analysis means by A word in each document data of a document set consisting of a plurality of documents Extract and count the number of appearances of the extracted words for each document Steps, Above Feature vector generation means , The word obtained by the document analysis means, the number of appearances of the word, Based on A matrix in which a row component corresponds to each document, a column component corresponds to each word, and each matrix element is the number of occurrences of the word counted for each document is generated. Obtaining a document feature vector using singular value decomposition; Above Feature vector correction means by Corresponding in the document feature vector In descending order of singular values Deleting the feature dimension and correcting the document feature vector; Above Document classification means by Classifying the document set into a plurality of partial document sets based on the similarity between the document feature vectors including the document feature vector corrected by the feature vector correcting means, Above A step of storing in the classification result storage means; and if the document classification means stores the classification result, and if it is determined to repeat according to the determination using a predetermined repetition condition, the feature vector correction means is a document feature The operation of correcting a vector and the step of repeating the operation of classifying a document set into a partial document set and storing the classification result in the classification result storage unit.
The invention according to claim 6 is the document classification method according to claim 5, wherein the feature vector storage means stores the document feature vector first obtained by the feature vector generation means. The feature vector correcting means corrects the first obtained document feature vector stored in the feature vector storage means when repeatedly correcting the document feature vector.
The invention according to claim 7 is the document classification method according to claim 5 or 6, wherein the statistical information is calculated from the classification result stored in the classification result storage means, and is deleted using the calculated statistical information. The feature dimension to be determined is determined.
The invention according to claim 8 is the document classification method according to claim 7, characterized in that the statistical information is a variance value of feature dimensions in each partial document set.
According to a ninth aspect of the present invention, there is provided a computer-readable storage medium storing a program for executing the document classification method according to any one of the fifth to eighth aspects.
[0006]
According to the above-described means, in the inventions according to claim 1 and claim 6, words in each document data of a document set made up of a plurality of documents are analyzed, and a document feature vector is determined based on the analysis result. If the document set is classified into a plurality of partial document sets based on the similarity between the document feature vectors, and repeat is selected according to the condition, the feature dimension of the document feature vector is corrected based on a predetermined criterion. If the document set is classified into a plurality of partial document sets based on the similarity between the document feature vectors including the corrected document feature vector, and the repetition is selected according to the condition, the document feature vector is corrected. The operation and the operation of classifying the document into a set of partial documents and storing the result are repeated.
In the invention of claim 2 and claim 7, in the invention of claim 1 or claim 6, the feature dimensions of the generated document feature vectors are ordered according to a predetermined standard, and the feature dimensions to be operated are ordered. Determined according to.
In the invention of claim 3 and claim 8, in the invention of claim 6 or claim 7, the document feature vector obtained first is stored and stored when the document feature vector is repeatedly corrected. The first determined feature vector is corrected.
In the inventions of claims 4 and 9, in the inventions of claims 1 to 3 or claims 6 to 8, statistical information is calculated from the stored classification results, and the calculated statistical information is The feature dimension to be manipulated is determined.
In the inventions according to claims 5 and 10, in the invention according to claim 4 or claim 9, the variance value of the feature dimension in each partial document set is calculated from the stored classification result, and the calculated variance value The feature dimension to be operated using is determined. In the invention according to claim 11, a program programmed according to the document classification method according to claims 6 to 10 is stored in, for example, a removable storage medium.
[0007]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a block diagram showing the configuration of a document classification apparatus according to the first embodiment of the present invention. As shown in the figure, the document classification apparatus according to this embodiment includes a document input unit 1 for inputting each document data of a document set made up of a plurality of documents, and words in each document data input by the document input unit 1. A document analysis unit 2 that is a document analysis unit that analyzes a document, a feature vector generation unit 3 that is a feature vector generation unit that obtains a document feature vector based on an analysis result by the document analysis unit 2, and the document feature vector based on a predetermined criterion A feature vector correcting unit 4 which is a feature vector correcting means for correcting the document feature vector by manipulating the feature dimension of the document, and a plurality of parts of the document set based on the similarity between the document feature vectors including the corrected document feature vector A document classification unit 5 which is a document classification unit for classifying into a document set, and a classification result for storing the classification result classified by the document classification unit 5 A storage means the classification result storing unit 6, and a like repetition determining unit 7 causes repeated operation after a document feature vector correction according to a predetermined repetition condition. The document analysis unit 2, the feature vector generation unit 3, the feature vector correction unit 4, the document classification unit 5, and the repeat determination unit 7 are in accordance with a shared memory (for example, RAM) that stores programs and data and the programs. Has a shared or proprietary CPU that operates. Hereinafter, the respective units will be further described.
First, the document input unit 1 includes a keyboard, an OCR device, a removable storage medium, a network interface unit, and the like. A document data group is input using these, and stored in a document storage unit (not shown). To do.
In addition, the document analysis unit 2 performs natural language analysis on each of the input document data, and extracts a word and its part of speech. Furthermore, it is possible to perform document analysis including the appearance order of words in the document data and document meta information (attribute information) such as the document creator and creation date. After extracting the word, a unique word identification code (ID) is assigned to the word that appears in the document group, and the word appearance count is counted for each document.
[0008]
In the feature vector generation unit 3, the row component is the document ID and the column component is the word ID based on the document analysis data such as the word, the word ID, the word appearance count, the part of speech information generated by the document analysis unit 2, and the matrix element Generates document-word matrix data such that is the number of occurrences of the word with each word ID included in the document with each document ID. Then, each row vector of the document-word matrix is set as a document feature vector. An example of document-word matrix data and document feature vectors is shown in FIG. Note that normalization processing can also be performed on the document feature vector. Also, in order to deal with the ambiguity / synonymity problem that words have, by applying multidimensional scaling methods such as factor analysis, quantification type III, and singular value decomposition to the generated document-word matrix A document feature vector can also be generated.
For example, in a method of generating a document feature vector from a document-word matrix using singular value decomposition, a document-word matrix (document feature vector) X of size d × t (d is the number of documents, t is the number of words) It decomposes | disassembles into a some matrix like Formula (1). In equation (1), svd () is an operator that applies singular value decomposition to a matrix. The singular value is a value generated by singular value decomposition. For example, a document including a large number of words that appear in common in a large number of documents is high in the dimension with a high singular value of the matrix L composed of singular values. Value.
Formula (1) X = svd (X) = ALU ^T [T indicates matrix transposition]
In Expression (1), A, L, and U are all matrices, and the matrix A is a matrix of size d × k (k is smaller than t). That is, each row vector in the matrix A of size d × k is a document feature vector. Here, k is an integer of 1 ≦ k ≦ r, and r is smaller than the smaller of d and t, and indicates the rank of the matrix X. The matrix L is a diagonal matrix of singular values of size k × k, and the matrix U is a t × k matrix, which can be thought of as mapping any word to a k-dimensional latent structure space. .
In order to efficiently manage the document feature vector, the feature vector generation unit 3 adds additional information accompanying the document-word matrix data, for example, a word ID that is a column component of the document-word matrix data and its At the same time, word-word ID correspondence map data describing the correspondence between words and word ID-part of speech correspondence map data describing the correspondence between the word ID and the part-of-speech information of each word are also generated.
[0009]
Further, the feature vector correction unit 4 can be considered to be composed of a plurality of words whose feature dimensions (vector dimensions, each dimension being approximately similar in behavior to the document set) of the document feature vector. ) Is sequentially operated based on a predetermined criterion to correct the document feature vector. Note that dimension weighting, deletion, linear transformation, and the like can be performed as feature dimension operations.
For example, when deleting a specific dimension from a document feature vector, the document feature vector is a matrix A having a size of d × k, and a column corresponding to the feature dimension to be deleted is deleted from a unit matrix of size k × k. Assuming that a correction matrix having a size of k × k ′ generated as a result is Pk ′, the corrected document feature vector A ′ can be obtained as shown in the equation (2) (this equation represents the singular value decomposition). General expressions not limited to cases).
Formula (2) A '= A Pk'
The feature dimension can also be deleted by using a matrix generated as a result of setting the diagonal elements corresponding to the feature dimension to be deleted from the unit matrix of size k × k as 0 as the correction matrix. The number of dimensions of the document feature vector is the same as before modification. Note that the correction shown in the equation (2) is sequentially executed during the repeated execution. The order in which the feature dimensions are deleted may be from the first feature dimension to the sorting order, or may be determined by generating random numbers from 1 to the number of feature dimensions. In this way, it is possible to classify documents in a feature space that excludes the features that were sequentially represented by the deleted feature dimensions, and other topics hidden behind the most central topic (feature) are classified. It becomes the viewpoint for.
In particular, when a document feature vector is generated using the above singular value decomposition, each dimension of the document feature vector is ranked according to the size of the corresponding singular value. By deleting the documents, it is possible to perform document classification in a feature space in which the influence of main topics is sequentially removed. In other words, each feature dimension can be thought of as consisting of approximately similar words with similar behavior, so it can be interpreted as each topic inherent in the document data. The size of the singular value corresponding to the dimension is considered to indicate the mainity of the topic in the document data. The larger the singular value, the corresponding feature dimension indicates the main topic in the document data. Since it can be interpreted, it is possible to perform document classification in a feature space that eliminates the influence of major topics one after another by gradually deleting feature dimensions with large singular values during repeated execution. It is.
The feature vector correction unit 4 is bypassed at the first iteration.
The document classification unit 5 performs document classification by applying a statistical method to the generated document feature vector. Since documents having similar document feature vector values are similar documents, documents having similar document feature vector values are collected to generate a plurality of partial document sets. As a statistical method to be applied, a classification method such as a discriminant analysis method or a cluster analysis method can be applied. However, any method can be used here as long as it is a classification method to which vector data can be applied.
[0010]
FIG. 3 shows an operation flow of the first embodiment. The operation of this embodiment will be described below with reference to FIG.
First, a document data group (document set) to be classified is input by the document input unit 1 via a keyboard, an OCR device, a removable storage medium, a network interface unit, or the like, and is input to a document storage unit (not shown). (Step S1).
Next, the document analysis unit 2 performs natural language analysis on each input document data, and extracts a word and its part of speech (step S2). Then, a unique word identification code (ID) is assigned to a word that appears in the document data group, and the number of word appearances is counted for each document (step S2).
Subsequently, the row component is the document ID and the column component is the word ID based on the document analysis data such as the word, word ID, word appearance count, part of speech information, etc., generated by the feature vector generation unit 3 in the document analysis unit 2. Document-word matrix data is generated such that the matrix element is the number of occurrences of the word with each word ID included in the document with each document ID (step S3). Then, each row vector of this document-word matrix is set as a document feature vector (see FIG. 2).
Further, the document classification unit 5 performs document classification by applying a statistical method to the generated document feature vector (step S5). Since documents having similar document feature vector values are similar documents, documents having similar document feature vector values are collected to generate a plurality of partial document sets.
Thereafter, the document classification unit 5 stores the generated document classification result in the classification result storage unit 6 (step S6), and the repeat determination unit 7 determines whether or not the document classification is repeated by correcting the document feature vector. The determination is made using a predetermined repetition condition (step S7). As a predetermined repetition condition for performing the determination, a predetermined number of repetitions can be used, or it can be determined with reference to the number of dimensions of the document feature vector. Also, it is possible to instruct whether the user repeats by looking at the classification result. If it is determined to repeat (Yes in step S7), the document feature vector is corrected as described above (step S4). For example, one feature dimension constituting a document feature vector is selected based on a predetermined criterion, and the feature dimension is deleted.
Subsequently, the document classification unit 5 performs document classification again using the corrected feature vector (step S5), and stores the classification result in the classification result storage unit 6 (step S6).
Thus, as described above, for example, when a document feature vector is generated using singular value decomposition, each dimension of the document feature vector is ranked by the size of the corresponding singular value, and is sequentially deleted from the feature dimension having a large singular value. Thus, it is possible to perform document classification in a feature space that eliminates the influence of major topics one after another.
[0011]
FIG. 4 is a block diagram showing the configuration of the document classification apparatus according to the second embodiment of the present invention. The same number is attached | subjected regarding the same thing as 1st Embodiment (refer FIG. 1). As illustrated, in this embodiment, in addition to the configuration of the first embodiment, a feature vector storage unit 8 that is a feature vector storage unit that stores a document feature vector obtained by the feature vector generation unit 3 is provided. I have. The feature vector storage unit 8 stores additional information accompanying the document-word matrix data generated by the feature vector generation unit 3 in order to efficiently manage document feature vectors, for example, a document-word matrix. Word-word ID map data describing the correspondence between the word ID, which is a column component of the data, and the word, and the word ID-part of speech describing the correspondence between the word ID and the part-of-speech information of the word for each word Corresponding map data and the like are also stored.
By adding such a feature vector storage unit 8, in this embodiment, the feature vector correction unit 4 operates the document feature vector stored in the feature vector storage unit 8 every time the document feature vector is corrected ( Document feature vectors to be corrected) can be obtained. Thus, it is possible to perform document classification using a document feature vector that does not inherit the effect (result) of an operation (for example, deletion of one dimension) performed on the document feature vector.
For example, when a document feature vector is generated by singular value decomposition and the n-th dimension feature dimension is deleted at the n-th iteration, the correction matrix at that time is Pn, and the document stored in the feature vector storage unit 8 If the feature vector is A0 and the modified document feature vector is An,
Formula (3) An = A0Pn
It becomes. In the case of the first embodiment,
Formula (4) An = A0Pn Pn-1 ... P0
It becomes. In other words, in the second embodiment, document classification can be performed in a feature space excluding only the topic expressed by the feature dimension to be deleted.
[0012]
FIG. 5 is a block diagram showing the configuration of the document classification apparatus according to the third embodiment of the present invention. In FIG. 5, the same number is attached | subjected regarding the same thing as 1st Embodiment (refer FIG. 1) and 2nd Embodiment (refer FIG. 4). As shown in the drawing, in the third embodiment, in addition to the configuration of the second embodiment, a partial document set extraction unit 9 that extracts document feature vectors belonging to each partial document set from the stored classification results, A partial document set variance calculation unit 10 that calculates a variance value of each feature dimension between each document feature vector in each extracted partial document set, and a feature that is operated using statistical information such as a calculated variance value of each feature dimension An operation target feature dimension determining unit 11 that determines a dimension is provided.
With this configuration, in this embodiment, for example, a variance value of the feature dimension in each partial document set is calculated as statistical information from the classification result stored in the classification result storage unit 6, and the calculated variance value of the feature dimension is calculated. The feature dimension to be operated is determined using. The basis of such a determination method is that the distribution of the feature dimension in the partial document set can be considered to indicate the contribution ratio for grouping the partial document set of the feature dimension. In other words, since the feature dimension with small variance is considered to be a dense sub-document set, it can be considered that the contribution ratio of grouping is high. Therefore, for each partial document set, the feature dimension with small variance is considered to be strongly related to the topic expressed by the partial document set. For example, document classification is performed in a feature vector space from which this feature dimension is deleted. Thus, it is considered that a partial document set expressing a topic other than the topic expressed by the partial document set can be extracted. Hereinafter, the respective parts added in this embodiment will be further described.
First, the partial document set extraction unit 9 extracts document feature vectors belonging to all the generated partial document sets from the classification results stored in the classification result storage unit 6. It should be noted that the target partial document set may be only the partial document set generated immediately before, or all generated partial document sets.
[0013]
In addition, the partial document set variance calculation unit 10 calculates the variance value of each feature dimension between the document feature vectors belonging to each partial document set extracted by the partial document set extraction unit 9. At this time, for each partial document set, the rank order of the variance value of each feature dimension is calculated, and the rank of each partial document set is also calculated for the variance value of each feature dimension.
In addition, the operation target feature dimension determination unit 11 includes a distribution value of each feature dimension in each partial document set calculated by the partial document set variance calculation unit 11, a rank order of the distribution value of each feature dimension in each partial document set, The feature dimension operated by the feature vector correcting unit 4 is determined based on the ranking information of each partial document set with respect to the variance value of each feature dimension. For example, the feature dimension variance value in the entire partial document set is selected as a feature dimension to be operated, or the rank order of the feature dimension variance value in the entire partial document set is always below a certain rank ( In other words, the one having a small variance is selected as the feature dimension of the operation target.
When only the partial document set generated immediately before is extracted, the distribution value of each feature dimension in the partial document set and the rank order of the distribution value of each feature dimension in the partial document set are used. Then, the feature dimension operated by the feature vector correction unit 4 is determined.
Thus, in this embodiment, document classification is performed in the feature vector space from which the selected feature dimension is deleted, and a partial document set expressing a topic other than the topic expressed by the partial document set can be extracted.
In the above, the case of the document classification apparatus having the configuration shown in FIGS. 1, 4, and 5 has been described. However, for example, a program programmed according to the document classification method of the present invention as described in each embodiment is detachable. The document classification may be performed in the information processing apparatus by storing the information in a storage medium and loading the storage medium into an information processing apparatus such as a personal computer that has not been able to perform document classification according to the method of the present invention. it can.
[0014]
【The invention's effect】
As explained above, according to the present invention, claims 1 and 5 In the described invention, it is possible to automatically extract a large number of partial document sets of different topics existing in the document set to be classified, and thus it is possible to perform a comprehensive analysis on the document set. further, The feature dimension can be efficiently operated.
Claims 2 And claims 6 In the described invention, the effect of the feature dimension operation of the document feature vector performed sequentially is effective only for the document classification performed immediately thereafter. That is, it is possible to generate a partial document set in which the effect of the feature dimension operation performed sequentially is not inherited. Or 5 Topics that are different from the described invention can also be extracted.
[0015]
Claims 3 In the invention according to claim 7, the claim 1 or claim 5 It is possible to automatically extract a large number of sub-document sets of different topics in a manner different from that of the described invention. 5 The effects of the described invention can be further improved.
Claims 4 And claims 8 In the described invention, the claims 3 Or claims 7 The effects of the described invention can be easily realized.
Claims 9 In the described invention The affection In the information processing apparatus, the effect of the invention according to any one of claims 5 to 8 can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram showing the configuration of a document classification apparatus according to a first embodiment of the present invention.
FIG. 2 is an explanatory diagram of a document classification method according to the first embodiment of this invention.
FIG. 3 is an operation flowchart of the document classification method according to the first embodiment of the present invention.
FIG. 4 is a block diagram showing the configuration of a document classification apparatus according to a second embodiment of the present invention.
FIG. 5 is a block diagram showing the configuration of a document classification apparatus according to a third embodiment of the present invention.
[Explanation of symbols]
1 Document input part
2 Document Analysis Department
3 Feature vector generator
4 Feature vector correction unit
5 Document classification department
6 Classification result storage
7 Repeat judgment part
8 Feature vector storage
9 Partial document set extraction unit
10 Partial document set variance calculation unit
11 operation target feature dimension determination unit

Claims

In a document classification device that automatically classifies a document set according to the contents of a document,
Document analysis means for extracting words in each document data of a document set consisting of a plurality of documents , and counting the number of appearances of the extracted words for each document;
Based on the word obtained by the document analysis means and the number of occurrences of the word, each word and each word appearing in each document correspond to a matrix component, and each word in which each matrix element is counted for each document A feature vector generating means for generating a matrix that is the number of occurrences of the document and obtaining a document feature vector using singular value decomposition on the matrix ;
Feature vector correcting means for correcting the document feature vector by deleting feature dimensions in descending order of corresponding singular values in the document feature vector;
Document classification means for classifying a document set into a plurality of partial document sets based on the similarity between the document feature vectors including the document feature vector modified by the feature vector modification means, and storing the classification result in the classification result storage means When,
With
After the document classifying means stored thereon the classification result, in accordance with the determination using a predetermined repetition condition, when it is determined that the to repeat the operation and the document classifying the feature vector correction means corrects the document feature vector An apparatus for classifying a document, wherein the means repeats the operation of classifying the document set into partial document sets and storing the classification result in the classification result storage means.

The document classification apparatus according to claim 1, wherein
Feature vector storage means for storing the document feature vector obtained by the feature vector generation means;
The document classification apparatus characterized in that the feature vector correcting means corrects a feature vector stored in the feature vector storage means when repeatedly correcting a document feature vector.

The document classification apparatus according to claim 1 or 2,
A document classification apparatus, wherein statistical information is calculated from a classification result stored in the classification result storage means, and a feature dimension to be deleted is determined using the calculated statistical information.

4. The document classification apparatus according to claim 3, wherein the statistical information is a variance value of a feature dimension in each partial document set.

In a document classification method executed by a document classification device that includes a document analysis unit, a feature vector generation unit, a feature vector correction unit, a document classification unit, and a classification result storage unit, and automatically classifies a document set according to the contents of the document.
According to the document analysis means, comprising the steps of counting extracting words in each document data of a document set consisting of a plurality of documents, the number of occurrences of a word issued extract for each of the document,
Based on the word obtained by the document analysis unit by the feature vector generation unit and the number of appearances of the word , a row component corresponds to each document, a column component corresponds to each word, and each matrix Generating a matrix whose elements are the number of occurrences of the word counted for each document, and obtaining a document feature vector using singular value decomposition on the matrix ;
And correcting the document feature vector for said According to a feature vector correction unit, and deletes the corresponding feature dimensions in order singular values is larger in the document feature vector,
According to the document classifying means, wherein said document set based on similarity between documents feature vectors including the document feature vector that has been modified by the vector correction means into a plurality of partial document set, classification result the classification result memory Memorizing the means;
After the document classification means stores the classification result, when it is determined to repeat according to the determination using a predetermined repetition condition, the feature vector correction means corrects the document feature vector and the document classification means Repeating the operation of classifying the document set into partial document sets and storing the classification result in the classification result storage means;
A document classification method characterized by comprising:

The document classification method according to claim 5, wherein
A feature vector storage means for storing the document feature vector first obtained by the feature vector generation means;
The feature vector correcting unit corrects a document feature vector first obtained stored in the feature vector storage unit when repeatedly correcting the document feature vector.

The document classification method according to claim 5 or 6,
A document classification method characterized in that statistical information is calculated from a classification result stored in the classification result storage means, and a feature dimension to be deleted is determined using the calculated statistical information.

8. The document classification method according to claim 7, wherein the statistical information is a variance value of a feature dimension in each partial document set.

9. A computer-readable storage medium storing a program for executing the document classification method according to claim 5.