JP4063489B2

JP4063489B2 - Document classification apparatus, document classification method, and computer-readable recording medium storing a program for causing a computer to execute the method

Info

Publication number: JP4063489B2
Application number: JP2000306537A
Authority: JP
Inventors: 雅樹辻井
Original assignee: 株式会社ジャストシステム
Priority date: 2000-10-05
Filing date: 2000-10-05
Publication date: 2008-03-19
Anticipated expiration: 2020-10-05
Also published as: JP2002117046A

Description

【０００１】
【発明の属する技術分野】
この発明は、電子化された文書をあらかじめ設定されたカテゴリのうちいずれか一つに分類する文書分類装置、文書分類方法およびその方法をコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体に関する。
【０００２】
【従来の技術】
コーパスと呼ばれる文書群をもちいて、与えられた分類対象文書（群）をあらかじめ設定されたいずれかのカテゴリに自動分類する文書分類装置が従来から知られている。この種の装置では、あらかじめ多種多様な内容を有する大量の文書（コーパス）について、人手によりそれぞれの分類先のカテゴリを決定しておき、各カテゴリに分類される文書の特徴から逆に各カテゴリの特徴を算出しておく。そして、分類対象文書が与えられたときは、当該文書の特徴と各カテゴリの特徴とを順次比較して、もっとも類似度の高いカテゴリへ当該文書を分類する。
【０００３】
ただ、分類される文書の特性によっては、あらかじめ設定された分類体系が不適切となる場合もあるので、分類対象文書の傾向にあわせて、自動的に分類体系を変更してゆく文書分類装置も提案されている。このような従来技術としては、たとえば特開平０７−０４９８７５や特開２０００−０１１００４を挙げることができる。
【０００４】
特開平０７−０４９８７５における文書収集サーバシステムは、自動的に複数の情報源に接続して新文書を取得し、適合度計算によって、あらかじめユーザーが記述した検索条件との適合度を調べる。そして、検索条件間の関係から分類体系を構成し、適合した文書を分類してフォルダに格納する。さらに、各フォルダへの情報の集まり具合を監視し、自動的にフォルダの細分化、統合、構造の変更をおこなう。
【０００５】
また、特開２０００−０１１００４における情報自動分類装置は、あるカテゴリに誤って分類された文書がある場合に、当該カテゴリに付随する仮カテゴリを設けてそこに上記文書を分類するようにしている。そして、以後の文書は正規のカテゴリまたは仮カテゴリのうちより類似度の高い方に分類される。すなわち、既存のカテゴリの範疇を超える文書が出現した場合には、正規のカテゴリに隣接してそれに準ずる仮カテゴリが自動的に生成されることになる。この仮カテゴリを正規のカテゴリに格上げすることもできる。
【０００６】
【発明が解決しようとする課題】
しかしながら、上記従来技術のうち特開平０７−０４９８７５においては、既存のカテゴリを分割（細分化）してゆくことができるのみなので、まったく新しい分野の文書が出現した場合には当該文書にふさわしいカテゴリを生成することはできない。また、文書がほとんどあるいはまったく分類されていない、不必要なカテゴリを削除することもできない。また、特開２０００−０１１００４によっても、既存のカテゴリとの類似性の低いカテゴリを生成することはできず、またカテゴリの分割や削除もおこなうことができない。
【０００７】
すなわち、上記従来技術は分類対象文書群の質的変化に弱く、既存のカテゴリに分類できない新規な文書が現れてくると、分類体系が分類対象文書群の特性にそぐわないものである結果、適切な分類ができず分類精度が低下してしまうという問題点があった。
【０００８】
この発明は、上述した従来技術による問題点を解消するため、分類対象文書群の特性に見合った分類体系を常に維持することができ、したがってその変化にともなう分類精度の低下を防止することが可能な文書分類装置、文書分類方法およびその方法をコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体を提供することを目的とする。
【０００９】
【課題を解決するための手段】
上述した課題を解決し、目的を達成するため、この発明にかかる文書分類装置は、電子化された文書をあらかじめ設定されたカテゴリのうちいずれか一つに分類する文書分類装置において、訓練用文書集合を構成する各文書の特徴ベクトルから前記各カテゴリの特徴ベクトル空間を算出する第１の特徴ベクトル空間算出手段と、すべてのベクトルの長さが等しい特徴ベクトル空間、訓練用文書集合の特徴ベクトル空間および分類対象文書集合の特徴ベクトル空間の重み付き平均から分類不能カテゴリの特徴ベクトル空間を算出する第２の特徴ベクトル空間算出手段と、前記第１の特徴ベクトル空間算出手段により算出された各カテゴリの特徴ベクトル空間および前記第２の特徴ベクトル空間算出手段により算出された分類不能カテゴリの特徴ベクトル空間と、分類対象文書集合を構成する各文書の特徴ベクトルとを比較することにより、前記各カテゴリまたは前記分類不能カテゴリのうちいずれか一つを前記各文書の分類先のカテゴリと算出する分類カテゴリ算出手段と、前記分類カテゴリ算出手段により分類不能カテゴリが分類先のカテゴリと算出される頻度が一定の閾値を上回ったかどうかを判定するカテゴリ追加要否判定手段と、前記カテゴリ追加要否判定手段により分類不能カテゴリが分類先のカテゴリと算出される頻度が一定の閾値を上回ったと判定された場合に、新たなカテゴリの追加を操作者に対して推奨するカテゴリ追加推奨手段と、を備えたことを特徴とする。
【００１０】
この発明によれば、分類不能カテゴリに分類される文書が多くなると、操作者に対して新たなカテゴリの追加が推奨される。
【００１１】
また、この発明にかかる文書分類装置は、電子化された文書をあらかじめ設定されたカテゴリのうちいずれか一つに分類する文書分類装置において、訓練用文書集合を構成する各文書の特徴ベクトルから前記各カテゴリの特徴ベクトル空間を算出する特徴ベクトル空間算出手段と、前記特徴ベクトル空間算出手段により算出された各カテゴリの特徴ベクトル空間と分類対象文書集合を構成する各文書の特徴ベクトルとを比較することにより、前記各カテゴリのうちいずれか一つを前記各文書の分類先のカテゴリと算出する分類カテゴリ算出手段と、前記分類カテゴリ算出手段により分類先のカテゴリと算出される頻度が一定の閾値を下回ったカテゴリがあるかどうかを判定するカテゴリ削除・併合要否判定手段と、前記カテゴリ削除・併合要否判定手段により分類先のカテゴリと算出される頻度が一定の閾値を下回ったカテゴリがあると判定された場合に、当該カテゴリの削除または併合を操作者に対して推奨するカテゴリ削除・併合推奨手段と、を備えたことを特徴とする。
【００１２】
この発明によれば、あるカテゴリに分類される文書が少なくなると、操作者に対して当該カテゴリの削除または併合が推奨される。
【００１３】
また、この発明にかかる文書分類装置は、上記発明において、さらに、前記分類カテゴリ算出手段により分類先のカテゴリと算出される頻度が一定の閾値を上回ったカテゴリがあるかどうかを判定するカテゴリ分割要否判定手段と、前記カテゴリ分割要否判定手段により分類先のカテゴリと算出される頻度が一定の閾値を上回ったカテゴリがあると判定された場合に、当該カテゴリの分割を操作者に対して推奨するカテゴリ分割推奨手段と、を備えたことを特徴とする。
【００１４】
この発明によれば、あるカテゴリに分類される文書が多くなると、操作者に対して当該カテゴリの分割が推奨される。
【００１５】
また、この発明にかかる文書分類方法は、電子化された文書をあらかじめ設定されたカテゴリのうちいずれか一つに分類する文書分類方法において、訓練用文書集合を構成する各文書の特徴ベクトルから前記各カテゴリの特徴ベクトル空間を算出する第１の特徴ベクトル空間算出工程と、すべてのベクトルの長さが等しい特徴ベクトル空間、訓練用文書集合の特徴ベクトル空間および分類対象文書集合の特徴ベクトル空間の重み付き平均から分類不能カテゴリの特徴ベクトル空間を算出する第２の特徴ベクトル空間算出工程と、前記第１の特徴ベクトル空間算出工程で算出された各カテゴリの特徴ベクトル空間および前記第２の特徴ベクトル空間算出工程で算出された分類不能カテゴリの特徴ベクトル空間と、分類対象文書集合を構成する各文書の特徴ベクトルとを比較することにより、前記各カテゴリまたは前記分類不能カテゴリのうちいずれか一つを前記各文書の分類先のカテゴリと算出する分類カテゴリ算出工程と、前記分類カテゴリ算出工程で分類不能カテゴリが分類先のカテゴリと算出される頻度が一定の閾値を上回ったかどうかを判定するカテゴリ追加要否判定工程と、前記カテゴリ追加要否判定工程で分類不能カテゴリが分類先のカテゴリと算出される頻度が一定の閾値を上回ったと判定された場合に、新たなカテゴリの追加を操作者に対して推奨するカテゴリ追加推奨工程と、を含んだことを特徴とする。
【００１６】
この発明によれば、分類不能カテゴリに分類される文書が多くなると、操作者に対して新たなカテゴリの追加が推奨される。
【００１７】
また、この発明にかかる文書分類方法は、電子化された文書をあらかじめ設定されたカテゴリのうちいずれか一つに分類する文書分類方法において、訓練用文書集合を構成する各文書の特徴ベクトルから前記各カテゴリの特徴ベクトル空間を算出する特徴ベクトル空間算出工程と、前記特徴ベクトル空間算出工程で算出された各カテゴリの特徴ベクトル空間と分類対象文書集合を構成する各文書の特徴ベクトルとを比較することにより、前記各カテゴリのうちいずれか一つを前記各文書の分類先のカテゴリと算出する分類カテゴリ算出工程と、前記分類カテゴリ算出工程で分類先のカテゴリと算出される頻度が一定の閾値を下回ったカテゴリがあるかどうかを判定するカテゴリ削除・併合要否判定工程と、前記カテゴリ削除・併合要否判定工程で分類先のカテゴリと算出される頻度が一定の閾値を下回ったカテゴリがあると判定された場合に、当該カテゴリの削除または併合を操作者に対して推奨するカテゴリ削除・併合推奨工程と、を含んだことを特徴とする。
【００１８】
この発明によれば、あるカテゴリに分類される文書が少なくなると、操作者に対して当該カテゴリの削除または併合が推奨される。
【００１９】
また、この発明にかかる文書分類方法は、上記発明において、さらに、前記分類カテゴリ算出工程で分類先のカテゴリと算出される頻度が一定の閾値を上回ったカテゴリがあるかどうかを判定するカテゴリ分割要否判定工程と、前記カテゴリ分割要否判定工程で分類先のカテゴリと算出される頻度が一定の閾値を上回ったカテゴリがあると判定された場合に、当該カテゴリの分割を操作者に対して推奨するカテゴリ分割推奨工程と、を含んだことを特徴とする。
【００２０】
この発明によれば、あるカテゴリに分類される文書が多くなると、操作者に対して当該カテゴリの分割が推奨される。
【００２１】
また、この発明にかかる記録媒体は、上記方法をコンピュータに実行させるプログラムを記録したことで、当該プログラムをコンピュータで読み取ることが可能となり、これによって、上記方法をコンピュータによって実施することが可能となる。
【００２２】
【発明の実施の形態】
以下に添付図面を参照して、この発明にかかる文書分類装置、文書分類方法およびその方法をコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体の好適な実施の形態を詳細に説明する。
【００２３】
（実施の形態）
まず、この発明の実施の形態による文書分類装置のハードウエア構成について説明する。図１は、この発明の実施の形態による文書分類装置のハードウエア構成を示す説明図である。同図において、１０１はシステム全体を制御するＣＰＵを、１０２は基本入出力プログラムを記憶したＲＯＭを、１０３はＣＰＵ１０１のワークエリアとして使用されるＲＡＭを、それぞれ示している。
【００２４】
また、１０４はＣＰＵ１０１の制御にしたがってＨＤ（ハードディスク）１０５に対するデータのリード／ライトを制御するＨＤＤ（ハードディスクドライブ）を、１０５はＨＤＤ１０４の制御にしたがって書き込まれたデータを記憶するＨＤを、それぞれ示している。また、１０６はＣＰＵ１０１の制御にしたがってＦＤ（フロッピーディスク）１０７に対するデータのリード／ライトを制御するＦＤＤ（フロッピーディスクドライブ）を、１０７はＦＤＤ１０６の制御にしたがって書き込まれたデータを記憶する着脱自在のＦＤを、それぞれ示している。
【００２５】
また、１０８はカーソル、メニュー、ウィンドウ、あるいは文字や画像などの各種データを表示するディスプレイを、１０９は通信回線１１０を介してネットワークＮＥＴに接続され、そのネットワークＮＥＴとＣＰＵ１０１とのインターフェースとして機能するネットワークボードを、それぞれ示している。また、１１１は文字、数値、各種指示などの入力のための複数のキーを備えたキーボードを、１１２は各種指示の選択や実行、処理対象の選択、カーソルの移動などをおこなうマウスを、それぞれ示している。
【００２６】
また、１１３は文字や画像を光学的に読み取るスキャナを、１１４はＣＰＵ１０１の制御にしたがって文字や画像を印刷するプリンタを、１１５は着脱可能な記録媒体であるＣＤ−ＲＯＭを、１１６はＣＤ−ＲＯＭ１１５に対するデータのリードを制御するＣＤ−ＲＯＭドライブを、１００は上記各部を接続するためのバスまたはケーブルを、それぞれ示している。
【００２７】
つぎに、この発明の実施の形態にかかる文書分類装置の機能的構成について説明する。図２は、この発明の実施の形態にかかる文書分類装置の構成を機能的に示す説明図である。
【００２８】
２０１は、文書記憶部であり、後述する分類カテゴリ推定部２０３による分類の対象となる文書群（分類対象文書集合）２０１ａを保持している。それぞれの文書は、ネットワーク上の他の情報処理装置などから通信回線１１０を介して取得されたり、キーボード１１１によって入力されたり、紙媒体の文書からスキャナ１１３を介して取り込まれたり、あるいはＣＤ−ＲＯＭ１１５などの各種記録媒体から読み込まれたりしたものである。なお、それぞれの文書の取得・蓄積時に差異があってもよい（５０個の文書のうち、１０個は一週間前に取得・蓄積され、２５個は昨日取得・蓄積され、１５個は今日取得・蓄積された、など）。
【００２９】
２０２は、特徴ベクトル空間推定部であり、初期状態では訓練用文書集合２０２ａと文書−カテゴリ対応表２０２ｂとを保持している。訓練用文書集合２０２ａはいわゆるコーパスであって、様々な分野に属する文書をまんべんなく含む文書集合、たとえば一年分の新聞記事からなる文書集合である。また、文書−カテゴリ対応表２０２ｂは、訓練用文書集合２０２ａの各文書に対応づけて、当該文書が分類されるカテゴリを記憶したテーブルである。このカテゴリの名称や上位・下位関係は、後述するカテゴリ体系管理部２０４のカテゴリ体系表２０４ａで定義されている。
【００３０】
特徴ベクトル空間推定部２０２は、訓練用文書集合２０２ａと文書−カテゴリ対応表２０２ｂとから、分類用データベース２０２ｃを作成する。図３は、分類用データベース２０２ｃの作成方法を具体的に説明するための説明図である。訓練用文書集合２０２ａが図３（ａ）に示すＤ１〜Ｄ１０の１０個の文書からなり、それらの文書が図３（ｂ）の文書−カテゴリ対応表２０２ｂに示すように、Ａ、ＢまたはＣのいずれかのカテゴリに分類されていたものとする。
【００３１】
特徴ベクトル空間推定部２０２は、まず訓練用文書集合２０２ａの個々の文書について、当該文書の意味内容を近似的に表現するベクトル（特徴ベクトル）を作成する。特徴ベクトルの作成方法としては種々の従来技術が存在するが、ここでは説明の便宜上、文書集合内に出現する全単語（全語彙。ただし不要語などを除く）の各文書内での出現頻度（出現回数）を計数して、ベクトル内の各要素値（各ベクトルの長さ）とするもっとも単純な方法を採用する。
【００３２】
訓練用文書集合２０２ａ内に出現する単語がＷ１〜Ｗ５の５種類であった場合、各文書について作成される特徴ベクトルは、当該文書内での単語Ｗ１の出現頻度、Ｗ２の出現頻度、Ｗ３の出現頻度、Ｗ４の出現頻度、およびＷ５の出現頻度を各ベクトル長さとする５次元のベクトルとなる。図３（ｃ）に、各文書について作成される特徴ベクトルの一例を示す。たとえば、単語Ｗ１が０回、Ｗ２が２回、Ｗ３が１回、Ｗ４が０回、Ｗ５が３回出現する文書Ｄ１の特徴ベクトルは、Ｖ１＝（０、２、１、０、３）である。同様に、文書Ｄ５の特徴ベクトルはＶ５＝（１、５、１、０、４）、文書Ｄ８の特徴ベクトルはＶ８＝（０、２、０、０、２）である。
【００３３】
訓練用文書集合２０２ａの各文書についてその特徴ベクトルを作成すると、つぎに特徴ベクトル空間推定部２０２は、各カテゴリに分類される文書の特徴ベクトルを平均することで、各カテゴリの特徴ベクトル空間を算出する。たとえば、カテゴリＡの特徴ベクトル空間ＶＡは、カテゴリＡに属する文書Ｄ１、Ｄ５およびＤ８の特徴ベクトルの平均を取って、ＶＡ＝（０＋１＋０／３、２＋５＋２／３、１＋１＋０／３、０＋０＋０／３、３＋４＋２／３）、すなわちＶＡ＝（０．３３、３、０．６６、０、３）となる。そしてこのＶＡ、また同様にして求めたＶＢおよびＶＣから、図３（ｄ）に示す分類用データベース２０２ｃを作成する。
【００３４】
ここで、分類用データベース２０２ｃには、Ａ、ＢおよびＣのほかにカテゴリ「分類不能」が設けられている。この分類不能カテゴリの特徴ベクトル空間は、以下の（１）〜（４）の手順にしたがって算出する。
【００３５】
（１）すべてのベクトルの長さが等しい特徴ベクトル空間Ｖαの算出
まず、特徴ベクトル空間推定部２０２は、文書記憶部２０１に保持されている分類対象文書集合２０１ａの各文書について、当該文書集合内に出現する全単語、あるいはあらかじめ選定されたいくつかの単語の、各文書内での出現頻度を要素値とする特徴ベクトルを作成する。ここでは、分類対象文書集合２０１ａが図４（ａ）に示すＤ’１〜Ｄ’５の５個の文書からなり、各文書について図４（ｂ）に示すような特徴ベクトルが作成されたものとする。
【００３６】
そして、特徴ベクトル空間推定部２０２は、訓練用文書集合２０２ａおよび分類対象文書集合２０１ａにおいて単語Ｗ１〜Ｗ７が出現する全回数（ベクトル長さの合計といってもよい。ここでは８７＋５９＝１４６）を、全文書数（ここでは１０＋５＝１５）で除算して、一文書あたりの平均単語出現回数（ここでは１４６／１５＝９．７３）を算出する。さらに、これを全語彙数（ベクトルの数といってもよい。ここでは５）で除算して、一文書あたりの各単語の平均出現回数（ここでは９．７３／５＝１．９５）を算出する。そして、この値を単語Ｗ１〜Ｗ７に対応する各ベクトルの長さとする特徴ベクトル＝（１．９５、１．９５、１．９５、１．９５、１．９５、１．９５、１．９５）で表される空間Ｖαを作成する。
【００３７】
（２）訓練用文書集合２０２ａの特徴ベクトル空間Ｖβの算出
つぎに、特徴ベクトル空間推定部２０２は、単語Ｗ１〜Ｗ７のそれぞれが訓練用文書集合２０２ａ内の一文書あたりに平均何回出現しているかを算出して、訓練用文書集合２０２ａの特徴ベクトル空間Ｖβを作成する。ここでは、特徴ベクトル空間Ｖβを表す特徴ベクトルは、（１２／１０、１１／１０、１３／１０、１７／１０、３４／１０、０／１０、０／１０）、すなわち（１．２、１．１、１．３、１．７、３．４、０、０）となる。
【００３８】
（３）分類対象文書集合２０１ａの特徴ベクトル空間Ｖγの算出
さらに、特徴ベクトル空間推定部２０２は、単語Ｗ１〜Ｗ７のぞれぞれが分類対象文書集合２０１ａ内の一文書あたりに平均何回出現しているかを算出して、分類対象文書集合２０１ａの特徴ベクトル空間Ｖγを作成する。ここでは、特徴ベクトル空間Ｖγを表す特徴ベクトルは、（７／５、６／５、８／５、１０／５、１８／５、１０／５、０／５）、すなわち（１．４、１．２、１．６、２、３．６、２、０）となる。
【００３９】
（４）分類不能カテゴリの特徴ベクトル空間Ｖの算出
そして、特徴ベクトル空間推定部２０２は、上記で算出したＶα、ＶβおよびＶγの重み付き平均を取ることで、分類不能カテゴリの特徴ベクトル空間Ｖを算出する。たとえば、Ｖαに与える重みが０．１５、Ｖβに与える重みが０．４５、Ｖγに与える重みが０．４０であったとすると、特徴ベクトル空間Ｖを表す特徴ベクトルは、（（１．９５×０．１５）＋（１．２×０．４５）＋（１．４×０．４０）、（１．９５×０．１５）＋（１．１×０．４５）＋（１．２×０．１４）、・・・）、すなわち（１．３９、１．２７、１．５２、１．８６、３．２６、１．０９、０．２９）となる。そして、このＶを分類不能カテゴリの特徴ベクトル空間として、分類用データベース２０２ｃに登録する。
【００４０】
このように、分類不能カテゴリの特徴ベクトル空間Ｖを「すべてのベクトルの長さが等しい特徴ベクトル空間Ｖα」「訓練用文書集合２０２ａの特徴ベクトル空間Ｖβ」および「分類対象文書集合２０１ａの特徴ベクトル空間Ｖγ」の加重平均によって推定することの意味は以下の通りである。
【００４１】
まず、すべてのベクトルの長さが等しい特徴ベクトル空間Ｖαは、未知のカテゴリの特徴ベクトル空間を仮定する役割を担っている。未知のカテゴリでは、各語彙がどのような分布を取るかが不明なので、すべての語彙の出現頻度が同じというベクトル空間をこのＶαによって仮定する。
【００４２】
もっとも、このＶαは自然な言語空間の表現とはなっていない。たとえば、英語の「ｔｈｅ」などはカテゴリに関係なくどの文書でも出現頻度が高いため、分類不能カテゴリにおいても頻度が高くなっているのが自然である。そこで、訓練用文書集合２０２ａの特徴ベクトル空間Ｖβを考慮することにより、出現頻度がカテゴリに依存しない語（図３の例ではＷ５）のベクトル長さを改善することができる。
【００４３】
分類対象文書集合２０１ａの特徴ベクトル空間Ｖγを考慮に入れるのも、基本的には上記と同様である。ただし、訓練用文書集合２０２ａはその規模に限界があるので（あらかじめ各文書の内容を検討して、人手によってカテゴリを付与しなければならないため）、その規模を補う意味で、Ｖβに加えてＶγを考慮する。また、訓練用文書を収集したときと語彙のパターンが変わった場合など、Ｖβが実際に分類をおこないたい文書集合に適したものとなっていない場合に、Ｖγを考慮することで、分類対象文書にあわせた最適化をおこなうことができる。たとえば、図３（ｄ）に示すようにカテゴリＡ〜Ｃの特徴ベクトル空間ＶＡ〜ＶＣでは、Ｗ６およびＷ７のベクトル長さはともに０であるが、分類不能カテゴリの特徴ベクトル空間Ｖではそれぞれ１．０９、０．２９であるので、Ｗ６やＷ７、すなわち訓練用文書集合２０２ａには現れていなかった新たな語や新たな分野の語などを含む文書は、そのぶん分類不能カテゴリに分類される可能性が高くなる。
【００４４】
なお、上記ではＶαおよびＶγの算出に、分類対象文書集合２０１ａのすべての文書をもちいるようにしたが、その数が多いときは処理の負荷が大きいので、分類対象文書集合２０１ａから無作為に抽出した一部の文書をもちいるようにしてもよい。
【００４５】
なお、この特徴ベクトル空間推定部２０２が、請求項にいう「第１の特徴ベクトル空間算出手段」「第２の特徴ベクトル空間算出手段」あるいは「特徴ベクトル空間算出手段」に相当し、そのおこなう処理の中に、請求項にいう「第１の特徴ベクトル空間算出工程」「第２の特徴ベクトル空間算出工程」あるいは「特徴ベクトル空間算出工程」が含まれる。
【００４６】
図２に戻り、残りの機能部について説明を続ける。２０３は、分類カテゴリ推定部であり、分類対象文書集合２０１ａの各文書の特徴ベクトルを、特徴ベクトル空間推定部２０２で算出された各カテゴリの特徴ベクトル空間（これは分類用データベース２０２ｃに保存されている）と比較して、その類似度を順次算出する。なお、類似度の算出方法としてはＴＦ・ＩＤＦ、ｎａｉｖｅＢａｙｅｓ、最小２乗法、最大エントロピー法などの従来技術が多数存在し、ここではそのいずれを採用するのであってもよい。
【００４７】
そして、類似度のもっとも高いカテゴリを当該文書の分類先と推定し、文書とその分類先カテゴリとを対応づけて、図５に例示するような分類結果一覧表２０３ａを作成する。なお、この分類カテゴリ推定部２０３が、請求項にいう「分類カテゴリ算出手段」に相当し、またそのおこなう処理が、請求項にいう「分類カテゴリ算出工程」に相当する。
【００４８】
２０４は、カテゴリ体系管理部であり、あらかじめカテゴリ体系表２０４ａと、カテゴリ体系変更規則２０４ｂとを保持している。カテゴリ体系表２０４ａでは、文書が分類される各カテゴリの名称と、その上位・下位関係とが定義されている。カテゴリの名称と関係とが分かるのであれば、カテゴリ体系表２０４ａのデータ構造はどのようなものであってもよい。
【００４９】
また、カテゴリ体系変更規則２０４ｂとは、カテゴリ体系表２０４ａに新たなカテゴリを追加したり、すでに登録されているカテゴリを分割・削除あるいは併合したりするのが望ましいかどうかを判定するための複数の規則（判定基準）である。
【００５０】
たとえば、カテゴリ追加に関しては「（ａ−１）分類不能と判定される頻度が閾値を上回った（具体的には「全文書中、分類不能カテゴリに属すると判定された文書の割合が２０％を上回った」「分類不能カテゴリに属すると判定された文書が３０以上連続して出現した」など）」という規則があり、この規則に該当する場合に、新たなカテゴリの追加が望ましいと判定する。
【００５１】
カテゴリ追加の規則としては、上記のほか「（ａ−２）分類不能カテゴリに属する文書の数が閾値を上回った」、「（ａ−３）上記頻度および／または文書数をもちいた評価関数の結果が閾値を上回った（具体的には「分類不能カテゴリに属すると判定された文書の連続出現回数が、全文書数の１５％にあたる数値を上回った」など）」、などがある。以下ではこれらの規則のうち、最初に挙げた「全文書中、分類不能カテゴリに属すると判定された文書の割合が２０％を上回った」を、カテゴリ追加規則として採用する。
【００５２】
また、カテゴリ分割に関する規則としては、たとえば「（ｂ−１）あるカテゴリが分類先であると判定される頻度が閾値を上回った（具体的には「全文書中、あるカテゴリに属すると判定された文書の割合が２０％を上回った」「あるカテゴリに属すると判定された文書が３０以上連続して出現した」など）」、「（ｂ−２）あるカテゴリに属する文書の数が閾値を上回った」、「（ｂ−３）上記頻度および／または文書数をもちいた評価関数の結果が閾値を上回った（具体的には「あるカテゴリに属すると判定された文書の連続出現回数が、全文書数の１５％にあたる数値を上回った」など）」、などがある。以下ではこれらの規則のうち、最初に挙げた「全文書中、あるカテゴリに属すると判定された文書の割合が２０％を上回った」を、カテゴリ分割規則として採用する。
【００５３】
また、カテゴリ削除または併合に関する規則としては、「（ｃ−１）あるカテゴリが分類先であると判定される頻度が閾値を下回った（具体的には「全文書中、あるカテゴリに属すると判定された文書の割合が５％を下回った」「あるカテゴリ以外に分類される文書が５０以上連続して出現した」など）」がある。ここでは最初に挙げた「全文書中、あるカテゴリに属すると判定された文書の割合が５％を下回った」を、カテゴリ削除・併合規則として採用する。
【００５４】
カテゴリ体系管理部２０４は、分類カテゴリ推定部２０３による文書の分類が終了すると、作成された分類結果一覧表２０３ａを参照して、カテゴリの追加や分割、削除あるいは併合が望ましいかどうか、具体的には、カテゴリ体系変更規則２０４ｂの各規則に該当する事象が発生しているかどうかを判定する。
【００５５】
たとえば、分類結果一覧表２０３ａが図５に示すようであり、カテゴリ体系変更規則２０４ｂは追加規則「全文書中、分類不能カテゴリに属すると判定された文書の割合が２０％を上回った」の一つのみであったとすると、図５から分類不能カテゴリの文書数（ここでは２）が全文書数（ここでは５）の２０％を上回ったことが分かるので、新たなカテゴリの追加が望ましいと判定する。カテゴリの分割や削除・併合についても、上記と同様にしてその可否を判定する。
【００５６】
なお、このカテゴリ体系管理部２０４が、請求項にいう「カテゴリ追加要否判定手段」「カテゴリ分割要否判定手段」あるいは「カテゴリ削除・併合要否判定手段」に相当し、またそのおこなう処理の中に、請求項にいう「カテゴリ追加要否判定工程」「カテゴリ分割要否判定工程」あるいは「カテゴリ削除・併合要否判定工程」が含まれる。
【００５７】
ユーザーインターフェース部２０５は、分類カテゴリ推定部２０３により作成された分類結果一覧表２０３ａ、およびカテゴリ体系管理部２０４のカテゴリ体系表２０４ａを読み込んで、カテゴリのツリー構造と各カテゴリに分類される文書のタイトルとをグラフィカルに画面表示する。また、カテゴリ体系管理部２０４によってカテゴリの追加・分割・削除あるいは併合が望ましいと判定された場合に、カテゴリの追加などを実行するよう操作者に推奨するダイアログ、またその実行に際して必要なパラメータ（追加するカテゴリの名称など）を入力させるダイアログを表示する。これらの表示例については後述する。
【００５８】
なお、このユーザーインターフェース部２０５が、請求項にいう「カテゴリ追加推奨手段」「カテゴリ分割推奨手段」あるいは「カテゴリ削除・併合推奨手段」に相当し、またそのおこなう処理の中に、請求項にいう「カテゴリ追加推奨工程」「カテゴリ分割推奨工程」あるいは「カテゴリ削除・併合推奨工程」が含まれる。
【００５９】
なお、文書記憶部２０１、特徴ベクトル空間推定部２０２、分類カテゴリ推定部２０３、カテゴリ体系管理部２０４およびユーザーインターフェース部２０５は、それぞれＲＯＭ１０２、ＲＡＭ１０３またはハードディスク１０５、フロッピーディスク１０７などの記録媒体に記録されたプログラムに記載された命令にしたがってＣＰＵ１０１などが命令処理を実行することにより、各部の機能を実現するものである。
【００６０】
つぎに、この発明の実施の形態による文書分類装置の文書分類処理の手順について説明する。図６は、この発明の実施の形態による文書分類装置の文書分類処理の手順を示すフローチャートである。分類対象文書集合２０１ａに新たな文書が追加されたり、あるいは操作者から文書の分類が指示されたりした場合に、本フローチャートによる処理を開始する。
【００６１】
ステップＳ６０１において、特徴ベクトル空間推定部２０２は、訓練用文書集合２０２ａの各文書についてその特徴ベクトルを作成する。そして、ステップＳ６０２で、文書−カテゴリ対応表２０２ｂの各カテゴリごとに文書の特徴ベクトルを平均して、各カテゴリの特徴ベクトル空間を算出する。
【００６２】
なお、これ以前にも本フローチャートによる処理をおこなったことがあれば（すなわち、２回目以降の分類処理の場合は）、以前の処理時にも訓練用文書集合２０２ａにもとづく各カテゴリの特徴ベクトル空間の算出がおこなわれ、その結果が分類用データベース２０２ｃに保存されているので、ステップＳ６０１およびＳ６０２の処理は省略することができる。
【００６３】
さらに、ステップＳ６０３において、分類不能カテゴリの特徴ベクトル空間Ｖを推定する。これは上述のように、すべてのベクトルの長さが等しい特徴ベクトル空間Ｖα（分類対象文書集合２０１ａおよび訓練用文書集合２０２ａの双方から算出）と、訓練用文書集合２０２ａの特徴ベクトル空間Ｖβと、分類対象文書集合２０１ａの特徴ベクトル空間Ｖγとの加重平均によって算出する。そして、ステップＳ６０４で、ステップＳ６０２およびステップＳ６０３で算出された各カテゴリの特徴ベクトル空間を、分類用データベース２０２ｃに保存する。
【００６４】
ステップＳ６０５で、分類カテゴリ推定部２０３は、分類対象文書集合２０１ａの各文書についてその特徴ベクトルを作成する。なお、ステップＳ６０３でＶαやＶγが分類対象文書の全部をもちいて算出された場合は、その際にすでにすべての文書の特徴ベクトルが作成されているので、このステップＳ６０５の処理は省略することができる。また、ステップＳ６０３でＶαやＶγが分類対象文書の一部をもちいて算出された場合には、そこに含まれなかった文書についてのみ特徴ベクトルを作成すればよい。
【００６５】
ステップＳ６０６で、分類カテゴリ推定部２０３は、ステップＳ６０５で作成された各文書の特徴ベクトルと、ステップＳ６０４で作成された分類用データベース２０２ｃの各カテゴリの特徴ベクトル空間とを順次比較して、もっとも類似度の高いカテゴリを当該文書の分類先と推定する。そして、ステップＳ６０７で、各文書とその分類先のカテゴリとを分類結果一覧表２０３ａに保存する。
【００６６】
ステップＳ６０８で、ユーザーインターフェース部２０５は、ステップＳ６０７で作成された分類結果一覧表２０３ａを読み込んで、図７に示すような分類結果一覧ウィンドウを表示する。そして、本フローチャートによる処理を終了する。
【００６７】
つぎに、この発明の実施の形態による文書分類装置のカテゴリ追加処理の手順について説明する。図８は、この発明の実施の形態による文書分類装置のカテゴリ追加処理の手順を示すフローチャートである。
【００６８】
図６に示す文書分類処理が終了した直後に、本フローチャートによる処理を開始する（あるいは、分類処理とは無関係にこの処理のみを定期的におこなうようにしてもよい）。なお、この開始時点で画面表示されている分類結果一覧ウィンドウは、図７に示すようなものであったとする。同図は分類不能カテゴリに大量の文書（たとえば全文書数の２０％超）が分類されている状態を示している。
【００６９】
ステップＳ８０１で、カテゴリ体系管理部２０４は、図６のステップＳ６０７で作成された分類結果一覧表２０３ａを参照して、カテゴリ体系変更規則２０４ｂのカテゴリ追加規則に該当する事象が発生しているかどうかを判定する。そして、カテゴリ追加規則に該当する事象が発生しているときは（ステップＳ８０１肯定）ステップＳ８０２に移行し、発生していないときは（ステップＳ８０１否定）そのまま本フローチャートによる処理を終了する。
【００７０】
ステップＳ８０２で、ユーザーインターフェース部２０５は、図９に示すような変更推奨ダイアログを表示する。ここでは新たなカテゴリの追加を推奨している。そして、ステップＳ８０３でいずれかのボタンがマウスクリックされるのを待ち、マウスクリックされたのが変更開始ボタン９０１であったときは（ステップＳ８０３肯定）ステップＳ８０４に移行する。また、マウスクリックされたのがキャンセルボタン９０２であったときは（ステップＳ８０３否定）、そのまま本フローチャートによる処理を終了する。
【００７１】
ステップＳ８０４で、ユーザーインターフェース部２０５は、ステップＳ８０２で表示された変更推奨ダイアログを消去するとともに、図１０に示すようなカテゴリ追加ダイアログを表示する。これは追加する新たなカテゴリの名称を操作者に入力させるためのダイアログである。そして、ステップＳ８０５でいずれかのボタンがマウスクリックされるのを待ち、マウスクリックされたのが「進む」ボタン１００１であったときは（ステップＳ８０５肯定）ステップＳ８０７に移行し、「進む」ボタン１００１でなかったときは（ステップＳ８０５否定）、ステップＳ８０６に移行して、クリックされたその他のボタンに応じた処理をおこなう。
【００７２】
ステップＳ８０７で、ステップＳ８０４で表示されたカテゴリ追加ダイアログを消去するとともに、図１１に示すような文書割付ダイアログを表示する。これは、追加する新たなカテゴリ（ここでは情報家電カテゴリ）に分類する文書を操作者に指定させるためのダイアログであり、分類不能カテゴリに分類されているすべての文書が一覧表示される。そして、ステップＳ８０８でいずれかのボタンがマウスクリックされるのを待ち、マウスクリックされたのが「完了」ボタン１１０１であったときは（ステップＳ８０８肯定）ステップＳ８１０に移行し、「完了」ボタン１１０１でなかったときは（ステップＳ８０８否定）、ステップＳ８０９に移行して、クリックされたその他のボタンに応じた処理をおこなう。
【００７３】
ステップＳ８１０で、ユーザーインターフェース部２０５は、ステップＳ８０５で「進む」ボタン１００１がクリックされた時点でカテゴリ追加ダイアログに入力されていたカテゴリ名称（ここでは「情報家電」）を、カテゴリ体系管理部２０４に対して通知する。これを受けたカテゴリ体系管理部２０４は、カテゴリ体系表２０４ａに新たなカテゴリ「情報家電」を追記する。
【００７４】
また、ステップＳ８１１で、ユーザーインターフェース部２０５は、ステップＳ８０８で文書割付ダイアログの「完了」ボタン１１０１がクリックされた時点で選択されていた文書（ここでは分類不能カテゴリのすべての文書が選択されていたものとする）を、特徴ベクトル空間推定部２０２に対して通知する。これを受けた特徴ベクトル空間推定部２０２は、通知された文書について図６のステップＳ６０５で作成された特徴ベクトルを取得して、それらの平均を取ることで、追加するカテゴリの特徴ベクトル空間を算出する。そして、これをカテゴリ名称と対応づけて、分類用データベース２０２ｃに登録する。
【００７５】
さらに、ステップＳ８１２で、ステップＳ８０７で表示された文書割付ダイアログを消去した後、図６のステップＳ６０６に移行して、更新された分類用データベース２０２ｃをもとに分類対象文書集合２０１ａの分類を再度実行する。図１２は、新たなカテゴリ「情報家電」が追加された直後の分類結果一覧ウィンドウの一例を示す説明図である。図７で分類不能カテゴリに分類されていた文書が、図１２では情報家電カテゴリに分類されている。
【００７６】
つぎに、この発明の実施の形態による文書分類装置のカテゴリ分割処理の手順について説明する。図１３は、この発明の実施の形態による文書分類装置のカテゴリ分割処理の手順を示すフローチャートである。図８に示すカテゴリ追加処理が終了した直後に、本フローチャートによる処理を開始する。なお、この開始時点で画面表示されている分類結果一覧ウィンドウは、図１４に示すようなものであったとする。同図は携帯端末カテゴリに大量の文書（たとえば全文書数の２０％超）が分類されている状態を示している。
【００７７】
ステップＳ１３０１で、カテゴリ体系管理部２０４は、図８のステップＳ８１１で更新された分類結果一覧表２０３ａを参照して、カテゴリ体系変更規則２０４ｂのカテゴリ分割規則に該当する事象が発生しているかどうかを判定する。そして、カテゴリ分割規則に該当する事象が発生しているときは（ステップＳ１３０１肯定）ステップＳ１３０２に移行し、発生していないときは（ステップＳ１３０１否定）そのまま本フローチャートによる処理を終了する。
【００７８】
ステップＳ１３０２で、ユーザーインターフェース部２０５は、図１５に示すような変更推奨ダイアログを表示する。ここでは携帯端末カテゴリの分割を推奨している。そして、ステップＳ１３０３でいずれかのボタンがマウスクリックされるのを待ち、マウスクリックされたのが変更開始ボタン１５０１であったときは（ステップＳ１３０３肯定）ステップＳ１３０４に移行する。また、マウスクリックされたのがキャンセルボタン１５０２であったときは（ステップＳ１３０３否定）、そのまま本フローチャートによる処理を終了する。
【００７９】
ステップＳ１３０４で、ユーザーインターフェース部２０５は、ステップＳ１３０２で表示された変更推奨ダイアログを消去するとともに、図１６に示すようなカテゴリ分割ダイアログを表示する。これは分割後の各カテゴリの名称を操作者に入力させるためのダイアログである。そして、ステップＳ１３０５でいずれかのボタンがマウスクリックされるのを待ち、マウスクリックされたのが「進む」ボタン１６０１であったときは（ステップＳ１３０５肯定）ステップＳ１３０７に移行し、「進む」ボタン１６０１でなかったときは（ステップＳ１３０５否定）、ステップＳ１３０６に移行して、クリックされたその他のボタンに応じた処理をおこなう。
【００８０】
ステップＳ１３０７で、ステップＳ１３０４で表示されたカテゴリ分割ダイアログを消去するとともに、図１７に示すような文書割付ダイアログを表示する。これは、分割後のカテゴリの一方（ここでは携帯電話カテゴリ）に分類する文書を操作者に指定させるためのダイアログであり、分割前のカテゴリ（ここでは携帯端末カテゴリ）に分類されているすべての文書が一覧表示される。
【００８１】
そして、ステップＳ１３０８でいずれかのボタンがマウスクリックされるのを待ち、マウスクリックされたのが「進む」ボタン１７０１であったときは（ステップ１３０８肯定）ステップＳ１３１０に移行し、「進む」ボタン１７０１でなかったときは（ステップＳ１３０８否定）、ステップＳ１３０９に移行して、クリックされたその他のボタンに応じた処理をおこなう。
【００８２】
ステップＳ１３１０で、ステップＳ１３０７で表示された文書割付ダイアログを消去するとともに、同様の文書割付ダイアログを分割後のカテゴリのもう一方（ここでは携帯ＰＣカテゴリ）について表示する。ただし、このダイアログには分割前のカテゴリの全文書から、ステップＳ１３０８で分割後の他のカテゴリに分類された文書を除いた文書の一覧が表示される。そして、ステップＳ１３１１でいずれかのボタンがマウスクリックされるのを待ち、マウスクリックされたのが「完了」ボタン１７０２であったときは（ステップ１３１１肯定）ステップＳ１３１３に移行し、「完了」ボタン１７０２でなかったときは（ステップＳ１３１１否定）、ステップＳ１３１２に移行して、クリックされたその他のボタンに応じた処理をおこなう。
【００８３】
ステップＳ１３１３で、ユーザーインターフェース部２０５は、分割前のカテゴリ名称（ここでは「携帯端末」）および分割後のカテゴリ名称（ステップＳ１３０５で「進む」ボタン１６０１がクリックされた時点でカテゴリ分割ダイアログに入力されていたカテゴリ名称。ここでは「携帯電話」「携帯ＰＣ」）を、カテゴリ体系管理部２０４に対して通知する。これを受けたカテゴリ体系管理部２０４は、カテゴリ体系表２０４ａからカテゴリ「携帯端末」を削除するとともに、新たなカテゴリ「携帯電話」および「携帯ＰＣ」を追記する。
【００８４】
また、ステップＳ１３１４で、ユーザーインターフェース部２０５は、分割前後のカテゴリ名称およびステップＳ１３０８およびステップＳ１３１１で分割後の各カテゴリに割り付けられた文書を、特徴ベクトル空間推定部２０２に対して通知する。これを受けた特徴ベクトル空間推定部２０２は、分割前のカテゴリの特徴ベクトル空間を分類用データベース２０２ｃから削除するとともに、分割後の各カテゴリの特徴ベクトル空間を算出して、カテゴリ名称と対応づけて上記データベースに登録する。
【００８５】
さらに、ステップＳ１３１５で、ステップＳ１３１０で表示された文書割付ダイアログを消去した後、図６のステップＳ６０６に移行して、更新された分類用データベース２０２ｃをもとに分類対象文書集合２０１ａの分類を再度実行する。図１８は、携帯端末カテゴリが携帯電話カテゴリおよび携帯ＰＣカテゴリに分割された直後の分類結果一覧ウィンドウの一例を示す説明図である。
【００８６】
図１４で携帯端末カテゴリに分類されていた文書の一部が、図１８では携帯ＰＣカテゴリに分類されている。また、カテゴリ分割後に再度分類対象文書集合２０１ａの全文書について分類をおこなった結果、分割前の携帯端末カテゴリには含まれていなかった文書（たとえば「携帯ＰＣの音声入力規格をＩ社が提案。Ｊ社なども賛同」）が、分割後の携帯ＰＣカテゴリに含まれるようになっている。なお、図示は省略するが、携帯電話カテゴリも上記と同様である。なお、上記ではカテゴリを２つに分割したが、この個数はいくつであってもよい。
【００８７】
つぎに、この発明の実施の形態による文書分類装置のカテゴリ削除・併合処理の手順について説明する。図１９は、この発明の実施の形態による文書分類装置のカテゴリ削除・併合処理の手順を示すフローチャートである。図１３に示すカテゴリ分割処理が終了した直後に、本フローチャートによる処理を開始する。なお、この開始時点で画面表示されている分類結果一覧ウィンドウは、図２０に示すようなものであったとする。同図はｉＮｅｔ専用端末カテゴリに分類された文書が少ない（たとえば全文書数の５％未満）状態を示している。
【００８８】
ステップＳ１９０１で、カテゴリ体系管理部２０４は、図１３のステップＳ１３１４で更新された分類結果一覧表２０３ａを参照して、カテゴリ体系変更規則２０４ｂのカテゴリ削除・併合規則に該当する事象が発生しているかどうかを判定する。そして、カテゴリ削除・併合規則に該当する事象が発生しているときは（ステップＳ１９０１肯定）ステップＳ１９０２に移行し、発生していないときは（ステップＳ１９０１否定）そのまま本フローチャートによる処理を終了する。
【００８９】
ステップＳ１９０２で、ユーザーインターフェース部２０５は、図２１に示すような変更推奨ダイアログを表示する。ここではｉＮｅｔ専用端末カテゴリの削除または併合を推奨している。そして、ステップＳ１９０３でいずれかのボタンがマウスクリックされるのを待ち、マウスクリックされたのが変更開始ボタン２１０１であったときは（ステップＳ１９０３肯定）ステップＳ１９０４に移行する。また、マウスクリックされたのがキャンセルボタン２１０２であったときは（ステップＳ１９０３否定）、そのまま本フローチャートによる処理を終了する。
【００９０】
ステップＳ１９０４で、ユーザーインターフェース部２０５は、ステップＳ１９０２で表示された変更推奨ダイアログを消去するとともに、図２２に示すようなカテゴリ削除・併合ダイアログを表示する。これは文書数の少ないカテゴリを削除するのか、あるいは併合するのかを操作者に指定させるためのダイアログである。なお、削除が指定されている間は「進む」ボタン２２０１を、併合が指定されている間は「完了」ボタン２２０２を、それぞれグレーアウトして押下できないようにする。
【００９１】
そして、ステップＳ１９０５でいずれかのボタンが押下されるのを待ち、押下されたのが「完了」ボタン２２０２であったときは（ステップＳ１９０５肯定）、カテゴリの削除が指定されたので、削除するカテゴリの名称（ここでは「ｉＮｅｔ専用端末」）をステップＳ１９０６でカテゴリ体系管理部２０４に通知して、カテゴリ体系表２０４ａから削除させる。また、ステップＳ１９０７で特徴ベクトル空間推定部２０２にも通知して、文書−カテゴリ対応表２０２ｂおよび分類用データベース２０２ｃから削除させる。そして、ステップＳ１９０８で、ステップＳ１９０４で表示されたカテゴリ削除・併合ダイアログを消去した後、図６のステップＳ６０６に移行する。
【００９２】
また、ステップＳ１９０５で押下されたのが「完了」ボタン２２０２でなかったときは（ステップＳ１９０５否定）、ステップＳ１９０９に移行して、それが「進む」ボタン２２０１であったかどうかを判定する。そして、「進む」ボタン２２０１でもなかったときは（ステップＳ１９０９否定）、ステップＳ１９１０で押下されたその他のボタンに応じた処理をおこなうが、「進む」ボタン２２０１であったときは（ステップＳ１９０９肯定）、ステップＳ１９１１に移行する。
【００９３】
この場合はカテゴリの併合が指定されたので、ステップＳ１９１１で、ステップＳ１９０４で表示されたカテゴリ削除・併合ダイアログを消去するとともに、図２３に示すような併合先選択ダイアログを表示する。これは、併合により削除されるカテゴリの文書を他のどのカテゴリに分類するかを操作者に指定させるためのダイアログであり、併合されるカテゴリ以外のすべてのカテゴリが一覧表示される。
【００９４】
そして、ステップＳ１９１２でいずれかのボタンが押下されるのを待ち、押下されたのが「進む」ボタン２３０１でなかったときは（ステップ１９１２否定）、ステップＳ１９１３で押下されたその他のボタンに応じた処理をおこなうが、「進む」ボタン２３０１であったときは（ステップＳ１９１２肯定）、ステップＳ１９１４に移行する。
【００９５】
ステップＳ１９１４で、ステップＳ１９１１で表示した併合先選択ダイアログを消去するとともに、図２４に示すような文書割付ダイアログを表示する。これは、併合により削除されるカテゴリの文書のうちいずれの文書を、併合先のカテゴリ（ステップＳ１９１２で「進む」ボタン２３０１が押下された時点で選択されていたカテゴリ。ここでは情報家電カテゴリであったものとする）に分類するかを操作者に指定させるためのダイアログであり、併合されるカテゴリのすべての文書の一覧が表示される。
【００９６】
そして、ステップＳ１９１５でいずれかのボタンが押下されるのを待ち、押下されたのが「完了」ボタン２４０１でなかったときは（ステップ１９１５否定）、ステップＳ１９１６で押下されたその他のボタンに応じた処理をおこなうが、「完了」ボタン２４０１であったときは（ステップＳ１９１５肯定）、ステップＳ１９１７に移行する。
【００９７】
ステップＳ１９１７で、ユーザーインターフェース部２０５は、併合されるカテゴリの名称（ここでは「ｉＮｅｔ専用端末」）をカテゴリ体系管理部２０４に通知して、カテゴリ体系表２０４ａから削除させる。また、ステップＳ１９１８で、併合元および併合先のカテゴリ名称、および併合先に割り付けられる文書（ステップＳ１９１５で「完了」ボタン２４０１が押下された時点で選択されていた文書）を、特徴ベクトル空間推定部２０２に対して通知する。これを受けた特徴ベクトル空間推定部２０２は、文書−カテゴリ対応表２０２ｂの併合元カテゴリを併合先カテゴリに書き換え、併合先のカテゴリの特徴ベクトル空間を再計算して分類用データベース２０２ｃに登録するとともに、併合元のカテゴリの特徴ベクトル空間を上記データベースから削除する。
【００９８】
そして、ステップＳ１９１９で、ステップＳ１９１４で表示された文書割付ダイアログを消去した後、図６のステップＳ６０６に移行して、更新された分類用データベース２０２ｃをもとに分類対象文書集合２０１ａの分類を再度実行する。図２５は、ｉＮｅｔ専用端末カテゴリが削除、あるいは情報家電カテゴリに併合された直後の分類結果一覧ウィンドウの一例を示す説明図である。
【００９９】
以上説明したようにこの実施の形態によれば、文書の追加や文書の分類指示によって文書の分類を実行するたびに、特定のカテゴリの文書（分類不能カテゴリを含む）の文書が増えすぎていないかどうかや、逆に特定のカテゴリの文書が減りすぎていないかなどが自動的にチェックされ、そのチェック結果にもとづいてカテゴリの追加や分割、削除、併合のいずれかが推奨されるので、この推奨にしたがって必要事項などを入力してゆくだけで、分類対象文書の特性に見合った最適な分類体系を常に維持することができる（分類体系のメンテナンスの容易化）。したがって、分類対象文書の分野傾向が変わるなどの質的変化があった場合にも、常に適切な分類結果を得ることが可能である。
【０１００】
また、分類不能カテゴリの特徴ベクトル空間Ｖを、すべてのベクトルの長さが等しい特徴ベクトル空間Ｖα、訓練用文書集合２０２ａの特徴ベクトル空間Ｖβおよび分類対象文書集合２０１ａの特徴ベクトル空間Ｖγの加重平均によって推定するので、既存のカテゴリとの類似度の低い文書が出現した場合にも、まったく新しい分野のカテゴリを生成してその中に分類することが可能である。
【０１０１】
なお、上記実施の形態で説明した文書分類方法（カテゴリ追加方法、カテゴリ分割方法およびカテゴリ削除・併合方法を含む）は、あらかじめ用意されたプログラムをパーソナルコンピュータやワークステーションなどのコンピュータで実行することにより実現される。このプログラムは、ハードディスク、フロッピーディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤなどのコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されて実行される。また、このプログラムは、上記記録媒体を介して、インターネットなどのネットワークを介して配布することができる。
【０１０２】
【発明の効果】
以上説明したようにこの発明にかかる文書分類装置は、電子化された文書をあらかじめ設定されたカテゴリのうちいずれか一つに分類する文書分類装置において、第１の特徴ベクトル空間算出手段が、訓練用文書集合を構成する各文書の特徴ベクトルから前記各カテゴリの特徴ベクトル空間を算出し、第２の特徴ベクトル空間算出手段が、すべてのベクトルの長さが等しい特徴ベクトル空間、訓練用文書集合の特徴ベクトル空間および分類対象文書集合の特徴ベクトル空間の重み付き平均から分類不能カテゴリの特徴ベクトル空間を算出し、分類カテゴリ算出手段が、前記第１の特徴ベクトル空間算出手段により算出された各カテゴリの特徴ベクトル空間および前記第２の特徴ベクトル空間算出手段により算出された分類不能カテゴリの特徴ベクトル空間と、分類対象文書集合を構成する各文書の特徴ベクトルとを比較することにより、前記各カテゴリまたは前記分類不能カテゴリのうちいずれか一つを前記各文書の分類先のカテゴリと算出し、カテゴリ追加要否判定手段が、前記分類カテゴリ算出手段により分類不能カテゴリが分類先のカテゴリと算出される頻度が一定の閾値を上回ったかどうかを判定し、カテゴリ追加推奨手段が、前記カテゴリ追加要否判定手段により分類不能カテゴリが分類先のカテゴリと算出される頻度が一定の閾値を上回ったと判定された場合に、新たなカテゴリの追加を操作者に対して推奨するので、分類不能カテゴリに分類される文書が多くなると、操作者に対して新たなカテゴリの追加が推奨され、これによって、分類対象文書群の特性に見合った分類体系を常に維持することができ、したがってその変化にともなう分類精度の低下を防止することが可能な文書分類装置が得られるという効果を奏する。
【０１０３】
また、この発明にかかる文書分類装置は、電子化された文書をあらかじめ設定されたカテゴリのうちいずれか一つに分類する文書分類装置において、特徴ベクトル空間算出手段が、訓練用文書集合を構成する各文書の特徴ベクトルから前記各カテゴリの特徴ベクトル空間を算出し、分類カテゴリ算出手段が、前記特徴ベクトル空間算出手段により算出された各カテゴリの特徴ベクトル空間と分類対象文書集合を構成する各文書の特徴ベクトルとを比較することにより、前記各カテゴリのうちいずれか一つを前記各文書の分類先のカテゴリと算出し、カテゴリ削除・併合要否判定手段が、前記分類カテゴリ算出手段により分類先のカテゴリと算出される頻度が一定の閾値を下回ったカテゴリがあるかどうかを判定し、カテゴリ削除・併合推奨手段が、前記カテゴリ削除・併合要否判定手段により分類先のカテゴリと算出される頻度が一定の閾値を下回ったカテゴリがあると判定された場合に、当該カテゴリの削除または併合を操作者に対して推奨するので、あるカテゴリに分類される文書が少なくなると、操作者に対して当該カテゴリの削除または併合が推奨され、これによって、分類対象文書群の特性に見合った分類体系を常に維持することができ、したがってその変化にともなう分類精度の低下を防止することが可能な文書分類装置が得られるという効果を奏する。
【０１０４】
また、この発明にかかる文書分類装置は、上記発明において、さらに、前記分類カテゴリ算出手段により分類先のカテゴリと算出される頻度が一定の閾値を上回ったカテゴリがあるかどうかを判定するカテゴリ分割要否判定手段と、前記カテゴリ分割要否判定手段により分類先のカテゴリと算出される頻度が一定の閾値を上回ったカテゴリがあると判定された場合に、当該カテゴリの分割を操作者に対して推奨するカテゴリ分割推奨手段と、を備えたので、あるカテゴリに分類される文書が多くなると、操作者に対して当該カテゴリの分割が推奨され、これによって、分類対象文書群の特性に見合った分類体系を常に維持することができ、したがってその変化にともなう分類精度の低下を防止することが可能な文書分類装置が得られるという効果を奏する。
【０１０５】
また、この発明にかかる文書分類方法は、電子化された文書をあらかじめ設定されたカテゴリのうちいずれか一つに分類する文書分類方法において、第１の特徴ベクトル空間算出工程で、訓練用文書集合を構成する各文書の特徴ベクトルから前記各カテゴリの特徴ベクトル空間を算出し、第２の特徴ベクトル空間算出工程で、すべてのベクトルの長さが等しい特徴ベクトル空間、訓練用文書集合の特徴ベクトル空間および分類対象文書集合の特徴ベクトル空間の重み付き平均から分類不能カテゴリの特徴ベクトル空間を算出し、分類カテゴリ算出工程で、前記第１の特徴ベクトル空間算出工程で算出された各カテゴリの特徴ベクトル空間および前記第２の特徴ベクトル空間算出工程で算出された分類不能カテゴリの特徴ベクトル空間と、分類対象文書集合を構成する各文書の特徴ベクトルとを比較することにより、前記各カテゴリまたは前記分類不能カテゴリのうちいずれか一つを前記各文書の分類先のカテゴリと算出し、カテゴリ追加要否判定工程で、前記分類カテゴリ算出工程で分類不能カテゴリが分類先のカテゴリと算出される頻度が一定の閾値を上回ったかどうかを判定し、カテゴリ追加推奨工程で、前記カテゴリ追加要否判定工程で分類不能カテゴリが分類先のカテゴリと算出される頻度が一定の閾値を上回ったと判定された場合に、新たなカテゴリの追加を操作者に対して推奨するので、分類不能カテゴリに分類される文書が多くなると、操作者に対して新たなカテゴリの追加が推奨され、これによって、分類対象文書群の特性に見合った分類体系を常に維持することができ、したがってその変化にともなう分類精度の低下を防止することが可能な文書分類方法が得られるという効果を奏する。
【０１０６】
また、この発明にかかる文書分類方法は、電子化された文書をあらかじめ設定されたカテゴリのうちいずれか一つに分類する文書分類方法において、特徴ベクトル空間算出工程で、訓練用文書集合を構成する各文書の特徴ベクトルから前記各カテゴリの特徴ベクトル空間を算出し、分類カテゴリ算出工程で、前記特徴ベクトル空間算出工程で算出された各カテゴリの特徴ベクトル空間と分類対象文書集合を構成する各文書の特徴ベクトルとを比較することにより、前記各カテゴリのうちいずれか一つを前記各文書の分類先のカテゴリと算出し、カテゴリ削除・併合要否判定工程で、前記分類カテゴリ算出工程で分類先のカテゴリと算出される頻度が一定の閾値を下回ったカテゴリがあるかどうかを判定し、カテゴリ削除・併合推奨工程で、前記カテゴリ削除・併合要否判定工程で分類先のカテゴリと算出される頻度が一定の閾値を下回ったカテゴリがあると判定された場合に、当該カテゴリの削除または併合を操作者に対して推奨するので、あるカテゴリに分類される文書が少なくなると、操作者に対して当該カテゴリの削除または併合が推奨され、これによって、分類対象文書群の特性に見合った分類体系を常に維持することができ、したがってその変化にともなう分類精度の低下を防止することが可能な文書分類方法が得られるという効果を奏する。
【０１０７】
また、この発明にかかる文書分類方法は、上記発明において、さらに、前記分類カテゴリ算出工程で分類先のカテゴリと算出される頻度が一定の閾値を上回ったカテゴリがあるかどうかを判定するカテゴリ分割要否判定工程と、前記カテゴリ分割要否判定工程で分類先のカテゴリと算出される頻度が一定の閾値を上回ったカテゴリがあると判定された場合に、当該カテゴリの分割を操作者に対して推奨するカテゴリ分割推奨工程と、を含んだので、あるカテゴリに分類される文書が多くなると、操作者に対して当該カテゴリの分割が推奨され、これによって、分類対象文書群の特性に見合った分類体系を常に維持することができ、したがってその変化にともなう分類精度の低下を防止することが可能な文書分類方法が得られるという効果を奏する。
【０１０８】
また、この発明にかかる記録媒体は、上記方法をコンピュータに実行させるプログラムを記録したことで、当該プログラムをコンピュータで読み取ることが可能となり、これによって、上記方法をコンピュータによって実施することが可能な記録媒体が得られるという効果を奏する。
【図面の簡単な説明】
【図１】この発明の実施の形態による文書分類装置のハードウエア構成を示す説明図である。
【図２】この発明の実施の形態による文書分類装置の構成を機能的に示す説明図である。
【図３】この発明の実施の形態による文書分類装置の、特徴ベクトル空間推定部２０２による分類用データベース２０２ｃの作成方法を具体的に説明するための説明図である。
【図４】この発明の実施の形態による文書分類装置の、分類対象文書集合２０１ａおよびその各文書について作成される特徴ベクトルの一例を示す説明図である。
【図５】この発明の実施の形態による文書分類装置の、分類カテゴリ推定部２０３により作成される分類結果一覧表２０３ａの一例を示す説明図である。
【図６】この発明の実施の形態による文書分類装置の、文書分類処理の手順を示すフローチャートである。
【図７】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により作成される分類結果一覧ウィンドウの一例を示す説明図である。
【図８】この発明の実施の形態による文書分類装置の、カテゴリ追加処理の手順を示すフローチャートである。
【図９】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される変更推奨ダイアログの一例を示す説明図である。
【図１０】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示されるカテゴリ追加ダイアログの一例を示す説明図である。
【図１１】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される文書割付ダイアログの一例を示す説明図である。
【図１２】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される分類結果一覧ウィンドウの他の一例を示す説明図である。
【図１３】この発明の実施の形態による文書分類装置の、カテゴリ分割処理の手順を示すフローチャートである。
【図１４】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される分類結果一覧ウィンドウの他の一例を示す説明図である。
【図１５】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される変更推奨ダイアログの他の一例を示す説明図である。
【図１６】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示されるカテゴリ分割ダイアログの一例を示す説明図である。
【図１７】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される文書割付ダイアログの他の一例を示す説明図である。
【図１８】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される分類結果一覧ウィンドウの他の一例を示す説明図である。
【図１９】この発明の実施の形態による文書分類装置の、カテゴリ削除・併合処理の手順を示すフローチャートである。
【図２０】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される分類結果一覧ウィンドウの他の一例を示す説明図である。
【図２１】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される変更推奨ダイアログの他の一例を示す説明図である。
【図２２】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示されるカテゴリ削除・併合ダイアログの一例を示す説明図である。
【図２３】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される併合先選択ダイアログの一例を示す説明図である。
【図２４】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される文書割付ダイアログの他の一例を示す説明図である。
【図２５】この発明の実施の形態による文書分類装置の、ユーザーインターフェース部２０５により表示される分類結果一覧ウィンドウの他の一例を示す説明図である。
【符号の説明】
１００バス
１０１ＣＰＵ
１０２ＲＯＭ
１０３ＲＡＭ
１０４ＨＤＤ
１０５ＨＤ
１０６ＦＤＤ
１０７ＦＤ
１０８ディスプレイ
１０９Ｉ／Ｆ
１１０通信回線
１１１キーボード
１１２マウス
１１３スキャナ
１１４プリンタ
１１５ＣＤ−ＲＯＭ
１１６ＣＤ−ＲＯＭドライブ
２０１文書記憶部
２０１ａ分類対象文書集合
２０２特徴ベクトル空間推定部
２０２ａ訓練用文書集合
２０２ｂ文書−カテゴリ対応表
２０２ｃ分類用データベース
２０３分類カテゴリ推定部
２０３ａ分類結果一覧表
２０４カテゴリ体系管理部
２０４ａカテゴリ体系表
２０４ｂカテゴリ体系変更規則
２０５ユーザーインターフェース部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document classification device that classifies an electronic document into any one of preset categories, a document classification method, and a computer-readable recording medium that records a program that causes a computer to execute the method. .
[0002]
[Prior art]
2. Description of the Related Art Document classification apparatuses that automatically classify a given classification target document (group) into any preset category using a document group called a corpus are conventionally known. In this type of device, for each of a large number of documents (corpus) having a wide variety of contents, the categories of the respective classification destinations are determined manually, and on the contrary, the characteristics of each category are reversed from the characteristics of the documents classified into each category. The feature is calculated. When a document to be classified is given, the features of the document are sequentially compared with the features of each category, and the document is classified into the category having the highest similarity.
[0003]
However, depending on the characteristics of the documents to be classified, the preset classification system may be inappropriate. Therefore, some document classification devices automatically change the classification system according to the tendency of the documents to be classified. Proposed. Examples of such conventional techniques include JP-A-07-049875 and JP-A-2000-011004.
[0004]
The document collection server system in Japanese Patent Application Laid-Open No. 07-049875 automatically connects to a plurality of information sources to acquire a new document, and checks the degree of conformity with the search condition described in advance by the user by the degree of conformity calculation. Then, a classification system is constructed from the relationship between the search conditions, and the adapted documents are classified and stored in a folder. In addition, the collection of information in each folder is monitored, and folder subdivision, integration, and structure change are automatically performed.
[0005]
In addition, the automatic information classification apparatus disclosed in Japanese Patent Laid-Open No. 2000-011004 provides a temporary category attached to the category and classifies the document there when there is a document that is mistakenly classified into a certain category. The subsequent documents are classified into a regular category or a temporary category having a higher similarity. That is, when a document exceeding the category of the existing category appears, a temporary category that is adjacent to the regular category and equivalent thereto is automatically generated. This temporary category can be upgraded to a regular category.
[0006]
[Problems to be solved by the invention]
However, among the above prior arts, Japanese Patent Application Laid-Open No. 07-049875 can only divide (subdivide) existing categories, so when a completely new field document appears, a category suitable for the document is selected. It cannot be generated. Nor can you delete unnecessary categories that have little or no document classification. Also, according to Japanese Patent Laid-Open No. 2000-011004, a category having low similarity to an existing category cannot be generated, and categories cannot be divided or deleted.
[0007]
In other words, the above prior art is vulnerable to qualitative changes in the classification target document group, and when a new document that cannot be classified into an existing category appears, the classification system does not match the characteristics of the classification target document group. There was a problem that classification could not be performed and classification accuracy was lowered.
[0008]
In order to eliminate the above-mentioned problems caused by the prior art, the present invention can always maintain a classification system suitable for the characteristics of the document group to be classified, and therefore can prevent a decrease in classification accuracy due to the change. An object of the present invention is to provide a computer readable recording medium having recorded thereon a document classification device, a document classification method, and a program for causing a computer to execute the method.
[0009]
[Means for Solving the Problems]
In order to solve the above-described problems and achieve the object, a document classification apparatus according to the present invention is a document classification apparatus that classifies an electronic document into one of preset categories. First feature vector space calculating means for calculating the feature vector space of each category from the feature vectors of each document constituting the set, a feature vector space having the same length of all vectors, and a feature vector space of the training document set And a second feature vector space calculating unit that calculates a feature vector space of an unclassifiable category from the weighted average of the feature vector space of the classification target document set, and each category calculated by the first feature vector space calculating unit Features of unclassifiable categories calculated by the feature vector space and the second feature vector space calculating means A classification in which any one of the categories or the unclassifiable categories is calculated as a classification target category of each document by comparing the vector space and the feature vector of each document constituting the classification target document set. A category calculation means, a category addition necessity judgment means for judging whether or not the frequency at which the uncategorized category is calculated as a classification destination category by the classification category calculation means exceeds a certain threshold, and the category addition necessity judgment means And a category addition recommendation means for recommending the operator to add a new category when it is determined that the frequency of calculation of the category that cannot be classified as the category to be classified exceeds a certain threshold. It is characterized by.
[0010]
According to the present invention, when the number of documents classified into the unclassifiable category increases, it is recommended that the operator add a new category.
[0011]
Further, the document classification device according to the present invention is a document classification device that classifies an electronic document into any one of preset categories, from the feature vector of each document that constitutes a training document set. Comparing a feature vector space calculating means for calculating a feature vector space of each category with a feature vector space of each category calculated by the feature vector space calculating means and a feature vector of each document constituting a classification target document set; Accordingly, a classification category calculation unit that calculates any one of the categories as a classification destination category of each document, and a frequency at which the classification category calculation unit calculates the classification destination category falls below a certain threshold. Category deletion / merging necessity determination means for determining whether there is a category and the category deletion / merging necessity A category deletion / merging recommendation unit that recommends the operator to delete or merge the category when the determination unit determines that there is a category whose classification destination and calculated frequency are below a certain threshold; , Provided.
[0012]
According to the present invention, when the number of documents classified into a certain category decreases, the operator is recommended to delete or merge the category.
[0013]
In the document classification apparatus according to the present invention, in the above-described invention, there is further provided a category division requirement for determining whether there is a category whose frequency calculated by the classification category calculation unit and the frequency calculated by the classification category calculation unit exceeds a certain threshold. When it is determined that there is a category whose frequency calculated as a classification destination category by the rejection determination means and the category division necessity determination means exceeds a certain threshold value, division of the category is recommended to the operator And a category division recommending means.
[0014]
According to the present invention, when the number of documents classified into a certain category increases, the operator is recommended to divide the category.
[0015]
The document classification method according to the present invention is a document classification method for classifying an electronic document into any one of preset categories, from the feature vector of each document constituting a training document set. A first feature vector space calculating step for calculating a feature vector space of each category, a feature vector space in which all vectors have the same length, a feature vector space of a training document set, and a feature vector space of a classification target document set A second feature vector space calculating step of calculating a feature vector space of an unclassifiable category from a weighted average, a feature vector space of each category calculated in the first feature vector space calculating step, and the second feature vector space The feature vector space of the unclassifiable category calculated in the calculation process and each of the documents constituting the classification target document set A classification category calculation step of calculating any one of the categories or the unclassifiable categories as a classification destination category of each document by comparing with a feature vector of the document, and classification in the classification category calculation step A category addition necessity determination step for determining whether or not the impossible category is a classification destination category and whether the frequency of calculation exceeds a certain threshold value, and the category addition necessity determination step calculates the classification impossible category as a classification destination category. A category addition recommendation step of recommending the operator to add a new category when it is determined that the frequency exceeds a certain threshold value.
[0016]
According to the present invention, when the number of documents classified into the unclassifiable category increases, it is recommended that the operator add a new category.
[0017]
The document classification method according to the present invention is a document classification method for classifying an electronic document into any one of preset categories, from the feature vector of each document constituting a training document set. Comparing a feature vector space calculating step of calculating a feature vector space of each category with a feature vector space of each category calculated in the feature vector space calculating step and a feature vector of each document constituting a classification target document set; The classification category calculation step of calculating any one of the categories as the classification destination category of each document, and the frequency calculated as the classification destination category in the classification category calculation step falls below a certain threshold. Category deletion / merging necessity determination process for determining whether there is a category and the category deletion / merging necessity determination process When it is determined that there is a category whose classification frequency is lower than a certain threshold value in the category, the category deletion / merging recommendation process that recommends the operator to delete or merge the category is It is characterized by including.
[0018]
According to the present invention, when the number of documents classified into a certain category decreases, the operator is recommended to delete or merge the category.
[0019]
Further, the document classification method according to the present invention is the above-described invention, further comprising a category division requirement for determining whether or not there is a category whose frequency calculated with the classification category in the classification category calculation step exceeds a certain threshold. When it is determined that there is a category in which the frequency calculated as a classification target category in the rejection determination step and the category division necessity determination step exceeds a certain threshold value, division of the category is recommended to the operator And a category division recommendation process.
[0020]
According to the present invention, when the number of documents classified into a certain category increases, the operator is recommended to divide the category.
[0021]
Further, the recording medium according to the present invention records a program for causing a computer to execute the method, so that the program can be read by the computer, and thus the method can be performed by the computer. .
[0022]
DETAILED DESCRIPTION OF THE INVENTION
Exemplary embodiments of a document classification device, a document classification method, and a computer-readable recording medium recording a program for causing a computer to execute the method will be described below in detail with reference to the accompanying drawings.
[0023]
(Embodiment)
First, the hardware configuration of the document classification apparatus according to the embodiment of the present invention will be described. FIG. 1 is an explanatory diagram showing a hardware configuration of a document classification apparatus according to an embodiment of the present invention. In the figure, 101 indicates a CPU that controls the entire system, 102 indicates a ROM that stores basic input / output programs, and 103 indicates a RAM that is used as a work area of the CPU 101.
[0024]
Reference numeral 104 denotes an HDD (hard disk drive) that controls reading / writing of data with respect to the HD (hard disk) 105 according to the control of the CPU 101, and 105 denotes an HD that stores data written according to the control of the HDD 104. Yes. Reference numeral 106 denotes an FDD (floppy disk drive) that controls reading / writing of data with respect to the FD (floppy disk) 107 under the control of the CPU 101, and 107 denotes a removable FD that stores data written according to the control of the FDD 106. Respectively.
[0025]
Reference numeral 108 denotes a cursor, menu, window, or display for displaying various data such as characters and images. Reference numeral 109 denotes a network that is connected to the network NET via the communication line 110 and functions as an interface between the network NET and the CPU 101. Each board is shown. Reference numeral 111 denotes a keyboard having a plurality of keys for inputting characters, numerical values, and various instructions, and 112 denotes a mouse for selecting and executing various instructions, selecting a processing target, moving a cursor, and the like. ing.
[0026]
Reference numeral 113 denotes a scanner that optically reads characters and images, 114 denotes a printer that prints characters and images according to the control of the CPU 101, 115 denotes a CD-ROM that is a removable recording medium, and 116 denotes a CD-ROM 115. Reference numeral 100 denotes a CD-ROM drive for controlling the reading of data with respect to the above.
[0027]
Next, a functional configuration of the document classification device according to the embodiment of the present invention will be described. FIG. 2 is an explanatory diagram functionally showing the configuration of the document classification apparatus according to the embodiment of the present invention.
[0028]
A document storage unit 201 holds a document group (classification target document set) 201a to be classified by a classification category estimation unit 203 described later. Each document is acquired from another information processing apparatus on the network via the communication line 110, input by the keyboard 111, taken from a paper medium document via the scanner 113, or CD-ROM 115. Or read from various recording media. In addition, there may be a difference when each document is acquired / accumulated (of 50 documents, 10 are acquired / accumulated one week ago, 25 are acquired / accumulated yesterday, and 15 are acquired today)・ Accumulated, etc.)
[0029]
A feature vector space estimation unit 202 holds a training document set 202a and a document-category correspondence table 202b in an initial state. The training document set 202a is a so-called corpus, and is a document set including documents belonging to various fields, for example, a document set consisting of newspaper articles for one year. The document-category correspondence table 202b is a table that stores categories in which the document is classified in association with each document in the training document set 202a. The names of the categories and the upper / lower relationships are defined in a category system table 204a of the category system management unit 204 described later.
[0030]
The feature vector space estimation unit 202 creates a classification database 202c from the training document set 202a and the document-category correspondence table 202b. FIG. 3 is an explanatory diagram for specifically explaining a method of creating the classification database 202c. The training document set 202a is made up of 10 documents D1 to D10 shown in FIG. 3A, and these documents are shown in the document-category correspondence table 202b in FIG. It shall be classified into one of the categories.
[0031]
The feature vector space estimation unit 202 first creates, for each document in the training document set 202a, a vector (feature vector) that approximately represents the semantic content of the document. There are various conventional techniques for creating feature vectors. Here, for convenience of explanation, the frequency of appearance of all words (all vocabulary, excluding unnecessary words, etc.) appearing in the document set in each document ( The simplest method is adopted in which the number of occurrences) is counted to obtain each element value (length of each vector) in the vector.
[0032]
When there are five types of words W1 to W5 that appear in the training document set 202a, the feature vectors created for each document include the appearance frequency of the word W1, the appearance frequency of W2, and the W3 It becomes a 5-dimensional vector having the appearance frequency, the appearance frequency of W4, and the appearance frequency of W5 as vector lengths. FIG. 3C shows an example of a feature vector created for each document. For example, the feature vector of the document D1 in which the word W1 appears 0 times, W2 2 times, W3 1 time, W4 0 times, and W5 3 times is V1 = (0, 2, 1, 0, 3). is there. Similarly, the feature vector of the document D5 is V5 = (1, 5, 1, 0, 4), and the feature vector of the document D8 is V8 = (0, 2, 0, 0, 2).
[0033]
After creating the feature vector for each document in the training document set 202a, the feature vector space estimation unit 202 calculates the feature vector space of each category by averaging the feature vectors of the documents classified into each category. To do. For example, the feature vector space VA of category A is obtained by averaging feature vectors of documents D1, D5, and D8 belonging to category A, and VA = (0 + 1 + 0/3, 2 + 5 + 2/3, 1 + 1 + 0/3, 0 + 0 + 0/3, 3 + 4 + 2 / 3), that is, VA = (0.33, 3, 0.66, 0, 3). Then, a classification database 202c shown in FIG. 3D is created from this VA and VB and VC obtained in the same manner.
[0034]
Here, in addition to A, B, and C, the category “classification not possible” is provided in the classification database 202c. The feature vector space of this unclassifiable category is calculated according to the following procedures (1) to (4).
[0035]
(1) Calculation of feature vector space Vα in which all vectors have the same length
First, for each document in the classification target document set 201a held in the document storage unit 201, the feature vector space estimation unit 202 includes all the words that appear in the document set, or some words selected in advance. A feature vector having an appearance frequency in each document as an element value is created. Here, the classification target document set 201a is composed of five documents D'1 to D'5 shown in FIG. 4A, and a feature vector as shown in FIG. 4B is created for each document. And
[0036]
Then, the feature vector space estimation unit 202 determines the total number of times that the words W1 to W7 appear in the training document set 202a and the classification target document set 201a (which may be referred to as the total vector length, here 87 + 59 = 146). Then, dividing by the total number of documents (here 10 + 5 = 15), the average number of word appearances per document (here, 146/15 = 9.73) is calculated. Further, this is divided by the total number of vocabularies (the number of vectors. Here, 5), and the average number of appearances of each word per document (here, 9.73 / 5 = 1.95) is obtained. calculate. A feature vector having this value as the length of each vector corresponding to the words W1 to W7 = (1.95, 1.95, 1.95, 1.95, 1.95, 1.95, 1.95). A space Vα represented by
[0037]
(2) Calculation of feature vector space Vβ of training document set 202a
Next, the feature vector space estimation unit 202 calculates the average number of times each of the words W1 to W7 appears per document in the training document set 202a, and the feature vector space of the training document set 202a. Create Vβ. Here, the feature vectors representing the feature vector space Vβ are (12/10, 11/10, 13/10, 17/10, 34/10, 0/10, 0/10), that is, (1.2, 1 .1, 1.3, 1.7, 3.4, 0, 0).
[0038]
(3) Calculation of feature vector space Vγ of classification target document set 201a
Further, the feature vector space estimation unit 202 calculates the average number of times each of the words W1 to W7 appears per document in the classification target document set 201a, and the feature of the classification target document set 201a. A vector space Vγ is created. Here, the feature vectors representing the feature vector space Vγ are (7/5, 6/5, 8/5, 10/5, 18/5, 10/5, 0/5), that is, (1.4, 1 .2, 1.6, 2, 3.6, 2, 0).
[0039]
(4) Calculation of feature vector space V of unclassifiable category
Then, the feature vector space estimation unit 202 calculates the feature vector space V of the category that cannot be classified by taking the weighted average of Vα, Vβ, and Vγ calculated above. For example, if the weight given to Vα is 0.15, the weight given to Vβ is 0.45, and the weight given to Vγ is 0.40, the feature vector representing the feature vector space V is ((1.95 × 0 .15) + (1.2 × 0.45) + (1.4 × 0.40), (1.95 × 0.15) + (1.1 × 0.45) + (1.2 × 0) .14), ...), that is, (1.39, 1.27, 1.52, 1.86, 3.26, 1.09, 0.29). Then, this V is registered in the classification database 202c as a feature vector space of an unclassifiable category.
[0040]
In this way, the feature vector space V of the category that cannot be classified is divided into “feature vector space Vα having the same length of all vectors”, “feature vector space Vβ of training document set 202a”, and “feature vector space of classification target document set 201a”. The meaning of the estimation by the weighted average of “Vγ” is as follows.
[0041]
First, a feature vector space Vα in which all vectors have the same length plays a role of assuming a feature vector space of an unknown category. In the unknown category, since it is unknown what distribution each vocabulary takes, a vector space in which the appearance frequencies of all vocabularies are the same is assumed by this Vα.
[0042]
However, this Vα is not a natural language space expression. For example, “the” in English has a high frequency of appearance in any document regardless of the category, so it is natural that the frequency is high even in a category that cannot be classified. Therefore, by considering the feature vector space Vβ of the training document set 202a, the vector length of a word whose appearance frequency does not depend on the category (W5 in the example of FIG. 3) can be improved.
[0043]
Taking the feature vector space Vγ of the classification target document set 201a into consideration is basically the same as described above. However, since the training document set 202a has a limit on the scale (because the contents of each document must be examined in advance and a category must be manually assigned), Vγ in addition to Vβ is added to supplement the scale. Consider. In addition, when Vβ is not suitable for a document set to be classified, such as when the training document is collected and when the vocabulary pattern changes, the document to be classified is considered by considering Vγ. Can be optimized to match For example, as shown in FIG. 3D, in the feature vector spaces VA to VC of the categories A to C, the vector lengths of W6 and W7 are both 0, but in the feature vector space V of the non-classifiable category, 1. Since it is 09, 0.29, W6 or W7, that is, a document including a new word that has not appeared in the training document set 202a, a word in a new field, or the like can be classified into the unclassifiable category. Increases nature.
[0044]
In the above description, Vα and Vγ are calculated using all the documents in the classification target document set 201a. However, when the number is large, the processing load is large. A part of the extracted document may be used.
[0045]
Note that the feature vector space estimation unit 202 corresponds to “first feature vector space calculation means”, “second feature vector space calculation means”, or “feature vector space calculation means” in the claims, and processing to be performed Includes “first feature vector space calculating step”, “second feature vector space calculating step” or “feature vector space calculating step”.
[0046]
Returning to FIG. 2, description of the remaining functional units will be continued. Reference numeral 203 denotes a classification category estimation unit, in which the feature vector of each document of the classification target document set 201a is stored in the feature vector space of each category calculated by the feature vector space estimation unit 202 (this is stored in the classification database 202c). The degree of similarity is calculated sequentially. There are many conventional techniques for calculating the degree of similarity, such as TF / IDF, native Bayes, least squares, and maximum entropy, and any of them may be adopted here.
[0047]
Then, the category having the highest similarity is estimated as the classification destination of the document, and the classification result list 203a as illustrated in FIG. 5 is created by associating the document with the classification destination category. The classification category estimation unit 203 corresponds to “classification category calculation means” in the claims, and the processing performed by the classification category estimation unit 203 corresponds to “classification category calculation step” in the claims.
[0048]
A category system management unit 204 holds a category system table 204a and a category system change rule 204b in advance. In the category system table 204a, the name of each category into which the document is classified and its upper / lower relationship are defined. As long as the names and relationships of the categories are known, the data structure of the category system table 204a may be anything.
[0049]
The category system change rule 204b is a plurality of categories for determining whether it is desirable to add a new category to the category system table 204a or to divide / delete / merge an already registered category. Rules (judgment criteria).
[0050]
For example, with regard to category addition, “(a-1) The frequency at which it is determined that classification is not possible has exceeded a threshold (specifically,“ the percentage of documents that are determined to belong to the category that cannot be classified among all documents is 20%. There is a rule that “exceeded 30 or more documents determined to belong to an unclassifiable category”). If this rule falls under this rule, it is determined that it is desirable to add a new category.
[0051]
The rules for adding a category include “(a-2) the number of documents belonging to an unclassifiable category exceeds a threshold”, “(a-3) an evaluation function using the above frequency and / or the number of documents. The result exceeded the threshold (specifically, “the number of consecutive appearances of documents determined to belong to the unclassifiable category exceeded a value corresponding to 15% of the total number of documents”). Below, among these rules, the first mentioned “the ratio of documents determined to belong to the unclassifiable category in all documents exceeds 20%” is adopted as the category addition rule.
[0052]
Further, as a rule regarding category division, for example, “(b-1) The frequency at which a certain category is determined to be a classification destination exceeds a threshold (specifically,“ all documents are determined to belong to a certain category ”). The number of documents that belong to a certain category has exceeded the threshold value, such as “the document that has been determined to belong to a certain category has appeared more than 30 consecutively”), “(b-2) "(B-3) The result of the evaluation function using the frequency and / or the number of documents exceeds a threshold (specifically," the number of consecutive appearances of a document determined to belong to a certain category is Etc.) ”, etc., which exceeded 15% of the total number of documents. Below, among these rules, the first mentioned “the ratio of documents determined to belong to a certain category in all documents exceeds 20%” is adopted as the category division rule.
[0053]
Further, as a rule regarding category deletion or merging, “(c-1) The frequency at which a certain category is determined to be a classification destination falls below a threshold (specifically,“ determined that it belongs to a certain category in all documents ”). The ratio of the selected documents is less than 5%, “50 or more documents classified in a category other than a certain category appear continuously”, etc.). Here, the first mentioned “the ratio of documents determined to belong to a certain category to less than 5% in all documents” is adopted as the category deletion / merging rule.
[0054]
When the classification system estimation unit 203 finishes classifying the document, the category system management unit 204 refers to the created classification result list 203a to determine whether it is desirable to add, divide, delete, or merge categories. Determines whether an event corresponding to each rule of the category system change rule 204b has occurred.
[0055]
For example, the classification result list 203a is as shown in FIG. 5, and the category system change rule 204b is one of the additional rules “the ratio of documents determined to belong to the unclassifiable category exceeds 20% in all documents”. 5, it can be seen from FIG. 5 that the number of documents in the uncategorized category (here, 2) exceeds 20% of the total number of documents (here, 5), so it is determined that the addition of a new category is desirable. To do. Whether categories are divided or deleted / merged is also determined in the same manner as described above.
[0056]
The category system management unit 204 corresponds to “category addition necessity determination means”, “category division necessity determination means” or “category deletion / merger necessity determination means” in the claims, and the processing to be performed It includes a “category addition necessity determination step”, “category division necessity determination step” or “category deletion / merger necessity determination step” in the claims.
[0057]
The user interface unit 205 reads the classification result list 203a created by the classification category estimation unit 203 and the category system table 204a of the category system management unit 204, and the category tree structure and the titles of documents classified into the categories. Are displayed graphically on the screen. In addition, when the category system management unit 204 determines that addition, division, deletion, or merging of categories is desirable, a dialog that recommends the operator to perform addition of a category, etc., and parameters necessary for the execution (addition) Display a dialog for entering the category name). These display examples will be described later.
[0058]
The user interface unit 205 corresponds to the “category addition recommendation means”, “category division recommendation means” or “category deletion / merging recommendation means” described in the claims. “Category addition recommendation process”, “Category division recommendation process” or “Category deletion / merging recommendation process” are included.
[0059]
The document storage unit 201, the feature vector space estimation unit 202, the category category estimation unit 203, the category system management unit 204, and the user interface unit 205 are recorded in a recording medium such as the ROM 102, the RAM 103 or the hard disk 105, and the floppy disk 107, respectively. The function of each unit is realized by the CPU 101 or the like executing instruction processing according to the instructions described in the program.
[0060]
Next, a procedure for document classification processing of the document classification device according to the embodiment of the present invention will be described. FIG. 6 is a flowchart showing a procedure of document classification processing of the document classification device according to the embodiment of the present invention. When a new document is added to the classification target document set 201a or when an operator instructs the classification of the document, the processing according to this flowchart is started.
[0061]
In step S601, the feature vector space estimation unit 202 creates a feature vector for each document in the training document set 202a. In step S602, document feature vectors are averaged for each category in the document-category correspondence table 202b to calculate a feature vector space for each category.
[0062]
If the processing according to this flowchart has been performed before this (that is, in the case of the second and subsequent classification processing), the feature vector space of each category based on the training document set 202a is also used during the previous processing. Since the calculation is performed and the result is stored in the classification database 202c, the processes in steps S601 and S602 can be omitted.
[0063]
In step S603, the feature vector space V of the category that cannot be classified is estimated. As described above, this is because the feature vector space Vα (calculated from both the classification target document set 201a and the training document set 202a) having the same length of all vectors, the feature vector space Vβ of the training document set 202a, It is calculated by a weighted average with the feature vector space Vγ of the classification target document set 201a. In step S604, the feature vector space of each category calculated in steps S602 and S603 is stored in the classification database 202c.
[0064]
In step S605, the classification category estimation unit 203 creates a feature vector for each document in the classification target document set 201a. If Vα and Vγ are calculated using all of the classification target documents in step S603, the feature vectors of all the documents have already been created at that time, and therefore the processing in step S605 may be omitted. it can. If Vα and Vγ are calculated using a part of the classification target document in step S603, a feature vector may be created only for a document that is not included therein.
[0065]
In step S606, the classification category estimation unit 203 sequentially compares the feature vector of each document created in step S605 with the feature vector space of each category in the classification database 202c created in step S604, and is most similar. A category having a high degree is estimated as a classification destination of the document. In step S607, each document and its classification destination category are stored in the classification result list 203a.
[0066]
In step S608, the user interface unit 205 reads the classification result list 203a created in step S607 and displays a classification result list window as shown in FIG. Then, the process according to this flowchart is terminated.
[0067]
Next, the procedure of the category addition process of the document classification device according to the embodiment of the present invention will be described. FIG. 8 is a flowchart showing the procedure of the category addition process of the document classification device according to the embodiment of the present invention.
[0068]
Immediately after the document classification process shown in FIG. 6 is completed, the process according to this flowchart is started (or only this process may be performed periodically regardless of the classification process). It is assumed that the classification result list window displayed on the screen at the start time is as shown in FIG. This figure shows a state in which a large number of documents (for example, more than 20% of the total number of documents) are classified in the unclassifiable category.
[0069]
In step S801, the category system management unit 204 refers to the classification result list 203a created in step S607 of FIG. 6 to determine whether an event corresponding to the category addition rule of the category system change rule 204b has occurred. judge. Then, when an event corresponding to the category addition rule has occurred (Yes at Step S801), the process proceeds to Step S802. When no event has occurred (No at Step S801), the processing according to this flowchart is terminated as it is.
[0070]
In step S802, the user interface unit 205 displays a change recommendation dialog as shown in FIG. Here, it is recommended to add a new category. In step S803, it waits for any button to be clicked with the mouse, and when it is the change start button 901 that the mouse is clicked (Yes in step S803), the process proceeds to step S804. If the cancel button 902 is clicked by the mouse (No at step S803), the processing according to this flowchart is terminated as it is.
[0071]
In step S804, the user interface unit 205 deletes the change recommendation dialog displayed in step S802 and displays a category addition dialog as shown in FIG. This is a dialog for allowing the operator to input the name of a new category to be added. In step S805, it waits for any button to be clicked with the mouse, and when the mouse click is on the “forward” button 1001 (Yes in step S805), the process proceeds to step S807, and the “forward” button 1001 is displayed. If not (No in step S805), the process proceeds to step S806, and processing corresponding to the other clicked button is performed.
[0072]
In step S807, the category addition dialog displayed in step S804 is deleted and a document allocation dialog as shown in FIG. 11 is displayed. This is a dialog for allowing the operator to specify a document to be classified into a new category to be added (in this case, an information home appliance category), and displays a list of all documents classified in the unclassifiable category. In step S808, it waits for any button to be clicked with the mouse, and when the mouse click is for the “complete” button 1101 (Yes in step S808), the process proceeds to step S810, and the “complete” button 1101 is displayed. If not (No in step S808), the process proceeds to step S809, and processing corresponding to the other clicked button is performed.
[0073]
In step S810, the user interface unit 205 sends the category name (in this case, “information home appliance”) input to the category addition dialog when the “forward” button 1001 is clicked in step S805 to the category system management unit 204. Notify them. In response to this, the category system management unit 204 adds a new category “information home appliance” to the category system table 204a.
[0074]
In step S811, the user interface unit 205 selects the document selected at the time when the “complete” button 1101 in the document assignment dialog is clicked in step S808 (here, all the documents in the unclassifiable category have been selected). To the feature vector space estimation unit 202. Receiving this, the feature vector space estimation unit 202 calculates the feature vector space of the category to be added by acquiring the feature vectors created in step S605 of FIG. 6 for the notified document and taking the average of them. To do. Then, this is associated with the category name and registered in the classification database 202c.
[0075]
Further, in step S812, after the document assignment dialog displayed in step S807 is deleted, the process proceeds to step S606 in FIG. 6, and the classification of the classification target document set 201a is performed again based on the updated classification database 202c. Execute. FIG. 12 is an explanatory diagram illustrating an example of a classification result list window immediately after a new category “information home appliance” is added. The documents classified in the unclassifiable category in FIG. 7 are classified in the information appliance category in FIG.
[0076]
Next, the procedure of category division processing of the document classification device according to the embodiment of the present invention will be described. FIG. 13 is a flowchart showing the procedure of category division processing of the document classification device according to the embodiment of the present invention. Immediately after the category addition process shown in FIG. 8 is completed, the process according to this flowchart is started. It is assumed that the classification result list window displayed on the screen at the start time is as shown in FIG. This figure shows a state where a large number of documents (for example, more than 20% of the total number of documents) are classified in the portable terminal category.
[0077]
In step S1301, the category system management unit 204 refers to the classification result list 203a updated in step S811 in FIG. 8 to determine whether an event corresponding to the category division rule of the category system change rule 204b has occurred. judge. When an event corresponding to the category division rule has occurred (Yes at Step S1301), the process proceeds to Step S1302, and when it has not occurred (No at Step S1301), the processing according to this flowchart is terminated.
[0078]
In step S1302, the user interface unit 205 displays a change recommendation dialog as shown in FIG. Here, the division of the mobile terminal category is recommended. In step S1303, it waits for any button to be clicked with the mouse, and when it is the change start button 1501 that the mouse is clicked (Yes at step S1303), the process proceeds to step S1304. If the cancel button 1502 is clicked by the mouse (No at Step S1303), the processing according to this flowchart is terminated as it is.
[0079]
In step S1304, the user interface unit 205 deletes the change recommendation dialog displayed in step S1302, and displays a category division dialog as shown in FIG. This is a dialog for allowing the operator to input the name of each category after division. In step S1305, it waits for any button to be clicked with the mouse, and when the mouse click is on the “forward” button 1601 (Yes in step S1305), the process proceeds to step S1307, and the “forward” button 1601 is displayed. If not (No at step S1305), the process proceeds to step S1306, and processing corresponding to the other clicked button is performed.
[0080]
In step S1307, the category division dialog displayed in step S1304 is deleted, and a document allocation dialog as shown in FIG. 17 is displayed. This is a dialog that allows the operator to specify a document to be classified in one of the categories after division (here, the mobile phone category). All of the categories classified in the category before division (here, the mobile device category) A list of documents is displayed.
[0081]
In step S1308, it waits for any button to be clicked with the mouse, and when the mouse click is the “forward” button 1701 (Yes in step 1308), the process proceeds to step S1310, and the “forward” button 1701 is displayed. If not (No in step S1308), the process proceeds to step S1309, and processing corresponding to the other clicked button is performed.
[0082]
In step S1310, the document allocation dialog displayed in step S1307 is deleted, and a similar document allocation dialog is displayed for the other category (in this case, the mobile PC category) after the division. However, in this dialog, a list of documents obtained by excluding the documents classified in the other categories after the division in step S1308 from all the documents in the categories before the division is displayed. In step S1311, it waits for any button to be clicked with the mouse, and when it is the “complete” button 1702 (YES in step 1311), the process proceeds to step S1313, and the “complete” button 1702 is displayed. If not (No in step S1311), the process proceeds to step S1312, and processing corresponding to the other clicked button is performed.
[0083]
In step S1313, the user interface unit 205 inputs the category name before division (here, “portable terminal”) and the category name after division (in step S1305, when the “forward” button 1601 is clicked) to the category division dialog. The category name (in this case, “mobile phone” “mobile PC”) is notified to the category system management unit 204. Receiving this, the category system management unit 204 deletes the category “mobile terminal” from the category system table 204a and additionally writes new categories “mobile phone” and “mobile PC”.
[0084]
In step S1314, the user interface unit 205 notifies the feature vector space estimation unit 202 of the category names before and after the division and the documents assigned to the respective categories after the division in steps S1308 and S1311. Receiving this, the feature vector space estimation unit 202 deletes the feature vector space of the category before division from the classification database 202c, calculates the feature vector space of each category after division, and associates it with the category name. Register in the above database.
[0085]
In step S1315, the document allocation dialog displayed in step S1310 is deleted, and then the process proceeds to step S606 in FIG. 6 to reclassify the classification target document set 201a based on the updated classification database 202c. Execute. FIG. 18 is an explanatory diagram illustrating an example of a classification result list window immediately after the mobile terminal category is divided into the mobile phone category and the mobile PC category.
[0086]
A part of the document classified into the portable terminal category in FIG. 14 is classified into the portable PC category in FIG. As a result of classifying all the documents of the classification target document set 201a again after the category division, documents that are not included in the portable terminal category before the division (for example, “Company I proposes a voice input standard for portable PCs. “Company J agrees”) is included in the mobile PC category after the division. Although illustration is omitted, the mobile phone category is the same as described above. In the above description, the category is divided into two categories, but the number may be any number.
[0087]
Next, the procedure of category deletion / merging processing of the document classification device according to the embodiment of the present invention will be described. FIG. 19 is a flowchart showing the procedure of category deletion / merging processing of the document classification device according to the embodiment of the present invention. Immediately after the category dividing process shown in FIG. 13 is completed, the process according to this flowchart is started. It is assumed that the classification result list window displayed on the screen at the start time is as shown in FIG. This figure shows a state where there are few documents classified into the iNet dedicated terminal category (for example, less than 5% of the total number of documents).
[0088]
In step S1901, the category system management unit 204 refers to the classification result list 203a updated in step S1314 in FIG. 13 and has an event corresponding to the category deletion / merging rule of the category system change rule 204b occurred? Determine if. If an event corresponding to the category deletion / merging rule has occurred (Yes at Step S1901), the process proceeds to Step S1902, and if not (No at Step S1901), the process according to this flowchart is terminated.
[0089]
In step S1902, the user interface unit 205 displays a change recommendation dialog as shown in FIG. Here, it is recommended to delete or merge the iNet dedicated terminal category. In step S1903, it waits for any button to be clicked with the mouse, and when it is the change start button 2101 that has been clicked with the mouse (Yes at step S1903), the process proceeds to step S1904. If the cancel button 2102 is clicked by the mouse (No at Step S1903), the processing according to this flowchart is terminated as it is.
[0090]
In step S1904, the user interface unit 205 deletes the change recommendation dialog displayed in step S1902, and displays a category deletion / merger dialog as shown in FIG. This is a dialog for allowing the operator to specify whether to delete or merge categories with a small number of documents. Note that the “forward” button 2201 is grayed out while deletion is specified, and the “complete” button 2202 is grayed out so that it cannot be pressed while merging is specified.
[0091]
Then, it waits for any button to be pressed in step S1905, and if it is the “complete” button 2202 (Yes in step S1905), the deletion of the category is designated, so the category to be deleted (In this case, “iNet dedicated terminal”) is notified to the category system management unit 204 in step S1906 and is deleted from the category system table 204a. In step S1907, the feature vector space estimation unit 202 is also notified and deleted from the document-category correspondence table 202b and the classification database 202c. In step S1908, the category deletion / merging dialog displayed in step S1904 is deleted, and the process proceeds to step S606 in FIG.
[0092]
If the “complete” button 2202 is not pressed in step S 1905 (No in step S 1905), the process proceeds to step S 1909, and it is determined whether or not it is the “forward” button 2201. If the button is not the “forward” button 2201 (No in step S1909), the process is performed according to the other button pressed in step S1910. If the button is the “forward” button 2201 (Yes in step S1909) The process proceeds to step S1911.
[0093]
In this case, since merging of categories is designated, in step S1911, the category deletion / merging dialog displayed in step S1904 is deleted and a merging destination selection dialog as shown in FIG. 23 is displayed. This is a dialog for allowing the operator to specify in which other category a document of a category deleted by merging is classified, and all categories other than the merged category are displayed in a list.
[0094]
Then, it waits for any button to be pressed in step S1912, and if it is not the “forward” button 2301 (No in step 1912), it corresponds to the other button pressed in step S1913. The process is performed, but if it is the “forward” button 2301 (Yes in step S1912), the process proceeds to step S1914.
[0095]
In step S1914, the merge destination selection dialog displayed in step S1911 is deleted, and a document allocation dialog as shown in FIG. 24 is displayed. This is because any of the documents in the category to be deleted by merging is selected as the categorization destination category (the category selected when the “Proceed” button 2301 is pressed in step S1912. Here, it is the information appliance category. A list of all documents in the categories to be merged is displayed.
[0096]
Then, it waits for any button to be pressed in step S1915, and if it is not the “complete” button 2401 (No in step 1915), it corresponds to the other button pressed in step S1916. The process is performed, but if it is the “complete” button 2401 (Yes in step S1915), the process proceeds to step S1917.
[0097]
In step S1917, the user interface unit 205 notifies the category system management unit 204 of the name of the category to be merged (here, “iNet dedicated terminal”), and deletes it from the category system table 204a. In step S1918, the category names of the merge source and merge destination, and the document assigned to the merge destination (the document selected when the “complete” button 2401 was pressed in step S1915) are converted into the feature vector space estimation unit. 202 is notified. Upon receiving this, the feature vector space estimation unit 202 rewrites the merging source category of the document-category correspondence table 202b with the merging destination category, recalculates the feature vector space of the merging destination category, and registers it in the classification database 202c. Then, the feature vector space of the category to be merged is deleted from the database.
[0098]
In step S1919, the document allocation dialog displayed in step S1914 is deleted, and then the process proceeds to step S606 in FIG. 6 to reclassify the classification target document set 201a based on the updated classification database 202c. Execute. FIG. 25 is an explanatory diagram illustrating an example of a classification result list window immediately after the iNet dedicated terminal category is deleted or merged with the information appliance category.
[0099]
As described above, according to this embodiment, the number of documents in a specific category (including unclassifiable categories) does not increase too much every time document classification is performed by adding documents or by document classification instructions. It is automatically checked whether there are too few documents in a specific category, and on the contrary, it is recommended to add, divide, delete, or merge categories based on the check result. By simply inputting necessary items according to recommendations, it is possible to always maintain an optimal classification system that matches the characteristics of the document to be classified (ease of maintenance of the classification system). Therefore, even when there is a qualitative change such as a change in the field trend of the classification target document, it is possible to always obtain an appropriate classification result.
[0100]
Further, the feature vector space V of the unclassifiable category is obtained by weighted averaging of the feature vector space Vα having the same length of all vectors, the feature vector space Vβ of the training document set 202a, and the feature vector space Vγ of the classification target document set 201a. Therefore, even when a document having a low similarity with an existing category appears, it is possible to generate a category in a completely new field and classify it.
[0101]
The document classification methods (including the category addition method, category division method, and category deletion / merging method) described in the above embodiment are executed by executing a prepared program on a computer such as a personal computer or a workstation. Realized. This program is recorded on a computer-readable recording medium such as a hard disk, floppy disk, CD-ROM, MO, and DVD, and is read from the recording medium and executed by the computer. Further, this program can be distributed via the recording medium and a network such as the Internet.
[0102]
【The invention's effect】
As described above, the document classification device according to the present invention is a document classification device that classifies an electronic document into any one of preset categories. The feature vector space of each category is calculated from the feature vectors of each document constituting the document set for use, and the second feature vector space calculating means includes a feature vector space in which all vectors have the same length, a training document set A feature vector space of an unclassifiable category is calculated from the weighted average of the feature vector space and the feature vector space of the classification target document set, and a classification category calculating unit calculates each category calculated by the first feature vector space calculating unit. Features of unclassifiable categories calculated by the feature vector space and the second feature vector space calculating means By comparing the vector space and the feature vector of each document constituting the classification target document set, any one of the categories or the non-classifiable categories is calculated as a classification destination category of the documents, A category addition necessity determination unit determines whether the frequency that the uncategorized category is calculated as a classification destination category by the classification category calculation unit exceeds a certain threshold, and the category addition recommendation unit determines whether the category addition is necessary If it is determined by the determination means that the frequency of calculating the category that cannot be classified as the category to be classified exceeds a certain threshold, the operator is recommended to add a new category. If the number of documents increases, the operator is encouraged to add a new category, which matches the characteristics of the group of documents to be classified. Classification system can always be maintained, thus there is an effect that the document classification apparatus can be obtained which can prevent a decrease in classification accuracy due to the change.
[0103]
The document classification device according to the present invention is a document classification device for classifying an electronic document into any one of preset categories, wherein the feature vector space calculation means constitutes a training document set. The feature vector space of each category is calculated from the feature vector of each document, and the classification category calculation means calculates the feature vector space of each category calculated by the feature vector space calculation means and each document constituting the classification target document set. By comparing the feature vector, any one of the categories is calculated as a classification destination category of each document, and the category deletion / merging necessity determination unit determines whether the classification destination is determined by the classification category calculation unit. Determine whether there is a category whose category and frequency calculated below a certain threshold, and recommend category deletion / merging However, when it is determined that there is a category whose frequency calculated as the category to be classified by the category deletion / merging necessity determination unit falls below a certain threshold, the deletion or merging of the category is performed for the operator. Therefore, if there are fewer documents classified into a certain category, it is recommended that the operator delete or merge the category, thereby maintaining a classification system consistent with the characteristics of the group of documents to be classified. Therefore, it is possible to obtain a document classification device capable of preventing a decrease in classification accuracy due to the change.
[0104]
In the document classification apparatus according to the present invention, in the above-described invention, there is further provided a category division requirement for determining whether there is a category whose frequency calculated by the classification category calculation unit and the frequency calculated by the classification category calculation unit exceeds a certain threshold. When it is determined that there is a category whose frequency calculated as a classification destination category by the rejection determination means and the category division necessity determination means exceeds a certain threshold value, division of the category is recommended to the operator Category division recommending means, so that when the number of documents classified into a certain category increases, the operator is recommended to divide the category, and accordingly, a classification system suitable for the characteristics of the classification target document group Therefore, it is possible to obtain a document classification apparatus that can always maintain the image quality, and therefore can prevent a decrease in classification accuracy due to the change. Achieve the results.
[0105]
The document classification method according to the present invention is a document classification method for classifying an electronic document into any one of preset categories. In the first feature vector space calculation step, a training document set is obtained. The feature vector space of each category is calculated from the feature vectors of each document constituting the feature vector, and in the second feature vector space calculation step, a feature vector space in which all the vectors have the same length, a feature vector space of the training document set And the feature vector space of each category calculated in the first feature vector space calculation step in the classification category calculation step. And the feature vector space of the unclassifiable category calculated in the second feature vector space calculating step, and classification By comparing the feature vector of each document that constitutes an elephant document set, one of the categories or the unclassifiable category is calculated as the category to which the document is classified, and the category addition necessity determination is performed. In the process, it is determined whether or not the frequency that the non-classifiable category is calculated as a classification destination category in the classification category calculation process exceeds a certain threshold value, and the classification cannot be classified in the category addition necessity determination process in the category addition recommendation process When it is determined that the category is a category to be classified and the frequency of calculation exceeds a certain threshold value, the operator is recommended to add a new category. Operators are encouraged to add new categories, so that they always maintain a classification system that matches the characteristics of the documents to be classified. Can, therefore an effect that document classification method capable of preventing a decrease in classification accuracy due to the change can be obtained.
[0106]
The document classification method according to the present invention is a document classification method for classifying an electronic document into any one of preset categories, and forms a training document set in a feature vector space calculation step. The feature vector space of each category is calculated from the feature vector of each document, and in the classification category calculation step, the feature vector space of each category calculated in the feature vector space calculation step and each document constituting the classification target document set By comparing with the feature vector, one of the categories is calculated as the category of the classification destination of each document. In the category deletion / merging necessity determination step, the classification category calculation step determines the classification destination It is determined whether there is a category whose category and calculated frequency are below a certain threshold. If it is determined that there is a category whose classification frequency and calculated frequency are below a certain threshold in the category deletion / merging necessity determination process, it is recommended that the operator delete or merge the category. When the number of documents classified into a certain category is reduced, it is recommended that the operator delete or merge the category, so that a classification system corresponding to the characteristics of the document group to be classified can always be maintained. There is an effect that a document classification method capable of preventing a reduction in classification accuracy due to the change can be obtained.
[0107]
Further, the document classification method according to the present invention is the above-described invention, further comprising a category division requirement for determining whether or not there is a category whose frequency calculated with the classification category in the classification category calculation step exceeds a certain threshold. When it is determined that there is a category in which the frequency calculated as a classification target category in the rejection determination step and the category division necessity determination step exceeds a certain threshold value, division of the category is recommended to the operator Category division recommending step, and when the number of documents classified into a certain category increases, the operator is recommended to divide the category, and accordingly, a classification system suitable for the characteristics of the classification target document group Therefore, there is an effect that a document classification method can be obtained that can always maintain the document quality, and can prevent a decrease in classification accuracy due to the change. That.
[0108]
Further, the recording medium according to the present invention records a program that causes a computer to execute the above method, so that the program can be read by the computer, and thereby the above method can be performed by the computer. There is an effect that a medium can be obtained.
[Brief description of the drawings]
FIG. 1 is an explanatory diagram showing a hardware configuration of a document classification device according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram functionally showing the configuration of the document classification device according to the embodiment of the present invention.
FIG. 3 is an explanatory diagram for specifically explaining a method of creating a classification database 202c by a feature vector space estimation unit 202 in the document classification device according to the embodiment of the present invention;
FIG. 4 is an explanatory diagram showing an example of a classification target document set 201a and a feature vector created for each document in the document classification device according to the embodiment of the present invention;
FIG. 5 is an explanatory diagram showing an example of a classification result list 203a created by a classification category estimation unit 203 of the document classification device according to the embodiment of the present invention.
FIG. 6 is a flowchart showing a procedure of document classification processing of the document classification device according to the embodiment of the present invention.
FIG. 7 is an explanatory diagram showing an example of a classification result list window created by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 8 is a flowchart showing a procedure of category addition processing of the document classification device according to the embodiment of the present invention.
FIG. 9 is an explanatory diagram showing an example of a change recommendation dialog displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 10 is an explanatory diagram showing an example of a category addition dialog displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 11 is an explanatory diagram showing an example of a document assignment dialog displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 12 is an explanatory diagram showing another example of a classification result list window displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 13 is a flowchart showing a procedure of category division processing of the document classification device according to the embodiment of the present invention.
FIG. 14 is an explanatory diagram showing another example of the classification result list window displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 15 is an explanatory diagram showing another example of the change recommendation dialog displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 16 is an explanatory diagram showing an example of a category division dialog displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 17 is an explanatory diagram showing another example of the document assignment dialog displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 18 is an explanatory diagram showing another example of the classification result list window displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 19 is a flowchart showing a procedure of category deletion / merging processing of the document classification device according to the embodiment of the present invention;
FIG. 20 is an explanatory diagram showing another example of a classification result list window displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 21 is an explanatory diagram showing another example of the change recommendation dialog displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 22 is an explanatory diagram showing an example of a category deletion / merging dialog displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 23 is an explanatory diagram showing an example of a merge destination selection dialog displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 24 is an explanatory diagram showing another example of the document assignment dialog displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
FIG. 25 is an explanatory diagram showing another example of the classification result list window displayed by the user interface unit 205 of the document classification device according to the embodiment of the present invention.
[Explanation of symbols]
100 buses
101 CPU
102 ROM
103 RAM
104 HDD
105 HD
106 FDD
107 FD
108 display
109 I / F
110 Communication line
111 keyboard
112 mouse
113 scanner
114 printer
115 CD-ROM
116 CD-ROM drive
201 Document storage unit
201a Classification target document set
202 Feature vector space estimation unit
202a Training document set
202b Document-category correspondence table
202c Classification database
203 Classification category estimation unit
203a Classification result table
204 Category System Management Department
204a Category system chart
204b Category system change rules
205 User interface

Claims

In a document classification apparatus that classifies an electronic document into one of preset categories,
First feature vector space calculating means for calculating a feature vector space of each category from feature vectors of each document constituting a training document set;
A second feature vector space for calculating a feature vector space of an unclassifiable category from a weighted average of a feature vector space having the same length of all vectors, a feature vector space of a training document set, and a feature vector space of a classification target document set A calculation means;
The feature vector space of each category calculated by the first feature vector space calculating means and the feature vector space of the unclassifiable category calculated by the second feature vector space calculating means, and each of the components constituting the classification target document set Classification category calculation means for calculating any one of the respective categories or the unclassifiable categories as a classification destination category of each document by comparing with a feature vector of the document;
A category addition necessity judging means for judging whether or not the classification category calculating means determines whether the frequency of the classification impossible category and the category to be classified exceeds a certain threshold;
A category addition recommendation that recommends the operator to add a new category when it is determined by the category addition necessity determination means that the frequency that the non-classifiable category is calculated as a category to be classified exceeds a certain threshold. Means,
A document classification apparatus comprising:

Further, a category division necessity determination unit that determines whether there is a category whose frequency calculated as a classification destination category by the classification category calculation unit exceeds a certain threshold value;
Category division recommendation means for recommending the division of the category to the operator when it is determined by the category division necessity determination means that there is a category whose frequency calculated as a classification destination category exceeds a certain threshold When,
The document classification apparatus according to claim 1 , further comprising:

In a computer comprising a first feature vector space calculation means, a second feature vector space calculation means, a classification category calculation means, a category addition necessity determination means, and a category addition recommendation means, it is digitized A document classification method for classifying a document into one of preset categories,
A first feature vector space calculating step in which the first feature vector space calculating means calculates a feature vector space of each category from a feature vector of each document constituting a training document set;
The second feature vector space calculating means is characterized by a weighted average of a feature vector space in which all vectors are equal in length, a feature vector space of a training document set, and a feature vector space of a classification target document set. A second feature vector space calculating step for calculating a vector space;
The classification category calculation means includes a feature vector space of each category calculated in the first feature vector space calculation step, a feature vector space of an unclassifiable category calculated in the second feature vector space calculation step, and a classification A classification category calculation step of calculating any one of the categories or the unclassifiable categories as a classification destination category of the documents by comparing the feature vectors of the documents constituting the target document set;
A category addition necessity determining step, wherein the category addition necessity determining means determines whether or not the frequency that the unclassifiable category is calculated as the classification destination category in the classification category calculating step exceeds a certain threshold;
When the category addition recommendation means determines that the frequency of calculating the category that cannot be classified as the category to be classified in the category addition necessity determination step exceeds a certain threshold, the operator is prompted to add a new category. Recommended category addition process recommended for
Document classification method characterized by including

Further, the computer includes a category division necessity determination unit and a category division recommendation unit,
The category division necessity determination means determines whether there is a category whose frequency calculated as a classification destination category in the classification category calculation step exceeds a certain threshold; and
When the category division recommendation means determines that there is a category whose frequency calculated as a classification destination category in the category division necessity determination step exceeds a certain threshold, the category division is performed for the operator. Recommended category division process,
The document classification method according to claim 3 , further comprising:

In a document classification device that classifies an electronic document into one of preset categories using a computer,
The computer,
First feature vector space calculating means for calculating a feature vector space of each category from feature vectors of each document constituting a training document set;
A second feature vector space for calculating a feature vector space of an unclassifiable category from a weighted average of a feature vector space having the same length of all vectors, a feature vector space of a training document set, and a feature vector space of a classification target document set Calculation means,
The feature vector space of each category calculated by the first feature vector space calculating means and the feature vector space of the unclassifiable category calculated by the second feature vector space calculating means, and each of the components constituting the classification target document set Classification category calculation means for calculating any one of the respective categories or the unclassifiable categories as a classification destination category of each document by comparing with a feature vector of the document,
A category addition necessity judging means for judging whether or not the frequency at which the uncategorized category is calculated as a classification destination category by the classification category calculating means exceeds a certain threshold;
A category addition recommendation that recommends the operator to add a new category when it is determined by the category addition necessity determination means that the frequency that the non-classifiable category is calculated as a category to be classified exceeds a certain threshold. means,
A computer-readable recording medium characterized by recording a program for causing the functions as a.