JP4669642B2

JP4669642B2 - Document classification apparatus, document classification method, and computer-readable recording medium storing a program for causing a computer to execute the document classification method

Info

Publication number: JP4669642B2
Application number: JP2001257049A
Authority: JP
Inventors: 敦夫嶋田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2001-08-27
Filing date: 2001-08-27
Publication date: 2011-04-13
Anticipated expiration: 2021-08-27
Also published as: JP2003067398A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書の内容（トピックス）に基づいて文書を分類する技術の領域、特にベクトル空間モデル(ＶｅｃｔｏｒＳｐａｃｅＭｏｄｅｌ)に基づく文書分類技術におけるベクトル空間の修正を行なって文書を分類する文書分類装置および文書分類方法、並びに文書分類方法をコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体に関する。
【０００２】
【従来の技術】
近年、ＩＴ（情報技術）の普及の急激な進歩および普及に伴い、ネットワークを介して国内外の大量の電子化文書へのアクセスが可能になってきており、その普及に比例して業務上蓄積される情報の量も増大化しつつある。このような状況の中、収集した大量の文書情報を意味のあるカテゴリ（ｃａｔｅｇｏｒｉａｌ）に分類するなどの知的作業が行なわれるようになってきている。この文書を意味に分類するか、あるいは話題毎に分類するという作業の目的は以下に記述する（１）、（２）の２つがある。
【０００３】
（１）検索容易性の向上を図る。すなわち、膨大な文書群を分類名称（内容名）を手がかりに探索できるので、所望の文書の属する文書集合を取りこむことができる。
（２）情報群全体の構造を把握する。すなわち、文書群全体がどのような内容（個々の分類）で構成されているかを把握する。
【０００４】
しかし、大量の文書情報をユーザが手動で分類する場合、分類の正確性に優れるものの、人的および時間的なコストが増大するため、膨大な文書を扱う近年の文書環境においては実質上不可能になっており、自動文書分類装置が提案されるようになってきている。
【０００５】
文書の自動分類として、上記（１）を目的としたものが、たとえば特開平７−３６８９７号公報、特開平１０−２６０９９１号公報、特開平１０−１７８２３号公報、特開平１０−２６０９９１号公報、「ＰｒｏｊｅｃｔｉｏｎｆｏｒＥｆｆｉｃｉｅｎｔＤｏｃｕｍｅｎｔＣｌｕｓｔｅｒｉｎｇ，Ｈｉｎｒｉｃｈ
ＳｃｈｕｔｚｅａｎｄＣｒａｉｎｇＳｉｌｖｅｒｓｔｏｎｅ，１９９７，ＰｒｏｃｅｅｄｉｎｇｓｏｆＳＩＧＩＲ９７，ｐｐ７４−８１，ＡＣＭ」に開示されている。ここでは、文書を、単語を特徴とする文書ベクトルと見なし、ベクトル間の類似度(距離)を測度として、クラスタリング手法を用いてこれらの文書の群分けをし、文書を自動分類している。
【０００６】
一方、上記（２）を目的としたものが、たとえば特開平１１−１５８３５号公報に開示されている。ここでは、上記と同様に類似度測度に基づいて文書を自動的に分類している。
【０００７】
クラスタリングが、文書ベクトル間の類似度にしたがってアドホック分類体系を生成するアルゴリズムであるのに対し、カテゴライゼーションあるいはクラシフィケーションと呼ばれる方法がある。この方法は、あらかじめ幾つかのカテゴリ（分類）を設けておき、各ドキュメントがそれぞれどのカテゴリに属するかを判断することにより、文書を自動的に分類するものである。その中でも古典的なものであるが、Ｓａｌｔｏｎらの提案する分類装置が、「ＩｎｔｒｏｄｕｃｔｉｏｎｔｏＭｏｄｅｒｎＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ，Ｇ．ＳａｌｔｏｎａｎｄＭ．Ｊ．ＭｃＧｉｌｌ，１９８３，ＮｅｗＹｏｒｋｍＮｃＧｒａｗＨｉｌｌ」に開示されている。
【０００８】
この装置では、文書ベクトルと、カテゴリのベクトルとの間の類似度（余弦）を計算し、当該文書が、どのカテゴリにもっとも類似しているかにより、文書を自動分類するものである。さらに、特許第２９４０５０１号公報では、分類に用いる単語のクラスタ化に関する改良を行ない、自動分類の精度を向上させている。
【０００９】
上述した自動分類の方法には、基本的に文書から抽出した単語により構成した空間にベクトルとして文書を配置し、文書間の類似度を計算し、クラスタリングやカテゴライゼーションを行なう特徴がある。したがって、どのような空間を構成するかによって文書分類の結果が左右される。
【００１０】
ところで、文書には、特定の情報が本文以外に付与されることが多い。たとえば、文書作成者名（ａｕｔｈｏｒ）や文書作成日などの書誌事項がこれに該当する。また、ＳＧＭＬ（ＳｔａｎｄａｒｄＧｅｎｅｒａｉｚｅｄＭａｒｋｕｐＬａｎｇｕａｇｅ：標準一般化マーク付け言語）やＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ：拡張可能なマーク付け言語）などの構造化文書では、本文中に現れるこうした特定の情報にあらかじめタグ（ｔａｇ）が付与され、管理できるようになっている。
【００１１】
さらに、こうした特定情報を自動的に抽出する技術として「情報抽出技術（ｉｎｆｏｒｍａｔｉｏｎｅｘｔｒａｃｔｉｏｎ）」が開示され、現在も利用可能になっている。たとえば、１９９９年に日本で開催されたＩＲＥＸ（ＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌａｎｄＥｘｔｒａｃｔｉｏｎＥｘｅｒｃｉｓｅ）では、テキスト中（本文中）に現れる組織名（ｅｘ．米軍）や人名（ｅｘ．クリントン）、地名、固有物などの自動抽出技術が開示されている。
【００１２】
【発明が解決しようとする課題】
しかしながら、上記に示されるような従来の技術にあっては、抽出された特定情報（単語）が文書分類の際の空間構成に利用されると、その特定情報（部分情報）が生成されるため、特定情報の抽出と文書分類とを併用するメリットが半減するという不具合があった。すなわち、特定情報を含む文書集合を求める場合は、その特定情報をキーとして文書を検索すればよいので、文書分類技術では、特定情報とは異なる観点から分類されることが望ましい。
【００１３】
本発明は、上記に鑑みてなされたものであって、抽出された特定情報を、文書分類の際の空間構成から排除することにより、効果的な文書分類を実現することを目的とする。
【００１４】
【課題を解決するための手段】
上記の目的を達成するために、請求項１にかかる文書分類装置にあっては、分類対象の文書情報を入力する文書入力手段と、前記文書入力手段により入力された前記文書情報を解析し、形態素解析の結果である言語解析情報を得る言語解析手段と、前記文書入力手段により入力された前記文書情報を解析し、付属情報または固有名称である特定情報を抽出する情報抽出手段と、前記情報抽出手段による前記特定情報の出力にしたがって、前記言語解析情報から前記特定情報を除去することにより、前記言語解析情報を修正する言語解析情報修正手段と、前記言語解析情報修正手段により前記言語解析情報を修正した複数の単語の出力にしたがって前記文書情報に対する文書特徴ベクトルを生成する文書特徴ベクトル生成手段と、前記文書特徴ベクトル生成手段により生成された前記文書特徴ベクトルにしたがって複数の前記文書情報を分類し、前記文書情報で構成される文書グループを複数生成する文書分類手段と、を備え、前記情報抽出手段は、構造化文書に付随するタグ情報にしたがって特定情報を得るものである。
【００１５】
この発明によれば、文書入力手段により入力された文書情報を、言語解析手段で解析して形態素解析の結果である言語解析情報を取得し、さらに情報抽出手段が文書情報から付属情報または固有名称である特定情報を抽出し、言語解析情報修正手段が上記言語解析情報から上記特定情報を除去することにより言語解析情報を修正し、文書特徴ベクトル生成手段が文書特徴ベクトルを生成し、その文書特徴ベクトルにしたがって、文書分類手段がたとえばクラスタリングやカテゴライゼーションなどの手法を用いて文書を分類し、文書情報で構成される文書グループを複数生成することにより、話題による分類結果と、特定情報による分類（任意の特定情報（タグ、属性名）を持つ文書をグループ化すること）の結果とが、内容的に重複しないようにすることが可能になる。また、分類対象の文書情報がたとえばＳＧＭＬやＸＭＬなどの構造化文書である場合、情報抽出手段が構造化文書に付随するタグ情報から特定情報を抽出することができる。
【００１６】
また、請求項２にかかる文書分類装置にあっては、分類対象の文書情報を入力する文書入力手段と、前記文書入力手段により入力された前記文書情報を解析し、形態素解析の結果である言語解析情報を得る言語解析手段と、前記文書入力手段により入力された前記文書情報を解析し、付属情報または固有名称である特定情報を抽出する情報抽出手段と、前記言語解析手段により得られた前記言語解析情報にしたがって前記文書情報に対する文書特徴ベクトルを生成する文書特徴ベクトル生成手段と、前記情報抽出手段による前記特定情報の出力にしたがって、前記文書特徴ベクトルから前記特定情報を除去することにより、前記文書特徴ベクトルを修正する文書特徴ベクトル修正手段と、前記文書特徴ベクトル修正手段のベクトル修正の出力にしたがって複数の前記文書情報を分類し、文書情報で構成される文書グループを複数生成する文書分類手段と、を備え、前記情報抽出手段は、構造化文書に付随するタグ情報にしたがって特定情報を得るものである。
【００１７】
この発明によれば、文書入力手段により入力された文書情報を、言語解析手段で解析して形態素解析の結果である言語解析情報を取得し、さらに情報抽出手段が文書情報から付属情報または固有名称である特定情報を抽出し、文書特徴ベクトル生成手段が上記言語解析情報にしたがって文書情報に対する文書特徴ベクトルを生成し、さらに文書特徴ベクトルから上記特定情報を除去することにより文書特徴ベクトルを修正し、その修正された文書特徴ベクトルに基づいて、文書分類手段が文書を分類し、文書情報で構成される文書グループを複数生成することにより、話題による分類結果と、特定情報による分類（任意の特定情報（タグ、属性名）を持つ文書をグループ化すること）の結果とが、内容的に重複しないようにすることが可能になる。また、分類対象の文書情報がたとえばＳＧＭＬやＸＭＬなどの構造化文書である場合、情報抽出手段が構造化文書に付随するタグ情報から特定情報を抽出することができる。
【００１８】
また、請求項３にかかる文書分類装置にあっては、前記情報抽出手段は、固有名詞表現を抽出することにより特定情報を得るものである。
【００１９】
この発明によれば、請求項１または２において、情報抽出手段が入力された文書情報から固有名詞表現の特定情報を取得することにより、この固有名詞表現の影響を排除した文書特徴ベクトルを生成することが可能になる。
【００２２】
また、請求項４にかかる文書分類装置にあっては、さらに、前記情報抽出手段により抽出された特定情報を表示する表示手段と、１つ以上の特定情報の選択を受け付ける抽出情報選択手段と、を備えたものである。
【００２３】
この発明によれば、情報抽出手段により抽出された特定情報を表示し、ユーザがこの表示された特定情報から、分類計算の際に排除すべき特定情報を選択して指定することにより、柔軟な分類処理が可能になる。
【００２４】
また、請求項５にかかる文書分類方法にあっては、あらかじめ用意されたプログラムをコンピュータ上で実行することにより実現される文書分類方法であって、前記プログラムを実行することにより、前記コンピュータが、分類対象の文書情報を入力する文書入力工程と、前記文書入力工程により入力された前記文書情報を解析し、形態素解析の結果である言語解析情報を得る言語解析工程と、前記文書入力工程により入力された前記文書情報を解析し、付属情報または固有名称である特定情報を抽出する情報抽出工程と、前記情報抽出工程による前記特定情報の出力にしたがって前記言語解析情報を修正する言語解析情報修正工程と、前記言語解析情報修正工程により前記言語解析情報を修正した複数の単語の出力にしたがって前記文書情報に対する文書特徴ベクトルを生成する文書特徴ベクトル生成工程と、前記文書特徴ベクトル生成工程により生成された文書特徴ベクトルにしたがって複数の前記文書情報を分類し、前記文書情報で構成される文書グループを複数生成する文書分類工程と、を実行し、前記情報抽出工程は、構造化文書に付随するタグ情報にしたがって特定情報を得るものである。
【００２５】
この発明によれば、文書入力工程により入力された文書情報を、言語解析工程で解析して形態素解析の結果である言語解析情報を取得し、さらに情報抽出工程が文書情報から付属情報または固有名称である特定情報を抽出し、言語解析情報修正工程が上記言語解析情報から上記特定情報を除去することにより言語解析情報を修正し、文書特徴ベクトル生成工程が文書特徴ベクトルを生成し、その文書特徴ベクトルにしたがって、文書分類工程がたとえばクラスタリングやカテゴライゼーションなどの手法を用いて文書を分類し、文書情報で構成される文書グループを複数生成することにより、話題による分類結果と、特定情報による分類（任意の特定情報（タグ、属性名）を持つ文書をグループ化すること）の結果とが、内容的に重複しないようにすることが可能になる。また、分類対象の文書情報がたとえばＳＧＭＬやＸＭＬなどの構造化文書である場合、情報抽出工程は構造化文書に付随するタグ情報から特定情報を抽出することができる。
【００２６】
また、請求項６にかかる文書分類方法にあっては、あらかじめ用意されたプログラムをコンピュータ上で実行することにより実現される文書分類方法であって、前記プログラムを実行することにより、前記コンピュータが、分類対象の文書情報を入力する文書入力工程と、前記文書入力工程により入力された前記文書情報を解析し、形態素解析の結果である言語解析情報を得る言語解析工程と、前記文書入力工程により入力された前記文書情報を解析し、付属情報または固有名称である特定情報を抽出する情報抽出工程と、前記言語解析工程により得られた前記言語解析情報にしたがって前記文書情報に対する文書特徴ベクトルを生成する文書特徴ベクトル生成工程と、前記情報抽出工程による前記特定情報の出力にしたがって前記文書特徴ベクトルから前記特定情報を除去することにより、前記文書特徴ベクトルを修正する文書特徴ベクトル修正工程と、前記文書特徴ベクトル修正工程のベクトル修正の出力にしたがって複数の前記文書情報を分類し、前記文書情報で構成される文書グループを複数生成する文書分類工程と、を実行し、前記情報抽出工程は、構造化文書に付随するタグ情報にしたがって特定情報を得るものである。
【００２７】
この発明によれば、文書入力工程により入力された文書情報を、言語解析工程で解析して形態素解析の結果である言語解析情報を取得し、さらに情報抽出工程が文書情報から付属情報または固有名称である特定情報を抽出し、文書特徴ベクトル生成工程が上記言語解析情報にしたがって文書情報に対する文書特徴ベクトルを生成し、さらに文書特徴ベクトルから前記特定情報を除去することにより文書特徴ベクトルを修正し、その修正された文書特徴ベクトルに基づいて、文書分類工程が文書を分類し、文書情報で構成される文書グループを複数生成することにより、話題による分類結果と、特定情報による分類（任意の特定情報（タグ、属性名）を持つ文書をグループ化すること）の結果とが、内容的に重複しないようにすることが可能になる。また、分類対象の文書情報がたとえばＳＧＭＬやＸＭＬなどの構造化文書である場合、情報抽出工程は構造化文書に付随するタグ情報から特定情報を抽出することができる。
【００２８】
また、請求項７にかかる文書分類方法にあっては、前記情報抽出工程は、固有名詞表現を抽出することにより特定情報を得るものである。
【００２９】
この発明によれば、請求項５または６において、情報抽出工程が、入力された文書情報から固有名詞表現の特定情報を取得することにより、この固有名詞表現の影響を排除した文書特徴ベクトルを生成することが可能になる。
【００３２】
また、請求項８にかかる文書分類方法にあっては、さらに、前記情報抽出手段により抽出された特定情報を表示する表示工程と、１つ以上の特定情報の選択を受け付ける抽出情報選択工程と、を含むものである。
【００３３】
この発明によれば、情報抽出工程により抽出された特定情報を表示し、ユーザがこの表示された特定情報から、分類計算の際に排除すべき特定情報を選択して指定することにより、柔軟な分類処理が可能になる。
【００３４】
また、請求項９にかかるコンピュータ読み取り可能な記録媒体にあっては、前記請求項５〜８の何れか１つに記載の文書分類方法をコンピュータに実行させるプログラムを記録したものである。
【００３５】
この発明によれば、請求項５〜８の何れか１つに記載の文書分類方法を、プログラム化してコンピュータ読み取り可能な記録媒体に記録することにより、コンピュータ上でこの文書分類方法を実行させることが可能になる。
【００３６】
【発明の実施の形態】
以下、本発明にかかる文書分類装置および文書分類方法、並びに文書分類方法をコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体の好適な実施の形態について添付図面を参照し、詳細に説明する。なお、本発明はこの実施の形態に限定されるものではない。
【００３７】
（実施の形態１）
図１は、本発明の実施の形態１にかかる文書分類装置のシステム構成を示すブロック図である。この実施の形態１における文書分類装置は、バス１００上に、分類対象の文書情報を入力する文書入力部１０１と、文書入力部１０１により入力された文書情報を解析し、言語解析情報を得る言語解析部１０２と、文書入力部１０１により入力された文書情報を解析し、特定情報を得る情報抽出部１０３と、情報抽出部１０３の出力にしたがって言語解析情報を修正する言語解析情報修正部１０４と、言語解析情報修正部１０４の出力にしたがって文書情報に対する文書特徴ベクトルを生成する文書特徴ベクトル生成部１０５と、文書特徴ベクトル生成部１０５により生成された文書特徴ベクトルにしたがって文書情報を分類し、文書の部分集合を生成する文書分類部１０６と、が接続されている。
【００３８】
図２は、図１における文書分類の基本的な動作手順を示すフローチャートである。ここでは、まず、文書入力部１０１により、分類対象の文書情報を入力する文書入力処理を実行し（ステップＳ１１）、この入力された文書情報を言語解析部１０２によって解析し、言語解析情報を得る言語解析処理を行なう（ステップＳ１２）。続いて、情報抽出部１０３により、上記入力された文書情報を解析し、特定情報を得る情報抽出処理を実行し（ステップＳ１３）、言語解析情報修正部１０４は、上記情報抽出処理の出力にしたがって言語解析情報を修正する言語解析情報修正処理を行なう（ステップＳ１４）。さらに、文書特徴ベクトル生成部１０５は、上記言語解析情報修正処理の出力にしたがって文書情報に対する文書特徴ベクトルを生成するベクトル生成処理を実行し（ステップＳ１５）、文書分類部１０６は、このベクトル生成処理により生成された文書特徴ベクトルにしたがって文書情報を分類し、文書の部分集合を生成する文書分類処理を実行する（ステップＳ１６）。
【００３９】
ここで、特定情報の抽出と、文書集合に含まれる内容（話題）に基づく文書分類が重要となる好適な例として、アンケート調査などにより得られた自由記述回答の分析場面を想定し、その具体例をあげて説明する。
【００４０】
近年、たとえば、インターネットなどを介して短期間に数千〜数十万件の自由記述回答をコンピュータ上において回収することが可能であり、このような機能を用いて大量のテキスト情報を収集することができる。
【００４１】
ここで、アンケート調査によって得られた大量のテキスト情報の例として、回答者の所有するプリンタに対する要望を記述してもらうことを想定してみる。アンケートは、「プリンタに対する要望」の他、そのプリンタの名称（商品名）、プリンタの製造元、そのプリンタに対する満足度（Ｑ１）の質問項目から構成されており、１件毎の回答を１つの文書として、全体でＮ件の回答が寄せられたとする。
【００４２】
アンケートにおける自由記述回答の例では、文書集合とはたとえば図３に示すような形式となる。ここで、分析者（発明の操作者）は、その分析活動の１つとして、回答集合（文書集合）にどのような種類の意見（話題）があり、それらが製造元や対象商品との間でどのような関係があるかを把握しようとしていると想定する。
【００４３】
まず、分類対象の回答集合は、文書入力部１０１を介してシステムに取りこまれる。取りこまれた回答集合に対する後の処理のため、通常、この入力情報を記録し保存しておく。続いて、取りこまれた回答集合について、言語解析部１０２が各文書（各回答）に含まれる単語や複合語（あるいは特定の連続する文字列）を抽出する。この処理は、形態素解析（ｍｏｒｐｈｏｌｏｇｉｃａｌａｎａｌｙｓｉｓ）などの既知の言語解析アルゴリズムが用いられる。以下に、言語解析部１０２による名詞、形容詞、形容動詞の単語を抽出した例を示す。
【００４４】
ＩＤ０００１→
ＸＬ・１００（未登録語）、消耗品（一般名詞）、高い（形容詞）、印刷（サ変名詞）
ＩＤ０００２→
Ａ社（未登録語）、セールスマン（一般名詞）、技術（一般名詞）、知識（一般名詞）、豊富（形容動詞）、信頼（一般名詞）
ＩＤ０００３→
Ｂ社（未登録語）、ＰＲＸ・４０００（未登録語）、印刷（サ変名詞）、速度（一般名詞）、満足（サ変名詞）、社内報（一般名詞）、業務（一般名詞）、マニュアル（一般名詞）、利用（サ変名詞）
ＩＤ０００Ｎ→
Ａ社（未登録語）、信頼（一般名詞）、高い（形容詞）、使用（サ変名詞）
【００４５】
つぎに、情報抽出部１０３は、回答集合に対して特定情報の抽出を行なう。情報抽出の第１の方法は、各文書に付随する書誌事項あるいは分類対象テキスト属性以外（上述の図３の例ではＱ１以外）の属性値（対象商品名フィールドの値、製造元フィールドの値）を言語解析情報修正部１０４へ出力する方法である。言語解析情報修正部１０４へどの属性の属性値をあらかじめファイルなどに記憶させておいてもよい。この例では、「対象商品名」および「製造元」の属性値が抽出され、「満足度」の属性値は抽出されない設定となっている。
【００４６】
また、情報抽出の第２の方法は、「対象商品名」や「製造元」などの付属情報が、あらかじめ属性値として取得されていない場合に利用できる方法である。これは、情報抽出技術と呼ばれる既知の技術であり、Ｑ１に記載のテキスト中から以下のような固有名称などを自動抽出するものである。
【００４７】
組織名(企業名)
人名
地名
商品名
日付
時間
金額
割合
等
【００４８】
この情報抽出技術により、ＩＤ０００１の回答からは商品名として「ＸＬ・１００」が、ＩＤ０００２の回答からは企業名として「Ａ社」が、ＩＤ０００３からは企業名として「Ｂ社」、商品名として「ＰＲＸ・４０００」が、ＩＤ０００Ｎからは企業名として「Ａ社」が抽出される。
【００４９】
つぎに、このようにして抽出された特定情報にしたがって文書や単語を表現するベクトル空間を修正する方法について述べる。なお、文書分類に寄与させたくない任意の単語（トークン）をファイルなどに記述して指定する「ストップワードリスト」と呼ばれる公知技術がある。本発明では、情報抽出部１０３により抽出した特定情報を自動的／ユーザ選択により、「ストップワードリスト」的に機能させる技術を用いる。
【００５０】
言語解析情報修正部１０４によって、言語解析部１０２の出力である言語解析情報から、情報抽出部１０３が抽出した特定情報が除去される。たとえば、ＩＤ０００１の回答では、言語解析部１０２により以下のような言語解析情報が出力される。
【００５１】
ＩＤ０００１→
ＸＬ・１００（未登録）、消耗品（一般名詞）、高い（形容詞）、印刷（サ変名詞）
【００５２】
言語解析情報修正部１０４は、上記言語解析情報から、情報抽出部１０３が抽出した「ＸＬ・１００」を除去し、ＩＤ０００１→消耗品（一般名詞）、高い（形容詞）、印刷（サ変名詞）を出力する。
【００５３】
同様に、ＩＤ０００２からＩＤ０００Ｎは、
ＩＤ０００２→
セールスマン（一般名詞）、技術（一般名詞）、知識（一般名詞）、豊富（形容動詞）、信頼（一般名詞）
ＩＤ０００３→
印刷（サ変名詞）、速度（一般名詞）、満足（サ変名詞）、社内報（一般名詞）、業務（一般名詞）、マニュアル（一般名詞）、利用（サ変名詞）
ＩＤ０００Ｎ→
信頼（一般名詞）、高い（形容詞）、使用（サ変名詞）
となる。
【００５４】
これにより、情報抽出部１０３により抽出されたトークンを用いずに、以降の文書特徴ベクトルの生成および文書分類を行なうことができる。
【００５５】
つぎに、言語解析情報修正部１０４による上記出力にしたがって、文書特徴ベクトル生成部１０５は、単語などのトークン（特徴記述単位）を列とし、各文書を行とし、要素をたとえば単語の出現頻度とした「トークン」×「文書（回答）」の行列を生成する。
【００５６】
なお、上述においてトークンと記載したのは一般的な形態素解析機能と構文解析機能を有する言語解析部１０２を用いると、単語抽出のほかに単語の品詞情報や複合語（フレーズ）、構文情報を同時に取得することができるためであり、たとえばつぎのような（１）、（２）に記載の内容が可能になる。
【００５７】
（１）「国際連合」を「国際（一般名詞）」「連合（一般名詞）」という２語ではなく１つの複合語としてトークンにする。
（２）「説明（サ変名詞）」という単語を、その出現位置により「述語部に出現するサ変名詞“説明”」として扱い、主語部に出現する場合と区別して扱う。
【００５８】
文書特徴ベクトル生成部１０５は、この「トークン」×「文書」の行列にしたがって文書ベクトルを求めるが、それらには以下の３つの方法があり、本発明ではそのいずれも使用してもよい。
【００５９】
（１）行列の列成分をそのまま文書特徴ベクトルとして利用する。
（２）各文書の長さ（文字数やページ数などにより測定される）や分類対象の文書集合内での各トークンの出現頻度を考慮して値の重み付けをした後、文書特徴ベクトルとして利用する。
（３）上記行列から文書間の内積行列を算出し、これに特異値分解を適用して潜在的意味空間を構成し、その空間内での各文書の位置を求めてベクトルとして利用する。なお、この技術は、「ＰｒｏｊｅｃｔｉｏｎｆｏｒＥｆｆｉｃｉｅｎｔＤｏｃｕｍｅｎｔＣｌｕｓｔｅｒｉｎｇ，ＨｉｎｒｉｃｈＳｃｈｕｔｚｅａｎｄＣｒａｉｎｇＳｉｌｖｅｒｓｔｏｎｅ１９９７，ＰｒｏｃｅｅｄｉｎｇｓｏｆＳＩＧＩＲ，ｐｐ７４−８１，ＡＣＭ」の内容を参照することで実現する。
【００６０】
文書分類部１０６は、文書特徴ベクトル生成部１０５の出力である文書特徴ベクトルの類似度（ｓｉｍｉｌａｒｉｔｙ）を用いて文書を分類する。類似度の測度としては、内積や余弦、ユークリッド距離、マハラノビス（Ｍａｈａｌａｎｏｂｉｓ）距離などが考えられ、いずれの測度も用いることが可能である。
【００６１】
文書分類の方法には、クラスタリング（ｃｌｕｓｔｅｒｉｎｇ）と呼ばれる文書特徴ベクトル間の類似度にしたがって類似する文書をグループ化するボトムアップな分類方法と、カテゴライゼーションと呼ばれるあらかじめ幾つかのカテゴリ（分類）を設けておき、各ドキュメントがそれぞれどのカテゴリに属するかを判断することにより、文書を自動的に分類する方法がある。以下、このクラスタリング手法およびカテゴリゼーション手法について説明する。
【００６２】
まず、クラスタリング手法について述べる。クラスタリングには階層型と非階層型のアルゴリズムを含んだ多様な計算方法が公知であり、いずれも利用することができる。代表的なアルゴリムの例としてＫ−平均アルゴリズム（Ｋ−ｍｅａｎｓａｌｇｏｒｉｔｈｍ）法について取り上げる。このアルゴリズムは、文書特徴ベクトル生成部１０５により生成されたｎ個の文書特徴ベクトルの集合をその類似度に応じてｋ個のベクトル集合に分類するものである。なお、ｋは事前に与える必要がある。
【００６３】
このアルゴリズムの計算手順は、
（１）ｋ個のクラスタの重心の初期値をｎ個のベクトルの中から選択する。
（２）ｎ個の各クラスタをもっとも類似するクラスタ重心へ帰属させる。
（３）ｋ個の各クラスタについてについて、それぞれに含まれるベクトルの平均を新たに求め、これを新たなクラスタ重心とする。
（４）上記（２）、（３）を、クラスタ重心の位置が変化しなくなったところ、などの終了条件を満たすまで繰り返し実行する。
である。
【００６４】
このクラスタリング手法を用いることにより、互いに類似する文書群から構成された複数の文書集合（クラスタ）を得ることができる。ここで、情報抽出部１０３により抽出された特定情報は、文書特徴ベクトル間の類似度計算には使用されないため、特定情報の影響を受けることなくクラスタを生成することができる。すなわち、特定情報（たとえば特定の製造元）を含め支配的になるクラスタは形成されないことになる。
【００６５】
つぎに、カテゴライゼーション手法について述べる。既に様々な方法が提案されているが、その中でも、Ｇ，ＳａｌｔｏｎａｎｄＭ．Ｊ．ＭｃＧｉｌｌ（ＩｎｔｏｒｏｄｕｃｔｉｏｎｔｏＭｏｄｅｒｎＩｎｆｏｍａｔｉｏｎＲｅｔｒｉｅｖａｌ，１９８３，ＮｅｗＹｏｒｋｍＮｃＧｒａｗＨｉｌｌ）による方法がよく知られている。
【００６６】
このカテゴライゼーション手法では、まず、あらかじめ分類すべきカテゴリの定義を行なう。各カテゴリにはたとえばサンプル文書の指定によりカテゴリの定義をしておくと、サンプル文書を元に各カテゴリのベクトルを生成することができる。また、複数の文書がサンプルとなる場合は、カテゴリベクトルはサンプル文書ベクトルを加算すればよい。分類対象の文書を各カテゴリに適切に分類するためには、分類対象の各文書特徴ベクトルと各カテゴリベクトルとの類似度を計算し、最も類似するカテゴリに文書を分類することで実現することができる。このような分類を実施する際にも用いられるベクトル間の類似度計算の前提になる文書−単語（トークン）空間には、情報抽出部１０３により抽出された特定情報は含まれていないため、特定情報に左右されずにカテゴリへの文書の配置が行なえる。
【００６７】
したがって、以上説明した実施の形態１により、情報抽出部１０３により抽出された特定情報（たとえば商品名）と、文書分類部１０６により達成される内容に基づく分類結果（たとえばユーザの要望の内容）との間の関連を吟味することができる。また、上述した例では、たとえば商品名毎にユーザ要望の差異をクロス集計などにより把握するなどの分析作業が行なえる。
【００６８】
（実施の形態２）
図４は、本発明の実施の形態２にかかる文書分類装置のシステム構成を示すブロック図である。この実施の形態２における文書分類装置は、バス１００上に、分類対象の文書情報を入力する文書入力部１０１と、文書入力部１０１により入力された文書情報を解析し、言語解析情報を得る言語解析部１０２と、文書入力部１０１により入力された文書情報を解析し、特定情報を抽出する情報抽出部１０３と、言語解析部１０２により得られた言語解析情報にしたがって文書情報に対する文書特徴ベクトルを生成する文書特徴ベクトル生成部１０５と、情報抽出部１０３の出力にしたがって文書特徴ベクトルを修正するベクトル修正部１０７と、ベクトル修正部１０７の出力にしたがって文書情報を分類し、文書の部分集合を生成する文書分類部１０６と、が接続されている。
【００６９】
図５は、図４における文書分類の基本的な動作手順を示すフローチャートである。ここでは、まず、文書入力部１０１により、分類対象の文書情報を入力する文書入力処理を実行し（ステップＳ２１）、この入力された文書情報を言語解析部１０２によって解析し、言語解析情報を得る言語解析処理を行なう（ステップＳ２２）。続いて、情報抽出部１０３により、上記入力された文書情報を解析し、特定情報を得る情報抽出処理を実行し（ステップＳ２３）、文書特徴ベクトル生成部１０５は、言語解析部１０２による言語解析情報にしたがって文書情報に対する文書特徴ベクトルを生成する（ステップＳ２４）。さらに、情報抽出部１０３の出力にしたがって文書特徴ベクトルを修正し（ステップＳ２５）、文書分類部１０６は、このベクトル修正処理により生成された文書特徴ベクトルにしたがって文書情報を分類し、文書の部分集合を生成する文書分類処理を実行する（ステップＳ２６）。
【００７０】
すなわち、この実施の形態２は、前述した実施の形態１に対し、情報抽出部１０３により抽出した特定情報に基づく修正を、生成した文書特徴ベクトルに対して行なう点が異なる。言語解析部１０２により抽出されたトークンに対して文書特徴ベクトル生成部１０５は、実施の形態１のように各文書のベクトルを生成する。ベクトルは、たとえば図６に示すように表現されているため、情報抽出部１０３により抽出された特定情報を排除するためには、列ベクトルを削除すればよい。
【００７１】
たとえば、図６において、情報抽出部１０３により抽出された特定情報がトークン３ならば、ベクトル修正部１０７は、トークン３の列を削除することにより影響を排除することができる。また、特異値分解などを施し、特徴次元がトークンではなくそれらの合成次元（ｍ１，ｍ２，ｍ３，・・・ｍＩ，・・・ｍＭ）として得られていた場合には、影響を排除したい特定情報と最も相関の高いｍＩを見つけ、これを排除することで実現する。
【００７２】
ところで、前述した実施の形態１あるいは実施の形態２における情報抽出部１０３は、固有名詞を抽出することで特定情報を得るようにしてもよい。
【００７３】
さらに、情報抽出部１０３は、構造化文書に付随するタグ情報に基づいて特定情報を抽出してもよい。ここで、文書入力部１０１から入力される文書が、ＳＧＭＬやＸＭＬなどのような構造化文書の場合を例にとって説明する。
【００７４】
構造化文書では、たとえば「プレスリリース高速カラープリンタ[ＸＬ−１０００]発売、株式会社Ａ社は２０００年１月７日より、新型の高速カラープリンタＸＬ−１０００をオフィス市場へ投入する。印刷速度はフルカラーで毎分６０枚である。販売価格は、８０００００円を予定しており、全国の代理店から販売する予定である」という文書は、つぎのような形式で記述される。
【００７５】
＜ｄｏｃｕｍｅｎｔ＞
＜ｈ１＞プレスリリース高速カラープリンタ「ＸＬ−１０００」発売＜ｈ１＞＜ｐ＞＜ｓｅｄｅｒｎａｍｅ＞株式会社Ａ社＜／ｓｅｄｅｒｎａｍｅ＞は、＜ｒｅｌｅａｓｅｄａｔｅ＞２０００年１月７日＜／ｒｅｌｅａｓｅｄａｔｅ＞より、新型の高速＜ｐｒｏｄｕｃｔｔｙｐｅ＞カラープリンタ＜／ｐｒｏｄｕｃｔｔｙｐｅ＞＜ｐｒｏｄｕｃｔｎａｍｅ＞ＸＬ−１０００＜／ｐｒｏｄｕｃｔｎａｍｅ＞をオフィス市場へ投入する。印刷速度はフルカラーで毎分６０枚である。販売価格は、＜ｐｒｉｃｅ＞８０００００円＜／ｐｒｉｃｅ＞を予定しており、全国の代理店から販売する予定である。＜／ｐ＞
＜／ｄｏｃｕｍｅｎｔ＞
【００７６】
構造化文書では、文書中に現れる重要な語句が＜ｔａｇｎａｍｅ＞＜／ｔａｇｎｅｍｅ＞に挟まれる形でタグ付けされているため、簡単に重要な語句を取り出すことが可能になってきている。上述の例では、ニュースの発信元（ｓｅｎｄｅｒｎａｍｅ）、発信日（ｒｅｌｅａｓｅｄａｔｅ）、商品名（ｐｒｏｄｕｃｔｎａｍｅ）、価格（ｐｒｉｃｅ）などのタグが埋め込まれている。
【００７７】
本発明の情報抽出部１０３は、構造化文書のタグ付けの構造を解析し、ｓｅｎｄｅｒｎａｍｅのようなタグ名と、株式会社Ａ社のような値とを対応付けて抽出し、（タグ名、値）のセットで抽出する。これは、＜ｘｘｘ＞で始まるところから＜／ｘｘｘ＞で終わるところまでの部分を抽出することにより、上記セットを獲得することができる。
【００７８】
すなわち、上述の例では、
ｓｅｎｄｅｎａｍｅ，株式会社Ａ社
ｒｅｌｅａｓｅｄａｔｅ，２０００年１月７日
ｐｒｏｄｕｃｔｔｙｐｅ，カラープリンタ
ｐｒｏｄｕｃｔｎａｍｅ，ＸＬ−１０００
ｐｒｉｃｅ，８０００００円
が抽出されることになる。
【００７９】
（実施の形態３）
図７は、本発明の実施の形態３にかかる文書分類装置のシステム構成を示すブロック図である。この実施の形態３における文書分類装置は、バス１００上に、分類対象の文書情報を入力する文書入力部１０１と、文書入力部１０１により入力された文書情報を解析し、言語解析情報を得る言語解析部１０２と、文書入力部１０１により入力された文書情報を解析し、特定情報を抽出する情報抽出部１０３と、言語解析部１０２により得られた言語解析情報にしたがって文書情報に対する文書特徴ベクトルを生成する文書特徴ベクトル生成部１０５と、情報抽出部１０３の出力にしたがって文書特徴ベクトルを修正するベクトル修正部１０７と、１つ以上の特定情報の選択をユーザから受け付ける抽出情報選択部１０８と、情報抽出部１０３により抽出された特定情報を表示する特定情報表示部１０９と、ベクトル修正部１０７の出力にしたがって文書情報を分類し、文書の部分集合を生成する文書分類部１０６と、が接続されている。
【００８０】
図８は、図７における文書分類の基本的な動作手順を示すフローチャートである。ここでは、まず、文書入力部１０１により、分類対象の文書情報を入力する文書入力処理を実行し（ステップＳ３１）、この入力された文書情報を言語解析部１０２によって解析し、言語解析情報を得る言語解析処理を行なう（ステップＳ３２）。続いて、情報抽出部１０３により、上記入力された文書情報を解析し、特定情報を得る情報抽出処理を実行し（ステップＳ３３）、その特定情報をＣＲＴなどによる特定情報表示部１０９に表示する（ステップＳ３４）。
【００８１】
さらに、抽出情報選択部１０８は、ユーザから１つ以上の特定情報の選択を受け付ける（ステップＳ３５）。続いて、先に述べた実施の形態１または２の処理を実行する。
【００８２】
実施の形態１の場合には、言語解析情報修正部１０４により、情報抽出処理の出力にしたがって言語解析情報を修正する言語解析情報修正処理を行ない（ステップＳ３６）。さらに、文書特徴ベクトル生成部１０５により、言語解析情報修正処理の出力にしたがって文書情報に対する文書特徴ベクトルを生成するベクトル生成処理を実行し（ステップＳ３７）、文書分類部１０６は、このベクトル生成処理により生成された文書特徴ベクトルにしたがって文書情報を分類し、文書の部分集合を生成する文書分類処理を実行する（ステップＳ３９）。
【００８３】
一方、実施の形態２の場合には、情報抽出部１０３の出力にしたがって、文書特徴ベクトル生成部１０５により生成された文書特徴ベクトルを修正し（ステップＳ３８）、文書分類部１０６は、このベクトル修正処理により生成された文書特徴ベクトルにしたがって文書情報を分類し、文書の部分集合を生成する文書分類処理を実行する（ステップＳ３９）。
【００８４】
すなわち、この実施の形態３は、情報抽出部１０３により抽出された特定情報をユーザに呈示し、分類に寄与させたくない特定情報をユーザが任意に指定できるようにしたものである。
【００８５】
まず、情報抽出部１０３により抽出された特定情報は、特定情報表示部１０９によって、たとえばＣＲＴ上に表示される。この画面表示例を図９に示す。図９に示す画面には、タグ名あるいは属性名としての「企業名、商品名、発表日時」と、値として「Ａ社、Ｂ社、ＸＬ−１００、ＸＬ−１００」、さらに個々の値の全文書集合中における出現頻度などが表示される。
【００８６】
また、抽出情報選択部１０８は、分類計算に利用しないタグ名をチェックボックスの選択によって指定される方法をとっている。すなわち、上述の例では、企業名および商品名のタグあるいは属性名を持つトークンが分類計算に利用されないことになり、抽出情報選択部１０８はその情報を、実施の形態１の場合は、言語解析情報修正部１０４に、実施の形態２の場合は、ベクトル修正部１０７に、出力する。
【００８７】
ところで、これまで説明してきた文書分類方法を、プログラム化し、コンピュータ読み取り可能な記録媒体に記録し、コンピュータ上で実行することもできる。また、文書分類方法の一部をネットワーク上に有し、通信回線を通して実現することもできる。
【００８８】
すなわち、この実施の形態で説明した文書分類方法は、図１０に示すように、あらかじめ用意されたプログラムをパーソナルコンピュータやワークステーションなどのコンピュータ（ＣＰＵ２０）で実行することにより実現される。このプログラムは、キーボード２５の操作などにより、メモリ２１、ハードディスク２４、フロッピーディスク（ＦＤ）２７、ＣＤ−ＲＯＭ２６、ＭＯ、ＤＶＤなどのコンピュータで読み取り可能な記録媒体に記録され、コンピュータ（ＣＰＵ２０）によって記録媒体から読み出されることによって実行される。また、必要に応じてこの文書分類処理のデータを通信装置２２から外部装置に送受信することも可能である。
【００８９】
また、このプログラムは、図１１に示すように、上記記録媒体を介して、インターネット３０などのネットワークによってパーソナルコンピュータなどの装置３１〜３３に配布することができる。なお、インターネット機能を備える場合、たとえば通信プロトコルとしてＴＣＰ／ＩＰ（ＴｒａｎｓｍｉｓｓｉｏｎＣｏｎｔｒｏｌＰｒｏｔｏｃｏｌ／ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）を用いる。また、このネットワークは、公衆回線や専用回線を経由して外部と接続するＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ：広域通信網）と、同一敷地内でネットワークを構築するＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ：構内通信網）に分類される方式があるが、その何れの方式であってもよい。
【００９０】
【発明の効果】
以上説明したように、本発明にかかる文書分類装置（請求項１）によれば、文書入力手段により入力された文書情報を、言語解析手段で解析して形態素解析の結果である言語解析情報を取得し、さらに情報抽出手段が文書情報から付属情報または固有名称である特定情報を抽出し、言語解析情報修正手段が上記言語解析情報から上記特定情報を除去することにより言語解析情報を修正し、文書特徴ベクトル生成手段が文書特徴ベクトルを生成し、その文書特徴ベクトルにしたがって、文書分類手段がたとえばクラスタリングやカテゴライゼーションなどの手法を用いて文書を分類し、文書情報で構成される文書グループを複数生成することにより、話題による分類結果と、特定情報による分類（任意の特定情報（タグ、属性名）を持つ文書をグループ化すること）の結果とが、内容的に重複しないようにすることが可能になるので、特定情報に影響されない効果的な文書分類装置が実現する。
【００９１】
また、本発明にかかる文書分類装置（請求項２）によれば、文書入力手段により入力された文書情報を、言語解析手段で解析して形態素解析の結果である言語解析情報を取得し、さらに情報抽出手段が文書情報から付属情報または固有名称である特定情報を抽出し、文書特徴ベクトル生成手段が上記言語解析情報にしたがって文書情報に対する文書特徴ベクトルを生成し、さらに文書特徴ベクトルから上記特定情報を除去することにより文書特徴ベクトルを修正し、その修正された文書特徴ベクトルに基づいて、文書分類手段が文書を分類し、文書情報で構成される文書グループを複数生成することにより、話題による分類結果と、特定情報による分類（任意の特定情報（タグ、属性名）を持つ文書をグループ化すること）の結果とが、内容的に重複しないようにすることが可能になるので、特定情報に影響されない効果的な文書分類装置が実現する。また、分類対象の文書情報がたとえばＳＧＭＬやＸＭＬなどの構造化文書である場合、情報抽出手段が構造化文書に付随するタグ情報から特定情報を抽出することができる。
【００９２】
また、本発明にかかる文書分類装置（請求項３）によれば、請求項１または２において、情報抽出手段が入力された文書情報から固有名詞表現の特定情報を取得するため、この固有名詞表現の影響を排除した文書特徴ベクトルを生成することができる。
【００９４】
また、本発明にかかる文書分類装置（請求項４）によれば、情報抽出手段により抽出された特定情報を表示し、ユーザがこの表示された特定情報から、分類計算の際に排除すべき特定情報を選択して指定することが可能なため、分類対象の種類や内容に応じた柔軟な分類処理に対応することができる。
【００９５】
また、本発明にかかる文書分類方法（請求項５）によれば、文書入力工程により入力された文書情報を、言語解析工程で解析して形態素解析の結果である言語解析情報を取得し、さらに情報抽出工程が文書情報から付属情報または固有名称である特定情報を抽出し、言語解析情報修正工程が上記言語解析情報から上記特定情報を除去することにより言語解析情報を修正し、文書特徴ベクトル生成工程が文書特徴ベクトルを生成し、その文書特徴ベクトルにしたがって、文書分類工程がたとえばクラスタリングやカテゴライゼーションなどの手法を用いて文書を分類し、文書情報で構成される文書グループを複数生成することにより、話題による分類結果と、特定情報による分類（任意の特定情報（タグ、属性名）を持つ文書をグループ化すること）の結果とが、内容的に重複しないようにすることが可能になるので、特定情報に影響されない効果的な文書分類方法が実現する。また、分類対象の文書情報がたとえばＳＧＭＬやＸＭＬなどの構造化文書である場合、情報抽出工程は構造化文書に付随するタグ情報から特定情報を抽出することができる。
【００９６】
また、本発明にかかる文書分類方法（請求項６）によれば、文書入力工程により入力された文書情報を、言語解析工程で解析して形態素解析の結果である言語解析情報を取得し、さらに情報抽出工程が文書情報から付属情報または固有名称である特定情報を抽出し、文書特徴ベクトル生成工程が上記言語解析情報にしたがって文書情報に対する文書特徴ベクトルを生成し、さらに文書特徴ベクトルから前記特定情報を除去することにより文書特徴ベクトルを修正し、その修正された文書特徴ベクトルに基づいて、文書分類工程が文書を分類し、文書情報で構成される文書グループを複数生成することにより、話題による分類結果と、特定情報による分類（任意の特定情報（タグ、属性名）を持つ文書をグループ化すること）の結果とが、内容的に重複しないようにすることが可能になるので、特定情報に影響されない効果的な文書分類方法が実現する。また、分類対象の文書情報がたとえばＳＧＭＬやＸＭＬなどの構造化文書である場合、情報抽出工程が構造化文書に付随するタグ情報から特定情報を抽出することができる。
【００９７】
また、本発明にかかる文書分類方法（請求項７）によれば、請求項５または６において、情報抽出工程が、入力された文書情報から固有名詞表現の特定情報を取得することにより、この固有名詞表現の影響を排除した文書特徴ベクトルを生成するため、この固有名詞表現の影響を排除した文書特徴ベクトルを生成することができる。
【００９９】
また、本発明にかかる文書分類方法（請求項８）によれば、情報抽出工程により抽出された特定情報を表示し、ユーザがこの表示された特定情報から、分類計算の際に排除すべき特定情報を選択して指定することが可能なため、分類対象の種類や内容に応じた柔軟な分類処理に対応することができる。
【０１００】
また、本発明にかかるコンピュータ読み取り可能な記録媒体（請求項９）によれば、請求項５〜８の何れか１つに記載の文書分類方法を、プログラム化してコンピュータ読み取り可能な記録媒体に記録するので、コンピュータ上でこの文書分類方法を実行させることができる。
【図面の簡単な説明】
【図１】本発明の実施の形態１にかかる文書分類装置のシステム構成を示すブロック図である。
【図２】図１における文書分類の基本的な動作手順を示すフローチャートである。
【図３】本発明の実施の形態１にかかる文書情報例を示す図表である。
【図４】本発明の実施の形態２にかかる文書分類装置のシステム構成を示すブロック図である。
【図５】図４における文書分類の基本的な動作手順を示すフローチャートである。
【図６】本発明の実施の形態２にかかる各文書ベクトル例を示す図表である。
【図７】本発明の実施の形態３にかかる文書分類装置のシステム構成を示すブロック図である。
【図８】図７における文書分類の基本的な動作手順を示すフローチャートである。
【図９】本発明の実施の形態３にかかる表示画面例を示す説明図である。
【図１０】本発明にかかる実施の形態にかかる文書分類方法をソフトウェアによって実現するコンピュータシステム例を示すブロック図である。
【図１１】本発明にかかる実施の形態にかかる文書分類方法をネットワーク上で実現するシステム例を示すブロック図である。
【符号の説明】
１０１文書入力部
１０２言語解析部
１０３情報抽出部
１０４言語解析情報修正部
１０５文書特徴ベクトル生成部
１０６文書分類部
１０７ベクトル修正部
１０８抽出情報選択部
１０９特定情報表示部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an area of technology for classifying a document based on the contents (topics) of the document, in particular, a document classification device for classifying a document by correcting a vector space in a document classification technology based on a vector space model (Vector Space Model). The present invention also relates to a document classification method and a computer-readable recording medium on which a program for causing a computer to execute the document classification method is recorded.
[0002]
[Prior art]
In recent years, with the rapid progress and spread of IT (information technology), it has become possible to access a large amount of electronic documents in Japan and overseas via a network. The amount of information being made is also increasing. Under such circumstances, intelligent work such as classifying a large amount of collected document information into a meaningful category (category) has been performed. There are two purposes (1) and (2) described below for the purpose of classifying this document into meanings or classifying each document by topic.
[0003]
(1) To improve searchability. That is, a huge document group can be searched using the classification name (content name) as a clue, so that a set of documents to which a desired document belongs can be captured.
(2) To grasp the structure of the entire information group. That is, the contents (individual classification) of the entire document group are grasped.
[0004]
However, when a user manually classifies a large amount of document information, although the accuracy of classification is excellent, human and time costs increase, so it is practically impossible in a recent document environment that handles a large number of documents. Therefore, an automatic document classification device has been proposed.
[0005]
As the automatic document classification, for example, JP-A-7-36897, JP-A-10-260991, JP-A-10-17823, JP-A-10-260991, "Project for Efficient Document Clustering, Hinrich
Schutze and Craving Silverstone, 1997, Proceedings of SIGIR 97, pp 74-81, ACM ”. Here, the documents are regarded as document vectors characterized by words, and the similarity (distance) between the vectors is used as a measure, and these documents are grouped using a clustering method to automatically classify the documents.
[0006]
On the other hand, an object for the above (2) is disclosed in, for example, Japanese Patent Application Laid-Open No. 11-15835. Here, as described above, the documents are automatically classified based on the similarity measure.
[0007]
While clustering is an algorithm that generates an ad hoc classification system according to the similarity between document vectors, there is a method called categorization or classification. In this method, several categories (classifications) are provided in advance, and documents are automatically classified by determining which category each document belongs to. Among them, a classifier proposed by Salton et al. Is disclosed in “Introduction to Modern Information Retrieval, G. Salton and MJ McGill, 1983, New York NcGraw Hill”.
[0008]
This apparatus calculates a similarity (cosine) between a document vector and a category vector, and automatically classifies the document according to which category the document is most similar to. Furthermore, Japanese Patent No. 2940501 improves the accuracy of automatic classification by improving the word clustering used for classification.
[0009]
The automatic classification method described above is characterized in that documents are arranged as vectors in a space composed of words extracted from documents, similarity between documents is calculated, and clustering and categorization are performed. Therefore, the result of document classification depends on what kind of space is configured.
[0010]
By the way, in many cases, specific information is given to a document other than the text. For example, bibliographic items such as a document creator name (author) and a document creation date correspond to this. Also, in structured documents such as SGML (Standard Generalized Markup Language) and XML (extensible Markup Language), such specific information that appears in the text is previously tagged. Is granted and can be managed.
[0011]
Furthermore, “information extraction technology” has been disclosed as a technology for automatically extracting such specific information and is still available. For example, in IREX (Information Retrieval and Extraction Excise) held in Japan in 1999, the organization name (ex. US Army), person name (ex. Clinton), place name, proper name, etc. appearing in the text (in the text) An automatic extraction technique is disclosed.
[0012]
[Problems to be solved by the invention]
However, in the conventional technique as described above, when the extracted specific information (word) is used for the spatial configuration at the time of document classification, the specific information (partial information) is generated. There is a problem that the merit of using the extraction of specific information and document classification is halved. That is, when a document set including specific information is obtained, it is only necessary to search for documents using the specific information as a key. Therefore, in the document classification technique, it is desirable to classify from a viewpoint different from the specific information.
[0013]
The present invention has been made in view of the above, and an object of the present invention is to realize effective document classification by excluding extracted specific information from a spatial configuration at the time of document classification.
[0014]
[Means for Solving the Problems]
  In order to achieve the above object, in the document classification apparatus according to claim 1, the document input means for inputting the document information to be classified and the document input meansSaidAnalyzing document information,It is the result of morphological analysisLanguage analysis means for obtaining language analysis information and input by the document input meansSaidAnalyzing document information,Attached information or proper nameInformation extracting means for extracting specific information and said information extracting meansOf the specific information byAccording to output, By removing the specific information from the language analysis information,Language analysis information correction means for correcting the language analysis information, and language analysis information correction meansThe plurality of words whose language analysis information is corrected byGenerate a document feature vector for the document information according to the outputDocument featuresVector generation means; andDocument featuresGenerated by vector generation meansSaidAccording to document feature vectorpluralClassifying the document information;Multiple document groups composed of the document informationA document classification means for generating,The information extraction unit obtains specific information according to tag information attached to the structured document.Is.
[0015]
  According to this invention, the document information input by the document input means is analyzed by the language analysis means.It is the result of morphological analysisLanguage analysis information is acquired, and further information extraction means from document informationAttached information or proper nameThe specific information is extracted, and the language analysis information correcting means extracts the specific information from the language analysis information.By removingCorrect language analysis information,Document featuresA vector generation unit generates a document feature vector, and according to the document feature vector, a document classification unit classifies the document using a technique such as clustering or categorization,Multiple document groups consisting of document informationBy generating, the result of classification by topic and the result of classification by specific information (grouping documents having arbitrary specific information (tag, attribute name)) should not overlap in content. It becomes possible.Further, when the document information to be classified is a structured document such as SGML or XML, the information extraction unit can extract specific information from tag information attached to the structured document.
[0016]
  In the document classification device according to claim 2, the document input means for inputting the document information to be classified and the document input meansSaidAnalyzing document information,It is the result of morphological analysisLanguage analysis means for obtaining language analysis information and input by the document input meansSaidAnalyzing document information,Attached information or proper nameInformation extraction means for extracting specific information and obtained by the language analysis meansSaidGenerate a document feature vector for the document information according to language analysis informationDocument featuresVector generating means and the information extracting meansSaid specific information byAccording to the output of, By removing the specific information from the document feature vector,Modify the document feature vectorDocument featuresVector correction means, andDocument featuresVector correction meansVector correctionAccording to outputpluralClassifying the document information;Multiple document groups consisting of document informationA document classification means for generating,The information extraction unit obtains specific information according to tag information attached to the structured document.Is.
[0017]
  According to this invention, the document information input by the document input means is analyzed by the language analysis means.It is the result of morphological analysisLanguage analysis information is acquired, and further information extraction means from document informationAttached information or proper nameExtract specific information,Document featuresA vector generation unit generates a document feature vector for the document information according to the language analysis information, andBy removing the specific information from the document feature vectorCorrecting the document feature vector, and based on the corrected document feature vector, the document classification means classifies the document;Multiple document groups consisting of document informationBy generating, the result of classification by topic and the result of classification by specific information (grouping documents having arbitrary specific information (tag, attribute name)) should not overlap in content. It becomes possible.Further, when the document information to be classified is a structured document such as SGML or XML, the information extraction unit can extract specific information from tag information attached to the structured document.
[0018]
In the document classification apparatus according to claim 3, the information extraction means obtains specific information by extracting proper noun expressions.
[0019]
According to the present invention, in claim 1 or 2, the information extraction unit obtains the specific information of the proper noun expression from the input document information, thereby generating the document feature vector excluding the influence of the proper noun expression. It becomes possible.
[0022]
  Claims4The document classification apparatus according to the present invention further comprises display means for displaying the specific information extracted by the information extraction means, and extracted information selection means for accepting selection of one or more specific information. is there.
[0023]
According to the present invention, the specific information extracted by the information extracting means is displayed, and the user can select and specify the specific information to be excluded at the time of classification calculation from the displayed specific information. Classification processing becomes possible.
[0024]
  Claims5In the document classification method related toA document classification method realized by executing a program prepared in advance on a computer, and by executing the program, the computerDocument input process for inputting document information to be classified, and input by the document input processSaidAnalyzing document information,It is the result of morphological analysisLanguage analysis process for obtaining language analysis information and input by the document input processSaidAnalyzing document information,Attached information or proper nameInformation extraction process for extracting specific information and the information extraction processOf the specific information byLanguage analysis information correction step for correcting the language analysis information according to the output, and the language analysis information correction stepThe plurality of words whose language analysis information is corrected byGenerate a document feature vector for the document information according to the outputDocument featuresA vector generation step, andDocument featuresAccording to the document feature vector generated by the vector generation processpluralClassifying the document information;Multiple document groups composed of the document informationA document classification process to be generated;And the information extracting step obtains specific information according to tag information attached to the structured document.Is.
[0025]
  According to the present invention, the document information input in the document input process is analyzed in the language analysis process.It is the result of morphological analysisLanguage analysis information is acquired, and the information extraction process is performed from document information.Attached information or proper nameThe specific information is extracted, and the language analysis information correction step extracts the specific information from the language analysis information.By removingCorrect language analysis information,Document featuresThe vector generation process generates a document feature vector, and according to the document feature vector, the document classification process classifies the document using a technique such as clustering or categorization,Multiple document groups consisting of document informationBy generating, the result of classification by topic and the result of classification by specific information (grouping documents having arbitrary specific information (tag, attribute name)) should not overlap in content. It becomes possible.When the document information to be classified is a structured document such as SGML or XML, for example, the information extraction step can extract specific information from tag information attached to the structured document.
[0026]
  Claims6In the document classification method related toA document classification method realized by executing a program prepared in advance on a computer,  By executing the program, the computerDocument input process for inputting document information to be classified, and input by the document input processSaidAnalyzing document information,It is the result of morphological analysisLanguage analysis process for obtaining language analysis information and input by the document input processSaidAnalyzing document information,Attached information or proper nameObtained by the information extraction step of extracting specific information and the language analysis stepSaidGenerate a document feature vector for the document information according to language analysis informationDocument featuresVector generation step and the information extraction stepSaid specific information byAccording to the output ofBy removing the specific information from the document feature vector,Modify the document feature vectorDocument featuresVector correction step, andDocument featuresVector correction processVector correctionAccording to outputpluralClassifying the document information;Multiple document groups composed of the document informationA document classification process to be generated;And the information extracting step obtains specific information according to tag information attached to the structured document.Is.
[0027]
  According to the present invention, the document information input in the document input process is analyzed in the language analysis process.It is the result of morphological analysisLanguage analysis information is acquired, and the information extraction process is performed from document information.Attached information or proper nameExtract specific information,Document featuresA vector generation step generates a document feature vector for the document information according to the language analysis information, andBy removing the specific information from the document feature vectorModify the document feature vector, and based on the modified document feature vector, the document classification step classifies the document;Multiple document groups consisting of document informationBy generating, the result of classification by topic and the result of classification by specific information (grouping documents having arbitrary specific information (tag, attribute name)) should not overlap in content. It becomes possible.When the document information to be classified is a structured document such as SGML or XML, for example, the information extraction step can extract specific information from tag information attached to the structured document.
[0028]
  Claims7In the document classification method, the information extracting step obtains specific information by extracting proper noun expressions.
[0029]
  According to the invention, the claims5 or 6In this case, the information extraction step acquires specific information of proper noun expressions from the input document information, so that it is possible to generate a document feature vector excluding the influence of the proper noun expressions.
[0032]
  Claims8The document classification method further includes a display step for displaying the specific information extracted by the information extraction means and an extraction information selection step for accepting selection of one or more specific information.
[0033]
According to this invention, the specific information extracted by the information extraction process is displayed, and the user can select and specify the specific information to be excluded at the time of classification calculation from the displayed specific information. Classification processing becomes possible.
[0034]
  Claims9A computer-readable recording medium according to claim 15-8A program for causing a computer to execute the document classification method described in any one of the above is recorded.
[0035]
  According to the invention, the claims5-8When the document classification method described in any one of the above is programmed and recorded on a computer-readable recording medium, the document classification method can be executed on a computer.
[0036]
DETAILED DESCRIPTION OF THE INVENTION
DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of a document classification apparatus, a document classification method, and a computer-readable recording medium recording a program for causing a computer to execute the document classification method according to the present invention will be described in detail with reference to the accompanying drawings. . The present invention is not limited to this embodiment.
[0037]
(Embodiment 1)
FIG. 1 is a block diagram showing a system configuration of the document classification apparatus according to the first embodiment of the present invention. The document classification apparatus according to the first embodiment includes a document input unit 101 that inputs document information to be classified on the bus 100, a language that analyzes the document information input by the document input unit 101, and obtains language analysis information. An analysis unit 102; an information extraction unit 103 that analyzes document information input by the document input unit 101 to obtain specific information; a language analysis information correction unit 104 that corrects language analysis information according to an output of the information extraction unit 103; A document feature vector generation unit 105 that generates a document feature vector for the document information according to the output of the language analysis information correction unit 104, and classifies the document information according to the document feature vector generated by the document feature vector generation unit 105, Are connected to a document classification unit 106 that generates a subset of
[0038]
FIG. 2 is a flowchart showing a basic operation procedure of document classification in FIG. Here, first, the document input unit 101 executes document input processing for inputting the document information to be classified (step S11), and the input document information is analyzed by the language analysis unit 102 to obtain language analysis information. Language analysis processing is performed (step S12). Subsequently, the information extraction unit 103 analyzes the input document information and executes an information extraction process for obtaining specific information (step S13). The language analysis information correction unit 104 follows the output of the information extraction process. A language analysis information correction process for correcting the language analysis information is performed (step S14). Further, the document feature vector generation unit 105 executes a vector generation process for generating a document feature vector for the document information in accordance with the output of the language analysis information correction process (step S15), and the document classification unit 106 performs the vector generation process. The document information is classified according to the document feature vector generated by the above, and document classification processing for generating a document subset is executed (step S16).
[0039]
Here, as a suitable example in which extraction of specific information and document classification based on contents (topics) included in a document set are important, an analysis situation of free description answers obtained through questionnaire surveys, etc. is assumed. An example will be described.
[0040]
In recent years, for example, it is possible to collect thousands to hundreds of thousands of free description answers on a computer over a short period of time via the Internet, etc., and collect a large amount of text information using such a function. Can do.
[0041]
Here, as an example of a large amount of text information obtained by a questionnaire survey, it is assumed that a request for a printer owned by a respondent is described. The questionnaire is composed of “requests for printers”, question items of the printer name (product name), printer manufacturer, and satisfaction (Q1) for the printer. Suppose that N responses were received in total.
[0042]
In the example of the free description answer in the questionnaire, the document set has a format as shown in FIG. 3, for example. Here, as one of the analysis activities, the analyst (the operator of the invention) has any kind of opinion (topic) in the answer set (document set), and these are between the manufacturer and the target product. Assume that you are trying to figure out what kind of relationship there is.
[0043]
First, the answer set to be classified is incorporated into the system via the document input unit 101. This input information is usually recorded and stored for later processing on the captured answer set. Subsequently, for the collected answer set, the language analysis unit 102 extracts words and compound words (or specific continuous character strings) included in each document (each answer). For this process, a known language analysis algorithm such as morphological analysis is used. An example in which words of nouns, adjectives, and adjective verbs are extracted by the language analysis unit 102 is shown below.
[0044]
ID0001 →
XL ・ 100 (unregistered word), consumables (general noun), high (adjective), printing (sa variable noun)
ID0002 →
Company A (unregistered word), salesman (general noun), technology (general noun), knowledge (general noun), abundance (adjective verb), trust (general noun)
ID0003 →
Company B (unregistered word), PRX 4000 (unregistered word), printing (sa variable noun), speed (general noun), satisfaction (sa variable noun), company newsletter (general noun), business (general noun), manual ( Common noun), use (sa variable noun)
ID000N →
Company A (unregistered word), trust (general noun), high (adjective), use (sa variable noun)
[0045]
Next, the information extraction unit 103 extracts specific information from the answer set. The first method of information extraction is to use attribute values (value of target product name field, value of manufacturer field) other than bibliographic items attached to each document or text attribute to be classified (other than Q1 in the example of FIG. 3 described above). This is a method of outputting to the language analysis information correction unit 104. The attribute value of any attribute may be stored in the file or the like in advance in the language analysis information correction unit 104. In this example, the attribute values of “target product name” and “manufacturer” are extracted, and the attribute value of “satisfaction” is not extracted.
[0046]
The second method of information extraction is a method that can be used when attached information such as “target product name” and “manufacturer” is not acquired as an attribute value in advance. This is a known technique called information extraction technique, and automatically extracts the following unique names from the text described in Q1.
[0047]
Organization name (company name)
Name
Place name
Product name
date
time
Amount of money
Percentage
etc
[0048]
With this information extraction technology, “XL • 100” is the product name from the ID0001 response, “Company A” is the company name from the ID0002 response, “B Company” is the company name from ID0003, and “ “PRX • 4000” is extracted from ID000N as “Company A” as the company name.
[0049]
Next, a method for correcting a vector space expressing a document or word according to the specific information extracted in this way will be described. There is a known technique called a “stop word list” in which an arbitrary word (token) that is not desired to contribute to document classification is described in a file or the like. In the present invention, a technique is used in which specific information extracted by the information extraction unit 103 functions in a “stop word list” by automatic / user selection.
[0050]
The language analysis information correction unit 104 removes the specific information extracted by the information extraction unit 103 from the language analysis information output from the language analysis unit 102. For example, for the answer of ID0001, the language analysis unit 102 outputs the following language analysis information.
[0051]
ID0001 →
XL · 100 (unregistered), consumables (general nouns), high (adjectives), printing (sa variable noun)
[0052]
The language analysis information correction unit 104 removes “XL · 100” extracted by the information extraction unit 103 from the language analysis information, and obtains ID0001 → consumable item (general noun), high (adjective), and print (sa variable noun). Output.
[0053]
Similarly, ID0002 to ID000N are
ID0002 →
Salesman (general noun), technology (general noun), knowledge (general noun), abundance (adjective verb), trust (general noun)
ID0003 →
Print (sa noun), speed (general noun), satisfaction (sa noun), company newsletter (general noun), business (general noun), manual (general noun), use (sa noun)
ID000N →
Trust (general noun), high (adjective), use (sa variable noun)
It becomes.
[0054]
As a result, the subsequent generation of document feature vectors and document classification can be performed without using the token extracted by the information extraction unit 103.
[0055]
Next, according to the output from the language analysis information correction unit 104, the document feature vector generation unit 105 uses tokens (feature description units) such as words as columns, each document as a row, and elements as, for example, word appearance frequency. The “token” × “document (answer)” matrix is generated.
[0056]
In addition, when the language analysis unit 102 having a general morphological analysis function and a syntax analysis function is used, the token is described in the above description, in addition to the word extraction, the word part of speech information, the compound word (phrase), and the syntax information are simultaneously displayed. For example, the following contents (1) and (2) are possible.
[0057]
(1) “International Union” is tokenized as one compound word instead of two words “international (general noun)” and “association (general noun)”.
(2) The word “explanation (sa variable noun)” is treated as “sa variable noun“ explanation ”appearing in the predicate part” according to its appearance position, and is distinguished from the case where it appears in the subject part.
[0058]
The document feature vector generation unit 105 obtains a document vector according to this “token” × “document” matrix, and there are the following three methods, any of which may be used in the present invention.
[0059]
(1) The column component of the matrix is used as it is as a document feature vector.
(2) Weighting values in consideration of the length of each document (measured by the number of characters, the number of pages, etc.) and the appearance frequency of each token in the document set to be classified, and then using it as a document feature vector .
(3) An inner product matrix between documents is calculated from the above matrix, singular value decomposition is applied to the matrix, a potential semantic space is constructed, and the position of each document in the space is obtained and used as a vector. This technology is realized by referring to the contents of “Project for Effective Document Clustering, Hinrich Schutze and Craving Silverstone 1997, Processeds of SIGIR, pp 74-81, ACM”.
[0060]
The document classification unit 106 classifies the document using the similarity of the document feature vector that is the output of the document feature vector generation unit 105. As measures of similarity, inner products, cosines, Euclidean distances, Mahalanobis distances, and the like can be considered, and any measure can be used.
[0061]
The document classification method includes a bottom-up classification method for grouping similar documents according to the similarity between document feature vectors called clustering, and several categories (classifications) called categorization in advance. There is a method of automatically classifying documents by determining which category each document belongs to. Hereinafter, the clustering method and the categorization method will be described.
[0062]
First, the clustering method will be described. For clustering, various calculation methods including hierarchical and non-hierarchical algorithms are known, and any of them can be used. The K-means algorithm method is taken up as an example of a typical algorithm. This algorithm classifies a set of n document feature vectors generated by the document feature vector generation unit 105 into k vector sets according to their similarity. Note that k must be given in advance.
[0063]
The calculation procedure of this algorithm is
(1) The initial value of the centroid of k clusters is selected from n vectors.
(2) Assign each of the n clusters to the most similar cluster centroid.
(3) For each of k clusters, a new average of vectors included in each cluster is obtained, and this is used as a new cluster centroid.
(4) The above (2) and (3) are repeatedly executed until the end condition is satisfied, such as when the position of the cluster centroid no longer changes.
It is.
[0064]
By using this clustering method, it is possible to obtain a plurality of document sets (clusters) composed of similar document groups. Here, since the specific information extracted by the information extraction unit 103 is not used for calculating the similarity between the document feature vectors, a cluster can be generated without being affected by the specific information. That is, a dominant cluster including specific information (for example, a specific manufacturer) is not formed.
[0065]
Next, the categorization method will be described. Various methods have already been proposed. Among them, G, Salton and M.M. J. et al. The method by McGill (Introduction to Modern Information Retrieval, 1983, New York NcGraw Hill) is well known.
[0066]
In this categorization method, first, categories to be classified are defined in advance. If each category is defined by, for example, specifying a sample document, a vector of each category can be generated based on the sample document. When a plurality of documents are samples, the category document may be added to the sample document vector. In order to properly classify the document to be classified into each category, it is possible to calculate the similarity between each document feature vector to be classified and each category vector, and classify the document into the most similar category. it can. The document-word (token) space, which is a premise for calculating the similarity between vectors that is also used when performing such classification, does not include the specific information extracted by the information extraction unit 103. Documents can be placed in categories regardless of information.
[0067]
Therefore, according to the first embodiment described above, the specific information (for example, product name) extracted by the information extracting unit 103, the classification result (for example, the content of the user's request) based on the content achieved by the document classification unit 106, The relationship between can be examined. Further, in the above-described example, for example, an analysis work such as grasping a difference in user request for each product name by cross tabulation or the like can be performed.
[0068]
(Embodiment 2)
FIG. 4 is a block diagram showing a system configuration of the document classification device according to the second exemplary embodiment of the present invention. The document classification apparatus according to the second embodiment includes a document input unit 101 for inputting document information to be classified on the bus 100, a language for analyzing the document information input by the document input unit 101, and obtaining language analysis information. An analysis unit 102, an information extraction unit 103 that analyzes the document information input by the document input unit 101 and extracts specific information, and a document feature vector for the document information according to the language analysis information obtained by the language analysis unit 102 Generate a document feature vector generation unit 105, a vector correction unit 107 that corrects a document feature vector according to the output of the information extraction unit 103, and classify the document information according to the output of the vector correction unit 107 to generate a document subset And a document classification unit 106 to be connected.
[0069]
FIG. 5 is a flowchart showing a basic operation procedure of document classification in FIG. Here, first, a document input process for inputting document information to be classified is executed by the document input unit 101 (step S21), and the input document information is analyzed by the language analysis unit 102 to obtain language analysis information. Language analysis processing is performed (step S22). Subsequently, the information extraction unit 103 analyzes the input document information and executes information extraction processing for obtaining specific information (step S23). The document feature vector generation unit 105 performs language analysis information by the language analysis unit 102. Accordingly, a document feature vector for the document information is generated (step S24). Further, the document feature vector is corrected according to the output of the information extraction unit 103 (step S25), and the document classification unit 106 classifies the document information according to the document feature vector generated by the vector correction processing, and sets a document subset. The document classification process for generating is executed (step S26).
[0070]
That is, the second embodiment is different from the first embodiment described above in that the correction based on the specific information extracted by the information extraction unit 103 is performed on the generated document feature vector. The document feature vector generation unit 105 generates a vector of each document for the token extracted by the language analysis unit 102 as in the first embodiment. Since the vector is expressed as shown in FIG. 6, for example, the column vector may be deleted in order to exclude the specific information extracted by the information extraction unit 103.
[0071]
For example, in FIG. 6, if the specific information extracted by the information extraction unit 103 is token 3, the vector correction unit 107 can eliminate the influence by deleting the sequence of token 3. In addition, when singular value decomposition is performed and the feature dimensions are obtained not as tokens but as their combined dimensions (m1, m2, m3,... MI,. It is realized by finding mI having the highest correlation with information and eliminating it.
[0072]
By the way, the information extraction unit 103 in the first embodiment or the second embodiment described above may obtain specific information by extracting proper nouns.
[0073]
Furthermore, the information extraction unit 103 may extract specific information based on tag information attached to the structured document. Here, a case where the document input from the document input unit 101 is a structured document such as SGML or XML will be described as an example.
[0074]
For structured documents, for example, “Press release High-speed color printer [XL-1000] is released. Company A Co., Ltd. introduces a new high-speed color printer XL-1000 to the office market from January 7, 2000. Printing speed is The full color is 60 sheets per minute.The sales price is scheduled to be 800,000 yen, and it is planned to be sold from distributors nationwide "is described in the following format.
[0075]
<Document>
<H1> Press Release High-speed color printer "XL-1000" released <h1> <p> <sederaname> Company A </ sedername> is a new model from <releasedate> January 7, 2000 </ releasedate> A high-speed <producttype> color printer </ producttype> <productname> XL-1000 </ productname> is introduced into the office market. The printing speed is 60 sheets per minute in full color. The selling price is planned to be <price> 800,000 yen </ price>, and is scheduled to be sold from distributors nationwide. </ P>
</ Document>
[0076]
In structured documents, important words appearing in the document are tagged in a form sandwiched between <tagname> </ tagname>, so that it is possible to easily extract important words. In the above-described example, tags such as a sender name, a date of release (releasedate), a product name (productname), and a price (price) are embedded.
[0077]
The information extraction unit 103 of the present invention analyzes the tagging structure of the structured document, extracts a tag name such as sendername and a value such as company A in association with each other, and (tag name, value ) To extract. In this case, the above set can be obtained by extracting a portion from a location starting with <xxx> to a location ending with </ xxx>.
[0078]
That is, in the above example,
sendename, Company A
releaseedate, January 7, 2000
producttype, color printer
productname, XL-1000
price, 800,000 yen
Will be extracted.
[0079]
(Embodiment 3)
FIG. 7 is a block diagram showing a system configuration of the document classification apparatus according to the third embodiment of the present invention. The document classification apparatus according to the third embodiment includes a document input unit 101 that inputs document information to be classified on the bus 100, a language that analyzes the document information input by the document input unit 101, and obtains language analysis information. An analysis unit 102, an information extraction unit 103 that analyzes the document information input by the document input unit 101 and extracts specific information, and a document feature vector for the document information according to the language analysis information obtained by the language analysis unit 102 A document feature vector generation unit 105 to be generated, a vector correction unit 107 that corrects a document feature vector according to the output of the information extraction unit 103, an extraction information selection unit 108 that accepts selection of one or more specific information from a user, and information The specific information display unit 109 that displays the specific information extracted by the extraction unit 103 and the output of the vector correction unit 107 Classify document information Therefore, the document classification unit 106 to generate a subset of the documents, are connected.
[0080]
FIG. 8 is a flowchart showing a basic operation procedure of document classification in FIG. Here, first, the document input unit 101 executes a document input process for inputting the document information to be classified (step S31), and the input document information is analyzed by the language analysis unit 102 to obtain language analysis information. Language analysis processing is performed (step S32). Subsequently, the information extraction unit 103 analyzes the input document information and executes information extraction processing for obtaining specific information (step S33), and displays the specific information on the specific information display unit 109 such as a CRT ( Step S34).
[0081]
Further, the extracted information selection unit 108 accepts selection of one or more specific information from the user (step S35). Subsequently, the processing of the first or second embodiment described above is executed.
[0082]
In the case of the first embodiment, the language analysis information correction unit 104 performs language analysis information correction processing for correcting the language analysis information according to the output of the information extraction processing (step S36). Further, the document feature vector generation unit 105 executes a vector generation process for generating a document feature vector for the document information in accordance with the output of the language analysis information correction process (step S37), and the document classification unit 106 performs the vector generation process. Document information is classified according to the generated document feature vector, and document classification processing for generating a document subset is executed (step S39).
[0083]
On the other hand, in the case of the second embodiment, the document feature vector generated by the document feature vector generation unit 105 is corrected according to the output of the information extraction unit 103 (step S38), and the document classification unit 106 Document information is classified according to the document feature vector generated by the process, and a document classification process for generating a document subset is executed (step S39).
[0084]
That is, the third embodiment presents the specific information extracted by the information extraction unit 103 to the user, and allows the user to arbitrarily specify the specific information that is not desired to contribute to the classification.
[0085]
First, the specific information extracted by the information extraction unit 103 is displayed on the CRT, for example, by the specific information display unit 109. An example of this screen display is shown in FIG. The screen shown in FIG. 9 includes “company name, product name, announcement date” as a tag name or attribute name, “company A, company B, XL-100, XL-100” as values, and individual values. The appearance frequency in the entire document set is displayed.
[0086]
Further, the extraction information selection unit 108 takes a method of specifying a tag name that is not used for classification calculation by selecting a check box. That is, in the above example, the token having the company name and product name tag or the attribute name is not used for classification calculation, and the extracted information selection unit 108 uses the language analysis in the case of the first embodiment. In the case of the second embodiment, the information correction unit 104 outputs the information to the vector correction unit 107.
[0087]
By the way, the document classification method described so far can be programmed, recorded on a computer-readable recording medium, and executed on the computer. Also, a part of the document classification method can be realized on a network and can be realized through a communication line.
[0088]
That is, the document classification method described in this embodiment is realized by executing a program prepared in advance on a computer (CPU 20) such as a personal computer or a workstation as shown in FIG. This program is recorded on a computer-readable recording medium such as the memory 21, the hard disk 24, the floppy disk (FD) 27, the CD-ROM 26, the MO, and the DVD by operating the keyboard 25, and is recorded by the computer (CPU 20). It is executed by reading from the medium. In addition, the document classification processing data can be transmitted and received from the communication device 22 to an external device as necessary.
[0089]
In addition, as shown in FIG. 11, this program can be distributed to devices 31 to 33 such as a personal computer via the recording medium via a network such as the Internet 30. When the Internet function is provided, for example, TCP / IP (Transmission Control Protocol / Internet Protocol) is used as a communication protocol. This network is divided into a WAN (Wide Area Network) connected to the outside via a public line or a dedicated line, and a LAN (Local Area Network) that builds a network on the same site. There are methods that are classified, but any of them may be used.
[0090]
【The invention's effect】
  As described above, according to the document classification apparatus (claim 1) of the present invention, the document information input by the document input means is analyzed by the language analysis means.It is the result of morphological analysisLanguage analysis information is acquired, and further information extraction means from document informationAttached information or proper nameThe specific information is extracted, and the language analysis information correcting means extracts the specific information from the language analysis information.By removingCorrect language analysis information,Document featuresA vector generation unit generates a document feature vector, and according to the document feature vector, a document classification unit classifies the document using a technique such as clustering or categorization,Multiple document groups consisting of document informationBy generating, the result of classification by topic and the result of classification by specific information (grouping documents having arbitrary specific information (tag, attribute name)) should not overlap in content. Therefore, an effective document classification device that is not affected by specific information is realized.
[0091]
  According to the document classification apparatus of the present invention (Claim 2), the document information input by the document input means is analyzed by the language analysis means.It is the result of morphological analysisLanguage analysis information is acquired, and further information extraction means from document informationAttached information or proper nameExtract specific information,Document featuresA vector generation unit generates a document feature vector for the document information according to the language analysis information, andBy removing the specific information from the document feature vectorCorrecting the document feature vector, and based on the corrected document feature vector, the document classification means classifies the document;Multiple document groups consisting of document informationBy generating, the result of classification by topic and the result of classification by specific information (grouping documents having arbitrary specific information (tag, attribute name)) should not overlap in content. Therefore, an effective document classification device that is not affected by specific information is realized.Further, when the document information to be classified is a structured document such as SGML or XML, the information extraction unit can extract specific information from tag information attached to the structured document.
[0092]
According to the document classification apparatus (claim 3) of the present invention, in claim 1 or 2, the information extraction unit acquires specific information of the proper noun expression from the input document information. It is possible to generate a document feature vector that eliminates the influence of.
[0094]
  A document classification device according to the present invention (claims)4), It is possible to display the specific information extracted by the information extraction means, and from the displayed specific information, the user can select and specify the specific information to be excluded during classification calculation, It is possible to deal with flexible classification processing according to the type and content of the classification target.
[0095]
  A document classification method according to the present invention (claims)5), The document information input in the document input process is analyzed in the language analysis process.It is the result of morphological analysisLanguage analysis information is acquired, and the information extraction process is performed from document information.Attached information or proper nameThe specific information is extracted, and the language analysis information correction step extracts the specific information from the language analysis information.By removingCorrect language analysis information,Document featuresThe vector generation process generates a document feature vector, and according to the document feature vector, the document classification process classifies the document using a technique such as clustering or categorization,Multiple document groups consisting of document informationBy generating, the result of classification by topic and the result of classification by specific information (grouping documents having arbitrary specific information (tag, attribute name)) should not overlap in content. Therefore, an effective document classification method that is not affected by specific information is realized.When the document information to be classified is a structured document such as SGML or XML, for example, the information extraction step can extract specific information from tag information attached to the structured document.
[0096]
  A document classification method according to the present invention (claims)6), The document information input in the document input process is analyzed in the language analysis process.It is the result of morphological analysisLanguage analysis information is acquired, and the information extraction process is performed from document information.Attached information or proper nameExtract specific information,Document featuresA vector generation step generates a document feature vector for the document information according to the language analysis information, andBy removing the specific information from the document feature vectorModify the document feature vector, and based on the modified document feature vector, the document classification step classifies the document;Multiple document groups consisting of document informationBy generating, the result of classification by topic and the result of classification by specific information (grouping documents having arbitrary specific information (tag, attribute name)) should not overlap in content. Therefore, an effective document classification method that is not affected by specific information is realized.In addition, when the document information to be classified is a structured document such as SGML or XML, the information extraction process can extract specific information from the tag information attached to the structured document.
[0097]
  A document classification method according to the present invention (claims)7)5 or 6, The information extraction process obtains specific information of proper noun expressions from the input document information, and generates document feature vectors that eliminate the influence of this proper noun expression, thus eliminating the influence of this proper noun expression Document feature vectors can be generated.
[0099]
  A document classification method according to the present invention (claims)8), It is possible to display the specific information extracted by the information extraction process, and the user can select and specify the specific information to be excluded during the classification calculation from the displayed specific information. It is possible to deal with flexible classification processing according to the type and content of the classification target.
[0100]
  A computer-readable recording medium according to the present invention (claims)9)5-8Since the document classification method described in any one of the above is programmed and recorded on a computer-readable recording medium, the document classification method can be executed on a computer.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a system configuration of a document classification apparatus according to a first embodiment of the present invention.
FIG. 2 is a flowchart showing a basic operation procedure for document classification in FIG. 1;
FIG. 3 is a chart showing an example of document information according to the first embodiment of the present invention.
FIG. 4 is a block diagram showing a system configuration of a document classification device according to a second exemplary embodiment of the present invention.
FIG. 5 is a flowchart showing a basic operation procedure of document classification in FIG. 4;
FIG. 6 is a chart showing an example of each document vector according to the second embodiment of the present invention.
FIG. 7 is a block diagram showing a system configuration of a document classification apparatus according to a third embodiment of the present invention.
FIG. 8 is a flowchart showing a basic operation procedure of document classification in FIG. 7;
FIG. 9 is an explanatory diagram showing an example of a display screen according to the third embodiment of the present invention.
FIG. 10 is a block diagram showing an example of a computer system that implements the document classification method according to the embodiment of the present invention by software.
FIG. 11 is a block diagram showing an example of a system for realizing the document classification method according to the embodiment of the present invention on a network.
[Explanation of symbols]
101 Document input part
102 Language Analysis Department
103 Information extraction unit
104 Language analysis information correction part
105 Document feature vector generator
106 Document classification section
107 Vector correction section
108 Extraction information selection unit
109 Specific information display section

Claims

A document input means for inputting document information to be classified, analyzed the document information input by the document input means, a language analysis means for obtaining a language analysis information is a result of the morphological analysis, input by the document input means have been analyzing the document information, and the information extraction means for extracting specific information is accompanying information or proper names, in accordance with the output of said identification information by said information extraction means to remove the specific information from the language analysis information by the language analysis information modifying means for modifying the language analysis information, document characteristic to generate a document feature vector for the document information in accordance with the output of a plurality of words that fixes the language analysis information by the language analysis information correction unit multiple and vector generation means, according to said document feature vector generated by the document feature vector generation means Wherein classifying the document information, and a document classifying means for generating multiple documents group including the document information,
The document extracting apparatus, wherein the information extracting unit obtains specific information according to tag information attached to a structured document .

A document input means for inputting document information to be classified, analyzed the document information input by the document input means, a language analysis means for obtaining a language analysis information is a result of the morphological analysis, input by the document input means It has been analyzing the document information, and generates an information extracting means for extracting specific information is accompanying information or proper name, the document feature vector for the document information according to the language analysis information obtained by the language analysis unit a document feature vector generation means, in accordance with the output of said identification information by said information extraction means, by removing the specific information from the document feature vector, and document characteristic vector correction means for correcting said document feature vector, the document classifying a plurality of the document information in accordance with the output of the vector correction feature vector modification unit, the sentence And a document classifying means for generating multiple documents group consisting of information,
The document extracting apparatus, wherein the information extracting unit obtains specific information according to tag information attached to a structured document .

The document classification apparatus according to claim 1, wherein the information extraction unit obtains specific information by extracting a proper noun expression.

Furthermore, the display means which displays the specific information extracted by the said information extraction means, and the extraction information selection means which receives selection of one or more specific information are provided, The any one of Claims 1-3 characterized by the above-mentioned. The document classification device according to claim 1.

A document classification method realized by executing a program prepared in advance on a computer,
By executing the program, the computer, the document input step of inputting document information to be classified, analyzed the document information input by the document input step, the language analysis information is a result of the morphological analysis a language analysis step of obtaining, analyzing the document information input by the document input step, an information extraction step of extracting specific information is accompanying information or proper names, in accordance with the output of said identification information by said information extraction step A language analysis information correction step for correcting the language analysis information, and a document feature vector generation step for generating a document feature vector for the document information in accordance with an output of a plurality of words in which the language analysis information is corrected by the language analysis information correction step. and a plurality of pre accordance document feature vectors generated by the document feature vector generation step It classifies the document information to perform a document classification process generate a plurality of document group consisting of the document information,
The document extracting method, wherein the information extracting step obtains specific information according to tag information attached to the structured document .

A document classification method realized by executing a program prepared in advance on a computer,
By executing the program, the computer, the document input step of inputting document information to be classified, analyzed the document information input by the document input step, the language analysis information is a result of the morphological analysis a language analysis step of obtaining, analyzing the document information input by the document input step, an information extraction step of extracting specific information is accompanying information or proper name, the language analysis information obtained by the language analysis step A document feature vector generating step for generating a document feature vector for the document information according to the method, and removing the specific information from the document feature vector in accordance with the output of the specific information by the information extracting step. and document feature vector modification step of modifying a vector of the document feature vector correction process Osamu Of classifying a plurality of the document information in accordance with the output, perform the a document classification process generate a plurality of document group consisting of the document information,
The document extracting method, wherein the information extracting step obtains specific information according to tag information attached to the structured document .

7. The document classification method according to claim 5 , wherein the information extraction step obtains specific information by extracting proper noun expressions.

Further, by executing the program, the computer executes a display step of displaying the specific information extracted by the information extraction means, extracts information selection step of accepting a selection of one or more specific information, the The document classification method according to any one of claims 5 to 7 , wherein:

9. A computer-readable recording medium on which a program for causing a computer to execute the document classification method according to claim 5 is recorded.