JP2004341948A

JP2004341948A - Concept extraction system, concept extraction method, program therefor, and storing medium thereof

Info

Publication number: JP2004341948A
Application number: JP2003139336A
Authority: JP
Inventors: Atsuo Shimada; 敦夫嶋田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2003-05-16
Filing date: 2003-05-16
Publication date: 2004-12-02
Anticipated expiration: 2023-05-16
Also published as: JP4359075B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a concept extraction technology which can extract various topics and concepts of text data accurately by repeating clustering reflecting an intention of a user. <P>SOLUTION: The concept extraction system extracts concepts of text data by automatically classifying a plurality of text data for a plurality of times. The system is provided with a text vector generator 1 which counts the appearance frequency of each language in the text data for every text data, and generates a matrix of each text data and each language with the appearance frequency as the matrix elements; a text vector storing part 5 which stores matrix information; a clustering part 8 which classifying every text data to a plurality of clusters based on the stored matrix information; a cluster selecting part 10 which selects a part of clusters among the plurality of clusters; a selected cluster storing part 11 which stores the selected cluster discrimination information; and a text vector amendment part 12 which amends the stored matrix information based on the cluster discrimination information. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、パーソナルコンピュータなど汎用の情報処理装置や、専用装置上に実現される文書検索システム、文書分類システム、およびテキスト情報分析システムなど、文書処理システムに用いることができる、テキストデータの概念を抽出する概念抽出技術に係り、特に、内容に基づきテキストデータを分類する文書分類技術を利用してテキストデータ集合を対象に概念抽出をおこなえる概念抽出技術に関する。
【０００２】
【従来の技術】
文書クラスタリング技術の利用目的は、概ね二つに大別される。一つは膨大な文書集合を話題ごとに分類することにより、目的の文書に短時間で到達するようにすることである。もう一つは、膨大なドキュメントや自由記述回答データの集合から、典型的な内容、話題、概念を取り出し、テキストデータ集合の概要を把握することである。なお、文書クラスタリングとは、特開平８−２６３５１０号公報記載の「文章自動分類システム」、または、特開平１０−１７１８２３号公報記載の「文書の自動分類方法およびその装置」に記載されているように、文書をそのなかに含まれる単語情報などを特徴とするベクトルとして表現し、各文書間の距離や余弦を類似度測度としてクラスタリングする方法である。クラスタリングの結果、対象のテキストデータ集合が類似するテキストデータから成るいくつかのクラスタに分割されるので、対象テキストデータ集合に含まれる代表的な話題や概念（トピック）が得られることになる。
このような文書クラスタリング技術のうち、Ｓｃａｔｔｅｒ／Ｇａｔｈｅｒ方式や特開平１１−２１３０００号公報記載の従来技術では複数回のクラスタリングをおこなって文書を分類する。例えば特開平１１−２１３０００号公報記載の従来技術では、利用者の入力した検索語にマッチする文書集合に対してクラスタリングをおこない、その結果を利用者に提示する。その際、繰り返してクラスタリングがおこなわれるが、Ｎ回目のクラスタリング結果がＮ−１回目以前のクラスタリング結果に影響されない。
【０００３】
それに対して、特開２００２−１８３１７１号公報に示された従来技術では、質の良いクラスタを漸次的に求める。クラスタ数は事前に決定せず、クラスタリング対象に応じてクラスタ数を決める。具体的には、各文書の特徴ベクトルの組について特異値分解をおこない、その結果から文書類似ベクトルを作成し、その文書類似ベクトルを用いて各対象文書とクラスタ重心との距離を算出し、さらに、同一の対象文書に対して１回目の分類に利用した文書類似ベクトルの次元数を増加させて２回目の分類をおこない、双方の結果を比較し、変化の少ないクラスタを安定クラスタとする。そして、安定クラスタの文書を対象から除いて分類を繰り返すことにより、対象に応じたクラスタ数のクラスタを決定する。
【特許文献１】特開平８−２６３５１０号公報
【特許文献２】特開平１０−１７１８２３号公報
【特許文献３】特開平１１−２１３０００号公報
【特許文献４】特開２００２−１８３１７１号公報
【０００４】
【発明が解決しようとする課題】
しかしながら、前記した特開平８−２６３５１０号公報および特開平１０−１７１８２３号公報記載の従来技術は、本質的には単語で構成される多次元空間内に布置した文書を統計的に分類するので、得られるクラスタは、単語の統計的振る舞いという観点から求められたに過ぎず、しばしば理解不能なクラスタが含まれることは否めない。１回のクラスタリングでは、その一部しか利用者にとって有益でないことになる。このようなことから、前記したＳｃａｔｔｅｒ／Ｇａｔｈｅｒ方式や特開平１１−２１３０００号公報記載の従来技術ではクラスタリングを複数回実行させているが、これらの従来技術は、目的が文書検索であるので、複数回のクラスタリングが話題や概念の抽出のためでなく、検索対象文書の絞り込みに利用されている。
また、特開平１１−２１３０００号公報記載の従来技術では、Ｎ回目のクラスタリング結果がＮ−１回目以前のクラスタリング結果に影響されない。つまり、利用者が選択したクラスタの再クラスタリングを繰り返すことにより話題や概念の詳細化は可能であるが、様々な話題や概念の抽出は可能でないのである。
それに対して、特開２００２−１８３１７１号公報に示された従来技術では、Ｎ回目のクラスタリング結果がＮ−１回目以前のクラスタリング結果に影響される。しかし、この従来技術では一般的なクラスタリングの１回分が２回のクラスタリングにより実現されるので効率が悪いし、前記したＮ回の繰り返しに利用者の意図を反映させることができない。
本発明の目的は、このような従来技術の問題を解決することにあり、具体的には、利用者の意図を反映させつつクラスタリングを繰り返すことにより、テキストデータの様々な話題や概念を高精度で抽出できる概念抽出技術を提供することにある。
【０００５】
【課題を解決するための手段】
前記の課題を解決するために、請求項１記載の発明では、複数のテキストデータをインタラクティブに複数回にわたって自動分類することによりテキストデータの概念を抽出する概念抽出システムにおいて、テキストデータ中に出現する各言語の出現頻度をテキストデータ毎に計数し、その出現頻度を行列要素とする前記各テキストデータと前記各言語の行列情報を生成するテキストベクトル生成手段と、前記行列情報を記憶しておくテキストベクトル記憶手段と、そのテキストベクトル記憶手段に記憶されている前記行列情報に基づいて前記各テキストデータを複数のクラスタに分類するクラスタリング手段と、その複数のクラスタ中から一部のクラスタを選択させるクラスタ選択手段と、選択されたクラスタを示すクラスタ識別情報を記憶しておく選択クラスタ記憶手段と、前記クラスタ識別情報に基づき前記テキストベクトル記憶手段に記憶されている前記行列情報を修正するテキストベクトル修正手段とを備えた。
また、請求項２記載の発明では、複数のテキストデータをインタラクティブに複数回にわたって自動分類することによりテキストデータの概念を抽出する概念抽出方法において、テキストデータ中に出現する各言語の出現頻度をテキストデータ毎に計数し、その出現頻度を行列要素とする前記各テキストデータと前記各言語の行列情報を生成し、その行列情報を記憶し、さらに、その行列情報に基づき前記各テキストデータを複数のクラスタに分類するクラスタリングをおこない、その複数のクラスタ中から一部のクラスタを選択させ、選択されたクラスタに基づき前記行列情報を修正し、前記クラスタリングから後を繰り返す構成にした。
また、請求項３記載の発明では、請求項２記載の発明において、前記行列情報の修正は、前記選択されたクラスタに所属するテキストデータに対応する行列情報を保持されている行列情報から削除する構成にした。
また、請求項４記載の発明では、請求項２または請求項３記載の発明において、前記選択されたクラスタに所属するテキストデータから、選択されたクラスタごとのクラスタ特徴量を算出する構成とし、前記行列情報の修正は、そのクラスタ特徴量に基づいて前記行列情報を修正する構成にした。
また、請求項５記載の発明では、請求項４記載の発明において、前記クラスタ特徴量は前記選択されたクラスタに所属するテキストデータ中に出現する特徴的な言語およびその頻度情報であって、前記行列情報の修正は、前記行列情報の対応する行列要素の値から対応する前記頻度情報の値を減ずる構成にした。
また、請求項６記載の発明では、請求項２乃至請求項５記載の発明において、Ｎ回目までのクラスタ選択により選択されているクラスタと、Ｎ＋１回目のクラスタリングにより生成されたクラスタとの間の類似度を算出する構成にした。
また、請求項７記載の発明では、請求項６記載の発明において、前記類似度の大きいクラスタをＮ＋１回目のクラスタリング結果から除く構成にした。
また、請求項８記載の発明では、情報処理装置上で実行されるプログラムにおいて、請求項２乃至請求項７のいずれか１項に記載の概念抽出方法によった概念抽出を実行させるようにプログラミングされている構成にした。
また、請求項９記載の発明では、プログラムを記憶した記憶媒体において、請求項８記載のプログラムを記憶した。
【０００６】
【発明の実施の形態】
以下、図面により本発明の実施の形態を詳細に説明する。なお、ここで扱うテキストデータとは、文書および文書の一部（例えば概要部分や、文書をいくつかの部分に分割したもの）、またはメール文書やコールセンターの問い合わせ記録など、自然言語により記述されたデータ単位である。また、以下の説明では、テキストデータ集合からの漸次的概念抽出の好適な例としてアンケート調査などにより得られた自由記述回答データの分析場面を想定する。
図１は、本発明の第１の実施例を示す、概念抽出システムの構成ブロック図である。図示したように、この実施例の概念抽出システムは、テキストデータの形態素解析をおこなって単語（またはトークン）とその品詞など言語情報を抽出するテキストデータ解析部２、および抽出された単語などの出現頻度を計数するテキストデータ計数部３を有して、その出現頻度に基づいたテキストベクトルを生成するテキストベクトル生成部１、そのテキストベクトルを記憶しておくテキストベクトル記憶部５、およびテキストデータ解析部２による形態素解析結果を記憶しておくテキスト解析結果記憶部６などから成る解析結果記憶部４、前記テキストベクトル記憶部５に記憶されているテキストベクトルに基づいてテキストデータを複数のクラスタに分類するクラスタリング部８、および分類された各クラスタの特徴を求めるクラスタ特徴算出部９などから成るクラスタ処理部７、前記各クラスタ中から一部のクラスタを選択させるクラスタ選択部１０、選択されたクラスタに係るクラスタ情報を記憶しておく選択クラスタ記憶部１１、そのクラスタ情報などに基づきテキストベクトル記憶部５に記憶されているテキストベクトルを修正するテキストベクトル修正部１２などを備えている。
なお、請求項１記載のテキストベクトル生成手段、テキストベクトル記憶手段、クラスタリング手段、クラスタ選択手段、選択クラスタ記憶手段、およびテキストベクトル修正手段はそれぞれ、テキストベクトル生成部１、テキストベクトル記憶部５、クラスタリング部８、クラスタ選択部１０、選択クラスタ記憶部１１、およびテキストベクトル修正部１２により実現される。また、前記したテキストベクトル生成部１、クラスタリング部８を含むクラスタ処理部７、クラスタ選択部１０、およびテキストベクトル修正部１２はプログラムを記憶したＲＡＭおよびそのプログラムに従って動作するＣＰＵなどにより実現され、テキストベクトル記憶部５を含む解析結果記憶部４および選択クラスタ記憶部１１はＲＡＭおよびハードディスク記憶装置それぞれの一記憶領域として実現される。
【０００７】
図８に、この実施例の動作フローを示す。以下、図８に従って、この動作フローを説明する。
まず、テキストデータ解析部２が、公知の形態素解析アルゴリズムを用い、入力されたテキストデータに含まれる単語（または、単なる単語でなく、ルールを用いて複数の形態素を変換して新たに生成した例えば複合語や異表記を統一したものなどを含むトークンと呼ばれるもの）とその品詞など言語情報を抽出する（Ｓ１）。例えば、［ソフトウェアの操作方法は難しくいつも苦労する］というテキストデータからは、ソフトウェア−（名詞）／の−（助詞）／操作−（名詞）／方法−（名詞）／は−（助詞）／難しい−（形容詞）／いつも−（副詞）／苦労−（名詞）／する−（助動詞）が抽出される。
続いて、テキストデータ計数部３が、予め設定されたストップワード（計数除外単語や計数除外品詞）テーブルを参照して、各テキストデータにおける有効な単語（またはトークン）の出現頻度を単語（またはトークン）毎に計数し（Ｓ２）、図２に示したように、テキストデータ毎に、抽出されたすべての有効単語（または有効トークン）のベクトルとして表現される行列に計数結果を書き込んで行く（ｍは全テキストデータ数、ｎは全テキストデータ集合内に出現する全有効単語数）。なお、行列の要素に単純に出現頻度を用いるのではなく、テキストデータの長さや、単語の全テキストデータ集合内での総出現頻度を用いて重み付けをおこなってもよい。
テキストデータがある限り（Ｓ３でＹ）前記したステップＳ１、Ｓ２をくり返し、対象とするテキストデータがなくなると（Ｓ３でＮ）、テキストデータ計数部３は行列の生成を完了する（Ｓ４）。そして、その行列のデータをテキストベクトル記憶部５に保存する（Ｓ５）。
【０００８】
次に、クラスタリング部８が、テキストベクトル記憶部５に格納された行列のデータを用い、テキストベクトル間の余弦（内積や距離でもよい）を測度としてｋ−ｍｅａｎｓ法、最大距離法などの既知のアルゴリズムを用いてクラスタを生成し、ｋ個のクラスタに全テキストデータを割り当てる（Ｓ６）。また、クラスタ特徴算出部９は、生成された各クラスタの特徴を表すクラスタ特徴トークンを求める。この実施例では、クラスタｋのクラスタ特徴トークンを、“トークンｉが出現するクラスタｋ所属のテキストデータ数／全テキストデータセットにおけるトークンｉが出現するテキストデータ数”が一定以上の値をとるトークンｉを特徴トークンとしている。
例えば、クラスタｋの特徴トークンは、次のようにして計算される。まず、テキストベクトル記憶部５に格納されている行列に基づいて、各トークン毎に列方向に要素が１以上であればカウンタを順次加算し、全テキストデータにおける各トークンの出現データ数を算出する。次いで、クラスタｋに所属するテキストデータの識別子をもとにテキストベクトル記憶部５からそのテキストデータのみからなる部分行列を生成し、同様に、各トークン毎に列方向に要素が１以上であればカウンタを順次加算し、当該クラスタ所属のテキストデータ数を算出する。順次このような計算を繰り返すことで、各トークンの全テキストデータに対する出現データ数と、クラスタｋに所属するテキストデータにおける出現データ数とが計算できる。次いで、順次各トークン毎に２つの数を用いて除算することで特徴量を求め、その値があらかじめ定めたＭ以上のトークンを特徴トークンとして、その識別子をクラスタ選択部１０へ出力する。クラスタ選択部１０は、渡されたトークン識別子をキーとして、テキスト解析結果記憶部６からそのトークン表記を検索して表示する。例えば、図３においては、クラスタ１について“管理”、“ダウン”、“多忙”、“システム”を特徴トークンとして表示するのである。
【０００９】
こうして、クラスタ選択部１０は、表示手段と入力手段とを用いたユーザーインターフェースにより、利用者にクラスタを選択させる。図３に示したような画面のユーザーインターフェースを用いて、利用者は有益な概念であると判断したクラスタを選択・指示するのである。なお、この実施例のクラスタ選択部１０では、クラスタを表示する際、クラスタリング部８によって生成されたクラスタを以下のクラスタ重要度スコアを用いてソートし、所属テキストデータ数が多く、クラスタ凝集度が高いクラスタが上位に表示されるようしている。
Ｔｋ＝（１／ＮｋΣ（Ｓｋｉ−Ｈｋ）^２）^１／２Ｎｋ／Ｎ
Ｔｋ・・クラスタｋのクラスタ重要度スコア
Ｎｋ・・クラスタｋ所属のテキストデータ数
Ｎ・・全テキストデータ数
Ｓｋｉ・・クラスタｋ所属のテキストｉの類似度スコア（距離、内積、余弦）
Ｈｋ・・クラスタｋの平均類似度スコア
クラスタ選択部１０により、利用者は、保存したいクラスタ（クラスタリングが適切で、利用者が有益であると判断したクラスタ）をマウスなどを用いて選択することができる。図３において、■印は選択されたことを示し、クラスタ特徴トークンとして、特徴単語を示している。また、メンバ数とは、各クラスタに属するテキストデータ数である。
保存するクラスタを選択後、再実行ボタン（図３参照）を押下すると、クラスタ選択部１０は、選択されたクラスタの識別子を選択クラスタ記憶部１１に、クラスタ識別子、所属テキストデータの識別子、およびクラスタ特徴トークンの識別子をテキストベクトル修正部１２に渡す（Ｓ７でＹ）。選択クラスタ記憶部１１は渡されたクラスタ識別子を保持する。なお、終了ボタンを押下した場合は、概念抽出処理を終了する（Ｓ７でＮ）。
【００１０】
テキストベクトル修正部１２では、テキストベクトル記憶部５からそこに保持されている行列を呼び出し、選択されたクラスタに所属する所属テキストデータの識別子を用いて当該テキストデータの行を削除する修正をおこなった後（図４において右側が削除後の状態を示している）（Ｓ８）、修正した行列をテキストベクトル記憶部５に記憶する（Ｓ５）。図４はテキストベクトル修正部１２にテキストデータの識別子２、４、５が渡された場合の作用を示したもので、行列からｔｅｘｔｄａｔａ−２、ｔｅｘｔｄａｔａ−４、ｔｅｘｔｄａｔａ−５が削除される。そして、修正された行列がテキストベクトル記憶部５に格納されると、再びクラスタリング部８が残りのテキストデータを対象に前回と同様にクラスタリングをおこなう（Ｓ６）。
こうして、この実施例によれば、クラスタリング処理の結果の一部しか有益でない場合でも、各回の有益なクラスタを保存し、次回のクラスタリングでは有益なクラスタのテキストデータを除いたテキストデータについてクラスタリングを再度おこなうので、漸次的に概念を抽出することができ、テキストの様々な話題や概念を高精度で抽出できる。
本発明の第２の実施例では、テキストベクトル修正部１２は、テキストベクトル記憶部５からそこに記憶されている行列を呼び出し、選択されたクラスタに所属するテキストデータから選択されたクラスタ毎のクラスタ特徴トークンの識別子を用いて対応するトークン列を削除した後、修正した行列をテキストベクトル記憶部５に保存する。図５は、テキストベクトル修正部１２にトークン識別子１、３、５、６が渡された場合の作用を示したもので、テキストベクトル修正部１２は行列からｔｏｋｅｎ−１、ｔｏｋｅｎ−３、ｔｏｋｅｎ−５、ｔｏｋｅｎ−６の列を削除する。そして、修正した行列をテキストベクトル記憶部５に格納すると、クラスタリング部８が再びクラスタリングをおこなう。
こうして、この実施例によれば、一つのテキストデータが複数のクラスタに所属しうるケースにおいて、選択しなかったクラスタに属するテキストデータ行が行列から削除されてしまうという第１の実施例の事態を避けることができるので、より精度高く、漸次的に概念を抽出することができる。
【００１１】
本発明の第３の実施例では、テキストベクトル修正部１２はテキストベクトル記憶部５からそこに記憶されている行列を呼び出し、選択されたクラスタに所属するテキストデータ識別子およびクラスタ特徴トークンの識別子を用いて対応する要素の値を０にした後、修正した行列をテキストベクトル記憶部５に保存する。図６では、テキストベクトル修正部１２にテキストデータ識別子１、３、５、クラスタ特徴トークン識別子１、３、４、６が渡された場合の作用を示したもので、テキストデータ識別子１、３、５のクラスタ特徴トークン識別子１、３、４、６の値を０にする。そして、修正された行列がテキストベクトル記憶部５に格納されると、クラスタリング部８が再びクラスタリングをおこなう。
こうして、この実施例によれば、一つのテキストデータが複数のクラスタに所属しうるケースにおいて、選択しなかったクラスタに属するテキストデータ行が行列から削除されてしまうという第１の実施例の事態を避けることができるし、選択されたクラスタのクラスタ特徴トークンについてはそのクラスタに所属するテキストデータの要素に対してのみ選択結果を反映させるので、さらに精度高く、漸次的に概念を抽出することができる。
【００１２】
図７は、本発明の第４の実施例を示す、概念抽出システムの構成ブロック図である。図示したように、この実施例の概念抽出システムは、図１に示した第１の実施例の構成に加えてクラスタ間類似度算出部１３を備える。そして、クラスタ間類似度算出部１３が、クラスタリング部８により生成された各クラスタと既に生成されてクラスタ記憶部１１に保持される各クラスタとの間の類似度を以下のように算出する。
つまり、この実施例では、クラスタリング部８により生成されたクラスタｉに所属するテキストデータの識別子集合と、クラスタ記憶部１１内の各クラスタｊに所属するテキストデータの識別子集合との類似度を次式により算出する。
Ｆ＝（β^２＋１）・ｐ・ｒ／（β^２・ｐ＋ｒ）
ここで、ｐは、クラスタｉとクラスタｊとの積集合の要素数を、クラスタｉの要素数で除したものである。また、ｒはクラスタｉとクラスタｊとの積集合の要素数を、クラスタｊの要素数で除したものである。また、βは調整パラメータであり、０〜１の値をとるが、この実施例では０．５を設定している。
【００１３】
ここで、クラスタリング部８により生成されたクラスタｉに所属するテキストデータの識別子が“ｔｅｘｔｄａｔａ−２４６、ｔｅｘｔｄａｔａ−５６７、ｔｅｘｔｄａｔａ−１２、ｔｅｘｔｄａｔａ−３２１、ｔｅｘｔｄａｔａ−９”であり、選択クラスタ記憶部１１に格納されているクラスタｊに所属するテキストデータの識別子が“ｔｅｘｔｄａｔａ−１、ｔｅｘｔｄａｔａ−２４６、ｔｅｘｔｄａｔａ−４５６、ｔｅｘｔｄａｔａ−１１２、ｔｅｘｔｄａｔａ−３２１、ｔｅｘｔｄａｔａ−９”であると、その積集合は、“ｔｅｘｔｄａｔａ−２４６、ｔｅｘｔｄａｔａ−３２１、ｔｅｘｔｄａｔａ−９”である。したがって、ｐは３／５であり、ｒは３／６となる。
このようにして類似度を求めた後、クラスタ間類似度算出部１３は、算出したｉ×ｊ個のＦ値が予め与えた所定値以上の、選択済みのクラスタと類似のクラスタを破棄し、Ｆ値がその所定値より小さいクラスタの識別子をクラスタ選択部１０に渡す。これにより、クラスタ選択部１０は渡された識別子のクラスタを選択対象として表示させる。
こうして、この実施例によれば、選択済みのクラスタに類似したクラスタは続いておこなわれるクラスタ提示の際、自動的に除かれるので、概念抽出を効率的におこなうことができる。
以上、図１および図７に示した構成の場合で本発明の実施例を説明したが、説明したような概念抽出方法に従ってプログラミングしたプログラムを着脱可能な記憶媒体に記憶し、その記憶媒体をこれまで本発明によった概念抽出をおこなえなかったパーソナルコンピュータなど情報処理装置に装着することにより、または、そのようなプログラムをネットワークを介してそのような情報処理装置へ転送することにより、そのような情報処理装置においても本発明によった概念抽出をおこなうことができる。
【００１４】
【発明の効果】
以上説明したように、本発明によれば、請求項１および請求項２記載の発明では、複数のテキストデータをインタラクティブに複数回にわたって自動分類することによりテキストデータの概念を抽出する際、テキストデータ中に出現する各言語の出現頻度をテキストデータ毎に計数し、その出現頻度を行列要素とする各テキストデータと各言語の行列情報を生成し、その行列情報を記憶し、さらに、その行列情報に基づき複数のテキストデータを複数のクラスタに分類するクラスタリングをおこない、その複数のクラスタ中から一部のクラスタを利用者に選択させ、選択されたクラスタに基づき前記行列情報を修正し、クラスタリングから後を繰り返すことができるので、利用者から見てクラスタリング処理の結果の一部しか有益でない場合でも、Ｎ回目までに既知になったクラスタ（選択されたクラスタ）がＮ＋１回目には算出されないように制御でき、したがって、テキストデータの様々な話題や概念を漸次的に高精度で抽出できる。
また、請求項３記載の発明では、請求項２記載の発明において、選択されたクラスタに所属するテキストデータのみに対応する行列情報を保持されている行列情報から削除するので、一つのテキストデータが複数のクラスタに所属しうるケースにおいて、選択しなかったクラスタに属するテキストデータ行が行列から削除されてしまうという事態を避けることができ、したがって、より精度高く、漸次的に概念を抽出することができる。
また、請求項４記載の発明では、請求項２または請求項３記載の発明において、選択されたクラスタに所属するテキストデータから、選択されたクラスタごとのクラスタ特徴量を算出し、そのクラスタ特徴量に基づいて行列情報を修正するので、選択しなかったクラスタに属するテキストデータ行が行列から削除されてしまうという事態を避けることができるし、選択されたクラスタのクラスタ特徴トークンについてはそのクラスタに所属するテキストデータの要素に対してのみ選択結果を反映でき、したがって、さらに精度高く、漸次的に概念を抽出することができる。
【００１５】
また、請求項５記載の発明では、請求項４記載の発明において、クラスタ特徴量は選択されたクラスタに所属するテキストデータ中に出現する特徴的な言語およびその頻度情報であって、行列情報の対応する行列要素の値から対応する頻度情報の値を減ずる行列情報修正をおこなうので、請求項４記載の発明を容易に実現できる。
また、請求項６記載の発明では、請求項２乃至請求項５記載の発明において、Ｎ回目までのクラスタ選択により選択されているクラスタと、Ｎ＋１回目のクラスタリングにより生成されたクラスタとの間の類似度を算出することができるので、選択済みのクラスタに類似したクラスタを、続いておこなわれるクラスタ提示の際、自動的に除くことができる。
また、請求項７記載の発明では、請求項６記載の発明において、前記類似度の大きいクラスタをＮ＋１回目のクラスタリング結果から除くことができるので、選択済みのクラスタに類似したクラスタは続いておこなわれるクラスタ提示の際、自動的に除かれ、したがって、概念抽出を効率的におこなうことができる。
また、請求項８記載の発明では、請求項２乃至請求項７のいずれか１項に記載の概念抽出方法によった概念抽出を実行させるようにプログラミングされているプログラムを情報処理装置上で実行させることができるので、情報処理装置を用いて請求項２乃至請求項７のいずれか１項に記載の発明の効果を得ることができる。
また、請求項９記載の発明では、請求項８記載のプログラムを着脱可能な記憶媒体に記憶できるので、その記憶媒体をこれまで請求項２乃至請求項７のいずれか１項に記載の発明によった概念抽出をおこなえなかったパーソナルコンピュータなど情報処理装置に装着することにより、そのような情報処理装置においても請求項２乃至請求項７のいずれか１項に記載の発明の効果を得ることができる。
【図面の簡単な説明】
【図１】本発明の第１の実施例を示す、概念抽出システムの構成ブロック図。
【図２】本発明の第１の実施例を示す、概念抽出システム要部のデータ構成図。
【図３】本発明の第１の実施例を示す、概念抽出システムの画面図。
【図４】本発明の第１の実施例を示す、概念抽出方法の説明図。
【図５】本発明の第２の実施例を示す、概念抽出方法の説明図。
【図６】本発明の第３の実施例を示す、概念抽出方法の説明図。
【図７】本発明の第４の実施例を示す、概念抽出システムの構成ブロック図。
【図８】本発明の第１の実施例を示す、概念抽出方法の動作フロー図。
【符号の説明】
１テキストベクトル生成部、２テキストデータ解析部、３テキストデータ計数部、４解析結果記憶部、５テキストベクトル記憶部、６テキスト解析結果記憶部、７クラスタ処理部、８クラスタリング部、９クラスタ特徴算出部、１０クラスタ選択部、１１選択クラスタ記憶部、１２テキストベクトル修正部、１３クラスタ間類似度算出部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention describes a concept of text data that can be used in a document processing system such as a general-purpose information processing device such as a personal computer, a document search system implemented on a dedicated device, a document classification system, and a text information analysis system. The present invention relates to a concept extraction technique for extracting, and particularly to a concept extraction technique for extracting a concept from a text data set using a document classification technique for classifying text data based on contents.
[0002]
[Prior art]
The purpose of using the document clustering technology is roughly divided into two. One is to classify an enormous set of documents for each topic so that the target document can be reached in a short time. The other is to extract typical contents, topics, and concepts from a huge document and a set of freely described answer data, and grasp the outline of the text data set. The document clustering is described in “Sentence Automatic Classification System” described in JP-A-8-263510 or “Document Automatic Classification Method and Apparatus” in JP-A-10-171823. In this method, a document is represented as a vector featuring word information and the like included therein, and the distance and cosine between each document are clustered as a similarity measure. As a result of the clustering, the target text data set is divided into several clusters composed of similar text data, so that typical topics and concepts (topics) included in the target text data set can be obtained.
Among such document clustering techniques, in the Scatter / Gather method or the conventional technique described in JP-A-11-213000, documents are classified by performing clustering a plurality of times. For example, in the related art described in Japanese Patent Application Laid-Open No. 11-213000, clustering is performed on a document set that matches a search term input by a user, and the result is presented to the user. At this time, clustering is repeatedly performed, but the N-th clustering result is not affected by the (N-1) -th clustering result or earlier.
[0003]
On the other hand, in the related art disclosed in Japanese Patent Application Laid-Open No. 2002-183171, high-quality clusters are gradually obtained. The number of clusters is not determined in advance, and the number of clusters is determined according to the clustering target. Specifically, a singular value decomposition is performed on a set of feature vectors of each document, a document similarity vector is created from the result, and a distance between each target document and the cluster centroid is calculated using the document similarity vector. For the same target document, the second classification is performed by increasing the number of dimensions of the document similarity vector used for the first classification, and the two results are compared, and a cluster with little change is set as a stable cluster. Then, by repeating the classification while excluding the document of the stable cluster from the target, the number of clusters according to the target is determined.
[Patent Document 1] JP-A-8-263510
[Patent Document 2] JP-A-10-171823
[Patent Document 3] JP-A-11-213000
[Patent Document 4] JP-A-2002-183171
[0004]
[Problems to be solved by the invention]
However, the conventional techniques described in JP-A-8-263510 and JP-A-10-171823 statistically classify documents laid out in a multidimensional space consisting essentially of words. The resulting clusters are only obtained from the viewpoint of the statistical behavior of words, and it is undeniable that clusters that are often incomprehensible are included. In one clustering, only a part of the clustering is useful to the user. For this reason, clustering is performed a plurality of times in the above-described Scatter / Gather method and the related art disclosed in Japanese Patent Application Laid-Open No. H11-213000. The clustering of the times is used not for extracting topics or concepts but for narrowing down documents to be searched.
Further, in the related art described in Japanese Patent Application Laid-Open No. H11-213000, the N-th clustering result is not affected by the (N-1) -th clustering result or earlier. That is, it is possible to refine topics and concepts by repeating re-clustering of the cluster selected by the user, but it is not possible to extract various topics and concepts.
On the other hand, in the related art disclosed in Japanese Patent Application Laid-Open No. 2002-183171, the N-th clustering result is affected by the (N-1) -th clustering result and before. However, in this prior art, since one clustering operation is realized by two clustering operations, the efficiency is low, and the user's intention cannot be reflected in the N times of repetition.
An object of the present invention is to solve such a problem of the prior art. Specifically, by repeating clustering while reflecting a user's intention, various topics and concepts of text data can be accurately determined. It is an object of the present invention to provide a concept extraction technique that can be extracted by using the above.
[0005]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, according to the present invention, in a concept extraction system for extracting a concept of text data by automatically classifying a plurality of text data interactively a plurality of times, the concept appears in the text data. Text vector generation means for counting the frequency of appearance of each language for each text data, generating the text data using the frequency of occurrence as a matrix element, and generating matrix information for each language, and text for storing the matrix information Vector storage means, clustering means for classifying the text data into a plurality of clusters based on the matrix information stored in the text vector storage means, and a cluster for selecting some clusters from the plurality of clusters Selection means and cluster identification information indicating the selected cluster are described. And selecting a cluster storing means for, and a text vector correction means for correcting the matrix information stored in the text vector storage means based on the cluster identification information.
According to a second aspect of the present invention, in the concept extracting method for extracting a concept of text data by automatically classifying a plurality of text data interactively a plurality of times, the frequency of occurrence of each language appearing in the text data is determined by text. Counting for each data, generating the text data and the matrix information of each language with the frequency of appearance as a matrix element, storing the matrix information, and furthermore, based on the matrix information, a plurality of text data based on the matrix information Clustering for classifying into clusters is performed, a part of clusters is selected from the plurality of clusters, the matrix information is corrected based on the selected clusters, and the process after the clustering is repeated.
According to a third aspect of the present invention, in the second aspect of the invention, the correction of the matrix information deletes the matrix information corresponding to the text data belonging to the selected cluster from the held matrix information. Was configured.
According to a fourth aspect of the present invention, in the second or third aspect, a cluster feature amount for each selected cluster is calculated from text data belonging to the selected cluster. The matrix information is modified to modify the matrix information based on the cluster feature amount.
Further, in the invention according to claim 5, in the invention according to claim 4, the cluster feature amount is a characteristic language appearing in text data belonging to the selected cluster and frequency information thereof. The matrix information is modified by subtracting the value of the corresponding frequency information from the value of the corresponding matrix element of the matrix information.
According to the invention of claim 6, in the invention of claims 2 to 5, the similarity between the cluster selected by the Nth cluster selection and the cluster generated by the (N + 1) th clustering is determined. The degree was calculated.
In the invention according to claim 7, in the invention according to claim 6, the cluster having the large similarity is excluded from the result of the (N + 1) -th clustering.
According to an eighth aspect of the present invention, in a program executed on an information processing apparatus, a program is executed so as to execute a concept extraction by the concept extracting method according to any one of the second to seventh aspects. The configuration has been.
According to the ninth aspect of the present invention, the program according to the eighth aspect is stored in a storage medium storing the program.
[0006]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The text data handled here is a document and a part of the document (for example, an outline part or a document obtained by dividing the document into several parts) or a natural language such as a mail document or a call center inquiry record. It is a data unit. In the following description, as a preferred example of the gradual concept extraction from the text data set, an analysis scene of free description answer data obtained by a questionnaire survey or the like is assumed.
FIG. 1 is a block diagram showing the configuration of a concept extracting system according to a first embodiment of the present invention. As shown in the figure, the concept extraction system of this embodiment performs a morphological analysis of text data to extract words (or tokens) and linguistic information such as their parts of speech, and the appearance of extracted words and the like. A text vector generation unit 1 having a text data counting unit 3 for counting the frequency and generating a text vector based on the appearance frequency, a text vector storage unit 5 for storing the text vector, and a text data analysis unit 2. An analysis result storage unit 4 including a text analysis result storage unit 6 for storing the morphological analysis results obtained by the method 2 and the like, and the text data is classified into a plurality of clusters based on the text vectors stored in the text vector storage unit 5. A clustering unit 8 and a class for obtaining characteristics of each classified cluster A cluster processing unit 7 including a feature calculation unit 9; a cluster selection unit 10 for selecting a part of clusters from each of the clusters; a selected cluster storage unit 11 for storing cluster information related to the selected cluster; A text vector correction unit 12 for correcting a text vector stored in the text vector storage unit 5 based on information or the like is provided.
The text vector generating means, the text vector storing means, the clustering means, the cluster selecting means, the selected cluster storing means, and the text vector correcting means according to claim 1 are respectively a text vector generating section 1, a text vector storing section 5, a clustering section. This is realized by the unit 8, the cluster selection unit 10, the selected cluster storage unit 11, and the text vector correction unit 12. The text vector generating unit 1, the cluster processing unit 7 including the clustering unit 8, the cluster selecting unit 10, and the text vector correcting unit 12 are realized by a RAM storing a program and a CPU operating according to the program, and the like. The analysis result storage unit 4 including the vector storage unit 5 and the selected cluster storage unit 11 are realized as one storage area of each of the RAM and the hard disk storage device.
[0007]
FIG. 8 shows an operation flow of this embodiment. Hereinafter, this operation flow will be described with reference to FIG.
First, the text data analysis unit 2 converts a plurality of morphemes by using a known morphological analysis algorithm and converts a plurality of morphemes using a rule (not just a word but a simple word) into a newly generated word. A linguistic information such as a compound word or a token including a unified notation is extracted (S1). For example, from the text data that [the software operation method is difficult and always struggling], software- (noun) / no- (particle) / operation- (noun) / method- (noun) /-(particle) / difficult -(Adjective) / always- (adverb) / difficulty- (noun) / do- (auxiliary verb) are extracted.
Subsequently, the text data counting unit 3 refers to a preset stop word (count-excluded word or count-excluded part-of-speech) table and determines the appearance frequency of a valid word (or token) in each text data by the word (or token). ) Is counted (S2), and as shown in FIG. 2, the counting result is written in a matrix expressed as a vector of all extracted valid words (or valid tokens) for each text data (m). Is the total number of text data, and n is the total number of valid words appearing in the entire text data set. Instead of simply using the frequency of appearance for the elements of the matrix, weighting may be performed using the length of text data or the total frequency of appearance of words in all text data sets.
As long as there is text data (Y in S3), the above steps S1 and S2 are repeated. When there is no more text data to be processed (N in S3), the text data counting unit 3 completes the generation of the matrix (S4). Then, the data of the matrix is stored in the text vector storage unit 5 (S5).
[0008]
Next, the clustering unit 8 uses the data of the matrix stored in the text vector storage unit 5 and uses a cosine (or an inner product or a distance) between the text vectors as a measure, such as a k-means method or a maximum distance method. A cluster is generated using an algorithm, and all text data are assigned to k clusters (S6). Further, the cluster feature calculation unit 9 obtains a cluster feature token representing the feature of each generated cluster. In this embodiment, the cluster feature token of the cluster k is defined as a token i having a value of “the number of text data belonging to the cluster k in which the token i appears / the number of text data in which the token i appears in the entire text data set” is a certain value or more Is a feature token.
For example, the feature token of the cluster k is calculated as follows. First, based on the matrix stored in the text vector storage unit 5, if the element is 1 or more in the column direction for each token, the counter is sequentially added, and the number of appearance data of each token in all text data is calculated. . Next, a sub-matrix consisting of only the text data is generated from the text vector storage unit 5 based on the identifier of the text data belonging to the cluster k. Similarly, if each token has one or more elements in the column direction, The counter is sequentially added to calculate the number of text data belonging to the cluster. By repeating such a calculation sequentially, the number of appearance data in all the text data of each token and the number of appearance data in the text data belonging to the cluster k can be calculated. Next, a feature amount is obtained by sequentially dividing each token by using two numbers, and a token whose value is equal to or greater than a predetermined M is output as a feature token, and its identifier is output to the cluster selecting unit 10. The cluster selection unit 10 retrieves and displays the token notation from the text analysis result storage unit 6 using the passed token identifier as a key. For example, in FIG. 3, "management", "down", "busy", and "system" for cluster 1 are displayed as feature tokens.
[0009]
Thus, the cluster selecting unit 10 allows the user to select a cluster by using the user interface using the display unit and the input unit. Using the user interface of the screen as shown in FIG. 3, the user selects and designates a cluster determined to be a useful concept. In displaying the clusters, the cluster selection unit 10 of this embodiment sorts the clusters generated by the clustering unit 8 using the following cluster importance scores, has a large number of belonging text data, and has a high cluster cohesion degree. Higher clusters are displayed higher.
Tk = (1 / NkΣ (Ski−Hk) ² ) ^1/2 Nk / N
Tk ··· Cluster importance score of cluster k
Nk ··· Number of text data belonging to cluster k
N: Total number of text data
Ski ... Similarity score of text i belonging to cluster k (distance, inner product, cosine)
Average similarity score of Hk..cluster k
The cluster selection unit 10 allows the user to select a cluster to be saved (a cluster judged to be appropriate and useful by the user) using a mouse or the like. In FIG. 3, a mark “を” indicates that the character is selected, and a characteristic word is shown as a cluster characteristic token. The number of members is the number of text data belonging to each cluster.
When the re-execute button (see FIG. 3) is pressed after selecting the cluster to be saved, the cluster selecting unit 10 stores the identifier of the selected cluster in the selected cluster storage unit 11, the cluster identifier, the identifier of the belonging text data, and the cluster. The feature token identifier is passed to the text vector correction unit 12 (Y in S7). The selected cluster storage unit 11 holds the passed cluster identifier. If the end button is pressed, the concept extraction process ends (N in S7).
[0010]
The text vector correction unit 12 calls the matrix stored therein from the text vector storage unit 5 and corrects the line of the text data using the identifier of the text data belonging to the selected cluster. Thereafter (the right side in FIG. 4 shows the state after deletion) (S8), the corrected matrix is stored in the text vector storage unit 5 (S5). FIG. 4 shows an operation when the text data identifiers 2, 4, and 5 are passed to the text vector correction unit 12, and text data-2, text data-4, and text data-5 are deleted from the matrix. You. Then, when the corrected matrix is stored in the text vector storage unit 5, the clustering unit 8 again performs clustering on the remaining text data in the same manner as before (S6).
Thus, according to this embodiment, even if only a part of the result of the clustering process is useful, each useful cluster is saved and the next clustering is performed again on the text data excluding the text data of the useful cluster. As a result, concepts can be extracted gradually, and various topics and concepts of text can be extracted with high accuracy.
In the second embodiment of the present invention, the text vector correction unit 12 calls the matrix stored therein from the text vector storage unit 5, and selects a cluster for each cluster selected from the text data belonging to the selected cluster. After deleting the corresponding token sequence using the identifier of the feature token, the corrected matrix is stored in the text vector storage unit 5. FIG. 5 shows an operation when the token identifiers 1, 3, 5, and 6 are passed to the text vector correction unit 12, and the text vector correction unit 12 calculates token-1, token-3, and token- 5. Delete the column of token-6. Then, when the corrected matrix is stored in the text vector storage unit 5, the clustering unit 8 performs clustering again.
Thus, according to this embodiment, in the case where one piece of text data can belong to a plurality of clusters, the text data rows belonging to the unselected cluster are deleted from the matrix. Since it can be avoided, concepts can be extracted with higher accuracy and gradually.
[0011]
In the third embodiment of the present invention, the text vector correction unit 12 calls the matrix stored therein from the text vector storage unit 5 and uses the text data identifier belonging to the selected cluster and the identifier of the cluster feature token. After the value of the corresponding element is set to 0, the corrected matrix is stored in the text vector storage unit 5. FIG. 6 shows the operation when the text data identifiers 1, 3, 5, and the cluster feature token identifiers 1, 3, 4, 6 are passed to the text vector correction unit 12, and the text data identifiers 1, 3, The value of the cluster feature token identifiers 1, 3, 4, and 6 of 5 is set to 0. When the corrected matrix is stored in the text vector storage unit 5, the clustering unit 8 performs clustering again.
Thus, according to this embodiment, in the case where one piece of text data can belong to a plurality of clusters, the text data rows belonging to the unselected cluster are deleted from the matrix. It can be avoided, and the cluster feature token of the selected cluster reflects the selection result only on text data elements belonging to that cluster, so that concepts can be extracted with higher accuracy and gradually. .
[0012]
FIG. 7 is a block diagram showing a configuration of a concept extracting system according to a fourth embodiment of the present invention. As shown in the figure, the concept extraction system of this embodiment includes an inter-cluster similarity calculator 13 in addition to the configuration of the first embodiment shown in FIG. Then, the inter-cluster similarity calculation unit 13 calculates the similarity between each cluster generated by the clustering unit 8 and each cluster already generated and held in the cluster storage unit 11 as follows.
That is, in this embodiment, the similarity between the identifier set of the text data belonging to the cluster i generated by the clustering unit 8 and the identifier set of the text data belonging to each cluster j in the cluster storage unit 11 is expressed by the following equation. It is calculated by:
F = (β ² +1) · p · r / (β ² ・ P + r)
Here, p is obtained by dividing the number of elements of the intersection of the cluster i and the cluster j by the number of elements of the cluster i. R is the number of elements in the intersection of cluster i and cluster j divided by the number of elements in cluster j. Β is an adjustment parameter and takes a value of 0 to 1, but in this embodiment, 0.5 is set.
[0013]
Here, the identifiers of the text data belonging to the cluster i generated by the clustering unit 8 are “text data-246, text data-567, text data-12, text data-321, text data-9”, and are selected. The identifier of the text data belonging to the cluster j stored in the cluster storage unit 11 is “text data-1, text data-246, text data-456, text data-112, text data-321, text data-9”. , The intersection is “text data-246, text data-321, text data-9”. Therefore, p is 3/5 and r is 3/6.
After calculating the similarity in this way, the inter-cluster similarity calculation unit 13 discards clusters similar to the selected cluster, in which the calculated i × j F-values are equal to or greater than a predetermined value given in advance, The identifier of the cluster whose F value is smaller than the predetermined value is passed to the cluster selector 10. As a result, the cluster selection unit 10 displays the cluster having the passed identifier as a selection target.
Thus, according to this embodiment, a cluster similar to the selected cluster is automatically removed at the time of the subsequent cluster presentation, so that concept extraction can be performed efficiently.
Although the embodiment of the present invention has been described with the configuration shown in FIGS. 1 and 7, the program programmed according to the concept extracting method described above is stored in a removable storage medium, and the storage medium is Until the concept extraction according to the present invention can not be performed by attaching to an information processing apparatus such as a personal computer, or by transferring such a program to such an information processing apparatus via a network, The concept extraction according to the present invention can also be performed in the information processing apparatus.
[0014]
【The invention's effect】
As described above, according to the present invention, according to the first and second aspects of the present invention, when extracting the concept of text data by automatically classifying a plurality of text data interactively a plurality of times, The appearance frequency of each language appearing in the text data is counted for each text data, each text data having the appearance frequency as a matrix element and matrix information of each language are generated, the matrix information is stored, and the matrix information is further stored. Performs clustering for classifying a plurality of text data into a plurality of clusters based on the selected cluster, and allows the user to select some of the plurality of clusters; corrects the matrix information based on the selected cluster; Can be repeated, so that only a part of the result of the clustering process is useful to the user. Also can be controlled so as not to be calculated in the first cluster became known (selected cluster) is N + until the N-th, thus, can extract various topics and concepts of the text data in progressively higher accuracy.
Further, in the invention according to claim 3, in the invention according to claim 2, since the matrix information corresponding to only the text data belonging to the selected cluster is deleted from the held matrix information, one text data is deleted. In a case where a cluster can belong to a plurality of clusters, it is possible to avoid a situation in which text data rows belonging to a cluster that has not been selected are deleted from the matrix. Therefore, it is possible to extract concepts more accurately and gradually. it can.
According to a fourth aspect of the present invention, in the second or third aspect, a cluster feature for each selected cluster is calculated from text data belonging to the selected cluster, and the cluster feature is calculated. Since the matrix information is corrected based on, the situation that text data rows belonging to the unselected cluster are deleted from the matrix can be avoided, and the cluster feature token of the selected cluster belongs to the cluster. The selection result can be reflected only on the element of the text data to be extracted, so that the concept can be extracted with higher accuracy and gradually.
[0015]
In the invention according to claim 5, in the invention according to claim 4, the cluster feature amount is a characteristic language appearing in the text data belonging to the selected cluster and its frequency information. Since the matrix information is corrected by subtracting the value of the corresponding frequency information from the value of the corresponding matrix element, the invention according to claim 4 can be easily realized.
According to the invention of claim 6, in the invention of claims 2 to 5, the similarity between the cluster selected by the Nth cluster selection and the cluster generated by the (N + 1) th clustering is determined. Since the degree can be calculated, a cluster similar to the selected cluster can be automatically removed in the subsequent cluster presentation.
Further, in the invention according to claim 7, in the invention according to claim 6, since the cluster having the large similarity can be excluded from the clustering result of the (N + 1) -th time, the cluster similar to the selected cluster is successively performed. When the cluster is presented, it is automatically removed, so that the concept extraction can be performed efficiently.
According to an eighth aspect of the present invention, a program programmed to execute the concept extraction by the concept extraction method according to any one of the second to seventh aspects is executed on the information processing apparatus. Therefore, the effect of the invention described in any one of claims 2 to 7 can be obtained by using the information processing device.
According to the ninth aspect of the present invention, since the program according to the eighth aspect can be stored in a removable storage medium, the storage medium can be stored in the storage medium according to any one of the second to seventh aspects. By mounting the information processing apparatus such as a personal computer that cannot extract the concept as described above, the effect of the invention according to any one of claims 2 to 7 can be obtained even in such an information processing apparatus. it can.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a concept extraction system according to a first embodiment of the present invention.
FIG. 2 is a data configuration diagram of a main part of the concept extraction system, showing the first embodiment of the present invention.
FIG. 3 is a screen view of a concept extraction system according to the first embodiment of the present invention.
FIG. 4 is an explanatory diagram of a concept extracting method according to the first embodiment of the present invention.
FIG. 5 is an explanatory diagram of a concept extracting method according to a second embodiment of the present invention.
FIG. 6 is a diagram illustrating a concept extracting method according to a third embodiment of the present invention.
FIG. 7 is a block diagram showing a configuration of a concept extracting system according to a fourth embodiment of the present invention.
FIG. 8 is an operation flowchart of a concept extracting method according to the first embodiment of the present invention.
[Explanation of symbols]
1 text vector generation section, 2 text data analysis section, 3 text data counting section, 4 analysis result storage section, 5 text vector storage section, 6 text analysis result storage section, 7 cluster processing section, 8 clustering section, 9 cluster feature calculation Section, 10 cluster selection section, 11 selected cluster storage section, 12 text vector correction section, 13 inter-cluster similarity calculation section

Claims

In a concept extraction system that extracts a concept of text data by automatically classifying a plurality of text data interactively a plurality of times, an appearance frequency of each language appearing in the text data is counted for each text data, and the appearance frequency is calculated. Text vector generating means for generating each text data as a matrix element and matrix information of each language; a text vector storage means for storing the matrix information; and the matrix information stored in the text vector storage means Clustering means for classifying each of the text data into a plurality of clusters based on the following, cluster selecting means for selecting some of the plurality of clusters, and cluster identification information indicating the selected cluster are stored. Selected cluster storage means, based on the cluster identification information; Concept extraction system, characterized in that a text vector correction means for correcting the matrix information stored in the text vector storage means.

In a concept extraction method for extracting a concept of text data by automatically classifying a plurality of text data interactively a plurality of times, an appearance frequency of each language appearing in the text data is counted for each text data, and the appearance frequency is calculated. Generating each text data as a matrix element and matrix information of each language, storing the matrix information, and further performing clustering for classifying each text data into a plurality of clusters based on the matrix information. A concept extracting method, comprising selecting a part of clusters from the clusters, correcting the matrix information based on the selected clusters, and repeating the process after the clustering.

3. The concept extracting method according to claim 2, wherein the correction of the matrix information is to delete the matrix information corresponding to the text data belonging to the selected cluster from the held matrix information. Concept extraction method.

4. The concept extracting method according to claim 2, wherein a cluster feature amount for each selected cluster is calculated from text data belonging to the selected cluster, and the correction of the matrix information includes Correcting the matrix information based on a feature amount.

5. The concept extraction method according to claim 4, wherein the cluster feature amount is a characteristic language appearing in text data belonging to the selected cluster and frequency information thereof, and the correction of the matrix information is performed by using the matrix A concept extracting method, characterized in that a value of the corresponding frequency information is subtracted from a value of a corresponding matrix element of information.

6. The concept extraction method according to claim 2, wherein a similarity between a cluster selected by the Nth cluster selection and a cluster generated by the (N + 1) th clustering is calculated. Concept extraction method.

7. The concept extracting method according to claim 6, wherein the cluster having the large similarity is excluded from the result of the (N + 1) -th clustering.

A program executed on an information processing apparatus, the program being programmed to execute concept extraction by the concept extraction method according to any one of claims 2 to 7.

A storage medium storing the program, wherein the program according to claim 8 is stored.