JP2004234051A

JP2004234051A - Sentence classifying device and its method

Info

Publication number: JP2004234051A
Application number: JP2003018295A
Authority: JP
Inventors: Keiko Shimazu; 恵子島津; Yohei Yamane; 洋平山根; Atsukimi Monma; 敦仁門馬; Tetsushi Sakurai; 哲志桜井
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2003-01-28
Filing date: 2003-01-28
Publication date: 2004-08-19

Abstract

<P>PROBLEM TO BE SOLVED: To classify the sentences collected over a wide range into groups including useful information and offer them to users. <P>SOLUTION: Consultation/query text data are inputted to an operator terminal 32 in a call center 3. A clustering device 4 receives the text data to store them and relates words included in the stored text data to each other according to the context relations. The clustering device 4 finds correlations between the related words, narrows, and classifies them to create a clustering rule A. On the basis of the clustering rule A, the clustering device 4 classifies the stored text data into classes. The classes obtained in this way are distributed to another section system 220. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、文章を、その内容に応じて分類する文章分類装置およびその方法に関する。
【０００２】
【従来の技術】
例えば、非特許文献１，２などには、市場調査などに用いられるデータマイニングの一種であり、テキストデータを処理の対象とするテキストマイニングを開示する。
また、非特許文献３には、文章を単語に分解するソフトウェア（茶筅）が開示されている。
また、非特許文献４には、文章に含まれる単語の相関関係を示す規則（相関規則；相関ルール）を求める方法が開示されており、さらに、非特許文献５には、相関規則を求めるソフトウェア（Ａｐｒｉｏｒｉ）が開示されている。
また、非特許文献６には、テキストの集合からその特徴を抽出するソフトウェア（Ａｌｅｐｈ）が開示されている。
また、非特許文献７には、日本語分の単語の係り受け関係（例えば、主語と動詞、動詞と目的語・補語）を解析するためのソフトウェア（ＣａｂｏＣｈａ）が開示されている。
【０００３】
ここで、コンピュータやＯＡ機器のメーカー・商社では、ユーザの相談を受け付ける部門が設けられ、このような部門は、コールセンターなどと呼ばれることが多い。
このコールセンターに受け付けられる相談には、製品開発のためのヒントが多く含まれるが、ユーザと製品開発者との間で、文章に用いる言葉が違うことがある。
従って、コールセンターが受け付けた相談をデータベース化しても、製品開発者が、ユーザが相談に用いる言葉を知らなければ、有用な情報を上手く引き出すことができない。
【０００４】
このような点に対し、例えば、非特許文献８，９は、コールセンターで受け付けられたテキストに対して、単語間の相関の抽出を行うことにより、知識を得るための方法を開示する。
しかしながら、これら非特許文献８，９には、係り受け関係を有する複数の単語の間の相関関係を抽出し、さらに、この相関関係を、事前確信度と事後確信度との差を用いて絞り込み、データマイニングを行う方法を開示してはいない。
【０００５】
【非特許文献１】特集「テキストマイニング」，人工知能学会誌ｖｏｌ．１６，Ｎｏ．２，２００１
【非特許文献２】特集「ナレッジ・マネージメントとその支援技術」，人工知能学会誌ｖｏｌ．１６，Ｎｏ．１，２００１
【非特許文献３】ｈｔｔｐ：／／ｃｈａｓｅｎ．ａｉｓｔ−ｎａｒａ．ａｃ．ｊｐ／ｉｎｄｅｘ．ｈｔｍｌ．ｊａ
【非特許文献４】データマイニング（データサイエンス・シリーズ３，福田他、共立出版社（２００１年９月１日初版第１刷），ＩＳＢＮ−４−３２０−１２００２−７）
【非特許文献５】ｈｔｔｐ：／／ｆｕｚｚｙ．ｃｓ．ｕｎｉ−ｍａｇｄｅｂｕｒｇ．ｄｅ／￣ｂｏｒｇｅｌｔ／ａｐｒｉｏｒｉ／
【非特許文献６】ｈｔｔｐ：／／ｗｅｂ．ｃｏｍｌａｂ．ｏｘ．ａｃ．ｕｋ／ｏｕｃｌ／ｒｅｓｅａｒｃｈ／ａｒｅａｓ／ｍａｃｈｌｅａｒｎ／Ａｌｅｐｈ／
【非特許文献７】ｈｔｔｐ：／／ｃｌ．ａｉｓｔ−ｎａｒａ．ａｃ．ｊｐ／￣ｔａｋｕ−ｋｕ／ｓｏｆｔｗａｒｅ／ｃａｂｏｃｈａ／
【非特許文献８】コールセンターにおけるテキストマイニング（人工知能学会誌１６巻２号、ｐ２２０〜２２５、那須川）
【非特許文献７】テキストマイニング：膨大な文章データからの知識獲得−意図の認識−（情報処理学会第５７回（平成１０年後期）全国大会予稿；３−７５、那須川他）
【０００６】
【発明が解決しようとする課題】
本発明は、上述した背景からなされたものであり、広く集められた文章を、有用な情報を含むグループに分類し、利用者に提供することができる文章分類装置およびその方法を提供することを目的とする。
また、本発明は、上述した背景からなされたものであり、係り受け関係を有する複数の単語の間の相関関係を抽出し、さらに、この相関関係を絞り込んでデータマイニングを行うための文章分類装置およびその方法を提供することを目的とする。
【０００７】
【課題を解決するための手段】
［文章分類装置］
上述した目的を達成するために、本発明にかかる文章分類装置は、複数の単語を含む文書を、０以上のグループに分類する文章分類装置であって、前記複数の文章それぞれに含まれる複数の単語を抽出する単語抽出手段と、前記文章それぞれに含まれる複数の単語を、それぞれ関連する２つ以上の単語を含む関連単語に分類する単語分類手段と、前記分類された関連単語の間の相関性に基づいて、前記複数の文章を前記０以上のグループに分類するための分類規則を作成する分類規則作成手段とを有する。
【０００８】
好適には、前記作成された分類規則に基づいて、前記複数の文章を、前記０以上のグループに分類する文章分類手段をさらに有する。
【０００９】
好適には、前記文章分類手段は、前記分類規則が作成された後は、新たに分類の対象とされた文章を、既に作成された前記分類規則に基づいて、前記グループの内の０以上に分類する。
【００１０】
好適には、前記相関規則作成手段は、同じ前記文章の単語から得られた関連単語の組み合わせを作成する組み合わせ作成手段と、前記作成された組み合わせを、所定の条件に合わせるように処理する組み合わせ処理手段とを有する。
【００１１】
好適には、前記組み合わせは、前記文章に、１つ以上の第１の関連単語が含まれる場合に、同一の前記文章に、他の１つ以上の第２の関連単語が含まれることを示す。
【００１２】
好適には、前記組み合わせ処理手段は、前記作成された組み合わせの内、所定の割合以上または所定数以上の前記文章に適合する組み合わせを選択する処理を行う。
【００１３】
好適には、前記組み合わせ処理手段は、前記第１の関連単語が含まれる前記文章が所定の割合以上となる組み合わせ、または、前記第１の関連単語が含まれる前記文章の内、前記第２の単語単語が含まれる前記文章の割合と、前記第２の関連単語が含まれる前記文章の割合との差が所定値以上となる組み合わせを選択する処理を行う。
【００１４】
好適には、前記単語抽出手段は、前記複数の単語それぞれの品詞をさらに識別し、前記単語分類手段は、前記識別された単語それぞれの品詞に基づいて、前記文章それぞれに含まれる複数の単語を、それぞれ関連する２つ以上の単語を含む関連単語に分類する。
【００１５】
［文章分類方法］
また、本発明にかかる文章分類方法は、複数の単語を含む文書を、０以上のグループに分類する文章分類方法であって、前記複数の文章それぞれに含まれる複数の単語を抽出し、前記文章それぞれに含まれる複数の単語を、それぞれ関連する２つ以上の単語を含む関連単語に分類し、前記分類された関連単語の間の相関性に基づいて、前記複数の文章を前記０以上のグループに分類するための分類規則を作成する。
【００１６】
好適には、前記作成された分類規則に基づいて、前記複数の文章を、前記０以上のグループに分類する。
【００１７】
好適には、複数の単語を含む文書を、０以上のグループに分類するためのプログラムであって、前記複数の文章それぞれに含まれる複数の単語を抽出し、前記抽出された単語それぞれの品詞を識別するステップと、前記抽出された単語それぞれの品詞に基づいて、前記文章それぞれに含まれる複数の単語を、それぞれ関連する２つ以上の単語を含む関連単語に分類するステップと、前記分類された関連単語の間の相関性に基づいて、前記複数の文章を前記０以上のグループに分類するための分類規則を作成するステップとをコンピュータに実行させる。
【００１８】
好適には、前記作成された分類規則に基づいて、前記複数の文章を、前記０以上のグループに分類するステップをコンピュータに実行させる。
【００１９】
【発明の実施の形態】
以下、本発明の実施形態を説明する。
【００２０】
［データマイニングシステム１］
図１は、本発明にかかる文章データ分類方法が適用されるデータマイニングシステム１の構成を示す図である。
図１に示すように、データマイニングシステム１は、電話ネットワーク２０に接続されたコールセンター３、クラスタリング装置４、および、ＬＡＮ・ＷＡＮなどのプライベートネットワーク２２から構成される。
ネットワーク２２は、クラスタリング装置４と他部門のシステム２００−１〜２２０−ｎとを接続する。
電話ネットワーク２０は、例えば一般的な公衆電話回線であって、多数のユーザ側電話機２００と、１台以上のセンタ側の電話機２０２−１〜２０２−ｍ（ｍは１以上の整数）とが接続される。
コールセンター３は、ＬＡＮ３４を介してクラスタリング装置４と接続されるコール受付装置３０−１〜３０−ｍを含む。
コール受付装置３０−１〜３０−ｍそれぞれは、オペレータ端末３２−１〜３２−ｍそれぞれと、電話機２０２−１〜２０２−ｍそれぞれとを含む。
【００２１】
データマイニングシステム１は、これらの構成部分により、例えば、コンピュータ・ＯＡ機器メーカにおいて、ユーザからの相談・問い合わせなどを受け付け、受け付けた相談・問い合わせの文章を、開発部門などの他部門それぞれの業務に有用な情報を含むグループ（クラス）に分類し、他部門のシステム２２０−１〜２２０−ｎに提供する。
なお、コール受付装置３０−１〜３０−ｍなど、複数ある構成部分のいずれかを、特定せずに示す場合には、単にコール受付装置３０などと記載することがある。
【００２２】
［ハードウェア構成］
次に、オペレータ端末３２，クラスタリング装置４および他部門システム２２０のハードウェア構成を説明する。
図２は、図１に示したオペレータ端末３２、クラスタリング装置４および他部門のシステム２２０のハードウェア構成を例示する図である。
図２に示すように、オペレータ端末３２、クラスタリング装置４および他部門のシステム２２０は、それぞれ、ＣＰＵ１０２およびメモリ１０４などを含む本体１０、オペレータ端末３２・クラスタリング装置４・他部門システム２２０の間の通信を行う通信装置１２、ＨＤＤ・ＣＤ装置などの記録装置１４、および、ＬＣＤ表示装置・キーボード・マウスなどを含む表示・入力装置１６から構成される。
つまり、オペレータ端末３２、クラスタリング装置４および他部門システム２２０は、通信機能を有する一般的なコンピュータとしての構成部分を有する。
【００２３】
［コール受付装置３０］
次に、データマイニングシステム１のコール受付装置３０および他部門システム２２０の動作を説明する。
コール受付装置３０は、オペレータ（図示せず）が、ユーザからの相談・問い合わせの電話を受け付け、相談・問い合わせの内容を入力するために用いられる。
つまり、ユーザが、ユーザ側電話機２００からセンタ側電話機２０２に電話をかけ、コール受付装置３０のオペレータに音声で相談・問い合わせをすると、コール受付装置３０のオペレータは、その内容を記した文章を、オペレータ端末３２に入力し、相談・問い合わせのテキストデータを作成する。
なお、コール受付装置３０は、ユーザからの相談・問い合わせを、電子メールを利用して受け、受けた電子メールをそのまま、相談・問い合わせの内容を示す文章のテキストデータとして作成してもよい。
コール受付装置３０のオペレータ端末３２は、このようにして作成したテキストデータを、ＬＡＮ３４を介して、クラスタリング装置４に対して送信する。
【００２４】
［他部門システム２２０］
他部門システム２２０は、例えば、開発部門・営業部門など、各部門に設置され、クラスタリング装置４から、意味や内容に基づいて分類されたテキストデータのグループ（クラス）を受け、部門の構成員（図示せず）に示す。
【００２５】
［クラスタリング装置４］
次に、クラスタリング装置４上で動作するクラスタリングプログラム５の構成および動作を説明する。
図３は、図１，図２に示したクラスタリング装置４において実行されるクラスタリングプログラム５の構成を示す図である。
図３に示すように、クラスタリングプログラム５は、前処理部５０、相関規則作成部５２、意味付け・分類部５４およびクラスタリング処理部５６から構成される。
【００２６】
前処理部５０は、テキスト受信部５００、テキストデータベース（テキストＤＢ）５０２、分かち書き処理部５０４、テキスト・単語ＤＢ５０６および関連付け処理部５０８を含む。
相関規則作成部５２は、相関規則作成処理部５２０、絞り込み処理部５２２および相関規則ＤＢ５２４を含む。
意味付け・分類部５４は、意味付け処理部５４０、クラスタリング規則ＤＢ５４２、分類処理部５４４、意味付け・分類ＤＢ５４６およびユーザインターフェース（ＵＩ）・処理制御部５４８を含む。
【００２７】
クラスタリング処理部５６は、クラスタリング処理部５６０、クラスＤＢ５６２およびクラス配信部５６４から構成され、図３中に点線で示すように、必要に応じて、クラスタリング規則作成部５６６をさらに含む。
なお、クラスタリングプログラム５を、各データベースの内、共用可能なものは、一体化した構成としてもよい。
クラスタリングプログラム５は、これらの構成部分により、コールセンター３から相談・問い合わせのテキストデータを受け、各部門に有用なクラスに分類して、他部門システム２２０に配信する。
【００２８】
図６は、図３に示したクラスタリングプログラム５の前処理部５０の処理を示す図であって、（Ａ）は、テキストＤＢ５０２に記憶されるテキストデータＱの集合を模式的に示し、（Ｂ）は、テキストデータＱの内容を例示し、（Ｃ）は、分かち書き処理部５０４が、テキストデータを処理して得られる単語（分かち書き結果）を示す。
但し、以下の各図において、クラスタリングプログラム５の各構成部分が処理の対象としている文章は、必ずしも同一ではない。
【００２９】
なお、図６（Ａ）〜（Ｃ）には、（Ａ）に示すテキストデータの１つ「Ｑ」の文章が、（Ｂ）に示すように、「ＹＹ使用中。２００ＤＰＩくらいで図面を取り込んでいるが、かなり綺麗に入る。これを、ＸＸのオブジェに持ってゆくと、かなり荒れてしまうが、何かいい方法はないですか？」である場合が示されている。
また、図６（Ｃ）には、この文章が、「ＹＹ（機種名）」、「使用中」などの単語に分解され、さらに、これらの単語が、それぞれ固有名詞、名詞などと識別される場合が示されている。
【００３０】
前処理部５０において、テキスト受信部５００は、オペレータ端末３２それぞれからテキストデータ（相談・問い合わせ文）を受け、フォーマットを統一して、図６（Ｃ）に示すように、テキストＤＢ５０２に記憶する。
分かち書き処理部５０４としては、例えば、上述したソフトウェア「茶筅」が用いられ、テキストＤＢ５０２に記憶されたテキストデータそれぞれが示す文章（以下、「テキストデータそれぞれが示す文章」を、単に「テキストデータの文章」とも記す）、例えば、図６（Ｂ）に示す文章（相談・問い合わせ文）に含まれる単語の全てを、図６（Ｃ）に示すように分離し、分離した単語それぞれの品詞を判定する（分かち書き処理）。
【００３１】
なお、この部分および以下の説明において、「茶筅」など、具体的なソフトウェアが例示される場合があるが、これは、発明の説明の明確化のためであって、本発明の技術的範囲の限定を意図するものではない。
具体例として挙げられたソフトウェアは、他の同等な機能・性能を有する他の手段に置換可能である。
【００３２】
図４は、文章の句構造を例示する図である。
図４に示すように、日本語の文章も、英文の構文解析に倣って、句構造に分解されうる。
分かち書き処理部５０４は、分かち書き処理の結果として得られた単語の内、相関規則作成部５２および意味付け・分類部５４の処理において用いられる重要語、例えば、名詞、動詞、形容詞、形容動詞および固有名詞（分かち書き処理部５０４は、固有名詞を未知の品詞の単語と判定することがある）と判定された単語と、これらの単語の品詞と、これらの単語を含む文章のテキストデータとを対応づけて、テキスト・単語ＤＢ５０６に記憶する。
なお、分かち書き処理部５０４により得られた単語の品詞を示す情報は、必要に応じて、後述するクラスタリング規則作成部５６６によるクラスタリング規則Ｂの作成処理においても利用されうる。
【００３３】
図５は、文章に含まれる単語の係り受け構造を例示する図である。
図５に示すように、文章に含まれる単語の間には、係り受けの関係がある。
関連付け処理部５０８としては、例えば、上述したソフトウェア「ＣａｂｏＣｈａ」が用いられる。
関連付け処理部５０８は、分かち書き処理部５０４による分かち書き処理の結果として得られ、テキスト・単語ＤＢ５０６に記憶された単語の内、係り受け関係にある２つ以上の単語を関連づけ、関連ある単語同士を関連単語Ｗ（ＲＷ）（｛Ｗ（ＲＷ）；ｗ_１〜ｗ_ｐ；ｒｗ_１〜ｒｗ_ｑ｝、但し単語ｒｗは単語ｗを受ける単語、ｐ、ｑ≧１）として、テキスト・単語ＤＢ５０６に記憶する。
関連付け処理部５０８の処理を、具体例を挙げて、さらに説明する。
例えば、具体例として、分かち書き処理部５０４により、テキストデータの文章から、「拡大コピー（名詞）」と「取る（動詞）」、および、「紙詰まり（名詞）」と「解決する（動詞）」が分離され、品詞が識別された場合を考える。
【００３４】
この場合、関連付け処理部５０８は、名詞「拡大コピー」は、動詞「取る」の主語であって、「拡大コピー」は、「取る」に係る単語（「取る」は「拡大コピー」を受ける単語）であるとして、「拡大コピー」と「取る」とを関連づけ、関連単語（拡大コピー→取る）として、テキスト・単語ＤＢ５０６に記憶する。
同様に、関連付け処理部５０８は、名詞「紙詰まり」は、動詞「解決する」の目的語であって、「紙詰まり」は、「解決する」に係る単語（「解決する」は「紙詰まり」を受ける単語）であるとして、「紙詰まり」と「解決する」とを関連づけ、関連単語（紙詰まり→解決する）として、テキスト・単語ＤＢ５０６に記憶する。
なお、関連付け処理部５０８により得られた単語間の係り受けを示す情報は、必要に応じて、後述するクラスタリング規則作成部５６６によるクラスタリング規則Ｂの作成処理においても利用されうる。
【００３５】
相関規則作成処理部５２０は、例えば、上述した相関規則を求めるソフトウェア（Ａｐｒｉｏｒｉ）であって、テキスト・単語ＤＢ５０６から、分かち書き処理部５０４の処理により得られたテキストデータの文章それぞれの単語およびそれらの品詞と、関連付け処理部５０８の処理により得られた関連単語とを読み出して、関連単語間の相関関係を示す相関規則を作成する。
この相関規則は、あるテキストデータの文章から関連単語Ｗ_１〜Ｗ_ｒが含まれている場合に、同じ文章に関連単語ＲＷ_１〜ＲＷ_ｓ（ｒ，ｓは１以上の整数）が含まれていることを示しており、相関規則作成処理部５２０は、この関連単語の組み合わせを、相関規則として適切な範囲で多数、作成する。
この相関関係は、例えば、｛Ｗ_１〜Ｗ_ｒ；ＲＷ_１〜ＲＷ_ｓ｝などの形式で表現される。
【００３６】
絞り込み処理部５２２は、相関規則作成処理部５２０が作成した多数の相関規則それぞれを、テキスト・単語ＤＢ５０６に記憶されているテキストデータそれぞれの文章に対して適用し、相関規則それぞれが、何個のテキストデータに当てはまるか、あるいは、何パーセントのテキストデータに当てはまるかを求める。
さらに、絞り込み処理部５２２は、予め決められた個数以上のテキストデータに当てはまる相関規則、および、予め決められたパーセンテージのテキストデータに当てはまる相関規則、あるいはこれらのいずれかを選択することにより、相関規則を絞り込み、相関規則ＤＢ５２４に記憶する。
なお、相関規則作成処理５２０と絞り込み処理部５２２とは、交互に起動されてそれぞれの処理を行っても、相関規則作成処理５２０による処理が全て終わった後、絞り込み処理部５２２が絞り込み処理を行ってもよい。
【００３７】
なお、絞り込み処理部５２２における相関規則の絞り込みには、この他、上記｛Ｗ_１〜Ｗ_ｒ；ＲＷ_１〜ＲＷ_ｓ｝の内、｛Ｗ_１〜Ｗ_ｒ，ＲＷ_１〜ＲＷ_ｓ｝を含む文章データの全文章データに対する割合（この割合を支持率とも呼ぶ）を計算し、この支持率が予め決められた値を超えるような相関規則を選択する方法、｛Ｗ_１〜Ｗ_ｒ｝を含む文章データの内、｛ＲＷ_１〜ＲＷ_ｓ｝を含む文章の割合（事後確信度）、反対に、｛ＲＷ_１〜ＲＷ_ｒ｝を含む文章データの全文章データに対する割合（事前確信度）を計算し、事前確信度と事後確信度との差が、予め決められた値を超えるような相関規則を選択する方法を採ることも可能である。
【００３８】
つまり、相関規則が｛Ｗ_１〜Ｗ_ｒ；ＲＷ_１〜ＲＷ_ｓ｝と表される場合には、支持率は、［（支持率（％））＝１００×（Ｗ_１〜Ｗ_ｒ，ＲＷ_１〜ＲＷ_ｓを含む文章データ数）／（全文章データ数）］と定義される。
また、事前確信度は、［（事前確信度（％））＝１００×（ＲＷ_１〜ＲＷ_ｒ｝を含む文章データ数）／（全文章データ数）］と定義される。
また、事後確信度は、［（事後確信度（％））＝１００×（ＲＷ_１〜ＲＷ_ｓ，Ｗ_１〜Ｗ_ｒを含む文章データ数）／（Ｗ_１〜Ｗ_ｒを含む文章データ数］と定義される。
また、相関規則が｛Ａ，Ｂ；Ｃ｝と表される場合には、その確信度は、［（確信度（％））＝１００×（Ａ，Ｂ，Ｃの支持率）／（Ａ，Ｂの支持率）］で定義される。
従って、相関規則｛Ａ，Ｂ；Ｃ｝の事前確信度は、相関規則｛φ；Ｃ｝の確信度に等しく、［（確信度（％））＝１００×（Ｃの支持率）／（１００）］となる。
【００３９】
ＵＩ・処理制御部５４８は、表示・入力装置１６（図２）に対してユーザインターフェース用の画像（ＵＩ画像）を表示し、このＵＩ画像に対するユーザの操作を受け入れて、クラスタリングプログラム５の構成部分それぞれに対して出力する。
また、ＵＩ・処理制御部５４８は、ユーザの操作などに応じて、クラスタリングプログラム５の処理全体を制御する。
【００４０】
意味付け・分類ＤＢ５４６は、意味付け処理部５４０および分類処理部５４４における処理において用いられる知識、例えば、相関規則ＤＢ５２４に記憶された相関規則と、その意味とを対応づけるために用いられる情報、相関規則を上位概念にまとめて意味付けするために用いられる情報、および、相関規則の意味を分類するために用いられる情報を記憶し、意味付け処理部５４０および分類処理部５４４に対して提供する。
【００４１】
意味付け処理部５４０は、相関規則ＤＢ５２４から相関規則を読み出して、意味付け・分類ＤＢ５４６に記憶されている情報を参照し、相関規則それぞれに対応する意味、および、相関規則の上位概念としてとらえられる意味、またはこれらのいずれかを作成し、クラスタリング規則ＤＢ５４２に記憶する。
なお、意味付け処理部５４０は、図３に点線で示すように、ＵＩ・処理制御部５４８を介して、相関規則と、相関規則それぞれに対する意味づけをユーザに求めるＵＩ画像を、表示・入力装置１６（図２）に表示し、このＵＩ画像に対するユーザの操作に基づいて、相関規則それぞれの意味を作成してもよい。
【００４２】
図７は、図３に示した分類処理部５４４により分類されたクラス、および、クラスに含まれるテキストデータを、表形式で例示する図である。
分類処理部５４４は、図７に示すように、クラスタリング規則ＤＢ５４２に記憶された相関規則それぞれの意味を読み出し、意味付け・分類ＤＢ５４６に記憶されている情報を参照して、読み出した相関規則それぞれの意味を分類し、クラスタリング規則Ａを作成する。
分類処理部５４４は、作成したクラスタリング規則Ａを、クラスタリング規則ＤＢ５４２に記憶する。
なお、分類処理部５４４は、図３に点線で示すように、意味付け処理部５４０と同様に、ＵＩ・処理制御部５４８を介して、相関規則それぞれの意味と、相関規則の意味の分類をユーザに求めるＵＩ画像を、表示・入力装置１６（図２）に表示し、このＵＩ画像に対するユーザの操作に基づいて、相関規則の意味の分類を行ってもよい。
【００４３】
図７には、分類処理部５４４が、相関規則の意味を分類して、左から１，２番目の欄に示すように、「Ｃ０１；ファイルのダウンロードのＨＴＴＰとＦＴＰとの違い」・「Ｃ０２；バージョンアップ版購入」などのクラスタリング規則Ａを作成し、これらのクラスタリング規則Ａそれぞれに１つ以上の相関規則の意味（図示せず）を含めたことを示している。
また、図７の左から３つめの欄には、クラスタリング規則Ａによるクラスタリングの結果として得られるクラスに含まれるテキストデータそれぞれの識別子（ＩＤ）が示されている。
【００４４】
図８は、図６（Ａ）に示したテキストデータの集合をクラスタリングして得られるクラスを模式的に例示する図である。
クラスタリング処理部５６０は、クラスタリング規則ＤＢ５４２に記憶されたクラスタリング規則Ａに基づいて、図８に示すように、テキスト・単語ＤＢ５０６に記憶されたテキストデータおよびその単語を処理し、テキストデータをクラスタリング処理して、グループ（クラス）に分類し、クラスＤＢ５６２に記憶する。
なお、クラスタリング処理部５６０は、例えば、図７に示したクラスタリング規則Ａそれぞれに含まれる１つ以上の相関規則の意味をＯＲ条件で用い、あるテキストデータが、あるクラスタリング規則Ａに含まれる相関規則のいずれかにマッチする場合には、そのテキストデータを、そのクラスタリング規則Ａに対応するクラスに分類する。
【００４５】
図９は、新たに入力されるテキストデータＱｎｅｗが、図８に示した既存のクラスに分類される態様を示す図である。
また、後述するように、クラスタリング規則作成部５６６が、既存のクラスそれぞれからクラスタリング規則Ｂ（第２の分類規則）を作成した後は、クラスタリング処理部５６０は、図９に示すように、新たにオペレータ端末３２（図１）から入力され、分かち書き処理部５０４により処理され、テキスト・単語ＤＢ５０６に記憶されたテキストデータＱｎｅｗを、クラスＤＢ５６２に記憶されるクラスタリング規則Ｂに基づいて、既存のクラスに分類し、クラスＤＢ５６２に記憶する。
【００４６】
なお、図８に示すように、各テキストデータは、クラスタリング規則Ａ，Ｂに基づくクラスタリングにより単一のクラスに分類されるだけでなく、複数のクラスに重複して分類されたり、あるいは、いずれのクラスにも分類されなかったりもする。
また、クラスタリング処理部５６０を、データの内容・性質に応じて、クラスタリング規則Ｂを用いずに、クラスタリング規則Ａをその後も用いてクラスタリング処理を行うようにしてもよい。
【００４７】
クラス配信部５６４は、他部門システム２２０からの要求に応じて、あるいは、ＵＩ・処理制御部５４８に対するユーザの操作に応じて、クラスＤＢ５６２に記憶されたクラスに属するテキストデータを読み出し、ネットワーク２２を介して、他部門システム２２０に配信する。
【００４８】
上述したように、クラスタリング規則作成部５６６は、クラスタリングプログラム５のクラスタリング処理部５６（図３）に、必要に応じて、選択的に付加され、以下に示すような処理を行う。
図１０は、図３に示したクラスタリング規則作成部５６６により作成されるクラスタリング規則Ｂを例示する図である。
なお、図１０には、２つのクラスタリング規則Ｂが示されている。
【００４９】
クラスタリング規則作成部５６６は、例えば、上述のテキストの集合からその特徴を抽出するソフトウェア（Ａｌｅｐｈ）であって、分類処理部５４４が作成したクラスタリング規則Ａに基づいて得られたクラス（図８など）それぞれに含まれるテキストデータの特徴を抽出し、新たなテキストデータがオペレータ端末３２からクラスタリング装置４（クラスタリングプログラム５）に入力された場合に、新たなテキストを、既存のクラスのいずれか分類するために用いられるクラスタリング規則Ｂ（図１０）を作成し、クラスタリング規則ＤＢ５４２に記憶する。
なお、クラスタリング規則ＤＢ５４２に記憶されたクラスタリング規則Ａ，Ｂは、適宜、記録媒体１４０などに対して出力され、クラスタリング装置４（図１など）と同様な処理を行う他の装置におけるクラスタリング処理の用に供せられうる。
【００５０】
なお、あるクラスについて複数のクラスタリング規則Ｂが作成された場合には、そのクラスに含まれるテキストデータは、複数のクラスタリング規則Ｂの０個以上にマッチしている。
また、クラスタリング規則Ｂの作成のためには、クラスタリング規則Ａに基づいて得られたクラスそれぞれに含まれるテキストデータをそのまま用いてもよいし、ユーザが、クラスタリング規則Ａに基づいて得られたクラスそれぞれから適宜、選択したテキストデータを用いてもよい。
また、クラスタリング規則Ｂの作成のためには、クラスタリング規則Ａに基づいて得られたクラスのいずれにも属さないテキストデータから適宜、選択したテキストデータを用いてもよい。
【００５１】
なお、図１０において、”ｈａｓ＿ｗ（Ｓｅｎｔｅｎｃｅ，Ｗｏｒｄ）は、文章”Ｓｅｎｔｅｎｃｅ”が、単語”Ｗｏｒｄ”を含むことを示す。
また、”ｌａｂｅｌ（Ｗｏｒｄ， ”ＬＡＢＥＬ”）は、単語”Ｗｏｒｄ”を示す実際の文字列が”ＬＡＢＥＬ”であることを示す。
また、”ｗｏｒｄ＿ｄｉｓｔａｎｃｅ（ｗｏｒｄ１，ｗｏｒｄ２，ｎｅａｒ／ｃｌｏｓｅ／ｍｉｄｄｌｅ／ｆａｒ）”は、単語”ｗｏｒｄ１”と単語”ｗｏｒｄ２”との間の距離が、それぞれ「近い」、「ごく近い」、「中間的」、「遠い」ことを示す。
また、”ｄｅｐｅｎｄｅｎｃｅ（Ａ，Ｂ）”は、文法上の係り受け関係を示し、”ｐａｒｔ（Ａ，’名詞−形容動詞語幹’）”は、品詞情報を示す。
また、”ｃｌａｓｓ（Ｓｅｎｔｅｎｃｅ，Ｃｌａｓｓ）は、文字”Ｓｅｎｔｅｎｃｅ”がクラス”Ｃｌａｓｓ”に属することを示す。
また、クラスタリング規則Ｂは、”：−”の右側に記載されることがらがすべて満たされるとき、”：−”の左側に記載されることがらが満足されることを意味している。
【００５２】
図１１は、図３に示したクラスタリング規則作成部５６６が、クラスタリング規則Ｂ（図１０）を作成するために用いる正例を例示する図である。
図１２は、図３に示したクラスタリング規則作成部５６６が、クラスタリング規則Ｂ（図１０）を作成するために用いる負例を例示する図である。
クラスタリング規則作成部５６６には、ＵＩ・処理制御部５４８などから、あるクラスからクラスタリング規則Ｂを作成する際に、特徴を抽出する対象のクラス（またはクラスタ）に属するテキストを示す正例、および、特徴を抽出する対象のクラス（またはクラスタ）に属さないテキストを示す負例（図１１，図１２）、および、特徴を抽出するための背景知識（図示せず）が設定され、クラスタリング規則作成部５６６は、これらの情報を用いて、各クラスの特徴を抽出し、図１０に示したクラスタリング規則Ｂとする。
【００５３】
［クラスタリング装置４（クラスタリングプログラム５）の動作］
以下、クラスタリング装置４（クラスタリングプログラム５）の動作を説明する。
オペレータ端末３２（図１，図２）に対して、相談・問い合わせの文章のテキストデータが入力されると、オペレータ端末３２は、入力されたテキストデータをクラスタリング装置４に対して出力する。
オペレータ端末３２からクラスタリング装置４に入力されたテキストデータは、テキスト受信部５００（図３）により、順次、テキストＤＢ５０２に記憶される。
【００５４】
図１３は、図３に示したクラスタリングプログラム５において、クラスタリング規則作成部５６６が用いられない場合の動作を示すフローチャートである。
図１３に示すように、ステップ１００（Ｓ１００）において、クラスタリング装置４（図１，図２）上で、クラスタリングプログラム５（図３）が起動される。
ステップ１０２（Ｓ１０２）において、ＵＩ・処理制御部５４８（図３）は、クラスタリング規則ＤＢ５４２を検索し、既にクラスタリング規則Ａが作成されているか否かを判断する。
クラスタリングプログラム５は、第２のクラスタリング規則Ａが既に存在する場合にはＳ１１０の処理に進み、これ以外の場合には図６，図７を参照して説明したクラスタリング規則Ａの作成処理（Ｓ１２）のＳ１２０の処理に進む。
【００５５】
ステップ１２０（Ｓ１２０）において、クラスタリングプログラム５（図３）の分かち書き処理部５０４は、テキストデータに対する分かち書き処理を行い、単語の抽出およびその品詞の識別を行い、その結果をテキスト・単語ＤＢ５０６に記憶する。
各構成部分は、図６，図７を参照して説明したように、クラスタリング規則Ａを作成する。
【００５６】
ステップ１２２（Ｓ１２２）において、関連付け処理部５０８は、テキスト・単語ＤＢ５０６に記憶された単語の係り受け関係に基づき、関連単語を作成し、テキスト・単語ＤＢ５０６に記憶する。
ステップ１２４（Ｓ１２４）において、相関規則作成処理部５２０は、テキスト・単語ＤＢ５０６に記憶された関連単語の相関関係を求め、相関規則ＤＢ５２４に記憶する。
【００５７】
ステップ１２６（Ｓ１２６）において、絞り込み処理部５２２は、相関規則ＤＢ５２４に記憶された相関関係を絞り込み、クラスタリング規則Ａとする。
【００５８】
ステップ１０４（Ｓ１０４）において、クラスタリング処理部５６０（図３）は、ステップ１２（Ｓ１２）の処理により作成されたクラスタリング規則Ａを用いて、テキスト・単語ＤＢ５０６に記憶されたテキストを、図７，図８を参照して説明したようにクラスタリングし、クラスを作成する。
なお、上述のように、Ｓ１０６の処理において、クラスタリング規則Ａにより作成されたクラスに含まれるテキストデータは、適宜、ユーザによる選択を受ける場合がある。
【００５９】
ステップ１０６（Ｓ１０６）において、クラスタリング規則作成部５６６（図３）は、図１０〜図１２を参照して説明したように、Ｓ１０６の処理により作成されたクラスそれぞれの特徴を抽出し、クラスタリング処理部５６０のクラスタリング処理において、新たなテキストがオペレータ端末３２から入力されたときに、新たなテキストを、既存のクラスのいずれに分類すべきかの判断に用いられるクラスタリング規則Ｂを作成する。
【００６０】
ステップ１１０（Ｓ１１０）において、ＵＩ・処理制御部５４８（図３）は、新たなテキストデータがオペレータ端末３２から入力されたか否かを判断する。
クラスタリングプログラム５は、新たなテキストデータが入力された場合にはＳ１１２の処理に進み、これ以外の場合には処理を終了する。
【００６１】
ステップ１１２（Ｓ１１２）において、クラスタリング処理部５６０（図３）は、クラスタリング規則Ａを用いて、新たに入力されたテキストデータを、既存のクラスに分類する。
【００６２】
図１４は、図３に示したクラスタリングプログラム５において、クラスタリング規則作成部５６６が用いられる場合の処理（Ｓ１４）を示すフローチャートである。
図１４に示すように、ステップ１４０（Ｓ１４０）において、クラスタリング装置４（図１，図２）上で、クラスタリングプログラム５（図３）が起動される。
【００６３】
ステップ１４２（Ｓ１４２）において、ＵＩ・処理制御部５４８（図３）は、クラスタリング規則ＤＢ５４２を検索し、既にクラスタリング規則Ｂが作成されているか否かを判断する。
クラスタリングプログラム５は、第２のクラスタリング規則Ｂが既に存在する場合にはＳ１５０の処理に進み、これ以外の場合には図６，図７，図１３を参照して説明したクラスタリング規則Ａの作成処理（Ｓ１２）に進む。
【００６４】
ステップ１４４（Ｓ１４４）において、クラスタリング処理部５６０（図３）は、ステップ１２（Ｓ１２）の処理により作成されたクラスタリング規則Ａを用いて、テキスト・単語ＤＢ５０６に記憶されたテキストを、図７，図８を参照して説明したようにクラスタリングし、クラスを作成する。
なお、上述のように、Ｓ１４６の処理において、クラスタリング規則Ａにより作成されたクラスに含まれるテキストデータは、適宜、ユーザによる選択を受ける場合がある。
【００６５】
ステップ１４６（Ｓ１４６）において、クラスタリング規則作成部５６６（図３）は、図１０〜図１２を参照して説明したように、Ｓ１４６の処理により作成されたクラスそれぞれの特徴を抽出し、クラスタリング処理部５６０のクラスタリング処理において、新たなテキストがオペレータ端末３２から入力されたときに、新たなテキストを、既存のクラスのいずれに分類すべきかの判断に用いられるクラスタリング規則Ｂを作成する。
【００６６】
ステップ１５０（Ｓ１５０）において、ＵＩ・処理制御部５４８（図３）は、新たなテキストデータがオペレータ端末３２から入力されたか否かを判断する。
クラスタリングプログラム５は、新たなテキストデータが入力された場合にはＳ１５２の処理に進み、これ以外の場合には処理を終了する。
【００６７】
ステップ１５２（Ｓ１５２）において、クラスタリング処理部５６０（図３）は、クラスタリング規則Ｂを用いて、新たに入力されたテキストデータを、既存のクラスに分類する。
以上、図１３，図１４を参照して説明したように作成されたクラス（図１０，図９）は、適宜、クラス配信部５６４により、他部門システム２２０（図１，図２）に、ネットワーク２２を介して配信される。
【００６８】
［実施例］
以下、データマイニングシステム１のクラスタリング装置４（図１，図２）において、図１３に示したように、クラスタリングプログラム５のクラスタリング規則作成部５６６（図３）を用いず、クラスタリング規則Ａによりテキストデータを分類する場合の実施例を説明する。
図１５は、クラスタリング装置５（図３）において、関連付け処理部５０８を用いずに、関連単語から生成した分類規則Ａによりテキストデータをクラスに分類する場合と、関連付け処理部５０８を用いて、図１３に示したように単語から生成したクラスタリング規則Ａによりテキストデータをクラスに分類する場合とを比較する図表である。
【００６９】
上述のように、クラスタリングプログラム５においては、関連付け処理部５０８が用いられ、相関規則作成処理部５２０は、係り受け関係によりテキストに含まれる単語を関連づけて関連単語の相関性を抽出し、この相関性に基づいて、絞り込み処理部５２２、意味付け処理部５４０および分類処理部５４４がクラスタリング規則Ａを作成する。
これに対して、クラスタリングプログラム５において、関連付け処理部５０８が用いられなくても、相関規則作成処理部５２０が、テキストに含まれる単語そのものの相関性を抽出し、分類処理部５４４などが、この単語の相関性に基づいて、クラスタリング規則Ａを作成することも可能である。
図１５には、このように、関連付け処理部５０８を用いて生成された相関規則Ａによりテキストデータをクラスに分類した結果と、関連付け処理部５０８を用いずに生成されたクラスタリング規則Ａによりテキストデータをクラスに分類した結果とが示されている。
【００７０】
なお、図１５に示した例においては、ある企業のコールセンターにおいて、２００２年４月１日から同年７月３１日までの間に受け付けられた実際の問い合わせを示す６０２個のテキストデータ（以下、ソースデータとも記す）が処理対象とされている。
また、この例においては、分かち書き処理部５０４などには、具体例として示した各ソフトウェアが用いられている。
また、分かち書き処理部５０４には、パーソナルコンピュータの代表的ＯＳの名称などを１つの語句として捉え、分かち書きの結果として得られる語句が細かくなりすぎないようにＩＴ用語リストを参照させている。
処理対象のテキストデータ１つには、平均して１２．５個の単語が含まれ、６０２個のテキストデータに含まれる総単語数は７５１７個で、関連付け処理部５０８の処理により、３１１６個の異なる関連単語が得られた。
【００７１】
図１５において、最小支持度は、（［（最小支持率（％））＝１００×（Ｗ_１〜Ｗ_ｒを含む文章データ数）／（全文章データ数）］）と定義される。
事前／事後確信度差は、事前確信度と事後確信度（絞り込み処理部５２２の説明を参照）との差を示す。
最小支持度と事前／事後確信度差は、小さければ小さいほど、クラスタリングプログラム５が、少ないテキストデータからクラスタリング規則Ａを作成できることを示しており、関連付け処理部５０８を用いると、これを用いない場合に比べて、より少ないテキストデータ数からクラスタリング規則Ａを導出できることがわかる。
【００７２】
また、いずれの場合でも、導出されるルール（クラスタリング規則Ａ）の数には大差はないが、専門家が、ソースデータと導出されたクラスタリング規則Ａとを比較し、意味解釈可能だと判断したクラスタリング規則Ａの数、および、その割合は、関連付け処理部５０８を用いることにより、大幅に増加することがわかる。
また、専門家により、有用性が高いと判断された高有用性ルール（クラスタリング規則Ａ）の数は、いずれの場合でも大差ないが、これを求めるために用いられた平均のテキストデータ数は、関連付け処理部５０８を用いる場合の方が、大幅に少なくなっており、関連付け処理部５０８を用いると、少ないソースデータから、有用なクラスタリング規則Ａが得られることがわかる。
【００７３】
図１６は、絞り込み処理部５２２（図３）に対して設定される事前確信度と事後確信度との差と、関連付け処理部５０８を用いて、図１３に示したようにクラスタリング規則Ａを用いてテキストを分類した結果とを対比して示す図表である。
図１６を参照してわかるように、絞り込み処理部５２２に対して、事前確信度と事後確信度との差が１５％程度になるように、相関規則の絞り込みを行わせると、クラスタリングプログラム５により、良好なテキストデータの分類結果が得られることがわかる。
【００７４】
図１７は、絞り込み処理部５２２（図３）に対して設定される最小確信度と、関連付け処理部５０８を用いずに、図１３に示したようにクラスタリング規則Ａを用いてテキストを分類した結果とを対比して示す図である。
図１７に示す最小確信度の定義は、上述の事後確信度の定義と同じであって、この最小確信度は、絞り込み処理部５２２に対して設定される。
図１７に示すように、関連付け処理部５０８を用いずに、クラスタリング規則Ａを用いたテキストの分類を行うと、絞り込み処理部５２２に対して、どのような最小確信度を設定しても、結果に含まれるルール（クラスタリング規則Ａ）の内、有用性が高いものの数に変化が生じない。
【００７５】
図１８は、関連付け処理部５０８を用いて、図１３に示したようにクラスタリング規則Ａを用いてテキストを分類した結果の精度および再現率を示す図である。
精度とは、得られたクラスのそれぞれに含まれるテキストの内、それぞれのクラスに含まれることが妥当であると、専門家により判断されたテキストの割合を示す。
再現率とは、得られたクラスのそれぞれに含まれるべきテキストの内、実際に、それぞれのクラスに含まれていたテキストの割合を示す。
図１８を参照すると、関連付け処理部５０８を用いて、図１３に示したようにテキストデータをクラス分けすると、一部の例外を除いて、高い精度と再現率とが得られることがわかる。
【００７６】
［関連出願］
本発明は、本出願人による特願２００２−３６６６９０号に関連する。
【００７７】
【発明の効果】
以上説明したように、本発明にかかる文章分類装置およびその方法によれば、広く集められた文章を、有用な情報を含むグループに分類し、利用者に提供することができる。
また、本発明にかかる文章分類装置およびその方法によれば、係り受け関係を有する複数の単語の間の相関関係を抽出し、さらに、この相関関係を絞り込んでデータマイニングを行うことができる。
【図面の簡単な説明】
【図１】本発明にかかる文章データ分類方法が適用されるデータマイニングシステムの構成を示す図である。
【図２】図１に示したオペレータ端末、クラスタリング装置および他部門のシステムのハードウェア構成を例示する図である。
【図３】図１，図２に示したクラスタリング装置において実行されるクラスタリングプログラムの構成を示す図である。
【図４】文章の句構造を例示する図である。
【図５】文章に含まれる単語の係り受け構造を例示する図である。
【図６】図３に示したクラスタリングプログラムの前処理部の処理を示す図であって、（Ａ）は、テキストＤＢに記憶されるテキストデータＱの集合を模式的に示し、（Ｂ）は、テキストデータＱの内容を例示し、（Ｃ）は、分かち書き処理部が、テキストデータを処理して得られる単語（分かち書き結果）を示す。
【図７】図３に示した分類処理部により分類されたクラス、クラスそれぞれに含まれる相関規則の数、および、クラスに含まれるテキストデータを、表形式で例示する図である。
【図８】図６（Ａ）に示したテキストデータの集合をクラスタリングして得られるクラスを模式的に例示する図である。
【図９】新たに入力されるテキストデータＱｎｅｗが、図８に示した既存のクラスに分類される態様を示す図である。
【図１０】図３に示したクラスタリング規則作成部５６６により作成されるクラスタリング規則Ｂを例示する図である。
【図１１】図３に示したクラスタリング規則作成部が、クラスタリング規則Ｂ（図１０）を作成するために用いる正例を例示する図である。
【図１２】図３に示したクラスタリング規則作成部が、クラスタリング規則Ｂ（図１０）を作成するために用いる負例を例示する図である。
【図１３】図３に示したクラスタリングプログラムにおいて、クラスタリング規則作成部が用いられない場合の動作を示すフローチャートである。
【図１４】図３に示したクラスタリングプログラムにおいて、クラスタリング規則作成部が用いられる場合の処理（Ｓ１４）を示すフローチャートである。
【図１５】クラスタリング装置（図３）において、関連付け処理部を用いずに、関連単語から生成した分類規則Ａによりテキストデータをクラスに分類する場合と、関連付け処理部を用いて、図１３に示したように単語から生成したクラスタリング規則Ａによりテキストデータをクラスに分類する場合とを比較する図表である。
【図１６】絞り込み処理部（図３）に対して設定される事前確信度と事後確信度との差と、関連付け処理部を用いて、図１３に示したようにクラスタリング規則Ａを用いてテキストを分類した結果とを対比して示す図表である。
【図１７】絞り込み処理部（図３）に対して設定される最小確信度と、関連付け処理部を用いずに、図１３に示したようにクラスタリング規則Ａを用いてテキストを分類した結果とを対比して示す図である。
【図１８】関連付け処理部を用いて、図１３に示したようにクラスタリング規則Ａを用いてテキストを分類した結果の精度および再現率を示す図である。
【符号の説明】
１・・・データマイニングシステム、
３・・・コールセンター、
３０・・・コール受付装置、
３２・・・オペレータ端末、
３４・・・ＬＡＮ、
４・・・クラスタリング装置、
５・・・クラスタリングプログラム、
５０・・・前処理部、
５００・・・テキスト受信部、
５０２・・・テキストＤＢ、
５０４・・・分かち書き処理部、
５０６・・・テキスト・単語ＤＢ、
５０８・・・関連付け処理部、
５２・・・相関規則作成部、
５２０・・・相関規則作成処理部、
５２２・・・絞り込み処理部、
５２４・・・相関規則ＤＢ５、
５４・・・意味付け処理部、
５４０・・・意味付け処理部、
５４２・・・クラスタリング規則ＤＢ、
５４６・・・意味付け・分類ＤＢ、
５４４・・・分類処理部、
５４８・・・ＵＩ・処理制御部、
５６・・・クラスタリング処理部、
５６０・・・クラスタリング処理部、
５６２・・・クラスＤＢ、
５６４・・・クラス配信部、
５６６・・・クラスタリング規則作成部、
２２・・・ネットワーク、
２２０・・・他部門システム、
１０・・・本体、
１０２・・・ＣＰＵ、
１０４・・・メモリ、
１２・・・通信装置、
１４・・・記録装置、
１４０・・・記録媒体、
１６・・・表示・入力装置、
２０・・・電話ネットワーク、
２００，２０２・・・電話機、[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a sentence classifying apparatus for classifying sentences according to their contents and a method therefor.
[0002]
[Prior art]
For example, Non-Patent Literatures 1 and 2 disclose text mining, which is a type of data mining used for market research and the like and uses text data for processing.
Non-Patent Document 3 discloses software (cha sen) for decomposing a sentence into words.
Non-Patent Document 4 discloses a method for obtaining a rule (correlation rule; correlation rule) indicating the correlation between words included in a sentence, and Non-Patent Document 5 discloses a software for obtaining a correlation rule. (Apriori) is disclosed.
Non-Patent Document 6 discloses software (Aleph) for extracting a feature from a set of texts.
Non-Patent Document 7 discloses software (CaboCha) for analyzing the dependency relationship between words in Japanese (eg, subject and verb, verb and object / complement).
[0003]
Here, manufacturers and trading companies of computers and OA equipment are provided with a department for receiving user consultation, and such a department is often called a call center or the like.
The consultation accepted by the call center includes many hints for product development, but the words used in the text may differ between the user and the product developer.
Therefore, even if the consultation accepted by the call center is compiled into a database, useful information cannot be successfully extracted unless the product developer knows the words used by the user for the consultation.
[0004]
In view of such a point, for example, Non-Patent Documents 8 and 9 disclose a method for obtaining knowledge by extracting a correlation between words in a text received at a call center.
However, in these Non-Patent Documents 8 and 9, the correlation between a plurality of words having a dependency relationship is extracted, and the correlation is narrowed down using the difference between the prior certainty factor and the posterior certainty factor. Does not disclose a method for performing data mining.
[0005]
[Non-Patent Document 1] Special Issue "Text Mining", Journal of the Japanese Society for Artificial Intelligence vol. 16, No. 2, 2001
[Non-Patent Document 2] Special Issue "Knowledge Management and Its Support Technology", Journal of the Japanese Society for Artificial Intelligence vol. 16, No. 1, 2001
[Non-Patent Document 3] http: // chasen. aist-nara. ac. jp / index. html. ja
[Non-Patent Document 4] Data Mining (Data Science Series 3, Fukuda et al., Kyoritsu Shuppansha (First Edition, September 1, 2001, First Edition), ISBN-4-320-12002-7)
[Non-patent document 5] http: // fuzzy. cs. uni-magdeburg. de / ￣borgelt / apriori /
[Non-Patent Document 6] http: // web. comlab. ox. ac. uk / outcl / research / areas / machlearn / Aleph /
[Non-Patent Document 7] http: // cl. aist-nara. ac. jp / @ taku-ku / software / cabocha /
[Non-Patent Document 8] Text mining in a call center (Journal of Artificial Intelligence, Vol. 16, No. 2, p. 220-225, Nasukawa)
[Non-Patent Document 7] Text Mining: Knowledge Acquisition from Massive Sentence Data-Recognition of Intention-(Information Processing Society of Japan 57th (late 1998) National Conference; 3-75, Nasukawa et al.)
[0006]
[Problems to be solved by the invention]
The present invention has been made in view of the above-described background, and provides a text classification device and a text classification device that can classify widely collected texts into groups containing useful information and provide them to users. Aim.
Further, the present invention has been made in view of the above background, and has a sentence classification apparatus for extracting a correlation between a plurality of words having a dependency relationship, and further narrowing down the correlation to perform data mining. And a method thereof.
[0007]
[Means for Solving the Problems]
[Sentence Classifier]
In order to achieve the above-described object, a sentence classification device according to the present invention is a sentence classification device that classifies documents including a plurality of words into 0 or more groups, and includes a plurality of sentences included in each of the plurality of sentences. Word extracting means for extracting words, word classifying means for classifying a plurality of words included in each of the sentences into related words including two or more related words, and a correlation between the classified related words Classification rule creating means for creating a classification rule for classifying the plurality of sentences into the zero or more groups based on the gender.
[0008]
Preferably, there is further provided a sentence classifying means for classifying the plurality of sentences into the zero or more groups based on the created classification rule.
[0009]
Preferably, after the classification rule is created, the sentence classification means sets the sentence newly targeted for classification to 0 or more in the group based on the already created classification rule. Classify.
[0010]
Preferably, the correlation rule creating means is a combination creating means for creating a combination of related words obtained from words of the same sentence, and a combination process for processing the created combination so as to match predetermined conditions. Means.
[0011]
Preferably, the combination indicates that, when the sentence includes one or more first related words, the same sentence includes another one or more second related words. .
[0012]
Preferably, the combination processing means performs a process of selecting, from the created combinations, combinations that match a predetermined ratio or more or a predetermined number or more of the sentences.
[0013]
Preferably, the combination processing means is a combination in which the sentences including the first related words are equal to or more than a predetermined ratio, or the second one of the sentences including the first related words. A process is performed to select a combination in which the difference between the ratio of the sentence including the word and the ratio of the sentence including the second related word is equal to or greater than a predetermined value.
[0014]
Preferably, the word extraction unit further identifies a part of speech of each of the plurality of words, and the word classification unit determines a plurality of words included in each of the sentences based on the part of speech of each of the identified words. Are classified into related words including two or more related words.
[0015]
[Sentence classification method]
Further, the sentence classification method according to the present invention is a sentence classification method for classifying a document including a plurality of words into 0 or more groups, and extracting a plurality of words included in each of the plurality of sentences, A plurality of words included in each of the plurality of words are classified into related words including two or more related words, and the plurality of sentences are classified into the zero or more groups based on the correlation between the classified related words. Create a classification rule for classifying into.
[0016]
Preferably, the plurality of sentences are classified into the zero or more groups based on the created classification rule.
[0017]
Preferably, a program for classifying documents including a plurality of words into groups of 0 or more, wherein a plurality of words included in each of the plurality of sentences is extracted, and a part of speech of each of the extracted words is extracted. Identifying, and, based on the part of speech of each of the extracted words, classifying a plurality of words included in each of the sentences into related words including two or more related words, and Creating a classification rule for classifying the plurality of sentences into the zero or more groups based on the correlation between related words.
[0018]
Preferably, the program causes the computer to execute a step of classifying the plurality of sentences into the zero or more groups based on the created classification rule.
[0019]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described.
[0020]
[Data mining system 1]
FIG. 1 is a diagram showing a configuration of a data mining system 1 to which a sentence data classification method according to the present invention is applied.
As shown in FIG. 1, the data mining system 1 includes a call center 3 connected to a telephone network 20, a clustering device 4, and a private network 22 such as a LAN / WAN.
The network 22 connects the clustering device 4 and systems 200-1 to 220-n of other departments.
The telephone network 20 is, for example, a general public telephone line, and is connected to a large number of user-side telephones 200 and one or more center-side telephones 202-1 to 202-m (m is an integer of 1 or more). Is done.
The call center 3 includes call accepting devices 30-1 to 30-m connected to the clustering device 4 via the LAN 34.
Each of the call receiving devices 30-1 to 30-m includes an operator terminal 32-1 to 32-m, and a telephone 202-1 to 202-m.
[0021]
With these components, the data mining system 1 receives, for example, a consultation / inquiry from a user in a computer / OA equipment maker, and translates the received consultation / inquiry into a business of each of other departments such as a development department. The information is classified into groups (classes) containing useful information and provided to the systems 220-1 to 220-n of other departments.
When any one of a plurality of components such as the call accepting devices 30-1 to 30-m is indicated without being specified, it may be simply referred to as the call accepting device 30 or the like.
[0022]
[Hardware configuration]
Next, the hardware configuration of the operator terminal 32, the clustering device 4, and the other department system 220 will be described.
FIG. 2 is a diagram illustrating an example of a hardware configuration of the operator terminal 32, the clustering device 4, and the system 220 of another department illustrated in FIG.
As shown in FIG. 2, the operator terminal 32, the clustering device 4, and the system 220 of the other department are respectively a communication between the main body 10 including the CPU 102 and the memory 104, and the operator terminal 32, the clustering device 4, and the other department system 220. And a recording device 14 such as an HDD / CD device, and a display / input device 16 including an LCD display device, a keyboard, and a mouse.
That is, the operator terminal 32, the clustering device 4, and the other department system 220 have components as a general computer having a communication function.
[0023]
[Call accepting device 30]
Next, operations of the call receiving device 30 and the other department system 220 of the data mining system 1 will be described.
The call receiving device 30 is used by an operator (not shown) to receive a consultation / inquiry call from a user and to input the contents of the consultation / inquiry.
That is, when the user makes a call from the user-side telephone 200 to the center-side telephone 202 and consults / inquires with the operator of the call accepting device 30 by voice, the operator of the call accepting device 30 writes a sentence describing the contents thereof. The data is input to the operator terminal 32 to create text data for consultation / inquiry.
Note that the call receiving device 30 may receive a consultation / inquiry from a user using an e-mail, and create the received e-mail as it is as text data of a sentence indicating the contents of the consultation / inquiry.
The operator terminal 32 of the call accepting device 30 transmits the text data thus created to the clustering device 4 via the LAN 34.
[0024]
[Other department system 220]
The other department system 220 is installed in each department, for example, a development department or a sales department, and receives a group (class) of text data classified based on meaning and content from the clustering device 4, and receives a member of the department ( (Not shown).
[0025]
[Clustering device 4]
Next, the configuration and operation of the clustering program 5 that operates on the clustering device 4 will be described.
FIG. 3 is a diagram showing a configuration of the clustering program 5 executed in the clustering device 4 shown in FIGS.
As shown in FIG. 3, the clustering program 5 includes a pre-processing unit 50, a correlation rule creating unit 52, a meaning / classification unit 54, and a clustering processing unit 56.
[0026]
The preprocessing unit 50 includes a text receiving unit 500, a text database (text DB) 502, a segmentation processing unit 504, a text / word DB 506, and an association processing unit 508.
The correlation rule creation unit 52 includes a correlation rule creation processing unit 520, a narrow-down processing unit 522, and a correlation rule DB 524.
The meaning / classification unit 54 includes a meaning processing unit 540, a clustering rule DB 542, a classification processing unit 544, a meaning / classification DB 546, and a user interface (UI) / processing control unit 548.
[0027]
The clustering processing unit 56 includes a clustering processing unit 560, a class DB 562, and a class distribution unit 564, and further includes a clustering rule creation unit 566 as necessary, as indicated by a dotted line in FIG.
It should be noted that the clustering program 5 that can be shared among the databases may have an integrated configuration.
The clustering program 5 receives the text data of the consultation / inquiry from the call center 3 by these components, classifies the text data into a class useful for each department, and distributes it to the other department system 220.
[0028]
6A and 6B are diagrams showing the processing of the preprocessing unit 50 of the clustering program 5 shown in FIG. 3, where FIG. 6A schematically shows a set of text data Q stored in the text DB 502, and FIG. ) Illustrates the contents of the text data Q, and (C) illustrates a word (separation result) obtained by the text processing unit 504 processing the text data.
However, in each of the following drawings, the sentences that are the processing targets of each component of the clustering program 5 are not necessarily the same.
[0029]
6A to 6C, the text of one of the text data “Q” shown in FIG. 6A is “YY in use. As shown in FIG. However, if you bring it to an XX object, it will be quite rough, but is there any good way? "
In FIG. 6C, this sentence is broken down into words such as “YY (model name)” and “in use”, and these words are identified as proper nouns, nouns, and the like, respectively. The case is shown.
[0030]
In the preprocessing unit 50, the text receiving unit 500 receives text data (consultation / inquiry sentence) from each of the operator terminals 32, unifies the format, and stores it in the text DB 502 as shown in FIG. 6C.
As the segmentation processing unit 504, for example, the above-described software “chasen” is used, and a sentence indicated by each text data stored in the text DB 502 (hereinafter, “sentence indicated by each text data” is simply referred to as “sentence of text data” For example, all the words included in the sentence (consultation / inquiry sentence) shown in FIG. 6B are separated as shown in FIG. 6C, and the part of speech of each separated word is determined. (Separation processing).
[0031]
In this part and the following description, specific software such as "chasen" may be exemplified, but this is for the purpose of clarifying the description of the invention, and is not limited to the technical scope of the invention. It is not intended to be limiting.
The software cited as a specific example can be replaced with another means having the same function and performance.
[0032]
FIG. 4 is a diagram illustrating a phrase structure of a sentence.
As shown in FIG. 4, a Japanese sentence can also be decomposed into a phrase structure following the parsing of an English sentence.
The segmentation processing unit 504 includes, among words obtained as a result of the segmentation process, important words used in the processing of the association rule creation unit 52 and the meaning / classification unit 54, for example, a noun, a verb, an adjective, an adjective verb, and a unique word. The noun (separation processing unit 504 may determine the proper noun as an unknown part-of-speech word), associate the part-of-speech of these words with the text data of a sentence including these words. Stored in the text / word DB 506.
Note that the information indicating the part of speech of the word obtained by the segmentation processing unit 504 can also be used in the later-described clustering rule creation unit 566-forming process of the clustering rule B.
[0033]
FIG. 5 is a diagram illustrating a dependency structure of a word included in a sentence.
As shown in FIG. 5, there is a dependency relationship between words included in a sentence.
As the association processing unit 508, for example, the above-described software “CaboCha” is used.
The association processing unit 508 associates two or more words in a dependency relationship among the words obtained as a result of the segmentation processing by the segmentation processing unit 504 and stored in the text / word DB 506, and associates the related words with each other. Word W (RW) (@W (RW); w ₁ ~ W _p ; Rw ₁ ~ Rw _q ｝, Where the word rw is stored in the text / word DB 506 as the word receiving the word w, p, q ≧ 1).
The processing of the association processing unit 508 will be further described with a specific example.
For example, as a specific example, “separated copy (noun)” and “take (verb)”, and “paper jam (noun)” and “solve (verb)” from the text of the text data by the segmentation processing unit 504. Is separated and the part of speech is identified.
[0034]
In this case, the association processing unit 508 determines that the noun “enlarged copy” is the subject of the verb “take”, and “enlarged copy” is a word related to “take” (“take” is a word that receives “enlarged copy”). ), The “enlarged copy” is associated with “take”, and stored in the text / word DB 506 as a related word (enlarged copy → take).
Similarly, the association processing unit 508 determines that the noun “paper jam” is an object of the verb “solve”, and “paper jam” is a word related to “solve” (“solve” is “paper jam”). "Paper jam" and "solve" are associated with each other and stored in the text / word DB 506 as a related word (paper jam → resolve).
Note that the information indicating the dependency between words obtained by the association processing unit 508 can also be used in the later-described clustering rule B creation processing by the clustering rule creation unit 566 as needed.
[0035]
The correlation rule creation processing unit 520 is, for example, software (Apriori) that obtains the above-described correlation rule, and includes, from the text / word DB 506, each word of the text of the text data obtained by the processing of the segmentation processing unit 504 and their words. The part of speech and the related words obtained by the processing of the association processing unit 508 are read, and a correlation rule indicating a correlation between the related words is created.
This correlation rule is based on the relation word W from a sentence of a certain text data. ₁ ~ W _r Is included in the same sentence, the related word RW ₁ ~ RW _s (R and s are integers of 1 or more), and the association rule creation processing unit 520 creates a large number of combinations of the related words within a range appropriate as the association rule.
This correlation is, for example, ΔW ₁ ~ W _r RW ₁ ~ RW _s It is expressed in a format such as｝.
[0036]
The narrowing-down processing unit 522 applies each of the many correlation rules created by the correlation rule creation processing unit 520 to the text of each piece of text data stored in the text / word DB 506. Find out what applies to text data, or what percentage of text data.
Further, the narrowing-down processing unit 522 selects the correlation rule that applies to a predetermined number or more of text data, the correlation rule that applies to a predetermined percentage of text data, or the correlation rule by selecting any of these. Are stored in the correlation rule DB 524.
Even if the correlation rule creation processing 520 and the narrowing-down processing unit 522 are activated alternately and perform each processing, the narrowing-down processing unit 522 performs the narrowing-down processing after all the processing by the correlation rule creation processing 520 is completed. You may.
[0037]
In addition, the narrowing-down processing unit 522 narrows down the correlation rule in addition to the above-mentioned ΔW ₁ ~ W _r RW ₁ ~ RW _s Of ｛, ｛W ₁ ~ W _r , RW ₁ ~ RW _s A method of calculating the ratio of sentence data including｝ to all the sentence data (this ratio is also referred to as support rate) and selecting a correlation rule such that the support rate exceeds a predetermined value; ₁ ~ W _r ｛RW in sentence data containing｝ ₁ ~ RW _s Proportion of sentences containing｝ (ex-post confidence), conversely, ｛RW ₁ ~ RW _r Calculate the ratio (pre-confidence) of the sentence data including 全 to all the sentence data, and select a correlation rule so that the difference between the prior confidence and the post-confidence exceeds a predetermined value. Is also possible.
[0038]
That is, if the correlation rule is ｛W ₁ ~ W _r RW ₁ ~ RW _s When expressed as｝, the support rate is [(support rate (%)) = 100 × (W ₁ ~ W _r , RW ₁ ~ RW _s ) / (The total number of sentence data)].
In addition, the prior certainty is [(prior certainty (%)) = 100 × (RW ₁ ~ RW _r (Number of sentence data including｝) / (total number of sentence data)].
In addition, the posterior confidence is [(posterior confidence (%)) = 100 × (RW ₁ ~ RW _s , W ₁ ~ W _r / (W) ₁ ~ W _r Sentence data number including
When the correlation rule is expressed as {A, B; C}, the confidence is [(confidence (%)) = 100 × (support of A, B, C) / (A, B support)]].
Therefore, the prior certainty of the correlation rule {A, B; C} is equal to the certainty of the correlation rule {φ; C}, and [(confidence (%)) = 100 × (support of C) / (100 )].
[0039]
The UI / process control unit 548 displays an image (UI image) for a user interface on the display / input device 16 (FIG. 2), accepts a user operation on the UI image, and configures the components of the clustering program 5. Output for each.
Further, the UI / processing control unit 548 controls the entire processing of the clustering program 5 according to a user operation or the like.
[0040]
The meaning / classification DB 546 includes knowledge used in the processing by the meaning processing unit 540 and the classification processing unit 544, for example, information used for associating the correlation rule stored in the correlation rule DB 524 with the meaning, correlation, and the like. The information used for grouping the rules into a superordinate concept and giving the meaning and the information used for classifying the meaning of the correlation rule are stored and provided to the meaning assignment processing unit 540 and the classification processing unit 544.
[0041]
The semantic processing unit 540 reads the correlation rule from the correlation rule DB 524, refers to the information stored in the semantic / classification DB 546, and is regarded as a meaning corresponding to each correlation rule and a superordinate concept of the correlation rule. A meaning or any of these is created and stored in the clustering rule DB 542.
As shown by the dotted line in FIG. 3, the semantic processing unit 540 displays, via the UI / process control unit 548, the correlation rule and a UI image for which a user is requested to give meaning to each of the correlation rules. 16 (FIG. 2), and the meaning of each correlation rule may be created based on a user operation on the UI image.
[0042]
FIG. 7 is a diagram illustrating, in a table form, classes classified by the classification processing unit 544 illustrated in FIG. 3 and text data included in the classes.
As shown in FIG. 7, the classification processing unit 544 reads the meaning of each of the correlation rules stored in the clustering rule DB 542, and refers to the information stored in the meaning and classification DB 546 to refer to each of the read correlation rules. The meaning is classified and a clustering rule A is created.
The classification processing unit 544 stores the created clustering rule A in the clustering rule DB 542.
Note that, as indicated by the dotted line in FIG. 3, the classification processing unit 544 determines the meaning of each correlation rule and the classification of the meaning of the correlation rule via the UI / processing control unit 548, similarly to the meaning setting processing unit 540. The UI image required by the user may be displayed on the display / input device 16 (FIG. 2), and the meaning of the correlation rule may be classified based on the user's operation on the UI image.
[0043]
In FIG. 7, the classification processing unit 544 classifies the meanings of the correlation rules, and as shown in the first and second columns from the left, “C01; Difference between HTTP and FTP for file download” and “C02”. Clustering rules A such as “purchase version upgrade” are created, and each of these clustering rules A includes the meaning (not shown) of one or more correlation rules.
The third column from the left in FIG. 7 shows the identifier (ID) of each text data included in the class obtained as a result of the clustering according to the clustering rule A.
[0044]
FIG. 8 is a diagram schematically illustrating a class obtained by clustering the set of text data shown in FIG. 6A.
The clustering processing unit 560 processes the text data and the words stored in the text / word DB 506 based on the clustering rule A stored in the clustering rule DB 542 as shown in FIG. And classify them into groups (classes) and store them in the class DB 562.
Note that the clustering processing unit 560 uses, for example, the meaning of one or more correlation rules included in each of the clustering rules A shown in FIG. If the text data matches any one of the above, the text data is classified into a class corresponding to the clustering rule A.
[0045]
FIG. 9 is a diagram showing a mode in which newly input text data Qnew is classified into the existing class shown in FIG.
Further, as described later, after the clustering rule creation unit 566 creates the clustering rule B (second classification rule) from each of the existing classes, the clustering processing unit 560 newly creates a clustering rule as shown in FIG. The text data Qnew input from the operator terminal 32 (FIG. 1), processed by the segmentation processing unit 504, and stored in the text / word DB 506 is classified into an existing class based on the clustering rule B stored in the class DB 562. Then, it is stored in the class DB 562.
[0046]
As shown in FIG. 8, each text data is classified not only into a single class by clustering based on clustering rules A and B, but also into a plurality of classes, or Some are not classified into classes.
Further, the clustering processing unit 560 may perform the clustering process using the clustering rule A without using the clustering rule B according to the contents and properties of the data.
[0047]
The class distribution unit 564 reads out text data belonging to a class stored in the class DB 562 in response to a request from the other department system 220 or in response to a user operation on the UI / processing control unit 548, and transmits the network 22. The data is distributed to the other department system 220 via the other departments.
[0048]
As described above, the clustering rule creation unit 566 is selectively added to the clustering processing unit 56 (FIG. 3) of the clustering program 5 as necessary, and performs the following processing.
FIG. 10 is a diagram illustrating a clustering rule B created by the clustering rule creation unit 566 shown in FIG.
FIG. 10 shows two clustering rules B.
[0049]
The clustering rule creation unit 566 is, for example, software (Aleph) for extracting the feature from the above-described set of texts, and is a class (such as FIG. 8) obtained based on the clustering rule A created by the classification processing unit 544. In order to extract the characteristics of the text data included in each of them and to classify the new text into one of the existing classes when new text data is input from the operator terminal 32 to the clustering device 4 (clustering program 5). Is created and stored in the clustering rule DB 542.
Note that the clustering rules A and B stored in the clustering rule DB 542 are output to the recording medium 140 or the like as appropriate, and are used for clustering processing in another apparatus that performs processing similar to that of the clustering apparatus 4 (FIG. 1 and the like). Can be provided.
[0050]
When a plurality of clustering rules B are created for a certain class, the text data included in the class matches zero or more of the plurality of clustering rules B.
To create the clustering rule B, the text data included in each of the classes obtained based on the clustering rule A may be used as it is, or the user may select the class data obtained based on the clustering rule A. , The selected text data may be used as appropriate.
To create the clustering rule B, text data appropriately selected from text data that does not belong to any of the classes obtained based on the clustering rule A may be used.
[0051]
In FIG. 10, “has_w (Sentence, Word)” indicates that the sentence “Sentence” includes the word “Word”.
“Label (Word,“ LABEL ”) indicates that the actual character string indicating the word“ Word ”is“ LABEL ”.
Further, “word_distance (word1, word2, near / close / middle / far)” indicates that the distance between the word “word1” and the word “word2” is “close”, “very close”, and “intermediate”. , "Far".
"Dependence (A, B)" indicates a grammatical dependency relationship, and "part (A, 'noun-adjective verb stem')" indicates part of speech information.
“Class (Sentence, Class)” indicates that the character “Sentence” belongs to the class “Class”.
Also, the clustering rule B means that when all the elements described on the right side of ":-" are satisfied, the elements described on the left side of ":-" are satisfied.
[0052]
FIG. 11 is a diagram illustrating a positive example used by the clustering rule creation unit 566 shown in FIG. 3 to create the clustering rule B (FIG. 10).
FIG. 12 is a diagram illustrating a negative example used by the clustering rule creation unit 566 shown in FIG. 3 to create the clustering rule B (FIG. 10).
When the clustering rule B is created from a certain class from the UI / processing control unit 548 or the like, the clustering rule creation unit 566 includes a positive example indicating a text belonging to the class (or cluster) from which features are to be extracted, and A negative example (FIGS. 11 and 12) indicating a text that does not belong to the class (or cluster) for which the feature is to be extracted, and background knowledge (not shown) for extracting the feature are set. 566 extracts the feature of each class using these pieces of information, and sets it as the clustering rule B shown in FIG.
[0053]
[Operation of Clustering Device 4 (Clustering Program 5)]
Hereinafter, the operation of the clustering device 4 (clustering program 5) will be described.
When the text data of the text of the consultation / inquiry is input to the operator terminal 32 (FIGS. 1 and 2), the operator terminal 32 outputs the input text data to the clustering device 4.
The text data input from the operator terminal 32 to the clustering device 4 is sequentially stored in the text DB 502 by the text receiving unit 500 (FIG. 3).
[0054]
FIG. 13 is a flowchart showing an operation when the clustering rule creation unit 566 is not used in the clustering program 5 shown in FIG.
As shown in FIG. 13, in step 100 (S100), the clustering program 5 (FIG. 3) is activated on the clustering device 4 (FIGS. 1 and 2).
In step 102 (S102), the UI / processing control unit 548 (FIG. 3) searches the clustering rule DB 542 to determine whether the clustering rule A has already been created.
If the second clustering rule A already exists, the clustering program 5 proceeds to the process of S110. Otherwise, the clustering program 5 creates the clustering rule A described with reference to FIGS. 6 and 7 (S12). The process proceeds to S120.
[0055]
In step 120 (S120), the segmentation processing unit 504 of the clustering program 5 (FIG. 3) performs segmentation processing on the text data, extracts words and identifies their parts of speech, and stores the result in the text / word DB 506. .
Each component creates the clustering rule A as described with reference to FIGS.
[0056]
In step 122 (S122), the association processing unit 508 creates a related word based on the dependency relation of the words stored in the text / word DB 506, and stores the related word in the text / word DB 506.
In step 124 (S124), the correlation rule creation processing unit 520 obtains the correlation between the related words stored in the text / word DB 506 and stores the correlation in the correlation rule DB 524.
[0057]
In step 126 (S126), the narrowing-down processing unit 522 narrows down the correlation stored in the correlation rule DB 524 and sets it as the clustering rule A.
[0058]
In step 104 (S104), the clustering processing unit 560 (FIG. 3) converts the text stored in the text / word DB 506 using the clustering rule A created by the processing in step 12 (S12), as shown in FIGS. Clustering is performed as described with reference to FIG. 8 to create a class.
As described above, in the process of S106, the text data included in the class created according to the clustering rule A may be appropriately selected by the user.
[0059]
In step 106 (S106), the clustering rule creation unit 566 (FIG. 3) extracts the characteristics of each class created by the process of S106, as described with reference to FIGS. In the clustering process of 560, when a new text is input from the operator terminal 32, a clustering rule B used to determine which of the existing classes the new text should be classified into is created.
[0060]
In step 110 (S110), the UI / process control unit 548 (FIG. 3) determines whether new text data has been input from the operator terminal 32.
The clustering program 5 proceeds to the process of S112 when new text data is input, and ends the process otherwise.
[0061]
In step 112 (S112), the clustering processing unit 560 (FIG. 3) classifies the newly input text data into an existing class using the clustering rule A.
[0062]
FIG. 14 is a flowchart showing a process (S14) when the clustering rule creation unit 566 is used in the clustering program 5 shown in FIG.
As shown in FIG. 14, in step 140 (S140), the clustering program 5 (FIG. 3) is activated on the clustering device 4 (FIGS. 1 and 2).
[0063]
In step 142 (S142), the UI / processing control unit 548 (FIG. 3) searches the clustering rule DB 542 to determine whether the clustering rule B has already been created.
If the second clustering rule B already exists, the clustering program 5 proceeds to the processing of S150. Otherwise, the clustering program 5 creates the clustering rule A described with reference to FIGS. 6, 7, and 13. Proceed to (S12).
[0064]
In step 144 (S144), the clustering processing unit 560 (FIG. 3) uses the clustering rule A created by the processing of step 12 (S12) to convert the text stored in the text / word DB 506 into FIGS. Clustering is performed as described with reference to FIG. 8 to create a class.
As described above, in the process of S146, the text data included in the class created by the clustering rule A may be appropriately selected by the user.
[0065]
In step 146 (S146), the clustering rule creation unit 566 (FIG. 3) extracts the characteristics of each class created by the process of S146, as described with reference to FIGS. In the clustering process of 560, when a new text is input from the operator terminal 32, a clustering rule B used to determine which of the existing classes the new text should be classified into is created.
[0066]
In step 150 (S150), the UI / process control unit 548 (FIG. 3) determines whether new text data has been input from the operator terminal 32.
The clustering program 5 proceeds to the process of S152 when new text data is input, and ends the process otherwise.
[0067]
In step 152 (S152), the clustering processing unit 560 (FIG. 3) classifies the newly input text data into an existing class using the clustering rule B.
The classes (FIGS. 10 and 9) created as described above with reference to FIGS. 13 and 14 are appropriately transmitted to the other department system 220 (FIGS. 1 and 2) by the class distribution unit 564. 22.
[0068]
[Example]
Hereinafter, as shown in FIG. 13, the clustering device 4 of the data mining system 1 (FIGS. 1 and 2) does not use the clustering rule creation unit 566 (FIG. 3) of the clustering program 5 and uses the clustering rule A to generate text data. An embodiment in the case of classifying will be described.
FIG. 15 illustrates a case where the clustering device 5 (FIG. 3) classifies text data into classes according to the classification rule A generated from the related words without using the association processing unit 508, and FIG. FIG. 14 is a chart comparing a case where text data is classified into classes by a clustering rule A generated from words as shown in FIG.
[0069]
As described above, in the clustering program 5, the association processing unit 508 is used, and the association rule creation processing unit 520 associates the words included in the text with the dependency relation to extract the correlation of the associated word, and Based on the characteristics, the refinement processing unit 522, the meaning setting processing unit 540, and the classification processing unit 544 create the clustering rule A.
On the other hand, in the clustering program 5, even if the association processing unit 508 is not used, the correlation rule creation processing unit 520 extracts the correlation between the words themselves included in the text, and the classification processing unit 544 and the like extract the correlation. It is also possible to create a clustering rule A based on word correlation.
FIG. 15 shows the result of classifying the text data into classes according to the correlation rule A generated using the association processing unit 508 and the text data according to the clustering rule A generated without using the association processing unit 508. Are shown as results of classifying.
[0070]
In the example shown in FIG. 15, in a call center of a certain company, 602 pieces of text data (hereinafter, “source”) indicating an actual inquiry received from April 1, 2002 to July 31, 2002 are received. (Also referred to as data).
Further, in this example, the software shown as a specific example is used for the partitioning processing unit 504 and the like.
Further, the segmentation processing unit 504 regards the name of the representative OS of the personal computer and the like as one word, and refers to the IT term list so that the word obtained as a result of the segmentation does not become too small.
One text data to be processed contains 12.5 words on average, and the total number of words contained in 602 pieces of text data is 7517 words. Different related words were obtained.
[0071]
In FIG. 15, the minimum support is ([(minimum support (%)) = 100 × (W ₁ ~ W _r ) / (Total number of sentence data)]).
The pre-post / post-post-confidence difference indicates the difference between the pre-post post-confidence and post-post post-confidence (see the description of the narrow-down processing unit 522).
This indicates that the smaller the minimum support and the difference between the pre- and post-confidences, the smaller the clustering program 5 can create the clustering rule A from a small amount of text data. If the association processing unit 508 is not used, It can be seen that the clustering rule A can be derived from a smaller number of text data as compared with.
[0072]
In any case, the number of derived rules (clustering rules A) is not much different, but an expert compares the source data with the derived clustering rules A and determines that the semantic interpretation is possible. It can be seen that the number of the clustering rules A and the ratio thereof are significantly increased by using the association processing unit 508.
In addition, the number of high usefulness rules (clustering rule A) determined to be highly useful by an expert is not much different in any case, but the average number of text data used for obtaining this is: The number of cases where the association processing unit 508 is used is significantly smaller, and it is understood that the use of the association processing unit 508 enables a useful clustering rule A to be obtained from a small amount of source data.
[0073]
FIG. 16 shows the difference between the prior certainty factor and the posterior certainty factor set for the refinement processing unit 522 (FIG. 3) and the association processing unit 508 and the clustering rule A as shown in FIG. 5 is a table showing a result of classifying texts in comparison with each other.
As can be seen with reference to FIG. 16, when the narrowing-down processing unit 522 narrows down the correlation rules so that the difference between the prior certainty and the post-confidence is about 15%, the clustering program 5 It can be seen that good text data classification results can be obtained.
[0074]
FIG. 17 shows the result of classifying the text using the clustering rule A as shown in FIG. 13 without using the association processing unit 508 and the minimum certainty factor set for the refinement processing unit 522 (FIG. 3). And FIG.
The definition of the minimum certainty factor shown in FIG. 17 is the same as the definition of the posterior certainty factor described above, and the minimum certainty factor is set for the narrow-down processing unit 522.
As shown in FIG. 17, if the text is classified using the clustering rule A without using the association processing unit 508, the result of setting the minimum certainty factor for the narrowing-down processing unit 522 is as follows. Does not change in the number of rules having high usefulness among the rules (clustering rule A) included in.
[0075]
FIG. 18 is a diagram illustrating the accuracy and recall of the result of classifying the text using the clustering rule A as illustrated in FIG. 13 using the association processing unit 508.
The accuracy indicates a ratio of texts judged by the expert to be appropriate to be included in each class among texts included in each of the obtained classes.
The recall indicates the proportion of the text actually included in each class among the texts to be included in each of the obtained classes.
Referring to FIG. 18, it is understood that when the text data is classified into classes as shown in FIG. 13 using the association processing unit 508, high accuracy and recall can be obtained with some exceptions.
[0076]
[Related application]
The present invention relates to Japanese Patent Application No. 2002-366690 filed by the present applicant.
[0077]
【The invention's effect】
As described above, according to the sentence classification apparatus and the method thereof according to the present invention, widely collected sentences can be classified into groups including useful information and provided to users.
Further, according to the sentence classification device and the method thereof according to the present invention, it is possible to extract a correlation between a plurality of words having a dependency relationship, and further perform data mining by narrowing down the correlation.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a data mining system to which a sentence data classification method according to the present invention is applied.
FIG. 2 is a diagram illustrating an example of a hardware configuration of an operator terminal, a clustering device, and a system of another department illustrated in FIG. 1;
FIG. 3 is a diagram showing a configuration of a clustering program executed in the clustering device shown in FIGS. 1 and 2;
FIG. 4 is a diagram illustrating a phrase structure of a sentence.
FIG. 5 is a diagram illustrating a dependency structure of words included in a sentence.
6A and 6B are diagrams illustrating a process of a pre-processing unit of the clustering program illustrated in FIG. 3, wherein FIG. 6A schematically illustrates a set of text data Q stored in a text DB, and FIG. , Illustrates the contents of the text data Q, and (C) shows a word (separation result) obtained by the text processing unit processing the text data.
7 is a diagram illustrating, in a table form, classes classified by the classification processing unit illustrated in FIG. 3, the number of correlation rules included in each class, and text data included in the classes.
FIG. 8 is a diagram schematically illustrating a class obtained by clustering a set of text data shown in FIG. 6A.
FIG. 9 is a diagram illustrating a mode in which newly input text data Qnew is classified into the existing class shown in FIG. 8;
FIG. 10 is a diagram illustrating a clustering rule B created by a clustering rule creation unit 566 shown in FIG. 3;
11 is a diagram illustrating a positive example used by the clustering rule creation unit shown in FIG. 3 to create a clustering rule B (FIG. 10).
12 is a diagram illustrating a negative example used by the clustering rule creation unit shown in FIG. 3 to create a clustering rule B (FIG. 10).
FIG. 13 is a flowchart showing an operation when a clustering rule creation unit is not used in the clustering program shown in FIG. 3;
FIG. 14 is a flowchart showing a process (S14) when a clustering rule creating unit is used in the clustering program shown in FIG. 3;
FIG. 15 illustrates a case where text data is classified into classes according to a classification rule A generated from related words without using the association processing unit in the clustering device (FIG. 3); 7 is a chart comparing a case where text data is classified into classes according to a clustering rule A generated from words as described above.
FIG. 16 shows a difference between the prior certainty factor and the posterior certainty factor set for the refinement processing unit (FIG. 3), and the text processing using the association processing unit and the clustering rule A as shown in FIG. 6 is a table showing the results obtained by classifying.
17 shows the minimum certainty factor set for the refinement processing unit (FIG. 3) and the result of classifying the text using the clustering rule A as shown in FIG. 13 without using the association processing unit. It is a figure shown in comparison.
FIG. 18 is a diagram showing the accuracy and recall of the result of classifying text using the clustering rule A as shown in FIG. 13 using the association processing unit.
[Explanation of symbols]
1. Data mining system
3 Call center,
30 ・・・ Call accepting device,
32 ... operator terminal,
34 ... LAN,
4 ... clustering device,
5 ... clustering program,
50 pre-processing unit
500: text receiving unit,
502: text DB,
504: Separation processing unit,
506: text / word DB,
508... Association processing unit,
52 ... correlation rule creation unit
520... Correlation rule creation processing unit
522 narrow-down processing unit,
524... Correlation rule DB5
54 ... meaning setting processing unit,
540 ... meaning processing unit,
542... Clustering rule DB,
546: Meaning / classification DB,
544 ··· Classification processing unit
548: UI / process control unit,
56 clustering processing unit
560 clustering processing unit,
562: Class DB,
564: class distribution unit,
566... Clustering rule creation unit,
22 ... network,
220 ... other department system,
10 ... body,
102 ... CPU,
104 ... memory,
12 ... communication device,
14 ・・・ Recording device,
140 ... recording medium,
16 ・・・ Display / input device,
20 ... telephone network,
200, 202 ... telephone,

Claims

A sentence classification apparatus for classifying a document including a plurality of words into 0 or more groups,
Word extraction means for extracting a plurality of words included in each of the plurality of sentences,
Word classification means for classifying a plurality of words included in each of the sentences into related words including two or more related words,
A classification rule creating unit that creates a classification rule for classifying the plurality of sentences into the zero or more groups based on the correlation between the classified related words.

The text classification device according to claim 1, further comprising a text classification unit configured to classify the plurality of texts into the zero or more groups based on the created classification rule.

The sentence classification means, after the classification rule is created, classifies a newly classified sentence into 0 or more in the group based on the already created classification rule. 2. The sentence classification device according to 2.

The correlation rule creation means,
Combination creating means for creating a combination of related words obtained from words of the same sentence,
The sentence classification device according to any one of claims 1 to 3, further comprising combination processing means for processing the created combination so as to meet predetermined conditions.

The combination according to claim 4, wherein the combination indicates that, when the sentence includes one or more first related words, the same sentence includes one or more other second related words. The sentence classification device described.

The sentence classification apparatus according to claim 4, wherein the combination processing unit performs a process of selecting a combination that matches a predetermined ratio or more or a predetermined number or more of the sentences from the created combinations.

The combination processing means may include a combination in which the sentence containing the first related word is equal to or more than a predetermined ratio, or the second word word is included in the sentence containing the first related word. The sentence classification device according to claim 4 or 5, wherein a process of selecting a combination in which a difference between the ratio of the sentence and the ratio of the sentence including the second related word is equal to or greater than a predetermined value is performed.

The word extracting means further identifies a part of speech of each of the plurality of words,
8. The method according to claim 1, wherein the word classifying unit classifies the plurality of words included in each of the sentences into related words including two or more related words, based on the part of speech of each of the identified words. 9. A sentence classification device according to any of the above.

A text classification method for classifying a document including a plurality of words into zero or more groups,
Extracting a plurality of words contained in each of the plurality of sentences,
Classifying a plurality of words included in each of the sentences into related words including two or more related words,
A sentence classification method for creating a classification rule for classifying the plurality of sentences into the zero or more groups based on the correlation between the classified related words.

The sentence classification method according to claim 9, wherein the plurality of sentences are classified into the zero or more groups based on the created classification rule.

A program for classifying documents containing a plurality of words into zero or more groups,
Extracting a plurality of words included in each of the plurality of sentences, and identifying the part of speech of each of the extracted words,
Classifying a plurality of words included in each of the sentences into related words including two or more related words, based on the part of speech of each of the extracted words;
Creating a classification rule for classifying the plurality of sentences into the zero or more groups based on the correlation between the classified related words.

The program according to claim 11, wherein the program causes a computer to execute a step of classifying the plurality of sentences into the zero or more groups based on the created classification rule.