JP3583631B2

JP3583631B2 - Information mining method, information mining device, and computer-readable recording medium recording information mining program

Info

Publication number: JP3583631B2
Application number: JP34430998A
Authority: JP
Inventors: 洋一藤井; 修森口; 克志鈴木
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1998-12-03
Filing date: 1998-12-03
Publication date: 2004-11-04
Anticipated expiration: 2018-12-03
Also published as: JP2000172691A

Description

【０００１】
【発明の属する技術分野】
本発明は、コンピュータ上に蓄積される電子化されたテキスト、例えば、ヘルプデスク業務のように顧客からの様々な問い合わせと回答内容のようなテキストが蓄積されるテキストから、営業、マニュアル作成、Ｑ＆Ａ作成等に有効となる情報をマイニングする方法等に関するものである。
【０００２】
【従来の技術】
従来のテキストデータからのマイニングは、「ＶｅｘｔＳｅａｒｃｈ」（（株）コマツソフト製品：日経エレクトロニクス１９９７．１２．１５（Ｎｏ．７０５）ｐｐ．６３‐７０および´９７年１０月時カタログ）に代表される。「ＶｅｘｔＳｅａｒｃｈ」は蓄積されたテキストから名詞、動詞、形容詞、副詞と接頭辞の一部といった単語を抽出し、助詞や助動詞を取り除いて、そのテキスト中の単語の出現頻度からテキストをベクトル（以下、文書ベクトルと呼ぶ）で表現する。そして、２つの文書ベクトルの内積値を類似度として定義する。その上で、あらかじめ与えられたサンプルテキストのどれに近いかで自動分類したり、分類数を指定して分類対象のテキストをクラスタリングすることで、テキストをグループ化してテキスト集合の特徴を捉えていた。
【０００３】
【発明が解決しようとする課題】
以上のように、従来の情報マイニング方法においては、テキスト中から抽出した単語の出現頻度に基づきテキストを一つの固まりと考えてテキスト間の類似度を計算し、類似度によって分類を行なっていたので、「コンピュータのプリンタに関する内容」といったレベルでのグループ化しかできない。従って、「プリンタの電源が入らないので印刷ができない」と、「プリンタの電源は入るが印刷できない」といった内容的には異なるが使われている単語が同じものを区別することはできないという問題があった。よって、システム管理者は、プリンタに関する問い合わせが多いことは分析できても、プリンタのどういった現象に対する問い合わせが多いかを分析することはできず、顧客からの大量の問い合わせの中から、優先的に対応すべき具体的な問題を抽出する事ができないといった問題があった。
【０００４】
本発明は、以上の事情を考慮してなされたもので、事例データベースのように、日々蓄積されるテキストデータの中から具体的内容に基づき有効な情報をマイニングして取り出すことで、事例データベースシステム管理者が顧客からの問い合わせを減少させるための製品マニュアルの充実や、Ｑ＆Ａの事例の充実を図ったり、事例データベースが対象とする製品の製品開発者が優先的に対処すべき項目を容易に発見できるようにする情報マイニング方法等を提供することを目的とする。
【０００５】
【課題を解決するための手段】
上記の目的に鑑み、この発明は、蓄積されたテキスト集合から有効な相関情報を見つけだすための情報マイニング方法において、蓄積された各テキストから単語を切り出す単語切り出しステップと、この単語切り出しステップによって切り出した単語の係り受け構造を解析する係り受け解析ステップと、この係り受け解析ステップで係り受け解析された係り受け構造の類似度を判定する文構造類似度判定ステップと、この文構造類似度判定ステップによって判定された値によって文をグループ化するグループ化ステップと、上記単語切り出しステップで切り出した中から特定単語を抽出する特定キーワード抽出ステップと、この特定キーワードとグループ化された文の出現回数を集計するデータ集計ステップと、このデータ集計ステップで集計したデータの相関関係を分析する情報マイニングステップと、相関が強い項目を取り出して表示する結果表示ステップと、を備えたことを特徴とする情報マイニング方法にある。
【０００６】
またこの発明は、上記係り受け解析ステップにおいて、日本語のテキストの場合、助詞などの情報が欠落した単語を最も近い用言に係るように処理することを特徴とする情報マイニング方法にある。
【０００７】
またこの発明、上記単語切り出しステップの結果に対して、重要文を抽出する重要文抽出ステップをさらに備え、この重要文抽出ステップによって抽出した文のみを上記係り受け解析ステップで処理することを特徴とする情報マイニング方法にある。
【０００８】
またこの発明は、上記重要文抽出ステップにおいて、テキスト中に高頻度で出現するキーワードを含む文を抽出対象とすることを特徴とする情報マイニング方法にある。
【０００９】
またこの発明は、上記重要文抽出ステップにおいて、特定のパターンにマッチする表現が出現した文を抽出対象とすることを特徴とする情報マイニング方法にある。
【００１０】
またこの発明は、上記類似文判定ステップにおいて、シソーラス辞書を使い単語の関連度を元に類似度を判定することを特徴とする情報マイニング方法にある。
【００１１】
またこの発明は、上記特定キーワード抽出ステップにおいて、マニュアル等の目次見出しを特定キーワードとすることを特徴とする情報マイニング方法にある。
【００１２】
またこの発明は、上記特定キーワード抽出ステップにおいて、製品のファミリーツリーの部品名を特定キーワードとすることを特徴とする情報マイニング方法にある。
【００１３】
またこの発明は、上記情報マイニングステップにおいて、グループ化された文を１つの軸とし、特定キーワードをもう１つの軸としてカイ二乗統計によって特異点を見つけ出すことを特徴とする情報マイニング方法にある。
【００１４】
またこの発明は、上記単語切り出しステップにおいて、構造化されたテキストの特定部分を処理対象とすることを特徴とする情報マイニング方法にある。
【００１５】
またこの発明は、上記結果表示ステップにおいて、上記マイニングステップで評価した結果の値を、２次元平面上で色の濃淡として表示することを特徴とする情報マイニング方法にある。
【００１６】
またこの発明は、蓄積されたテキスト集合から情報を見つけだすための情報マイニング装置において、蓄積された各テキストから単語を切り出す単語切り出し手段と、この単語切り出し手段によって切り出した単語の係り受け構造を解析する係り受け解析手段と、この係り受け解析手段で係り受け解析された係り受け構造の類似度を判定する文構造類似度判定手段と、この文構造類似度判定手段によって判定された値によって文をグループ化するグループ化手段と、上記単語切り出し手段で切り出した中から特定単語を抽出する特定キーワード抽出手段と、この特定キーワードとグループ化された文の出現回数を集計するデータ集計手段と、このデータ集計手段で集計したデータの相関関係を分析する情報マイニング手段と、相関が強い項目を取り出して表示する結果表示手段と、を備えたことを特徴とする情報マイニング装置にある。
【００１７】
またこの発明は、コンピュータによる蓄積されたテキストから情報を見つけだす情報マイニングプログラムを記録したコンピュータ読み取り可能な記録媒体において、蓄積されたテキストから単語を切り出す単語切り出し手順と、この単語切り出し手順によって切り出した単語の係り受け構造を解析する係り受け解析手順と、この係り受け解析手順で係り受け解析された係り受け構造の類似度を判定する文構造類似度判定手順と、この文構造類似度判定手順によって判定された値によって文をグループ化するグループ化手順と、上記単語切り出し手順で切り出した中から特定単語を抽出する特定キーワード抽出手順と、この特定キーワードとグループ化された文の出現回数を集計するデータ集計手順と、このデータ集計手順で集計したデータの相関関係を分析する情報マイニング手順と、相関が強い項目を取り出して表示する結果表示手順と、を含むことを特徴とする情報マイニングプログラムを記録したコンピュータ読み取り可能な記録媒体にある。
【００１８】
【発明の実施の形態】
以下、この発明の実施の形態を図について説明する。図１は、本発明の情報マイニング装置を示す構成図である。１１は、顧客からの問い合わせ事例などを蓄積するテキストＤＢ（データベース）で、顧客からの問い合わせ内容、製品名、それに対する回答内容などを蓄積する。さらに、テキストから形態素解析した形態素情報、重要文抽出によって抽出された重要文情報、文と文との類似度を計算した類似度値等を格納する。１２は、単語辞書で、単語切り出し処理での解析用辞書として使用する。さらに、１３は、各単語間の関係を記述したシソーラス辞書である。これらはデータベース部５１に格納されている。
【００１９】
１は、テキストＤＢ１１に格納されたテキストに対して単語を抽出する単語切り出し手段である。２は、単語切り出し手段１にて切り出したテキストの中から重要文を特定して抽出する重要文抽出手段である。３は、重要文抽出手段２で抽出した重要文に対して係り受け関係を解析する係り受け解析手段である。４は、係り受け解析手段３で解析した係り受け構造と、シソーラス辞書１３の情報を基に文の類似度を計算する文構造類似度判定手段である。
【００２０】
一方、５は、重要文抽出手段２で抽出した重要文に対して指定された特定キーワードを抽出する特定キーワード抽出手段である。６は、文構造類似度判定手段４によって類似度計算された情報を基に、類似する文をグループ化するグループ化手段である。
【００２１】
７は、特定キーワード抽出手段５で抽出対象となった特定キーワードと、グループ化手段６でグループ化した文グループの２つを軸とし、出現頻度を基に集計するデータ集計手段である。８は、データ集計手段７によって集計した出現頻度の表に対して、統計計算によって特徴を抽出する情報マイニング手段である。９は、情報マイニング手段８によって、特徴を抽出した結果、特徴量の大きい項目を表示する結果表示手段である。これらは格納されたプログラムに従って動作するコンピュータ５０により構成される。さらに５２は、表示のための表示器である。
【００２２】
図２は、本発明の情報マイニング装置の動作を示すフローチャート図である。各ステップは、図１の構成図の処理を行うための手段に対応し、１から９が、Ｓ１からＳ９に対応する。
【００２３】
図３は、テキストＤＢ１１に格納されているテキストの例ある。テキストは構造化されており、２１は製品名例、２２は問い合わせ内容例である。
【００２４】
次に動作について説明する。単語切り出し手段１は単語切り出しステップＳ１によってテキストＤＢ１１に格納されたテキストに対して単語の切り出しを行なう。単語の切り出しには、単語辞書１２を使い、一般に文の解析に利用される形態素解析方法を用いることで、文から名詞、動詞、および形容詞などの自立語とその活用形、および助詞、助動詞などの付属語とその活用形などを特定する。分割された形態素の情報は、単語切り出しの対象となったテキストと対応づけて、テキストＤＢ１１に格納する。
【００２５】
図３は、テキストＤＢ１１に格納されているテキストの例を示しており、単純なテキストではなく、文書番号、製品名、問い合わせ、回答といった構造を持ったテキストである。ここでは、問い合わせに関して情報マイニングを行うとして、問い合わせ内容例２２の部分を取り出して、単語切り出しステップＳ１によって単語を切り出しテキストＤＢ１１に格納する。
【００２６】
次に、重要分抽出手段２では、重要分抽出ステップＳ２によって、解析対象のテキスト中から重要な文を抽出して、重要な文に印を付けたテキスト情報をテキストＤＢ１１に格納する。重要文抽出ステップＳ２の処理としては、テキストの抄録作成手段として用いられる統計的手法による方法を用いる。たとえば、１つのテキスト中に多く含まれた自立単語を含む文を指定した割合で抽出することで実現する。
【００２７】
【数１】

【００２８】
式（１）では、Ｗｉがｉ番目の文の重要度を表しており、Ｗｉの値の順に一定の割合の文を重要文として抽出する。
【００２９】
図３の問い合わせ内容例２２では、文が一つしか存在しないので、問い合わせ内容例２２がそのまま重要文となる。
【００３０】
重要文抽出手段２で重要文を選択しテキストＤＢ１１に格納すると、係り受け解析手段３では係り受け解析ステップＳ３で、一般に知られている構文解析処理によってテキスト中の重要文に対して係り受けを抽出し、係り受け構造をテキストＤＢ１１に格納する。この時、主たる用言に対して（テンス、アスペクト、モダリティ）の情報も同様に格納する。
【００３１】
図４は、問い合わせ内容例２２を係り受け解析ステップＳ３で解析した結果を示す係り受け解析例である。
【００３２】
係り受け解析手段３で、テキスト中の重要文に関して係り受け構造を解析したら、文構造類似度判定手段４では、文構造類似度計算ステップＳ４にて、シソーラス辞書１３を利用しながらテキスト中の文の類似度をテキストＤＢ１１に格納されたすべての文に対して計算する。類似度の計算方法として、テキストＤＢ１１に格納されているすべての重要文に関して類似度を単純に計算すると、計算量が非常に多くなるので、あらかじめ係り受け構造を比較する前に、シソーラス辞書１３を利用して、関連する単語を限定する。たとえば、「〜が印刷できない」と、「〜がプリントできない」は、図５のシソーラス辞書上で直接の上位概念を持つので類似度計算の対象とするが、「〜が印刷できない」と「〜が入力できない」は類似度を０とする。
【００３３】
【数２】

【００３４】
（２）式は、構文上で対応する単語の類似度を基に文の類似度を定義したものである。これによって、文として同じ用語が用いられていなくても類似度を計算することができる。
【００３５】
次に、特定キーワード抽出手段５では、特定キーワード抽出ステップＳ５で予め指定されたキーワードとマッチするかどうかを判定し、マッチすればその情報をテキストＤＢ１１に格納する。この時、特定キーワードは、予め製品マニュアルの目次項目（目次見出し）や、製品のファミリーツリーなどから人手、または機械的に部品名等が抽出されているものとする。
【００３６】
図６はプリンタマニュアルの目次から抽出した特定キーワードの例である。問い合わせ内容例２２の文には、「印刷」という単語が存在し、図５のシソーラス辞書上で「プリント」という単語が同義と定義されているので、特定キーワード抽出ステップＳ５によって、「プリント」が問い合わせ内容例２２の特定キーワードとなる。
【００３７】
次に、グループ化手段６では、グループ化ステップＳ６によって、上記文構造類似度判定手段４で計算された類似度に基づき、類似文をグループ化する。この時、類似度を（テンス、アスペクト、モダリティ）の一致するものに限定してグループ化を行なう。グループ化するに当たっては、予め設定した類似度の閾値に基づき、文をグループ化するものとする。
【００３８】
ここで設定する閾値を変更することで、問い合わせ内容を大まかにグループ化するか、細かくグループ化するかを選択することができる。
【００３９】
グループ化手段６によって、グループ化が終了すると、データ集計手段７では、データ集計ステップＳ７で、２次元の表上に頻度集計する。２次元の表で２つの軸のうち１つは、グループ化した文を配置し、もう一つの軸には特定キーワード抽出手段５で抽出した特定キーワードを配置する。
【００４０】
図７はデータ集計手段７で集計するためのテーブルの例で、横軸方向に特定キーワード、縦軸方向にグループ化手段６によってグループ化された文が配置される。問い合わせ内容例２２の文に対しては、特定キーワード「プリント」が対応しているので、３１の位置の頻度をプラス１することになる。
【００４１】
次に情報マイニング手段８では、情報マイニングステップＳ８によって、データ集計ステップＳ７で集計した２次元の表に対して、（３）、（４）の式の適用によってカイ二乗検定による統計的に特異（特徴的）な点を抽出する。
【００４２】
【数３】

【００４３】
上記（４）式のＹｉｊは理論頻度と実際の頻度がどれだけ離れているかを表す値で、この値が大きいほど特徴的に現れたことを示している。
【００４４】
最後に情報マイニング手段８で計算されたＹｉｊに対して、結果表示手段９では、結果表示ステップＳ９に基づき、Ｙｉｊの値が大きなものを順番に、特定キーワード、グループ化された文を代表する文、Ｙｉｊの値の組みを表示器５２に表示する。
【００４５】
図８は、Ｙｉｊの値が大きい順に情報マイニングした結果を表示したもので、プリント（印刷）に関して、「電源が入っているのに印刷ができない」という問い合わせが非常に多く、特徴的であった場合には上位に表示されることを示している。
【００４６】
さらに結果表示手段９では、式（３）で計算された値を色の濃淡で表示することで、利用者は特徴的に現れる問題（たとえば、製品の特定の機能に関して問い合わせが多いといった情報）を全体の中から把握することができる。
【００４７】
図９は、情報マイニング結果を２次元平面上に表示したもので、図８で１位であった項目４１が濃い色で表示されている。
【００４８】
これによって、テキストＤＢ１１に格納されたテキストのうちで、高頻度で現れる内容をシステム管理者に提示することができ、マニュアルの改良や、Ｑ＆Ａ事例の追加を効果的に進めることができる。さらに、特定キーワードを製品のファミリーツリー中の部品名とし、テキスト処理対象をＱ＆Ａ事例のＡに適用することで、特定部品に関する質問が頻発していることから、製品改良へのフィードバックをするために必要となる情報を開発者が入手することが可能となる。
【００４９】
なお、重要文抽出ステップＳ２の処理として、テキスト中に出現する自立語の出現頻度を元に重要文を抽出する処理に換えて、「〜できない」、「〜について知りたい」といった特定の形態素パターンを用意しておき、そのパターンに一致する文を重要文として抽出することもできる。これにより、問い合わせ履歴の分析といった特定の内容に関するＤＢに対しては、統計的手法による重要文抽出より適切な文を選択することが可能となる。
【００５０】
また、係り受け解析処理として、日本語のテキストの場合、一般の構文解析処理に換えて、「プリンタ印刷できない」といった助詞が欠落する文を許容するために助詞が欠落する場合には最も近くの用言に係り受け構造を設定するようにすることもできる。
【００５１】
さらに、特定キーワード抽出処理として、テキスト中に現れる特定キーワードを抽出する方法に換えて、製品名や、部品名などがテキストＤＢ１１中で所定の書誌項目としてあらかじめ分かっている場合には、テキスト中から抽出することなく、所定の書誌情報フィールド（図３の製品名例２１に対応する部分）から取り出しマッチングを取るようにすることもできる。
【００５２】
【発明の効果】
以上のようにこの発明によれば、蓄積されたテキスト集合から有効な相関情報を見つけだすための情報マイニング方法において、蓄積された各テキストから単語を切り出す単語切り出しステップと、この単語切り出しステップによって切り出した単語の係り受け構造を解析する係り受け解析ステップと、この係り受け解析ステップで係り受け解析された係り受け構造の類似度を判定する文構造類似度判定ステップと、この文構造類似度判定ステップによって判定された値によって文をグループ化するグループ化ステップと、上記単語切り出しステップで切り出したの中から特定単語を抽出する特定キーワード抽出ステップと、この特定キーワードとグループ化された文の出現回数を集計するデータ集計ステップと、このデータ集計ステップで集計したデータの相関関係を分析する情報マイニングステップと、相関が強い項目を取り出して表示する結果表示ステップと、を備えたことを特徴とする情報マイニング方法およびこれにを実行する情報マイニング装置、さらには情報マイニングプログラムを記録したコンピュータ読み取り可能な記録媒体を提供する。これにより、事例データベースのように、日々蓄積されるテキストデータの中から有効な情報をマイニングして取り出すことで、事例データベースシステム管理者が顧客からの問い合わせを減少させるための製品マニュアルの充実や、Ｑ＆Ａの事例の充実を図ったり、事例データベースが対象とする製品の製品開発者が優先的に対処すべき項目を容易に発見できるようにするという効果がある。
【００５３】
またこの発明では、上記係り受け解析ステップにおいて、日本語のテキストの場合、助詞などの情報が欠落した単語を最も近い用言に係るように処理することを特徴とするので、より応用力のある情報マイニング方法等が提供できる。
【００５４】
またこの発明で、上記単語切り出しステップの結果に対して、重要文を抽出する重要文抽出ステップをさらに備え、この重要文抽出ステップによって抽出した文のみを上記係り受け解析ステップで処理することを特徴とするので、より効率のよい情報マイニング方法等が提供できる。
【００５５】
またこの発明では、上記重要文抽出ステップにおいて、テキスト中に高頻度で出現するキーワードを含む文を抽出対象とすることを特徴とするので、よい効率のよい情報マイニング方法等が提供できる。
【００５６】
またこの発明では、上記重要文抽出ステップにおいて、特定のパターンにマッチする表現が出現した文を抽出対象とすることを特徴とするので、より効率のよい情報マイニング方法が提供できる。
【００５７】
またこの発明では、上記類似文判定ステップにおいて、シソーラス辞書を使い単語の関連度を元に類似度を判定することを特徴とするので、より効率のよい情報マイニング方法等が提供できる。
【００５８】
またこの発明では、上記特定キーワード抽出ステップにおいて、マニュアル等の目次見出しを特定キーワードとすることを特徴とするので、マニュアル製造等に適したより効率のよい情報マイニング方法等を提供できる。
【００５９】
またこの発明では、上記特定キーワード抽出ステップにおいて、製品のファミリーツリーの部品名を特定キーワードとすることを特徴とするので、製品製造等に適したより効率のよい情報マイニング方法等を提供できる。
【００６０】
またこの発明では、上記情報マイニングステップにおいて、グループ化された文を１つの軸とし、特定キーワードをもう１つの軸としてカイ二乗統計によって特異点を見つけ出すことを特徴とするので、より効率のよい情報マイニング方法等が提供できる。
【００６１】
またこの発明では、上記単語切り出しステップにおいて、構造化されたテキストの特定部分を処理対象とすることを特徴とするので、より効率のよい情報マイニング方法等が提供できる。
【００６２】
またこの発明では、上記結果表示ステップにおいて、上記マイニングステップで評価した結果の値を、２次元平面上で色の濃淡として表示することを特徴とすので、評価結果が分かりやすい情報マイニング方法等を提供できる。
【図面の簡単な説明】
【図１】本発明の情報マイニング装置の構成を示す図である。
【図２】本発明の処理動作を示すフローチャート図である。
【図３】本発明のテキストＤＢに格納されたテキストの例を示す図である。
【図４】本発明の係り受け解析結果の例を示す図である。
【図５】本発明のシソーラス辞書に格納されたデータの例を示す図である。
【図６】本発明における特定キーワードをマニュアルから抽出した例を示す図である。
【図７】本発明における集計テーブルの例を示す図である。
【図８】本発明における分析結果リストの画面の例を示す図である。
【図９】本発明における分析結果リストの２次元濃淡表示の例を示す図である。
【符号の説明】
１単語切り出し手段、２重要文抽出手段、３係り受け解析手段、４文構造類似度判定手段、５特定キーワード抽出手段、６グループ化手段、７データ集計手段、８情報マイニング手段、９結果表示手段、１１テキストＤＢ、１２単語辞書、１３シソーラス辞書、２１製品名例、２２問い合わせ内容例、５０コンピュータ部、５１データベース部、５２表示器。[0001]
TECHNICAL FIELD OF THE INVENTION
According to the present invention, sales, manual preparation, Q & A, and the like are performed from electronic texts stored on a computer, for example, texts such as various inquiries and answers from customers such as help desk operations. The present invention relates to a method of mining information that is effective for creation and the like.
[0002]
[Prior art]
Conventional mining from text data is represented by “Vext Search” (Komatsu Software Inc .: Nikkei Electronics 1997.12.15 (No. 705) pp. 63-70 and catalog as of October 1997). You. "Vext Search" extracts words such as nouns, verbs, adjectives, adverbs and part of prefixes from accumulated text, removes particles and auxiliary verbs, and converts the text into a vector (hereinafter referred to as "vector") based on the frequency of occurrence of words in the text. , A document vector). Then, the inner product value of the two document vectors is defined as the similarity. On top of that, by automatically classifying based on which of the given sample texts is closer to it, or by clustering the text to be classified by specifying the number of classifications, the text was grouped and the characteristics of the text set were captured .
[0003]
[Problems to be solved by the invention]
As described above, in the conventional information mining method, the similarity between texts is calculated based on the appearance frequency of words extracted from the text, considering the text as one block, and classification is performed based on the similarity. , And grouping only at the level of "contents related to computer printers". Therefore, there is a problem that it is not possible to distinguish between words that have different contents but use the same words, such as "printer cannot be printed because power is not turned on" and "printer is turned on but cannot print". there were. Therefore, the system administrator can analyze that there are many inquiries about the printer, but cannot analyze what kind of phenomena about the printer, and out of the large number of inquiries from customers, There was a problem that it was not possible to extract a specific problem to be dealt with.
[0004]
The present invention has been made in consideration of the above circumstances, and, like a case database, mines and extracts effective information from text data accumulated every day based on specific contents, thereby providing a case database system. Managers can enhance product manuals to reduce inquiries from customers, enhance Q & A cases, and easily find items that the product developer of the target database should deal with with priority. An object of the present invention is to provide an information mining method or the like that enables the information mining.
[0005]
[Means for Solving the Problems]
In view of the above object, the present invention provides an information mining method for finding valid correlation information from an accumulated text set, a word extracting step of extracting a word from each of the stored texts, and a word extracting step performed by the word extracting step. A dependency analyzing step of analyzing a dependency structure of a word, a sentence structure similarity determining step of determining a similarity of the dependency structure analyzed by the dependency analyzing step, and a sentence structure similarity determining step. A grouping step of grouping sentences according to the determined value, a specific keyword extracting step of extracting a specific word from the words extracted in the word extracting step, and counting the number of appearances of the grouped sentences with the specific keyword Data aggregation step and the data aggregation step And an information mining step of analyzing the correlation data, in the information mining method is characterized in that and a result display step of displaying the correlation takes out a strong fields.
[0006]
Further, the present invention is an information mining method, characterized in that in the dependency analysis step, in the case of a Japanese text, a word in which information such as particles is missing is processed so as to be related to the closest declinable word.
[0007]
Further, the present invention further comprises an important sentence extracting step of extracting an important sentence with respect to the result of the word extracting step, wherein only the sentence extracted by the important sentence extracting step is processed by the dependency analyzing step. Information mining method.
[0008]
Further, the present invention is the information mining method, characterized in that in the important sentence extracting step, a sentence including a keyword appearing frequently in a text is to be extracted.
[0009]
The present invention also resides in an information mining method characterized in that in the important sentence extracting step, a sentence in which an expression matching a specific pattern appears is to be extracted.
[0010]
The present invention also resides in an information mining method characterized in that in the similar sentence determination step, similarity is determined based on the relevance of words using a thesaurus dictionary.
[0011]
The present invention also resides in an information mining method, wherein in the specific keyword extracting step, a table of contents such as a manual is used as a specific keyword.
[0012]
The present invention also resides in an information mining method, wherein in the specific keyword extracting step, a part name of a family tree of a product is used as a specific keyword.
[0013]
Further, the present invention is the information mining method, characterized in that in the information mining step, a singular point is found by chi-square statistics using the grouped sentences as one axis and a specific keyword as another axis.
[0014]
The present invention also resides in an information mining method, characterized in that in the word extracting step, a specific portion of a structured text is to be processed.
[0015]
Further, the present invention is the information mining method, wherein in the result display step, a value of a result evaluated in the mining step is displayed as a shade of color on a two-dimensional plane.
[0016]
According to the present invention, in an information mining apparatus for finding information from an accumulated text set, a word extracting means for extracting a word from each accumulated text and a dependency structure of the word extracted by the word extracting means are analyzed. Dependency analyzing means, sentence structure similarity determining means for determining the similarity of the dependency structure analyzed by the dependency analyzing means, and grouping the sentences by the value determined by the sentence structure similarity determining means. Grouping means for grouping, specific keyword extracting means for extracting a specific word from the words extracted by the word extracting means, data totaling means for totalizing the number of appearances of the specific keyword and grouped sentences, and data totaling means Information mining means to analyze the correlation of data aggregated by means and items with strong correlation A result display means for displaying taken out, some of the information mining apparatus characterized by comprising.
[0017]
Also, the present invention provides a computer-readable recording medium that records an information mining program for finding information from text stored by a computer, a word extraction procedure for extracting words from the stored text, and a word extracted by the word extraction procedure. A dependency analysis procedure for analyzing the dependency structure of the subject, a sentence structure similarity determination procedure for determining the similarity of the dependency structure analyzed by the dependency analysis procedure, and a sentence structure similarity determination procedure Grouping procedure for grouping sentences according to the set values, a specific keyword extracting procedure for extracting a specific word from the words extracted in the word extracting procedure, and data for summing up the number of appearances of the specific keyword and the grouped sentences Aggregation procedure and data aggregated in this data aggregation procedure Information mining procedures to analyze the correlation, in a computer-readable recording medium recording the information mining program characterized by comprising, a result display procedure for displaying correlation takes out a strong fields.
[0018]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a configuration diagram showing an information mining device of the present invention. Reference numeral 11 denotes a text DB (database) for storing examples of inquiries from customers, and stores the contents of inquiries from customers, product names, contents of answers to the inquiries, and the like. Further, it stores morphological information obtained by morphological analysis from text, important sentence information extracted by extracting important sentences, a similarity value calculated by calculating the similarity between sentences, and the like. Reference numeral 12 denotes a word dictionary, which is used as an analysis dictionary in word extraction processing. Reference numeral 13 denotes a thesaurus dictionary describing relationships between words. These are stored in the database unit 51.
[0019]
Reference numeral 1 denotes a word extracting unit that extracts a word from the text stored in the text DB 11. Reference numeral 2 denotes an important sentence extracting unit that specifies and extracts an important sentence from the text extracted by the word extracting unit 1. Reference numeral 3 denotes a dependency analyzing unit that analyzes a dependency relationship with respect to the important sentence extracted by the important sentence extracting unit 2. Reference numeral 4 denotes a sentence structure similarity determination unit that calculates a sentence similarity based on the dependency structure analyzed by the dependency analysis unit 3 and information in the thesaurus dictionary 13.
[0020]
On the other hand, reference numeral 5 denotes a specific keyword extracting unit that extracts a specific keyword specified for the important sentence extracted by the important sentence extracting unit 2. A grouping unit 6 groups similar sentences based on the information calculated by the sentence structure similarity determination unit 4.
[0021]
Reference numeral 7 denotes a data tabulation unit that tabulates data based on the frequency of appearance with two axes of the specific keyword extracted by the specific keyword extraction unit 5 and the sentence group grouped by the grouping unit 6. Reference numeral 8 denotes an information mining unit that extracts a feature by statistical calculation from a table of the appearance frequency tabulated by the data tabulation unit 7. Reference numeral 9 denotes a result display unit that displays an item having a large feature amount as a result of extracting a feature by the information mining unit 8. These are constituted by a computer 50 operating according to the stored programs. Reference numeral 52 denotes a display for display.
[0022]
FIG. 2 is a flowchart illustrating the operation of the information mining device of the present invention. Each step corresponds to a unit for performing the processing of the configuration diagram of FIG. 1, and 1 to 9 correspond to S1 to S9.
[0023]
FIG. 3 is an example of text stored in the text DB 11. The text is structured, 21 is an example of a product name, and 22 is an example of inquiry contents.
[0024]
Next, the operation will be described. The word extracting means 1 extracts words from the text stored in the text DB 11 in the word extracting step S1. Word extraction is performed by using the word dictionary 12 and a morphological analysis method generally used for analyzing a sentence, so that an independent word such as a noun, a verb, and an adjective and its inflected form, a particle, an auxiliary verb, etc. Identify the adjuncts and their inflected forms. The information of the divided morphemes is stored in the text DB 11 in association with the text from which the word is extracted.
[0025]
FIG. 3 shows an example of text stored in the text DB 11, which is not a simple text but a text having a structure such as a document number, a product name, an inquiry, and an answer. Here, assuming that information mining is performed for an inquiry, a part of the inquiry content example 22 is extracted, and words are extracted and stored in the text DB 11 in a word extraction step S1.
[0026]
Next, the important part extracting means 2 extracts an important sentence from the text to be analyzed in the important part extracting step S2, and stores text information in which the important sentence is marked in the text DB 11. As the processing of the important sentence extraction step S2, a method based on a statistical method used as text abstract creation means is used. For example, it is realized by extracting a sentence including an independent word that is often included in one text at a specified ratio.
[0027]
(Equation 1)

[0028]
In Expression (1), Wi represents the importance of the i-th sentence, and a fixed percentage of sentences are extracted as important sentences in the order of Wi values.
[0029]
In the query content example 22 of FIG. 3, since there is only one sentence, the query content example 22 becomes an important sentence as it is.
[0030]
When an important sentence is selected by the important sentence extracting means 2 and stored in the text DB 11, the dependency analyzing means 3 receives a dependency on the important sentence in the text by a generally known syntactic analysis process in a dependency analyzing step S3. Extract and store the dependency structure in the text DB11. At this time, information of (tense, aspect, modality) is also stored for the main declinable word.
[0031]
FIG. 4 is a dependency analysis example showing the result of analyzing the inquiry content example 22 in the dependency analysis step S3.
[0032]
When the dependency analysis unit 3 analyzes the dependency structure of the important sentence in the text, the sentence structure similarity determination unit 4 uses the thesaurus dictionary 13 to send the sentence in the text in the sentence structure similarity calculation step S4. Is calculated for all the sentences stored in the text DB 11. As a method of calculating the similarity, if the similarity is simply calculated for all the important sentences stored in the text DB 11, the amount of calculation becomes extremely large. Therefore, before comparing the dependency structures, the thesaurus dictionary 13 is used. Use to limit related words. For example, “cannot print” and “cannot print” have similar direct concepts on the thesaurus of FIG. 5 and are therefore subject to similarity calculation. However, “cannot print” and “cannot print” "Cannot be input" sets the similarity to 0.
[0033]
(Equation 2)

[0034]
The expression (2) defines the similarity of a sentence based on the similarity of the corresponding word in the syntax. Thereby, the similarity can be calculated even if the same term is not used as a sentence.
[0035]
Next, the specific keyword extracting means 5 determines whether or not the keyword matches a keyword specified in advance in the specific keyword extracting step S5. If the keyword matches, the information is stored in the text DB 11. At this time, as the specific keyword, it is assumed that a part name or the like is extracted manually or mechanically from a table of contents (table of contents index) of a product manual, a family tree of a product, or the like.
[0036]
FIG. 6 is an example of a specific keyword extracted from the table of contents of the printer manual. Since the word “print” exists in the sentence of the inquiry content example 22 and the word “print” is defined as synonymous in the thesaurus dictionary of FIG. 5, “print” is replaced by the specific keyword extraction step S5. It becomes the specific keyword of the inquiry content example 22.
[0037]
Next, the grouping unit 6 groups similar sentences based on the similarity calculated by the sentence structure similarity determination unit 4 in the grouping step S6. At this time, grouping is performed by limiting the similarity to those having the same (tens, aspect, modality). At the time of grouping, the sentences are grouped based on a preset threshold value of the similarity.
[0038]
By changing the threshold value set here, it is possible to select whether the inquiry contents are roughly grouped or finely grouped.
[0039]
When the grouping is completed by the grouping unit 6, the data totaling unit 7 totals the frequencies on a two-dimensional table in a data totaling step S7. In the two-dimensional table, one of the two axes arranges the grouped sentences, and the other axis arranges the specific keyword extracted by the specific keyword extracting means 5.
[0040]
FIG. 7 shows an example of a table for counting by the data counting means 7, in which specific keywords are arranged in the horizontal axis direction and sentences grouped by the grouping means 6 in the vertical axis direction. Since the specific keyword “print” corresponds to the sentence of the inquiry content example 22, the frequency of the position 31 is increased by one.
[0041]
Next, in the information mining means 8, in the information mining step S 8, the two-dimensional table tabulated in the data tabulation step S 7 is statistically singularly applied by applying the formulas (3) and (4) by the chi-square test ( Characteristic points are extracted.
[0042]
(Equation 3)

[0043]
Yij in the above equation (4) is a value indicating how far the theoretical frequency is apart from the actual frequency, and the larger this value is, the more characteristic it appears.
[0044]
Finally, with respect to Yij calculated by the information mining means 8, the result display means 9 determines, based on the result display step S9, those having a larger Yij value in order, a specific keyword, and a sentence representing a grouped sentence. , Yij are displayed on the display 52.
[0045]
FIG. 8 shows the result of information mining in the descending order of the value of Yij. With regard to printing (printing), there were very many inquiries that "printing is not possible even though the power is on", which was characteristic. In this case, it is shown that it is displayed at the top.
[0046]
Further, the result display means 9 displays the value calculated by the equation (3) in shades of color, so that the user can identify a characteristic problem (for example, information that there are many inquiries regarding a specific function of the product). We can grasp from the whole.
[0047]
FIG. 9 shows the result of the information mining on a two-dimensional plane, and the item 41 which is the first in FIG. 8 is displayed in dark color.
[0048]
As a result, among the texts stored in the text DB 11, the contents appearing frequently can be presented to the system administrator, and the manual can be improved and the Q & A case can be effectively added. Furthermore, by using the specific keyword as the part name in the product family tree and applying the text processing target to A in the Q & A case, there are frequent questions about the specific part. The required information can be obtained by the developer.
[0049]
Note that, as the processing of the important sentence extraction step S2, a specific morpheme pattern such as "cannot be" or "I want to know about" is replaced with a processing of extracting an important sentence based on the appearance frequency of an independent word appearing in a text. Is prepared, and a sentence that matches the pattern can be extracted as an important sentence. This makes it possible to select an appropriate sentence for a DB relating to specific contents such as analysis of an inquiry history by extracting important sentences by a statistical method.
[0050]
In addition, in the case of Japanese text, as a dependency analysis process, in order to allow a sentence where a particle is missing, such as "printer cannot be printed", instead of a general syntax analysis process, if the particle is missing, the nearest It is also possible to set a dependency structure according to the word.
[0051]
Further, as a specific keyword extracting process, when a product name, a part name, or the like is known as a predetermined bibliographic item in the text DB 11 instead of a method of extracting a specific keyword appearing in the text, the specific keyword is extracted from the text. Without extraction, it is also possible to take out from a predetermined bibliographic information field (the part corresponding to the product name example 21 in FIG. 3) and perform matching.
[0052]
【The invention's effect】
As described above, according to the present invention, in an information mining method for finding valid correlation information from an accumulated text set, a word extracting step of extracting a word from each of the stored texts, and the word extracting step is performed by the word extracting step. A dependency analyzing step of analyzing a dependency structure of a word, a sentence structure similarity determining step of determining a similarity of the dependency structure analyzed by the dependency analyzing step, and a sentence structure similarity determining step. A grouping step of grouping sentences according to the determined value, a specific keyword extracting step of extracting a specific word from the words extracted in the word extracting step, and counting the number of appearances of the sentence grouped with the specific keyword Data aggregation step and the data aggregation step An information mining method, comprising: an information mining step of analyzing a correlation between measured data; and a result display step of extracting and displaying items having a strong correlation, and an information mining apparatus for executing the information mining method. Provides a computer-readable recording medium recording an information mining program. By mining and extracting valid information from the text data that is accumulated every day, such as a case database, the case database system administrator can enhance product manuals to reduce inquiries from customers, This has the effect of enriching the Q & A case and enabling the product developer of the product targeted by the case database to easily find items to be dealt with with priority.
[0053]
Further, in the present invention, in the dependency analysis step, in the case of a Japanese text, a word in which information such as particles is missing is processed so as to be related to the closest declinable word, so that the present invention has more applied power. An information mining method can be provided.
[0054]
Further, the present invention further comprises an important sentence extracting step of extracting an important sentence from the result of the word extracting step, wherein only the sentence extracted by the important sentence extracting step is processed by the dependency analyzing step. Therefore, a more efficient information mining method and the like can be provided.
[0055]
Further, according to the present invention, in the important sentence extracting step, a sentence including a keyword appearing frequently in a text is set as an extraction target, so that a good and efficient information mining method or the like can be provided.
[0056]
Further, in the present invention, in the important sentence extracting step, a sentence in which an expression matching a specific pattern appears is set as an extraction target, so that a more efficient information mining method can be provided.
[0057]
Further, in the present invention, in the similar sentence determination step, similarity is determined based on the relevance of words using a thesaurus dictionary, so that a more efficient information mining method or the like can be provided.
[0058]
Further, in the present invention, in the specific keyword extracting step, a table of contents such as a manual is set as a specific keyword, so that a more efficient information mining method and the like suitable for manual manufacturing and the like can be provided.
[0059]
Also, in the present invention, in the specific keyword extracting step, a part name of a product family tree is used as a specific keyword, so that a more efficient information mining method and the like suitable for product manufacturing or the like can be provided.
[0060]
According to the present invention, in the information mining step, a singular point is found by chi-square statistics using grouped sentences as one axis and a specific keyword as another axis, so that more efficient information is obtained. A mining method can be provided.
[0061]
Further, in the present invention, in the word segmenting step, a specific portion of the structured text is processed, so that a more efficient information mining method and the like can be provided.
[0062]
Also, in the present invention, in the result display step, the value of the result evaluated in the mining step is displayed as a shade of color on a two-dimensional plane. Can be provided.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of an information mining device of the present invention.
FIG. 2 is a flowchart showing a processing operation of the present invention.
FIG. 3 is a diagram showing an example of text stored in a text DB according to the present invention.
FIG. 4 is a diagram illustrating an example of a dependency analysis result of the present invention.
FIG. 5 is a diagram showing an example of data stored in a thesaurus dictionary of the present invention.
FIG. 6 is a diagram showing an example in which a specific keyword in the present invention is extracted from a manual.
FIG. 7 is a diagram illustrating an example of an aggregation table according to the present invention.
FIG. 8 is a diagram showing an example of an analysis result list screen according to the present invention.
FIG. 9 is a diagram showing an example of two-dimensional grayscale display of an analysis result list in the present invention.
[Explanation of symbols]
1 Word extraction means, 2 Important sentence extraction means, 3 Dependency analysis means, 4 Structural similarity determination means, 5 Specific keyword extraction means, 6 Grouping means, 7 Data totaling means, 8 Information mining means, 9 Result display means , 11 text DB, 12 word dictionary, 13 thesaurus dictionary, 21 example product names, 22 example query contents, 50 computer section, 51 database section, 52 display.

Claims

In an information mining method for finding valid correlation information from an accumulated text set,
A word segmentation step of segmenting a word from each accumulated text;
A dependency analyzing step of analyzing a dependency structure of the word cut out by the word cutting step;
A sentence structure similarity determination step of determining the similarity of the dependency structure analyzed in the dependency analysis step;
A grouping step of grouping sentences by the value determined by the sentence structure similarity determination step;
A specific keyword extraction step of extracting a specific word from the words extracted in the word extraction step,
A data aggregation step of counting the number of occurrences of the specific keyword and the grouped sentences;
An information mining step for analyzing the correlation between the data aggregated in this data aggregation step,
A result display step of extracting and displaying items having a strong correlation,
An information mining method, comprising:

2. The information mining method according to claim 1, wherein in the dependency analysis step, in the case of a Japanese text, a word in which information such as particles is missing is processed so as to be related to the closest declinable word.

2. The method according to claim 1, further comprising an important sentence extracting step of extracting an important sentence from a result of the word extracting step, wherein only the sentence extracted by the important sentence extracting step is processed by the dependency analyzing step. Information mining method described in.

4. The information mining method according to claim 3, wherein in the important sentence extracting step, a sentence including a keyword appearing frequently in a text is to be extracted.

4. The information mining method according to claim 3, wherein in the important sentence extracting step, a sentence in which an expression matching a specific pattern appears is to be extracted.

2. The information mining method according to claim 1, wherein in the similar sentence determination step, the similarity is determined based on the relevance of words using a thesaurus dictionary.

2. The information mining method according to claim 1, wherein in the specific keyword extracting step, a table of contents such as a manual is set as a specific keyword.

2. The information mining method according to claim 1, wherein in the specific keyword extracting step, a part name of a family tree of a product is used as a specific keyword.

2. The information mining method according to claim 1, wherein in the information mining step, a singular point is found by chi-square statistics using the grouped sentences as one axis and a specific keyword as another axis.

2. The information mining method according to claim 1, wherein a specific portion of the structured text is processed in the word extracting step.

The information mining method according to claim 1, wherein in the result displaying step, a value of a result evaluated in the mining step is displayed as a shade of color on a two-dimensional plane.

In an information mining device for finding information from a stored text set,
A word extracting means for extracting a word from each accumulated text;
Dependency analysis means for analyzing a dependency structure of a word cut out by the word cutout means;
Sentence structure similarity determining means for determining the similarity of the dependency structure analyzed by the dependency analyzing means;
Grouping means for grouping sentences by the value determined by the sentence structure similarity determining means;
A specific keyword extracting means for extracting a specific word from the words extracted by the word extracting means,
Data aggregation means for counting the number of occurrences of the specific keyword and the grouped sentences;
Information mining means for analyzing the correlation of the data collected by the data collecting means,
A result display means for extracting and displaying items having a strong correlation,
An information mining device comprising:

In a computer-readable recording medium recording an information mining program that finds information from text stored by a computer,
A word extraction procedure for extracting words from the stored text,
A dependency analysis procedure for analyzing a dependency structure of a word cut out by the word cutout procedure;
A sentence structure similarity determination procedure for determining the similarity of the dependency structure analyzed by the dependency analysis procedure;
A grouping procedure for grouping sentences by the value determined by the sentence structure similarity determination procedure;
A specific keyword extraction procedure for extracting a specific word from the words extracted in the word extraction procedure,
A data aggregation procedure for counting the number of occurrences of this specific keyword and grouped sentences;
An information mining procedure that analyzes the correlation between the data aggregated in this data aggregation procedure,
A result display procedure for extracting and displaying items having a strong correlation,
A computer-readable recording medium on which an information mining program is recorded.