JP3960530B2

JP3960530B2 - Text mining program, method and apparatus

Info

Publication number: JP3960530B2
Application number: JP2002177956A
Authority: JP
Inventors: 安彦内田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2002-06-19
Filing date: 2002-06-19
Publication date: 2007-08-15
Anticipated expiration: 2022-06-19
Also published as: JP2004021763A

Description

【０００１】
【発明の属する技術分野】
本発明は、蓄積されたテキストデータを分析し、特徴や傾向を把握したり、未知の情報を発見したりするプロセスを支援するテキストマイニングプログラム、方法、及び装置に関する
【０００２】
【従来の技術】
従来、蓄積されたテキストデータを解析し、特徴や傾向を把握したり、未知の情報を発見する技術として、文書分類方法、重要語の抽出方法、抽出した単語の分類方法、及び抽出した単語間の関連の表示方法など、多くのテキストマイニングの技術が提案されている。その中でテキストマイニングの可視化技術として、単語間の連想関係を図１のようなネットワーク形式で表示する方法が提案されている。図１において、矩形で囲んだ各単語は分析対象のテキストから抽出したキーワードを示し、各キーワード間をつなぐ経路に付された数値はそれらのキーワード間の関連性を示す。また、人工知能学会誌１６巻第２号（２００１年３月）の「ビジュアルテキストマイニング」では、テキストマイニングの可視化技術として、単語マップ、アンカーマップ、及びスケルトンマップが提案されている。
【０００３】
これらは、主に注目しているキーワードと直接関連のあるキーワードをネットワーク表示することにより、傾向を把握するための表示方法である。さらに、特開２００１−１１７９３５や特表２００１−５１３２４２には、ネットワーク表示されたキーワードをクリックすると、クリックしたキーワードに関連するキーワードを展開し、間接的な関連を見せるという方法が提案されてはいる。ただし、間接的な関連を見るためには利用者がキーワードを指定しなければならなかった。
【０００４】
【発明が解決しようとする課題】
上記従来技術では、指定されたキーワードあるいは複数のキーワードに直接関連の強いキーワードをネットワーク形式あるいはリスト形式で表示するため、直接的な関連あるいは直接的な結びつきを把握することは可能であるが、間接的な結びつきを把握することができなかった。つまり、語と語の直接的な結びつきを見ることはできるが、語と語の間にどのような語が介在しているかを見ることはできなかった。そのため、利用者はある程度推測できる連想関係しか見ることができないという問題があった。また、あるキーワードを選択し、その関連語を徐々に表示していく場合においても、利用者が探索する方向を決定し、操作しなければならず、限られた経路しか表示されないという問題があった。
【０００５】
本発明の目的は、第１キーワードから第２キーワードへ至る関連語の経路をあらかじめ複数表示することにより、２つのキーワードがどのような語を経由して結びついているのかという情報を提示し、このような関連語の経路を表示することにより、利用者に今まで気がつかなかった未知の情報を提示することにある。
【０００６】
【課題を解決するための手段】
上記の目的を達成するため、本発明では、指定された２つのキーワードを結ぶ関連語の経路を探索し、探索した経路を表示することを特徴とする。例えば、指定された第１キーワードからの距離が所定値以内の範囲内で、該第１キーワードに関連する関連語及び何れかの関連語を介して該第１キーワードとつながる関連語を探索し、その探索結果から該第１キーワードの関連語リストを作成し、指定された第２キーワードについても同様にして第２キーワードの関連語リストを作成し、それらの関連語リストから、両関連語リストに出現する共通関連語を探索し、その共通関連語を介して前記第１キーワードから第２キーワードに至る経路を求め、求めた経路を表示する。
【０００７】
具体的には、文書データベースから関連語辞書を作成する関連語辞書作成部と関連語辞書からキーワード間の経路を求める関連語経路探索部を設ける。これにより、指定された２つのキーワード間の経路情報を作成することが可能になる。入力画面にはキーワード指定エリアの他に距離（関連語の数）を指定するエリアを設ける。これにより、第１キーワードから第２キーワードに至る距離（関連語の数）をしきい値として、関連語の経路を探索することが可能となる。また、同じくキーワード間の関連の強さ（関連度）を指定する入力エリアを設ける。これにより、関連の強さ（関連度）をしきい値として、関連語の経路を探索することが可能となる。以上のように、指定された２つのキーワード、距離、関連度を入力として探索処理を実行し、経路情報を作成する。
【０００８】
さらに、作成した経路情報を表示するためのオプションとして、表示する経路の順序を指定するエリア、キーワード間の関連度、キーワードの出現頻度によって、表示する経路の表示色や、キーワードの表示色を指定するエリア等を設ける。表示する経路の順序を指定する方法として、最短経路順あるいは最長経路順の指定と、経路の関連の強さの平均値の昇順または降順による指定方法を設ける。最短経路順が指定された場合には、経路の長さが短い順に表示する。最長経路順が指定された場合には、経路の長さが長い順に表示する。経路の関連の強さの平均値の昇順が指定された場合には、経路の要素である各キーワード間の関連の強さの平均値が小さい順に表示する。経路の関連の強さの平均値の降順が指定された場合には、経路の要素である各キーワード間の関連の強さの平均値が大きい順に表示する。また、これらの表示順序の指定については、経路を優先するか、関連度の平均値を優先するかを指定できるようにする。
【０００９】
さらに、キーワード間の関連の強さによって経路の表示色を指定するオプションやキーワードの出現頻度によって表示色を指定するオプションを設けてもよい。キーワード間の関連の強さを色分け表示することにより、関連の強弱を把握することができる。また、キーワードの頻度情報を色分け表示することにより、キーワードそのものの情報つまり低頻度語なのか高頻度語なのかという情報も同時に把握することができる。このように、表示オプションを設けることにより、作成した経路情報を複数のパターンで表示することが可能となる。
【００１０】
【発明の実施の形態】
以下、本発明の実施の形態を図面に基づいて説明する。
【００１１】
図２は、本発明の実施の形態であるテキストマイニング装置の構成を示す。本装置は、処理装置１０、入力装置６０、及び表示装置７０を備える。処理装置１０は、入力装置６０から入力された情報に従って処理を行い、結果を表示装置７０の入力／出力画面８０に表示する。処理装置１０は、あらかじめ文書データベース２０から関連語辞書４０を生成する関連語辞書作成部３０と、指定された２つのキーワードを結ぶ経路を探索し表示する関連語経路探索部５０とを備える。関連語辞書作成部３０は、単語抽出部３１、及び関連語抽出部３２を備える。関連語経路探索部５０は、関連語リスト作成部５１と、関連語リスト作成部５１から出力される関連語リスト５２と、関連語経路作成部５３と、関連語経路作成部５３から出力される関連語経路リスト５４と、関連語経路リスト５４を表示するための関連語経路表示部５５とを備える。
【００１２】
図３は、図２の関連語辞書作成部３０が文書データベース２０から関連語辞書４０を作成するまでの処理過程を示したフローチャートである。単語抽出部３１では、図４に示すような文書データを文書データベース２０から読み込み（ステップ１０１）、図５に示すように単語の切り出しを行い（ステップ１０２）、図６に示すような単語テーブルを作成し、関連語辞書４０に登録する（ステップ１０３）。単語抽出の方法としては、辞書データを参照して語を切り出す方法、文中で漢字やひらがな等の文字の種類を目印として切り出す方法などがあるが、ここでは、その方式は特に制限しない。
【００１３】
関連語抽出部３２では、単語間の共起関係を抽出し、１つの単語に対して関連のある語を抽出し、図７に示すような共起頻度テーブルに登録する（ステップ１０４）。ここでいう共起関係とは、１文中に共に使用される単語同士を意味する。図７の共起頻度テーブルの共起頻度とは、単語１と単語２とが１文中で共に使用されている回数を表すものである。共起関係の抽出については、単に同一文中に出現する単語というだけではなく、一般的な構文解析方式により、主語、述語の関係や係り受けの関係を求めることもできるが、その方式は特に制限しない。抽出した単語と共起関係をもとに単語間の関連の強さを求め、図８に示すような関連度テーブルを作成し、結果を関連語辞書４０に登録する（ステップ１０５）。なお、共起頻度を単語間の強度（関連度）としてもよいし、単語間の関係の強さを求める手法として知られている図９に示す相互情報量を強度にしてもよい。本実施の形態では、単語間の相互情報量を強度（関連度）とする。
【００１４】
図１０は、図２の関連語リスト作成部５１で関連語リスト５２を作成する処理過程を示したフローチャートである。関連語リスト作成部５１では、ユーザにより指定された２つのキーワードの関連語リスト５２を作成する。まず、図２の入力装置６０から入力された第１キーワードを変数Ａに、距離を変数Ｄに、関連度を変数Ｒに、初期値の距離０を変数Ｄ１に代入し（ステップ２０１）、これらを引数として関連語リスト作成関数を呼び出し（ステップ２０２）、図１１に示すような第１キーワードの関連語リストを作成する。
【００１５】
図１１では、説明の便宜上、第１キーワードをキーワードＡとし、キーワードＡの関連語を関連語Ａ１、関連語Ａ２というように記号で関連語を示している。なお、図１１の関連語リストは行単位のデータの集まりで構成されている。その各行のリストは、先頭要素と、その先頭要素に関連する関連語のリストとを、並べたものである。例えば、図１１の第１行目のリストである［キーワードＡ，［関連語Ａ１，関連語Ａ２，関連語Ａ３，関連語Ａ４］］は、キーワードＡの関連語が［関連語Ａ１，関連語Ａ２，関連語Ａ３，関連語Ａ４］であることを示している。第２行目以降の各行のデータも同様の表現形式であり、さらに後述する図１２の第２キーワード（キーワードＢ）の関連語リストも同じ表現形式である。また、図１１や図１２のデータ全体を「関連語リスト」と呼ぶほか、説明の便宜上、図１１や図１２の各行の先頭要素に関連する関連語のリストも「関連語リスト」と呼ぶものとする。例えば、図１１の第１行目のリストである［キーワードＡ，［関連語Ａ１，関連語Ａ２，関連語Ａ３，関連語Ａ４］］の中で、キーワードＡに関連する関連語を並べたリストである［関連語Ａ１，関連語Ａ２，関連語Ａ３，関連語Ａ４］も「関連語リスト」と呼ぶ。
【００１６】
ステップ２０１，２０２で第１キーワードの関連語リストを作成した後、第１キーワードと同様にして、図２の入力装置６０から入力された第２キーワードをＢに、距離をＤに、関連度をＲに、初期値の距離０をＤ１に代入し（ステップ２０３）、これらを引数として関連語リスト作成関数を呼び出し（ステップ２０４）、図１２に示すような第２キーワードの関連語リストを作成する。図１２では、説明の便宜上、第２キーワードをキーワードＢとし、キーワードＢの関連語を関連語Ｂ１、関連語Ｂ２というように記号で関連語を示している。
【００１７】
図１３は、図１０のステップ２０２とステップ２０４で呼び出している関連語リスト作成関数の処理過程を示したフローチャートである。関連語リスト作成関数では、指定された関連度と距離の範囲内で関連語リストを作成する処理を行う。まず、最初に引数として入力した距離Ｄ１が指定された距離Ｄ以下かを判定し（ステップ３０１）、すでに指定されている距離Ｄ（関連語の数）を超えていたら、リターンする。距離Ｄ１が指定された距離Ｄの範囲内にあれば、引数として入力したキーワードＸの関連語の探索が終了したかをチェックする（ステップ３０２）。この探索は、図８の関連度テーブルからキーワードＸを探索するものである。探索するべきキーワードＸの関連語がある場合には、Ｘの関連語を取得し、Ｘ１に代入する（ステップ３０３）。そして、さらにＸとＸ１の関連度をＲ１に代入する（ステップ３０４）。関連度Ｒ１が指定された関連度Ｒ以下であるかを判定し（ステップ３０５）、指定された関連度Ｒ以下の場合は、Ｘ１を関連語としては取らずに、ステップ３０２に戻って、Ｘの次の関連語を取得する処理を行う。ステップ３０５でＲ１が関連度Ｒ以上の関連度であれば、Ｘ１をＸの関連語リストに追加する処理を行い（ステップ３０６）、ステップ３０２に戻って、Ｘの次の関連語を取得する処理を行う。このようにＸの関連語についてチェックを行い、Ｘの関連語リストを作成する。
【００１８】
Ｘの関連語の探索が終了したら、次にＸの関連語リストの各要素について同じ処理を繰り返す。すなわち、Ｘの関連語リストの各要素について、その要素に関連する関連語を探索して取得する処理を行う。まず、Ｘの関連語リストの探索が終了したかをチェックする（ステップ３０７）。Ｘの関連語リストの探索が終了している場合にはリターンする。Ｘの関連語リストの探索が終了していない場合には、距離Ｄ１に１加算し（ステップ３０８）、Ｘの関連語リストの各要素の探索が終了しているかをチェックする（ステップ３０９）。探索が終了している場合にはリターンする。探索が終了していない場合には、Ｘの関連語リストから要素（未だ探索を行っていない要素）を取り出してＹへ代入する（ステップ３１０）。そして、Ｘの関連語リストの要素Ｙ、指定された距離Ｄ、指定された関連度Ｒ、及び変数Ｄ１を引数として、本関連語リスト作成関数の再帰呼び出しを行う（ステップ３１１）。以上のような処理を行い、図１１に示す第１キーワードの関連語リストと図１２に示す第２キーワードの関連語リストを作成する。
【００１９】
図１４は、図１１と図１２の関連語リストをもとに関連語経路リスト５４を作成する関連語経路作成部５３の処理過程を示したフローチャートである。まず、図１１に示した第１キーワード（キーワードＡ）の関連語リストと図１２に示した第２キーワード（キーワードＢ）の関連語リストから、図１５に示す共通関連語リストを作成する（ステップ４０１）。次に、図１５に示す共通関連語リストの左側の要素から図１６に示すキーワードＡに至る部分リストを作成する（ステップ４０２）。さらに、図１５に示す共通関連語リストの右側の要素から図１７に示すキーワードＢに至る部分リストを作成する（ステップ４０３）。図１６及び図１７の部分リストを作成したら、図１５に示す共通関連語リストの左側の要素から図１６に示すキーワードＡに至る部分リストを利用して、図１８に示すキーワードＡに至る関連語経路リストを作成する（ステップ４０４）。またさらに、図１５に示す共通関連語リストの右側の要素から図１７に示すキーワードＢに至る部分リストを利用して、図１９に示すキーワードＢに至る関連語経路リストを作成する（ステップ４０５）。そして最後に、キーワードＡに至る関連語経路リストとキーワードＢに至る関連語経路リストを結合して、図２０に示すようなキーワードＡからキーワードＢに至る関連語経路リストを作成する（ステップ４０６）。
【００２０】
以下、図１４の各ステップの処理の詳細を順に説明する。
【００２１】
図２１は、図１４のステップ４０１の共通関連語リストを作成する処理過程を示したフローチャートである。まず、第１キーワードであるキーワードＡの関連語リスト（図１１）の探索が終了したかをチェックし（ステップ５０１）、探索が終了している場合には処理を終了する。探索が終了していない場合には、キーワードＡの関連語リストの中の各要素（キーワード）の探索が終了したかをチェックし（ステップ５０２）、終了している場合には、ステップ５０１に戻る。探索が終了していない場合には、キーワードＡの関連語リストから次の要素を取り出してＸに代入し（ステップ５０３）、ステップ５０４に進む。
【００２２】
なお、ステップ５０１は、図１１に示した関連語リストの各行データを処理単位として処理を進めていく際、すべての行データについて処理を終了したかをチェックするものである。すなわち、図１１の関連語リストの先頭行データから処理を開始して、１つの行データについて処理したら、ステップ５０１で次の行データを処理対象として、ステップ５０２に進む。処理対象の行データが無くなったら、ステップ５０１から処理終了する。ステップ５０２は、処理対象の行データ中の関連語リストの全要素について処理終了したかをチェックするものである。ステップ５０３で取り出している要素とは、前記処理対象の行データ中の関連語リストの各要素のことである。
【００２３】
ステップ５０２で現在の処理対象の行データ中の関連語リストの全要素についての探索が終了していない場合には、次の要素を取り出してＸに代入し（ステップ５０３）、キーワードＢの関連語リスト（図１２）の探索が終了したかをチェックする（ステップ５０４）。キーワードＢの関連語リストの探索が終了している場合には、ステップ５０２に戻る。キーワードＢの関連語リストの探索が終了していない場合には、キーワードＢの関連語の中の各要素（キーワード）の探索が終了したかをチェックする（ステップ５０５）。キーワードＢの関連語の中の各要素（キーワード）の探索が終了している場合にはステップ５０４に戻る。キーワードＢの関連語の中の各要素（キーワード）の探索が終了していない場合には、キーワードＢの関連語リストから各要素（キーワード）を取り出してＹに代入し（ステップ５０６）、ＸとＹが同じかの判定を行う（ステップ５０７）。ＸとＹが同じであれば、一致したＹの関連語リストの先頭要素（キーワード）とキーワードＸで図１５に示す共通関連語リストを作成し（［Ｘ，Ｙの関連語リストの先頭要素］のリスト形式）、ステップ５０６に戻る。ＸとＹが同じでなければ、ステップ５０５に戻る。
【００２４】
なお、ステップ５０４，５０５，５０６は、それぞれステップ５０１，５０２，５０３と同様の処理である。ただし、処理対象の関連語リストは図１２のキーワードＢの関連語リストである。また、ステップ５０６で要素を代入する変数はＹである。
【００２５】
例えば、図１１のキーワードＡの関連語リスト中の要素であって、図１２のキーワードＢの関連語リストにも含まれている要素は、関連語Ａ３、関連語Ａ１１、関連語Ａ１２であり、それらの先頭要素（図１２の関連語リストでの先頭要素）は、関連語Ａ３については関連語Ｂ３、関連語Ａ１１については関連語Ｂ２と関連語Ｂ１１、関連語Ａ１２については関連語Ｂ１と関連語Ｂ１１であるため、図１５に示すようなリストとなる。図１５の共通関連語リストの各行データの左側の要素は、図１１の関連語リスト中の各行データの右側の関連語リスト中の要素のうち、図１２の関連語リスト中の各行データの右側の関連語リスト中の要素と同じものがあるものである。また、図１５の共通関連語リストの各行データの右側の要素は、その左側の要素に対応する先頭要素（図１２の関連語リスト中の先頭要素）である。
【００２６】
図２２は、図１４のステップ４０２のキーワードＡに至る部分リストを作成する処理過程を示したフローチャートである。まず、図１５の共通関連語リストの探索が終了したかをチェックし（ステップ６０１）、探索が終了している場合には、処理を終了する。すなわち、図１５の共通関連語リストの先頭行データから処理を開始して、１つの行データについて処理したら、ステップ６０１で次の行データを処理対象として、ステップ６０２に進む。処理対象の行データが無くなったら、ステップ６０１から処理終了する。ステップ６０１で探索が終了していない場合には、図１５の共通関連語リストの処理対象の行データの左側の要素を取り出してＸに代入し（ステップ６０２）、ＸがキーワードＡつまり終端キーワードか否かをチェックする（ステップ６０３）。ＸがキーワードＡと同じだった場合は、ステップ６０１に戻り、図１５の共通関連語リストの次の行データを処理対象として取り出す処理を行う。
【００２７】
ステップ６０３でＸがキーワードＡと同じでない場合は、図１１のキーワードＡの関連語リストの中でＸを含む関連語リストの先頭要素をＹに代入する（ステップ６０４）。すなわち、図１１の各行データ中の右側の関連語リストにＸを含む行データを見つけ、その先頭要素をＹに代入する。次に、ＹとＸの部分リストがすでに作成済みかをチェックし（ステップ６０５）、作成済みの場合はステップ６０１に戻る。作成済みでない場合は、ＹとＸの部分リストを作成する（ステップ６０６）。この部分リストは［Ｘ，Ｙ］の形式である。次に、ＹがキーワードＡつまり終端キーワードか否かをチェックする（ステップ６０７）。ＹがキーワードＡと同じだった場合は、ステップ６０１に戻り、ＹがキーワードＡと同じでない場合は、ＹをＸに代入し（ステップ６０８）、ステップ６０４に戻り、図１６に示すキーワードＡに至る部分リストを作成する処理を繰り返す。
【００２８】
例えば、図１５の共通関連語リストの中の第２行目のリストで左側の要素である関連語Ａ１１をキーに図１１のキーワードＡの関連語リストを探索してみると、［関連語Ａ１，［関連語Ａ１１，関連語Ａ１２，関連語Ａ１３］］と［関連語Ａ２，［関連語Ａ２１，関連語Ａ２２，関連語Ａ２３，関連語Ａ１１］］というリストがあり、それらの先頭要素は関連語Ａ１と関連語Ａ２なので、キーワードＡに至る部分リストとして［関連語Ａ１１，関連語Ａ１］と［関連語Ａ１１，関連語Ａ２］を作成することになる。
【００２９】
図２３は、図１４のステップ４０３のキーワードＢに至る部分リストを作成する処理過程を示したフローチャートである。まず、図１５の共通関連語リストの探索が終了したかをチェックし（ステップ７０１）、探索が終了している場合には、処理を終了する。すなわち、図１５の共通関連語リストの先頭行データから処理を開始して、１つの行データについて処理したら、ステップ７０１で次の行データを処理対象として、ステップ７０２に進む。処理対象の行データが無くなったら、ステップ７０１から処理終了する。ステップ７０１で探索が終了していない場合には、図１５の共通関連語リストの処理対象の行データの右側の要素を取り出してＸに代入し（ステップ７０２）、ＸがキーワードＢつまり終端キーワードか否かをチェックする（ステップ７０３）。ＸがキーワードＢと同じだった場合は、ステップ７０１に戻り、図１５の共通関連語リストの次の行データを処理対象として取り出す処理を行う。
【００３０】
ステップ７０３でＸがキーワードＢと同じでない場合は、図１２のキーワードＢの関連語リストの中でＸを含む関連語リストの先頭要素をＹに代入する（ステップ７０４）。すなわち、図１２の各行データ中の右側の関連語リストにＸを含む行データを見つけ、その先頭要素をＹに代入する。次に、ＹとＸの部分リストがすでに作成済みかをチェックし（ステップ７０５）、作成済みの場合はステップ７０１に戻る。作成済みでない場合は、ＹとＸの部分リストを作成する（ステップ７０６）。この部分リストは［Ｘ，Ｙ］の形式である。次に、ＹがキーワードＢつまり終端キーワードか否かをチェックする（ステップ７０７）。ＹがキーワードＢと同じだった場合は、ステップ７０１に戻り、ＹがキーワードＢと同じでない場合は、ＹをＸに代入し（ステップ７０８）、ステップ７０４に戻り、図１７に示すキーワードＢに至る部分リストを作成する処理を繰り返す。
【００３１】
例えば、図１５の共通関連語リストの中の第３行目のリストで右側の要素である関連語Ｂ１１をキーに図１２のキーワードＢの関連語リストを探索してみると、［関連語Ｂ１，［関連語Ｂ１１，関連語Ａ１２，関連語Ｂ１２，関連語Ｂ１３］］と［関連語Ｂ２，［関連語Ｂ２１，関連語Ｂ１１，関連語Ａ１１］］というリストがあり、それらの先頭要素は関連語Ｂ１と関連語Ｂ２なので、キーワードＢに至る部分リストとして［関連語Ｂ１１，関連語Ｂ１］と［関連語Ｂ１１，関連語Ｂ２］を作成することになる。
【００３２】
図２４は、図１４のステップ４０４のキーワードＡに至る関連語経路リストを作成する処理過程を示したフローチャートである。まず、図１５の共通関連語リストの探索が終了したかをチェックし（ステップ８０１）、探索が終了している場合には、処理を終了する。すなわち、図１５の共通関連語リストの先頭行データから処理を開始して、１つの行データについて処理したら、ステップ８０１で次の行データを処理対象として、ステップ８０２に進む。処理対象の行データが無くなったら、ステップ８０１から処理終了する。ステップ８０１で探索が終了していない場合には、図１５の共通関連語リストの処理対象の行データの左側の要素を取り出してＸに代入し（ステップ８０２）、Ｘの関連語経路リストを作成する（ステップ８０３）。最初の関連語経路リストはＸのみを要素とするリスト［Ｘ］となる。次に、ＸをＸ１に代入し（ステップ８０４）、図１６のキーワードＡに至る部分リストの探索が終了したかをチェックする（ステップ８０５）。この探索は、図１６のキーワードＡに至る部分リストから、左側の要素がＸ１と一致するリストを探すものである。
【００３３】
図１６のキーワードＡに至る部分リストの探索が終了している場合は、ステップ８０１に戻り、図１５の共通関連語リストの次の行データを処理対象として取り出す処理を行う。探索が終了していない場合は、図１６のキーワードＡに至る部分リストの中で左側の要素がＸ１と一致する右側の要素Ｙを取得し（ステップ８０６）、Ｘの関連語経路リストにＹを追加する（ステップ８０７）。これは、Ｘの関連語経路リストの先頭要素としてＹを挿入する処理である。次に、ＹがキーワードＡつまり終端キーワードか否かをチェックし（ステップ８０８）、キーワードＡと同じだった場合つまり終端キーワードだった場合は、ステップ８０１に戻り、ＹがキーワードＡと同じでない場合は、ＹをＸ１に代入し（ステップ８０９）、ステップ８０６に戻り、図１８に示すキーワードＡに至る関連語経路リストを作成する処理を繰り返す。
【００３４】
例えば、図１５の共通関連語リストの中の第４行目のリストで左側の要素である関連語Ａ１２をキーに図１６のキーワードＡに至る部分リストを探索してみると、左側の要素が関連語Ａ１２と一致するリストには［関連語Ａ１２，関連語Ａ１］と［関連語Ａ１２，関連語Ａ１１］があり、そのうち前者［関連語Ａ１２，関連語Ａ１］について見てみると、［関連語Ａ１２，関連語Ａ１］の右側の要素は関連語Ａ１なので、まず関連語Ａ１２の関連語経路リスト［関連語Ａ１２］に関連語Ａ１を追加して［関連語Ａ１，関連語Ａ１２］を作成する。さらに、関連語Ａ１はキーワードＡつまり終端キーワードではないので、さらにキーワードＡに至る部分リストを探索すると、左側の要素が関連語Ａ１と一致する［関連語Ａ１，キーワードＡ］というリストが見つかるので、右側の要素キーワードＡを関連語Ａ１２の関連語リストに追加し［キーワードＡ，関連語Ａ１，関連語Ａ１１］というリストを作成することになる。
【００３５】
図２５は、図１４のステップ４０５のキーワードＢに至る関連語経路リストを作成する処理過程を示したフローチャートである。まず、図１５の共通関連語リストの探索が終了したかをチェックし（ステップ９０１）、探索が終了している場合には、処理を終了する。すなわち、図１５の共通関連語リストの先頭行データから処理を開始して、１つの行データについて処理したら、ステップ９０１で次の行データを処理対象として、ステップ９０２に進む。処理対象の行データが無くなったら、ステップ９０１から処理終了する。ステップ９０１で探索が終了していない場合には、図１５の共通関連語リストの処理対象の行データの右側の要素を取り出してＸに代入し（ステップ９０２）、Ｘの関連語経路リストを作成する（ステップ９０３）。最初の関連語経路リストはＸのみを要素とするリスト［Ｘ］となる。次に、ＸをＸ１に代入し（ステップ９０４）、図１７のキーワードＢに至る部分リストの探索が終了したかをチェックする（ステップ９０５）。この探索は、図１７のキーワードＢに至る部分リストから、左側の要素がＸ１と一致するリストを探すものである。
【００３６】
図１７のキーワードＢに至る部分リストの探索が終了している場合は、ステップ９０１に戻り、図１５の共通関連語リストの次の行データを処理対象として取り出す処理を行う。探索が終了していない場合は、図１７のキーワードＢに至る部分リストの中で左側の要素がＸ１と一致する右側の要素Ｙを取得し（ステップ９０６）、Ｘの関連語経路リストにＹを追加する（ステップ９０７）。これは、Ｘの関連語経路リストの最後尾の要素としてＹを挿入する処理である。次に、ＹがキーワードＢつまり終端キーワードか否かをチェックし（ステップ９０８）、キーワードＢと同じだった場合つまり終端キーワードだった場合は、ステップ９０１に戻り、ＹがキーワードＢと同じでない場合は、ＹをＸ１に代入し（ステップ９０９）、ステップ９０６に戻り、図１９に示すキーワードＢに至る関連語経路リストを作成する処理を繰り返す。
【００３７】
例えば、図１５の共通関連語リストの中の第３行目のリストで右側の要素である関連語Ｂ１１をキーに図１７のキーワードＢに至る部分リストを探索してみると、左側の要素が関連語Ｂ１１と一致するリストには［関連語Ｂ１１，関連語Ｂ１］と［関連語Ｂ１１，関連語Ｂ２］があり、そのうち前者［関連語Ｂ１１，関連語Ｂ１］について見てみると、［関連語Ｂ１１，関連語Ｂ１］の右側の要素は関連語Ｂ１なので、まず関連語Ｂ１１の関連語経路リスト［関連語Ｂ１１］に関連語Ｂ１を追加して［関連語Ｂ１１，関連語Ｂ１］を作成する。さらに、関連語Ｂ１はキーワードＢつまり終端キーワードではないので、さらにキーワードＢに至る部分リストを探索すると、左側の要素が関連語Ｂ１と一致する［関連語Ｂ１，キーワードＢ］というリストが見つかるので、右側の要素キーワードＢを関連語Ｂ１１の関連語リストに追加し［関連語Ｂ１１，関連語Ｂ１，キーワードＢ］というリストを作成することになる。
【００３８】
図２６は、図１４のステップ４０６のキーワードＡからキーワードＢに至る関連語経路リストを作成する処理過程を示したフローチャートである。まず、図１５の共通関連語リストの探索が終了したかをチェックし（ステップ１００１）、探索が終了している場合には、処理を終了する。すなわち、図１５の共通関連語リストの先頭行データから処理を開始して、１つの行データについて処理したら、ステップ１００１で次の行データを処理対象として、ステップ１００２に進む。処理対象の行データが無くなったら、ステップ１００１から処理終了する。ステップ１００１で探索が終了していない場合には、図１５の共通関連語リストの処理対象の行データの左側の要素を取り出してＸに代入し（ステップ１００２）、さらにその右側の要素を取り出してＹに代入する（ステップ１００３）。次に、図１８のキーワードＡに至る関連語経路リストの探索が終了したかをチェックする（ステップ１００４）。この探索は、図１８のキーワードＡに至る関連語経路リストから、一番右側の要素がＸと一致するリストを探すものである。
【００３９】
図１８のキーワードＡに至る関連語経路リストの探索が終了している場合は、ステップ１００１に戻り、図１５の共通関連語リストの次の行データを処理対象として取り出す処理を行う。探索が終了していない場合には、図１８のキーワードＡに至る関連語経路リストの中で一番右側の要素がＸと一致するリストを取得し、Ｌ１に代入する（ステップ１００５）。続いて、図１９のキーワードＢに至る関連語経路リストの探索が終了したかをチェックする（ステップ１００６）。この探索は、図１９のキーワードＢに至る関連語経路リストから、一番左側の要素がＹと一致するリストを探すものである。
【００４０】
図１９のキーワードＢに至る関連語経路リストの探索が終了している場合は、ステップ１００４に戻り、探索が終了していない場合には、図１９のキーワードＢに至る関連語経路リストの中で一番左側の要素がＹと一致するリストを取得し、Ｌ２に代入する（ステップ１００７）。そして、Ｌ１のリストとＬ２のリストを結合し、キーワードＡからキーワードＢに至る関連語経路リストとする（ステップ１００８）。リスト結合後、ステップ１００６に戻り、他のリストと結合する処理を繰り返す。
【００４１】
例えば、図１５の共通関連語リストの中の第１行目のリスト［関連語Ａ３，関連語Ｂ３］について見てみると、その左側の要素は関連語Ａ３で、図１８のキーワードＡに至る関連語リストの中で右側の要素が関連語Ａ３と一致するリストは［キーワードＡ，関連語Ａ３］である。また、前記第１行目のリスト［関連語Ａ３，関連語Ｂ３］の右側の要素は関連語Ｂ３で、図１９のキーワードＢに至る関連語リストの中で左側の要素が関連語Ｂ３と一致するリストは［関連語Ｂ３，キーワードＢ］である。したがって、これら２つのリストを結合し、［キーワードＡ，関連語Ａ３，関連語Ｂ３，キーワードＢ］というリストを作成することになる。以上のような処理を行い、図２０のような第１キーワード（キーワードＡ）から第２キーワード（キーワードＢ）に至る関連語経路リストを作成する。図２０において、各行のリストはそれぞれキーワードＡからキーワードＢに至る経路を表している。
【００４２】
図２７は、図２の関連語経路表示部５５により表示用データを作成する過程を示したフローチャートである。関連語経路表示部５５は、関連語経路リスト５４（図２０）をもとに表示用データを作成する。まず、各経路の関連度の平均値を算出し（ステップ１１０１）、表示する順序を決定する（ステップ１１０２）。各経路の関連度の平均値とは、各経路のリストにおいて、隣合う要素間の関連度の平均値のことである。表示する順序は利用者が指定した順番となる。最短経路順であれば、関連語経路リストの中の要素数が最も少ないリストから順に表示し、最長経路順であれば関連語経路リストの中の要素数が最も多いリストから順に表示する。また、関連の強さの平均値の高い順であれば、ステップ１１０１で算出した平均値の高い順に表示する。さらにそれらの組み合わせにより表示順序を決定する。表示する経路を決定したら、表示用のデータを作成する（ステップ１１０３）。オプションで、所定の関連度以下の経路の表示色を変更するように指定できる。また、キーワードの出現頻度が所定値以下のものについてはその表示色を変更するように指定できる。例えば、関連度3.0以下の経路の表示色が赤と指定されている場合には、該当するキーワード間の経路を赤に指定する。また、出現頻度50以下の関連語の表示色が青と指定されている場合には、該当するキーワードの背景を青に指定する。
【００４３】
図２８は、ＨＴＭＬ（HyperText Markup Language）形式における表示用データ作成例を示す。図２８は、「文字」と「見づらい」の間の経路を赤に、「見づらい」の背景を青に指定する例を示している。以上のように、指定されたオプションにしたがって表示用データを作成し、探索結果を図２の表示装置７０の入力／出力画面８０に表示する。
【００４４】
なお、ここでは表示色を変更する例を説明したが、変更する表示態様は表示色に限らない。例えば、線の太さ、強調表示、ブリンクの有無などを変更するようにしてもよい。
【００４５】
図２９は、探索結果の表示例を示す。この例は、「パソコン」と「高齢者」をキーワードとして入力し探索した結果である。図２９の例から、「パソコン」と「高齢者」との間には、「インターネット」「メール」「学習意欲」というキーワードが多く、「難しい」−「学習意欲」と「文字」−「見づらい」の経路が利用者の指定により赤で強調され、また「見づらい」というキーワードが青で強調されているので、高齢者向けのパソコンの要件として「メールも含めて操作がシンプルで、キーボードの文字やディスプレイの表示文字を大きくした方が良い」というようなことを推測することができる。このように、キーワード間の関連語を表示することにより、従来のネットワーク形式の表示方法では得られなかった情報を得ることが可能となる。
【００４６】
なお、本発明は、図１〜図２９を用いて説明した実施の形態に限定されるものではなく、その要旨を逸脱しない範囲において種々変更可能である。例えば、上記実施の形態では、表示方法としてＨＴＭＬ形式で表示する例を説明したが、グラフィックイメージを作成し表示させることも可能である。
【００４７】
上記実施形態によれば、第１キーワードから第２キーワードへ至る関連語の経路を複数表示するため、２つの語がどのような語を経由して結びついているのかを把握することができる。このような関連語の経路を表示することにより、利用者に今まで気がつかなかった未知の情報を提示することが可能となる。また、最短経路順、最長経路順、関連の強さの平均値の昇順、降順と表示方法を変更することが可能であるため、視点を変えて語と語の結びつきを見ることもできる。営業日報、店長日誌などの各種報告書、一般的な新聞データをもとに注目したキーワードにどのような関連があるのか傾向を判断するだけではなく、結びつきを見ることにより、新たな知識を発見することが可能となる。例えば、高齢者向けパソコンや高齢者向け携帯電話といった製品開発にどのような機能が必要なのかなど、十分なアンケートが取れない場合にも、潜在的なニーズを探索することが可能になる。
【００４８】
【発明の効果】
以上説明したように、本発明によれば、第１キーワードから第２キーワードへ至る関連語の経路をあらかじめ複数表示することにより、２つのキーワードがどのような語を経由して結びついているのかという情報を提示し、このような関連語の経路を表示することにより、利用者に今まで気がつかなかった未知の情報を提示することができる。
【図面の簡単な説明】
【図１】単語間の連想関係をネットワーク形式で表示した図である。
【図２】本発明の実施の形態であるテキストマイニング装置の構成を示す図である。
【図３】文書データベースから関連語辞書を作成するまでの処理過程を示したフローチャートである。
【図４】文書データベースから読み込まれる文書データの例を示した図である。
【図５】単語抽出部で解析され、切り出された単語の例を示した図である。
【図６】単語抽出部で作成される単語テーブルの例を示した図である。
【図７】関連語抽出部で作成される共起頻度テーブルの例を示した図である。
【図８】関連語抽出部で作成される関連度テーブルの例を示した図である。
【図９】関連度テーブル作成過程で使用する相互情報量の計算式を示した図である。
【図１０】関連語リスト作成部で関連語リストを作成する処理過程を示したフローチャートである。
【図１１】第１キーワードの関連語リストの例を示した図である。
【図１２】第２キーワードの関連語リストの例を示した図である。
【図１３】関連語リスト作成関数の処理過程を示したフローチャートである。
【図１４】関連語経路作成部で関連語経路リストを作成する処理過程を示したフローチャートである。
【図１５】第１キーワードの関連語リストと第２キーワードの関連語リストの共通の関連語リストの例を示した図である。
【図１６】第１キーワード（キーワードＡ）に至る部分リストの例を示した図である。
【図１７】第２キーワード（キーワードＢ）に至る部分リストの例を示した図である。
【図１８】第１キーワード（キーワードＡ）に至る関連語経路リストの例を示した図である。
【図１９】第２キーワード（キーワードＢ）に至る関連語経路リストの例を示した図である。
【図２０】第１キーワード（キーワードＡ）から第２キーワード（キーワードＢ）に至る関連語の経路リストの例を示した図である。
【図２１】共通関連語リスト作成の処理過程を示したフローチャートである。
【図２２】第１キーワード（キーワードＡ）に至る部分リスト作成の処理過程を示したフローチャートである。
【図２３】第２キーワード（キーワードＢ）に至る部分リスト作成の処理過程を示したフローチャートである。
【図２４】第１キーワード（キーワードＡ）に至る関連語経路リスト作成の処理過程を示したフローチャートである。
【図２５】第２キーワード（キーワードＢ）に至る関連語経路リスト作成の処理過程を示したフローチャートである。
【図２６】第１キーワード（キーワードＡ）から第２キーワード（キーワードＢ）に至る関連語経路リスト作成の処理過程を示したフローチャートである。
【図２７】関連語経路表示部で表示用データを作成する処理過程を示したフローチャートである。
【図２８】ＨＴＭＬ形式による表示用データ作成例を示した図である。
【図２９】第１キーワードから第２キーワードへ至る関連語リストの探索結果を示した図である。
【符号の説明】
１０……処理装置
２０……文書データベース
３０……関連語辞書作成部
３１……単語抽出部
３２……関連語抽出部
４０……関連語辞書
５０……関連語経路探索部
５１……関連語リスト作成部
５２……関連語リスト
５３……関連語経路作成部
５４……関連語経路リスト
５５……関連語経路表示部
６０……入力装置
７０……表示装置
８０……入力／出力装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a text mining program, method, and apparatus for supporting a process of analyzing accumulated text data, grasping characteristics and trends, and discovering unknown information.
[0002]
[Prior art]
Conventionally, document classification methods, important word extraction methods, extracted word classification methods, and between extracted words have been used as techniques for analyzing accumulated text data to understand features and trends and to discover unknown information. Many text mining techniques have been proposed, such as related display methods. Among them, as a text mining visualization technique, a method of displaying an associative relationship between words in a network format as shown in FIG. 1 has been proposed. In FIG. 1, each word surrounded by a rectangle indicates a keyword extracted from the text to be analyzed, and a numerical value attached to a path connecting the keywords indicates a relationship between the keywords. In addition, “Visual Text Mining” in Journal of the Japanese Society for Artificial Intelligence Vol. 16 No. 2 (March 2001) proposes word maps, anchor maps, and skeleton maps as text mining visualization techniques.
[0003]
These are display methods for grasping a tendency by displaying a keyword that is directly related to a keyword of interest mainly on the network. Furthermore, Japanese Patent Laid-Open No. 2001-117935 and Special Table 2001-513242 have proposed a method in which when a keyword displayed on a network is clicked, a keyword related to the clicked keyword is expanded to show an indirect relationship. . However, to see indirect associations, the user had to specify keywords.
[0004]
[Problems to be solved by the invention]
In the above prior art, keywords that are directly related to a specified keyword or a plurality of keywords are displayed in a network format or a list format, so it is possible to grasp a direct relationship or a direct connection, but indirectly I couldn't figure out a typical connection. In other words, you can see the direct connection between words, but you can't see what words are between words. Therefore, there is a problem that the user can see only associative relations that can be guessed to some extent. Even when a keyword is selected and its related words are displayed gradually, there is a problem that the user has to determine and operate the search direction, and only a limited route is displayed. It was.
[0005]
The object of the present invention is to present information on what words the two keywords are connected to by displaying a plurality of related word paths from the first keyword to the second keyword in advance. By displaying the path of the related word, unknown information that has not been noticed until now is presented to the user.
[0006]
[Means for Solving the Problems]
In order to achieve the above object, the present invention is characterized by searching for a route of related words connecting two designated keywords and displaying the searched route. For example, within a range where the distance from the designated first keyword is within a predetermined value, a related word related to the first keyword and a related word connected to the first keyword via any related word are searched. A related word list of the first keyword is created from the search result, and a related word list of the second keyword is similarly created for the designated second keyword. From the related word list, both related word lists are created. A common related word that appears is searched, a route from the first keyword to the second keyword is obtained via the common related word, and the obtained route is displayed.
[0007]
Specifically, a related word dictionary creating unit that creates a related word dictionary from a document database and a related word route searching unit that obtains a route between keywords from the related word dictionary are provided. This makes it possible to create route information between the two specified keywords. In addition to the keyword specification area, an area for specifying the distance (number of related words) is provided on the input screen. As a result, it is possible to search for the path of the related word using the distance (number of related words) from the first keyword to the second keyword as a threshold value. Similarly, an input area for specifying the strength of association (relationship) between keywords is provided. As a result, it is possible to search for the path of the related word using the strength of the relation (relevance) as a threshold. As described above, the search processing is executed with the two specified keywords, distance, and relevance as inputs, and route information is created.
[0008]
In addition, as an option to display the created route information, specify the display color of the displayed route and the display color of the keyword according to the area to specify the order of the displayed route, the degree of association between keywords, and the appearance frequency of the keyword An area to be used is provided. As a method for designating the order of the routes to be displayed, there are provided a designation method by specifying the shortest route order or the longest route order and the ascending or descending order of the average value of the related strength of the route. When the shortest route order is specified, the route lengths are displayed in ascending order. When the longest route order is designated, the route lengths are displayed in order from the longest. When the ascending order of the average value of the relationship strength of the route is designated, the average value of the relationship strength between the keywords that are the elements of the route is displayed in ascending order. When the descending order of the average value of the relation strength of the route is designated, the average value of the relation strength between the keywords as the elements of the route is displayed in descending order. Further, regarding the designation of the display order, it is possible to designate whether to give priority to the route or to give priority to the average value of relevance.
[0009]
Furthermore, an option for designating the display color of the route depending on the strength of the relationship between keywords and an option for designating the display color based on the appearance frequency of the keyword may be provided. By displaying the strength of the relationship between keywords in different colors, the strength of the relationship can be grasped. In addition, by displaying the keyword frequency information in different colors, it is possible to simultaneously grasp the information of the keyword itself, that is, information about whether it is a low-frequency word or a high-frequency word. As described above, by providing the display option, the created route information can be displayed in a plurality of patterns.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0011]
FIG. 2 shows a configuration of a text mining apparatus according to an embodiment of the present invention. The apparatus includes a processing device 10, an input device 60, and a display device 70. The processing device 10 performs processing according to the information input from the input device 60 and displays the result on the input / output screen 80 of the display device 70. The processing device 10 includes a related word dictionary creation unit 30 that generates a related word dictionary 40 from the document database 20 in advance, and a related word path search unit 50 that searches and displays a path connecting two designated keywords. The related word dictionary creation unit 30 includes a word extraction unit 31 and a related word extraction unit 32. The related word path search unit 50 is output from the related word list creation unit 51, the related word list 52 output from the related word list creation unit 51, the related word path creation unit 53, and the related word path creation unit 53. A related word path list 54 and a related word path display unit 55 for displaying the related word path list 54 are provided.
[0012]
FIG. 3 is a flowchart showing a process until the related word dictionary creation unit 30 in FIG. 2 creates the related word dictionary 40 from the document database 20. The word extraction unit 31 reads document data as shown in FIG. 4 from the document database 20 (step 101), cuts out words as shown in FIG. 5 (step 102), and creates a word table as shown in FIG. It is created and registered in the related word dictionary 40 (step 103). As a word extraction method, there are a method of cutting out a word by referring to dictionary data, a method of cutting out a character type such as kanji or hiragana in a sentence, and the method is not particularly limited here.
[0013]
The related word extraction unit 32 extracts a co-occurrence relationship between words, extracts words related to one word, and registers them in a co-occurrence frequency table as shown in FIG. 7 (step 104). The co-occurrence relationship here means words used together in one sentence. The co-occurrence frequency in the co-occurrence frequency table in FIG. 7 represents the number of times word 1 and word 2 are used together in one sentence. Regarding the extraction of co-occurrence relationships, not only words that appear in the same sentence but also general parsing methods can be used to determine the relationship between the subject, predicate, and dependency, but this method is particularly limited. do not do. Based on the extracted word and the co-occurrence relation, the strength of the relation between the words is obtained, a relation degree table as shown in FIG. 8 is created, and the result is registered in the related word dictionary 40 (step 105). The co-occurrence frequency may be the strength (relevance) between words, or the mutual information amount shown in FIG. 9 known as a method for obtaining the strength of the relationship between words may be the strength. In the present embodiment, the mutual information amount between words is assumed to be strength (relevance).
[0014]
FIG. 10 is a flowchart showing a process of creating the related word list 52 by the related word list creating unit 51 of FIG. The related word list creation unit 51 creates a related word list 52 of two keywords specified by the user. First, the first keyword input from the input device 60 of FIG. 2 is substituted into the variable A, the distance is substituted into the variable D, the relevance is substituted into the variable R, and the initial distance 0 is substituted into the variable D1 (step 201). Is used as an argument to call a related word list creation function (step 202) to create a related word list of the first keyword as shown in FIG.
[0015]
In FIG. 11, for convenience of explanation, the first keyword is a keyword A, and related words of the keyword A are related words indicated by symbols such as a related word A1 and a related word A2. The related word list in FIG. 11 is composed of a collection of data in units of lines. The list of each line is an arrangement of a head element and a list of related terms related to the head element. For example, [Keyword A, [Related Word A1, Related Word A2, Related Word A3, Related Word A4]] in the first line list in FIG. A2, related word A3, related word A4]. The data in the second and subsequent lines has the same expression format, and the related word list of the second keyword (keyword B) in FIG. 12 described later also has the same expression format. Further, in addition to calling the entire data of FIG. 11 and FIG. 12 as a “related word list”, for convenience of explanation, a list of related words related to the head element of each line of FIG. 11 and FIG. 12 is also called a “related word list”. And For example, in [Keyword A, [Related Word A1, Related Word A2, Related Word A3, Related Word A4]] which is the list on the first line in FIG. 11, a list in which related words related to Keyword A are arranged. [Related Word A1, Related Word A2, Related Word A3, Related Word A4] is also referred to as “related word list”.
[0016]
After the related words list of the first keyword is created in steps 201 and 202, the second keyword input from the input device 60 of FIG. An initial distance 0 is substituted into D1 for R (step 203), and a related word list creation function is called using these as arguments (step 204) to create a related word list of the second keyword as shown in FIG. . In FIG. 12, for convenience of explanation, the second keyword is a keyword B, and related words of the keyword B are related words B1 and related words B2, and related words are indicated by symbols.
[0017]
FIG. 13 is a flowchart showing the process of the related word list creation function called at step 202 and step 204 of FIG. In the related word list creation function, a related word list is created within a range of the specified relevance and distance. First, it is determined whether the distance D1 input as an argument is equal to or less than the specified distance D (step 301). If the distance D1 (number of related words) has already been exceeded, the process returns. If the distance D1 is within the specified distance D, it is checked whether the search for the related word of the keyword X input as an argument has been completed (step 302). In this search, the keyword X is searched from the relevance table of FIG. If there is a related word of the keyword X to be searched, the related word of X is acquired and substituted into X1 (step 303). Further, the degree of association between X and X1 is substituted into R1 (step 304). It is determined whether the relevance level R1 is equal to or less than the specified relevance level R (step 305). If the relevance level R1 is equal to or lower than the specified relevance level R, the process returns to step 302 without taking X1 as a related word. Process to get the next related word of. If R1 is a relevance level equal to or higher than the relevance level R in step 305, a process of adding X1 to the related word list of X is performed (step 306), and the process returns to step 302 to acquire the next related word of X I do. In this way, the related words of X are checked, and a related word list of X is created.
[0018]
When the search for the related word of X is completed, the same process is repeated for each element of the related word list of X. That is, for each element in the X related word list, a process for searching for and acquiring related words related to the element is performed. First, it is checked whether or not the search for the related word list of X is completed (step 307). If the search for the related word list of X has been completed, the process returns. If the search of the related word list of X is not completed, 1 is added to the distance D1 (step 308), and it is checked whether the search of each element of the related word list of X is completed (step 309). If the search is complete, return. If the search has not ended, an element (an element that has not been searched yet) is taken out from the related word list of X and assigned to Y (step 310). Then, the related word list creation function is recursively called with the element Y of the related word list of X, the specified distance D, the specified relevance level R, and the variable D1 as arguments (step 311). The above processing is performed to create a related word list of the first keyword shown in FIG. 11 and a related word list of the second keyword shown in FIG.
[0019]
FIG. 14 is a flowchart showing the process of the related word path creation unit 53 that creates the related word path list 54 based on the related word lists of FIGS. 11 and 12. First, the common related word list shown in FIG. 15 is created from the related word list of the first keyword (keyword A) shown in FIG. 11 and the related word list of the second keyword (keyword B) shown in FIG. 401). Next, a partial list from the element on the left side of the common related word list shown in FIG. 15 to the keyword A shown in FIG. 16 is created (step 402). Further, a partial list from the element on the right side of the common related word list shown in FIG. 15 to the keyword B shown in FIG. 17 is created (step 403). 16 and FIG. 17 are used, the related word reaching the keyword A shown in FIG. 18 by using the partial list from the element on the left side of the common related word list shown in FIG. 15 to the keyword A shown in FIG. A route list is created (step 404). Further, a related word path list reaching the keyword B shown in FIG. 19 is created using the partial list extending from the element on the right side of the common related word list shown in FIG. 15 to the keyword B shown in FIG. 17 (step 405). . Finally, the related word path list leading to keyword A and the related word path list leading to keyword B are combined to create a related word path list from keyword A to keyword B as shown in FIG. 20 (step 406). .
[0020]
Hereinafter, details of the processing of each step of FIG. 14 will be described in order.
[0021]
FIG. 21 is a flowchart showing the process of creating the common related word list in step 401 of FIG. First, it is checked whether the search of the related word list (FIG. 11) of the keyword A, which is the first keyword, is completed (step 501). If the search is completed, the process ends. If the search has not ended, it is checked whether the search for each element (keyword) in the related word list of keyword A has ended (step 502). If the search has ended, the process returns to step 501. . If the search has not ended, the next element is extracted from the related word list of keyword A and substituted for X (step 503), and the process proceeds to step 504.
[0022]
Note that step 501 checks whether the processing has been completed for all the row data when the processing is carried out using each row data of the related word list shown in FIG. 11 as a processing unit. That is, when the process is started from the first line data of the related word list in FIG. 11 and one line data is processed, the process proceeds to step 502 with the next line data as a processing target in step 501. If there is no more row data to be processed, the process ends from step 501. In step 502, it is checked whether the processing has been completed for all elements of the related word list in the row data to be processed. The elements extracted in step 503 are each element of the related word list in the row data to be processed.
[0023]
If the search for all the elements of the related word list in the current row data to be processed is not completed in step 502, the next element is extracted and substituted for X (step 503). It is checked whether the search of the list (FIG. 12) has been completed (step 504). If the search for the related word list of keyword B has been completed, the process returns to step 502. If the search for the related word list of keyword B has not been completed, it is checked whether the search for each element (keyword) in the related word of keyword B has been completed (step 505). If the search for each element (keyword) in the related word of keyword B has been completed, the process returns to step 504. If the search for each element (keyword) in the related word of keyword B has not been completed, each element (keyword) is extracted from the related word list of keyword B and substituted for Y (step 506). It is determined whether Y is the same (step 507). If X and Y are the same, a common related word list shown in FIG. 15 is created with the first element (keyword) of the matched Y related word list and the keyword X ([first element of the related word list of X, Y]). List format), the process returns to step 506. If X and Y are not the same, the process returns to step 505.
[0024]
Steps 504, 505, and 506 are the same processes as steps 501, 502, and 503, respectively. However, the related word list to be processed is the related word list of the keyword B in FIG. In addition, the variable to which the element is substituted in step 506 is Y.
[0025]
For example, the elements in the related word list of the keyword A in FIG. 11 and also included in the related word list of the keyword B in FIG. 12 are the related word A3, the related word A11, and the related word A12. Those head elements (the head element in the related word list of FIG. 12) are related words B3 for related word A3, related words B2 and related words B11 for related word A11, and related words B1 for related word A12. Since it is the word B11, the list is as shown in FIG. The element on the left side of each row data in the common related word list in FIG. 15 is the right side of each row data in the related word list in FIG. 12 among the elements in the related word list on the right side of each row data in the related word list in FIG. Has the same elements as in the related word list. Further, the element on the right side of each row data of the common related word list in FIG. 15 is the head element corresponding to the element on the left side (head element in the related word list in FIG. 12).
[0026]
FIG. 22 is a flowchart showing a process of creating a partial list reaching keyword A in step 402 in FIG. First, it is checked whether the search for the common related word list in FIG. 15 is completed (step 601). If the search is completed, the process is terminated. That is, when the process is started from the first row data of the common related word list in FIG. 15 and one row data is processed, the next row data is processed in step 601 and the process proceeds to step 602. If there is no more row data to be processed, the process ends from step 601. If the search is not completed in step 601, the element on the left side of the row data to be processed in the common related word list in FIG. 15 is extracted and substituted for X (step 602). It is checked whether or not (step 603). If X is the same as keyword A, the process returns to step 601 to perform processing for extracting the next row data of the common related word list in FIG. 15 as a processing target.
[0027]
If X is not the same as keyword A in step 603, the top element of the related word list including X in the related word list of keyword A in FIG. 11 is substituted for Y (step 604). That is, the line data including X in the related word list on the right side in each line data in FIG. 11 is found, and the head element is substituted for Y. Next, it is checked whether or not the partial lists of Y and X have already been created (step 605). If they have been created, the process returns to step 601. If it has not been created, a partial list of Y and X is created (step 606). This partial list is in the format [X, Y]. Next, it is checked whether Y is a keyword A, that is, a terminal keyword (step 607). If Y is the same as keyword A, the process returns to step 601. If Y is not the same as keyword A, Y is substituted for X (step 608), and the process returns to step 604 to reach the keyword A shown in FIG. Repeat the process of creating a partial list.
[0028]
For example, when searching the related word list of the keyword A in FIG. 11 using the related word A11 which is the element on the left side as a key in the second row list in the common related word list in FIG. , [Related word A11, related word A12, related word A13]] and [related word A2, [related word A21, related word A22, related word A23, related word A11]], and their head elements are related. Since the word A1 and the related word A2, [related word A11, related word A1] and [related word A11, related word A2] are created as a partial list leading to the keyword A.
[0029]
FIG. 23 is a flowchart showing a process of creating a partial list reaching keyword B in step 403 of FIG. First, it is checked whether the search for the common related word list in FIG. 15 is completed (step 701). If the search is completed, the process is terminated. That is, when the process is started from the first row data of the common related word list in FIG. 15 and one row data is processed, the next row data is processed in step 701 and the process proceeds to step 702. When there is no more row data to be processed, the process ends from step 701. If the search is not completed in step 701, the element on the right side of the row data to be processed in the common related word list in FIG. 15 is extracted and substituted for X (step 702). It is checked whether or not (step 703). If X is the same as keyword B, the process returns to step 701, and the next row data of the common related word list in FIG.
[0030]
If X is not the same as keyword B in step 703, the top element of the related word list including X in the related word list of keyword B in FIG. 12 is substituted for Y (step 704). That is, row data including X in the related word list on the right side in each row data in FIG. 12 is found, and the head element is substituted for Y. Next, it is checked whether or not the partial lists of Y and X have already been created (step 705). If they have been created, the process returns to step 701. If it has not been created, a partial list of Y and X is created (step 706). This partial list is in the format [X, Y]. Next, it is checked whether Y is a keyword B, that is, a terminal keyword (step 707). If Y is the same as keyword B, the process returns to step 701. If Y is not the same as keyword B, Y is substituted for X (step 708), and the process returns to step 704 to reach keyword B shown in FIG. Repeat the process of creating a partial list.
[0031]
For example, when searching the related word list of the keyword B in FIG. 12 using the related word B11 which is the element on the right side in the list in the third row in the common related word list in FIG. 15, [Related Word B1 , [Related Word B11, Related Word A12, Related Word B12, Related Word B13]] and [Related Word B2, [Related Word B21, Related Word B11, Related Word A11]], and their head elements are related. Since it is the word B1 and the related word B2, [Related Word B11, Related Word B1] and [Related Word B11, Related Word B2] are created as a partial list leading to the keyword B.
[0032]
FIG. 24 is a flowchart showing a process of creating a related word path list that reaches keyword A in step 404 of FIG. First, it is checked whether or not the search for the common related word list in FIG. 15 has been completed (step 801). If the search has been completed, the process is terminated. That is, if the process is started from the first row data of the common related word list in FIG. 15 and one row data is processed, the next row data is processed in step 801 and the process proceeds to step 802. When there is no more row data to be processed, the process ends from step 801. If the search is not completed in step 801, the element on the left side of the row data to be processed in the common related word list in FIG. 15 is extracted and substituted for X (step 802), and a related word path list for X is created. (Step 803). The first related word path list is a list [X] having only X as an element. Next, X is substituted into X1 (step 804), and it is checked whether the search for the partial list reaching the keyword A in FIG. 16 is completed (step 805). In this search, a list in which the element on the left side matches X1 is searched from the partial list reaching the keyword A in FIG.
[0033]
When the search for the partial list leading to the keyword A in FIG. 16 has been completed, the process returns to step 801 to perform processing for retrieving the next row data of the common related word list in FIG. 15 as a processing target. If the search is not completed, the right element Y whose left element matches X1 in the partial list leading to keyword A in FIG. 16 is obtained (step 806), and Y is added to the related word path list of X. Add (step 807). This is a process of inserting Y as the head element of the related word path list of X. Next, it is checked whether Y is the keyword A, that is, the end keyword (step 808). If it is the same as the keyword A, that is, if it is the end keyword, the process returns to step 801, and if Y is not the same as the keyword A, , Y is substituted into X1 (step 809), and the process returns to step 806 to repeat the process of creating the related word path list reaching the keyword A shown in FIG.
[0034]
For example, when searching the partial list reaching the keyword A in FIG. 16 using the related word A12 as the left element in the list in the fourth row in the common related word list in FIG. 15, the left element is The list that matches the related word A12 includes [related word A12, related word A1] and [related word A12, related word A11], and when looking at the former [related word A12, related word A1], Since the element on the right side of the word A12, the related word A1] is the related word A1, first, the related word A1 is added to the related word path list [related word A12] of the related word A12 to create [related word A1, related word A12]. To do. Further, since the related word A1 is not the keyword A, that is, the terminal keyword, further searching the partial list leading to the keyword A finds a list [related word A1, keyword A] whose left side element matches the related word A1, The element keyword A on the right side is added to the related word list of the related word A12 to create a list [keyword A, related word A1, related word A11].
[0035]
FIG. 25 is a flowchart showing a process of creating a related word path list reaching keyword B in step 405 of FIG. First, it is checked whether the search for the common related word list in FIG. 15 is completed (step 901). If the search is completed, the process is terminated. That is, if the process is started from the first row data of the common related word list in FIG. 15 and one row data is processed, the process proceeds to step 902 with the next row data as a processing target in step 901. If there is no more row data to be processed, the process ends from step 901. If the search is not completed in step 901, the element on the right side of the row data to be processed in the common related word list in FIG. 15 is extracted and substituted for X (step 902), and a related word path list for X is created. (Step 903). The first related word path list is a list [X] having only X as an element. Next, X is substituted into X1 (step 904), and it is checked whether the search for the partial list reaching the keyword B in FIG. 17 is completed (step 905). In this search, a list in which the element on the left side matches X1 is searched from the partial list that reaches keyword B in FIG.
[0036]
If the search for the partial list reaching keyword B in FIG. 17 has been completed, the process returns to step 901 to perform processing for extracting the next row data of the common related word list in FIG. 15 as a processing target. If the search has not ended, the right element Y whose left element matches X1 in the partial list leading to the keyword B in FIG. 17 is obtained (step 906), and Y is added to the related word path list of X. It adds (step 907). This is a process of inserting Y as the last element of the related word path list of X. Next, it is checked whether or not Y is a keyword B, that is, a terminal keyword (step 908). If it is the same as keyword B, that is, if it is a terminal keyword, the process returns to step 901, and if Y is not the same as keyword B , Y is substituted for X1 (step 909), the process returns to step 906, and the process of creating the related word path list reaching the keyword B shown in FIG. 19 is repeated.
[0037]
For example, when a partial list reaching the keyword B in FIG. 17 is searched using the related word B11 as the right element in the list in the third row in the common related word list in FIG. 15, the left element is The list that matches with the related word B11 includes [related word B11, related word B1] and [related word B11, related word B2], and when looking at the former [related word B11, related word B1], Since the element on the right side of the word B11, the related word B1] is the related word B1, first, the related word B1 is added to the related word path list [related word B11] of the related word B11 to create [related word B11, related word B1]. To do. Further, since the related word B1 is not the keyword B, that is, the terminal keyword, when a partial list reaching the keyword B is further searched, a list of [related word B1, keyword B] whose left side element matches the related word B1 is found. The element keyword B on the right side is added to the related word list of the related word B11 to create a list [related word B11, related word B1, keyword B].
[0038]
FIG. 26 is a flowchart showing the process of creating a related word path list from keyword A to keyword B in step 406 of FIG. First, it is checked whether or not the search for the common related word list in FIG. 15 has been completed (step 1001). If the search has been completed, the process is terminated. That is, if processing is started from the first row data of the common related word list in FIG. 15 and processing is performed for one row data, the processing proceeds to step 1002 with the next row data as a processing target in step 1001. If there is no more row data to be processed, the process ends from step 1001. If the search is not completed in step 1001, the element on the left side of the row data to be processed in the common related word list in FIG. 15 is extracted and substituted for X (step 1002), and the element on the right side is extracted. Substitute for Y (step 1003). Next, it is checked whether the search for the related word path list leading to the keyword A in FIG. 18 is completed (step 1004). In this search, a list in which the rightmost element coincides with X is searched from the related word path list reaching the keyword A in FIG.
[0039]
When the search for the related word path list reaching the keyword A in FIG. 18 has been completed, the process returns to step 1001 to perform processing for extracting the next row data of the common related word list in FIG. 15 as a processing target. If the search is not completed, a list in which the rightmost element matches X in the related word path list leading to keyword A in FIG. 18 is acquired and substituted into L1 (step 1005). Subsequently, it is checked whether or not the search for the related word path list leading to the keyword B in FIG. 19 is completed (step 1006). In this search, a list in which the leftmost element matches Y is searched from the related word path list reaching the keyword B in FIG.
[0040]
If the search for the related word path list reaching the keyword B in FIG. 19 has been completed, the process returns to step 1004. If the search has not been completed, the related word path list in the related word path list reaching the keyword B in FIG. A list in which the leftmost element matches Y is acquired and assigned to L2 (step 1007). Then, the list of L1 and the list of L2 are combined to form a related word path list from the keyword A to the keyword B (step 1008). After combining the lists, the process returns to step 1006 to repeat the process of combining with other lists.
[0041]
For example, looking at the list [related word A3, related word B3] on the first line in the common related word list in FIG. 15, the element on the left side is related word A3, which leads to keyword A in FIG. In the related word list, a list whose right side element matches the related word A3 is [keyword A, related word A3]. Further, the element on the right side of the list [Related Word A3, Related Word B3] on the first row is the related word B3, and the left element in the related word list leading to the keyword B in FIG. 19 matches the related word B3. The list to be performed is [Related Word B3, Keyword B]. Therefore, these two lists are combined to create a list [keyword A, related word A3, related word B3, keyword B]. The above processing is performed to create a related word path list from the first keyword (keyword A) to the second keyword (keyword B) as shown in FIG. In FIG. 20, each list of lines represents a route from the keyword A to the keyword B.
[0042]
FIG. 27 is a flowchart showing a process of creating display data by the related word path display unit 55 of FIG. The related word path display unit 55 creates display data based on the related word path list 54 (FIG. 20). First, the average value of the relevance of each route is calculated (step 1101), and the display order is determined (step 1102). The average value of the degree of association of each route is the average value of the degree of association between adjacent elements in each route list. The display order is the order specified by the user. In the shortest path order, the list is displayed in order from the list with the smallest number of elements in the related word path list, and in the longest path order, the list is displayed in order from the list with the largest number of elements in the related word path list. If the average value of the related strengths is in descending order, they are displayed in descending order of the average value calculated in step 1101. Further, the display order is determined by a combination thereof. When the route to be displayed is determined, display data is created (step 1103). Optionally, it can be specified to change the display color of routes below a predetermined relevance. In addition, it is possible to specify that the display color of a keyword whose appearance frequency is equal to or less than a predetermined value is changed. For example, when the display color of a route having a relevance level of 3.0 or less is designated as red, the route between the corresponding keywords is designated as red. In addition, when the display color of related words having an appearance frequency of 50 or less is designated as blue, the background of the corresponding keyword is designated as blue.
[0043]
FIG. 28 shows an example of display data creation in the HTML (HyperText Markup Language) format. FIG. 28 shows an example in which the route between “character” and “difficult to see” is designated in red and the background of “difficult to see” is designated in blue. As described above, display data is created according to the designated option, and the search result is displayed on the input / output screen 80 of the display device 70 of FIG.
[0044]
In addition, although the example which changes a display color was demonstrated here, the display mode to change is not restricted to a display color. For example, you may make it change the thickness of a line, an emphasis display, the presence or absence of a blink, etc.
[0045]
FIG. 29 shows a display example of search results. In this example, “personal computer” and “elderly” are input as keywords and searched. From the example of FIG. 29, there are many keywords “Internet”, “Mail”, “Motivation to learn” between “PC” and “Elderly”, “Difficult”-“Motivation to learn” and “Text”-“Difficult to see” ”Is highlighted in red according to the user ’s designation, and the keyword“ difficult to see ”is highlighted in blue. And it is better to enlarge the display characters on the display. Thus, by displaying related words between keywords, it is possible to obtain information that could not be obtained by a conventional network-type display method.
[0046]
The present invention is not limited to the embodiment described with reference to FIGS. 1 to 29, and various modifications can be made without departing from the scope of the invention. For example, in the above-described embodiment, an example of displaying in the HTML format as the display method has been described. However, it is also possible to create and display a graphic image.
[0047]
According to the above embodiment, since a plurality of related word paths from the first keyword to the second keyword are displayed, it is possible to grasp what word the two words are connected to. By displaying the path of such a related word, it is possible to present unknown information that the user has not noticed until now. In addition, since it is possible to change the display method with the shortest path order, the longest path order, the ascending order of the average value of the related strengths, and the descending order, it is possible to see the connection between words from different viewpoints. Discover new knowledge not only by judging trends in various keywords such as daily reports, store manager's diaries, and general newspaper data, but also by observing the connections. It becomes possible to do. For example, it is possible to search for potential needs even when sufficient questionnaires are not available, such as what functions are necessary for product development such as a PC for elderly people and a mobile phone for elderly people.
[0048]
【The invention's effect】
As described above, according to the present invention, by displaying a plurality of related word paths from the first keyword to the second keyword in advance, it is possible to determine which word the two keywords are connected to. By presenting information and displaying the path of such related terms, it is possible to present unknown information that the user has not noticed before.
[Brief description of the drawings]
FIG. 1 is a diagram showing association relationships between words in a network format.
FIG. 2 is a diagram showing a configuration of a text mining device according to an embodiment of the present invention.
FIG. 3 is a flowchart showing a processing process until a related word dictionary is created from a document database.
FIG. 4 is a diagram showing an example of document data read from a document database.
FIG. 5 is a diagram illustrating an example of a word that is analyzed and cut out by a word extraction unit;
FIG. 6 is a diagram showing an example of a word table created by a word extraction unit.
FIG. 7 is a diagram showing an example of a co-occurrence frequency table created by a related word extraction unit.
FIG. 8 is a diagram showing an example of a relevance level table created by a related word extraction unit.
FIG. 9 is a diagram showing a calculation formula for mutual information used in the relevance degree table creation process;
FIG. 10 is a flowchart showing a process of creating a related word list by a related word list creating unit.
FIG. 11 is a diagram illustrating an example of a related word list of a first keyword.
FIG. 12 is a diagram illustrating an example of a related word list of a second keyword.
FIG. 13 is a flowchart showing a process of a related word list creation function.
FIG. 14 is a flowchart showing a process of creating a related word path list by a related word path creating unit.
FIG. 15 is a diagram showing an example of a common related word list of the related word list of the first keyword and the related word list of the second keyword.
FIG. 16 is a diagram showing an example of a partial list that reaches the first keyword (keyword A);
FIG. 17 is a diagram showing an example of a partial list that reaches the second keyword (keyword B);
FIG. 18 is a diagram showing an example of a related word path list that reaches the first keyword (keyword A);
FIG. 19 is a diagram showing an example of a related word path list that reaches the second keyword (keyword B);
FIG. 20 is a diagram showing an example of a related word path list from a first keyword (keyword A) to a second keyword (keyword B);
FIG. 21 is a flowchart showing a process of creating a common related word list.
FIG. 22 is a flowchart showing a process of creating a partial list up to the first keyword (keyword A).
FIG. 23 is a flowchart showing a process of creating a partial list up to the second keyword (keyword B).
FIG. 24 is a flowchart showing a process of creating a related word path list up to the first keyword (keyword A).
FIG. 25 is a flowchart showing a process of creating a related word path list up to the second keyword (keyword B).
FIG. 26 is a flowchart showing a process of creating a related word path list from the first keyword (keyword A) to the second keyword (keyword B).
FIG. 27 is a flowchart showing a process of creating display data in a related word path display unit.
FIG. 28 is a diagram showing an example of creating display data in HTML format.
FIG. 29 is a diagram showing a search result of a related word list from the first keyword to the second keyword.
[Explanation of symbols]
10 …… Processing device
20 …… Document database
30 …… Related word dictionary creation part
31 …… Word extractor
32 …… Related word extractor
40 …… Related word dictionary
50 …… Related word path search unit
51 …… Related word list creation part
52 …… Related word list
53 …… Related word path creation part
54 …… Related word path list
55 …… Related word path display
60 …… Input device
70 …… Display device
80 …… Input / output device

Claims

A text mining program for analyzing text data, understanding features and trends, and discovering unknown information.
On the computer,
Searching for a path of related terms connecting two specified keywords;
A text mining program for executing a step of displaying the searched route in the shortest route order or the longest route order according to a user's specification .

A text mining program for analyzing text data, understanding features and trends, and discovering unknown information.
On the computer,
Searching for a path of related terms connecting two specified keywords;
A text mining program for executing a step of displaying the searched route in ascending or descending order of the average value of the strength of association between the keywords as the elements of the route according to the user's specification .

In a text mining device constructed using a computer, it is a text mining method for analyzing text data, grasping features and trends, and discovering unknown information,
A route search means provided in the computer for searching for a route of related words connecting the two specified keywords;
Text mining method which the display unit included in the computer, in response to user's designation, the shortest path order or longest path order, and executes the step of displaying the searched route.

In a text mining device constructed using a computer, it is a text mining method for analyzing text data, grasping features and trends, and discovering unknown information,
A route search means provided in the computer for searching for a route of related words connecting the two specified keywords;
The display means provided in the computer executes a step of displaying the searched route in ascending or descending order of the average value of the strength of association between the keywords that are the elements of the route according to the user's designation. A featured text mining method.

A text mining device for analyzing text data, understanding features and trends, and discovering unknown information,
Means for searching a path of related terms connecting two specified keywords;
A text mining device comprising: means for displaying searched routes in the shortest route order or in the longest route order according to a user's specification .

A text mining device for analyzing text data, understanding features and trends, and discovering unknown information,
Means for searching a path of related terms connecting two specified keywords;
A text mining device comprising: means for displaying the searched routes in ascending or descending order of the average value of the strength of association between the keywords that are elements of the route according to the user's specification .