JP2004078542A

JP2004078542A - Text mining processor, text mining processing method, its program, and record medium

Info

Publication number: JP2004078542A
Application number: JP2002237689A
Authority: JP
Inventors: Naoyuki Horai; 蓬莱　尚幸; Kiyoshi Nitta; 新田　清
Original assignee: Celestar Lexico Sciences Inc
Current assignee: Celestar Lexico Sciences Inc
Priority date: 2002-08-16
Filing date: 2002-08-16
Publication date: 2004-03-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide a text mining processor capable of realizing the high level, the efficiency, and automation of analysis to be performed by text mining. <P>SOLUTION: This system realizes the high precision/high efficiency/automation of summarization result analysis in text mining processing. That is, it is possible to realize the high precision of text mining analysis by providing a method(original display, dictionary entry retrieval, and trace result display or the like) for evaluating an analysis procedure or an analysis method using a syntax structure in the text mining processing. Also, it is possible to provide the efficiency method(analysis screen list method(multi-windowing or the like) or 2-D map category item sort/clustering or the like) of summarization results. Also, it is possible to provide analysis automating method(operation history collection or operation automation execution or the like) or a large-scale concept management method(tree structure hierarchization and intermediate node tabulation or the like). <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、テキストマイニング処理装置、テキストマイニング処理方法、プログラム、および、記録媒体に関し、特に、テキストマイニングによる分析を高度化、効率化、自動化等することのできるテキストマイニング処理装置、テキストマイニング処理方法、プログラム、および、記録媒体に関する。
【０００２】
【従来の技術】
近年、論文などの各種の技術文献を蓄積した文献データベースが構築され、インターネットなどを介して広く利用されている。例えば、米国国立バイオテクノロジーセンター（ＮＣＢＩ）が米国国立医学図書館（ＮＬＭ）等の文献データを提供するＰｕｂＭｅｄなどが存在する（インターネット上のＰｕｂＭｅｄのＵＲＬ：　ｈｔｔｐ：／／ｗｗｗ．ｎｃｂｉ．ｎｌｍ．ｇｏｖ／ｅｎｔｒｅｚ／）。
【０００３】
従来の文献データベースの検索サービスにおいては、検索効率の向上などを図るために、各用語の正規形と表記形との対応を取るための「表記辞書」や、各用語についてカテゴリ分類するための「カテゴリ辞書」などが用いられている。
【０００４】
例えば、既存の表記辞書やカテゴリ辞書を用いたテキストマイニングシステムとして、ＩＢＭ（会社名）のＴＡＫＭＩ（製品名）が存在する（ＩＢＭ東京基礎研究所のテキストマイニング技術紹介のホームページのＵＲＬ：　ｈｔｔｐ：／／ｗｗｗ．ｔｒｌ．ｉｂｍ．ｃｏｍ／ｐｒｏｊｅｃｔｓ／ｓ７７１０／ｔｍ／ｉｎｄｅｘ．ｈｔｍ、ＴＡＫＭＩ紹介のホームページのＵＲＬ：　ｈｔｔｐ：／／ｗｗｗ．ｔｒｌ．ｉｂｍ．ｃｏｍ／ｐｒｏｊｅｃｔｓ／ｓ７７１０／ｔｍ／ｔａｋｍｉ／ｔａｋｍｉ．ｈｔｍ）。
【０００５】
また、医学用語のシソーラス検索サービスとして、ＭｅＳＨ（ＭｅｄｉｃａｌＳｕｂｊｅｃｔ　Ｈｅａｄｉｎｇｓ）などが存在する（ＮＬＭのＭｅＳＨのホームページのＵＲＬ：　ｈｔｔｐ：／／ｗｗｗ．ｎｌｍ．ｎｉｈ．ｇｏｖ／ｍｅｓｈ／ｍｅｓｈｈｏｍｅ．ｈｔｍｌ、ＭｅＳＨの概要を解説した論文のホームページのＵＲＬ：　ｈｔｔｐ：／／ｗｗｗ．ｎｌｍ．ｎｉｈ．ｇｏｖ／ｍｅｓｈ／ｐａｔｔｅｒｎｓ．ｈｔｍｌ、ＭｅＳＨ　ＢｒｏｗｓｅｒサービスのホームページのＵＲＬ：　ｈｔｔｐ：／／ｗｗｗ．ｎｃｂｉ．ｎｉｈ．ｇｏｖ／ｅｎｔｒｅｚ／ｍｅｓｈｂｒｏｗｓｅｒ．ｃｇｉ）。
【０００６】
【発明が解決しようとする課題】
ここで、テキストマイニングシステムの概要を図１および図２を参照して説明する。図１は、テキストマイニング処理の概要を示す概念図である。
【０００７】
図１に示すように、本システムにおいて分析対象文書群に含まれる各文書情報に現れる語の文字列から概念への対応をつけるために、以下の手順を実行する。
【０００８】
まず、表記辞書を作成（手作業で作成される）し、英語や日本語などで記載された文書情報の各語に表記辞書を適用する（ステップＳＡ−１）。
【０００９】
そして、部分的に語区切りのついた文書情報に対して、判別ルールに従って専門用語を判別した後（ステップＳＡ−２）、構文解析処理を適用する（ステップＳＡ−３）。ここで、表記辞書の適用と構文解析の実行の順序は任意であり、また、これらを平行して実行してもよい。
【００１０】
そして、カテゴリ辞書を作成（手作業で作成される）し、構文解析結果である文書情報の適切な文構造、および、表記辞書を適用して得られた結果に対してカテゴリ辞書を適用してカテゴライズを行い、カテゴリに対応する用語を集計しインデックスを作成する（ステップＳＡ−４）。
【００１１】
そして、カテゴライズされた概念等の出現頻度などを計算・集計し（ステップＳＡ−５）、文書情報中の単語の登場する頻度などをグラフ化した頻度グラフや、文献発行年月日毎等に頻度などをグラフ化した情報時系列グラフや、図２に示すような２−Ｄマップなどの形式に整形して表示する（ステップＳＡ−６）。そして、利用者は、表示された出現頻度などの情報から手作業・目視で所望の情報を抽出する。
【００１２】
ここで、図２は、図１のステップＳＡ−６において表示される２−Ｄマップの概念を示す概念図である。
２−Ｄマップは、図２に示すように、カテゴリ毎にその出現頻度を２次元表示したマップである。通常、２−Ｄマップの各カラムには、対応する縦方向（列）と横方向（行）の２つのカテゴリに帰属する用語を含む文書の出現頻度と、各行の出現頻度の総和に占める出現頻度の割合とを表示する。そして、通常は、出現頻度の割合（図２のカラムのｙｙｙの値）が高いものについて注目して所望の情報を抽出している。
【００１３】
このように、既存のテキストマイニングシステムでは、エンドユーザが一連の対話的な分析操作を行い、原文に到達する。それぞれの操作の信頼性は原文を処理する手法により異なるが、エンドユーザはその信頼度を直接得る手段を持たない。すなわち、どのような用語がどの文書から抽出されたかを直接的に調べることが困難であった。このため既存のテキストマイニングシステムを用いて有用な情報を抽出するためには経験と習熟を必要とする。そのため、一般ユーザがテキストマイニングシステムを活用する敷居を低くするために、対話的分析操作の信頼度推定に参考になる情報を提供する必要があるが、このようなテキストマイニングシステムは存在していなかった。
【００１４】
また、従来の方法では同じ表記の語は必ず同じカテゴリとして集計されるため、文脈によって意味の変化する語を正確に扱うことができなかった。
【００１５】
また、複数の文書集合や分析軸を単一の画面を切り替えて扱うため、エンドユーザの記憶に頼った分析方法となっていた。
【００１６】
また、２−Ｄマップ分析を行う際にそのカテゴリ要素数が多くなったときに、注目すべきカテゴリ要素を探す作業が困難であった。
【００１７】
また、ユーザが対話的に操作を行うことで分析を進めるため、分析対象や分析の種類が多いときに処理時間が多大にかかっていた。
【００１８】
また、大規模（数千〜数万のカテゴリ等）な概念辞書を利用する場合、一次元のリストを用いて概念項目の一覧や検索を行う方法では対処できなかった。
【００１９】
このように、従来のシステム等は数々の問題点を有しており、その結果、システムの利用者および管理者のいずれにとっても、利便性が悪く、また、利用効率が悪いものであった。
【００２０】
なお、これまで説明した従来の技術および発明が解決しようとする課題は、生物や医学や化学等の自然科学系の文献の文献情報データベース検索システムに限られず、全ての分野の文献情報を検索する全てのシステムにおいて、同様に考えることができる。
【００２１】
本発明は上記問題点に鑑みてなされたもので、テキストマイニングによる分析を高度化、効率化、自動化等することのできる、テキストマイニング処理装置、テキストマイニング処理方法、プログラム、および、記録媒体を提供することを目的としている。
【００２２】
【課題を解決するための手段】
このような目的を達成するため、請求項１に記載のテキストマイニング処理装置は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理装置であって、上記分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語のリストであって、当該用語毎に、用語の型、および／または、上記用語の格納先アドレスへのリンクボタンを対応付けた集計キーリスト情報とを出力装置に出力するように制御する原文表示制御手段を備えたことを特徴とする。
【００２３】
この装置によれば、分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語のリストであって、用語毎に、用語の型、および／または、用語の格納先アドレスへのリンクボタンを対応付けた集計キーリスト情報とを出力装置に出力するので、原文を集計のキーになった語のリストと共に表示することで、エンドユーザは一連の分析操作のうちどの操作でその文献を取得することになったかを容易に把握できるようになる。その結果、経験の少ないユーザでもノイズの原因となる操作を回避し精度の高い分析作業が可能になる。また、原文中に外部データベースへのリンクを張ることによって、エンドユーザは取得した文献が何を主題にしているのかを正確に知ることができる。この情報は検索ノイズを生み出す操作の学習に活用されることにより、分析作業の精度向上につながる。
【００２４】
また、請求項２に記載のテキストマイニング処理装置は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理装置であって、利用者が入力した検索語と、当該検索語に基づいて表記辞書情報を検索して抽出した対応する正規形とその表記辞書エントリに関する情報と、当該検索語に基づいてカテゴリ辞書情報を検索して抽出した対応するカテゴリとそのカテゴリ辞書エントリに関する情報とを出力装置に出力するように制御する辞書エントリ検索画面制御手段を備えたことを特徴とする。
【００２５】
この装置によれば、利用者が入力した検索語と、検索語に基づいて表記辞書情報を検索して抽出した対応する正規形とその表記辞書エントリに関する情報と、検索語に基づいてカテゴリ辞書情報を検索して抽出した対応するカテゴリとそのカテゴリ辞書エントリに関する情報とを出力装置に出力するように制御するので、特定の語についての表記辞書およびカテゴリ辞書の適用可能性を検索することによって、目的とするカテゴリに文献を分離するのにふさわしい語を選別することができるようになる。また、語検索を繰り返し行うことで、本来は分離したい多数のカテゴリ項目群に展開される可能性のある語が頻出する辞書ファイルを選別することができる。またそれらのカテゴリ群の精度を推測することができる。さらに、あるカテゴリを特徴付ける既知の用語がわかっている場合、その語に関する辞書エントリの存在を確認することによって、カテゴリの再現率を推測することができる。
【００２６】
また、請求項３に記載のテキストマイニング処理装置は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理装置であって、上記分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語に対して、表記辞書の検索結果、構文解析処理による品詞情報、カテゴリ辞書の検索結果のうち少なくとも一つを含むトレース結果情報とを出力装置に出力するように制御する辞書エントリ検索画面制御手段を備えたことを特徴とする。
【００２７】
この装置によれば、分析対象文書の原文情報と、原文情報に含まれ、かつ、集計の対象となる用語に対して、表記辞書の検索結果、構文解析処理による品詞情報、カテゴリ辞書の検索結果のうち少なくとも一つを含むトレース結果情報とを出力装置に出力するように制御するので、原文の各要素がどの辞書の項目で正規化されカテゴライズされたかを表示することで、さらに詳細にその文献を取得することになった理由を把握できるようになる。また、どのように連語をまとめたかや構文解析系が付与する正規形など、用語を見ただけでは判断できない適用情報も正確に把握することができる。さらに、これにより無関係と思われる原文を取得した場合にどの用語が原因でそうなったかを特定できるようになる。
【００２８】
また、請求項４に記載のテキストマイニング処理装置は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理装置であって、上記分析対象文書の原文情報に対する構文解析の結果に従って、当該原文情報に含まれる名詞と動詞のｎ個の順序付き組み合わせを一つのカテゴリとして上記分析対象文書に対するテキストマイニングの集計処理を行う構文構造分析手段を備えたことを特徴とする。
【００２９】
この装置によれば、分析対象文書の原文情報に対する構文解析の結果に従って、原文情報に含まれる名詞と動詞のｎ個の順序付き組み合わせを一つのカテゴリとして分析対象文書に対するテキストマイニングの集計処理を行うので、ｎ項関係のパターンを集計対象とすることで、用語の種類だけでは判別できなかった文献を切り分けることができ、さらに分析精度を向上させることができる。
【００３０】
また、請求項５に記載のテキストマイニング処理装置は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理装置であって、テキストマイニング用の一つの検索ウィンドウによる検索結果から、別の検索ウィンドウを用いてさらに検索条件を絞って検索を行う場合に、これらの関連する検索ウィンドウおよび検索結果表示ウィンドウをマルチウィンドウ化して表示し、いずれかのウィンドウの表示内容が変更されたときにはその変更内容が他のウィンドウにも自動的に反映されるように制御するマルチウィンドウ化手段を備えたことを特徴とする。
【００３１】
この装置によれば、別の検索ウィンドウを用いてさらに検索条件を絞って検索を行う場合に、これらの関連する検索ウィンドウおよび検索結果表示ウィンドウをマルチウィンドウ化して表示し、いずれかのウィンドウの表示内容が変更されたときにはその変更内容が他のウィンドウにも自動的に反映されるように制御するので、任意の作業状態を必要に応じて残すことによって、エンドユーザが分析のために記憶しておく情報の量が減る。これにより分析作業を効率化することができる。また、複数画面を備えた計算機端末の表示領域を有効に使うことができる。
【００３２】
また、請求項６に記載のテキストマイニング処理装置は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理装置であって、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をソートまたはクラスタリングして２−Ｄマップウィンドウを出力装置に出力する２−Ｄマップ表示画面制御手段を備えたことを特徴とする。
【００３３】
この装置によれば、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をソートして２−Ｄマップウィンドウを出力装置に出力するので、注目すべきカテゴリ項目がオリジナルのカテゴリ定義順に特定位置に固定されるような場合、オリジナル順でソートしておくことでそれらのカテゴリ項目の発見が容易になる。また、注目すべきカテゴリ項目が出現頻度の高いものである場合、高い頻度順でソートしておくことでそれらのカテゴリ項目の発見が容易になる。さらに、注目すべきカテゴリ項目が特定の名前で始まるものである場合、アルファベット順にソートしておくことでそれらのカテゴリ項目の発見が容易になる。
【００３４】
また、この装置によれば、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をクラスタリングして２−Ｄマップウィンドウを出力装置に出力するので、特徴的なパターンを共通して持つ項目群をクラスタとしてまとめることにより、カテゴリ項目の探索作業の付加が軽減し、分析作業を効率化することができる。
【００３５】
また、請求項７に記載のテキストマイニング処理装置は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理装置であって、テキストマイニング時の各操作に対する操作時刻、ユーザ識別子、操作名、操作引数、操作対象、操作結果、操作意図に関するユーザのコメントのうち少なくとも一つに関する操作履歴情報を収集する操作履歴収集手段を備えたことを特徴とする。
【００３６】
この装置によれば、テキストマイニング時の各操作に対する操作時刻、ユーザ識別子、操作名、操作引数、操作対象、操作結果、操作意図に関するユーザのコメントのうち少なくとも一つに関する操作履歴情報を収集するので、操作履歴に基づいて、表記辞書やカテゴリ辞書の登録内容をチェックすることができるようになる。また、後述する操作自動実行処理（バッチ処理）の指示（バッチスクリプト）を作成する際に雛型として用いることで、繁雑に行う分析処理を容易にバッチ処理化することができる。また、操作意図に関するユーザのコメントを格納することにより、作業履歴において対話操作が多数記録された場合でも、ユーザの作業意図等を手掛りにバッチ化する箇所を迅速に探すことができ、バッチスクリプト作成を効率化できる。また、ユーザが後でバッチ化したい箇所にコメントを入れることで、バッチスクリプト作成時にバッチの内容を検討する作業が軽減され、バッチスクリプトの作成を効率化できる。
【００３７】
また、請求項８に記載のテキストマイニング処理装置は、請求項７に記載のテキストマイニング処理装置において、上記操作履歴収集手段により収集された上記操作履歴情報に基づいて、バッチスクリプトを作成し、当該バッチスクリプトを実行する操作自動実行手段を備えたことを特徴とする。
【００３８】
この装置によれば、収集された操作履歴情報に基づいて、バッチスクリプトを作成し、バッチスクリプトを実行するので、一連の操作からなる分析をバッチ処理により繰り返し実行することで、エンドユーザをツール使用に拘束する時間を短縮することができる。また、一定期間毎に行う分析処理を自動的に行うことができるようになる。さらに、システム閑散期に負荷のかかる分析処理を実行することができるようになる。
【００３９】
また、請求項９に記載のテキストマイニング処理装置は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理装置であって、テキストマイニング処理に用いるカテゴリ辞書情報に登録された各カテゴリの集計結果を、木構造により階層化して出力装置に出力するカテゴリ階層化手段と、上記カテゴリ階層化手段により出力された、木構造により階層化されたカテゴリのうち少なくとも一部を選択するカテゴリ選択手段とを備えたことを特徴とする。
【００４０】
この装置によれば、テキストマイニング処理に用いるカテゴリ辞書情報に登録された各カテゴリの集計結果を、木構造により階層化して出力装置に出力するので、木構造で階層化し折畳みや展開操作を備えることによってユーザ対話インターフェースを介して画面上に一度に表示する概念項目数を抑制することができる。また、これにより、目的とする概念項目の探索が容易になる。
【００４１】
また、この装置によれば、出力された、木構造により階層化されたカテゴリのうち少なくとも一部を選択するので、対話的なテキストマイニング操作を行う際に、カテゴリを木構造により階層化して表示した画面から利用者が目的の部分カテゴリを選択することができるようになり、最終的な出力だけでなく操作の途中でも階層カテゴリを活用することができるようになる。また、これにより、カテゴリ部分を指定する必要のある対話的なテキストマイニング分析操作を、大規模なカテゴリ構造が対象である場合においても効率的に遂行できるようになる。
【００４２】
また、請求項１０に記載のテキストマイニング処理装置は、請求項９に記載のテキストマイニング処理装置において、上記カテゴリ階層化手段により木構造に階層化された各カテゴリの集計結果について中間ノードを概念項目として扱う場合に、当該中間ノードの子孫となる各リーフノード概念項目に対応する集計結果を当該中間ノードに対応する集計結果とするか、および／または、テキストマイニング処理に用いる表記辞書に当該中間ノードに対応する正規形または別表記形が定義されている場合に、当該正規形または当該別表記形を含む上記分析対象文書の集計結果を当該中間ノードに対応する集計結果とする、中間ノード集計手段を備えたことを特徴とする。
【００４３】
この装置によれば、木構造に階層化された各カテゴリの集計結果について中間ノードを概念項目として扱う場合に、中間ノードの子孫となる各リーフノード概念項目に対応する集計結果を中間ノードに対応する集計結果とするか（第１の集計方法）、および／または、テキストマイニング処理に用いる表記辞書に中間ノードに対応する正規形または別表記形が定義されている場合に、正規形または別表記形を含む分析対象文書の集計結果を中間ノードに対応する集計結果とする（第２の集計方法）。第１の集計方法を用いることにより、中間ノードに正規語が対応しない概念カテゴリ構造であっても処理することができる。また、大規模な概念カテゴリ構造を適切な粒度に分割するなど、自由度の高いカテゴリ構造の設計を可能にする。一方、第２の集計方法を用いることにより、中間ノードに対応する正規語が存在する概念カテゴリ構造である場合、精度よく文書数を集計することができる。また、既存データ構造を用いて作成した概念カテゴリ構造ではこのような場合が多く、それらが活用できる。これらの２つの方法を場合に応じて使い分けること、または、組み合わせて使うことにより、概念カテゴリ構造を作成するコストを下げることができ、大規模カテゴリ概念の利用が容易になる。
【００４４】
また、本発明はテキストマイニング処理方法に関するものであり、請求項１１に記載のテキストマイニング処理方法は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理方法であって、上記分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語のリストであって、当該用語毎に、用語の型、および／または、上記用語の格納先アドレスへのリンクボタンを対応付けた集計キーリスト情報とを出力方法に出力するように制御する原文表示制御ステップを含むことを特徴とする。
【００４５】
この方法によれば、分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語のリストであって、用語毎に、用語の型、および／または、用語の格納先アドレスへのリンクボタンを対応付けた集計キーリスト情報とを出力方法に出力するので、原文を集計のキーになった語のリストと共に表示することで、エンドユーザは一連の分析操作のうちどの操作でその文献を取得することになったかを容易に把握できるようになる。その結果、経験の少ないユーザでもノイズの原因となる操作を回避し精度の高い分析作業が可能になる。また、原文中に外部データベースへのリンクを張ることによって、エンドユーザは取得した文献が何を主題にしているのかを正確に知ることができる。この情報は検索ノイズを生み出す操作の学習に活用されることにより、分析作業の精度向上につながる。
【００４６】
また、請求項１２に記載のテキストマイニング処理方法は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理方法であって、利用者が入力した検索語と、当該検索語に基づいて表記辞書情報を検索して抽出した対応する正規形とその表記辞書エントリに関する情報と、当該検索語に基づいてカテゴリ辞書情報を検索して抽出した対応するカテゴリとそのカテゴリ辞書エントリに関する情報とを出力方法に出力するように制御する辞書エントリ検索画面制御ステップを含むことを特徴とする。
【００４７】
この方法によれば、利用者が入力した検索語と、検索語に基づいて表記辞書情報を検索して抽出した対応する正規形とその表記辞書エントリに関する情報と、検索語に基づいてカテゴリ辞書情報を検索して抽出した対応するカテゴリとそのカテゴリ辞書エントリに関する情報とを出力方法に出力するように制御するので、特定の語についての表記辞書およびカテゴリ辞書の適用可能性を検索することによって、目的とするカテゴリに文献を分離するのにふさわしい語を選別することができるようになる。また、語検索を繰り返し行うことで、本来は分離したい多数のカテゴリ項目群に展開される可能性のある語が頻出する辞書ファイルを選別することができる。またそれらのカテゴリ群の精度を推測することができる。さらに、あるカテゴリを特徴付ける既知の用語がわかっている場合、その語に関する辞書エントリの存在を確認することによって、カテゴリの再現率を推測することができる。
【００４８】
また、請求項１３に記載のテキストマイニング処理方法は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理方法であって、上記分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語に対して、表記辞書の検索結果、構文解析処理による品詞情報、カテゴリ辞書の検索結果のうち少なくとも一つを含むトレース結果情報とを出力方法に出力するように制御する辞書エントリ検索画面制御ステップを含むことを特徴とする。
【００４９】
この方法によれば、分析対象文書の原文情報と、原文情報に含まれ、かつ、集計の対象となる用語に対して、表記辞書の検索結果、構文解析処理による品詞情報、カテゴリ辞書の検索結果のうち少なくとも一つを含むトレース結果情報とを出力方法に出力するように制御するので、原文の各要素がどの辞書の項目で正規化されカテゴライズされたかを表示することで、さらに詳細にその文献を取得することになった理由を把握できるようになる。また、どのように連語をまとめたかや構文解析系が付与する正規形など、用語を見ただけでは判断できない適用情報も正確に把握することができる。さらに、これにより無関係と思われる原文を取得した場合にどの用語が原因でそうなったかを特定できるようになる。
【００５０】
また、請求項１４に記載のテキストマイニング処理方法は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理方法であって、上記分析対象文書の原文情報に対する構文解析の結果に従って、当該原文情報に含まれる名詞と動詞のｎ個の順序付き組み合わせを一つのカテゴリとして上記分析対象文書に対するテキストマイニングの集計処理を行う構文構造分析ステップを含むことを特徴とする。
【００５１】
この方法によれば、分析対象文書の原文情報に対する構文解析の結果に従って、原文情報に含まれる名詞と動詞のｎ個の順序付き組み合わせを一つのカテゴリとして分析対象文書に対するテキストマイニングの集計処理を行うので、ｎ項関係のパターンを集計対象とすることで、用語の種類だけでは判別できなかった文献を切り分けることができ、さらに分析精度を向上させることができる。
【００５２】
また、請求項１５に記載のテキストマイニング処理方法は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理方法であって、テキストマイニング用の一つの検索ウィンドウによる検索結果から、別の検索ウィンドウを用いてさらに検索条件を絞って検索を行う場合に、これらの関連する検索ウィンドウおよび検索結果表示ウィンドウをマルチウィンドウ化して表示し、いずれかのウィンドウの表示内容が変更されたときにはその変更内容が他のウィンドウにも自動的に反映されるように制御するマルチウィンドウ化ステップを含むことを特徴とする。
【００５３】
この方法によれば、別の検索ウィンドウを用いてさらに検索条件を絞って検索を行う場合に、これらの関連する検索ウィンドウおよび検索結果表示ウィンドウをマルチウィンドウ化して表示し、いずれかのウィンドウの表示内容が変更されたときにはその変更内容が他のウィンドウにも自動的に反映されるように制御するので、任意の作業状態を必要に応じて残すことによって、エンドユーザが分析のために記憶しておく情報の量が減る。これにより分析作業を効率化することができる。また、複数画面を含む計算機端末の表示領域を有効に使うことができる。
【００５４】
また、請求項１６に記載のテキストマイニング処理方法は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理方法であって、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をソートまたはクラスタリングして２−Ｄマップウィンドウを出力方法に出力する２−Ｄマップ表示画面制御ステップを含むことを特徴とする。
【００５５】
この方法によれば、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をソートして２−Ｄマップウィンドウを出力方法に出力するので、注目すべきカテゴリ項目がオリジナルのカテゴリ定義順に特定位置に固定されるような場合、オリジナル順でソートしておくことでそれらのカテゴリ項目の発見が容易になる。また、注目すべきカテゴリ項目が出現頻度の高いものである場合、高い頻度順でソートしておくことでそれらのカテゴリ項目の発見が容易になる。さらに、注目すべきカテゴリ項目が特定の名前で始まるものである場合、アルファベット順にソートしておくことでそれらのカテゴリ項目の発見が容易になる。
【００５６】
また、この方法によれば、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をクラスタリングして２−Ｄマップウィンドウを出力方法に出力するので、特徴的なパターンを共通して持つ項目群をクラスタとしてまとめることにより、カテゴリ項目の探索作業の付加が軽減し、分析作業を効率化することができる。
【００５７】
また、請求項１７に記載のテキストマイニング処理方法は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理方法であって、テキストマイニング時の各操作に対する操作時刻、ユーザ識別子、操作名、操作引数、操作対象、操作結果、操作意図に関するユーザのコメントのうち少なくとも一つに関する操作履歴情報を収集する操作履歴収集ステップを含むことを特徴とする。
【００５８】
この方法によれば、テキストマイニング時の各操作に対する操作時刻、ユーザ識別子、操作名、操作引数、操作対象、操作結果、操作意図に関するユーザのコメントのうち少なくとも一つに関する操作履歴情報を収集するので、操作履歴に基づいて、表記辞書やカテゴリ辞書の登録内容をチェックすることができるようになる。また、後述する操作自動実行処理（バッチ処理）の指示（バッチスクリプト）を作成する際に雛型として用いることで、繁雑に行う分析処理を容易にバッチ処理化することができる。また、操作意図に関するユーザのコメントを格納することにより、作業履歴において対話操作が多数記録された場合でも、ユーザの作業意図等を手掛りにバッチ化する箇所を迅速に探すことができ、バッチスクリプト作成を効率化できる。また、ユーザが後でバッチ化したい箇所にコメントを入れることで、バッチスクリプト作成時にバッチの内容を検討する作業が軽減され、バッチスクリプトの作成を効率化できる。
【００５９】
また、請求項１８に記載のテキストマイニング処理方法は、請求項１７に記載のテキストマイニング処理方法において、上記操作履歴収集ステップにより収集された上記操作履歴情報に基づいて、バッチスクリプトを作成し、当該バッチスクリプトを実行する操作自動実行ステップを含むことを特徴とする。
【００６０】
この方法によれば、収集された操作履歴情報に基づいて、バッチスクリプトを作成し、バッチスクリプトを実行するので、一連の操作からなる分析をバッチ処理により繰り返し実行することで、エンドユーザをツール使用に拘束する時間を短縮することができる。また、一定期間毎に行う分析処理を自動的に行うことができるようになる。さらに、システム閑散期に負荷のかかる分析処理を実行することができるようになる。
【００６１】
また、請求項１９に記載のテキストマイニング処理方法は、分析対象文書に登場する各用語の出現頻度を集計するテキストマイニング処理方法であって、テキストマイニング処理に用いるカテゴリ辞書情報に登録された各カテゴリの集計結果を、木構造により階層化して出力方法に出力するカテゴリ階層化ステップと、上記カテゴリ階層化ステップにより出力された、木構造により階層化されたカテゴリのうち少なくとも一部を選択するカテゴリ選択ステップとを含むことを特徴とする。
【００６２】
この方法によれば、テキストマイニング処理に用いるカテゴリ辞書情報に登録された各カテゴリの集計結果を、木構造により階層化して出力方法に出力するので、木構造で階層化し折畳みや展開操作を備えることによってユーザ対話インターフェースを介して画面上に一度に表示する概念項目数を抑制することができる。また、これにより、目的とする概念項目の探索が容易になる。
【００６３】
また、この方法によれば、出力された、木構造により階層化されたカテゴリのうち少なくとも一部を選択するので、対話的なテキストマイニング操作を行う際に、カテゴリを木構造により階層化して表示した画面から利用者が目的の部分カテゴリを選択することができるようになり、最終的な出力だけでなく操作の途中でも階層カテゴリを活用することができるようになる。また、これにより、カテゴリ部分を指定する必要のある対話的なテキストマイニング分析操作を、大規模なカテゴリ構造が対象である場合においても効率的に遂行できるようになる。
【００６４】
また、請求項２０に記載のテキストマイニング処理方法は、請求項１９に記載のテキストマイニング処理方法において、上記カテゴリ階層化ステップにより木構造に階層化された各カテゴリの集計結果について中間ノードを概念項目として扱う場合に、当該中間ノードの子孫となる各リーフノード概念項目に対応する集計結果を当該中間ノードに対応する集計結果とするか、および／または、テキストマイニング処理に用いる表記辞書に当該中間ノードに対応する正規形または別表記形が定義されている場合に、当該正規形または当該別表記形を含む上記分析対象文書の集計結果を当該中間ノードに対応する集計結果とする、中間ノード集計ステップを含むことを特徴とする。
【００６５】
この方法によれば、木構造に階層化された各カテゴリの集計結果について中間ノードを概念項目として扱う場合に、中間ノードの子孫となる各リーフノード概念項目に対応する集計結果を中間ノードに対応する集計結果とするか（第１の集計方法）、および／または、テキストマイニング処理に用いる表記辞書に中間ノードに対応する正規形または別表記形が定義されている場合に、正規形または別表記形を含む分析対象文書の集計結果を中間ノードに対応する集計結果とする（第２の集計方法）。第１の集計方法を用いることにより、中間ノードに正規語が対応しない概念カテゴリ構造であっても処理することができる。また、大規模な概念カテゴリ構造を適切な粒度に分割するなど、自由度の高いカテゴリ構造の設計を可能にする。一方、第２の集計方法を用いることにより、中間ノードに対応する正規語が存在する概念カテゴリ構造である場合、精度よく文書数を集計することができる。また、既存データ構造を用いて作成した概念カテゴリ構造ではこのような場合が多く、それらが活用できる。これらの２つの方法を場合に応じて使い分けること、または、組み合わせて使うことにより、概念カテゴリ構造を作成するコストを下げることができ、大規模カテゴリ概念の利用が容易になる。
【００６６】
また、本発明はプログラムに関するものであり、請求項２１に記載のプログラムは、分析対象文書に登場する各用語の出現頻度を集計するプログラムであって、上記分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語のリストであって、当該用語毎に、用語の型、および／または、上記用語の格納先アドレスへのリンクボタンを対応付けた集計キーリスト情報とを出力プログラムに出力するように制御する原文表示制御ステップを含むテキストマイニング処理方法をコンピュータに実行させることを特徴とする。
【００６７】
このプログラムによれば、分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語のリストであって、用語毎に、用語の型、および／または、用語の格納先アドレスへのリンクボタンを対応付けた集計キーリスト情報とを出力プログラムに出力するので、原文を集計のキーになった語のリストと共に表示することで、エンドユーザは一連の分析操作のうちどの操作でその文献を取得することになったかを容易に把握できるようになる。その結果、経験の少ないユーザでもノイズの原因となる操作を回避し精度の高い分析作業が可能になる。また、原文中に外部データベースへのリンクを張ることによって、エンドユーザは取得した文献が何を主題にしているのかを正確に知ることができる。この情報は検索ノイズを生み出す操作の学習に活用されることにより、分析作業の精度向上につながる。
【００６８】
また、請求項２２に記載のプログラムは、分析対象文書に登場する各用語の出現頻度を集計するプログラムであって、利用者が入力した検索語と、当該検索語に基づいて表記辞書情報を検索して抽出した対応する正規形とその表記辞書エントリに関する情報と、当該検索語に基づいてカテゴリ辞書情報を検索して抽出した対応するカテゴリとそのカテゴリ辞書エントリに関する情報とを出力プログラムに出力するように制御する辞書エントリ検索画面制御ステップを含むテキストマイニング処理方法をコンピュータに実行させることを特徴とする。
【００６９】
このプログラムによれば、利用者が入力した検索語と、検索語に基づいて表記辞書情報を検索して抽出した対応する正規形とその表記辞書エントリに関する情報と、検索語に基づいてカテゴリ辞書情報を検索して抽出した対応するカテゴリとそのカテゴリ辞書エントリに関する情報とを出力プログラムに出力するように制御するので、特定の語についての表記辞書およびカテゴリ辞書の適用可能性を検索することによって、目的とするカテゴリに文献を分離するのにふさわしい語を選別することができるようになる。また、語検索を繰り返し行うことで、本来は分離したい多数のカテゴリ項目群に展開される可能性のある語が頻出する辞書ファイルを選別することができる。またそれらのカテゴリ群の精度を推測することができる。さらに、あるカテゴリを特徴付ける既知の用語がわかっている場合、その語に関する辞書エントリの存在を確認することによって、カテゴリの再現率を推測することができる。
【００７０】
また、請求項２３に記載のプログラムは、分析対象文書に登場する各用語の出現頻度を集計するプログラムであって、上記分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語に対して、表記辞書の検索結果、構文解析処理による品詞情報、カテゴリ辞書の検索結果のうち少なくとも一つを含むトレース結果情報とを出力プログラムに出力するように制御する辞書エントリ検索画面制御ステップを含むテキストマイニング処理方法をコンピュータに実行させることを特徴とする。
【００７１】
このプログラムによれば、分析対象文書の原文情報と、原文情報に含まれ、かつ、集計の対象となる用語に対して、表記辞書の検索結果、構文解析処理による品詞情報、カテゴリ辞書の検索結果のうち少なくとも一つを含むトレース結果情報とを出力プログラムに出力するように制御するので、原文の各要素がどの辞書の項目で正規化されカテゴライズされたかを表示することで、さらに詳細にその文献を取得することになった理由を把握できるようになる。また、どのように連語をまとめたかや構文解析系が付与する正規形など、用語を見ただけでは判断できない適用情報も正確に把握することができる。さらに、これにより無関係と思われる原文を取得した場合にどの用語が原因でそうなったかを特定できるようになる。
【００７２】
また、請求項２４に記載のプログラムは、分析対象文書に登場する各用語の出現頻度を集計するプログラムであって、上記分析対象文書の原文情報に対する構文解析の結果に従って、当該原文情報に含まれる名詞と動詞のｎ個の順序付き組み合わせを一つのカテゴリとして上記分析対象文書に対するテキストマイニングの集計処理を行う構文構造分析ステップを含むテキストマイニング処理方法をコンピュータに実行させることを特徴とする。
【００７３】
このプログラムによれば、分析対象文書の原文情報に対する構文解析の結果に従って、原文情報に含まれる名詞と動詞のｎ個の順序付き組み合わせを一つのカテゴリとして分析対象文書に対するテキストマイニングの集計処理を行うので、ｎ項関係のパターンを集計対象とすることで、用語の種類だけでは判別できなかった文献を切り分けることができ、さらに分析精度を向上させることができる。
【００７４】
また、請求項２５に記載のプログラムは、分析対象文書に登場する各用語の出現頻度を集計するプログラムであって、テキストマイニング用の一つの検索ウィンドウによる検索結果から、別の検索ウィンドウを用いてさらに検索条件を絞って検索を行う場合に、これらの関連する検索ウィンドウおよび検索結果表示ウィンドウをマルチウィンドウ化して表示し、いずれかのウィンドウの表示内容が変更されたときにはその変更内容が他のウィンドウにも自動的に反映されるように制御するマルチウィンドウ化ステップを含むテキストマイニング処理方法をコンピュータに実行させることを特徴とする。
【００７５】
このプログラムによれば、別の検索ウィンドウを用いてさらに検索条件を絞って検索を行う場合に、これらの関連する検索ウィンドウおよび検索結果表示ウィンドウをマルチウィンドウ化して表示し、いずれかのウィンドウの表示内容が変更されたときにはその変更内容が他のウィンドウにも自動的に反映されるように制御するので、任意の作業状態を必要に応じて残すことによって、エンドユーザが分析のために記憶しておく情報の量が減る。これにより分析作業を効率化することができる。また、複数画面を含む計算機端末の表示領域を有効に使うことができる。
【００７６】
また、請求項２６に記載のプログラムは、分析対象文書に登場する各用語の出現頻度を集計するプログラムであって、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をソートまたはクラスタリングして２−Ｄマップウィンドウを出力プログラムに出力する２−Ｄマップ表示画面制御ステップを含むテキストマイニング処理方法をコンピュータに実行させることを特徴とする。
【００７７】
このプログラムによれば、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をソートして２−Ｄマップウィンドウを出力プログラムに出力するので、注目すべきカテゴリ項目がオリジナルのカテゴリ定義順に特定位置に固定されるような場合、オリジナル順でソートしておくことでそれらのカテゴリ項目の発見が容易になる。また、注目すべきカテゴリ項目が出現頻度の高いものである場合、高い頻度順でソートしておくことでそれらのカテゴリ項目の発見が容易になる。さらに、注目すべきカテゴリ項目が特定の名前で始まるものである場合、アルファベット順にソートしておくことでそれらのカテゴリ項目の発見が容易になる。
【００７８】
また、このプログラムによれば、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をクラスタリングして２−Ｄマップウィンドウを出力プログラムに出力するので、特徴的なパターンを共通して持つ項目群をクラスタとしてまとめることにより、カテゴリ項目の探索作業の付加が軽減し、分析作業を効率化することができる。
【００７９】
また、請求項２７に記載のプログラムは、分析対象文書に登場する各用語の出現頻度を集計するプログラムであって、テキストマイニング時の各操作に対する操作時刻、ユーザ識別子、操作名、操作引数、操作対象、操作結果、操作意図に関するユーザのコメントのうち少なくとも一つに関する操作履歴情報を収集する操作履歴収集ステップを含むテキストマイニング処理方法をコンピュータに実行させることを特徴とする。
【００８０】
このプログラムによれば、テキストマイニング時の各操作に対する操作時刻、ユーザ識別子、操作名、操作引数、操作対象、操作結果、操作意図に関するユーザのコメントのうち少なくとも一つに関する操作履歴情報を収集するので、操作履歴に基づいて、表記辞書やカテゴリ辞書の登録内容をチェックすることができるようになる。また、後述する操作自動実行処理（バッチ処理）の指示（バッチスクリプト）を作成する際に雛型として用いることで、繁雑に行う分析処理を容易にバッチ処理化することができる。また、操作意図に関するユーザのコメントを格納することにより、作業履歴において対話操作が多数記録された場合でも、ユーザの作業意図等を手掛りにバッチ化する箇所を迅速に探すことができ、バッチスクリプト作成を効率化できる。また、ユーザが後でバッチ化したい箇所にコメントを入れることで、バッチスクリプト作成時にバッチの内容を検討する作業が軽減され、バッチスクリプトの作成を効率化できる。
【００８１】
また、請求項２８に記載のプログラムは、請求項２７に記載のプログラムにおいて、上記操作履歴収集ステップにより収集された上記操作履歴情報に基づいて、バッチスクリプトを作成し、当該バッチスクリプトを実行する操作自動実行ステップを含むテキストマイニング処理方法をコンピュータに実行させることを特徴とする。
【００８２】
このプログラムによれば、収集された操作履歴情報に基づいて、バッチスクリプトを作成し、バッチスクリプトを実行するので、一連の操作からなる分析をバッチ処理により繰り返し実行することで、エンドユーザをツール使用に拘束する時間を短縮することができる。また、一定期間毎に行う分析処理を自動的に行うことができるようになる。さらに、システム閑散期に負荷のかかる分析処理を実行することができるようになる。
【００８３】
また、請求項２９に記載のプログラムは、分析対象文書に登場する各用語の出現頻度を集計するプログラムであって、テキストマイニング処理に用いるカテゴリ辞書情報に登録された各カテゴリの集計結果を、木構造により階層化して出力プログラムに出力するカテゴリ階層化ステップと、上記カテゴリ階層化ステップにより出力された、木構造により階層化されたカテゴリのうち少なくとも一部を選択するカテゴリ選択ステップとを含むテキストマイニング処理方法をコンピュータに実行させることを特徴とする。
【００８４】
このプログラムによれば、テキストマイニング処理に用いるカテゴリ辞書情報に登録された各カテゴリの集計結果を、木構造により階層化して出力プログラムに出力するので、木構造で階層化し折畳みや展開操作を備えることによってユーザ対話インターフェースを介して画面上に一度に表示する概念項目数を抑制することができる。また、これにより、目的とする概念項目の探索が容易になる。
【００８５】
また、このプログラムによれば、出力された、木構造により階層化されたカテゴリのうち少なくとも一部を選択するので、対話的なテキストマイニング操作を行う際に、カテゴリを木構造により階層化して表示した画面から利用者が目的の部分カテゴリを選択することができるようになり、最終的な出力だけでなく操作の途中でも階層カテゴリを活用することができるようになる。また、これにより、カテゴリ部分を指定する必要のある対話的なテキストマイニング分析操作を、大規模なカテゴリ構造が対象である場合においても効率的に遂行できるようになる。
【００８６】
また、請求項３０に記載のプログラムは、請求項２９に記載のプログラムにおいて、上記カテゴリ階層化ステップにより木構造に階層化された各カテゴリの集計結果について中間ノードを概念項目として扱う場合に、当該中間ノードの子孫となる各リーフノード概念項目に対応する集計結果を当該中間ノードに対応する集計結果とするか、および／または、テキストマイニング処理に用いる表記辞書に当該中間ノードに対応する正規形または別表記形が定義されている場合に、当該正規形または当該別表記形を含む上記分析対象文書の集計結果を当該中間ノードに対応する集計結果とする、中間ノード集計ステップを含むテキストマイニング処理方法をコンピュータに実行させることを特徴とする。
【００８７】
このプログラムによれば、木構造に階層化された各カテゴリの集計結果について中間ノードを概念項目として扱う場合に、中間ノードの子孫となる各リーフノード概念項目に対応する集計結果を中間ノードに対応する集計結果とするか（第１の集計方法）、および／または、テキストマイニング処理に用いる表記辞書に中間ノードに対応する正規形または別表記形が定義されている場合に、正規形または別表記形を含む分析対象文書の集計結果を中間ノードに対応する集計結果とする（第２の集計方法）。第１の集計方法を用いることにより、中間ノードに正規語が対応しない概念カテゴリ構造であっても処理することができる。また、大規模な概念カテゴリ構造を適切な粒度に分割するなど、自由度の高いカテゴリ構造の設計を可能にする。一方、第２の集計方法を用いることにより、中間ノードに対応する正規語が存在する概念カテゴリ構造である場合、精度よく文書数を集計することができる。また、既存データ構造を用いて作成した概念カテゴリ構造ではこのような場合が多く、それらが活用できる。これらの２つのプログラムを場合に応じて使い分けること、または、組み合わせて使うことにより、概念カテゴリ構造を作成するコストを下げることができ、大規模カテゴリ概念の利用が容易になる。
【００８８】
また、本発明は記録媒体に関するものであり、請求項３１に記載の記録媒体は、上記請求項２１から３０のいずれか一つに記載されたプログラムを記録したことを特徴とする。
【００８９】
この記録媒体によれば、当該記録媒体に記録されたプログラムをコンピュータに読み取らせて実行することによって、請求項２１から３０のいずれか一つに記載されたプログラムをコンピュータを利用して実現することができ、これら各方法と同様の効果を得ることができる。
【００９０】
【発明の実施の形態】
以下に、本発明にかかるテキストマイニング処理装置、テキストマイニング処理方法、プログラム、および、記録媒体の実施の形態を図面に基づいて詳細に説明する。なお、この実施の形態によりこの発明が限定されるものではない。
特に、以下の実施の形態においては、本発明を、生物や医学や化学等の自然科学系の文献の文献情報データベース検索システムに適用した例について説明するが、この場合に限られず、全ての分野の文献情報を検索する全てのシステムにおいて、同様に適用することができる。
【００９１】
［本発明の概要］
以下、本発明の概要について説明し、その後、本発明の構成および処理等について詳細に説明する。
本発明は、概略的に、以下の基本的特徴を有する。すなわち、本発明は、図１に示したテキストマイニング処理において、集計結果分析の高精度化・効率化・自動化を図る。すなわち、本発明は、テキストマイニング処理において、分析手順を評価する手法（原文表示、辞書エントリ検索、トレース結果表示など）や、構文構造を利用した分析手法を提供することにより、テキストマイニング分析を高精度化する。また、本発明は、集計結果の効率化手法（分析画面の一覧手法（マルチウィンドウ化など）や、図２に示した２−Ｄマップのカテゴリ項目のソート・クラスタリングなど）を提供する。また、本発明は、分析自動化手法（操作履歴収集、操作自動実行など）や大規模概念管理手法（木構造階層化、中間ノード集計など）を提供する。これらの各手法などについては後述する。
【００９２】
［システム構成］
まず、本システムの構成について説明する。図３は、本発明が適用される本システムの構成の一例を示すブロック図であり、該構成のうち本発明に関係する部分のみを概念的に示している。本システムは、概略的に、テキストマイニング処理装置１００と、文献情報や配列情報等に関する外部データベースや各種検索処理等の外部プログラム等を提供する外部システム２００とを、ネットワーク３００を介して通信可能に接続して構成されている。
【００９３】
図３においてネットワーク３００は、テキストマイニング処理装置１００と外部システム２００とを相互に接続する機能を有し、例えば、インターネット等である。
【００９４】
図３において外部システム２００は、ネットワーク３００を介して、テキストマイニング処理装置１００と相互に接続され、利用者に対して文献情報や配列情報等に関する外部データベースや各種検索処理等の外部プログラム等を実行するウェブサイトを提供する機能を有する。
【００９５】
ここで、外部システム２００は、ＷＥＢサーバやＡＳＰサーバ等として構成してもよく、そのハードウェア構成は、一般に市販されるワークステーション、パーソナルコンピュータ等の情報処理装置およびその付属装置により構成してもよい。また、外部システム２００の各機能は、外部システム２００のハードウェア構成中のＣＰＵ、ディスク装置、メモリ装置、入力装置、出力装置、通信制御装置等およびそれらを制御するプログラム等により実現される。
【００９６】
図３においてテキストマイニング処理装置１００は、概略的に、テキストマイニング処理装置１００の全体を統括的に制御するＣＰＵ等の制御部１０２、通信回線等に接続されるルータ等の通信装置（図示せず）に接続される通信制御インターフェース部１０４、入力装置１１２や出力装置１１４に接続される入出力制御インターフェース部１０８、および、各種のデータベースやテーブルなどを格納する記憶部１０６を備えて構成されており、これら各部は任意の通信路を介して通信可能に接続されている。さらに、このテキストマイニング処理装置１００は、ルータ等の通信装置および専用線等の有線または無線の通信回線を介して、ネットワーク３００に通信可能に接続されている。
【００９７】
記憶部１０６に格納される各種のデータベースやテーブル（表記辞書情報ファイル１０６ａ〜バッチスクリプトファイル１０６ｆ）は、固定ディスク装置等のストレージ手段であり、各種処理に用いる各種のプログラムやテーブルやファイルやデータベースやウェブページ用ファイル等を格納する。
【００９８】
これら記憶部１０６の各構成要素のうち、表記辞書情報ファイル１０６ａは、各用語の正規形と別表記形との対応関係を定義する表記辞書情報を格納した表記辞書情報格納手段である。図１７は、表記辞書情報ファイル１０６ａに格納される表記辞書情報の一例を示す図である。この表記辞書情報ファイル１０６ａに格納される表記辞書情報は、図１７に示すように、正規形と別表記形との対応関係を定義している。
【００９９】
また、カテゴリ辞書情報ファイル１０６ｂは、正規形の所属するカテゴリを定義するカテゴリ辞書情報を格納するカテゴリ辞書情報格納手段である。図１８は、カテゴリ辞書情報ファイル１０６ｂに格納されるカテゴリ辞書情報の一例を示す図である。このカテゴリ辞書情報ファイル１０６ｂに格納されるカテゴリ辞書情報は、図１８に示すように、カテゴリと正規形との対応関係、および、カテゴリ構造（図１８ではカテゴリ構造の概念を示しており、実際のファイルにはノード（カテゴリ）毎の親ノードと子ノードの情報等を定義している。）を定義している。
【０１００】
また、分析対象文書ファイル１０６ｃは、解析対象の文書情報の原文情報や、その原文情報に設定されたリンク先のＵＲＬ等のアドレス情報等を格納する文書情報格納手段である。ここでアドレス情報は、原文中の一部分が外部データベースの識別子と解釈できる部分があれば、その外部データベースのハイパーリンク（ＷＷＷリンク）情報等を格納してもよい。
【０１０１】
また、操作履歴情報ファイル１０６ｄは、テキストマイニング時の各操作に対する操作時刻、ユーザ識別子、操作名、操作引数、操作対象、操作結果、操作意図に関するユーザのコメント等に関する操作履歴情報を格納する操作履歴情報格納手段である。
【０１０２】
また、処理結果ファイル１０６ｅは、制御部による各処理の処理結果や中間結果などのワークファイル等を格納する処理結果格納手段である。
【０１０３】
また、バッチスクリプトファイル１０６ｆは、バッチスクリプトに関する情報等を格納するバッチスクリプト情報格納手段である。
【０１０４】
また、図３において、通信制御インターフェース部１０４は、テキストマイニング処理装置１００とネットワーク３００（またはルータ等の通信装置）との間における通信制御を行う。すなわち、通信制御インターフェース部１０４は、他の端末と通信回線を介してデータを通信する機能を有する。
【０１０５】
また、図３において、入出力制御インターフェース部１０８は、入力装置１１２や出力装置１１４の制御を行う。ここで、出力装置１１４としては、モニタ（家庭用テレビを含む）の他、スピーカを用いることができる（なお、以下においては出力装置１１４をモニタとして記載する場合がある）。また、入力装置１１２としては、キーボード、マウス、および、マイク等を用いることができる。また、モニタも、マウスと協働してポインティングデバイス機能を実現する。
【０１０６】
また、図３において、制御部１０２は、ＯＳ（Ｏｐｅｒａｔｉｎｇ　Ｓｙｓｔｅｍ）等の制御プログラム、各種の処理手順等を規定したプログラム、および所要データを格納するための内部メモリを有し、これらのプログラム等により、種々の処理を実行するための情報処理を行う。制御部１０２は、機能概念的に、分析手順評価部１０２ａ、構文構造分析部１０２ｂ、マルチウィンドウ化部１０２ｃ、２−Ｄマップ表示画面制御部１０２ｄ、操作履歴収集部１０２ｅ、操作自動実行部１０２ｆ、カテゴリ階層化部１０２ｇ、中間ノード集計部１０２ｈ、および、テキストマイニング部１０２ｐを備えて構成されている。
【０１０７】
このうち、分析手順評価部１０２ａは、テキストマイニング部１０２ｐによるテキストマイニング処理の分析手順を評価する分析手順評価手段である。図４に示すように、分析手順評価部１０２ａは、原文表示画面制御部１０２ｉ、辞書エントリ検索画面制御部１０２ｊ、および、トレース結果表示画面制御部１０２ｋを備えて構成される。ここで、原文表示画面制御部１０２ｉは、分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語のリストであって、当該用語毎に、用語の型、および／または、用語の格納先アドレスへのリンクボタンを対応付けた集計キーリスト情報とを出力装置に出力するように制御する原文表示制御手段である。また、辞書エントリ検索画面制御部１０２ｊは、利用者が入力した検索語と、当該検索語に基づいて表記辞書情報を検索して抽出した対応する正規形とその表記辞書エントリに関する情報と、当該検索語に基づいてカテゴリ辞書情報を検索して抽出した対応するカテゴリとそのカテゴリ辞書エントリに関する情報とを出力装置に出力するように制御する辞書エントリ検索画面制御手段である。また、トレース結果表示画面制御部１０２ｋは、分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語に対して、表記辞書の検索結果、構文解析処理による品詞情報、カテゴリ辞書の検索結果のうち少なくとも一つを含むトレース結果情報とを出力装置に出力するように制御する辞書エントリ検索画面制御手段である。
【０１０８】
また、構文構造分析部１０２ｂは、分析対象文書の原文情報に対する構文解析の結果に従って、当該原文情報に含まれる名詞と動詞のｎ個の順序付き組み合わせを一つのカテゴリとして分析対象文書に対するテキストマイニングの集計処理を行う構文構造分析手段である。
【０１０９】
また、マルチウィンドウ化部１０２ｃは、テキストマイニング用の一つの検索ウィンドウによる検索結果から、別の検索ウィンドウを用いてさらに検索条件を絞って検索を行う場合に、これらの関連する検索ウィンドウおよび検索結果表示ウィンドウをマルチウィンドウ化して表示し、いずれかのウィンドウの表示内容が変更されたときにはその変更内容が他のウィンドウにも自動的に反映されるように制御するマルチウィンドウ化手段である。
【０１１０】
また、２−Ｄマップ表示画面制御部１０２ｄは、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をソートまたはクラスタリングして２−Ｄマップウィンドウを出力装置に出力する２−Ｄマップ表示画面制御手段である。ここで、２−Ｄマップ表示画面制御部１０２ｄは、図５に示すように、項目ソート部１０２ｍ、および、項目クラスタリング部１０２ｎを備えて構成される。項目ソート部１０２ｍは、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をソートして２−Ｄマップウィンドウを出力装置に出力する項目ソート手段である。また、項目クラスタリング部１０２ｎは、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をクラスタリングして２−Ｄマップウィンドウを出力装置に出力する項目クラスタリング手段である。
【０１１１】
また、操作履歴収集部１０２ｅは、テキストマイニング時の各操作に対する操作時刻、ユーザ識別子、操作名、操作引数、操作対象、操作結果、操作意図に関するユーザのコメントのうち少なくとも一つに関する操作履歴情報を収集する操作履歴収集手段である。
【０１１２】
また、操作自動実行部１０２ｆは、操作履歴収集手段により収集された操作履歴情報に基づいて、バッチスクリプトを作成し、当該バッチスクリプトを実行する操作自動実行手段である。
【０１１３】
また、カテゴリ階層化部１０２ｇは、テキストマイニング処理に用いるカテゴリ辞書情報に登録された各カテゴリの集計結果を、木構造により階層化して出力装置に出力するカテゴリ階層化手段である。
【０１１４】
また、中間ノード集計部１０２ｈは、カテゴリ階層化手段により木構造に階層化された各カテゴリの集計結果について中間ノードを概念項目として扱う場合に、当該中間ノードの子孫となる各リーフノード概念項目に対応する集計結果を当該中間ノードに対応する集計結果とするか、および／または、テキストマイニング処理に用いる表記辞書に当該中間ノードに対応する正規形または別表記形が定義されている場合に、当該正規形または当該別表記形を含む上記分析対象文書の集計結果を当該中間ノードに対応する集計結果とする、中間ノード集計手段である。
【０１１５】
また、テキストマイニング部１０２ｐは、例えば上述した図１に示すテキストマイニング処理により情報抽出結果に対して統計・分析処理を実行するテキストマイニング手段である。
なお、これら各部によって行なわれる処理の詳細については、後述する。
【０１１６】
［システムの処理］
次に、このように構成された本実施の形態における本システムの処理の一例について、以下に図３〜図１６を参照して詳細に説明する。
【０１１７】
［原文表示画面制御処理］
次に、原文表示画面制御処理の詳細について図６等を参照して説明する。
テキストマイニング処理装置１００は、原文表示画面制御部１０２ｉの処理により、分析対象文書ファイル１０６ｃに格納された原文情報を、集計処理対象の用語（キー）のリストと共に出力装置１１４に表示する。例えば、カテゴリ辞書情報ファイル１０６ｂに登録されたカテゴリに対応する正規形、または、表記辞書情報ファイル１０６ａに登録されたその正規形に対応する別表記形の出現頻度を集計処理を行う場合には、当該正規形と当該別表記形が処理対象の用語（キー）となる。また、原文表示画面制御部１０２ｉは、原文中のキーが外部データベースの識別子と解釈できる部分があれば、そこにハイパーリンク（ＷＷＷリンク）を張り、一緒に表示する。
【０１１８】
図６は、出力装置１１４に表示される原文表示画面の一例を示す図である。図６に示すように文献毎にひとつの原文表示画面用のウィンドウを用意する。各ウィンドウは、原文情報表示領域ＭＡ−１と集計キーリスト情報表示領域ＭＡ−２などから構成される。ここで、集計キーリスト情報表示領域ＭＡ−２は、用語の型（品詞や用語の属する専門分野等）の表示領域ＭＡ−３と、原文中に現れる用語の表示領域ＭＡ−４と、（必要に応じて）外部データベースへのハイパーリンクボタンＭＡ−５とから成る。集計キーリストの項目は、テキストマイニングシステムが対象テキストをあらかじめテキストマイニング処理（前処理）するときの中間生成物より取得してもよい。
これにて、原文表示画面制御処理が終了する。
【０１１９】
［辞書エントリ検索画面制御処理］
次に、辞書エントリ検索画面制御処理の詳細について図７等を参照して説明する。テキストマイニング処理装置１００は、辞書エントリ検索画面制御部１０２ｊの処理により、利用者が指定した任意の単語や連語を入力とし、以下の手法により、表記辞書情報ファイル１０６ａに格納された表記辞書やカテゴリ辞書情報ファイル１０６ｂに格納されたカテゴリ辞書を検索し、ヒットする辞書エントリを抽出して出力装置１１４に出力する。
【０１２０】
まず、辞書エントリ検索画面制御部１０２ｊは、入力された検索語で表記辞書を検索し、ヒットした正規形の集合を得る。次に入力された検索語と正規形集合の各要素とを用いてカテゴリ辞書を検索し、ヒットしたカテゴリの集合を得る。
【０１２１】
そして、検索結果として、入力した語、その正規形、それが属するカテゴリ、変換に利用した辞書エントリの含まれるファイル／データベース名、その辞書エントリのファイル／データベース内の識別子／位置を出力装置１１４に表示する。
【０１２２】
図７は、出力装置１１４に表示される辞書エントリ検索画面の一例を示す図である。図７に示すように、辞書エントリ検索画面は、検索語入力欄ＭＢ−１、検索指示ボタンＭＢ−２、結果表示領域ＭＢ−３から成る。
【０１２３】
利用者は、検索語入力欄ＭＢ−１に所望の単語や連語を入力した後、検索指示ボタンＭＢ−２をマウスなどの入力装置１１２により選択すると、テキストマイニング処理装置１００は、辞書エントリ検索画面制御部１０２ｊの処理により、結果表示領域ＭＢ−３に検索処理結果を表示する。この例では入力語を表記辞書で検索した結果、正規形ｔ１，ｔ２，ｔ３，．．．　（ｔ２以降は図７では省略している。）がヒットしたことを表している。また、入力語をカテゴリ辞書で検索した結果、カテゴリｃ１，ｃ２，ｃ３，．．．がヒットしたことを示している。また、カテゴリｃ１は、Ｄ２というファイル名のカテゴリ辞書のｅ２という識別子の辞書項目によって入力語にヒットしたことを示している。ここで、カテゴリｃ２，ｃ３についても同様である。また、カテゴリｃ１とｃ３は同一辞書（Ｄ２）に属する辞書項目である。また、正規形ｔ１はＤ１という名前の表記辞書のｅ１という識別子の辞書項目によって入力語にヒットしたことを示している。そして、正規形ｔ１はカテゴリｃ１，ｃ４，ｃ５にそれぞれヒットしたことを示している。以上の結果、入力語を含む文献は少なくともカテゴリｃ１，ｃ２，ｃ３，ｃ４，ｃ５に属することがわかる。
これにて、辞書エントリ検索画面制御処理が終了する。
【０１２４】
［トレース結果表示画面制御処理］
次に、トレース結果表示画面制御処理の詳細について図８等を参照して説明する。テキストマイニング処理装置１００は、トレース結果表示画面制御部１０２ｋの処理により、表記辞書情報ファイル１０６ａに格納された原文情報や利用者が任意に指定した自然言語の英文等の原文情報を入力とし、以下に示す手法により、その原文情報にテキストマイニングの一連の前処理をトレース適用し、入力した原文情報中の各要素がどのようにテキストマイニングシステムに認識されるかを明確化するトレース情報を表示する。
【０１２５】
まず、トレース結果表示画面制御部１０２ｋは、入力された原文情報に表記辞書情報ファイル１０６ａに格納された表記辞書をあてて、連語をまとめて要素構造とする。そして、トレース結果表示画面制御部１０２ｋは、上記結果に専門用語（術語）判別ルールを適用し、連語をまとめ要素構造とする。そして、トレース結果表示画面制御部１０２ｋは、上記結果に既知の構文解析処理系を適用し、要素構造に品詞情報を付与する。そして、トレース結果表示画面制御部１０２ｋは、上記結果にカテゴリ辞書をあてて要素構造にカテゴリ情報を付与する。
【０１２６】
そして、トレース結果表示画面制御部１０２ｋは、トレース結果情報として、各処理の入出力項目を表示する。ここで、トレース結果表示画面制御部１０２ｋは、表記辞書およびカテゴリ辞書について利用した辞書エントリの含まれるファイル／データベース名、その辞書エントリのファイル／データベース内の識別子／位置などのトレース情報を表示してもよい。
【０１２７】
図８は、出力装置１１４に表示されるトレース結果表示画面の一例を示す図である。
トレース結果表示画面は、図８に示すように原文入力欄ＭＣ−１とトレース結果表示領域ＭＣ−２から成る。原文入力欄ＭＣ−１は直接処理対象原文をタイプしても良いし、文献識別子入力欄ＭＣ−３に文献識別子を入力し原文取得ボタンＭＣ−４を押して分析対象文書ファイル１０６ｃから処理対象の原文情報を取得しても良い。そして、利用者がトレースボタンＭＣ−５を選択すると、トレース結果表示領域ＭＣ−２にトレース結果情報が表示される。
【０１２８】
トレース結果表示領域は原文の各要素構造（語）毎に次の情報が繰り返し表示される。図８の例では語１の表記は正規形ｔ１、正規形ｔ２、．．．に正規化されたことを意味する。正規形ｔ１への正規化と品詞Ｎの付与には表記辞書Ｄ１の項目ｅ１が適用されたことを示す。正規形ｔ２には専門用語判別ルールＦが適用されたことを示す。語１に関しては構文解析では品詞付与は行われなかったことを示す。正規形ｔ１はカテゴリｃ１，ｃ４，　．．．に属することを示す。カテゴリｃ１への所属にはカテゴリ辞書Ｄ２の項目ｅ５が適用されたことを示す。カテゴリｃ４への所属にはカテゴリ辞書Ｄ４の項目ｅ６が適用されたことを示す。
これにて、トレース結果表示画面制御処理が終了する。
【０１２９】
［構文構造分析処理］
次に、構文構造分析処理の詳細について図９等を参照して説明する。テキストマイニング処理装置１００は、構文構造分析部１０２ｂの処理により、分析対象文書ファイル１０６ｃに格納された分析対象文書の原文情報に対する構文解析の結果に従って、当該原文情報に含まれる名詞と動詞のｎ個の順序付き組み合わせを一つのカテゴリとして分析対象文書に対するテキストマイニングの集計処理を行う。すなわち、構文構造分析部１０２ｂは、テキストマイニング部１０２ｐが実行した構文解析の結果、一文中に現われる名詞と動詞のｎ個の任意の順序付き組み合わせについて、それぞれをカテゴリとして分析対象文書に対するテキストマイニングの集計処理を行い、２−Ｄマップなどの分析に用いる。
【０１３０】
上記順序付き組み合わせパターンのうち任意の複数パターンを同一カテゴリに属すると見なし、集計・分析を行う。同一カテゴリの見なし方は次の２つの手法のいずれかまたはその組み合わせにより行う。第１は、部分的に順序を考慮しない組み合わせパターンを同一カテゴリとみなす手法である。第２は、任意の組み合わせパターン同士の違いが、同じカテゴリに属する語の違いのみの場合に、それらのパターンを同一カテゴリとみなす手法である。
【０１３１】
図９は、本発明の構文構造分析処理の一例を示す概念図である。図９に示すように、特定の語が特定の順番で現れる文を含む文献を同一カテゴリに属するとしてテキストマイニング分析を行う。図９の例の場合、名詞ｎ１が先頭にあり、動詞ｖ１とカテゴリｃ１に属する名詞が任意の順番で出現するパターンの文を持つ文献を同一カテゴリに集計する。図９に示すパターン表記においては、「＊」は任意の語要素が入ってもよいことを示し、「（表記１｜表記２）」は、表記１または表記２のいずれでもよいことを示し、表記の並びはその順番であることを示す。
これにて、構文構造分析処理が終了する。
【０１３２】
［マルチウィンドウ化処理］
次に、マルチウィンドウ化処理の詳細について図１０等を参照して説明する。テキストマイニング処理装置１００は、マルチウィンドウ化部１０２ｃの処理により、テキストマイニング用の一つの検索ウィンドウによる検索結果から、別の検索ウィンドウを用いてさらに検索条件を絞って検索を行う場合に、これらの関連する検索ウィンドウおよび検索結果表示ウィンドウをマルチウィンドウ化して表示し、いずれかのウィンドウの表示内容が変更されたときにはその変更内容が他のウィンドウにも自動的に反映されるように制御する。すなわち、マルチウィンドウ化部１０２ｃは、テキストマイニング部１０２ｐ等により出力装置１１４に出力される検索ウィンドウ、頻度グラフウィンドウ、２−Ｄマップウィンドウ、時系列グラフウィンドウなどをそれぞれ独立のウィンドウとし、それぞれ複数の実体を持ちつつ、それらの情報に連携が取れるようにする。
【０１３３】
図１０は、出力装置１１４に表示されるマルチウィンドウ化された表示画面の一例を示す図である。図１０は、３つの検索ウィンドウ（ｗ１、ｗ２およびｗ４）と２つの２−Ｄマップウィンドウ（ｗ３およびｗ５）を全て同時に表示する例を示している。検索ウィンドウ（ｗ１）は母集団とする文献集合を保持する。検索ウィンドウ（ｗ２）は、検索ウィンドウ（ｗ１）の集合をさらにキーワードｋｗ１で絞り込んだ文献集合を保持する。２−Ｄマップウィンドウ（ｗ３）は、検索ウィンドウ（ｗ２）の文献集合を対象にした２−Ｄマップ分析結果を表示する。検索ウィンドウ（ｗ４）は、検索ウィンドウ（ｗ１）の集合をさらにキーワードｋｗ２で絞り込んだ文献集合を保持する。２−Ｄマップウィンドウ（ｗ５）は、検索ウィンドウ（ｗ４）の文献集合を対象にした２−Ｄマップ分析結果を表示する。
これにて、マルチウィンドウ化処理が終了する。
【０１３４】
［２−Ｄマップ表示画面制御処理］
次に、２−Ｄマップ表示画面制御処理の詳細について図１１および図１２等を参照して説明する。
【０１３５】
テキストマイニング処理装置１００は、２−Ｄマップ表示画面制御部１０２ｄの処理により、テキストマイニング部１０２ｐによるテキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をソートまたはクラスタリングして２−Ｄマップウィンドウを出力装置１１４に出力する。
【０１３６】
たとえば、２−Ｄマップ表示画面制御部１０２ｄは、項目ソート部１０２ｍの処理により、２−Ｄマップの列または行に対応する各カテゴリ項目をオリジナルモード、頻度順モード、または、アルファベット順モード等でソートし表示する。ここで、オリジナルモードの場合は、カテゴリ項目がカテゴリ辞書情報ファイル１０６ｂに格納されたカテゴリ辞書等に定義（格納）されている順番に並び換える。また、頻度順モードの場合は、そのカテゴリ項目に属する列または行の頻度値の総和をそのカテゴリ項目の頻度値とし、頻度値が大きいまたは小さい順にカテゴリ項目を並べ換える。また、アルファベット順モードの場合はカテゴリ項目の名前文字列が辞書順に並ぶように並べ換える。
【０１３７】
ここで、図１１は、出力装置１１４に表示される２−Ｄマップ表示画面の制御（ソート処理）の一例を示す図である。図１１に示すように、２−Ｄマップウィンドウ（ｗ１）は、縦・横共に項目名のアルファベット順にソートした状態を表す。また、２−Ｄマップウィンドウ（ｗ２）は、縦方向は項目名のアルファベット順に、横方向は頻度順にソートした状態を表す。２−Ｄマップウィンドウ（ｗ２）では、横方向の２−Ｄマップ項目ａ，ｂ，ｃ，ｆの列の頻度の合計値がそれぞれ１４，１８，８，１５であるため、値が最大の項目となるｂを最も左に、最小となるｃを最も右になるようにソートされている。また、２−Ｄマップウィンドウ（ｗ３）は縦方向は頻度順に、横方向は項目名のアルファベット順にソートした状態を表す。ここで、２−Ｄマップウィンドウ（ｗ３）では、縦方向の２−Ｄマップ項目ｊ，ｋ，ｐの行の頻度の合計値がそれぞれ２０，１９，１６であるため、ソートしても入れ換えは起きていない。また、２−Ｄマップウィンドウ（ｗ４）は縦・横共に頻度順にソートした状態を表す。
【０１３８】
また、２−Ｄマップ表示画面制御部１０２ｄは、項目クラスタリング部１０２ｎの制御により、２−Ｄマップの列または行のカテゴリ項目を、相手側の軸の項目を要素とするベクトルで特徴付けてクラスタリングする。ここで、項目クラスタリング部１０２ｎは、カテゴリ項目同士の類似度をベクトルの内積等で定義してもよい。また、項目クラスタリング部１０２ｎは、クラスタリングアルゴリズムは既存手法のものを用い、行と列のカテゴリ項目をそれぞれ階層化して表示してもよい。
【０１３９】
また、項目クラスタリング部１０２ｎは、カテゴリ項目はその階層に収まるように並べ変える。ここで、項目クラスタリング部１０２ｎは、次のいずれかの方法またはそれらを組み合わせた方法で並べ換えてもよい。第１の方法は、注目するカテゴリ項目を複数指定し、多数の指定カテゴリ項目が含まれるクラスタが原点（左上）に集まり、指定カテゴリが可能な範囲で原点寄りになるよう、クラスタとカテゴリ要素を並べ換える方法である。第２の方法は、カテゴリ要素を多く含むクラスタが原点（左上）寄りになるよう、クラスタを並べ換える方法である。
【０１４０】
図１２は、出力装置１１４に表示される２−Ｄマップ表示画面の制御（クラスタリング）の一例を示す図である。図１２の例では、列と行それぞれについてクラスタリングした２−Ｄマップを示す。本図では、行については、カテゴリ項目ａａ，ａｂ，ａｃがクラスタｃ１に、ａｄ，ａｅ，ａｆ，ａｇがｃ２に、ａｈ，ａｉがｃ３に、ａｊがｃ４に、ａｋ，ａｌ，ａｍがｃ５に、ａｓがｃ７にそれぞれまとめられることを表している。さらに、本図では、クラスタｃ１，ｃ２はクラスタｃ８に、ｃ３，ｃ４はｃ９に、ｃ５，ｃ６，ｃ７はｃ１０に、ｃ８，ｃ９はｃ１１に、ｃ１０はｃ１２にそれぞれまとめられることを表している。一方、列方向については、カテゴリ項目ｂａ，ｂｂがクラスタｃ２０に、ｂｃ，ｂｄ，ｂｅがｃ２１に、ｂｆ，ｂｇがｃ２２に、ｂｈ，ｂｉがｃ２３に、ｂｊがｃ２４に、ｂｋ，ｂｌがｃ２５に、ｂｍがｃ２６に、ｂｚがｃ２８にそれぞれまとめられることを表している。さらに、本図では、クラスタｃ２０，ｃ２１はクラスタｃ２９に、ｃ２２，ｃ２３はｃ３０に、ｃ２４，ｃ２５，ｃ２６はｃ３１に、ｃ２７，ｃ２８はｃ３２に、ｃ２９，ｃ３０はｃ３３に、ｃ３１，ｃ３２はｃ３４にそれぞれまとめられることを表している。カテゴリ項目は、本図のようにクラスタの木構造を平面で表現できるように並べ換えられる。
【０１４１】
また、項目クラスタリング部１０２ｎは、各項目のクラスタリングを次の手順で行ってもよい。
（１）まず、項目クラスタリング部１０２ｎは、行方向のカテゴリ項目（ａａ〜ａｓ）を次の方法でクラスタリングする。
【０１４２】
（１−１）各カテゴリ項目に特徴ベクトルを定義
項目クラスタリング部１０２ｎは、列方向のカテゴリ項目との共起頻度を要素とするベクトルを行カテゴリ項目の特徴ベクトルとする。例えば、行カテゴリ項目ａａの特徴ベクトルは、（（ａａ，ｂａ），（ａａ，ｂｂ），（ａａ，ｂｃ），．．．，（ａａ，　ｂｚ））と定義する。ここで、（ａａ，ｂａ）は行カテゴリ項目ａａと列カテゴリ項目ｂａの共起頻度（両カテゴリ項目を含む文書の出現頻度）を表すとする。
【０１４３】
（１−２）各カテゴリ項目間の類似度に基きクラスタリングし、並び換えて表示項目クラスタリング部１０２ｎは、任意の２つの行カテゴリ項目間の類似度をそれぞれ上記のように定義された特徴ベクトルの内積として定義し、計算する。類似度の高い行カテゴリ項目同士が集まるように一般のクラスタリングアルゴリズムを適用する。
【０１４４】
（２）つぎに、項目クラスタリング部１０２ｎは、列方向のカテゴリ項目（ｂａ〜ｂｚ）を、上記（１）の方法について行と列を読み換えてクラスタリングする。
これにて、２−Ｄマップ表示画面制御処理が終了する。
【０１４５】
［操作履歴収集処理］
次に、操作履歴収集処理の詳細について図１３を参照して説明する。
テキストマイニング処理装置１００は、操作履歴収集部１０２ｅの処理により、対話的に行ったテキストマイニング操作に関して操作時刻、ユーザ識別子、操作名、操作引数、操作対象、操作結果等の項目を含む操作履歴情報を自動的に操作履歴情報ファイル１０６ｄに記録する。また、操作履歴収集部１０２ｅは、記録項目に加えて操作意図に関するユーザのコメントも記録の対象にしてもよい。ここで、操作意図に関するユーザのコメントは、ユーザが分析ツールにコメント入力操作を指示し、コメントを入力してもよい。また、コメントは、テキストデータ、音声データ、静止画データ、動画データ等のいずれかの形態またはその組み合わせであってよい。そして、操作履歴収集部１０２ｅは、操作履歴情報ファイル１０６ｄに収集された操作履歴情報操作を適宜参照して、操作履歴収集画面を作成し、出力装置１１４に表示することができる。
【０１４６】
図１３は、出力装置１１４に表示される操作履歴収集画面の一例を示す図である。図１３に示すように、操作履歴収集画面には自動バックアップされた作業履歴情報が出力される。本図の例では１行が１つの履歴項目を表しており、各行は７つの列からなっており、それぞれ、履歴項目の参照用識別番号（履歴項目番号）の表示領域ＭＤ−１、操作の行われた時刻の表示領域ＭＤ−２、操作を行なったユーザの識別子の表示領域ＭＤ−３、操作の名称・種類の表示領域ＭＤ−４、操作のパラメタ・引数の表示領域ＭＤ−５、操作対象のデータ・ファイル（識別子等）の表示領域ＭＤ−６、操作結果のデータ（識別子等）の表示領域ＭＤ−７を表す。操作履歴情報の項目の参照用識別番号（履歴項目番号）は本システムが履歴項目を管理するために用いる。
【０１４７】
本図の例では上から下に時間順になるように操作履歴情報が表示されている。以下、順に履歴項目の意味を説明する。
【０１４８】
まず、時刻１６時４４分（履歴項目番号３７０）にユーザＫＮは、ａｌｌ（テキストマイニングシステムの扱う全データ）を引数として「Ｏｐｅｎ　ｄｂ」操作を行い、全てのデータを分析操作において「Ａｒｔｉｃｌｅ　ｓｅｔ　ａｌｌ」として使用できるようにした。
【０１４９】
そして、時刻１６時４５分（履歴項目番号３７１）に１９９０年から２００２年までのデータを対象にするという検索操作の履歴をロードし、結果として「Ａｒｔｉｃｌｅ　ｓｅｔ　１２８（ここで１２８はテキストマイニングシステムが文書集合を管理するときの識別番号。）」を生成した。
【０１５０】
そして、時刻１６時４６分（履歴項目番号３７２）に文書集合１２８を対象に「Ｐｒｏｔｅｉｎ　Ａ」で検索し、結果として「Ａｒｔｉｃｌｅ　ｓｅｔ　１２９」を生成した。
【０１５１】
そして、時刻１６時４７分（履歴項目番号３７３）に頻度グラフウィンドウにおいてカテゴリツリーのルート直下にある「Ｃａｔｅｇｏｒｙ　Ｍ」を選択し、Ｍにカーソルを移動した。
【０１５２】
そして、時刻１６時５１分（履歴項目番号３７４）に頻度グラフウィンドウにおいてＣａｔｅｇｏｒｙ　Ｍを対象に展開操作を行い、木構造上Ｍの直下の子供のカテゴリ項目を表示した。
【０１５３】
そして、時刻１６時５１分（履歴項目番号３７５）に頻度グラフウィンドウにおいてＭの子のひとつであるＣａｔｅｇｏｒｙ　Ｍ／Ｄを選択し、Ｄにカーソルを移動した。
【０１５４】
そして、時刻１６時５２分（履歴項目番号３７６）に頻度グラフウィンドウにおいて文献集合１２９を対象にカテゴリＤとその子に関して頻度グラフ（Ｆｒｅｑｕｅｎｃｙ　ｇｒａｐｈ　３７）を生成し表示した。
【０１５５】
そして、時刻１６時５３分（履歴項目番号３７７）に２−Ｄマップウィンドウにおいて、カテゴリＤの子と、カテゴリ「Ｐ／Ｄ／Ａ（ここでカテゴリＡはＤの子、ＤはＰの子でＭ／Ｄとは異なる、Ｐはルートの子とする。）」の子をそれぞれ縦横の軸とする２−Ｄマップを、文献集合１２９を引数として生成した（２−Ｄｍａｐ　５１）。ここで５１はテキストマイニングシステムが２−Ｄマップを管理するときの識別番号を意味する。
【０１５６】
そして、時刻１７時１５分（履歴項目番号３７８）から時刻１７時３６分（履歴項目番号３８３）までは上記履歴項目番号３７２−３７７と類似の作業を行った。ただし履歴項目番号３７８の検索操作では「Ｐｒｏｔｅｉｎ　Ｂ（Ａではない）」を検索キーとした。
【０１５７】
そして、時刻１８時０５分（履歴項目番号３８４）にユーザＫＮは「２−Ｄ　ｍａｐ　５２」を表示している２−Ｄマップウィンドウにおいて、テキストデータによるコメント入力操作を選択した。これにより、「２−Ｄ　ｍａｐ　５２」に関するユーザの分析意図や結論等が履歴項目の操作引数として記録された。
【０１５８】
そして、時刻１８時０６分（履歴項目番号３８５）に「２−Ｄ　ｍａｐ　５２」を表示する２−Ｄマップウィンドウにおいて、カテゴリＤの２２番目のカテゴリ項目とカテゴリＡの３番目のカテゴリ項目の交差するセルを選択し、文献集合１３０中それらのカテゴリ項目を共起する文献の集合を生成し、「Ａｒｔｉｃｌｅ　ｓｅｔ　１３１」とした。
これにて、操作履歴収集処理が終了する。
【０１５９】
［操作自動実行処理］
次に、操作自動実行処理の詳細について図１４を参照して説明する。図１４は、操作自動実行処理の一例を示す概念図である。
【０１６０】
テキストマイニング処理装置１００は、操作自動実行部１０２ｆの処理により、操作履歴情報ファイル１０６ｄに収集された操作履歴情報に基づいて、バッチスクリプトを作成し、当該バッチスクリプトを実行する。すなわち、テキストマイニング処理装置１００は、テキストマイニングツールの任意の対話的な操作の連続を、次の３つの方法のいずれかまたはその組合せの方法によりバッチ実行できるようにする。
【０１６１】
第１の方法は、テキストマイニングシステムの各機能を既存のプログラミング言語のライブラリとして呼び出せるようにし、そのプログラミング言語を使用してバッチ処理を記述し実行する（実行はＪＡＶＡなどのストアードプロシージャとしてもよい）方法である。
【０１６２】
第２の方法は、テキストマイニングシステムを集計処理サーバと対話操作クライアントに分離した設計とし、クライアントの代わりに定められた通信プロトコロルを実行するモジュールによりバッチ処理を記述し実行する方法である。
【０１６３】
第３の方法は、専用のバッチ用スクリプト言語の解釈系をテキストマイニングシステムに組み込み、そのスクリプト言語でバッチ処理を記述し実行する方法である。
【０１６４】
図１４に実施形態の一例を示す。本図の例では、テキストマイニングシステムにおいて、ユーザ対話インターフェースとバッチ処理系と含む操作自動実行部１０２ｆの他に、操作履歴収集部１０２ｅ、テキストマイニング部１０２ｐ、分析対象文書ファイル１０６ｃ、操作履歴情報ファイル１０６ｄ、バッチスクリプトファイル１０６ｆを持つ構成とする。操作履歴収集部１０２ｅは、前述の「コメント付き操作履歴の自動バックアップ」の方法等で操作履歴情報ファイル１０６ｄに、テキストマイニング部１０２ｐで行われた操作の履歴を自動的に蓄積する。操作自動実行部１０２ｆのユーザ対話インターフェースは、必要に応じて操作履歴収集部１０２ｅと連携して、操作履歴情報ファイル１０６ｄ中の操作履歴から目的の部分履歴を探し出す機能を提供する。
【０１６５】
さらに操作自動実行部１０２ｆのユーザ対話インターフェースは、必要に応じて操作履歴収集部１０２ｅと連携して、新規または部分履歴を参考にバッチスクリプトを作成し、バッチスクリプトファイル１０６ｆに登録する機能を提供する。操作自動実行部１０２ｆのバッチ処理系はユーザインターフェースからバッチスクリプトの識別子とパラメタの可動範囲を受け取り、バッチスクリプトファイル１０６ｆから該当するバッチスクリプトを取得し実行する機能を提供する。
【０１６６】
図１４の下部にバッチスクリプトの一例を示す。本図においてバッチスクリプトＡは、図１３で示した履歴例における履歴項目番号３７２〜３７７を参考に作成した。まず操作履歴収集部１０２ｅに蓄積された操作履歴情報から、ユーザ対話インターフェースを介して履歴項目番号３７２〜３７７を切り出した。操作「Ｓｅａｒｃｈ」の引数（検索キーワード）と対象（検索対象の文書集合）は、それぞれ「ＰＡＲＡＭＥＴＥＲ　１」と「ＰＡＲＡＭＥＴＥＲ　２」というスクリプトのパラメタに変更した。操作「Ｓｅａｒｃｈ」の結果は、スクリプトの変数「Ａｒｔｉｃｌｅ　ｓｅｔ　ａ」に変更した。それに合わせて操作「Ｓｈｏｗ」と「２−Ｄ　ｍａｐ」の操作対象を、変数「Ａｒｔｉｃｌｅ　ｓｅｔ　ａ」に変更した。操作「Ｓｈｏｗ」の結果は、変数「Ｆｒｅｑｕｅｎｃｙ　ｇｒａｐｈ　ｂ」、操作「２−Ｄ　ｍａｐ」の結果は変数「２−Ｄ　ｍａｐ　ｃ」にそれぞれ変更した。
【０１６７】
バッチスクリプトＡは、例えば次のように実行される。ユーザが「ＰＡＲＡＭＥＴＥＲ１」の可動範囲を（ｋｗ１，ｋｗ２，．．．，ｋｗｎ）、「ＰＡＲＡＭＥＴＥＲ　２」の可動範囲を（Ａｒｔｉｃｌｅ　ｓｅｔ　１００，Ａｒｔｉｃｌｅ　ｓｅｔ　１０１，．．．，Ａｒｔｉｃｌｅ　ｓｅｔ　１９９）と指定し、バッチスクリプトＡの実行を指示したとする。バッチ処理系は、２つのパラメタの全ての組み合わせ（１００×ｎ通り）についてバッチスクリプトＡを実行する。実行時にはスクリプトのパラメタ部分を実際のデータに置き換えてスクリプトの順番に実行する。スクリプトの変数は、初出時に適切な型のデータを新規に生成し、以後スクリプトで参照している箇所をそのデータで置き換える。例えば操作「Ｓｅａｒｃｈ」の実行時に「Ａｒｔｉｃｌｅ　ｓｅｔ」が１７２まで作成されていたとすると、その結果を蓄積するために「Ａｒｔｉｃｌｅ　ｓｅｔ　１７３」（＝ａ）が新規に作成され、操作「Ｓｈｏｗ」と「２−Ｄ　ｍａｐ」の対象の「Ａｒｔｉｃｌｅ　ｓｅｔ　ａ」を１７３に置き換える。
これにて、操作自動実行処理が終了する。
【０１６８】
［カテゴリ階層化処理］
次に、カテゴリ階層化処理の詳細について図１５を参照して説明する。
テキストマイニング処理装置１００は、カテゴリ階層化部１０２ｇの処理により、カテゴリ辞書情報ファイル１０６ｂに格納された、テキストマイニング処理に用いるカテゴリ辞書情報に登録された各カテゴリの集計結果を、木構造により階層化して出力装置１１４に出力する。すなわち、テキストマイニング処理装置１００は、大規模（数千〜数万）な概念集合に木構造を与えて管理する。木構造は既存のデータ構造から生成してもよいし、新規に与えてもよい。木構造の与え方は既知の手法によってもよい。また、カテゴリ階層化部１０２ｇは、概念を扱う任意の対話インターフェース機能を有してもよく、当該対話インターフェース機能により、木のノード選択、折畳み、展開などの操作を実装してもよい。また、分析操作は選択した概念ノードの直下の子である概念項目またノードを対象に実行する。
【０１６９】
図１５は、出力装置１１４に表示されるカテゴリ階層化されたカテゴリ表示画面の一例を示す図である。図１５では、カテゴリを木構造で階層化し表示する際の実施例を示す。図１５の左側のウィンドウ（ｗ１）は概念カテゴリ項目を階層化せず１次元リストのままで表示した例を示す。取り扱う全ての概念項目が縦方向にリストされている。ウィンドウ（ｗ１）上で項目を探すためには右側のスクロールバーを用いる。また、図１５の右側のウィンドウ（ｗ２）は、概念カテゴリ項目を木構造で階層化してアウトラインプロセッサ風に表示した例を示す。各項目の左側には「＋」または「−」印のボタンが付属している。通常の項目または展開されたノードには「−」印のボタンが付く。折畳まれたノードには「＋」印のボタンが付く。展開されたノード（例えば「Ｃａｔｅｇｏｒｙ　ｐ３」）の「−」印ボタンを押すとそれ以下のノード（ｍ１とｍ２）を折畳み、ボタンを「＋」印に変える。逆に折畳まれたノードの「＋」印ボタンを押すと直下の子を展開表示してボタンを「−」印に変える。展開して表示する項目がウィンドウに表示しきれない場合はスクロールバーを操作して表示領域を調節する。
これにて、カテゴリ階層化処理が終了する。
【０１７０】
［中間ノード集計処理］
次に、中間ノード集計処理の詳細について図１６を参照して説明する。図１６は、中間ノード集計処理の一例を示す概念図である。
テキストマイニング処理装置１００は、中間ノード集計部１０２ｈの処理により、カテゴリ階層化部１０２ｇにより木構造に階層化された各カテゴリの集計結果について中間ノードを概念項目として扱う場合に、当該中間ノードの子孫となる各リーフノード概念項目に対応する集計結果を当該中間ノードに対応する集計結果とするか、および／または、テキストマイニング処理に用いる表記辞書情報ファイル１０６ａに当該中間ノードに対応する正規形または別表記形が定義されている場合に、当該正規形または当該別表記形を含む分析対象文書の集計結果を当該中間ノードに対応する集計結果とする。すなわち、中間ノード集計部１０２ｈは、上述した木構造階層化概念構造において中間ノードを概念項目（例えば、利用者が集計の対象して指定したカテゴリなど）として扱うときに、次の２つの方法のいずれかまたはその組み合わせの方法で文書と対応付ける。
【０１７１】
第１の方法は、中間ノードの子孫となる各リーフノード概念項目に対応する集計結果をこの中間ノードに対応する集計結果とする方法である。ここで、文書数等を集計する際に、のべ数で集計する方法や、文書等の重複を取り除いて集計する方法などがある。
【０１７２】
第２の方法は、中間ノード自身に正規形または別表記形が定義されている場合に、それらの語を含む文書等の集計結果をこの中間ノードに対応する集計結果とする方法である。
【０１７３】
図１６に示す例では、中間ノード概念項目ｐ３には正規形ｋｗ１，ｋｗ２が、ｐ３の子のリーフノード概念項目ｍ１には正規形ｋｗ３が、ｐ３の子のリーフノード概念項目ｍ２には正規形ｋｗ４，ｋｗ５，ｋｗ６がそれぞれ対応していることを示している。操作対象の文献集合中、正規形ｋｗ１，ｋｗ２，．．．，ｋｗ６にヒットする文書数はそれぞれｎ１，ｎ２，．．．，ｎ６である。のべ数でヒット文書数をカウントする方針としたとき、概念項目ｍ１のヒット文書数はｎ３、ｍ２のヒット文書数はｎ４＋ｎ５＋ｎ６となる。ここで中間ノード概念項目ｐ３のヒット文書数は次のようになる。集計方法が第１の方法の場合は子のヒット文書数を合計したｎ３＋ｎ４＋ｎ５＋ｎ６となる。集計方法が第２の方法の場合は対応する正規形のヒット文書数を合計したｎ１＋ｎ２となる。
これにて、中間ノード集計処理が終了する。
【０１７４】
［他の実施の形態］
さて、これまで本発明の実施の形態について説明したが、本発明は、上述した実施の形態以外にも、上記特許請求の範囲に記載した技術的思想の範囲内において種々の異なる実施の形態にて実施されてよいものである。
【０１７５】
例えば、テキストマイニング処理装置１００がスタンドアローンの形態で処理を行う場合を一例に説明したが、テキストマイニング処理装置１００とは別筐体で構成されるクライアント端末からの要求に応じて処理を行い、その処理結果を当該クライアント端末に返却するように構成してもよい。
【０１７６】
また、実施形態において説明した各処理のうち、自動的に行なわれるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行なわれるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。
この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種の登録データや検索条件等のパラメータを含む情報、画面例、データベース構成については、特記する場合を除いて任意に変更することができる。
【０１７７】
また、テキストマイニング処理装置１００に関して、図示の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。
例えば、テキストマイニング処理装置１００の各部または各装置が備える処理機能、特に制御部１０２にて行なわれる各処理機能については、その全部または任意の一部を、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）および当該ＣＰＵにて解釈実行されるプログラムにて実現することができ、あるいは、ワイヤードロジックによるハードウェアとして実現することも可能である。なお、プログラムは、後述する記録媒体に記録されており、必要に応じてテキストマイニング処理装置１００に機械的に読み取られる。
【０１７８】
すなわち、ＲＯＭまたはＨＤなどの記憶部１０６などには、ＯＳ（Ｏｐｅｒａｔｉｎｇ　Ｓｙｓｔｅｍ）と協働してＣＰＵに命令を与え、各種処理を行うためのコンピュータプログラムが記録されている。このコンピュータプログラムは、ＲＡＭ等にロードされることによって実行され、ＣＰＵと協働して制御部１０２を構成する。また、このコンピュータプログラムは、テキストマイニング処理装置１００に対して任意のネットワーク３００を介して接続されたアプリケーションプログラムサーバに記録されてもよく、必要に応じてその全部または一部をダウンロードすることも可能である。
【０１７９】
また、本発明にかかるプログラムを、コンピュータ読み取り可能な記録媒体に格納することもできる。ここで、この「記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＥＰＲＯＭ、ＥＥＰＲＯＭ、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ等の任意の「可搬用の物理媒体」や、各種コンピュータシステムに内蔵されるＲＯＭ、ＲＡＭ、ＨＤ等の任意の「固定用の物理媒体」、あるいは、ＬＡＮ、ＷＡＮ、インターネットに代表されるネットワークを介してプログラムを送信する場合の通信回線や搬送波のように、短期にプログラムを保持する「通信媒体」を含むものとする。
【０１８０】
また、「プログラム」とは、任意の言語や記述方法にて記述されたデータ処理方法であり、ソースコードやバイナリコード等の形式を問わない。なお、「プログラム」は必ずしも単一的に構成されるものに限られず、複数のモジュールやライブラリとして分散構成されるものや、ＯＳ（Ｏｐｅｒａｔｉｎｇ　Ｓｙｓｔｅｍ）に代表される別個のプログラムと協働してその機能を達成するものをも含む。なお、実施の形態に示した各装置において記録媒体を読み取るための具体的な構成、読み取り手順、あるいは、読み取り後のインストール手順等については、周知の構成や手順を用いることができる。
【０１８１】
記憶部１０６に格納される各種のデータベース等（表記辞書情報ファイル１０６ａ〜バッチスクリプトファイル１０６ｆ）は、ＲＡＭ、ＲＯＭ等のメモリ装置、ハードディスク等の固定ディスク装置、フレキシブルディスク、光ディスク等のストレージ手段であり、各種処理やウェブサイト提供に用いる各種のプログラムやテーブルやファイルやデータベースやウェブページ用ファイル等を格納する。
【０１８２】
また、テキストマイニング処理装置１００は、既知のパーソナルコンピュータ、ワークステーション等の情報処理端末等の情報処理装置にプリンタやモニタやイメージスキャナ等の周辺装置を接続し、該情報処理装置に本発明の方法を実現させるソフトウェア（プログラム、データ等を含む）を実装することにより実現してもよい。
【０１８３】
さらに、テキストマイニング処理装置１００の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷等に応じた任意の単位で、機能的または物理的に分散・統合して構成することができる。例えば、各データベースを独立したデータベース装置として独立に構成してもよく、また、処理の一部をＣＧＩ（Ｃｏｍｍｏｎ　Ｇａｔｅｗａｙ　Ｉｎｔｅｒｆａｃｅ）を用いて実現してもよい。
【０１８４】
また、ネットワーク３００は、テキストマイニング処理装置１００と外部システム２００とを相互に接続する機能を有し、例えば、インターネットや、イントラネットや、ＬＡＮ（有線／無線の双方を含む）や、ＶＡＮや、パソコン通信網や、公衆電話網（アナログ／デジタルの双方を含む）や、専用回線網（アナログ／デジタルの双方を含む）や、ＣＡＴＶ網や、ＩＭＴ２０００方式、ＧＳＭ方式またはＰＤＣ／ＰＤＣ―Ｐ方式等の携帯回線交換網／携帯パケット交換網や、無線呼出網や、Ｂｌｕｅｔｏｏｔｈ等の局所無線網や、ＰＨＳ網や、ＣＳ、ＢＳまたはＩＳＤＢ等の衛星通信網等のうちいずれかを含んでもよい。すなわち、本システムは、有線・無線を問わず任意のネットワークを介して、各種データを送受信することができる。
【０１８５】
【発明の効果】
以上詳細に説明したように、本発明によれば、分析対象文書の原文情報と、当該原文情報に含まれ、かつ、集計の対象となる用語のリストであって、用語毎に、用語の型、および／または、用語の格納先アドレスへのリンクボタンを対応付けた集計キーリスト情報とを出力装置に出力するので、原文を集計のキーになった語のリストと共に表示することで、エンドユーザは一連の分析操作のうちどの操作でその文献を取得することになったかを容易に把握できるようになる。その結果、経験の少ないユーザでもノイズの原因となる操作を回避し精度の高い分析作業が可能になる。また、原文中に外部データベースへのリンクを張ることによって、エンドユーザは取得した文献が何を主題にしているのかを正確に知ることができる。この情報は検索ノイズを生み出す操作の学習に活用されることにより、分析作業の精度向上につながる。
【０１８６】
また、本発明によれば、利用者が入力した検索語と、検索語に基づいて表記辞書情報を検索して抽出した対応する正規形とその表記辞書エントリに関する情報と、検索語に基づいてカテゴリ辞書情報を検索して抽出した対応するカテゴリとそのカテゴリ辞書エントリに関する情報とを出力装置に出力するように制御するので、特定の語についての表記辞書およびカテゴリ辞書の適用可能性を検索することによって、目的とするカテゴリに文献を分離するのにふさわしい語を選別することができるようになる。また、語検索を繰り返し行うことで、本来は分離したい多数のカテゴリ項目群に展開される可能性のある語が頻出する辞書ファイルを選別することができる。またそれらのカテゴリ群の精度を推測することができる。さらに、あるカテゴリを特徴付ける既知の用語がわかっている場合、その語に関する辞書エントリの存在を確認することによって、カテゴリの再現率を推測することができる。
【０１８７】
また、本発明によれば、分析対象文書の原文情報と、原文情報に含まれ、かつ、集計の対象となる用語に対して、表記辞書の検索結果、構文解析処理による品詞情報、カテゴリ辞書の検索結果のうち少なくとも一つを含むトレース結果情報とを出力装置に出力するように制御するので、原文の各要素がどの辞書の項目で正規化されカテゴライズされたかを表示することで、さらに詳細にその文献を取得することになった理由を把握できるようになる。また、どのように連語をまとめたかや構文解析系が付与する正規形など、用語を見ただけでは判断できない適用情報も正確に把握することができる。さらに、これにより無関係と思われる原文を取得した場合にどの用語が原因でそうなったかを特定できるようになる。
【０１８８】
また、本発明によれば、分析対象文書の原文情報に対する構文解析の結果に従って、原文情報に含まれる名詞と動詞のｎ個の順序付き組み合わせを一つのカテゴリとして分析対象文書に対するテキストマイニングの集計処理を行うので、ｎ項関係のパターンを集計対象とすることで、用語の種類だけでは判別できなかった文献を切り分けることができ、さらに分析精度を向上させることができる。
【０１８９】
また、本発明によれば、別の検索ウィンドウを用いてさらに検索条件を絞って検索を行う場合に、これらの関連する検索ウィンドウおよび検索結果表示ウィンドウをマルチウィンドウ化して表示し、いずれかのウィンドウの表示内容が変更されたときにはその変更内容が他のウィンドウにも自動的に反映されるように制御するので、任意の作業状態を必要に応じて残すことによって、エンドユーザが分析のために記憶しておく情報の量が減る。これにより分析作業を効率化することができる。また、複数画面を備えた計算機端末の表示領域を有効に使うことができる。
【０１９０】
また、本発明によれば、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をソートして２−Ｄマップウィンドウを出力装置に出力するので、注目すべきカテゴリ項目がオリジナルのカテゴリ定義順に特定位置に固定されるような場合、オリジナル順でソートしておくことでそれらのカテゴリ項目の発見が容易になる。また、注目すべきカテゴリ項目が出現頻度の高いものである場合、高い頻度順でソートしておくことでそれらのカテゴリ項目の発見が容易になる。さらに、注目すべきカテゴリ項目が特定の名前で始まるものである場合、アルファベット順にソートしておくことでそれらのカテゴリ項目の発見が容易になる。
【０１９１】
また、本発明によれば、テキストマイニング結果を表示する２−Ｄマップに対して、列または行に対応する各カテゴリ項目をクラスタリングして２−Ｄマップウィンドウを出力装置に出力するので、特徴的なパターンを共通して持つ項目群をクラスタとしてまとめることにより、カテゴリ項目の探索作業の付加が軽減し、分析作業を効率化することができる。
【０１９２】
また、本発明によれば、テキストマイニング時の各操作に対する操作時刻、ユーザ識別子、操作名、操作引数、操作対象、操作結果、操作意図に関するユーザのコメントのうち少なくとも一つに関する操作履歴情報を収集するので、操作履歴に基づいて、表記辞書やカテゴリ辞書の登録内容をチェックすることができるようになる。また、後述する操作自動実行処理（バッチ処理）の指示（バッチスクリプト）を作成する際に雛型として用いることで、繁雑に行う分析処理を容易にバッチ処理化することができる。また、操作意図に関するユーザのコメントを格納することにより、作業履歴において対話操作が多数記録された場合でも、ユーザの作業意図等を手掛りにバッチ化する箇所を迅速に探すことができ、バッチスクリプト作成を効率化できる。また、ユーザが後でバッチ化したい箇所にコメントを入れることで、バッチスクリプト作成時にバッチの内容を検討する作業が軽減され、バッチスクリプトの作成を効率化できる。
【０１９３】
また、本発明によれば、収集された操作履歴情報に基づいて、バッチスクリプトを作成し、バッチスクリプトを実行するので、一連の操作からなる分析をバッチ処理により繰り返し実行することで、エンドユーザをツール使用に拘束する時間を短縮することができる。また、一定期間毎に行う分析処理を自動的に行うことができるようになる。さらに、システム閑散期に負荷のかかる分析処理を実行することができるようになる。
【０１９４】
また、本発明によれば、テキストマイニング処理に用いるカテゴリ辞書情報に登録された各カテゴリの集計結果を、木構造により階層化して出力装置に出力するので、木構造で階層化し折畳みや展開操作を備えることによってユーザ対話インターフェースを介して画面上に一度に表示する概念項目数を抑制することができる。また、これにより、目的とする概念項目の探索が容易になる。
【０１９５】
また、本発明によれば、出力された、木構造により階層化されたカテゴリのうち少なくとも一部を選択するので、対話的なテキストマイニング操作を行う際に、カテゴリを木構造により階層化して表示した画面から利用者が目的の部分カテゴリを選択することができるようになり、最終的な出力だけでなく操作の途中でも階層カテゴリを活用することができるようになる。また、これにより、カテゴリ部分を指定する必要のある対話的なテキストマイニング分析操作を、大規模なカテゴリ構造が対象である場合においても効率的に遂行できるようになる。
【０１９６】
さらに、本発明によれば、木構造に階層化された各カテゴリの集計結果について中間ノードを概念項目として扱う場合に、中間ノードの子孫となる各リーフノード概念項目に対応する集計結果を中間ノードに対応する集計結果とするか（第１の集計方法）、および／または、テキストマイニング処理に用いる表記辞書に中間ノードに対応する正規形または別表記形が定義されている場合に、正規形または別表記形を含む分析対象文書の集計結果を中間ノードに対応する集計結果とする（第２の集計方法）。第１の集計方法を用いることにより、中間ノードに正規語が対応しない概念カテゴリ構造であっても処理することができる。また、大規模な概念カテゴリ構造を適切な粒度に分割するなど、自由度の高いカテゴリ構造の設計を可能にする。一方、第２の集計方法を用いることにより、中間ノードに対応する正規語が存在する概念カテゴリ構造である場合、精度よく文書数を集計することができる。また、既存データ構造を用いて作成した概念カテゴリ構造ではこのような場合が多く、それらが活用できる。これらの２つの方法を場合に応じて使い分けること、または、組み合わせて使うことにより、概念カテゴリ構造を作成するコストを下げることができ、大規模カテゴリ概念の利用が容易になる。
【図面の簡単な説明】
【図１】テキストマイニング処理の概要を示す概念図である。
【図２】図１のステップＳＡ−６において表示される２−Ｄマップの概念を示す概念図である。
【図３】本発明が適用される本システムの構成の一例を示すブロック図である。
【図４】本発明が適用される分析手順評価部１０２ａの構成の一例を示すブロック図である。
【図５】本発明が適用される２−Ｄマップ表示画面制御部１０２ｄの構成の一例を示すブロック図である。
【図６】出力装置１１４に表示される原文表示画面の一例を示す図である。
【図７】出力装置１１４に表示される辞書エントリ検索画面の一例を示す図である。
【図８】出力装置１１４に表示されるトレース結果表示画面の一例を示す図である。
【図９】本発明の構文構造分析処理の一例を示す概念図である。
【図１０】出力装置１１４に表示されるマルチウィンドウ化された表示画面の一例を示す図である。
【図１１】出力装置１１４に表示される２−Ｄマップ表示画面の制御（ソート処理）の一例を示す図である。
【図１２】出力装置１１４に表示される２−Ｄマップ表示画面の制御（クラスタリング）の一例を示す図である。
【図１３】出力装置１１４に表示される操作履歴収集画面の一例を示す図である。
【図１４】操作自動実行処理の一例を示す概念図である。
【図１５】出力装置１１４に表示されるカテゴリ階層化されたカテゴリ表示画面の一例を示す図である。
【図１６】中間ノード集計処理の一例を示す概念図である。
【図１７】表記辞書情報ファイル１０６ａに格納される表記辞書情報の一例を示す図である。
【図１８】カテゴリ辞書情報ファイル１０６ｂに格納されるカテゴリ辞書情報の一例を示す図である。
【符号の説明】
１００　テキストマイニング処理装置
１０２　制御部
１０２ａ　分析手順評価部
１０２ｂ　構文構造分析部
１０２ｃ　マルチウィンドウ化部
１０２ｄ　２−Ｄマップ表示画面制御部
１０２ｅ　操作履歴収集部
１０２ｆ　操作自動実行部
１０２ｇ　カテゴリ階層化部
１０２ｈ　中間ノード集計部
１０２ｉ　原文表示画面制御部
１０２ｊ　辞書エントリ検索画面制御部
１０２ｋ　トレース結果表示画面制御部
１０２ｍ　項目ソート部
１０２ｎ　項目クラスタリング部
１０２ｐ　テキストマイニング部
１０４　通信制御インターフェース部
１０６　記憶部
１０６ａ　表記辞書情報ファイル
１０６ｂ　カテゴリ辞書情報ファイル
１０６ｃ　分析対象文書ファイル
１０６ｄ　操作履歴情報ファイル
１０６ｅ　処理結果ファイル
１０６ｆ　バッチスクリプトファイル
１０８　入出力制御インターフェース部
１１２　入力装置
１１４　出力装置
２００　外部システム
３００　ネットワーク[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a text mining processing apparatus, a text mining processing method, a program, and a recording medium, and in particular, a text mining processing apparatus and a text mining processing method capable of enhancing, improving, and automating analysis by text mining. , A program, and a recording medium.
[0002]
[Prior art]
In recent years, a document database storing various technical documents such as papers has been constructed and widely used via the Internet and the like. For example, there is PubMed provided by the United States National Biotechnology Center (NCBI) to provide literature data from the National Library of Medicine (NLM) and the like (URL of PubMed on the Internet: http://www.ncbi.nlm.gov/ entrez /).
[0003]
In a conventional document database search service, in order to improve search efficiency and the like, a "notation dictionary" for associating a normal form with a notation form of each term, and a "notation dictionary" for classifying each term into categories. For example, a "category dictionary" is used.
[0004]
For example, as a text mining system using existing notation dictionaries and category dictionaries, TAKMI (product name) of IBM (company name) exists (URL of the homepage of the text mining technology introduction of IBM Tokyo Research Laboratory: http: // /Www.trl.ibm.com/projects/s7710/tm/index.htm, URL of TAKMI introduction home page: http://www.trl.ibm.com/projects/s7710/tm/takmi/tamki/takmi/takmi/takmi/tamik .
[0005]
Also, as a thesaurus search service for medical terms, there is MeSH (Medical Subject Headings) and the like (URL of NLM MeSH homepage URL: http://www.nlm.nih.gov/mesh/meshhome.html, Mesh. URL of the homepage of the paper described: http://www.nlm.nih.gov/mesh/patterns.html, URL of the homepage of the MeSH Browser service: http://www.ncbi.nih.gov/entrez. cgi).
[0006]
[Problems to be solved by the invention]
Here, an outline of the text mining system will be described with reference to FIGS. FIG. 1 is a conceptual diagram showing an outline of the text mining process.
[0007]
As shown in FIG. 1, in the present system, the following procedure is performed in order to associate a concept from a character string of a word appearing in each document information included in a group of documents to be analyzed.
[0008]
First, a notation dictionary is created (manually created), and the notation dictionary is applied to each word of the document information described in English or Japanese (step SA-1).
[0009]
Then, after terminology is determined for the partially-word-separated document information according to the determination rule (step SA-2), a syntax analysis process is applied (step SA-3). Here, the order of applying the notation dictionary and executing the syntax analysis is arbitrary, and these may be executed in parallel.
[0010]
Then, a category dictionary is created (manually created), and an appropriate sentence structure of the document information, which is a result of the syntax analysis, and a category dictionary is applied to a result obtained by applying the notation dictionary. The categorization is performed, the terms corresponding to the category are totaled, and an index is created (step SA-4).
[0011]
Then, the frequency of appearance of the categorized concepts and the like is calculated and tabulated (step SA-5), and a frequency graph in which the frequency of appearance of words in the document information is graphed, the frequency of each document publication date, etc. Are displayed in a format such as an information time-series graph in which a graph is formed or a 2-D map as shown in FIG. 2 (step SA-6). Then, the user manually and visually extracts desired information from the displayed information such as the appearance frequency.
[0012]
Here, FIG. 2 is a conceptual diagram showing the concept of the 2-D map displayed in step SA-6 of FIG.
As shown in FIG. 2, the 2-D map is a map in which the appearance frequency is displayed two-dimensionally for each category. Normally, in each column of the 2-D map, the appearance frequency of a document including a term belonging to two categories corresponding to the vertical direction (column) and the horizontal direction (row) and the appearance frequency in the sum of the appearance frequency of each row are included. Displays the frequency percentage. Then, normally, the desired information is extracted by paying attention to those having a high appearance frequency ratio (yyy value in the column of FIG. 2).
[0013]
As described above, in the existing text mining system, the end user performs a series of interactive analysis operations to reach the original text. Although the reliability of each operation differs depending on the method of processing the original text, the end user has no means for directly obtaining the reliability. That is, it was difficult to directly check what terms were extracted from which documents. Therefore, experience and skill are required to extract useful information using existing text mining systems. Therefore, in order to lower the threshold for general users to utilize the text mining system, it is necessary to provide information that is useful for estimating the reliability of the interactive analysis operation, but such a text mining system does not exist. Was.
[0014]
Further, in the conventional method, words having the same notation are always counted as the same category, so that words whose meaning changes depending on the context cannot be correctly handled.
[0015]
Further, since a single screen is used to handle a plurality of document sets and analysis axes, an analysis method that relies on the memory of the end user has been used.
[0016]
In addition, when the number of category elements increases when performing 2-D map analysis, it is difficult to search for a notable category element.
[0017]
In addition, since the analysis is advanced by the user performing an interactive operation, the processing time is long when there are many analysis targets and types of analysis.
[0018]
In addition, when a large-scale (thousands to tens of thousands of categories) concept dictionary is used, a method of performing a list or search of concept items using a one-dimensional list cannot be used.
[0019]
As described above, the conventional system and the like have a number of problems, and as a result, both the system user and the manager are inconvenient and the utilization efficiency is low.
[0020]
The problems to be solved by the conventional technology and the invention described so far are not limited to the literature information database search system for literature of natural sciences such as living organisms, medicine, and chemistry, and search for literature information in all fields. In all systems the same can be considered.
[0021]
The present invention has been made in view of the above-described problems, and provides a text mining processing apparatus, a text mining processing method, a program, and a recording medium that can enhance, increase efficiency, and automate analysis by text mining. It is intended to be.
[0022]
[Means for Solving the Problems]
In order to achieve such an object, a text mining apparatus according to claim 1 is a text mining apparatus that counts the frequency of occurrence of each term appearing in a document to be analyzed. And a list of terms included in the original text information and subject to aggregation, and associated with a term type and / or a link button to a storage address of the term for each term. An original sentence display control means for controlling so as to output the aggregation key list information to the output device is provided.
[0023]
According to this device, the original text information of the document to be analyzed and the list of terms included in the original text information and to be counted are stored for each term. Since the aggregation key list information associated with the link button to the destination address is output to the output device, by displaying the original text together with the list of words that became the aggregation key, the end user can determine which of the series of analysis operations It is possible to easily grasp whether the document has been acquired by the operation. As a result, even an inexperienced user can avoid an operation that causes noise and perform highly accurate analysis work. In addition, by providing a link to an external database in the original text, the end user can know exactly what the acquired document is about. This information is used for learning operations that generate search noise, thereby improving the accuracy of analysis work.
[0024]
The text mining device according to claim 2 is a text mining device that counts the frequency of appearance of each term appearing in a document to be analyzed, based on a search word input by a user and the search word. The information on the corresponding normal form and its notation dictionary entry searched and extracted by the notation dictionary information, and the information on the corresponding category extracted by searching the category dictionary information based on the search term and the information on the category dictionary entry are A dictionary entry search screen control means for controlling output to the output device is provided.
[0025]
According to this device, the search term input by the user, the information on the corresponding normal form and the notation dictionary entry extracted by searching the notation dictionary information based on the search term, and the category dictionary information based on the search term Is controlled so as to output the corresponding category extracted and retrieved and the information on the category dictionary entry to the output device. Therefore, by searching the applicability of the notation dictionary and category dictionary for a specific word, It is possible to select words that are suitable for separating documents into categories. Further, by repeatedly performing the word search, it is possible to select a dictionary file in which words that are likely to be developed into a large number of category item groups that are originally desired to be separated frequently appear. In addition, the accuracy of those category groups can be estimated. Furthermore, if a known term that characterizes a category is known, the recall of the category can be estimated by checking for the presence of a dictionary entry for that term.
[0026]
Further, the text mining processing device according to claim 3 is a text mining processing device that counts the appearance frequency of each term appearing in the analysis target document, and includes the original text information of the analysis target document and the text information included in the original text information. And outputting, to the output device, a search result of a notation dictionary, part-of-speech information obtained by a syntactic analysis process, and trace result information including at least one of a search result of a category dictionary for a term to be counted. And a dictionary entry search screen control means for controlling the search.
[0027]
According to this apparatus, the search result of the notation dictionary, the part-of-speech information by the syntactic analysis processing, and the search result of the category dictionary for the original text information of the document to be analyzed and the terms included in the original text information and to be counted And the trace result information including at least one of the following is output to the output device. By displaying in which dictionary item each element of the original text is normalized and categorized, the document is described in more detail. You will be able to understand why you decided to get it. In addition, it is possible to accurately grasp application information that cannot be determined only by looking at terms, such as how a collocation is put together or a normal form given by a parsing system. Furthermore, this makes it possible to identify which term caused the acquisition of the original text which seems to be irrelevant.
[0028]
A text mining device according to claim 4 is a text mining device that counts the frequency of appearance of each term appearing in a document to be analyzed, according to a result of a syntax analysis on the original text information of the document to be analyzed. It is characterized by comprising a syntax structure analyzing means for performing a text mining tallying process on the document to be analyzed with n ordered combinations of nouns and verbs included in the original text information as one category.
[0029]
According to this device, according to the result of the syntax analysis of the original text information of the analysis target document, the text mining tally process is performed on the analysis target document with n ordered combinations of nouns and verbs included in the original text information as one category. Therefore, by using the pattern of the n-term relation as an aggregation target, it is possible to separate documents that cannot be determined only by the type of term, and to further improve the analysis accuracy.
[0030]
A text mining apparatus according to claim 5 is a text mining apparatus that counts the frequency of occurrence of each term appearing in a document to be analyzed. When performing a search by further narrowing down the search conditions using the search window of the above, the related search window and the search result display window are displayed in a multi-window, and when the display content of any of the windows is changed, the corresponding window is displayed. Multi-windowing means is provided for controlling so that the changes are automatically reflected in other windows.
[0031]
According to this device, when a search is performed by further narrowing down search conditions using another search window, the related search window and search result display window are displayed in a multi-window form, and any one of the windows is displayed. When the contents are changed, control is performed so that the changes are automatically reflected in other windows, so that any work state can be left as needed so that the end user can store it for analysis. The amount of information to put is reduced. Thereby, the analysis work can be made more efficient. Further, the display area of the computer terminal having a plurality of screens can be used effectively.
[0032]
A text mining device according to claim 6 is a text mining device that counts the frequency of appearance of each term appearing in a document to be analyzed. There is provided a 2-D map display screen control means for sorting or clustering each category item corresponding to a column or a row and outputting a 2-D map window to an output device.
[0033]
According to this device, for the 2-D map displaying the text mining result, the category items corresponding to the columns or rows are sorted and the 2-D map window is output to the output device. If the items are fixed at specific positions in the original category definition order, sorting them in the original order makes it easier to find those category items. If the category items to be noted are those having a high frequency of appearance, sorting them in descending order of frequency makes it easier to find those category items. Furthermore, if the category items to be noted start with a specific name, sorting them in alphabetical order makes it easier to find those category items.
[0034]
Further, according to this device, for a 2-D map displaying a text mining result, each category item corresponding to a column or a row is clustered and a 2-D map window is output to an output device. By grouping items having common patterns into clusters, the addition of a category item search operation can be reduced, and the analysis operation can be made more efficient.
[0035]
A text mining device according to claim 7 is a text mining device that counts the frequency of appearance of each term appearing in a document to be analyzed, and includes an operation time, a user identifier, and an operation for each operation during text mining. An operation history collection unit for collecting operation history information on at least one of a user's comment on a name, an operation argument, an operation target, an operation result, and an operation intention is provided.
[0036]
According to this device, operation history information on at least one of a user's comment on operation time, user identifier, operation name, operation argument, operation target, operation result, and operation intention for each operation during text mining is collected. The registered contents of the notation dictionary and the category dictionary can be checked based on the operation history. In addition, by using the template as a template when creating an instruction (batch script) for an operation automatic execution process (batch process), which will be described later, it is possible to easily perform a batch analysis process that is frequently performed. In addition, by storing the user's comment on the operation intention, even if many interactive operations are recorded in the work history, it is possible to quickly search for a place where the user's work intention and the like are to be batched as a clue, and create a batch script. Can be made more efficient. Also, by putting a comment in a place where the user wants to make a batch later, the work of examining the contents of the batch at the time of creating the batch script can be reduced, and the creation of the batch script can be made more efficient.
[0037]
Further, the text mining processing device according to claim 8 is the text mining processing device according to claim 7, wherein a batch script is created based on the operation history information collected by the operation history collection unit. An automatic operation execution unit for executing a batch script is provided.
[0038]
According to this device, a batch script is created and the batch script is executed based on the collected operation history information, so that the analysis including a series of operations is repeatedly executed by the batch processing, thereby enabling the end user to use the tool. Can be shortened. In addition, it becomes possible to automatically perform the analysis process performed at regular intervals. Further, it becomes possible to execute a load of analysis processing during the off-peak period of the system.
[0039]
A text mining device according to claim 9 is a text mining device that counts the frequency of occurrence of each term appearing in a document to be analyzed, and wherein each category registered in category dictionary information used for the text mining process is provided. Category hierarchization means for hierarchizing the aggregation results of the tree structure according to a tree structure and outputting the result to an output device, and category selection for selecting at least a part of the categories hierarchized by the tree structure output by the category hierarchy means Means.
[0040]
According to this device, the aggregation result of each category registered in the category dictionary information used for the text mining process is hierarchized by a tree structure and output to the output device. Thus, the number of concept items displayed on the screen at one time via the user interactive interface can be suppressed. In addition, this makes it easy to search for a target concept item.
[0041]
In addition, according to this device, at least a part of the output categories hierarchized by the tree structure is selected, so that when performing an interactive text mining operation, the categories are hierarchically displayed by the tree structure and displayed. The user can select the desired partial category from the screen displayed, and can utilize the hierarchical category not only in the final output but also during the operation. In addition, this enables an interactive text mining analysis operation requiring a category portion to be efficiently performed even when a large-scale category structure is targeted.
[0042]
A text mining apparatus according to a tenth aspect is the text mining apparatus according to the ninth aspect, wherein an intermediate node is added to a result of aggregation of each category hierarchized into a tree structure by the category hierarchy means. In the case where the intermediate node is used, whether the aggregation result corresponding to each leaf node concept item that is a descendant of the intermediate node is set to the aggregation result corresponding to the intermediate node, and / or the intermediate node is included in a notation dictionary used for text mining processing. An intermediate node summarizing unit that, when a normal form or another notation form corresponding to the above is defined, sets the summation result of the analysis target document including the normal form or the different notation form as the summation result corresponding to the intermediate node. It is characterized by having.
[0043]
According to this device, when the intermediate node is treated as a concept item for the aggregation result of each category hierarchized in a tree structure, the aggregation result corresponding to each leaf node concept item that is a descendant of the intermediate node corresponds to the intermediate node. If the normalization form or another notation form corresponding to the intermediate node is defined in the notation dictionary used for the text mining processing, the normal form or another notation is used. The tabulation result of the analysis target document including the form is set as the tabulation result corresponding to the intermediate node (second counting method). By using the first counting method, it is possible to process even a concept category structure in which a regular word does not correspond to an intermediate node. Further, it is possible to design a category structure having a high degree of freedom, for example, by dividing a large-scale concept category structure into appropriate granularities. On the other hand, by using the second counting method, the number of documents can be counted with high accuracy when the concept category structure has a regular word corresponding to the intermediate node. Further, such a case is common in a concept category structure created using an existing data structure, and these can be utilized. By using these two methods properly or in combination depending on the case, the cost of creating the concept category structure can be reduced, and the use of the large-scale category concept becomes easy.
[0044]
Further, the present invention relates to a text mining processing method, wherein the text mining processing method according to claim 11 is a text mining processing method for counting the appearance frequency of each term appearing in a document to be analyzed. A list of original text information of the target document and terms included in the original text information and to be counted, and for each of the terms, a term type and / or a link to a storage address of the term. The method further includes an original text display control step of controlling to output the aggregation key list information associated with the button to the output method.
[0045]
According to this method, the original text information of the document to be analyzed and the list of terms included in the original text information and to be counted are stored for each term. Outputs the aggregation key list information associated with the link button to the destination address in the output method. By displaying the original text together with the list of words that became the aggregation key, the end user can determine which of the series of analysis operations It is possible to easily grasp whether the document has been acquired by the operation. As a result, even an inexperienced user can avoid an operation that causes noise and perform highly accurate analysis work. In addition, by providing a link to an external database in the original text, the end user can know exactly what the acquired document is about. This information is used for learning operations that generate search noise, thereby improving the accuracy of analysis work.
[0046]
A text mining processing method according to claim 12 is a text mining processing method for counting the appearance frequency of each term appearing in a document to be analyzed, wherein the text mining method is based on a search word input by a user and the search word. The information on the corresponding normal form and its notation dictionary entry searched and extracted by the notation dictionary information, and the information on the corresponding category extracted by searching the category dictionary information based on the search term and the information on the category dictionary entry are The method includes a dictionary entry search screen control step of controlling output to the output method.
[0047]
According to this method, a search term entered by a user, information on a corresponding normal form and its notation dictionary entry extracted by searching for notation dictionary information based on the search term, and category dictionary information based on the search term Is controlled so as to output the corresponding category extracted and the information on the category dictionary entry to the output method. Therefore, by searching the applicability of the notation dictionary and the category dictionary for a specific word, It is possible to select words that are suitable for separating documents into categories. Further, by repeatedly performing the word search, it is possible to select a dictionary file in which words that are likely to be developed into a large number of category item groups that should be separated frequently appear. In addition, the accuracy of those category groups can be estimated. Further, if a known term characterizing a certain category is known, the recall of the category can be estimated by confirming the existence of a dictionary entry for the term.
[0048]
A text mining method according to a thirteenth aspect is a text mining method for counting the frequency of occurrence of each term appearing in a document to be analyzed. In addition, for a term to be tabulated, a notation dictionary search result, part-of-speech information obtained by syntactic analysis processing, and trace result information including at least one of a category dictionary search result are output to an output method. A dictionary entry search screen control step of controlling the dictionary entry search screen.
[0049]
According to this method, a search result of a notation dictionary, a part-of-speech information obtained by a syntactic analysis process, and a search result of a category dictionary are provided for the original text information of a document to be analyzed, and terms included in the original text information and to be counted. And the trace result information including at least one of the following is output to the output method. By displaying in which dictionary item each element of the original text is normalized and categorized, the document is described in more detail. You will be able to understand why you decided to get it. In addition, it is possible to accurately grasp application information that cannot be determined only by looking at terms, such as how a collocation is put together or a normal form given by a parsing system. Furthermore, this makes it possible to identify which term caused the acquisition of the original text which seems to be irrelevant.
[0050]
A text mining processing method according to claim 14 is a text mining processing method for counting the frequency of appearance of each term appearing in a document to be analyzed, wherein: The method further includes a syntax structure analyzing step of performing a text mining tallying process on the analysis target document using n ordered combinations of nouns and verbs included in the original text information as one category.
[0051]
According to this method, in accordance with the result of the syntax analysis of the original text information of the analysis target document, the text mining tally process is performed on the analysis target document by using the n ordered combinations of nouns and verbs included in the original text information as one category. Therefore, by using the pattern of the n-term relation as an aggregation target, it is possible to separate documents that cannot be determined only by the type of term, and to further improve the analysis accuracy.
[0052]
A text mining method according to a fifteenth aspect is a text mining processing method for counting the frequency of occurrence of each term appearing in a document to be analyzed. When performing a search by further narrowing down the search conditions using the search window of the above, the related search window and the search result display window are displayed in a multi-window, and when the display content of any of the windows is changed, the corresponding window is displayed. It is characterized by including a multi-windowing step of controlling so that the changes are automatically reflected in other windows.
[0053]
According to this method, when performing a search by further narrowing down search conditions using another search window, the related search window and search result display window are displayed in a multi-window form, and any one of the windows is displayed. When the contents are changed, control is performed so that the changes are automatically reflected in other windows, so that any work state can be left as needed so that the end user can store it for analysis. The amount of information to put is reduced. Thereby, the analysis work can be made more efficient. Further, the display area of the computer terminal including a plurality of screens can be effectively used.
[0054]
A text mining processing method according to claim 16 is a text mining processing method for counting the frequency of appearance of each term appearing in a document to be analyzed, wherein a text mining result is displayed on a 2-D map. The method includes a 2-D map display screen control step of sorting or clustering each category item corresponding to a column or a row and outputting a 2-D map window to an output method.
[0055]
According to this method, for the 2-D map displaying the text mining result, the category items corresponding to the columns or rows are sorted and the 2-D map window is output to the output method. If the items are fixed at specific positions in the original category definition order, sorting them in the original order makes it easier to find those category items. If the category items to be noted are those having a high frequency of appearance, sorting them in descending order of frequency makes it easier to find those category items. Furthermore, if the category items to be noted start with a specific name, sorting them in alphabetical order makes it easier to find those category items.
[0056]
In addition, according to this method, each category item corresponding to a column or a row is clustered with respect to a 2-D map displaying a text mining result, and a 2-D map window is output to an output method. By grouping items having common patterns into clusters, the addition of a category item search operation can be reduced, and the analysis operation can be made more efficient.
[0057]
A text mining processing method according to claim 17 is a text mining processing method for counting the frequency of appearance of each term appearing in a document to be analyzed. The text mining method includes an operation time, a user identifier, and an operation for each operation during text mining. The method includes an operation history collection step of collecting operation history information on at least one of a name, an operation argument, an operation target, an operation result, and a user comment on an operation intention.
[0058]
According to this method, operation history information relating to at least one of the operation time, the user identifier, the operation name, the operation argument, the operation target, the operation result, and the user's comment regarding the operation intention for each operation during text mining is collected. The registered contents of the notation dictionary and the category dictionary can be checked based on the operation history. In addition, by using the template as a template when creating an instruction (batch script) for an operation automatic execution process (batch process), which will be described later, it is possible to easily perform a batch analysis process that is frequently performed. In addition, by storing the user's comment on the operation intention, even if many interactive operations are recorded in the work history, it is possible to quickly search for a place where the user's work intention and the like are to be batched as a clue, and create a batch script. Can be made more efficient. Also, by putting a comment in a place where the user wants to make a batch later, the work of examining the contents of the batch at the time of creating the batch script can be reduced, and the creation of the batch script can be made more efficient.
[0059]
The text mining processing method according to claim 18 is the text mining processing method according to claim 17, wherein a batch script is created based on the operation history information collected in the operation history collection step. The method includes an operation automatic execution step of executing a batch script.
[0060]
According to this method, a batch script is created and the batch script is executed based on the collected operation history information, so that the analysis including a series of operations is repeatedly executed by the batch processing, thereby enabling the end user to use the tool. Can be shortened. In addition, it becomes possible to automatically perform the analysis process performed at regular intervals. Further, it becomes possible to execute a load of analysis processing during the off-peak period of the system.
[0061]
A text mining processing method according to claim 19 is a text mining processing method for counting the appearance frequency of each term appearing in a document to be analyzed, wherein each category registered in category dictionary information used for text mining processing is included. And a category selection step of selecting at least a part of the categories hierarchized by the tree structure, which are output by the above-described category layering step. And a step.
[0062]
According to this method, the aggregation result of each category registered in the category dictionary information used for the text mining process is hierarchized by a tree structure and output to an output method. Thus, the number of conceptual items displayed on the screen at one time via the user interactive interface can be suppressed. In addition, this makes it easy to search for a target concept item.
[0063]
Further, according to this method, at least a part of the output categories hierarchically arranged by the tree structure is selected, so that when performing an interactive text mining operation, the categories are hierarchically displayed by the tree structure and displayed. The user can select the desired partial category from the screen displayed, and can utilize the hierarchical category not only in the final output but also during the operation. In addition, this enables an interactive text mining analysis operation requiring a category portion to be efficiently performed even when a large-scale category structure is targeted.
[0064]
A text mining method according to a twentieth aspect of the present invention is the text mining method according to the nineteenth aspect, wherein an intermediate node is added to the result of aggregation of each category hierarchized into a tree structure in the category hierarchy step. In the case where the intermediate node is used, whether the aggregation result corresponding to each leaf node concept item that is a descendant of the intermediate node is set to the aggregation result corresponding to the intermediate node, and / or the intermediate node is included in a notation dictionary used for text mining processing. An intermediate node summing step in which, when a normal form or another notation form corresponding to the above is defined, the summation result of the analysis target document including the normal form or the other notation form is set as a summation result corresponding to the intermediate node. It is characterized by including.
[0065]
According to this method, when the intermediate node is treated as a concept item for the aggregation result of each category hierarchized in a tree structure, the aggregation result corresponding to each leaf node concept item that is a descendant of the intermediate node corresponds to the intermediate node. If the normalization form or another notation form corresponding to the intermediate node is defined in the notation dictionary used for the text mining process, the normal form or another notation is used. The tabulation result of the analysis target document including the form is set as the tabulation result corresponding to the intermediate node (second counting method). By using the first counting method, it is possible to process even a concept category structure in which a regular word does not correspond to an intermediate node. Further, it is possible to design a category structure having a high degree of freedom, for example, by dividing a large-scale concept category structure into appropriate granularities. On the other hand, by using the second counting method, the number of documents can be counted with high accuracy when the concept category structure has a regular word corresponding to the intermediate node. Further, such a case is common in a concept category structure created using an existing data structure, and these can be utilized. By using these two methods properly or in combination depending on the case, the cost of creating the concept category structure can be reduced, and the use of the large-scale category concept becomes easy.
[0066]
The present invention also relates to a program, wherein the program according to claim 21 is a program for counting the frequency of appearance of each term appearing in the analysis target document, wherein the original text information of the analysis target document and the original text A list of terms included in the information and subject to tallying, and tallying key list information in which each term is associated with a term type and / or a link button to a storage address of the term. And causing the computer to execute a text mining processing method including an original sentence display control step of controlling to output to the output program.
[0067]
According to this program, the textual information of the document to be analyzed and the list of terms included in the textual information and subject to counting are stored, and for each term, the type of term and / or the storage of the term Since the aggregation program outputs the aggregation key list information associated with the link button to the destination address to the output program, by displaying the original text together with the list of words that became the aggregation key, the end user can select any one of a series of analysis operations. It becomes possible to easily grasp whether the document has been acquired by the operation. As a result, even an inexperienced user can avoid an operation that causes noise and perform highly accurate analysis work. In addition, by providing a link to an external database in the original text, the end user can know exactly what the acquired document is about. This information is used for learning operations that generate search noise, thereby improving the accuracy of analysis work.
[0068]
A program according to claim 22 is a program for counting the frequency of appearance of each term appearing in a document to be analyzed. The program retrieves notation dictionary information based on a search term entered by a user and the search term. And outputting to the output program the information on the corresponding normal form and its notation dictionary entry extracted as described above, and the corresponding category extracted by searching the category dictionary information based on the search term and the information on the category dictionary entry. And causing the computer to execute a text mining processing method including a dictionary entry search screen control step of controlling the computer.
[0069]
According to this program, a search term input by a user, information on a corresponding normal form and its notation dictionary entry extracted by searching for notation dictionary information based on the search term, and category dictionary information based on the search term Is controlled so as to output the corresponding category extracted and extracted and the information on the category dictionary entry to the output program. Therefore, by searching the applicability of the notation dictionary and the category dictionary for a specific word, It is possible to select words that are suitable for separating documents into categories. Further, by repeatedly performing the word search, it is possible to select a dictionary file in which words that are likely to be developed into a large number of category item groups that should be separated frequently appear. In addition, the accuracy of those category groups can be estimated. Further, if a known term characterizing a certain category is known, the recall of the category can be estimated by confirming the existence of a dictionary entry for the term.
[0070]
A program according to claim 23 is a program for counting the appearance frequency of each term appearing in a document to be analyzed, wherein the source text information of the document to be analyzed is included in the source text information, and Dictionary entry search for controlling to output, to an output program, a notation dictionary search result, part-of-speech information by syntactic analysis processing, and trace result information including at least one of category dictionary search results for a target term A text mining method including a screen control step is executed by a computer.
[0071]
According to this program, the search result of the notation dictionary, the part-of-speech information by syntactic analysis processing, and the search result of the category dictionary for the original text information of the document to be analyzed and the terms included in the original text information and to be counted And the trace result information including at least one of the following is output to the output program, and by displaying in which dictionary item each element of the original text is normalized and categorized, the document is described in more detail. You will be able to understand why you decided to get it. In addition, it is possible to accurately grasp application information that cannot be determined only by looking at terms, such as how a collocation is put together or a normal form given by a parsing system. Furthermore, this makes it possible to identify which term caused the acquisition of the original text which seems to be irrelevant.
[0072]
A program according to a twenty-fourth aspect is a program for counting the appearance frequency of each term appearing in a document to be analyzed, and is included in the original text information according to a result of parsing the text information of the document to be analyzed. The invention is characterized in that a computer executes a text mining processing method including a syntax structure analysis step of performing a text mining tally process on the document to be analyzed with n ordered combinations of nouns and verbs as one category.
[0073]
According to this program, in accordance with the result of the syntax analysis on the original text information of the analysis target document, the text mining of the analysis target document is aggregated with n ordered combinations of nouns and verbs included in the original text information as one category. Therefore, by using the pattern of the n-term relation as an aggregation target, it is possible to separate documents that cannot be determined only by the type of term, and to further improve the analysis accuracy.
[0074]
A program according to claim 25 is a program for counting the frequency of appearance of each term appearing in a document to be analyzed, wherein a search result from one search window for text mining is used and another search window is used. When performing a search with further narrowed search conditions, these related search windows and search result display windows are displayed in a multi-window form, and when the display contents of one of the windows are changed, the changed contents are displayed in another window. The method is characterized by causing a computer to execute a text mining processing method including a multi-windowing step of controlling so as to be automatically reflected in the data.
[0075]
According to this program, when performing a search by further narrowing down search conditions using another search window, the related search window and the search result display window are displayed in a multi-window, and any one of the windows is displayed. When the contents are changed, control is performed so that the changes are automatically reflected in other windows, so that any work state can be left as needed so that the end user can store it for analysis. The amount of information to put is reduced. Thereby, the analysis work can be made more efficient. Further, the display area of the computer terminal including a plurality of screens can be effectively used.
[0076]
A program according to claim 26 is a program for counting the frequency of appearance of each term appearing in a document to be analyzed, and corresponds to a column or a row in a 2-D map displaying a text mining result. The method is characterized by causing a computer to execute a text mining processing method including a 2-D map display screen control step of sorting or clustering each category item and outputting a 2-D map window to an output program.
[0077]
According to this program, the category items corresponding to the columns or rows are sorted and the 2-D map window is output to the output program for the 2-D map displaying the text mining result. If the items are fixed at specific positions in the original category definition order, sorting them in the original order makes it easier to find those category items. If the category items to be noted are those having a high frequency of appearance, sorting them in descending order of frequency makes it easier to find those category items. Furthermore, if the category items to be noted start with a specific name, sorting them in alphabetical order makes it easier to find those category items.
[0078]
Also, according to this program, for the 2-D map displaying the text mining result, each category item corresponding to a column or a row is clustered and a 2-D map window is output to an output program. By grouping items having common patterns into clusters, the addition of a category item search operation can be reduced, and the analysis operation can be made more efficient.
[0079]
The program according to claim 27 is a program for counting the frequency of appearance of each term appearing in the analysis target document, and includes an operation time, a user identifier, an operation name, an operation argument, and an operation time for each operation during text mining. A text mining method including an operation history collection step of collecting operation history information on at least one of a user's comment on an object, an operation result, and an operation intention is performed by a computer.
[0080]
According to this program, the operation history information relating to at least one of the operation time, the user identifier, the operation name, the operation argument, the operation target, the operation result, and the user's comment on the operation intention for each operation at the time of text mining is collected. The registered contents of the notation dictionary and the category dictionary can be checked based on the operation history. In addition, by using the template as a template when creating an instruction (batch script) for an operation automatic execution process (batch process), which will be described later, it is possible to easily perform a batch analysis process that is frequently performed. In addition, by storing the user's comment on the operation intention, even if many interactive operations are recorded in the work history, it is possible to quickly search for a place where the user's work intention and the like are to be batched as a clue, and create a batch script. Can be made more efficient. Also, by putting a comment in a place where the user wants to make a batch later, the work of examining the contents of the batch at the time of creating the batch script can be reduced, and the creation of the batch script can be made more efficient.
[0081]
The program according to claim 28 is the program according to claim 27, wherein a batch script is created based on the operation history information collected in the operation history collection step, and the batch script is executed. The method is characterized by causing a computer to execute a text mining processing method including an automatic execution step.
[0082]
According to this program, a batch script is created and the batch script is executed based on the collected operation history information, so that the analysis consisting of a series of operations is repeatedly executed by batch processing, thereby enabling the end user to use the tool. Can be shortened. In addition, it becomes possible to automatically perform the analysis process performed at regular intervals. Further, it becomes possible to execute a load of analysis processing during the off-peak period of the system.
[0083]
A program according to claim 29 is a program for counting the frequency of appearance of each term appearing in a document to be analyzed, wherein the result of counting each category registered in the category dictionary information used for the text mining process is stored in a tree. Text mining including a category stratification step of stratifying according to a structure and outputting to an output program, and a category selecting step output at the category stratification step and selecting at least a part of the categories stratified by the tree structure It is characterized by causing a computer to execute the processing method.
[0084]
According to this program, the aggregation results of each category registered in the category dictionary information used for the text mining process are hierarchized by a tree structure and output to an output program. Thus, the number of concept items displayed on the screen at one time via the user interactive interface can be suppressed. In addition, this makes it easy to search for a target concept item.
[0085]
Further, according to this program, at least a part of the output categories hierarchized by the tree structure is selected, so that when performing an interactive text mining operation, the categories are hierarchically displayed by the tree structure and displayed. The user can select the desired partial category from the screen displayed, and can utilize the hierarchical category not only in the final output but also during the operation. In addition, this enables an interactive text mining analysis operation requiring a category portion to be efficiently performed even when a large-scale category structure is targeted.
[0086]
A program according to claim 30 is the program according to claim 29, wherein the intermediate node is treated as a concept item with respect to an aggregation result of each category hierarchized in a tree structure in the category hierarchy step. The aggregation result corresponding to each leaf node concept item that is a descendant of the intermediate node is set as the aggregation result corresponding to the intermediate node, and / or the normal form corresponding to the intermediate node is included in the notation dictionary used for the text mining process. A text mining processing method including an intermediate node aggregation step, in which, when an alternative notation form is defined, an aggregation result of the analysis target document including the normal form or the alternative expression form is an aggregation result corresponding to the intermediate node. Is executed by a computer.
[0087]
According to this program, when the intermediate node is treated as a concept item for the aggregation result of each category hierarchized in a tree structure, the aggregation result corresponding to each leaf node concept item that is a descendant of the intermediate node corresponds to the intermediate node. If the normalization form or another notation form corresponding to the intermediate node is defined in the notation dictionary used for the text mining process, the normal form or another notation is used. The tabulation result of the analysis target document including the form is set as the tabulation result corresponding to the intermediate node (second counting method). By using the first counting method, it is possible to process even a concept category structure in which a regular word does not correspond to an intermediate node. Further, it is possible to design a category structure having a high degree of freedom, for example, by dividing a large-scale concept category structure into appropriate granularities. On the other hand, by using the second counting method, the number of documents can be counted with high accuracy when the concept category structure has a regular word corresponding to the intermediate node. Further, such a case is common in a concept category structure created using an existing data structure, and these can be utilized. By using these two programs as appropriate or in combination, the cost of creating a conceptual category structure can be reduced, and the use of large-scale category concepts can be facilitated.
[0088]
Further, the present invention relates to a recording medium, wherein a recording medium according to claim 31 records the program according to any one of claims 21 to 30.
[0089]
According to this recording medium, a program recorded in the recording medium is read and executed by a computer, thereby realizing the program according to any one of claims 21 to 30 using the computer. And the same effect as each of these methods can be obtained.
[0090]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of a text mining processing apparatus, a text mining processing method, a program, and a recording medium according to the present invention will be described in detail with reference to the drawings. It should be noted that the present invention is not limited by the embodiment.
In particular, in the following embodiment, an example in which the present invention is applied to a literature information database search system for documents of natural sciences such as living organisms, medicine, and chemistry will be described. However, the present invention is not limited to this case, and is applicable to all fields. The same can be applied to all systems for searching for document information.
[0091]
[Summary of the present invention]
Hereinafter, the outline of the present invention will be described, and then the configuration, processing, and the like of the present invention will be described in detail.
The present invention generally has the following basic features. That is, in the text mining process shown in FIG. 1, the present invention aims to increase the accuracy, efficiency, and automation of the analysis of the aggregation result. That is, the present invention enhances text mining analysis by providing a method for evaluating an analysis procedure (text display, dictionary entry search, trace result display, etc.) and an analysis method using a syntax structure in text mining processing. Improve accuracy. In addition, the present invention provides a method for improving the efficiency of the tabulation results (such as a method for listing analysis screens (such as multi-windowing) and a method for sorting and clustering the category items of the 2-D map shown in FIG. 2). The present invention also provides an analysis automation method (operation history collection, operation automatic execution, etc.) and a large-scale concept management method (tree structure hierarchy, intermediate node aggregation, etc.). Each of these methods will be described later.
[0092]
[System configuration]
First, the configuration of the present system will be described. FIG. 3 is a block diagram showing an example of the configuration of the present system to which the present invention is applied, and conceptually shows only those parts of the configuration related to the present invention. This system enables communication between a text mining processing apparatus 100 and an external system 200 that provides an external database for document information, sequence information, and the like, and an external program for various search processing and the like, via a network 300. Connected and configured.
[0093]
In FIG. 3, a network 300 has a function of interconnecting the text mining apparatus 100 and the external system 200, and is, for example, the Internet.
[0094]
In FIG. 3, an external system 200 is mutually connected to the text mining processing apparatus 100 via a network 300, and executes an external database for document information, sequence information, and the like, and an external program such as various search processes for a user. It has the function of providing a website to do.
[0095]
Here, the external system 200 may be configured as a WEB server, an ASP server, or the like, and its hardware configuration may be configured by an information processing device such as a generally-available workstation, a personal computer, and its accompanying devices. Good. Each function of the external system 200 is realized by a CPU, a disk device, a memory device, an input device, an output device, a communication control device, and the like in a hardware configuration of the external system 200, a program for controlling them, and the like.
[0096]
In FIG. 3, a text mining processing apparatus 100 includes a control unit 102 such as a CPU that comprehensively controls the entire text mining processing apparatus 100 and a communication apparatus such as a router connected to a communication line or the like (not shown). ), An input / output control interface unit 108 connected to the input device 112 and the output device 114, and a storage unit 106 for storing various databases and tables. These units are communicably connected via an arbitrary communication path. Further, the text mining processing apparatus 100 is communicably connected to a network 300 via a communication device such as a router and a wired or wireless communication line such as a dedicated line.
[0097]
Various databases and tables (notation dictionary information file 106a to batch script file 106f) stored in the storage unit 106 are storage means such as a fixed disk device, and various programs, tables, files, databases, and databases used for various processes. Stores web page files and the like.
[0098]
Among the constituent elements of the storage unit 106, the notation dictionary information file 106a is a notation dictionary information storage unit that stores notation dictionary information that defines the correspondence between the normal form of each term and another notation form. FIG. 17 is a diagram illustrating an example of the notation dictionary information stored in the notation dictionary information file 106a. The notation dictionary information stored in the notation dictionary information file 106a defines a correspondence between a normal form and another notation form, as shown in FIG.
[0099]
The category dictionary information file 106b is a category dictionary information storage unit that stores category dictionary information that defines the category to which the normal form belongs. FIG. 18 is a diagram illustrating an example of the category dictionary information stored in the category dictionary information file 106b. The category dictionary information stored in the category dictionary information file 106b includes, as shown in FIG. 18, the correspondence between the category and the normal form, and the category structure (FIG. 18 shows the concept of the category structure. The file defines information such as parent node information and child node information for each node (category).)
[0100]
The analysis target document file 106c is a document information storage unit that stores original text information of the document information to be analyzed and address information such as a URL of a link destination set in the original text information. Here, the address information may store hyperlink (WWW link) information or the like of the external database as long as a part of the original text can be interpreted as an identifier of the external database.
[0101]
Further, the operation history information file 106d stores operation history information on operation times, user identifiers, operation names, operation arguments, operation targets, operation results, user comments on operation intentions, and the like for each operation during text mining. It is an information storage means.
[0102]
The processing result file 106e is a processing result storage unit that stores a work file or the like of a processing result or an intermediate result of each processing by the control unit.
[0103]
The batch script file 106f is a batch script information storage unit that stores information related to the batch script and the like.
[0104]
In FIG. 3, the communication control interface unit 104 controls communication between the text mining processing device 100 and the network 300 (or a communication device such as a router). That is, the communication control interface unit 104 has a function of communicating data with another terminal via a communication line.
[0105]
3, the input / output control interface unit 108 controls the input device 112 and the output device 114. Here, as the output device 114, in addition to a monitor (including a home television), a speaker can be used (in the following, the output device 114 may be described as a monitor). As the input device 112, a keyboard, a mouse, a microphone, and the like can be used. The monitor also realizes a pointing device function in cooperation with the mouse.
[0106]
In FIG. 3, the control unit 102 has a control program such as an OS (Operating System), a program defining various processing procedures and the like, and an internal memory for storing required data. And information processing for executing various processes. The control unit 102 functionally conceptually includes an analysis procedure evaluation unit 102a, a syntax structure analysis unit 102b, a multi-window conversion unit 102c, a 2-D map display screen control unit 102d, an operation history collection unit 102e, an operation automatic execution unit 102f, It comprises a category hierarchy unit 102g, an intermediate node tally unit 102h, and a text mining unit 102p.
[0107]
Among these, the analysis procedure evaluation unit 102a is an analysis procedure evaluation unit that evaluates the analysis procedure of the text mining process by the text mining unit 102p. As shown in FIG. 4, the analysis procedure evaluation unit 102a includes an original text display screen control unit 102i, a dictionary entry search screen control unit 102j, and a trace result display screen control unit 102k. Here, the original text display screen control unit 102i is a list of the original text information of the document to be analyzed and the terms included in the original text information and to be counted. For each of the terms, the type of the term, And / or an original text display control unit that controls to output, to an output device, aggregation key list information in which a link button to a storage address of a term is associated. The dictionary entry search screen control unit 102j also searches for a search term input by the user, information on a corresponding normal form extracted and extracted from the notation dictionary information based on the search term, and information on the notation dictionary entry. The dictionary entry search screen control means controls to output corresponding categories extracted by searching category dictionary information based on words and information on the category dictionary entries to an output device. In addition, the trace result display screen control unit 102k performs a search result of the notation dictionary for the original text information of the document to be analyzed and the term included in the original text information and which is to be counted, A dictionary entry search screen control means for controlling to output to the output device trace information including at least one of the category dictionary search results.
[0108]
Further, the syntactic structure analysis unit 102b sets the n ordered combinations of nouns and verbs included in the original text information as one category according to the result of the syntax analysis of the text information of the analysis target document, and performs text mining on the analysis target document. This is a syntax structure analysis unit that performs a totaling process.
[0109]
In addition, when performing a search by narrowing down search conditions using a different search window from a search result obtained by using one search window for text mining, the multi-windowing unit 102c may use these related search windows and search results. This is a multi-windowing means for displaying a display window in a multi-window manner, and controlling the display contents of any one of the windows to be automatically reflected in other windows when the display contents are changed.
[0110]
Further, the 2-D map display screen control unit 102d sorts or clusters each category item corresponding to a column or a row with respect to a 2-D map displaying a text mining result, and outputs a 2-D map window to an output device. Is a 2-D map display screen control means for outputting to a. Here, the 2-D map display screen control unit 102d includes an item sorting unit 102m and an item clustering unit 102n, as shown in FIG. The item sorting unit 102m is an item sorting unit that sorts each category item corresponding to a column or a row in a 2-D map that displays a text mining result, and outputs a 2-D map window to an output device. The item clustering unit 102n is an item clustering unit that clusters each category item corresponding to a column or a row with respect to a 2-D map displaying a text mining result and outputs a 2-D map window to an output device. is there.
[0111]
Further, the operation history collection unit 102e stores operation history information relating to at least one of an operation time, a user identifier, an operation name, an operation argument, an operation target, an operation result, and a user comment regarding an operation intention for each operation during text mining. This is an operation history collection unit to be collected.
[0112]
The operation automatic execution unit 102f is an operation automatic execution unit that creates a batch script based on the operation history information collected by the operation history collection unit and executes the batch script.
[0113]
The category hierarchical unit 102g is a category hierarchical unit that hierarchizes the aggregation results of each category registered in the category dictionary information used for the text mining process by using a tree structure and outputs the result to an output device.
[0114]
Further, when the intermediate node is treated as a concept item with respect to the aggregation result of each category hierarchized into a tree structure by the category hierarchy unit, the intermediate node tallying unit 102h assigns each leaf node concept item that is a descendant of the intermediate node to the intermediate node. If the corresponding aggregation result is the aggregation result corresponding to the intermediate node, and / or the normal form or another notation form corresponding to the intermediate node is defined in the notation dictionary used for the text mining process, An intermediate node tallying unit that sets a tally result of the analysis target document including the normal form or the different notation form as a tally result corresponding to the intermediate node.
[0115]
The text mining unit 102p is a text mining unit that executes statistical / analysis processing on the information extraction result by the text mining processing illustrated in FIG. 1 described above, for example.
The details of the processing performed by these units will be described later.
[0116]
[System processing]
Next, an example of the processing of the present system configured as described above according to the present embodiment will be described in detail below with reference to FIGS.
[0117]
[Original display screen control processing]
Next, details of the original text display screen control processing will be described with reference to FIG.
The text mining processing device 100 displays, on the output device 114, the original text information stored in the analysis target document file 106c together with a list of terms (keys) to be processed by the processing of the original text display screen control unit 102i. For example, in a case where the frequency of appearance of a normal form corresponding to a category registered in the category dictionary information file 106b or another notation form corresponding to the normal form registered in the notation dictionary information file 106a is calculated, The normal form and the different notation form are terms (keys) to be processed. Further, if there is a portion where the key in the original text can be interpreted as an identifier of the external database, the original text display screen control unit 102i sets a hyperlink (WWW link) there and displays it together.
[0118]
FIG. 6 is a diagram illustrating an example of the original text display screen displayed on the output device 114. As shown in FIG. 6, one window for an original text display screen is prepared for each document. Each window includes an original text information display area MA-1 and a tallying key list information display area MA-2. Here, the aggregation key list information display area MA-2 includes a display area MA-3 for displaying the type of the term (part of speech and the technical field to which the term belongs), a display area MA-4 for the term appearing in the original text, and Hyperlink button MA-5 to an external database. The items of the aggregation key list may be obtained from an intermediate product when the text mining system performs text mining processing (pre-processing) on the target text in advance.
Thus, the original text display screen control processing ends.
[0119]
[Dictionary entry search screen control processing]
Next, details of the dictionary entry search screen control processing will be described with reference to FIG. The text mining processing apparatus 100 receives an arbitrary word or collocation specified by the user through the processing of the dictionary entry search screen control unit 102j, and stores the notation dictionary or category stored in the notation dictionary information file 106a by the following method. The category dictionary stored in the dictionary information file 106b is searched, and a dictionary entry that hits is extracted and output to the output device 114.
[0120]
First, the dictionary entry search screen control unit 102j searches the notation dictionary using the input search word, and obtains a set of hit normal forms. Next, a category dictionary is searched using the input search word and each element of the normal form set, and a set of hit categories is obtained.
[0121]
Then, as a search result, the input word, its normal form, the category to which it belongs, the name of the file / database containing the dictionary entry used for conversion, and the identifier / position in the file / database of the dictionary entry are output to the output device 114. indicate.
[0122]
FIG. 7 is a diagram showing an example of a dictionary entry search screen displayed on the output device 114. As shown in FIG. 7, the dictionary entry search screen includes a search word input field MB-1, a search instruction button MB-2, and a result display area MB-3.
[0123]
After the user inputs a desired word or collocation in the search word input field MB-1 and selects the search instruction button MB-2 with the input device 112 such as a mouse, the text mining processing device 100 displays the dictionary entry search screen. By the processing of the control unit 102j, the search processing result is displayed in the result display area MB-3. In this example, as a result of retrieving input words with a notation dictionary, normal forms t1, t2, t3,. . . (After t2, it is omitted in FIG. 7). Also, as a result of searching the input words by the category dictionary, the categories c1, c2, c3,. . . Indicates that a hit has occurred. The category c1 indicates that the input word was hit by the dictionary item with the identifier e2 of the category dictionary with the file name D2. Here, the same applies to the categories c2 and c3. Categories c1 and c3 are dictionary items belonging to the same dictionary (D2). The normal form t1 indicates that the input word is hit by the dictionary item with the identifier e1 of the notation dictionary named D1. The normal form t1 indicates that the category c1, c4, c5 has been hit, respectively. As a result, it is understood that the document including the input word belongs to at least the categories c1, c2, c3, c4, and c5.
Thus, the dictionary entry search screen control processing ends.
[0124]
[Trace result display screen control processing]
Next, details of the trace result display screen control processing will be described with reference to FIG. The text mining processing apparatus 100 receives the original text information stored in the notation dictionary information file 106a and the original text information such as a natural language English sentence arbitrarily specified by the user through the processing of the trace result display screen control unit 102k. Trace application of a series of text mining preprocessing to the original text information, and display trace information that clarifies how each element in the input text information is recognized by the text mining system .
[0125]
First, the trace result display screen control unit 102k assigns a notation dictionary stored in the notation dictionary information file 106a to the input original text information, and combines collocations into an element structure. Then, the trace result display screen control unit 102k applies a technical term (technical term) discrimination rule to the result, and puts the collocation into an element structure. Then, the trace result display screen control unit 102k applies a known syntax analysis processing system to the result, and adds the part of speech information to the element structure. Then, the trace result display screen control unit 102k assigns category information to the element structure by applying a category dictionary to the result.
[0126]
Then, the trace result display screen control unit 102k displays input / output items of each process as trace result information. Here, the trace result display screen control unit 102k displays trace information such as a file / database name including a dictionary entry used for the notation dictionary and the category dictionary, and an identifier / position in the file / database of the dictionary entry. Is also good.
[0127]
FIG. 8 is a diagram illustrating an example of a trace result display screen displayed on the output device 114.
The trace result display screen includes an original text entry field MC-1 and a trace result display area MC-2 as shown in FIG. In the original text input field MC-1, the original text to be processed may be directly typed, or a text identifier may be input in the text identifier input field MC-3, and the original text acquisition button MC-4 may be pressed. Information may be obtained. When the user selects the trace button MC-5, trace result information is displayed in the trace result display area MC-2.
[0128]
In the trace result display area, the following information is repeatedly displayed for each element structure (word) of the original sentence. In the example of FIG. 8, the notation of the word 1 is a normal form t1, a normal form t2,. . . Means normalized to This indicates that the item e1 of the notation dictionary D1 has been applied to normalization to the normal form t1 and addition of the part of speech N. This indicates that the technical term discrimination rule F has been applied to the normal form t2. This indicates that part-of-speech was not assigned to word 1 in the syntax analysis. The normal form t1 has categories c1, c4,. . . It belongs to. This indicates that the item e5 of the category dictionary D2 has been applied to the category c1. This indicates that the item e6 of the category dictionary D4 has been applied to belonging to the category c4.
This completes the trace result display screen control processing.
[0129]
[Syntax structure analysis processing]
Next, details of the syntax structure analysis processing will be described with reference to FIG. The text mining processing device 100 performs the processing of the syntactic structure analysis unit 102b, and according to the result of the syntax analysis on the original text information of the analysis target document stored in the analysis target document file 106c, the n nouns and the verbs included in the original text information Of the text mining for the document to be analyzed as one category. That is, as a result of the syntax analysis performed by the text mining unit 102p, the syntactic structure analysis unit 102b performs, for each of the n ordered combinations of nouns and verbs appearing in one sentence, a text mining process for the analysis target document as a category. Aggregation processing is performed and used for analysis of a 2-D map or the like.
[0130]
Arbitrary plural patterns among the above ordered combination patterns are regarded as belonging to the same category, and aggregation and analysis are performed. The same category is determined by one of the following two methods or a combination thereof. The first is a method of regarding a combination pattern that does not partially consider the order as the same category. The second is a method in which, when the difference between arbitrary combination patterns is only the difference between words belonging to the same category, those patterns are regarded as the same category.
[0131]
FIG. 9 is a conceptual diagram showing an example of the syntax structure analysis processing of the present invention. As shown in FIG. 9, text mining analysis is performed on documents including sentences in which specific words appear in a specific order as belonging to the same category. In the case of the example of FIG. 9, documents having a sentence of a pattern in which the noun n1 is at the head and the verb v1 and the noun belonging to the category c1 appear in an arbitrary order are totaled in the same category. In the pattern notation illustrated in FIG. 9, “*” indicates that any word element may be included, “(notation 1 | notation 2)” indicates that either notation 1 or notation 2 may be used, The order of the notation indicates the order.
Thus, the syntax structure analysis processing ends.
[0132]
[Multi-window processing]
Next, the details of the multi-window processing will be described with reference to FIG. When the text mining processing apparatus 100 performs the search by narrowing the search condition using the other search window from the search result by one search window for text mining by the process of the multi-window conversion unit 102c, A related search window and a search result display window are displayed in a multi-window form, and when the display content of any one of the windows is changed, the change is automatically reflected in other windows. That is, the multi-windowing unit 102c sets the search window, the frequency graph window, the 2-D map window, the time-series graph window, and the like, which are output to the output device 114 by the text mining unit 102p and the like, as independent windows. Be able to cooperate with such information while having the entity.
[0133]
FIG. 10 is a diagram illustrating an example of a multi-window display screen displayed on the output device 114. FIG. 10 shows an example in which all three search windows (w1, w2 and w4) and two 2-D map windows (w3 and w5) are displayed simultaneously. The search window (w1) holds a document set as a population. The search window (w2) holds a document set in which the set of the search windows (w1) is further narrowed down by the keyword kw1. The 2-D map window (w3) displays a 2-D map analysis result for the document set in the search window (w2). The search window (w4) holds a document set obtained by further narrowing the set of the search windows (w1) by the keyword kw2. The 2-D map window (w5) displays a 2-D map analysis result for the document set in the search window (w4).
Thus, the multi-windowing process ends.
[0134]
[2-D map display screen control processing]
Next, details of the 2-D map display screen control processing will be described with reference to FIGS.
[0135]
The text mining processing device 100 sorts each category item corresponding to a column or a row in the 2-D map displaying the text mining result by the text mining unit 102p by the processing of the 2-D map display screen control unit 102d. Alternatively, clustering is performed and the 2-D map window is output to the output device 114.
[0136]
For example, the 2-D map display screen control unit 102d, by the processing of the item sorting unit 102m, sorts each category item corresponding to a column or a row of the 2-D map in the original mode, the frequency order mode, the alphabetical order mode, or the like. Sort and display. Here, in the case of the original mode, the category items are rearranged in the order defined (stored) in the category dictionary or the like stored in the category dictionary information file 106b. In the case of the frequency order mode, the sum of the frequency values of the columns or rows belonging to the category item is used as the frequency value of the category item, and the category items are rearranged in descending order of the frequency value. In the case of the alphabetical mode, the category items are rearranged so that the name character strings are arranged in dictionary order.
[0137]
Here, FIG. 11 is a diagram illustrating an example of control (sort processing) of the 2-D map display screen displayed on the output device 114. As shown in FIG. 11, the 2-D map window (w1) represents a state where the item names are sorted in alphabetical order of the item names both vertically and horizontally. The 2-D map window (w2) shows a state in which the vertical direction is sorted in alphabetical order of item names, and the horizontal direction is sorted in frequency order. In the 2-D map window (w2), the total value of the frequencies of the columns of the horizontal 2-D map items a, b, c, and f is 14, 18, 8, and 15, respectively, so the item having the largest value is displayed. Are sorted such that b becomes the leftmost and c the smallest becomes the rightmost. In the 2-D map window (w3), the vertical direction shows the state sorted in order of frequency, and the horizontal direction shows the state sorted in alphabetical order of item names. Here, in the 2-D map window (w3), the sum of the frequencies of the rows of the 2-D map items j, k, and p in the vertical direction is 20, 19, and 16 respectively. Not awake. The 2-D map window (w4) shows a state in which both the vertical and horizontal directions are sorted in order of frequency.
[0138]
Further, the 2-D map display screen control unit 102d, under the control of the item clustering unit 102n, characterizes the column item or the column item of the 2-D map by a vector having the item of the partner axis as an element. I do. Here, the item clustering unit 102n may define the similarity between the category items by a dot product of vectors or the like. In addition, the item clustering unit 102n may use the existing method for the clustering algorithm and display the category items of the row and the column in a hierarchical manner.
[0139]
Further, the item clustering unit 102n rearranges the category items so as to be included in the hierarchy. Here, the item clustering unit 102n may perform the rearrangement by any of the following methods or a combination thereof. In the first method, a plurality of category items of interest are specified, and clusters containing a large number of specified category items are gathered at the origin (upper left), and the clusters and the category elements are arranged so as to be as close to the origin as possible within the specified category. It is a method of rearranging. The second method is a method of rearranging clusters such that clusters containing many category elements are closer to the origin (upper left).
[0140]
FIG. 12 is a diagram illustrating an example of control (clustering) of the 2-D map display screen displayed on the output device 114. The example of FIG. 12 shows a 2-D map obtained by performing clustering on each of the column and the row. In this figure, regarding the rows, category items aa, ab, and ac are in cluster c1, ad, ae, af, and ag are in c2, ah, ai are in c3, aj is in c4, and ak, al, am are c5. Shows that as is put together in c7. Further, in this figure, clusters c1 and c2 are combined into cluster c8, c3 and c4 are combined into c9, c5, c6 and c7 are combined into c10, c8 and c9 are combined into c11, and c10 is combined into c12. . On the other hand, in the column direction, the category items ba and bb are in the cluster c20, bc, bd and be are in c21, bf and bg are in c22, bh and bi are in c23, bj is in c24, and bk and bl are c25. , Bm is collected into c26, and bz is collected into c28. Further, in this figure, clusters c20 and c21 are cluster c29, c22 and c23 are c30, c24, c25 and c26 are c31, c27 and c28 are c32, c29 and c30 are c33, and c31 and c32 are c34. Respectively. The category items are rearranged so that the tree structure of the cluster can be represented by a plane as shown in FIG.
[0141]
The item clustering unit 102n may perform clustering of each item in the following procedure.
(1) First, the item clustering unit 102n clusters the category items (aa to as) in the row direction by the following method.
[0142]
(1-1) Define feature vectors for each category item
The item clustering unit 102n sets a vector having a frequency of co-occurrence with a category item in the column direction as a feature vector of a row category item. For example, the feature vector of the line category item aa is defined as ((aa, ba), (aa, bb), (aa, bc), ..., (aa, bz)). Here, (aa, ba) represents the co-occurrence frequency of the row category item aa and the column category item ba (the appearance frequency of a document including both category items).
[0143]
(1-2) The display item clustering unit 102n performs clustering based on the similarity between the respective category items and rearranges them, and determines the similarity between any two row category items of the feature vector defined as described above. Define as inner product and calculate. A general clustering algorithm is applied so that line category items having high similarity are gathered.
[0144]
(2) Next, the item clustering unit 102n clusters the category items (ba to bz) in the column direction by replacing the rows and columns with the method (1).
Thus, the 2-D map display screen control processing ends.
[0145]
[Operation history collection processing]
Next, details of the operation history collection processing will be described with reference to FIG.
The text mining processing apparatus 100 performs operation history information including items such as an operation time, a user identifier, an operation name, an operation argument, an operation target, and an operation result regarding a text mining operation performed interactively by the processing of the operation history collection unit 102e. Is automatically recorded in the operation history information file 106d. In addition, the operation history collection unit 102e may record the user's comment on the operation intention in addition to the record items. Here, regarding the user's comment regarding the operation intention, the user may instruct the analysis tool to perform a comment input operation and input the comment. The comment may be in any form of text data, audio data, still image data, moving image data, or the like, or a combination thereof. Then, the operation history collection unit 102e can create an operation history collection screen by appropriately referring to the operation history information operation collected in the operation history information file 106d, and display the screen on the output device 114.
[0146]
FIG. 13 is a diagram illustrating an example of the operation history collection screen displayed on the output device 114. As shown in FIG. 13, the operation history information automatically backed up is output on the operation history collection screen. In the example of this figure, one row represents one history item, and each row is composed of seven columns, and a display area MD-1 of a reference number (history item number) of the history item, and a display area MD-1 Display area MD-2 of the time of operation, display area MD-3 of the identifier of the user who performed the operation, display area MD-4 of the name and type of the operation, display area MD-5 of the parameters and arguments of the operation, the operation A display area MD-6 of a target data file (identifier or the like) and a display area MD-7 of operation result data (identifier or the like) are shown. The reference identification number (history item number) of the item of the operation history information is used by the present system to manage the history item.
[0147]
In the example of this figure, the operation history information is displayed from top to bottom in chronological order. Hereinafter, the meanings of the history items will be described in order.
[0148]
First, at time 16:44 (history item number 370), the user KN performs an “Open db” operation with all (all data handled by the text mining system) as an argument, and performs “Article set all” in the analysis operation. That can be used as.
[0149]
Then, at time 16:45 (history item number 371), the history of the search operation of targeting data from 1990 to 2002 is loaded, and as a result, “Article set 128 (where 128 is a text mining system) Identification number for managing a document set.)).
[0150]
At 16:46 (history item number 372), the document set 128 was searched with “Protein A”, and as a result, “Article set 129” was generated.
[0151]
Then, at 16:47 (history item number 373), “Category M” immediately below the root of the category tree was selected in the frequency graph window, and the cursor was moved to M.
[0152]
Then, at 16:51 (history item number 374), an expansion operation was performed on Category M in the frequency graph window to display the category items of the children immediately below M on the tree structure.
[0153]
At 16:51 (history item number 375), Category M / D, one of the children of M, was selected in the frequency graph window, and the cursor was moved to D.
[0154]
Then, at 16:52 (history item number 376), a frequency graph (Frequency graph 37) for category D and its children was generated and displayed for the document set 129 in the frequency graph window.
[0155]
At time 16:53 (history item number 377), in the 2-D map window, the child of category D and the category "P / D / A (where category A is a child of D, and D is a child of P P is different from M / D, P is a child of the root.)), And a 2-D map is generated using the document set 129 as an argument (2-Dmap 51). Here, 51 indicates an identification number when the text mining system manages the 2-D map.
[0156]
Then, from time 17:15 (history item number 378) to time 17:36 (history item number 383), work similar to the above-described history item numbers 372 to 377 was performed. However, in the search operation of the history item number 378, “Protein B (not A)” was used as a search key.
[0157]
Then, at 18:05 (history item number 384), the user KN selected a comment input operation using text data in the 2-D map window displaying “2-D map 52”. As a result, the user's analysis intention, conclusion, and the like regarding “2-D map 52” are recorded as operation arguments of the history item.
[0158]
Then, in the 2-D map window displaying “2-D map 52” at 18:06 (history item number 385), the intersection of the 22nd category item of category D and the 3rd category item of category A is displayed. Is selected, and a set of documents co-occurring with those category items in the document set 130 is generated, and is set as “Article set 131”.
Thus, the operation history collection processing ends.
[0159]
[Operation automatic execution process]
Next, details of the operation automatic execution process will be described with reference to FIG. FIG. 14 is a conceptual diagram illustrating an example of the operation automatic execution process.
[0160]
The text mining processing apparatus 100 creates a batch script based on the operation history information collected in the operation history information file 106d by the processing of the operation automatic execution unit 102f, and executes the batch script. In other words, the text mining processing apparatus 100 enables the continuous execution of any interactive operation of the text mining tool to be batch-executed by any one of the following three methods or a combination thereof.
[0161]
A first method allows each function of the text mining system to be called as a library of an existing programming language, and describes and executes a batch process using the programming language (the execution may be a stored procedure such as JAVA). Is the way.
[0162]
The second method is a method in which the text mining system is designed so as to be separated into an aggregation processing server and an interactive operation client, and a batch processing is described and executed by a module that executes a predetermined communication protocol on behalf of the client.
[0163]
A third method is to incorporate a dedicated batch script language interpreting system into a text mining system and describe and execute batch processing in the script language.
[0164]
FIG. 14 shows an example of the embodiment. In the example of this figure, in a text mining system, in addition to an operation automatic execution unit 102f including a user interaction interface and a batch processing system, an operation history collection unit 102e, a text mining unit 102p, an analysis target document file 106c, an operation history information file 106d and a batch script file 106f. The operation history collection unit 102e automatically accumulates the history of the operations performed by the text mining unit 102p in the operation history information file 106d by the above-described “automatic backup of operation history with comment” or the like. The user interaction interface of the operation automatic execution unit 102f provides a function of searching for a target partial history from the operation history in the operation history information file 106d in cooperation with the operation history collection unit 102e as necessary.
[0165]
Further, the user interaction interface of the operation automatic execution unit 102f provides a function of creating a batch script with reference to a new or partial history and registering it in the batch script file 106f in cooperation with the operation history collection unit 102e as necessary. . The batch processing system of the operation automatic execution unit 102f provides a function of receiving the batch script identifier and the movable range of the parameter from the user interface, and acquiring and executing the corresponding batch script from the batch script file 106f.
[0166]
An example of a batch script is shown in the lower part of FIG. In this figure, the batch script A is created with reference to the history item numbers 372 to 377 in the history example shown in FIG. First, history item numbers 372 to 377 are cut out from the operation history information accumulated in the operation history collection unit 102e via the user interactive interface. The argument (search keyword) and target (search target document set) of the operation "Search" were changed to script parameters "PARAMETER 1" and "PARAMETER 2", respectively. The result of the operation “Search” was changed to a variable “Article set a” of the script. Accordingly, the operation target of the operations “Show” and “2-D map” is changed to the variable “Article set a”. The result of the operation “Show” was changed to a variable “Frequency graph b”, and the result of the operation “2-D map” was changed to a variable “2-D map c”.
[0167]
The batch script A is executed as follows, for example. The user designates the movable range of “PARAMETER1” as (kw1, kw2,..., Kwn) and the movable range of “PARAMETER2” as (Article set 100, Article set 101,..., Article set 199), It is assumed that execution of the batch script A is instructed. The batch processing system executes the batch script A for all combinations (100 × n) of the two parameters. At the time of execution, the parameters of the script are replaced with actual data and executed in the order of the script. Script variables generate new data of an appropriate type when they first appear, and replace the parts referenced in the script with the data. For example, if the “Article set” is created up to 172 when the operation “Search” is executed, an “Article set 173” (= a) is newly created to accumulate the results, and the operations “Show” and “2” are performed. “Article set a” of the object of “−D map” is replaced with 173.
Thus, the operation automatic execution process ends.
[0168]
[Category hierarchical processing]
Next, the details of the category hierarchy processing will be described with reference to FIG.
The text mining processing apparatus 100 hierarchizes the aggregation results of the respective categories registered in the category dictionary information used for the text mining process, stored in the category dictionary information file 106b, into a tree structure by the processing of the category hierarchy unit 102g. And outputs it to the output device 114. That is, the text mining processing apparatus 100 manages a large-scale (thousands to tens of thousands) concept set by giving a tree structure. The tree structure may be generated from an existing data structure or may be newly provided. The tree structure may be given by a known method. In addition, the category layering unit 102g may have an arbitrary interactive interface function that handles a concept, and may implement operations such as tree node selection, folding, and expansion using the interactive interface function. In addition, the analysis operation is performed on a concept item or node that is a child immediately below the selected concept node.
[0169]
FIG. 15 is a diagram illustrating an example of a category display screen in which the categories are displayed in a hierarchy on the output device 114. FIG. 15 shows an embodiment in which categories are hierarchically displayed in a tree structure and displayed. The window (w1) on the left side of FIG. 15 shows an example in which concept category items are displayed as a one-dimensional list without being hierarchized. All the concept items to be handled are listed vertically. Use the scroll bar on the right to search for items on the window (w1). The window (w2) on the right side of FIG. 15 shows an example in which concept category items are hierarchized in a tree structure and displayed like an outline processor. On the left side of each item, a button with a "+" or "-" mark is attached. A normal item or an expanded node has a button with a "-" mark. The folded node has a button marked with a “+”. When the "-" mark button of the expanded node (for example, "Category p3") is pressed, the nodes (m1 and m2) below it are folded, and the buttons are changed to "+" marks. Conversely, pressing the “+” sign button of the folded node expands and displays the immediate child and changes the button to a “-” sign. If the items to be expanded and displayed cannot be displayed in the window, operate the scroll bar to adjust the display area.
Thus, the category hierarchy processing is completed.
[0170]
[Intermediate node aggregation process]
Next, details of the intermediate node tallying process will be described with reference to FIG. FIG. 16 is a conceptual diagram illustrating an example of the intermediate node tallying process.
The text mining processing apparatus 100 performs the processing of the intermediate node tallying unit 102h and treats the intermediate node as a concept item for the aggregation result of each category hierarchized into a tree structure by the category hierarchy unit 102g. The aggregation result corresponding to each leaf node concept item to be used as the aggregation result corresponding to the intermediate node, and / or the normal form corresponding to the intermediate node or another expression corresponding to the intermediate node is stored in the notation dictionary information file 106a used for the text mining process. When the notation form is defined, the totalization result of the analysis target document including the normal form or the alternative notation form is set as the totalization result corresponding to the intermediate node. That is, the intermediate node counting unit 102h uses the following two methods when treating the intermediate node as a concept item (for example, a category specified by the user as a target of counting) in the above-described tree-structured hierarchical conceptual structure. Correspond to a document by one or a combination of these methods.
[0171]
The first method is a method in which the aggregation result corresponding to each leaf node concept item that is a descendant of the intermediate node is set as the aggregation result corresponding to the intermediate node. Here, when totaling the number of documents and the like, there are a method of totaling by the total number, a method of removing duplicates of documents and the like, and a method of totaling.
[0172]
The second method is a method in which, when a normal form or another notation form is defined in the intermediate node itself, a total result of a document or the like including those words is set as a total result corresponding to the intermediate node.
[0173]
In the example shown in FIG. 16, the normal forms kw1 and kw2 are set in the intermediate node concept item p3, the normal form kw3 is set in the leaf node concept item m1 of the child of p3, and the normal form kw3 is set in the leaf node concept item m2 of the child of p3. kw4, kw5, and kw6 correspond to each other. In the document set to be operated, the normal forms kw1, kw2,. . . , Kw6 are n1, n2,. . . , N6. If the policy is to count the number of hit documents based on the total number, the number of hit documents of the concept item m1 is n3, and the number of hit documents of m2 is n4 + n5 + n6. Here, the number of hit documents of the intermediate node concept item p3 is as follows. When the totaling method is the first method, the total number of child hit documents is n3 + n4 + n5 + n6. When the totaling method is the second method, the total number of hit documents in the corresponding normal form is n1 + n2.
With this, the intermediate node tallying process ends.
[0174]
[Other embodiments]
Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, but may be applied to various different embodiments within the scope of the technical idea described in the claims. It may be implemented.
[0175]
For example, the case where the text mining processing apparatus 100 performs the processing in a stand-alone form has been described as an example, but the processing is performed in response to a request from a client terminal configured in a separate housing from the text mining processing apparatus 100, The processing result may be returned to the client terminal.
[0176]
Further, among the processes described in the embodiment, all or a part of the processes described as being performed automatically may be manually performed, or all of the processes described as being performed manually may be performed. Alternatively, it can be performed partly automatically by a known method.
In addition, the processing procedures, control procedures, specific names, information including parameters such as various registration data and search conditions, screen examples, and database configurations shown in the above-described documents and drawings, except where otherwise noted, It can be changed arbitrarily.
[0177]
Also, regarding the text mining processing apparatus 100, the components illustrated in the drawings are functionally conceptual, and need not necessarily be physically configured as illustrated.
For example, with respect to the processing functions provided in each unit or each device of the text mining processing device 100, in particular, each processing function performed by the control unit 102, all or any part thereof is transferred to a CPU (Central Processing Unit) and the CPU. It can be realized by a program that is interpreted and executed, or can be realized as hardware by wired logic. The program is recorded on a recording medium described later, and is mechanically read by the text mining processing device 100 as necessary.
[0178]
That is, a computer program for giving instructions to the CPU in cooperation with an OS (Operating System) and performing various processes is recorded in the storage unit 106 such as a ROM or an HD. This computer program is executed by being loaded into a RAM or the like, and configures the control unit 102 in cooperation with the CPU. Further, this computer program may be recorded in an application program server connected to the text mining processing apparatus 100 via an arbitrary network 300, and all or part of the computer program may be downloaded as needed. It is.
[0179]
Further, the program according to the present invention can be stored in a computer-readable recording medium. Here, the “recording medium” refers to an arbitrary “portable physical medium” such as a flexible disk, a magneto-optical disk, a ROM, an EPROM, an EEPROM, a CD-ROM, an MO, a DVD, and the like, and a built-in various computer systems. A short-term program such as a communication line or a carrier wave when transmitting the program via an arbitrary "fixed physical medium" such as ROM, RAM, HD, or a network represented by LAN, WAN, or the Internet. "Communications medium" that holds.
[0180]
The “program” is a data processing method described in an arbitrary language or description method, and may be in any format such as a source code or a binary code. The “program” is not necessarily limited to a single program, but may be distributed in the form of a plurality of modules or libraries, or may operate in cooperation with a separate program represented by an OS (Operating System). Includes those that achieve functions. Note that a known configuration and procedure can be used for a specific configuration, a reading procedure, an installation procedure after reading, and the like in each apparatus described in the embodiments.
[0181]
Various databases and the like (notation dictionary information file 106a to batch script file 106f) stored in the storage unit 106 are storage devices such as a memory device such as a RAM and a ROM, a fixed disk device such as a hard disk, a flexible disk, and an optical disk. In addition, various programs, tables, files, databases, web page files, and the like used for various processes and website provision are stored.
[0182]
Further, the text mining processing apparatus 100 connects a peripheral device such as a printer, a monitor, and an image scanner to an information processing apparatus such as an information processing terminal such as a known personal computer or a workstation, and connects the information processing apparatus of the present invention to the information processing apparatus. May be implemented by implementing software (including programs, data, and the like) for implementing the above.
[0183]
Further, the specific form of the distribution / integration of the text mining processing apparatus 100 is not limited to the illustrated one, and all or a part of the distribution / integration may be functionally or physically distributed / arranged in an arbitrary unit corresponding to various loads or the like. Can be integrated and configured. For example, each database may be independently configured as an independent database device, or a part of the processing may be realized using a CGI (Common Gateway Interface).
[0184]
Further, the network 300 has a function of interconnecting the text mining processing apparatus 100 and the external system 200, and includes, for example, the Internet, an intranet, a LAN (including both wired / wireless), a VAN, and a personal computer. A communication network, a public telephone network (including both analog and digital), a private line network (including both analog and digital), a CATV network, an IMT2000 system, a GSM system, a PDC / PDC-P system, and the like. It may include any of a cellular line switching network / portable packet switching network, a radio paging network, a local radio network such as Bluetooth, a PHS network, and a satellite communication network such as CS, BS or ISDB. That is, the present system can transmit and receive various data via any network regardless of wired or wireless.
[0185]
【The invention's effect】
As described above in detail, according to the present invention, the original text information of the document to be analyzed and the list of terms included in the original text information and to be tabulated, and for each term, the type of term And / or aggregation key list information associated with a link button to the storage address of the term, is output to the output device, so that the end user can display the original text together with a list of words that became the aggregation key. Can easily understand which operation of the series of analysis operations has resulted in the acquisition of the document. As a result, even an inexperienced user can avoid an operation that causes noise and perform highly accurate analysis work. In addition, by providing a link to an external database in the original text, the end user can know exactly what the acquired document is about. This information is used for learning operations that generate search noise, thereby improving the accuracy of analysis work.
[0186]
Also, according to the present invention, a search term input by a user, information on a corresponding normal form and its notation dictionary entry searched and extracted from the notation dictionary information based on the search term, and a category based on the search term Since the control is performed so that the corresponding category extracted by searching the dictionary information and the information on the category dictionary entry are output to the output device, by searching the applicability of the notation dictionary and the category dictionary for a specific word, Thus, it is possible to select words suitable for separating documents into target categories. Further, by repeatedly performing the word search, it is possible to select a dictionary file in which words that are likely to be developed into a large number of category item groups that should be separated frequently appear. In addition, the accuracy of those category groups can be estimated. Further, if a known term characterizing a certain category is known, the recall of the category can be estimated by confirming the existence of a dictionary entry for the term.
[0187]
Further, according to the present invention, the search result of the notation dictionary, the part-of-speech information by syntactic analysis processing, and the Since the control is performed so that the trace result information including at least one of the search results is output to the output device, by displaying in which dictionary item each element of the original text is normalized and categorized, more detailed information is displayed. It will be possible to understand the reason for obtaining the document. In addition, it is possible to accurately grasp application information that cannot be determined only by looking at terms, such as how a collocation is put together or a normal form given by a parsing system. Furthermore, this makes it possible to identify which term caused the acquisition of the original text which seems to be irrelevant.
[0188]
Further, according to the present invention, according to the result of the syntax analysis of the original text information of the analysis target document, the text mining tally processing for the analysis target document is performed by using the n ordered combinations of nouns and verbs included in the original text information as one category. Is performed, by using the pattern of the n-term relation as an aggregation target, documents that could not be determined only by the type of term can be separated, and the analysis accuracy can be further improved.
[0189]
According to the present invention, when a search is performed by further narrowing down search conditions using another search window, the related search window and search result display window are displayed in a multi-window form, and any one of the windows is displayed. When the display contents are changed, the changes are automatically reflected in other windows, so that any work status can be left as needed so that the end user can store it for analysis. The amount of information to keep is reduced. Thereby, the analysis work can be made more efficient. Further, the display area of the computer terminal having a plurality of screens can be used effectively.
[0190]
Also, according to the present invention, the 2-D map displaying the text mining result is sorted, and the category items corresponding to the columns or rows are sorted and the 2-D map window is output to the output device. If the category items to be fixed are fixed at specific positions in the original category definition order, sorting them in the original order makes it easier to find those category items. If the category items to be noted are those having a high frequency of appearance, sorting them in descending order of frequency makes it easier to find those category items. Furthermore, if the category items to be noted start with a specific name, sorting them in alphabetical order makes it easier to find those category items.
[0191]
Further, according to the present invention, for the 2-D map displaying the text mining result, each category item corresponding to a column or a row is clustered and a 2-D map window is output to an output device. By grouping items having common patterns into clusters, the addition of a category item search operation can be reduced, and the analysis operation can be made more efficient.
[0192]
According to the present invention, operation history information on at least one of the operation time, user identifier, operation name, operation argument, operation target, operation result, and user's comment on operation intention for each operation during text mining is collected. Therefore, the registered contents of the notation dictionary and the category dictionary can be checked based on the operation history. In addition, by using the template as a template when creating an instruction (batch script) for an operation automatic execution process (batch process), which will be described later, it is possible to easily perform a batch analysis process that is frequently performed. In addition, by storing the user's comment on the operation intention, even if many interactive operations are recorded in the work history, it is possible to quickly search for a place where the user's work intention and the like are to be batched as a clue, and create a batch script. Can be made more efficient. Also, by putting a comment in a place where the user wants to make a batch later, the work of examining the contents of the batch at the time of creating the batch script can be reduced, and the creation of the batch script can be made more efficient.
[0193]
Further, according to the present invention, a batch script is created based on the collected operation history information and the batch script is executed. It is possible to reduce the time required for using the tool. In addition, it becomes possible to automatically perform the analysis process performed at regular intervals. Further, it becomes possible to execute a load of analysis processing during the off-peak period of the system.
[0194]
Further, according to the present invention, the aggregation result of each category registered in the category dictionary information used for the text mining process is hierarchized by a tree structure and output to an output device, so that the tree structure is hierarchized and folding and expansion operations are performed. With this arrangement, the number of concept items displayed on the screen at one time via the user interaction interface can be suppressed. In addition, this makes it easy to search for a target concept item.
[0195]
Further, according to the present invention, at least a part of the output categories hierarchized by the tree structure is selected, so that when performing an interactive text mining operation, the categories are hierarchically displayed by the tree structure and displayed. The user can select the desired partial category from the screen displayed, and can utilize the hierarchical category not only in the final output but also during the operation. In addition, this enables an interactive text mining analysis operation requiring a category portion to be efficiently performed even when a large-scale category structure is targeted.
[0196]
Furthermore, according to the present invention, when treating the intermediate node as a concept item with respect to the aggregation result of each category hierarchized in a tree structure, the aggregation result corresponding to each leaf node concept item that is a descendant of the intermediate node is represented by the intermediate node. (First aggregation method), and / or when the normal form or another notation form corresponding to the intermediate node is defined in the notation dictionary used for the text mining process, The tabulation result of the analysis target document including the different notation form is set as the tabulation result corresponding to the intermediate node (second counting method). By using the first counting method, it is possible to process even a concept category structure in which a regular word does not correspond to an intermediate node. Further, it is possible to design a category structure having a high degree of freedom, for example, by dividing a large-scale concept category structure into appropriate granularities. On the other hand, by using the second counting method, the number of documents can be counted with high accuracy when the concept category structure has a regular word corresponding to the intermediate node. Further, such a case is common in a concept category structure created using an existing data structure, and these can be utilized. By using these two methods properly or in combination depending on the case, the cost of creating the concept category structure can be reduced, and the use of the large-scale category concept becomes easy.
[Brief description of the drawings]
FIG. 1 is a conceptual diagram showing an outline of a text mining process.
FIG. 2 is a conceptual diagram showing a concept of a 2-D map displayed in step SA-6 of FIG.
FIG. 3 is a block diagram illustrating an example of a configuration of the present system to which the present invention is applied.
FIG. 4 is a block diagram showing an example of a configuration of an analysis procedure evaluation unit 102a to which the present invention is applied.
FIG. 5 is a block diagram illustrating an example of a configuration of a 2-D map display screen control unit 102d to which the present invention is applied.
FIG. 6 is a diagram showing an example of an original text display screen displayed on the output device 114.
FIG. 7 is a diagram showing an example of a dictionary entry search screen displayed on the output device 114.
FIG. 8 is a diagram showing an example of a trace result display screen displayed on the output device 114.
FIG. 9 is a conceptual diagram illustrating an example of a syntax structure analysis process according to the present invention.
FIG. 10 is a diagram showing an example of a multi-window display screen displayed on the output device 114.
11 is a diagram illustrating an example of control (sort processing) of a 2-D map display screen displayed on the output device 114. FIG.
12 is a diagram illustrating an example of control (clustering) of a 2-D map display screen displayed on the output device 114. FIG.
FIG. 13 is a diagram showing an example of an operation history collection screen displayed on the output device 114.
FIG. 14 is a conceptual diagram illustrating an example of an operation automatic execution process.
FIG. 15 is a diagram showing an example of a category display screen in which categories are displayed in a hierarchy on the output device 114;
FIG. 16 is a conceptual diagram illustrating an example of an intermediate node aggregation process.
FIG. 17 is a diagram showing an example of the notation dictionary information stored in the notation dictionary information file 106a.
FIG. 18 is a diagram illustrating an example of category dictionary information stored in a category dictionary information file 106b.
[Explanation of symbols]
100 Text mining processor
102 control unit
102a Analysis procedure evaluation unit
102b syntax structure analysis unit
102c Multi windowing unit
102d 2-D map display screen control unit
102e Operation history collection unit
102f operation automatic execution unit
102g Category hierarchy section
102h Intermediate node tallying unit
102i Original text display screen control unit
102j Dictionary entry search screen control unit
102k Trace result display screen control unit
102m item sorting section
102n item clustering unit
102p Text mining unit
104 Communication control interface unit
106 storage unit
106a Notation dictionary information file
106b Category dictionary information file
106c Analysis target document file
106d Operation history information file
106e Processing result file
106f batch script file
108 I / O control interface
112 input device
114 Output device
200 External system
300 Network

Claims

A text mining processing device that counts the appearance frequency of each term appearing in a document to be analyzed,
The original information of the document to be analyzed,
A list of terms included in the original text information and subject to tabulation, and a tabulation key in which each term is associated with a term type and / or a link button to a storage address of the term. List information,
Text display control means for controlling the output to an output device,
A text mining processing device comprising:

A text mining processing device that counts the appearance frequency of each term appearing in a document to be analyzed,
Search terms entered by the user,
Information on a corresponding normal form extracted from the notation dictionary information based on the search term and the notation dictionary entry,
Information on a corresponding category extracted by searching the category dictionary information based on the search term and the category dictionary entry,
Dictionary entry search screen control means for controlling output to an output device,
A text mining processing device comprising:

A text mining processing device that counts the appearance frequency of each term appearing in a document to be analyzed,
The original information of the document to be analyzed,
Trace result information including at least one of a notation dictionary search result, a part of speech information obtained by a syntactic analysis process, and a category dictionary search result for a term included in the original text information and to be counted,
Dictionary entry search screen control means for controlling output to an output device,
A text mining processing device comprising:

A text mining processing device that counts the appearance frequency of each term appearing in a document to be analyzed,
According to the result of the syntax analysis on the original text information of the analysis target document, a syntax structure analysis for performing a text mining totaling process on the analysis target document as n categories of ordered combinations of nouns and verbs included in the original text information means,
A text mining processing device comprising:

A text mining processing device that counts the appearance frequency of each term appearing in a document to be analyzed,
When performing a search by narrowing the search conditions using another search window from the search results of one search window for text mining, these related search windows and search result display windows are displayed as multi-windows. Multi-windowing means for controlling when the display content of any window is changed, so that the changed content is automatically reflected in other windows,
A text mining processing device comprising:

A text mining processing device that counts the appearance frequency of each term appearing in a document to be analyzed,
2-D map display screen control means for sorting or clustering each category item corresponding to a column or a row with respect to a 2-D map displaying a text mining result and outputting a 2-D map window to an output device;
A text mining processing device comprising:

A text mining processing device that counts the appearance frequency of each term appearing in a document to be analyzed,
An operation history collection unit for collecting operation history information on at least one of an operation time, a user identifier, an operation name, an operation argument, an operation target, an operation result, and a user comment on an operation intention for each operation during text mining;
A text mining processing device comprising:

An operation automatic execution unit that creates a batch script based on the operation history information collected by the operation history collection unit and executes the batch script;
The text mining apparatus according to claim 7, comprising:

A text mining processing device that counts the appearance frequency of each term appearing in a document to be analyzed,
Category hierarchical means for layering the aggregation result of each category registered in the category dictionary information used for the text mining process by a tree structure and outputting the result to an output device;
Category selection means for selecting at least a part of the categories hierarchized by the tree structure, output by the category hierarchy means,
A text mining processing device comprising:

When treating the intermediate node as a concept item with respect to the aggregation result of each category hierarchized in a tree structure by the category hierarchy means,
Whether the aggregation result corresponding to each leaf node concept item that is a descendant of the intermediate node is the aggregation result corresponding to the intermediate node, and / or
If the notation dictionary used for the text mining process defines a normal form or another notation form corresponding to the intermediate node, the aggregation result of the analysis target document including the normal form or the other notation form is stored in the intermediate node. An intermediate node tallying means to be a corresponding tallying result;
The text mining apparatus according to claim 9, further comprising:

A text mining processing method for counting the appearance frequency of each term appearing in a document to be analyzed,
The original information of the document to be analyzed,
A list of terms included in the original text information and subject to tabulation, and a tabulation key in which each term is associated with a term type and / or a link button to a storage address of the term. List information,
A text display control step of controlling to output the
A text mining method comprising:

A text mining processing method for counting the appearance frequency of each term appearing in a document to be analyzed,
Search terms entered by the user,
Information on a corresponding normal form extracted from the notation dictionary information based on the search term and the notation dictionary entry,
Information on a corresponding category extracted by searching the category dictionary information based on the search term and the category dictionary entry,
Dictionary entry search screen control step of controlling so as to output to the output method,
A text mining method comprising:

A text mining processing method for counting the appearance frequency of each term appearing in a document to be analyzed,
The original information of the document to be analyzed,
Trace result information including at least one of a notation dictionary search result, a part of speech information obtained by a syntactic analysis process, and a category dictionary search result for a term included in the original text information and to be counted,
Dictionary entry search screen control step of controlling so as to output to the output method,
A text mining method comprising:

A text mining processing method for counting the appearance frequency of each term appearing in a document to be analyzed,
According to the result of the syntax analysis on the original text information of the analysis target document, a syntax structure analysis that performs a text mining aggregation process on the analysis target document, using n ordered combinations of nouns and verbs included in the original text information as one category Steps,
A text mining method comprising:

A text mining processing method for counting the appearance frequency of each term appearing in a document to be analyzed,
When performing a search by narrowing the search conditions using another search window from the search results of one search window for text mining, these related search windows and search result display windows are displayed as multi-windows. , A multi-windowing step of controlling when the display content of any window is changed, so that the changed content is automatically reflected in other windows,
A text mining method comprising:

A text mining processing method for counting the appearance frequency of each term appearing in a document to be analyzed,
A 2-D map display screen control step of sorting or clustering each category item corresponding to a column or a row with respect to a 2-D map displaying a text mining result and outputting a 2-D map window to an output method;
A text mining method comprising:

A text mining processing method for counting the appearance frequency of each term appearing in a document to be analyzed,
An operation history collection step of collecting operation history information on at least one of a user's comment on an operation time, a user identifier, an operation name, an operation argument, an operation target, an operation result, and an operation intention for each operation during text mining;
A text mining method comprising:

An operation automatic execution step of creating a batch script based on the operation history information collected by the operation history collection step and executing the batch script;
18. The text mining processing method according to claim 17, comprising:

A text mining processing method for counting the appearance frequency of each term appearing in a document to be analyzed,
A category hierarchization step in which the aggregation results of each category registered in the category dictionary information used for the text mining process are hierarchized by a tree structure and output to an output method;
A category selection step of selecting at least a part of the categories hierarchized by the tree structure, output by the category hierarchy step,
A text mining method comprising:

When treating the intermediate node as a concept item for the aggregation result of each category hierarchized into a tree structure by the above category hierarchy step,
Whether the aggregation result corresponding to each leaf node concept item that is a descendant of the intermediate node is the aggregation result corresponding to the intermediate node, and / or
If the notation dictionary used for the text mining process defines a normal form or another notation form corresponding to the intermediate node, the aggregation result of the document to be analyzed including the normal form or the other notation form is stored in the intermediate node. An intermediate node aggregation step to be the corresponding aggregation result,
20. The text mining method according to claim 19, further comprising:

A program for counting the frequency of occurrence of each term that appears in the analysis target document,
The original information of the document to be analyzed,
A list of terms included in the original text information and subject to tabulation, and a tabulation key in which each term is associated with a term type and / or a link button to a storage address of the term. List information,
Source text display control step of controlling the output to an output program,
A program which causes a computer to execute a text mining processing method including:

A program for counting the frequency of occurrence of each term that appears in the analysis target document,
Search terms entered by the user,
Information on a corresponding normal form extracted from the notation dictionary information based on the search term and the notation dictionary entry,
Information on a corresponding category extracted by searching the category dictionary information based on the search term and the category dictionary entry,
Dictionary entry search screen control step of controlling the output to an output program,
A program which causes a computer to execute a text mining processing method including:

A program for counting the frequency of occurrence of each term that appears in the analysis target document,
The original information of the document to be analyzed,
Trace result information including at least one of a notation dictionary search result, a part of speech information obtained by a syntactic analysis process, and a category dictionary search result for a term included in the original text information and to be counted,
Dictionary entry search screen control step of controlling the output to an output program,
A program which causes a computer to execute a text mining processing method including:

A program for counting the frequency of occurrence of each term that appears in the analysis target document,
According to the result of the syntax analysis on the original text information of the analysis target document, a syntax structure analysis that performs a text mining aggregation process on the analysis target document, using n ordered combinations of nouns and verbs included in the original text information as one category Steps,
A program which causes a computer to execute a text mining processing method including:

A program for counting the frequency of occurrence of each term that appears in the analysis target document,
When performing a search by narrowing the search conditions using another search window from the search results of one search window for text mining, these related search windows and search result display windows are displayed as multi-windows. , A multi-windowing step of controlling when the display content of any window is changed, so that the changed content is automatically reflected in other windows,
A program which causes a computer to execute a text mining processing method including:

A program for counting the frequency of occurrence of each term that appears in the analysis target document,
A 2-D map display screen control step of sorting or clustering each category item corresponding to a column or a row with respect to a 2-D map displaying a text mining result and outputting a 2-D map window to an output program;
A program which causes a computer to execute a text mining processing method including:

A program for counting the frequency of occurrence of each term that appears in the analysis target document,
An operation history collection step of collecting operation history information on at least one of a user's comment on an operation time, a user identifier, an operation name, an operation argument, an operation target, an operation result, and an operation intention for each operation during text mining;
A program which causes a computer to execute a text mining processing method including:

An operation automatic execution step of creating a batch script based on the operation history information collected by the operation history collection step and executing the batch script;
28. The program according to claim 27, wherein the program causes a computer to execute a text mining processing method including:

A program for counting the frequency of occurrence of each term that appears in the analysis target document,
A category layering step of layering the aggregation results of each category registered in the category dictionary information used for the text mining process into a tree structure and outputting the result to an output program;
A category selection step of selecting at least a part of the categories hierarchized by the tree structure, output by the category hierarchy step,
A program which causes a computer to execute a text mining processing method including:

When treating the intermediate node as a concept item for the aggregation result of each category hierarchized into a tree structure by the above category hierarchy step,
Whether the aggregation result corresponding to each leaf node concept item that is a descendant of the intermediate node is the aggregation result corresponding to the intermediate node, and / or
If the notation dictionary used for the text mining process defines a normal form or another notation form corresponding to the intermediate node, the aggregation result of the document to be analyzed including the normal form or the other notation form is stored in the intermediate node. An intermediate node aggregation step to be the corresponding aggregation result,
30. The program according to claim 29, further comprising:

A computer-readable recording medium on which the program according to any one of claims 21 to 30 is recorded.