JP3746233B2

JP3746233B2 - Knowledge analysis system and knowledge analysis method

Info

Publication number: JP3746233B2
Application number: JP2001395284A
Authority: JP
Inventors: 将池田; 和典島川
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-12-26
Filing date: 2001-12-26
Publication date: 2006-02-15
Anticipated expiration: 2021-12-26
Also published as: JP2003196294A

Description

【０００１】
【発明の属する技術分野】
本発明は、ナレッジマネジメントシステムで用いられる知識分析のためのシステムおよび方法に関し、特に重要語を用いて蓄積文書の検索やクラスタリング行うことによって知識分析の支援を行う知識分析システムおよび知識分析方法に関する。
【０００２】
【従来の技術】
近年、企業を中心に複数のユーザ間で情報共有を行うためのグループウェアの導入が進められている。代表的なグループウェアとしては、電子メールシステムやワークフローシステムなどが知られているが、最近では、知識や情報の共有支援を図るためのナレッジマネジメントシステムも開発され始めている。
【０００３】
このナレッジマネジメントシステムは、Ｗｅｂ情報や電子ファイル情報などに加え、個人のノウハウなどを知識データベースとして蓄積・管理するためのものであり、自然言語検索などの検索機能と組み合わせることにより、知識、情報の効率的な活用が可能となる。
【０００４】
ところで、このようなナレッジマネジメントシステムにおいては、個人のノウハウなどの知識をどのように収集・蓄積するかが重要なポイントとなる。個人のノウハウなどの知識は、いわゆる暗黙知であって、Ｗｅｂ情報や電子ファイル情報などのように形式化されたものではないため、それを自動的に収集、蓄積することは困難であるからである。
【０００５】
そこで、最近では、知識蓄積支援機能を持つナレッジマネジメントシステムの開発が要求されている。個人のノウハウなどの知識を自動的に収集・蓄積する仕組みを実現することにより、暗黙知としての知識をもＷｅｂ情報や電子ファイル情報などのような形式化された形式知と同様に活用することが可能となる。
【０００６】
また、このようにして蓄積された知識や情報を容易に検索するナレッジマネジメントシステムの開発も並行して行われている。典型例としては、自然言語の質問文を入力して、有用な知識や情報を検索する知識検索支援機能を持つ自然言語検索システムが挙げられる。
【０００７】
【発明が解決しようとする課題】
ところで、この種のナレッジマネジメントシステムでは、知識や情報を簡単に整理または閲覧したり、あるいは、初めて利用するユーザに対しても、どのような知識情報が検索できるのかを分かりやすく提示する等、知識データの有効活用を図るための知識の体系化が強く求められる。これに伴い、蓄積した知識をカテゴリ別に構造化（分類）すること等が行われている。
【０００８】
知識を分析してそれを構造化（分類）する処理はクラスタリングと呼ばれるが、従来では、そのクラスタリングの際の軸となる語句は、重要語としてユーザが直接指定しなければならないため、その重要語のリストアップに手間が掛かかるといった問題があった。またこの場合、ユーザが重要語として指定した語句に基づいて、クラスタリングのための知識の分析および分類が行われることになるので、ユーザが重要語として指定した語句によっては、必ずしも的確な分類結果が得られない場合もある。このような状況はクラスタリングを行う場合のみならず、情報検索のための検索キーワードをユーザに指定させる場合にも同様に発生する。
【０００９】
本発明はこのような事情を考慮してなされたものであり、文書から自動的に重要語を抽出および決定してそれを用いて検索やクラスタリングを実行することにより、ユーザに重要語を指定させることなく、蓄積文書の検索又はクラスタリングによる的確な知識分析を自動的に実行することが可能な知識分析システムおよび知識分析方法を提供することを目的とする。
【００１０】
【課題を解決するための手段】
上述の課題を解決するため、本発明は、複数のクライアント端末とネットワークを介して接続可能に構成され、各クライアント端末からの要求に応じて、サーバに蓄積されている複数の文書情報の分類および分析を支援する知識分析システムであって、前記蓄積されている複数の文書情報を対象に、前記複数の文書情報それぞれに埋め込まれている強調タグに基づいて、前記複数の文書情報の各々から強調属性を持つ単語を抽出する単語抽出手段と、前記単語抽出手段によって抽出された各単語の強調属性の種別と当該単語の出現頻度とに基づいて当該単語毎に重要度を判定し、その重要度の判定結果を前記複数の文書情報にわたって累積する処理を実行して、累積された重要度の値が高い上位所定数の単語を、前記複数の文書情報にわたる文書情報全体の重要語として決定する重要語決定手段と、
前記重要語決定手段によって決定された前記文書情報全体の重要語と前記各文書情報から決定される重要語との突き合わせによって前記複数の文書情報をクラスタリングすることにより、前記複数の文書情報を分類および整理したクラスタデータベースを作成する知識分析手段とを具備することを特徴とする。
【００１４】
この知識分析システムにおいては、既に蓄積されている複数の文書情報全体を対象として、単語抽出処理と、単語抽出処理によって各文書から抽出された各単語の強調属性の種別と当該単語の出現頻度とに基づいて当該単語毎に重要度を判定し、その重要度の判定結果を複数の文書情報にわたって累積する処理とを実行して、累積された重要度の値が高い上位所定数の単語を複数の文書情報全体の重要語として決定する。そして、文書情報全体の重要語と前記各文書情報から決定される重要語との突き合わせによって前記複数の文書情報がクラスタリングされる。これにより、蓄積文書の分析に要する時間を短縮できるようになると共に、蓄積文書全体の特徴を考慮して決定された重要語を用いてクラスタリングを行うことが可能となる。
【００１５】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態を説明する。
【００１６】
図１には、本発明の一実施形態に係る知識分析システムの構成が示されている。この知識分析システムは、複数のクライアント端末１１からＬＡＮ等のコンピュータネットワーク１３を介してアクセス可能なサーバコンピュータ１２にて実現されている。サーバコンピュータ１２とクライアント端末１１には、それぞれ、図示しないが、ＣＰＵ、メインメモリ、記憶装置としての磁気ディスク装置、及びキーボードやマウスなどの入力部とディスプレイなどの表示部とを持つ入出力装置が設けられている。
【００１７】
サーバコンピュータ１２上で実現された知識分析システムは、各クライアント端末１１に対して、文書情報の蓄積機能、蓄積された複数の文書情報から目的の文書を検索する検索機能、蓄積された複数の文書を分類および整理するクラスタリング機能などを提供する。クラスタリング機能によって蓄積文書の分類および整理を行い、サーバコンピュータ１２からその内容を蓄積文書の分析結果としてクライアント端末１１に提供することにより、ある目的で集められた雑多な文書群から読み取れる傾向等を知識としてユーザに提供することが出来る。ここで、クラスタリング機能の概要について説明する。
【００１８】
クラスタリング機能とは、サーバコンピュータ１２に集められた大量の文書情報それぞれの特徴を分析して、類似する内容のもの同士のグループに仕分けし、大量の文書情報をその特徴別に分類および整理したクラスタデータベースを作成する機能である。クラスタリングの実行に際しては図２のような分析条件指定画面がクライアント端末１１に提供され、その分析条件指定画面上で様々な分析条件がユーザによって指定される。図２は、「１万件ｄｂ」という名称のデータベースと「○○新聞記事」という名称のデータベースが分析対象データベースとして選択された場合の例であり、以下の指定項目がある。
【００１９】
・分析結果名称：分析結果を保存する際の名称を入力するフィールドである。
・分析対象期間：どの期間内に作成または保存された文書を分析対象とすべきかを指定するフィールドである。入力がない場合は分析対象データベース内に蓄積されている全文書が分析対象になる。
【００２０】
・絞込キーワード：蓄積文書から分析対象の文書を検索によって絞り込むための検索用キーワードを入力するフィールドである。入力がない場合は分析対象データベース内に蓄積されている全文書が分析対象になる。
・絞込件数：分析対象の文書を何件にまで絞り込むかを指定するフィールドである。
【００２１】
・階層数：クラスタリングで使用する分類体系の階層数を指定するフィールドである。「１階層」又は「多階層」のいずれかを選択。デフォルトは「多階層」であり、カテゴリ別に階層化された構造で文書の分類が行われる。
・知識の重複：１つの文書が複数の分類（クラスタ）に重複して登録されることを許可するかどうかを指定するフィールドである。「あり」又は「なし」のいずれかを選択。
【００２２】
・最上位クラスタの最大個数：最上位の階層に作成されるクラスタの最大個数を指定するフィールドである。入力がない場合は、指定なしとしてクラスタリングを行う。
【００２３】
・重要語：クラスタリングを行う際の軸となる語句を重要語として入力するためのフィールドである。
通常は、クラスタリングで使用する重要語はユーザ自身が直接指定する必要があるが、本実施形態では、重要語を自動生成する機能を有しており、ユーザに重要語を指定させることなく、自動生成した重要語を用いて蓄積文書の分析および分類を行うことが出来る。どのような観点で蓄積文書の内容を分析し、それをどのような分類体系で分類するかは、クラスタリングの軸を生成する重要語に基づいて決定されるので、いわば、重要語は蓄積情報を分析および分類する際の切り口となるものと言える。
【００２４】
次に、図１に戻り、本知識分析システムを構成する各要素について説明する。
【００２５】
クライアント端末１１では、Ｗｅｂブラウザ１１１が動作している。サーバコンピュータ１２上に構築された知識分析のためのリソースを示すＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）をＷｅｂブラウザ１１１から指定することにより、サーバコンピュータ１２による知識分析処理を各クライアント端末１１から利用することができる。
【００２６】
また、クライアント端末１１には、イメージスキャナ１１４が接続されており、ユーザは紙の文書の取り込みを文字認識ツール１１３を用いて行う。さらに文書編集ツール１１２が動作しており、ユーザはこれを用いて電子文書を作成する。こうして取りこまれたイメージ文書データと、作成された電子文書データは、例えばＳＧＭＬ（Standard Generalized Markup Language）、ＨＴＭＬ（Hyper Text Markup Language）、ＸＭＬ（eXtensible Markup Language）などのマークアップ言語で記述された文書形式に自動的に変換された後に、ネットワーク１３を介してサーバコンピュータ１２に登録文書として、または重要語を決定するための入力文書として送られる。マークアップ言語で記述された文書形式の文書情報においては、文書中の語句の強調属性の種別が強調タグによって表現される。よって、各語句の強調属性に対応した強調タグが埋め込まれている文書情報が、クライアント端末１１からサーバコンピュータ１２に送られることになる。サーバコンピュータ１２では、文書情報中の強調タグを基に重要語を決定する処理が行われる。
【００２７】
サーバコンピュータ１２の知識分析機能は、主に、Ｗｅｂサーバ１２１、制御モジュール１２１１、ナレッジサーバ１２２、強調タグ設定モジュール１２２１、登録モジュール１２２２、検索モジュール１２２３、クラスタリングモジュール１２２４などのソフトウェアと、これらソフトフェアによって知識分析のために利用される管理情報と実データとによって実現される。管理情報には、各クライアント端末１１に対してユーザ認証を行うためのログイン管理情報１２１２が存在する。実データは、知識データベース１２２６、中間処理用のデータベース１２２８および分析結果データベース（クラスタＤＢ）１２２９と、強調タグテーブル１２２５と、重要語テーブル１２２７とによって構成されている。
【００２８】
強調タグテーブル１２２５は、強調属性の種別それぞれに対応した強調タグとその強調タグに対する重み付け値（ポイント）との関係を定義するためのものであり、ユーザがクライアント端末１１のＷｅｂブラウザ１１１を介して強調タグ設定モジュール１２２１を起動することにより、強調タグテーブル１２２５の内容をユーザが設定することができる。
【００２９】
重要語テーブル１２２７は、クライアント端末１１から入力された入力文書、または知識データベース１２２６上に既に蓄積されている文書情報から抽出された重要語を登録するためのものであり、入力文書、または蓄積文書から強調タグで囲まれた語句を抽出し形態素解析して単語に分解し、その単語を強調タグの種別と出現頻度によりスコア付けすることで、重要語テーブル１２２７が作成される。
【００３０】
これら強調タグテーブル１２２５と重要語テーブル１２２７は、クラスタリングモジュール１２２４が中間データベース１２２８から分析結果データベース（クラスタＤＢ）１２２９を作成するときに使用される。
【００３１】
制御モジュール１２１１は、Ｗｅｂサーバ１２１上で動作し、知識分析に関する全体の動作を制御するためのものであり、この知識分析システムの中核プログラムであるナレッジサーバ１２２およびＷｅｂサーバ１２１と、クライアント端末１１のＷｅｂブラウザ１１１との間の仲介機能を初め、各クライアント端末１１がナレッジサーバ１２２にログインする際のユーザ認証機能を持つ。このユーザ認証のために、制御モジュール１２１１は、ログイン管理情報１２１２を管理している。このログイン管理情報１２１２には、この知識分析システムに参加しているユーザそれぞれのユーザＩＤとパスワード等が格納されている。このユーザ認証により、各クライアント端末１１からの知識分析等の為になされるナレッジサーバ１２２に対するアクセスの許可・禁止の制御が行われる。
【００３２】
ナレッジサーバ１２２は、複数のクライアント端末１１からの要求に応じて知識データベース１２２６や分析結果データベース（クラスタＤＢ）１２２９の管理、運用を行うためのものであり、各クライアント端末１１から指定された分析条件で知識分析を行うことにより、知識データベース１２２６に登録されている蓄積文書の分析支援を行う。
【００３３】
次に、図３のフローチャートを参照して、クライアント端末１１から入力された入力文書情報から重要語を自動的に決定し、その重要語を用いて知識データベース１２２６の検索や知識分析を行う場合の処理の概要について説明する。
【００３４】
まず、対象となる入力文書情報（ＸＭＬ，ＨＴＭＬ，ＳＧＭＬ，ｅｔｃ）中に含まれる強調タグに基づいて、その強調タグで囲まれた語句が入力文書情報から抽出される（ステップＳ１０１）。次いで、抽出した各語句を形態素解析して単語に分割することにより、強調属性を持つ単語の抽出、および重要語テーブル１２２７への登録が行われる（ステップＳ１０２，Ｓ１０３）。そして、抽出された単語毎にその強調タグに対するポイントが強調タグテーブル１２２５から取得され、そのポイントが重要語テーブル１２２７上に、該当する単語の優先度（重要度）を示すスコアとして加算されて行く（ステップＳ１０４）。これにより、抽出された単語毎に、当該単語の出現頻度とその時の強調タグの種別に対応したポイントとの積によって表現されたスコアの値が重要語テーブル１２２７に登録されることになる。重要語テーブル１２２７に登録された単語群はスコアの高い順にソートされ、そしてスコアの高い上位の幾つかの単語が優先的に重要語として決定される。もちろん、重要語テーブル１２２７上に登録された全ての単語を重要語として決定してもよい。
【００３５】
入力文書情報のキー概念に合致する文書を知識データベース１２２６から検索する場合には、重要語として決定された単語それぞれから検索用キーワードが生成され、その検索用キーワードを用いて、全文検索などの検索処理が検索モジュール１２２３によって実行される（ステップＳ１０５）。
【００３６】
一方、知識データベース１２２６の蓄積文書をクラスタリングする場合には、重要語として決定された単語それぞれをクラスタリングの軸とする文書分析および分類が蓄積文書を対象に実行され、これにより知識データベース１２２６の蓄積文書がクラスタリングされる（ステップＳ１０６）。
【００３７】
なお、実際には、重要語を軸とする文書分析および分類処理は、中間データベース１２２８に登録されている絞り込み後の文書群を対象に行われ、そしてその分類結果がクラスタデータベース１２２９に登録されることになる。また、重要語として決定された単語そのそのものを軸とする分析を行っても良いが、重要語として決定された単語から派生する類義語、同義語等を用いて分析を行っても良い。分析処理では、重要語を切り口として分析対象となる各文書の特徴が分析され、そして同じ特徴を持つもの同士が、その特徴を的確に表す分類名が付された一つのクラスタにまとめられていく。もちろん、重要語そのものを分類名として使用することも出来る。
【００３８】
各文書の特徴分析の手法としては様々なものがあるが、例えば、上述した入力文書からの重要語決定処理と同様の手法で分析対象の各蓄積文書毎に重要語を決定し、そしてどのような重要語を含むかを分析対象の蓄積文書毎に調べることによってそれら文書それぞれ特徴を分析することが出来る。この場合、例えば、入力文書から決定した重要語と、分析対象の各文書から決定した重要語とを突きあわせることにより、入力文書から決定した重要語を軸としたクラスタリングを容易に行うことが可能となる。
【００３９】
さらに、入力文書情報から重要語を決定する代わりに、上述した入力文書からの重要語決定処理と同様の手法で、例えば、中間データベース１２２８に登録されている分析対象の文書群全体から重要語を決定し、それをクラスタリングの軸として用いて、分析対象の文書群全体をクラスタリングするという手法も好適である。この場合には、まず、分析対象の文書群全体を対象として、ステップＳ１０１からＳ１０３の単語抽出処理を行い、そしてステップＳ１０４では、各文書から抽出された各単語の強調属性の種別と当該単語の出現頻度とに基づいて当該単語毎にスコア付けを行い、そのスコアを分析対象の文書群全体にわたって累積する処理が実行される。累積されたスコアの値が高い上位所定数の単語が重要語として決定される。
【００４０】
そして、ステップＳ１０６では、重要語として決定された上位所定数の単語をクラスタリングの軸として、中間データベース１２２８に登録されている文書群の分析および分類が実行される。この場合においても、例えば、蓄積文書全体から決定した重要語と、分析対象の各蓄積文書から決定した重要語とを文書語に突きあわせることにより、蓄積文書全体から決定した重要語を軸としたクラスタリングを容易に行うことが可能となる。
【００４１】
次に、図４および図５を参照して、クライアント端末１１に設けられた文字認識ツール１１３や文書編集ツール１１２で実行される文書形式変換処理について説明する。なお、以下では、強調タグが埋め込まれた文書情報としてＸＭＬを利用する場合を例示して説明することとする。
【００４２】
図４はＸＭＬ変換前の文書形式を示し、また、図５はＸＭＬ変換後の文書形式を示している。図４に示すように、変換前の文書には文字の強調が各所になされており、これらが語句の強調属性である。例えば、＜大きい文字＞、＜太い文字＞、＜斜体＞、＜下線＞、＜反転＞、＜引用符＞、＜背景色＞などがある。
【００４３】
図４の文書形式から図５の文書形式への変換は、クライアント端末１１の文字認識ツール１１３や文書編集ツール１１２が行い、Ｗｅｂブラウザ１１１を介してナレッジサーバ１２２の登録モジュール１２２２に渡され、知識データベース１２２６に登録される。文字認識ツール１１３を用いた場合には、紙の文書上に施されている語句の強調属性が強調タグに変換されてＸＭＬ形式の文書情報が生成されることになる。以下に変換の例を示す。
【００４４】
例えば、図４の文書上においては、語句「膨大な情報や知識を的確に分析」は太い文字で記述され、且つ各文字の背景に色が付けられている。この場合、大きい文字、背景色という２種類の強調属性が付加されていることになるので、この語句を図５のＸＭＬ文書に変換すると、
＜背景色＞＜太字＞膨大な情報や知識を的確に分析＜／太字＞＜／背景色＞
となる。
【００４５】
また、図４の文書上の語句「迅速な意志決定」には下線が引かれており、この語句を図５のＸＭＬ文書に変換すると、
＜下線＞迅速な意思決定＜／下線＞
となる。
【００４６】
このように、１つの語句に複数の強調タグが付加されていると、それだけその語句が重要であることを意味することになる。上述の例では、前者の方が後者よりも重要度が高いと判定される。
【００４７】
図６は、強調タグテーブル１２２５の例を示している。強調タグテーブル１２２５は、強調タグを設定するための強調タグフィールド１２２５−１と、当該強調タグに対応するポイントを設定するためのポイントフィールド１２２５−２とから構成される。ユーザはクライアント端末１１のＷｅｂブラウザ１１１を介して強調タグ設定モジュール１２２１を起動し、図４の変換前の文書に出現する語句の強調属性に対応する強調タグとそれに対応するポイントとを予め強調タグテーブル１２２５に登録しておくことで、強調属性の種別毎にそれに対応する重要度（ポイント）を設定することが出来る。
【００４８】
図６の強調タグテーブル１２２５においては、例えば、＜大きい文字＞という強調タグに対して１０ポイントが与えられており、また＜斜体＞という強調タグに対しては９ポイント、＜太字＞という強調タグに対しては８ポイント、＜色文字＞という強調タグに対しては７ポイント、さらに＜背景色＞という強調タグに対しては６ポイントが与えられている。
【００４９】
上述の「＜背景色＞＜太字＞膨大な情報や知識を的確に分析＜／太字＞＜／背景色＞」という語句については、その語句内の各単語には、＜太字＞という強調タグに対する８ポイントと、＜背景色＞という強調タグに対する６ポイントとが加算され、計１４ポイントが与えられることになる。このように、ある単語が様々な強調属性を混合した強調属性で強調されている場合であっても、それをＸＭＬ文書に変換した場合には、当該混合強調属性は、種別の異なる複数の強調タグに変換されるので、各強調タグ毎にポイントを加算するだけで容易に当該単語の強調属性に対応するポイントを算出することが出来る。
【００５０】
なお、強調タグテーブル１２２５のデフォルトの設定内容を予めシステム内で決めておき、それを用いて重要度の計算を行っても良い。
【００５１】
図７乃至図９には、重要語テーブル１２２７が生成される様子が示されている。図７は１つの文書の解析中の状態を、図８は１つの文書の解析後の状態を、図９は全文書の解析後の状態をそれぞれ表したものである。
【００５２】
図７および図８は、図５のＸＭＬ文書を解析する場合に対応するものである。図７に示されているように、強調タグで囲まれた語句（例えば、「膨大な情報や知識を的確に分析」）を形態素解析することによって得られた単語、すなわち、「膨大」、「情報」、「知識」、「的確」、「分析」…が重要語テーブル１２２７の重要語フィールド１２２７−１に登録されると共に、対応するスコアフィールド１２２７−２にはそれら単語が出現するたびに該当するポイントの値が順次加算されていく。そして図５のＸＭＬ文書の解析の結果、図８の重要語テーブル１２２７が生成される。
【００５３】
図５のＸＭＬ文書から決定した重要語をクラスタリングの軸として蓄積文書のクラスタリングを行う場合には、図８の重要語テーブル１２２７に登録された単語の内、スコアの高いもの（「知識」「蓄積」「共有」「創造」…）から順次、クラスタリングで優先すべき重要語として決定されることになる。この場合、上位所定数の単語（重要語テーブル１２２７に登録された単語数が重要語として使用可能な上限以下であれば全ての単語）が、クラスタの軸として使用される。使用される重要語の数はシステムが規定値として持っていてもよいし、ユーザがクライアント端末１１のＷｅｂブラウザ１１１から与えても構わない。
【００５４】
図９は、中間データベース１２２８に蓄積された分析対象の文書群全てを解析することによって得られた重要語テーブル１２２７の例であり、ここではスコアが高い順にソートした結果を示している。中間データベース１２２８に蓄積された分析対象の文書群から重要語を決定する場合は、図９の重要語テーブル１２２７に登録された単語の内、上位所定数の単語（重要語テーブル１２２７に登録された単語数が重要語として使用可能な上限以下であれば全ての単語）が、クラスタの軸として使用される。
【００５５】
次に、図１０のフローチャートを参照して、本知識分析システムの処理の手順を説明する。ここでは、各クライアント端末１１から入力されるＸＭＬ形式の文書を知識データベース１２２６に順次登録していき、クライアント端末１１から知識データベース１２２６の分析要求を受けたときに、分析対象の全文書から重要語を自動抽出してクラスタリングを行う場合を想定する。
【００５６】
各クライアント端末１１のユーザは、Ｗｅｂブラウザ１１１からＷｅｂサーバ１２１の制御モジュール１２１１を起動し、ナレッジサーバ１２２にログインする（ステップＡ０１）。そしてＷｅｂブラウザ１１１からナレッジサーバ１２２強調タグ設定モジュール１２２１を起動し、ＸＭＬ文書中に現れる文字の強調属性に対する強調タグとポイント（強さの度数）を強調タグテーブル１２２５上に設定する（ステップＡ０２）。
【００５７】
次に、クライアント端末１１に接続されているイメージスキャナ１１４を使用して帳票に記載された文書を読み取り、その読み取り文字を文字認識ツール１１３で認識してＸＭＬ文書に変換する時、あるいは文書編集ツール１１２でＸＭＬ文書を作成する時、語句に強調属性が付加されていたら、その語句を強調タグで囲んで、ＸＭＬ形式の強調語句に変換する（ステップＡ０３）。その処理が終了すると、Ｗｅｂブラウザ１１１からナレッジサーバ１２２の登録モジュール１２２２を起動し、ＸＭＬ文書を登録モジュール１２２２に渡す。登録モジュール１２２２は受け取ったＸＭＬ文書を知識データベース１２２６に登録する（ステップＡ０４）。
【００５８】
そして、Ｗｅｂブラウザ１１１からナレッジサーバ１２２の検索モジュール１２２３を起動し、知識データベース１２２６内に構築されているどの知識データベースを分析するかの選択を行う（ステップＡ０５）。検索モジュール１２２３は選択された知識データベースをユーザの与えた検索キーワードを用いて検索することにより、選択された知識データベースからユーザの与えた検索キーワードにより絞り込んだ知識だけからなる中間データベース１２２６を作成する（ステップＡ０６）。
【００５９】
検索モジュール１２２３は中間データベース１２２８の作成時にＸＭＬ文書それぞれをスキャンして、強調タグテーブル１２２５に登録されている強調タグで囲まれた語句を見つけて形態素解析し、単語と強調タグのポイントを重要語テーブル１２２７に登録する。そして、その単語が出現するたびに対応するポイントをその単語スコアに加算していき、更にスコアの大きい順にソートし、重要語テーブル１２２７を作成する（ステップＡ０７）。クラスタリングモジュール１２２４は中間データベース１２２８を入力してクラスタリングするときに、重要語がスコア順に並んだ重要語テーブル１２２７からシステムのデフォルト数分、あるいはユーザの指定数分の重要語をスコアの高いものから順に取り出してクラスタリングに使用して、中間データベース１２２８の文書群を分類および整理したクラスタデータベース１２２９を作成する（ステップＡ０８）。
【００６０】
図１１は、図１０のステップＡ０７で行われる処理を詳細に説明したフローチャートであり、重要語テーブル１２２７を作成する手順を示したものである。
【００６１】
検索モジュール１２２３は、中間データベース１２２８を作成するときに、まず知識データベース１２２６から取り出す各文書を解析して重要語テーブル１２２７を作成する。その処理を説明すると、まず、重要語テーブル１２２７をクリアし（ステップＢ０１）、知識データベース１２２６の最初のＸＭＬ文書にポジショニングする（ステップＢ０２）。そして、このＸＭＬ文書から１つ以上の強調タグで囲まれた語句を取り出し（ステップＢ０３）、その取り出した語句を形態素解析し、名詞を抽出し、重要語テーブル１２２７にエントリーする（ステップＢ０４）。次に、強調タグテーブル１２２５からこれらの強調タグのポイントを取り出して、重要語テーブル１２２７上の該当する語句のスコアフィールド１２２７−２に加算し（ステップＢ０５）、このＸＭＬ文書の解析が終了するまでステップＢ０３〜ステップＢ０５を繰り返す（ステップＢ０６）。
【００６２】
最初のＸＭＬ文書の解析が終了すると（ステップＢ０６のＹＥＳ）、次のＸＭＬ文書にポジショニングし（ステップＢ０７）、全ＸＭＬ文書の解析が終了するまでステップＢ０３〜ステップＢ０７を繰り返し（ステップＢ０８のＮＯ）、終了であれば（ステップＢ０８のＹＥＳ）、重要語テーブル１２２７の重要語フィールド１２２７−１をスコア１２２７−２の高い順にソートし結果を反映する（ステップＢ０９）。
【００６３】
図１２は、図１０のステップＡ０８の処理を詳細に説明したフローチャートであり、クラスタリングモジュール１２２４がクラスタデータベース１２２９を作成する処理のフローチャートである。
【００６４】
ユーザがクラスタリングの権限制御を受けるために、クライアント端末１１のＷｅｂブラウザ１１１よりサーバコンピュータ１２の制御部１２１１へログイン要求する（ステップＣ０１）。すると、サーバコンピュータ１２の制御部１２１１は、ユーザから入力されたユーザＩＤおよびパスワードが登録されているか否かを調べるためにログイン管理情報１２１２にアクセスし（ステップＣ０２）、このログインを許可するかどうかを判定するためのユーザ認証を行う（ステップＣ０３）。ユーザＩＤおよびパスワードがログイン管理情報１２１２に登録されておらず、ログインが失敗した場合（ステップＣ０４のＮＯ）、制御部１２１１は、ログイン失敗をＷｅｂサーバ１２１を通じてクライアント端末１１のＷｅｂブラウザ１１１に返して、この処理を終了する（ステップＣ０５）。
【００６５】
一方、ユーザＩＤおよびパスワードがログイン管理情報１２１２に登録されており、ログインが成功した場合には（ステップＣ０４のＹＥＳ）、分析対象の知識データベースを選択し（ステップＣ０６）、重要語テーブル１２２７から重要語を取り出すと（ステップＣ０７）、取り出した重要語をもとに、クラスタリングモジュール１２２４はクラスタリングを実行し、クラスタデータベース１２２９を作成する（ステップＣ０８）。
【００６６】
作成されたクラスタデータベース１２２９においては、重要語をもとに決定された分類体系で各文書がその特徴に応じて分類および整理されており、分類対象の特徴に応じた分類項目別に、そこに何件の文書が知識として登録されているかなどを分析結果としてユーザに提供に提供することができる。またユーザは分析結果を基に特定の分類項目を指定し、その分類項目を対象としたさらなる情報検索なども行うことが出来る。
【００６７】
このように、この知識分析システムにおいては、強調属性を持つ単語はその文書中のキー概念となる語であるということに着目して、ＸＭＬ文書から強調タグをもとに重要語を決定し、この重要語に基づいたクラスタリング処理を可能としたものである。
【００６８】
なお、本実施形態の知識分析システムの機能は全てコンピュータプログラムにより実現されているので、そのコンピュータプログラムをコンピュータ読み取り可能な記憶媒体に記憶しておき、その記憶媒体を通じてコンピュータプログラムを、コンピュータネットワーク接続可能な通常のコンピュータに導入して実行させるだけで、本実施形態と同様の効果を得ることができる。
【００６９】
また本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。更に、上記実施形態には種々の段階の発明が含まれており、開示される複数の構成要件における適宜な組み合わせにより種々の発明が抽出され得る。例えば、実施形態に示される全構成要件から幾つかの構成要件が削除されても、発明が解決しようとする課題の欄で述べた課題が解決でき、発明の効果の欄で述べられている効果が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。
【００７０】
【発明の効果】
以上説明したように、本発明によれば、自動決定した重要語によって情報検索やクラスタリングを行うことにより正確な知識分析を可能とする。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る知識分析システムのシステム構成を示すブロック図。
【図２】同実施形態の知識分析システムで用いられる分析条件指定画面の一例を示す図。
【図３】同実施形態の知識分析システムにおける知識分析処理の手順の概要を示すフローチャート。
【図４】同実施形態の知識分析システムにおけるＸＭＬ変換前の文書形式を説明するための図。
【図５】同実施形態の知識分析システムにおけるＸＭＬ変換後の文書形式を説明するための図。
【図６】同実施形態の知識分析システムで使用される強調タグテーブルの形式を説明するための図。
【図７】同実施形態の知識分析システムにおける１文書解析中における重要語テーブルの状態を示す図。
【図８】同実施形態の知識分析システムにおける１文書解析後における重要語テーブルの状態を示す図。
【図９】同実施形態の知識分析システムにおける全文書解析後における重要語テーブルの状態を示す図。
【図１０】同実施形態の知識分析システムによる知識分析処理の手順を示すフローチャート。
【図１１】同実施形態の知識分析システムによる重要語決定処理の手順を示すフローチャート。
【図１２】同実施形態の知識分析システムによるクラスタリング処理の手順を示すフローチャート。
【符号の説明】
１１…クライアント端末
１２…サーバコンピュータ
１１１…Ｗｅｂブラウザ
１１２…文書編集ツール
１１３…文字認識ツール
１１４…イメージスキャナ
１２１…Ｗｅｂサーバ
１２２…ナレッジサーバ
１２２１…強調タグ設定モジュール
１２２２…登録モジュール
１２２３…検索モジュール
１２２４…クラスタリングモジュール
１２２５…強調タグテーブル
１２２６…知識データベース
１２２７…重要語テーブル
１２２８…中間データベース
１２２９…クラスタデータベース[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a system and method for knowledge analysis used in a knowledge management system, and more particularly to a knowledge analysis system and a knowledge analysis method for supporting knowledge analysis by searching and clustering stored documents using important words.
[0002]
[Prior art]
In recent years, introduction of groupware for sharing information among a plurality of users has been promoted mainly by companies. As typical groupware, an e-mail system, a workflow system, and the like are known, but recently, a knowledge management system for supporting sharing of knowledge and information has begun to be developed.
[0003]
This knowledge management system is for accumulating and managing personal know-how as a knowledge database in addition to Web information and electronic file information. By combining it with a search function such as natural language search, knowledge and information Efficient use is possible.
[0004]
By the way, in such a knowledge management system, how to collect and accumulate knowledge such as individual know-how is an important point. Because knowledge such as personal know-how is so-called tacit knowledge and not formalized like Web information or electronic file information, it is difficult to automatically collect and store it. is there.
[0005]
Therefore, recently, there is a demand for the development of a knowledge management system having a knowledge accumulation support function. Utilize knowledge as tacit knowledge in the same way as formal knowledge such as Web information and electronic file information by realizing a mechanism for automatically collecting and accumulating knowledge such as personal know-how Is possible.
[0006]
In addition, a knowledge management system that easily searches for knowledge and information accumulated in this way is being developed in parallel. A typical example is a natural language search system having a knowledge search support function for searching for useful knowledge and information by inputting a natural language question sentence.
[0007]
[Problems to be solved by the invention]
By the way, in this kind of knowledge management system, knowledge and information can be easily organized or viewed, or even for users who are using it for the first time, it is easy to understand what kind of knowledge information can be searched. There is a strong demand for systematization of knowledge for effective use of data. Along with this, the accumulated knowledge is structured (classified) by category.
[0008]
The process of analyzing knowledge and structuring (classifying) it is called clustering. Conventionally, words that become the axis for clustering must be directly specified as important words by the user, so that important words There was a problem that it took time and effort to list. Also, in this case, knowledge analysis and classification for clustering is performed based on the words specified by the user as important words. Therefore, depending on the words specified as important words by the user, an accurate classification result is not necessarily obtained. It may not be obtained. Such a situation occurs not only when clustering is performed, but also when a user specifies a search keyword for information search.
[0009]
The present invention has been made in consideration of such circumstances, and automatically extracts and determines important words from a document, and uses them to perform search and clustering to allow a user to specify important words. It is an object of the present invention to provide a knowledge analysis system and a knowledge analysis method capable of automatically executing accurate knowledge analysis by searching or clustering stored documents without any problem.
[0010]
[Means for Solving the Problems]
In order to solve the above problems, the present invention provides: A knowledge analysis system configured to be connectable to a plurality of client terminals via a network and supporting classification and analysis of a plurality of document information stored in a server in response to a request from each client terminal, For a plurality of stored document information, based on emphasis tags embedded in each of the plurality of document information, from each of the plurality of document information Word extraction means for extracting words having an emphasis attribute, and types of emphasis attributes of each word extracted by the word extraction means And the frequency of appearance of the word, the importance is determined for each word, the process of accumulating the determination result of the importance over the plurality of document information is executed, and the value of the accumulated importance is high The upper predetermined number of words are determined as important words of the entire document information over the plurality of document information. Important word determination means,
Determined by the important word determining means By clustering the plurality of document information by matching the important words of the entire document information with the important words determined from the document information, a cluster database in which the plurality of document information is classified and organized is created. And knowledge analysis means.
[0014]
In this knowledge analysis system, For each of a plurality of pieces of document information already accumulated, for each word based on the word extraction process, the type of emphasis attribute of each word extracted from each document by the word extraction process, and the appearance frequency of the word A process of accumulating the determination results of the importance levels over a plurality of pieces of document information, and determining a top predetermined number of words having a high accumulated importance value. It is determined as an important word for the entire document information. The plurality of pieces of document information are clustered by matching the important words of the entire document information with the important words determined from the document information. As a result, the time required for analyzing the stored document can be shortened, and clustering can be performed using important words determined in consideration of the characteristics of the entire stored document.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0016]
FIG. 1 shows the configuration of a knowledge analysis system according to an embodiment of the present invention. This knowledge analysis system is realized by a server computer 12 accessible from a plurality of client terminals 11 via a computer network 13 such as a LAN. Although not shown, the server computer 12 and the client terminal 11 each have a CPU, a main memory, a magnetic disk device as a storage device, and an input / output device having an input unit such as a keyboard and a mouse and a display unit such as a display. Is provided.
[0017]
The knowledge analysis system realized on the server computer 12 has a document information storage function for each client terminal 11, a search function for searching for a target document from a plurality of stored document information, and a plurality of stored documents. Provides a clustering function that classifies and organizes data. The clustering function classifies and organizes the stored documents, and provides the client computer 11 with the contents from the server computer 12 as an analysis result of the stored documents, thereby knowing the tendency to read from various documents collected for a certain purpose. Can be provided to the user. Here, an overview of the clustering function will be described.
[0018]
The clustering function is a cluster database in which features of each of a large amount of document information collected in the server computer 12 are analyzed, sorted into groups having similar contents, and a large amount of document information is classified and organized according to the features. It is a function to create. When executing clustering, an analysis condition designation screen as shown in FIG. 2 is provided to the client terminal 11, and various analysis conditions are designated by the user on the analysis condition designation screen. FIG. 2 shows an example in which a database named “10,000 db” and a database named “XX newspaper article” are selected as analysis target databases, and includes the following designated items.
[0019]
Analysis result name: This is a field for inputting a name for saving the analysis result.
Analysis period: This is a field for designating in which period a document created or stored should be analyzed. If there is no input, all documents stored in the database to be analyzed are analyzed.
[0020]
Refinement keyword: This is a field for inputting a search keyword for narrowing down a document to be analyzed from an accumulated document by a search. If there is no input, all documents stored in the database to be analyzed are analyzed.
-Number of filters: This field specifies the number of documents to be analyzed.
[0021]
-Number of hierarchies: This field specifies the number of hierarchies of the classification system used in clustering. Select either “1 layer” or “Multi layer”. The default is “multi-hierarchy”, and the documents are classified in a hierarchical structure by category.
-Duplicate knowledge: a field for designating whether one document is permitted to be registered in a plurality of classifications (clusters). Select either “Yes” or “No”.
[0022]
Maximum number of top clusters: This field specifies the maximum number of clusters created in the top hierarchy. If there is no input, clustering is performed with no designation.
[0023]
-Important word: This is a field for inputting a word or phrase as an important word when performing clustering.
Normally, it is necessary for the user to directly specify the important words used in the clustering. However, in the present embodiment, a function for automatically generating the important words is provided. The generated important words can be used to analyze and classify stored documents. From what point of view, the contents of stored documents are analyzed and the classification system is classified based on the key words that generate the axis of clustering. It can be said that this is the starting point for analysis and classification.
[0024]
Next, returning to FIG. 1, each element constituting the knowledge analysis system will be described.
[0025]
In the client terminal 11, a Web browser 111 is operating. By specifying a URL (Uniform Resource Locator) indicating a resource for knowledge analysis constructed on the server computer 12 from the Web browser 111, the knowledge analysis processing by the server computer 12 can be used from each client terminal 11. .
[0026]
Further, an image scanner 114 is connected to the client terminal 11, and the user uses a character recognition tool 113 to capture a paper document. Further, a document editing tool 112 is operating, and the user creates an electronic document using this. The captured image document data and the created electronic document data are described in a markup language such as SGML (Standard Generalized Markup Language), HTML (Hyper Text Markup Language), or XML (eXtensible Markup Language). After being automatically converted into a document format, it is sent to the server computer 12 via the network 13 as a registered document or as an input document for determining important words. In document information in a document format described in a markup language, the type of emphasis attribute of a word or phrase in the document is expressed by an emphasis tag. Accordingly, document information in which an emphasis tag corresponding to the emphasis attribute of each word is embedded is sent from the client terminal 11 to the server computer 12. The server computer 12 performs processing for determining an important word based on the emphasis tag in the document information.
[0027]
The knowledge analysis function of the server computer 12 is mainly performed by software such as the Web server 121, the control module 1211, the knowledge server 122, the highlight tag setting module 1221, the registration module 1222, the search module 1223, the clustering module 1224, and the software. This is realized by management information and actual data used for knowledge analysis. The management information includes login management information 1212 for performing user authentication for each client terminal 11. The actual data includes a knowledge database 1226, an intermediate processing database 1228, an analysis result database (cluster DB) 1229, an emphasis tag table 1225, and an important word table 1227.
[0028]
The emphasis tag table 1225 is for defining a relationship between an emphasis tag corresponding to each type of emphasis attribute and a weighting value (point) for the emphasis tag, and the user uses the web browser 111 of the client terminal 11. By starting the highlight tag setting module 1221, the user can set the contents of the highlight tag table 1225.
[0029]
The important word table 1227 is for registering an input document input from the client terminal 11 or an important word extracted from the document information already stored in the knowledge database 1226. The input document or the stored document The important word table 1227 is created by extracting the words enclosed by the emphasis tag, analyzing them into morphemes and breaking them into words, and scoring the words according to the type and appearance frequency of the emphasis tag.
[0030]
The emphasis tag table 1225 and the important word table 1227 are used when the clustering module 1224 creates an analysis result database (cluster DB) 1229 from the intermediate database 1228.
[0031]
The control module 1211 operates on the Web server 121 and controls the entire operation related to knowledge analysis. The knowledge server 122 and the Web server 121 which are the core programs of this knowledge analysis system, and the client terminal 11 In addition to an intermediary function with the Web browser 111, each client terminal 11 has a user authentication function when logging in to the knowledge server 122. For this user authentication, the control module 1211 manages login management information 1212. The login management information 1212 stores user IDs and passwords of the users participating in the knowledge analysis system. By this user authentication, control of permission / prohibition of access to the knowledge server 122 performed for knowledge analysis from each client terminal 11 is performed.
[0032]
The knowledge server 122 is for managing and operating the knowledge database 1226 and the analysis result database (cluster DB) 1229 in response to requests from the plurality of client terminals 11, and the analysis conditions designated by each client terminal 11 By performing the knowledge analysis, the stored document registered in the knowledge database 1226 is supported for analysis.
[0033]
Next, referring to the flowchart of FIG. 3, when important words are automatically determined from the input document information input from the client terminal 11 and the knowledge database 1226 is searched and knowledge analysis is performed using the important words. An overview of the processing will be described.
[0034]
First, based on the emphasis tag included in the target input document information (XML, HTML, SGML, etc), a phrase surrounded by the emphasis tag is extracted from the input document information (step S101). Next, each extracted phrase is subjected to morphological analysis and divided into words, whereby a word having an emphasis attribute is extracted and registered in the important word table 1227 (steps S102 and S103). Then, for each extracted word, a point for the emphasis tag is acquired from the emphasis tag table 1225, and the point is added to the important word table 1227 as a score indicating the priority (importance) of the corresponding word. (Step S104). Thus, for each extracted word, the score value expressed by the product of the appearance frequency of the word and the point corresponding to the type of the emphasis tag at that time is registered in the important word table 1227. The word groups registered in the important word table 1227 are sorted in descending order of score, and several words having higher scores are preferentially determined as important words. Of course, all the words registered on the important word table 1227 may be determined as important words.
[0035]
When a document matching the key concept of the input document information is searched from the knowledge database 1226, a search keyword is generated from each word determined as an important word, and a search such as full-text search is performed using the search keyword. Processing is executed by the search module 1223 (step S105).
[0036]
On the other hand, when the accumulated documents in the knowledge database 1226 are clustered, document analysis and classification are performed on the accumulated documents, with each word determined as an important word as the axis of clustering. Are clustered (step S106).
[0037]
Actually, the document analysis and classification processing with the key word as the axis is performed on the narrowed down document group registered in the intermediate database 1228, and the classification result is registered in the cluster database 1229. It will be. The analysis may be performed with the word itself determined as the important word as an axis, but the analysis may be performed using synonyms, synonyms derived from the word determined as the important word. In the analysis process, the features of each document to be analyzed are analyzed using important words as the starting point, and those with the same features are collected into one cluster with a classification name that accurately represents the features. . Of course, the important words themselves can also be used as classification names.
[0038]
There are various methods for analyzing the characteristics of each document. For example, the keyword is determined for each accumulated document to be analyzed by the same method as the keyword determination process from the input document described above, and how It is possible to analyze the characteristics of each of the documents by examining each of the accumulated documents to be analyzed to determine whether or not the important words are included. In this case, for example, by matching important words determined from input documents with important words determined from each document to be analyzed, clustering around important words determined from input documents can be easily performed. It becomes.
[0039]
Further, instead of determining the important word from the input document information, for example, the important word is determined from the entire analysis target document group registered in the intermediate database 1228 by the same method as the important word determination processing from the input document described above. A method is also suitable in which an entire document group to be analyzed is clustered using the determined clustering axis as a clustering axis. In this case, first, word extraction processing in steps S101 to S103 is performed for the entire document group to be analyzed, and in step S104, the type of emphasis attribute of each word extracted from each document and the word Based on the appearance frequency, scoring is performed for each word, and processing for accumulating the score over the entire document group to be analyzed is executed. The upper predetermined number of words having a high accumulated score value are determined as important words.
[0040]
In step S106, analysis and classification of the document group registered in the intermediate database 1228 are executed using the upper predetermined number of words determined as important words as the axis of clustering. Even in this case, for example, by matching the important words determined from the entire stored document with the important words determined from each of the stored documents to be analyzed, the important words determined from the entire stored document are used as axes. Clustering can be easily performed.
[0041]
Next, a document format conversion process executed by the character recognition tool 113 and the document editing tool 112 provided in the client terminal 11 will be described with reference to FIGS. In the following, a case where XML is used as document information in which an emphasis tag is embedded will be described as an example.
[0042]
FIG. 4 shows the document format before XML conversion, and FIG. 5 shows the document format after XML conversion. As shown in FIG. 4, in the document before conversion, characters are emphasized in various places, and these are word emphasis attributes. For example, <large character>, <bold character>, <italic>, <underline>, <reverse>, <quotation mark>, <background color>, and the like.
[0043]
The conversion from the document format of FIG. 4 to the document format of FIG. 5 is performed by the character recognition tool 113 and the document editing tool 112 of the client terminal 11, and is passed to the registration module 1222 of the knowledge server 122 via the Web browser 111, Registered in the database 1226. When the character recognition tool 113 is used, the emphasis attribute of a phrase applied to a paper document is converted into an emphasis tag, and XML format document information is generated. An example of conversion is shown below.
[0044]
For example, in the document of FIG. 4, the phrase “analyze enormous amount of information and knowledge” is described in bold characters, and the background of each character is colored. In this case, since two kinds of emphasis attributes such as a large character and a background color are added, when this phrase is converted into the XML document of FIG.
<Background color><Bold> Accurate analysis of huge amount of information and knowledge </ Bold></ Background color>
It becomes.
[0045]
Also, the phrase “rapid decision making” on the document of FIG. 4 is underlined, and when this phrase is converted to the XML document of FIG.
<Underline> Rapid decision making </ Underline>
It becomes.
[0046]
Thus, when a plurality of emphasis tags are added to one word, it means that the word is important. In the above example, it is determined that the former is more important than the latter.
[0047]
FIG. 6 shows an example of the emphasis tag table 1225. The emphasis tag table 1225 includes an emphasis tag field 1225-1 for setting an emphasis tag and a point field 1225-2 for setting a point corresponding to the emphasis tag. The user activates the emphasis tag setting module 1221 via the Web browser 111 of the client terminal 11, and highlights the emphasis tag corresponding to the emphasis attribute of the word and phrase that appears in the document before conversion in FIG. By registering in the table 1225, the importance (point) corresponding to each type of emphasis attribute can be set.
[0048]
In the emphasis tag table 1225 of FIG. 6, for example, 10 points are given for the emphasis tag <large character>, and 9 points for the emphasis tag <italic>, and the emphasis tag <bold> 8 points, 7 points for the emphasis tag <color character>, and 6 points for the emphasis tag <background color>.
[0049]
As for the above-mentioned words “<background color><bold> accurately analyze a large amount of information and knowledge </ bold></ background color>”, each word in the word corresponds to an emphasis tag “<bold>” Eight points and six points for the emphasis tag <background color> are added to give a total of 14 points. In this way, even when a certain word is emphasized with an emphasis attribute that is a mixture of various emphasis attributes, when the word is converted into an XML document, the mixed emphasis attribute includes a plurality of different emphasis attributes. Since it is converted into a tag, it is possible to easily calculate the point corresponding to the emphasis attribute of the word simply by adding the point for each emphasis tag.
[0050]
The default setting contents of the emphasis tag table 1225 may be determined in advance in the system, and the importance may be calculated using this.
[0051]
7 to 9 show how the important word table 1227 is generated. 7 shows a state during analysis of one document, FIG. 8 shows a state after analysis of one document, and FIG. 9 shows a state after analysis of all documents.
[0052]
7 and 8 correspond to the case of analyzing the XML document of FIG. As shown in FIG. 7, words obtained by morphological analysis of words (for example, “analyze huge amount of information and knowledge”) surrounded by emphasis tags, that is, “large”, “ “Information”, “Knowledge”, “Accuracy”, “Analysis”,... Are registered in the important word field 1227-1 of the important word table 1227, and the corresponding score field 1227-2 corresponds to each time the word appears. The values of points to be added are sequentially added. As a result of the analysis of the XML document in FIG. 5, the important word table 1227 in FIG. 8 is generated.
[0053]
When clustering the stored documents using the important words determined from the XML document in FIG. 5 as the clustering axis, the words registered in the important word table 1227 in FIG. ”,“ Share ”,“ Creation ”, etc.) are sequentially determined as important words to be prioritized in clustering. In this case, the upper predetermined number of words (all words if the number of words registered in the important word table 1227 is less than or equal to the upper limit that can be used as important words) is used as the axis of the cluster. The number of important words used may be provided as a specified value by the system, or the user may give it from the Web browser 111 of the client terminal 11.
[0054]
FIG. 9 shows an example of the important word table 1227 obtained by analyzing all the document groups to be analyzed stored in the intermediate database 1228, and shows the result of sorting in descending order of score. When determining important words from the analysis target document group stored in the intermediate database 1228, a predetermined number of words (registered in the important word table 1227) among the words registered in the important word table 1227 of FIG. If the number of words is less than or equal to the upper limit that can be used as an important word, all words) are used as cluster axes.
[0055]
Next, the processing procedure of the knowledge analysis system will be described with reference to the flowchart of FIG. Here, XML documents input from each client terminal 11 are sequentially registered in the knowledge database 1226, and when an analysis request for the knowledge database 1226 is received from the client terminal 11, important words are extracted from all the documents to be analyzed. Assume that clustering is performed by automatically extracting.
[0056]
The user of each client terminal 11 activates the control module 1211 of the web server 121 from the web browser 111 and logs in to the knowledge server 122 (step A01). Then, the knowledge server 122 emphasis tag setting module 1221 is started from the Web browser 111, and emphasis tags and points (frequency of strength) for the emphasis attribute of characters appearing in the XML document are set on the emphasis tag table 1225 (step A02). .
[0057]
Next, when a document described in a form is read using the image scanner 114 connected to the client terminal 11 and the read character is recognized by the character recognition tool 113 and converted into an XML document, or a document editing tool When an XML document is created at 112, if an emphasis attribute is added to a word, the word is enclosed in an emphasis tag and converted to an XML-style emphasis (step A03). When the processing ends, the registration module 1222 of the knowledge server 122 is activated from the Web browser 111 and the XML document is passed to the registration module 1222. The registration module 1222 registers the received XML document in the knowledge database 1226 (step A04).
[0058]
Then, the search module 1223 of the knowledge server 122 is activated from the Web browser 111, and a selection of which knowledge database to be analyzed in the knowledge database 1226 is selected (step A05). The search module 1223 searches the selected knowledge database using the search keyword given by the user, thereby creating an intermediate database 1226 consisting only of the knowledge narrowed down by the search keyword given by the user from the selected knowledge database ( Step A06).
[0059]
The search module 1223 scans each XML document at the time of creating the intermediate database 1228, finds a phrase enclosed by the emphasis tag registered in the emphasis tag table 1225, performs morphological analysis, and determines the points of the word and the emphasis tag as important words. Register in the table 1227. Then, every time the word appears, the corresponding points are added to the word score, and further sorted in descending order of score to create an important word table 1227 (step A07). When the clustering module 1224 inputs the intermediate database 1228 to perform clustering, the important words corresponding to the system default number or the user-specified number of important words are arranged in descending order from the important word table 1227 in which important words are arranged in the order of score. The cluster database 1229 is created by classifying and organizing the document group of the intermediate database 1228 by taking out and using it for clustering (step A08).
[0060]
FIG. 11 is a flowchart illustrating in detail the processing performed in step A07 of FIG. 10, and shows a procedure for creating the keyword table 1227.
[0061]
When creating the intermediate database 1228, the search module 1223 first analyzes each document extracted from the knowledge database 1226 and creates an important word table 1227. The process will be described. First, the important word table 1227 is cleared (step B01), and positioned to the first XML document in the knowledge database 1226 (step B02). Then, a phrase enclosed by one or more emphasis tags is extracted from the XML document (step B03), the extracted phrase is subjected to morphological analysis, a noun is extracted, and is entered into the important word table 1227 (step B04). Next, the points of these emphasis tags are taken out from the emphasis tag table 1225 and added to the score field 1227-2 of the corresponding word on the important word table 1227 (step B05) until the analysis of the XML document is completed. Step B03 to step B05 are repeated (step B06).
[0062]
When the analysis of the first XML document is completed (YES in step B06), the next XML document is positioned (step B07), and steps B03 to B07 are repeated until the analysis of all XML documents is completed (NO in step B08). If completed (YES in step B08), the important word field 1227-1 of the important word table 1227 is sorted in descending order of the score 1227-2 and the result is reflected (step B09).
[0063]
FIG. 12 is a flowchart illustrating in detail the processing in step A08 of FIG. 10, and is a flowchart of processing in which the clustering module 1224 creates the cluster database 1229.
[0064]
In order for the user to receive clustering authority control, the Web browser 111 of the client terminal 11 makes a login request to the control unit 1211 of the server computer 12 (step C01). Then, the control unit 1211 of the server computer 12 accesses the login management information 1212 to check whether or not the user ID and password input by the user are registered (step C02), and whether or not to permit this login. User authentication is performed to determine (step C03). If the user ID and password are not registered in the login management information 1212 and the login fails (NO in Step C04), the control unit 1211 returns a login failure to the Web browser 111 of the client terminal 11 through the Web server 121. This process is terminated (step C05).
[0065]
On the other hand, when the user ID and password are registered in the login management information 1212 and the login is successful (YES in step C04), the knowledge database to be analyzed is selected (step C06), and the important word table 1227 is used. When a word is extracted (step C07), the clustering module 1224 executes clustering based on the extracted important word to create a cluster database 1229 (step C08).
[0066]
In the created cluster database 1229, each document is classified and arranged according to its features in a classification system determined based on important words. It is possible to provide the user with the analysis result as to whether the document is registered as knowledge. Further, the user can designate a specific classification item based on the analysis result, and can perform further information search for the classification item.
[0067]
In this way, in this knowledge analysis system, focusing on the fact that words with emphasis attributes are words that are key concepts in the document, important words are determined from the XML document based on the emphasis tags, Clustering processing based on this important word is made possible.
[0068]
Since all the functions of the knowledge analysis system of this embodiment are realized by a computer program, the computer program can be stored in a computer-readable storage medium, and the computer program can be connected to a computer network through the storage medium. The effect similar to that of the present embodiment can be obtained simply by installing and executing the program on a normal computer.
[0069]
Further, the present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the scope of the invention in the implementation stage. Further, the above embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, even if some constituent requirements are deleted from all the constituent requirements shown in the embodiment, the problem described in the column of the problem to be solved by the invention can be solved, and the effect described in the column of the effect of the invention Can be obtained as an invention.
[0070]
【The invention's effect】
As described above, according to the present invention, it is possible to perform accurate knowledge analysis by performing information retrieval and clustering with automatically determined important words.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a system configuration of a knowledge analysis system according to an embodiment of the present invention.
FIG. 2 is a view showing an example of an analysis condition designation screen used in the knowledge analysis system of the embodiment.
FIG. 3 is an exemplary flowchart illustrating an outline of a procedure of knowledge analysis processing in the knowledge analysis system according to the embodiment;
FIG. 4 is an exemplary view for explaining a document format before XML conversion in the knowledge analysis system of the embodiment;
FIG. 5 is an exemplary view for explaining a document format after XML conversion in the knowledge analysis system according to the embodiment;
FIG. 6 is an exemplary view for explaining a format of an emphasis tag table used in the knowledge analysis system of the embodiment.
FIG. 7 is an exemplary view showing a state of an important word table during one document analysis in the knowledge analysis system of the embodiment.
FIG. 8 is a view showing a state of an important word table after one document analysis in the knowledge analysis system of the embodiment.
FIG. 9 is an exemplary view showing a state of an important word table after all documents are analyzed in the knowledge analysis system according to the embodiment;
FIG. 10 is an exemplary flowchart illustrating a procedure of knowledge analysis processing by the knowledge analysis system according to the embodiment;
FIG. 11 is an exemplary flowchart illustrating a procedure of important word determination processing by the knowledge analysis system according to the embodiment;
FIG. 12 is an exemplary flowchart illustrating a procedure of clustering processing by the knowledge analysis system according to the embodiment;
[Explanation of symbols]
11 ... Client terminal
12 ... Server computer
111 ... Web browser
112 ... Document editing tool
113 ... Character recognition tool
114: Image scanner
121 ... Web server
122 ... Knowledge Server
1221 ... Emphasis tag setting module
1222 ... Registration module
1223 ... Search module
1224 ... Clustering module
1225 ... Emphasis tag table
1226 ... Knowledge Database
1227 ... Keyword table
1228 ... Intermediate database
1229 ... Cluster database

Claims

A knowledge analysis system configured to be connectable to a plurality of client terminals via a network and supporting classification and analysis of a plurality of document information stored in a server in response to a request from each client terminal,
Word extraction means for extracting a word having an emphasis attribute from each of the plurality of document information based on the emphasis tag embedded in each of the plurality of document information for the plurality of stored document information; ,
The importance is determined for each word based on the type of emphasis attribute of each word extracted by the word extraction means and the appearance frequency of the word, and the determination result of the importance is accumulated over the plurality of document information. An important word determination means for executing processing and determining the upper predetermined number of words having a high accumulated importance value as the important words of the entire document information across the plurality of document information;
The plurality of document information is classified and clustered by clustering the plurality of document information by matching the important words of the entire document information determined by the important word determination means and the important words determined from the document information. A knowledge analysis system comprising knowledge analysis means for creating an organized cluster database.

Further comprising means for holding definition information indicating the relationship between the type of emphasis attribute and the corresponding weighting value;
The important word determining means includes
For each word extracted by the word extraction means, including means for determining the importance by the product of the weighting value corresponding to the emphasis attribute of the word acquired from the definition information and the appearance frequency of the word The knowledge analysis system according to claim 1, wherein:

In accordance with a request from each client terminal, a server is a knowledge analysis method that supports classification and analysis of a plurality of document information stored in the server,
The server extracts words having an emphasis attribute from each of the plurality of document information based on emphasis tags embedded in each of the plurality of document information for the plurality of stored document information. A word extraction step;
The server determines the importance for each word based on the type of emphasis attribute of each word extracted by the word extraction step and the appearance frequency of the word, and the determination result of the importance is the plurality of documents. An important word determination step of executing a process of accumulating over the information, and determining an upper predetermined number of words having a high accumulated importance value as important words of the entire document information over the plurality of document information;
The server clusters the plurality of document information by matching the important words of the whole document information determined in the important word determination step with the important words determined from the document information, thereby the plurality of documents. A knowledge analysis method comprising: a knowledge analysis step of creating a cluster database in which information is classified and organized.

In the important word determination step, for each word extracted in the word extraction step , the server acquires the importance level from the definition information indicating the relationship between the type of the emphasis attribute and the corresponding weighting value. 4. The knowledge analysis method according to claim 3, wherein the knowledge analysis method is determined by a product of a weighting value corresponding to the emphasis attribute and an appearance frequency of the word.

In accordance with a request from each client terminal, a program for causing the server to perform classification and analysis of a plurality of document information stored in the server,
A word extraction procedure for extracting a word having an emphasis attribute from each of the plurality of document information based on an emphasis tag embedded in each of the plurality of document information for the plurality of document information stored; ,
The importance is determined for each word based on the type of emphasis attribute of each word extracted by the word extraction procedure and the appearance frequency of the word, and the determination result of the importance is accumulated over the plurality of document information. An important word determination procedure for executing processing to determine the upper predetermined number of words having a high accumulated importance value as the important words of the entire document information across the plurality of document information;
Classifying the plurality of document information by clustering the plurality of document information by matching the important words of the whole document information determined by the important word determination procedure with the important words determined from the document information. A program for causing the server to execute a knowledge analysis procedure for creating an organized cluster database.