JP4010711B2

JP4010711B2 - Storage medium storing term evaluation program

Info

Publication number: JP4010711B2
Application number: JP18539699A
Authority: JP
Inventors: 淳高藤; 勝彦水戸部; 功志土居; 浩之三ツ矢
Original assignee: 株式会社ジャストシステム
Priority date: 1999-06-30
Filing date: 1999-06-30
Publication date: 2007-11-21
Anticipated expiration: 2019-06-30
Also published as: JP2001014330A

Description

【０００１】
【発明の属する技術分野】
この発明は、文書データから抽出したタームを評価するターム評価装置に関し、特にタームの評価手法に関する。
【０００２】
【従来技術および発明が解決しようとする課題】
今日、多くの文書情報をデータとして記憶しておき、これらの文書情報から所望の情報や知識を発見すること（以下、マイニングという）が試みられている。その１つの手法として、文書中からタームを抽出して、抽出したタームに基づいて、前記マイニングをおこなうことが提案されている。
【０００３】
しかし、文書中には雑多なタームが存在し、タームの抽出基準について、確立された手法が存在しなかった。したがって、抽出したタームを評価することができなかった。
【０００４】
この発明は上記問題を解決し、複数の文書からの抽出したタームを評価できるターム評価装置またはその方法を提供することを目的とする。さらに、複数の文書からターム抽出対象文書を決定することを目的とする。
【０００５】
【課題を解決するための手段および発明の効果】
１）本発明にかかるプログラムを記憶した記録媒体においては、前記プログラムは、前記コンピュータに、以下の処理を実行させる。Ａ）視点プロファイルごとに、対応する複数の関連キーワードを記憶しており、複数の視点プロファイルが与えられると、視点プロファイル毎に、以下のターム抽出対象文書決定処理と、ターム統計量演算処理を実行し、a1)当該視点プロファイルに関連する複数の関連ワードを用いて、複数の文書データから、ターム抽出対象となる文書を１または２以上特定するターム抽出対象文書決定処理、a2)前記特定されたターム抽出対象文書に存在するタームの統計量を求めて、前記視点プロファイルにおけるターム統計量として演算するターム統計量演算処理、Ｂ）前記特定されたターム抽出対象文書に存在する各タームについて、各ベクトル要素の値が前記各視点プロファイルのターム統計量で表され、かつ、前記視点プロファイルの数と同じ次元を持つ数値ベクトルを求め、前記数値ベクトルを求めたタームについて、その数値ベクトルとともに表示する。
このように、視点プロファイル毎に前記文書データから抽出対象文書を決定することにより、前記視点プロファイルに関連する複数の関連ワードに合致した文書からターム抽出が可能となる。また、前記各タームについて、各ベクトル要素の値が前記各視点プロファイルのターム統計量で表され、かつ、前記視点プロファイルの数と同じ次元を持つ数値ベクトルを求め、前記数値ベクトルを求めたタームについて、その数値ベクトルとともに表示することにより、複数の文書からの抽出したタームを評価することができる。また、前記タームの選別分類した分析も可能となる。
【０００６】
２）本発明にかかるプログラムを記憶した記録媒体においては、前記各タームの数値ベクトルに基づいて、特定のベクトル要素だけが所定のしきい値以上または以下の値を持つタームを抽出し、抽出したタームについて、その数値ベクトルとともに表示する。したがって、前記ベクトルに基づいて、操作者の特定の興味に固有のタームを抽出して表示することができる。
【０００７】
３）本発明にかかるプログラムを記憶した記録媒体においては、前記各タームの数値ベクトルに基づいて、特定の複数のベクトル要素がすべて、所定のしきい値以上、または以下の値を持つタームを抽出し、抽出したタームについて、その数値ベクトルとともに表示する。したがって、操作者の特定の興味に共通するタームを抽出して表示することができる。
【０００８】
４）本発明にかかるプログラムを記憶した記録媒体においては、前記各タームの数値ベクトルに基づいて、特定のベクトル要素値が他のタームと比べて特異的に大きい、または特異的に小さいタームを抽出し、抽出したタームについて、その数値ベクトルとともに表示する。したがって、抽出されたタームから操作者の特定の興味に特に関係するタームを抽出し、抽出したタームについて、その数値ベクトルとともに表示することができる。
【０００９】
５）本発明にかかるプログラムを記憶した記録媒体においては、前記複数の視点プロファイルは順列に意味がある視点プロファイルであり、前記各タームの数値ベクトルに基づいて、あるベクトル要素の値がその前後のベクトル要素と比べて所定の差分より大きなタームを抽出し、抽出したタームについて、その数値ベクトルとともに表示する。したがって、１の視点プロファイルにおける時系列的にある時期に変化するタームを抽出して表示することができる。
【００１０】
６）本発明にかかるプログラムを記憶した記録媒体においては、ターム中に含まれる所定の文字列であって、文書作成者の意向が表現される文字列を意向毎に分類して記憶しておき、操作者がいずれかの分類を選択すると、指定された文字列が存在するタームだけを抽出し、抽出したタームについて、その数値ベクトルとともに表示する。したがって、文書作成者の意向が表現される文字列を含むタームを抽出して表示することができる。これにより、操作者の意向に合致したターム抽出が容易となる。かかる文書作成者の意向が表現される文字列としては、助動詞、接尾語、助数詞等がある。
【００１１】
７）本発明にかかるターム評価方法においては、文書データを構成するタームをコンピュータによって評価する方法であって、前記コンピュータは視点プロファイルごとに、対応する複数の関連キーワードを記憶しており、以下の処理を実行する。Ａ）関連する複数の関連ワードが定められた視点プロファイルが複数与えられると、視点プロファイル毎に、以下のa1)ターム抽出対象文書決定処理と、a2)ターム統計量演算処理を行う。a1)当該視点プロファイルに関連する複数の関連ワードを用いて、複数の文書データから、ターム抽出対象となる文書を１または２以上特定することによってターム抽出対象文書を決定するターム抽出対象文書決定処理、a2)特定したターム抽出対象文書に存在するタームの統計量を、その視点プロファイルにおけるターム統計量として演算するターム統計量演算処理。Ｂ）前記各タームについて、ベクトル要素の値が前記各視点プロファイルのターム統計量で表され、かつ、前記視点プロファイルの数と同じ次元を持つ数値ベクトルを求め、前記数値ベクトルを求めたタームについて、その数値ベクトルとともに表示する。
このように、視点プロファイル毎に前記文書データから抽出対象文書を決定することにより、前記視点プロファイルに関連する複数の関連ワードに合致した文書からターム抽出が可能となる。また、前記各タームについて、各ベクトル要素の値が前記各視点プロファイルのターム統計量で表され、かつ、前記視点プロファイルの数と同じ次元を持つ数値ベクトルを求め、前記数値ベクトルを求めたタームについて、その数値ベクトルとともに表示することにより、複数の文書からの抽出したタームを評価することができる。また、前記タームの選別分類した分析も可能となる。
【００１２】
８）本発明にかかる文書データを構成するタームの評価方法においては、動詞または形容詞に付属して用いられ、動詞または形容詞に文書作成者の意向が表現される文字列が与えられると、指定された文字列が存在するタームだけを抽出し、抽出したタームについて、その数値ベクトルとともに表示する。したがって、文書作成者の意向が表現される文字列を含むタームを抽出することができる。これにより、操作者の意向に合致したターム抽出が容易となる。
【００１３】
９）本発明にかかるターム評価装置においては、ターム抽出対象文書特定手段は、視点プロファイルごとに、対応する複数の関連キーワードを記憶しており、複数の視点プロファイルが与えられると、与えられた視点プロファイルごとに、対応する複数の関連キーワードを用いて、複数の文書データから、ターム抽出対象となる文書を１または２以上特定する。ターム抽出対象文書特定手段は、与えられた視点プロファイルに基づいて、視点プロファイル毎に、複数の文書データから、ターム抽出対象となる文書を１または２以上特定する。ターム統計量演算手段は、前記特定したターム抽出対象文書に存在するタームの統計量を演算して、その視点プロファイルにおけるターム統計量として出力する。評価手段は、前記各タームについて、前記視点プロファイルの数と同じ次元を持ち、かつ、各ベクトル要素の値が前記各視点プロファイルのターム統計量で表された数値ベクトルを求め、Ｃ）前記数値ベクトルを求めたタームについて、その数値ベクトルとともに表示する。
このように、視点プロファイル毎に前記文書データから抽出対象文書を決定することにより、前記視点プロファイルに関連する複数の関連ワードに合致した文書からターム抽出が可能となる。また、前記各タームについて、各ベクトル要素の値が前記各視点プロファイルのターム統計量で表され、かつ、前記視点プロファイルの数と同じ次元を持つ数値ベクトルを求め、前記数値ベクトルを求めたタームについて、その数値ベクトルとともに表示する。これにより、各タームを前記複数の視点プロファイルを視点とした統計量で表すことにより、各種の数値演算が可能となる。これにより、複数の文書からの抽出したタームを評価することができる。また、前記タームの選別分類した分析も可能となる。
１０）本発明にかかるターム評価装置においては、前記表示手段は、前記各タームの数値ベクトルに基づいて、特定のベクトル要素だけが所定のしきい値以上または以下の値を持つタームを抽出し、抽出したタームについて、その数値ベクトルとともに表示する。したがって、前記ベクトルに基づいて、操作者の特定の興味に固有のタームを抽出して表示することができる。
１１）本発明にかかるターム評価装置においては、前記表示手段は、前記各タームの数値ベクトルに基づいて、特定の複数のベクトル要素がすべて、所定のしきい値以上、または以下の値を持つタームを抽出し、抽出したタームについて、その数値ベクトルとともに表示する。したがって、操作者の特定の興味に共通するタームを抽出して表示することができる。
１２）本発明にかかるターム評価装置においては、前記表示手段は、前記各タームの数値ベクトルに基づいて、特定のベクトル要素値が他のタームと比べて特異的に大きい、または特異的に小さいタームを抽出し、抽出したタームについて、その数値ベクトルとともに表示する。したがって、抽出されたタームから操作者の特定の興味に特に関係するタームを抽出し、抽出したタームについて、その数値ベクトルとともに表示することができる。
１３）本発明にかかるターム評価装置においては、前記複数の視点プロファイルは順列に意味がある視点プロファイルであり、前記表示手段は、前記各タームの数値ベクトルに基づいて、あるベクトル要素の値がその前後のベクトル要素と比べて所定の差分より大きなタームを抽出し、抽出したタームについて、その数値ベクトルとともに表示する。したがって、１の視点プロファイルにおける時系列的にある時期に変化するタームを抽出して表示することができる。
１４）本発明にかかるターム評価装置においては、ターム中に含まれる所定の文字列であって、文書作成者の意向が表現される文字列を意向毎に分類して記憶しておき、前記表示手段は、操作者によりいずれかの分類が指定されると、指定された文字列が存在するタームだけを抽出し、抽出したタームについて、その数値ベクトルとともに表示する。したがって、文書作成者の意向が表現される文字列を含むタームを抽出して表示することができる。これにより、操作者の意向に合致したターム抽出が容易となる。かかる文書作成者の意向が表現される文字列としては、助動詞、接尾語、助数詞等がある。
【００２０】
【発明の実施の形態】
１．機能ブロック図の説明本発明の一実施形態を図面に基づいて説明する。図１に示すターム評価装置１は、文書記憶手段３、ターム統計量演算手段５、評価手段７、分析手段８および報知手段９を備えている。
【００２１】
文書記憶手段３は、複数の文書を記憶する。ターム統計量演算手段５は、与えられた検索条件に基づいて、文書記憶手段３に記憶された複数の文書データから、ターム抽出対象となる文書を１または２以上特定し、特定したターム抽出対象文書に存在するタームの統計量を、その検索条件におけるターム統計量として演算する。評価値決定手段７は、ターム統計量演算手段５から異なる複数の検索条件におけるターム統計量が与えられると、与えられた検索条件数と同じ次元を持つ数値ベクトルで各タームの評価値を決定する。
【００２２】
分析手段８は、前記各タームの評価値に基づいてタームを分析する。報知手段９は前記分析結果を報知する。
【００２３】
本実施形態においては、ターム統計量演算手段は５は、視点プロファイルとこの視点プロファイルに関連する複数の関連ワードで構成される視点プロファイル対応情報を複数記憶しており、検索条件として視点プロファイルが与えられると、前記複数の関連ワードを用いて、前記ターム抽出対象文書を特定する。
【００２４】
このように、ターム抽出文書を視点プロファイルで絞り込むことにより、抽出タームが操作者の興味のある視点プロファイルに合致して抽出される。また、各タームの評価値を数値ベクトルで表すことにより、タームの分析が可能となる。
【００２５】
本実施形態においては、報知手段として表示手段を採用したが、これ以外の報知手段を採用してもよい。
【００２６】
２．ハードウェア構成
(2.1)概略
図１に示すターム評価装置１のハードウェア構成について説明する。図２に示すコンピュータシステム４０は、入力装置４１、制御装置４３、表示装置４５および記憶装置４７を備えている。入力装置４１は、各種の命令を入力するためのものである。記憶装置４７には、与えられた命令に基づいて所定の処理を行うプログラムが記憶される。制御装置４３は、記憶装置４７に記憶されたプログラムに基づいて所定のデータ処理を行う。
【００２７】
(2.2)詳細
図３に、図２に示すコンピュータシステム４０をＣＰＵを用いて実現したハードウェア構成の一例を示す。
【００２８】
コンピュータシステム４０は、ＣＰＵ２３、メモリ２７、ハードディスク２６、ＣＲＴ３０、ＦＤＤ２５、キーボード２８、マウス３１およびバスライン２９を備えている。ＣＰＵ２３は、ハードディスク２６に記憶された制御プログラムにしたがいバスライン２９を介して、各部を制御する。
【００２９】
この制御プログラムは、ＦＤＤ２５を介して、プログラムが記憶されたフレキシブルディスク２５ａから読み出されてハードディスク２６にインストールされたものである。なお、フレキシブルディスク以外に、ＣＤ−ＲＯＭ、ＩＣカード等のプログラムを実体的に一体化したコンピュータ可読の記録媒体から、ハードディスクにインストールさせるようにしてもよい。さらに、通信回線を用いてダウンロードするようにしてもよい。
【００３０】
本実施形態においては、プログラムをフレキシブルディスクからハードディスク２６にインストールさせることにより、フレキシブルディスクに記憶させたプログラムを間接的にコンピュータに実行させるようにしている。しかし、これに限定されることなく、フレキシブルディスクに記憶させたプログラムをＦＤＤ２５から直接的に実行するようにしてもよい。なお、コンピュータによって、実行可能なプログラムとしては、そのままのインストールするだけで直接実行可能なものはもちろん、一旦他の形態等に変換が必要なもの（例えば、データ圧縮されているものを、解凍する等）、さらには、他のモジュール部分と組合して実行可能なものも含む。
【００３１】
ハードディスク２６には、プログラム記憶部２６ａ、対応表記憶部２６ｂ、文書記憶部２６ｃ、評価記憶部２６ｄを有する。プログラム記憶部２６ａには、後述するプログラムが記憶されている。対応表記憶部２６ｂには、図４に示すような複数の視点プロファイルが記憶されている。各視点プロファイルは対応する複数の関連キーワードが記憶されている。文書記憶部２６ｃには評価対象の文書が複数記憶されている。本実施形態においては、各文書は、作成日時、その文書のタイトルおよび各文書の内容で構成されている。評価記憶部２６ｄには各タームの評価結果が記憶される。メモリ２７にはその他、各種の演算結果等が記憶される。
【００３２】
３．フローチャート
つぎに、ハードディスク２６のプログラム記憶部２６ａに記憶されているプログラムについて、図５〜図７のフローチャートを用いて説明する。以下では、視点プロファイルとして「好景気」、「不景気」を用いて、各タームを評価する場合を、例として説明する。
【００３３】
まず、操作者は複数のクエリを入力する。本実施形態においては、クエリに視点プロファイルを採用したので、図８に示すような選択ボックス６２をＣＲＴ３０に表示させて、視点プロファイル「好景気」、「不景気」を選択すればよい。図８は、視点プロファイル「好景気」を選択後、視点プロファイル「不景気」を選択した状態を示す。
【００３４】
図８に示す選択ボックス６２に、視点プロファイルが存在しない場合には、ボタン６４をクリックして、必要な視点プロファイルおよび対応する関連キーワードを追加するようにすればよい。これにより、図４に示す対応表に追加される。
【００３５】
なお、存在する視点プロファイルについても、関連キーワードを追加削除する場合には、関連キーワードボックス６３に追加または削除するようにすればよい。このように、操作者の興味のある視点からの視点プロファイルを作成して、ターム抽出対象文書を特定することにより、操作者の望む視点に則った文書を抽出することができる。
【００３６】
ＣＰＵ２３は、複数のクエリが入力されたと判断すると（図５ステップＳ１）、クエリ番号ｋを初期化する（ステップＳ３）。そして、０番目のクエリである視点プロファイル「好景気」について、抽出対象文書を決定する（ステップＳ５）。抽出対象文書の決定処理について、図６を用いて説明する。本実施形態においては、以下に述べるように、視点プロファイルとの類似度を判断して、所定のしきい値を越える類似度の文書をターム抽出文書として決定した。
【００３７】
ＣＰＵ２３は、文書番号ｍを初期化する（ステップＳ５０）。各文書のベクトル化処理が終了しているか否かを判断する（ステップＳ５１）。この場合、ベクトル化処理は、終了していないので、ステップＳ５３に進み、各文書について形態素解析を行い、文書中に出現する単語（ターム）を抽出する。そして、抽出したタームについて、ｔｆｉｄｆ法を用いて重要タームを決定する（ステップＳ５５）。ｔｆｉｄｆ法とは、情報検索におけるキーワード決定の手法であり、ある文書中におけるそのタームの出現頻度を示すｔｆ（term frequency）および全文書中で当該タームがいかに少ない文書でしか現れないかの希少性を示すｉｄｆ（inverse document frequency）を用いて、タームの重み付けをする手法である。
【００３８】
このようにして抽出した重要タームを用いて、各文書を重要タームの数と同じ次元の数値ベクトルとしてベクトル化する。例えば、重要タームが１００ある場合に１００次元の数値ベクトルが得られる。なお、文書によっては決定された重要タームを含んでいない場合がある。この場合には、その文書のその次元の値は０（疎）となる。ＣＰＵ２３は、このようにして得られた数値ベクトルをメモリ２７に記憶しておく。
【００３９】
つぎにＣＰＵ２３は、視点プロファイルをベクトル化する（ステップＳ５９）。本実施形態においては、ｋ番目の視点プロファイルの全関連キーワードをハードディスク２６から読み出して、視点プロファイル内の各関連キーワードの数と同じ次元の数値ベクトルとしてベクトル化した。本実施形態においては、視点プロファイル内の各関連キーワードのｔｆと全文書におけるｉｄｆから、前記各関連キーワード毎のｔｆｉｄｆ値を求め、前記関連キーワードの数と同じ次元の数値ベクトルとしてベクトル化した。
【００４０】
つぎに、視点プロファイルとｍ番目の文書との類似度を演算する（ステップＳ６１）。かかる類似度は、ステップＳ５７で求めた数値ベクトルとステップＳ５９で求めた数値ベクトルの内積を演算することにより求めることができる。
【００４１】
つぎにＣＰＵ２３は、全文書について類似度演算が終了したか否か判断する（ステップＳ６３）。全文書について類似度演算が終了していなければ、文書番号ｍをインクリメントし（ステップＳ６５）、ステップＳ６１以下の処理を行う。
【００４２】
ステップＳ６３にて、全文書について類似度演算が終了すると、所定のしきい値を越える文書を抽出文書として決定する（ステップＳ６７）。
【００４３】
つぎに、ＣＰＵ２３は、抽出対象文書に存するタームの統計量を演算する。かかる演算処理について、図７を用いて説明する。
【００４４】
ＣＰＵ２３は、注目文書番号ｉ、注目ターム番号ｊを初期化する（ステップＳ２５）。ｉ番目の文書を注目文書とする（ステップＳ２７）。注目文書のｊ番目のタームを注目タームとする（ステップＳ２９）。
【００４５】
ＣＰＵ２３は、注目タームが注目文書で初めて出現したタームか否か判断する（ステップＳ３１）。注目文書で初めて出現したタームである場合には、ターム出現文書数ｂｉをインクリメントする（ステップＳ３３）。
【００４６】
ＣＰＵ２３は、注目タームが、メモリ２７の抽出ターム表（図示せず）に存在するか否か判断する（ステップＳ３５）。既に存在する場合には、そのタームの出現頻度ｔｉをインクリメントする（ステップＳ３７）。一方、存在しない場合には、抽出ターム表に追加する（ステップＳ３９）。
【００４７】
ＣＰＵ２３は、注目タームが最終タームであるか否か判断する（ステップＳ４１）。最終タームでなければ、注目ターム番号ｊをインクリメントして（ステップＳ４３）、ステップＳ２９以下の処理を繰り返す。一方、最終タームであれば、全文書について、ターム抽出終了したか否か判断する（ステップＳ４５）。ターム抽出終了していない文書が残っている場合には、注目文書番号ｉをインクリメントし（ステップＳ４７）、注目ターム番号ｊを初期化し（ステップＳ４９）、ステップＳ２７以下の処理繰り返す。これにより、各タームについて、出現文書数ｂｉおよび出現頻度ｔｉが求められる。
【００４８】
つぎに、ＣＰＵ２３は、全クエリについて処理が終了したか否か判断する（図５ステップＳ９）。全クエリについて処理が終了してない場合には、ＣＰＵ２３は、クエリ番号ｋをインクリメントし（ステップＳ１１）、ステップＳ５以下の処理を繰り返す。
【００４９】
全クエリについて処理が終了すると、ＣＰＵ２３は、各タームについて各クエリに対する数値ベクトル化処理を行う（ステップＳ１３）。本実施形態においては、タームの数値ベクトル化のベクトル要素としては、1)ターム出現頻度ｔｉ、2)ターム出現文書数ｂｉ、3)ターム出現頻度ｔｉ／ターム出現文書数ｂｉ、4)関連度のいずれかを選択できるようにした。
【００５０】
関連度とは、視点プロファイルを構成する関連ワード群に対して文書中で関連する程度を表す数値であり、評価するアルゴリズムを変更することによって異なる。本実施形態においては、ターム出現文書数ｂｉおよび全文書における希少度に基づいて求めるようにした。したがって、ターム出現文書数が多いほど関連度が高くなり、ターム出現文書数が少ないと関連度が低くなる。また、ターム抽出文書以外の文書にはあまり存在せず、ターム抽出文書に存在するタームは関連度が高くなる。すなわち、ターム出現文書数が高く、かつ、ターム抽出文書以外の文書には、あまり存在せず、ターム抽出文書に数多く存在する場合に、関連度が高くなる。
【００５１】
なお、1)ターム出現頻度ｔｉおよび3)ターム出現頻度ｔｉ／ターム出現文書数ｂｉは、クエリセットに使用されたキーワードと共起するタームの抽出、すなわち、操作者の興味に関連する事象の抽出に役立つものと考えられる。また、2)ターム出現文書数ｂｉは、操作者の興味に関連する主要なトピックと、その発生件数の抽出に役立つものと考えられる。4)関連度は、特に操作者の興味と希少な関連を持つ事象の抽出に役立つものと思われる。
【００５２】
図９に、各タームを数値ベクトルで表した例を示す。この場合、視点プロファイル「好景気」、「不景気」について、ベクトル要素として関連度を求め、しきい値はいずれかが０以上のもの（全てのターム）が、表示領域７２に表示されている。かかるベクトル要素およびしきい値の設定は、しきい値設定ボタン７１をマウスでクリックすれば、しきい値設定ダイアログ７４が表示されるので、操作者が所望の条件を設定することができる。
【００５３】
図１０に設定条件を変更して、共通タームを抽出した場合の表示例を示す。図１０では、ベクトル要素としてターム出現文書数ｂｉで、しきい値はすべてのクエリが１以上のものが、表示領域７２に表示されている。このように、指定された複数のベクトル要素がすべて、操作者より指定されたしきい値以上、または以下の値を持つタームを共通タームという。
【００５４】
共通タームは、全てのクエリに共通して抽出されるタームであり、本実施形態のように、クエリをユーザの興味を表す視点プロファイルとした場合には、ユーザのすべての興味に共通するターム（トピック）であることが多い。
【００５５】
図１１に設定条件を変更して、固有タームを抽出した場合の表示例を示す。図１１では、ベクトル要素としてターム出現文書数ｂｉで、しきい値は１つのクエリのみ１以上のものが、表示領域７２に表示されている。このように、特定のベクトル要素だけが操作者より指定されたしきい値以上または以下の値を持つタームを固有タームという。
【００５６】
固有タームは、対応するクエリによってのみ抽出されるタームであり、本実施形態のように、クエリをユーザの興味を表す視点プロファイルとした場合には、ユーザの特定の興味に固有のタームであることが多い。
【００５７】
つぎに、ＣＰＵ２３は、特徴タームを抽出する（図５ステップＳ１５）。特徴タームの抽出は、操作者が抽出基準を与えるようにすればよい。本実施形態においては、特定の文字列を含むタームを抽出するようにした。この場合は、フィルタリング条件を設定することにより、以下のようにして、特徴ターム抽出が行われる。
【００５８】
図１２に文字列「会社」を含むタームのみを抽出した場合の表示例を示す。この場合、フィルタリング条件を設定するには、操作者がフィルタボタン７５をクリックすると、フィルタリングダイアログ７６が表示されるので、文字列フィルタとして「会社」を設定するようにすればよい。
【００５９】
フィルタリング条件としては、文字列フィルタ以外に、辞書フィルタ、パターンフィルタを単独または組み合わせて設定することができる。
【００６０】
辞書フィルタとは、操作者が望む１または２以上の用語を記憶した用語集である。本実施形態においては、辞書フィルタに、図１３に示すようなユーザの意向毎に、対応する助動詞をあらかじめ記憶しておき、操作者がこれを選択できるようにした。具体的には、操作者が意向として「希望」を選択すると、「たい」の複数の用語がｏｒ条件で文字列フィルタとしてフィルタリング処理がなされる。これにより、用語「たい」を含むターム、例えば、「知りたい」、「調べたい」などのタームを抽出することができる。
【００６１】
このように、マイニングを行う場合の指針となる助動詞をあらかじめ辞書化しておくことにより、マイニングが容易となる。
【００６２】
また、辞書フィルタに、ユーザの意向毎に、あらかじめ１または２以上の接尾語を記憶しておき、選択できるようにしてもよい。接尾語とは、ある語の末尾に添えて意味を添え、またはある品詞に一定の資格を与える独立しない語をいう。例えば、「的」、「性」等である。これにより、「具体的」や「革新性」等のタームを抽出することができる。そして、かかるタームが文書中でどのような単語に係っているかを知ることができる。
【００６３】
また、辞書フィルタに、ユーザの意向毎に、あらかじめ１または２以上の助数詞を記憶しておき、選択できるようにしてもよい。例えば、「円」、「ＭＨｚ」等である。これにより、数詞と助数詞で構成されたタームが抽出できるので、かかるタームが文書中でどのような単語に係っているかを知ることができる。特に、値段の場合は、値段のしきい値を設定して、それ以下またはそれ以上の値段の商品の情報のみを抽出することができる。
【００６４】
なお、かかる辞書フィルタはユーザの意向毎にさらに階層構造にしてもよい。
【００６５】
なお、本実施形態においては、辞書フィルタにあらかじめ記憶しておいたユーザの意向に対応する助動詞を用いたが、このような辞書フィルタを用いなくとも、ユーザが文字列フィルタにこれを与えてフィルタリングしてもよい。
【００６６】
なお、特徴タームとしては、上記以外に、以下のような基準で抽出することができる。
【００６７】
１）特異ターム：あるベクトル要素値が他のタームより特異的に大きい、あるいは特異的に小さいターム
特異タームは、各ベクトル要素値の差分が所定のしきい値より大きいタームであり、固有タームほどではないが、ユーザの特定の興味に関係するタームであることが多い。例えば、３つのベクトル要素値が「１，１，６」でしきい値を「５」とした場合も、「１，３，６」でしきい値を「５」とした場合でも、特異タームとして抽出される。
【００６８】
２）差分ターム：順列に意味があるクエリについて、あるクエリとその前後のクエリで差分が大きなターム
差分タームは、例えば、互いに関連する一連のクエリとして自動車の売れ筋のジャンルの移り変わりとして、第１の視点プロファイル「セダン」、第２の視点プロファイル「スポーツ」、第３の視点プロファイル「ＲＶ」・・を用いた場合、移り変わりとともに共起する新規トピック等の抽出が可能となる。なお、差分タームについては後述の抽出期間を限定するような場合にも有効である。
【００６９】
このように、本実施形態においては、視点プロファイルごとに抽出対象文書を決定して、その視点プロファイルにおける各タームの評価を演算し、これを複数の視点プロファイルについて繰り返して、各タームを複数の視点プロファイルに対する数値ベクトルで評価するようにした。ターム抽出対象文書を操作者の望む観点で特定することにより、操作者の興味に関係した文書からターム抽出を行うので、操作者の興味に深く関連しているタームを抽出することができる。また、ターム総数を絞り込めるため、ターム分析がより容易となる。
【００７０】
また、抽出した各タームを複数の視点プロファイルについてのベクトルで表すことにより、操作者の望む観点で多面的にタームを評価することができる。したがって、未知の知識の発見が容易となる。
【００７１】
本実施形態においては、各タームのベクトル要素のパターンを指定するパターンフィルタを採用している。これにより、例えば、前記共通ターム、固有ターム、特異ターム、差分タームを抽出することができる。
【００７２】
４．トレンド分析
上記実施形態においては、複数の検索条件として、異なる視点プロファイルを用いた。しかし、１の視点プロファイルおよび各文書の作成期間によって、複数の検索条件としてもよい。例えば、図１４に示すように、視点プロファイル「不景気」を選択し、重みづけダイアログ８３にて、この視点プロファイルの関連キーワードについて、重みづけを設定する。この場合、全関連キーワードは重みづけ「１」と設定されている。さらに、着目期間を、開始日が１９９７／１／１で、期間は１月毎で、回数５回に設定されている。
【００７３】
前記条件における実行結果を図１５に示す。表示領域７７にはターム抽出文書が日付順に表示されている。表示領域７２には、ターム抽出文書から抽出したタームの関連度で並べ替えられて表示されている。本実施形態においては、図１６に示すフローチャートに基づいて、並べ替えるようにした。
【００７４】
ＣＰＵ２３は、まず、期間番号ｎを初期化する（図１６ステップＳ７１）。第ｎ番目の値でタームを並び替える（ステップＳ７３）。この場合、第０番目の期間（１／１〜１／３１の期間）の値で大きい順に並び替えられる。
【００７５】
つぎに、ＣＰＵ２３は、第ｎ番目の期間の値が「０」のタームを抽出して、これらについて、ｎ＋１番目の期間の値でソートする（ステップＳ７５）。すなわち、この場合、第０番目の期間の値が「０」で、かつ、第１番目の期間の値が「０」でないタームが、第１番目の期間の値が大きい順に並び替えられる。
【００７６】
ＣＰＵ２３は、最終期間まで検討が終了したか否か判断し（ステップＳ７７）、最終期間でなければ、期間番号ｎをインクリメントし（ステップＳ７９）、ステップＳ７５以下の処理を繰り返す。一方、最終期間まで検討が終了すると、処理を終了する。
【００７７】
このような並べ替えにより、着目期間に初めて出現するタームを発見しやすくなる。例えば、図１５において、着目期間を第２番目の期間（３／１〜３／３１）とすると、ターム「土産物・水産物卸」がその前の期間２／１〜２／２８には出現せず、着目期間３／１〜３／３１で初めて出現したタームであることがわかる。
【００７８】
かかるターム分析に、文字列フィルタやパターンフィルタを併用するようにしてもよい。図１７、図１８にかかるフィルタリング条件を設定した場合の分析結果を示す。図１７では、視点プロファイル「株」を選択し、重みづけダイアログ６３にて、この視点プロファイルの関連キーワードについて、重みづけを設定する。この場合、重みづけは全部「１」である。さらに、着目期間を期間設定ダイアログ６７に設定する。この場合であれば、開始日が１９９７／１／１で、期間は３月毎で、回数４回に設定されている。
【００７９】
前記条件における実行結果を図１８に示す。表示領域７７にはターム抽出文書が日付順に表示されている。表示領域７２には、ターム抽出文書から抽出したタームが表示されている。この場合、さらに、フィルタリング条件として、文字列フィルタとして、文字列「株」が、パターンフィルタ「単調増加」が設定されているので、抽出したタームのうち、文字列「株」を含み、さらに、３月毎の値が単調増加しているタームが、表示されている。
【００８０】
このようなトレンド分析により、各期間の文書に共通に出現するターム、特定の期間の文書にのみ出現するターム、時間の経過とともに新たに発生または消滅するタームを抽出することができる。例えば、徐々に増加または減少する事象の抽出に役立つ。
【００８１】
なお、このようなトレンド分析については、視覚的な把握を容易とするために、ターム毎、またはある程度まとめてグラフ表示（折れ線グラフ等）するようにしてもよい。
【００８２】
５．他の実施形態
なお、上記実施形態においては、前記複数の視点プロファイルとして、好景気、不景気と逆の方向性を有する１対の視点プロファイルを採用したが、これに限定されず、複数の視点プロファイルであればどのようなものでも、ユーザが自由に設定することができる。
【００８３】
なお、クエリとして視点プロファイルを与えたが、ユーザの興味に関連する幾つかのタームをキーワードとして直接与えてもよく、さらに、例えば、「株価と景気の動向との関連について知りたい」というような自然文が与えられると、この自然文を意味解析して、キーワードを設定して、クエリを生成するようにしてもよい。さらに、クエリを制約する検索時間範囲や必ず含むべきターム等の制約条件を追加するようにしてもよい。
【００８４】
また、各クエリごとにターム抽出の対象となる文書データやその数が異なるので、統計データによる補正を行ってもよい。例えば、ターム出現頻度をタームを含む文書数または抽出した文書数で除算すればよい。
【００８５】
なお、数値ベクトルデータを従来の分類手法を用いて分類するようにしてもよい。例えば、以下の３つの手法を採用することができる。
【００８６】
１）クラスタ分析
あらかじめ分類基準がない場合には、クラスタ分析を行えばよい。そして分類後は、各集団の典型例を抽出することにより、その集団の意味を推定することもできる。典型例の抽出には、例えば、集団の中心に一番近いタームを抽出すればよい。
【００８７】
また、クラスタリングの手法としては、従来から用いられている手法が採用でき、例えば、階層的クラスタ分析だけでなく、非階層的クラスタ分析するようにしてもよい。
【００８８】
２）クラシフィケーション
いくつかのクラスタをあらかじめ用意しておき、各数値ベクトルをもっとも近いクラスタに割り当てることによりタームを分類する。
【００８９】
３）単純な分類
同じベクトル要素値が最大となるタームをまとめて、１つのクラスタとする。クエリ数と同じ数の分類クラスタが生成され、かつ、各分類クラスタには対応するクエリに関連するタームが分類される。
【００９０】
このような分類処理により、ある分類グループの特徴的なタームを抽出し、比較検討することにより、新たな知識を発見することも可能となる。
【００９１】
また、本実施形態においては、抽出対象文書をユーザの興味のある視点プロファイルによって特定しているので、比較的精度の高いターム分析が可能となる。また、全文書から抽出する場合と比べて、ターム総数が少なくなり、分類が短時間で可能となり、ユーザの分析も容易となる。
【００９２】
なお、上記実施形態においては、前記複数の視点プロファイルとして、逆の方向性を有する１対の視点プロファイルを採用したが、これに限定されず、例えば一見関係の無いような視点プロファイル、例えば、「政治」と「出生率」等であってもよい。
【００９３】
また、本実施形態においては、前記視点プロファイルとして状態を表すプロファイルとして「好景気」、「不景気」を用いたが、変化を表すプロファイルとして「増産」、「増益」等を用いてもよい。このような状態または変化を表すプロファイルを用いることにより、データマイニングがより容易となる。なお、視点プロファイルについては、操作者が望む視点プロファイルであれば、これらに限定されず、どのようなものであってもよい。
【００９４】
なお、抽出文書決定する際に行う、各文書からタームを抽出してベクトル化する処理については、全文書の内容を確定できれば、あらかじめ処理することも可能であるので、新たな文書が追加された時に実行して、記憶しておいてもよい。
【００９５】
また、本実施形態においては、抽出文書決定処理にて、各文書から重要タームを抽出するようにしたが、タームであればこれに限定されず、例えば重要タームだけでなく、全タームを抽出するようにしたり、ある抽出基準で抽出するようにしてもよい。
【００９６】
本実施形態においては、２つの数値ベクトルの類似度を、両数値ベクトルの内積を演算することにより決定したが、両数値ベクトルのコサイン値を類似度としてもよい。
【００９７】
また、本実施形態においては、日本語の文書の場合について説明したが、他の言語、例えば、英語、中国語、韓国語等についても同様に適用することができる。
【００９８】
本実施形態においては、図１に示す機能を実現する為に、ＣＰＵ２３を用い、ソフトウェアによってこれを実現している。しかし、その一部もしくは全てを、ロジック回路等のハードウェアによって実現してもよい。
【００９９】
このように、文書に存在するタームを操作者の興味がある複数の視点プロファイルに関する得点によって数値ベクトル化している。これにより、タームの出現頻度に依存した値で表す場合と比べて、自己の興味と各タームとの関係を操作者が容易に把握することができる。また、数値ベクトル化されたタームを各種の統計的解析手法を用いて分析することができる。さらに、分析結果の意味付けが容易となる。
【図面の簡単な説明】
【図１】本発明にかかるターム評価装置１の機能ブロック図である。
【図２】図１に示すターム評価装置のハードウエア構成の一例を示す図である。
【図３】図２に示すコンピュータシステム４０をＣＰＵ２３を用いて実現したハードウエア構成の一例を示す図である。
【図４】視点プロファイルと関連キーワードの対応を示す図である。
【図５】ターム評価処理の全体フローチャートである。
【図６】抽出文書決定処理の詳細フローチャートである。
【図７】統計量決定処理の詳細フローチャートである。
【図８】クエリ決定のためのダイアログの一例である。
【図９】抽出タームの表示の一例である。
【図１０】抽出タームの表示の一例である。
【図１１】抽出タームの表示の一例である。
【図１２】抽出タームの表示の一例である。
【図１３】辞書フィルタのデータ構造を示す。
【図１４】トレンド分析のための設定ダイアログを示す。
【図１５】トレンド分析の結果を示す。
【図１６】並べ替え処理の詳細フローチャートである。
【図１７】トレンド分析のための設定ダイアログを示す。
【図１８】トレンド分析の結果を示す。
【符号の説明】
２３・・・ＣＰＵ
２７・・・メモリ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a term evaluation device for evaluating a term extracted from document data, and more particularly to a term evaluation method.
[0002]
[Background Art and Problems to be Solved by the Invention]
Today, a large amount of document information is stored as data, and attempts have been made to discover desired information and knowledge from these document information (hereinafter referred to as mining). As one of the techniques, it has been proposed to extract a term from a document and perform the mining based on the extracted term.
[0003]
However, there were miscellaneous terms in the document, and there was no established method for term extraction criteria. Therefore, the extracted term could not be evaluated.
[0004]
SUMMARY OF THE INVENTION An object of the present invention is to provide a term evaluation apparatus or method capable of solving the above problems and evaluating terms extracted from a plurality of documents. It is another object of the present invention to determine a term extraction target document from a plurality of documents.
[0005]
[Means for Solving the Problems and Effects of the Invention]
  1) In the recording medium storing the program according to the present invention, the program causes the computer to execute the following processing. A) A plurality of corresponding related keywords are stored for each viewpoint profile. When a plurality of viewpoint profiles are given, the following term extraction target document determination process and term statistic calculation process are executed for each viewpoint profile. A1) a term extraction target document determination process for specifying one or more terms extraction target documents from a plurality of document data using a plurality of related words related to the viewpoint profile; a2) the specified Term statistic calculation processing for calculating the term statistic existing in the term extraction target document and calculating it as the term statistic in the viewpoint profile, and B) each vector for each term existing in the specified term extraction target document The value of the element is represented by the term statistic of each viewpoint profile, and the same number as the number of viewpoint profiles. We asked a number vector with,Displays the term for which the numeric vector was found, along with the numeric vector.
  In this way, by determining the extraction target document from the document data for each viewpoint profile, it is possible to extract terms from a document that matches a plurality of related words related to the viewpoint profile. Further, for each of the terms, a numerical vector having the same dimension as the number of the viewpoint profiles is obtained in which the value of each vector element is represented by the term statistic of each viewpoint profile,By displaying the term for which the numeric vector was obtained, together with the numeric vector,Extracted terms from multiple documents can be evaluated. In addition, it is possible to perform analysis by sorting and classifying the terms.
[0006]
  2) In the recording medium storing the program according to the present invention, based on the numerical vector of each term, a term in which only a specific vector element has a value greater than or less than a predetermined threshold is extracted., Display the extracted terms along with their numeric vectors. Therefore, terms specific to the operator's specific interest are extracted based on the vector.Take out and displaycan do.
[0007]
  3) In the recording medium storing the program according to the present invention, based on the numerical vector of each term, all the specific vector elements that have a value greater than or equal to a predetermined threshold value are extracted. AndDisplay extracted terms along with their numeric vectors. Therefore, terms common to the operator's specific interests are extractedDisplaycan do.
[0008]
  4) In the recording medium storing the program according to the present invention,Based on the numerical vector of each term, a specific vector element value is extracted that is specifically larger or specifically smaller than other terms, and the extracted term is displayed together with the numerical vector.. Therefore, terms that are particularly relevant to the operator's specific interests are extracted from the extracted terms., Display the extracted terms along with their numeric vectorsbe able to.
[0009]
  5) In the recording medium storing the program according to the present invention, the plurality of viewpoint profiles are viewpoint profiles that are meaningful in permutation, and based on the numerical vector of each term, the value of a certain vector element is Extract terms larger than a given difference compared to vector elements, Display the extracted terms along with their numeric vectors. Therefore, terms that change at a certain time in a time series in one viewpoint profile are extracted.Displaycan do.
[0010]
  6) In the recording medium storing the program according to the present invention, a predetermined character string included in the term and expressing the intention of the document creator is classified and stored for each intention. When the operator selects one of the categories, only the terms that have the specified character string are extracted., Display the extracted terms along with their numeric vectors. Therefore, the term including the character string that expresses the intention of the document creator is extracted.Displaycan do. This facilitates term extraction that matches the operator's intention. Examples of character strings that express the intention of the document creator include auxiliary verbs, suffixes, and classifiers.
[0011]
  7) The term evaluation method according to the present invention is a method for evaluating terms constituting document data by a computer, wherein the computer stores a plurality of corresponding keywords for each viewpoint profile. Execute the process. A) When a plurality of viewpoint profiles in which a plurality of related words are defined are given, the following a1) term extraction target document determination process and a2) term statistic calculation process are performed for each viewpoint profile. a1) Term extraction target document determination processing for determining one or more term extraction target documents from a plurality of document data using a plurality of related words related to the viewpoint profile. A2) Term statistic calculation processing for calculating a term statistic existing in the specified term extraction target document as a term statistic in the viewpoint profile. B) For each term, find a numeric vector whose vector element value is represented by the term statistic of each viewpoint profile and has the same dimensions as the number of viewpoint profiles;Displays the term for which the numeric vector was found, along with the numeric vector.
  In this way, by determining the extraction target document from the document data for each viewpoint profile, it is possible to extract terms from a document that matches a plurality of related words related to the viewpoint profile. Further, for each of the terms, a numerical vector having the same dimension as the number of the viewpoint profiles is obtained in which the value of each vector element is represented by the term statistic of each viewpoint profile,By displaying the term for which the numeric vector was obtained, together with the numeric vector,Extracted terms from multiple documents can be evaluated. In addition, it is possible to perform analysis by sorting and classifying the terms.
[0012]
  8) In the method for evaluating the terms constituting the document data according to the present invention, it is used when attached to a verb or adjective and given a character string expressing the intention of the document creator. Extract only terms that containThe extracted terms are displayed together with their numerical vectors.Therefore, it is possible to extract a term including a character string expressing the intention of the document creator. This facilitates term extraction that matches the operator's intention.
[0013]
  9) In the term evaluation device according to the present invention, the term extraction target document specifying means stores a plurality of corresponding keywords for each viewpoint profile, and when a plurality of viewpoint profiles are given, the given viewpoint is given. For each profile, one or more documents for which terms are to be extracted are specified from a plurality of document data using a plurality of corresponding keywords. The term extraction target document specifying means determines the viewpoint profile based on the given viewpoint profile.everyIn addition, one or more documents to be subject to term extraction are specified from a plurality of document data. The term statistic calculating means calculates a term statistic existing in the specified term extraction target document and outputs it as a term statistic in the viewpoint profile. The evaluation means obtains a numerical vector having the same dimension as the number of viewpoint profiles for each term, and the value of each vector element represented by the term statistic of each viewpoint profile,C) About the term which calculated | required the said numerical vector, it displays with the numerical vector.
  In this way, by determining the extraction target document from the document data for each viewpoint profile, it is possible to extract terms from a document that matches a plurality of related words related to the viewpoint profile. Further, for each of the terms, a numerical vector having the same dimension as the number of the viewpoint profiles is obtained in which the value of each vector element is represented by the term statistic of each viewpoint profile,The term for which the numerical vector is obtained is displayed together with the numerical vector.Thus, various numerical operations can be performed by expressing each term as a statistic with the plurality of viewpoint profiles as viewpoints. Thereby, the terms extracted from a plurality of documents can be evaluated. In addition, it is possible to perform analysis by sorting and classifying the terms.
10) In the term evaluation device according to the present invention, the display means extracts a term in which only a specific vector element has a value greater than or less than a predetermined threshold, based on the numerical vector of each term. The extracted terms are displayed along with their numerical vectors. Therefore, it is possible to extract and display a term specific to the operator's specific interest based on the vector.
11) In the term evaluation device according to the present invention, the display means has a term in which all of a plurality of specific vector elements have a value greater than or equal to a predetermined threshold based on the numerical vector of each term. Are extracted and the extracted terms are displayed together with their numerical vectors. Therefore, it is possible to extract and display terms that are common to the operator's specific interest.
12) In the term evaluation device according to the present invention, the display means is a term in which a specific vector element value is specifically larger or specifically smaller than other terms based on the numerical vector of each term. Are extracted and the extracted terms are displayed together with their numerical vectors. Therefore, it is possible to extract a term particularly related to the operator's specific interest from the extracted term, and display the extracted term together with its numerical vector.
13) In the term evaluation device according to the present invention, the plurality of viewpoint profiles are viewpoint profiles having meanings in a permutation, and the display means has a value of a certain vector element based on a numerical vector of each term. A term larger than a predetermined difference is extracted compared to the preceding and following vector elements, and the extracted term is displayed together with its numerical vector. Therefore, it is possible to extract and display terms that change at a certain time in a time series in one viewpoint profile.
14) In the term evaluation device according to the present invention, character strings that are predetermined character strings included in the terms and express the intentions of the document creator are classified and stored for each intention, and the display When any classification is designated by the operator, the means extracts only the term in which the designated character string exists, and displays the extracted term together with its numerical vector. Therefore, it is possible to extract and display a term including a character string expressing the intention of the document creator. This facilitates term extraction that matches the operator's intention. Examples of character strings that express the intention of the document creator include auxiliary verbs, suffixes, and classifiers.
[0020]
DETAILED DESCRIPTION OF THE INVENTION
1. Description of Functional Block Diagram An embodiment of the present invention will be described with reference to the drawings. 1 includes a document storage unit 3, a term statistic calculation unit 5, an evaluation unit 7, an analysis unit 8, and a notification unit.9It has.
[0021]
The document storage unit 3 stores a plurality of documents. The term statistic calculation means 5 specifies one or more documents to be extracted from the plurality of document data stored in the document storage means 3 based on the given search condition, and specifies the specified term extraction target A term statistic existing in the document is calculated as a term statistic in the search condition. When the term statistic in a plurality of different search conditions is given from the term statistic calculation means 5, the evaluation value decision means 7 decides the evaluation value of each term using a numerical vector having the same dimension as the given search condition number. .
[0022]
The analysis means 8 analyzes the term based on the evaluation value of each term. The notification means 9 notifies the analysis result.
[0023]
In this embodiment, the term statistic calculating means 5 stores a plurality of pieces of viewpoint profile correspondence information composed of a viewpoint profile and a plurality of related words related to the viewpoint profile, and the viewpoint profile gives a search condition. Then, the term extraction target document is specified using the plurality of related words.
[0024]
In this way, by extracting the term extraction document with the viewpoint profile, the extraction term is extracted in accordance with the viewpoint profile in which the operator is interested. Also, by representing the evaluation value of each term with a numerical vector, it is possible to analyze the term.
[0025]
In the present embodiment, the display means is employed as the notification means, but other notification means may be employed.
[0026]
2. Hardware configuration
(2.1) Outline
A hardware configuration of the term evaluation device 1 shown in FIG. 1 will be described. The computer system 40 shown in FIG. 2 includes an input device 41, a control device 43, a display device 45, and a storage device 47. The input device 41 is for inputting various commands. The storage device 47 stores a program for performing a predetermined process based on a given command. The control device 43 performs predetermined data processing based on a program stored in the storage device 47.
[0027]
(2.2) Details
FIG. 3 shows an example of a hardware configuration in which the computer system 40 shown in FIG. 2 is realized using a CPU.
[0028]
The computer system 40 includes a CPU 23, a memory 27, a hard disk 26, a CRT 30, an FDD 25, a keyboard 28, a mouse 31, and a bus line 29. The CPU 23 controls each unit via the bus line 29 according to a control program stored in the hard disk 26.
[0029]
This control program is read from the flexible disk 25 a storing the program via the FDD 25 and installed in the hard disk 26. In addition to the flexible disk, it may be installed on a hard disk from a computer-readable recording medium in which a program such as a CD-ROM or an IC card is substantially integrated. Furthermore, it may be downloaded using a communication line.
[0030]
In the present embodiment, the program stored in the flexible disk is indirectly executed by the computer by installing the program from the flexible disk to the hard disk 26. However, the present invention is not limited to this, and the program stored in the flexible disk may be directly executed from the FDD 25. Note that programs that can be executed by a computer are not only those that can be directly executed by simply installing them, but also those that need to be converted to other forms once (for example, those that have been compressed) are decompressed. Etc.), and further executable in combination with other module parts.
[0031]
The hard disk 26 includes a program storage unit 26a, a correspondence table storage unit 26b, a document storage unit 26c, and an evaluation storage unit 26d. The program storage unit 26a stores a program to be described later. The correspondence table storage unit 26b stores a plurality of viewpoint profiles as shown in FIG. Each viewpoint profile stores a plurality of corresponding keywords. A plurality of documents to be evaluated are stored in the document storage unit 26c. In the present embodiment, each document includes a creation date and time, a title of the document, and contents of each document. The evaluation storage unit 26d stores evaluation results for each term. In addition, the memory 27 stores various calculation results and the like.
[0032]
3. flowchart
Next, the program stored in the program storage unit 26a of the hard disk 26 will be described with reference to the flowcharts of FIGS. In the following, an example will be described in which each term is evaluated using “prosperous economy” and “recession” as the viewpoint profile.
[0033]
First, the operator inputs a plurality of queries. In the present embodiment, since the viewpoint profile is adopted for the query, a selection box 62 as shown in FIG. 8 may be displayed on the CRT 30 to select the viewpoint profiles “boom” and “recession”. FIG. 8 shows a state where the viewpoint profile “depression” is selected after the viewpoint profile “boom” is selected.
[0034]
If the viewpoint profile does not exist in the selection box 62 shown in FIG. 8, the button 64 may be clicked to add the required viewpoint profile and the corresponding related keyword. Thereby, it is added to the correspondence table shown in FIG.
[0035]
Note that existing viewpoint profiles may be added to or deleted from the related keyword box 63 when additional keywords are deleted. In this way, by creating a viewpoint profile from the viewpoint in which the operator is interested and specifying the term extraction target document, it is possible to extract a document in accordance with the viewpoint desired by the operator.
[0036]
When CPU 23 determines that a plurality of queries have been input (step S1 in FIG. 5), it initializes query number k (step S3). Then, an extraction target document is determined for the viewpoint profile “business boom” which is the 0th query (step S5). The extraction target document determination process will be described with reference to FIG. In the present embodiment, as described below, the similarity with the viewpoint profile is determined, and a document with a similarity exceeding a predetermined threshold is determined as a term extraction document.
[0037]
The CPU 23 initializes the document number m (step S50). It is determined whether or not the vectorization process for each document has been completed (step S51). In this case, since the vectorization process has not been completed, the process proceeds to step S53, where morphological analysis is performed for each document, and words (terms) appearing in the document are extracted. Then, for the extracted terms, important terms are determined using the tfidf method (step S55). The tfidf method is a method for determining a keyword in information retrieval, and the tf (term frequency) indicating the frequency of appearance of a term in a document and the rareness of how few the term appears in all documents. This is a method of weighting a term using idf (inverse document frequency) indicating.
[0038]
Using the important terms extracted in this way, each document is vectorized as a numerical vector having the same dimensions as the number of important terms. For example, when there are 100 important terms, a 100-dimensional numerical vector is obtained. Some documents may not include the important terms that have been determined. In this case, the dimension value of the document is 0 (sparse). The CPU 23 stores the numerical vector thus obtained in the memory 27.
[0039]
Next, the CPU 23 vectorizes the viewpoint profile (step S59). In the present embodiment, all the related keywords of the kth viewpoint profile are read from the hard disk 26 and vectorized as a numerical vector having the same dimension as the number of each related keyword in the viewpoint profile. In this embodiment, the tfidf value for each related keyword is obtained from the tf of each related keyword in the viewpoint profile and the idf in all documents, and is vectorized as a numerical vector having the same dimension as the number of the related keywords.
[0040]
Next, the similarity between the viewpoint profile and the mth document is calculated (step S61). Such similarity can be obtained by calculating the inner product of the numerical vector obtained in step S57 and the numerical vector obtained in step S59.
[0041]
Next, the CPU 23 determines whether or not the similarity calculation has been completed for all documents (step S63). If the similarity calculation has not been completed for all the documents, the document number m is incremented (step S65), and the processing from step S61 is performed.
[0042]
When the similarity calculation is completed for all documents in step S63, a document exceeding a predetermined threshold is determined as an extracted document (step S67).
[0043]
Next, the CPU 23 calculates a statistical amount of terms existing in the extraction target document. Such calculation processing will be described with reference to FIG.
[0044]
The CPU 23 initializes the target document number i and the target term number j (step S25). The i-th document is set as the document of interest (step S27). The j-th term of the document of interest is set as the term of interest (step S29).
[0045]
The CPU 23 determines whether or not the term of interest is the term that first appears in the document of interest (step S31). If the term first appears in the document of interest, the term appearing document number bi is incremented (step S33).
[0046]
The CPU 23 determines whether or not the term of interest exists in an extraction term table (not shown) in the memory 27 (step S35). If it already exists, the term appearance frequency ti is incremented (step S37). On the other hand, if it does not exist, it is added to the extraction term table (step S39).
[0047]
The CPU 23 determines whether or not the term of interest is the final term (step S41). If it is not the final term, the target term number j is incremented (step S43), and the processing from step S29 is repeated. On the other hand, if it is the final term, it is determined whether or not the term extraction has been completed for all documents (step S45). If there is a document that has not been term-extracted, the document number i of interest is incremented (step S47), the term of interest number j is initialized (step S49), and the processes in and after step S27 are repeated. Thereby, the number of appearance documents bi and the appearance frequency ti are obtained for each term.
[0048]
Next, the CPU 23 determines whether or not processing has been completed for all queries (step S9 in FIG. 5). If the processing has not been completed for all the queries, the CPU 23 increments the query number k (step S11) and repeats the processing from step S5.
[0049]
When processing is completed for all queries, the CPU 23 performs numerical vectorization processing for each query for each term (step S13). In this embodiment, the vector elements of the numerical value vectorization of terms include 1) term appearance frequency ti, 2) term appearance document number bi, 3) term appearance frequency ti / term appearance document number bi, 4) relevance degree You can choose either.
[0050]
The degree of relevance is a numerical value indicating the degree of relevance in the document with respect to the related word group constituting the viewpoint profile, and varies depending on the algorithm to be evaluated. In this embodiment, it is determined based on the number of term appearance documents bi and the rarity in all documents. Accordingly, the greater the number of term appearing documents, the higher the degree of association, and the smaller the number of term appearing documents, the lower the degree of association. Further, there are not many documents other than the term extracted document, and the terms existing in the term extracted document have a high degree of relevance. That is, when the number of term appearance documents is high and there are not many documents other than the term extraction document, and there are many documents in the term extraction document, the relevance degree is high.
[0051]
Note that 1) term appearance frequency ti and 3) term appearance frequency ti / term appearance document number bi are the extraction of terms co-occurring with the keywords used in the query set, that is, the extraction of events related to the operator's interest. It is thought that it is useful. Also, 2) the term appearance document number bi is considered to be useful for extracting main topics related to the operator's interest and the number of occurrences. 4) The degree of relevance seems to be useful for extracting events that have a rare relationship with the operator's interest.
[0052]
FIG. 9 shows an example in which each term is represented by a numerical vector. In this case, the degree of association is obtained as a vector element for the viewpoint profiles “prosperous economy” and “recessive economy”, and threshold values that have a threshold value of 0 or more (all terms) are displayed in the display area 72. Such vector elements and threshold values can be set by clicking on the threshold value setting button 71 with a mouse, and a threshold value setting dialog 74 is displayed so that the operator can set desired conditions.
[0053]
FIG. 10 shows a display example when the common conditions are extracted by changing the setting conditions. In FIG. 10, the number of term appearing documents bi as vector elements and threshold values of all queries of 1 or more are displayed in the display area 72. As described above, a term in which all of a plurality of designated vector elements have values greater than or less than a threshold value designated by the operator is referred to as a common term.
[0054]
The common term is a term that is extracted in common to all queries. When the query is a viewpoint profile that represents the user's interest as in the present embodiment, the term ( In many cases.
[0055]
FIG. 11 shows a display example when the unique conditions are extracted by changing the setting conditions. In FIG. 11, the term appearance document number bi as a vector element and a threshold value of 1 or more for only one query are displayed in the display area 72. In this way, a term in which only a specific vector element has a value greater than or less than a threshold value designated by the operator is referred to as a specific term.
[0056]
The unique term is a term that is extracted only by the corresponding query. When the query is a viewpoint profile that represents the user's interest as in this embodiment, the unique term is a term that is specific to the user's specific interest. There are many.
[0057]
Next, the CPU 23 extracts feature terms (step S15 in FIG. 5). The feature terms may be extracted by the operator providing an extraction criterion. In this embodiment, a term including a specific character string is extracted. In this case, feature term extraction is performed as follows by setting filtering conditions.
[0058]
FIG. 12 shows a display example when only terms including the character string “company” are extracted. In this case, in order to set the filtering condition, when the operator clicks the filter button 75, a filtering dialog 76 is displayed. Therefore, “company” may be set as the character string filter.
[0059]
As a filtering condition, in addition to the character string filter, a dictionary filter and a pattern filter can be set alone or in combination.
[0060]
A dictionary filter is a glossary that stores one or more terms desired by an operator. In the present embodiment, the corresponding auxiliary verb is stored in advance in the dictionary filter for each user's intention as shown in FIG. 13 so that the operator can select it. Specifically, when the operator selects “hope” as an intention, a plurality of terms “want” are subjected to filtering processing as a character string filter under the or condition. As a result, it is possible to extract terms including the term “tai”, for example, terms such as “want to know” and “want to check”.
[0061]
In this way, mining is facilitated by previously creating a dictionary of auxiliary verbs that serve as guidelines for mining.
[0062]
Further, one or more suffixes may be stored in advance in the dictionary filter for each user's intention so that they can be selected. A suffix is an independent word that adds meaning to the end of a word or qualifies a part of speech. For example, “target”, “sex”, and the like. As a result, terms such as “specific” and “innovative” can be extracted. Then, it is possible to know what word the term relates to in the document.
[0063]
Further, one or more classifiers may be stored in advance in the dictionary filter for each user's intention so that they can be selected. For example, “Yen”, “MHz”, and the like. Thereby, since a term composed of a number and a classifier can be extracted, it is possible to know what word the term relates to in the document. In particular, in the case of a price, it is possible to set a threshold value for the price and extract only information on products with a price lower than or higher than that.
[0064]
Note that such a dictionary filter may further have a hierarchical structure for each user's intention.
[0065]
In this embodiment, auxiliary verbs corresponding to the user's intentions stored in advance in the dictionary filter are used. However, even if such a dictionary filter is not used, the user gives this to the character string filter for filtering. May be.
[0066]
In addition to the above, the feature terms can be extracted based on the following criteria.
[0067]
1) Singular term: A term whose vector element value is specifically larger or smaller than other terms
A singular term is a term in which the difference between each vector element value is larger than a predetermined threshold, and although it is not as unique as a specific term, it is often a term related to a user's specific interest. For example, even if the three vector element values are “1, 1, 6” and the threshold value is “5”, the threshold value is “1, 3, 6” and the threshold value is “5”. Extracted as
[0068]
2) Difference term: For queries that have meaning in permutation, a term with a large difference between a query and the query before and after it.
The difference terms are, for example, the first viewpoint profile “sedan”, the second viewpoint profile “sports”, the third viewpoint profile “RV”, etc. When is used, it is possible to extract new topics and the like that co-occur with changes. Note that the differential term is also effective when the extraction period described later is limited.
[0069]
As described above, in this embodiment, an extraction target document is determined for each viewpoint profile, the evaluation of each term in the viewpoint profile is calculated, and this is repeated for a plurality of viewpoint profiles, and each term is converted into a plurality of viewpoints. Evaluated by numerical vector for profile. By specifying the term extraction target document from the viewpoint desired by the operator, the term is extracted from the document related to the operator's interest, so that the term deeply related to the operator's interest can be extracted. Moreover, since the total number of terms can be narrowed down, term analysis becomes easier.
[0070]
  In addition, by expressing each extracted term as a vector for multiple viewpoint profiles,MultifacetedThe term can be evaluated. Therefore, discovery of unknown knowledge is facilitated.
[0071]
In the present embodiment, a pattern filter that designates the pattern of a vector element of each term is employed. Thereby, for example, the common term, unique term, unique term, and differential term can be extracted.
[0072]
4). Trend analysis
In the above embodiment, different viewpoint profiles are used as a plurality of search conditions. However, a plurality of search conditions may be used depending on one viewpoint profile and the creation period of each document. For example, as shown in FIG. 14, the viewpoint profile “recession” is selected, and the weighting dialog 83 sets weights for the related keywords of this viewpoint profile. In this case, all the related keywords are set to weight “1”. Furthermore, the period of interest is set to 5 times with a start date of 1997/1/1 and a period of one month.
[0073]
An execution result under the above conditions is shown in FIG. In the display area 77, term-extracted documents are displayed in order of date. In the display area 72, the terms are sorted and displayed according to the degree of relevance of terms extracted from the term extraction document. In the present embodiment, rearrangement is performed based on the flowchart shown in FIG.
[0074]
First, the CPU 23 initializes the period number n (step S71 in FIG. 16). The terms are rearranged by the nth value (step S73). In this case, the values are rearranged in descending order by the value of the 0th period (1 / 1-1 / 31 period).
[0075]
Next, the CPU 23 extracts terms whose value in the nth period is “0” and sorts them by the value in the (n + 1) th period (step S75). That is, in this case, terms whose value in the 0th period is “0” and whose value in the 1st period is not “0” are rearranged in descending order of the value in the 1st period.
[0076]
The CPU 23 determines whether or not the examination has been completed up to the final period (step S77). If it is not the final period, the period number n is incremented (step S79), and the processes after step S75 are repeated. On the other hand, when the examination is completed until the final period, the process is terminated.
[0077]
Such rearrangement makes it easier to find terms that first appear during the period of interest. For example, in FIG. 15, if the period of interest is the second period (3/1 to 3/31), the term “souvenir / fishery wholesale” will not appear in the previous period 2/1 to 2/28. It can be seen that the term first appeared in the period of interest 3/1 to 3/31.
[0078]
You may make it use a character string filter and a pattern filter together for this term analysis. The analysis result at the time of setting the filtering conditions concerning FIG. 17, FIG. 18 is shown. In FIG. 17, the viewpoint profile “stock” is selected, and the weighting dialog 63 sets weights for the related keywords of this viewpoint profile. In this case, the weights are all “1”. Further, the period of interest is set in the period setting dialog 67. In this case, the start date is 1997/1/1, the period is set to 4 times every 3 months.
[0079]
An execution result under the above conditions is shown in FIG. In the display area 77, term-extracted documents are displayed in order of date. In the display area 72, terms extracted from the term extraction document are displayed. In this case, as the filtering condition, the character string “stock” is set as the pattern filter “monotonically increasing” as the character string filter, and thus the character string “stock” is included in the extracted terms. The term whose monotonically increasing value is displayed is displayed.
[0080]
By such trend analysis, it is possible to extract terms that appear in common in documents in each period, terms that appear only in documents in a specific period, and terms that newly appear or disappear over time. For example, it is useful for extracting events that gradually increase or decrease.
[0081]
Such trend analysis may be displayed as a graph (such as a line graph) for each term or to some extent in order to facilitate visual grasp.
[0082]
5. Other embodiments
In the above embodiment, a pair of viewpoint profiles having the opposite direction to the booming economy and the recession is adopted as the plurality of viewpoint profiles. However, the present invention is not limited to this, and any viewpoint profiles can be used. Anything can be set freely by the user.
[0083]
Although the viewpoint profile is given as a query, some terms related to the user's interest may be given directly as keywords. For example, “I want to know the relationship between stock prices and economic trends” When a natural sentence is given, the natural sentence may be subjected to semantic analysis, a keyword may be set, and a query may be generated. Furthermore, you may make it add constraint conditions, such as the search time range which restrict | limits a query, and the term which must be included.
[0084]
In addition, since the document data and the number of terms to be extracted are different for each query, correction using statistical data may be performed. For example, the term appearance frequency may be divided by the number of documents including the term or the number of extracted documents.
[0085]
The numerical vector data may be classified using a conventional classification method. For example, the following three methods can be employed.
[0086]
1) Cluster analysis
If there is no classification criterion in advance, a cluster analysis may be performed. After classification, the meaning of each group can be estimated by extracting typical examples of each group. For example, a term that is closest to the center of the group may be extracted.
[0087]
As a clustering method, a conventionally used method can be employed. For example, not only hierarchical cluster analysis but also non-hierarchical cluster analysis may be performed.
[0088]
2) Classification
Several clusters are prepared in advance, and the terms are classified by assigning each numerical vector to the nearest cluster.
[0089]
3) Simple classification
The terms with the same vector element value are maximized to form one cluster. The same number of classification clusters as the number of queries is generated, and the terms related to the corresponding query are classified into each classification cluster.
[0090]
By such classification processing, it becomes possible to discover new knowledge by extracting characteristic terms of a certain classification group and comparing them.
[0091]
In the present embodiment, since the extraction target document is specified by the viewpoint profile that the user is interested in, term analysis with relatively high accuracy is possible. In addition, the total number of terms is reduced compared to the case of extracting from all documents, classification is possible in a short time, and user analysis is facilitated.
[0092]
In the above embodiment, a pair of viewpoint profiles having opposite directions are adopted as the plurality of viewpoint profiles. However, the present invention is not limited to this, and for example, viewpoint profiles that are not related at first glance, for example, “ It may be “politics” and “birth rate”.
[0093]
In the present embodiment, “prosperous economy” and “recession” are used as the profile representing the state as the viewpoint profile, but “increased production”, “increased profit”, and the like may be used as profiles representing changes. By using a profile representing such a state or change, data mining becomes easier. The viewpoint profile is not limited to these as long as the viewpoint profile desired by the operator is used, and any viewpoint profile may be used.
[0094]
Note that the process of extracting and vectorizing terms from each document when determining the extracted document can be performed in advance if the contents of all documents can be confirmed, so a new document has been added. May be executed and remembered from time to time.
[0095]
In the present embodiment, the important terms are extracted from each document in the extracted document determination process. However, the term is not limited to this, and for example, not only the important terms but also all the terms are extracted. Or may be extracted based on a certain extraction criterion.
[0096]
In the present embodiment, the similarity between two numerical vectors is determined by calculating the inner product of both numerical vectors, but the cosine value of both numerical vectors may be used as the similarity.
[0097]
In the present embodiment, the case of a Japanese document has been described. However, the present invention can be similarly applied to other languages such as English, Chinese, and Korean.
[0098]
In the present embodiment, the CPU 23 is used to realize the function shown in FIG. 1, and this is realized by software. However, some or all of them may be realized by hardware such as a logic circuit.
[0099]
In this way, the terms existing in the document are converted into numerical vectors by the scores relating to a plurality of viewpoint profiles in which the operator is interested. Thereby, compared with the case where it represents with the value depending on the appearance frequency of a term, the operator can grasp | ascertain the relationship between an own interest and each term easily. In addition, it is possible to analyze a term converted into a numerical vector by using various statistical analysis methods. Furthermore, the meaning of the analysis result becomes easy.
[Brief description of the drawings]
FIG. 1 is a functional block diagram of a term evaluation device 1 according to the present invention.
FIG. 2 is a diagram showing an example of a hardware configuration of the term evaluation device shown in FIG. 1;
3 is a diagram showing an example of a hardware configuration in which the computer system 40 shown in FIG. 2 is realized by using a CPU 23. FIG.
FIG. 4 is a diagram illustrating a correspondence between viewpoint profiles and related keywords.
FIG. 5 is an overall flowchart of term evaluation processing.
FIG. 6 is a detailed flowchart of extracted document determination processing;
FIG. 7 is a detailed flowchart of a statistic determination process.
FIG. 8 is an example of a dialog for query determination.
FIG. 9 is an example of a display of an extraction term.
FIG. 10 is an example of a display of an extraction term.
FIG. 11 is an example of a display of an extraction term.
FIG. 12 is an example of a display of an extraction term.
FIG. 13 shows a data structure of a dictionary filter.
FIG. 14 shows a setting dialog for trend analysis.
FIG. 15 shows the result of trend analysis.
FIG. 16 is a detailed flowchart of the sorting process.
FIG. 17 shows a setting dialog for trend analysis.
FIG. 18 shows a result of trend analysis.
[Explanation of symbols]
23 ... CPU
27 ... Memory

Claims

In a recording medium storing a program for causing a computer including an input device, a control device, and a storage device to function as a term evaluation device,
The program causes the computer to execute the following processing:
A) A plurality of corresponding related keywords are stored for each viewpoint profile. When a plurality of viewpoint profiles are given, the following term extraction target document determination process and term statistic calculation process are executed for each viewpoint profile. And
a1) Term extraction target document determination processing for specifying one or more documents to be subject to term extraction from a plurality of document data using a plurality of related words predetermined for the viewpoint profile;
a2) Term statistic calculation processing for calculating a term statistic existing in the specified term extraction target document and calculating as a term statistic in the viewpoint profile ;
B) For each term existing in the specified term extraction target document, a numerical vector in which the value of each vector element is represented by the term statistic of each viewpoint profile and has the same dimension as the number of viewpoint profiles Seeking
C) displaying the term for which the numerical vector is obtained together with the numerical vector;
A recording medium storing a program characterized by the above.

In the recording medium which memorize | stored the program of Claim 1,
Based on the numerical vector of each term , extracting a term in which only a specific vector element has a value greater than or equal to a predetermined threshold value, and displaying the extracted term together with the numerical vector;
It is characterized by.

A recording medium storing a program according to claim 1,
Based on the numerical vector of each term, a specific plurality of vector elements all extract a term having a value equal to or greater than or equal to a predetermined threshold, and the extracted term is displayed together with the numerical vector.
It is characterized by.

In the recording medium which memorize | stored the program of Claim 1,
Extracting a term having a specific vector element value that is specifically larger or specifically smaller than other terms based on the numeric vector of each term, and displaying the extracted term together with the numeric vector;
It is characterized by.

In the recording medium which memorize | stored the program of Claim 1,
Wherein the plurality of viewpoints profiles are perspective profile is meaningful permutations, on the basis of the numeric vector of each term, the value of a vector element in comparison with before and after the vector element that extracts the larger terms than a predetermined difference, Displaying the extracted terms along with their numeric vectors,
It is characterized by.

In the recording medium which memorize | stored the program of Claim 1,
A specified character string included in the term, the character string expressing the intention of the document creator is classified and stored for each intention, and when the operator selects one of the classifications, the specified character string is specified. Extract only terms that have strings , and display the extracted terms along with their numeric vectors.
It is characterized by.

An evaluation method for evaluating terms constituting document data by a computer, wherein the computer stores a plurality of corresponding keywords corresponding to each viewpoint profile, and executes the following processing:
A) When multiple viewpoint profiles are given, the following a1) term extraction target document determination process and a2) term statistic calculation process are performed for each viewpoint profile.
a1) Term extraction for determining a term extraction target document by specifying one or more documents to be subject to term extraction from a plurality of document data using a plurality of related words predetermined for the viewpoint profile. Target document decision processing,
a2) Term statistic calculation processing for calculating the term statistic existing in the specified term extraction target document as the term statistic in the viewpoint profile;
B) For each term, find a numeric vector whose vector element value is represented by the term statistic of each viewpoint profile and has the same dimensions as the number of viewpoint profiles;
C) displaying the term for which the numerical vector is obtained together with the numerical vector;
A term evaluation method characterized by

The term evaluation method according to claim 7,
If the operator gives a character string that is used attached to a verb or adjective and expresses the intention of the document creator to the verb or adjective, the specified character string exists, and each of the terms Extracting a term having a predetermined feature based on the numerical vector of the and displaying the extracted term together with the numerical vector;
It is characterized by.

A plurality of corresponding keywords are stored for each viewpoint profile, and when a plurality of viewpoint profiles are given, a plurality of corresponding keywords are used for each given viewpoint profile, from a plurality of document data, Term extraction target document specifying means for specifying one or more documents to be subject to term extraction;
Term statistic calculating means for calculating a term statistic existing in the specified term extraction target document and outputting it as a term statistic in the viewpoint profile;
Numeric vector computing means for obtaining a numeric vector having the same dimension as the number of viewpoint profiles for each term and the value of each vector element represented by the term statistic of each viewpoint profile;
C) Display means for displaying each term computed by the numeric vector computing means together with the numeric vector;
A term evaluation device comprising:

  The term evaluation device of claim 9,
  The display means extracts a term in which only a specific vector element has a value greater than or equal to a predetermined threshold based on the numeric vector of each term, and displays the extracted term together with the numeric vector. ,
  It is characterized by.

  The term evaluation device of claim 9,
    The display means extracts a term in which all of a plurality of specific vector elements have values greater than or equal to a predetermined threshold value based on the numeric vector of each term, and the numeric vector for the extracted term Display with,
  It is characterized by.

  The term evaluation device of claim 9,
    The display means extracts, based on the numerical vector of each term, a specific vector element value that is specifically larger or specifically smaller than the other terms, and for the extracted terms, the numerical vector Display with,
  It is characterized by.

  The term evaluation device of claim 9,
  The plurality of viewpoint profiles are viewpoint profiles that are meaningful in permutation,
  The display means extracts a term in which a value of a certain vector element is larger than a predetermined difference from the preceding and following vector elements based on the numeric vector of each term, and displays the extracted term together with the numeric vector. thing,
  It is characterized by.

  The term evaluation device of claim 9,
  A predetermined character string included in the term, and a character string expressing the intention of the document creator is classified and stored for each intention,
    The display means, when any classification is designated by the operator, extracts only the term in which the designated character string exists, and displays the extracted term together with its numerical vector,
  It is characterized by.