JP2004246491A

JP2004246491A - Text mining system and text mining program

Info

Publication number: JP2004246491A
Application number: JP2003034059A
Authority: JP
Inventors: Yamahiko Ito; 山彦伊藤; Takeyuki Aikawa; 勇之相川; Yasuhiro Takayama; 泰博高山; Katsushi Suzuki; 克志鈴木
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2003-02-12
Filing date: 2003-02-12
Publication date: 2004-09-02

Abstract

<P>PROBLEM TO BE SOLVED: To perform an analysis based on a new reference not defined in an original document. <P>SOLUTION: This text mining device is constituted so as to store index information for retrieving/mining an object document group 101 in an index storage part 103. A retrieval/mining execution part 110 executes the retrieval/mining processing of the object document group 101 in reference to the index information according to a retrieval/mining condition. A retrieval/mining result storage part 11 stores the retrieval/mining result in a retrieval/mining result storage part 107 with the relevance ratio of each document with the retrieval/mining condition attached to each document. Further, a retrieval/mining result editing part 112 reads a plurality of designated retrieval/mining results from the storage part 107, and selects a read retrieval/mining result for every document in conformation to a designated attribute name to determine the attribute value. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、大量の電子化文書の集合に対し、様々な観点から分析を行うことを可能にするテキストマイニング装置及びテキストマイニングプログラムに関するものである。
【０００２】
【従来の技術】
アンケートや電子ニュース等、大量の文書データの分析を支援するための技術として、テキストマイニングが注目されている。テキストマイニングでは、処理対象の文書データのテキスト情報から単語の頻度や関連性を抽出して、新たな知識を発見する技術である。従来、大量の文書データの分析作業を対話的に進めるための支援の方法が提案されており、例えば、特許文献１に開示されたテキストマイニング装置では、以下のような処理が行われていた。
【０００３】
テキストマイニングの処理対象となる対象文書集合から、対象文書集合に特徴的に出現する語句（特徴語句）を抽出し、その中からユーザが指定した分析軸の構成要素と共起する語句を取り出す。例えば、“Ｏ１５７”に関する新聞記事集合を処理対象とした場合、特徴語句「小学校、集団感染、患者、複数、出血性、下痢、症状、入院、…」が抽出される。その中から、分析軸として、新聞の掲載月を指定し、その構成要素（７月、８月、９月）と共起する語句を取得する。この結果、“７月”と対応付けて「感染、患者、症状、入院、…」、“８月”と対応付けて「衝撃、給食、入院、集団感染、…」、“９月”と対応付けて「売上、マイナス、食料品、生鮮、…」といった共起語句が取得される。
【０００４】
この分析軸及び分析結果を分析履歴として保存しておくと共に、複数の異なる分析軸を有するようにし、分析軸の追加や任意の分析軸の変更を行った際に、複数の分析軸の構成要素の各々と共起する可能性が高い語句を分析履歴を用いて絞り込み、異なる分析軸の組み合わせによる分析を実現している。
【０００５】
指定された特徴語句を含む分析軸を追加する際には、指定された構成要素にその分析軸を追加して、指定された構成要素と追加した分析軸の構成要素の組を作成すると共に、作成した構成要素の組の各々と共起する可能性が高い語句を共起語句候補として分析履歴から抽出する。そして抽出した共起語句候補の中から、作成した構成要素の組の各々と予め定められた範囲内（同一文書内、同一段落内、同一文章内、又はｍ語以内、ｎ文以内）で共起する語句を取得する。例えば、構成要素“７月”と対応付けて取得した共起語句“感染”及び“症状”を含む分析軸を追加する。
【０００６】
そして、指定された構成要素“７月”と追加した分析軸の構成要素の組“７月−感染”、“７月−症状”と共起する可能性が高い共起語句候補として、“７月”と共起する語句「感染、患者、症状、入院、…」を分析履歴から抽出する。そしてこの共起語句候補の中から、構成要素の組“７月−感染”、“７月−症状”と予め定められた範囲内で共起する語句を取得する。この結果、“７月−感染”と共起する語句として「患者、症状、予防法、集団、…」、“７月−症状”と共起する語句として「吐気、下痢、入院、重症、…」を得る。ここで、“７月−感染”とは、“７月”と“感染”の組であることを意味し、「“７月−感染”と予め定められた範囲内で共起する」とは、「“７月”と予め定められた範囲内で共起し、かつ、“感染”と予め定められた範囲内で共起する」ことを意味する。
【０００７】
追加した分析軸を変更する際には、分析軸をユーザの指示に従って変更すると共に、構成要素の組と共起する可能性が高い語句を共起語句候補として分析履歴から抽出する。そして抽出した共起語句候補の中から、構成要素の組と予め定められた範囲内で共起する語句を取得する。「構成要素“７月”に分析軸を追加した」ことを消去して、“８月”と対応付けて取得した共起語句“給食”を含む分析軸を追加した場合、指定された構成要素“８月”と追加した分析軸の構成要素の組“８月−給食”と共起する可能性が高い共起語句候補として、“８月”と共起する語句「衝撃、給食、入院、集団感染、…」を分析結果から抽出する。そして、この共起語句候補の中から、構成要素の組“８月−給食”と予め定められた範囲内で共起する語句を取得する。
【０００８】
テキストマイニング結果から指定された共起語句と共に、代表文書取得指示がユーザから入力された場合には、指定された共起語句及びその共起語句と対応する構成要素に含まれる語句でテキストマイニング対象となった文書集合の検索を行い、得点の高い文書や最新の文書、指定された書誌情報を持つ文書等を代表文書として取得する。
【０００９】
このように、特許文献１のテキストマイニング装置では、分析軸及び分析結果を分析履歴として保存しておくと共に、複数の異なる分析軸を有するようにし、分析軸の追加や任意の分析軸の変更を行った際に、複数の分析軸の構成要素の各々と共起する可能性が高い語句を分析履歴を用いて絞込み、複数の異なる分析軸の組み合わせによる分析を実現している。また、分析履歴を用いて共起語句を絞り込んでから、実際に構成要素の各々と共起する語句を調べるため、全ての特徴語句について共起するか否かを調べる場合に比べ高速な分析が行える。
【００１０】
【特許文献１】
特開２００１−３１８９３９公報（第１４頁、第４図）
【００１１】
【発明が解決しようとする課題】
従来のテキストマイニング装置は、以上のようになされていたので、対話的に分析作業を進める際に、分析は常に絞り込む方向に行われ、保存された分析結果は、別の条件でさらに絞り込みを行うための中間情報として活用されるに止まっていた。そのため、保存された複数の分析結果を関連付けてユーザが新たな分析の基準を作成し、その基準に基づく分析を行うことができないという課題があった。
【００１２】
例えば、“給食”、“売上”、及び“食料品”という特徴語句によって得られた分析結果に対して「社会的影響」という新たな分析の軸（属性）を作成し、“予防”、“加熱調理”、及び“情報公開”という特徴語句によって得られた分析結果に対して、“対策”という新たな分析の軸（属性）を作成し、「社会的影響」と「対策」の２つの属性の関係を分析するような処理を行うことができない。
【００１３】
この発明は上記のような課題を解決するためになされたもので、保存されている分析結果同士の関連付けを行うことによって、新たな分析対象や新たな属性の作成を可能にし、元の文書に定義されていないユーザが定義した基準（分析軸）に基づく分析が行えるテキストマイニング装置及びテキストマイニングプログラムを得ることを目的とする。
【００１４】
すなわち、特許文献１では、図３３に示すように、対象文書集合４０１に分析処理４０２を施して出力した分析結果４０３に対して、細分化を行う分析処理４０４を繰り返し施していた。これに対し、この発明では、図３４に示すように、対象文書集合４１１に分析処理４１２を施して複数の分析結果４１３，４１４を出力し、それぞれの分析結果４１３，４１４を関連付けて分析処理４１５を行い、新たな分析結果４１６を得る。これによって、この発明では、ユーザが定義した複数の分析軸を横断した分析を行うことができるようになる。
【００１５】
【課題を解決するための手段】
この発明に係るテキストマイニング装置は、対象文書集合を検索・マイニングするための索引情報を保存している索引格納部と、指定された検索・マイニング条件に従って上記索引格納部に保存されている索引情報を参照して上記対象文書集合の検索・マイニングの処理を実行する検索・マイニング実行部と、上記検索・マイニング実行部による検索・マイニング結果を、各文書と検索・マイニング条件との関連度を上記各文書に付与して検索・マイニング結果格納部に保存する検索・マイニング結果保存部と、指定された複数の検索・マイニング結果を上記検索・マイニング結果格納部から読み込んで、指定された属性名に対応して、各文書毎に読み込んだ検索・マイニング結果を選択して属性値を決定する検索・マイニング結果編集部とを備えたものである。
【００１６】
【発明の実施の形態】
以下、この発明の実施の一形態を説明する。
実施の形態１．
図１はこの発明の実施の形態１によるテキストマイニング装置の構成を示すブロック図である。対象文書集合１０１からは、索引生成部１０２によって対象文書集合１０１を検索・マイニングするための索引情報が生成され索引格納部１０３に格納される。実行部１０４は、索引格納部１０３中の索引情報を読み込み、入力部１０５から入力される検索条件又はマイニング条件に従って、検索又はマイニング処理を実行し、検索・マイニング結果を表示部１０６に出力する。ユーザから検索・マイニング結果を保存する指示を受けた場合には、実行部１０４は検索・マイニング結果を検索・マイニング結果格納部１０７に保存する。
【００１７】
実行部１０４において、索引読み込み部１０８は索引格納部１０３に保存されている索引情報を読み込む。検索・マイニング結果読み込み部１０９は検索・マイニング結果格納部１０７に格納されている検索・マイニング結果を読み込む。検索・マイニング実行部１１０は、指定された検索・マイニング条件に従って索引格納部１０３に保存されている索引情報を参照して、対象文書集合１０１又は検索・マイニング結果読み込み部１０９が読み込んだ検索・マイニング結果の検索・マイニングの処理を実行する。
【００１８】
また、実行部１０４において、検索・マイニング結果保存部１１１は、検索・マイニング実行部１１０による検索・マイニング結果を、各文書と検索・マイニング条件との関連度を各文書に付与して検索・マイニング結果格納部１０７に保存する。検索・マイニング結果編集部１１２は、指定された複数の検索・マイニング結果を検索・マイニング結果格納部１０７から読み込んで、指定された属性名に対応して、読み込んだ検索・マイニング結果を各文書毎に選択して属性値を決定する。
【００１９】
図２は図１における索引生成部１０２、索引格納部１０３及び検索・マイニング実行部１１０を詳細化したテキストマイニング装置の構成を示す詳細ブロック図である。索引生成部１０２は、概念辞書生成部２０１、文書ベクトル索引生成部２０２及びマイニング索引生成部２０３から構成されている。概念辞書生成部２０１、マイニング索引生成部２０３は、対象文書集合１０１からそれぞれ概念辞書２０４、マイニング索引２０６を生成し、文書ベクトル索引生成部２０２は概念辞書２０４を参照して対象文書集合１０１から文書ベクトル索引２０５を生成する。検索・マイニング実行部１１０は、検索処理を実行する検索部２０７及びマイニング処理を実行するマイニング部２０８から構成される。
【００２０】
図３は対象文書集合１０１中の文書の例を示す図である。この実施の形態１では、携帯電話のアンケートを分析する例に基づいて説明する。対象文書集合１０１は、「性別」、「年代」、「機種」、「地域」及び「日付」のように、予め値の範囲が定められており、選択肢によって値を指定できるフィールドと、「自由意見」のように自由なテキストで記述するフィールドから構成される。前者のフィールドを、以降、その文書における属性と呼ぶ。属性の項目名を属性名と呼び、その値を属性値と呼ぶ。例えば、図３に示した対象文書集合１０１中の文書は、「性別」という属性名に対して「男性」という属性値を持っている。
【００２１】
なお、図３の例では、１つの属性名に１つの属性値が付与されているが、アンケートにおいて回答がなかった場合等には、必ずしも属性名に属性値が付与されなくても良い。また、「機種」のように、回答者が２つ以上指定する場合があるときには、１つの属性名に２つ以上の属性値が付与されても良い。
【００２２】
図４は図２に示す概念辞書生成部２０１の処理の流れを示すフローチャートであり、図５は概念辞書生成部２０１の処理を説明する図である。
図４のステップＳＴ１１において、概念辞書生成部２０１は図５の文書集合３０１に含まれるテキストを形態素解析することによって、テキスト中の文字列を単語毎に分割する。なお、形態素解析に関しては広く公知の技術であるので、ここでは詳細な説明を省略する。このとき、図５に示す学習対象の文書集合３０１は、必ずしも対象文書集合１０１そのものでなくて、対象文書集合１０１と同一分野における他の文書集合を用いても良い。
【００２３】
ステップＳＴ１２において、概念辞書生成部２０１は形態素解析結果から複合語を抽出する。
図６は概念辞書生成部２０１の複合語抽出処理の流れを示すフローチャートである。図６のステップＳＴ２１において、概念辞書生成部２０１は内部に保有している複合語候補抽出辞書に記述されている形態素の連接パターンにより複合語候補を抽出する。
【００２４】
図７は複合語候補を抽出するための形態素の連接パターンを記述した複合語候補抽出辞書の例を示す図であり、２つの形態素の連接関係から複合語候補を抽出する複合語候補抽出辞書の例を示している。例えば、パターン番号００１では、「品詞が名詞である形態素が２つ続いた場合、その形態素の並びを複合語候補として抽出する」ということを示している。これによって、「機種／変更」や「確認／画面」等、テキスト中で名詞が２つ連続した文字列が複合語候補として抽出される。
【００２５】
同様に、パターン番号００２からは、「携帯／性」、「機能／的」のような、品詞が名詞の形態素と品詞が接尾辞の形態素が連続する文字列が複合語候補として抽出され、パターン番号００３からは「季節／感」、「使用／感」のような、品詞が名詞の形態素と表記が「感」の形態素が連続する文字列が、それぞれ複合語候補として抽出される。なお、「／」は、形態素の区切りを表す記号として用いている。
図８は対象文書集合１０１から複合語候補を抽出した結果を、「機種変更」という文字列に着目して示した図である。
【００２６】
図６のステップＳＴ２２において、ステップＳＴ２１で抽出された複合語候補の中から、概念辞書生成部２０１は統計情報を使用して共起頻度表に登録する複合語を選択する。共起頻度表とは、後述するように単語の共起関係を示す表である。ステップＳＴ２１によって抽出された複合語候補は莫大な数になるため、単語の統計情報を利用して、重要な語に絞って共起頻度表に登録する。統計情報としては、出現頻度や、下記の式で算出されるｔｆ＊ｉｄｆ値のような既知の手法によって計算される統計情報を利用する。
ｔｆ＊ｉｄｆ（ｗ）＝ｆ（ｗ）＊ｌｏｇ（Ｎｄ／ｄ（ｗ））
ここで、ｆ（ｗ）は単語ｗの出現頻度、Ｎｄは対象文書集合中の文書の数、ｄ（ｗ）は単語ｗが出現する文書の数を示す。
【００２７】
上記ステップＳＴ２２によって、出現頻度値やｔｆ＊ｉｄｆ値等の統計情報の値が予め設定した閾値以上の複合語候補を共起頻度表に登録する複合語として選択する。あるいは、予め抽出する複合語の個数を決めておき、統計情報の値が高い複合語候補を上位から選択しても良い。なお、この実施の形態１では、単語の２語の連続を例に説明を行ったが、２語に限らず、３語以上の連接関係を複合語候補抽出辞書に記述して、複合語抽出処理を行っても良い。
【００２８】
このように、概念辞書生成部２０１が複合語の抽出を行うことにより、一般的な語を組み合わせた語より専門性の高い語が抽出され、後述する処理によって複合語に対する概念索引が生成されるので、より適切な語に基づいたマイニング処理を行うことができ、マイニング処理の精度を向上させることができる。
【００２９】
次に、図４のステップＳＴ１３において、概念辞書生成部２０１は文書集合３０１の各文書における単語と単語が同時に出現した回数である共起頻度を計算して、図５に示す共起頻度表３０２を求める。共起頻度表３０２は縦軸の単語が横軸の単語と何回共起したかを示す表である。
【００３０】
ステップＳＴ１４において、概念辞書生成部２０１は共起頻度表３０２を特異値分解する。特異値分解は、行列Ａ、ここでは、図５に示す共起頻度表３０２を３つの行列（ＵΣＶ）３０３，３０４，３０５の積に分解する公知の線形代数手法である。例えば、文献３（「単語の連想関係に基づく情報検索システムＩｎｆｏＭＡＰ、高山他、情報学基礎５３−１、１９９９−３」）に、特異値分解を用いて作成する概念辞書を用いた文書検索方法に関する記述がある。なお、特異値分解の代りに固有値分解を用いても良い。
【００３１】
ステップＳＴ１５において、概念辞書生成部２０１は、ステップＳＴ１４で特異値分解して得た行列Ｕ（左特異行列）３０３から、行列Σ（特異値行列）３０４に含まれる特異値の大きいほうから指定したｋ個（ｋは元の行列Ａの列の数より小さいものとする）の列を概念辞書２０４として出力する。概念辞書２０４の横軸は行列Ｕのｋ個分の横軸成分を示している。概念辞書２０４は共起頻度表３０２よりも次元圧縮されており、各行を高次の相関関係を含む単語の概念ベクトルとみなすことができる。以降、単語の概念ベクトルを単語ベクトルと呼ぶ。ここで、概念ベクトルは共起頻度表３０２を特異値分解した結果生成した行列Ｕの左ｋ次元を成分値としたベクトルである。
【００３２】
このようにして作成された概念辞書２０４は、同じ単語と共起する単語同士が類似した単語ベクトルを持つ、すなわち、類似した概念を持つ単語同士が類似した単語ベクトルを持つという特徴を持っている。
【００３３】
次に文書ベクトル索引生成部２０２の処理について説明する。
図９は文書ベクトル索引生成部２０２の文書ベクトル索引生成を説明する図である。文書ベクトル索引生成部２０２は概念辞書２０４を参照して対象文書集合１０１から文書ベクトル索引２０５を生成する。
【００３４】
図１０は文書ベクトル索引生成部２０２の処理の流れを示すフローチャートである。ステップＳＴ３１において、文書ベクトル索引生成部２０２は、文書集合１０１の各文書を形態素解析して、文書中のテキストを単語毎に分割する。ステップＳＴ３２において、文書ベクトル索引生成部２０２は各文書毎に出現するそれぞれの単語毎の頻度を計算する。ステップＳＴ３３において、文書ベクトル索引生成部２０２は概念辞書２０４から各単語に対する単語ベクトルを取り出す。
【００３５】
ステップＳＴ３４において、文書ベクトル索引生成部２０２は、各文書に出現する単語の単語ベクトルに、ステップＳＴ３２で計算した頻度を係数として乗算したベクトルを加算したものを、かかる文書の文書ベクトル索引２０５として出力する。なお、文書ベクトル索引２０５に格納されているそれぞれの文書に対応するベクトルを文書ベクトルと呼ぶ。
【００３６】
図１１は文書ベクトル索引２０５を示す図であり、各文書と文書ベクトルとが対応付けられた構造となっている。文書ベクトルは、概念辞書２０４中の単語ベクトルと同じ次元を有し、類似した文書ベクトルを持つ文書同士は、類似した内容を持つという特徴を持っている。
【００３７】
次にマイニング索引生成部２０３の処理について説明する。
図１２はマイニング索引生成部２０３の処理の流れを示すフローチャートである。ステップＳＴ４１において、マイニング索引生成部２０３は対象文書集合１０１に含まれる属性名の一覧を取得する。図３に示す例の場合、「性別」、「年代」、「機種」、「地域」及び「日付」が属性名として取得される。ステップＳＴ４２において、マイニング索引生成部２０３は、対象文書集合１０１中の各文書に対して、ステップＳＴ４１で取得した属性名に対応する属性値を取得する。例えば、図３に示した文書からは、「性別」に対して「男性」、「年代」に対して「２０代」、「機種」に対して「機種１」、「地域」に対して「東京」、「日付」に対して「２００２−０１−１４」が取得される。
【００３８】
マイニング索引生成部２０３は、対象文書集合１０１中の全ての文書に対して属性値を取得し、ステップＳＴ４３において、取得した属性名と属性値をマイニング索引２０６に書き込む。
図１３はマイニング索引生成部２０３により生成されたマイニング索引の例を示す図である。
【００３９】
次に実行部１０４の処理について説明する。
図１４は実行部１０４の処理の流れを示すフローチャートである。ステップＳＴ５１において、索引読み込み部１０８は、索引格納部１０３から、概念辞書２０４、文書ベクトル索引２０５及びマイニング索引２０６の３つの索引情報を読み込む。
【００４０】
ステップＳＴ５２において、検索・マイニング結果読み込み部１０９は、ユーザにより指定された検索・マイニング結果を検索・マイニング結果格納部１０７より読み込むが、最初の時点では、検索・マイニング結果格納部１０７に何も情報が書き込まれていないものとし、ステップＳＴ５２では何も処理を行わずに通過する。なお、この実施の形態１において、検索・マイニング結果とは、後述する図２５の形式で保存される文書集合を指す。このときの検索・マイニング対象は文書ベクトル索引２０５に登録されている対象文書集合１０１全てである。
【００４１】
ステップＳＴ５３において、検索・マイニング実行部１１０は、ユーザからの検索・マイニング条件の入力があるか否かを判定し、検索・マイニング条件の入力がある場合には、ステップＳＴ５４において、検索・マイニング条件を入力として受け取り、ステップＳＴ５５において、検索・マイニング処理を実行する。
【００４２】
ここで、図１４に示すステップＳＴ５５の検索・マイニング処理のうち、検索部２０７により実行される検索処理について説明する。
図１５は検索部２０７により実行される検索処理の流れを示すフローチャートであり、検索部２０７は文書ベクトル索引２０５を用いて対象文書集合１０１の検索を実行する。ステップＳＴ６１において、検索部２０７は入力部１０５からユーザにより指定された検索条件をテキストで入力する。ここでは、例えば「画面が大きい機種」を検索条件として入力する。
【００４３】
ステップＳＴ６２において、検索部２０７は概念辞書２０４を参照して入力された検索条件に対する概念ベクトルを生成する。ここでは、「画面」、「大きい」、「機種」に対する単語ベクトルを合成したものが概念ベクトルとなる。以降、検索条件に対する概念ベクトルを検索ベクトルと呼ぶ。このステップＳＴ６２における検索ベクトルの生成処理は、図１０のステップＳＴ３１〜ＳＴ３４の処理と同様に行われる。
【００４４】
ステップＳＴ６３において、検索部２０７は、検索ベクトルと、格納されている対象文書集合１０１のそれぞれの文書に該当する文書ベクトルとのベクトル同士の余弦値を計算し類似度とする。ステップＳＴ６４において、検索部２０７は類似度の順に並べて検索結果として出力する。
【００４５】
図１６は検索部２０７による検索結果の例を示す図であり、検索条件として「画面が大きい機種」を入力した場合の検索結果であり、文書を識別する文字列（図１６では文書ｎ）、類似度、文書の自由記述部分を表示している。
【００４６】
なお、検索結果として表示する対象文書集合１０１は、類似度に閾値を設け、その閾値以上の文書のみを表示するようにしても良い。また、予め表示する件数の最大値を設定しておき、類似度の上位から指定された件数のみを表示するようにしても良い。さらに、「性別」や「年代」等、文書の属性による絞り込みを併用しても良い。また、この実施の形態１では、概念辞書２０４及び文書ベクトル索引２０５を利用してベクトル同士の類似度を計算したが、入力された検索条件と文書との類似度が計算できる方法であれば、他の方法を用いても良い。
【００４７】
次に図１４に示すステップＳＴ５５の検索・マイニング処理のうち、マイニング部２０８により実行されるマイニング処理について説明する。このマイニング処理を行うことにより、キーワードと属性の関連度を分析することができる。
図１７はマイニング部２０８により実行されるマイニング処理の流れを示すフローチャートである。ステップＳＴ７１において、マイニング部２０８は、入力部１０５からマイニング条件を入力する。ここで、入力されるマイニング条件は属性とキーワードである。属性として、属性名に「機種」を、属性値に「機種１」、「機種２」、「機種３」及び「機種４」を指定し、キーワードとして「画面」及び「着メロ」を指定した場合について説明する。
【００４８】
ステップＳＴ７２において、マイニング部２０８は指定された属性値を持つ対象文書集合１０１を取得する。すなわち、「機種」という属性名に対して、属性値に「機種１」を含む対象文書集合１０１、「機種２」を含む対象文書集合１０１、「機種３」を含む対象文書集合１０３及び「機種４」を含む対象文書集合１０１をそれぞれ取得する。ステップＳＴ７３において、マイニング部２０８は取得した対象文書集合１０１のそれぞれに対する属性ベクトルを作成する。属性ベクトルは、同じ属性値を持つ対象文書集合１０１の文書ベクトルの各要素を加算平均して作成したベクトルである。
【００４９】
ステップＳＴ７４において、マイニング部２０８は各属性値とキーワードとの関連度を求める。関連度は、属性ベクトルと、キーワードに対応して概念辞書２０４から取り出した単語ベクトルの余弦値によって計算する。ステップＳＴ７５において、マイニング部２０８は求めた属性とキーワードとの関連度をグラフにしてマイニング結果として出力する。
【００５０】
図１８はマイニング部２０８によるマイニング結果を示す図であり、機種の属性値を軸にとり、それぞれの属性値とキーワードとの関連度の値を、折れ線グラフを用いて表している。図１８のグラフにより、「機種３」では画面に関する関心が低く着メロに関する関心が高く、逆に「機種４」では画面に関する関心が高く着メロに関する関心が低いという分析結果が読み取れる。
【００５１】
なお、属性の指定は、属性名だけを指定して、マイニング部２０８が属性名に対応する全ての属性値をマイニング索引２０６から自動的に抽出し、分析を行うようにしても良い。あるいは、属性値を指定する際に、機種１又は機種２のように、各属性値の積（ＡＮＤ）、和（ＯＲ）、否定（ＮＯＴ）の関係を指定しても良い。
【００５２】
また、キーワードは、「画面」、「着メロ」のように１単語だけでなく、「画面、色」「着メロ、和音」のように複数の単語を指定しても良い。キーワードに複数の単語を指定した場合には、キーワードに対応するベクトルは、キーワード中に含まれる単語に対する単語ベクトルを加算平均して作成したベクトルとなる。また、「日付」のような属性名を指定した場合は、ステップＳＴ７２において、「２００２−０１−０１から２００２−０１−３１まで」のように、対象文書集合１０１を属性値の範囲で取得し、キーワードとの関係を分析できるようにしても良い。
【００５３】
次に、マイニング部２０８で実行されるマイニング処理のうち、２つの属性値の相関をキーワードの分布によって分析する例について説明する。
図１９はマイニング部２０８により実行されるマイニング処理の流れを示すフローチャートである。ステップＳＴ８１において、マイニング部２０８は入力部１０５からマイニング条件を入力する。ここで、ユーザから入力されるマイニング条件は、２つの属性値とキーワードである。属性値として、属性名「性別」の属性値である「男性」と「女性」を指定し、キーワードとして「フリップ」、「料金」、「着メロ」及び「キャラクター」を指定した場合について説明する。
【００５４】
ステップＳＴ８２において、マイニング部２０８は指定された属性値を持つ対象文書集合１０１を取得する。すなわち、「性別」という属性名に対して、属性値が「男性」である対象文書集合１０１と「女性」である対象文書集合１０１をそれぞれ取得する。ステップＳＴ８３において、マイニング部２０８は取得した対象文書集合１０１のそれぞれに対する属性ベクトルを作成する。属性ベクトルの作成の方法は図１７のステップＳＴ７３の処理と同様である。
【００５５】
ステップＳＴ８４において、マイニング部２０８は各属性ベクトルとマイニング条件として入力されたキーワードとの関連度を求める。関連度の計算は、図１７のステップＳＴ７４の処理と同様である。ステップＳＴ８５において、マイニング部２０８は属性値を座標軸として各キーワードの座標を表示してマイニング結果として出力する。
【００５６】
図２０はマイニング部２０８によるマイニング結果を示す図であり、この例では、「男性」をＸ座標軸、「女性」をＹ座標軸にとり、各キーワードの概念ベクトルと、「男性」の属性ベクトルとの関連度をＸ座標、「女性」の属性ベクトルとの関連度をＹ座標としている。図２０のグラフにより、「キャラクター」や「着メロ」は女性との関連度が高く、「フリップ」は男性との関連度が高いという分析結果が読み取れる。
【００５７】
属性の指定は、「年代が１０代で性別が女性」のように複数の属性を組み合わせて指定しても良い。その場合は、ステップＳＴ８２において、属性名「年代」の属性値が「１０代」であり、かつ属性名「性別」の属性値が「女性」である対象文書集合１０１を取り出す処理を行う。また、キーワードも、複数の単語を組み合わせて指定しても良い。さらに、ユーザがキーワードを指定せず、辞書や統計情報を用いてキーワードを対象文書集合１０１から自動抽出し、抽出したキーワードをマイニング条件として入力しても良い。
【００５８】
次に、マイニング部２０８で実行されるマイニング処理のうち、キーワードとキーワードの相関を、文書の分布によって分析する例について説明する。
図２１はマイニング部２０８により実行されるマイニング処理の流れを示すフローチャートである。ステップＳＴ９１において、マイニング部２０８は入力部１０５からマイニング条件を入力する。ここで、ユーザから入力されるマイニング条件はキーワードである。キーワードとして、「メロディー」と「曲」を指定した場合について説明する。
【００５９】
ステップＳＴ９２において、マイニング部２０８は、各キーワードについて概念辞書２０４から取り出した単語ベクトルと、対象文書集合１０１中の各文書に対応する文書ベクトルとの関連度を、ベクトルの余弦値により求める。ステップＳＴ９３において、マイニング部２０８はキーワードを座標軸として各文書の座標を表示してマイニング結果として出力する。
【００６０】
図２２はマイニング部２０８によるマイニング結果を示す図であり、この例では、「メロディー」をＸ座標軸、「曲」をＹ座標軸により、各文書の文書ベクトルと、「メロディー」の概念ベクトルとの関連度をＸ座標、「曲」との関連度をＹ座標とした点をグラフ上に表示している。図２２のグラフにより、「メロディー」と相関の高い文書は「曲」とも相関が高く、従って、「メロディー」と「曲」とは、関連の高い単語であることが読み取れる。
【００６１】
図２３は同様にマイニング部２０８によるマイニング結果を示す図であり、この例では、キーワードに「音」と「色」を指定して図２１の処理を行った場合のマイニング結果であり、この例では、「音」と「色」とは関連の低い単語であることが読み取れる。
【００６２】
なお、この実施の形態１では、概念辞書２０４と文書ベクトル索引２０５を用いることによって、単語と属性の関連度や単語と文書の関連度を計算している。概念辞書２０４では、例えば、「画面」と「液晶」は、「大きさ」、「明るさ」、「色」等、同じ語と共起するので、類似した文脈で現れる単語の単語ベクトルの距離は近くなる。従って、ある文書と「画面」という単語の関連度を求める場合、その文書に「画面」という単語が含まれていなくても、「液晶」という単語が含まれていれば、その文書と「画面」という単語の関連度は高くなる。このように、単語の表記が異なる場合も、意味的な近さで関連度を判定するマイニングが実現できる。
【００６３】
図１４のフローチャートに戻り、検索・マイニング実行部１１０は、ステップＳＴ５５で実行した検索・マイニング結果を、ステップＳＴ５６において表示する。ステップＳＴ５７において、検索・マイニング結果保存部１１１はユーザからマイニング結果を保存する指示があったか否かをチェックする。指示がなかった場合には、ステップＳＴ５３に戻り、検索・マイニング実行部１１０は次の検索・マイニング条件を受け取り、検索・マイニング処理を実行する。次の検索・マイニング条件がない場合は終了する。
【００６４】
ステップＳＴ５７で、検索・マイニング結果保存部１１１がユーザからの検索・マイニング結果を保存する指示を受け取った場合には、ステップＳＴ５８において、検索・マイニング結果保存部１１１は検索・マイニング結果を検索・マイニング結果格納部１０７に保存する。
【００６５】
図２４は検索・マイニング結果を保存する検索・マイニング結果保存部１１１の処理の流れを示すフローチャートである。ステップＳＴ１０１において、検索・マイニング結果として保存する文書集合がユーザにより決定される。例えば、検索結果の場合は、図１５のステップＳＴ６４で出力された検索結果をそのまま保存すべき文書集合としても良いし、検索結果からさらに類似度の閾値や、検索件数等の条件を設定して、保存すべき文書集合を絞り込んでも良い。
【００６６】
また、図１７及び図１８で示した、属性とキーワードの関連度を分析するマイニング処理の場合は、属性とキーワードを指定して、特定の属性を持つ文書集合の中で、特定のキーワードと関連度の高い文書集合としても良い。例えば、属性値が「機種４」で、キーワード「画面」との関連度の高い文書集合を保存すべき文書集合としても良い。
【００６７】
さらに、図１９及び図２０で説明した属性と属性の相関を分析するマイニング処理の例では、属性とキーワードを指定して、特定の属性を持つ文書集合の中で、特定のキーワードと関連度の高い文書集合としても良い。例えば、属性名「性別」の属性値が「女性」で、キーワード「キャラクター」との高い文書集合を、保存すべき文書集合としても良い。
【００６８】
さらに、図２１、図２２及び図２３で説明したキーワードとキーワードの相関を分析するマイニング処理の例では、座標軸に指定したキーワードとの関連の高い文書集合を保存すべき文書集合としても良い。例えば、キーワード「メロディー」とキーワード「曲」の一方、又は両方との関連度の高い文書を保存すべき文書集合としても良い。
【００６９】
次に、図２４のステップＳＴ１０２において、ユーザの指示により検索・マイニング結果保存部１１１は、検索・マイニング結果に対して、後で参照できるように明示的に名前を付与する。ステップＳＴ１０３において、検索・マイニング結果保存部１１１は保存すべき各文書にスコアを付与し保存する。
【００７０】
スコアは、文書と、検索又はマイニングで指定したキーワードとの関連度とする。例えば、図１５で示した検索処理の場合は、ステップＳＴ６１で指定した検索条件との類似度（関連度）をスコアとする。
【００７１】
また、図１７及び図１８の属性とキーワードの関連度を分析するマイニング処理の場合は、マイニング条件で指定されたキーワードとの関連度をスコアとする。例えば、属性名「機種」の属性値が「機種４」で、キーワード「画面」との関連度の高い文書集合を保存すべき文書集合とした場合は、「画面」との関連度をスコアとする。
【００７２】
さらに、図１９及び図２０で説明した属性と属性の相関を分析するマイニング処理の例では、マイニング条件で指定されたキーワードとの関連度をスコアとする。例えば、属性名「性別」の属性値が「女性」で、キーワード「キャラクター」との高い文書集合を保存すべき文書集合とした場合は、「キャラクター」との関連度をスコアとする。
【００７３】
さらに、図２１、図２２及び図２３で説明したキーワードとキーワードの相関を分析するマイニング処理の例では、座標軸に指定したキーワードとの関連度をスコアとする。例えば、各文書と、キーワード「メロディー」とキーワード「曲」の一方との関連度、又は両方の単語ベクトルを合成したベクトルとの関連度をスコアとする。
【００７４】
図２５は検索・マイニング結果格納部１０７に保存された検索・マイニング結果の例を示す図である。例えば、「ファッション性」とユーザが名前をつけた検索・マイニング結果は、「デザイン、キャラクター、色・・」等のキーワードを指定して得られ、「操作性」とユーザが名前をつけた検索・マイニング結果は、「キー、ボタン、操作・・」等のキーワードを指定して検索した結果から得られ、「機能」とユーザが名前をつけた検索・マイニング結果は、「漢字変換、メール、スケジュール表・・」等のキーワードを指定して得られ、「ビジネスユース」とユーザが名前をつけた検索・マイニング結果は、「上司、取引先、報告・・」等のキーワードを指定して得られ、「プライベートユース」とユーザが名前をつけた検索・マイニング結果は、「メル友、子供、旅行・・」等のキーワードを指定して検索した結果から得られたものとする。
【００７５】
次に、図１４のステップＳＴ５８の処理後にステップＳＴ５２に戻り、検索・マイニング結果読み込み部１０９は、ユーザから指定された検索・マイニング結果格納部１０７に格納されている検索・マイニング結果を読み込む。指定された検索・マイニング結果において、スコアが付与されている文書集合を、次の検索・マイニング対象の文書集合とする。
【００７６】
例えば、図２５の検索・マイニング結果のうち「ファッション性」が指定された場合には、文書１、文書３、文書５、…が検索・マイニング対象の文書集合となる。また、複数の検索・マイニング結果に対して、ＡＮＤ・ＯＲ・ＮＯＴの関係を指定することにより、新たな検索・マイニング対象の文書集合を定義することができる。
【００７７】
例えば、図２４の処理において、「ファッション性」と名づけられた検索・マイニング結果と、「プライベートユース」と名づけられた検索・マイニング結果をＡＮＤで指定することにより、「ファッション性」と「プライベートユース」の両方でスコアが定義されている文書集合が、検索・マイニング対象の文書集合となる。また、「操作性」と名づけられた検索・マイニング結果と、「機能」と名づけられた検索・マイニング結果をＯＲで指定することにより、「操作性」又は「機能」のいずれかでスコアが付与されている文書集合が検索・マイニング対象の文書集合となる。
【００７８】
このように、検索・マイニング結果を組み合わせて指定することにより、例えば「ファッション性」と「プライベートユース」のＡＮＤの組み合わせは、「プライベートユースでファッション性に関心がある人のアンケート結果の集合」のような意味を持つ分析対象の文書集合を作成することができ、「操作性」と「機能」のＯＲの組み合わせは、「実用性に関心がある人のアンケート結果の集合」のような意味を持つ分析対象の文書集合を作成することができる。
【００７９】
以降、図１４のステップＳＴ５２で指定された検索・マイニング結果によって作成された文書集合に対して、ステップＳＴ５３で検索・マイニング条件の入力があった場合には、ステップＳＴ５４でその検索・マイニング条件を入力として受け取り、ステップＳＴ５５で検索・マイニング処理を実行し、ステップＳＴ５６で、その結果を表示する。
【００８０】
例えば、「プライベートユース」と「ファッション性」をＡＮＤで組み合わせた文書集合に対して、図１７及び図１８で説明した属性とキーワードの関連度を分析するマイニング処理を行う場合には、属性名に「機種」、キーワードに「画面」と「着メロ」行えば、「プライベートユースでファッション性に関心がある人」が、画面と着メロにどれだけ関心を持っているか、機種ごとに分析することができる。
【００８１】
次に検索・マイニング結果編集部１１２の処理について説明する。
図２６は検索・マイニング結果編集部１１２の処理の流れを示すフローチャートである。ステップＳＴ１１１において、検索・マイニング結果編集部１１２は、ユーザにより指定された対象の検索・マイニング結果及び付与する属性名を受け取る。例えば、対象の検索・マイニング結果として、図２５の「ファッション性」、「操作性」及び「機能」が指定され、付与する属性名として「重視するポイント」が指定されたとする。ステップＳＴ１１２において、検索・マイニング結果編集部１１２は、ユーザから指定された検索・マイニング結果を検索・マイニング結果格納部１０７から読み込む。
【００８２】
ステップＳＴ１１３において、検索・マイニング結果編集部１１２は、各文書毎に最もスコアの高い検索・マイニング結果を選択する。例えば、図２５の文書１に関しては、「ファッション性」、「操作性」、「機能」の中で、最もスコアが高いのが「ファッション性」であるので、文書１に対しては「ファッション性」が選択される。同様に、文書２に対しては「機能」が、文書３に対しては「操作性」が、文書５に対しては「ファッション性」が、文書６に対しては「機能」が、それぞれ選択される。文書４に対しては、どの検索・マイニング結果にもスコアが付与されていないので、検索・マイニング結果は選択されない。
【００８３】
ステップＳＴ１１４において、検索・マイニング結果編集部１１２は、ステップＳＴ１１３で選択された検索・マイニング結果の名前を、ステップＳＴ１１１で指定された属性名に対する属性値として各文書に付与する。
【００８４】
図２７は検索・マイニング結果編集部１１２による編集結果を示す図である。図２６のステップＳＴ１１４の処理の結果、図２７の「属性名：重視するポイント」で示す属性が生成される。文書４に関しては、ステップＳＴ１１３で、検索・マイニング結果が選択されていないので、属性値は定義されない。
【００８５】
同様に、図２６の処理において、検索・マイニング結果として「ビジネスユース」と「プライベートユース」を指定し、属性名として「利用分野」を指定した場合は、図２７の「属性名：利用分野」で示す属性が生成される。
【００８６】
図２８は検索・マイニング結果編集部１１２の別の実現方法の処理の流れを示すフローチャートである。図２６では１つの文書に１つの属性値を付与する処理の例について示したが、図２８では１つの文書に２つ以上の属性値の付与を認める処理の例である。ステップＳＴ１２１、ステップＳＴ１２２の処理は、それぞれ図２６におけるステップＳＴ１１１、ステップＳＴ１１２の処理と同様である。例として、読み込むべき検索・マイニング結果として、図２５の「ファッション性」、「操作性」及び「機能」が指定され、付与する属性名として「重視するポイント」が指定された場合について説明する。
【００８７】
図２８のステップＳＴ１２３において、検索・マイニング結果編集部１１２は、各文書毎にスコアが閾値以上の検索・マイニング結果を選択する。例えば閾値を０．５とすると、文書１は「ファッション性」と「機能」が閾値以上の値を持つので、文書１に対しては、「ファッション性」と「機能」が選択される。同様に、文書２に対しては「機能」が選択され、文書３に対しては「操作性」が選択され、文書５に対しては「ファッション性」が選択される。
【００８８】
ステップＳＴ１２４において、検索・マイニング結果編集部１１２は、ステップＳＴ１２３で選択された検索・マイニング結果の名前を、ステップＳＴ１２１で指定された属性名に対する属性値として、ユーザの指示により各文書に付与する。
【００８９】
図２９は検索・マイニング結果編集部１１２による編集結果を示す図である。図２８のステップＳＴ１２４の処理の結果、図２９の「属性名：重視するポイント」で示す属性が生成される。同様に、検索・マイニング結果として、「ビジネスユース」と「プライベートユース」を属性名として「利用分野」を指定した場合は、図２９の「属性名：利用分野」で示す属性が生成される。
【００９０】
このように、図２６又は図２８の処理により生成された属性は、マイニング部２０８におけるマイニング処理において、元から文書に備わっていた属性と同様に扱うことができる。すなわち、図３の対象文書集合１０１の例では、元から備わっていた属性「性別」、「年代」、「機種」、「地域」、「日付」に加えて、新たな属性「重視するポイント」と「利用分野」を分析に利用することができる。
【００９１】
例えば、図３０はマイニング部２０８によるマイニング結果を示す図であり、検索・マイニング結果編集部１１２により決定された属性値を軸にとり、それぞれの属性値とマイニング条件として入力されたキーワードとの関連度の値を、折れ線グラフを用いて表している。すなわち、図１７及び図１８で示した属性とキーワードの関連度の分析を、「属性名」に「重視するポイント」を指定し、キーワードに「仕事」、「価格」、「画面」を指定して実行した例を示す図である。「仕事」は、「ファッション性」との関連度が低く、「機能」との関連度が高いこと、「価格」は、「ファッション性」、「操作性」及び「機能」の何れとも関連度はあまり変わらないことを読み取ることができる。
このように、マイニング部２０８は、指定されたマイニング条件と、検索・マイニング結果編集部１１２により決定された属性値との関連度を分析することができる。
【００９２】
また、図１９及び図２０で示した文書の属性値と属性値の相関関係を、指定された単語の分布によって分析する例についても、同様に、マイニング部２０８は、検索・マイニング結果編集部１１２により決定された属性値と属性値の相関関係を、指定されたマイニング条件の分布によって分析することもできる。
【００９３】
さらに、図２１、図２２及び図２３で示した単語と単語の相関関係を、文書の分布によって分析する例についても、同様に、マイニング部２０８は、単語と単語の相関関係を、検索・マイニング結果格納部１０７に保存されている文書の分布によって分析することもできる。
【００９４】
また、検索・マイニング結果編集部１１２で作成された属性同士で関連度を分析することも可能である。
図３１はマイニング部２０８におけるマイニング処理の例として、検索・マイニング結果編集部１１２で作成された属性同士で関連度を分析する処理の流れを示すフローチャートである。図２７又は図２９で示した「重視するポイント」と「利用分野」の関係を分析する例について説明する。
【００９５】
ステップＳＴ１３１において、マイニング部２０８はユーザより指定された属性名１に対する属性値毎の文書集合を取得する。属性名１に「重視するポイント」を指定すると、属性値に「ファッション性」を含む文書集合、属性値に「操作性」を含む文書集合及び属性値に「機能」を含む文書集合が取得される。ステップＳＴ１３２において、マイニング部２０８はステップＳＴ１３１で取得した各文書集合から属性ベクトルを作成する。属性ベクトルの作成は文書集合における各文書の文書ベクトルを加算平均して作成する。
【００９６】
ステップＳＴ１３３において、マイニング部２０８は指定された属性名２に対する属性値毎の文書集合を取得する。属性名２に「利用分野」を指定すると、属性値に「ビジネスユース」を含む文書集合と、属性値に「プライベートユース」を含む文書集合が取得される。ステップＳＴ１３４において、マイニング部２０８はステップＳＴ１３３で取得した各文書集合から属性ベクトルを作成する。
【００９７】
ステップＳＴ１３５において、マイニング部２０８は、ステップＳＴ１３２で作成した各属性ベクトルと、ステップＳＴ１３４で作成した各属性ベクトルとの余弦値を計算し関連度とする。ステップＳＴ１３６において、マイニング部２０８は属性名１の各属性値に対する属性名２の各属性値の関連度をグラフに表示してマイニング結果として出力する。
【００９８】
図３２はのマイニング部２０８によるマイニング結果を示す図であり、図３１の処理の結果をグラフに表示した例を示す図である。図３２のグラフより、例えば「ファッション性」と「プライベートユース」の相関が高く、「機能」と「ビジネスユース」の相関が高いことを読み取ることができる。
【００９９】
このように、マイニング部２０８は、検索・マイニング結果編集部１１２により、複数の属性名に対して決定された属性値同士で関連度を分析することができる。この場合、分析する属性値は検索・マイニング結果編集部１１２により決定された属性値だけでなく、検索・マイニング結果編集部１１２により決定された属性値を用いて、入力された他の属性値との関連度を分析しても良い。
【０１００】
以上のように、この実施の形態１によれば、検索やテキストマイニングによって得られた結果を保存し、保存されている検索・マイニング結果を関連付けることにより、新たな分析対象や新たな属性の作成を可能にし、元の文書に定義されていないユーザが定義した基準（分析軸）に基づく分析を行うことができるという効果が得られる。
【０１０１】
また、この実施の形態１によれば、概念辞書２０４及び文書ベクトル索引２０５を利用することにより、表記が異なる単語に対しても、意味的な近さに基づいて、単語と単語、単語と文書、単語と属性の関連度を判定できるので、表記の異なる表現に対しても、検索、マイニング及び検索・マイニング結果の関連付けを意味的な近さに基づいて行うことができるという効果が得られる。
【０１０２】
さらに、この実施の形態１によれば、概念辞書生成部２０１が複合語の抽出を行うことにより、一般的な語を組み合わせた語より専門性の高い語が抽出され、複合語に対する概念索引が生成されるので、より適切な語に基づいたマイニング処理を行うことができ、マイニング処理の精度を向上させることができるという効果が得られる。
【０１０３】
なお、上記の各処理は、コンピュータに搭載されるテキストマイニングプログラムにより、コンピュータ上で実現することができる。
【０１０４】
【発明の効果】
以上のように、この発明によれば、対象文書集合を検索・マイニングするための索引情報を保存している索引格納部と、指定された検索・マイニング条件に従って保存されている索引情報を参照して対象文書集合の検索・マイニングの処理を実行する検索・マイニング実行部と、検索・マイニング実行部による検索・マイニング結果を、各文書と検索・マイニング条件との関連度を各文書に付与して検索・マイニング結果格納部に保存する検索・マイニング結果保存部と、指定された複数の検索・マイニング結果を検索・マイニング結果格納部から読み込んで、指定された属性名に対応して、各文書毎に読み込んだ検索・マイニング結果を選択して属性値を決定する検索・マイニング結果編集部とを備えたことにより、新たな分析対象や新たな属性の作成を可能にし、元の文書に定義されていないユーザが定義した基準に基づく分析を行うことができるという効果がある。
【図面の簡単な説明】
【図１】この発明の実施の形態１によるテキストマイニング装置の構成を示すブロック図である。
【図２】この発明の実施の形態１によるテキストマイニング装置の構成を示す詳細ブロック図である。
【図３】対象文書集合中の文書の例を示す図である。
【図４】この発明の実施の形態１によるテキストマイニング装置の概念辞書生成部の処理の流れを示すフローチャートである。
【図５】この発明の実施の形態１によるテキストマイニング装置の概念辞書生成部の処理を説明する図である。
【図６】この発明の実施の形態１によるテキストマイニング装置の概念辞書生成部の複合語抽出処理の流れを示すフローチャートである。
【図７】この発明の実施の形態１によるテキストマイニング装置の概念辞書生成部が保有する複合語候補抽出辞書の例を示す図である。
【図８】この発明の実施の形態１によるテキストマイニング装置の概念辞書生成部の複合語候補抽出結果を示す図である。
【図９】この発明の実施の形態１によるテキストマイニング装置の文書ベクトル索引生成部の文書ベクトル索引生成を説明する図である
【図１０】この発明の実施の形態１によるテキストマイニング装置の文書ベクトル索引生成部の処理の流れを示すフローチャートである。
【図１１】この発明の実施の形態１によるテキストマイニング装置の文書ベクトル索引を示す図である。
【図１２】この発明の実施の形態１によるテキストマイニング装置のマイニング索引生成部の処理の流れを示すフローチャートである。
【図１３】この発明の実施の形態１によるテキストマイニング装置のマイニング索引生成部により生成されたマイニング索引の例を示す図である。
【図１４】この発明の実施の形態１によるテキストマイニング装置の実行部の処理の流れを示すフローチャートである。
【図１５】この発明の実施の形態１によるテキストマイニング装置の検索部により実行される検索処理の流れを示すフローチャートである。
【図１６】この発明の実施の形態１によるテキストマイニング装置の検索部による検索結果の例を示す図である。
【図１７】この発明の実施の形態１によるテキストマイニング装置のマイニング部により実行されるマイニング処理の流れを示すフローチャートである。
【図１８】この発明の実施の形態１によるテキストマイニング装置のマイニング部によるマイニング結果を示す図である。
【図１９】この発明の実施の形態１によるテキストマイニング装置のマイニング部により実行されるマイニング処理の流れを示すフローチャートである。
【図２０】この発明の実施の形態１によるテキストマイニング装置のマイニング部によるマイニング結果を示す図である。
【図２１】この発明の実施の形態１によるテキストマイニング装置のマイニング部により実行されるマイニング処理の流れを示すフローチャートである。
【図２２】この発明の実施の形態１によるテキストマイニング装置のマイニング部によるマイニング結果を示す図である。
【図２３】この発明の実施の形態１によるテキストマイニング装置のマイニング部によるマイニング結果を示す図である。
【図２４】この発明の実施の形態１による検索・マイニング結果保存部の処理の流れを示すフローチャートである。
【図２５】この発明の実施の形態１によるテキストマイニング装置の検索・マイニング結果格納部に保存された検索・マイニング結果の例を示す図である。
【図２６】この発明の実施の形態１によるテキストマイニング装置の検索・マイニング結果編集部の処理の流れを示すフローチャートである。
【図２７】この発明の実施の形態１によるテキストマイニング装置の検索・マイニング結果編集部による編集結果を示す図である。
【図２８】この発明の実施の形態１によるテキストマイニング装置の検索・マイニング結果編集部の処理の流れを示すフローチャートである。
【図２９】この発明の実施の形態１によるテキストマイニング装置の検索・マイニング結果編集部による編集結果を示す図である。
【図３０】この発明の実施の形態１によるテキストマイニング装置の検索・マイニング結果編集部による編集結果を示す図である。
【図３１】この発明の実施の形態１によるテキストマイニング装置のマイニング部の処理の流れを示すフローチャートである。
【図３２】この発明の実施の形態１によるテキストマイニング装置のマイニング部によるマイニング結果を示す図である。
【図３３】従来のテキストマイニングの分析処理を説明する図である。
【図３４】この発明の実施の形態１によるテキストマイニングの分析処理を説明する図である。
【符号の説明】
１０１対象文書集合、１０２索引生成部、１０３索引格納部、１０４実行部、１０５入力部、１０６表示部、１０７検索・マイニング結果格納部、１０８索引読み込み部、１０９検索・マイニング結果読み込み部、１１０検索・マイニング実行部、１１１検索・マイニング結果保存部、１１２検索・マイニング結果編集部、２０１概念辞書生成部、２０２文書ベクトル索引生成部、２０３マイニング索引生成部、２０４概念辞書、２０５文書ベクトル索引、２０６マイニング索引、２０７検索部、２０８マイニング部。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a text mining apparatus and a text mining program that can analyze a set of a large number of digitized documents from various viewpoints.
[0002]
[Prior art]
Text mining has attracted attention as a technique for supporting the analysis of a large amount of document data such as questionnaires and electronic news. Text mining is a technique for extracting frequency and relevance of words from text information of document data to be processed to discover new knowledge. Conventionally, a method of supporting interactive analysis of a large amount of document data has been proposed. For example, in a text mining device disclosed in Patent Literature 1, the following processing is performed.
[0003]
A phrase (characteristic phrase) characteristically appearing in the target document set is extracted from a target document set to be subjected to text mining processing, and a phrase co-occurring with a component of the analysis axis specified by the user is extracted from the extracted words. For example, when a newspaper article set regarding “O157” is to be processed, the characteristic phrase “elementary school, mass infection, patients, plural, bleeding, diarrhea, symptoms, hospitalization,...” Is extracted. From among them, the publication month of the newspaper is designated as the analysis axis, and words co-occurring with the constituent elements (July, August, September) are acquired. As a result, in association with “July”, “infection, patient, symptom, hospitalization,...”, And in association with “August”, “impact, lunch, hospitalization, mass infection,. In addition, co-occurrence phrases such as “sales, minus, food, fresh,...” Are acquired.
[0004]
The analysis axis and the analysis result are stored as an analysis history, and a plurality of different analysis axes are provided. When an analysis axis is added or an arbitrary analysis axis is changed, the constituent elements of the plurality of analysis axes are changed. Are narrowed down by using the analysis history, and the words and phrases that are likely to co-occur with each are realized by the combination of different analysis axes.
[0005]
When adding an analysis axis that includes a specified characteristic phrase, the analysis axis is added to the specified component, and a set of the specified component and the added analysis axis component is created. A phrase that is highly likely to co-occur with each of the created set of constituent elements is extracted from the analysis history as a co-occurrence phrase candidate. Then, from the extracted co-occurrence word candidates, each set of constituent elements is co-existed within a predetermined range (in the same document, in the same paragraph, in the same sentence, or in m words or n sentences). Get the words that occur. For example, an analysis axis including the co-occurrence phrases “infection” and “symptom” acquired in association with the component “July” is added.
[0006]
Then, as a co-occurrence phrase candidate having a high possibility of co-occurring with the specified component “July” and the added component of the analysis axis “July-infection” and “July-symptom”, “7 The phrase "infection, patient, symptom, hospitalization, ..." co-occurring with "month" is extracted from the analysis history. Then, from these co-occurrence word candidates, words that co-occur within a predetermined range as a set of component elements “July-infected” and “July-symptom” are acquired. As a result, as a phrase co-occurring with "July-infection", "patient, symptom, prophylaxis, population, ..." and as a phrase co-occurring with "July-symptom", "nausea, diarrhea, hospitalization, severe illness, ... To get. Here, “July-infected” means a set of “July” and “infected”, and “co-occurs within a predetermined range with“ July-infected ”” means , "Co-occurs within a predetermined range with" July "and co-occurs with" infection "within a predetermined range."
[0007]
When changing the added analysis axis, the analysis axis is changed in accordance with the user's instruction, and a phrase that is highly likely to co-occur with the set of components is extracted from the analysis history as a co-occurrence word candidate. Then, from the extracted co-occurrence word candidates, words that co-occur within a predetermined range with the set of constituent elements are acquired. If "Analysis axis was added to component" July "" is deleted and analysis axis containing co-occurrence phrase "feeding" acquired in association with "August" is added, specified component As a co-occurrence word candidate having a high possibility of co-occurring with the set of components of the analysis axis “August” and “August-lunch”, the phrase “shock, lunch, hospitalization, Mass infection, ... "is extracted from the analysis results. Then, from the co-occurrence word candidates, a word that co-occurs within a predetermined range with a set of constituent elements “August-meal” is acquired.
[0008]
When the representative document acquisition instruction is input from the user together with the co-occurrence word specified from the text mining result, the text mining target is specified by the specified co-occurrence word and the phrase included in the component corresponding to the co-occurrence word. Then, a document set having a high score, the latest document, a document having designated bibliographic information, and the like are acquired as a representative document.
[0009]
As described above, in the text mining apparatus of Patent Document 1, the analysis axis and the analysis result are stored as an analysis history, and a plurality of different analysis axes are provided. When the analysis is performed, words having a high possibility of being co-occurring with each of the components of the plurality of analysis axes are narrowed down using the analysis history, and the analysis using a combination of a plurality of different analysis axes is realized. In addition, since the co-occurrence words are narrowed down using the analysis history, the words that actually co-occur with each of the constituent elements are examined. I can do it.
[0010]
[Patent Document 1]
JP 2001-318939A (Page 14, FIG. 4)
[0011]
[Problems to be solved by the invention]
Since the conventional text mining device has been configured as described above, when performing the analysis work interactively, the analysis is always performed in a narrowing direction, and the stored analysis result is further narrowed down under another condition. It was only used as intermediate information. Therefore, there is a problem in that the user cannot create a new analysis standard by associating a plurality of stored analysis results and perform analysis based on the standard.
[0012]
For example, a new analysis axis (attribute) of “social impact” is created for the analysis results obtained by the characteristic phrases “lunch”, “sales”, and “food”, and “prevention”, “prevention”, A new analysis axis (attribute) called "measures" is created for the analysis results obtained by the characteristic words "cooking" and "information disclosure". Processing that analyzes the relationship between attributes cannot be performed.
[0013]
The present invention has been made in order to solve the above-described problems. By associating stored analysis results with each other, a new analysis target and a new attribute can be created, and the original document can be created. It is an object to obtain a text mining device and a text mining program that can perform analysis based on a criterion (analysis axis) defined by an undefined user.
[0014]
That is, in Patent Literature 1, as shown in FIG. 33, an analysis process 404 for performing segmentation is repeatedly performed on an analysis result 403 output by performing an analysis process 402 on a target document set 401. On the other hand, in the present invention, as shown in FIG. 34, the analysis processing 412 is performed on the target document set 411 to output a plurality of analysis results 413 and 414, and the analysis processing 415 is associated with the respective analysis results 413 and 414. To obtain a new analysis result 416. As a result, according to the present invention, analysis across a plurality of analysis axes defined by the user can be performed.
[0015]
[Means for Solving the Problems]
A text mining device according to the present invention includes an index storage unit storing index information for searching and mining a target document set, and index information stored in the index storage unit according to a specified search / mining condition. And a search / mining execution unit for executing the search / mining process of the target document set with reference to the search / mining results by the search / mining execution unit. A search / mining result storage unit that is assigned to each document and saved in the search / mining result storage unit, and a plurality of specified search / mining results are read from the search / mining result storage unit and assigned to the specified attribute name Correspondingly, a search / mining result editor that selects the search / mining result read for each document and determines the attribute value It includes those were.
[0016]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of the present invention will be described.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a text mining device according to Embodiment 1 of the present invention. From the target document set 101, index information for searching and mining the target document set 101 is generated by the index generation unit 102 and stored in the index storage unit 103. The execution unit 104 reads the index information in the index storage unit 103, executes a search or mining process according to the search condition or the mining condition input from the input unit 105, and outputs a search / mining result to the display unit 106. When the user receives an instruction to save the search / mining result, the execution unit 104 saves the search / mining result in the search / mining result storage unit 107.
[0017]
In the execution unit 104, the index reading unit 108 reads the index information stored in the index storage unit 103. The search / mining result reading unit 109 reads the search / mining result stored in the search / mining result storage unit 107. The search / mining execution unit 110 refers to the index information stored in the index storage unit 103 according to the specified search / mining conditions, and searches / mines the target document set 101 or the search / mining result reading unit 109. Execute the search / mining process of the result.
[0018]
In the execution unit 104, the search / mining result storage unit 111 assigns the degree of relevance between each document and the search / mining condition to each document based on the search / mining result obtained by the search / mining execution unit 110, and performs search / mining. The result is stored in the result storage unit 107. The search / mining result editing unit 112 reads a plurality of specified search / mining results from the search / mining result storage unit 107 and stores the read search / mining results for each document in accordance with the specified attribute name. To determine the attribute value.
[0019]
FIG. 2 is a detailed block diagram showing a configuration of the text mining apparatus in which the index generation unit 102, the index storage unit 103, and the search / mining execution unit 110 in FIG. 1 are detailed. The index generation unit 102 includes a concept dictionary generation unit 201, a document vector index generation unit 202, and a mining index generation unit 203. The concept dictionary generation unit 201 and the mining index generation unit 203 generate a concept dictionary 204 and a mining index 206 from the target document set 101, respectively, and the document vector index generation unit 202 refers to the concept dictionary 204 to generate a document from the target document set 101. Generate a vector index 205. The search / mining execution unit 110 includes a search unit 207 that executes a search process and a mining unit 208 that executes a mining process.
[0020]
FIG. 3 is a diagram illustrating an example of documents in the target document set 101. The first embodiment will be described based on an example of analyzing a questionnaire on a mobile phone. The target document set 101 has a predetermined range of values such as “sex”, “age”, “model”, “region”, and “date”. It is composed of a field described in free text such as "Opinion". The former field is hereinafter referred to as an attribute in the document. The item name of the attribute is called an attribute name, and its value is called an attribute value. For example, the documents in the target document set 101 shown in FIG. 3 have an attribute value of “male” for an attribute name of “sex”.
[0021]
In the example shown in FIG. 3, one attribute value is assigned to one attribute name. However, when there is no answer in a questionnaire, the attribute value does not necessarily have to be assigned to the attribute name. Further, when there is a case where the respondent specifies two or more such as “model”, two or more attribute values may be given to one attribute name.
[0022]
FIG. 4 is a flowchart showing the flow of the process of the concept dictionary generator 201 shown in FIG. 2, and FIG. 5 is a diagram for explaining the process of the concept dictionary generator 201.
In step ST11 in FIG. 4, the concept dictionary generation unit 201 morphologically analyzes the text included in the document set 301 in FIG. 5 to divide a character string in the text into words. Since the morphological analysis is a widely known technique, a detailed description is omitted here. At this time, the learning target document set 301 shown in FIG. 5 is not necessarily the target document set 101 itself, but may be another document set in the same field as the target document set 101.
[0023]
In step ST12, the concept dictionary generation unit 201 extracts a compound word from the morphological analysis result.
FIG. 6 is a flowchart showing the flow of the compound word extraction process of the concept dictionary generation unit 201. In step ST21 of FIG. 6, the concept dictionary generation unit 201 extracts compound word candidates based on a morpheme concatenation pattern described in a compound word candidate extraction dictionary stored therein.
[0024]
FIG. 7 is a diagram showing an example of a compound word candidate extraction dictionary describing a morpheme concatenation pattern for extracting compound word candidates. FIG. 7 shows a compound word candidate extraction dictionary for extracting compound word candidates from the concatenation relationship between two morphemes. An example is shown. For example, pattern number 001 indicates that "when two morphemes whose nouns are part of speech continue, the arrangement of the morphemes is extracted as a compound word candidate." As a result, a character string in which two nouns are consecutive in the text, such as “model / change” and “confirmation / screen”, is extracted as a compound word candidate.
[0025]
Similarly, from the pattern number 002, a character string in which a morpheme having a noun as a part of speech and a morpheme having a suffix as a suffix, such as “mobile / sex” and “function / target”, is extracted as a compound word candidate. From the number 003, character strings such as “season / feeling” and “use / feeling” in which a morpheme having a part of speech as a noun and a morpheme having a notation of “feeling” are respectively extracted as compound word candidates. Note that “/” is used as a symbol indicating a morpheme delimiter.
FIG. 8 is a diagram illustrating a result of extracting compound word candidates from the target document set 101, focusing on a character string “model change”.
[0026]
In step ST22 of FIG. 6, from the compound word candidates extracted in step ST21, the concept dictionary generation unit 201 selects a compound word to be registered in the co-occurrence frequency table using the statistical information. The co-occurrence frequency table is a table showing the co-occurrence relationship of words as described later. Since the number of compound word candidates extracted in step ST21 is enormous, it is registered in the co-occurrence frequency table by narrowing down important words using statistical information of words. As the statistical information, statistical information calculated by a known method such as an appearance frequency or a tf * idf value calculated by the following equation is used.
tf * idf (w) = f (w) * log (Nd / d (w))
Here, f (w) indicates the frequency of appearance of the word w, Nd indicates the number of documents in the target document set, and d (w) indicates the number of documents in which the word w appears.
[0027]
In step ST22, a compound word candidate whose statistical information value such as an appearance frequency value or a tf * idf value is equal to or greater than a preset threshold is selected as a compound word to be registered in the co-occurrence frequency table. Alternatively, the number of compound words to be extracted may be determined in advance, and compound word candidates having a high statistical information value may be selected from the top. In the first embodiment, a description has been given by taking an example of a continuation of two words, but the present invention is not limited to two words, and a concatenation relationship of three or more words is described in a compound word candidate extraction dictionary, and compound word extraction is performed. Processing may be performed.
[0028]
As described above, the concept dictionary generation unit 201 extracts a compound word, so that a word having a higher degree of specialty is extracted from a word obtained by combining general words, and a concept index for the compound word is generated by processing described later. Therefore, a mining process based on a more appropriate word can be performed, and the accuracy of the mining process can be improved.
[0029]
Next, in step ST13 of FIG. 4, the concept dictionary generation unit 201 calculates a co-occurrence frequency which is the number of times a word and a word appear simultaneously in each document of the document set 301, and generates a co-occurrence frequency table 302 shown in FIG. Ask for. The co-occurrence frequency table 302 is a table showing how many times the word on the vertical axis co-occurs with the word on the horizontal axis.
[0030]
In step ST14, the concept dictionary generation unit 201 performs singular value decomposition of the co-occurrence frequency table 302. The singular value decomposition is a known linear algebra technique for decomposing a matrix A, here, a co-occurrence frequency table 302 shown in FIG. 5 into a product of three matrices (UΣV) 303, 304, and 305. For example, a document search method using a concept dictionary created by using singular value decomposition is described in Document 3 (“Information search system InfoMAP based on word associations, Takayama et al., Informatics Basics 53-1, 1999-3”). There is a description about. Note that eigenvalue decomposition may be used instead of singular value decomposition.
[0031]
In step ST15, the concept dictionary generation unit 201 designates a matrix U (left singular matrix) 303 obtained by performing singular value decomposition in step ST14, and designates a matrix Σ (singular value matrix) 304 from the largest singular value included in the matrix. k columns (k is smaller than the number of columns of the original matrix A) are output as the concept dictionary 204. The horizontal axis of the concept dictionary 204 indicates k horizontal axis components of the matrix U. The concept dictionary 204 is more dimensionally compressed than the co-occurrence frequency table 302, and each row can be regarded as a concept vector of a word including a higher-order correlation. Hereinafter, the concept vector of a word is referred to as a word vector. Here, the concept vector is a vector in which the left k dimension of the matrix U generated as a result of performing the singular value decomposition of the co-occurrence frequency table 302 is a component value.
[0032]
The concept dictionary 204 created in this way has a feature that words co-occurring with the same word have similar word vectors, that is, words having similar concepts have similar word vectors. .
[0033]
Next, the processing of the document vector index generation unit 202 will be described.
FIG. 9 is a diagram illustrating the generation of a document vector index by the document vector index generation unit 202. The document vector index generation unit 202 generates a document vector index 205 from the target document set 101 with reference to the concept dictionary 204.
[0034]
FIG. 10 is a flowchart showing the flow of processing of the document vector index generation unit 202. In step ST31, the document vector index generation unit 202 morphologically analyzes each document in the document set 101 and divides the text in the document into words. In step ST32, the document vector index generation unit 202 calculates the frequency of each word that appears in each document. In step ST33, the document vector index generation unit 202 extracts a word vector for each word from the concept dictionary 204.
[0035]
In step ST34, the document vector index generation unit 202 outputs, as a document vector index 205 of the document, a value obtained by adding a vector obtained by multiplying the word vector of the word appearing in each document by using the frequency calculated in step ST32 as a coefficient. I do. Note that a vector corresponding to each document stored in the document vector index 205 is called a document vector.
[0036]
FIG. 11 is a diagram showing the document vector index 205, which has a structure in which each document is associated with a document vector. The document vector has the same dimension as the word vector in the concept dictionary 204, and documents having similar document vectors have a feature that they have similar contents.
[0037]
Next, the processing of the mining index generation unit 203 will be described.
FIG. 12 is a flowchart showing the flow of processing of the mining index generation unit 203. In step ST41, the mining index generation unit 203 acquires a list of attribute names included in the target document set 101. In the example shown in FIG. 3, “sex”, “age”, “model”, “region”, and “date” are acquired as attribute names. In step ST42, the mining index generation unit 203 acquires, for each document in the target document set 101, an attribute value corresponding to the attribute name acquired in step ST41. For example, from the document shown in FIG. 3, “sex” is “male”, “age” is “20s”, “model” is “model 1”, and “region” is “model 1”. “2002-01-14” is obtained for “Tokyo” and “Date”.
[0038]
The mining index generation unit 203 obtains attribute values for all documents in the target document set 101, and writes the obtained attribute names and attribute values to the mining index 206 in step ST43.
FIG. 13 is a diagram illustrating an example of a mining index generated by the mining index generation unit 203.
[0039]
Next, the processing of the execution unit 104 will be described.
FIG. 14 is a flowchart illustrating the flow of the process of the execution unit 104. In step ST51, the index reading unit 108 reads, from the index storage unit 103, three pieces of index information of the concept dictionary 204, the document vector index 205, and the mining index 206.
[0040]
In step ST52, the search / mining result reading unit 109 reads the search / mining result specified by the user from the search / mining result storage unit 107. At first, no information is stored in the search / mining result storage unit 107. Is not written, and the flow passes without performing any processing in step ST52. In the first embodiment, the search / mining result indicates a document set stored in a format shown in FIG. 25 described later. The search / mining target at this time is the entire target document set 101 registered in the document vector index 205.
[0041]
In step ST53, the search / mining execution section 110 determines whether or not there is a search / mining condition input from the user. If there is a search / mining condition input, in step ST54, the search / mining condition is determined. Is received as an input, and in step ST55, a search / mining process is executed.
[0042]
Here, the search processing executed by the search unit 207 among the search and mining processing in step ST55 shown in FIG. 14 will be described.
FIG. 15 is a flowchart illustrating the flow of a search process performed by the search unit 207. The search unit 207 searches the target document set 101 using the document vector index 205. In step ST61, the search unit 207 inputs search conditions specified by the user from the input unit 105 as text. Here, for example, “model with a large screen” is input as a search condition.
[0043]
In step ST62, the search unit 207 generates a concept vector for the input search condition with reference to the concept dictionary 204. Here, a concept vector is a combination of word vectors for “screen”, “large”, and “model”. Hereinafter, the concept vector corresponding to the search condition is referred to as a search vector. The search vector generation processing in step ST62 is performed in the same manner as the processing in steps ST31 to ST34 in FIG.
[0044]
In step ST63, the search unit 207 calculates a cosine value between the search vector and a document vector corresponding to each document of the stored target document set 101, and sets the calculated cosine value as the similarity. In step ST64, the search unit 207 outputs the search results arranged in the order of similarity.
[0045]
FIG. 16 is a diagram illustrating an example of a search result obtained by the search unit 207. The search result is obtained when “large model” is input as a search condition, and a character string for identifying a document (document n in FIG. 16), The degree of similarity and the free description part of the document are displayed.
[0046]
It should be noted that the target document set 101 displayed as a search result may be provided with a threshold value for similarity, and only documents having the threshold value or more may be displayed. Alternatively, a maximum value of the number of cases to be displayed may be set in advance, and only the number of cases specified from the highest similarity may be displayed. Further, narrowing down by document attributes such as “sex” and “age” may be used together. Further, in the first embodiment, the similarity between vectors is calculated using the concept dictionary 204 and the document vector index 205. However, if the similarity between the input search condition and the document can be calculated, Other methods may be used.
[0047]
Next, the mining process executed by the mining unit 208 among the search / mining processes in step ST55 shown in FIG. 14 will be described. By performing this mining process, the degree of association between the keyword and the attribute can be analyzed.
FIG. 17 is a flowchart showing the flow of the mining process executed by the mining unit 208. In step ST71, the mining unit 208 inputs mining conditions from the input unit 105. Here, the input mining conditions are attributes and keywords. When "model" is specified as the attribute name, "model 1", "model 2", "model 3" and "model 4" are specified as the attribute values, and "screen" and "ringtone" are specified as keywords. Will be described.
[0048]
In step ST72, the mining unit 208 acquires the target document set 101 having the specified attribute value. That is, for the attribute name “model”, a target document set 101 including “model 1” in the attribute value, a target document set 101 including “model 2”, a target document set 103 including “model 3”, and “model” The target document set 101 including “4” is acquired. In step ST73, the mining unit 208 creates an attribute vector for each of the acquired target document sets 101. The attribute vector is a vector created by averaging the elements of the document vector of the target document set 101 having the same attribute value.
[0049]
In step ST74, the mining unit 208 obtains the degree of association between each attribute value and the keyword. The degree of association is calculated based on the attribute vector and the cosine value of the word vector extracted from the concept dictionary 204 corresponding to the keyword. In step ST75, the mining unit 208 outputs the obtained degree of association between the attribute and the keyword as a graph as a mining result.
[0050]
FIG. 18 is a diagram illustrating a mining result by the mining unit 208. The attribute value of the model is used as an axis, and the value of the degree of association between each attribute value and the keyword is represented using a line graph. From the graph of FIG. 18, it can be seen that the analysis result indicates that “model 3” has a low interest in the screen and high interest in the ringtone, and conversely, “model 4” has a high interest in the screen and low interest in the ringtone.
[0051]
The attribute may be specified by specifying only the attribute name, and the mining unit 208 may automatically extract all the attribute values corresponding to the attribute name from the mining index 206 and perform the analysis. Alternatively, when specifying an attribute value, a relationship of a product (AND), a sum (OR), or a negation (NOT) of each attribute value may be specified, such as model 1 or model 2.
[0052]
In addition, the keyword may specify not only one word such as “screen” and “ringtone” but also a plurality of words such as “screen, color” and “ringtone, chord”. When a plurality of words are specified for a keyword, the vector corresponding to the keyword is a vector created by averaging word vectors for words included in the keyword. When an attribute name such as “date” is specified, the target document set 101 is acquired in the attribute value range such as “from 2002-01-01 to 2002-01-31” in step ST72. , The relationship with the keyword may be analyzed.
[0053]
Next, an example in which the correlation between two attribute values in the mining processing executed by the mining unit 208 is analyzed by keyword distribution will be described.
FIG. 19 is a flowchart showing the flow of the mining process executed by the mining unit 208. In step ST81, the mining unit 208 inputs mining conditions from the input unit 105. Here, the mining conditions input by the user are two attribute values and a keyword. A case will be described in which the attribute values “male” and “female” of the attribute name “sex” are specified as the attribute values, and “flip”, “charge”, “ringtone”, and “character” are specified as the keywords.
[0054]
In step ST82, the mining unit 208 acquires the target document set 101 having the specified attribute value. That is, for the attribute name “sex”, the target document set 101 whose attribute value is “male” and the target document set 101 whose attribute value is “female” are acquired. In step ST83, the mining unit 208 creates an attribute vector for each of the acquired target document sets 101. The method of creating the attribute vector is the same as the process of step ST73 in FIG.
[0055]
In step ST84, the mining unit 208 obtains the degree of association between each attribute vector and the keyword input as the mining condition. The calculation of the degree of association is the same as the processing in step ST74 in FIG. In step ST85, the mining unit 208 displays the coordinates of each keyword using the attribute value as a coordinate axis and outputs the result as a mining result.
[0056]
FIG. 20 is a diagram illustrating a mining result by the mining unit 208. In this example, “male” is set on the X coordinate axis, “female” is set on the Y coordinate axis, and the relationship between the concept vector of each keyword and the attribute vector of “male” The degree is defined as the X coordinate, and the degree of association with the attribute vector of “female” is defined as the Y coordinate. From the graph of FIG. 20, it can be seen that "character" and "ringtone" have a high degree of association with a woman, and "flip" has a high degree of association with a man.
[0057]
The attribute may be specified by combining a plurality of attributes, such as “teens and genders”. In this case, in step ST82, a process of extracting the target document set 101 in which the attribute value of the attribute name “age” is “teens” and the attribute value of the attribute name “sex” is “female” is performed. Also, a keyword may be specified by combining a plurality of words. Further, the user may automatically specify a keyword from the target document set 101 using a dictionary or statistical information without specifying a keyword, and input the extracted keyword as a mining condition.
[0058]
Next, among mining processes executed by the mining unit 208, an example will be described in which the correlation between keywords is analyzed based on the distribution of documents.
FIG. 21 is a flowchart showing the flow of the mining process executed by the mining unit 208. In step ST91, the mining unit 208 inputs mining conditions from the input unit 105. Here, the mining conditions input by the user are keywords. A case where “melody” and “song” are designated as keywords will be described.
[0059]
In step ST92, the mining unit 208 obtains the degree of association between the word vector extracted from the concept dictionary 204 for each keyword and the document vector corresponding to each document in the target document set 101 by the cosine value of the vector. In step ST93, the mining unit 208 displays the coordinates of each document using the keyword as a coordinate axis and outputs the result as a mining result.
[0060]
FIG. 22 is a diagram illustrating a mining result by the mining unit 208. In this example, the relation between the document vector of each document and the concept vector of “melody” is indicated by “melody” on the X coordinate axis and “tune” on the Y coordinate axis. A point where the degree is the X coordinate and the degree of association with the "song" is the Y coordinate is displayed on the graph. From the graph of FIG. 22, it can be read that a document having a high correlation with “melody” also has a high correlation with “song”, and thus “melody” and “song” are words having a high relation.
[0061]
FIG. 23 is a diagram similarly showing a mining result by the mining unit 208. In this example, the mining result when the processing of FIG. 21 is performed by specifying “sound” and “color” as keywords is shown. Then, it can be read that "sound" and "color" are words having low relation.
[0062]
In the first embodiment, the relevance between a word and an attribute and the relevance between a word and a document are calculated by using the concept dictionary 204 and the document vector index 205. In the concept dictionary 204, for example, "screen" and "liquid crystal" co-occur with the same word such as "size", "brightness", and "color", so that the distance between word vectors of words that appear in similar contexts Becomes closer. Therefore, when calculating the degree of association between a document and the word “screen”, if the document does not include the word “screen”, but includes the word “liquid crystal”, the document and the “screen” Is higher. In this way, even when the word notation is different, mining that determines the degree of relevance based on closeness in meaning can be realized.
[0063]
Returning to the flowchart of FIG. 14, the search / mining execution section 110 displays the search / mining result executed in step ST55 in step ST56. In step ST57, the search / mining result storage unit 111 checks whether the user has issued an instruction to save the mining result. If there is no instruction, the process returns to step ST53, and the search / mining execution unit 110 receives the next search / mining condition and executes a search / mining process. If there is no next search / mining condition, the process ends.
[0064]
In step ST57, when the search / mining result storage unit 111 receives an instruction to save the search / mining result from the user, in step ST58, the search / mining result storage unit 111 searches / mines the search / mining result. The result is stored in the result storage unit 107.
[0065]
FIG. 24 is a flowchart showing a processing flow of the search / mining result storage unit 111 for storing the search / mining result. In step ST101, a document set to be saved as a search / mining result is determined by the user. For example, in the case of a search result, the search result output in step ST64 of FIG. 15 may be used as a set of documents to be stored as it is, or a threshold of similarity, a condition such as the number of searches, and the like may be set from the search result. Alternatively, a set of documents to be stored may be narrowed down.
[0066]
In the case of the mining process shown in FIGS. 17 and 18 for analyzing the degree of association between an attribute and a keyword, the attribute and the keyword are designated, and a document set having a specific attribute is associated with a specific keyword. A document set with a high degree may be used. For example, a document set whose attribute value is “model 4” and has a high degree of association with the keyword “screen” may be set as a document set to be stored.
[0067]
Furthermore, in the example of the mining process for analyzing the correlation between the attributes described in FIGS. 19 and 20, the attribute and the keyword are designated, and the specific keyword and the relevance of the relevance are set in the document set having the specific attribute. A high document set may be used. For example, a document set whose attribute value of the attribute name “sex” is “female” and has a high keyword “character” may be set as a document set to be stored.
[0068]
Further, in the example of the mining process for analyzing the correlation between keywords described with reference to FIGS. 21, 22, and 23, a document set having a high relation with the keyword specified as the coordinate axis may be set as a document set to be stored. For example, a document set to be stored may be a document having a high degree of relevance to one or both of the keyword “melody” and the keyword “song”.
[0069]
Next, in step ST102 in FIG. 24, the search / mining result storage unit 111 explicitly assigns a name to the search / mining result so that it can be referred to later, in accordance with a user's instruction. In step ST103, the search / mining result storage unit 111 assigns a score to each document to be stored and stores it.
[0070]
The score is the degree of association between the document and the keyword specified in the search or mining. For example, in the case of the search processing shown in FIG. 15, the score is the similarity (relevance) with the search condition specified in step ST61.
[0071]
In the case of the mining process of analyzing the relevance between the attribute and the keyword in FIGS. 17 and 18, the relevance with the keyword specified by the mining condition is used as the score. For example, if the attribute value of the attribute name “model” is “model 4” and a document set having a high degree of relevance to the keyword “screen” is a document set to be saved, the degree of relevance to “screen” is defined as a score. I do.
[0072]
Further, in the example of the mining process for analyzing the correlation between the attributes described in FIGS. 19 and 20, the degree of relevance with the keyword specified by the mining condition is used as the score. For example, when the attribute value of the attribute name “sex” is “female” and a set of documents having a high keyword “character” is set as a document set to be stored, the degree of association with “character” is set as the score.
[0073]
Further, in the example of the mining process for analyzing the correlation between the keywords described in FIGS. 21, 22, and 23, the degree of relevance with the keyword specified on the coordinate axis is used as the score. For example, the degree of relevance between each document and one of the keyword “melody” and the keyword “song” or the degree of relevance to a vector obtained by combining both word vectors is used as a score.
[0074]
FIG. 25 is a diagram illustrating an example of a search / mining result stored in the search / mining result storage unit 107. For example, a search / mining result named “fashion” by the user is obtained by specifying a keyword such as “design, character, color, etc.”, and a search named “operability” by the user is given. -The mining result is obtained from the search result by specifying keywords such as "key, button, operation ...", and the search named "function" and the user-The mining result is "kanji conversion, mail, The search / mining results named by the user as "business use" can be obtained by specifying keywords such as "boss, business partner, report ...". The search / mining result named "Private Use" by the user is assumed to be obtained from the search result by specifying keywords such as "Mel friend, child, travel ...". .
[0075]
Next, returning to step ST52 after the processing of step ST58 in FIG. 14, the search / mining result reading unit 109 reads the search / mining result stored in the search / mining result storage unit 107 specified by the user. In the designated search / mining result, a document set to which a score is given is set as the next search / mining target document set.
[0076]
For example, when “fashion” is specified in the search / mining result of FIG. 25, documents 1, document 3, document 5,... Are a set of documents to be searched / mined. In addition, a new set of documents to be searched and mined can be defined by specifying AND, OR, and NOT relationships for a plurality of search and mining results.
[0077]
For example, in the processing of FIG. 24, by specifying the search / mining result named “fashion” and the search / mining result named “private use” by AND, “fashion” and “private use” are designated. Are defined as a document set to be searched / mined. In addition, by specifying the search / mining result named “operability” and the search / mining result named “function” by OR, a score is assigned to either “operability” or “function”. The set of documents being searched becomes a set of documents to be searched and mined.
[0078]
In this way, by specifying the combination of the search and mining results, for example, the combination of the AND of “fashion” and “private use” becomes the “set of questionnaire results of those who are interested in fashion in private use”. A set of documents to be analyzed with such a meaning can be created, and the combination of OR of “operability” and “function” has a meaning like “set of questionnaire results of people who are interested in practicality”. A set of documents to be analyzed can be created.
[0079]
Thereafter, when a search / mining condition is input in step ST53 for the document set created by the search / mining result specified in step ST52 of FIG. 14, the search / mining condition is changed in step ST54. The search and mining process is executed in step ST55, and the result is displayed in step ST56.
[0080]
For example, when a mining process for analyzing the degree of association between an attribute and a keyword described in FIGS. 17 and 18 is performed on a document set in which “private use” and “fashion” are combined by AND, By performing “model” and “screen” and “ringtone” as keywords, “people who are interested in fashion in private use” can analyze for each model how much they are interested in screens and ringtones. .
[0081]
Next, the processing of the search / mining result editing unit 112 will be described.
FIG. 26 is a flowchart showing the flow of processing of the search / mining result editing unit 112. In step ST111, the search / mining result editing unit 112 receives the search / mining result of the target specified by the user and the attribute name to be given. For example, it is assumed that “fashion”, “operability”, and “function” in FIG. 25 are specified as the target search / mining result, and “point to be emphasized” is specified as the attribute name to be given. In step ST112, the search / mining result editing unit 112 reads the search / mining result specified by the user from the search / mining result storage unit 107.
[0082]
In step ST113, the search / mining result editing unit 112 selects a search / mining result having the highest score for each document. For example, regarding the document 1 in FIG. 25, “fashion” has the highest score among “fashion”, “operability”, and “function”. Is selected. Similarly, “function” is set for document 2, “operability” is set for document 3, “fashion” is set for document 5, and “function” is set for document 6. Selected. Since no score is assigned to any of the search and mining results for the document 4, no search or mining result is selected.
[0083]
In step ST114, the search / mining result editing unit 112 assigns the name of the search / mining result selected in step ST113 to each document as an attribute value for the attribute name specified in step ST111.
[0084]
FIG. 27 is a diagram showing an editing result by the search / mining result editing unit 112. As a result of the process of step ST114 in FIG. 26, an attribute indicated by “attribute name: important point” in FIG. 27 is generated. Regarding the document 4, since no search / mining result has been selected in step ST113, no attribute value is defined.
[0085]
Similarly, in the process of FIG. 26, when “business use” and “private use” are specified as search / mining results and “use field” is specified as an attribute name, “attribute name: use field” in FIG. 27 is used. Is generated.
[0086]
FIG. 28 is a flowchart showing a flow of processing of another implementation method of the search / mining result editing unit 112. FIG. 26 shows an example of the process of assigning one attribute value to one document, but FIG. 28 shows an example of the process of granting two or more attribute values to one document. The processing of step ST121 and step ST122 is the same as the processing of step ST111 and step ST112 in FIG. 26, respectively. As an example, a case will be described in which “fashion”, “operability”, and “function” in FIG. 25 are specified as the search / mining result to be read, and “point to be emphasized” is specified as the attribute name to be given.
[0087]
In step ST123 of FIG. 28, the search / mining result editing unit 112 selects a search / mining result whose score is equal to or larger than a threshold for each document. For example, if the threshold is set to 0.5, “fashion” and “function” are selected for document 1 because “fashion” and “function” have values equal to or greater than the threshold. Similarly, “function” is selected for document 2, “operability” is selected for document 3, and “fashion” is selected for document 5.
[0088]
In step ST124, the search / mining result editing unit 112 assigns the name of the search / mining result selected in step ST123 as an attribute value for the attribute name specified in step ST121 to each document according to a user instruction.
[0089]
FIG. 29 is a diagram showing an editing result by the search / mining result editing unit 112. As a result of the process of step ST124 in FIG. 28, an attribute indicated by “attribute name: important point” in FIG. 29 is generated. Similarly, when “business field” and “private use” are specified as “attribute fields” as attribute names as search / mining results, an attribute represented by “attribute name: field of use” in FIG. 29 is generated.
[0090]
As described above, the attribute generated by the processing in FIG. 26 or FIG. 28 can be handled in the mining process in the mining unit 208 in the same manner as the attribute originally provided in the document. That is, in the example of the target document set 101 in FIG. 3, in addition to the attributes “gender”, “age”, “model”, “region”, and “date” originally provided, a new attribute “point to be emphasized” And "applications" can be used for analysis.
[0091]
For example, FIG. 30 is a diagram showing a mining result by the mining unit 208. The attribute value determined by the search / mining result editing unit 112 is used as an axis, and the degree of association between each attribute value and a keyword input as a mining condition is shown. Is represented using a line graph. That is, in the analysis of the degree of association between the attribute and the keyword shown in FIGS. 17 and 18, "attribute point" is designated for the "attribute name", and "work", "price", and "screen" are designated for the keyword. It is a figure which shows the example performed. "Work" has low relevance to "fashion" and high relevance to "function". "Price" indicates relevance to any of "fashion", "operability" and "function". Can be read that does not change much.
As described above, the mining unit 208 can analyze the degree of association between the designated mining condition and the attribute value determined by the search / mining result editing unit 112.
[0092]
Also, in the example in which the correlation between the attribute values of the document and the attribute values shown in FIGS. 19 and 20 is analyzed based on the distribution of the specified words, the mining unit 208 similarly performs the search / mining result editing unit 112. The correlation between attribute values determined by the above can be analyzed by the distribution of the specified mining conditions.
[0093]
Further, in the example in which the correlation between words and the words shown in FIGS. 21, 22 and 23 is analyzed based on the distribution of the documents, the mining unit 208 similarly searches and mines the correlation between words and words. The analysis can also be performed based on the distribution of the documents stored in the result storage unit 107.
[0094]
Further, it is also possible to analyze the degree of association between the attributes created by the search / mining result editing unit 112.
FIG. 31 is a flowchart showing, as an example of the mining process in the mining unit 208, the flow of the process of analyzing the degree of association between the attributes created by the search / mining result editing unit 112. An example in which the relationship between the "point to be emphasized" and the "use field" shown in FIG. 27 or FIG. 29 will be described.
[0095]
In step ST131, the mining unit 208 acquires a document set for each attribute value for the attribute name 1 specified by the user. When "point to be emphasized" is specified for the attribute name 1, a document set including "fashion" in the attribute value, a document set including "operability" in the attribute value, and a document set including "function" in the attribute value are obtained. You. In step ST132, the mining unit 208 creates an attribute vector from each document set acquired in step ST131. The attribute vector is created by averaging the document vectors of each document in the document set.
[0096]
In step ST133, the mining unit 208 acquires a document set for each attribute value for the specified attribute name 2. When "Usage field" is specified for the attribute name 2, a document set including "business use" in the attribute value and a document set including "private use" in the attribute value are obtained. In step ST134, the mining unit 208 creates an attribute vector from each document set acquired in step ST133.
[0097]
In step ST135, the mining unit 208 calculates a cosine value between each attribute vector created in step ST132 and each attribute vector created in step ST134, and sets the cosine value as the degree of association. In step ST136, the mining unit 208 displays the relevance of each attribute value of the attribute name 2 to each attribute value of the attribute name 1 on a graph and outputs the result as a mining result.
[0098]
FIG. 32 is a diagram illustrating a mining result by the mining unit 208, and is a diagram illustrating an example in which a result of the processing in FIG. 31 is displayed in a graph. From the graph of FIG. 32, it can be seen that, for example, the correlation between “fashion” and “private use” is high, and the correlation between “function” and “business use” is high.
[0099]
As described above, the mining unit 208 can analyze the degree of association between attribute values determined for a plurality of attribute names by the search / mining result editing unit 112. In this case, the attribute values to be analyzed include not only the attribute values determined by the search / mining result editing unit 112 but also the other attribute values input using the attribute values determined by the search / mining result editing unit 112. May be analyzed.
[0100]
As described above, according to the first embodiment, the results obtained by the search and the text mining are saved, and the saved search and mining results are associated with each other to create a new analysis target and a new attribute. And analysis based on a user-defined criterion (analysis axis) that is not defined in the original document is obtained.
[0101]
In addition, according to the first embodiment, by using the concept dictionary 204 and the document vector index 205, even for words having different notations, words and words, and words and documents Since the degree of association between a word and an attribute can be determined, it is possible to obtain the effect that search, mining, and association of search and mining results can be performed based on semantic proximity even for expressions having different notations.
[0102]
Furthermore, according to the first embodiment, the concept dictionary generation unit 201 extracts a compound word, so that a word having higher specialty is extracted than a word combining general words, and a concept index for the compound word is extracted. Since it is generated, it is possible to perform a mining process based on a more appropriate word, and it is possible to obtain an effect that the accuracy of the mining process can be improved.
[0103]
Each of the above processes can be realized on a computer by a text mining program installed on the computer.
[0104]
【The invention's effect】
As described above, according to the present invention, the index storage unit storing the index information for searching and mining the target document set and the index information stored according to the specified search and mining conditions are referred to. A search / mining execution unit that executes the search / mining process of the target document set, and the search / mining result by the search / mining execution unit is assigned to each document with the degree of association between each document and the search / mining condition. A search / mining result storage unit to be stored in the search / mining result storage unit, and a plurality of specified search / mining results are read from the search / mining result storage unit, and each document is read according to the specified attribute name. A search / mining result editor that selects attribute values by selecting search / mining results read in To allow creation of attributes, there is an effect that users who are not defined in the original document can be analyzed based on the criteria defined.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a text mining device according to Embodiment 1 of the present invention.
FIG. 2 is a detailed block diagram showing a configuration of a text mining device according to Embodiment 1 of the present invention.
FIG. 3 is a diagram illustrating an example of a document in a target document set.
FIG. 4 is a flowchart showing a processing flow of a concept dictionary generation unit of the text mining device according to the first embodiment of the present invention.
FIG. 5 is a diagram for explaining processing of a concept dictionary generation unit of the text mining device according to Embodiment 1 of the present invention;
FIG. 6 is a flowchart showing a flow of compound word extraction processing of a concept dictionary generation unit of the text mining device according to Embodiment 1 of the present invention.
FIG. 7 is a diagram showing an example of a compound word candidate extraction dictionary held by the concept dictionary generation unit of the text mining device according to the first embodiment of the present invention.
FIG. 8 is a diagram showing a compound word candidate extraction result of the concept dictionary generating unit of the text mining device according to the first embodiment of the present invention.
FIG. 9 is a diagram illustrating generation of a document vector index by a document vector index generation unit of the text mining device according to the first embodiment of the present invention.
FIG. 10 is a flowchart showing a processing flow of a document vector index generation unit of the text mining device according to the first embodiment of the present invention.
FIG. 11 is a diagram showing a document vector index of the text mining device according to the first embodiment of the present invention.
FIG. 12 is a flowchart showing a processing flow of a mining index generation unit of the text mining device according to the first embodiment of the present invention.
FIG. 13 is a diagram illustrating an example of a mining index generated by a mining index generation unit of the text mining device according to the first embodiment of the present invention.
FIG. 14 is a flowchart showing a processing flow of an execution unit of the text mining device according to the first embodiment of the present invention.
FIG. 15 is a flowchart showing a flow of a search process executed by the search unit of the text mining device according to the first embodiment of the present invention.
FIG. 16 is a diagram showing an example of a search result by a search unit of the text mining device according to the first embodiment of the present invention.
FIG. 17 is a flowchart showing a flow of a mining process executed by the mining unit of the text mining device according to the first embodiment of the present invention.
FIG. 18 is a diagram showing a mining result by a mining unit of the text mining device according to the first embodiment of the present invention.
FIG. 19 is a flowchart showing a flow of a mining process executed by the mining unit of the text mining device according to the first embodiment of the present invention.
FIG. 20 is a diagram showing a mining result by the mining unit of the text mining device according to the first embodiment of the present invention.
FIG. 21 is a flowchart showing a flow of a mining process executed by the mining unit of the text mining device according to the first embodiment of the present invention.
FIG. 22 is a diagram illustrating a mining result by a mining unit of the text mining device according to the first embodiment of the present invention.
FIG. 23 is a diagram illustrating a mining result by a mining unit of the text mining device according to the first embodiment of the present invention.
FIG. 24 is a flowchart showing a processing flow of a search / mining result storage unit according to the first embodiment of the present invention.
FIG. 25 is a diagram illustrating an example of a search / mining result stored in a search / mining result storage unit of the text mining device according to the first embodiment of the present invention.
FIG. 26 is a flowchart showing a processing flow of a search / mining result editing unit of the text mining device according to the first embodiment of the present invention.
FIG. 27 is a diagram showing an editing result by a search / mining result editing unit of the text mining device according to the first embodiment of the present invention.
FIG. 28 is a flowchart showing a processing flow of a search / mining result editing unit of the text mining apparatus according to Embodiment 1 of the present invention.
FIG. 29 is a diagram showing an editing result by a search / mining result editing unit of the text mining device according to the first embodiment of the present invention.
FIG. 30 is a diagram showing an editing result by a search / mining result editing unit of the text mining device according to the first embodiment of the present invention.
FIG. 31 is a flowchart showing a processing flow of a mining unit of the text mining device according to the first embodiment of the present invention.
FIG. 32 is a diagram illustrating a mining result by a mining unit of the text mining device according to the first embodiment of the present invention.
FIG. 33 is a diagram illustrating a conventional text mining analysis process.
FIG. 34 is a diagram for explaining text mining analysis processing according to Embodiment 1 of the present invention;
[Explanation of symbols]
101 target document set, 102 index generation unit, 103 index storage unit, 104 execution unit, 105 input unit, 106 display unit, 107 search / mining result storage unit, 108 index reading unit, 109 search / mining result reading unit, 110 search Mining execution unit, 111 search / mining result storage unit, 112 search / mining result editing unit, 201 concept dictionary generation unit, 202 document vector index generation unit, 203 mining index generation unit, 204 concept dictionary, 205 document vector index, 206 Mining index, 207 search unit, 208 mining unit.

Claims

An index storage unit that stores index information for searching and mining a set of target documents;
A search / mining execution unit that executes a search / mining process of the target document set by referring to the index information stored in the index storage unit according to a specified search / mining condition;
A search / mining result storage unit for assigning the degree of relevance between each document and the search / mining condition to each of the documents and storing the search / mining result by the search / mining execution unit in the search / mining result storage unit;
A plurality of specified search / mining results are read from the search / mining result storage unit, and the attribute values are determined by selecting the read / mining results read for each document in accordance with the specified attribute name. A text mining device equipped with a search / mining result editing unit.

2. The text mining apparatus according to claim 1, wherein the search / mining result editing unit selects a search / mining result having the highest relevance for each document and determines an attribute value of each document with respect to the attribute name. .

2. The text according to claim 1, wherein the search / mining result editing unit selects a search / mining result whose relevance is equal to or more than a predetermined threshold for each document, and determines an attribute value of each document with respect to the attribute name. Mining equipment.

2. The text mining apparatus according to claim 1, wherein the search / mining execution unit analyzes the degree of association between the designated mining condition and the attribute value determined by the search / mining result editing unit.

2. The text mining apparatus according to claim 1, wherein the search / mining execution unit analyzes the correlation between the attribute values determined by the search / mining result editing unit and the distribution of the specified mining conditions. .

The text mining apparatus according to claim 1, wherein the search / mining execution unit analyzes the correlation between the words based on the distribution of the documents stored in the search / mining result storage unit.

The text mining apparatus according to claim 1, wherein the search / mining execution unit analyzes the degree of association between the attribute values determined for the plurality of attribute names by the search / mining result editing unit.

The index information stored in the index storage unit includes a concept dictionary that describes a concept vector of a word in each document generated by performing morphological analysis on text included in the target document set, and a document included in the target document set. 2. The text mining apparatus according to claim 1, further comprising a document vector index that describes a concept vector of a document generated by morphologically analyzing the document and referring to the concept dictionary.

9. The text mining apparatus according to claim 8, wherein a concept vector of a compound word extracted from a morphological analysis result of the text included in the target document set is described in a concept dictionary.

A first function of executing a search / mining process of a target document set by referring to index information stored in an index storage unit according to a specified search / mining condition;
A second function of adding the search / mining result obtained by the first function to each document and the relevance between each document and the search / mining condition, and storing the result in the search / mining result storage unit;
A plurality of specified search / mining results are read from the search / mining result storage unit, and the attribute values are determined by selecting the read / mining results read for each document in accordance with the specified attribute name. A text mining program that makes a computer realize the third function.