JP2004185306A

JP2004185306A - Dictionary construction supporting device and method

Info

Publication number: JP2004185306A
Application number: JP2002351429A
Authority: JP
Inventors: Ryohei Orihara; 良平折原; Kazuhiko Atsumi; 一彦渥美; Seiji Iwata; 誠司岩田; Kyoko Makino; 恭子牧野; Kayoko Isoo; 佳代子磯尾; Yumi Ichimura; 由美市村; Akihiro Suyama; 明弘酢山; Mitsuo Nunome; 光生布目; Kouichi Sasaki; 光一笹氣
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-12-03
Filing date: 2002-12-03
Publication date: 2004-07-02
Anticipated expiration: 2022-12-03
Also published as: JP3774431B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a supporting environment for allowing a user to prepare/verify a dictionary for information extraction where the user can set whether morphemic analysis is to be performed in collating expressions. <P>SOLUTION: A dictionary construction supporting program 100 stored in a storage device 50 is read and stored into a memory 40, and executed by a CPU 10 so that a dictionary construction supporting device can be realized. When a dictionary verifying part 104 extracts the expressions of each concept stored in a dictionary 109 for information extraction from given document data, a condition setting part 106 sets whether morphemic analysis is to be performed to the document data for each concept, and stores the setting in the dictionary 109 for information extraction as a collation condition. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、文書データを自動的に分類する文書分類システムにおいて利用される電子辞書をユーザによって構築可能とするための辞書構築支援装置および方法に関する。
【０００２】
【従来の技術】
さまざまな場所で日々多様な電子データが利用されているが、それらデータの８割以上が文書データであると言われている。文書データは、用途に応じて、種々の電子辞書が利用される。例えば、日本語の文書データの作成時においては、かな漢字変換辞書が用いられ、また、日本語の文書データの形態素解析を行う場合には形態素解析辞書が、文書データの構文解析を行う際には構文解析辞書が用いられる。このような電子辞書においては、基本的には明確に定義された不変的な規則が存在しているため、予め製品メーカ側で作成されて供給されるものであり、ユーザはこのような電子辞書の内容を特に意識することなく利用している。
【０００３】
ところで、多種多様な文書データのうち、特に自由な形式で記述された文書データからユーザにとって必要な情報を抽出する技術として、テキストマイニング技術が近年注目されてきている。
【０００４】
本願出願人は、特許文献１において、テキストマイニング技術の一方法を提案している。これには、抽出したい概念と該概念を表す一つ以上の様々な表現とを対応づけて保持される情報抽出用辞書を利用して、対象となる文書データ中に含まれる重要な情報を抽出する文書データの解析方法が提案されている。なお、ここで言う表現とは、例えば、単語や、句や、共起関係にある単語の組、等のことを言い、また、ここで言う概念とは、このような表現それぞれに共通した上位の意味に相当する。例えば、「オーナー」、「トレーナー」、「バイヤー」等が個々の「表現」であり、「人」は、これら「オーナー」、「トレーナー」、「バイヤー」等の上位の意味を持つ「概念」である。
【０００５】
このような情報抽出用辞書における概念と表現との対応付けは、明確に定義された不変的な規則が存在しているものではなく、分析対象とする文書データの分野や分析の視点に依存するものである。
【０００６】
これら情報抽出用辞書においても、先に示した電子辞書と同様、製品メーカ側が、テキストマイニング技術を導入するユーザの利用形態を調査し、それに併せて情報抽出用辞書を作成して提供し、また、利用後の結果を検証してその情報抽出用辞書に登録、削除、修正などのメンテナンスを行っているのが現状であった。
【０００７】
以上説明してきたように、テキストマイニング技術に用いられる従来の情報抽出用辞書は、その良し悪しにより、抽出精度に大きな影響を与えるにもかかわらず、利用者ではないメーカ側によって予め用意されているものであり、ユーザがその辞書を作成したり、作成した辞書を検証しながら編集する手段は提供されていなかった。なお、エディタなどを用いて、情報抽出用辞書を編集することは可能であったが、辞書の構造などを十分に把握し、プログラムレベルで編集する必要があり、通常のユーザにとって容易に編集できるものではなかった。
【０００８】
また、これら情報抽出用辞書のメンテナンスにおいても、製品メーカ側が、利用後のユーザの不具合等を調査して、解決されると思われる箇所を経験などから判断し、情報抽出用辞書への追加、削除、修正などを行ってテキストマイニングの精度を向上するしかなかった。
【０００９】
このような点に鑑み、本願出願人は、特許文献２において、テキストマイニングで用いられる情報抽出用辞書の作成およびメインテナンスを容易にする方法および装置を提案した。
【００１０】
また、抽出したい概念が固有名詞など形態素解析を必要としない場合があることに鑑み、本願出願人は、特許文献３において、電子辞書中の表現と文書データとを照合する際、あらかじめ設定された条件に応じて形態素解析を行ってから照合するか形態素解析を行わずに照合するかを切り替えて表現を抽出する表現抽出手段を具備することを特徴とする情報抽出システムを提案した。これによれば、抽出すべき表現あるいは概念ごとに照合条件を設定できる。例えば、文字列照合だけでは他の単語の一部になってしまう表現を抽出したい場合、形態素解析を行えば単語の境界を正しく認識できるので、形態素解析を行わない場合よりも抽出精度が高くなる。具体的な例としては、「京都」を抽出したい場合、形態素解析を行わないと「東京都」という文字列中の「京都」にもマッチしてしまうが、形態素解析を行えばそのような誤りを防げる。
【００１１】
逆に、例えば、形態素解析に失敗し未知語になるような場合、単語境界を誤る可能性が高いので、形態素解析を行うとマッチしなくなる場合がある。具体的な例としては、「コチジャン」を抽出しようとして１単語として抽出用辞書に登録したとしても、形態素解析すると「コチ／ジャン」のように２単語に分割されてしまう場合、形態素解析を行うとそのままではマッチしなくなる。
【００１２】
一般に、形態素解析辞書には登録されていない製品名や記号列などを抽出したい場合には形態素解析を行わない方が適しており、形態素解析辞書に登録されているような一般的な語を抽出したい場合には形態素解析を行う方が適していると言える。
【００１３】
【特許文献１】
特開２００１−１４７９３７号公報
【００１４】
【特許文献２】
特開２００２−１４０３３８号公報
【００１５】
【特許文献３】
特願２００２−２７８４２１号
【００１６】
【発明が解決しようとする課題】
ところが、上記条件設定を行うためには、テキストエディタ等を用いて情報抽出用辞書を編集することが必要であった。形態素解析を行わない場合の電子辞書の構築は、計算言語学上の知識を必要としないため、論理的には一般のユーザにも可能な作業であるが、プログラムレベルでの編集作業が必要であったため、実際にはプログラムレベルでの編集作業が必要であり、一般のユーザにとっては編集作業は極めて困難であった。
【００１７】
本発明は、このような事情に鑑みてなされたものであり、ユーザが表現の照合において形態素解析をするかしないかを設定できる情報抽出用の辞書を作成・検証するための支援環境を提供することを目的とする。
【００１８】
【課題を解決するための手段】
そこで、本発明の辞書構築支援装置は、複数の表現とそれら表現に共通する上位の表現である概念とを対応付けて格納される電子辞書を記憶する辞書記憶手段と、文書データから抽出された表現を記憶する表現記憶手段と、前記辞書記憶手段にて記憶される該電子辞書から少なくとも一部の概念と、前記表現記憶手段にて記憶される少なくとも一部の該抽出された表現とを同時に表示する表示手段と、前記表示手段によって表示された表現から一つ以上の表現の指定と、前記表示手段によって表示された概念から一つの概念の指定とを受けると、指定された表現を指定された概念に対応付けて前記電子辞書に追加登録する登録手段とに加え、概念毎に照合の際、形態素解析をするか否かを指定し、それを電子辞書に保存する手段を備えた。
【００１９】
また、抽出結果を表示する際、形態素解析を用いて照合されたものとそうでないものを区別して表示する手段を備えた。
【００２０】
これにより、文書データから抽出された表現を、簡単な操作で電子辞書内の所望の概念へ登録することができ、形態素解析をすべきか否かを含め電子辞書の良し悪しを容易に検証することができるようになった。
【００２１】
すなわち、文書から抽出した重要な表現を参照しながら、必要な表現を選択して情報処理用の辞書に登録することができ、同時に、形態素解析を行うか否かも指定することができ、また、情報処理結果を見て辞書の性能を検証しながら辞書を編集することができるようになった。
【００２２】
【発明の実施の形態】
以下、図面を参照して本発明に係る実施形態を説明する。
【００２３】
図１は、本発明の実施形態に係る辞書構築支援装置の構成を示した図である。
【００２４】
本実施形態における辞書構築支援装置は、例えばパーソナルコンピュータ等のコンピュータにおいて、記憶装置５０に格納された辞書構築支援プログラム１００がメモリ４０上へ読み込まれ、全体の制御を司るＣＰＵ１０によって実行されることにより実現される。
【００２５】
入力部２０は、マウスやキーボードあるいは音声入力装置等から、文字列の挿入や削除などの編集指示、機能を選択するための操作指示、処理対象となる文書や辞書の指定などのコマンド入力等を受けるものである。出力部３０は、例えばディスプレイ等の表示装置へ表示情報を供給するためのものである。
【００２６】
メモリ４０は、高速で揮発性の、例えばＤＲＡＭ等から構成され、前記したとおりＣＰＵ１０で実行される前記プログラム１００等を記憶したり、該プログラム１００が実行される際に一時的に保持される内部データ保持部１１０として利用される。
【００２７】
この内部データ保持部１１０には、テキストバッファ１１０ａ、表現リストバッファ１１０ｂ、処理結果バッファ１１０ｃ、辞書バッファ１１０ｄ、差分バッファ１１０ｅが設けられている。
【００２８】
記憶装置５０は、不揮発性の大容量記憶装置であり、例えば、ＨＤＤ、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ等によって実現可能である。記憶装置５０には、制御部１０１、表現抽出部１０２ａを備えた表現登録部１０２、辞書編集部１０３、辞書検証部１０４、差分検出部１０５および条件設定部１０６からなる辞書構築支援プログラム１００と、このプログラムによって構築される情報抽出用辞書１０９が格納されている。
【００２９】
制御部１０１は、本プログラム１００の起動時に最初に実行されるメインプログラムである。
【００３０】
表現登録部１０２は、表現を情報抽出用辞書１０９へ登録するプログラムであり、制御部１０２からの所定の指示により起動され、後述する「表現」を情報抽出用辞書１０９へ登録する処理を行う。
【００３１】
表現抽出部１０２ａは、表現登録部１０２から、文書データの表現の抽出処理が必要になった際に起動されるプログラムである。表現抽出部１０２ａは、テキストバッファ１１０ａに記憶される内容を構文解析等を行って、文書中で使用される単語や句、あるいは共起関係等（以下、これらを表現と称す）をリスト化し、表現リストバッファ１１０ｂへ記憶する。なお、このリスト化されたものを、総称して表現リストと呼ぶ。この表現リストの作成方法については、たとえば特開昭６２−９９８６５号公報や特開平２−４２５７２号公報で公開されており、説明を省略する。
【００３２】
表現登録部１０２は、作成された表現リスト中からユーザによって指定された表現を、辞書バッファ１１０ｄへ登録する。
【００３３】
辞書編集部１０３は、情報抽出用辞書１０９を編集するプログラムである。この辞書編集部１０３は、表現登録部１０２や辞書検証部１０４からも起動できる。
【００３４】
辞書検証部１０４は、情報抽出用辞書１０９から読み出され格納した、辞書バッファ１１０ｄを用いて、ユーザによって指定された抽出処理を行い、抽出処理結果を処理結果バッファ１１０ｃに記憶する。また、ユーザの指示に基づき、処理結果バッファ１１０ｃに記憶される情報を表示する。
【００３５】
差分検出部１０５は、２つの情報抽出用辞書１０９の差分を検出する。
【００３６】
条件設定部１０６は、情報抽出の際の照合条件を設定する。設定された条件は、制御部１０１を通して情報抽出用辞書１０９に保存される。保存された条件は、辞書バッファ１１０ｄに読み出されることにより、辞書検証部１０４における情報抽出の動きを制御するほか、本情報抽出用辞書を用いる外部の情報抽出システムにおける情報抽出の動きを、たとえば特許文献３のように制御する。
【００３７】
情報抽出用辞書１０９は、本プログラムで構築される電子辞書である。この情報抽出用辞書１０９の構成の一例を図２に示す。この情報抽出用辞書１０９は、３階層にて構成されており、最上位の階層をクラス、中位の階層を概念、下位の階層を表現と呼んでいる。なお、クラス、概念、表現の階層を特に意識することなく、これらの一つを指す場合には、以下では単にノードと呼ぶこととする。
【００３８】
クラステーブル１２１は、クラスを示しており、ここには、「地名」、「人名」等を格納する。概念テーブル１２２は、クラスと概念を対応付けて格納している。表現テーブル１２３は、クラス、概念、表現とを対応付けて格納をしている。
【００３９】
以上が本実施形態の辞書構築支援装置の構成である。
【００４０】
次に、本実施形態の辞書構築支援装置の動作について情報抽出を例として、フローチャートを用いて説明する。
【００４１】
図３は、辞書構築プログラム１００のうち、制御部１０１のフローチャートを示している。
【００４２】
ユーザが、辞書構築プログラム１００の起動を要求すると、制御部１０１が起動される。
【００４３】
制御部１０１は、自プログラム起動後、ユーザからのキー入力を待つ（Ｓ２０１）。キー入力を受けると、まず、ユーザの入力が終了指示であるか否か判定する（Ｓ２０２）。ここで終了指示と判定した場合には自プログラムを終了する。
【００４４】
一方、ステップＳ２０２において、終了指示でないと判定した場合には、次に、ユーザの入力が文書指定であるか否か判定する（Ｓ２０４）。ここで文書指定である場合には、指定された文書データを、指定先の、例えばメモリや磁気ディスク、光ディスクなどから読み出して、メモリ４０の内部データ保持部１１０内のテキストバッファ１１０ａに記憶する（Ｓ２０４）。この処理が終了すると、ステップＳ２０１に戻り、ユーザからの次の入力を待つ。
【００４５】
一方、ステップＳ２０３において、文書指定でないと判定した場合には、次に、ユーザの入力が辞書指定であるか否か判定する（Ｓ２０５）。ここで辞書指定である場合には、指定された辞書を、指定先の、例えばメモリや磁気ディスク、光ディスクなどから読み出して、メモリ４０の内部データ保持部１１０内の辞書バッファ１１０ｄに記憶する（Ｓ２０６）。この処理が終了すると、ステップＳ２０１に戻り、ユーザからの次の入力を待つ。
【００４６】
一方、ステップＳ２０５において、辞書指定でないと判定した場合には、次に、ユーザの入力が表現登録指示であるか否か判定する（Ｓ２０７）。ここで表現登録指示であると判断した場合には、表現登録部１０２を起動する（Ｓ２０８）。表現登録部１０２の動作については後述する。なお、表現登録部１０２の動作が終了すると、ステップＳ２０１に戻り、ユーザからの次の入力を待つ。
【００４７】
一方、ステップＳ２０７において、表現登録指示でないと判断した場合には、次に、ユーザの入力が辞書検証指示であるか否か判定する（Ｓ２０９）。ここで辞書検証指示であると判定した場合には、辞書検証部１０４を起動する（Ｓ２１０）。辞書検証部１０４の動作については後述する。なお、辞書検証部１０４の動作が終了するとステップＳ２０１に戻り、ユーザからの次の入力を待つ。
【００４８】
一方、ステップＳ２０９において、辞書検証指示でないと判断した場合には、次に、ユーザの入力が辞書編集指示であるか否か判定する（Ｓ２１１）。ここで辞書編集指示である場合には、辞書編集部１０３を起動する（Ｓ２１２）。辞書編集部１０３の動作については後述する。なお、辞書編集部１０３の動作が終了するとステップＳ２０１に戻り、ユーザからの次の入力を待つ。
【００４９】
一方、ステップＳ２１１において、辞書編集指示でないと判断した場合には、次に、ユーザの入力が差分検出指示であるか否か判定する（Ｓ２１３）。ここで、差分検出指示である場合には、差分検出部１０５を起動する（Ｓ２１４）。差分検出部１０５の動作については後述する。なお、差分検出部１０５の動作が終了するとステップＳ２０１に戻り、ユーザからの次の入力を待つ。
【００５０】
一方、ステップＳ２１３において、差分検出指示でないと判断した場合には、指定された他の処理、例えば環境設定処理等のユーザの入力指示に対応する処理を行って（Ｓ２１５）、ステップＳ２０１に戻り、ユーザからの次の入力を待つ。
【００５１】
以上のように、制御部１０１は、ユーザの入力を解析して各種処理部を起動するとともに、ユーザの指示に基づき、処理対象の文書および辞書を内部データ保持部１１０に記憶する。
【００５２】
図４は、制御部１０１が起動した直後の画面表示例を示している。上部には、登録、検証、編集、差分表示、環境設定、終了の各ボタンがあり、ユーザは、これらのボタンを押すことによって、次に行う処理を選択する。例えば、終了ボタンがクリックされると、ステップＳ２０２にて、ユーザの入力が終了指示であることを判定して、終了処理が行われる。また、例えば、登録ボタンがクリックされると、ステップＳ２０７にて、ユーザの入力が表現登録指示であることを判定し、表現登録部１０２が起動される処理が行われる。
【００５３】
これらボタンの下部には、左に辞書名を入力する領域が、右に文書名を入力する領域が設けられており、ユーザは、これら領域に辞書名、文書名を入力し、実行することにより、該当する辞書あるいは文書を記憶装置５０から読み出す。例えば、文書名を入力する領域にユーザが文書名を入力して実行すると、ステップＳ２０３にて、ユーザの入力が文書指定であると判定し、続くステップＳ２０４で、入力された文書名に基づいて、該当する文書を記憶装置５０などから読み出して、テキストバッファ１１０ａに記憶する。
【００５４】
次に、表現登録部１０２の処理動作について説明する。図５は、表現登録部１０２の処理動作を示すフローチャートである。
【００５５】
制御部１０１の動作中、ステップ２０７において表現登録指示であると判断した場合には、表現登録部１０２が起動される。起動された表現登録部１０２は、まず、表現抽出部１０２ａを起動する。起動された表現抽出部１０２ａは、テキストバッファ１１０ａに記憶される文書データを所定の解析方法にて解析し、表現リストを作成する（Ｓ３０１）。この作成された表現リストを表現リストバッファ１１０ｂに保持する（Ｓ３０２）。なお、所定の解析方法は、例えば、単語の表現リストを作成する際には、文書データを形態素解析や、構文解析し、単語を抽出し、抽出した単語を羅列すればよく、またこの解析方法にとらわれることなく既知の様々な方法で作成すれば良い。抽出後、表現抽出部１０２ａは、終了する。
【００５６】
次に、辞書バッファ１１０ｄに記憶される辞書データと、表現リストバッファ１１０ｂに記憶される表現リストとを読み出して、同時に表示する（Ｓ３０３）。
【００５７】
次に、ユーザの入力が辞書登録指示であるか否か判定する（Ｓ３０４）。ここで、辞書登録指示である場合には、次に、ユーザは、表示された表現リストの中から辞書に登録したい表現を指定する（Ｓ３０５）。これにより、表現登録部１０２は、ユーザによって指定された表現を得る。また、ユーザは、表示された辞書の概念の中から、得られた表現を登録したい概念を指定する（Ｓ３０６）。これにより、表現登録部１０２は、表現が登録される概念を得る。また、ユーザは、その概念が情報抽出時に形態素解析されるべきか否かを、チェックボックス等の入力手段を用いて条件設定部１０６に与える（Ｓ３０７）。そして、これら得た表現、概念、および情報抽出の条件を合成し、表現の辞書情報（下位のテーブル）として登録する（Ｓ３０８）。この登録は、辞書バッファ１１０ｄに追加される。
【００５８】
一方、ステップＳ３０４において、辞書登録指示でない場合には、ユーザの入力が辞書編集指示であるか否か判定する（Ｓ３０９）。ここで、辞書編集指示である場合には、後述の辞書編集部１０３を起動し、辞書の編集を行う（Ｓ３１０）。辞書編集部１０３の処理が終了すると、ステップＳ３０４に戻る。
【００５９】
一方、ステップＳ３０９において、辞書編集指示でない場合には、辞書バッファ１１０ｄに記憶される情報を磁気ディスク等に保存し（Ｓ３１１）、処理を終了する。
【００６０】
このようにして、表現登録部１０２は、表現抽出部１０２ａを起動して表現リストを作成し、表現リストから辞書１０９への辞書登録を行う。
【００６１】
図６は、表現登録部１０２の処理時の画面表示例を示している。辞書名を入力する領域の下部には、登録ボタンがあり、このボタンをクリックすることによって辞書登録処理（図５のＳ３０５〜Ｓ３０８）が可能になる。この登録ボタンの下部には、辞書の内容を表示している。この例においては、辞書名「ＸＸＸＸＸＸＸ」の辞書の内容の一部が表示されている。この表示では、クラス、概念、表現の３階層の形式で表示されている。また、各語の前部にはチェックボックスが設けられており、ユーザにてチェックボックスをクリックすることにより指定可能になる。また、概念には、情報抽出時の照合条件として形態素解析を使うかどうかを指定するためのチェックボックス（ａ１，ａ２，…）があり、ユーザによる条件の入力が可能となっている。そして、このチェックボックス（ａ１，ａ２，…）を設けて、形態素解析を行うか否かを指定できるようにした点が、この辞書構築支援装置の特徴の１つであり、これにより、プログラムレベルでの編集作業が不要となり、一般のユーザでも編集作業を簡単に行えるようになる。この例では、概念「仕事」のチェックボックス（ａ１）はオフ、概念「人」のチェックポイント（ａ２）はオンとなっているため、概念「仕事」に含まれる表現を抽出する際には形態素解析は用いられず、一方、概念「人」に含まれる表現を抽出する際には形態素解析が用いられることになる。
【００６２】
一方、文書名を入力する領域の下部には、単語、共起、句の各ボタンがあり、これらボタンを押すことによって、表現リストの表示内容を選択できる。これらボタンの下部は、表現リストを表示する領域であり、この例においては、文書名「＋＋＋＋＋＋」の単語の表現リストを表示している。この表現リストの各語（この場合は単語）の前部にはチェックボックスが設けられており、ユーザにてチェックボックスをクリックすることにより指定可能になっている。
【００６３】
このように、この表現登録部１０２の処理時の画面は、辞書と表現リストが同時に表示可能となっており、また、辞書のクラス、概念、表現、および表現リストの各語は、簡単に指定可能になっており、ユーザにとって、簡単に辞書への登録が可能である。
【００６４】
図７は、表現の登録処理を模式的に示した図である。この例では、表現リストから「社員」を選択し、「人」の概念の一つの表現として追加する際の内部データの処理について示している。
【００６５】
表示された表現リストから「社員」が選択されると表現リストバッファ１１０ｂから「社員」を取得し、また、表示された辞書から概念「人」が選択されると辞書バッファ１１０ｄに記憶される概念テーブル１２２から「業界・人」を取得する。そして、これら取得した「業界、人」、「社員」を合成し、「業界・人・社員」を得て、これを辞書バッファ１１０ｄの表現テーブル１２３に登録する。その結果、表現テーブルは１２３’のようになる。
【００６６】
これにより、文書データから抽出された表現を、簡単な操作で電子辞書内の所望の概念へ登録することができるようになった。
【００６７】
次に、辞書検証部１０４の処理動作について説明する。図８は辞書検証部１０４の処理動作を示すフローチャートである。
【００６８】
制御部１０１の動作中、ステップ２０９において辞書検証指示であると判断した場合には、辞書検証部１０４が起動される。起動された辞書検証部１０４は、辞書バッファ１１０ｄに記憶される辞書データを表示する（Ｓ４０１）。
【００６９】
次に、辞書１０９を用いて情報抽出を行い、結果を処理結果バッファ１１０ｃに記憶する（Ｓ４０２）。そして、処理結果バッファ１１０ｄに記憶した情報を表示する（Ｓ４０３）。
【００７０】
次に、ユーザの入力が情報抽出結果の検証指示であるか否か判定する（Ｓ４０４）。
【００７１】
ここで、情報抽出結果の検証指示であると判定した場合には、検証したいノードを辞書中で指定する（Ｓ４０５）。そして、指定されたノードに対応する情報抽出結果を表示する（Ｓ４０６）。なお、この処理後、ステップＳ４０４に戻る。
【００７２】
一方、ステップＳ４０４において、情報抽出結果の検証指示でないと判定した場合には、次に、ユーザの入力が辞書編集指示であるか否か判定する（Ｓ４０７）。ここで、辞書編集指示であると判定した場合には、辞書編集部１０３を起動し、辞書編集を行う。辞書編集部１０３が終了すると、ステップＳ４０４に戻る。
【００７３】
一方、ステップＳ４０７において、辞書編集指示でないと判定した場合には、辞書バッファ１１０ｄに記憶される情報を磁気ディスク等に保存する（Ｓ４０９）。また、処理結果バッファ１１０ｃに記憶される情報を磁気ディスク等に保存し（Ｓ４１０）、辞書検証部１０４の処理を終了する。
【００７４】
このようにして、辞書検証部１０４は、辞書１０９による情報抽出結果を表示しながら辞書の検証を行う。
【００７５】
図９は、辞書検証部１０４の処理時の画面表示例を示している。
【００７６】
表示上、左半分の表示内容は、図６の表現登録部１０２の左半分の表示内容と比較して、登録ボタンに代えて検証ボタンがあるのみで、その他は同じである。
【００７７】
この検証ボタンを押すことによって結果検証処理（図８のＳ４０５〜Ｓ４０６）が可能になる。
【００７８】
一方、図９の右半分の表示内容は、文書名の下部には、辞書１０９で処理された抽出結果が表示される。そして、図のように抽出結果の検証後にあたって、抽出結果上、指定された概念が持つ表現がある箇所には、ユーザへ明示可能なように、ここでは下線（ｂ１，ｂ２，ｂ３，ｂ４，…）で明示している。なお、下線以外に例えば、色などの十分識別ができる方法であれば良い。
【００７９】
さらに、この際、情報抽出に当たって形態素解析を用いたかどうかにしたがって、下線の形状を変えたり、色を変えたりなどの表示方法の変化をつけることにより、情報抽出の条件をも同時に表示することができる。そして、この指定された概念が持つ表現がある箇所を、情報抽出に当たって形態素解析を用いたかどうかによってその表示方法に変化をつける点も、この辞書構築支援装置の特徴の１つである。
【００８０】
これにより、階層化された電子辞書内の概念を、簡単な操作で指定でき、指定された概念に含まれる表現を電子辞書で抽出された抽出結果の中から容易に抽出して明示し、しかも照合条件も同時に表示するようにしたので、ユーザは、電子辞書の良し悪しを容易に検証することができるようになった。
【００８１】
この例では、「サービス」、「開発」、「営業」に付された下線（ｂ１，ｂ２，ｂ３）は単線なのに対して、「バイヤー」に付された下線（ｂ４）は２重線となっているが、これは、「サービス」、「開発」、「営業」は、情報抽出に当たって形態素解析が用いられておらず、一方、「バイヤー」は、形態素解析が用いられていることを示している。つまり、先に図６で示した、「サービス」、「開発」、「営業」を含む概念「仕事」については、形態素解析を使わず、「バイヤー」を含む概念「人」については、形態素解析を使うように設定した情報抽出時の照合条件を、この下線により簡単に認識することができる。
【００８２】
次に、辞書編集部１０３の処理動作について説明する。図１０は、辞書編集部１０３の処理動作を示すフローチャートである。
【００８３】
制御部１０１の動作中、ステップＳ２１１において辞書編集指示であると判断した場合、表現登録部１０２の動作中、ステップＳ３０８において辞書編集指示であると判断した場合、および辞書検証部１０４の動作中、ステップＳ４０７において辞書編集指示であると判断した場合に、辞書編集部１０３が起動される。
【００８４】
起動された辞書編集部１０３は、辞書バッファ１１０ｄに記憶される辞書データを表示する（Ｓ５０１）。
【００８５】
次に、ユーザの入力がノードの追加指示であるか否か判定する（Ｓ５０２）。ここで、追加指示である場合には、追加ノードを指定する（Ｓ５０３）。そして、指定されたノードに子ノードを追加する（Ｓ５０４）。追加の内容は、ユーザが直接入力する。そして、ステップＳ５０２に戻る。
【００８６】
一方、ステップＳ５０２において追加指示でない場合には、次にユーザの入力がノードの削除指示であるか否か判定する（Ｓ５０５）。ここで、削除指示である場合には、削除ノードを指定する（Ｓ５０６）。そして、指定されたノードとそのノードの子ノードを全て削除する（Ｓ５０７）。そして、ステップＳ５０２に戻る。
【００８７】
一方、ステップＳ５０５において削除指示でない場合には、次にユーザの入力がノードの変更指示であるか否か判定する（Ｓ５０８）。ここで、変更指示である場合には、変更ノードを指定する（Ｓ５０９）。そして、指定されたノードの文字列や値などを変更する（Ｓ５１０）。そして、ステップＳ５０２に戻る。
【００８８】
一方、ステップＳ５０８において変更指示でない場合には、次にユーザの入力がノードの複写指示であるか否か判定する（Ｓ５１１）。ここで、複写指示である場合には、複写元ノードを指定する（Ｓ５１２）。そして、複写先ノードを指定する（Ｓ５１３）。そして、指定された複写元ノードとその子ノード全てを複写先ノードの子ノードに追加する（Ｓ５１４）。そして、ステップＳ５０２に戻る。
【００８９】
一方、ステップＳ５１１において、複写指示でない場合には、ユーザの入力がノードの移動指示であるか否か判定する（Ｓ５１５）。ここで、移動指示である場合には、移動元ノードを指定する（Ｓ５１６）。そして、移動先ノードを指定する（Ｓ５１７）。そして、指定された移動元ノードとその子ノード全てを移動先ノードの子ノードに移動する（Ｓ５１８）。そして、ステップＳ５０２に戻る。
【００９０】
一方、ステップＳ５１５において、移動指示でない場合には、辞書バッファ１１０ｄに記憶される情報を磁気ディスク等に保存し（Ｓ５１９）、処理を終了する。
【００９１】
このようにして、辞書編集部１０８は、辞書１０９のノードに対して、追加、削除、変更、複写、移動などを行い、辞書編集を行う。
【００９２】
図１１は、辞書編集部１０３の処理時の画面表示例を示している。
【００９３】
この例では、図６の表現登録部１０２の左半分の表示内容と比較して、登録ボタンに変えて、追加、削除、変更、移動、複写の各ボタンが配置されている点が異なり、他は同じである。各ボタンをクリックすることにより、上記したようなクリックされたボタンの各種編集機能が実施される。
【００９４】
以上のように、電子辞書の編集を扱いやすいユーザインターフェースにしたので、ユーザにとって辞書の編集が容易に実現できる。
【００９５】
次に、差分検出部１０５の処理動作について説明する。図１２は差分検証部１０５の処理動作を示すフローチャートである。
【００９６】
制御部１０１の動作中、ステップＳ２１３において差分検出指示であると判断した場合に、差分検出部１０５が起動される。
【００９７】
起動された差分検出部１０５は、ユーザの入力によって、比較したい辞書を２つ指定する（Ｓ６０１）。そして、指定された辞書の差分を作成して差分バッファ１１０ｅに記憶する（Ｓ６０２）。
【００９８】
次に、差分バッファ１１０ｅに記憶される差分を表示する（Ｓ６０３）。差分バッファ１１０ｅに記憶される差分を磁気ディスク等に保存し（Ｓ６０４）、処理を終了する。
【００９９】
このようにして、差分検出部１０５は、辞書１０９同士を比較して、差分を作成・表示する。
【０１００】
図１３は、差分検出部１０５の処理時の画面表示例を示している。
【０１０１】
この例では、上部には、比較したい辞書を入力する２つの領域を備える。この領域の下部には、比較結果を表示する領域であり、ここでは、概念毎の比較結果を表示している。
【０１０２】
これにより、２つの電子辞書を容易に指定可能となり、それら電子辞書間の差分を検出し、概念の単位でユーザへ提示可能となった。
【０１０３】
以上説明した本実施形態においては、文書データから抽出された表現を、簡単な操作で電子辞書内の所望の概念へ登録することができるようになった。
【０１０４】
また、階層化された電子辞書内の概念を、簡単な操作で指定でき、指定された概念に含まれる表現を電子辞書で抽出された抽出結果の中から容易に抽出して明示するようにしたので、ユーザは、電子辞書の良し悪しを容易に検証することができるようになった。
【０１０５】
また、２つの電子辞書を容易に指定可能となり、それら電子辞書間の差分を検出し、概念の単位でユーザへ提示可能となった。
【０１０６】
また、文書から抽出した重要な表現を参照しながら、必要な表現を選択して情報抽出用の辞書に登録することができ、また、情報抽出結果を見て辞書の性能を検証しながら辞書を編集することができるようになった。
【０１０７】
なお、本願発明は、前記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。更に、前記実施形態には種々の段階の発明が含まれており、開示される複数の構成要件における適宜な組み合わせにより種々の発明が抽出され得る。たとえば、実施形態に示される全構成要件から幾つかの構成要件が削除されても、発明が解決しようとする課題の欄で述べた課題が解決でき、発明の効果の欄で述べられている効果が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。
【０１０８】
【発明の効果】
以上説明したように、本発明によれば、ユーザが表現の照合において形態素解析をするかしないかを設定できる情報抽出用の辞書を作成・検証するための支援環境を提供することができる。
【図面の簡単な説明】
【図１】本発明の実施形態に係る辞書構築支援装置の構成を示す図。
【図２】同実施形態における情報抽出用辞書１０９の構成の一例。
【図３】同実施形態における制御部１０１の処理動作を示すフローチャート。
【図４】同実施形態における制御部１０１が起動した直後の画面表示例。
【図５】同実施形態における表現登録部１０２の処理動作を示すフローチャート。
【図６】同実施形態における表現登録部１０２の処理時の画面表示例。
【図７】同実施形態における表現の登録処理を模式的に示した図。
【図８】同実施形態における辞書検証部１０４の処理動作を示すフローチャート。
【図９】同実施形態における辞書検証部１０４の処理時の画面表示例。
【図１０】同実施形態における辞書編集部１０３の処理動作を示すフローチャート。
【図１１】同実施形態における辞書編集部１０３の処理時の画面表示例。
【図１２】同実施形態における差分検出部１０５の処理動作を示すフローチャート。
【図１３】同実施形態における差分検出部１０５の処理時の画面表示例。
【符号の説明】
１０…ＣＰＵ
２０…入力部
３０…出力部
４０…メモリ
５０…記憶装置
１００…辞書構築プログラム
１０１…制御部
１０２…表現登録部
１０２ａ…表現抽出部
１０３…辞書編集部
１０４…辞書検証部
１０５…差分検出部
１０６…条件設定部
１０９…情報抽出用辞書
１１０…内部データ保持部
１１０ａ…テキストバッファ
１１０ｂ…表現リストバッファ
１１０ｃ…処理結果バッファ
１１０ｄ…辞書バッファ
１１０ｅ…差分バッファ
１２１…クラステーブル
１２２…概念テーブル
１２３…表現テーブル[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a dictionary construction support apparatus and method for enabling a user to construct an electronic dictionary used in a document classification system for automatically classifying document data.
[0002]
[Prior art]
A variety of electronic data is used every day in various places, and it is said that more than 80% of the data is document data. Various electronic dictionaries are used for the document data depending on the application. For example, a kana-kanji conversion dictionary is used when creating Japanese document data, a morphological analysis dictionary is used when performing morphological analysis on Japanese document data, and a morphological analysis dictionary is used when performing syntax analysis on document data. A parsing dictionary is used. In such electronic dictionaries, there are basically invariable rules that are clearly defined, and are created and supplied in advance by the product maker. Is used without any particular awareness.
[0003]
In recent years, a text mining technique has attracted attention as a technique for extracting information necessary for a user from document data described in a free format, among various kinds of document data.
[0004]
The applicant of the present application has proposed a method of text mining technology in Patent Document 1. This involves extracting important information contained in the target document data using an information extraction dictionary that holds the concept to be extracted and one or more various expressions representing the concept in association with each other. A method of analyzing document data has been proposed. The expression referred to here means, for example, a word, a phrase, a set of words having a co-occurrence relation, and the like, and the concept referred to herein is a higher order common to each of such expressions. Corresponds to the meaning of For example, “owner”, “trainer”, “buyer”, etc. are individual “expressions”, and “person” is a “concept” having a higher meaning such as “owner”, “trainer”, “buyer”, etc. It is.
[0005]
The correspondence between concepts and expressions in such an information extraction dictionary does not have a clearly defined invariant rule, but depends on the field of document data to be analyzed and the viewpoint of analysis. Things.
[0006]
In these information extraction dictionaries, similarly to the electronic dictionary described above, the product manufacturer investigates the usage form of the user who introduces the text mining technology, and creates and provides the information extraction dictionary in accordance with the survey. At present, the results after use are verified, and maintenance such as registration, deletion, and correction in the information extraction dictionary is performed.
[0007]
As described above, the conventional information extraction dictionary used for the text mining technique is prepared in advance by a manufacturer other than the user, despite the fact that the quality of the dictionary greatly affects the extraction accuracy. No means has been provided for the user to create the dictionary or to edit the dictionary while verifying the created dictionary. Although it was possible to edit the dictionary for information extraction using an editor or the like, it was necessary to fully understand the structure of the dictionary and to edit at the program level, and it could be easily edited by ordinary users. It was not something.
[0008]
In the maintenance of these information extraction dictionaries, the product manufacturer investigates the user's problems after use, judges from the experience, etc., the parts that are considered to be resolved, and adds them to the information extraction dictionaries. There was no choice but to improve the accuracy of text mining by deleting or modifying.
[0009]
In view of such points, the applicant of the present application has proposed a method and an apparatus in Patent Literature 2 that facilitate creation and maintenance of an information extraction dictionary used in text mining.
[0010]
In addition, in view of the fact that a concept to be extracted may not require a morphological analysis such as a proper noun, the applicant of the present application disclosed in Patent Literature 3 when collating an expression in an electronic dictionary with document data is set in advance. We have proposed an information extraction system characterized by comprising an expression extracting means for extracting expressions by switching between collation after performing morphological analysis and collation without performing morphological analysis according to conditions. According to this, matching conditions can be set for each expression or concept to be extracted. For example, if you want to extract an expression that becomes a part of another word only by string matching, you can perform morphological analysis to correctly recognize word boundaries, so the extraction accuracy is higher than without morphological analysis . As a specific example, if you want to extract "Kyoto", it will match "Kyoto" in the character string "Tokyo" unless you perform morphological analysis. Can be prevented.
[0011]
Conversely, for example, when the morphological analysis fails and becomes an unknown word, there is a high possibility that the word boundary is erroneous. As a specific example, if a word is registered in the dictionary for extraction as one word in an attempt to extract “Kochijan”, if the word is divided into two words such as “Kochi / Jan” by morphological analysis, morphological analysis is performed. Will not match as it is.
[0012]
In general, if you want to extract product names or symbol strings that are not registered in the morphological analysis dictionary, it is more appropriate not to perform morphological analysis, and extract general words that are registered in the morphological analysis dictionary. It can be said that it is more appropriate to perform morphological analysis.
[0013]
[Patent Document 1]
JP 2001-147937 A
[0014]
[Patent Document 2]
JP-A-2002-140338
[0015]
[Patent Document 3]
Japanese Patent Application No. 2002-278421
[0016]
[Problems to be solved by the invention]
However, in order to set the above conditions, it was necessary to edit the information extraction dictionary using a text editor or the like. The construction of an electronic dictionary without morphological analysis does not require knowledge of computational linguistics, so it is logically possible for ordinary users, but editing work at the program level is required. Therefore, editing work at the program level is actually required, and the editing work is extremely difficult for ordinary users.
[0017]
The present invention has been made in view of such circumstances, and provides a support environment for creating and verifying an information extraction dictionary that allows a user to set whether or not to perform morphological analysis in expression matching. The purpose is to:
[0018]
[Means for Solving the Problems]
Therefore, the dictionary construction support apparatus of the present invention provides a dictionary storage unit that stores an electronic dictionary that stores a plurality of expressions and a concept that is a higher-level expression common to those expressions, and that is extracted from the document data. Expression storage means for storing expressions, at least some concepts from the electronic dictionary stored in the dictionary storage means, and at least some of the extracted expressions stored in the expression storage means simultaneously Display means for displaying, designation of one or more expressions from the expressions displayed by the display means, and designation of one concept from the concepts displayed by the display means, the designated expression is designated. In addition to a registration unit for additionally registering the concept in the electronic dictionary in association with the concept, a unit for designating whether or not to perform morphological analysis at the time of matching for each concept and storing it in the electronic dictionary is provided.
[0019]
Further, when displaying the extraction result, a means is provided for distinguishing and displaying those that have been collated using morphological analysis and those that have not.
[0020]
This makes it possible to register expressions extracted from document data to desired concepts in the electronic dictionary with a simple operation, and to easily verify the quality of the electronic dictionary, including whether or not morphological analysis should be performed. Is now available.
[0021]
That is, while referring to the important expressions extracted from the document, necessary expressions can be selected and registered in the dictionary for information processing, and at the same time, whether or not to perform morphological analysis can be specified. The dictionary can be edited while verifying the performance of the dictionary by viewing the information processing results.
[0022]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0023]
FIG. 1 is a diagram showing a configuration of a dictionary construction support device according to an embodiment of the present invention.
[0024]
The dictionary construction support apparatus according to the present embodiment is configured such that, for example, in a computer such as a personal computer, a dictionary construction support program 100 stored in a storage device 50 is read into a memory 40 and executed by a CPU 10 that controls the entire system. Is achieved.
[0025]
The input unit 20 receives, from a mouse, a keyboard, a voice input device, or the like, an editing instruction such as insertion or deletion of a character string, an operation instruction for selecting a function, and a command input such as designation of a document or a dictionary to be processed. Is what you get. The output unit 30 is for supplying display information to a display device such as a display.
[0026]
The memory 40 is composed of a high-speed and volatile DRAM, for example, and stores the program 100 and the like executed by the CPU 10 as described above, and temporarily holds the internal memory when the program 100 is executed. It is used as the data holding unit 110.
[0027]
The internal data holding unit 110 includes a text buffer 110a, an expression list buffer 110b, a processing result buffer 110c, a dictionary buffer 110d, and a difference buffer 110e.
[0028]
The storage device 50 is a nonvolatile large-capacity storage device, and can be realized by, for example, an HDD, a CD-ROM, a DVD-ROM, or the like. The storage device 50 includes a dictionary construction support program 100 including a control unit 101, an expression registration unit 102 including an expression extraction unit 102a, a dictionary editing unit 103, a dictionary verification unit 104, a difference detection unit 105, and a condition setting unit 106; An information extraction dictionary 109 constructed by this program is stored.
[0029]
The control unit 101 is a main program that is executed first when the program 100 is started.
[0030]
The expression registration unit 102 is a program for registering expressions in the information extraction dictionary 109. The expression registration unit 102 is activated by a predetermined instruction from the control unit 102, and performs processing for registering “expressions” described later in the information extraction dictionary 109.
[0031]
The expression extraction unit 102a is a program that is started when the expression registration unit 102 needs to perform an extraction process of the expression of the document data. The expression extraction unit 102a performs syntax analysis and the like on the content stored in the text buffer 110a, and lists words and phrases used in the document or co-occurrence relationships (hereinafter, these are referred to as expressions). The data is stored in the expression list buffer 110b. The list is collectively called an expression list. The method of creating this expression list is disclosed in, for example, JP-A-62-99865 and JP-A-2-42572, and the description is omitted.
[0032]
The expression registration unit 102 registers the expression specified by the user from the created expression list in the dictionary buffer 110d.
[0033]
The dictionary editing unit 103 is a program for editing the information extraction dictionary 109. The dictionary editing unit 103 can also be activated by the expression registration unit 102 and the dictionary verification unit 104.
[0034]
The dictionary verification unit 104 performs an extraction process specified by the user using the dictionary buffer 110d read and stored from the information extraction dictionary 109, and stores the extraction processing result in the processing result buffer 110c. Further, based on a user's instruction, information stored in the processing result buffer 110c is displayed.
[0035]
The difference detection unit 105 detects a difference between the two information extraction dictionaries 109.
[0036]
The condition setting unit 106 sets a collation condition for extracting information. The set conditions are stored in the information extraction dictionary 109 through the control unit 101. The stored conditions are read out to the dictionary buffer 110d to control the movement of information extraction in the dictionary verification unit 104, and to determine the movement of information extraction in an external information extraction system using the present information extraction dictionary. Control is performed as in Reference 3.
[0037]
The information extraction dictionary 109 is an electronic dictionary constructed by this program. FIG. 2 shows an example of the configuration of the information extraction dictionary 109. The information extraction dictionary 109 is composed of three layers. The highest layer is called a class, the middle layer is called a concept, and the lower layers are called expressions. In the case where one of these is referred to without any particular awareness of the class, concept, and expression hierarchy, it is simply referred to as a node hereinafter.
[0038]
The class table 121 indicates a class, and stores “place name”, “person name”, and the like. The concept table 122 stores classes and concepts in association with each other. The expression table 123 stores classes, concepts, and expressions in association with each other.
[0039]
The above is the configuration of the dictionary construction support device of the present embodiment.
[0040]
Next, the operation of the dictionary construction support apparatus of the present embodiment will be described with reference to a flowchart, taking information extraction as an example.
[0041]
FIG. 3 shows a flowchart of the control unit 101 in the dictionary construction program 100.
[0042]
When the user requests activation of the dictionary construction program 100, the control unit 101 is activated.
[0043]
After starting the own program, the control unit 101 waits for a key input from the user (S201). When a key input is received, first, it is determined whether or not the user input is an end instruction (S202). If it is determined that the instruction is an end instruction, the own program is ended.
[0044]
On the other hand, if it is determined in step S202 that the input is not an end instruction, it is next determined whether or not the user input is a document designation (S204). If the document is designated, the designated document data is read from the designated destination, for example, a memory, a magnetic disk, an optical disk, or the like, and stored in the text buffer 110a in the internal data holding unit 110 of the memory 40 ( S204). When this process ends, the process returns to the step S201 and waits for the next input from the user.
[0045]
On the other hand, if it is determined in step S203 that the input is not a document specification, it is next determined whether or not the user input is a dictionary specification (S205). If the dictionary is specified, the specified dictionary is read from the specified destination, for example, a memory, a magnetic disk, an optical disk, or the like, and stored in the dictionary buffer 110d in the internal data holding unit 110 of the memory 40 (S206). ). When this process ends, the process returns to the step S201 and waits for the next input from the user.
[0046]
On the other hand, if it is determined in step S205 that the dictionary is not specified, it is determined whether the user input is an expression registration instruction (S207). If it is determined that the instruction is an expression registration instruction, the expression registration unit 102 is activated (S208). The operation of the expression registration unit 102 will be described later. When the operation of the expression registration unit 102 ends, the process returns to step S201, and waits for the next input from the user.
[0047]
On the other hand, if it is determined in step S207 that the input is not an expression registration instruction, it is next determined whether or not the user input is a dictionary verification instruction (S209). If it is determined that the instruction is a dictionary verification instruction, the dictionary verification unit 104 is activated (S210). The operation of the dictionary verification unit 104 will be described later. When the operation of the dictionary verification unit 104 ends, the process returns to step S201, and waits for the next input from the user.
[0048]
On the other hand, if it is determined in step S209 that the input is not a dictionary verification instruction, then it is determined whether the user input is a dictionary edit instruction (S211). If the input is a dictionary editing instruction, the dictionary editing unit 103 is activated (S212). The operation of the dictionary editing unit 103 will be described later. When the operation of the dictionary editing unit 103 ends, the process returns to step S201, and waits for the next input from the user.
[0049]
On the other hand, if it is determined in step S211 that the input is not a dictionary editing instruction, then it is determined whether the user input is a difference detection instruction (S213). If the instruction is a difference detection instruction, the difference detection unit 105 is activated (S214). The operation of the difference detection unit 105 will be described later. When the operation of the difference detection unit 105 ends, the process returns to step S201, and waits for the next input from the user.
[0050]
On the other hand, if it is determined in step S213 that the received instruction is not a difference detection instruction, another designated process, for example, a process corresponding to a user input instruction such as an environment setting process is performed (S215), and the process returns to step S201. Wait for the next input from the user.
[0051]
As described above, the control unit 101 analyzes the input of the user, activates various processing units, and stores the processing target document and the dictionary in the internal data holding unit 110 based on the user's instruction.
[0052]
FIG. 4 shows a screen display example immediately after the control unit 101 is started. The upper part has buttons for registration, verification, editing, difference display, environment setting, and termination. The user selects these processings by pressing these buttons. For example, when the end button is clicked, it is determined in step S202 that the user's input is an end instruction, and end processing is performed. Further, for example, when the registration button is clicked, it is determined in step S207 that the user input is an expression registration instruction, and a process of activating the expression registration unit 102 is performed.
[0053]
Below these buttons, an area for inputting a dictionary name is provided on the left, and an area for inputting a document name is provided on the right. The user inputs a dictionary name and a document name in these areas and executes the The corresponding dictionary or document is read from the storage device 50. For example, when the user inputs a document name in an area for inputting a document name and executes the process, in step S203, it is determined that the user input is a document designation, and in step S204, based on the input document name, The relevant document is read from the storage device 50 or the like and stored in the text buffer 110a.
[0054]
Next, the processing operation of the expression registration unit 102 will be described. FIG. 5 is a flowchart showing the processing operation of the expression registration unit 102.
[0055]
If it is determined in step 207 that the instruction is an expression registration instruction during the operation of the control unit 101, the expression registration unit 102 is activated. The activated expression registration unit 102 first activates the expression extraction unit 102a. The activated expression extracting unit 102a analyzes the document data stored in the text buffer 110a by a predetermined analysis method, and creates an expression list (S301). The created expression list is stored in the expression list buffer 110b (S302). The predetermined analysis method may be, for example, when creating an expression list of words, morphological analysis or syntax analysis of document data, extract words, and list the extracted words. What is necessary is just to create by various well-known methods, without being limited to. After the extraction, the expression extraction unit 102a ends.
[0056]
Next, the dictionary data stored in the dictionary buffer 110d and the expression list stored in the expression list buffer 110b are read out and displayed simultaneously (S303).
[0057]
Next, it is determined whether or not the user input is a dictionary registration instruction (S304). If the instruction is a dictionary registration instruction, the user specifies an expression to be registered in the dictionary from the displayed expression list (S305). Thus, the expression registration unit 102 obtains the expression specified by the user. Further, the user designates a concept in which the obtained expression is to be registered from the displayed dictionary concepts (S306). Thereby, the expression registration unit 102 obtains the concept in which the expression is registered. Further, the user gives to the condition setting unit 106 whether or not the concept should be subjected to morphological analysis at the time of information extraction, using an input unit such as a check box (S307). Then, the obtained expressions, concepts, and conditions for information extraction are synthesized and registered as dictionary information (lower-level table) of the expressions (S308). This registration is added to the dictionary buffer 110d.
[0058]
On the other hand, if it is not a dictionary registration instruction in step S304, it is determined whether or not the user input is a dictionary editing instruction (S309). If the instruction is a dictionary editing instruction, the dictionary editing unit 103 described later is activated to edit the dictionary (S310). When the processing of the dictionary editing unit 103 ends, the process returns to step S304.
[0059]
On the other hand, if it is not a dictionary editing instruction in step S309, the information stored in the dictionary buffer 110d is stored on a magnetic disk or the like (S311), and the process ends.
[0060]
In this way, the expression registration unit 102 activates the expression extraction unit 102a to create an expression list, and registers a dictionary in the dictionary 109 from the expression list.
[0061]
FIG. 6 shows an example of a screen display at the time of processing of the expression registration unit 102. A registration button is provided below the area for inputting the dictionary name. By clicking this button, dictionary registration processing (S305 to S308 in FIG. 5) becomes possible. The contents of the dictionary are displayed below the registration button. In this example, a part of the contents of the dictionary with the dictionary name “XXXXXXXXX” is displayed. In this display, the information is displayed in a three-layered format of a class, a concept, and an expression. Also, a check box is provided at the front of each word, and the user can specify the check box by clicking the check box. The concept has a check box (a1, a2,...) For designating whether to use morphological analysis as a collation condition at the time of information extraction, and the condition can be input by the user. One of the features of this dictionary construction support device is that the check boxes (a1, a2,...) Are provided so that whether or not to perform morphological analysis can be designated. This eliminates the need for editing work, and allows ordinary users to easily perform editing work. In this example, the check box (a1) for the concept “work” is off and the checkpoint (a2) for the concept “person” is on. No analysis is used. On the other hand, morphological analysis is used to extract expressions included in the concept “person”.
[0062]
On the other hand, below the area for inputting the document name, there are buttons for word, co-occurrence, and phrase. By pressing these buttons, the display contents of the expression list can be selected. The lower part of these buttons is an area for displaying an expression list. In this example, an expression list of a word having a document name “++++++” is displayed. A check box is provided in front of each word (word in this case) in the expression list, and can be specified by the user by clicking the check box.
[0063]
As described above, the screen at the time of processing of the expression registration unit 102 can simultaneously display the dictionary and the expression list, and the dictionary class, concept, expression, and each word of the expression list can be easily specified. This makes it possible for the user to easily register in the dictionary.
[0064]
FIG. 7 is a diagram schematically illustrating expression registration processing. This example shows processing of internal data when "employee" is selected from the expression list and added as one expression of the concept of "person".
[0065]
When "employee" is selected from the displayed expression list, "employee" is obtained from the expression list buffer 110b, and when the concept "person" is selected from the displayed dictionary, the concept stored in the dictionary buffer 110d. “Industry / person” is acquired from the table 122. Then, the obtained "industry, person" and "employee" are combined to obtain "industry / person / employee", which is registered in the expression table 123 of the dictionary buffer 110d. As a result, the expression table becomes like 123 '.
[0066]
Thereby, the expression extracted from the document data can be registered to a desired concept in the electronic dictionary by a simple operation.
[0067]
Next, the processing operation of the dictionary verification unit 104 will be described. FIG. 8 is a flowchart showing the processing operation of the dictionary verification unit 104.
[0068]
If it is determined in step 209 that the instruction is a dictionary verification instruction during the operation of the control unit 101, the dictionary verification unit 104 is activated. The activated dictionary verification unit 104 displays the dictionary data stored in the dictionary buffer 110d (S401).
[0069]
Next, information is extracted using the dictionary 109, and the result is stored in the processing result buffer 110c (S402). Then, the information stored in the processing result buffer 110d is displayed (S403).
[0070]
Next, it is determined whether or not the user input is a verification instruction of the information extraction result (S404).
[0071]
If it is determined that the instruction is a verification instruction of the information extraction result, the node to be verified is specified in the dictionary (S405). Then, the information extraction result corresponding to the designated node is displayed (S406). After this processing, the process returns to step S404.
[0072]
On the other hand, if it is determined in step S404 that the instruction is not a verification instruction of the information extraction result, it is next determined whether or not the user input is a dictionary editing instruction (S407). If it is determined that the instruction is a dictionary editing instruction, the dictionary editing unit 103 is activated to perform dictionary editing. When the dictionary editing unit 103 ends, the process returns to step S404.
[0073]
On the other hand, if it is determined in step S407 that the instruction is not a dictionary editing instruction, the information stored in the dictionary buffer 110d is stored on a magnetic disk or the like (S409). In addition, the information stored in the processing result buffer 110c is stored on a magnetic disk or the like (S410), and the processing of the dictionary verification unit 104 ends.
[0074]
In this way, the dictionary verification unit 104 verifies the dictionary while displaying the information extraction result by the dictionary 109.
[0075]
FIG. 9 shows an example of a screen display at the time of processing by the dictionary verification unit 104.
[0076]
On the display, the display content of the left half is the same as the display content of the left half of the expression registration unit 102 in FIG. 6 except that there is only a verification button instead of the registration button.
[0077]
By pressing this verification button, the result verification processing (S405 to S406 in FIG. 8) becomes possible.
[0078]
On the other hand, in the display content in the right half of FIG. 9, the extraction result processed by the dictionary 109 is displayed below the document name. Then, as shown in the figure, after verification of the extraction result, underlines (b1, b2, b3, b4, b4 …). It should be noted that any method other than underlining, for example, may be used as long as it can sufficiently identify colors and the like.
[0079]
Furthermore, at this time, depending on whether or not morphological analysis was used for information extraction, by changing the display method such as changing the shape of the underline or changing the color, it is possible to simultaneously display the conditions for information extraction. it can. One of the features of this dictionary construction support device is that the display method is changed depending on whether or not a morphological analysis is used for extracting information at a location where an expression of the designated concept exists.
[0080]
In this way, the concepts in the hierarchical electronic dictionary can be specified by a simple operation, and the expressions included in the specified concept can be easily extracted and specified from the extraction results extracted by the electronic dictionary. Since the matching condition is also displayed at the same time, the user can easily verify the quality of the electronic dictionary.
[0081]
In this example, the underline (b1, b2, b3) added to "service", "development", and "sales" is a single line, whereas the underline (b4) added to "buyer" is a double line. However, this indicates that “service”, “development”, and “sales” do not use morphological analysis in extracting information, while “buyers” indicate that morphological analysis is used. I have. In other words, the concept “work” including “service”, “development” and “sales” shown in FIG. 6 is not used for morphological analysis, and the concept “person” including “buyer” is not analyzed by morphological analysis. The matching condition at the time of information extraction set to use is easily recognized by the underline.
[0082]
Next, the processing operation of the dictionary editing unit 103 will be described. FIG. 10 is a flowchart illustrating the processing operation of the dictionary editing unit 103.
[0083]
During the operation of the control unit 101, when it is determined that the instruction is a dictionary editing instruction in step S 211, during the operation of the expression registration unit 102, when it is determined that the instruction is a dictionary editing instruction in step S 308, and during the operation of the dictionary verification unit 104, If it is determined in step S407 that the instruction is a dictionary editing instruction, the dictionary editing unit 103 is activated.
[0084]
The activated dictionary editing unit 103 displays the dictionary data stored in the dictionary buffer 110d (S501).
[0085]
Next, it is determined whether or not the user input is a node addition instruction (S502). If the instruction is an addition instruction, an additional node is specified (S503). Then, a child node is added to the designated node (S504). Additional content is entered directly by the user. Then, the process returns to step S502.
[0086]
On the other hand, if it is not an addition instruction in step S502, it is next determined whether or not the user input is an instruction to delete a node (S505). If the instruction is a deletion instruction, a deletion node is designated (S506). Then, the designated node and all child nodes of the node are deleted (S507). Then, the process returns to step S502.
[0087]
On the other hand, if it is not a deletion instruction in step S505, it is next determined whether or not the user input is a node change instruction (S508). Here, if it is a change instruction, a change node is designated (S509). Then, the character string and value of the designated node are changed (S510). Then, the process returns to step S502.
[0088]
On the other hand, if it is not a change instruction in step S508, then it is determined whether or not the user input is a node copy instruction (S511). If the instruction is a copy instruction, a copy source node is designated (S512). Then, a copy destination node is designated (S513). Then, the specified copy source node and all of its child nodes are added to the child nodes of the copy destination node (S514). Then, the process returns to step S502.
[0089]
On the other hand, if it is not a copy instruction in step S511, it is determined whether or not the user input is a node move instruction (S515). Here, in the case of a move instruction, a move source node is designated (S516). Then, the destination node is specified (S517). Then, the designated source node and all of its child nodes are moved to child nodes of the destination node (S518). Then, the process returns to step S502.
[0090]
On the other hand, if it is not a move instruction in step S515, the information stored in the dictionary buffer 110d is stored on a magnetic disk or the like (S519), and the process ends.
[0091]
In this way, the dictionary editing unit 108 performs addition, deletion, change, copying, movement, and the like on the nodes of the dictionary 109 to perform dictionary editing.
[0092]
FIG. 11 shows an example of a screen display at the time of processing by the dictionary editing unit 103.
[0093]
This example is different from the display content of the left half of the expression registration unit 102 in FIG. 6 in that add, delete, change, move, and copy buttons are arranged in place of the registration button. Is the same. By clicking each button, various editing functions of the clicked button are performed as described above.
[0094]
As described above, since the editing of the electronic dictionary is made to be a user interface which is easy to handle, the dictionary editing can be easily realized for the user.
[0095]
Next, the processing operation of the difference detection unit 105 will be described. FIG. 12 is a flowchart showing the processing operation of the difference verification unit 105.
[0096]
During the operation of the control unit 101, if it is determined in step S213 that the instruction is a difference detection instruction, the difference detection unit 105 is activated.
[0097]
The activated difference detection unit 105 specifies two dictionaries to be compared according to the user's input (S601). Then, a difference of the designated dictionary is created and stored in the difference buffer 110e (S602).
[0098]
Next, the difference stored in the difference buffer 110e is displayed (S603). The difference stored in the difference buffer 110e is stored on a magnetic disk or the like (S604), and the process ends.
[0099]
In this way, the difference detection unit 105 creates and displays a difference by comparing the dictionaries 109 with each other.
[0100]
FIG. 13 shows an example of a screen display at the time of processing of the difference detection unit 105.
[0101]
In this example, the upper part is provided with two areas for inputting a dictionary to be compared. The lower part of this area is an area for displaying the comparison result. Here, the comparison result for each concept is displayed.
[0102]
This makes it possible to easily designate two electronic dictionaries, detect a difference between the electronic dictionaries, and present the difference to the user in units of concepts.
[0103]
In the present embodiment described above, an expression extracted from document data can be registered in a desired concept in the electronic dictionary by a simple operation.
[0104]
In addition, the concept in the hierarchical electronic dictionary can be specified by a simple operation, and the expressions included in the specified concept can be easily extracted from the extraction results extracted by the electronic dictionary and specified. Therefore, the user can easily verify the quality of the electronic dictionary.
[0105]
Further, two electronic dictionaries can be easily specified, and a difference between the electronic dictionaries can be detected and presented to the user in units of concepts.
[0106]
In addition, it is possible to select necessary expressions and register them in the dictionary for information extraction while referring to important expressions extracted from the document. You can now edit it.
[0107]
It should be noted that the present invention is not limited to the above-described embodiment, and can be variously modified in an implementation stage without departing from the scope of the invention. Furthermore, the embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, even if some components are deleted from all the components shown in the embodiment, the problem described in the column of the problem to be solved by the invention can be solved, and the effects described in the column of the effect of the invention can be solved. Is obtained, a configuration from which this configuration requirement is deleted can be extracted as an invention.
[0108]
【The invention's effect】
As described above, according to the present invention, it is possible to provide a support environment for creating and verifying an information extraction dictionary in which a user can set whether or not to perform morphological analysis in expression matching.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a dictionary construction support device according to an embodiment of the present invention.
FIG. 2 is an example of a configuration of an information extraction dictionary 109 according to the embodiment.
FIG. 3 is a flowchart showing a processing operation of a control unit 101 in the embodiment.
FIG. 4 is a screen display example immediately after the control unit 101 according to the embodiment is started.
FIG. 5 is a flowchart showing a processing operation of the expression registration unit 102 in the embodiment.
FIG. 6 is a screen display example during processing of the expression registration unit 102 in the embodiment.
FIG. 7 is an exemplary view schematically showing registration processing of expressions in the embodiment.
FIG. 8 is a flowchart showing a processing operation of the dictionary verification unit 104 according to the embodiment.
FIG. 9 is an example of a screen display during processing of the dictionary verification unit 104 according to the embodiment.
FIG. 10 is an exemplary flowchart illustrating the processing operation of the dictionary editing unit 103 according to the embodiment.
FIG. 11 is a screen display example during processing of the dictionary editing unit 103 according to the embodiment.
FIG. 12 is a flowchart showing a processing operation of a difference detection unit 105 in the embodiment.
FIG. 13 is a screen display example during processing of the difference detection unit 105 in the embodiment.
[Explanation of symbols]
10 ... CPU
20 ... input section
30 ... Output unit
40 ... Memory
50 ... Storage device
100: Dictionary construction program
101 ... Control unit
102 ... Expression registration unit
102a ... Expression extraction unit
103 ... Dictionary editor
104: Dictionary verification unit
105: Difference detection unit
106: Condition setting section
109: Information extraction dictionary
110: Internal data holding unit
110a ... text buffer
110b ... Expression list buffer
110c: Processing result buffer
110d: Dictionary buffer
110e ... difference buffer
121 ... Class table
122 ... Concept table
123 ... Expression table

Claims

Dictionary storage means for storing an electronic dictionary stored in association with a plurality of expressions and concepts which are higher-level expressions common to those expressions,
An expression storage means for storing an expression extracted from the document data;
Display means for simultaneously displaying at least some of the concepts from the electronic dictionary stored in the dictionary storage means and at least some of the extracted expressions stored in the expression storage means;
Upon receiving designation of one or more expressions from the expressions displayed by the display means and designation of one concept from the concepts displayed by the display means, the specified expression is associated with the specified concept. A dictionary construction support device comprising a registration unit for additionally registering in the electronic dictionary,
When matching the expression in the electronic dictionary stored in the dictionary storage unit with the document data, it is determined whether the document data is subjected to morphological analysis and then collated or collated without morphological analysis. A dictionary construction support device comprising a setting means for setting each concept.

2. The dictionary construction support apparatus according to claim 1, wherein said setting means includes means for storing the setting as a collation condition in the electronic dictionary.

3. The dictionary construction support apparatus according to claim 2, wherein said setting means includes means for displaying the setting stored in said electronic dictionary.

Dictionary storage means for storing an electronic dictionary stored in association with a plurality of expressions and concepts which are higher-level expressions common to those expressions,
An expression storage means for storing an expression extracted from the document data;
A dictionary comprising display means for simultaneously displaying at least some concepts from the electronic dictionary stored in the dictionary storage means and at least some of the extracted expressions stored in the expression storage means A construction support device,
When displaying a part of the extracted expressions on the display unit, the display unit includes a display control unit that changes a display method according to a difference in a matching condition whether or not to use a morphological analysis used at the time of matching. A dictionary construction support device characterized by the following.

Dictionary storage means for storing an electronic dictionary stored in association with a plurality of expressions and concepts which are higher-level expressions common to those expressions,
An expression storage means for storing an expression extracted from the document data;
Display means for simultaneously displaying at least some of the concepts from the electronic dictionary stored in the dictionary storage means and at least some of the extracted expressions stored in the expression storage means;
Upon receiving designation of one or more expressions from the expressions displayed by the display means and designation of one concept from the concepts displayed by the display means, the specified expression is associated with the specified concept. A dictionary construction support method of a dictionary construction support device comprising: a registration unit for additionally registering in the electronic dictionary;
When matching the expression in the electronic dictionary stored in the dictionary storage unit with the document data, it is determined whether the document data is subjected to morphological analysis and then collated or collated without morphological analysis. A dictionary construction supporting method, comprising a step of setting for each concept of:

6. The dictionary construction support method according to claim 5, wherein the setting step includes a step of storing the setting as a collation condition in the electronic dictionary.

3. The dictionary construction support method according to claim 2, wherein the setting step includes a step of displaying the setting stored in the electronic dictionary.

Dictionary storage means for storing an electronic dictionary stored in association with a plurality of expressions and concepts which are higher-level expressions common to those expressions,
An expression storage means for storing an expression extracted from the document data;
A dictionary comprising display means for simultaneously displaying at least some concepts from the electronic dictionary stored in the dictionary storage means and at least some of the extracted expressions stored in the expression storage means A dictionary construction support method of a construction support device,
When displaying a part of the extracted expressions on the display unit, a display control step of changing a display method depending on a difference in a matching condition whether or not to use a morphological analysis used at the time of matching is provided. A dictionary construction support method characterized by the following.

Dictionary storage means for storing an electronic dictionary stored in association with a plurality of expressions and concepts which are higher-level expressions common to those expressions,
An expression storage means for storing an expression extracted from the document data;
Display means for simultaneously displaying at least some of the concepts from the electronic dictionary stored in the dictionary storage means and at least some of the extracted expressions stored in the expression storage means;
Upon receiving designation of one or more expressions from the expressions displayed by the display means and designation of one concept from the concepts displayed by the display means, the specified expression is associated with the specified concept. A computer comprising a registration unit for additionally registering in the electronic dictionary,
When matching the expression in the electronic dictionary stored in the dictionary storage unit with the document data, it is determined whether the document data is subjected to morphological analysis and then collated or collated without morphological analysis. A program for functioning as setting means for setting each concept.

The program according to claim 9, wherein the setting unit includes a unit that stores the setting as a collation condition in the electronic dictionary.

The program according to claim 10, wherein the setting unit includes a unit configured to display the setting stored in the electronic dictionary.

Dictionary storage means for storing an electronic dictionary stored in association with a plurality of expressions and concepts which are higher-level expressions common to those expressions,
An expression storage means for storing an expression extracted from the document data;
A computer including a display unit for simultaneously displaying at least some of the concepts from the electronic dictionary stored in the dictionary storage unit and at least some of the extracted expressions stored in the expression storage unit To
When displaying a part of the extracted expressions on the display unit, the display unit may function as a display control unit that changes a display method depending on a difference in a matching condition whether or not to use a morphological analysis used at the time of matching. Program.