JP2004361992A

JP2004361992A - Related word extracting device, related word extracting method, and program

Info

Publication number: JP2004361992A
Application number: JP2003155922A
Authority: JP
Inventors: Tsutomu Kobayashi; 勉小林; Yoshihisa Otake; 能久大嶽; Yukio Nakamoto; 幸夫中本; Hiroshi Yamazaki; 弘山崎; Takeshi Matsukuma; 剛松隈
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-05-30
Filing date: 2003-05-30
Publication date: 2004-12-24

Abstract

<P>PROBLEM TO BE SOLVED: To properly extract related words in a certain field even when any document in another field is not prepared, or the related words do not necessarily co-occur in one document. <P>SOLUTION: An object field acquiring part 512 stores an object field inputted from an input device 3 in an object field storage buffer 521. A document reading part 514 reads a document in a document database 601, and retrieves a document having field information matched with the object field acquired by the object field acquiring part 512. The document having the matched field information is stored in a field temporary storage buffer 525, and morphemic analysis is executed to the stored document. Then, related words are tabulated and extracted from the result of the morphemic analysis by referring to a predetermined field related word notation pattern under the consideration of the rhetoric expression of the document having the field information. In this case, the related words whose appearance frequency is low can be removed by using a threshold as necessary. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、関連語抽出技術に係り、特に、分野情報が定まっている文書で構成される文書データベースから所定の分野に応じた関連語を抽出する関連語抽出装置、関連語抽出方法及びプログラムに関する。
【０００２】
【従来の技術】
関連語又は同義語は、文書の検索又は自動分類を行う際の有用な情報である。従来、関連語又は同義語を人手で体系的に編修する試みがある一方で、電子計算機を利用した機械処理によって関連語又は同義語を抽出する試みが為されている。
例えば、単語間の関連度として相互情報量を利用する手法がある（例えば、特許文献１）。相互情報量とは、着目している２つの単語が偶然ではなく当然に出現する程の関連強さを示した指標であり、この相互情報量が低いとき、当該２つの単語は偶然出現しているに過ぎないと解される。この特許文献１では、文書データベースに登録されている単語がクラスタリングされ、このクラスタリングにより関連語情報が作成される。ユーザにより入力された単語の関連語は、この関連語情報が参照され、当該関連度の高いクラスタ毎に提示されるものである。
【０００３】
また、相互情報量がいわゆる関連語を抽出することしかできない点に着目し、単語同士の関連度を特定するために言語的特徴を利用し、同義語、上位語、下位語といった関係まで特定するための手法が提示されている（例えば、特許文献２）。この特許文献２では、単語の共起情報を参照し、共起関係にある単語間の類似度を算出する。一方、単語の言語的特徴を参照し、単語間の類似度を算出する。これら双方の類似度を統合し、求めた類似度が所定のレベルより高いとき、類似関係にあるとして提示するものである。
【０００４】
【特許文献１】特開２００２−３２３９４号公報（第１０頁）
【特許文献２】特開２０００−２２２４２７号公報（第１２頁）
【発明が解決しようとする課題】
上記した従来技術において、共起する単語の頻度を使用する方法では、どのような文書にも出てくるような単語同士の関係を排除するのが難しい。また、相互情報量に基づく方法では、どのような文書にも出てくるような単語同士の相互情報量は低くなるように工夫されているが、この機構が意図どおりに機能するかどうかは、相互情報量を算出するのに使用する文書群の選択方法にかかっている。例えば、「発明」と「課題」という単語は相互に関連する概念と考えられるが、特許公報を用いて相互情報量を算出するとき、ほとんどの特許公報には「発明」および「課題」という単語が含まれているため、相互情報量は相対的に低くなってしまう傾向にある。「発明」と「課題」の相互情報量を高めるためには、これらの単語同士が共起していることが際立つような文書群を投入する必要がある。したがって、ある分野Ａにおける関連語を抽出しようとした場合、可能なら分野Ａとは違った分野Ｂ，分野Ｃ…などの文書を複合しないと、分野Ａの文書に頻出する特有の関連語を抽出することができないという危険性がある。さらに、ここに述べた従来技術に共通する問題として、関連語同士が共起していることが前提となっていることが挙げられる。上述した従来技術によれば、例えば、文書Ａで述べられている「コンピュータ」と、別の文書Ｂで述べられている「電子計算機」を結びつけることはできない。
【０００５】
そこで、本発明は上述した問題点を解決するためになされたものであり、ある分野の関連語を抽出するときに、他の分野の文書が用意できず、また、必ずしも関連語が一つの文書内に共起していない場合であっても、適切に関連語を抽出する関連語抽出装置、関連語抽出方法およびプログラムを提供することを目的とする。
【０００６】
【課題を解決するための手段】
上記目的を達成するために、本発明の関連語抽出装置は、複数の文書を有する文書データベースと、文書の分野情報を入力する入力手段と、入力された分野情報と一致する分野情報を有する文書を文書データベースから抽出する文書抽出手段と、文書抽出手段により抽出された文書を自然言語解析する自然言語解析手段と、文書データベース中の文書からこの文書の分野に関連する分野関連単語を抽出するための分野関連単語表記パターンを保持する分野関連単語表記パターン保持手段と、自然言語解析の結果に基づいて分野関連単語表記パターンを参照し、文書データベースから抽出した文書から分野関連単語を抽出する分野関連単語抽出手段と、抽出された分野関連単語の出現頻度を集計する分野関連単語集計手段とを具備することを特徴としている。
【０００７】
次に、本発明の関連語抽出方法は、複数の文書を有する文書データベースから、特定の分野の文書を抽出し、この抽出した文書から特定の分野に応じた関連語を抽出する関連語抽出方法であって、文書の分野情報を入力する入力ステップと、入力された分野情報と一致する分野情報を有する文書を文書データベースから抽出する文書抽出ステップと、文書データベース中の文書からこの文書の分野に関連する分野関連単語を抽出するための分野関連単語表記パターンを保持する分野関連単語表記パターン保持ステップと、自然言語解析の結果に基づいて分野関連単語表記パターンを参照し、文書データベースから抽出した文書から前記分野関連単語を抽出する分野関連単語抽出ステップと、抽出された分野関連単語の出現頻度を集計する分野関連単語集計ステップとを具備することを特徴としている。
【０００８】
さらに、本発明のプログラムは、複数の文書を有する文書データベースから、特定の分野の文書を抽出し、この抽出した文書から特定の分野に応じた関連語を抽出する関連語抽出装置に、文書の分野情報を入力する入力機能と、入力された分野情報と一致する分野情報を有する文書をデータベースから抽出する文書抽出ステップと、文書データベース中の文書からこの文書の分野に関連する分野関連単語を抽出するための分野関連単語表記パターンを保持する分野関連単語表記パターン保持機能と、自然言語解析の結果に基づいて分野関連単語表記パターンを参照し、文書データベースから抽出した文書から分野関連単語を抽出する分野関連単語抽出機能と、抽出された分野関連単語の出現頻度を集計する分野関連単語集計機能とを実現させることを特徴としている。
【０００９】
【発明の実施の形態】
以下、本発明における実施の形態について図面を参照して説明する。
本発明に係る関連語抽出装置は、予め磁気ディスク装置に複数の分野の文書群を蓄えておき、入力装置から所定の分野が入力されると、その分野情報と一致する文書を文書群から選定し、この選定した文書からその分野情報に関連する単語（「分野関連単語」という。以下において同じ。）を抽出して表示装置に表示を行う。磁気ディスク装置に格納された複数の分野の文書群のそれぞれの文書は、定型のフォーマットから成る分野情報を保持している。
【００１０】
次に、本発明における関連語抽出装置について図１乃至図９を参照して説明する。
図１は、本発明の実施の形態に係る関連語抽出装置の構成を示すブロック図である。
この関連語抽出装置は、入力された所定の分野に応じた文書を文書群から選出し、この選出された文書に見られる特定の修辞表現に着目し、この修辞表現に基づいて分野関連単語を抽出する。
関連語抽出装置１は、制御装置２、入力装置３、表示装置４、メモリ５及び磁気ディスク装置６から構成されており、各部は互いにバス７を介して接続されている。
【００１１】
制御装置２は、中央演算処理装置（ＣＰＵ）であり、磁気ディスク装置６内に格納されているＯＳ（オペレーティング・システム）、所定のプログラムを後記するプログラム部５１に読み出し、関連語抽出装置１全体の動作制御及び各装置間のデータ転送の処理を行なう。
入力装置３は、文字列、各種データ及び命令の入力が行なわれるものであり、キーボード、ＯＣＲ、ペン、マウス、タブレット又はタッチパネルからなる。
【００１２】
表示装置４は、入力装置３により入力されるデータ、関連語抽出装置１からユーザへの指示及び最終的に得られる関連語抽出結果などのデータを表示するものであり、例えばＣＲＴ又は液晶ディスプレイから構成される。
メモリ５は、図２に示すように、制御装置２が各種制御や処理を実行するために磁気ディスク装置６より所定のプログラムを読み出して記憶するためのプログラム部５１及び各処理の際に必要なデータを一時的に格納するバッファ部５２から構成されている。
【００１３】
図３は、関連語抽出装置１のメモリ５のバッファ部５２の構成を示したブロック図である。バッファ部５２は、対象分野格納バッファ５２１と、文書格納バッファ５２２と、形態素解析結果格納バッファ５２３と、分野関連単語表記パターン格納バッファ５２４と、分野一時格納バッファ５２５と、関連単語集計用バッファ５２６と、一時変数格納バッファ５２７とから構成されている。
対象分野格納バッファ５２１は、対象分野取得プログラム５１２が取得した分野情報を格納するためのものである。文書格納バッファ５２２は、文書読み出しプログラム５１４が磁気ディスク装置６から読み込んだ文書を格納するためのものである。形態素解析結果格納バッファ５２３は、形態素解析プログラム５１６が文書格納バッファ５２２に格納した文書に対して形態素解析した結果を格納するためのものである。分野関連単語表記パターン格納バッファ５２４は、分野関連単語表記パターン読み込みプログラム５１１が磁気ディスク装置６から読み出した分野関連単語表記パターン辞書６０３を格納するためのものである。分野一時格納バッファ５２５は、分野抽出プログラム５１５が文書格納バッファ５２２に格納された文書から文書の属する分野情報を抽出して格納するためのものである。関連単語集計用バッファ５２６は、分野関連単語集計プログラム５１８が分野関連単語の出現頻度を集計するときに使用するワーク領域であり、プログラムのループ用変数など一時的な変数を格納するためのものである。
【００１４】
磁気ディスク装置６は、図４に示すように、複数の文書群から構成される文書データベース６０１と、形態素解析プログラム５１６が形態素解析をする際に参照する形態素解析辞書６０２と、分野関連単語を抽出するためのパターンが格納された分野関連単語表記パターン辞書６０３と、ＯＳ（オペレーティング・システム）をはじめ、関連語抽出装置１を起動させるためのプログラム又は新規に作成されたデータが格納されるデータ格納領域６０４と、制御装置２が各種制御や処理を実行するためのプログラムが格納されている関連語抽出プログラム６０５とを記録している。
【００１５】
図５は、磁気ディスク装置６内の関連語抽出プログラム６０５に格納されている各種プログラムの構成を示したブロック図である。関連語抽出プログラム６０５は、分野関連単語表記パターン読み込みプログラム６０５１と、対象分野取得プログラム６０５２と、表示プログラム６０５３と、文書読み出しプログラム６０５４と、分野抽出プログラム６０５５と、形態素解析プログラム６０５６と、分野関連単語抽出プログラム６０５７と、分野関連単語集計プログラム６０５８と、初期化プログラム６０５９と、分野情報比較プログラム６０６０とから構成されている。
【００１６】
分野関連単語表記パターン読み込みプログラム６０５１は、磁気ディスク装置６に記憶された分野関連単語表記パターン辞書６０３をバッファ部５２に読み込むためのものである。対象分野取得プログラム６０５２は、分野関連単語を抽出するためにユーザにより入力された分野情報を取得するためのものである。表示プログラム６０５３は、入力装置３により入力されたデータ、関連語抽出装置１からユーザへの指示、及び抽出された分野関連単語を表示部４に表示するためのものである。文書読み出しプログラム６０５４は、磁気ディスク装置６に格納されている文書群のうちの一文書をバッファ部５２に読み込むためのものである。分野抽出プログラム６０５５は、バッファ部５２に格納された文書を読み込むことにより文書の属する分野情報を抽出するためのものである。形態素解析プログラム６０５６は、磁気ディスク装置６に格納された形態素解析辞書６０２を参照してバッファ部５２に格納された文書を形態素解析するためのものである。分野関連単語抽出プログラム６０５７は、バッファ部５２に読み込まれた分野関連単語表記パターンを参照してバッファ部５２に読み込まれた文書から分野関連単語を抽出するためのものである。分野関連単語集計プログラム６０５８は、抽出された分野関連単語の出現頻度を集計するためのものである。初期化プログラム６０５９は、関連語抽出装置１の電源が投入されるときに各装置の設定状態を初期化するためのものである。分野情報比較プログラム６０６０は、入力装置３を介して取得した対象分野と分野抽出プログラム６０５５に従って取得した分野情報とを比較するためのものである。
【００１７】
図６は、磁気ディスク装置６の文書データベース６０１を構成する文書群の一例を示す図である。本発明における実施の形態で使用される文書群の各文書は、図６に示すように「分野：」に続いて文書の属する分野が記述されている。本発明の実施の形態では図６に示すように分野情報とその分野情報に応じたテキスト情報との構造を有する文書例を示したが、実際に指定された形式にしたがって技術分野を特定している文書は多い。例えば、特許公報は国際特許分類、ＦＩ又はＦターム等の技術分類が付与されている。また、企業が保持する社内の技術文書も、後日の参照の便を向上させるために独自の技術分類を付与してもよいと思われる。文書データベース６０１の各文書の有する分野情報には、特許公報のように、一つの分類コードが大分類、中分類、小分類のように階層化されていてもよいとし、図６中の文書１においては、「分野：プログラム→サブルーチン定義」という表現で、「大分類」として「プログラム」、「中分類」として「サブルーチン定義」という意味を有する。
【００１８】
図７は、磁気ディスク装置６の形態素解析辞書６０２が保持する論理的情報の一例を示す図である。形態素解析辞書は検索効率を向上させるために複雑なデータ構造を持っていることが一般であるが、ここでは簡略化して論理的な情報のみを示している。図７に示した形態素解析辞書６０２は、単語の見出し、読み、品詞の３種類の情報が保持されている。一般的な形態素解析では、細分化された品詞や属性あるいは精緻な接続文法を用いるものもあるが、本発明の実施の形態で参照している形態素解析は単純化したものを使用しており、品詞の接続情報は形態素解析処理に組み込まれているものとする。本発明の実施の形態で用いている形態素解析をより一般的で精度の良いもので置き換えることも可能である。
【００１９】
図８は、分野関連単語表記パターン辞書６０３の一例を示す図である。（１）から（５）までの５つの分野関連単語表記パターンがあるが、それぞれのパターンにおいて『』で示されているのはプレースホルダである。このプレースホルダとは、いわゆるワイルドカードであり、制御装置２は、分野関連単語抽出プログラム６０５７に従い、この『』で囲まれる所定のバイト列（桁数は特に問わない。）を分野関連単語として抽出する。
【００２０】
図９は、分野関連単語集計用バッファ５２６の一例を示す図である。この分野関連単語集計用バッファ５２６は、項番、単語及び頻度で構成されるものであり、後記する関連語集計処理により分野関連単語として抽出された単語が項番毎に割り振られ、文書中の出現頻度とともに格納される。
次に、関連語抽出装置１の動作について図１０乃至図１３を参照して説明する。
図１０は、関連語抽出装置１の電源が投入されてから分野関連単語を抽出して終了するまでの処理（具体的には、対象分野取得処理、関連語集計処理及び関連語抽出処理）を体系的に説明したフローチャートである。図１１は、図１０に示したフローチャートにおける対象分野取得処理（Ｓ３）について説明するフローチャートであり、図１２は、図１０に示したフローチャートにおける関連語集計処理（Ｓ５）について説明するフローチャートである。図１３は、図１２に示した関連語集計処理によって集計された分野関連単語の候補となる対象語から最終的に分野関連単語として抽出する動作を説明するフローチャートである。
【００２１】
関連語抽出装置１の動作が開始すると、制御装置２は、関連語抽出プログラム６０５から各々のプログラムを読み取って適宜にメモリ５のプログラム部５１に記憶した後、そのプログラムに従って所定の処理を実行する。
即ち、図１０において、関連語抽出装置１の電源が投入されると、ブートストラップの起動処理が実行され、図１０に示す処理を実行するプログラムが、関連語抽出プログラム６０５からメモリ５中のプログラム部５１にロードされた後に実行される。この処理では、制御装置２は、初期化プログラム６０５９に従い、入力装置３や表示装置４等の各種デバイスの設定状態を初期化する（Ｓ１）。続いて、分野関連単語表記パターン読み込みプログラム６０５１に従い、磁気ディスク装置６の分野関連単語表記パターン辞書６０３を読み込み、そして、分野関連単語表記パターン格納バッファ５２４に格納する（Ｓ２）。このあと、制御装置２は、対象分野取得処理に入る（Ｓ３）。制御装置２は、対象分野取得処理を終了（詳しくは後記する）しない限り（Ｓ４のＮｏ）、関連語集計処理の実行に入る（Ｓ５）。また、関連語抽出装置１は、対象分野取得処理を終了するとき（Ｓ４のＹｅｓ）、システム上の情報などメモリ５上にあるデータをデータ格納領域６０４に格納する等のシャットダウンを経てこのまま終了する。
【００２２】
次に、対象分野取得処理について図１１を参照して説明する。
図１１において、制御装置２は、対象分野取得プログラム６０５２に従い、分野関連単語を求めるために必要な対象分野を関連語抽出装置１の入力装置３を介して取得する（Ｓ３０１）。ここで、制御装置２は、対象分野取得プログラム６０５２に従い、入力装置３から対象分野取得処理の終了を示すファンクション（例えば、ユーザからウィンドウ上のクローズボタンが押下されたという処理に相当する処理）が送られたか否かを判定する（Ｓ３０２）。終了でない限り（Ｓ３０２のＮｏ）、制御装置２は、取得した対象分野を対象分野格納バッファ５２１に格納する（Ｓ３０３）。終了であれば（Ｓ３０２のＹｅｓ）、制御装置２は、終了である値（例えば、プログラムに書き込む文字列の終端を表わす値であるバイナリ０）を対象分野格納バッファ５２１に格納し（Ｓ３０４）、コール元にリターンする（Ｓ４へ）。制御装置２は、対象分野取得処理の終了である値を対象分野格納バッファ５２１に格納したとき（Ｓ３０４）、図１０において、Ｓ４の判定で終了との判定をし（Ｓ４のＹｅｓ）、システム上の情報などメモリ５上にあるデータをデータ格納領域６０４に格納する等のシャットダウンを経てこのまま終了する。以下、制御装置２がプログラム部５１に記憶した対象分野取得プログラム６０５２に従って取得した対象分野は「プリンタ技術」であるとして説明する。
【００２３】
次に、関連語集計処理について図１２を参照して説明する。
図１２において、制御装置２は、文書読み出しプログラム６０５４に従い、磁気ディスク装置６の文書データベース６０１に格納された文書から一文書を読み込んで文書格納バッファ５２２に格納する（Ｓ５０１）。このあと、制御装置２は、分野抽出プログラム６０５５に従い、文書格納バッファ５２２に格納した文書に対し、各々の文書の属する分野情報を決定し、分野一時格納バッファ５２５に格納する（Ｓ５０２）。上記したように、図６に示した本発明の実施の形態で扱う文書データベース６０１内の文書は、すべて先頭にある文字列「分野：」に引き続いて文書の属する分野情報が記述されているものとするため、例えば、図６の文書Ｎの場合「プリンタ技術」という分野情報が抽出され、この分野情報は、分野一時格納バッファ５２５に格納される。
【００２４】
次に、制御装置２は、プログラム部５１に記憶した分野情報比較プログラム６０６０に従い、対象分野格納バッファ５２１に格納された対象分野と分野一時格納バッファ５２５に格納された分野情報を比較する（Ｓ５０３）。対象分野格納バッファ５２１に格納された対象分野と分野一時格納バッファ５２５に格納された分野情報とが異なるものであると判定されたとき（Ｓ５０３のＮｏ）、制御装置２は、文書読み出しプログラム６０５４に従い、Ｓ５０１で読み出した文書データベース６０１内の文書と別の文書を読み込んで文書格納バッファ５２２に格納し、対象分野格納バッファ５２１に格納された対象分野と分野一時格納バッファ５２５に格納された分野情報とが一致すると判定するまで繰り返し文書データベース６０１内の文書を読み出す（Ｓ５０７のＹｅｓ、Ｓ５０１へ）。
【００２５】
制御装置２は、対象分野格納バッファ５２１に格納された対象分野と分野一時格納バッファ５２５に格納された分野情報とが一致するものであると判定したとき（Ｓ５０３のＹｅｓ）、形態素解析プログラム６０５６に従い、このときにおける文書格納バッファ５２２に格納されている文書を形態素解析し、この形態素解析の結果を形態素解析結果格納バッファ５２３に格納する（Ｓ５０４）。制御装置２が実行する形態素解析では、図７に示した形態素解析辞書６０２の情報が参照される。本発明の実施の形態における形態素解析は、合成語処理機能を有しているものとする。この合成語処理機能とは、複数の単語の合成を新たな単語として認定する処理のことを表わす。
【００２６】
例えば、「名詞」＋「名詞」＝「名詞」のように、名詞が２つ連続しているとき、この２つの単語を新たな名詞として認定し、又は「接頭」＋「名詞」＝「名詞」のように、接頭に続いて名詞が出現するとき、この２つの単語を新たな名詞として認定する等の合成語作成ルールに基づき、制御装置２が形態素解析プログラム６０５６に従い、合成語を認定する処理のことである。例えば、図６の文書Ｎ−２を形態素に区切ると、次のようになる（「／」は単語の切れ目を表すものとする）。インクジェット（名詞）／方式（名詞）／の（格助詞）／印刷（名詞）／装置（名詞）／に（助詞）おいて（名詞）。さらに、合成語処理を行うと名詞と名詞の連続を新たな名詞として認定し、次のようになる。インクジェット方式（名詞）／の（格助詞）／印刷装置（名詞）／に（助詞）おいて（名詞）。
【００２７】
形態素解析処理の後、制御装置２は、分野関連単語抽出プログラム６０５７に従い、形態素解析結果格納バッファ５２３に格納された形態素解析の結果に基づき、分野関連単語表記パターン格納バッファ５２４に格納された分野関連単語表記パターンを参照して分野関連単語を抽出する（Ｓ５０５）。上記の例では、図８（５）の分野関連単語表記パターン（『』において）と一致し、制御装置２は、分野関連単語抽出プログラム６０５７に従い、「印刷装置」を分野関連単語として抽出する。分野関連単語が抽出されると、制御装置２は、分野関連単語集計プログラム６０５８に従い、図９に例示したように、分野関連単語集計用バッファ５２６に分野関連単語及びその分野関連単語が文書中に出現する頻度を集計し、その集計結果を格納する（Ｓ５０６）。具体的には、分野関連単語が抽出されるごとに分野関連単語の出現頻度をインクリメントする。
【００２８】
次に、制御装置２は、文書格納バッファ５２２に格納した文書を参照し、文書データベース６０１に格納されている総ての文書を読み出したか否かを判定し（Ｓ５０７）、読み込んでいない未処理の文書があるとき（Ｓ５０７のＹｅｓ）、引き続き処理を実行する（Ｓ５０１へ）。読み込んでいない未処理の文書がないとき（Ｓ５０７のＮｏ）、後記する関連語抽出処理に進み（Ｓ５０８）、コール元にリターンする。
【００２９】
次に、関連語抽出処理を図１３を用いて説明する。
ここまでの処理で分野関連単語集計用バッファ５２６の内容が、図９の状態になっているものとする。
まず、制御装置２は、分野関連単語抽出プログラム６０５７に従い、一時変数格納バッファ５２７で、図９における頻度の合計値を求め、一時的に設定した変数ＳＵＭに代入する（Ｓ５０８１）。図９の例では、１８１＋１６０＋５４＋４４＋１２０＋８＋５＋５４を計算した値６２６が変数ＳＵＭに代入される。関連語抽出装置１には、分野関連単語として抽出された結果の適性を高めるために、文書中の出現頻度が低い分野関連単語をノイズとして除去するための閾値が設定されている。仮にこの閾値を変数ＳＵＭの５％とし、図９における頻度が５％未満の単語を除外するものとする（Ｓ５０８２）。ここで、６２６の５％は３１．３であるため、図９における頻度が３１．３未満である「インクリボン」および「方式」が除外される。次に、残った単語を関連語抽出装置１の求める分野関連単語として表示装置４に表示する（Ｓ５０８３）。ここでは、「プリンタ」、「プリンター」、「印刷装置」、「印字装置」、「インクジェットプリンタ」及び「熱転写プリンタ」が分野関連単語として表示される。
【００３０】
このように、本発明の実施の形態においては、特許明細書のように、分野ごとに詳細に分類された文書中においてよく見られる特徴的な修辞表現に着目し、ユーザが求める対象分野と同一の分野情報を有する文書から、この特徴的な修辞表現に基づいて分野関連単語を抽出するものである。
また、抽出した分野関連単語の適性を高めるために、関連語抽出装置１は、文書中の出現頻度が低い分野関連単語を除去するための閾値が設定されており、この閾値より低いと算出された関連語を除去する。
【００３１】
なお、本発明は、上記実施の形態に限定されるものでなく、その要旨を逸脱しない範囲で種々変形して実施できる。例えば、磁気ディスク装置６内の文書データベース６０１内の文書を追加することは可能である。また、本発明の実施の形態においては、文書データベース６０１内の各文書の分野情報は定まっているが、分野情報が定まっていないものであってもよい。この場合、所定の文書から分野情報を抽出する既存の技術を利用し、この分野抽出技術と本発明の関連語抽出技術とを組み合わせることにより、より汎用性の高い関連語抽出技術となり得る。
【００３２】
また、上記実施の形態においては、入力装置３に入力された対象分野と同一の分野を有する文書データベース６０１内の文書を形態素解析することにより、分野関連語表記パターンと照らし合わせたが、形態素解析ではなく、他の自然言語解析を用いてもよい。例えば、構文解析又は意味解析等の自然言語解析を使用してもよい。
【００３３】
【発明の効果】
以上説明したように本発明によれば、文書に出現する単語の共起情報に依存することなく、求める分野情報と同一の分野情報を有する文書を選出し、この選出された文書に見られる所定の修辞表現に基づいて分野関連単語を抽出する関連語抽出装置を提供することができる。
【図面の簡単な説明】
【図１】関連語抽出装置の構成を示すブロック図。
【図２】メモリ５の内部構成を示す図。
【図３】メモリ５のバッファ部５２の構成を示すブロック図。
【図４】磁気ディスク装置６の構成を示すブロック図。
【図５】関連語抽出プログラム６０５の構成を示すブロック図。
【図６】文書データベース６０１を構成する文書群の一例を示す図。
【図７】形態素解析辞書６０２の保持する論理的情報の一例を示す図。
【図８】分野関連単語表記パターン辞書６０３の一例を示す図。
【図９】分野関連単語集計用バッファ５２６の一例を示す図。
【図１０】関連語抽出装置１全体の動作を体系的に説明するフローチャート。
【図１１】対象分野取得処理について説明するフローチャート。
【図１２】関連語集計処理について説明するフローチャート。
【図１３】関連語抽出処理について説明するフローチャート。
【符号の説明】
１・・・関連語抽出装置
２・・・制御装置
３・・・入力装置
４・・・表示装置
５・・・メモリ
６・・・磁気ディスク装置
７・・・バス
５１・・・プログラム部
５２・・・バッファ部
５２１・・・対象分野格納バッファ
５２２・・・文書格納バッファ
５２３・・・形態素解析結果格納バッファ
５２４・・・分野関連単語表記パターン格納バッファ
５２５・・・分野一時格納バッファ
５２６・・・分野関連単語集計用バッファ
５２７・・・一時変数格納バッファ
６０１・・・文書データベース
６０２・・・形態素解析辞書
６０３・・・分野関連単語表記パターン辞書
６０４・・・データ格納領域
６０５・・・関連語抽出プログラム
６０５１・・分野関連単語表記パターン読み込みプログラム
６０５２・・・対象分野取得プログラム
６０５３・・・表示プログラム
６０５４・・・文書読み出しプログラム
６０５５・・・分野抽出プログラム
６０５６・・・形態素解析プログラム
６０５７・・・分野関連単語抽出プログラム
６０５８・・・分野関連単語集計プログラム
６０５９・・・初期化プログラム
６０６０・・・分野情報比較プログラム[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a related word extraction technique, and more particularly, to a related word extraction device, a related word extraction method, and a program for extracting a related word corresponding to a predetermined field from a document database composed of documents in which field information is determined. .
[0002]
[Prior art]
Related words or synonyms are useful information for performing document search or automatic classification. Conventionally, while there has been an attempt to systematically edit related words or synonyms by hand, an attempt has been made to extract related words or synonyms by mechanical processing using an electronic computer.
For example, there is a method of using mutual information as a degree of association between words (for example, Patent Document 1). Mutual information is an index indicating the strength of association such that the two words of interest appear naturally rather than by chance. When the mutual information is low, the two words appear by chance. It is understood that it is only. In Patent Document 1, words registered in a document database are clustered, and related word information is created by this clustering. The related word of the word input by the user refers to the related word information and is presented for each cluster having a high degree of relatedness.
[0003]
Also, paying attention to the fact that mutual information can only extract so-called related words, use linguistic features to specify the degree of relevance between words, and specify relationships such as synonyms, broad words, and low words (For example, Patent Document 2). In Patent Document 2, similarity between words having a co-occurrence relationship is calculated with reference to word co-occurrence information. On the other hand, the similarity between words is calculated by referring to the linguistic features of the words. These two similarities are integrated, and when the obtained similarity is higher than a predetermined level, it is presented as having a similar relationship.
[0004]
[Patent Document 1] JP-A-2002-32394 (page 10)
[Patent Document 2] JP-A-2000-222427 (page 12)
[Problems to be solved by the invention]
In the above-described prior art, it is difficult to eliminate the relationship between words that appears in any document by the method using the frequency of co-occurring words. In addition, the method based on mutual information is devised so that the mutual information between words that appears in any document is low, but whether this mechanism functions as intended is It depends on the method of selecting a group of documents used to calculate the mutual information. For example, the words "invention" and "problem" are considered to be mutually related concepts, but when calculating mutual information using patent publications, most patent publications use the words "invention" and "problem". , The amount of mutual information tends to be relatively low. In order to increase the amount of mutual information between "invention" and "problem", it is necessary to input a document group that makes it obvious that these words co-occur. Therefore, when it is attempted to extract related words in a certain field A, if possible, a unique related word frequently appearing in a document in the field A is extracted unless documents such as a field B, a field C, etc. different from the field A are combined. There is a danger that you cannot do it. Further, as a problem common to the related art described here, there is a premise that related words co-occur. According to the above-described conventional technology, for example, it is not possible to associate the “computer” described in the document A with the “electronic computer” described in another document B.
[0005]
Therefore, the present invention has been made in order to solve the above-described problems, and when extracting related words in a certain field, a document in another field cannot be prepared. It is an object of the present invention to provide a related word extraction device, a related word extraction method, and a program that appropriately extract related words even when the related words do not co-occur.
[0006]
[Means for Solving the Problems]
In order to achieve the above object, a related-word extracting apparatus according to the present invention includes a document database having a plurality of documents, an input unit for inputting field information of the document, and a document having field information matching the input field information. Means for extracting a document from the document database, natural language analyzing means for analyzing the document extracted by the document extracting means in a natural language, and extracting field-related words related to the field of the document from the document in the document database. Field-related word notation pattern holding means for holding field-related word notation patterns, and field-related word extraction from documents extracted from a document database by referring to field-related word notation patterns based on the results of natural language analysis It is characterized by comprising word extraction means and field-related word counting means for counting the frequency of appearance of the extracted field-related words. It is set to.
[0007]
Next, a related word extraction method according to the present invention extracts a document in a specific field from a document database having a plurality of documents, and extracts a related word corresponding to the specific field from the extracted document. And an input step of inputting field information of the document, a document extracting step of extracting a document having field information matching the input field information from the document database, and A field-related word notation pattern holding step for holding a field-related word notation pattern for extracting a related field-related word, and a document extracted from a document database by referring to the field-related word notation pattern based on the result of natural language analysis A field-related word extracting step of extracting the field-related words from the field, and a field for counting the appearance frequency of the extracted field-related words It is characterized by comprising a communication word aggregation step.
[0008]
Further, the program of the present invention extracts a document in a specific field from a document database having a plurality of documents, and outputs a related word extraction device that extracts a related word corresponding to the specific field from the extracted document. An input function for inputting field information, a document extraction step of extracting a document having field information that matches the input field information from a database, and field-related words related to the field of this document are extracted from documents in the document database A field-related word notation pattern holding function for storing field-related word notation patterns for performing a search, and field-related word notation patterns are referred to based on the results of natural language analysis, and field-related words are extracted from documents extracted from a document database. Realize a field-related word extraction function and a field-related word counting function that counts the frequency of appearance of extracted field-related words. It is characterized in that.
[0009]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
The related-word extraction device according to the present invention stores a document group of a plurality of fields in a magnetic disk device in advance, and when a predetermined field is input from an input device, selects a document matching the field information from the document group. Then, words related to the field information (hereinafter referred to as "field-related words"; the same applies hereinafter) are extracted from the selected document and displayed on the display device. Each document in the document group of a plurality of fields stored in the magnetic disk device holds field information in a fixed format.
[0010]
Next, a related word extracting apparatus according to the present invention will be described with reference to FIGS.
FIG. 1 is a block diagram showing a configuration of a related word extraction device according to an embodiment of the present invention.
This related word extraction device selects a document corresponding to a predetermined input field from a document group, focuses on a specific rhetorical expression found in the selected document, and extracts a field-related word based on the rhetorical expression. Extract.
The related word extraction device 1 includes a control device 2, an input device 3, a display device 4, a memory 5, and a magnetic disk device 6, and each unit is connected to each other via a bus 7.
[0011]
The control device 2 is a central processing unit (CPU), reads an OS (operating system) stored in the magnetic disk device 6 and a predetermined program into a program unit 51 described later, and reads the entire related word extraction device 1. And the data transfer process between the devices.
The input device 3 is used to input a character string, various data, and a command, and includes a keyboard, an OCR, a pen, a mouse, a tablet, or a touch panel.
[0012]
The display device 4 displays data such as data input by the input device 3, instructions to the user from the related word extraction device 1, and data such as a finally obtained related word extraction result, for example, from a CRT or a liquid crystal display. Be composed.
As shown in FIG. 2, the memory 5 includes a program unit 51 for reading and storing a predetermined program from the magnetic disk device 6 for the control device 2 to execute various controls and processes, and necessary for each process. The buffer unit 52 temporarily stores data.
[0013]
FIG. 3 is a block diagram showing a configuration of the buffer unit 52 of the memory 5 of the related word extraction device 1. The buffer unit 52 includes a target field storage buffer 521, a document storage buffer 522, a morphological analysis result storage buffer 523, a field related word notation pattern storage buffer 524, a field temporary storage buffer 525, and a related word counting buffer 526. , And a temporary variable storage buffer 527.
The target field storage buffer 521 is for storing the field information acquired by the target field acquisition program 512. The document storage buffer 522 stores the document read from the magnetic disk device 6 by the document reading program 514. The morphological analysis result storage buffer 523 is for storing the result of the morphological analysis of the document stored in the document storage buffer 522 by the morphological analysis program 516. The field-related word description pattern storage buffer 524 stores the field-related word description pattern dictionary 603 read from the magnetic disk device 6 by the field-related word description pattern reading program 511. The field temporary storage buffer 525 is for the field extraction program 515 to extract field information to which the document belongs from the document stored in the document storage buffer 522 and store it. The related word counting buffer 526 is a work area used when the field related word counting program 518 counts the appearance frequency of the field related words, and is used for storing temporary variables such as variables for looping the program. is there.
[0014]
As shown in FIG. 4, the magnetic disk device 6 extracts a document database 601 composed of a plurality of documents, a morphological analysis dictionary 602 to which the morphological analysis program 516 refers when performing morphological analysis, and extracts a field-related word. Related word notation pattern dictionary 603 in which a pattern for performing the operation is stored, and a data storage for storing a program for starting the related word extraction device 1 or newly created data, including an OS (operating system). An area 604 and a related word extraction program 605 storing a program for the control device 2 to execute various controls and processes are recorded.
[0015]
FIG. 5 is a block diagram showing the configuration of various programs stored in the related word extraction program 605 in the magnetic disk device 6. The related word extraction program 605 includes a field related word notation pattern reading program 6051, a target field acquisition program 6052, a display program 6053, a document reading program 6054, a field extraction program 6055, a morphological analysis program 6056, and a field related word. It comprises an extraction program 6057, a field-related word counting program 6058, an initialization program 6059, and a field information comparison program 6060.
[0016]
The field-related word notation pattern reading program 6051 is for reading the field-related word notation pattern dictionary 603 stored in the magnetic disk device 6 into the buffer unit 52. The target field acquisition program 6052 is for acquiring field information input by a user to extract field-related words. The display program 6053 is for displaying data input by the input device 3, instructions from the related word extraction device 1 to the user, and extracted field-related words on the display unit 4. The document reading program 6054 is for reading one document from a group of documents stored in the magnetic disk device 6 into the buffer unit 52. The field extraction program 6055 is for extracting field information to which the document belongs by reading the document stored in the buffer unit 52. The morphological analysis program 6056 refers to a morphological analysis dictionary 602 stored in the magnetic disk device 6 and performs morphological analysis of a document stored in the buffer unit 52. The field-related word extraction program 6057 is for extracting field-related words from the document read into the buffer 52 with reference to the field-related word notation pattern read into the buffer 52. The field-related word counting program 6058 is for counting the frequency of appearance of the extracted field-related words. The initialization program 6059 is for initializing the setting state of each device when the power of the related word extraction device 1 is turned on. The field information comparison program 6060 is for comparing the target field acquired via the input device 3 with the field information acquired according to the field extraction program 6055.
[0017]
FIG. 6 is a diagram illustrating an example of a document group forming the document database 601 of the magnetic disk device 6. As shown in FIG. 6, each document of the document group used in the embodiment of the present invention describes the field to which the document belongs following "field:". In the embodiment of the present invention, as shown in FIG. 6, an example of a document having a structure of field information and text information corresponding to the field information is shown, but a technical field is specified according to an actually designated format. There are many documents. For example, a patent gazette is given a technical classification such as an international patent classification, FI or F-term. In addition, it is considered that in-house technical documents held by the company may be given an original technical classification in order to improve the convenience of reference at a later date. In the field information of each document in the document database 601, it is assumed that one classification code may be hierarchized such as a large classification, a middle classification, and a small classification as in the patent gazette. , The expression “field: program → subroutine definition” means “program” as “major category” and “subroutine definition” as “middle category”.
[0018]
FIG. 7 is a diagram showing an example of logical information held by the morphological analysis dictionary 602 of the magnetic disk device 6. Although the morphological analysis dictionary generally has a complicated data structure in order to improve search efficiency, here, only logical information is shown in a simplified manner. The morphological analysis dictionary 602 shown in FIG. 7 holds three types of information, ie, headings, readings, and parts of speech of words. In general morphological analysis, there are also those that use fragmented parts of speech and attributes or fine connection grammar, but the morphological analysis referred to in the embodiment of the present invention uses a simplified one, It is assumed that the part-of-speech connection information is incorporated in the morphological analysis processing. The morphological analysis used in the embodiment of the present invention can be replaced with a more general and accurate one.
[0019]
FIG. 8 is a diagram illustrating an example of the field-related word notation pattern dictionary 603. There are five field-related word notation patterns from (1) to (5), and in each of the patterns, "" is a placeholder. The placeholder is a so-called wildcard, and the control device 2 extracts a predetermined byte string (the number of digits is not particularly limited) surrounded by "" as a field-related word according to the field-related word extraction program 6057. I do.
[0020]
FIG. 9 is a diagram illustrating an example of the field-related word counting buffer 526. The field-related word totaling buffer 526 is composed of an item number, a word, and a frequency. Words extracted as field-related words by the related word totaling process described below are allocated to each item number, and are stored in the document. Stored together with the frequency of appearance.
Next, the operation of the related word extraction device 1 will be described with reference to FIGS.
FIG. 10 shows processing (specifically, target field acquisition processing, related word counting processing, and related word extraction processing) from power-on of the related word extraction device 1 to extraction of field related words and termination. It is a flowchart explained systematically. FIG. 11 is a flowchart for explaining the target field acquisition process (S3) in the flowchart shown in FIG. 10, and FIG. 12 is a flowchart for explaining the related word counting process (S5) in the flowchart shown in FIG. FIG. 13 is a flowchart illustrating an operation of finally extracting as a field-related word from a target word that is a candidate of the field-related word totaled by the related word totaling process illustrated in FIG. 12.
[0021]
When the operation of the related word extraction device 1 starts, the control device 2 reads each program from the related word extraction program 605 and stores it in the program unit 51 of the memory 5 as appropriate, and then executes a predetermined process according to the program. .
That is, in FIG. 10, when the power of the related word extraction device 1 is turned on, a bootstrap activation process is executed, and a program for executing the process shown in FIG. It is executed after being loaded into the unit 51. In this process, the control device 2 initializes the setting states of various devices such as the input device 3 and the display device 4 according to the initialization program 6059 (S1). Subsequently, in accordance with the field-related word notation pattern reading program 6051, the field-related word notation pattern dictionary 603 of the magnetic disk device 6 is read and stored in the field-related word notation pattern storage buffer 524 (S2). Thereafter, the control device 2 enters a target field acquisition process (S3). Unless the control unit 2 finishes the target field acquisition process (to be described in detail later) (No in S4), the control device 2 starts execution of the related word counting process (S5). When terminating the target field acquisition process (Yes in S4), the related word extraction device 1 ends the process after shutting down, for example, storing data on the memory 5 such as information on the system in the data storage area 604. .
[0022]
Next, the target field acquisition processing will be described with reference to FIG.
In FIG. 11, the control device 2 obtains a target field necessary for obtaining a field-related word via the input device 3 of the related word extraction device 1 according to the target field acquisition program 6052 (S301). Here, in accordance with the target field acquisition program 6052, the control device 2 executes a function indicating the end of the target field acquisition process from the input device 3 (for example, a process corresponding to a process in which the user has pressed the close button on the window). It is determined whether or not it has been sent (S302). Unless the processing is completed (No in S302), the control device 2 stores the acquired target field in the target field storage buffer 521 (S303). If the processing is to be ended (Yes in S302), the control device 2 stores the value of the end (for example, binary 0 which is the value representing the end of the character string to be written in the program) in the target field storage buffer 521 (S304). Return to the caller (to S4). When the control apparatus 2 stores the value indicating the end of the target field acquisition processing in the target field storage buffer 521 (S304), in FIG. 10, the control apparatus 2 determines that the end is determined in S4 (Yes in S4), Then, the process is terminated after shutting down, for example, storing data on the memory 5 such as the information in the data storage area 604. Hereinafter, a description will be given assuming that the target field acquired according to the target field acquisition program 6052 stored in the program unit 51 by the control device 2 is “printer technology”.
[0023]
Next, the related word counting process will be described with reference to FIG.
12, the control device 2 reads one document from the documents stored in the document database 601 of the magnetic disk device 6 according to the document reading program 6054, and stores it in the document storage buffer 522 (S501). Thereafter, the control device 2 determines field information to which each document belongs for the documents stored in the document storage buffer 522 according to the field extraction program 6055 and stores the field information in the field temporary storage buffer 525 (S502). As described above, all of the documents in the document database 601 handled in the embodiment of the present invention shown in FIG. 6 have the character string "field:" at the beginning, followed by the field information to which the document belongs. For example, in the case of the document N in FIG. 6, field information “printer technology” is extracted, and this field information is stored in the field temporary storage buffer 525.
[0024]
Next, the control device 2 compares the target field stored in the target field storage buffer 521 with the field information stored in the field temporary storage buffer 525 according to the field information comparison program 6060 stored in the program unit 51 (S503). . When it is determined that the target field stored in the target field storage buffer 521 is different from the field information stored in the field temporary storage buffer 525 (No in S503), the control device 2 operates in accordance with the document reading program 6054. , A document different from the document in the document database 601 read in S501 is read and stored in the document storage buffer 522, and the target field stored in the target field storage buffer 521 and the field information stored in the field temporary storage buffer 525 are read. The document in the document database 601 is repeatedly read until it is determined that "matches" (Yes in S507, to S501).
[0025]
When the control device 2 determines that the target field stored in the target field storage buffer 521 matches the field information stored in the field temporary storage buffer 525 (Yes in S503), the control apparatus 2 follows the morphological analysis program 6056. At this time, the document stored in the document storage buffer 522 is morphologically analyzed, and the result of the morphological analysis is stored in the morphological analysis result storage buffer 523 (S504). In the morphological analysis performed by the control device 2, information in the morphological analysis dictionary 602 illustrated in FIG. 7 is referred to. The morphological analysis in the embodiment of the present invention has a compound word processing function. The compound word processing function represents a process of identifying a combination of a plurality of words as a new word.
[0026]
For example, when two nouns are continuous like “noun” + “noun” = “noun”, these two words are recognized as new nouns, or “prefix” + “noun” = “noun” When the noun appears following the prefix, the control device 2 recognizes the compound word according to the morphological analysis program 6056 based on a compound word creation rule such as detecting the two words as new nouns. Processing. For example, when the document N-2 in FIG. 6 is divided into morphemes, the result is as follows ("/" represents a word break). Ink-jet (noun) / method (noun) / no (case particle) / printing (noun) / device (noun) / on (particle) (noun). Furthermore, when compound word processing is performed, a series of nouns and nouns is recognized as a new noun, and the result is as follows. Ink-jet method (noun) / no (case particle) / printing device (noun) / in (particle) (noun).
[0027]
After the morphological analysis processing, the control device 2 according to the field-related word extraction program 6057, based on the result of the morphological analysis stored in the morphological analysis result storage buffer 523, stores the field-related word notation pattern stored in the field-related word notation pattern storage buffer 524. A field-related word is extracted with reference to the word notation pattern (S505). In the above example, the control device 2 matches the field-related word notation pattern (in “”) of FIG. 8 (5), and extracts “printing device” as the field-related word according to the field-related word extraction program 6057. When the field-related words are extracted, the control device 2 stores the field-related words and the field-related words in the field-related word counting buffer 526 in the document according to the field-related word counting program 6058, as illustrated in FIG. The frequency of appearance is counted, and the counting result is stored (S506). Specifically, each time a field-related word is extracted, the appearance frequency of the field-related word is incremented.
[0028]
Next, the control device 2 refers to the documents stored in the document storage buffer 522 to determine whether all the documents stored in the document database 601 have been read (S507), When there is a document (Yes in S507), the process is continuously executed (to S501). If there is no unprocessed document that has not been read (No in S507), the process proceeds to a related word extraction process described later (S508), and returns to the call source.
[0029]
Next, the related word extraction processing will be described with reference to FIG.
It is assumed that the contents of the field-related word totaling buffer 526 are in the state shown in FIG.
First, the control device 2 calculates the total value of the frequencies in FIG. 9 in the temporary variable storage buffer 527 according to the field-related word extraction program 6057, and substitutes it for the temporarily set variable SUM (S5081). In the example of FIG. 9, the value 626 obtained by calculating 181 + 160 + 54 + 44 + 120 + 8 + 5 + 54 is assigned to the variable SUM. In the related word extraction device 1, a threshold value is set for removing a field related word having a low appearance frequency in a document as noise in order to enhance the suitability of a result extracted as a field related word. It is assumed that this threshold is set to 5% of the variable SUM, and words whose frequency in FIG. 9 is less than 5% are excluded (S5082). Here, since 5% of 626 is 31.3, “ink ribbon” and “method” whose frequency in FIG. 9 is less than 31.3 are excluded. Next, the remaining words are displayed on the display device 4 as the field-related words required by the related word extraction device 1 (S5083). Here, “printer”, “printer”, “printing device”, “printing device”, “inkjet printer”, and “thermal transfer printer” are displayed as field-related words.
[0030]
As described above, in the embodiment of the present invention, attention is paid to characteristic rhetorical expressions often found in documents classified in detail according to fields, such as patent specifications, and the same as the target field required by the user. The field-related words are extracted from the document having the field information on the basis of this characteristic rhetorical expression.
Further, in order to enhance the suitability of the extracted field-related words, the related-word extracting apparatus 1 sets a threshold for removing field-related words having a low appearance frequency in the document, and calculates that the threshold is lower than this threshold. Remove related words.
[0031]
The present invention is not limited to the above-described embodiment, and can be implemented with various modifications without departing from the spirit of the invention. For example, it is possible to add a document in the document database 601 in the magnetic disk device 6. Further, in the embodiment of the present invention, the field information of each document in the document database 601 is determined, but the field information may not be determined. In this case, by using an existing technology for extracting field information from a predetermined document, and combining this field extraction technology with the related word extraction technology of the present invention, a more versatile related word extraction technology can be obtained.
[0032]
In the above embodiment, the document in the document database 601 having the same field as the target field input to the input device 3 is morphologically analyzed to be compared with the field-related word notation pattern. Instead, other natural language analysis may be used. For example, natural language analysis such as syntactic analysis or semantic analysis may be used.
[0033]
【The invention's effect】
As described above, according to the present invention, a document having the same field information as the desired field information is selected without depending on the co-occurrence information of words appearing in the document, and a predetermined And a related word extraction device that extracts a field related word based on the rhetorical expression of
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a related word extraction device.
FIG. 2 is a diagram showing an internal configuration of a memory 5;
FIG. 3 is a block diagram showing a configuration of a buffer unit 52 of the memory 5.
FIG. 4 is a block diagram showing a configuration of a magnetic disk device 6.
FIG. 5 is a block diagram showing a configuration of a related word extraction program 605.
FIG. 6 is a diagram showing an example of a document group constituting a document database 601.
FIG. 7 is a view showing an example of logical information held by a morphological analysis dictionary 602.
FIG. 8 is a diagram showing an example of a field-related word notation pattern dictionary 603.
FIG. 9 is a diagram showing an example of a field-related word counting buffer 526.
FIG. 10 is a flowchart for systematically explaining the overall operation of the related word extraction device 1;
FIG. 11 is a flowchart illustrating target field acquisition processing.
FIG. 12 is a flowchart illustrating a related word counting process.
FIG. 13 is a flowchart illustrating a related word extraction process.
[Explanation of symbols]
1 ... Related word extraction device
2 ... control device
3 ... input device
4 Display devices
5 ... memory
6 ... magnetic disk drive
7 ... bus
51 ... Program section
52 ··· Buffer section
521: target field storage buffer
522: Document storage buffer
523: Morphological analysis result storage buffer
524... Field-related word notation pattern storage buffer
525: Field temporary storage buffer
526: Field-related word counting buffer
527 ... temporary variable storage buffer
601: Document database
602 morphological analysis dictionary
603 ・・・ Field related word notation pattern dictionary
604: Data storage area
605: Related word extraction program
6051 ・・ Field related word notation pattern reading program
6052 ・・・ Target field acquisition program
6053 ・・・ Display program
6054 ・・・ Document reading program
6055 ・・・ Field extraction program
6056 ・・・ Morphological analysis program
6057 ・・・ Field-related word extraction program
6058 ・・・ Field related word counting program
6059: Initialization program
6060 ・・・ Field information comparison program

Claims

A document database having a plurality of documents;
Input means for inputting field information of the document;
Document extraction means for extracting from the document database a document having field information that matches the input field information,
Natural language analyzing means for analyzing the natural language of the document extracted by the document extracting means,
Field-related word notation pattern holding means for holding a field-related word notation pattern for extracting a field-related word related to the field of this document from a document in the document database,
A field-related word extraction unit that refers to the field-related word notation pattern based on the result of the natural language analysis and extracts the field-related word from a document extracted from the document database;
Field-related word counting means for counting the frequency of appearance of the extracted field-related words,
A related word extraction device comprising:

2. The related word extraction device according to claim 1, wherein the natural language analysis unit executes one of morphological analysis, syntactic analysis, and semantic analysis.

3. The related word extracting apparatus according to claim 1, wherein the field related word totaling unit removes a field related word whose appearance frequency of the totaled field related words is lower than a predetermined threshold.

A related word extraction method for extracting a document in a specific field from a document database having a plurality of documents, and extracting a related word corresponding to the specific field from the extracted document.
An input step for inputting field information of the document;
A document extraction step of extracting a document having field information that matches the input field information from the database;
A field-related word notation pattern holding step of holding a field-related word notation pattern for extracting a field-related word related to the field of this document from the document in the document database,
A field-related word extraction step of referring to the field-related word notation pattern based on the result of the natural language analysis and extracting the field-related word from a document extracted from the document database;
A field-related word counting step of counting the frequency of appearance of the extracted field-related words,
A related word extraction method characterized by comprising:

5. The related word extraction method according to claim 4, wherein the natural language analysis step performs one of morphological analysis, syntactic analysis, and semantic analysis.

6. The related word extraction method according to claim 4, wherein the field related word totaling step removes a field related word in which the frequency of appearance of the totaled field related words is lower than a predetermined threshold.

From a document database having a plurality of documents, to extract a document of a specific field, a related word extraction device that extracts a related word corresponding to the specific field from the extracted document,
An input function for inputting field information of the document,
A document extraction step of extracting a document having field information that matches the input field information from the database;
A field-related word notation pattern holding function for holding a field-related word notation pattern for extracting a field-related word related to the field of this document from a document in the document database,
A field-related word extraction function of referring to the field-related word notation pattern based on the result of the natural language analysis, and extracting the field-related word from a document extracted from the document database;
A field-related word counting function for counting the frequency of appearance of the extracted field-related words,
A program characterized by realizing.

8. The program according to claim 7, wherein the natural language analysis function executes one of morphological analysis, syntactic analysis, and semantic analysis.

The program according to claim 7, wherein the field-related word totaling function removes a field-related word in which the frequency of appearance of the totaled field-related words is lower than a predetermined threshold.