JP3508312B2

JP3508312B2 - Keyword extraction device

Info

Publication number: JP3508312B2
Application number: JP20855695A
Authority: JP
Inventors: 明男山下
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1995-07-25
Filing date: 1995-07-25
Publication date: 2004-03-22
Anticipated expiration: 2015-07-25
Also published as: JPH0944522A

Description

【発明の詳細な説明】【０００１】【産業上の利用分野】本発明は、全文検索において、あ
らかじめ検索するために登録する文書に対するインデッ
クスを作成し、検索時はそのインデックスを利用して入
力されたキーワードとインデックスとを比較して文書を
特定する技術に関し、特にインデックス作成のためのキ
ーワード抽出装置に関する。【０００２】【従来の技術】文書を文書検索装置に登録する際に、そ
の文書のキーワードとなる単語を抽出し、文書に対する
索引（インデックス）を作成することが、従来から行わ
れている。キーワードを抽出する技術に関しては、例え
ば、次のような文献が公開されている。【０００３】［１］特開平６１−１５１７３８号公報
（発明の名称「キーワード抽出装置」）この文献には、文字種と文字数からなるキーワード抽出
規則を用いて、文書からキーワードを抽出することが開
示されている。【０００４】［２］特開平５−００６３９８号公報（発
明の名称「文書登録装置及び文書検索装置」）この文献には、文書またはキーワードを字種の変化点で
区切り、連続する２文字の組をコード化し、インデック
スとすること、また、平仮名の区間はコード化しないこ
と等が開示されている。【０００５】［３］特開平３−１１６３７５号公報（発
明の名称「情報検索装置」）この文献には、単語辞書を用いてキーワードが複合語か
単純語かを判定し、複合語であればそのキーワードを構
成する単純語に分割し、すべての単純語をキーワードと
して保持させることが開示されている。【０００６】［４］特開平５−８１３２８号公報（発明
の名称「キーワード自動入力システム」）この文献には、日本語の文章データに対して、文章を文
節に分解し、得られた文節データから品詞解析をおこな
って名詞データを抽出し、抽出された名詞データとこの
データが記述されている文章中の箇所をあらわすデータ
とにより索引データを作成することが開示されている。【０００７】【発明が解決しようとする課題】特開昭６１−１５１７
３８号公報記載の技術では、キーワード抽出規則に合致
するもののみを抽出するので、抽出したキーワードが複
合的なものであるときにそれを構成する単語を抽出する
ことができない。また、読みを抽出できないし、活用の
変化を吸収することもできない。【０００８】特開平５−００６３９８号公報記載の技術
では、平仮名の区間はコード化せず、２文字を単位とし
てコード化するので、「読み取り」などの平仮名と漢字
を組合せたキーワードを抽出することができない。ま
た、読みの抽出や活用の変化の吸収ができない。【０００９】特開平３−１１６３７５号公報記載の技術
では、複合語を辞書に登録しておかなければならない
が、これは現実的には難しい。読みの抽出や活用の変化
の吸収ができない。【００１０】特開平５−８１３２８号公報記載の技術で
は、文節に複合語が含まれる場合に、複合語を１つの語
として抽出するので、複合語を構成する単語を抽出する
ことができない。また、抽出されるキーワードが名詞に
限定されるし、読みの抽出や活用の変化を吸収すること
ができない。【００１１】以上のように、従来の技術では、キーワー
ドの抽出において制限が多く、検索に有用なキーワード
を漏れなく抽出することが困難であった。本発明は、上
述の従来技術の問題点を解決し、全文検索用のインデッ
クスを作成するときに、検索式に指定される可能性のあ
るキーワードを登録時に漏れなく抽出することができる
ようにすることを目的とするものである。【００１２】【課題を解決するための手段】【００１３】本発明のキーワード抽出装置は、キーワー
ド抽出の対象文書のテキストの内容を記憶する入力記憶
手段（１１）と、日本語の文節構造規則を満足する単語
の組合せを解析して文節候補を抽出する文節候補解析手
段（１２）と、文節候補解析手段により解析、抽出され
たされた結果を記憶する文節候補記憶手段（１３）と、
文節候補記憶手段の内容からコストが最小になる単語の
組合せを抽出する解析結果抽出手段（１４）と、解析結
果抽出手段の抽出した解析結果を記憶する解析結果記憶
手段（１５）と、文節候補記憶手段の内容あるいは解析
結果記憶手段の内容からキーワードを抽出するときの条
件を記憶する抽出条件記憶手段（１６）と、文節候補記
憶手段の内容から抽出条件記憶手段の条件に合致するも
のをキーワードとして抽出する第１キーワード抽出手段
（１７）と、解析結果記憶手段の内容から抽出条件記憶
手段の条件に合致するものをキーワードとして抽出する
第２キーワード抽出手段（１８）と、第１キーワード抽
出手段の抽出したキーワードおよび第２キーワード抽出
手段の抽出したキーワードを記憶する抽出キーワード記
憶手段（１９）とを備えたものである。【００１４】【作用】解析手段は、対象文書中のテキストを形態素解
析して、解析結果の情報を解析結果記憶手段に記憶す
る。キーワード抽出手段は、条件記憶手段に保持された
キーワードを決定するための条件を用いて、解析結果記
憶手段の保持する解析の結果得られた情報から、キーワ
ードを抽出する。前記の解析結果の情報としては、形態
素解析の最終的な結果だけでなく、形態素解析の中間結
果である文節候補も含んでおり、その文節候補からもキ
ーワードを抽出することができる。例えば、形態素解析
の結果最終的に得られる複合語だけでなく、形態素解析
の中間的な結果として得られる複合語の要素である単語
を抽出したりカタカナ文字列の中に含まれる単語を抽出
することができる。したがって、検索の漏れをなくする
全文検索用のキーワードの抽出ができる。また、解析結
果の情報として、語の表記だけでなく読みや品詞の情報
があるが、これらの読みや品詞の情報から、終止形に戻
したり、読みを抽出したり、抽出するキーワードを品詞
によって選別したりすることを、キーワードを抽出する
ときの条件とすることができる。このように抽出したキ
ーワードを用いて検索する際には、検索用のキーワード
の設定の許容幅が広くなり、検索の漏れが少なくなる。【００１５】本発明の具体的態様（請求項４）の動作概
要は、次の通りである。文節候補解析手段は、テキスト
内容に対して日本語の文節構造規則を満足する単語の組
合せを解析し、その解析の結果得られた文節候補は文節
候補記憶手段に保持される。解析結果抽出手段は、文節
候補記憶手段の内容からコストが最小になる単語の組合
せを抽出し、その結果は解析結果記憶手段に記憶され
る。抽出条件記憶手段は、キーワードを抽出する時の条
件を記憶する。第１および第２のキーワード抽出手段
は、それぞれ文節候補記憶手段および解析結果記憶手段
の内容から抽出条件記憶手段の条件に合致するものをキ
ーワードとして抽出し、抽出キー記憶手段は、抽出され
たキーワードを記憶する。従来技術では、形態素解析の
最終的な結果からキーワードを抽出するので、複合語の
場合にその要素である単語をキーワードとすることはで
きなかったが、本発明では、形態素解析における中間結
果である文節候補記憶手段の記憶内容からもキーワード
を抽出するので、検索の漏れの少ない全文検索用のキー
ワードの抽出ができる。【００１６】【実施例】図１は、本発明の実施例によるキーワード抽
出装置の概略を示す機能ブロック図である。このキーワ
ード抽出装置は、入力記憶手段１１、文節候補解析手段
１２、文節候補記憶手段１３、解析結果抽出手段１４、
解析結果記憶手段１５、抽出条件記憶手段１６、第１キ
ー抽出手段１７、第２キー抽出手段１８、および抽出キ
ー記憶手段１９を備えてなるものである。各要素の機能
は以下の通りである。【００１７】入力記憶手段１１は、登録文書のテキスト
内容を記憶する。文節候補解析手段１２は、後述する解
析辞書や接続テーブルを参照して、日本語の文節構造規
則を満足する単語の組合せを解析して抽出する。文節候
補記憶手段１３は、文節候補解析手段１２の解析結果得
られた文節候補を記憶する。解析結果抽出手段１４は、
文節候補記憶手段１３の内容からコストが最小になる単
語の組合せを抽出する。解析結果記憶手段１５は、解析
結果抽出手段１４の抽出結果を記憶する。抽出条件記憶
手段１６は、文節候補記憶手段１３や解析結果記憶手段
１５の内容からキーを抽出する時の条件を記憶する。第
１キー抽出手段１７は、文節候補記憶手段１３の内容か
ら抽出条件記憶手段１６の条件に合致する解析結果から
キーを抽出する。第２キー抽出手段１８は、解析結果記
憶手段１５の内容から抽出条件記憶手段１６の条件に合
致する解析結果からキーを抽出する。抽出キー記憶手段
１９は、第１キー抽出手段１７および第２キー抽出手段
１８が抽出したキーを記憶する。【００１８】日本語の形態素解析は、分かち書きされて
いない日本語の文章から単語や文節の情報を解析する処
理である。本実施例は、コスト最小法と呼ばれる形態素
解析に基いている。コスト最小法とは、文節数最小法
（複数の解析結果がある場合に文節の数が最小となるも
のを優先する解析方法）を拡張したもので、単語候補に
コストを割り当て全体のコストが最小となる解析結果を
優先するものである（吉村・日高・吉田：未登録語を含
む日本語の形態素解析のアルゴリズム、九州大学工学集
報、Ｖｏｌ．５５、Ｎｏ．６，１９８２参照）。なお、
文節数最小法は、自立語のコストを１にして、付属のコ
ストを０とした時のコスト最小法に相当する。【００１９】図２は、文節候補解析手段１２が解析時に
参照する辞書の一例を示す図である。辞書は、日本語の
文節を構成する単語の情報を格納したもので、各単語に
ついて、見出し２１、品詞２２、読み２３、コスト２
４、その他情報２５を保持している。【００２０】図３は、文節候補解析手段１２が解析時に
参照する接続テーブルの一例を示す図である。接続テー
ブルは辞書で定義された品詞情報を用いて隣接する２つ
の単語が接続可能かどうかを定義した二次元の配列であ
る。行の品詞は隣接する単語の左側の単語の品詞をあら
わし、列の品詞は隣接する単語の右側の単語の品詞をあ
らわす。配列要素の値が１であれば、接続可能であり、
その値が０であれば接続不可能を意味する。列の品詞で
仮想的な文節の先頭とあるが、これはある単語が文節の
末尾になりうるかどうかを判定するために設けたもので
ある。【００２１】実施例において入力文の解析では、前述の
コスト最小法に基づく形態素解析のアルゴリズムにした
がって解析を行う。このアルゴリズムでは、まず、入力
文に含まれる単語を辞書を用いて切り出し、直前の単語
候補との接続チェック、文節末の判定、コスト更新の処
理を行い、文節候補を抽出する。つぎに、文頭からのコ
ストの総和が最小となるような解析結果を文節候補の中
から抽出する。【００２２】図４は、実施例の動作を示すフローチャー
トである。ステップＳ１では、登録対象のテキスト内
容を入力記憶手段１１に格納する。ステップＳ２では、
入力記憶手段１１に記憶されたテキストから句点や改行
を手掛かりにして文章を抽出する。ステップ３では、文
章が抽出されたかどうかを判定する。文章が抽出されな
ければ、テキスト中の文章をすべて処理したので処理を
終了する。文章が抽出されれば、処理をステップ４に進
める。ステップ４では、文章を解析し、文節の候補を求
め、結果を文節候補記憶手段１３に記憶する。ステップ
４はコスト最小法の最初のステップに対応する。ステッ
プ５では、文節候補記憶手段１３からコストが最小とな
る文節構造を抽出し、解析結果記憶手段１５に抽出結果
を格納する。ステップ５はコスト最小法の２番目のステ
ップに対応する。ステップ６では、第２キー抽出手段１
８が、解析結果記憶手段１５の中から抽出条件記憶手段
１６に記憶された抽出条件にマッチするキーを抽出し抽
出キー記憶手段１９に記憶する。ステップ７では、第１
キー抽出手段１７が、文節候補記憶手段１３の中から抽
出条件記憶手段１６に記憶された抽出条件にマッチする
キーを抽出し、抽出キー記憶手段１９に記憶する。【００２５】図５（ａ）は第１キー抽出手段１７や第２
キー抽出手段１８が参照する抽出条件の一例を示した説
明図である。抽出条件として品詞５１とその品詞の語の
長さの条件を表す文字数下限５２とを対応させている。
この例では、抽出条件として、一般名詞については長さ
が２文字以上のものを抽出し、固有名詞については長さ
が１文字以上のものを抽出し、辞書登録外（例えば英数
字やカタカナなど）については長さが２文字以上のもの
を抽出することを条件としている。【００２６】図５（ｂ）は登録対象文書の内容の一部を
示す内容例である。対象文書に、「テキストファイルに
書き込みます。」という文章５３があるものとする。図
５（ｃ）は、文章の文字の境界位置を例示したもので、
解析結果や抽出キーの範囲を始点と終点で表現する場合
に参照する位置情報を示す図である。例えば、「テキス
ト」という文字列は、０と４の範囲に存在する。【００２７】以下、図５（ｂ）に示す文章からキーワー
ドを抽出する動作について具体的に説明する。図４のス
テップＳ１において、「テキストファイルに書き込みま
す。」という文章を含むテキストを入力記憶手段１１に
格納する。ステップＳ２では句点「。」、改行、ファイ
ルの終端かどうかを手掛かりにして文章を抽出する。以
下の説明では、ある時点で「テキストファイルに書き込
みます。」という文章を抽出したときのステップＳ２以
降の処理について説明する。ステップＳ３において文章
が抽出されたので処理をステップＳ４に進める。ステッ
プＳ４において、コスト最小法の最初のステップを実行
する。図５（ｃ）に示す各境界位置（０，１，２，…１
５）から始まる単語の候補を図２に例示するような辞書
を検索して求める。また連続するカタカナやアルファベ
ットについては、未登録語の可能性もあるので連続する
カタカナは、適切な品詞情報（実施例では辞書登録外）
とコスト（実施例では８０とする）を付与して名詞性の
未登録語として切り出し、あたかも辞書に登録されてい
たかのように扱う。辞書情報を得た後、その辞書内容を
元に直前の単語候補と接続可能かどうかを調べ、接続可
能であれば辞書情報と先頭からのコストの総和の最小値
を計算し、文節候補記憶手段１３に記憶する。【００２８】現在の文章の先頭の文字列は「テキスト
…」であり、「テ」「テキ」「テキス」「テキスト」と
いった入力文字列の部分文字列の辞書情報を図２に例示
するような辞書を検索して求める。その結果、「テキス
ト」が検索される。また、先頭の文字「テ」がカタカナ
であるので後続するカタカナ文字列も切り出す。その結
果「テキストファイル」が切り出される。図６のＮｏ
１，Ｎｏ４の文節候補が文節候補記憶手段３に記憶され
る。品詞はそれぞれ名詞類と辞書登録外である。コスト
については、それぞれ７０、８０であり文頭のコストは
０としているので、文頭からのコストの総和である７
０、８０となる。文節末情報は、０であれば、文節末に
なれず、１以上であれば文節末になれることを意味す
る。【００２９】次の切り出し位置は、抽出した単語候補の
終了位置の中で最も小さい位置とする。この場合、「テ
キスト」と「テキストファイル」が抽出されているの
で、最小の終了位置は４である。従って、図５（ｃ）に
定義したように、「ファイルに…」に対して文節候補の
切り出しを行う。この場合ファイル（辞書登録外）とフ
ァイル（名詞類・サ変の語幹）が切り出され、文節候補
記憶手段３のＮｏ２，Ｎｏ３のような文節候補が記憶さ
れる。自分自身のコストは、８０と７０である。文頭か
らのコストの総和は、開始位置のコストの総和の最小値
（この場合は４番目の位置の最小値）７０に自分自身の
コストを加えたもので１５０と１４０となる。以後、同
様に処理を進めステップＳ４の実行後には、図６に示す
ような文節候補が文節候補記憶手段１３に記憶される。【００３０】次のステップＳ５において、コスト最小法
の２番目のステップを実行する。文節候補記憶手段１３
からコストが最小となる文節構造を抽出し、解析結果記
憶手段１５に抽出結果を格納する。具体的には、最後尾
の文節候補のうち先頭からコストの総和が最小のものを
抽出し、この文節構造と接続可能な文節候補を文末から
文頭に向かって抽出する。接続可能かどうかは、隣接し
ていること、コスト上接続可能なこと、接続テーブルで
接続が定義されているかどうかで判断する。図６に示す
文節候補記憶手段１３の内容において、最後尾の文節候
補は「。」である。この開始位置は１５で、自分自身の
コストは３００で、コストの総和は４４５である。これ
に接続する直前の文節候補は、終了位置が１５でコスト
の総和が１４５であり、接続テーブルで接続が定義され
たもの、すなわち図６のＮｏ３９の文節候補である。以
下同様にして文末側から文頭側に向かってコスト最小と
なる文節候補を抽出した結果を図７（ａ）に示す。図７
（ａ）の内容はいわゆる形態素解析結果となる。【００３１】次のステップＳ６では、第２キー抽出手段
１８が、図７（ａ）の解析結果記憶手段１５の中から図
５（ａ）の抽出条件記憶手段１６に記憶された抽出条件
にマッチするキーを抽出し、抽出キー記憶手段１９に記
憶する。ここでも、既に抽出した表記については重複し
た抽出は行わない、ステップＳ６を実行した結果「テキ
ストファイル」が抽出され、図７（ｂ）のＮｏ１に示す
ような内容を抽出キー記憶手段に登録する。【００３２】次のステップＳ７において、第１キー抽出
手段１７が、図６に示す文節候補記憶手段１３の記憶内
容中から、図５（ａ）の抽出条件記憶手段１６に記憶さ
れた抽出条件にマッチするキーワードを抽出し、抽出キ
ー記憶手段１９に記憶する。図７（ｂ）のＮｏ２からＮ
ｏ６がステップＳ７において抽出したキーの内容であ
る。ここで、表記が同じものは重複して抽出しないの
で、図６のＮｏ２，Ｎｏ３の「ファイル」については、
一方しか抽出しない。ステップＳ７の処理後、ステップ
Ｓ２に戻り次の文章に対してステップＳ２からステップ
Ｓ７の処理を文章が抽出できなくなるまで繰り返す。図
４において、ステップＳ６とステップＳ７を入れ換え
て、重複したキーの抽出を行わないようにすることで、
同様の効果を得ることができる。【００３３】「テキストファイルに書き込みます。」と
いう文章の形態素解析結果を図７（ａ）に示している
が、この解析結果から図５（ａ）に示すような抽出条件
でキーを抽出すると「テキストファイル」だけが抽出さ
れる。このようなキーを用いて全文検索用のインデック
スを作成した場合、「テキスト」「ファイル」「書き込
み」「込み」「ます」といった名詞性の検索語を指定し
て、「テキストファイルに書き込みます。」を含む文書
を検索できなかった。本実施例によれば、図６に示すよ
うな解析の中間結果からも図７（ｂ）に示すようなキー
を抽出するので、「テキストファイルに書き込みま
す。」を含む文書を「テキストファイル」「テキスト」
「ファイル」「書き込み」「込み」「ます」といった名
詞性の検索語から検索できるようになり、検索の漏れを
なくすことができる。【００３４】本発明は、以上に説明した実施例の一部を
次のように変形もしくは置換して実施することもでき
る。【００３５】（１）図４のステップＳ６、ステップＳ
７のキーワードの抽出においてキーワードの表記だけで
なく読みをも抽出するようにする。表記がカタカナの場
合、読みはコード変換することで得られるのでカタカナ
のままでもよい。カタカナの読みを抽出しない場合、検
索語がひらがなの場合カタカナの検索語も生成して検索
する。そのように構成することにより「テキストファイ
ル」「テキスト」「ファイル」「書き込み」「込み」
「ます」、「かきこみ」「こみ」といったキーを抽出す
る。その結果、利用者が検索時に読みはわかるが正確な
表記が指定不可能な場合、登録文書中に送り仮名の有無
によって表記にばらつきがある場合（書き込み、書きこ
み、書込み）についても読みを手掛かりにして所望の文
書を検索することができる。【００３６】（２）抽出条件記憶手段１６の内容を例
えば図８に示すような内容に置換する。図８では、名詞
性以外の品詞についても抽出する指定を行っている。例
えば、５段動詞とは、カ行、ガ行、サ行、タ行、バ行、
マ行、ラ行、ナ行、ハ行、ア行、などの５段活用の語幹
を意味する。抽出条件を図８のように指定することによ
り、「読み込みます」といった文から、「読み込み（名
詞）」「読む（マ行５段）」「込む（マ行５段）」「読
み込む（マ行５段）といったキーを抽出することができ
る。【００３７】（３）抽出するキーの品詞が活用語を意
味する場合、終止形にして抽出する。規則変化するもの
も（例えば５段動詞、形容詞など）であれば、活用語尾
を除き終止形の活用語を語幹に連結することで終止形を
得ることができる。不規則変化するものについては、不
規則変化系と終止形の対応表を用いて終止形を得る。こ
のように活用語の終止形を抽出することで、活用の違い
に左右されない検索を可能とする。例えば、「読み取
り、読み取った」から「読み取る」といった終止形を抽
出しておくことで、「読み取り」や「読み取った」を含
む文書を「読み取る」という検索語から検索することが
できる。また検索式についても同様のキーの抽出処理を
行うことで「読み取り」や「読み取った」を含む検索式
から「読み取る」という検索語を抽出でき、「読み取
り」や「読み取った」を含む文書を検索することができ
る。【００３８】【発明の効果】本発明によれば、形態素解析の最終的な
結果だけから全文検索用のキーワードを抽出するだけで
なく、入力文に含まれる単語の候補のような中間的な結
果からも全文検索用キーワードを抽出することができ
る。例えば、複合語だけでなく複合語に含まれる単語を
抽出したりカタカナ文字列の中に含まれる単語を抽出す
ることができる。また、形態素解析をおこなって得られ
るキーワードの読みや品詞の情報から、終止形に戻した
り、読みを抽出したり、抽出するキーワードを品詞によ
って選別でき、また、検索時に終止形で検索することも
できる。その結果、検索の漏れをなくす全文検索用のキ
ーワードの抽出ができる。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention creates an index for a document to be registered in advance in a full-text search, and inputs a document using the index when searching. The present invention relates to a technique for specifying a document by comparing a keyword and an index, and more particularly to a keyword extraction device for creating an index. 2. Description of the Related Art Conventionally, when a document is registered in a document search device, a word serving as a keyword of the document is extracted and an index (index) for the document is created. Regarding the technology for extracting keywords, for example, the following documents are disclosed. [1] Japanese Patent Application Laid-Open No. 61-151738 (Title of Invention: "Keyword Extraction Device") This document discloses extracting a keyword from a document using a keyword extraction rule consisting of a character type and the number of characters. ing. [2] Japanese Patent Application Laid-Open No. 5-006398 (Title of Invention "Document Registration Apparatus and Document Retrieval Apparatus") In this document, a document or a keyword is divided at a character type change point and a set of two consecutive characters Is coded as an index, and the hiragana section is not coded. [3] Japanese Patent Application Laid-Open No. 3-116375 (Title of Invention: "Information Retrieval Apparatus") This document uses a word dictionary to determine whether a keyword is a compound word or a simple word. It is disclosed that the keyword is divided into simple words constituting the keyword, and all the simple words are held as keywords. [4] Japanese Patent Application Laid-Open No. 5-81328 (Title of the Invention: "Keyword Automatic Input System") This document describes a method in which Japanese sentence data is divided into sentences and the sentence data is obtained. Japanese Patent Application Laid-Open No. H11-64300 discloses that part-of-speech analysis is performed to extract noun data, and index data is created by using the extracted noun data and data representing a location in a sentence in which the data is described. [0007] Japanese Patent Application Laid-Open No. 61-1517
In the technology described in Japanese Patent Publication No. 38, only words that match the keyword extraction rules are extracted, and therefore, when the extracted keywords are complex, the words that compose the keyword cannot be extracted. Also, it is not possible to extract readings and absorb changes in utilization. In the technique described in Japanese Patent Application Laid-Open No. 5-006398, the section of hiragana is not coded, but is coded in units of two characters, so that a keyword such as "reading" which combines hiragana and kanji can be extracted. Can not. Also, it is not possible to extract readings and absorb changes in utilization. In the technique described in Japanese Patent Application Laid-Open No. 3-116375, compound words must be registered in a dictionary, but this is practically difficult. It is not possible to extract readings and absorb changes in utilization. According to the technique described in Japanese Patent Application Laid-Open No. 5-81328, when a compound word is included in a clause, the compound word is extracted as one word, so that the words constituting the compound word cannot be extracted. In addition, the extracted keywords are limited to nouns, and the change in the extraction and utilization of reading cannot be absorbed. As described above, in the conventional technique, there are many limitations in extracting keywords, and it has been difficult to extract useful keywords for search without omission. The present invention solves the above-described problems of the conventional technology, and enables to completely extract keywords that may be specified in a search expression when creating an index for full-text search at the time of registration. It is intended for that purpose. [0013] A keyword extraction device according to the present invention includes an input storage means (11) for storing the contents of text of a document to be subjected to keyword extraction, and a Japanese phrase structure rule. A phrase candidate analyzing means (12) for analyzing a combination of satisfying words to extract a phrase candidate; a phrase candidate storing means (13) for storing a result analyzed and extracted by the phrase candidate analyzing means;
An analysis result extraction means (14) for extracting a combination of words having the minimum cost from the contents of the phrase candidate storage means, an analysis result storage means (15) for storing the analysis results extracted by the analysis result extraction means, and a phrase candidate An extraction condition storage means (16) for storing conditions for extracting a keyword from the contents of the storage means or the contents of the analysis result storage means, and a keyword matching the extraction condition storage means from the contents of the phrase candidate storage means. First keyword extracting means (17) for extracting as a keyword, second keyword extracting means (18) for extracting, from the contents of the analysis result storing means, those which match the conditions of the extraction condition storing means as keywords, and first keyword extracting means Extracted keyword storage means (19) for storing the extracted keywords and the keywords extracted by the second keyword extraction means; It includes those were. The analyzing means performs a morphological analysis on the text in the target document, and stores information on the analysis result in the analysis result storage means. The keyword extracting means extracts the keyword from the information obtained as a result of the analysis held by the analysis result storing means, using the condition for determining the keyword held in the condition storing means. The information on the analysis result includes not only the final result of the morphological analysis but also a phrase candidate which is an intermediate result of the morphological analysis, and a keyword can be extracted from the phrase candidate. For example, extract words that are elements of compound words obtained as an intermediate result of morphological analysis, as well as words that are included in katakana character strings, as well as compound words that are finally obtained as a result of morphological analysis be able to. Therefore, it is possible to extract keywords for full-text search without omission of search. In addition, the analysis result information includes not only word notation but also reading and part-of-speech information.From these readings and part-of-speech information, it is possible to return to the final form, extract readings, and extract keywords by part of speech. Sorting can be used as a condition for extracting a keyword. When a search is performed using the keywords extracted in this manner, the allowable range of setting the search keywords is widened, and the omission of the search is reduced. The outline of the operation of a specific mode of the present invention (claim 4) is as follows. The phrase candidate analysis means analyzes a combination of words satisfying the Japanese phrase structure rule with respect to the text content, and the phrase candidates obtained as a result of the analysis are stored in the phrase candidate storage means. The analysis result extraction means extracts a combination of words having the minimum cost from the contents of the phrase candidate storage means, and the result is stored in the analysis result storage means. The extraction condition storage means stores conditions for extracting a keyword. The first and second keyword extracting means extract, from the contents of the phrase candidate storing means and the analysis result storing means, those which meet the conditions of the extracting condition storing means as keywords, and the extracted key storing means stores the extracted keyword. Is stored. In the related art, a keyword is extracted from the final result of the morphological analysis. Therefore, in the case of a compound word, a word that is an element of the compound word cannot be used as a keyword. Since keywords are also extracted from the storage contents of the phrase candidate storage means, keywords for full-text search with few omissions in search can be extracted. FIG. 1 is a functional block diagram schematically showing a keyword extracting apparatus according to an embodiment of the present invention. This keyword extraction device includes an input storage unit 11, a phrase candidate analysis unit 12, a phrase candidate storage unit 13, an analysis result extraction unit 14,
It comprises an analysis result storage unit 15, an extraction condition storage unit 16, a first key extraction unit 17, a second key extraction unit 18, and an extraction key storage unit 19. The function of each element is as follows. The input storage means 11 stores the text content of the registered document. The phrase candidate analyzing means 12 analyzes and extracts a combination of words satisfying the Japanese phrase structure rule with reference to an analysis dictionary and a connection table described later. The phrase candidate storage unit 13 stores the phrase candidates obtained by the analysis result of the phrase candidate analysis unit 12. The analysis result extraction means 14
A combination of words having the minimum cost is extracted from the contents of the phrase candidate storage unit 13. The analysis result storage unit 15 stores the extraction result of the analysis result extraction unit 14. The extraction condition storage unit 16 stores conditions for extracting a key from the contents of the phrase candidate storage unit 13 and the analysis result storage unit 15. The first key extraction unit 17 extracts a key from the contents of the phrase candidate storage unit 13 and an analysis result that matches the condition of the extraction condition storage unit 16. The second key extracting unit 18 extracts a key from the analysis result that matches the condition of the extraction condition storing unit 16 from the content of the analysis result storing unit 15. The extracted key storage unit 19 stores the keys extracted by the first key extraction unit 17 and the second key extraction unit 18. The Japanese morphological analysis is a process of analyzing information on words and phrases from a Japanese sentence that is not separated. This embodiment is based on a morphological analysis called a minimum cost method. The minimum cost method is an extension of the minimum number of clauses method (an analysis method that gives priority to the one with the minimum number of clauses when there are multiple analysis results), and assigns costs to word candidates to minimize the overall cost. (Yoshimura / Hidaka / Yoshida: Algorithm of Japanese morphological analysis including unregistered words, Kyushu University Journal of Engineering, Vol. 55, No. 6, 1982). In addition,
The phrase number minimization method is equivalent to the cost minimization method when the cost of an independent word is set to 1 and the attached cost is set to 0. FIG. 2 is a diagram showing an example of a dictionary referred to by the phrase candidate analysis means 12 at the time of analysis. The dictionary stores information on the words that make up Japanese phrases. For each word, a heading 21, a part of speech 22, a reading 23, a cost 2
4. Other information 25 is stored. FIG. 3 is a diagram showing an example of a connection table referred to by the phrase candidate analysis means 12 at the time of analysis. The connection table is a two-dimensional array that defines whether two adjacent words can be connected using the part of speech information defined in the dictionary. The part of speech of the row represents the part of speech of the word to the left of the adjacent word, and the part of speech of the column represents the part of speech of the word to the right of the adjacent word. If the value of the array element is 1, connection is possible,
If the value is 0, it means that connection is impossible. The part of speech in the column is at the beginning of a virtual phrase, which is provided to determine whether a word can be the end of a phrase. In the embodiment, the analysis of the input sentence is performed in accordance with the morphological analysis algorithm based on the above-mentioned minimum cost method. In this algorithm, first, words included in an input sentence are cut out using a dictionary, connection check with the immediately preceding word candidate, determination of the end of a phrase, and cost update processing are performed to extract a phrase candidate. Next, an analysis result that minimizes the sum of costs from the beginning of the sentence is extracted from the phrase candidates. FIG. 4 is a flowchart showing the operation of the embodiment. In step S1, the text content to be registered is stored in the input storage unit 11. In step S2,
A sentence is extracted from the text stored in the input storage means 11 using clues and line feeds as clues. In step 3, it is determined whether a sentence has been extracted. If no text is extracted, all text in the text has been processed and the process ends. If the text is extracted, the process proceeds to step 4. In step 4, the sentence is analyzed to obtain a phrase candidate, and the result is stored in the phrase candidate storage unit 13. Step 4 corresponds to the first step of the minimum cost method. In step 5, the phrase structure with the minimum cost is extracted from the phrase candidate storage unit 13, and the extracted result is stored in the analysis result storage unit 15. Step 5 corresponds to the second step of the minimum cost method. In step 6, the second key extracting means 1
8 extracts a key that matches the extraction condition stored in the extraction condition storage unit 16 from the analysis result storage unit 15 and stores the key in the extraction key storage unit 19. In step 7, the first
The key extracting unit 17 extracts a key that matches the extraction condition stored in the extraction condition storing unit 16 from the phrase candidate storing unit 13 and stores the key in the extracted key storing unit 19. FIG. 5A shows the first key extracting means 17 and the second key extracting means 17.
FIG. 4 is an explanatory diagram showing an example of an extraction condition referred to by a key extraction unit 18. The part of speech 51 and the lower limit 52 of the number of characters indicating the condition of the word length of the part of speech are associated as the extraction condition.
In this example, as an extraction condition, a general noun having a length of two or more characters is extracted, and a proper noun having a length of one or more characters is extracted, and is not registered in a dictionary (for example, alphanumeric characters, katakana, etc.). The condition ()) is based on the condition that a character having a length of two or more characters is extracted. FIG. 5B is an example of a content showing a part of the content of the document to be registered. It is assumed that the target document has a sentence 53 “Write to a text file.” FIG. 5C illustrates a boundary position of a character in a sentence.
FIG. 14 is a diagram illustrating position information referred to when an analysis result and a range of an extraction key are represented by a start point and an end point. For example, the character string “text” exists in the range of 0 and 4. The operation of extracting a keyword from the text shown in FIG. 5B will be specifically described below. In step S1 of FIG. 4, a text including a sentence “Write to a text file.” Is stored in the input storage unit 11. In step S2, a sentence is extracted using clues ".", A line feed, and the end of the file as clues. In the following description, a description will be given of the processing from step S2 onward when a sentence “write to a text file” is extracted at a certain point in time. Since the text is extracted in step S3, the process proceeds to step S4. In step S4, the first step of the minimum cost method is executed. Each boundary position (0, 1, 2,... 1) shown in FIG.
Word candidates starting with 5) are found by searching a dictionary as illustrated in FIG. Also, continuous katakana and alphabets may be unregistered words, so continuous katakana is not appropriate in the part-of-speech information (not registered in the dictionary in the example).
And a cost (in the embodiment, 80) are added to extract the word as a noun word unregistered word, and treat it as if it had been registered in the dictionary. After obtaining the dictionary information, it checks whether it can be connected to the immediately preceding word candidate based on the dictionary contents, and if it can be connected, calculates the minimum value of the sum of the dictionary information and the cost from the beginning, and stores the phrase candidate storage means. 13 is stored. The character string at the beginning of the current sentence is "text ...", and dictionary information of a partial character string of the input character string such as "te", "text", "text", and "text" is shown in FIG. Search the dictionary and ask for it. As a result, "text" is searched. In addition, since the first character “te” is katakana, the subsequent katakana character string is also cut out. As a result, a “text file” is cut out. No in FIG.
The phrase candidates No. 1 and No. 4 are stored in the phrase candidate storage means 3. The parts of speech are nouns and out of the dictionary. Since the costs are 70 and 80, respectively, and the cost at the beginning of the sentence is 0, the total cost from the beginning of the sentence is 7
0 and 80. The phrase end information means that if it is 0, it cannot be the end of the phrase, and if it is 1 or more, it can be the end of the phrase. The next cutout position is the smallest position among the end positions of the extracted word candidates. In this case, since “text” and “text file” have been extracted, the minimum end position is 4. Accordingly, as shown in FIG. 5C, a phrase candidate is cut out for "in the file ...". In this case, a file (not registered in the dictionary) and a file (noun class / symbol stem) are cut out, and phrase candidates such as No. 2 and No. 3 in the phrase candidate storage unit 3 are stored. My own costs are 80 and 70. The sum of the costs from the beginning of the sentence is 150 or 140, which is the sum of the minimum value of the sum of the costs at the start position (in this case, the minimum value at the fourth position) 70 and the cost of itself. Thereafter, the processing proceeds in the same manner, and after the execution of step S4, the phrase candidates as shown in FIG. In the next step S5, the second step of the minimum cost method is executed. Phrase candidate storage means 13
, The phrase structure with the minimum cost is extracted, and the extracted result is stored in the analysis result storage unit 15. Specifically, from the last clause candidate, the one with the smallest total cost is extracted from the beginning, and the candidate clauses connectable to this clause structure are extracted from the end of the sentence toward the beginning of the sentence. Whether or not connection is possible is determined based on the fact that they are adjacent, that connection is possible due to cost, and whether or not a connection is defined in the connection table. In the contents of the phrase candidate storage means 13 shown in FIG. 6, the last phrase candidate is "." The starting position is 15, the own cost is 300, and the total cost is 445. The phrase candidate immediately before the connection is a phrase candidate whose end position is 15, the total cost is 145, and the connection is defined in the connection table, that is, the phrase candidate of No. 39 in FIG. In the same manner, FIG. 7A shows the result of extracting phrase candidates having the minimum cost from the sentence end to the sentence start. FIG.
The content of (a) is a so-called morphological analysis result. In the next step S6, the second key extraction means 18 matches the extraction condition stored in the extraction condition storage means 16 of FIG. 5A from the analysis result storage means 15 of FIG. 7A. The extracted key is extracted and stored in the extracted key storage unit 19. Also in this case, no duplicate extraction is performed for the notations already extracted. As a result of executing step S6, a "text file" is extracted, and the contents shown as No. 1 in FIG. 7B are registered in the extraction key storage means. . In the next step S7, the first key extracting means 17 converts the contents stored in the phrase candidate storing means 13 shown in FIG. 6 into the extracting conditions stored in the extracting condition storing means 16 shown in FIG. A matching keyword is extracted and stored in the extracted key storage unit 19. No. 2 to N in FIG. 7B
o6 is the content of the key extracted in step S7. Here, since the same notation is not redundantly extracted, the “files” of No. 2 and No. 3 in FIG.
Only one is extracted. After the process in step S7, the process returns to step S2, and the processes in steps S2 to S7 are repeated for the next sentence until a sentence cannot be extracted. In FIG. 4, by exchanging steps S6 and S7 so as not to extract a duplicate key,
Similar effects can be obtained. FIG. 7A shows the result of the morphological analysis of the sentence "Write to a text file." When the key is extracted from this analysis result under the extraction conditions shown in FIG. Only "text files" are extracted. When creating an index for full-text search using such a key, specify a noun search term such as "text", "file", "write", "contain" or "masu" and write it to a text file. "Could not be searched. According to the present embodiment, the key as shown in FIG. 7B is also extracted from the intermediate result of the analysis as shown in FIG. 6, so that the document including "Write to text file.""text"
Search can be performed from noun search words such as "file", "write", "contain", and "masu", and omission of search can be eliminated. The present invention can be implemented by modifying or substituting a part of the embodiment described above as follows. (1) Steps S6 and S in FIG.
In the extraction of the keyword 7, not only the notation of the keyword but also the reading is extracted. When the notation is in katakana, the reading can be obtained by code conversion, so that katakana may be used. If no katakana readings are extracted, and if the search word is hiragana, a katakana search word is also generated and searched. By configuring as such, “text file”, “text”, “file”, “write”, “include”
Keys such as "Masu", "Kakikomi" and "Komi" are extracted. As a result, the user can understand the reading at the time of retrieval but cannot specify the exact notation, and also can read the case where the notation varies depending on the presence or absence of the sentence in the registered document (writing, writing, writing). To search for a desired document. (2) The contents of the extraction condition storage means 16 are replaced with contents as shown in FIG. 8, for example. In FIG. 8, designation is made to extract parts of speech other than nounity. For example, a five-stage verb means ka line, ga line, sa line, ta line, ba line,
It means the stem of five-row utilization such as ma row, la row, na row, c row, a row, etc. By specifying the extraction conditions as shown in FIG. 8, from a sentence such as “read”, “read (noun)”, “read (ma line 5 dan)”, “contain (ma line 5 dan)” “read (ma line) (3) When the part of speech of the key to be extracted means a conjugative word, the key is extracted as a final form, and the one that changes rules (for example, a five-step verb or an adjective) Etc.), the final form can be obtained by connecting the final form of the conjugative word to the stem except for the conjugation endings. By extracting the final form of the conjugation word in this way, it is possible to perform a search that is not affected by the difference in conjugation, for example, by extracting the final form such as “read, read” and “read”. By reading, A document containing "and" read "can be retrieved from the search term" read ". Also, by performing the same key extraction processing for the search expression, the search term “read” can be extracted from the search expression including “read” and “read”, and the document including “read” and “read” can be extracted. Can be searched. According to the present invention, not only a keyword for full-text search is extracted from only the final result of morphological analysis, but also an intermediate result such as a candidate for a word included in an input sentence. The keyword for full-text search can be extracted from. For example, it is possible to extract not only compound words but also words included in compound words or words included in katakana character strings. In addition, it is possible to return to the final form, extract the pronunciation, and select the keywords to be extracted by the part of speech from the keyword reading and part-of-speech information obtained by performing morphological analysis. it can. As a result, it is possible to extract a keyword for full-text search that eliminates omission of search.

【図面の簡単な説明】【図１】本発明の実施例のキーワード抽出装置の概略
構成を示す機能ブロック図である。【図２】文節候補解析手段２が解析時に参照する辞書
の一例を示す図である。【図３】文節候補解析手段２が解析時に参照する接続
テーブルの一例を示す説明図である。【図４】実施例の動作を示すフローチャートである。【図５】（ａ）は実施例における第１キー抽出手段７
や第２キー抽出手段８が参照する抽出条件の一例を示す
図、（ｂ）は実施例における登録対象文書の内容の一部
を示す内容例を示す図、（ｃ）は文章の文字の境界位置
を例示した図である。【図６】図５（ｂ）に示す例文に対する文節候補記憶
手段の内容を示す図である。【図７】（ａ）は、例文に対する形態素解析結果を示
す図、（ｂ）は形態素解析の中間結果からもキーワード
を抽出した結果を示す図である。【図８】抽出条件の他の例を示す図である。【符号の説明】１１…入力記憶手段、１２…文節候補解析手段、１
３…文節候補記憶手段、１４…解析結果抽出手段、
１５…解析結果記憶手段、１６…抽出条件記憶手段、
１７…第１キー抽出手段、１８…第２キー抽出手
段、１９…抽出キー記憶手段。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a functional block diagram illustrating a schematic configuration of a keyword extracting device according to an embodiment of the present invention. FIG. 2 is a diagram showing an example of a dictionary referred to by the phrase candidate analysis means 2 at the time of analysis. FIG. 3 is an explanatory diagram showing an example of a connection table referred to by the phrase candidate analysis means 2 at the time of analysis. FIG. 4 is a flowchart showing an operation of the embodiment. FIG. 5A shows a first key extracting unit 7 in the embodiment.
FIG. 3B is a diagram showing an example of an extraction condition referred to by the second key extracting unit 8, FIG. 4B is a diagram showing a content example showing a part of the content of the registration target document in the embodiment, and FIG. It is the figure which illustrated the position. FIG. 6 is a diagram showing the contents of a phrase candidate storage unit for the example sentence shown in FIG. 5 (b). FIG. 7A is a diagram illustrating a morphological analysis result for an example sentence, and FIG. 7B is a diagram illustrating a result of extracting a keyword from an intermediate result of the morphological analysis. FIG. 8 is a diagram showing another example of an extraction condition. [Description of Signs] 11: Input storage means, 12: Clause candidate analysis means, 1
3 ... phrase candidate storage means 14 ... analysis result extraction means
15: analysis result storage means, 16: extraction condition storage means,
17 ... first key extracting means 18 ... second key extracting means 19 ... extracted key storage means

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平４−211868（ＪＰ，Ａ) 特開平６−301722（ＪＰ，Ａ) 特開平７−73200（ＪＰ，Ａ) 伊藤哲ほか，利用目的に応じて最適化可能なキーワード抽出手法，電子情報通信学会技術研究報告，日本，社団法人電子情報通信学会，1993年12月９日，第 93巻第366号，第41頁〜第46頁，ＮＬＣ93−53 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-4-211868 (JP, A) JP-A-6-301722 (JP, A) JP-A-7-73200 (JP, A) Satoshi Ito et al. Keyword extraction method that can be optimized according to the purpose, IEICE Technical Report, Japan, The Institute of Electronics, Information and Communication Engineers, December 9, 1993, Vol. 93, No. 366, pp. 41-41. 46 pages, NL C93-53 (58) Field surveyed (Int. Cl. ⁷ , DB name) G06F 17/30 JICST file (JOIS)

Claims

(57) [Claims] [Claim 1] Input storage means for storing the text content of a registered document, and a phrase for extracting a phrase candidate by analyzing a combination of words satisfying Japanese phrase structure rules Candidate analysis means; a phrase candidate storage means for storing results analyzed and extracted by the phrase candidate analysis means; and an analysis result extraction means for extracting a combination of words having a minimum cost from the contents of the phrase candidate storage means. An analysis result storage means for storing analysis results extracted by the analysis result extraction means; an extraction condition storage means for storing conditions for extracting keywords from the contents of the phrase candidate storage means or the contents of the analysis result storage means; First keyword extracting means for extracting a keyword that matches the condition of the extraction condition storing means from the contents of the candidate storing means as a keyword; A second keyword extracting unit that extracts a keyword that matches a condition of the condition storing unit as a keyword; and an extracted keyword storing unit that stores a keyword extracted by the first keyword extracting unit and a keyword extracted by the second keyword extracting unit. A keyword extracting device.