JP2004151926A

JP2004151926A - Keyword extraction device, keyword extraction method, program and recording medium

Info

Publication number: JP2004151926A
Application number: JP2002315397A
Authority: JP
Inventors: Masayuki Kameda; 雅之亀田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2002-10-30
Filing date: 2002-10-30
Publication date: 2004-05-27
Anticipated expiration: 2022-10-30
Also published as: JP4222811B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a keyword extraction device that extracts a keyword in a document in dependence on evaluation based on a corrected word frequency considering constituent words and on the number of constituent words. <P>SOLUTION: A digitized document is divided into words, which are classified as parts of speech and, if necessary, analyzed for basic word forms. According to the analysis results, keyword candidates are extracted by keyword candidate extraction rules describing patterns to be extracted as keyword candidates. The extracted keyword candidates are evaluated in terms of constituent words of the candidates, and according to the evaluation results, keywords are extracted from the keyword candidates. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、キーワード抽出装置、キーワード抽出方法並びにプログラムおよび記録媒体に関し、具体的には、短単位の単語だけではなく、複合名詞や複合名詞句のような長い単位のキーワード抽出に関し、文書検索装置や文書登録装置等に応用して好適である。
【０００２】
【従来の技術】
文書中からキーワードを抽出することは、キーワードを検索や分類のために文書の情報として付与したり、文書検索結果の一覧表示で文書の内容を簡潔に表現する補助情報として表示する際に必要となる重要な技術である。
【０００３】
通常、文書中からキーワードを抽出するためには、形態素解析技術により、単語分割し品詞付けを行い、そのうちの特定の品詞（特に名詞）の単語についてそのキーワードとするか否かを評価する。
これは、対象分野での専門用語や単語のキーワード性（特許文献１参照）に基づき判定する。また、複合語であれば、それを構成する単語のキーワード性や複合語構成上の役割といった情報を用意し、それらに基づき判定する（非特許文献１参照）。
しかしながら、こうした判定に用いるキーワードに関する情報を事前に辞書等に用意しておくことを前提としているが、これらを設定・保守することは容易ではない。
【０００４】
そこで、こうしたキーワードのための情報を必要とせずに、キーワード候補を文字種により判別したり（特許文献２参照）、単語の長さとその使用頻度に基づいてキーワード性を計算する抽出装置が提案されている（特許文献３参照）。
しかしながら、キーワード候補の評価に、特許文献２および特許文献３で行っているような単語の出現頻度を考慮する場合、短単位の単語ベースのキーワードならよいが、複合名詞等をベースにするキーワードの場合は、同一のキーワード候補単語については、出現頻度に反映されるが、同一の構成単語を含んでいる関連単語の場合は、同一視されず、それぞれの出現頻度には反映されない。
【０００５】
また、単語の文書内頻度（Ｔｆ：ＴｅｒｍＦｒｅｑｕｅｎｃｙ）と対象文書データベース中での単語の出現文書数（Ｄｆ：ＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）の逆数（Ｉｄｆ：ＩｎｖｅｒｔｅｄＤＦ）の積をキーワード指標とするＴｆ＊Ｉｄｆ法が知られている（非特許文献２参照）。
これは、キーワードのための情報を必要としない方法であるが、対象データベースがあらかじめ定まっている必要があり、可能性のあるキーワード候補について事前に統計処理を必要とする問題とともに、出現頻度についても、上述の問題がある。
【０００６】
また、上記のキーワードのための事前の情報や統計情報の取得の問題を回避し、また出現頻度に関わる問題を解決するために、特許文献４の「キーワード抽出装置」は、単語の複合度あるいは類似単語を出現頻度に反映した疑似出現頻度を用いて、キーワード候補の評価を改善し、また、単語長から単語複合度、重複文字列の割合から疑似出現頻度を簡易に得る方法である。
【０００７】
さらに、特許文献５の「キーワード抽出装置及びキーワード表示装置」は、疑似出現頻度を取得する場合、文書内でキーワード候補同士の組合せ数の計算で、計算量が多くなってしまうのを改善するために、文書内での文字単位の出現頻度により少ない計算量で誤字出現頻度を計算する方法である。
【０００８】
【特許文献１】
特開昭６２−２８７３３７号公報
【非特許文献１】
小川他「短単位キーワードに基づくテキストデータベース装置」、
情報処理学会、データベース研究会９０−６，１９９２
【特許文献２】
特開平１−０２８７７０号公報
【特許文献３】
特開昭６３−２４４２５９号公報
【非特許文献２】
長尾真、佐藤理史編「岩波講座ソフトウェア科学１５自然言語処理，岩波書店、１９９６．４．２６，ｐ．４１１−４２１
【特許文献４】
特開平８−９５９８２号公報
【特許文献５】
特開平９−３１１８７１号公報
【０００９】
【発明が解決しようとする課題】
しかしながら、特許文献４の技術は、文字列の重複検査の計算量の問題があり、一方、特許文献５では、文字の出現頻度に基づく処理のため、キーワード候補と無関係な同一の文字が用いられている場合、統計に雑音が多く混入する問題がある。
また、特許文献４や特許文献５では、文字や文字列に意味がある日本語の複合語に効果がある方法であったが、英語などのアルファベットを用いた言語の単語に対しては有効でない。
【００１０】
本発明は、上述の実情を考慮してなされたものであって、キーワードのための事前の情報や統計情報の取得の問題を回避し、また出現頻度に関わる問題を単語レベルでの処理により上記の問題を回避するキーワード抽出装置、その方法並びにプログラムおよび記録媒体を提供することを目的とする。
【００１１】
【課題を解決するための手段】
上記の課題を解決するために、本発明の請求項１のキーワード抽出装置は、電子化された文書を単語に分割し、品詞を付与し、また必要に応じて単語の基本形を解析する単語解析部と、キーワードの候補として抽出すべきパターンを記述したキーワード候補抽出規則と、前記単語解析部により解析された解析結果をもとに、前記キーワード候補抽出規則によりキーワード候補を抽出するキーワード候補抽出部と、前記キーワード候補の構成単語に基づいて評価するキーワード候補評価部と、前記キーワード候補評価部が評価した結果に基づいて前記キーワード候補からキーワードを抽出するキーワード抽出部とを備えることを特徴とする。
【００１２】
また、本発明の請求項２は、請求項１に記載のキーワード抽出装置において、前記キーワード候補評価部は、前記キーワード候補の構成単語の基本形に基づき評価することを特徴とする。
【００１３】
また、本発明の請求項３は、請求項１または２に記載のキーワード抽出装置において、キーワードの構成単語として扱う品詞を定めた構成単語品詞リストを有し、前記キーワード候補評価部は、前記キーワード候補の構成単語のうち前記構成単語品詞リストに定めた品詞の単語だけを対象にすることを特徴とする。
【００１４】
また、本発明の請求項４は、請求項１または２に記載のキーワード抽出装置において、キーワードの構成単語として扱わない品詞を定めた非構成単語品詞リストまたはキーワードの構成単語として扱わない単語を定めた非構成単語リストの少なくとも一方を有し、前記キーワード候補評価部は、前記キーワード候補の構成単語のうち前記非構成単語品詞リストに定めた品詞の単語、または前記非構成単語リストに定めた単語のいずれかあるいは双方を対象にしないことを特徴とする。
【００１５】
また、本発明の請求項５は、請求項１乃至４のいずれかに記載のキーワード抽出装置において、前記キーワード候補評価部は、前記キーワード候補に対し、このキーワード候補の構成単語数、あるいは、このキーワード候補の構成単語数に応じた値の一方に応じて評価値を計算することを特徴とする。
【００１６】
また、本発明の請求項６は、請求項１乃至４のいずれかに記載のキーワード抽出装置において、前記キーワード候補評価部は、前記キーワード候補に対し、このキーワード候補の構成単語ごとの当該電子化文書内における出現頻度に基づく修正単語頻度、あるいは、このキーワード候補の構成単語数に応じた値の一方に応じて評価値を計算することを特徴とする。
【００１７】
また、本発明の請求項７は、請求項１乃至４のいずれかに記載のキーワード抽出装置において、前記キーワード候補評価部は、前記キーワード候補に対し、
（１）このキーワード候補の構成単語数あるいはこの構成単語数に応じた値、
および、
（２）このキーワード候補の構成単語ごとの当該電子化文書内における出現頻度に基づく修正単語頻度あるいはこのキーワード候補の構成単語数に応じた値、
の双方の重み付き線形和の値あるいは積によって評価することを特徴とする。
【００１８】
また、本発明の請求項８は、請求項６または７に記載のキーワード抽出装置において、前記修正単語頻度は、前記キーワード候補に対し、このキーワード候補の構成単語ごとの頻度数の総和をこのキーワード候補の構成単語数で除した値とすることを特徴とする。
【００１９】
また、本発明の請求項９は、請求項５、６または７に記載のキーワード抽出装置において、前記キーワード候補評価部は、前記キーワード候補に対し、前記構成単語数に応じた値として、構成単語数の対数値を用いることを特徴とする。
【００２０】
また、本発明の請求項１０は、請求項１乃至９のいずれかに記載のキーワード抽出装置において、前記キーワード抽出部は、前記キーワード候補抽出部が抽出したキーワード候補のうち、その構成単語のすべてが他のキーワード候補の構成単語に含まれている場合、当該キーワード候補はキーワードとしては抽出しないことを特徴とする。
【００２１】
また、本発明の請求項１１は、請求項１乃至１０のいずれかに記載のキーワード抽出装置において、前記キーワード抽出部が抽出したキーワードを前記キーワード候補評価部で評価したキーワード候補に対する評価値の大きい順に並べて出力するキーワード出力部を備えることを特徴とする。
また、本発明の請求項１２は、請求項１乃至１１のいずれかに記載のキーワード抽出装置において、前記キーワード抽出部が抽出したキーワードと同一の構成単語を含むキーワードを組にして出力するキーワード出力部を備えることを特徴とする。
【００２２】
また、本発明の請求項１３のキーワード抽出方法は、電子化された文書を単語に分割し、品詞を付与し、また必要に応じて単語の基本形を解析し、解析された解析結果をもとに、キーワードの候補として抽出すべきパターンを記述したキーワード候補抽出規則によりキーワード候補を抽出し、前記キーワード候補の構成単語に基づいて評価し、この評価した結果に基づいて前記キーワード候補からキーワードを抽出することを特徴とする。
【００２３】
また、本発明の請求項１４のプログラムは、コンピュータを用いて、文書データからキーワードを抽出するためのプログラムであって、前記コンピュータに、請求項１乃至１２のいずれかに記載のキーワード抽出装置の機能を実行させるためのプログラムである。
また、本発明の請求項１５の記録媒体は、請求項１４に記載のプログラムを記録したコンピュータ読み取り可能な記録媒体である。
【００２４】
以上の構成により、複合名詞や「の」や「ｏｆ」で結ばれた複合名詞句のレベルでキーワード抽出を行う場合、キーワード候補を構成する単語（構成単語）に着目することにより、事前にキーワードのための情報を用意したり、統計処理を行うことなくキーワード評価が可能になる。
また、英語のような変化のある単語をその基本形で構成単語を見ることにより、適切な頻度集計ができるので、より適切なキーワード評価が可能になる。
また、構成単語を扱う場合、キーワード評価に影響を与える品詞、あるいは影響を与えない品詞や単語を別途用意することで、内容語を優先したり、「の」や「ｏｆ」などの機能語を無視することで、より適切なキーワード評価が可能になる。
また、構成単語数の代わりに構成単語数の対数をとってキーワード候補の評価値を計算することによって、構成単語数が多いキーワード候補の過剰な評価を回避することが可能になる。
また、キーワード候補のその構成単語のすべてが、他のキーワード候補の構成単語に含まれている場合には、キーワードとしては抽出しないようにすることによって、キーワード集合の冗長性を排除することが可能になる。
【００２５】
【発明の実施の形態】
以下、図面を参照して本発明に係る好適な実施形態を説明する。
図１は、本発明のキーワード抽出装置が適用されるキーワード抽出システムの全体構成を示すブロック図であり、同図に示すように、キーワード抽出システムは、複数の電子化された文書データを保持する文書データベース１０と、キーボード、マウス、タッチパネル等により構成されて文書データベース１０中の文書名を指定する入力装置２０と、入力装置２０によって指定された文書名の電子化文書データを文書データベース１０から読み込む文書入力装置３０と、この読み込まれた文書データからキーワードを抽出する本発明に係るキーワード抽出装置４０と、抽出されたキーワードに関する情報をディスプレイやプリンタあるいはネットワークに接続した他の装置へ出力する出力装置５０とを含んでいる。
【００２６】
＜実施形態１＞
図２は、本発明のキーワード抽出装置４０に係る実施形態の機能構成を示すブロック図であり、キーワード抽出装置４０は、単語解析部４１、単語辞書４２、キーワード候補抽出部４３、キーワード候補抽出規則４４、キーワード候補評価部４５、キーワード抽出部４６とから構成される。
【００２７】
単語解析部４１は、文書データベース１０に記憶されたキーワード抽出装置４０の入力対象となる文書データを文書入力装置３０から渡され、品詞情報や基本形情報をもった単語辞書４２を参照し、対象となる文書データの文ごとに単語に分割し、品詞を付与し、また必要に応じて単語の基本形を解析して、単語単位の解析結果を出力する。
この単語解析部４１は、従来技術の形態素解析技術によって実現される。
【００２８】
例えば、図３に示す日本語文書の第４文（本文の先頭文）を単語解析部４１で処理した結果は、図４に示すような単語に分割される。
図４において、「基本形」とは、日本語では、活用しない単語はその単語そのもの、活用する単語はその終止形である。以降の説明では、活用しない単語の基本形は、単語そのものと同じであるから図中では省略している。
また、英語のような変化のある単語の場合は、適切な頻度集計ができないため、その基本形を用いることでより適切な評価が行えるようになる。
例えば、Ｃｏｍｐｕｔｅｒｓという単語が対象文書中に出てきた場合には、その基本形（Ｃｏｍｐｕｔｅｒ）を単語辞書から取り出すようにする。
【００２９】
キーワード候補抽出部４３は、単語解析部４１の結果をもとに、抽出するべきキーワード候補の単語のパターン
（構成）に関する規則（キーワード候補抽出規則４４）に従いキーワード候補を抽出する。
キーワード候補抽出規則４４は、例えば、図５に示したように、リーフノードが品詞名である書換え規則として、下記の記法のもとで記述してある。
【００３０】
Ｘ＋Ｙ：ＸとＹがこの順序で並んでいる
（Ｘ｜Ｙ｜Ｚ）：ＸかＹかＺ
｛Ｘ｝．：Ｘが省略されるか、Ｘが１つ存在する
｛Ｘ｝＊：Ｘの０以上の繰り返し
｛Ｘ｝＋：Ｘの１以上の繰り返し
【００３１】
図５に例示した規則では、キーワード候補として次の２種類を定義している。
（１）キーワード候補Ａは、前／後に接頭辞／接尾辞の付加を許す名詞類の１つ以上の並びあるいはその複合の並びであり、単独の名詞あるいは名詞連続の複合名詞がキーワード候補となる。
（２）キーワード候補Ｂは、上記のキーワード候補Ａの複合名詞が格助詞「の」で連結された並びであり、「の」で連結された名詞句がキーワード候補となる。
【００３２】
この図５の規則によれば、図３の第４文からは、図６に示すキーワード候補が抽出される。ここで、抽出された各キーワード候補には、図７に示すように、各キーワード候補を構成する単語と品詞および基本形の情報が付加されている。
【００３３】
キーワード候補の抽出方法は、上述のようなキーワード候補抽出規則を使用するものに限定はされず、従来技術によって構成されてもよい。
また、単語解析部４１とキーワード候補抽出部４３とは、対象文書全体の単語解析結果を得てからキーワード候補を抽出するようにしてもよいし、あるいは、１文ごとに単語解析結果を得て、その文についてのキーワード候補を抽出することを全文の処理が終わるまで繰り返すようにしてもよい。
【００３４】
キーワード候補評価部４５は、キーワード候補抽出部４３により抽出されたキーワード候補について、その構成単語に基づいてキーワード候補の評価を行う。
ここで、キーワード候補中の構成単語に英語のような変化のある単語が含まれている場合は、その単語の基本形に替えて頻度を収集するようにして、適切な頻度集計ができ、より適切なキーワード評価が可能となる。
この評価は、例えば、キーワード候補Ｋｊについて、次のような評価値Ｅ（Ｋｊ）によって行う。
【００３５】

また、この修正単語頻度Ｗｆｊは次のようにして求める。
Ｗｆｊ＝Ｓｆｊ／ＷＮｊ
ここで、Ｓｆｊ：キーワード候補Ｋｊの構成単語ごとの頻度の総和
【００３６】
例えば、上記の関数Ｆは次のいずれとしてもよい。
Ｅ（Ｋｊ）＝ａ＊Ａ（ＷＮｊ）＋ｂ＊Ｂ（Ｗｆｊ）
あるいは、
Ｅ（Ｋｊ）＝Ａ（ＷＮｊ）＊Ｂ（Ｗｆｊ）
【００３７】
上記の式で、ａおよびｂは重み係数であり、Ａ（Ｘ）およびＢ（Ｘ）はＸをパラメータとする関数であり、例えば、上記第１の式で、ａ＝ｂ＝１、Ａ（Ｘ）＝Ｘ、Ｂ（Ｘ）＝Ｘとすると、
Ｅ（Ｋｊ）＝ＷＮｊ＋Ｗｆｊ
であり、以下の説明では、この評価式を使用することにする。
【００３８】
さらに、上記の第１の式で、ａ＝ｂ＝１、Ａ（Ｘ）＝ｌｏｇ（Ｘ）、Ｂ（Ｘ）＝Ｘとすると、
Ｅ（Ｋｊ）＝ｌｏｇ（ＷＮｊ）＋Ｗｆｊ
であり、また、上記第２の式で、ａ＝１、Ａ（Ｘ）＝ｌｏｇ（Ｘ）、Ｂ（Ｘ）＝Ｘとすると、
Ｅ（Ｋｊ）＝ｌｏｇ（ＷＮｊ）＊Ｗｆｊ
という評価式が使用できる。
【００３９】
図８のフローチャートをもとに、キーワード候補評価部４５の処理手順について詳細に説明する。
まず、単語解析部４１で解析された単語に対して、単語ごとの頻度を集計した単語表を作成して記憶装置等に格納する（ステップＳ１）。
単語解析部４１で図４のように分割された第４文に関する単語表は、図９に示したようになる。尚、以降で、この表を参照して、キーワード候補の構成単語ごとに頻度を求めるため、単語は辞書順にソートされている方がよい。
【００４０】
キーワード候補抽出部４３で抽出されたキーワード候補に関する表を作成して記憶装置等に格納する（ステップＳ２）。
このキーワード候補表は、キーワード候補抽出部４３で抽出の際に得られたキーワード候補Ｋｊごとに、単語構成、頻度、構成単語数ＷＮｊ、構成単語の総頻度Ｓｆｊ、修正単語頻度Ｗｆｊおよびキーワード評価値Ｅ（Ｋｊ）からなる情報を保持する。
また、キーワード候補は、単語表と同様に重複して現れることもあるので、同一のキーワード候補の重複は取り除き、頻度として計数して、表中に設定する。さらに、キーワード候補の順序は、本説明では出現順とするが、これに特に限定するものではない。
【００４１】
図１０は、第４文中のキーワード候補の設定状態を抜き出した図で、単語構成の情報と頻度が設定されている。図中、単語構成に関する欄では、「｜」で単語の切れ目、「．」で接辞（接頭辞、接尾辞）の切れ目を表すものとする。
また、図中の第４文のキーワード候補では、「工業製品」が他に１回、「輸出規制」が他に２回出現しているので、頻度が各々２、３となっているが、他の頻度は１になっている。尚、「工業製品」と「輸出規制」は、各々第１文、第２文に現れているので、他のキーワード候補より先行して出ている。
【００４２】
以下のステップＳ３からＳ９までは、キーワード候補ごとに処理が行われる。キーワード候補表から順次キーワード候補を取り出し、すべてのキーワード候補の処理が終われば（ステップＳ３の「なし」）、処理を終了する。
これにより、キーワード候補表のすべての欄が埋まり、すべてのキーワード候補の評価値が得られる。
【００４３】
キーワード候補表からキーワード候補を取り出せた場合（ステップＳ３の「あり」）、対象のキーワード候補の「構成単語数ＷＮｊ」欄と「構成単語総頻度Ｓｆｊ」欄を、以降の加算処理のためにゼロクリアする（ステップＳ４）。
例えば、図１１のキーワード候補「工業製品」の「構成単語数ＷＮｊ」欄と「構成単語総頻度Ｓｆｊ」欄がゼロに設定されている。
【００４４】
次のステップＳ５からＳ７までは、対象キーワード候補の構成単語ごとに処理が行われる。
キーワード候補の単語構成に示される構成単語について、順次構成単語Ｗｊｉを取り出し、すべての構成単語の処理が終われば（ステップＳ５の「なし」）、構成単語ごとの処理を終了し、対象キーワード候補の構成単語数と構成単語総頻度が得られるステップＳ８へ進む。
【００４５】
キーワード候補の単語構成に示される構成単語について、構成単語を取り出せた場合（ステップＳ５の「あり」）、その構成単語Ｗｊｉに対する「構成単語数ＷＮｊ」欄に１を加算する（ステップＳ６）。
例えば、単語構成の「工業｜製品」のうちの「工業」を取り出した場合、構成単語「工業」分として、「構成単語数ＷＮｊ」欄に１を加算する。
【００４６】
単語表を参照して、この構成単語Ｗｊｉの単語頻度を得て、「構成単語総頻度Ｓｆｊ」欄に加算し（ステップＳ７）、再び、ステップＳ５に戻る。
例えば、上記の単語構成の「工業｜製品」の場合には、構成単語「製品」を取り出し、「構成単語数ＷＮｊ」欄に１加算し、「製品」の単語頻度「２」を「構成単語総頻度Ｓｆｊ」欄に加算すると、キーワード候補表の「工業製品」は図１２のように設定される。
【００４７】
修正単語頻度Ｗｆｊを計算し、「修正単語頻度Ｗｆｊ」欄に設定する（ステップＳ８）。
この修正単語頻度Ｗｆｊは、例えば、「構成単語総頻度Ｓｆｊ」欄の値を「構成単語数ＷＮｊ」の値で除した値をとして計算する。
図１２のキーワード候補「工業製品」では、Ｓｆｊが「４」、ＷＮｊが「２」であるから、Ｗｆｊは４／２＝２となる。
【００４８】
最後に、キーワード候補の「評価値Ｅ（Ｋｊ）」を計算し、「評価値Ｅ（Ｋｊ）」欄に設定し（ステップＳ９）、再び、ステップＳ３に戻り、以降、順次キーワード候補を取り出しながら、全キーワード候補に対して、上記の処理を繰り返す。
このキーワード候補の「評価値Ｅ（Ｋｊ）」は、例えば、「構成単語数ＷＮｊ」と「修正単語頻度Ｗｆｊ」を加算した値として設定する。
【００４９】
上述した例のキーワード候補「工業製品」では、ＷＮｊが「２」、Ｗｆｊが「２」であるから、Ｅ（Ｋｊ）＝２＋２＝４となるので、「工業製品」に対するキーワード候補表は図１３のような設定になる。
さらに、第４文の最後のキーワード候補「日本」の処理が終わった時点でのキーワード候補表は図１４のようになり、また、全キーワード候補の処理が終わった時点でのキーワード候補表は図１５のようになる。
【００５０】
キーワード抽出部４６は、キーワード候補評価部４５により評価された結果に基づいて、評価値が高いキーワード候補ほどキーワード性が高いという仮定から、次のような基準でキーワード候補の中からキーワードを抽出する。
（１）評価値が上位Ｎのキーワード候補
（２）評価値がキーワード候補全体の上位Ｍ分の一のキーワード候補
（３）評価値がＡ以上のキーワード候補
【００５１】
例えば、上位１０位までを抽出基準とすると、図１６に示したようなキーワードを抽出できる（図１６では、同点があるので１１キーワードが抽出される）。
【００５２】
図１７のフローチャートを用いて、本実施形態１の処理手順を説明する。
まず、利用者あるいは利用アプリケーションが指示したキーワード抽出の対象文書について、品詞情報や基本形情報をもった単語辞書を参照し、対象となる文書の文ごとに単語に分割し、品詞を付与し、また必要に応じて単語の基本形を解析して、単語単位の解析結果を出力する（ステップＳ１１）。
例えば、図３の日本語文書の第４文（本文の先頭文）の単語解析処理した結果は図４のようになる。
【００５３】
ステップＳ１１の文書の単語解析処理の結果をもとに、抽出するべきキーワード候補の単語のパターン
（構成）に関する規則（例えば、図５に示すキーワード候補抽出規則）に従ってキーワード候補を抽出する（ステップＳ１２）。
図５の規則によれば、図３の第４文部分からは、図６のキーワード候補が抽出され、抽出された各キーワード候補には、図７のようにこれを構成する単語と品詞および基本形の情報が付加されている。
【００５４】
ステップＳ１２のキーワード候補抽出処理により抽出されたキーワード候補について、その構成単語に基づいてキーワード候補の評価を行う（ステップＳ１３）。
全キーワード候補の処理が終わった時点でのキーワードの評価値は図１５に示すように設定される。
【００５５】
この評価は、例えば、キーワード候補Ｋｊについて、次のような評価値Ｅ（Ｋｊ）を計算する。

【００５６】
この修正単語頻度Ｗｆｊは次のように求める。
Ｗｆｊ＝Ｓｆｊ／ＷＮｊ
ここで、Ｓｆｊ：Ｋｊの構成単語ごとの頻度の総和
【００５７】
例えば、これらの計算値から次のいずれかの式で評価値を求める。
Ｅ（Ｋｊ）＝ａ＊Ａ（ＷＮｊ）＋ｂ＊Ｂ（Ｗｆｊ）
または
Ｅ（Ｋｊ）＝Ａ（ＷＮｊ）＊Ｂ（Ｗｆｊ）
【００５８】
ステップＳ１３のキーワード候補評価処理により評価された結果に基づいてキーワード抽出を行う（ステップＳ１４）。
評価値が高いキーワード候補ほどキーワード性が高いので、いずれかを基準としてキーワードを抽出する。
（１）評価値が上位Ｎのキーワード候補
（２）評価値がキーワード候補全体の上位Ｍ分の一のキーワード候補
（３）評価値がＡ以上のキーワード候補
例えば、上位１０位までをキーワードとして抽出するとすると、図１６のようになる（同点があるので１１キーワード）。
【００５９】
＜実施形態２＞
上述の実施形態１では、キーワード候補のその構成単語のすべてが、他のキーワード候補の構成単語に含まれている場合であっても、キーワード候補をキーワードとして抽出していた。
本実施形態２では、キーワード候補のその構成単語のすべてが、他のキーワード候補の構成単語に含まれている場合には、キーワードとしては抽出しないようにする。これにより、キーワード集合の冗長性を排除することが可能になる。
【００６０】
このために、キーワード抽出部４６において、キーワード候補抽出部４３で記憶装置等に記憶したキーワード候補表中のキーワード候補について、その構成単語のすべてを含む他のキーワード候補があるか否かを検査し、ある場合は、キーワード候補表において無効の扱いをする。
この処理は、
・構成単語数の少ないキーワード候補から検査する
・自身の構成単語数より構成単語数が少ないキーワード候補は検査しない
等により処理の効率化ができる。
【００６１】
本実施形態２のキーワード抽出部４６によれば、図１５に示したキーワード候補のうち、
・「通常兵器」は、「通常兵器関連」に含まれているので無効とする。
・「規制」は、「輸出規制」に含まれているので無効とする。
・「対象」は、「規制対象」に含まれているので無効とする。
・「国」は、「３カ国」に含まれているので無効とする。
・「共産圏」は、「対共産圏輸出統制委員会」に含まれているので無効とする。
・「輸出」は、「輸出規制」に含まれているので無効とする。
【００６２】
これにより、上位１０位までをキーワードとして抽出すると、図１８に示すような結果になる（１１キーワード）。
【００６３】
＜実施形態３＞
上述の実施形態では、キーワード候補評価部４５で評価値を計算するために構成単語数を用いていたが、本実施形態３では構成単語数の代わりに構成単語数の対数値を用いて評価値を計算する。
図１５のキーワード候補表に対して、ｅを底とする自然対数として計算して評価値を求めた場合、図１９に示すように「評価値２」の欄を追加したキーワード候補表が求められる。ここで、図１９は、実施形態２の構成単語が含まれるキーワード候補の排除した場合である。
この図１９のキーワード候補表から評価値２の上位１０位は、図２０に示したようになり、「朝鮮民主主義人民共和国」が上位１０位からおちているのが確認できる。
このように、構成単語数の対数をとることにより構成単語数が多いキーワード候補の過剰評価を抑える効果がある。
【００６４】
＜実施形態４＞
上述の実施形態によって抽出されたキーワードを次のようにディスプレイのような表示装置やプリンタのような出力装置へ出力することによって、利用者にわかりやすく出力することができる。
（１）評価値の大きい順に出力する。
例えば、図１８または図１９のように抽出されたキーワードを順位、評価値およびキーワードを組として図２１または図２２に示すように出力する。
【００６５】
（２）同一の構成単語を含むキーワードを組にして出力する。
この出力形式は多数考えられるが、例えば、図２１に示したキーワードを図２３に示すように出力する。
これは、上位のキーワードの構成単語をキーとして、同一の構成単語をもつキーワードを評価値の順に並べる。重複して現れる場合は、省略してもよいが、図２３では括弧を付けて重複を表している。また、その構成単語を含む他のキーワードがなければ、その構成単語の行は表示しない。
【００６６】
＜実施形態５＞
上述した実施形態では、キーワード候補抽出部４３で用いたキーワード候補抽出規則４４による２種類のキーワード候補（単独の名詞あるいは名詞連続の複合名詞（キーワード候補Ａ）と、複合名詞が格助詞「の」で連結された並び（キーワード候補Ｂ））のうち、キーワード候補Ｂについて説明しなかった。
【００６７】
まず、キーワード候補評価部４５において、キーワード候補Ｂを含んだキーワード候補の評価値に影響を与える構成単語の選定について説明する。
例えば、図６および図７では、キーワード候補Ｂが
（７）通常兵器の部品
（８）工業製品の輸出規制
として抽出されている。
上述の実施形態のように、「の」を構成単語として含めると、単語解析部４１の結果「の」の頻度は１３（図９参照）であるから、「通常兵器の部品」の構成単語数は４、総頻度は１８となり、修正単語頻度は４．５、従って評価値は８．５になる。
また、「工業製品の輸出規制」の評価値は、５＋（２＋２＋７＋７＋１３）／５＝１１．２となる。
したがって、キーワード候補の評価値に、「の」による過大評価が生じてしまう。
一方、「の」を構成単語に数えないようにすると、上記の２つキーワード候補と対象文書中の他のキーワード候補Ｂは、図２４のように評価値を与えられ、適切なキーワード評価を行える。
【００６８】
したがって、本実施形態５では、キーワード候補中の構成単語の選定のために、次のいずれか一方または双方の処理を加える。
（１）構成単語品詞リストを用意して、特定の品詞の単語だけを構成単語とする。
例えば、「一般名詞」等を含むような構成単語品詞リストを用いることによって、キーワード評価に影響を与える品詞、あるいは影響を与えない品詞を指定できるので、内容語を優先したより適切なキーワード評価が可能になる。
【００６９】
（２）非構成単語品詞リストを用意して特定の品詞の単語を構成単語から除く、あるいは、非構成単語リストを用意して特定の単語を構成単語から除く。
上述の場合、「格助詞の」や「接辞（接頭辞、接尾辞）」を含むような非構成単語品詞リストを用いることによって、「の」による過大評価を生じないようにできる。
【００７０】
このように構成単語を扱う場合、キーワード評価に影響を与える品詞、あるいは影響を与えない品詞や単語を別途用意することで、内容語を優先したり、「の」や「ｏｆ」などの機能語を無視することで、より適切なキーワード評価が可能になる。
【００７１】
＜実施形態６＞
さらに、本発明は上述した実施形態のみに限定されたものではない。上述した実施形態のキーワード抽出装置を構成する各機能（単語解析部、キーワード候補抽出部、キーワード候補評価部、キーワード抽出部等）をそれぞれプログラム化し、単語辞書、キーワード抽出規則、構成単語品詞リスト等をデータ化し、あらかじめＣＤ−ＲＯＭ等の記録媒体に書き込んでおき、コンピュータに搭載したＣＤ−ＲＯＭドライブのような媒体駆動装置にこのＣＤ−ＲＯＭを装着して、これらのプログラムをコンピュータのメモリあるいは記憶装置に格納し、それを実行することによって、本発明の目的が達成されることは言うまでもない。
この場合、記録媒体から読み出されたプログラム自体が上述した実施形態の機能を実現することになり、そのプログラムおよびそのプログラムを記録した記録媒体も本発明を構成することになる。
【００７２】
なお、記録媒体としては半導体媒体（例えば、ＲＯＭ、不揮発性メモリカード等）、光媒体（例えば、ＤＶＤ、ＭＯ、ＭＤ、ＣＤ−Ｒ等）、磁気媒体（例えば、磁気テープ、フレキシブルディスク等）等のいずれであってもよい。
【００７３】
また、ロードしたプログラムを実行することにより上述した実施形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、オペレーティングシステム等が実際の処理の一部または全部を行い、その処理によって上述した実施形態の機能が実現される場合も含まれる。
【００７４】
また、上述したプログラムをサーバコンピュータの磁気ディスク等の記憶装置に格納しておき、インターネット等の通信網で接続された利用者のコンピュータからダウンロード等の形式で頒布する場合、このサーバコンピュータの記憶装置も本発明の記録媒体に含まれる。
【００７５】
【発明の効果】
以上説明したように本発明によれば、複合名詞や「の」や「ｏｆ」で結ばれた複合名詞句のレベルでキーワード抽出を行う場合、キーワード候補を構成する単語（構成単語）に着目することにより、事前にキーワードのための情報を用意したり、統計処理を行うことなくキーワード評価が可能になる。
また、英語のような変化のある単語をその基本形で構成単語を見ることにより、適切な頻度集計ができるので、より適切なキーワード評価が可能になる。
また、構成単語を扱う場合、キーワード評価に影響を与える品詞、あるいは影響を与えない品詞や単語を別途用意することで、内容語を優先したり、「の」や「ｏｆ」などの機能語を無視することで、より適切なキーワード評価が可能になる。
また、構成単語数の代わりに構成単語数の対数をとってキーワード候補の評価値を計算することによって、構成単語数が多いキーワード候補の過剰な評価を回避することが可能になる。
また、キーワード候補のその構成単語のすべてが、他のキーワード候補の構成単語に含まれている場合には、キーワードとしては抽出しないようにすることによって、キーワード集合の冗長性を排除することが可能になる。
【図面の簡単な説明】
【図１】本発明のキーワード抽出装置が適用されるキーワード抽出システムの全体構成を示すブロック図である。
【図２】本発明のキーワード抽出装置に係る実施形態１の機能構成を示すブロック図である。
【図３】キーワード抽出の説明で使用する日本語文書の例である。
【図４】図３に示す日本語文書の第４文（本文の先頭文）を単語に分割した例である。
【図５】キーワード候補抽出規則の例である。
【図６】図５のキーワード候補抽出規則によって、図３の第４文から抽出されたキーワード候補の例である。
【図７】抽出された各キーワード候補を構成する単語と品詞および基本形の情報を示す例である。
【図８】キーワード候補評価部の処理手順を示すフローチャートである。
【図９】図４の文例で単語ごとに頻度を集計した単語表の例である。
【図１０】図４の文例の第４文中のキーワード候補の設定状態を示すキーワード候補表の例である。
【図１１】キーワード候補の「構成単語数ＷＮｊ」欄と「構成単語総頻度Ｓｆｊ」欄がゼロに設定されている状態を示す図である。
【図１２】単語構成のうちの構成単語分として、「構成単語数ＷＮｊ」欄および「構成単語総頻度Ｓｆｊ」欄に頻度を加算した場合の状態を示す図である。
【図１３】キーワード候補「工業製品」に対するキーワード候補表の設定状態を示す図である。
【図１４】第４文の最後のキーワード候補「日本」の処理が終わった時点でのキーワード候補表の設定状態を示す図である。
【図１５】全キーワード候補の処理が終わった時点でのキーワード候補表の設定状態を示す図である。
【図１６】上位１０位までを抽出基準として抽出したキーワードの例を示す図である。
【図１７】実施形態１の処理手順を示すフローチャートである。
【図１８】キーワード候補のその構成単語のすべてが、他のキーワード候補の構成単語に含まれている場合に抽出されるキーワードの例を示す図である。
【図１９】キーワード候補の評価を構成単語数ではなく、構成単語数の対数とした場合のキーワード候補表の設定状態を示す図である。
【図２０】図１９のキーワード候補表から上位１０位を抽出したキーワードの例を示す図である。
【図２１】図１８のように抽出されたキーワードの出力例である。
【図２２】図１９のように抽出されたキーワードの出力例である。
【図２３】同一の構成単語を含むキーワードを組にした出力例である。
【図２４】複合名詞が格助詞「の」で連結された並びからなるキーワード候補のキーワード候補表の設定状態を示す図である。
【符号の説明】
１０…文書データベース、２０…入力装置、３０…文書入力装置、４０…キーワード抽出装置、４１…単語解析部、４２…単語辞書、４３…キーワード候補抽出部、
４４…キーワード候補抽出規則、４５…キーワード候補評価部、４６…キーワード抽出部、５０…出力装置。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a keyword extraction device, a keyword extraction method, a program, and a recording medium, and more specifically to extraction of not only short unit words but also long unit keywords such as compound nouns and compound noun phrases. It is suitable for application to documents and document registration devices.
[0002]
[Prior art]
Extracting keywords from a document is necessary when assigning keywords as document information for search and classification, or when displaying as a supplementary information that simply expresses the contents of a document in a list of document search results. Is an important technology.
[0003]
Normally, in order to extract a keyword from a document, words are divided and part-of-speech is attached by morphological analysis technology, and it is evaluated whether or not a word of a specific part of speech (particularly a noun) is used as the keyword.
This is determined based on the terminology of the technical terms and words in the target field (see Patent Document 1). If the word is a compound word, information such as the keyword characteristics of the words constituting the word and the role of the word in the compound word is prepared, and the information is determined based on the information (see Non-Patent Document 1).
However, it is assumed that information about keywords used for such determination is prepared in advance in a dictionary or the like, but it is not easy to set and maintain these.
[0004]
Therefore, an extraction device has been proposed which does not require information for such a keyword and determines a keyword candidate by a character type (see Patent Document 2) or calculates a keyword property based on a word length and its use frequency. (See Patent Document 3).
However, when considering the frequency of appearance of words as described in Patent Literature 2 and Patent Literature 3 in the evaluation of keyword candidates, short-word-based keywords may be used, but keywords based on compound nouns and the like may be used. In the case, the same keyword candidate word is reflected in the appearance frequency, but the related words including the same constituent word are not identified and are not reflected in the respective appearance frequencies.
[0005]
Also, a Tf * Idf method using a product of a word frequency in a document (Tf: TermFrequency) and a reciprocal (Idf: Inverted DF) of the number of documents in which the word appears in the target document database (Df: DocumentFrequency) is known. (See Non-Patent Document 2).
This is a method that does not require information for keywords, but requires that the target database be determined in advance. However, there is a problem described above.
[0006]
Further, in order to avoid the problem of acquiring the prior information and statistical information for the above-described keyword and to solve the problem related to the frequency of appearance, the “keyword extraction device” of Patent Document 4 employs the word complexity or This is a method for improving the evaluation of a keyword candidate by using a pseudo appearance frequency in which a similar word is reflected in the appearance frequency, and for easily obtaining a pseudo appearance frequency from a word length from a word length and a ratio of duplicate character strings.
[0007]
Further, the “keyword extraction device and keyword display device” of Patent Document 5 is to improve the amount of calculation in calculating the number of combinations of keyword candidates in a document when acquiring the pseudo appearance frequency. In addition, there is a method of calculating the erroneous character appearance frequency with a small amount of calculation based on the appearance frequency of each character in a document.
[0008]
[Patent Document 1]
JP-A-62-287337
[Non-patent document 1]
Ogawa et al. "Text database device based on short unit keywords",
Information Processing Society of Japan, Database Research Group 90-6, 1992
[Patent Document 2]
JP-A-1-28770
[Patent Document 3]
JP-A-63-244259
[Non-patent document 2]
Makoto Nagao, edited by Masashi Sato "Iwanami Koza Software Science 15 Natural Language Processing, Iwanami Shoten, 1996. 4.26, p. 411-421
[Patent Document 4]
JP-A-8-95982
[Patent Document 5]
JP-A-9-311871
[0009]
[Problems to be solved by the invention]
However, the technique of Patent Literature 4 has a problem of a calculation amount of a character string duplication check. On the other hand, Patent Literature 5 uses the same character irrelevant to a keyword candidate because the processing is based on the appearance frequency of the character. In such a case, there is a problem that much noise is mixed in the statistics.
In Patent Document 4 and Patent Document 5, the method is effective for a Japanese compound word having a meaning in a character or a character string, but is not effective for a word in a language using an alphabet such as English. .
[0010]
The present invention has been made in consideration of the above-described circumstances, and avoids the problem of acquiring prior information and statistical information for a keyword, and solves the problem of appearance frequency by processing at the word level. It is an object of the present invention to provide a keyword extracting device, a method thereof, and a program and a recording medium which avoid the above problem.
[0011]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, a keyword extraction device according to claim 1 of the present invention divides an electronic document into words, assigns parts of speech, and analyzes a basic form of the words as necessary. Unit, a keyword candidate extraction rule that describes a pattern to be extracted as a keyword candidate, and a keyword candidate extraction unit that extracts a keyword candidate according to the keyword candidate extraction rule based on an analysis result analyzed by the word analysis unit. A keyword candidate evaluation unit that evaluates based on constituent words of the keyword candidate; and a keyword extraction unit that extracts a keyword from the keyword candidate based on a result evaluated by the keyword candidate evaluation unit. .
[0012]
According to a second aspect of the present invention, in the keyword extracting device according to the first aspect, the keyword candidate evaluation unit performs evaluation based on a basic form of a constituent word of the keyword candidate.
[0013]
According to a third aspect of the present invention, in the keyword extracting device according to the first or second aspect, the keyword extracting device includes a constituent word part-of-speech list that defines a part of speech to be treated as a constituent word of the keyword. It is characterized in that only words of the part of speech defined in the constituent word part of speech list among candidate constituent words are targeted.
[0014]
According to a fourth aspect of the present invention, in the keyword extracting device according to the first or second aspect, a non-constitutive word part-of-speech list that defines a part of speech that is not treated as a constituent word of the keyword or a word that is not treated as a constituent word of the keyword is determined. The keyword candidate evaluator has at least one of the non-structured word lists, and the keyword candidate evaluation unit includes, among the constituent words of the keyword candidates, a part of speech word defined in the non-structured word part-of-speech list, or a word defined in the non-structured word list. It does not target either or both.
[0015]
According to a fifth aspect of the present invention, in the keyword extraction device according to any one of the first to fourth aspects, the keyword candidate evaluation unit determines the number of constituent words of the keyword candidate, It is characterized in that the evaluation value is calculated according to one of the values according to the number of words constituting the keyword candidate.
[0016]
According to a sixth aspect of the present invention, in the keyword extraction device according to any one of the first to fourth aspects, the keyword candidate evaluation unit determines whether or not the keyword candidate has been digitized for each constituent word of the keyword candidate. The evaluation value is calculated according to one of a corrected word frequency based on an appearance frequency in a document and a value corresponding to the number of constituent words of the keyword candidate.
[0017]
According to a seventh aspect of the present invention, in the keyword extraction device according to any one of the first to fourth aspects, the keyword candidate evaluation unit includes:
(1) The number of constituent words of this keyword candidate or a value corresponding to the number of constituent words,
and,
(2) a corrected word frequency based on an appearance frequency in the digitized document for each constituent word of this keyword candidate or a value corresponding to the number of constituent words of this keyword candidate;
Is evaluated by the value or product of both weighted linear sums.
[0018]
According to an eighth aspect of the present invention, in the keyword extracting apparatus according to the sixth or seventh aspect, the corrected word frequency is obtained by adding a sum of frequency numbers for each of constituent words of the keyword candidate to the keyword candidate. It is characterized by a value obtained by dividing by the number of constituent words of a candidate.
[0019]
According to a ninth aspect of the present invention, in the keyword extraction device according to the fifth, sixth or seventh aspect, the keyword candidate evaluator evaluates the keyword candidate as a value corresponding to the number of the constituent words. It is characterized by using a logarithmic value of a number.
[0020]
According to a tenth aspect of the present invention, in the keyword extracting device according to any one of the first to ninth aspects, the keyword extracting unit includes all of the constituent words among the keyword candidates extracted by the keyword candidate extracting unit. Is included in a constituent word of another keyword candidate, the keyword candidate is not extracted as a keyword.
[0021]
According to claim 11 of the present invention, in the keyword extraction device according to any one of claims 1 to 10, a keyword extracted by the keyword extraction unit has a large evaluation value with respect to a keyword candidate evaluated by the keyword candidate evaluation unit. It is characterized by having a keyword output unit for arranging and outputting in order.
According to a twelfth aspect of the present invention, in the keyword extracting device according to any one of the first to eleventh aspects, a keyword output that outputs a set of keywords including the same constituent words as the keywords extracted by the keyword extracting unit is provided. It is characterized by comprising a part.
[0022]
A keyword extracting method according to a thirteenth aspect of the present invention divides an electronic document into words, gives a part of speech, analyzes a basic form of the word as necessary, and, based on the analyzed result, Then, a keyword candidate is extracted according to a keyword candidate extraction rule describing a pattern to be extracted as a keyword candidate, evaluation is performed based on constituent words of the keyword candidate, and a keyword is extracted from the keyword candidate based on the evaluation result. It is characterized by doing.
[0023]
A program according to a fourteenth aspect of the present invention is a program for extracting a keyword from document data using a computer. The computer according to any one of the first to twelfth aspects, This is a program for executing functions.
A recording medium according to a fifteenth aspect of the present invention is a computer-readable recording medium storing the program according to the fourteenth aspect.
[0024]
With the above configuration, when keyword extraction is performed at the level of a compound noun or a compound noun phrase connected by “no” or “of”, a keyword (a constituent word) constituting a keyword candidate is focused on in advance. Keywords can be evaluated without preparing information for the keyword and without performing statistical processing.
In addition, by viewing constituent words of a word having a change such as English in its basic form, appropriate frequency counting can be performed, so that more appropriate keyword evaluation can be performed.
In addition, when dealing with constituent words, by separately preparing parts of speech that affect keyword evaluation or parts of speech or words that do not affect keyword evaluation, priority is given to content words, and functional words such as “no” and “of” can be used. By ignoring, more appropriate keyword evaluation becomes possible.
Further, by calculating the evaluation value of the keyword candidate by taking the logarithm of the number of constituent words instead of the number of constituent words, it is possible to avoid excessive evaluation of the keyword candidate having a large number of constituent words.
In addition, when all of the constituent words of a keyword candidate are included in the constituent words of other keyword candidates, it is possible to eliminate the redundancy of the keyword set by not extracting the keyword as a keyword. become.
[0025]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, preferred embodiments according to the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing the overall configuration of a keyword extraction system to which the keyword extraction device of the present invention is applied. As shown in FIG. 1, the keyword extraction system holds a plurality of digitized document data. A document database 10, an input device 20 configured by a keyboard, a mouse, a touch panel, and the like, which specifies a document name in the document database 10, and digitized document data of the document name specified by the input device 20 are read from the document database 10. A document input device 30, a keyword extraction device 40 according to the present invention for extracting a keyword from the read document data, and an output device for outputting information on the extracted keyword to a display, a printer, or another device connected to a network. 50.
[0026]
<First embodiment>
FIG. 2 is a block diagram showing a functional configuration of an embodiment according to the keyword extraction device 40 of the present invention. The keyword extraction device 40 includes a word analysis unit 41, a word dictionary 42, a keyword candidate extraction unit 43, and a keyword candidate extraction rule. 44, a keyword candidate evaluation unit 45, and a keyword extraction unit 46.
[0027]
The word analysis unit 41 receives, from the document input device 30, document data to be input to the keyword extraction device 40 stored in the document database 10 and refers to a word dictionary 42 having part-of-speech information and basic form information. Each sentence of the document data is divided into words, parts of speech are given, and the basic form of the words is analyzed as necessary, and the analysis result is output in word units.
The word analysis unit 41 is realized by a morphological analysis technique of the related art.
[0028]
For example, the result of processing the fourth sentence (the first sentence of the main text) of the Japanese document shown in FIG. 3 by the word analysis unit 41 is divided into words as shown in FIG.
In FIG. 4, the “basic form” means, in Japanese, a word that is not used is the word itself, and a word that is used is its final form. In the following description, the basic form of a word that is not used is the same as the word itself, so that it is omitted in the figure.
Further, in the case of a word having a change such as English, appropriate frequency tallying cannot be performed, so that more appropriate evaluation can be performed by using the basic form.
For example, when the word "Computers" appears in the target document, its basic form (Computer) is extracted from the word dictionary.
[0029]
The keyword candidate extraction unit 43 uses the pattern of the keyword candidate word to be extracted based on the result of the word analysis unit 41.
Keyword candidates are extracted in accordance with the rules regarding (configuration) (keyword candidate extraction rules 44).
For example, as shown in FIG. 5, the keyword candidate extraction rule 44 is described as a rewriting rule in which the leaf node is a part of speech name under the following notation.
[0030]
X + Y: X and Y are arranged in this order
(X | Y | Z): X or Y or Z
{X}. : X is omitted or there is one X
{X} *: 0 or more repetitions of X
{X} +: One or more repetitions of X
[0031]
In the rule illustrated in FIG. 5, the following two types are defined as keyword candidates.
(1) The keyword candidate A is a sequence of at least one or a combination of nouns to which a prefix / suffix can be added before or after, or a combination thereof, and a single noun or a compound noun continuous noun is a keyword candidate. .
(2) The keyword candidate B is a sequence in which the compound nouns of the keyword candidate A are connected by the case particle "no", and the noun phrase connected by the "no" is the keyword candidate.
[0032]
According to the rules in FIG. 5, the keyword candidates shown in FIG. 6 are extracted from the fourth sentence in FIG. Here, as shown in FIG. 7, information on the words constituting the respective keyword candidates, the part of speech, and the basic form is added to each of the extracted keyword candidates.
[0033]
The keyword candidate extraction method is not limited to the method using the keyword candidate extraction rule as described above, and may be configured by a conventional technique.
The word analysis unit 41 and the keyword candidate extraction unit 43 may extract the keyword candidates after obtaining the word analysis result of the entire target document, or may obtain the word analysis result for each sentence. The extraction of keyword candidates for the sentence may be repeated until the processing of all sentences is completed.
[0034]
The keyword candidate evaluation unit 45 evaluates the keyword candidates extracted by the keyword candidate extraction unit 43 based on the constituent words.
Here, when the constituent words in the keyword candidates include words having a change such as English, the frequency is collected instead of the basic form of the word, so that appropriate frequency aggregation can be performed, and more appropriate Keyword evaluation becomes possible.
This evaluation is performed, for example, on the keyword candidate Kj using the following evaluation value E (Kj).
[0035]

The corrected word frequency Wfj is obtained as follows.
Wfj = Sfj / WNj
Here, Sfj: the sum of the frequencies of each of the constituent words of the keyword candidate Kj
[0036]
For example, the above function F may be any of the following.
E (Kj) = a * A (WNj) + b * B (Wfj)
Or
E (Kj) = A (WNj) * B (Wfj)
[0037]
In the above equation, a and b are weighting factors, and A (X) and B (X) are functions using X as a parameter. For example, in the first equation, a = b = 1, A ( If X) = X and B (X) = X,
E (Kj) = WNj + Wfj
In the following description, this evaluation formula will be used.
[0038]
Further, if a = b = 1, A (X) = log (X), and B (X) = X in the above first equation,
E (Kj) = log (WNj) + Wfj
Also, if a = 1, A (X) = log (X), and B (X) = X in the second equation,
E (Kj) = log (WNj) * Wfj
Can be used.
[0039]
The processing procedure of the keyword candidate evaluation unit 45 will be described in detail based on the flowchart of FIG.
First, for the words analyzed by the word analysis unit 41, a word table in which the frequencies of the words are totaled is created and stored in a storage device or the like (step S1).
The word table for the fourth sentence divided by the word analysis unit 41 as shown in FIG. 4 is as shown in FIG. In order to determine the frequency for each of the constituent words of the keyword candidates with reference to this table, the words should be sorted in dictionary order.
[0040]
A table relating to the keyword candidates extracted by the keyword candidate extraction unit 43 is created and stored in a storage device or the like (step S2).
The keyword candidate table includes, for each keyword candidate Kj obtained at the time of extraction by the keyword candidate extraction unit 43, a word configuration, a frequency, a number of component words WNj, a total frequency Sfj of component words, a corrected word frequency Wfj, and a keyword evaluation value. It holds information consisting of E (Kj).
In addition, since the keyword candidates may appear in the same manner as in the word table, the same keyword candidates are removed from duplication, counted as a frequency, and set in the table. Furthermore, the order of the keyword candidates is the order of appearance in the present description, but is not particularly limited to this.
[0041]
FIG. 10 is a diagram in which the setting states of the keyword candidates in the fourth sentence are extracted, in which the word configuration information and the frequency are set. In the column relating to the word composition in the figure, it is assumed that "|" represents a break between words, and "." Represents a break between prefixes (prefixes and suffixes).
In addition, in the keyword candidate of the fourth sentence in the figure, since “industrial product” appears once more and “export control” appears twice more, the frequencies are respectively two and three. Other frequencies are 1. Since "industrial product" and "export control" appear in the first sentence and the second sentence, respectively, they appear before other keyword candidates.
[0042]
The following steps S3 to S9 are performed for each keyword candidate. The keyword candidates are sequentially extracted from the keyword candidate table, and when the processing of all the keyword candidates is completed (“none” in step S3), the processing ends.
Thereby, all the columns of the keyword candidate table are filled, and the evaluation values of all the keyword candidates are obtained.
[0043]
If the keyword candidates can be extracted from the keyword candidate table ("Yes" in step S3), the "number of constituent words WNj" column and the "total frequency of constituent words Sfj" column of the target keyword candidate are cleared to zero for subsequent addition processing. (Step S4).
For example, the column “number of constituent words WNj” and the column “total constituent word frequency Sfj” of the keyword candidate “industrial product” in FIG. 11 are set to zero.
[0044]
In the following steps S5 to S7, processing is performed for each constituent word of the target keyword candidate.
Constituent words Wji are sequentially extracted from the constituent words indicated in the word composition of the keyword candidates, and when all constituent words have been processed (“none” in step S5), the processing for each constituent word is terminated, and the target keyword candidate The process proceeds to step S8 where the number of constituent words and the total frequency of constituent words are obtained.
[0045]
When a constituent word can be extracted from the constituent words indicated in the word configuration of the keyword candidate (“Yes” in step S5), 1 is added to the “number of constituent words WNj” column for the constituent word Wji (step S6).
For example, when "industry" of the word "industry | product" is taken out, 1 is added to the "number of constituent words WNj" column as the component word "industry".
[0046]
With reference to the word table, the word frequency of this constituent word Wji is obtained, added to the “total constituent word frequency Sfj” column (step S7), and the process returns to step S5.
For example, in the case of “industry | product” of the above word composition, the composition word “product” is taken out, 1 is added to the “number of composition words WNj” column, and the word frequency “2” of “product” is changed to “composition word”. When added to the "total frequency Sfj" column, "industrial products" in the keyword candidate table is set as shown in FIG.
[0047]
The corrected word frequency Wfj is calculated and set in the "corrected word frequency Wfj" column (step S8).
The corrected word frequency Wfj is calculated, for example, as a value obtained by dividing the value in the “total constituent word frequency Sfj” column by the value of the “number of constituent words WNj”.
In the keyword candidate “industrial product” in FIG. 12, Sfj is “4” and WNj is “2”, so Wfj is 4/2 = 2.
[0048]
Finally, the “evaluation value E (Kj)” of the keyword candidate is calculated and set in the “evaluation value E (Kj)” column (step S9), and the process returns to step S3 again. The above process is repeated for all keyword candidates.
The “evaluation value E (Kj)” of the keyword candidate is set, for example, as a value obtained by adding “the number of constituent words WNj” and “corrected word frequency Wfj”.
[0049]
In the keyword candidate “industrial product” in the above example, since WNj is “2” and Wfj is “2”, E (Kj) = 2 + 2 = 4. Therefore, the keyword candidate table for “industrial product” is shown in FIG. It becomes setting like.
Further, the keyword candidate table at the time when the processing of the last keyword candidate “Japan” of the fourth sentence is completed is as shown in FIG. 14, and the keyword candidate table at the time when the processing of all keyword candidates is completed is shown in FIG. It looks like 15.
[0050]
The keyword extracting unit 46 extracts keywords from the keyword candidates on the basis of the following criteria based on the assumption that a keyword candidate having a higher evaluation value has a higher keyword property based on the result evaluated by the keyword candidate evaluating unit 45. .
(1) Top N keyword candidates with evaluation values
(2) One-Mth Keyword Candidate with Evaluation Value of All Keyword Candidates
(3) Keyword candidates with an evaluation value of A or more
[0051]
For example, if the top 10 rankings are used as extraction criteria, keywords as shown in FIG. 16 can be extracted (in FIG. 16, 11 keywords are extracted because there is a tie).
[0052]
The processing procedure of the first embodiment will be described with reference to the flowchart of FIG.
First, with respect to the target document for keyword extraction specified by the user or the application to be used, reference is made to a word dictionary having part-of-speech information and basic form information, divided into words for each sentence of the target document, and a part-of-speech is added. If necessary, the basic form of the word is analyzed, and an analysis result for each word is output (step S11).
For example, the result of the word analysis of the fourth sentence (the first sentence of the text) of the Japanese document in FIG. 3 is as shown in FIG.
[0053]
Based on the result of the word analysis processing of the document in step S11, the pattern of the keyword candidate word to be extracted
Keyword candidates are extracted in accordance with rules relating to (configuration) (for example, a keyword candidate extraction rule shown in FIG. 5) (step S12).
According to the rule of FIG. 5, the keyword candidates of FIG. 6 are extracted from the fourth sentence portion of FIG. 3, and each of the extracted keyword candidates is composed of the words, the parts of speech, and the basic forms of the keywords as shown in FIG. Information has been added.
[0054]
With respect to the keyword candidates extracted by the keyword candidate extraction process in step S12, the keyword candidates are evaluated based on the constituent words (step S13).
The evaluation values of the keywords at the time when the processing of all the keyword candidates is completed are set as shown in FIG.
[0055]
In this evaluation, for example, the following evaluation value E (Kj) is calculated for the keyword candidate Kj.

[0056]
This corrected word frequency Wfj is obtained as follows.
Wfj = Sfj / WNj
Here, Sfj: the sum of the frequencies of the constituent words of Kj
[0057]
For example, an evaluation value is obtained from these calculated values by one of the following expressions.
E (Kj) = a * A (WNj) + b * B (Wfj)
Or
E (Kj) = A (WNj) * B (Wfj)
[0058]
Keyword extraction is performed based on the result evaluated by the keyword candidate evaluation process in step S13 (step S14).
Keyword candidates with higher evaluation values have higher keyword characteristics, and keywords are extracted based on any of them.
(1) Top N keyword candidates with evaluation values
(2) One-Mth Keyword Candidate with Evaluation Value of All Keyword Candidates
(3) Keyword candidates with an evaluation value of A or more
For example, if the top 10 keywords are extracted as keywords, the result is as shown in FIG. 16 (11 keywords because there are ties).
[0059]
<Embodiment 2>
In the first embodiment, the keyword candidates are extracted as keywords even when all of the constituent words of the keyword candidates are included in the constituent words of other keyword candidates.
In the second embodiment, if all of the constituent words of a keyword candidate are included in constituent words of another keyword candidate, the keyword is not extracted as a keyword. This makes it possible to eliminate the redundancy of the keyword set.
[0060]
For this purpose, the keyword extracting unit 46 checks whether or not there is another keyword candidate including all the constituent words of the keyword candidates in the keyword candidate table stored in the storage device or the like by the keyword candidate extracting unit 43. In some cases, the keyword candidate table is treated as invalid.
This process
・ Check from keyword candidates with small number of constituent words
・ Do not check keyword candidates that have fewer constituent words than their own.
For example, the efficiency of the processing can be improved.
[0061]
According to the keyword extraction unit 46 of the second embodiment, among the keyword candidates shown in FIG.
・ "Conventional weapons" is invalid because it is included in "Conventional weapons-related".
-"Regulations" are invalid because they are included in "Export restrictions".
・ "Target" is invalid because it is included in "Restricted".
・ "Country" is invalid because it is included in "3 countries".
-"Communist area" is invalid because it is included in the "Committee on Export Control to Communist Areas".
-"Export" is invalid because it is included in "Export Regulations".
[0062]
As a result, when the top ten keywords are extracted as keywords, the result is as shown in FIG. 18 (11 keywords).
[0063]
<Embodiment 3>
In the above embodiment, the keyword candidate evaluation unit 45 uses the number of constituent words to calculate the evaluation value. However, in the third embodiment, the evaluation value is calculated using the logarithmic value of the number of constituent words instead of the number of constituent words. Is calculated.
When an evaluation value is obtained by calculating a natural logarithm with e as the base with respect to the keyword candidate table of FIG. 15, a keyword candidate table in which a column of “evaluation value 2” is added as shown in FIG. 19 is obtained. . Here, FIG. 19 shows a case where the keyword candidates including the constituent words of the second embodiment are excluded.
From the keyword candidate table of FIG. 19, the top ten rankings of the evaluation value 2 are as shown in FIG. 20, and it can be confirmed that “DPR Korea” falls from the top ten rankings.
In this way, taking the logarithm of the number of constituent words has the effect of suppressing over-evaluation of keyword candidates having a large number of constituent words.
[0064]
<Embodiment 4>
By outputting the keywords extracted according to the above-described embodiment to a display device such as a display or an output device such as a printer as follows, the keywords can be output in a manner that is easy for the user to understand.
(1) Output in descending order of evaluation value.
For example, keywords extracted as shown in FIG. 18 or FIG. 19 are output as shown in FIG. 21 or FIG.
[0065]
(2) A set of keywords including the same constituent word is output.
There are many possible output formats. For example, the keyword shown in FIG. 21 is output as shown in FIG.
In this method, keywords having the same constituent word are arranged in the order of the evaluation value, using the constituent words of the higher-order keywords as keys. If they appear in duplicate, they may be omitted, but in FIG. 23 the parentheses are used to indicate the overlap. If there is no other keyword including the constituent word, the line of the constituent word is not displayed.
[0066]
<Embodiment 5>
In the above-described embodiment, two types of keyword candidates (a single noun or a compound noun that is a series of nouns (keyword candidate A) and a compound noun are the case particles “no” according to the keyword candidate extraction rule 44 used in the keyword candidate extraction unit 43. The keyword candidate B was not described in the list (keyword candidate B) connected by.
[0067]
First, the selection of constituent words that affect the evaluation value of the keyword candidate including the keyword candidate B in the keyword candidate evaluation unit 45 will be described.
For example, in FIG. 6 and FIG.
(7) Parts of conventional weapons
(8) Export control of industrial products
It has been extracted as.
When “no” is included as a constituent word as in the above-described embodiment, the frequency of “no” as a result of the word analysis unit 41 is 13 (see FIG. 9). Is 4, the total frequency is 18, the corrected word frequency is 4.5, and thus the evaluation value is 8.5.
In addition, the evaluation value of "export control of industrial products" is 5+ (2 + 2 + 7 + 7 + 13) /5=11.2.
Therefore, the evaluation value of the keyword candidate is overestimated by “NO”.
On the other hand, if “no” is not counted as a constituent word, the above two keyword candidates and the other keyword candidates B in the target document are given evaluation values as shown in FIG. 24, and appropriate keyword evaluation can be performed. .
[0068]
Therefore, in the fifth embodiment, one or both of the following processes are added to select a constituent word in a keyword candidate.
(1) A constituent word part-of-speech list is prepared, and only words of a specific part of speech are set as constituent words.
For example, by using a part-of-speech list that includes “general nouns” and the like, it is possible to specify a part-of-speech that affects keyword evaluation or a part-of-speech that does not affect keyword evaluation. Will be possible.
[0069]
(2) Prepare a non-constitutive word part-of-speech list and remove words of specific parts of speech from constituent words, or prepare a non-constitutive word list and remove specific words from constituent words.
In the above case, by using a non-constitutive word part-of-speech list including “case particle” and “prefix (prefix, suffix)”, overestimation due to “no” can be prevented.
[0070]
When dealing with constituent words in this way, by providing a part of speech that affects keyword evaluation, or a part of speech or word that does not affect keyword evaluation, content words can be prioritized, or functional words such as “no” or “of” can be used. By ignoring, more appropriate keyword evaluation becomes possible.
[0071]
<Embodiment 6>
Furthermore, the present invention is not limited to only the above-described embodiments. Each function (word analysis unit, keyword candidate extraction unit, keyword candidate evaluation unit, keyword extraction unit, etc.) constituting the keyword extraction device of the above-described embodiment is programmed, and a word dictionary, keyword extraction rules, constituent word part-of-speech list, etc. Is written in a recording medium such as a CD-ROM in advance, the CD-ROM is mounted on a medium drive such as a CD-ROM drive mounted on a computer, and these programs are stored in a memory or a computer of the computer. It goes without saying that the object of the present invention is achieved by storing in a device and executing it.
In this case, the program itself read from the recording medium implements the functions of the above-described embodiment, and the program and the recording medium on which the program is recorded also constitute the present invention.
[0072]
As a recording medium, a semiconductor medium (for example, ROM, nonvolatile memory card, etc.), an optical medium (for example, DVD, MO, MD, CD-R, etc.), a magnetic medium (for example, magnetic tape, flexible disk, etc.), etc. Any of these may be used.
[0073]
Further, not only the functions of the above-described embodiments are realized by executing the loaded program, but also the operating system or the like performs part or all of the actual processing based on the instructions of the program, and the processing performs the processing described above. The case where the functions of the embodiments described above are realized is also included.
[0074]
When the above-described program is stored in a storage device such as a magnetic disk of a server computer and distributed in a form such as download from a user computer connected via a communication network such as the Internet, the storage device of the server computer Are also included in the recording medium of the present invention.
[0075]
【The invention's effect】
As described above, according to the present invention, when performing keyword extraction at the level of a compound noun or a compound noun phrase connected by “no” or “of”, attention is paid to the words (constituent words) constituting the keyword candidates. Thus, keyword evaluation can be performed without preparing information for keywords in advance or performing statistical processing.
In addition, by viewing constituent words of a word having a change such as English in its basic form, appropriate frequency counting can be performed, so that more appropriate keyword evaluation can be performed.
In addition, when dealing with constituent words, by separately preparing parts of speech that affect keyword evaluation or parts of speech or words that do not affect keyword evaluation, priority is given to content words, and functional words such as “no” and “of” can be used. By ignoring, more appropriate keyword evaluation becomes possible.
Further, by calculating the evaluation value of the keyword candidate by taking the logarithm of the number of constituent words instead of the number of constituent words, it is possible to avoid excessive evaluation of the keyword candidate having a large number of constituent words.
In addition, when all of the constituent words of a keyword candidate are included in the constituent words of other keyword candidates, it is possible to eliminate the redundancy of the keyword set by not extracting the keyword as a keyword. become.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an overall configuration of a keyword extraction system to which a keyword extraction device of the present invention is applied.
FIG. 2 is a block diagram illustrating a functional configuration of a keyword extracting device according to a first embodiment of the present invention.
FIG. 3 is an example of a Japanese document used for explaining keyword extraction.
FIG. 4 is an example in which the fourth sentence (the first sentence of the text) of the Japanese document shown in FIG. 3 is divided into words.
FIG. 5 is an example of a keyword candidate extraction rule.
FIG. 6 is an example of keyword candidates extracted from the fourth sentence of FIG. 3 according to the keyword candidate extraction rule of FIG. 5;
FIG. 7 is an example showing information on words, parts of speech, and basic forms constituting each extracted keyword candidate.
FIG. 8 is a flowchart illustrating a processing procedure of a keyword candidate evaluation unit.
FIG. 9 is an example of a word table in which the frequencies are totaled for each word in the sentence example of FIG. 4;
FIG. 10 is an example of a keyword candidate table showing a setting state of keyword candidates in a fourth sentence of the sentence example of FIG. 4;
FIG. 11 is a diagram showing a state in which a “number of constituent words WNj” column and a “total constituent word frequency Sfj” column of keyword candidates are set to zero.
FIG. 12 is a diagram illustrating a state in which frequencies are added to the “number of constituent words WNj” column and the “total constituent word frequency Sfj” column as constituent words of the word configuration.
FIG. 13 is a diagram showing a setting state of a keyword candidate table for a keyword candidate “industrial product”.
FIG. 14 is a diagram illustrating a setting state of a keyword candidate table at the time when the processing of the last keyword candidate “Japan” in the fourth sentence is completed.
FIG. 15 is a diagram illustrating a setting state of a keyword candidate table at the time when processing of all keyword candidates is completed.
FIG. 16 is a diagram showing an example of keywords extracted from the top 10 as extraction criteria.
FIG. 17 is a flowchart illustrating a processing procedure according to the first embodiment.
FIG. 18 is a diagram illustrating an example of a keyword extracted when all of the constituent words of a keyword candidate are included in constituent words of another keyword candidate.
FIG. 19 is a diagram illustrating a setting state of a keyword candidate table in a case where the evaluation of a keyword candidate is not the number of constituent words but a logarithm of the number of constituent words.
FIG. 20 is a diagram illustrating an example of keywords extracted from the keyword candidate table of FIG.
FIG. 21 is an output example of keywords extracted as in FIG. 18;
FIG. 22 is an output example of keywords extracted as in FIG. 19;
FIG. 23 is an output example in which keywords including the same constituent words are grouped.
FIG. 24 is a diagram showing a setting state of a keyword candidate table of keyword candidates composed of a sequence in which compound nouns are linked by case particles “no”.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Document database, 20 ... Input device, 30 ... Document input device, 40 ... Keyword extraction device, 41 ... Word analysis part, 42 ... Word dictionary, 43 ... Keyword candidate extraction part,
44: keyword candidate extraction rule, 45: keyword candidate evaluation unit, 46: keyword extraction unit, 50: output device.

Claims

A word analysis unit that divides the digitized document into words, assigns parts of speech, and analyzes the basic form of the words as necessary, a keyword candidate extraction rule describing a pattern to be extracted as a keyword candidate, A keyword candidate extraction unit that extracts a keyword candidate according to the keyword candidate extraction rule based on the analysis result analyzed by the word analysis unit; a keyword candidate evaluation unit that evaluates based on constituent words of the keyword candidate; A keyword extracting unit for extracting a keyword from the keyword candidates based on a result evaluated by the candidate evaluating unit.

2. The keyword extraction device according to claim 1, wherein the keyword candidate evaluation unit performs evaluation based on a basic form of a constituent word of the keyword candidate.

3. The keyword extracting device according to claim 1, further comprising a constituent word part-of-speech list that defines part of speech to be treated as a constituent word of the keyword, wherein the keyword candidate evaluation unit includes the constituent word part of speech of the keyword candidate. A keyword extraction device characterized in that only words of the part of speech defined in a list are targeted.

3. The keyword extraction device according to claim 1, wherein the keyword extraction device has at least one of a non-configuration word part-of-speech list that defines a part of speech that is not treated as a keyword configuration word and a non-configuration word list that defines a word that is not treated as a keyword configuration word. The keyword candidate evaluation unit does not target one or both of the words of the part of speech defined in the non-structured word part-of-speech list and / or the words defined in the non-structured word list among the constituent words of the keyword candidate. A keyword extraction device characterized by the following.

5. The keyword extraction device according to claim 1, wherein the keyword candidate evaluator evaluates, for the keyword candidate, the number of words constituting the keyword candidate or a value corresponding to the number of words constituting the keyword candidate. A keyword extraction device that calculates an evaluation value according to one of the following.

5. The keyword extraction device according to claim 1, wherein the keyword candidate evaluation unit determines a corrected word frequency based on an appearance frequency in the digitized document for each of the constituent words of the keyword candidate. 6. Alternatively, a keyword extraction device that calculates an evaluation value according to one of the values according to the number of words constituting the keyword candidate.

The keyword extraction device according to any one of claims 1 to 4, wherein the keyword candidate evaluation unit is configured to:
(1) The number of constituent words of this keyword candidate or a value corresponding to the number of constituent words,
and,
(2) a corrected word frequency based on an appearance frequency in the digitized document for each constituent word of this keyword candidate or a value corresponding to the number of constituent words of this keyword candidate;
Characterized in that the evaluation is performed by the value or the product of both weighted linear sums.

The keyword extraction device according to claim 6, wherein the corrected word frequency is a value obtained by dividing a sum of frequencies of the keyword candidates for each constituent word by the number of constituent words of the keyword candidate for the keyword candidate. A keyword extracting device.

8. The keyword extraction device according to claim 5, wherein the keyword candidate evaluation unit uses a logarithmic value of the number of constituent words as the value corresponding to the number of constituent words for the keyword candidate. Keyword extraction device.

10. The keyword extraction device according to claim 1, wherein the keyword extraction unit includes all of the constituent words among the keyword candidates extracted by the keyword candidate extraction unit in the constituent words of other keyword candidates. The keyword candidate is not extracted as a keyword when the keyword is extracted.

The keyword extraction device according to any one of claims 1 to 10, further comprising: a keyword output unit that arranges and outputs the keywords extracted by the keyword extraction unit in descending order of evaluation values of the keyword candidates evaluated by the keyword candidate evaluation unit. A keyword extraction device characterized by the following.

12. The keyword extraction device according to claim 1, further comprising a keyword output unit configured to output a set of keywords including the same constituent words as the keywords extracted by the keyword extraction unit. apparatus.

A keyword that divides the digitized document into words, assigns parts of speech, analyzes the basic form of the words as necessary, and describes patterns to be extracted as keyword candidates based on the analyzed results. A keyword extraction method, comprising extracting keyword candidates according to candidate extraction rules, evaluating based on constituent words of the keyword candidates, and extracting keywords from the keyword candidates based on the evaluation result.

A program for extracting a keyword from document data using a computer, the program causing the computer to execute the function of the keyword extraction device according to any one of claims 1 to 12.

A computer-readable recording medium on which the program according to claim 14 is recorded.