JP4360122B2

JP4360122B2 - Keyword extractor

Info

Publication number: JP4360122B2
Application number: JP2003147599A
Authority: JP
Inventors: 幸寛坪下; 洋岡本; 基文福井; 正浩前田; 功山口
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2003-05-26
Filing date: 2003-05-26
Publication date: 2009-11-11
Anticipated expiration: 2023-05-26
Also published as: JP2004348637A

Description

【０００１】
【発明の属する技術分野】
この発明は、ニューラルネットワークを用いた連想記憶技術に関し、とくにニューラルネットワークにおける活性化ノードの決定手法の改良に関する。
【０００２】
【従来の技術】
連想記憶に基づく認識理解技術は、自然言語処理分野に応用可能である。例えば、特許文献１は、関連性はあるが異質な情報を提供できるような情報検索装置を提供することを目的として連想記憶技術を文書の検索に適用した発明を開示している。与えられた複数の文書を辞書中の単語が存在するか否かで１，０を付与した「キーワードベクトル」を作成し、自己相関連想記憶として、「アソシアトロン」に記銘させる。ユーザが入力した文章、あるいは単語列から、「キーワードベクトル」を作成し、それを初期条件としてアソシアトロンに想起を行わせる。この操作により、ａｎｄ条件、ｏｒ条件とは全く異なるパターン認識的な検索処理を実現している。
【０００３】
しかしながら、この手法では従来型の連想記憶モデルである「アソシアトロン」をそのまま用いているので、文書パターンを想起する場合における様々な問題点を解決するには至っていない。従来型の連想記憶モデルではランダムパターン（パターンをＮ個の０と１との列として表現した場合ｉ番目の値とｊ番目の値とが互いに独立に決まるようなパターンをランダムパターンという）の学習・想起することは容易であるが、文書パターンのような偏ったパターンを学習・想起することは困難である。
【０００４】
辞書中の単語が文書中に存在するか否かで、１，０を付与したベクトルによって、一つの文書パターンを表現する場合、文書パターンには次のような特徴がある。
【０００５】
（１）ノイズを含まないパターンは存在しない
実際の文書を学習データとして使用した場合、大抵は話題と関係ない単語を含んでいる。そのようなデータは学習データのノイズと見ることができる。
【０００６】
（２）一つのパターンは全単語に対して非常に少ない単語しか含まない
日本語には数十万の単語があるが、例えば電子メール等の比較的短い文書に含まれるのはせいぜい数百語程度である。このようなパターンはスパース（疎）なパターンといわれる。
【０００７】
（３）単語の出現頻度差が大きい
例えば、「これ」「この」といった指示語は使用頻度が高いが、様々な専門用語等は特殊な場合にしか現れない。
【０００８】
（４）類似パターンの出現頻度に偏りがある
例えば、お客様からの問い合わせの電子メール等を対象にした場合、良くある問い合わせパターンと、あまり現れないパターンの出現頻度の差異は非常に顕著である。
【０００９】
ランダムスパースなパターンを記憶するために、共分散行列を用いた連想記憶が提案されている（非特許文献１、非特許文献２）。この連想記憶モデルでは、Ｎ個のノードからランダムにノードを順次選び、（２）式に従ってその活性値を更新する。活性値の変化するノードがなくなるまでこの操作を繰り返す。
【数１】

この連想記憶モデルでは、上記（１）の問題に関しては、ノードの活性値の活性確率を導入することにより、ランダムに現れる共起性と、真に相関があり共起するパターンを区別することで解決されている。また、この連想記憶モデルは、ランダムスパースなパターンを記憶するために考案されたので、上記（２）については問題ない。また、上記（３）についても、ノードの活性値の活性確率を導入することにより解決されている。
【００１０】
しかし、この連想記憶モデルは、類似パターンの出現頻度が偏った場合（上記４）には適していない。なぜならば、この連想記憶モデルではよく現れるパターンのみを想起する傾向にあるからである。
【００１１】
また、特許文献２では、連想記憶モデルを仮名漢字変換に適用する研究が行われている。この連想記憶モデルでは、上記（１）〜（４）の問題を解決するために、共分散行列を用いた連想記憶モデルにおいて、次のような工夫を施している。新規の文章パターンを記銘する際に、そのパターンのエネルギー値を計算し、このエネルギー値と所定のエネルギー値との差に基づいてリンクの重みを更新させる度合いを定める。
【数２】

このような工夫により、文書パターンがどのようなものであってもエネルギー値をほぼ一定にするようにリンクの値を学習する。これにより偏ったパターンについても適切に想起できるようになる（上記（４）の解決）。
【００１２】
しかしながら、文書パターンを学習する際に、その都度エネルギーを計算する必要があるので学習に時間を要する。また、この手法では、類似パターンの出現頻度などの情報がネットワークの重みには反映されていないので、重み行列は、学習データの構造を正しく反映しているものであるとはいえない。
【特許文献１】
特許第２８３２６７８号
【特許文献２】
特許第３３６４２４２号
【非特許文献１】
Ｓ．Ａｍａｒｉ，Ｎｅｕｒａｌｔｈｅｏｒｙｏｆａｓｓｏｃｉａｔｉｏｎａｎｄｃｏｎｃｅｐｔ−ｆｏｒｍａｔｉｏｎ，Ｂｉｏ．Ｃｙｂｅｒｎ，Ｖｏｌ．２６，ｐｐ．１８５−１７５，１９７７
【非特許文献２】
Ｓ．Ａｍａｒｉ．Ｃｈａｒａｃｔｅｒｉｓｔｉｃｓｏｆｓｐａｒｓｅｌｙｅｎｃｏｄｅｄａｓｓｏｃｉａｔｉｖｅｍｅｍｏｒｙ．ＮｅｕｒａｌＮｅｔｗｏｒｋｓ，Ｖｏｌ．２，ｐｐ．４５１−４５７，１９８９）
【００１３】
【発明が解決する課題】
この発明は、以上の事情を考慮してなされたものであり、少ない計算コストでも偏ったパターンを適切に想起できる連想記憶手法を提供することを目的としている。
【００１４】
【課題を解決するための手段】
この発明によれば、上述の目的を達成するために、特許請求の範囲に記載のとおりの構成を採用している。ここでは、発明を詳細に説明するのに先だって、特許請求の範囲の記載について補充的に説明を行なっておく。
【００１５】
この発明の原理的な構成では、共分散行列を用いた連想記憶手法において、全体の発火率に比例した抑制入力をネット−ワークに加えることによって、文書パターンのような非常に偏ったデータに対しても、初期状態に依存した形での連想想起を行えるようにする。
【００１６】
この結果、つぎのような効果がある。
・非常に「偏った」重み行列を用いても、状況依存的な出力が可能である。
・リンクの更新に際し、逐一エネルギー値を求める必要がないので、時間を要しない。
・抑制入力の大きさを制御することによって、エネルギー値が低いパターンから、エネルギー値が高いパターンまで、様々なレベルのパターンを出力することが可能である。
【００１７】
なお、この発明は装置またはシステムとして実現できるのみでなく、方法としても実現可能である。また、そのような発明の一部をソフトウェアとして構成することができることはもちろんである。またそのようなソフトウェアをコンピュータに実行させるために用いるソフトウェア製品もこの発明の技術的な範囲に含まれることも当然である。
【００１８】
この発明の上述の側面および他の側面は特許請求の範囲に記載され以下実施例を用いて詳述される。
【００１９】
【発明の実施の形態】
以下、この発明をキーワード抽出装置に適用した実施例について説明する。このキーワード抽出装置は、文書に自動的にキーワードを付与して文書の内容を把握しやすくする、あるいは検索を容易にすることのできる文書処理装置の構築を目指すものである。オフィス等における、文書の管理・処理に関するシステムなどに応用される。とくに、文書中に含まれている単語のみならず、文書に含まれていないが、文書に含まれている単語群との関連に基づいて、文書の意味内容に深く関わる単語もキーワードとして出力できるようにするものである。
【００２０】
図１はこの実施例のキーワード抽出装置を全体として示しており、この図において、キーワード抽出装置１０はユーザインターフェース部１１、構文解析部１２、単語間相関学習部１３、想起情報抽出部１４、データベース１５等を含んで構成されている。
【００２１】
ユーザインターフェース部１１は、キーボードやモニタ等からなり、文書の入力やキーワードの提示をユーザが実行可能にするものである。
【００２２】
構文解析部１２は、学習用文書、及び、入力文書を単語に分解して構文解析する（以下学習用文書群、入力文書に対して全く同様の処理を行う場合、これらを総称し「文書」と記述する）。この時、同じ意味の単語は一つの代表単語に変換される。例えば、「プリンター」、「プリンタ」、「Ｐｒｉｎｔｅｒ」、「ｐｒｉｎｔｅｒ」は同じ意味の単語として一つの代表語「プリンター」に置き換えられる。そして文書は一つの特徴ベクトルに変換される。文書Ｄ_μの特徴ベクトルＦ_μは次のようになる。
【数３】

特徴ベクトルＦ_μの要素ｗ_ｉ ^μは単語Ｗ_ｉが文書Ｄ_μに現れたら１、現れなければ０となる。すなわち、特徴ベクトルは、ある単語がその文書に現れたか否かのみで判断されており、出現頻度、重要度などは考慮されていない。特徴ベクトルの長さｎは、あらかじめ設定する。また、特徴ベクトル生成時に使用されるｎ個の単語｛Ｗ_１，Ｗ_２，…，Ｗ_ｎ｝は、学習用文書群に出現する単語をＴＦ＊ＩＤＦの値によりスコア付けして、その上位ｎ個を採用する。ＴＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ）は、ある文書ｄにおける索引語ｔの生起頻度であり、ＩＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）ＴＦは語がどのくらい特定性を持つかを表す。
【００２３】
単語間相関学習部１３は、学習用文書から、共起性に基づいて単語間の関連性を学習し連想記憶行列を構築する。すなわち連想行列ｇ_ｉｊは、次のような式で求められる。
【数４】

更に、例えば、突出して出現度の高い単語の影響を抑えるため、連想記憶行列ｇ_ｉｊを行ごとにノーマライズすることによって、システムからの入力を一定値以内に抑える。
【数５】

このようにして、連想記憶行列Ｇ_ｉｊ（１≦ｉ，ｊ≦ｎ）を得る。得られた連想記憶行列Ｇ_ｉｊはデータベース１５に格納される。
【００２４】
想起情報抽出部１４は、入力文書から、連想想起によりキーワードを抽出する。想起情報抽出部１４は、模式的には図２に示すような構成を有し、図３に示すような動作を実行する。
【００２５】
図２に示すように、構文解析部１２からの単語列に基づいて入力特徴ベクトルを生成し、これを入力特徴ベクトル記憶部１４１で受取り、この入力特徴ベクトルで連想想起特徴ベクトル記憶部１４２の連想想起特徴ベクトルの値を初期化する。この後、連想想起特徴ベクトル更新部１４３が、データベース１５の重み行列およびその時点での連想想起特徴ベクトルに基づいて新たな連想想起特徴ベクトルを生成してその値を更新する。収束判定部１４４は、連想想起特徴ベクトルが収束したかどうかを判別し、収束したと判別した場合には連想想起特徴ベクトル出力部１４５がその時点の連想想起特徴ベクトルを出力し、これに基づいてキーワードをユーザインターフェース部１１に出力する。
【００２６】
以下、図３を参照して詳細に説明する。なお、連想想起を行わせるｎ次元ベクトルをＮ＝［Ｎ_１，Ｎ_２，…，Ｎ_ｎ］と定義する。
【００２７】
［ステップＳ１０］：特徴ベクトルの初期化
入力文書から得た特徴ベクトルをＱ＝［Ｑ_１，Ｑ_２，…，Ｑ_ｎ］とする時、入力特徴ベクトルＱで連想想起特徴ベクトルＮを初期化する。すなわち、
【数６】

【００２８】
［ステップＳ１１］：ランダムにノードｋを選択
ノードの状態変化は非同期的に行われる。すなわち一回に任意の一つのノードのみが状態変化する。
【００２９】
［ステップＳ１２］：Ｎ_ｋの更新
基本的には、閾値が一定値以上であれば発火（Ｎ_ｋ＝１）し、一定値以下であれば、発火を取りやめる（Ｎ_ｋ＝０）単純なバイナリ型のモデルニューロンによって、ダイナミクスは実現される。もちろん、これに限定されない。
【００３０】
選択されたノードＮ_ｋは、次式に従って状態が更新される。
【数７】

このような抑制入力を加えることによって、最終的に抽出される単語数を一定数以内に制御することができる。なお、比例定数αにかえて単調増加する関数を用いてもよい。
また
【数８】

の値をノードｋの活性度と定義する。これは他のノードからどれくらい大きな入力を受けているかを表している。すなわち、その単語が入力文書に対してどれくらいの活性を有するかの尺度として用いることができる。
【００３１】
［ステップＳ１３］：収束判定
収束判定は様々な方法が考えられる。ある一定試行回数状態が変化しなくなったところで収束と判定しても良い。あるいは、試行回数に上限を与えるなどの方法などを採用しても良いだろう。
【００３２】
［ステップＳ１４］：連想想起特徴ベクトルの出力
収束が判定された後、連想想起特徴ベクトルＮを出力する。この連想想起特徴ベクトルＮに基づいて１または複数のキーワードを決定する。
【００３３】
この実施例によれば、文書知識のような非常に偏ったデータに対しても、全体の発火数に比例する抑制入力を導入することで、リンクの学習に特別な処理を施すことなく初期状態に依存した形での連想想起を行える。
【００３４】
また、抑制入力の比例定数αを変化させることにより、最終的にキーワードとして抽出される語数をある一定数に制御することも可能である。
【００３５】
更に、キーワードとして抽出された単語それぞれの活性度の値を入力文書に対するそのキーワードの位置付けの重要度とみなすことにより、抽出された単語のランク付けを行うことも可能である。
【００３６】
以上のような連想想起によって、入力文書に存在していた単語であっても、発火している他のノードからの入力が小さければ、すなわち、発火している他の単語との関連が小さければ発火しなくなり、逆に入力文書に存在していない単語であっても、他のノードから大きな入力を受ければ、すなわち、発火している他の単語と関連が大きければ発火する。これにより、過去の知識を活用しながらも、それぞれの文書に固有の関連キーワードを、その文書に含まれていない単語も含め抽出するキーワード抽出装置を実現した。
【００３７】
また単語間の関連性は、学習文書群を与えることによって自動的に獲得されるので、分野ごとに固有の知識を先見的に与える必要はない。更に、従来までの概念辞書、意味ネットワーク等を用いるアプローチでは、辞書のメンテナンス作業を日々行う必要があるが、本特許では、新規の学習文書群を追加するだけで、新たな単語間の関連性を自動的に獲得することができるので、メンテナンスに対する労力も大幅に低減される。
［実験例］
【００３８】
以下、実際に行った実験を題材に具体的に説明する。
【００３９】
富士ゼロックス株式会社のお客様相談センターに寄せられた問い合わせメールを用いた検証を行った。学習文書数は、１１，３６５であり、特徴ベクトル生成時に使用する単語数は１，０００とした。
【００４０】
まずは、学習文書とは異なる問い合わせメール６２通を初期状態として与えた場合の連想結果（制御入力がある場合と無い場合）を図４示す。このように、抑制入力を適用しない場合は、６２通のメールがほとんど同一のパターンに収束しているのに対して、抑制入力を適用した場合には、ほとんどのメールが異なるパターンに収束していることが分かる。
【００４１】
次に、ある典型的なクレームメールに対する適用例を示す。
【００４２】
メール本文を要約すると、
「富士ゼロックスからコピー機を購入して使用している。先日インクがよく出ないのでカートリッジを交換したところ、エラーが出てしまう。古いものに交換したところ正常である。相談センターへ連絡すると本体を送ってほしいといわれた。指示通り送付し、確認の電話を入れたら届いていないといわれる。なぜインクが出ないのかを聞いても要領を得ない回答をされる。コールセンターの事務的取り扱いは顧客の立場を少しも考えてくれていない。責任の所在を明確にして欲しい。」
といったものであった。
この入力メールの特徴ベクトル中、値が１となった単語は図５のようになった。
連想想起後、最終的に抽出されたキーワードは図６のようになった。ただし、（）内の数字は、最終的な活性値、アスタリスクは、本文中にはないキーワードを表す。
【００４３】
更に、上記抽出キーワードの中から、活性度に従って上位１０単語を切り出した。この結果は図７に示すとおりである。この中で、「クレーム」、「責任」等のキーワードは、本文中に出現する単語よりも文章の内容を的確に表していると考えることができ、「文章の内容を把握しやすくする」という目的に対して、より適切なキーワードが得られたといえる。
【００４４】
つぎに上述実施例を用いてキーワードを抽出し、このキーワードを文書とともに登録する文書処理装置の例を説明する。
【００４５】
図８は、文書のキーワードを抽出しながら文書を登録し、登録文書のキーワード検索を可能にする文書処理装置２０を示している。この文書処理装置２０は、基本的には図１のキーワード抽出装置１０を主たる構成要素とし、これに文書データベース２１、文書登録部２２、文書検索部２３を付加したものである。なお、図８において図１に対応する箇所には対応する符号を付した。この例では想起情報抽出部１４で抽出された連想想起特徴ベクトルに基づいて出力されるキーワードを文書と関連づけて文書データベース２１に登録し、文書検索部２３を用いて文書をキーワード検索する。
【００４６】
以上で、この発明の実施例の説明を終了する。
【００４７】
なお、この発明は上述の実施例に限定されるものではなくその趣旨を逸脱しない範囲で種々変更が可能である。例えば、ニューラルネットワーク全体の発火率に比例する、あるいはその発火率に応じて単調増加する抑制入力を用いるようにしたが、予め所定範囲のノードに限定してその範囲内の発火率を基準にしてもよい。またバイナリ型でない場合には、発火率に代えて活性値の総和を用いるようにしてもよい。考慮するノードの範囲は、ノードに該当する単語の頻度等に基づいて決定してもよい。また、状況に応じて、抑制入力を与えることなく、連想想起特徴ベクトルの収束結果を出力して連想想起を行うようにしてもよい。
【００４８】
【発明の効果】
以上説明したように、この発明によれば、文書パターンのように偏ったデータに対しても、連想想起を適切に行うことができ、しかも計算コストを抑えることができる。
【図面の簡単な説明】
【図１】この発明の実施例のキーワード抽出装置の構成を示すブロック図である。
【図２】図１の実施例の想起情報抽出部１４の構成を模式的に示すブロック図である。
【図３】図１の実施例の想起情報抽出部１４の動作を説明するフローチャートである。
【図４】実験例を説明する図である。
【図５】実験例を説明する図である。
【図６】実験例を説明する図である。
【図７】実験例を説明する図である。
【図８】図１の実施例を用いたキーワード付きで文書を登録する文書処理装置の構成例を示すブロック図である。
【符号の説明】
１０キーワード抽出装置
１１ユーザインターフェース部
１２構文解析部
１３単語間相関学習部
１４想起情報抽出部
１５データベース
２０文書処理装置
２１文書データベース
２２文書登録部
２３文書検索部
１４１入力特徴ベクトル記憶部
１４２連想想起特徴ベクトル記憶部
１４３連想想起特徴ベクトル更新部
１４４収束判定部
１４５連想想起特徴ベクトル出力部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an associative memory technique using a neural network, and more particularly to improvement of a method for determining an activation node in a neural network.
[0002]
[Prior art]
Recognition and understanding technology based on associative memory can be applied to the field of natural language processing. For example, Patent Document 1 discloses an invention in which an associative memory technique is applied to document retrieval for the purpose of providing an information retrieval apparatus that can provide related but different information. A “keyword vector” is created by adding 1, 0 depending on whether or not a word in the dictionary exists in a plurality of given documents, and is recorded in “Associatron” as a self-relevance memory. A “keyword vector” is created from the sentence or word string input by the user, and the Associatron is made to recall using it as an initial condition. By this operation, a pattern recognizing search process completely different from the and condition and the or condition is realized.
[0003]
However, this method uses “Associatron” which is a conventional associative memory model as it is, and has not yet solved various problems in recalling document patterns. In the conventional associative memory model, learning of a random pattern (when a pattern is expressed as a sequence of N 0s and 1s, a pattern in which the i-th value and the j-th value are determined independently of each other is called a random pattern) It is easy to recall, but it is difficult to learn and recall biased patterns such as document patterns.
[0004]
When one document pattern is expressed by a vector to which 1, 0 is added depending on whether or not a word in the dictionary exists in the document, the document pattern has the following characteristics.
[0005]
(1) When an actual document that does not include a noise-free pattern is used as learning data, it usually includes words that are not related to the topic. Such data can be viewed as noise in the learning data.
[0006]
(2) One pattern contains hundreds of thousands of words in Japanese, which contains very few words for all words. For example, a relatively short document such as e-mail contains hundreds of words at most. Degree. Such a pattern is called a sparse pattern.
[0007]
(3) The difference in appearance frequency of words is large. For example, “this” and “this” are frequently used, but various technical terms appear only in special cases.
[0008]
(4) There is a bias in the appearance frequency of similar patterns For example, when an inquiry e-mail from a customer is targeted, the difference in the appearance frequency of a common inquiry pattern and a pattern that does not appear so much is very significant.
[0009]
In order to store a random sparse pattern, associative memory using a covariance matrix has been proposed (Non-Patent Document 1, Non-Patent Document 2). In this associative memory model, nodes are sequentially selected from N nodes sequentially, and the activation value is updated according to the equation (2). This operation is repeated until there is no node whose active value changes.
[Expression 1]

In this associative memory model, regarding the problem (1), by introducing the activation probability of the activation value of the node, it is possible to distinguish between the co-occurrence that appears at random and the pattern that is truly correlated and co-occurs. It has been resolved. Also, since this associative memory model was devised to store a random sparse pattern, there is no problem with respect to (2) above. Also, the above (3) is solved by introducing the activation probability of the activation value of the node.
[0010]
However, this associative memory model is not suitable when the appearance frequency of similar patterns is biased (4). This is because this associative memory model tends to recall only the patterns that often appear.
[0011]
Further, in Patent Document 2, research for applying an associative memory model to kana-kanji conversion is performed. In this associative memory model, in order to solve the above problems (1) to (4), the following device is applied to the associative memory model using the covariance matrix. When a new sentence pattern is recorded, the energy value of the pattern is calculated, and the degree of updating the link weight is determined based on the difference between this energy value and a predetermined energy value.
[Expression 2]

With such a device, the link value is learned so that the energy value is almost constant regardless of the document pattern. As a result, it is possible to appropriately recall a biased pattern (solution (4) above).
[0012]
However, when learning a document pattern, it is necessary to calculate energy each time, so that learning takes time. Further, in this method, since information such as the appearance frequency of similar patterns is not reflected in the network weight, it cannot be said that the weight matrix correctly reflects the structure of the learning data.
[Patent Document 1]
Patent No. 2832678 [Patent Document 2]
Patent No. 3364242 [Non-Patent Document 1]
S. Amari, Natural theory of association and concept-formation, Bio. Cyber, Vol. 26, pp. 185-175, 1977
[Non-Patent Document 2]
S. Amari. Characteristics of sparsely encoded associative memory. Neural Networks, Vol. 2, pp. 451-457, 1989)
[0013]
[Problems to be solved by the invention]
The present invention has been made in view of the above circumstances, and an object thereof is to provide an associative memory method that can appropriately recall a biased pattern even with a small calculation cost.
[0014]
[Means for Solving the Problems]
According to this invention, in order to achieve the above-mentioned object, the configuration as described in the claims is adopted. Here, prior to describing the invention in detail, supplementary explanations of the claims will be given.
[0015]
In the principle configuration of the present invention, in the associative memory method using the covariance matrix, by adding a suppression input proportional to the overall firing rate to the network, it is possible to deal with extremely biased data such as a document pattern. However, it is possible to perform associative recall depending on the initial state.
[0016]
As a result, there are the following effects.
• Situation-dependent output is possible even with very “biased” weight matrices.
・ There is no need to obtain energy values for each link update, so no time is required.
-By controlling the magnitude of the suppression input, it is possible to output patterns of various levels from a pattern having a low energy value to a pattern having a high energy value.
[0017]
The present invention can be realized not only as an apparatus or a system but also as a method. Of course, a part of the invention can be configured as software. Of course, software products used to cause a computer to execute such software are also included in the technical scope of the present invention.
[0018]
These and other aspects of the invention are set forth in the appended claims and will be described in detail below with reference to examples.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment in which the present invention is applied to a keyword extracting device will be described. This keyword extraction apparatus aims to construct a document processing apparatus that can automatically assign a keyword to a document to make it easy to grasp the contents of the document, or to facilitate a search. It is applied to systems related to document management and processing in offices. In particular, not only words included in the document but also words that are not included in the document but are closely related to the semantic content of the document can be output as keywords based on the relationship with the word group included in the document. It is what you want to do.
[0020]
FIG. 1 shows the entire keyword extraction apparatus of this embodiment. In this figure, the keyword extraction apparatus 10 includes a user interface unit 11, a syntax analysis unit 12, an inter-word correlation learning unit 13, a recall information extraction unit 14, and a database. 15 etc. are comprised.
[0021]
The user interface unit 11 includes a keyboard, a monitor, and the like, and allows the user to execute document input and keyword presentation.
[0022]
The syntax analysis unit 12 decomposes the learning document and the input document into words and analyzes the syntax (hereinafter, when the same processing is performed on the learning document group and the input document, these are collectively referred to as “document”. Write). At this time, words having the same meaning are converted into one representative word. For example, “printer”, “printer”, “Printer”, and “printer” are replaced with one representative word “printer” as words having the same meaning. The document is then converted into a single feature vector. The feature vector F _mu document D _mu as follows.
[Equation 3]

1 When the elements _{w i} ^μ of the feature vector _F μ appeared word _{W i} is the document D _μ, a 0 if it appears. That is, the feature vector is determined only by whether or not a certain word appears in the document, and the appearance frequency, importance, etc. are not considered. The length n of the feature vector is set in advance. In addition, n words {W ₁ , W ₂ ,..., W _n } used at the time of generating the feature vector are obtained by scoring words appearing in the learning document group based on the value of TF * IDF, and the top n Adopt a piece. TF (Term Frequency) is the occurrence frequency of the index word t in a document d, and IDF (Inverse Document Frequency) TF represents how specific a word is.
[0023]
The inter-word correlation learning unit 13 learns the relationship between words based on the co-occurrence from the learning document and constructs an associative memory matrix. That is, the associative matrix g _ij is obtained by the following equation.
[Expression 4]

Further, for example, in order to suppress the influence of a word having a high appearance degree, the input from the system is suppressed within a certain value by normalizing the associative memory matrix g _ij for each row.
[Equation 5]

In this way, an associative memory matrix G _ij (1 ≦ i, j ≦ n) is obtained. The obtained associative memory matrix G _ij is stored in the database 15.
[0024]
The recall information extraction unit 14 extracts keywords from the input document by associative recall. The recall information extraction unit 14 schematically has a configuration as shown in FIG. 2 and executes an operation as shown in FIG.
[0025]
As shown in FIG. 2, an input feature vector is generated based on the word string from the syntax analysis unit 12, and this is received by the input feature vector storage unit 141, and the association of the associative recall feature vector storage unit 142 using this input feature vector. Initialize the value of the recall feature vector. Thereafter, the associative recall feature vector update unit 143 generates a new associative recall feature vector based on the weight matrix of the database 15 and the associated recall feature vector at that time, and updates the value. The convergence determination unit 144 determines whether or not the associative recall feature vector has converged. If it is determined that the convergence has occurred, the association recall feature vector output unit 145 outputs the associative recall feature vector at that time, and based on this The keyword is output to the user interface unit 11.
[0026]
Hereinafter, this will be described in detail with reference to FIG. Note that an n-dimensional vector for performing associative recall is defined as N = [N ₁ , N ₂ ,..., N _n ].
[0027]
[Step S10]: Initialization of Feature Vector When the feature vector obtained from the input document is Q = [Q ₁ , Q ₂ ,..., Q _n ], the associative recall feature vector N is initialized with the input feature vector Q. . That is,
[Formula 6]

[0028]
[Step S11]: Node k is selected at random The state change of the node is performed asynchronously. That is, only one arbitrary node changes state at a time.
[0029]
[Step S12]: Update of N _k Basically, if the threshold value is equal to or greater than a certain value, it is ignited (N _k = 1), and if it is less than a certain value, the ignition is canceled (N _k = 0). Dynamics are realized by model model neurons. Of course, it is not limited to this.
[0030]
The state of the selected node _Nk is updated according to the following equation.
[Expression 7]

By adding such suppression input, the number of words finally extracted can be controlled within a certain number. Note that a monotonically increasing function may be used instead of the proportionality constant α.
Also, [Equation 8]

Is defined as the activity of node k. This represents how much input is received from other nodes. That is, it can be used as a measure of how active the word is for the input document.
[0031]
[Step S13]: Convergence Determination Various methods can be considered for the convergence determination. The convergence may be determined when a certain number of trials does not change. Alternatively, a method such as giving an upper limit to the number of trials may be adopted.
[0032]
[Step S14]: After the output convergence of the associative recall feature vector is determined, the associative recall feature vector N is output. One or more keywords are determined based on the associative recall feature vector N.
[0033]
According to this embodiment, even for highly biased data such as document knowledge, by introducing a suppression input proportional to the total number of firings, an initial state can be obtained without performing special processing for link learning. Associate recall in a form that depends on
[0034]
It is also possible to control the number of words finally extracted as keywords to a certain number by changing the proportional constant α of the suppression input.
[0035]
Furthermore, it is possible to rank the extracted words by regarding the activity value of each word extracted as a keyword as the importance of the positioning of the keyword with respect to the input document.
[0036]
Even if a word exists in the input document by the association recall as described above, if the input from other ignited nodes is small, that is, if the relation with the other ignited words is small On the contrary, even a word that does not ignite and does not exist in the input document will be ignited if it receives a large input from another node, that is, if it is related to another word that is ignited. As a result, a keyword extraction device that extracts related keywords unique to each document, including words not included in the document, while realizing past knowledge is realized.
[0037]
Moreover, since the relationship between words is automatically acquired by giving a learning document group, it is not necessary to give a priori knowledge specific to each field. Furthermore, in conventional approaches using concept dictionaries, semantic networks, etc., it is necessary to perform dictionary maintenance work every day. However, in this patent, by adding a new learning document group, the relationship between new words is increased. Can be acquired automatically, and the labor for maintenance is greatly reduced.
[Experimental example]
[0038]
Hereinafter, actual experiments will be specifically described.
[0039]
Verification was performed using the email sent to the customer service center of Fuji Xerox Co., Ltd. The number of learning documents is 11,365, and the number of words used when generating feature vectors is 1,000.
[0040]
First, FIG. 4 shows associative results (with and without control input) when 62 inquiry mails different from the learning document are given as the initial state. Thus, when the suppression input is not applied, 62 emails converge to almost the same pattern, whereas when the suppression input is applied, most emails converge to a different pattern. I understand that.
[0041]
Next, an application example for a typical complaint mail will be shown.
[0042]
Summarizing the email body,
"I have purchased a copy machine from Fuji Xerox and used it. The other day the ink did not come out well, so when I changed the cartridge, I got an error. When I replaced the old one, it was normal. If you send it as instructed and make a confirmation phone call, it is said that it has not arrived.If you ask why the ink does not come out, you will not get the point. I don't think about the customer's position at all. I want the responsibility to be clear. "
It was something like that.
In the feature vector of the input mail, the word having a value of 1 is as shown in FIG.
After the association recall, the finally extracted keywords are as shown in FIG. However, the number in () represents the final activity value, and the asterisk represents a keyword not in the text.
[0043]
Further, the top 10 words were extracted from the extracted keywords according to the degree of activity. The result is as shown in FIG. Among them, keywords such as “claim” and “responsibility” can be considered to represent the content of the sentence more accurately than the words appearing in the text, and “make it easier to understand the content of the sentence” It can be said that more appropriate keywords were obtained for the purpose.
[0044]
Next, an example of a document processing apparatus that extracts a keyword using the above-described embodiment and registers the keyword together with the document will be described.
[0045]
FIG. 8 shows a document processing apparatus 20 that registers a document while extracting the keyword of the document and enables keyword search of the registered document. The document processing apparatus 20 basically includes the keyword extraction apparatus 10 of FIG. 1 as main components, and a document database 21, a document registration unit 22, and a document search unit 23 are added thereto. In FIG. 8, portions corresponding to those in FIG. In this example, a keyword output based on the association recall feature vector extracted by the recall information extraction unit 14 is registered in the document database 21 in association with the document, and the document search unit 23 is used for keyword search.
[0046]
This is the end of the description of the embodiment of the present invention.
[0047]
The present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit of the invention. For example, a suppression input that is proportional to the firing rate of the entire neural network or monotonically increases in accordance with the firing rate is used, but it is limited to a predetermined range of nodes in advance and is based on the firing rate within that range. Also good. In the case of a non-binary type, the sum of activation values may be used instead of the firing rate. The range of nodes to be considered may be determined based on the frequency of words corresponding to the nodes. Further, according to the situation, the association recall may be performed by outputting the convergence result of the association recall feature vector without giving the suppression input.
[0048]
【The invention's effect】
As described above, according to the present invention, associative recall can be appropriately performed even on biased data such as a document pattern, and the calculation cost can be reduced.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a keyword extracting device according to an embodiment of the present invention.
FIG. 2 is a block diagram schematically showing a configuration of a recall information extraction unit 14 in the embodiment of FIG.
FIG. 3 is a flowchart for explaining the operation of the recall information extraction unit 14 in the embodiment of FIG. 1;
FIG. 4 is a diagram for explaining an experimental example.
FIG. 5 is a diagram illustrating an experimental example.
FIG. 6 is a diagram illustrating an experimental example.
FIG. 7 is a diagram illustrating an experimental example.
8 is a block diagram showing a configuration example of a document processing apparatus for registering a document with a keyword using the embodiment of FIG.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 Keyword extraction device 11 User interface part 12 Syntax analysis part 13 Inter-word correlation learning part 14 Recollection information extraction part 15 Database 20 Document processing apparatus 21 Document database 22 Document registration part 23 Document search part 141 Input feature vector storage part 142 Associative recall feature Vector storage unit 143 Association recall feature vector update unit 144 Convergence determination unit 145 Association recall feature vector output unit

Claims

A database for storing data of an associative memory matrix whose elements are values representing co-occurrence between words included in a predetermined word set;
A syntax analysis unit that extracts a word vector whose element is a value indicating the presence or absence of the word in the word set for the input document;
An associative recall information extraction unit,
The association recall information extraction unit
A storage unit for storing an associative recall feature vector;
An update unit that updates the associative recall feature vector stored in the storage unit,
The association recall information extraction unit
Initialize the association recall vector stored in the storage unit with the word vector,
For the word w _k determined at random using the update unit,

Here, sgn {} is a function that returns 1 or 0 according to the value of the argument, G _ki is an element of the associative memory matrix, θ _k is a predetermined threshold, and F () is a monotonically increasing function.
To update the element N _k of the associative memory recall vector,
Randomly select word w _k and update corresponding element N _k until a predetermined convergence condition is met,
A keyword extracting apparatus that extracts one or a plurality of words as a keyword based on the associative memory vector when the convergence condition is satisfied.

For the word w _k

The keyword extraction device according to claim 1, wherein an activity value represented by the following formula is obtained, and keywords are ranked based on the size of the activity value.

The keyword extracting apparatus according to claim 1, wherein the monotonically increasing function is a proportional function.