JP3612125B2

JP3612125B2 - Information filtering method and information filtering apparatus

Info

Publication number: JP3612125B2
Application number: JP32590795A
Authority: JP
Inventors: 誠司三池; 正浩梶浦; 哲也酒井; 顕司小野; 一男住田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1995-12-14
Filing date: 1995-12-14
Publication date: 2005-01-19
Anticipated expiration: 2015-12-14
Also published as: US6052714A; JPH09167164A

Description

【０００１】
【発明の属する技術分野】
この発明は、膨大なテキスト記事からユーザの要求・興味にあったものを検索して定期的にユーザに提供する情報フィルタリング方法および情報フィルタリング装置。
【０００２】
【従来の技術】
近年、ワードプロセッサーや電子計算機の普及、インターネットなどの計算機ネットワークを介した電子メールや電子ニュースの普及に伴い、文書の電子化は加速的に進みつつある。
【０００３】
電子出版という言葉が示すように、今後は新聞、雑誌や本の情報も電子的に提供されることが一般的になると考えられる。これにより、個人にとってリアルタイムで入手可能となるテキスト情報の量は膨大になっていくと予測される。
【０００４】
これに伴い、新聞や雑誌などの膨大なテキスト記事からユーザの要求・興味にあったものを選出して定期的にユーザに提供する情報フィルタリングシステムあるいは情報フィルタリングサービスの需要が高まりつつある。
【０００５】
一方、大量の情報をコンパクトにするという観点から、文書の抄録を作成する方法についても、従来から研究が行なわれてきた。従来は、最初の段落のみを抽出する方法や予め登録しておいたキーワードを用いてキーワードを含む文のみを抽出する方法などがとられてきた。
【０００６】
【発明が解決しようとする課題】
しかしながら、最初の段落のみを抽出する方法では、必ずしもそこにユーザが必要とする情報が含まれているとは限らず、適切な方法ではなかった。また、キーワードを含む文を並べただけでは、それらの文のつながりがわからないという問題があった。
【０００７】
また、一旦登録したキーワードは、ユーザが変更しなければ追加や修正が行なわれず、これをユーザが行なうことは面倒な作業であった。
そこで、本発明は、上記問題点に鑑みてなされたものであり、情報フィルタリングにより検索された記事について、ユーザへの負担がなく、各ユーザの所望するテーマに合致する読みやすい抄録を生成できる情報フィルタリング方法および情報フィルタリング装置を提供することを目的とする。
【０００８】
【課題を解決するための手段】
本発明の情報フィルタリング方法は、複数の情報源からテキストやイメージなどの記事の配信を受け、それら配信された記事の中から、ユーザ毎に予め指定されたテーマに関する記事を検索してユーザに配信する情報フィルタリング方法において、ユーザ毎に予め指定されたテーマに基づく検索情報を基に、前記情報源から配信された記事からユーザが所望する記事を検索し、この検索された記事の抄録を前記検索情報を基に生成し、この生成された抄録をユーザに配信することにより、抄録を生成する際に、ユーザの興味や関心に関する情報を格納したプロファイルを用いるので、ユーザそれぞれが必要とする情報を含んだ抄録を生成することが可能になる。
【０００９】
また、本発明の情報フィルタリング方法は、複数の情報源からテキストやイメージなどの記事の配信を受け、それら配信された記事の中から、ユーザ毎に予め指定されたテーマに関する記事を検索してユーザに配信する情報フィルタリング方法において、ユーザ毎に予め指定されたテーマに基づく検索情報を基に、前記情報源から配信された記事からユーザが所望する記事を検索し、この検索された記事の特徴と前記検索情報を基に前記検索された記事の抄録を生成し、この生成された抄録をユーザに配信することにより、記事の特徴に応じて、当該文書に適切な抄録生成方法を用いて当該文書の抄録を生成することができる。
【００１０】
また、本発明の情報フィルタリング装置は、複数の情報源からテキストやイメージなどの記事の配信を受け、それら配信された記事の中から、ユーザ毎に予め指定されたテーマに関する記事を検索してユーザに配信する情報フィルタリング装置において、ユーザ毎に予め指定されたテーマに基づく検索情報を基に、前記情報源から配信された記事からユーザが所望する記事を検索する検索手段と、前記検索情報を基に前記検索手段で検索された記事の抄録を生成する抄録生成手段と、この抄録生成手段で生成された抄録をユーザに配信する配信手段とを具備することにより、抄録を生成する際に、ユーザの興味や関心に関する情報を格納したプロファイルを用いるので、ユーザそれぞれが必要とする情報を含んだ抄録を生成することが可能になる。
【００１１】
さらに、本発明の情報フィルタリング装置は、複数の情報源からテキストやイメージなどの記事の配信を受け、それら配信された記事の中から、ユーザ毎に予め指定されたテーマに関する記事を検索してユーザに配信する情報フィルタリング装置において、ユーザ毎に予め指定されたテーマに基づく検索情報を基に、前記情報源から配信された記事からユーザが所望する記事を検索する検索手段と、この検索手段で検索された記事の特徴に応じて、前記検索情報を基に前記検索手段で検索された記事の抄録を生成する複数の抄録生成手段と、前記検索手段で検索された記事の特徴を検出する特徴検出手段と、この特徴検出手段で検出された記事の特徴に基づき前記複数の抄録生成手段のうちの１つを選択する選択手段と、この選択手段で選択された抄録生成手段で生成された抄録をユーザに配信する配信手段とを具備することにより、記事の特徴に応じて、当該文書に適切な抄録生成方法を用いて当該文書の抄録を生成することができる。
【００１２】
【発明の実施の形態】
本発明の一実施形態について図面を参照して説明する。
まず、図１を参照して、本発明の情報フィルタリング装置を用いた情報フィルタリングシステム全体の構成について説明する。
【００１３】
情報フィルタリングシステムは、新聞社、通信社、または出版社などの複数の情報源２からテキスト記事の配信を受け、それを定期的に加入ユーザ端末３それぞれに送信する情報提供システムであり、このシステムの情報提供サービスは情報フィルタリングセンタ１によって実現されている。
【００１４】
情報フィルタリングセンタ１は、通信網を介して複数の情報源２および複数の加入ユーザ端末３に接続された１つの情報フィルタリング装置から構成され、ここには、情報フィルタリングのための制御や処理を行う中央処理装置４、プログラム並びにデータを格納する半導体メモリ、磁気ディスク、光ディスクなどの記憶装置５、回線や電波などの通信網を介して情報源２からテキスト記事を受信する受信部６、回線や電波などの通信網を介してユーザ端末３にテキスト記事を送信したり、ユーザ端末３からの回答等を受信する送受信部７などから構成されている。
【００１５】
各ユーザ端末３は、例えばパーソナルコンピュータやワークステーションなどの情報処理端末であり、情報フィルタリングセンタ１から送信されたテキスト記事を受信したり、情報フィルタリングセンタ１にテキストデータを送信するテキスト情報送受信部８と、受信したテキスト記事を画面表示する表示部９などを備えている。
【００１６】
情報フィルタリングセンタ１は、図２に示されているように、ユーザプロファイル１０と称する一種の検索条件をユーザ毎に保持しており、そのユーザプロファイル１０に従って該当するユーザに提供すベき記事を検索する。ユーザプロファイル１０は、ユーザによって指定された複数のテーマ（トピック）などを基に構成されており、それらテーマに合致する記事が検索および選出されてユーザに送られる。
【００１７】
次に、図３を参照して情報フィルタリングセンタ１の構成について説明する。図３において、ユーザプロファイル格納部５０には、記事を検索するための検索条件に対応するプロファイルが格納されている。ここで、プロファイルの記憶形式とその具体例について図４を参照して説明する。図４（ａ）に示すように、プロファイルには、利用者がほしいと思う記事に含まれると考えられる単語とその重みが記述されている。具体的には、図４（ｂ）に示すように、各ユーザ毎に複数の単語とその重みが記憶されている。なお、プロファイル作成の方法は本発明の主眼ではなく、例えば、人が利用者からどのような記事がほしいかを聞き、該当する記事に含まれると思われる単語を列挙する方法でもよい。
【００１８】
記事格納部５１には、情報源２から配信され、受信部６で受信された記事が格納される。ここで、記事格納部５１に格納されている記事データの記憶形式とその具体例について図５を参照して説明する。
【００１９】
図５（ａ）および（ｂ）に示すように、記事データは、センタ１が管理する全ての記事についてその記事を識別するための記事ＩＤ（例えば、「００１」、「００２」、「００３」…）、記事の発行元（例えば、「Ａ新聞社」、「Ｂ新聞社」、「Ｃ出版社」…）、記事の見出し（例えば、「マルチメディアパソコン発売」、「半導体売行き好調」…）、およびその記事の本文の格納位置を示す記事格納部５１に対するポインタ（例えば、１２３４５６、１２３４５７、１２３４５８）から構成されて、記事格納部５１に記憶されている。
【００２０】
記事検索部５２は、ユーザプロファイル格納部５０に記憶されているユーザプロファイルを用いて、記事格納部５１に格納されている記事を検索する。ここでの検索方法としては、例えば「ＳＭＡＲＴ情報検索システム」（ジェラルド・サルトン編、神保健二監訳、企画センター）に記載されている公知の方法を用いることができる。すなわち、この方法（ベクトル空間法）よれば、たとえば、プロファイルに記憶されている各単語のベクトル空間を仮定し、このベクトル空間上において各単語の重みベクトルと検索対象の記事における対応する単語の出現回数ベクトルとの内積を求めることによりプロファイルに対する類似度を算出し、この類似度の大きい順にランク付けされ、検索された記事を並べて出力することが可能である。
【００２１】
図６に記事検索部５２から出力される検索結果の具体例を示す。図６には、あるユーザプロファイルに対して４件の記事が検索された場合について示している。すなわち、ユーザプロファイルのＩＤとともに、算出された類似度に基づく記事のランク（例えば、「１」、「２」、「３」、「４」）と記事ＩＤから構成される検索結果が、検索結果格納部５３に格納される。
【００２２】
従来のフィルタリングシステムでは、検索結果格納部５３に格納された結果と、記事格納部５１に格納された記事データから、最終的に利用者へ送るフィルタリング結果が作成されて、検索結果格納部５３の所定の記憶領域に記憶されるようになっている。
【００２３】
図７は、利用者へ送られる従来のフィルタリング結果の具体例を示したもので、記事のランク、記事の見出し、その記事の本文などから構成されている。
抄録生成部１００は、検索結果格納部５３に格納されたフィルタリング結果について、フィルタリング結果に含まれる記事の抄録を生成するものである。抄録生成部１００は、図３に示すように、抄録生成制御部１０１、記事特徴検出部１０２、および複数（ここでは、例えば３つ）の生成部１０３−１、１０３−２、１０３−３から構成される。
【００２４】
記事特徴検出部１０２は、記事検索部５２で検索された記事から、例えば、記事の文字数、段落数といった記事の特徴を検出するようになっている。
第１の作成部１０３−１、第２の作成部１０３−２、第３の作成部１０３−３は、それぞれ記事特徴検出部１０２で検出された記事の特徴に適した抄録を生成するようになっている。
【００２５】
抄録生成制御部１０１は、抄録生成部１００全体の制御を司るものである。
次に、図８に示すフローチャートを参照して、抄録生成制御部１０１の処理動作について説明する。
【００２６】
抄録生成制御部１０１は、例えば、記事のランク（ｎ）が高い順に検索結果格納部５３に格納されているフィルタリング結果を取り出し（ステップＳ１〜ステップＳ３）、当該のフィルタリング結果の記事ＩＤについて、順に以下の処理を行なう。まず、記事ＩＤを記事特徴検出部１０２へ渡し、記事特徴検出部１０２から選択すべき生成部の番号（この場合、「１」（第１の生成部１０３−１）、「２」（第２の生成部１０３−２）、「３」（第３の生成部１０３−３）のいずれか）を受け取る（ステップＳ４〜ステップＳ５）。
【００２７】
次に、当該の番号の生成部へ、記事ＩＤとユーザプロファイルのＩＤを渡す（ステップＳ６）。そして、当該の生成部で生成された抄録を受け取り（ステップＳ７）、当該の記事について、図６に示したような検索結果格納部５３に格納されたフィルタリング結果に含まれる記事のランク、記事ＩＤにより記事格納部５１から取り出した記事見出し、および、当該記事の抄録が検索結果格納部５３の所定の記憶領域に記憶され、最終的に利用者へ配信するフィルタリング結果が生成される。
【００２８】
次に、図９に示すフローチャートを参照して記事特徴検出部１０２の処理動作について説明する。
記事特徴検出部１０２は、抄録生成制御部１０１から渡された記事ＩＤを用いて、記事格納部５１から、記事本文へのポインタをたどって、記事本文を取り出す（ステップＳ１０）。次に、取り出した記事本文の文字数と段落数をカウントする（ステップＳ１１）。段落は、行の先頭が空白文字であることを検出して、段落の始まりであるとすることができる。さらに、予め格納されている抄録生成条件テーブルを参照しながら処理を進める。
【００２９】
図１０に、記事特徴検出部１０２に記憶されている抄録生成条件テーブルの形式（図１０（ａ））とその具体例（図１０（ｂ））を示す。抄録生成条件テーブルは、複数の生成部を識別するＩＤ番号（例えば、「１」、「２」、「３」）と、その生成部にて適用される記事の文字数の最大値（例えば、「４００」、「８００」、「１００００」）および段落数の最大値（例えば、「５」、「１０」、「１００」）が格納されている。
【００３０】
さて、記事特徴検出部１０２は、図１０に示したような抄録生成条件テーブルの最初の行に示されているＩＤ番号「１」の第１の生成部１０３−１の文字数および段落数を取り出し、当該の記事本文の文字数および段落数と比較する（ステップＳ１２）。当該の記事本文の文字数と段落数の両方が、抄録作成部１のそれぞれの値より小さい場合には「１」を返して終了する（ステップＳ１３）。
【００３１】
もし、この条件に合わない場合には、抄録生成条件テーブルの次の行、すなわち、ＩＤ番号「２」の第２の生成部１０３−２の文字数および段落数を取り出して同様の処理を行ない、条件に合う場合には「２」を返して終了する（ステップＳ１４〜ステップＳ１５）。さらに、この条件に合わない場合には第３の生成部１０３−３のＩＤ番号「３」を返して終了する（ステップＳ１６）。
【００３２】
抄録生成制御部１０１は、記事特徴検出部１０２から返された数字に対応して、第１〜第３の生成部のうちの１つを選択し、選択した生成部へ記事ＩＤとユーザプロファイルを渡す。
【００３３】
次に、第１〜第３の生成部（１０３−１〜１０３−３）の処理動作について説明する。
まず、第１の生成部１０３−１の処理動作について説明する。第１の生成部１０３−１は、抄録生成制御部１０１から渡された記事ＩＤを用いて、記事格納部５１から記事の最初の段落を取り出す。文字数や段落数が少ない記事では、第１段落に主な情報が書かれていることが多いので、記事格納部５１から取り出された記事の第１段落を取り出す処理を行なっている。
【００３４】
次に、第２の生成部１０３−２の処理動作について図１１に示すフローチャートを参照して説明する。第２の生成部１０３−２は、まず、抄録生成制御部１０１から渡された記事ＩＤを用いて、記事格納部５１から記事の本文を取り出し格納し（ステップＳ２０）、その本文について各段落の類似度を算出する（ステップＳ２１）。
【００３５】
ここで、段落類似度の算出処理動作について、図１２に示すフローチャートを参照して説明する。
まず、記事格納部５１から取り出された記事本文の第１段落から順に、ひとつずつ段落を取り出し（ステップＳ３０〜ステップＳ３３）、ユーザプロファイルのＩＤを用いてユーザプロファイル格納部５０から所望のユーザプロファイルを取り出して、このユーザプロファイルを用いて段落の類似度を算出し（ステップＳ３４）、その結果を蓄えていく（ステップＳ３５）。
【００３６】
段落の類似度の算出は、前述の記事検索部５２が記事全体を対象として行なう類似度の算出処理（ベクトル空間法）を、段落を対象にして行なうことによって実現できる。
【００３７】
全ての段落について類似度を算出してその結果が蓄えられたら（ステップＳ３６、ステップＳ３１）、最後に、蓄えた結果を段落の類似度の順に並べ換える（ステップＳ３２）。なお、この段落類似度の算出結果は、第２の生成部１０３−２に割り当てられた検索結果格納部５３の所定の記憶領域に、例えば、図１３に示すように記憶される。図１３（ａ）に示すように、段落類似度の算出結果は、段落の番号、段落の類似度、段落の内容から構成されていて、段落の類似度が高いものから順に並んでいる（図１３（ｂ））。
【００３８】
図１１の説明にもどり、第２の生成部１０３−２は、図１３に示したような段落類似度算出結果を基に、最も類似度が高かった段落の番号を取り出す（ステップＳ２２）。もし、当該の段落の番号が「１」である場合には、第１段落のみを抄録生成制御部１０１へ返す（ステップＳ２３、ステップＳ２５）。そうでないない場合には、格納しておいた第１段落と、最も類似度が高かった段落の両方を、抄録生成制御部１０１へ返す（ステップＳ２３、ステップＳ２４）。
【００３９】
このように、第２の生成部１０３−２では、主な情報が書かれている第１段落と、利用者の関心が高い（類似度の最も高い）段落の両方を取り出す処理を行なっている。これは、例えば、国や地方自治体の予算に関する記事では、予算全体に関する情報が最初の段落に書かれ、予算の個々の細目については、第２段落以降の段落に分けて書かれるので、ある利用者がマルチメディアに関心がある場合、プロファイルを用いることによって、マルチメディアに関する予算の割当などについて書かれた段落を取り出すことができ、第１段落とともに利用者へ送られることになる。従って、利用者は、第１段落により、記事の全体の情報を得るとともに、自分の関心がある情報についても知ることができる。
【００４０】
なお、第１段落の他には、最も類似度の高い段落だけでなく、最も類似度の高い段落から複数の段落を選ぶようにすることは容易に実現できる。
次に、第３の生成部１０３−３の処理動作について図１４に示すフローチャートを参照して説明する。
【００４１】
第３の生成部１０３−３は、抄録生成制御部１０１から受け取った記事ＩＤを用いて、記事格納部５１から記事本文を取り出し（ステップＳ４０）、図１２に示したユーザプロファイルを用いた段落類似度算出処理を行って、全ての段落の類似度を算出し、第３の生成部１０３−２に割り当てられた検索結果格納部５３の所定の記憶領域に、例えば、図１３に示すように記憶される（ステップＳ４１）。
【００４２】
その結果をもとに、次に、例えば、段落の総数を３で割った値（整数）の数だけ、類似度の最も大きい段落から順に取り出す。段落の総数を「３」で割った値を用いることにより、約１／３の長さの抄録を生成することができる（ステップＳ４２）。例えば、記事の全体が１４の段落からなる場合には、類似度の最も高い段落から４つの段落が取り出されることになる。このように第３の生成部１０３−３は、雑誌の記事のように、記事全体の長さが長い場合に、利用者が関心のある情報が含まれると考えられる段落を取り出して抄録を作成する処理を行なっている。
【００４３】
なお、段落の総数を割る値は「３」に限るものではなく、他の値でもよい。また、利用者が予め指定した値を設定するようにすることも容易に実現できる。
以上の第１〜第３の生成部１０３−１〜１０３−３のいづれかの生成部で生成された抄録を抄録生成制御部１０１が受け取り（図８のステップＳ７）、当該の記事についての記事ランク、記事見出し、および、当該記事の抄録等が、検索結果格納部５３の所定の記憶領域に記憶され、最終的に各ユーザへ配信されるフィルタリング結果が生成される。
【００４４】
図１５に抄録生成部１００で生成されたフィルタリング結果の具体例を示す。図１５に示すように、フィルタリング結果は、記事のランク、記事の見出し、記事の全文から一部の段落を取り出して作成した抄録から構成されている。
【００４５】
このような構成で検索結果格納部５３に記憶されたフィルタリング結果は、テキスト情報送受信部７および所定の通信網を介して、例えば、電子メールやＦＡＸで利用者へ送られる。利用者は、送られてきた記事を読み、自分にとって関心がある記事であるかどうかの観点で「○」や「×」を付けて情報フィルタリングセンタ１へ送り返すことができる。すなわち、ユーザ端末から所定の通信網およびテキスト情報送受信部７を介して送り返されてきた回答は、ユーザ回答格納部６０へ格納される。
【００４６】
図１６にユーザ回答格納部６０に格納されたユーザの回答データの具体例を示す。利用者は、関心のあった記事に「○」、関心のなかった記事に「×」、どちらでもない記事はそのままとしている。ユーザの回答は、さらにユーザが必要とする記事を選択できるように、プロファイルを修正するために利用される。
【００４７】
プロファイル修正部６１は、図１６に示したようなユーザの回答と、ユーザ回答中の記事ＩＤを用いて記事格納部５１から取り出した記事本文とから、ユーザプロファイル格納部５０に格納されているユーザプロファイルを修正する。プロファイルを修正する方法は、例えば、「ＳＭＡＲＴ情報検索システム」（ジェラルド・サルトン編、神保健二監訳、企画センター）に記載されている方法が適用できよう。すなわち、この方法は、ユーザが関心のあった記事の中から単語の出現頻度を計数して、その計数値が最も大きい単語をユーザプロファイルに新たに追加するというものである。
【００４８】
以上説明したように、上記実施形態によれば、ユーザプロファイル格納部５０に格納されている各ユーザ毎のプロファイルは、ユーザの回答に基づいて常に修正・調整され、このプロファイルを用いて記事検索部５２で記事格納庫５１に格納された記事との類似度を算出してユーザが所望する情報を含んだ記事の検索を行い、さらに、この検索された記事について、抄録生成部１００で同じくプロファイルを用いて類似度を算出することにより抄録を生成することにより、ユーザが必要とする情報を含んだ抄録の作成が容易に行える。
【００４９】
また、抄録生成部１００で検索された記事の抄録を生成する際、まず、記事特徴検出部１０２で、その記事の特徴、具体的には、例えば、文字数、段落数を求めて、第１〜第３の生成部１０３−１〜１０３−３のうちの１つを選択して、その記事の特徴に適した抄録生成方法を選択することにより、ユーザが必要とする情報を含んだ抄録の作成が容易に行える。
【００５０】
さらに、このようにして生成された抄録を各ユーザに配信することにより、情報フィルタリングにより検索されて配信された記事内容から各利用者が情報を得るための時間を短縮し、労力を軽減することができる。
【００５１】
【発明の効果】
以上説明したように、本発明によれば、情報フィルタリングにより検索された記事について、ユーザが必要とする情報を含んだ抄録の作成が容易に行える情報フィルタリング方法および情報フィルタリング装置を提供できる。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る情報フィルタリング装置のシステム全体の構成を示すブロック図。
【図２】図１のシステムの運用形態を概念的に示す図。
【図３】情報フィルタリング装置の構成を示したブロック図。
【図４】ユーザプロファイルの記憶形式とその具体例について説明するための図。
【図５】記事格納部に格納される記事データの記憶形式とその具体例について説明するための図。
【図６】記事検索部から出力され、検索結果格納部に格納される検索結果の記憶形式とその具体例について説明するための図。
【図７】従来のフィルタリング装置の出力形態の具体例を示した図。
【図８】抄録生成制御部の処理動作を説明するためのフローチャート。
【図９】抄録生成部の記事特徴検出部の処理動作を説明するためのフローチャート。
【図１０】記事特徴検出部が参照する抄録生成条件テーブルの記憶形式とその具体例について説明するための図。
【図１１】抄録生成部の第２の生成部の処理動作を説明するためのフローチャート。
【図１２】段落類似度算出処理について説明するためのフローチャート。
【図１３】段落類似度の算出結果の記憶形式とその具体例について説明するための図。
【図１４】抄録生成部の第３の生成部の処理動作について説明するためのフローチャート。
【図１５】抄録生成部で生成された抄録に基づき生成されて検索結果格納部に格納され各ユーザに配信される情報フィルタリング結果の具体例を示した図。
【図１６】ユーザ回答格納部に格納されたユーザの回答データの具体例を示した図。
【符号の説明】
１…情報フィルタリングセンタ、２…情報源、３…ユーザ端末、４…中央処理部、５…記憶部、６…テキスト情報受信部、７…テキスト情報送受信部、５０…ユーザプロファイル格納部、５１…記事格納部、５２…記事検索部、５３…検索結果格納部、６０…ユーザ回答格納部、６１…プロファイル修正部、１００…抄録生成部、１０１…抄録生成制御部、１０２…記事特徴検出部、１０３−１…第１の生成部、１０３−２…第２の生成部、１０３−３…第３の生成部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information filtering method and an information filtering device for searching for a user's request / interest from an enormous amount of text articles and periodically providing it to the user.
[0002]
[Prior art]
In recent years, with the spread of word processors and electronic computers, and the spread of electronic mail and electronic news via computer networks such as the Internet, the digitization of documents is accelerating.
[0003]
As the term “electronic publishing” suggests, it will be common to provide information on newspapers, magazines and books electronically in the future. As a result, the amount of text information that can be obtained in real time for an individual is expected to become enormous.
[0004]
Accordingly, there is an increasing demand for an information filtering system or an information filtering service that selects and regularly provides a user with a user's request / interest from huge text articles such as newspapers and magazines.
[0005]
On the other hand, from the viewpoint of making a large amount of information compact, research has also been conducted on methods for creating document abstracts. Conventionally, a method of extracting only the first paragraph or a method of extracting only a sentence including a keyword using a keyword registered in advance has been employed.
[0006]
[Problems to be solved by the invention]
However, the method of extracting only the first paragraph does not necessarily include the information required by the user, and is not an appropriate method. Moreover, there is a problem that the connection of sentences is not understood only by arranging sentences including keywords.
[0007]
Also, once the registered keyword is not changed by the user, it is not added or modified, and it is troublesome for the user to do this.
Therefore, the present invention has been made in view of the above problems, and for an article searched by information filtering, there is no burden on the user, and information that can generate an easy-to-read abstract that matches each user's desired theme. An object is to provide a filtering method and an information filtering device.
[0008]
[Means for Solving the Problems]
The information filtering method of the present invention receives distribution of articles such as texts and images from a plurality of information sources, searches the distributed articles for articles on a theme specified in advance for each user, and distributes them to the users. In the information filtering method, an article desired by a user is searched from articles distributed from the information source based on search information based on a theme specified in advance for each user, and an abstract of the searched article is searched. By creating a summary based on information and distributing the generated abstract to the user, a profile storing information about the user's interests and interests is used when generating the abstract. It is possible to generate an abstract that includes it.
[0009]
Further, the information filtering method of the present invention receives articles such as texts and images from a plurality of information sources, searches for articles related to the theme designated in advance for each user from the distributed articles. In the information filtering method distributed to the user, based on the search information based on the theme specified in advance for each user, the article desired by the user is searched from the articles distributed from the information source. An abstract of the searched article is generated based on the search information, and the generated abstract is distributed to the user, so that the document can be generated using an abstract generation method appropriate for the document according to the feature of the article. Abstracts can be generated.
[0010]
Further, the information filtering device of the present invention receives articles such as texts and images from a plurality of information sources, searches the articles for articles on a theme specified in advance for each user, and searches the articles for users. In the information filtering device distributed on the basis of search information based on a theme specified in advance for each user, search means for searching for an article desired by the user from articles distributed from the information source, and the search information When generating an abstract, the system includes: an abstract generation unit that generates an abstract of the article searched by the search unit; and a distribution unit that distributes the abstract generated by the abstract generation unit to the user. It is possible to generate an abstract that contains information that each user needs because it uses a profile that stores information about interests That.
[0011]
Furthermore, the information filtering device of the present invention receives distribution of articles such as texts and images from a plurality of information sources, searches the distributed articles for articles on a theme specified in advance for each user, and In the information filtering device distributed to the user, based on search information based on a theme specified in advance for each user, search means for searching for an article desired by the user from articles distributed from the information source, and search by the search means A plurality of abstract generation means for generating abstracts of articles searched by the search means based on the search information, and feature detection for detecting features of articles searched by the search means A selection means for selecting one of the plurality of abstract generation means based on the feature of the article detected by the feature detection means, and a selection by the selection means A distribution means for distributing the abstract generated by the abstract generation means to the user, and generating an abstract of the document using an abstract generation method appropriate for the document according to the feature of the article. Can do.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of the present invention will be described with reference to the drawings.
First, with reference to FIG. 1, the structure of the whole information filtering system using the information filtering apparatus of this invention is demonstrated.
[0013]
The information filtering system is an information providing system for receiving a text article from a plurality of information sources 2 such as a newspaper company, a news agency, or a publisher, and periodically transmitting it to each of the subscriber user terminals 3. The information providing service is realized by the information filtering center 1.
[0014]
The information filtering center 1 is composed of one information filtering device connected to a plurality of information sources 2 and a plurality of subscriber user terminals 3 via a communication network, and performs control and processing for information filtering here. Central processing unit 4, a storage device 5 such as a semiconductor memory for storing programs and data, a magnetic disk, an optical disc, a receiving unit 6 for receiving a text article from information source 2 via a communication network such as a line or radio wave, a line or radio wave The transmission / reception part 7 etc. which transmit a text article to the user terminal 3 via a communication network etc., or receive the reply from the user terminal 3, etc. are comprised.
[0015]
Each user terminal 3 is an information processing terminal such as a personal computer or a workstation, for example, and receives a text article transmitted from the information filtering center 1 or transmits a text data to the information filtering center 1. And a display unit 9 for displaying the received text article on the screen.
[0016]
As shown in FIG. 2, the information filtering center 1 holds a kind of search condition called a user profile 10 for each user, and searches for an article to be provided to the corresponding user according to the user profile 10. To do. The user profile 10 is configured based on a plurality of themes (topics) designated by the user, and articles matching those themes are searched and selected and sent to the user.
[0017]
Next, the configuration of the information filtering center 1 will be described with reference to FIG. In FIG. 3, the user profile storage unit 50 stores profiles corresponding to search conditions for searching for articles. Here, a profile storage format and a specific example thereof will be described with reference to FIG. As shown in FIG. 4A, the profile describes a word that is considered to be included in an article that the user wants and its weight. Specifically, as shown in FIG. 4B, a plurality of words and their weights are stored for each user. The profile creation method is not the main point of the present invention. For example, a method in which a person asks what kind of article he wants from a user and enumerates words that are supposed to be included in the corresponding article.
[0018]
The article storage unit 51 stores articles distributed from the information source 2 and received by the receiving unit 6. Here, a storage format of the article data stored in the article storage unit 51 and a specific example thereof will be described with reference to FIG.
[0019]
As shown in FIGS. 5A and 5B, the article data includes article IDs (for example, “001”, “002”, “003”) for identifying the articles for all articles managed by the center 1. …), Publishers of articles (eg “A newspaper publisher”, “B newspaper publisher”, “C publisher”…), headlines of articles (eg “multi-media PC release”, “semiconductor sales”) , And a pointer (for example, 123456, 123457, 123458) to the article storage unit 51 indicating the storage position of the article body, and stored in the article storage unit 51.
[0020]
The article search unit 52 searches for articles stored in the article storage unit 51 using the user profile stored in the user profile storage unit 50. As a search method here, for example, a publicly known method described in “SMART information search system” (edited by Gerald Salton, translated by Shinjin Koji, Planning Center) can be used. That is, according to this method (vector space method), for example, assuming the vector space of each word stored in the profile, the weight vector of each word and the appearance of the corresponding word in the article to be searched on this vector space It is possible to calculate the similarity to the profile by calculating the inner product with the frequency vector, and rank and search the articles that are ranked in descending order of the similarity.
[0021]
FIG. 6 shows a specific example of the search result output from the article search unit 52. FIG. 6 shows a case where four articles are searched for a certain user profile. That is, a search result including an article rank (for example, “1”, “2”, “3”, “4”) based on the calculated similarity and an article ID together with the ID of the user profile is the search result. It is stored in the storage unit 53.
[0022]
In the conventional filtering system, a filtering result that is finally sent to the user is created from the result stored in the search result storage unit 53 and the article data stored in the article storage unit 51. It is stored in a predetermined storage area.
[0023]
FIG. 7 shows a specific example of the conventional filtering result sent to the user, and is composed of the rank of the article, the headline of the article, the body of the article, and the like.
The abstract generation unit 100 generates an abstract of articles included in the filtering result for the filtering result stored in the search result storage unit 53. As illustrated in FIG. 3, the abstract generation unit 100 includes an abstract generation control unit 101, an article feature detection unit 102, and a plurality of (for example, three) generation units 103-1, 103-2, and 103-3. Composed.
[0024]
The article feature detection unit 102 detects article features such as the number of characters of an article and the number of paragraphs from the articles searched by the article search unit 52.
The first creation unit 103-1, the second creation unit 103-2, and the third creation unit 103-3 each generate an abstract suitable for the feature of the article detected by the article feature detection unit 102. It has become.
[0025]
The abstract generation control unit 101 controls the entire abstract generation unit 100.
Next, the processing operation of the abstract generation control unit 101 will be described with reference to the flowchart shown in FIG.
[0026]
The abstract generation control unit 101 extracts, for example, the filtering results stored in the search result storage unit 53 in descending order of article rank (n) (steps S1 to S3), and the article IDs of the filtering results in order. The following processing is performed. First, the article ID is passed to the article feature detection unit 102, and the number of the generation unit to be selected from the article feature detection unit 102 (in this case, “1” (first generation unit 103-1), “2” (second) Generation unit 103-2) and “3” (any of the third generation units 103-3)) (step S4 to step S5).
[0027]
Next, the article ID and the user profile ID are passed to the number generation unit (step S6). And the abstract produced | generated by the said production | generation part is received (step S7), the rank of the article contained in the filtering result stored in the search result storage part 53 as shown in FIG. Thus, the article headline extracted from the article storage unit 51 and the abstract of the article are stored in a predetermined storage area of the search result storage unit 53, and finally a filtering result to be distributed to the user is generated.
[0028]
Next, the processing operation of the article feature detection unit 102 will be described with reference to the flowchart shown in FIG.
The article feature detection unit 102 uses the article ID passed from the abstract generation control unit 101 to follow the pointer to the article body from the article storage unit 51 and extracts the article body (step S10). Next, the number of characters and the number of paragraphs in the extracted article body are counted (step S11). A paragraph may be the beginning of a paragraph by detecting that the beginning of the line is a white space character. Further, the processing proceeds while referring to the abstract generation condition table stored in advance.
[0029]
FIG. 10 shows the format of the abstract generation condition table stored in the article feature detection unit 102 (FIG. 10A) and its specific example (FIG. 10B). The abstract generation condition table includes an ID number (for example, “1”, “2”, “3”) for identifying a plurality of generation units, and a maximum value of the number of characters of articles applied by the generation unit (for example, “ 400 ”,“ 800 ”,“ 10000 ”) and the maximum number of paragraphs (for example,“ 5 ”,“ 10 ”,“ 100 ”) are stored.
[0030]
Now, the article feature detection unit 102 takes out the number of characters and the number of paragraphs of the first generation unit 103-1 having the ID number “1” shown in the first row of the abstract generation condition table as shown in FIG. The number of characters and the number of paragraphs of the article body are compared (step S12). If both the number of characters and the number of paragraphs of the article body are smaller than the respective values of the abstract creation unit 1, “1” is returned and the process ends (step S13).
[0031]
If this condition is not met, the next line of the abstract generation condition table, that is, the number of characters and the number of paragraphs of the second generation unit 103-2 with the ID number “2” is extracted and the same processing is performed If the condition is met, “2” is returned and the process ends (steps S14 to S15). If this condition is not met, the ID number “3” of the third generation unit 103-3 is returned and the process ends (step S16).
[0032]
The abstract generation control unit 101 selects one of the first to third generation units corresponding to the number returned from the article feature detection unit 102, and supplies the article ID and the user profile to the selected generation unit. hand over.
[0033]
Next, processing operations of the first to third generation units (103-1 to 103-3) will be described.
First, the processing operation of the first generation unit 103-1 will be described. The first generation unit 103-1 takes out the first paragraph of the article from the article storage unit 51 using the article ID passed from the abstract generation control unit 101. For articles with a small number of characters and paragraphs, main information is often written in the first paragraph, and therefore processing for extracting the first paragraph of the article extracted from the article storage unit 51 is performed.
[0034]
Next, the processing operation of the second generation unit 103-2 will be described with reference to the flowchart shown in FIG. The second generation unit 103-2 first extracts and stores the text of the article from the article storage unit 51 using the article ID passed from the abstract generation control unit 101 (step S20). The similarity is calculated (step S21).
[0035]
Here, the paragraph similarity calculation processing operation will be described with reference to the flowchart shown in FIG.
First, paragraphs are extracted one by one in order from the first paragraph of the article text extracted from the article storage unit 51 (steps S30 to S33), and a desired user profile is obtained from the user profile storage unit 50 using the user profile ID. Then, the similarity between paragraphs is calculated using this user profile (step S34), and the result is stored (step S35).
[0036]
The calculation of the similarity of a paragraph can be realized by performing the similarity calculation processing (vector space method) performed on the entire article by the article search unit 52 described above for the paragraph.
[0037]
When the similarities are calculated for all the paragraphs and the results are stored (steps S36 and S31), the stored results are finally rearranged in the order of the similarity of the paragraphs (step S32). The paragraph similarity calculation result is stored in a predetermined storage area of the search result storage unit 53 assigned to the second generation unit 103-2, for example, as shown in FIG. As shown in FIG. 13A, the calculation result of the paragraph similarity is composed of the paragraph number, the paragraph similarity, and the paragraph contents, and is arranged in order from the highest paragraph similarity (FIG. 13). 13 (b)).
[0038]
Returning to the description of FIG. 11, the second generation unit 103-2 extracts the number of the paragraph with the highest similarity based on the paragraph similarity calculation result as shown in FIG. 13 (step S <b> 22). If the number of the paragraph is “1”, only the first paragraph is returned to the abstract generation control unit 101 (steps S23 and S25). Otherwise, both the stored first paragraph and the paragraph with the highest similarity are returned to the abstract generation control unit 101 (step S23, step S24).
[0039]
As described above, the second generation unit 103-2 performs processing for extracting both the first paragraph in which main information is written and the paragraph in which the user is highly interested (the highest similarity). . This is because, for example, in an article on the budget of a national or local government, information about the entire budget is written in the first paragraph, and each detail of the budget is written in the second and subsequent paragraphs. If a person is interested in multimedia, a paragraph about the budget allocation related to multimedia can be taken out by using the profile, and sent to the user together with the first paragraph. Accordingly, the user can obtain information on the entire article and information on his / her interest from the first paragraph.
[0040]
In addition to the first paragraph, it is easy to select a plurality of paragraphs from the paragraph having the highest similarity as well as the paragraph having the highest similarity.
Next, the processing operation of the third generation unit 103-3 will be described with reference to the flowchart shown in FIG.
[0041]
The third generation unit 103-3 takes out the article text from the article storage unit 51 using the article ID received from the abstract generation control unit 101 (step S40), and is similar to the paragraph using the user profile shown in FIG. The degree calculation process is performed to calculate the similarities of all paragraphs, and stored in a predetermined storage area of the search result storage unit 53 assigned to the third generation unit 103-2, for example, as shown in FIG. (Step S41).
[0042]
Based on the result, next, for example, the number of paragraphs having the largest similarity is extracted in order from the total number of paragraphs divided by 3 (an integer). By using a value obtained by dividing the total number of paragraphs by “3”, an abstract having a length of about 1/3 can be generated (step S42). For example, if the entire article consists of 14 paragraphs, 4 paragraphs are extracted from the paragraph with the highest similarity. In this way, the third generation unit 103-3 creates an abstract by extracting a paragraph that is considered to contain information of interest to the user when the entire article is long, such as an article in a magazine. Processing to do.
[0043]
The value that divides the total number of paragraphs is not limited to “3”, and may be another value. It is also easy to set a value designated in advance by the user.
The abstract generation control unit 101 receives the abstract generated by any one of the first to third generation units 103-1 to 103-3 (step S7 in FIG. 8), and the article rank for the article. The article headline, the abstract of the article, and the like are stored in a predetermined storage area of the search result storage unit 53, and a filtering result that is finally distributed to each user is generated.
[0044]
FIG. 15 shows a specific example of the filtering result generated by the abstract generation unit 100. As shown in FIG. 15, the filtering result is composed of an article rank, an article headline, and an abstract created by extracting a part of the paragraph from the entire article.
[0045]
The filtering result stored in the search result storage unit 53 with such a configuration is sent to the user by e-mail or FAX, for example, via the text information transmission / reception unit 7 and a predetermined communication network. The user can read the sent article and send it back to the information filtering center 1 with “O” or “X” in view of whether the article is of interest to the user. That is, the answer sent back from the user terminal via the predetermined communication network and the text information transmitting / receiving unit 7 is stored in the user answer storage unit 60.
[0046]
FIG. 16 shows a specific example of user answer data stored in the user answer storage unit 60. The user keeps the articles that are neither “○” for articles that are interested, or “x” for articles that are not interested. The user's answer is further used to modify the profile so that the user can select the articles he needs.
[0047]
The profile correction unit 61 stores the user stored in the user profile storage unit 50 from the user's answer as shown in FIG. 16 and the article text extracted from the article storage unit 51 using the article ID in the user answer. Modify the profile. As a method for correcting the profile, for example, the method described in “SMART information retrieval system” (edited by Gerald Salton, translated by Shinjin Koji, Planning Center) may be applied. In other words, this method counts the appearance frequency of words from articles that the user is interested in, and newly adds the word having the largest count value to the user profile.
[0048]
As described above, according to the above-described embodiment, the profile for each user stored in the user profile storage unit 50 is always corrected and adjusted based on the user's answer, and the article search unit using this profile. 52, the similarity with the article stored in the article storage 51 is calculated to search for an article including information desired by the user, and the abstract generation unit 100 also uses the profile for the searched article. By generating the abstract by calculating the similarity, the abstract including the information required by the user can be easily created.
[0049]
Further, when the abstract of the article searched by the abstract generation unit 100 is generated, first, the article feature detection unit 102 obtains the feature of the article, specifically, for example, the number of characters and the number of paragraphs. Select one of the third generation units 103-1 to 103-3 and select an abstract generation method suitable for the feature of the article, thereby creating an abstract including information required by the user Can be done easily.
[0050]
Furthermore, by distributing the abstracts generated in this way to each user, the time for each user to obtain information from the contents of the articles searched and distributed by information filtering is reduced, and labor is reduced. Can do.
[0051]
【The invention's effect】
As described above, according to the present invention, it is possible to provide an information filtering method and an information filtering apparatus that can easily create an abstract including information required by a user for an article searched by information filtering.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an entire system of an information filtering apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram conceptually showing an operation mode of the system of FIG.
FIG. 3 is a block diagram showing a configuration of an information filtering device.
FIG. 4 is a diagram for explaining a user profile storage format and a specific example thereof.
FIG. 5 is a diagram for explaining a storage format of article data stored in an article storage unit and a specific example thereof.
FIG. 6 is a diagram for explaining a storage format of search results output from an article search unit and stored in a search result storage unit and a specific example thereof.
FIG. 7 is a diagram showing a specific example of an output form of a conventional filtering device.
FIG. 8 is a flowchart for explaining the processing operation of the abstract generation control unit.
FIG. 9 is a flowchart for explaining the processing operation of the article feature detection unit of the abstract generation unit.
FIG. 10 is a diagram for explaining a storage format of an abstract generation condition table referred to by an article feature detection unit and a specific example thereof.
FIG. 11 is a flowchart for explaining the processing operation of the second generation unit of the abstract generation unit.
FIG. 12 is a flowchart for explaining paragraph similarity calculation processing;
FIG. 13 is a diagram for explaining a storage format of paragraph similarity calculation results and a specific example thereof.
FIG. 14 is a flowchart for explaining a processing operation of a third generation unit of the abstract generation unit.
FIG. 15 is a diagram illustrating a specific example of an information filtering result generated based on an abstract generated by an abstract generation unit, stored in a search result storage unit, and distributed to each user.
FIG. 16 is a diagram showing a specific example of user answer data stored in a user answer storage unit;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Information filtering center, 2 ... Information source, 3 ... User terminal, 4 ... Central processing part, 5 ... Memory | storage part, 6 ... Text information receiving part, 7 ... Text information transmission / reception part, 50 ... User profile storage part, 51 ... Article storage unit 52 ... Article search unit 53 ... Search result storage unit 60 ... User answer storage unit 61 ... Profile correction unit 100 ... Abstract generation unit 101 ... Abstract generation control unit 102 ... Article feature detection unit 103-1 ... 1st production | generation part, 103-2 ... 2nd production | generation part, 103-3 ... 3rd production | generation part.

Claims

Receiving means for receiving articles including text and images distributed from a plurality of information sources;
  First storage means for storing articles received by the receiving means;
  Second storage means for storing a search condition in which a word for searching for an article related to a theme specified in advance by a user and a weight value thereof are specified;
  A computing means for extracting an article to be distributed to the user from the articles stored in the first storage means and performing an information filtering process for creating an abstract of the article;
  A distribution means for distributing the result of the information filtering process;
  In an information filtering method in a computer comprising:
  The computing means is
  (1) Using the number of appearances of the word of the search condition included in each article stored in the first storage means and the weight value of the word, the similarity of each article to the search condition is calculated A search step for searching for a plurality of articles ranked in descending order of the similarity;
  (2) For each article searched in the search step, a first group in which the number of characters and the number of paragraphs of the article body are less than a predetermined first threshold, which is equal to or more than the first threshold and the first Classifying into any one of a second group less than a second threshold greater than a threshold and a third group greater than or equal to the second threshold;
  (3) a first abstract creation step of extracting the first paragraph as an abstract for the articles classified into the first group;
  (4) For each paragraph of the article classified into the second group, the similarity to the search condition is calculated using the number of appearances of the word of the search condition included in the paragraph and the weight value of the word. A second abstract creation step for extracting the first paragraph and at least the paragraph with the highest similarity from the article as an abstract,
  (5) For each paragraph of the article classified into the third group, the similarity to the search condition is calculated using the number of appearances of the word of the search condition included in the paragraph and the weight value of the word. A third abstract creating step for extracting a predetermined number of paragraphs from the article in descending order of the similarity,
  Run
  A delivery step in which the delivery means delivers the rank of each article searched in the search step and the abstract created in any of the first to third abstract creation steps;
  Run
An information filtering method characterized by the above.

The computing means is
Of the articles distributed as abstracts in the distribution step, the article that the user replied with interest has the step of adding the word with the highest appearance frequency contained therein to the search condition. The information filtering method according to claim 1.

  Receiving means for receiving articles including text and images distributed from a plurality of information sources;
  First storage means for storing articles received by the receiving means;
  Second storage means for storing a search condition in which a word for searching for an article related to a theme designated in advance by a user and its weight value are specified;
  Using the number of appearances of the word of the search condition included in each article stored in the first storage means and the weight value of the word, the similarity to the search condition of each article is calculated, and the similarity A search means for searching a plurality of articles ranked in descending order;
  Each article searched by the search means is a first group in which the number of characters and the number of paragraphs of the article body are smaller than a predetermined first threshold, which is equal to or more than the first threshold and is lower than the first threshold. A second group that is less than a large second threshold and a third group that is greater than or equal to the second threshold Means for classifying any one of
  A first abstract creating means for extracting the first paragraph as an abstract for the articles classified into the first group;
  For each paragraph of the article classified into the second group, the similarity to the search condition is calculated using the number of appearances of the word of the search condition included in the paragraph and the weight value of the word, A second abstract creation means for extracting the first paragraph and at least the paragraph with the highest similarity from the article as an abstract;
  For each paragraph of the articles classified into the third group, the similarity to the search condition is calculated using the number of occurrences of the word of the search condition included in the paragraph and the weight value of the word, A third abstract creation means for extracting a predetermined number of paragraphs from the article in descending order of the degree of similarity;
  Distribution means for distributing the rank of each article searched by the search means and the abstract created by any one of the first to third abstract creation means;
  An information filtering apparatus comprising:

Of the articles distributed as abstracts by the distribution means, the articles that the user replied with interest are further provided with means for adding the word with the highest appearance frequency contained therein to the search condition. 4. The information filtering device according to claim 3, wherein