JP2004348523A

JP2004348523A - System for filtering document, and program

Info

Publication number: JP2004348523A
Application number: JP2003145930A
Authority: JP
Inventors: Shunsuke Doi; 俊介土井; Yuki Yoshida; 由紀吉田; Takeshi Tono; 豪東野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-05-23
Filing date: 2003-05-23
Publication date: 2004-12-09

Abstract

<P>PROBLEM TO BE SOLVED: To select a document with high precision by discriminating unnecessary mails and necessary mails, etc. <P>SOLUTION: When an inputted mail document is selected in a mail-filtering device 1, a morpheme-analyzing part 1b morphologically analyzes the inputted mail document, so as to make the aggregation of keywords; a keyword vector generating part 1c generates an input keyword vector S, expressing the feature of the mail document from the aggregation of the keywords; a keyword vector similarity degree calculating part 1d reads a keyword vector B for reference which is previously generated in a keyword vector generating part 5 and stored in a storage device as a dictionary, so as to calculate the similarity degree p1 of the reference keyword vector B and the input keyword vector S; and a determining part 1e discriminates whether the mail document is necessary or desired, based on the similarity degree p1 by referring to a discrimination condition 13, and selects. Consequently, determination becomes possible as to the need/unnecessary state of an unknown mail which is not discriminated by the designation of a mail condition, when an E-mail is the object, for example. Even when the unnecessary mail with a similar content is transmitted to an unspecified number of people by masquerading as the sender, the unnecessary mail can be discriminated. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、インターネットから受信した電子メールが不要メールか必要メールか等、文書の選別を行う技術に係わり、特に、高精度な選別を行うのに好適な文書フィルタリング技術に関するものである。
【０００２】
【従来の技術】
近年、インターネット等のネットワークにおける電子メールの普及とブロードバンド化に伴い、受信者が不要と感じる迷惑メールや広告メールが無差別に送られてくることが多くなっている。
【０００３】
このような不要メールは、電子メールの送信者のアドレスや件名、本文に指定した文字列を含む・含まないといった条件を指定することで、メールソフトやプロバイダのメールサーバにて、メール受信時に不要メールと必要メールを判別し、不要メールを自動で削除したり、分別するといった処理が行われている。
【０００４】
しかし、これらの技術では、予め不要メールと考えられるメールアドレスや、受信条件を指定していても、その指定した条件に当てはまらない不要メールは処理されない問題点がある。
【０００５】
また、特許文献１に記載の「通信サービスにおけるユーザフィルタリングシステム及び方法」では、受信したメールが不要メールであると、ネットワーク上の発信者評価パラメータ記憶装置に発信者の評価ポイントを下げて登録する。
【０００６】
そして、別途、メール受信時に、このメールの発信者のアドレスをキーにして、ネットワーク上の発信者評価パラメータ記憶装置の発信者の評価ポイントを検索し、メールの発信者のアドレスの評価ポイントが下げられて登録されている場合は、不要メールであると判別して、受信しないなどの処理を行っている。
【０００７】
しかし、この技術では、送信者が毎回異なる架空のアドレスを用いている場合は、送信者のアドレスで判別できないという問題が発生する。
【０００８】
また、特許文献２に記載の「文書分類装置」では、文書データの解析を行い、特徴ベクトルを自動的に抽出して、類似した文書を自動的に分類することを行っており、メール受信時に前記装置を用いることで、メールの自動分類は可能となる。
【０００９】
しかし、この技術では、類似した特徴をもった文章が分類されるだけで、その分類が不要メールであるかは人間が判断をしなければいけない。また、必要メールと似た単語を用いて記述されたメールは、誤分類される可能性が大きいといった問題点がある。
【００１０】
【特許文献１】
特開２００３−１８３２４号公報
【特許文献２】
特許第２９７８０４４号
【００１１】
【発明が解決しようとする課題】
解決しようとする問題点は、従来の技術では、例えば電子メールに関して、不要メールを判別するために、メールの送信者のアドレスや件名、本文に指定した文字列を含む／含まないといった条件を指定することで、不要メールと必要メールを判別していたが、未知の不要メールは上記条件指定で判別できない場合があるとの間題点と、他者の評価ポイント情報によって不要メールの送信者を判別できるようになっても、送信者を偽ることで不要メールが判別できないという問題点と、メール本文の特徴ベクトルで分類する場合、必要メールと似た単語を用いて記述がされたメールは、誤分類される可能性が大きいという問題点である。
【００１２】
本発明の目的は、これら従来技術の課題を解決し、不要メールと必要メールとの判別等、文書の選別を高精度に行うことを可能とすることである。
【００１３】
【課題を解決するための手段】
上記目的を達成するため、本発明は、電子メールを例とすると、受信メールから生成したキーワードベクトルＳと、例えば予め不要メールの文面を形態素解析をして重み付けして生成したキーワードベクトルＢとの類似度や、必要メールの文面を形態素解析をして重み付けして生成したキーワードベクトルＷとの類似度を用いて不要メールか必要メールかを判別することで、不要メールの文面が未知であっても、キーワードベクトルの類似度によって不要メールか否かを判別することを特徴とする。詳細には、▲１▼電子メールを受信して当該メールが不要メールか否かを判別するメールフィルタリング装置において、予めキーワードベクトルＢが記憶されているキーワードベクトル辞書Ｂをネットワーク上もしくはローカルに具備し、電子メールを受信する受信部と、受信メールの文面を形態素解析し、キーワードの集合にする形態素解析部と、形態素解析後のキーワード集合からキーワードベクトルＳを生成するキーワードベクトル生成部と、生成したキーワードベクトルＳと、キーワードベクトル辞書から取得したキーワードベクトルＢとで内積や余弦等の類似度を算出する演算を行い、類似度ｐ１を算出するキーワードベクトル類似度算出部と、必要メール、不要メールを判別する為の類似度の大きさの条件を記述した判別条件を参照し、類似度ｐ１の大きさによって、必要メール、不要メールかを判別する判別部と、判別部において不要メールであると判別された場合、当該メールを削除するなどのしかるべき処理を行う不要メール処理部と、判別部において必要メールであると判別された場合、当該メールをメールソフトで受信するなどのしかるべき処理を行う必要メール処理部とを具備する。▲２▼または、電子メールを受信して当該メールが不要メールか否かを判別するメールフィルタリング装置において、予めキーワードベクトルＢが記憶されているキーワードベクトル辞書、および、予めキーワードベクトルＷが記憶されているキーワードベクトル辞書をネットワーク上もしくはローカルに具備し、電子メールを受信する受信部と、メールの文面を形態素解析し、キーワードの集合にする形態素解析部と、形態素解析後のキーワード集合からキーワードベクトルＳを生成するキーワードベクトル生成部と、生成したキーワードベクトルＳと、キーワードベクトル辞書Ｂから取得したキーワードベクトルＢとで内積や余弦等の類似度を算出する演算を行い、類似度ｐ１を算出するキーワードベクトル類似度算出部と、生成したキーワードベクトルＳと、キーワードベクトル辞書Ｗから取得したキーワードベクトルＷとで内積や余弦等の類似度を算出する演算を行い、類似度ｐ２を算出するキーワードベクトル類似度算出部と、必要メール、不要メールを判別する為の類似度の大きさの条件を記述した判別条件を参照し、類似度ｐ１とｐ２の大きさによって、必要メール、不要メールかを判別する判別部と、判別部において不要メールであると判別された場合、当該メールを削除するなどのしかるべき処理を行う不要メール処理部と、判別部において必要メールであると判別された場合、当該メールをメールソフトで受信するなどのしかるべき処理を行う必要メール処理部とを具備する。▲３▼また、▲１▼のメールフィルタリング装置であって、キーワードベクトル生成部の次段において、キーワードベクトルＳとキーワードベクトル辞書Ｗから取得したキーワードベクトルＷとで積集合演算（Ｓ∩Ｗ）を行い、キーワードベクトルＳからキーワードベクトル（Ｓ∩Ｗ）を除いたキーワードベクトルＳｂ（＝Ｓ−（Ｓ∩Ｗ））を生成するキーワードベクトルフィルタリング部を具備し、生成したキーワードベクトルＳｂと、キーワードベクトル辞書Ｂから取得したキーワードベクトルＢとで内積や余弦等の類似度を算出する演算を行い、類似度ｐ１を算出するキーワードベクトル類似度算出部と、生成した類似度ｐ１を判別部への入力として処理を継続する。▲４▼また、▲２▼のメールフィルタリング装置であって、キーワードベクトル生成部の次段において、キーワードベクトルＳとキーワードベクトル辞書Ｗから取得したキーワードベクトルＷとで積集合演算（Ｓ∩Ｗ）を行い、キーワードベクトルＳからキーワードベクトル（Ｓ∩Ｗ）を除いたキーワードベクトルＳｂ（＝Ｓ−（Ｓ∩Ｗ））を生成するキーワードベクトルフィルタリング部を具備し、キーワードベクトル生成部の次段において、キーワードベクトルＳとキーワードベクトル辞書Ｂから取得したキーワードベクトルＢとで積集合演算（Ｓ∩Ｂ）を行い、キーワードベクトルＳからキーワードベクトル（Ｓ∩Ｂ）を除いたキーワードベクトルＳｗ（＝Ｓ−（Ｓ∩Ｂ））を生成するキーワードベクトルフィルタリング部を具備し、生成したキーワードベクトルＳｂと、キーワードベクトル辞書Ｂから取得したキーワードベクトルＢとで内積や余弦等の類似度を算出する演算を行い、類似度ｐ１を算出するキーワードベクトル類似度算出部と、生成したキーワードベクトルＳｗと、キーワードベクトル辞書Ｗから取得したキーワードベクトルＷとで内積や余弦等の類似度を算出する演算を行い、類似度ｐ２を算出するキーワードベクトル類似度算出部と、生成した類似度ｐ１、ｐ２を判別部への入力として処理を継続する。▲５▼また、▲１▼〜▲４▼におけるメールフィルタリング装置であって、必要メール、不要メールを判別する為の類似度の大きさの条件を記述した判別条件を参照し、類似度ｐ１（とｐ２）の大きさによって、必要メール、不要メール、それ以外の３つに判別する判別部と、「それ以外」と判別された場合、演算部。算出部の処理を継続して処理を繰り返し、最終段階では、必要メール、不要メールを判別する為の類似度の大きさの条件を記述した判別条件を参照し、類似度ｐ１（とｐ２）の大きさによって、必要メール、不要メールのいずれか２つに判別する判別部と、判別部において不要メールであると判別された場合、当該メールを削除するなどのしかるべき処理を行う不要メール処理部と、判別部において必要メールであると判別された場合、当該メールをメールソフトで受信するなどのしかるべき処理を行う必要メール処理部とを具備する。尚、▲５▼では、演算部。算出部の処理を多段構成で用いるが、各演算部。算出部で用いる、キーワードベクトルＢ、キーワードベクトルＷは、各段によって、置き換えても良い。また、キーワードベクトルＢを複数用いる場合であっても、内容が同一のキーワードベクトルＢであっても、それぞれ内容が異なるキーワードベクトルＢであっても良い。
【００１４】
【発明の実施の形態】
以下、本発明の実施の形態を、図面により詳細に説明する。
【００１５】
図１は、本発明に係わる文書フィルタリングシステムの第１の構成例を示すブロック図であり、図２は、図１における文書フィルタリングシステムを用いたメール配信サービスシステムの構成例を示すブロック図である。
【００１６】
図２において、１は本発明の文書フィルタリングシステムとしてのメールフィルタリング装置、２はメールサーバ装置（図中「メールサーバ」と記載）、３はメールクライアント装置（図中「メールクライアント」と記載）、４はメールクライアント装置の利用者、５は本発明に係わるキーワードベクトル辞書を生成するキーワードベクトル生成装置、６ａ〜６ｄはインターネットやイントラネット等からなるＩＰネットワークであり、本例では、選別対象の文書として電子メールを例に説明する。
【００１７】
メールフィルタリング装置１、メールサーバ装置２、メールクライアント装置３、キーワードベクトル生成装置５のそれぞれは、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）や主メモリ、表示装置、入力装置、外部記憶装置からなるコンピュータ構成からなり、光ディスク駆動装置等を介してＣＤ−ＲＯＭ等の記憶媒体に記録されたプログラムやデータを外部記憶装置内にインストールした後、この外部記憶装置から主メモリに読み込みＣＰＵで処理することにより、各処理機能を実行する。
【００１８】
メールフィルタリング装置１は、メールサーバ装置２や図示していないメール記憶ディスク等からメールを受信する。また、メールフィルタリング装置１は、キーワードベクトル生成装置５からキーワードベクトルを取得し、不要メールであるか必要メールであるかを判別し、不要メールであると判断した場合、メールサーバ装置２からの当該メールを削除したり、当該メールの内容を書き換えたり、メールクライアント装置３に渡さない、等の処理を行う。
【００１９】
また、必要メールであると判断した場合、メールサーバ装置２からの当該メールを削除せず、当該メールをメールクライアント装置３に渡す等の処理を実施する。
【００２０】
尚、本例では、キーワードベクトル生成装置５は、メールクライアント装置３からメールを受信し、この受信メールを用いてキーワードベクトルを生成して、キーワードベクトル辞書に格納し、メールフィルタリング装置１に提供する。
【００２１】
尚、このキーワードベクトル生成装置５は、複数あってもよく、例えば、不要なメールのキーワードベクトルを生成してキーワードベクトル辞書に格納するよう特化するために「不要メール」のみを受け付けるキーワードベクトル生成装置であったり、必要なメールのキーワードベクトルを生成してキーワードベクトル辞書に格納するよう特化するために「必要メール」のみを受け付けるキーワードベクトル生成装置であったり、プライベートに必要なメールのキーワードベクトルを生成してキーワードベクトル辞書に格納するよう特化するために「プライベートな必要メール」のみを受け付けるキーワードベクトル生成装置であったり、プライベートに不要なメールのキーワードベクトルを生成してキーワードベクトル辞書に格納するよう特化するために「プライベートな不要メール」のみを受け付けるキーワードベクトル生成装置であっても良い。
【００２２】
あるいは、キーワードベクトル生成装置５は１つとし、１つのキーワードベクトル生成装置５で、上述の「不要メール」、「必要メール」、「プライベートな必要メール」、「プライベートな不要メール」を基に、各キーワードベクトルを生成して個別のキーワードベクトル辞書に分けて格納する構成としても良い。
【００２３】
また、キーワードベクトル生成装置５にむけて送信されるメールは、メールフィルタリング装置１を経由して取得したメールであっても、通常のメールであっても良い。
【００２４】
以下、図１に示すメールフィルタリング装置の詳細を説明する。
【００２５】
図１に示すようにメールフィルタリング装置１は、コンピュータのプログラムに基づく実行機能として処理を行う受信部１ａ、形態素解析部１ｂ、キーワードベクトル生成部１ｃ、キーワードベクトル類似度算出部１ｄ、判定部１ｅ、必要メール処理部１ｆ、不要メール処理部１ｇを具備する。
【００２６】
本例では、受信部１ａ、形態素解析部１ｂ、キーワードベクトル生成部１ｃからなる機能ブロックをキーワードベクトル生成ブロック１０、キーワードベクトル類似度算出部１ｄからなる機能ブロックを演算処理ブロック１１とする。
【００２７】
このような構成において、メールフィルタリング装置１は、その受信部１ａにおいて、メール１２を受信し、受信したメールに対して、形態素解析部１ｂによって、キーワードの集合に分ける。この際、形態素解析部１ｂでは、例えば「茶せん」（ｈｔｔｐ：／／ｃｈａｓｅｎ．ａｓｉｔ−ｎａｒａ．ａｃ．ｊｐ／で入手可能）といった形態素解析ツールを用いることができる。
【００２８】
キーワードベクトル生成部１ｃにおいては、形態素解析部１ｂで生成されたキーワード集合から、それに重みを付与したキーワードベクトルＳを生成する。尚、この重みの付与については、キーワード集合の同一のキーワードの出現回数の値で重みを付与したり、出現の有無で「０」または「１」を付与したり、出現回数に変換式をかけた値であっても良い。
【００２９】
例えば、「Ｆｒｏｍｏｋａｋｋｏ＠ｊｐこんにちわお元気ですか？今日はもう退社ですか？」とのメールを形態素解析した場合、「Ｆｒｏｍ／ｏｋａｋｋｏ／＠／ｊｐ／こんにちわ／お／元気／です／か／？／今日／は／もう／退社／です／か／？」と１７個のキーワードに分割され、これらを出現の有無で「０」または「１」の重みを付与して同じキーワードが出現した場合は重複しないようにして、キーワードベクトルを生成した場合、下記のようになる。
【００３０】
（キーワード）（重み）
Ｆｒｏｍ「１」
ｏｋａｋｋｏ「１」
＠「１」
ｊｐ「１」
こんにちわ「１」
お「１」
元気「１」
です「１」
か「１」
？「１」
今日「１」
は「１」
もう「１」
退社「１」
【００３１】
このように、１４のキーワード要素からなるキーワードベクトルＳが生成される。尚、この分割結果例は、形態素解析ツールの種別によって異なる。
【００３２】
キーワードベクトル類似度算出部１ｄにおいては、図２のキーワードベクトル生成装置５から提供されるキーワードベクトル辞書ＢからキーワードベクトルＢを取得し、このキーワードベクトルＢと、キーワードベクトル生成部１ｃで生成したキーワードベクトルＳとの類似度の算出演算を行う。
【００３３】
例えば、不要メールを元にして生成されたキーワードベクトルＢは下記の通りとする。
【００３４】
（キーワード）（重み）
Ｆｒｏｍ「１」
Ｓｐａｍ「１」
＠「１」
ｊｐ「１」
ｎｅｔ「１」
未「１」
承諾「１」
販売「１」
限定「１」
アダルト「１」
激安「１」
必見「１」
お「１」
は「１」
！「１」
【００３５】
この場合、キーワードベクトルＳとキーワードベクトルＢとの類似度をキーワードベクトル同士の内積で算出し、形態素解析で分割されたキーワード数で正規化した場合、以下のように類似度ｐ１が得られる。
【００３６】
ｐ１＝｛（Ｓ・Ｂ）／（キーワード数）｝＝４／１７＝０．２３５
【００３７】
尚、ここで、キーワード数で正規化を行わずに、類似度ｐ１＝（Ｓ・Ｂ）で導出しても良い。
【００３８】
判別部１ｅでは、キーワードベクトル類似度算出部１ｄで算出した類似度ｐ１と、予め設定された判別条件１３を用いて、当該受信メールが不要メールか必要メールかを判別して選別する。
【００３９】
判別条件１３の一例としては、例えば、「予めしきい値ｎ１を定めておき、類似度ｐ１がしきい値ｎ１を越えれば不要メール、類似度ｐ１がしきい値ｎ１を越えなければ必要メール」といったものが用いられる。
【００４０】
上述の例で、しきい値ｎ１として「０．７００」が設定されていた場合、類似度ｐ１は「０．２３５」であり、「ｐ１＜ｎ１」となり、必要メールであると判断される。
【００４１】
このように必要メールと判断された場合には、必要メール処理部１ｆにおいて、メールサーバ装置（２）からの当該メールを取得して削除せず、メールクライアント装置（３）に渡す等の処理を行う。
【００４２】
また、不要メールであると判断された場合には、不要メール処理部１ｇにおいて、メールサーバ装置（２）からの当該メールを削除したり、当該メールの内容を書き換えたり、メールクライアント装置（３）に渡さない等の処理を行う。
【００４３】
このように本例のメールフィルタリング装置１では、入力されたメール文書の選別を行う際、形態素解析部１ｂにより、入力されたメール文書を形態素解析してキーワードの集合にし、キーワードベクトル生成部１ｃにより、このキーワードの集合から当該メール文書の特徴を表す入力キーワードベクトルＳを生成し、キーワードベクトル類似度算出部１ｄにより、予めキーワードベクトル生成装置５で生成され辞書として記憶装置に記憶された参照用のキーワードベクトルＢを読み出し、この参照キーワードベクトルＢと入力キーワードベクトルＳとの類似度ｐ１を算出し、判別部１ｅにより、この類似度ｐ１に基づき当該メール文書を不要か必要かを判定条件１３を参照して判定して選別する。
【００４４】
特に、本例では、参照キーワードベクトルＢは、不要とされる文書の特徴を表しており、判別部１ｅでは、類似度ｐ１が予め定められた条件値より大きければ、参照キーワードベクトルＢの特徴と類似しており、当該メール文書を不要メール文書として選別する。このようにして、本例では、不要メールと同様の特徴をもったメールを不要メールであると判別することが可能となる。
【００４５】
図３は、本発明に係わる文書フィルタリングシステムの第２の構成例を示すブロック図である。
【００４６】
図３に示す本発明のメールフィルタリングシステムとしてのメールフィルタリング装置３１も、図１におけるメールフィルタリング装置１と同様に、図２に示すメール配信サービスシステムを構成するものであり、コンピュータのプログラムに基づく実行機能として処理を行うキーワードベクトル生成ブロック１０（図示していない受信部１ａ、形態素解析部１ｂ、キーワードベクトル生成部１ｃからなる）、キーワードベクトル類似度算出部１ｄとキーワードベクトル類似度算出部１ｈからなる演算処理ブロック１１ａ、判定部３１ｅ、必要メール処理部１ｆ、不要メール処理部１ｇを具備する。
【００４７】
このような構成において、メールフィルタリング装置３１は、キーワードベクトル生成ブロック１０において、メールの受信と、受信したメールの形態素解析および入力キーワードベクトルＳの生成を行う。
【００４８】
そして、本例の演算処理ブロック１１ａでは、キーワードベクトル類似度算出部１ｄにおいて、キーワードベクトル辞書Ｂを用いて参照キーワードベクトルＢと入力キーワードベクトルＳの類似度を算出すると共に、キーワードベクトル類似度算出部１ｈにおいては、キーワードベクトル辞書Ｗを用いて参照キーワードベクトルＷと入力キーワードベクトルＳの類似度を算出する。
【００４９】
そして、判別部３１ｅでは、キーワードベクトル類似度算出部１ｄにおいて算出した類似度ｐ１とキーワードベクトル類似度算出部１ｈにおいて算出した類似度ｐ２との２つを用いて、判別条件３３に基づく受信メール文書の要否を判別・選別を行う。
【００５０】
判別条件３３の一例をあげると、「予めαを定めておき、ｐ１＋α＞ｐ２ならば不要メール、そうで無いならば必要メール」といった条件が挙げられる。例えば、「類似度ｐ１＝０．２３５」、「類似度ｐ２＝０．５００」の場合で、α値を「−０．１００」と予め設定している場合、「０．２３５−０．１００＜０．５００」となり、必要メールであると判断する。
【００５１】
あるいは、判別条件３３として、他に一例をあげると、「（ｐ１／β）＞ｐ２ならば不要メール、そうで無いならば必要メール」と言った具合に、倍率βで条件を設定することもできる。この場合、前述のように「類似度ｐ１＝０．２３５」、「類似度ｐ２＝０．５００」、β値を「０．５」と予め設定していれば、「０．２３５／０．５＜０．５００」となり、必要メールであると判断する。
【００５２】
ここで、必要メールと判断すると必要メール処理部１ｆが、また、不要メールと判断すると不要メール処理部１ｇが実行される。
【００５３】
このように本例のメールフィルタリング装置３１では、入力されたメール文書の選別を行う際、キーワードベクトル類似度算出部１ｄにより参照キーワードベクトルＢと入力キーワードベクトルＳとの類似度ｐ１を算出すると共に、キーワードベクトル類似度算出部１ｈにより参照キーワードベクトルＷと入力キーワードベクトルＳとの類似度ｐ２を算出し、判別部３１ｅにおいては、類似度ｐ１および類似度ｐ２に基づき、当該メール文書の要否を判定して選別する。
【００５４】
また、一例として、参照キーワードベクトルＢは不要とされる文書の特徴を表し、参照キーワードベクトルＷは必要とされる文書の特徴を表すものとすると、判別部３１ｅは、類似度ｐ１が予め定められた条件値Ｔ１より大きく且つ類似度ｐ２が予め定められた条件値Ｔ２より小さければ当該メール文書を不要文書として選別し、類似度ｐ１が条件値Ｔ１より小さく且つ類似度ｐ２が条件値Ｔ２より大きければ当該メール文書を必要文書として選別する。
【００５５】
このことにより、本図３の例のメールフィルタリングシステムによれば、図１の構成例のメールフィルタリングシステムの有する問題点を解決できる。すなわち、図１の例では、判断部１ｅは、キーワードベクトルＳとキーワードベクトルＢとの類似度ｐ１単独で、不要メールか必要メールかを判断しているため、本当は、キーワードベクトルＳとキーワードベクトルＷとの類似度の方が大きい場合であっても、類似度ｐ１が不要メールの条件に合致すれば、不要メールとされてしまう問題があった。しかし、本図３の構成では、判断部３１ｅは、類似度ｐ１と類似度ｐ２の２つの値の関係から不要メール、必要メールを判断しており、このような問題は解決する。
【００５６】
図４は、本発明に係わる文書フィルタリングシステムの第３の構成例を示すブロック図である。
【００５７】
図４に示す本発明のメールフィルタリングシステムとしてのメールフィルタリング装置４１も、図１および図３におけるメールフィルタリング装置１，３１と同様に、図２に示すメール配信サービスシステムを構成するものであり、コンピュータのプログラムに基づく実行機能として処理を行うキーワードベクトル生成ブロック１０（図示していない受信部１ａ、形態素解析部１ｂ、キーワードベクトル生成部１ｃからなる）、キーワードベクトルフィルタリング部１ｉとキーワードベクトル類似度算出部１ｊからなる演算処理ブロック１１ｂ、判定部４１ｅ、必要メール処理部１ｆ、不要メール処理部１ｇを具備する。
【００５８】
このような構成において、メールフィルタリング装置４１は、キーワードベクトル生成ブロック１０において、メールの受信と、受信したメールの形態素解析および入力キーワードベクトルＳの生成を行う。
【００５９】
そして、本例の演算処理ブロック１１ｂでは、まず、キーワードベクトルフィルタリング部１ｉにおいて、入力キーワードベクトルＳからキーワードベクトル辞書Ｗ４５に格納された参照キーワードベクトルＷの成分を除く処理を行う。
【００６０】
例えば、受信メールのキーワードベクトルＳとキーワードベクトルＷとの積集合（Ｓ∩Ｗ）を、キーワードベクトルＳから引くことで、キーワードベクトルＷの成分を除いたキーワードベクトルＳｂ（＝Ｓ−Ｓ∩Ｗ）を生成する。
【００６１】
そして、キーワードベクトル類似度算出部１ｊにおいて、キーワードベクトルフィルタリング部１ｉで生成したキーワードベクトルＳｂと、キーワードベクトル辞書Ｂ４４に格納されたキーワードベクトルＢとの類似度ｐ１を算出し、判別部４１ｅにおいて、判別条件４３に基づき不要メールか必要メールかを判別して選別を行う。
【００６２】
例えば、図１の説明で例示した入力キーワードベクトルＳの場合、このキーワードベクトルＳと参照キーワードベクトルＷとの積集合「Ｓ∩Ｗ」は、下記の通りとなる。
【００６３】
（キーワード）重み
Ｆｒｏｍ「１」
Ｏｋａｋｋｏ「１」
＠「１」
ｊｐ「１」
こんにちわ「１」
お「１」
元気「１」
つ「１」
は「１」
【００６４】
そして、「Ｓ−（Ｓ∩Ｗ）」は下記の通りとなる。
【００６５】
（キーワード）（重み）
です「１」
か「１」
今日「１」
もう「１」
退社「１」
【００６６】
これがキーワードベクトルＳｂとなる。
【００６７】
そして、このキーワードベクトルＳｂとキーワードベクトル辞書Ｂとの類似度を、図１の説明の例と同様にして計算した場合、「ｐ１＝｛（Ｓｂ・Ｂ）／（キーワード数）｝＝０／１７＝０．０００」が得られる。
【００６８】
このように本例のメールフィルタリング装置４１では、入力されたメール文書の選別を行う際、キーワードベクトル生成ブロック１０で生成された入力キーワードベクトルＳに対して、キーワードベクトルフィルタリング部１ｉにおいて、予めキーワードベクトル辞書Ｗ４５として記憶装置に記憶された参照キーワードベクトルＷを読み出し、この参照キーワードベクトルＷと入力キーワードベクトルＳとの積集合演算（Ｓ∩Ｗ）を行い、この積集合演算（Ｓ∩Ｗ）結果を入力キーワードベクトルＳから除いたキーワードベクトルＳｂ（＝Ｓ−Ｓ∩Ｗ）を生成し、キーワードベクトル類似度算出部１ｊにおいて、予めキーワードベクトル辞書Ｂ４４として記憶装置に記憶された参照キーワードベクトルＢを読み出し、この参照キーワードベクトルＢと、キーワードベクトルフィルタリング部１ｉで生成したキーワードベクトルＳｂとの類似度ｐ１を算出し、判別部４１ｅにより、この類似度ｐ１に基づき当該メール文書の要否を判断して選別する。
【００６９】
例えば、参照キーワードベクトルＢは不要とされる文書の特徴を表し、参照キーワードベクトルＷは必要とされる文書の特徴を表すものとすると、キーワードベクトルベクトルフィルタリング部１ｉでは、受信メールから生成したキーワードベクトルＳから、必要メールから生成されたキーワードベクトルＷ成分を除き、必要メールと不要メールともに含まれるキーワード集合は排除される。
【００７０】
これにより、キーワードベクトル類似度算出部１ｊでは、（１）特徴的な要素だけで類似度を算出するので、類似度の値がより特徴的となり、判別部４１ｅで用いる判別条件４３の設定が容易となり、設定負荷を低減できる。また、（２）判別に必要なキーワードベクトル（Ｓｂ）だけで類似度演算をさせることで、キーワードベクトル類似度算出部１ｊにおける類似度演算の処理数を低減させることが可能となる。
【００７１】
図５は、本発明に係わる文書フィルタリングシステムの第４の構成例を示すブロック図である。
【００７２】
図５に示す本発明のメールフィルタリングシステムとしてのメールフィルタリング装置５１も、図１，３，４おけるメールフィルタリング装置１，３１，４１と同様に、図２に示すメール配信サービスシステムを構成するものであり、コンピュータのプログラムに基づく実行機能として処理を行うキーワードベクトル生成ブロック１０（図示していない受信部１ａ、形態素解析部１ｂ、キーワードベクトル生成部１ｃからなる）、キーワードベクトルフィルタリング部１ｉとキーワードベクトル類似度算出部１ｊおよびキーワードベクトルフィルタリング部１ｋとキーワードベクトル類似度算出部１ｍからなる演算処理ブロック１１ｃ、判定部５１ｅ、必要メール処理部１ｆ、不要メール処理部１ｇを具備する。
【００７３】
このような構成において、メールフィルタリング装置５１は、キーワードベクトル生成ブロック１０において、メールの受信と、受信したメールの形態素解析および入力キーワードベクトルＳの生成を行う。
【００７４】
本例のメールフィルタリング装置５１の特徴は、図４に示すメールフィルタリング装置４１の演算処理ブロック１１ｂにおいて、キーワードベクトルフィルタリング部１ｋとキーワードベクトル類似度算出部１ｍの２つを追加した点であり、キーワードベクトルフィルタリング部１ｋは、キーワードベクトルＳから、予めキーワードベクトル辞書Ｂにおいて決めておいたキーワードベクトルＢの成分を除く処理を行い、キーワードベクトル類似度算出部１ｍは、キーワードベクトルフィルタリング部１ｋから出力されるキーワードベクトルに対して、予めキーワードベクトル辞書Ｗにおいて決めておいたキーワードベクトルＷとの類似度を算出する。以下、このような構成の演算処理ブロック１１ｃの動作の詳細を説明する。
【００７５】
キーワードベクトルフィルタリング部１ｉは、キーワードベクトル生成ブロック１０で生成された受信メールのキーワードベクトルＳと、キーワードベクトル辞書Ｗ５５に格納されたキーワードベクトルＷとの積集合（Ｓ∩Ｗ）を、キーワードベクトルＳから除いた、キーワードベクトルＳｂ（＝「Ｓ―（Ｓ∩Ｗ）」）を生成する。
【００７６】
キーワードベクトル類似度算出部１ｊは、キーワードベクトルフィルタリング部１ｉで生成したキーワードベクトルＳｂと、キーワードベクトル辞書Ｂから取得したキーワードベクトルＢとの類似度ｐ１を算出する。
【００７７】
例えば、図１の説明で示したキーワードベクトルの例を用いた場合、「Ｓ∩Ｗ」は、以下の通りとなる。
【００７８】
（キーワード）（重み）
Ｆｒｏｍ「１」
ｏｋａｋｋｏ「１」
＠「１」
ｊｐ「１」
こんにちわ「１」
お「１」
元気「１」
？「１」
は「１」
【００７９】
そして、「Ｓ−（Ｓ∩Ｗ）」は、以下の通りとなる。
【００８０】
（キーワード）（重み）
です「１」
か「１」
今日「１」
もう「１」
退社「１」
【００８１】
これがキーワードベクトルＳｂとなる。このキーワードベクトルＳｂとキーワードベクトルＢとの類似度ｐ１を図１の説明と同様にして計算した場合、「ｐ１＝｛（Ｓｂ・Ｂ）／（キーワード数）｝＝０／１７＝０．０００」が得られる。
【００８２】
また、キーワードベクトルフィルタリング部１ｋは、キーワードベクトル生成ブロック１０で生成された受信メールのキーワードベクトルＳから、キーワードベクトル辞書Ｂ５４に格納されたキーワードベクトルＢとの積集合（Ｓ∩Ｂ）を除いた、キーワードベクトルＳｗ（＝「Ｓ―（Ｓ∩Ｂ）」）を生成し、キーワードベクトル類似度算出部１ｍは、キーワードベクトルフィルタリング部１ｋで生成したキーワードベクトルＳｗと、キーワードベクトル辞書Ｗから取得したキーワードベクトルＷとの類似度ｐ２を算出する。
【００８３】
これにより、例えば、図１の説明で示したキーワードベクトルの例を用いた場合、キーワードベクトルＳとキーワードベクトルＢとの積集合「Ｓ∩Ｂ」は、以下の通りとなる。
【００８４】
（キーワード）（重み）
Ｆｒｏｍ「１」
＠「１」
お「１」
は「１」
【００８５】
そして、「Ｓｗ＝Ｓ−（Ｓ∩Ｂ）」は、下記の通りとなる。
【００８６】
（キーワード）（重み）
ｏｋａｋｋｏ「１」
ｊｐ「１」
こんにちわ「１」
元気「１」
です「１」
か「１」
？「１」
今日「１」
もう「１」
退社「１」
【００８７】
このキーワードベクトルＳｗとキーワードベクトルＷとの類似度ｐ２を図１の説明と同様にして計算した場合、「ｐ２＝｛（Ｓｗ・Ｗ）／（キーワード数）｝＝５／１７＝０．２９４」が得られる。
【００８８】
そして、判別部５１ｅにおいては、判別条件５３に従って、類似度ｐ１と類似度ｐ２とを比較して不要メールか必要メールかを判別して選別する。
【００８９】
このように本例のメールフィルタリング装置５１では、入力されたメール文書の選別を行う際、キーワードベクトル生成ブロック１０で生成された入力キーワードベクトルＳに対して、キーワードベクトルフィルタリング部１ｉにおいて、予めキーワードベクトル辞書Ｗ４５として記憶装置に記憶された参照キーワードベクトルＷを読み出し、この参照キーワードベクトルＷと入力キーワードベクトルＳとの積集合演算（Ｓ∩Ｗ）を行い、この積集合演算（Ｓ∩Ｗ）結果を入力キーワードベクトルＳから除いたキーワードベクトルＳｂ（＝Ｓ−Ｓ∩Ｗ）を生成し、キーワードベクトル類似度算出部１ｊにおいて、予めキーワードベクトル辞書Ｂ４４として記憶装置に記憶された参照キーワードベクトルＢを読み出し、この参照キーワードベクトルＢと、キーワードベクトルフィルタリング部１ｉで生成したキーワードベクトルＳｂとの類似度ｐ１を算出し、さらに、キーワードベクトルフィルタリング部１ｋにおいて、予めキーワードベクトル辞書Ｂ５４として記憶装置に記憶された参照キーワードベクトルＢを読み出し、この参照キーワードベクトルＢと入力キーワードベクトルＳとの積集合演算（Ｓ∩Ｂ）を行い、この積集合演算（Ｓ∩Ｂ）結果を入力キーワードベクトルＳから除いたキーワードベクトルＳｗ（＝Ｓ−Ｓ∩Ｂ）を生成し、キーワードベクトル類似度算出部１ｍにおいて、予めキーワードベクトル辞書Ｗ５５として記憶装置に記憶された参照キーワードベクトルＷを読み出し、この参照キーワードベクトルＷと、キーワードベクトルフィルタリング部１ｋで生成したキーワードベクトルＳｗとの類似度ｐ２を算出し、判別部５１ｅにより、この類似度ｐ１と類似度ｐ２に基づき当該メール文書の要否を判断して選別する。
【００９０】
また、一例として、参照キーワードベクトルＢは不要とされる文書の特徴を表し、参照キーワードベクトルＷは必要とされる文書の特徴を表すものとすると、判別部５１ｅは、類似度ｐ１が予め定められた条件値Ｔ１より大きく且つ類似度ｐ２が予め定められた条件値Ｔ２より小さければ当該メール文書を不要文書として選別し、また、類似度ｐ１が条件値Ｔ１より小さく且つ類似度ｐ２が条件値Ｔ２より大きければ当該メール文書を必要文書として選別する。
【００９１】
このように本例では、キーワードベクトルベクトルフィルタリング部１ｉにおいては、受信メールから生成したキーワードベクトルＳから、必要メールから生成されたキーワードベクトルＷ成分が除かれており、また、キーワードベクトルベクトルフィルタリング部１ｋにおいては、受信メールから生成したキーワードベクトルＳから、不要メールから生成されたキーワードベクトルＢ成分が除かれており、必要メールと不要メールともに含まれるキーワード集合は排除されている。
【００９２】
これにより、キーワードベクトル類似度算出部１ｊでは、キーワードベクトルＳｂとキーワードベクトルＢとの類似度ｐ１と、キーワードベクトルＳｗとキーワードベクトルＷとの類似度ｐ２とを用いて不要メールか必要メールかを判別することで、必要メールと不要メールともに含まれるキーワード集合は排除して評価することができる。
【００９３】
この結果、キーワードベクトル類似度算出部１ｊでは、（１）特徴的な要素だけで類似度を算出するので、類似度の値がより特徴的となり、判別部５１ｅで用いる判別条件４３の設定が容易となり、設定負荷を低減できる。また、（２）判別に必要なキーワードベクトル（Ｓｂ，Ｓｗ）だけで類似度演算をさせることで、キーワードベクトル類似度算出部１ｊ，１ｋにおける類似度演算の処理数を低減させることが可能となる。
【００９４】
図６は、本発明に係わる文書フィルタリングシステムの第５の構成例を示すブロック図である。
【００９５】
図６に示す本発明のメールフィルタリングシステムとしてのメールフィルタリング装置６１も、図１，３〜５おけるメールフィルタリング装置１，３１〜５１と同様に、図２に示すメール配信サービスシステムを構成するものであり、コンピュータのプログラムに基づく実行機能として処理を行うキーワードベクトル生成ブロック１０（図示していない受信部１ａ、形態素解析部１ｂ、キーワードベクトル生成部１ｃからなる）と図１，３〜５のそれぞれで示される各処理部から構成された演算処理ブロック１１ｄ、判定部６１ｅ，６１ｅｅ、必要メール処理部１ｆ、不要メール処理部１ｇを具備する。
【００９６】
このような構成において、メールフィルタリング装置６１は、キーワードベクトル生成ブロック１０において、メールの受信と、受信したメールの形態素解析および入力キーワードベクトルＳの生成を行い、そして、演算処理ブロック１１ｄでは、入力キーワードベクトルＳと各種参照キーワードベクトルとの類似度ｐ１，ｐ２の算出を行い、判別部６１ｅにおいて、判別条件６３ａに従って、類似度ｐ１，ｐ２に基づく当該入力メール文書の要否の判別・選別を行う。
【００９７】
本例のメールフィルタリング装置６１の特徴は、この判別部６１ｅにおける当該入力メール文書の判別において、必要メールと不要メールのいずれにも選別できない「それ以外」の判別結果にも対応した仕組みを設けた点である。
【００９８】
すなわち、本例では、例えば、図１および図３〜図５のそれぞれのメールフィルタリング装置１，３１，４１，５１の判別部１ｅ、３１ｅ、４１ｅ、５１ｅにおいて「それ以外」と判別された場合に適用されるものであり、演算処理ブロック６１ｄに類似度算出の処理を繰り返させる仕組みを有する。
【００９９】
尚、図６の例では、演算処理ブロック１１ｄからは類似度ｐ１と類似度ｐ２が出力される構成としているが、図１および図４に示したメールフィルタリング装置１，４１に適用した場合は、類似度ｐ１のみが出力され、図３および図５に示したメールフィルタリング装置３１，５１に適用した場合に、本図６に示すように類似度ｐ１と類似度ｐ２が出力される。
【０１００】
以下、図１のメールフィルタリング装置１に対して、本図６に示すように、演算処理ブロックを二段構成にした場合の適用例として説明する。
【０１０１】
１段目の演算処理ブロック１１ｄでは、例えば個人的に不要とされたメールから生成されたキーワードベクトルＢ１を元に受信メールの類似度を算出して判別部６１ｅでその要否を判別し、２段目の演算処理ブロック６１ｄでは、一般的に不要とされるメールから生成されたキーワードベクトルＢ２を元に受信メールの類似度を算出して判別部６１ｅｅでその要否を判別するものとする。
【０１０２】
まず、１段目の演算処理ブロック１１ｄにおいて、キーワードベクトルＳと個人的に不要とされたメールから生成されたキーワードベクトルＢ１との類似度ｐ１が、判別部６１ｅの判別条件６３ａに従っての判別で「それ以外」と選別されたとする。
【０１０３】
その場合、２段目の演算処理ブロック６１ｅｅでの処理を継続し、この２段目の演算処理ブロック６１ｅｅでは、キーワードベクトルＳを入力として、１段目と同様に処理を行う。
【０１０４】
本例では、２段目が最終段となっているため、最終段の判別部６１ｅｅでは、「不要メール」、「必要メール」のいづれかに判別する。
【０１０５】
尚、１段目の演算処理ブロック１１ｅの後の判別部６１ｅで「不要メール」と「必要メール」のいづれかに判別された場合は、２段目の演算処理ブロック６１ｅｅに処理は継続されず、直ちに必要メール処理部１ｆ、不要メール処理部１ｇのそれぞれの処理に移る。
【０１０６】
このようにして本例では、１段目の判別部６１ｅが不要メール文書もしくは必要メール文書のいずれにも判別できないメール文書に対しても、キーワードベクトルに基づく類似度の算出と、この類似度に基づく選別処理を繰り返し、当該メール文書を不要メール文書もしくは必要メール文書のいずれか一方に選別することができる。
【０１０７】
尚、類似度の算出の繰り返しにおいて、類似度の算出に用いる参照キーワードベクトルは任意に置き換えることができる。
【０１０８】
例えば、１段目の演算処理ブロック１１ｄと判別部６１ｅで、個人的に生成した、個人的な不要メールキーワードベクトルＢ１に基づき受信メールを判別し、この１段目の演算処理ブロック１１ｄと判別部６１ｅで不要メールか否かを判別できなかった場合には、２段目の演算処理ブロック６１ｄと判別部６１ｅｅで、ネット上にある一般的な不要メールキーワードベクトルＢ２に基づき受信メールを判別することで、より精度高く、メールの要否を判別することができる。
【０１０９】
また、１段目の演算処理ブロック１１ｄと判別部６１ｅで、個人的に生成した、個人的に必要となさたメールから生成されたキーワードベクトルＷ１に基づき受信メールを判別し、１段目の演算処理ブロックで必要メールか否かを判別できなかった場合、２段目の演算処理ブロック６１ｄと判別部６１ｅｅで、ネット上で一般的に必要とされるメールから生成されたキーワードベクトルＷ２に基づき受信メールの要否を判別することで、より精度高く、メールの要否を判別することができる。
【０１１０】
また、１段目の演算処理ブロック１１ｄと判別部６１ｅで、個人的に生成した、個人的に必要とされるメールから生成されたキーワードベクトルＷ１で受信メールを判別し、１段目の判別部６１ｅで必要メールか否かを判別できなかった場合、２段目の演算処理ブロック６１ｄと判別部６１ｅｅで、一般的に不要とされるメールから生成されたキーワードベクトルＢ２に基づき受信メールの要否を判別することで、より精度高く、メールの要否を判別することができる。
【０１１１】
このように、本例では類似度の算出に用いる参照キーワードベクトルの内容を、目的に応じて組み合わせることが可能であり、精度の高い判別を可能とすることができる。
【０１１２】
以上、図１〜図６を用いて説明したように、本例のメールフィルタリング装置では、電子メールを選別対象の文書とし、受信メールから生成したキーワードベクトルＳと、例えば予め不要メールの文面を形態素解析をして重み付けして生成したキーワードベクトルＢとの類似度や、必要メールの文面を形態素解析をして重み付けして生成したキーワードベクトルＷとの類似度を用いて不要メールか必要メールかを判別することで、不要メールの文面が未知であっても、キーワードベクトルの類似度によって不要メールか否かを判別することができる。
【０１１３】
例えば、図１に示すメールフィルタリング装置１の例では、受信メールの文面（ヘッダ情報や署名も含む）を形態素解析し、受信メールのキーワードベクトルＳを生成し、キーワードベクトル辞書生成装置（５）が不要メールから生成したキーワードベクトル辞書Ｂから取得したキーワードベクトルＢと受信メールのキーワードベクトルＳとの類似度ｐ１を算出し、類似度ｐ１の大きさと予め登録された判別条件とによって必要メールか不要メールかを判断する。このことにより、不要メールと同様の特徴をもったメールを不要メールであると判別することが可能となる。
【０１１４】
また、図３に示すメールフィルタリング装置３１の例では、キーワードベクトル辞書生成装置（５）が不要メールから生成したキーワードベクトル辞書Ｂから取得したキーワードベクトルＢと、受信メールのキーワードベクトルＳとの類似度ｐ１と、キーワードベクトル辞書生成装置（５）が必要メールから生成したキーワードベクトル辞書Ｗから取得したキーワードベクトルＷと、受信メールのキーワードベクトルＳとの類似度ｐ２とを比較して、類似度ｐ１、ｐ２の大きさと判別条件によって必要メールか不要メールかを判断する。この図３の構成例によれば、図１の例の問題点を解決できる。
【０１１５】
すなわち、図１の構成のメールフィルタリング装置１では、判別部１ｅにおいて、キーワードベクトルＳと、不要メールから生成されたキーワードベクトルＢとの類似度ｐ１のみで、不要メール、必要メールかを判断しているため、本当は、キーワードベクトルＳと必要メールから生成されたキーワードベクトルＷとの類似度の方が大きい場合であっても、類似度ｐ１が不要メールの条件に合致すれば、不要メールとされてしまう問題があった。しかし、図３の構成のメールフィルタリング装置３１では、判別部３１ｅにおいて、類似度ｐ１とｐ２の２つの値の関係から不要メール、必要メールを判断するので、前述の問題は解決する。
【０１１６】
また、図４に示すメールフィルタリング装置４１の例は、図１に示すメールフィルタリング装置１にキーワードベクトルフィルタリング部を追加したものであり、受信メールのキーワードベクトルＳから、キーワードベクトルＳと（例えば必要メールから生成された）キーワードベクトル辞書から取得したキーワードベクトルＷとの積集合（Ｓ∩Ｗ）を除いたＳｂを用い、このキーワードベクトルＳｂと（例えば必要メールから生成された）キーワードベクトル辞書から取得したキーワードベクトルＢとの類似度ｐ１を算出し、類似度ｐ１の大きさと判別条件によって必要メールか不要メールかを判断する。
【０１１７】
このように、図４に示すメールフィルタリング装置４１では、キーワードベクトルフィルタリング部１ｉによって生成されたキーワードベクトルＳｂは、受信メールから生成したキーワードベクトルＳから、必要メールから生成したキーワードベクトルＷ成分を除かれており、必要メールと不要メールともに含まれるキーワード集合は排除して評価することができる。これにより、（１）特徴的な要素だけで類似度を算出するため、類似度の値がより特徴的となり、判別部の判別条件設定の困難さを低減させることができ、また、（２）判別に必要なキーワードベクトルだけで類似度演算をさせることで、類似度演算の処理数を低減させることが可能となる。
【０１１８】
また、図５に示すメールフィルタリング装置５１の例では、図３に示すメールフィルタリング装置３１に、キーワードベクトルフィルタリング部を追加したものであり、受信メールのキーワードベクトルＳから、キーワードベクトルＳと（必要メールから生成された）キーワードベクトル辞書から取得したキーワードベクトルＷとの積集合（Ｓ∩Ｗ）を除いたＳｂを用い、このキーワードベクトルＳｂと（必要メールから生成された）キーワードベクトル辞書から取得したキーワードベクトルＢとの類似度ｐ１を算出し、また、受信メールのキーワードベクトルＳから、キーワードベクトルＳと（不要メールから生成された）キーワードベクトル辞書から取得したキーワードベクトルＢとの積集合（Ｓ∩Ｂ）を除いたＳｗを用い、このキーワードベクトルＳｗと（必要メールから生成された）キーワードベクトル辞書から取得したキーワードベクトルＷとの類似度ｐ２を算出し、これらの類似度ｐ１とｐ２とを比較して判別部によって不要メールか必要メールかを判別する。
【０１１９】
本例によれば、追加した第１のキーワードベクトルフィルタリング部によって生成されたキーワードベクトルＳｂは、受信メールから生成したキーワードベクトルＳから、必要メールから生成したキーワードベクトルＷ成分を除かれており、また、第２のキーワードベクトルフィルタリング部によって生成されたキーワードベクトルＳｗは、受信メールから生成したキーワードベクトルＳから、不要メールから生成したキーワードベクトルＢ成分を除かれており、そのキーワードベクトルＳｂとキーワードベクトルＢとの類似度ｐ１と、そのキーワードベクトルＳｗとキーワードベクトルＷとの類似度ｐ２とを用いて不要メールか必要メールかを判別することで、必要メールと不要メールともに含まれるキーワード集合は排除して評価することができる。
【０１２０】
これにより、本例では、（１）特徴的な要素だけで類似度を算出するため、類似度の値がより特徴的となり、判別部の判別条件設定の困難さを低減させることができ、また、（２）判別に必要なキーワードベクトルだけで類似度演算をさせることで、類似度演算の処理数を低減させることが可能となる。
【０１２１】
また、図６に示すメールフィルタリング装置６１の例では、図１、および図３〜５の各メールフィルタリング装置１，３１〜５１の判定部において、必要メールと不要メール以外のメールと判別し、「それ以外」と判別されたメールに対して、演算処理ブロックと判別部の処理を継続して繰り返し、その最終段階の判別部において、必要メールと不要メールのいずれか一方に選別する為の類似度の大きさの条件を記述した判別条件を参照し、類似度ｐ１と類似度ｐ２の大きさ、もしくは類似度ｐ１のみの大きさによって、必要、不要、の２つに受信メールを判別する。
【０１２２】
例えば、図１のメールフィルタリング装置１に適用して、演算処理ブロックと判別部を２段とした場合、１段目の演算処理ブロックで類似度を演算して判別部において「それ以外」と判別した場合、２段目の演算処理ブロックと判別部に処理と移す。このようにすることで、１段目で、確実に「不要メール」であることを示すキーワードベクトルＢ１で判別し、この１段目では判別できなかった場合、２段目で、一般的な「不要メール」であることを示すキーワードベクトルＢ２で判別するといった、精度の高い判別が可能となる。
【０１２３】
同様に、このようにすることで、１段目で、確実に「必要メール」であることを示すキーワードベクトルＷで判別し、１段目では判別できなかった場合、２段目で、「不要メール」であることを示すキーワードベクトルＢで判別するといった処理動作も可能となる。
【０１２４】
尚、本発明は、図１〜図６を用いて説明した例に限定されるものではなく、その要旨を逸脱しない範囲において種々変更可能である。例えば、図２で示したメールフィルタリング装置、キーワードベクトル生成装置、メールサーバ装置、メールクライアント装置等における、それぞれの通信は、インターネットやＬＡＮと言ったネットワークを介した通信であっても、コンピュータ内のローカルな通信であっても構わない。
【０１２５】
例えば、図２の例では、メールフィルタリング装置は、キーワードベクトル生成装置で生成したキーワードベクトルをネットワークを介して参照する構成としているが、キーワードベクトル生成装置で生成したキーワードベクトルを予めメールフィルタリング装置内に取り込んでおく構成でも、キーワードベクトル生成装置内に、キーワードベクトル生成装置を設けた構成としても良い。
【０１２６】
また、図５の例では、キーワードフィルタリング部１ｉとキーワードベクトル類似度算出部１ｍでは、同じでキーワードベクトル辞書Ｗ５５を参照し、キーワードフィルタリング部１ｋとキーワードベクトル類似度算出部１ｊでは、同じでキーワードベクトル辞書Ｂ５４を参照する構成としているが、キーワードフィルタリング部１ｉとキーワードベクトル類似度算出部１ｍで、それぞれ異なるキーワードベクトル辞書（Ｗ，Ｗａ）を参照し、また、キーワードフィルタリング部１ｋとキーワードベクトル類似度算出部１ｊにおいても、それぞれ異なるキーワードベクトル辞書（Ｂ，Ｂａ）を参照する構成としても良い。
【０１２７】
また、本例では、要・不要の判別の対象として電子メールを例に説明しているが、文字列で構成されたテキストデータであれば良く、電子メールに限定するものではない。
【０１２８】
また、本例でのコンピュータ構成例としては、キーボードや光ディスクの駆動装置の無いコンピュータ構成としても良い。また、本例では、光ディスクを記録媒体として用いているが、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）等を記録媒体として用いることでも良い。また、プログラムのインストールに関しても、通信装置を介してネットワーク経由でプログラムをダウンロードしてインストールすることでも良い。
【０１２９】
【発明の効果】
本発明によれば、例えば電子メールを対象とした場合、メール条件指定で判別できない未知のメールの要・不要を判別することが可能となり、また、類似内容の不要メールを送信者を偽って不特定多数に送りつける場合であっても、不要メールを判別することが可能となり、また、類似内容の不要メールを送信者を偽って不特定多数に送りつける場合、必要メールと似た単語を用いて記述がされたメールであっても、他者が不要と判断した不要メールのキーワードベクトルを用いることにより、メール本文の特徴ベクトルで判別しても誤分類される可能性が少なくなり、不要メールと必要メールとの判別等、文書の選別を高精度に行うことが可能となる。
【図面の簡単な説明】
【図１】本発明に係わる文書フィルタリングシステムの第１の構成例を示すブロック図である。
【図２】図１における文書フィルタリングシステムを用いたメール配信サービスシステムの構成例を示すブロック図である。
【図３】本発明に係わる文書フィルタリングシステムの第２の構成例を示すブロック図である。
【図４】本発明に係わる文書フィルタリングシステムの第３の構成例を示すブロック図である。
【図５】本発明に係わる文書フィルタリングシステムの第４の構成例を示すブロック図である。
【図６】本発明に係わる文書フィルタリングシステムの第５の構成例を示すブロック図である。
【符号の説明】
１，３１，４１，５１，６１：メールフィルタリング装置、１ａ：受信部、１ｂ：形態素解析部、１ｃ：キーワードベクトル生成部、１ｄ，１ｈ，１ｊ，１ｍ：キーワードベクトル類似度算出部、１ｅ，３１ｅ，４１ｅ，５１ｅ，６１ｅ，６１ｅｅ：判別部、１ｆ：必要メール処理部、１ｇ：不要メール処理部、１ｉ，１ｋ：キーワードベクトルフィルタリング部、２：メールサーバ装置（「メールサーバ」）、３：メールクライアント装置（「メールクライアント」）、４：利用者、５：キーワードベクトル生成装置、６ａ〜６ｄ：ＩＰネットワーク、１０：キーワードベクトル生成ブロック、１１，１１ａ，１１ｂ，１１ｃ，１１ｄ，６１ｄ：演算処理ブロック、１２：メール、１３，３３，４３，５３，６３ａ，６３ｂ：判別条件、１４，３４，４４，５４：キーワードベクトル辞書Ｂ、３５，４５，５５：キーワードベクトル辞書Ｗ。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a technique for selecting documents such as whether an e-mail received from the Internet is unnecessary or necessary, and more particularly to a document filtering technique suitable for performing high-precision sorting.
[0002]
[Prior art]
In recent years, with the spread of electronic mail in networks such as the Internet and the spread of broadband, unsolicited spam mails and advertisement mails that a recipient feels unnecessary are often sent indiscriminately.
[0003]
By specifying conditions such as including or not including the specified character string in the e-mail sender's address, subject, and body, such unnecessary e-mails are unnecessary when receiving e-mails with mail software or the mail server of the provider Processing such as discriminating mail from necessary mail and automatically deleting unnecessary mail or separating the mail is performed.
[0004]
However, these techniques have a problem in that even if a mail address considered to be unnecessary mail or a reception condition is specified in advance, unnecessary mail that does not satisfy the specified condition is not processed.
[0005]
Further, in the “user filtering system and method in communication service” described in Patent Document 1, if the received mail is unnecessary mail, the sender's evaluation point is lowered and registered in a sender evaluation parameter storage device on the network. .
[0006]
Then, separately, when receiving the mail, the sender's address of this mail is searched for using the sender's address of the mail as a key, and the sender's evaluation point of the mail sender's address is lowered. If it has been registered, it is determined that it is unnecessary mail, and processing such as not receiving it is performed.
[0007]
However, in this technique, when a sender uses a different imaginary address every time, there is a problem that it is not possible to determine the sender's address.
[0008]
Further, the “document classification device” described in Patent Document 2 analyzes document data, automatically extracts a feature vector, and automatically classifies similar documents. By using the above-mentioned device, automatic classification of mail becomes possible.
[0009]
However, in this technique, only texts having similar characteristics are classified, and a human must determine whether the classification is unnecessary mail. In addition, there is a problem that a mail described using a word similar to a necessary mail is likely to be misclassified.
[0010]
[Patent Document 1]
JP 2003-18324 A
[Patent Document 2]
Patent No. 2978044
[0011]
[Problems to be solved by the invention]
The problem to be solved is that, in the conventional technology, for example, regarding an e-mail, in order to determine an unnecessary mail, a condition such as including / not including a character string specified in an address, a subject, and a body of the mail sender is specified. In this way, unnecessary emails are distinguished from unnecessary emails.However, unknown unnecessary emails may not be distinguished by specifying the above conditions. Even if it becomes possible to judge, the problem that unnecessary mail cannot be judged by falsifying the sender, and when classifying by the feature vector of the mail body, the mail described using words similar to the necessary mail is The problem is that the possibility of misclassification is high.
[0012]
SUMMARY OF THE INVENTION An object of the present invention is to solve these problems of the prior art and to enable high-accuracy document selection such as discrimination between unnecessary mail and necessary mail.
[0013]
[Means for Solving the Problems]
In order to achieve the above object, in the present invention, when an e-mail is taken as an example, a keyword vector S generated from a received e-mail and a keyword vector B generated by weighting the text of an unnecessary e-mail by performing morphological analysis in advance, for example, It is possible to determine whether the text of the unnecessary mail is unknown by using the similarity or the similarity with the keyword vector W generated by performing morphological analysis and weighting the text of the necessary mail to determine whether the text of the unnecessary mail is unknown. Also, it is characterized in that whether or not it is unnecessary mail is determined based on the similarity of the keyword vectors. More specifically, (1) a mail filtering apparatus that receives an electronic mail and determines whether the mail is unnecessary mail includes a keyword vector dictionary B in which a keyword vector B is stored in advance on a network or locally. A receiving unit that receives an e-mail, a morphological analysis unit that morphologically analyzes the text of the received mail to generate a set of keywords, and a keyword vector generating unit that generates a keyword vector S from the keyword set after the morphological analysis. A keyword vector similarity calculation unit for calculating a similarity such as an inner product or a cosine of the keyword vector S and a keyword vector B obtained from the keyword vector dictionary to calculate a similarity p1 is provided. The discriminant condition describing the condition of the magnitude of similarity for discriminating And a determining unit for determining whether the mail is necessary or unnecessary according to the magnitude of the similarity p1, and when the determining unit determines that the mail is unnecessary, it is unnecessary to perform an appropriate process such as deleting the mail. An e-mail processing unit and a necessary e-mail processing unit that performs an appropriate process such as receiving the e-mail by e-mail software when the e-mail is determined to be necessary by the determining unit. {Circle around (2)} In a mail filtering device that receives an electronic mail and determines whether the mail is unnecessary mail, a keyword vector dictionary in which a keyword vector B is stored in advance and a keyword vector W is stored in advance. Receiving a keyword vector dictionary on a network or locally and receiving an e-mail; a morphological analysis unit for morphologically analyzing the text of the mail to make a set of keywords; and a keyword vector S from the keyword set after the morphological analysis. And a keyword vector for calculating a similarity such as an inner product or a cosine of the generated keyword vector S and the keyword vector B obtained from the keyword vector dictionary B, and calculating a similarity p1. The similarity calculation unit and the generated key A keyword vector similarity calculator for calculating a similarity such as an inner product or a cosine of the word vector S and the keyword vector W obtained from the keyword vector dictionary W to calculate a similarity p2; A discriminating unit for discriminating whether the mail is a necessary mail or an unnecessary mail based on the size of the similarities p1 and p2 with reference to a discriminating condition describing a condition of the magnitude of the similarity for discriminating the unnecessary mail. If it is determined that there is a mail, an unnecessary mail processing unit that performs appropriate processing such as deleting the mail, and if the determination unit determines that the mail is necessary, the mail should be received by mail software. A necessary mail processing unit for performing processing; {Circle around (3)} In the mail filtering device of (1), the intersection of the keyword vector S and the keyword vector W obtained from the keyword vector dictionary W is performed at the next stage of the keyword vector generation unit. And a keyword vector filtering unit for generating a keyword vector Sb (= S− (S∩W)) by removing the keyword vector (S∩W) from the keyword vector S. The generated keyword vector Sb and the keyword vector dictionary A keyword vector similarity calculating unit that calculates a similarity such as an inner product or a cosine with the keyword vector B obtained from the keyword B and calculates the similarity p1 and processes the generated similarity p1 as an input to the determination unit To continue. {Circle around (4)} The mail filtering device according to {circle around (2)}, in the next stage of the keyword vector generation unit, performs an intersection set operation (S∩W) with the keyword vector S and the keyword vector W obtained from the keyword vector dictionary W. A keyword vector filtering unit that generates a keyword vector Sb (= S− (S∩W)) by removing the keyword vector (S∩W) from the keyword vector S. An intersection set operation (S∩B) is performed on the vector S and the keyword vector B obtained from the keyword vector dictionary B, and the keyword vector Sw (= S− (S∩) is obtained by removing the keyword vector (S∩B) from the keyword vector S. B)) comprising a keyword vector filtering unit for generating A keyword vector similarity calculation unit that calculates a similarity such as an inner product or a cosine of the generated keyword vector Sb and a keyword vector B obtained from the keyword vector dictionary B, and calculates a similarity p1; A keyword vector similarity calculating unit that calculates a similarity such as an inner product or a cosine of the vector Sw and the keyword vector W obtained from the keyword vector dictionary W, and calculates a similarity p2; Processing is continued with p2 as an input to the determination unit. (5) The mail filtering apparatus according to (1) to (4), wherein the similarity p1 ( And a discriminating unit for discriminating between the required mail, the unnecessary mail, and the other three depending on the size of p2), and a calculating unit when it is discriminated as “other”. The processing of the calculation unit is continued and the processing is repeated. In the final stage, the similarity p1 (and p2) is determined by referring to the determination condition describing the condition of the magnitude of the similarity for determining the required mail and the unnecessary mail. A discriminating unit for discriminating between two of necessary mail and unnecessary mail according to the size, and an unnecessary mail processing unit for performing an appropriate process such as deleting the mail when the discriminating unit determines that the mail is unnecessary. And a necessary mail processing unit that performs an appropriate process such as receiving the mail by mail software when the determination unit determines that the mail is necessary mail. In (5), the calculation unit. Although the processing of the calculation unit is used in a multi-stage configuration, each calculation unit is used. The keyword vector B and the keyword vector W used in the calculation unit may be replaced by each stage. Also, a plurality of keyword vectors B may be used, a keyword vector B having the same contents, or a keyword vector B having different contents.
[0014]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0015]
FIG. 1 is a block diagram showing a first configuration example of a document filtering system according to the present invention, and FIG. 2 is a block diagram showing a configuration example of a mail distribution service system using the document filtering system in FIG. .
[0016]
In FIG. 2, 1 is a mail filtering device as a document filtering system of the present invention, 2 is a mail server device (described as “mail server” in the drawing), 3 is a mail client device (described as “mail client” in the drawing), Reference numeral 4 denotes a user of the mail client device, 5 denotes a keyword vector generation device for generating a keyword vector dictionary according to the present invention, and 6a to 6d denote IP networks such as the Internet and an intranet. This will be described using e-mail as an example.
[0017]
Each of the mail filtering device 1, the mail server device 2, the mail client device 3, and the keyword vector generation device 5 has a computer configuration including a CPU (Central Processing Unit), a main memory, a display device, an input device, and an external storage device. After installing programs and data recorded on a storage medium such as a CD-ROM via an optical disk drive or the like in an external storage device, the programs and data are read from the external storage device into a main memory and processed by the CPU, thereby achieving various processing functions. Execute
[0018]
The mail filtering device 1 receives mail from the mail server device 2 or a mail storage disk (not shown). Further, the mail filtering device 1 obtains a keyword vector from the keyword vector generation device 5, determines whether the mail is unnecessary mail or not, and determines that the mail is unnecessary mail. Processing such as deleting the mail, rewriting the contents of the mail, and not passing the mail to the mail client device 3 is performed.
[0019]
If it is determined that the mail is a necessary mail, a process such as passing the mail to the mail client device 3 without deleting the mail from the mail server device 2 is performed.
[0020]
In this example, the keyword vector generation device 5 receives a mail from the mail client device 3, generates a keyword vector using the received mail, stores the keyword vector in the keyword vector dictionary, and provides it to the mail filtering device 1. .
[0021]
The keyword vector generation device 5 may include a plurality of keyword vector generation units, for example, a keyword vector generation unit that receives only an “unnecessary mail” in order to generate a keyword vector of an unnecessary mail and store it in a keyword vector dictionary. A keyword vector generator that accepts only "necessary emails" to specialize in generating keyword vectors of necessary emails and storing them in the keyword vector dictionary, or keyword vectors of emails required for private use Is a keyword vector generator that accepts only "private necessary mail" to specialize to generate and store it in the keyword vector dictionary, or generate a keyword vector of private unnecessary mail and store it in the keyword vector dictionary To do It may be a keyword vector generation device that accepts a "private unwanted mail" only in order to reduction.
[0022]
Alternatively, one keyword vector generation device 5 is provided, and one keyword vector generation device 5 uses the above-described “unnecessary mail”, “necessary mail”, “private necessary mail”, and “private unnecessary mail” based on the above. A configuration in which each keyword vector is generated and divided and stored in an individual keyword vector dictionary may be adopted.
[0023]
Further, the mail transmitted to the keyword vector generating device 5 may be a mail obtained via the mail filtering device 1 or a normal mail.
[0024]
Hereinafter, details of the mail filtering device shown in FIG. 1 will be described.
[0025]
As shown in FIG. 1, the mail filtering device 1 includes a receiving unit 1a, a morphological analysis unit 1b, a keyword vector generation unit 1c, a keyword vector similarity calculation unit 1d, a determination unit 1e, which performs processing as an execution function based on a computer program. It includes a necessary mail processing unit 1f and an unnecessary mail processing unit 1g.
[0026]
In this example, a functional block including a receiving unit 1a, a morphological analysis unit 1b, and a keyword vector generating unit 1c is referred to as a keyword vector generating block 10, and a functional block including a keyword vector similarity calculating unit 1d is referred to as an arithmetic processing block 11.
[0027]
In such a configuration, the mail filtering device 1 receives the mail 12 in the receiving unit 1a, and divides the received mail into a set of keywords by the morphological analysis unit 1b. At this time, the morphological analysis unit 1b can use a morphological analysis tool such as “chasen” (available at http://chasen.asit-nara.ac.jp/).
[0028]
The keyword vector generation unit 1c generates a keyword vector S weighted from the keyword set generated by the morphological analysis unit 1b. The weight is assigned by the value of the number of occurrences of the same keyword in the keyword set, “0” or “1” is assigned depending on the presence or absence of the keyword, or a conversion formula is applied to the number of occurrences. Value may be used.
[0029]
For example, if you morphologically analyze the email "From Okakoko @ jp, how are you? Are you out of the office today?", "From / okakoko / @ / jp / Hello / Genki / Is / Is /? /?" / Today / is / already / leaving / is / is / ?? ", which is divided into 17 keywords. If the same keyword appears with a weight of" 0 "or" 1 "depending on whether it appears or not, When a keyword vector is generated without duplication, the result is as follows.
[0030]
(Keyword) (weight)
From "1"
Okakako "1"
＠ "1"
jp "1"
Hello "1"
Contact "1"
Fine "1"
"1"
Or "1"
? "1"
Today "1"
Is "1"
Another "1"
Leaves "1"
[0031]
As described above, the keyword vector S including the 14 keyword elements is generated. Note that the example of the division result differs depending on the type of the morphological analysis tool.
[0032]
The keyword vector similarity calculation unit 1d acquires the keyword vector B from the keyword vector dictionary B provided from the keyword vector generation device 5 in FIG. 2, and stores the keyword vector B and the keyword vector generated by the keyword vector generation unit 1c. The calculation of the similarity with S is performed.
[0033]
For example, the keyword vector B generated based on the unnecessary mail is as follows.
[0034]
(Keyword) (weight)
From "1"
Spam "1"
＠ "1"
jp "1"
net "1"
Not "1"
Consent "1"
Sales "1"
Limited "1"
Adult "1"
Cheap "1"
Must see "1"
Contact "1"
Is "1"
! "1"
[0035]
In this case, when the similarity between the keyword vector S and the keyword vector B is calculated by the inner product of the keyword vectors and normalized by the number of keywords divided by the morphological analysis, the similarity p1 is obtained as follows.
[0036]
p1 = {(SB) / (number of keywords)} = 4/17 = 0.235
[0037]
Here, it is also possible to derive the similarity p1 = (S · B) without normalizing with the number of keywords.
[0038]
Using the similarity p1 calculated by the keyword vector similarity calculator 1d and the preset determination condition 13, the determination unit 1e determines whether the received mail is an unnecessary mail or a necessary mail, and selects the received mail.
[0039]
An example of the determination condition 13 is, for example, “a threshold value n1 is determined in advance, and an unnecessary mail if the similarity p1 exceeds the threshold value n1, and a necessary mail if the similarity p1 does not exceed the threshold value n1”. Is used.
[0040]
In the above example, when “0.700” is set as the threshold value n1, the similarity p1 is “0.235”, “p1 <n1”, and it is determined that the mail is necessary.
[0041]
When it is determined that the mail is necessary, the necessary mail processing unit 1f performs processing such as acquiring the mail from the mail server device (2), not deleting the mail, and passing it to the mail client device (3). Do.
[0042]
When it is determined that the mail is unnecessary mail, the unnecessary mail processing unit 1g deletes the mail from the mail server device (2), rewrites the contents of the mail, or sends the mail client device (3). Perform processing such as not passing it to
[0043]
As described above, in the mail filtering device 1 of the present embodiment, when selecting an input mail document, the morphological analysis unit 1b morphologically analyzes the input mail document to form a set of keywords, and the keyword vector generation unit 1c performs the morphological analysis. An input keyword vector S representing the feature of the mail document is generated from the set of keywords, and the keyword vector similarity calculation unit 1d generates a reference keyword vector S generated by the keyword vector generation device 5 and stored in the storage device as a dictionary in advance. The keyword vector B is read out, the similarity p1 between the reference keyword vector B and the input keyword vector S is calculated, and the determining unit 1e determines whether the mail document is unnecessary or necessary based on the similarity p1 with reference to the determination condition 13. Judge and sort.
[0044]
In particular, in this example, the reference keyword vector B represents the feature of the document that is not required. If the similarity p1 is larger than a predetermined condition value, the determination unit 1e determines the feature of the reference keyword vector B as It is similar, and the mail document is selected as an unnecessary mail document. In this way, in the present example, it is possible to determine that an email having the same characteristics as an unnecessary email is an unnecessary email.
[0045]
FIG. 3 is a block diagram showing a second configuration example of the document filtering system according to the present invention.
[0046]
The mail filtering device 31 as the mail filtering system of the present invention shown in FIG. 3 also constitutes the mail distribution service system shown in FIG. 2, similarly to the mail filtering device 1 in FIG. A keyword vector generation block 10 (which includes a receiving unit 1a, a morphological analysis unit 1b, and a keyword vector generation unit 1c, not shown) that performs processing as a function, a keyword vector similarity calculation unit 1d, and a keyword vector similarity calculation unit 1h. It includes an arithmetic processing block 11a, a determination unit 31e, a necessary mail processing unit 1f, and an unnecessary mail processing unit 1g.
[0047]
In such a configuration, the mail filtering device 31 receives the mail, performs morphological analysis of the received mail, and generates the input keyword vector S in the keyword vector generation block 10.
[0048]
Then, in the arithmetic processing block 11a of this example, the keyword vector similarity calculating unit 1d calculates the similarity between the reference keyword vector B and the input keyword vector S using the keyword vector dictionary B, and also calculates the keyword vector similarity calculating unit. In 1h, the similarity between the reference keyword vector W and the input keyword vector S is calculated using the keyword vector dictionary W.
[0049]
Then, the discrimination unit 31e uses the two similarities p1 calculated by the keyword vector similarity calculation unit 1d and the similarity p2 calculated by the keyword vector similarity calculation unit 1h to determine the received mail document based on the discrimination condition 33. The necessity / unnecessity is determined / sorted.
[0050]
As an example of the determination condition 33, there is a condition such as “predetermined α, unnecessary mail if p1 + α> p2, otherwise necessary mail”. For example, when “similarity p1 = 0.235” and “similarity p2 = 0.500” and the α value is set to “−0.100” in advance, “0.235−0.100” <0.500 ”, and is determined to be necessary mail.
[0051]
Alternatively, as another example of the discrimination condition 33, a condition may be set with a magnification β, such as “unnecessary mail if (p1 / β)> p2, otherwise necessary mail”. it can. In this case, as described above, if “similarity p1 = 0.235”, “similarity p2 = 0.500”, and the β value are set to “0.5” in advance, “0.235 / 0. 5 <0.500 ”, and it is determined that the mail is necessary.
[0052]
Here, if it is determined that the mail is necessary, the necessary mail processing unit 1f is executed, and if it is determined that the mail is unnecessary, the unnecessary mail processing unit 1g is executed.
[0053]
As described above, in the mail filtering device 31 of the present example, when selecting the input mail document, the keyword vector similarity calculation unit 1d calculates the similarity p1 between the reference keyword vector B and the input keyword vector S, The keyword vector similarity calculator 1h calculates the similarity p2 between the reference keyword vector W and the input keyword vector S, and the determiner 31e determines whether the mail document is necessary based on the similarities p1 and p2. And sort.
[0054]
Further, as an example, assuming that the reference keyword vector B represents the feature of the document that is not required and the reference keyword vector W represents the feature of the document that is required, the determination unit 31e determines the similarity p1 in advance. If the condition value T1 is larger than the condition value T1 and the similarity p2 is smaller than a predetermined condition value T2, the mail document is selected as an unnecessary document, and the similarity p1 is smaller than the condition value T1 and the similarity p2 is larger than the condition value T2. If so, the e-mail document is selected as a necessary document.
[0055]
Thus, according to the mail filtering system of the example of FIG. 3, the problems of the mail filtering system of the configuration example of FIG. 1 can be solved. That is, in the example of FIG. 1, the determination unit 1e determines whether an unnecessary mail or a necessary mail is based on the similarity p1 between the keyword vector S and the keyword vector B alone. There is a problem that even if the degree of similarity is larger, if the similarity p1 satisfies the condition of the unnecessary mail, the mail is regarded as unnecessary mail. However, in the configuration of FIG. 3, the determination unit 31e determines the unnecessary mail and the necessary mail from the relationship between the two values of the similarity p1 and the similarity p2, and such a problem is solved.
[0056]
FIG. 4 is a block diagram showing a third configuration example of the document filtering system according to the present invention.
[0057]
The mail filtering device 41 as the mail filtering system of the present invention shown in FIG. 4 also constitutes the mail distribution service system shown in FIG. 2 like the mail filtering devices 1 and 31 in FIGS. A keyword vector generation block 10 (consisting of a receiving unit 1a, a morphological analysis unit 1b, and a keyword vector generation unit 1c, not shown), a keyword vector filtering unit 1i, and a keyword vector similarity calculation unit 1j, an arithmetic processing block 11b, a determination unit 41e, a necessary mail processing unit 1f, and an unnecessary mail processing unit 1g.
[0058]
In such a configuration, the mail filtering device 41 receives the mail, performs morphological analysis of the received mail, and generates the input keyword vector S in the keyword vector generation block 10.
[0059]
Then, in the arithmetic processing block 11b of this example, first, the keyword vector filtering unit 1i performs a process of removing the components of the reference keyword vector W stored in the keyword vector dictionary W45 from the input keyword vector S.
[0060]
For example, by subtracting the product set (S の W) of the keyword vector S and the keyword vector W of the received mail from the keyword vector S, the keyword vector Sb (= S−S∩W) excluding the component of the keyword vector W is obtained. Generate
[0061]
Then, in the keyword vector similarity calculation unit 1j, the similarity p1 between the keyword vector Sb generated by the keyword vector filtering unit 1i and the keyword vector B stored in the keyword vector dictionary B44 is calculated. Based on the condition 43, it is determined whether the mail is unnecessary mail or necessary mail, and the sorting is performed.
[0062]
For example, in the case of the input keyword vector S illustrated in the description of FIG. 1, a product set “S∩W” of the keyword vector S and the reference keyword vector W is as follows.
[0063]
(Keyword) Weight
From "1"
Okakko "1"
＠ "1"
jp "1"
Hello "1"
Contact "1"
Fine "1"
One "1"
Is "1"
[0064]
“S− (S∩W)” is as follows.
[0065]
(Keyword) (weight)
"1"
Or "1"
Today "1"
Another "1"
Leaves "1"
[0066]
This becomes the keyword vector Sb.
[0067]
Then, when the similarity between the keyword vector Sb and the keyword vector dictionary B is calculated in the same manner as in the example of FIG. 1, "p1 = {(Sb.B) / (number of keywords)} = 0/17. = 0.000 "is obtained.
[0068]
As described above, in the mail filtering device 41 of the present embodiment, when selecting the input mail document, the keyword vector filtering unit 1i determines in advance the keyword vector for the input keyword vector S generated by the keyword vector generation block 10. The reference keyword vector W stored in the storage device as the dictionary W45 is read, a product set operation (S∩W) of the reference keyword vector W and the input keyword vector S is performed, and the result of the product set operation (S∩W) is obtained. A keyword vector Sb (= S−S∩W) that is removed from the input keyword vector S is generated, and the keyword vector similarity calculation unit 1j reads out the reference keyword vector B stored in advance in the storage device as the keyword vector dictionary B44, This reference keyword vector And Le B, calculates the similarity p1 the keyword vector Sb generated by the keyword vector filtering unit 1i, the discrimination unit 41e, screened to determine the necessity of the mail document on the basis of the similarity p1.
[0069]
For example, assuming that the reference keyword vector B represents the feature of the document that is not required, and the reference keyword vector W represents the feature of the document that is required, the keyword vector vector filtering unit 1i determines the keyword vector generated from the received mail. Except for the keyword vector W component generated from the required mail, the keyword set included in both the required mail and the unnecessary mail is excluded from S.
[0070]
As a result, the keyword vector similarity calculation unit 1j (1) calculates the similarity using only characteristic elements, so that the value of the similarity becomes more characteristic and the setting of the determination condition 43 used by the determination unit 41e is easy. And the set load can be reduced. (2) By performing the similarity calculation only with the keyword vector (Sb) necessary for the determination, it is possible to reduce the number of processes of the similarity calculation in the keyword vector similarity calculation unit 1j.
[0071]
FIG. 5 is a block diagram showing a fourth configuration example of the document filtering system according to the present invention.
[0072]
The mail filtering device 51 as the mail filtering system of the present invention shown in FIG. 5 also constitutes the mail distribution service system shown in FIG. 2, like the mail filtering devices 1, 31, and 41 in FIGS. Yes, a keyword vector generation block 10 (consisting of a receiving unit 1a, a morphological analysis unit 1b, and a keyword vector generation unit 1c, not shown) that performs processing as an execution function based on a computer program, a keyword vector filtering unit 1i, and a keyword vector similarity It has an arithmetic processing block 11c composed of a degree calculating unit 1j, a keyword vector filtering unit 1k, and a keyword vector similarity calculating unit 1m, a determining unit 51e, a necessary mail processing unit 1f, and an unnecessary mail processing unit 1g.
[0073]
In such a configuration, the mail filtering device 51 receives the mail, performs morphological analysis of the received mail, and generates the input keyword vector S in the keyword vector generation block 10.
[0074]
A feature of the mail filtering device 51 of this example is that a keyword vector filtering unit 1k and a keyword vector similarity calculation unit 1m are added to the arithmetic processing block 11b of the mail filtering device 41 shown in FIG. The vector filtering unit 1k performs a process of removing, from the keyword vector S, a component of the keyword vector B previously determined in the keyword vector dictionary B, and the keyword vector similarity calculation unit 1m is output from the keyword vector filtering unit 1k. The similarity between the keyword vector and the keyword vector W determined in the keyword vector dictionary W in advance is calculated. Hereinafter, the operation of the arithmetic processing block 11c having such a configuration will be described in detail.
[0075]
The keyword vector filtering unit 1i calculates a product set (S∩W) of the keyword vector S of the received mail generated in the keyword vector generation block 10 and the keyword vector W stored in the keyword vector dictionary W55 from the keyword vector S. A removed keyword vector Sb (= “S− (S∩W)”) is generated.
[0076]
The keyword vector similarity calculation unit 1j calculates a similarity p1 between the keyword vector Sb generated by the keyword vector filtering unit 1i and the keyword vector B acquired from the keyword vector dictionary B.
[0077]
For example, when the example of the keyword vector shown in the description of FIG. 1 is used, “S∩W” is as follows.
[0078]
(Keyword) (weight)
From "1"
Okakako "1"
＠ "1"
jp "1"
Hello "1"
Contact "1"
Fine "1"
? "1"
Is "1"
[0079]
“S− (S∩W)” is as follows.
[0080]
(Keyword) (weight)
"1"
Or "1"
Today "1"
Another "1"
Leaves "1"
[0081]
This becomes the keyword vector Sb. When the similarity p1 between the keyword vector Sb and the keyword vector B is calculated in the same manner as described with reference to FIG. 1, "p1 = {(Sb.B) / (number of keywords)} = 0/17 = 0.000" Is obtained.
[0082]
Also, the keyword vector filtering unit 1k removes the intersection (S∩B) of the keyword vector S of the received mail generated by the keyword vector generation block 10 with the keyword vector B stored in the keyword vector dictionary B54. The keyword vector Sw (= “S− (S∩B)”) is generated, and the keyword vector similarity calculation unit 1m calculates the keyword vector Sw generated by the keyword vector filtering unit 1k and the keyword vector acquired from the keyword vector dictionary W. The similarity p2 with W is calculated.
[0083]
Thus, for example, when the example of the keyword vector shown in the description of FIG. 1 is used, the intersection set “S∩B” of the keyword vector S and the keyword vector B is as follows.
[0084]
(Keyword) (weight)
From "1"
＠ "1"
Contact "1"
Is "1"
[0085]
“Sw = S− (S∩B)” is as follows.
[0086]
(Keyword) (weight)
Okakako "1"
jp "1"
Hello "1"
Fine "1"
"1"
Or "1"
? "1"
Today "1"
Another "1"
Leaves "1"
[0087]
When the similarity p2 between the keyword vector Sw and the keyword vector W is calculated in the same manner as described with reference to FIG. 1, “p2 = {(Sw · W) / (number of keywords)} = 5/17 = 0.294” Is obtained.
[0088]
Then, the discrimination unit 51e compares the similarity p1 and the similarity p2 according to the discrimination condition 53, discriminates whether it is unnecessary mail or necessary mail, and selects the mail.
[0089]
As described above, in the mail filtering device 51 of the present example, when selecting an input mail document, the keyword vector filtering unit 1i determines in advance a keyword vector for the input keyword vector S generated by the keyword vector generation block 10. The reference keyword vector W stored in the storage device as the dictionary W45 is read, a product set operation (S∩W) of the reference keyword vector W and the input keyword vector S is performed, and the result of the product set operation (S∩W) is obtained. A keyword vector Sb (= S−S∩W) that is removed from the input keyword vector S is generated, and the keyword vector similarity calculation unit 1j reads out the reference keyword vector B stored in advance in the storage device as the keyword vector dictionary B44, This reference keyword vector And a similarity p1 between the keyword B and the keyword vector Sb generated by the keyword vector filtering unit 1i. Further, in the keyword vector filtering unit 1k, the reference keyword vector B previously stored in the storage device as the keyword vector dictionary B54 is calculated. The keyword set Sw (= S−S) is obtained by performing a product set operation (S∩B) of the reference keyword vector B and the input keyword vector S, and removing the result of the product set operation (S∩B) from the input keyword vector S. S∩B), the keyword vector similarity calculation unit 1m reads out the reference keyword vector W stored in advance in the storage device as the keyword vector dictionary W55, and generates the reference keyword vector W and the keyword vector filtering unit 1k. Keywords calculates similarity p2 between the vector Sw was, the discrimination unit 51e, screened to determine the necessity of the mail document on the basis of the similarity p1 and similarity p2.
[0090]
Further, as an example, assuming that the reference keyword vector B represents the feature of the document that is not required and the reference keyword vector W represents the feature of the document that is required, the determination unit 51e determines the similarity p1 in advance. If the similarity value p1 is larger than the condition value T1 and the similarity p2 is smaller than the predetermined condition value T2, the mail document is selected as an unnecessary document, and the similarity p1 is smaller than the condition value T1 and the similarity p2 is the condition value T2. If it is larger, the mail document is selected as a necessary document.
[0091]
As described above, in the present example, the keyword vector vector filtering unit 1i removes the keyword vector W component generated from the necessary mail from the keyword vector S generated from the received mail, and furthermore, the keyword vector vector filtering unit 1k In, the keyword vector B component generated from the unnecessary mail is removed from the keyword vector S generated from the received mail, and a keyword set included in both the required mail and the unnecessary mail is excluded.
[0092]
As a result, the keyword vector similarity calculation unit 1j determines whether the mail is unnecessary or necessary using the similarity p1 between the keyword vector Sb and the keyword vector B and the similarity p2 between the keyword vector Sw and the keyword vector W. By doing so, the keyword set included in both the necessary mail and the unnecessary mail can be excluded and evaluated.
[0093]
As a result, the keyword vector similarity calculation unit 1j (1) calculates the similarity using only the characteristic elements, so that the value of the similarity becomes more characteristic, and the determination condition 43 used by the determination unit 51e can be easily set. And the set load can be reduced. (2) By performing the similarity calculation using only the keyword vectors (Sb, Sw) necessary for the determination, the number of processes of the similarity calculation in the keyword vector similarity calculation units 1j and 1k can be reduced. .
[0094]
FIG. 6 is a block diagram showing a fifth configuration example of the document filtering system according to the present invention.
[0095]
The mail filtering device 61 as the mail filtering system of the present invention shown in FIG. 6 also constitutes the mail distribution service system shown in FIG. 2, similarly to the mail filtering devices 1 and 31 to 51 in FIGS. There is a keyword vector generation block 10 (comprising a receiving unit 1a, a morphological analysis unit 1b, and a keyword vector generation unit 1c, not shown) that performs processing as an execution function based on a computer program, and FIGS. It has an arithmetic processing block 11d composed of the processing units shown, determination units 61e and 61ee, a necessary mail processing unit 1f, and an unnecessary mail processing unit 1g.
[0096]
In such a configuration, the mail filtering device 61 performs the reception of the mail, the morphological analysis of the received mail and the generation of the input keyword vector S in the keyword vector generation block 10, and the input keyword vector S in the arithmetic processing block 11d. The similarities p1 and p2 between the vector S and the various reference keyword vectors are calculated, and the determination unit 61e determines and sorts the necessity of the input mail document based on the similarities p1 and p2 according to the determination conditions 63a.
[0097]
The feature of the mail filtering device 61 of the present example is that in the discrimination of the input mail document in the discriminating unit 61e, a mechanism is provided that also corresponds to a discrimination result of “other” that cannot be selected as either a necessary mail or an unnecessary mail. Is a point.
[0098]
That is, in the present example, for example, when the determination units 1e, 31e, 41e, and 51e of the mail filtering devices 1, 31, 41, and 51 in FIGS. This is applied, and has a mechanism for causing the arithmetic processing block 61d to repeat the process of calculating the similarity.
[0099]
In the example of FIG. 6, the similarity p1 and the similarity p2 are output from the arithmetic processing block 11d. However, when applied to the mail filtering devices 1 and 41 shown in FIGS. 1 and 4, Only the similarity p1 is output, and when applied to the mail filtering devices 31 and 51 shown in FIGS. 3 and 5, the similarity p1 and the similarity p2 are output as shown in FIG.
[0100]
In the following, an example in which the arithmetic processing block has a two-stage configuration as shown in FIG. 6 will be described with respect to the mail filtering device 1 of FIG.
[0101]
In the first processing block 11d, for example, the similarity of the received mail is calculated based on the keyword vector B1 generated from the personally unnecessary mail, and the necessity is determined by the determining unit 61e. In the calculation processing block 61d at the lower stage, the similarity of the received mail is calculated based on the keyword vector B2 generated from the mail that is generally unnecessary, and the determination unit 61ee determines the necessity.
[0102]
First, in the first-stage arithmetic processing block 11d, the similarity p1 between the keyword vector S and the keyword vector B1 generated from the personally unnecessary mail is determined by the determination unit 61e according to the determination condition 63a. Other than that ".
[0103]
In this case, the processing in the second-stage operation processing block 61ee is continued, and the second-stage operation processing block 61ee receives the keyword vector S and performs the same processing as the first stage.
[0104]
In this example, since the second stage is the final stage, the final stage discriminating unit 61ee determines either “unnecessary mail” or “necessary mail”.
[0105]
If the discrimination section 61e after the first-stage arithmetic processing block 11e determines that the message is "unnecessary mail" or "necessary mail", the processing is not continued to the second-stage arithmetic processing block 61ee. The process immediately proceeds to the necessary mail processing unit 1f and the unnecessary mail processing unit 1g.
[0106]
As described above, in this example, the similarity calculation based on the keyword vector and the similarity calculation are performed on the mail document that cannot be determined as either the unnecessary mail document or the necessary mail document by the first-stage determination unit 61e. The e-mail document can be selected as either an unnecessary e-mail document or a necessary e-mail document by repeating the selection process based on the e-mail document.
[0107]
In the repetition of the calculation of the similarity, the reference keyword vector used for calculating the similarity can be arbitrarily replaced.
[0108]
For example, the first-stage arithmetic processing block 11d and the determination unit 61e determine the received mail based on the personally generated unnecessary email keyword vector B1, and the first-stage arithmetic processing block 11d and the determination unit 61e. If it is not possible to determine whether or not the received mail is the unnecessary mail in 61e, the second-stage arithmetic processing block 61d and the determination unit 61ee determine the received mail based on a general unnecessary mail keyword vector B2 on the net. Thus, the necessity of mail can be determined with higher accuracy.
[0109]
Further, the first-stage arithmetic processing block 11d and the determination unit 61e determine the received mail based on the keyword vector W1 generated from the personally generated mail that has been personally necessary, and If the processing block fails to determine whether the mail is necessary, the second-stage arithmetic processing block 61d and the determination unit 61ee receive the mail based on the keyword vector W2 generated from the mail generally required on the net. By determining the necessity of the mail, the necessity of the mail can be more accurately determined.
[0110]
The first-stage arithmetic processing block 11d and the determination unit 61e determine the received mail using the keyword vector W1 generated from the personally generated mail that is required personally. If it is not possible to determine whether the received mail is necessary or not at 61e, the second-stage arithmetic processing block 61d and the determination unit 61ee determine whether the received mail is necessary or not based on the keyword vector B2 generated from the generally unnecessary mail. , The necessity of mail can be determined with higher accuracy.
[0111]
As described above, in this example, the contents of the reference keyword vector used for calculating the similarity can be combined according to the purpose, and highly accurate discrimination can be performed.
[0112]
As described above with reference to FIGS. 1 to 6, in the mail filtering apparatus of the present embodiment, an electronic mail is used as a document to be sorted, and a keyword vector S generated from a received mail and, for example, the Using the similarity with the keyword vector B generated by performing analysis and weighting, or the similarity with the keyword vector W generated by performing weighting by performing morphological analysis on the text of the required email, it is determined whether the email is unnecessary or required. By determining, even if the text of the unnecessary mail is unknown, it can be determined whether or not the mail is unnecessary mail based on the similarity of the keyword vectors.
[0113]
For example, in the example of the mail filtering device 1 shown in FIG. 1, the text (including header information and signature) of the received mail is morphologically analyzed to generate a keyword vector S of the received mail, and the keyword vector dictionary generating device (5) The similarity p1 between the keyword vector B obtained from the keyword vector dictionary B generated from the unnecessary mail and the keyword vector S of the received mail is calculated, and the required mail or the unnecessary mail is determined based on the magnitude of the similarity p1 and the pre-registered determination condition. Judge. This makes it possible to determine that an email having the same characteristics as an unnecessary email is an unnecessary email.
[0114]
In the example of the mail filtering device 31 shown in FIG. 3, the similarity between the keyword vector B obtained from the keyword vector dictionary B generated from the unnecessary mail by the keyword vector dictionary generation device (5) and the keyword vector S of the received mail is shown. p1 and the keyword vector W obtained from the keyword vector dictionary W generated from the required mail by the keyword vector dictionary generation device (5) and the similarity p2 between the keyword vector S of the received mail and the similarity p1, It is determined whether it is necessary mail or unnecessary mail based on the size of p2 and the determination condition. According to the configuration example of FIG. 3, the problem of the example of FIG. 1 can be solved.
[0115]
That is, in the mail filtering device 1 having the configuration shown in FIG. 1, the determination unit 1e determines whether the mail is an unnecessary mail or a necessary mail only based on the similarity p1 between the keyword vector S and the keyword vector B generated from the unnecessary mail. Therefore, even if the similarity between the keyword vector S and the keyword vector W generated from the necessary mail is larger, if the similarity p1 matches the condition of the unnecessary mail, the mail is regarded as unnecessary mail. There was a problem. However, in the mail filtering device 31 having the configuration shown in FIG. 3, the above-described problem is solved because the determination unit 31e determines the unnecessary mail and the necessary mail from the relationship between the two values of the similarities p1 and p2.
[0116]
Further, the example of the mail filtering device 41 shown in FIG. 4 is obtained by adding a keyword vector filtering unit to the mail filtering device 1 shown in FIG. The keyword vector Sb obtained from the keyword vector dictionary (generated, for example, from a required mail) is obtained by using Sb excluding the intersection (S∩W) of the keyword vector W obtained from the keyword vector dictionary obtained from the keyword vector S. The similarity p1 with the keyword vector B is calculated, and it is determined whether the mail is necessary or unnecessary according to the magnitude of the similarity p1 and the determination condition.
[0117]
As described above, in the mail filtering device 41 shown in FIG. 4, the keyword vector Sb generated by the keyword vector filtering unit 1i is obtained by removing the keyword vector W component generated from the necessary mail from the keyword vector S generated from the received mail. Therefore, the keyword set included in both the necessary mail and the unnecessary mail can be excluded and evaluated. Thereby, (1) the similarity is calculated using only the characteristic elements, so that the value of the similarity becomes more characteristic, and it is possible to reduce the difficulty of setting the determination condition of the determination unit, and (2) By performing the similarity calculation using only the keyword vector necessary for the determination, the number of processes of the similarity calculation can be reduced.
[0118]
In addition, in the example of the mail filtering device 51 shown in FIG. 5, a keyword vector filtering unit is added to the mail filtering device 31 shown in FIG. The keyword vector Sb and the keyword acquired from the keyword vector dictionary (generated from the necessary mail) are obtained using Sb excluding the intersection (S∩W) of the keyword vector W acquired from the keyword vector dictionary acquired from the keyword vector W. A similarity p1 with the vector B is calculated, and a product set (S∩B) of the keyword vector S of the received mail and the keyword vector B obtained from the keyword vector dictionary (generated from the unnecessary mail) is obtained. ) Except for this key The similarity p2 between the keyword Sw and the keyword vector W obtained from the keyword vector dictionary (generated from the necessary mail) is calculated, and these similarities p1 and p2 are compared. Is determined.
[0119]
According to this example, the keyword vector Sb generated by the added first keyword vector filtering unit is obtained by removing the keyword vector W component generated from the necessary mail from the keyword vector S generated from the received mail, and , The keyword vector Sw generated by the second keyword vector filtering unit is obtained by removing the keyword vector B component generated from the unnecessary mail from the keyword vector S generated from the received mail, and the keyword vector Sb and the keyword vector B Is determined by using the similarity p1 between the keyword and the similarity p2 between the keyword vector Sw and the keyword vector W, thereby eliminating a keyword set included in both the required email and the unnecessary email. To be evaluated Can.
[0120]
Thereby, in this example, since (1) similarity is calculated only with characteristic elements, the value of similarity becomes more characteristic, and it is possible to reduce the difficulty of setting the determination conditions of the determination unit. (2) By performing the similarity calculation using only the keyword vector necessary for the determination, the number of processes of the similarity calculation can be reduced.
[0121]
In the example of the mail filtering device 61 shown in FIG. 6, the determination units of the mail filtering devices 1 and 31 to 51 of FIG. 1 and FIGS. The processing of the arithmetic processing block and the discriminating unit is continuously repeated for the mail determined to be "other", and the similarity for selecting the required mail or the unnecessary mail in the final stage of the determining unit. The received mail is determined to be necessary or unnecessary according to the size of the similarity p1 and the similarity p2 or only the size of the similarity p1 with reference to the determination condition describing the size condition.
[0122]
For example, when the present invention is applied to the mail filtering device 1 of FIG. 1 and the arithmetic processing block and the discriminating unit are provided in two stages, the similarity is calculated in the first-stage arithmetic processing block, and the discriminating unit determines “other”. If so, the process proceeds to the second-stage arithmetic processing block and determination unit. In this way, in the first stage, the determination is made with the keyword vector B1 indicating that the mail is "unnecessary mail". It is possible to perform highly accurate determination, such as determination using the keyword vector B2 indicating that the mail is unnecessary mail.
[0123]
Similarly, in this manner, in the first row, the keyword is determined using the keyword vector W indicating that the mail is "necessary mail". It is also possible to perform a processing operation such as discriminating using a keyword vector B indicating "mail".
[0124]
It should be noted that the present invention is not limited to the examples described with reference to FIGS. For example, each communication in the mail filtering device, the keyword vector generation device, the mail server device, the mail client device, and the like illustrated in FIG. 2 may be communication via a network such as the Internet or a LAN. Local communication may be used.
[0125]
For example, in the example of FIG. 2, the mail filtering device is configured to refer to the keyword vector generated by the keyword vector generation device via a network, but the keyword vector generated by the keyword vector generation device is previously stored in the mail filtering device. A configuration in which the keyword vector generation device is provided may be provided in the keyword vector generation device.
[0126]
In the example of FIG. 5, the keyword filtering unit 1i and the keyword vector similarity calculating unit 1m refer to the same keyword vector dictionary W55, and the keyword filtering unit 1k and the keyword vector similarity calculating unit 1j use the same keyword vector. Although the dictionary B54 is referred to, the keyword filtering unit 1i and the keyword vector similarity calculating unit 1m refer to different keyword vector dictionaries (W, Wa), respectively. The unit 1j may be configured to refer to different keyword vector dictionaries (B, Ba).
[0127]
Further, in the present example, an e-mail is described as an example of the target of the necessity / unnecessity determination, but text data composed of a character string may be used, and the present invention is not limited to the e-mail.
[0128]
Further, as an example of a computer configuration in this example, a computer configuration without a keyboard or a drive device for an optical disk may be used. In this example, the optical disk is used as the recording medium, but an FD (Flexible Disk) or the like may be used as the recording medium. As for the installation of the program, the program may be downloaded and installed via a network via a communication device.
[0129]
【The invention's effect】
According to the present invention, for example, in the case of an electronic mail, it is possible to determine whether or not an unknown mail that cannot be determined by specifying the mail conditions is required, and to disguise an unnecessary mail having similar contents by falsely impersonating a sender. Unwanted emails can be identified even when sent to a specific majority, and when sending unwanted emails with similar contents to the unspecified majority by falsely sending the sender, words similar to the required emails are used. Even if the e-mail is described in an email, it is possible to reduce the possibility of misclassification by using the keyword vector of the unnecessary e-mail determined by others to be unnecessary even if the e-mail is judged by the feature vector of the e-mail body. It is possible to perform document selection with high accuracy, such as discrimination between a document and a required mail.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a first configuration example of a document filtering system according to the present invention.
FIG. 2 is a block diagram showing a configuration example of a mail distribution service system using the document filtering system in FIG. 1;
FIG. 3 is a block diagram showing a second configuration example of the document filtering system according to the present invention.
FIG. 4 is a block diagram showing a third configuration example of the document filtering system according to the present invention.
FIG. 5 is a block diagram showing a fourth configuration example of the document filtering system according to the present invention.
FIG. 6 is a block diagram showing a fifth configuration example of the document filtering system according to the present invention.
[Explanation of symbols]
1, 31, 41, 51, 61: mail filtering device, 1a: receiving unit, 1b: morphological analysis unit, 1c: keyword vector generation unit, 1d, 1h, 1j, 1m: keyword vector similarity calculation unit, 1e, 31e , 41e, 51e, 61e, 61ee: discriminating unit, 1f: necessary mail processing unit, 1g: unnecessary mail processing unit, 1i, 1k: keyword vector filtering unit, 2: mail server device ("mail server"), 3: mail Client device (“mail client”), 4: user, 5: keyword vector generation device, 6a to 6d: IP network, 10: keyword vector generation block, 11, 11a, 11b, 11c, 11d, 61d: arithmetic processing block , 12: mail, 13, 33, 43, 53, 63a, 63b: determination condition, 1 , 34, 44, and 54: keyword vector dictionary B, 35,45,55: keyword vector dictionary W.

Claims

A document filtering system for sorting input documents,
A morphological analysis means for morphologically analyzing the input document into a set of keywords,
Keyword vector generation means for generating an input keyword vector S representing the characteristics of the document from the set of keywords,
Keyword vector similarity calculating means for reading a reference keyword vector B stored in advance in a storage device and calculating a similarity p1 between the reference keyword vector B and the input keyword vector S;
A document filtering system for selecting the document based on the similarity p1.

The document filtering system according to claim 1, wherein
The reference keyword vector B represents a feature of a document that is not required,
If the similarity p1 is larger than a predetermined condition value, the determination unit selects the document as an unnecessary document.

A document filtering system for sorting input documents,
A morphological analysis means for morphologically analyzing the input document into a set of keywords,
Keyword vector generation means for generating an input keyword vector S representing the characteristics of the document from the set of keywords,
First keyword vector similarity calculating means for reading a reference keyword vector B previously stored in a storage device and calculating a similarity p1 between the reference keyword vector B and the input keyword vector S;
A second keyword vector similarity calculating means for reading a reference keyword vector W stored in a storage device in advance and calculating a similarity p2 between the reference keyword vector W and the input keyword vector S;
A document filtering system comprising: a discriminating unit for selecting the document based on the similarity p1 and the similarity p2.

The document filtering system according to claim 3, wherein
The reference keyword vector B represents a feature of a document that is not required,
The reference keyword vector W represents a required document feature,
If the similarity p1 is larger than a predetermined condition value T1 and the similarity p2 is smaller than a predetermined condition value T2, the discriminating unit selects the document as an unnecessary document, and the similarity p1 becomes the unnecessary document. If the similarity p2 is smaller than the condition value T1 and the similarity p2 is larger than the condition value T2, the document is selected as a necessary document.

A document filtering system for sorting input documents,
A morphological analysis means for morphologically analyzing the input document into a set of keywords,
Keyword vector generation means for generating an input keyword vector S representing the characteristics of the document from the set of keywords,
The reference keyword vector W stored in advance in the storage device is read out, a product set operation (S∩W) of the reference keyword vector W and the input keyword vector S is performed, and the result of the product set operation (S∩W) is obtained as described above. A keyword vector filtering means for generating a keyword vector Sb (= S−S∩W) removed from the input keyword vector S;
Keyword vector similarity calculating means for reading a reference keyword vector B stored in advance in a storage device and calculating a similarity p1 between the reference keyword vector B and the keyword vector Sb;
A document filtering system for selecting the document based on the similarity p1.

The document filtering system according to claim 5, wherein
The reference keyword vector B represents a feature of a document that is not required,
The reference keyword vector W represents a required document feature,
If the similarity p1 is larger than a predetermined condition value, the determination unit selects the document as an unnecessary document.

A document filtering system for sorting input documents,
A morphological analysis means for morphologically analyzing the input document into a set of keywords,
Keyword vector generation means for generating an input keyword vector S representing the characteristics of the document from the set of keywords,
The reference keyword vector W stored in advance in the storage device is read out, a product set operation (S∩W) of the reference keyword vector W and the input keyword vector S is performed, and the result of the product set operation (S∩W) is obtained as described above. First keyword vector filtering means for generating a keyword vector Sb (= S−S∩W) removed from the input keyword vector S;
The reference keyword vector B previously stored in the storage device is read out, a product set operation (S∩B) of the reference keyword vector B and the input keyword vector S is performed, and the result of the product set operation (S∩B) is obtained as described above. Second keyword vector filtering means for generating a keyword vector Sw (= S−S∩B) removed from the input keyword vector S;
First keyword vector similarity calculating means for reading a reference keyword vector Ba previously stored in a storage device and calculating a similarity p1 between the reference keyword vector Ba and the input keyword vector Sb;
A second keyword vector similarity calculating means for reading a reference keyword vector Wa stored in advance in a storage device and calculating a similarity p2 between the reference keyword vector Wa and the keyword vector Sw;
A document filtering system comprising: a determination unit configured to select the document based on the similarity p1 and the similarity p2.

The document filtering system according to claim 7, wherein:
The above-mentioned reference keyword vectors B and Ba represent the characteristics of a document that is not required,
The reference keyword vectors W and Wa represent the required document features,
If the similarity p1 is larger than a predetermined condition value T1 and the similarity p2 is smaller than a predetermined condition value T2, the discriminating unit selects the document as an unnecessary document, and the similarity p1 becomes the unnecessary document. If the similarity p2 is smaller than the condition value T1 and the similarity p2 is larger than the condition value T2, the document is selected as a necessary document.

A document filtering system according to claim 7 or claim 8, wherein:
A document filtering system, wherein the reference keyword vector B and the reference keyword vector Ba are the same, and the reference keyword vector W and the reference keyword vector Wa are the same.

The document filtering system according to any one of claims 1 to 9, wherein
For a document that cannot be determined as an unnecessary document or a necessary document by the determination unit, the calculation of the similarity based on the keyword vector and the selection based on the similarity are repeated to determine whether the document is an unnecessary document or a required document. A document filtering system comprising means for selecting one of them.

The document filtering system according to claim 10, wherein
A document filtering system characterized by arbitrarily replacing a reference keyword vector used for calculation in repeating the calculation of similarity.

The document filtering system according to any one of claims 1 to 11, wherein the input document comprises an electronic mail, and the electronic mail is selected.

A program for causing a computer to function as each unit in the document filtering system according to claim 1.