JP5366204B2

JP5366204B2 - Mail filtering system, computer program thereof, and information generation method

Info

Publication number: JP5366204B2
Application number: JP2009168408A
Authority: JP
Inventors: 俊和和田
Original assignee: 国立大学法人和歌山大学
Priority date: 2009-07-17
Filing date: 2009-07-17
Publication date: 2013-12-11
Anticipated expiration: 2029-07-17
Also published as: JP2011022876A

Abstract

<P>PROBLEM TO BE SOLVED: To provide information available for determining appropriateness of an E-mail distribution method of advertising E-mail. <P>SOLUTION: An E-mail filtering system includes a classification database 13 provided with: an advertising E-mail transmitter database 13e for registering a set of IP addresses of transmitters of advertising E-mail; non-advertising E-mail transmitter databases 13a to 13d for registering a set of IP addresses of non-advertising E-mail transmitters. A latest search of the transmitter IP address of an E-mail header is made in the advertising transmitter database and the non-advertising transmitter database respectively and distinctiveness showing the likeliness of advertising E-mail is obtained by the latest address obtained by the latest search. The set of transmitter IP addresses of the advertising E-mail is output as information available for determination of appropriateness of an E-mail distribution method of advertising E-mail. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、メールフィルタリングシステム、そのコンピュータプログラム、情報生成方法に関するものである。 The present invention relates to a mail filtering system, a computer program thereof, and an information generation method.

近年、電子メールの利用を妨げる問題として迷惑メール（スパムメール）が挙げられる。不特定多数に大量に送信される迷惑メールは、メールトラフィックの増大やウィルスメールの拡散などを招き、経済的な損失まで引き起こしている。このため、迷惑メールを適切に検出できるメールフィルタリングシステムが求められている。 In recent years, spam mail is a problem that hinders the use of electronic mail. Spam emails that are sent in large quantities to unspecified numbers have caused an increase in email traffic and spread of virus emails, causing economic losses. For this reason, a mail filtering system capable of appropriately detecting junk mail is demanded.

メールフィルタリングシステムとしては、メールの内容となる本文・タイトルなどの文章（コンテンツ）を解析して迷惑メール（スパムメール）を検出するコンテンツ分析型のシステムが一般的である。 As a mail filtering system, a content analysis type system that detects a spam mail (spam mail) by analyzing a sentence (content) such as a text or a title as a mail content is common.

ここで、迷惑メール（広義の迷惑メール）には、有害な情報やウイルスを含むなどしてほとんどの人々にとって迷惑であると感じられるメール（狭義の迷惑メール）と、一般の企業が宣伝広告のために送信するメールであって、その情報を必要としない人々には有益でないものの、その情報を必要とする人々には有益であるメール（広告用のメール、メールマガジンなど）と、が存在する。 Here, junk e-mail (broad junk e-mail) includes harmful information and viruses that seem to be annoying to most people (narrow junk e-mail), and general companies There are emails (such as e-mails for advertisements, e-mail magazines) that are useful for those who need the information, but not for those who do not need the information. .

後者のメール（以下、「広告型メール」という）は、企業等の正常な宣伝広告活動の一環として行われているものであり、狭義の迷惑メールとは区別して考えることができるが、これまで、前者と後者のメールを区別してフィルタリングするという発想はなかった。 The latter e-mail (hereinafter referred to as “advertising-type e-mail”) is carried out as part of normal advertising campaigns by companies, etc., and can be considered separately from spam mail in the narrow sense. , There was no idea to filter the former and the latter mail separately.

ここで、本発明者は、ＮＮＩＰＦ（Nearest Neighbor IP address based mail Filter）とよばれるメールフィルタリングシステムを開発・運用している（非特許文献１参照）。
このＮＮＩＰＦでは、ユーザが、組織外部からメールを受け取るときに使用するＭＴＡ（Mail Transfer Agent）をシステムに予め登録されている。また、ＮＮＩＰＦでは、正常な送信者のＩＰアドレス集合と迷惑メールの送信者のＩＰアドレス集合を予め登録している。 Here, the present inventor has developed and operates a mail filtering system called NNNIP (Nearest Neighbor IP address based mail Filter) (see Non-Patent Document 1).
In this NNIPF, an MTA (Mail Transfer Agent) used when a user receives mail from outside the organization is registered in the system in advance. Also, in NNIPF, a normal sender IP address set and a spam mail sender IP address set are registered in advance.

ＮＮＩＰＦは、ユーザがシステムに登録したＭＴＡ（ＭＴＡＯＢ；MTA On Border）が外部からメールを受信したときに生成してメールヘッダに付加したReceived行から、メール送信元のＩＰアドレスを割り出す。 The NNIPF determines the IP address of the mail transmission source from the Received line generated and added to the mail header when an MTA (MTAOB; MTA On Border) registered in the system by the user receives the mail from the outside.

そして、ＮＮＩＰＦは、割り出したＩＰアドレスに関する最近傍探索を、予め登録されている正常な送信者のＩＰアドレス集合及び迷惑メールの送信者のＩＰアドレス集合に対して行う。これにより、予め登録されている正常な送信者のＩＰアドレス集合のうち、割り出したＩＰアドレスに最も近い値を持つＩＰアドレス（最近傍アドレス）が検出されるとともに、予め登録されている迷惑メールの送信者のＩＰアドレス集合のうち、割り出したＩＰアドレスに最も近い値を持つＩＰアドレス（最近傍アドレス）が検出される。 Then, the NNNPF performs a nearest neighbor search on the determined IP address with respect to the IP address set of normal senders and the IP address set of spam mail senders registered in advance. As a result, an IP address (nearest neighbor address) having a value closest to the determined IP address is detected from a set of IP addresses of normal senders registered in advance, and spam mails registered in advance are detected. An IP address (nearest neighbor address) having a value closest to the determined IP address is detected from the sender IP address set.

さらに、ＮＮＩＰＦは、割り出したＩＰアドレスと、正常な送信者の最近傍アドレス及び迷惑メール送信者の最近傍アドレスそれぞれとの距離（アドレス空間内での距離）を求め、これらの距離に基づいて、迷惑メールらしさを示す「弁別度」（０〜１の値）を求める。この弁別度が０であれば、ほぼ確実に正常なメールであり、逆に１であれば、ほぼ確実に迷惑メールであると判断される。また、０から１の間の中間値は、１に近いほど、迷惑メールらしさが高いことを示す。 Furthermore, NNIPF obtains the distance (distance in the address space) between the determined IP address and the nearest neighbor address of the normal sender and the nearest neighbor address of the spam mail sender, and based on these distances, A “discrimination degree” (value of 0 to 1) indicating the likelihood of spam mail is obtained. If the degree of discrimination is 0, it is almost certainly a normal mail, and conversely if it is 1, it is almost certainly determined to be a spam mail. An intermediate value between 0 and 1 indicates that the closer to 1, the higher the likelihood of spam mail.

ここで、ＮＮＩＰＦは、受信したメールの送信元ＩＰアドレスが、ブラックリスト（迷惑メールＩＰアドレスのリスト）又はホワイトリスト（正常メールＩＰアドレスのリスト）と一致するか一致しないかという「ブラックリスト／ホワイトリスト」型の分類とは異なるものである。 Here, the NNIPF determines whether the source IP address of the received mail matches or does not match the black list (junk mail IP address list) or white list (normal mail IP address list). This is different from the “list” type classification.

送信元ＩＰアドレスを「ブラックリスト／ホワイトリスト」で確認する場合、送信元ＩＰアドレスが、「ブラックリスト／ホワイトリスト」に含まれるアドレスと１ビットでも異なると、「一致しない」と判定してしまい、迷惑メールを確実に検出しようとすると、迷惑メールが発信される可能性のある全てのサーバのＩＰアドレス集合がブラックリストに登録されている必要があるが、これは運用上不可能である。 When the source IP address is confirmed by the “black list / white list”, if the source IP address is different from the address included in the “black list / white list” even by one bit, it is determined as “not matching”. In order to reliably detect junk mail, it is necessary that a set of IP addresses of all servers to which junk mail may be sent be registered in the black list, but this is not operational.

これに対し、ＮＮＩＰＦは、正常な送信者の最近傍アドレス及び迷惑メール送信者の最近傍アドレスそれぞれとの距離を求め、これらの距離に基づいて、迷惑メールらしさを示す「弁別度」を求めるため、迷惑メール送信者や正常な送信者のＩＰアドレス全てが予め登録されていなくても、迷惑メールを識別することができる。 On the other hand, NNIPF obtains the distance from the nearest neighbor address of the normal sender and the nearest neighbor address of the spam mail sender, and obtains the “discrimination degree” indicating the likelihood of spam mail based on these distances. The spam mail can be identified even if all the IP addresses of the spam mail sender and the normal sender are not registered in advance.

しかも、ＮＮＩＰＦでは、正しく運用した場合、９８％以上の精度で分類が行えることが確認されている。これは、迷惑メールは、過去に迷惑メールを送信したことのあるサーバのＩＰアドレスと似たアドレス値のＩＰアドレスのサーバから送信されることが多く、正常なメールは、以前から受信し続けている送信元サーバのＩＰアドレスと似たアドレス値のＩＰアドレスから送信されることが多いという事実に基づいている。 In addition, it has been confirmed that NNIPF can perform classification with an accuracy of 98% or more when correctly operated. This is because junk mail is often sent from a server with an IP address similar to the IP address of a server that has sent junk mail in the past, and normal mail continues to be received from before. It is based on the fact that it is often transmitted from an IP address having an address value similar to that of the source server.

このような傾向が現れる理由は、ＩＰアドレスが、国別、組織別に系統立って割り当てられており、迷惑メール送信元のＩＰアドレス集合には、ＩＰアドレスの全空間において偏りが見られるからである。例えば、ある著名な優良企業から迷惑メールが来ることはほとんど無く、一方で、外国の格安プロバイダに割り当てられているＩＰアドレスから迷惑メールが来る可能性は、著名な優良企業からよりも高い。
この傾向は、ＩＰアドレスの割り当てルールが変更にならない限り普遍であるため、迷惑メールを送信するサーバのＩＰアドレスは、決して均一に分布することはなく、統計的に有意な偏りを持ち続けるため、継続的に分類することが可能である。 The reason why such a tendency appears is that IP addresses are systematically allocated by country and organization, and there is a bias in the IP address set of spam mail senders in the entire IP address space. . For example, there is almost no spam mail coming from a certain well-known company, while the possibility of spam mail coming from an IP address assigned to a foreign cheap provider is higher than from a well-known company.
This trend is universal unless the IP address allocation rules change, so the IP addresses of servers sending spam are never evenly distributed and remain statistically significant biases, It is possible to classify continuously.

和田俊和，ＮＮＩＰＦ，［ｏｎｌｉｎ］，２００７年，［平成２１年７月１３日検索］，インターネット＜http://vrl.sys.wakayama-u.ac.jp/~twada/NNIPF.html＞Toshikazu Wada, NNIPF, [onlin], 2007, [searched on July 13, 2009], Internet <http://vrl.sys.wakayama-u.ac.jp/~twada/NNIPF.html>

さて、企業等の正常な宣伝広告活動の一環として行われる広告型メールにおいては、テレビ放送における視聴率のように、メール配信の効果を測定するなどメール配信方法の適切さを検討可能であることが望まれる。 Now, in the case of advertisement-type e-mails that are carried out as part of normal promotional advertising activities by companies, etc., it is possible to consider the appropriateness of e-mail delivery methods, such as measuring the effectiveness of e-mail delivery, such as the audience rating in TV broadcasting. Is desired.

この点に関し、本発明者は、広告型メールにおけるメール配信方法の適切さを、メールフィルタリングシステムにおける広告型メールの通過／ブロックに基づいて判断できるようにする、という着想を得た。 In this regard, the present inventor has come up with the idea that the appropriateness of the mail delivery method in the advertisement type mail can be determined based on the passage / block of the advertisement type mail in the mail filtering system.

例えば、広告型メールは、広告型メールを宣伝広告主のサーバからではなく、広告型メールの送信を請け負う専門業者のサーバから送信されることが多い。したがって、宣伝広告主からすると、どの専門業者を選択すべきか、という選択基準として、例えば、宣伝広告主が望むユーザにメールが届きやすいか否かという点が重要となる。
つまり、広告型メールでの宣伝広告を望む企業からすると、自社の広告型メールが、メールフィルタリングシステムでブロックされずに、できるだけ通過することが望ましい。 For example, the advertisement-type mail is often transmitted from a server of a specialist who undertakes transmission of the advertisement-type mail, not from the server of the advertising advertiser. Therefore, from the viewpoint of the advertising advertiser, as a selection criterion as to which specialist trader should be selected, for example, it is important whether or not an email is likely to reach a user desired by the advertising advertiser.
In other words, it is desirable that a company wishing to advertise with advertisement-type mail passes through its own advertisement-type mail as much as possible without being blocked by the mail filtering system.

しかし、従来のフィルタリングシステムでは、広告型メールに着目してこれを区別してフィルタリングするという発想が従来なかったのは、上述の通りであり、ブロックされたメールが広告メールであるか否かを判定することができなかった。
また、前記ＮＮＩＰＦにおいても、広告型メールに着目してフィルタリングをすることは行われていなかった。 However, as described above, the conventional filtering system has not had the idea of filtering by focusing on advertisement-type mail as described above, and it is determined whether or not the blocked mail is an advertisement mail. I couldn't.
Also, the NNNPF has not been filtered by paying attention to the advertisement type mail.

しかも、従来の一般的なコンテンツ分析型のメールフィルタリングシステムでは、広告型メールの文章（コンテンツ）が正常なメールの文章と区別し難いことが多いため、広告型メールだけを適切に抽出することができない。同じ理由から，コンテンツ分析型で、広告型メールを（広義の）迷惑メールとして検出しようとすると、正常なメールまでも迷惑メールとして検出されるおそれが高くなる。
さらに、広告型メールが受信されるべきものなのか、それとも受信拒否されるべきものかは、各ユーザの選好によるところが大きいため、コンテンツの内容だけから、正常に受信すべき広告型メールと受信拒否すべき広告型メールとを適切に区別することが困難である。 Moreover, in the conventional general content analysis type mail filtering system, since the text (content) of the advertisement type mail is often difficult to distinguish from the text of the normal mail, it is possible to appropriately extract only the advertisement type mail. Can not. For the same reason, when trying to detect advertisement-type mail as spam mail (in a broad sense) in content analysis type, there is a high risk that even normal mail will be detected as spam mail.
Furthermore, whether or not the advertisement-type mail should be received or rejected depends largely on the preference of each user, so the advertisement-type mail that should be received normally and the reception rejection based only on the contents of the content It is difficult to properly distinguish from advertisement-type mail to be sent.

このため、従来の一般的なコンテンツ分析型のメールフィルタリングシステムは、広告型メールのメール配信方法の適切さの判断には利用することが困難である。 For this reason, it is difficult to use the conventional general content analysis type mail filtering system for determining the appropriateness of the mail distribution method of the advertisement type mail.

本発明は、上記問題に鑑み、広告型メールのメール配信方法の適切さの判断材料となる情報を生成できるメールフィルタリングシステム等を提供することを目的とする。 In view of the above problems, an object of the present invention is to provide a mail filtering system and the like that can generate information that is used as a material for determining the appropriateness of a mail distribution method for advertising mail.

本発明者は、広告型メールの送信元ＩＰアドレス（例えば、広告型メール配信業者のＩＰアドレス）には、狭義の迷惑メールや正常メールのアドレスとは識別可能な偏りがあることを見出した。つまり、前記ＮＮＩＰのように、メール送信元ＩＰアドレスを元に最近傍探索を行って弁別度を求める方式では、過去に広告型メールを送信したサーバのＩＰアドレス集合を予め登録しておけば、新たに受信したメールのフィルタリングの際に広告型メールを識別して、広告型メールらしさの弁別度を求めることが可能であり、これを利用して、広告型メールのメール配信方法の適切さの判断に利用可能な情報を生成できるという着想を得て、本発明を完成した。 The present inventor has found that the source IP address of the advertisement type mail (for example, the IP address of the advertisement type mail distributor) has a bias that can be distinguished from the narrowly defined junk mail and the normal mail address. In other words, as in the case of NNIP, in the method of obtaining the discrimination degree by performing the nearest neighbor search based on the mail source IP address, if the IP address set of the server that sent the advertisement type mail in the past is registered in advance, When filtering newly received emails, it is possible to identify advertising emails and determine the degree of distinction of advertising emails, and use this to determine the appropriateness of email delivery methods for advertising emails. The present invention was completed with the idea that information that can be used for judgment can be generated.

（１）本発明は、
メール分類用のＩＰアドレス集合を登録するための分類用データベースと、
受信メールのメールヘッダに含まれるメール送信元のＩＰアドレスを抽出し、抽出されたＩＰアドレスに最も近似する最近傍アドレスを、最近傍探索によって前記分類用データベースから求めて、前記最近傍アドレスから受信したメールを分類するための弁別度を算出し、当該弁別度に基づいて前記受信メールを分類する分類部と、
を備えたメールフィルタリングシステムにおいて、
前記分類用データベースは、広告型メールの送信元のＩＰアドレス集合を登録するための広告型メール送信元データベースと、非広告型メールの送信元のＩＰアドレス集合を登録するための非広告型メール送信元データベースと、を含み、
前記分類部は、前記最近傍探索を前記広告型メール送信元データベース及び非広告型メール送信元データベースそれぞれに対して行い、当該最近傍探索によって得られた前記最近傍アドレスから広告型メールらしさを示す弁別度を求め、受信メールから抽出された前記ＩＰアドレスを当該弁別度に応じて前記広告型メール送信元データベース及び非広告型メール送信元データベースのいずれかに分類して登録するよう構成され、
前記広告型メール送信元データベースに登録されたＩＰアドレス集合を、ユーザデータと関連づけて出力する出力手段と、
を備えていることを特徴とするメールフィルタリングシステムである。 (1) The present invention
A classification database for registering a set of IP addresses for mail classification;
The IP address of the mail transmission source included in the mail header of the received mail is extracted, and the nearest address closest to the extracted IP address is obtained from the classification database by nearest neighbor search and received from the nearest address. A classification unit for classifying the received mail based on the degree of discrimination;
In the mail filtering system with
The classification database includes an advertisement-type mail sender database for registering an IP address set of advertisement-type mail senders, and a non-advertisement-type mail transmission for registering an IP address set of non-advertisement-type mail senders. Including the original database,
The classification unit performs the nearest neighbor search on each of the advertisement-type mail transmission source database and the non-advertisement-type mail transmission source database, and indicates the likelihood of advertisement-type mail from the nearest address obtained by the nearest neighbor search. Obtaining the degree of discrimination, configured to classify and register the IP address extracted from the received mail into either the advertisement-type mail sender database or the non-advertisement-type mail sender database according to the discrimination degree,
The IP address set that has been registered in the advertisement type e-mail source database, and output means for a force out in association with the user data,
It is the mail filtering system characterized by comprising.

（２）前記分類用ＩＰデータベースは、非広告型メールを送信した送信元サーバのＩＰアドレス集合を登録するための非広告型メール送信元データベースとして、前記広告型メール以外の迷惑メールを送信した送信元サーバのＩＰアドレス集合を登録するための迷惑メール送信元データベースと、正常なメールを送信した送信元サーバのＩＰアドレス集合を登録するための正常メール送信元データベースと、を含み、
前記分類部は、前記最近傍探索を前記広告型メール送信元データベース、前記迷惑メール送信元データベース及び正常メール送信元データベースそれぞれに対して行い、当該最近傍探索によって得られた前記最近傍アドレスから広告型メールらしさを示す弁別度を求めるよう構成されているのが好ましい。 (2) The classification IP database is a non-advertisement type mail transmission source database for registering the IP address set of the transmission source server that has transmitted the non-advertisement type mail. A spam mail source database for registering the IP address set of the original server, and a normal mail source database for registering the IP address set of the source server that sent the normal mail,
The classification unit performs the nearest neighbor search on each of the advertisement-type mail sender database, the junk mail sender database, and the normal mail sender database, and advertises from the nearest address obtained by the nearest neighbor search. It is preferable to be configured to obtain a degree of discrimination indicating the likelihood of a type mail.

（３）メール分類用のＩＰアドレス集合を登録するための分類用データベースと、
受信メールのメールヘッダに含まれるメール送信元のＩＰアドレスを抽出し、抽出されたＩＰアドレスに最も近似する最近傍アドレスを、最近傍探索によって前記分類用データベースから求めて、前記最近傍アドレスから受信したメールを分類するための弁別度を算出し、当該弁別度に基づいて前記受信メールを分類する分類部と、
を備えたメールフィルタリングシステムにおいて、
前記分類用データベースは、広告型メールの送信元のＩＰアドレス集合を登録するための広告型メール送信元データベースと、非広告型メールの送信元のＩＰアドレス集合を登録するための非広告型メール送信元データベースと、を含み、
前記分類部は、前記最近傍探索を前記広告型メール送信元データベース及び非広告型メール送信元データベースそれぞれに対して行い、当該最近傍探索によって得られた前記最近傍アドレスから広告型メールらしさを示す弁別度を求め、受信メールから抽出された前記ＩＰアドレスを当該弁別度に応じて前記広告型メール送信元データベース及び非広告型メール送信元データベースのいずれかに分類して登録するよう構成され、
前記広告型メール送信元データベースに登録されたＩＰアドレス集合を、広告型メールのメール配信方法の適切さの判断に利用可能な情報として出力する出力手段とを備え、
前記広告メール送信元ＩＰアドレスデータベースは、ユーザ毎にＩＰアドレス集合を分けて登録するよう構成されているのが好ましい。 (3) a classification database for registering a set of IP addresses for mail classification;
The IP address of the mail transmission source included in the mail header of the received mail is extracted, and the nearest address closest to the extracted IP address is obtained from the classification database by nearest neighbor search and received from the nearest address. A classification unit for classifying the received mail based on the degree of discrimination;
In the mail filtering system with
The classification database includes an advertisement-type mail sender database for registering an IP address set of advertisement-type mail senders, and a non-advertisement-type mail transmission for registering an IP address set of non-advertisement-type mail senders. Including the original database,
The classification unit performs the nearest neighbor search on each of the advertisement-type mail transmission source database and the non-advertisement-type mail transmission source database, and indicates the likelihood of advertisement-type mail from the nearest address obtained by the nearest neighbor search. Obtaining the degree of discrimination, configured to classify and register the IP address extracted from the received mail into either the advertisement-type mail sender database or the non-advertisement-type mail sender database according to the discrimination degree,
An output means for outputting the IP address set registered in the advertisement-type mail transmission source database as information usable for determining the appropriateness of the mail distribution method of the advertisement-type mail;
It is preferable that the advertisement mail transmission source IP address database is configured to separately register IP address sets for each user.

（４）前記広告型メール送信元アドレスは、広告型メールに分類されたメールの送信元ＩＰアドレスと当該メールの宛先であるユーザのユーザデータとを関連付けて保存するのが好ましい。 (4) The advertisement-type mail transmission source address is preferably stored in association with the transmission source IP address of the mail classified as the advertisement-type mail and the user data of the user who is the destination of the mail.

（５）指定されたＩＰアドレスが送信元アドレスであるメールが、前記分類部において広告型メールに分類された件数及び／又は非広告型メールに分類された件数を集計する手段を更に備えているのが好ましい。 (5) The mail having the designated IP address as the transmission source address further includes means for counting the number of cases classified into the advertisement type mail and / or the number classified into the non-advertisement type mail in the classification unit. Is preferred.

（６）他の観点からみた本発明は、コンピュータを、上記（１）〜（４）のいずれか１項に記載のメールフィルタリングシステムとして機能させるためのコンピュータプログラムである。 (6) The present invention viewed from another viewpoint is a computer program for causing a computer to function as the mail filtering system described in any one of (1) to (4) above.

（７）さらに他の観点からみた本発明は、受信メールのメールヘッダに含まれるメール送信元のＩＰアドレスを抽出し、抽出されたＩＰアドレスに最も近似する最近傍アドレスを、最近傍探索によって、メール分類用のＩＰアドレス集合を登録するための分類用データベースから求めて、前記最近傍アドレスから受信したメールを分類するための弁別度を算出し、当該弁別度に基づいて前記受信メールを分類するメールフィルタリングシステムにおいて、広告型メールのメール配信方法の適切さの判断に利用可能な情報を生成する情報生成方法であって、
前記分類用データベースは、広告型メールの送信元のＩＰアドレス集合を登録するための広告型メール送信元データベースと、非広告型メールの送信元のＩＰアドレス集合を登録するための非広告型メール送信元データベースと、を含み、
前記情報生成方法は、
前記最近傍探索を前記広告型メール送信元データベース及び非広告型メール送信元データベースそれぞれに対して行うステップと、
当該最近傍探索によって得られた前記最近傍アドレスから広告型メールらしさを示す弁別度を求めるステップと、
受信メールから抽出された前記ＩＰアドレスを当該弁別度に応じて前記広告型メール送信元データベース及び非広告型メール送信元データベースのいずれかに分類して登録するステップと、
前記広告型メール送信元データベースに登録されたＩＰアドレスを、ユーザデータと関連づけて出力するステップと、
を含む情報生成方法である。
（８）さらに他の観点からみた本発明は、受信メールのメールヘッダに含まれるメール送信元のＩＰアドレスを抽出し、抽出されたＩＰアドレスに最も近似する最近傍アドレスを、最近傍探索によって、メール分類用のＩＰアドレス集合を登録するための分類用データベースから求めて、前記最近傍アドレスから受信したメールを分類するための弁別度を算出し、当該弁別度に基づいて前記受信メールを分類するメールフィルタリングシステムにおいて、広告型メールのメール配信方法の適切さの判断に利用可能な情報を生成する情報生成方法であって、
前記分類用データベースは、広告型メールの送信元のＩＰアドレス集合を登録するための広告型メール送信元データベースと、非広告型メールの送信元のＩＰアドレス集合を登録するための非広告型メール送信元データベースと、を含み、
前記情報生成方法は、
前記広告メール送信元ＩＰアドレスデータベースを、ユーザ毎にＩＰアドレス集合を分けて登録するステップと、
前記最近傍探索を前記広告型メール送信元データベース及び非広告型メール送信元データベースそれぞれに対して行うステップと、
当該最近傍探索によって得られた前記最近傍アドレスから広告型メールらしさを示す弁別度を求めるステップと、
受信メールから抽出された前記ＩＰアドレスを当該弁別度に応じて前記広告型メール送信元データベース及び非広告型メール送信元データベースのいずれかに分類して登録するステップと、
前記広告型メール送信元データベースに登録されたＩＰアドレス集合を、広告型メールのメール配信方法の適切さの判断に利用可能な情報として出力するステップと、
を含む情報生成方法である。 (7) According to another aspect of the present invention, the IP address of the mail transmission source included in the mail header of the received mail is extracted, and the nearest neighbor address closest to the extracted IP address is obtained by nearest neighbor search. Obtained from a classification database for registering an IP address set for mail classification, calculates a discrimination degree for classifying mail received from the nearest address, and classifies the received mail based on the discrimination degree In an email filtering system, an information generation method for generating information that can be used to determine the appropriateness of an email delivery method for advertising emails,
The classification database includes an advertisement-type mail sender database for registering an IP address set of advertisement-type mail senders, and a non-advertisement-type mail transmission for registering an IP address set of non-advertisement-type mail senders. Including the original database,
The information generation method includes:
Performing the nearest neighbor search on each of the advertisement-type mail sender database and the non-advertisement-type mail sender database;
Obtaining a degree of discrimination indicating the likelihood of advertising mail from the nearest address obtained by the nearest neighbor search;
Classifying and registering the IP address extracted from the received mail into either the advertisement-type mail sender database or the non-advertisement-type mail sender database according to the degree of discrimination;
The IP address registered in the advertisement-type mail sender database, a step of force out in association with the user data,
Is an information generation method.
(8) According to another aspect of the present invention, the IP address of the mail transmission source included in the mail header of the received mail is extracted, and the nearest neighbor address closest to the extracted IP address is obtained by nearest neighbor search. Obtained from a classification database for registering an IP address set for mail classification, calculates a discrimination degree for classifying mail received from the nearest address, and classifies the received mail based on the discrimination degree In an email filtering system, an information generation method for generating information that can be used to determine the appropriateness of an email delivery method for advertising emails,
The classification database includes an advertisement-type mail sender database for registering an IP address set of advertisement-type mail senders, and a non-advertisement-type mail transmission for registering an IP address set of non-advertisement-type mail senders. Including the original database,
The information generation method includes:
Registering the advertisement mail sender IP address database separately for each user, a set of IP addresses;
Performing the nearest neighbor search on each of the advertisement-type mail sender database and the non-advertisement-type mail sender database;
Obtaining a degree of discrimination indicating the likelihood of advertising mail from the nearest address obtained by the nearest neighbor search;
Classifying and registering the IP address extracted from the received mail into either the advertisement-type mail sender database or the non-advertisement-type mail sender database according to the degree of discrimination;
Outputting the set of IP addresses registered in the advertisement-type mail sender database as information that can be used to determine the appropriateness of the mail delivery method of the advertisement-type mail;
Is an information generation method.

本発明によれば、メールフィルタリングシステムから、広告型メールのメール配信方法の適切さの判断材料となる情報を出力することができる。 ADVANTAGE OF THE INVENTION According to this invention, the information used as the judgment material of the appropriateness of the mail delivery method of advertisement type | mold mail can be output from a mail filtering system.

ＭＴＡＯＢの概念のメールフィルタリングシステムの配置を示す図である。It is a figure which shows arrangement | positioning of the mail filtering system of the concept of MTAOB. メールフィルタリングシステムの入出力機能を示す概略図である。It is the schematic which shows the input / output function of a mail filtering system. メールフィルタリングシステムの内部機能を示す概略図である。It is the schematic which shows the internal function of a mail filtering system. 分類用ＩＰアドレスデータベースの構成図である。It is a block diagram of the IP address database for classification. 分類部の構成図である。It is a block diagram of a classification | category part. ＩＰアドレスからＩＰアドレス値への変換の説明図である。It is explanatory drawing of conversion from an IP address to an IP address value. Ｂ木の追加操作を示す図である。It is a figure which shows addition operation of B-tree. Ｂ木の削除操作を示す図である。It is a figure which shows deletion operation of B-tree. Ｂ木の最近傍探索を示す図である。It is a figure which shows the nearest neighbor search of B-tree. Ｂ＋木の追加操作を示す図である。It is a figure which shows B + tree addition operation. Ｂ＋木の削除操作を示す図である。It is a figure which shows deletion operation of B + tree. Ｂ＋木の最近傍探索を示す図である。It is a figure which shows the nearest neighbor search of B + tree. Ｂ木のＧＩＰのファイルサイズと探索速度を示すグラフである。It is a graph which shows the file size and search speed of GIP of B tree. Ｂ木のＢＩＰのファイルサイズと探索速度を示すグラフである。It is a graph which shows the file size and search speed of BIP of B-tree. Ｂ＋木のＧＩＰのファイルサイズと探索速度を示すグラフである。It is a graph which shows the file size of GIP of B + tree, and search speed. Ｂ＋木のＢＩＰのファイルサイズと探索速度を示すグラフである。It is a graph which shows the file size and search speed of B + tree BIP. サイズ比較の対数グラフである。It is a logarithmic graph of size comparison. ＧＩＰのｅｘａｃｔｍａｔｃｈ平均速度のグラフである。It is a graph of the average match average speed of GIP. ＢＩＰのｅｘａｃｔｍａｔｃｈ平均速度のグラフである。It is a graph of the average match average speed of BIP. 最近傍探索平均速度のグラフである。It is a graph of the nearest neighbor search average speed. 追加・削除の動作の平均速度のグラフである。It is a graph of the average speed of the operation | movement of addition / deletion.

以下、本発明の好ましい実施形態について添付図面を参照しながら説明する。 Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings.

［１．最近傍探索型フィルタリングシステムの概要と用語の定義］
図１及び図２は、実施形態に係るメールフィルタリングシステム（フィルタリング用サーバ）１の運用形態を例示している。図１に示すように、メールユーザが使用するメール送受信ソフトウエアおよびそれが動作するコンピュータを、「ＵＭＡ」（User Mail Agent）という。また、ある組織（メールサーバを有する企業・大学などの団体、インターネットサービスプロバイダなど）Ｓに、複数のＭＴＡ（Mail Transfer Agent）が設置されている場合に、ＵＭＡに対して通常のメール受信のための受信メールボックスを提供するＭＴＡを、「ＢａｓｅＭＴＡ」という。また、メールユーザが、組織Ｓ外部からメールを受け取る際に、当該組織Ｓ内のＭＴＡであって、外部サーバからメールを受信するＭＴＡをＭＴＡＯＢ（MTA On the Border）という。 [1. Overview of Nearest Neighboring Filtering System and Definition of Terms]
1 and 2 illustrate an operation mode of the mail filtering system (filtering server) 1 according to the embodiment. As shown in FIG. 1, the mail transmission / reception software used by the mail user and the computer on which it operates are called “UMA” (User Mail Agent). In addition, when a plurality of MTAs (Mail Transfer Agents) are installed in a certain organization (an organization such as a company / university having a mail server, an Internet service provider, etc.) S, normal mail reception for UMA is performed. The MTA that provides the receiving mailbox is referred to as “Base MTA”. Further, when a mail user receives mail from outside the organization S, the MTA within the organization S that receives mail from an external server is referred to as MTAOB (MTA On the Border).

本実施形態では、ＢａｓｅＭＴＡは、メールを受信すると、実施形態に係るフィルタリングシステム（フィルタリング用サーバ））１にメールを転送するよう設定されている。図２に示すように、本フィルタリングシステム１は、転送されてきたメールを分類し、正常なメールは、メールユーザごとの分類済メールボックスに格納される。メールユーザは、フィルタリングシステム１を利用してメールを受信する場合、ＢａｓｅＭＴＡではなく、フィルタリングシステム１の分類済みメールボックスからメールを取得する。つまり、フィルタリングシステム１において、正常でないメール（迷惑メール・広告型メール）は、ブロックされ、正常なメールは通過する。 In the present embodiment, the Base MTA is set to forward the mail to the filtering system (filtering server) 1 according to the embodiment when the mail is received. As shown in FIG. 2, the filtering system 1 classifies the forwarded mail, and normal mail is stored in a classified mailbox for each mail user. When the mail user receives the mail using the filtering system 1, the mail user acquires the mail from the classified mailbox of the filtering system 1 instead of the Base MTA. In other words, in the filtering system 1, abnormal mail (spam mail / advertisement mail) is blocked, and normal mail passes.

また、本フィルタリングシステム１は、メールを分類した結果に基づいて、広告型メールのメール配信方法の適切さの判断に利用可能な情報を出力する。 Further, the filtering system 1 outputs information that can be used for determining the appropriateness of the mail delivery method for the advertisement-type mail based on the result of classifying the mail.

なお、本実施形態のフィルタリングシステム１は、ＢａｓｅＭＴＡと同じサーバに搭載されていてもよい。また、フィルタリングシステムの各機能は、同一サーバに搭載されている必要はなく、分散していてもよい。 Note that the filtering system 1 of the present embodiment may be mounted on the same server as the Base MTA. Moreover, each function of the filtering system does not need to be mounted on the same server, and may be distributed.

［２．メールの送信元ＩＰアドレスについて］
各ＭＴＡは、メールを受信して転送する度に、メールヘッダに、図１に示すようなＲｅｃｅｉｖｅｄ行を付加する。Ｒｅｃｅｉｖｅｄ行には、「Ｆｒｏｍ」の後に、当該ＭＴＡへメールを送信したサーバのＩＰアドレスが記載され、「ｂｙ」の後に、当該ＭＴＡのＩＰアドレスが記載される。 [2. About the mail sender IP address]
Each MTA adds a Received line as shown in FIG. 1 to the mail header each time it receives and forwards the mail. In the Received line, after “From”, the IP address of the server that sent the mail to the MTA is described, and after “by”, the IP address of the MTA is described.

メールは、複数のＭＴＡを転送されてくるため、一つのメールのメールヘッダには、複数のＲｅｃｅｉｖｅｄ行が付加されているのが一般的であるが、複数のＲｅｃｅｉｖｅｄ行のうち、組織Ｓの外部サーバからメールを受信するＭＴＡＯＢが「ｂｙ」の後に記載されているＲｅｃｅｉｖｅｄ行をみれば、そのＲｅｃｅｉｖｅｄ行の「Ｆｒｏｍ」の後のＩＰアドレスが、メール送信元のＩＰアドレスであることがわかる。ＭＴＡＯＢが「ｂｙ」の後に記載されているＲｅｃｅｉｖｅｄ行は、ユーザの属する組織ＳのＭＴＡＯＢが付加したものであるため、送信元ＩＰアドレスの偽装は困難であり、送信元の正しいＩＰアドレスが得られる。 Since a mail is transferred with a plurality of MTAs, it is common that a plurality of Received lines are added to the mail header of one mail. If the MTAOB receiving mail from the server looks at the Received line described after “by”, it can be seen that the IP address after “From” in the Received line is the IP address of the mail transmission source. Since the Received line described after MTAOB is “by” is added by MTAOB of the organization S to which the user belongs, it is difficult to impersonate the source IP address, and the correct IP address of the source can be obtained. .

本フィルタリングシステム１は、メールヘッダに含まれる送信元ＩＰアドレスを用いて、メールの分類を行う。 The filtering system 1 classifies mail using the source IP address included in the mail header.

［３．メールフィルタリングシステムの構成］
図３は、本メールフィルタリングシステム（以下、単に「システム」という）１の機能ブロックを示している。なお、本システム１は、サーバ（コンピュータ）１にメールフィルタリング用コンピュータプログラムをインストールして構成されており、以下に説明するシステム１の機能は、当該プログラムがサーバ（コンピュータ）１によって実行されることで発揮されるものである。 [3. Configuration of mail filtering system]
FIG. 3 shows functional blocks of the mail filtering system (hereinafter simply referred to as “system”) 1. The system 1 is configured by installing a mail filtering computer program on a server (computer) 1, and the function of the system 1 described below is executed by the server (computer) 1. It will be demonstrated in

本システム１は、メールユーザが各種設定などの入力を行うためのインターフェース部１１、メールを分類する分類部１２、分類のためのＩＰアドレスが登録された分類用ＩＰアドレスデータベース１３、メールユーザの情報（年齢、性別、地域、職業など）が登録されたユーザデータベース１４、指定ＩＰアドレスを記憶するための指定ＩＰアドレス記憶部１５、及び通過率データ記憶部１６、デーベース１３を検索するための検索部１７、及び指定ＩＰアドレスを送信元とするメールのフィルタ通過率算出部１８を備えている。 The system 1 includes an interface unit 11 for a mail user to input various settings, a classification unit 12 for classifying mail, a classification IP address database 13 in which IP addresses for classification are registered, and information on the mail user Search for searching the user database 14 registered (age, gender, region, occupation, etc.), the specified IP address storage unit 15 for storing the specified IP address, the passage rate data storage unit 16, and the database 13. And a mail filter pass rate calculation unit 18 having a designated IP address as a transmission source.

前記インターフェース部１１は、Ｗｅｂサーバとしての機能を有しており、ユーザは、（ＵＭＡが搭載された）ユーザコンピュータから、Ｗｅｂブラウザ機能によって、必要な情報の閲覧や入力を行うことができる。インターフェース部１１では、分類部１２に転送された各メールの分類状況をユーザに対して表示させたり、分類未確定のメールについて正常メールであるか非正常メール（受信したくない広告型メール又は迷惑メール）であるかの区別の入力をユーザから受け付けたり、システム１が誤って分類したメールやシステムが分類不能と判断したメールについて、ユーザが手動入力で適切に分類（広告型メール、迷惑メール、正常なメールのいずれか）するためのユーザからの入力を受け付けたりすることができる。 The interface unit 11 has a function as a Web server, and a user can browse and input necessary information from a user computer (equipped with UMA) by a Web browser function. The interface unit 11 displays the classification status of each mail transferred to the classification unit 12 to the user, or is a normal mail or an abnormal mail (an advertisement-type mail or annoyance that you do not want to receive) for mail that has not yet been classified. Emails that are classified by the system, or those that are classified incorrectly by the system 1 or that are determined to be unclassifiable by the system, are manually classified by the user (advertising-type emails, spam emails, It is possible to accept input from the user for any of normal mail).

前記分類部１２では、当該分類部１２に転送されてきたメールを、自動的に、広告型メール、迷惑メール、正常なメールのいずれかに分類し、正常なメールを、メールユーザ毎の分類済メールボックスに格納する。このため、メールユーザは、不要な広告型メールや迷惑メールをUMAで受信することを回避できる。なお、正常なメールには、広告型メールが含まれることがある。
この分類部１３の詳細については後述する。 The classification unit 12 automatically categorizes the mail transferred to the classification unit 12 into one of advertisement type mail, spam mail, and normal mail, and classifies the normal mail for each mail user. Store in mailbox. For this reason, the mail user can avoid receiving unnecessary advertisement-type mail and junk mail by UMA. Normal mail may include advertisement-type mail.
Details of the classification unit 13 will be described later.

図４に示すように、前記分類用ＩＰアドレスデータベース１３には、５種類のデータベースが含まれる。すなわち、正常メール送信者ＩＰアドレス共通データベース１３ａ、迷惑メール送信者ＩＰアドレス共通データベース１３ｂ、正常メール送信者ＩＰアドレス個人用データベース１３ｃ、迷惑メール送信者ＩＰアドレス個人用データベース１３ｄ、広告型メール送信者ＩＰアドレス個人用データベース１３ｅの５種類である。 As shown in FIG. 4, the classification IP address database 13 includes five types of databases. That is, the normal mail sender IP address common database 13a, the spam mail sender IP address common database 13b, the normal mail sender IP address personal database 13c, the spam mail sender IP address personal database 13d, and the advertisement mail sender IP. There are five types of address personal database 13e.

正常メール送信者ＩＰアドレス共通データベース１３ａ及び正常メール送信者ＩＰアドレス個人用データベース１３ｃは、正常メールの送信元ＩＰアドレスを登録するためのものである。 The normal mail sender IP address common database 13a and the normal mail sender IP address personal database 13c are used for registering the normal mail sender IP address.

これらの正常メールに関するデータベース１３ａ，１３ｃうち、共通データベース１３ａは、各メールユーザについて共通して利用されるデータベースである。例えば、優良な企業や公的団体からのメールは、多くのユーザにとって正常メールとして扱われるべきものであるため、そのような企業や公的団体の送信元サーバのＩＰアドレスが、予め共通データベース１３ａに登録されている。 Of these databases 13a and 13c regarding normal mail, the common database 13a is a database that is commonly used for each mail user. For example, since mails from excellent companies and public organizations should be treated as normal emails for many users, the IP address of the transmission source server of such a company or public organization is previously stored in the common database 13a. It is registered in.

一方、個人用データベース１３ｃは、メールユーザごとに設けられており、各ユーザ宛のメールのうち、正常メールと分類されたものであって、その送信元ＩＰアドレスが共通データベース１３ａに登録されていないメールの送信元ＩＰアドレスが登録される。
この個人用データベース１３ｃには、例えば、各ユーザが日常的にメールをやり取りする相手の送信元サーバのＩＰアドレスが登録されることになる。 On the other hand, the personal database 13c is provided for each mail user, and is classified as a normal mail among mails addressed to each user, and the transmission source IP address is not registered in the common database 13a. The sender IP address of the mail is registered.
In this personal database 13c, for example, the IP address of the transmission source server with which each user exchanges mail on a daily basis is registered.

迷惑メール送信者ＩＰアドレス共通データベース１３ｂ及び迷惑メール送信者ＩＰアドレス個人用データベース１３ｄは、迷惑メールの送信元ＩＰアドレスを登録するためのものである。 The spam mail sender IP address common database 13b and the spam mail sender IP address personal database 13d are used to register the sender IP address of the spam mail.

これらの迷惑メールに関するデータベース１３ｂ，１３ｄうち、共通データベース１３ｂは、各メールユーザについて共通して利用されるデータベースである。例えば、有害情報を含むメールやウィルスメールは、多くのユーザにとって迷惑メールとして扱われるべきものであるため、そのようなメールの送信元サーバのＩＰアドレスは、この共通データベース１３ｂに予め登録されている。 Of the databases 13b and 13d related to these spam mails, the common database 13b is a database that is commonly used for each mail user. For example, since mail including harmful information and virus mail should be treated as spam mail for many users, the IP address of the mail transmission source server is registered in advance in the common database 13b. .

一方、個人用データベース１３ｄは、メールユーザごとに設けられており、各ユーザ宛のメールのうち、迷惑メールと分類されたものであって、その送信元ＩＰアドレスが共通データベース１３ｂに登録されていないメールの送信元ＩＰアドレスが登録される。 On the other hand, the personal database 13d is provided for each mail user, and the mail addressed to each user is classified as spam mail, and the transmission source IP address is not registered in the common database 13b. The sender IP address of the mail is registered.

なお、個人用データベース１３ｃ，１３ｄに登録されているＩＰアドレスのうち、多くのユーザの個人用データベース１３ｃ，１３ｄに登録された結果、共通性が高まったと判断したＩＰアドレスについては、システム１が対応する共通データベース１３ａ，１３ｂに格納する。 Of the IP addresses registered in the personal databases 13c and 13d, the system 1 handles IP addresses that have been determined to have increased commonality as a result of being registered in the personal databases 13c and 13d of many users. Stored in the common databases 13a and 13b.

上記の４つのデータベース１３ａ，１３ｂ，１３ｃ，１３ｄは、広告型メール以外の非広告型メールである正常メール及び／又は迷惑メールを送信した送信元サーバのＩＰアドレスを登録する「非広告型メール送信元ＩＰアドレスデータベース」である。
これに対し、広告型メール送信元個人用データベース１３ｅは、広告型メールを送信した送信元サーバのＩＰアドレスを登録する「広告型メール送信元ＩＰアドレスデータベース」である。 The above four databases 13a, 13b, 13c, and 13d register the IP address of the transmission source server that has transmitted normal mail and / or spam mail that is non-advertisement mail other than advertisement mail. Original IP address database ".
On the other hand, the advertisement-type mail sender personal database 13e is an “advertisement-type mail sender IP address database” that registers the IP address of the sender server that sent the advertisement mail.

この広告型メール送信元個人用データベース１３ｅは、メールユーザごとに設けられており、メールユーザごとの広告型メール送信元ＩＰアドレスが登録される。
「広告型メール送信元ＩＰアドレスデータベース」としては、この広告型メール送信元個人用データベース１３ｅだけが設けられており、正常メールや迷惑メールのように、共通データベースは設けられていない。
したがって、広告型メールに分類されたメールの送信元ＩＰアドレスは、すべて、各ユーザの広告型メール送信元個人用データベース１３ｅに登録される。共通データベースを設けないことで、ユーザごとの広告型メールの受信状況が把握し易くなる。 This advertisement-type mail sender personal database 13e is provided for each mail user, and the advertisement-type mail sender IP address for each mail user is registered.
As the “advertisement type mail sender IP address database”, only this advertisement type mail sender personal database 13e is provided, and no common database is provided like normal mail and spam mail.
Therefore, all the sender IP addresses of mail classified as advertisement mail are registered in the advertisement mail sender personal database 13e of each user. By not providing a common database, it is easy to grasp the reception status of advertisement-type mail for each user.

広告型メール送信元個人用データベース１３ｅには、興味を失った商品等の広告メールや読まなくなったメールマガジンなど、過去においては正常メールとして分類されていたが、現在は受信の必要がなくなったメールや、無差別に送信されてきて受信したくない広型メールなどの送信元サーバのＩＰアドレスが登録される。なお、ユーザがメールを手入力で分類してデータベース１３に登録する場合、迷惑メールか広告型メールかはユーザの主観で行えば足りる。 In the advertisement-type email sender's personal database 13e, emails that have been classified as normal emails in the past, such as advertising emails about products that have lost interest, and email magazines that can no longer be read, are now no longer needed. In addition, the IP address of a transmission source server such as a wide mail that is transmitted indiscriminately and is not desired to be received is registered. When the user classifies the mail by manual input and registers it in the database 13, it is only necessary to determine whether it is spam mail or advertisement mail based on the user's subjectivity.

［４．分類部の詳細］
図５は、分類部１２の機能ブロックを示している。この分類部１２は、転送されてきたメールを受信する受信部１２１、受信部１２１で受信したメールの送信元ＩＰアドレスを抽出するＩＰアドレス抽出部１２２、抽出したＩＰアドレスを元に分類用ＩＰアドレスデータベース１３での最近傍探索を行う最近傍探索部１２３、最近傍探索の結果から弁別度を算出する弁別度算出部１２４、弁別度に基づいて受信したメールを分類するメール分類部１２５、正常メールであると分類されたメールを格納する分類済メールボックス１２６、受信部１２１で受信したメールのうち指定ＩＰアドレスが送信元であるメールを検出する指定ＩＰアドレス検出部１２７を備えている。 [4. Details of classification section]
FIG. 5 shows functional blocks of the classification unit 12. The classification unit 12 includes a reception unit 121 that receives the forwarded mail, an IP address extraction unit 122 that extracts a transmission source IP address of the mail received by the reception unit 121, and a classification IP address based on the extracted IP address. Nearest neighbor search unit 123 that performs the nearest neighbor search in the database 13, a discrimination degree calculation unit 124 that calculates a discrimination degree from the result of the nearest neighbor search, a mail classification unit 125 that classifies received mail based on the discrimination degree, and normal mail A classified mail box 126 for storing the mail classified as, and a designated IP address detecting section 127 for detecting a mail whose designated IP address is the transmission source among the mails received by the receiving section 121.

［４．１ＩＰアドレス抽出部］
ＩＰアドレス抽出部１２２は、受信部１２１で受信したメールのメールヘッダに付加されているＲｅｃｅｉｖｅｄ行のうち、そのメールの受信者であるユーザのＭＴＡＯＢであるとしてシステム１に登録されているＭＴＡが、Ｒｅｃｅｉｖｅｄ行の「ｂｙ」の後に記載されているＲｅｃｅｉｖｅｄ行を抽出する。さらに、ＩＰアドレス抽出部１２２は、その抽出したＲｅｃｅｉｖｅｄ行における「ｆｒｏｍ」の後のＩＰアドレスを、送信元ＩＰアドレスとして抽出する。
なお、各ユーザのＭＴＡＯＢは、システム１に予め登録されている。 [4.1 IP address extraction unit]
The IP address extraction unit 122 includes an MTA registered in the system 1 as the MTAOB of the user who is the recipient of the mail among the Received lines added to the mail header of the mail received by the reception unit 121. The Received line described after “by” in the Received line is extracted. Furthermore, the IP address extracting unit 122 extracts the IP address after “from” in the extracted Received row as the source IP address.
Note that the MTAOB of each user is registered in the system 1 in advance.

［４．２最近傍探索部］
最近傍探索部１２３は、ＮＮＩＰＦと同様の最近傍探索理論に基づいて、抽出アドレスの最近傍アドレスを探索する。
すなわち、本実施形態の最近傍探索部１２３は、ＩＰアドレス抽出部１２２で抽出されたＩＰアドレス（抽出ＩＰアドレス；未知のＩＰアドレス）の最近傍アドレスを、分類用ＩＰアドレスデータベースにおける各データベース１３ａ，１３ｂ，１３ｃ，１３ｄ，１３ｅそれぞれで探索する。 [4.2 Nearest neighbor search unit]
The nearest neighbor search unit 123 searches for the nearest neighbor address of the extracted address based on the nearest neighbor search theory similar to NNIPF.
That is, the nearest neighbor searching unit 123 of this embodiment uses the nearest neighbor address of the IP address (extracted IP address; unknown IP address) extracted by the IP address extracting unit 122 as the database 13a in the classification IP address database. Search is performed for each of 13b, 13c, 13d, and 13e.

具体的には、最近傍探索部１２３は、正常メール送信者ＩＰアドレス共通データベース１３ａにおける最近傍アドレス（第１最近傍アドレス）、迷惑メール送信者ＩＰアドレス共通データベース１３ｂにおける最近傍アドレス（第２最近傍アドレス）、メールの受信者であるユーザについての正常メール送信者ＩＰアドレス個人用データベース１３ｃにおける最近傍アドレス（第３最近傍アドレス）、メールの受信者であるユーザについての迷惑メール送信者ＩＰアドレス個人用データベース１３ｄにおける最近傍アドレス（第４最近傍アドレス）、及びメールの受信者であるユーザについての広告型メール送信者ＩＰアドレス個人用データベース１３ｅにおける最近傍アドレス（第５最近傍アドレス）を探索し、５つの最近傍アドレスを取得する。
この探索方法の詳細については、後述する。 Specifically, the nearest neighbor searching unit 123 performs the nearest neighbor address (first nearest neighbor address) in the normal mail sender IP address common database 13a, and the nearest neighbor address (second nearest neighbor address in the spam mail sender IP address common database 13b). Side address), normal mail sender IP address for the user who is the mail recipient, the nearest neighbor address (third nearest neighbor address) in the personal database 13c, and the spam mail sender IP address for the user who is the mail recipient Search for the nearest neighbor address (fourth nearest neighbor address) in the personal database 13d and the nearest neighbor address (fifth nearest neighbor address) in the personal database 13e of the advertisement-type mail sender IP address for the user who is the mail recipient 5 nearest neighbor addresses To.
Details of this search method will be described later.

また、最近傍探索部１２３は、正常メールに関する第１最近傍アドレス及び第３最近傍アドレスのうち、抽出ＩＰアドレスにより近いアドレスを、「正常メールとしての最近傍アドレス」とするとともに、迷惑メールに関する第２最近傍アドレス及び第３最近傍アドレスのうち、抽出ＩＰアドレスにより近いアドレスを、「迷惑メールとしての最近傍アドレス」とする。 In addition, the nearest neighbor searching unit 123 sets an address closer to the extracted IP address among the first nearest neighbor address and the third nearest neighbor address related to normal mail as “the nearest neighbor address as normal mail” and also relates to spam mail. Of the second nearest neighbor address and the third nearest neighbor address, an address closer to the extracted IP address is set as “nearest neighbor address as spam mail”.

ここで、正常メールとしての最近傍アドレスを「ＮＮ（ＩＰ；ＧＩＰ）」と表し、
迷惑メールについての最近傍アドレスを「ＮＮ（ＩＰ；ＢＩＰ）」と表し、
広告メールとしての最近傍アドレス（第４最近傍アドレス）を「ＮＮ（ＩＰ；ＡＩＰ）」と表し、
抽出ＩＰアドレス（未知ＩＰアドレス；ＩＰ）と、正常メールとしての最近傍アドレスＮＮ（ＩＰ；ＧＩＰ）との距離を、Ｄ_GIP（ＩＰ）＝ｄｉｓｔ（ＩＰ，ＮＮ（ＩＰ；ＧＩＰ））と表し、
抽出ＩＰアドレス（未知ＩＰアドレス；ＩＰ）と、迷惑メールとしての最近傍アドレスＮＮ（ＩＰ；ＢＩＰ）との距離を、Ｄ_BIP（ＩＰ）＝ｄｉｓｔ（ＩＰ，ＮＮ（ＩＰ；ＢＩＰ））と表し、
抽出ＩＰアドレス（未知ＩＰアドレス；ＩＰ）と、広告型メールとしての最近傍アドレスＮＮ（ＩＰ；ＡＩＰ）との距離を、Ｄ_AIP（ＩＰ）＝ｄｉｓｔ（ＩＰ，ＮＮ（ＩＰ；ＡＩＰ））と表す。
なお、距離の求め方については後述するが、最近傍探索部１２３は、抽出ＩＰアドレスと各最近傍アドレスとから、これらの距離Ｄ_GIP，Ｄ_BIP，Ｄ_AIPそれぞれ求めて出力する。 Here, the nearest address as a normal mail is expressed as “NN (IP; GIP)”,
The nearest address for spam is represented as “NN (IP; BIP)”
The nearest neighbor address (4th nearest neighbor address) as an advertisement mail is expressed as “NN (IP; AIP)”,
The distance between the extracted IP address (unknown IP address; IP) and the nearest address NN (IP; GIP) as normal mail is represented as D _GIP (IP) = dist (IP, NN (IP; GIP))
The distance between the extracted IP address (unknown IP address; IP) and the nearest neighbor address NN (IP; BIP) as spam mail is expressed as D _BIP (IP) = dist (IP, NN (IP; BIP)),
The distance between the extracted IP address (unknown IP address; IP) and the nearest address NN (IP; AIP) as the advertisement type mail is expressed as D _AIP (IP) = dist (IP, NN (IP; AIP)). .
Although the method of obtaining the distance will be described later, the nearest neighbor searching unit 123 _obtains and outputs each of these distances D _GIP , D _BIP , and D _AIP from the extracted IP address and each nearest neighbor address.

［４．３弁別度算出部］
前記弁別度算出部１２４は、前記距離Ｄ_GIP，Ｄ_BIP，Ｄ_AIPに基づいて、抽出ＩＰアドレスの迷惑メールの送信元らしさを示す弁別度、及び広告型メールの送信元らしさ示す弁別度を算出する。 [4.3 Discrimination degree calculation unit]
Based on the distances D _GIP , D _BIP , and D _AIP , the discrimination degree calculation unit 124 calculates a discrimination degree that indicates the spam source of the extracted IP address and a discrimination level that indicates the likelihood of the sender of the advertisement-type mail. To do.

ここで、抽出ＩＰアドレス（ＩＰ）が、迷惑メールの送信元らしいかを示す弁別度は、次式で表される。

Here, the degree of discrimination indicating whether the extracted IP address (IP) is likely to be a spam mail sender is expressed by the following equation.

また、抽出ＩＰアドレス（ＩＰ）が、広告型メールの送信元らしいかを示す弁別度は、次式で表される。

Further, the degree of discrimination indicating whether the extracted IP address (IP) is likely to be the sender of the advertisement mail is expressed by the following equation.

これは、正常メールの事前確率Ｐ（Ｇｏｏｄ）、迷惑メールの事前確率Ｐ（Ｂａｄ）、広告型メールの事前確率Ｐ（Ａｄｖ）がともに１／３で、正規確率分布をそれぞれ、Ｐ（ＩＰ｜Ｇｏｏｄ）＝１／Ｄ_GIP，Ｐ（ＩＰ｜Ｂａｄ）＝１／Ｄ_BIP，Ｐ（ＩＰ｜Ａｄｖ）＝１／Ｄ_AIPとした場合に、下記のように迷惑メール及び広告型メールの事後確率をＢａｙｅｓ則で計算した結果と一致する。

This is because the prior probability P (Good) of normal mail, the prior probability P (Bad) of junk mail, and the prior probability P (Adv) of advertisement-type mail are both 1/3, and the normal probability distribution is P (IP | Good) = 1 / D _GIP , P (IP | Bad) = 1 / D _BIP , P (IP | Adv) = 1 / D _AIP , the posterior probabilities of spam mail and advertisement mail are as follows: It agrees with the result calculated by Bayes rule.

上記のＰ（Ｓｐａｍ｜ＩＰ）は、抽出ＩＰアドレスが迷惑メール送信元ＩＰアドレスである確率（迷惑メール弁別度）を示しており、０〜１の値をとる。弁別度算出部１２４は、上記のＰ（Ｓｐａｍ｜ＩＰ）の式に基づいて弁別度を算出する。この値が、１に近いほど、迷惑メール送信元ＩＰアドレスである確率が高いことを示している。
また、上記のＰ（Ａｄｖ｜ＩＰ）は、抽出ＩＰアドレスが広告型メール送信元ＩＰである確率（広告型メール弁別度）を示しており、０〜１の値をとる。弁別度算出部１２４は、上記のＰ（Ａｄｖ｜ＩＰ）の式に基づいて弁別度を算出する。この値が、１に近いほど、広告メール送信元ＩＰアドレスである確率が高いことを示している。
また、抽出ＩＰアドレスが正常メール送信元ＩＰアドレスである確率（正常メール弁別度）は、Ｐ（Ｇｏｏｄ｜ＩＰ）＝１−（Ｐ（Ｓｐａｍ｜ＩＰ）＋Ｐ（Ａｄｖ｜ＩＰ））で求められる。 The above P (Spam | IP) indicates the probability that the extracted IP address is a spam mail source IP address (spam mail discrimination degree), and takes a value of 0 to 1. The discrimination degree calculation unit 124 calculates the discrimination degree based on the above-described expression P (Spam | IP). The closer this value is to 1, the higher the probability that it is a spam mail source IP address.
Also, the above P (Adv | IP) indicates the probability that the extracted IP address is the advertisement type mail transmission source IP (advertising type mail discrimination degree), and takes a value of 0 to 1. The discrimination degree calculation unit 124 calculates the discrimination degree based on the above expression of P (Adv | IP). The closer this value is to 1, the higher the probability that it is an advertisement mail sender IP address.
The probability that the extracted IP address is a normal mail transmission source IP address (normal mail discrimination degree) is obtained by P (Good | IP) = 1− (P (Spam | IP) + P (Adv | IP)).

［４．４メール分類部］
前記メール分類部１２５は、上記の弁別度Ｐ（Ｓｐａｍ｜ＩＰ），Ｐ（Ａｄｖ｜ＩＰ），Ｐ（Ｇｏｏｄ｜ＩＰ）に基づいて、分類対象の受信メールを、迷惑メール、広告型メール、正常メールのいずれかに分類する。 [4.4 Mail classification section]
The mail classifying unit 125 converts the received mail to be classified into spam mail, advertisement mail, normal, based on the above-described discrimination levels P (Spam | IP), P (Adv | IP), and P (Good | IP). Categorize as either email.

メール分類部１２５は、３つの弁別度Ｐ（Ｓｐａｍ｜ＩＰ），Ｐ（Ａｄｖ｜ＩＰ），Ｐ（Ｇｏｏｄ｜ＩＰ）のうち、最も大きな値を持つ分類（迷惑メール、広告型メール、正常メール）を決定し、受信メールはその分類に振り分けられ、当該受信メールの送信元ＩＰアドレス（抽出ＩＰアドレス）は、対応するデータベース１３ａ〜１３ｅに登録される。また、正常メールについては、分類済メールボックス１２６に格納され、ＵＭＡがメールを取得可能となる。
なお、決定した分類が迷惑メール又は正常メールである場合、その抽出ＩＰアドレスが、共通データベース１３ａ，１３ｂに登録されていない場合に、個人用データベース１３ｃ，１３ｄに登録する。 The mail classification unit 125 has a classification with the largest value among the three discrimination levels P (Spam | IP), P (Adv | IP), and P (Good | IP) (spam mail, advertisement mail, normal mail). The received mail is sorted into the classification, and the transmission source IP address (extracted IP address) of the received mail is registered in the corresponding databases 13a to 13e. Moreover, normal mail is stored in the classified mail box 126, and UMA can acquire mail.
When the determined classification is junk mail or normal mail, the extracted IP address is registered in the personal databases 13c and 13d when the extracted IP address is not registered in the common databases 13a and 13b.

また、決定した分類が広告型メールである場合、メール分類部１２５は、メールの受信者であるユーザのユーザデータをユーザデータベース１４から取得し、抽出ＩＰアドレスに前記ユーザデータを関連付けた上で、当該抽出ＩＰアドレスを、広告型メール送信者ＩＰアドレス個人用データベース１３ｅに登録する。なお、個人用データベース１３ｃ，１３ｄについても、抽出ＩＰアドレスとユーザデータとを関連づけて登録してもよい。 Further, when the determined classification is an advertisement type mail, the mail classification unit 125 acquires user data of a user who is a mail recipient from the user database 14, and associates the user data with an extracted IP address. The extracted IP address is registered in the advertisement-type mail sender IP address personal database 13e. The personal databases 13c and 13d may be registered in association with the extracted IP address and user data.

このように、本システム１では、広告型メールのフィルタリングが行えるとともに、ブロックされた広告型メールの送信元ＩＰアドレスとブロックしたユーザ（メールの宛先ユーザ）のユーザデータとを関連づけたデータ（データベース１３ｅ）が得られる。このデータは、広告型メールのメール配信方法の適切さの判断材料となる情報として利用される。 As described above, in the present system 1, advertisement mail can be filtered, and data (database 13e) that associates the transmission source IP address of the blocked advertisement mail with the user data of the blocked user (mail destination user). ) Is obtained. This data is used as information that is used as a material for determining the appropriateness of the mail delivery method of the advertisement-type mail.

また、弁別度を用いたメールの分類の際には、閾値を用いても良い。例えば、迷惑メール弁別度又は広告型メール弁別度は、所定の閾値Ｔ１以上でなければ、迷惑メール又は広告型メールとみなさないものとして扱うことができる。また、閾値は、２個又は３個以上設けてもよく、閾値に応じて迷惑メールらしさや広告型メールらしさを判定してもよい。また、その判定結果は、ユーザに対して、分類された各メールの色分け表示などに用いることができる。 In addition, a threshold may be used when classifying mail using the degree of discrimination. For example, if the spam discrimination level or the advertisement-type email discrimination level is not equal to or higher than a predetermined threshold T1, it can be handled as not regarded as a spam email or an advertisement-type email. Also, two or three or more threshold values may be provided, and the likelihood of junk mail or advertisement-type mail may be determined according to the threshold value. Further, the determination result can be used for color-coded display of each classified mail for the user.

［４．５指定ＩＰアドレス検出部］
前記指定ＩＰアドレス検出部１２７は、指定ＩＰアドレス記憶部１５に記憶されている指定ＩＰアドレスが送信元ＩＰアドレスとなっているメールの受信状況（全受信数）を検出するとともに、指定ＩＰアドレスが送信元ＩＰアドレスとなっているメールが正常メールに分類された数及び／又は広告型メールに分類された数を検出する。また、迷惑メールに分類された数も検出してもよい。 [4.5 Designated IP address detector]
The designated IP address detecting unit 127 detects the reception status (the total number of received messages) of mail in which the designated IP address stored in the designated IP address storage unit 15 is the source IP address, and the designated IP address is It detects the number of mails that have been classified as normal mails and / or the number of mails classified as advertisement-type mails. Also, the number classified as junk mail may be detected.

指定ＩＰアドレス記憶部１５には、例えば、広告型メールの広告主から指定されたＩＰアドレス（広告型メールの送信元ＩＰアドレス）が入力されて記憶されている。指定ＩＰアドレス検出部１２７は、分類用ＩＰアドレスデータベース１３及び／又は分類部１２５での分類結果に基づいて、その指定ＩＰアドレスを持つ広告型メールが本システム１を通過した件数や、本システムでブロックされた件数を検出することができる。 In the designated IP address storage unit 15, for example, an IP address designated by the advertiser of the advertisement mail (input address of the advertisement mail) is input and stored. The designated IP address detection unit 127 determines the number of advertisement mails having the designated IP address that have passed through the system 1 based on the classification result in the classification IP address database 13 and / or the classification unit 125, The number of blocked cases can be detected.

これらの検出結果は、通過率算出部１８において、広告メールの通過率及び／又はブロック率の算出に用いられる。通過率は、［指定ＩＰアドレスを持つメールが正常メールに分類された件数／指定ＩＰアドレスを持つメールの全件数］で算出でき、ブロック率は、［指定ＩＰアドレスを持つメールが広告型メールに分類された件数／指定ＩＰアドレスを持つメールの全件数］で算出できる。
この算出結果は、システム１から出力され、広告型メールの広告主へ提供される情報となる。また、算出結果には、通過又はブロックしたユーザのユーザデータも付加してもよい。このような通過率又はブロック率が得られることで、広告主としては、依頼している広告メール配信業者が、適切な送信元ＩＰアドレスを持っているか否かを検討することが可能となる。 These detection results are used by the passage rate calculation unit 18 to calculate the passage rate and / or block rate of the advertisement mail. The passing rate can be calculated by [number of emails with specified IP address classified as normal email / total number of emails with specified IP address], and the blocking rate is [mail with specified IP address becomes advertising mail] The number of classified items / the total number of emails with the specified IP address] can be calculated.
This calculation result is output from the system 1 and becomes information provided to the advertiser of the advertising mail. In addition, user data of a user who has passed or blocked may be added to the calculation result. By obtaining such a passing rate or blocking rate, it becomes possible for the advertiser to examine whether or not the requested advertisement mail distributor has an appropriate source IP address.

［５．分類用ＩＰアドレスデータベースの検索］
前記検索部１７では、ＩＰアドレスを検索キーにして、分類用ＩＰアドレスデータベース１３を検索し、検索キーのＩＰアドレスが、各データベース１３ａ〜１３ｅにおいて登録されているか否かを検索し、出力することができる。
また、検索部１７からＩＰアドレスを検索キーとして、広告型メール送信者ＩＰアドレス個人用データベース１３ｅを検索すると、当該データベース１３ｅにおいて、そのＩＰアドレスに関連づけて登録されているユーザデータを抽出することができる。
例えば、広告型メールの広告主が、広告型メールがどの程度、本システム１においてどのようなユーザに広告型メールとしてブロックされているかを知りたい場合、当該広告型メールの送信元ＩＰアドレスを検索キーとして入力して検索し、抽出されたユーザデータを解析すればよい。 [5. Search IP address database for classification]
The search unit 17 searches the classification IP address database 13 using the IP address as a search key, and searches for and outputs whether or not the IP address of the search key is registered in each of the databases 13a to 13e. Can do.
Further, when the advertisement type mail sender IP address personal database 13e is searched from the search unit 17 using the IP address as a search key, user data registered in association with the IP address can be extracted from the database 13e. it can.
For example, when an advertiser of an advertisement type mail wants to know to what extent the advertisement type mail is blocked as an advertisement type mail in the system 1, it searches for the sender IP address of the advertisement type mail. The user data may be input and searched as a key, and the extracted user data may be analyzed.

［６．最近傍探索の詳細］
以下、前記最近傍探索部１２３による最近傍探索理論について説明する。 [6. Details of nearest neighbor search]
Hereinafter, the nearest neighbor search theory by the nearest neighbor search unit 123 will be described.

［６．１ＩＰアドレス］
本実施形態で使用するＩＰアドレスは、現在多く使用されているＩＰｖ４プロトコルに基づく。ＩＰｖ４のＩＰは、通常０〜２５５の数字４組をドットで繋いだ記法で表記される。本実施形態においては、ＩＰアドレスの距離計算を行うときは、ＮＮＩＰＦと同様に、ＩＰアドレスを２５６進数と考え１０進数表記に変換したＩＰアドレス値により行う。ＩＰアドレスからＩＰアドレス値を求めるには、図６に示すように、ＩＰアドレスの左側を上位桁として、各桁に対して、上位側から順に、２５６³、２５６²、２５６¹、２５６⁰の桁重みを乗じて、１０進数のＩＰアドレスを求める。ＩＰアドレス値による距離計算は、ＮＮＩＰＦ採用されており、上記ＩＰアドレス値による距離計算によって適切に分類が行えることが確認されている。 [6.1 IP address]
The IP address used in the present embodiment is based on the IPv4 protocol that is currently widely used. IPv4 IP is usually expressed in a notation in which four sets of numbers from 0 to 255 are connected by dots. In the present embodiment, when calculating the distance of the IP address, the IP address is converted into a decimal number notation by considering the IP address as a 256 number in the same manner as the NNIPF. In order to obtain the IP address value from the IP address, as shown in FIG. 6, the left side of the IP address is the upper digit, and 256 ³ , 256 ² , 256 ¹ , 256 ⁰ Multiply the digit weight to obtain a decimal IP address. The distance calculation based on the IP address value adopts NNIPF, and it has been confirmed that the classification can be appropriately performed by the distance calculation based on the IP address value.

データベース１３ａ〜１３ｅでは、ＩＰアドレス及びこのＩＰアドレス値が保存される。また、ＩＰアドレスは、バイナリデータで保存される。ＩＰアドレスをバイナリデータに変換すると、ＩＰアドレスの各オクテットが１バイトになるので１つのＩＰアドレスを４バイトで扱うことができる。 In the databases 13a to 13e, the IP address and the IP address value are stored. The IP address is stored as binary data. When the IP address is converted into binary data, each octet of the IP address becomes 1 byte, so that one IP address can be handled with 4 bytes.

［６．２データベース１３の構築方法］
本システム１では、メールから抽出された送信元ＩＰアドレスを最近傍探索によって分類する。このため、分類数に応じた既知ＩＰアドレスの集合が必要となる。すなわち、正常メール、迷惑メール、広告型メールの３つに分類するのであれば、ＩＰ既知の正常メール送信者のＩＰアドレスの集合（データベース１３ａ，１３ｃ）、既知の迷惑メール送信者のＩＰアドレスの集合（データベース１３ｂ，１３ｄ）、既知の広告型メール送信者のＩＰアドレスの集合（データベース１３ｅ）それぞれが必要であり、これらの集合の中から、メールから抽出された送信元ＩＰアドレスの最近傍アドレスを抽出することになる。 [6.2 Construction Method of Database 13]
In the present system 1, the source IP address extracted from the mail is classified by nearest neighbor search. For this reason, a set of known IP addresses corresponding to the number of classifications is required. That is, if classified into three types of normal mail, spam mail, and advertisement mail, a set of IP addresses of normal mail senders with known IP (databases 13a and 13c), IP addresses of known spam mail senders, A set (databases 13b and 13d) and a set of IP addresses (database 13e) of known advertisement-type mail senders are required, and the nearest neighbor address of the source IP address extracted from the mail out of these sets Will be extracted.

従来のＮＮＩＰＦでは、既知のＩＰアドレスの集合を記憶するために、数百〜数千のＩＰアドレスをディレクトリの階層構造を利用し、入力されたメールのＩＰアドレスに最も近いアドレス（最近傍アドレス）を探索する。このため、通常のファイルシステムではディスクスペースを大量に消費してしまうという問題点がある。この問題を回避するために、個々のデータベース１３ａ〜３ｅを単一ファイルに格納し、それを読み込みメモリ上で最近傍探索を行う方法も考えられる。しかし、本システム１はメールが到着するたびに起動されるため、数千ものデータをファイルから読み込んでいたのでは、効率が良くない。 In the conventional NNIPF, in order to store a set of known IP addresses, hundreds to thousands of IP addresses are used in the hierarchical structure of the directory, and the address closest to the input mail IP address (nearest address) Explore. Therefore, a normal file system has a problem that a large amount of disk space is consumed. In order to avoid this problem, a method may be considered in which the individual databases 13a to 3e are stored in a single file, and the nearest neighbor search is performed by reading the database. However, since this system 1 is activated every time a mail arrives, it is not efficient to read thousands of data from a file.

このような問題点を解決するため、本実施形態では、コンピュータ１の補助記憶装置（ハードディスク等）に格納されたデータをメモリに展開せずに最近傍探索を行う方法を採用する。これはソートアルゴリズムが内部ソートと外部ソートに分類されることになぞらえると、外部最近傍探索とも呼べる内容である。この外部最近傍探索のために、本実施形態では、Ｂ木又はＢ＋木を用いる。これらはもともと、一致型（Exact Match）の探索を前提としたデータ構造であるが、本実施形態ではこれらを最近傍探索に用いる。 In order to solve such a problem, the present embodiment employs a method of performing a nearest neighbor search without expanding data stored in an auxiliary storage device (hard disk or the like) of the computer 1 into a memory. This sort of content can be called an external nearest neighbor search when the sort algorithm is classified into an internal sort and an external sort. In this embodiment, a B-tree or a B + tree is used for this outer nearest neighbor search. These data structures are originally based on a match type search, but in the present embodiment, these are used for nearest neighbor search.

本実施形態では、５つのデータベース１３ａ〜１３ｅそれぞれに対応するファイルをコンピュータ１の補助記憶装置（ハードディスク等）に設ける。ファイル内には、Ｂ木又はＢ＋木構造で、ＩＰアドレス値（及びＩＰアドレス）が保存される。
ファイル内にＢ木又はＢ＋木を構築することにより、探索はファイル内の移動のみで行うことができ、すべてのデータをメモリにロードする必要がなく効率が良い。
以下、Ｂ木及びＢ＋木を用いてＩＰアドレス値（ＩＰアドレス）をファイルに格納することにより、ディスクスペースの消費を抑え、メモリにロードしないで最近傍探索する方法について述べる。 In the present embodiment, files corresponding to each of the five databases 13a to 13e are provided in the auxiliary storage device (hard disk or the like) of the computer 1. In the file, IP address values (and IP addresses) are stored in a B-tree or B + tree structure.
By constructing a B-tree or B + -tree in the file, the search can be performed only by movement within the file, and it is not necessary to load all the data into the memory, which is efficient.
Hereinafter, a method of searching for the nearest neighbor without saving disk space by loading an IP address value (IP address) in a file using a B tree and a B + tree and loading it into a memory will be described.

［６．２．１Ｂ木を用いたＩＰアドレスの格納］
ここでは、Ｂ木の定義や追加、削除などの操作を紹介し、Ｂ木を用いてＩＰアドレスをファイルに格納する方法とＢ木の最近傍探索について述べる。 [6.2.1 IP Address Storage Using B-Tree]
Here, operations such as definition, addition, and deletion of the B-tree are introduced, and a method of storing an IP address in a file using the B-tree and a nearest neighbor search of the B-tree are described.

［６．２．１．１Ｂ木］
Ｂ木とは、システムにおけるインデックスやファイルシステムに用いられる木構造で、多分木の平衡木（バランス木）の一種である。Ｂ木は、ノード内のキーの個数を決定する次数ｋの値を持つ。ｋ次のＢ木の場合、次のように定義される。 [6.2.1.1 B-tree]
The B-tree is a tree structure used for an index or file system in the system, and is a kind of balanced tree. The B-tree has an order k value that determines the number of keys in the node. For a k-th order B-tree, it is defined as follows.

・根以外の節には、ｋ個以上２ｋ個以下のキーが格納される。それぞれのキー（ｍ個とする）をａ［１］，・・・・・・，ａ［ｍ］とする。
・根には１個以上2ｋ個以下のキーが格納される。
・葉以外の任意の節には、部分木へのポインタがｍ＋１個ある・ただし、ｍはその節に格納されているキーの個数である。これらのポインタをｐ［１］，・・・・・・・，ｐ［ｍ］とする。
・すべての葉は同一レベルにある。
・任意の節において、キー列ａ［１］，・・・・・・，ａ［ｍ］は整列している。また、キーａ［ｉ］(１≦ ｉ ≦ ｍ)はｐ［ｉ−１］の指す部分木内のどのキーよりも大きく、逆にｐ［ｉ］の指す部分木内のどのキーよりも小さい。
以上が、ｋ次のＢ木の定義である。
本実施形態では、キーとしてＩＰアドレス値が格納される。 In the clauses other than the root, k or more and 2k or less keys are stored. Each key (m) is a [1], ..., a [m].
・ 1 to 2k keys are stored in the root.
Any section other than leaves has m + 1 pointers to subtrees, where m is the number of keys stored in that section. Let these pointers be p [1],..., P [m].
・ All leaves are at the same level.
In any clause, the key sequences a [1],..., A [m] are aligned. The key a [i] (1 ≦ i ≦ m) is larger than any key in the subtree pointed to by p [i−1], and conversely smaller than any key in the subtree pointed to by p [i].
The above is the definition of the k-th order B-tree.
In this embodiment, an IP address value is stored as a key.

また、キーの探索は、各節のキー列は整列しているので二分探索によって行う。Ｂ木の大きな特徴は、動的ファイル操作に対する緩衝効果である。キーの分割によってノードが分割されると、分割直後のノードは空きスペースが十分あるので,しばらくは分割のような大がかりな構造の手直しは起こらない。削除についても同様である。特に、追加と削除が混雑しているような状況では緩衝効果によりノードの分割も連結の動作もほとんど必要とならない。 In addition, the key search is performed by binary search because the key strings in each section are aligned. A major feature of the B-tree is a buffering effect on dynamic file operations. When the node is divided by the key division, the node immediately after the division has enough free space, so that a major restructuring like the division does not occur for a while. The same applies to deletion. In particular, in a situation where addition and deletion are congested, almost no node division or connection operation is required due to the buffer effect.

［６．２．１．２Ｂ木のキー（ＩＰアドレス値）の追加］
Ｂ木のキー（ＩＰアドレス値）の追加の操作を、図７（ａ）の２次のＢ木を例に示す。 [6.2.1.2 Addition of B tree key (IP address value)]
An operation for adding a key (IP address value) of a B tree is shown by taking the secondary B tree of FIG. 7A as an example.

（１）キー（ＩＰアドレス値）１０の追加
まず、図７（ａ）に示すＢ木において、キー１０を探索すると格納すべきノードが見つかる。図７（ｂ）に示すように、このノードにはキーが２個含まれており、一個追加しても条件を満たすので単純に追加すればよい。 (1) Addition of Key (IP Address Value) 10 First, in the B-tree shown in FIG. 7A, when the key 10 is searched, a node to be stored is found. As shown in FIG. 7B, this node includes two keys, and even if one key is added, the condition is satisfied, so it may be simply added.

（２）キー（ＩＰアドレス値）３０の追加
図７（ａ）に示すＢ木において、キー３０を探索すると、格納すべきノードが見つかる。図７（ｃ）に示すように、この節には上限の４個のキーが格納されている。この場合は、キー３０を仮に挿入した後、二つに分割し中央のキー（ここでは「３０」）を親節に上げる。 (2) Addition of Key (IP Address Value) 30 When searching for the key 30 in the B-tree shown in FIG. 7A, a node to be stored is found. As shown in FIG. 7C, the upper limit of four keys is stored in this section. In this case, after the key 30 is temporarily inserted, the key 30 is divided into two and the center key (here, “30”) is raised to the parent clause.

上記（１），（２）で示した通り、Ｂ木におけるキーの追加は単純追加または分割によって行われる。分割によって中央のキーが親節に送られるが、もし親節にも最大限数のキーが既に入っていた場合、親ノードが再び分割されて、さらにその親に中央キーが送られる、このような操作が繰り返し起こり、時には根自体が分割され木の高さが一つ増えることも起こる、B木の高さが大きくなるのはこのように分割操作が根にまで及んだときに起こる。 As shown in the above (1) and (2), the key is added to the B-tree by simple addition or division. The split will send the center key to the parent clause, but if the parent clause already contains the maximum number of keys, the parent node will be split again and the center key will be sent to its parent. This happens repeatedly, and sometimes the root itself is split and the height of the tree increases by one. The height of the B tree increases when the split operation reaches the root in this way.

［６．２．１．３Ｂ木のキー（ＩＰアドレス値）の削除］
Ｂ木からのキーの削除の操作を、図８（ａ）の２次のＢ木を例に示す。 [6.2.1.3 Deletion of B-tree key (IP address value)]
An operation for deleting a key from the B-tree will be described by taking the secondary B-tree in FIG. 8A as an example.

（１）キー６２の削除
キー６２を探索すると、葉ノードが見つかる。この場合，キー６２を削除してもまだキーは２個あるので条件は維持される。したがって図８（ｂ）のように単純に削除される。 (1) Deletion of key 62 When the key 62 is searched, a leaf node is found. In this case, even if the key 62 is deleted, the condition is maintained because there are still two keys. Therefore, it is simply deleted as shown in FIG.

（２）キー８５の削除
キー８５の入ったノードを探索すると、葉ノードが見つかる。
キー８５を削除すると、残りはキー８０一つとなり、条件を満たさなくなる。図８（ｃ）に示すように、隣の兄弟ノードを調べると、すぐ左の兄弟ノードに三つキーが入っているので、親ノードのキー７７も参加させて不足ノードへ移動させる。このように隣の兄弟ノードからキーを譲ってもらうことをアンダーフローという。 (2) Deletion of key 85 When a node containing key 85 is searched, a leaf node is found.
If the key 85 is deleted, the rest becomes one key 80 and the condition is not satisfied. As shown in FIG. 8C, when the next sibling node is examined, since the three keys are in the sibling node immediately to the left, the parent node key 77 is also joined and moved to the missing node. In this way, a key is handed over from an adjacent sibling node.

（３）キー９０の削除
キー９０を探索すると葉ノードが見つかる。
この場合もノード内のキーが最低個数しか入っていない。さらに隣の兄弟ノードを見ても譲る余裕のあるノードがない。このような場合は、図８（ｄ）に示すように、隣のノードと合わせて一つのノードにまとめる。この操作を連結という。この例では親ノード自体も条件を満たさなくなるので親ノードの兄弟ノードとの連結が再び引き起こされる。このように連結は根へと向かって波及することがあり、その結果、根ノードが連結に参加するとＢ木の高さが一つ小さくなることが起こる。 (3) Deletion of key 90 When key 90 is searched, a leaf node is found.
In this case as well, there is only a minimum number of keys in the node. In addition, even if you look at the next sibling node, there is no node that can afford to yield. In such a case, as shown in FIG. 8D, the nodes are combined into one node together with the adjacent nodes. This operation is called connection. In this example, since the parent node itself does not satisfy the condition, the connection with the sibling node of the parent node is caused again. In this way, the connection may ripple toward the root, and as a result, when the root node participates in the connection, the height of the B-tree is reduced by one.

（４）キー８８の削除
（１），（２），（３）の例はすべて葉ノードに含まれるキーの削除であった。しかし、今回のキー８８の削除は分岐ノード内のキーである。このような場合は、まずキー８８よりも大きい直後のキーを探索する。例ではキー９０がそれである。その９０を８８の格納場所に複写した後、葉ノードにある、キー９０を削除する。葉ノードのキー９０の削除は（３）で述べた通りである。 (4) Deletion of key 88 The examples of (1), (2), and (3) are all deletions of keys included in leaf nodes. However, the current deletion of the key 88 is a key in the branch node. In such a case, a key immediately after the key 88 is searched first. In the example, it is the key 90. After the 90 is copied to the storage location 88, the key 90 in the leaf node is deleted. The deletion of the leaf node key 90 is as described in (3).

［６．２．１．４Ｂ木の最近傍探索］
Ｂ木の最近傍探索を、図９の２次のＢ木を例に示す.
最近傍探索を行うキー（ＩＰアドレス値）を、８２とする。この場合、まず、ｒｏｏｔノードに移動する。ｒｏｏｔノード内で二分探索を行い、・・・・キーｎ＜８２＜キーｎ＋１・・・・となる位置を探す。図９では、６０＜８２の位置である。 [6.2.1.4 Nearest neighbor search of B-tree]
The nearest neighbor search of B-tree is shown as an example of the secondary B-tree in FIG.
The key (IP address value) for performing the nearest neighbor search is 82. In this case, first, it moves to the root node. A binary search is performed in the root node to find a position where... key n <82 <key n + 1. In FIG. 9, the position is 60 <82.

キー１・・・・キーｎ＜８２＜キーｎ＋１・・・・の位置が見つかるとキーｎとキーｎ＋１との距離を計算する。距離は、キー同士の値の差（絶対値）として算出される。図９では、８２−６０＝２２がｒｏｏｔノード内のキーとの最短距離である。
次に、６０の右にあるポインタから子ノードへ移動する。子ノードでも、二分探索、距離計算を行う。図９で距離計算を行うと、８２−７７＝５, ８８−８２＝６より、このノード内のキーとの最短距離は５である。Ｂ木が深くなっている場合、これを葉ノードに移動するまで繰り返す。 When the position of key 1... Key n <82 <key n + 1... Is found, the distance between key n and key n + 1 is calculated. The distance is calculated as a difference (absolute value) between keys. In FIG. 9, 82-60 = 22 is the shortest distance from the key in the root node.
Next, move from the pointer on the right of 60 to the child node. Even for child nodes, binary search and distance calculation are performed. When the distance calculation is performed in FIG. 9, the shortest distance from the key in this node is 5 from 82−77 = 5 and 88−82 = 6. If the B-tree is deep, repeat this until it moves to the leaf node.

葉ノード内で二分探索、距離計算を行い、最短距離の中で最も小さい値を、抽出アドレスと最近傍アドレスとの最短距離として、返し処理を終了する。図９では、葉ノードで二分探索を行うと,８０＜８２＜８５の位置が見つかり、距離計算を行うと８２−８０＝２，８５−８２＝３となるので、葉ノード内のキーとの最短距離は２である。最短距離の中で最も小さい値は２なので、キー８２（抽出ＩＰアドレスのアドレス値）の最近傍アドレスとの距離として２を返すとともに、最近傍アドレスとしてキー８０に対応するＩＰアドレスを返して、処理を終了する。
Ｂ木のキーはソートされて格納されているため、一回の探索で最近傍探索ができるため、高速で処理が行える。 The binary search and the distance calculation are performed in the leaf node, and the return process is terminated with the smallest value among the shortest distances as the shortest distance between the extracted address and the nearest neighbor address. In FIG. 9, when a binary search is performed on a leaf node, a position of 80 <82 <85 is found, and when a distance calculation is performed, 82-80 = 2 and 85-82 = 3. The shortest distance is 2. Since the smallest value in the shortest distance is 2, return 2 as the distance to the nearest neighbor address of the key 82 (extracted IP address), and return the IP address corresponding to the key 80 as the nearest neighbor address, The process ends.
Since the keys of the B-tree are sorted and stored, the nearest neighbor search can be performed with a single search, so that processing can be performed at high speed.

［６．２．２Ｂ＋木を用いたＩＰアドレスの格納］
以下、Ｂ木を拡張したＢ＋木の特性とキーの追加や削除の操作、本実施形態のデータベース１３ａ〜１３ｅそれぞれにおいて、Ｂ＋木を採用した場合に、このＢ＋木を用いてIPアドレスをファイルに格納する方法、Ｂ＋木の最近傍探索について述べる。 [6.2.2 IP address storage using B + tree]
In the following, when the B + tree is adopted in the characteristics of the B + tree obtained by extending the B tree, the key addition / deletion operation, and the databases 13a to 13e of this embodiment, the IP address is converted into a file using the B + tree. The storage method and the nearest neighbor search of the B + tree will be described.

［６．２．２．１Ｂ＋木］
Ｂ＋木は、Ｂ木を拡張した平衡木で、インデックスページである根、節ノードと、データページである葉ノードに分かれる。
Ｂ＋木では葉にすべてのデータが入り、根から葉までの節には索引と分岐しか入らずインデックスの役割をもち、B木よりも格納効率が良い。Ｂ＋木のインデックスページはデータを挿入したり、削除したりする過程で構成される。基本的な定義は、前記Ｂ木と同じである。 [6.2.2.1 B + tree]
The B + tree is an equilibrium tree obtained by extending the B tree, and is divided into a root node and a node node as index pages and a leaf node as a data page.
In the B + tree, all data is stored in the leaf, and only the index and branch are included in the nodes from the root to the leaf. The B + tree functions as an index, and has better storage efficiency than the B tree. The index page of the B + tree is composed of a process of inserting and deleting data. The basic definition is the same as the B-tree.

［６．２．２．２Ｂ＋木のキーの追加］
Ｂ＋木の追加の操作を、図１０（ａ）の２次のＢ＋木を例に示す。 [6.2.2.2 Add B + tree key]
An operation for adding a B + tree is shown by taking the secondary B + tree of FIG. 10A as an example.

（１）キー２８の追加
キー２８で探索すると格納すべき葉ノードがみつかる。
この葉ノードにはキーが２個はいっておりキーの個数が上限に達していないので、単純に追加する。追加後を図１０（ｂ）に示す。 (1) Addition of key 28 When searching with key 28, a leaf node to be stored is found.
Since this leaf node has two keys and the number of keys has not reached the upper limit, it is simply added. FIG. 10B shows after the addition.

(2)キー７０の追加
キー７０を探索すると格納すべき葉ノードが見つかる。
しかしこの葉ノードには、上限の４個が入っている。この場合は仮にキー７０を追加した後、中央のキーを親ノードであるインデックスノードに上げ、葉ノードを分割する。
分割後を図１０（ｃ）に示す。 (2) Addition of key 70 When the key 70 is searched, a leaf node to be stored is found.
However, this leaf node contains the upper limit of four. In this case, after adding the key 70, the central key is raised to the index node which is the parent node, and the leaf node is divided.
FIG. 10C shows the result after the division.

もし、親ノードであるインデックスノードのキーの個数が上限だった場合は、中央のインデックスキーを親インデックスに上げて、インデックスノードが分割する。
キー９５を追加した図１０（ｄ）を以下に示す。 If the number of keys of the index node that is the parent node is the upper limit, the index node is divided by raising the central index key to the parent index.
FIG. 10D in which the key 95 is added is shown below.

［６．２．２．３Ｂ＋木のキーの削除］
Ｂ木からのキーの削除の操作を図１１（ａ）の２次のＢ＋木を例に示す。 [6.2.2.3 Delete B + tree key]
An operation for deleting a key from a B-tree is shown by taking the secondary B + tree of FIG. 11A as an example.

（１）キー７０の削除
キー７０の入ったノードを探索すると、葉ノードがみつかる。
このノードには、６０，６５，７０が入っておりキー７０を削除しても２個残るので条件を満たす。したがって、単純に削除される。削除後を図１１（ｂ）に示す。 (1) Deletion of key 70 When a node containing the key 70 is searched, a leaf node is found.
Since 60, 65, and 70 are contained in this node and two keys remain even if the key 70 is deleted, the condition is satisfied. Therefore, it is simply deleted. FIG. 11B shows the state after deletion.

（２）キー２５の削除
キー２５を探索すると、葉ノードがみつかる。この葉ノードは２５，２８，５０がはいておりキー２５を削除しても２個残るので条件を満たすので単純に削除できる。しかし、インデックスノードにも２５があるのでインデックスノードの２５を削除した後、２８をインデックスノードに追加する。キー２５削除後を図１１（ｃ）に示す。 (2) Deletion of key 25 When key 25 is searched, a leaf node is found. This leaf node has 25, 28, and 50, and even if the key 25 is deleted, two remain, so the condition is satisfied. However, since there are 25 index nodes, 28 is added to the index node after the index node 25 is deleted. FIG. 11C shows the state after the key 25 is deleted.

（３）キー60の削除
キー６０を探索すると、葉ノードがみつかる。この葉ノードにはキーが２つしかなくキー６０を削除すると、条件を満たさなくなるので兄弟ノードと結合する。
さらに、ｒｏｏｔインデックスノードに６０があるので、６０を削除してｒｏｏｔの子インデックスノードが結合してｒｏｏｔインデックスになる。キー６０削除後を図１１（ｄ）に示す。 (3) Deletion of key 60 When the key 60 is searched, a leaf node is found. Since this leaf node has only two keys, if the key 60 is deleted, the condition is not satisfied, so it is combined with the sibling node.
Furthermore, since there is 60 in the root index node, 60 is deleted and child index nodes of root are combined to become a root index. FIG. 11D shows the state after the key 60 is deleted.

［６．２．２．４Ｂ＋木の最近傍探索］
Ｂ＋木の最近傍探索を、図１２の２次のＢ＋木を例に示す。最近傍探索を行うキーを５７とする。まず、ｒｏｏｔノードに移動する。ｒｏｏｔノード内で二分探索を行い・・・・キーｎ＜５７＜キーｎ＋１・・・・となる位置を探す。図１２では、５０＜５７＜７５の位置である。 [6.2.2.4 B + tree nearest neighbor search]
The nearest neighbor search of the B + tree is shown by taking the secondary B + tree of FIG. 12 as an example. The key for performing the nearest neighbor search is 57. First, move to the root node. A binary search is performed in the root node, and a position where key n <57 <key n + 1. In FIG. 12, the position is 50 <57 <75.

キー１・・・・キーｎ＜５７＜キーｎ＋１・・・・の位置が見つかるとキーｎとキーｎ＋１との距離を計算する。図１２では、５７−５０＝７，７５−５７＝１８より、ｒｏｏｔノード内のキーとの最短距離は７である。 When the position of key 1... Key n <57 <key n + 1... Is found, the distance between key n and key n + 1 is calculated. In FIG. 12, since 57-50 = 7 and 75-57 = 18, the shortest distance from the key in the root node is 7.

次に、５０と７５の間のポインタから子ノードへ移動する。子ノードでも二分探索、距離計算を行う。Ｂ＋木が深くなっている場合は、これを葉ノードに移動するまで繰り返す。葉ノード内で二分探索、距離計算を行い、最短距離の中で最も小さい値を返し処理を終了する。図１２では葉ノードで二分探索を行うと５５＜５７＜６０の位置が見つかり、距離計算を行うと５７−５５＝２，６０−５７＝３となるので、葉ノード内のキーとの最短距離は２である。最短距離の中で最も小さい値である２と対応するＩＰアドレスを返し処理を終了する。
Ｂ＋木のキーもＢ木と同じくソートされて格納されているため、一回の探索で最近傍探索ができ、高速で処理が行える。 Next, move from the pointer between 50 and 75 to the child node. Binary search and distance calculation are also performed on child nodes. If the B + tree is deep, repeat this until it moves to the leaf node. A binary search and distance calculation are performed within the leaf node, and the smallest value among the shortest distances is returned and the process is terminated. In FIG. 12, when a binary search is performed on a leaf node, a position of 55 <57 <60 is found, and when a distance calculation is performed, 57−55 = 2 and 60−57 = 3. Therefore, the shortest distance from the key in the leaf node Is 2. The IP address corresponding to 2 which is the smallest value in the shortest distance is returned and the process is terminated.
Since the keys of the B + tree are sorted and stored in the same manner as the B tree, the nearest neighbor search can be performed with a single search, and processing can be performed at high speed.

［７．実験］
ここでは、前記Ｂ木及びＢ＋木を用いてＩＰアドレスをファイルに格納し、従来の格納方法（ディレクトリの階層構造を利用した格納方法）とのサイズと探索速度を比較した結果を述べる。 [7. Experiment]
Here, an IP address is stored in a file using the B tree and the B + tree, and the result of comparing the size and the search speed with the conventional storage method (storage method using a directory hierarchical structure) will be described.

［７．１次数ｋを変化させてのファイルサイズや探索速度の比較］
Ｂ木、Ｂ＋木には、ノード内のキーの個数を決める次数ｋをもつ。このｋの値によってＢ木、Ｂ＋木のファイルサイズや探索時間がどのように違うかを比較した。 [7.1 Comparison of file size and search speed with varying degree k]
The B tree and B + tree have an order k that determines the number of keys in the node. The difference in file size and search time between the B-tree and B + -tree according to the value of k was compared.

使用するデータは、現在のＮＮＩＰＦで使用されているＩＰアドレス集合ＧＩＰ，ＢＩＰとする。ＧＩＰ，ＢＩＰのＩＰアドレスの個数を以下に示す。
ＧＩＰ：正常なメール送信者ＩＰアドレス集合（ＩＰアドレスの個数：842個）
ＢＩＰ：迷惑メール送信者ＩＰアドレス集合（ＩＰアドレスの個数：3037個） The data to be used is an IP address set GIP, BIP used in the current NNIPF. The number of GIP and BIP IP addresses is shown below.
GIP: Normal mail sender IP address set (number of IP addresses: 842)
BIP: Spam mail sender IP address set (number of IP addresses: 3037)

次数ｋを変化させたときのＢ木、Ｂ＋木のファイルサイズ、探索速度を表１〜表４に示し、結果をまとめたグラフを図１３〜図１６に示す。

Tables 1 to 4 show the file sizes and search speeds of the B tree and B + tree when the order k is changed, and FIGS. 13 to 16 show graphs summarizing the results.

Ｂ木のＧＩＰについての表とグラフ（表１及び図１３）をみると、ｋが３２の時、ファイルサイズが小さくなり、ｋが８の時、exact mach timeが速くなることがわかる。
しかし、exact match timeの差は、ｋ＝８の時３３[ms]、ｋ＝３２の時３３．９[ms]とごく僅かなのでファイルサイズが一番小さくなるｋ＝３２をＧＩＰのＢ木で用いるのが好ましい。 Looking at the table and graph (Table 1 and FIG. 13) for the B-tree GIP, it can be seen that when k is 32, the file size is small, and when k is 8, the exact mach time is fast.
However, the difference in exact match time is 33 [ms] when k = 8 and 33.9 [ms] when k = 32, so k = 32 is the smallest file size in the BIP tree of GIP. It is preferable to use it.

Ｂ木のＢＩＰについての表とグラフ（表２及び図１４）をみると、ｋが３２の時ファイルサイズが小さくなり、ｋが８の時、exact mach timeが速くなることがわかる。
しかし、exact match timeの差は、ｋ＝８の時３３．８[ms]、ｋ＝３２の時３５［ms]とごく僅かなのでファイルサイズが一番小さくなるｋ＝３２をＢＩＰのＢ木で用いるのが好ましい。 Looking at the BIP BIP table and graph (Table 2 and FIG. 14), it can be seen that when k is 32, the file size decreases, and when k is 8, the exact mach time increases.
However, the difference in the exact match time is 33.8 [ms] when k = 8 and 35 [ms] when k = 32, so k = 32 is the smallest file size in the BIP B-tree. It is preferable to use it.

Ｂ＋木のＧＩＰについての表とグラフ（表３及び図１５）をみると、ｋが３２の時、ファイルサイズが小さくなり、ｋが８の時、exact mach timeが速くなることがわかる。
しかし、exact match timeの差は、ｋ＝８の時１２．６[ms]、ｋ＝３２の時１４．０[ms]とごく僅かなのでファイルサイズが一番小さくなるｋ＝３２をＧＩＰのＢ＋木で用いるのが好ましい。 From the table and graph (Table 3 and FIG. 15) for the B + tree GIP, it can be seen that when k is 32, the file size is small, and when k is 8, the exact mach time is fast.
However, the difference in the exact match time is 12.6 [ms] when k = 8 and 14.0 [ms] when k = 32, so that the file size is the smallest and k = 32 is set to B + of GIP. Use with wood is preferred.

Ｂ＋木のＢＩＰについての表とグラフ（表４及び図１６）をみると、ｋが３２の時、ファイルサイズが小さくなり、ｋが８の時、exact mach timeが速くなることがわかる。
しかし、exact match timeの差は、ｋ＝８の時１３．３[ms]、ｋ＝３２の時１５．５[ms]とごく僅かなのでファイルサイズが一番小さくなるｋ＝３２をＢＩＰのＢ＋木で用いるのが好ましい。 Looking at the B + tree BIP table and graph (Table 4 and FIG. 16), it can be seen that when k is 32, the file size is small, and when k is 8, the exact mach time is fast.
However, the difference in the exact match time is 13.3 [ms] when k = 8 and 15.5 [ms] when k = 32, so that the file size is the smallest and k = 32 is set to B + of BIP. Use with wood is preferred.

［７．２従来手法とのサイズ、探索時間の比較］
７．１で求めた次数ｋ＝３２でのＢ木、Ｂ＋木と従来手法であるディレクトリの階層構造とのサイズ、探索速度の比較を行った。
従来手法とＢ木、Ｂ＋木のサイズの比較を以下の表５で示す。また、サイズの比較を対数軸でとった結果を図１７に示す。

[7.2 Comparison of size and search time with conventional methods]
The size and search speed were compared between the B-tree and B + -tree at the order k = 32 obtained in 7.1 and the conventional hierarchical structure of the directory.
Table 5 below shows a comparison between the conventional method and the sizes of the B tree and the B + tree. In addition, FIG. 17 shows the result of size comparison on the logarithmic axis.

表５及び図１７より、ディレクトリの階層構造での格納よりもサイズを小さくすることに成功した。また、Ｂ木よりもＢ＋木のファイルサイズが小さくなることがわかった。このことより、Ｂ木よりもＢ＋木の格納効率が良いことがわかる。 From Table 5 and FIG. 17, the size was successfully made smaller than the storage in the directory hierarchical structure. It was also found that the file size of the B + tree is smaller than that of the B tree. This shows that the storage efficiency of the B + tree is better than that of the B tree.

次に,探索速度の時間の違いを示す。exact matchの速度比較を図１８及び図１９に示す。図１８及び図１９よりexact matchの速度は従来手法よりわずかにＢ木が速くなり、Ｂ＋木が一番速くなることがわかった。
従来手法よりＢ木、Ｂ＋木の速度が速くなったのは、従来手法ではディレクトリを４回開くのに対し、Ｂ木やＢ＋木ではファイルを一度開くだけで済むので速度が速くなったと考えられる。
Ｂ木よりもＢ＋木の速度が速くなったのは、Ｂ木よりもＢ＋木の節ノードの格納率が良いので、葉ノードまで探索する回数が減ったからだと考えられる。 Next, the time difference of search speed is shown. The exact match speed comparison is shown in FIG. 18 and FIG. 18 and 19, it was found that the exact match speed was slightly faster for the B-tree than the conventional method, and the B + -tree was the fastest.
The speed of the B-tree and B + -tree is faster than the conventional method. The conventional method opens the directory four times, whereas the B-tree and B + -tree only need to open the file once, so the speed is increased. .
The reason why the speed of the B + tree is faster than that of the B tree is considered to be that the number of times of searching up to the leaf node is reduced because the storage rate of the node of the B + tree is better than that of the B tree.

次に最近傍探索の速度の比較を、図２０に示す。最近傍探索の速度実験より、従来の手法より、Ｂ木およびＢ＋木の速度が速くなったことがわかった。 Next, FIG. 20 shows a comparison of speeds of nearest neighbor search. From the speed experiment of the nearest neighbor search, it was found that the speed of the B-tree and the B + -tree was faster than the conventional method.

［７．３Ｂ木、Ｂ＋木のキーの追加や削除の操作の比較］
IPアドレスをＢ木やＢ＋木を用いてファイルに格納する際に、キーを追加や削除などの動作の速度の違いを比較した。その結果を図２１に示す。
本実験では、Ｂ木、Ｂ＋木ともに深い木にはならなかったので、あまり大掛かりな動作は起こらず、追加や削除の動作も高速に行えた。 [7.3 Comparison of key addition and deletion operations for B-tree and B + tree]
When storing IP addresses in files using B-trees or B + -trees, we compared differences in the speed of operations such as adding and deleting keys. The result is shown in FIG.
In this experiment, neither the B-tree nor the B + -tree was a deep tree, so a large-scale operation did not occur, and addition and deletion operations could be performed at high speed.

［８．付記］
なお、本明細書の実施形態として開示した事項は、例示であって、本発明を限定するものではなく、様々な変形が可能である。 [8. Addendum]
In addition, the matter disclosed as embodiment of this specification is an illustration, Comprising: This invention is not limited and various deformation | transformation are possible.

１フィルタリングシステム
１１インターフェース部
１２分類部
１３分類用ＩＰアドレスデータベース
１３ａ正常メール送信者ＩＰアドレス共通データベース
１３ｂ迷惑メール送信者ＩＰアドレス共通データベース
１３ｃ正常メール送信者ＩＰアドレス個人用データベース
１３ｄ迷惑メール送信者ＩＰアドレス共通データベース
１３ｅ広告メール送信者ＩＰアドレス個人用データベース
１４ユーザデータベース
１５指定ＩＰアドレス記憶部
１６通過データ記憶部
１７検索部
１８通過率算出部 DESCRIPTION OF SYMBOLS 1 Filtering system 11 Interface part 12 Classifying part 13 Classification IP address database 13a Normal mail sender IP address common database 13b Spam mail sender IP address common database 13c Normal mail sender IP address personal database 13d Spam mail sender IP address Common database 13e Advertising mail sender IP address personal database 14 User database 15 Designated IP address storage unit 16 Pass data storage unit 17 Search unit 18 Pass rate calculation unit

Claims

A classification database for registering a set of IP addresses for mail classification;
The IP address of the mail transmission source included in the mail header of the received mail is extracted, and the nearest address closest to the extracted IP address is obtained from the classification database by nearest neighbor search and received from the nearest address. A classification unit for classifying the received mail based on the degree of discrimination;
In the mail filtering system with
The classification database includes an advertisement-type mail sender database for registering an IP address set of advertisement-type mail senders, and a non-advertisement-type mail transmission for registering an IP address set of non-advertisement-type mail senders. Including the original database,
The classification unit performs the nearest neighbor search on each of the advertisement-type mail transmission source database and the non-advertisement-type mail transmission source database, and indicates the likelihood of advertisement-type mail from the nearest address obtained by the nearest neighbor search. Obtaining the degree of discrimination, configured to classify and register the IP address extracted from the received mail into either the advertisement-type mail sender database or the non-advertisement-type mail sender database according to the discrimination degree,
The IP address set that has been registered in the advertisement type e-mail source database, and output means for a force out in association with the user data,
An email filtering system characterized by comprising:

The classification IP database is a non-advertisement type mail transmission source database for registering an IP address set of a transmission source server that has transmitted a non-advertisement type mail. A spam mail source database for registering an IP address set, and a normal mail source database for registering an IP address set of a source server that has sent a normal mail,
The classification unit performs the nearest neighbor search on each of the advertisement-type mail sender database, the junk mail sender database, and the normal mail sender database, and advertises from the nearest address obtained by the nearest neighbor search. The mail filtering system according to claim 1, wherein the mail filtering system is configured to obtain a discrimination degree indicating the likelihood of a type mail.

A classification database for registering a set of IP addresses for mail classification;
The IP address of the mail transmission source included in the mail header of the received mail is extracted, and the nearest address closest to the extracted IP address is obtained from the classification database by nearest neighbor search and received from the nearest address. A classification unit for classifying the received mail based on the degree of discrimination;
In the mail filtering system with
The classification database includes an advertisement-type mail sender database for registering an IP address set of advertisement-type mail senders, and a non-advertisement-type mail transmission for registering an IP address set of non-advertisement-type mail senders. Including the original database,
The classification unit performs the nearest neighbor search on each of the advertisement-type mail transmission source database and the non-advertisement-type mail transmission source database, and indicates the likelihood of advertisement-type mail from the nearest address obtained by the nearest neighbor search. Obtaining the degree of discrimination, configured to classify and register the IP address extracted from the received mail into either the advertisement-type mail sender database or the non-advertisement-type mail sender database according to the discrimination degree,
An output means for outputting the IP address set registered in the advertisement-type mail transmission source database as information usable for determining the appropriateness of the mail distribution method of the advertisement-type mail;
Before Symbol advertising mail sender IP address database is configured to register separately the IP address set for each user
Mail filtering system.

The advertisement-type email source address is stored in association with a source IP address of email classified as an advertisement-type email and user data of a user who is the destination of the email. The described email filtering system.

The mail further comprising a means for aggregating the number of mails classified as advertisement-type mails and / or the number of mails classified as non-advertisement-type mails in the mail whose designated IP address is a source address. The mail filtering system of any one of -4.

A computer program for causing a computer to function as the mail filtering system according to any one of claims 1 to 5.

A classification for registering an IP address set for mail classification by extracting the IP address of the mail transmission source included in the mail header of the received mail and searching for the nearest address closest to the extracted IP address by nearest neighbor search In a mail filtering system that calculates a discrimination level for classifying mail received from the nearest address, and classifies the received mail based on the discrimination level, a mail distribution method for advertising mail An information generation method for generating information that can be used to determine appropriateness,
The classification database includes an advertisement-type mail sender database for registering an IP address set of advertisement-type mail senders, and a non-advertisement-type mail transmission for registering an IP address set of non-advertisement-type mail senders. Including the original database,
The information generation method includes:
Performing the nearest neighbor search on each of the advertisement-type mail sender database and the non-advertisement-type mail sender database;
Obtaining a degree of discrimination indicating the likelihood of advertising mail from the nearest address obtained by the nearest neighbor search;
Classifying and registering the IP address extracted from the received mail into either the advertisement-type mail sender database or the non-advertisement-type mail sender database according to the degree of discrimination;
The IP address registered in the advertisement-type mail sender database, a step of force out in association with the user data,
Information generation method.

A classification for registering an IP address set for mail classification by extracting the IP address of the mail transmission source included in the mail header of the received mail and searching for the nearest address closest to the extracted IP address by nearest neighbor search In a mail filtering system that calculates a discrimination level for classifying mail received from the nearest address, and classifies the received mail based on the discrimination level, a mail distribution method for advertising mail An information generation method for generating information that can be used to determine appropriateness,
The classification database includes an advertisement-type mail sender database for registering an IP address set of advertisement-type mail senders, and a non-advertisement-type mail transmission for registering an IP address set of non-advertisement-type mail senders. Including the original database,
The information generation method includes:
Registering the advertisement mail sender IP address database separately for each user, a set of IP addresses;
Performing the nearest neighbor search on each of the advertisement-type mail sender database and the non-advertisement-type mail sender database;
Obtaining a degree of discrimination indicating the likelihood of advertising mail from the nearest address obtained by the nearest neighbor search;
Classifying and registering the IP address extracted from the received mail into either the advertisement-type mail sender database or the non-advertisement-type mail sender database according to the degree of discrimination;
Outputting the set of IP addresses registered in the advertisement-type mail sender database as information that can be used to determine the appropriateness of the mail delivery method of the advertisement-type mail;
Information generation method.