KR20050111566A

KR20050111566A - Spam mail filtering system via link structure analysis of e-mail

Info

Publication number: KR20050111566A
Application number: KR1020050107842A
Authority: KR
Inventors: 이신영
Original assignee: 이신영
Priority date: 2005-11-11
Filing date: 2005-11-11
Publication date: 2005-11-25

Abstract

본 발명은 이메일 안에 존재하는 하이퍼링크(Hyper-Link)가 가리키는 웹문서가 다른 임의의 웹문서에 의해 인용된(링크된) 횟수를 측정한 후, 이메일 안에 존재하는 하이퍼링크의 개수와 그 하이퍼링크를 인용하는 다른 웹문서의 개수를 자질로 하여 기계학습을 한 후 스팸메일 분류를 한다. The present invention measures the number of times a web document pointed to by a hyper-link present in an email is cited (linked) by another arbitrary web document, and then the number of hyperlinks present in the email and the hyperlink. Machine learning with the number of other web documents quoting qualities and classify spam mail.

또한 이메일의 하이퍼링크에서 호스트URL부분만을 자질로 하여 기계학습을 한 후 스팸메일 분류를 한다. 더불어 이러한 이메일의 하이퍼링크 분석과 함께 내용기반 분석으로 기계학습을 한 후 이메일 내용분석결과를 통합하여 스팸메일 필터링 효율을 더 높인다.Also, classify spam mail after machine learning using only host URL part of hyperlink of email. In addition, after the machine learning through the content-based analysis in addition to the hyperlink analysis of the email, the email content analysis results are integrated to improve spam filtering efficiency.

Description

Spam Mail Filtering System via Link Structure Analysis of E-mail}

본 발명은 전자메일 차단 시스템 및 방법에 관한 것으로, 특히 인터넷 통신망을 통해 사용자의 동의 없이 발송되는 무작위적인 스팸메일을 분류하는 시스템 및 방법에 관한 것이다.The present invention relates to a system and method for blocking e-mail, and more particularly, to a system and method for classifying random spam mail sent without a user's consent through an internet communication network.

스팸메일은 본인이 원하지 않고 요청하지도 않았음에도 불특정 다수에게 무작위 적으로 발송되는 광고성 이메일을 말한다. 스팸메일로 인해 개인의 업무시간을 뺏기고 스트레스를 주며, 무작위 적인 스팸메일은 네트워크에 과도한 부하를 주게 되어 개인, 기업, 국가적으로도 커다란 피해를 주는 정보화 사회의 대표적인 역기능 이다Spam mail is advertising email that is sent randomly to an unspecified number of people, even if they do not want it or do not request it. Spam mail deprives individuals of their working hours and stresses them, and random spam mail is a major dysfunction of the information society that causes excessive load on the network and causes great damage to individuals, companies and the nation.

기존의 스팸메일 분류 알고리즘은 크게 다음과 같이 분류할 수 있다.Existing spam classification algorithm can be classified as follows.

① 목록기반 분류(List-Based Filtering)① List-Based Filtering

이메일 서버차원에서 블랙리스트와 화이트리스트를 만들어 스팸메일을 차단한다. 그러나 이 방법은 스패머(Spammer; 스팸메일 발신자)가 메일 주소를 계속 바꾸어 공격하고, 리스트를 갱신하는데 비용이 많이 들며, 최신의 스팸메일에 대해 느리게 반응한다는 단점이 있다Create blacklists and whitelists at the email server level to block spam. However, this method has the disadvantage that spammers continue to attack by changing their email address, are expensive to update the list, and react slowly to the latest spam mail.

② 규칙기반 분류(Rule-Based Filtering)Rule-Based Filtering

이는 이메일의 헤더, 제목, 본문을 분석하여 분류하고자 하는 특정 단어를 발견했을 때에 이메일을 분류하는 방식이다. 그러나 스패머가 "광고"를 "광xx고", "광**고"등과 같이 스팸메일을 분류하는 특정 단어에 특수문자나 한자 등을 섞어서 단어를 변조하면 규칙기반방법으로는 분류하기가 어려운 단점이 있다.This is a method of classifying an email when it finds a specific word to classify by analyzing the header, subject and body of the email. However, if a spammer modifies a word by mixing special characters or Chinese characters with specific words that classify spam mail, such as "advertisement", "advertisement", "advertisement", etc., it is difficult to classify it by rule-based method. There is this.

③ 통계적 내용기반 스팸 분류(Statistical Content-Based Spam Filtering)Statistical Content-Based Spam Filtering

이는 스팸메일과 스팸메일이 아닌 정상메일(Legitimate Mail)을 샘플로 하여 불용어(Stop-Word)를 제거하고 스테밍(Stemming) 등의 전처리 작업과 정규화 작업 후, 의사결정트리(Decision Tree), 나이브 베이지안(Naive Bayesian), 지지벡터기계(Support Vector Machine, SVM), 인공신경망(Artificial Neural Networks)등의 기계학습방법으로 학습한 후 이메일을 분류하는 방식이다. It uses spam mail and non-spam mail as a sample to remove stop words, preprocessing such as stemming, normalization, and decision tree and naïve. This is a method of classifying e-mail after learning by machine learning methods such as Naive Bayesian, Support Vector Machine (SVM), and Artificial Neural Networks.

그러나 최근의 스팸메일은 본문의 내용을 텍스트가 아닌 이미지로 보내는 경우가 대부분이어서 통계적 스팸 분류의 한계가 있다. However, most recent spam mails have a limitation of statistical spam classification because the contents of the body are mostly sent as images rather than texts.

④ 협업적 분류(Collaborative Filtering)④ Collaborative Filtering

이는 한 이메일 서버내의 사용자들이 스팸메일이라고 신고한 메일들은 다른 사용자에게도 스팸메일이라고 판단하여 분류하는 방법이다. 이 방법은 이메일 포털 서비스 제공업체를 중심으로 사용되고 있고 스팸메일 분류율은 좋은 편이나, 서로 다른 이메일 서버 간에 스팸메일에 대한 정보를 공유하지 않아서 한 이메일 서버 내에서만 사용할 수 있다는 제한점이 있다. 또한 같은 내용의 스팸메일에 대해서 스패머가 스팸메일에 임의의 코드를 삽입함으로써 각각의 스팸메일을 서로 다른 이메일로 인식하게 하여 분류율을 떨어뜨린다. 그리고 스팸메일의 발신자 주소나 발신서버의 주소를 계속 바꾸기 때문에 역시 분류율을 떨어뜨린다.This is a method of classifying mails which are reported as spam mails by users in one e-mail server as spam mails to other users. This method is mainly used by email portal service providers and has a good spam classification rate, but has a limitation that it can be used only within one email server because it does not share information about spam between different email servers. In addition, spammers insert random code into spam mails for spam mails with the same content, thereby reducing the classification rate. It also lowers the classification rate because it constantly changes the sender address of the spam mail or the address of the sending server.

⑤ 사회연결망(Social Network)분석을 통한 스팸 분류⑤ Spam classification through social network analysis

이는 Boykin에 의해 제안된 최근의 방법으로 그래프 상에서 이메일을 보낸 사람을 노드로, 이메일을 보낸 관계를 링크로 표현하여 결국 이메일을 서로 자주 주고받은 사람들을 그래프 상에서 군집으로 표현하는 방법으로, 결국 그래프 상에서 군집은 서로를 아는 친구관계를 나타내게 된다. 그래서 그래프상의 군집이 아닌 노드들은 스패머라고 판단하여 분류할 수 있다.This is a recent method proposed by Boykin that represents email senders as nodes on the graph, and links to send emails as links so that people who send and receive emails frequently are grouped on the graphs. Clusters represent friends who know each other. So nodes that are not clusters on the graph can be classified as spammers.

그러나 이 방법은 이메일을 보낸 관계를 추적해야 하므로 하나의 이메일 서버 안에서만 사용해야 한다는 단점이 있다.However, this method has the disadvantage of having to keep track of the relationship of sending the email, so it should be used only within one email server.

본 발명은 기존의 규칙기반이나 내용기반의 스팸메일 분류방식이 아닌 이메일 안에 존재하는 하이퍼링크(이하 링크)가 가리키는 웹문서를 링크하는 웹문서의 개수를 측정하여 분류하는 방식이다. 따라서 이메일 안에 존재하는 다양한 형태의 링크를 추출 할 수 있어야 하며, 때로는 형식에 맞지 않는 링크도 추출해야 한다. The present invention is a method of classifying by measuring the number of web documents linking a web document pointed to by a hyperlink (hereinafter referred to as a link) existing in an email, rather than a conventional rule-based or content-based spam mail classification method. Therefore, you should be able to extract the various types of links that exist within an email, and sometimes even out-of-format links.

그리고 추출한 링크가 가리키는 웹문서를 링크하는 웹문서의 개수를 측정하기 위해서 페이지랭크(PageRank)알고리즘을 사용하는 검색엔진 구글(Google)에서 제공하는 구글 웹 API를 사용한다. In order to measure the number of web documents linking the web documents pointed to by the extracted links, the Google web API provided by the search engine Google using the PageRank algorithm is used.

그로 인해 얻은 이메일안의 링크 개수와 그 링크들의 웹문서가 다른 웹문서에 의해 인용된 횟수의 합을 자질로 하여 의사결정트리로 학습한 후 스팸메일을 분류하는 새로운 방식의 매우 효율적인 스팸메일 분류 방법이다. The result is a new method of classifying spam mails, which is a new method of classifying spam mails after learning the decision tree based on the sum of the number of links in the email and the number of times the web documents of the links are cited by other web documents. .

또한 이메일의 링크에서 호스트URL부분만을 링크하는 웹문서의 개수를 측정하여 의사결정트리로 학습할 수 있고, 내용기반 분류 방법을 병합하여 더욱 스팸메일 분류율을 높일 수 있다.In addition, by measuring the number of web documents linking only the host URL portion in the link of the e-mail can be learned by the decision tree, it can further increase the spam classification rate by merging the content-based classification method.

이하, 본 발명에 따른 실시 예를 첨부한 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

이메일은 일종의 웹문서로 볼 수 있으며 그 안에는 다른 웹문서로의 링크를 대부분 가지고 있다. 특히 스팸메일의 경우에는 광고의 목적이 대부분이므로 광고하는 사이트로의 링크를 대부분 가지고 있다. 왜냐하면 광고목적의 이메일 마케팅을 실현하려면 반드시 광고하려는 사이트로 유도해야 하기 때문이다. 반면에 스팸메일이 아닌 정상메일의 경우에는 뉴스레터 등과 같이 다른 사이트로의 링크를 가지고 있는 경우도 있고, 개인 간의 주고받는 이메일 같은 링크가 없고 텍스트만 있는 경우로 나로 나눌 수 있다.Email can be viewed as a web document, with most links to other web documents. In the case of spam mails, in particular, the purpose of advertising is mostly to have a link to an advertising site. This is because in order to realize email marketing for advertising purposes, it must be directed to the site to be advertised. On the other hand, normal mail, not spam mail, may have a link to another site such as a newsletter, or it may be divided into a case in which there is only a text and no link, such as an email between individuals.

한편 검색엔진 구글은 페이지랭크 알고리즘을 사용하여 모든 웹문서의 적합도(Fitness)를 계산하여 인덱싱(Indexing)한 결과를 가지고 있다. 여기서 적합도란 어떤 웹문서가 다른 웹문서에 의해 얼마나 많은 링크로 참조되고 있는 정도를 수치화한 것이다. 다시 말하면 어떤 웹문서로 들어오는 인링크(In-Link) 개수의 정도를 수치화한 것이다. 그래서 웹문서의 적합도가 높다는 의미는 그 웹문서는 다른 수많은 웹문서에 의해 링크되고 있고 따라서 인기가 높고 더 유용하다고 말할 수 있다. 반면에 웹문서의 적합도가 낮으면 그 웹문서는 다른 웹문서에 의해 거의 링크되지 않으므로 인기가 없고 유용하지 않다는 의미이다.On the other hand, the search engine Google has a result of indexing by calculating the fitness of all web documents using the page rank algorithm. The goodness of fit here is a quantification of how many links a web document is referenced by other web documents. In other words, the number of in-links coming into a web document is quantified. Thus, the high relevance of a web document means that the web document is linked by many other web documents and is therefore more popular and more useful. On the other hand, a poor fit of a web document means that the web document is not popular and useful because it is rarely linked by other web documents.

구글의 검색창의 옵션에 'link:'를 주게 되면 구글이 이미 인덱싱해 놓은 결과에서 그 웹문서를 링크하는 웹문서의 개수와 URL을 알 수 있다. 예를 들면 구글의 검색창에 link:http://www.seoul.go.kr'과 같이 서울시 홈페이지에 'link:'옵션을 주어 검색하면 서울시 홈페이지를 링크하는 212개(2005년 11월 현재)의 웹문서를 볼 수 있다. 서울시 홈페이지를 링크하는 웹문서는 212개나 되므로 서울시 홈페이지는 인기가 있고 유용하다고 말할 수 있다. 반면에 본 출원인의 개인 블로그에 'link:'옵션을 주어 검색하면('link:http://blog.naver.com/bearrhee') 본 출원인의 블로그를 링크하는 웹문서는 0개로 나타나고 본 출원인의 블로그는 인기가 없고 객관적으로 유용하지 않다고 말 할 수 있다.If you give 'link:' to Google's search box, you'll see the number and URLs of the web documents that link to the web documents that Google has already indexed. For example, if you search by giving the 'link:' option to the Seoul homepage, such as link: http: //www.seoul.go.kr, in Google's search box, 212 links to the Seoul homepage (as of November 2005). You can view the web document at. Since there are 212 web documents linking the Seoul homepage, it can be said that the Seoul homepage is popular and useful. On the other hand, if you search for 'link:' option on 'Applicant's personal blog' ('link: http: //blog.naver.com/bearrhee'), there will be 0 web documents linking to Applicant's blog. It can be said that blogs are not popular and are not objectively useful.

본 발명의 시스템에서는 구글에서 제공하는 구글 웹 API를 사용하여 링크의 적합도를 측정하였다.In the system of the present invention, the suitability of the link was measured using the Google web API provided by Google.

도1을 참조하면 스팸메일은 대부분 다른 웹문서로의 링크를 가지고 있다. 그러나 그 웹문서를 링크하는 웹문서는 거의 없다. Referring to FIG. 1, most spam mails have links to other web documents. But few web documents link to them.

반면에 도2를 참조하면 정상메일이 링크를 가지고 있을 때, 그 웹문서들을 링크하는 웹문서는 상대적으로 많다.On the other hand, referring to Figure 2, when the normal mail has a link, there are relatively many web documents that link the web documents.

이 때 이메일의 링크구조분석을 통해 다음과 같은 사실을 발견 할 수 있다.At this time, through the link structure analysis of e-mail, the following facts can be found.

"이메일이 링크를 가지고 있는데 그 링크의 웹문서를 링크하는 웹문서가 거의 없다면 스팸메일일 가능성이 높다. 반면에 이메일이 링크를 가지고 있지 않거나(텍스트로만 이루어져 있거나) 링크를 가지고 있을 경우 그 웹문서를 링크하는 웹문서가 많다면 정상메일일 가능성이 높다." "If an email has a link and there are very few web documents linking to the web document on that link, it is likely spam. On the other hand, if the email does not have a link (consists of text only) or has a link, the web document If you have a lot of web documents linking to, it's likely a normal mail. "

도3에서는 링크구조분석 기반의 스팸메일 분류과정을 나타내고 있다. 크게 학습과정과 분류과정으로 나눌 수 있는데, 학습과정에서는 우선 학습데이터로 사용할 충분한 수의 스팸메일과 정상메일을 수집한다. 학습데이터에서 HTML 태그형태인 링크뿐만 아니라, 텍스트 안에서 'http://'형식의 URL등도 추출한다. 추출한 링크를 구글 웹 API를 사용하여 링크에 해당하는 웹문서를 링크하고 있는 웹문서들의 수를 측정한다. 만약 한 이메일 안에 링크가 여러 개가 있으면 각각의 링크에 해당하는 웹문서를 링크하고 있는 웹문서의 수를 합하여 해당 이메일의 적합도로 사용한다. 이 때 얻을 수 있는 한 이메일안의 링크의 개수와 적합도의 합을 자질로 하여 기계학습방법의 하나인 의사결정트리를 학습시켜 모델을 생성한다.3 shows a spam mail classification process based on link structure analysis. It can be divided into learning process and classification process. In the learning process, a sufficient number of spam mails and normal mails are used for learning data. In addition to the links in the form of HTML tags from the learning data, it extracts URLs in the form of 'http: //' from the text. The extracted link is measured using the Google Web API to measure the number of web documents that link to the web document corresponding to the link. If there are several links in an email, the number of web documents that link each web document corresponding to each link is used as the appropriateness of the email. At this time, the model is generated by training the decision tree, which is one of the machine learning methods, using the sum of the number of links in the email and the goodness of fit.

모델을 생성하는 과정에서 의사결정트리의 규칙 프루닝(Pruning)의 신뢰도를 변화시킴에 따라 다양한 크기의 트리가 생성된다. 즉 규칙 프루닝의 신뢰도에 따라 생성되는 규칙의 수가 변한다. 실제 데이터로부터 생성된 규칙 중의 한 예는 다음과 같다.Trees of various sizes are created as the reliability of rule pruning of the decision tree changes during the model generation. That is, the number of generated rules varies according to the reliability of rule pruning. An example of a rule generated from real data is as follows.

분류하려는 이메일을 D, 이메일 안의 링크의 개수를 L, 링크에 해당하는 웹문서의 적합도를 F라고 했을 때,Suppose the email you want to classify is D, the number of links in the email is L, and the goodness of fit of the web document corresponding to the link is F,

규칙1. 만약 L>0 이고 F>42 이면 D는 정상메일Rule 1. If L> 0 and F> 42 then D is normal mail

규칙2. 만약 L>6 이고 F<=42 이면 D는 정상메일Rule 2. If L> 6 and F <= 42 then D is normal mail

규칙3. 만약 L=0이면 D는 정상메일Rule 3. If L = 0, D is normal mail

규칙4. 만약 0<L<=6 이고 F<=42이면 D는 스팸메일Rule 4. If 0 <L <= 6 and F <= 42 then D is spam

분류과정은 스팸인지 아닌지 알 수 없는 새로운 입력에 해당하는 이메일을 분류하는 과정이다. 학습과정과 마찬가지로 우선 이메일을 전처리 하여 링크(하이퍼링크)를 추출해 내고 구글 웹 API로 적합도를 측정한다. 본 발명에서 적합도는 단순히 어떤 페이지에 존재하는 링크들을 링크하는 다른 웹문서들의 합으로 정의하였고 도4의 수식과 같다. 그 후 링크와 적합도를 학습과정에서 생성한 의사결정트리의 모델을 적용하여 스팸메일인지 아닌지를 분류하게 된다.The categorization process categorizes e-mails that correspond to new inputs that don't know whether they are spam or not. As in the learning process, we first preprocess the email to extract the link (hyperlink) and measure the fitness with the Google Web API. In the present invention, the goodness of fit is simply defined as the sum of other web documents linking links existing in a page and is the same as the equation of FIG. Then, the link and goodness-of-fit are applied to the decision tree model created in the learning process to classify spam mail or not.

또한 이메일에서 추출한 링크의 URL에서 호스트(서버)부분을 잘라내어 호스트URL의 적합도를 측정하는 방법도 가능하다. 예를 들면, 이메일에서 ㅗ'http://www.makdrim.com/md/shaver/shaver.php?pcode=inventad'과 같은 링크를 추출해 내어 이 링크에 해당하는 웹문서의 적합도를 측정하는 것 외에, 'http://www.makdrim.com/'와 같이 호스트부분을 다시 잘라내어 이 호스트에 해당하는 웹문서의 적합도를 측정할 수도 있고, 이 둘은 서로 다른 결과를 내 놓는다.You can also measure the suitability of the host URL by cutting the host (server) portion from the URL of the link extracted from the email. For example, in addition to extracting a link like “http://www.makdrim.com/md/shaver/shaver.php?pcode=inventad” from an email and measuring the suitability of the web document for this link, For example, you can re-cut the host section, such as 'http://www.makdrim.com/', to measure the goodness of fit of the web document for this host.

또한 이메일 안의 이미지의 URL을 일종의 링크로 보고 측정하는 것도 가능하다. 예를 들면 이메일 안에 You can also view and measure the URL of an image in an email as a link. For example, in an email

'<img src="http://www.seoul.go.kr/images/main_eventhead.gif">'와 같은 이미지에 대한 태그가 있을 때, 'http://www.seoul.go.kr/images/main_eventhead.gif'를 측정하거나 호스트부분만을 잘라낸 'http://www.seoul.go.kr/'를 측정할 수 있다.When you have a tag for an image like "<img src =" http://www.seoul.go.kr/images/main_eventhead.gif ">", you'll see "http://www.seoul.go.kr/images /main_eventhead.gif 'or' http://www.seoul.go.kr/ 'with only the host part cut out.

또한 이렇게 링크와 링크의 호스트 부분을 잘라내어 각각 적합도를 측정하고, 내용분석기반의 방법을 서로 병합하여 더욱 스팸메일 분류율을 높일 수 있다.In addition, it can cut the link and the host part of the link, measure the fitness of each, and merge the content analysis-based method with each other to further increase the spam classification rate.

본 발명은 이메일 안의 하이퍼링크, 이미지 등의 URL 형태의 링크를 추출하여 해당 링크의 웹문서의 적합도를 측정함으로써 해당 이메일이 스팸메일인지 아닌지를 분류한다. 기존에는 링크를 스팸메일의 목록으로 만들어 패턴매칭에 의해 분류 하는 방법이 있었으나 이는 스팸목록에 없는 새로운 스팸메일일 경우에는 분류하지 못하는 단점이 있다. 그러나 본 발명은 이메일안의 링크가 가리키는 웹문서의 적합도를 측정하여 분류하므로 어떠한 형태의 새로운 이메일도 모두 분류할 수 있다. 또한 최근에는 스팸메일의 내용을 텍스트가 아닌 이미지의 형태로 보내는 경우가 많은데 기존의 내용분석 방법으로는 이미지를 분석하지 못하므로 한계가 있다. 그러나 본 발명은 이러한 내용기반 방법의 단점도 극복 하는 전혀 새로운 형태의 획기적인 스팸메일 분류 방법이다.The present invention classifies whether the e-mail is spam or not by extracting a link in the form of a URL such as a hyperlink or an image in the e-mail and measuring the fitness of the web document of the link. Conventionally, there was a method of classifying links by pattern matching and pattern matching, but this is a disadvantage that cannot be classified in the case of new spam mail that is not in the spam list. However, the present invention measures and categorizes the suitability of the web document pointed to by the link in the e-mail, so that any new e-mail can be classified. Also, in recent years, the contents of spam mails are often sent in the form of images instead of texts, but there are limitations because the existing contents analysis methods cannot analyze the images. However, the present invention is a completely new form of breakthrough spam classification that overcomes the disadvantages of such content-based methods.

따라서 본 발명은 스패머가 스팸메일의 내용을 그림으로 처리하거나 텍스트에 임의의 코드를 삽입하거나 보내는 서버를 자주 바꾸는 등 현재까지의 어떠한 형태의 공격에도 강건하다.Therefore, the present invention is robust against any type of attack to date, such as spammers frequently processing the contents of spam mails, inserting arbitrary code in the text, or frequently changing the sending server.

왜냐하면 근본적으로 스팸메일 안에 링크되어 유도한 스패머의 웹페이지가 다른 수많은 웹문서에 의해 링크되지 않으면 본 발명에 의해 적합도가 낮다고 판단되어 스팸메일로 분류되기 때문이다.This is because if the spammer's web page linked and induced in the spam mail is not linked by many other web documents, it is judged as low suitability by the present invention and classified as spam mail.

또한 본 발명의 방법은 이메일의 내용을 분석하는 방법이 아니기 때문에 한국어, 영어뿐만 아니라 어떠한 언어의 스팸메일 일지라고 효과적으로 분류할 수 있는 장점이 있다.In addition, since the method of the present invention is not a method of analyzing the contents of an email, it is advantageous in that it can be effectively classified as a spam mail log of any language as well as Korean and English.

도 1은 스팸메일의 웹상의 링크구조를 분석한 도면.1 is a diagram analyzing the link structure on the web of spam mail.

도 2는 스팸이 아닌 정상메일의 웹상의 링크구조를 분석한 도면.Figure 2 is a view of analyzing the link structure on the web of non-spam normal mail.

도 3은 제안한 링크구조분석을 통한 스팸메일 분류과정을 설명한 플로우차트.Figure 3 is a flow chart illustrating a spam mail classification process through the proposed link structure analysis.

도 4는 본 발명에서 사용한 웹페이지의 적합도.4 is a goodness of fit of the web page used in the present invention.

Claims

Spam e-mail classification method that extracts URLs such as hyperlinks and images in e-mails, and measures and classifies the extent to which web documents corresponding to the URLs are linked from other web documents.

The method of claim 1, wherein not only the entire URL of the link but also the host (server) URL part is cut and the degree of the host URL is measured by the linkage from other web documents.