KR100452910B1

KR100452910B1 - Method and Apparatus for Filtering Spam Mails

Info

Publication number: KR100452910B1
Application number: KR10-2002-0009412A
Authority: KR
Inventors: 신중호; 남세동; 김영준; 김경태
Original assignee: 주식회사 네오위즈
Priority date: 2002-02-22
Filing date: 2002-02-22
Publication date: 2004-10-14
Also published as: KR20030069567A

Abstract

본 발명은 인터넷을 통하여 무차별적으로 살포되는 대량의 광고성 및 상업성 메일을 수신한 사용자의 총 수에 기반하여 대량 메일을 분류하고, 분류된 대량 메일에 기반하여 스팸 메일을 인식하는 스팸 메일 필터링 방법 및 장치에 관한 것이다. 본 발명은 외부로부터 수신된 각각의 메일에 포함된 문자 중에서 사용 빈도수가 높은 N(N은 자연수) 개의 문자를 상기 메일을 대표하는 대표 문자열로 추출하고, 상기 대표 문자열이 동일한 메일들의 개수를 상기 대표 문자별로 누적하고, 상기 대표 문자별로 누적된 메일들의 누적 개수가 기설정 개수 이상인 메일들을 대량으로 발송된 유사 메일로 분석하고, 상기 유사 메일로 분류된 각각의 메일에 포함된 키워드와 기설정 키워드를 비교하여 기설정 사전 설정치 이상의 유사도를 갖는 메일을 상기 스팸 메일로 인식한다. 또한, 본 발명은 유사 메일로 분류된 메일들 중에서 대표 문자가 동일하지만 메일의 내용이 일부만 바뀐 메일도 유사 메일로 판단하여 스팸 메일로서 인식할 수 있고, 또한, 대표 문자가 동일한 메일들 중에서 우연히 대표 문자가 같지만 그 내용이 유사하지 않은 메일을 제외시킴으로써 스팸 메일로서의 인식률의 정확도를 높일 수 있다.The present invention provides a spam mail filtering method for classifying bulk mail based on the total number of users who have received a large amount of advertising and commercial mail spread indiscriminately through the Internet, and recognizing spam mail based on the classified mass mail. Relates to a device. The present invention extracts N (N is a natural number) characters having a high frequency of use among characters included in each mail received from the outside as a representative string representing the mail, and representing the number of mails having the same representative string. Accumulate by mail, and analyze mails in which the cumulative number of mails accumulated by the representative text is equal to or larger than a preset number as similar mails sent in bulk; and include keywords and preset keywords included in each mail classified as similar mails. By comparison, mails having a similarity or higher than a predetermined preset value are recognized as the spam mails. In addition, according to the present invention, an e-mail classified as a similar e-mail may be recognized as a spam e-mail even if the e-mail is changed to a similar e-mail. However, the e-mail may be recognized as a spam e-mail. The accuracy of the recognition rate as spam mail can be improved by excluding mails having the same text but similar contents.

Description

Method and Apparatus for Filtering Spam Mail Based on the Identification of Bulk Mail {Method and Apparatus for Filtering Spam Mails}

본 발명은 인터넷을 통하여 무차별적으로 배포되는 스팸 메일(Spam Mail)을 필터링하는 방법 및 장치에 관한 것으로, 보다 상세하게는, 동일한 메일을 수신한 사용자의 수에 기반하여 대량 메일을 분류하고, 분류된 대량 메일에 기반하여 스팸 메일을 필터링하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for filtering spam mails distributed indiscriminately over the Internet, and more particularly, to classify and classify mass mails based on the number of users who have received the same mail. The present invention relates to a method and apparatus for filtering spam mail based on bulk mail.

인터넷 메일 서비스에 있어서, 가장 큰 문제점 중의 하나는 인터넷을 통하여 무차별적으로 배포되는 광고성 및 상업성 전자 메일(E-mail, 이하 간략히 메일이라고 함)을 들 수 있다. 보통 스팸 메일이라고 불리우는 광고성 및 상업성 메일은 광의적 의미로서 사용자가 읽을 필요가 없는 메일, 즉 사용자가 해당 메일의 내용을 보지 않고 삭제하기를 원하는 메일을 의미하며, 협의적 의미로는 사용자의 수신 의사와 무관하게 통신이나 인터넷을 통해 무차별적으로 대량 살포되는 메일을 의미한다.In the Internet mail service, one of the biggest problems is advertising and commercial electronic mail (E-mail, hereinafter referred to simply as mail) distributed indiscriminately over the Internet. Advertisement and commercial messages, commonly called spam mails, are broadly intended to mean messages that the user does not need to read, that is, messages that the user wishes to delete without seeing the contents of the message. Regardless of this, it means mail distributed indiscriminately through communication or the Internet.

많은 사용자들은 스팸 메일이라고 불리우는 대량 살포되는 광고성 및 상업성 메일로 인하여 본인 의사와 무관하게 불필요한 메일을 수신하고, 불필요한 메일을 지워야하는 관리적인 불편을 겪고 있으며, 불필요한 스팸 메일로 인하여 전체 인터넷의 트래픽이 증가하는 등 막대한 비효율성을 수반하고 있다.Many users suffer from administrative inconvenience of receiving unnecessary mails and deleting unnecessary mails regardless of their intentions due to the mass spread of commercial and commercial mails called spam mails, and the traffic of the whole Internet increases due to unnecessary spam mails. This entails enormous inefficiencies.

인터넷 메일 서비스를 통하여 수신되는 스팸 메일을 필터링하는 방법은 다음과 같이 여러 가지 방법이 있다.There are several ways to filter spam mail received through Internet mail service.

첫 번째 방법은 스팸 메일을 보내는 발신자측의 메일 주소(E-mail Address)를 리스트로 관리하여, 그 리스트에 속한 메일 주소로부터 오는 메일을 스팸 메일로 간주하는 방법이다. 다시 말해서, 이 방법은 메일 서버에서 스팸 메일을 주로 보내는 메일 주소를 조사하여, 수신 거부 리스트를 만들고, 해당 주소에서 보내진 메일들을 스팸 메일로 인식하는 것이다. 그러나, 이러한 리스트 관리 방법은 스팸 메일 여부를 판단하고, 해당하는 메일 주소를 일일이 수동으로 관리하는 수작업이 요구된다는 점과, 일단 스팸 메일을 발송한 메일 주소가 리스트에 등록되면 해당하는 주소에서 발송되는 모든 메일이 스팸 메일로 취급되어 간혹 있을 수 있는 중요한 메일조차도 열람하지 못하는 경우가 발생된다는 단점이 있다. 또한, 스팸 메일을 작성하여 발송할 때 발신 주소를 기존의 주소로 하지 않고 발신자 주소를 임의로 생성하거나 보낼때마다 변경하여 스팸 메일을 발송하는 경우에는 스팸 메일의 차단 기능이 거의 무력화되는 단점이 있다.The first method is to manage the e-mail address of the sender of the spam mail as a list, and to consider the mail coming from the mail address of the list as spam mail. In other words, this method examines the mail address from which the mail server primarily sends spam, creates a opt-out list, and recognizes mail sent from that address as spam. However, this list management method requires manual management of spam mails and manual management of corresponding mail addresses, and once the mail addresses that send spam mails are registered in the list, they are sent from the corresponding addresses. The disadvantage is that all mails are treated as spam mails, so you may not be able to read even the most important mails. In addition, when a spam mail is created and sent, the blocking address of the spam mail is almost incapacitated when the spam mail is sent by randomly generating or changing the sender address whenever the sender address is not used as an existing address.

두 번째 방법은 수신된 메일의 제목이나 내용을 분석하여, '광고', '돈벌기', '홍보' 등과 같이 광고나 상업적인 문구의 내용이나 홍보성 문구의 내용을 나타내는 특정 단어를 사전 설정치 이상 포함하는 경우 스팸 메일로 인식하는 방법이다. 이 방법은 첫 번째 방법과 달리 스팸 메일을 보낸 사람의 주소를 기준으로 하지 않고, 메일 내용을 분석하여 스팸 메일 여부를 판단한다는 차별성이 있다. 그러나, '광고'나 '돈벌기'와 같은 특정 단어를 포함하는 메일 중에서 스팸 메일이 아닌 많은 경우를 스팸 메일로 인식하는 오류가 있고, 스팸 메일의 특성을 나타내는 모든 단어를 분석하는 것이 현실적으로 불가능하다는 단점이 있다.The second method analyzes the subject or content of received mail and includes more than a preset word that indicates the content of an ad or commercial phrase or promotional phrase, such as "advertisement", "money-making", or "publicity." This method is recognized as spam mail. This method differs from the first method in that it is not based on the sender's address but instead analyzes the contents of the spam to determine whether it is spam. However, there is an error that recognizes many non-spam mails containing certain words such as 'advertisement' or 'money-making' as spam mails, and it is practically impossible to analyze all words that characterize spam mails. There is this.

세 번째 방법은 사회적 필터링(Social Filtering) 기법을 이용한 것으로, 사용자들이 수신한 메일에 대하여, 스팸 메일 여부를 메일 서버 혹은 스팸 메일 관리 서버에 신고하여 스팸 메일을 필터링하는 것이다. 메일 서버는 사전 설정치 이상의 사용자들로부터 스팸 메일이라고 신고된 메일에 대하여, 같은 메일을 수신한 다른 사용자들에게도 해당 메일이 스팸 메일임을 알린다. 이 방법은 스팸 메일 여부를 판단하는 기준을 사용자들의 해당 메일에 대한 스팸 메일 판단의 통계에 기반하는 방법이므로, 전술한 두 번째 방법과 비교하여 사용자 관점에서 비교적 정확한 스팸 메일 판단이 가능하다는 장점이 있다. 그러나, 이 방법의 단점은 사용자들의 스팸 메일에 대한 판단 정보가 충분하게 확보되지 못한 메일에 대해서는 정확하게 스팸 메일인지의 여부를 판단할 수 없다는 점과, 각 메일에 대하여 사전 설정치 이상의 충분한 사용자들의 스팸 메일의 신고 데이터가 축적되기 전까지는 해당 메일이 스팸 메일인지의 여부를 신속하게 판단할 수 없다는 단점이 있다.The third method uses a social filtering technique. The spam filtering is performed by notifying a mail server or a spam management server about whether the mail is received by the user or not. The mail server informs other users who have received the same mail that the mail is spam, for mails that are reported as spam mails from users above the preset value. Since this method is based on the statistics of the spam mail judgment of the user based on the criteria for judging whether or not the spam mail, compared to the second method described above, there is an advantage that relatively accurate spam judgment is possible from the user's point of view. . However, the disadvantages of this method are that it is not possible to accurately determine whether or not the spam mail is accurate for mails that do not have sufficient information about users' spam mails. Until the report data is accumulated, it is not possible to quickly determine whether the mail is spam.

한편, 스팸 메일의 필터링을 위한 스팸 메일 인식 방법에는 동일 메일 인식 방법이 있다. 동일 메일 인식 방법은 특정 메일을 스팸 메일이라고 판단한 후, 그 스팸 메일과 동일한 메일을 수신한 사용자들에게 해당하는 메일을 스팸 메일로 처리하기 위하여 적용된다. 통상적인 동일 메일 인식 방법은 메일을 보낸 사람의 주소, 메일의 제목 및 메일의 내용과 같은 정보 혹은 이들 정보들의 조합을 조사하여, 같은 내용을 가지면 동일한 메일로 인식한다.Meanwhile, a spam mail recognition method for filtering spam mails includes the same mail recognition method. The same mail recognition method is applied to determine that a specific mail is spam mail and to process the mail corresponding to the users who have received the same mail as the spam mail as spam mail. In general, the same mail recognition method examines information such as the sender's address, the subject of the mail and the contents of the mail, or a combination of these information, and recognizes the same mail as having the same contents.

그러나, 동일 메일 인식 방법은 전술한 기존의 방법들과 같이 보낸 사람의주소와 제목 등으로 동일 메일 여부를 판단할 경우, 메일의 전체 내용은 동일하지만 메일의 내용 중에서 특정 부분이 변하는 유사 메일, 예를 들면, "홍길동 귀하"와 같이 수신자만이 달라지는 메일을 동일 메일로 인식하지 못한다는 한계가 있다. 더욱이, 내용이 같은 메일이라도 보낸 사람의 주소가 변경된 경우에도 동일 메일로 인식하지 못하는 문제점이 있다.However, when the same mail recognition method determines whether the same mail is the same as the sender's address and the subject, as in the aforementioned methods, similar mails in which the entire contents of the mail are the same but certain parts of the contents of the mail are changed. For example, there is a limitation that only the recipients, such as "Hong Gil Dong", are not recognized as the same mail. Moreover, even if the mail is the same, even if the sender's address is changed, there is a problem that the same mail is not recognized.

그러므로, 본 발명은 스팸 메일을 효율적으로 차단하고 관리할 수 있는 스팸메일 필터링 방법 및 장치를 제공하는 것을 그 목적으로 한다.Therefore, an object of the present invention is to provide a spam mail filtering method and apparatus that can effectively block and manage spam mail.

본 발명의 다른 목적은 내용이 유사한 다수의 메일을 대량 메일로 판단하고, 대량 메일의 판단에 기반하여 스팸 메일을 필터링하는 방법 및 장치를 제공하는 것이다.Another object of the present invention is to provide a method and apparatus for determining a plurality of mails having similar contents as bulk mails and filtering spam mails based on the determination of the bulk mails.

도 1은 메일 서비스 시스템의 블럭 구성도,1 is a block diagram of a mail service system;

도 2는 본 발명에 따른 메일 서버의 블럭 구성도,2 is a block diagram of a mail server according to the present invention;

도 3은 대량 메일 분석부의 상세 블럭 구성도,3 is a detailed block diagram of a mass mail analyzing unit;

도 4는 비유사 메일 분석부의 상세 블럭 구성도,4 is a detailed block diagram of a dissimilar mail analysis unit;

도 5는 스팸 메일 분석부의 상세 블럭 구성도,5 is a detailed block diagram of a spam mail analyzing unit;

도 6 및 도 7은 각기 스팸 메일로 인식된 메일의 사후 처리 과정을 설명하는 흐름도이다.6 and 7 are flowcharts illustrating a post-processing process of mails respectively recognized as spam mails.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

110 : 외부 메일 서버 130 : 메일 서버110: external mail server 130: mail server

210 : 메일 관리부 230 : 대량 메일 분석부210: mail management unit 230: bulk mail analysis unit

240 : 비유사 메일 분석부 250 : 스팸 메일 분석부240: dissimilar mail analysis unit 250: spam mail analysis unit

310 : 대표 문자 추출부 320 : 동일 메일 분류부310: representative character extracting unit 320: same mail classification unit

330 : 대량 메일 판별부 410 : 문자 빈도수 계산부330: mass mail determination unit 410: character frequency calculation unit

420 : 빈도수 평균 계산부 430 : 빈도수 평균 비교부420: frequency average calculation unit 430: frequency average comparison unit

510 : 형태소 분석부 520 : 키워드 추출부510: Morphological analysis unit 520: Keyword extraction unit

540 : 스팸 메일 인식부 550 : 스팸 용어 저장부540: Spam mail recognition unit 550: Spam term storage unit

전술한 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따르면, 메일 서버에서 스팸 메일을 필터링하는 방법은, 외부로부터 수신된 각각의 메일에 대하여 유사 메일의 발송 건수를 기반으로 상기 스팸 메일을 판단하는 것을 특징으로 한다.According to a preferred embodiment of the present invention for achieving the above object, a method for filtering spam mail in the mail server, the spam mail is determined based on the number of similar mails for each mail received from the outside; It is characterized by.

본 발명의 다른 실시예에 따르면, 메일 서버에서 스팸 메일을 필터링하는 방법은, (a) 외부로부터 수신된 각각의 메일에 포함된 문자 중에서 사용 빈도수가 높은 N(N은 자연수) 개의 문자를 상기 메일을 대표하는 대표 문자열로 추출하는 대표 문자 추출 단계; (b) 상기 대표 문자열이 동일한 메일들의 개수를 상기 대표 문자열별로 누적하는 메일 누적 단계; (c) 상기 대표 문자열별로 누적된 메일들의 누적 개수가 기설정 개수 이상인 메일들을 대량 메일로 분석하는 대량 메일 분석 단계; 및 (d) 상기 대량 메일로 분류된 각각의 메일에 포함된 복수개의 키워드와 기설정 키워드를 비교하여 기설정 사전 설정치 이상의 유사도를 갖는 메일을 상기 스팸 메일로 인식하는 스팸 메일 분석 단계를 포함하는 것을 특징으로 한다.According to another embodiment of the present invention, a method for filtering spam mails in a mail server includes: (a) N (N is a natural number) characters having a high frequency of use among characters included in each mail received from the outside; Representative character extraction step of extracting the representative character string representing; (b) accumulating mails by accumulating the number of mails having the same representative string for each representative string; (c) a mass mail analyzing step of analyzing mails in which the cumulative number of mails accumulated for each representative string is greater than or equal to a predetermined number as bulk mails; And (d) a spam mail analysis step of comparing a plurality of keywords included in each mail classified as the bulk mail with a preset keyword to recognize a mail having a similarity or more than a predetermined preset value as the spam mail. It features.

본 발명의 또 다른 실시예에 따르면, 상업성 또는 광고성 스팸 메일을 필터링하는 장치는, 외부로부터 수신된 메일들 중에서 스팸 메일 인식 결과에 따라 상기 수신된 각각의 메일에 대한 스팸 메일의 여부를 사용자에게 알려주는 메일 관리부; 및 상기 메일 관리부에서 수신된 각각의 메일에 포함된 내용을 분석하여 내용이 유사한 메일들을 분류하고, 분류된 유사 메일들이 기설정 개수 이상으로 대량인 유사 메일들을 상기 스팸 메일로 분석하는 대량 메일 분석부를 포함하는 것을 특징으로 한다.According to another embodiment of the present invention, an apparatus for filtering commercial or commercial spam mails may inform a user whether spam mails are received for each of the received mails according to a spam mail recognition result among mails received from the outside. The mail management unit; And a mass mail analyzer configured to analyze contents included in each mail received by the mail manager to classify mails having similar contents, and to analyze similar mails in which the classified similar mails are larger than a predetermined number as the spam mails. It is characterized by including.

본 발명의 또 다른 실시예에 따르면, 상업성 또는 광고성 스팸 메일을 필터링하는 장치는, 외부로부터 수신되는 각각의 메일을 스팸 메일 인식 결과에 따라 스팸 메일로서 관리하는 메일 관리부; 상기 메일 관리부에서 수신된 각각의 메일에 포함된 문자 중에서 사용 빈도수가 높은 N(N은 자연수) 개의 문자를 상기 메일을 대표하는 대표 문자로 추출하고, 상기 추출된 대표 문자가 동일한 메일들이 사전 설정치 이상인 메일들을 대량 메일로 분류하는 대량 메일 분석부; 및 상기 대량 메일로 분류된 각각의 메일에 포함된 단어들 중에서 그 메일을 대표하는 키워드와 기설정 스팸성 키워드를 비교하여 사전 설정치 이상의 유사도를 갖는 메일을 스팸 메일로 인식하고 상기 메일 관리부로 상기 스팸 메일 인식 결과를 제공하는 스팸 메일 분석부를 포함하는 것을 특징으로 한다.According to still another embodiment of the present invention, an apparatus for filtering commercial or commercial spam mail includes: a mail manager for managing each mail received from the outside as spam mail according to a spam mail recognition result; Among the characters included in each message received by the mail management unit, N (N is a natural number) characters having a high frequency of use are extracted as representative characters representing the mail, and the extracted representative characters have the same or more preset values. A mass mail analyzer for classifying mails into bulk mails; And among the words included in each mail classified as the bulk mail, a keyword representing the mail and a predetermined spam keyword are compared to recognize a mail having a similarity or more than a preset value as a spam mail and the spam management to the mail manager. It characterized in that it comprises a spam mail analysis unit for providing a recognition result.

이하, 본 발명의 바람직한 실시예를 첨부된 도면을 참조하여 다음과 같이 상세히 설명한다.Hereinafter, with reference to the accompanying drawings, preferred embodiments of the present invention will be described in detail as follows.

본 발명의 상세한 설명에 앞서, 본 발명과 연관된 용어에 대하여 정의하면 다음과 같다.Prior to the detailed description of the present invention, terms related to the present invention are defined as follows.

* 대량 메일 : 사전 설정치 이상의 수신자에게 발송된 메일의 내용이 유사한 메일* Bulk Mail: Mails with similar content to the recipients over preset value

* 스팸 필터링 : 메일 사용자에게 유용하지 않은 무차별적 광고성 메일을 구분하여, 스팸 메일의 여부를 표시하고, 스팸 메일을 자동으로 별도의 보관함으로 이동시키는 등 향후 유사 메일의 수신을 차단하는 방법 또는 서비스* Spam filtering: Method or service to separate indiscriminate advertising messages that are not useful to mail users, to indicate whether they are spam mails, and to block receiving similar mails in the future, such as automatically moving spam mails to a separate archive.

* 동일 메일: 내용이 100 % 동일한 메일* Same mail: 100% identical mail with content

* 유사 메일: 메일의 내용이 사전 설정치 이상 동일한 메일* Similar Mail: A message whose content is equal to or greater than a preset value.

* 스팸메일 :* Spam mail:

1) 광의적 정의 : 사용자가 읽을 필요가 없는 메일, 즉 사용자가 해당 메일의 내용을 보지도 않고 삭제하여도 무방하다고 여겨지는 메일1) Broad definition: Messages that the user does not need to read, that is, messages that the user can safely delete without seeing the contents of the message.

2) 협의적 정의 : 사용자의 수신 의사와 무관하게 통신이나 인터넷을 통해 무차별적으로 대량 살포되는 광고성 및 상업성 메일2) Negotiable definition: Advertising and commercial mail that is distributed indiscriminately through communication or the Internet regardless of the user's intention to receive it.

* 스팸성 단어 : '광고', '구매' 또는 '찬스' 등 광고성 메일임을 인식할 수 있는 단어* Spam words: Words that can be recognized as advertising mail, such as 'Ad', 'Buy' or 'Chance'

이제, 도 1을 참조하면, 본 발명에 따라서 구성된 스팸 메일 필터링 시스템의 개략적인 블럭 구성이 도시한다.Referring now to FIG. 1, a schematic block diagram of a spam mail filtering system constructed in accordance with the present invention is shown.

외부 메일 서버(110)는 외부의 발신자에 의해 작성된 메일을 인터넷(120)을통하여 본 발명의 메일 서버(130)로 송신하며, 메일 서버(130)는 외부 메일 서버(110)로부터 송신된 메일을 수신하고, 수신된 메일을 그 메일에서 지정하는 사용자(140)들에게 전달한다.The external mail server 110 transmits the mail created by the external sender to the mail server 130 of the present invention via the Internet 120, and the mail server 130 sends the mail sent from the external mail server 110. Receives the received mail and delivers the received mail to the users 140 who designate the mail.

또한, 메일 서버(130)는 수신된 메일 중에서 내용이 유사한 메일을 접수한 사용자(140)의 수를 기준으로 메일이 대량으로 발송되었는지를 판단하여 스팸 메일을 판단하는 기준으로 삼는다. 대량 메일을 판단하는 과정에서는 각각의 메일의 내용이 사전 설정치 이상의 유사도를 가지는 경우 유사한 메일로 간주하는 내용 기반의 유사 메일 판단 방법을 적용한다. 메일 서버(130)는 유사 메일 판단 기준에 속한 메일 중에서 광고나 상업적인 스팸성 단어를 가지고 있는 메일을 스팸 메일로 규정하고, 스팸 메일로 규정한 메일을 별도로 관리 또는 폐기하거나 해당 메일을 수신한 각 사용자들에게 전송된 메일이 스팸 메일임을 알려준다.In addition, the mail server 130 determines whether the mail is sent in large quantities based on the number of the users 140 who received the mails having similar contents from among the received mails, as a criterion for determining the spam mails. In the process of judging mass mails, if a content of each mail has a similarity or more than a preset value, a content-based similar mail determination method that considers a similar mail is applied. The mail server 130 defines a mail having an advertisement or commercial spam word among spam mails belonging to the similar mail judgment criteria as spam mail, and separately manages or discards the mail defined as spam mail or receives the mail. Inform the user that the mail sent is spam.

사용자(140)는 웹 브라우저가 내장된 컴퓨터 또는 그 컴퓨터의 사용자를 의미한다. 사용자(140)는 메일 서버(130)로부터 스팸 메일의 통보에 따라 수신된 메일을 폐기할 수 있다.The user 140 refers to a computer having a web browser or a user of the computer. The user 140 may discard the received mail according to the notification of the spam mail from the mail server 130.

도 2는 도 1에 도시된 본 발명에 따른 스팸 필터링 메일 서버(130)의 상세 블럭 구성도를 도시한다.FIG. 2 shows a detailed block diagram of the spam filtering mail server 130 shown in FIG. 1.

메일 서버(130)는 메일 관리부(210), 메일 저장부(220), 대량 메일분석부(230), 비유사 메일 분석부(240) 및 스팸 메일 분석부(250)를 구비한다.The mail server 130 includes a mail managing unit 210, a mail storing unit 220, a mass mail analyzing unit 230, a dissimilar mail analyzing unit 240, and a spam mail analyzing unit 250.

메일 관리부(210)는 외부로부터 수신되는 메일을 메일 저장부(220)에 제공하여 저장하도록 하고, 수신된 메일을 지정된 사용자(140)에게 전달하며, 스팸 메일 분석부(250)의 스팸 메일 분석 결과에 따라 수신된 메일들 중에서 스팸 메일을 선별하는 기능을 수행한다. 메일 관리부(210)는 스팸 메일 분석부(250)에 의해 분석된 결과에 따라 스팸 메일로 분석된 메일을 제외한 메일들만을 사용자(140)에게 전송하거나, 수신된 모든 메일을 그대로 사용자(140)에게 제공한 후 스팸 메일 분석부(250)에 의해 분석된 결과를 사용자에게 통보할 수도 있다.The mail manager 210 provides the mail received from the outside to the mail storage 220 to store the mail, delivers the received mail to the designated user 140, and analyzes the spam mail of the spam mail analyzer 250. It performs the function of screening spam mails among the received mails. The mail managing unit 210 transmits only mails except the mail analyzed as spam mails to the user 140 according to the result analyzed by the spam mail analyzing unit 250, or sends all received mails to the user 140 as it is. After providing, the user may be notified of the result analyzed by the spam mail analyzing unit 250.

메일 저장부(220)는 메일 관리부(210)에서 선별된 스팸 메일에 대하여 태크(Tag)를 기록함으로써 해당하는 메일이 스팸 메일임을 나타내며, 사용자(140)의 요구에 따라 저장된 스팸 메일을 확인시켜준다.The mail storage unit 220 indicates that the corresponding mail is spam mail by recording a tag on the spam mail selected by the mail management unit 210, and confirms the stored spam mail at the request of the user 140. .

대량 메일 분석부(230)는 메일 관리부(210)에서 수신한 각각의 메일에 포함된 문자 중에서 사용 빈도수가 높은 N(N은 자연수) 개의 문자를 그 메일을 대표하는 대표 문자열로 추출하고, 추출된 대표 문자열과 동일한 대표 문자열을 갖는 유사 메일들을 분류하고, 동일한 대표 문자열을 갖는 유사 메일을 수신한 사용자의 수가 일정 사용자 이상일 때 그 메일들을 대량 메일이라고 판단한다. 대량 메일 분석부(230)에서 대량 메일을 판단하는 과정은 도 3을 참조하여 상세히 설명한다.The mass mail analyzing unit 230 extracts N (N is a natural number) characters having a high frequency of use among the characters included in each mail received from the mail managing unit 210 as a representative string representing the mail, and extracted. Similar mails having the same representative string as the representative string are classified, and when the number of users who receive the similar mail having the same representative string is equal to or greater than a certain user, the mails are determined to be bulk mails. The process of determining the mass mail in the mass mail analyzer 230 will be described in detail with reference to FIG. 3.

비유사 메일 분석부(240)는 대량 메일 분석부(230)에 의해 분석된 대량의 유사 메일들 중에서 대표 문자열이 동일하지만 메일 본문의 내용이 유사하지 않은 메일을 유사 메일들에서 제외시키는 기능과 더불어 대표 문자열이 동일하되 메일 본문의 내용이 일부만 변경된 메일을 유사 메일로서 인식하는 기능을 수행한다. 비유사 메일 분석부(240)의 동작은 도 4를 참조하여 상세히 설명한다.The dissimilar mail analyzer 240 excludes mails having similar representative strings from the similar mails analyzed by the mass mail analyzer 230 but having similar contents in the mail body from the similar mails. This function recognizes mails with the same representative string but only partially changed contents of the mail body as similar mails. The operation of the dissimilar mail analyzer 240 will be described in detail with reference to FIG. 4.

스팸 메일 분석부(250)는 대량 메일 분석부(230)에 의해 분석된 대량 메일을 기준으로 해당하는 메일에 포함된 키워드를 기설정 키워드와 비교하여 상호 일치하는 메일을 스팸 메일로 분석하고 메일 관리부(210)로 스팸 메일 분석 결과를 제공한다. 스팸 메일 분석부(230)의 동작은 도 5에서 상세히 설명한다.The spam mail analyzing unit 250 compares the keywords included in the corresponding mails with the preset keywords based on the mass mails analyzed by the mass mail analyzing unit 230, and analyzes the mails which are mutually matched as spam mails, and the mail management unit. At 210, the result of the spam analysis is provided. The operation of the spam mail analyzing unit 230 will be described in detail with reference to FIG. 5.

도 3은 도 2에 도시된 대량 메일 분석부(230)의 상세 블럭 구성을 도시한다. 대량 메일 분석부(230)는 대표 문자 추출부(310), 동일 메일 분류부(320), 메일 관리 테이블(330) 및 대량 메일 판별부(340)를 구비한다.FIG. 3 shows a detailed block diagram of the mass mail analyzing unit 230 shown in FIG. The mass mail analyzing unit 230 includes a representative character extracting unit 310, the same mail classifying unit 320, a mail management table 330, and a mass mail determining unit 340.

대표 문자 추출부(310)는 수신된 메일에 포함된 문자 및 각 문자의 빈도수, 즉 사용 회수를 계산한다. 예컨대, 수신된 메일이 "EAAEAB AD CD EEAD"이라는 내용을 담고 있는 경우, 메일에서 사용된 문자들은 모두 "A B C D E"이다. 이때, 이들 각 문자의 사용 빈도수는 각각 "5 1 1 3 4"이다. 여기서, 메일에서 사용된 문자들 중에서 사용 빈도수가 가장 많은 순서대로 N 개의 문자를 대표 문자열로 추출한 다음, 추출된 N 개 문자 각각의 사용 빈도수를 각 문자의 코드로 부여하여 N 개의 코드로 구성된 대표 코드열을 생성한다. 예를 들면, N = 3인 경우에는 "A E D"가 그 메일을 대표하는 대표 문자열로서 선택되고, 메일의 내용을 대표하는 대표 문자열, "A E D"에서 각 문자의 사용 빈도수, 즉, "5 4 3"를 해당 메일의 대표 코드열로서 추출한다. 이 때, 각 문자에 대응적으로 부여된 코드는 각 문자의 사용 빈도수로부터 첫째 자릿수만을 추출하여 사용한다. 물론, 메일에서 출현하는 각 문자의 사용빈도수가 10 단위 또는 100 단위로 될 수도 있고, 그 단위를 그대로 대표 코드열에 사용할 수도 있지만, 대표 코드열의 간략화를 기하기 위하여 첫째 자릿수만을 추출하여 사용하는 것이 바람직하다.The representative character extracting unit 310 calculates the frequency included in the received mail and the frequency of each character, that is, the number of times of use. For example, if the received mail contains the content "EAAEAB AD CD EEAD", the characters used in the mail are all "A B C D E". At this time, the frequency of use of each of these characters is "5 1 1 3 4", respectively. Here, a representative code composed of N codes is extracted by extracting N characters as a representative string in order of the highest frequency of use among the characters used in the mail, and then assigning a frequency of use of each extracted N character as a code of each character. Generate heat. For example, when N = 3, "AED" is selected as the representative string representing the mail, and the frequency of use of each character in the representative string representing the contents of the mail, "AED", that is, "5 4 3". "Is extracted as the representative code string of the mail. At this time, the code assigned to each character is used by extracting only the first digit from the frequency of use of each character. Of course, the frequency of use of each character appearing in the mail may be 10 units or 100 units, and the unit may be used for the representative code string as it is, but in order to simplify the representative code string, it is preferable to extract and use only the first digit. Do.

본 발명에 있어서, 대표 코드를 생성하기 위하여 대표 문자를 이용하는 방법을 기술하고 있지만, 이와 달리 해쉬 함수(Hash Function)를 이용한 내용 요약에 기반하여 대표 코드를 생성하는 방법을 적용할 수도 있다. 내용 요약을 통하여 대표 코드를 생성하는 해쉬 함수 기반의 방법 중에서 대표적인 방법은 MD5(Message Digest5) 알고리즘을 들 수 있다. MD5 알고리즘은 어떠한 메시지의 내용을 요약(Digestion)하여 128 비트 해쉬 코드로 암호화하는 방법이다. 본 발명에서 MD5 알고리즘을 적용하기 위해서는 수신된 각각의 메일의 내용을 요약하여 128 비트 해쉬 코드를 생성하고, 생성된 128 비트 해쉬 코드를 본 발명에서와 같이 동일 메일을 분류하기 위한 대표 코드값으로 사용할 수도 있다.In the present invention, a method of using a representative character to generate a representative code is described. Alternatively, a method of generating a representative code based on a summary of contents using a hash function may be applied. A representative method of the hash function-based method for generating the representative code through the summary of contents is an MD5 (Message Digest5) algorithm. The MD5 algorithm is a method of digesting the contents of a message and encrypting it with a 128-bit hash code. In order to apply the MD5 algorithm in the present invention, a 128-bit hash code is generated by summarizing the contents of each received mail, and the generated 128-bit hash code is used as a representative code value for classifying the same mail as in the present invention. It may be.

동일 메일 분류부(320)는 수신되는 각각의 메일에서 동일한 대표 코드열을 갖는 메일들을 분류한다.The same mail classifying unit 320 classifies mails having the same representative code string in each mail received.

메일 관리 테이블(330)은 RAM과 같은 메모리로 구현될 수 있으며, 동일 메일 분류부(320)에 의해 분류된 메일들을 대표 코드열별로 누적한다. 하기 표 1은 동일 메일 분류부(330)에 의해 분류된 메일들의 개수를 대표 코드열별로 누적하는 예를 나타낸다.The mail management table 330 may be implemented with a memory such as RAM, and accumulates mail classified by the same mail classifying unit 320 for each representative code string. Table 1 below shows an example of accumulating the number of mails classified by the same mail classification unit 330 for each representative code string.

대표 코드열Representative code string 메일 개수Number of messages 메일 IDMail id 543123..543123 .. 12020..12020 .. 111, 134, 343.....3434, 434, 34,.....111, 134, 343 ..... 3434, 434, 34, .....

표 1에서, 수신된 각 메일이 "543" 및 "123"이라는 대표 코드열을 가지고 있고, 대표 코드열이 동일한 메일이 각기 "120" 및 "20" 개씩 누적되어 있으며, 그 대표 코드열을 갖는 메일 ID가 리스트되어 있음을 알 수 있다.In Table 1, each received message has a representative code string of "543" and "123", and mails having the same representative code string are accumulated by "120" and "20", respectively, and having the representative code string. Notice that the mail ID is listed.

대량 메일 판별부(340)는 메일 관리 테이블(330)을 참조하여 대표 코드열에 속하는 메일의 누적 개수가 기설정 개수, 예컨대, 100 개 이상인 메일들을 대량 메일로 판별한다. 이것은 100 명 이상의 사용자(140)가 모두 내용이 유사한 메일을 수신한 것 또는 유사한 메일이 100 개 이상 발송된 것을 의미한다. 대량 메일 판별부(340)의 판별에 의하면, 표 1에서는 대표 코드열 "543"을 갖는 메일의 개수가 120으로서 기설정치를 초과하므로, 메일 ID가 111, 134 및 343인 메일들은 대량 메일로 판별된다.The mass mail discrimination unit 340 determines the mass mail as a mass mail by referring to the mail management table 330. This means that more than 100 users 140 have all received similar mails or more than 100 similar mails have been sent. According to the determination of the mass mail discrimination unit 340, in Table 1, since the number of mails having the representative code string "543" exceeds 120 as a preset value, mails having the mail IDs 111, 134, and 343 are identified as bulk mails. do.

본 발명에서는 내용이 유사한 메일을 수신한 사용자(140)의 수 또는 메일 발송 건수를 기준으로 대량 메일을 분류하고, 대량 메일로 분류된 메일은 다음에 설명하는 스팸 메일 분석부(250)에서 스팸 메일을 판단하는 기준이 된다.In the present invention, the mass mail is classified based on the number of the users 140 or the number of mails sent, and the mail classified as the bulk mail is spammed by the spam mail analyzing unit 250 described below. It is a standard for judging.

물론, 대량 메일 판별부(340)에서 분류된 대량 메일을 그대로 스팸 메일로 간주하여 해당 메일을 수신한 사용자(140)에게 해당 메일이 스팸 메일임을 알려 줄 수도 있을 것이다. 또한, 대량 메일 판별부(340)에서 분류된 대량 메일을 그대로 스팸 메일로 간주하는 경우, 종래 기술의 첫 번째 방법에서 설명한 바와 같이, 스팸 메일을 보낸 발신자의 메일 주소를 리스트로 관리하고, 해당하는 메일 주소 리스트에 속한 메일 주소로부터 수신되는 메일을 모두 스팸 메일로 간주하여 조치를 취할 수도 있다. 또한, 대량 메일 판별부(340)에서 분류된 대량 메일을 그대로 스팸 메일로 간주하는 경우, 종래 기술의 두 번째 방법에서 설명한 바와 같이, 스팸 메일의 제목이나 내용을 검사하여 광고성 또는 상업성 특정 단어를 갖는 스팸 정보를 스팸 정보 리스트로 관리하여 그 특정 단어가 포함된 수신 메일을 모두 스팸 메일로서 필터링할 수도 있을 것이다.Of course, the mass mail classified by the mass mail discrimination unit 340 may be regarded as spam mail, and the user 140 receiving the mail may be notified that the mail is spam mail. In addition, when the mass mail classified by the mass mail discrimination unit 340 is regarded as spam mail as it is, as described in the first method of the prior art, and manages the mail address of the sender of the spam mail as a list, Any mail received from a mail address in the mail address list may be regarded as spam mail, and the action may be taken. In addition, when the mass mail classified by the mass mail discrimination unit 340 is regarded as a spam mail as it is, as described in the second method of the prior art, by inspecting the title or content of the spam mail having a specific advertising or commercial specific words Spam information may be managed as a list of spam information to filter all incoming mail containing the specific word as spam mail.

도 4는 도 2에 도시된 비유사 메일 분석부(240)의 상세 블럭 구성이 도시된다. 비유사 메일 분석부(240)는 문자 빈도수 계산부(410), 빈도수 평균 계산부(420) 및 빈도수 평균 비교부(430)를 구비한다.4 is a detailed block diagram of the dissimilar mail analyzing unit 240 shown in FIG. 2. The dissimilar mail analyzer 240 includes a text frequency calculator 410, a frequency average calculator 420, and a frequency average comparator 430.

문자 빈도수 계산부(410)는 도 2의 대량 메일 분석부(230)에 의해 대량 메일로서 분석된 각각의 유사 메일에 포함된 전체 문자에 대한 빈도수를 계산한다.The character frequency calculator 410 calculates the frequency for all the characters included in each similar mail analyzed by the mass mail analyzer 230 of FIG. 2 as the mass mail.

빈도수 평균 계산부(420)는 문자 빈도수 계산부(410)에서 계산된 각 문자의 사용 빈도수의 평균을 계산한다. 빈도수 평균 계산부(420)에서 평균값을 계산하는 과정은 하기 표 2를 참조하여 설명한다.The frequency average calculator 420 calculates an average of the frequency of use of each character calculated by the character frequency calculator 410. The process of calculating the average value in the frequency average calculation unit 420 will be described with reference to Table 2 below.

유사 메일군(A B C D E)Similar mailing group (A B C D E) 사용 빈도수A B C D EFrequency of use A B C D E 메일 1메일 2메일 3..Mail 1 Mail 2 Mail 3 .. 500 10 10 300 400510 10 10 300 400390 300 300 500 400..500 10 10 300 400 510 10 10 300 400 390 300 300 500 400 ..

표 2에서는 세 개의 메일 1, 메일 2 및 메일 3이 도 2의 대량 메일분석부(230)에 의해 유사 메일로 분류된 메일들로서, 각각의 메일이 모두 "A B C D E"라는 문자열을 가지고 있다고 가정한다. N = 3 인 경우 메일 1, 메일 2 및 메일 3의 대표 문자열은 각기 "A E D", "A E D" 및 "D E A"로 다르지만, 모두 "5 4 3" 이라는 동일한 대표 코드를 가지고 있다. 이들 메일들에서, 문자열 "A, B, C, D, E"의 빈도수 평균은 각기 1400/3(=466.66), 320/3(=106.66), 320/3(106.66), 1100/3(366.66), 1200/3(=400)으로 계산된다.In Table 2, it is assumed that three mails 1, 2 and 3 are classified as similar mails by the mass mail analysis unit 230 of FIG. 2, and each mail has a string of "A B C D E". When N = 3, the representative strings of the mails 1, 2 and 3 are different from "A E D", "A E D" and "D E A", respectively, but all have the same representative code of "5 4 3". In these mails, the frequency averages of the strings "A, B, C, D, E" are 1400/3 (= 466.66), 320/3 (= 106.66), 320/3 (106.66), 1100/3 (366.66), respectively. ), 1200/3 (= 400).

빈도수 평균 비교부(430)는 빈도수 평균 계산부(420)에서 계산된 각 문자의 평균값과 빈도수를 벡터값으로 간주하여 비교하고, 이들 간의 벡터 유사도가 사전 설정치, 예컨대, 90 % ~ 95 %를 초과하지 않는 메일을 유사 메일군에서 제외시킨다. 이 과정에서 메일 3이 제외될 것이다.The frequency average comparison unit 430 compares the average value of each character calculated by the frequency average calculation unit 420 and the frequency as a vector value, and the vector similarity between them exceeds a preset value, for example, 90% to 95%. Mail that you do not want is excluded from the mail group. Mail 3 will be excluded from this process.

본 발명에 있어서, 문서의 내용을 대표하는 벡터를 표현하기 위하여 문자 단위의 벡터값을 이용하는 방법을 기술하고 있지만, 이와 달리 단어 단위의 문서 내용을 대표하는 벡터값을 계산하는 방법을 적용할 수도 있다.In the present invention, a method of using a vector value in character units is described to express a vector representing the content of a document. Alternatively, a method of calculating a vector value representing a document content in a word unit may be applied. .

따라서, 비유사 메일 제외부(240)에 의해 진행되는 과정을 거치면서 우연히 대표 코드열은 같지만 메일의 내용은 유사하지 않은 메일을 비유사 메일로 판단하여 유사 메일에서 제외시킬 수 있다. 또한, 대표 코드열은 같은데 메일의 내용이 일부만 바뀐 메일, 예컨대, 수신자의 이름만 바뀐 메일의 경우에도 유사 메일로 판단할 수 있으므로, 본 발명에서 목적으로 하는 유사 내용을 갖는 메일을 유사 메일로서 인식하는 것이 가능하다. 빈도수 평균 비교부(430)에서 분석된 결과는 스팸 메일 분석부(250)로 제공된다.Therefore, while going through the process by the dissimilar mail exclusion unit 240, it is possible to accidentally exclude a mail that is similar to the representative code string but the contents of the mail as dissimilar mail by dissimilar mail. In addition, in the case of an e-mail in which the contents of an e-mail are partially changed, for example, an e-mail in which only the recipient's name is changed, the e-mail may be regarded as a similar e-mail. It is possible to do The result analyzed by the frequency average comparison unit 430 is provided to the spam mail analyzing unit 250.

도 5는 도 2에 도시된 스팸 메일 분석부(250)의 상세 블럭 구성이 도시된다. 스팸 메일 분석부(250)는 형태소 분석부(510), 키워드 추출부(520), 불용어 키워드 저장부(530), 스팸 메일 인식부(540) 및 스팸성 단어 저장부(550)를 구비한다.5 is a detailed block diagram of the spam mail analyzing unit 250 shown in FIG. 2. The spam mail analyzer 250 includes a morpheme analyzer 510, a keyword extractor 520, a stop word keyword store 530, a spam mail recognizer 540, and a spam word store 550.

형태소 분석부(510)는 대량 메일 분석부(230) 또는 비유사 메일 분석부(240)에서 대량 메일 또는 유사 메일로 분석된 각각의 메일에 포함된 문자열에 대하여 형태소 단위, 예컨대, 단어 단위의 형태소 분석을 수행한다. 형태소 분석은, 예를 들어 설명하면, "공짜로 쉬운 돈벌기 방법" 같은 구문에 대하여 "공짜(명사)+로(조사)+쉽(형용사)+은(어미)+돈벌기(명사)+방법(명사)"와 같은 방식으로 형태소 단위의 분석을 수행하는 과정이다. 형태소 분석부(510)에 의해 분석된 명사나 명사구 등의 복수개의 단어를 추출하고, 추출된 복수개의 단어를 각기 키워드 가능 후보로서 키워드 추출부(520)로 제공한다.The morpheme analyzer 510 may be a morpheme unit, for example, a morpheme of a word unit, for a character string included in each mail analyzed as a bulk mail or a similar mail by the mass mail analyzer 230 or the dissimilar mail analyzer 240. Perform the analysis. Morphological analysis, for example, explains "free (noun) + by (probe) + easy (adjective) + (mother) + earning (noun) + method (noun) for phrases such as" how to make money for free. " This is the process of performing morphological analysis in the same manner as "." A plurality of words such as nouns and noun phrases analyzed by the morpheme analysis unit 510 are extracted, and the extracted plurality of words are provided to the keyword extraction unit 520 as keyword candidates, respectively.

키워드 추출부(520)는 형태소 분석부(510)에서 제공된 키워드 가능 후보들 중에서 메일의 내용을 대표하는 키워드를 선택한다. 이때, 키워드를 선택하는 기준은 키워드 가능 후보들 중에서 불용어 키워드 저장부(530)에 수록되지 않은 모든 키워드 가능 후보를 키워드로서 추출한다. 불용어 키워드 저장부(530)는 색인어로서의 가치가 적은 대명사, 관형사, 부사, 감탄사, 그리고 자주 출현하는 용언의 어간이 수록되어 있고, 또한 명사 중에서도 일반적으로 색인어로서의 가치가 희박한 것도 불용어로 수록한 불용어 사전으로서 사용된다. 이러한 불용어 키워드 저장부(530)는 ROM과 같은 메모리 소자로서 구현될 수 있다. 키워드 추출부(520)에서 키워드를 선택하는 방법으로서 불용어 사전을 이용하는 것으로 설명되었지만,이와 반대로 불용어 사전과 반대되는 용어 사전을 이용하여 특정 키워드만을 선택할 수도 있을 것이다.The keyword extractor 520 selects a keyword representing the content of the mail from among the keyword candidates provided by the morpheme analyzer 510. In this case, the criterion for selecting a keyword extracts as a keyword all keyword candidates which are not included in the stopword keyword storage unit 530 among the keyword candidates. The stopword keyword storage unit 530 contains a pronoun, an adjective, an adverb, an adjective, and a frequently used verb that has a low value as an index word. Used as The stopword keyword storage unit 530 may be implemented as a memory device such as a ROM. Although the term extraction unit 520 has been described as using a stopword dictionary as a method of selecting a keyword, on the contrary, only a specific keyword may be selected using a term dictionary opposite to the stopword dictionary.

스팸 메일 인식부(540)는 스팸성 단어 저장부(550)에 수록된 스팸성 단어와 키워드 추출부(520)에서 선택된 키워드를 비교하고 비교 결과 사전 설정치 이상의 유사도, 예컨대, 90 % ~ 95 %의 유사도를 갖는 메일을 스팸 메일로서 인식한다. 스팸 메일 인식부(540)에서 유사도를 계산하는 방법은 다음과 같이 설명될 수 있다. 먼저, 메일의 문서를 구성하는 키워드와 그에 할당된 가중치를 메일 문서의 키워드 벡터로 하고, 스팸 용어 저장부(550)의 스팸성 단어와 그에 할당된 가중치를 스팸성 단어 벡터로 하여 두 벡터 사이의 유사도를 계산한다. 이때, 가중치는 어느 문서 내에서 특정 단어가 몇 번 출현하였는지를 나타내는 용어 빈도수(Term Frequency)와 특정 단어가 전체 문서에서 사용되고 있는 빈도수를 나타내는 역파일 빈도수(Inverse Frequency)를 이용하여 계산하며, 유사도 계산은 코사인 계수(Cosine Coefficient)를 사용할 수도 있다. 스팸 메일 인식부(540)는 스팸성 단어와 키워드간의 유사도를 비교하여 해당 메일을 스팸 메일로 인식하며, 그 스팸 인식 결과를 메일 관리부(210)로 제공한다.The spam mail recognition unit 540 compares the spam word included in the spam word storage unit 550 with the keyword selected by the keyword extraction unit 520, and has a similarity of more than a preset value, for example, 90% to 95% similarity. Recognize the mail as spam mail. The method of calculating the similarity in the spam mail recognition unit 540 may be described as follows. First, the keyword constituting the document of the mail and the weight assigned thereto are the keyword vectors of the mail document, and the similarity between the two vectors is determined using the spam word of the spam term storage unit 550 and the weight assigned thereto as the spam word vector. Calculate At this time, the weight is calculated using the term frequency (Term Frequency) indicating how many times a specific word appears in a document and the inverse frequency (frequency) indicating the frequency at which a specific word is used in the entire document. Cosine coefficients may also be used. The spam mail recognition unit 540 compares the similarity between the spam word and the keyword, recognizes the mail as spam mail, and provides the spam recognition result to the mail management unit 210.

수신된 메일이 스팸 메일인 것으로 인식되는 경우에 사용자(140)(도 1 참조)는 도 6 및 도 7에서 설명하는 바와 같은 방법으로 스팸 메일을 처리할 수 있다.In the case where the received mail is recognized as spam mail, the user 140 (see FIG. 1) may process the spam mail as described in FIGS. 6 and 7.

도 6은 메일 서버(130)에서 수신된 각각의 메일에 대하여 스팸 메일 여부를 확인한 후에 사용자에게 일반 메일만을 전달해주는 경우를 예시한다.6 illustrates a case in which only a general mail is delivered to a user after checking whether a mail is spam for each mail received by the mail server 130.

도 6에서, 메일 서버(130)에서 수신된 메일에 대하여 스팸 메일의 여부를 판단한다(단계 S610). 판단 결과, 수신된 메일이 스팸 메일인 것으로 판단되면, 단계(S620)로 진행하고, 그렇지 않으면 단계(S630)로 진행한다.In FIG. 6, the mail received from the mail server 130 determines whether spam mail is present (step S610). If it is determined that the received mail is spam mail, the process proceeds to step S620, otherwise, the process proceeds to step S630.

단계(S620)에서, 메일 서버(130)는 해당 메일이 스팸 메일임을 나타내는 태크를 메일 저장부(220)에 저장된 메일에 기록하거나 그 스팸 메일을 폐기한다.In step S620, the mail server 130 records a tag indicating that the mail is spam mail in a mail stored in the mail storage unit 220 or discards the spam mail.

한편, 단계(S630)에서, 메일 서버(130)는 스팸 메일이라고 인식된 메일을 제외한 일반 메일을 사용자(140)에게 전송한다.Meanwhile, in step S630, the mail server 130 transmits the general mail to the user 140 except the mail recognized as spam mail.

그 다음, 메일 서버(130)에서 사용자(140)가 메일 서버(130)에 로그인한 후 특별히 스팸 메일이라고 인식된 메일을 조회하고자 하는 요청이 있는지를 판단한다(단계 S640).Next, after the user 140 logs in to the mail server 130, the mail server 130 determines whether there is a request for inquiring about the mail that is specifically recognized as spam mail (step S640).

사용자(140)로부터 스팸 메일 조회 요청이 있으면, 메일 서버(130)는 단계(S650)로 진행하여 메일 저장부(220)로 이동하여 해당 사용자(140)가 조회를 요청한 메일이 스팸 메일로 표시되어 있는지를 조회하여 이를 사용자(140)에게 알려준다.If there is a spam mail inquiry request from the user 140, the mail server 130 proceeds to step S650 and moves to the mail storage unit 220. The mail requested by the user 140 is displayed as spam mail. If there is a query, it is notified to the user 140.

도 7은 메일 서버(130)에서 수신된 각각의 메일을 스팸 메일 여부를 확인하지 않은 채로 일단 사용자(140)에게 전달해준 다음 사후 처리하는 경우를 예시한다.FIG. 7 illustrates a case where each mail received by the mail server 130 is delivered to the user 140 without being checked for spam mail and then post-processed.

먼저, 단계(S710)에서 외부 메일 서버(110)로부터 수신된 메일은 메일 서버(130)에 의해 그대로 사용자(140)에게 전달된다.First, the mail received from the external mail server 110 in step S710 is delivered to the user 140 as it is by the mail server 130.

그 다음, 메일 서버(130)는 사용자(140)에게 전달된 메일에 대하여 스팸 메일의 여부를 판단한다(단계 S720). 스팸 메일의 판단 결과, 사용자(140)에게 전달된 메일이 스팸 메일인 것으로 판단되면, 단계(S730)로 진행하여 사용자(140)에게 전달된 메일이 스팸 메일임을 알려준다. 사용자에게 스팸 메일이 전달되었음을 알리는 방법은 메일 저장부(220)에 저장되는 메일에 대하여 태크를 기록하여 해당 메일이 스팸 메일임을 알리는 것이다.Then, the mail server 130 determines whether or not spam mail for the mail delivered to the user 140 (step S720). As a result of the determination of the spam mail, if it is determined that the mail delivered to the user 140 is spam mail, the process proceeds to step S730 to inform the user 140 that the mail delivered to the spam mail. The method of notifying the user that the spam mail has been delivered is to record a tag for the mail stored in the mail storage unit 220 to inform that the corresponding mail is spam mail.

이후, 메일 서버(130)에 로그인한 사용자가 자신의 메일 목록에서 스팸 메일이 있음을 확인하고, 메일 서버(130)로 스팸 메일을 일괄 선택하여 줄 것을 요청하며, 메일 서버(130)는 사용자(140)로부터 스팸 메일 일괄 선택 요청이 있는지를 체크한다(단계 S740).Subsequently, the user who logs in to the mail server 130 confirms that there is spam mail in his or her mail list, requests the mail server 130 to collectively select the spam mail, and the mail server 130 requests the user ( 140, it is checked whether there is a batch of spam mail selection request (step S740).

단계(S750)에서, 사용자(140)로부터 스팸 메일 일괄 선택 요청이 있음을 확인한 메일 서버(130)는 사용자(140)의 메일 목록에 포함된 스팸 메일을 자동 인식하고 스팸 메일 리스트를 작성하여 사용자(140)에게 제공한다.In step S750, the mail server 130 confirming that there is a bulk mail selection request from the user 140 automatically recognizes the spam mail included in the mail list of the user 140, and creates a spam mail list to create a spam mail list. 140).

이후, 메일 서버(130)는 스팸 메일 리스트를 확인한 사용자로부터의 스팸 메일의 삭제 요청에 따라 메일 저장부(220)에 저장되어 있는 스팸 메일을 일괄 삭제하거나 폐기한다.Thereafter, the mail server 130 collectively deletes or discards the spam mail stored in the mail storage unit 220 according to the request for deleting the spam mail from the user who checked the spam mail list.

본 발명은 상기한 실시예에 한정되지 않고, 본 발명의 기술적 요지를 벗어나지 않는 범위 내에서 다양하게 수정 및 변경 실시할 수 있음은 이 기술 분야에서 통상의 지식을 가진 자라면 누구나 이해할 수 있을 것이다.It will be appreciated by those skilled in the art that the present invention is not limited to the above embodiments, and that various modifications and changes can be made without departing from the spirit of the present invention.

본 발명에 따르면, 스팸 메일 필터링을 통하여, 상업성 및 광고성 내용을 포함하는 메일이 사용자에게 무분별하게 전달되는 것을 사전에 차단함으로써, 사용자들의 메일 시스템을 이용하는 관리적인 수고를 덜어줄 수 있다.According to the present invention, through spam mail filtering, mail containing commercial and advertising contents can be prevented from being delivered indiscriminately to the user, thereby reducing administrative trouble using the user's mail system.

본 발명의 스팸 필터링 기법을 이용하여, 사용자 입장에서 정보성 및 중요도가 낮은 메일들을 대상으로 메일 인식, 차단 및 관리 등의 편의성을 제공함으로써, 사용자의 메일 관리 작업의 효율성을 높여줄 수 있다.By using the spam filtering technique of the present invention, by providing conveniences such as mail recognition, blocking, and management for mails having low informationality and importance from the user's point of view, the user's efficiency of mail management can be improved.

본 발명의 대량 메일에 대한 정확한 파악을 통하여, 대량 메일을 발송하는 이메일 발송자들에 대한 정확한 정보 및 발송 현황 등을 파악하고, 이러한 명확한 기준을 바탕으로 해당 발송자에 대하여 필요한 조치를 취할 수 있다.Through accurate grasping of the bulk mail of the present invention, it is possible to grasp the exact information and sending status of the email senders who send the bulk mail, and take necessary measures for the sender based on these clear criteria.

또한, 대량의 스팸 메일을 발송하는 발송자를 파악하고 이에 대한 조치를 행함으로써 불필요한 인터넷 트래픽을 감소시키고, 이로 인하여 인터넷에서 데이터 전송 속도를 증대하는데 기여할 수 있다.In addition, it is possible to reduce unnecessary Internet traffic by identifying senders who send a large amount of spam mails and taking measures against them, thereby contributing to increasing data transmission speed on the Internet.

Claims

In the spam filtering method for determining spam mail based on the number of similar mails sent,

(a) a representative character extraction step of extracting N (N is a natural number) characters having a high frequency of use from among the characters included in each mail received from the outside as a representative string representing the mail;

(b) accumulating mails by accumulating the number of mails having the same representative string for each representative string; And

(c) a similar mail analyzing step of analyzing the mails whose cumulative number of mails accumulated for each representative string is equal to or greater than a preset number as the similar mails; And

(d) Spam mail determination step of judging the mail analyzed as the similar mail as the spam mail

Spam e-mail filtering method comprising a.

delete

The method of claim 1, wherein step (c) comprises:

(c-1) calculating a frequency of use of all characters included in each similar mail;

(c-2) calculating an average value of the frequency of use of each character in each similar mail;

(c-3) comparing the frequency of use of each character with the average value; And

(c-4) recognizing, as the similar mail, an email whose similarity between the frequency of use of each character and the average value exceeds a preset value as a result of the comparison;

The method for filtering spam mails, wherein the mails classified as the similar mails are recognized as the similar mails, wherein the mails having the same representative string but partially changed contents of the mails are recognized as the similar mails.

The method of claim 1, wherein step (c) comprises:

(c-4) excluding from the similar mail an email whose similarity between the frequency of use of each character and the average value does not exceed a preset value;

The method of filtering spam mails, wherein the mails classified as the similar mails are recognized as the similar mails except for mails in which the representative string is the same but the contents of the mails are not similar.

In the method for filtering spam mail in the mail server,

(b) accumulating mails by accumulating the number of mails having the same representative string for each representative string;

(c) a mass mail analyzing step of analyzing mails in which the cumulative number of mails accumulated for each representative string is greater than or equal to a predetermined number as bulk mails; And

(d) a spam mail analysis step of comparing a plurality of keywords included in each mail classified as the bulk mail with a preset keyword and recognizing a mail having a similarity or more than a predetermined preset value as the spam mail;

Spam mail filtering method of a mail server comprising a.

The method of claim 5, wherein

The spam mail filtering method,

(f) dissimilar mail exclusion step of excluding mails in which the representative strings are identical but the contents of the mails are not similar among the mails classified as the bulk mails;

Spam filtering method characterized in that it further comprises.

The method of claim 6,

The dissimilar mail exclude step (f)

(f-1) calculating a frequency of use of all characters in each mail;

(f-2) calculating an average value of the frequency of use of each character in each mail;

(f-3) comparing the frequency of use of each character with the average value;

(f-4) excluding from the mass mail a message in which the similarity between the frequency of use of each character and the mean value does not exceed a preset value;

Spam filtering method characterized in that it comprises a.

The method of claim 7, wherein

The step (f-4) is a method for filtering spam mails, wherein the mails classified as the bulk mails are recognized as the bulk mails, in which the representative string is the same but the contents of the mails are partially changed.

The method of claim 5, wherein

The spam mail analyzing step (d),

(d-1) selecting a plurality of words extracted by analyzing the character strings included in the contents of the mail in morpheme units as keyword candidates;

(d-2) selecting a keyword-capable candidate excluding a stopword among the selected keyword-capable candidates as a keyword representing the mail;

(d-3) comparing the selected keyword with preset spammy information;

(d-4) recognizing the mail as spam mail when the similarity between the selected keyword and the predetermined spam information is equal to or greater than a preset value as a result of the comparison;

Spam filtering method characterized in that it comprises a.

A device for filtering commercial or advertising spam mail,

A representative character extracting unit for calculating a frequency of use for each character included in each mail received from the outside, and extracting the N characters having the most frequency of use as a representative character string representing the mail; A same mail classifying unit for classifying the same mail having the same representative string; A mail management table for accumulating the same mail classified by the same mail classification unit for each representative string; And a mass mail determination unit configured to refer to the mail management table, and to determine the mails having a cumulative number of mails classified by the representative string as the bulk mail as a bulk mail, and a mass mail determining unit.

A mail management unit that receives a result of recognizing the bulk mail from the mass mail analyzer and informs a user whether spam mail is received for each of the mails received from the outside.

Spam filtering device comprising a.

delete

The method of claim 10,

The spam mail filtering device,

The non-similar mail analysis unit for excluding the mail that is the same as the representative string, but the contents of the mail from among the mail classified as the bulk mail by the bulk mail analysis unit,

The dissimilar mail analysis unit,

A character frequency calculator for calculating a frequency of use of all characters included in the mail;

A frequency average calculating unit that calculates an average value of the frequency of use of each character in the similar mail calculated by the character frequency calculating unit; And

Frequency average comparison unit for excluding the mail in which the similarity comparing the frequency of use of each character with the average value does not exceed a preset value from the same mail

Spam filtering device characterized in that it comprises a.

A device for filtering commercial or advertising spam mail,

A representative character extracting unit for calculating a frequency of use for each character included in each mail received from the outside, and extracting the N characters having the most frequency of use as the representative character strings; A same mail classifying unit for classifying the same mail having the same representative string; A mail management table for accumulating the same mail classified by the same mail classification unit for each representative string; And a mass mail analyzing unit configured to refer to the mail management table, and to determine the mails having the cumulative number of mails belonging to the representative string as a predetermined number as the bulk mails.

A mail management unit which receives a result of recognizing the bulk mail from the mass mail analyzing unit and informs a user whether spam mail is received for each of the mails received from the outside;

Among the words included in the mail classified as the bulk mail, a keyword representing a mail and a predetermined spam keyword are compared to recognize a mail having a similarity or more than a preset value as a spam mail, and the mail management unit recognizes the spam mail. Spam analysis unit that provides

Spam filtering device comprising a.

delete

The method of claim 13,

The spam mail analyzing unit,

A morpheme analysis unit configured to extract at least one word analyzed by performing a morpheme unit analysis on a character string included in the contents of each mail classified as the bulk mail, and provide the extracted word as a keyword possibility candidate;

A keyword extraction unit for selecting a keyword representative of the document of the mail from among the keyword possible candidates provided by the morphological analyzer; And

A spam mail recognition unit that compares the keyword extracted by the keyword extracting unit with predetermined spam information and recognizes a mail having similarity or more than a preset value as spam mail and provides the spam mail recognition result to the mail management unit.

Spam filtering device characterized in that it comprises a.

The method of claim 15,

The spam mail analyzing unit further includes a stop word keyword storing unit containing unused stop words,

And the keyword extracting unit extracts, as the keyword, all keyword capable candidates not included in the stopword keyword storage unit among the keyword possible candidates.

The method of claim 15,

The spam mail analyzing unit further includes a spam term storing unit containing the predetermined spam information,

And the spam mail recognition unit calculates a similarity between the keyword and the preset spam information by assigning a weight to the keyword and the preset spam information.

The method of claim 17,

The weight is a value of term frequency and inverse frequency, and the similarity is calculated by a cosine coefficient.

The method of claim 13,

The spam mail filtering device,

It further comprises a non-like mail analysis unit for excluding the mail that is the same as the representative character, but the contents of the mail among the similar mail classified by the mass mail analysis unit,

The dissimilar mail analysis unit,

A character frequency calculating unit that calculates a frequency of use of all characters included in the similar mails;

Frequency average comparison unit for excluding from the similar mail a message whose similarity comparing the frequency of use of each character with the average value does not exceed a preset value

Spam filtering device characterized in that it comprises a.

The method of claim 13,

And the mail managing unit transmits only mails except for the mail determined as the spam mail to the user according to the spam mail analysis result provided by the spam mail analyzing unit.

The method of claim 13,

And the mail management unit provides the received mail to the user as it is, and notifies the user of the spam mail analysis result provided by the spam mail analyzing unit.