KR20130026099A

KR20130026099A - Method and apparatus for creating classifier for spam messages in social networking websites using sender-receiver relationship and method for filtering spam messages

Info

Publication number: KR20130026099A
Application number: KR1020110089499A
Authority: KR
Inventors: 김종; 이상호; 송종혁
Original assignee: 포항공과대학교 산학협력단
Priority date: 2011-09-05
Filing date: 2011-09-05
Publication date: 2013-03-13

Abstract

PURPOSE: A method for generating a spam message classifier in a social network site by using a relation between a transmitter and a receiver, a device thereof, and a spam message filtering method thereof are provided to accurately filter the spam message by checking if a message is a spam or not through relation information between the receiver and transmitter of the spam message. CONSTITUTION: A spam message classifier generating device collects a real message transceived between members from a specific Internet social network site(S210). The device checks whether the collected message is a spam message(S220). The device calculates a connection degree and a relation distance of a transmitter and a receiver of the message(S230). The device generates a spam message classifier by using features derived from the calculated message(S240). [Reference numerals] (AA) Start; (BB) End; (S210) Collecting messages; (S220) Classifying messages(spam or normal); (S230) Calculating a relation distance and a connection degree between a message receiver and transmitter(using a related member list); (S240) Deriving relation distance and connection degree characteristics of spam messages and generating a spam message classifier by using the same

Description

METHODS AND APPARATUS FOR CREATING CLASSIFIER FOR SPAM MESSAGES IN SOCIAL NETWORKING WEBSITES USING SENDER-RECEIVER RELATIONSHIP AND METHOD FOR FILTERING SPAM MESSAGES }

본 발명은 소셜 네트워크 사이트에서의 스팸 메시지 필터링 방법 및 장치에 관한 것으로, 더욱 상세하게는 소셜 네트워크 사이트에서 송수신 되는 메시지에 대하여 메시지 발신자와 수신자간의 관계를 이용하여 스팸 메시지를 필터링하기 위한방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for filtering spam messages in a social network site, and more particularly, to a method and apparatus for filtering spam messages using a relationship between a message sender and a receiver for a message transmitted and received on a social network site. It is about.

스팸(spam)은 전자 우편, 게시판, 문자 메시지, 전화, 인터넷 포털 사이트의 쪽지 기능 등을 통해 불특정 다수의 사람들에게 보내는 광고성 편지 또는 메시지를 말한다. 이러한 스팸은 무차별하게 살포되어 스팸 수신자에게 불편을 끼치게 된다. Spam is an advertising letter or message sent to an unspecified number of people through e-mail, bulletin boards, text messages, telephones, and the messages feature of Internet portal sites. Such spam is spread indiscriminately, causing inconvenience to spam recipients.

최근에는 웹상에서 친구렐궐캣瓮동료 등 지인과의 인맥 관계를 강화시키고 또 새로운 인맥을 쌓으며 폭넓은 인적 네트워크를 형성할 수 있도록 해주는 소셜 네트워크 서비스(Social Network Service)를 제공하는 인터넷 사이트 이용이 일상화 되어감에 따라, 해당 사이트를 통해 수신되는 스팸 메시지가 문제가 되고 있다. In recent years, the use of Internet sites that provide social network services to strengthen social connections with acquaintances such as friends, friends and friends on the web, build new ones, and form a wider human network has become commonplace. As time goes by, spam messages coming through the site become a problem.

이러한 스팸을 방지하기 위한 종래의 방법으로는 미리 등록된 특정 문자열이나 특정 발신자 계정이 포함된 경우 스팸 메시지로 분류하는 방법을 적용한다. 예를 들어 이메일에서 스팸을 판단하기 위하여 이메일에 포함된 내용을 검색하여 스팸으로 판단되는 문자열이 있는지 여부를 판단한다. 하지만 비교적 글의 길이가 짧은 소셜 네트워크 사이트상에서 수신되는 메시지의 경우, 글 내용만으로 스팸 여부를 판단하기 쉽지 않은 문제가 있다. As a conventional method for preventing such spam, a method of classifying a pre-registered specific string or a specific sender account as a spam message is applied. For example, in order to determine spam in an e-mail, the contents of the e-mail are searched to determine whether there is a character string that is determined to be spam. However, in the case of a message received on a social network site having a relatively short text length, there is a problem that it is not easy to determine whether or not the text is spam.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은 소셜 네트워크 사이트상에서 수신되는 메시지로부터 스팸 메시지를 분류하기 위한 스팸 메시지 분류자를 생성하는 방법을 제공하는데 있다.An object of the present invention to solve the above problems is to provide a method for generating a spam message classifier for classifying spam messages from messages received on social network sites.

상기와 같은 문제점을 해결하기 위한 본 발명의 다른 목적은 소셜 네트워크 사이트상에서 수신되는 메시지로부터 스팸 메시지를 분류하기 위한 스팸 메시지 분류자를 생성하는 장치를 제공하는데 있다.Another object of the present invention for solving the above problems is to provide an apparatus for generating a spam message classifier for classifying spam messages from messages received on social network sites.

상기와 같은 문제점을 해결하기 위한 본 발명의 또 다른 목적은 소셜 네트워크 사이트상에서 수신되는 메시지로부터 스팸 메시지를 필터링하기 위한 방법을 제공하는데 있다.Another object of the present invention to solve the above problems is to provide a method for filtering spam messages from messages received on social network sites.

상기 목적을 달성하기 위한 본 발명은, 특정 인터넷 소셜 네트워크 사이트에서 수집되어 스팸 메시지 여부가 분류된 메시지들을 이용한, 스팸 메시지 분류자를 생성하기 위한 방법으로, 회원 관계 데이터베이스에 저장된 관계회원 목록을 참조하여 상기 분류된 메시지 각각에 대하여 상기 메시지 수신자로부터 발신자를 추적할 때 경유하는 관계회원의 수에 기초한 관계거리를 계산하는 단계, 상기 메시지 수신자로부터 발신자에게 도달하기까지 경유하게 되는 관계회원을 연결한 발생 가능한 적어도 하나의 경로에 기초한 연결도(connectivity)를 계산하는 단계 및 상기 수신자와 발신자 사이의 관계거리 및 연결도가 계산된 메시지들을 이용하여 스팸 메시지의 관계거리 및 연결도 특성을 도출하고, 상기 도출된 특성을 이용하여 스팸 메시지를 분류하기 위한 분류자를 생성하는 단계를 포함하는 것을 특징으로 하는 스팸메시지 분류자 생성방법을 제공한다..In order to achieve the above object, the present invention provides a method for generating a spam message classifier using messages collected from a specific Internet social network site and classified as spam messages, by referring to a list of related members stored in a member relations database. For each classified message, calculating a relationship distance based on the number of related members passing through when tracking the sender from the message receiver, and at least possible occurrences of connecting related members passing through from the message receiver to the sender; Calculating a connectivity based on one path, deriving a relation distance and a connectivity characteristic of a spam message using the calculated distance and connectivity between the receiver and the sender, and deriving the derived characteristic. To classify spam messages It provides a spam message classifier generation method comprising the step of generating a classifier for.

여기서, 상기 수신자와 발신자 사이의 관계거리 및 연결도는, 각각 상기 발신자에게 도달하기까지 경유하게 되는 모든 관계회원을 연결한 발생가능한 경로들 중, 상기 경로 상에 나타난 관계회원들의 수가 가장 적은 적어도 하나의 최단 경로를 기준으로 결정되는 것을 특징으로 한다.Here, the relationship distance and the connection diagram between the receiver and the sender are at least one of the smallest number of related members shown on the path, among possible paths connecting all related members via each of them to reach the sender. It is characterized in that determined based on the shortest path of.

여기서, 상기 수신자와 발신자 사이의 연결도는, 상기 최단 경로의 수를 계산시, 동일한 관계회원이 두 개 이상의 최단 경로에 중복되어 나타나는 경우, 상기 두 개 이상의 최단 경로 중 하나만을 상기 최단 경로의 수에 반영하는 것을 특징으로 한다.Here, the connection diagram between the receiver and the sender is, when calculating the number of the shortest paths, when the same relationship member appears in two or more shortest paths, only one of the two or more shortest paths is the number of the shortest paths It is characterized by reflecting on.

여기서, 상기 수신자와 발신자 사이의 연결도는 최소 절단(MIN-CUT) 방법을 이용하여 계산하는 것을 특징으로 한다.In this case, the connection degree between the receiver and the sender is calculated using a MIN-CUT method.

여기서, 상기 수신자와 발신자 사이의 연관도를 계산시, RANDOM-WALK 알고리즘을 사용하여 상기 각 최단 경로상에 나타나는 관계회원에 대한 가중치를 계산하고, 상기 가중치를 참조하여 상기 연관도를 계산하되, 상기 관계회원에 대한 가중치는 상기 관계회원의 관계회원목록에 포함된 관계회원의 수에 기초하여 계산하는 것을 특징으로 한다.Here, when calculating the degree of association between the receiver and the sender, using the RANDOM-WALK algorithm to calculate the weight for the relationship member appearing on each of the shortest path, and to calculate the association with reference to the weight, The weight for the related member may be calculated based on the number of related members included in the related member list of the related member.

여기서, 상기 스팸 메시지 분류자는 Bagging, LibSVM, FT, J48 및 BAyesNet 중 어느 하나의 알고리즘을 사용하여 생성하는 것을 특징으로 한다.Here, the spam message classifier may be generated using any one algorithm among Bagging, LibSVM, FT, J48, and BAyesNet.

상기 다른 목적을 달성하기 위한 본 발명은 특정 인터넷 소셜 네트워크 사이트에서 수신한 특정 메시지에 대하여 스팸 메시지인지 여부를 판단하여 스팸메시지를 필터링하는 방법으로, 회원 관계 데이터베이스에 저장된 관계회원 목록을 참조하여 상기 수신한 메시지에 대하여 상기 메시지 수신자로부터 발신자를 추적할 때 경유하는 관계회원의 수에 기초한 관계거리를 계산하는 단계, 상기 메시지 수신자로부터 발신자에게 도달하기까지 경유하게 되는 관계회원을 연결한 발생 가능한 적어도 하나의 경로에 기초한 연결도(connectivity)를 계산하는 단계 및 상기 관계거리 및 연결도가 계산된 메시지에 스팸 메시지 분류자를 적용하여 스팸메시지로 분류된 메시지를 필터링하는 단계를 포함하고, 상기 스팸 메시지 분류자는 상기 특정 인터넷 소셜 네트워크 사이트내의 실제 메시지를 이용하여 훈련된 분류자인 것을 특징으로 하는 스팸 메시지 필터링방법을 제공한다.In accordance with another aspect of the present invention, a method for filtering a spam message by determining whether a message is a spam message for a specific message received from a specific Internet social network site, and receiving the received message by referring to a list of related members stored in a member relationship database. Calculating a relationship distance based on the number of related members passing through the message receiver when the sender is tracked from the message receiver, and at least one possible connection connecting the related members passing through the message receiver to the sender; Calculating a connectivity based on a route and filtering a message classified as a spam message by applying a spam message classifier to a message for which the relationship distance and connectivity are calculated, wherein the spam message classifier includes Specific internet social net Provides a spam message filtering, wherein design was classified trained using the actual message size within the site.

여기서, 상기 수신자와 발신자 사이의 연관도를 계산시, RANDOM-WALK 알고리즘을 사용하여 상기 각 최단 경로상에 나타나는 관계회원에 대한 가중치를 계산하여, 상기 가중치를 참조하여 상기 연관도를 계산하되, 상기 관계회원에 대한 가중치는 상기 관계회원의 관계회원목록에 포함된 관계회원 수에 기초하여 계산하는 것을 특징으로 한다. Here, when calculating the degree of association between the receiver and the sender, by using a RANDOM-WALK algorithm to calculate the weight for the relationship member appearing on each of the shortest path, by calculating the association with reference to the weight, The weight for the related member may be calculated based on the number of related members included in the related member list of the related member.

상기 또 다른 목적을 달성하기 위한 본 발명은, 특정 인터넷 소셜 네트워크 사이트에서 수집되어 스팸 메시지 여부가 분류된 메시지들을 이용한, 스팸 메시지 분류자를 생성하기 위한 장치로, 상기 스팸 메시지 여부가 분류된 메시지가 저장된 메시지 데이터베이스, 상기 메시지의 수신자를 포함하여 상기 소셜 네트워크 사이트의 회원들 및 상기 회원의 관계회원목록이 저장된 회원 관계 데이터베이스, 상기 회원 관계 데이터베이스의 관계회원 목록을 참조하여 상기 메시지 데이터베이스내의 메시지 각각에 대하여 상기 메시지 수신자로부터 발신자를 추적할 때 경유하는 관계회원의 수에 기초한 관계거리를 계산하는 관계거리 계산부, 상기 메시지 수신자로부터 발신자에게 도달하기까지 경유하게 되는 관계회원을 연결한 발생 가능한 적어도 하나의 경로에 기초한 연결도(connectivity)를 계산하는 연결도 계산부 및 상기 수신자와 발신자 사이의 관계거리 및 연결도가 계산된 메시지들을 이용하여 스팸 메시지의 관계거리 및 연결도 특성을 도출하고, 상기 도출된 특성을 이용하여 스팸 메시지를 분류하는 분류자를 생성하는 분류자 생성부를 포함하는 것을 특징으로 하는 스팸메시지 분류자 생성장치를 제공한다.In accordance with another aspect of the present invention, there is provided a device for generating a spam message classifier using messages collected from a specific Internet social network site and classified as spam messages, wherein the messages classified as spam messages are stored. For each message in the message database with reference to a message database, a member relationship database storing members of the social network site including the recipient of the message and a related member list of the member, and a related member list of the member relationship database. At least one possible linkage unit that calculates a relationship distance based on the number of related members passing through when the sender is traced from the message receiver, and at least one possible linking the related members passing through the message receiver to the sender. Deriving the relationship distance and connectivity characteristics of the spam message using a connectivity calculation unit that calculates a connectivity based on a route, and messages in which the relationship distance and connectivity between the receiver and the sender are calculated, and the derived Provided is a spam message classifier generating device comprising a classifier generator for generating a classifier for classifying spam messages using the property.

여기서, 상기 수신자와 발신자 사이의 연결도는, 상기 최단 경로의 수를 계산시, 동일한 관계회원가 두 개 이상의 최단 경로에 중복되어 나타나는 경우, 상기 두 개 이상의 최단 경로 중 하나만을 상기 최단 경로의 수에 반영하는 최소 절단(MIN-CUT) 방법을 이용하여 계산하는 것을 특징으로 한다.Here, the connection diagram between the receiver and the sender, when calculating the number of the shortest paths, when the same relationship member appears in two or more shortest paths, only one of the two or more shortest paths to the number of the shortest paths It is characterized by calculating using the reflected minimum cut (MIN-CUT) method.

여기서, 상기 수신자와 발신자 사이의 연관도를 계산시, RANDOM-WALK 알고리즘을 사용하여 상기 각 최단 경로상에 나타나는 관계회원에 대한 가중치를 계산하여, 상기 가중치를 참조하여 상기 연관도를 계산하되, 상기 관계회원에 대한 가중치는 상기 관계회원의 관계회원목록에 포함된 관계회원의 수에 기초하여 계산하는 것을 특징으로 한다.Here, when calculating the degree of association between the receiver and the sender, by using a RANDOM-WALK algorithm to calculate the weight for the relationship member appearing on each of the shortest path, by calculating the association with reference to the weight, The weight for the related member may be calculated based on the number of related members included in the related member list of the related member.

상기와 같은 본 발명에 따른 발신자와 수신자의 관계를 이용한 소셜 네트워크 사이트에서의 스팸 메시지 분류자 생성 방법 및 장치 그리고 스팸 메시지 필터링 방법을 이용할 경우에는, 메시지 발신자와 수신자간의 관계 정보를 이용하여, 수신자와 발신자의 관계가 멀수록 해당 메시지가 스팸 메시지일 가능성이 높다고 판단함으로써, 좀 더 정확하게 스팸 메시지를 필터링 하는 효과가 있다. 또한 소셜 네트워크 사이트내에 설정된 수신자와 발신자의 관계는 스팸 메시지 발신자라도 조작하기 어렵기 때문에 스팸 메시지 발신자를 효과적으로 차단하는 장점이 있다.When using a method and apparatus for generating a spam message classifier and a spam message filtering method in a social network site using the relationship between the sender and the receiver according to the present invention, the receiver and the receiver are used by using the relationship information between the message sender and the receiver. The farther the sender's relationship is, the more likely it is that the message is a spam message, thereby filtering spam messages more accurately. In addition, the relationship between the receiver and the sender set in the social network site has an advantage of effectively blocking the spam message sender because it is difficult to operate even the spam message sender.

도 1은 본 발명에 따른 스팸 메시지를 분류하기 위한 분류자를 생성하는 과정의 개략적인 구성을 보여주는 개념도이다.
도 2는 본 발명의 일 실시예에 따른 스팸 메시지 분류자를 생성하기 위한 단계를 보여주는 시퀀스 차트이다.
도 3은 본 발명의 일 실시예에 따른 메시지 수신자와 발신자간의 관계 경로의 예를 보여주는 개념도이다.
도 4는 본 발명의 일 실시예에 따른 스팸 메시지 분류자 생성을 위한 장치의 구성을 보여주는 블록도이다.
도 5는 본 발명의 일 실시예에 따른 스팸 메시지를 필터링 하기 위한 과정을 보여주는 시퀀스 챠트이다.1 is a conceptual diagram illustrating a schematic configuration of a process of generating a classifier for classifying spam messages according to the present invention.
2 is a sequence chart showing the steps for generating a spam message classifier in accordance with one embodiment of the present invention.
3 is a conceptual diagram illustrating an example of a relationship path between a message receiver and a sender according to an embodiment of the present invention.
4 is a block diagram illustrating a configuration of an apparatus for generating a spam message classifier according to an embodiment of the present invention.
5 is a sequence chart showing a process for filtering a spam message according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.
Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

본 발명에 따른 소셜 네트워크 사이트에서 전송되는 메시지에 대하여, 메시지 발신자와 수신자간의 관계를 기초로 스팸 메시지인지 여부를 판단하는 방법이 개시된다. Disclosed is a method for determining whether a message is a spam message based on a relationship between a message sender and a receiver for a message transmitted from a social network site according to the present invention.

본 발명은 발신자와 수신자간의 관계를 파악하기 위하여 발신자와 수신자간의 관계거리와 연결도를 이용한다. 관계거리란 메시지 수신자로부터 메시지 발신자를 추적하기 위하여 경유해야 하는 해당 사이트내 다른 회원들의 수와 관계된다. 또한 연결도는 발신자를 추적하는 과정에서 경유할 수 있는 회원들의 경우의 수와 관계된다. The present invention uses the relationship distance and the connection diagram between the sender and the receiver to grasp the relationship between the sender and the receiver. Relationship distances are related to the number of other members of the site that need to pass through to track the message sender from the message recipient. Connection diagrams also relate to the number of instances of members that can be routed in the process of tracking the caller.

일반적으로 메시지 수신자와 발신자간의 관계거리가 클수록, 즉 발신자가 수신지와 직접 관계를 맺은 경우보다 여러 회원을 경유해야 발신자가 추적되는 경우에 해당 메시지가 스팸메시지일 확률이 높다고 볼 수 있다. 또한 다양한 경로를 통해서 발신자를 추적할 수 있는 경우보다, 발신자를 추적할 수 있는 경로가 제한되어 있는 경우, 즉 연결도가 작을수록 스팸 메시지일 확률이 높을 수 있다.In general, the greater the relationship between the message receiver and the sender, that is, the message is more likely to be a spam message when the sender is tracked through several members, rather than when the sender has a direct relationship with the destination. In addition, rather than being able to track the sender through various paths, if the path to track the sender is limited, that is, the smaller the connection, the more likely it is to be a spam message.

따라서 본 발명에서는 실제 스팸 메시지를 분석하여 발신자와 수신자간의 관계거리 및 연결도 특성을 도출함으로써 좀더 정확하게 스팸 메시지를 판별하는 방법을 제안하고자 한다.Therefore, the present invention is to propose a method of more accurately determining spam messages by analyzing the actual spam messages and deriving the relationship distance and connectivity characteristics between the sender and the receiver.

본 발명에서는 발신자와 수신자간의 연결도 특성을 도출하기 위하여 최소절단(min-cut) 및 Random-walk 방법을 사용한다.In the present invention, the min-cut and random-walk methods are used to derive the connectivity characteristics between the sender and the receiver.

최소절단 방법은 두 노드(예를 들면 어떤 두 회원) 사이의 연결도를 측정하기 위하여 노드 사이의 경로의 개수를 측정하며, 경로의 개수가 많을수록 두 노드 사이의 연결도가 높다고 볼 수 있다. 한편, 두 노드 사이의 경로들의 개수를 측정할 때 중복으로 카운트 되는 에지(edge)가 존재할 수 있는데, 이때, 중복되는 에지가 없는 경로들을 에지-독립적(edge-independent) 경로라고 한다. 본 발명에서는 보다 정확한 연결도를 측정하기 위해서 에지 독립적인 경로를 측정한다. The minimum cutting method measures the number of paths between nodes in order to measure the connection degree between two nodes (for example, two members), and the larger the number of paths, the higher the connection degree between two nodes. Meanwhile, when measuring the number of paths between two nodes, there may be an edge that is counted as overlap. In this case, paths without overlapping edges are called edge-independent paths. In the present invention, an edge independent path is measured to measure a more accurate connection degree.

Random-walk 방법은 검색엔진에서 사용하는 알고리즘인 PageRank에 쓰이는 알고리즘이다. 그래프에서 어떠한 노드에 유입링크(in-link)가 많을수록, 또한 유입링크가 많은 노드에서 온 유입링크일수록 점수가 올라가는 알고리즘이다.
The random walk method is used for PageRank, the algorithm used by search engines. The more the in-link at any node in the graph, the more the inflow link from the node with more inflow links, the higher the score.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 스팸 메시지를 분류하기 위한 분류자를 생성하는 과정의 개략적인 구성을 보여주는 개념도이다.1 is a conceptual diagram illustrating a schematic configuration of a process of generating a classifier for classifying spam messages according to the present invention.

도 1을 참조하면 본 발명의 스팸 메시지를 분류하기 위한 분류자를 생성하기 위한 개략적 구성은 데이터 수집 단계(110), 데이터 분류단계(120), 데이터 트레이닝 단계(130)를 포함하여 구성되는 것을 알 수 있다.Referring to FIG. 1, a schematic configuration for generating a classifier for classifying spam messages according to the present invention includes a data collection step 110, a data classification step 120, and a data training step 130. have.

도 1을 참조하면 본 발명의 스팸 메시지를 분류하기 위한 분류자를 생성하기 위한 개략적 구성은 다음과 같이 설명될 수 있다.Referring to FIG. 1, a schematic configuration for generating a classifier for classifying spam messages of the present invention may be described as follows.

데이터 수집 단계(110)는 인터넷 소셜 네트워크 사이트에 존재하는 실제 메시지들을 수집하는 단계이다. 가능한 한 많은 양의 데이터를 수집하는 데이터 특징을 파악하는 것이 좋으며, 결과에 대한 정확도가 높아진다. 수집된 메시지들은 발신자, 수신자들의 정보 및 그들과 연관된 사람들과의 관계, 예를 들면 친구 관계에 대한 정보도 필요하다. 이러한 관계는 소셜 네트워크 사이트의 정책별로 조금씩 차이가 있을 수 있다. The data collection step 110 is a step of collecting actual messages existing in an internet social network site. It is good practice to identify data characteristics that collect as much data as possible, and increase the accuracy of the results. Collected messages also need information about the sender, the recipients, and the relationships with those associated with them, for example, friend relationships. This relationship may be slightly different depending on the policy of the social network site.

예를 들면 페이스북의 경우에는 회원들 상호간에 관계요청과 관계 허용을 통하여 수립되는 관계이고, 트위터라면 어떤 회원에게 관계를 요청한 다른 회원들을 지칭하는 팔로워(folower)와 상기 회원이 다른 회원에게 관계를 요청한 경우 위 다른 회원을 지칭하는 팔로잉(following) 관계이다. 본 발명은 관계가 이루어지는 과정에 제한받지 않으며, 소셜 네트워크 사이트의 정책에 따른 다양한 관계에 적용될 수 있다. For example, in the case of Facebook, the relationship is established through requesting and allowing the relationship between members, and in the case of Twitter, a follower (referred to as another member who requested a relationship with a member) and the member establishes a relationship with another member. If requested, this is a following relationship that refers to the other members above. The present invention is not limited to the process of making a relationship and can be applied to various relationships according to the policy of a social network site.

데이터 분류 단계(120)는 수집된 데이터들이 스팸인지 정상인지 분류하는 단계이다. 이 단계는 수작업으로 수집된 메시지에 대하여 스팸 메시지인지 아닌지를 분류한다. 분류된 메시지들은 수신자와 발신자 사이의 관계거리와 연결도(connectivity)를 계산하여 그 값들과 함께 메시지 데이터 베이스(10)에 저장된다. 즉, 메시지 데이터베이스(10)에 저장되는 각 메시지는 스팸 메시지인지 정상 메시지인지의 여부(11)와, 수신자와 발신자간의 관계거리(12) 및 연결도(13)와 함께 저장된다.The data classification step 120 is a step of classifying whether the collected data is spam or normal. This step classifies whether or not spam messages are collected manually. The classified messages are stored in the message database 10 along with their values by calculating the relationship distance and connectivity between the receiver and the sender. That is, each message stored in the message database 10 is stored together with whether it is a spam message or a normal message 11, the relationship distance 12 and the connection diagram 13 between the receiver and the sender.

데이터 트레이닝 단계(130)는 앞 단계(110, 120)에서 수집하여 분류된 데이터를 바탕으로 분류자(classifier)를 생성하는 단계이다. 이 단계에서는 기존의 데이터 마이닝 분야에서 존재하는 알고리즘들인 을 사용할 수 있다. 예를 들면, Bagging, LibSVM, FT, J48, BayesNet 등을 사용할 수 있는데, 각각의 알고리즘들은 데이터 베이스에 저장된 메시지들을 이용하여 각 알고리즘 고유의 방법으로 메시지들이 갖는 특성을 파악함으로써, 임의의 메시지(20)가 주어졌을 때 해당 메시지가 스팸메시지(21)인지 정상 메시지(22)인지 여부를 결정할 분류자(30)를 생성한다.
The data training step 130 is a step of generating a classifier based on the data collected and classified in the previous steps 110 and 120. In this step, we can use the algorithms that exist in the existing field of data mining. For example, Bagging, LibSVM, FT, J48, BayesNet, etc. may be used. Each algorithm uses messages stored in a database to determine the characteristics of the messages in a manner unique to each algorithm, thereby providing an arbitrary message (20). ), A classifier 30 is determined to determine whether the message is a spam message 21 or a normal message 22.

이하 본 발명에 대한 실시예를 도면을 참조하여 좀 더 상세하게 설명하기로 한다. 먼저 본 발명의 일 실시예에 따른 스팸 메시지 분류자를 생성하는 방법 및 장치에 대해서 설명하고, 본 발명의 스팸 메시지 분류자를 이용하여 스팸 메시지 여부를 판단하는 방법에 대해서 설명하기로 한다.
Hereinafter, embodiments of the present invention will be described in more detail with reference to the accompanying drawings. First, a method and apparatus for generating a spam message classifier according to an embodiment of the present invention will be described, and a method for determining whether or not a spam message using the spam message classifier of the present invention will be described.

도 2는 본 발명의 일 실시예에 따른 스팸 메시지 분류자를 생성하기 위한 단계를 보여주는 시퀀스 차트이다.2 is a sequence chart showing the steps for generating a spam message classifier in accordance with one embodiment of the present invention.

도 2를 참조하면 본 발명의 일 실시예에 따른 스팸 메시지 분류자를 생성하기 위한 단계는 메시지 수집 단계(S210), 메시지 분류 단계(S220), 수신자와 발신자간의 관계거리 및 연결도 계산단계(S230) 및 스팸 메시지 분류자 생성 단계(S240)를 포함하여 구성된다.Referring to FIG. 2, the steps for generating a spam message classifier in accordance with an embodiment of the present invention include message collection step S210, message classification step S220, a relationship distance between the receiver and the sender, and a connection degree calculation step S230. And spam message classifier generation step (S240).

또한 도 2를 참조하면 본 발명의 일 실시예에 따른 스팸 메시지 분류자를 생성하기 위한 과정의 각 단계는 아래와 같이 설명될 수 있다.In addition, referring to Figure 2, each step of the process for generating a spam message classifier according to an embodiment of the present invention can be described as follows.

메시지 수집 단계(S210)는 특정 인터넷 쇼셜 네트워크 사이트로부터 회원간에 송수신한 실제 메시지를 수집하는 단계이다.Message collecting step (S210) is a step of collecting the actual messages transmitted and received between members from a particular Internet social network site.

메시지 분류 단계(S220)는 메시지 수집 단계(S210)에서 수집한 메시지들이 스팸 메시지인지 여부를 판단하여 분류하는 단계이다. 이 단계는 수집된 데이터를 훈련하기 위한 샘플 데이터를 준비하는 단계이기 때문에, 정확한 분류를 위하여 수작업으로 진행될 수 있다.The message classification step S220 is a step of determining and classifying whether the messages collected in the message collection step S210 are spam messages. Since this step prepares sample data for training the collected data, it can be done manually for accurate classification.

수신자와 발신자간의 관계거리 및 연결도 계산단계(S230)는 위의 단계에서 스팸 메시지 여부가 분류된 메시지에 대하여, 해당 메시지의 수신자와 발신자간의 관계거리 및 연결도를 계산하는 단계이다. In the step S230 of calculating the relationship distance and the connection degree between the receiver and the sender, the relationship distance and the connection degree between the receiver and the sender of the corresponding message are calculated for the message classified as a spam message in the above step.

예를 들면 관계거리란, 발신자와 수신자간에 직접적으로 연결된 관계인지, 몇 사람을 통해서 연결된 관계인지를 나타내는 척도이다.For example, a relationship distance is a measure of whether there is a direct connection between a sender and a receiver or how many people are connected.

이때, 해당 쇼셜 네트워크 사이트 회원들 및 해당 회원들과 해당사이트의 정책에 따른 연결 관계를 맺은 관계회원목록이 저장된 회원 관계 데이터 베이스를 참조하여 스팸 여부가 분류된 메시지 각각에 대하여 수신자와 발신자 사이의 관계거리를 계산한다. 관계회원이란 해당 회원의 친구일 수도 있고, 팔로워일 수도 있고, 일촌일수도 있다. 즉 사이트의 정책에 따라 해당 회원과 여러 가지 다양한 경로를 통해 연결되고 온라인 상에서 메시지를 주고 받거나 회원 관련 정보를 조회할 수 있는 등의 연결관계가 형성된 사이트 내 다른 회원을 의미한다.At this time, the relationship between the recipient and the sender for each message classified as spam is referred to by referring to the member relationship database in which the members of the social network site and the related members are connected according to the policy of the site. Calculate the distance. A related member may be a friend of the member, a follower, or a cousin. That is, it means other members in the site that are connected to the member through various paths according to the policy of the site, and have a connection relationship such as sending and receiving messages online or viewing member related information.

한편, 해당 메시지의 수신자와 발신자 사이의 관계거리는, 회원 관계 데이터 베이스의 관계회원 목록을 참조하여 메시지 수신자로부터 메시지 발신자를 역추적함으로써, 해당 발신자에게 도달하기까지 경유하게 되는 관계회원 수에 따라 결정될 수 있다. 또한 수신자와 발신자 사이의 연결도는 수신자로부터 발신자에게 도달하기까지 경유하게 되는 관계회원을 연결한 발생가능한 경로들의 수에 따라 결정될 수 있다. 예를 들어, 경유하게 되는 관계회원이 많을수록 관계거리는 커지도록 설정하고, 발생가능한 경로들의 수가 많을수록 연결도가 커지도록 설정할 수 있다.On the other hand, the relationship between the recipient and the sender of the message may be determined according to the number of related members to reach the sender by backtracking the message sender from the message receiver with reference to the list of related members in the member relations database. have. In addition, the degree of connection between the receiver and the sender may be determined according to the number of possible paths connecting the related members passing through from the receiver to the sender. For example, the greater the number of related members passing through, the greater the relationship distance, and the greater the number of possible paths, the greater the connectivity.

한편 한 회원과 연결되는 관계회원이 여럿 일수 있기 때문에, 추적경로 역시 다양하게 나올 수 있다. 따라서 발신자에게 도달하기까지 경유하게 되는 모든 관계회원을 연결한 발생가능한 경로들 중, 경로 상에 나타난 관계회원들의 수가 가장 적은 최단 경로를 기준으로 관계거리와 연결도를 결정하는 것이 바람직하다. On the other hand, since there may be several related members linked to one member, the tracking path may also vary. Therefore, it is desirable to determine the relationship distance and the connection degree based on the shortest path among the possible paths connecting all the related members who have passed through to the sender with the smallest number of related members shown on the path.

또한, 최단 경로의 수를 계산시, 동일한 관계회원이 두 개 이상의 최단 경로에 중복되어 나타나는 경우, 상기 두 개 이상의 최단 경로 중 하나만을 상기 최단 경로의 수에 반영할 수 있다. 이를 위해, 최소 절단(MIN-CUT) 방법을 이용하여 연결도를 계산할 수 있을 것이다.In addition, when calculating the number of shortest paths, when the same related member appears in two or more shortest paths, only one of the two or more shortest paths may be reflected in the number of shortest paths. For this purpose, the connectivity can be calculated using the MIN-CUT method.

수신자와 발신자 사이의 연관도를 계산하는 다른 방법으로, RANDOM-WALK 알고리즘을 사용하여 각 최단 경로상에 나타나는 관계회원에 대한 가중치를 계산하고, 상기 가중치를 참조하여 연관도를 계산할 수 있다. 이때, 관계회원에 대한 가중치는 관계회원의 관계회원목록에 포함된 관계회원의 수에 기초하여 계산할 수 있다. 즉 최단 경로상에서 경유하게 되는 회원의 관계회원의 수가 많다면, 연관도가 높게 나오도록 가중치를 높게 주고 적다면 낮게 주는 방법으로 가중치를 설정할 수 있다. As another method of calculating the degree of association between the receiver and the sender, a weight for each member appearing on each shortest path may be calculated using the RANDOM-WALK algorithm, and the degree of association may be calculated with reference to the weight. In this case, the weight for the related member may be calculated based on the number of related members included in the related member list of the related member. In other words, if the number of related members of the member who passes through the shortest path is large, the weight may be set in such a manner that the weight is increased so that the association degree is high and the weight is low.

또는 위의 최소 절단 방법과 RANDOM-WALK 알고리즘을 조합하여 사용할 수 있을 것이다.Alternatively, the above minimum truncation method and the RANDOM-WALK algorithm can be used in combination.

수신자와 발신자 사이의 관계거리와 연결도를 계산하는 좀 더 상세한 방법은 후술하기로 한다.A more detailed method of calculating the relationship distance and the connection between the receiver and the sender will be described later.

스팸 메시지 분류자 생성 단계(S250)는 수신자와 발신자 사이의 관계거리 및 연결도가 계산된 메시지들을 이용하여 스팸 메시지의 관계거리 및 연결도 특성을 도출하고, 도출된 특성을 이용하여 스팸 메시지를 분류하는 분류자를 생성하는 단계이다. Spam message classifier generation step (S250) derives the relation distance and connectivity characteristics of the spam message by using the calculated relation distance and connectivity between the receiver and the sender, and classifies the spam message using the derived characteristics In this step, a classifier is generated.

이때, 스팸 메시지 분류자는 Bagging, LibSVM, FT, J48 및 BAyesNet 중 어느 하나의 알고리즘을 사용하여 생성할 수 있다.
At this time, the spam message classifier may be generated using any one of algorithms such as Bagging, LibSVM, FT, J48, and BAyesNet.

이하 메시지 발신자와 수신자간의 관계거리와 연결도를 생성하는 방법에 대하여 좀 더 상세하게 설명하기로 한다.
Hereinafter, a method of generating a relationship distance and a connection diagram between a message sender and a receiver will be described in more detail.

도 3은 본 발명의 일 실시예에 따른 메시지 수신자와 발신자간의 관계 경로의 예를 보여주는 개념도이다.3 is a conceptual diagram illustrating an example of a relationship path between a message receiver and a sender according to an embodiment of the present invention.

도 3의 (a)는 관계거리가 2인 에지-독립적인 경로의 수가 3인 그래프이고, (b)는 관계거리가 4인, 에지-독립적 경로가 1인 그래프를 보여준다. (A) of FIG. 3 is a graph in which the number of edge-independent paths having a relation distance of 3 is 3, and (b) shows a graph in which the edge-independent path is 1 having a relation distance of 4.

예를 들어, 도 3의 (a)에 도시된 것처럼 F라는 회원이 A라는 회원에게 어떤 메시지를 보내어, 수신자 A로부터 발신자 F를 역추적한다고 가정한다.For example, suppose that a member named F sends a message to a member named A, as shown in FIG.

수신자 A와 F간의 관계거리를 파악하기 위해서는 A와 F가 직접 관계가 있는지, 몇 회원을 거쳐야 F가 추적될 수 있는지를 파악한다. 예를 들면, 친구인지 아니면 친구의 친구인지, 완전한 타인인지를 파악하는 것이다. To determine the relationship distance between receivers A and F, determine whether A and F are directly related and how many members can be traced to F. For example, determine whether you are a friend, a friend of a friend, or a complete person.

이때, 이러한 관계를 파악하는 방법은 A의 관계회원목록(이하, 친구목록)에 F가 존재하는지 여부를 파악하는 것이다. 즉 도 3의(a)상의 A의 친구목록에 F가 존재한다면 관계거리는 1이 될 수 있다. 그러나 A의 친구목록에 F가 존재하지 않는다면, A의 친구목록에 있는 모든 친구들을 대상으로, 각 친구들의 친구목록을 조회하여 F가 존재하는지를 파악한다. 만일 A의 친구(B,C,D,E)의 친구목록에 F가 존재한다면 관계거리는 2가 될 수 있다.At this time, a method of identifying such a relationship is to determine whether F exists in the related member list of A (hereinafter, referred to as friend list). That is, if F is present in the friend list of A in FIG. 3 (a), the relationship distance may be 1. However, if F does not exist in A's friend list, all friends in A's friend list are searched for each friend's friend list to see if F exists. If F is in the friend list of A's friends (B, C, D, and E), the relationship distance can be 2.

그러나 A의 친구의 친구 목록에도 F가 존재하지 않는다면, 다시 A의 친구목록에 있는 모든 친구의 친구를 대상으로 F를 추적하는 과정을 반복하는 것이다. 한편 이러한 발신자 추적과정은 발신자가 추적이 안되는 경우나 지나치게 많은 친구를 거쳐야 하는 경우, 시스템에 불필요한 부하를 줄 수 있으므로 최대 추적 단계를 지정하고, 최대 추적 단계를 지나도 추적이 안되는 경우에는 스팸으로 인정되는 최소값(예를 들면 5)을 지정하여 관계거리로 설정할 수도 있다. However, if F is not present in A's friend's friend list, the process of tracking F is repeated for all friends of A's friend list. On the other hand, the sender tracking process can put unnecessary load on the system when the sender cannot track or if there are too many friends. Therefore, the maximum tracking level is specified. You can also specify the minimum value (for example, 5) to set the relationship distance.

도 3의(a)는 A의 친구목록에 F는 존재하지 않지만 A의 친구인 B,C,D,E의 친구목록에 존재하는 F를 추적한 결과를 보여주는 그래프이다. 3 (a) is a graph showing a result of tracing F existing in a friend list of B, C, D, and E which are friends of A, although F does not exist in A's friend list.

이때, A로부터 F를 추적하기 위해서는, 그래프(301)에 나타난 것처럼 A와 F의 최단경로 중간에 B,C,D,E라는 회원 중 하나를 경유하여 각 경로별로 두 개의 에지가 연결(311-312, 321-322, 331-332, 341-342) 되므로 A와 F의 관계거리는 2가 된다.At this time, in order to track F from A, two edges are connected for each path through one of the members B, C, D, and E in the middle of the shortest path of A and F as shown in the graph 301 (311-). 312, 321-322, 331-332, 341-342), so the relationship between A and F is 2.

한편 A와 F간의 에지-독립적인 최단 경로가 A-B-F(311-312), A-C-F(321-322), A-D-F(331-332), A-E-F(341-342)로 4개이므로 연결도는 4가 될 수 있다.On the other hand, there are four edge-independent shortest paths between A and F: ABF (311-312), ACF (321-322), ADF (331-332), and AEF (341-342), so the connectivity can be four. have.

다른 예로서, 도 3의 (b)를 참조하면, Q 회원이 H 회원에게 어떤 메시지를 보냈고, 메시지 수신자 H로부터 Q를 역추적한다고 가정한다.As another example, referring to FIG. 3B, it is assumed that a Q member sent a message to a H member, and traces Q back from the message receiver H.

이때, H로부터 Q를 추적하기 위해서는, 그래프(302)에 나타난 것처럼 H와 Q의 최단경로가 6개(H-I-M-P-Q, H-J-M-P-Q, H-J-N-P-Q, H-K-N-P-Q, H-K-O-P-Q, H-L-O-P-Q)가 나올 수 있다. 이들 최단 경로별로 4개의 에지가 연결되므로 H와 Q사이의 거리는 4가 된다.At this time, in order to track Q from H, six shortest paths of H and Q (H-I-M-P-Q, H-J-M-P-Q, H-J-N-P-Q, H-K-N-P-Q, H-K-O-P-Q, and H-L-O-P-Q) may appear. Since four edges are connected for each of these shortest paths, the distance between H and Q is four.

또한 이들 최단 경로에서 중복되는 에지(353, 364, 383, 354)가 존재하므로, 중복되는 에지를 갖는 경로를 제외한 에지 독립적인 경로는 1개가 되므로, H와 Q의 연결도는 1이 된다.In addition, since there are overlapping edges 353, 364, 383, and 354 in these shortest paths, there is only one edge-independent path except for paths having overlapping edges, so that H and Q have a connection degree of 1.

도 4는 본 발명의 일 실시예에 따른 스팸 메시지 분류자 생성을 위한 장치의 구성을 보여주는 블록도이다.4 is a block diagram illustrating a configuration of an apparatus for generating a spam message classifier according to an embodiment of the present invention.

도 4를 참조하면 본 발명의 일 실시예에 따른 스팸 메시지 분류자를 생성을 위한 장치는 메시지 수집부(410), 메시지 분류부(420), 관계거리 및 연결도 계산부(430), 스팸 메시지 분류자 생성부(440), 메시지 데이터베이스(10) 및 회원관계 데이터베이스(90)를 포함하여 구성된다.Referring to FIG. 4, an apparatus for generating a spam message classifier in accordance with an embodiment of the present invention includes a message collector 410, a message classifier 420, a relationship distance and connectivity calculator 430, and a spam message classification. The child generator 440 includes a message database 10 and a membership relationship database 90.

또한 도 4를 참조하면 본 발명의 일 실시예에 따른 스팸 메시지 분류자 생성을 위한 장치는 다음과 같이 설명될 수 있다.In addition, referring to Figure 4 the apparatus for generating a spam message classifier according to an embodiment of the present invention can be described as follows.

메시지 수집부(410)는 특정 인터넷 쇼셜 네트워크 사이트로부터 회원간에 송수신한 실제 메시지를 수집하는 부분이다.The message collecting unit 410 collects actual messages transmitted and received between members from a specific internet social network site.

메시지 분류부(420)는 메시지 수집부(410)에서 수집한 메시지들이 스팸 메시지인지 여부를 판단하여 분류하는 메시지 데이터베이스(10)에 저장하는 부분이다.The message classifier 420 is a part of determining and classifying whether or not the messages collected by the message collector 410 are spam messages and storing them in the message database 10.

관계거리 및 연결도 계산부(430)는 메시지 데이터베이스(10)에 저장된 스팸 메시지 여부가 분류된 메시지에 대하여, 해당 메시지의 수신자와 발신자간의 관계거리 및 연결도를 계산한다.The relationship distance and connectivity calculation unit 430 calculates a relationship distance and a connection between the receiver and the sender of the corresponding message for the message classified as spam messages stored in the message database 10.

이때, 해당 쇼셜 네트워크 사이트 회원들 및 해당 회원들과 해당사이트의 정책에 따른 연결 관계를 맺은 관계회원목록이 저장된 회원 관계 데이터 베이스(90)를 참조하여 스팸 여부가 분류된 메시지 각각에 대하여 수신자와 발신자 사이의 관계거리를 계산한다. 관계회원이란 해당 회원의 친구일 수도 있고, 팔로워일 수도 있고, 일촌일수도 있다. 즉 사이트의 정책에 따라 해당 회원과 여러 가지 다양한 경로를 통해 연결되고 온라인 상에서 메시지를 주고 받거나 회원 관련 정보를 조회할 수 있는 등의 사이트 정책에 따른 연결관계가 형성된 사이트 내 다른 회원을 의미한다.At this time, the recipient and the sender for each message classified as spam by referring to the member relations database 90 in which the members of the corresponding social network site and the related members are connected according to the policy of the site are stored. Calculate the relationship distance between. A related member may be a friend of the member, a follower, or a cousin. That is, it means other members in the site that are connected with the member through various paths according to the policy of the site, and have a connection relationship formed according to the site policy such as sending and receiving messages online or viewing member related information.

한편, 해당 메시지의 수신자와 발신자 사이의 관계거리는, 회원 관계 데이터 베이스(90)의 관계회원 목록을 참조하여 메시지 수신자로부터 메시지 발신자를 역추적함으로써, 해당 발신자에게 도달하기까지 경유하게 되는 관계회원 수에 따라 결정될 수 있다. 또한 수신자와 발신자 사이의 연결도는 수신자로부터 발신자에게 도달하기까지 경유하게 되는 관계회원을 연결한 발생가능한 경로들의 수에 따라 결정될 수 있다. 예를 들어, 경유하게 되는 관계회원이 많을수록 관계거리는 커지도록 설정하고, 발생가능한 경로들의 수가 많을수록 연결도가 커지도록 설정할 수 있다.On the other hand, the relationship distance between the receiver and the sender of the message is based on the number of related members to reach the sender by backtracking the message sender from the message receiver with reference to the related member list in the membership relations database 90. Can be determined accordingly. In addition, the degree of connection between the receiver and the sender may be determined according to the number of possible paths connecting the related members passing through from the receiver to the sender. For example, the greater the number of related members passing through, the greater the relationship distance, and the greater the number of possible paths, the greater the connectivity.

수신자와 발신자 사이의 연관도를 계산하는 다른 방법으로, RANDOM-WALK 알고리즘을 사용하여 각 최단 경로상에 나타나는 관계회원에 대한 가중치를 계산하고, 상기 가중치를 참조하여 연관도를 계산할 수 있다. 이때, 관계회원에 대한 가중치는 관계회원의 관계회원목록에 포함된 관계회원의 수에 기초하여 계산할 수 있다. 즉 최단 경로상의 경유 회원의 관계회원의 수가 많다면, 연관도가 높게 나오도록 가중치를 높게 주고 적다면 낮게 주는 방법으로 가중치를 설정할 수 있다. 또는 위의 최소 절단 방법과 RANDOM-WALK 알고리즘을 조합하여 사용할 수 있을 것이다.As another method of calculating the degree of association between the receiver and the sender, a weight for each member appearing on each shortest path may be calculated using the RANDOM-WALK algorithm, and the degree of association may be calculated with reference to the weight. In this case, the weight for the related member may be calculated based on the number of related members included in the related member list of the related member. In other words, if the number of related members of the transit member on the shortest path is large, the weight may be set in such a manner that the weight is increased so that the association degree is high and the weight is low. Alternatively, the above minimum truncation method and the RANDOM-WALK algorithm can be used in combination.

스팸 메시지 분류자 생성부(440)는 수신자와 발신자 사이의 관계거리 및 연결도가 계산된 메시지들을 이용하여 스팸 메시지의 관계거리 및 연결도 특성을 도출하고, 도출된 특성을 이용하여 스팸 메시지를 분류하는 분류자를 생성하는 부분이다. 이때, 스팸 메시지 분류자는 Bagging, LibSVM, FT, J48 및 BAyesNet 중 어느 하나의 알고리즘을 사용하여 생성할 수 있다.
The spam message classifier generation unit 440 derives the relation distance and connectivity characteristics of the spam message using the calculated relation distance and connectivity between the receiver and the sender, and classifies the spam message using the derived characteristics. This is the part that creates the classifier. At this time, the spam message classifier may be generated using any one of algorithms such as Bagging, LibSVM, FT, J48, and BAyesNet.

이하, 본 발명에 따른 스팸 메시지 분류자를 사용하여 임의의 메시지에 대한 스팸 메시지 여부를 판별하는 방법에 대하여 설명하기로 한다.Hereinafter, a method of determining whether or not a spam message is for a random message using a spam message classifier according to the present invention will be described.

도 5는 본 발명의 일 실시예에 따른 스팸 메시지를 필터링 하기 위한 과정을 보여주는 시퀀스 챠트이다.5 is a sequence chart showing a process for filtering a spam message according to an embodiment of the present invention.

도 5를 참조하면 본 발명의 일 실시예에 따른 스팸 메시지를 필터링하기 위한 각 단계는 메시지 수신단계(S510), 수신자와 발신자간의 관계거리 및 연결도 계산단계(S520), 스팸메시지 필터링 단계(S530)를 포함하여 구성된다.Referring to FIG. 5, each step for filtering a spam message according to an embodiment of the present invention includes a message receiving step (S510), a relationship distance between the receiver and the sender, a connection degree calculation step (S520), and a spam message filtering step (S530). It is configured to include).

메시지 수신단계(S510)는 특정 인터넷 쇼셜 네트워크 사이트에서 어떤 회원으로부터 발신된 임의의 메시지에 대하여 스팸 메시지 여부를 판단하기 위하여 수신하는 단계이다. Message receiving step (S510) is a step for receiving a message to determine whether a spam message for any message sent from any member in a particular Internet social network site.

수신자와 발신자간의 관계거리 및 연결도 계산단계(S520)는 수신한 임의의 메시지에 대하여, 쇼셜 네트워크 사이트 회원들 및 상기 회원들과 상기 사이트의 정책에 따른 연결 관계를 맺은 관계회원목록이 저장된 회원 관계 데이터 베이스를 참조하여 수신한 메시지에 대하여 수신자와 발신자 사이의 관계거리 및 연결도(connectivity)를 계산하는 단계이다.The relationship distance between the receiver and the sender and the connection degree calculation step (S520) is a member relationship in which a social network site member and a relationship member list having a connection relationship according to the policy of the site are stored for any message received. Computing the relationship distance and connectivity between the receiver and the sender with respect to the received message by referring to the database.

이때, 수신자와 발신자 사이의 관계거리는, 관계회원 목록을 이용하여 수신자로부터 상기 발신자를 역추적함으로써, 발신자에게 도달하기까지 경유하게 되는 관계회원 수에 따라 결정되고, 연결도는 상기 발신자에게 도달하기까지 경유하게 되는 관계회원을 연결한 발생가능한 경로들의 수에 따라 결정될 수 있다.At this time, the relationship distance between the receiver and the sender is determined according to the number of related members to reach the caller by backtracking the caller from the receiver using the related member list, and the connection degree is reached until reaching the caller. It can be determined by the number of possible paths connecting the related members via.

관계거리 및 연결도를 계산하는 좀 더 자세한 방법은 상술하였으므로 생략하기로 한다. Since a more detailed method of calculating the relation distance and the connection diagram has been described above, it will be omitted.

스팸메시지 필터링 단계(S530)는 관계거리 및 연결도가 계산된 메시지에 스팸 메시지 분류자를 적용하여 스팸 메시지로 판단된 메시지를 필터링하는단계이다. 이때 사용되는 스팸 메시지 분류자는 본 발명에 따른 스팸 메시지인지 여부와 수신자와 발신자간의 관계거리 및 연결도가 계산된 메시지들을 이용하여 훈련된 분류자를 이용한다.
The spam message filtering step S530 is a step of filtering a message determined to be a spam message by applying a spam message classifier to a message whose relationship distance and connectivity are calculated. At this time, the spam message classifier used uses a classifier trained by using the messages of which the spam message according to the present invention and the relation distance and the connection degree between the receiver and the sender have been calculated.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims It can be understood that

Claims

A method for generating a spam message classifier using messages collected from a specific internet social network site and classified as spam messages,
Calculating a relationship distance based on the number of related members passing through the message from the message receiver for each of the classified messages by referring to the related member list stored in the member relations database;
Calculating a connectivity based on at least one possible path connecting the related members from the message receiver to the sender; And
Deriving a relationship distance and a connection property of a spam message using the calculated values of the relationship distance and the connection between the receiver and the sender, and generating a classifier for classifying the spam message using the derived property. Spam message classifier generation method comprising the.

The method of claim 1,
The relationship distance and the connection diagram between the receiver and the sender are at least one shortest number of related members represented on the path among the possible paths connecting all related members each passing through the caller. Spam message classifier generation method characterized in that determined based on the path.

The method of claim 2,
The connection diagram between the receiver and the sender is,
When calculating the number of shortest paths, if the same related member appears in two or more shortest paths, a spam message classifier is generated, wherein only one of the two or more shortest paths is reflected in the number of shortest paths. Way.

The method of claim 3, wherein
The method for generating a spam message classifier, characterized in that the degree of connection between the receiver and the sender is calculated using a MIN-CUT method.

3. The method according to claim 2 or 3,
In calculating the degree of association between the receiver and the sender,
The RANDOM-WALK algorithm is used to calculate weights for the related members appearing on each of the shortest paths, and calculates the association with reference to the weights.
And the weight for the related member is calculated based on the number of related members included in the related member list of the related member.

The method of claim 1,
The spam message classifier is a spam message classifier generation method, characterized in that generated using any one of the algorithm of Bagging, LibSVM, FT, J48 and BAyesNet.

A method of filtering spam messages by determining whether a message is spam for a specific message received from a specific internet social network site.
Calculating a relationship distance based on the number of related members passing through the received message when tracking a sender from the message receiver with respect to the received message by referring to the related member list stored in the member relations database;
Calculating a connectivity based on at least one possible path connecting the related members from the message receiver to the sender; And
Filtering a message classified as a spam message by applying a spam message classifier to the message whose relationship distance and connectivity are calculated;
And wherein said spam message classifier is a classifier trained using actual messages in said particular internet social network site.

8. The method of claim 7,
The relationship distance and the connection diagram between the receiver and the sender are at least one shortest number of related members represented on the path among the possible paths connecting all related members each passing through the caller. Spam message filtering method characterized in that determined based on the path.

The method of claim 8,
The connection diagram between the receiver and the sender is,
When calculating the number of the shortest paths, if the same related member appears in two or more shortest paths, only one of the two or more shortest paths is reflected in the number of shortest paths.

The method of claim 9,
The method for filtering spam messages, characterized in that the degree of connection between the receiver and the sender is calculated using a MIN-CUT method.

The method according to claim 7 or 8,
In calculating the degree of association between the receiver and the sender,
By using the RANDOM-WALK algorithm to calculate the weight for the relationship member appearing on each of the shortest path, the association degree is calculated with reference to the weight,
And the weight for the related member is calculated based on the number of related members included in the related member list of the related member.

A device for generating a spam message classifier using messages collected from a specific Internet social network site and classified as spam messages,
A message database storing a message classified as the spam message;
A member relations database storing members of the social network site including the recipient of the message and a related member list of the member;
A relationship distance calculation unit that calculates a relationship distance based on the number of relationship members passing through each message in the message database by referring to a relationship member list of the member relationship database when the sender is traced from the message receiver;
A connectivity calculation unit configured to calculate a connectivity based on at least one possible path connecting the related members passing through the message receiver to the sender; And
Generating a classifier that derives the relation distance and connectivity characteristics of the spam message by using the messages whose relationship distance and connectivity between the receiver and the sender are calculated, and generates a classifier that classifies the spam messages using the derived characteristics Spam message classifier generator, characterized in that it comprises a wealth.

13. The method of claim 12,
The relationship distance and the connection diagram between the receiver and the sender are at least one shortest number of related members represented on the path among the possible paths connecting all related members each passing through the caller. Spam message classifier generator, characterized in that determined based on the path.

The method of claim 13,
The connection diagram between the receiver and the sender reflects only one of the two or more shortest paths to the number of the shortest paths when the same related member appears in two or more shortest paths when calculating the number of the shortest paths. Spam message classifier generator, characterized in that calculated using the MIN-CUT method.

The method according to claim 12 or 13,
In calculating the degree of association between the receiver and the sender,
By using the RANDOM-WALK algorithm to calculate the weight for the relationship member appearing on each of the shortest path, the association degree is calculated with reference to the weight,
And a weight for the related member is calculated based on the number of related members included in the related member list of the related member.

13. The method of claim 12,
The spam message classifier generating apparatus for spam message classifier, characterized in that generated using any one of the algorithm of Bagging, LibSVM, FT, J48 and BAyesNet.