KR20160000574A

KR20160000574A - Apparatus and method for filtering image spam

Info

Publication number: KR20160000574A
Application number: KR1020140077664A
Authority: KR
Inventors: 홍충선; 강효성
Original assignee: 경희대학교 산학협력단
Priority date: 2014-06-24
Filing date: 2014-06-24
Publication date: 2016-01-05
Also published as: KR101644231B1

Abstract

The present invention provides an apparatus and method for filtering an image spam. The method for filtering an image spam comprises the following steps: determining whether a received message is an image message; converting an image included in the message into a gray scale by binarizing the image when the message is an image message; extracting local binary pattern (LBP) information by using the image converted into the gray scale; classifying the extracted LBP information by using a support vector machine (SVM) classification algorithm; and determining whether the message is an image spam based on the classification result.

Description

[0001] APPARATUS AND METHOD FOR FILTERING IMAGE SPAM [0002]

본 발명의 실시예들은 이미지 이진화를 이용한 이미지 스팸 필터링 장치 및 방법에 관한 것이다.
Embodiments of the present invention are directed to an apparatus and method for filtering image spam using image binarization.

인터넷의 급속한 성장과 더불어 전자 메일 사용자의 수가 증가하면서 인터넷 상에 개인 전자 메일 주소의 노출로 인한 스팸 메시지의 양도 덩달아 증가하게 되었다. 이러한 현상은 메일 사용자에게 악용되어 바이러스 전파 및 개인정보 노출의 원인이 되고 있다.With the rapid growth of the Internet, the number of e-mail users has increased and the volume of spam messages has increased due to the exposure of personal e-mail addresses on the Internet. This phenomenon has been exploited by mail users, causing virus transmission and personal information exposure.

따라서, 스팸 메시지를 차단하는 다양한 방법들이 제시되고 있는 실정이다. Accordingly, various methods of blocking spam messages have been proposed.

그 대표적인 방법 중 하나가 텍스트 기반 스팸 필터링 방법이다. 텍스트 기반 스팸 필터링은 수신된 메시지의 텍스트 내용을 추출하여 해당 텍스트 내용과 스팸으로 분류한 텍스트의 내용을 서로 비교하여 수신된 메시지의 스팸 여부를 판단하는 방식이다. One of the representative methods is a text-based spam filtering method. Text-based spam filtering is a method of extracting the text content of a received message and comparing the contents of the text with the contents of text classified as spam to determine whether the received message is spam.

현재 모바일 문자 메시지와 웹 상의 전자 메일에 탑재된 스팸 메시지 필터링 알고리즘은 대부분 텍스트 기반이다.Currently, spam message filtering algorithms embedded in mobile text messages and e-mail on the Web are mostly text based.

텍스트 기반 스팸 필터링 기법은 메시지가 수신되면 해당 메시지의 내용에서 텍스트를 추출하고, 스팸 텍스트로 판단되어 분류된 데이터 베이스에서 스팸 텍스트 데이터를 가져와 수신된 메시지의 텍스트 내용과 비교한다. The text-based spam filtering technique extracts text from the contents of a message when it is received, compares the text with the text of the received message, retrieves spam text data from the classified database judged as spam text.

그리고, 비교한 결과에서 최종적으로 일치하는 정보가 있는 경우 해당 메시지를 스팸으로 판단할 수 있다. If the comparison result shows that there is finally matching information, the message can be determined as spam.

이때, 텍스트 기반의 스팸 필터링 기법은 전 처리기 단계를 거칠 수 있으며, 텍스트에서 스팸으로 등록된 단어들과 비교하여 필터링하기 위해 전처리 단계에서 텍스트 메시지의 특수 문자를 제거, 문장 사이 자동 띄어쓰기 추가, 수사어절 표준화, 실제 사전에 없는 은어와 같은 불용어 제거 단계를 우선적으로 거칠 수 있다. 텍스트 기반의 스팸 필터링 기법은 이러한 전처리 단계를 거친 후 필터링 단어의 패턴으로 학습된 SVM(Support Vector Machine)을 통해 상기 텍스트 메시지의 스팸 여부를 판단할 수 있다.At this time, the text-based spam filtering technique can go through a preprocessing step. In order to filter by comparison with the words registered as spam from the text, the special character of the text message is removed in the preprocessing step, the automatic spacing between sentences is added, Standardization, and deletion of idiomatic words such as slang that do not exist in the actual dictionary. The text-based spam filtering technique can determine whether the text message is spam through the SVM (Support Vector Machine) which is learned in the pattern of the filtering word after the preprocessing step.

기존의 텍스트 기반 스팸 필터링 기법은 수신된 메시지에서 텍스트 추출이 가능하다는 전제하에 동작하는 알고리즘이다. 하지만, 텍스트가 거의 포함되어 있지 않거나, 이미지 또는 HTML 만을 포함하는 이미지 스팸(Image Spam)의 경우, 비교 판단할 수 있는 텍스트가 거의 없기 때문에 기존 텍스트 기반 스팸 필터링 기법으로 차단할 수 없는 어려움이 있다.
The existing text-based spam filtering technique is based on the premise that text extraction is possible in received messages. However, in the case of image spam that contains little text or contains only images or HTML, there is little text that can be compared and judged, so it can not be blocked by existing text-based spam filtering techniques.

본 발명의 일실시예는 메시지 내용이 텍스트가 아닌 이미지 형태로 되어 있어 스팸 필터링으로 차단하기 어려웠던 이미지 스팸 메시지를 필터링 하는 장치 및 방법을 제공한다.An embodiment of the present invention provides an apparatus and method for filtering an image spam message that is difficult to block by spam filtering because the message content is in the form of an image rather than text.

본 발명의 일실시예는 모바일 기기를 통한 문자 메시지, 인터넷 상의 웹 계정을 통한 전자 메일 등에 포함된 이미지 스팸을 차단하는 방법을 제공한다.One embodiment of the present invention provides a method for blocking image spam included in a text message through a mobile device, an electronic mail through a web account on the Internet, and the like.

본 발명의 일실시예는 이미지 스팸으로 유포되는 바이러스 및 악성 프로그램에 대한 예방이 가능하다.
One embodiment of the present invention can prevent viruses and malicious programs that are spread as image spam.

본 발명의 일실시예에 따른 이미지 스팸 필터링 장치는 수신된 메시지가 이미지 메시지인지 여부를 판단하는 메시지 판단부, 상기 메시지가 이미지 메시지인 경우, 상기 메시지에 포함된 이미지를 이진화 하여 그레이 스케일로 변환하는 이미지 변환부, 상기 그레이 스케일로 변환된 이미지로 LBP(Local Binary Pattern) 정보를 추출하는 LBP 추출부, 및 SVM(Support Vector Machine) 분류 알고리즘을 이용하여 상기 추출된 LBP 정보를 분류하고, 상기 분류된 결과를 기반으로 상기 메시지가 이미지 스팸인지 여부를 판단하는 스팸 판단부를 포함한다.The image spam filtering apparatus according to an exemplary embodiment of the present invention includes a message determination unit for determining whether a received message is an image message or not, and if the message is an image message, converting the image included in the message into a gray scale An LBP extractor for extracting LBP information from the image converted into the gray-scale image, and an SVM (Support Vector Machine) classification algorithm to classify the extracted LBP information, And a spam determination unit for determining whether the message is image spam based on the result.

본 발명의 일측에 따르면, 상기 스팸 판단부는 상기 변환된 이미지의 그레이 레벨을 분석하여 미리 설정된 레벨 사이에서 미리 설정된 빈도 수를 가지는 경우, 상기 메시지를 이미지 스팸으로 결정할 수 있다.According to an aspect of the present invention, the spam determining unit may determine the image as the image spam when the gray level of the converted image is analyzed to have a predetermined frequency between predetermined levels.

본 발명의 일측에 따른 이미지 스팸 필터링 장치는 상기 추출된 LBP 정보로부터 추출된 패턴 중 일부 패턴을 학습 데이터로 선정하고, 상기 선정된 학습 데이터를 SVM(Support Vector Machine)을 통하여 학습하는 학습부를 더 포함할 수 있다.The image spam filtering apparatus according to an aspect of the present invention further includes an learning unit that selects some patterns among patterns extracted from the extracted LBP information as learning data and learns the selected learning data through SVM (Support Vector Machine) can do.

본 발명의 일측에 따르면, 상기 스팸 판단부는 미리 학습된 전체 이미지 스팸의 패턴을 기반으로 상기 학습된 학습 데이터의 경계(Hyperplane)를 기준으로 하여, 상기 메시지가 이미지 스팸인지 여부를 판단할 수 있다.According to one aspect of the present invention, the spam determining unit may determine whether the message is image spam based on a hyperplane of the learned learning data based on a pattern of the entire image spam learned in advance.

본 발명의 일측에 따르면, 상기 LBP 추출부는 상기 변환된 이미지의 픽셀 단위 별로 상기 LBP 정보에 가중치를 적용하여 eLBP(edge-LBP) 정보를 연산하고, 상기 스팸 판단부는 SVM(Support Vector Machine) 분류 알고리즘을 이용하여 상기 추출된 eLBP 정보를 분류하고, 상기 분류된 결과를 기반으로 상기 메시지가 이미지 스팸인지 여부를 판단할 수 있다.According to an aspect of the present invention, the LBP extractor calculates eLBP (edge-LBP) information by applying a weight to the LBP information for each pixel unit of the transformed image, and the spam determiner includes a Support Vector Machine (SVM) The extracted eLBP information may be classified using the extracted eLBP information, and it may be determined whether the message is image spam based on the classified result.

본 발명의 일실시예에 따른 이미지 스팸 필터링 방법은 수신된 메시지가 이미지 메시지인지 여부를 판단하는 단계, 상기 메시지가 이미지 메시지인 경우, 상기 메시지에 포함된 이미지를 이진화 하여 그레이 스케일로 변환하는 단계, 상기 그레이 스케일로 변환된 이미지로 LBP(Local Binary Pattern) 정보를 추출하는 단계, 및 SVM(Support Vector Machine) 분류 알고리즘을 이용하여 상기 추출된 LBP 정보를 분류하고, 상기 분류된 결과를 기반으로 상기 메시지가 이미지 스팸인지 여부를 판단하는 단계를 포함한다.According to another aspect of the present invention, there is provided an image spam filtering method comprising: determining whether a received message is an image message; converting the image included in the message into grayscale when the message is an image message; Extracting Local Binary Pattern (LBP) information from the image converted into the gray scale, classifying the extracted LBP information using a Support Vector Machine (SVM) classification algorithm, Is image spam.

본 발명의 일측에 따른 이미지 스팸 필터링 방법은 상기 변환된 이미지의 그레이 레벨을 분석하여 미리 설정된 레벨 사이에서 미리 설정된 빈도 수를 가지는 경우, 상기 메시지를 이미지 스팸으로 결정하는 단계를 더 포함할 수 있다.The image spam filtering method according to an aspect of the present invention may further include the step of determining the message as image spam when the gray level of the converted image is analyzed and the predetermined number of frequencies is set between predetermined levels.

본 발명의 일측에 따른 이미지 스팸 필터링 방법은 상기 추출된 LBP 정보로부터 추출된 패턴 중 일부 패턴을 학습 데이터로 선정하는 단계, 및 상기 선정된 학습 데이터를 SVM(Support Vector Machine)을 통하여 학습하는 단계를 더 포함할 수 있다.According to an aspect of the present invention, there is provided a method of filtering an image spam, comprising the steps of: selecting a pattern among patterns extracted from the extracted LBP information as learning data; and learning the selected learning data through SVM (Support Vector Machine) .

본 발명의 일측에 따른 이미지 스팸 필터링 방법은 미리 학습된 전체 이미지 스팸의 패턴을 기반으로 상기 학습된 학습 데이터의 경계(Hyperplane)를 기준으로 하여, 상기 메시지가 이미지 스팸인지 여부를 판단하는 단계를 더 포함할 수 있다. According to an aspect of the present invention, there is provided a method for filtering image spam, comprising the steps of: determining whether the message is image spam based on a hyperplane of the learned learning data based on a pattern of all learned image spam .

본 발명의 일측에 따른 이미지 스팸 필터링 방법은 상기 변환된 이미지의 픽셀 단위 별로 상기 LBP 정보에 가중치를 적용하여 eLBP(edge-LBP) 정보를 연산하는 단계, 및 SVM(Support Vector Machine) 분류 알고리즘을 이용하여 상기 추출된 eLBP 정보를 분류하고, 상기 분류된 결과를 기반으로 상기 메시지가 이미지 스팸인지 여부를 판단하는 단계를 더 포함할 수 있다.
The image spam filtering method according to an aspect of the present invention includes the steps of calculating eLBP (edge-LBP) information by applying a weight to the LBP information for each pixel unit of the transformed image, and using SVM (Support Vector Machine) And classifying the extracted eLBP information, and determining whether the message is image spam based on the classified result.

본 발명의 일실시예에 따르면 메시지 내용이 텍스트가 아닌 이미지 형태로 되어 있어 스팸 필터링으로 차단하기 어려웠던 이미지 스팸 메시지를 필터링 할 수 있다.According to an embodiment of the present invention, an image spam message, which is difficult to be blocked by spam filtering, can be filtered because the message content is an image format rather than a text.

본 발명의 일실시예에 따르면 모바일 기기를 통한 문자 메시지, 인터넷 상의 웹 계정을 통한 전자 메일 등에 포함된 이미지 스팸을 차단할 수 있다.According to an embodiment of the present invention, image spam included in a text message through a mobile device, an electronic mail through a web account on the Internet, and the like can be blocked.

본 발명의 일실시예에 따르면 이미지 스팸으로 유포되는 바이러스 및 악성 프로그램에 대한 예방이 가능하다.
According to an embodiment of the present invention, it is possible to prevent viruses and malicious programs that are distributed as image spam.

도 1은 본 발명의 일실시예에 따른 이미지 스팸 필터링 장치의 구성을 도시한 블록도이다.
도 2는 본 발명의 일측에 따른 이미지 변환의 예를 도시한 도면이다.
도 3은 본 발명의 일측에 따른 LBP 정보를 추출하는 예를 도시한 도면이다.
도 4는 본 발명의 일측에 따른 eLBP(edge-LBP) 특징 추출 과정을 도시한 도면이다.
도 5는 SVM의 범주를 분류한 예를 도시한 도면이다.
도 6은 SVM 분류 알고리즘에서 초평면(Hyperplane)을 선택하는 예를 도시한 도면이다.
도 7은 본 발명의 일측에 따른 이미지 스팸 필터링 방법을 도시한 흐름도이다.1 is a block diagram illustrating a configuration of an image spam filtering apparatus according to an embodiment of the present invention.
2 is a diagram showing an example of image conversion according to an aspect of the present invention.
3 is a diagram illustrating an example of extracting LBP information according to one aspect of the present invention.
4 is a diagram illustrating an eLBP (edge-LBP) feature extraction process according to an aspect of the present invention.
5 is a diagram showing an example of classifying categories of SVMs.
6 is a diagram showing an example of selecting a hyperplane in the SVM classification algorithm.
7 is a flowchart illustrating an image spam filtering method according to an embodiment of the present invention.

이하 첨부 도면들 및 첨부 도면들에 기재된 내용들을 참조하여 본 발명의 실시예를 상세하게 설명하지만, 본 발명이 실시예에 의해 제한되거나 한정되는 것은 아니다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings and accompanying drawings, but the present invention is not limited to or limited by the embodiments.

한편, 본 발명을 설명함에 있어서, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는, 그 상세한 설명을 생략할 것이다. 그리고, 본 명세서에서 사용되는 용어(terminology)들은 본 발명의 실시예를 적절히 표현하기 위해 사용된 용어들로서, 이는 사용자, 운용자의 의도 또는 본 발명이 속하는 분야의 관례 등에 따라 달라질 수 있다. 따라서, 본 용어들에 대한 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. The terminology used herein is a term used for appropriately expressing an embodiment of the present invention, which may vary depending on the user, the intent of the operator, or the practice of the field to which the present invention belongs. Therefore, the definitions of these terms should be based on the contents throughout this specification.

도 1은 본 발명의 일실시예에 따른 이미지 스팸 필터링 장치의 구성을 도시한 블록도이다.1 is a block diagram illustrating a configuration of an image spam filtering apparatus according to an embodiment of the present invention.

도 1을 참조하면, 본 발명에 따른 일실시예 따른 이미지 스팸 필터링 장치는 메시지 판단부(110), 이미지 변환부(120), LBP 추출부(130), 스팸 판단부(140), 및 학습부(150)을 포함한다.1, an image spam filtering apparatus according to an exemplary embodiment of the present invention includes a message determination unit 110, an image conversion unit 120, an LBP extraction unit 130, a spam determination unit 140, (150).

본 발명의 일측에 따르면, 이미지 스팸 필터링 장치는 제어부(160)를 포함할 수 있으며, 제어부(160)는 이미지 스팸 필터링 장치의 전반적인 동작을 제어할 수 있다. 예를 들어, 제어부(160)는 연결된 모듈 간의 데이터를 송수신하여 다른 모듈로 전달하거나, 제어 데이터를 생성하여 각각의 모듈을 제어하는 역할을 담당할 수 있다. 또한, 각각의 모듈은 제어부(160)의 제어에 의하여 구동 될 수도 있고, 제어부(160)와 별도로 독립적으로 구현될 수도 있다.According to one aspect of the present invention, the image spam filtering apparatus may include a controller 160, and the controller 160 may control the overall operation of the image spam filtering apparatus. For example, the control unit 160 may transmit and receive data between connected modules and transmit the data to another module, or may generate control data to control each module. In addition, each module may be driven under the control of the control unit 160 or independently of the control unit 160.

판단부(110)는 수신된 메시지가 이미지 메시지인지 여부를 판단한다.The determination unit 110 determines whether the received message is an image message.

예를 들어, 판단부(110)는 수신된 메시지에 이미지 파일이 포함되었거나, 텍스트 파일이 포함되었더라도 이미지 파일의 비율이 높은 경우에 상기 메시지를 이미지 메시지로 판단할 수도 있다.For example, the determination unit 110 may determine the message as an image message if the received message includes an image file or if the ratio of the image file is high even if the text file is included.

이미지 변환부(120)는 상기 메시지가 이미지 메시지인 경우, 상기 메시지에 포함된 이미지를 이진화 하여 그레이 스케일로 변환한다.If the message is an image message, the image conversion unit 120 binarizes the image included in the message and converts the image into grayscale.

예를 들어, 이미지 변환부(120)는 상기 메시지가 이미지 메시지로 판단 되면 상기 이미지 메시지에 포함된 이미지 파일을 이진화하여 그레이 스케일(예를 들어, 이미지를 이진화 하는 방식)로 변환할 수 있다. For example, if the message is determined as an image message, the image conversion unit 120 may convert the image file included in the image message into a gray scale (e.g., a method of binarizing the image) by binarizing the image file.

도 2는 본 발명의 일측에 따른 이미지 변환의 예를 도시한 도면이다.2 is a diagram showing an example of image conversion according to an aspect of the present invention.

도 2를 참조하면, 수신된 메시지에 포함된 이미지가 컬러 이미지(211, 212)인 경우, 이미지 변환부(120)는 상기 컬러 이미지(211, 212)를 이진화 하여 그레이 스케일의 이미지(221, 222)로 변환할 수 있다.2, if the image included in the received message is a color image 211 or 212, the image conversion unit 120 binarizes the color image 211 or 212 to generate a gray-scale image 221 or 222 ). &Lt; / RTI >

LBP 추출부(130)는 상기 그레이 스케일로 변환된 이미지로 LBP(Local Binary Pattern) 정보를 추출한다.The LBP extractor 130 extracts LBP (Local Binary Pattern) information from the image converted into the gray scale.

도 3은 본 발명의 일측에 따른 LBP 정보를 추출하는 예를 도시한 도면이다.3 is a diagram illustrating an example of extracting LBP information according to one aspect of the present invention.

도 2 및 도 3을 참조하면, LBP 추출부(130)는 도 2의 그레이 스케일의 이미지(221, 222)로부터 이미지의 질감 정보를 이진화 데이터로 표현하는 LBP 정보를 추출하여 히스토그램화(311, 312) 할 수 있다. LBP는 이미지 정보를 2진화 패턴으로 추출하는 방법으로 이미지 질감에 대한 특징 표현을 할 수 있다.2 and 3, the LBP extracting unit 130 extracts LBP information representing the texture information of the image from the gray-scale images 221 and 222 of FIG. 2 as binary data, and outputs histograms 311 and 312 ) can do. LBP is a method of extracting image information as a binary pattern, and it can express feature of image texture.

본 발명의 일측에 따르면, LBP 추출부(130)는 이미지 내에서 적은 차원을 가지면서 많은 질감 정보를 표현하는 eLBP(edge-LBP)를 적용할 수도 있다.According to one aspect of the present invention, the LBP extracting unit 130 may apply an eLBP (edge-LBP) representing a lot of texture information with a small dimension in the image.

본 발명의 이해를 돕고자 eLBP(edge-LBP) 특징 추출 기법을 아래와 같이 설명하도록 한다. eLBP 특징 추출 기법은 질감의 경계 정보를 표현하기 위해 고안된 방식이다.To facilitate understanding of the present invention, an eLBP (edge-LBP) feature extraction technique will be described as follows. The eLBP feature extraction technique is designed to represent texture boundary information.

도 4는 본 발명의 일측에 따른 eLBP(edge-LBP) 특징 추출 과정을 도시한 도면이다.4 is a diagram illustrating an eLBP (edge-LBP) feature extraction process according to an aspect of the present invention.

도 4를 참조하면, eLBP(edge-LBP) 특징 추출 기법은 3ㅧ3 픽셀 블록 단위(410)로 중심 화소의 화소 값이 중심 화소 및 이웃 화소의 화소 값의 차보다 크면 1 작으면 0의 값으로 표현할 수 있다(420).4, if the pixel value of the center pixel is greater than the difference between the pixel values of the center pixel and the neighboring pixel in the 3-by-3 pixel block unit 410, (420).

상기 표현된 픽셀 블록 단위(420)의 값에 가중치를 각각 곱하여 수직, 수평 성분 별로 따로 더할 수 있다. 예를 들어, 상기 표현된 픽셀 블록 단위(420)의 값에 1, 1, 2, 2, 4, 4, 8, 8의 가중치를 각각 곱하여 수직, 수평 성분 별로 더할 수 있다(430, 440).The values of the pixel block unit 420 may be multiplied by weights, and added to the vertical and horizontal components separately. For example, the weights of 1, 1, 2, 2, 4, 4, 8, and 8 may be multiplied by the values of the pixel block unit 420, and then added by vertical and horizontal components 430 and 440.

[수학식 1][Equation 1]

eLBP_V-h = 2eLBP _Vh = 2

eLBP_diag = 2 + 4 = 6eLBP _diag = 2 + 4 = 6

[수학식 2]&Quot; (2) "

∴ eLBP_total = eLBP_V-h + eLBP_diag = 2 + 6 = 8∴ eLBP _total = eLBP _Vh + eLBP _diag = 2 + 6 = 8

이때, 상기 eLBP_V-h는 화소의 수직, 수평 정보를 표현한 값이고, eLBP_diag는 화소의 대각선 정보를 표현한 값이다. In this case, the eLBP _Vh is a value representing vertical and horizontal information of a pixel, and the eLBP _diag is a value representing a diagonal information of a pixel.

최종적으로 이미지의 eLBP_total 값은 eLBP_V-h와 eLBP_diag를 더하여 표현할 수 있다.Finally, the _total eLBP value of the image can be expressed by adding eLBP _Vh and eLBP _diag .

스팸 판단부(140)는 SVM(Support Vector Machine) 분류 알고리즘을 이용하여 상기 추출된 LBP 정보를 분류하고, 상기 분류된 결과를 기반으로 상기 메시지가 이미지 스팸인지 여부를 판단한다.The spam determining unit 140 classifies the extracted LBP information using a support vector machine (SVM) classification algorithm, and determines whether the message is image spam based on the classified result.

예를 들어, 스팸 판단부(140)는 상기 변환된 이미지의 그레이 레벨을 분석하여 미리 설정된 레벨 사이에서 미리 설정된 빈도 수를 가지는 경우, 상기 메시지를 이미지 스팸으로 결정할 수 있다. For example, the spam determination unit 140 may analyze the gray level of the converted image and determine the message as image spam if the predetermined level has a predetermined frequency.

구체적으로, 이미지 스팸은 도 3에 도시된 바와 같이 평균적으로 그레이 레벨이 155~170, 220~230 사이에서 높은 해당 그레이 레벨의 빈도 수를 가질 수 있다. 이런 특징들은 SVM의 학습에서 정확한 분류 임계 값을 정하는 기준이 될 수 있으며, 이미지의 스팸 여부를 결정짓는 중요한 기준이 될 수 있다. Specifically, the image spam may have a frequency number of the gray level corresponding to a gray level of 155 to 170 or 220 to 230 on average, as shown in FIG. These features can serve as a criterion for determining accurate classification thresholds in SVM learning and can be an important criterion for determining whether images are spammed.

또한, LBP 추출부(130)는 상기 변환된 이미지의 픽셀 단위 별로 상기 LBP 정보에 가중치를 적용하여 eLBP(edge-LBP) 정보를 연산하고, 스팸 판단부(140)는 SVM(Support Vector Machine) 분류 알고리즘을 이용하여 상기 추출된 eLBP 정보를 분류하고, 상기 분류된 결과를 기반으로 상기 메시지가 이미지 스팸인지 여부를 판단할 수 있다. In addition, the LBP extractor 130 calculates eLBP (edge-LBP) information by applying a weight to the LBP information for each pixel unit of the transformed image. The spam determiner 140 classifies SVM (Support Vector Machine) Algorithm to classify the extracted eLBP information, and determine whether the message is image spam based on the classified result.

다시 도 1을 참조하면, 학습부(150)는 상기 추출된 LBP 정보로부터 추출된 패턴 중 일부 패턴을 학습 데이터로 선정하고, 상기 선정된 학습 데이터를 SVM(Support Vector Machine)을 통하여 학습할 수 있다. Referring again to FIG. 1, the learning unit 150 selects some patterns among patterns extracted from the extracted LBP information as learning data, and learns the selected learning data through SVM (Support Vector Machine) .

스팸 판단부(140)는 미리 학습된 전체 이미지 스팸의 패턴을 기반으로 상기 학습된 학습 데이터의 경계(Hyperplane)를 기준으로 하여, 상기 메시지가 이미지 스팸인지 여부를 판단할 수 있다.The spam determining unit 140 can determine whether the message is image spam based on a hyperplane of the learned learning data based on the pattern of the entire image spam learned in advance.

본 발명의 전반에 걸쳐 언급되는 SVM(Support Vector Machine) 이라 함은 머신 러닝(Machine Learning) 기법 중 하나로, SVM(Support Vector Machine) 분류 알고리즘은 주어진 자료에 대해서 그 자료들을 분리하는 지점을 찾는 것을 목적으로 하는 알고리즘 방법이다.The Support Vector Machine (SVM), which is generally referred to throughout the present invention, is one of the machine learning techniques. SVM (Support Vector Machine) classification algorithm is used to find a point for separating the data for a given data .

도 5는 SVM의 범주를 분류한 예를 도시한 도면이고, 도 6은 SVM 분류 알고리즘에서 초평면(Hyperplane)을 선택하는 예를 도시한 도면이다.FIG. 5 is a diagram showing an example of classifying SVM categories, and FIG. 6 is a diagram illustrating an example of selecting a hyperplane in the SVM classification algorithm.

도 5 및 6을 참조하면, SVM 분류 알고리즘은 벡터 공간에 표시될 수 있으며 선형이든 비선형이든 분류 할 수 있는 방법이다. 도 5 및 도 6과 같은 벡터들 중 같은 범주를 기준으로 바깥으로 위치한 벡터들의 연결선으로 이루어진 다각형을 볼록 껍질(convex hull)이라 하며, 상기 convex hull 안에 다른 벡터들은 그룹을 분류하는데 영향을 미치지 않지만 가장 바깥에 있는 위치한 벡터들이 큰 영향을 미칠 수 있다.Referring to FIGS. 5 and 6, the SVM classification algorithm can be displayed in a vector space and can be classified as linear or non-linear. 5 and 6, a polygon consisting of connecting lines of vectors positioned on the basis of the same category is referred to as a convex hull. Other vectors in the convex hull do not affect group classification, Vectors located outside can have a big impact.

도 5에 도시된 바와 같이, 가장 바깥에 위치한 벡터들을 서포트 벡터(Support Vector)라 하며, 서포트 벡터들을 가르는 선 및 면을 초평면(hyperplane)이라고 한다.As shown in Fig. 5, the outermost vectors are referred to as support vectors, and the lines and planes that divide support vectors are referred to as hyperplanes.

도 6에 도시된 바와 같이, 초평면(H1, H2, H3)은 복수개가 생길 수 있으나, 그 중에서 서포트 벡터와 가장 먼 거리를 가진 초평면인 H2가 두 그룹을 효과적으로 분류할 수 있다.As shown in FIG. 6, a plurality of hyperplanes H1, H2, and H3 may occur, but H2, which is a hyperplane having the farthest distance from the support vector, can effectively classify the two groups.

학습부(150)는 eLBP 특징 추출 기법을 통해 추출된 스팸 패턴, 비스팸 패턴에서 무작위로 일부 패턴만을 추출하여 학습 데이터로 선정할 수도 있다. 예를 들어, 학습부(150)는 스팸 패턴, 비스팸 패턴에서 무작위로 약 20%만을 추출하여 학습 데이터로 선정할 수도 있다.The learning unit 150 may extract only some patterns randomly from the spam patterns and non-spam patterns extracted through the eLBP feature extraction technique and select the learning data. For example, the learning unit 150 may extract only about 20% randomly from the spam pattern and non-spam pattern and select the learning data.

본 발명의 일측에 따르면, 선정된 학습 데이터는 SVM(Support Vector Machine) 학습에 활용될 수 있으며, 스팸 판단부(140)는 전체 이미지 스팸의 패턴을 입력 받아 학습된 데이터로부터 선별된 초평면을 기준으로 수신된 메시지가 이미지 스팸인지 여부를 결정할 수 있다.
According to one aspect of the present invention, the selected learning data can be utilized for learning SVM (Support Vector Machine), and the spam determining unit 140 receives the pattern of the entire image spam, and based on the selected hyperplane from the learned data And can determine whether the received message is image spam.

아래에서는 본 발명의 일실시예에 따른 이미지 스팸 필터링 방법을 설명하도록 한다.Hereinafter, an image spam filtering method according to an embodiment of the present invention will be described.

본 발명의 일측에 따른 이미지 스팸 필터링 방법은 전술한 이미지 스팸 필터링 장치를 이용하여 수행될 수 있는 바, 이미지 스팸 필터링 장치의 관점에서 설명하도록 한다.An image spam filtering method according to an aspect of the present invention can be performed using the above-described image spam filtering apparatus, and will be described in terms of an image spam filtering apparatus.

도 7은 본 발명의 일측에 따른 이미지 스팸 필터링 방법을 도시한 흐름도이다.7 is a flowchart illustrating an image spam filtering method according to an embodiment of the present invention.

도 7을 참조하면, 이미지 스팸 필터링 장치는 메시지를 수신하고(710), 수신된 메시지가 이미지 메시지인지 여부를 판단한다(720).Referring to FIG. 7, the image spam filtering device receives a message (710) and determines whether the received message is an image message (720).

이미지 스팸 필터링 장치는 상기 메시지가 이미지 메시지인 경우, 상기 메시지에 포함된 이미지를 이진화 하여 그레이 스케일로 변환한다(730).If the message is an image message, the image spam filtering apparatus binarizes the image included in the message and converts the image into grayscale (730).

이미지 스팸 필터링 장치는 상기 그레이 스케일로 변환된 이미지로 LBP(Local Binary Pattern) 정보를 추출한다(740).The image spam filtering apparatus extracts LBP (Local Binary Pattern) information from the image converted into the gray scale (740).

이미지 스팸 필터링 장치는 SVM(Support Vector Machine) 분류 알고리즘을 이용하여 상기 추출된 LBP 정보를 분류한다(750).The image spam filtering device classifies the extracted LBP information using a Support Vector Machine (SVM) classification algorithm (750).

이미지 스팸 필터링 장치는 상기 분류된 결과를 기반으로 상기 메시지가 이미지 스팸인지 여부를 판단한다(760).The image spam filtering device determines whether the message is image spam based on the classified result (760).

이미지 스팸 필터링 장치는 상기 변환된 이미지의 그레이 레벨을 분석하여 미리 설정된 레벨 사이에서 미리 설정된 빈도 수를 가지는 경우, 상기 메시지를 이미지 스팸으로 결정할 수 있다.The image spam filtering apparatus may analyze the gray level of the converted image and determine the message as image spam if the predetermined number of frequencies is between predetermined levels.

또한, 이미지 스팸 필터링 장치는 상기 추출된 LBP 정보로부터 추출된 패턴 중 일부 패턴을 학습 데이터로 선정하고, 상기 선정된 학습 데이터를 SVM(Support Vector Machine)을 통하여 학습할 수 있다(770).In addition, the image spam filtering apparatus may select some of the patterns extracted from the extracted LBP information as learning data, and may learn the selected learning data through SVM (Support Vector Machine) (770).

이미지 스팸 필터링 장치는 미리 학습된 전체 이미지 스팸의 패턴을 기반으로 상기 학습된 학습 데이터의 경계(Hyperplane)를 기준으로 하여, 상기 메시지가 이미지 스팸인지 여부를 판단할 수도 있다.The image spam filtering device may determine whether the message is image spam based on a hyperplane of the learned learning data based on a pattern of the entire image spam learned in advance.

이미지 스팸 필터링 장치는 상기 변환된 이미지의 픽셀 단위 별로 상기 LBP 정보에 가중치를 적용하여 eLBP(edge-LBP) 정보를 연산하고, SVM(Support Vector Machine) 분류 알고리즘을 이용하여 상기 추출된 eLBP 정보를 분류하고, 상기 분류된 결과를 기반으로 상기 메시지가 이미지 스팸인지 여부를 판단할 수 있다.
The image spam filtering apparatus calculates eLBP (edge-LBP) information by applying a weight to the LBP information for each pixel unit of the transformed image, and classifies the extracted eLBP information using SVM (Support Vector Machine) And determine whether the message is image spam based on the classified result.

본 발명의 일측에 따른 이미지 스팸 필터링 방법은 모바일 기기를 통한 문자 메시지, 인터넷 상의 웹 계정을 통한 전자 메일의 이미지 스팸을 차단시킬 수 있다.The image spam filtering method according to one aspect of the present invention can block image spam of e-mail through a text message through a mobile device and a web account on the Internet.

본 발명의 일측에 따른 이미지 스팸 필터링 방법은 이미지 스팸으로 유포되는 바이러스, 악성 프로그램에 대한 배포를 예방할 수 있다.The image spam filtering method according to one aspect of the present invention can prevent distribution of viruses and malicious programs that are distributed as image spam.

본 발명의 일측에 따른 이미지 스팸 필터링 방법은 이미지 스팸의 차단율을 높이고, 텍스트 기반 스팸 필터링 시스템을 우회하여 배포되는 악성 프로그램과 바이러스의 배포 차단율을 줄일 수 있다. The image spam filtering method according to one aspect of the present invention can increase the blocking rate of image spam and reduce the distribution blocking rate of malicious programs and viruses distributed by bypassing the text based spam filtering system.

본 발명의 일측에 따른 이미지 스팸 필터링 방법은 기존 메일링 시스템을 새로 개발할 필요가 없이 각 포털 사이트의 메일 서버에 적용할 수 있다.The image spam filtering method according to an aspect of the present invention can be applied to a mail server of each portal site without newly developing an existing mailing system.

본 발명의 일측에 따른 이미지 스팸 필터링 방법은 스마트 모바일 기기의 문자메시지 서비스, 모바일 메신저 서비스의 백그라운드 소프트웨어로 개발하여 쉽게 탑재 및 서비스 적용이 가능하다.The image spam filtering method according to one aspect of the present invention is developed as background software of a text messaging service of a smart mobile device and a mobile messenger service, and can easily be mounted and applied to services.

본 발명의 일측에 따른 이미지 스팸 필터링 방법은 백신 프로그램에도 적용되어 이미지 스팸을 통해 배포되는 악성 소프트웨어 및 바이러스를 예방할 수 있다.
The image spam filtering method according to one aspect of the present invention can also be applied to a vaccine program to prevent malicious software and viruses distributed through image spam.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다. The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

110: 메시지 판단부
120: 이미지 변환부
130: LBP 추출부
140: 스팸 판단부
150: 학습부
160: 제어부110:
120: Image conversion unit
130: LBP extracting unit
140: Spam judging unit
150:
160:

Claims

A message judging unit for judging whether the received message is an image message;
An image converting unit for binarizing the image included in the message and converting the image into grayscale when the message is an image message;
An LBP extracting unit for extracting LBP (Local Binary Pattern) information from the image converted into the gray scale; And
A spam determination unit for classifying the extracted LBP information using a SVM (Support Vector Machine) classification algorithm, and determining whether the message is image spam based on the classified result;
And an image spam filtering device.

The method according to claim 1,
Wherein the spam determination unit comprises:
Analyzing the gray level of the transformed image and determining the message as image spam if it has a predetermined frequency between predetermined levels
Image spam filtering device.

The method according to claim 1,
A learning unit for selecting some of the patterns extracted from the extracted LBP information as learning data and learning the selected learning data through SVM (Support Vector Machine)
Wherein the image spam filtering device further comprises:

The method of claim 3,
Wherein the spam determination unit comprises:
It is determined whether the message is image spam based on a hyperplane of the learned learning data based on a pattern of the entire image spam learned in advance
Image spam filtering device.

The method according to claim 1,
The LBP extractor may calculate edge-LBP information by applying a weight to the LBP information for each pixel unit of the transformed image,
The spam determining unit classifies the extracted eLBP information using a support vector machine (SVM) classification algorithm, and determines whether the message is image spam based on the classified result
Image spam filtering device.

Determining whether the received message is an image message;
If the message is an image message, converting the image included in the message into a gray scale;
Extracting LBP (Local Binary Pattern) information from the image converted into the gray scale; And
Classifying the extracted LBP information using a Support Vector Machine (SVM) classification algorithm, and determining whether the message is image spam based on the classified result
The method comprising the steps of:

The method according to claim 6,
Analyzing the gray level of the transformed image and determining the message as image spam if it has a predetermined frequency between predetermined levels
The method comprising the steps of:

The method according to claim 6,
Selecting a pattern among patterns extracted from the extracted LBP information as learning data; And
Learning the selected learning data through SVM (Support Vector Machine)
The method comprising the steps of:

9. The method of claim 8,
Determining whether the message is image spam based on a hyperplane of the learned learning data based on a pattern of the entire image spam learned in advance;
The method comprising the steps of:

The method according to claim 6,
Calculating eLBP (edge-LBP) information by applying a weight to the LBP information for each pixel unit of the transformed image; And
Classifying the extracted eLBP information using a SVM (Support Vector Machine) classification algorithm, and determining whether the message is image spam based on the classified result
The method comprising the steps of: