KR102500115B1

KR102500115B1 - Spam message blocking apparatus and method

Info

Publication number: KR102500115B1
Application number: KR1020200136173A
Authority: KR
Inventors: 최보성; 정인정; 이세희; 김승환
Original assignee: 주식회사 엘지유플러스
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2023-02-16
Also published as: KR20220052156A

Abstract

본 발명의 예시적인 실시예는 학습(딥 러닝) 기반 방식을 통한 스팸 메시지 차단 시 발생되는 차단 오류를 방지하여 스팸 메시지 차단 효율을 향상시키기 위한 스팸 메시지 차단 장치 및 방법에 관한 것으로, 본 발명의 일 측면에 따른 스팸 메시지 차단 장치는, 학습 데이터를 기반으로 학습(딥러닝)된 딥러닝 기반 스팸 필터를 통하여 입력 메시지를 스팸 또는 비스팸으로 분류하기 위한 제1 분류부; 상기 딥러닝 기반 스팸 필터를 통해 스팸으로 분류된 메시지와 상기 학습 데이터 간의 유사도를 산출하기 위한 산출부; 및 상기 산출된 유사도를 기초로 상기 스팸으로 분류된 메시지를 스팸 또는 비스팸으로 재 분류하기 위한 제2 분류부;를 포함할 수 있다.Exemplary embodiments of the present invention relate to a spam message blocking device and method for improving spam message blocking efficiency by preventing blocking errors that occur when spam message blocking occurs through a learning (deep learning) based method. An apparatus for blocking spam messages according to an aspect includes: a first classification unit for classifying an input message as spam or non-spam through a deep learning-based spam filter learned based on learning data (deep learning); a calculation unit for calculating a similarity between the message classified as spam through the deep learning-based spam filter and the learning data; and a second classification unit for re-classifying the spam-classified message as spam or non-spam based on the calculated similarity.

Description

Spam message blocking apparatus and method {Spam message blocking apparatus and method}

본 발명은 스팸 메시지 차단 기술에 관한 것으로서, 보다 상세하게는 학습(딥 러닝) 기반 방식을 통한 스팸 메시지 차단 시 발생되는 차단 오류를 줄여 스팸 메시지 차단 효율을 향상시키기 위한, 스팸 메시지 차단 장치 및 방법에 관한 것이다.The present invention relates to a spam message blocking technology, and more particularly, to a spam message blocking device and method for improving spam message blocking efficiency by reducing blocking errors that occur when spam message blocking occurs through a learning (deep learning) based method. it's about

일반적으로, 스패머(spammer)는 이메일 혹은 문자 등을 활용해서 원하는 목적을 악의적으로 달성한다. 초창기 인터넷의 보급은 스팸 이메일의 문제를 주로 야기시켰지만 최근 스마트폰의 사용량의 증가는 스팸 이메일은 물론 스팸 문자 메시지 양의 급격하게 증가시켰다. Generally, spammers use e-mails or text messages to maliciously achieve their desired goals. The spread of the Internet in the early days mainly caused the problem of spam e-mail, but the recent increase in usage of smartphones has dramatically increased the amount of spam e-mail as well as spam text messages.

이러한 스팸 (문자)메시지를 차단하기 위하여 스팸 문자 데이터 셋을 기초로 등록된 스팸 문자/문구와 일치하면 차단하는 방식이 제안되어 사용되고 있으나, 스팸 문자 데이터 셋은 공개적으로 구하기 어렵고 구한다고 해도 스팸 데이터의 양이 방대하고 계속 다양한 패턴이 추가되기 때문에 사람이 일일이 스팸이 되는 문구를 추출하고 대응하기가 어렵다. In order to block such spam (text) messages, a method of blocking if it matches a registered spam character/phrase based on a spam character data set has been proposed and used. Since the amount is huge and various patterns are continuously added, it is difficult for people to extract and respond to phrases that become spam.

그래서 최근에는 글자 패턴과 특징을 학습하여 분류하는 학습(딥러닝) 기반 분류모델을 사용하여 스팸 차단 효율을 향상시킨 바 있는데, 이러한 학습 기반 스팸 차단 방식은, 1) 테스트데이터가 기존 학습데이터에 없을 때 스팸 확률이 예상치 못하게 올라가 오차단 발생, 2) 스팸 문구가 아닌 주변 일반 단어 특성으로 학습되었을 경우 오차단 발생, 3) (고객의 민원 및 분류인원 교체 등으로) 스팸의 기준이 실시간으로 변경되는 운영성의 문제로 오차단 발생 등의 문제점이 발견되었다. So, recently, the efficiency of spam blocking has been improved by using a learning (deep learning)-based classification model that learns and classifies character patterns and features. This learning-based spam blocking method is When spam probability unexpectedly rises, error detection occurs. 2) Error detection occurs when learning is based on the characteristics of surrounding general words, not spam phrases. Problems such as the occurrence of errors were found due to operational problems.

등록특허공보 제10-2001375호(2019.07.12.)Registered Patent Publication No. 10-2001375 (2019.07.12.)

본 발명은 전술한 종래의 문제점을 해결하기 위한 것으로, 그 목적은 학습(딥 러닝) 기반 방식을 통한 스팸 메시지 차단 시 발생되는 차단 오류를 방지하여 스팸 메시지 차단 효율을 향상시키기 위한, 스팸 메시지 차단 장치 및 방법을 제공하는 것이다.The present invention is to solve the above-mentioned conventional problems, the object of which is to improve spam message blocking efficiency by preventing blocking errors that occur when spam message blocking occurs through a learning (deep learning) based method, spam message blocking device and to provide a method.

전술한 목적을 달성하기 위하여 본 발명의 일 측면에 따른 스팸 메시지 차단 장치는, 학습 데이터를 기반으로 학습(딥러닝)된 딥러닝 기반 스팸 필터를 통하여 입력 메시지를 스팸 또는 비스팸으로 분류하기 위한 제1 분류부; 상기 딥러닝 기반 스팸 필터를 통해 스팸으로 분류된 메시지와 상기 학습 데이터 간의 유사도를 산출하기 위한 산출부; 및 상기 산출된 유사도를 기초로 상기 스팸으로 분류된 메시지를 스팸 또는 비스팸으로 재 분류하기 위한 제2 분류부;를 포함할 수 있고, 기 등록된 스팸 문구를 기초로 입력 메시지를 스팸 또는 비스팸으로 분류하기 위한 전처리부를 더 포함할 수 있으며, 상기 전처리부를 통해 분류된 비스팸 메시지를 상기 제1 분류부의 입력 메시지로 제공할 수 있다.In order to achieve the above object, an apparatus for blocking spam messages according to an aspect of the present invention includes a device for classifying an input message as spam or non-spam through a deep learning-based spam filter learned based on learning data (deep learning) 1 classification unit; a calculation unit for calculating a similarity between the message classified as spam through the deep learning-based spam filter and the learning data; and a second classifier for re-classifying the spam-classified message as spam or non-spam based on the calculated similarity, and classifies the input message as spam or non-spam based on the pre-registered spam phrase. It may further include a pre-processing unit for classifying, and the non-spam message classified through the pre-processing unit may be provided as an input message to the first classifying unit.

상기 제2 분류부는 상기 산출부에서 산출된 유사도 값이 기 설정된 기준값 미만인 경우, 상기 단계 제1 분류부에서 스팸으로 분류된 해당 메시지를 비스팸 메시지로 재 분류할 수 있다.The second classification unit may re-classify the corresponding message classified as spam by the first classification unit into a non-spam message when the similarity value calculated by the calculation unit is less than a predetermined reference value.

상기 단계 제2 분류부는 상기 산출부에서 산출된 유사도 값이 기 설정된 기준값 이상인 경우, 상기 제1 분류부에서 스팸으로 분류된 해당 메시지를 스팸 메시지로 재 분류할 수 있다.The step 2 classification unit may re-classify the message classified as spam by the first classification unit as a spam message when the similarity value calculated by the calculation unit is equal to or greater than a predetermined reference value.

상기 제2 분류부는 상기 제1 분류부에서 스팸으로 분류된 해당 메시지에 대응하는 학습 데이터 내 유사 데이터에 스팸과 비스팸의 모순된 라벨링이 되어 있는 경우, 상기 제1 분류부에서 스팸으로 분류된 해당 메시지를 비스팸 메시지로 재 분류할 수 있다. The second classifier classifies the corresponding message classified as spam by the first classifier when similar data in the training data corresponding to the corresponding message classified as spam by the first classifier has contradictory labeling of spam and non-spam. The message can be reclassified as a non-spam message.

상기 산출부는 상기 딥러닝 기반 스팸 필터를 통해 스팸으로 분류된 메시지에 대하여 학습 데이터 내 복수의 스팸 데이터 각각과의 제1 유사도의 평균값(이하, 스팸 유사도 평균값) 및 복수의 비스팸 데이터 각각과의 제2 유사도의 평균값(이하, 비스팸 유사도 평균값)을 구하고, 상기 스팸 유사도 평균값에서 상기 비스팸 유사도 평균값 간의 차이값(이하, 최종 스코어 값)을 상기 유사도로 산출할 수 있다.The calculator calculates the first average value of the first similarity (hereinafter, the average value of spam similarity) with each of the plurality of spam data in the training data for the message classified as spam through the deep learning-based spam filter and the first average value of the plurality of non-spam data. 2 An average value of similarity (hereinafter referred to as average non-spam similarity value) may be obtained, and a difference between the average similarity value of spam and the average value of non-spam similarity (hereinafter referred to as final score value) may be calculated as the similarity value.

상기 스팸 또는 비스팸 유사도 평균값은 각각 상기 산출된 복수의 제1 또는 제2 유사도 중 설정된 최소값 이상인 상위 n개에 대한 평균값일 수 있다.The average similarity value of the spam or non-spam may be an average value of the top n pieces that are equal to or greater than a set minimum value among the calculated plurality of first or second similarities, respectively.

상기 산출부는 검색 유사도 비교 방식, 문자열 유사도 비교 방식, 및 센텐스 임베딩(sentence embedding) 기반 유사도 비교 방식 중 적어도 하나를 통해 유사도를 산출할 수 있고, 상기 검색 유사도 비교 방식은 BM25 알고리즘을 포함할 수 있다. The calculator may calculate similarity through at least one of a search similarity comparison method, a string similarity comparison method, and a sentence embedding-based similarity comparison method, and the search similarity comparison method may include a BM25 algorithm. .

전술한 목적을 달성하기 위하여 본 발명의 일 측면에 따른 스팸 메시지 차단 방법은, (a) 학습 데이터를 기반으로 학습(딥러닝)된 딥러닝 기반 스팸 필터를 통하여 입력 메시지를 스팸 또는 비스팸으로 분류하기 위한 단계; (b) 상기 딥러닝 기반 스팸 필터를 통해 스팸으로 분류된 메시지와 상기 학습 데이터 간의 유사도를 산출하기 위한 단계; 및 (c) 상기 산출된 유사도를 기초로 상기 스팸으로 분류된 메시지를 스팸 또는 비스팸으로 재 분류하기 위한 단계;를 포함할 수 있고, (d) 기 등록된 스팸 문구를 기초로 입력 메시지를 스팸 또는 비스팸으로 분류하기 위한 전처리 단계를 더 포함할 수 있으며, 상기 전처리 단계를 통해 분류된 비스팸 메시지를 상기 단계 (a)의 입력 메시지로 제공할 수 있다. In order to achieve the above object, a spam message blocking method according to an aspect of the present invention includes: (a) classifying an input message as spam or non-spam through a deep learning-based spam filter learned (deep learning) based on learning data; steps to do; (b) calculating a similarity between a message classified as spam through the deep learning-based spam filter and the training data; and (c) re-classifying the spam-classified message as spam or non-spam based on the calculated similarity, and (d) spam the input message based on the pre-registered spam phrase. Alternatively, a pre-processing step for classifying as non-spam may be further included, and the non-spam message classified through the pre-processing step may be provided as the input message of step (a).

상기 단계 (c)는 상기 단계 (b)에서 산출된 유사도 값이 기 설정된 기준값 미만인 경우, 상기 단계 (a)에서 스팸으로 분류된 해당 메시지를 비스팸 메시지로 재 분류할 수 있다.In step (c), when the similarity value calculated in step (b) is less than a predetermined reference value, the message classified as spam in step (a) may be reclassified as a non-spam message.

상기 단계 (c)는 상기 단계 (b)에서 산출된 유사도 값이 기 설정된 기준값 이상인 경우, 상기 단계 (a)에서 스팸으로 분류된 해당 메시지를 스팸 메시지로 재 분류할 수 있다. In the step (c), when the similarity value calculated in the step (b) is greater than or equal to a predetermined reference value, the message classified as spam in the step (a) may be reclassified as a spam message.

상기 단계 (c)는 상기 단계 (a)에서 스팸으로 분류된 해당 메시지에 대응하는 학습 데이터 내 유사 데이터에 스팸과 비스팸의 모순된 라벨링이 되어 있는 경우, 상기 단계 (a)에서 스팸으로 분류된 해당 메시지를 비스팸 메시지로 재 분류할 수 있다. In the step (c), if the similar data in the learning data corresponding to the message classified as spam in step (a) has contradictory labeling of spam and non-spam, the message classified as spam in step (a) The message can be reclassified as a non-spam message.

상기 단계 (b)는 상기 딥러닝 기반 스팸 필터를 통해 스팸으로 분류된 메시지에 대하여 학습 데이터 내 복수의 스팸 데이터 각각과의 제1 유사도의 평균값(이하, 스팸 유사도 평균값) 및 복수의 비스팸 데이터 각각과의 제2 유사도의 평균값(이하, 비스팸 유사도 평균값)을 구하고, 상기 스팸 유사도 평균값에서 상기 비스팸 유사도 평균값 간의 차이값(이하, 최종 스코어 값)을 상기 유사도로 산출할 수 있다.In the step (b), for the messages classified as spam through the deep learning-based spam filter, the average value of the first similarity with each of the plurality of spam data in the training data (hereinafter, the average value of the spam similarity) and each of the plurality of non-spam data A second average similarity value (hereinafter referred to as average non-spam similarity value) may be obtained, and a difference between the average similarity value of spam and the average similarity value of non-spam (hereinafter referred to as final score value) may be calculated as the similarity.

상기 스팸 또는 비스팸 유사도 평균값은 각각 상기 산출된 복수의 제1 또는 제2 유사도 중 설정된 최소값 이상인 상위 n개에 대한 평균값으로 산출할 수 있다.The spam or non-spam average similarity value may be calculated as an average value for the top n pieces that are equal to or greater than a set minimum value among the calculated plurality of first or second similarities, respectively.

상기 단계 (b)는 검색 유사도 비교 방식, 문자열 유사도 비교 방식, 및 센텐스 임베딩(sentence embedding) 기반 유사도 비교 방식 중 적어도 하나를 통해 유사도를 산출할 수 있고, 상기 검색 유사도 비교 방식은 BM25 알고리즘을 포함할 수 있다.The step (b) may calculate similarity through at least one of a search similarity comparison method, a string similarity comparison method, and a sentence embedding-based similarity comparison method, and the search similarity comparison method includes a BM25 algorithm can do.

전술한 목적을 달성하기 위하여 본 발명의 또 다른 측면에 따르면, 상기 스팸 메시지 차단 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체가 제공될 수 있다.According to another aspect of the present invention in order to achieve the above object, a computer-readable recording medium recording a program for executing the spam message blocking method in a computer may be provided.

전술한 목적을 달성하기 위하여 본 발명의 또 다른 측면에 따르면, 상기 스팸 메시지 차단 방법을 하드웨어와 결합하여 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록 매체에 저장된 애플리케이션이 제공될 수 있다.According to another aspect of the present invention in order to achieve the above object, an application stored in a computer-readable recording medium may be provided to execute the spam message blocking method in combination with hardware.

전술한 목적을 달성하기 위하여 본 발명의 또 다른 측면에 따르면, 상기 스팸 메시지 차단 방법을 컴퓨터에서 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록 매체에 저장된 컴퓨터 프로그램이 제공될 수 있다.According to another aspect of the present invention in order to achieve the above object, a computer program stored in a computer-readable recording medium may be provided in order to execute the spam message blocking method on a computer.

이상에서 설명한 바와 같이 본 발명의 다양한 측면에 따르면, 학습(딥 러닝) 기반 방식을 통한 스팸 메시지 차단 시 발생되는 차단 오류를 방지하여 스팸 메시지의 차단 효율을 향상시키는 효과가 있다.As described above, according to various aspects of the present invention, there is an effect of improving spam message blocking efficiency by preventing blocking errors that occur when spam message blocking occurs through a learning (deep learning) based method.

즉, 종래의 학습 기반 스팸 차단 방식에 따르면 1) 테스트데이터가 기존 학습데이터에 없을 때 스팸 확률이 예상치 못하게 올라가 오차단 발생, 2) 스팸 문구가 아닌 주변 일반 단어 특성으로 학습되었을 경우 오차단 발생, 3) (고객의 민원 및 분류인원 교체 등으로) 스팸의 기준이 실시간으로 변경되는 운영성의 문제로 오차단 발생 등의 문제점이 발생하는데, 이러한 기존의 문제점을 모두 해결하여 스팸 오차단 문제를 방지함으로써 스팸 메시지의 차단 효율을 향상시키는 효과가 있다.In other words, according to the conventional learning-based spam blocking method, 1) when the test data is not in the existing training data, the probability of spam unexpectedly increases, and an error cutoff occurs, 2) When the spam phrase is learned with the surrounding general word characteristics, an error cutout occurs, 3) Problems such as error detection occur due to operational problems in which spam criteria are changed in real time (due to customer complaints and replacement of classified personnel). There is an effect of improving the blocking efficiency of spam messages.

도 1은 본 발명의 예시적인 실시예에 따른 스팸 메시지 차단 장치의 구성도,
도 2는 본 발명의 예시적인 실시예에 따른 스팸 메시지 차단 방법의 흐름도,
도 3은 도 2의 유사도 산출 단계의 세부 흐름도,
도 4는 도 2의 제2 분류 단계의 세부 흐름도,
도 5는 본 발명의 예시적인 실시예에 따른 스팸 오차단 방지 비율을 나타내는 그래프이다.1 is a block diagram of an apparatus for blocking spam messages according to an exemplary embodiment of the present invention;
2 is a flowchart of a method for blocking spam messages according to an exemplary embodiment of the present invention;
3 is a detailed flowchart of the similarity calculation step of FIG. 2;
4 is a detailed flowchart of the second classification step of FIG. 2;
5 is a graph showing a spam false positive prevention ratio according to an exemplary embodiment of the present invention.

이하, 첨부도면을 참조하여 본 발명의 실시예에 대해 구체적으로 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 한다. 또한, 본 발명의 실시예에 대한 설명 시 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In adding reference numerals to the components of each drawing, the same components should have the same numerals as much as possible, even if they are displayed on different drawings. In addition, when it is determined that a detailed description of a known configuration or function related to an embodiment of the present invention may obscure the gist of the present invention, the detailed description will be omitted.

도 1은 본 발명의 예시적인 실시예에 따른 스팸 메시지 차단 장치의 구성도로서, 동 도면에 도시된 바와 같이, 스팸 메시지 차단 장치(10)는 전처리부(11), 제1 분류부(13), 유사도 산출부(15), 및 제2 분류부(17)를 포함할 수 있다.1 is a block diagram of a spam message blocking device according to an exemplary embodiment of the present invention. As shown in the figure, the spam message blocking device 10 includes a pre-processing unit 11 and a first classification unit 13 , a similarity calculation unit 15, and a second classification unit 17.

전처리부(11)는 기 등록된 스팸 문구를 기초로 입력 메시지를 스팸 또는 비스팸으로 분류하기 위한 것으로서, 예를 들어, 입력된 메시지가 미리 등록되어 있는 스팸 문구와 일치하면 해당 입력 메시지를 스팸으로 분류하여 차단하고 그렇지 않으면 비스팸으로 분류하여 정상 처리 즉, 제1 분류부(13)의 입력으로 제공할 수 있다.The pre-processing unit 11 is for classifying the input message as spam or non-spam based on pre-registered spam phrases. For example, if the input message matches a pre-registered spam phrase, the input message is classified as spam. It can be classified and blocked, otherwise it can be classified as non-spam and provided as an input to normal processing, that is, the first classification unit 13.

제1 분류부(13)는 학습 데이터를 기반으로 학습(또는 딥러닝이라 칭함)된 딥러닝 기반 스팸 필터를 통하여 입력 메시지를 스팸 또는 비스팸으로 분류하기 위한 것으로서, 예를 들어, 전처리부(11)를 통해 비스팸으로 분류되어 출력된 메시지를 입력하고, 그 입력된 메시지를 딥러닝 기반 스팸 필터를 통하여 스팸 또는 비스팸으로 분류한 후, 비스팸으로 분류된 메시지는 정상 출력하고, 스팸으로 분류된 메시지는 유사도 산출부(15) 및 제2 분류부(17)의 입력으로 제공할 수 있다.The first classification unit 13 is for classifying an input message as spam or non-spam through a deep learning-based spam filter learned (or referred to as deep learning) based on learning data. For example, the pre-processing unit 11 ), input the message that is classified as non-spam and output, and classify the input message as spam or non-spam through a deep learning-based spam filter, then output the message classified as non-spam and classify it as spam The message may be provided as an input to the similarity calculating unit 15 and the second classifying unit 17.

유사도 산출부(15)는 제1 분류부(13)의 딥러닝 기반 스팸 필터를 통해 스팸으로 분류되어 입력된 스팸 메시지와 딥러닝 기반 스팸 필터의 학습 시 사용된 학습 데이터 간의 유사도를 산출하기 위한 것으로서, 예를 들어, 검색 유사도 비교 방식, 문자열 유사도 비교 방식, 및 센텐스 임베딩(sentence embedding) 기반 유사도 비교 방식 중 적어도 하나를 통해 유사도를 산출할 수 있다.The similarity calculation unit 15 is for calculating the similarity between the spam message classified as spam through the deep learning-based spam filter of the first classification unit 13 and input, and the training data used when learning the deep learning-based spam filter. , For example, similarity may be calculated through at least one of a search similarity comparison method, a string similarity comparison method, and a sentence embedding-based similarity comparison method.

본 실시예에서는 검색 유사도 비교 방식 중 하나로 많이 사용되는 하기의 BM25 알고리즘을 통하여 유사도를 산출할 수 있다.In this embodiment, similarity can be calculated through the following BM25 algorithm, which is widely used as one of search similarity comparison methods.

f(qi, D)는 문서 D에 있는 qi의 빈도이고 |D|는 문서 D의 길이를 뜻한다. 또한 avgdl는 문서들의 평균 길이이다. k1, b는 free parameters를 뜻한다. 주로 k1는 [1.2, 2.0]에 있고 b는 0.75로 선택된다.f(qi, D) is the frequency of qi in document D, and |D| is the length of document D. Also, avgdl is the average length of documents. k1, b means free parameters. Usually k1 is in [1.2, 2.0] and b is chosen as 0.75.

산출부(15)는 딥러닝 기반 스팸 필터를 통해 스팸으로 분류된 제1 메시지와 학습 데이터 내 복수의 스팸 데이터 각각과의 제1 유사도(BM25 스코어)를 산출하고, 산출된 복수(상위 n개)의 제1 유사도의 평균값(이하, 스팸 유사도 평균값)을 산출한다.The calculation unit 15 calculates the first similarity (BM25 score) between the first message classified as spam through the deep learning-based spam filter and each of the plurality of spam data in the training data, and calculates the calculated plurality (top n). Calculate the average value of the first similarity (hereinafter, the average value of spam similarity).

또한, 제1 메시지와 학습 데이터 내 및 복수의 비스팸 데이터 각각과의 제2 유사도(BM25 스코어)를 산출하고, 산출된 복수(상위 n개)의 제2 유사도의 평균값(이하, 비스팸 유사도 평균값)을 산출한다. In addition, a second similarity (BM25 score) between the first message and the learning data and each of the plurality of non-spam data is calculated, and the average value of the calculated plurality (top n) second similarities (hereinafter, the average value of the non-spam similarity) ) is calculated.

그리고, 스팸 유사도 평균값에서 비스팸 유사도 평균값을 뺀 차이값(이하, 최종 스코어 값)을 최종 유사도 값으로 산출하여 사용할 수 있다.In addition, a difference value obtained by subtracting the average similarity value of non-spam from the average similarity value of spam (hereinafter, the final score value) can be calculated and used as the final similarity value.

최종 스코어 값 = 스팸 유사도 평균값 ?? 비스팸 유사도 평균값final score value = spam similarity average value ?? Nonspam similarity average

스팸 또는 비스팸 유사도 평균값은 각각 산출된 복수의 제1 또는 제2 유사도 중 설정된 최소값 이상인 상위 n개에 대한 평균값으로 산출할 수 있다.The spam or non-spam average similarity value may be calculated as an average value for the top n pieces that are equal to or greater than a set minimum value among the plurality of first or second similarities calculated respectively.

제2 분류부(17)는 제1 분류부(13)에서 분류된 스팸 메시지 중 오스팸 메시지(즉, 스팸이 아닌데 스팸으로 분류된 메시지)가 있다면 이를 찾아내 정상으로 재 분류하기 위한 것으로서, 제1 분류부(13)에서 스팸으로 분류된 메시지를 유사도 산출부(15)에서 산출된 유사도를 기초로 스팸 또는 비스팸으로 재 분류하여 오스팸 메시지를 줄일 수 있다.The second classification unit 17 is for finding and reclassifying the spam messages classified as normal if there is an ospam message (that is, a message that is not spam but classified as spam) among the spam messages classified by the first classification unit 13. 1. Spam messages can be reduced by re-classifying messages classified as spam by the classification unit 13 into spam or non-spam based on the similarity calculated by the similarity calculation unit 15.

제2 분류부(17)는 일 예로 제1 분류부(13)에서 스팸으로 분류된 해당 메시지에 대해 산출부(15)에서 산출된 유사도 값(즉, 최종 스코어 값)이 기 설정된 기준값 미만인 경우, 해당 스팸 메시지를 비스팸 메시지로 재 분류할 수 있다.The second classification unit 17, for example, when the similarity value (ie, the final score value) calculated by the calculation unit 15 for the message classified as spam by the first classification unit 13 is less than a preset reference value, The spam message may be reclassified as non-spam message.

제2 분류부(17)는 다른 예로 제1 분류부(13)에서 스팸으로 분류된 해당 메시지에 대해 산출부(15)에서 산출된 유사도 값(즉, 최종 스코어 값)이 기 설정된 기준값 이상인 경우, 해당 스팸 메시지를 다시 스팸 메시지로 재 분류할 수 있다.As another example, the second classification unit 17 calculates the similarity value (i.e., the final score value) from the calculation unit 15 for the message classified as spam by the first classification unit 13 if it is greater than or equal to a preset reference value, The spam message can be reclassified as spam message again.

제2 분류부(17)는 또 다른 예로 제1 분류부(13)에서 스팸으로 분류된 해당 메시지에 대응하는 학습 데이터 내 (유사도 기반) 유사 데이터에 스팸과 비스팸의 모순된 라벨링이 되어 있는 경우, 해당 스팸 메시지를 비스팸 메시지로 재 분류할 수 있다.As another example, the second classification unit 17 may, in the case where similar data (based on similarity) in the learning data corresponding to the corresponding message classified as spam by the first classification unit 13, have contradictory labeling of spam and non-spam , the corresponding spam message can be reclassified as a non-spam message.

도 5에 도시된 예시와 같이, 스팸 메시지 차단 장치(10)의 제2 분류부(17)의 기준값을 5로하여 재 분류 테스트 결과, 동일한 입력에 대하여 딥러닝 기반 스팸 필터만을 사용하는 기존의 방식에 따를 경우(즉, 재 분류하지 않을 경우) 오차단 개수가 81개 였고, 대조적으로 스팸 메시지 차단 장치(10)의 제2 분류부(17)를 통한 재 분류 결과 오차단 개수가 2개로 감소하여 기존 대비 약 97% 감소하였다.As shown in the example shown in FIG. 5, the standard value of the second classification unit 17 of the spam message blocking device 10 is set to 5, and as a result of the re-classification test, the conventional method using only the deep learning-based spam filter for the same input (i.e., not re-classifying), the number of false cuts was 81. In contrast, as a result of re-classification through the second classification unit 17 of the spam message blocking device 10, the number of false cuts decreased to two, It decreased by about 97% compared to the previous one.

도 2는 본 발명의 예시적인 실시예에 따른 스팸 메시지 차단 방법의 흐름도로서, 도 1의 스팸 메시지 장치(10)에 적용되므로 해당 장치(10)의 동작과 병행하여 설명한다.FIG. 2 is a flow chart of a spam message blocking method according to an exemplary embodiment of the present invention, and since it is applied to the spam message device 10 of FIG. 1, the operation of the device 10 will be described in parallel.

먼저, 스팸 메시지 차단 장치(10)에 메시지가 입력되면 전처리 단계(S210)에서는 전처리부(11)를 통하여 입력 메시지와 기 등록된 스팸 메시지(단어/문구/문장 등 포함)를 비교하여 일치하면 스팸으로 분류하여 차단하고 일치하지 않으면 비스팸으로 분류하여 제1 분류부(13)의 입력으로 제공한다(S210).First, when a message is input to the spam message blocking device 10, in the pre-processing step (S210), the input message and pre-registered spam messages (including words/phrases/sentences, etc.) are compared through the pre-processing unit 11, and if they match, spam It is classified and blocked, and if it does not match, it is classified as non-spam and provided as an input to the first classification unit 13 (S210).

이어, 제1 분류 단계(S220)에서 제1 분류부(13)는 단계 S210에서 전처리부(11)를 통해 비스팸으로 분류되어 출력된 메시지를 입력하고, 그 입력된 메시지를 딥러닝 기반 스팸 필터를 통하여 스팸 또는 비스팸으로 분류한 후, 비스팸으로 분류된 메시지는 정상 출력하고, 스팸으로 분류된 메시지는 유사도 산출부(15) 및 제2 분류부(17)의 입력으로 제공한다(S220).Subsequently, in the first classification step (S220), the first classification unit 13 inputs the message classified as non-spam through the pre-processing unit 11 in step S210 and outputs the message, and converts the input message to a deep learning-based spam filter After classifying them as spam or non-spam, messages classified as non-spam are normally output, and messages classified as spam are provided as inputs to the similarity calculation unit 15 and the second classification unit 17 (S220). .

이어, 유사도 산출 단계(S230)에서 유사도 산출부(15)는 단계 S220에서 제1 분류부(13)의 딥러닝 기반 스팸 필터를 통해 스팸으로 분류되어 입력된 스팸 메시지와 딥러닝 기반 스팸 필터의 학습 시 사용된 학습 데이터 간의 유사도를 검색 유사도 비교 방식, 문자열 유사도 비교 방식, 및 센텐스 임베딩(sentence embedding) 기반 유사도 비교 방식 중 적어도 하나를 통해 산출할 수 있는데, 본 실시예에서는 검색 유사도 비교 방식 중 하나로 많이 사용되는 BM25 알고리즘을 통하여 유사도를 산출한다(S230).Subsequently, in the similarity calculation step (S230), the similarity calculation unit 15 learns the spam message classified as spam through the deep learning-based spam filter of the first classification unit 13 in step S220 and the input spam message and the deep learning-based spam filter. The similarity between the learning data used in the test can be calculated through at least one of a search similarity comparison method, a string similarity comparison method, and a sentence embedding-based similarity comparison method. In this embodiment, one of the search similarity comparison methods is used. The similarity is calculated through the widely used BM25 algorithm (S230).

마지막으로, 제2 분류 단계(S240)에서 제2 분류부(17)는 제1 분류 단계(S220)의 제1 분류부(13)에서 스팸으로 분류된 메시지를 유사도 산출 단계(S230)의 산출부(15)에서 산출된 유사도를 기초로 스팸 또는 비스팸으로 재 분류하여, 제1 분류 단계(S220)의 제1 분류부(13)에서 분류된 스팸 메시지 중 오스팸 메시지(즉, 스팸이 아닌데 스팸으로 잘못 분류된 메시지)가 있다면 이를 걸러내 정상으로 재 분류함으로써 오스팸 메시지를 줄인다(S240).Finally, in the second classification step (S240), the second classification unit 17 converts the messages classified as spam in the first classification unit 13 in the first classification step (S220) to the calculation unit in the similarity calculation step (S230). It is reclassified as spam or non-spam based on the similarity calculated in (15), and among the spam messages classified in the first classification unit 13 in the first classification step (S220), the spam message (that is, spam that is not spam) If there is a message incorrectly classified as ), it is filtered out and reclassified as normal, thereby reducing the number of spam messages (S240).

도 3은 도 2의 유사도 산출 단계(S230)의 세부 구성의 일 예를 도시한 흐름도이다.FIG. 3 is a flowchart illustrating an example of a detailed configuration of the similarity calculation step ( S230 ) of FIG. 2 .

먼저, 딥러닝 기반 스팸 필터를 통해 스팸으로 분류된 제1 메시지와 학습 데이터 내 복수의 스팸 데이터 각각과의 제1 유사도(BM25 스코어)를 BM25 알고리즘을 사용하여 산출하고, 산출된 복수(상위 n개)의 제1 유사도의 평균값(이하, 스팸 유사도 평균값)을 산출한다(S231).First, the first similarity (BM25 score) between the first message classified as spam through the deep learning-based spam filter and each of the plurality of spam data in the training data is calculated using the BM25 algorithm, and the calculated plurality (top n ) of the first similarity (hereinafter referred to as spam average similarity) is calculated (S231).

이어, 제1 메시지와 학습 데이터 내 및 복수의 비스팸 데이터 각각과의 제2 유사도(BM25 스코어)를 BM25 알고리즘을 사용하여 산출하고, 산출된 복수(상위 n개)의 제2 유사도의 평균값(이하, 비스팸 유사도 평균값)을 산출한다(S233).Subsequently, a second similarity (BM25 score) between the first message and the learning data and each of the plurality of non-spam data is calculated using the BM25 algorithm, and the average value of the calculated plurality (top n) second similarities (below , non-spam similarity average value) is calculated (S233).

마지막으로, 단계 S231에서 산출된 스팸 유사도 평균값에서 단계 S233의 비스팸 유사도 평균값을 뺀 차이값(이하, 최종 스코어 값)을 S240에서 사용할 최종 유사도 값으로 산출한다(S235). Finally, a difference value obtained by subtracting the average non-spam similarity value of step S233 from the average similarity value of spam calculated in step S231 (hereinafter, the final score value) is calculated as the final similarity value to be used in S240 (S235).

단계 S231의 스팸 유사도 평균값과 단계 S233의 비스팸 유사도 평균값은 각각 산출된 복수의 제1 유사도 및 복수의 제2 유사도 중 기 설정된 최소값 이상인 상위 n개에 대한 평균값으로 산출할 수 있다.The average similarity value of spam in step S231 and the average value of similarity value of non-spam in step S233 may be calculated as average values of the top n values that are equal to or greater than a preset minimum value among the plurality of first similarities and the plurality of second similarities, respectively.

도 4는 도 2의 제2 분류 단계(S240)의 세부 구성의 일 예를 도시한 흐름도이다.FIG. 4 is a flowchart illustrating an example of a detailed configuration of the second classification step ( S240 ) of FIG. 2 .

먼저, 제2 분류부(17)는 제1 분류부(13)에서 스팸으로 분류된 해당 메시지에 대해 산출부(15)에서 산출된 유사도 값(즉, 최종 스코어 값)이 기 설정된 기준값 미만인지의 여부를 판단한다(S241).First, the second classification unit 17 determines whether the similarity value (ie, the final score value) calculated by the calculation unit 15 for the corresponding message classified as spam by the first classification unit 13 is less than a predetermined reference value. It is determined whether or not (S241).

단계 S241의 판단 결과 해당 유사도 값이 기 설정된 기준값 미만이라고 판단되면 제1 분류부(13)에서 스팸으로 분류된 해당 메시지를 비스팸으로 재분류하고(S243), 단계 S241의 판단 결과 해당 유사도 값이 기 설정된 기준값 이상이라고 판단되면 제1 분류부(13)에서 스팸으로 분류된 해당 메시지를 그대로 스팸으로 재분류하여 처리한다(S245).As a result of the determination in step S241, if it is determined that the corresponding similarity value is less than the preset reference value, the corresponding message classified as spam in the first classification unit 13 is reclassified as non-spam (S243), and as a result of determination in step S241, the corresponding similarity value is If it is determined that the message is equal to or greater than the preset reference value, the message classified as spam by the first classification unit 13 is reclassified as spam and processed (S245).

한편, 제2 분류 단계(S240)에서 제2 분류부(17)는 제1 분류부(13)에서 스팸으로 분류된 해당 메시지에 대응하는 학습 데이터 내의 (유사도 기반) 유사 데이터에 스팸과 비스팸의 모순된 라벨링이 되어 있는 경우, 해당 스팸 메시지를 비스팸 메시지로 재 분류할 수 있다.Meanwhile, in the second classification step (S240), the second classification unit 17 identifies spam and non-spam to similar data (based on similarity) in the learning data corresponding to the corresponding message classified as spam by the first classification unit 13. In the case of inconsistent labeling, the spam message may be reclassified as non-spam message.

전술한 도 2 내지 도 4의 방법에 따르면, 도 5에 도시된 예시와 같이, 스팸 메시지 차단 방법의 제2 분류 단계(S240)의 기준값을 5로하여 재 분류 테스트를 실시한 결과, 동일한 입력 메시지에 대하여 딥러닝 기반 스팸 필터만을 사용하는 기존의 방식에 따를 경우(즉, 재 분류하지 않을 경우) 오차단 개수가 81개 였고, 대조적으로 스팸 메시지 차단 방법의 제2 분류 단계(S240)를 통한 재 분류 결과 오차단 개수가 2개로 감소하여 기존 대비 약 97% 감소하였다.According to the methods of FIGS. 2 to 4 described above, as in the example shown in FIG. 5, as a result of the re-classification test by setting the reference value to 5 in the second classification step (S240) of the spam message blocking method, the same input message In the case of following the conventional method using only deep learning-based spam filters (i.e., not re-classifying), the number of false cuts was 81, and in contrast, re-classification through the second classification step (S240) of the spam message blocking method As a result, the number of error cutoffs was reduced to two, which was about 97% less than before.

전술한 본 발명에 따르면, 종래의 학습 기반 스팸 차단 방식에 따를 경우 1) 테스트데이터가 기존 학습데이터에 없을 때 스팸 확률이 예상치 못하게 올라가 오차단 발생, 2) 스팸 문구가 아닌 주변 일반 단어 특성으로 학습되었을 경우 오차단 발생, 3) (고객의 민원 및 분류인원 교체 등으로) 스팸의 기준이 실시간으로 변경되는 운영성의 문제로 오차단 발생 등의 문제점이 발생하는데, 이러한 기존의 문제점을 모두 해결하여 스팸 오차단 문제를 방지함으로써 스팸 메시지의 차단 효율을 향상시킬 수 있으며, 이에 대한 예시를 설명하면 다음과 같다.According to the present invention described above, in the case of the conventional learning-based spam blocking method, 1) when the test data is not included in the existing training data, the probability of spam unexpectedly increases, resulting in an error cutoff, 2) learning with the characteristics of surrounding common words rather than spam phrases. 3) Occurrence of errors due to operational problems in which the standards for spam are changed in real time (due to customer complaints and replacement of classified personnel). The efficiency of spam message blocking can be improved by preventing the error-cutting problem. An example of this can be described as follows.

1) 테스트데이터가 기존 학습데이터에 없을 때 스팸 확률이 예상치 못하게 올라가 오차단 발생하는 문제:1) When the test data is not in the existing training data, the spam probability unexpectedly rises and errors occur:

예를 들어, "[Web발신] 대왕암아구찜 모든메뉴 배달됩니다. 배달료는 무료 매일 정상영업 합니다, 코로나19 조심하세요"와 같은 입력 메시지에 대하여 종래의 학습 기반 스팸 차단 방식에 따를 경우 학습데이터에 유사한 스팸/비스팸 문구가 없으나 원인 모르게 스팸확률이 99%로 올라가는 결과가 나타났는데, 이는 딥러닝에서 주로 발생하는 OOD(Out-Of-Distribution) 문제로 보이고, 학습데이터에 없을 시 확률이 50%로 나와야 하나 그렇지 않은 경우인데 반하여, 대조적으로 전술한 본 발명의 장치 및 방법을 통한 학습데이터에 있는지 유사도 비교하는 후처리기를 통해 학습데이터에 없으면 유사도의 최종 Score가 낮아져 기준값(threshold)보다 낮으면 비스팸으로 재분류함으로써 해결이 가능하다.For example, if the conventional learning-based spam blocking method is followed for an input message such as "[Sent from Web] Daewangam Agujjim, all menus will be delivered. Delivery fee is free, daily business is normal, be careful of Corona 19", spam similar to learning data / There is no non-spam phrase, but the result is that the spam probability rises to 99% without knowing the cause. This seems to be an OOD (Out-Of-Distribution) problem that mainly occurs in deep learning. However, on the other hand, in contrast, if it is not in the learning data through the post-processor that compares the degree of similarity in the learning data through the apparatus and method of the present invention described above, the final score of similarity is lowered, and if it is lower than the threshold, it is classified as non-spam. This can be solved by reclassifying.

2) 스팸 문구가 아닌 주변 일반 단어 특성으로 학습되었을 경우 오차단 발생하는 문제:2) Problems that occur when learning is based on the characteristics of surrounding general words rather than spam phrases:

예를 들어, "내일 간단한 내쟝 작업하실 수 있어요 일산인력 010 6810 9000"와 같은 입력 메시지 또는 "안녕하세요^^ 애견호텔1번가입니다. 4주년 감사이벤트 진행합니다. ?'호텔1박 만원'? 3월1일부터~(회원가입필수)"와 같은 입력 메시지에 대하여, 종래의 학습 기반 스팸 차단 방식에 따를 경우 "쟝", "주년 감사이벤트"는 스팸문구가 아니나 학습 데이터 내에 이 단어가 포함된 스팸데이터가 많아 특정 문구에 집중하여 오차단이 발생하는 반면, 본 발명의 장치 및 방법에 따를 경우 일부 문구가 아닌 문자 전체의 유사도 비교를 통해 스팸 여부를 재분류함으로써 오차단 방지 가능하다. For example, an input message such as “You can work on a simple interior tomorrow. Regarding input messages such as "from the 1st ~ (membership registration required)", according to the conventional learning-based spam blocking method, "Jang" and "Anniversary Thanks Event" are not spam phrases, but spam containing these words in the learning data On the other hand, in the case of the device and method of the present invention, errors can be prevented by reclassifying spam or not through similarity comparison of entire texts, not just some phrases.

추가적으로, 일부 문구가 아닌 문자 전체의 유사도 비교를 하기 때문에, 반복적으로 오는 스미싱 문자를 막으면서도 일반문자는 막지 않을 수 있다.In addition, since the similarity comparison is performed for whole characters rather than for some phrases, it is possible to block smishing characters that come repeatedly while not blocking general characters.

[표 1][Table 1]

예를 들어, 표 1의 문구를 학습한 딥러닝 모델이 "폰이 안되서 인터넷으로 보내요"를 스팸으로 오차단 한 경우, 본 발명의 장치 및 방법에 따르면 학습데이터와 일부 일치하나 스미싱 문자와 문자 전체 유사도는 낮기 때문에 오차단을 방지할 수 있다.For example, if the deep learning model that has learned the phrases in Table 1 misdiagnoses "Send to the Internet because my phone is not available" as spam, according to the apparatus and method of the present invention, some matches with the training data, but smishing text and text Since the total similarity is low, error judgment can be prevented.

3) 스팸의 기준이 실시간으로 변경되는 운영성의 문제로 오차단 발생(고객의 민원 및 분류인원 교체) 문제:3) Occurrence of errors due to operational problems in which spam standards are changed in real time (customer complaints and replacement of classified persons) Problem:

예를 들어, 유사한 데이터에 대해서 스팸/비스팸이 오락가락하는 경우 본 발명의 장치 및 방법에 따르면 전체 유사도 Score가 낮아지기 때문에 비스팸으로 재분류되어 오차단을 방지할 수 있다. For example, if spam/non-spam comes and goes for similar data, according to the apparatus and method of the present invention, the overall similarity score is lowered, so it can be reclassified as non-spam to prevent false detection.

한편, 전술한 스팸 메시지 차단 방법에 따르면 해당 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체를 구현할 수 있다.On the other hand, according to the method for blocking spam messages described above, a computer-readable recording medium having a program for executing the corresponding method in a computer can be implemented.

또 한편, 전술한 스팸 메시지 차단 방법에 따르면 해당 방법을 하드웨어와 결합하여 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록 매체에 저장된 애플리케이션을 구현할 수 있다.On the other hand, according to the above-described spam message blocking method, an application stored in a computer-readable recording medium may be implemented in order to execute the corresponding method in combination with hardware.

또 다른 한편, 전술한 스팸 메시지 차단 방법에 따르면 해당 방법을 컴퓨터에서 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록 매체에 저장된 컴퓨터 프로그램을 구현할 수 있다.On the other hand, according to the method for blocking spam messages described above, a computer program stored in a computer-readable recording medium may be implemented in order to execute the corresponding method on a computer.

예를 들어, 전술한 바와 같이 본 발명의 예시적인 실시예에 따른 스팸 메시지 차단 방법은 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터 판독가능 기록 매체 또는 이러한 기록 매체에 저장된 애플리케이션으로 구현될 수 있다. 상기 컴퓨터 판독 가능 기록 매체는 프로그램 명령, 로컬 데이터 파일, 로컬 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 기록 매체는 본 발명의 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크와 같은 자기-광 매체, 및 롬, 램, 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다.For example, as described above, the spam message blocking method according to an exemplary embodiment of the present invention is a computer-readable recording medium containing program instructions for performing various computer-implemented operations or an application stored on such a recording medium. can be implemented The computer readable recording medium may include program instructions, local data files, local data structures, etc. alone or in combination. The recording medium may be specially designed and configured for the embodiment of the present invention, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and ROMs, RAMs, flash memories, and the like. A hardware device specially configured to store and execute the same program instructions is included. Examples of program instructions may include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes generated by a compiler.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely an example of the technical idea of the present invention, and various modifications and variations can be made to those skilled in the art without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but to explain, and the scope of the technical idea of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed according to the claims below, and all technical ideas within the equivalent range should be construed as being included in the scope of the present invention.

10: 스팸 메시지 차단 장치
11: 전처리부
13: 제1 분류부
15: 유사도 산출부
17: 제2 분류부10: spam message blocking device
11: pre-processing unit
13: first classification unit
15: similarity calculator
17: second classification unit

Claims

A first classification unit for classifying an input message as spam or non-spam through a deep learning-based spam filter learned based on learning data (deep learning);
a calculation unit for calculating a similarity between the message classified as spam through the deep learning-based spam filter and the learning data; and
A second classification unit for re-classifying the spam classified as spam or non-spam based on the calculated similarity; and
When the second classification unit has contradictory labeling of spam and non-spam in a plurality of similar data having a similarity value of a predetermined value or more in the training data corresponding to the corresponding message classified as spam by the first classification unit ( That is, when similar data labeled as spam and similar data labeled as non-spam are mixed), spam message blocking characterized in that the message classified as spam in the first classification unit is reclassified as a non-spam message. Device.

According to claim 1,
Wherein the second classification unit reclassifies the message classified as spam by the first classification unit as a non-spam message when the similarity value calculated by the calculation unit is less than a predetermined reference value.

According to claim 1,
Wherein the second classification unit reclassifies the corresponding message classified as spam by the first classification unit as a spam message when the similarity value calculated by the calculation unit is equal to or greater than a predetermined reference value.

delete

According to claim 1,
The calculation unit calculates the first average value of similarity (hereinafter referred to as the average value of spam similarity) with each of the plurality of spam data in the training data for the message classified as spam through the deep learning-based spam filter and the value of each of the plurality of non-spam data. Spam message blocking device characterized by obtaining a second average similarity value (hereinafter, non-spam similarity average value), and calculating a difference between the spam similarity average value and the non-spam similarity average value (hereinafter, a final score value) as the similarity value .

According to claim 5,
The spam message blocking device, characterized in that the average value of the spam or non-spam similarity is an average value for the top n pieces that are equal to or greater than a set minimum value among the calculated plurality of first or second similarities, respectively.

According to claim 1,
Wherein the calculator calculates the similarity through at least one of a search similarity comparison method, a string similarity comparison method, and a sentence embedding based similarity comparison method.

According to claim 7,
The spam message blocking device, characterized in that the search similarity comparison method comprises a BM25 algorithm.

According to claim 1,
Further comprising a pre-processing unit for classifying an input message as spam or non-spam based on a pre-registered spam phrase, and providing the non-spam message classified through the pre-processing unit as an input message to the first classification unit. Spam message blocking device.

(a) classifying an input message as spam or non-spam through a deep learning-based spam filter learned based on learning data (deep learning);
(b) calculating a similarity between a message classified as spam through the deep learning-based spam filter and the learning data; and
(c) re-classifying the spam-classified message as spam or non-spam based on the calculated similarity;
In the step (c), in the learning data corresponding to the corresponding message classified as spam in step (a), a plurality of similar data having a similarity value equal to or greater than a predetermined value are labeled spam and non-spam inconsistently. (i.e., when similar data labeled as spam and similar data labeled as non-spam are mixed), spam characterized in that the message classified as spam in step (a) is reclassified as a non-spam message. How to block messages.

According to claim 10,
In step (c), when the similarity value calculated in step (b) is less than a predetermined reference value, the corresponding message classified as spam in step (a) is reclassified as a non-spam message. method.

According to claim 10,
In step (c), if the similarity value calculated in step (b) is equal to or greater than a predetermined reference value, the spam message classified as spam in step (a) is reclassified as a spam message. .

delete

According to claim 10,
In the step (b), for the messages classified as spam through the deep learning-based spam filter, the average value of the first similarity with each of the plurality of spam data in the training data (hereinafter, the average value of the spam similarity) and the plurality of non-spam data Spam characterized in that the average value of the second similarity (hereinafter, hereinafter, the average similarity value of non-spam) is obtained, and the difference between the average value of the similarity of the spam and the average value of the non-spam similarity (hereinafter, the final score value) is calculated as the similarity. How to block messages.

According to claim 14,
The spam message blocking method, characterized in that the average value of the spam or non-spam similarity is an average value for the top n pieces that are equal to or greater than a set minimum value among the calculated plurality of first or second similarities, respectively.

According to claim 10,
Wherein step (b) calculates the similarity through at least one of a search similarity comparison method, a string similarity comparison method, and a sentence embedding-based similarity comparison method.

According to claim 16,
The spam message blocking method, characterized in that the search similarity comparison method comprises a BM25 algorithm.

According to claim 10,
(d) further comprising a pre-processing step for classifying the input message as spam or non-spam based on the pre-registered spam phrase, and providing the non-spam message classified through the pre-processing step as the input message of step (a) Spam message blocking method characterized in that for.

A computer-readable recording medium storing a program for executing the spam message blocking method according to any one of claims 10 to 12 and 14 to 18 in a computer.

A computer-readable recording medium for executing the spam message blocking method according to any one of claims 10 to 12 and 14 to 18 in combination with hardware.

A computer program stored in a computer-readable recording medium in order to execute the spam message blocking method according to any one of claims 10 to 12 and 14 to 18 in a computer.