KR20150006930A

KR20150006930A - Fax spam detection apparatus, method and system

Info

Publication number: KR20150006930A
Application number: KR1020130080263A
Authority: KR
Inventors: 이지형; 김재광; 김형식
Original assignee: 성균관대학교산학협력단
Priority date: 2013-07-09
Filing date: 2013-07-09
Publication date: 2015-01-20
Also published as: KR101508258B1

Abstract

The present invention relates to a fax spam document blocking algorithm, and more particularly, to a method for intelligently filtering a received fax spam document. According to the present invention, the method for blocking a fax spam document comprises steps of: generating a spam classification algorithm, based on at least one of a fax document group for analyzing and a fax generation document group for analyzing; determining whether the received target fax document is a spam document by using the spam classification algorithm; and determining an output of the target fax document, based on the determined result. Therefore, when the fax spam system is realized, unnecessary resource consumption can be reduced, and user′s work efficiency can be improved to increase productivity.

Description

FIELD OF THE INVENTION [0001] The present invention relates to a facsimile apparatus,

본 발명은 팩스 스팸 차단 알고리즘에 관한 것으로,보다 상세하게는 수신되는 팩스 스팸을 지능적으로 필터링하기 위한 방법에 관한 것이다.
The present invention relates to a fax spam blocking algorithm, and more particularly, to a method for intelligently filtering received fax spam.

종래의 팩스 수신 차단을 위한 방법들은 사용자가 직접 스팸 팩스를 송신하는 주체의 전화번호를 직접 등록/삭제/수정 등의 관리를 통해 이루어졌다. 즉, 단순히 송신 전화번호를 등록하여 등록된 스팸 전화번호에서 문서가 전송되는 경우 차단하는 방법을 사용하였다. 이와 같은 경우, 등록, 삭제 등을 수행해야 하는 불편함이 존재하고, 문서의 내용과 상관없이 등록된 전화번호에서 송신되는 모든 무서를 차단하므로 전화번호의 주기적인 업데이터가 필요하며, 반대로 팩스 스팸을 보내는 사람이 다양한 전화번호로 스팸을 송신하는 경우 매번 전화번호를 등록해야 하고, 최소 한번은 스팸을 받아야 하는 문제점이 존재한다. Conventional methods for blocking the reception of faxes have been performed through direct registration / deletion / correction of the telephone number of the subject who directly transmits the spam fax by the user. That is, a method of simply registering a transmission telephone number and blocking a document when the document is transmitted from the registered spam phone number is used. In such a case, it is inconvenient to perform registration and deletion, and it is necessary to periodically update the telephone number because all the scrolls transmitted from the registered telephone number are blocked regardless of the contents of the document. When a sender sends a spam to various phone numbers, the phone number must be registered every time, and there is a problem that at least one spam must be received.

도 1은 종래 팩스 스팸 차단 방법을 개략적으로 나타낸 흐름도이다. 1 is a flowchart schematically illustrating a conventional fax spam blocking method.

도 1을 참조하면, 종래 팩스 스팸 차단 장치는 팩스 문서를 수신한다(S110). 그리고는 수신된 팩스 문서를 전송한 전화 번호가 기존에 등록된 스팩 전화 번호인가 판단한다(S120). 판단 결과, 스팸 전화 번호로부터 수신된 문서이면 팩스 데이터를 차단한다(S130). 반대로 스팸 전화 번호가 아닌 곳으로부터 수신된 문서이면, 일반 문서라고 판단하여 팩스 문서를 출력한다(S140).Referring to FIG. 1, the conventional fax spam blocking device receives a fax document (S110). Then, it is determined whether the telephone number of the received fax document is a registered specific telephone number (S120). As a result of the determination, if the document is received from the spam phone number, the facsimile data is blocked (S130). Conversely, if the document is received from a location other than the spam phone number, it is determined that the document is a general document and a fax document is output (S140).

또한, 블랙 리스트 및 화이트 리스트를 사용하여 스팸 문서를 차단하는 방법이 있을 수 있는데, 이러한 방법은 미리 알려진 리스트를 피해가는 식의 회피 방법에 취약하다는 한계가 있다. 즉, 금지 키워드를 사용하여 스팸을 탐지하면, 공격자는 쉽게 금지 키워드 대산 다른 키워드를 사용하여 탐지 시스템을 우회할 수 있다.
There may also be a way to block spam documents using blacklists and whitelists, which is vulnerable to avoidance methods that avoid pre-known lists. That is, if the spam is detected using the prohibited keyword, the attacker can easily bypass the detection system by using other keywords, such as prohibited keywords.

상술한 문제점을 해결하기 위한 본 발명의 목적은 팩스 스팸 차단을 위해, 수신된 팩스의 정보를 분석하여 지능형/자동형 팩스 스팸 알고리즘을 생성하고, 이를 이용하여 효과적인 팩스 스팸 차단을 하는 팩스 스팸 차단 장치, 방법 및 시스템을 제공하는 것이다. SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems occurring in the prior art, and it is an object of the present invention to provide a fax spam screening device for effectively blocking fax spam using an intelligent / automatic fax spam algorithm by analyzing information of received faxes, A method, and a system.

이를 통해 불필요한 팩스 스팸의 수신으로 인한 자원의 낭비를 방지하고 업무의 효율성을 증가시킬 수 있다.
This prevents waste of resources due to unnecessary reception of fax spam and increases work efficiency.

상기한 목적을 달성하기 위한 본 발명의 팩스 스팸 문서 차단 방법은 분석용 팩스 스팸 문서 집단과 분석용 팩스 일반 문서 집단 중 적어도 어느 하나를 토대로 스팸 분류 알고리즘을 생성하는 단계, 상기 스팸 분류 알고리즘을 이용하여 수신된 대상 팩스 문서가 스팸 문서인지 판별하는 판별 단계 및 판별 결과를 기반으로 상기 대상 팩스 문서의 출력 여부를 결정하는 출력 단계를 포함할 수 있다.According to another aspect of the present invention, there is provided a method for blocking a fax spam document, the method comprising: generating a spam classification algorithm based on at least one of a group of analysis fax spam documents and a group of analysis general fax documents; A determination step of determining whether the received target fax document is a spam document, and an output step of determining whether to output the target fax document based on the determination result.

상기 분석용 팩스 스팸 문서는 팩스 문서의 내용을 기반으로 스팸 문서로 판별된 문서이고, 분석용 팩스 일반 문서는 팩스 문서의 내용을 기반으로 하여 수신에 적합한 것으로 판별된 문서일 수 있다.The analysis fax spam document may be a document determined as a spam document based on the contents of the fax document and the analysis fax general document may be a document determined to be suitable for reception based on the contents of the fax document.

상기 분류 알고리즘 생성 단계는 상기 팩스 스팸 문서 집단 및 상기 팩스 일반 문서 집단을 개별적으로 스캔하는 단계, 상기 스캐닝된 팩스 스팸 문서 집단 및 상기 스캐닝된 팩스 일반 문서 집단에 포함된 단어의 출현 빈도를 개별적으로 산출하는 단계, 상기 출현 빈도를 기반으로 팩스 스팸 문서 모델링 및 팩스 일반 문서 모델링을 개별적으로 수행하는 단계 및 상기 모델링된 팩스 스팸 문서 및 상기 모델링된 팩스 일반 문서 중 적어도 어느 하나를 기반으로 상기 분류 알고리즘을 생성하는 단계를 포함할 수 있다.Wherein the classifying algorithm generating step includes individually scanning the group of fax spam documents and the group of general fax documents, calculating the frequency of occurrences of words included in the scanned fax spam document group and the scanned fax general document group, Performing fax spam document modeling and fax general document modeling individually on the basis of the appearance frequency and generating the classification algorithm based on at least one of the modeled fax spam document and the modeled fax general document, .

상기 출현 빈도 산출 단계는 상기 스캔된 문서를 전처리하여 불용어를 제거하고 단어만 추출하는 단계 및 상기 추출된 단어를 기반으로 출현 빈도를 산출하는 단계를 포함할 수 있다.The appearance frequency calculation step may include a step of preprocessing the scanned document to remove an insoluble word and extracting only words, and calculating an appearance frequency based on the extracted word.

상기 모델링 수행 단계는 상기 출현 빈도를 기반으로 하여 특징을 선택하는 단계 및 상기 선택된 특징을 지지 벡터 머신(SVM: Support Vector Machine)의 특징 벡터로 사용하여 팩스 스팸 문서 모델링 및 팩스 일반 문서 모델링을 수행하는 단계를 포함할 수 있다.The modeling step may include selecting features based on the appearance frequency and performing the fax spam document modeling and fax general document modeling using the selected feature as a feature vector of a support vector machine (SVM) Step < / RTI >

상기 특징을 선택하는 단계는 상기 출현 빈도가 높은 상위 N 개 - 여기서, N은 임의의 자연수 - 의 단어를 추출하는 단계를 포함할 수 있다.The step of selecting the feature may comprise extracting words of the top N, where N is a natural number, of which the appearance frequency is high.

상기 스팸 분류 알고리즘이 나이브 베이지안 분류 방법(Naive Bayesian Classifier)을 이용하여 생성될 수 있다.The spam classification algorithm may be generated using a Naive Bayesian Classifier.

상기 출력 여부 결정 단계는 상기 대상 팩스 문서가 스팸 문서로 판별된 경우에는 출력하지 않고, 지정된 온라인 지점으로 자동으로 전송할 수 있다.The output determining step may automatically transmit the output destination to a specified online point without outputting the target fax document if the target fax document is determined as a spam document.

상기 온라인 지점은 사용자 이메일 주소 또는 사용자 지정 웹하드일 수 있다.The online point may be a user email address or a custom web hard.

판별이 완료된 대상 팩스 문서는 판별 결과에 따라 상기 분석용 팩스 스팸 문서 집단과 상기 분석용 팩스 일반 문서 집단 중 어느 하나에 포함될 수 있다.The subject facsimile document that has been discriminated can be included in either the analysis fax spam document group or the analysis fax general document group according to the discrimination result.

상기한 목적을 달성하기 위한 본 발명의 팩스 스팸 문서 차단 장치는 분석용 팩스 스팸 문서 집단과 분석용 팩스 일반 문서 집단 중 적어도 어느 하나를 토대로 스팸 분류 알고리즘을 생성하는 분류 알고리즘 생성부, 상기 스팸 분류 알고리즘을 이용하여 수신된 대상 팩스 문서가 스팸 문서인지 판별하는 판별부 및 판별 결과를 기반으로 상기 대상 팩스 문서의 출력 여부를 결정하는 출력 결정부를 포함할 수 있다.According to another aspect of the present invention, there is provided a fax spam document blocking apparatus including a classification algorithm generating unit for generating a spam classification algorithm based on at least one of a group of analysis fax spam documents and a group of general fax documents for analysis, A determination unit for determining whether the received target fax document is a spam document, and an output determination unit for determining whether to output the target fax document based on the determination result.

상기 분류 알고리즘 생성부는 상기 팩스 스팸 문서 집단 및 상기 팩스 일반 문서 집단을 개별적으로 스캔하는 스캔 수행부, 상기 스캔된 팩스 스팸 문서 집단 및 상기 팩스 일반 문서 집단에 포함된 단어의 출현 빈도를 개별적으로 산출하는 출현 빈도 산출부, 상기 출현 빈도를 기반으로 팩스 스팸 문서 모델링 및 팩스 일반 문서 모델링을 개별적으로 수행하는 모델링부 및 상기 모델링된 팩스 스팸 문서 및 상기 모델링된 팩스 일반 문서 중 적어도 어느 하나를 기반으로 상기 분류 알고리즘을 생성하는 알고리즘 생성부를 포함할 수 있다.The classification algorithm generation unit may include a scan execution unit for individually scanning the group of fax spam documents and the group of general fax documents, the frequency of appearance of words included in the group of scanned fax spam documents and the group of fax general documents A modeling unit for separately performing fax spam document modeling and fax general document modeling based on the appearance frequency, and a modeling unit for modeling the fax spam document and the modeled fax general document based on the appearance frequency, And an algorithm generation unit for generating an algorithm.

상기 출현 빈도 산출부는 상기 스캔된 문서를 전처리하여 불용어를 제거하고 단어만 추출하는 단어 추출부 및 상기 추출된 단어를 기반으로 출현 빈도를 산출하는 산출부를 포함할 수 있다.The appearance frequency calculating unit may include a word extracting unit for extracting only the word by removing the stop words by preprocessing the scanned document and a calculating unit for calculating the appearance frequency based on the extracted word.

상기 모델링부는 상기 출현 빈도가 높은 상위 N 개 - 여기서, N은 임의의 자연수 - 의 단어를 추출하는 상위 단어 추출부 및 상기 추출된 단어를 지지 벡터 머신(SVM: Support Vector Machine)의 특징 벡터로 사용하여 팩스 스팸 문서 모델링 및 팩스 일반 문서 모델링을 수행하는 펙스 문서 모델링부를 포함할 수 있다.Wherein the modeling unit uses an upper word extracting unit for extracting words of the N high-frequency words, where N is an arbitrary natural number, and the extracted words as feature vectors of a support vector machine (SVM) And a facsimile document modeling unit for performing fax spam document modeling and fax general document modeling.

상기 출력 결정부는 상기 대상 팩스 문서가 스팸 문서로 판별된 경우에는 출력하지 않고, 지정된 온라인 지점으로 자동으로 전송할 수 있다.The output determining unit may automatically transmit the destination fax document to a designated online point without outputting the destination fax document as a spam document.

상기한 목적을 달성하기 위한 본 발명의 팩스 스팸 문서 차단 시스템은 대상 팩스 문서를 전송하는 수신 팩스 장치로 전송 팩스 장치 및 분석용 팩스 스팸 문서 집단과 분석용 팩스 일반 문서 집단 중 적어도 어느 하나를 토대로 스팸 분류 알고리즘을 생성하고, 상기 스팸 분류 알고리즘을 이용하여 상기 전송 팩스 장치로부터 수신된 대상 팩스 문서가 스팸 문서인지 판별하며, 판별 결과를 기반으로 상기 대상 팩스 문서의 출력 여부를 결정하는 수신 팩스 장치를 포함할 수 있다.
According to another aspect of the present invention, there is provided a facsimile apparatus for transmitting a facsimile document, the facsimile apparatus comprising: And a reception fax machine for determining whether the target fax document received from the transmission fax machine is a spam document using the spam classification algorithm and determining whether to output the target fax document based on the determination result can do.

본 발명의 팩스 스팸 차단 장치, 방법 및 시스템에 따르면, 팩스 스팸 시스템을 구현할 때에 불필요한 자원의 소모를 줄일 수 있고, 사용자의 업무 효율을 증가시키므로 생산성의 증가를 도모하는 효과가 있다.According to the fax spam blocking device, method and system of the present invention, it is possible to reduce unnecessary resource consumption when implementing the fax spam system, and to increase the productivity of the user, thereby increasing the productivity.

또한, 본 발명의 팩스 스팸 차단 장치, 방법 및 시스템에 따르면, 분류기의 생성과 업데이트로 스팸 차단의 높은 정확도를 유지하고 유지 보수 비용을 최소화하는 효과가 있다.
Also, according to the fax spam blocking device, method, and system of the present invention, generation and updating of a classifier maintains high accuracy of spam blocking and minimizes maintenance costs.

도 1은 종래 팩스 스팸 차단 방법을 개략적으로 나타낸 흐름도,
도 2는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법이 적용될 수 있는 시스템을 개략적으로 나타낸 도면,
도 3은 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법을 개략적으로 나타낸 흐름도,
도 4는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 특징 추출 단계를 구체적으로 나타낸 상세흐름도,
도 5는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 출력 여부 결정 단계를 구체적으로 나타낸 상세흐름도,
도 6은 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법에 따라 스팸 문서로 결정된 경우의 처리를 설명하기 위한 도면,
도 7은 본 발명의 일 실시예에 따른 팩스 스팸 차단 장치를 개략적으로 나타낸 블록도,
도 8은 본 발명의 일 실시예에 따른 팩스 스팸 차단 장치의 분류 알고리즘 생성부를 구체적으로 나타낸 상세블록도,
도 9는 본 발명의 일 실시예에 따른 팩스 스팸 차단 장치의 출현 빈도 산출부를 구체적으로 나타낸 상세블록도,
도 10은 본 발명의 일 실시예에 따른 팩스 스팸 차단 장치의 모델링부를 구체적으로 나타낸 상세블록도,
도 11은 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 성능을 실험하기 위해 사용하는 confusion matrix를 나타낸 도면,
도 12a는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 세 가지 분류 방식의 ACC 결과를 나타낸 표,
도 12b는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 세 가지 분류 방식의 Pre_spam 결과를 나타낸 표,
도 12c는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 세 가지 분류 방식의 Rec_spam 결과를 나타낸 표,
도 12d는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 세 가지 분류 방식의 Rec_norm 결과를 나타낸 표,
도 13은 고급 스팸 공격에서의 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 세 가지 분류 방식의 F-measure를 비교한 그래프이다.1 is a flowchart schematically illustrating a conventional fax spam blocking method,
FIG. 2 is a schematic view of a system to which a fax spam blocking method according to an exemplary embodiment of the present invention can be applied;
3 is a flowchart schematically illustrating a method of blocking fax spam according to an exemplary embodiment of the present invention.
FIG. 4 is a detailed flowchart illustrating a feature extracting step of the fax spam blocking method according to an exemplary embodiment of the present invention;
5 is a detailed flowchart specifically illustrating a step of determining whether to output the fax spam blocking method according to an exemplary embodiment of the present invention.
FIG. 6 is a diagram for explaining a process when a spam document is determined as a spam document according to a fax spam blocking method according to an embodiment of the present invention;
FIG. 7 is a block diagram schematically showing a facsimile apparatus according to an embodiment of the present invention.
8 is a detailed block diagram specifically illustrating a classification algorithm generation unit of a fax spam blocking device according to an exemplary embodiment of the present invention.
FIG. 9 is a detailed block diagram specifically showing an appearance frequency calculating unit of the fax spam screening apparatus according to an embodiment of the present invention;
FIG. 10 is a detailed block diagram specifically illustrating a modeling unit of the fax spam blocking device according to an exemplary embodiment of the present invention,
11 is a view showing a confusion matrix used for testing the performance of a fax spam blocking method according to an embodiment of the present invention;
12A is a table showing ACC results of three classification methods of the fax spam blocking method according to an embodiment of the present invention,
12B is a table showing Pre_spam results of three classification methods of the fax spam blocking method according to an embodiment of the present invention,
12C is a table showing Rec_spam results of three classification methods of the fax spam blocking method according to an embodiment of the present invention,
12D is a table showing Rec_norm results of three classification methods of the fax spam blocking method according to an embodiment of the present invention,
13 is a graph comparing F-measures of three classification methods of a facsimile spam blocking method according to an exemplary embodiment of the present invention in an advanced spam attack.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

제 1, 제 2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소도 제 1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the understanding of the present invention, the same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted.

팩스 스팸 차단 시스템Fax spam protection system

도 2는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법이 적용될 수 있는 시스템을 개략적으로 나타낸 도면이다. 도 2에 도시된 바와 같이, 본 발명의 일 실시예에 따른 팩스 스팸 차단 시스템은 전송 팩스 장치(10-1, 10-2, ..., 10-N) 및 수신 팩스 장치(20)를 포함할 수 있다.FIG. 2 is a schematic view of a system to which a fax spam blocking method according to an exemplary embodiment of the present invention can be applied. 2, the fax spam blocking system according to the embodiment of the present invention includes a transmitting fax machine 10-1, 10-2, ..., 10-N and a receiving fax machine 20 can do.

도 2를 참조하면, 전송 팩스 장치(10-1, 10-2, ..., 10-N)는 수신 팩스 장치(20)로 팩스 문서를 전송한다. 전송 팩스 장치(10-1, 10-2, ..., 10-N)는 무선 또는 유선 네트워크를 통해 수신 팩스 장치(20)로 팩스 문서를 전송할 수 있다. 전송 팩스 장치(10-1, 10-2, ..., 10-N)는 원하지 않는 광고 정보를 포함하는 스팸 문서 또는 일반 문서를 전송할 수 있다. 여기서, 일반 문서는 팩스 문서의 내용을 기반으로 사용자가 수신하길 원하는 문서일 수 있다. Referring to FIG. 2, the transmitting fax machines 10-1, 10-2, ..., 10-N transmit fax documents to the receiving fax machine 20. The transmitting fax machines 10-1, 10-2, ..., 10-N can transmit fax documents to the receiving fax machine 20 via a wireless or wired network. The transmission fax apparatuses 10-1, 10-2, ..., 10-N may transmit a spam document or a general document including unwanted advertisement information. Here, the general document may be a document that the user desires to receive based on the contents of the fax document.

수신 팩스 장치(20)는 무선 또는 유선 네트워크를 통해 팩스 문서를 수신한다. 수신 팩스 장치(20)는 팩스 스팸 문서와 팩스 일반 문서 중 적어도 어느 하나를 토대로 스팸 분류 알고리즘을 생성할 수 있다. 그리고는, 생성된 스팸 분류 알고리즘을 이용하여 전송 팩스 장치(10-1, 10-2, ..., 10-N)로부터 수신된 팩스 문서가 스팸 문서인지 판별할 수 있다. 수신 팩스 장치(20)는 판별 결과를 기반으로 수신된 팩스 문서의 출력 여부를 결정할 수 있다. 수신 팩스 문서가 일반 문서이면 문서를 출력하고, 그렇지 않으면, 출력을 하지 않고 사용자가 지정한 온라인 지점으로 상기 팩스 문서를 전송할 수 있다.
The receiving fax machine 20 receives fax documents via a wireless or wired network. The receiving fax machine 20 can generate a spam classification algorithm based on at least one of the fax spam document and the fax general document. Then, it is possible to determine whether the fax document received from the transmission fax apparatuses 10-1, 10-2, ..., 10-N is a spam document by using the generated spam classification algorithm. The receiving fax machine 20 can determine whether to output the received fax document based on the discrimination result. If the received fax document is a general document, the document is output; otherwise, the fax document can be transmitted to the online point designated by the user without outputting.

팩스 스팸 차단 방법How to prevent fax spam

도 3은 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법을 개략적으로 나타낸 흐름도이다.3 is a flowchart schematically illustrating a method of blocking fax spam according to an exemplary embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시예에 따른 팩스 스팸 차단 장치는 먼저 팩스 문서를 수신한다(S310). Referring to FIG. 3, the fax spam blocking device according to an embodiment of the present invention first receives a fax document (S310).

그리고는 특징 추출(Feature Extraction)을 수행한다(S320). 특징 추출은 변환과 적절한 결합을 기반으로 새로운 특징들을 만들어 내는 것이다. 상기 특징 추출은 팩스 스팸 문서 집단과 팩스 일반 문서 집단으로부터 특징을 추출하여 스팸 분류 알고리즘을 생성함으로써 이루어질 수 있다. 여기서, 팩스 스팸 문서 집단 및 팩스 일반 문서 집단은 스팸 분류 알고리즘을 생성하기 위한 분석용 모집단이다. 이를 위해, 사용자의 정의에 따른 팩스 스팸 문서와 팩스 일반 문서를 이용하여 분류기 학습을 수행해야 한다. 팩스 스팸 문서는 단순히 특정 주제 관련 팩스이거나 특정 전화번호에서 송신된 팩스가 아니라 팩스 문서 내용에 따라 사용자가 원하지 않는 팩스 문서로서 사용자가 학습에 사용할 문서로 정의할 수 있다. 팩스 일반 문서는 전술한 바와 같이, 팩스 문서의 내용을 기반으로 하여 사용자가 수신하기 원하는 문서로 정의할 수 있다. Then, Feature Extraction is performed (S320). Feature extraction is the creation of new features based on transformations and proper combination. The feature extraction can be performed by extracting features from a fax spam document group and a fax general document group to generate a spam classification algorithm. Here, the fax spam document group and the fax general document group are analysis populations for generating a spam classification algorithm. To do this, classifier learning should be performed using fax spam documents and fax general documents according to the user's definition. A fax spam document can be defined as a document that the user will use to learn as a fax document that the user does not want, depending on the content of the fax document, rather than just a specific topic-related fax or a fax sent from a specific telephone number. The fax general document can be defined as a document that the user desires to receive based on the contents of the fax document, as described above.

특징 추출을 완료하면, 스팸 문서 차단 장치는 수신된 팩스 문서가 스팸 문서인지 판단한다(S330). 특징 추출 단계(S320)에서 생성된 분류 알고리즘을 이용하여 수신된 팩스 문서가 스팸 문서인지 판단한다.Upon completion of the feature extraction, the spam document interception device determines whether the received fax document is a spam document (S330). It is determined whether the received fax document is a spam document using the classification algorithm generated in the feature extraction step S320.

판단 결과, 스팸 문서가 아닌 경우, 사용자가 원하는 일반 문서로 판단하여 팩스 문서를 출력한다(S340). 반대로, 스팸 문서인 경우, 팩스 데이터를 차단한다(S350).If it is determined that the document is not a spam document, it is determined that the user desires a general document and a fax document is output (S340). Conversely, if the document is a spam document, the facsimile data is blocked (S350).

도 4는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 특징 추출 단계를 구체적으로 나타낸 상세흐름도이다.4 is a detailed flowchart illustrating a feature extracting step of the fax spam blocking method according to an exemplary embodiment of the present invention.

도 4를 참조하면, 전체적으로 팩스 스팸 차단 장치는 대상 팩스 문서를 입력 받고(S410), 스팸 문서인지 판별하기 위한 스팸 분류 알고리즘을 생성한 후(S430), 대상 팩스 문서의 스팸 여부를 판별한다(S450). 여기서, 스팸 분류 알고리즘을 생성하는 단계(S430)가 핵심이 될 수 있는데, 이를 자세히 살펴보면 다음과 같다. Referring to FIG. 4, the fax spam blocking device collectively receives a target fax document (S410), generates a spam classification algorithm to determine whether the target fax document is spam (S430), and determines whether the target fax document is spam ). Here, the step of generating a spam classification algorithm (S430) may be the key.

본 발명의 일 실시예에 따르면, 스팸 문서 차단 장치는 팩스 스팸 문서 모델과 팩스 일반 문서 모델 중 적어도 어느 하나를 통해 스팸 분류 알고리즘을 생성한다. 따라서, 두 가지 모델을 개별적으로 생성할 수 있다. 즉, 사용자 설정에 따라 팩스 스팸 문서 모델만을 이용할지, 팩스 일반 문서 모델만을 이용할지, 아니면 둘 모두를 이용할지를 선택할 수 있고, 이에 따라 모델링 프로세스가 진행될 수 있다.According to an embodiment of the present invention, the spam document blocking device generates a spam classification algorithm through at least one of a fax spam document model and a fax general document model. Therefore, two models can be created separately. That is, depending on the user setting, it is possible to select whether to use only the fax spam document model, only the fax general document model, or both, and the modeling process can proceed accordingly.

이를 위해, 스팸 문서 차단 장치는 팩스 스팸 문서 및/또는 팩스 일반 문서를 OCR(Optical Character Reader) 스캔한다(S431, S441). 즉, 이미지를 스캔하여 기계에서 읽을 수 있는 포맷으로 변환한다. OCR 스캐닝 기술은 현재 공지되어 있는 복수의 기술을 포함할 수 있다. To this end, the spam document blocking device scans an OCR (Optical Character Reader) of a fax spam document and / or a general fax document (S431, S441). That is, the image is scanned and converted into a machine readable format. OCR scanning techniques may include a plurality of techniques that are currently known.

다음, 스팸 문서 차단 장치는 스캐닝된 문서를 전처리하여 단어만 추출한다(S433, S443). 스캐닝된 문서에는 특별한 의미를 지닌 단어가 아닌 불용어를 다수 포함하고 있으므로, 이를 제거하여 단어만을 추출한다. 스팸 문서 차단 장치는 불용어 사전을 이용하여 전처리를 수행할 수 있다.Next, the spam document blocking device preprocesses the scanned document to extract only words (S433, S443). Since scanned documents contain many abbreviations that are not words with special meanings, they are removed to extract only words. The spam document interception device can perform preprocessing using an abbreviation dictionary.

그리고는, 각 단어의 출현 빈도를 산출한다(S435, S445). 즉, 스팸 문서 차단 장치는 특징 추출을 위해, 팩스 문서 내의 단어의 출현 빈도를 분석할 수 있다. Then, the occurrence frequency of each word is calculated (S435, S445). That is, the spam document blocking device can analyze the appearance frequency of words in a fax document for feature extraction.

다음으로, 스팸 문서 차단 장치는 지지 벡터 머신(SVM: Support Vector Machine) 또는 나이브 베이지안 분류 방법(Naive Bayesian Classifier)을 구성하기 위해, 팩스 스팸 문서 및/또는 팩스 일반 문서의 두 클래스의 특징을 선택한다(S437, S447). 특징 선택은 특징 추출과는 다른 개념으로, 입력되는 특징 집합 중 최선의 부분 집합을 골라내는 것을 의미한다. 본 발명에서는, 분류기의 특징으로서 각 클래스의 단어 출현 빈도를 선택한다. 이때, 모든 단어의 모든 출현 빈도를 특징으로 이용할 수 있는데, 이는 비효율적일 수 있다. 따라서, 스팸 문서 차단 장치는 문서들의 특징들 중 더 큰 임팩트를 갖는 특징들, 즉 출현 빈도가 높은 단어들을 선택한다. 즉, 이러한 출현 빈도 기반의 특징 선택은 텍스트 마이닝에 폭넓게 사용될 수 있다. Next, the spam document blocking device selects features of the two classes of the fax spam document and / or the fax general document in order to construct a support vector machine (SVM) or a Naive Bayesian classifier (S437, S447). Feature selection is a different concept from feature extraction, which means to select the best subset of input feature sets. In the present invention, as the characteristic of the classifier, the frequency of occurrence of words in each class is selected. At this time, all occurrences of all words can be used as a feature, which can be inefficient. Therefore, the spam document interception device selects features having a larger impact among the features of the documents, i.e., words having a high appearance frequency. That is, feature selection based on the appearance frequency can be widely used for text mining.

본 발명의 다른 실시예에 따르면, 출현 빈도가 높은 N개의 단어를 이용하여 단어가 나타난 정도에 따라 각 집단에 포함될 확률을 높이는 방법으로 분류할 수 있고, 또는, 다른 방법으로 출현 빈도가 높은 N개의 단어를 지지 벡터 머신의 특징 벡터(feature vector)로 사용하여 분류 알고리즘을 생성할 수 있다. According to another embodiment of the present invention, it is possible to classify N words having a high appearance frequency by a method of increasing the probability of being included in each group according to the degree of appearance of words, or N A word can be used as a feature vector of a support vector machine to generate a classification algorithm.

마지막으로, 스팸 문서 차단 장치는 두 개의 대표적인 방법(예컨대, 지지 벡터 머신 또는 나이브 베이지안)을 통해 팩스 스팸 문서 모델링 및 팩스 일반 문서 모델링을 수행한다(S439, S449). 여기서, 지지 벡터 머신을 사용하는 경우, 전술한 바와 같이, 출현 빈도가 높은 N개의 단어를 지지 벡터 머신의 특징 벡터로 사용하여 모델링을 수행할 수 있다. 지지 벡터 머신 모델을 생성하는 것은 훈련 단계 및 테스트 단계를 통해 이루어질 수 있다.Finally, the spam document interception device performs fax spam document modeling and fax general document modeling through two representative methods (e.g., support vector machine or Naive Bayesian) (S439, S449). Here, when the support vector machine is used, as described above, modeling can be performed using N words having a high appearance frequency as feature vectors of the support vector machine. Generating the support vector machine model may be accomplished through training and testing steps.

다른 방법으로, 나이브 베이지안 분류 방법이 사용될 수 있는데, 이는 각 클래스에서 나타난 단어는 그 클래스를 나타내는 특징(feature)이고, 각 단어는 출현한 어떤 문서에 대해 다른 단어들과 연관이 없다는 가정을 기반으로 이산 분리 모델을 생성하는 것이다. 따라서, 출현 빈도가 높은 상기 N 개의 단어를 토대로 모델링을 수행할 수 있다. 나이브 베이지안 분류 방법을 통한 모델링은 실행하기에 단순하고, 문서 모델링이 빠르다는 장점이 있다. 이는 단어 모델의 백(bag)으로써 잘 작동하고, 문서 모델링에 매우 적합하다. Alternatively, the Naïve Bayesian classification method may be used, which is based on the assumption that the words appearing in each class are features representing that class, and that each word does not have to be associated with other words for an emerging document To generate a discrete separation model. Therefore, modeling can be performed based on the N words having a high appearance frequency. Modeling through the Naïve Bayesian classification method is simple to implement and has the advantage of fast document modeling. It works well as a bag of word models and is well suited for document modeling.

이렇게 두 가지 방법 중 어느 하나를 사용하여 팩스 스팸 문서 모델링 및 팩스 일반 문서 모델링을 수행할 수 있고, 상기 두 모델링된 문서 중 적어도 어느 하나를 사용하여 분류 알고리즘을 생성할 수 있다.Fax spam document modeling and fax general document modeling can be performed using either one of the two methods, and a classification algorithm can be generated using at least one of the two modeled documents.

도 5는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 출력 여부 결정 단계를 구체적으로 나타낸 상세흐름도이다.5 is a detailed flowchart specifically illustrating a step of determining whether to output the fax spam blocking method according to an embodiment of the present invention.

도 5를 참조하면, 팩스 스팸 차단 장치는 생성된 스팸 분류 알고리즘을 이용하여 수신된 대상 팩스 문서가 스팸 문서인지 판단한다(S510). 그리고는, 판단 결과, 스팸 문서이면 지정된 온라인 지점으로 전송한다(S520). 만약, 스팸 문서가 아니라면 팩스 문서를 출력한다(S530). 이렇게 함으로써, 스팸 문서의 무조건적인 출력을 방지할 수 있어 전력 및 종이 낭비를 줄일 수 있다.Referring to FIG. 5, the fax spam blocking device determines whether the received target fax document is a spam document using the generated spam classification algorithm (S510). As a result of the determination, if the document is a spam document, it is transmitted to the designated online point (S520). If the document is not a spam document, a fax document is output (S530). By doing so, it is possible to prevent unconditional output of spam documents, thereby reducing power and paper waste.

도 6은 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법에 따라 스팸 문서로 결정된 경우의 처리를 설명하기 위한 도면이다.FIG. 6 is a diagram for explaining a process when a spam document is determined as a spam document according to the fax spam blocking method according to an embodiment of the present invention.

도 6을 참조하면, 본 발명의 팩스 스팸 차단 장치는 스팸 문서로 판별되면 사용자가 지정한 온라인 지점으로 전송되어 무조건적인 소실을 방지할 수 있다. 예컨대, 사용자는 사용자 인터페이스를 통해 사용자가 사용하는 이메일(620) 또는 인터넷을 통해 액세스할 수 있는 웹하드(630)로 스팸 문서를 일시 저장하도록 설정할 수 있다. 상기 설정에 따라 팩스 스팸 차단 장치는 스팸 문서로 판별된 수신 팩스 문서를 사용자 이메일 주소(620) 또는 웹하드(630)로 전송한다. 사용자는 본인이 설정한 상기 이메일 주소(620) 또는 웹하드(630)에서 스팸 문서를 확인하고, 스팸이 아닌 경우 다시 복원하여 출력할 수 있는 기회를 확보할 수 있다. 이를 통해 무조건적으로 문서가 스팸 문서로 판별되어 버려지는 것을 방지할 수 있다.Referring to FIG. 6, when the fax spam screening apparatus of the present invention is determined as a spam document, it is transmitted to an online point designated by the user, thereby preventing unconditional disappearance. For example, the user can set the spam document to be temporarily stored in the e-mail 620 used by the user through the user interface or the web hard 630 accessible through the Internet. According to the setting, the fax spam screening device transmits the received fax document determined as a spam document to the user e-mail address 620 or the web hard 630. The user can check the spam document in the e-mail address 620 or the web hard 630 set by the user, and if the spam is not spam, the user can restore and output the spam document again. This prevents a document from being discarded as a spam document unconditionally.

이렇게 하나의 수신 팩스 문서에 대한 출력 여부가 결정되고 나면, 상기 수신 팩스 문서도 일반 문서 또는 스팸 문서로의 판별에 따라 분석용 팩스 스팸 문서 집단 또는 분석용 팩스 일반 문서 집단에 포함될 수 있고, 최신 자료가 계속 업데이트되어 종국에는 스팸 분류 알고리즘이 최신으로 업데이트될 수 있다.
After determining whether to output the received fax document, the received fax document can be included in the analysis fax spam document group or analysis fax general document group according to the discrimination as the general document or the spam document, May be continuously updated to eventually update the spam classification algorithm to the latest.

팩스 스팸 차단 장치Fax spam blocker

도 7은 본 발명의 일 실시예에 따른 팩스 스팸 차단 장치를 개략적으로 나타낸 블록도이다. 도 7에 도시된 바와 같이, 본 발명의 일 실시예에 다른 팩스 차단 장치는 분류 알고리즘 생성부(710), 판별부(720) 및 출력 결정부(730)를 포함할 수 있다.FIG. 7 is a block diagram schematically illustrating a fax spam screening apparatus according to an exemplary embodiment of the present invention. Referring to FIG. 7, the fax blocking device according to an exemplary embodiment of the present invention may include a classification algorithm generation unit 710, a determination unit 720, and an output determination unit 730.

도 7을 참조하면, 분류 알고리즘 생성부(710)는 특징 추출(Feature Extraction)을 수행한다. 분류 알고리즘 생성부(710)는 팩스 스팸 문서 집단과 팩스 일반 문서 집단으로부터 특징을 추출하여 스팸 분류 알고리즘을 생성함으로써 이루어질 수 있다. 여기서, 팩스 스팸 문서 집단 및 팩스 일반 문서 집단은 스팸 분류 알고리즘을 생성하기 위한 분석용 모집단이다. 이를 위해, 사용자의 정의에 따른 팩스 스팸 문서와 팩스 일반 문서를 이용하여 분류기 학습을 수행해야 한다. 팩스 스팸 문서는 단순히 특정 주제 관련 팩스이거나 특정 전화번호에서 송신된 팩스가 아니라 팩스 문서 내용에 따라 사용자가 원하지 않는 팩스 문서로서 사용자가 학습에 사용할 문서로 정의할 수 있다. 팩스 일반 문서는 전술한 바와 같이, 팩스 문서의 내용을 기반으로 하여 사용자가 수신하기 원하는 문서로 정의할 수 있다. Referring to FIG. 7, the classification algorithm generation unit 710 performs feature extraction. The classification algorithm generation unit 710 may extract the feature from the fax spam document group and the fax general document group to generate a spam classification algorithm. Here, the fax spam document group and the fax general document group are analysis populations for generating a spam classification algorithm. To do this, classifier learning should be performed using fax spam documents and fax general documents according to the user's definition. A fax spam document can be defined as a document that the user will use to learn as a fax document that the user does not want, depending on the content of the fax document, rather than just a specific topic-related fax or a fax sent from a specific telephone number. The fax general document can be defined as a document that the user desires to receive based on the contents of the fax document, as described above.

판별부(720)는 수신된 팩스 문서가 스팸 문서인지 판단한다. 분류 알고리즘 생성부(710)에서 생성된 분류 알고리즘을 이용하여 수신된 팩스 문서가 스팸 문서인지 판단한다.The determination unit 720 determines whether the received fax document is a spam document. It is determined whether the received fax document is a spam document by using the classification algorithm generated by the classification algorithm generation unit 710. [

출력 결정부(730)는 판별부(720)에서의 판단 결과에 따라 수신 팩스 문서의 출력 여부를 결정한다. 판단 결과, 스팸 문서가 아닌 경우, 사용자가 원하는 일반 문서로 판단하여 팩스 문서를 출력한다. 반대로, 스팸 문서인 경우, 팩스 데이터를 차단한다. 이때, 스팸 문서로 판별되어 차단되는 팩스 문서는 바로 제거되는 것이 아니라 지정된 온라인 지점으로 전송될 수 있다. 이렇게 함으로써, 스팸 문서의 무조건적인 출력을 방지할 수 있어 전력 및 종이 낭비를 줄일 수 있다. 사용자는 사용자 인터페이스를 통해 사용자가 사용하는 이메일 또는 인터넷을 통해 액세스할 수 있는 웹하드로 스팸 문서를 일시 저장하도록 설정할 수 있다. 상기 설정에 따라 출력 결정부(730)는 스팸 문서로 판별된 수신 팩스 문서를 사용자 이메일 주소 또는 웹하드로 전송한다. 사용자는 설정된 지점에서 스팸 문서를 확인하고, 스팸이 아닌 경우 다시 복원하여 출력할 수 있는 기회를 확보할 수 있다. The output determination unit 730 determines whether to output the received fax document according to the determination result of the determination unit 720. [ As a result of the determination, if the document is not a spam document, the user determines that the document is a general document desired and outputs the fax document. Conversely, if it is a spam document, it blocks the fax data. At this time, a fax document that is determined as a spam document and blocked is not immediately deleted, but may be transmitted to a designated online point. By doing so, it is possible to prevent unconditional output of spam documents, thereby reducing power and paper waste. The user can set the temporary storage of the spam document to the web hard which can be accessed through the user interface or via e-mail used by the user through the user interface. According to the setting, the output determining unit 730 transmits the received fax document determined as a spam document to the user's e-mail address or WebHard. The user can check the spam document at the set point, and if the spam is not spam, the user can get the opportunity to restore it and output it again.

이렇게 하나의 수신 팩스 문서에 대한 출력 여부가 결정되고 나면, 상기 수신 팩스 문서도 일반 문서 또는 스팸 문서로의 판별에 따라 분석용 팩스 스팸 문서 집단 또는 분석용 팩스 일반 문서 집단에 포함될 수 있다.Once the output of the received fax document is determined, the received fax document may be included in the analysis fax spam document group or the analysis fax general document group according to the discrimination as the general document or the spam document.

도 8은 본 발명의 일 실시예에 따른 팩스 스팸 차단 장치의 분류 알고리즘 생성부(710)를 구체적으로 나타낸 상세블록도이다. 도 8에 도시된 바와 같이, 분류 알고리즘 생성부(710)는 스캔 수행부(810), 출현 빈도 산출부(820), 모델링부(830) 및 알고리즘 생성부(840)를 포함할 수 있다.8 is a detailed block diagram specifically illustrating a classification algorithm generation unit 710 of a fax spam blocking device according to an embodiment of the present invention. 8, the classification algorithm generating unit 710 may include a scan performing unit 810, an appearance frequency calculating unit 820, a modeling unit 830, and an algorithm generating unit 840.

도 8을 참조하면, 스캔 수행부(810)는 팩스 스팸 문서 모델과 팩스 일반 문서 모델 중 적어도 어느 하나를 통해 스팸 분류 알고리즘을 생성하기 위해, 팩스 스팸 문서 및/또는 팩스 일반 문서를 OCR(Optical Character Reader) 스캔한다. 즉, 이미지를 스캔하여 기계에서 읽을 수 있는 포맷으로 변환한다. OCR 스캐닝 기술은 현재 공지되어 있는 복수의 기술을 포함할 수 있다. Referring to FIG. 8, the scan performing unit 810 may send a fax spam document and / or a fax general document to an OCR (Optical Character Recognition) function to generate a spam classification algorithm through at least one of a fax spam document model and a fax general document model. Reader). That is, the image is scanned and converted into a machine readable format. OCR scanning techniques may include a plurality of techniques that are currently known.

출현 빈도 산출부(820)는 특징 추출을 위해, 팩스 스팸 문서 집단 또는 팩스 일반 문서 집단 내의 단어의 출현 빈도를 분석할 수 있다. The appearance frequency calculating unit 820 can analyze the occurrence frequency of words in the fax spam document group or the fax general document group for feature extraction.

모델링부(830)는 분석된 출현 빈도를 토대로 특징을 선택하여 팩스 스팸 문서 모델링 및 팩스 일반 문서 모델링을 수행할 수 있다. 이때, 지지 벡터 머신 또는 나이브 베이지안 분류 방법이 사용될 수 있다.The modeling unit 830 may perform facsimile spam document modeling and fax general document modeling by selecting features based on the analyzed appearance frequency. At this time, a support vector machine or a Naive Bayes classification method may be used.

알고리즘 생성부(840)는 팩스 스팸 문서 모델과 팩스 일반 문서 모델 중 적어도 어느 하나를 통해 스팸 분류 알고리즘을 생성한다. 알고리즘 생성부(840)는 두 가지 모델을 개별적으로 생성할 수 있다. 즉, 사용자 설정에 따라 팩스 스팸 문서 모델만을 이용할지, 팩스 일반 문서 모델만을 이용할지, 아니면 둘 모두를 이용할지를 선택할 수 있고, 이에 따라 모델링 프로세스가 진행될 수 있다.The algorithm generation unit 840 generates a spam classification algorithm through at least one of the fax spam document model and the fax general document model. The algorithm generation unit 840 can generate the two models individually. That is, depending on the user setting, it is possible to select whether to use only the fax spam document model, only the fax general document model, or both, and the modeling process can proceed accordingly.

도 9는 본 발명의 일 실시예에 따른 팩스 스팸 차단 장치의 출현 빈도 산출부(820)를 구체적으로 나타낸 상세블록도이다. 도 9에 도시된 바와 같이, 본 발명의 일 실시예에 따른 출현 빈도 산출부(820)는 단어 추출부(910) 및 산출부(920)를 포함할 수 있다.FIG. 9 is a detailed block diagram illustrating an appearance frequency calculating unit 820 of the fax spam screening apparatus according to an exemplary embodiment of the present invention. 9, the appearance frequency calculating unit 820 according to an embodiment of the present invention may include a word extracting unit 910 and a calculating unit 920. [

도 9를 참조하면, 단어 추출부(910)는 스캐닝된 문서를 전처리하여 단어만 추출한다. 스캐닝된 문서에는 특별한 의미를 지닌 단어가 아닌 불용어를 다수 포함하고 있으므로, 이를 제거하여 단어만을 추출한다. 스팸 문서 차단 장치는 불용어 사전을 이용하여 전처리를 수행할 수 있다.Referring to FIG. 9, the word extracting unit 910 preprocesses the scanned document to extract only words. Since scanned documents contain many abbreviations that are not words with special meanings, they are removed to extract only words. The spam document interception device can perform preprocessing using an abbreviation dictionary.

산출부(920)는 출현 빈도를 산출한다. 산출부(920)는 특징 추출을 위해, 팩스 문서 내의 단어의 출현 빈도를 분석할 수 있다.The calculation unit 920 calculates the appearance frequency. The calculating unit 920 can analyze the occurrence frequency of words in the fax document for feature extraction.

도 10은 본 발명의 일 실시예에 따른 팩스 스팸 차단 장치의 모델링부(830)를 구체적으로 나타낸 상세블록도이다. 도 10에 도시된 바와 같이, 모델링부(830)는 특징 선택부(1010) 및 팩스 문서 모델링부(1020)를 포함할 수 있다.10 is a detailed block diagram specifically illustrating a modeling unit 830 of the fax spam blocking device according to an embodiment of the present invention. 10, the modeling unit 830 may include a feature selecting unit 1010 and a fax document modeling unit 1020.

특징 선택부(1010)는 지지 벡터 머신(SVM: Support Vector Machine) 또는 나이브 베이지안 분류 방법(Naive Bayesian Classifier)을 구성하기 위해, 팩스 스팸 문서 및/또는 팩스 일반 문서의 두 클래스의 특징을 선택한다. 특징 선택은 특징 추출과는 다른 개념으로, 입력되는 특징 집합 중 최선의 부분 집합을 골라내는 것을 의미한다. 특징 선택부(1010)는 분류기의 특징으로서 각 클래스의 단어 출현 빈도를 선택할 수 있다. 이때, 모든 단어의 모든 출현 빈도를 특징으로 이용할 수 있는데, 이는 비효율적일 수 있다. 따라서, 스팸 문서 차단 장치는 문서들의 특징들 중 더 큰 임팩트를 갖는 특징들, 즉 출현 빈도가 높은 단어들을 선택한다. 즉, 이러한 출현 빈도 기반의 특징 선택은 텍스트 마이닝에 폭넓게 사용될 수 있다. The feature selection unit 1010 selects the characteristics of the two classes of the fax spam document and / or the fax general document to construct a support vector machine (SVM) or a Naive Bayesian classifier. Feature selection is a different concept from feature extraction, which means to select the best subset of input feature sets. The feature selecting unit 1010 can select the word occurrence frequency of each class as a characteristic of the classifier. At this time, all occurrences of all words can be used as a feature, which can be inefficient. Therefore, the spam document interception device selects features having a larger impact among the features of the documents, i.e., words having a high appearance frequency. That is, feature selection based on the appearance frequency can be widely used for text mining.

본 발명의 다른 실시예에 따르면, 특징 선택부(1010)는 출현 빈도가 높은 N개의 단어를 이용하여 단어가 나타난 정도에 따라 각 집단에 포함될 확률을 높이는 방법으로 분류할 수 있다. 또는, 특징 선택부(1010)는 출현 빈도가 높은 N개의 단어를 지지 벡터 머신의 특징 벡터(feature vector)로 사용하여 분류 알고리즘을 생성할 수 있다. According to another embodiment of the present invention, the feature selecting unit 1010 can classify N words having high occurrence frequency by a method of increasing the probability of being included in each group according to the degree of appearance of words. Alternatively, the feature selection unit 1010 may generate a classification algorithm using N words having a high appearance frequency as a feature vector of a support vector machine.

팩스 문서 모델링부(1020)는 두 개의 대표적인 방법(예컨대, 지지 벡터 머신 또는 나이브 베이지안)을 통해 팩스 스팸 문서 모델링 및 팩스 일반 문서 모델링을 수행한다. 여기서, 지지 벡터 머신을 사용하는 경우, 전술한 바와 같이, 출현 빈도가 높은 N개의 단어를 지지 벡터 머신의 특징 벡터로 사용하여 모델링을 수행할 수 있다. 지지 벡터 머신 모델을 생성하는 것은 훈련 단계 및 테스트 단계를 통해 이루어질 수 있다.The fax document modeling unit 1020 performs fax spam document modeling and fax general document modeling through two representative methods (e.g., support vector machine or Naïve Bayesian). Here, when the support vector machine is used, as described above, modeling can be performed using N words having a high appearance frequency as feature vectors of the support vector machine. Generating the support vector machine model may be accomplished through training and testing steps.

다른 방법으로, 팩스 문서 모델링부(1020)는 나이브 베이지안 분류 방법이 사용될 수 있는데, 이는 각 클래스에서 나타난 단어는 그 클래스를 나타내는 특징(feature)이고, 각 단어는 출현한 어떤 문서에 대해 다른 단어들과 연관이 없다는 가정을 기반으로 이산 분리 모델을 생성하는 것이다. 따라서, 팩스 문서 모델링부(1020)는 출현 빈도가 높은 상기 N 개의 단어를 토대로 모델링을 수행할 수 있다. 나이브 베이지안 분류 방법을 통한 모델링은 실행하기에 단순하고, 문서 모델링이 빠르다는 장점이 있다. 이는 단어 모델의 백(bag)으로써 잘 작동하고, 문서 모델링에 매우 적합하다.
Alternatively, the fax document modeling unit 1020 may use the Naïve Bayesian classification method, in which the word represented in each class is a feature representing the class, and each word indicates a word Based on the assumption that there is no association with the discrete separation model. Therefore, the fax document modeling unit 1020 can perform modeling based on the N words having a high appearance frequency. Modeling through the Naïve Bayesian classification method is simple to implement and has the advantage of fast document modeling. It works well as a bag of word models and is well suited for document modeling.

시뮬레이션 결과Simulation result

본 발명의 팩스 스팸 차단 방법의 성능을 검증하기 위해 시뮬레이션을 수행하였다. 먼저, 팩스 스팸 분류 알고리즘을 생성하기 위해 팩스 스팸 문서와 팩스 일반 문서를 수집하였고, 이때, 수집 문서는 사용자의 주관적인 판단에 의해 다양한 내용으로 분류되어 수집되었다. 각 수집한 팩스 문서를 집단별로 OCR 스캔하여 단어의 집단으로 취합한 후, 전처리를 수행하였다. 이후, 각 단어의 출현 빈도를 파악하여 문서별로 가장 많이 나타난 단어와 그 빈도를 파악하고 이를 이용하여 스팸 문서와 일반 문서의 특징을 선정하였다. 또한, 지지 벡터 머신과 나이브 베이지안 분류 방법을 사용하여 모델링을 수행하였다.A simulation was performed to verify the performance of the fax spam blocking method of the present invention. First, to generate a fax spam classification algorithm, fax spam documents and fax general documents were collected. At this time, the collected documents were classified by various contents according to user's subjective judgment. Each collected fax document was grouped into groups of words by OCR scan, and preprocessing was performed. Then, the frequency of occurrence of each word was identified, and the most frequently appearing words and their frequency were identified and the characteristics of the spam document and the general document were selected. In addition, modeling was performed using the support vector machine and the Naïve Bayesian classification method.

도 11은 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 성능을 실험하기 위해 사용하는 confusion matrix를 나타낸 도면이다.11 is a view showing a confusion matrix used for testing the performance of the fax spam blocking method according to an embodiment of the present invention.

도 11을 참조하면, 매트릭스를 통해 정확도(ACC), 스팸 검출의 정밀도(precesion), 스팸 검출의 리콜(recall) 및 일반 검출의 리콜을 산출하였다. 이는 다음과 같이 계산될 수 있다.Referring to FIG. 11, the accuracy (ACC), the accuracy of spam detection, the recall of spam detection, and the recall of general detection are calculated through the matrix. This can be calculated as follows.

정확도(ACC)는 본 발명에 따른 팩스 스팸 시스템의 전체 성능을 의미한다. 이는 시스템이 실제 스팸을 스팸이라고, 실제 일반 문서를 일반 문서라고 판별할 때 증가한다. Accuracy (ACC) means the overall performance of the fax spam system according to the present invention. This increases when the system identifies the actual spam as spam and the actual general document as a generic document.

도 12a는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 세 가지 분류 방식의 ACC 결과를 나타낸 표이다. 12A is a table showing ACC results of three classification methods of the fax spam blocking method according to an embodiment of the present invention.

도 12a를 참조하면, RB는 룰 기반 필터링 방법을 통한 모델링을 수행한 경우에 해당하고, SVM은 지지 벡터 머신을 통한 분류 방법을 사용하여 모델링한 경우를 나타내며, NB는 나이브 베이지안 방법을 사용한 경우를 나타낸다. 도 12a에 도시된 바와 같이, 룰 기반 필터링 방법의 정확도가 55.35%로 가장 낮고, 본 발명의 실시예에 따른 팩스 차단 시스템이 사용하는 SVM 및 NB는 거의 동일한 결과로 91% 대의 높은 정확도를 나타낸다. 테이블 상단의 10, 20,...100의 숫자는 선택된 특징들의 수를 나타낸다.Referring to FIG. 12A, the RB corresponds to a case where modeling is performed through a rule-based filtering method, the SVM is modeled using a classification method using a support vector machine, and the NB uses a Naive Bayesian method . As shown in FIG. 12A, the accuracy of the rule-based filtering method is the lowest at 55.35%, and the SVM and NB used by the facsimile blocking system according to the embodiment of the present invention show a high accuracy of 91% with almost the same result. The numbers 10, 20, ... 100 at the top of the table represent the number of selected features.

다시 도 11로 돌아가서, Pre_spam은 시스템이 얼마나 스팸을 잘 검출하는지를 나타낸다. 즉, 이는 스팸 검출 시스템의 검출 능력을 나타낸다. Referring again to FIG. 11, Pre_spam indicates how well the system detects spam. That is, it represents the detection capability of the spam detection system.

도 12b는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 세 가지 분류 방식의 Pre_spam 결과를 나타낸 표이다.12B is a table showing Pre_spam results of three classification methods of the fax spam blocking method according to an embodiment of the present invention.

도 12b에 도시된 바와 같이, RB 분류 방법의 Pre_spam은 RB에 사용된 특징들의 수에 따라 증가한다. 반면, SVM 및 NB의 Pre_spam 결과는 특징들의 모든 수에서 100%를 나타낸다.As shown in FIG. 12B, the Pre_spam of the RB classification method increases according to the number of features used in the RB. On the other hand, the Pre_spam result of SVM and NB represents 100% in all the numbers of features.

다시 도 11로 돌아가서, Rec_spam은 시스템이 할 수 있는 한 최대한 많은 스팸 문서들을 검출하는지를 나타낸다. 즉, 이는 전체 스팸 문서 중에서 얼마나 많은 스팸 팩스 문서를 검출할 수 있는지에 대한 능력을 나타낸다. Referring back to FIG. 11, Rec_spam indicates whether the system detects as many spam documents as possible. That is, it represents the ability to detect how many spam fax documents are among the total spam documents.

도 12c는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 세 가지 분류 방식의 Rec_spam 결과를 나타낸 표이다.12C is a table showing Rec_spam results of three classification methods of the fax spam blocking method according to an embodiment of the present invention.

도 12c에 도시된 바와 같이, RB 분류 방법, SVM 분류 방법 및 NB 분류 방법의 Rec_spam은 거의 동일하다. 즉, RB의 ACC 및 Pre_spam이 다른 것들보다 낮았음에도 불구하고, RB의 Rec_spam은 다른 분류 방법보다 더 높았다 이는 RB가 좋지 않은 성능을 가졌음을 나타낸다.As shown in Fig. 12C, the Rec_spam of the RB classification method, the SVM classification method, and the NB classification method are almost the same. In other words, although ACC and Pre_spam of RB were lower than others, Rec_spam of RB was higher than other classification methods, indicating that RB had bad performance.

다시 도 11로 돌아가서, Rec_norm은 거짓 양성(false positive) 확률을 나타낸다. 즉 Rec_norm이 높을수록, 낮은 거짓 양성 확률을 나타낼 수 있다. Referring again to FIG. 11, Rec_norm represents a false positive probability. That is, the higher the Rec_norm, the lower the probability of false positives.

도 12d는 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 세 가지 분류 방식의 Rec_norm 결과를 나타낸 표이다. 12D is a table showing Rec_norm results of three classification methods of the fax spam blocking method according to an embodiment of the present invention.

도 12d에 도시된 바와 같이, NB는 단지10 특징들이 사용되었을 때, 100% 결과를 달성했고, RB는 28.99%, SVM은 78.77%의 결과를 달성했다.As shown in FIG. 12D, NB achieved 100% results when only 10 features were used, RB achieved 28.99%, SVM achieved 78.77%.

위의 결과들을 토대로, 정밀도와 리콜의 조합값을 나타내는 F-measure를 계산할 수 있다.Based on the above results, we can calculate the F-measure that represents the combination of precision and recall.

도 13은 고급 스팸 공격에서의 본 발명의 일 실시예에 따른 팩스 스팸 차단 방법의 세 가지 분류 방식의 F-measure를 비교한 그래프이다.13 is a graph comparing F-measures of three classification methods of a facsimile spam blocking method according to an exemplary embodiment of the present invention in an advanced spam attack.

도 13에 도시된 바와 같이, RB 및 SVM의 F-measure는 특징들의 변화에 영향을 많이 받는 특징이 있다. 반면, NB의 F-measure는 전체 x축(특징들의 수)에서 안정적인 결과를 나타낸다. 따라서, 팩스 스팸 검출에 NB를 사용하는 것을 추천할 수 있다. As shown in FIG. 13, the F-measure of the RB and the SVM is characterized by being greatly influenced by changes in the characteristics. On the other hand, the F-measure of NB shows stable results over the entire x-axis (number of features). Therefore, it is recommended to use NB for fax spam detection.

이상 도면 및 실시예를 참조하여 설명하였지만, 본 발명의 보호범위가 상기 도면 또는 실시예에 의해 한정되는 것을 의미하지는 않으며 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the inventions as defined by the following claims It will be understood that various modifications and changes may be made thereto without departing from the spirit and scope of the invention.

Claims

Generating a spam classification algorithm based on at least one of a group of analysis fax spam documents and a group of general fax documents for analysis;
Determining whether the received target fax document is a spam document using the spam classification algorithm; And
And determining whether to output the target facsimile document based on the discrimination result.

The method according to claim 1,
Wherein the analysis fax spam document is a document determined as a spam document based on the contents of the fax document and the analysis fax general document is a document determined to be suitable for reception based on the contents of the fax document. How to block documents.

3. The method of claim 2,
Scanning the fax spam document group and the fax general document group separately;
Separately calculating the appearance frequency of the scanned fax spam document group and the words included in the scanned fax general document group;
Performing fax spam document modeling and fax general document modeling separately based on the appearance frequency; And
And generating the classification algorithm based on at least one of the modeled fax spam document and the modeled fax general document.

4. The method of claim 3, wherein the appearance frequency calculating step
A step of preprocessing the scanned document to remove an abbreviated word and extracting only words; And
And calculating an appearance frequency based on the extracted word.

4. The method of claim 3, wherein the modeling step
Selecting a feature based on the appearance frequency;
And performing fax spam document modeling and fax general document modeling using the selected feature as a feature vector of a support vector machine (SVM).

6. The method of claim 5, wherein selecting the feature comprises:
Wherein the step of extracting words comprises extracting words of the top N, where N is an arbitrary natural number, having a high appearance frequency.

The method according to claim 1,
Wherein the spam classification algorithm is generated using a Naive Bayesian Classifier.

The method according to claim 1,
If the target fax document is determined to be a spam document, not to output it, but automatically to the designated online point.

9. The method of claim 8,
Wherein the on-line point is a user's email address or a user-specified web hard.

The method according to claim 1,
Wherein the identified fax document is included in the analysis fax spam document group and the analysis fax general document group according to the discrimination result.

A classification algorithm generation unit for generating a spam classification algorithm based on at least one of a group of analysis fax spam documents and a group of fax general documents for analysis;
A determination unit for determining whether the received target fax document is a spam document using the spam classification algorithm; And
And an output determining unit determining whether to output the target fax document based on the discrimination result.

12. The method of claim 11,
Wherein the analysis fax spam document is a document determined as a spam document based on the contents of the fax document and the analysis fax general document is a document determined to be suitable for reception based on the contents of the fax document. Document blocking device.

12. The apparatus of claim 11, wherein the classification algorithm generating unit
A scan performing unit that individually scans the fax spam document group and the fax general document group;
An appearance frequency calculating unit for individually calculating an occurrence frequency of the scanned fax spam document group and words included in the fax general document group;
A modeling unit for separately performing fax spam document modeling and fax general document modeling based on the appearance frequency; And
And an algorithm generation unit for generating the classification algorithm based on at least any one of the modeled fax spam document and the modeled fax general document.

14. The apparatus of claim 13, wherein the appearance frequency calculation unit
A word extracting unit that preprocesses the scanned document to remove an abbreviated word and extracts only words; And
And a calculating unit for calculating an appearance frequency based on the extracted word.

14. The apparatus of claim 13, wherein the modeling unit
An upper word extracting unit for extracting words of the upper N words having a higher appearance frequency, where N is an arbitrary natural number; And
And a facsimile document modeling unit that performs fax spam document modeling and fax general document modeling using the extracted word as a feature vector of a support vector machine (SVM).

12. The method of claim 11,
Wherein the spam classification algorithm is generated using a Naive Bayesian Classifier.

12. The apparatus of claim 11, wherein the output determining unit
When the target fax document is determined as a spam document, not to output the automatic transmission to the designated online point.

18. The method of claim 17,
Wherein the on-line point is a user email address or a user-specified web hard.

12. The method of claim 11,
Wherein the discriminated fax document is included in the analysis fax spam document group and the analysis fax general document group according to the discrimination result.

Send to destination fax machine to send destination fax document Fax machine; And
A spam classification algorithm is generated based on at least one of a group of analysis fax spam documents and a group of fax general documents for analysis and the target fax document received from the transmission fax device is determined to be a spam document using the spam classification algorithm, And determining whether to output the target fax document based on the discrimination result.