KR102641092B1

KR102641092B1 - Method for detecting phishing data and computing device using the same

Info

Publication number: KR102641092B1
Application number: KR1020230124377A
Authority: KR
Inventors: 문일주; 김진
Original assignee: (주)시큐레이어
Priority date: 2023-09-18
Filing date: 2023-09-18
Publication date: 2024-02-28

Abstract

피싱 데이터에 대한 정규식을 통해 획득된 패턴화 데이터를 참조로 하여 Siamese Network 및 KD-Tree 알고리즘을 이용해 상기 피싱 데이터를 탐지하는 방법에 있어서, (a) 복수의 피싱 데이터 및 상기 복수의 피싱 데이터 각각에 대한 정규식 각각을 기준으로 상기 복수의 피싱 데이터 각각을 패턴화하여 복수의 패턴화 데이터 각각이 획득된 후, 상기 복수의 피싱 데이터 각각이 상기 Siamese Network의 제1 CNN 에 입력되고, 상기 복수의 패턴화 데이터 각각이 상기 Siamese Network의 제2 CNN에 입력되며, 상기 제1 CNN에 의해 제1 피쳐 각각이 출력되고 상기 제2 CNN에 의해 제2 피쳐 각각이 출력되어, 상기 제1 피쳐와 상기 제2 피쳐의 거리가 기설정된 임계치 이하가 되도록 상기 Siamese Network가 학습된 상태에서, 상기 Siamese Network 의 출력단에 KD-tree 장치가 연결되면, 상기 KD-tree 장치는, 타겟 데이터에 대한 타겟 패턴화 데이터에 대응되는 타겟 패턴화 데이터 피쳐를 KD-tree 상에 표현하는 단계; (b) 신규 데이터가 상기 Siamese Network의 상기 제1 CNN에 입력되어 신규 데이터 피쳐가 출력되면, 상기 KD-tree 장치는, 상기 신규 데이터 피쳐를 상기 KD-tree 상에 매핑하는 단계; 및 (c) 상기 KD-tree 장치는, 상기 신규 데이터에 대응되는 상기 신규 데이터 피쳐와 상기 KD-tree 상에 표현된 상기 타겟 패턴화 데이터 피쳐 사이의 거리를 참조로 하여 상기 신규 데이터가 피싱 데이터에 해당되는지 여부를 판단하는 단계;를 포함하는 방법 및 이를 이용한 컴퓨팅 장치가 개시된다.In the method of detecting the phishing data using Siamese Network and KD-Tree algorithms with reference to patterning data obtained through regular expressions for phishing data, (a) a plurality of phishing data and each of the plurality of phishing data After patterning each of the plurality of phishing data based on each of the regular expressions to obtain a plurality of patterned data, each of the plurality of phishing data is input to the first CNN of the Siamese Network, and the plurality of patterned data is obtained. Each piece of data is input to the second CNN of the Siamese Network, each first feature is output by the first CNN, and each second feature is output by the second CNN, so that the first feature and the second feature In a state in which the Siamese Network is learned so that the distance is less than or equal to a preset threshold, when the KD-tree device is connected to the output terminal of the Siamese Network, the KD-tree device corresponds to the target patterning data for the target data. Representing target patterned data features on a KD-tree; (b) when new data is input to the first CNN of the Siamese Network and new data features are output, the KD-tree device maps the new data features on the KD-tree; and (c) the KD-tree device stores the new data in the phishing data with reference to the distance between the new data feature corresponding to the new data and the target patterned data feature expressed on the KD-tree. A method including a step of determining whether the method is applicable and a computing device using the same are disclosed.

Description

Method for detecting phishing data and computing device using the same {METHOD FOR DETECTING PHISHING DATA AND COMPUTING DEVICE USING THE SAME}

본 발명은 피싱 데이터를 탐지하는 방법 및 이를 사용한 컴퓨팅 장치에 관한 것이다.The present invention relates to a method for detecting phishing data and a computing device using the same.

피싱 공격은 공격자가 악의적인 목적으로 설계된 도메인을 이용하여 이메일 혹은 문자 메시지를 통해 사용자의 개인 정보나 금융 정보를 빼내는 행위이다.A phishing attack is an act in which an attacker uses a domain designed for malicious purposes to steal a user's personal or financial information through email or text message.

과거에 알려진 피싱 공격을 저장하여 악성코드를 탐지하는 방법인 시그니처 기반의 탐지 방법만으로는 피싱 공격자의 다양한 변종 공격과 새로운 공격 형태에 대응하기 어려운 문제가 있다.There is a problem in that it is difficult to respond to various variant attacks and new types of attacks by phishing attackers using only the signature-based detection method, which is a method of detecting malware by storing previously known phishing attacks.

따라서, 본 발명의 발명자는, 기존의 시그니처 기반의 탐지 방법보다 다양한 피싱 패턴을 탐지하기 위하여, 피싱 데이터에 대한 정규식을 통해 획득된 패턴화 데이터를 참조로 하여 Siamese Network가 다양한 패턴을 학습하도록 한 후, 학습이 완료된 Siamese Network에 KD-Tree장치를 결합한 모델을 통해 피싱 데이터에서 파생된 다양한 패턴의 피싱 데이터를 탐지하는 방법을 개발하게 되었다.Therefore, in order to detect more diverse phishing patterns than existing signature-based detection methods, the inventor of the present invention allowed the Siamese Network to learn various patterns by referring to patterning data obtained through regular expressions for phishing data. , We developed a method to detect various patterns of phishing data derived from phishing data through a model that combines the learned Siamese Network with the KD-Tree device.

본 발명은 상술한 문제점을 해결하는 것을 그 목적으로 한다.The purpose of the present invention is to solve the above-mentioned problems.

본 발명은, 피싱 데이터 및 이에 대응되는 패턴화 데이터를 이용하여 Siamese Network가 피싱 데이터 및 패턴화 데이터를 동일하게 인식할 수 있도록 학습하는 것을 다른 목적으로 한다.Another purpose of the present invention is to learn Siamese Network to recognize phishing data and patterning data equally by using phishing data and patterning data corresponding thereto.

또한, 본 발명은, Siamese Network 가 학습된 후, 타겟 데이터에 대응되는 타겟 패턴화 데이터 피쳐를 KD-Tree 상에 표현한 상태에서, 임의의 신규 데이터가 Siamese Network에 입력되면, Siamese Network로부터 출력된 신규 데이터 피쳐를 KD-Tree 상에 매핑함으로써, 피싱 탐지의 정확도를 높이는 것을 또 다른 목적으로 한다.In addition, in the present invention, after the Siamese Network is learned, with the target patterning data feature corresponding to the target data expressed on the KD-Tree, if any new data is input to the Siamese Network, the new data output from the Siamese Network Another purpose is to increase the accuracy of phishing detection by mapping data features on the KD-Tree.

상기한 바와 같은 본 발명의 목적을 달성하고, 후술하는 본 발명의 특징적인 효과를 실현하기 위한 본 발명의 특징적인 구성은 다음과 같다.The characteristic configuration of the present invention for achieving the purpose of the present invention as described above and realizing the characteristic effects of the present invention described later is as follows.

본 발명의 일 태양에 따르면, 피싱 데이터에 대한 정규식을 통해 획득된 패턴화 데이터를 참조로 하여 Siamese Network 및 KD-Tree 알고리즘을 이용해 상기 피싱 데이터를 탐지하는 방법에 있어서, (a) 복수의 피싱 데이터 및 상기 복수의 피싱 데이터 각각에 대한 정규식 각각을 기준으로 상기 복수의 피싱 데이터 각각을 패턴화하여 복수의 패턴화 데이터 각각이 획득된 후, 상기 복수의 피싱 데이터 각각이 상기 Siamese Network의 제1 CNN 에 입력되고, 상기 복수의 패턴화 데이터 각각이 상기 Siamese Network의 제2 CNN에 입력되며, 상기 제1 CNN에 의해 제1 피쳐 각각이 출력되고 상기 제2 CNN에 의해 제2 피쳐 각각이 출력되어, 상기 제1 피쳐와 상기 제2 피쳐의 거리가 기설정된 임계치 이하가 되도록 상기 Siamese Network가 학습된 상태에서, 상기 Siamese Network 의 출력단에 KD-tree 장치가 연결되면, 상기 KD-tree 장치는, 타겟 데이터에 대한 타겟 패턴화 데이터에 대응되는 타겟 패턴화 데이터 피쳐를 KD-tree 상에 표현하는 단계; (b) 신규 데이터가 상기 Siamese Network의 상기 제1 CNN에 입력되어 신규 데이터 피쳐가 출력되면, 상기 KD-tree 장치는, 상기 신규 데이터 피쳐를 상기 KD-tree 상에 매핑하는 단계; 및 (c) 상기 KD-tree 장치는, 상기 신규 데이터에 대응되는 상기 신규 데이터 피쳐와 상기 KD-tree 상에 표현된 상기 타겟 패턴화 데이터 피쳐 사이의 거리를 참조로 하여 상기 신규 데이터가 피싱 데이터에 해당되는지 여부를 판단하는 단계;를 포함하는 것을 특징으로 하는 방법이 개시된다.According to one aspect of the present invention, in the method of detecting the phishing data using the Siamese Network and KD-Tree algorithm with reference to patterning data obtained through a regular expression for the phishing data, (a) a plurality of phishing data And after each of the plurality of phishing data is obtained by patterning each of the plurality of phishing data based on each of the regular expressions for each of the plurality of phishing data, each of the plurality of phishing data is transmitted to the first CNN of the Siamese Network. input, and each of the plurality of patterned data is input to a second CNN of the Siamese Network, each of the first features is output by the first CNN, and each of the second features is output by the second CNN, In a state in which the Siamese Network is learned so that the distance between the first feature and the second feature is less than or equal to a preset threshold, when a KD-tree device is connected to the output terminal of the Siamese Network, the KD-tree device expressing target patterning data features corresponding to target patterning data on a KD-tree; (b) when new data is input to the first CNN of the Siamese Network and new data features are output, the KD-tree device maps the new data features on the KD-tree; and (c) the KD-tree device stores the new data in the phishing data with reference to the distance between the new data feature corresponding to the new data and the target patterned data feature expressed on the KD-tree. A method comprising a step of determining whether the information is applicable is disclosed.

일례로서, 상기 (a) 단계에서, 상기 복수의 피싱 데이터 각각이 상기 Siamese Network의 상기 제1 CNN 에 입력되고, 상기 복수의 패턴화 데이터 각각이 상기 Siamese Network의 상기 제2 CNN에 입력된 후, 상기 Siamese Network의 상기 제1 CNN 및 상기 제2 CNN 사이에서 공유 가중치(W)를 가진 상태에서, 상기 복수의 피싱 데이터 각각과 상기 복수의 패턴화 데이터 각각이 같은 데이터라고 판단되면, 상기 제1 CNN에서 출력되는 상기 제1 피쳐 각각과 상기 제2 CNN에서 출력되는 상기 제2 피쳐 각각의 거리가 기설정된 임계치 이하가 되도록 학습되고, 상기 복수의 피싱 데이터 각각과 상기 복수의 패턴화 데이터 각각이 다른 데이터라고 판단되면, 상기 제1 피쳐 각각과 상기 제2 피쳐 각각의 거리가 기설정된 임계치 초과가 되도록 학습되는 것을 특징으로 하는 방법이 개시된다.As an example, in step (a), after each of the plurality of phishing data is input to the first CNN of the Siamese Network and each of the plurality of patterning data is input to the second CNN of the Siamese Network, With a shared weight (W) between the first CNN and the second CNN of the Siamese Network, if it is determined that each of the plurality of phishing data and each of the plurality of patterning data are the same data, the first CNN is learned so that the distance between each of the first features output from and each of the second features output from the second CNN is less than or equal to a preset threshold, and each of the plurality of phishing data and each of the plurality of patterning data are different data. If it is determined that this is the case, a method is disclosed wherein the distance between each of the first features and each of the second features is learned to exceed a preset threshold.

일례로서, 상기 (a) 단계에서, 소정의 Dictionary를 기준으로 상기 복수의 피싱 데이터 각각을 정수화하여, 숫자로 이루어진 각각의 피싱 데이터 배열로 만들고, 상기 복수의 패턴화 데이터 각각을 정수화하여 숫자로 이루어진 패턴화 데이터 배열 각각으로 만든 상태에서, 상기 Siamese Network의 상기 제1 CNN에 상기 피싱 데이터 배열 각각이 입력되고, 상기 제2 CNN에 상기 패턴화 데이터 배열 각각이 입력되면, 상기 제1 CNN에 의해 상기 피싱 데이터 배열에 대응되는 피싱 데이터 배열 피쳐 각각이 출력되고, 상기 제2 CNN에 의해 상기 패턴 데이터 배열에 대응되는 패턴 데이터 배열 피쳐 각각이 출력되면, 상기 피싱 데이터 배열 피쳐와 상기 패턴 데이터 배열 피쳐 사이의 유클리디안 거리를 계산하여 상기 피싱 데이터와 상기 패턴화 데이터의 유사도를 측정하고, 상기 유사도를 참조로 한 손실 함수를 이용하여 상기 피싱 데이터 각각과 상기 패턴화 데이터 각각의 유사도를 최소화하는 방향으로 학습하는 과정을 거치는 것을 특징으로 하는 방법이 개시된다.As an example, in step (a), each of the plurality of phishing data is integerized based on a predetermined dictionary to make each phishing data array composed of numbers, and each of the plurality of patterned data is integerized to be composed of numbers. In a state where each patterned data array is created, when each of the phishing data arrays is input to the first CNN of the Siamese Network, and each of the patterned data arrays is input to the second CNN, the first CNN When each of the phishing data array features corresponding to the phishing data array is output, and each of the pattern data array features corresponding to the pattern data array is output by the second CNN, there is a difference between the phishing data array feature and the pattern data array feature. Calculate the Euclidean distance to measure the similarity between the phishing data and the patterning data, and learn to minimize the similarity between each of the phishing data and the patterning data using a loss function with reference to the similarity. A method characterized by going through the process is disclosed.

일례로서, 상기 (a) 단계에서, 상기 복수의 피싱 데이터 각각이 획득되고 상기 복수의 피싱 데이터 각각에 대한 정규식 각각을 기준으로 상기 복수의 피싱 데이터 각각을 패턴화하여 상기 복수의 패턴화 데이터 각각이 획득된 후, 상기 복수의 피싱 데이터 각각과 상기 복수의 패턴화 데이터 각각이 상기 Siamese Network에 입력되고, 상기 Siamese Network 에 의해 상기 제1 피쳐 각각과 상기 제2 피쳐 각각이 피쳐 벡터로서 출력되며, 상기 Siamese Network에 의해 상기 피쳐 벡터로부터 거리가 출력되면, 상기 거리를 참조로 하여 상기 Siamese Network가 학습되도록 하는 것을 특징으로 하는 방법이 개시된다.As an example, in step (a), each of the plurality of phishing data is obtained and each of the plurality of phishing data is patterned based on each regular expression for each of the plurality of phishing data to obtain each of the plurality of patterned data. After being acquired, each of the plurality of phishing data and each of the plurality of patterning data are input to the Siamese Network, and each of the first feature and each of the second feature is output as a feature vector by the Siamese Network, A method is disclosed wherein when a distance is output from the feature vector by a Siamese Network, the Siamese Network is learned with reference to the distance.

일례로서, 상기 (a) 단계에서, 상기 피싱 데이터 각각과 상기 패턴화 데이터 각각이 상기 Siamese Network의 상기 제1 CNN 및 상기 제2 CNN 각각에 입력되면, 상기 제1 CNN 및 상기 제2 CNN 사이에서 공유 가중치(W)와 편향 b를 가진 상태에서, 상기 피싱 데이터 각각과 상기 패턴화 데이터 각각에 대한 피싱 가중치 데이터 각각과 패턴화 가중치 데이터 각각에 상기 편향 b를 더한 후 활성화 함수를 적용하여 상기 제1 피쳐에 대한 제1 피쳐 벡터 및 상기 제2 피쳐에 대한 제2 피쳐 벡터를 출력하면, 상기 제1 피쳐 벡터와 상기 제2 피쳐 벡터 사이의 유클리디안 거리를 계산하여 상기 피싱 데이터와 상기 패턴화 데이터의 유사도를 측정하고, 상기 유사도를 참조로 한 손실 함수를 이용하여 상기 피싱 데이터 각각과 상기 패턴화 데이터 각각의 유사도를 최소화하는 방향으로 학습하는 과정을 거치는 것을 특징으로 하는 방법이 개시된다.As an example, in step (a), when each of the phishing data and each of the patterning data is input to each of the first CNN and the second CNN of the Siamese Network, between the first CNN and the second CNN With a shared weight (W) and a bias b, the bias b is added to each of the phishing weight data and the patterning weight data for each of the phishing data and each of the patterning data, and then an activation function is applied to the first When the first feature vector for the feature and the second feature vector for the second feature are output, the Euclidean distance between the first feature vector and the second feature vector is calculated to calculate the phishing data and the patterning data. A method is disclosed, which includes measuring the similarity and learning to minimize the similarity between each of the phishing data and the patterning data using a loss function referring to the similarity.

일례로서, 상기 (a) 단계에서, 상기 피싱 데이터와 상기 패턴화 데이터의 유사도를 참조로 한 상기 손실 함수를 이용하여 상기 피싱 데이터와 상기 패턴화 데이터 사이의 로스값을 산출한 상태에서, Backpropagation을 진행하여 상기 피싱 데이터와 상기 패턴화 데이터 사이의 상기 로스값이 최소값이 되도록 상기 공유 가중치(W)와 상기 편향 b를 업데이트 하여, 상기 Siamese Network가 학습되도록 하는 과정을 거치는 것을 특징으로 하는 방법이 개시된다.As an example, in step (a), the loss value between the phishing data and the patterned data is calculated using the loss function referring to the similarity between the phishing data and the patterned data, and backpropagation is performed. A method is disclosed, characterized in that the Siamese Network is learned by updating the shared weight (W) and the bias b so that the loss value between the phishing data and the patterning data becomes the minimum value. do.

일례로서, 상기 (a) 단계에서, 상기 피싱 데이터 각각에서 특수문자 각각을 기준으로 정상 부분 데이터 각각과 상기 정상 부분 데이터를 제외한 나머지 데이터 각각을 추출하고 상기 나머지 데이터 각각은 n 개의 비정상 부분 데이터로 나누어진 상태에서, 상기 정상 부분 데이터 각각은 그대로 유지하고, 상기 비정상 부분 데이터 각각에 대한 정규식 각각을 기준으로 제1_1 패턴화 데이터 내지 제1_n 패턴화 데이터로 치환함으로써, 상기 정상 부분 데이터 및 상기 제1_1 패턴화 데이터 내지 상기 제1_n 패턴화 데이터를 포함하여 상기 패턴화 데이터 각각을 생성하는 것을 특징으로 하는 방법이 개시된다.As an example, in step (a), each normal partial data and each remaining data excluding the normal partial data are extracted based on each special character from each of the phishing data, and each of the remaining data is divided into n abnormal partial data. In the true state, each of the normal partial data is maintained as is, and is replaced with 1_1 patterned data to 1_n patterned data based on regular expressions for each of the abnormal partial data, so that the normal partial data and the 1_1 pattern A method is disclosed, characterized in that it generates each of the patterning data including pattern data to the 1_nth patterning data.

일례로서, 상기 (a) 단계에서, 상기 피싱 데이터 각각으로부터 상기 패턴화 데이터 각각을 생성함에 있어서, 상기 피싱 데이터 각각에서 상기 특수문자 각각을 기준으로 상기 정상 부분 데이터 각각과 상기 n 개의 비정상 부분 데이터를 추출한 상태에서, 상기 n 개의 비정상 부분 데이터 각각이, 정상 도메인을 제외한 정수에 대응되는 제1 패턴, 상기 정상 도메인을 제외한 정수 및 문자에 대응되는 제2 패턴, 상기 정상 도메인 및 정수에 대응되는 제3 패턴, 문자 및 상기 정상 도메인에 대응되는 제4 패턴, 상기 정상 도메인 및 문자에 대응되는 제5 패턴, 상기 정상 도메인을 제외한 문자에 대응되는 제6 패턴 중 어느 패턴에 해당되는지 판단하여 패턴화하는 것을 특징으로 하는 방법이 개시된다.As an example, in step (a), when generating each of the patterned data from each of the phishing data, each of the normal partial data and the n abnormal partial data are divided based on each of the special characters in each of the phishing data. In the extracted state, each of the n abnormal partial data has a first pattern corresponding to an integer excluding the normal domain, a second pattern corresponding to an integer and a character excluding the normal domain, and a third pattern corresponding to the normal domain and an integer. Patterning by determining which pattern corresponds to a pattern, a character, and a fourth pattern corresponding to the normal domain, a fifth pattern corresponding to the normal domain and a character, and a sixth pattern corresponding to a character excluding the normal domain. A characterized method is disclosed.

일례로서, 상기 (b) 단계에서, 상기 신규 데이터가 복수 개 입력되고, 상기 (c) 단계에서, 상기 KD-tree 장치가, 상기 신규 데이터 각각에 대응되는 상기 신규 데이터 피쳐 각각과 상기 KD-tree 상에 표현된 상기 타겟 패턴화 데이터 피쳐 각각의 거리를 참조로 하여, 상기 신규 데이터 피쳐 각각이 적어도 하나의 정상 도메인 각각 중 어느 도메인으로 군집화되는지 판단하는 것을 특징으로 하는 방법이 개시된다.As an example, in step (b), a plurality of pieces of new data are input, and in step (c), the KD-tree device creates each of the new data features corresponding to each of the new data and the KD-tree. A method is disclosed, characterized in that it determines which domain of at least one normal domain each of the new data features is clustered in, with reference to the distances of each of the target patterned data features expressed in the image.

일례로서, 상기 (c) 단계에서, 상기 KD-tree 장치가, 상기 신규 데이터 각각에 대응되는 상기 신규 데이터 피쳐 각각과 상기 KD-tree 상에 표현된 상기 타겟 패턴화 데이터 피쳐 각각의 분포를 2차원으로 변환한 상태에서, 상기 신규 데이터 피쳐 각각이 상기 정상 도메인 각각과 유사하다면, 상기 정상 도메인 각각 중 상기 신규 데이터 피쳐 각각에 해당되는 적어도 하나의 특정 정상 도메인 각각으로 상기 신규 데이터 피쳐 각각을 군집화하는 것을 특징으로 하는 방법을 제공한다.As an example, in step (c), the KD-tree device distributes the distribution of each of the new data features corresponding to each of the new data and each of the target patterned data features expressed on the KD-tree in two dimensions. In the converted state, if each of the new data features is similar to each of the normal domains, clustering each of the new data features into at least one specific normal domain corresponding to each of the new data features among each of the normal domains. Provides a method to characterize.

본 발명의 또 다른 태양에 따르면, 피싱 데이터에 대한 정규식을 통해 획득된 패턴화 데이터를 참조로 하여 Siamese Network 및 KD-Tree 알고리즘을 이용해 상기 피싱 데이터를 탐지하는 방법에 있어서, 인스트럭션들을 저장하는 적어도 하나의 메모리; 및 상기 인스트럭션들을 실행하기 위해 구성된 적어도 하나의 프로세서를 포함하되, 상기 프로세서가, (I) 복수의 피싱 데이터 및 상기 복수의 피싱 데이터 각각에 대한 정규식 각각을 기준으로 상기 복수의 피싱 데이터 각각을 패턴화하여 복수의 패턴화 데이터 각각이 획득된 후, 상기 복수의 피싱 데이터 각각이 상기 Siamese Network의 제1 CNN 에 입력되고, 상기 복수의 패턴화 데이터 각각이 상기 Siamese Network의 제2 CNN에 입력되며, 상기 제1 CNN에 의해 제1 피쳐 각각이 출력되고 상기 제2 CNN에 의해 제2 피쳐 각각이 출력되어, 상기 제1 피쳐와 상기 제2 피쳐의 거리가 기설정된 임계치 이하가 되도록 상기 Siamese Network가 학습된 상태에서, 상기 Siamese Network 의 출력단에 KD-tree 장치가 연결되면, 타겟 데이터에 대한 타겟 패턴화 데이터에 대응되는 타겟 패턴화 데이터 피쳐를 KD-tree 상에 표현하는 프로세스, (II) 신규 데이터가 상기 Siamese Network의 상기 제1 CNN에 입력되어 신규 데이터 피쳐가 출력되면, 상기 신규 데이터 피쳐를 상기 KD-tree 상에 매핑하는 프로세스, 및 (III) 상기 신규 데이터에 대응되는 상기 신규 데이터 피쳐와 상기 KD-tree 상에 표현된 상기 타겟 패턴화 데이터 피쳐 사이의 거리를 참조로 하여 상기 신규 데이터가 피싱 데이터에 해당되는지 여부를 판단하는 프로세스를 수행하는 컴퓨팅 장치를 제공한다.According to another aspect of the present invention, in a method of detecting phishing data using a Siamese Network and KD-Tree algorithm with reference to patterning data obtained through a regular expression for phishing data, at least one storing instructions memory; and at least one processor configured to execute the instructions, wherein the processor: (I) patterns each of the plurality of phishing data based on each of the plurality of phishing data and a regular expression for each of the plurality of phishing data; After each of the plurality of patterning data is acquired, each of the plurality of phishing data is input to the first CNN of the Siamese Network, and each of the plurality of patterning data is input to the second CNN of the Siamese Network, Each of the first features is output by the first CNN and each of the second features is output by the second CNN, and the Siamese Network is learned so that the distance between the first feature and the second feature is less than a preset threshold. In this state, when the KD-tree device is connected to the output terminal of the Siamese Network, a process of expressing the target patterning data feature corresponding to the target patterning data for the target data on the KD-tree, (II) the new data is When a new data feature is input to the first CNN of a Siamese Network and output, a process of mapping the new data feature on the KD-tree, and (III) the new data feature corresponding to the new data and the KD- A computing device is provided that performs a process of determining whether the new data corresponds to phishing data by referring to the distance between the target patterned data features expressed on a tree.

일례로서, 상기 (I) 프로세스에서, 상기 복수의 피싱 데이터 각각이 상기 Siamese Network의 상기 제1 CNN 에 입력되고, 상기 복수의 패턴화 데이터 각각이 상기 Siamese Network의 상기 제2 CNN에 입력된 후, 상기 Siamese Network의 상기 제1 CNN 및 상기 제2 CNN 사이에서 공유 가중치(W)를 가진 상태에서, 상기 복수의 피싱 데이터 각각과 상기 복수의 패턴화 데이터 각각이 같은 데이터라고 판단되면, 상기 제1 CNN에서 출력되는 상기 제1 피쳐 각각과 상기 제2 CNN에서 출력되는 상기 제2 피쳐 각각의 거리가 기설정된 임계치 이하가 되도록 학습되고, 상기 복수의 피싱 데이터 각각과 상기 복수의 패턴화 데이터 각각이 다른 데이터라고 판단되면, 상기 제1 피쳐 각각과 상기 제2 피쳐 각각의 거리가 기설정된 임계치 초과가 되도록 학습되는 것을 특징으로 하는 컴퓨팅 장치를 제공한다.As an example, in the process (I), after each of the plurality of phishing data is input to the first CNN of the Siamese Network, and each of the plurality of patterning data is input to the second CNN of the Siamese Network, With a shared weight (W) between the first CNN and the second CNN of the Siamese Network, if it is determined that each of the plurality of phishing data and each of the plurality of patterning data are the same data, the first CNN is learned so that the distance between each of the first features output from and each of the second features output from the second CNN is less than or equal to a preset threshold, and each of the plurality of phishing data and each of the plurality of patterning data are different data. If it is determined that the distance between each of the first features and each of the second features is learned to exceed a preset threshold.

일례로서, 상기 (I) 프로세스에서, 소정의 Dictionary를 기준으로 상기 복수의 피싱 데이터 각각을 정수화하여, 숫자로 이루어진 각각의 피싱 데이터 배열로 만들고, 상기 복수의 패턴화 데이터 각각을 정수화하여 숫자로 이루어진 패턴화 데이터 배열 각각으로 만든 상태에서, 상기 Siamese Network의 상기 제1 CNN에 상기 피싱 데이터 배열 각각이 입력되고, 상기 제2 CNN에 상기 패턴화 데이터 배열 각각이 입력되면, 상기 제1 CNN에 의해 상기 피싱 데이터 배열에 대응되는 피싱 데이터 배열 피쳐 각각이 출력되고, 상기 제2 CNN에 의해 상기 패턴 데이터 배열에 대응되는 패턴 데이터 배열 피쳐 각각이 출력되면, 상기 피싱 데이터 배열 피쳐와 상기 패턴 데이터 배열 피쳐 사이의 유클리디안 거리를 계산하여 상기 피싱 데이터와 상기 패턴화 데이터의 유사도를 측정하고, 상기 유사도를 참조로 한 손실 함수를 이용하여 상기 피싱 데이터 각각과 상기 패턴화 데이터 각각의 유사도를 최소화하는 방향으로 학습하는 과정을 거치는 것을 특징으로 하는 컴퓨팅 장치를 제공한다.As an example, in the (I) process, each of the plurality of phishing data is integerized based on a predetermined dictionary to make each phishing data array composed of numbers, and each of the plurality of patterned data is integerized to be composed of numbers. In a state where each patterned data array is created, when each of the phishing data arrays is input to the first CNN of the Siamese Network and each of the patterned data arrays is input to the second CNN, the first CNN When each of the phishing data array features corresponding to the phishing data array is output, and each of the pattern data array features corresponding to the pattern data array is output by the second CNN, there is a difference between the phishing data array feature and the pattern data array feature. Calculate the Euclidean distance to measure the similarity between the phishing data and the patterning data, and learn to minimize the similarity between each of the phishing data and the patterning data using a loss function with reference to the similarity. Provides a computing device characterized in that it undergoes a process of:

일례로서, 상기 (I) 프로세스에서, 상기 복수의 피싱 데이터 각각이 획득되고 상기 복수의 피싱 데이터 각각에 대한 정규식 각각을 기준으로 상기 복수의 피싱 데이터 각각을 패턴화하여 상기 복수의 패턴화 데이터 각각이 획득된 후, 상기 복수의 피싱 데이터 각각과 상기 복수의 패턴화 데이터 각각이 상기 Siamese Network에 입력되고, 상기 Siamese Network 에 의해 상기 제1 피쳐 각각과 상기 제2 피쳐 각각이 피쳐 벡터로서 출력되며, 상기 Siamese Network에 의해 상기 피쳐 벡터로부터 거리가 출력되면, 상기 거리를 참조로 하여 상기 Siamese Network가 학습되도록 하는 것을 특징으로 하는 컴퓨팅 장치를 제공한다.As an example, in the process (I), each of the plurality of phishing data is obtained and each of the plurality of phishing data is patterned based on each regular expression for each of the plurality of phishing data to obtain each of the plurality of patterned data. After being acquired, each of the plurality of phishing data and each of the plurality of patterning data are input to the Siamese Network, and each of the first feature and each of the second feature is output as a feature vector by the Siamese Network, When a distance is output from the feature vector by a Siamese Network, a computing device is provided that allows the Siamese Network to be learned with reference to the distance.

일례로서, 상기 (I) 프로세스에서, 상기 피싱 데이터 각각과 상기 패턴화 데이터 각각이 상기 Siamese Network의 상기 제1 CNN 및 상기 제2 CNN 각각에 입력되면, 상기 제1 CNN 및 상기 제2 CNN 사이에서 공유 가중치(W)와 편향 b를 가진 상태에서, 상기 피싱 데이터 각각과 상기 패턴화 데이터 각각에 대한 피싱 가중치 데이터 각각과 패턴화 가중치 데이터 각각에 상기 편향 b를 더한 후 활성화 함수를 적용하여 상기 제1 피쳐에 대한 제1 피쳐 벡터 및 상기 제2 피쳐에 대한 제2 피쳐 벡터를 출력하면, 상기 제1 피쳐 벡터와 상기 제2 피쳐 벡터 사이의 유클리디안 거리를 계산하여 상기 피싱 데이터와 상기 패턴화 데이터의 유사도를 측정하고, 상기 유사도를 참조로 한 손실 함수를 이용하여 상기 피싱 데이터 각각과 상기 패턴화 데이터 각각의 유사도를 최소화하는 방향으로 학습하는 과정을 거치는 것을 특징으로 하는 컴퓨팅 장치를 제공한다.As an example, in the process (I), when each of the phishing data and each of the patterning data are input to each of the first CNN and the second CNN of the Siamese Network, between the first CNN and the second CNN With a shared weight (W) and a bias b, the bias b is added to each of the phishing weight data and the patterning weight data for each of the phishing data and each of the patterning data, and then an activation function is applied to the first When the first feature vector for the feature and the second feature vector for the second feature are output, the Euclidean distance between the first feature vector and the second feature vector is calculated to calculate the phishing data and the patterning data. A computing device is provided, characterized in that it measures the similarity and goes through a process of learning to minimize the similarity between each of the phishing data and each of the patterned data using a loss function referring to the similarity.

일례로서, 상기 (I) 프로세스에서, 상기 피싱 데이터와 상기 패턴화 데이터의 유사도를 참조로 한 상기 손실 함수를 이용하여 상기 피싱 데이터와 상기 패턴화 데이터 사이의 로스값을 산출한 상태에서, Backpropagation을 진행하여 상기 피싱 데이터와 상기 패턴화 데이터 사이의 상기 로스값이 최소값이 되도록 상기 공유 가중치(W)와 상기 편향 b를 업데이트 하여, 상기 Siamese Network가 학습되도록 하는 과정을 거치는 것을 특징으로 하는 컴퓨팅 장치를 제공한다.As an example, in the process (I), backpropagation is performed while calculating a loss value between the phishing data and the patterning data using the loss function referring to the similarity between the phishing data and the patterning data. A computing device characterized in that the Siamese Network is learned by updating the shared weight (W) and the bias b so that the loss value between the phishing data and the patterning data is minimal. to provide.

일례로서, 상기 (I) 프로세스에서, 상기 피싱 데이터 각각에서 특수문자 각각을 기준으로 정상 부분 데이터 각각과 상기 정상 부분 데이터를 제외한 나머지 데이터 각각을 추출하고 상기 나머지 데이터 각각은 n 개의 비정상 부분 데이터로 나누어진 상태에서, 상기 정상 부분 데이터 각각은 그대로 유지하고, 상기 비정상 부분 데이터 각각에 대한 정규식 각각을 기준으로 제1_1 패턴화 데이터 내지 제1_n 패턴화 데이터로 치환함으로써, 상기 정상 부분 데이터 및 상기 제1_1 패턴화 데이터 내지 상기 제1_n 패턴화 데이터를 포함하여 상기 패턴화 데이터 각각을 생성하는 것을 특징으로 하는 컴퓨팅 장치를 제공한다.As an example, in the process (I), each normal partial data and each remaining data excluding the normal partial data are extracted based on each special character from each of the phishing data, and each of the remaining data is divided into n abnormal partial data. In the true state, each of the normal partial data is maintained as is, and is replaced with 1_1 patterned data to 1_n patterned data based on regular expressions for each of the abnormal partial data, so that the normal partial data and the 1_1 pattern A computing device is provided, characterized in that it generates each of the patterned data, including pattern data to the 1_nth patterned data.

일례로서, 상기 (I) 프로세스에서, 상기 피싱 데이터 각각으로부터 상기 패턴화 데이터 각각을 생성함에 있어서, 상기 피싱 데이터 각각에서 상기 특수문자 각각을 기준으로 상기 정상 부분 데이터 각각과 상기 n 개의 비정상 부분 데이터를 추출한 상태에서, 상기 n 개의 비정상 부분 데이터 각각이, 정상 도메인을 제외한 정수에 대응되는 제1 패턴, 상기 정상 도메인을 제외한 정수 및 문자에 대응되는 제2 패턴, 상기 정상 도메인 및 정수에 대응되는 제3 패턴, 문자 및 상기 정상 도메인에 대응되는 제4 패턴, 상기 정상 도메인 및 문자에 대응되는 제5 패턴, 상기 정상 도메인을 제외한 문자에 대응되는 제6 패턴 중 어느 패턴에 해당되는지 판단하여 패턴화하는 것을 특징으로 하는 컴퓨팅 장치를 제공한다.As an example, in the process (I), when generating each of the patterned data from each of the phishing data, each of the normal partial data and the n abnormal partial data are divided based on each of the special characters in each of the phishing data. In the extracted state, each of the n abnormal partial data has a first pattern corresponding to an integer excluding the normal domain, a second pattern corresponding to an integer and a character excluding the normal domain, and a third pattern corresponding to the normal domain and an integer. Patterning by determining which pattern corresponds to a pattern, a character, and a fourth pattern corresponding to the normal domain, a fifth pattern corresponding to the normal domain and a character, and a sixth pattern corresponding to a character excluding the normal domain. A computing device featuring features is provided.

일례로서, 상기 (II) 프로세스에서, 상기 신규 데이터가 복수 개 입력되고, 상기 프로세서가, 상기 (III) 프로세스에서, 상기 신규 데이터 각각에 대응되는 상기 신규 데이터 피쳐 각각과 상기 KD-tree 상에 표현된 상기 타겟 패턴화 데이터 피쳐 각각의 거리를 참조로 하여, 상기 신규 데이터 피쳐 각각이 적어도 하나의 정상 도메인 각각 중 어느 도메인으로 군집화되는지 판단하는 것을 특징으로 하는 컴퓨팅 장치가 제공된다.As an example, in the (II) process, a plurality of new data is input, and the processor, in the (III) process, represents each of the new data features corresponding to each of the new data on the KD-tree. A computing device is provided, wherein each new data feature is clustered into one of at least one normal domain, with reference to the distance of each of the target patterned data features.

일례로서, 상기 프로세서가, 상기 (III) 프로세스에서, 상기 신규 데이터 각각에 대응되는 상기 신규 데이터 피쳐 각각과 상기 KD-tree 상에 표현된 상기 타겟 패턴화 데이터 피쳐 각각의 분포를 2차원으로 변환한 상태에서, 상기 신규 데이터 피쳐 각각이 상기 정상 도메인 각각과 유사하다면, 상기 정상 도메인 각각 중 상기 신규 데이터 피쳐 각각에 해당되는 적어도 하나의 특정 정상 도메인 각각으로 상기 신규 데이터 피쳐 각각을 군집화하는 것을 특징으로 하는 컴퓨팅 장치를 제공한다.As an example, in the process (III), the processor converts the distribution of each of the new data features corresponding to each of the new data and each of the target patterned data features expressed on the KD-tree into two dimensions. In the state, if each of the new data features is similar to each of the normal domains, clustering each of the new data features into at least one specific normal domain corresponding to each of the new data features among each of the normal domains. Provides a computing device.

본 발명에 의하면, 다음과 같은 효과가 있다.According to the present invention, the following effects are achieved.

본 발명은, 피싱 데이터 및 이에 대응되는 패턴화 데이터를 이용하여 Siamese Network가 피싱 데이터 및 패턴화 데이터를 동일하게 인식할 수 있도록 학습할 수 있는 효과가 있다.The present invention has the effect of enabling Siamese Network to learn to recognize phishing data and patterning data equally by using phishing data and patterning data corresponding thereto.

또한, 본 발명은, Siamese Network 가 학습된 후, 타겟 데이터에 대응되는 타겟 패턴화 데이터 피쳐를 KD-Tree 상에 표현한 상태에서, 임의의 신규 데이터가 Siamese Network에 입력되면, Siamese Network로부터 출력된 신규 데이터 피쳐를 KD-Tree 상에 매핑함으로써, 피싱 탐지의 정확도를 높일 수 있는 효과가 있다.In addition, in the present invention, after the Siamese Network is learned, with the target patterning data feature corresponding to the target data expressed on the KD-Tree, if any new data is input to the Siamese Network, the new data output from the Siamese Network By mapping data features onto the KD-Tree, the accuracy of phishing detection can be improved.

본 발명의 실시예의 설명에 이용되기 위하여 첨부된 아래 도면들은 본 발명의 실시예들 중 단지 일부일 뿐이며, 본 발명이 속한 기술분야에서 통상의 지식을 가진 자(이하 "통상의 기술자")에게 있어서는 발명적 작업이 이루어짐 없이 이 도면들에 기초하여 다른 도면들이 얻어질 수 있다.
도 1은 본 발명의 일 실시예에 따라, Siamese Network 및 KD-Tree 알고리즘을 이용해 상기 피싱 데이터를 탐지하는 장치의 개략적인 구성을 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따라, 피싱 데이터 및 패턴화 데이터를 참조로 하여 학습된 Siamese Network 및 KD-Tree 알고리즘을 이용해 신규 피싱 데이터를 탐지하는 과정을 개략적으로 도시한 흐름도이다.
도 3은 본 발명의 일 실시예에 따라, Siamese Network의 제1 CNN 에 입력된 피싱 데이터 및 Siamese Network의 제2 CNN에 입력된 패턴화 데이터를 이용하여 Siamese Network를 학습하는 과정을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따라, Siamese Network가 학습된 후, 신규 데이터가 Siamese Network의 제1 CNN에 입력되어 신규 데이터 피쳐가 출력되면, 타겟 패턴화 데이터 피쳐가 표현된 KD-tree 상에 신규 데이터 피쳐를 매핑하는 과정을 설명하기 위한 도면이다.The following drawings attached for use in explaining embodiments of the present invention are only some of the embodiments of the present invention, and to those skilled in the art (hereinafter “those skilled in the art”), the invention Other drawings can be obtained based on these drawings without further work being done.
Figure 1 is a diagram showing the schematic configuration of a device that detects the phishing data using the Siamese Network and KD-Tree algorithm, according to an embodiment of the present invention.
Figure 2 is a flowchart schematically showing the process of detecting new phishing data using the Siamese Network and KD-Tree algorithm learned with reference to phishing data and patterning data, according to an embodiment of the present invention.
Figure 3 is a diagram illustrating the process of learning a Siamese Network using phishing data input to the first CNN of the Siamese Network and patterning data input to the second CNN of the Siamese Network, according to an embodiment of the present invention. am.
Figure 4 shows, according to an embodiment of the present invention, after the Siamese Network is learned, when new data is input to the first CNN of the Siamese Network and new data features are output, the KD-tree image in which the target patterned data feature is expressed is shown. This is a diagram to explain the process of mapping new data features to .

후술하는 본 발명에 대한 상세한 설명은, 본 발명의 목적들, 기술적 해법들 및 장점들을 분명하게 하기 위하여 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 통상의 기술자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다.The detailed description of the present invention described below refers to the accompanying drawings, which show by way of example specific embodiments in which the present invention may be practiced to make clear the objectives, technical solutions and advantages of the present invention. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention.

또한, 본 발명의 상세한 설명 및 청구항들에 걸쳐, "포함하다"라는 단어 및 그것의 변형은 다른 기술적 특징들, 부가물들, 구성요소들 또는 단계들을 제외하는 것으로 의도된 것이 아니다. 통상의 기술자에게 본 발명의 다른 목적들, 장점들 및 특성들이 일부는 본 설명서로부터, 그리고 일부는 본 발명의 실시로부터 드러날 것이다. 아래의 예시 및 도면은 실례로서 제공되며, 본 발명을 한정하는 것으로 의도된 것이 아니다.Additionally, throughout the description and claims, the word “comprise” and variations thereof are not intended to exclude other technical features, attachments, components or steps. Other objects, advantages and features of the invention will appear to those skilled in the art, partly from this description and partly from practice of the invention. The examples and drawings below are provided by way of example and are not intended to limit the invention.

더욱이 본 발명은 본 명세서에 표시된 실시예들의 모든 가능한 조합들을 망라한다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.Moreover, the present invention encompasses all possible combinations of the embodiments shown herein. It should be understood that the various embodiments of the present invention are different from one another but are not necessarily mutually exclusive. For example, specific shapes, structures and characteristics described herein with respect to one embodiment may be implemented in other embodiments without departing from the spirit and scope of the invention. Additionally, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description that follows is not intended to be taken in a limiting sense, and the scope of the invention is limited only by the appended claims, together with all equivalents to what those claims assert, if properly described. Similar reference numbers in the drawings refer to identical or similar functions across various aspects.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 바람직한 실시예들에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, in order to enable those skilled in the art to easily practice the present invention, preferred embodiments of the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명의 일 실시예에 따라, Siamese Network 및 KD-Tree 알고리즘을 이용해 상기 피싱 데이터를 탐지하는 컴퓨팅 장치의 개략적인 구성을 나타내는 도면이다.1 is a diagram showing a schematic configuration of a computing device that detects the phishing data using a Siamese Network and KD-Tree algorithm, according to an embodiment of the present invention.

도 1에 도시된 바와 같이, Siamese Network 및 KD-Tree 알고리즘을 이용해 피싱 데이터를 탐지하는 컴퓨팅 장치(100)는 메모리(110), 프로세서(120)를 포함할 수 있다. 여기서, 컴퓨팅 장치(100)는, Siamese Network 및 KD-Tree 알고리즘을 탑재한 KD-tree 장치를 모두 포함하는 개념일 수도 있으나, 이에 한정되는 것은 아니다. As shown in FIG. 1, the computing device 100 that detects phishing data using the Siamese Network and KD-Tree algorithm may include a memory 110 and a processor 120. Here, the computing device 100 may be a concept that includes both a Siamese Network and a KD-tree device equipped with the KD-Tree algorithm, but is not limited thereto.

피싱 데이터에 대한 정규식을 통해 획득된 패턴화 데이터를 참조로 하여 Siamese Network 및 KD-Tree 알고리즘을 이용해 피싱 데이터를 탐지하는 컴퓨팅 장치(100)의 메모리(110)는 프로세서(120)에 의해 수행될 인스트럭션들을 저장할 수 있는데, 구체적으로, 인스트럭션들은 컨텐츠를 제공하기 위한 컴퓨팅 장치(100)로 하여금 특정의 방식으로 기능하게 하기 위한 목적으로 생성되는 코드로서, 컴퓨터 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장될 수 있다. 인스트럭션들은 본 발명의 명세서에서 설명되는 기능들을 실행하기 위한 프로세스들을 수행할 수 있다.The memory 110 of the computing device 100, which detects phishing data using the Siamese Network and KD-Tree algorithms with reference to patterning data obtained through regular expressions for phishing data, contains instructions to be performed by the processor 120. Specifically, the instructions are codes generated for the purpose of allowing the computing device 100 to function in a specific way to provide content, and are a computer that can be directed to a computer or other programmable data processing equipment. It may be stored in accessible or computer-readable memory. Instructions may perform processes to execute functions described in the specification of the present invention.

그리고, 피싱 데이터에 대한 정규식을 통해 획득된 패턴화 데이터를 참조로 하여 Siamese Network 및 KD-Tree 알고리즘을 이용해 피싱 데이터를 탐지하는 컴퓨팅 장치(100)의 프로세서(120)는 MPU(Micro Processing Unit) 또는 CPU(Central Processing Unit), 캐쉬 메모리(Cache Memory), 데이터 버스(Data Bus) 등의 하드웨어 구성을 포함할 수 있다. 또한, 컴퓨팅 장치(100)는 운영체제, 특정 목적을 수행하는 애플리케이션의 소프트웨어 구성을 더 포함할 수도 있다.In addition, the processor 120 of the computing device 100, which detects phishing data using the Siamese Network and KD-Tree algorithm with reference to patterned data obtained through regular expressions for phishing data, is an MPU (Micro Processing Unit) or It may include hardware components such as CPU (Central Processing Unit), cache memory, and data bus. Additionally, the computing device 100 may further include an operating system and a software component of an application that performs a specific purpose.

또한, 컴퓨팅 장치(100)는 데이터베이스와 연동될 수 있다. 여기서, 데이터베이스는 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리), 램(Random Access Memory, RAM), SRAM(Static Random Access Memory), 롬(ReadOnly Memory, ROM), EEPROM(Electrically Erasable Programmable ReadOnly Memory), PROM(Programmable ReadOnly Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있으며, 이에 한정되지 않으며 데이터를 저장할 수 있는 모든 매체를 포함할 수 있다. 또한, 데이터베이스는 컴퓨팅 장치(100)와 분리되어 설치되거나, 이와는 달리 컴퓨팅 장치(100)의 내부에 설치되어 데이터를 전송하거나 수신되는 데이터를 기록할 수도 있고, 도시된 바와 달리 둘 이상으로 분리되어 구현될 수도 있으며, 이는 발명의 실시 조건에 따라 달라질 수 있다.Additionally, the computing device 100 may be linked to a database. Here, the database is a flash memory type, hard disk type, multimedia card micro type, card type memory (e.g. SD or XD memory), and RAM (Random). At least one type of Access Memory (RAM), SRAM (Static Random Access Memory), ROM (ReadOnly Memory, ROM), EEPROM (Electrically Erasable Programmable ReadOnly Memory), PROM (Programmable ReadOnly Memory), magnetic memory, magnetic disk, and optical disk. It may include a storage medium, but is not limited to this and may include any medium that can store data. In addition, the database may be installed separately from the computing device 100, or alternatively, may be installed inside the computing device 100 to transmit data or record received data, or, unlike shown, may be implemented separately into two or more parts. It may be, and this may vary depending on the implementation conditions of the invention.

이와 같이 구성된 본 발명의 일 실시예에 따른 컴퓨팅 장치(100)를 이용한 방법을 도 2를 참조하여 설명하면 다음과 같다.A method using the computing device 100 according to an embodiment of the present invention configured as described above will be described with reference to FIG. 2 as follows.

도 2는 본 발명의 일 실시예에 따라, 피싱 데이터 및 패턴화 데이터를 참조로 하여 학습된 Siamese Network 및 KD-Tree 알고리즘을 이용해 신규 피싱 데이터를 탐지하는 과정을 개략적으로 도시한 흐름도이다.Figure 2 is a flowchart schematically showing the process of detecting new phishing data using the Siamese Network and KD-Tree algorithm learned with reference to phishing data and patterning data, according to an embodiment of the present invention.

도 2를 참조하면, 우선적으로 Siamese Network 가 학습될 수 있는데, 구체적으로, 복수의 피싱 데이터 및 복수의 피싱 데이터 각각에 대한 정규식 각각을 기준으로 복수의 피싱 데이터 각각을 패턴화하여 복수의 패턴화 데이터 각각이 획득되면, 복수의 피싱 데이터 각각이 Siamese Network의 제1 CNN에 입력되도록 하고, 복수의 패턴화 데이터 각각이 Siamese Network의 제2 CNN에 입력되도록 하여, 제1 CNN에 의해 출력된 제1 피쳐 각각과 제2 CNN에 의해 출력된 제2 피쳐 각각의 거리가 기설정된 임계치 이하가 되도록 Siamese Network를 학습할 수 있을 것이다. 이와 같이 Siamese Network 가 학습된 상태에서, 컴퓨팅 장치(100)는, Siamese Network의 출력단에 KD-tree 장치를 연결한 후, 타겟 데이터에 대한 타겟 패턴화 데이터에 대응되는 타겟 패턴화 데이터 피쳐를 KD-tree 상에 표현할 수 있다(S201).Referring to FIG. 2, the Siamese Network can be learned first. Specifically, each of the plurality of phishing data is patterned based on the plurality of phishing data and the regular expressions for each of the plurality of phishing data to form a plurality of patterned data. Once each is obtained, each of the plurality of phishing data is input to the first CNN of the Siamese Network, and each of the plurality of patterning data is input to the second CNN of the Siamese Network, so that the first feature output by the first CNN The Siamese Network can be trained so that the distance between each feature and the second feature output by the second CNN is below a preset threshold. In this state in which the Siamese Network is learned, the computing device 100 connects the KD-tree device to the output terminal of the Siamese Network, and then selects the target patterning data feature corresponding to the target patterning data for the target data to the KD- It can be expressed on a tree (S201).

여기서, 상기 패턴화 데이터는, 피싱 데이터에 소정 기준에 따른 정규식을 적용하여 만들 수 있으나 이에 한정되는 것은 아니다. Here, the patterned data can be created by applying a regular expression according to a predetermined standard to the phishing data, but is not limited to this.

구체적으로, 피싱 데이터 각각을 정규식을 통하여 패턴화 데이터로 패턴화하는 과정에 있어서, 가령, 피싱 데이터(‘webmaildaum.com’ 및 ‘cafe.midnaver.com’)를 정규식을 통해 패턴화를 할 때, 피싱데이터 각각에서 특수문자 (‘.’, ‘-’)를 기준으로 정상 부분 데이터 (‘daum’, ‘naver’)각각과 정상 부분 데이터를 제외한 나머지 데이터(‘webmail’‘com’, ‘cafe’ ‘mid’‘com’) 각각을 추출하고 나머지 데이터(‘webmail’‘com’, ‘cafe’ ‘mid’‘com’) 각각은 또 다시 적어도 하나의 비정상 부분 데이터로 나누어질 수 있으며, 이후, 정상 부분 데이터(‘daum’, ‘naver’) 각각은 그대로 유지하고, 비정상 부분 데이터 각각을 {1.정상 도메인을 제외한 정수는 #P1# 패턴, 2.정상 도메인을 제외한 정수 + 문자는 #P2# 패턴, 3.정상 도메인 + 정수는 domain name#P3# 패턴, 4. 문자 + 정상 도메인은 domain name#P4# 패턴, 5. 정상 도메인 + 문자는 domain name#P5# 패턴, 정상 도메인을 제외한 문자는 #P6 패턴}의 내용을 포함하는 기준으로 치환함으로써, 패턴화 데이터(‘#P4#daum.#P6#’ 및 ‘#P6#.#P4#naver.#P6#’) 각각을 생성할 수 있을 것이다. 물론, 이와 같은 예시에 한정되지는 않으며 다양한 변형예를 상정할 수 있을 것이다.Specifically, in the process of patterning each piece of phishing data into patterned data through regular expressions, for example, when patterning phishing data ('webmaildaum.com' and 'cafe.midnaver.com') through regular expressions, In each phishing data, the normal part data ('daum', 'naver') is based on special characters ('.', '-'), and the remaining data excluding the normal part data ('webmail', 'com', 'cafe') 'mid''com') is extracted, and each of the remaining data ('webmail''com', 'cafe' 'mid''com') can be divided into at least one abnormal part data again, and then normal Each of the partial data ('daum', 'naver') is maintained as is, and each of the abnormal partial data is {1. Integer excluding normal domain is #P1# pattern, 2. Integer + character excluding normal domain is #P2# pattern , 3. Normal domain + integer is domain name#P3# pattern, 4. Character + normal domain is domain name#P4# pattern, 5. Normal domain + character is domain name#P5# pattern, characters excluding normal domain are # By replacing the contents of {P6 pattern} with criteria containing the content, it will be possible to generate patterned data ('#P4#daum.#P6#' and '#P6#.#P4#naver.#P6#'), respectively. . Of course, it is not limited to this example, and various modifications can be assumed.

한편, 피싱 데이터와 패턴화 데이터는 각각의 데이터를 정수화하여 정수로 이루어진 배열상태로 변환되어 Siamese Network(300)에 입력될 수도 있을 것이다.Meanwhile, phishing data and patterning data may be converted into an array of integers by converting each data into an integer and input into the Siamese Network ( 300 ).

구체적으로, 숫자, 알파벳 및 특수기호를 소정 기준에 따라 나누어 정수 각각에 매핑한 Dictionary를 기준으로, 숫자 각각과 알파벳 각각 및 특수기호 각각에 정수 각각을 매칭하여 복수의 피싱 데이터 각각과 복수의 패턴화 데이터 각각을 정수화할 수 있을 것이다.Specifically, based on a Dictionary that divides numbers, alphabets, and special symbols according to a predetermined standard and maps them to each integer, each integer is matched to each number, each alphabet, and each special symbol to form a plurality of phishing data and a plurality of patterns. You will be able to integerize each piece of data.

가령, 복수의 피싱 데이터 각각과 복수의 패턴화 데이터 각각의 데이터를 정수화하기 위하여 {‘0’: 0, ‘1’: 1, ‘2’: 2, ‘3’: 3,‘4’: 4, ‘5’: 5, ‘6’: 6, ‘7’: 7, ‘8’: 8, ‘9’: 9, ‘a’: 10, ‘b’: 11, ‘c’: 12, ‘d’: 13, ‘e’: 14,‘f’: 15, ‘g’: 16, ‘h’: 17, ‘i’: 18, ‘j’: 19, ‘k’: 20, ‘l’: 21, ‘m’: 22, ‘n’: 23, ‘o’: 24, ‘p’: 25, ‘q’: 26, ‘r’: 27, ‘s’: 28, ‘t’: 29, ‘u’: 30, ‘v’: 31, ‘w’: 32, ‘x’: 33, ‘y’: 34, ‘z’: 35, ‘.’: 36, ‘-’: 37, ‘_’: 38 }의 대응 관계를 포함하는 Dictionary를 기준으로 복수의 피싱 데이터 각각과 복수의 패턴화 데이터 각각의 데이터를 정수로 치환하여 정수로 이루어진 배열 상태로 만든 상태에서, Siamese Network(300)의 제1 CNN(310)에 피싱 데이터 각각이 정수화된 배열 상태로 입력되도록 하고, 제2 CNN(320)에 패턴화 데이터 배열 각각이 정수화된 배열 상태로 입력되도록 할 수 있을 것이다. 물론, Dictionary에 포함되는 내용이 문자, 숫자에 한정되는 것은 아니며 Dictionary에 포함되어 있지 않는 기호가 치환 되어야 하는 경우로서, Dictionary에 포함되지 않은 ‘#’등의 기호가 추가되는 경우 39 이상, 40 이하의 정수로 매칭시켜서 치환될 수도 있을 것이다.For example, in order to integerize each of the plurality of phishing data and the plurality of patterning data, {'0': 0, '1': 1, '2': 2, '3': 3, '4': 4 , '5': 5, '6': 6, '7': 7, '8': 8, '9': 9, 'a': 10, 'b': 11, 'c': 12, 'd': 13, 'e': 14,'f': 15, 'g': 16, 'h': 17, 'i': 18, 'j': 19, 'k': 20, 'l' : 21, 'm': 22, 'n': 23, 'o': 24, 'p': 25, 'q': 26, 'r': 27, 's': 28, 't': 29 , 'u': 30, 'v': 31, 'w': 32, 'x': 33, 'y': 34, 'z': 35, '.': 36, '-': 37, ' Based on the Dictionary containing the correspondence relationship of _': 38 }, each of the plurality of phishing data and each of the plurality of patterning data is replaced with an integer to form an array of integers, and the Siamese Network ( 300 ) Each piece of phishing data can be input to the first CNN 310 in an integer array, and each patterned data array can be input to the second CNN 320 in an integer array. Of course, the contents included in the dictionary are not limited to letters and numbers, and in cases where symbols not included in the dictionary must be replaced, if symbols such as '#' that are not included in the dictionary are added, the number is 39 or more and 40 or less. It may be replaced by matching with an integer.

도 3은 본 발명의 일 실시예에 따라, Siamese Network의 제1 CNN 에 입력된 피싱 데이터 및 Siamese Network의 제2 CNN에 입력된 패턴화 데이터를 이용하여 Siamese Network를 학습하는 과정을 보다 구체적으로 설명하기 위한 도면이다.Figure 3 illustrates in more detail the process of learning a Siamese Network using phishing data input to the first CNN of the Siamese Network and patterning data input to the second CNN of the Siamese Network, according to an embodiment of the present invention. This is a drawing for this purpose.

일례로, 도 3을 참조하면, 피싱 데이터가 Siamese Network(300)의 제1 CNN(310)에 입력되고 패턴화 데이터가 Siamese Network(300)의 제2 CNN(320)에 입력되면, 제1 CNN(310)에 의해 제1 피쳐가 출력되고 제2 CNN(320)에 의해 제2 피쳐가 출력되는 것을 알 수 있다.For example, referring to Figure 3, when phishing data is input to the first CNN ( 310 ) of the Siamese Network (300) and patterning data is input to the second CNN ( 320 ) of the Siamese Network (300), the first CNN It can be seen that the first feature is output by 310 and the second feature is output by the second CNN 320.

이와 같이 복수의 피싱 데이터 각각이 Siamese Network(300)의 제1 CNN(310) 에 입력되고 복수의 패턴화 데이터 각각이 Siamese Network(300)의 제2 CNN(320)에 입력된 후, 복수의 피싱 데이터 각각과 복수의 패턴화 데이터 각각이 같은 데이터인 경우, Siamese Network(300)는, 제1 CNN(310)에서 출력되는 제1 피쳐 각각과 제2 CNN(320)에서 출력되는 제2 피쳐 각각의 거리가 가깝게 되도록 학습되고, 복수의 피싱 데이터 각각과 복수의 패턴화 데이터 각각이 다른 데이터인 경우, Siamese Network(300)는, 제1 피쳐 각각과 제2 피쳐 각각의 거리가 멀게 되도록 학습될 수 있을 것이다.In this way, after each of the plurality of phishing data is input to the first CNN ( 310 ) of the Siamese Network (300) and each of the plurality of patterning data is input to the second CNN (320) of the Siamese Network ( 300 ), the plurality of phishing data is input to the first CNN (310) of the Siamese Network (300). If each of the data and each of the plurality of patterning data are the same data, the Siamese Network ( 300 ) is The distance is learned to be close, and when each of the plurality of phishing data and each of the plurality of patterning data are different data, the Siamese Network ( 300 ) may be learned so that the distance between each of the first and second features is large. will be.

구체적으로, Siamese Network(300)가 학습되는 과정에 있어서, Siamese Network(300)에 입력되는 입력값을 input(x_1, x_2) = input(피싱 데이터, 패턴화 데이터)라고 할 때, 피싱 데이터(x_1) 각각과 패턴화 데이터(x_2) 각각이 Siamese Network(300)의 제1 CNN(310) 및 제2 CNN(320)에 각각에 입력되면, 제1 CNN(310) 및 제2 CNN(320) 사이에서 공유 가중치(W)와 편향 b를 가진 상태에서, 피싱 데이터(x_1) 각각과 패턴화 데이터(x_2) 각각에 대한 피싱 가중치 데이터(Wx_1) 각각과 패턴화 가중치 데이터(Wx_2) 각각을 구하고 이에 편향 b를 더한 후 LeakyReLU활성화 함수를 적용하면, 수식 : f_1 = LeakyReLU(Wx_1+b), f_2 = LeakyReLU(Wx_2+b)를 얻을 수 있을 것이다. 여기서, LeakyReLU는 f(x) = max(0.01x, x)를 의미할 수 있을 것이다.Specifically, in the process of learning Siamese Network ( 300 ), when the input value input to Siamese Network ( 300 ) is input(x_1, x_2) = input (phishing data, patterning data), phishing data (x_1 ) When each and the patterning data (x_2) are respectively input to the first CNN (310) and the second CNN (320) of the Siamese Network ( 300 ), between the first CNN (310) and the second CNN (320) With shared weight (W) and bias b, obtain each of the phishing weight data (Wx_1) and patterning weight data (Wx_2) for each of the phishing data (x_1) and patterning data (x_2) and bias accordingly. If you add b and then apply the LeakyReLU activation function, you will get the formula: f_1 = LeakyReLU(Wx_1+b), f_2 = LeakyReLU(Wx_2+b). Here, LeakyReLU may mean f(x) = max(0.01x, x).

여기서, 제1 CNN(310)의 출력인 제1 피쳐는 수식 f_1 = LeakyReLU(Wx_1+b)에 의한 제1 피쳐 벡터(f_1)로 표현될 수 있고, 제2 CNN(320)의 출력인 제2 피쳐는 수식 f_2 = LeakyReLU(Wx_2+b)에 의한 제2 피쳐 벡터(f_2)로 표현될 수 있을 것이다. 이후, 제1 피쳐 벡터(f_1)와 제2 피쳐 벡터(f_2) 사이의 유클리디안 거리(d)를 계산하여, 피싱 데이터와 패턴화 데이터의 유사도에 해당되는 값(수식 :

)을 획득할 수 있을 것이다.Here, the first feature that is the output of the first CNN (310) can be expressed as a first feature vector (f_1) by the formula f_1 = LeakyReLU(Wx_1+b), and the second feature that is the output of the second CNN (320) The feature may be expressed as a second feature vector (f_2) by the formula f_2 = LeakyReLU(Wx_2+b). Afterwards, the Euclidean distance (d) between the first feature vector (f_1) and the second feature vector (f_2) is calculated, and the value corresponding to the similarity between the phishing data and the patterning data (formula:

) will be able to obtain.

이후, 유사도를 참조로 한 손실 함수(RMSE, 수식:RMSE=

)를 이용하여 피싱 데이터 각각과 패턴화 데이터 각각의 유사도를 최소화하는 방향으로 Siamese Network(300)를 학습할 수 있는데, 구체적으로, 피싱 데이터와 패턴화 데이터 사이의 로스값을 이용하여 Backpropagation을 진행하여 공유 가중치(W)와 편향 b를 업데이트하는 과정을 거치면서, Siamese Network(300)가 학습될 수 있을 것이다.Afterwards, the loss function (RMSE, formula: RMSE=

) can be used to learn the Siamese Network (300) in the direction of minimizing the similarity between each phishing data and each patterning data. Specifically, backpropagation is performed using the loss value between the phishing data and the patterning data. Through the process of updating the shared weight (W) and bias b, the Siamese Network ( 300 ) can be learned.

다음으로, 신규 데이터가 Siamese Network(300)의 제1 CNN(310)에 입력되어 신규 데이터 피쳐가 출력되면, KD-tree 장치는, 신규 데이터 피쳐를 KD-tree 상에 매핑할 수 있다(S202).Next, when new data is input to the first CNN (310) of the Siamese Network ( 300 ) and new data features are output, the KD-tree device can map the new data features on the KD-tree (S202) .

도 4는 본 발명의 일 실시예에 따라, Siamese Network가 학습된 후, 신규 데이터가 Siamese Network의 제1 CNN에 입력되어 신규 데이터 피쳐가 출력되면, 타겟 패턴화 데이터 피쳐가 표현된 KD-tree 상에 신규 데이터 피쳐를 매핑하는 과정을 설명하기 위한 도면이다.Figure 4 shows, according to an embodiment of the present invention, after the Siamese Network is learned, when new data is input to the first CNN of the Siamese Network and new data features are output, the KD-tree image in which the target patterned data feature is expressed is shown. This is a diagram to explain the process of mapping new data features to .

구체적으로, 도 4를 참조하면, 이미 S201 단계에서 타겟 데이터에 대한 타겟 패턴화 데이터 피쳐를 KD-tree 상에 표현한 상태에서, 신규 데이터가 제1 CNN(310)에 입력되어 신규 데이터 피쳐가 출력되면, 제1 CNN(310)에서 출력된 신규 데이터 피쳐를 KD-tree(400) 상에 매핑할 수 있다.Specifically, referring to FIG. 4, when new data is input to the first CNN 310 and the new data feature is output while the target patterning data feature for the target data has already been expressed on the KD-tree in step S201, , new data features output from the first CNN (310) can be mapped onto the KD-tree (400).

이때, Siamese Network(300)이 학습을 수행하던 S201 단계와는 달리, 신규 데이터에 대한 패턴화 데이터는 제2 CNN(320)에 입력되지 않을 수 있다. 왜냐하면, 신규 데이터를 굳이 신규 패턴화 데이터로 만들지 않더라도, 이미 Siamese Network(300)는 신규 데이터가 이에 대응되는 신규 패턴화 데이터와 거리가 가깝도록 학습된 상태이므로, 신규 데이터만을 이용하여 생성된 신규 데이터 피쳐만을 KD-tree상에 매핑해도 소기의 성과를 거둘 수 있기 때문이다. At this time, unlike step S201 in which the Siamese Network ( 300 ) is performing learning, patterning data for new data may not be input to the second CNN ( 320 ). This is because, even if the new data is not necessarily made into new patterning data, the Siamese Network (300) has already learned that the new data is close to the corresponding new patterning data, so new data created using only the new data This is because the desired results can be achieved by mapping only features onto the KD-tree.

이에 따라, KD-tree 장치(400)는, 신규 데이터에 대응되는 신규 데이터 피쳐와 KD-tree 상에 표현된 타겟 패턴화 데이터 피쳐 사이의 거리를 참조로 하여 신규 데이터가 피싱 데이터에 해당되는지 여부를 판단할 수 있다(S203).Accordingly, the KD-tree device 400 determines whether the new data corresponds to phishing data by referring to the distance between the new data feature corresponding to the new data and the target patterned data feature expressed on the KD-tree. Can be judged (S203).

구체적으로, 도 4를 참조하면, 신규 데이터가 제1 CNN(310)에 입력되어 신규 데이터 피쳐가 출력되고, 제1 CNN(310)에서 출력된 신규 데이터 피쳐가 KD-tree 상에 매핑되면, KD-tree(400)는, KD-tree 상에 이미 표현되어 있는 타겟 패턴화 데이터 피쳐와 KD-tree 상에 매핑된 신규 데이터 피쳐의 거리를 산출하여, 거리가 소정 임계치 이상이면 신규 데이터를 정상 도메인으로 판단하고, 거리가 소정 임계치 미만이면 신규 데이터를 피싱 도메인으로 판단할 수 있다. Specifically, referring to FIG. 4, when new data is input to the first CNN 310 and new data features are output, and the new data features output from the first CNN 310 are mapped on the KD-tree, KD -tree(400) calculates the distance between the target patterned data feature already expressed on the KD-tree and the new data feature mapped on the KD-tree, and if the distance is greater than a predetermined threshold, the new data is transferred to the normal domain. If the distance is less than a predetermined threshold, the new data can be determined to be a phishing domain.

이때, 타겟 패턴화 데이터 피쳐는, 검출을 하고자 하는 타겟 데이터에 대한 정규식을 통해 타겟 패턴화 데이터가 생성되고 Siamese Network(300)에 입력된 후 타겟 패턴화 데이터 피쳐로 출력된 데이터를 의미하며, 타겟 데이터는 피싱 도메인에 해당된다. At this time, the target patterning data feature refers to data that is generated through a regular expression for the target data to be detected, input to the Siamese Network (300), and then output as a target patterning data feature. The data falls under a phishing domain.

또한, KD-tree 장치(400)가, 신규 데이터 각각에 대응되는 신규 데이터 피쳐 각각과 KD-tree 상에 표현된 타겟 패턴화 데이터 피쳐 각각의 분포를 2차원으로 변환한 상태에서 거리를 산출함으로써, 신규 데이터 각각이 정상 도메인 각각과 유사하다면, 신규 데이터 각각이 해당되는 정상 도메인 각각의 근처에 군집화하도록 할 수 있다.In addition, the KD-tree device 400 calculates the distance by converting the distribution of each new data feature corresponding to each new data and each target patterned data feature expressed on the KD-tree into two dimensions, If each new data is similar to each normal domain, each new data can be clustered near each corresponding normal domain.

이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능한 기록 매체의 예에는, 하드디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magnetooptical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the computer-readable recording medium may be specially designed and configured for the present invention, or may be known and usable by those skilled in the computer software field. Examples of computer-readable recording media include hard disks, magnetic media such as floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. , and hardware devices specifically configured to store and perform program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include not only machine language code such as that created by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules to perform processing according to the invention and vice versa.

이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형을 꾀할 수 있다.In the above, the present invention has been described with specific details such as specific components and limited embodiments and drawings, but this is only provided to facilitate a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , a person skilled in the art to which the present invention pertains can make various modifications and variations from this description.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments, and the scope of the patent claims described below as well as all modifications equivalent to or equivalent to the scope of the claims fall within the scope of the spirit of the present invention. They will say they do it.

Claims

In a method of detecting the phishing data using Siamese Network and KD-Tree algorithms with reference to patterning data obtained through regular expressions for phishing data,
(a) After patterning each of the plurality of phishing data based on each of the plurality of phishing data and the regular expression for each of the plurality of phishing data to obtain each of the plurality of patterned data, each of the plurality of phishing data is is input to a first CNN of the Siamese Network, each of the plurality of patterned data is input to a second CNN of the Siamese Network, each first feature is output by the first CNN, and each of the first features is output by the second CNN When each feature is output and the Siamese Network is learned so that the distance between the first feature and the second feature is less than or equal to a preset threshold, when a KD-tree device is connected to the output terminal of the Siamese Network, the KD- The tree device includes: expressing target patterning data features corresponding to target patterning data for target data on a KD-tree;
(b) when new data is input to the first CNN of the Siamese Network and new data features are output, the KD-tree device maps the new data features on the KD-tree; and
(c) The KD-tree device determines that the new data corresponds to phishing data with reference to the distance between the new data feature corresponding to the new data and the target patterned data feature expressed on the KD-tree. A step of determining whether it is possible;
Including,
In step (a) above,
Each of the plurality of phishing data is integerized based on a predetermined dictionary to create each phishing data array made of numbers, and each of the plurality of patterned data is integerized to make each patterned data array made of numbers, When each of the phishing data arrays is input to the first CNN of the Siamese Network and each of the patterned data arrays is input to the second CNN, a phishing data array feature corresponding to the phishing data array is generated by the first CNN. When each of the patterned data array features corresponding to the patterned data array is output by the second CNN, the Euclidean distance between the phishing data array feature and the patterned data array feature is calculated. Characterized by measuring the similarity between the phishing data and the patterning data and learning to minimize the similarity between each of the phishing data and the patterning data using a loss function with reference to the similarity. How to.

According to clause 1,
In step (a) above,
After each of the plurality of phishing data is input to the first CNN of the Siamese Network, and each of the plurality of patterning data is input to the second CNN of the Siamese Network, the first CNN of the Siamese Network and the With a shared weight (W) between the second CNNs, if it is determined that each of the plurality of phishing data and each of the plurality of patterning data are the same data, each of the first features output from the first CNN and the If the distance of each of the second features output from the second CNN is learned to be less than or equal to a preset threshold, and each of the plurality of phishing data and each of the plurality of patterning data are determined to be different data, each of the first features and A method characterized in that the distance of each second feature is learned to exceed a preset threshold.

delete

According to claim 1,
In step (a) above,
After each of the plurality of phishing data is acquired and each of the plurality of patterned data is obtained by patterning each of the plurality of phishing data based on each regular expression for each of the plurality of phishing data, each of the plurality of phishing data and each of the plurality of patterned data is input to the Siamese Network, each of the first feature and each of the second feature is output as a feature vector by the Siamese Network, and the distance from the feature vector is determined by the Siamese Network. When output, the method is characterized in that the Siamese Network is learned with reference to the distance.

In a method of detecting the phishing data using Siamese Network and KD-Tree algorithms with reference to patterning data obtained through regular expressions for phishing data,
(a) After patterning each of the plurality of phishing data based on each of the plurality of phishing data and the regular expression for each of the plurality of phishing data to obtain each of the plurality of patterned data, each of the plurality of phishing data is is input to a first CNN of the Siamese Network, each of the plurality of patterned data is input to a second CNN of the Siamese Network, each first feature is output by the first CNN, and each of the first features is output by the second CNN When each feature is output and the Siamese Network is learned so that the distance between the first feature and the second feature is less than or equal to a preset threshold, when a KD-tree device is connected to the output terminal of the Siamese Network, the KD- The tree device includes: expressing target patterning data features corresponding to target patterning data for target data on a KD-tree;
(b) when new data is input to the first CNN of the Siamese Network and new data features are output, the KD-tree device maps the new data features on the KD-tree; and
(c) The KD-tree device determines that the new data corresponds to phishing data with reference to the distance between the new data feature corresponding to the new data and the target patterned data feature expressed on the KD-tree. A step of determining whether it is possible;
Including,
In step (a) above,
When each of the phishing data and each of the patterning data are input to each of the first CNN and the second CNN of the Siamese Network, with a shared weight (W) and bias b between the first CNN and the second CNN In the state, the bias b is added to each of the phishing weight data and the patterning weight data for each of the phishing data and each of the patterning data, and then an activation function is applied to create a first feature vector for the first feature and the first feature vector. 2 When the second feature vector for the feature is output, the Euclidean distance between the first feature vector and the second feature vector is calculated to measure the similarity between the phishing data and the patterning data, and the similarity is referred to. A method characterized in that it undergoes a learning process in the direction of minimizing the similarity between each of the phishing data and each of the patterning data using a loss function.

According to clause 5,
In step (a) above,
With the loss value between the phishing data and the patterning data calculated using the loss function referring to the similarity between the phishing data and the patterning data, backpropagation is performed to obtain the phishing data and the patterning data. A method characterized in that the Siamese Network is learned by updating the shared weight (W) and the bias b so that the loss value between them becomes the minimum value.

In a method of detecting the phishing data using Siamese Network and KD-Tree algorithms with reference to patterning data obtained through regular expressions for phishing data,
(a) After patterning each of the plurality of phishing data based on each of the plurality of phishing data and the regular expression for each of the plurality of phishing data to obtain each of the plurality of patterned data, each of the plurality of phishing data is is input to a first CNN of the Siamese Network, each of the plurality of patterned data is input to a second CNN of the Siamese Network, each first feature is output by the first CNN, and each of the first features is output by the second CNN When each feature is output and the Siamese Network is learned so that the distance between the first feature and the second feature is less than or equal to a preset threshold, when a KD-tree device is connected to the output terminal of the Siamese Network, the KD- The tree device includes: expressing target patterning data features corresponding to target patterning data for target data on a KD-tree;
(b) when new data is input to the first CNN of the Siamese Network and new data features are output, the KD-tree device maps the new data features on the KD-tree; and
(c) The KD-tree device determines that the new data corresponds to phishing data with reference to the distance between the new data feature corresponding to the new data and the target patterned data feature expressed on the KD-tree. A step of determining whether it is possible;
Including,
In step (a) above,
From each of the phishing data, each normal partial data and each remaining data excluding the normal partial data are extracted based on each special character, and each of the remaining data is divided into n abnormal partial data, each of the normal partial data By keeping the abnormal partial data as is and replacing it with the 1_1st patterned data to the 1_nth patterned data based on the regular expressions for each of the abnormal partial data, the normal partial data and the 1_1th patterned data to the 1_nth patterned data are A method comprising generating each of the patterned data.

According to clause 7,
In step (a) above,
In generating each of the patterned data from each of the phishing data, each of the normal partial data and the n abnormal partial data are extracted from each of the phishing data based on each of the special characters, and the n abnormal partial data are extracted from each of the phishing data. A first pattern corresponding to an integer excluding the normal domain, a second pattern corresponding to an integer and a character excluding the normal domain, and a third pattern corresponding to the normal domain and an integer, a character and a character corresponding to the normal domain, respectively. A method characterized in that patterning is performed by determining which pattern among a fourth pattern, a fifth pattern corresponding to the normal domain and characters, and a sixth pattern corresponding to characters excluding the normal domain.

In a method of detecting the phishing data using Siamese Network and KD-Tree algorithms with reference to patterning data obtained through regular expressions for phishing data,
(a) After patterning each of the plurality of phishing data based on each of the plurality of phishing data and the regular expression for each of the plurality of phishing data to obtain each of the plurality of patterned data, each of the plurality of phishing data is is input to a first CNN of the Siamese Network, each of the plurality of patterned data is input to a second CNN of the Siamese Network, each first feature is output by the first CNN, and each of the first features is output by the second CNN When each feature is output and the Siamese Network is learned so that the distance between the first feature and the second feature is less than or equal to a preset threshold, when a KD-tree device is connected to the output terminal of the Siamese Network, the KD- The tree device includes: expressing target patterning data features corresponding to target patterning data for target data on a KD-tree;
(b) when new data is input to the first CNN of the Siamese Network and new data features are output, the KD-tree device maps the new data features on the KD-tree; and
(c) The KD-tree device determines that the new data corresponds to phishing data with reference to the distance between the new data feature corresponding to the new data and the target patterned data feature expressed on the KD-tree. A step of determining whether it is possible;
Including,
In step (b) above,
A plurality of the new data are input,
In step (c) above,
The KD-tree device, with reference to the distance between each of the new data features corresponding to each of the new data and each of the target patterned data features expressed on the KD-tree, determines that each of the new data features is at least one A method characterized by determining which domain is clustered into each of the normal domains.

According to clause 9,
In step (c) above,
In a state where the KD-tree device converts the distribution of each of the new data features corresponding to each of the new data and each of the target patterned data features expressed on the KD-tree into two dimensions, the new data feature If each is similar to each of the normal domains, each of the new data features is clustered into at least one specific normal domain corresponding to each of the new data features among each of the normal domains.

In a computing device that detects the phishing data using Siamese Network and KD-Tree algorithms with reference to patterned data obtained through regular expressions for the phishing data,
at least one memory storing instructions; and
At least one processor configured to execute the instructions,
The processor, (I) patterns each of the plurality of phishing data based on each of the plurality of phishing data and the regular expression for each of the plurality of phishing data to obtain each of the plurality of patterned data, and then obtains each of the plurality of patterned data, Each piece of data is input to a first CNN of the Siamese Network, each of the plurality of patterned data is input to a second CNN of the Siamese Network, each first feature is output by the first CNN, and the second CNN When each second feature is output and the Siamese Network is learned so that the distance between the first feature and the second feature is less than a preset threshold, when a KD-tree device is connected to the output terminal of the Siamese Network , a process of expressing target patterning data features corresponding to target patterning data for target data on a KD-tree, (II) when new data is input to the first CNN of the Siamese Network and new data features are output , a process of mapping the new data feature on the KD-tree, and (III) referring to the distance between the new data feature corresponding to the new data and the target patterned data feature expressed on the KD-tree. Perform a process to determine whether the new data corresponds to phishing data,
In process (I) above,
Each of the plurality of phishing data is integerized based on a predetermined dictionary to create each phishing data array made of numbers, and each of the plurality of patterned data is integerized to make each patterned data array made of numbers, When each of the phishing data arrays is input to the first CNN of the Siamese Network and each of the patterned data arrays is input to the second CNN, a phishing data array feature corresponding to the phishing data array is generated by the first CNN. When each of the patterned data array features corresponding to the patterned data array is output by the second CNN, the Euclidean distance between the phishing data array feature and the patterned data array feature is calculated. Characterized by measuring the similarity between the phishing data and the patterning data and learning to minimize the similarity between each of the phishing data and the patterning data using a loss function with reference to the similarity. A computing device that does.

According to clause 11,
In process (I) above,
After each of the plurality of phishing data is input to the first CNN of the Siamese Network, and each of the plurality of patterning data is input to the second CNN of the Siamese Network, the first CNN of the Siamese Network and the With a shared weight (W) between the second CNNs, if it is determined that each of the plurality of phishing data and each of the plurality of patterning data are the same data, each of the first features output from the first CNN and the If the distance of each of the second features output from the second CNN is learned to be less than or equal to a preset threshold, and each of the plurality of phishing data and each of the plurality of patterning data are determined to be different data, each of the first features and A computing device characterized in that the distance of each second feature is learned to exceed a preset threshold.

delete

According to clause 11,
In process (I) above,
After each of the plurality of phishing data is acquired and each of the plurality of patterned data is obtained by patterning each of the plurality of phishing data based on each regular expression for each of the plurality of phishing data, each of the plurality of phishing data and each of the plurality of patterned data is input to the Siamese Network, each of the first feature and each of the second feature is output as a feature vector by the Siamese Network, and the distance from the feature vector is determined by the Siamese Network. When output, a computing device characterized in that the Siamese Network is learned with reference to the distance.

In a computing device that detects the phishing data using Siamese Network and KD-Tree algorithms with reference to patterned data obtained through regular expressions for the phishing data,
at least one memory storing instructions; and
At least one processor configured to execute the instructions,
The processor, (I) patterns each of the plurality of phishing data based on each of the plurality of phishing data and the regular expression for each of the plurality of phishing data to obtain each of the plurality of patterned data, and then obtains each of the plurality of patterned data, Each piece of data is input to a first CNN of the Siamese Network, each of the plurality of patterned data is input to a second CNN of the Siamese Network, each first feature is output by the first CNN, and the second CNN When each second feature is output and the Siamese Network is learned so that the distance between the first feature and the second feature is less than a preset threshold, when a KD-tree device is connected to the output terminal of the Siamese Network , a process of expressing target patterning data features corresponding to target patterning data for target data on a KD-tree, (II) when new data is input to the first CNN of the Siamese Network and new data features are output , a process of mapping the new data feature on the KD-tree, and (III) referring to the distance between the new data feature corresponding to the new data and the target patterned data feature expressed on the KD-tree. Perform a process to determine whether the new data corresponds to phishing data,
In process (I) above,
When each of the phishing data and each of the patterning data are input to each of the first CNN and the second CNN of the Siamese Network, with a shared weight (W) and bias b between the first CNN and the second CNN In the state, the bias b is added to each of the phishing weight data and the patterning weight data for each of the phishing data and each of the patterning data, and then an activation function is applied to create a first feature vector for the first feature and the first feature vector. When the second feature vector for the 2 feature is output, the Euclidean distance between the first feature vector and the second feature vector is calculated to measure the similarity between the phishing data and the patterning data, and the similarity is referred to. A computing device characterized in that it undergoes a learning process to minimize the similarity between each of the phishing data and each of the patterned data using a loss function.

According to clause 15,
In process (I) above,
With the loss value between the phishing data and the patterning data calculated using the loss function referring to the similarity between the phishing data and the patterning data, backpropagation is performed to obtain the phishing data and the patterning data. A computing device characterized in that it undergoes a process of learning the Siamese Network by updating the shared weight (W) and the bias b so that the loss value between them becomes the minimum value.

In a computing device that detects the phishing data using Siamese Network and KD-Tree algorithms with reference to patterned data obtained through regular expressions for the phishing data,
at least one memory storing instructions; and
At least one processor configured to execute the instructions,
The processor, (I) patterns each of the plurality of phishing data based on each of the plurality of phishing data and the regular expression for each of the plurality of phishing data to obtain each of the plurality of patterned data, and then obtains each of the plurality of patterned data, Each piece of data is input to a first CNN of the Siamese Network, each of the plurality of patterned data is input to a second CNN of the Siamese Network, each first feature is output by the first CNN, and the second CNN When each second feature is output and the Siamese Network is learned so that the distance between the first feature and the second feature is less than a preset threshold, when a KD-tree device is connected to the output terminal of the Siamese Network , a process of expressing target patterning data features corresponding to target patterning data for target data on a KD-tree, (II) when new data is input to the first CNN of the Siamese Network and new data features are output , a process of mapping the new data feature on the KD-tree, and (III) referring to the distance between the new data feature corresponding to the new data and the target patterned data feature expressed on the KD-tree. Perform a process to determine whether the new data corresponds to phishing data,
In process (I) above,
From each of the phishing data, each normal partial data and each remaining data excluding the normal partial data are extracted based on each special character, and each of the remaining data is divided into n abnormal partial data, each of the normal partial data By keeping the abnormal partial data as is and replacing it with the 1_1st patterned data to the 1_nth patterned data based on the regular expressions for each of the abnormal partial data, the normal partial data and the 1_1th patterned data to the 1_nth patterned data are A computing device comprising generating each of the patterned data.

According to clause 17,
In process (I) above,
In generating each of the patterned data from each of the phishing data, each of the normal partial data and the n abnormal partial data are extracted from each of the phishing data based on each of the special characters, and the n abnormal partial data are extracted from each of the phishing data. A first pattern corresponding to an integer excluding the normal domain, a second pattern corresponding to an integer and a character excluding the normal domain, and a third pattern corresponding to the normal domain and an integer, a character and a character corresponding to the normal domain, respectively. A computing device characterized in that patterning is performed by determining which pattern corresponds to a fourth pattern, a fifth pattern corresponding to the normal domain and characters, and a sixth pattern corresponding to characters excluding the normal domain.

In a computing device that detects the phishing data using Siamese Network and KD-Tree algorithms with reference to patterned data obtained through regular expressions for the phishing data,
at least one memory storing instructions; and
At least one processor configured to execute the instructions,
The processor, (I) patterns each of the plurality of phishing data based on each of the plurality of phishing data and the regular expression for each of the plurality of phishing data to obtain each of the plurality of patterned data, and then obtains each of the plurality of patterned data, Each piece of data is input to a first CNN of the Siamese Network, each of the plurality of patterned data is input to a second CNN of the Siamese Network, each first feature is output by the first CNN, and the second CNN When each second feature is output and the Siamese Network is learned so that the distance between the first feature and the second feature is less than a preset threshold, when a KD-tree device is connected to the output terminal of the Siamese Network , a process of expressing target patterning data features corresponding to target patterning data for target data on a KD-tree, (II) when new data is input to the first CNN of the Siamese Network and new data features are output , a process of mapping the new data feature on the KD-tree, and (III) referring to the distance between the new data feature corresponding to the new data and the target patterned data feature expressed on the KD-tree. Perform a process to determine whether the new data corresponds to phishing data,
In process (II) above,
A plurality of the new data are input,
The processor,
In process (III) above,
With reference to the distance between each of the new data features corresponding to each of the new data and each of the target patterned data features expressed on the KD-tree, each of the new data features is selected from one of at least one normal domain. A computing device characterized in that it determines whether it is clustered.

According to clause 19,
The processor,
In process (III) above,
In a state where the distribution of each of the new data features corresponding to each of the new data and each of the target patterned data features expressed on the KD-tree is converted into two dimensions, each of the new data features is divided into each of the normal domains and If similar, each of the new data features is clustered into at least one specific normal domain corresponding to each of the new data features among the normal domains.