KR20220158533A

KR20220158533A - Malicious site detection method

Info

Publication number: KR20220158533A
Application number: KR1020210066481A
Authority: KR
Inventors: 주형돈; 박범수; 손성훈; 이용재; 임호문; 정이우; 조정인; 최강석
Original assignee: 주식회사 케이티
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2022-12-01

Abstract

Disclosed is a method for detecting a malicious site. The method for detecting a malicious site according to the present invention comprises the steps of: collecting a training data set including a plurality of types of training data, and extracting a feature from the training data set; generating an artificial intelligence model by training an artificial neural network to classify normal and malignity using the feature as input data; collecting web traffic in real time; obtaining a plurality of types of web collection data by using the web traffic, and extracting a feature from the plurality of types of web collection data; and providing the feature extracted from the web collection data to the artificial intelligence model, and determining whether the web traffic is harmful based on a result value output by the artificial intelligence model. Accordingly, a suspected malicious domain can be automatically classified through the artificial intelligence model.

Description

Malicious site detection method {MALICIOUS SITE DETECTION METHOD}

본 발명은, 웹 수집 데이터로부터 특징을 추출한 후, 추출된 특징을 인공지능 모델에 입력하여 웹 트래픽에 대한 유해 여부를 획득할 수 있는, 악성 사이트 탐지 방법에 관한 것이다.The present invention relates to a method for detecting malicious sites, which is capable of obtaining whether or not harmful to web traffic is obtained by extracting features from web collected data and inputting the extracted features to an artificial intelligence model.

일반적으로 유해사이트 탐지 방식은 보안장치 또는 보안시스템의 이벤트 분석 방식으로, 이벤트 발생 시 미리 정의된 보안정책에 매칭된 패턴 또는 임계치에 의한 분석이었다고 할 수 있다.In general, a harmful site detection method is an event analysis method of a security device or security system, and can be said to be an analysis based on a pattern or threshold matching a predefined security policy when an event occurs.

그리고 이와 같은 기술들에서는, 보안사고 발생 전에 선제적으로 악성사이트를 탐지하기에 어려움이 있고, 파일분석 없이 웹 패킷 및 웹 사이트 정보 기반으로 유해사이트를 탐지하는데 어려움이 있다.And in these technologies, it is difficult to preemptively detect malicious sites before a security incident occurs, and it is difficult to detect malicious sites based on web packets and web site information without file analysis.

본 발명은 상술한 문제점을 해결하기 위한 것으로, 본 발명의 목적은, 웹 수집 데이터로부터 특징을 추출한 후, 추출된 특징을 인공지능 모델에 입력하여 웹 트래픽의 유해 여부를 획득할 수 있는, 악성 사이트 탐지 방법을 제공하기 위함이다.The present invention is intended to solve the above problems, and an object of the present invention is to extract features from web collected data and then input the extracted features to an artificial intelligence model to obtain whether web traffic is harmful or not, a malicious site. It is to provide a detection method.

본 발명에 따른 악성 사이트 탐지 방법은, 복수의 유형의 트레이닝 데이터를 포함하는 트레이닝 데이터 셋을 수집하고, 상기 트레이닝 데이터 셋으로부터 특징(feature)을 추출하는 단계, 상기 특징(feature)을 입력 데이터로 이용하여, 정상 및 악성을 분류하도록 인공 신경망을 트레이닝함으로써 인공지능 모델을 생성하는 단계, 웹 트래픽을 실시간으로 수집하는 단계, 상기 웹 트래픽을 이용하여 상기 복수의 유형의 웹 수집 데이터를 획득하고, 상기 복수의 유형의 웹 수집 데이터로부터 특징(feature)을 추출하는 단계, 및, 상기 웹 수집 데이터로부터 추출된 특징(feature)을 상기 인공지능 모델에 제공하고, 상기 인공지능 모델이 출력한 결과 값에 기반하여 상기 웹 트래픽에 대한 유해 여부를 결정하는 단계를 포함한다.A malicious site detection method according to the present invention includes the steps of collecting a training data set including a plurality of types of training data, extracting a feature from the training data set, and using the feature as input data. and generating an artificial intelligence model by training an artificial neural network to classify normal and malicious data, collecting web traffic in real time, obtaining the plurality of types of web collection data using the web traffic, and Extracting a feature from web collected data of a type, and providing the feature extracted from the web collected data to the artificial intelligence model, based on the result value output by the artificial intelligence model and determining whether the web traffic is harmful.

이 경우 상기 복수의 유형의 웹 수집 데이터는, URL, URI, HTML, JavaScript 및 호스트를 포함할 수 있다.In this case, the plurality of types of web collection data may include URL, URI, HTML, JavaScript, and host.

이 경우 상기 복수의 유형의 웹 수집 데이터로부터 추출되는 특징은, 상기 웹 수집 데이터를 구성하는 구성 요소의 길이 정보, 상기 구성 요소의 개수 정보, 상기 구성요소가 특정 언어인지 여부, 상기 구성요소가 특정 형태를 나타내는 개수, 상기 구성 요소의 엔트로피, 태그의 수, 특정 값의 존재 여부, 특정 단어의 존재 여부 및 함수의 개수를 포함할 수 있다.In this case, the characteristics extracted from the plurality of types of web collected data include length information of components constituting the web collected data, information on the number of the components, whether the components are in a specific language, and whether the components are specific It may include the number representing the shape, the entropy of the component, the number of tags, whether a specific value exists, whether a specific word exists, and the number of functions.

한편, 상기 웹 트래픽이 수집되는 경우, 상기 수집된 웹 트래픽이 특징의 추출 대상인지 판단하는 단계를 더 포함할 수 있다.Meanwhile, when the web traffic is collected, determining whether the collected web traffic is a feature extraction target may be further included.

이 경우 상기 수집된 웹 트래픽이 특징의 추출 대상인지 판단하는 단계는, 파일의 존재 여부, 화이트리스트 도메인에 해당하는지 여부, 응답 코드의 형태, HTML 문서인지 여부에 기초하여, 상기 수집된 웹 트래픽이 특징의 추출 대상인지 판단할 수 있다.In this case, the step of determining whether the collected web traffic is a feature extraction target may include the collected web traffic based on whether a file exists, whether it corresponds to a whitelisted domain, the type of a response code, and whether it is an HTML document. It is possible to determine whether a feature is to be extracted.

본 발명에 따르면, 실시간으로 웹트래픽을 수집하여 AI분석이 가능하도록 특성(Feature)을 추출하고, AI모델을 통해 악성의심 도메인을 자동으로 분류할 수 있는 장점이 있다.According to the present invention, there is an advantage in that web traffic is collected in real time, features are extracted so that AI analysis is possible, and suspected malicious domains can be automatically classified through an AI model.

또한 본 발명에 따르면, 파일 다운로드가 불가능한 경우에도, 네트워크 단에서 수집된 웹 패킷 또는 웹 사이트 데이터 만으로 선제적으로 악성 사이트를 빠르게 탐지할 수 있는 장점이 있다. In addition, according to the present invention, even when file download is impossible, there is an advantage in that a malicious site can be preemptively and quickly detected only with web packets or web site data collected at the network level.

이렇게 검출된 유해사이트 정보는, 사이버보안위협으로부터 예방을 위해 보안서비스 및 보안장치에 적용하여 위협정보를 선제적으로 대응할 수 있도록 할 수 있으며, 보안 플랫폼과의 연동을 통해 위협을 사전에 제거함으로써 고객들이 안심하고 인터넷서비스를 제공받을 수 있도록 할 수 있다. 또한 본 발명을 통해 정보보안사업에 새로운 가치를 창출할 수 있다.Harmful site information detected in this way can be applied to security services and security devices to prevent cyber security threats so that threat information can be preemptively responded to, and threats can be removed in advance through linkage with the security platform. You can make sure that people can use the Internet service safely. In addition, new value can be created in the information security business through the present invention.

도 1은 본 발명에 따른 악성 사이트 탐지 장치를 설명하기 위한 블록도이다.
도 2는 본 발명에 따른 특징 추출 방법을 설명하기 위한 도면이다.
도 3은 본 발명에 따른, 특징 추출 방법을 세부적으로 설명하기 위한 도면이다.
도 4는 본 발명에 따른, 웹 트래픽을 이용하여 웹 트래픽의 정상 여부를 실시간으로 확인하는 방법을 설명하기 위한 도면이다.1 is a block diagram illustrating an apparatus for detecting malicious sites according to the present invention.
2 is a diagram for explaining a feature extraction method according to the present invention.
3 is a diagram for explaining in detail a feature extraction method according to the present invention.
4 is a diagram for explaining a method of checking whether web traffic is normal in real time using web traffic according to the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Hereinafter, the embodiments disclosed in this specification will be described in detail with reference to the accompanying drawings, but the same or similar elements are given the same reference numerals regardless of reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "unit" for components used in the following description are given or used together in consideration of ease of writing the specification, and do not have meanings or roles that are distinct from each other by themselves. In addition, in describing the embodiments disclosed in this specification, if it is determined that a detailed description of a related known technology may obscure the gist of the embodiment disclosed in this specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in this specification, the technical idea disclosed in this specification is not limited by the accompanying drawings, and all changes included in the spirit and technical scope of the present invention , it should be understood to include equivalents or substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including ordinal numbers, such as first and second, may be used to describe various components, but the components are not limited by the terms. These terms are only used for the purpose of distinguishing one component from another.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.It is understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle. It should be. On the other hand, when an element is referred to as “directly connected” or “directly connected” to another element, it should be understood that no other element exists in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, terms such as "comprise" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

본 발명을 구현함에 있어서 설명의 편의를 위하여 구성요소를 세분화하여 설명할 수 있으나, 이들 구성요소가 하나의 장치 또는 모듈 내에 구현될 수도 있고, 혹은 하나의 구성요소가 다수의 장치 또는 모듈들에 나뉘어져서 구현될 수도 있다. In implementing the present invention, components may be subdivided for convenience of description, but these components may be implemented in one device or module, or one component may be divided into multiple devices or modules may be implemented in

도 1은 본 발명에 따른 악성 사이트 탐지 장치를 설명하기 위한 블록도이다.1 is a block diagram illustrating an apparatus for detecting malicious sites according to the present invention.

본 발명에 따른 악성 사이트 탐지 장치(100)는, 수집부(110), 제어부(120) 및 메모리(130)를 포함할 수 있다.The malicious site detection device 100 according to the present invention may include a collection unit 110 , a control unit 120 and a memory 130 .

수집부(110)는 유/무선 통신 기술을 이용하여 외부 장치와 통신하기 위한 통신 회로 또는 통신 모듈을 포함하고, 외부 장치와 데이터를 송/수신할 수 있다. The collection unit 110 includes a communication circuit or communication module for communicating with an external device using a wired/wireless communication technology, and may transmit/receive data with the external device.

구체적으로 수집부(110)는 네트워크 단에 연결되어, 네트워크 단을 경유하는 웹 트래픽을 실시간으로 수집할 수 있다. In detail, the collecting unit 110 is connected to a network end and can collect web traffic passing through the network end in real time.

또한 수집부(110)는 피싱 사이트, 한국 인터넷 진흥원(KISA), 통신사 등과 연결되어 데이터를 수집할 수 있다.In addition, the collection unit 110 may collect data by being connected to a phishing site, Korea Internet & Security Agency (KISA), or a telecommunications company.

메모리(130)는 악성 사이트 탐지 장치의 동작을 위한 프로그램이나 기타 데이터를 저장할 수 있다.The memory 130 may store programs or other data for the operation of the malicious site detection device.

제어부(120)는 악성 사이트 탐지 장치(100)의 전반적인 동작을 제어할 수 있다.The controller 120 may control overall operations of the malicious site detection device 100 .

또한 악성 사이트 탐지 장치(100)에는 인공 신경망 또는 인공지능 모델이 탑재되며, 이 경우 인공 신경망 또는 인공지능 모델을 구성하는 하나 이상의 명령어는 메모리(130)에 저장될 수 있다.In addition, the malicious site detection device 100 is equipped with an artificial neural network or artificial intelligence model, and in this case, one or more commands constituting the artificial neural network or artificial intelligence model may be stored in the memory 130 .

도 2는 본 발명에 따른 특징 추출 방법을 설명하기 위한 도면이다.2 is a diagram for explaining a feature extraction method according to the present invention.

먼저 본 명세서에서는 인공 신경망을 트레이닝 하기 위하여 수집되는 데이터를 트레이닝 데이터라고 명칭 한다.First, in this specification, data collected to train an artificial neural network is referred to as training data.

제어부(120)는 수집부(110)를 통하여 트레이닝 데이터 셋을 수집할 수 있다.The controller 120 may collect a training data set through the collection unit 110 .

구체적으로, 트레이닝 데이터 셋은 URI(Uniform Resource Identifier) 데이터를 포함할 수 있다. 또한 URI 데이터는, 공개 URI 데이터 및 비 공개 URI 데이터를 포함할 수 있다.Specifically, the training data set may include URI (Uniform Resource Identifier) data. Also, the URI data may include public URI data and non-public URI data.

공개 URI 데이터는, 피싱 사이트(OpenPhish, Phistank등)로부터 수집되는 데이터일 수 있다.Public URI data may be data collected from phishing sites (OpenPhish, Phistank, etc.).

또한 비 공개 URI는, 악성 사이트 탐지 장치(100)를 운영하는 주체(예를 들어 통신사)에서 보유하거나 한국 인터넷 진흥원(KISA)에서 보유하는 데이터일 수 있다.In addition, the non-public URI may be data held by a subject (for example, a telecommunications company) that operates the malicious site detection device 100 or held by the Korea Internet & Security Agency (KISA).

한편 트레이닝 셋은 복수의 유형의 트레이닝 데이터를 포함할 수 있다. 여기서 하나의 트레이닝 셋은, URI 데이터 및 URI 데이터와 관련된 다양한 유형의 데이터를 포함할 수 있다. 예를 들어 복수의 유형의 트레이닝 데이터는, URL, URI, HTML, JavaScript 및 호스트를 포함할 수 있다. 이 경우 제어부(120)는 수집된 URI 데이터에 대한 웹 크롤링을 통하여 복수의 유형의 트레이닝 데이터를 획득할 수 있다.Meanwhile, a training set may include a plurality of types of training data. Here, one training set may include URI data and various types of data related to the URI data. For example, multiple types of training data may include URL, URI, HTML, JavaScript, and host. In this case, the control unit 120 may obtain a plurality of types of training data through web crawling of the collected URI data.

한편 제어부(120)는 트레이닝 데이터 셋에 대응하는 정답 값(레이블)을 획득할 수 있다.Meanwhile, the controller 120 may obtain a correct answer value (label) corresponding to the training data set.

구체적으로, 특정 트레이닝 데이터 셋이 공개 URI 데이터에 기반하여 획득된 경우, 제어부(120)는 특정 트레이닝 데이터 셋에 대응하는 레이블이 악성인 것으로 결정할 수 있다.Specifically, when a specific training data set is obtained based on public URI data, the controller 120 may determine that a label corresponding to the specific training data set is malicious.

또한 특정 트레이닝 데이터 셋이 한국 인터넷 진흥원(KISA)에서 보유하는 비 공개 URI 데이터에 기반하여 획득된 경우, 제어부(120)는 특정 트레이닝 데이터 셋에 대응하는 레이블이 악성인 것으로 결정할 수 있다.Also, when a specific training data set is obtained based on non-public URI data held by the Korea Internet & Security Agency (KISA), the controller 120 may determine that a label corresponding to the specific training data set is malicious.

한편, 특정 트레이닝 데이터 셋이 악성 사이트 탐지 장치(100)를 운영하는 주체(예를 들어 통신사)에서 보유하는 비 공개 URI 데이터에 기반하여 획득된 경우, 제어부(120)는 SafeBrowsing API를 통해 정상 또는 악성 여부에 대한 정보를 획득할 수 있다. 그리고 제어부(120)는 트레이닝 데이터 셋에 정상 또는 악성 여부에 대한 정보를 레이블로써 연관시킬 수 있다.On the other hand, when a specific training data set is obtained based on non-public URI data held by an entity (eg, a telecommunications company) that operates the malicious site detection device 100, the control unit 120 detects normal or malicious data through the SafeBrowsing API. information can be obtained. Also, the controller 120 may associate information on whether the training data set is normal or malicious as a label.

한편 제어부(120)는 트레이닝 데이터 셋으로부터 특징(Feature)을 추출할 수 있다. 본 발명에서는 183개의 특징(Feature)을 추출하였다. 그리고 183개 특징은 URI 기반의 특징 96개, HTML 기반의 특징 43개, JavaScript 기반의 특징 28개, Host 기반의 특징 16개로 구성되어 있다.Meanwhile, the controller 120 may extract features from the training data set. In the present invention, 183 features were extracted. And the 183 features consist of 96 URI-based features, 43 HTML-based features, 28 JavaScript-based features, and 16 Host-based features.

도 3은 본 발명에 따른, 특징 추출 방법을 세부적으로 설명하기 위한 도면이다.3 is a diagram for explaining in detail a feature extraction method according to the present invention.

제어부(120)는 공개 URI 데이터 및 비 공개 URI 데이터를 수집할 수 있다(S305).The control unit 120 may collect public URI data and non-public URI data (S305).

한편 앞서 설명한 바와 같이, 먼저 트레이닝 데이터 셋을 구성한 후 정답 값을 레이블링 하는 방식도 가능하며, 먼저 URI 데이터에 정답 값을 레이블링 한 후 트레이닝 데이터 셋을 구성하는 경우도 가능하다.Meanwhile, as described above, it is also possible to first configure a training data set and then label the correct value, or configure the training data set after first labeling the URI data with the correct value.

제어부(120)는 공개 URI 데이터 및 비 공개 URI 데이터에 대하여 정답 값을 레이블링 할 수 있다(S310).The controller 120 may label correct values for public URI data and non-public URI data (S310).

구체적으로, 제어부(120)는 수집된 URI 데이터가, 악성 사이트 탐지 장치(100)를 운영하는 주체(예를 들어 통신사)에서 보유한 데이터인지 판단할 수 있다(S315). 그리고 수집된 URI 데이터가 악성 사이트 탐지 장치(100)를 운영하는 주체(예를 들어 통신사)에서 보유한 데이터인 경우, 제어부(120)는 google safeBrowsing API를 호출하여(S320), 해당하는 URI 데이터가 정상인지 또는 악성인지에 대한 정보를 획득할 수 있다.Specifically, the control unit 120 may determine whether the collected URI data is data held by an entity (eg, a telecommunication company) that operates the malicious site detection device 100 (S315). And if the collected URI data is data held by a subject (for example, a telecommunications company) that operates the malicious site detection device 100, the control unit 120 calls the google safeBrowsing API (S320), and the corresponding URI data is normal. Information about recognition or malignancy may be obtained.

그리고 해당하는 URI 데이터가 악성인 경우(S325), 제어부(120)는 “악성”이라는 정답 값을 해당하는 URI 데이터에 레이블링 할 수 있다(S330).And, if the corresponding URI data is malicious (S325), the control unit 120 may label the corresponding URI data with an answer value of “malicious” (S330).

반면에, 해당하는 URI 데이터가 정상인 경우(S325), 제어부(120)는 “정상이라는 정답 값을 해당하는 URI 데이터에 레이블링 할 수 있다(S335).On the other hand, if the corresponding URI data is normal (S325), the control unit 120 may label the corresponding URI data with an answer value of “normal” (S335).

한편 수집된 URI 데이터가 악성 사이트 탐지 장치(100)를 운영하는 주체(예를 들어 통신사)에서 보유한 데이터가 아닌 경우, 제어부(120)는 수집된 URI 데이터가 공개 데이터인지 판단할 수 있다(S340).On the other hand, if the collected URI data is not data held by an entity (eg, a telecommunications company) that operates the malicious site detection device 100, the control unit 120 may determine whether the collected URI data is public data (S340). .

그리고 수집된 URI 데이터가 공개 데이터이면, 제어부(120)는 “악성”이라는 정답 값을 해당하는 URI 데이터에 레이블링 할 수 있다(S330).And if the collected URI data is open data, the control unit 120 may label the corresponding URI data with an answer value of “malicious” (S330).

한편 수집된 URI 데이터가 공개 데이터가 아니면, 제어부(120)는 수집된 URI 데이터가 한국 인터넷 진흥원에서 보유한 데이터인지 판단할 수 있다(S345). Meanwhile, if the collected URI data is not open data, the controller 120 may determine whether the collected URI data is data owned by the Korea Internet & Security Agency (S345).

그리고 수집된 URI 데이터가 한국 인터넷 진흥원에서 보유한 데이터이면, 제어부(120)는 “악성”이라는 정답 값을 해당하는 URI 데이터에 레이블링 할 수 있다(S330).If the collected URI data is data held by the Korea Internet & Security Agency, the control unit 120 may label the corresponding URI data with an answer value of “malicious” (S330).

한편 제어부(120)는 URI를 포함하는 트레이닝 데이터 셋이, 특징 추출 대상인지 판단할 수 있다(S350).Meanwhile, the control unit 120 may determine whether a training data set including a URI is a feature extraction target (S350).

구체적으로 제어부(120)는, 파일의 존재 여부, 화이트리스트 도메인에 해당하는지 여부, 응답 코드의 형태, HTML 문서인지 여부에 기초하여, 트레이닝 데이터 셋이 특징의 추출 대상인지 판단할 수 있다.Specifically, the controller 120 may determine whether the training data set is a feature extraction target based on whether the file exists, whether it corresponds to a whitelist domain, the type of the response code, and whether it is an HTML document.

더욱 구체적으로, 제어부(120)는 URI 데이터를 이용하여 파일을 획득할 수 없고(파일의 다운 로드가 불가능한 경우)(S351), URI 데이터가 화이트 리스트 도메인이 아니고(S352), 응답 코드의 형태가 2XX이고(S353), URI 데이터가 HTML 문서인 경우(S354), 트레이닝 데이터 셋으로부터 특징을 추출할 수 있다(S360).More specifically, the control unit 120 cannot acquire a file using URI data (when downloading a file is impossible) (S351), the URI data is not a white list domain (S352), and the form of the response code is 2XX (S353), and if the URI data is an HTML document (S354), features can be extracted from the training data set (S360).

특징의 구체적인 유형에 대해서는, 이후에 자세히 설명한다.Specific types of features will be described in detail later.

한편 제어부(120)는 특징을 입력 데이터로 이용하여, 정상 및 악성을 분류하도록 인공 신경망을 트레이닝 할 수 있다.Meanwhile, the controller 120 may train an artificial neural network to classify normal and malignant features by using the features as input data.

구체적으로, 인공신경망(뉴럴 네트워크)은 생물학적 뉴런의 동작원리와 뉴런간의 연결 관계를 모델링한 것으로 노드(node) 또는 처리 요소(processing element)라고 하는 다수의 뉴런들이 레이어(layer) 구조의 형태로 연결된 정보처리 시스템일 수 있다.Specifically, an artificial neural network (neural network) is a modeling of the operating principle of biological neurons and the connection relationship between neurons, and a number of neurons called nodes or processing elements are connected in the form of a layer structure. It may be an information processing system.

또한 인공 신경망은 입력 데이터를 이용하여 트레이닝(training)될 수 있다. 여기서 트레이닝이란, 입력 데이터를 분류(classification)하거나 회귀분석(regression)하거나 군집화(clustering)하는 등의 목적을 달성하기 위하여, 입력 데이터를 이용하여 인공 신경망의 파라미터(parameter)를 결정하는 과정을 의미할 수 있다. 인공 신경망의 파라미터의 대표적인 예시로써, 시냅스에 부여되는 가중치(weight)나 뉴런에 적용되는 편향(bias)을 들 수 있다.Also, an artificial neural network may be trained using input data. Here, training may refer to a process of determining parameters of an artificial neural network using input data in order to achieve a purpose such as classification, regression analysis, or clustering of input data. can As representative examples of parameters of an artificial neural network, a weight assigned to a synapse or a bias applied to a neuron may be cited.

또한 손실 함수는 인공 신경망의 트레이닝 과정에서 최적의 모델 파라미터를 결정하기 위한 지표(기준)로 이용될 수 있다. 인공 신경망에서 학습은 손실 함수를 줄이기 위하여 인공신경망의 파라미터들(가중치, 편향 등)을 조작하는 과정을 의미하며, 학습의 목적은 손실 함수를 최소화하는 파라미터를 결정하는 것으로 볼 수 있다.In addition, the loss function can be used as an index (reference) for determining optimal model parameters in the training process of an artificial neural network. Learning in an artificial neural network means a process of manipulating the parameters (weights, biases, etc.) of the artificial neural network to reduce the loss function, and the purpose of learning can be seen as determining parameters that minimize the loss function.

그리고 제어부(120)는 지도 학습 알고리즘을 기반으로, 인공 신경망을 트레이닝 할 수 있다.Also, the controller 120 may train an artificial neural network based on a supervised learning algorithm.

지도 학습에서는, 입력 데이터에 대한 레이블(label)이 주어진 상태에서 인공 신경망을 학습시킨다. 여기서 레이블이란, 입력 데이터가 인공 신경망에 입력되는 경우 인공 신경망이 추론해 내야 하는 정답(또는 결과 값)을 의미할 수 있다.In supervised learning, an artificial neural network is trained under a given label for input data. Here, the label may mean a correct answer (or a result value) to be inferred by the artificial neural network when input data is input to the artificial neural network.

그리고 제어부(120)는 특정 트레이닝 데이터 셋으로부터 추출된 특징을 인공 신경망에 입력 데이터로 제공하고, 특정 트레이닝 데이터 셋에 레이블링된 정상 또는 악성에 대한 정보를 인공 신경망에 정답으로 제공함으로써, 인공 신경망을 트레이닝 할 수 있다.In addition, the control unit 120 provides features extracted from a specific training data set to the artificial neural network as input data, and provides information on normal or malicious labeled in the specific training data set to the artificial neural network as an answer, thereby training the artificial neural network. can do.

또한 제어부(120)는 입력 데이터(특징) 및 정답의 차이에 기반하여 인공 신경망을 트레이닝 할 수 있다. 즉 제어부(120)는 입력 데이터(특징) 및 정답의 차이에 기반하여 손실 함수를 산출하고, 산출된 손실 함수를 이용하여 인공 신경망의 파라미터를 조절할 수 있다.Also, the controller 120 may train an artificial neural network based on input data (features) and differences between correct answers. That is, the controller 120 may calculate a loss function based on the difference between the input data (characteristic) and the correct answer, and adjust parameters of the artificial neural network using the calculated loss function.

또한 제어부(120)는 다양한 트레이닝 데이터 셋으로부터 추출된 특징 및 다양한 트레이닝 데이터 셋에 대응하는 정답을 이용하여 인공 신경망에 대한 트레이닝을 반복함으로써, 인공 신경망의 파라미터가 최적화 된 인공지능 모델을 생성할 수 있다.In addition, the controller 120 may generate an artificial intelligence model in which parameters of the artificial neural network are optimized by repeating training of the artificial neural network using features extracted from various training data sets and correct answers corresponding to the various training data sets. .

도 4는 본 발명에 따른, 웹 트래픽을 이용하여 웹 트래픽의 정상 여부를 실시간으로 확인하는 방법을 설명하기 위한 도면이다.4 is a diagram for explaining a method of checking whether web traffic is normal in real time using web traffic according to the present invention.

제어부(120)는 네트워크 단에 연결되어, 네트워크 단을 경유하는 웹 트래픽을 실시간으로 수집할 수 있다(S410). 여기서 웹 트래픽은, 웹 패킷 및 웹사이트 데이터 중 적어도 하나를 포함할 수 있다.The controller 120 may be connected to the network end and collect web traffic passing through the network end in real time (S410). Here, web traffic may include at least one of web packets and website data.

그리고 나서 제어부(120)는 수집된 웹 트래픽이 특징의 추출 대상인지 판단할 수 있다(S420).Then, the control unit 120 may determine whether the collected web traffic is a feature extraction target (S420).

구체적으로 제어부(120)는, 파일의 존재 여부, 화이트리스트 도메인에 해당하는지 여부, 응답 코드의 형태, HTML 문서인지 여부에 기초하여, 수집된 웹 트래픽이 특징의 추출 대상인지(즉, 정상 또는 악성을, 인공지능 모델을 이용하여 판단할 대상인지) 판단할 수 있다(420).Specifically, the control unit 120 determines whether the collected web traffic is a target for feature extraction (i.e., normal or malicious) based on whether the file exists, whether it corresponds to a whitelist domain, the type of response code, and whether it is an HTML document. , whether it is an object to be judged using an artificial intelligence model) can be determined (420).

더욱 구체적으로, 제어부(120)는 웹 트래픽을 이용하여 파일을 획득할 수 없고(파일의 다운 로드가 불가능한 경우)(S421), 웹 트래픽이 화이트 리스트 도메인이 아니고(S422), 응답 코드의 형태가 2XX이고(S423), 웹 트래픽이 HTML 문서인 경우(S424), 수집된 웹 트래픽이 특징의 추출 대상인 것으로(즉, 정상 또는 악성을, 인공지능 모델을 이용하여 판단할 대상인 것으로) 판단할 수 있다(420).More specifically, the control unit 120 cannot acquire a file using web traffic (if it is impossible to download a file) (S421), the web traffic is not a white list domain (S422), and the form of the response code is 2XX (S423), and if the web traffic is an HTML document (S424), it can be determined that the collected web traffic is a subject for feature extraction (ie, normal or malicious, subject to be determined using an artificial intelligence model) (420).

한편 수집된 웹 트래픽이 특징의 추출 대상이면, 제어부(120)는 웹 트레픽을 이용하여 복수의 유형의 웹 수집 데이터를 획득하고, 복수의 유형의 웹 수집 데이터로부터 특징(feature)을 추출할 수 있다(S430).Meanwhile, if the collected web traffic is a feature extraction target, the control unit 120 may obtain multiple types of web collected data using the web traffic and extract features from the multiple types of web collected data. (S430).

여기서 복수의 유형의 웹 수집 데이터는, URL, URI, HTML, JavaScript 및 호스트 정보를 포함할 수 있으며, 웹 패킷을 이용하여 획득될 수 있다. 또한 제어부(120)는 도메인 정보를 획득할 수 있다. 예를 들어 제어부(120)는 웹 패킷으로부터 URI를 추출하고, 추출된 URI에 접속하여 해당하는 웹 페이지의 HTML, JavaScript 정보 등을 획득할 수 있다. 다른 예를 들어 제어부(120)는 웹 패킷으로부터 URI를 추출하고, 추출된 URI를 후이즈(Whois) 등에 문의하여, 도메인 정보(도메인의 생성 시점, 도메인 생성 후 경과 시간 등)을 획득할 수 있다.Here, the plurality of types of web collection data may include URL, URI, HTML, JavaScript, and host information, and may be obtained using web packets. Also, the control unit 120 may obtain domain information. For example, the control unit 120 may extract a URI from a web packet and obtain HTML and JavaScript information of a corresponding web page by accessing the extracted URI. For another example, the control unit 120 may extract a URI from a web packet and inquire about the extracted URI to Whois to obtain domain information (domain creation time, elapsed time after domain creation, etc.).

본 발명에서는 183개의 특징(Feature)을 예시하였으며, 183개 특징은 URI 기반의 특징 96개, HTML 기반의 특징 43개, JavaScript 기반의 특징 28개, Host 기반의 특징 16개로 구성되어 있다. 특징(Feature)의 예시는 표 1과 같다.In the present invention, 183 features are exemplified, and the 183 features consist of 96 URI-based features, 43 HTML-based features, 28 JavaScript-based features, and 16 Host-based features. Examples of features are shown in Table 1.

특징 구분Classification of characteristics 설명Explanation URLURL protocol 제외한 url 길이url length excluding protocol URLURL domain 에서 '-' 개수The number of '-' in domain URLURL domain token 수number of domain tokens URLURL url path 의 길이length of url path URLURL url filename 의 길이length of url filename URLURL domain token 중 가장 긴 token 의 길이Length of the longest token among domain tokens URLURL domain token 의 평균 token 길이Average token length of domain tokens URLURL url의 tld가 valid인 tld인지 검사Check whether the tld of the url is a valid tld URLURL (포트번호 포함한) domain 의 길이 (netloc: 포트번호 포함)Length of domain (including port number) (including netloc: port number) URLURL (포트번호 미포함) url hostname 의 길이 (hostname: 포트번호 미포함)(without port number) Length of url hostname (hostname: without port number) URLURL url 에서 dots(.) 의 수Number of dots(.) in url URLURL url 에서 underscores(_) 의 수Number of underscores(_) in url URLURL url 에서 equals(=) 의 수Number of equals(=) in url URLURL url 에서 slashes(/) 의 수 ('http://' 2개 포함)Number of slashes (/) in url (including two 'http://') URLURL url 에서 dash(-) 의 수number of dashes (-) in url URLURL url 에서 semicolon(;) 의 수Number of semicolon(;) in url URLURL url 에서 at(@) 의 수The number of at(@) in url URLURL url 에서 percent(%) 의 수Number of percent(%) in url URLURL url 에서 plus(+) 의 수The number of plus(+) in url URLURL url 에서 query 길이query length in url URLURL url 에서 query parameter 개수Number of query parameters in url URLURL url 에 ip(IPv4/IPv6) 가 있는지 확인Check if ip(IPv4/IPv6) exists in url URLURL url 복잡도 검사(shannon-entropy)Check url complexity (shannon-entropy) URLURL url 자음 갯수 검사url consonant count check URLURL url 숫자 갯수 검사 (0~9)Check the number of url numbers (0 to 9) URLURL url 중국어 검사url chinese check URLURL url port가 valid port인지 검사(None, 80, 443, 8080, 8443)Check if the url port is a valid port (None, 80, 443, 8080, 8443) URL_PathURL_Path path token의 수number of path tokens URL_PathURL_Path path token 중 가장 긴 token의 길이Length of longest token among path tokens URL_PathURL_Path path token 중 가장 짧은 token의 길이Length of the shortest token among path tokens URL_PathURL_Path path 중 자음 갯수 검사Check the number of consonants in the path URL_PathURL_Path path token 의 평균 token 길이Average token length of path token URIURI url tld의 길이length of url tld URIURI [a i e o u] /총 문자[a-z] [a i e o u] /total characters [a-z] URIURI 자음 /총 문자[a-z] consonants /total letters [a-z] URIURI url에서 ldl형태의 개수The number of ldl types in url URIURI domain에서 ldl형태의 개수The number of ldl forms in the domain URIURI path에서 ldl형태의 개수The number of ldl types in path URIURI 파일명에서 ldl형태의 개수The number of ldl types in the file name URIURI Arg에서 ldl형태의 개수The number of ldl types in Arg URIURI url에서 dld형태의 개수Number of dld types in url URIURI domain에서 dld형태의 개수Number of dld types in domain URIURI path에서 dld형태의 개수The number of dld types in the path URIURI 파일명에서 dld형태의 개수The number of dld types in file names URIURI Arg에서 dld형태의 개수Number of dld types in Arg URIURI subdirectory의 길이length of subdirectory URIURI fileName의 길이length of fileName URIURI file확자자의 길이length of file extension URIURI Arg의 길이length of arg URIURI path / urlpath/url URIURI arg / urlarg/url URIURI arg / domainarg/domain URIURI domain / urldomain/url URIURI path / domainpath/domain URIURI arg / patharg/path URIURI 실행파일여부Executable file URIURI 80포트를 쓰는지Are you using port 80? URIURI 연속성의 비율(알파벳,숫자,특수문자 타입들 중 가장 긴값의 합을 촐 길이로 나누어줌)Ratio of continuity (the sum of the longest values among alphabet, number, and special character types is divided by the length of the chord) URIURI 가장 긴 변수의 값value of longest variable URIURI hostname 에서 숫자 수number of digits in hostname URIURI directory에서 숫자 수number of digits in directory URIURI filename에서 숫자 수number of digits in filename URIURI extension에서 숫자 수number of digits in extension URIURI query에서 숫자 수number of digits in query URIURI url에서 글자 수number of characters in url URIURI host에서 글자 수number of characters in host URIURI directory에서 문자 수number of characters in directory URIURI filename에서 문자 수number of characters in filename URIURI extension에서 문자 수number of characters in extension URIURI query에서 문자수number of characters in query URIURI subdirectory 중 가장 긴 단어 길이Longest word length among subdirectories URIURI argument 중 가장 긴 단어 길이Longest word length among arguments URIURI URL에서 민감한 Word의 수Number of sensitive words in URL URIURI query에서 변수의 수number of variables in query URIURI url에서 특수문자의 수Number of special characters in url URIURI domain에서 구분문자의 수number of delimiters in domain URIURI path에서 구분문자의 수number of delimiters in path URIURI 전체 URL에서 구분문자의 수Number of delimiters in full URL URIURI url 에서 숫자/전체 문자Numbers/full characters in url URIURI domain 에서 숫자/전체 문자number/full character in domain URIURI DirectoryName 에서 숫자/전체 문자Numbers/whole letters in DirectoryName URIURI filename 에서 숫자/전체 문자Numbers/whole characters in filename URIURI extension 에서 숫자/전체 문자number/full character in extension URIURI AfterPath 에서 숫자/전체 문자Numbers/whole characters in AfterPath URIURI 전체 URL에서 Symbol의 수Number of symbols in full URL URIURI 전체 Domain에서 Symbol의 수The number of symbols in all domains URIURI 디렉토리 이름에서 Symbol의 수number of symbols in directory name URIURI 파일이름에서 Symbol의 수number of symbols in filename URIURI 파일 확장자에서 Symbol의 수Number of Symbols in File Extension URIURI afterpath에서 Symbol의 수number of symbols in afterpath URIURI url의 entropyentropy of url URIURI domain의 entropydomain entropy URIURI directoryname의 entropyentropy of directoryname URIURI filename의 entropyentropy of filename URIURI extension의 entropyextension entropy URIURI afterpath의 entropyafterpath entropy HTMLHTML html 에서 <iFrame> 태그의 수The number of <iFrame> tags in html HTMLHTML html 에서 <script> 태그의 수The number of <script> tags in html HTMLHTML html 에서 <embed> 태그의 수The number of <embed> tags in html HTMLHTML html 에서 <object> 태그의 수The number of <object> tags in html HTMLHTML html 에서 <div> 태그의 수The number of <div> tags in html HTMLHTML html 에서 <head> 태그의 수The number of <head> tags in html HTMLHTML html 에서 <body> 태그의 수The number of <body> tags in html HTMLHTML html 에서 <form> 태그의 수The number of <form> tags in html HTMLHTML html 에서 <a> 태그의 수The number of <a> tags in html HTMLHTML html 에서 <small> 태그의 수The number of <small> tags in html HTMLHTML html 에서 <span> 태그의 수The number of <span> tags in html HTMLHTML html 에서 <input> 태그의 수The number of <input> tags in html HTMLHTML html 에서 <applet> 태그의 수The number of <applet> tags in html HTMLHTML html 에서 <img> 태그의 수Number of <img> tags in html HTMLHTML html 에서 <video> 태그의 수Number of <video> tags in html HTMLHTML html 에서 <audio> 태그의 수The number of <audio> tags in html HTMLHTML html 에서 'refresh' 속성을 가진 <meta> 태그의 수 ex) <meta http-equiv="refresh" …Number of <meta> tags with 'refresh' attribute in html ex) <meta http-equiv="refresh" … HTMLHTML html 에서 <script>...</script> 길이 (외부링크 제외)Length of <script>...</script> in html (excluding external links) HTMLHTML html 에서 공백 수(only space)Number of spaces in html (only space) HTMLHTML html 에서 외부주소를 참조하는 태그의 수 (link에 자신의 hostname이 없으며, href 속성이 'http'나 'ftp'를 참고할 경우)The number of tags that refer to external addresses in html (when the link does not have its own hostname and the href attribute refers to 'http' or 'ftp') HTMLHTML html body 문서 길이html body document length HTMLHTML script 에서 href,src 속성을 가진 태그 수The number of tags with href,src attributes in script HTMLHTML html 에서 href 속성을 가진 태그 수The number of tags in html with the href attribute HTMLHTML html value에서 공백 기준 토큰화한 단어들의 수The number of tokenized words based on space in the html value. HTMLHTML html value에서 공백 기준 토큰화한 단어들의 수 (중복제거)Number of tokenized words based on space in html value (remove duplicates) HTMLHTML html 에서 라인 수(중복제거)Number of lines in html (remove duplicates) HTMLHTML html 에서 단어 길이의 평균 (총 단어 길이 /총 단어 수)Average word length in html (total word length / total number of words) HTMLHTML html 에서 "log"라는 단어 검사Check the word "log" in html HTMLHTML html 에서 "pay"라는 단어 검사Check the word "pay" in html HTMLHTML html 에서 "free"라는 단어 검사Check for the word "free" in html HTMLHTML html 에서 "access"라는 단어 검사Check the word "access" in html HTMLHTML html 에서 "bonus"라는 단어 검사Check the word "bonus" in html HTMLHTML html 에서 "click"라는 단어 검사Check for the word "click" in html HTMLHTML html 에서 dots(.)의 수Number of dots(.) in html HTMLHTML html 에서 hyphens(-)의 수Number of hyphens(-) in html HTMLHTML html 에서 underscores(_)의 수Number of underscores(_) in html HTMLHTML html 에서 equals(=)의 수Number of equals(=) in html HTMLHTML html 에서 slashes(/)의 수The number of slashes (/) in html HTMLHTML html 에서 hashes(#)의 수Number of hashes (#) in html HTMLHTML html 에서 at(@)의 수number of at(@) in html HTMLHTML html 에서 dollar($)의 수number of dollars($) in html HTMLHTML html 에 <title> 태그가 있는지 검사Check if <title> tag exists in html HTMLHTML html 이 default webpage인지 검사 (apache, nginx 등)Check if html is the default webpage (apache, nginx, etc.) JavascriptJavascript script 에 eval() 함수의 개수Number of eval() functions in script JavascriptJavascript script 에 setTimeout() 함수의 개수The number of setTimeout() functions in script JavascriptJavascript script 에 unescape() 함수의 개수The number of unescape() functions in script JavascriptJavascript script 에 escape() 함수의 개수Number of escape() functions in script JavascriptJavascript script 에 parseInt() 함수의 개수The number of parseInt() functions in script JavascriptJavascript script 에 fromCharCode() 함수의 개수The number of fromCharCode() functions in script JavascriptJavascript script 에 ActiveXObject() 함수의 개수Number of ActiveXObject() functions in script JavascriptJavascript script 에 concat() 함수의 개수The number of concat() functions in script JavascriptJavascript script 에 indexOf() 함수의 개수The number of indexOf() functions in script JavascriptJavascript script 에 substring() 함수의 개수The number of substring() functions in script JavascriptJavascript script 에 replace() 함수의 개수The number of replace() functions in script JavascriptJavascript script 에 documentaddEventListener() 함수의 개수The number of documentaddEventListener() functions in script JavascriptJavascript script 에 createElement() 함수의 개수The number of createElement() functions in script JavascriptJavascript script 에 getElementById() 함수의 개수The number of getElementById() functions in script JavascriptJavascript script 에 attachEvent() 함수의 개수The number of attachEvent() functions in script JavascriptJavascript script 에 documentwrite() 함수의 개수The number of documentwrite() functions in script JavascriptJavascript script 문서 길이script document length JavascriptJavascript script 에서 중복 제거한 단어 수Number of deduplicated words in script JavascriptJavascript script 에서 라인 수(중복 제거)Number of lines in script (remove duplicates) JavascriptJavascript script 에서 평균 단어 길이 (총 단어 길이/총 단어 수)Average word length in script (total word length / total number of words) JavascriptJavascript script 에서 dots(.)의 수Number of dots(.) in script JavascriptJavascript script 에서 hyphens(-)의 수Number of hyphens(-) in script JavascriptJavascript script 에서 underscores(_)의 수Number of underscores(_) in script JavascriptJavascript script 에서 equals(=)의 수Number of equals(=) in script JavascriptJavascript script 에서 slashes(/)의 수The number of slashes (/) in script JavascriptJavascript script 에서 hashes(#)의 수The number of hashes (#) in script JavascriptJavascript script 에서 at(@)의 수The number of at(@) in script JavascriptJavascript script 에서 dollar($)의 수The number of dollars($) in script HostHost 도메인 생성 이 후 period (Month 기준)period after domain creation (Based on Month) HostHost 도메인 남은 만료일 (Month 기준)Domain remaining expiry date (by month) HostHost 도메인 정보 최신 업데이트 이 후 period(Day 기준)Period after the latest update of domain information (Based on Day) HostHost update date 값 존재 여부Whether update date value exists HostHost create date 값 존재 여부Whether create date value exists HostHost expiration date 값 존재 여부Existence of expiration date value HostHost zipcode 값 존재 여부Whether zipcode value exists HostHost orgraniztion 값 존재 여부whether the orgraniztion value exists HostHost dnssec 서명 여부whether dnssec is signed HostHost city 값 존재 여부Whether city value exists HostHost state 값 존재 여부Whether state value exists HostHost country 값 존재 여부Whether the country value exists HostHost whois 서버 등록 여부Whether to register whois server HostHost referral url 값 존재 여부Whether referral url value exists HostHost email 값 존재 여부Whether email value exists HostHost Alexa 순위 (100,000등 기준)Alexa ranking (by 100,000)

특징의 예시로, 복수의 유형의 웹 수집 데이터로부터 추출되는 특징은, 웹 수집 데이터를 구성하는 구성 요소의 길이 정보를 포함할 수 있다. 예를 들어 특징은, 제1 유형(URL)의 웹 수집 데이터를 구성하는 구성 요소(protocol 제외한 url)의 길이를 포함할 수 있다. 다른 예를 들어 특징은, 제2 유형(URI)의 웹 수집 데이터를 구성하는 구성 요소(fileName)의 길이를 포함할 수 있다.As an example of a feature, a feature extracted from a plurality of types of web collected data may include length information of a component constituting the web collected data. For example, the characteristic may include the length of a component (url excluding the protocol) constituting the web collection data of the first type (URL). For another example, the feature may include the length of a component (fileName) constituting the web collection data of the second type (URI).

특징의 또 다른 예시로, 복수의 유형의 웹 수집 데이터로부터 추출되는 특징은, 웹 수집 데이터를 구성하는 구성 요소의 개수 정보를 포함할 수 있다. 예를 들어 특징은, 제1 유형(URL)의 웹 수집 데이터를 구성하는 구성 요소(equals(=))의 수를 포함할 수 있다. 다른 예를 들어 특징은, 제4 유형(JavaScript)의 웹 수집 데이터를 구성하는 구성 요소(script 에서 중복 제거한 단어)의 수를 포함할 수 있다.As another example of a feature, a feature extracted from a plurality of types of web collected data may include information on the number of components constituting the web collected data. For example, the feature may include the number of components (equals(=)) constituting the web collection data of the first type (URL). For another example, the characteristic may include the number of components (words removed from duplicates in script) constituting the web collection data of the fourth type (JavaScript).

특징의 또 다른 예시로, 복수의 유형의 웹 수집 데이터로부터 추출되는 특징은, 구성 요소가 특정 언어인지 여부를 포함할 수 있다. 예를 들어 특징은, 제1 유형(URL)의 웹 수집 데이터를 구성하는 구성 요소가 중국어인지 여부를 포함할 수 있다.As another example of a feature, a feature extracted from a plurality of types of web collected data may include whether a component is in a specific language. For example, the characteristic may include whether or not a component constituting the web collection data of the first type (URL) is Chinese.

특징의 또 다른 예시로, 복수의 유형의 웹 수집 데이터로부터 추출되는 특징은, 구성 요소가 특정 형태를 나타내는 개수를 포함할 수 있다. 예를 들어 특징은, 제2 유형(URI)의 웹 수집 데이터를 구성하는 구성 요소가 did 형태를 나타내는 개수를 포함할 수 있다.As another example of a feature, a feature extracted from a plurality of types of web collected data may include the number of components representing a specific shape. For example, the feature may include the number of elements constituting the web collection data of the second type (URI) indicating the type of did.

특징의 또 다른 예시로, 복수의 유형의 웹 수집 데이터로부터 추출되는 특징은, 구성 요소의 엔트로피를 포함할 수 있다. 예를 들어 특징은, 제2 유형(URI)의 웹 수집 데이터에서 도메인의 엔트로피(entropy)를 포함할 수 있다.As another example of a feature, a feature extracted from a plurality of types of web collected data may include entropy of a component. For example, the characteristic may include the entropy of the domain in the web collection data of the second type (URI).

특징의 또 다른 예시로, 복수의 유형의 웹 수집 데이터로부터 추출되는 특징은, 태그의 수를 포함할 수 있다. 예를 들어 특징은, 제3 유형(HTML)의 웹 수집 데이터를 구성하는 구성 요소(<body> 태그)의 수를 포함할 수 있다. 다른 예를 들어 특징은, 제3 유형(HTML)의 웹 수집 데이터를 구성하는 구성 요소(href 속성을 가진 태그 수)의 수를 포함할 수 있다.As another example of a feature, a feature extracted from a plurality of types of web collected data may include the number of tags. For example, the characteristics may include the number of components (<body> tags) constituting the web collection data of the third type (HTML). For another example, the feature may include the number of components (number of tags with href attribute) constituting web collection data of the third type (HTML).

특징의 또 다른 예시로, 복수의 유형의 웹 수집 데이터로부터 추출되는 특징은, 특정 값의 존재 여부를 포함할 수 있다. 예를 들어 특징은, 제5 유형(HOST)의 웹 수집 데이터에 특정 값(state 값)의 존재 여부를 포함할 수 있다.As another example of a feature, a feature extracted from a plurality of types of web collected data may include whether a specific value exists. For example, the characteristic may include whether a specific value (state value) is present in the web collection data of the fifth type (HOST).

특징의 또 다른 예시로, 복수의 유형의 웹 수집 데이터로부터 추출되는 특징은, 함수의 개수를 포함할 수 있다. 예를 들어 특징은, 제4 유형(Javascrip)의 웹 수집 데이터 내 eval() 함수의 개수를 포함할 수 있다.As another example of a feature, a feature extracted from a plurality of types of web collected data may include the number of functions. For example, the feature may include the number of eval() functions in web collection data of the fourth type (Javascrip).

특징의 또 다른 예시로, 복수의 유형의 웹 수집 데이터로부터 추출되는 특징은, 특정 단어의 존재 여부를 포함할 수 있다. 예를 들어 특징은, 제3 유형(HTNL)의 웹 수집 데이터에 특정 단어("free")의 존재 여부를 포함할 수 있다.As another example of a feature, a feature extracted from a plurality of types of web collected data may include whether a specific word exists. For example, the feature may include whether a specific word (“free”) is present in the web collection data of the third type (HTNL).

한편 특징이 추출되면, 추출된 특징 기반의 AI 분석을 수행할 수 있다(S440).Meanwhile, when a feature is extracted, AI analysis based on the extracted feature may be performed (S440).

구체적으로, 제어부(120)는 웹 수집 데이터로부터 추출된 특징을 인공지능 모델에 제공하고, 인공지능 모델이 출력한 결과 값에 기반하여 웹 트래픽에 대한 유해 여부를 결정할 수 있다.Specifically, the controller 120 may provide features extracted from web collected data to an artificial intelligence model, and determine whether or not web traffic is harmful based on a result value output from the artificial intelligence model.

더욱 구체적으로, 웹 수집 데이터로부터 추출된 특징이 입력되면, 인공지능 모델은 입력된 특징에 기반하여 입력된 특징에 대응하는 결과 값을 출력할 수 있다. 이 경우 인공지능 모델은, 입력된 특징에 대응하는 클래스(정상 또는 악성)과 함께 클래스에 대응하는 확률을 출력할 수 있다.More specifically, when a feature extracted from web collection data is input, the artificial intelligence model may output a result value corresponding to the input feature based on the input feature. In this case, the artificial intelligence model may output a class (normal or malicious) corresponding to the input feature and a probability corresponding to the class.

그리고 악성의 확률이 기 설정된 값(예를 들어 0.5) 이상으로 출력되는 경우, 제어부(120)는 입력된 특징이 악성인 것으로 결정할 수 있다(S450). 또한 입력된 특징이 복수의 유형의 웹 수집 데이터로부터 추출된 경우, 상기 복수의 유형의 웹 수집 데이터를 수집하기 위해 사용된 최초의 웹 트래픽이 유해 사이트에 대한 트래픽인 것으로 결정할 수 있다. 따라서 제어부(120)는 실시간으로 수집된 웹 트래픽이 유해 사이트에 대한 트래픽인 것으로 결정하고(S460), 해당하는 사용자에게 유해 사이트 안내를 전송할 수 있다.Also, if the probability of maliciousness is output as higher than a preset value (eg, 0.5), the controller 120 may determine that the input feature is malicious (S450). Also, if the input feature is extracted from multiple types of web collected data, it may be determined that the first web traffic used to collect the multiple types of web collected data is traffic to a harmful site. Therefore, the control unit 120 may determine that the web traffic collected in real time is traffic to a harmful site (S460), and transmit a harmful site guide to the corresponding user.

한편 정상의 확률이 기 설정된 값(예를 들어 0.5)을 초과하는 값으로 출력되는 경우, 제어부(120)는 입력된 특징이 정상인 것으로 결정할 수 있다(S450). 또한 입력된 특징이 복수의 유형의 웹 수집 데이터로부터 추출된 경우, 제어부(120)는 복수의 유형의 웹 수집 데이터를 수집하기 위해 사용된 최초의 웹 트래픽이 정상 사이트에 대한 트래픽인 것으로 결정할 수 있다. 따라서 제어부(120)는 실시간으로 수집된 웹 트래픽이 정상 사이트에 대한 트래픽인 것으로 결정할 수 있다(S470).Meanwhile, when the normal probability is output as a value exceeding a preset value (eg, 0.5), the controller 120 may determine that the input feature is normal (S450). Also, if the input feature is extracted from multiple types of web collected data, the controller 120 may determine that the first web traffic used to collect multiple types of web collected data is traffic to a normal site. . Accordingly, the control unit 120 may determine that the web traffic collected in real time is traffic to a normal site (S470).

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 상기 컴퓨터는 제어부를 포함할 수도 있다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니 되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The above-described present invention can be implemented as computer readable code on a medium on which a program is recorded. The computer-readable medium includes all types of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable media include Hard Disk Drive (HDD), Solid State Disk (SSD), Silicon Disk Drive (SDD), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. there is Also, the computer may include a control unit. Accordingly, the above detailed description should not be construed as limiting in all respects and should be considered as illustrative. The scope of the present invention should be determined by reasonable interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

100: 악성 사이트 탐지 장치 110: 수집부
120: 제어부 130: 메모리100: malicious site detection device 110: collection unit
120: control unit 130: memory

Claims

collecting a training data set including a plurality of types of training data and extracting a feature from the training data set;
generating an artificial intelligence model by training an artificial neural network to classify normal and malignant types using the features as input data;
Collecting web traffic in real time;
obtaining the plurality of types of web collected data using the web traffic and extracting features from the plurality of types of web collected data; and
Providing features extracted from the web collection data to the artificial intelligence model, and determining whether or not the web traffic is harmful based on a result value output by the artificial intelligence model;
How to detect malicious sites.

According to claim 1,
The plurality of types of web collected data,
including URL, URI, HTML, JavaScript and Host
How to detect malicious sites.

According to claim 2,
The characteristics extracted from the plurality of types of web collected data are:
Length information of elements constituting the web collection data, information on the number of elements, whether or not the element is in a specific language, the number of elements representing a specific form, entropy of the element, number of tags, specific including the presence or absence of a value, the presence or absence of a specific word, and the number of functions.
How to detect malicious sites.

According to claim 1,
When the web traffic is collected, determining whether the collected web traffic is a feature extraction target; further comprising
How to detect malicious sites.

According to claim 4,
The step of determining whether the collected web traffic is a feature extraction target,
Determining whether the collected web traffic is a feature extraction target based on whether the file exists, whether it corresponds to a whitelist domain, the type of response code, and whether it is an HTML document
How to detect malicious sites.