KR20210085333A

KR20210085333A - Adaptive method, device, computer-readable storage medium and computer program for detecting malware based on machine learning

Info

Publication number: KR20210085333A
Application number: KR1020190178263A
Authority: KR
Inventors: 전제민; 황용석; 김원혁
Original assignee: 주식회사 안랩
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2021-07-08
Also published as: KR102325293B1

Abstract

According to various embodiments, provided is an adaptive method for detecting a malicious code based on machine learning by a device using a model trained to detect a malicious code. The method includes: an operation of generating a label corresponding to one or more data of the device; an operation of collecting one or more additional data infected with the malicious code; an operation of performing additional training on the model based on the generated label and the collected additional data; and an operation of performing performance evaluation of the additionally trained model.

Description

ADAPTIVE METHOD, DEVICE, COMPUTER-READABLE STORAGE MEDIUM AND COMPUTER PROGRAM FOR DETECTING MALWARE BASED ON MACHINE LEARNING

본 발명은 기계 학습 기반의 악성 코드를 탐지하기 위한 적응적 방법, 장치, 컴퓨터 판독 가능한 기록 매체 및 컴퓨터 프로그램에 관한 것이다.The present invention relates to an adaptive method, apparatus, computer-readable recording medium, and computer program for detecting machine learning-based malicious code.

대부분의 가정, 학교 및/또는 기업 등에서 이용되는 컴퓨터, 서버 등의 장치에는, 배포된 기계 학습 모델을 이용하여 악성 코드를 탐지하고 상기 탐지된 악성 코드를 차단하는 방식의 보안 기술이 적용되고 있다.In most devices such as computers and servers used in homes, schools, and/or businesses, a security technology of detecting malicious code using a distributed machine learning model and blocking the detected malicious code is applied.

종래에는 상기 기계 학습 모델의 학습에 사용된 데이터와 상기 기계 학습 모델을 이용하는 장치의 실제 환경에서의 데이터의 분포 차이로 인해, 악성 코드의 탐지 성능이 하락하는 문제가 있어 왔다.In the related art, there has been a problem in that detection performance of malicious codes decreases due to a difference in distribution of data used for learning the machine learning model and data in an actual environment of a device using the machine learning model.

대한민국등록특허 10-2021138, 2019.09.05 등록Registered Republic of Korea Patent 10-2021138, 2019.09.05

이에 따라, 상기의 데이터 분포의 차이를 보상하기 위해, 상기 기계 학습 모델에 대해, 실제 데이터를 이용한 추가적인 미세 조정(fine tuning)을 하는 학습 기술(전이 학습; transfer learning; 이하에서는 전이 학습이라고 함)을 적용하여, 기계 학습 모델이 변화된 데이터 분포에 적응하도록 하기 위한, 기계 학습 기반의 악성 코드를 탐지하기 위한 적응적 방법, 장치, 컴퓨터 판독 가능한 기록 매체 및 컴퓨터 프로그램을 제공할 수 있다.Accordingly, in order to compensate for the difference in the data distribution, a learning technique that performs additional fine tuning using real data on the machine learning model (transfer learning; hereinafter referred to as transfer learning) By applying , it is possible to provide an adaptive method, apparatus, computer readable recording medium, and computer program for detecting machine learning-based malicious code so that the machine learning model adapts to the changed data distribution.

예를 들어, 상기 기계 학습 모델이 상기 기계 학습 모델을 적용하는 장치의 데이터 분포에 적응하도록 학습시킴으로써, 상기 기계 학습 모델의 학습에 사용된 데이터와 상기 기계 학습 모델을 이용하는 장치의 실제 환경에서의 데이터의 분포 차이를 최소화할 수 있다.For example, by learning the machine learning model to adapt to the data distribution of a device applying the machine learning model, data used for training the machine learning model and data in a real environment of a device using the machine learning model It is possible to minimize the distribution difference of .

본 발명이 해결하고자 하는 과제는 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 해결하고자 하는 과제는 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to those mentioned above, and other problems to be solved that are not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention belongs from the following description.

일 실시 예에 따르면, 악성 코드를 탐지하도록 학습된 모델을 이용하는 장치에 의한 기계 학습 기반의 악성 코드를 탐지하기 위한 적응적 방법에 있어서, 상기 장치의 하나 이상의 데이터에 대응하는 레이블을 생성하는 동작; 상기 악성 코드에 감염된 하나 이상의 추가 데이터를 수집하는 동작; 상기 생성된 레이블 및 상기 수집된 추가 데이터에 기초하여, 상기 모델에 대한 추가 학습을 수행하는 동작; 및 상기 추가 학습된 모델의 성능 평가를 수행하는 동작을 포함할 수 있다.According to an embodiment, there is provided an adaptive method for detecting a machine learning-based malicious code by a device using a model trained to detect a malicious code, the method comprising: generating a label corresponding to one or more data of the device; collecting one or more additional data infected with the malicious code; performing additional learning on the model based on the generated label and the collected additional data; and performing performance evaluation of the additionally trained model.

일 실시 예에 따르면, 상기 방법은, 상기 악성 코드를 탐지하도록 학습된 모델에 기초하여, 상기 악성 코드 감지를 위한 백신 프로그램을 이용하여 상기 악성 코드에 감염되지 않은 정상 데이터로 분류되거나 기 저장되어 상기 정상 데이터로 분류된 상기 하나 이상의 데이터 각각에 대해, 상기 악성 코드에 감염된 데이터로 오분류하는지를 확인하는 동작을 더 포함하며, 상기 장치의 상기 하나 이상의 데이터에 대한 상기 레이블을 생성하는 동작은, 상기 악성 코드에 감염된 데이터로 오분류된 데이터의 개수가 기 설정된 임계 값 이상인 것에 기초하여 수행될 수 있다.According to an embodiment, the method includes, based on a model trained to detect the malicious code, classified or pre-stored as normal data not infected with the malicious code by using a vaccine program for detecting the malicious code. The method further includes checking whether the one or more data classified as normal data is misclassified as data infected with the malicious code, wherein generating the label for the one or more data of the device includes: This may be performed based on the number of data misclassified as code-infected data is greater than or equal to a preset threshold.

일 실시 예에 따르면, 상기 방법은, 상기 악성 코드를 탐지하도록 학습된 모델에 기초하여, 상기 하나 이상의 데이터 각각에 대해 상기 악성 코드에 감염된 데이터인지 상기 악성 코드에 감염되지 않은 정상 데이터인지의 예측에 대한 예측 확률을 확인하는 동작을 더 포함하며, 상기 장치의 상기 하나 이상의 데이터에 대한 상기 레이블을 생성하는 동작은, 상기 예측 확률이 기 설정된 임계 값 보다 작은 것에 기초하여 수행될 수 있다.According to an embodiment, the method includes predicting whether data infected with the malicious code or normal data not infected with the malicious code for each of the one or more pieces of data, based on a model trained to detect the malicious code. The method may further include checking a prediction probability for the data, and the operation of generating the label for the one or more data of the device may be performed based on that the prediction probability is less than a preset threshold value.

일 실시 예에 따르면, 상기 장치의 하나 이상의 데이터에 대응하는 상기 레이블을 생성하는 동작은, 상기 하나 이상의 데이터가 백신 프로그램에 의해 상기 하나 이상의 악성 코드에 감염되지 않은 정상 데이터로 확인된 것에 기초하여, 상기 하나 이상의 데이터의 레이블을 상기 정상 데이터에 대응되는 레이블로 생성하는 동작을 포함할 수 있다.According to an embodiment, the generating of the label corresponding to the one or more data of the device is based on the fact that the one or more data is confirmed as normal data that is not infected with the one or more malicious codes by a vaccine program, and generating a label of the one or more data as a label corresponding to the normal data.

일 실시 예에 따르면, 상기 방법은, 상기 하나 이상의 데이터가, 백신 프로그램에 의해 상기 악성 코드에 감염된 데이터인지 상기 악성 코드에 감염되지 않은 정상 데이터인지에 대한 레이블이 생성되지 않은 데이터, 상기 악성 코드를 탐지하도록 학습된 모델에 기초한 예측 확률이 지정된 임계 값보다 높은 데이터, 및 상기 악성 코드를 탐지하도록 학습된 모델에 기초하여 상기 악성 코드에 감염된 데이터로 예측된 데이터 중 적어도 하나의 데이터인지를 확인하는 동작; 상기 하나 이상의 데이터가 상기 적어도 하나의 데이터인 것으로 확인된 것에 기초하여, 상기 장치의 표시부를 이용하여 사용자에게 레이블의 입력을 요청하는 정보를 출력하는 동작; 상기 장치의 상기 표시부를 이용한 상기 정보의 출력에 응답하여, 사용자 입력을 수신하는 동작을 더 포함하며, 상기 장치의 하나 이상의 데이터에 대응하는 상기 레이블을 생성하는 동작은, 상기 사용자 입력에 기초하여 수행될 수 있다.According to an embodiment, the method includes: data for which a label for whether the one or more data is data infected with the malicious code by a vaccine program or normal data not infected with the malicious code is generated, the malicious code; Checking whether at least one of data having a prediction probability higher than a specified threshold value based on the model trained to detect and data predicted as data infected with the malicious code based on the model trained to detect the malicious code ; outputting information requesting the user to input a label using a display unit of the device based on it being confirmed that the one or more data is the at least one data; in response to the output of the information using the display unit of the device, further comprising: receiving a user input, wherein generating the label corresponding to one or more data of the device is performed based on the user input can be

일 실시 예에 따르면, 상기 악성 코드에 감염된 상기 하나 이상의 추가 데이터는, 서버로부터 상기 악성 코드에 감염된 상기 하나 이상의 추가 데이터를 수신하는 동작 및 생성 모델(generative model)을 이용하여 악성 코드에 지정된 횟수 이상으로 감염되는 데이터의 특성들에 기초하여 상기 하나 이상의 추가 데이터를 생성하는 동작 중 적어도 하나의 동작에 기초하여 수집될 수 있다.According to an embodiment, the one or more additional data infected with the malicious code is more than the number of times specified for the malicious code by using an operation of receiving the one or more additional data infected with the malicious code from a server and a generative model may be collected based on at least one operation of generating the one or more additional data based on characteristics of the infected data.

일 실시 예에 따르면, 상기 생성된 레이블에 기초하여 예측된 상기 악성 코드에 감염되지 않은 정상 데이터인 상기 하나 이상의 데이터 및 상기 수집된 하나 이상의 추가 데이터를 포함하는 복수의 데이터를 학습 데이터 세트와 검증 데이터 세트로 렌덤으로 분할하는 동작을 더 포함하며, 상기 추가 학습된 모델의 성능 평가를 수행하는 동작은, 상기 검증 데이터 세트를 이용하여 수행될 수 있다.According to an embodiment, a plurality of data including the one or more data that is normal data not infected with the malicious code predicted based on the generated label and the one or more additional data collected are a training data set and verification data. The method may further include randomly dividing into sets, wherein the operation of performing performance evaluation of the additionally trained model may be performed using the verification data set.

일 실시 예에 따르면, 상기 방법은 상기 성능 평가의 결과에 기초하여 추가 학습 결과에 대응하는 정보를 전송하는 동작을 더 포함할 수 있다.According to an embodiment, the method may further include transmitting information corresponding to an additional learning result based on the result of the performance evaluation.

일 실시 예에 따르면, 상기 악성 코드를 탐지하도록 학습된 모델의 학습에 이용된 데이터 및 상기 하나 이상의 데이터는 상이할 수 있다.According to an embodiment, data used for training a model trained to detect the malicious code and the one or more data may be different.

일 실시 예에 따르면, 상기 추가 학습된 모델의 상기 성능 평가에 따라 결정된 사용자의 기여도에 기초하여, 지정된 방식으로 보상을 제공하는 동작을 더 포함할 수 있다.According to an embodiment, the method may further include providing a reward in a specified manner based on the user's contribution determined according to the performance evaluation of the additionally trained model.

일 실시 예에 따르면, 컴퓨터 프로그램을 저장하고 있는 컴퓨터 판독 가능 기록매체로서, 상기 장치의 하나 이상의 데이터에 대응하는 레이블을 생성하는 동작; 상기 악성 코드에 감염된 하나 이상의 추가 데이터를 수집하는 동작; 상기 생성된 레이블 및 상기 수집된 추가 데이터에 기초하여, 상기 모델에 대한 추가 학습을 수행하는 동작; 및 상기 추가 학습된 모델의 성능 평가를 수행하는 동작을 포함하는 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함할 수 있다.According to an embodiment, there is provided a computer-readable recording medium storing a computer program, the method comprising: generating a label corresponding to one or more data of the device; collecting one or more additional data infected with the malicious code; performing additional learning on the model based on the generated label and the collected additional data; and instructions for causing the processor to perform a method including performing performance evaluation of the additionally trained model.

일 실시 예에 따르면, 컴퓨터 판독 가능한 기록매체에 저장되어 있는 컴퓨터 프로그램으로서, 상기 장치의 하나 이상의 데이터에 대응하는 레이블을 생성하는 동작; 상기 악성 코드에 감염된 하나 이상의 추가 데이터를 수집하는 동작; 상기 생성된 레이블 및 상기 수집된 추가 데이터에 기초하여, 상기 모델에 대한 추가 학습을 수행하는 동작; 및 상기 추가 학습된 모델의 성능 평가를 수행하는 동작을 포함하는 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함할 수 있다.According to an embodiment, there is provided a computer program stored in a computer-readable recording medium, comprising: generating a label corresponding to one or more data of the device; collecting one or more additional data infected with the malicious code; performing additional learning on the model based on the generated label and the collected additional data; and instructions for causing the processor to perform a method including performing performance evaluation of the additionally trained model.

일 실시 예에 따르면, 기계 학습 기반의 악성 코드를 탐지하기 위한 적응적 장치에 있어서, 프로세서; 및 상기 프로세서와 전기적으로 연결된 메모리를 포함하며, 상기 메모리는 하나 이상의 데이터 및 악성 코드를 탐지하도록 학습된 모델을 저장하도록 구성되며, 실행될 때, 프로세서가, 상기 하나 이상의 데이터에 대응하는 레이블을 생성하며, 상기 악성 코드에 감염된 하나 이상의 추가 데이터를 수집하며, 상기 생성된 레이블 및 상기 수집된 추가 데이터에 기초하여, 상기 모델에 대한 추가 학습을 수행하며, 상기 추가 학습된 모델의 성능 평가를 수행하도록 하는 명령을 포함할 수 있다.According to an embodiment, an adaptive apparatus for detecting a machine learning-based malicious code includes: a processor; and a memory electrically coupled to the processor, the memory configured to store one or more data and a model trained to detect malicious code, wherein when executed, the processor generates a label corresponding to the one or more data; , to collect one or more additional data infected with the malicious code, perform additional learning on the model based on the generated label and the collected additional data, and perform performance evaluation of the additionally trained model It can contain commands.

따라서, 본 발명의 실시 예에 따른 기계 학습 기반의 악성 코드를 탐지하기 위한 적응적 방법, 장치, 컴퓨터 판독 가능한 기록 매체 및 컴퓨터 프로그램은, 상기 기계 학습 모델에 대해, 실제 데이터를 이용한 추가적인 미세 조정(fine tuning)을 하는 학습 기술(전이 학습; transfer learning; 이하에서는 전이 학습이라고 함)을 적용하여, 기계 학습 모델이 변화된 데이터 분포에 적응하도록 할 수 있다.Accordingly, the adaptive method, apparatus, computer-readable recording medium and computer program for detecting machine learning-based malicious code according to an embodiment of the present invention provides additional fine-tuning ( By applying a learning technique that performs fine tuning (transfer learning; hereinafter referred to as transfer learning), the machine learning model can adapt to the changed data distribution.

예를 들어, 상기 기계 학습 모델이 상기 기계 학습 모델을 적용하는 장치의 데이터 분포에 적응하도록 학습(로컬 학습)시킴으로써, 상기 기계 학습 모델의 학습에 사용된 데이터와 상기 기계 학습 모델을 이용하는 장치의 실제 환경에서의 데이터의 분포 차이를 최소화할 수 있다. 이에 따라, 향상된 성능의 기계 학습 모델을 생성할 수 있다.For example, by learning (local learning) the machine learning model to adapt to the data distribution of the device applying the machine learning model, the data used for learning the machine learning model and the actual performance of the device using the machine learning model It is possible to minimize the difference in the distribution of data in the environment. Accordingly, it is possible to create a machine learning model with improved performance.

예를 들어, 상기 기계 학습 모델을 배포하는 기업의 서버에서 상기 기계 학습 모델의 잘못된 예측에 대해 데이터의 수집 및/또는 모델의 재배포 과정 없이, 상기 기계 학습 모델을 적용하는 고객의 장치에서 즉시 대응 가능하도록 할 수 있다. 상기 기계 학습 모델을 배포하는 기업의 서버에서 데이터 분포가 상이한 고객들의 각 장치에 모델을 배포하는 경우, 상기 배포된 모델이 상기 고객들의 각 장치의 환경에 최적화된 모델이 되도록 할 수 있다.For example, in the server of a company that distributes the machine learning model, it is possible to immediately respond to the wrong prediction of the machine learning model on the customer's device applying the machine learning model without collecting data and/or redistribution of the model can make it In the case of distributing the model to each device of customers having different data distributions in the server of the company that distributes the machine learning model, the distributed model may be a model optimized for the environment of each device of the customers.

도 1은 본 발명의 일 실시 예에 따른 기계 학습 기반의 적응적 악성 코드를 탐지하는 시스템의 블록도이다.
도 2는 본 발명의 일 실시 예에 따른 악성 코드를 탐지하도록 학습된 모델을 이용하는 시스템에 의한 기계 학습 기반의 적응적 악성 코드를 탐지하기 위한 동작의 흐름도이다.
도 3은 본 발명의 일 실시 예에 따른 악성 코드를 탐지하도록 학습된 모델을 이용하는 장치에 의한 기계 학습 기반의 적응적 악성 코드를 탐지하기 위한 동작의 흐름도이다.1 is a block diagram of a system for detecting machine learning-based adaptive malicious code according to an embodiment of the present invention.
2 is a flowchart of an operation for detecting machine learning-based adaptive malicious code by a system using a model trained to detect malicious code according to an embodiment of the present invention.
3 is a flowchart of an operation for detecting machine learning-based adaptive malicious code by a device using a model trained to detect malicious code according to an embodiment of the present invention.

먼저, 본 발명의 장점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되는 실시 예들을 참조하면 명확해질 것이다. 여기에서, 본 발명은 이하에서 개시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 발명의 범주를 명확하게 이해할 수 있도록 하기 위해 예시적으로 제공되는 것이므로, 본 발명의 기술적 범위는 청구항들에 의해 정의되어야 할 것이다.First, the advantages and features of the present invention, and a method for achieving them will become clear with reference to the embodiments described below in detail in conjunction with the accompanying drawings. Here, the present invention is not limited to the embodiments disclosed below, but may be implemented in a variety of different forms, and only these embodiments allow the disclosure of the present invention to be complete, and are common in the technical field to which the present invention pertains. The technical scope of the present invention should be defined by the claims since it is provided by way of example so that those with knowledge can clearly understand the scope of the invention.

아울러, 아래의 본 발명을 설명함에 있어서 공지 기능 또는 구성 등에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들인 것으로, 이는 사용자, 운용자 등의 의도 또는 관례 등에 따라 달라질 수 있음은 물론이다. 그러므로, 그 정의는 본 명세서의 전반에 걸쳐 기술되는 기술사상을 토대로 이루어져야 할 것이다.In addition, in the following description of the present invention, if it is determined that a detailed description of a well-known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. And, the terms to be described later are terms defined in consideration of functions in the present invention, which may vary depending on the intentions or customs of users, operators, etc., of course. Therefore, the definition should be made based on the technical idea described throughout this specification.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예에 대하여 상세하게 설명한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시 예에 따른 기계 학습 기반의 적응적 악성 코드를 탐지하는 시스템의 블록도이다.1 is a block diagram of a system for detecting machine learning-based adaptive malicious code according to an embodiment of the present invention.

도 1을 참조하면, 상기 시스템은 서버(10), 장치(100) 및 장치(1000)를 포함할 수 있다.Referring to FIG. 1 , the system may include a server 10 , an apparatus 100 , and an apparatus 1000 .

일 실시 예에 따르면, 상기 서버(10)는, 프로세서(11), 메모리(13), 통신 인터페이스(15) 및/또는 입출력 인터페이스(17)를 포함할 수 있다.According to an embodiment, the server 10 may include a processor 11 , a memory 13 , a communication interface 15 , and/or an input/output interface 17 .

프로세서(11)(제어부, 제어 장치 또는 제어 회로라고도 함)는 연결된 서버(10)의 적어도 하나의 다른 구성 요소(예: 하드웨어 구성 요소(예: 메모리(13), 통신 인터페이스(15) 및/또는 입출력 인터페이스(17)) 또는 소프트웨어 구성 요소)를 제어할 수 있고, 다양한 데이터 처리 및 연산을 수행할 수 있다.The processor 11 (also referred to as a control unit, control unit or control circuit) is connected to at least one other component (eg, a hardware component (eg, memory 13 , communication interface 15 ) and/or of the connected server 10 . The input/output interface 17) or software components) can be controlled, and various data processing and operations can be performed.

메모리(13)(데이터베이스라고도 함)는 서버(10)의 적어도 하나의 구성요소(프로세서(11), 통신 인터페이스(15) 및/또는 입출력 인터페이스(17))에 의해 사용되는 다양한 데이터, 예를 들어, 소프트웨어(예: 프로그램) 및, 이와 관련된 명령에 대한 입력 데이터 또는 출력 데이터를 저장할 수 있다.The memory 13 (also referred to as a database) contains various data used by at least one component of the server 10 (the processor 11 , the communication interface 15 and/or the input/output interface 17 ), for example, , software (eg, a program), and input data or output data for instructions related thereto.

일 실시 예에 따르면, 메모리(13)는 상기 프로세서(11)의 제어에 따라 학습된 하나 이상의 모델을 저장할 수 있다. 예를 들어, 상기 하나 이상의 모델은, 특정 장치(예를 들어, 장치(100) 및/또는 장치(1000))에 의해 실행(또는 저장)되는 하나 이상의 데이터(또는 파일)(example이라고도 함)가 악성 코드에 감염된 데이터인지 상기 악성 코드에 감염되지 않은 정상 데이터인지를 확인하기 위한 악성 코드 탐지 모델일 수 있다. 예를 들어, 상기 모델은, 신경망 모델 형태 등의 학습 모델일 수 있다. According to an embodiment, the memory 13 may store one or more models learned under the control of the processor 11 . For example, the one or more models may include one or more data (or files) (also referred to as examples) executed (or stored) by a particular device (eg, device 100 and/or device 1000 ). It may be a malicious code detection model for determining whether data is infected with malicious code or normal data that is not infected with the malicious code. For example, the model may be a learning model in the form of a neural network model.

통신 인터페이스(15)는 서버(100)와 외부 장치간의 유선 또는 무선 통신 채널의 수립, 및 수립된 통신 채널을 통한 통신 수행을 지원할 수 있다. 예를 들어, 통신 인터페이스(15)는 통신 모듈을 포함하고, 상기 통신 모듈을 이용하여 외부 장치, 예를 들어, 장치(100) 및/또는 장치(1000)와 통신할 수 있다. The communication interface 15 may support establishment of a wired or wireless communication channel between the server 100 and an external device, and communication through the established communication channel. For example, the communication interface 15 may include a communication module, and may communicate with an external device, for example, the device 100 and/or the device 1000 using the communication module.

입출력 인터페이스(17)는, 예를 들면, 사용자 또는 다른 외부 장치(기기)로부터 입력된 명령 또는 데이터를 서버(10)의 다른 구성요소(들)에 전달하거나, 또는 서버(10)의 다른 구성요소(들)로부터 수신된 명령 또는 데이터를 사용자 또는 다른 외부 기기로 출력할 수 있다.The input/output interface 17 transmits, for example, a command or data input from a user or other external device (device) to other component(s) of the server 10 , or another component of the server 10 . Commands or data received from (s) may be output to the user or other external device.

일 실시 예에 따르면, 상기 프로세서(11)는 상기 통신 인터페이스(15) 또는 상기 입출력 인터페이스(17)를 이용하여 상기 메모리(13)에 저장된 상기 하나 이상의 모델을 외부 장치, 예를 들어, 장치(100) 및/또는 장치(1000)로 전송할 수 있다. According to an embodiment, the processor 11 transfers the one or more models stored in the memory 13 to an external device, for example, the device 100 using the communication interface 15 or the input/output interface 17 . ) and/or to the device 1000 .

일 실시 예에 따르면, 상기 장치(100)는, 기업체, 학교, 공공 기관 등에서의 복수의 사용자 단말(컴퓨터)과 연결되어 상기 복수의 사용자 단말을 관리하는 서버이거나, 또는 상기 복수의 사용자 단말 중 어느 한 단말일 수 있다.According to an embodiment, the device 100 is a server that is connected to a plurality of user terminals (computers) in a company, school, public institution, etc. and manages the plurality of user terminals, or any one of the plurality of user terminals It may be one terminal.

일 실시 예에 따르면, 상기 장치(100)는, 프로세서(101), 메모리(103), 통신 인터페이스(105) 및/또는 입출력 인터페이스(107)를 포함할 수 있다.According to an embodiment, the device 100 may include a processor 101 , a memory 103 , a communication interface 105 and/or an input/output interface 107 .

프로세서(101)(제어부, 제어 장치 또는 제어 회로라고도 함)는 연결된 장치(100)의 적어도 하나의 다른 구성 요소(예: 하드웨어 구성 요소(예: 메모리(103), 통신 인터페이스(105) 및/또는 입출력 인터페이스(107)) 또는 소프트웨어 구성 요소)를 제어할 수 있고, 다양한 데이터 처리 및 연산을 수행할 수 있다.The processor 101 (also referred to as a control unit, control device, or control circuit) may include at least one other component (eg, a hardware component (eg, a hardware component (eg, memory 103 ), a communication interface 105 , and/or The input/output interface 107) or software components) can be controlled, and various data processing and operations can be performed.

메모리(103)(데이터베이스라고도 함)는 장치(100)의 적어도 하나의 구성요소(프로세서(101), 통신 인터페이스(105) 및/또는 입출력 인터페이스(107))에 의해 사용되는 다양한 데이터, 예를 들어, 소프트웨어(예: 프로그램) 및, 이와 관련된 명령에 대한 입력 데이터 또는 출력 데이터를 저장할 수 있다. 메모리(103)는, 휘발성 메모리 또는 비휘발성 메모리를 포함할 수 있다.The memory 103 (also referred to as a database) contains various data used by at least one component of the device 100 (the processor 101 , the communication interface 105 and/or the input/output interface 107 ), for example, , software (eg, a program), and input data or output data for instructions related thereto. The memory 103 may include a volatile memory or a non-volatile memory.

일 실시 예에 따르면, 메모리(103)는 상기 통신 인터페이스(105) 또는 입출력 인터페이스(107)를 통해 수신된 하나 이상의 모델을 저장할 수 있다. 예를 들어, 상기 하나 이상의 모델은, 장치(100)에 의해 실행(또는 저장)되는 하나 이상의 데이터(또는 파일)(example이라고도 함)가 악성 코드에 감염된 데이터인지 상기 악성 코드에 감염되지 않은 정상 데이터인지를 확인하기 위한 악성 코드 탐지 모델일 수 있다. 예를 들어, 상기 모델은, 신경망 모델 형태 등의 학습 모델일 수 있다.According to an embodiment, the memory 103 may store one or more models received through the communication interface 105 or the input/output interface 107 . For example, the one or more models may determine whether one or more data (or files) (also referred to as examples) executed (or stored) by the device 100 is data infected with a malicious code or normal data that is not infected with malicious code. It may be a malicious code detection model to confirm recognition. For example, the model may be a learning model in the form of a neural network model.

통신 인터페이스(105)는 장치(100)와 외부 장치간의 유선 또는 무선 통신 채널의 수립, 및 수립된 통신 채널을 통한 통신 수행을 지원할 수 있다. 예를 들어, 통신 인터페이스(105)는 통신 모듈을 포함하고, 상기 통신 모듈을 이용하여 외부 장치, 예를 들어, 서버(10)와 통신할 수 있다. The communication interface 105 may support establishment of a wired or wireless communication channel between the device 100 and an external device, and communication through the established communication channel. For example, the communication interface 105 may include a communication module, and may communicate with an external device, for example, the server 10 using the communication module.

입출력 인터페이스(107)는, 예를 들면, 사용자 또는 다른 외부 장치(기기)로부터 입력된 명령 또는 데이터를 장치(100)의 다른 구성요소(들)에 전달하거나, 또는 장치(100)의 다른 구성요소(들)로부터 수신된 명령 또는 데이터를 사용자 또는 다른 외부 기기로 출력할 수 있다.The input/output interface 107 transmits, for example, a command or data input from a user or other external device (device) to other component(s) of the device 100 , or another component of the device 100 . Commands or data received from (s) may be output to the user or other external device.

일 실시 예에 따르면, 프로세서(101)는, 상기 메모리(103)에 저장된 하나 이상의 데이터에 대응하는 레이블을 생성하며, 상기 악성 코드에 감염된 하나 이상의 추가 데이터를 수집하며, 상기 생성된 레이블 및 상기 수집된 추가 데이터에 기초하여, 상기 메모리(103)에 저장된 상기 모델에 대한 추가 학습을 수행하며, 상기 추가 학습된 모델의 성능 평가를 수행할 수 있다.According to an embodiment, the processor 101 generates a label corresponding to one or more data stored in the memory 103, collects one or more additional data infected with the malicious code, and collects the generated label and the collected data. Based on the obtained additional data, additional learning of the model stored in the memory 103 may be performed, and performance evaluation of the additionally trained model may be performed.

일 실시 예에 따르면, 프로세서(101)는, 상기 메모리(103)에 저장된 악성 코드를 탐지하도록 학습된 모델에 기초하여, 상기 악성 코드 감지를 위한 백신 프로그램을 이용하여 상기 악성 코드에 감염되지 않은 정상 데이터로 분류되거나 기 저장되어 상기 정상 데이터로 분류된 상기 하나 이상의 데이터 각각에 대해, 상기 악성 코드에 감염된 데이터로 오분류하는지를 확인할 수 있다. 상기 프로세서(101)는, 상기 악성 코드에 감염된 데이터로 오분류된 데이터의 개수가 기 설정된 임계 값 이상인 것에 기초하여 상기 하나 이상의 데이터에 대한 상기 레이블을 생성 할 수 있다.According to an embodiment, the processor 101 uses a vaccine program for detecting the malicious code based on a model trained to detect the malicious code stored in the memory 103, and is not infected with the malicious code. For each of the one or more data classified as data or pre-stored and classified as the normal data, it may be checked whether the data is misclassified as data infected with the malicious code. The processor 101 may generate the label for the one or more data based on the number of data misclassified as data infected with the malicious code being equal to or greater than a preset threshold value.

일 실시 예에 따르면, 프로세서(101)는, 상기 악성 코드를 탐지하도록 학습된 모델에 기초하여, 상기 하나 이상의 데이터 각각에 대해 상기 악성 코드에 감염된 데이터인지 상기 악성 코드에 감염되지 않은 정상 데이터인지의 예측에 대한 예측 확률을 확인할 수 있다. 상기 프로세서(103)는 상기 예측 확률이 기 설정된 임계 값 보다 작은 것에 기초하여 상기 하나 이상의 데이터에 대한 상기 레이블을 생성할 수 있다.According to an embodiment, the processor 101 determines whether the data is infected with the malicious code or normal data not infected with the malicious code for each of the one or more pieces of data, based on a model trained to detect the malicious code. You can check the prediction probability for the prediction. The processor 103 may generate the label for the one or more data based on that the prediction probability is less than a preset threshold value.

일 실시 예에 따르면, 프로세서(101)는, 상기 하나 이상의 데이터가 백신 프로그램에 의해 상기 하나 이상의 악성 코드에 감염되지 않은 정상 데이터로 확인된 것에 기초하여, 상기 하나 이상의 데이터의 레이블을 상기 정상 데이터에 대응되는 레이블로 생성할 수 있다. According to an embodiment, the processor 101 is configured to label the one or more data to the normal data based on the fact that the one or more data are normal data not infected with the one or more malicious codes by the vaccine program. It can be created with a corresponding label.

일 실시 예에 따르면, 프로세서(101)는, 상기 하나 이상의 데이터가, 백신 프로그램에 의해 상기 악성 코드에 감염된 데이터인지 상기 악성 코드에 감염되지 않은 정상 데이터인지에 대한 레이블이 생성되지 않은 데이터, 상기 악성 코드를 탐지하도록 학습된 모델에 기초한 예측 확률이 지정된 임계 값보다 높은 데이터, 및 상기 악성 코드를 탐지하도록 학습된 모델에 기초하여 상기 악성 코드에 감염된 데이터로 예측된 데이터 중 적어도 하나의 데이터인지를 확인할 수 있다. 상기 프로세서(101)는, 상기 하나 이상의 데이터가 상기 적어도 하나의 데이터인 것으로 확인된 것에 기초하여, 상기 입출력 인터페이스(107)(예: 표시부)를 이용하여 사용자에게 레이블의 입력을 요청하는 정보를 출력할 수 있다. 상기 프로세서(101)는, 상기 입출력 인터페이스(107)를 이용한 상기 정보의 출력에 응답하여, 사용자 입력을 수신할 수 있다. 상기 프로세서(101)는, 상기 사용자 입력에 기초하여 상기 장치의 하나 이상의 데이터에 대응하는 상기 레이블을 생성할 수 있다.According to an embodiment, the processor 101 is configured to generate a label for whether the one or more data is data infected with the malicious code by a vaccine program or normal data that is not infected with the malicious code. To determine whether at least one of data with a prediction probability based on a model trained to detect a code is higher than a specified threshold and data predicted as data infected with the malicious code based on the model trained to detect the malicious code. can The processor 101 outputs information requesting input of a label to the user using the input/output interface 107 (eg, a display unit) based on it being confirmed that the one or more data is the at least one data. can do. The processor 101 may receive a user input in response to the output of the information using the input/output interface 107 . The processor 101 may generate the label corresponding to one or more data of the device based on the user input.

일 실시 예에 따르면, 상기 악성 코드에 감염된 상기 하나 이상의 추가 데이터는, 상기 프로세서(101)의 제어에 의해, 서버(10)로부터 상기 악성 코드에 감염된 상기 하나 이상의 추가 데이터를 수신하는 동작 및 생성 모델(generative model)을 이용하여 악성 코드에 지정된 횟수 이상으로 감염되는 데이터의 특성들에 기초하여 상기 하나 이상의 추가 데이터를 생성하는 동작 중 적어도 하나의 동작에 기초하여 수집될 수 있다.According to an embodiment, the one or more additional data infected with the malicious code is an operation and generation model of receiving the one or more additional data infected with the malicious code from the server 10 under the control of the processor 101 . (Generative model) may be collected based on at least one of the operations of generating the one or more additional data based on characteristics of the data infected with the malicious code more than a specified number of times.

일 실시 예에 따르면, 상기 프로세서(101)는, 상기 생성된 레이블에 기초하여 예측된 상기 악성 코드에 감염되지 않은 정상 데이터인 상기 하나 이상의 데이터 및 상기 수집된 하나 이상의 추가 데이터를 포함하는 복수의 데이터를 학습 데이터 세트와 검증 데이터 세트로 렌덤으로 분할할 수 있다. 상기 프로세서(101)는, 상기 검증 데이터 세트를 이용하여 상기 추가 학습된 모델의 성능 평가를 수행할 수 있다.According to an embodiment, the processor 101 is configured to include a plurality of data including the one or more data that is normal data not infected with the malicious code predicted based on the generated label, and the one or more additional data collected. can be randomly partitioned into a training data set and a validation data set. The processor 101 may perform performance evaluation of the additionally trained model using the verification data set.

일 실시 예에 따르면, 상기 프로세서(101)는 통신 인터페이스(105)를 이용하여, 상기 성능 평가의 결과에 대응하는 정보를 상기 서버(10)로 전송할 수 있다.According to an embodiment, the processor 101 may transmit information corresponding to the result of the performance evaluation to the server 10 using the communication interface 105 .

도 2는 본 발명의 일 실시 예에 따른 악성 코드를 탐지하도록 학습된 모델을 이용하는 시스템에 의한 기계 학습 기반의 적응적 악성 코드를 탐지하기 위한 동작의 흐름도이다.2 is a flowchart of an operation for detecting machine learning-based adaptive malicious code by a system using a model trained to detect malicious code according to an embodiment of the present invention.

201 동작에서 서버(10)(또는 서버(10)의 프로세서(11))는 모델을 장치(100)로 전송할 수 있다.In operation 201 , the server 10 (or the processor 11 of the server 10 ) may transmit the model to the device 100 .

일 실시 예에 따르면, 상기 모델은, 특정 장치에 의해 실행(또는 저장)되는 하나 이상의 데이터(또는 파일)(example이라고도 함)가 악성 코드에 감염된 데이터인지 상기 악성 코드에 감염되지 않은 정상 데이터인지를 확인하기 위한 악성 코드 탐지 모델일 수 있다. 예를 들어, 상기 모델은, 신경망 모델 형태 등의 학습 모델일 수 있다. 상기 모델은, 서버(10)에 의해 학습된 모델로, 상기 서버(10)의 메모리(데이터베이스라고도 함)(미도시)에 저장된 것을 수 있다.According to an embodiment, the model determines whether one or more data (or files) (also referred to as examples) executed (or stored) by a specific device is data infected with a malicious code or normal data that is not infected with the malicious code. It may be a malware detection model to confirm. For example, the model may be a learning model in the form of a neural network model. The model is a model learned by the server 10 and may be stored in a memory (also referred to as a database) (not shown) of the server 10 .

일 실시 예에 따르면, 서버(10)는 장치(100)의 상기 모델의 배포 요청(다운 로드 요청)의 수신에 기초하여, 상기 모델을 장치(100)로 전송할 수 있다.According to an embodiment, the server 10 may transmit the model to the device 100 based on the device 100 receiving a distribution request (download request) of the model.

203 동작에서 장치(100)(또는 장치(100)의 프로세서(101))는 상기 모델에 대한 전이 학습(transfer learning)(학습 또는 로컬 학습 이라고도 함)의 필요성을 확인할 수 있다.In operation 203 , the device 100 (or the processor 101 of the device 100 ) may determine the need for transfer learning (also called learning or local learning) for the model.

일 실시 예에 따르면, 상기 장치(100)는 상기 서버(10)가 전송한 모델을 수신하여, 상기 모델에 대한 전이 학습이 필요한지를 확인할 수 있다.According to an embodiment, the device 100 may receive the model transmitted by the server 10 and check whether transfer learning for the model is required.

예를 들어, 상기 장치(100)가 상기 모델에 대한 전이 학습을 하는 경우, 상기 장치(100)는 CPU 등의 자원을 이용한다. 이에 따라, 상기 서버(10)에서 상기 모델을 생성하기 위해 학습에 사용한 데이터의 분포와 상기 장치(100)에서의 데이터의 분포가 동일(유사)한 경우, 상기 장치(10)의 자원의 불필요한 이용을 방지하기 위해, 상기 전이 학습을 생략할 수 있다. 따라서, 본 발명의 실시 예에서는 상기 장치의 전이 학습을 하기 전에 상기 전이 학습이 필요한지의 여부를 먼저 평가할 수 있다.For example, when the device 100 performs transfer learning on the model, the device 100 uses a resource such as a CPU. Accordingly, when the distribution of data used for learning to generate the model in the server 10 and the distribution of data in the apparatus 100 are the same (similar), unnecessary use of resources of the apparatus 10 In order to prevent this, the transfer learning may be omitted. Accordingly, in an embodiment of the present invention, it may be first evaluated whether the transfer learning is necessary before the transfer learning of the device is performed.

일 실시 예에 따르면, 상기 장치(100)는, 상기 모델에 대한 전이 학습(transfer learning)이 필요한지를 다음의 동작들 중 적어도 하나의 동작의 수행에 따라 확인할 수 있다.According to an embodiment, the apparatus 100 may determine whether transfer learning for the model is required according to the performance of at least one of the following operations.

예를 들어, 상기 장치(100)는 상기 장치(100)의 하나 이상의 데이터(예: 로컬 파일이나 생성된 행위 로그 데이터)가 악성 코드에 감염되지 않은 정상 데이터라는 가정 하에, 상기 하나 이상의 데이터에 대해, 상기 모델을 적용하여, 상기 하나 이상의 데이터를 악성 코드에 감염된 데이터로 오분류 하는지를 확인할 수 있다. 상기 정상 데이터로 가정한 상기 하나 이상의 데이터에 대한 오분류 수량(개수)이 기 설정된 임계 수량 이상이면, 상기 장치(100)는, 상기 모델에 대한 전이 학습이 필요한 것으로 결정할 수 있다.For example, on the assumption that one or more data (eg, local files or generated behavior log data) of the device 100 is normal data that is not infected with malicious code, the device 100 may , by applying the model, it is possible to check whether the one or more data is misclassified as data infected with a malicious code. If the number of misclassifications for the one or more data assumed as the normal data is equal to or greater than a preset threshold, the apparatus 100 may determine that transfer learning for the model is required.

예를 들어, 상기 장치(100)는 기 저장되거나 수신된 악성 코드 감지를 위한 백신 프로그램(antivirus scanner라고도 함)을 이용, 예를 들어, antivirus scan(가능한 경우에는 클라우드(cloud)로 질의)을 수행하여, 상기 장치(100)의 상기 하나 이상의 데이터에 대해 악성 코드의 감염 여부를 판단할 수 있다. 예를 들어, 상기 백신 프로그램이 정상 데이터로 결정한 데이터에 대해, 상기 모델을 적용하였을 때 악성 코드에 감염된 데이터라고 결정한 경우, 상기 장치(100)는 상기 모델의 적용 시, 오분류한 것으로 결정할 수 있다. 상기 장치(100)는 상기 오분류로 결정한 수량(개수)이 기 설정된 임계 수량 이상이면, 상기 장치(100)는, 상기 모델에 대한 전이 학습이 필요한 것으로 결정할 수 있다.For example, the device 100 uses a pre-stored or received antivirus program (also referred to as an antivirus scanner) for detecting malicious code, for example, performs an antivirus scan (if possible, queries the cloud). Thus, it is possible to determine whether the one or more data of the device 100 is infected with a malicious code. For example, if the vaccine program determines that the data determined as normal data is data infected with a malicious code when the model is applied, the device 100 may determine that the model is misclassified when the model is applied. . When the quantity (number) determined as the misclassification is equal to or greater than a preset threshold quantity, the apparatus 100 may determine that transfer learning for the model is required.

예를 들어, 상기 장치(100)는, 상기 하나 이상의 데이터에 상기 모델을 적용하였을 때, 정상 데이터를 분류하여 저장해 둔 데이터베이스에 있는 데이터, 사용자에 의해 악성 코드 감염의 진단 예외 목록에 포함된 데이터, 공인된 업체에서 정상 파일로 서명한 데이터 및/또는 사용자 수 등의 평판 점수가 높은 데이터를 악성 코드로 감염된 데이터라고 결정하였는지를 확인할 수 있으며, 상기 악성 코드에 감염된 데이터로 결정한 경우, 상기 장치(100)는 상기 모델의 적용 시, 오분류한 것으로 결정할 수 있다. 상기 장치(100)는 상기 오분류로 결정한 수량(개수)이 기 설정된 임계 수량 이상이면, 상기 장치(100)는, 상기 모델에 대한 전이 학습이 필요한 것으로 결정할 수 있다. For example, when the model is applied to the one or more data, the device 100 classifies and stores normal data, data in a database, data included in a list of exceptions to diagnosis of malicious code infection by a user; It is possible to check whether data signed by an authorized company as a normal file and/or data with a high reputation score such as the number of users is determined to be data infected with malicious code, and if it is determined as data infected with the malicious code, the device 100 may be determined to be misclassified when the model is applied. When the quantity (number) determined as the misclassification is equal to or greater than a preset threshold quantity, the apparatus 100 may determine that transfer learning for the model is required.

예를 들어, 상기 장치(100)는 상기 하나 이상의 데이터에 상기 모델을 적용하였을 때, 상기 하나 이상의 데이터가 악성 코드에 감염된 데이터인지 정상 데이터인지의 예측에 대한 예측 확률(예측 확신도) 값(0~1)이 기 설정된 임계 값보다 작으면, 상기 장치(100)는, 상기 모델에 대한 전이 학습이 필요한 것으로 결정할 수 있다. 상기 장치(100)는, 상기 모델을 이용한 악성 코드 감염 여부의 예측 시, 불확실성이 높은 데이터가 포함되어 있으면 전이 학습이 필요하다고 결정할 수 있다. 상기 장치(100)는 개별 데이터 각각에 대해, 예측 불확실성(predictive uncertainty) 정도를 계산할 수 있으며, 상기 예측 불확실성이 높은 데이터 수가 일정 개수 이상이면 상기 전이 학습이 필요하다고 결정할 수 있다. 상기 장치(100)는, 악성보다 약한 의심 사항을 탐지하는 룰(rule) 기반 탐지 장치(예: weak detector)를 이용하여 상기 룰 기반 탐지 장치에서 탐지되는 것이 없는데, 불확실성이 높은 경우, 전이 학습이 필요하다고 결정할 수 있다. 예를 들어, 상기 불확실성이 높은 것으로의 결정은 다음의 동작들(1, 2) 중 적어도 하나에 기초하여 수행될 수 있다.For example, when the device 100 applies the model to the one or more data, a prediction probability (prediction certainty) value (0) for prediction of whether the one or more data is data infected with a malicious code or normal data If ~1) is less than a preset threshold value, the apparatus 100 may determine that transfer learning for the model is required. The apparatus 100 may determine that transfer learning is necessary if data with high uncertainty is included when predicting whether or not a malicious code is infected using the model. The apparatus 100 may calculate a degree of predictive uncertainty for each individual data, and may determine that the transfer learning is necessary when the number of data having high prediction uncertainty is a predetermined number or more. The device 100 uses a rule-based detection device (eg, a weak detector) that detects a suspicious matter weaker than a malicious one, and there is nothing detected by the rule-based detection device, but when the uncertainty is high, transfer learning is performed may decide it is necessary. For example, the determination of the high uncertainty may be performed based on at least one of the following operations ( 1 , 2 ).

1. 상기 모델이 신경망 모델인 경우, 랜덤하게 일부 뉴런을 드롭 아웃(drop out)시키는 drop-out 기법을 이용하여, 상기 하나 이상의 데이터가 악성 코드에 감염된 데이터인지 정상 데이터인지의 예측에 대한 예측을 지정된 횟수만큼 수행한 후, 예측 확률들의 분산 정도를 측정하여, 상기 측정된 분산 정도가 특정 임계 값 이상인 경우, 불확실성이 높은 것으로 결정될 수 있다. 1. When the model is a neural network model, a prediction of whether the one or more data is malicious code-infected data or normal data is made using a drop-out technique that randomly drops out some neurons. After performing a specified number of times, the degree of variance of the prediction probabilities is measured, and when the measured degree of variance is equal to or greater than a specific threshold, uncertainty may be determined to be high.

2. 복수의 모델들을 이용(ensemble)하여, 상기 복수의 모델들 각각을 적용한 동일한 하나 이상의 데이터에 대한 예측 확률들의 분산 정도가 특정 임계 값 이상인 경우, 불확실성이 높은 것으로 결정될 수 있다.2. Using a plurality of models, when the degree of dispersion of prediction probabilities for the same one or more data to which each of the plurality of models is applied is equal to or greater than a specific threshold value, uncertainty may be determined to be high.

상기 1, 2에서의 임계 값은, 타겟으로 하는 재현율(recall)을 달성하기 위한 임계 값이 주어졌을 때 정확도(precision)가 얼마인지를 확인하기 위한 지표인, precision@K-recall의 방식을 이용하거나 또는 사용자에 의해 지정되는 등으로 결정될 수 있다. 예를 들어, 상기 precision@K-recall 방식의 이용은, 모델의 학습 시, 목표로하는 precision@k-recall을 설정하여 이 때의 임계 값을 확인할 수 있으며, 상기 확인한 임계 값을 상기 1, 2에서의 임계 값이 되도록 할 수 있다.The threshold value in 1 and 2 is an index for checking how much precision is given when a threshold value for achieving a target recall is given, using the precision@K-recall method. or may be determined by designation by the user. For example, in the use of the precision@K-recall method, the threshold value at this time can be confirmed by setting the target precision@k-recall when the model is trained, and the checked threshold value is set to 1 and 2 It can be made to be a threshold value in .

예를 들어, 상기 장치(100)가 상기 모델을 이용하여, 사전에 악성 코드에 감염된 데이터로 결정(진단)되어 격리되도록 특정 데이터베이스(예: 검역소) 등에 저장되어 있는 데이터를 정상 데이터인 것으로 결정한 경우, 상기 장치(100)는 상기 모델에 대한 전이 학습이 필요한 것으로 결정할 수 있다. 상기 전이 학습은 기 학습된 상기 모델에 대해, 상기 장치의 하나 이상의 데이터를 이용하여 추가 학습을 하는 것을 나타낸다.For example, when the device 100 determines (diagnosed) as data infected with a malicious code in advance using the model and determines that data stored in a specific database (eg, quarantine) to be isolated is normal data. , the apparatus 100 may determine that transfer learning for the model is required. The transfer learning refers to performing additional learning on the previously trained model using one or more data of the device.

205 동작에서 상기 장치(100)는 상기 필요성 확인에 기초하여, 상기 장치의 하나 이상의 데이터에 대한 레이블을 생성할 수 있다.In operation 205 , the device 100 may generate a label for one or more data of the device based on the necessity check.

일 실시 예에 따르면, 상기 모델에 대한 전이 학습이 필요한 것으로 확인된 경우 205 동작을 실행하고 그렇지 않으면 본 실시 예의 동작을 종료할 수 있다. According to an embodiment, if it is determined that transfer learning for the model is required, operation 205 may be executed, otherwise the operation of the present embodiment may be terminated.

일 실시 예에 따르면, 상기 모델에 대한 추가 학습인 전이 학습을 하기 위해서는, 상기 모델의 입력 데이터로 이용되는 상기 장치(100)의 하나 이상의 데이터 각각이, 악성 코드에 감염된 데이터인지 정상 데이터인지를 나타내는 레이블이 필요하다. 예를 들어, 상기 하나 이상의 데이터가 파일이면 상기 파일이 악성 파일인지 아닌지, 상기 하나 이상의 데이터가 프로세스 행위(리스트)이면 행위 내에 악성 요소가 포함되어 있는지 아닌지를 나타내는 레이블이 필요하다.According to an embodiment, in order to perform transfer learning, which is additional learning for the model, each of one or more pieces of data of the device 100 used as input data of the model indicates whether data infected with a malicious code or normal data. You need a label. For example, if the one or more data is a file, a label indicating whether the file is a malicious file or not, and if the one or more data is a process action (list), a label indicating whether a malicious element is included in the action is required.

일 실시 예에 따르면, 상기 장치(100)는 상기 하나 이상의 데이터에 대해 자동으로 레이블을 생성(부여)할 수 있다.According to an embodiment, the device 100 may automatically generate (apply) a label for the one or more data.

예를 들어, 상기 장치(100)는 기 저장되거나 수신된 악성 코드 감지를 위한 백신 프로그램(antivirus scanner라고도 함)을 이용, 예를 들어, antivirus scan(가능한 경우에는 클라우드(cloud)로 질의)을 수행하여, 악성 코드로 탐지되지 않는 데이터(파일)에 대해 정상 데이터를 나타내는 레이블을 부여할 수 있다.For example, the device 100 uses a pre-stored or received antivirus program (also referred to as an antivirus scanner) for detecting malicious code, for example, performs an antivirus scan (if possible, queries the cloud). Thus, it is possible to assign a label indicating normal data to data (files) that are not detected as malicious codes.

예를 들어, 상기 장치(100)는 행위 그래프나 리스트에 악성이 없는 경우 정상 데이터를 나타내는 레이블을 부여할 수 있다.For example, when there is no maliciousness in the behavior graph or list, the device 100 may assign a label indicating normal data.

일 실시 예에 따르면, 상기 장치(100)는 상기 하나 이상의 데이터에 대해 사용자 입력에 기초하여 레이블을 생성(부여)할 수 있다.According to an embodiment, the device 100 may generate (apply) a label for the one or more data based on a user input.

예를 들어, 상기 장치(100)는, 지정된 조건에 기초하여, 상기 하나 이상의 데이터에 대응하는 레이블 생성을 위한 사용자 입력이 필요한지를 결정하고, 상기 사용자 입력이 필요하다고 결정한 경우, 상기 사용자 입력에 기초하여 레이블이 생성되도록 할 수 있다. 상기 지정된 조건은, 상기 장치(100)가 자동으로 레이블을 부여하지 못하는 데이터나 사용자 추가 확인이 필요한 데이터에 대해서는, 사용자 입력에 기초하여 레이블이 생성되도록 기 지정된 조건일 수 있다. 상기 장치(100)가 자동으로 레이블을 부여하지 못하는 데이터나 사용자 추가 확인이 필요한 데이터는, 상기의 백신 프로그램에 의해 분류되지 않는 데이터, 상기 모델의 예측에서의 불확실성 정도가 지정된 임계 값 보다 높은 데이터, 및/또는 상기 모델이 악성으로 예측한 데이터를 포함할 수 있다.For example, the device 100 determines whether a user input for generating a label corresponding to the one or more data is required based on a specified condition, and when it is determined that the user input is required, based on the user input so that the label can be created. The specified condition may be a predetermined condition for generating a label based on a user input for data that the device 100 cannot automatically label or data requiring additional user confirmation. Data that the device 100 cannot automatically label or data that requires additional user confirmation are data that are not classified by the vaccine program, data in which the degree of uncertainty in the prediction of the model is higher than a specified threshold value, and/or data predicted by the model to be malignant.

예를 들어, 상기 장치(100)는 상기 사용자 입력에 기초하여 레이블을 생성하기 위해, 상기 하나 이상의 데이터에 대응하는 레이블 생성을 위한 사용자 입력이 필요하다고 결정하면, 사용자에게 레이블 입력을 요청할 수 있다. 상기 장치(100)는 상기 장치(100)의 표시부를 통해, 사용자에게 레이블의 입력을 요청하는 정보를 출력하여, 상기 사용자에게 레이블 입력을 요청할 수 있다. 예를 들어, 상기 장치(100)가 상기 사용자에게 레이블 입력 요청을 하는 동작은, 상기 사용자가 레이블 입력 요청 대상의 데이터를 사용하려고 할 때(예를 들어, 상기 장치(100)는 상기 레이블 입력 요청의 대상이 되는 파일을 실행하기 위한 사용자 입력의 수신 시), 수행되도록 할 수 있다.For example, when determining that a user input for generating a label corresponding to the one or more data is required to generate a label based on the user input, the apparatus 100 may request a label input from the user. The device 100 may output information requesting the user to input a label through the display unit of the device 100 to request the user to input the label. For example, the operation of the device 100 requesting a label input to the user may be performed when the user intends to use the label input request target data (eg, the device 100 requests the label input request). When receiving user input to execute the target file of ), it can be executed.

상기 장치(100)가 상기 사용자에게 레이블 입력 요청을 하는 동작의 수행 시, 상기 장치(100)가 특정 서버(예: 기업의 서버)와 연결된 개인 단말인 경우, 상기 특정 서버와 연결된 다른 개인 단말이 상기의 레이블 입력 요청 대상의 데이터에 대해 상기 장치(100)와 같은 레이블 입력 요청을 수신하지 않도록, 상기 특정 서버에서는, 상기 장치(100)가 입력한 하나 이상의 데이터에 대응하는 레이블을 수신하여 저장 및 관리할 수 있다. 예를 들어, 상기 특정 서버는, 상기 특정 서버와 연결된 다른 개인 단말이 상기의 레이블 입력 요청 대상의 데이터에 대해 상기 장치(100)와 같은 레이블 입력 요청을 수신하지 않도록, 연결된 개인 단말들이 입력한 하나 이상의 데이터에 대응하는 레이블을 수집하여 저장 및 관리 할 수 있다. 이에 따라, 상기 단말에 연결된 개인 단말들 각각은 사용자에게 하나 이상의 데이터에 대응하는 레이블 입력의 요청 이전에, 상기 특정 서버에 상기 하나 이상의 데이터에 대응하는 레이블이 저장되어 있는지를 확인하는 동작을 할 수 있다. 상기 장치(100) 또한 상기 사용자에게 레이블 입력 요청을 하기 이전에 연결된 특정 서버에 상기 레이블 입력 요청 대상의 데이터에 대응하는 레이블이 저장되어 있는지를 확인하고, 상기 특정 서버에 상기 레이블 입력 요청 대상의 데이터에 대응하는 레이블이 저장되어 있지 않은 경우, 상기 사용자에게 레이블 입력 요청을 수행할 수 있다. 상기 특정 서버에 상기 레이블 입력 요청 대상의 데이터에 대응하는 레이블이 저장되어 있는 경우, 상기 특정 서버로부터 레이블을 수신할(전달받을) 수 있다. 예를 들어, 상기 장치(100)가 상기 사용자에게 레이블 입력 요청을 하는 동작의 수행 시, 상기 사용자가 레이블을 입력하기 이전에, 상기 장치(100)가 연결된 특정 서버에 연결된 다른 개인 단말이 상기 레이블 입력 요청 대상의 데이터에 대한 레이블을 입력한 경우, 상기 장치(100)는 상기 특정 서버로부터 상기의 사항과 관련된 정보를 수신하여, 상기 레이블 입력 요청을 취소할 수 있다.When the device 100 performs an operation of requesting the user to input a label, if the device 100 is a personal terminal connected to a specific server (eg, a corporate server), another personal terminal connected to the specific server In order not to receive the same label input request as the device 100 for the data of the label input request target, the specific server receives and stores a label corresponding to one or more data input by the device 100, and can manage For example, the specific server may include one inputted by the connected personal terminals so that other personal terminals connected to the specific server do not receive the same label input request as the device 100 for the data of the label input request target. Labels corresponding to the above data can be collected, stored and managed. Accordingly, each of the personal terminals connected to the terminal may perform an operation to check whether a label corresponding to the one or more data is stored in the specific server before a request for input of a label corresponding to one or more data from the user. have. The device 100 also checks whether a label corresponding to the data of the label input request target is stored in a specific server connected before making a label input request to the user, and the data of the label input request target in the specific server When a label corresponding to . is not stored, a label input request may be performed to the user. When a label corresponding to the data of the label input request target is stored in the specific server, the label may be received (transmitted) from the specific server. For example, when the device 100 performs an operation of requesting the user to input a label, before the user inputs a label, another personal terminal connected to a specific server to which the device 100 is connected is connected to the label. When a label for input request data is input, the device 100 may receive information related to the above from the specific server and cancel the label input request.

207 동작에서 상기 장치(100)는 악성 코드에 감염된 하나 이상의 추가 데이터를 수집할 수 있다.In operation 207 , the device 100 may collect one or more additional data infected with a malicious code.

일 실시 예에 따르면, 상기 장치(100)는 상기 모델이 전이 학습을 할 때, 정교한 학습을 위해 다음과 같은 동작들 중 적어도 하나의 동작을 통해, 악성 코드에 감염된 하나 이상의 추가 데이터를 수집(확보)할 수 있다.According to an embodiment, when the model performs transfer learning, the device 100 collects (secures) one or more additional data infected with a malicious code through at least one of the following operations for sophisticated learning. )can do.

일 실시 예에 따르면, 상기 장치(100)는, 상기 모델을 제공한 상기 서버(10)가, 실제 악성 코드에 감염된 하나 이상의 추가 데이터를 준비하여 상기 추가 데이터에 대응되는 데이터를 전송할 경우, 상기 추가 데이터에 대응되는 데이터를 수신할 수 있다. According to an embodiment, when the server 10 providing the model prepares one or more additional data infected with an actual malicious code and transmits data corresponding to the additional data, the Data corresponding to the data may be received.

상기 실제 악성 코드에 감염된 하나 이상의 추가 데이터에 대응되는 데이터는 상기 서버(10)에 의해 수치화된 벡터(numeric vector) 형상으로 변환된 데이터일 수 있다. 상기 서버(10)는, 복수의 데이터 중 모델의 학습 기여도가 높은 악성 코드에 감염된 하나 이상의 추가 데이터를 선택하여, 상기 추가 데이터에 대응되는 데이터를 상기 장치(100)로 제공할 수 있다. 예를 들어, 상기 학습 기여도가 높은 악성 코드에 감염된 하나 이상의 추가 데이터는, loss 비율이 높은 악성 코드에 감염된 데이터일 수 있다. 상기 서버(10)는, 상기 모델을 상기 장치(100)로 전송한 이후 상기 서버(10)에서 추가적으로 획득(수집)한 악성 코드에 감염된 하나 이상의 추가 데이터에 대응되는 데이터를 상기 장치(100)로 제공할 수 있다. 상기 서버(10)는, 상기 모델을 평가하여 평가 결과에 따라 상기 모델이 탐지하는 비율이 임계 값 이하인 악성 코드에 감염된 하나 이상의 추가 데이터에 대응되는 데이터 및/또는 기 설정된 중요 악성 코드에 감염된 하나 이상의 추가 데이터에 대응되는 데이터를, 상기 장치(100)로 제공할 수 있다.Data corresponding to one or more additional data infected with the actual malicious code may be data converted into a numeric vector shape by the server 10 . The server 10 may select one or more additional data infected with a malicious code having a high degree of contribution to model learning from among a plurality of data, and provide data corresponding to the additional data to the device 100 . For example, the one or more additional data infected with a malicious code having a high learning contribution may be data infected with a malicious code having a high loss ratio. The server 10 transmits, to the device 100, data corresponding to one or more additional data infected with a malicious code additionally acquired (collected) from the server 10 after transmitting the model to the device 100 . can provide The server 10 evaluates the model, and according to the evaluation result, data corresponding to one or more additional data infected with a malicious code in which the ratio detected by the model is less than or equal to a threshold value and/or one or more pieces of data infected with a preset important malicious code Data corresponding to the additional data may be provided to the device 100 .

일 실시 예에 따르면, 상기 장치(100)는, 상기 모델을 제공하는 상기 서버(10)가, 수집하여 전송(배포)한 데이터의 특성(feature) 항목들 중 악성 코드에 자주 감염되는 데이터의 특성 항목들을 수신하여, 기 설정된 샘플링을 통해 악성 코드에 감염된 데이터를 생성할 수 있다. 상기 서버(10)가 데이터의 특성 항목들을 전송할 때, 악성 코드가 감염되는 확률도 함께 상기 장치(100)로 전송하여, 실제 분포가 반영되도록 할 수 있다.According to an embodiment, in the device 100, characteristics of data frequently infected with malicious codes among feature items of data collected and transmitted (distributed) by the server 10 providing the model. By receiving the items, data infected with a malicious code may be generated through preset sampling. When the server 10 transmits the characteristic items of data, the probability of being infected with the malicious code may also be transmitted to the device 100 to reflect the actual distribution.

일 실시 예에 따르면, 상기 장치(100)는, 분류에 사용되는 모델들 중 생성 모델(generative model)을 이용하여, 악성 코드에 지정된 횟수 이상으로 감염되는 데이터의 특성(feature)들을 생성하여, 샘플링하여(조합하여) 악성 코드에 감염된 데이터를 생성할 수 있다.According to an embodiment, the device 100 generates features of data that are infected with a malicious code more than a specified number of times by using a generative model among models used for classification, and performs sampling. (in combination) to create data infected with malicious code.

209 동작에서 상기 장치(100)는 상기 모델에 대한 전이 학습(로컬 학습 또는 학습이라고도 함) 및 상기 전이 학습된 모델에 대한 성능 평가를 수행할 수 있다.In operation 209, the apparatus 100 may perform transfer learning (also referred to as local learning or learning) on the model and performance evaluation on the transfer-learned model.

일 실시 예에 따르면, 상기 장치(100)는 상술한 동작들에 의해 생성된 레이블 및 획득된 복수의 데이터를 이용하여, 상기 모델에 대한 전이 학습을 수행할 수 있다. 상기 획득된 복수의 데이터는, 상술한 레이블 생성에 기초하여 상기 장치(100)의 악성 코드에 감염되지 않은 것으로 결정(추정)된 하나 이상의 정상 데이터, 및/또는 상술한 동작에 따라 수집된 악성 코드에 감염된 하나 이상의 추가 데이터를 포함할 수 있다.According to an embodiment, the apparatus 100 may perform transfer learning on the model by using the labels generated by the above-described operations and the plurality of acquired data. The acquired plurality of data includes one or more normal data determined (presumed) not infected with the malicious code of the device 100 based on the above-described label generation, and/or the malicious code collected according to the above-described operation. may contain one or more additional data infected with

일 실시 예에 따르면, 상기 장치(100)는, 상기 획득된 복수의 데이터를 학습 데이터 세트와 검증 데이터 세트로 분할 한 이후 상기 모델에 대한 전이 학습을 수행할 수 있다.According to an embodiment, the apparatus 100 may perform transfer learning on the model after dividing the obtained plurality of data into a training data set and a verification data set.

예를 들어, 상기 분할은 렌덤하게 수행될 수 있으며, 상기 획득된 복수의 데이터에서의 정상 데이터 및 악성 코드에 감염된 하나 이상의 추가 데이터의 비율과 대응되는 비율로, 상기 학습 데이터 세트와 검증 데이터 세트 각각에 포함되는 악성 코드에 감염된 데이터 및 악성 코드에 감염되지 않은 정상 데이터가 분할 될 수 있다. 예를 들어, 상기 획득된 복수의 데이터에서 악성 코드에 감염되지 않은 것으로 결정된 정상 데이터가 100개, 상기 수집된 악성 코드에 감염된 하나 이상의 추가 데이터가 10개일 경우, 상기 학습 세트에는 정상 데이터가 90개, 악성 코드에 감염된 추가 데이터가 9개, 검증 세트에는 정상 데이터가 10개, 악성 코드에 감염된 추가 데이터가 1개가 되도록 분할을 수행할 수 있다.For example, the division may be performed randomly, and each of the training data set and the verification data set is a ratio corresponding to a ratio of normal data and one or more additional data infected with a malicious code in the plurality of acquired data. Malware-infected data included in and normal data not infected with malicious code can be divided. For example, if 100 pieces of normal data determined not to be infected with malicious code in the plurality of acquired data and 10 pieces of one or more additional data infected with the collected malicious code are 90 pieces of normal data in the training set , partitioning can be performed so that there are 9 additional data infected with malicious code, 10 normal data in the validation set, and 1 additional data infected with malicious code.

일 실시 예에 따르면, 상기의 209 동작의 전이 학습은, 상기 장치(100)의 CPU, 메모리 등의 사용량을 체크하여, 상기 사용량이 지정된 임계 사용량 이하인 경우, 수행되도록 할 수 있다. 예를 들어, 상기 전이 학습의 동작 시, 상기 장치(100)의 자원이 이용될 수 있으므로, 상기 장치(100)는 CPU가 아이들(idle) 상태인 경우에만, 상기의 207 동작을 수행하도록 할 수 있다.According to an embodiment, the transfer learning of operation 209 may be performed when the usage amount of the CPU and memory of the device 100 is checked and the usage amount is less than or equal to a specified threshold usage amount. For example, since the resource of the device 100 may be used during the transfer learning operation, the device 100 may perform the operation 207 only when the CPU is in an idle state. have.

일 실시 예에 따르면, 상기 모델에 대한 전이 학습의 완료 시, 상기의 검증 세트를 이용하여, 전이 학습된 모델에 대한 성능 평가를 진행할 수 있다.According to an embodiment, upon completion of transfer learning of the model, performance evaluation of the transfer-learned model may be performed using the verification set.

211 동작에서 상기 장치(100)는 상기 성능 평가의 결과에 기초하여, 상기 성능 평가의 결과에 대응하는 정보를 전송할 수 있다.In operation 211, the device 100 may transmit information corresponding to the result of the performance evaluation based on the result of the performance evaluation.

일 실시 예에 따르면, 상기 전이 학습된 모델에 대한 성능 평가의 결과가 성능 평가 기준을 통과했다는 것인 경우, 상기 장치(100)는 상기 성능 평가의 결과에 대응하는 정보를 상기 전이 학습 모델을 제공한 상기 서버(10)로 전송할 수 있다.According to an embodiment, when the result of the performance evaluation for the transfer-learned model has passed the performance evaluation criterion, the apparatus 100 provides the transfer learning model with information corresponding to the result of the performance evaluation. One can transmit to the server (10).

213 동작에서, 상기 서버(10)는 상기 성능 평가의 결과에 대응하는 정보를 기초로, 상기 모델을 업데이트할 수 있다.In operation 213, the server 10 may update the model based on information corresponding to the result of the performance evaluation.

일 실시 예에 따르면, 상기 서버(10)는, 상기 장치(100)가 전송한 상기 성능 평가의 결과에 대응하는 정보를 수신할 수 있으며, 이를 이용하여, 상기 서버(10)의 메모리(또는 데이터베이스)에 저장된 상기 모델을 업데이트할 수 있다. According to an embodiment, the server 10 may receive information corresponding to the result of the performance evaluation transmitted by the device 100, and using it, the memory (or database) of the server 10 ) can be updated.

일 실시 예에 따르면, 상기 성능 평가의 결과에 대응하는 정보는, 상기 기 학습된 모델과 상기 전이 학습된 모델 간의 웨이트(weight) 차이에 대응하는 정보를 포함할 수 있다. 예를 들어, 상기 기 학습된 모델과 상기 전이 학습된 모델이 신경망 모델일 경우, 상 상기 성능 평가의 결과에 대응하는 정보는, 상기 기 학습된 모델과 상기 전이 학습된 모델의 뉴런들의 웨이트(weight) 변화량에 대응하는 정보를 포함할 수 있다.According to an embodiment, the information corresponding to the result of the performance evaluation may include information corresponding to a weight difference between the pre-trained model and the transfer-learned model. For example, when the pre-trained model and the transfer-learned model are neural network models, the information corresponding to the result of the performance evaluation may include weights of neurons of the pre-trained model and the transfer-learned model. ) may include information corresponding to the amount of change.

일 실시 예에 따르면, 상기 서버(10)는, 상기 웨이트 차이에 대응하는 정보를 이용하여 저장된 상기 모델의 성능을 향상시킬 수 있다. 예를 들어, 상기 장치(100) 및/또는 상기 장치(100) 이외의 하나 이상의 다른 장치로부터 수신된 상기 웨이트 차이에 대응하는 정보를 이용하여 상기 모델을 1차적으로 업데이트 할 경우, 상기 1차적으로 업데이트한 모델의 성능 평가를 수행할 수 있다. 예를 들어, 상기 서버(10)의 기 설정된 내부 평가 세트를 이용하여, 상기 1차적으로 업데이트한 모델의 성능 평가(성능 검증이라고도 함)를 수행할 수 있다. 상기 1차적으로 업데이트한 모델의 성능 평가 결과, 성능이 지정된 임계 기준을 충족하지 못하는 경우, 상기 웨이트 차이에 대응하는 정보를 삭제할 수 있으며, 상기 1차적으로 업데이트한 모델은 저장하지 않을 수 있다. 상기 1차적으로 업데이트한 모델의 성능 평가 결과, 성능이 지정된 임계 기준을 충족하는 경우, 상기 1차적으로 업데이트한 모델을 저장 및/또는 하나 이상의 장치에 전송(배포)할 수 있다.According to an embodiment, the server 10 may improve the performance of the stored model by using information corresponding to the weight difference. For example, when the model is primarily updated using information corresponding to the weight difference received from the device 100 and/or one or more other devices other than the device 100 , the first Performance evaluation of the updated model can be performed. For example, performance evaluation (also referred to as performance verification) of the primarily updated model may be performed using a preset internal evaluation set of the server 10 . As a result of the performance evaluation of the primarily updated model, if the performance does not meet the specified threshold criterion, the information corresponding to the weight difference may be deleted, and the primarily updated model may not be stored. As a result of performance evaluation of the primarily updated model, when performance meets a specified threshold criterion, the primarily updated model may be stored and/or transmitted (distributed) to one or more devices.

예를 들어, 상기 서버(10)는 상기 장치(100)로부터 수신된 상기 성능 평가의 결과에 대응하는 정보를 통한 성능 평가 결과에서, 성능이 지정된 임계 기준을 충족하지 못하는 경우, 향후, 상기 장치(100)로부터 수신된 상기 성능 평가의 결과에 대해서는 신뢰성이 없다고 판단하여 모델 업데이트에 이용하지 않을 수 있다.For example, in the performance evaluation result through the information corresponding to the result of the performance evaluation received from the device 100, the server 10, if the performance does not meet the specified threshold criterion, in the future, the device ( 100) may not be used for model update because it is determined that the performance evaluation result is not reliable.

한편, 상술한 도 2의 실시 예에서의 201 동작에 따라, 상기 장치가 불확실성이 낮은 악성 파일이 존재하는 것을 확인한 경우, 상기 장치는 악성 코드에 감염된 것으로 추정하여, 상기 장치의 관리자에게 통지하여 조치되도록 할 수 있다.Meanwhile, according to operation 201 in the embodiment of FIG. 2 described above, when the device confirms that a malicious file with low uncertainty exists, the device is assumed to be infected with a malicious code, and notifies the administrator of the device to take action can make it happen

또한, 상술한 도 2의 실시 예의 동작 이후, 상기 장치에 신규 파일 등의 신규 데이터가 수신되는 경우, 상기 장치는 상기 도 2의 실시 예에 따른 학습된 모델을 이용하여, 상기 신규 데이터가 악성 코드에 감염된 데이터인지 여부를 예측하고(결정하고), 불확실성이 높은 신규 데이터가 지정된 임계 개수 이상인 경우, 상기 장치의 관리자에게 통지하고 상술한 전이 학습 동작을 수행할 수도 있다.In addition, when new data such as a new file is received in the device after the operation of the embodiment of FIG. 2 , the device uses the learned model according to the embodiment of FIG. 2 to convert the new data into malicious code. It is also possible to predict (determine) whether data is infected with (determining) whether or not new data with high uncertainty is greater than a specified threshold number, notify the administrator of the device and perform the transfer learning operation described above.

상술한 도 2의 실시 예에 추가로, 상술한 203 동작에서, 상기 장치가 사용자에게 레이블 입력 요청을 하여 상기 사용자로부터 레이블을 입력 받는 동작에 따라, 상기 기 학습된 모델에 대한 전이 학습을 수행하여, 상기 모델의 성능이 향상되는 경우, 상기 모델을 제공한 업체는 상기 외부의 장치를 통해, 인터넷 쿠폰의 지급, 가상 포인트의 지급 등의 다양한 방식으로 보상을 제공할 수 있다. 예를 들어, 상기 사용자로부터 레이블을 입력 받는 횟수가 지정된 임계 값 이상 인 경우, 상기 모델을 제공한 업체는 상기 외부의 장치를 통해 상기 보상을 제공할 수 있다. 또는, 상기 사용자가 제공한 레이블을 통해 상기 모델 성능이 향상되는 정도에 비례하여 보상을 제공할 수 있다.In addition to the above-described embodiment of FIG. 2 , in operation 203 described above, the device performs transfer learning on the pre-trained model according to the operation of receiving a label input from the user by making a label input request to the user. , when the performance of the model is improved, the company providing the model may provide compensation in various ways, such as payment of an Internet coupon or payment of virtual points, through the external device. For example, when the number of times the label is received from the user is greater than or equal to a specified threshold, the company providing the model may provide the compensation through the external device. Alternatively, a compensation may be provided in proportion to the extent to which the model performance is improved through the label provided by the user.

도 3은 본 발명의 일 실시 예에 따른 악성 코드를 탐지하도록 학습된 모델을 이용하는 장치(예: 장치(100) 또는 장치(100)의 프로세서(101))에 의한 기계 학습 기반의 적응적 악성 코드를 탐지하기 위한 동작의 흐름도이다.3 is a diagram illustrating a machine learning-based adaptive malicious code by a device (eg, the device 100 or the processor 101 of the device 100) using a model trained to detect malicious code according to an embodiment of the present invention. It is a flowchart of the operation for detecting .

301 동작에서, 상기 장치는, 상기 장치의 하나 이상의 데이터에 대응하는 레이블을 생성할 수 있다.In operation 301, the device may generate a label corresponding to one or more data of the device.

일 실시 예에 따르면, 상기 장치는, 상기 악성 코드를 탐지하도록 학습된 모델에 기초하여, 상기 악성 코드 감지를 위한 백신 프로그램을 이용하여 상기 악성 코드에 감염되지 않은 정상 데이터로 분류되거나 기 저장되어 상기 정상 데이터로 분류된 상기 하나 이상의 데이터 각각에 대해, 상기 악성 코드에 감염된 데이터로 오분류하는지를 확인할 수 있다. 상기 장치는, 상기 악성 코드에 감염된 데이터로 오분류된 데이터의 개수가 기 설정된 임계 값 이상인 것에 기초하여 상기 장치의 상기 하나 이상의 데이터에 대한 상기 레이블을 생성할 수 있다.According to an embodiment, based on a model trained to detect the malicious code, the device is classified or pre-stored as normal data not infected with the malicious code by using a vaccine program for detecting the malicious code. For each of the one or more pieces of data classified as normal data, it may be checked whether the data is misclassified as data infected with the malicious code. The device may generate the label for the one or more data of the device based on the number of data misclassified as data infected with the malicious code being equal to or greater than a preset threshold.

일 실시 예에 따르면, 상기 장치는, 상기 악성 코드를 탐지하도록 학습된 모델에 기초하여, 상기 하나 이상의 데이터 각각에 대해 상기 악성 코드에 감염된 데이터인지 상기 악성 코드에 감염되지 않은 정상 데이터인지의 예측에 대한 예측 확률을 확인할 수 있다. 상기 장치는, 상기 예측 확률이 기 설정된 임계 값 보다 작은 것에 기초하여 상기 장치의 상기 하나 이상의 데이터에 대한 상기 레이블을 생성할 수 있다.According to an embodiment, based on a model trained to detect the malicious code, the device is configured to predict whether data infected with the malicious code or normal data not infected with the malicious code for each of the one or more pieces of data. You can check the predicted probabilities for The device may generate the label for the one or more data of the device based on that the prediction probability is less than a preset threshold value.

일 실시 예에 따르면, 상기 장치는, 상기 하나 이상의 데이터가 백신 프로그램에 의해 상기 하나 이상의 악성 코드에 감염되지 않은 정상 데이터로 확인된 것에 기초하여, 상기 하나 이상의 데이터의 레이블을 상기 정상 데이터에 대응되는 레이블로 생성할 수 있다.According to an embodiment, the device sets a label of the one or more data corresponding to the normal data based on the fact that the one or more data are normal data not infected with the one or more malicious codes by the vaccine program. You can create it as a label.

일 실시 예에 따르면, 상기 장치는, 상기 하나 이상의 데이터가, 백신 프로그램에 의해 상기 악성 코드에 감염된 데이터인지 상기 악성 코드에 감염되지 않은 정상 데이터인지에 대한 레이블이 생성되지 않은 데이터, 상기 악성 코드를 탐지하도록 학습된 모델에 기초한 예측 확률이 지정된 임계 값보다 높은 데이터, 및 상기 악성 코드를 탐지하도록 학습된 모델에 기초하여 상기 악성 코드에 감염된 데이터로 예측된 데이터 중 적어도 하나의 데이터인지를 확인할 수 있다. 상기 장치는, 상기 하나 이상의 데이터가 상기 적어도 하나의 데이터인 것으로 확인된 것에 기초하여, 상기 장치의 표시부를 이용하여 사용자에게 레이블의 입력을 요청하는 정보를 출력할 수 있다. 상기 장치는, 상기 장치의 상기 표시부를 이용한 상기 정보의 출력에 응답하여, 사용자 입력을 수신할 수 있다. 상기 장치는 상기 사용자 입력에 기초하여 상기 장치의 하나 이상의 데이터에 대응하는 상기 레이블을 생성할 수 있다.According to an embodiment, the device detects the malicious code and data for which a label is not generated as to whether the one or more data is data infected with the malicious code by a vaccine program or normal data not infected with the malicious code. It may be confirmed whether the prediction probability based on the model trained to detect is higher than a specified threshold value and data predicted as data infected with the malicious code based on the model trained to detect the malicious code. . The device may output information for requesting input of a label to the user using a display unit of the device based on it being confirmed that the one or more data is the at least one data. The device may receive a user input in response to an output of the information using the display portion of the device. The device may generate the label corresponding to one or more data of the device based on the user input.

303 동작에서, 상기 장치는, 상기 악성 코드에 감염된 하나 이상의 추가 데이터를 수집할 수 있다.In operation 303, the device may collect one or more additional data infected with the malicious code.

일 실시 예에 따르면, 상기 악성 코드에 감염된 상기 하나 이상의 추가 데이터는, 서버로부터 상기 악성 코드에 감염된 상기 하나 이상의 추가 데이터를 수신하는 동작 및 생성 모델(generative model)을 이용하여 악성 코드에 지정된 횟수 이상으로 감염되는 데이터의 특성들에 기초하여 상기 하나 이상의 추가 데이터를 생성하는 동작 중 적어도 하나의 동작에 기초하여 수집될 수 있다.According to an embodiment, the one or more additional data infected with the malicious code is more than the number of times specified for the malicious code by using an operation of receiving the one or more additional data infected with the malicious code from a server and a generative model may be collected based on at least one of the operations of generating the one or more additional data based on characteristics of the infected data.

305 동작에서, 상기 장치는, 상기 생성된 레이블 및 상기 수집된 추가 데이터에 기초하여, 상기 모델에 대한 추가 학습을 수행할 수 있다.In operation 305, the device may perform additional learning on the model based on the generated label and the collected additional data.

307 동작에서, 상기 장치는, 상기 추가 학습된 모델의 성능 평가를 수행할 수 있다.In operation 307, the device may perform performance evaluation of the additionally trained model.

일 실시 예에 따르면, 상기 장치는, 상기 생성된 레이블에 기초하여 예측된 상기 악성 코드에 감염되지 않은 정상 데이터인 상기 하나 이상의 데이터 및 상기 수집된 하나 이상의 추가 데이터를 포함하는 복수의 데이터를 학습 데이터 세트와 검증 데이터 세트로 렌덤으로 분할할 수 있다. 예를 들어, 상기 장치는, 상기 학습 데이터 세트를 이용하여 상기 모델에 대한 추가 학습을 수행할 수 있다. 예를 들어, 상기 장치는, 상기 검증 데이터 세트를 이용하여 상기 추가 학습 모델의 성능 평가를 수행할 수 있다.According to an embodiment, the device is configured to learn a plurality of data including the one or more data that is normal data not infected with the malicious code predicted based on the generated label, and the one or more additional data collected as training data. It can be split randomly into sets and validation data sets. For example, the device may perform additional training on the model using the training data set. For example, the apparatus may perform performance evaluation of the additional learning model using the verification data set.

상술한 도 3의 실시 예에 추가로, 상기 장치는, 상기 성능 평가의 결과에 대응하는 정보를 상기 악성 코드를 탐지하도록 학습된 모델을 배포한 서버(예: 서버(10))로 전송할 수 있다.In addition to the above-described embodiment of FIG. 3 , the device may transmit information corresponding to the result of the performance evaluation to a server (eg, server 10 ) that has distributed a model trained to detect the malicious code. .

상술한 도 3의 실시 예에 추가로, 상기 장치는, 상기 추가 학습된 모델의 상기 성능 평가에 따라 결정된 사용자의 기여도에 기초하여, 지정된 방식으로 보상을 제공할 수 있다.In addition to the above-described embodiment of FIG. 3 , the device may provide a reward in a specified manner based on the user's contribution determined according to the performance evaluation of the additionally trained model.

본 문서의 다양한 실시예들은 기기(machine)(예: 컴퓨터)로 읽을 수 있는 저장 매체(machine-readable storage media)(예: 메모리(113)(내장 메모리 또는 외장 메모리))에 저장된 명령어를 포함하는 소프트웨어(예: 프로그램)로 구현될 수 있다. 기기는, 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 동작이 가능한 장치로서, 개시된 실시예들에 따른 전자 장치(예: 전자 장치(100))를 포함할 수 있다. 상기 명령이 제어부(예: 제어부(101))(또는 프로세서)에 의해 실행될 경우, 제어부가 직접, 또는 상기 제어부의 제어하에 다른 구성요소들을 이용하여 상기 명령에 해당하는 기능을 수행할 수 있다. 명령은 컴파일러 또는 인터프리터에 의해 생성 또는 실행되는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적'은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다.Various embodiments of the present document include instructions stored in a machine-readable storage media (eg, the memory 113 (internal memory or external memory)) readable by a machine (eg, a computer). It may be implemented in software (eg, a program). The device is a device capable of calling a stored command from a storage medium and operating according to the called command, and may include an electronic device (eg, the electronic device 100 ) according to the disclosed embodiments. When the command is executed by a control unit (eg, the control unit 101) (or a processor), the control unit may directly or use other components under the control of the control unit to perform a function corresponding to the command. Instructions may include code generated or executed by a compiler or interpreter. The device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory' means that the storage medium does not include a signal and is tangible, and does not distinguish that data is semi-permanently or temporarily stored in the storage medium.

일시예에 따르면, 본 문서에 개시된 다양한 실시예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. According to an example, the method according to various embodiments disclosed in the present document may be included and provided in a computer program product.

이상의 설명은 본 발명의 기술사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경 등이 가능함을 쉽게 알 수 있을 것이다. 즉, 본 발명에 개시된 실시 예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것으로서, 이러한 실시 예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다.The above description is merely illustrative of the technical idea of the present invention, and those of ordinary skill in the art to which the present invention pertains may make various substitutions, modifications, and changes within the scope not departing from the essential characteristics of the present invention. It will be easy to see that this is possible. That is, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments.

따라서, 본 발명의 보호 범위는 후술되는 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Accordingly, the protection scope of the present invention should be construed by the claims described below, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

Claims

An adaptive method for detecting malicious code based on machine learning by a device using a model trained to detect malicious code, the method comprising:
generating a label corresponding to one or more data of the device;
collecting one or more additional data infected with the malicious code;
performing additional learning on the model based on the generated label and the collected additional data; and
and performing a performance evaluation of the additionally trained model. An adaptive method for detecting a machine learning-based malicious code.

The method of claim 1,
Based on the model trained to detect the malicious code, each of the one or more data classified as normal data not infected with the malicious code or pre-stored and classified as the normal data using the vaccine program for detecting the malicious code for, further comprising the operation of checking whether the data is misclassified as data infected with the malicious code,
The operation of generating the label for the one or more data of the device is performed on the basis that the number of data misclassified as data infected with the malicious code is greater than or equal to a preset threshold. adaptive way to do it.

The method of claim 1,
The method further includes checking a prediction probability for prediction of whether the data is infected with the malicious code or the normal data not infected with the malicious code for each of the one or more pieces of data, based on the model trained to detect the malicious code. and
The generating of the label for the one or more data of the device is performed based on the prediction probability being less than a preset threshold value.

The method of claim 1, wherein generating the label corresponding to one or more data of the device comprises:
and generating a label of the one or more data as a label corresponding to the normal data based on the one or more data being identified as normal data not infected with the one or more malicious codes by a vaccine program. An adaptive method to detect the underlying malicious code.

5. The method of claim 4,
Data for which no label is generated as to whether the one or more data is data infected with the malicious code by the vaccine program or normal data not infected with the malicious code, and a prediction probability based on a model trained to detect the malicious code determining whether the data is at least one of data that is higher than a specified threshold and data predicted as data infected with the malicious code based on a model trained to detect the malicious code;
outputting information requesting input of a label to a user using a display unit of the device based on it being confirmed that the one or more data is the at least one data;
In response to the output of the information using the display unit of the device, further comprising the operation of receiving a user input,
and generating the label corresponding to the one or more data of the device is performed based on the user input.

The method of claim 1, wherein the one or more additional data infected with the malicious code comprises:
An operation of receiving the one or more additional data infected with the malicious code from a server and generating the one or more additional data based on characteristics of the data infected with the malicious code more than a specified number of times using a generative model An adaptive method for detecting malicious code based on machine learning, which is collected based on at least one of the actions.

The method of claim 1,
Randomly dividing a plurality of data including the one or more data that is the normal data not infected with the malicious code predicted based on the generated label and the one or more additional data collected as a training data set and a verification data set more action,
The operation of performing the performance evaluation of the additionally trained model is,
An adaptive method for detecting machine learning-based malicious code, performed using the verification data set.

The method of claim 1,
The adaptive method for detecting a machine learning-based malicious code further comprising transmitting information corresponding to the result of the additional learning.

The method of claim 1,
Data used for training a model trained to detect the malicious code and the one or more data are different, the machine learning-based adaptive method for detecting malicious code.

The method of claim 1,
and providing a reward in a specified manner based on the user's contribution determined according to the performance evaluation of the additionally trained model.

As a computer-readable recording medium storing a computer program,
generating a label corresponding to one or more data of the device;
collecting one or more additional data infected with the malicious code;
performing additional learning on the model based on the generated label and the collected additional data; and
A computer-readable recording medium comprising instructions for causing a processor to perform a method comprising the operation of performing performance evaluation of the additionally trained model.

As a computer program stored in a computer-readable recording medium,
generating a label corresponding to one or more data of the device;
collecting one or more additional data infected with the malicious code;
performing additional learning on the model based on the generated label and the collected additional data; and
and instructions for causing a processor to perform a method comprising performing a performance evaluation of the further trained model.

An adaptive device for detecting machine learning-based malicious code, comprising:
processor; and
a memory electrically coupled to the processor, the memory configured to store one or more data and a model trained to detect malicious code, wherein when executed, the processor:
generating a label corresponding to the one or more data;
collects one or more additional data infected with said malicious code;
Based on the generated label and the collected additional data, performing additional training on the model,
An adaptive apparatus for detecting a machine learning-based malicious code comprising a command to perform performance evaluation of the additionally trained model.