KR20200063960A

KR20200063960A - Discount factor auto adjusting type reinforcement learning method

Info

Publication number: KR20200063960A
Application number: KR1020190069058A
Authority: KR
Inventors: 오성택; 고웅; 김미주; 이재혁; 박준형
Original assignee: 한국인터넷진흥원
Priority date: 2019-06-12
Filing date: 2019-06-12
Publication date: 2020-06-05
Also published as: KR102120049B1

Abstract

Provided is a reinforcement learning method by automatically adjusting a discount factor, capable of adjusting a direction of reinforcement learning as a discount factor of reinforcement learning is automatically changed according to a change in an environment. According to one embodiment of the present invention, the method includes repeatedly training a reinforcement learning model, which determines an evaluation result of input data, by using the input data. In this case, the repeatedly training includes: obtaining first result data which is an output result obtained by inputting the input data to the reinforcement learning model; obtaining second result data which is a result of evaluating the input data by using a first evaluation model; obtaining a first return value which is a summed result obtained by applying the discount factor to a first reward given in consideration of whether the first result data matches the second result data; training the reinforcement learning model by using the first return value; and automatically adjusting the discount factor in consideration of the second result data.

Description

{DISCOUNT FACTOR AUTO ADJUSTING TYPE REINFORCEMENT LEARNING METHOD}

본 발명은 강화 학습을 기반으로 입력 데이터에 대한 평가를 수행하는 모델을 학습 시키는 방법 및 그 방법이 적용된 컴퓨팅 장치에 관한 것이다. 보다 자세하게는, 모델의 학습 시 반영되는 감가율(discount factor)이 모델 학습 과정에서 자동적으로 조정되는 방식의 강화 학습 방법 및 그 방법이 적용된 컴퓨팅 장치에 관한 것이다.The present invention relates to a method for training a model for evaluating input data based on reinforcement learning and a computing device to which the method is applied. More specifically, the present invention relates to a reinforcement learning method in which a discount factor reflected when learning a model is automatically adjusted in a model learning process and a computing device to which the method is applied.

강화 학습은 환경(environment)과 상호작용하며 목표를 달성하는 에이전트를 다루는 학습 방법이다. 에이전트는 시간 스텝이 흘러감에 따라 순차적으로 액션을 선택하게 된다. 그리고, 에이전트는 상기 액션이 환경에 끼친 영향에 기반하여 보상(reward)을 받게 되고, 이러한 보상에 기반하여 에이전트가 취하는 행동이 최대의 보상을 받도록, 에이전트의 행동의 확률 분포가 조정된다.Reinforcement learning is a learning method that deals with agents that interact with the environment and achieve their goals. The agent selects actions sequentially as time steps progress. In addition, the agent receives a reward based on the effect of the action on the environment, and the probability distribution of the agent's action is adjusted so that the action taken by the agent receives the maximum reward based on the reward.

한편, 시간 스텝마다의 보상은 현재 시점 대비 먼 미래일수록 감가되어 반영될 필요가 있다. 이러한 감가를 반영하기 위하여 감가율이 제시된다.On the other hand, the compensation for each time step needs to be depreciated and reflected in the distant future compared to the current point in time. To reflect this depreciation, the depreciation rate is presented.

상기 감가율의 개념이 반영되어, 에이전트가 받는 보상의 총합이 계산될 수 있다. 타임 스텝 t 이후의 감가율 반영된 모든 보상(reward)의 총합(G_t)은 아래의 수학식 1에 표시된 수식으로 구해질 수 있으며, 이렇게 구해진 값은 반환(return)으로 지칭된다.The concept of the depreciation rate is reflected, and the sum of the rewards received by the agent can be calculated. The total sum G _t of all the rewards reflected after the time step t may be obtained by the formula shown in Equation 1 below, and the value thus obtained is referred to as a return.

즉, 이렇게 구해진 반환이 최대가 되도록 에이전트의 행동의 확률 분포가 조정되는 것이며, 이러한 과정을 상기 에이전트의 학습 과정으로 이해할 수 있을 것이다.That is, the probability distribution of the agent's behavior is adjusted so that the obtained return is the maximum, and this process can be understood as the learning process of the agent.

감가율은 주로 0에서 1사이의 실수로 설정된다. 0에 가까울수록 현재의 보상이 중요시되고, 1에 가까울수록 현재와 미래의 보상이 동일한 가치를 가지는 것으로 이해될 수 있을 것이다. 이러한 감가율은 강화 학습의 시작 시점에 모델의 학습 방향이 반영되도록 특정 값으로 세팅되는 것이 일반적이다.The depreciation rate is usually set to a real number between 0 and 1. It can be understood that the closer to 0 the more important the present reward, and the closer to 1 the more the current and future rewards have the same value. It is common that the depreciation rate is set to a specific value so that the learning direction of the model is reflected at the start of reinforcement learning.

한국등록특허 제1877243호Korean Registered Patent No. 1877243

본 발명이 해결하고자 하는 기술적 과제는, 환경의 변화에 따라 강화학습의 감가율이 자동적으로 변화되면서 강화 학습의 방향성도 조정되는 강화 학습 방법 및 그 방법이 적용되어 입력 데이터의 해석 또는 평가를 수행하는 모델을 학습 시키는 장치를 제공하는 것이다.The technical problem to be solved by the present invention is a model for performing analysis or evaluation of input data by applying a reinforcement learning method in which the direction of reinforcement learning is adjusted as the depreciation rate of reinforcement learning automatically changes according to changes in the environment and the method is applied. It is to provide a device for learning.

본 발명이 해결하고자 하는 다른 기술적 과제는, 입력되는 데이터가 알려진 패턴(known pattern)에 부합하는 데이터인지 또는 알려지지 않은 패턴의 데이터인지에 따라 알려진 패턴을 정확하게 식별하는 강화학습 방향성 또는 알려지지 않은 패턴이라도 의심스러운 패턴을 검출하는 강화학습 방향성 사이에서 연속적이고 자체적인 방향성 전환을 수행하는 강화 학습 방법 및 그 방법이 적용되어 입력 데이터의 해석 또는 평가를 수행하는 모델을 학습 시키는 장치를 제공하는 것이다.Another technical problem to be solved by the present invention is that the input data is data that conforms to a known pattern (known pattern) or unknown pattern, depending on whether or not the reinforcement learning direction or unknown pattern to accurately identify a known pattern is suspected It is to provide a reinforcement learning method that performs a continuous and self-direction conversion between reinforcement learning directionality that detects anomalous patterns, and an apparatus that trains a model that performs analysis or evaluation of input data by applying the method.

본 발명이 해결하고자 하는 또 다른 기술적 과제는, 연산 능력이 제한된 저수준 컴퓨팅 장치에서 수행될 수 있으면서도 최신의 데이터 패턴에 대한 학습도 가능한 강화 학습 방법 및 그 방법이 적용되어 입력 데이터의 해석 또는 평가를 수행하는 모델을 학습 시키는 장치를 제공하는 것이다.Another technical problem to be solved by the present invention is to perform an analysis or evaluation of input data by applying a reinforcement learning method capable of being performed on a low-level computing device with limited computing power, and also capable of learning the latest data patterns and applying the method. It provides a device to train the model.

본 발명이 해결하고자 하는 또 다른 기술적 과제는, 내부망에 함께 연결된 디바이스의 패킷을 수집하여 상기 패킷이 가리키는 위협(threat)을 감지하는 저수준 컴퓨팅 장치 및 상기 저수준 컴퓨팅 장치가 포함된 위협 감지 시스템을 제공하는 것이다.Another technical problem to be solved by the present invention is to provide a low-level computing device that collects packets of devices connected to an internal network and detects a threat indicated by the packet, and a threat detection system including the low-level computing device Is to do.

본 발명이 해결하고자 하는 또 다른 기술적 과제는, 알려진 위협은 내장된 known pattern 평가 모델의 영향을 받아 강화 학습하고, 상기 known pattern 평가 모델에서 탐지 되지 않는 unknown pattern은 서버에서 지속적으로 업데이트 되고 있는 평가 모델의 영향을 받아 강화 학습하는 저수준 컴퓨팅 장치 및 상기 저수준 컴퓨팅 장치 및 상기 서버를 포함하는 위협 감지 시스템을 제공하는 것이다.Another technical problem to be solved by the present invention, known threats are reinforced learning under the influence of a built-in known pattern evaluation model, and an unknown pattern that is not detected in the known pattern evaluation model is continuously updated in a server. It is to provide a threat detection system including a low-level computing device and the low-level computing device for reinforcement learning under the influence of the server.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명의 기술분야에서의 통상의 기술자에게 명확하게 이해 될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 해결하기 위한 본 발명의 일 실시예에 따른 감가율 자동 조정형 강화 학습 방법은 입력 데이터를 이용하여 상기 입력 데이터의 평가결과를 결정하는 강화 학습 모델을 학습 시키는 것을 반복하는 단계를 포함한다. 이 때, 상기 반복하는 단계는, 상기 입력 데이터를 상기 강화 학습 모델에 입력하여 출력된 결과인 제1 결과 데이터를 얻는 단계와, 상기 입력 데이터를 제1 평가 모델을 이용하여 평가한 결과인 제2 결과 데이터를 얻는 단계와, 상기 제1 결과 데이터 및 상기 제2 결과 데이터의 합치 여부를 고려하여 부여되는 제1 보상(reward)에 감가율(discount factor)이 적용되어 합산된 결과인 제1 반환 값을 얻는 단계와, 상기 제1 반환 값을 이용하여 상기 강화 학습 모델을 학습 시키는 단계와, 상기 제2 결과 데이터를 고려하여 상기 감가율을 자동으로 조정하는 단계를 포함할 수 있다.The method for automatically adjusting the depreciation rate according to an embodiment of the present invention for solving the above technical problem includes repeating training a reinforcement learning model for determining an evaluation result of the input data using input data. At this time, the repeating step may include obtaining the first result data, which is a result output by inputting the input data to the reinforcement learning model, and a second result of evaluating the input data using a first evaluation model. Obtaining a result data and a first return value, which is a sum result obtained by applying a discount factor to a first reward given in consideration of whether the first result data and the second result data are matched The method may include obtaining, training the reinforcement learning model using the first return value, and automatically adjusting the depreciation rate in consideration of the second result data.

일 실시예에서, 상기 감가율을 자동으로 조정하는 단계는, 상기 입력 데이터가 수신되는 빈도가 잦을수록 상기 감가율의 조정폭을 감소시키는 단계를 포함할 수 있다.In one embodiment, the step of automatically adjusting the depreciation rate may include reducing the adjustment width of the depreciation rate as the frequency with which the input data is received is increased.

일 실시예에서, 상기 반복하는 단계는, 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 합치 여부에 기초하여 부여되는 제2 보상(reward)에 상기 감가율이 적용되어 합산된 결과인 제2 반환 값의 기대치인 가치 값(value)을 얻되, 상기 제3 결과 데이터는 상기 입력 데이터를 제2 평가 모델을 이용하여 평가한 결과인 단계를 더 포함하고, 상기 강화 학습 모델을 학습 시키는 단계는, 상기 제1 반환 값 및 상기 가치 값을 이용하여 상기 강화 학습 모델을 학습시키는 단계를 포함하며, 상기 감가율을 자동으로 조정하는 단계는, 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 합치 여부를 더 고려하여 상기 감가율을 조정하는 단계를 포함할 수 있다. 이 때, 상기 제1 평가 모델은, 복수의 알려진 패턴(known pattern)을 감지하는 모델이고, 상기 가치 값을 얻는 단계는, 상기 제1 결과 데이터가 상기 복수의 알려진 패턴 중 하나 이상이 발견되지 않았음을 가리키는 것인 경우에 한하여, 상기 가치 값을 얻는 단계를 포함할 수 있다. 또는, 상기 제1 평가 모델은, 복수의 알려진 패턴(known pattern)을 감지하는 모델로서, 상기 컴퓨팅 장치에 다운로드 된 데이터로 구성되는 것이고, 상기 제2 평가 모델은, 상기 복수의 알려진 패턴 중 적어도 일부의 패턴과, 상기 복수의 알려진 패턴에 포함되지 않은 신규의 패턴을 감지하는 모델로서, 상기 컴퓨팅 장치와 네트워크로 연결되는 서버 장치에 저장되어, 학습 결과가 반영되도록 주기적 또는 비주기적으로 업데이트 되는 것일 수도 있다. 이 때, 상기 제2 평가 모델은, 상기 복수의 알려진 패턴 중 적어도 일부의 패턴을 이용하여 초기 학습(initial learning)을 진행한 후, 전이학습(transfer learning) 방식으로 추가 학습된 모델일 수 있다. In one embodiment, the repetition step includes a second return value that is a result of applying the deduction rate to a second reward that is granted based on whether the first result data and the third result data match or not. Obtaining a value value (expected value) of the expected value, the third result data further comprises the step of evaluating the input data using a second evaluation model, the step of training the reinforcement learning model, the 1 comprising learning the reinforcement learning model using the return value and the value value, and automatically adjusting the depreciation rate, further considering whether the first result data and the third result data match And adjusting the depreciation rate. At this time, the first evaluation model is a model that detects a plurality of known patterns, and the step of obtaining the value value is that at least one of the plurality of known patterns is not found in the first result data. The method may include obtaining the value of the value, as long as it indicates a negative value. Alternatively, the first evaluation model is a model that detects a plurality of known patterns, and is composed of data downloaded to the computing device, and the second evaluation model includes at least some of the plurality of known patterns. As a model for detecting a pattern and a new pattern not included in the plurality of known patterns, it is stored in a server device connected to the computing device and a network, and may be updated periodically or aperiodically to reflect a learning result. have. In this case, the second evaluation model may be a model additionally learned in a transfer learning method after initial learning is performed using at least some of the plurality of known patterns.

일 실시예에서, 상기 제1 평가 모델은 과거 데이터에서 수집된 복수의 알려진 패턴(known pattern)을 감지하는 모델이고, 상기 감가율을 자동으로 조정하는 단계는, 상기 제2 결과 데이터가 상기 복수의 알려진 패턴 중 하나 이상이 발견되었음을 가리키는 것인지 여부를 더 고려하여 상기 감가율을 자동으로 조정하는 단계를 포함할 수도 있다. 이 때, 상기 제2 결과 데이터가 상기 복수의 알려진 패턴 중 하나 이상이 발견되었음을 가리키는 것인지 여부를 더 고려하여 상기 감가율을 자동으로 조정하는 단계는, 상기 제2 결과 데이터가 상기 복수의 알려진 패턴 중 하나 이상이 발견되었음을 가리키는 것인 경우, 상기 감가율을 자동으로 감소시키는 단계를 포함할 수 있다. 이 때, 상기 제2 평가 모델은, 상기 복수의 알려진 패턴 중 적어도 일부의 패턴과, 상기 복수의 알려진 패턴에 포함되지 않은 신규의 패턴을 감지하는 모델로서, 주기적, 비주기적 또는 실시간으로 수집되는 데이터를 이용한 기계 학습의 결과가 반영되도록 주기적 또는 비주기적으로 업데이트 되는 것이며, 상기 제2 결과 데이터가 상 기 복수의 알려진 패턴 중 하나 이상이 발견되었음을 가리키는 것인지 여부를 더 고려하여 상기 감가율을 자동으로 조정하는 단계는, 상기 제1 결과 데이터가 패턴이 발견되지 않았음을 가리키는 것이고, 상기 제3 결과 데이터는 패턴이 발견되었음을 가리키는 것인 경우, 상기 감가율을 자동으로 증가시키는 단계를 포함할 수 있다.In one embodiment, the first evaluation model is a model that detects a plurality of known patterns collected from past data, and automatically adjusting the depreciation rate includes: the second result data is the plurality of known patterns. The method may further include automatically adjusting the depreciation rate in consideration of whether or not one or more of the patterns indicate that it has been found. At this time, the step of automatically adjusting the depreciation rate in consideration of whether the second result data indicates that one or more of the plurality of known patterns have been found, wherein the second result data is one of the plurality of known patterns. If it indicates that an abnormality is found, it may include the step of automatically reducing the depreciation rate. In this case, the second evaluation model is a model that detects at least some of the plurality of known patterns and new patterns not included in the plurality of known patterns, data collected periodically, aperiodically, or in real time. It is updated periodically or aperiodically to reflect the result of machine learning using, and automatically adjusts the depreciation rate by further considering whether the second result data indicates that one or more of the plurality of known patterns has been found. The step may include, if the first result data indicates that a pattern has not been found, and the third result data indicates that a pattern has been found, automatically increasing the depreciation rate.

일 실시예에서, 상기 입력 데이터는 상기 컴퓨팅 장치와 내부망으로 연결된 디바이스의 인바운드 패킷 및 아웃바운드 패킷 중 적어도 하나이고, 상기 제1 평가 모델은, 과거의 패킷 데이터에서 수집된 복수의 알려진 패턴(known pattern)을 감지하는 모델로서 상기 컴퓨팅 장치에 저장된 것이고, 상기 제2 평가 모델은, 상기 복수의 알려진 패턴 중 적어도 일부의 패턴과, 상기 복수의 알려진 패턴에 포함되지 않은 신규의 패턴을 감지하는 모델로서, 주기적, 비주기적 또는 실시간으로 수집되는 데이터를 이용한 기계 학습의 결과가 반영되도록 주기적 또는 비주기적으로 업데이트 되는 것으로서, 상기 컴퓨팅 장치와 외부망으로 연결된 외부 장치에 저장된 것이며, 상기 강화 학습 모델을 구성하는 데이터는 상기 컴퓨팅 장치에 저장된 것일 수 있다.In one embodiment, the input data is at least one of an inbound packet and an outbound packet of a device connected to the computing device and an internal network, and the first evaluation model includes a plurality of known patterns collected from past packet data. pattern) is stored in the computing device, and the second evaluation model is a model for detecting at least some of the plurality of known patterns and a new pattern not included in the plurality of known patterns. , Periodically, aperiodically, or periodically or aperiodically updated to reflect the results of machine learning using data collected in real time, stored in an external device connected to the computing device and an external network, and constructing the reinforcement learning model Data may be stored in the computing device.

일 실시예에서, 상기 입력 데이터는 상기 컴퓨팅 장치와 내부망으로 연결된 IoT 센서에서 송신된 센서값의 시계열 데이터이고, 상기 제1 평가 모델은, 과거의 센서값 시계열 데이터에서 수집된 복수의 알려진 패턴(known pattern)을 감지하는 모델로서 상기 컴퓨팅 장치에 저장된 것이고, 상기 제2 평가 모델은, 상기 복수의 알려진 패턴 중 적어도 일부의 패턴과, 상기 복수의 알려진 패턴에 포함되지 않은 신규의 패턴을 감지하는 모델로서, 주기적, 비주기적 또는 실시간으로 수집되는 데이터를 이용한 기계 학습의 결과가 반영되도록 주기적 또는 비주기적으로 업데이트 되는 것으로서, 상기 컴퓨팅 장치와 외부망으로 연결된 외부 장치에 저장된 것이며, 상기 강화 학습 모델을 구성하는 데이터는 상기 컴퓨팅 장치에 저장된 것일 수 있다.In one embodiment, the input data is time-series data of sensor values transmitted from an IoT sensor connected to the computing device and an internal network, and the first evaluation model includes a plurality of known patterns collected from past sensor value time-series data ( known pattern) is stored in the computing device, and the second evaluation model is a model for detecting at least some of the plurality of known patterns and new patterns not included in the plurality of known patterns. As, as periodically or aperiodically updated to reflect the results of machine learning using data collected periodically, aperiodically or in real time, it is stored in an external device connected to the computing device and an external network, and constitutes the reinforcement learning model The data may be stored in the computing device.

일 실시예에서, 상기 제2 평가 모델은, 상기 복수의 알려진 패턴 중 적어도 일부의 패턴과, 상기 복수의 알려진 패턴에 포함되지 않은 신규의 패턴을 감지하는 모델로서, 주기적, 비주기적 또는 실시간으로 수집되는 데이터를 이용한 기계 학습의 결과가 반영되도록 주기적 또는 비주기적으로 업데이트 되는 것이고, 상기 제2 보상은, 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 불합치 시 마이너스 값이 부여되는 것일 수 있다. 이 때, 상기 제1 결과 데이터가 패턴 발견을 가리키는 것이고 상기 제3 결과 데이터가 패턴 미발견을 가리키는 것인 경우의 상기 제2 보상의 절대값이, 상기 제1 결과 데이터가 패턴 미발견을 가리키는 것이고 상기 제3 결과 데이터가 패턴 발견을 가리키는 것인 경우의 상기 제2 보상의 절대값보다 큰 것일 수 있다.In one embodiment, the second evaluation model is a model that detects at least some of the plurality of known patterns and new patterns not included in the plurality of known patterns, collected periodically or aperiodically or in real time It may be periodically or aperiodically updated to reflect the result of machine learning using the data, and the second compensation may be given a negative value when there is a mismatch between the first result data and the third result data. At this time, the absolute value of the second compensation when the first result data indicates pattern discovery and the third result data indicates pattern not found, and the first result data indicates pattern not found It may be greater than the absolute value of the second compensation when the third result data indicates pattern discovery.

일 실시예에서, 상기 제1 평가 모델은, 과거 데이터에서 수집된 복수의 알려진 패턴(known pattern)을 감지하는 모델이고, 상기 제1 보상은, 상기 제1 결과 데이터 및 상기 제2 결과 데이터의 불합치 시 마이너스 값이 부여되는 것일 수 있다. 이 때, 상기 제1 결과 데이터가 패턴 미발견을 가리키는 것이고 상기 제2 결과 데이터가 패턴 발견을 가리키는 것인 경우의 상기 제1 보상의 절대값이, 상기 제1 결과 데이터가 패턴 발견을 가리키는 것이고 상기 제2 결과 데이터가 패턴 미발견을 가리키는 것인 경우의 상기 제1 보상의 절대값보다 큰 것일 수 있다.In one embodiment, the first evaluation model is a model that detects a plurality of known patterns collected from past data, and the first compensation is a mismatch between the first result data and the second result data A minus value may be given. At this time, the absolute value of the first compensation when the first result data indicates pattern not found and the second result data indicates pattern discovery, the first result data indicates pattern discovery, and the When the second result data indicates that the pattern is not found, it may be greater than the absolute value of the first compensation.

본 발명의 다른 실시예에 따른 게이트웨이 장치는, 복수의 디바이스와 연결되어 상기 복수의 디바이스의 인바운드 패킷 및 아웃바운드 패킷 중 적어도 하나인 입력 데이터를 수집하는 네트워크 인터페이스와, 과거의 패킷 데이터에서 수집된 복수의 알려진 패턴(known pattern)을 감지하는 제1 평가 모델을 구성하는 데이터 및 강화 학습 모델을 구성하는 데이터를 저장하는 스토리지와, 복수의 인스트럭션, 상기 제1 평가 모델을 구성하는 데이터 및 상기 강화 학습 모델을 구성하는 데이터를 로드하는 메모리와, 상기 복수의 인스트럭션을 실행하는 프로세서를 포함한다. 이 때, 상기 복수의 인스트럭션은, 상기 입력 데이터를 상기 강화 학습 모델에 입력하여 출력된 결과인 제1 결과 데이터를 얻는 인스트럭션과, 상기 입력 데이터를 상기 제1 평가 모델을 이용하여 평가한 결과인 제2 결과 데이터를 얻는 인스트럭션과, 상기 제1 결과 데이터 및 상기 제2 결과 데이터의 합치 여부를 고려하여 부여되는 제1 보상(reward)에 감가율(discount factor)이 적용되어 합산된 결과인 제1 반환 값을 얻는 인스트럭션과, 상기 제1 반환 값을 이용하여 상기 강화 학습 모델을 학습 시키는 인스트럭션과, 상기 제2 결과 데이터를 고려하여 상기 감가율을 자동으로 조정하는 인스트럭션과, 상기 제1 결과 데이터를 출력하는 인스트럭션을 포함한다.The gateway device according to another embodiment of the present invention is connected to a plurality of devices, a network interface for collecting input data that is at least one of inbound packets and outbound packets of the plurality of devices, and a plurality of collected from past packet data Storage for storing data constituting a first evaluation model for detecting a known pattern and data constituting a reinforcement learning model, a plurality of instructions, data constituting the first evaluation model, and the reinforcement learning model It includes a memory for loading the data constituting the, and a processor for executing the plurality of instructions. At this time, the plurality of instructions may include: an instruction for inputting the input data to the reinforcement learning model to obtain first result data that is output, and a result that is the result of evaluating the input data using the first evaluation model. 2 The first return value, which is the sum of the result obtained by applying a discount factor to the instruction for obtaining result data and the first reward given considering whether the first result data and the second result data match An instruction to obtain, an instruction to train the reinforcement learning model using the first return value, an instruction to automatically adjust the depreciation rate in consideration of the second result data, and an instruction to output the first result data It includes.

일 실시예에서, 상기 복수의 인스트럭션은, 상기 제1 결과 데이터가 상기 복수의 알려진 패턴 중 하나 이상이 발견되지 않았음을 가리키는 것인 경우, 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 합치 여부에 기초하여 부여되는 제2 보상(reward)에 상기 감가율이 적용되어 합산된 결과인 제2 반환 값의 기대치인 가치 값(value)을 얻기 위한 요청 신호를 상기 네트워크 인터페이스를 통하여 서버 장치에 송신하는 인스트럭션을 더 포함할 수 있다. 이 때, 상기 강화 학습 모델을 학습 시키는 인스트럭션은, 상기 제1 반환 값 및 상기 가치 값을 이용하여 상기 강화 학습 모델을 학습시키는 인스트럭션을 포함할 수 있다. 또한, 상기 감가율을 자동으로 조정하는 인스트럭션은, 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 합치 여부를 더 고려하여 상기 감가율을 조정하는 인스트럭션을 포함할 수 있다. 이 때, 상기 제3 결과 데이터는 상기 입력 데이터를 제2 평가 모델을 이용하여 평가한 결과이고, 상기 요청 신호는 상기 입력 데이터 및 감가율 데이터를 포함할 수 있다. 이 때, 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 합치 여부를 더 고려하여 상기 감가율을 조정하는 인스트럭션은, 상기 제1 결과 데이터가 패턴이 발견되지 않았음을 가리키는 것이고, 상기 제3 결과 데이터는 패턴이 발견되었음을 가리키는 것인 경우, 상기 감가율을 자동으로 증가시키는 인스트럭션을 포함할 수 있다.In one embodiment, the plurality of instructions, if the first result data indicates that one or more of the plurality of known patterns have not been found, whether the first result data and the third result data match An instruction to transmit a request signal to obtain a value value, which is an expected value of a second return value, which is the sum of the result of applying the depreciation rate to the second reward granted based on the result, and transmits the request signal to the server device through the network interface It may further include. At this time, the instruction for training the reinforcement learning model may include an instruction for training the reinforcement learning model using the first return value and the value value. In addition, the instruction for automatically adjusting the depreciation rate may include an instruction for adjusting the depreciation rate by further considering whether the first result data and the third result data are matched. At this time, the third result data is a result of evaluating the input data using a second evaluation model, and the request signal may include the input data and depreciation rate data. At this time, the instruction for adjusting the depreciation rate by further considering whether the first result data and the third result data are matched indicates that the pattern of the first result data was not found, and the third result data If it indicates that a pattern is found, it may include an instruction to automatically increase the depreciation rate.

일 실시예에서, 상기 제2 결과 데이터를 고려하여 상기 감가율을 자동으로 조정하는 인스트럭션은, 상기 제2 결과 데이터가 상기 복수의 알려진 패턴 중 하나 이상이 발견되었음을 가리키는 것인 경우, 상기 감가율을 자동으로 감소시키는 인스트럭션을 포함할 수 있다.In one embodiment, the instruction to automatically adjust the depreciation rate in consideration of the second result data, if the second result data indicates that one or more of the plurality of known patterns are found, the deceleration rate is automatically It may contain instructions to reduce.

본 발명의 또 다른 실시예에 따른 강화 학습 방법은, 입력 데이터를 이용하여 상기 입력 데이터의 평가결과를 결정하는 강화 학습 모델을 학습 시키는 것을 반복하는 단계를 포함한다. 이 때, 상기 반복하는 단계는, 상기 입력 데이터를 상기 강화 학습 모델에 입력하여 출력된 결과인 제1 결과 데이터를 얻는 단계와, 상기 입력 데이터를 제1 평가모델을 이용하여 평가한 결과인 제2 결과 데이터를 얻는 단계와, 상기 입력 데이터를 제2 평가모델을 이용하여 평가한 결과인 제3 결과 데이터를 얻는 단계와, 상기 제2 결과 데이터 및 상기 제3 결과 데이터의 비교 결과에 따라 추종 대상 평가 모델을 결정하는 단계와, 상기 추종 대상 평가 모델의 결과 데이터와 상기 제1 결과 데이터 사이의 비교 결과에 따라 결정되는 보상(reward)을 이용하여 상기 강화 학습 모델을 학습 시키는 단계를 포함할 수 있다.The reinforcement learning method according to another embodiment of the present invention includes repeating training the reinforcement learning model for determining an evaluation result of the input data using the input data. At this time, the iterative step includes: inputting the input data to the reinforcement learning model to obtain first result data that is output, and second, which is the result of evaluating the input data using a first evaluation model. Obtaining result data, obtaining third result data, which is a result of evaluating the input data using a second evaluation model, and evaluating a follow-up object according to a comparison result of the second result data and the third result data The method may include determining a model, and training the reinforcement learning model using a reward determined according to a comparison result between the result data of the following evaluation model and the first result data.

도 1은 본 발명의 몇몇 실시예들에 적용되는 개선된 강화 학습 개념을 설명하기 위한 도면이다.
도 2는 도 1을 참조하여 설명한 강화 학습 개념이 구현되는 예시를 설명하기 위한 도면이다.
도 3은 도 2를 참조하여 설명한 강화 학습 구현 시, 그 기능이 복수의 장치에 의하여 분할되어 구현되는 예시를 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 위협 탐지 시스템의 구성도이다.
도 5는 본 발명의 다른 실시예에 따른 게이트웨이 장치의 구성도이다.
도 6은 본 발명의 또 다른 실시예에 따른 강화 학습 방법의 순서도이다.
도 7은 도 6을 참조하여 설명한 방법이 일부 변형되는 경우를 설명하기 위한 순서도이다.
도 8은 도 6 내지 도 7을 참조하여 설명한 방법에 따른 결과를 사례 별로 설명하기 위한 도면이다.1 is a view for explaining an improved reinforcement learning concept applied to some embodiments of the present invention.
2 is a diagram for explaining an example in which the reinforcement learning concept described with reference to FIG. 1 is implemented.
3 is a diagram for explaining an example in which the function is divided and implemented by a plurality of devices when implementing reinforcement learning described with reference to FIG. 2.
4 is a block diagram of a threat detection system according to an embodiment of the present invention.
5 is a block diagram of a gateway device according to another embodiment of the present invention.
6 is a flowchart of a reinforcement learning method according to another embodiment of the present invention.
7 is a flowchart for explaining a case in which the method described with reference to FIG. 6 is partially modified.
8 is a view for explaining the results according to the method described with reference to FIGS. 6 to 7 on a case-by-case basis.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Advantages and features of the present invention, and methods for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the embodiments allow the publication of the present invention to be complete, and general knowledge in the technical field to which the present invention pertains. It is provided to fully inform the holder of the scope of the invention, and the invention is only defined by the scope of the claims. The same reference numerals refer to the same components throughout the specification.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 이하, 도면들을 참조하여 본 발명의 몇몇 실시예들을 설명한다.Unless otherwise defined, all terms (including technical and scientific terms) used in the present specification may be used as meanings commonly understood by those skilled in the art to which the present invention pertains. In addition, terms defined in the commonly used dictionary are not ideally or excessively interpreted unless specifically defined. The terminology used herein is for describing the embodiments and is not intended to limit the present invention. In the present specification, the singular form also includes the plural form unless otherwise specified in the phrase. Hereinafter, some embodiments of the present invention will be described with reference to the drawings.

먼저, 도 1을 참조하여 본 발명의 몇몇 실시예들에 적용되는 개선된 강화 학습 개념을 설명한다.First, an improved reinforcement learning concept applied to some embodiments of the present invention will be described with reference to FIG. 1.

널리 알려진 바와 같이, 강화 학습은 에이전트가 액션(Action)을 수행하여 상태가 제1 상태(State)에서 제2 상태(State')로 변하는 경우, 상기 액션에 대한 보상(reward)이 주어지고, 상기 에이전트가 다음번에는 좀더 좋은 액션을 선택하도록 상기 보상을 반영하여 기존의 모델을 보정하는 과정이다. 상기 에이전트가 현재 가장 좋은 보상이 주어지는 액션을 선택할 수도 있으나, 이후 시점을 더 고려하더라도 합산 보상이 가장 좋은 액션을 선택할 수도 있으며, 이 때 이후 시점에 주어지는 보상에 대하여는 감가율(discount factor)가 적용될 수 있는 점이 알려져 있다.As is widely known, reinforcement learning is performed when an agent performs an action and the state changes from a first state to a second state, and a reward for the action is given. It is the process of revising the existing model by reflecting the above compensation so that the agent selects a better action next time. Although the agent may select an action for which the best reward is currently given, the action for which the summation compensation is best may be selected even if the future viewpoint is further considered, and a discount factor may be applied to the compensation given at a later time. The point is known.

본 발명의 몇몇 실시예들에 적용되는 강화 학습은 아래의 점에서 기존의 강화 학습 대비 차별화 된다. 본 발명의 몇몇 실시예들에 적용되는 상기 강화 학습은 실시예에 따라 아래의 특징 중 일부만 적용된 것일 수도 있음을 유의한다.Reinforcement learning applied to some embodiments of the present invention is differentiated from existing reinforcement learning in the following points. Note that the reinforcement learning applied to some embodiments of the present invention may be applied to only some of the following features depending on the embodiment.

제1 특징: 에이전트가 액션을 수행한 것에 대한 환경의 피드백으로서 2가지를 모두 반영하여 기존의 모델을 보정한다. 즉, 이후 시점의 감가율 반영된 보상(reward)의 총합인 반환(return; G_t) 및 가치 함수(value function)의 값을 모두 반영하여 기존의 에이전트 모델이 보정될 수 있다.First feature: As an environment feedback on the agent performing an action, the existing model is corrected by reflecting both. That is, the existing agent model can be corrected by reflecting both the return (G _t ) and the value of the value function, which is the sum of the rewards reflected in the depreciation rate at a later time.

이 때, 반환의 연산에 반영되는 보상은 에이전트의 액션과 제1 평가 모델에서 출력된 결과 데이터가 합치하는지 여부를 고려하여 부여되는 것인 반면, 가치 함수의 함수 값을 연산할 때 반영되는 보상은 상기 에이전트의 액션과 제2 평가 모델에서 출력된 결과 데이터가 합치하는지 여부를 고려하여 부여되는 것인 점에서 서로 다르다. 즉, 에이전트가 기존의 모델을 보정할 때 참조하는 환경의 피드백은 제1 평가 모델에서 출력된 결과 데이터와 상기 에이전트가 선택한 액션과의 합치 여부가 반영된 반환(G_t)과, 제2 평가 모델에서 출력된 결과 데이터와 상기 에이전트가 선택한 액션과의 합치 여부가 반영된 가치 함수의 함수 값(이하, '가치 값'이라 한다)이다.At this time, the compensation reflected in the calculation of the return is given considering whether the action of the agent and the result data output from the first evaluation model match, while the compensation reflected when calculating the function value of the value function is They differ from each other in that they are given in consideration of whether the agent action and the result data output from the second evaluation model match. That is, the feedback of the environment referenced by the agent when calibrating the existing model is returned (G _t ) reflecting whether the result data output from the first evaluation model matches the action selected by the agent, and the second evaluation model. It is the function value of the value function (hereinafter referred to as'value value') reflecting whether the result data output and the agent's selected action match.

강화 학습에서 가치 함수(value function; 주로 v(s)로 표현)는 상기 반환(return)의 기대값(expectation)을 얻기 위한 함수이다. 상기 가치 함수는 상태-가치 함수(state-value function) 또는 액션-가치 함수(action-value function)로 세분화 되는 점이 널리 알려져 있다. 액션-가치 함수의 값은 Q-값(Q-value)으로도 지칭된다. 가치 함수의 의미 및 예시에 대하여 다양한 문헌이 참고될 수 있을 것이다.In reinforcement learning, a value function (predominantly expressed as v(s)) is a function for obtaining an expectation of the return. It is well known that the value function is subdivided into a state-value function or an action-value function. The value of the action-value function is also referred to as Q-value. Various documents may be referred to for meaning and examples of the value function.

제2 특징: 감가율이 강화 학습의 진행 과정에서 환경(environment)을 반영하여 자동적으로 조정된다. 또한, 제1 특징에서 설명된 반환(Gt)과, 가치 값은 동일한 감가율을 이용하여 연산 된다. 이 때, 상기 감가율은 상기 에이전트가 상기 제1 평가 모델의 결과 데이터를 추종하는 방향으로 조정되거나('0'에 가까워지도록), 상기 제2 평가 모델의 결과 데이터를 추종하는 방향으로 조정될 수 있다('1'에 가까워지도록). Second feature: The depreciation rate is automatically adjusted to reflect the environment in the course of reinforcement learning. In addition, the return (Gt) and the value value described in the first characteristic are calculated using the same depreciation rate. At this time, the depreciation rate may be adjusted in the direction in which the agent follows the result data of the first evaluation model (to approach '0'), or in the direction of following the result data of the second evaluation model ( To get closer to '1').

일 실시예에서, 감가율이 자동적으로 조정될 때 반영되는 상기 '환경'은, 예를 들어 제1 평가 모델의 결과 데이터와 제2 평가 모델의 결과 데이터 중 어느 것이 더 정확한 것인지를 의미할 수 있다. 즉, 상기 '환경'에서 상기 제1 평가 모델이 상기 제2 평가 모델보다 더 좋은 결과를 보이면, 상기 강화 학습 모델이 상기 제1 평가 모델의 결과 데이터를 제2 평가 모델의 결과 데이터보다 더 추종하는 방향으로 보정될 수 있도록, 상기 감가율이 자동적으로 조정되는 것이다.In one embodiment, the'environment' reflected when the depreciation rate is automatically adjusted may mean, for example, which of the result data of the first evaluation model and the result data of the second evaluation model is more accurate. That is, if the first evaluation model in the'environment' shows better results than the second evaluation model, the reinforcement learning model follows the result data of the first evaluation model more than the result data of the second evaluation model. The deceleration rate is automatically adjusted so that it can be corrected in the direction.

다른 실시예에서, 상기 '환경'하에서, 제1 평가 모델의 결과 데이터와 제2 평가 모델의 결과 데이터 모두 문제가 없을 때에는 제1 평가 모델이 추종 대상으로 선정되고, 제1 평가 모델의 결과 데이터가 정확하지 않을 때에 한하여 제2 평가 모델이 추종 대상으로 선정될 수도 있다. 상기 제1 평가 모델을 추종하는 방향으로 강화 학습을 진행하는 것이 상기 제2 평가 모델을 추종하는 방향으로 강화 학습을 진행하는 것 대비 빠른 학습이 가능하거나 컴퓨팅 자원 또는 네트워킹 자원을 덜 소모하는 경우, 본 실시예는 효율적인 강화 학습 전략이 될 것이다.In another embodiment, under the'environment', when there is no problem in both the result data of the first evaluation model and the result data of the second evaluation model, the first evaluation model is selected as a follow-up object, and the result data of the first evaluation model is The second evaluation model may be selected for follow-up only when it is not accurate. When the reinforcement learning in the direction of following the first evaluation model is faster than the reinforcement learning in the direction of following the second evaluation model, or when computing resources or networking resources are consumed less, The embodiment will be an effective reinforcement learning strategy.

제2 특징에 대하여 예를 들어 설명한다. 제1 평가 모델이 알려진 패턴(known pattern)에 대하여만 탐지가 가능한 반면, 낮은 수준의 컴퓨팅 장치에서도 실행 될 수 있는 경량의 모델이고, 제2 평가 모델이 알려진 패턴 뿐만 아니라 신규로 등장하는 패턴에 대하여도 탐지가 가능하도록 지속적으로 데이터를 수집하여 학습을 진행하고 있는 모델로서, 낮은 수준의 컴퓨팅 장치에서는 실행 될 수 없고 네트워크가 연결되어 있는 클라우드 컴퓨팅 환경과 같은 높은 수준의 컴퓨팅 장치에서만 실행 될 수 있는 모델이라고 가정하자. 이 때, 상기 감가율은 상기 '환경'에서 더 좋은 성능을 보이는 평가 모델을 추종하는 방향으로 자동 조정되되, 제1 평가 모델과 제2 평가 모델이 동일한 성능을 보이는 환경에서는 제1 평가 모델을 추종하는 방향으로 자동 조정될 수 있을 것이다. 이러한 방식으로 상기 감가율을 조정하면, 낮은 수준의 컴퓨팅 장치에서 강화 학습이 진행될 수 있는 자원 효율성을 제공하면서도, 상기 '환경'이 제1 평가 모델 보다는 제2 평가 모델이 정확한 결과를 보여주는 것으로 바뀌어 감에 따라 높은 수준의 컴퓨팅 자원을 요하는 제2 평가 모델의 결과 또한 반영되어 강화 학습이 진행되는 환경 적응성을 제공할 수 있다.The second characteristic will be described by way of example. While the first evaluation model can detect only known patterns, it is a lightweight model that can be executed on low-level computing devices, and the second evaluation model is not only for known patterns, but also for newly emerging patterns. A model that continuously learns by collecting data to enable detection, and it cannot run on low-level computing devices, but only on high-level computing devices, such as cloud computing environments with network connections. Suppose At this time, the depreciation rate is automatically adjusted in the direction of following the evaluation model showing better performance in the'environment', but in the environment where the first evaluation model and the second evaluation model show the same performance, the first evaluation model is followed. It can be automatically adjusted in the direction. Adjusting the depreciation rate in this way provides a resource efficiency for reinforcement learning in a low-level computing device, while the'environment' is changed to show an accurate result rather than the first evaluation model. Accordingly, the result of the second evaluation model that requires a high level of computing resources is also reflected to provide adaptability to the environment in which reinforcement learning is performed.

이상으로 도 1을 참조하여 본 발명의 몇몇 실시예들에 적용되는 개선된 강화 학습 개념을 설명하였다. 다음으로, 도 2를 참조하여 상기 강화 학습 개념이 구현되는 예시를 설명한다.The improved reinforcement learning concept applied to some embodiments of the present invention has been described above with reference to FIG. 1. Next, an example in which the reinforcement learning concept is implemented will be described with reference to FIG. 2.

도 2에 도시된 강화 학습 모델(20)은 입력 데이터(10)를 통해 강화 학습 된다. 즉, 도 2에 도시된 예시에서, 강화 학습 개념에 따른 '환경'은 입력 데이터(10)이다. 강화 학습 모델(20)은 Q-러닝(Q-Learning), 딥 Q 네트워크(Deep Q Network; DQN) 등 다양한 모델로 구현될 수 있을 것이다.The reinforcement learning model 20 shown in FIG. 2 is reinforced learning through the input data 10. That is, in the example illustrated in FIG. 2, the'environment' according to the reinforcement learning concept is input data 10. The reinforcement learning model 20 may be implemented with various models such as Q-Learning and Deep Q Network (DQN).

강화 학습 모델(20)은 입력 데이터(10)를 입력 받아 출력 데이터인 제1 결과 데이터를 출력한다. 즉, 이 때 강화 학습 개념에 따른 '액션'은 상기 제1 결과 데이터이다. 이하, 이해의 편의를 위해 상기 제1 결과 데이터는 입력 데이터(10)에 대한 평가 결과를 가리키는 데이터인 것으로 가정한다. 예를 들어 상기 제1 결과 데이터는 입력 데이터(10)에 보안 위협(security threat)이 존재하는지 여부에 대한 판정일 수 있을 것이다.The reinforcement learning model 20 receives the input data 10 and outputs first result data that is output data. That is, at this time, the'action' according to the reinforcement learning concept is the first result data. Hereinafter, for convenience of understanding, it is assumed that the first result data is data indicating an evaluation result of the input data 10. For example, the first result data may be a determination as to whether a security threat exists in the input data 10.

입력 데이터(10)는 제1 평가 모델(30)에도 제공될 수 있다. 제1 평가 모델(30)은 입력 데이터(10)를 입력 받아 출력 데이터인 제2 결과 데이터를 출력한다. 상기 제2 결과 데이터 역시 입력 데이터(10)에 보안 위협(security threat)이 존재하는지 여부에 대한 판정인 것으로 가정한다. 제1 평가 모델(30)은 보안 위협이 존재함을 가리키는 복수의 알려진 패턴(known pattern)을 감지하도록 기계학습 된 모델일 수 있다. 제1 평가 모델(30)은 기계학습이 완료된 것이어서, 신규의 알려지지 않은 패턴(unknown pattern)에 대하여는 감지하지 못할 수 있다. 제1 평가 모델(30)은 기계학습이 완료된 것이므로, 특정 컴퓨팅 장치에 다운로드 된 후에는 독자적으로 동작할 수 있다. 제1 평가 모델(30)은 인공 신경망 기반으로 구현된 것일 수 있으나, 인공 신경망 이외에도 분류(classification), 클러스터링(clustering) 등을 수행하기 위한 다양한 인공 지능 기반의 모델로 구현된 것일 수 있다.The input data 10 may also be provided to the first evaluation model 30. The first evaluation model 30 receives input data 10 and outputs second result data that is output data. The second result data is also assumed to be a determination as to whether a security threat exists in the input data 10. The first evaluation model 30 may be a machine learning model to detect a plurality of known patterns indicating that a security threat exists. Since the first evaluation model 30 is that machine learning is completed, it may not be able to detect a new unknown pattern. Since the first evaluation model 30 is machine learning is completed, it may operate independently after being downloaded to a specific computing device. The first evaluation model 30 may be implemented based on an artificial neural network, but may be implemented as various artificial intelligence based models for performing classification, clustering, etc. in addition to the artificial neural network.

입력 데이터(10)는 제2 평가 모델(40)에도 제공될 수 있다. 제2 평가 모델(40)은 입력 데이터(10)를 입력 받아 출력 데이터인 제3 결과 데이터를 출력한다. 상기 제3 결과 데이터 역시 입력 데이터(10)에 보안 위협이 존재함을 가리키는 보안 위협(security threat)이 존재하는지 여부에 대한 판정인 것으로 가정한다. 제2 평가 모델(40)은 상기 복수의 알려진 패턴 중 적어도 일부의 패턴과, 상기 복수의 알려진 패턴에 포함되지 않은 신규의 패턴을 감지하는 모델일 수 있다. 제2 평가 모델(40)은 상기 컴퓨팅 장치와 네트워크로 연결되는 서버 장치에 저장되어, 학습 결과가 반영되도록 주기적 또는 비주기적으로 업데이트 되는 것일 수 있다. 제2 평가 모델(40)은 인공 신경망 기반으로 구현된 것일 수 있으나, 인공 신경망 이외에도 분류(classification), 클러스터링(clustering) 등을 수행하기 위한 다양한 인공 지능 기반의 모델로 구현된 것일 수 있다.The input data 10 may also be provided to the second evaluation model 40. The second evaluation model 40 receives input data 10 and outputs third result data that is output data. The third result data is also assumed to be a determination as to whether a security threat exists indicating that a security threat exists in the input data 10. The second evaluation model 40 may be a model that detects at least some of the plurality of known patterns and new patterns that are not included in the plurality of known patterns. The second evaluation model 40 may be stored in a server device connected to the computing device and a network and updated periodically or aperiodically to reflect the learning result. The second evaluation model 40 may be implemented based on an artificial neural network, but may also be implemented as various artificial intelligence based models for performing classification, clustering, etc. in addition to the artificial neural network.

제2 평가 모델(40)은 다양한 데이터를 수집하고, 수집된 데이터를 주기적/비주기적으로 트레이닝 데이터셋으로 구성하여 학습한 결과로 생성 또는 업데이트 되는 것일 수 있다. 상기 트레이닝 데이터셋을 구성하는데 사용되는 데이터는 다양한 장치로부터 실시간(real-time) 또는 준실시간(near-real-time)으로 수집되는 것일 수 있다.The second evaluation model 40 collects various data and may be generated or updated as a result of learning by configuring the collected data periodically/aperiodically as a training dataset. The data used to construct the training dataset may be collected in real-time or near-real-time from various devices.

일 실시예에서, 학습 속도 및 학습 된 모델의 정확도가 개선될 수 있도록, 제2 평가 모델(40)은 상기 복수의 알려진 패턴 중 적어도 일부의 패턴을 이용하여 초기 학습(initial learning)을 진행한 후, 전이학습(transfer learning) 방식으로 추가 학습된 모델일 수 있다.In one embodiment, the second evaluation model 40 performs initial learning using at least some of the plurality of known patterns so that the learning speed and the accuracy of the trained model can be improved. , May be a model additionally trained by a transfer learning method.

이하, 강화 학습 모델(20)을 트레이닝하는 로직(50)에 대하여 설명한다.Hereinafter, the logic 50 for training the reinforcement learning model 20 will be described.

도 2에 도시된 바와 같이 강화 학습 모델 트레이닝 로직(이하, '트레이닝 로직'이라 함)(50)은 상기 제1 결과 데이터 및 상기 제2 결과 데이터를 참조한다. 트레이닝 로직(50)은 강화 학습 모델(20)을 보정(트레이닝, 학습이라는 용어와 동일한 의미로 이해 될 수 있음)하기 위하여, 제1 반환 값(G_t)을 참조한다. 제1 반환 값은, 예를 들어 수학식 1을 통하여 계산될 수 있는 값이다. 상기 제1 반환 값은 제1 반환 값 생성 로직(51)에 의하여 연산되고, 강화 학습 모델 보정 로직(52)에 제공될 수 있다.As illustrated in FIG. 2, the reinforcement learning model training logic (hereinafter referred to as “training logic”) 50 refers to the first result data and the second result data. The training logic 50 refers to the first return value G _t to correct the reinforcement learning model 20 (which may be understood in the same sense as the terms training and learning). The first return value is a value that can be calculated through Equation 1, for example. The first return value is calculated by the first return value generation logic 51 and may be provided to the reinforcement learning model correction logic 52.

제1 반환 값의 계산 시 필요한 제1 보상은 상기 제1 결과 데이터 및 상기 제2 결과 데이터의 합치 여부를 고려하여 부여된다. 이미 설명한 바와 같이, 강화 학습 모델(20)은 제1 평가 모델(30) 또는 제2 평가 모델(40)을 추종하여 학습된다. 따라서, 상기 제1 보상은 상기 제1 결과 데이터와 상기 제2 결과 데이터가 같은 결론을 가리키는 것이라면 양의 값을 가질 것이고, 제1 보상은 상기 제1 결과 데이터와 상기 제2 결과 데이터가 다른 결론을 가리키는 것이라면 음의 값을 가질 것이다. 다만, 강화 학습이 어느 정도 진행 되면 제1 결과 데이터가 맞고 제2 결과 데이터가 틀릴 수도 있을 것이다. 또한, 입력 데이터(10)에 포함된 패턴에 대한 탐지 난이도가 너무 낮아서, 상기 제1 결과 데이터와 상기 제2 결과 데이터가 같은 결론을 가리키는 것이라 하더라도 보상으로 '0'이 세팅 될 필요도 있을 것이다. 상기 제1 보상이 상기 제1 결과 데이터 및 상기 제2 결과 데이터의 합치 여부를 고려하여 부여된다는 것은 이러한 상황을 포괄적으로 표현한 것으로 이해될 수 있을 것이다. The first compensation required when calculating the first return value is given in consideration of whether the first result data and the second result data match. As already described, the reinforcement learning model 20 is trained by following the first evaluation model 30 or the second evaluation model 40. Therefore, the first compensation will have a positive value if the first result data and the second result data indicate the same conclusion, and the first compensation will conclude that the first result data and the second result data are different. If it points, it will have a negative value. However, if reinforcement learning progresses to some extent, the first result data may be correct and the second result data may be wrong. In addition, since the difficulty of detection of the pattern included in the input data 10 is too low, even if the first result data and the second result data indicate the same conclusion, it may be necessary to set '0' as compensation. It may be understood that the first reward is given considering the congruence of the first result data and the second result data, which is a comprehensive expression of this situation.

위협 탐지에 있어서, 상기 제1 보상이 어떤 기준에 의하여 부여 되는지는 도 8을 참조하여 후술한다. 상기 제1 보상은 알려진 패턴의 감지여부를 맞추는지를 반영하는 점에서, 에이전트에 의하여 선택된 액션에 대한 현재 보상의 성격을 가지는 것으로 이해될 수 있을 것이다. 위협 탐지에 있어서, 상기 제1 결과 데이터 및 상기 제2 결과 데이터의 합치 여부를 고려하여 상기 제1 보상이 어떤 방식으로 부여되는지는 도 8을 참조하여 후술한다.In the threat detection, what criteria is given to the first compensation will be described later with reference to FIG. 8. The first reward may be understood as having the nature of the current reward for the action selected by the agent in that it reflects whether or not the known pattern is detected. In the threat detection, the first reward will be described in detail with reference to FIG. 8 in consideration of whether the first result data and the second result data match.

또한, 제1 반환 값의 계산 시 필요한 감가율(discount factor; λ)은 감가율을 자동으로 조정하는 감가율 관리 로직(53)으로부터 제공 받을 수 있을 것이다. 감가율 관리 로직(53)은 제2 결과 데이터를 고려하여 상기 감가율을 자동으로 조정한다.In addition, the depreciation factor (λ) required when calculating the first return value may be provided from the depreciation rate management logic 53 that automatically adjusts the depreciation rate. The depreciation rate management logic 53 automatically adjusts the depreciation rate in consideration of the second result data.

감가율 관리 로직(53)은 감가율 조정폭을 상황에 따라 가변적으로 조정할 수 있다. 예를 들어, 입력 데이터(10)가 수신되는 빈도가 잦을수록 상기 감가율의 조정폭을 감소시킴으로써, 감가율이 너무 빠르게 양 극단('0' 또는 '1')으로 조정되는 것을 방지할 수 있을 것이다. 또한, 예를 들어 입력 데이터(10)의 일관성이 유지되는 경우 조정폭을 증가시킴으로써, 빠르게 강화 학습 모델이 추종하는 패턴이 제1 평가 모델에서 제2 평가 모델으로, 또는 제2 평가 모델에서 제1 평가 모델으로 변경되도록 제어할 수 있을 것이다. 예를 들어, 기 지정된 기준치 이상으로 연속하여 감가율의 증가가 이뤄졌거나, 감가율의 감소가 이뤄졌다면, 그 조정폭을 증가시킬 수 있을 것이다. 이는, 다양한 이유에 의하여 입력 데이터(10)에 포함되는 패턴이 종래의 알려진 패턴으로 바뀌거나, 알려지지 않은 신규의 패턴으로 바뀐 것을 의미하고, 이러한 상황이 일정 기간 유지되었다면 그러한 상황 변경을 신속하게 반영하는 것이 바람직하기 때문이다. 위협 탐지에 있어서, 감가율이 어떤 기준에 의하여 자동 조정 되는지는 도 8을 참조하여 후술한다.The depreciation rate management logic 53 may variably adjust the depreciation rate adjustment width according to the situation. For example, as the frequency of input data 10 being received is increased, the adjustment width of the reduction rate may be reduced, thereby preventing the reduction rate from being adjusted to both extremes ('0' or '1') too quickly. In addition, for example, by increasing the adjustment width when the consistency of the input data 10 is maintained, the pattern that the reinforcement learning model follows quickly from the first evaluation model to the second evaluation model, or from the second evaluation model, the first evaluation You will be able to control it to change to a model. For example, if the depreciation rate increases continuously or exceeds the predetermined reference value, or if the depreciation rate decreases, the adjustment width may be increased. This means that the pattern included in the input data 10 is changed to a conventional known pattern or a new, unknown pattern for a variety of reasons, and if such a situation has been maintained for a period of time, this situation change is quickly reflected. Because it is preferable. In the threat detection, the criteria for automatically adjusting the depreciation rate will be described later with reference to FIG. 8.

트레이닝 로직(50)은 상기 제1 결과 데이터, 상기 제2 결과 데이터에 더하여 제3 결과 데이터를 더 참조하여 강화 학습 모델을 보정하고, 감가율을 자동 조정할 수도 있다.The training logic 50 may further correct the reinforcement learning model by referring to the third result data in addition to the first result data and the second result data, and automatically adjust the depreciation rate.

몇몇 실시예들에서, 제1 평가모델(30)이 복수의 알려진 패턴(known pattern)을 감지하는 모델이라면, 상기 제1 결과 데이터가 상기 복수의 알려진 패턴 중 하나 이상이 발견되지 않았음을 가리키는 것인 경우에 한하여 상기 제3 결과 데이터가 더 참조될 수 있다. 이는, 제1 평가모델(30)이 특정 패턴을 감지했다면 그 결과는 신뢰될 수 있는 것이고, 굳이 컴퓨팅, 네트워킹 자원의 소모, 및 시간의 소비가 수반되는 제2 평가모델(40)의 결과 데이터가 참조될 필요는 없기 때문이다.In some embodiments, if the first evaluation model 30 is a model that detects a plurality of known patterns, the first result data indicates that one or more of the plurality of known patterns has not been found. In this case, the third result data may be further referred to. This means that if the first evaluation model 30 detects a specific pattern, the result can be trusted, and the result data of the second evaluation model 40 accompanied by computing, consumption of networking resources, and time is required. Because it does not have to be referenced.

이를 위해, 제2 평가 모델(40)은 제1 결과 데이터, 입력 데이터(10) 및 감가율 관리 로직(53)에 의하여 관리 되는 현재의 감가율을 공유 받을 수 있다. 제2 평가 모델(40)을 운영하는 컴퓨팅 환경에서는, 상기 공유된 감가율, 상기 제1 결과 데이터 및 상기 공유된 입력 데이터를 이용하여, 제2 보상에 상기 공유된 감가율이 적용되어 합산된 결과인 제2 반환 값(return)의 기대치인 가치 값(value; 가치 함수의 출력 값)을 제공할 것이다. 이 때, 상기 제2 보상은 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 합치 여부에 기초하여 부여되는 것이다.To this end, the second evaluation model 40 may share the current depreciation rate managed by the first result data, the input data 10 and the depreciation rate management logic 53. In the computing environment in which the second evaluation model 40 is operated, by using the shared depreciation rate, the first result data, and the shared input data, the shared depreciation rate is applied to the second reward and added to the result. 2 We will provide the value (the output value of the value function), which is the expected value of the return value. At this time, the second compensation is given based on whether the first result data and the third result data are consistent.

위협 탐지에 있어서, 상기 제2 보상이 어떤 기준에 의하여 부여 되는지는 도 8을 참조하여 후술한다. 상기 제2 보상은 반환 값(return)의 기대치를 구성하는 점에서, 에이전트에 의하여 선택된 액션에 대한 미래 보상의 성격을 가지는 것으로 이해될 수 있을 것이다. 위협 탐지에 있어서, 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 합치 여부를 고려하여 상기 제2 보상이 어떤 방식으로 부여되는지는 도 8을 참조하여 후술한다.In the threat detection, the criteria for which the second compensation is given will be described later with reference to FIG. 8. The second reward may be understood as having the nature of future reward for the action selected by the agent in that it constitutes the expected value of the return value. In the detection of the threat, how the second compensation is given in consideration of whether the first result data and the third result data are matched will be described later with reference to FIG. 8.

상기 제1 결과 데이터, 상기 제2 결과 데이터에 더하여 제3 결과 데이터가 더 참조되어 강화 학습 모델이 보정되는 경우, 강화 학습 보정 로직(52)은 상기 제1 반환 값(G_t) 및 상기 가치 값(value)을 이용하여 강화 학습 모델(20)을 학습시키고, 감가율 관리 로직(53)은 상기 제2 결과 데이터에 더하여, 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 합치 여부를 더 고려하여 상기 감가율을 조정할 수 있다. 위협 탐지에 있어서, 감가율이 어떤 기준에 의하여 자동 조정 되는지는 도 8을 참조하여 후술한다.When the reinforcement learning model is corrected by further referring to the first result data and the second result data in addition to the third result data, the reinforcement learning correction logic 52 may include the first return value G _t and the value value. The reinforcement learning model 20 is trained using (value), and the depreciation rate management logic 53 further considers whether the first result data and the third result data are matched in addition to the second result data. The depreciation rate can be adjusted. In the threat detection, the criteria for automatically adjusting the depreciation rate will be described later with reference to FIG. 8.

몇몇 실시예들에서, 도 3에 도시된 바와 같이 제1 장치(100)가 제1 평가 모델(30), 강화 학습 모델(20) 및 트레이닝 로직(50)을 실행하고, 제2 장치(200)가 제2 평가 모델(40)을 실행할 수 있다. 이 때, 제1 장치(100)는 제2 장치(200) 대비 저성능의 컴퓨팅 장치일 수 있다. 제1 평가 모델(30)은 알려진 패턴에 대한 감지만 수행하는 것인 점에서 경량(light)의 모델이므로 저성능의 컴퓨팅 장치로도 충분히 실행이 가능하다. 또한, 강화 학습 모델(20) 및 그 트레이닝 로직(50) 역시 서버 수준의 고성능을 요구하지는 않는다. 다만, 몇몇 실시예들에서, 도 3에 도시된 바와는 달리 제1 평가 모델(30)을 실행하는 컴퓨팅 장치와 강화 학습 모델(20) 및 그 트레이닝 로직(50)을 실행하는 컴퓨팅 장치가 서로 물리적으로 다른 장치일 수도 있다.In some embodiments, as shown in FIG. 3, the first device 100 executes the first evaluation model 30, the reinforcement learning model 20 and the training logic 50, and the second device 200 Can execute the second evaluation model 40. In this case, the first device 100 may be a lower performance computing device than the second device 200. Since the first evaluation model 30 is a light model in that it only performs detection on a known pattern, it can be sufficiently executed even with a low-performance computing device. In addition, the reinforcement learning model 20 and its training logic 50 also do not require server-level high performance. However, in some embodiments, unlike the one illustrated in FIG. 3, the computing device executing the first evaluation model 30 and the computing device executing the reinforcement learning model 20 and its training logic 50 are physically coupled to each other. It may be another device.

제1 장치(100)가 복수의 디바이스와 연결되고, 상기 복수의 디바이스의 인바운드 패킷 및 아웃바운드 패킷 중 적어도 하나를 수집함으로써 입력 데이터(10)를 구성하고, 제1 평가 모델(30), 강화 학습 모델(20) 및 그 트레이닝 로직(53)을 실행하며, 제2 평가 모델(40)을 실행하는 제2 장치(200)와 요청 및 그에 대한 응답을 송수신함으로써 상기 복수의 디바이스에서 발생되는 보안 위협(security threat)을 탐지하는 시스템이 구성될 수 있다. 도 4를 참조하여, 이러한 본 발명의 일 실시예에 따른 위협 탐지 시스템의 구성 및 동작을 설명한다.The first apparatus 100 is connected to a plurality of devices, and the input data 10 is configured by collecting at least one of inbound packets and outbound packets of the plurality of devices, and the first evaluation model 30 and reinforcement learning Security threats generated by the plurality of devices by transmitting and receiving requests and responses to and from the second device 200 executing the model 20 and its training logic 53 and executing the second evaluation model 40 ( security threat) may be configured. Referring to FIG. 4, the configuration and operation of the threat detection system according to an embodiment of the present invention will be described.

본 실시예에 따른 위협 탐지 시스템은 서버 장치(200b), 게이트웨이 장치(100a)를 포함하여 구성된다. 게이트웨이 장치(100a)는 내부망(300)을 통하여 복수의 디바이스와 연결되고, 상기 복수의 디바이스의 인바운드 패킷 및 아웃바운드 패킷 중 적어도 하나를 수집한다. 이러한 패킷의 수집은 게이트웨이 장치(100a)가 상기 복수의 디바이스와 내부망(300)에서 직접 연결되거나, 게이트웨이 장치(100a)가 상기 복수의 디바이스의 인바운드 패킷 및 아웃바운드 패킷을 중개하는 장치인 경우 쉽게 이뤄질 수 있을 것이다. 다만, 도 4에 도시된 것과 같이 복수의 디바이스가 게이트웨이 장치(100a)와 내부망(300)에서 연결되는 것은 하나의 실시예에 불과하고, 본 발명이 복수의 디바이스가 게이트웨이 장치(100a)와 내부망(300)에서 연결되는 것으로 한정되어서는 아니될 것이다.The threat detection system according to the present embodiment includes a server device 200b and a gateway device 100a. The gateway device 100a is connected to a plurality of devices through the internal network 300 and collects at least one of inbound packets and outbound packets of the plurality of devices. Collection of such packets is easy when the gateway device 100a is directly connected to the plurality of devices and the internal network 300, or the gateway device 100a is a device that mediates inbound and outbound packets of the plurality of devices. It could be done. However, as illustrated in FIG. 4, it is only one embodiment that a plurality of devices are connected to the gateway device 100a and the internal network 300, and the present invention allows a plurality of devices to communicate with the gateway device 100a. It will not be limited to being connected to the network 300.

도 4는 스마트 홈 내부의 IoT센서, 디지털 도어락, 스마트 램프 등 다양한 디바이스의 인바운드 패킷 및 아웃바운드 패킷이 외부 네트워크(예를 들어 인터넷)와 송수신 될 때, 이를 게이트웨이 장치(100a)가 중개하는 것이 도시되어 있다. 게이트웨이 장치(100a)는, 예를 들어 스마트 스피커, 억세스 포인트(AP), IoT 허브 등 다양한 컴퓨팅 장치일 수 있다. 이미 설명한 바와 같이, 게이트웨이 장치(100a)는 저수준 컴퓨팅 사양을 가진 장치이더라도 무방하다.FIG. 4 illustrates that when the inbound packet and the outbound packet of various devices such as an IoT sensor, a digital door lock, and a smart lamp inside a smart home are transmitted and received with an external network (for example, the Internet), the gateway device 100a mediates this. It is. The gateway device 100a may be, for example, various computing devices such as a smart speaker, an access point (AP), and an IoT hub. As already described, the gateway device 100a may be a device having a low-level computing specification.

본 실시예에 따른 게이트웨이 장치의 일 구현예에 따른 하드웨어 구성에 대하여 도 5를 참조하여 설명한다.A hardware configuration according to an embodiment of the gateway device according to this embodiment will be described with reference to FIG. 5.

본 실시예에 따른 게이트웨이 장치(100)는 도 5에 도시된 바와 같이, 프로세서(104), 메모리(106), 스토리지(108) 및 네트워크 인터페이스(110)를 포함한다.The gateway device 100 according to this embodiment includes a processor 104, a memory 106, a storage 108, and a network interface 110, as shown in FIG.

네트워크 인터페이스(110)는 복수의 디바이스와 연결되어 상기 복수의 디바이스의 인바운드 패킷 및 아웃바운드 패킷 중 적어도 하나인 입력 데이터를 수집한다.The network interface 110 is connected to a plurality of devices to collect input data that is at least one of inbound packets and outbound packets of the plurality of devices.

스토리지(108)는 과거의 패킷 데이터에서 수집된 복수의 알려진 패턴(known pattern)을 감지하는 제1 평가모델을 구성하는 데이터(180a) 및 강화 학습 모델을 구성하는 데이터(182a)를 저장한다. 물론, 스토리지(108)는 본 발명의 몇몇 실시예들에 따른 방법을 수행하기 위한 소프트웨어 바이너리(184a) 또한 저장할 수 있다.The storage 108 stores data 180a constituting a first evaluation model that detects a plurality of known patterns collected from past packet data, and data 182a constituting a reinforcement learning model. Of course, the storage 108 can also store software binaries 184a for performing the method according to some embodiments of the invention.

메모리(106)는 소프트웨어 바이너리(184a)가 로드 된 결과 구성되는 복수의 인스트럭션(184b), 제1 평가모델을 구성하는 데이터(180b) 및 강화 학습 모델을 구성하는 데이터(182b)를 로드한다.The memory 106 loads a plurality of instructions 184b configured as a result of loading the software binary 184a, data 180b constituting the first evaluation model, and data 182b constituting the reinforcement learning model.

프로세서(104)는 복수의 인스트럭션(184b)을 실행한다.The processor 104 executes a plurality of instructions 184b.

상기 복수의 인스트럭션은, 상기 입력 데이터를 상기 강화 학습 모델에 입력하여 출력된 결과인 제1 결과 데이터를 얻는 인스트럭션과, 상기 입력 데이터를 상기 제1 평가모델을 이용하여 평가한 결과인 제2 결과 데이터를 얻는 인스트럭션과, 상기 제1 결과 데이터 및 상기 제2 결과 데이터의 합치 여부를 고려하여 부여되는 제1 보상(reward)에 감가율(discount factor)이 적용되어 합산된 결과인 제1 반환 값을 얻는 인스트럭션과, 상기 제1 반환 값을 이용하여 상기 강화 학습 모델을 학습 시키는 인스트럭션과, 상기 제2 결과 데이터를 고려하여 상기 감가율을 자동으로 조정하는 인스트럭션과, 상기 제1 결과 데이터를 출력하는 인스트럭션을 포함할 수 있다.The plurality of instructions include an instruction to input the input data to the reinforcement learning model to obtain first result data that is output, and second result data that is a result of evaluating the input data using the first evaluation model. An instruction to obtain a first return value, which is the summed result, by applying a discount factor to a first reward given in consideration of whether the first result data and the second result data match. And an instruction to train the reinforcement learning model using the first return value, an instruction to automatically adjust the depreciation rate in consideration of the second result data, and an instruction to output the first result data. Can be.

또한, 상기 복수의 인스트럭션은, 상기 제1 결과 데이터가 상기 복수의 알려진 패턴 중 하나 이상이 발견되지 않았음을 가리키는 것인 경우, 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 합치 여부에 기초하여 부여되는 제2 보상(reward)에 상기 감가율이 적용되어 합산된 결과인 제2 반환 값의 기대치인 가치 값(value)을 얻기 위한 요청 신호를 상기 네트워크 인터페이스를 통하여 서버 장치에 송신하는 인스트럭션을 더 포함할 수 있다. 이 때, 상기 강화 학습 모델을 학습 시키는 인스트럭션은 상기 제1 반환 값 및 상기 가치 값을 이용하여 상기 강화 학습 모델을 학습시키는 인스트럭션을 포함할 수 있다. 또한, 상기 감가율을 자동으로 조정하는 인스트럭션은 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 합치 여부를 더 고려하여 상기 감가율을 조정하는 인스트럭션을 포함할 수 있다. 상기 요청 신호는 상기 입력 데이터 및 감가율 데이터를 포함할 수 있다.In addition, the plurality of instructions, if the first result data indicates that one or more of the plurality of known patterns are not found, based on whether the first result data and the third result data match Further comprising an instruction to transmit a request signal for obtaining a value value, which is an expected value of a second return value, which is the sum of the result of applying the deduction rate to the second reward granted, to the server device through the network interface. can do. At this time, the instruction to train the reinforcement learning model may include an instruction to train the reinforcement learning model using the first return value and the value value. In addition, the instruction for automatically adjusting the depreciation rate may include an instruction for adjusting the depreciation rate by further considering whether the first result data and the third result data match. The request signal may include the input data and depreciation rate data.

또한, 일 실시예에서, 상기 제1 결과 데이터 및 상기 제3 결과 데이터의 합치 여부를 더 고려하여 상기 감가율을 조정하는 인스트럭션은, 상기 제1 결과 데이터가 패턴이 발견되지 않았음을 가리키는 것이고, 상기 제3 결과 데이터는 패턴이 발견되었음을 가리키는 것인 경우, 상기 감가율을 자동으로 증가시키는 인스트럭션을 포함할 수 있다.In addition, in one embodiment, the instruction for adjusting the depreciation rate by further considering whether the first result data and the third result data match, indicates that the first result data has not been found, and the The third result data may include an instruction that automatically increases the depreciation rate when it indicates that a pattern is found.

또한, 일 실시예에서, 상기 제2 결과 데이터를 고려하여 상기 감가율을 자동으로 조정하는 인스트럭션은, 상기 제2 결과 데이터가 상기 복수의 알려진 패턴 중 하나 이상이 발견되었음을 가리키는 것인 경우, 상기 감가율을 자동으로 감소시키는 인스트럭션을 포함할 수 있다.In addition, in one embodiment, the instruction to automatically adjust the depreciation rate in consideration of the second result data, when the second result data indicates that one or more of the plurality of known patterns are found, the deduction rate It can include instructions that automatically reduce.

이하, 본 발명의 또 다른 실시예에 따른 강화 학습 방법에 대하여 도 6 및 도 7을 참조하여 설명하기로 한다. 이해의 편의를 위하여 이미 설명된 사항은 중복 설명되지 않을 것이나, 이미 설명된 몇몇 실시예들의 기술적 구성 및 기술적 사상이 본 실시예에 따른 강화 학습 방법에도 적용될 수 있다. 본 실시예에 따른 방법은 컴퓨팅 장치에 의하여 수행될 수 있다. 상기 컴퓨팅 장치는, 예를 들어 도 4의 게이트웨이 장치(100a)일 수 있다. 다만, 몇몇 실시예에서, 본 실시예에 따른 방법은 복수의 물리적으로 분리된 컴퓨팅 장치들에 의하여 분할 실시될 수도 있음을 유의하여야 한다.Hereinafter, a reinforcement learning method according to another embodiment of the present invention will be described with reference to FIGS. 6 and 7. For convenience of understanding, the items already described will not be duplicated, but the technical configuration and technical ideas of some embodiments already described may be applied to the reinforcement learning method according to the present embodiment. The method according to the present embodiment may be performed by a computing device. The computing device may be, for example, the gateway device 100a of FIG. 4. However, it should be noted that, in some embodiments, the method according to the present embodiment may be divided by a plurality of physically separated computing devices.

단계 S100에서 입력 데이터의 수집이 감지된다. 상기 입력 데이터가 디바이스의 인바운드 패킷 또는 아웃바운드 패킷이라면, 하나의 패킷을 하나의 입력 데이터로 구성할 수도 있을 것이고, 기지정된 개수의 일련의 패킷들을 한꺼번에 하나의 입력 데이터로 구성할 수도 있을 것이다. 예시적인 수집 대상은 아래의 표 1에 표시된다.In step S100, collection of input data is detected. If the input data is an inbound packet or an outbound packet of the device, one packet may be composed of one input data, and a predetermined number of packets may be composed of one input data at a time. Exemplary collection objects are shown in Table 1 below.

단계 S102에서 입력 데이터 전처리 및 그 특성(feature)이 추출된다. 본 발명의 발명자들은, 장기간의 연구를 통하여 위협을 탐지하는 용도에 적절한 패킷의 총 41개 특성을 선정하였다. 총 41개의 특성은 아래의 표 2 내지 표 5에 표시된다.In step S102, the input data pre-processing and its features are extracted. The inventors of the present invention selected 41 characteristics of packets suitable for the purpose of detecting a threat through a long-term study. A total of 41 properties are shown in Tables 2 to 5 below.

단계 S104에서 입력 데이터를 강화 학습 모델에 입력 하여 제1 결과 데이터를 얻고, 단계 S106에서 입력 데이터를 제1 평가 모델에 입력하여 제2 결과 데이터를 얻는다. 다음으로, 단계 S108에서 상기 제1 결과 데이터 및 상기 제2 결과 데이터의 합치 여부를 고려하여 부여되는 제1 보상(reward)에 감가율(discount factor)이 적용되어 합산된 결과인 제1 반환 값을 얻는다.In step S104, input data is input to the reinforcement learning model to obtain first result data, and in step S106, input data is input to the first evaluation model to obtain second result data. Next, in step S108, a first return value, which is the summed result, is obtained by applying a discount factor to a first reward given in consideration of whether the first result data and the second result data are matched. .

단계 S120에서, 제1 반환 값을 이용하여 강화 학습 모델을 학습시킨다. 또한, 단계 S122에서, 제2 결과 데이터를 고려하여 감가율을 조정한다. 감가율 조정에 관하여는 도 8을 참조하여 자세히 후술한다. 단계 S124에서, 제1 결과 데이터가 입력 데이터에 대한 분석 또는 평가 결과로서 출력된다.In step S120, the reinforcement learning model is trained using the first return value. In addition, in step S122, the depreciation rate is adjusted in consideration of the second result data. The adjustment of the depreciation rate will be described later in detail with reference to FIG. 8. In step S124, the first result data is output as a result of analysis or evaluation of the input data.

본 발명의 몇몇 실시예에서, 도 7에 도시된 바와 같이, 입력 데이터에 대한 제2 평가 모델의 결과 데이터를 더 고려하여, 강화 학습이 이뤄질 수도 있다. 이를 위해, 제1 평가 모델의 제2 결과 데이터가 알려진 패턴이 입력데이터에서 발견되지 않았음을 의미하는 경우(S110)에는 제2 평가 모델을 실행하는 컴퓨팅 장치에 입력 데이터의 평가를 요청하는 신호가 송신될 수 있다(S112). 이 때, 상기 신호에는 강화 학습 모델의 입력 데이터 평가 결과인 제1 결과 데이터와, 현재의 감가율(discount factor)과, 상기 입력 데이터가 포함될 수 있다. 단계 S114에서 상기 신호에 대한 응답으로 제2 평가 모델을 실행하는 컴퓨팅 장치로부터 가치 값이 제공될 수 있다.In some embodiments of the present invention, as illustrated in FIG. 7, reinforcement learning may be achieved by further considering result data of a second evaluation model for input data. To this end, when the second result data of the first evaluation model means that a known pattern is not found in the input data (S110), a signal requesting evaluation of the input data is sent to the computing device executing the second evaluation model. It may be transmitted (S112). In this case, the signal may include first result data, which is a result of evaluating input data of the reinforcement learning model, a current discount factor, and the input data. In step S114, a value value may be provided from the computing device executing the second evaluation model in response to the signal.

이후로, 단계 S121에서 제1 반환 값 및 가치값을 이용하여 강화 학습 모델이 학습되고, 단계 S123에서 제1 결과 데이터와 제3 결과 데이터의 합치 여부 및 제2 결과 데이터를 고려하여 감가율이 조정될 수 있다. 또한, 단계 S124에서, 제1 결과 데이터가 입력 데이터에 대한 분석 또는 평가 결과로서 출력된다.Thereafter, in step S121, the reinforcement learning model is trained using the first return value and the value value, and in step S123, the depreciation rate may be adjusted in consideration of whether the first result data and the third result data are consistent and the second result data. have. Further, in step S124, the first result data is output as a result of analysis or evaluation of the input data.

이하, 도 8을 본 실시예에 따른 방법을 보다 명확하게 설명한다.Hereinafter, the method according to the present embodiment will be more clearly described with reference to FIG. 8.

도 8은 본 실시예에 따른 방법이 패킷 데이터의 분석을 통한 위협 탐지(threat detection)의 용도로 사용되는 경우를 전제하여 제1 보상, 제2 보상 및 감가율의 자동 조정을 케이스 별로 설명하는 것이다. 또한, 제1 평가 모델은 알려진 패턴을 탐지하는 모델이고, 제2 평가 모델은 클라우드 서버 등에서 지속적으로 업데이트 되는 위협 패턴 탐지 머신 러닝의 결과 모델이다. 즉, 입력 데이터를 제2 평가 모델에 의하여 평가받기 위하여는 게이트웨이 장치가 상기 클라우드 서버에 요청해야 하는 것으로 이해될 수 있을 것이다. 반면에, 제1 평가 모델 및 강화 학습 되고 있는 모델은 상기 게이트웨이 장치에 내장되어 있는 것으로 이해될 수 있을 것이다.8 illustrates automatic adjustment of the first compensation, the second compensation, and the depreciation rate on a case-by-case basis on the premise that the method according to the present embodiment is used for the purpose of threat detection through analysis of packet data. In addition, the first evaluation model is a model for detecting a known pattern, and the second evaluation model is a result model of a threat pattern detection machine learning that is continuously updated in a cloud server or the like. That is, it may be understood that the gateway device needs to request the cloud server in order to receive the input data by the second evaluation model. On the other hand, it may be understood that the first evaluation model and the model being reinforced learning are embedded in the gateway device.

입력 데이터에 대하여 제1 평가 모델이 위협탐지로 평가한 경우, 이러한 결과는 신뢰할 수 있는 것이므로 제2 평가 모델은 개입될 필요가 없고, 제1 평가 모델이 위협탐지로 평가한 입력 데이터가 발생되는 환경에서는 제1 평가 모델의 결과를 추종하는 방식으로 강화 학습이 이뤄지는 것이 바람직하기 때문에, 감가율은 감소 되는 방향으로 조정된다. 이 때, 제1 보상은 강화 학습 모델이 입력 데이터에 대하여 평가한 제1 결과 데이터와 제1 평가 모델이 동일한 입력 데이터에 대하여 평가한 제2 결과 데이터가 합치하는 경우 (+) 보상이 주어지고, 합치하지 않는 경우 (-) 보상이 주어진다. 다만, 제1 평가 모델이 위협탐지로 평가했으나, 강화 학습 모델이 정상으로 판단했다면 (-) 보상의 절대값을 크게 하여, 강화 학습 모델이 추후에는 알려진 패턴을 놓치지 않도록 할 수 있다.When the first evaluation model evaluates the input data as threat detection, these results are reliable, so the second evaluation model does not need to be involved, and the environment in which the input data evaluated by the first evaluation model as threat detection occurs Since it is desirable that reinforcement learning is performed in a manner that follows the results of the first evaluation model, the depreciation rate is adjusted in a decreasing direction. At this time, the first compensation is given when the first result data evaluated by the reinforcement learning model for the input data and the second result data evaluated by the first evaluation model for the same input data (+) are compensated. If they do not match, a (-) reward is given. However, if the first evaluation model is evaluated as threat detection, but the reinforcement learning model determines that it is normal, the absolute value of (-) compensation can be increased so that the reinforcement learning model does not miss a later known pattern.

즉, 몇몇 실시예들에서, 상기 감가율은 제1 평가 모델의 제2 결과 데이터에서 위협탐지로 판단했는지 여부를 고려하여 감가율이 조정되되, 제1 평가 모델이 위협탐지로 평가한 경우에는 감가율이 감소되는 방향으로 조정된다.That is, in some embodiments, the depreciation rate is adjusted in consideration of whether it is determined as a threat detection in the second result data of the first evaluation model, but when the first evaluation model evaluates as a threat detection, the depreciation rate decreases It is adjusted in the direction.

입력 데이터에 대하여 제1 평가 모델이 정상으로 평가한 경우는, 실제로 입력 데이터가 정상인 경우이거나, 입력 데이터에 알려지지 않은 위협 패턴이 존재하는 경우일 수 있다. 따라서, 제2 평가 모델에 추가적으로 입력 데이터에 대한 평가를 의뢰 하고, 그 결과를 반영하여 강화 학습을 진행할 필요가 있는 것이다. 이 때, 제2 평가 모델도 상기 입력 데이터에 대하여 정상으로 평가한 경우라면, 감가율은 변동될 필요가 없다. 그리고, 이 때 제2 보상은 제1 결과 데이터와 제3 결과 데이터의 합치 여부에 따라 부여될 것이다. 제1 결과 데이터와 제3 결과 데이터가 합치하는 것으로 평가되면, (+)의 제2 보상이 부여된다. 반대로, 제1 결과 데이터와 제3 결과 데이터가 합치하지 않는 경우, 즉 제1 평가 모델 및 제2 평가 모델 모두 정상이지만 강화 학습 모델만 위협탐지로 평가한 경우에는 (-)의 제2 보상이 부여된다. 이 때는 (-)보상의 절대값을 크게 하여 강화 학습 모델이 추후에는 정상인 패턴을 위협탐지로 잘못 판단하는 경우가 발생되지 않도록 할 수 있다.When the first evaluation model evaluates the input data as normal, it may be the case that the input data is actually normal or an unknown threat pattern exists in the input data. Therefore, it is necessary to request evaluation of the input data in addition to the second evaluation model, and to reinforce the learning by reflecting the results. At this time, if the second evaluation model also evaluates the input data as normal, the depreciation rate need not be changed. In addition, at this time, the second reward will be given according to whether the first result data and the third result data match. If the first result data and the third result data are evaluated to match, a second compensation of (+) is given. Conversely, if the first result data and the third result data do not match, that is, both the first evaluation model and the second evaluation model are normal, but only the reinforcement learning model is evaluated by threat detection, a second compensation of (-) is given. do. In this case, the absolute value of (-) compensation can be increased so that the case where the reinforcement learning model later incorrectly judges the normal pattern as threat detection does not occur.

입력 데이터에 대하여 제1 평가 모델이 정상으로 평가하였으나, 제2 평가 모델은 위협탐지로 평가한 경우는, 입력 데이터에 알려지지 않은 패턴의 위협이 존재하는 경우인 것으로 이해될 수 있을 것이다. 이런 경우에는 감가율이 증가되는 방향으로 자동 조정되어, 알려지지 않은 패턴에 대하여도 강화 학습 모델이 잘 탐지할 수 있도록 강화 학습 되는 것이 바람직하다. 그리고, 이 때 제2 보상은 제1 결과 데이터와 제3 결과 데이터의 합치 여부에 따라 부여될 것이다. 강화 학습 모델 역시 제2 평가 모델과 마찬가지로 위협탐지로 평가했다면 (+) 보상이 부여되고, 강화 학습 모델은 정상으로 평가했다면 (-) 보상이 부여될 것이다.Although the first evaluation model evaluates normally with respect to the input data, the second evaluation model may be understood to be a case in which an unknown pattern of threat exists in the input data when evaluated by threat detection. In this case, it is desirable that the reinforcement learning is performed so that the reinforcement learning model can detect well even unknown patterns by automatically adjusting the depreciation rate. In addition, at this time, the second reward will be given according to whether the first result data and the third result data match. The reinforcement learning model, like the second evaluation model, will be rewarded with a positive (+) reward if evaluated by threat detection, and will be rewarded if a reinforcement learning model is evaluated as normal.

지금까지 설명된 본 발명의 실시예에 따른 방법들은 컴퓨터가 읽을 수 있는 코드로 구현된 컴퓨터프로그램의 실행에 의하여 수행될 수 있다. 상기 컴퓨터프로그램은 인터넷 등의 네트워크를 통하여 제1 전자 장치로부터 제2 전자 장치에 전송되어 상기 제2 전자 장치에 설치될 수 있고, 이로써 상기 제2 전자 장치에서 사용될 수 있다. 상기 제1 전자 장치 및 상기 제2 전자 장치는, 서버 장치, 클라우드 서비스를 위한 서버 풀에 속한 물리 서버, 데스크탑 피씨와 같은 고정식 전자 장치를 모두 포함한다.The methods according to the embodiments of the present invention described so far can be performed by executing a computer program embodied in computer readable code. The computer program may be transmitted from a first electronic device to a second electronic device through a network such as the Internet and installed in the second electronic device, and thus used in the second electronic device. The first electronic device and the second electronic device include both a server device, a physical server belonging to a server pool for cloud service, and a fixed electronic device such as a desktop PC.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다.Although the embodiments of the present invention have been described with reference to the accompanying drawings, a person skilled in the art to which the present invention pertains may be implemented in other specific forms without changing the technical concept or essential features of the present invention. You will understand. Therefore, it should be understood that the above-described embodiments are illustrative in all respects and not restrictive.

Claims

A method executed by a computing device, the method comprising:
And repeating training a reinforcement learning model for determining an evaluation result of the input data using the input data,
The repeating step,
Inputting the input data into the reinforcement learning model to obtain first output data that is output;
Obtaining second result data that is a result of evaluating the input data using a first evaluation model;
Obtaining a first return value, which is a sum result by applying a discount factor to a first reward given in consideration of whether the first result data and the second result data agree;
Training the reinforcement learning model using the first return value; And
And automatically adjusting the depreciation rate in consideration of the second result data,
Reduction rate automatic adjustment reinforcement learning method.

According to claim 1,
The step of automatically adjusting the depreciation rate,
Reducing the adjustment width of the depreciation rate as the frequency of the input data is received more frequently,
Reduction rate automatic adjustment reinforcement learning method.

According to claim 1,
The repeating step,
The depreciation rate is applied to the second reward, which is granted based on whether the first result data and the third result data are matched to obtain a value, which is an expected value of the second return value, which is the summed result. The third result data further includes a step of evaluating the input data using a second evaluation model,
The step of training the reinforcement learning model,
And training the reinforcement learning model using the first return value and the value value,
The step of automatically adjusting the depreciation rate,
And adjusting the depreciation rate by further considering whether the first result data and the third result data match.
Reduction rate automatic adjustment reinforcement learning method.

According to claim 3,
The first evaluation model,
A model that detects a plurality of known patterns,
The step of obtaining the value of the value,
Obtaining the value of the value, only when the first result data indicates that one or more of the plurality of known patterns were not found,
Reduction rate automatic adjustment reinforcement learning method.

According to claim 3,
The first evaluation model,
A model that detects a plurality of known patterns, and is composed of data downloaded to the computing device,
The second evaluation model,
As a model for detecting at least some of the plurality of known patterns and a new pattern not included in the plurality of known patterns, the model is stored in a server device connected to the computing device and a network, and periodic to reflect a learning result Or being updated aperiodically,
Reduction rate automatic adjustment reinforcement learning method.

The method of claim 5,
The second evaluation model,
After performing initial learning using at least some of the plurality of known patterns, a model additionally learned by a transfer learning method,
Reduction rate automatic adjustment reinforcement learning method.

According to claim 3,
The first evaluation model,
This model detects a plurality of known patterns collected from past data.
The step of automatically adjusting the depreciation rate,
Automatically adjusting the depreciation rate by further considering whether the second result data indicates that one or more of the plurality of known patterns are found,
Reduction rate automatic adjustment reinforcement learning method.

The method of claim 7,
The step of automatically adjusting the depreciation rate by further considering whether the second result data indicates that one or more of the plurality of known patterns have been found is:
If the second result data indicates that one or more of the plurality of known patterns are found, automatically reducing the depreciation rate,
Reduction rate automatic adjustment reinforcement learning method.

The method of claim 7,
The second evaluation model,
As a model for detecting at least some of the plurality of known patterns and new patterns not included in the plurality of known patterns, periodic to reflect machine learning results using data collected periodically, aperiodically, or in real time. Or it is updated aperiodically,
The step of automatically adjusting the depreciation rate by further considering whether the second result data indicates that at least one of the plurality of known patterns has been found,
And if the first result data indicates that a pattern was not found, and the third result data indicates that a pattern was found, automatically increasing the depreciation rate,
Reduction rate automatic adjustment reinforcement learning method.

According to claim 3,
The input data is at least one of an inbound packet and an outbound packet of a device connected to the computing device and an internal network,
The first evaluation model,
A model that detects a plurality of known patterns collected from past packet data, and is stored in the computing device.
The second evaluation model,
As a model for detecting at least some of the plurality of known patterns and new patterns not included in the plurality of known patterns, periodic to reflect the results of machine learning using data collected periodically, aperiodically, or in real time Or, it is updated aperiodically, and is stored in an external device connected to the computing device through an external network.
The data constituting the reinforcement learning model is stored in the computing device,
Reduction rate automatic adjustment reinforcement learning method.

According to claim 3,
The input data is time series data of sensor values transmitted from an IoT sensor connected to the computing device and an internal network,
The first evaluation model,
A model that detects a plurality of known patterns collected from past sensor value time series data, and is stored in the computing device.
The second evaluation model,
As a model for detecting at least some of the plurality of known patterns and new patterns not included in the plurality of known patterns, periodic to reflect the results of machine learning using data collected periodically, aperiodically, or in real time Or, it is updated aperiodically, and is stored in an external device connected to the computing device through an external network.
The data constituting the reinforcement learning model is stored in the computing device,
Reduction rate automatic adjustment reinforcement learning method.

According to claim 3,
The second evaluation model,
As a model for detecting at least some of the plurality of known patterns and new patterns not included in the plurality of known patterns, periodic to reflect the results of machine learning using data collected periodically, aperiodically, or in real time Or it is updated aperiodically,
The second compensation,
When there is a mismatch between the first result data and the third result data, a negative value is given,
The absolute value of the second compensation when the first result data indicates pattern discovery and the third result data indicates pattern not found, and the first result data indicates pattern not found and the third Greater than the absolute value of the second compensation when the result data indicates pattern discovery,
Reduction rate automatic adjustment reinforcement learning method.

According to claim 1,
The first evaluation model,
This model detects a plurality of known patterns collected from past data.
The first compensation,
When there is a mismatch between the first result data and the second result data, a negative value is given,
The absolute value of the first compensation when the first result data indicates pattern not found and the second result data indicates pattern discovery, the first result data indicates pattern discovery and the second result Greater than the absolute value of the first compensation when the data indicates a pattern not found,
Reduction rate automatic adjustment reinforcement learning method.

A network interface connected to a plurality of devices to collect input data that is at least one of inbound packets and outbound packets of the plurality of devices;
A storage for storing data constituting a first evaluation model for detecting a plurality of known patterns collected from past packet data and data constituting a reinforcement learning model;
A memory for loading a plurality of instructions, data constituting the first evaluation model, and data constituting the reinforcement learning model; And
It includes a processor for executing the plurality of instructions,
The plurality of instructions,
An instruction for inputting the input data into the reinforcement learning model and obtaining first result data that is output;
An instruction to obtain second result data that is a result of evaluating the input data using the first evaluation model;
An instruction to obtain a first return value which is a sum result by applying a discount factor to a first reward given in consideration of whether the first result data and the second result data agree;
An instruction to train the reinforcement learning model using the first return value;
An instruction for automatically adjusting the depreciation rate in consideration of the second result data; And
Including an instruction for outputting the first result data,
Gateway device with threat detection.

The method of claim 14,
The plurality of instructions,
When the first result data indicates that one or more of the plurality of known patterns has not been found, the second reward is granted based on whether the first result data and the third result data are matched. Further comprising an instruction for transmitting a request signal for obtaining a value value (value), which is an expected value of the second return value, which is the sum of the result of applying the depreciation rate, to the server device through the network interface,
The instruction to train the reinforcement learning model,
And an instruction to train the reinforcement learning model using the first return value and the value value,
The instruction to automatically adjust the depreciation rate,
And an instruction for adjusting the depreciation rate by further considering whether the first result data and the third result data are matched,
The third result data is a result of evaluating the input data using a second evaluation model,
The request signal includes the input data and depreciation rate data,
Gateway device with threat detection.

The method of claim 15,
The instruction for adjusting the depreciation rate by further considering whether the first result data and the third result data are consistent,
When the first result data indicates that a pattern has not been found, and the third result data indicates that a pattern has been found, including an instruction to automatically increase the depreciation rate,
Gateway device with threat detection.

The method of claim 14,
The instruction to automatically adjust the depreciation rate in consideration of the second result data,
If the second result data indicates that one or more of the plurality of known patterns are found, including an instruction to automatically decrease the depreciation rate,
Gateway device with threat detection.

A method executed by a computing device, the method comprising:
And repeating training a reinforcement learning model for determining an evaluation result of the input data using the input data,
The repeating step,
Inputting the input data into the reinforcement learning model to obtain first output data, which is output result;
Obtaining second result data which is a result of evaluating the input data using a first evaluation model;
Obtaining third result data, which is a result of evaluating the input data using a second evaluation model;
Determining a follow-up evaluation model according to a comparison result of the second result data and the third result data;
And training the reinforcement learning model by using a reward determined according to a comparison result between the result data of the following evaluation model and the first result data,
Reinforcement learning method.