KR102458103B1

KR102458103B1 - Reinforcement learning apparatus and method for multiple classification

Info

Publication number: KR102458103B1
Application number: KR1020220040037A
Authority: KR
Inventors: 이성령; 팜 투옌 르; 김동석
Original assignee: 주식회사 애자일소다
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-10-25

Abstract

Disclosed are a reinforcement learning device and method for multiple classification. The present invention has an advantage of increasing a classification accuracy by maximizing an expected reward in classification learning of an artificial intelligence model through reinforcement learning by using a policy gradient, and providing a high classification accuracy by improving a defense for an adversarial attack. The reinforcement learning device comprises: a policy part; and a policy loss function part.

Description

REINFORCEMENT LEARNING APPARATUS AND METHOD FOR MULTIPLE CLASSIFICATION

본 발명은 다중 분류를 위한 강화학습 장치 및 방법관한 발명으로서, 더욱 상세하게는 정책 그래디언트를 이용한 강화학습을 통해 인공지능 모델의 분류 학습에서 기대 보상을 최대화시켜 분류 정확도를 증가시키고, 적대적 공격에 대하여 방어를 개선한 다중 분류를 위한 강화학습 장치 및 방법에 관한 것이다.The present invention relates to a reinforcement learning apparatus and method for multiple classification, and more particularly, through reinforcement learning using a policy gradient, by maximizing the expected reward in classification learning of an artificial intelligence model, the classification accuracy is increased, and It relates to a reinforcement learning apparatus and method for multi-class classification with improved defense.

강화 학습은 환경(environment)과 상호작용하며 목표를 달성하는 에이전트를 다루는 학습 방법으로서, 학습의 행동 주체인 강화 학습 에이전트(Agent)가 어떤 행동을 해야 더 많은 보상(Reward)을 받을지 알아내는 것을 목적으로 하고, 에이전트는 시간 스텝이 흘러감에 따라 순차적으로 행동을 선택하게 되며, 행동이 환경에 끼친 영향에 기반하여 보상을 받게 된다.Reinforcement learning is a learning method that deals with an agent that interacts with the environment and achieves a goal. , and the agent selects actions sequentially as the time step passes, and receives a reward based on the action's impact on the environment.

도 1은 종래 기술에 따른 강화 학습 장치의 구성을 나타낸 블록도로서, 도 1에 나타낸 바와 같이, 에이전트(10)가 강화 학습 모델의 학습을 통해 행동(Action)을 결정하는 방법을 학습시키고, 각 행동은 그 다음 상태(state)에 영향을 끼치며, 성공한 정도는 보상(Reward)으로 측정할 수 있다.1 is a block diagram showing the configuration of a reinforcement learning apparatus according to the prior art. As shown in FIG. 1, the agent 10 learns a method for determining an action through learning of a reinforcement learning model, and each Actions affect the next state, and the degree of success can be measured as a reward.

환경(20)은 에이전트(10)가 취할 수 있는 행동, 그에 따른 보상 등 모든 규칙으로서, 상태, 행동, 보상 등은 모두 환경의 구성요소이고, 에이전트(10) 이외의 모든 정해진 것들이 환경이다.The environment 20 is all rules, such as actions that the agent 10 can take and rewards accordingly, states, actions, rewards, etc. are all components of the environment, and all predetermined things other than the agent 10 are the environment.

한편, 최근 들어 딥러닝 기술에 대한 사회적 관심이 증가함에 따라 해당 기술이 자율주행 및 생체인식 등과 같은 다양한 기술분야에 폭넓게 적용되면서, 크게 두 가지 단점이 대두되고 있다. On the other hand, as social interest in deep learning technology increases in recent years, the technology is widely applied to various technology fields such as autonomous driving and biometrics, and two major disadvantages are emerging.

첫 번째는, 딥러닝 모델이 내재적으로 가진 불확실성(uncertainty)이고, 두 번째는 명시적으로 정의된 특징(handcrafted feature) 기반의 머신러닝에 비해 학습된 딥러닝 모델은 블랙박스(black-box)와 같아 그 추론 결과가 설명 가능(explainable)하지 않다는 점이다.The first is the inherent uncertainty of a deep learning model, and the second is that compared to machine learning based on explicitly defined features (handcrafted feature), the trained deep learning model has a black-box and As such, the inference result is not explainable.

이러한 딥러닝 모델의 단점을 이용하여 적대적 공격(Adversarial Attack)이시도되고 있으며, 적대적 공격은 물체의 감지, 이미지 분류 등의 작업을 방해하여 이미지 분류 시스템을 기만할 수 있다.Adversarial attacks are being attempted by using the shortcomings of these deep learning models, which can deceive the image classification system by interfering with tasks such as object detection and image classification.

특히, 다중 클래스의 이미지 분류에서 적대적 공격은 적합한 이미지에 최소한의 적대적 노이즈를 생성하는 것으로서, 이미지상에 생성되는 픽셀의 섭동을 추가시킨 적대적 이미지를 생성할 때, 교란되는 픽셀이 적을수록 적대적 공격 기술이 더욱 효과적으로 나타날 수 있다.In particular, in multi-class image classification, hostile attack is to generate minimal hostile noise in a suitable image. This could be more effective.

그러나 종래 기술에 따른 강화학습을 이용한 이미지 분류 또는 다중 클래스 이미지 분류 학습은 정책 그래디언트(Policy gradient)에 기반한 정책을 학습한다.However, image classification or multi-class image classification learning using reinforcement learning according to the prior art learns a policy based on a policy gradient.

이를 위해, 보상 함수는 다음과 같이 정의될 수 있다.To this end, the compensation function may be defined as follows.

여기서, k=1, …, n,where k = 1, ... , n,

또한, 정책 손실 함수는 다음과 같이 정의될 수 있다.Also, the policy loss function may be defined as follows.

여기서, 'a'는 a(행동)의 one-hot 벡터를 의미한다.Here, 'a' means a one-hot vector of a (action).

또한, 위의 함수에 대한 정책 그래디언트(Policy gradient)는 다음과 같이 표현될 수 있다.Also, the policy gradient for the above function can be expressed as follows.

여기서,here,

이다.

to be.

이러한, 정책 그래디언트에서 오른쪽은 기존의 다중 분류에서 사용되는 크로스 엔트로피 손실(Cross entropy loss) 함수이고, 왼쪽은 정책 손실(Policy loss)함수이다.In this policy gradient, the right side is a cross entropy loss function used in the existing multiple classification, and the left side is a policy loss function.

그러나 확률(probability)에 log를 씌운 값에 잘 분류한 경우에는 보상(일반적으로 양수임)를 곱하고, 그렇지 못한 경우에는 패널티(일반적으로 음수임)를 곱하게 되는데 이러한 경우, 정책 그래디언트

는 크로스 엔트로피 손실 기울기의 역방향을 따라 정책의 파라미터를 업데이트하게 되어 학습이 잘 이루어지지 못하는 문제점이 있다.However, the log of the probability is multiplied by the reward (usually positive) if it is well classified, and by the penalty (usually negative) if it is not. In this case, the policy gradient is

, updates the policy parameters along the reverse direction of the cross-entropy loss gradient, so there is a problem in that learning is not performed well.

즉, 정책 그래디언트는 크로스 엔트로피 손실을 따르거나 크로스 엔트로피 손실의 역방향을 따라 정책의 파라미터를 업데이트하고, 정책 그래디언트는 정확도를 높이거나 잘못된 작업을 방지하는 방향으로 파라미터를 업데이트하게 되는데, 이는 정책 손실이 목표가 아닌 행동에 초점을 맞추게 되는 문제점이 있다.That is, the policy gradient updates the parameters of the policy according to the cross-entropy loss or the reverse of the cross-entropy loss, and the policy gradient updates the parameters in the direction of increasing accuracy or preventing erroneous operations. There is a problem in focusing on actions that are not.

또한, 강화학습에서 지도학습은 학습 과정 중 정책 행동에 대해 계속해서 정답을 제시해주어 정답을 맞추기 위해 정답에 대응되는 소프트맥스 값에 변화를 줄 수 있지만, 정책 그래디언트는 정답이 아니라 행동을 부정하는 방향으로 가중치를 변경해 나가므로 특히, 적대적 공격에 대하여 방어 학습이 잘 이루어지지 못하는 문제점이 있다.In addition, supervised learning in reinforcement learning can change the softmax value corresponding to the correct answer by continuously presenting the correct answer to the policy action during the learning process, but the policy gradient is not the correct answer, but the direction that denies the behavior. In particular, there is a problem in that defense learning is not well performed against a hostile attack because the weight is changed.

한국 등록특허공보 등록번호 제10-2304661호(발명의 명칭: 견고한 적대적 방어를 위한 공격 기술에 독립적인 적대적 훈련 방법)Korean Patent Publication No. 10-2304661 (Title of Invention: Adversary training method independent of attack technology for strong hostile defense)

이러한 문제점을 해결하기 위하여, 본 발명은 정책 그래디언트를 이용한 강화학습을 통해 인공지능 모델의 분류 학습에서 기대 보상을 최대화시켜 분류 정확도를 증가시키고, 적대적 공격에 대하여 방어를 개선한 다중 분류를 위한 강화학습 장치 및 방법을 제공하는 것을 목적으로 한다.In order to solve this problem, the present invention is reinforcement learning for multi-classification that increases classification accuracy by maximizing expected rewards in classification learning of artificial intelligence models through reinforcement learning using policy gradients, and improves defense against hostile attacks. An object of the present invention is to provide an apparatus and method.

상기한 목적을 달성하기 위하여 본 발명의 일 실시 예는 다중 분류를 위한 강화학습 장치로서, 강화 학습 모델의 학습을 통해 행동(Action)을 결정하는 방법을 학습하고, 각 행동이 그 다음 상태(state)에 영향을 주며, 성공한 정도를 보상(Reward)으로 측정하는 에이전트(100)를 포함하되, 상기 에이전트는 다중 클래스 이미지 분류에서 정책(Policy)을 통해 기대 보상이 최대화되도록 분류 데이터의 정답을 예측하는 정책부; 및 상기 정책부가 정책 그래디언트(Policy gradient)를 이용한 기대 보상이 가장 높은 확률로 정답을 예측하도록 정책을 학습하는 정책 손실 함수부;를 포함한다.In order to achieve the above object, an embodiment of the present invention is a reinforcement learning apparatus for multiple classification, learning a method for determining an action through learning of a reinforcement learning model, and each action is performed in the next state ) and includes an agent 100 that measures the degree of success as a reward, wherein the agent predicts the correct answer of the classification data so that the expected reward is maximized through the policy in multi-class image classification policy department; and a policy loss function unit that learns the policy so that the policy unit predicts the correct answer with the highest probability of the expected reward using a policy gradient.

또한, 상기 실시 예에 따른 정책 손실 함수부(120)는 하기식 In addition, the policy loss function unit 120 according to the embodiment has the following formula

.

- 여기서,

이고, C는 클래스(Class) 수이며, i= 1,‥, c, 임- 을 이용한 정책 그래디언트에 기반하여 정책을 학습하는 것을 특징으로 한다.- here,

, C is the number of classes, and it is characterized in that the policy is learned based on the policy gradient using i = 1, ..., c, im-.

또한, 상기 실시 예에 따른 정책 손실 함수부(120)는

가 낮아지도록 정책을 학습하는 것을 특징으로 한다.In addition, the policy loss function unit 120 according to the embodiment

It is characterized by learning the policy so that

또한, 상기 실시 예에 따른 정책 손실 함수부는 제1 정책손실 함수에 기반한 학습을 수행하는 제1 손실 함수부; 및 상기 제1 손실 함수부의 학습이 완료되면, 제2 정책 손실 함수에 기반한 학습을 수행하는 제2 손실 함수부;를 포함하는 것을 특징으로 한다.In addition, the policy loss function unit according to the embodiment includes: a first loss function unit for performing learning based on the first policy loss function; and a second loss function unit that performs learning based on a second policy loss function when the learning of the first loss function unit is completed.

또한, 상기 실시 예에 따른 제1 정책손실 함수는 하기식 In addition, the first policy loss function according to the embodiment is

을 이용할 수 있다.is available.

또한, 상기 실시 예에 따른 제2 정책 손실 함수는 하기식In addition, the second policy loss function according to the embodiment is

-여기서, k=1,‥,n, i=1,‥,c, 임- 을 이용하는 것을 특징으로 한다.- Here, k = 1, ..., n, i = 1, ..., c, im- is characterized by using.

또한, 본 발명의 일 실시 예는 다중 분류를 위한 강화학습 방법으로서, a) 에이전트가 다중 클래스 이미지 분류에서 정책(Policy)을 통해 기대 보상이 최대화되도록 분류 데이터의 정답을 예측하되, 정책 그래디언트(Policy gradient)를 이용한 기대 보상이 가장 높은 확률로 예측되도록 정책을 학습하는 단계를 포함할 수 있다.In addition, an embodiment of the present invention is a reinforcement learning method for multiple classification, a) the agent predicts the correct answer of the classification data so that the expected reward is maximized through the policy in the multi-class image classification, but the policy gradient (Policy) It may include the step of learning the policy so that the expected reward using the gradient) is predicted with the highest probability.

또한, 상기 실시 예에 따른 에이전트는 하기식 In addition, the agent according to the embodiment is

.

- 여기서,

또한, 상기 실시 예에 따른 에이전트는

가 낮아지도록 정책을 학습하는 것을 특징으로 한다.In addition, the agent according to the embodiment

It is characterized by learning the policy so that

또한, 상기 실시 예에 따른 a) 단계는 a-1) 에이전트가 제1 정책 손실 함수에 기반한 학습을 수행하는 단계; 및 a-2) 상기 제1 정책 손실 함수에 기반한 학습이 완료되면, 에이전트가 제2 정책 손실 함수에 기반한 학습을 수행하는 단계;를 포함하는 것을 특징으로 한다.In addition, step a) according to the above embodiment includes: a-1) the agent performing learning based on the first policy loss function; and a-2) when the learning based on the first policy loss function is completed, performing, by the agent, learning based on the second policy loss function;

또한, 상기 실시 예에 따른 제1 정책 손실 함수는 하기식 In addition, the first policy loss function according to the embodiment is

을 이용하는 것을 특징으로 한다.It is characterized by using

본 발명은 정책 그래디언트를 이용한 강화학습을 통해 인공지능 모델의 분류 학습에서 기대 보상을 최대화시켜 분류 정확도를 증가시킬 수 있는 장점이 있다.The present invention has the advantage of increasing classification accuracy by maximizing the expected reward in classification learning of an artificial intelligence model through reinforcement learning using a policy gradient.

또한, 본 발명은 적대적 공격(Adversarial Attack)에 대한 방어를 개선하여 높은 분류 정확도를 제공할 수 있는 장점이 있다.In addition, the present invention has the advantage of being able to provide high classification accuracy by improving the defense against an adversarial attack.

도1은 일반적인 강화학습 장치의 구성을 나타낸 블록도.
도2는 본 발명의 일 실시 예에 따른 다중 분류를 위한 강화학습 장치의 구성을 나타낸 블록도.
도3은 도2의 실시 예에 따른 다중 분류를 위한 강화학습 장치의 정책 손실 함수를 나타낸 블록도.
도4는 본 발명의 일 실시 예에 따른 다중 분류를 위한 강화학습 방법을 설명하기 위해 나타낸 흐름도.1 is a block diagram showing the configuration of a general reinforcement learning apparatus.
2 is a block diagram showing the configuration of a reinforcement learning apparatus for multiple classification according to an embodiment of the present invention.
3 is a block diagram illustrating a policy loss function of the reinforcement learning apparatus for multiple classification according to the embodiment of FIG. 2 .
4 is a flowchart illustrating a reinforcement learning method for multiple classification according to an embodiment of the present invention.

이하에서는 본 발명의 바람직한 실시 예 및 첨부하는 도면을 참조하여 본 발명을 상세히 설명하되, 도면의 동일한 참조부호는 동일한 구성요소를 지칭함을 전제하여 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to preferred embodiments of the present invention and the accompanying drawings.

본 발명의 실시를 위한 구체적인 내용을 설명하기에 앞서, 본 발명의 기술적 요지와 직접적 관련이 없는 구성에 대해서는 본 발명의 기술적 요지를 흩뜨리지 않는 범위 내에서 생략하였음에 유의하여야 할 것이다. Prior to describing the specific contents for carrying out the present invention, it should be noted that components not directly related to the technical gist of the present invention are omitted within the scope of not disturbing the technical gist of the present invention.

또한, 본 명세서 및 청구범위에 사용된 용어 또는 단어는 발명자가 자신의 발명을 최선의 방법으로 설명하기 위해 적절한 용어의 개념을 정의할 수 있다는 원칙에 입각하여 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다.In addition, the terms or words used in the present specification and claims have meanings and concepts consistent with the technical idea of the invention based on the principle that the inventor can define the concept of an appropriate term to best describe his invention. should be interpreted as

본 명세서에서 어떤 부분이 어떤 구성요소를 "포함"한다는 표현은 다른 구성요소를 배제하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.In this specification, the expression that a part "includes" a certain element does not exclude other elements, but means that other elements may be further included.

또한, "‥부", "‥기", "‥모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어, 또는 그 둘의 결합으로 구분될 수 있다.Also, terms such as “… unit”, “… group”, and “… module” mean a unit that processes at least one function or operation, which may be divided into hardware, software, or a combination of the two.

또한, "적어도 하나의" 라는 용어는 단수 및 복수를 포함하는 용어로 정의되고, 적어도 하나의 라는 용어가 존재하지 않더라도 각 구성요소가 단수 또는 복수로 존재할 수 있고, 단수 또는 복수를 의미할 수 있음은 자명하다 할 것이다. In addition, the term "at least one" is defined as a term including the singular and the plural, and even if the term at least one does not exist, each element may exist in the singular or plural, and may mean the singular or plural. will be self-evident.

또한, 각 구성요소가 단수 또는 복수로 구비되는 것은, 실시 예에 따라 변경가능하다 할 것이다.In addition, that each component is provided in singular or plural may be changed according to an embodiment.

이하, 첨부된 도면을 참조하여 본 발명의 일 실시 예에 따른 다중 분류를 위한 강화학습 장치 및 방법의 바람직한 실시예를 상세하게 설명한다.Hereinafter, a preferred embodiment of an apparatus and method for reinforcement learning for multiple classification according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도2는 본 발명의 일 실시 예에 따른 다중 분류를 위한 강화학습 장치의 구성을 나타낸 블록도이고, 도3은 도2의 실시 예에 따른 다중 분류를 위한 강화학습 장치의 정책 손실 함수를 나타낸 블록도이다.2 is a block diagram showing the configuration of a reinforcement learning apparatus for multiple classification according to an embodiment of the present invention, and FIG. 3 is a block diagram showing a policy loss function of the reinforcement learning apparatus for multiple classification according to the embodiment of FIG. it is do

도2 및 도3을 참조하면, 본 발명의 일 실시 예에 따른 다중 분류를 위한 강화학습 장치는 강화 학습 모델의 학습을 통해 행동(Action)을 결정하는 방법을 학습하고, 각 행동이 그 다음 상태(state)에 영향을 주며, 성공한 정도를 보상(Reward)으로 측정하는 에이전트(100)와, 에이전트(100)가 취할 수 있는 행동, 그에 따른 보상 등을 수행하는 환경(200)을 포함하여 구성될 수 있다.2 and 3 , the reinforcement learning apparatus for multiple classification according to an embodiment of the present invention learns a method of determining an action through learning of a reinforcement learning model, and each action becomes the next state An agent 100 that affects the state and measures the degree of success as a reward, an action that the agent 100 can take, and an environment 200 that performs a reward accordingly. can

에이전트(100)는 다중 클래스 이미지 분류에서 정책(Policy)을 통해 기대 보상이 최대화되도록 분류 데이터의 정답을 예측하는 정책부(110)와, 정책부(110)가 정책 그래디언트(Policy gradient)를 이용한 기대 보상이 가장 높은 확률로 정답을 예측하도록 정책을 학습하는 정책 손실 함수부(120)를 포함하여 구성될 수 있다.The agent 100 has a policy unit 110 that predicts the correct answer of the classification data so that the expected reward is maximized through a policy in multi-class image classification, and the policy unit 110 uses a policy gradient The reward may be configured to include a policy loss function unit 120 that learns the policy to predict the correct answer with the highest probability.

또한, 에이전트(100)는 데이터 세트에서 샘플링된 레이블(Lable)이 지정된 같은 클래스에서 정의된 속성과 성질을 가진 실제적인 객체인 인스턴스를 학습하여 결정을 내리는 강화학습을 기반으로 할 수 있다.In addition, the agent 100 may be based on reinforcement learning to make a decision by learning an instance that is a real object having properties and properties defined in the same class with a label sampled from a data set.

상태(s)는 지도학습의 데이터 세트와 동일한 상태 공간에서 무작위로 뽑은 데이터이다.The state(s) is randomly selected data from the same state space as the data set of supervised learning.

a_k(k=1, …N)으로 표기하는 에이전트의 행동은 지도학습의 분류와 동일하고, N 개의 정답지 중 1개를 고르는 것으로서 1과 N 사이의 값이며, 정답(Lable)은 상태에 대응하는 레이블을 의미하고, 이 또한 1과 N 사이의 값이며 y_k(k=1, …N)으로 표기한다.The behavior of the agent expressed as a _k (k=1, …N) is the same as the classification of supervised learning, and it is a value between 1 and N by choosing one of the N correct answers, and the correct answer (Lable) corresponds to the state. means a label, which is also a value between 1 and N, and is expressed as y _k (k=1, …N).

보상은 행동이 정답과 일치할 때 양수 값 w_k1을 보상하고, 불일치할 때는 음수 값 -w_k2(w_k2는 양수)인 함수이다.A reward is a function that rewards a positive value w _k1 when the action matches the correct answer, and a negative value -w _k2 (w _k2 is positive) when it does not.

정책은 인공지능 신경망을 의미하고, 최종값은 소프트맥스 함수(softmax function)를 통해 가장 높은 확률로 데이터의 정답을 예측할 수 있는 값으로 π(s)로 표기한다.The policy refers to the artificial intelligence neural network, and the final value is expressed as π(s) as a value that can predict the correct answer of the data with the highest probability through the softmax function.

정책(π)은 주어진 데이터와 신경망에서 행동을 예측하고, 행동은 주어진 데이텀의 샘플 x에 대한 예측 확률에 따라 레이블 공간에서 가져올 수 있다.The policy (π) predicts a behavior on the given data and a neural network, and the behavior can be taken from the label space according to the predicted probability for a sample x of a given datum.

정책부(110)에서 정답을 예측하는 정책은 인공 신경망으로서, CNN 기반의 모델로 구성될 수 있다.The policy for predicting the correct answer in the policy unit 110 is an artificial neural network, and may be configured as a CNN-based model.

정책 손실 함수부(120)는 정책 손실 함수의

가 낮아지도록 정책을 학습하고, 손실은 잘못된 동작에 해당하는 클래스를 제외한 나머지 클래스에 대하여 균일한 분포가 제공되도록 할 수 있다.Policy loss function unit 120 of the policy loss function

The policy is learned so that , and the loss can be provided so that a uniform distribution is provided for the remaining classes except for the class corresponding to the erroneous operation.

또한, 정책 손실 함수부(120)는 보상 함수를 다음과 같이 정의하고,In addition, the policy loss function unit 120 defines the compensation function as follows,

여기서, k=1, …, n, 이다.where k = 1, ... , n, is

정책 손실 함수는 다음과 같이 정의할 수 있다.The policy loss function can be defined as follows.

결과적으로 정책 그래디언트(Policy gradient)는 다음과 같이 표현될 수 있다.As a result, the policy gradient can be expressed as

여기서,here,

이다.

to be.

한편, 행동이 레이블 값과 같으면 정책 그래디언트

는 크로스 엔트로피 손실(cross-entropy loss)의 기울기와 동일하고, 행동이 레이블 값과 다르면 정책 그래디언트

는 크로스 엔트로피 손실 기울기의 역방향을 따라 정책의 파라미터가 업데이트되도록 한다.On the other hand, if the action is equal to the label value, the policy gradient

is equal to the slope of the cross-entropy loss, and if the behavior is different from the label value, then the policy gradient

Lets the parameters of the policy be updated along the reverse direction of the cross-entropy loss gradient.

그러나 정책 손실 함수부(120)는 정책 그래디언트가 크로스 엔트로피 손실을 따르거나 크로스 엔트로피 손실의 역방향을 따라 정책의 파라미터를 업데이트하고, 정책 그래디언트가 정확도를 높이거나 잘못된 작업을 방지하는 방향으로 파라미터를 업데이트하게 되는 것을 방지하도록 손실에 의한 학습을 워밍업 할 수 있도록 제1 손실 함수부(121)를 포함할 수 있다.However, the policy loss function unit 120 updates the parameters of the policy when the policy gradient follows the cross entropy loss or in the reverse direction of the cross entropy loss, and updates the parameters in the direction that the policy gradient increases accuracy or prevents incorrect operation. A first loss function unit 121 may be included to warm up learning by loss to prevent it from happening.

제1 손실 함수부(121)는 제1 정책 손실 함수인 하기식 The first loss function unit 121 is a first policy loss function, the following formula

.

- 여기서,

이고, C는 클래스(Class) 수이며, i= 1,‥, c, 임- 을 이용한 정책 그래디언트에 기반하여 손실에 의한 학습을 워밍업하는 정책을 학습할 수 있다.- here,

, C is the number of classes, and based on the policy gradient using i = 1, ..., c, im-, a policy of warming up learning by loss can be learned.

즉, 정책 그래디언트

를 하기식,In other words, the policy gradient

to the following formula,

으로 교체한 손실 함수를 이용한다.We use the loss function replaced by

직관적으로

를 최소화하는 대신, intuitively

Instead of minimizing

를 최대화하는 것과 동일하다.

is equivalent to maximizing

또한, 정책 손실 함수부(120)는 제1 손실 함수부(121)가 제1 정책손실 함수에 기반한 학습을 일정 기준에 따라 완료하면, 제2 정책 손실 함수에 기반한 학습을 수행하는 제2 손실 함수부(122);를 통해 다시 학습을 진행할 수 있다.In addition, the policy loss function unit 120 is a second loss function that performs learning based on the second policy loss function when the first loss function unit 121 completes the learning based on the first policy loss function according to a predetermined criterion. Through the unit 122; it is possible to proceed with learning again.

여기서, 제2 정책 손실 함수는 하기식Here, the second policy loss function is

-여기서, k=1,‥,n, i=1,‥,c, 임- 을 이용할 수 있고, 기존의 손실 함수일 수 있다.-here, k = 1, ..., n, i = 1, ..., c, im- may be used, and may be an existing loss function.

다음은 본 발명의 일 실시 예에 따른 다중 분류를 위한 강화학습 방법을 설명한다.The following describes a reinforcement learning method for multiple classification according to an embodiment of the present invention.

도4는 본 발명의 일 실시 예에 따른 다중 분류를 위한 강화학습 방법을 설명하기 위해 나타낸 흐름도이다.4 is a flowchart illustrating a reinforcement learning method for multiple classification according to an embodiment of the present invention.

도2 내지 도4를 참조하면, 본 발명의 일 실시 예에 따른 다중 분류를 위한 강화학습 방법은 에이전트(100)가 다중 클래스 이미지 분류에서 정책(Policy)을 통해 기대 보상이 최대화되도록 분류 데이터의 정답을 예측하되, 정책 그래디언트(Policy gradient)를 이용한 기대 보상이 가장 높은 확률로 예측되도록 정책을 학습할 수 있도록 제1 정책 손실 함수에 기반한 학습을 수행(S100)한다.2 to 4, in the reinforcement learning method for multiple classification according to an embodiment of the present invention, the agent 100 corrects the classification data so that the expected reward is maximized through the policy in the multi-class image classification. , but learning based on the first policy loss function is performed so that the policy can be learned so that the expected reward using the policy gradient is predicted with the highest probability (S100).

S100 단계에서, 에이전트(100)는 하기식,In step S100, the agent 100 has the following formula,

.

- 여기서,

이고, C는 클래스(Class) 수이며, i= 1,‥, c, 임- 을 이용한 정책 그래디언트에 기반하여 정책을 학습한다.- here,

, and C is the number of classes, and the policy is learned based on the policy gradient using i = 1, ..., c, im-.

또한, S100 단계에서 에이전트(100)는

가 낮아지도록 정책을 학습하고, 손실은 잘못된 동작에 해당하는 클래스를 제외한 나머지 클래스에 대하여 균일한 분포가 제공되도록 한다.In addition, in step S100, the agent 100

The policy is learned so that α is lowered, and the loss ensures that a uniform distribution is provided for the rest of the classes except for the class corresponding to the erroneous operation.

즉, S100 단계에서, 에이전트(100)는 정책 그래디언트

를 하기식,That is, in step S100, the agent 100 is a policy gradient

to the following formula,

계속해서, S100 단계의 제1 정책 손실 함수에 기반한 학습이 완료되면, 에이전트(100)는 제2 정책 손실 함수에 기반한 학습을 수행(S200)한다.Subsequently, when the learning based on the first policy loss function in step S100 is completed, the agent 100 performs learning based on the second policy loss function ( S200 ).

S200 단계에서 에이전트(100)는 제1 정책 손실 함수에 기반한 학습을 일정 기준에 따라 완료하면, 하기식In step S200, when the agent 100 completes the learning based on the first policy loss function according to a predetermined criterion, the following formula

-여기서, k=1,‥,n, i=1,‥,c, 임- 을 이용한 제2 정책 손실 함수를 이용하여 학습을 수행할 수 있다.-Here, learning can be performed using the second policy loss function using k = 1, ..., n, i = 1, ..., c, im-.

표1은 ε이 0.04인 경우 일련의 적대적 공격에 대한 테스트 정확도를 나타낸 것이다.Table 1 shows the test accuracy for a series of hostile attacks when ε is 0.04.

DatasetDataset CIFAR-10CIFAR-10 AttacksAttacks cleanclean FGSMFGSM PGDPGD CrossEntropyCrossEntropy 92.2692.26 0.10740.1074 0.200.20 Ours(λ=0.1)Ours(λ=0.1) 92.4092.40 0.59370.5937 0.39270.3927 Ours(λ=0.25)Ours(λ=0.25) 92.4392.43 0.64220.6422 0.44440.4444 Ours(λ=0.5)Ours(λ=0.5) 92.4892.48 0.66410.6641 0.50620.5062 Ours(λ=0.75)Ours(λ=0.75) 92.1792.17 0.65210.6521 0.44780.4478 Ours(λ=0.1.0)Ours(λ=0.1.0) 92.2792.27 0.61780.6178 0.40780.4078 Ours(λ=0.1.25)Ours(λ=0.1.25) 91.8591.85 0.55760.5576 0.32020.3202 Ours(λ=0.1.5)Ours(λ=0.1.5) 91.9591.95 0.52940.5294 0.31860.3186

여기서, λ값은

의 절대값이며, ε은 공격 강도이고, FGSM(Fast Gradient Sign Methods)은 가장 간단하지만 효율적인 화이트박스 공격이다.Here, the value of λ is

is the absolute value of , ε is the attack strength, and FGSM (Fast Gradient Sign Methods) is the simplest but most efficient white box attack.

표2는 크로스 엔트로피와 기존 정책 함수를 사용할 때와, 교체한 손실 함수와 기존 정책 함수를 사용할 때의 비교를 나타낸 것이다.Table 2 shows the comparison between cross entropy and the existing policy function, and when the replaced loss function and the existing policy function are used.

DatasetDataset CIRAR-10CIRAR-10 AttacksAttacks CleanClean FGSMFGSM CrossEntropy with(3)(λ=1.0)CrossEntropy with(3)(λ=1.0) 92.2192.21 0.18770.1877 (6)with (3)(λ=1.0)(6)with (3)(λ=1.0) 92.2792.27 0.61780.6178 Use Only (3)(λ=1.0)Use Only (3) (λ=1.0) 10.0110.01 .0.0

여기서, λ값은

의 절대값이다.Here, the value of λ is

is the absolute value of

표2에서, CIFAR-10에 대한 제거 결과는 원본 정책 손실의 효과를 유지하기 위해 대체 손실 함수의 사용 필요성을 알 수 있다.In Table 2, it can be seen that the removal result for CIFAR-10 requires the use of a replacement loss function to maintain the effect of the original policy loss.

따라서, 정책 그래디언트를 이용한 강화학습을 통해 인공지능 모델의 분류 학습에서 기대 보상을 최대화시켜 분류 정확도를 증가시킬 수 있게 되고, 적대적 공격(Adversarial Attack)에 대한 방어를 개선하여 높은 분류 정확도를 제공할 수 있게 된다.Therefore, through reinforcement learning using policy gradients, classification accuracy can be increased by maximizing the expected reward in classification learning of an AI model, and high classification accuracy can be provided by improving defense against adversarial attacks. there will be

상기와 같이, 본 발명의 바람직한 실시 예를 참조하여 설명하였지만 해당 기술 분야의 숙련된 당업자라면 하기의 특허청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.As described above, although described with reference to preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention within the scope without departing from the spirit and scope of the present invention described in the claims below. You will understand that it can be done.

또한, 본 발명의 특허청구범위에 기재된 도면번호는 설명의 명료성과 편의를 위해 기재한 것일 뿐 이에 한정되는 것은 아니며, 실시예를 설명하는 과정에서 도면에 도시된 선들의 두께나 구성요소의 크기 등은 설명의 명료성과 편의상 과장되게 도시되어 있을 수 있다.In addition, the reference numbers described in the claims of the present invention are only described for clarity and convenience of description, and are not limited thereto, and in the process of describing the embodiment, the thickness of the lines shown in the drawings or the size of components, etc. may be exaggerated for clarity and convenience of explanation.

또한, 상술된 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례에 따라 달라질 수 있으므로, 이러한 용어들에 대한 해석은 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In addition, the above-mentioned terms are terms defined in consideration of functions in the present invention, which may vary depending on the intention or custom of the user or operator, so the interpretation of these terms should be made based on the content throughout this specification. .

또한, 명시적으로 도시되거나 설명되지 아니하였다 하여도 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기재사항으로부터 본 발명에 의한 기술적 사상을 포함하는 다양한 형태의 변형을 할 수 있음은 자명하며, 이는 여전히 본 발명의 권리범위에 속한다. In addition, even if not explicitly shown or described, a person of ordinary skill in the art to which the present invention pertains can make various modifications including the technical idea according to the present invention from the description of the present invention. Obviously, this still falls within the scope of the present invention.

또한, 첨부하는 도면을 참조하여 설명된 상기의 실시예들은 본 발명을 설명하기 위한 목적으로 기술된 것이며 본 발명의 권리범위는 이러한 실시예에 국한되지 아니한다.In addition, the above embodiments described with reference to the accompanying drawings have been described for the purpose of explaining the present invention, and the scope of the present invention is not limited to these embodiments.

100 : 에이전트(Agent)
110 : 정책부
120 : 정책 손실 함수부
121 : 제1 손실 함수부
122 : 제2 손실 함수부
200 : 환경(Environment)100: Agent
110: Ministry of Policy
120: policy loss function part
121: first loss function unit
122: second loss function unit
200: Environment

Claims

An agent 100 that learns how to determine an action through learning of a reinforcement learning model, each action affects the next state, and measures the degree of success as a reward. ,
The agent 100 includes: a policy unit 110 for predicting the correct answer of classification data so that an expected reward is maximized through a policy in multi-class image classification; and
and a policy loss function unit 120 for learning the policy so that the policy unit 110 predicts the correct answer with the highest probability of the expected reward using a policy gradient;
The policy loss function unit 120 has the following formula

.
- here,

, C is the number of classes, and i = 1, ..., c, Im- Reinforcement learning apparatus for multiple classification, characterized in that it learns the policy based on the policy gradient using.

The method of claim 1,
The policy loss function unit 120 is

Reinforcement learning device for multi-classification, characterized in that learning the policy so that is lowered.

The method of claim 1,
The policy loss function unit 120 includes a first loss function unit 121 for performing learning based on the first policy loss function; and
When the learning of the first loss function unit 121 is completed, a second loss function unit 122 that performs learning based on a second policy loss function; includes,
The first policy loss function is

use ,
The second policy loss function is

-here, k = 1, ..., n, i = 1, ..., c, im- Reinforcement learning apparatus for multiple classification, characterized in that using.

In the reinforcement learning method for multiple classification performed by a reinforcement learning apparatus for multiple classification,
a) The agent 100 predicts the correct answer of the classification data so that the expected reward is maximized through the policy in multi-class image classification,
Including the step of learning the policy so that the expected reward using the policy gradient is predicted with the highest probability,
The agent 100 has the following formula

.
- here,

, where C is the number of classes, and a reinforcement learning method for multiple classification, characterized in that the policy is learned based on the policy gradient using i = 1, ..., c, im-.

5. The method of claim 4,
The agent 100

Reinforcement learning method for multiple classification, characterized in that learning the policy so that is lowered.

5. The method of claim 4,
The step a) includes: a-1) performing, by the agent 100, learning based on the first policy loss function; and
a-2) when the learning based on the first policy loss function is completed, performing, by the agent 100, learning based on the second policy loss function;
The first policy loss function is

use ,
The second policy loss function is

- Here, k = 1, ..., n, i = 1, ..., c, im- Reinforcement learning method for multiple classification, characterized in that using.