KR20210151644A

KR20210151644A - Apparatus and method for extracting deep learning models

Info

Publication number: KR20210151644A
Application number: KR1020200126332A
Authority: KR
Inventors: 오규삼; 서지현; 민찬호; 윤용근; 조성우
Original assignee: 삼성에스디에스 주식회사
Priority date: 2020-06-05
Filing date: 2020-09-28
Publication date: 2021-12-14

Abstract

Disclosed are a device and method for extracting a deep learning model. According to an embodiment of the present invention, the device for extracting the deep learning model includes; one or more first artificial neural networks which are pre-trained to perform a specific function; N (N>=2) second artificial neural networks learning to perform the specific function based on the first artificial neural network; and a learning unit which trains the N second artificial neural networks. The learning unit calculates a total loss function based on the ground truth for a specific input value, the probability score of the first artificial neural network, and the probability score of each of the N second artificial neural networks, and the learning unit can train each of the N second artificial neural networks based on the total loss function. Therefore, it is possible to extract a small network model with better performance than a large network through mutual learning between small networks.

Description

Apparatus and method for extracting deep learning models

개시되는 실시예들은 딥러닝 모델 추출 기술과 관련된다.The disclosed embodiments relate to deep learning model extraction techniques.

최근 미리 학습된 큰 네트워크의 지식을 모바일 단말 등에 사용할 수 있는 작은 네트워크에 전달하는 지식 증류(Knowledge distillation)에 대한 연구가 활발히 진행되고 있다.Recently, research on knowledge distillation, which transfers previously learned knowledge of a large network to a small network that can be used in mobile terminals, etc., is being actively researched.

그러나, 현재 개시된 기술은 같은 타입의 네트워크 또는 이종 네트워크로 작은 딥러닝 모델을 추출할 때 큰 딥러닝 모델의 정확도를 뛰어 넘기 어려운 문제점이 존재한다. However, the presently disclosed technology has a problem in that it is difficult to exceed the accuracy of a large deep learning model when extracting a small deep learning model with the same type of network or heterogeneous network.

미국 공개특허공보 제US2015/0356461호 (2015.12.10. 공개)US Patent Publication No. US2015/0356461 (published on Dec. 10, 2015)

개시되는 실시예들은 딥러닝 모델을 추출하기 위한 방법 및 장치를 제공하기 위한 것이다.Disclosed embodiments are to provide a method and apparatus for extracting a deep learning model.

일 실시예에 따른 딥러닝 모델 추출 장치는 특정 기능을 수행하도록 미리 학습된 하나 이상의 제 1 인공 신경망; 제 1 인공 신경망을 기초로 특정 기능을 수행하도록 학습하는 N (N≥2) 개의 제 2 인공 신경망; 및 N 개의 제 2 인공 신경망을 학습시키는 학습부를 포함하며, 학습부는 특정 입력값에 대한 정답값(ground truth), 제 1 인공 신경망의 확률 점수(probability score) 및 N 개의 제 2 인공 신경망 각각의 확률 점수에 기초한 총 손실 함수(total loss function)를 계산하며, 총 손실 함수에 기초하여 N 개의 제 2 인공 신경망을 각각 학습시킬 수 있다.An apparatus for extracting a deep learning model according to an embodiment includes: one or more first artificial neural networks trained in advance to perform a specific function; N (N≥2) second artificial neural networks learning to perform a specific function based on the first artificial neural network; and a learning unit for learning N second artificial neural networks, wherein the learning unit includes a ground truth for a specific input value, a probability score of the first artificial neural network, and a probability of each of the N second artificial neural networks. A total loss function is calculated based on the score, and N second artificial neural networks may be trained based on the total loss function, respectively.

제 2 인공 신경망에 포함된 레이어의 개수는 제 1 인공 신경망에 포함된 레이어 개수 보다 같거나 적으며, 또는 제 2 인공 신경망에 포함된 파라미터의 개수는 제 1 인공 신경망에 포함된 파라미터의 개수 보다 같거나 적을 수 있다.The number of layers included in the second artificial neural network is equal to or less than the number of layers included in the first artificial neural network, or the number of parameters included in the second artificial neural network is equal to the number of parameters included in the first artificial neural network or you can write

학습부는, n(1≤n≤N)번째 제 2 인공 신경망에 대하여 정답값 및 n번째 제 2 인공 신경망의 확률 점수에 기초하여 생성된 제 1 손실 함수, 제 1 인공 신경망의 확률 점수 및 n번째 제 2 인공 신경망의 확률 점수에 기초하여 생성된 제 2 손실 함수 및 n번째 제 2 인공 신경망의 확률 점수 및 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 확률 점수를 이용하여 생성한 확률 점수에 기초하여 생성된 제 3 손실 함수를 계산할 수 있다.The learning unit, the first loss function generated based on the correct answer value for the n (1≤n≤N)-th second artificial neural network and the probability score of the n-th second artificial neural network, the probability score of the first artificial neural network, and the n-th The second loss function generated based on the probability score of the second artificial neural network, the probability score of the n-th second artificial neural network, and the probability score generated using the probability scores of all second artificial neural networks except for the n-th second artificial neural network It is possible to calculate a third loss function generated based on .

학습부는, N개의 제 2 인공 신경망 각각에 대하여 계산된 제 1 손실함수, 제 2 손실 함수 및 제 3 손실 함수를 합산하여 총 손실 함수를 계산할 수 있다.The learner may calculate a total loss function by summing the first loss function, the second loss function, and the third loss function calculated for each of the N second artificial neural networks.

제 2 손실 함수는, 하나 이상의 제 1 인공 신경망 각각의 마지막 레이어에서 출력된 점수(score) 중 최대값을 가지는 점수에 기초하여 생성된 확률 점수, The second loss function is a probability score generated based on a score having a maximum value among scores output from the last layer of each of one or more first artificial neural networks;

상기 하나 이상의 제 1 인공 신경망 각각의 확률 점수 중 최대값을 가지는 확률 점수에 기초하여 생성된 확률 점수, 및 하나 이상의 제 1 인공 신경망 각각의 확률 점수를 평균하여 생성된 확률 점수 중 적어도 하나와 n번째 제 2 인공 신경망의 확률 점수에 기초하여 생성될 수 있다.At least one of a probability score generated based on a probability score having a maximum value among the probability scores of each of the one or more first artificial neural networks, and a probability score generated by averaging the probability scores of each of the one or more first artificial neural networks and the nth It may be generated based on the probability score of the second artificial neural network.

제 3 손실 함수는, n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 마지막 레이어에서 출력된 점수(score) 중 크기가 가장 큰 점수를 가지는 제 2 인공 신경망의 점수에 기초하여 생성된 제 1 확률 점수에 대한 n번째 제 2 인공 신경망의 제 2 확률 점수의 조건부 확률일 수 있다.The third loss function is the first generated based on the score of the second artificial neural network having the largest score among the scores output from the last layer of all the second artificial neural networks except for the n-th second artificial neural network. It may be a conditional probability of the second probability score of the nth second artificial neural network with respect to the probability score.

제 3 손실 함수는, n번째 제 2 인공 신경망의 제 1 확률 점수와 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 마지막 레이어에서 출력된 점수(score)를 기초로 생성된 제 2 확률 점수의 교차 엔트로피(cross entropy)에서 n번째 제 2 인공 신경망의 제 1 확률 점수의 엔트로피를 뺀 결과에 기초하여 생성될 수 있다.The third loss function is a second probability score generated based on the first probability score of the n-th second artificial neural network and the scores output from the last layer of all second artificial neural networks except for the n-th second artificial neural network. It may be generated based on a result of subtracting the entropy of the first probability score of the nth second artificial neural network from the cross entropy of .

제 2 확률 점수는, n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 마지막 레이어에서 출력된 점수(score) 중 크기가 가장 큰 점수를 가지는 제 2 인공 신경망의 점수에 기초하여 생성될 수 있다.The second probability score may be generated based on the score of the second artificial neural network having the largest score among the scores output from the last layer of all second artificial neural networks except for the n-th second artificial neural network. .

학습부는, n번째 제 2 인공 신경망의 제 1 확률 점수를 최적화하기 위하여 역방향 쿨백-라이블러 발산(kullback-leibler divergence)를 적용할 수 있다.The learner may apply reverse kullback-leibler divergence to optimize the first probability score of the n-th second artificial neural network.

정답값은 원-핫 벡터(one-hot vector)이며, 확률 점수는 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 softmax 함수를 적용하여 생성된 확률 벡터일 수 있다.The correct answer value is a one-hot vector, and the probability score may be a probability vector generated by applying a softmax function to a score output from the last layer of the artificial neural network.

N 개의 제 2 인공 신경망은 각각 다른 가중치 초기값으로 초기화될 수 있다.The N second artificial neural networks may be initialized with different initial weight values.

일 양상에 따른 딥러닝 모델 추출 방법은 특정 기능을 수행하도록 미리 학습된 하나 이상의 제 1 인공 신경망으로부터 특정 입력에 대한 확률 점수(probability score)를 수신하는 단계; 제 1 인공 신경망을 기초로 특정 기능을 수행하도록 학습하는 N (N≥2) 개의 제 2 인공 신경망으로부터 특정 입력에 대한 확률 점수(probability score)를 수신하는 단계; 및 N 개의 제 2 인공 신경망을 학습시키는 단계를 포함하며, 학습시키는 단계는 특정 입력값에 대한 정답값(ground truth), 제 1 인공 신경망의 확률 점수(probability score) 및 N 개의 제 2 인공 신경망 각각의 확률 점수에 기초한 총 손실 함수(total loss function)를 계산하며, 총 손실 함수에 기초하여 N 개의 제 2 인공 신경망을 각각 학습시킬 수 있다.A deep learning model extraction method according to an aspect includes: receiving a probability score for a specific input from one or more first artificial neural networks previously trained to perform a specific function; Receiving a probability score for a specific input from N (N≥2) second artificial neural networks that learn to perform a specific function based on the first artificial neural network; and training N second artificial neural networks, wherein the training includes a ground truth for a specific input value, a probability score of the first artificial neural network, and each of the N second artificial neural networks. A total loss function is calculated based on the probability score of , and N second artificial neural networks can be trained based on the total loss function, respectively.

학습시키는 단계는, n(1≤n≤N)번째 제 2 인공 신경망에 대하여 정답값 및 n번째 제 2 인공 신경망의 확률 점수에 기초하여 생성된 제 1 손실 함수, 제 1 인공 신경망의 확률 점수 및 n번째 제 2 인공 신경망의 확률 점수에 기초하여 생성된 제 2 손실 함수 및 n번째 제 2 인공 신경망의 확률 점수 및 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 확률 점수를 이용하여 생성한 확률 점수에 기초하여 생성된 제 3 손실 함수를 계산할 수 있다.In the learning step, the first loss function generated based on the correct answer value for the n (1≤n≤N)-th second artificial neural network and the probability score of the n-th second artificial neural network, the probability score of the first artificial neural network, and The second loss function generated based on the probability score of the n-th second artificial neural network, the probability score of the n-th second artificial neural network, and the probability score of all second artificial neural networks except for the n-th second artificial neural network. A third loss function may be calculated based on the probability score.

학습시키는 단계는, N개의 제 2 인공 신경망 각각에 대하여 계산된 제 1 손실함수, 제 2 손실 함수 및 제 3 손실 함수를 합산하여 총 손실 함수를 계산할 수 있다.The learning may include calculating a total loss function by summing the first loss function, the second loss function, and the third loss function calculated for each of the N second artificial neural networks.

제 2 손실 함수는, 하나 이상의 제 1 인공 신경망 각각의 마지막 레이어에서 출력된 점수(score) 중 최대값을 가지는 점수에 기초하여 생성된 확률 점수, 하나 이상의 제 1 인공 신경망 각각의 확률 점수 중 최대값을 가지는 확률 점수에 기초하여 생성된 확률 점수, 및 하나 이상의 제 1 인공 신경망 각각의 확률 점수를 평균하여 생성된 확률 점수 중 적어도 하나와 n번째 제 2 인공 신경망의 확률 점수에 기초하여 생성될 수 있다.The second loss function is a probability score generated based on a score having a maximum value among scores output from the last layer of each of the one or more first artificial neural networks, and a maximum value among the probability scores of each of the one or more first artificial neural networks. At least one of a probability score generated based on a probability score having .

학습시키는 단계는, n번째 제 2 인공 신경망의 제 1 확률 점수를 최적화하기 위하여 역방향 쿨백-라이블러 발산(kullback-leibler divergence)를 적용할 수 있다.In the training step, reverse kullback-leibler divergence may be applied to optimize the first probability score of the n-th second artificial neural network.

개시되는 실시예들에 따르면, 큰 네트워크를 통하여 작은 네트워크를 학습시킬 수 있으며, 작은 네트워크 간 상호 학습을 통하여 큰 네트워크보다 좋은 성능을 가지는 작은 네트워크 모델을 추출할 수 있다.According to the disclosed embodiments, a small network may be trained through a large network, and a small network model having better performance than a large network may be extracted through mutual learning between small networks.

도 1은 일 실시예에 따른 딥러닝 모델 추출 장치의 구성도
도 2는 일 실시예에 따른 딥러닝 모델 추출 장치의 구성도
도 3은 일 실시예에 따른 딥러닝 모델 추출 방법을 설명하기 위한 흐름도
도 4는 일 실시예에 따른 딥러닝 모델 추출 방법을 설명하기 위한 흐름도
도 5는 일 실시예에 따른 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도1 is a block diagram of an apparatus for extracting a deep learning model according to an embodiment;
2 is a block diagram of an apparatus for extracting a deep learning model according to an embodiment;
3 is a flowchart illustrating a method for extracting a deep learning model according to an embodiment;
4 is a flowchart illustrating a method for extracting a deep learning model according to an embodiment;
5 is a block diagram illustrating and explaining a computing environment including a computing device according to an embodiment;

이하, 도면을 참조하여 본 발명의 구체적인 실시형태를 설명하기로 한다. 이하의 상세한 설명은 본 명세서에서 기술된 방법, 장치 및/또는 시스템에 대한 포괄적인 이해를 돕기 위해 제공된다. 그러나 이는 예시에 불과하며 본 발명은 이에 제한되지 않는다.Hereinafter, specific embodiments of the present invention will be described with reference to the drawings. The following detailed description is provided to provide a comprehensive understanding of the methods, devices, and/or systems described herein. However, this is merely an example, and the present invention is not limited thereto.

본 발명의 실시예들을 설명함에 있어서, 본 발명과 관련된 공지기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다. 그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. 상세한 설명에서 사용되는 용어는 단지 본 발명의 실시예들을 기술하기 위한 것이며, 결코 제한적이어서는 안 된다. 명확하게 달리 사용되지 않는 한, 단수 형태의 표현은 복수 형태의 의미를 포함한다. 본 설명에서, "포함" 또는 "구비"와 같은 표현은 어떤 특성들, 숫자들, 단계들, 동작들, 요소들, 이들의 일부 또는 조합을 가리키기 위한 것이며, 기술된 것 이외에 하나 또는 그 이상의 다른 특성, 숫자, 단계, 동작, 요소, 이들의 일부 또는 조합의 존재 또는 가능성을 배제하도록 해석되어서는 안 된다.In describing the embodiments of the present invention, if it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. And, the terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to intentions or customs of users and operators. Therefore, the definition should be made based on the content throughout this specification. The terminology used in the detailed description is for the purpose of describing embodiments of the present invention only, and should in no way be limiting. Unless explicitly used otherwise, expressions in the singular include the meaning of the plural. In this description, expressions such as “comprising” or “comprising” are intended to indicate certain features, numbers, steps, acts, elements, some or a combination thereof, one or more other than those described. It should not be construed to exclude the presence or possibility of other features, numbers, steps, acts, elements, or any part or combination thereof.

도 1은 일 실시예에 따른 딥러닝 모델 추출 장치의 구성도이다.1 is a block diagram of an apparatus for extracting a deep learning model according to an embodiment.

도 1을 참조하면, 딥러닝 모델 추출 장치(100)는 특정 기능을 수행하도록 미리 학습된 제 1 인공 신경망(110), 제 1 인공 신경망(110)을 기초로 특정 기능을 수행하도록 학습하는 N (N≥2) 개의 제 2 인공 신경망(121, 123, 125) 및 N 개의 제 2 인공 신경망(121, 123, 125)을 학습시키는 학습부(130)를 포함할 수 있다.Referring to FIG. 1 , the deep learning model extraction apparatus 100 includes a first artificial neural network 110 previously trained to perform a specific function, N ( It may include N≥2) second artificial neural networks 121 , 123 , and 125 and the learning unit 130 for learning N second artificial neural networks 121 , 123 , 125 .

설명의 편의를 위하여 N (N≥2) 개의 제 2 인공 신경망(121, 123, 125) 각각을 제 2 인공 신경망(120)으로 나타낼 수 있다. 따라서, 제 2 인공 신경망(120)에 대한 실시예들은 제 2 인공 신경망(121, 123, 125) 각각에 대한 실시예로 해석될 수 있다.For convenience of explanation, each of the N (N≥2) second artificial neural networks 121 , 123 , and 125 may be represented as the second artificial neural network 120 . Accordingly, the embodiments for the second artificial neural network 120 may be interpreted as embodiments for each of the second artificial neural networks 121 , 123 , and 125 .

일 예에 따르면, 제 1 인공 신경망(110)은 큰 네트워크(또는, teacher model), 제 2 인공 신경망(120)은 작은 네트워크(또는, student model)로 나타낼 수 있다. According to an example, the first artificial neural network 110 may be represented by a large network (or a teacher model), and the second artificial neural network 120 may be represented by a small network (or a student model).

일 예로, 제 2 인공 신경망(120)에 포함된 레이어의 개수는 제 1 인공 신경망(110)에 포함된 레이어 개수 보다 적을 수 있다. 예를 들어, 제 2 인공 신경망(121, 123, 125) 각각은 3개의 레이어를 포함하며, 제 1 인공 신경망(110)은 4개의 레이어를 포함할 수 있다.For example, the number of layers included in the second artificial neural network 120 may be less than the number of layers included in the first artificial neural network 110 . For example, each of the second artificial neural networks 121 , 123 , and 125 may include three layers, and the first artificial neural network 110 may include four layers.

다른 예로, 제 2 인공 신경망(120)에 포함된 파라미터의 개수는 제 1 인공 신경망(110)에 포함된 파라미터의 개수 보다 적을 수 있다. 예를 들어, 제 2 인공 신경망(121, 123, 125) 각각은 100개의 파라미터를 포함하며, 제 1 인공 신경망(110)은 150개의 파라미터를 포함할 수 있다.As another example, the number of parameters included in the second artificial neural network 120 may be less than the number of parameters included in the first artificial neural network 110 . For example, each of the second artificial neural networks 121 , 123 , and 125 may include 100 parameters, and the first artificial neural network 110 may include 150 parameters.

일 실시예에 따르면, 제 2 인공 신경망(121, 123, 125) 각각 다른 가중치 초기값으로 초기화될 수 있다. 이에 따라, 제 2 인공 신경망(121, 123, 125)은 동일한 입력 데이터에 대하여 각각 다른 결과를 출력할 수 있다.According to an embodiment, the second artificial neural networks 121 , 123 , and 125 may be initialized with different initial weight values. Accordingly, the second artificial neural networks 121 , 123 , and 125 may output different results for the same input data.

일 실시예에 따르면, 학습부(130)는 특정 입력값에 대한 정답값(ground truth), 제 1 인공 신경망(110)의 확률 점수(probability score) 및 N 개의 제 2 인공 신경망(120) 각각의 확률 점수에 기초한 총 손실 함수(total loss function)를 계산하며, 총 손실 함수에 기초하여 N 개의 제 2 인공 신경망을 각각 학습시킬 수 있다.According to one embodiment, the learning unit 130 is a ground truth for a specific input value, a probability score of the first artificial neural network 110, and each of the N second artificial neural networks 120 . A total loss function is calculated based on the probability score, and N second artificial neural networks may be trained based on the total loss function, respectively.

일 예에 따르면, 학습부(130)는 정답값과 제 2 인공 신경망(120)의 출력 결과에 대한 오차를 계산할 수 있다. 일 예로, 오차는 손실 함수일 수 있다.According to an example, the learning unit 130 may calculate an error between the correct answer value and the output result of the second artificial neural network 120 . As an example, the error may be a loss function.

일 예에 따르면, 학습부(130)는 제 2 인공신경망(121)의 마지막 레이어에서 출력된 점수(score)에 softmax 함수를 적용하여 생성된 확률 벡터인 확률 점수를 계산할 수 있다. 일 예로, 확률 점수는 수학식 1과 같이 정의될 수 있다.According to an example, the learning unit 130 may calculate a probability score that is a probability vector generated by applying a softmax function to a score output from the last layer of the second artificial neural network 121 . As an example, the probability score may be defined as in Equation (1).

[수학식 1][Equation 1]

여기서, x는 입력 데이터이며, y_n는 입력 데이터 x에 대한 n번째 제 2 인공 신경망의 점수(score)이다. 또한, p_n는 n번째 제 2 인공 신경망의 점수를 수학식 2의 softmax 함수를 적용하여 생성한 확률 점수이다.Here, x is the input data, and y _n is the score of the n-th second artificial neural network with respect to the input data x. In addition, p _n is a probability score generated by applying the softmax function of Equation 2 to the score of the nth second artificial neural network.

[수학식 2][Equation 2]

여기서, I은 제 2 인공 신경망을 이용하여 분류(classification)하고자 하는 클래스(class)의 개수이다.Here, I is the number of classes to be classified using the second artificial neural network.

일 실시예에 따르면, 학습부(130)는 어느 하나의 제 2 인공 신경망에 대하여 정답값 및 제 2 인공 신경망의 확률 점수에 기초하여 제 1 손실 함수를 생성할 수 있다.According to an embodiment, the learning unit 130 may generate a first loss function based on a correct answer value for any one of the second artificial neural networks and a probability score of the second artificial neural network.

일 예에 따르면, 제 1 손실 함수는 n번째 제 2 인공 신경망의 확률 점수와 정답값 간의 교차 엔트로피 손실(cross entropy loss)일 수 있다. 일 예로, 교차 엔트로피 손실은 다음과 같이 정의될 수 있다.According to an example, the first loss function may be a cross entropy loss between the probability score of the n-th second artificial neural network and the correct value. As an example, the cross entropy loss may be defined as follows.

[수학식 3][Equation 3]

여기서, p는 정답값을 나타내며, q는 n번째 제 2 인공 신경망의 확률 점수를 나타내며, i는 제 2 인공 신경망을 이용하여 분류(classification)하고자 하는 클래스(class)의 번호를 나타낸다. 일 예로, 정답값 P는 원-핫 벡터(one-hot vector)일 수 있다. 예를 들어, P = (p₁, p₂, …, p_i, …, P_I) = (0, 1, 0, …, 0) 일 수 있다.Here, p denotes a correct answer value, q denotes a probability score of the nth second artificial neural network, and i denotes the number of a class to be classified using the second artificial neural network. As an example, the correct answer value P may be a one-hot vector. For example, P = (p ₁ , p ₂ , …, p _i , …, P _I ) = (0, 1, 0, …, 0).

일 실시예에 따르면, 학습부(130)는 제 1 인공 신경망의 확률 점수 및 n번째 제 2 인공 신경망의 확률 점수에 기초하여 제 2 손실 함수를 생성할 수 있다.According to an embodiment, the learner 130 may generate the second loss function based on the probability score of the first artificial neural network and the probability score of the n-th second artificial neural network.

일 예에 따르면, 제 2 손실 함수는 특정 입력에 대한 제 1 인공 신경망의 확률 점수와 n번째 제 2 인공 신경망의 확률 점수 간의 지식 증류 손실(knowledge distillation loss)일 수 있다. 일 예로, 지식 증류 손실은 다음과 같이 정의될 수 있다. According to an example, the second loss function may be a knowledge distillation loss between the probability score of the first artificial neural network for a specific input and the probability score of the n-th second artificial neural network. As an example, the knowledge distillation loss may be defined as follows.

[수학식 4][Equation 4]

지식 증류에서 제 1 인공 신경망의 점수를 softmax하는 경우 one-hot가 가까워질 수 있다. T_KD는 one-hot에 가까워진 제 1 인공 신경망의 확률 점수를 변형하여 높은 엔트로피를 가지도록 평활화(smoothing)하는 역할을 수행하는 temperature hyperparameter이다.One-hot can be approached when softmaxing the score of the first artificial neural network in knowledge distillation. T _KD is a temperature hyperparameter that performs a role of smoothing to have high entropy by transforming the probability score of the first artificial neural network approaching one-hot.

일 실시예에 따르면, 학습부(130)는 n번째 제 2 인공 신경망의 확률 점수 및 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 확률 점수를 이용하여 생성한 확률 점수에 기초하여 제 3 손실 함수를 생성할 수 있다.According to an embodiment, the learning unit 130 is configured to perform a third based on the probability score generated by using the probability score of the n-th second artificial neural network and the probability score of all second artificial neural networks except for the n-th second artificial neural network. You can create a loss function.

일 예에 따르면, 제 3 손실 함수는 n번째 제 2 인공 신경망의 확률 점수 및 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 확률 점수를 이용하여 생성한 확률 점수 간의 수집 손실(collection loss)일 수 있다. According to an example, the third loss function is a collection loss between the probability scores of the n-th second artificial neural network and the probability scores generated using the probability scores of all second artificial neural networks except for the n-th second artificial neural network. can be

일 실시예에 따르면, 제 3 손실 함수는 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 마지막 레이어에서 출력된 점수(score) 중 크기가 가장 큰 점수를 가지는 제 2 인공 신경망의 점수에 기초하여 생성된 제 1 확률 점수에 대한 n번째 제 2 인공 신경망의 제 2 확률 점수의 조건부 확률일 수 있다.According to an embodiment, the third loss function is based on the score of the second artificial neural network having the largest score among the scores output from the last layer of all second artificial neural networks except for the n-th second artificial neural network. may be a conditional probability of the second probability score of the n-th second artificial neural network with respect to the first probability score generated by

일 예로, 제 3 손실 함수인 수집 손실은 다음과 같이 정의될 수 있다.As an example, the collection loss, which is the third loss function, may be defined as follows.

[수학식 5][Equation 5]

여기서, L_KLD는 쿨백-라이블러 발산(kullback-leibler divergence) 함수이며,

는

에 대한

의 조건부 확률을 의미한다. 또한,

와

는 다음과 같이 정의된다. 이때, i는 N개의 제 2 인공 신경망 중 i 번째 제 2 인공 신경망을 의미한다.where L _KLD is a kullback-leibler divergence function,

Is

for

is the conditional probability of Also,

Wow

is defined as In this case, i means the i-th second artificial neural network among the N second artificial neural networks.

[수학식 6][Equation 6]

여기서 T_KLD는 temperature hyperparameter이다. where T _KLD is the temperature hyperparameter.

일 실시예에 따르면, 학습부(130)는 N개의 제 2 인공 신경망 각각에 대하여 계산된 제 1 손실함수, 제 2 손실 함수 및 제 3 손실 함수를 합산하여 총 손실 함수를 계산할 수 있다. 일 예로, 총 손실 함수는 다음의 수학식과 같이 정의될 수 있다.According to an embodiment, the learner 130 may calculate the total loss function by summing the first loss function, the second loss function, and the third loss function calculated for each of the N second artificial neural networks. As an example, the total loss function may be defined as the following equation.

[수학식 7][Equation 7]

여기서, L_n은 n 번째 제 2 인공 신경망의 손실 함수이며, 다음 수식과 같이 정의될 수 있다.Here, L _n is the loss function of the n-th second artificial neural network, and may be defined as follows.

[수학식 8][Equation 8]

여기서,

는 각 손실(loss)의 균형을 유지하기 위한 가중치이다.here,

is a weight for maintaining the balance of each loss.

일 실시예에 따르면, 학습부(130)는 제 2 인공 신경망을 학습 시 손실에 따른 스칼라 장의 기울기(gradient)를 계산하여 이를 역전파(back propagation)하게 되는데, 이때 L_n에 대해 n번째 제 2 인공 신경망을 개별적으로 학습시키는 것이 아니라 L_total로 N개의 제 2 인공 신경망 모두를 학습시킬 수 있다.According to an embodiment, the learning unit 130 calculates the gradient of the scalar field according to the loss when learning the second artificial neural network and back propagates it, in this case the n-th second with respect to _{L n .} Instead of training artificial neural networks individually, it is possible to train all N second artificial neural networks with _{L total.}

일 실시예에 따라 L_n을 이용하여 n번째 제 2 인공 신경망을 학습시키는 것과 L_total을 이용하여 모든 제 2 인공 신경망을 학습시키는 것을 비교하면, L_total을 이용하는 학습은

의

와

두 분포를 모두 최적화하는 반면, L_n을 이용하는 학습은

만을 최적화하게 된다. According to one embodiment as compared to the study of all the second artificial neural network by using the L _n L used as the _total of the n-th second learning neural network, learning using the _total L is

of

Wow

While optimizing both distributions, learning with _{L n}

only to be optimized.

도 2는 일 실시예에 따른 딥러닝 모델 추출 장치의 구성도이다.2 is a block diagram of an apparatus for extracting a deep learning model according to an embodiment.

도 2를 참조하면, 딥러닝 모델 추출 장치(200)는 특정 기능을 수행하도록 미리 학습된 M(M≥2)개의 제 1 인공 신경망(210), 제 1 인공 신경망(210)을 기초로 특정 기능을 수행하도록 학습하는 N (N≥2) 개의 제 2 인공 신경망(220) 및 N 개의 제 2 인공 신경망(220)을 학습시키는 학습부(230)를 포함할 수 있다.Referring to FIG. 2 , the deep learning model extraction apparatus 200 performs a specific function based on M (M≥2) first artificial neural networks 210 and the first artificial neural networks 210 that have been pre-trained to perform a specific function. may include N (N≥2) second artificial neural networks 220 that learn to perform , and a learning unit 230 that trains N second artificial neural networks 220 .

설명의 편의를 위하여 M(M≥2)개의 제 1 인공 신경망(211, 213) 각각을 제 1 인공 신경망(210)으로 나타내며, N (N≥2) 개의 제 2 인공 신경망(221, 223, 225) 각각을 제 2 인공 신경망(220)으로 나타낼 수 있다. 따라서, 제 1 인공 신경망(210)에 대한 실시예들은 제 1 인공 신경망(211. 213) 각각에 대한 실시예로 해석될 수 있으며, 제 2 인공 신경망(220)에 대한 실시예들은 제 2 인공 신경망(221, 223, 225) 각각에 대한 실시예로 해석될 수 있다.For convenience of explanation, each of the M (M≥2) first artificial neural networks 211 and 213 is referred to as the first artificial neural network 210, and N (N≥2) second artificial neural networks 221, 223, 225 ) may be represented by the second artificial neural network 220 . Accordingly, the embodiments of the first artificial neural network 210 may be interpreted as embodiments for each of the first artificial neural networks 211 and 213 , and the embodiments of the second artificial neural network 220 are the second artificial neural networks. (221, 223, 225) can be interpreted as an embodiment for each.

일 예에 따르면, 제 2 인공 신경망(220)에 포함된 레이어의 개수는 제 1 인공 신경망(210)에 포함된 레이어 개수 보다 같거나 적을 수 있다. 예를 들어, 제 1 인공 신경망(211)의 레이어 개수는 3개이며, 제 2 인공 신경망(221)의 레이어 개수도 3개일 수 있다. 다른 예를 들어, 제 2 인공 신경망(221)의 레이어 개수는 2개일 수 있다.According to an example, the number of layers included in the second artificial neural network 220 may be equal to or less than the number of layers included in the first artificial neural network 210 . For example, the number of layers of the first artificial neural network 211 may be three, and the number of layers of the second artificial neural network 221 may also be three. As another example, the number of layers of the second artificial neural network 221 may be two.

일 예에 따르면, 제 2 인공 신경망(220)에 포함된 파라미터의 개수는 제 1 인공 신경망(110)에 포함된 파라미터의 개수 보다 같거나 적을 수 있다. 예를 들어, 제 1 인공 신경망(211)의 파라미터 개수는 100개이며, 제 2 인공 신경망(221)의 파라미터 개수도 100개일 수 있다. 다른 예를 들어, 제 2 인공 신경망(221)의 파라미터 개수는 80개일 수 있다.According to an example, the number of parameters included in the second artificial neural network 220 may be equal to or less than the number of parameters included in the first artificial neural network 110 . For example, the number of parameters of the first artificial neural network 211 may be 100, and the number of parameters of the second artificial neural network 221 may also be 100. As another example, the number of parameters of the second artificial neural network 221 may be 80.

일 실시예에 따르면, 제 2 인공 신경망(221, 223, 225) 각각 다른 가중치 초기값으로 초기화될 수 있다. 이에 따라, 제 2 인공 신경망(221, 223, 225)은 동일한 입력 데이터에 대하여 각각 다른 결과를 출력할 수 있다.According to an embodiment, the second artificial neural networks 221 , 223 , and 225 may be initialized to different initial weight values. Accordingly, the second artificial neural networks 221 , 223 , and 225 may output different results with respect to the same input data.

일 실시예에 따르면, 학습부(230)는 특정 입력값에 대한 정답값(ground truth), M개의 제 1 인공 신경망 각각의 확률 점수(probability score) 및 N 개의 제 2 인공 신경망 각각의 확률 점수에 기초한 총 손실 함수(total loss function)를 계산하며, 총 손실 함수에 기초하여 N 개의 제 2 인공 신경망을 각각 학습시킬 수 있다.According to an embodiment, the learning unit 230 is a ground truth for a specific input value, a probability score of each of the M first artificial neural networks, and a probability score of each of the N second artificial neural networks. A total loss function is calculated based on the total loss function, and N second artificial neural networks may be trained based on the total loss function, respectively.

일 예에 따르면, 학습부(230)는 정답값과 제 2 인공 신경망(220)의 출력 결과에 대한 오차를 계산할 수 있다. 일 예로, 오차는 손실 함수일 수 있다.According to an example, the learning unit 230 may calculate an error between the correct answer value and the output result of the second artificial neural network 220 . As an example, the error may be a loss function.

일 예에 따르면, 학습부(230)는 제 2 인공신경망(221)의 마지막 레이어에서 출력된 점수(score)에 softmax 함수를 적용하여 생성된 확률 벡터인 확률 점수를 계산할 수 있다. 일 예로, 확률 점수는 도 1을 참조하여 설명한 수학식 1과 같이 정의될 수 있다.According to an example, the learning unit 230 may calculate a probability score that is a probability vector generated by applying a softmax function to a score output from the last layer of the second artificial neural network 221 . As an example, the probability score may be defined as in Equation 1 described with reference to FIG. 1 .

일 실시예에 따르면, 학습부(230)는 n(1≤n≤M)번째 제 2 인공 신경망에 대하여 정답값 및 n번째 제 2 인공 신경망의 확률 점수에 기초하여 제 1 손실 함수를 생성할 수 있다.According to an embodiment, the learning unit 230 may generate a first loss function based on a correct answer value for the n (1≤n≤M)-th second artificial neural network and a probability score of the n-th second artificial neural network. have.

일 예에 따르면, 제 1 손실 함수는 n번째 제 2 인공 신경망의 확률 점수와 정답값 간의 교차 엔트로피 손실(cross entropy loss)일 수 있다. 일 예로, 교차 엔트로피 손실은 수학식 3과 같이 정의될 수 있다.According to an example, the first loss function may be a cross entropy loss between the probability score of the n-th second artificial neural network and the correct value. As an example, the cross entropy loss may be defined as in Equation (3).

일 실시예에 따르면, 학습부(230)는 M개의 제 1 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 기초하여 생성된 확률 점수 및 n번째 제 2 인공 신경망의 확률 점수에 기초하여 제 2 손실 함수를 생성할 수 있다.According to an embodiment, the learning unit 230 is configured to generate a second probability score based on a score output from the last layer of the M first artificial neural networks and a probability score of the n-th second artificial neural network. You can create a loss function.

일 실시예에 따르면, M개의 제 1 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 기초하여 생성된 확률 점수는 M개의 제 1 인공 신경망 각각의 마지막 레이어에서 출력된 점수(score) 중 최대값을 가지는 점수에 기초하여 생성된 것일 수 있다. 일 예로, 아래의 수식과 같이 정의될 수 있다.According to an embodiment, the probability score generated based on the scores output from the last layer of the M first artificial neural networks is the maximum value among the scores output from the last layers of each of the M first artificial neural networks. It may be generated based on a score having . As an example, it may be defined as in the following equation.

[수학식 9][Equation 9]

일 실시예에 따르면, M개의 제 1 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 기초하여 생성된 확률 점수는 M개의 제 1 인공 신경망 각각의 확률 점수 중 최대값을 가지는 확률 점수에 기초하여 생성된 것일 수 있다. 일 예로, 아래의 수식과 같이 정의될 수 있다.According to an embodiment, the probability score generated based on the score output from the last layer of the M first artificial neural networks is based on the probability score having the maximum value among the probability scores of each of the M first artificial neural networks. may have been created. As an example, it may be defined as in the following equation.

[수학식 10][Equation 10]

일 실시예에 따르면, M개의 제 1 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 기초하여 생성된 확률 점수는 M개의 제 1 인공 신경망 각각의 확률 점수를 평균하여 생성된 것일 수 있다. 일 예로, 아래의 수식과 같이 정의될 수 있다.According to an embodiment, the probability score generated based on the scores output from the last layer of the M first artificial neural networks may be generated by averaging the probability scores of each of the M first artificial neural networks. As an example, it may be defined as in the following equation.

[수학식 11][Equation 11]

일 실시예에 따르면, 제 2 손실 함수는

,

및

중 적어도 하나와 n번째 제 2 인공 신경망의 확률 점수에 기초하여 생성될 수 있다.According to one embodiment, the second loss function is

,

and

It may be generated based on at least one of the following and the probability score of the n-th second artificial neural network.

일 실시예에 따르면, 학습부(230)는 n번째 제 2 인공 신경망의 확률 점수 및 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 확률 점수를 이용하여 생성한 확률 점수에 기초하여 제 3 손실 함수를 생성할 수 있다. According to an embodiment, the learning unit 230 is configured to perform the third based on the probability score generated by using the probability score of the n-th second artificial neural network and the probability score of all second artificial neural networks except for the n-th second artificial neural network. You can create a loss function.

일 예로, 제 3 손실 함수인 수집 손실은 수학식 5와 같이 정의될 수 있다.As an example, the collection loss, which is the third loss function, may be defined as in Equation (5).

일 실시예에 따르면, 제 3 손실 함수는 n번째 제 2 인공 신경망의 제 1 확률 점수와 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 마지막 레이어에서 출력된 점수(score)를 기초로 생성된 제 2 확률 점수의 교차 엔트로피(cross entropy)에서 n번째 제 2 인공 신경망의 제 1 확률 점수의 엔트로피를 뺀 결과에 기초하여 생성될 수 있다. 일 예로, 제 3 손실 함수는 아래 수식과 같이 정의될 수 있다.According to an embodiment, the third loss function is generated based on the first probability score of the n-th second artificial neural network and the scores output from the last layer of all second artificial neural networks except for the n-th second artificial neural network. It may be generated based on a result of subtracting the entropy of the first probability score of the n-th second artificial neural network from the cross entropy of the second probability score. As an example, the third loss function may be defined as follows.

[수학식 12][Equation 12]

여기서, H(p)는 확률 점수 p의 엔트로피이다. 위의 수학식에 따르면, p가 고정될 때 q로 L_CE를 최적화하는 것은 L_KLD를 최적화하는 것과 동일하다는 것을 알 수 있다. 이때, L_KLD는 비대칭이므로 L_KLD(p,q)와 L_KLD(q,p)는 다른 값을 갖는다. where H(p) is the entropy of the probability score p. According to the above equation, it can be seen that when p is fixed, optimizing _{L CE} _{with q is the same as optimizing L KLD .} At this time, since L _KLD is asymmetric, L _KLD (p,q) and L _KLD (q,p) have different values.

일 예로, 최적화 작업에서 loss L (p, q)가 주어지면 p는 고정되고 q는 최적화할 분포가 된다. q가 (0, 1, 0, ..., 0)과 같이 one-hot 분포를 따르게 되면 p=q가 아닌 한, L_KLD(p,q)가 무한대가 되기 때문에 q를 최적화하는 것이 통상적으로 더 합리적인 방법이라고 할 수 있다. As an example, when loss L (p, q) is given in the optimization task, p is fixed and q is a distribution to be optimized. If q follows a one-hot distribution such as (0, 1, 0, ..., 0), it is common to optimize q because _{L KLD (p, q) becomes infinite unless p = q.} This may be a more reasonable way.

일 실시예에 따르면, 학습부(230)는 n번째 제 2 인공 신경망의 제 1 확률 점수를 최적화하기 위하여 역방향 쿨백-라이블러 발산(kullback-leibler divergence)를 적용할 수 있다.According to an embodiment, the learner 230 may apply reverse kullback-leibler divergence to optimize the first probability score of the n-th second artificial neural network.

수학식 12에 따르면, 쿨백-라이블러 발산(kullback-leibler divergence)을 최소화하기 위해서 L_CE를 감소시키는 대신 엔트로피(H(p))를 증가시킬 수 있음을 알 수 있으며, 엔트로피(H(p))를 높이기 위하여 역방향 쿨백-라이블러 발산(kullback-leibler divergence)을 적용할 수 있다. 이때, p가 최적화하는 분포인 경우, L_KLD(q,p)를 순방향 쿨백-라이블러 발산(kullback-leibler divergence)라고 하고, L_KLD(p,q)를 역방향 쿨백-라이블러 발산(kullback-leibler divergence)라 할 수 있다.According to Equation 12, it can be seen that entropy (H(p)) can be increased instead of decreasing _{L CE} in order to minimize kullback-leibler divergence, and entropy (H(p) ), a reverse kullback-leibler divergence can be applied. In this case, if p is the optimized distribution, L _KLD (q,p) is called forward kullback-leibler divergence, and L _KLD (p,q) is called backward kullback-leibler divergence (kullback-leibler divergence). leibler divergence).

일 예에 따르면, L_KLD(p,q)에서 q를 업데이트하는 대신 수집 손실(collection loss)로 역방향 L_KLD를 적용할 수 있다. 일 예로, 엔트로피가 작은 경우는 p는 one-hot 분포와 유사해질 수 있으며, p는 하나의 클래스에 대한 정보를 가지게 된다. 반면, 엔트로피가 큰 경우, p는 하나 이상의 클래스에 대한 정보를 가질 수 있으며, 이를 통해 제 2 인공 신경망 간에 더 많은 정보를 이용하여 학습할 수 있다. According to an example, _{instead of updating q in L KLD} (p, q), the reverse L _KLD may be applied as a collection loss. For example, when entropy is small, p may be similar to a one-hot distribution, and p has information about one class. On the other hand, when entropy is large, p may have information about one or more classes, and through this, it is possible to learn using more information between the second artificial neural networks.

일 실시예에 따를 경우, 제 2 확률 점수는 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 마지막 레이어에서 출력된 점수(score) 중 크기가 가장 큰 점수를 가지는 제 2 인공 신경망의 점수에 기초하여 생성될 수 있다.According to an embodiment, the second probability score is the score of the second artificial neural network having the largest score among the scores output from the last layer of all second artificial neural networks except for the nth second artificial neural network. It can be created based on

일 실시예에 따르면, 학습부(230)는 N개의 제 2 인공 신경망 각각에 대하여 계산된 제 1 손실함수, 제 2 손실 함수 및 제 3 손실 함수를 합산하여 총 손실 함수를 계산할 수 있다. 일 예로, 총 손실 함수는 수학식 7과 같이 정의될 수 있다.According to an embodiment, the learning unit 230 may calculate the total loss function by summing the first loss function, the second loss function, and the third loss function calculated for each of the N second artificial neural networks. As an example, the total loss function may be defined as in Equation (7).

일 실시예에 따르면, 정답값은 원-핫 벡터(one-hot vector)이며, 확률 점수는 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 softmax 함수를 적용하여 생성된 확률 벡터일 수 있다.According to an embodiment, the correct answer value is a one-hot vector, and the probability score may be a probability vector generated by applying a softmax function to a score output from the last layer of the artificial neural network.

일 실시예에 따르면, N 개의 제 2 인공 신경망은 각각 다른 가중치 초기값으로 초기화될 수 있다.According to an embodiment, the N second artificial neural networks may be initialized with different initial weight values.

도 3은 일 실시예에 따른 딥러닝 모델 추출 방법을 설명하기 위한 흐름도이다.3 is a flowchart illustrating a method for extracting a deep learning model according to an embodiment.

도 3을 참조하면, 딥러닝 모델 추출 장치는 제 1 인공 신경망으로부터 특정 입력에 대한 확률 점수(probability score)를 수신할 수 있다(310). 일 예로, 제 1 인공 신경망은 특정 기능을 수행하도록 미리 학습된 인공 신경망일 수 있다.Referring to FIG. 3 , the apparatus for extracting a deep learning model may receive a probability score for a specific input from the first artificial neural network ( 310 ). As an example, the first artificial neural network may be an artificial neural network that has been previously trained to perform a specific function.

일 실시예에 따르면, 딥러닝 모델 추출 장치는 N (N≥2) 개의 제 2 인공 신경망으로부터 특정 입력에 대한 확률 점수(probability score)를 수신할 수 있다(320). 일 예로, 제 2 인공 신경망은 제 1 인공 신경망을 기초로 특정 기능을 수행하도록 학습하는 인공 신경망일 수 있다.According to an embodiment, the apparatus for extracting a deep learning model may receive a probability score for a specific input from N (N≥2) second artificial neural networks ( 320 ). As an example, the second artificial neural network may be an artificial neural network that learns to perform a specific function based on the first artificial neural network.

일 예에 따르면, 제 1 인공 신경망(110)은 큰 네트워크(또는, teacher model), 제 2 인공 신경망(120)은 작은 네트워크(또는, student model)로 나타낼 수 있다.According to an example, the first artificial neural network 110 may be represented by a large network (or a teacher model), and the second artificial neural network 120 may be represented by a small network (or a student model).

일 실시예에 따르면, 딥러닝 모델 추출 장치는 N 개의 제 2 인공 신경망을 학습시킬 수 있다(330).According to an embodiment, the apparatus for extracting a deep learning model may train N second artificial neural networks ( 330 ).

일 실시예에 따르면, 딥러닝 모델 추출 장치는 특정 입력값에 대한 정답값(ground truth), 제 1 인공 신경망의 확률 점수(probability score) 및 N 개의 제 2 인공 신경망 각각의 확률 점수에 기초한 총 손실 함수(total loss function)를 계산하며, 총 손실 함수에 기초하여 N 개의 제 2 인공 신경망을 각각 학습시킬 수 있다.According to an embodiment, the deep learning model extraction device is a total loss based on a ground truth for a specific input value, a probability score of the first artificial neural network, and a probability score of each of the N second artificial neural networks. A total loss function is calculated, and N second artificial neural networks may be trained on the basis of the total loss function, respectively.

일 예에 따르면, 딥러닝 모델 추출 장치는 정답값과 제 2 인공 신경망의 출력 결과에 대한 오차를 계산할 수 있다. 일 예로, 오차는 손실 함수일 수 있다.According to an example, the apparatus for extracting a deep learning model may calculate an error between a correct answer value and an output result of the second artificial neural network. As an example, the error may be a loss function.

일 예에 따르면, 딥러닝 모델 추출 장치는 제 2 인공신경망의 마지막 레이어에서 출력된 점수(score)에 softmax 함수를 적용하여 생성된 확률 벡터인 확률 점수를 계산할 수 있다. 일 예로, 확률 점수는 위에서 설명된 수학식 1과 같이 정의될 수 있다.According to an example, the apparatus for extracting a deep learning model may calculate a probability score that is a probability vector generated by applying a softmax function to a score output from the last layer of the second artificial neural network. As an example, the probability score may be defined as in Equation 1 described above.

일 실시예에 따르면, 딥러닝 모델 추출 장치는 어느 하나의 제 2 인공 신경망에 대하여 정답값 및 제 2 인공 신경망의 확률 점수에 기초하여 제 1 손실 함수를 생성할 수 있다.According to an embodiment, the apparatus for extracting a deep learning model may generate a first loss function based on a correct answer value for any one of the second artificial neural networks and a probability score of the second artificial neural network.

일 실시예에 따르면, 딥러닝 모델 추출 장치는 제 1 인공 신경망의 확률 점수 및 n번째 제 2 인공 신경망의 확률 점수에 기초하여 제 2 손실 함수를 생성할 수 있다.According to an embodiment, the apparatus for extracting a deep learning model may generate a second loss function based on the probability score of the first artificial neural network and the probability score of the n-th second artificial neural network.

일 예에 따르면, 제 2 손실 함수는 특정 입력에 대한 제 1 인공 신경망의 확률 점수와 n번째 제 2 인공 신경망의 확률 점수 간의 지식 증류 손실(knowledge distillation loss)일 수 있다. 일 예로, 지식 증류 손실은 수학식 4와 같이 정의될 수 있다. According to an example, the second loss function may be a knowledge distillation loss between the probability score of the first artificial neural network for a specific input and the probability score of the n-th second artificial neural network. As an example, the knowledge distillation loss may be defined as Equation (4).

일 실시예에 따르면, 딥러닝 모델 추출 장치는 n번째 제 2 인공 신경망의 확률 점수 및 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 확률 점수를 이용하여 생성한 확률 점수에 기초하여 제 3 손실 함수를 생성할 수 있다.According to one embodiment, the deep learning model extraction apparatus is based on a probability score generated using the probability score of the n-th second artificial neural network and the probability scores of all second artificial neural networks except for the n-th second artificial neural network. You can create a loss function.

일 예로, 제 3 손실 함수인 수집 손실은 수학식 5와 같이 정의될 수 있다. As an example, the collection loss, which is the third loss function, may be defined as in Equation (5).

일 실시예에 따르면, 딥러닝 모델 추출 장치는 N개의 제 2 인공 신경망 각각에 대하여 계산된 제 1 손실함수, 제 2 손실 함수 및 제 3 손실 함수를 합산하여 총 손실 함수를 계산할 수 있다. 일 예로, 총 손실 함수는 수학식 7과 같이 정의될 수 있다.According to an embodiment, the apparatus for extracting a deep learning model may calculate a total loss function by summing the first loss function, the second loss function, and the third loss function calculated for each of the N second artificial neural networks. As an example, the total loss function may be defined as in Equation (7).

도 4는 일 실시예에 따른 딥러닝 모델 추출 방법을 설명하기 위한 흐름도이다.4 is a flowchart illustrating a method for extracting a deep learning model according to an embodiment.

도 4를 참조하면, 딥러닝 모델 추출 장치는 M(M≥2)개의 제 1 인공 신경망으로부터 특정 입력에 대한 확률 점수(probability score)를 수신할 수 있다(410). 일 예로, 제 1 인공 신경망은 특정 기능을 수행하도록 미리 학습된 인공 신경망일 수 있다.Referring to FIG. 4 , the apparatus for extracting a deep learning model may receive a probability score for a specific input from M (M≥2) first artificial neural networks ( 410 ). As an example, the first artificial neural network may be an artificial neural network that has been previously trained to perform a specific function.

일 실시예에 따르면, 딥러닝 모델 추출 장치는 N (N≥2) 개의 제 2 인공 신경망으로부터 특정 입력에 대한 확률 점수(probability score)를 수신할 수 있다(420). 일 예로, 제 2 인공 신경망은 제 1 인공 신경망을 기초로 특정 기능을 수행하도록 학습하는 인공 신경망일 수 있다.According to an embodiment, the apparatus for extracting a deep learning model may receive a probability score for a specific input from N (N≥2) second artificial neural networks ( 420 ). As an example, the second artificial neural network may be an artificial neural network that learns to perform a specific function based on the first artificial neural network.

일 실시예에 따르면, N 개의 제 2 인공 신경망을 학습시킬 수 있다(430).According to an embodiment, N second artificial neural networks may be trained ( 430 ).

일 실시예에 따르면, 딥러닝 모델 추출 장치는 특정 입력값에 대한 정답값(ground truth), M개의 제 1 인공 신경망 각각의 확률 점수(probability score) 및 N 개의 제 2 인공 신경망 각각의 확률 점수에 기초한 총 손실 함수(total loss function)를 계산하며, 총 손실 함수에 기초하여 N 개의 제 2 인공 신경망을 각각 학습시킬 수 있다.According to an embodiment, the deep learning model extraction apparatus is a ground truth for a specific input value, a probability score of each of the M first artificial neural networks, and a probability score of each of the N second artificial neural networks. A total loss function is calculated based on the total loss function, and N second artificial neural networks may be trained based on the total loss function, respectively.

일 예에 따르면, 딥러닝 모델 추출 장치는 제 2 인공신경망의 마지막 레이어에서 출력된 점수(score)에 softmax 함수를 적용하여 생성된 확률 벡터인 확률 점수를 계산할 수 있다. 일 예로, 확률 점수는 도 1을 참조하여 설명한 수학식 1과 같이 정의될 수 있다.According to an example, the apparatus for extracting a deep learning model may calculate a probability score that is a probability vector generated by applying a softmax function to a score output from the last layer of the second artificial neural network. As an example, the probability score may be defined as in Equation 1 described with reference to FIG. 1 .

일 실시예에 따르면, 딥러닝 모델 추출 장치는 n(1≤n≤M)번째 제 2 인공 신경망에 대하여 정답값 및 n번째 제 2 인공 신경망의 확률 점수에 기초하여 제 1 손실 함수를 생성할 수 있다.According to an embodiment, the deep learning model extraction apparatus may generate a first loss function based on the correct answer value for the n (1≤n≤M)-th second artificial neural network and the probability score of the n-th second artificial neural network. have.

일 실시예에 따르면, 딥러닝 모델 추출 장치는 M개의 제 1 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 기초하여 생성된 확률 점수 및 n번째 제 2 인공 신경망의 확률 점수에 기초하여 제 2 손실 함수를 생성할 수 있다.According to an embodiment, the deep learning model extraction apparatus is a second artificial neural network based on a probability score generated based on a score output from the last layer of the M first artificial neural networks and a probability score of the n-th second artificial neural network. You can create a loss function.

일 실시예에 따르면, M개의 제 1 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 기초하여 생성된 확률 점수는 M개의 제 1 인공 신경망 각각의 마지막 레이어에서 출력된 점수(score) 중 최대값을 가지는 점수에 기초하여 생성된 것일 수 있다. 일 예로, 제 1 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 기초하여 생성된 확률 점수는 수학식 9와 같이 정의될 수 있다.According to an embodiment, the probability score generated based on the scores output from the last layer of the M first artificial neural networks is the maximum value among the scores output from the last layers of each of the M first artificial neural networks. It may be generated based on a score having . As an example, a probability score generated based on a score output from the last layer of the first artificial neural network may be defined as in Equation (9).

일 실시예에 따르면, M개의 제 1 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 기초하여 생성된 확률 점수는 M개의 제 1 인공 신경망 각각의 확률 점수 중 최대값을 가지는 확률 점수에 기초하여 생성된 것일 수 있다. 일 예로, M개의 제 1 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 기초하여 생성된 확률 점수는 수학식 10과 같이 정의될 수 있다.According to an embodiment, the probability score generated based on the score output from the last layer of the M first artificial neural networks is based on the probability score having the maximum value among the probability scores of each of the M first artificial neural networks. may have been created. As an example, a probability score generated based on a score output from the last layer of the M first artificial neural networks may be defined as in Equation 10.

일 실시예에 따르면, M개의 제 1 인공 신경망의 마지막 레이어에서 출력된 점수(score)에 기초하여 생성된 확률 점수는 M개의 제 1 인공 신경망 각각의 확률 점수를 평균하여 생성된 것일 수 있다. 일 예로, 수학식 11과 같이 정의될 수 있다.According to an embodiment, the probability score generated based on the scores output from the last layer of the M first artificial neural networks may be generated by averaging the probability scores of each of the M first artificial neural networks. As an example, it may be defined as in Equation 11.

일 실시예에 따르면, 딥러닝 모델 추출 장치는 n번째 제 2 인공 신경망의 확률 점수 및 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 확률 점수를 이용하여 생성한 확률 점수에 기초하여 제 3 손실 함수를 생성할 수 있다. According to one embodiment, the deep learning model extraction apparatus is based on a probability score generated using the probability score of the n-th second artificial neural network and the probability scores of all second artificial neural networks except for the n-th second artificial neural network. You can create a loss function.

일 실시예에 따르면, 제 3 손실 함수는 n번째 제 2 인공 신경망의 제 1 확률 점수와 n번째 제 2 인공 신경망을 제외한 모든 제 2 인공 신경망의 마지막 레이어에서 출력된 점수(score)를 기초로 생성된 제 2 확률 점수의 교차 엔트로피(cross entropy)에서 n번째 제 2 인공 신경망의 제 1 확률 점수의 엔트로피를 뺀 결과에 기초하여 생성될 수 있다. 일 예로, 제 3 손실 함수는 수학식 12와 같이 정의될 수 있다. According to an embodiment, the third loss function is generated based on the first probability score of the n-th second artificial neural network and the scores output from the last layer of all second artificial neural networks except for the n-th second artificial neural network. It may be generated based on a result of subtracting the entropy of the first probability score of the n-th second artificial neural network from the cross entropy of the second probability score. As an example, the third loss function may be defined as in Equation (12).

일 실시예에 따르면, 딥러닝 모델 추출 장치는 n번째 제 2 인공 신경망의 제 1 확률 점수를 최적화하기 위하여 역방향 쿨백-라이블러 발산(kullback-leibler divergence)를 적용할 수 있다.According to an embodiment, the apparatus for extracting a deep learning model may apply reverse kullback-leibler divergence to optimize the first probability score of the n-th second artificial neural network.

도 5는 일 실시예에 따른 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도이다.5 is a block diagram illustrating and explaining a computing environment including a computing device according to an embodiment.

도시된 실시예에서, 각 컴포넌트들은 이하에 기술된 것 이외에 상이한 기능 및 능력을 가질 수 있고, 이하에 기술된 것 이외에도 추가적인 컴포넌트를 포함할 수 있다.In the illustrated embodiment, each component may have different functions and capabilities other than those described below, and may include additional components in addition to those described below.

도시된 컴퓨팅 환경(10)은 컴퓨팅 장치(12)를 포함한다. 일 실시예에서, 컴퓨팅 장치(12)는 딥러닝 모델 추출 장치(120)에 포함되는 하나 이상의 컴포넌트일 수 있다. 컴퓨팅 장치(12)는 적어도 하나의 프로세서(14), 컴퓨터 판독 가능 저장 매체(16) 및 통신 버스(18)를 포함한다. 프로세서(14)는 컴퓨팅 장치(12)로 하여금 앞서 언급된 예시적인 실시예에 따라 동작하도록 할 수 있다. 예컨대, 프로세서(14)는 컴퓨터 판독 가능 저장 매체(16)에 저장된 하나 이상의 프로그램들을 실행할 수 있다. 상기 하나 이상의 프로그램들은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 상기 컴퓨터 실행 가능 명령어는 프로세서(14)에 의해 실행되는 경우 컴퓨팅 장치(12)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.The illustrated computing environment 10 includes a computing device 12 . In one embodiment, the computing device 12 may be one or more components included in the deep learning model extraction device 120 . Computing device 12 includes at least one processor 14 , computer readable storage medium 16 , and communication bus 18 . The processor 14 may cause the computing device 12 to operate in accordance with the exemplary embodiments discussed above. For example, the processor 14 may execute one or more programs stored in the computer-readable storage medium 16 . The one or more programs may include one or more computer-executable instructions that, when executed by the processor 14, configure the computing device 12 to perform operations in accordance with the exemplary embodiment. can be

컴퓨터 판독 가능 저장 매체(16)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장하도록 구성된다. 컴퓨터 판독 가능 저장 매체(16)에 저장된 프로그램(20)은 프로세서(14)에 의해 실행 가능한 명령어의 집합을 포함한다. 일 실시예에서, 컴퓨터 판독 가능 저장 매체(16)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 디바이스들, 광학 디스크 저장 디바이스들, 플래시 메모리 디바이스들, 그 밖에 컴퓨팅 장치(12)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장 매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 16 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. The program 20 stored in the computer readable storage medium 16 includes a set of instructions executable by the processor 14 . In one embodiment, computer-readable storage medium 16 includes memory (volatile memory, such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash It may be memory devices, other forms of storage medium accessed by computing device 12 and capable of storing desired information, or a suitable combination thereof.

통신 버스(18)는 프로세서(14), 컴퓨터 판독 가능 저장 매체(16)를 포함하여 컴퓨팅 장치(12)의 다른 다양한 컴포넌트들을 상호 연결한다.Communication bus 18 interconnects various other components of computing device 12 , including processor 14 and computer readable storage medium 16 .

컴퓨팅 장치(12)는 또한 하나 이상의 입출력 장치(24)를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(22) 및 하나 이상의 네트워크 통신 인터페이스(26)를 포함할 수 있다. 입출력 인터페이스(22) 및 네트워크 통신 인터페이스(26)는 통신 버스(18)에 연결된다. 입출력 장치(24)는 입출력 인터페이스(22)를 통해 컴퓨팅 장치(12)의 다른 컴포넌트들에 연결될 수 있다. 예시적인 입출력 장치(24)는 포인팅 장치(마우스 또는 트랙패드 등), 키보드, 터치 입력 장치(터치패드 또는 터치스크린 등), 음성 또는 소리 입력 장치, 다양한 종류의 센서 장치 및/또는 촬영 장치와 같은 입력 장치, 및/또는 디스플레이 장치, 프린터, 스피커 및/또는 네트워크 카드와 같은 출력 장치를 포함할 수 있다. 예시적인 입출력 장치(24)는 컴퓨팅 장치(12)를 구성하는 일 컴포넌트로서 컴퓨팅 장치(12)의 내부에 포함될 수도 있고, 컴퓨팅 장치(12)와는 구별되는 별개의 장치로 컴퓨팅 장치(12)와 연결될 수도 있다.Computing device 12 may also include one or more input/output interfaces 22 and one or more network communication interfaces 26 that provide interfaces for one or more input/output devices 24 . The input/output interface 22 and the network communication interface 26 are coupled to the communication bus 18 . Input/output device 24 may be coupled to other components of computing device 12 via input/output interface 22 . Exemplary input/output device 24 may include a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touchpad or touchscreen), a voice or sound input device, various types of sensor devices, and/or imaging devices. input devices, and/or output devices such as display devices, printers, speakers and/or network cards. The exemplary input/output device 24 may be included in the computing device 12 as a component constituting the computing device 12 , and may be connected to the computing device 12 as a separate device distinct from the computing device 12 . may be

이상에서 대표적인 실시예를 통하여 본 발명에 대하여 상세하게 설명하였으나, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 전술한 실시예에 대하여 본 발명의 범주에서 벗어나지 않는 한도 내에서 다양한 변형이 가능함을 이해할 것이다. 그러므로 본 발명의 권리범위는 설명된 실시예에 국한되어 정해져서는 안 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다.Although the present invention has been described in detail through representative embodiments above, those of ordinary skill in the art to which the present invention pertains can make various modifications to the above-described embodiments without departing from the scope of the present invention. will understand Therefore, the scope of the present invention should not be limited to the described embodiments, and should be defined by the claims described below as well as the claims and equivalents.

10: 컴퓨팅 환경
12: 컴퓨팅 장치
14: 프로세서
16: 컴퓨터 판독 가능 저장 매체
18: 통신 버스
20: 프로그램
22: 입출력 인터페이스
24: 입출력 장치
26: 네트워크 통신 인터페이스
100: 딥러닝 모델 추출 장치
110: 제 1 인공 신경망
120, 121, 123, 125: 제 2 인공 신경망
130: 학습부
200: 딥러닝 모델 추출 장치
210, 211, 213: 제 1 인공 신경망
220, 221, 223, 225: 제 2 인공 신경망
230: 학습부10: Computing Environment
12: computing device
14: Processor
16: computer readable storage medium
18: communication bus
20: Program
22: input/output interface
24: input/output device
26: network communication interface
100: deep learning model extraction device
110: first artificial neural network
120, 121, 123, 125: second artificial neural network
130: study unit
200: deep learning model extraction device
210, 211, 213: first artificial neural network
220, 221, 223, 225: second artificial neural network
230: study unit

Claims

one or more first artificial neural networks pre-trained to perform a specific function;
N (N≥2) second artificial neural networks that learn to perform the specific function based on the first artificial neural network; and
It includes a learning unit for learning the N second artificial neural networks,
the learning unit
calculating a total loss function based on a ground truth for a specific input value, a probability score of a first artificial neural network, and a probability score of each of the N second artificial neural networks, wherein A deep learning model extraction apparatus for training each of the N second artificial neural networks based on a total loss function.

The method according to claim 1,
The number of layers included in the second artificial neural network is equal to or less than the number of layers included in the first artificial neural network, or
The number of parameters included in the second artificial neural network is equal to or less than the number of parameters included in the first artificial neural network, a deep learning model extraction apparatus.

The method according to claim 1,
The learning unit,
For the n (1≤n≤N)-th second artificial neural network
A first loss function generated based on the correct value and the probability score of the n-th second artificial neural network;
a second loss function generated based on the probability score of the first artificial neural network and the probability score of the n-th second artificial neural network; and
Deep learning for calculating a third loss function generated based on the probability score generated using the probability score of the n-th second artificial neural network and the probability score of all second artificial neural networks except for the n-th second artificial neural network model extraction device.

4. The method according to claim 3,
The learning unit,
A deep learning model extraction apparatus for calculating a total loss function by summing the first loss function, the second loss function, and the third loss function calculated for each of the N second artificial neural networks.

4. The method according to claim 3,
The second loss function is
a probability score generated based on a score having a maximum value among scores output from the last layer of each of the one or more first artificial neural networks;
a probability score generated based on a probability score having a maximum value among the probability scores of each of the one or more first artificial neural networks; and
The apparatus for extracting a deep learning model, which is generated based on at least one of the probability scores generated by averaging the probability scores of each of the one or more first artificial neural networks and the probability scores of the n-th second artificial neural network.

4. The method according to claim 3,
The third loss function is
The first probability score generated based on the score of the second artificial neural network having the largest score among the scores output from the last layer of all second artificial neural networks except for the n-th second artificial neural network. A conditional probability of the second probability score of the n-th second artificial neural network, a deep learning model extraction apparatus.

4. The method according to claim 3,
The third loss function is
Cross entropy ( Cross entropy) is generated based on the result of subtracting the entropy of the first probability score of the n-th second artificial neural network, a deep learning model extraction apparatus.

8. The method of claim 7,
The second probability score is,
A deep learning model extraction apparatus generated based on the score of the second artificial neural network having the largest score among the scores output from the last layer of all second artificial neural networks except for the n-th second artificial neural network.

8. The method of claim 7,
The learning unit,
In order to optimize the first probability score of the n-th second artificial neural network, a reverse kullback-leibler divergence is applied, a deep learning model extraction apparatus.

The method according to claim 1,
The correct value is a one-hot vector, and the probability score is a probability vector generated by applying a softmax function to a score output from the last layer of an artificial neural network.

The method according to claim 1,
The N second artificial neural networks are initialized to different initial weight values, a deep learning model extraction apparatus.

Receiving a probability score for a specific input from one or more first artificial neural networks that have been previously trained to perform a specific function;
receiving a probability score for a specific input from N (N≥2) second artificial neural networks that learn to perform the specific function based on the first artificial neural network; and
Including the step of learning the N second artificial neural networks,
The learning step
calculating a total loss function based on a ground truth for a specific input value, a probability score of a first artificial neural network, and a probability score of each of the N second artificial neural networks, wherein A deep learning model extraction method for training each of the N second artificial neural networks based on a total loss function.

13. The method of claim 12,
The number of layers included in the second artificial neural network is equal to or less than the number of layers included in the first artificial neural network, or
The number of parameters included in the second artificial neural network is equal to or less than the number of parameters included in the first artificial neural network, a deep learning model extraction method.

13. The method of claim 12,
The learning step is
For the n (1≤n≤N)-th second artificial neural network
A first loss function generated based on the correct value and the probability score of the n-th second artificial neural network;
a second loss function generated based on the probability score of the first artificial neural network and the probability score of the n-th second artificial neural network; and
Deep learning for calculating a third loss function generated based on the probability score generated using the probability score of the n-th second artificial neural network and the probability score of all second artificial neural networks except for the n-th second artificial neural network How to extract the model.

15. The method of claim 14,
The learning step is
A deep learning model extraction method for calculating a total loss function by summing the first loss function, the second loss function, and the third loss function calculated for each of the N second artificial neural networks.

15. The method of claim 14,
The second loss function is
a probability score generated based on a score having a maximum value among scores output from the last layer of each of the one or more first artificial neural networks;
a probability score generated based on a probability score having a maximum value among the probability scores of each of the one or more first artificial neural networks; and
The method for extracting a deep learning model, which is generated based on at least one of the probability scores generated by averaging the probability scores of each of the one or more first artificial neural networks and the probability scores of the n-th second artificial neural network.

15. The method of claim 14,
The third loss function is
The first probability score generated based on the score of the second artificial neural network having the largest score among the scores output from the last layer of all second artificial neural networks except for the n-th second artificial neural network. A method for extracting a deep learning model, which is the conditional probability of the second probability score of the nth second artificial neural network.

15. The method of claim 14,
The third loss function is
Cross entropy ( Cross entropy) is generated based on the result of subtracting the entropy of the first probability score of the n-th second artificial neural network, a deep learning model extraction method.

19. The method of claim 18,
The second probability score is,
A method for extracting a deep learning model, generated based on the score of the second artificial neural network having the largest score among the scores output from the last layer of all second artificial neural networks except for the n-th second artificial neural network.

19. The method of claim 18,
The learning step is
In order to optimize the first probability score of the n-th second artificial neural network, reverse kullback-leibler divergence is applied, a deep learning model extraction method.

15. The method of claim 14,
The correct value is a one-hot vector, and the probability score is a probability vector generated by applying a softmax function to a score output from the last layer of an artificial neural network.

15. The method of claim 14,
The N second artificial neural networks are each initialized to different initial weight values, a deep learning model extraction method.