KR20220069225A

KR20220069225A - Knowledge distillation method for compressing transformer neural network and apparatus thereof

Info

Publication number: KR20220069225A
Application number: KR1020200156143A
Authority: KR
Inventors: 강유; 조익현
Original assignee: 서울대학교산학협력단
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2022-05-27
Also published as: KR102631406B1

Abstract

A method for training a student network including at least one transformer neural network by using knowledge distillation in a teacher network including the at least one transformer neural network includes; pre-training the student network in a pretraining with teacher's predictions scheme of initializing the student network based on a result value of the teacher network previously trained with learning data; and training the student network by performing the knowledge distillation in the teacher network.

Description

KNOWLEDGE DISTILLATION METHOD FOR COMPRESSING TRANSFORMER NEURAL NETWORK AND APPARATUS THEREOF

본 명세서에서 개시되는 실시예들은 트랜스포머 기반의 네트워크를 경량화하기 위해 지식증류 방법 및 장치에 관한 것으로, 보다 상세하게는 파라미터 효율성과 정확도를 높인 지식증류 방법 및 장치를 제공하는 방법 및 장치에 관한 것이다.Embodiments disclosed herein relate to a knowledge distillation method and apparatus for lightening a transformer-based network, and more particularly, to a method and apparatus for providing a knowledge distillation method and apparatus with increased parameter efficiency and accuracy.

인공지능 분야에 버트(Bidirectional Encoder Representations from Transformers, BERT)가 등장하면서, 자연어 처리분야에서 거대한 모델들이 등장하기 시작하였다. 버트(BERT)는 트랜스포머(Transformer) 기반의 모델로, 자연어 처리에서도 컴퓨터 비전과 마찬가지로 거대한 모델의 사전학습(pre-training) 및 재학습(fine-tuning)이 가능해졌고, 다양한 문제들에서 뛰어난 성능을 보여주었다.With the advent of Bidirectional Encoder Representations from Transformers (BERT) in the field of artificial intelligence, huge models began to appear in the field of natural language processing. BERT is a Transformer-based model that enables pre-training and fine-tuning of huge models in natural language processing just like computer vision, and has excellent performance in various problems. showed

그러나 버트(BERT)의 단점은 모델이 매우 크고, BERT-base는 약 1.1억개의 파라미터를 사용하기 때문에 버트(BERT)를 사용하기 위해서는 매우 많은 양의 메모리가 필요한 것이였다.However, the disadvantage of BERT is that the model is very large, and BERT-base uses about 110 million parameters, so a very large amount of memory is required to use BERT.

따라서, 버트(BERT)의 모델 경량화(Model Compression)가 요구되었고, 본 발명은 자연어처리 분야에서 많이 쓰이는 트랜스포머(Transformer) 기반 모델, 특히 버트(BERT)에 대한 모델 경량화(Model Compression)에 대한 것이라고 할 수 있다. Therefore, model compression of BERT was required, and the present invention relates to a Transformer-based model widely used in the field of natural language processing, in particular, Model Compression for BERT. can

버트(BERT)의 성능을 유지하면서, 모델 경량화를 하는 기술의 필요성이 대두되었다.While maintaining the performance of the BERT, the need for technology to reduce the weight of the model emerged.

그 중 하나인, 지식증류(Knowledge Distillation, KD) 방법은 더 큰 교사 네트워크(teacher network)의 일반화 능력(generalization ability)을 일반적으로 훨씬 작은 학생 네트워크(student network)로 전송하여 학생 네트워크를 학습시킨다.One of them, the Knowledge Distillation (KD) method, trains the student network by transferring the generalization ability of a larger teacher network to a generally much smaller student network.

그러나, 버트(BERT)의 모델 경량화에 사용된 기존의 지식증류는 2가지 문제점이 존재하였다.However, the existing knowledge distillation used to lighten the model of BERT had two problems.

첫 번째, 학생 네트워크의 크기가 절대적으로 작아 모델 복잡성(model complexity)가 부족하였다.First, the size of the student network was absolutely small and the model complexity was insufficient.

두 번째, 학생 네트워크에서 초기 가이드(initial guide)의 부재로 인해, 학생 네트워크가 교사 네트워크의 성능을 충실히 흉내(imitate)하기엔 어려웠다.Second, due to the absence of an initial guide in the student network, it was difficult for the student network to faithfully imitate the performance of the teacher network.

따라서 상술된 문제점을 해결하기 위한 기술이 필요하게 되었다.Therefore, there is a need for a technique for solving the above-mentioned problems.

한편, 전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다.On the other hand, the above-mentioned background art is technical information that the inventor possessed for the purpose of derivation of the present invention or acquired during the derivation process of the present invention, and it cannot be said that it is necessarily known technology disclosed to the general public before the filing of the present invention. .

본 명세서에서 개시되는 실시예들은, 트랜스포머 기반의 네트워크를 경량화하기 위해 지식증류 방법 및 장치를 제시하는데 목적이 있다.Embodiments disclosed herein have an object to present a knowledge distillation method and apparatus in order to lighten a transformer-based network.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서 일 실시예에 따르면, 트랜스포머 뉴럴 네트워크(transformer neural network)를 적어도 하나를 포함하는 교사(teacher) 네트워크에서 지식증류(knowledge distillation)를 이용하여 트랜스포머 뉴럴 네트워크를 적어도 하나를 포함하는 학생(student) 네트워크를 학습시키는 방법은, 학습데이터에 의해 미리 학습이 된 상기 교사 네트워크의 결과값에 기초하여 상기 학생 네트워크를 초기화하는 PTP(Pretraining with Teacher's Predictions)방식으로 상기 학생 네트워크를 사전학습(pre-training)시키는 단계; 및 상기 교사 네트워크에 지식증류 프로세스를 진행하여 상기 학생 네트워크를 학습시키는 단계를 포함할 수 있다.According to an embodiment, as a technical means for achieving the above-described technical task, a transformer neural network is formed using knowledge distillation in a teacher network including at least one transformer neural network. A method for learning a student network including at least one is a PTP (Pretraining with Teacher's Predictions) method of initializing the student network based on a result value of the teacher network that has been previously learned by learning data. pre-training the network; and performing a knowledge distillation process on the teacher network to learn the student network.

다른 실시예에 따르면, 트랜스포머 뉴럴 네트워크(transformer neural network)를 적어도 하나를 포함하는 교사(teacher) 네트워크에서 지식증류(knowledge distillation)를 이용하여 트랜스포머 뉴럴 네트워크를 적어도 하나를 포함하는 학생(student) 네트워크를 학습시키는 지식증류장치는 지식증류를 수행하는 프로그램이 저장되는 저장부; 및 적어도 하나의 프로세서를 포함하는 제어부를 포함하며, 상기 제어부는, 학습데이터에 의해 미리 학습이 된 상기 교사 네트워크의 결과값에 기초하여 상기 학생 네트워크를 초기화하는 PTP(Pretraining with Teacher's Predictions)방식으로 상기 학생 네트워크를 사전학습(pre-training)시키며, 상기 교사 네트워크에 지식증류 프로세스를 진행하여 상기 학생 네트워크를 학습시킬 수 있다.According to another embodiment, a student network including at least one transformer neural network using knowledge distillation in a teacher network including at least one transformer neural network The knowledge distillation apparatus for learning includes: a storage unit in which a program for performing knowledge distillation is stored; and a control unit including at least one processor, wherein the control unit initializes the student network based on a result value of the teacher network that has been previously learned by learning data in a PTP (Pretraining with Teacher's Predictions) method. By pre-training the student network and performing a knowledge distillation process on the teacher network, the student network may be trained.

다른 실시예에 따르면, 컴퓨터에 지식증류 방법을 실행시키기 위한 프로그램을 기록한 컴퓨터 판독 가능한 기록 매체로서, 지식증류장치에서 수행되는 지식증류 방법은, 트랜스포머 뉴럴 네트워크(transformer neural network)를 적어도 하나를 포함하는 교사(teacher) 네트워크에서 지식증류(knowledge distillation)를 이용하여 트랜스포머 뉴럴 네트워크를 적어도 하나를 포함하는 학생(student) 네트워크를 학습시키는 방법에 있어서, 학습데이터에 의해 미리 학습이 된 상기 교사 네트워크의 결과값에 기초하여 상기 학생 네트워크를 초기화하는 PTP(Pretraining with Teacher's Predictions)방식으로 상기 학생 네트워크를 사전학습(pre-training)시키는 단계; 및 상기 교사 네트워크에 지식증류 프로세스를 진행하여 상기 학생 네트워크를 학습시키는 단계를 포함할 수 있다.According to another embodiment, as a computer-readable recording medium recording a program for executing a knowledge distillation method in a computer, the knowledge distillation method performed in the knowledge distillation apparatus includes at least one transformer neural network In the method for learning a student network including at least one transformer neural network using knowledge distillation in a teacher network, the result value of the teacher network previously learned by learning data pre-training the student network in a PTP (Pretraining with Teacher's Predictions) method of initializing the student network based on and performing a knowledge distillation process on the teacher network to learn the student network.

다른 실시에에 다르면, 지식증류장치에 의해 수행되며, 지식증류 방법을 수행하기 위해 기록매체에 저장된 컴퓨터프로그램으로서, 지식증류장치에서 수행되는 지식증류 방법은, 트랜스포머 뉴럴 네트워크(transformer neural network)를 적어도 하나를 포함하는 교사(teacher) 네트워크에서 지식증류(knowledge distillation)를 이용하여 트랜스포머 뉴럴 네트워크를 적어도 하나를 포함하는 학생(student) 네트워크를 학습시키는 방법에 있어서, 학습데이터에 의해 미리 학습이 된 상기 교사 네트워크의 결과값에 기초하여 상기 학생 네트워크를 초기화하는 PTP(Pretraining with Teacher's Predictions)방식으로 상기 학생 네트워크를 사전학습(pre-training)시키는 단계; 및 상기 교사 네트워크에 지식증류 프로세스를 진행하여 상기 학생 네트워크를 학습시키는 단계를 포함할 수 있다.According to another embodiment, as a computer program stored in a recording medium to perform the knowledge distillation method and performed by the knowledge distillation apparatus, the knowledge distillation method performed in the knowledge distillation apparatus comprises at least a transformer neural network. A method for learning a student network including at least one transformer neural network using knowledge distillation in a teacher network including one, wherein the teacher has been previously learned by learning data pre-training the student network in a PTP (Pretraining with Teacher's Predictions) method for initializing the student network based on a result value of the network; and performing a knowledge distillation process on the teacher network to learn the student network.

전술한 과제 해결 수단 중 어느 하나에 의하면, 학생 네트워크의 복잡도가 올라감으로써 학생 네트워크의 학습 성능이 향상될 수 있다. According to any one of the above-described task solving means, the learning performance of the student network may be improved by increasing the complexity of the student network.

또한, 기존에 존재하지 않았던 학생 네트워크의 초기화 작업을 제시하여, 학생 네트워크의 학습 성능이 향상될 수 있다. 따라서 네트워크의 경량화에도 불구하고, 교사 네트워크의 학습 성능을 유지할 수 있다.In addition, by presenting an initialization task of the student network that did not exist before, the learning performance of the student network may be improved. Therefore, despite the weight reduction of the network, it is possible to maintain the learning performance of the teacher network.

개시되는 실시예들에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 개시되는 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.Effects obtainable in the disclosed embodiments are not limited to the above-mentioned effects, and other effects not mentioned are clear to those of ordinary skill in the art to which the embodiments disclosed from the description below belong. can be understood clearly.

도 1은 지식증류를 이용하여 DNN을 학습시키는 과정을 설명하기 위한 일 예시도를 나타낸 것이다.
도 2 는 일 실시예에 따른 지식증류장치의 구성을 도시한 블록도이다.
도3은 일 실시예에 따른 학생 네트워크를 학습시키기 위한 지식증류를 수행하는 방법을 설명하기 위한 순서도이다.
도 4는 일 실시예에 따른 PTP 방식을 학생 네트워크에 적용하는 예시도이다.
도 5는 일 실시예에 따른 PTP 라벨 부여 방식의 예시도이다.
도6은 일 실시예에 따른 학생 네트워크를 학습시키기 위한 지식증류를 수행하는 방법을 설명하기 위한 순서도이다.
도 7은 일 실시예에 따른 SPS 방식을 학생 네트워크에 적용하는 예시도이다.1 shows an exemplary diagram for explaining a process of learning a DNN using knowledge distillation.
2 is a block diagram illustrating the configuration of a knowledge distillation apparatus according to an embodiment.
3 is a flowchart illustrating a method of performing knowledge distillation for learning a student network according to an embodiment.
4 is an exemplary diagram of applying the PTP method to a student network according to an embodiment.
5 is an exemplary diagram of a PTP label assignment method according to an embodiment.
6 is a flowchart illustrating a method of performing knowledge distillation for learning a student network according to an embodiment.
7 is an exemplary diagram of applying the SPS scheme to a student network according to an embodiment.

아래에서는 첨부한 도면을 참조하여 다양한 실시예들을 상세히 설명한다. 아래에서 설명되는 실시예들은 여러 가지 상이한 형태로 변형되어 실시될 수도 있다. 실시예들의 특징을 보다 명확히 설명하기 위하여, 이하의 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에게 널리 알려져 있는 사항들에 관해서 자세한 설명은 생략하였다. 그리고, 도면에서 실시예들의 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, various embodiments will be described in detail with reference to the accompanying drawings. The embodiments described below may be modified and implemented in various different forms. In order to more clearly describe the characteristics of the embodiments, detailed descriptions of matters widely known to those of ordinary skill in the art to which the following embodiments belong are omitted. In addition, in the drawings, parts irrelevant to the description of the embodiments are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 구성이 다른 구성과 "연결"되어 있다고 할 때, 이는 '직접적으로 연결'되어 있는 경우뿐 아니라, '그 중간에 다른 구성을 사이에 두고 연결'되어 있는 경우도 포함한다. 또한, 어떤 구성이 어떤 구성을 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한, 그 외 다른 구성을 제외하는 것이 아니라 다른 구성들을 더 포함할 수도 있음을 의미한다.Throughout the specification, when a component is said to be “connected” with another component, it includes not only a case of 'directly connected' but also a case of 'connected with another component interposed therebetween'. In addition, when a component "includes" a component, it means that other components may be further included, rather than excluding other components, unless otherwise stated.

본 발명의 실시예에 따르면, DNN(Deep Neural network)를 학습시키기 위해서 지식증류(Knowledge Distillation, KD)를 이용한다. 지식증류는 더 큰 교사 네트워크(teacher network)의 일반화 능력(generalization ability)을 일반적으로 훨씬 작은 학생 네트워크(student network)로 전송하여 학생 네트워크를 학습시키는 방식을 말한다. 교사 네트워크는 하드 타겟(hard-target) 정보와 소프트 타겟(soft-target) 정보를 학생 네트워크(30)에 제공하여, 학생 네트워크가 유사하게 일반화하는 것을 학습할 수 있도록 한다. According to an embodiment of the present invention, knowledge distillation (KD) is used to learn a deep neural network (DNN). Knowledge distillation refers to a method of training a student network by transferring the generalization ability of a larger teacher network to a generally smaller student network. The teacher network provides hard-target information and soft-target information to the student network 30 so that the student network can learn to generalize similarly.

본 발명에서, 지식증류의 방식을 사용하여 학습시키는 뉴럴 네트워크는 트랜스포머(transformer) 기반의 뉴럴 네트워크이며, 일실시예에 따르면, 버트(Bidirectional Encoder Representations from Transformers, BERT)일 수 있다. 이때, 트랜스포머는 seq2seq의 구조인 인코더-디코더를 따르며, RNN을 사용하지 않고 어텐션(Attention)만으로 구현한 모델이다.In the present invention, the neural network trained using the knowledge distillation method is a transformer-based neural network, and according to an embodiment, may be Bidirectional Encoder Representations from Transformers (BERT). At this time, the transformer follows the encoder-decoder structure of seq2seq, and is a model implemented only with attention without using RNN.

이하 첨부된 도면을 참고하여 실시예들을 상세히 설명하기로 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.

도 1은 지식증류를 이용하여 DNN을 학습하는 과정을 설명하기 위한 일 예시도를 나타낸 것이다.1 shows an exemplary diagram for explaining a process of learning a DNN using knowledge distillation.

도 1을 참조하면, 학생 모델을 SPS(Shuffled Parameter Sharing)방식을 적용하여 SPS 학생 모델로 변환하고, SPS 학생 모델을 PTP(Pretraining with Teacher's Predicitons)방식을 통하여 사전학습(pre-training)하고, PTP방식으로 사전학습된 SPS 학생 네트워크(30)를 지식증류를 이용하여 선생 네트워크를 모방하도록 학습시킨다. 학생 네트워크(30)에 적용된 SPS방식과 PTP방식의 구체적인 방법에 대해서는 아래에서 다른 도면들을 참조하여 자세하게 설명한다.Referring to FIG. 1, the student model is converted into an SPS student model by applying the Shuffled Parameter Sharing (SPS) method, and the SPS student model is pre-trained through the PTP (Pretraining with Teacher's Predicitons) method, and PTP In this way, the pre-trained SPS student network 30 is trained to imitate the teacher network using knowledge distillation. Specific methods of the SPS method and the PTP method applied to the student network 30 will be described in detail with reference to other drawings below.

도 2는 일 실시예에 따른 지식증류장치(200)의 구성을 도시한 블록도이다.2 is a block diagram illustrating the configuration of the knowledge distillation apparatus 200 according to an embodiment.

도 2를 참조하면, 일 실시예에 따른 지식증류장치(200)는 입출력부(210), 저장부(220) 및 제어부(230)를 포함할 수 있다.Referring to FIG. 2 , the knowledge distillation apparatus 200 according to an embodiment may include an input/output unit 210 , a storage unit 220 , and a control unit 230 .

일 실시예에 따른 입출력부(210)는 사용자로부터 입력을 수신하기 위한 입력장치와, 작업의 수행 결과 또는 지식증류장치(200)의 상태 등의 정보를 표시하기 위한 출력장치를 포함할 수 있다. 예를 들어, 입출력부(210)는 데이터 처리의 명령을 수신하기 위한 입력부와 수신한 명령에 따라 처리된 결과를 출력하는 출력부를 포함할 수 있다. 일 실시예에 따르면 입출력부(210)는 키보드나 마우스, 터치패널 등의 사용자 입력수단과, 모니터나 스피커 등의 출력수단을 포함할 수 있다.The input/output unit 210 according to an embodiment may include an input device for receiving an input from a user, and an output device for displaying information such as a result of a job or a state of the knowledge distillation device 200 . For example, the input/output unit 210 may include an input unit for receiving a data processing command and an output unit for outputting a result processed according to the received command. According to an embodiment, the input/output unit 210 may include a user input unit such as a keyboard, mouse, or touch panel, and an output unit such as a monitor or speaker.

한편, 저장부(220)는 지식증류를 수행하여 학생 네트워크(30)를 학습시키기 위한 데이터를 저장할 수 있다. 가령, 뉴럴 네트워크(교사 네트워크(20), 학생 네트워크(30))를 학습시키기 위한 학습 데이터를 저장할 수 있고, 학생 네트워크(30)를 학습시키기 위한 지식증류에 필요한 교사 네트워크(20)의 출력 데이터를 저장할 수 있다. 또한, 학생 네트워크(30)를 학습시키기 위한 지식증류에 필요한 각종 데이터나 프로그램들을 저장할 수 있다.Meanwhile, the storage 220 may store data for learning the student network 30 by performing knowledge distillation. For example, it is possible to store learning data for learning a neural network (teacher network 20, student network 30), and output data of the teacher network 20 required for knowledge distillation for learning the student network 30 can be saved In addition, various data or programs necessary for knowledge distillation for learning the student network 30 may be stored.

그리고 제어부(230)는 뉴럴 네트워크 시스템(200)의 전체적인 동작을 제어하며, CPU 등과 같은 프로세서를 포함할 수 있다. 특히, 제어부(230)는 저장부(220)에 저장된 프로그램을 실행하거나 데이터를 읽어 학생 네트워크(30)를 학습시키기 위한 지식증류를 수행할 수 있다. 제어부(230)가 학생 네트워크(30)를 학습시키기 위한 지식증류를 수행하는 구체적인 방법에 대해서는 아래에서 다른 도면들을 참조하여 자세하게 설명한다.In addition, the controller 230 controls the overall operation of the neural network system 200 and may include a processor such as a CPU. In particular, the controller 230 may execute a program stored in the storage 220 or read data to perform knowledge distillation for learning the student network 30 . A specific method for the controller 230 to perform knowledge distillation for learning the student network 30 will be described in detail below with reference to other drawings.

이하에서는 학생 네트워크(30)를 학습시키기 위한 지식증류를 수행하는 구체적인 방법에 대해서 설명한다.Hereinafter, a specific method of performing knowledge distillation for learning the student network 30 will be described.

도3은 일 실시예에 따른 학생 네트워크(30)를 학습시키기 위한 지식증류를 수행하는 방법을 설명하기 위한 순서도이다.3 is a flowchart illustrating a method of performing knowledge distillation for learning the student network 30 according to an embodiment.

도 3을 참조하면, 제어부(230)는 PTP방식으로 학생 네트워크(30)를 사전학습(pre-training)시킨다(S310). S310단계 이후에, 제어부(230)는 교사 네트워크(20)에 지식증류 프로세스를 진행하여 학생 네트워크(30)를 학습시킨다(S320).Referring to FIG. 3 , the controller 230 pre-trains the student network 30 in the PTP method ( S310 ). After step S310 , the controller 230 performs a knowledge distillation process on the teacher network 20 to learn the student network 30 ( S320 ).

PTP(Pretraining with Teacher's Predictions) 방식Pretraining with Teacher's Predictions (PTP) method

여기서 PTP(Pretraining with Teacher's Predicitons)방식이란, 지식증류를 하기 전 학생 모델을 초기화시키는 방법이다. 일종의 지식증류에 '특화된' 초기화 방법이라고 볼 수 있다. PTP방식에 의해 학생 네트워크(30)는 다운스트림 테스크(downstream task)뿐만 아니라 교사 네트워크(20)에 대한 추가적인 지식을 습득할 수 있다.Here, the PTP (Pretraining with Teacher's Predicitons) method is a method of initializing the student model before knowledge distillation. It can be seen as a 'specialized' initialization method for a kind of knowledge distillation. By the PTP method, the student network 30 may acquire additional knowledge about the teacher network 20 as well as a downstream task.

도 4는 일 실시예에 따른 PTP 방식을 학생 네트워크(30)에 적용하는 예시도이다.4 is an exemplary diagram of applying the PTP method according to an embodiment to the student network 30 .

도 4를 참조하면, PTP방식은 다음과 같이 이루어진다. 제어부(230)는 학습데이터(10)를 교사 네트워크(20)에 입력하고, 교사 네트워크(20)의 결과값을 모은다. 이후, 제어부(230)는 교사 네트워크(20)의 결과값에 기초하여 정의된 컨피던스(confidence)에 기초하여 학습데이터(10)에 PTP라벨을 부여한다. 제어부(230)는 학습데이터(10)를 학생 네트워크(30)에 입력하여, 입력에 대응하는 PTP 라벨을 예측하도록 학생 네트워크(30)를 학습시킨다. 제어부(230)는 PTP 라벨을 예측하도록 학습된 학생 네트워크(30)를 지식증류 프로세스의 초기 상태로 사용한다.Referring to FIG. 4, the PTP method is performed as follows. The controller 230 inputs the learning data 10 to the teacher network 20 and collects the result values of the teacher network 20 . Thereafter, the controller 230 assigns a PTP label to the learning data 10 based on a confidence defined based on the result value of the teacher network 20 . The controller 230 inputs the learning data 10 to the student network 30 and trains the student network 30 to predict the PTP label corresponding to the input. The controller 230 uses the learned student network 30 to predict the PTP label as an initial state of the knowledge distillation process.

컨피던스(confidence)는 다음과 같이 정의된다. 교사 네트워크(20)의 결과값 중 가장 큰 값을 컨피던스(confidence)로 한다.Confidence is defined as follows. The largest value among the result values of the teacher network 20 is assumed to be confidence.

도 5는 일 실시예에 따른 PTP 라벨 부여 방식의 예시도이다.5 is an exemplary diagram of a PTP label assignment method according to an embodiment.

도 5에 도시된 PTP라벨 부여 방식을 구체적으로 살펴보면, 제어부는 교사 네트워크(20)가 입력데이터에 대해서 정답을 맞추는지 못 맞추는지로 먼저 나누고, 제어부는 각각의 경우 confidence의 크기가 t(하이퍼파라미터) 이상인지 아닌지에 따라 나누어 총 4종류의 라벨을 입력데이터에 부여하게 된다. 이때 하이퍼파라미터 t는 0.5와 1 사이의 값을 갖는 것이 바람직하다. 예를 들어, t가 0.6이고, 입력 x에 대한 교사 네트워크(20)의 결과값의 최대값이 0.7이고, 입력 x에 대하여 교사 네트워크(20)가 정답을 맞추었다고 가정하면, 제어부(230)는 입력 x에 대한 교사 네트워크(20)의 결과값의 최대값이 t보다 크므로 입력 x에 대하여 confidently correct 라벨을 부여할 수 있다.Specifically, looking at the PTP label assignment method shown in FIG. 5 , the control unit first divides the teacher network 20 into whether the teacher network 20 corrects the correct answer for the input data, and the control unit sets the confidence level to t (hyperparameter) in each case. A total of 4 types of labels are given to the input data by dividing them according to whether they are abnormal or not. In this case, the hyperparameter t preferably has a value between 0.5 and 1. For example, assuming that t is 0.6, the maximum value of the result value of the teacher network 20 for the input x is 0.7, and the teacher network 20 gives the correct answer to the input x, the controller 230 Since the maximum value of the result of the teacher network 20 for the input x is greater than t, it is possible to confidently assign a correct label to the input x.

PTP방식을 통해서 학생 네트워크(30)는 교사 네트워크(20)에 대한 추가적인 정보를 미리 얻을 수 있고, 지식 증류 프로세스가 교사 네트워크(20)의 지식을 학생 네트워크(30)에 주는 것이므로 교사 네트워크(20)에 대한 정보를 사전학습(pre-training)한 학생 네트워크(30)는 교사 네트워크(20)의 지식을 보다 더 효율적으로 받을 수 있다.Through the PTP method, the student network 30 can obtain additional information about the teacher network 20 in advance, and since the knowledge distillation process gives the knowledge of the teacher network 20 to the student network 30 , the teacher network 20 . The student network 30 , which has pre-trained the information on , can more efficiently receive the knowledge of the teacher network 20 .

제어부(230)가 PTP 라벨을 이용하여 학생 네트워크(30)를 사전학습(pretraining)시키는 수식은 <수학식 1>과 같다.A formula for the control unit 230 to pretrain the student network 30 using the PTP label is as shown in Equation 1 above.

[수학식 1][Equation 1]

여기서,

는 학생 네트워크의 파라미터를 의미하고,

는 학습 데이터를 의미하고,

는 학생 네트워크의 출력값,

는 소프트맥스(softmax) 함수를 의미하고,

는 PTP 라벨을 의미하고, 첨자 s는 학생 네트워크를 의미하고,

는 크로스엔트로피 손실(cross-entropy loss)을 의미할 수 있다. here,

is the parameter of the student network,

is the training data,

is the output of the student network,

is a softmax function,

means PTP label, subscript s means student network,

may mean cross-entropy loss.

PTP방식으로 학생 네트워크(30)의 사전학습이 완료되면, 제어부는

를 지식증류 시 학생 네트워크(30)의 초기 상태에 사용할 수 있다. 즉, 제어부는 학생 네트워크(30)의 파라미터가

인 상태로 지식증류 프로세스를 시작한다.When the pre-learning of the student network 30 in the PTP method is completed, the control unit

can be used for the initial state of the student network 30 during knowledge distillation. That is, the control unit determines that the parameters of the student network 30 are

Start the knowledge distillation process in the phosphorus state.

도 6을 참조하면, S310단계에서 PTP방식으로 학생 네트워크(30)를 사전학습(pre-training)시키는 과정은 다시 S601단계 내지 S602단계로 구분될 수 있다.Referring to FIG. 6 , the process of pre-training the student network 30 in the PTP method in step S310 may be divided into steps S601 to S602 again.

제어부(230)는 학생 네트워크(30)의 레이어에 SPS방식을 적용할 수 있다(S601). The controller 230 may apply the SPS method to the layer of the student network 30 (S601).

SPS(Shuffled Parameter Sharing) 방식SPS (Shuffled Parameter Sharing) method

여기서 SPS 방식이란, 파라미터의 수를 늘리지 않으면서 학생 네트워크(30)의 모델 복잡성(model complexity)를 높여주는 방법이다.Here, the SPS method is a method of increasing the model complexity of the student network 30 without increasing the number of parameters.

SPS 방식은 다음과 같이 이루어진다. 제어부(230)는 학생 네트워크(30)가 파라미터를 공유하는 복수의 레이어 쌍들을 포함하게 하고, 제어부는 학생 네트워크(30)를 학습시키는 단계에서 파라미터가 공유된 레이어 쌍을 두 그룹으로 분할한 후, 제 1레이어 그룹이 파라미터를 파라미터끼리 서로 뒤바꾸어 사용하게 하고, 제2레이어 그룹은 공유하는 파라미터를 그대로 사용하게 한다. 이때, 학습시키는 단계는 사전학습(pretraining)단계와 지식증류 시 학습단계를 포함한다.The SPS method is performed as follows. The control unit 230 causes the student network 30 to include a plurality of layer pairs that share parameters, and the control unit divides the layer pair in which the parameters are shared into two groups in the step of learning the student network 30, The first layer group uses parameters interchangeably with each other, and the second layer group uses the shared parameters as they are. In this case, the learning step includes a pretraining step and a learning step during knowledge distillation.

도 7은 일 실시예에 따른 SPS 방식을 학생 네트워크(30)에 적용하는 예시도이다.7 is an exemplary diagram of applying the SPS method to the student network 30 according to an embodiment.

도 7을 참조하면, 학생 네트워크(30)에 총 3개의 레이어가 있음을 알 수 있다. 제어부(230)는 학생 네트워크(30)가 파라미터를 공유하는 복수의 레이어 쌍들을 포함하도록 레이어를 추가한다. 이때, 추가하는 레이어 개수는 기존 레이어와 꼭 같지 않아도 된다. 예를 들어, 학생 네트워크(30)에 총 6개의 레이어가 있을 때, 파라미터를 공유하는 3개의 레이어만 추가할 수도 있다.Referring to FIG. 7 , it can be seen that there are a total of three layers in the student network 30 . The controller 230 adds a layer so that the student network 30 includes a plurality of layer pairs sharing parameters. In this case, the number of layers to be added does not have to be exactly the same as that of the existing layers. For example, when there are a total of 6 layers in the student network 30, only 3 layers sharing parameters may be added.

도 7을 참조하면, 아래쪽 레이어 1과 위쪽 레이어 1은 파라미터를 서로 공유하고 있고, 마찬가지로 레이어2와 위쪽 레이어2 및 아래쪽 레이어3과 위쪽 레이어3도 파라미터를 서로 공유하고 있음을 알 수 있다. 제어부는 상단(upper) 레이어 1,2,3을 제1그룹, 바텀(bottom) 레이어 1,2,3을 제2그룹으로 분할한 후, 제1레이어 그룹이 파마미터 Q(Query), K(key)를 뒤바꾸어 사용하게 하고, 제2레이어 그룹은 파라미터 Q, K를 그대로 사용하게 할 수 있다.Referring to FIG. 7 , it can be seen that the lower layer 1 and the upper layer 1 share parameters with each other, and similarly, the layer 2 and the upper layer 2 and the lower layer 3 and the upper layer 3 also share the parameters with each other. The control unit divides the upper layers 1,2,3 into the first group and the bottom layers 1,2,3 into the second group, and then sets the first layer group with parameters Q (Query), K ( key), and the second layer group uses the parameters Q and K as they are.

이를 통해 학생 네트워크(30)는 더 높은 복잡도를 얻게 되고 따라서 학습의 성능(정확도)이 향상될 수 있다.Through this, the student network 30 may obtain a higher complexity and thus the performance (accuracy) of learning may be improved.

S601 단계 이후에, 제어부(230)는 PTP방식으로 SPS 학생 네트워크를 사전학습(pre-training)시킨다(S602).After step S601, the control unit 230 pre-trains the SPS student network in the PTP method (S602).

제어부(230)는 기존 학생 네트워크(30)에 SPS방식을 적용하여 SPS 학생 네트워크(30)로 변환하고, SPS 학생 네트워크(30)에 PTP 방식으로 사전학습(pre-training)시킨다. SPS방식과 PTP방식을 적용한 학생 네트워크(30)는 기존의 학생 네트워크(30)보다 지식증류에 의한 학습의 성능(정확도)이 월등히 높다. 이때, SPS방식과 PTP방식은 서로 독립적인 방법이기 때문에 함께 적용했을 때 효과가 합해져서 더욱 큰 성능 향상을 얻을 수 있다.The control unit 230 applies the SPS method to the existing student network 30 and converts it to the SPS student network 30 , and pre-trains the SPS student network 30 with the PTP method. The student network 30 to which the SPS method and the PTP method are applied has significantly higher performance (accuracy) of learning by knowledge distillation than the existing student network 30 . At this time, since the SPS method and the PTP method are independent of each other, when they are applied together, the effects are combined and a greater performance improvement can be obtained.

S310단계 이후에, 제어부(230)는 교사 네트워크(20)에 지식증류 프로세스를 진행하여 학생 네트워크(30)를 학습시킨다(S320).After step S310 , the controller 230 performs a knowledge distillation process on the teacher network 20 to learn the student network 30 ( S320 ).

지식증류 프로세스를 진행하기 전에, 선생 네트워크는 학습데이터(10)로 충분히 학습되어야 한다. 예를 들어, 선생 네트워크는 12 layer Bert-base model일 수 있다.Before proceeding with the knowledge distillation process, the teacher network must be sufficiently trained with the learning data (10). For example, the teacher network may be a 12-layer Bert-base model.

학습된 선생 네트워크의 파라미터는 <수학식 2>와 같다.The parameters of the learned teacher network are as in <Equation 2>.

[수학식 2][Equation 2]

여기서,

는 선생 네트워크의 파라미터를 의미하고,

는 소프트맥스 함수(softmax function)를 의미하고,

는 학습데이터(10)를 의미하고,

는 선생 네트워크의 결과값을 의미하고,

는 참 라벨(true labels)을 의미하고,

는 크로스엔트로피 손실(cross-entropy loss)을 의미할 수 있다. <수학식 2>는 손실 함수(Loss function)을 나타내는 <수학식 4>에서 사용될 수 있다.here,

is the parameter of the teacher network,

means a softmax function,

means the learning data (10),

is the result of the teacher network,

means true labels,

may mean cross-entropy loss. <Equation 2> may be used in <Equation 4> representing a loss function.

지식증류 프로세스의 진행을 위해 사용되는 수식은 <수학식 3>와 <수학식 4>과 같다.Equations used for the progress of the knowledge distillation process are <Equation 3> and <Equation 4>.

[수학식 3][Equation 3]

지식증류를 하기 위해 네트워크의 결과값은 <수학식 2>에 의해 소프트(soft)한 값으로 표현된다. T 는 교사 네트워크(20)의 출력값을 소프트하게 조절하는 스프트맥스함수의 온도를 의미할 수 있고,

는 네트워크의 출력값을 의미할 수 있다. <수학식 3>는 손실 함수(Loss function)을 나타내는 <수학식 4>에서 교사네트워크의 결과값을 소프트화하기 위해 사용될 수 있다.For knowledge distillation, the result value of the network is expressed as a soft value by <Equation 2>. T may mean the temperature of the sptmax function that softly adjusts the output value of the teacher network 20,

may mean an output value of the network. <Equation 3> may be used to soften the result value of the teacher network in <Equation 4> representing a loss function.

[수학식 4][Equation 4]

은 지식증류하여 학생 네트워크를 학습시키기 위해 필요한 손실함수(Loss function)을 의미하고,

와

는 각각 교사 네트워크(20)와 학생 네크워크(32)의 결과값을 의미하고,

는 페이션트 지식증류(patient KD)에서 사용되는 특정 레이어의 지수를 의미하고,

는 학생 네트워크의 파라미터를 의미하고,

는 학습데이터(10)를 의미하고,

는 소프트맥스 함수(softmax function)를 의미하고,

는 참 라벨(true labels)을 의미하고,

는 크로스엔트로피 손실(cross-entropy loss)을 의미하고,

는 쿨백-라이블러 발산손실(Kullback-Leibler divergence loss)을 나타내고,

는 k번째 레이어의 아웃풋 로짓(output logits)을 의미할 수 있고,

와

는 하이퍼파라미터를 의미할 수 있다.

와

는 각각 수학식<2> 및 <3>에 의해 정의된다.

is the loss function required to learn the student network by distillation of knowledge,

Wow

means the result values of the teacher network 20 and the student network 32, respectively,

is the index of a specific layer used in patient KD,

is the parameter of the student network,

means the learning data (10),

means a softmax function,

means true labels,

means cross-entropy loss,

represents the Kullback-Leibler divergence loss,

may mean the output logits of the k-th layer,

Wow

may mean a hyperparameter.

Wow

are defined by Equations <2> and <3>, respectively.

제어부(230)는 <수학식 4>의 손실함수를 통하여 학생 네트워크(30)를 학습시킨다.The control unit 230 learns the student network 30 through the loss function of Equation (4).

또한, 명세서에 기재된 "…부", "…모듈" 의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In addition, the terms “…unit” and “…module” described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software.

이상의 실시예들에서 사용되는 '~부'라는 용어는 소프트웨어 또는 FPGA(field programmable gate array) 또는 ASIC 와 같은 하드웨어 구성요소를 의미하며, '~부'는 어떤 역할들을 수행한다. 그렇지만 '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램특허 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들, 및 변수들을 포함한다.The term '~ unit' used in the above embodiments means software or hardware components such as field programmable gate array (FPGA) or ASIC, and '~ unit' performs certain roles. However, '-part' is not limited to software or hardware. '~' may be configured to reside on an addressable storage medium or may be configured to refresh one or more processors. Accordingly, as an example, '~' indicates components such as software components, object-oriented software components, class components, and task components, and processes, functions, properties, and procedures. , subroutines, segments of program patent code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로부터 분리될 수 있다.The functions provided in the components and '~ units' may be combined into a smaller number of elements and '~ units' or separated from additional components and '~ units'.

뿐만 아니라, 구성요소들 및 '~부'들은 디바이스 또는 보안 멀티미디어카드 내의 하나 또는 그 이상의 CPU 들을 재생시키도록 구현될 수도 있다.In addition, components and '~ units' may be implemented to play one or more CPUs in a device or secure multimedia card.

도 3 내지 도 7을 통해 설명된 실시예들에 따른 지식증류 방법은 컴퓨터에 의해 실행 가능한 명령어 및 데이터를 저장하는, 컴퓨터로 판독 가능한 매체의 형태로도 구현될 수 있다. 이때, 명령어 및 데이터는 프로그램 코드의 형태로 저장될 수 있으며, 프로세서에 의해 실행되었을 때, 소정의 프로그램 모듈을 생성하여 소정의 동작을 수행할 수 있다. 또한, 컴퓨터로 판독 가능한 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터로 판독 가능한 매체는 컴퓨터 기록 매체일 수 있는데, 컴퓨터 기록 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 예를 들어, 컴퓨터 기록 매체는 HDD 및 SSD 등과 같은 마그네틱 저장 매체, CD, DVD 및 블루레이 디스크 등과 같은 광학적 기록 매체, 또는 네트워크를 통해 접근 가능한 서버에 포함되는 메모리일 수 있다.The knowledge distillation method according to the embodiments described with reference to FIGS. 3 to 7 may also be implemented in the form of a computer-readable medium for storing instructions and data executable by a computer. In this case, the instructions and data may be stored in the form of program codes, and when executed by the processor, a predetermined program module may be generated to perform a predetermined operation. In addition, computer-readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. In addition, the computer-readable medium may be a computer recording medium, which is a volatile and non-volatile and non-volatile storage medium implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. It may include both volatile, removable and non-removable media. For example, the computer recording medium may be a magnetic storage medium such as HDD and SSD, an optical recording medium such as CD, DVD, and Blu-ray disc, or a memory included in a server accessible through a network.

또한 도 3 내지 도 7을 통해 설명된 실시예들에 따른 지식증류 방법은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 컴퓨터 프로그램(또는 컴퓨터 프로그램 제품)으로 구현될 수도 있다. 컴퓨터 프로그램은 프로세서에 의해 처리되는 프로그래밍 가능한 기계 명령어를 포함하고, 고레벨 프로그래밍 언어(High-level Programming Language), 객체 지향 프로그래밍 언어(Object-oriented Programming Language), 어셈블리 언어 또는 기계 언어 등으로 구현될 수 있다. 또한 컴퓨터 프로그램은 유형의 컴퓨터 판독가능 기록매체(예를 들어, 메모리, 하드디스크, 자기/광학 매체 또는 SSD(Solid-State Drive) 등)에 기록될 수 있다.In addition, the knowledge distillation method according to the embodiments described with reference to FIGS. 3 to 7 may be implemented as a computer program (or computer program product) including instructions executable by a computer. The computer program includes programmable machine instructions processed by a processor, and may be implemented in a high-level programming language, an object-oriented programming language, an assembly language, or a machine language. . In addition, the computer program may be recorded in a tangible computer-readable recording medium (eg, a memory, a hard disk, a magnetic/optical medium, or a solid-state drive (SSD), etc.).

따라서 도 3 내지 도 7을 통해 설명된 실시예들에 따른 지식증류 방법은 상술한 바와 같은 컴퓨터 프로그램이 컴퓨팅 장치에 의해 실행됨으로써 구현될 수 있다. 컴퓨팅 장치는 프로세서와, 메모리와, 저장 장치와, 메모리 및 고속 확장포트에 접속하고 있는 고속 인터페이스와, 저속 버스와 저장 장치에 접속하고 있는 저속 인터페이스 중 적어도 일부를 포함할 수 있다. 이러한 성분들 각각은 다양한 버스를 이용하여 서로 접속되어 있으며, 공통 머더보드에 탑재되거나 다른 적절한 방식으로 장착될 수 있다.Accordingly, the knowledge distillation method according to the embodiments described with reference to FIGS. 3 to 7 may be implemented by executing the above-described computer program by a computing device. The computing device may include at least a portion of a processor, a memory, a storage device, a high-speed interface connected to the memory and the high-speed expansion port, and a low-speed interface connected to the low-speed bus and the storage device. Each of these components is connected to each other using various buses, and may be mounted on a common motherboard or in any other suitable manner.

여기서 프로세서는 컴퓨팅 장치 내에서 명령어를 처리할 수 있는데, 이런 명령어로는, 예컨대 고속 인터페이스에 접속된 디스플레이처럼 외부 입력, 출력 장치상에 GUI(Graphic User Interface)를 제공하기 위한 그래픽 정보를 표시하기 위해 메모리나 저장 장치에 저장된 명령어를 들 수 있다. 다른 실시예로서, 다수의 프로세서 및(또는) 다수의 버스가 적절히 다수의 메모리 및 메모리 형태와 함께 이용될 수 있다. 또한 프로세서는 독립적인 다수의 아날로그 및(또는) 디지털 프로세서를 포함하는 칩들이 이루는 칩셋으로 구현될 수 있다.Here, the processor may process a command within the computing device, such as, for example, to display graphic information for providing a graphic user interface (GUI) on an external input or output device, such as a display connected to a high-speed interface. Examples are instructions stored in memory or a storage device. In other embodiments, multiple processors and/or multiple buses may be used with multiple memories and types of memory as appropriate. In addition, the processor may be implemented as a chipset formed by chips including a plurality of independent analog and/or digital processors.

또한 메모리는 컴퓨팅 장치 내에서 정보를 저장한다. 일례로, 메모리는 휘발성 메모리 유닛 또는 그들의 집합으로 구성될 수 있다. 다른 예로, 메모리는 비휘발성 메모리 유닛 또는 그들의 집합으로 구성될 수 있다. 또한 메모리는 예컨대, 자기 혹은 광 디스크와 같이 다른 형태의 컴퓨터 판독 가능한 매체일 수도 있다.Memory also stores information within the computing device. As an example, the memory may be configured as a volatile memory unit or a set thereof. As another example, the memory may be configured as a non-volatile memory unit or a set thereof. The memory may also be another form of computer readable medium, such as, for example, a magnetic or optical disk.

그리고 저장장치는 컴퓨팅 장치에게 대용량의 저장공간을 제공할 수 있다. 저장 장치는 컴퓨터 판독 가능한 매체이거나 이런 매체를 포함하는 구성일 수 있으며, 예를 들어 SAN(Storage Area Network) 내의 장치들이나 다른 구성도 포함할 수 있고, 플로피 디스크 장치, 하드 디스크 장치, 광 디스크 장치, 혹은 테이프 장치, 플래시 메모리, 그와 유사한 다른 반도체 메모리 장치 혹은 장치 어레이일 수 있다.In addition, the storage device may provide a large-capacity storage space to the computing device. A storage device may be a computer-readable medium or a component comprising such a medium, and may include, for example, devices or other components within a storage area network (SAN), a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory, or other semiconductor memory device or device array similar thereto.

상술된 실시예들은 예시를 위한 것이며, 상술된 실시예들이 속하는 기술분야의 통상의 지식을 가진 자는 상술된 실시예들이 갖는 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 상술된 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above-described embodiments are for illustration, and those of ordinary skill in the art to which the above-described embodiments pertain can easily transform into other specific forms without changing the technical idea or essential features of the above-described embodiments. You will understand. Therefore, it should be understood that the above-described embodiments are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

본 명세서를 통해 보호받고자 하는 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태를 포함하는 것으로 해석되어야 한다.The scope to be protected through this specification is indicated by the claims described below rather than the above detailed description, and should be construed to include all changes or modifications derived from the meaning and scope of the claims and their equivalents. .

10: 학습데이터 20: 교사 네트워크
30: 학생 네트워크 200: 지식증류장치
210: 입출력부 220: 저장부
230: 제어부10: learning data 20: teacher network
30: student network 200: knowledge distillation device
210: input/output unit 220: storage unit
230: control unit

Claims

In a method of training a student network including at least one transformer neural network using knowledge distillation in a teacher network including at least one transformer neural network,
pre-training the student network in a PTP (Pretraining with Teacher's Predictions) method of initializing the student network based on a result value of the teacher network learned in advance by learning data; and
and performing a knowledge distillation process on the teacher network to train the student network.

According to claim 1,
The step of pre-training the student network,
inputting the learning data into the teacher network, and collecting result values of the teacher network;
assigning a PTP label to the learning data based on a confidence defined based on a result value of the teacher network; and
The method further comprising: inputting the learning data into the student network to train the student network to predict a PTP label corresponding to the input.

3. The method of claim 2,
The confidence is
The method, which means the largest value among the output values of the teacher model.

According to claim 1,
The step of pre-learning the student network,
causing the student network to include a plurality of layer pairs sharing a parameter; and
Among the layer pairs in which the parameters are shared, the first layer swaps the shared parameters of the query parameter and the key parameter, and the second layer keeps the query parameter and the key parameter as it is. A method comprising the step of applying the SPS (Shuffled Parameter Sharing) method used to the student network.

In a knowledge distillation apparatus for learning a student network including at least one transformer neural network using knowledge distillation in a teacher network including at least one transformer neural network in,
a storage unit in which a program for performing knowledge distillation is stored; and
A control unit including at least one processor,
The control unit is
Pre-training the student network in a PTP (Pretraining with Teacher's Predictions) method that initializes the student network based on the result value of the teacher network learned in advance by learning data,
An apparatus for learning the student network by performing a knowledge distillation process on the teacher network.

7. The method of claim 6,
The control unit is
input the learning data into the teacher network, and collect the results of the teacher network,
A PTP label is given to the learning data based on a confidence defined based on the result value of the teacher network,
An apparatus for learning the student network to predict the PTP label corresponding to the input by inputting the learning data into the student network.

8. The method of claim 7,
The confidence is
The device, which means the largest value among the output values of the teacher model.

7. The method of claim 6,
The control unit is
let the student network include a plurality of layer pairs sharing a parameter,
Among the layer pairs in which the parameters are shared, the first layer exchanges the shared parameters of the query parameter and the key parameter, and the second layer uses the query parameter and the key parameter as it is. A device that applies a Shuffled Parameter Sharing (SPS) method to the student network.

A computer-readable recording medium in which a program for executing the method according to claim 1 is recorded on a computer.

It is performed by the knowledge distillation device, A computer program stored in a medium for performing the method according to claim 1.