KR20240050709A

KR20240050709A - Method and apparatus for self-knowledge distillation using cross entropy

Info

Publication number: KR20240050709A
Application number: KR1020220130420A
Authority: KR
Inventors: 우사이먼성일; 이한빈
Original assignee: 성균관대학교산학협력단
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2024-04-19

Abstract

본 발명에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법은, 원핫인코딩(One-hot encoding)된 라벨 정보가 고려된 목표 분포를 식별하는 과정; 상기 라벨 정보가 고려된 목표 분포와 출력 분포 각각을 적어도 둘 이상의 클래스로 그룹화 및 정규화하는 과정; 및 상기 라벨 정보가 고려된 목표 분포의 제1 그룹과 상기 제1 그룹에 대응하는 상기 출력 분포의 제2 그룹 간의 손실을 식별하는 과정; 을 포함할 수 있다.The self-knowledge distillation method using cross-entropy according to the present invention includes the process of identifying a target distribution considering one-hot encoded label information; A process of grouping and normalizing each of the target distribution and output distribution in which the label information is considered into at least two or more classes; and identifying a loss between a first group of target distributions in which the label information is considered and a second group of output distributions corresponding to the first group. may include.

Description

Self-knowledge distillation method and device using cross entropy {METHOD AND APPARATUS FOR SELF-KNOWLEDGE DISTILLATION USING CROSS ENTROPY}

본 발명은 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치에 관한 것으로, 상세하게, 자기 지식 증류에 따른 출력 분포에서 낮은 확률 분포를 고려한 자기 지식 증류 방법 및 이의 장치에 관한 것이다.The present invention relates to a self-knowledge distillation method and device using cross-entropy, and more specifically, to a self-knowledge distillation method and device that considers a low probability distribution in the output distribution according to self-knowledge distillation.

이 부분에 기술된 내용은 단순히 본 발명에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다. The content described in this section simply provides background information about the present invention and does not constitute prior art.

AI기술을 발전시킨 딥러닝(deep learning)의 핵심은 심층신경망을 이용해 데이터로부터 높은 수준의 추상적 지식을 학습함으로써 복잡한 연산을 효과적으로 수행하는 데 있다. 데이터 기반 AI모델들은 작업에 필요한 지식을 데이터로부터 습득하는데, 새로운 작업을 수행하기 위해서는 새 작업을 학습하기 위한 데이터를 확보해야 하며 이러한 데이터에 대한 수요는 증가할 수 있다. 대규모 데이터에 딥러닝을 적용하면 보다 지식을 효과적으로 학습할 수 있기 때문에 더 많은 데이터를 사용하여 딥러닝 성능을 향상시킬 수 있다. 다만, 학습을 위한 대규모 데이터를 확보하기 위해서는 많은 비용과 시간이 요구될 수 있다. 이에, 대규모 데이터셋을 통해 훈련된 모델(또는 사전학습모델 또는 프리 트레인드 모델(pre-trained model)로 지칭될 수 있음)의 특징 벡터(feature vector)를 이용하여 새로운 모델에 맞게 재보정하여 사용할 수 있는 지식 증류 기술이 제안되었다. 지식 증류 기술을 통해 비교적 적은 수의 데이터만이 확보된 상황에서도 새로운 모델에 적용할 수 있는 딥러닝 모델을 학습시킬 수 있다.The core of deep learning, which has developed AI technology, is to effectively perform complex calculations by learning high-level abstract knowledge from data using deep neural networks. Data-based AI models acquire the knowledge necessary for tasks from data. In order to perform new tasks, data for learning new tasks must be secured, and the demand for such data may increase. When deep learning is applied to large-scale data, knowledge can be learned more effectively, so deep learning performance can be improved by using more data. However, securing large-scale data for learning may require a lot of cost and time. Accordingly, the feature vector of a model trained through a large-scale dataset (or may be referred to as a pre-learning model or pre-trained model) can be used to recalibrate to fit a new model. A knowledge distillation technique was proposed. Through knowledge distillation technology, it is possible to learn a deep learning model that can be applied to a new model even when only a relatively small amount of data is available.

종래, 지식 증류 기술에서는 모델 학습에 필요한 데이터를 어디서, 어떻게 얻는지에 대해 많은 방법들이 제안되었을 뿐, 획득한 지식을 효과적으로 전이시키는 방법에 대한 연구는 많이 이뤄지지 않았다. 또한, 대규모 데이터셋을 통해 훈련된 모델과 상기 모델로부터 증류한 지식을 전이 받는 새로운 모델 간에 지식을 전이하는 방법에서 쿨백-라이블러 발산(Kullback-Leibler divergence, KLD) 등의 방법들이 사용되고 있으나, 두 모델의 데이터 처리 과정 중에 적용되는 손실 함수에서 유사도가 낮은 분포에 대해서는 고려하고 있지 않아 학습 모델의 성능에 한계를 갖고 있다.Conventionally, in knowledge distillation technology, many methods have been proposed on where and how to obtain the data needed for model learning, but not much research has been done on how to effectively transfer the acquired knowledge. In addition, methods such as Kullback-Leibler divergence (KLD) are used to transfer knowledge between a model trained through a large-scale dataset and a new model that receives knowledge distilled from the model. The performance of the learning model is limited because distributions with low similarity are not considered in the loss function applied during the model's data processing process.

일본 공개특허공보 JP 2022-516452 A (데이터 처리 방법 및 장치, 전자기기 및 기억 매체, 2022년 2월 28일)Japanese Patent Publication JP 2022-516452 A (Data processing method and device, electronic device and storage medium, February 28, 2022)

전술한 바와 같이, 유사도가 낮은 분포에 대해서도 학습에 고려하기 위한 교차 엔트로피를 활용한 자기 지식 증류 방법으로, 목표 분포와 출력 분포를 동일한 기준으로 정렬하고, 일정한 구간 별로 손실 함수를 적용하는 방법을 통해 딥러닝 성능을 개선할 수 있는 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치를 제공하는 것에 목적이 있다.As mentioned above, it is a self-knowledge distillation method using cross-entropy to consider even distributions with low similarity for learning, and the target distribution and output distribution are aligned on the same basis and a loss function is applied for each certain section. The purpose is to provide a self-knowledge distillation method and device using cross-entropy that can improve deep learning performance.

상기 목적을 달성하기 위하여 본 발명에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법은, 원핫인코딩(One-hot encoding)된 라벨 정보가 고려된 목표 분포를 식별하는 과정; 상기 라벨 정보가 고려된 목표 분포와 출력 분포 각각을 적어도 둘 이상의 클래스로 그룹화 및 정규화하는 과정; 및 상기 라벨 정보가 고려된 목표 분포의 제1 그룹과 상기 제1 그룹에 대응하는 상기 출력 분포의 제2 그룹 간의 손실을 식별하는 과정; 을 포함할 수 있다.To achieve the above object, the self-knowledge distillation method using cross-entropy according to the present invention includes the process of identifying a target distribution considering one-hot encoded label information; A process of grouping and normalizing each of the target distribution and output distribution in which the label information is considered into at least two or more classes; and identifying a loss between a first group of target distributions in which the label information is considered and a second group of output distributions corresponding to the first group. may include.

실시예에 따라, 상기 적어도 둘 이상의 클래스로 그룹화 및 정규화하는 과정은, 상기 라벨 정보가 선형 결합된 목표 분포와 출력 분포를 동일한 기준으로 정렬하는 과정; 을 포함할 수 있다.Depending on the embodiment, the process of grouping and normalizing the at least two classes may include arranging a target distribution and an output distribution in which the label information is linearly combined by the same standard; may include.

실시예에 따라, 상기 적어도 둘 이상의 클래스로 그룹화 및 정규화하는 과정은, 상기 적어도 둘 이상의 클래스로 그룹화된 상기 라벨 정보가 고려된 목표 분포와 출력 분포 각각에 소프트맥스 함수를 취하여 대표값을 식별하는 과정; 을 포함할 수 있다.Depending on the embodiment, the process of grouping and normalizing into at least two or more classes includes identifying a representative value by taking a softmax function for each of the target distribution and output distribution considering the label information grouped into at least two or more classes. ; may include.

실시예에 따라, 상기 손실을 식별하는 과정은, 상기 적어도 둘 이상의 클래스로 그룹화된 상기 라벨 정보가 고려된 목표 분포와 출력 분포 각각에 소프트맥스 함수를 취하여 대표값에 손실 함수를 적용하는 과정; 을 포함하며, 상기 손실 함수는 쿨백-라이블러 발산(Kullback-Leibler divergence, KLD) 함수를 포함할 수 있다.Depending on the embodiment, the process of identifying the loss may include taking a softmax function for each of the target distribution and the output distribution considering the label information grouped into the at least two classes and applying the loss function to the representative value; Includes, and the loss function may include a Kullback-Leibler divergence (KLD) function.

또한, 본 발명의 따른 교차 엔트로피를 활용한 자기 지식 증류 장치는, 기 입력된 데이터에 기초하여 목표 분포를 식별하는 제1 분류기; 상기 사전학습 분류기와 상이한 소프트맥스 함수가 적용된 제2 분류기; 및 원핫인코딩(One-hot encoding)된 라벨 정보가 고려된 목표 분포를 식별하고, 상기 라벨 정보가 고려된 목표 분포와 출력 분포 각각을 적어도 둘 이상의 클래스로 그룹화 및 정규화하며, 상기 라벨 정보가 고려된 목표 분포의 제1 그룹과 상기 제1 그룹에 대응하는 상기 출력 분포의 제2 그룹 간의 손실을 식별하는 제어부; 을 포함할 수 있다.In addition, the self-knowledge distillation device using cross-entropy according to the present invention includes a first classifier that identifies a target distribution based on previously input data; a second classifier to which a softmax function different from the pre-learning classifier is applied; And identifying a target distribution in which one-hot encoded label information is considered, grouping and normalizing each of the target distribution and output distribution in which the label information is considered into at least two classes, and a control unit that identifies a loss between a first group of target distributions and a second group of output distributions corresponding to the first group; may include.

실시예에 따라, 상기 제어부는, 상기 라벨 정보가 선형 결합된 목표 분포와 출력 분포를 동일한 기준으로 정렬할 수 있다.Depending on the embodiment, the control unit may align the target distribution and the output distribution in which the label information is linearly combined based on the same standard.

실시예에 따라, 상기 제어부는, 상기 적어도 둘 이상의 클래스로 그룹화된 상기 라벨 정보가 고려된 목표 분포와 출력 분포 각각에 소프트맥스 함수를 취하여 대표값을 식별할 수 있다.Depending on the embodiment, the control unit may identify a representative value by taking a softmax function for each of the target distribution and output distribution considering the label information grouped into the at least two classes.

실시예에 따라, 상기 제어부는, 상기 적어도 둘 이상의 클래스로 그룹화된 상기 라벨 정보가 고려된 목표 분포와 출력 분포 각각에 소프트맥스 함수를 취하여 대표값에 손실 함수를 적용하며, 상기 손실 함수는 쿨백-라이블러 발산(Kullback-Leibler divergence, KLD) 함수를 포함할 수 있다.According to an embodiment, the control unit takes a softmax function to each of the target distribution and the output distribution considering the label information grouped into the at least two classes and applies a loss function to the representative value, and the loss function is coolback- May include Kullback-Leibler divergence (KLD) function.

본 발명의 일 실시 예로서 제공되는 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치는 별도의 선행 모델을 필요로 하지 않는 자기 지식 증류 방법을 활용하여 상대적으로 메모리, 파라미터 수, 구조 측면에서 단순하다. The self-knowledge distillation method and device using cross-entropy provided as an embodiment of the present invention utilize a self-knowledge distillation method that does not require a separate prior model and are relatively simple in terms of memory, number of parameters, and structure. .

또한, 본 발명의 일 실시예에 따르면, 입력 데이터가 가지고 있는 복수의 클래스 정보를 고르게 학습할 수 있으며, 결과적으로 클래스 간 관계를 명확하게 하고, 동일 클래스 내의 여러 데이터들 간의 분산을 낮춤으로써 보다 정확한 결정 경계를 가질 수 있다. In addition, according to an embodiment of the present invention, it is possible to evenly learn the plurality of class information contained in the input data, and as a result, the relationship between classes is clarified and the variance between multiple data within the same class is lowered to provide more accurate information. It can have a decision boundary.

한편, 본 발명의 효과는 이상에서 언급한 효과들로 제한되지 않으며, 이하에서 설명할 내용으로부터 통상의 기술자에게 자명한 범위 내에서 다양한 효과들이 포함될 수 있다.Meanwhile, the effects of the present invention are not limited to the effects mentioned above, and various effects may be included within the range apparent to those skilled in the art from the contents described below.

도 1은 본 발명의 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법을 설명하기 위한 도면이다.
도 2는 본 발명의 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 장치를 설명하기 위한 도면이다.
도 3은 본 발명의 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법을 설명하기 위한 순서도이다.
도 4a 내지 4d는 본 발명의 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치의 연산 결과를 설명하기 위한 도면이다.Figure 1 is a diagram for explaining a self-knowledge distillation method using cross entropy according to an embodiment of the present invention.
Figure 2 is a diagram for explaining a self-knowledge distillation device using cross entropy according to an embodiment of the present invention.
Figure 3 is a flowchart illustrating a self-knowledge distillation method using cross entropy according to an embodiment of the present invention.
Figures 4a to 4d are diagrams for explaining the calculation results of the self-knowledge distillation method and device using cross-entropy according to an embodiment of the present invention.

이하, 본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 발명을 실시하기 위한 구체적인 내용으로서 본 발명의 바람직한 실시 예의 구성과 작용에 대해 구체적으로 설명하기로 한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Hereinafter, the terms used in the present specification will be briefly explained, and the structure and operation of a preferred embodiment of the present invention will be described in detail as specific details for carrying out the present invention. The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the disclosure of the present invention is complete and to provide a general understanding of the technical field to which the present invention pertains. It is provided to fully inform the skilled person of the scope of the present invention, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제 1", "제 2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제 1 구성요소는 본 발명의 기술적 사상 내에서 제 2 구성요소일 수도 있음은 물론이다.The terminology used herein is for describing embodiments and is not intended to limit the invention. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used in the specification, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other elements in addition to the mentioned elements. Like reference numerals refer to like elements throughout the specification, and “and/or” includes each and every combination of one or more of the referenced elements. Although “first”, “second”, etc. are used to describe various components, these components are of course not limited by these terms. These terms are merely used to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may also be a second component within the technical spirit of the present invention.

"예시적인"이라는 단어는 본 명세서에서 "예시 또는 예증으로서 사용된"의 의미로 사용된다. 본 명세서에서 "예시적인"것으로 설명된 임의의 실시예는 반드시 바람직한 것으로서 해석되거나 다른 실시예들보다 이점을 갖는 것으로 해석되어서는 안된다.The word “exemplary” is used herein to mean “used as an example or illustration.” Any embodiment described herein as “exemplary” should not necessarily be construed as preferred or as having an advantage over other embodiments.

또한, 명세서에서 사용되는 "부"라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 엘리먼트를 의미하며, "부"는 어떤 역할들을 수행한다. 그렇지만 "부"는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부"는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부"는 소프트웨어 엘리먼트들, 객체지향 소프트웨어 엘리먼트들, 클래스 엘리먼트들 및 태스크 엘리먼트들과 같은 엘리먼트들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 엘리먼트들과 "부"들 안에서 제공되는 기능은 더 작은 수의 엘리먼트들 및 "부"들로 결합되거나 추가적인 엘리먼트들과 "부"들로 더 분리될 수 있다.Additionally, the term “unit” used in the specification refers to a hardware element such as software, FPGA, or ASIC, and the “unit” performs certain roles. However, “wealth” is not limited to software or hardware. The “copy” may be configured to reside on an addressable storage medium and may be configured to run on one or more processors. Thus, as an example, “part” refers to elements such as software elements, object-oriented software elements, class elements, and task elements, processes, functions, properties, procedures, subroutines, and programs. Includes segments of code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. The functionality provided within elements and “parts” may be combined into smaller numbers of elements and “parts” or may be further separated into additional elements and “parts”.

또한, 본 명세서에서 모든 “부”는 적어도 하나의 프로세서에 의해 제어될 수 있으며 본 개시의 “부”가 수행하는 동작을 적어도 하나의 프로세서가 수행할 수도 있다.Additionally, all “units” in this specification may be controlled by at least one processor, and at least one processor may perform the operations performed by the “units” of the present disclosure.

본 명세서의 실시예들은 기능 또는 기능을 수행하는 블록의 관점에서 설명될 수 있다. 본 개시의 '부' 또는 '모듈' 등으로 지칭될 수 있는 블록은 논리 게이트, 집적 회로, 마이크로 프로세서, 마이크로 컨트롤러, 메모리, 수동 전자 부품, 능동 전자 부품, 광학 컴포넌트, 하드와이어드 회로(hardwired circuits) 등과 같은 아날로그 또는 디지털 회로에 의해 물리적으로 구현되고, 선택적으로 펌웨어 및 소프트웨어에 의해 구동될 수 있다. Embodiments of the present specification may be described in terms of a function or a block that performs a function. Blocks that may be referred to as 'units' or 'modules' of the present disclosure include logic gates, integrated circuits, microprocessors, microcontrollers, memories, passive electronic components, active electronic components, optical components, and hardwired circuits. It is physically implemented by analog or digital circuits, etc., and can optionally be driven by firmware and software.

본 명세서의 실시예는 적어도 하나의 하드웨어 디바이스 상에서 실행되는 적어도 하나의 소프트웨어 프로그램을 사용하여 구현될 수 있고 엘리먼트를 제어하기 위해 네트워크 관리 기능을 수행할 수 있다.Embodiments herein may be implemented using at least one software program running on at least one hardware device and may perform network management functions to control elements.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings commonly understood by those skilled in the art to which the present invention pertains. Additionally, terms defined in commonly used dictionaries are not to be interpreted ideally or excessively unless clearly specifically defined.

종래, 지식 증류 방법은 모델 학습에 도움이 되는 데이터를 어디서, 어떻게 얻는지에 따라 다양한 방법들이 제시되었고, 이러한 방법들은 통상적으로 교차 엔트로피 및 쿨백-라이블러 발산(KLD)과 같은 손실 함수를 사용하고 있으며, 이러한 손실 함수가 초래하는 문제점을 극복하지 못한다. 구체적으로, 손실의 많은 부분은 딥러닝의 최종 결과로서의 가중치와 밀접한 클래스 정보로부터 발생하며, 유사도가 낮은 분포(이하, 꼬리 분포로 지칭될 수 있음)에 있는 정보들은 손실 함수에 상당부분 기여하지 못하고 최적화 과정에서 충분히 고려되지 못할 수 있다. Conventionally, various knowledge distillation methods have been proposed depending on where and how data helpful for model learning are obtained, and these methods typically use loss functions such as cross entropy and Kullback-Leibler divergence (KLD). , it cannot overcome the problems caused by this loss function. Specifically, much of the loss arises from class information that is closely related to the weight as the final result of deep learning, and information in a distribution with low similarity (hereinafter, may be referred to as a tail distribution) does not significantly contribute to the loss function. It may not be sufficiently considered during the optimization process.

이에, 본 발명의 일 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치는 딥러닝 모델이 학습 과정 중에 데이터가 함유하고 있는 모든 클래스 정보를 고르게 활용하여 학습 과정 중에 딥러닝의 최종 결과로서의 가중치 및 이와 밀접한 클래스 정보 뿐만 아니라 딥러닝의 최종 결과로서의 가중치와 거리가 먼, 즉, 꼬리 분포에 있는 클래스 정보도 고르게 활용함으로써 학습 모델의 성능을 향상시킬 수 있다.Accordingly, the self-knowledge distillation method and device using cross-entropy according to an embodiment of the present invention allow the deep learning model to evenly utilize all class information contained in the data during the learning process to produce the final result of deep learning during the learning process. The performance of the learning model can be improved by evenly utilizing not only the weight and class information closely related to it, but also class information in the tail distribution, which is far from the weight as the final result of deep learning.

본 발명의 일 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치는 딥러닝 모델 중 통상의 분류 모델을 활용할 수 있다. 통상의 분류 모델을 포함하는 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치는 입력 데이터가 딥러닝 네트워크를 통과하면 로짓(Logit)값을 산출하며, 산출된 로짓값에 소프트맥스 함수(softmax function)을 취하여 합이 1인 확률 분포로 변형할 수 있으며, 가장 높은 입력값을 보이는 클레스에는 1을 부여하고 나머지 클래스에 대해서는 0을 부여하여 아웃풋을 도출하는 원핫인코딩을 사용할 수 있다. 이때, 본 발명의 일 실시예에서 1과 0으로만 구성된 예측값을 부여하는 방법을 하드타겟(hard target)이라고 지시하고, 이 경우 가장 높은 확률값만 갖는 클래스를 제외하고 다른 클래스의 확률값은 동등하게 0으로 무시될 수 있다. 예를 들어, 고양이 이미지가 팬더보다는 강아지에 가까운 형태를 갖는 경우, 고양이(0.6), 강아지(0.3) 및 팬더(0.1)의 확률분포를 가질 수 있다. 이러한 확률분포는 적절한 학습이 된 딥러닝 모델이라고 볼 수 있고, 적절한 학습된 딥러닝 모델을 선생 모델(teacher model)로 지칭할 수 있고, 학생 모델(student model)은 선생 모델의 특징추출기 또는 분류기의 전부 또는 일부를 소프트타겟(soft target)을 통해 활용할 수 있다. 소프트타겟은 소프트맥스 함수의 출력을 모델의 최종 출력으로 사용할 수 있고, 모든 카테고리(또는 클래스)에 대한 확률 정보를 얻어 정보의 손실을 줄일 수 있다. The self-knowledge distillation method and device using cross-entropy according to an embodiment of the present invention can utilize a typical classification model among deep learning models. The self-knowledge distillation method and device using cross-entropy including a typical classification model calculates a logit value when input data passes through a deep learning network, and a softmax function is applied to the calculated logit value. can be taken and transformed into a probability distribution whose sum is 1, and one-hot encoding can be used to derive the output by assigning 1 to the class with the highest input value and 0 to the remaining classes. At this time, in one embodiment of the present invention, the method of giving a predicted value consisting of only 1 and 0 is referred to as a hard target, and in this case, except for the class with the highest probability value, the probability values of other classes are equally 0. can be ignored. For example, if a cat image has a shape closer to a dog than to a panda, it may have probability distributions of cat (0.6), dog (0.3), and panda (0.1). This probability distribution can be viewed as a properly trained deep learning model, and the appropriately trained deep learning model can be referred to as a teacher model, and the student model is the feature extractor or classifier of the teacher model. All or part of it can be utilized through soft targets. The soft target can use the output of the softmax function as the final output of the model, and obtain probability information for all categories (or classes) to reduce information loss.

본 발명의 일 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치는 소프트타겟을 정의하는 과정에서 소프트맥스 함수의 특징인 1, 0에 쏠리게 되는 현상을 완화시키기 위해서 로짓(logit)에 축소확대 비율(scaling) 역할을 수행하는 하이퍼 파라미터(t, temperature)를 적용할 수 있다. 하이퍼 파라미터(t)가 1일 때, 소프트맥스 결과와 동일하며, 하이퍼 파라미터(t)가 클수록 더 소프트(soft)한 확률분포를 가지게 될 수 있다.The self-knowledge distillation method and device using cross-entropy according to an embodiment of the present invention uses logit to alleviate the phenomenon of focusing on 1 and 0, which are characteristic of the softmax function, in the process of defining a soft target. A hyperparameter (t, temperature) that plays a scaling role can be applied. When the hyperparameter (t) is 1, it is the same as the softmax result, and the larger the hyperparameter (t), the softer the probability distribution can be.

이하, 도면을 참고하여 본 발명의 구체적인 실시 예들에 대해 설명하기로 한다.Hereinafter, specific embodiments of the present invention will be described with reference to the drawings.

도 1에서 교차 엔트로피를 활용한 자기 지식 증류 방법에 대한 개략적인 과정을 설명하고, 도 2 및 도 3에서 구체적인 구성 및 각 구성에 따른 상세 과정을 설명한다. 이후, 도 4a 내지 4d에서 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치에 따른 효과를 설명한다.In Figure 1, a schematic process for the self-knowledge distillation method using cross entropy is explained, and in Figures 2 and 3, specific configurations and detailed processes according to each configuration are explained. Next, the self-knowledge distillation method using cross entropy and the effects of the device will be described in FIGS. 4A to 4D.

도 1은 본 발명의 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법을 설명하기 위한 도면이다.Figure 1 is a diagram for explaining a self-knowledge distillation method using cross entropy according to an embodiment of the present invention.

도 1을 참조하면, 교차 엔트로피를 활용한 자기 지식 증류 방법은 데이터가 갖고 있는 클래스 정보를 고르게 학습할 수 있으며, 클래스 간의 관계를 고려할 수 있고, 동일한 클래스로 분류된 적어도 하나 이상의 클래스에 대응하는 여러 데이터 간의 분산을 낮춰 적용 모델의 결정 경계(decision　boundary)를 명확히 할 수 있다. 딥러닝에 기초한 분류 과정에서 결정 경계를 형성하는 것은 딥러닝의 성능 확보에 중요한 기준이 될 수 있다. Referring to Figure 1, the self-knowledge distillation method using cross-entropy can evenly learn the class information contained in the data, can consider relationships between classes, and can generate multiple classes corresponding to at least one class classified into the same class. By lowering the variance between data, the decision boundary of the applied model can be made clear. Forming a decision boundary in a classification process based on deep learning can be an important criterion for securing the performance of deep learning.

실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법은 통상의 자기 지식 증류 방법을 포함할 수 있고, 하나의 모델 안에서 지식이 전달되면서 학습되는 방법을 포함할 수 있으며, 선생 모델 및 학생 모델 모두 동일한 모델일 수 있다. 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법에 따른 하나의 모델은 선생 모델 역할을 수행하는 선행학습 분류기와 학생 모델 역할을 수행하는 파인튜닝 분류기를 포함할 수 있다. 실시예에 따라, 선행학습 분류기(또는 깊은 분류기(deepest classifier)로 지칭될 수 있음)는 적어도 하나의 레즈넷블록(resNet block) 및 소프트맥스 함수 블록을 포함할 수 있으며, 파인튜닝 분류기(또는 얕은 분류기(shallow classifier)로 지칭될 수 있음)는 적어도 하나의 FC layer(fully connected layer) 및 소프트맥스 함수 블록을 포함할 수 있다.The self-knowledge distillation method using cross-entropy according to the embodiment may include a normal self-knowledge distillation method, and may include a method in which knowledge is learned while being transferred within one model, and both the teacher model and the student model may be the same. It could be a model. One model according to the self-knowledge distillation method using cross-entropy according to the embodiment may include a pre-learning classifier that acts as a teacher model and a fine-tuning classifier that acts as a student model. Depending on the embodiment, the pre-learning classifier (or may be referred to as the deepest classifier) may include at least one resNet block and a softmax function block, and the fine-tuning classifier (or shallowest classifier) may include at least one resNet block and a softmax function block. The classifier (may be referred to as a shallow classifier) may include at least one fully connected layer (FC layer) and a softmax function block.

실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법에서의 결정 경계는 적어도 하나 이상의 각각 히든레이어(Hidden Layer)의 액티베이션 바운더리(activation boundary)의 조합으로 구성될 수 있으며, 각각의 히든레이어의 결정 경계를 고려하게 되면 결국 이전에 학습된 모델(pre-trained model)의 지식 증류의 성능이 향상될 수 있다.The decision boundary in the self-knowledge distillation method using cross entropy according to the embodiment may be composed of a combination of the activation boundaries of at least one hidden layer, and the decision boundary of each hidden layer. Considering , the performance of knowledge distillation of a previously trained model can ultimately be improved.

실시예에 따라, 교차 엔트로피를 활용한 자기 지식 증류 방법은 이전에 학습된 분류기(또는 사전학습 분류기 또는 사전학습 모델 또는 사전학습 네트워크로 지칭될 수 있음)와 파인튜닝 분류기에 의해 수행될 수 있다. Depending on the embodiment, the self-knowledge distillation method utilizing cross-entropy may be performed by a previously learned classifier (or may be referred to as a pre-trained classifier, pre-trained model, or pre-trained network) and a fine-tuned classifier.

실시예에 따라, 사전학습 분류기 및 파인튜닝 분류기는 입력된 데이터의 특징을 추출할 수 있는 특징추출기 및 특징추출기의 출력값을 확률로 산출하는 분류기를 포함할 수 있으나, 사전학습 분류기 및 파인튜닝 분류기와 구분되어 구성되는 특징추출기로부터 동일한 출력값을 전달받을 수 있다. Depending on the embodiment, the pre-learning classifier and the fine-tuning classifier may include a feature extractor capable of extracting features of the input data and a classifier that calculates the output value of the feature extractor with probability, but the pre-learning classifier and the fine-tuning classifier The same output value can be received from separately configured feature extractors.

실시예에 따라, 사전학습 분류기 및 파인튜닝 분류기와 구분되어 구성되는 특징추출기로부터 동일한 출력값을 전달받는다는 것은 동일한 입력 데이터를 사용한다는 것으로 이해될 수 있으며, 동일한 입력에 기초함으로써 사전학습 분류기 및 파인튜닝 분류기 간에 지식이 증류 또는 전달되는 것으로 이해될 수 있다. 파인튜닝 분류기가 사전학습 분류기의 라벨값(라벨 정보)과 사전학습 분류기가 이전에 도출한 소프트타겟(soft target)을 사용할 수 있기 때문이다.Depending on the embodiment, receiving the same output value from a feature extractor that is configured separately from the pre-learning classifier and the fine-tuning classifier may be understood as using the same input data, and by being based on the same input, the pre-learning classifier and the fine-tuning classifier Knowledge can be understood as being distilled or transmitted between people. This is because the fine tuning classifier can use the label value (label information) of the pre-learning classifier and the soft target previously derived by the pre-learning classifier.

실시예에 따라, 특징추출기는 적어도 하나 이상의 콘벌루션(convolution)층과 풀링층으로 구성될 수 있으며, 이미 사전에 제공된 입력 데이터에 기초하여 사전에 학습되어 있을 수 있고, 특징추출기의 출력값은 보틀넥(bottleneck) 또는 특징 벡터(feature vector)로 지칭될 수 있다.Depending on the embodiment, the feature extractor may be composed of at least one convolution layer and a pooling layer, and may be learned in advance based on input data already provided in advance, and the output value of the feature extractor may be bottleneck. It may be referred to as a bottleneck or a feature vector.

실시예에 따라, 사전학습 분류기 및 파인튜닝 분류기 각각은 각각의 소프트맥스(softmax) 함수에 기초하여 특징추출기의 출력값을 확률값으로 산출하고, 적어도 둘 이상의 카테고리(또는 클래스) 형태로 분류할 수 있다.Depending on the embodiment, each of the pre-learning classifier and the fine-tuning classifier may calculate the output value of the feature extractor as a probability value based on each softmax function and classify it into at least two categories (or classes).

실시예에 따라, 사전학습 분류기 및 파인튜닝 분류기는 완전 연결층으로 연결될 수 있고, 적어도 하나 이상의 덴스(dense)층(또는 밀집층으로 지칭될 수 있음)으로 구성될 수 있으며, 적어도 하나 이상의 덴스층 사이에는 드롭아웃(dropout)층 및 배치정규화(batchnormalization)층 등이 위치하여 오버피팅(overfitting)을 줄일 수 있다.Depending on the embodiment, the pre-learning classifier and the fine-tuning classifier may be connected by a fully connected layer and may be composed of at least one dense layer (or may be referred to as a dense layer), and at least one dense layer A dropout layer and a batch normalization layer are located in between to reduce overfitting.

실시예에 따라, 파인튜닝 분류기는 사전학습 분류기에 적용되는 소프트맥스 함수의 가중치를 조정하여, 데이터의 종류와 전체 개수를 분석하고, 이에 기초하여 사전학습 분류기의 가중치 일부만을 재 학습시키거나 모든 가중치를 처음부터 다시 학습할 수 있다.Depending on the embodiment, the fine tuning classifier adjusts the weight of the softmax function applied to the pre-learning classifier, analyzes the type and total number of data, and based on this, retrains only part of the weight of the pre-learning classifier or all weights. You can learn again from the beginning.

실시예에 따라, 사전학습 분류기 및 파인튜닝 분류기는 VGG16, ResNet50, MobilNet, InceptionV3 등 다양한 종류를 가지며, 입력 데이터의 크기 또는 입력 데이터 종류에 따라 적어도 어느 하나가 적용될 수 있으나, 본 발명의 실시예는 상기 종류에 한정되지 않는다. Depending on the embodiment, the pre-learning classifier and the fine-tuning classifier have various types such as VGG16, ResNet50, MobilNet, and InceptionV3, and at least one of them may be applied depending on the size of the input data or the type of input data. However, the embodiment of the present invention It is not limited to the above types.

실시예에 따라, 사전학습 분류기 및 파인튜닝 분류기는 신경망 구축을 위해 1차원 벡터로 변환하는 역할을 수행하는 flatten 또는 GlobalAveragePooling2D 등을 포함할 수 있다.Depending on the embodiment, the pre-learning classifier and fine-tuning classifier may include flatten or GlobalAveragePooling2D, which perform the role of converting into a one-dimensional vector to build a neural network.

실시예에 따라, 사전학습 분류기 및 파인튜닝 분류기 간에 전체적인 최적화를 위해 교차 엔트로피 및 쿨백-라이블러 발산(KLD)이 적용될 수 있다.Depending on the embodiment, cross-entropy and Kullback-Leibler divergence (KLD) may be applied for overall optimization between the pre-learning classifier and the fine-tuning classifier.

실시예에 따라, 교차 엔트로피를 활용한 자기 지식 증류 방법은 사전학습 분류기 및 파인튜닝 분류기 각각의 확률 분포의 확률값을 기준으로 정렬하거나 정규화할 수 있고, 카테고리에 대해 일정한 구간 별로 그룹화하여 대표값을 식별할 수 있다.Depending on the embodiment, the self-knowledge distillation method using cross-entropy can sort or normalize the probability value of the probability distribution of each pre-learning classifier and fine-tuning classifier, and identify representative values by grouping categories into certain intervals. can do.

도 2는 본 발명의 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 장치를 설명하기 위한 도면이며, 도 3은 본 발명의 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법을 설명하기 위한 순서도이다.FIG. 2 is a diagram illustrating a self-knowledge distillation device using cross-entropy according to an embodiment of the present invention, and FIG. 3 is a flowchart illustrating a self-knowledge distillation method using cross-entropy according to an embodiment of the present invention. am.

도 2 및 도 3을 참조하면, 교차 엔트로피를 활용한 자기 지식 증류 장치(200)는 입력 데이터를 입력 받는 입력부(210), 입력 데이터의 특징을 추출하여 출력벡터로 식별하는 특징추출기(220), 이전에 학습된 데이터에 기초하여 목표 분포를 출력하는 사전학습 분류기(230), 사전학습 분류기(230)으로부터 지식을 증류 또는 전달받고 사전학습 분류기(230)와 상이한 소프트맥스 함수에 기초하여 출력 분포를 출력하는 파인튜닝 분류기(240) 및 목표 분포와 출력 분포 간의 차이에 기초하여 손실 함수를 적용하는 제어부(250)를 포함할 수 있다. Referring to Figures 2 and 3, the self-knowledge distillation device 200 using cross entropy includes an input unit 210 that receives input data, a feature extractor 220 that extracts features of the input data and identifies them as an output vector, A pre-learning classifier 230 that outputs a target distribution based on previously learned data, distills or transfers knowledge from the pre-learning classifier 230, and generates an output distribution based on a softmax function different from the pre-learning classifier 230. It may include a fine-tuning classifier 240 that outputs output, and a control unit 250 that applies a loss function based on the difference between the target distribution and the output distribution.

실시예에 따라, 교차 엔트로피를 활용한 자기 지식 증류 장치(200)는 이전의 입력 데이터에 기초하여 자기-지식 증류 방법으로부터 목표 분포를 획득하고, KL-Divergence 손실을 통해 전반적인 최적화를 수행하며, 확률 분포를 정렬한 후 일정한 간격으로 분할하여 손실을 별도로 계산할 수 있다.According to an embodiment, the self-knowledge distillation device 200 utilizing cross-entropy obtains a target distribution from a self-knowledge distillation method based on previous input data, performs overall optimization through KL-Divergence loss, and probabilities. After sorting the distribution, the loss can be calculated separately by splitting it at regular intervals.

실시예에 따라, 교차 엔트로피를 활용한 자기 지식 증류 장치(200)는 자기-지식 증류 방법을 통해 훈련될 때 입력 데이터가 함유하고 있는 적어도 하나 이상의 카테고리(또는 클래스) 정보를 고르게 고려할 수 있다.Depending on the embodiment, the self-knowledge distillation device 200 using cross-entropy may evenly consider at least one category (or class) information contained in the input data when trained through the self-knowledge distillation method.

실시예에 따라, 교차 엔트로피를 활용한 자기 지식 증류 장치(200)는 자기-지식 증류 방법에 더해 성능을 더욱 향상시키는 플러그인(Plug-in) 형태의 손실 함수를 활용할 수 있고, 종래의 자기-지식 증류 방법의 전반적인 기술을 활용할 수 있다.Depending on the embodiment, the self-knowledge distillation device 200 utilizing cross-entropy may utilize a loss function in the form of a plug-in that further improves performance in addition to the self-knowledge distillation method, and may utilize a loss function in the form of a plug-in that further improves performance The overall technology of the distillation method can be utilized.

실시예에 따라, 제어부(250)는 사전학습 분류기(230)에 의해 기존에 자기 지식 증류 방법으로부터 학습된 목표 분포를 식별할 수 있다(S310). Depending on the embodiment, the control unit 250 may identify a target distribution learned from an existing self-knowledge distillation method by the pre-learning classifier 230 (S310).

실시예에 따라, 제어부(250)는 원핫인코딩(One-hot encoding)된 라벨 정보와 목표 분포와의 선형 결합을 통해 학습에 활용할 확률 분포를 식별할 수 있다(S320).Depending on the embodiment, the control unit 250 may identify a probability distribution to be used for learning through a linear combination of one-hot encoded label information and the target distribution (S320).

예를 들어, 카테고리 별로 개, 고양이, 팬더의 목표 분포가 0.3, 0.6, 0.1이고, 출력 분포가 0.5, 0.3, 0.2인 경우, 목표 분포에서 가장 높은 확률 값은 고양이 클래스인데, 입력 데이터의 라벨 정보가 개 또는 팬더일 수 있고, 만약 라벨 정보를 사용하지 않는다면, 출력 분포는 틀린 목표 분포를 따라가므로, 라벨 정보와 목표 분포와의 선형 결합을 통하여 실제 정답 클래스가 가장 높은 확률 값을 가질 수 있다. 예를 들어, 실제 라벨 정보가 개라면, 목표 분포에 0.1정도를 가중치를 두고, (고양이, 개, 팬더) 정렬 순서에 따라 라벨 정보에 0.9의 가중치를 두어 0.1 * (0.6, 0.3, 0.1) + 0.9 * (0, 1, 0) = (0.06, 0.93, 0.01)로 식별될 수 있다. 만약, 라벨 정보를 사용하지 않았다면 (0.6, 0.3, 0.1)로 고양이가 가장 큰 값을 가질 수 있다.For example, if the target distributions for dogs, cats, and pandas for each category are 0.3, 0.6, and 0.1, and the output distributions are 0.5, 0.3, and 0.2, the highest probability value in the target distribution is the cat class, and the label information in the input data It can be a dog or a panda, and if label information is not used, the output distribution follows the incorrect target distribution, so the actual correct answer class can have the highest probability value through a linear combination of the label information and the target distribution. . For example, if the actual label information is a dog, a weight of about 0.1 is given to the target distribution, and a weight of 0.9 is given to the label information according to the sort order (cat, dog, panda), resulting in 0.1 * (0.6, 0.3, 0.1) + It can be identified as 0.9 * (0, 1, 0) = (0.06, 0.93, 0.01). If label information was not used, cat would have the highest value (0.6, 0.3, 0.1).

실시예에 따라, 제어부(250)는 KL-Divergence 손실 최적화를 통해 라벨 정보가 선형 결합된 목표 분포와 출력 분포 간의 차이(손실)를 식별하고, 학습시킬 수 있다.Depending on the embodiment, the control unit 250 may identify and learn the difference (loss) between the target distribution and the output distribution in which label information is linearly combined through KL-Divergence loss optimization.

실시예에 따라, 제어부(250)는 라벨 정보가 선형 결합된 목표 분포와 출력 분포 각각에 포함된 적어도 둘 이상의 확률 분포를 동일한 기준으로 카테고리 별로 분할 또는 그룹화한 후 그룹화된 목표 분포와 출력 분포에 대응되는 그룹별 간의 교차 엔트로피 손실을 식별할 수 있다(S330). 제어부(250)는 라벨 정보가 선형 결합된 목표 분포와 출력 분포를 동일한 기준으로 정렬할 수 있으며, 먼저 식별된 목표 분포의 확률 분포를 내림차순으로 정렬될 수 있고, 정렬된 목표 분포의 카테고리(또는 클래스)의 순서대로 출력 분포를 정렬시킬 수 있다. 예를 들어, 카테고리 별로 개, 고양이, 팬더의 목표 분포가 0.3, 0.6, 0.1이고, 출력 분포가 0.5, 0.3, 0.2인 경우, 먼저, 목표 분포는 고양이(0.6), 개(0.3), 팬더(0.2) 순으로 정렬될 수 있고, 출력 분포도 목표 분포와 같은 순서대로 고양이(0.3), 개(0.5), 팬더(0.2) 순으로 나열될 수 있다.Depending on the embodiment, the control unit 250 divides or groups at least two probability distributions included in each of the target distribution and the output distribution in which label information is linearly combined into categories based on the same standard, and then corresponds to the grouped target distribution and output distribution. The cross-entropy loss between each group can be identified (S330). The control unit 250 may sort the target distribution and the output distribution in which the label information is linearly combined by the same standard, and may sort the probability distribution of the first identified target distribution in descending order, and the category (or class) of the sorted target distribution. ) can be sorted in the order of the output distribution. For example, if the target distribution of dog, cat, and panda by category is 0.3, 0.6, 0.1, and the output distribution is 0.5, 0.3, 0.2, first, the target distribution is cat (0.6), dog (0.3), and panda ( 0.2), and the output distribution can be listed in the same order as the target distribution: cat (0.3), dog (0.5), and panda (0.2).

실시예에 따라, 제어부(250)는 라벨 정보가 선형 결합된 목표 분포와 출력 분포 각각을 일정 크기 별로(window size) 일정 간격 (stride) 만큼 분할할 수 있고, 목표 분포와 출력 분포의 분할되어 대응되는 그룹 간의 교차 엔트로피 손실을 계산할 수 있다.Depending on the embodiment, the control unit 250 may divide each of the target distribution and the output distribution, in which the label information is linearly combined, by a certain size (window size) and at a certain interval (stride), and the target distribution and the output distribution are divided and correspond to each other. The cross-entropy loss between groups can be calculated.

실시예에 따라, 제어부(250)는 목표 분포와 출력 분포를 정규화할 수 있고, 카테고리에 대해 일정한 구간 별로 그룹화하여 대표값을 식별할 수 있다. 제어부(250)는 정렬된 목표 분포와 출력 분포에 대하여 총 N개의 클래스에서 M개의 근접한 클래스를 1번째 클래스부터 묶어 그룹화할 수 있다. 제어부(250)는 두개의 N개의 클래스로 묶인 값들에 소프트맥스 함수를 적용할 수 있다. 소프트맥스 함수를 취하게 되면 총 합이 1이 될 수 있으며, 소프트맥스 함수를 취한 값끼리 비교하여 차이를 식별할 수 있다. 제어부(250)는 s개의 클래스만큼 옆으로 건너 뛰면서 마지막 클래스 비교하여 차이를 식별할 수 있다.Depending on the embodiment, the control unit 250 may normalize the target distribution and the output distribution, and identify representative values by grouping categories into certain sections. The control unit 250 may group M adjacent classes from a total of N classes for the sorted target distribution and output distribution, starting from the first class. The control unit 250 can apply the softmax function to values grouped into two N classes. When the softmax function is taken, the total sum can be 1, and the difference can be identified by comparing the values taken from the softmax function. The control unit 250 can identify the difference by skipping s classes sideways and comparing the last class.

예를 들어, 고양이, 개, 팬더 클래스에 대해, 총 M=3개의 클래스에 대하여 N=2의 클래스끼리 묶으면 처음에는 (고양이, 개)가 묶일 수 있고, 목표분포 (0.6, 0.3), 출력분포 (0.3, 0.5)에 대하여 소프트맥스 함수를 취하게 되면 (0.5744, 0.4256), (0.4502, 0.5498)이 되고, 이 둘을 비교하여 차이를 식별할 수 있다. s=1이라 할 때 한 칸 옆으로 간다면 (개, 팬더) 클래스에 대하여 동일하게 소프트맥스 함수가 취해진 후 차이를 식별할 수 있다. For example, for the cat, dog, and panda classes, if N=2 classes are grouped for a total of M=3 classes, (cat, dog) can be grouped at first, and the target distribution (0.6, 0.3), output If you take the softmax function for the distribution (0.3, 0.5), you get (0.5744, 0.4256) and (0.4502, 0.5498), and you can compare the two to identify the difference. When s = 1, if you go one space to the side, the same softmax function is taken for the (dog, panda) class, and then you can identify the difference.

실시예에 따라, 제어부(250)는 KL-Divergence 손실 최적화를 통해 라벨 정보가 선형 결합된 목표 분포와 출력 분포 간의 차이(손실)과 S330에서 식별된 교차 엔트로피 손실을 고려하여 모델을 학습시킬 수 있다(S340). 제어부(250)는 정규화된 후 손실 함수를 식별할 수 있고, KL divergence 함수가 적용되어 산출되는 손실값들을 더함으로써 각 블록들을 결합하는 과정을 수행하 수 있고, 다시 말해서, KL divergence 함수를 이용하여 손실을 식별할 수 있다. 예를 들어, 고양이, 개, 팬더의 선행하는 예를 들면, (0.5744, 0.4256), (0.4502, 0.5498)을 KL divergence 함수에 적용하면 0.5477 * log(0.4502) + 0.4256 * log(0.5498)을 계산한 값이 손실로 식별될 수 있다. 제어부(250)는 각 블록을 결합하는 과정으로, N개의 클래스끼리 묶은 뒤 KL divergence 함수를 통해 손실을 계산하고 그 값과 다음 s칸 옆으로 가서 구한 KL divergence 함수를 통해 산출된 값을 더하여 손실을 식별할 수 있다.Depending on the embodiment, the control unit 250 may learn the model by considering the difference (loss) between the target distribution and the output distribution in which label information is linearly combined through KL-Divergence loss optimization and the cross-entropy loss identified in S330. (S340). The control unit 250 can identify the loss function after normalization and perform the process of combining each block by adding the loss values calculated by applying the KL divergence function. In other words, by using the KL divergence function Losses can be identified. For example, taking the preceding examples of cat, dog, and panda, applying (0.5744, 0.4256), (0.4502, 0.5498) to the KL divergence function calculates 0.5477 * log(0.4502) + 0.4256 * log(0.5498). A value can be identified as loss. The control unit 250 is a process of combining each block. After grouping N classes together, the control unit 250 calculates the loss through the KL divergence function and adds the value calculated through the KL divergence function next to the next s space to calculate the loss. can be identified.

도 4a 내지 4d는 본 발명의 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치의 연산 결과를 설명하기 위한 도면이다.Figures 4a to 4d are diagrams for explaining the calculation results of the self-knowledge distillation method and device using cross-entropy according to an embodiment of the present invention.

도 4a를 참조하면, 일 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치에서 훈련된 모델에서 CIFAR-100 데이터셋에 대해 각각의 데이터를 클래스 별로 평균을 구하고, Pearson Correlation Coefficient를 계산한 결과가 도시된다. 관측의 편의를 위해 같은 부모 클래스를 가지는 자식 클래스들을 인접시킬 수 있으며, 다른 부모를 가지는 클래스에 대한 상관관계는 낮추고 같은 부모 클래스를 가지는 클래스에 대한 상관관계는 높여 데이터의 표현력을 증가시켜 도시될 수 있다.Referring to Figure 4a, in the self-knowledge distillation method using cross-entropy according to an embodiment and the model trained in the device, the average of each data for each class for the CIFAR-100 dataset is calculated, and the Pearson Correlation Coefficient is calculated. One result is shown. For convenience of observation, child classes with the same parent class can be adjacent, and the expressiveness of the data can be increased by lowering the correlation for classes with different parents and increasing the correlation for classes with the same parent class. there is.

도 4b를 참조하면, 일 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치에서 훈련된 모델에서 CIFAR-100 데이터셋에 대해 t-SNE 결과가 도시된다. 관측의 편의를 위해 총 10 클래스에 대해서, 실험의 공정성을 위해 t-SNE에 사용되는 하이퍼 파라미터는 모두 동일하게 설정되어 도시될 수 있다.Referring to Figure 4b, t-SNE results are shown for the CIFAR-100 dataset in a model trained in the self-knowledge distillation method and device using cross-entropy according to one embodiment. For the convenience of observation, for a total of 10 classes, for fairness of the experiment, the hyperparameters used in t-SNE can all be set to be the same.

도 4c 및 도 4d를 참조하면, 도 4c는 CIFAR-100 실험 결과를 도시하며, 도 4d는 Fine-grained Image Classification 실험 - CUB-200, Stanford Cars, and FGVC-Aircraft 실험 결과를 도시한다. 일 실시예에 따른 교차 엔트로피를 활용한 자기 지식 증류 방법 및 이의 장치에서 훈련된 모델을 활용하여 모델의 크기는 작으면서도 높은 성능을 보이는 모델을 생성할 수 있으며, 모바일 디바이스 혹은 엣지 컴퓨팅 환경에서의 사용자 경험을 증대시킬 수 있다. Referring to FIGS. 4C and 4D, FIG. 4C shows the results of the CIFAR-100 experiment, and FIG. 4D shows the results of the Fine-grained Image Classification experiment - CUB-200, Stanford Cars, and FGVC-Aircraft. According to one embodiment, a self-knowledge distillation method using cross-entropy and a model trained in the device can be used to create a model with a small model size but high performance, and can be used by users in mobile devices or edge computing environments. Experience can be increased.

본 발명의 다양한 실시예들은 기기(machine)(예를 들어, 차량 생성 데이터 기록장치 또는 컴퓨터)에 의해 읽을 수 있는 저장 매체(storage medium)(예를 들어, 메모리)에 저장된 하나 이상의 인스트럭션들을 포함하는 소프트웨어로서 구현될 수 있다. 예를 들면, 기기의 프로세서는, 저장 매체로부터 저장된 하나 이상의 인스트럭션들 중 적어도 하나의 명령을 호출하고, 그것을 실행할 수 있다. 이것은 기기가 상기 호출된 적어도 하나의 인스트럭션에 따라 적어도 하나의 기능을 수행하도록 운영되는 것을 가능하게 한다. 상기 하나 이상의 인스트럭션들은 컴파일러에 의해 생성된 코드 또는 인터프리터에 의해 실행될 수 있는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적 저장매체'는 실재(tangible)하는 장치이고, 신호(signal)(예를 들어, 전자기파)를 포함하지 않는다는 것을 의미할 뿐이며, 이 용어는 데이터가 저장매체에 반영구적으로 저장되는 경우와 임시적으로 저장되는 경우를 구분하지 않는다. 예를 들어, '비일시적 저장매체'는 데이터가 임시적으로 저장되는 버퍼를 포함할 수 있다.Various embodiments of the present invention include one or more instructions stored in a storage medium (e.g., memory) readable by a machine (e.g., a vehicle generated data recorder or computer). It can be implemented as software. For example, the processor of the device may call at least one instruction among one or more instructions stored from a storage medium and execute it. This allows the device to be operated to perform at least one function according to the at least one instruction called. The one or more instructions may include code generated by a compiler or code executable by an interpreter. A storage medium that can be read by a device may be provided in the form of a non-transitory storage medium. Here, 'non-transitory storage medium' simply means that it is a tangible device and does not contain signals (e.g. electromagnetic waves), and this term is used when data is stored semi-permanently in a storage medium. There is no distinction between temporary storage and temporary storage. For example, a 'non-transitory storage medium' may include a buffer where data is temporarily stored.

일 실시예에 따르면, 본 명세서에 개시된 다양한 실시예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예를 들어, compact disc read only memory (CD-ROM))의 형태로 배포되거나, 또는 어플리케이션 스토어(예: 플레이 스토어TM)를 통해 또는 두개의 사용자 장치들(예: 스마트폰들) 간에 직접, 온라인으로 배포(예: 다운로드 또는 업로드)될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품(예: 다운로더블 앱(downloadable app))의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 기기로 읽을 수 있는 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다. 이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다.According to one embodiment, methods according to various embodiments disclosed in this specification may be included and provided in a computer program product. Computer program products are commodities and can be traded between sellers and buyers. The computer program product may be distributed in the form of a machine-readable storage medium (e.g. compact disc read only memory (CD-ROM)) or through an application store (e.g. Play StoreTM) or on two user devices. It can be distributed (e.g. downloaded or uploaded) directly between devices (e.g. smartphones) or online. In the case of online distribution, at least a portion of the computer program product (e.g., a downloadable app) is stored on a machine-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server. It can be temporarily stored or created temporarily. Above, embodiments of the present invention have been described with reference to the attached drawings, but those skilled in the art will understand that the present invention can be implemented in other specific forms without changing its technical idea or essential features. You will be able to understand it. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive.

Claims

A process of identifying a target distribution considering one-hot encoded label information;
A process of grouping and normalizing each of the target distribution and output distribution in which the label information is considered into at least two or more classes; and
identifying a loss between a first group of target distributions in which the label information is considered and a second group of output distributions corresponding to the first group;
Including,
Self-knowledge distillation method using cross-entropy.

According to paragraph 1,
The process of grouping and normalizing into at least two or more classes is,
Including a process of aligning the target distribution and the output distribution in which the label information is linearly combined by the same standard,
Self-knowledge distillation method using cross-entropy.

According to paragraph 1,
The process of grouping and normalizing into at least two or more classes is,
A process of identifying a representative value by taking a softmax function for each of the target distribution and output distribution in which the label information grouped into the at least two classes is considered, comprising:
Self-knowledge distillation method using cross-entropy.

According to paragraph 3,
The process of identifying the loss is:
A process of applying a loss function to a representative value by taking a softmax function to each of the target distribution and the output distribution in which the label information grouped into the at least two classes is considered,
The loss function includes the Kullback-Leibler divergence (KLD) function,
Self-knowledge distillation method using cross-entropy.

a first classifier that identifies a target distribution based on previously input data;
a second classifier to which a softmax function different from the pre-learning classifier is applied; and
Identify the target distribution considering one-hot encoded label information,
Grouping and normalizing each of the target distribution and output distribution in which the label information is considered into at least two or more classes,
a control unit that identifies loss between a first group of target distributions considering the label information and a second group of output distributions corresponding to the first group;
Including,
Self-knowledge distillation device using cross-entropy.

According to clause 5,
The control unit,
Sorting the target distribution and output distribution in which the label information is linearly combined by the same standard,
Self-knowledge distillation device using cross-entropy.

According to clause 5,
The control unit,
Identifying a representative value by taking a softmax function for each of the target distribution and output distribution in which the label information grouped into the at least two classes is considered,
Self-knowledge distillation device using cross-entropy.

In clause 7,
The control unit,
A softmax function is taken from each of the target distribution and output distribution in which the label information grouped into at least two classes is considered, and a loss function is applied to the representative value,
The loss function includes the Kullback-Leibler divergence (KLD) function,
Self-knowledge distillation device using cross-entropy.