KR20180134740A

KR20180134740A - Electronic apparatus and method for optimizing of trained model

Info

Publication number: KR20180134740A
Application number: KR1020180010938A
Authority: KR
Inventors: 황성주; 김주용; 김건희; 박유군
Original assignee: 한국과학기술원
Priority date: 2017-06-09
Filing date: 2018-01-29
Publication date: 2018-12-19
Also published as: KR102139740B1; KR102139729B1; KR102102772B1; KR20180134738A; KR20180134739A

Abstract

An electronic apparatus is provided. The electronic apparatus includes: a memory storing a trained model including a plurality of layers; and a processor initializing a parameter matrix and a plurality of split variables of the trained model, calculating a new parameter matrix having a block-diagonal matrix for the split variables and the trained model to minimize a loss function for the trained model and an objective function including a parameter attenuation regularization term and a split regularization term defined by the parameter matrix and the plurality of split variables, vertically splitting the layers according to the group based on the computed split parameters, and reconstructing the trained model using the computed new parameter matrix as parameters of the vertically split layers.

Description

ELECTRONIC APPARATUS AND METHOD FOR OPTIMIZING OF TRAINED MODEL < RTI ID = 0.0 >

본 개시는 전자 장치 및 학습 모델 최적화 방법에 관한 것으로, 더욱 상세하게는 학습 모델 내의 각 레이어를 의미적으로 연관있는 그룹으로 자동으로 나누고 모델 병렬화하여 학습 모델을 최적화할 수 있는 전자 장치 및 학습 모델 최적화 방법에 관한 것이다. This disclosure relates to an electronic device and a learning model optimization method, and more particularly, to an electronic device and a learning model optimization capable of optimizing a learning model by automatically dividing each layer in a learning model into semantically related groups and model- &Lt; / RTI >

심층 신경망(Deep Neural Network)은 컴퓨터 비전, 음성 인식, 자연어 처리와 같은 분야에서 큰 성능 향상을 가져온 머신 러닝의 한 기술이다. 이러한 심층 신경망은 완전 연결 레이어, 합성곱 레이어와 같은 여러 레이어의 순차적인 연산으로 이루어진다. Deep Neural Network is a technology of machine learning that brings great performance improvement in areas such as computer vision, speech recognition, and natural language processing. These in-depth neural networks consist of sequential operations of several layers, such as a fully connected layer and a composite product layer.

심층 신경망은 행렬 곱으로 표현되는 각각의 레이어가 많은 양의 연산을 필요하기 때문에 학습하고 실행하는데 있어 큰 계산량, 큰 용량의 모델 파라미터를 요구하였다. Since the layered neural network requires a large amount of computation for each layer represented by the matrix multiplication, it requires a large amount of computation and a large capacity of model parameters for learning and execution.

그러나 수 만개의 객체 클래스를 분류하는 등과 같은 모델 또는 태스크 크기가 매우 커지거나, 실시간 객체 검출이 필요한 경우에 이러한 큰 계산량은 심층 신경망을 활용하는데 제한 사항이 되었다. However, when the size of a model or task, such as classifying tens of thousands of object classes, becomes very large, or when real-time object detection is required, this large amount of computation has become a limitation in utilizing deep-layer neural networks.

이에 따라, 종래에는 모델 파라미터의 개수를 줄이거나, 분산 머신 러닝(distributed machine learning)을 사용한 데이터 병렬화(data parallelization)를 통해 모델의 학습과 실행을 가속하는 방법이 이용되었다. Accordingly, conventionally, a method of reducing the number of model parameters or accelerating the learning and execution of the model through data parallelization using distributed machine learning has been used.

그러나 이러한 방식들은 네트워크 구조를 유지하면서 파라미터의 수를 줄이거나 많은 양의 연산 장치를 사용해 연산 시간을 줄이는 방법으로, 심층 신경망의 본질적인 구조를 개선하는 방식은 아니었다. However, these methods are not a way to reduce the number of parameters while maintaining the network structure, or to reduce the computation time by using a large amount of computation devices, thereby improving the essential structure of the deep neural network.

즉, 기존의 심층 신경망은 단일하고 큰 레이어의 순차적인 연산으로 이루어져 있으며, 이를 여러 연산장치에서 나누어 수행할 경우, 연산장치 간의 통신에 더 큰 시간적 병목 현상이 생기기 때문에, 한 입력에 대한 연산을 한 장치에서 수행할 수밖에 없는 한계가 있었다. In other words, the conventional neural network consists of sequential operations of a single layer and a large layer. If the neural network is divided into a plurality of arithmetic units, a larger temporal bottleneck occurs in communication between arithmetic units. There was a limitation that the device had to perform.

따라서, 본 개시의 목적은 학습 모델 내의 각 레이어를 의미적으로 연관있는 그룹으로 자동으로 나누고 모델 병렬화하여 학습 모델을 최적화할 수 있는 전자 장치 및 학습 모델 최적화 방법을 제공하는 데 있다. Accordingly, it is an object of the present disclosure to provide an electronic device and a learning model optimization method that can optimize a learning model by automatically dividing each layer in a learning model into semantically related groups and model parallelizing them.

상술한 바와 같은 목적을 달성하기 위한 본 개시의 학습 모델 최적화 방법은 복수의 레이어로 구성되는 학습 모델의 파라미터 행렬 및 복수의 분할 변수를 초기화하는 단계, 상기 학습 모델에 대한 손실 함수, 파라미터 감쇠 정규화 항 및 상기 파라미터 행렬 및 상기 복수의 분할 변수로 정의되는 분할 정규화 항을 포함하는 목적 함수가 최소화하도록 상기 복수의 분할 변수와 상기 학습 모델에 대한 블록 대각 행렬을 갖는 신규 파라미터 행렬을 산출하는 단계, 및 상기 산출된 분할 변수에 기초하여 상기 복수의 레이어를 그룹에 따라 수직 분할하고, 상기 산출된 신규 파라미터 행렬을 상기 수직 분할된 레이어의 파라미터로 사용하여 상기 학습 모델을 재구성하는 단계를 포함한다. According to another aspect of the present invention, there is provided a method of optimizing a learning model, the method comprising: initializing a parameter matrix and a plurality of partitioned parameters of a learning model composed of a plurality of layers; Calculating a new parameter matrix having the block diagonal matrix for the learning model and the plurality of partitioning variables so as to minimize the objective function including the parameter matrix and the partition normalization term defined by the plurality of partitioning variables; And vertically dividing the plurality of layers according to the group based on the calculated division variable and reconstructing the learning model using the calculated new parameter matrix as a parameter of the vertically divided layer.

이 경우, 상기 초기화하는 단계는 상기 파라미터 행렬을 랜덤하게 초기화하고, 상기 복수의 분할 변수를 상호 균일하지 않도록 초기화할 수 있다. In this case, the initializing step may initialize the parameter matrix at random and initialize the plurality of divided variables to be non-uniform.

한편, 상기 산출하는 단계는 상기 목적 함수가 최소화하도록 추계적 경사 하강(Stochastic Gradient Descent) 방법을 이용할 수 있다. Meanwhile, the calculating step may use a stochastic gradient descent method so that the objective function is minimized.

한편, 상기 분할 정규화 항은 그룹 간의 연결을 억제하고 그룹 내의 연결만을 활성화하는 그룹 파라미터 정규화 항, 각 그룹이 직교하도록 하는 서로소 그룹 정규화 항 및 한 그룹의 크기가 과도하지 않도록 하는 균등 그룹 정규화 항을 포함할 수 있다. On the other hand, the partition normalization term includes a group parameter normalization term for suppressing connection between groups and activating a connection in a group, a small group normalization term for allowing each group to orthogonally and an even group normalization term for preventing a size of a group from being excessive .

한편, 본 학습 모델 최적화 방법은 상기 학습 모델에 대한 손실 함수 및 상기 파라미터 감쇠 정규화 항만 포함하는 제2 목적 함수가 최소화하도록 상기 재구성된 학습 모델에 대한 2차 신규 파라미터 행렬을 산출하는 단계, 및 상기 산출된 2차 신규 파라미터 행렬을 상기 수직 분할된 레이어의 파라미터로 사용하여 상기 학습 모델을 최적화하는 단계를 더 포함할 수 있다. Meanwhile, the learning model optimization method includes calculating a second-order new parameter matrix for the reconstructed learning model so that a second objective function including only the loss function for the learning model and the parameter attenuation normalization term is minimized, And optimizing the learning model by using a second-order new parameter matrix as a parameter of the vertically-divided layer.

이 경우, 본 학습 모델 최적화 방법은 상기 최적화된 학습 모델 내의 수직 분할된 레이어 각각을 서로 다른 프로세서를 이용하여 병렬 처리하는 단계를 더 포함할 수 있다. In this case, the learning model optimization method may further comprise parallel processing each of the vertically divided layers in the optimized learning model using different processors.

한편, 본 개시의 전자 장치는 복수의 레이어로 구성되는 학습 모델이 저장된 메모리, 및 상기 학습 모델의 파라미터 행렬 및 복수의 분할 변수를 초기화하고, 상기 학습 모델에 대한 손실 함수, 파라미터 감쇠 정규화 항 및 상기 파라미터 행렬 및 상기 복수의 분할 변수로 정의되는 분할 정규화 항을 포함하는 목적 함수가 최소화하도록 상기 복수의 분할 변수와 상기 학습 모델에 대한 블록 대각 행렬을 갖는 신규 파라미터 행렬을 산출하고, 상기 산출된 분할 변수에 기초하여 상기 복수의 레이어를 그룹에 따라 수직 분할하고, 상기 산출된 신규 파라미터 행렬을 상기 수직 분할된 레이어의 파라미터로 사용하여 상기 학습 모델을 재구성하는 프로세서를 포함한다. On the other hand, the electronic apparatus of the present disclosure includes a memory in which a learning model composed of a plurality of layers is stored, and a parameter matrix and a plurality of division variables of the learning model are initialized, and a loss function, a parameter attenuation normalization term, Calculating a new parameter matrix having the block diagonal matrix for the learning model and the plurality of partitioning parameters so that the objective function including the parameter matrix and the partition normalization term defined by the plurality of partitioning parameters is minimized, And vertically dividing the plurality of layers in accordance with the group, and reconstructing the learning model using the calculated new parameter matrix as a parameter of the vertically divided layer.

이 경우, 상기 프로세서는 상기 파라미터 행렬을 랜덤하게 초기화하고, 상기 복수의 분할 변수를 상호 균일하지 않도록 초기화할 수 있다. In this case, the processor may randomly initialize the parameter matrix and initialize the plurality of partitioned variables to be non-uniform.

한편, 상기 프로세서는 상기 목적 함수가 최소화하도록 추계적 경사 하강(Stochastic Gradient Descent) 방법을 이용할 수 있다. Meanwhile, the processor may use a stochastic gradient descent method to minimize the objective function.

한편, 상기 프로세서는 상기 학습 모델에 대한 손실 함수 및 상기 파라미터 감쇠 정규화 항만 포함하는 제2 목적 함수가 최소화하도록 상기 재구성된 학습 모델에 대한 2차 신규 파라미터 행렬을 산출하고, 상기 산출된 2차 신규 파라미터 행렬을 상기 수직 분할된 레이어의 파라미터로 사용하여 상기 학습 모델을 최적화할 수 있다. Meanwhile, the processor calculates a second-order new parameter matrix for the reconstructed learning model so that a second objective function including only the loss function for the learning model and only the parameter attenuation normalization term is minimized, The learning model can be optimized by using the matrix as a parameter of the vertically divided layer.

한편, 본 개시의 전자 장치에서의 학습 모델 최적화 방법을 실행하기 위한 프로그램을 포함하는 컴퓨터 판독가능 기록 매체에 있어서, 상기 학습 모델 최적화 방법은 복수의 레이어로 구성되는 학습 모델의 파라미터 행렬 및 복수의 분할 변수를 초기화하는 단계, 상기 학습 모델에 대한 손실 함수, 파라미터 감쇠 정규화 항 및 상기 파라미터 행렬 및 상기 복수의 분할 변수로 정의되는 분할 정규화 항을 포함하는 목적 함수가 최소화하도록 상기 복수의 분할 변수와 상기 학습 모델에 대한 블록 대각 행렬을 갖는 신규 파라미터 행렬을 산출하는 단계, 및 상기 산출된 분할 변수에 기초하여 상기 복수의 레이어를 그룹에 따라 수직 분할하고, 상기 산출된 신규 파라미터 행렬을 상기 수직 분할된 레이어의 파라미터로 사용하여 상기 학습 모델을 재구성하는 단계를 포함한다. On the other hand, in a computer readable recording medium including a program for executing a learning model optimization method in an electronic apparatus of the present disclosure, the learning model optimization method includes a parameter matrix of a learning model composed of a plurality of layers, A parameter attenuation normalization term and a parameter normalization term defined by the parameter matrix and the plurality of subdivision variables to minimize an objective function including a loss function for the learning model, a parameter attenuation normalization term, Calculating a new parameter matrix having a block diagonal matrix for the model, and dividing the plurality of layers vertically according to the group on the basis of the calculated division variable, Parameter to reconstruct the learning model And a step.

상술한 바와 같이 본 개시의 다양한 실시 예에 따르면, 학습 모델의 레이어들을 자동으로 여러 레이어로 나눌 수 있는바, 연산량을 줄일 수 있으며, 파라미터의 수를 줄일 수 있고, 또한, 모델 병렬화가 가능하게 된다. As described above, according to various embodiments of the present disclosure, layers of the learning model can be automatically divided into several layers, thereby reducing the amount of computation, reducing the number of parameters, and enabling model parallelism .

도 1은 본 개시의 일 실시 예에 따른 전자 장치의 간단한 구성을 나타내는 블록도,
도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 나타내는 블록도,
도 3은 트리 구조 네트워크를 설명하기 위한 도면,
도 4는 그룹 할당 및 그룹 가중치 정규화 동작을 설명하기 위한 도면,
도 5는 정규화가 적용된 가중치를 시각화한 도면,
도 6은 본 개시의 학습 모델의 분할 알고리즘을 나타내는 도면,
도 7은 한 그룹의 출력을 분할하는 경우의 예를 나타내는 도면,
도 8은 연속한 상위 세 레이어에 대한 최적화 결과 예를 나타내는 도면,
도 9는 최적화 방법이 적용된 학습 모델의 벤치 마크를 나타내는 도면,
도 10은 균형 그룹 정규화의 효과를 나타내는 도면,
도 11은 CIFAR-100 데이터에 세트에 대한 여러 알고리즘 방식 각각에 대한 테스트 에러를 나타내는 도면,
도 12는 CIFAR-100 데이터에 세트에서 파라미터(또는 계산) 감소 및 테스트 에러의 비교를 나타내는 도면,
도 13은 20개의 상위 클래스의 하위 클래스들이 어는 그룹에 속하는지를 나타내는 도면,
도 14 및 도 15는 ILSVRC2012 데이터 세트에서의 파라미터(또는 계산) 감소 및 테스트 에러의 비교를 나타내는 도면,
도 16은 본 개시의 일 실시 예에 따른 학습 모델 최적화 방법을 설명하기 위한 흐름도, 그리고,
도 17은 본 개시의 일 실시 예에 따른 학습 모델 분할 방법을 설명하기 위한 흐름도이다. 1 is a block diagram illustrating a simple configuration of an electronic device according to one embodiment of the present disclosure;
2 is a block diagram showing a specific configuration of an electronic device according to an embodiment of the present disclosure;
3 is a diagram for explaining a tree structure network,
4 is a diagram for explaining a group assignment and a group weight normalization operation;
5 is a view showing a weighted value to which a normalization is applied,
6 is a diagram showing a partitioning algorithm of the learning model of the present disclosure,
7 is a diagram showing an example of a case of dividing the output of one group,
8 is a diagram showing an example of optimization results for successive upper three layers,
9 is a diagram showing a benchmark of a learning model to which an optimization method is applied,
10 is a diagram showing the effect of the balance group normalization,
11 is a diagram showing test errors for each of the various algorithmic schemes for a set of CIFAR-100 data,
12 is a diagram showing a comparison of parameter (or calculation) reduction and test error in a set to CIFAR-100 data,
13 is a diagram showing whether 20 subclasses of an upper class belong to a group,
Figures 14 and 15 show a comparison of the parameter (or calculation) reduction and test error in the ILSVRC 2012 data set,
16 is a flowchart for explaining a learning model optimization method according to an embodiment of the present disclosure,
17 is a flowchart for explaining a learning model dividing method according to an embodiment of the present disclosure.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 개시에 대해 구체적으로 설명하기로 한다. BRIEF DESCRIPTION OF THE DRAWINGS The terminology used herein will be briefly described, and the present disclosure will be described in detail.

본 개시의 실시 예에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 개시의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다. The terms used in the embodiments of the present disclosure have selected the currently widely used generic terms possible in light of the functions in this disclosure, but these may vary depending on the intentions or precedents of those skilled in the art, the emergence of new technologies, and the like . Also, in certain cases, there may be a term chosen arbitrarily by the applicant, in which case the meaning shall be stated in detail in the description of the relevant disclosure. Accordingly, the terms used in this disclosure should be defined based on the meaning of the term rather than on the name of the term, and throughout the present disclosure.

본 개시의 실시 예들은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 특정한 실시 형태에 대해 범위를 한정하려는 것이 아니며, 개시된 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 실시 예들을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.The embodiments of the present disclosure are capable of various transformations and may have various embodiments, and specific embodiments are illustrated in the drawings and described in detail in the detailed description. It is to be understood, however, that it is not intended to limit the scope of the specific embodiments but includes all transformations, equivalents, and alternatives falling within the spirit and scope of the disclosure disclosed. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS In the following description of the embodiments of the present invention,

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The terms first, second, etc. may be used to describe various elements, but the elements should not be limited by terms. Terms are used only for the purpose of distinguishing one component from another.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다." 또는 "구성되다." 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the term " includes " Or " configured. &Quot; , Etc. are intended to designate the presence of stated features, integers, steps, operations, components, parts, or combinations thereof, may be combined with one or more other features, steps, operations, components, It should be understood that they do not preclude the presence or addition of combinations thereof.

본 개시의 실시 예에서 '모듈' 혹은 '부'는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 '모듈' 혹은 복수의 '부'는 특정한 하드웨어로 구현될 필요가 있는 '모듈' 혹은 '부'를 제외하고는 적어도 하나의 모듈로 일체화되어 적어도 하나의 프로세서로 구현될 수 있다.In the embodiments of the present disclosure, 'module' or 'subtype' performs at least one function or operation, and may be implemented in hardware or software, or a combination of hardware and software. In addition, a plurality of 'modules' or a plurality of 'parts' may be integrated into at least one module except for 'module' or 'module' which needs to be implemented by specific hardware, and may be implemented by at least one processor.

아래에서는 첨부한 도면을 참고하여 본 개시의 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Reference will now be made in detail to embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. However, the present disclosure may be embodied in many different forms and is not limited to the embodiments described herein. In order that the present disclosure may be more fully understood, the same reference numbers are used throughout the specification to refer to the same or like parts.

이하에서는 도면을 참조하여 본 개시에 대해 더욱 상세히 설명하기로 한다.Hereinafter, the present disclosure will be described in more detail with reference to the drawings.

도 1은 본 개시의 일 실시 예에 따른 전자 장치의 간단한 구성을 나타내는 블록도이다. 1 is a block diagram illustrating a simple configuration of an electronic device according to an embodiment of the present disclosure;

도 1을 참조하면, 전자 장치(100)는 메모리(110) 및 프로세서(120)로 구성될 수 있다. 여기서 전자 장치(100)는 데이터 연산이 가능한 PC, 노트북 PC, 서버 등일 수 있다. Referring to FIG. 1, an electronic device 100 may comprise a memory 110 and a processor 120. Here, the electronic device 100 may be a PC, a notebook PC, a server, etc. capable of data operation.

메모리(110)는 복수의 레이어(또는 계층)로 구성되는 학습 모델을 저장한다. 여기서 학습 모델은 인공 지능 알고리즘을 이용하여 학습된 모델로 네트워크로 지칭될 수도 있다. 그리고 인공 지능 알고리즘은 심층 신경 네트워크(Deep Neural Network, DNN), 심층 합성곱 신경망(Deep Convolution Neural Network), 레지듀얼 네트워크(Residual Network) 등일 수 있다. The memory 110 stores a learning model composed of a plurality of layers (or layers). Here, the learning model may be referred to as a network as a learned model using artificial intelligence algorithms. The artificial intelligence algorithm may be a Deep Neural Network (DNN), a Deep Convolution Neural Network, a Residual Network, or the like.

메모리(110)는 학습 모델을 최적화하기 위한 학습 데이터 세트를 저장할 수 있으며, 해당 학습 모델을 이용하여 분류 또는 인식하기 위한 데이터를 저장할 수도 있다. The memory 110 may store a learning data set for optimizing a learning model, and may store data for classification or recognition using the learning model.

또한, 메모리(110)는 학습 모델 최적화를 수행하는데 필요한 프로그램을 저장하거나, 해당 프로그램에 의하여 최적화된 학습 모델을 저장할 수 있다. In addition, the memory 110 may store a program necessary for performing the learning model optimization, or may store an optimized learning model by the program.

이러한, 메모리(110)는 전자 장치(100) 내의 저장매체 및 외부 저장매체, 예를 들어 USB 메모리를 포함한 Removable Disk, 호스트(Host)에 연결된 저장매체, 네트워크를 통한 웹서버(Web server) 등으로 구현될 수 있다. The memory 110 may be a storage medium in the electronic device 100 and an external storage medium such as a removable disk including a USB memory, a storage medium connected to a host, a web server via a network, Can be implemented.

프로세서(120)는 전자 장치(100) 내의 각 구성에 대한 제어를 수행한다. 구체적으로, 프로세서(120)는 사용자로부터 부팅 명령이 입력되면, 메모리(110)에 저장된 운영체제를 이용하여 부팅을 수행할 수 있다. Processor 120 performs control of each configuration within electronic device 100. Specifically, when a boot command is input from the user, the processor 120 may perform booting using an operating system stored in the memory 110. [

프로세서(120)는 후술할 조작 입력부(140)를 통하여 최적화할 학습 모델을 선택받을 수 있으며, 선택된 학습 모델을 최적화하기 위한 각종 파라미터를 조작 입력부(140)를 통하여 입력받을 수 있다. 여기서 입력받는 각종 파라미터는 분할할 그룹의 수, 하이퍼파라미터 등일 수 있다. The processor 120 can select a learning model to be optimized through an operation input unit 140 to be described later and can input various parameters for optimizing the selected learning model through the operation input unit 140. [ Here, various parameters received may be the number of groups to be divided, hyper parameters, and the like.

각종 정보를 입력받으면, 프로세서(120)는 선택된 학습 모델의 각 레이어의 입출력 특징에 기초하여 복수의 그룹으로 그루핑하여 트리 구조를 갖는 학습 모델로 재구성할 수 있다. Upon receiving various kinds of information, the processor 120 can group the learning models into a plurality of groups based on input / output characteristics of the respective layers of the selected learning model, and reconstruct them into a learning model having a tree structure.

구체적으로, 프로세서(120)는 복수의 레이어로 구성되는 학습 모델의 파라미터 행렬 및 복수의 분할 변수를 초기화할 수 있다. 구체적으로, 프로세서(120)는 파라미터 행렬을 랜덤하게 초기화하고, 복수의 분할 변수는 균일한 값에 가깝게 초기화할 수 있다. 여기서 파라미터 행렬은 학습 모델의 한 레이어의 파라미터 행렬을 의미하고, 복수의 분할 변수는 특징-그룹 분할 변수와 클래스-그룹 분할 변수를 포함할 수 있다. 이러한 복수의 분할 변수는 파라미터 행렬에 대응되는 행렬 형태를 가질 수 있다. Specifically, the processor 120 can initialize a parameter matrix of a learning model composed of a plurality of layers and a plurality of divided variables. Specifically, the processor 120 may randomly initialize the parameter matrix and initialize the plurality of split variables close to a uniform value. Wherein the parameter matrix means a parameter matrix of one layer of the learning model and the plurality of partitioning variables can include a feature-group partitioning variable and a class-group partitioning variable. These plurality of partitioning variables may have a matrix form corresponding to the parameter matrix.

그리고 프로세서(120)는 학습 모델에 대한 손실 함수, 파라미터 감쇠 정규화 항 및 파라미터 행렬 및 복수의 분할 변수로 정의되는 분할 정규화 항을 포함하는 목적 함수가 최소화하도록 복수의 분할 변수 및 학습 모델에 대한 블록 대각 행렬을 갖는 신규 파라미터 행렬을 산출할 수 있다. 이때 프로세서(120)는 추계적 경사 하강(Stochastic Gradient Descent) 방법을 이용하여 목적 함수가 최소화되도록 할 수 있다. 여기서 목적 함수는 교차 엔트로피 손실과 그룹 정규화를 동시에 최적화하기 위한 함수로 수학식 1과 같이 표현될 수 있다. 목적 함수의 구체적인 내용에 대해서는 도 3과 관련하여 후술한다. The processor 120 then computes the block diagonal for the plurality of partitioning variables and the learning model so that the objective function including the loss function for the learning model, the parameter attenuation normalization term and the parameter matrix and the partition normalization term defined by the plurality of partitioning variables is minimized, A new parameter matrix having a matrix can be calculated. At this time, the processor 120 may minimize the objective function using a stochastic gradient descent method. Here, the objective function is a function for simultaneously optimizing the cross entropy loss and the group normalization, and can be expressed as Equation (1). The concrete contents of the objective function will be described later with reference to Fig.

그리고 프로세서(120)는 산출된 분할 변수에 기초하여 복수의 레이어를 그룹에 따라 수직 분할하고, 산출된 신규 파라미터 행렬을 수직 분할된 레이어의 파라미터로 사용하여 학습 모델을 재구성할 수 있다. The processor 120 vertically divides the plurality of layers according to the group based on the calculated partitioning parameters, and reconstructs the learning model using the calculated new parameter matrix as a parameter of the vertically partitioned layer.

그리고 프로세서(120)는 학습 모델에 대한 손실 함수 및 파라미터 감쇠 정규화 항만 포함하는 제2 목적 함수가 최소화하도록 재구성된 학습 모델에 2차 신규 파라미터 행렬을 산출하고, 산출된 2차 신규 파라미터 행렬을 수직 분할된 레이어의 파라미터로 사용하여 학습 모델을 최적화할 수 있다. The processor 120 calculates a second-order new parameter matrix in the reconstructed learning model so that the second objective function including only the loss function and the parameter attenuation normalization term for the learning model is minimized, and outputs the calculated second- It can be used as a parameter of the layer to optimize the learning model.

프로세서(120)는 최적화된 학습 모델을 이용하여 비전 인식, 음성 인식, 자연어 처리 등의 각종 처리를 수행할 수 있다. 구체적으로, 학습 모델이 이미지 분류와 관련된 것이었으면, 프로세서(120)는 최적화된 학습 모델과 입력된 이미지를 이용하여 입력된 이미지가 어떠한 것인지를 분류할 수 있다. The processor 120 can perform various processes such as vision recognition, speech recognition, and natural language processing using an optimized learning model. Specifically, if the learning model is related to image classification, the processor 120 can classify the input image using the optimized learning model and the input image.

이때, 프로세서(120)는 입력된 이미지의 분류를 복수의 프로세서 코어를 이용하여 수행하거나, 타 전자 장치와 함께 수행할 수 있다. 구체적으로, 본 개시에 의해 최적화된 학습 모델은 수직으로 분할된 트리 구조를 갖게 되는바, 분할된 하나의 그룹에 해당하는 연산은 하나의 연산 장치를 이용하여 계산하고, 다른 그룹에 해당하는 연산은 다른 연산 장치를 이용하여 계산할 수 있다. At this time, the processor 120 may classify the input image using a plurality of processor cores, or may perform the processing together with other electronic devices. Specifically, the learning model optimized according to the present disclosure has a vertically partitioned tree structure, so that an operation corresponding to one divided group is calculated using one operation device, and an operation corresponding to another group It can be calculated using another computing device.

이상과 같이 본 실시 예에 따른 전자 장치(100)는 학습 모델을 클래스를 독점적인 기능 집합에 맞는 그룹으로 클러스터링한다. 이에 따라 최적화된 학습 모델은 하나의 학습 모델에 대한 연산을 통신에 의한 병목현상 없이 여러 장치로 나눠 처리하는 것이 가능하며, 연산량과 파라미터의 수가 줄어들기 때문에 하나의 장치를 이용하더라도 더욱 빠른 연산이 가능하게 된다. 또한, 전자 장치(100)는 수학식 1과 같은 목적 함수를 이용하는바 네트워크 학습 절차에 완벽하게 통합되므로 네트워크 가중치와 분할을 동시에 학습할 수 있다. As described above, the electronic device 100 according to the present embodiment clusters the learning models into groups in accordance with the exclusive function set. Therefore, the optimized learning model can process the operation of one learning model by dividing it into several devices without communication bottleneck, and the number of computation and the number of parameters can be reduced, so that even faster operation can be performed by using one device . In addition, since the electronic device 100 is fully integrated into the network learning procedure using the objective function as shown in Equation (1), the network weighting and division can be simultaneously learned.

한편, 도 1을 설명함에 있어서, 사용자로부터 분리된 그룹 수를 입력받고, 입력받은 그룹 수로 학습 모델을 분리하는 것으로 설명하였지만, 구현시에는 기설정된 알고리즘을 이용하여 학습 모델의 최적의 그룹 수를 찾는 동작을 선행적으로 수행하고, 찾아진 그룹 수에 기초하여 학습 모델을 분리하는 것도 가능하다. In the description of FIG. 1, it is described that the number of groups separated from the user is input and the learning model is separated by the number of input groups. However, in implementation, the optimum number of groups of the learning model is found It is also possible to perform the operation in advance and separate the learning models based on the number of groups found.

한편, 이상에서는 전자 장치를 구성하는 간단한 구성에 대해서만 도시하고 설명하였지만, 구현시에는 다양한 구성이 추가로 구비될 수 있다. 이에 대해서는 도 2를 참조하여 이하에서 설명한다. While only a simple configuration for configuring an electronic device has been shown and described above, various configurations may be additionally provided at the time of implementation. This will be described below with reference to FIG.

도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 나타내는 블록도이다. 2 is a block diagram showing a specific configuration of an electronic device according to an embodiment of the present disclosure;

도 2를 참조하면, 전자 장치(100)는 메모리(110), 프로세서(120), 통신부(130), 디스플레이(140) 및 조작 입력부(150)로 구성될 수 있다. Referring to FIG. 2, the electronic device 100 may include a memory 110, a processor 120, a communication unit 130, a display 140, and an operation input unit 150.

메모리(110) 및 프로세서(120)의 동작에 대해서는 도 1과 관련하여 설명하였는바, 중복 설명은 생략한다. The operations of the memory 110 and the processor 120 have been described with reference to FIG. 1, and redundant description will be omitted.

통신부(130)는 타 전자 장치와 연결되며, 타 전자 장치로부터 학습 모델 및/또는 학습 데이터를 수신할 수 있다. 또한, 통신부(130)는 다른 전자 장치와의 분산 연산을 위하여 필요한 데이터를 타 전자 장치에 전송할 수 있다. The communication unit 130 is connected to other electronic devices and can receive learning models and / or learning data from other electronic devices. In addition, the communication unit 130 may transmit data necessary for the distributed operation to other electronic devices to other electronic devices.

또한, 통신부(130)는 학습 모델을 이용한 처리를 위한 정보를 수신받을 수 있으며, 처리 결과를 대응되는 장치에 제공할 수 있다. 예를 들어, 해당 학습 모델이 이미지를 분류하는 모델이었으면, 통신부(130)는 분류할 이미지를 입력받고, 분류 결과에 대한 정보를 이미지를 전송한 장치에 전송할 수 있다. Also, the communication unit 130 can receive information for processing using the learning model, and can provide the processing result to the corresponding apparatus. For example, if the learning model is a model for classifying an image, the communication unit 130 receives an image to be classified and transmits information on the classification result to the device that transmitted the image.

이러한 통신부(130)는 전자 장치(100)를 외부 장치와 연결하기 위해 형성되고, 근거리 통신망(LAN: Local Area Network) 및 인터넷망을 통해 단말장치에 접속되는 형태뿐만 아니라, USB(Universal Serial Bus) 포트 또는 무선 통신(예를 들어, WiFi 802.11a/b/g/n, NFC, Bluetooth) 포트를 통하여 접속되는 형태도 가능하다. The communication unit 130 is formed to connect the electronic device 100 to an external device and is connected to a terminal device via a local area network (LAN) Port or wireless communication (for example, WiFi 802.11a / b / g / n, NFC, Bluetooth) port.

디스플레이(140)는 전자 장치(100)에서 제공하는 각종 정보를 표시한다. 구체적으로, 디스플레이(140)는 전자 장치(100)가 제공하는 각종 기능을 선택받기 위한 사용자 인터페이스 창을 표시할 수 있다. 구체적으로, 해당 사용자 인터페이스 창은 최적화를 수행할 학습 모델을 선택받거나, 최적화 과정에 사용될 파라미터를 입력받기 위한 항목을 포함할 수 있다. The display 140 displays various information provided by the electronic device 100. Specifically, the display 140 may display a user interface window for receiving various functions provided by the electronic device 100. Specifically, the user interface window may include an item for receiving a learning model to be optimized or receiving a parameter to be used in the optimization process.

이러한 디스플레이(140)는 LCD, CRT, OLED 등과 같은 모니터일 수 있으며, 후술할 조작 입력부(150)의 기능을 동시에 수행할 수 있는 터치 스크린으로 구현될 수도 있다. The display 140 may be a monitor such as an LCD, a CRT, or an OLED, or may be implemented as a touch screen capable of simultaneously performing functions of an operation input unit 150, which will be described later.

또한, 디스플레이(140)는 학습 모델을 이용하여 테스트 결과에 대한 정보를 표시할 수 있다. 예를 들어, 해당 학습 모델이 이미지를 분류하는 모델이었으면, 디스플레이(140)는 입력된 이미지에 대한 분류 결과를 표시할 수 있다. In addition, the display 140 may display information about test results using a learning model. For example, if the learning model was a model for classifying images, the display 140 may display classification results for the input images.

조작 입력부(150)는 사용자로부터 최적화를 수행할 학습 데이터 및 최적화 과정에서 수행할 각종 파라미터를 입력받을 수 있다. The operation input unit 150 may receive learning data to be optimized by the user and various parameters to be performed in the optimization process.

이러한 조작 입력부(150)는 복수의 버튼, 키보드, 마우스 등으로 구현될 수 있으며, 상술한 디스플레이(140)의 기능을 동시에 수행할 수 있는 터치 스크린으로도 구현될 수도 있다. The operation input unit 150 may be implemented by a plurality of buttons, a keyboard, a mouse, and the like, or may be implemented as a touch screen capable of simultaneously performing the functions of the display 140 described above.

한편, 도 1 및 도 2를 도시하고 설명함에 있어서, 전자 장치(100)에 하나의 프로세서만 포함되는 것으로 설명하였지만, 전자 장치에는 복수의 프로세서가 포함될 수 있으며, 일반적인 CPU 뿐만 아니라 GPU가 활용될 수 있다. 구체적으로, 상술한 최적화 동작은 복수의 GPU를 이용하여 수행될 수 있다. 1 and 2, the electronic device 100 includes only one processor. However, the electronic device may include a plurality of processors. In addition to a general CPU, a GPU may be utilized. have. Specifically, the above-described optimization operation can be performed using a plurality of GPUs.

이하에서는 상술한 바와 같은 학습 모델의 최적화가 가능한 이유에 대해서 자세히 설명한다. Hereinafter, the reason why optimization of the learning model as described above is possible will be described in detail.

이미지 분류 태스크의 수가 많을수록 의미적으로(semantically) 유사한 클래스들은 같은 종류의 특징(feature)만을 사용하는 분리된 그룹들로 나눠질 수 있다. The more image classification tasks, the more semantically similar classes can be divided into separate groups using only the same kind of features.

예를 들어, 개, 고양이와 같은 동물 클래스로 분류하기 위해 사용되는 특징은 트럭, 비행기와 같은 클래스로 물체를 분류하기 위해 사용되는 상위 단계의 특징(Hig-level feature)은 서로 다를 수 있다. 그러나 점, 줄무늬, 색상과 같은 낮은 단계의 특징(Low-level feature)은 모든 클래스에서 사용 가능할 수 있다. For example, features used to classify animals such as dogs and cats may differ from hig-level features used to classify objects into classes such as trucks and planes. However, low-level features such as dots, stripes, and colors may be available in all classes.

이러한 점은 인공 신경망에서 하위 레이어는 모든 그룹에 공통적으로 사용되고, 상위 레이어로 올라갈수록 특징들이 의미적으로 구분되는 클래스들의 그룹에 따라 분할되는 트리 형태의 구조로 더 효율적으로 작동 가능함을 의미한다. 즉, 사용하는 기능에 따라 클래스를 상호 배타적인 그룹으로 클러스팅할 수 있음을 의미한다. This means that in the artificial neural network, the lower layer is commonly used for all groups, and the higher the layer is, the more efficient the operation can be performed by the structure of the tree which is divided according to the group of classes in which the characteristics are semantically classified. This means that classes can be clustered into mutually exclusive groups depending on the functionality they use.

이와 같은 클러스팅을 통해 인공 신경망은 좀 더 적은 수의 파라미터를 사용할 뿐만 아니라, 각 그룹을 서로 다른 연산 장치에서 수행하도록 하는 모델 병렬화(Model Parallelization)가 가능하게 된다. Through such clustering, the artificial neural network can use not only fewer parameters but also model parallelization that allows each group to be performed by different computing devices.

다만, 인공 신경망을 구성하는 각 레이어를 임의의 그룹으로 분할하는 경우, 즉 의미적으로 유사한 클래스, 특징들끼리 묶이지 않게 되는 경우에는 성능 저하가 발생할 수 있다. However, performance degradation may occur when each layer constituting the artificial neural network is divided into an arbitrary group, that is, when semantically similar classes and features are not bound to each other.

따라서, 본 개시에서는 인공 신경망의 각 레이어에서 입출력을 의미적으로 연관있는 여러 그룹으로 나누고, 이에 기초하여 레이어를 수직하게 분할함으로써 성능의 저하 없이 연산량과 파라미터의 양을 줄일 수 있다. 또한, 이를 통해 모델 병렬화도 가능하게 된다. Therefore, in the present disclosure, the input / output in each layer of the artificial neural network is divided into semantically related groups, and the layer is divided vertically based on this, so that the amount of computation and the amount of parameters can be reduced without deteriorating the performance. Also, model parallelization is possible through this.

이를 위하여, 본 개시에서는 각 레이어의 입출력을 의미적으로 연관 있는 것들 끼리 그룹으로 묶으며, 레이어의 파라미터의 행렬 값 중에서 그룹 간의 연결에 해당하는 부분을 없애는 동작을 수행한다. To this end, in the present disclosure, input / output of each layer is grouped into groups that are semantically related to each other, and an operation of removing a portion corresponding to the connection between the groups in the matrix values of the parameters of the layer is performed.

이러한 동작을 위하여, 각 레이어의 입출력을 그룹에 할당하기 위한 분할 변수를 새롭게 도입하고, 의미적으로 유사한 입출력을 그룹으로 자동으로 나눔과 동시에 그룹 간의 연결을 억제하기 위한 추가적인 정규화 함수를 도입하여 레이어의 분할을 가능게 하였다. For this operation, a partitioning variable for assigning input / output of each layer to a group is newly introduced, and an additional normalization function for dividing semantically similar input / output into groups and suppressing connection between groups is introduced, Division.

도 3은 트리 구조 네트워크를 설명하기 위한 도면이다. 3 is a diagram for explaining a tree structure network.

도 3을 참조하면, 기본적인 학습 모델(310)은 복수의 레이어로 구성된다. Referring to FIG. 3, the basic learning model 310 is composed of a plurality of layers.

복수의 레이어를 분할하기 위하여, 클래스 대 그룹, 특징 대 그룹 할당과 같은 네트워크 가중치를 최적화한다. 구체적으로, 기본적인 학습 모델(310)을 구성하는 한 레이어의 입출력의 노드에 대한 분할 변수를 도입하고, 태스크의 학습 데이터를 통해 원 태스크의 손실함수를 최소화함과 동시에 세 종류의 추가적인 정규화 항(regularization term)을 도입해 이들의 합을 최소화한다. To segment multiple layers, optimize network weights such as class-to-group, feature-to-group assignment. Specifically, the partitioning variable for the input / output nodes of one layer constituting the basic learning model 310 is introduced, the loss function of the original task is minimized through the learning data of the task, and three kinds of additional regularization term to minimize the sum of them.

이 결과를 기초로 각 노드가 어떤 그룹에 속하는지를 결정하며, 결정에 기초하여 최적화된 학습 모델(330)을 생성할 수 있다. Based on this result, it is possible to determine which group each node belongs to, and to generate an optimized learning model 330 based on the determination.

기본 네트워크가 주어지면, 본 개시의 최종 목표는 도 3의 최적화된 학습 모델(330)과 같은 특정 클래스 그룹과 연관되는 서브 네트워크의 집합 또는 레이어(또는 계층)를 포함하는 트리 구조 네트워크를 얻을 수 있다. Given a base network, the ultimate goal of this disclosure is to obtain a tree structure network that includes a set or layer (or layer) of subnetworks associated with a particular class group, such as the optimized learning model 330 of FIG. 3 .

이질적인 클래스를 그룹화하면 여러 그룹에 중복 기능을 학습시킬 수 있고, 결과적으로 네트워크 용량을 낭비할 수 있는바, 클래스를 분할하는 최적화 방법은 각 그룹 내의 클래스가 가능한 한 많은 기능을 공유해야 한다. Grouping heterogeneous classes can lead to redundant functions being learned by multiple groups and consequently wasting network capacity. As a result, optimization methods for class segmentation should share as many functions as possible within each group.

따라서, 분할의 유용성을 극대화하기 위해서는 각 그룹이 다른 그룹에서 사용하는 것과 완전히 다른 기능의 하위 집합을 사용하도록 클래스를 클러스팅하는 것이다. Thus, to maximize the usability of a partition, it is necessary to clusters the class so that each group uses a subset of the functionality that is completely different from that used by the other groups.

이러한 상호 배타적인 클래스 그룹을 얻는 가장 직접적인 방법은 비슷한 클래스가 특징을 공유할 가능성이 있기 때문에 의미 분류법을 활용하는 것이다. 그러나 실질적으로 이러한 의미 분류법은 사용 가능하지 않거나, 각 클래스에서 사용하는 기능에 따라 실제 계층적 그룹와 일치하지 않을 수 있다. The most direct way to obtain these mutually exclusive class groups is to use semantic classification because similar classes may share features. In practice, however, these semantic classifications are not available or may not match actual hierarchical groups, depending on the functionality used in each class.

다른 접근 방법은 원래의 네트워크에서 습득한 가중치에 대해서 (계층적으로) 클러스터링을 수행하는 것이다. 이러한 방식은 실제 기능 사용에도 기초한다. 그러나 그룹이 중복될 가능성이 높고 네트워크를 두 번 학습시켜야 하므로, 비효율적이며 클러스터링된 그룹이 최적화가 된 것이 아닐 수도 있다. Another approach is to perform clustering (hierarchically) on the weights learned in the original network. This approach is also based on the use of actual functions. However, because groups are likely to be duplicated and the network must be trained twice, inefficient and clustered groups may not be optimized.

따라서, 이하에서는 각 클래스 및 특징을 분리된 그룹들로 배타적인 할당을 어떻게 하는지, 심층 러닝 프레임워크에서 네트워크 가중치를 동시에 사용하는 방법에 대해서 이하에서 설명한다. Therefore, hereinafter, how to allocate each class and feature exclusively to separate groups, and how to use network weights simultaneously in the deep learning framework, will be described below.

이하에서는 데이터 세트가

인 것을 가정하여 설명한다. 여기서

는 입력 데이터 인스턴스(instance)이고,

는 K 클래스에 대한 클래스 레벨이다. In the following,

. here

Is an input data instance,

Is the class level for the K class.

인공 신경망에서의 학습은 각 레이어(l)에서의 가중치(

)가 있는 네트워크를 학습시키는 것이다. 여기서

는 블록 대각행렬(block-diagonal matrix)이고, 각

은 클래스 그룹(

)과 관련된다. 여기서

는 모든 그룹의 세트이다. Learning in the artificial neural network is based on the weight (l)

) Is learned. here

Is a block-diagonal matrix,

Is a class group (

). here

Is a set of all groups.

이러한 블록 대각(block-diagonal)은 클래스의 각 분리된 그룹이 다른 그룹과 해당 기능을 사용하지 않도록 연관된 고유한 기능을 갖도록 한다. 이에 따라, 빠른 계산과 병렬처리를 위하여 네트워크는 복수의 클래스 그룹으로 분할될 수 있다. This block-diagonal ensures that each separate group of classes has a unique function associated with it that does not use that function with other groups. Accordingly, the network can be divided into a plurality of class groups for fast calculation and parallel processing.

이러한 블록 대각 웨이트 행렬(

)을 얻기 위해, 본 개시에서는 네트워크 가중치에 덧붙여 특징-그룹 및 클래스-그룹 할당을 학습하는 새로운 분할 알고리즘을 이용한다. 이러한 분할 알고리즘을 이하에서는 스플릿넷(splitNet)(또는 심층 스플릿)이라고 지칭한다. This block diagonal weight matrix (

), This disclosure uses a new partitioning algorithm that learns feature-group and class-group assignments in addition to the network weights. This segmentation algorithm is referred to below as splitNet (or deep split).

먼저, 소프트맥스 분류기에서 사용되는 파라미터에 대한 분할 방법을 먼저, 설명하고, 이를 DNN에 적용하는 방법은 후술한다. First, a method of dividing a parameter used in a soft max classifier will be described first, and a method of applying it to a DNN will be described later.

는 특징(i)이 그룹 g에 할당되는지를 나타내는 이진 변수이고,

는 클래스(j)가 그룹 g에 할당되는지 여부를 나타내는 이진 변수이다.

Is a binary variable indicating whether feature (i) is assigned to group g,

Is a binary variable indicating whether class j is assigned to group g.

는 그룹 g에 대한 특징 그룹 할당 벡터로 정의한다. 여기서

, D는 특징의 치수(dimension)이다.

Is defined as a feature group assignment vector for group g. here

, And D is the dimension of the feature.

유사하게

는 그룹 g에 대한 클래스 그룹 할당 벡터로 정의한다. 즉,

와

는 그룹 g를 함께 정의한다. 여기서

는 그룹과 연관된 특징 차수를 나타내며,

는 그룹에 할당된 클래스 세트를 나타낸다. Similarly

Is defined as a class group assignment vector for group g. In other words,

Wow

Group g together. here

Represents the feature degree associated with the group,

Represents a set of classes assigned to the group.

특징들 또는 클래스들 중 그룹들 사이에 중첩이 없다고 가정한다. 예를 들어,

이고, 즉,

1K이며, 여기서

및

는 모두 하나의 요소를 갖는 벡터들이다. It is assumed that there is no overlap between groups of features or classes. E.g,

That is,

1K, where

And

Are all vectors with one element.

이 가정은 그룹 할당에 대한 엄격한 규칙을 부과하는 반면, 각 클래스는 그룹에 할당되고 각 그룹은 특징의 분리된 부분 집합에 의존하기 때문에 가중치 행렬을 블록 대각행렬로 분류할 수 있다. 이것은 파라미터의 수를 크게 줄이고, 동시에 곱셈

는 더 작고 빠른 블록 행렬 곱셈으로 분해될 수 있다.This assumption imposes stringent rules for group assignment, while each class is assigned to a group, and each group depends on a discrete subset of features, so the weighting matrix can be classified as a block diagonal matrix. This greatly reduces the number of parameters,

Can be decomposed into smaller and faster block matrix multiplications.

본 개시에서 최적화하고자 하는 목적함수는 다음과 같은 수학식 1과 같이 정의될 수 있다. The objective function to be optimized in the present disclosure can be defined as Equation 1 below.

여기서,

는 학습 데이터의 교차 엔트로피 손실이고, W는 가중치 텐서(tensor)이고, P 및 Q는 특징-그룹과 클래스-그룹 할당 행렬이고,

는 하이퍼파라미터(λ)가 있는 파라미터 감쇠(Weight decay) 정규화 항이고,

는 네트워크 분할을 위한 정규화 항이다. here,

Is the cross-entropy loss of the training data, W is the weight tensor, P and Q are the feature-group and class-group assignment matrix,

Is a parameter decay normalization term with a hyper parameter ([lambda]),

Is a normalization term for network segmentation.

이하에서는 외부 의미 정보 없이 자동적으로 분리된 그룹 할당을 찾기 위하여 새롭게 도입한 정규화 항(Ω)에 대해서 설명한다. Hereinafter, a newly introduced normalization term (Ω) will be described in order to find a group assignment that is automatically separated without external semantic information.

상술한 수학식 1의 목적은 경사 하강(gradient descent), 각 레이어에 대한 전체 가중치 행렬의 시작, 각 가중치에 대한 알려지지 않은 그룹 할당을 공통적으로 최적화하는 것이다. The purpose of Equation (1) above is to optimally optimize the gradient descent, the start of the entire weighting matrix for each layer, and the unknown group assignment for each weight.

교차 엔트로피 손실과 그룹 정규화를 함께 최적화함으로써 적절한 그룹화를 자동으로 얻고, 그룹 간 연결도 제거할 수 있게 된다. By optimizing the cross entropy loss and group normalization together, it is possible to automatically obtain the appropriate grouping and eliminate the intergroup concatenation.

그룹화가 학습 되면 파라미터 수를 줄이기 위하여 가중치 행렬은 명시적으로 블록 대각행렬들로 분할될 수 있으며, 이를 통해 훨씬 빠른 추론이 가능해 진다. 이하에서는 각 레이어를 분리하는 그룹 수(G)는 주어진 것으로 가정한다. Once the grouping is learned, the weighting matrix can be explicitly partitioned into block diagonal matrices to reduce the number of parameters, which allows much faster inference. In the following, it is assumed that the number of groups (G) for separating each layer is given.

이하에서는 레이어(또는 계층)에 대한 가중치 행렬을 복수의 그룹으로 분리하는 방법에 대해서 도 4를 참조하여 설명한다. Hereinafter, a method of dividing weight matrices for a layer (or layer) into a plurality of groups will be described with reference to FIG.

도 4는 그룹 할당 및 그룹 가중치 정규화 동작을 설명하기 위한 도면이고, 도 5는 정규화가 적용된 가중치를 시각화한 도면이다. FIG. 4 is a diagram for explaining a group assignment and a group weight normalization operation, and FIG. 5 is a diagram visualizing a weight applied with normalization.

도 4를 참조하면, 특징 및 클래스를 복수의 그룹으로 할당하는 정규화는 다음과 같은 수학식 2로 표현될 수 있다. Referring to FIG. 4, the normalization for assigning features and classes to a plurality of groups can be expressed by Equation (2) below.

여기서,

각각은 목표의 강도를 조절하는 파라미터이다. 이러한 파라미터는 사용자로부터 입력받을 수 있다. here,

Each is a parameter that controls the intensity of the target. These parameters can be input from the user.

첫 번째 R_W는 그룹 파라미터 정규화 항(Group Weight Regularization)으로, 그룹 간의 연결에 대한 파라미터의 (2,1)-놈(norm)으로 정의된다. 해당 항을 최소화하면 그룹 간의 연결이 억제되고 그룹 내의 연결만을 활성화된다. R_W에 대한 보다 자세한 설명은 후술한다. The first R _W is the Group Weight Regularization, defined as the (2,1) -norm of the parameters for the inter-group connections. Minimizing that term suppresses the association between groups and activates only those in the group. A more detailed description of R _W will be given later.

두 번째 R_D는 서로소 그룹 정규화 항(Disjoint Group Assignment)으로, 분할변수 간의 내적으로 배타적으로 분할이 진행되도록 하는 항이다. R_D에 대한 보다 자세한 설명은 후술한다. The second R _D is a Disjoint Group Assignment, which is a term that allows partitioning to proceed internally exclusively between partitioned variables. A more detailed description of R _D will be given later.

세 번째 R_E는 균등 그룹 정규화 항(Balanced Group Assignment)으로 분할변수 각각의 합의 제곱으로 정의되며 한 그룹의 크기가 과도하게 커지지 않게 하는 항이다. R_E에 대한 보다 자세한 설명은 후술한다. The third R _E is a balanced group assignment, which is defined as the sum of the squares of each of the partitioned variables, so that the size of one group is not excessively increased. A more detailed description of R _E will be given later.

이하에서는 첫 번째 R_W정규화 항에 대해서 먼저 설명한다. Hereinafter, the first R _W normalization term will be described first.

특징-그룹 할당 행렬과 클래스-그룹 할당 행렬을 각각

와

라고 가정한다. 그 다음,

는 그룹 g(예를 들어, 기능과 클래스 간의 그룹 내의 연결)와 관련된 가중 파라미터를 나타낸다. The feature-group assignment matrix and the class-group assignment matrix are

Wow

. next,

Represents a weighting parameter associated with a group g (e.g., a connection within a group between a function and a class).

블록 대각 웨이트 행렬을 얻기 위해 그룹 간 연결을 제거하여야 하는바, 그룹 간의 연결을 우선적으로 정규화한다. 이러한 정규화는 다음과 같은 수학식 3으로 표현될 수 있다. In order to obtain the block diagonal weight matrix, the inter-group connection should be removed, and the inter-group connection is preferentially normalized. This normalization can be expressed by the following equation (3).

여기서,

와

는 가중치(W)의 i번째 행렬, j번째를 나타낸다. here,

Wow

Denotes the i-th matrix of the weight W, j-th.

상술한 수학식 3은 그룹 간 연결에 행/열-방향(ℓ2,1)-놈(norm)을 부과한다. 도 5는 이러한 정규화를 나타내는데, 도 5를 참조하면, 정규화가 적용된 가중치 부분은 다른 영역과 다른 색으로 표현되어 있다. Equation (3) above imposes a row / column-direction (l2,1) -norm in the inter-group connection. FIG. 5 illustrates this normalization. Referring to FIG. 5, the weighted portion to which the normalization is applied is expressed in a different color from the other regions.

이와 같은 방식의 정규화는 의미론적 그룹과 상당히 유사한 그룹을 산출한다. 주의할 점은, 그룹화 할당시 동일한 초기화를 피하는 것이다. 예를 들어, pi = 1 / G이면, 행/열-방향(ℓ2,1)-놈(norm)은 목적이 감소하며, 일부 행/열 가중치 벡터가 그룹 할당 전에 사라질 수 있기 때문이다. Normalization in this manner yields a group quite similar to the semantic group. Note that the same initialization is avoided when grouping is assigned. For example, if pi = 1 / G, the row / column-direction (l2,1) -nom is reduced in purpose and some row / column weight vectors may disappear before group assignment.

이하에서는 두 번째 R_D정규화 항에 대해서 설명한다. The second R _D normalization term will be described below.

수치 최적화를 다루기 쉽게 하기 위해, 먼저 바이너리 변수인

와

를 제한(

및

)을 사용하여 [0, 1] 인터벌 내의 실제 값을 갖도록 완화한다. 이러한 sum-to-one 구속 조건은 희소 솔루션을 산출하는 축소 구배 알고리즘(reduced gradient algorithm)을 사용하여 최적화할 수 있다. To make numerical optimization easier to deal with,

Wow

To limit

And

) So as to have the actual value in the interval [0, 1]. This sum-to-one constraint can be optimized using a reduced gradient algorithm that yields a scarce solution.

또는

와

를 아래의 수학식 4와 같이 소프트맥스 형태의 독립변수

,

로 재 파라미터화하여 소프트 할당을 수행할 수도 있다. or

Wow

As shown in Equation (4) below,

,

To perform soft allocation.

두 방식 중 소프트맥스 형식이 보다 의미론적으로 의미 있는 그룹화를 달성할 수 있다. 반면에 sum-to-one 구속 조건의 최적화는 종종 소프트맥스 방식보다 빠른 수렴을 유도한다. Of the two approaches, the soft max format can achieve more semantically meaningful grouping. On the other hand, optimization of the sum-to-one constraint often leads to faster convergence than the soft max method.

그룹 할당 벡터가 완전히 상호 배타적이기 위해서는 각 그룹은 직교해야 한다. 예를 들어, i와 j가 다른 조건에서

,

가 만족하여야 한다. 이를 만족하는 직교 정규화 항은 수학식 5와 같다. In order for group assignment vectors to be completely mutually exclusive, each group must be orthogonal. For example, if i and j are different conditions

,

Should be satisfied. The orthonormal normalization term that satisfies this is expressed by Equation (5).

여기서, 불균등은 중복된 내적을 피할 수 있다. 구체적으로, pi와 qi의 차원이 다를 수 있으므로 그룹 할당 벡터 간의 코사인 유사성을 최소화할 수 있다. 그러나 sum-to-one 제한(constraint)과 정규화를 유도하는 희귀성(sparsity)에서 그룹 할당 벡터는 비슷한 스케일을 가지며 코사인 유사도는 내적 유사성으로 감소한다.Here, the inequality can avoid duplicated inner product. Specifically, since the dimensions of pi and qi may be different, the cosine similarity between group assignment vectors can be minimized. However, in sum-to-one constraints and sparsity that leads to normalization, group assignment vectors have similar scales and cosine similarity decreases to inner similarity.

그룹 할당 벡터 간의 내적을 최소화하는 데 몇 가지 주의점이 있다. 첫째는 내적의 수는 그룹의 수와 함께 2차식으로 조정하는 것이다. 둘째는 그룹 지정 벡터의 값을 균일하게 초기화하면 경사가 0에 가까울 수 있으므로 최적화 프로세스가 느려지는 것이다. 예를 들어,

제한이 있는

에서

를 최소화하면,

를 최소화하는 것과 동일하다. 만약 초기화가 0.5에서 수행된다면, 경사도는 0에 가깝게 된다. There are some caveats in minimizing the dot product between group assignment vectors. The first is to adjust the number of inner products in a quadratic form together with the number of groups. Second, if the value of the group assignment vector is uniformly initialized, the slope may be close to zero, which slows down the optimization process. E.g,

Limited

in

If minimized,

Is minimized. If initialization is performed at 0.5, the slope is close to zero.

이하에서는 세 번째 R_E정규화 항에 대해서 설명한다. The third R _E normalization term will be described below.

상술한 수학식 5만을 이용한 그룹 분리는 분리된 한 그룹이 다른 모든 그룹보다 우세할 수 있다. 즉, 한 그룹에는 모든 기능과 클래스가 포함되지만 다른 그룹에는 포함되지 않을 수 있게 된다. Group separation using only Equation (5) can separate one group from all other groups. That is, a group contains all the functions and classes, but not the other.

따라서 아래의 수학식 6과 같이 각 그룹 할당 벡터에서 요소의 제곱합을 정규화하여 그룹 할당이 균형을 이루도록 제한할 수 있다.Therefore, it is possible to normalize the sum of squares of elements in each group allocation vector as shown in Equation (6) below to limit the group allocation to be balanced.

와

의 제약으로 인해 각 그룹 할당 벡터의 원소들의 합이 짝수 일 때 수학식 5는 최소화된다. 예를 들어, 각 그룹은 동일한 수의 요소를 가질 수 있다. 특징과 클래스 그룹 할당 벡터의 차원이 다를 수 있으므로 적절한 가중치로 두 조건의 비율을 조정한다. 예를 들어, 일괄 정규화에 이어 그룹 가중치 정규화를 사용할 때, 가중치는 BN 레이어의 스케일 파라미터가 증가하는 동안 그 크기가 줄어드는 경향이 있다. 이러한 효과를 방지하기 위하여,

내의

를 대신하여

정규화 가중치(

)를 사용하거나, 단순히 BN 레이어의 스케일 파라미터를 비활성화할 수 있다.

Wow

The equation (5) is minimized when the sum of the elements of each group assignment vector is even. For example, each group may have the same number of elements. Since the dimensions of features and class group assignment vectors may be different, adjust the ratio of the two conditions to the appropriate weight. For example, when using batch normalization followed by group weight normalization, the weights tend to decrease in size as the scale parameter of the BN layer increases. To prevent this effect,

undergarment

On behalf of

Normalized weights (

), Or simply to deactivate the scale parameter of the BN layer.

균형 조정 그룹 정규화의 효과에 대해서는 도 10을 참조하여 후술한다. The effect of balancing group normalization will be described later with reference to FIG.

이하에서는 상술한 목적함수를 심층 신경 네트워크에 적용하는 방법에 대해서 설명한다. Hereinafter, a method of applying the above-described objective function to a deep neural network will be described.

앞서 설명한 가중 분리 방법은 심층 신경 네트워크(DNN)에 적용할 수 있다. 먼저,

는 1차 (1≤l≤L) 층의 가중치를 나타내며, L은 DNN의 총 층수인 것을 가정한다. The weighted separation method described above can be applied to the deep nervous network (DNN). first,

Denotes the weight of the first order (1? L? L) layer, and L is the total number of layers of DNN.

심층 신경 네트워크는 두 가지 유형의 레이어(1) 주어진 입력에 대한 특성 벡터를 생성하는 입력 및 숨겨진 레이어와 2) 소프트맥스 분류자가 클래스 확률을 산출하는 출력 완전 연결 (FC) 레이어)을 포함할 수 있다. A deep neural network can include two types of layers: (1) an input and hidden layer that generates a feature vector for a given input, and (2) an output full-connect (FC) layer that yields a soft max classifier class probability .

출력 완전 연결(FC) 레이어에 대한 가중치에 대해서는 앞서 설명한 분리 방법을 그대로 적용하여 출력 완전 연결(FC) 레이어를 분할할 수 있다. 본 개시의 방법은 다중 연속 레이어 또는 반복적인 계층적 그룹 할당에도 확장 적용될 수 있다. For the weights on the output fully connected (FC) layer, the output fully connected (FC) layer can be divided by applying the separation method described above. The method of the present disclosure may also be extended to multiple continuous layers or iterative hierarchical group assignments.

심층 신경 네트워크에서 하위 수준의 레이어는 기본 표현을 학습하며, 기본 표현들은 모든 클래스에서 공유될 수 있다. 반대로 높은 수준의 표현들은 특정 그룹의 학습에만 적용될 가능성이 크다. In a deep neural network, the lower level layers learn the basic representation, and the basic representations can be shared in all classes. Conversely, high-level expressions are likely to apply only to specific group learning.

따라서, 자연스러운 심층 신경 네트워크에 대한 분할은 하위 레이어(l<S)들이 클래스 그룹 간에 공유되도록 유지한 상태에서, l번째 레이어를 먼저 분할하고, 점진적으로 S번째 레이어(S≤l)를 분할하는 것이다. Thus, a partition for a natural deep neural network divides the lth layer first and gradually divides the Sth layer (S? L) while maintaining the lower layers (l <S) to be shared among the class groups .

레이어 각각은 입력 노드와 출력 노드로 구성되며, 입력 노드와 출력 노드는 상호 간의 연결을 나타내는 가중치

를 갖는다.

와

는 l번째 레이어의 이력 노드 및 출력 노드에 대한 특징 그룹 할당 벡터, 클래스 그룹 할당 벡터다. 이러한 점에서,

는 레이어 l 내의 그룹 g에 대한 그룹 내 연결을 나타내게 된다. Each layer consists of an input node and an output node, and the input node and the output node are weighted

.

Wow

Is a feature group assignment vector, a class group assignment vector, for the hysteresis node and output node of the lth layer. In this regard,

Group connection for group g in layer l.

이전 레이어의 출력 노드는 다음 레이어의 입력 노드에 대응되기 때문에, 그룹 할당은

로서 공유될 수 있다. 이에 따라 서로 다른 레이어 그룹에 신호가 전달되지 않으므로 각 그룹에서 순방향 및 역방향 전파(propagation)가 다른 그룹의 처리로부터 독립적이게 된다. 따라서 각 그룹에 대한 계산을 분리 및 병렬 처리할 수 있게 된다. 이를 위해 모든 레이어에 상술한

을 부과할 수 있다. Since the output nodes of the previous layer correspond to the input nodes of the next layer,

As shown in FIG. As a result, the signals are not delivered to different layer groups, so forward and reverse propagation in each group becomes independent of the processing of the other groups. Thus, the calculations for each group can be separated and parallelized. To this end,

.

출력 레이어에서의 소프트맥스 계산은 그룹에 대한 로짓(logit)을 집계해야 하는 모든 클래스에 대한 정규화 작업이 포함된다. 그러나 진행 중에 최대 로짓을 갖는 클래스를 식별하는 것만으로 충분하다. 각 그룹에서 최대 로짓을 갖는 클래스 은 독립적으로 결정될 수 있으며, 그 중 최대값을 계산하는 것은 최소한의 통신 및 계산만 필요로 하게 된다. 따라서, 출력 레이어에 대한 소프트맥스 계산을 제외하고, 각 그룹에 대한 계산은 분해되고 병렬 처리되는 것이 가능해 진다. SoftMax calculations at the output layer include normalization for all classes that need to compute the logit for the group. However, it is sufficient to identify the class that has the largest logarithm in the process. Classes with the largest logarithm in each group can be determined independently, and calculating the maximum of them requires only minimal computation and computation. Thus, except for the soft max calculation for the output layer, the calculations for each group can be decomposed and processed in parallel.

심층 신경 네트워크에 적용되는 목적 함수는 각 레이어에 대한

와

의 수(L)를 갖는다는 점을 제외하고는 앞서 설명한 수학식 1 및 2와 동일하다. 즉, 제안된 그룹 분할 방식은 컨벌루션 필터의 방식과 유사하기 때문에 CNN에도 적용하는 것이 가능하다. 예를 들어, 컨벌루션 레이어의 가중치가 4D 텐서(

, 여기서 M, N은 각 필드의 높이 및 너비이고, D, K는 입력 컨벌루션 필터의 수와 출력 컨벌루션 필터의 수이다. 상술한 그룹-

-놈은 입력 및 출력 필터 치수에 적용될 수 있다. 그리고 4-D 가중치 텐서(Wc)를 아래와 같은 수학식 7을 이용하여 2-D 행렬(

)로 줄일 수 있다. The objective function applied to the deep neural network is

Wow

(1) and (2) described above, except that the number (L) That is, since the proposed group division scheme is similar to the convolution filter scheme, it can be applied to CNN. For example, if the weight of the convolution layer is 4D tensor (

, Where M and N are the height and width of each field, and D and K are the number of input convolution filters and the number of output convolution filters. The group-

- Nom can be applied to input and output filter dimensions. Then, the 4-D weighted tensor (Wc) is transformed into a 2-D matrix (

).

다음으로, 컨볼루션 가중치를 위한 가중치 정규화는 앞서 설명한 수학식 5에 수학식 7을 적용하여 얻을 수 있다. Next, the weight normalization for the convolution weight can be obtained by applying Equation (7) to Equation (5).

또한, 본 개시의 방법은 지름길 연결(shortcut connection)로 연결된 노드를 통해 그룹 할당을 공유함으로써 레지듀얼 네트워크(residual network)에도 적용 가능하다. 구체적으로, 레지듀얼 네트워크는 두 개의 컨벌루션 레이어를 지름길 연결로 바이패스 한다는 점을 고려한 것이다. The method of the present disclosure is also applicable to a residual network by sharing group assignments via nodes connected by shortcut connections. Specifically, the residual network considers that the two convolution layers are bypassed by a short-cut connection.

와

가 컨벌루션 레이어의 가중치이고,

는

를 갖는 각 레이어에 대한 그룹 할당 벡터라고 가정한다. 단축 아이덴티티 매핑은 제1 컨볼루션 레이어의 입력 노드를 제2 컨볼루션 레이어의 출력 노드와 연결하기 때문에,

와 같이 이들 노드의 그룹화는 공유될 수 있다.

Wow

Is the weight of the convolution layer,

The

Is a group assignment vector for each layer. Because the short identity mapping links the input node of the first convolution layer to the output node of the second convolution layer,

The grouping of these nodes may be shared.

이하에서는 계층적 그룹화에 대해서 설명한다. Hereinafter, hierarchical grouping will be described.

종종 클래스의 의미적인 레이어가 존재한다. 예를 들어, 개 그룹과 고양이 그룹은 포유류의 하위 그룹이다. 이러한 점은 앞서 설명한 심층 스플릿을 카테고리의 다층 계층 구조를 얻기 위하여 확장할 수 있다. 단순하게 설명하기 위하여, 슈퍼 그룹에 대해서 서브 그룹의 세트를 포함하는 2개의 트리 레이어를 고려할 수 있는데, 이러한 점을 임의의 깊이의 계층 구조로 확장하는 것은 용이하다. Often there is a semantic layer of classes. For example, dog and cat groups are subgroups of mammals. This can be extended to obtain the multi-level hierarchical structure of the categories described above. For the sake of simplicity, it is possible to consider two tree layers containing a set of subgroups for the super group, which is easy to extend to a hierarchical structure of arbitrary depth.

l 번째 레이어 및 l번째 레이어의 출력 노드에서 그루핑 가지(grouping branche)는

을 갖는 G 슈퍼그룹 할당 백터(

)로 그룹화된다고 가정한다. The grouping branche at the output nodes of the lth and lth layers is

G super group assignment vector (

). &Lt; / RTI >

그리고 다음 레이어에서,

를 갖는 서브그룹 할당 벡터(

)에 대응되는 각 서브 그룹(

)이 있다고 가정한다. 앞서 설명한 바와 같이 l+1 번째 레이어의 입력 노드는 l번째 레이어의 출력 노드에 대응된다. 따라서,

를 정의할 수 있으며, 서브 그룹 할당을 상응하는 슈퍼 그룹 할당으로 매핑 할 수 있다. 다음으로, 심층 스플릿에서와 같이

제한을 부가한다. And in the next layer,

Lt; RTI ID = 0.0 >

) Corresponding to each subgroup (

). As described above, the input node of the (l + 1) th layer corresponds to the output node of the lth layer. therefore,

And may map the subgroup assignment to the corresponding super group assignment. Next, as in the deep split

Limit is added.

한편, CNN에서 이러한 구조를 구축하는 하나는 컨볼루션 필터의 수가 두 배가 되면 각 그룹을 2개의 서브 그룹으로 분기하는 것이다. On the other hand, one of the constructs in CNN is to divide each group into two subgroups when the number of convolution filters doubles.

이하에서는 스플릿넷의 병렬화에 대해서 설명한다. Hereinafter, parallelization of split net will be described.

본 개시에 따른 방법은 그룹 간에 연결이 존재하지 않은 서브네트워크인 트리 구조 네트워크를 생성할 수 있다. 이러한 결과는 얻어진 각 서브 네트워크를 각 프로세서에 할당하여 모델 병렬 처리를 가능케 한다. 구현시에는 하위 레이어와 그룹별 상위 레이어를 각 노드에 할당하는 동작만으로 가능하다. The method according to the present disclosure may create a tree-structured network that is a sub-network in which there is no connection between the groups. These results enable model parallel processing by assigning each subnetwork obtained to each processor. In the implementation, it is possible to assign only the lower layer and the upper layer per group to each node.

하위 레이어에 대한 테스트 시간은 변경되지 않는바, 불필요한 중복 연산이 발생하더라도, 이러한 방식은 허용 가능하게 된다. 또한, 학습 시간의 병렬화도 가능하다. The test time for the lower layer does not change, so even if unnecessary duplication occurs, this approach becomes acceptable. It is also possible to parallelize the learning time.

도 6은 본 개시의 학습 모델의 분할 알고리즘을 나타내는 도면이다. 도 7은 한 그룹의 출력을 분할하는 경우의 예를 나타내는 도면이다. 6 is a diagram showing a partitioning algorithm of the learning model of the present disclosure. Fig. 7 is a diagram showing an example of dividing the output of one group.

도 6을 참조하면, 먼저 신경망 파라미터는 기존에 학습된(Pretrained) 신경망 파라미터이거나 랜덤하게 초기화할 수 있으며 분할변수는 균일한(

) 값에 가깝게 초기화할 수 있다. Referring to FIG. 6, first, the neural network parameter is a pre-neural network parameter or can be randomly initialized.

) Value.

다음으로, 앞서 설명한 정규화 항과 함께 태스크의 손실함수와 파라미터 감쇠 정규화 항을 함께 최소화하는 방향으로 신경망의 파라미터와 분할변수의 값을 추계적 경사 하강 방식(Stochastic Gradient Descent) 방법으로 최적화 한다. Next, the parameters of the neural network and the values of the partitioning variables are optimized by the stochastic gradient descent method in the direction of minimizing the loss function of the task and the parameter attenuation normalization term together with the normalization term described above.

이렇게 최적화된 분할변수는 레이어 각각의 노드들이 어떤 그룹에 속할지 0 또는 1의 값으로 수렴하게 되며, 신경망 파라미터의 그룹 간 연결이 거의 억제되며 분할변수에 따라 재정렬될 경우 블록 대각행렬이 된다. 여기서 파라미터 행렬의 각 블록은 각 그룹 내의 연겨에 해당하면 그룹 간의 연결은 없어진 형태이다. This optimized partitioning variable converges to a value of 0 or 1 for each group of nodes in the layer, and the group-to-group connection of the neural network parameters is almost suppressed and becomes a block diagonal matrix when rearranged according to the partitioning variables. Here, each block of the parameter matrix corresponds to a lag in each group.

따라서, 도 7과 같이 레이어를 그룹에 따라 여러 레이어로 수직분할하고, 파라미터 행렬의 대각 블록들을 나뉜 레이어들의 파라미터로 사용하여 여러 개의 레이어로 분할할 수 있다. Accordingly, as shown in FIG. 7, the layers can be vertically divided into a plurality of layers according to the group, and the diagonal blocks of the parameter matrix can be divided into a plurality of layers using the parameters of the divided layers.

구체적으로, 앞서 언급한 정규화 함수를 통해 한 레이어의 입력과 출력을 각각 그룹으로 나누어 수직하게 분할할 수 있다. 이를 연속한 여러 레이어에 적용함으로써 한 레이어의 한 그룹의 출력이 다음 레이어의 해당 그룹의 입력으로 이어질 수 있게끔 분할 변수를 공유하면, 여러 레이어에 걸쳐 그룹들이 상호 간에 연결이 없게끔 나눠지게 된다. 또한, 한 그룹의 출력을 다음 레이어의 여러 출력으로 나뉘게끔 분할 변수를 공유하면 최종적으로 만들어지는 신경만은 그룹이 분기하게 되는 구조를 가지게 된다. Specifically, the input and output of one layer can be divided into groups and vertically divided by the normalization function. Applying this to multiple successive layers, sharing the divide variable so that the output of one group of one layer leads to the input of that group of the next layer, the groups are split across the layers so that they are not connected to each other. In addition, sharing the division variable to divide the output of one group into multiple outputs of the next layer results in a structure in which the group is branched in the final nerve.

마지막으로 태스크 손실 함수와 파라미터 감쇠 정규화 항으로 파라미터를 미세 조정하여 최종적으로 트리 형태의 신경망을 얻는다. Finally, the parameters are finely adjusted by the task loss function and the parameter attenuation normalization term to finally obtain a tree-shaped neural network.

이하에서는 도 8 내지 도 15를 참조하여, 본 개시에 따른 최적화 방법에 효과를 설명한다. Hereinafter, with reference to Figs. 8 to 15, effects of the optimization method according to the present disclosure will be described.

본 개시에 따른 최적화 방법에 적용된 실험 조건에 대해서 먼저 설명한다. The experimental conditions applied to the optimization method according to the present disclosure will be described first.

도 8 내지 도 15의 실험 결과는 아래에 개시된 바와 같은 두 가지 벤치 데이터 세트를 이용하여 이미지 분류를 하였다. Experimental results of Figures 8-15 illustrate image classification using two sets of bench data as described below.

첫 번째는 CIFAR-100이다. CIFAR-100 데이터 세트는 100개의 일반 객체 분류를 위한 32x32 픽셀 이미지들을 포함하며, 각 분류는 학습을 위한 100의 이미지와 테스트를 위한 100개의 이미지를 포함한다. 이러한 실험에서는 각 분류에 대한 50개의 이미지를 교차 검증을 위한 유효성 검증 세트로 별도로 이용하였다. The first is CIFAR-100. The CIFAR-100 data set includes 32x32 pixel images for 100 common object classifications, each containing 100 images for learning and 100 images for testing. In this experiment, 50 images for each category were separately used as a validation set for cross validation.

두 번째는 ImageNet-1K이다. ImageNet-1K 데이터 세트는 1000개 일반 객체 분류를 위한 1.2백만 이미지로 구성된다. 각 분류에 대해서 표준 절차에 따라 학습을 위한 1~1.3 천개 이미지와 테스트를 위한 50개 이미지가 포함된다. The second is ImageNet-1K. The ImageNet-1K data set consists of 1.2 million images for 1000 common object classifications. For each classification, there are 1 to 1.3 images for learning and 50 images for testing according to standard procedures.

그룹화를 위한 여러 가지 방법을 비교하기 위하여, 5개의 분류 모델을 이용하였다. To compare various methods for grouping, five classification models were used.

첫 번째는 기본 네트워크로, 전체 네트워크 가중치를 포함하는 일반 네트워크이다. CIFAR-100에 대한 실험을 위해 데이터 세트의 최첨단 네트워크 중 하나인 Wide Residual Network (WRN)를 사용하였다. 그리고 ILSVRC2012의 기본 네트워크로 AlexNet 및 ResNet-18을 사용하였다. The first is the primary network, which is a generic network that includes the entire network weight. We used the Wide Residual Network (WRN), one of the most advanced networks of data sets, for experiments on CIFAR-100. And AlexNet and ResNet-18 as the basic networks of ILSVRC2012.

두 번째는 SplitNet-Semantic 이다. 이는 데이터 세트에서 제공하는 의미 분류로부터 클래스 분류를 얻는 앞서 설명한 스플릿넷의 변형이다. 학습 전에 분류 체계에 따라 네트워크를 분할하여 레이어를 균등하게 분할하고 각 그룹에 서브 네트워크를 할당한 다음 처음부터 학습을 진행하였다. The second is SplitNet-Semantic. This is a variant of SplitNet described above that obtains a class classification from the semantic classification provided by the data set. Before learning, we divide the network according to the classification scheme, divide the layers evenly, allocate subnetworks to each group, and then proceed from the beginning.

세 번째는 SplitNet-Clustering 이다. 이 방식은 두 번째 방식의 변형으로, 클래스는 사전 훈련된 기본 네트워크의 계층적 스펙트럼 클러스터링에 의해 분할하는 방식이다. The third is SplitNet-Clustering. This is a modification of the second scheme, in which the class is partitioned by hierarchical spectral clustering of the pre-trained base network.

네 번째는 SplitNet-Random 이다. 이 방식은 임의의 클래스 분할을 사용하는 변형이다. The fourth is SplitNet-Random. This approach is a variation that uses arbitrary class partitioning.

다섯째는 SplitNet이다. 스플릿넷(SplitNet)은 앞서 설명한 바와 같은 가중치 행렬을 자동 분할을 사용하여 학습하는 방식이다. The fifth is SplitNet. SplitNet is a method of learning the weighting matrix as described above using automatic segmentation.

도 8은 연속한 상위 세 레이어에 대한 최적화 결과 예를 나타내는 도면이다. 구체적으로, 도 8은 심층신경망의 한 종류인 알렉스넷(AlexNet)의 ImageNet 데이터 세트의 이미지 분류(Image classification) 작업에 대하여 학습시키면서 연속한 상위 세 레이어(FC6, FC7, FC8)에 대하여 적용하였을 때, 두 번째 단계에서 신경망 파라미터와 함께 분할변수를 최적화한 결과를 나타낸 것이다(이때 값의 순서는 분할변수의 값에 따라 행과 열을 재정렬한 것이다). Fig. 8 is a diagram showing an example of the optimization result for successive upper three layers. Fig. Specifically, FIG. 8 shows a case in which the image classification operation of ImageNet data set of Alexnet, a type of the deep neural network, is applied to the upper three layers (FC6, FC7, FC8) , And the optimization of the partitioning variables with the neural network parameters in the second stage (the order of the values is the rearrangement of rows and columns according to the value of the partitioning variable).

도 8에서 검은색은 값이 0임을, 흰색은 값이 양수임을 의미하며, 이때 파라미터 행렬에서 각 그룹 내 연결이 활성화 되어있으며(대각 블록이 양수), 그룹 간 연결은 억제되어 있음을 확인할 수 있다. In FIG. 8, black indicates that the value is 0, and white indicates that the value is positive. In this case, it is confirmed that the intra-group connection is activated (the diagonal block is positive) in the parameter matrix and the inter-group connection is inhibited .

즉, 본 개시에 따라 방법을 통해 분할변수(

)와 파라미터(

)가 어떻게 분할될 것인지를 보여주며 심층신경망의 각 레이어가 계층적인 구조로 있음을 확인할 수 있다. That is, by means of the method according to the present disclosure,

) And parameters (

) Is divided and each layer of the deep neural network has a hierarchical structure.

도 9는 최적화 방법이 적용된 학습 모델의 벤치 마크를 나타내는 도면이다. 구체적으로, 도 9는 SplitNets에서 모델 병렬 처리를 사용하는 런타임 성능을 요약한 것이다. 9 is a diagram showing a benchmark of a learning model to which an optimization method is applied. Specifically, Figure 9 summarizes runtime performance using model parallelism in SplitNets.

도 9를 참조하면, DNN을 최적화하면 파라미터를 줄이는 것뿐만 아니라 모델 병렬 처리를 위해 분할 구조를 활용하여 속도를 높일 수 있음을 확인할 수 있다. Referring to FIG. 9, it can be seen that, by optimizing DNN, it is possible to increase the speed by not only reducing the parameters but also utilizing the partition structure for model parallel processing.

한편, 모델 병렬 처리의 자연스러운 방법은 각 분할 그룹과 공유 하위 레이어를 각 GPU에 할당하는 것이다. 중복 계산이 발생하지만 동시에 GPU 간에 필요한 통신이 없음을 보장하게 된다. 이에 따라 속도가 최대 1:44까지 갈수록 커짐을 확인할 수 있다. On the other hand, a natural way of model parallel processing is to assign each partition group and a shared lower layer to each GPU. Duplicate calculations will occur, but at the same time ensure that there is no communication required between the GPUs. As a result, it can be seen that the speed increases up to 1:44.

도 10은 균형 그룹 정규화의 효과를 나타내는 도면이다. 10 is a diagram showing the effect of the balance group normalization.

도 10을 참조하면, 충분히 큰 정규화로 인해 그룹의 크기가 균일해 지므로 SplitNet의 파라미터 축소 및 모델 병렬 처리에 바람직함을 알 수 있다. Referring to FIG. 10, since the size of the group is made uniform due to a sufficiently large normalization, it is preferable for parameter reduction and model parallel processing of SplitNet.

이 정규화를 완화하면 개별 그룹 크기에 유연성이 부여된다.

를 너무 작게 설정하면 모든 클래스와 기능이 하나의 그룹으로 분류되어 사소한 해결책이 생기게 된다. 이러한 점에서, 실험에서 네트워크 축소 및 병렬화를 위해 모델의 모든 그룹을 균형있게 조정하는 것이 바람직하다. Eliminating this normalization gives flexibility to individual group sizes.

If set too low, all classes and functions are grouped into a single group, resulting in a minor solution. In this regard, it is desirable in the experiment to balance all groups of models for network reduction and parallelism.

도 11은 CIFAR-100 데이터에 세트에 대한 여러 알고리즘 방식 각각에 대한 테스트 에러를 나타내는 도면이고, 도 12는 CIFAR-100 데이터에 세트에서 파라미터(또는 계산) 감소 및 테스트 에러의 비교를 나타내는 도면이다. Fig. 11 is a diagram showing test errors for each of the various algorithm schemes for a set in CIFAR-100 data, and Fig. 12 is a diagram showing a comparison of parameter (or calculation) reduction and test error in a set to CIFAR-100 data.

도 11을 참조하면, 데이터 세트 (-S)와 스펙트럼 클러스터링 (-C)이 제공하는 의미 분류법을 사용하는 SplitNet 변형은 임의 그룹핑(-R)보다 우수하며 DNN 분할에 적절한 그룹화가 중요 함을 확인할 수 있다. 11, it can be seen that the SplitNet transformation using the semantic classification provided by the data set (-S) and the spectral clustering (-C) is superior to the arbitrary grouping (-R) and the grouping suitable for the DNN segmentation is important have.

특히, SplitNet을 적용하는 것이 다른 모든 변형들을 능가함을 확인할 수 있다. SplitNet은 의미론적 또는 클러스터링 분할에서와 같이 추가 의미 정보나 사전 계산된 네트워크 가중치가 필요하지 않는다는 장점이 있다. In particular, we can confirm that applying SplitNet surpasses all other variants. SplitNet has the advantage that it does not need additional semantic information or pre-computed network weights as in semantic or clustering partitioning.

도 12를 참조하면, 많은 수의 필터로 인해 따라서 FC 분할은 파라미터 감소를 최소화함을 확인할 수 있다. 반면에 5개의 컨벌루션 레이어를 포함한 Shallow Split은 네트워크의 파라미터를 32.44% 줄이고 테스트 정확도는 약간 향상시킴을 확인할 수 있다. Referring to FIG. 12, it can be seen that due to the large number of filters, the FC partition minimizes the parameter reduction. On the other hand, Shallow Split with 5 convolution layers reduces the network parameters by 32.44% and slightly improves test accuracy.

그리고 심층 및 계층적 분할은 사소한 정확도 저하를 희생시키면서 파라미터와 FLOP을 추가로 줄임을 확인할 수 있다. It can be seen that the in-depth and hierarchical partitioning further reduces the parameters and FLOP while sacrificing the slight accuracy degradation.

얕은 분할은 훨씬 적은 수의 파라미터를 갖음으로써 다른 알고리즘 방식보다 훨씬 우수한 성능을 나타낸다. 본 개시의 SplitNet이 전체 네트워크에서 시작하여 내부 레이어에 대한 서로 다른 그룹 간의 불필요한 연결을 학습 및 축소하여 레이어에 정규화 효과를 부여한다는 사실에 기인한다. 또한, 레이어 분할은 변수 선택의 한 형태로 간주 될 수 있다. 레이어의 각 그룹은 필요한 노드 그룹만을 간략하게 선택할 수 있다.Shallow partitions have much fewer parameters and therefore perform better than other algorithms. This is due to the fact that SplitNet of the present disclosure starts from the entire network and learns and reduces unnecessary connections between different groups of internal layers to give the layer a normalization effect. In addition, layer partitioning can be considered as a form of variable selection. Each group of layers can simply select the required node group.

결론적으로, 심층신경망의 한 종류인 Wide residual network의 CIFAR-100 데이터 세트의 이미지 분류 작업에 대하여 학습시키면서 상위 6개의 레이어를 분할한 결과, 파라미터의 수를 32%, 연산량을 15% 줄이면서 동시에 성능은 평균 0.3%p 증가함을 확인할 수 있다. In conclusion, by dividing the top six layers while studying image classification of the CIFAR-100 data set of a wide residual network, which is one kind of in-depth neural network, the number of parameters is 32%, the computation amount is reduced by 15% , The average increase of 0.3% p.

도 13은 20개의 상위 클래스의 하위 클래스들이 어느 그룹에 속하는지를 나타내는 도면이다. 구체적으로, 도 13은 FC SplitNet (G = 4)에서 학습 한 그룹 지정을 CIFAR-100에서 제공하는 의미 카테고리와 비교한다. 13 is a diagram showing a group to which the subclasses of the 20 upper classes belong. Specifically, FIG. 13 compares a group designation learned in FC SplitNet (G = 4) with a semantic category provided in CIFAR-100.

도 13을 참조하면, 사람들 카테고리에는 아기, 소년, 소녀, 남성 및 여성의 5가지 클래스가 포함되어 있다. 이 클래스는 모두 본 개시에 따른 알고리즘에 따라 그룹 2로 그룹화된다. 모든 의미 카테고리의 3개 이상의 클래스가 함께 그룹화된다. 해당 그림에서 볼 수 있듯이 의미적으로 비슷한 상위 클래스들이 같은 그룹으로 묶여있음을 알 수 있다.Referring to FIG. 13, the people category includes five classes: baby, boy, girl, male, and female. All of these classes are grouped into Group 2 according to the algorithm according to the present disclosure. Three or more classes of all semantic categories are grouped together. As you can see in the figure, semantically similar superclasses are grouped together.

도 14 및 도 15는 ILSVRC2012 데이터 세트에서의 파라미터(또는 계산) 감소 및 테스트 에러의 비교를 나타내는 도면이다. Figures 14 and 15 are diagrams illustrating a comparison of the parameter (or calculation) reduction and test error in the ILSVRC 2012 data set.

도 14 및 도 15를 참조하면, SplitNet은 AlexNet을 기본 모델로 사용하여 fc 계층에 집중된 파라미터의 수를 크게 줄임을 확인할 수 있다. 그러나 대부분의 FLOP은 낮은 전환 층 (conv layer)에서 발생하며, 단지 작은 FLOP 감소만 가져옴도 확인할 수 있다. Referring to FIG. 14 and FIG. 15, it can be seen that SplitNet greatly reduces the number of parameters concentrated in the fc layer using AlexNet as a basic model. However, most FLOPs occur at the lower conv layer, and only a small FLOP reduction can be seen.

한편, AlexNet의 SplitNet이 중요한 파라미터 감소로 사소한 테스트 정확도 저하를 보임을 확인할 수 있다. 반면에 ResNet-18을 기반으로 하는 SplitNet은 분할이 깊어짐에 따라 테스트 정밀도가 저하됨을 확인할 수 있다. 이러한 점은 ResNet-18을 분할하는 것이 다수의 클래스와 비교하여 최대 512개의 컨볼루션 레이어의 폭을 제한하므로 네트워크 용량을 손상 시키기 때문으로 예측된다. On the other hand, you can see that AlexNet's SplitNet shows minor test accuracy degradation due to significant parameter reduction. On the other hand, SplitNet, which is based on ResNet-18, can be seen to degrade the test accuracy as the division becomes deeper. This is presumably because splitting ResNet-18 limits the width of up to 512 convolution layers compared to many classes, impairing network capacity.

그럼에도, 우리의 제안 된 SplitNet은 모든 실험에서 SplitNet-Random을 능가함을 확인할 수 있다. 구체적으로, 심층신경망의 한 종류인 ResNet-18의 레이어의 필터 수를 기존 N 개에서 M개로 두 배로 한 네트워크에 대해 ImageNet 데이터 세트의 이미지 분류 작업에 대해 학습시키면서 상위 6개의 레이어를 분할한 결과 파라미터의 수를 38%, 연산량을 12% 줄이면서 성능은 평균 0.1%p 증가함을 확인할 수 있다. Nevertheless, our proposed SplitNet is superior to SplitNet-Random in all experiments. Specifically, the image classification task of ImageNet data set is studied for a network in which the number of filters of a layer of ResNet-18, which is one type of in-depth neural network, is doubled from M to N. In the result, The performance is increased by 0.1% p, while the number of operations is reduced by 38% and the amount of operation is reduced by 12%.

도 16은 본 개시의 일 실시 예에 따른 학습 모델 최적화 방법을 설명하기 위한 흐름도이다. 16 is a flowchart for explaining a learning model optimization method according to an embodiment of the present disclosure.

도 16을 참조하면, 먼저, 복수의 레이어로 구성되는 학습 모델의 파라미터 행렬 및 복수의 분할 변수를 초기화한다(S1610). 구체적으로, 파라미터 행렬을 랜덤하게 초기화하고, 복수의 분할 변수를 상호 균일하지 않도록 초기화할 수 있다. Referring to FIG. 16, a parameter matrix of a learning model composed of a plurality of layers and a plurality of divided variables are initialized (S1610). Specifically, it is possible to initialize the parameter matrix at random and initialize the plurality of divided variables so that they are not mutually uniform.

그리고 학습 모델에 대한 손실 함수, 파라미터 감쇠 정규화 항 및 파라미터 행렬 및 복수의 분할 변수로 정의되는 분할 정규화 항을 포함하는 목적 함수가 최소화하도록 복수의 분할 변수와 학습 모델에 대한 블록 대각 행렬을 갖는 신규 파라미터 행렬을 산출한다(S1620). 이때, 추계적 경사 하강(Stochastic Gradient Descent) 방법을 이용하는 수학식 1과 같은 목적 함수가 최소화하도록 할 수 있다. A new parameter having a block diagonal matrix for the learning model and a plurality of partitioning variables for minimizing the objective function including the loss function for the learning model, the parameter attenuation normalization term and the parameter matrix and the partition normalization term defined by the plurality of partitioning variables The matrix is calculated (S1620). At this time, it is possible to minimize the objective function expressed by Equation (1) using the stochastic gradient descent method.

여기서 분할 정규화 항은 그룹 간의 연결을 억제하고 그룹 내의 연결만을 활성화하는 그룹 파라미터 정규화 항, 각 그룹이 직교하도록 하는 서로소 그룹 정규화 항 및 한 그룹의 크기가 과도하지 않도록 하는 균등 그룹 정규화 항을 포함할 수 있다. Here, the partition normalization term includes a group parameter normalization term for suppressing the inter-group interconnection and activating only the intra-group connection, a small-group normalization term for each group to be orthogonal, and an even group normalization term for preventing the size of one group from being excessive .

그리고 산출된 분할 변수에 기초하여 복수의 레이어를 그룹에 따라 수직 분할하고, 산출된 신규 파라미터 행렬을 수직 분할된 레이어의 파라미터로 사용하여 학습 모델을 재구성한다.Then, the plurality of layers are vertically divided according to the group based on the calculated partitioning variables, and the learning model is reconstructed using the calculated new parameter matrix as parameters of the vertically divided layers.

재구성 이후에, 학습 모델에 대한 손실 함수 및 파라미터 감쇠 정규화 항만 포함하는 제2 목적 함수가 최소화하도록 재구성된 학습 모델에 대한 2차 신규 파라미터 행렬을 산출하고, 산출된 2차 신규 파라미터 행렬을 수직 분할된 레이어의 파라미터로 사용하여 학습 모델을 최적화할 수 있다. After the reconstruction, a second-order new parameter matrix for the reconstructed learning model is calculated so as to minimize the second objective function including the loss function and the parameter attenuation normalization term for the learning model, and the calculated second- You can use it as a parameter of the layer to optimize the learning model.

따라서, 본 실시 예에 따른 학습 모델 최적화 방법은 학습 모델을 클래스를 독점적인 기능 집합에 맞는 그룹으로 클러스터링한다. 그리고 수학식 1과 같은 목적 함수를 이용하는바 네트워크 학습과정에 완벽하게 통합되므로 네트워크 가중치와 분할을 동시에 학습할 수 있다. 이에 따라 최적화된 학습 모델은 하나의 학습 모델에 대한 연산을 여러 장치로 나눠 처리하는 것이 가능하며, 연산량과 파라미터의 수가 줄어들기 때문에 하나의 장치를 이용하더라도 더욱 빠른 연산이 가능하게 된다. 도 16과 같은 학습 모델 최적화 방법은 도 1 또는 도 2의 구성을 가지는 전자 장치상에서 실행될 수 있으며, 그 밖의 다른 구성을 가지는 전자 장치상에서도 실행될 수 있다. Therefore, in the learning model optimization method according to the present embodiment, the learning models are clustered into groups in accordance with the exclusive function set. And, since it is fully integrated into the network learning process using the objective function as shown in Equation (1), the network weighting and division can be simultaneously learned. Accordingly, the optimized learning model can process an operation for one learning model by dividing it into a plurality of apparatuses. Since the number of operations and the number of parameters can be reduced, even faster operation can be performed even if one apparatus is used. The learning model optimization method as shown in Fig. 16 can be executed on an electronic device having the configuration of Fig. 1 or Fig. 2, or on an electronic device having other configurations.

또한, 상술한 바와 같은 학습 모델 최적화 방법은 컴퓨터에서 실행될 수 있는 실행 가능한 알고리즘을 포함하는 프로그램으로 구현될 수 있고, 상술한 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다. In addition, the learning model optimization method as described above can be implemented as a program including an executable algorithm that can be executed in a computer, and the above-described program is stored in a non-transitory computer readable medium .

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 방법을 수행하기 위한 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.A non-transitory readable medium is a medium that stores data for a short period of time, such as a register, cache, memory, etc., but semi-permanently stores data and is readable by the apparatus. In particular, the programs for performing the above-described various methods may be stored in non-volatile readable media such as CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM,

도 17은 본 개시의 일 실시 예에 따른 학습 모델 분할 방법을 설명하기 위한 흐름도이다. 17 is a flowchart for explaining a learning model dividing method according to an embodiment of the present disclosure.

도 17을 참조하면, 먼저 신경망 파라미터는 기존에 학습된(Pretrained) 신경망 파라미터이거나 랜덤하게 초기화할 수 있으며 분할변수는 균일한(

) 값에 가깝게 초기화할 수 있다(S1710). Referring to FIG. 17, the neural network parameter is a pre-neural network parameter or it can be initialized at random,

) Value (S1710).

다음으로, 앞서 설명한 정규화 항과 함께 태스크의 손실함수와 파라미터 감쇠 정규화 항을 함께 최소화하는 방향으로 신경망의 파라미터와 분할변수의 값을 추계적 경사 하강 방식(Stochastic Gradient Descent) 방법으로 최적화 한다(S1720). Next, the parameters of the neural network and the values of the partitioning variables are optimized by the stochastic gradient descent method (S1720) in the direction of minimizing the loss function of the task and the parameter attenuation normalization term together with the normalization term described above, .

이렇게 최적화된 분할변수는 레이어 각각의 노드들이 어떤 그룹에 속할지 0 또는 1의 값으로 수렴하게 되며, 신경망 파라미터의 그룹 간 연결이 거의 억제되며 분할변수에 따라 재정렬될 경우 블록 대각행렬이 된다. This optimized partitioning variable converges to a value of 0 or 1 for each group of nodes in the layer, and the group-to-group connection of the neural network parameters is almost suppressed and becomes a block diagonal matrix when rearranged according to the partitioning variables.

다음으로, 앞서 산출된 분할 변수를 이용하여 신경망을 분할할 수 있다(S1730). Next, the neural network can be divided using the previously calculated partitioning variable (S1730).

마지막으로 태스크 손실 함수와 파라미터 감쇠 정규화 항으로 파라미터를 미세 조정하여 최종적으로 트리 형태의 신경망을 얻는다(S1740). Finally, the parameters are finely adjusted by the task loss function and the parameter attenuation normalization term to finally obtain a tree-shaped neural network (S1740).

따라서, 본 실시 예에 따른 학습 모델 분할 방법은 학습 모델을 클래스를 독점적인 기능 집합에 맞는 그룹으로 클러스터링한다. 그리고 수학식 1과 같은 목적 함수를 이용하는바 네트워크 학습 과정에 완벽하게 통합되므로 네트워크 가중치와 분할을 동시에 학습할 수 있다. 도 17과 같은 학습 모델 분할 방법은 도 1 또는 도 2의 구성을 가지는 전자 장치상에서 실행될 수 있으며, 그 밖의 다른 구성을 가지는 전자 장치상에서도 실행될 수 있다. Therefore, in the learning model segmentation method according to the present embodiment, the learning models are clustered into groups in accordance with the exclusive function set. And, since it is fully integrated into the network learning process using the objective function as shown in Equation (1), the network weighting and division can be simultaneously learned. The learning model dividing method as shown in Fig. 17 can be executed on an electronic device having the configuration of Fig. 1 or Fig. 2, or on an electronic device having other configurations.

또한, 상술한 바와 같은 학습 모델 분할 방법은 컴퓨터에서 실행될 수 있는 실행 가능한 알고리즘을 포함하는 프로그램으로 구현될 수 있고, 상술한 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다. Further, the learning model dividing method as described above can be implemented as a program including an executable algorithm that can be executed in a computer, and the above-mentioned program is stored in a non-transitory computer readable medium .

또한, 이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시가 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of this disclosure.

100: 전자 장치 110: 메모리
120: 프로세서 130: 통신 인터페이스
140: 디스플레이 150: 조작 입력부100: electronic device 110: memory
120: processor 130: communication interface
140: Display 150: Operation input

Claims

In a learning model optimization method,
Comprising: initializing a parameter matrix of a learning model composed of a plurality of layers and a plurality of divided variables;
A block diagonal matrix for the learning model, a block diagonal matrix for the learning model to minimize an objective function including a loss function for the learning model, a parameter attenuation normalization term, and a partition normalization term defined by the parameter matrix and the plurality of partitioning variables, Calculating a new parametric matrix having < RTI ID = 0.0 > And
And vertically dividing the plurality of layers according to the group on the basis of the calculated division variable and reconstructing the learning model using the calculated new parameter matrix as a parameter of the vertically divided layer Optimization method.

The method according to claim 1,
Wherein the initializing comprises:
Randomly initializing the parameter matrix, and initializing the plurality of partitioned variables so that they are not uniform.

The method according to claim 1,
Wherein the calculating step comprises:
A method of optimizing a learning model using a stochastic gradient descent method to minimize the objective function.

The method according to claim 1,
The partition normalization term includes:
A group parameter normalization term for suppressing connections between groups and activating only connections within groups, a small group normalization term for each group to be orthogonal, and an even group normalization term for preventing a group from being oversized.

The method according to claim 1,
Calculating a second new parameter matrix for the reconstructed learning model so as to minimize a loss function for the learning model and a second objective function including only the parameter attenuation normalization term; And
And optimizing the learning model using the calculated secondary new parameter matrix as a parameter of the vertically partitioned layer.

6. The method of claim 5,
And parallelly processing each vertically divided layer in the optimized learning model using different processors.

In an electronic device,
A memory for storing a learning model composed of a plurality of layers; And
Initializing a parameter matrix of the learning model and a plurality of subdivision variables to minimize an objective function including a loss function, a parameter attenuation normalization term for the learning model, and a partition normalization term defined by the parameter matrix and the plurality of subdivision variables, Calculating a new parameter matrix having the block diagonal matrix for the learning model and the plurality of partitioning variables to vertically divide the plurality of layers according to the group based on the calculated partitioning variable, To reconstruct the learning model using the vertically divided layer as a parameter of the vertically partitioned layer.

8. The method of claim 7,
The processor comprising:
Randomly initializes the parameter matrix, and initializes the plurality of divided variables to be non-uniform.

8. The method of claim 7,
The processor comprising:
Wherein the stochastic gradient descent method is used to minimize the objective function.

8. The method of claim 7,
The partition normalization term includes:
A group parameter normalization term that suppresses connections between groups and activates only connections within the group, a small group normalization term that causes each group to be orthogonal, and an even group normalization term that prevents the size of one group from being excessive.

8. The method of claim 7,
The processor comprising:
Calculating a second-order new parameter matrix for the reconstructed learning model so that a second objective function including only the loss function for the learning model and only the parameter attenuation normalization term is minimized, Lt; RTI ID = 0.0 > layer, < / RTI >

A computer-readable recording medium containing a program for executing a method of optimizing a learning model in an electronic apparatus,
The learning model optimization method includes:
Comprising: initializing a parameter matrix of a learning model composed of a plurality of layers and a plurality of divided variables;
A block diagonal matrix for the learning model, a block diagonal matrix for the learning model to minimize an objective function including a loss function for the learning model, a parameter attenuation normalization term, and a partition normalization term defined by the parameter matrix and the plurality of partitioning variables, Calculating a new parametric matrix having < RTI ID = 0.0 > And
Dividing the plurality of layers vertically according to the group on the basis of the computed partitioning variable and reconstructing the learning model using the computed new parameter matrix as a parameter of the vertically partitioned layer Possible recording medium.