KR102139740B1

KR102139740B1 - Electronic apparatus and method for optimizing of trained model

Info

Publication number: KR102139740B1
Application number: KR1020180010938A
Authority: KR
Inventors: 황성주; 김주용; 김건희; 박유군
Original assignee: 한국과학기술원
Priority date: 2017-06-09
Filing date: 2018-01-29
Publication date: 2020-07-31
Also published as: KR20180134740A; KR102102772B1; KR20180134739A; KR20180134738A; KR102139729B1

Abstract

전자 장치가 개시된다. 본 전자 장치는, 복수의 레이어로 구성되는 학습 모델을 저장하는 메모리, 및 학습 모델의 파라미터 행렬 및 복수의 분할 변수를 초기화하고, 학습 모델에 대한 손실 함수, 파라미터 감쇠 정규화 항 및 파라미터 행렬 및 복수의 분할 변수로 정의되는 분할 정규화 항을 포함하는 목적 함수가 최소화하도록 복수의 분할 변수와 학습 모델에 대한 블록 대각 행렬을 갖는 신규 파라미터 행렬을 산출하고, 산출된 분할 변수에 기초하여 복수의 레이어를 그룹에 따라 수직 분할하고, 산출된 신규 파라미터 행렬을 수직 분할된 레이어의 파라미터로 사용하여 학습 모델을 재구성하는 프로세서를 포함한다. An electronic device is disclosed. The electronic device initializes a memory storing a learning model composed of a plurality of layers, and a parameter matrix of the learning model and a plurality of divided variables, a loss function for the learning model, a parameter attenuation normalization term and a parameter matrix, and a plurality of A new parameter matrix having a plurality of partition variables and a block diagonal matrix for a training model is calculated so that an objective function including a partition normalization term defined as a partition variable is minimized, and a plurality of layers are grouped based on the calculated partition variables. And a processor that reconstructs the training model by vertically dividing and using the calculated new parameter matrix as a parameter of the vertically divided layer.

Description

ELECTRONIC APPARATUS AND METHOD FOR OPTIMIZING OF TRAINED MODEL}

본 개시는 전자 장치 및 학습 모델 최적화 방법에 관한 것으로, 더욱 상세하게는 학습 모델 내의 각 레이어를 의미적으로 연관있는 그룹으로 자동으로 나누고 모델 병렬화하여 학습 모델을 최적화할 수 있는 전자 장치 및 학습 모델 최적화 방법에 관한 것이다. The present disclosure relates to a method for optimizing an electronic device and a learning model, and more specifically, to optimize the learning model by optimizing the learning model by automatically dividing each layer in the learning model into semantically related groups and parallelizing the model. It's about how.

심층 신경망(Deep Neural Network)은 컴퓨터 비전, 음성 인식, 자연어 처리와 같은 분야에서 큰 성능 향상을 가져온 머신 러닝의 한 기술이다. 이러한 심층 신경망은 완전 연결 레이어, 합성곱 레이어와 같은 여러 레이어의 순차적인 연산으로 이루어진다. Deep Neural Network is a machine learning technology that has made significant performance improvements in areas such as computer vision, speech recognition, and natural language processing. This deep neural network consists of multiple layers of sequential operations, such as a fully connected layer and a convolutional layer.

심층 신경망은 행렬 곱으로 표현되는 각각의 레이어가 많은 양의 연산을 필요하기 때문에 학습하고 실행하는데 있어 큰 계산량, 큰 용량의 모델 파라미터를 요구하였다. In the deep neural network, each layer represented by a matrix product requires a large amount of computation, and thus requires a large amount of computational and large model parameters in learning and execution.

그러나 수 만개의 객체 클래스를 분류하는 등과 같은 모델 또는 태스크 크기가 매우 커지거나, 실시간 객체 검출이 필요한 경우에 이러한 큰 계산량은 심층 신경망을 활용하는데 제한 사항이 되었다. However, when the model or task size, such as classifying tens of thousands of object classes, becomes very large, or when real-time object detection is required, such a large amount of computation has been limited in utilizing deep neural networks.

이에 따라, 종래에는 모델 파라미터의 개수를 줄이거나, 분산 머신 러닝(distributed machine learning)을 사용한 데이터 병렬화(data parallelization)를 통해 모델의 학습과 실행을 가속하는 방법이 이용되었다. Accordingly, conventionally, a method of reducing the number of model parameters or accelerating the learning and execution of the model through data parallelization using distributed machine learning has been used.

그러나 이러한 방식들은 네트워크 구조를 유지하면서 파라미터의 수를 줄이거나 많은 양의 연산 장치를 사용해 연산 시간을 줄이는 방법으로, 심층 신경망의 본질적인 구조를 개선하는 방식은 아니었다. However, these methods did not improve the intrinsic structure of deep neural networks by reducing the number of parameters while maintaining the network structure or using a large amount of computing devices to reduce computation time.

즉, 기존의 심층 신경망은 단일하고 큰 레이어의 순차적인 연산으로 이루어져 있으며, 이를 여러 연산장치에서 나누어 수행할 경우, 연산장치 간의 통신에 더 큰 시간적 병목 현상이 생기기 때문에, 한 입력에 대한 연산을 한 장치에서 수행할 수밖에 없는 한계가 있었다. In other words, the existing deep neural network is composed of sequential calculations of single and large layers, and when this is performed by dividing it in multiple computing devices, a larger temporal bottleneck occurs in communication between computing devices, so that one input operation is performed. There was a limit that had to be performed on the device.

따라서, 본 개시의 목적은 학습 모델 내의 각 레이어를 의미적으로 연관있는 그룹으로 자동으로 나누고 모델 병렬화하여 학습 모델을 최적화할 수 있는 전자 장치 및 학습 모델 최적화 방법을 제공하는 데 있다. Accordingly, an object of the present disclosure is to provide an electronic device and a learning model optimization method capable of automatically dividing each layer in a learning model into semantically related groups and parallelizing the models to optimize the learning model.

상술한 바와 같은 목적을 달성하기 위한 본 개시의 학습 모델 최적화 방법은 복수의 레이어로 구성되는 학습 모델의 파라미터 행렬 및 복수의 분할 변수를 초기화하는 단계, 상기 학습 모델에 대한 손실 함수, 파라미터 감쇠 정규화 항 및 상기 파라미터 행렬 및 상기 복수의 분할 변수로 정의되는 분할 정규화 항을 포함하는 목적 함수가 최소화하도록 상기 복수의 분할 변수와 상기 학습 모델에 대한 블록 대각 행렬을 갖는 신규 파라미터 행렬을 산출하는 단계, 및 상기 산출된 분할 변수에 기초하여 상기 복수의 레이어를 그룹에 따라 수직 분할하고, 상기 산출된 신규 파라미터 행렬을 상기 수직 분할된 레이어의 파라미터로 사용하여 상기 학습 모델을 재구성하는 단계를 포함한다. The method for optimizing the learning model of the present disclosure for achieving the above-mentioned object includes initializing a parameter matrix and a plurality of splitting variables of a learning model composed of a plurality of layers, a loss function for the learning model, and a parameter attenuation normalization term And calculating a new parameter matrix having a plurality of partition variables and a block diagonal matrix for the learning model such that an objective function including the parameter matrix and a partition normalization term defined by the plurality of partition variables is minimized. And vertically dividing the plurality of layers according to a group based on the calculated partitioning variable, and reconstructing the learning model using the calculated new parameter matrix as a parameter of the vertically divided layer.

이 경우, 상기 초기화하는 단계는 상기 파라미터 행렬을 랜덤하게 초기화하고, 상기 복수의 분할 변수를 상호 균일하지 않도록 초기화할 수 있다. In this case, the initializing may randomly initialize the parameter matrix and initialize the plurality of partition variables so that they are not uniform to each other.

한편, 상기 산출하는 단계는 상기 목적 함수가 최소화하도록 추계적 경사 하강(Stochastic Gradient Descent) 방법을 이용할 수 있다. Meanwhile, the calculating step may use a stochastic gradient descent method so that the objective function is minimized.

한편, 상기 분할 정규화 항은 그룹 간의 연결을 억제하고 그룹 내의 연결만을 활성화하는 그룹 파라미터 정규화 항, 각 그룹이 직교하도록 하는 서로소 그룹 정규화 항 및 한 그룹의 크기가 과도하지 않도록 하는 균등 그룹 정규화 항을 포함할 수 있다. On the other hand, the division normalization term is a group parameter normalization term that suppresses the connection between groups and activates only the connections within the group, a group normalization term that allows each group to be orthogonal, and an equal group normalization term that prevents the size of one group from being excessive. It can contain.

한편, 본 학습 모델 최적화 방법은 상기 학습 모델에 대한 손실 함수 및 상기 파라미터 감쇠 정규화 항만 포함하는 제2 목적 함수가 최소화하도록 상기 재구성된 학습 모델에 대한 2차 신규 파라미터 행렬을 산출하는 단계, 및 상기 산출된 2차 신규 파라미터 행렬을 상기 수직 분할된 레이어의 파라미터로 사용하여 상기 학습 모델을 최적화하는 단계를 더 포함할 수 있다. Meanwhile, the method for optimizing the learning model includes calculating a second new parameter matrix for the reconstructed learning model such that the second objective function including only the loss function for the learning model and the parameter attenuation normalization term is minimized, and the calculation The method may further include optimizing the learning model using the second ordered new parameter matrix as a parameter of the vertically divided layer.

이 경우, 본 학습 모델 최적화 방법은 상기 최적화된 학습 모델 내의 수직 분할된 레이어 각각을 서로 다른 프로세서를 이용하여 병렬 처리하는 단계를 더 포함할 수 있다. In this case, the method for optimizing the learning model may further include parallel processing each of the vertically divided layers in the optimized learning model using different processors.

한편, 본 개시의 전자 장치는 복수의 레이어로 구성되는 학습 모델이 저장된 메모리, 및 상기 학습 모델의 파라미터 행렬 및 복수의 분할 변수를 초기화하고, 상기 학습 모델에 대한 손실 함수, 파라미터 감쇠 정규화 항 및 상기 파라미터 행렬 및 상기 복수의 분할 변수로 정의되는 분할 정규화 항을 포함하는 목적 함수가 최소화하도록 상기 복수의 분할 변수와 상기 학습 모델에 대한 블록 대각 행렬을 갖는 신규 파라미터 행렬을 산출하고, 상기 산출된 분할 변수에 기초하여 상기 복수의 레이어를 그룹에 따라 수직 분할하고, 상기 산출된 신규 파라미터 행렬을 상기 수직 분할된 레이어의 파라미터로 사용하여 상기 학습 모델을 재구성하는 프로세서를 포함한다. On the other hand, the electronic device of the present disclosure initializes a memory in which a learning model composed of a plurality of layers is stored, a parameter matrix of the learning model and a plurality of divided variables, a loss function for the learning model, a parameter attenuation normalization term, and the A new parameter matrix having a plurality of partition variables and a block diagonal matrix for the learning model is calculated such that an objective function including a parameter matrix and a partition normalization term defined by the plurality of partition variables is minimized, and the calculated partition variable And a processor that vertically divides the plurality of layers according to a group based on the group, and reconstructs the learning model using the calculated new parameter matrix as a parameter of the vertically divided layer.

이 경우, 상기 프로세서는 상기 파라미터 행렬을 랜덤하게 초기화하고, 상기 복수의 분할 변수를 상호 균일하지 않도록 초기화할 수 있다. In this case, the processor may initialize the parameter matrix randomly and initialize the plurality of partition variables so that they are not uniform to each other.

한편, 상기 프로세서는 상기 목적 함수가 최소화하도록 추계적 경사 하강(Stochastic Gradient Descent) 방법을 이용할 수 있다. Meanwhile, the processor may use a stochastic gradient descent method so that the objective function is minimized.

한편, 상기 프로세서는 상기 학습 모델에 대한 손실 함수 및 상기 파라미터 감쇠 정규화 항만 포함하는 제2 목적 함수가 최소화하도록 상기 재구성된 학습 모델에 대한 2차 신규 파라미터 행렬을 산출하고, 상기 산출된 2차 신규 파라미터 행렬을 상기 수직 분할된 레이어의 파라미터로 사용하여 상기 학습 모델을 최적화할 수 있다. Meanwhile, the processor calculates a second new parameter matrix for the reconstructed learning model such that the second objective function including only the loss function for the learning model and the parameter attenuation normalization term is minimized, and the calculated second new parameter The learning model can be optimized by using a matrix as a parameter of the vertically divided layer.

한편, 본 개시의 전자 장치에서의 학습 모델 최적화 방법을 실행하기 위한 프로그램을 포함하는 컴퓨터 판독가능 기록 매체에 있어서, 상기 학습 모델 최적화 방법은 복수의 레이어로 구성되는 학습 모델의 파라미터 행렬 및 복수의 분할 변수를 초기화하는 단계, 상기 학습 모델에 대한 손실 함수, 파라미터 감쇠 정규화 항 및 상기 파라미터 행렬 및 상기 복수의 분할 변수로 정의되는 분할 정규화 항을 포함하는 목적 함수가 최소화하도록 상기 복수의 분할 변수와 상기 학습 모델에 대한 블록 대각 행렬을 갖는 신규 파라미터 행렬을 산출하는 단계, 및 상기 산출된 분할 변수에 기초하여 상기 복수의 레이어를 그룹에 따라 수직 분할하고, 상기 산출된 신규 파라미터 행렬을 상기 수직 분할된 레이어의 파라미터로 사용하여 상기 학습 모델을 재구성하는 단계를 포함한다. Meanwhile, in a computer-readable recording medium including a program for executing a learning model optimization method in an electronic device of the present disclosure, the learning model optimization method includes a parameter matrix and a plurality of partitions of a learning model composed of a plurality of layers. Initializing a variable, a loss function for the learning model, a parameter attenuation normalization term and a partition normalization term defined by the parameter matrix and the plurality of partition variables so that the objective function minimizes the plurality of partition variables and the learning Calculating a new parameter matrix having a block diagonal matrix for the model, and vertically dividing the plurality of layers into groups based on the calculated partitioning variable, and dividing the calculated new parameter matrix of the vertically divided layer. And reconstructing the learning model using the parameters.

상술한 바와 같이 본 개시의 다양한 실시 예에 따르면, 학습 모델의 레이어들을 자동으로 여러 레이어로 나눌 수 있는바, 연산량을 줄일 수 있으며, 파라미터의 수를 줄일 수 있고, 또한, 모델 병렬화가 가능하게 된다. As described above, according to various embodiments of the present disclosure, since the layers of the learning model can be automatically divided into multiple layers, the computation amount can be reduced, the number of parameters can be reduced, and model parallelization is possible. .

도 1은 본 개시의 일 실시 예에 따른 전자 장치의 간단한 구성을 나타내는 블록도,
도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 나타내는 블록도,
도 3은 트리 구조 네트워크를 설명하기 위한 도면,
도 4는 그룹 할당 및 그룹 가중치 정규화 동작을 설명하기 위한 도면,
도 5는 정규화가 적용된 가중치를 시각화한 도면,
도 6은 본 개시의 학습 모델의 분할 알고리즘을 나타내는 도면,
도 7은 한 그룹의 출력을 분할하는 경우의 예를 나타내는 도면,
도 8은 연속한 상위 세 레이어에 대한 최적화 결과 예를 나타내는 도면,
도 9는 최적화 방법이 적용된 학습 모델의 벤치 마크를 나타내는 도면,
도 10은 균형 그룹 정규화의 효과를 나타내는 도면,
도 11은 CIFAR-100 데이터에 세트에 대한 여러 알고리즘 방식 각각에 대한 테스트 에러를 나타내는 도면,
도 12는 CIFAR-100 데이터에 세트에서 파라미터(또는 계산) 감소 및 테스트 에러의 비교를 나타내는 도면,
도 13은 20개의 상위 클래스의 하위 클래스들이 어는 그룹에 속하는지를 나타내는 도면,
도 14 및 도 15는 ILSVRC2012 데이터 세트에서의 파라미터(또는 계산) 감소 및 테스트 에러의 비교를 나타내는 도면,
도 16은 본 개시의 일 실시 예에 따른 학습 모델 최적화 방법을 설명하기 위한 흐름도, 그리고,
도 17은 본 개시의 일 실시 예에 따른 학습 모델 분할 방법을 설명하기 위한 흐름도이다. 1 is a block diagram showing a simple configuration of an electronic device according to an embodiment of the present disclosure;
2 is a block diagram showing a specific configuration of an electronic device according to an embodiment of the present disclosure;
3 is a view for explaining a tree structure network,
4 is a view for explaining the group allocation and group weight normalization operation,
5 is a diagram visualizing a weight applied with normalization,
6 is a diagram showing a segmentation algorithm of the learning model of the present disclosure,
7 is a diagram showing an example of dividing a group of outputs;
8 is a view showing an example of optimization results for successive upper three layers,
9 is a diagram showing a benchmark of a learning model to which an optimization method is applied,
10 is a view showing the effect of normalizing the balance group,
FIG. 11 is a diagram showing test errors for each of several algorithm schemes for a set in CIFAR-100 data;
12 is a diagram showing comparison of parameter (or calculation) reduction and test error in a set in CIFAR-100 data;
13 is a diagram showing which sub-classes of 20 upper classes belong to which group,
14 and 15 are diagrams showing comparison of parameter (or calculation) reduction and test error in the ILSVRC2012 data set,
16 is a flowchart illustrating a learning model optimization method according to an embodiment of the present disclosure, and
17 is a flowchart illustrating a method for dividing a learning model according to an embodiment of the present disclosure.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 개시에 대해 구체적으로 설명하기로 한다. Terms used in this specification will be briefly described, and the present disclosure will be described in detail.

본 개시의 실시 예에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 개시의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다. Terms used in the embodiments of the present disclosure, while considering the functions in the present disclosure, general terms that are currently widely used are selected, but this may vary according to the intention or precedent of a person skilled in the art or the appearance of new technologies. . Also, in certain cases, some terms are arbitrarily selected by the applicant, and in this case, their meanings will be described in detail in the description of the corresponding disclosure. Therefore, the terms used in the present disclosure should be defined based on the meaning of the terms and the contents of the present disclosure, not simply the names of the terms.

본 개시의 실시 예들은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 특정한 실시 형태에 대해 범위를 한정하려는 것이 아니며, 개시된 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 실시 예들을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.The embodiments of the present disclosure may apply various transformations and have various embodiments, and thus, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the scope of the specific embodiments, it should be understood to include all conversions, equivalents, or substitutes included in the scope of the disclosed ideas and techniques. In the description of the embodiments, when it is determined that the detailed description of the related known technology may obscure the subject matter, the detailed description is omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are only used to distinguish one component from other components.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다." 또는 "구성되다." 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, "includes." Or "composed." Terms such as intended to designate the presence of a feature, number, step, operation, component, part, or combination thereof described in the specification, one or more other features or numbers, steps, operation, component, part, or It should be understood that the possibility of the presence or addition of these combinations is not excluded in advance.

본 개시의 실시 예에서 '모듈' 혹은 '부'는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 '모듈' 혹은 복수의 '부'는 특정한 하드웨어로 구현될 필요가 있는 '모듈' 혹은 '부'를 제외하고는 적어도 하나의 모듈로 일체화되어 적어도 하나의 프로세서로 구현될 수 있다.In the exemplary embodiment of the present disclosure, the'module' or the'unit' performs at least one function or operation, and may be implemented in hardware or software, or a combination of hardware and software. In addition, a plurality of'modules' or a plurality of'units' may be integrated with at least one module, except for a'module' or'unit', which needs to be implemented with specific hardware, and may be implemented with at least one processor.

아래에서는 첨부한 도면을 참고하여 본 개시의 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present disclosure pertains can easily carry out the embodiments. However, the present disclosure may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present disclosure in the drawings, parts irrelevant to the description are omitted, and like reference numerals are assigned to similar parts throughout the specification.

이하에서는 도면을 참조하여 본 개시에 대해 더욱 상세히 설명하기로 한다.Hereinafter, the present disclosure will be described in more detail with reference to the drawings.

도 1은 본 개시의 일 실시 예에 따른 전자 장치의 간단한 구성을 나타내는 블록도이다. 1 is a block diagram illustrating a simple configuration of an electronic device according to an embodiment of the present disclosure.

도 1을 참조하면, 전자 장치(100)는 메모리(110) 및 프로세서(120)로 구성될 수 있다. 여기서 전자 장치(100)는 데이터 연산이 가능한 PC, 노트북 PC, 서버 등일 수 있다. Referring to FIG. 1, the electronic device 100 may include a memory 110 and a processor 120. Here, the electronic device 100 may be a PC, a notebook PC, or a server capable of data calculation.

메모리(110)는 복수의 레이어(또는 계층)로 구성되는 학습 모델을 저장한다. 여기서 학습 모델은 인공 지능 알고리즘을 이용하여 학습된 모델로 네트워크로 지칭될 수도 있다. 그리고 인공 지능 알고리즘은 심층 신경 네트워크(Deep Neural Network, DNN), 심층 합성곱 신경망(Deep Convolution Neural Network), 레지듀얼 네트워크(Residual Network) 등일 수 있다. The memory 110 stores a learning model composed of a plurality of layers (or layers). Here, the learning model is a model trained using an artificial intelligence algorithm and may be referred to as a network. The artificial intelligence algorithm may be a deep neural network (DNN), a deep convolution neural network, a residual network, or the like.

메모리(110)는 학습 모델을 최적화하기 위한 학습 데이터 세트를 저장할 수 있으며, 해당 학습 모델을 이용하여 분류 또는 인식하기 위한 데이터를 저장할 수도 있다. The memory 110 may store a set of learning data for optimizing the learning model, and may store data for classification or recognition using the learning model.

또한, 메모리(110)는 학습 모델 최적화를 수행하는데 필요한 프로그램을 저장하거나, 해당 프로그램에 의하여 최적화된 학습 모델을 저장할 수 있다. Also, the memory 110 may store a program necessary to perform learning model optimization, or may store a learning model optimized by the corresponding program.

이러한, 메모리(110)는 전자 장치(100) 내의 저장매체 및 외부 저장매체, 예를 들어 USB 메모리를 포함한 Removable Disk, 호스트(Host)에 연결된 저장매체, 네트워크를 통한 웹서버(Web server) 등으로 구현될 수 있다. The memory 110 is a storage medium and an external storage medium in the electronic device 100, for example, a removable disk including a USB memory, a storage medium connected to a host, a web server through a network, or the like. Can be implemented.

프로세서(120)는 전자 장치(100) 내의 각 구성에 대한 제어를 수행한다. 구체적으로, 프로세서(120)는 사용자로부터 부팅 명령이 입력되면, 메모리(110)에 저장된 운영체제를 이용하여 부팅을 수행할 수 있다. The processor 120 performs control for each component in the electronic device 100. Specifically, when a boot command is input from the user, the processor 120 may boot using an operating system stored in the memory 110.

프로세서(120)는 후술할 조작 입력부(140)를 통하여 최적화할 학습 모델을 선택받을 수 있으며, 선택된 학습 모델을 최적화하기 위한 각종 파라미터를 조작 입력부(140)를 통하여 입력받을 수 있다. 여기서 입력받는 각종 파라미터는 분할할 그룹의 수, 하이퍼파라미터 등일 수 있다. The processor 120 may select a learning model to be optimized through the manipulation input unit 140 to be described later, and receive various parameters for optimizing the selected learning model through the manipulation input unit 140. Here, the various parameters received may be the number of groups to be divided, hyperparameters, and the like.

각종 정보를 입력받으면, 프로세서(120)는 선택된 학습 모델의 각 레이어의 입출력 특징에 기초하여 복수의 그룹으로 그루핑하여 트리 구조를 갖는 학습 모델로 재구성할 수 있다. When various information is input, the processor 120 may group into a plurality of groups based on the input/output characteristics of each layer of the selected learning model and reconstruct the learning model having a tree structure.

구체적으로, 프로세서(120)는 복수의 레이어로 구성되는 학습 모델의 파라미터 행렬 및 복수의 분할 변수를 초기화할 수 있다. 구체적으로, 프로세서(120)는 파라미터 행렬을 랜덤하게 초기화하고, 복수의 분할 변수는 균일한 값에 가깝게 초기화할 수 있다. 여기서 파라미터 행렬은 학습 모델의 한 레이어의 파라미터 행렬을 의미하고, 복수의 분할 변수는 특징-그룹 분할 변수와 클래스-그룹 분할 변수를 포함할 수 있다. 이러한 복수의 분할 변수는 파라미터 행렬에 대응되는 행렬 형태를 가질 수 있다. Specifically, the processor 120 may initialize a parameter matrix and a plurality of partitioning variables of a learning model composed of a plurality of layers. Specifically, the processor 120 randomly initializes the parameter matrix, and a plurality of split variables can be initialized close to a uniform value. Here, the parameter matrix means a parameter matrix of one layer of the learning model, and the plurality of split variables may include feature-group split variables and class-group split variables. The plurality of partition variables may have a matrix form corresponding to a parameter matrix.

그리고 프로세서(120)는 학습 모델에 대한 손실 함수, 파라미터 감쇠 정규화 항 및 파라미터 행렬 및 복수의 분할 변수로 정의되는 분할 정규화 항을 포함하는 목적 함수가 최소화하도록 복수의 분할 변수 및 학습 모델에 대한 블록 대각 행렬을 갖는 신규 파라미터 행렬을 산출할 수 있다. 이때 프로세서(120)는 추계적 경사 하강(Stochastic Gradient Descent) 방법을 이용하여 목적 함수가 최소화되도록 할 수 있다. 여기서 목적 함수는 교차 엔트로피 손실과 그룹 정규화를 동시에 최적화하기 위한 함수로 수학식 1과 같이 표현될 수 있다. 목적 함수의 구체적인 내용에 대해서는 도 3과 관련하여 후술한다. In addition, the processor 120 blocks block diagonals for the plurality of split variables and the training model such that the objective function including the loss function for the training model, the parameter attenuation normalization term, and the parameter matrix and the split normalization term defined by the plurality of split variables is minimized. A new parameter matrix with a matrix can be calculated. At this time, the processor 120 may use the stochastic gradient descent method to minimize the objective function. Here, the objective function is a function for simultaneously optimizing cross-entropy loss and group normalization, and can be expressed as Equation 1. Details of the objective function will be described later with reference to FIG. 3.

그리고 프로세서(120)는 산출된 분할 변수에 기초하여 복수의 레이어를 그룹에 따라 수직 분할하고, 산출된 신규 파라미터 행렬을 수직 분할된 레이어의 파라미터로 사용하여 학습 모델을 재구성할 수 있다. In addition, the processor 120 may vertically divide a plurality of layers according to a group based on the calculated partitioning variable, and reconstruct the learning model using the calculated new parameter matrix as a parameter of the vertically divided layer.

그리고 프로세서(120)는 학습 모델에 대한 손실 함수 및 파라미터 감쇠 정규화 항만 포함하는 제2 목적 함수가 최소화하도록 재구성된 학습 모델에 2차 신규 파라미터 행렬을 산출하고, 산출된 2차 신규 파라미터 행렬을 수직 분할된 레이어의 파라미터로 사용하여 학습 모델을 최적화할 수 있다. Then, the processor 120 calculates a second new parameter matrix in the reconstructed learning model such that the second objective function including only the loss function and the parameter attenuation normalization term for the training model is minimized, and vertically divides the calculated second new parameter matrix. The learning model can be optimized by using it as a parameter of the layer.

프로세서(120)는 최적화된 학습 모델을 이용하여 비전 인식, 음성 인식, 자연어 처리 등의 각종 처리를 수행할 수 있다. 구체적으로, 학습 모델이 이미지 분류와 관련된 것이었으면, 프로세서(120)는 최적화된 학습 모델과 입력된 이미지를 이용하여 입력된 이미지가 어떠한 것인지를 분류할 수 있다. The processor 120 may perform various processes such as vision recognition, speech recognition, and natural language processing using an optimized learning model. Specifically, if the learning model is related to image classification, the processor 120 may classify the input image using the optimized learning model and the input image.

이때, 프로세서(120)는 입력된 이미지의 분류를 복수의 프로세서 코어를 이용하여 수행하거나, 타 전자 장치와 함께 수행할 수 있다. 구체적으로, 본 개시에 의해 최적화된 학습 모델은 수직으로 분할된 트리 구조를 갖게 되는바, 분할된 하나의 그룹에 해당하는 연산은 하나의 연산 장치를 이용하여 계산하고, 다른 그룹에 해당하는 연산은 다른 연산 장치를 이용하여 계산할 수 있다. At this time, the processor 120 may perform classification of the input image using a plurality of processor cores or may be performed together with other electronic devices. Specifically, since the learning model optimized by the present disclosure has a vertically divided tree structure, operations corresponding to one divided group are calculated using one computing device, and operations corresponding to another group are It can be calculated using other computing devices.

이상과 같이 본 실시 예에 따른 전자 장치(100)는 학습 모델을 클래스를 독점적인 기능 집합에 맞는 그룹으로 클러스터링한다. 이에 따라 최적화된 학습 모델은 하나의 학습 모델에 대한 연산을 통신에 의한 병목현상 없이 여러 장치로 나눠 처리하는 것이 가능하며, 연산량과 파라미터의 수가 줄어들기 때문에 하나의 장치를 이용하더라도 더욱 빠른 연산이 가능하게 된다. 또한, 전자 장치(100)는 수학식 1과 같은 목적 함수를 이용하는바 네트워크 학습 절차에 완벽하게 통합되므로 네트워크 가중치와 분할을 동시에 학습할 수 있다. As described above, the electronic device 100 according to the present embodiment clusters the learning model into a group suitable for a set of exclusive functions. Accordingly, the optimized learning model can process the operation for one learning model by dividing it into multiple devices without communication bottlenecks, and because the number of calculations and the number of parameters is reduced, faster calculation is possible even with one device. Is done. In addition, since the electronic device 100 uses an objective function such as Equation 1, it is perfectly integrated into a network learning procedure, so that network weight and segmentation can be simultaneously learned.

한편, 도 1을 설명함에 있어서, 사용자로부터 분리된 그룹 수를 입력받고, 입력받은 그룹 수로 학습 모델을 분리하는 것으로 설명하였지만, 구현시에는 기설정된 알고리즘을 이용하여 학습 모델의 최적의 그룹 수를 찾는 동작을 선행적으로 수행하고, 찾아진 그룹 수에 기초하여 학습 모델을 분리하는 것도 가능하다. On the other hand, in the description of FIG. 1, although it has been described that the number of groups separated from the user is input and the learning model is separated by the number of groups received, the implementation finds the optimal number of groups in the learning model using a predetermined algorithm. It is also possible to perform the operation proactively and separate the learning model based on the number of groups found.

한편, 이상에서는 전자 장치를 구성하는 간단한 구성에 대해서만 도시하고 설명하였지만, 구현시에는 다양한 구성이 추가로 구비될 수 있다. 이에 대해서는 도 2를 참조하여 이하에서 설명한다. On the other hand, in the above, only a simple configuration constituting the electronic device is illustrated and described, but in the implementation, various configurations may be additionally provided. This will be described below with reference to FIG. 2.

도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 나타내는 블록도이다. 2 is a block diagram illustrating a specific configuration of an electronic device according to an embodiment of the present disclosure.

도 2를 참조하면, 전자 장치(100)는 메모리(110), 프로세서(120), 통신부(130), 디스플레이(140) 및 조작 입력부(150)로 구성될 수 있다. Referring to FIG. 2, the electronic device 100 may include a memory 110, a processor 120, a communication unit 130, a display 140, and an operation input unit 150.

메모리(110) 및 프로세서(120)의 동작에 대해서는 도 1과 관련하여 설명하였는바, 중복 설명은 생략한다. The operation of the memory 110 and the processor 120 has been described with reference to FIG. 1, and duplicate description is omitted.

통신부(130)는 타 전자 장치와 연결되며, 타 전자 장치로부터 학습 모델 및/또는 학습 데이터를 수신할 수 있다. 또한, 통신부(130)는 다른 전자 장치와의 분산 연산을 위하여 필요한 데이터를 타 전자 장치에 전송할 수 있다. The communication unit 130 may be connected to another electronic device and receive a learning model and/or learning data from another electronic device. Also, the communication unit 130 may transmit data necessary for distributed calculation with other electronic devices to other electronic devices.

또한, 통신부(130)는 학습 모델을 이용한 처리를 위한 정보를 수신받을 수 있으며, 처리 결과를 대응되는 장치에 제공할 수 있다. 예를 들어, 해당 학습 모델이 이미지를 분류하는 모델이었으면, 통신부(130)는 분류할 이미지를 입력받고, 분류 결과에 대한 정보를 이미지를 전송한 장치에 전송할 수 있다. In addition, the communication unit 130 may receive information for processing using a learning model, and provide processing results to a corresponding device. For example, if the corresponding learning model is a model for classifying images, the communication unit 130 may receive an image to be classified, and transmit information about the classification result to the device that transmitted the image.

이러한 통신부(130)는 전자 장치(100)를 외부 장치와 연결하기 위해 형성되고, 근거리 통신망(LAN: Local Area Network) 및 인터넷망을 통해 단말장치에 접속되는 형태뿐만 아니라, USB(Universal Serial Bus) 포트 또는 무선 통신(예를 들어, WiFi 802.11a/b/g/n, NFC, Bluetooth) 포트를 통하여 접속되는 형태도 가능하다. The communication unit 130 is formed to connect the electronic device 100 with an external device, and is connected to a terminal device through a local area network (LAN) and an internet network, as well as a universal serial bus (USB). It is also possible to connect via a port or a wireless communication (eg, WiFi 802.11a/b/g/n, NFC, Bluetooth) port.

디스플레이(140)는 전자 장치(100)에서 제공하는 각종 정보를 표시한다. 구체적으로, 디스플레이(140)는 전자 장치(100)가 제공하는 각종 기능을 선택받기 위한 사용자 인터페이스 창을 표시할 수 있다. 구체적으로, 해당 사용자 인터페이스 창은 최적화를 수행할 학습 모델을 선택받거나, 최적화 과정에 사용될 파라미터를 입력받기 위한 항목을 포함할 수 있다. The display 140 displays various information provided by the electronic device 100. Specifically, the display 140 may display a user interface window for selecting various functions provided by the electronic device 100. Specifically, the corresponding user interface window may include an item for selecting a learning model to perform optimization or inputting parameters to be used in the optimization process.

이러한 디스플레이(140)는 LCD, CRT, OLED 등과 같은 모니터일 수 있으며, 후술할 조작 입력부(150)의 기능을 동시에 수행할 수 있는 터치 스크린으로 구현될 수도 있다. The display 140 may be a monitor such as an LCD, a CRT, or an OLED, or may be implemented as a touch screen capable of simultaneously performing the functions of the manipulation input unit 150, which will be described later.

또한, 디스플레이(140)는 학습 모델을 이용하여 테스트 결과에 대한 정보를 표시할 수 있다. 예를 들어, 해당 학습 모델이 이미지를 분류하는 모델이었으면, 디스플레이(140)는 입력된 이미지에 대한 분류 결과를 표시할 수 있다. In addition, the display 140 may display information about test results using a learning model. For example, if the corresponding learning model is a model for classifying images, the display 140 may display a classification result for the input image.

조작 입력부(150)는 사용자로부터 최적화를 수행할 학습 데이터 및 최적화 과정에서 수행할 각종 파라미터를 입력받을 수 있다. The manipulation input unit 150 may receive learning data to be optimized from the user and various parameters to be performed in the optimization process.

이러한 조작 입력부(150)는 복수의 버튼, 키보드, 마우스 등으로 구현될 수 있으며, 상술한 디스플레이(140)의 기능을 동시에 수행할 수 있는 터치 스크린으로도 구현될 수도 있다. The manipulation input unit 150 may be implemented with a plurality of buttons, keyboard, mouse, or the like, and may also be implemented with a touch screen capable of simultaneously performing the functions of the display 140 described above.

한편, 도 1 및 도 2를 도시하고 설명함에 있어서, 전자 장치(100)에 하나의 프로세서만 포함되는 것으로 설명하였지만, 전자 장치에는 복수의 프로세서가 포함될 수 있으며, 일반적인 CPU 뿐만 아니라 GPU가 활용될 수 있다. 구체적으로, 상술한 최적화 동작은 복수의 GPU를 이용하여 수행될 수 있다. On the other hand, in the illustration and description of FIGS. 1 and 2, it has been described that the electronic device 100 includes only one processor, but the electronic device may include a plurality of processors, and a general CPU as well as a GPU may be utilized. have. Specifically, the above-described optimization operation may be performed using a plurality of GPUs.

이하에서는 상술한 바와 같은 학습 모델의 최적화가 가능한 이유에 대해서 자세히 설명한다. Hereinafter, the reason for optimizing the learning model as described above will be described in detail.

이미지 분류 태스크의 수가 많을수록 의미적으로(semantically) 유사한 클래스들은 같은 종류의 특징(feature)만을 사용하는 분리된 그룹들로 나눠질 수 있다. As the number of image classification tasks increases, semantically similar classes can be divided into separate groups using only the same kind of features.

예를 들어, 개, 고양이와 같은 동물 클래스로 분류하기 위해 사용되는 특징은 트럭, 비행기와 같은 클래스로 물체를 분류하기 위해 사용되는 상위 단계의 특징(Hig-level feature)은 서로 다를 수 있다. 그러나 점, 줄무늬, 색상과 같은 낮은 단계의 특징(Low-level feature)은 모든 클래스에서 사용 가능할 수 있다. For example, the features used to classify into animal classes such as dogs and cats may be different from the hi-level features used to classify objects into classes such as trucks and airplanes. However, low-level features such as dots, stripes, and colors may be available in all classes.

이러한 점은 인공 신경망에서 하위 레이어는 모든 그룹에 공통적으로 사용되고, 상위 레이어로 올라갈수록 특징들이 의미적으로 구분되는 클래스들의 그룹에 따라 분할되는 트리 형태의 구조로 더 효율적으로 작동 가능함을 의미한다. 즉, 사용하는 기능에 따라 클래스를 상호 배타적인 그룹으로 클러스팅할 수 있음을 의미한다. This means that in the artificial neural network, the lower layer is commonly used for all groups, and as it goes up to the upper layer, it can operate more efficiently in a tree-like structure that is divided according to a group of classes whose features are semantically separated. This means that classes can be clustered into mutually exclusive groups depending on the functions they use.

이와 같은 클러스팅을 통해 인공 신경망은 좀 더 적은 수의 파라미터를 사용할 뿐만 아니라, 각 그룹을 서로 다른 연산 장치에서 수행하도록 하는 모델 병렬화(Model Parallelization)가 가능하게 된다. Through this clustering, the artificial neural network not only uses fewer parameters, but also enables model parallelization, which enables each group to be performed by different computing devices.

다만, 인공 신경망을 구성하는 각 레이어를 임의의 그룹으로 분할하는 경우, 즉 의미적으로 유사한 클래스, 특징들끼리 묶이지 않게 되는 경우에는 성능 저하가 발생할 수 있다. However, if each layer constituting the artificial neural network is divided into arbitrary groups, that is, semantically similar classes and features are not grouped, performance degradation may occur.

따라서, 본 개시에서는 인공 신경망의 각 레이어에서 입출력을 의미적으로 연관있는 여러 그룹으로 나누고, 이에 기초하여 레이어를 수직하게 분할함으로써 성능의 저하 없이 연산량과 파라미터의 양을 줄일 수 있다. 또한, 이를 통해 모델 병렬화도 가능하게 된다. Accordingly, in the present disclosure, input/output in each layer of the artificial neural network is divided into several groups that are semantically related, and based on this, the layers are vertically divided to reduce the amount of computation and parameters without deteriorating performance. In addition, this enables model parallelization.

이를 위하여, 본 개시에서는 각 레이어의 입출력을 의미적으로 연관 있는 것들 끼리 그룹으로 묶으며, 레이어의 파라미터의 행렬 값 중에서 그룹 간의 연결에 해당하는 부분을 없애는 동작을 수행한다. To this end, in the present disclosure, input/output of each layer is grouped into groups that are semantically related, and an operation of removing a portion corresponding to a connection between groups among matrix values of a layer parameter is performed.

이러한 동작을 위하여, 각 레이어의 입출력을 그룹에 할당하기 위한 분할 변수를 새롭게 도입하고, 의미적으로 유사한 입출력을 그룹으로 자동으로 나눔과 동시에 그룹 간의 연결을 억제하기 위한 추가적인 정규화 함수를 도입하여 레이어의 분할을 가능게 하였다. For this operation, a new partitioning variable for allocating I/O of each layer to a group is newly introduced, and additional normalization functions for automatically separating semantically similar I/O into groups and suppressing connections between groups are introduced. Partitioning was enabled.

도 3은 트리 구조 네트워크를 설명하기 위한 도면이다. 3 is a diagram illustrating a tree structure network.

도 3을 참조하면, 기본적인 학습 모델(310)은 복수의 레이어로 구성된다. Referring to FIG. 3, the basic learning model 310 is composed of a plurality of layers.

복수의 레이어를 분할하기 위하여, 클래스 대 그룹, 특징 대 그룹 할당과 같은 네트워크 가중치를 최적화한다. 구체적으로, 기본적인 학습 모델(310)을 구성하는 한 레이어의 입출력의 노드에 대한 분할 변수를 도입하고, 태스크의 학습 데이터를 통해 원 태스크의 손실함수를 최소화함과 동시에 세 종류의 추가적인 정규화 항(regularization term)을 도입해 이들의 합을 최소화한다. In order to divide a plurality of layers, network weights such as class-to-group and feature-to-group allocation are optimized. Specifically, a partition variable for an input/output node of one layer constituting the basic learning model 310 is introduced, and the loss function of the original task is minimized through the training data of the task, and at the same time, three additional regularization terms are used. term) to minimize their sum.

이 결과를 기초로 각 노드가 어떤 그룹에 속하는지를 결정하며, 결정에 기초하여 최적화된 학습 모델(330)을 생성할 수 있다. Based on this result, it is determined which group each node belongs to, and based on the determination, an optimized learning model 330 may be generated.

기본 네트워크가 주어지면, 본 개시의 최종 목표는 도 3의 최적화된 학습 모델(330)과 같은 특정 클래스 그룹과 연관되는 서브 네트워크의 집합 또는 레이어(또는 계층)를 포함하는 트리 구조 네트워크를 얻을 수 있다. Given a basic network, the final goal of the present disclosure is to obtain a tree-structured network comprising a set or layer (or layer) of sub-networks associated with a particular class group, such as the optimized learning model 330 of FIG. 3. .

이질적인 클래스를 그룹화하면 여러 그룹에 중복 기능을 학습시킬 수 있고, 결과적으로 네트워크 용량을 낭비할 수 있는바, 클래스를 분할하는 최적화 방법은 각 그룹 내의 클래스가 가능한 한 많은 기능을 공유해야 한다. Grouping heterogeneous classes can lead to learning redundancy in multiple groups and consequently wasted network capacity, so the optimization method of splitting classes should share as many functions as possible within each group.

따라서, 분할의 유용성을 극대화하기 위해서는 각 그룹이 다른 그룹에서 사용하는 것과 완전히 다른 기능의 하위 집합을 사용하도록 클래스를 클러스팅하는 것이다. Therefore, in order to maximize the usefulness of partitioning, it is necessary to cluster the classes so that each group uses a completely different subset of the functions used by the other groups.

이러한 상호 배타적인 클래스 그룹을 얻는 가장 직접적인 방법은 비슷한 클래스가 특징을 공유할 가능성이 있기 때문에 의미 분류법을 활용하는 것이다. 그러나 실질적으로 이러한 의미 분류법은 사용 가능하지 않거나, 각 클래스에서 사용하는 기능에 따라 실제 계층적 그룹와 일치하지 않을 수 있다. The most direct way to achieve this mutually exclusive class group is to use semantic classification because similar classes are likely to share features. However, in practice, this semantic classification method may not be available or may not match the actual hierarchical group depending on the function used in each class.

다른 접근 방법은 원래의 네트워크에서 습득한 가중치에 대해서 (계층적으로) 클러스터링을 수행하는 것이다. 이러한 방식은 실제 기능 사용에도 기초한다. 그러나 그룹이 중복될 가능성이 높고 네트워크를 두 번 학습시켜야 하므로, 비효율적이며 클러스터링된 그룹이 최적화가 된 것이 아닐 수도 있다. Another approach is to perform (hierarchically) clustering on the weights learned from the original network. This method is also based on the use of actual functions. However, because groups are likely to overlap and the network needs to be trained twice, inefficient and clustered groups may not be optimized.

따라서, 이하에서는 각 클래스 및 특징을 분리된 그룹들로 배타적인 할당을 어떻게 하는지, 심층 러닝 프레임워크에서 네트워크 가중치를 동시에 사용하는 방법에 대해서 이하에서 설명한다. Therefore, hereinafter, a method of exclusively allocating each class and feature into separate groups and a method of simultaneously using network weights in a deep learning framework will be described below.

이하에서는 데이터 세트가

인 것을 가정하여 설명한다. 여기서

는 입력 데이터 인스턴스(instance)이고,

는 K 클래스에 대한 클래스 레벨이다. In the following, the data set

It is assumed that it is. here

Is the input data instance,

Is the class level for the K class.

인공 신경망에서의 학습은 각 레이어(l)에서의 가중치(

)가 있는 네트워크를 학습시키는 것이다. 여기서

는 블록 대각행렬(block-diagonal matrix)이고, 각

은 클래스 그룹(

)과 관련된다. 여기서

는 모든 그룹의 세트이다. The learning in the artificial neural network is the weight at each layer (l) (

). here

Is a block-diagonal matrix, and

Silver class group(

). here

Is a set of all groups.

이러한 블록 대각(block-diagonal)은 클래스의 각 분리된 그룹이 다른 그룹과 해당 기능을 사용하지 않도록 연관된 고유한 기능을 갖도록 한다. 이에 따라, 빠른 계산과 병렬처리를 위하여 네트워크는 복수의 클래스 그룹으로 분할될 수 있다. This block-diagonal ensures that each separate group of classes has a unique function associated with it so that it does not use that function with other groups. Accordingly, the network can be divided into a plurality of class groups for fast calculation and parallel processing.

이러한 블록 대각 웨이트 행렬(

)을 얻기 위해, 본 개시에서는 네트워크 가중치에 덧붙여 특징-그룹 및 클래스-그룹 할당을 학습하는 새로운 분할 알고리즘을 이용한다. 이러한 분할 알고리즘을 이하에서는 스플릿넷(splitNet)(또는 심층 스플릿)이라고 지칭한다. This block diagonal weight matrix (

To obtain ), the present disclosure uses a new segmentation algorithm that learns feature-group and class-group assignments in addition to network weights. This segmentation algorithm is hereinafter referred to as splitnet (or deep split).

먼저, 소프트맥스 분류기에서 사용되는 파라미터에 대한 분할 방법을 먼저, 설명하고, 이를 DNN에 적용하는 방법은 후술한다. First, a partitioning method for parameters used in the Softmax classifier will be described first, and a method of applying it to the DNN will be described later.

는 특징(i)이 그룹 g에 할당되는지를 나타내는 이진 변수이고,

는 클래스(j)가 그룹 g에 할당되는지 여부를 나타내는 이진 변수이다.

Is a binary variable indicating whether feature (i) is assigned to group g,

Is a binary variable indicating whether class j is assigned to group g.

는 그룹 g에 대한 특징 그룹 할당 벡터로 정의한다. 여기서

, D는 특징의 치수(dimension)이다.

Is defined as the feature group allocation vector for group g. here

, D is the dimension of the feature.

유사하게

는 그룹 g에 대한 클래스 그룹 할당 벡터로 정의한다. 즉,

와

는 그룹 g를 함께 정의한다. 여기서

는 그룹과 연관된 특징 차수를 나타내며,

는 그룹에 할당된 클래스 세트를 나타낸다. Similarly

Is defined as the class group allocation vector for group g. In other words,

Wow

Defines group g together. here

Denotes the feature order associated with the group,

Indicates a set of classes assigned to the group.

특징들 또는 클래스들 중 그룹들 사이에 중첩이 없다고 가정한다. 예를 들어,

이고, 즉,

1K이며, 여기서

및

는 모두 하나의 요소를 갖는 벡터들이다. It is assumed that there is no overlap between groups of features or classes. For example,

Is, that is,

1K, where

And

Are all vectors with one element.

이 가정은 그룹 할당에 대한 엄격한 규칙을 부과하는 반면, 각 클래스는 그룹에 할당되고 각 그룹은 특징의 분리된 부분 집합에 의존하기 때문에 가중치 행렬을 블록 대각행렬로 분류할 수 있다. 이것은 파라미터의 수를 크게 줄이고, 동시에 곱셈

는 더 작고 빠른 블록 행렬 곱셈으로 분해될 수 있다.While this assumption imposes strict rules for group assignment, each class is assigned to a group and each group depends on a separate subset of features, so the weight matrix can be classified as a block diagonal matrix. This greatly reduces the number of parameters, and multiplication at the same time

Can be decomposed into smaller and faster block matrix multiplications.

본 개시에서 최적화하고자 하는 목적함수는 다음과 같은 수학식 1과 같이 정의될 수 있다. The objective function to be optimized in the present disclosure may be defined as Equation 1 below.

여기서,

는 학습 데이터의 교차 엔트로피 손실이고, W는 가중치 텐서(tensor)이고, P 및 Q는 특징-그룹과 클래스-그룹 할당 행렬이고,

는 하이퍼파라미터(λ)가 있는 파라미터 감쇠(Weight decay) 정규화 항이고,

는 네트워크 분할을 위한 정규화 항이다. here,

Is the cross-entropy loss of the training data, W is the weight tensor, P and Q are feature-group and class-group assignment matrices,

Is the parameter decay normalization term with hyperparameter (λ),

Is the normalization term for network segmentation.

이하에서는 외부 의미 정보 없이 자동적으로 분리된 그룹 할당을 찾기 위하여 새롭게 도입한 정규화 항(Ω)에 대해서 설명한다. Hereinafter, a newly introduced normalization term (Ω) will be described in order to automatically find a separate group assignment without external semantic information.

상술한 수학식 1의 목적은 경사 하강(gradient descent), 각 레이어에 대한 전체 가중치 행렬의 시작, 각 가중치에 대한 알려지지 않은 그룹 할당을 공통적으로 최적화하는 것이다. The purpose of Equation 1 above is to optimize the gradient descent, the start of the overall weight matrix for each layer, and the unknown group assignment for each weight in common.

교차 엔트로피 손실과 그룹 정규화를 함께 최적화함으로써 적절한 그룹화를 자동으로 얻고, 그룹 간 연결도 제거할 수 있게 된다. By optimizing the cross-entropy loss and group normalization together, it is possible to automatically obtain the appropriate grouping and eliminate the linkage between groups.

그룹화가 학습 되면 파라미터 수를 줄이기 위하여 가중치 행렬은 명시적으로 블록 대각행렬들로 분할될 수 있으며, 이를 통해 훨씬 빠른 추론이 가능해 진다. 이하에서는 각 레이어를 분리하는 그룹 수(G)는 주어진 것으로 가정한다. When grouping is learned, to reduce the number of parameters, the weight matrix can be explicitly divided into block diagonal matrices, thereby enabling much faster inference. Hereinafter, it is assumed that the number of groups G separating each layer is given.

이하에서는 레이어(또는 계층)에 대한 가중치 행렬을 복수의 그룹으로 분리하는 방법에 대해서 도 4를 참조하여 설명한다. Hereinafter, a method of separating a weight matrix for a layer (or layer) into a plurality of groups will be described with reference to FIG. 4.

도 4는 그룹 할당 및 그룹 가중치 정규화 동작을 설명하기 위한 도면이고, 도 5는 정규화가 적용된 가중치를 시각화한 도면이다. 4 is a diagram for explaining the group allocation and group weight normalization operation, and FIG. 5 is a diagram visualizing a weight applied with normalization.

도 4를 참조하면, 특징 및 클래스를 복수의 그룹으로 할당하는 정규화는 다음과 같은 수학식 2로 표현될 수 있다. Referring to FIG. 4, normalization in which features and classes are assigned to a plurality of groups may be expressed by Equation 2 below.

여기서,

각각은 목표의 강도를 조절하는 파라미터이다. 이러한 파라미터는 사용자로부터 입력받을 수 있다. here,

Each is a parameter that controls the intensity of the target. These parameters can be input from the user.

첫 번째 R_W는 그룹 파라미터 정규화 항(Group Weight Regularization)으로, 그룹 간의 연결에 대한 파라미터의 (2,1)-놈(norm)으로 정의된다. 해당 항을 최소화하면 그룹 간의 연결이 억제되고 그룹 내의 연결만을 활성화된다. R_W에 대한 보다 자세한 설명은 후술한다. The first R _W is a group parameter normalization term, _and is defined as a (2,1)-norm of parameters for connection between groups. Minimizing the term suppresses the connections between groups and activates only the connections within the group. R _W will be described in more detail later.

두 번째 R_D는 서로소 그룹 정규화 항(Disjoint Group Assignment)으로, 분할변수 간의 내적으로 배타적으로 분할이 진행되도록 하는 항이다. R_D에 대한 보다 자세한 설명은 후술한다. The second R _D is a disjoint group assignment (Disjoint Group Assignment), which is a term that allows the split to proceed exclusively between split variables. R _D will be described in more detail later.

세 번째 R_E는 균등 그룹 정규화 항(Balanced Group Assignment)으로 분할변수 각각의 합의 제곱으로 정의되며 한 그룹의 크기가 과도하게 커지지 않게 하는 항이다. R_E에 대한 보다 자세한 설명은 후술한다. The third R _E is a balanced group assignment term, defined as the square of the sum of each of the divided variables, and is a term that prevents the size of one group from becoming excessively large. R _E will be described later in more detail.

이하에서는 첫 번째 R_W정규화 항에 대해서 먼저 설명한다. Hereinafter, the first R _W normalization term will be described first.

특징-그룹 할당 행렬과 클래스-그룹 할당 행렬을 각각

와

라고 가정한다. 그 다음,

는 그룹 g(예를 들어, 기능과 클래스 간의 그룹 내의 연결)와 관련된 가중 파라미터를 나타낸다. Feature-group assignment matrix and class-group assignment matrix respectively

Wow

Is assumed. next,

Denotes a weighted parameter associated with group g (eg, a link within a group between function and class).

블록 대각 웨이트 행렬을 얻기 위해 그룹 간 연결을 제거하여야 하는바, 그룹 간의 연결을 우선적으로 정규화한다. 이러한 정규화는 다음과 같은 수학식 3으로 표현될 수 있다. In order to obtain a block diagonal weight matrix, it is necessary to remove the connection between groups, so the connection between groups is first normalized. This normalization can be expressed by the following equation (3).

여기서,

와

는 가중치(W)의 i번째 행렬, j번째를 나타낸다. here,

Wow

Denotes the i-th matrix and the j-th of the weight W.

상술한 수학식 3은 그룹 간 연결에 행/열-방향(ℓ2,1)-놈(norm)을 부과한다. 도 5는 이러한 정규화를 나타내는데, 도 5를 참조하면, 정규화가 적용된 가중치 부분은 다른 영역과 다른 색으로 표현되어 있다. Equation 3 described above imposes a row/column-direction (ℓ2,1)-norm between groups. FIG. 5 shows such normalization. Referring to FIG. 5, the weighted portion to which normalization is applied is expressed in a different color from other regions.

이와 같은 방식의 정규화는 의미론적 그룹과 상당히 유사한 그룹을 산출한다. 주의할 점은, 그룹화 할당시 동일한 초기화를 피하는 것이다. 예를 들어, pi = 1 / G이면, 행/열-방향(ℓ2,1)-놈(norm)은 목적이 감소하며, 일부 행/열 가중치 벡터가 그룹 할당 전에 사라질 수 있기 때문이다. Normalization in this way yields a group that is quite similar to the semantic group. Note that avoiding the same initialization when assigning grouping. For example, if pi = 1 / G, the row/column-direction (ℓ2,1)-norm is reduced in purpose, and some row/column weight vectors may disappear before group allocation.

이하에서는 두 번째 R_D정규화 항에 대해서 설명한다. The second R _D normalization term will be described below.

수치 최적화를 다루기 쉽게 하기 위해, 먼저 바이너리 변수인

와

를 제한(

및

)을 사용하여 [0, 1] 인터벌 내의 실제 값을 갖도록 완화한다. 이러한 sum-to-one 구속 조건은 희소 솔루션을 산출하는 축소 구배 알고리즘(reduced gradient algorithm)을 사용하여 최적화할 수 있다. To make it easier to deal with numerical optimization, the binary variable

Wow

Limit (

And

) To relax to have an actual value within the [0, 1] interval. This sum-to-one constraint can be optimized using a reduced gradient algorithm that yields a sparse solution.

또는

와

를 아래의 수학식 4와 같이 소프트맥스 형태의 독립변수

,

로 재 파라미터화하여 소프트 할당을 수행할 수도 있다. or

Wow

Is the independent variable in the form of soft max as shown in Equation 4 below.

,

You can also re-parameterize to perform soft assignment.

두 방식 중 소프트맥스 형식이 보다 의미론적으로 의미 있는 그룹화를 달성할 수 있다. 반면에 sum-to-one 구속 조건의 최적화는 종종 소프트맥스 방식보다 빠른 수렴을 유도한다. Of the two methods, the Softmax format can achieve more semantically meaningful grouping. On the other hand, optimization of sum-to-one constraints often leads to faster convergence than the Softmax method.

그룹 할당 벡터가 완전히 상호 배타적이기 위해서는 각 그룹은 직교해야 한다. 예를 들어, i와 j가 다른 조건에서

,

가 만족하여야 한다. 이를 만족하는 직교 정규화 항은 수학식 5와 같다. Each group must be orthogonal in order for the group assignment vector to be completely mutually exclusive. For example, i and j in different conditions

,

Should be satisfied. The orthogonal normalization term satisfying this is equal to Equation 5.

여기서, 불균등은 중복된 내적을 피할 수 있다. 구체적으로, pi와 qi의 차원이 다를 수 있으므로 그룹 할당 벡터 간의 코사인 유사성을 최소화할 수 있다. 그러나 sum-to-one 제한(constraint)과 정규화를 유도하는 희귀성(sparsity)에서 그룹 할당 벡터는 비슷한 스케일을 가지며 코사인 유사도는 내적 유사성으로 감소한다.Here, unevenness can avoid overlapping dot products. Specifically, since the dimensions of pi and qi may be different, cosine similarity between group allocation vectors can be minimized. However, in sparsity that leads to sum-to-one constraints and normalization, group assignment vectors have a similar scale and cosine similarity decreases to internal similarity.

그룹 할당 벡터 간의 내적을 최소화하는 데 몇 가지 주의점이 있다. 첫째는 내적의 수는 그룹의 수와 함께 2차식으로 조정하는 것이다. 둘째는 그룹 지정 벡터의 값을 균일하게 초기화하면 경사가 0에 가까울 수 있으므로 최적화 프로세스가 느려지는 것이다. 예를 들어,

제한이 있는

에서

를 최소화하면,

를 최소화하는 것과 동일하다. 만약 초기화가 0.5에서 수행된다면, 경사도는 0에 가깝게 된다. There are some caveats in minimizing the dot product between group assignment vectors. The first is to adjust the number of dot products in a quadratic fashion with the number of groups. Second, uniformly initializing the value of the group designation vector can slow the optimization process because the slope can be close to zero. For example,

Limited

in

When minimized,

It is the same as minimizing. If initialization is performed at 0.5, the slope is close to zero.

이하에서는 세 번째 R_E정규화 항에 대해서 설명한다. The third R _E normalization term will be described below.

상술한 수학식 5만을 이용한 그룹 분리는 분리된 한 그룹이 다른 모든 그룹보다 우세할 수 있다. 즉, 한 그룹에는 모든 기능과 클래스가 포함되지만 다른 그룹에는 포함되지 않을 수 있게 된다. In the group separation using Equation 5 described above, one group separated may be superior to all other groups. That is, one group includes all functions and classes, but the other group may not.

따라서 아래의 수학식 6과 같이 각 그룹 할당 벡터에서 요소의 제곱합을 정규화하여 그룹 할당이 균형을 이루도록 제한할 수 있다.Therefore, as shown in Equation 6 below, it is possible to limit the group allocation to be balanced by normalizing the sum of squares of elements in each group allocation vector.

와

의 제약으로 인해 각 그룹 할당 벡터의 원소들의 합이 짝수 일 때 수학식 5는 최소화된다. 예를 들어, 각 그룹은 동일한 수의 요소를 가질 수 있다. 특징과 클래스 그룹 할당 벡터의 차원이 다를 수 있으므로 적절한 가중치로 두 조건의 비율을 조정한다. 예를 들어, 일괄 정규화에 이어 그룹 가중치 정규화를 사용할 때, 가중치는 BN 레이어의 스케일 파라미터가 증가하는 동안 그 크기가 줄어드는 경향이 있다. 이러한 효과를 방지하기 위하여,

내의

를 대신하여

정규화 가중치(

)를 사용하거나, 단순히 BN 레이어의 스케일 파라미터를 비활성화할 수 있다.

Wow

Equation 5 is minimized when the sum of the elements of each group allocation vector is even due to the constraint of. For example, each group can have the same number of elements. Since the dimensions of the feature and class group allocation vector may be different, adjust the ratio of the two conditions with appropriate weights. For example, when using group weight normalization followed by batch normalization, the weight tends to decrease in size while the scale parameter of the BN layer increases. To prevent this effect,

undergarment

On behalf of

Normalized weight (

), or simply disable the scale parameter of the BN layer.

균형 조정 그룹 정규화의 효과에 대해서는 도 10을 참조하여 후술한다. The effect of balancing group normalization will be described later with reference to FIG. 10.

이하에서는 상술한 목적함수를 심층 신경 네트워크에 적용하는 방법에 대해서 설명한다. Hereinafter, a method of applying the above-described objective function to a deep neural network will be described.

앞서 설명한 가중 분리 방법은 심층 신경 네트워크(DNN)에 적용할 수 있다. 먼저,

는 1차 (1≤l≤L) 층의 가중치를 나타내며, L은 DNN의 총 층수인 것을 가정한다. The weighted separation method described above can be applied to a deep neural network (DNN). first,

Denotes the weight of the primary (1≤l≤L) layer, and L is assumed to be the total number of layers of the DNN.

심층 신경 네트워크는 두 가지 유형의 레이어(1) 주어진 입력에 대한 특성 벡터를 생성하는 입력 및 숨겨진 레이어와 2) 소프트맥스 분류자가 클래스 확률을 산출하는 출력 완전 연결 (FC) 레이어)을 포함할 수 있다. A deep neural network can include two types of layers: (1) input and hidden layers that generate a feature vector for a given input, and 2) output full connection (FC) layers where the Softmax classifier yields class probabilities). .

출력 완전 연결(FC) 레이어에 대한 가중치에 대해서는 앞서 설명한 분리 방법을 그대로 적용하여 출력 완전 연결(FC) 레이어를 분할할 수 있다. 본 개시의 방법은 다중 연속 레이어 또는 반복적인 계층적 그룹 할당에도 확장 적용될 수 있다. For the weight of the output complete connection (FC) layer, the separation method described above may be applied as it is to divide the output complete connection (FC) layer. The method of the present disclosure can be extended to multiple consecutive layer or repetitive hierarchical group allocation.

심층 신경 네트워크에서 하위 수준의 레이어는 기본 표현을 학습하며, 기본 표현들은 모든 클래스에서 공유될 수 있다. 반대로 높은 수준의 표현들은 특정 그룹의 학습에만 적용될 가능성이 크다. In the deep neural network, the lower level layer learns the basic expressions, and the basic expressions can be shared by all classes. Conversely, high-level expressions are likely to be applied only to specific groups of learning.

따라서, 자연스러운 심층 신경 네트워크에 대한 분할은 하위 레이어(l<S)들이 클래스 그룹 간에 공유되도록 유지한 상태에서, l번째 레이어를 먼저 분할하고, 점진적으로 S번째 레이어(S≤l)를 분할하는 것이다. Therefore, the segmentation for the natural deep neural network is to divide the l-th layer first, and gradually divide the S-th layer (S≤l), while keeping the lower layers (l<S) shared between class groups. .

레이어 각각은 입력 노드와 출력 노드로 구성되며, 입력 노드와 출력 노드는 상호 간의 연결을 나타내는 가중치

를 갖는다.

와

는 l번째 레이어의 이력 노드 및 출력 노드에 대한 특징 그룹 할당 벡터, 클래스 그룹 할당 벡터다. 이러한 점에서,

는 레이어 l 내의 그룹 g에 대한 그룹 내 연결을 나타내게 된다. Each layer is composed of an input node and an output node, and the input node and the output node are weights representing interconnections.

Have

Wow

Is a feature group allocation vector and a class group allocation vector for the history node and output node of the l-th layer. In this regard,

Denotes an intra-group connection to group g in layer l.

이전 레이어의 출력 노드는 다음 레이어의 입력 노드에 대응되기 때문에, 그룹 할당은

로서 공유될 수 있다. 이에 따라 서로 다른 레이어 그룹에 신호가 전달되지 않으므로 각 그룹에서 순방향 및 역방향 전파(propagation)가 다른 그룹의 처리로부터 독립적이게 된다. 따라서 각 그룹에 대한 계산을 분리 및 병렬 처리할 수 있게 된다. 이를 위해 모든 레이어에 상술한

을 부과할 수 있다. Since the output node of the previous layer corresponds to the input node of the next layer, group assignment is

As can be shared. Accordingly, since signals are not transmitted to different layer groups, forward and reverse propagation in each group becomes independent from processing of other groups. Therefore, it is possible to separate and parallelize calculations for each group. For this, all the layers described above

Can be charged.

출력 레이어에서의 소프트맥스 계산은 그룹에 대한 로짓(logit)을 집계해야 하는 모든 클래스에 대한 정규화 작업이 포함된다. 그러나 진행 중에 최대 로짓을 갖는 클래스를 식별하는 것만으로 충분하다. 각 그룹에서 최대 로짓을 갖는 클래스 은 독립적으로 결정될 수 있으며, 그 중 최대값을 계산하는 것은 최소한의 통신 및 계산만 필요로 하게 된다. 따라서, 출력 레이어에 대한 소프트맥스 계산을 제외하고, 각 그룹에 대한 계산은 분해되고 병렬 처리되는 것이 가능해 진다. Softmax calculations at the output layer include normalization for all classes that need to aggregate the logit for the group. However, it is sufficient to identify the class with the largest logit in progress. The class with the maximum logit in each group can be determined independently, and calculating the maximum value among them requires only minimal communication and calculation. Thus, except for the softmax calculation for the output layer, the calculation for each group can be decomposed and processed in parallel.

심층 신경 네트워크에 적용되는 목적 함수는 각 레이어에 대한

와

의 수(L)를 갖는다는 점을 제외하고는 앞서 설명한 수학식 1 및 2와 동일하다. 즉, 제안된 그룹 분할 방식은 컨벌루션 필터의 방식과 유사하기 때문에 CNN에도 적용하는 것이 가능하다. 예를 들어, 컨벌루션 레이어의 가중치가 4D 텐서(

, 여기서 M, N은 각 필드의 높이 및 너비이고, D, K는 입력 컨벌루션 필터의 수와 출력 컨벌루션 필터의 수이다. 상술한 그룹-

-놈은 입력 및 출력 필터 치수에 적용될 수 있다. 그리고 4-D 가중치 텐서(Wc)를 아래와 같은 수학식 7을 이용하여 2-D 행렬(

)로 줄일 수 있다. The objective function applied to the deep neural network is for each layer.

Wow

It is the same as

Equations

1 and 2 described above, except that it has the number of (L). That is, since the proposed group division method is similar to that of the convolution filter, it can be applied to CNN. For example, the weight of a convolutional layer is a 4D tensor (

, Where M and N are the height and width of each field, and D and K are the number of input convolution filters and the number of output convolution filters. Group mentioned above

-The norm can be applied to input and output filter dimensions. And 4-D weight tensor (Wc) using the equation (7) 2-D matrix (

).

다음으로, 컨볼루션 가중치를 위한 가중치 정규화는 앞서 설명한 수학식 5에 수학식 7을 적용하여 얻을 수 있다. Next, the weight normalization for the convolution weight can be obtained by applying Equation 7 to Equation 5 described above.

또한, 본 개시의 방법은 지름길 연결(shortcut connection)로 연결된 노드를 통해 그룹 할당을 공유함으로써 레지듀얼 네트워크(residual network)에도 적용 가능하다. 구체적으로, 레지듀얼 네트워크는 두 개의 컨벌루션 레이어를 지름길 연결로 바이패스 한다는 점을 고려한 것이다. In addition, the method of the present disclosure is applicable to a residual network by sharing group assignment through a node connected by a shortcut connection. Specifically, the residual network takes into account that the two convolutional layers are bypassed with a shortcut connection.

와

가 컨벌루션 레이어의 가중치이고,

는

를 갖는 각 레이어에 대한 그룹 할당 벡터라고 가정한다. 단축 아이덴티티 매핑은 제1 컨볼루션 레이어의 입력 노드를 제2 컨볼루션 레이어의 출력 노드와 연결하기 때문에,

와 같이 이들 노드의 그룹화는 공유될 수 있다.

Wow

Is the weight of the convolution layer,

The

It is assumed that it is a group allocation vector for each layer having. Since the shortened identity mapping connects the input node of the first convolution layer with the output node of the second convolution layer,

As such, the grouping of these nodes can be shared.

이하에서는 계층적 그룹화에 대해서 설명한다. Hereinafter, hierarchical grouping will be described.

종종 클래스의 의미적인 레이어가 존재한다. 예를 들어, 개 그룹과 고양이 그룹은 포유류의 하위 그룹이다. 이러한 점은 앞서 설명한 심층 스플릿을 카테고리의 다층 계층 구조를 얻기 위하여 확장할 수 있다. 단순하게 설명하기 위하여, 슈퍼 그룹에 대해서 서브 그룹의 세트를 포함하는 2개의 트리 레이어를 고려할 수 있는데, 이러한 점을 임의의 깊이의 계층 구조로 확장하는 것은 용이하다. Often there are semantic layers of the class. For example, dog groups and cat groups are subgroups of mammals. This point can be extended to obtain the above-described deep split to obtain a multi-layered hierarchical structure of categories. For simplicity, two tree layers including a set of subgroups can be considered for a supergroup, and it is easy to extend this point to a hierarchical structure of arbitrary depth.

l 번째 레이어 및 l번째 레이어의 출력 노드에서 그루핑 가지(grouping branche)는

을 갖는 G 슈퍼그룹 할당 백터(

)로 그룹화된다고 가정한다. The grouping branche in the l-th layer and the output node of the l-th layer

G supergroup allocation vector with

).

그리고 다음 레이어에서,

를 갖는 서브그룹 할당 벡터(

)에 대응되는 각 서브 그룹(

)이 있다고 가정한다. 앞서 설명한 바와 같이 l+1 번째 레이어의 입력 노드는 l번째 레이어의 출력 노드에 대응된다. 따라서,

를 정의할 수 있으며, 서브 그룹 할당을 상응하는 슈퍼 그룹 할당으로 매핑 할 수 있다. 다음으로, 심층 스플릿에서와 같이

제한을 부가한다. And in the next layer,

Subgroup assignment vector with

Each subgroup corresponding to)

). As described above, the input node of the l+1th layer corresponds to the output node of the lth layer. therefore,

Can be defined, and the subgroup assignment can be mapped to the corresponding supergroup assignment. Next, as in the deep split

Add restrictions.

한편, CNN에서 이러한 구조를 구축하는 하나는 컨볼루션 필터의 수가 두 배가 되면 각 그룹을 2개의 서브 그룹으로 분기하는 것이다. On the other hand, one of constructing such a structure in CNN is to branch each group into two subgroups when the number of convolution filters doubles.

이하에서는 스플릿넷의 병렬화에 대해서 설명한다. Hereinafter, the parallelization of splitnet will be described.

본 개시에 따른 방법은 그룹 간에 연결이 존재하지 않은 서브네트워크인 트리 구조 네트워크를 생성할 수 있다. 이러한 결과는 얻어진 각 서브 네트워크를 각 프로세서에 할당하여 모델 병렬 처리를 가능케 한다. 구현시에는 하위 레이어와 그룹별 상위 레이어를 각 노드에 할당하는 동작만으로 가능하다. The method according to the present disclosure may create a tree structure network, which is a subnetwork in which there is no connection between groups. This result enables model parallel processing by assigning each obtained sub-network to each processor. In implementation, it is possible only by assigning the lower layer and the upper layer for each group to each node.

하위 레이어에 대한 테스트 시간은 변경되지 않는바, 불필요한 중복 연산이 발생하더라도, 이러한 방식은 허용 가능하게 된다. 또한, 학습 시간의 병렬화도 가능하다. The test time for the lower layer does not change, so even if unnecessary duplication occurs, this method is acceptable. It is also possible to parallelize learning time.

도 6은 본 개시의 학습 모델의 분할 알고리즘을 나타내는 도면이다. 도 7은 한 그룹의 출력을 분할하는 경우의 예를 나타내는 도면이다. 6 is a diagram showing a segmentation algorithm of the learning model of the present disclosure. 7 is a diagram showing an example of dividing a group of outputs.

도 6을 참조하면, 먼저 신경망 파라미터는 기존에 학습된(Pretrained) 신경망 파라미터이거나 랜덤하게 초기화할 수 있으며 분할변수는 균일한(

) 값에 가깝게 초기화할 수 있다. Referring to FIG. 6, first, the neural network parameter may be a previously trained neural network parameter or randomly initialized, and the divided variable may be uniform (

) Can be initialized close to the value.

다음으로, 앞서 설명한 정규화 항과 함께 태스크의 손실함수와 파라미터 감쇠 정규화 항을 함께 최소화하는 방향으로 신경망의 파라미터와 분할변수의 값을 추계적 경사 하강 방식(Stochastic Gradient Descent) 방법으로 최적화 한다. Next, the values of the neural network parameters and the segmentation variables are optimized by the stochastic gradient descent method in the direction of minimizing the task loss function and the parameter attenuation normalization term together with the normalization term described above.

이렇게 최적화된 분할변수는 레이어 각각의 노드들이 어떤 그룹에 속할지 0 또는 1의 값으로 수렴하게 되며, 신경망 파라미터의 그룹 간 연결이 거의 억제되며 분할변수에 따라 재정렬될 경우 블록 대각행렬이 된다. 여기서 파라미터 행렬의 각 블록은 각 그룹 내의 연겨에 해당하면 그룹 간의 연결은 없어진 형태이다. The optimized splitting variable converges to a value of 0 or 1 to which group each node belongs to, and the connection between groups of neural network parameters is almost suppressed and becomes a block diagonal matrix when rearranged according to the splitting variable. Here, if each block of the parameter matrix corresponds to a linkage within each group, the connection between the groups is lost.

따라서, 도 7과 같이 레이어를 그룹에 따라 여러 레이어로 수직분할하고, 파라미터 행렬의 대각 블록들을 나뉜 레이어들의 파라미터로 사용하여 여러 개의 레이어로 분할할 수 있다. Therefore, as shown in FIG. 7, a layer may be vertically divided into multiple layers according to groups, and diagonal blocks of a parameter matrix may be divided into multiple layers using parameters of divided layers.

구체적으로, 앞서 언급한 정규화 함수를 통해 한 레이어의 입력과 출력을 각각 그룹으로 나누어 수직하게 분할할 수 있다. 이를 연속한 여러 레이어에 적용함으로써 한 레이어의 한 그룹의 출력이 다음 레이어의 해당 그룹의 입력으로 이어질 수 있게끔 분할 변수를 공유하면, 여러 레이어에 걸쳐 그룹들이 상호 간에 연결이 없게끔 나눠지게 된다. 또한, 한 그룹의 출력을 다음 레이어의 여러 출력으로 나뉘게끔 분할 변수를 공유하면 최종적으로 만들어지는 신경만은 그룹이 분기하게 되는 구조를 가지게 된다. Specifically, through the aforementioned normalization function, input and output of one layer can be divided into groups and vertically divided. By applying this to multiple successive layers, by sharing the splitting variable so that the output of one group of one layer can lead to the input of the corresponding group of the next layer, the groups across multiple layers are divided so that there is no connection to each other. In addition, if the split variable is shared to divide the output of one group into multiple outputs of the next layer, only the nerve that is finally created has a structure in which the group branches.

마지막으로 태스크 손실 함수와 파라미터 감쇠 정규화 항으로 파라미터를 미세 조정하여 최종적으로 트리 형태의 신경망을 얻는다. Finally, we fine-tune the parameters with the task loss function and the parameter attenuation normalization term to finally get the tree-shaped neural network.

이하에서는 도 8 내지 도 15를 참조하여, 본 개시에 따른 최적화 방법에 효과를 설명한다. Hereinafter, the effect of the optimization method according to the present disclosure will be described with reference to FIGS. 8 to 15.

본 개시에 따른 최적화 방법에 적용된 실험 조건에 대해서 먼저 설명한다. Experimental conditions applied to the optimization method according to the present disclosure will be described first.

도 8 내지 도 15의 실험 결과는 아래에 개시된 바와 같은 두 가지 벤치 데이터 세트를 이용하여 이미지 분류를 하였다. The results of the experiments in FIGS. 8 to 15 were image classified using two bench data sets as described below.

첫 번째는 CIFAR-100이다. CIFAR-100 데이터 세트는 100개의 일반 객체 분류를 위한 32x32 픽셀 이미지들을 포함하며, 각 분류는 학습을 위한 100의 이미지와 테스트를 위한 100개의 이미지를 포함한다. 이러한 실험에서는 각 분류에 대한 50개의 이미지를 교차 검증을 위한 유효성 검증 세트로 별도로 이용하였다. The first is CIFAR-100. The CIFAR-100 data set contains 32x32 pixel images for 100 general object classifications, and each classification contains 100 images for training and 100 images for testing. In these experiments, 50 images for each classification were used separately as a validation set for cross-validation.

두 번째는 ImageNet-1K이다. ImageNet-1K 데이터 세트는 1000개 일반 객체 분류를 위한 1.2백만 이미지로 구성된다. 각 분류에 대해서 표준 절차에 따라 학습을 위한 1~1.3 천개 이미지와 테스트를 위한 50개 이미지가 포함된다. The second is ImageNet-1K. The ImageNet-1K data set consists of 1.2 million images for the classification of 1000 general objects. For each classification, 1 to 1.3 thousand images for training and 50 images for testing are included according to standard procedures.

그룹화를 위한 여러 가지 방법을 비교하기 위하여, 5개의 분류 모델을 이용하였다. To compare the various methods for grouping, five classification models were used.

첫 번째는 기본 네트워크로, 전체 네트워크 가중치를 포함하는 일반 네트워크이다. CIFAR-100에 대한 실험을 위해 데이터 세트의 최첨단 네트워크 중 하나인 Wide Residual Network (WRN)를 사용하였다. 그리고 ILSVRC2012의 기본 네트워크로 AlexNet 및 ResNet-18을 사용하였다. The first is a basic network, which is a general network including all network weights. For experimenting with CIFAR-100, Wide Residual Network (WRN), one of the most advanced networks in the data set, was used. In addition, AlexNet and ResNet-18 were used as the basic networks of ILSVRC2012.

두 번째는 SplitNet-Semantic 이다. 이는 데이터 세트에서 제공하는 의미 분류로부터 클래스 분류를 얻는 앞서 설명한 스플릿넷의 변형이다. 학습 전에 분류 체계에 따라 네트워크를 분할하여 레이어를 균등하게 분할하고 각 그룹에 서브 네트워크를 할당한 다음 처음부터 학습을 진행하였다. The second is SplitNet-Semantic. This is a variant of the splitnet described earlier that gets class classification from semantic classification provided by the data set. Before learning, the network was divided according to the classification system, the layers were evenly divided, and a subnetwork was assigned to each group, and then learning was conducted from the beginning.

세 번째는 SplitNet-Clustering 이다. 이 방식은 두 번째 방식의 변형으로, 클래스는 사전 훈련된 기본 네트워크의 계층적 스펙트럼 클러스터링에 의해 분할하는 방식이다. The third is SplitNet-Clustering. This is a variant of the second method, where the class is split by hierarchical spectrum clustering of the pre-trained basic network.

네 번째는 SplitNet-Random 이다. 이 방식은 임의의 클래스 분할을 사용하는 변형이다. The fourth is SplitNet-Random. This is a variant that uses arbitrary class partitioning.

다섯째는 SplitNet이다. 스플릿넷(SplitNet)은 앞서 설명한 바와 같은 가중치 행렬을 자동 분할을 사용하여 학습하는 방식이다. Fifth is SplitNet. SplitNet is a method of learning a weight matrix as described above using automatic partitioning.

도 8은 연속한 상위 세 레이어에 대한 최적화 결과 예를 나타내는 도면이다. 구체적으로, 도 8은 심층신경망의 한 종류인 알렉스넷(AlexNet)의 ImageNet 데이터 세트의 이미지 분류(Image classification) 작업에 대하여 학습시키면서 연속한 상위 세 레이어(FC6, FC7, FC8)에 대하여 적용하였을 때, 두 번째 단계에서 신경망 파라미터와 함께 분할변수를 최적화한 결과를 나타낸 것이다(이때 값의 순서는 분할변수의 값에 따라 행과 열을 재정렬한 것이다). 8 is a diagram showing an example of the result of optimization for successive upper three layers. Specifically, Figure 8 is applied to the successive upper three layers (FC6, FC7, FC8) while learning about the image classification operation of the ImageNet data set of AlexNet, a kind of deep neural network. , The second step shows the result of optimizing the partition variable along with the neural network parameters.

도 8에서 검은색은 값이 0임을, 흰색은 값이 양수임을 의미하며, 이때 파라미터 행렬에서 각 그룹 내 연결이 활성화 되어있으며(대각 블록이 양수), 그룹 간 연결은 억제되어 있음을 확인할 수 있다. In FIG. 8, black indicates that the value is 0, and white indicates that the value is positive. In this case, it can be confirmed that the connection in each group is activated in the parameter matrix (diagonal block is positive), and the connection between groups is suppressed. .

즉, 본 개시에 따라 방법을 통해 분할변수(

)와 파라미터(

)가 어떻게 분할될 것인지를 보여주며 심층신경망의 각 레이어가 계층적인 구조로 있음을 확인할 수 있다. That is, the segmentation variable through the method according to the present disclosure (

) And parameters (

), and shows that each layer of the deep neural network has a hierarchical structure.

도 9는 최적화 방법이 적용된 학습 모델의 벤치 마크를 나타내는 도면이다. 구체적으로, 도 9는 SplitNets에서 모델 병렬 처리를 사용하는 런타임 성능을 요약한 것이다. 9 is a diagram showing a benchmark of a learning model to which an optimization method is applied. Specifically, FIG. 9 summarizes the runtime performance using model parallel processing in SplitNets.

도 9를 참조하면, DNN을 최적화하면 파라미터를 줄이는 것뿐만 아니라 모델 병렬 처리를 위해 분할 구조를 활용하여 속도를 높일 수 있음을 확인할 수 있다. Referring to FIG. 9, it can be seen that optimizing DNN not only reduces the parameters, but also increases the speed by utilizing a partitioning structure for model parallel processing.

한편, 모델 병렬 처리의 자연스러운 방법은 각 분할 그룹과 공유 하위 레이어를 각 GPU에 할당하는 것이다. 중복 계산이 발생하지만 동시에 GPU 간에 필요한 통신이 없음을 보장하게 된다. 이에 따라 속도가 최대 1:44까지 갈수록 커짐을 확인할 수 있다. On the other hand, a natural way of model parallel processing is to assign each split group and shared sub-layer to each GPU. Duplicate calculations occur, but at the same time it ensures that there is no necessary communication between GPUs. Accordingly, it can be seen that the speed increases up to 1:44.

도 10은 균형 그룹 정규화의 효과를 나타내는 도면이다. 10 is a view showing the effect of normalizing the balance group.

도 10을 참조하면, 충분히 큰 정규화로 인해 그룹의 크기가 균일해 지므로 SplitNet의 파라미터 축소 및 모델 병렬 처리에 바람직함을 알 수 있다. Referring to FIG. 10, it can be seen that the size of the group is uniform due to sufficiently large normalization, which is preferable for parameter reduction of SplitNet and parallel processing of the model.

이 정규화를 완화하면 개별 그룹 크기에 유연성이 부여된다.

를 너무 작게 설정하면 모든 클래스와 기능이 하나의 그룹으로 분류되어 사소한 해결책이 생기게 된다. 이러한 점에서, 실험에서 네트워크 축소 및 병렬화를 위해 모델의 모든 그룹을 균형있게 조정하는 것이 바람직하다. Relaxing this normalization gives flexibility to individual group sizes.

If set too small, all classes and functions are grouped into one group, resulting in a minor solution. In this regard, it is desirable to balance all groups of models in the experiment for network reduction and parallelization.

도 11은 CIFAR-100 데이터에 세트에 대한 여러 알고리즘 방식 각각에 대한 테스트 에러를 나타내는 도면이고, 도 12는 CIFAR-100 데이터에 세트에서 파라미터(또는 계산) 감소 및 테스트 에러의 비교를 나타내는 도면이다. FIG. 11 is a diagram showing test errors for each of several algorithm schemes for a set of CIFAR-100 data, and FIG. 12 is a diagram showing a comparison of parameter (or calculation) reduction and test errors in a set for CIFAR-100 data.

도 11을 참조하면, 데이터 세트 (-S)와 스펙트럼 클러스터링 (-C)이 제공하는 의미 분류법을 사용하는 SplitNet 변형은 임의 그룹핑(-R)보다 우수하며 DNN 분할에 적절한 그룹화가 중요 함을 확인할 수 있다. Referring to FIG. 11, it can be confirmed that the SplitNet variant using semantic classification provided by the data set (-S) and spectral clustering (-C) is superior to random grouping (-R) and proper grouping for DNN partitioning is important. have.

특히, SplitNet을 적용하는 것이 다른 모든 변형들을 능가함을 확인할 수 있다. SplitNet은 의미론적 또는 클러스터링 분할에서와 같이 추가 의미 정보나 사전 계산된 네트워크 가중치가 필요하지 않는다는 장점이 있다. In particular, it can be seen that applying SplitNet outperforms all other variants. SplitNet has the advantage that it does not require additional semantic information or precomputed network weights, as in semantic or clustering partitions.

도 12를 참조하면, 많은 수의 필터로 인해 따라서 FC 분할은 파라미터 감소를 최소화함을 확인할 수 있다. 반면에 5개의 컨벌루션 레이어를 포함한 Shallow Split은 네트워크의 파라미터를 32.44% 줄이고 테스트 정확도는 약간 향상시킴을 확인할 수 있다. Referring to FIG. 12, it can be seen that due to the large number of filters, FC division minimizes parameter reduction. On the other hand, it can be seen that Shallow Split, which includes 5 convolutional layers, reduces network parameters by 32.44% and slightly improves test accuracy.

그리고 심층 및 계층적 분할은 사소한 정확도 저하를 희생시키면서 파라미터와 FLOP을 추가로 줄임을 확인할 수 있다. And it can be seen that deep and hierarchical partitioning further reduces parameters and FLOP while sacrificing minor degradation of accuracy.

얕은 분할은 훨씬 적은 수의 파라미터를 갖음으로써 다른 알고리즘 방식보다 훨씬 우수한 성능을 나타낸다. 본 개시의 SplitNet이 전체 네트워크에서 시작하여 내부 레이어에 대한 서로 다른 그룹 간의 불필요한 연결을 학습 및 축소하여 레이어에 정규화 효과를 부여한다는 사실에 기인한다. 또한, 레이어 분할은 변수 선택의 한 형태로 간주 될 수 있다. 레이어의 각 그룹은 필요한 노드 그룹만을 간략하게 선택할 수 있다.Shallow partitioning has a much smaller number of parameters, so it performs much better than other algorithmic methods. It is due to the fact that the SplitNet of the present disclosure starts with the whole network and learns and reduces unnecessary connections between different groups for the inner layer to give the layer a normalization effect. Also, layer division can be considered as a form of variable selection. Each group in the layer can simply select the group of nodes it needs.

결론적으로, 심층신경망의 한 종류인 Wide residual network의 CIFAR-100 데이터 세트의 이미지 분류 작업에 대하여 학습시키면서 상위 6개의 레이어를 분할한 결과, 파라미터의 수를 32%, 연산량을 15% 줄이면서 동시에 성능은 평균 0.3%p 증가함을 확인할 수 있다. In conclusion, as a result of dividing the upper six layers while learning about the image classification task of the CIFAR-100 data set of the wide residual network, which is a type of deep neural network, the performance is reduced while reducing the number of parameters by 32% and the computational amount by 15%. It can be seen that the average increase is 0.3%p.

도 13은 20개의 상위 클래스의 하위 클래스들이 어느 그룹에 속하는지를 나타내는 도면이다. 구체적으로, 도 13은 FC SplitNet (G = 4)에서 학습 한 그룹 지정을 CIFAR-100에서 제공하는 의미 카테고리와 비교한다. 13 is a view showing which group the lower classes of the 20 upper classes belong to. Specifically, FIG. 13 compares the group designation learned in FC SplitNet (G = 4) with the semantic category provided by CIFAR-100.

도 13을 참조하면, 사람들 카테고리에는 아기, 소년, 소녀, 남성 및 여성의 5가지 클래스가 포함되어 있다. 이 클래스는 모두 본 개시에 따른 알고리즘에 따라 그룹 2로 그룹화된다. 모든 의미 카테고리의 3개 이상의 클래스가 함께 그룹화된다. 해당 그림에서 볼 수 있듯이 의미적으로 비슷한 상위 클래스들이 같은 그룹으로 묶여있음을 알 수 있다.13, the people category includes five classes: baby, boy, girl, male and female. All of these classes are grouped into group 2 according to the algorithm according to the present disclosure. Three or more classes of all semantic categories are grouped together. As you can see in the picture, you can see that the semantically similar upper classes are grouped together.

도 14 및 도 15는 ILSVRC2012 데이터 세트에서의 파라미터(또는 계산) 감소 및 테스트 에러의 비교를 나타내는 도면이다. 14 and 15 are diagrams showing comparison of parameter (or calculation) reduction and test error in the ILSVRC2012 data set.

도 14 및 도 15를 참조하면, SplitNet은 AlexNet을 기본 모델로 사용하여 fc 계층에 집중된 파라미터의 수를 크게 줄임을 확인할 수 있다. 그러나 대부분의 FLOP은 낮은 전환 층 (conv layer)에서 발생하며, 단지 작은 FLOP 감소만 가져옴도 확인할 수 있다. 14 and 15, it can be seen that SplitNet greatly reduces the number of parameters concentrated in the fc layer by using AlexNet as a basic model. However, it can be seen that most of the FLOPs occur in a low conv layer, and only a small FLOP reduction occurs.

한편, AlexNet의 SplitNet이 중요한 파라미터 감소로 사소한 테스트 정확도 저하를 보임을 확인할 수 있다. 반면에 ResNet-18을 기반으로 하는 SplitNet은 분할이 깊어짐에 따라 테스트 정밀도가 저하됨을 확인할 수 있다. 이러한 점은 ResNet-18을 분할하는 것이 다수의 클래스와 비교하여 최대 512개의 컨볼루션 레이어의 폭을 제한하므로 네트워크 용량을 손상 시키기 때문으로 예측된다. On the other hand, it can be seen that AlexNet's SplitNet shows a slight decrease in test accuracy due to a significant reduction in parameters. On the other hand, SplitNet based on ResNet-18 shows that the test precision deteriorates as the split becomes deeper. This is expected because partitioning ResNet-18 limits the width of up to 512 convolutional layers compared to multiple classes, thus compromising network capacity.

그럼에도, 우리의 제안 된 SplitNet은 모든 실험에서 SplitNet-Random을 능가함을 확인할 수 있다. 구체적으로, 심층신경망의 한 종류인 ResNet-18의 레이어의 필터 수를 기존 N 개에서 M개로 두 배로 한 네트워크에 대해 ImageNet 데이터 세트의 이미지 분류 작업에 대해 학습시키면서 상위 6개의 레이어를 분할한 결과 파라미터의 수를 38%, 연산량을 12% 줄이면서 성능은 평균 0.1%p 증가함을 확인할 수 있다. Nevertheless, we can see that our proposed SplitNet outperforms SplitNet-Random in all experiments. Specifically, the result of dividing the top 6 layers while learning about the image classification operation of the ImageNet data set for a network that doubles the number of filters of the layer of ResNet-18, which is a kind of deep neural network, from N to M existing filters It can be seen that the performance increases by an average of 0.1%p while reducing the number of copies by 38% and the computation amount by 12%.

도 16은 본 개시의 일 실시 예에 따른 학습 모델 최적화 방법을 설명하기 위한 흐름도이다. 16 is a flowchart illustrating a method for optimizing a learning model according to an embodiment of the present disclosure.

도 16을 참조하면, 먼저, 복수의 레이어로 구성되는 학습 모델의 파라미터 행렬 및 복수의 분할 변수를 초기화한다(S1610). 구체적으로, 파라미터 행렬을 랜덤하게 초기화하고, 복수의 분할 변수를 상호 균일하지 않도록 초기화할 수 있다. Referring to FIG. 16, first, a parameter matrix and a plurality of splitting variables of a learning model composed of a plurality of layers are initialized (S1610). Specifically, the parameter matrix may be initialized randomly, and a plurality of splitting variables may be initialized so that they are not mutually uniform.

그리고 학습 모델에 대한 손실 함수, 파라미터 감쇠 정규화 항 및 파라미터 행렬 및 복수의 분할 변수로 정의되는 분할 정규화 항을 포함하는 목적 함수가 최소화하도록 복수의 분할 변수와 학습 모델에 대한 블록 대각 행렬을 갖는 신규 파라미터 행렬을 산출한다(S1620). 이때, 추계적 경사 하강(Stochastic Gradient Descent) 방법을 이용하는 수학식 1과 같은 목적 함수가 최소화하도록 할 수 있다. And a new parameter having a plurality of partition variables and a block diagonal matrix for the training model such that the objective function including the loss function for the training model, the parameter attenuation normalization term and the parameter matrix and the partition normalization term defined by the plurality of partition variables is minimized. The matrix is calculated (S1620). At this time, an objective function such as Equation 1 using a stochastic gradient descent method may be minimized.

여기서 분할 정규화 항은 그룹 간의 연결을 억제하고 그룹 내의 연결만을 활성화하는 그룹 파라미터 정규화 항, 각 그룹이 직교하도록 하는 서로소 그룹 정규화 항 및 한 그룹의 크기가 과도하지 않도록 하는 균등 그룹 정규화 항을 포함할 수 있다. Here, the division normalization term includes a group parameter normalization term that suppresses the connection between groups and activates only the connections within the group, a subgroup normalization term that allows each group to be orthogonal, and an equal group normalization term that prevents the size of one group from being excessive. Can.

그리고 산출된 분할 변수에 기초하여 복수의 레이어를 그룹에 따라 수직 분할하고, 산출된 신규 파라미터 행렬을 수직 분할된 레이어의 파라미터로 사용하여 학습 모델을 재구성한다.Then, a plurality of layers are vertically divided according to groups based on the calculated division variables, and the learning model is reconstructed using the calculated new parameter matrix as parameters of the vertically divided layers.

재구성 이후에, 학습 모델에 대한 손실 함수 및 파라미터 감쇠 정규화 항만 포함하는 제2 목적 함수가 최소화하도록 재구성된 학습 모델에 대한 2차 신규 파라미터 행렬을 산출하고, 산출된 2차 신규 파라미터 행렬을 수직 분할된 레이어의 파라미터로 사용하여 학습 모델을 최적화할 수 있다. After reconstruction, a second new parameter matrix for the reconstructed training model is calculated such that the second objective function including only the loss function and the parameter attenuation normalization term for the training model is minimized, and the calculated second new parameter matrix is vertically divided. It can be used as a parameter of a layer to optimize the learning model.

따라서, 본 실시 예에 따른 학습 모델 최적화 방법은 학습 모델을 클래스를 독점적인 기능 집합에 맞는 그룹으로 클러스터링한다. 그리고 수학식 1과 같은 목적 함수를 이용하는바 네트워크 학습과정에 완벽하게 통합되므로 네트워크 가중치와 분할을 동시에 학습할 수 있다. 이에 따라 최적화된 학습 모델은 하나의 학습 모델에 대한 연산을 여러 장치로 나눠 처리하는 것이 가능하며, 연산량과 파라미터의 수가 줄어들기 때문에 하나의 장치를 이용하더라도 더욱 빠른 연산이 가능하게 된다. 도 16과 같은 학습 모델 최적화 방법은 도 1 또는 도 2의 구성을 가지는 전자 장치상에서 실행될 수 있으며, 그 밖의 다른 구성을 가지는 전자 장치상에서도 실행될 수 있다. Therefore, in the learning model optimization method according to the present embodiment, the learning model is clustered into a group that fits a class of exclusive functions. In addition, since the objective function such as Equation 1 is used, it is perfectly integrated into the network learning process, so that network weighting and segmentation can be simultaneously learned. Accordingly, the optimized learning model is capable of dividing and processing an operation on one learning model into multiple devices, and since the amount of calculation and the number of parameters are reduced, faster calculation is possible even when using one device. The method for optimizing a learning model as shown in FIG. 16 may be executed on an electronic device having the configuration of FIG. 1 or 2, and may be executed on an electronic device having other configurations.

또한, 상술한 바와 같은 학습 모델 최적화 방법은 컴퓨터에서 실행될 수 있는 실행 가능한 알고리즘을 포함하는 프로그램으로 구현될 수 있고, 상술한 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다. In addition, the learning model optimization method as described above may be implemented as a program including an executable algorithm that can be executed on a computer, and the above-described program is stored and provided in a non-transitory computer readable medium. Can be.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 방법을 수행하기 위한 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium means a medium that stores data semi-permanently and that can be read by a device, rather than a medium that stores data for a short time, such as registers, caches, and memory. Specifically, programs for performing the various methods described above may be stored and provided on a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, and ROM.

도 17은 본 개시의 일 실시 예에 따른 학습 모델 분할 방법을 설명하기 위한 흐름도이다. 17 is a flowchart illustrating a method for dividing a learning model according to an embodiment of the present disclosure.

도 17을 참조하면, 먼저 신경망 파라미터는 기존에 학습된(Pretrained) 신경망 파라미터이거나 랜덤하게 초기화할 수 있으며 분할변수는 균일한(

) 값에 가깝게 초기화할 수 있다(S1710). Referring to FIG. 17, first, a neural network parameter may be a previously trained neural network parameter or initialized randomly, and the divided variable may be uniform (

) Can be initialized close to the value (S1710 ).

다음으로, 앞서 설명한 정규화 항과 함께 태스크의 손실함수와 파라미터 감쇠 정규화 항을 함께 최소화하는 방향으로 신경망의 파라미터와 분할변수의 값을 추계적 경사 하강 방식(Stochastic Gradient Descent) 방법으로 최적화 한다(S1720). Next, the parameters of the neural network and the values of the segmentation variables are optimized by using a stochastic gradient descent method in the direction of minimizing the loss function and the parameter attenuation normalization term of the task together with the normalization term described above (S1720). .

이렇게 최적화된 분할변수는 레이어 각각의 노드들이 어떤 그룹에 속할지 0 또는 1의 값으로 수렴하게 되며, 신경망 파라미터의 그룹 간 연결이 거의 억제되며 분할변수에 따라 재정렬될 경우 블록 대각행렬이 된다. The optimized splitting variable converges to a value of 0 or 1 to which group each node belongs to, and the connection between groups of neural network parameters is almost suppressed and becomes a block diagonal matrix when rearranged according to the splitting variable.

다음으로, 앞서 산출된 분할 변수를 이용하여 신경망을 분할할 수 있다(S1730). Next, the neural network may be segmented using the previously calculated segmentation variable (S1730).

마지막으로 태스크 손실 함수와 파라미터 감쇠 정규화 항으로 파라미터를 미세 조정하여 최종적으로 트리 형태의 신경망을 얻는다(S1740). Finally, by fine-tuning the parameters with the task loss function and the parameter attenuation normalization term, a tree-shaped neural network is finally obtained (S1740).

따라서, 본 실시 예에 따른 학습 모델 분할 방법은 학습 모델을 클래스를 독점적인 기능 집합에 맞는 그룹으로 클러스터링한다. 그리고 수학식 1과 같은 목적 함수를 이용하는바 네트워크 학습 과정에 완벽하게 통합되므로 네트워크 가중치와 분할을 동시에 학습할 수 있다. 도 17과 같은 학습 모델 분할 방법은 도 1 또는 도 2의 구성을 가지는 전자 장치상에서 실행될 수 있으며, 그 밖의 다른 구성을 가지는 전자 장치상에서도 실행될 수 있다. Therefore, in the learning model segmentation method according to the present embodiment, the learning model is clustered into a group suitable for a set of exclusive functions. In addition, since the objective function such as Equation 1 is used, it is perfectly integrated into the network learning process, so that network weighting and segmentation can be simultaneously learned. The learning model segmentation method as shown in FIG. 17 may be executed on an electronic device having the configuration of FIG. 1 or 2, and may be executed on an electronic device having other configurations.

또한, 상술한 바와 같은 학습 모델 분할 방법은 컴퓨터에서 실행될 수 있는 실행 가능한 알고리즘을 포함하는 프로그램으로 구현될 수 있고, 상술한 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다. In addition, the learning model segmentation method as described above may be implemented as a program including an executable algorithm that can be executed on a computer, and the above-described program is stored and provided in a non-transitory computer readable medium. Can be.

또한, 이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시가 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.In addition, although the preferred embodiments of the present disclosure have been described and described above, the present disclosure is not limited to the specific embodiments described above, and the technical field to which the present disclosure belongs without departing from the gist of the present disclosure claimed in the claims. In addition, various modifications can be implemented by those having ordinary knowledge in the art, and these modifications should not be individually understood from the technical idea or prospect of the present disclosure.

100: 전자 장치 110: 메모리
120: 프로세서 130: 통신 인터페이스
140: 디스플레이 150: 조작 입력부100: electronic device 110: memory
120: processor 130: communication interface
140: display 150: operation input

Claims

A method for optimizing a learning model in an electronic device,
In the electronic device, initializing a parameter matrix and a plurality of divided variables of a learning model composed of a plurality of layers;
In the electronic device, the plurality of split variables and the learning model are minimized so that an objective function including a loss function for the learning model, a parameter attenuation normalization term, and a partition normalization term defined by the parameter matrix and the plurality of split variables is minimized. Calculating a new parameter matrix having a block diagonal matrix for; And
In the electronic device, vertically dividing the plurality of layers into groups based on the calculated dividing variables, and reconstructing the learning model using the calculated new parameter matrix as parameters of the vertically divided layer; Learning model optimization method comprising a.

According to claim 1,
The initializing step,
A learning model optimization method for randomly initializing the parameter matrix and initializing the plurality of partition variables so that they are not uniform to each other.

According to claim 1,
The calculating step,
A learning model optimization method using a stochastic gradient descent method so that the objective function is minimized.

According to claim 1,
The division normalization term,
A method of optimizing a learning model that includes a group parameter normalization term that suppresses the connections between groups and activates only the connections within the group, a subgroup normalization term that allows each group to be orthogonal, and an equal group normalization term that prevents the size of one group from being excessive.

According to claim 1,
Calculating, in the electronic device, a second new parameter matrix for the reconstructed learning model such that the second objective function including only the loss function for the learning model and the parameter attenuation normalization term is minimized; And
And in the electronic device, optimizing the learning model using the calculated second new parameter matrix as a parameter of the vertically divided layer.

The method of claim 5,
And in the electronic device, processing the vertically divided layers in the optimized learning model in parallel using different processors; further comprising a learning model optimization method.

In the electronic device,
A memory in which a learning model composed of a plurality of layers is stored; And
Initialize a parameter matrix of the learning model and a plurality of partitioning variables, and an objective function including a loss function for the training model, a parameter attenuation normalization term, and a partition normalization term defined by the parameter matrix and the plurality of partitioning variables is minimized. A new parameter matrix having a plurality of partition variables and a block diagonal matrix for the learning model is calculated so as to vertically divide the plurality of layers into groups based on the calculated partition variable, and the calculated new parameter matrix And a processor reconstructing the learning model by using as a parameter of the vertically divided layer.

The method of claim 7,
The processor,
An electronic device that initializes the parameter matrix randomly and initializes the plurality of divided variables so that they are not mutually uniform.

The method of claim 7,
The processor,
An electronic device using a stochastic gradient descent method to minimize the objective function.

The method of claim 7,
The division normalization term,
An electronic device comprising a group parameter normalization term that suppresses connections between groups and activates only the connections within the group, mutual group normalization terms that allow each group to be orthogonal, and equal group normalization terms that do not cause the size of one group to be excessive.

The method of claim 7,
The processor,
A second new parameter matrix for the reconstructed training model is calculated such that the second objective function including only the loss function for the training model and the parameter attenuation normalization term is minimized, and the calculated second new parameter matrix is vertically divided. Device for optimizing the learning model by using it as a parameter of a layer.

A computer readable recording medium comprising a program for executing a method for optimizing a learning model in an electronic device, the computer readable recording medium comprising:
The learning model optimization method,
Initializing a parameter matrix and a plurality of divided variables of a learning model composed of a plurality of layers;
The block diagonal matrix for the plurality of partition variables and the learning model is minimized by an objective function including a loss function for the learning model, a parameter attenuation normalization term and a partition normalization term defined by the parameter matrix and the plurality of partition variables. Calculating a new parameter matrix having a; And
Computer reconstruction comprising; vertically dividing the plurality of layers into groups based on the calculated partitioning variable, and reconstructing the learning model using the calculated new parameter matrix as parameters of the vertically divided layer. Recordable media.