KR102619539B1

KR102619539B1 - Optimization method of neural network for multi-gpu system and optimization system using the same

Info

Publication number: KR102619539B1
Application number: KR1020200112148A
Authority: KR
Inventors: 류기하; 박영준
Original assignee: 한양대학교 산학협력단
Priority date: 2020-06-22
Filing date: 2020-09-03
Publication date: 2023-12-29
Also published as: KR20210157813A

Abstract

본 출원은 주 GPU 및 보조 GPU를 포함하는 다중 GPU 시스템을 위한 신경망의 최적화 방법에 관한 것으로, 본 명세서의 일 양상에 따른 최적화 방법은 복수의 레이어를 포함하는 상기 신경망을 획득하는 단계; 상기 신경망의 최적화를 위한 기반 정보를 수집하는 프로파일링 단계 - 상기 기반 정보는 상기 다중 GPU 시스템의 GPU 정보 및 상기 신경망의 구조에 관한 신경망 정보를 포함함 - ; 상기 복수의 레이어 사이의 지점 중 하나인 제1 분기점에서 상기 신경망을 분기하여 상기 신경망을 상기 주 GPU에서 처리되는 제1 주 신경망 및 상기 보조 GPU에서 처리되는 제1 보조 신경망을 포함하는 제1 트리 구조 신경망으로 변환하는 단계; 및 상기 기반 정보에 기초하여 상기 제1 트리 구조 신경망을 최적화하는 단계를 포함한다.This application relates to a method of optimizing a neural network for a multi-GPU system including a main GPU and a auxiliary GPU. The optimization method according to an aspect of the present specification includes obtaining the neural network including a plurality of layers; A profiling step of collecting base information for optimization of the neural network - the base information includes GPU information of the multi-GPU system and neural network information regarding the structure of the neural network; A first tree structure including a first main neural network processed in the main GPU and a first auxiliary neural network processed in the auxiliary GPU by branching the neural network at a first branch point, which is one of the points between the plurality of layers. Converting to a neural network; and optimizing the first tree-structured neural network based on the base information.

Description

Neural network optimization method for multi-GPU system and optimization system using the same {OPTIMIZATION METHOD OF NEURAL NETWORK FOR MULTI-GPU SYSTEM AND OPTIMIZATION SYSTEM USING THE SAME}

본 출원은 다중 GPU 시스템을 위한 신경망의 최적화 방법 및 이를 이용하는 최적화 시스템에 관한 것으로, 보다 상세하게는 주 GPU 및 보조 GPU를 포함하는 다중 GPU 시스템에서 신경망을 트리 구조 신경망으로 최적화하여 추론 정확도 및 추론 성능을 향상시킬 수 있는 최적화 방법 및 이를 이용하는 최적화 시스템에 관한 것이다.This application relates to an optimization method of a neural network for a multi-GPU system and an optimization system using the same. More specifically, inference accuracy and inference performance are achieved by optimizing a neural network as a tree-structured neural network in a multi-GPU system including a main GPU and a secondary GPU. It relates to an optimization method that can improve and an optimization system using the same.

신경망(Neural Network, NN)은 기계 학습(machine learning) 분야 중 널리 사용되는 알고리즘으로서 이미지 분류(image classification), 의료 진단(medical diagnosis), 드론(drone), 무인 자동차(automobiles) 등 다양한 어플리케이션에서 사용되고 있다.Neural Network (NN) is a widely used algorithm in the machine learning field and is used in various applications such as image classification, medical diagnosis, drones, and autonomous vehicles. there is.

분할 정복(divide-and-conquer) 기법을 기반으로 한 트리 구조 신경망(tree-structured NN)은 신경망을 트리 형태로 구축함으로서 신경망의 추론 정확도를 향상시키는 기법이다. 이는 정확도 향상을 위해 단순히 네트워크의 크기를 증가시키는 것이 아닌 하나의 문제를 여러 소문제로 분류함으로서 문제의 복잡성을 낮추는 효과를 얻는 방식으로, 기존 신경망에 해당 기법을 적용하는 것만으로도 효과적으로 정확도를 향상시킬 수 있다.Tree-structured NN, based on the divide-and-conquer technique, is a technique that improves the inference accuracy of a neural network by constructing the neural network in the form of a tree. This is a method of lowering the complexity of the problem by classifying one problem into several small problems rather than simply increasing the size of the network to improve accuracy. By simply applying the technique to an existing neural network, accuracy can be effectively improved. You can.

신경망을 GPU를 통해 가속하고자 하는 연구가 다수 수행되고 있으며, 최근에는 다수의 GPU를 사용하는 시스템에 대한 수요가 증가함에 따라 이에 관한 연구 및 기술개발이 많이 이루어지고 있다. 다수의 GPU를 사용하는 것은 여러 코어로 이루어진 CPU를 활용하는 것과는 다른 형태의 기술이 필요하며, 여러 연구 끝에 효용성이 입증된 모델 병렬화(model parallelization) 및 데이터 병렬화(data parallelization) 기법은 다양한 신경망 프레임워크에 활용되고 있다.A lot of research is being conducted to accelerate neural networks through GPUs, and recently, as demand for systems using multiple GPUs increases, much research and technology development has been conducted. Using multiple GPUs requires a different type of technology than utilizing a CPU with multiple cores, and model parallelization and data parallelization techniques, which have proven to be effective after several studies, can be used in various neural network frameworks. is being used.

트리 구조 신경망의 경우 신경망에 따라 차이는 존재하지만, 기존에 비해 대체로 신경망의 크기가 증가하며 신경망에 사용되는 파라미터의 수가 증가하게 된다. 이는 데이터를 처리하는데 필요한 연산의 수가 증가함을 의미하며 따라서 데이터 당 추론 속도 및 전체 처리량이 감소하는 단점을 가지고 있다. 기존 연구는 정확도 향상에 중점을 두었을 뿐, 이러한 부가적인 손실에 대한 고려가 충분치 않았다.In the case of tree-structured neural networks, there are differences depending on the neural network, but in general, the size of the neural network increases compared to the existing network, and the number of parameters used in the neural network increases. This means that the number of operations required to process data increases, which has the disadvantage of reducing inference speed per data and overall throughput. Existing research only focused on improving accuracy, but did not sufficiently consider these additional losses.

다중 GPU를 활용하는 대표적인 두 기법의 경우, 모델 병렬화는 데이터를 처리하는 과정에서 연산의 중간 결과물이 여러 GPU에 전파(broadcast)되는 과정에서 발생하는 성능 저하의 문제가 있으며, 데이터 병렬화의 경우 불규칙하게 전달되는 데이터에 대한 추론이 요구되는 실제 상황에 부적합한 기법이라는 한계가 존재한다.In the case of the two representative techniques that utilize multiple GPUs, model parallelization has the problem of performance degradation that occurs when intermediate results of calculations are broadcast to multiple GPUs during data processing, and in the case of data parallelization, the data is distributed irregularly. There is a limitation that the technique is unsuitable for real-life situations that require inference about transmitted data.

본 명세서에서 해결하고자 하는 일 과제는 다중 GPU를 사용하는 시스템에서 트리 구조 신경망을 사용할 경우 추론 정확도를 향상시키면서 동시에 추론 성능을 보존할 수 있는 최적화 방법 및 최적화 시스템을 제공하는 것에 있다.The problem to be solved in this specification is to provide an optimization method and optimization system that can improve inference accuracy while preserving inference performance when using a tree-structured neural network in a system using multiple GPUs.

본 명세서에서 해결하고자 하는 과제가 상술한 과제로 제한되는 것은 아니며, 언급되지 아니한 과제들은 본 명세서 및 첨부된 도면으로부터 본 출원이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The problems to be solved in this specification are not limited to the above-mentioned problems, and problems not mentioned can be clearly understood by those skilled in the art from this specification and the attached drawings. .

본 명세서의 일 양상에 따르면 주 GPU 및 보조 GPU를 포함하는 다중 GPU 시스템을 위한 신경망의 최적화 방법에 있어서, 복수의 레이어를 포함하는 상기 신경망을 획득하는 단계; 상기 신경망의 최적화를 위한 기반 정보를 수집하는 프로파일링 단계 - 상기 기반 정보는 상기 다중 GPU 시스템의 GPU 정보 및 상기 신경망의 구조에 관한 신경망 정보를 포함함 - ; 상기 복수의 레이어 사이의 지점 중 하나인 제1 분기점에서 상기 신경망을 분기하여 상기 신경망을 상기 주 GPU에서 처리되는 제1 주 신경망 및 상기 보조 GPU에서 처리되는 제1 보조 신경망을 포함하는 제1 트리 구조 신경망으로 변환하는 단계; 및 상기 기반 정보에 기초하여 상기 제1 트리 구조 신경망을 최적화하는 단계를 포함하는 최적화 방법이 제공될 수 있다.According to one aspect of the present specification, there is provided a method for optimizing a neural network for a multi-GPU system including a primary GPU and a secondary GPU, comprising: obtaining the neural network including a plurality of layers; A profiling step of collecting base information for optimization of the neural network - the base information includes GPU information of the multi-GPU system and neural network information regarding the structure of the neural network; A first tree structure including a first main neural network processed in the main GPU and a first auxiliary neural network processed in the auxiliary GPU by branching the neural network at a first branch point, which is one of the points between the plurality of layers. Converting to a neural network; and optimizing the first tree-structured neural network based on the base information.

본 명세서의 과제의 해결 수단이 상술한 해결 수단들로 제한되는 것은 아니며, 언급되지 아니한 해결 수단들은 본 명세서 및 첨부된 도면으로부터 본 출원이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The solution to the problem of this specification is not limited to the above-mentioned solution, and the solution not mentioned can be clearly understood by those skilled in the art from this specification and the attached drawings. You will be able to.

본 명세서의 실시예에 따르면 트리 구조 신경망의 구조적 특징을 다중 GPU 시스템에 맞춰 활용한 병렬 추론 기법을 이용하여 정확도뿐만 아니라 추론 처리량 등 다양한 성능 지표들의 균형을 통해 개발자가 원하는 최적의 트리 구조 신경망을 선택할 수 있는 최적화 방법 및 최적화 시스템을 제공할 수 있다. According to an embodiment of the present specification, a parallel inference technique that utilizes the structural characteristics of a tree-structured neural network in a multi-GPU system is used to select the optimal tree-structured neural network desired by the developer by balancing various performance indicators such as accuracy and inference throughput. An optimization method and optimization system can be provided.

본 명세서의 실시예에 따르면 기존 트리 구조 신경망 연구에서 정확도 외에 간과하던 처리량, 파라미터 개수, GPU의 최대 메모리 사용량, 데이터 당 추론 속도 등 다양한 성능 요소를 프로파일링 및 최적화함으로서 실제 추론 환경에서 맞닥뜨릴 수 있는 여러 제약 조건을 더 현실적으로 고려하고 있으며 해당 성능 지표들 간의 균형을 맞출 수 있는 최적화 방법 및 최적화 시스템을 제공할 수 있다.According to the embodiments of this specification, by profiling and optimizing various performance factors such as throughput, number of parameters, maximum memory usage of GPU, and inference speed per data, which were overlooked in addition to accuracy in existing tree-structured neural network research, various performance factors that can be encountered in an actual inference environment are achieved. It considers constraints more realistically and can provide optimization methods and optimization systems that can balance the performance indicators.

본 명세서의 실시예에 따르면 다중 GPU 환경에서 사용되는 기존 신경망 병렬 기법에서 문제시되던 연산 오버헤드 및 실제 추론 환경에서의 부적합성을 트리 구조 신경망의 구조적 특성을 활용한 2단계 파이프라이닝 기법을 통해 최소화하는 최적화 방법 및 최적화 시스템을 제공할 수 있다.According to the embodiment of the present specification, the computational overhead and incompatibility in the actual inference environment, which were problems in the existing parallel neural network technique used in a multi-GPU environment, are optimized through a two-stage pipelining technique utilizing the structural characteristics of a tree-structured neural network. Methods and optimization systems can be provided.

본 명세서의 실시예에 따르면 시스템을 고려한 최적화를 수행함으로서 트리 구조 신경망이 적용될 수 있는 다양한 연구 및 산업체의 다중 GPU 시스템에 대한 GPU 자원 활용도를 극대화하는 최적화 방법 및 최적화 시스템을 제공할 수 있다.According to an embodiment of the present specification, by performing optimization considering the system, it is possible to provide an optimization method and an optimization system that maximizes GPU resource utilization for multiple GPU systems in various research and industrial companies to which a tree-structured neural network can be applied.

본 명세서의 실시예에 따르면 트리 구조 신경망의 분기점 위치 및 시스템 내 GPU들의 성능 차이, GPU간 데이터 전송속도 차이에 의하여 주 GPU 또는 보조 GPU에서 발생할 수 있는 스톨을 프로파일링을 통해 습득한 데이터를 기반으로 분석한 뒤 보조 신경망의 길이를 조절하여 최소화하는 최적화 방법 및 최적화 시스템을 제공할 수 있다.According to the embodiment of the present specification, based on data acquired through profiling the stall that may occur in the main GPU or auxiliary GPU due to the branch point location of the tree-structured neural network, performance differences between GPUs in the system, and differences in data transfer rates between GPUs, After analysis, it is possible to provide an optimization method and optimization system that minimizes the length of the auxiliary neural network by adjusting it.

본 명세서의 발명의 효과가 상술한 효과로 제한되는 것은 아니며, 언급되지 아니한 효과들은 본 명세서 및 첨부된 도면으로부터 본 출원이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확히 이해될 수 있을 것이다.The effects of the invention of this specification are not limited to the effects described above, and effects not mentioned can be clearly understood by those skilled in the art from this specification and the attached drawings.

도 1은 본 명세서의 일 실시예에 따른 최적화 시스템에 관한 도면이다.
도 2는 본 명세서의 일 실시예에 따른 최적화 방법에 관한 도면이다.
도 3은 본 명세서의 일 실시예에 따른 최적화 방법 및 최적화 시스템이 고려하는 성능 지표에 관한 표이다.
도 4는 본 명세서의 일 실시예에 따른 최적화 방법에 관한 순서도이다.
도 5는 본 명세서의 일 실시예에 따른 신경망의 변환에 관한 도면이다.
도 6은 본 명세서의 일 실시예에 따른 서로 다른 지점에서 분기되는 트리 구조 신경망에 관한 도면이다.
도 7은 본 명세서의 일 실시예에 따른 최적화 단계에 관한 도면이다.
도 8은 본 명세서의 일 실시예에 따른 신경망 및 성능 지표에 관한 도면이다.
도 9는 본 명세서의 일 실시예에 따른 최적화 방법 및 최적화 시스템의 하이퍼파라미터에 관한 표이다.
도 10은 본 명세서의 일 실시예에 따른 최적화 방법 및 최적화 시스템의 성능 지표에 대한 신경망들의 차이에 관한 도면이다.1 is a diagram of an optimization system according to an embodiment of the present specification.
Figure 2 is a diagram of an optimization method according to an embodiment of the present specification.
Figure 3 is a table regarding performance indicators considered by the optimization method and optimization system according to an embodiment of the present specification.
Figure 4 is a flowchart of an optimization method according to an embodiment of the present specification.
Figure 5 is a diagram relating to transformation of a neural network according to an embodiment of the present specification.
Figure 6 is a diagram of a tree-structured neural network branching at different points according to an embodiment of the present specification.
Figure 7 is a diagram of an optimization step according to an embodiment of the present specification.
Figure 8 is a diagram of a neural network and performance indicators according to an embodiment of the present specification.
Figure 9 is a table regarding hyperparameters of the optimization method and optimization system according to an embodiment of the present specification.
Figure 10 is a diagram showing differences between neural networks for performance indicators of an optimization method and an optimization system according to an embodiment of the present specification.

본 명세서에 기재된 실시예는 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 본 발명의 사상을 명확히 설명하기 위한 것이므로, 본 발명이 본 명세서에 기재된 실시예에 의해 한정되는 것은 아니며, 본 발명의 범위는 본 발명의 사상을 벗어나지 아니하는 수정예, 변형예, 균등물 내지 대체물을 포함하는 것으로 해석되어야 한다. 본 발명을 설명함에 있어 공지된 구성 또는 기능에 관한 구체적인 설명은 생략할 수 있다.The embodiments described in this specification are intended to clearly explain the spirit of the present invention to those skilled in the art to which the present invention pertains. Therefore, the present invention is not limited to the embodiments described in this specification, and the present invention The scope of should be construed to include modifications, variations, equivalents, and substitutes that do not depart from the spirit of the present invention. In describing the present invention, detailed descriptions of well-known structures or functions may be omitted.

본 명세서에서 사용되는 용어는 본 발명에서의 기능을 고려하여 가능한 현재 널리 사용되고 있는 일반적인 용어를 선택하였으나 이는 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자의 의도, 관례 또는 새로운 기술의 출현 등에 따라 달라질 수 있다. 다만, 이와 달리 특정한 용어를 임의의 의미로 정의하여 사용하는 경우에는 그 용어의 의미에 관하여 별도로 기재할 것이다. 따라서 본 명세서에서 사용되는 용어는 단순한 용어의 명칭이 아닌 그 용어가 가진 실질적인 의미와 본 명세서의 전반에 걸친 내용을 토대로 해석되어야 한다.The terms used in this specification are general terms that are currently widely used as much as possible in consideration of their function in the present invention, but this may vary depending on the intention, custom, or the emergence of new technology of a person skilled in the art in the technical field to which the present invention pertains. You can. However, if a specific term is defined and used with an arbitrary meaning, the meaning of the term will be described separately. Therefore, the terms used in this specification should be interpreted based on the actual meaning of the term and the overall content of this specification, not just the name of the term.

본 명세서에 첨부된 도면은 본 발명을 용이하게 설명하기 위한 것으로 도면에 도시된 형상은 본 발명의 이해를 돕기 위하여 필요에 따라 과장되어 표시된 것일 수 있으므로 본 발명이 도면에 의해 한정되는 것은 아니다. 또한, 본 명세서의 설명 과정에서 이용되는 숫자(예를 들어, 제1, 제2 등)는 하나의 구성요소를 다른 구성요소와 구분하기 위한 식별 기호에 불과하다.The drawings attached to this specification are intended to easily explain the present invention, and the shapes shown in the drawings may be exaggerated as necessary to aid understanding of the present invention, so the present invention is not limited by the drawings. In addition, numbers (eg, first, second, etc.) used in the description of this specification are merely identification symbols to distinguish one component from another component.

여기서, 상기 최적화하는 단계는 상기 제1 보조 신경망의 크기 및 길이 중 적어도 하나를 조절하는 단계를 포함할 수 있다.Here, the optimizing step may include adjusting at least one of the size and length of the first auxiliary neural network.

여기서, 상기 제1 주 신경망은 상기 제1 분기점 이전의 레이어인 제1 주 레이어 및 상기 제1 분기점 이후의 레이어인 제2 주 레이어를 포함하고, 상기 제1 보조 신경망의 크기 및 길이 중 적어도 하나는 상기 제1 주 레이어에 대한 상기 주 GPU의 연산 시간, 상기 제2 주 레이어에 대한 상기 주 GPU의 연산 시간 및 상기 제1 보조 신경망에 대한 상기 보조 GPU의 연산 시간 중 적어도 하나를 고려하여 조절될 수 있다.Here, the first main neural network includes a first main layer that is a layer before the first branch point and a second main layer that is a layer after the first branch point, and at least one of the size and length of the first auxiliary neural network is It can be adjusted by considering at least one of the computation time of the main GPU for the first main layer, the computation time of the main GPU for the second main layer, and the computation time of the auxiliary GPU for the first auxiliary neural network. there is.

여기서, 상기 제2 주 레이어에 대한 상기 주 GPU의 연산 시간이 상기 제1 보조 신경망에 대한 상기 보조 GPU의 연산 시간보다 긴 경우 상기 제1 보조 신경망의 크기 및 길이 중 적어도 하나를 증가시키고, 상기 제2 주 레이어에 대한 상기 주 GPU의 연산 시간이 상기 제1 보조 신경망에 대한 상기 보조 GPU의 연산 시간보다 짧은 경우 상기 제1 보조 신경망의 크기 및 길이 중 적어도 하나를 감소시킬 수 있다.Here, if the computation time of the main GPU for the second main layer is longer than the computation time of the auxiliary GPU for the first auxiliary neural network, at least one of the size and length of the first auxiliary neural network is increased, and the first auxiliary neural network is increased. 2 If the computation time of the main GPU for the main layer is shorter than the computation time of the auxiliary GPU for the first auxiliary neural network, at least one of the size and length of the first auxiliary neural network may be reduced.

여기서, 상기 최적화하는 단계는 상기 보조 GPU의 메모리 최대치를 초과하지 않도록 상기 제1 보조 신경망의 크기 및 길이 중 적어도 하나를 조절하는 단계를 포함할 수 있다.Here, the optimizing step may include adjusting at least one of the size and length of the first auxiliary neural network so as not to exceed the maximum memory of the auxiliary GPU.

여기서, 상기 제1 보조 신경망은 상기 제1 주 신경망의 상기 제1 분기점 이전의 레이어에 대응되는 제1 보조 레이어 및 상기 제1 주 신경망의 상기 제1 분기점 이후의 레이어에 대응되는 제2 보조 레이어를 포함하고 상기 최적화하는 단계는 상기 보조 GPU의 메모리 최대치를 초과하지 않도록 상기 제1 보조 레이어 및 상기 제2 보조 레이어 중 적어도 하나의 크기 및 길이 중 적어도 하나를 조절하는 단계를 포함할 수 있다.Here, the first auxiliary neural network includes a first auxiliary layer corresponding to a layer before the first branch point of the first main neural network and a second auxiliary layer corresponding to a layer after the first branch point of the first main neural network. The optimizing step may include adjusting at least one of the size and length of at least one of the first auxiliary layer and the second auxiliary layer so as not to exceed the maximum memory of the auxiliary GPU.

여기서, 상기 제1 보조 신경망은 제1-1 보조 신경망 및 제1-2 보조 신경망을 포함하고, 상기 보조 GPU는 상기 제1-1 보조 신경망을 처리하는 제1 보조 GPU 및 상기 제1-2 보조 신경망을 처리하는 제2 보조 GPU를 포함하고, 상기 제1 주 신경망에 대한 상기 주 GPU의 연산 결과에 기초하여 상기 제1-1 보조 신경망 및 상기 제1-2 보조 신경망 중 적어도 하나의 연산이 중단될 수 있다.Here, the first auxiliary neural network includes a 1-1 auxiliary neural network and a 1-2 auxiliary neural network, and the auxiliary GPU includes a first auxiliary GPU that processes the 1-1 auxiliary neural network and the 1-2 auxiliary neural network. and a second auxiliary GPU that processes a neural network, wherein at least one of the 1-1 auxiliary neural network and the 1-2 auxiliary neural network is stopped based on an operation result of the main GPU for the first main neural network. It can be.

여기서, 상기 제1 보조 신경망의 파라미터 중 일부는 상기 제1 보조 신경망에 대한 상기 보조 GPU의 연산 시작 시점에 대응되는 시점에 로드되고 상기 파라미터 중 나머지는 상기 제1 주 신경망에 대한 상기 주 GPU의 연산 종료 시점에 대응되는 시점에 로드될 수 있다.Here, some of the parameters of the first auxiliary neural network are loaded at a time corresponding to the start time of the auxiliary GPU's operation for the first auxiliary neural network, and the remaining parameters are loaded in the operation of the main GPU for the first main neural network. It can be loaded at the time corresponding to the end point.

여기서, 상기 GPU 정보는 GPU 메모리 크기에 관한 정보, GPU간 데이터 전송 속도에 관한 정보 및 GPU에서의 상기 신경망의 상기 레이어별 처리 속도에 관한 정보 중 적어도 하나를 포함할 수 있다.Here, the GPU information may include at least one of information about the GPU memory size, information about the inter-GPU data transfer speed, and information about the processing speed of each layer of the neural network in the GPU.

여기서, 상기 제1 주 신경망의 상기 제1 분기점 직후의 레이어에 대한 상기 주 GPU의 연산 시작 시점 및 상기 제1 보조 신경망에 대한 상기 보조 GPU의 연산 시작 시점의 차이는 소정의 시간 미만일 수 있다.Here, the difference between the operation start time of the main GPU for the layer immediately after the first branch point of the first main neural network and the operation start time of the auxiliary GPU for the first auxiliary neural network may be less than a predetermined time.

여기서, 상기 최적화 방법은 상기 복수의 레이어 사이의 지점 중 하나이고 상기 제1 분기점과 상이한 제2 분기점에서 상기 신경망을 분기하여 상기 신경망을 상기 주 GPU에서 처리되는 제2 주 신경망 및 상기 보조 GPU에서 처리되는 제2 보조 신경망을 포함하는 제2 트리 구조 신경망으로 변환하는 단계; 및 상기 기반 정보에 기초하여 상기 제2 트리 구조 신경망을 최적화하는 단계를 더 포함할 수 있다.Here, the optimization method branches the neural network at a second branch point that is one of the points between the plurality of layers and is different from the first branch point, and the neural network is processed in the main GPU and the second main neural network processed in the auxiliary GPU. converting to a second tree-structured neural network including a second auxiliary neural network; and optimizing the second tree-structured neural network based on the base information.

여기서, 상기 최적화 방법은 상기 제1 트리 구조 신경망 및 상기 제2 트리 구조 신경망의 정확도 및 성능 중 적어도 하나를 고려하여 상기 제1 트리 구조 신경망 및 상기 제2 트리 구조 신경망 중 최적 트리 구조 신경망을 선택하는 단계를 더 포함할 수 있다.Here, the optimization method selects an optimal tree-structure neural network among the first tree-structure neural network and the second tree-structure neural network by considering at least one of the accuracy and performance of the first tree-structure neural network and the second tree-structure neural network. Additional steps may be included.

본 명세서의 실시예에 따른 다중 GPU 시스템을 위한 신경망의 최적화 방법 및 최적화 시스템은 기존 신경망을 트리 구조로 변환하여 신경망의 추론 정확도를 향상시키면서, 다중 GPU 시스템을 고려한 최적화를 수행할 수 있다. The neural network optimization method and optimization system for a multi-GPU system according to an embodiment of the present specification can convert an existing neural network into a tree structure to improve the inference accuracy of the neural network and perform optimization considering the multi-GPU system.

본 명세서에서, 트리 구조 신경망의 구조는 주 신경망과 보조 신경망으로 나뉠 수 있다. 주 신경망은 입력 데이터가 어떤 보조 신경망을 통해 처리되어야 하는지 클러스터링(clustering)하는 역할을 수행할 수 있다. 보조 신경망은 추론 결과인 라벨(label) 중 일부 클러스터에 대해서 특화되어 훈련된 신경망일 수 있다. 보조 신경망은 주 신경망에서의 클러스터링 결과를 기반으로 입력 데이터 혹은 주 신경망의 중간 결과물을 전달받아 최종적인 라벨링을 수행할 수 있다. 보조 신경망의 연산은 주 신경망의 특정 레이어(layer)에서 분기되어 수행될 수 있다. 본 명세서에서 상기 분기되는 지점은 분기점(branch point)이라 지칭한다. 또한, 본 명세서에서는 다중 GPU 시스템에서 주 신경망의 연산을 처리하는 GPU를 주 GPU, 보조 신경망의 연산을 처리하는 GPU를 보조 GPU라 지칭한다. 여기서, 주 GPU와 보조 GPU는 시스템 내의 어떤 GPU로 구성되건 간에 무방하며, 다수의 GPU로 구성될 수 있다.In this specification, the structure of the tree-structured neural network can be divided into a main neural network and an auxiliary neural network. The main neural network can play a role in clustering which auxiliary neural network the input data should be processed through. The auxiliary neural network may be a neural network trained specifically for some clusters of labels that are inference results. The auxiliary neural network can perform final labeling by receiving input data or intermediate results from the main neural network based on the clustering results from the main neural network. The calculation of the auxiliary neural network can be performed by branching from a specific layer of the main neural network. In this specification, the branching point is referred to as a branch point. Additionally, in this specification, the GPU that processes the calculations of the main neural network in a multi-GPU system is referred to as the main GPU, and the GPU that processes the calculations of the auxiliary neural network is referred to as the auxiliary GPU. Here, the main GPU and auxiliary GPU may be composed of any GPU in the system, and may be composed of multiple GPUs.

본 명세서의 실시예에 따른 다중 GPU 시스템을 위한 신경망의 최적화 방법은 최적화 시스템에 의해 수행될 수 있다. 도 1은 본 명세서의 일 실시예에 따른 최적화 시스템에 관한 도면이다. 도 2는 본 명세서의 일 실시예에 따른 최적화 방법에 관한 도면으로, 최적화 방법을 NIN5 신경망에 대해 적용한 경우의 흐름도이다. The neural network optimization method for a multi-GPU system according to an embodiment of the present specification may be performed by an optimization system. 1 is a diagram of an optimization system according to an embodiment of the present specification. Figure 2 is a diagram of an optimization method according to an embodiment of the present specification, and is a flowchart when the optimization method is applied to a NIN5 neural network.

도 1을 참고하면, 시스템은 제어 모듈(100) 및 메모리(200)를 포함할 수 있다. Referring to FIG. 1 , the system may include a control module 100 and a memory 200.

제어 모듈(100)은 신경망에 관한 연산을 처리할 수 있다. 제어 모듈(100)은 소프트웨어, 하드웨어 및 이들의 조합으로 구현될 수 있다. 예를 들어, 하드웨어적으로 제어 모듈(100)은 중앙 처리 장치(Central Processing Unit, CPU), 그래픽 처리 장치(Graphics Processing Unit, GPU), 디지털 신호 처리 장치(Digital Signal Processor, DSP), FPGA((field programmable gate array)나 ASIC(Application Specific Integrated Circuit), 반도체 칩, 및 그 외의 다양한 형태의 전자 회로로 구현될 수 있다. 다만, 이하에서는 설명의 편의를 위해 주로 GPU에 대해 설명한다. 또 예를 들어, 소프트웨어적으로 제어 모듈(100)은 상술한 하드웨어에 따라 수행되는 논리 프로그램이나 각종 컴퓨터 언어 등으로 구현될 수 있다.The control module 100 can process calculations related to neural networks. The control module 100 may be implemented with software, hardware, or a combination thereof. For example, in terms of hardware, the control module 100 includes a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), and an FPGA (( It can be implemented with a field programmable gate array (ASIC), an application specific integrated circuit (ASIC), a semiconductor chip, and other various types of electronic circuits. However, for convenience of explanation, the following mainly describes the GPU. Also, see examples. For example, in software terms, the control module 100 may be implemented as a logic program or various computer languages that are executed according to the above-described hardware.

메모리(200)는 각종 데이터를 저장할 수 있다. 예를 들어, 메모리(200)는 제어 모듈(100)에서 처리될 신경망을 저장할 수 있다. 또한, 메모리(200)는 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 램(Random Access Memory, RAM), 롬(Read-Only Memory, ROM), 자기 메모리, 자기 디스크, 광디스크 등을 포함할 수 있다.The memory 200 can store various data. For example, the memory 200 may store a neural network to be processed in the control module 100. In addition, the memory 200 may include a flash memory type, hard disk type, random access memory (RAM), read-only memory (ROM), magnetic memory, magnetic disk, It may include an optical disk, etc.

도 1 및 도 2를 참고하면, 제어 모듈(100)은 프로파일러(110), 변환부(120), 최적화부(130) 및 훈련 및 평가부(140)를 포함할 수 있다. Referring to FIGS. 1 and 2 , the control module 100 may include a profiler 110, a conversion unit 120, an optimization unit 130, and a training and evaluation unit 140.

일 실시예에 따르면, 프로파일러(110)는 신경망의 최적화를 위한 기반 정보를 수집할 수 있다. (프로파일링) 여기서, 기반 정보는 신경망 및 시스템에 대한 정보로서 다중 GPU 시스템의 GPU 정보 및 신경망의 구조에 관한 신경망 정보를 포함할 수 있다. 비한정적인 예로, GPU 정보는 GPU 메모리 크기에 관한 정보, GPU간 데이터 전송 속도에 관한 정보 및 GPU에서의 신경망의 레이어별 처리 속도에 관한 정보 중 적어도 하나를 포함할 수 있다. 입력 데이터와 연산 속도 간에 관계성이 없는 신경망의 특성상, 프로파일링에 필요한 데이터는 오랜 시간을 소요하는 훈련 과정의 필요 없이 임의의 가중치와 입력값에 대해 빠르게 구할 수 있을 것이다.According to one embodiment, the profiler 110 may collect basic information for optimization of a neural network. (Profiling) Here, the base information is information about the neural network and system and may include GPU information of a multi-GPU system and neural network information about the structure of the neural network. As a non-limiting example, GPU information may include at least one of information about GPU memory size, information about data transfer speed between GPUs, and information about processing speed of each layer of a neural network in GPU. Due to the nature of neural networks, where there is no relationship between input data and computation speed, the data needed for profiling can be quickly obtained for arbitrary weights and input values without the need for a long training process.

일 실시예에 따르면, 변환부(120)는 신경망을 트리 구조 신경망으로 변환할 수 있다. 예를 들어, 변환부(120)는 신경망을 제1 분기점에서 분기하여 제1 트리 구조 신경망으로 변환하고, 제2 분기점에서 분기하여 제2 트리 구조 신경망으로 변환할 수 있다. According to one embodiment, the conversion unit 120 may convert a neural network into a tree-structured neural network. For example, the conversion unit 120 may branch the neural network at a first branch point to convert it into a first tree-structured neural network, and branch at a second branch point to convert it into a second tree-structured neural network.

일 실시예에 따르면, 최적화부(130)는 트리 구조 신경망을 최적화할 수 있다. 예를 들어, 최적화부(130)는 기반 정보에 기초하여 2단계 파이프라이닝(2-stage pipelining)에 맞춘 최적화를 수행할 수 있다. 트리 구조 신경망의 최적화를 통해 신경망의 정확도를 향상시키고 시스템 내 GPU들의 활용성(utilization)을 극대화할 수 있다. 트리 구조 신경망의 최적화에 대한 보다 구체적인 내용은 후술하기로 한다.According to one embodiment, the optimization unit 130 may optimize a tree-structured neural network. For example, the optimization unit 130 may perform optimization tailored to 2-stage pipelining based on base information. Through optimization of the tree-structured neural network, the accuracy of the neural network can be improved and the utilization of GPUs in the system can be maximized. More specific details about optimization of tree-structured neural networks will be described later.

일 실시예에 따르면, 훈련 및 평가부(140)는 최적화된 트리 구조 신경망을 훈련할 수 있다. 여기서, 훈련 및 평가부(140)는 파라미터 재사용을 활용하여 훈련을 수행할 수 있다. 일 실시예에 따르면, 훈련 및 평가부(140)는 훈련된 트리 구조 신경망의 성능 지표를 비교할 수 있다. 여기서, 훈련 및 평가부(140)는 개발자 및 사용자의 요구에 맞는 최적의 신경망을 선택할 수 있다.According to one embodiment, the training and evaluation unit 140 may train an optimized tree-structured neural network. Here, the training and evaluation unit 140 may perform training using parameter reuse. According to one embodiment, the training and evaluation unit 140 may compare performance indicators of trained tree-structured neural networks. Here, the training and evaluation unit 140 can select the optimal neural network that meets the needs of developers and users.

전술한 제어 모듈의 구성은 예시에 불과하고, 전술한 구성 중 일부 구성이 제외되거나 다른 구성이 추가된 제어 모듈이 제공될 수 있다. 또한, 제어 모듈의 각 구성은 별도의 물리적 구성을 의미하는 것만은 아니다. 또한, 도 2에는 NIN5 신경망에 대해 적용된 최적화 방법에 대해 도시되었으나, 본 명세서의 실시예에 따른 최적화 방법이 적용될 수 있는 신경망은 이에 한정되는 것은 아니고 이 외에 다양한 신경망에도 적용될 수 있을 것이다.The configuration of the control module described above is only an example, and a control module may be provided in which some of the configurations described above are excluded or other configurations are added. Additionally, each configuration of the control module does not simply mean a separate physical configuration. In addition, Figure 2 shows the optimization method applied to the NIN5 neural network, but the neural network to which the optimization method according to the embodiment of the present specification can be applied is not limited to this, and may also be applied to various other neural networks.

도 3은 본 명세서의 일 실시예에 따른 최적화 방법 및 최적화 시스템이 고려하는 성능 지표에 관한 표로, 우선순위가 지정된 다양한 성능 지표에 관한 것이다. 본 명세서의 일 실시예에 따른 최적화 방법 및 최적화 시스템은 기존 신경망에 대해 제시된 다양한 성능 지표 값을 측정하고 개선하기 위해 복수의 단계를 포함할 수 있다. 이하에서는 최적화 방법의 각 단계에 대해 설명한다.Figure 3 is a table regarding performance indicators considered by the optimization method and optimization system according to an embodiment of the present specification, and relates to various performance indicators with assigned priorities. The optimization method and optimization system according to an embodiment of the present specification may include a plurality of steps to measure and improve various performance indicator values presented for an existing neural network. Below, each step of the optimization method is explained.

도 4는 본 명세서의 일 실시예에 따른 최적화 방법에 관한 순서도이다. 도 4를 참고하면, 최적화 방법은 신경망을 획득하는 단계(S1100), 신경망의 최적화를 위한 기반 정보를 수집하는 단계(S1200), 신경망을 분기하여 트리 구조 신경망으로 변환하는 단계(S1300) 및 트리 구조 신경망을 최적화하는 단계(S1400)를 포함할 수 있다.Figure 4 is a flowchart of an optimization method according to an embodiment of the present specification. Referring to Figure 4, the optimization method includes obtaining a neural network (S1100), collecting basic information for optimization of the neural network (S1200), branching the neural network and converting it to a tree-structured neural network (S1300), and tree-structured neural network. A step of optimizing the neural network (S1400) may be included.

일 실시예에 따르면, 신경망을 획득하는 단계(S1100)는 복수의 레이어를 포함하는 신경망을 획득하는 단계를 포함할 수 있다. 여기서, 복수의 레이어는 서로 직렬 연결된 레이어를 포함할 수 있다.According to one embodiment, the step of acquiring a neural network (S1100) may include acquiring a neural network including a plurality of layers. Here, the plurality of layers may include layers connected to each other in series.

일 실시예에 따르면, 신경망을 분기하여 트리 구조 신경망으로 변환하는 단계(S1300)는 복수의 레이어 사이의 지점 중 하나인 분기점에서 신경망을 분기하여 신경망을 주 신경망 및 보조 신경망을 포함하는 트리 구조 신경망으로 변환하는 단계를 포함할 수 있다. According to one embodiment, the step of branching the neural network and converting it into a tree-structured neural network (S1300) involves branching the neural network at a branch point, which is one of the points between a plurality of layers, and converting the neural network into a tree-structured neural network including a main neural network and an auxiliary neural network. It may include a conversion step.

도 5는 본 명세서의 일 실시예에 따른 신경망의 변환에 관한 도면이다. 도 5를 참고하면, 주 GPU에서 처리되는 주 신경망은 분기점 이전의 레이어인 프리-브랜치(pre-branch) 및 분기점 이후의 레이어인 포스트-브랜치(post-branch)를 포함할 수 있다. 또한, 보조 GPU에서 처리되는 복수의 보조 신경망을 X, Y 및 Z로 표시하였다.Figure 5 is a diagram relating to transformation of a neural network according to an embodiment of the present specification. Referring to FIG. 5, the main neural network processed in the main GPU may include a pre-branch, which is a layer before a branch point, and a post-branch, a layer after a branch point. Additionally, multiple auxiliary neural networks processed on the auxiliary GPU are indicated by X, Y, and Z.

일 실시예에 따르면, 포스트-브랜치에 대한 주 GPU의 연산 시작 시점 및 보조 신경망에 대한 보조 GPU의 연산 시작 시점의 차이는 소정의 시간 미만일 수 있다. 예를 들어, 도 5를 참고하면, 포스트-브랜치 및 보조 신경망 X, Y 및 Z는 (실질적으로) 동시에 처리되기 시작할 수 있다.According to one embodiment, the difference between the operation start time of the main GPU for the post-branch and the start time of the auxiliary GPU for the auxiliary neural network may be less than a predetermined time. For example, referring to Figure 5, post-branch and auxiliary networks X, Y, and Z may begin to be processed (substantially) simultaneously.

일 실시예에 따르면, 복수의 보조 신경망 중 적어도 일부는 그 연산이 완료되기 이전에 중단될 수 있다. 도 5를 참고하면, 보조 신경망 X, Y 및 Z 중 보조 신경망 Y 및 Z는 그 연산이 완료되기 이전에 중단되고 보조 신경망 X만 그 연산이 완료될 수 있다. 여기서, 연산이 완료되기 이전에 중단되는 보조 신경망은 주 신경망의 연산 결과에 기초하여 결정될 수 있다. 예를 들어, 주 신경망의 연산 결과 연산이 수행될 필요가 없는 보조 신경망은 주 신경망의 연산 결과가 산출된 후 그 연산이 중단될 수 있다.According to one embodiment, at least some of the plurality of auxiliary neural networks may be stopped before their operations are completed. Referring to FIG. 5, among the auxiliary neural networks X, Y, and Z, auxiliary neural networks Y and Z are stopped before their operations are completed, and only auxiliary neural networks Here, the auxiliary neural network that is stopped before the computation is completed may be determined based on the computation results of the main neural network. For example, an auxiliary neural network that does not need to perform calculations as a result of the calculation of the main neural network may stop its calculation after the calculation result of the main neural network is calculated.

일 실시예에 따르면, 최적화되지 않은 트리 구조 신경망은 스톨(stall)이 발생할 수 있다. 도 5의 위를 참고하면, 보조 신경망 X에 대한 보조 GPU의 연산 시간이 프리-브랜치에 대한 주 GPU의 연산 시간보다 길어서 스톨이 발생할 수 있다. 또는, 도 5의 아래를 참고하면, 프리-브랜치에 대한 주 GPU의 연산 시간이 보조 신경망 X에 대한 보조 GPU의 연산 시간보다 길어서 스톨이 발생할 수 있다. 즉, 트리 구조 신경망이 최적화되지 않은 경우 주 신경망 및 보조 신경망에 대한 연산 시간 사이의 균형이 맞지 않아 스톨이 발생할 수 있고, 이로 인해 총 연산 시간이 증가할 수 있다.According to one embodiment, a non-optimized tree-structured neural network may stall. Referring to the top of FIG. 5, the computation time of the auxiliary GPU for auxiliary neural network Alternatively, referring to the bottom of FIG. 5, a stall may occur because the computation time of the main GPU for the pre-branch is longer than the computation time of the auxiliary GPU for the auxiliary neural network X. That is, if the tree-structured neural network is not optimized, the computational time for the main neural network and the auxiliary neural network may be unbalanced, which may cause a stall, which may increase the total computational time.

일 실시예에 따르면, 신경망은 복수의 분기점에서 분기되어 서로 다른 트리 구조 신경망으로 변환될 수 있다. 도 6은 본 명세서의 일 실시예에 따른 서로 다른 지점에서 분기되는 트리 구조 신경망에 관한 도면이다. 예를 들어, 도 6의 위를 참고하면, 신경망은 제1 분기점에서 분기되어 제1 트리 구조 신경망으로 변환되고, 도 6의 아래를 참고하면, 신경망은 제2 분기점에서 분기되어 제2 트리 구조 신경망으로 변환될 수 있다. According to one embodiment, a neural network may be branched at a plurality of branch points and converted into a different tree-structured neural network. Figure 6 is a diagram of a tree-structured neural network branching at different points according to an embodiment of the present specification. For example, referring to the top of Figure 6, the neural network branches at a first branch point and is converted into a first tree-structured neural network, and referring to the bottom of Figure 6, the neural network branches at a second branch point and is converted into a second tree-structured neural network. can be converted to

일 실시예에 따르면, 트리 구조 신경망을 최적화하는 단계(S1400)는 기반 정보에 기초하여 트리 구조 신경망을 최적화하는 단계를 포함할 수 있다. 여기서, 트리 구조 신경망을 최적화하는 것은 보조 신경망의 크기 및 길이 중 적어도 하나를 조절하는 것을 포함할 수 있다. 또한, 보조 신경망의 크기 및 길이 중 적어도 하나는 프리-브랜치에 대한 주 GPU의 연산 시간, 포스트-브랜치에 대한 주 GPU의 연산 시간 및 보조 신경망에 대한 보조 GPU의 연산 시간 중 적어도 하나를 고려하여 조절될 수 있다. 또는, 트리 구조 신경망을 최적화하는 것은 스톨을 감소시키거나 없애는 것을 포함할 수 있다. 예를 들어, 트리 구조 신경망을 최적화하는 것은 보조 신경망의 크기 및 길이 중 적어도 하나를 조절하여 스톨을 감소시키거나 없애는 것을 포함할 수 있다. 도 7은 본 명세서의 일 실시예에 따른 최적화 단계에 관한 도면이다. 도 7을 참고하면, 최적화되기 전의 트리 구조 신경망(도 7의 위)에 대해 보조 신경망의 크기 및 길이 중 적어도 하나를 증가시켜 최적화된 트리 구조 신경망(도 7의 아래)을 생성할 수 있다. 물론, 경우에 따라 보조 신경망의 크기 및 길이 중 적어도 하나를 감소시켜 최적화된 트리 구조 신경망을 생성할 수도 있을 것이다.According to one embodiment, the step of optimizing the tree-structured neural network (S1400) may include optimizing the tree-structured neural network based on base information. Here, optimizing the tree-structured neural network may include adjusting at least one of the size and length of the auxiliary neural network. In addition, at least one of the size and length of the auxiliary neural network is adjusted in consideration of at least one of the computation time of the main GPU for the pre-branch, the computation time of the main GPU for the post-branch, and the computation time of the auxiliary GPU for the auxiliary neural network. It can be. Alternatively, optimizing a tree-structured neural network may include reducing or eliminating stalls. For example, optimizing a tree-structured neural network may include adjusting at least one of the size and length of the auxiliary neural network to reduce or eliminate stall. Figure 7 is a diagram of an optimization step according to an embodiment of the present specification. Referring to FIG. 7, an optimized tree-structured neural network (bottom of FIG. 7) can be created by increasing at least one of the size and length of the auxiliary neural network for the tree-structured neural network before optimization (top of FIG. 7). Of course, in some cases, an optimized tree-structured neural network may be created by reducing at least one of the size and length of the auxiliary neural network.

일 실시예에 따르면, 트리 구조 신경망을 최적화하는 단계(S1400)는 보조 신경망의 최대 크기를 제한하는 단계를 포함할 수 있다. 예를 들어, 보조 GPU에 로드되는 보조 신경망의 파라미터의 합은 상기 보조 GPU의 메모리를 초과하지 않도록 제한될 수 있다. 구체적으로, 도 7을 참고하면, 분기점 이후의 포스트-브랜치가 연산되는 시점에서의 보조 신경망 X, Y 및 Z의 파라미터의 합 및 상기 포스트-브랜치 이후에 연산되는 프리-브랜치가 연산되는 시점에서의 보조 신경망 X의 파라미터의 합이 보조 GPU의 메모리 최대치를 초과하지 않도록 상기 보조 신경망 X, Y 및 Z의 크기가 조절될 수 있다. 도 7에서는 보조 신경망 X에 대한 연산이 중단되지 않는 것으로 도시되었으나, 보조 신경망 Y 또는 Z에 대한 연산이 중단되지 않을 수도 있고, 따라서 분기점 이후의 포스트-브랜치가 연산되는 시점에서의 보조 신경망 X, Y 및 Z의 파라미터의 합 및 상기 포스트-브랜치 이후에 연산되는 프리-브랜치가 연산되는 시점에서의 보조 신경망 Y 또는 Z의 파라미터의 합이 보조 GPU의 메모리 최대치를 초과하지 않도록 상기 보조 신경망 X, Y 및 Z의 크기가 조절될 수 있다. 이에 따라 여러 보조 신경망의 연산을 병렬적으로 수행함에 따라 발생할 수 있는 보조 GPU의 메모리 오버헤드를 줄일 수 있다.According to one embodiment, the step of optimizing the tree-structured neural network (S1400) may include limiting the maximum size of the auxiliary neural network. For example, the sum of parameters of an auxiliary neural network loaded on an auxiliary GPU may be limited to not exceed the memory of the auxiliary GPU. Specifically, referring to FIG. 7, the sum of parameters of auxiliary neural networks The sizes of the auxiliary neural networks X, Y, and Z may be adjusted so that the sum of the parameters of the auxiliary neural network X does not exceed the maximum memory of the auxiliary GPU. In FIG. 7, it is shown that the operation on the auxiliary neural network and the auxiliary neural networks X, Y, and The size of Z can be adjusted. Accordingly, the memory overhead of the auxiliary GPU that may occur as the calculations of several auxiliary neural networks are performed in parallel can be reduced.

일 실시예에 따르면, 트리 구조 신경망을 최적화하는 단계(S1400)는 보조 신경망의 파라미터 중 일부는 보조 신경망에 대한 보조 GPU의 연산 시작 시점에 로드되고 파라미터 중 나머지는 보조 신경망이 특정된 이후 로드되도록 하는 단계를 포함할 수 있다. 여기서, 보조 신경망이 특정되는 시점은 주 신경망에 대한 주 GPU의 연산 종료 시점일 수 있다. 이 경우 보조 GPU의 메모리에 부담을 줄일 수 있다. 또한, 보조 GPU에 분기점 이후 로드되는 보조 신경망들의 파라미터는 다음 데이터에 대한 주 GPU의 연산이 시작되기 이전까지, 즉 다음 파이프라인 단계 이전까지 수행될 수 있는 연산의 양만큼 충분히 전달되는 것을 보장할 수 있다.According to one embodiment, the step of optimizing a tree-structured neural network (S1400) is such that some of the parameters of the auxiliary neural network are loaded at the start of the operation of the auxiliary GPU for the auxiliary neural network, and the remaining parameters are loaded after the auxiliary neural network is specified. May include steps. Here, the point at which the auxiliary neural network is specified may be the end point of the main GPU's computation for the main neural network. In this case, the burden on the memory of the auxiliary GPU can be reduced. In addition, it can be ensured that the parameters of the auxiliary neural networks loaded on the auxiliary GPU after the branch point are sufficiently transferred to the amount of operations that can be performed before the main GPU's operation on the next data begins, that is, before the next pipeline stage. there is.

일 실시예에 따르면, 트리 구조 신경망을 최적화하는 단계(S1400)는 보조 신경망의 초기 일부 레이어의 구조를 주 신경망의 중간 레이어의 구조와 동일하게 구축하는 단계를 포함할 수 있다. 이는 주 신경망과 보조 신경망이 동일한 입력 데이터에 대해 유사한 추론을 수행한다는 점을 활용한 것으로, 후술할 훈련 및 평가 단계에서 활용되어 정확도를 향상시키는데 도움이 될 수 있다.According to one embodiment, the step of optimizing a tree-structured neural network (S1400) may include constructing the structure of some initial layers of the auxiliary neural network to be the same as the structure of the middle layer of the main neural network. This takes advantage of the fact that the main neural network and the auxiliary neural network perform similar inferences on the same input data, and can be used in the training and evaluation stages, which will be described later, to help improve accuracy.

본 명세서의 일 실시예에 따른 최적화 방법은 훈련 및 평가 단계를 더 포함할 수 있다. 여기서, 훈련 및 평가 단계는 분기점마다 최적화를 거쳐 구조가 결정된 트리 구조 신경망을 훈련하는 단계를 포함할 수 있다. 또한, 훈련 및 평가 단계는 트리 구조 신경망의 정확도 및 성능을 확인하는 단계를 포함할 수 있다. 또한, 훈련 및 평가 단계는 여러 성능 지표 간에 균형을 이룬 트리 구조 신경망을 선택하는 단계를 포함할 수 있다. 훈련 및 평가 단계는 트리 구조 신경망의 최종 정확도를 향상시키기 위해 보조 신경망의 훈련시 주 신경망의 훈련된 파라미터 일부를 재사용하는 단계를 포함할 수 있다. 이를 통해 임의의 값으로 초기화되어 훈련된 보조 신경망에 비해 더 높은 정확도를 갖는 보조 신경망을 습득할 수 있다.The optimization method according to an embodiment of the present specification may further include training and evaluation steps. Here, the training and evaluation step may include training a tree-structured neural network whose structure has been determined through optimization at each branch point. Additionally, the training and evaluation steps may include checking the accuracy and performance of the tree-structured neural network. Additionally, the training and evaluation steps may include selecting a tree-structured neural network that is balanced across multiple performance metrics. The training and evaluation step may include reusing some of the trained parameters of the main neural network when training the auxiliary neural network to improve the final accuracy of the tree-structured neural network. Through this, it is possible to acquire an auxiliary neural network with higher accuracy compared to an auxiliary neural network that is initialized to a random value and trained.

이하에서는 실시예를 통해 본 명세서의 실시예에 따른 최적화 방법 및 최적화 시스템에 대해 설명한다. 본 명세서의 실시예에 따른 방법을 증명하기 위해 CIFAR-10 이미지 데이터 셋을 분류하는 NIN(Network-in-Network) 신경망에 대하여 실험을 진행하였다. 트리 구조 신경망의 클러스터는 기존 트리 구조 신경망 연구(Deep Decision Network for Multi-Class Image Classification, 이하 DDN)에서 제안한 바와 같이 다음 {airplane, ship}, {automobile, truck}, {frog, bird, deer, dog, cat, horse} 세 종류로 구분하였다.Hereinafter, the optimization method and optimization system according to the embodiments of the present specification will be described through examples. To prove the method according to the embodiment of the present specification, an experiment was conducted on a Network-in-Network (NIN) neural network that classifies the CIFAR-10 image data set. The clusters of the tree-structured neural network are as follows, {airplane, ship}, {automobile, truck}, {frog, bird, deer, dog), as proposed in existing tree-structured neural network research (Deep Decision Network for Multi-Class Image Classification, DDN). , cat, horse} were divided into three types.

도 8은 본 명세서의 일 실시예에 따른 신경망 및 성능 지표에 관한 도면으로, 실험에 사용된 신경망의 기본적인 구조 및 여러 성능 지표에 대한 상대적 비교를 보여준다. 도 8을 참고하면, 적절한 최적화 없이 트리 구조 신경망을 적용할 경우 정확도는 상승하지만 파라미터 개수의 증가에 대한 성능 저하나 처리량의 감소를 피할 수 없음을 알 수 있다.Figure 8 is a diagram of a neural network and performance indicators according to an embodiment of the present specification, showing the basic structure of the neural network used in the experiment and a relative comparison of various performance indicators. Referring to Figure 8, it can be seen that if a tree-structured neural network is applied without appropriate optimization, accuracy increases, but performance degradation or reduction in throughput due to an increase in the number of parameters cannot be avoided.

본 명세서의 실시예에 따른 방법의 다양한 분기점에 대한 최적화 가능성을 보여주기 위해, NIN의 레이어 수를 기존 3개에서 5개(이하 NIN5)까지 증가시켰다. 실험은 2대의 GPU로 이루어진 시스템에서 수행되었으며, Tensorflow 프레임워크를 사용하여 훈련을 진행하였다. 도 9는 본 명세서의 일 실시예에 따른 최적화 방법 및 최적화 시스템의 하이퍼파라미터(hyperparameter)에 관한 표로, 실험에 사용된 시스템 및 프로그램과 신경망 훈련에 사용된 다양한 하이퍼파라미터들에 대한 정보를 보여준다.In order to show the optimization potential for various branching points of the method according to the embodiments of the present specification, the number of layers of NIN was increased from the existing 3 to 5 (hereinafter referred to as NIN5). The experiment was conducted on a system consisting of two GPUs, and training was conducted using the Tensorflow framework. Figure 9 is a table regarding the hyperparameters of the optimization method and optimization system according to an embodiment of the present specification, showing information on the system and program used in the experiment and various hyperparameters used in neural network training.

도 10은 본 명세서의 일 실시예에 따른 최적화 방법 및 최적화 시스템의 성능 지표에 대한 신경망들의 차이에 관한 도면으로, NIN5는 기존 신경망, DDN은 최적화를 수행하지 않은 트리 구조 신경망, NIN5-X는 여러 분기점에 대해 최적화를 수행한 트리 구조 신경망을 의미한다. NIN5-0은 첫 레이어 이후에 분기하였음을 의미하며, NIN5에서는 총 4개의 분기점이 존재한다. 검은색 막대는 본 발명에서 제안하는 최적화 기법을 적용하지 않은 경우, 회색 막대는 기법을 적용한 경우에 해당한다. Parameter 항목에서 main_trx는 주 GPU에 파라미터를 로드하는데 요구되는 메모리 양, sub_trx는 최적화를 수행한 경우 보조 GPU에 요구되는 최고 메모리 양, param은 최적화를 요구하지 않은 경우에 보조 GPU에 추가적으로 요구되는 메모리 양을 의미한다.Figure 10 is a diagram of the differences between neural networks in terms of performance indicators of the optimization method and optimization system according to an embodiment of the present specification. NIN5 is an existing neural network, DDN is a tree-structured neural network without optimization, and NIN5-X is several neural networks. This refers to a tree-structured neural network that performs optimization on branch points. NIN5-0 means branching after the first layer, and there are a total of 4 branching points in NIN5. Black bars correspond to cases where the optimization technique proposed in the present invention is not applied, and gray bars correspond to cases where the technique is applied. In the Parameter item, main_trx is the amount of memory required to load parameters to the main GPU, sub_trx is the maximum amount of memory required for the secondary GPU when optimization is performed, and param is the amount of memory additionally required for the secondary GPU when optimization is not required. means.

본 명세서의 실시예에 따른 최적화 방법을 활용하지 않으면 정확도가 향상되지만 일부 분기점에서 처리량의 감소, 비효율적인 다중 GPU 활용으로 인한 메모리 부담, 지나친 파라미터 수의 증가, 단일 데이터 처리 속도 감소 등의 문제가 발생하게 된다. 하지만 본 명세서의 실시예에 따른 최적화 방법을 적용한 후에는 처리량 증가(NIN5-0, NIN5-1), GPU의 메모리 부담 감소(NIN5-1, NIN5-2, NIN5-3), 총 파라미터 수 감소(NIN5-0, NIN5-1), 처리 속도 증가(NIN5-0, NIN5-1)의 이득을 볼 수 있다. 본 명세서의 실시예에 따른 최적화 방법은 정확도를 포함한 다양한 성능 지표간의 균형을 맞추는 용도로 사용됨으로써, 만일 개발자/이용자가 정확도보다 다른 성능 지표를 우선시하는 경우, 혹은 반대로 다른 성능지표보다 정확도를 가장 우선시하는 경우에 따라 선호하는 분기점이 달라질 수 있다. 예를 들어, 도 10을 참고하면, 정확도에 미미한 손실을 감안하고 추론 성능 지표를 향상시키고 싶은 경우 NIN5-1이 적합한데, 최적화 이전보다 정확도가 0.02% 밖에 하락하지 않았지만 처리량은 40% 향상, 보조 GPU의 최대 메모리 사용량은 50% 감소, 전체 파라미터 개수는 25% 감소, 데이터 당 처리 속도는 15% 향상됨을 보였다.If the optimization method according to the embodiment of the present specification is not used, accuracy is improved, but problems such as reduced throughput at some branch points, memory burden due to inefficient use of multiple GPUs, excessive increase in the number of parameters, and reduced single data processing speed occur. I do it. However, after applying the optimization method according to the embodiment of the present specification, the throughput increases (NIN5-0, NIN5-1), the memory burden on the GPU decreases (NIN5-1, NIN5-2, NIN5-3), and the total number of parameters decreases ( You can benefit from increased processing speed (NIN5-0, NIN5-1) and increased processing speed (NIN5-0, NIN5-1). The optimization method according to the embodiment of the present specification is used to balance various performance indicators, including accuracy, if the developer/user prioritizes other performance indicators over accuracy, or, conversely, gives priority to accuracy over other performance indicators. Depending on the case, the preferred branching point may vary. For example, referring to Figure 10, if you want to improve the inference performance index while considering a slight loss in accuracy, NIN5-1 is suitable. Although the accuracy decreased by only 0.02% compared to before optimization, the throughput improved by 40%, auxiliary The GPU's maximum memory usage was reduced by 50%, the total number of parameters was reduced by 25%, and the processing speed per data was improved by 15%.

본 명세서의 실시예에 따른 최적화 방법 및 최적화 시스템은 기존에 사용되던 신경망의 최적화를 통해 정확도와 단위 시간당 처리량, 처리 속도, 파라미터 수와 같은 다양한 성능 요소 간의 균형을 맞추어 기존 신경망에 비해 종합적인 성능을 향상 효과를 기대할 수 있을 것이다. The optimization method and optimization system according to the embodiments of the present specification achieve comprehensive performance compared to existing neural networks by balancing various performance factors such as accuracy, throughput per unit time, processing speed, and number of parameters through optimization of existing neural networks. Improvement effects can be expected.

본 명세서의 실시예에 따른 최적화 방법 및 최적화 시스템은 다양하게 적용되어 이용될 수 있다. 이하에서는 본 명세서의 실시예에 따른 최적화 방법 및 최적화 시스템의 몇몇 적용예에 관해 설명한다.The optimization method and optimization system according to the embodiments of the present specification can be applied and used in various ways. Hereinafter, several application examples of the optimization method and optimization system according to embodiments of the present specification will be described.

본 명세서의 실시예에 따른 최적화 방법 및 최적화 시스템은 차량 번호판 인식 시스템에 적용될 수 있다. 여기서, 차량 번호판 인식 시스템은 이미지에서 DNN을 통해 차량 검출, 번호판 검출, 문자 검출, 문자 인식을 통하여 차량의 번호판을 인식하는 시스템일 수 있다. 여기서, 상기 이미지는 카메라를 통해 촬상된 이미지일 수 있다. The optimization method and optimization system according to the embodiments of the present specification can be applied to a vehicle license plate recognition system. Here, the vehicle license plate recognition system may be a system that recognizes a vehicle's license plate through vehicle detection, license plate detection, character detection, and character recognition through DNN in an image. Here, the image may be an image captured through a camera.

본 명세서의 실시예에 따른 최적화 방법 및 최적화 시스템은 휴대폰 카메라 얼굴 인식 시스템에 적용될 수 있다. 여기서, 휴대폰 카메라 얼굴 인식 시스템은 휴대폰 카메라를 이용해 인물 촬영 시에 DNN을 통해 사람의 얼굴을 검출하고 인식하는 시스템일 수 있다.The optimization method and optimization system according to the embodiments of the present specification can be applied to a mobile phone camera face recognition system. Here, the mobile phone camera face recognition system may be a system that detects and recognizes a person's face through DNN when photographing a person using a mobile phone camera.

본 명세서의 실시예에 따른 최적화 방법 및 최적화 시스템은 자동차 카메라 사물 인식 시스템에 적용될 수 있다. 여기서, 자동차 카메라 사물 인식 시스템은 자동차의 전/후방 카메라에 관측된 사물을 인식하고 주행에 도움이 되는 정보를 제공하는 시스템일 수 있다.The optimization method and optimization system according to the embodiments of the present specification can be applied to an automobile camera object recognition system. Here, the car camera object recognition system may be a system that recognizes objects observed by the car's front/rear cameras and provides information helpful for driving.

이 외에도 기계·컴퓨터·의료·화학 등 다양한 공학 분야에서 인공지능을 활용하기 위한 시도들이 일어나고 있다. 특히 서비스의 질 뿐만 아니라 사용자의 안전과 관련한 분야에 활용되는 사례가 증가함에 따라 높은 정확도의 신경망이 요구되는 사례가 늘고 있기에 정확도 향상에 효과적인 트리 구조 신경망의 가치가 높아지고 있다. 또한 높은 정확도를 위해 복잡하고 거대해지는 신경망에 대한 빠른 추론을 위해 다중 GPU를 구축한 시스템을 구축하는 사례가 증가하고 있고, 본 명세서에서 제안하는 최적화 기법 및 최적화 시스템은 이에 적용될 수 있다.In addition, attempts are being made to utilize artificial intelligence in various engineering fields such as machinery, computers, medicine, and chemistry. In particular, as the number of cases that are used in fields related to user safety as well as service quality increases, the number of cases requiring high-accuracy neural networks is increasing, and the value of tree-structured neural networks, which are effective in improving accuracy, is increasing. In addition, the number of cases of building systems with multiple GPUs for fast inference on complex and large neural networks for high accuracy is increasing, and the optimization technique and optimization system proposed in this specification can be applied to this.

본 명세서의 실시예에 따른 방법은 다양한 컴퓨팅 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Methods according to embodiments of the present specification may be implemented in the form of program instructions that can be executed through various computing means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

상기에서는 실시예를 기준으로 본 출원의 구성과 특징을 설명하였으나 본 출원은 이에 한정되지 않으며, 본 출원의 사상과 범위 내에서 다양하게 변경 또는 변형할 수 있음은 본 출원이 속하는 기술 분야의 당업자에게 명백한 것이며, 따라서 이와 같은 변경 또는 변형은 첨부된 특허청구범위에 속함을 밝혀둔다.In the above, the configuration and features of the present application have been described based on the examples, but the present application is not limited thereto, and various changes or modifications may be made within the spirit and scope of the present application. It is obvious, and therefore, it is stated that such changes or modifications fall within the scope of the attached patent claims.

100: 제어 모듈
110: 프로파일러
120: 변환부
130: 최적화부
140: 훈련 및 평가부
200: 메모리100: control module
110: Profiler
120: conversion unit
130: Optimization unit
140: Training and Evaluation Department
200: memory

Claims

In the optimization method of a neural network for a multi-GPU system including a primary GPU and a secondary GPU,
Obtaining the neural network including a plurality of layers, wherein the neural network includes a first main neural network;
A profiling step of collecting base information for optimization of the neural network - the base information includes GPU information of the multi-GPU system and neural network information regarding the structure of the neural network;
The first main neural network processed in the main GPU is branched at a first branch point, which is one of the points between the plurality of layers, and after the first branch point, the neural network is divided into a first main neural network processed in the main GPU and the converting to a first tree-structured neural network including a first auxiliary neural network processed on the auxiliary GPU; and
Clustering input data for the neural network before the first branch point by the first main neural network processed in the main GPU;
Based on the clustering, calculating first cluster data in the first main neural network processed in the main GPU after the first branch point, and calculating second cluster data in the first auxiliary neural network processed in the auxiliary GPU. Step - At this time, the first auxiliary neural network is trained specifically for the second cluster data -;
Optimizing the first tree-structured neural network by adjusting at least one of a sum of parameters of the first auxiliary neural network and a computation time of the first auxiliary neural network based on the base information; and
Confirming the accuracy and performance of the first tree-structured neural network that has undergone the optimization step and training the first tree-structured neural network; Including,
The optimization step is characterized by reducing the total computation time or adjusting the first auxiliary neural network so that the sum of the parameters of the auxiliary neural network does not exceed the maximum value of the memory,
Characterized in that the verification and training step reuses some of the trained parameters of the first main neural network when training the first auxiliary neural network,
Optimization method.

delete

According to paragraph 1,
The first main neural network includes a first main layer, which is a layer before the first branch point, and a second main layer, which is a layer after the first branch point,
The optimization step includes at least one of the computation time of the main GPU for the first main layer, the computation time of the main GPU for the second main layer, and the computation time of the auxiliary GPU for the first auxiliary neural network. adjusted to take into account
Optimization method.

According to clause 3,
If the computation time of the main GPU for the second main layer is longer than the computation time of the auxiliary GPU for the first auxiliary neural network, at least the sum of the parameters of the first auxiliary neural network and the computation time of the first auxiliary neural network increase one,
If the computation time of the main GPU for the second main layer is shorter than the computation time of the auxiliary GPU for the first auxiliary neural network, at least the sum of the parameters of the first auxiliary neural network and the computation time of the first auxiliary neural network reducing one
Optimization method.

delete

According to paragraph 1,
The first auxiliary neural network includes a first auxiliary layer corresponding to a layer before the first branch point of the first main neural network and a second auxiliary layer corresponding to a layer after the first branch point of the first main neural network,
The optimizing step includes adjusting at least one of the calculation time and the sum of parameters of the first auxiliary layer and the second auxiliary layer so as not to exceed the maximum memory of the auxiliary GPU.
Optimization method.

According to claim 1,
The first auxiliary neural network includes a 1-1 auxiliary neural network and a 1-2 auxiliary neural network,
The auxiliary GPU includes a first auxiliary GPU for processing the 1-1 auxiliary neural network and a second auxiliary GPU for processing the 1-2 auxiliary neural network,
The operation of at least one of the 1-1 auxiliary neural network and the 1-2 auxiliary neural network is stopped based on the operation result of the main GPU for the first main neural network.
Optimization method.

According to paragraph 1,
Each parameter of the first auxiliary neural network is a time point corresponding to the start time of the operation of the auxiliary GPU for the first auxiliary neural network or a time point corresponding to the end time of the operation of the main GPU for the first main neural network. loaded into,
How to optimize

According to claim 1,
The GPU information includes at least one of information about GPU memory size, information about data transfer speed between GPUs, and information about processing speed for each layer of the neural network in GPU.
Optimization method.

According to claim 1,
The difference between the operation start time of the main GPU for the layer immediately after the first branch point of the first main neural network and the operation start time of the auxiliary GPU for the first auxiliary neural network is less than a predetermined time.
Optimization method.

According to claim 1,
The second auxiliary neural network is branched at a second branch point that is one of the points between the plurality of layers and is different from the first branch point, and the neural network is divided into a second main neural network processed in the main GPU and a second branch processed in the auxiliary GPU. converting to a second tree-structured neural network including an auxiliary neural network; and
Further comprising optimizing the second tree-structured neural network based on the base information,
Optimization method.

According to claim 11,
Further comprising selecting an optimal tree-structured neural network among the first tree-structured neural network and the second tree-structured neural network in consideration of at least one of accuracy and performance of the first tree-structured neural network and the second tree-structured neural network.
Optimization method.