KR102494945B1

KR102494945B1 - Apparatus and method for optimal split size decision in deep learning using multi-gpu and method for learning deep learning model using the same

Info

Publication number: KR102494945B1
Application number: KR1020200058647A
Authority: KR
Inventors: 이재환; 이명성; 최현성
Original assignee: 한국항공대학교산학협력단
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2023-02-01
Also published as: KR20210141240A

Abstract

멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치 및 방법과 이를 이용한 딥러닝 모델 학습 방법이 개시되며, 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법은, (a) 상기 멀티 GPU에 포함된 GPU의 수 및 상기 멀티 GPU의 메모리 크기에 기초하여 초기 스플릿 크기를 연산하는 단계, (b) 상기 초기 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 초기 수행시간을 연산하는 단계, (c) 상기 초기 스플릿 크기와 상기 초기 수행시간, 그리고 (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기 사이에 설정된 관계에 기초하여, n번째 스플릿 크기, 상기 n번째 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 n번째 수행시간, (n+1)번째 스플릿 크기 및 상기 (n+1)번째 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 (n+1)번째 수행시간을 획득하는 단계 및 (d) 상기 n번째 수행시간과 상기 (n+1)번째 수행시간 사이의 시간 차이가 미리 설정된 시간 차이 이내이면, 상기 (n+1)번째 스플릿 크기를 최적 스플릿 크기로 결정하는 단계를 포함할 수 있다.Disclosed are an apparatus and method for determining an optimal split size when learning a deep learning model using multiple GPUs and a method for learning a deep learning model using the same, and determining an optimal split size when learning a deep learning model using multiple GPUs according to an embodiment of the present application. The method includes (a) calculating an initial split size based on the number of GPUs included in the multi-GPU and the memory size of the multi-GPU, (b) a predetermined number of times performed based on the initial split size Calculating an initial execution time required for iterative learning, (c) between the initial split size and the initial execution time, and between the (n-1) th split size, the n th split size and the (n+1) th split size Based on the relationship set in, the n-th split size, the n-th execution time required for iterative learning by the preset number of times performed based on the n-th split size, the (n + 1)-th split size and the (n + 1) obtaining the (n + 1) th execution time required for iterative learning by a preset number of times based on the split size, and (d) the n th execution time and the (n + 1) th execution time and determining the (n+1) th split size as an optimal split size when the time difference between times is within a preset time difference.

Description

Apparatus and method for determining the optimal split size for deep learning model learning using multi-GPU and deep learning model learning method using the same SAME}

본원은 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치 및 방법과 이를 이용한 딥러닝 모델 학습 방법에 관한 것이다. 특히, 본원은 멀티 GPU 환경에서 파이프라이닝을 활용하여 딥러닝(심층 학습) 시의 성능 향상을 고려한 최적 스플릿 크기를 결정하는 자동화된 솔루션에 관한 것이다.The present application relates to an apparatus and method for determining an optimal split size when learning a deep learning model using multi-GPU and a method for learning a deep learning model using the same. In particular, the present application relates to an automated solution for determining an optimal split size considering performance improvement in deep learning (deep learning) by utilizing pipelining in a multi-GPU environment.

딥러닝 수행 시 높은 학습 정확도를 달성하기 위해서는 큰 학습 모델을 활용하거나 큰 데이터 셋을 사용할 수 있다. 그러나, 학습 모델의 크기가 커짐에 따라 GPU의 메모리 자원의 한계로 인해 단일 GPU를 통하여 딥러닝을 수행하기 어려울 수 있다. 이러한 단일 GPU 자원의 한계를 극복하기 위하여는 다수의 GPU를 활용한 병렬 딥러닝을 적용할 수 있다.In order to achieve high learning accuracy when performing deep learning, a large learning model or a large data set can be used. However, as the size of the learning model increases, it may be difficult to perform deep learning through a single GPU due to limitations in GPU memory resources. In order to overcome this limitation of single GPU resources, parallel deep learning using multiple GPUs can be applied.

상술한 병렬 딥러닝은 병렬화 방식에 따라 모델 병렬화와 데이터 병렬화로 나눌 수 있는데, 모델 병렬화의 경우 하나의 대규모 학습 모델을 다수의 GPU에 나누어 학습을 수행하는 방식을 의미한다.Parallel deep learning described above can be divided into model parallelism and data parallelism according to the parallelization method. In the case of model parallelism, it means a method of dividing one large-scale learning model on multiple GPUs for learning.

도 1은 모델 병렬화를 설명하기 위한 개념도이다. 도 1을 참조하면, 다양한 방식으로 모델을 분할하고, 분할된 전체 모델의 일부분을 각각의 GPU로 할당하는 모델 병렬화를 수행할 수 있는데, 이 때 각각의 GPU에 나누어져 있는 전체 모델의 일부분은 다른 GPU에 있는 모델의 출력에 의존성을 갖게 된다. 따라서, 멀티 GPU 내의 각각의 GPU들은 의존성이 있는 데이터가 자신에게 전달될 때까지 유휴 상태에 놓이게 된다. 이러한 모델 병렬화 수행 시의 분할된 학습 모델 간의 의존성으로 인해 멀티 GPU 전체의 컴퓨팅 자원을 충분히 활용할 수 없는 한계를 극복하기 위해 파이프라이닝(pipelining)을 적용할 수 있다.1 is a conceptual diagram for explaining model parallelization. Referring to FIG. 1, model parallelization can be performed by dividing a model in various ways and assigning a portion of the entire divided model to each GPU. At this time, a portion of the entire model divided to each GPU is It will depend on the output of the model on the GPU. Therefore, each GPU in the multi-GPU is placed in an idle state until dependent data is delivered to itself. Pipelining can be applied to overcome the limitation of not being able to fully utilize the computing resources of the entire multi-GPU due to the dependency between the divided learning models when performing such model parallelization.

도 2는 파이프라이닝(pipelining)을 설명하기 위한 개념도이다. 도 2를 참조하면, 파이프라이닝 기법은 학습 미니배치를 더 작은 단위로 나누어 학습을 수행하는 것이며, 이러한 파이프라이닝 기법을 적용하면 기존의 모델 병렬화 방식에 비해 GPU 각각의 유휴 시간이 감소하는 것을 확인할 수 있다. 그러나 이러한 파이프라이닝을 적용할 때, 미니배치를 적절한 크기로 나누어 주지 못하면, 전체 GPU의 컴퓨팅 자원을 모두 활용할 수 없다. 따라서, 학습 미니배치를 분할하기 위한 스플릿 크기(사이즈)를 적절하게 탐색하는 것은 파이프라이닝 적용 시의 중요한 요소이다. 파이프라이닝을 통해 효율적인 모델 병렬화를 달성한 예로는 Google의 GPipe가 있는데, GPipe는 학습 미니배치를 나누는 크기를 학습이 시작되기 전에 설정해야 하는 한계가 있었다.2 is a conceptual diagram for explaining pipelining. Referring to FIG. 2, the pipelining technique divides the training mini-batch into smaller units to perform learning, and when this pipelining technique is applied, it can be seen that the idle time of each GPU is reduced compared to the existing model parallelization method. there is. However, when applying such pipelining, if the mini-batch is not divided into appropriate sizes, the computing resources of the entire GPU cannot be fully utilized. Therefore, properly exploring the split size (size) for splitting the training mini-batches is an important factor when applying pipelining. An example of achieving efficient model parallelism through pipelining is Google's GPipe, which had a limitation in that the size of dividing the training mini-batch had to be set before training started.

또한, 파이프라이닝을 적용하기 위하여 학습이 수행되는 과정에서 사용자가 직접 최적의 스플릿 크기를 탐색하는 것은 비효율적이므로, 효율적인 모델 병렬화를 수행하기 위하여 최적의 스플릿 크기를 자동으로 결정할 수 있는 솔루션이 요구된다.In addition, since it is inefficient for the user to directly search for the optimal split size during the learning process to apply pipelining, a solution capable of automatically determining the optimal split size is required to perform efficient model parallelization.

본원의 배경이 되는 기술은 한국공개특허공보 제10-2019-0085444호에 개시되어 있다.The background technology of the present application is disclosed in Korean Patent Publication No. 10-2019-0085444.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 멀티 GPU 환경에서 파이프라이닝을 적용할 때 GPU 각각의 연산 능력을 최대한 활용할 수 있도록 하는 최적의 스플릿 사이즈를 탐색하기 위한 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치 및 방법과 이를 이용한 딥러닝 모델 학습 방법을 제공하려는 것을 목적으로 한다.The present application is intended to solve the above-mentioned problems of the prior art, and when pipelining is applied in a multi-GPU environment, a deep learning model using multi-GPUs to search for an optimal split size to maximize the computational power of each GPU. Its purpose is to provide an apparatus and method for determining the optimal split size during learning and a method for learning a deep learning model using the same.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the embodiments of the present application is not limited to the technical problems described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법은, (a) 상기 멀티 GPU에 포함된 GPU의 수 및 상기 멀티 GPU의 메모리 크기에 기초하여 초기 스플릿 크기를 연산하는 단계, (b) 상기 초기 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 초기 수행시간을 연산하는 단계, (c) 상기 초기 스플릿 크기와 상기 초기 수행시간, 그리고 (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기 사이에 설정된 관계에 기초하여, n번째 스플릿 크기, 상기 n번째 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 n번째 수행시간, (n+1)번째 스플릿 크기 및 상기 (n+1)번째 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 (n+1)번째 수행시간을 획득하는 단계 및 (d) 상기 n번째 수행시간과 상기 (n+1)번째 수행시간 사이의 시간 차이가 미리 설정된 시간 차이 이내이면, 상기 (n+1)번째 스플릿 크기를 최적 스플릿 크기로 결정하는 단계를 포함할 수 있다.As a technical means for achieving the above technical problem, a method for determining an optimal split size when learning a deep learning model using multi-GPUs according to an embodiment of the present application includes (a) the number of GPUs included in the multi-GPU and the Calculating the initial split size based on the memory size of the multi-GPU, (b) calculating the initial execution time required for iterative learning by a preset number of times based on the initial split size, (c) the Based on the relationship established between the initial split size and the initial execution time, and the (n-1) th split size, the n th split size, and the (n+1) th split size, the n th split size, the n th split size The n-th execution time required for repeated learning by a preset number of times performed based on , the (n + 1) th split size, and iterative learning by a preset number of times performed based on the (n + 1) th split size obtaining the (n+1)th execution time required for and (d) if the time difference between the nth execution time and the (n+1)th execution time is within a preset time difference, the (n+1)th execution time 1) determining the split size as an optimal split size.

또한, 상기 초기 스플릿 크기는 n이 0인 경우로서 S₀로 표시되고, 상기 (n-1)번째 스플릿 크기는 S_n-1로 표시되고, 상기 n번째 스플릿 크기는 S_n으로 표시되고, 상기 (n+1)번째 스플릿 크기는 S_n+1로 표시되고, n이 0인 경우, S_n-1은 0일 수 있다.In addition, the initial split size is represented by S ₀ when n is 0, the (n−1) th split size is represented by S _n−1 , the n th split size is represented by S _n , The (n+1)th split size is denoted by S _n+1 , and when n is 0, S _n-1 may be 0.

또한, 상기 (c) 단계와 상기 (d) 단계는 n을 0부터 1씩 증가시키며 반복 수행될 수 있다.In addition, steps (c) and (d) may be repeatedly performed while increasing n from 0 to 1.

또한, 상기 (c) 단계와 상기 (d) 단계가 n을 0부터 1씩 증가시키며 반복 수행되다가, 상기 (d) 단계에서, 상기 n번째 수행시간과 상기 (n+1)번째 수행시간 사이의 시간 차이가 미리 설정된 시간 차이 이내인 조건을 만족하면, 상기 (n+1)번째 스플릿 크기를 최적 스플릿 크기로 결정하고 반복 수행이 종료될 수 있다.In addition, the steps (c) and (d) are repeatedly performed while increasing n from 0 to 1, and in the step (d), between the n-th execution time and the (n + 1)-th execution time When the condition that the time difference is within a preset time difference is satisfied, the (n+1) th split size is determined as an optimal split size, and the iterative performance may be terminated.

또한, 상기 (c) 단계와 상기 (d) 단계의 반복 수행시 획득되는 S_n은 이전 반복 수행시의 S_n+1일 수 있다.In addition, S _n obtained when the steps (c) and (d) are repeatedly performed may be S _n+1 when the previous repetition is performed.

또한, 상기 (c) 단계와 상기 (d) 단계의 반복 수행시 획득되는 n번째 수행시간은 이전 반복 수행시의 (n+1)번째 수행시간일 수 있다.In addition, the n-th execution time obtained when the steps (c) and (d) are repeated may be the (n+1)-th execution time when the previous repetition is performed.

또한, 상기 (a) 단계는, 하기 식 1 및 식 2에 기초하여 상기 초기 스플릿 크기를 연산할 수 있다.In the step (a), the initial split size may be calculated based on Equations 1 and 2 below.

또한, 상기 (c) 단계에서의 상기 설정된 관계는, (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기가

을 만족하는 관계일 수 있다.In addition, the relationship established in step (c) is that the (n-1) th split size, the n th split size, and the (n+1) th split size

may be a relationship that satisfies

또한, 상기 (d) 단계는, 상기 n번째 수행시간과 상기 (n+1)번째 수행시간이 하기 식 3의 부등식을 만족하면, 상기 (n+1)번째 스플릿 크기를 상기 최적 스플릿 크기로 결정하는 것일 수 있다.In the step (d), if the n-th execution time and the (n+1)-th execution time satisfy the inequality of Equation 3 below, the (n+1)-th split size is determined as the optimal split size. it may be

한편, 본원의 일 실시예에 따른 멀티 GPU를 이용한 최적 스플릿 크기 기반의 딥러닝 모델 학습 방법은, 상기 최적 스플릿 크기 결정 방법에 기초하여 소정의 딥러닝 모델 학습에 활용되는 최적 스플릿 크기를 결정하는 단계 및 상기 결정된 최적 스플릿 크기에 기초하여 상기 딥러닝 모델을 학습시키는 단계를 포함할 수 있다.Meanwhile, a deep learning model learning method based on an optimal split size using multi-GPUs according to an embodiment of the present application includes determining an optimal split size used for learning a predetermined deep learning model based on the optimal split size determination method. and learning the deep learning model based on the determined optimal split size.

또한, 상기 딥러닝 모델을 학습시키는 단계는, 미리 설정된 미니배치를 상기 최적 스플릿 크기에 기초하여 분할한 복수의 스플릿 중 제1스플릿을 상기 멀티 GPU 중 제1 GPU에 할당하는 단계, 상기 제1 GPU에 의한 상기 제1스플릿에 대한 처리 결과를 상기 멀티 GPU 중 제2 GPU에 전송하는 단계, 상기 제2 GPU가 상기 처리 결과에 기초하여 기울기를 계산하고 가중치를 변경하는 단계 및 상기 제1 GPU에 상기 복수의 스플릿 중 제2스플릿을 할당하는 단계를 포함할 수 있다.In addition, the step of learning the deep learning model may include allocating a first split among a plurality of splits obtained by dividing a preset mini-batch based on the optimal split size to a first GPU among the multi-GPUs; Transmitting the processing result for the first split to a second GPU among the multi-GPU, calculating a gradient and changing a weight based on the processing result by the second GPU, and sending the processing result to the first GPU. It may include allocating a second split among a plurality of splits.

또한, 상기 기울기를 계산하고 가중치를 변경하는 단계와 상기 제2스플릿을 할당하는 단계는, 상기 제1 GPU 및 상기 제2 GPU 각각에 의해 수행되어 서로 미리 설정된 시간 차이 이내에 병렬적으로 개시되는 것일 수 있다.In addition, the step of calculating the gradient and changing the weight and the step of allocating the second split may be performed by the first GPU and the second GPU, respectively, and initiated in parallel within a preset time difference from each other. there is.

한편, 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 결정 장치는, 상기 멀티 GPU에 포함된 GPU의 수 및 상기 멀티 GPU의 메모리 크기에 기초하여 초기 스플릿 크기를 연산하고, 상기 초기 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 초기 수행시간을 연산하는 초기 스플릿 연산부 및 상기 초기 스플릿 크기와 상기 초기 수행시간, 그리고 (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기 사이에 설정된 관계에 기초하여, n번째 스플릿 크기, 상기 n번째 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 n번째 수행시간, (n+1)번째 스플릿 크기 및 상기 (n+1)번째 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 (n+1)번째 수행시간을 획득하고, 상기 n번째 수행시간과 상기 (n+1)번째 수행시간 사이의 시간 차이가 미리 설정된 시간 차이 이내이면, 상기 (n+1)번째 스플릿 크기를 최적 스플릿 크기로 결정하는 최적 스플릿 탐색부를 포함할 수 있다.On the other hand, an apparatus for determining an optimal split when learning a deep learning model using multiple GPUs according to an embodiment of the present application calculates an initial split size based on the number of GPUs included in the multi-GPU and the memory size of the multi-GPU, , an initial split calculation unit that calculates an initial execution time required for iterative learning by a predetermined number of times based on the initial split size, the initial split size and the initial execution time, and the (n-1) th split size and Based on the relationship established between the n-th split size and the (n+1)-th split size, the n-th split size and the n-th execution time required for iterative learning by the preset number of times based on the n-th split size , (n + 1) th split size and the (n + 1) th execution time required for repeated learning by a preset number of times performed based on the (n + 1) th split size are obtained, and the n th execution An optimal split search unit may be configured to determine the (n+1) th split size as an optimal split size when a time difference between time and the (n+1) th execution time is within a preset time difference.

또한, 상기 초기 스플릿 크기는 n이 0인 경우로서 S₀로 표시되고, 상기 (n-1)번째 스플릿 크기는 S_n-1로 표시되고, 상기 n번째 스플릿 크기는 S_n으로 표시되고, 상기 (n+1)번째 스플릿 크기는 S_n+1로 표시될 수 있다.In addition, the initial split size is represented by S ₀ when n is 0, the (n−1) th split size is represented by S _n−1 , the n th split size is represented by S _n , The (n+1)th split size may be denoted as S _n+1 .

또한, 상기 최적 스플릿 탐색부는, n을 0부터 1씩 증가시키며 상기 최적 스플릿 크기를 탐색하는 프로세스를 반복적으로 수행할 수 있다.In addition, the optimal split search unit may repeatedly perform a process of searching for the optimal split size while increasing n by 1 from 0.

또한, 상기 최적 스플릿 탐색부가 상기 프로세스를 반복 수행하는 과정에서 획득되는 S_n 및 n번째 수행시간은 각각 이전 반복 수행시의 S_n+1 및 이전 반복 수행시의 (n+1)번째 수행시간일 수 있다.In addition, the S _n and n th execution times obtained in the process of repeatedly performing the process by the optimal split search unit are S _{n + 1} in the previous iteration and the (n + 1) th execution time in the previous iteration, respectively. can

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present disclosure. In addition to the exemplary embodiments described above, additional embodiments may exist in the drawings and detailed description of the invention.

전술한 본원의 과제 해결 수단에 의하면, 멀티 GPU 환경에서 파이프라이닝을 적용할 때 GPU 각각의 연산 능력을 최대한 활용할 수 있도록 하는 최적의 스플릿 사이즈를 탐색하기 위한 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치 및 방법과 이를 이용한 딥러닝 모델 학습 방법을 제공할 수 있다.According to the above-described problem solving means of the present application, when pipelining is applied in a multi-GPU environment, optimal deep learning model learning using multi-GPUs to search for the optimal split size to maximize the computational power of each GPU An apparatus and method for determining a split size and a deep learning model learning method using the same may be provided.

전술한 본원의 과제 해결 수단에 의하면, 사용자가 선택 가능한 모든 스플릿 사이즈에 대하여 학습에 소요되는 시간을 측정해가면서 최적의 스플릿 크기를 직접 탐색해야 하는 불편을 해소하고, 멀티 GPU 환경에서의 GPU의 수 및 메모리 사이즈를 기반으로 초기 스플릿 크기를 결정하고 이에 따라 탐색을 수행함으로써 탐색으로 인한 오버헤드를 완화할 수 있다.According to the above-mentioned problem solving method of the present application, while measuring the time required for learning for all split sizes selectable by the user, the inconvenience of directly searching for the optimal split size is eliminated, and the number of GPUs in a multi-GPU environment The overhead due to the search may be alleviated by determining an initial split size based on the memory size and performing the search accordingly.

다만, 본원에서 얻을 수 있는 효과는 상기된 바와 같은 효과들로 한정되지 않으며, 또 다른 효과들이 존재할 수 있다.However, the effects obtainable herein are not limited to the effects described above, and other effects may exist.

도 1은 모델 병렬화를 설명하기 위한 개념도이다.
도 2는 파이프라이닝(pipelining)을 설명하기 위한 개념도이다.
도 3은 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 개략적인 구성도이다.
도 4a는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 동작과 연계된 일 실험예로, 분할되는 스플릿 크기에 따른 처리량의 변화를 나타낸 그래프이다.
도 4b는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 동작과 연계된 일 실험예로 미니배치의 크기에 따른 처리량의 변화를 모델 병렬화만을 적용한 경우와 본원의 최적 스플릿 크기 결정 기법을 적용한 경우 각각에 대하여 비교하여 나타낸 그래프이다.
도 5는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법에 대한 동작 흐름도이다.
도 6은 본원의 일 실시예에 따른 멀티 GPU를 이용한 최적 스플릿 크기 기반의 딥러닝 모델 학습 방법에 대한 동작 흐름도이다.
도 7은 결정된 최적 스플릿 크기에 기초하여 딥러닝 모델을 학습시키는 단계에 대한 세부 동작 흐름도이다.1 is a conceptual diagram for explaining model parallelization.
2 is a conceptual diagram for explaining pipelining.
3 is a schematic configuration diagram of an apparatus for determining an optimal split size when learning a deep learning model using multi-GPUs according to an embodiment of the present disclosure.
4A is an experimental example associated with the operation of an apparatus for determining an optimal split size when learning a deep learning model using multi-GPUs according to an embodiment of the present disclosure, and is a graph showing a change in throughput according to a split size.
Figure 4b is an experimental example associated with the operation of the apparatus for determining the optimal split size when learning a deep learning model using multi-GPUs according to an embodiment of the present application, and the change in throughput according to the size of a mini-batch is compared to the case where only model parallelization is applied and It is a graph showing the comparison for each when the optimal split size determination technique of the present application is applied.
5 is an operational flowchart of a method for determining an optimal split size when learning a deep learning model using multi-GPUs according to an embodiment of the present disclosure.
6 is an operational flowchart for a deep learning model learning method based on an optimal split size using multi-GPUs according to an embodiment of the present disclosure.
7 is a detailed operation flowchart of a step of learning a deep learning model based on the determined optimal split size.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail so that those skilled in the art can easily practice with reference to the accompanying drawings. However, the present disclosure may be implemented in many different forms and is not limited to the embodiments described herein. And in order to clearly describe the present application in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결" 또는 "간접적으로 연결"되어 있는 경우도 포함한다. Throughout the present specification, when a part is said to be “connected” to another part, it is not only “directly connected”, but also “electrically connected” or “indirectly connected” with another element in between. "Including cases where

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout the present specification, when a member is referred to as being “on,” “above,” “on top of,” “below,” “below,” or “below” another member, this means that a member is located in relation to another member. This includes not only the case of contact but also the case of another member between the two members.

본원 명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout the present specification, when a certain component is said to "include", it means that it may further include other components without excluding other components unless otherwise stated.

본원은 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치 및 방법과 이를 이용한 딥러닝 모델 학습 방법에 관한 것이다. 특히, 본원은 멀티 GPU 환경에서 파이프라이닝을 활용하여 딥러닝(심층 학습) 시의 성능 향상을 고려한 최적 스플릿 크기를 결정하는 자동화된 솔루션에 관한 것이다. The present application relates to an apparatus and method for determining an optimal split size when learning a deep learning model using multi-GPU and a method for learning a deep learning model using the same. In particular, the present application relates to an automated solution for determining an optimal split size considering performance improvement in deep learning (deep learning) by utilizing pipelining in a multi-GPU environment.

도 3은 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 개략적인 구성도이다.3 is a schematic configuration diagram of an apparatus for determining an optimal split size when learning a deep learning model using multi-GPUs according to an embodiment of the present disclosure.

도 3을 참조하면, 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치(100)(이하, '최적 스플릿 크기 결정 장치(100)'라 한다.)는, 초기 스플릿 연산부(110) 및 최적 스플릿 탐색부(120)를 포함할 수 있다.Referring to FIG. 3, the apparatus 100 for determining the optimal split size when learning a deep learning model using multi-GPU (hereinafter, referred to as the apparatus 100 for determining the optimal split size) includes an initial split calculator 110 and An optimal split search unit 120 may be included.

초기 스플릿 연산부(110)는 멀티 GPU에 포함된 GPU의 수 및 멀티 GPU의 메모리 크기에 기초하여 초기 스플릿 크기를 연산할 수 있다. 본원의 실시예에 관한 이하의 설명에서 초기 스플릿 크기는 n이 0인 경우로서 S₀로 표시될 수 있다.The initial split calculator 110 may calculate an initial split size based on the number of GPUs included in the multi-GPU and the memory size of the multi-GPU. In the following description of an embodiment of the present application, the initial split size may be represented as S ₀ when n is 0.

본원의 일 실시예에 따르면, 초기 스플릿 연산부(110)는 멀티 GPU에 포함된 GPU가 수행 가능한 최대 학습 미니배치의 크기(s) 및 입력된 미니배치의 크기(mini batch)의 대소 관계를 고려하여 초기 스플릿 크기(S₀)를 결정할 수 있다.According to an embodiment of the present invention, the initial split operation unit 110 considers the size relationship between the maximum learning mini-batch size ( s ) and the input mini-batch size ( mini-batch ) that can be performed by GPUs included in the multi-GPU. An initial split size (S ₀ ) can be determined.

구체적으로, 초기 스플릿 연산부(110)는 멀티 GPU에 포함된 단일 GPU의 메모리에서 처리 가능한 최대 학습 미니배치의 크기(s) 이하의 크기를 갖는 미니배치가 입력되면(달리 말해, mini batch≤s 이면), 입력된 미니배치의 크기(mini batch)를 멀티 GPU의 GPU의 수로 나눈 값으로 초기 스플릿 크기(S₀)를 결정할 수 있다(S₀=

). 반대로, 초기 스플릿 연산부(110)는 입력된 미니배치의 크기(mini batch)가 멀티 GPU의 단일 GPU 메모리에서 처리 가능한 최대 학습 미니배치의 크기(s)를 초과하면(달리 말해, mini batch>s 이면), 단일 GPU에서 처리 가능한 최대 학습 미니배치의 크기인 s를 초기 스플릿 크기(S₀)로 연산할 수 있다(S₀=s).Specifically, the initial split operator 110 inputs a mini-batch having a size equal to or less than the maximum learning mini-batch size ( s ) that can be processed in the memory of a single GPU included in the multi-GPU (in other words, if mini-batch ≤ s ) ), the initial split size (S ₀ ) can be determined by dividing the input mini-batch size ( mini batch ) by the number of GPUs of the multi-GPU (S ₀ =

). Conversely, if the initial split operation unit 110 exceeds the maximum learning mini- batch size ( s ) that can be processed in a single GPU memory of a multi-GPU (in other words, mini batch > s ) ), s , the size of the largest training mini-batch that can be processed on a single GPU, can be calculated as the initial split size (S ₀ ) (S ₀ = s ).

또한, 본원의 일 실시예에 따르면, 초기 스플릿 연산부(110)는 GPU간 처리 결과 송수신 과정에서 멀티 GPU의 연산 능력의 일부가 사용되는 점을 고려하여 멀티 GPU의 전체 메모리에서 소정의 처리량을 차감한 후 초기 스플릿 크기(S₀)를 연산할 수 있다.In addition, according to an embodiment of the present application, the initial split operation unit 110 subtracts a predetermined amount of processing from the entire memory of the multi-GPU in consideration of the fact that a part of the computational power of the multi-GPU is used in the process of transmitting and receiving the processing result between the GPUs. After that, the initial split size (S ₀ ) can be calculated.

이와 관련하여, 초기 스플릿 연산부(110)는 GPU 간 데이터 송수신 프로세스에 의하여 미니배치 처리에 활용될 수 없는 소정의 처리량을 고려하여 하기 식 1 및 식 2에 기초하여 초기 스플릿 크기(S₀)를 연산할 수 있다.In this regard, the initial split calculating unit 110 calculates the initial split size (S ₀ ) based on Equations 1 and 2 in consideration of a predetermined amount of throughput that cannot be utilized for mini-batch processing due to a data transmission and reception process between GPUs. can do.

[식 1][Equation 1]

[식 2][Equation 2]

여기서, M은 멀티 GPU에 포함된 모든 GPU의 메모리 크기를 합한 값일 수 있다. 또한, s는 멀티 GPU에 포함된 어느 하나의 GPU의 메모리 크기가 m일 때, 해당 GPU가 수행 가능한 최대 학습 미니배치(mini batch)의 크기일 수 있다. 또한, N은 멀티 GPU에 포함된 GPU의 수일 수 있다. 또한,

는 생성(학습)하려는 딥러닝 모델의 모델 크기일 수 있다.Here, M may be the sum of memory sizes of all GPUs included in the multi-GPU. In addition, s may be the size of a maximum learning mini-batch that can be performed by the corresponding GPU when the memory size of any one GPU included in the multi-GPU is m. Also, N may be the number of GPUs included in the multi-GPU. also,

may be the model size of the deep learning model to be created (learned).

구체적으로, M이 모델 병렬화를 수행하는 멀티 GPU 내의 모든 GPU의 메모리 크기를 합한 값이라 하면, 멀티 GPU에 모델 병렬화를 적용하여 학습 가능한 미니배치의 크기는

일 수 있다. 여기서 소정의 처리량 α는 멀티 GPU 환경에서 모델 병렬화를 적용할 경우, 하나의 GPU의 처리 결과를 다른 GPU에 전송하는 과정에서 요구되는 데이터 처리량을 의미하는 것일 수 있다. 즉, 모델 병렬화를 적용하는 경우, GPU 간의 입출력 의존성에 따라 GPU 사이에서 처리 결과를 송수신하는 프로세스가 수반되기 때문에 멀티 GPU의 모든 메모리 자원을 미니배치를 처리하는데 활용할 수 없으므로, 초기 스플릿 연산부(110)는 이를 고려하여 초기 스플릿 크기 연산시 멀티 GPU의 전체 메모리 크기에서 GPU 간 데이터 전송을 위한 소정의 처리량(α)을 차감하는 것이다.Specifically, if M is the sum of the memory sizes of all GPUs within a multi-GPU performing model parallelization, the size of a mini-batch that can be learned by applying model parallelization to multiple GPUs is

can be Here, the predetermined throughput α may mean a data throughput required in a process of transmitting a processing result of one GPU to another GPU when model parallelization is applied in a multi-GPU environment. That is, when model parallelization is applied, since the process of transmitting and receiving processing results between GPUs is involved according to the input/output dependency between GPUs, all memory resources of multi-GPUs cannot be used to process mini-batches, so the initial split operation unit 110 In consideration of this, a predetermined throughput (α) for data transfer between GPUs is subtracted from the total memory size of the multi-GPU when calculating the initial split size.

이와 관련하여, 초기 스플릿 연산부(110)는 모델 병렬화를 통해 학습 가능한 미니배치의 크기인

를 GPU의 수(N)으로 나누어 식 1 및 식 2에 따라 초기 스플릿 크기(S₀)로 결정할 수 있다.In this regard, the initial split operator 110 is the size of a mini-batch that can be learned through model parallelization.

Dividing by the number of GPUs (N), the initial split size (S ₀ ) can be determined according to

Equations

1 and 2.

본원의 일 실시예에 따르면, 초기 스플릿 연산부(110)는 입력된 미니배치의 크기(mini batch)가 멀티 GPU의 단일 GPU 메모리에서 처리 가능한 최대 학습 미니배치의 크기(s)를 초과하는 경우(mini batch>s)에 상술한 식 1 및 식 2에 기초하여 초기 스플릿 크기(S₀)를 연산하는 것일 수 있으나, 이에만 한정되는 것은 아니다.According to an embodiment of the present invention, the initial split operation unit 110 is configured when the input mini-batch size ( mini batch ) exceeds the maximum training mini-batch size ( s ) that can be processed in a single GPU memory of a multi-GPU ( mini The initial split size (S ₀ ) may be calculated based on Equations 1 and 2 described above for batch > s , but is not limited thereto.

상술한 바와 같이 초기 스플릿 연산부(110)에 의해 초기 스플릿 크기(S₀)가 연산되고 나면, 하기에서 상세히 설명하는 최적 스플릿 크기를 탐색하는 프로세스가 개시될 수 있다.As described above, after the initial split size (S ₀ ) is calculated by the initial split operator 110, a process of searching for an optimal split size described in detail below may be started.

또한, 초기 스플릿 연산부(110)는 연산된 초기 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수(i번)만큼의 반복 학습에 소요되는 초기 수행시간을 연산할 수 있다. 본원의 실시예에 관한 이하의 설명에서 초기 수행시간은 n이 0인 경우로서 t₀로 표시될 수 있다.In addition, the initial split operator 110 may calculate an initial execution time required for iterative learning by a predetermined number of times (i times) based on the calculated initial split size. In the following description of an embodiment of the present application, the initial execution time may be represented as t ₀ when n is 0.

최적 스플릿 탐색부(120)는 초기 스플릿 크기(S₀)와 초기 수행시간(t₀), 그리고 (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기 사이에 설정된 관계에 기초하여 n번째 스플릿 크기를 획득할 수 있다. 본원의 실시예에 관한 이하의 설명에서 (n-1)번째 스플릿 크기는 S_n-1로 표시되고, n번째 스플릿 크기는 S_n으로 표시되고, (n+1)번째 스플릿 크기는 S_n+1로 표시될 수 있다. 또한, n이 0인 경우의 S_n-1인 S_-1은 0일 수 있다.The optimum split search unit 120 is set between the initial split size (S ₀ ), the initial execution time (t ₀ ), and the (n-1) th split size, the n th split size, and the (n+1) th split size. Based on the relationship, the nth split size can be obtained. In the following description of the embodiments of the present application, the (n-1) th split size is denoted by S _n-1 , the n th split size is denoted by S _n , and the (n+1) th split size is denoted by S _{n+ 1} can be displayed. In addition, when n is 0, S _n-1 , S _-1 , may be 0.

또한, 본원의 일 실시예에 따르면, 최적 스플릿 탐색부(120)는 최적 스플릿 크기를 탐색하는 프로세스를 반복적으로 수행하되, 최적 스플릿 탐색부(120)가 현재 반복 수행 내에서 획득하는 n번째 스플릿 크기(S_n)는 이전 반복 수행시의 (n+1)번째 스플릿 크기(S_n+1)일 수 있다.In addition, according to an embodiment of the present application, the optimal split search unit 120 repeatedly performs a process of searching for the optimal split size, but the n-th split size obtained by the optimal split search unit 120 within the current repetition. (S _n ) may be the (n+1) th split size (S _n+1 ) at the time of performing the previous iteration.

또한, 최적 스플릿 탐색부(120)는, 연산된 n번째 스플릿 크기(S_n)에 기초하여 수행되는 미리 설정된 횟수(i번)만큼의 반복 학습에 소요되는 n번째 수행시간을 획득할 수 있다. 또한, 본원의 실시예에 관한 이하의 설명에서 n번째 수행시간은 t_n으로 표시될 수 있다.In addition, the optimum split search unit 120 may obtain an n-th execution time required for iterative learning by a predetermined number of times (i times) based on the calculated n-th split size (S _n ). In addition, in the following description of the embodiments of the present application, the n-th execution time may be denoted by t _n .

또한, 본원의 일 실시예에 따르면, 최적 스플릿 탐색부(120)는 최적 스플릿 크기를 탐색하는 프로세스를 반복적으로 수행하되, 최적 스플릿 탐색부(120)가 현재 반복 수행 내에서 획득하는 n번째 수행시간(t_n)은 이전 반복 수행시의 (n+1)번째 수행시간(t_n+1)일 수 있다.In addition, according to an embodiment of the present application, the optimal split search unit 120 repeatedly performs a process of searching for an optimal split size, but the n-th execution time obtained by the optimal split search unit 120 within the current repetition. (t _n ) may be the (n+1)th execution time (t _n+1 ) of the previous iteration.

또한, 최적 스플릿 탐색부(120)는, 초기 스플릿 크기(S₀)와 초기 수행시간(t₀), 그리고 (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기 사이에 설정된 관계에 기초하여 (n+1)번째 스플릿 크기(S_n+1)를 획득할 수 있다.In addition, the optimal split search unit 120, the initial split size (S ₀ ), the initial execution time (t ₀ ), the (n-1) th split size, the n th split size, and the (n+1) th split size The (n+1)th split size (S _n+1 ) may be obtained based on the relationship established between

본원의 일 실시예에 따르면, 최적 스플릿 탐색부(120)가 고려하는 (n-1)번째 스플릿 크기(S_n-1)와 n번째 스플릿 크기(S_n)와 (n+1)번째 스플릿 크기(S_n+1) 사이에 설정된 관계는

을 만족하는 관계일 수 있다.According to an embodiment of the present application, the (n−1) th split size (S _n−1 ), the n th split size (S _n ) and the (n+1) th split size considered by the optimal split search unit 120 The relationship established between (S _n+1 ) is

may be a relationship that satisfies

구체적으로, 최적 스플릿 탐색부(120)는 첫 번째 반복 수행시(n=0), S_n+1인 S₁은 전술한 바와 같이 S_-1이 0이므로 초기 스플릿 크기의 절반(S₁=S₀/2)으로 연산하고, 두 번째 이후의 반복 수행시(n≥1)부터는 S_n+1을 S_n 및 S_n-1의 산술 평균값으로 연산하되, 연산에 활용되는 S_n 및 S_n-1의 값은 이전 반복 수행에서 연산된 값을 획득하는 것일 수 있다.Specifically, when the optimal split search unit 120 performs the first iteration (n=0), S 1 of S _n+1 _is half of the initial split size (S ₁ =S because S _-1 is 0 as described above). ₀ /2), and from the second and subsequent iterations (n≥1), S _n+1 is calculated as the arithmetic average value of S _n and S _n _-1 , but Sn and S _n- A value of ₁ may be to obtain the value computed in the previous iteration.

또한, 최적 스플릿 탐색부(120)는 연산된 (n+1)번째 스플릿 크기(S_n+1)에 기초하여 수행되는 미리 설정된 횟수(i번)만큼의 반복 학습에 소요되는 (n+1)번째 수행시간을 획득할 수 있다. 또한, 본원의 실시예에 관한 이하의 설명에서 (n+1)번째 수행시간은 t_n+1로 표시될 수 있다.In addition, the optimal split search unit 120 calculates (n+1) th split size (S _n+1 ) required for iterative learning by a preset number of times (i times) performed based on (n+1) th split size (S n+1). The second execution time can be obtained. In addition, in the following description of the embodiments of the present application, the (n+1)th execution time may be denoted by t _n+1 .

또한, 최적 스플릿 탐색부(120)는 n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1)사이의 시간 차이가 미리 설정된 시간 차이 이내이면, (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정할 수 있다.In addition, the optimal split search unit 120, if the time difference between the nth execution time (t _n ) and the (n+1)th execution time (t _n+1 ) is within the preset time difference, (n+1) A th split size (S _n+1 ) may be determined as an optimal split size.

구체적으로 본원의 일 실시예에 따르면, 최적 스플릿 탐색부(120)는 n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1)이 하기 식 3의 부등식을 만족하면, 해당 반복 수행에서 연산된 (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정할 수 있다.Specifically, according to an embodiment of the present application, the optimal split search unit 120, if the n-th execution time (t _n ) and the (n+1)-th execution time (t _n+1 ) satisfy the inequality of Equation 3 below: , the (n+1)th split size (S _n+1 ) calculated in the corresponding iteration can be determined as the optimal split size.

[식 3][Equation 3]

여기서, t_n은 n번째 수행시간이고, t_n+1은 (n+1)번째 수행시간이고, T는 미리 설정된 시간 차이일 수 있다. 또한, T는 0보다 큰 임의의 임계값일 수 있다.Here, t _n is the n-th execution time, t _n+1 is the (n+1)-th execution time, and T may be a preset time difference. Also, T may be any threshold greater than zero.

구체적으로, 최적 스플릿 탐색부(120)는 미니배치를 더 작은 스플릿으로 나누어 학습을 수행하는 경우에 분할되는 스플릿의 크기가 필요 이상으로 작으면 GPU 간 데이터 전송이 많아져서 효율적으로 GPU의 연산 능력을 활용할 수 없는 점과 반대로 스플릿의 크기가 필요 이상으로 커지면 미니배치와 분할된 스플릿의 크기가 비슷해지므로 파이프라이닝이 적용되지 않은 경우처럼 GPU의 유휴 시간이 길어지게 되는 점을 고려하여 초기 스플릿 크기(S₀)로부터 스플릿 크기를 변화시킴에 따라 멀티 GPU의 처리 능력이 최대가 되는 지점(달리 말해, 연산에 소요되는 시간이 소정 수준 이상 감소하는 지점)을 탐색하여 최적 스플릿 크기를 결정하도록 동작할 수 있다.Specifically, when the optimal split search unit 120 performs learning by dividing a mini-batch into smaller splits, if the size of the splits is smaller than necessary, data transfer between GPUs increases, effectively reducing the computational power of the GPUs. Contrary to the fact that it cannot be utilized, if the size of the split becomes larger than necessary, the size of the mini-batch and the divided split become similar, taking into account the fact that the idle time of the GPU becomes longer as in the case where pipelining is not applied, the initial split size (S By changing the split size from ₀ ), it is possible to operate to determine the optimal split size by searching for a point where the processing power of the multi-GPU is maximized (in other words, a point where the time required for calculation decreases by more than a predetermined level). .

또한, 최적 스플릿 탐색부(120)는 n을 0부터 1씩 증가시키며 상술한 최적 스플릿 크기를 탐색하는 프로세스를 반복적으로 수행할 수 있다. 즉, 최적 스플릿 탐색부(120)는, n을 0부터 1씩 증가시키며 최적 스플릿 크기를 탐색하는 프로세스를 반복 수행하다가, n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1)사이의 시간 차이가 미리 설정된 시간 차이 이내인 조건을 만족하면, (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정하고 반복 수행을 종료할 수 있다.In addition, the optimal split search unit 120 may increase n by 1 from 0 and repeatedly perform the above-described process of searching for the optimal split size. That is, the optimal split search unit 120 repeatedly performs a process of searching for an optimal split size by increasing n from 0 to 1, and then the n th execution time (t _n ) and the (n+1) th execution time (t When the condition that the time difference between _n+1 ) is within a preset time difference is satisfied, the (n+1) th split size (S _n+1 ) may be determined as an optimal split size, and the iterative performance may be terminated.

이하에서는, 최적 스플릿 크기 결정 장치(100) 및 멀티 GPU를 포함하는 딥러닝 모델 학습 시스템(미도시)이 결정된 최적 스플릿 크기를 통해 소정의 딥러닝 모델을 학습시키는 과정에 대해 설명하도록 한다. 즉, 딥러닝 모델 학습 시스템(미도시)은 최적 스플릿 크기 결정 장치(100) 및 멀티 GPU를 포함할 수 있다.Hereinafter, a process in which the apparatus 100 for determining the optimal split size and a deep learning model learning system (not shown) including multiple GPUs learns a predetermined deep learning model through the determined optimal split size will be described. That is, the deep learning model learning system (not shown) may include the apparatus 100 for determining the optimal split size and multi-GPUs.

딥러닝 모델 학습 시스템(미도시)은 최적 스플릿 크기 결정 장치(100)에 의해 결정된 소정의 딥러닝 모델 학습에 활용되는 최적 스플릿 크기를 획득할 수 있다. 구체적으로, 딥러닝 모델 학습 시스템(미도시)의 최적 스플릿 크기 결정 장치(100)는 앞서 설명한 바와 같이 미니배치를 분할하는 기준이 되는 스플릿 크기를 변화시켜가며, 변화되는 스플릿 크기에 따른 미리 설정된 소정의 횟수(i번)의 반복 학습(iteration)에 소요되는 시간(수행시간)의 변화를 추적하여 멀티 GPU의 연산 능력이 최대가 되도록 하는 최적 스플릿 크기를 소정의 딥러닝 모델 학습을 생성하는 초기의 학습 단계에서 탐색할 수 있다.The deep learning model learning system (not shown) may obtain an optimal split size used for learning a predetermined deep learning model determined by the apparatus 100 for determining the optimal split size. Specifically, the apparatus 100 for determining the optimal split size of the deep learning model learning system (not shown) changes the split size that is the criterion for dividing the mini-batch as described above, and sets a predetermined predetermined value according to the changed split size. By tracking the change in time (execution time) required for the number of iterations (i times) of , the optimal split size to maximize the computational power of the multi-GPU is the initial time to generate a predetermined deep learning model learning can be explored in the learning phase.

종합하면, 본원에서 개시하는 딥러닝 모델 학습 시스템(미도시)은 초기의 학습 단계에서는 딥러닝 모델 생성을 위한 학습을 진행하는 동시에 최적 스플릿 크기를 탐색하고, 최적 스플릿 크기가 결정되고 난 후의 학습에는 결정된 최적 스플릿 크기에 기초하여 미니배치를 분할하여 GPU에 할당하는 파이프라이닝을 통해 학습이 이루어지도록 할 수 있다. 즉, 딥러닝 모델 학습 시스템(미도시)은 최적 스플릿 크기가 결정되고 나면, 결정된 최적 스플릿 크기에 기초하여 딥러닝 모델 생성을 위한 이후의 학습을 수행할 수 있다.In summary, the deep learning model learning system (not shown) disclosed herein searches for an optimal split size while learning to create a deep learning model in an initial learning step, and in learning after the optimal split size is determined, Based on the determined optimal split size, learning may be performed through pipelining in which mini-batches are divided and allocated to GPUs. That is, after the optimal split size is determined, the deep learning model learning system (not shown) may perform subsequent learning for generating the deep learning model based on the determined optimal split size.

구체적으로, 딥러닝 모델 학습 시스템(미도시)은 미리 설정된 미니배치를 최적 스플릿 크기 결정 장치(100)에 의해 결정된 최적 스플릿 크기에 기초하여 분할한 복수의 스플릿 중 제1스플릿을 멀티 GPU 중 제1 GPU에 할당할 수 있다.Specifically, the deep learning model learning system (not shown) converts a first split among a plurality of splits obtained by dividing a preset mini-batch based on the optimal split size determined by the optimal split size determining apparatus 100 into a first one among multi-GPUs. Allocate to GPU.

또한, 딥러닝 모델 학습 시스템(미도시)은 제1 GPU에 의한 제1스플릿에 대한 처리 결과를 멀티 GPU 중 제2 GPU에 전송할 수 있다.In addition, the deep learning model learning system (not shown) may transmit the processing result of the first split by the first GPU to the second GPU among the multi-GPU.

또한, 딥러닝 모델 학습 시스템(미도시)의 제2 GPU는 전송받은 처리 결과에 기초하여 기울기(달리 말해, Gradient)를 계산하고 가중치(달리 말해, 모델 파라미터)를 변경할 수 있다.In addition, the second GPU of the deep learning model learning system (not shown) may calculate a gradient (in other words, a gradient) based on the received processing result and change a weight (in other words, a model parameter).

또한, 딥러닝 모델 학습 시스템(미도시)은 제1 GPU에 분할된 복수의 스플릿 중 제2스플릿을 할당할 수 있다.In addition, the deep learning model learning system (not shown) may allocate a second split among a plurality of splits to the first GPU.

여기서, 딥러닝 모델 학습 시스템(미도시)의 제2 GPU에 의해 기울기가 계산되고 가중치를 변경되는 프로세스 및 제1 GPU에 제2스플릿을 할당하는 프로세스는 제1 GPU 및 제2 GPU 각각에 의해 수행될 수 있으므로 상술한 두 프로세스는 서로 미리 설정된 시간 차이 이내에 병렬적으로 개시될 수 있다. 달리 말해, 멀티 GPU 중 어느 하나의 GPU는 할당된 스플릿에 대한 처리를 완료하고 나면, 다른 GPU으로부터의 출력 결과를 수신하는 것을 대기할 필요 없이 다음 스플릿에 대한 처리를 개시할 수 있어 유휴 시간(Idle time)이 획기적으로 감소될 수 있다.Here, the process of calculating the gradient and changing the weight by the second GPU of the deep learning model learning system (not shown) and the process of allocating the second split to the first GPU are performed by the first GPU and the second GPU, respectively. Therefore, the above-described two processes may be initiated in parallel within a preset time difference from each other. In other words, after any one of the multi-GPUs has completed processing for the assigned split, it can start processing for the next split without waiting for output results from other GPUs to be received. time) can be drastically reduced.

또한, 본원의 일 실시예에 따르면, 제1 GPU는 딥러닝 인공 신경망 모델의 입력 레이어에 대응하는 GPU를 의미하고, 제2 GPU는 딥러닝 인공 신경망 모델의 출력 레이어에 해당하는 GPU를 의미하는 것일 수 있으나, 이에만 한정되는 것은 아니다. 참고로, 본원의 구현예에 따라 제1 GPU는 이전 GPU로, 제2 GPU는 나중 GPU로 각각 지칭될 수 있다.In addition, according to an embodiment of the present application, the first GPU refers to a GPU corresponding to an input layer of a deep learning artificial neural network model, and the second GPU refers to a GPU corresponding to an output layer of a deep learning artificial neural network model. It may, but is not limited thereto. For reference, according to the implementation of the present application, the first GPU may be referred to as a previous GPU, and the second GPU may be referred to as a later GPU, respectively.

종래의 모델 병렬화 기반의 딥러닝 모델 학습 시스템에 의할 때, 파이프라이닝이 적용되지 않아 제1 GPU가 할당된 미니배치에 포함된 데이터를 처리할 동안 제2 GPU가 제1 GPU가 데이터를 처리하는 것을 기다려야 하므로 제2 GPU가 유휴 상태가 되고, 제1 GPU가 데이터를 처리하여 제2 GPU로 넘겨주면 제1 GPU는 제2 GPU가 데이터를 처리하고 기울기를 계산하는 과정 등이 모두 완료된 후에야 새로운 미니배치를 읽어오기 때문에 제1 GPU가 유휴 상태가 된다.When using a conventional model parallelization-based deep learning model learning system, pipelining is not applied, so that the second GPU processes the data included in the mini-batch to which the first GPU is allocated while the second GPU processes the data Since the second GPU has to wait, the second GPU becomes idle, and when the first GPU processes the data and passes it to the second GPU, the first GPU processes the data and calculates the gradient before the first GPU completes a new mini. Because the batch is read, the first GPU is idle.

반면, 본원에서 개시하는 딥러닝 모델 학습 시스템(미도시)에 의하면 미니배치를 더 작은 단위인 스플릿으로 나누어 학습을 진행하기 때문에, 제1스플릿이 제1 GPU로 할당된 후 제1 GPU가 제1스플릿에 포함된 데이터를 처리하여 제2 GPU로 전해주고 나면, 제1 GPU가 유휴 상태에 놓이는 것이 아니라 다음 스플릿인 제2스플릿을 신속히 할당받아 제1 GPU가 제2스플릿에 포함된 데이터에 대한 처리를 수행하고, 이 때 제2 GPU는 제1 GPU로부터 전달된 제1스플릿의 처리 결과를 기초로 기울기 계산 및 가중치 변경을 수행할 수 있으므로, 멀티 GPU 환경 내의 모든 GPU의 연산 능력을 병렬적으로 활용함으로써 학습 효율을 향상시킬 수 있다.On the other hand, according to the deep learning model learning system (not shown) disclosed herein, since learning is performed by dividing a mini-batch into splits, which are smaller units, the first split is assigned to the first GPU, and then the first GPU is assigned to the first GPU. After processing the data included in the split and passing it to the second GPU, the first GPU is not put in an idle state, but is quickly assigned to the second split, which is the next split, so that the first GPU processes the data included in the second split. , and at this time, since the second GPU can perform gradient calculation and weight change based on the processing result of the first split transmitted from the first GPU, the computing power of all GPUs in the multi-GPU environment is utilized in parallel. By doing so, learning efficiency can be improved.

도 4a는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 동작과 연계된 일 실험예로, 분할되는 스플릿 크기에 따른 처리량의 변화를 나타낸 그래프이고, 도 4b는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 동작과 연계된 일 실험예로 미니배치의 크기에 따른 처리량의 변화를 모델 병렬화만을 적용한 경우와 본원의 최적 스플릿 크기 결정 기법을 적용한 경우 각각에 대하여 비교하여 나타낸 그래프이다.Figure 4a is an experimental example associated with the operation of the apparatus for determining the optimal split size when learning a deep learning model using multi-GPUs according to an embodiment of the present application, and is a graph showing the change in throughput according to the split size to be divided. 4b is an experimental example associated with the operation of the apparatus for determining the optimal split size when learning a deep learning model using multi-GPU according to an embodiment of the present application, and the case where only model parallelization is applied to the change in throughput according to the size of the mini-batch and the present application When the optimal split size determination technique of is applied, it is a graph shown by comparison for each.

도 4a 및 도 4b는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 동작과 연계된 성능을 평가하기 위한 실험 결과를 나타낸 것으로, 본 실험은 Pytorch 1.4.0, CUDA 10.0 및 NVIDIA driver 418.56의 소프트웨어 환경 및 두 개의 GPU를 포함하는 멀티 GPU환경(2개의 NVIDIA GeForce GTX 1080 Ti)에서 수행되었고, 학습 대상인 딥러닝 모델의 유형으로써 U-Net을 사용하였다. U-Net은 이미지 분할을 위한 fully convolutional network로 U-Net에 관한 사항은 통상의 기술자에게 자명하므로 구체적인 설명은 생략하도록 한다.4a and 4b show experimental results for evaluating the performance associated with the operation of the apparatus for determining the optimal split size when learning a deep learning model using multi-GPUs according to an embodiment of the present invention. This experiment is performed using Pytorch 1.4. 0, CUDA 10.0 and NVIDIA driver 418.56 software environment and a multi-GPU environment including two GPUs (two NVIDIA GeForce GTX 1080 Ti), and U-Net was used as the type of deep learning model to be trained. U-Net is a fully convolutional network for image segmentation, and details about U-Net are obvious to those skilled in the art, so detailed descriptions will be omitted.

도 4a는 미니배치를 분할하는 스플릿 사이즈에 따른 학습 성능의 변화를 나타낸 것으로, 도 4a를 참조하면, 학습 미니배치의 크기가 16일 때, 미니배치를 분할하는 스플릿의 크기가 1에서 3까지 변화하는 구간에서는 초당 이미지 처리량이 9.2 images/sec에서 15.69 images/sec로 증가하나, 미니배치를 분할하는 스플릿의 크기가 3을 초과하는 구간에서는 초당 이미지 처리량이 다시 감소하는 것을 확인할 수 있다. 구체적으로, 미니배치를 분할하는 스플릿의 크기를 14까지 증가시키면, 초당 이미지 처리량은 14.16 images/sec로 스플릿 크기가 3인 경우에 비해 11% 감소하는 것을 확인할 수 있다. 즉, 본 실험에서의 최적 스플릿 크기는 3으로 예시적으로 결정될 수 있으며, 결정된 최적 스플릿 크기 보다 작게 미니배치를 분할하거나 크게 미니배치를 분할하여 학습을 수행하는 경우에는 동등한 시간 동안 처리할 수 있는 처리량이 감소하여 미리 설정된 횟수만큼의 반복 학습에 소요되는 수행시간이 늘어날 것임을 예측할 수 있다.Figure 4a shows the change in learning performance according to the split size for dividing the mini-batch. Referring to Figure 4a, when the size of the learning mini-batch is 16, the size of the split for dividing the mini-batch varies from 1 to 3. In the section where the image throughput per second increases from 9.2 images/sec to 15.69 images/sec, the image throughput per second decreases again in the section where the size of the split that divides the mini-batch exceeds 3. Specifically, when the size of the split to divide the mini-batch is increased to 14, the image throughput per second is 14.16 images/sec, which is 11% lower than when the split size is 3. That is, the optimal split size in this experiment may be exemplarily determined to be 3, and when learning is performed by dividing a mini-batch smaller than the determined optimal split size or by dividing a large mini-batch, the throughput that can be processed for the same amount of time As this decreases, it can be predicted that the execution time required for repeated learning by the preset number of times will increase.

도 4b는 학습 미니배치의 크기를 변화시킴에 따라 변화되는 학습 성능을 본원의 최적 스플릿 크기 결정 방법을 적용하여 모델 병렬화 및 파이프라이닝을 함께 적용한 경우와 종래의 모델 병렬화에 의한 경우를 비교하여 나타낸 것으로, 도 4b를 참조하면, 본원에서 개시하는 최적 스플릿 크기 결정 기법을 적용한 경우가 모든 미니배치의 크기에 대하여 종래의 모델 병렬화 기법에 비해 높은 이미지 처리량을 보이는 것을 확인할 수 있다. 특히 미니배치의 크기가 16일 때 본원을 적용한 경우는 15.66 images/sec의 처리 성능을 보인 반면, 종래의 모델 병렬화를 적용한 경우는 13.91 images/sec의 처리 성능을 보여 파이프라이닝을 적용함으로써 이미지 처리량이 약 12% 증가하는 것을 확인할 수 있다.Figure 4b shows the learning performance that changes as the size of the learning mini-batch is changed by comparing the case where model parallelization and pipelining are applied together by applying the method for determining the optimal split size of the present invention and the case by conventional model parallelization. , Referring to FIG. 4B, it can be seen that the case where the optimal split size determination technique disclosed herein is applied shows a higher image throughput than the conventional model parallelization technique for all mini-batch sizes. In particular, when the mini-batch size is 16, the processing performance of the present application was 15.66 images/sec, whereas the processing performance of 13.91 images/sec was obtained when the conventional model parallelization was applied. An increase of about 12% can be seen.

나아가 본원에서 개시하는 최적 스플릿 크기 결정 기법 기반으로 파이프라이닝을 적용할 경우, 종래의 모델 병렬화에 비해 이미지 처리량뿐만 아니라 학습 가능한 미니배치의 크기도 증가할 수 있다. 구체적으로, 본 실험을 통해 파악된 GeForce GTX 1080 Ti 1대로 학습 가능한 최대 mini-batch 크기는 10이고, GeForce GTX 1080 Ti 2대를 사용하여 종래의 모델 병렬화를 수행할 때 학습 가능한 최대 미니배치의 크기는 16으로 60% 증가한 반면, 최적 스플릿 크기 탐색 및 파이프라이닝 기법을 적용하는 본원의 딥러닝 모델 학습 기법에 의할 때의 학습 가능한 미니배치의 크기는 20으로 GPU 1대를 사용할 때보다 2배 증가하고 종래의 모델 병렬화에 비해서도 학습 가능한 미니배치의 크기가 25% 증가하는 것을 확인할 수 있다.Furthermore, when pipelining is applied based on the optimal split size determination technique disclosed herein, not only image throughput but also the size of trainable mini-batches can be increased compared to conventional model parallelization. Specifically, the maximum mini-batch size that can be trained with one GeForce GTX 1080 Ti identified through this experiment is 10, and the maximum mini-batch size that can be learned when performing conventional model parallelization using two GeForce GTX 1080 Ti units is increased by 60% to 16, while the size of a mini-batch that can be learned by our deep learning model training method applying the optimal split size search and pipelining technique is 20, which is twice as large as when using one GPU It can be seen that the size of mini-batches that can be learned increases by 25% compared to conventional model parallelization.

즉, 본원에서 개시하는 최적 스플릿 크기 탐색 및 파이프라이닝 기법을 적용하여 딥러닝 모델의 학습을 수행하면, 실제 GPU 1대에서 실제로 학습을 수행하는 이미지의 수는 입력된 학습 미니배치의 크기보다 더 작은 최적 스플릿 크기이므로 GPU 각각이 연산을 수행(이미지 처리)하는데 필요한 메모리가 적어질 수 있어, 종래의 모델 병렬화 기법에 비해 더 큰 미니배치를 학습할 수 있는 것이다.That is, when deep learning model training is performed by applying the optimal split size search and pipelining technique disclosed herein, the number of images actually trained on one actual GPU is smaller than the size of the input training mini-batch. Since it is an optimal split size, the memory required for each GPU to perform an operation (image processing) can be reduced, and a larger mini-batch can be learned compared to conventional model parallelization techniques.

도 4a 및 도 4b의 실험예를 종합하면, 본원에서 개시하는 최적 스플릿 크기 결정 기법을 통해 계산된 최적 스플릿 크기 및 파이프라이닝을 통해 딥러닝 모델의 학습을 수행하면, 종래의 모델 병렬화 기법을 적용하는 경우에 비해 이미지 처리량이 최대 12% 증가할 수 있고, 학습 가능한 미니배치의 크기 또한 25% 증가하여 1대의 GPU를 사용할 때 보다 2배의 미니배치를 학습시킬 수 있는 효과가 있다.Summarizing the experimental examples of FIGS. 4A and 4B, when learning the deep learning model through the optimal split size and pipelining calculated through the optimal split size determination technique disclosed herein, the conventional model parallelization technique is applied. Compared to the case, image processing can be increased by up to 12%, and the size of the mini-batch that can be learned is also increased by 25%, so it is possible to train twice as many mini-batches than when using one GPU.

이하에서는 상기에 자세히 설명된 내용을 기반으로, 본원의 동작 흐름을 간단히 살펴보기로 한다.Hereinafter, based on the details described above, the operation flow of the present application will be briefly reviewed.

도 5는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법에 대한 동작 흐름도이다.5 is an operational flowchart of a method for determining an optimal split size when learning a deep learning model using multi-GPUs according to an embodiment of the present disclosure.

도 5에 도시된 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법은 앞서 설명된 최적 스플릿 크기 결정 장치(100)에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 최적 스플릿 크기 결정 장치(100)에 대하여 설명된 내용은 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법에 대한 설명에도 동일하게 적용될 수 있다.The method for determining the optimal split size when learning the deep learning model using multi-GPUs shown in FIG. 5 may be performed by the apparatus 100 for determining the optimal split size described above. Therefore, even if omitted below, the description of the apparatus 100 for determining the optimal split size can be equally applied to the description of the method for determining the optimal split size when learning a deep learning model using multiple GPUs.

도 5를 참조하면, 단계 S11에서 초기 스플릿 연산부(110)는 (a) 멀티 GPU에 포함된 GPU의 수 및 멀티 GPU의 메모리 크기에 기초하여 초기 스플릿 크기(S₀)를 연산할 수 있다.Referring to FIG. 5 , in step S11, the initial split operator 110 (a) calculates an initial split size (S ₀ ) based on the number of GPUs included in the multi-GPU and the memory size of the multi-GPU.

본원의 일 실시예에 따르면, 단계 S11에서 초기 스플릿 연산부(110)는 전술한 식 1 및 식 2에 기초하여 초기 스플릿 크기(S₀)를 연산하는 것일 수 있다.According to an embodiment of the present application, in step S11, the initial split calculator 110 may calculate an initial split size (S ₀ ) based on Equations 1 and 2 described above.

다음으로, 단계 S12에서 초기 스플릿 연산부(110)는 (b) 연산된 초기 스플릿 크기(S₀)에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 초기 수행시간(t₀)을 연산할 수 있다.Next, in step S12, the initial split calculator 110 (b) calculates the initial execution time (t ₀ ) required for iterative learning by a preset number of times based on the calculated initial split size (S ₀ ). can

다음으로, 단계 S13에서 최적 스플릿 탐색부(120)는 (c) 초기 스플릿 크기(S₀)와 초기 수행시간(t₀), 그리고 (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기 사이에 설정된 관계에 기초하여, n번째 스플릿 크기(S_n), n번째 스플릿 크기(S_n)에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 n번째 수행시간(t_n), (n+1)번째 스플릿 크기(S_n+1) 및 (n+1)번째 스플릿 크기(S_n+1)에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 (n+1)번째 수행시간(t_n+1)을 획득할 수 있다.Next, in step S13, the optimal split search unit 120 determines (c) the initial split size (S ₀ ) and the initial execution time (t ₀ ), and the (n-1) th split size and the n th split size (n +1) Based on the relationship established between the split sizes, the n-th split size (S _n ) and the n-th execution time required for iterative learning as many as the preset number of times based on the n-th split size (S _n ) (t _n ), (n + 1) th split size (S _{n + 1} ) and (n + 1) th split size (S _{n + 1} ) required for iterative learning by a preset number of times performed based on the size (S n + 1 ) The n+1)th execution time (t _n+1 ) can be obtained.

달리 말해, 단계 S13에서((c) 단계에서) 최적 스플릿 탐색부(120)는 S₀, t₀ 및 S_n-1와 S_n와 S_n+1 사이의 미리 설정된 관계에 기초하여 S_n, t_n, S_n+1 및 t_n+1을 획득할 수 있다.In other words, in step S13 (in step (c)), the optimal split search unit 120 determines S n , t ₀ and S _n-1 based on a preset relationship between S _n and S _n ₊₁ _. t _n , S _n+1 and t _n+1 can be obtained.

또한, 본원의 일 실시예에 따르면, 단계 S13에서의(달리 말해, (c) 단계에서의) 설정된 관계는, (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기가

을 만족하는 관계일 수 있다.In addition, according to an embodiment of the present application, the relationship established in step S13 (in other words, in step (c)) is the (n-1) th split size, the n th split size, and the (n+1) th split size. size

may be a relationship that satisfies

다음으로, 단계 S14에서 최적 스플릿 탐색부(120)는 (d) n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1) 사이의 시간 차이가 미리 설정된 시간 차이 이내이면, (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정할 수 있다.Next, in step S14, the optimal split search unit 120 determines (d) that the time difference between the n-th execution time (t _n ) and the (n+1)-th execution time (t _n+1 ) is within a preset time difference. , the (n+1)th split size (S _n+1 ) can be determined as the optimal split size.

구체적으로, 단계 S14에서 최적 스플릿 탐색부(120)는 단계 S13을 통해 획득된 n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1)이 전술한 식 3의 부등식을 만족하면, (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정할 수 있다.Specifically, in step S14, the optimal split search unit 120 determines that the n-th execution time (t _n ) and the (n+1)-th execution time (t _n+1 ) obtained through step S13 are the inequality of Equation 3 described above. is satisfied, the (n+1)th split size (S _n+1 ) can be determined as the optimal split size.

본원의 일 실시예에 따르면, 상술한 단계 S13 및 단계 S14(즉, (c) 단계와 (d) 단계)는 n을 0부터 1씩 증가시키며 반복 수행될 수 있다. 보다 구체적으로 도 5를 참조하면, 단계 S13 및 단계 S14(즉, (c) 단계와 (d) 단계)가 n을 0부터 1씩 증가시키며 반복 수행되다가, 단계 S14에서 n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1) 사이의 시간 차이가 미리 설정된 시간 차이 이내인 조건을 만족하면(단계 S14의 'YES'), (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정하고 반복 수행이 종료되는 것일 수 있다.According to one embodiment of the present application, the above steps S13 and S14 (ie, steps (c) and (d)) may be repeatedly performed while increasing n from 0 to 1. More specifically, referring to FIG. 5, steps S13 and step S14 (ie, steps (c) and (d)) are repeatedly performed while increasing n from 0 to 1, and then in step S14, the nth execution time (t _n ) and the (n + 1) th execution time (t _{n + 1} ) if the condition is satisfied that the time difference is within the preset time difference ('YES' in step S14), the (n + 1) th split size (S _n+1 ) may be determined as the optimal split size, and the iterative performance may be terminated.

이와 달리, 단계 S14에서 n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1) 사이의 시간 차이가 미리 설정된 시간 차이 이내인 조건을 만족하지 않으면(단계 S14의 'NO'), 최적 스플릿 탐색부(120)는 n을 1 증가시키고(n=n+1), 단계 S13((c) 단계)로 되돌아가 (n+1)에 대한 최적 스플릿 크기 탐색 프로세스의 다음 번 반복을 수행할 수 있다.On the other hand, if the time difference between the nth execution time (t _n ) and the (n+1)th execution time (t _n+1 ) in step S14 does not satisfy the condition that is within the preset time difference ('step S14'NO'), the optimal split search unit 120 increases n by 1 (n = n + 1), returns to step S13 (step (c)), and returns to the next step of the optimal split size search process for (n + 1). Iterations can be performed.

또한, 본원의 일 실시예에 따르면, 단계 S13 및 단계 S14의 반복 수행시 획득되는 n번째 스플릿 크기(S_n)는 이전 반복 수행시의 (n+1)번째 스플릿 크기(S_n+1)일 수 있다. 또한, 본원의 일 실시예에 따르면, 단계 S13 및 단계 S14의 반복 수행시 획득되는 n번째 수행시간(t_n)은 이전 반복 수행시의 (n+1)번째 수행시간(t_n+1)일 수 있다.In addition, according to an embodiment of the present application, the n-th split size (S _n ) obtained when repeating steps S13 and S14 is the (n+1)-th split size (S _n+1 ) at the previous iteration. can In addition, according to an embodiment of the present application, the n-th execution time (t _n ) obtained during repetition of steps S13 and S14 is the (n+1)-th execution time (t _n+1 ) of the previous repetition. can

상술한 설명에서, 단계 S11 내지 S14는 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the foregoing description, steps S11 to S14 may be further divided into additional steps or combined into fewer steps, depending on an embodiment of the present invention. Also, some steps may be omitted if necessary, and the order of steps may be changed.

도 6은 본원의 일 실시예에 따른 멀티 GPU를 이용한 최적 스플릿 크기 기반의 딥러닝 모델 학습 방법에 대한 동작 흐름도이다.6 is an operational flowchart for a deep learning model learning method based on an optimal split size using multi-GPUs according to an embodiment of the present disclosure.

도 6에 도시된 멀티 GPU를 이용한 최적 스플릿 크기 기반의 딥러닝 모델 학습 방법은 앞서 설명된 최적 스플릿 크기 결정 장치(100)를 포함하는 딥러닝 모델 학습 시스템에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 딥러닝 모델 학습 시스템에 대하여 설명된 내용은 도 6에 대한 설명에도 동일하게 적용될 수 있다.The deep learning model learning method based on the optimal split size using multiple GPUs shown in FIG. 6 may be performed by the deep learning model learning system including the apparatus 100 for determining the optimal split size described above. Therefore, even if omitted below, the description of the deep learning model learning system can be equally applied to the description of FIG. 6 .

도 6을 참조하면, 단계 S21에서 최적 스플릿 크기 결정 장치(100)는 소정의 딥러닝 모델 학습에 활용되는 최적 스플릿 크기를 결정할 수 있다.Referring to FIG. 6 , in step S21, the apparatus 100 for determining an optimal split size may determine an optimal split size used for learning a predetermined deep learning model.

다음으로, 단계 S22에서 딥러닝 모델 학습 시스템은 결정된 최적 스플릿 크기에 기초하여 소정의 딥러닝 모델을 학습시킬 수 있다.Next, in step S22, the deep learning model learning system may train a predetermined deep learning model based on the determined optimal split size.

상술한 설명에서, 단계 S21 및 S22는 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the foregoing description, steps S21 and S22 may be further divided into additional steps or combined into fewer steps, depending on the implementation of the present application. Also, some steps may be omitted if necessary, and the order of steps may be changed.

도 7은 결정된 최적 스플릿 크기에 기초하여 딥러닝 모델을 학습시키는 단계에 대한 세부 동작 흐름도이다.7 is a detailed operation flowchart of a step of learning a deep learning model based on the determined optimal split size.

도 7을 참조하면, 단계 S221에서 딥러닝 모델 학습 시스템은 미리 설정된 미니배치를 결정된 최적 스플릿 크기에 기초하여 분할한 복수의 스플릿 중 제1스플릿을 멀티 GPU 중 제1 GPU에 할당할 수 있다.Referring to FIG. 7 , in step S221, the deep learning model learning system may allocate a first split among a plurality of splits obtained by dividing a preset mini-batch based on the determined optimal split size to a first GPU among multiple GPUs.

다음으로, 단계 S222에서 딥러닝 모델 학습 시스템은 제1 GPU에 의한 제1스플릿에 대한 처리 결과를 멀티 GPU 중 제2 GPU에 전송할 수 있다.Next, in step S222, the deep learning model learning system may transmit the processing result of the first split by the first GPU to the second GPU among the multi-GPUs.

다음으로, 단계 S223에서 딥러닝 모델 학습 시스템의 제2 GPU는 수신한 제1스플릿에 대한 처리 결과에 기초하여 기울기를 계산하고 가중치를 변경할 수 있다.Next, in step S223, the second GPU of the deep learning model learning system may calculate the gradient and change the weight based on the received processing result for the first split.

다음으로, 단계 S224에서 딥러닝 모델 학습 시스템은 제1 GPU에 최적 스플릿 크기에 기초하여 분할한 복수의 스플릿 중 제2스플릿을 할당할 수 있다.Next, in step S224, the deep learning model learning system may allocate a second split among a plurality of splits based on the optimal split size to the first GPU.

이 때, 기울기를 계산하고 가중치를 변경하는 단계(단계 S223)와 제2스플릿을 할당하는 단계(단계 S224)는 제1 GPU 및 제2 GPU 각각에 의해 수행되어 서로 미리 설정된 시간 차이 이내에 병렬적으로 개시되는 것일 수 있다.At this time, the step of calculating the gradient and changing the weight (step S223) and the step of allocating the second split (step S224) are performed by the first GPU and the second GPU, respectively, in parallel within a preset time difference from each other. may be initiated.

상술한 설명에서, 단계 S221 내지 S224는 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the foregoing description, steps S221 to S224 may be further divided into additional steps or combined into fewer steps, depending on the implementation of the present application. Also, some steps may be omitted if necessary, and the order of steps may be changed.

본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.A method for determining an optimal split size when learning a deep learning model using multi-GPUs according to an embodiment of the present disclosure may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the medium may be those specially designed and configured for the present invention or those known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to act as one or more software modules to perform the operations of the present invention, and vice versa.

또한, 전술한 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법은 기록 매체에 저장되는 컴퓨터에 의해 실행되는 컴퓨터 프로그램 또는 애플리케이션의 형태로도 구현될 수 있다.In addition, the above-described method for determining the optimal split size when learning a deep learning model using multi-GPUs may be implemented in the form of a computer program or application stored in a recording medium and executed by a computer.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is for illustrative purposes, and those skilled in the art will understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present application. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the following claims rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts thereof should be construed as being included in the scope of the present application.

100: 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치
110: 초기 스플릿 연산부
120: 최적 스플릿 탐색부100: Apparatus for determining the optimal split size when learning deep learning models using multi-GPU
110: initial split calculation unit
120: optimal split search unit

Claims

As a deep learning model learning method based on the optimal split size using multi-GPU,
(a) calculating an initial split size based on the number of GPUs included in the multi-GPU and the memory size of the multi-GPU;
(b) calculating an initial execution time required for iterative learning by a preset number of times based on the initial split size;
(c) the n-th split size based on the relationship established between the initial split size and the initial execution time, and the (n-1)-th split size, the n-th split size, and the (n+1)-th split size; The n-th execution time required for iterative learning by the preset number of times performed based on the n-th split size, the (n+1)-th split size, and the preset number of times performed based on the (n+1)-th split size obtaining an (n+1)th execution time required for repeated learning as much as;
(d) if the time difference between the n-th execution time and the (n+1)-th execution time is within a preset time difference, determining the (n+1)-th split size as an optimal split size; and
Learning a predetermined deep learning model based on the determined optimal split size;
including,
The step of learning the deep learning model,
allocating a first split among a plurality of splits obtained by dividing a preset mini-batch based on the optimal split size to a first GPU among the multi-GPUs;
transmitting a processing result of the first split by the first GPU to a second GPU among the multi-GPUs;
calculating, by the second GPU, a gradient based on the processing result and changing a weight; and
Allocating a second split among the plurality of splits to the first GPU;
including,
The step of calculating the gradient and changing the weight and the step of assigning the second split are performed by each of the first GPU and the second GPU and are initiated in parallel within a preset time difference from each other. How to train a running model.

According to claim 1,
The initial split size is represented by S ₀ when n is 0,
The (n-1) th split size is represented by S _n-1 ,
The nth split size is denoted by S _n ,
The (n + 1) th split size is denoted by S _{n + 1} ,
When n is 0, S _n-1 is 0;
Wherein step (c) and step (d) are repeatedly performed while increasing n from 0 to 1, deep learning model learning method.

According to claim 2,
Steps (c) and (d) are repeatedly performed while increasing n from 0 to 1, and in step (d), the time difference between the n-th execution time and the (n+1)-th execution time When satisfies the condition that is within a preset time difference, the (n + 1) th split size is determined as the optimal split size, and the iterative performance ends, the deep learning model learning method.

According to claim 2,
S _n obtained when repeating steps (c) and (d) is S _{n + 1} when performing previous iterations, deep learning model learning method.

According to claim 4,
The n-th execution time obtained when repeating steps (c) and (d) is performed is the (n + 1)-th execution time during previous iterations, deep learning model learning method.

According to claim 1,
In step (a), the initial split size is calculated based on Equations 1 and 2 below,
[Equation 1]

[Equation 2]

Here, S ₀ is the initial split size, M is the sum of the memory sizes of all GPUs included in the multi-GPU, and s is the memory size of any one GPU included in the multi-GPU when m is the corresponding The size of the maximum learning mini-batch that can be performed by the GPU, N is the number of GPUs included in the multi-GPU,

Is the size of the deep learning model, the deep learning model learning method.

According to claim 2,
The relationship established in step (c) is that the (n-1) th split size, the n th split size, and the (n+1) th split size

A relationship that satisfies, a deep learning model learning method.

According to claim 2,
In step (d),
If the nth execution time and the (n+1)th execution time satisfy the inequality of Equation 3 below, the (n+1)th split size is determined as the optimal split size,
[Equation 3]

Here, t _n is the n th execution time, t _{n + 1} is the (n + 1) th execution time, and T is the preset time difference, deep learning model learning method.

delete

A deep learning model learning system including an optimal split decision device and multi-GPU when learning a deep learning model,
The optimal split determining device,
The initial split size is calculated based on the number of GPUs included in the multi-GPU and the memory size of the multi-GPU, and the initial execution time required for iterative learning by a preset number of times is calculated based on the initial split size. an initial split operation unit that does; and
Based on the relationship established between the initial split size and the initial execution time, and the (n-1) th split size, the n th split size, and the (n+1) th split size, the n th split size, the n th split size The nth execution time required for learning, the (n+1)th split size, and the preset number of repetitions based on the (n+1)th split size If the (n+1)th execution time required for learning is obtained, and the time difference between the nth execution time and the (n+1)th execution time is within a preset time difference, the (n+1)th execution time An optimal split search unit for determining the split size as the optimal split size;
including,
The deep learning model learning system,
A first split among a plurality of splits obtained by dividing a preset mini-batch based on the optimal split size is allocated to a first GPU among the multi-GPUs, and a processing result of the first split by the first GPU is displayed in the multi-GPU. transmission to a second GPU among the GPUs, wherein the second GPU calculates a gradient based on the processing result, changes weights, and allocates a second split among the plurality of splits to the first GPU; The process of calculating the gradient and changing the weight and the process of allocating the second split to the first GPU are performed by each of the first GPU and the second GPU to start in parallel within a preset time difference from each other That is, a deep learning model learning system.

According to claim 11,
The initial split size is represented by S ₀ when n is 0,
The (n-1) th split size is represented by S _n-1 ,
The nth split size is denoted by S _n ,
The (n + 1) th split size is denoted by S _{n + 1} ,
The optimal split search unit increases n from 0 to 1 and repeatedly performs a process of searching for the optimal split size, deep learning model learning system.

According to claim 12,
The S _n and n th execution times obtained in the course of the optimal split search unit repeating the process are S _{n + 1} at the previous iteration and the (n + 1) th execution time at the previous iteration, respectively. , Deep learning model learning system.

A computer-readable recording medium recording a program for executing the method according to any one of claims 1 to 8 in a computer.