KR20210141240A

KR20210141240A - Apparatus and method for optimal split size decision in deep learning using multi-gpu and method for learning deep learning model using the same

Info

Publication number: KR20210141240A
Application number: KR1020200058647A
Authority: KR
Inventors: 이재환; 이명성; 최현성
Original assignee: 한국항공대학교산학협력단
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2021-11-23
Also published as: KR102494945B1

Abstract

Disclosed are an apparatus and method for determining an optimal split size in training a deep learning model using a multi-GPU and a method for training the deep learning model using the same, capable of determining an initial split size, based on the number of GPUs and a memory size under a multi-GPU environment and performing a search operation based on the initial split size, thereby reducing an overhead resulting from the search operation. According to an embodiment of the present disclosure, the method for determining the optimal split size in training the deep learning model using the multi-GPU may include the steps of: (a) calculating an initial split size based on the number of GPUs included in the multi-GPU and a size of a memory in the multi-GPU; (b) calculating an initial execution time required to perform repeated training by the preset number of times based on the initial split size; (c) acquiring a n^th split size and a n^th execution time, which is required to perform repeated training by a preset number of times performed based on the n^th split size, a (n+1)^th split size and a (n+1)^th execution time, which is required to perform repeated training by a preset number of times performed based on the (n+1)^th split size, based on the initial split size, the initial execution time, and the relationship among a (n-1)^th split size, a n^th split size, and a (n+1)^th split size; and (d) determining the (n+1)^th split size to the optimal split size, when the time difference between the n^th execution time and the (n+1)^th execution time is within a preset time difference.

Description

Apparatus and method for determining optimal split size when training a deep learning model using multi-GPU and a method for learning a deep learning model using the same SAME}

본원은 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치 및 방법과 이를 이용한 딥러닝 모델 학습 방법에 관한 것이다. 특히, 본원은 멀티 GPU 환경에서 파이프라이닝을 활용하여 딥러닝(심층 학습) 시의 성능 향상을 고려한 최적 스플릿 크기를 결정하는 자동화된 솔루션에 관한 것이다.The present application relates to an apparatus and method for determining the optimal split size when training a deep learning model using a multi-GPU, and a method for learning a deep learning model using the same. In particular, this application relates to an automated solution for determining the optimal split size in consideration of performance improvement in deep learning (deep learning) by utilizing pipelining in a multi-GPU environment.

딥러닝 수행 시 높은 학습 정확도를 달성하기 위해서는 큰 학습 모델을 활용하거나 큰 데이터 셋을 사용할 수 있다. 그러나, 학습 모델의 크기가 커짐에 따라 GPU의 메모리 자원의 한계로 인해 단일 GPU를 통하여 딥러닝을 수행하기 어려울 수 있다. 이러한 단일 GPU 자원의 한계를 극복하기 위하여는 다수의 GPU를 활용한 병렬 딥러닝을 적용할 수 있다.In order to achieve high learning accuracy when performing deep learning, a large learning model can be utilized or a large data set can be used. However, as the size of the learning model increases, it may be difficult to perform deep learning through a single GPU due to the limitation of the memory resource of the GPU. In order to overcome this limitation of single GPU resources, parallel deep learning using multiple GPUs can be applied.

상술한 병렬 딥러닝은 병렬화 방식에 따라 모델 병렬화와 데이터 병렬화로 나눌 수 있는데, 모델 병렬화의 경우 하나의 대규모 학습 모델을 다수의 GPU에 나누어 학습을 수행하는 방식을 의미한다.The parallel deep learning described above can be divided into model parallelization and data parallelism according to the parallelization method. In the case of model parallelization, it means a method of performing learning by dividing one large-scale learning model among multiple GPUs.

도 1은 모델 병렬화를 설명하기 위한 개념도이다. 도 1을 참조하면, 다양한 방식으로 모델을 분할하고, 분할된 전체 모델의 일부분을 각각의 GPU로 할당하는 모델 병렬화를 수행할 수 있는데, 이 때 각각의 GPU에 나누어져 있는 전체 모델의 일부분은 다른 GPU에 있는 모델의 출력에 의존성을 갖게 된다. 따라서, 멀티 GPU 내의 각각의 GPU들은 의존성이 있는 데이터가 자신에게 전달될 때까지 유휴 상태에 놓이게 된다. 이러한 모델 병렬화 수행 시의 분할된 학습 모델 간의 의존성으로 인해 멀티 GPU 전체의 컴퓨팅 자원을 충분히 활용할 수 없는 한계를 극복하기 위해 파이프라이닝(pipelining)을 적용할 수 있다.1 is a conceptual diagram for explaining model parallelization. Referring to FIG. 1 , model parallelism can be performed by dividing a model in various ways and allocating a part of the divided overall model to each GPU. It will depend on the output of the model on the GPU. Accordingly, each GPU in the multi-GPU is placed in an idle state until dependent data is delivered to it. Pipelining can be applied to overcome the limitation of not fully utilizing the computing resources of the entire multi-GPU due to the dependency between the divided learning models when performing model parallelization.

도 2는 파이프라이닝(pipelining)을 설명하기 위한 개념도이다. 도 2를 참조하면, 파이프라이닝 기법은 학습 미니배치를 더 작은 단위로 나누어 학습을 수행하는 것이며, 이러한 파이프라이닝 기법을 적용하면 기존의 모델 병렬화 방식에 비해 GPU 각각의 유휴 시간이 감소하는 것을 확인할 수 있다. 그러나 이러한 파이프라이닝을 적용할 때, 미니배치를 적절한 크기로 나누어 주지 못하면, 전체 GPU의 컴퓨팅 자원을 모두 활용할 수 없다. 따라서, 학습 미니배치를 분할하기 위한 스플릿 크기(사이즈)를 적절하게 탐색하는 것은 파이프라이닝 적용 시의 중요한 요소이다. 파이프라이닝을 통해 효율적인 모델 병렬화를 달성한 예로는 Google의 GPipe가 있는데, GPipe는 학습 미니배치를 나누는 크기를 학습이 시작되기 전에 설정해야 하는 한계가 있었다.2 is a conceptual diagram for explaining pipelining. Referring to FIG. 2 , the pipelining technique performs training by dividing the training mini-batch into smaller units, and when this pipelining technique is applied, it can be seen that the idle time of each GPU is reduced compared to the existing model parallelization method. have. However, when applying such pipelining, if the mini-batch is not divided into an appropriate size, it is impossible to utilize all the computing resources of the entire GPU. Therefore, appropriately searching for the split size (size) for dividing the training mini-batch is an important factor when applying pipelining. An example of achieving efficient model parallelism through pipelining is Google's GPipe, which has a limitation in that the size of dividing the training mini-batch must be set before training starts.

또한, 파이프라이닝을 적용하기 위하여 학습이 수행되는 과정에서 사용자가 직접 최적의 스플릿 크기를 탐색하는 것은 비효율적이므로, 효율적인 모델 병렬화를 수행하기 위하여 최적의 스플릿 크기를 자동으로 결정할 수 있는 솔루션이 요구된다.In addition, since it is inefficient for a user to directly search for an optimal split size in the process of learning to apply pipelining, a solution capable of automatically determining the optimal split size is required to efficiently perform model parallelization.

본원의 배경이 되는 기술은 한국공개특허공보 제10-2019-0085444호에 개시되어 있다.The background technology of the present application is disclosed in Korean Patent Application Laid-Open No. 10-2019-0085444.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 멀티 GPU 환경에서 파이프라이닝을 적용할 때 GPU 각각의 연산 능력을 최대한 활용할 수 있도록 하는 최적의 스플릿 사이즈를 탐색하기 위한 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치 및 방법과 이를 이용한 딥러닝 모델 학습 방법을 제공하려는 것을 목적으로 한다.The present application is to solve the problems of the prior art described above, and when pipelining is applied in a multi-GPU environment, a deep learning model using a multi-GPU to search for an optimal split size that allows the maximum use of the computational power of each GPU. An object of the present invention is to provide an apparatus and method for determining the optimal split size during training and a method for learning a deep learning model using the same.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problems to be achieved by the embodiments of the present application are not limited to the technical problems as described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법은, (a) 상기 멀티 GPU에 포함된 GPU의 수 및 상기 멀티 GPU의 메모리 크기에 기초하여 초기 스플릿 크기를 연산하는 단계, (b) 상기 초기 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 초기 수행시간을 연산하는 단계, (c) 상기 초기 스플릿 크기와 상기 초기 수행시간, 그리고 (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기 사이에 설정된 관계에 기초하여, n번째 스플릿 크기, 상기 n번째 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 n번째 수행시간, (n+1)번째 스플릿 크기 및 상기 (n+1)번째 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 (n+1)번째 수행시간을 획득하는 단계 및 (d) 상기 n번째 수행시간과 상기 (n+1)번째 수행시간 사이의 시간 차이가 미리 설정된 시간 차이 이내이면, 상기 (n+1)번째 스플릿 크기를 최적 스플릿 크기로 결정하는 단계를 포함할 수 있다.As a technical means for achieving the above technical problem, the method for determining the optimal split size when training a deep learning model using a multi-GPU according to an embodiment of the present application includes (a) the number of GPUs included in the multi-GPU and the Calculating an initial split size based on the memory size of the multi-GPU, (b) calculating an initial execution time required for iterative learning for a preset number of times performed based on the initial split size, (c) the above Based on the initial split size and the initial execution time, and the relationship established between the (n-1)th split size, the nth split size, and the (n+1)th split size, the nth split size and the nth split size The n-th execution time required for the iterative learning for a preset number of times performed based on obtaining an (n+1)-th execution time required for 1) determining the th split size as an optimal split size.

또한, 상기 초기 스플릿 크기는 n이 0인 경우로서 S₀로 표시되고, 상기 (n-1)번째 스플릿 크기는 S_n-1로 표시되고, 상기 n번째 스플릿 크기는 S_n으로 표시되고, 상기 (n+1)번째 스플릿 크기는 S_n+1로 표시되고, n이 0인 경우, S_n-1은 0일 수 있다.In addition, the initial split size is denoted by S ₀ when n is 0, the (n-1)-th split size is _{denoted by S n-1} , the n-th split size is denoted by _{S n, and} The (n+1)-th split size is _{represented by S n+1} , and when n is 0, S _n-1 may be 0.

또한, 상기 (c) 단계와 상기 (d) 단계는 n을 0부터 1씩 증가시키며 반복 수행될 수 있다.In addition, steps (c) and (d) may be repeatedly performed while increasing n from 0 to 1.

또한, 상기 (c) 단계와 상기 (d) 단계가 n을 0부터 1씩 증가시키며 반복 수행되다가, 상기 (d) 단계에서, 상기 n번째 수행시간과 상기 (n+1)번째 수행시간 사이의 시간 차이가 미리 설정된 시간 차이 이내인 조건을 만족하면, 상기 (n+1)번째 스플릿 크기를 최적 스플릿 크기로 결정하고 반복 수행이 종료될 수 있다.In addition, the steps (c) and (d) are repeatedly performed while increasing n from 0 to 1, and in step (d), between the n-th execution time and the (n+1)-th execution time If the time difference is within the preset time difference, the (n+1)-th split size may be determined as the optimal split size, and the iterative execution may be terminated.

또한, 상기 (c) 단계와 상기 (d) 단계의 반복 수행시 획득되는 S_n은 이전 반복 수행시의 S_n+1일 수 있다. _{In addition, S n} obtained when the steps (c) and (d) are repeatedly performed may be _{S n+1} when the previous iteration is performed.

또한, 상기 (c) 단계와 상기 (d) 단계의 반복 수행시 획득되는 n번째 수행시간은 이전 반복 수행시의 (n+1)번째 수행시간일 수 있다.In addition, the n-th execution time obtained when the steps (c) and (d) are repeatedly performed may be the (n+1)-th execution time when the previous iteration is performed.

또한, 상기 (a) 단계는, 하기 식 1 및 식 2에 기초하여 상기 초기 스플릿 크기를 연산할 수 있다.Also, in step (a), the initial split size may be calculated based on Equations 1 and 2 below.

또한, 상기 (c) 단계에서의 상기 설정된 관계는, (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기가

을 만족하는 관계일 수 있다.In addition, the set relationship in step (c) is that the (n-1)-th split size, the n-th split size, and the (n+1)-th split size are

may be a relationship that satisfies

또한, 상기 (d) 단계는, 상기 n번째 수행시간과 상기 (n+1)번째 수행시간이 하기 식 3의 부등식을 만족하면, 상기 (n+1)번째 스플릿 크기를 상기 최적 스플릿 크기로 결정하는 것일 수 있다.Also, in step (d), if the n-th execution time and the (n+1)-th execution time satisfy the inequality of Equation 3 below, the (n+1)-th split size is determined as the optimal split size may be doing

한편, 본원의 일 실시예에 따른 멀티 GPU를 이용한 최적 스플릿 크기 기반의 딥러닝 모델 학습 방법은, 상기 최적 스플릿 크기 결정 방법에 기초하여 소정의 딥러닝 모델 학습에 활용되는 최적 스플릿 크기를 결정하는 단계 및 상기 결정된 최적 스플릿 크기에 기초하여 상기 딥러닝 모델을 학습시키는 단계를 포함할 수 있다.On the other hand, the method for learning a deep learning model based on an optimal split size using a multi-GPU according to an embodiment of the present application includes the steps of determining an optimal split size used for learning a predetermined deep learning model based on the optimal split size determining method and training the deep learning model based on the determined optimal split size.

또한, 상기 딥러닝 모델을 학습시키는 단계는, 미리 설정된 미니배치를 상기 최적 스플릿 크기에 기초하여 분할한 복수의 스플릿 중 제1스플릿을 상기 멀티 GPU 중 제1 GPU에 할당하는 단계, 상기 제1 GPU에 의한 상기 제1스플릿에 대한 처리 결과를 상기 멀티 GPU 중 제2 GPU에 전송하는 단계, 상기 제2 GPU가 상기 처리 결과에 기초하여 기울기를 계산하고 가중치를 변경하는 단계 및 상기 제1 GPU에 상기 복수의 스플릿 중 제2스플릿을 할당하는 단계를 포함할 수 있다.In addition, the step of training the deep learning model may include allocating a first split among a plurality of splits obtained by dividing a preset mini-batch based on the optimal split size to a first GPU among the multi-GPUs, the first GPU transmitting the processing result for the first split by The method may include allocating a second split from among the plurality of splits.

또한, 상기 기울기를 계산하고 가중치를 변경하는 단계와 상기 제2스플릿을 할당하는 단계는, 상기 제1 GPU 및 상기 제2 GPU 각각에 의해 수행되어 서로 미리 설정된 시간 차이 이내에 병렬적으로 개시되는 것일 수 있다.In addition, the steps of calculating the gradient and changing the weight and allocating the second split may be performed by each of the first GPU and the second GPU and started in parallel within a preset time difference from each other. have.

한편, 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 결정 장치는, 상기 멀티 GPU에 포함된 GPU의 수 및 상기 멀티 GPU의 메모리 크기에 기초하여 초기 스플릿 크기를 연산하고, 상기 초기 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 초기 수행시간을 연산하는 초기 스플릿 연산부 및 상기 초기 스플릿 크기와 상기 초기 수행시간, 그리고 (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기 사이에 설정된 관계에 기초하여, n번째 스플릿 크기, 상기 n번째 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 n번째 수행시간, (n+1)번째 스플릿 크기 및 상기 (n+1)번째 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 (n+1)번째 수행시간을 획득하고, 상기 n번째 수행시간과 상기 (n+1)번째 수행시간 사이의 시간 차이가 미리 설정된 시간 차이 이내이면, 상기 (n+1)번째 스플릿 크기를 최적 스플릿 크기로 결정하는 최적 스플릿 탐색부를 포함할 수 있다.On the other hand, the apparatus for determining an optimal split when training a deep learning model using a multi-GPU according to an embodiment of the present application calculates an initial split size based on the number of GPUs included in the multi-GPU and the memory size of the multi-GPU, , an initial split operation unit that calculates an initial execution time required for iterative learning for a preset number of times performed based on the initial split size, and the initial split size and the initial execution time, and the (n-1)th split size and Based on the relationship established between the nth split size and the (n+1)th split size, the nth execution time required for the nth split size and repeated learning for a preset number of times performed based on the nth split size , the (n+1)-th split size and the (n+1)-th split size obtain the (n+1)-th execution time required for iterative learning for a preset number of times performed based on the (n+1)-th split size, and the n-th execution is performed If the time difference between the time and the (n+1)-th execution time is within a preset time difference, an optimal split search unit for determining the (n+1)-th split size as the optimal split size may be included.

또한, 상기 초기 스플릿 크기는 n이 0인 경우로서 S₀로 표시되고, 상기 (n-1)번째 스플릿 크기는 S_n-1로 표시되고, 상기 n번째 스플릿 크기는 S_n으로 표시되고, 상기 (n+1)번째 스플릿 크기는 S_n+1로 표시될 수 있다.In addition, the initial split size is denoted by S ₀ when n is 0, the (n-1)-th split size is _{denoted by S n-1} , the n-th split size is denoted by _{S n, and} The (n+1)th split size may be expressed as _{S n+1 .}

또한, 상기 최적 스플릿 탐색부는, n을 0부터 1씩 증가시키며 상기 최적 스플릿 크기를 탐색하는 프로세스를 반복적으로 수행할 수 있다.Also, the optimal split search unit may iteratively perform a process of increasing n from 0 to 1 and searching for the optimal split size.

또한, 상기 최적 스플릿 탐색부가 상기 프로세스를 반복 수행하는 과정에서 획득되는 S_n 및 n번째 수행시간은 각각 이전 반복 수행시의 S_n+1 및 이전 반복 수행시의 (n+1)번째 수행시간일 수 있다. _{In addition, the S n} and n-th execution times obtained while the optimal split search unit repeatedly performs the process _{are S n+1} when the previous iteration is performed and the (n+1)-th execution time when the previous iteration is performed, respectively. can

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary, and should not be construed as limiting the present application. In addition to the exemplary embodiments described above, additional embodiments may exist in the drawings and detailed description.

전술한 본원의 과제 해결 수단에 의하면, 멀티 GPU 환경에서 파이프라이닝을 적용할 때 GPU 각각의 연산 능력을 최대한 활용할 수 있도록 하는 최적의 스플릿 사이즈를 탐색하기 위한 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치 및 방법과 이를 이용한 딥러닝 모델 학습 방법을 제공할 수 있다.According to the above-described problem solving means of the present application, when pipelining is applied in a multi-GPU environment, it is best to learn the deep learning model using multi-GPU to find the optimal split size that allows the maximum use of the computational power of each GPU. It is possible to provide an apparatus and method for determining a split size and a method for learning a deep learning model using the same.

전술한 본원의 과제 해결 수단에 의하면, 사용자가 선택 가능한 모든 스플릿 사이즈에 대하여 학습에 소요되는 시간을 측정해가면서 최적의 스플릿 크기를 직접 탐색해야 하는 불편을 해소하고, 멀티 GPU 환경에서의 GPU의 수 및 메모리 사이즈를 기반으로 초기 스플릿 크기를 결정하고 이에 따라 탐색을 수행함으로써 탐색으로 인한 오버헤드를 완화할 수 있다.According to the above-described problem solving means of the present application, the inconvenience of directly searching for the optimal split size while measuring the learning time for all split sizes selectable by the user is eliminated, and the number of GPUs in a multi-GPU environment is eliminated. And by determining the initial split size based on the memory size and performing the search accordingly, the overhead due to the search can be alleviated.

다만, 본원에서 얻을 수 있는 효과는 상기된 바와 같은 효과들로 한정되지 않으며, 또 다른 효과들이 존재할 수 있다.However, the effects obtainable herein are not limited to the above-described effects, and other effects may exist.

도 1은 모델 병렬화를 설명하기 위한 개념도이다.
도 2는 파이프라이닝(pipelining)을 설명하기 위한 개념도이다.
도 3은 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 개략적인 구성도이다.
도 4a는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 동작과 연계된 일 실험예로, 분할되는 스플릿 크기에 따른 처리량의 변화를 나타낸 그래프이다.
도 4b는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 동작과 연계된 일 실험예로 미니배치의 크기에 따른 처리량의 변화를 모델 병렬화만을 적용한 경우와 본원의 최적 스플릿 크기 결정 기법을 적용한 경우 각각에 대하여 비교하여 나타낸 그래프이다.
도 5는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법에 대한 동작 흐름도이다.
도 6은 본원의 일 실시예에 따른 멀티 GPU를 이용한 최적 스플릿 크기 기반의 딥러닝 모델 학습 방법에 대한 동작 흐름도이다.
도 7은 결정된 최적 스플릿 크기에 기초하여 딥러닝 모델을 학습시키는 단계에 대한 세부 동작 흐름도이다.1 is a conceptual diagram for explaining model parallelization.
2 is a conceptual diagram for explaining pipelining.
3 is a schematic configuration diagram of an apparatus for determining an optimal split size when training a deep learning model using a multi-GPU according to an embodiment of the present application.
4A is an experimental example associated with the operation of the apparatus for determining the optimal split size when training a deep learning model using a multi-GPU according to an embodiment of the present application, and is a graph illustrating a change in throughput according to a divided split size.
4b is an experimental example linked to the operation of the apparatus for determining the optimal split size when training a deep learning model using a multi-GPU according to an embodiment of the present application. It is a graph showing comparison of each of the cases in which the optimal split size determination technique of the present application is applied.
5 is an operation flowchart for a method of determining an optimal split size when training a deep learning model using a multi-GPU according to an embodiment of the present application.
6 is an operation flowchart of a deep learning model learning method based on an optimal split size using a multi-GPU according to an embodiment of the present application.
7 is a detailed operation flowchart of the step of training the deep learning model based on the determined optimal split size.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present application pertains can easily implement them. However, the present application may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present application in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결" 또는 "간접적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a part is "connected" with another part, it is not only "directly connected" but also "electrically connected" or "indirectly connected" with another element interposed therebetween. "Including cases where

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout this specification, when it is said that a member is positioned "on", "on", "on", "under", "under", or "under" another member, this means that a member is positioned on the other member. It includes not only the case where they are in contact, but also the case where another member exists between two members.

본원 명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout this specification, when a part "includes" a component, it means that other components may be further included, rather than excluding other components, unless otherwise stated.

본원은 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치 및 방법과 이를 이용한 딥러닝 모델 학습 방법에 관한 것이다. 특히, 본원은 멀티 GPU 환경에서 파이프라이닝을 활용하여 딥러닝(심층 학습) 시의 성능 향상을 고려한 최적 스플릿 크기를 결정하는 자동화된 솔루션에 관한 것이다. The present application relates to an apparatus and method for determining the optimal split size when training a deep learning model using a multi-GPU, and a method for learning a deep learning model using the same. In particular, this application relates to an automated solution for determining the optimal split size in consideration of performance improvement in deep learning (deep learning) by utilizing pipelining in a multi-GPU environment.

도 3은 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 개략적인 구성도이다.3 is a schematic configuration diagram of an apparatus for determining an optimal split size when training a deep learning model using a multi-GPU according to an embodiment of the present application.

도 3을 참조하면, 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치(100)(이하, '최적 스플릿 크기 결정 장치(100)'라 한다.)는, 초기 스플릿 연산부(110) 및 최적 스플릿 탐색부(120)를 포함할 수 있다.Referring to FIG. 3 , an apparatus 100 for determining an optimal split size when training a deep learning model using a multi-GPU (hereinafter referred to as an 'optimal split size determination apparatus 100 ') includes an initial split operation unit 110 and The optimal split search unit 120 may be included.

초기 스플릿 연산부(110)는 멀티 GPU에 포함된 GPU의 수 및 멀티 GPU의 메모리 크기에 기초하여 초기 스플릿 크기를 연산할 수 있다. 본원의 실시예에 관한 이하의 설명에서 초기 스플릿 크기는 n이 0인 경우로서 S₀로 표시될 수 있다.The initial split operation unit 110 may calculate an initial split size based on the number of GPUs included in the multi-GPU and the memory size of the multi-GPU. In the following description of the embodiments of the present application, the initial split size may be denoted by _{S 0 as a case in which n is 0.}

본원의 일 실시예에 따르면, 초기 스플릿 연산부(110)는 멀티 GPU에 포함된 GPU가 수행 가능한 최대 학습 미니배치의 크기(s) 및 입력된 미니배치의 크기(mini batch)의 대소 관계를 고려하여 초기 스플릿 크기(S₀)를 결정할 수 있다.According to an embodiment of the present application, the initial split operation unit 110 considers the size relationship between the maximum training mini-batch size ( s ) and the input mini-batch size ( mini batch ) that the GPU included in the multi-GPU can perform. An initial split size (S ₀ ) may be determined.

구체적으로, 초기 스플릿 연산부(110)는 멀티 GPU에 포함된 단일 GPU의 메모리에서 처리 가능한 최대 학습 미니배치의 크기(s) 이하의 크기를 갖는 미니배치가 입력되면(달리 말해, mini batch≤s 이면), 입력된 미니배치의 크기(mini batch)를 멀티 GPU의 GPU의 수로 나눈 값으로 초기 스플릿 크기(S₀)를 결정할 수 있다(S₀=

). 반대로, 초기 스플릿 연산부(110)는 입력된 미니배치의 크기(mini batch)가 멀티 GPU의 단일 GPU 메모리에서 처리 가능한 최대 학습 미니배치의 크기(s)를 초과하면(달리 말해, mini batch>s 이면), 단일 GPU에서 처리 가능한 최대 학습 미니배치의 크기인 s를 초기 스플릿 크기(S₀)로 연산할 수 있다(S₀=s).Specifically, when a mini-batch having a size less than or equal to the maximum training mini-batch size ( s ) that can be processed in the memory of a single GPU included in the multi-GPU is input (in other words, if the mini-batch ≤ s), the initial split operation unit 110 ), the _{initial split size (S 0} ) can be determined by dividing the input mini-batch size by the number of GPUs of the multi-GPU _{(S 0} =

). Conversely, the initial split operation unit 110 determines that when the size of the input mini-batch ( mini batch ) exceeds the maximum training mini-batch size ( s ) that can be processed in the single GPU memory of the multi-GPU (in other words, if mini batch > s) ), the size of the largest training mini-batch that can be processed by a single GPU, s , can be calculated as the initial split size (S ₀ _{) (S 0} = s ).

또한, 본원의 일 실시예에 따르면, 초기 스플릿 연산부(110)는 GPU간 처리 결과 송수신 과정에서 멀티 GPU의 연산 능력의 일부가 사용되는 점을 고려하여 멀티 GPU의 전체 메모리에서 소정의 처리량을 차감한 후 초기 스플릿 크기(S₀)를 연산할 수 있다.In addition, according to an embodiment of the present application, the initial split operation unit 110 subtracts a predetermined processing amount from the total memory of the multi-GPU considering that a part of the computing power of the multi-GPU is used in the process of transmitting and receiving the processing result between the GPUs. After that, the initial split size (S ₀ ) may be calculated.

이와 관련하여, 초기 스플릿 연산부(110)는 GPU 간 데이터 송수신 프로세스에 의하여 미니배치 처리에 활용될 수 없는 소정의 처리량을 고려하여 하기 식 1 및 식 2에 기초하여 초기 스플릿 크기(S₀)를 연산할 수 있다. _{In this regard, the initial split operation unit 110 calculates the initial split size (S 0} ) based on Equations 1 and 2 below in consideration of a predetermined throughput that cannot be utilized for mini-batch processing by the data transmission/reception process between GPUs. can do.

[식 1][Equation 1]

[식 2][Equation 2]

여기서, M은 멀티 GPU에 포함된 모든 GPU의 메모리 크기를 합한 값일 수 있다. 또한, s는 멀티 GPU에 포함된 어느 하나의 GPU의 메모리 크기가 m일 때, 해당 GPU가 수행 가능한 최대 학습 미니배치(mini batch)의 크기일 수 있다. 또한, N은 멀티 GPU에 포함된 GPU의 수일 수 있다. 또한,

는 생성(학습)하려는 딥러닝 모델의 모델 크기일 수 있다.Here, M may be a sum of memory sizes of all GPUs included in the multi-GPU. Also, when the memory size of any one GPU included in the multi-GPU is m, s may be the size of the maximum learning mini-batch that the GPU can perform. Also, N may be the number of GPUs included in the multi-GPU. In addition,

may be the model size of the deep learning model to be created (learned).

구체적으로, M이 모델 병렬화를 수행하는 멀티 GPU 내의 모든 GPU의 메모리 크기를 합한 값이라 하면, 멀티 GPU에 모델 병렬화를 적용하여 학습 가능한 미니배치의 크기는

일 수 있다. 여기서 소정의 처리량 α는 멀티 GPU 환경에서 모델 병렬화를 적용할 경우, 하나의 GPU의 처리 결과를 다른 GPU에 전송하는 과정에서 요구되는 데이터 처리량을 의미하는 것일 수 있다. 즉, 모델 병렬화를 적용하는 경우, GPU 간의 입출력 의존성에 따라 GPU 사이에서 처리 결과를 송수신하는 프로세스가 수반되기 때문에 멀티 GPU의 모든 메모리 자원을 미니배치를 처리하는데 활용할 수 없으므로, 초기 스플릿 연산부(110)는 이를 고려하여 초기 스플릿 크기 연산시 멀티 GPU의 전체 메모리 크기에서 GPU 간 데이터 전송을 위한 소정의 처리량(α)을 차감하는 것이다.Specifically, if M is the sum of the memory sizes of all GPUs in the multi-GPU that performs model parallelization, the size of mini-batch that can be trained by applying model parallelization to multi-GPU is

can be Here, the predetermined throughput α may mean a data throughput required in a process of transmitting a processing result of one GPU to another GPU when model parallelism is applied in a multi-GPU environment. That is, when model parallelism is applied, since a process of sending and receiving processing results between GPUs is involved according to the input/output dependency between GPUs, all memory resources of the multi-GPU cannot be utilized to process the mini-batch, so the initial split operation unit 110 In consideration of this, when calculating the initial split size, a predetermined throughput (α) for data transfer between GPUs is subtracted from the total memory size of the multi-GPU.

이와 관련하여, 초기 스플릿 연산부(110)는 모델 병렬화를 통해 학습 가능한 미니배치의 크기인

를 GPU의 수(N)으로 나누어 식 1 및 식 2에 따라 초기 스플릿 크기(S₀)로 결정할 수 있다.In this regard, the initial split operation unit 110 is the size of the mini-batch that can be learned through model parallelization.

can be determined as _{the initial split size (S 0} ) according to

Equations

1 and 2 by dividing by the number of GPUs (N).

본원의 일 실시예에 따르면, 초기 스플릿 연산부(110)는 입력된 미니배치의 크기(mini batch)가 멀티 GPU의 단일 GPU 메모리에서 처리 가능한 최대 학습 미니배치의 크기(s)를 초과하는 경우(mini batch>s)에 상술한 식 1 및 식 2에 기초하여 초기 스플릿 크기(S₀)를 연산하는 것일 수 있으나, 이에만 한정되는 것은 아니다.According to one embodiment of the present application, the initial split operation unit 110 when the size of the input mini-batch ( mini batch ) exceeds the size ( s ) of the maximum training mini-batch that can be processed in the single GPU memory of the multi-GPU ( mini batch > s _{) may be to calculate the initial split size (S 0} ) based on Equations 1 and 2 described above, but is not limited thereto.

상술한 바와 같이 초기 스플릿 연산부(110)에 의해 초기 스플릿 크기(S₀)가 연산되고 나면, 하기에서 상세히 설명하는 최적 스플릿 크기를 탐색하는 프로세스가 개시될 수 있다. _{After the initial split size S 0} is calculated by the initial split operation unit 110 as described above, a process for searching for an optimal split size, which will be described in detail below, may be started.

또한, 초기 스플릿 연산부(110)는 연산된 초기 스플릿 크기에 기초하여 수행되는 미리 설정된 횟수(i번)만큼의 반복 학습에 소요되는 초기 수행시간을 연산할 수 있다. 본원의 실시예에 관한 이하의 설명에서 초기 수행시간은 n이 0인 경우로서 t₀로 표시될 수 있다.Also, the initial split operation unit 110 may calculate an initial execution time required for repeated learning for a preset number of times (i times) performed based on the calculated initial split size. In the following description of the embodiment of the present application, the initial execution time may be expressed _{as t 0 as a case where n is 0.}

최적 스플릿 탐색부(120)는 초기 스플릿 크기(S₀)와 초기 수행시간(t₀), 그리고 (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기 사이에 설정된 관계에 기초하여 n번째 스플릿 크기를 획득할 수 있다. 본원의 실시예에 관한 이하의 설명에서 (n-1)번째 스플릿 크기는 S_n-1로 표시되고, n번째 스플릿 크기는 S_n으로 표시되고, (n+1)번째 스플릿 크기는 S_n+1로 표시될 수 있다. 또한, n이 0인 경우의 S_n-1인 S_-1은 0일 수 있다.The optimal split search unit 120 is set between the initial split size (S ₀ ), the initial execution time (t ₀ ), and the (n-1)-th split size, the n-th split size, and the (n+1)-th split size. An nth split size may be obtained based on the relationship. In the following description of the embodiments of the present application, the (n-1)-th split size is _{denoted by S n-1} , the n-th split size is _{denoted by S n} , and the (n+1)-th split size is _{denoted by S n+ 1} can be displayed. Also, when n is 0, S _{−1 that} _{is S n−1} may be 0.

또한, 본원의 일 실시예에 따르면, 최적 스플릿 탐색부(120)는 최적 스플릿 크기를 탐색하는 프로세스를 반복적으로 수행하되, 최적 스플릿 탐색부(120)가 현재 반복 수행 내에서 획득하는 n번째 스플릿 크기(S_n)는 이전 반복 수행시의 (n+1)번째 스플릿 크기(S_n+1)일 수 있다.Also, according to an embodiment of the present disclosure, the optimal split search unit 120 iteratively performs a process of searching for an optimal split size, but the nth split size that the optimal split search unit 120 acquires within the current iteration. (S _n ) may be the (n+1)-th split size (S _n+1 ) when the previous iteration is performed.

또한, 최적 스플릿 탐색부(120)는, 연산된 n번째 스플릿 크기(S_n)에 기초하여 수행되는 미리 설정된 횟수(i번)만큼의 반복 학습에 소요되는 n번째 수행시간을 획득할 수 있다. 또한, 본원의 실시예에 관한 이하의 설명에서 n번째 수행시간은 t_n으로 표시될 수 있다.Also, the optimal split search unit 120 may acquire the n-th execution time required for repeated learning for a preset number of times (i times) performed based on the _{calculated n-th split size (S n ).} In addition, in the following description of the embodiment of the present application, the nth execution time may be represented by _{t n .}

또한, 본원의 일 실시예에 따르면, 최적 스플릿 탐색부(120)는 최적 스플릿 크기를 탐색하는 프로세스를 반복적으로 수행하되, 최적 스플릿 탐색부(120)가 현재 반복 수행 내에서 획득하는 n번째 수행시간(t_n)은 이전 반복 수행시의 (n+1)번째 수행시간(t_n+1)일 수 있다.In addition, according to an embodiment of the present application, the optimal split search unit 120 iteratively performs the process of searching for the optimal split size, but the nth execution time obtained by the optimal split search unit 120 within the current iteration. (t _n ) may be the (n+1)-th execution time (t _n+1 ) of the previous iteration.

또한, 최적 스플릿 탐색부(120)는, 초기 스플릿 크기(S₀)와 초기 수행시간(t₀), 그리고 (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기 사이에 설정된 관계에 기초하여 (n+1)번째 스플릿 크기(S_n+1)를 획득할 수 있다.In addition, the optimal split search unit 120 includes an initial split size (S ₀ ), an initial execution time (t ₀ ), and an (n-1)-th split size, an n-th split size, and an (n+1)-th split size. The (n+1)-th split size (S _n+1 ) may be obtained based on the relationship established therebetween.

본원의 일 실시예에 따르면, 최적 스플릿 탐색부(120)가 고려하는 (n-1)번째 스플릿 크기(S_n-1)와 n번째 스플릿 크기(S_n)와 (n+1)번째 스플릿 크기(S_n+1) 사이에 설정된 관계는

을 만족하는 관계일 수 있다.According to an embodiment of the present application, the (n-1)-th split size (S _n-1 ), the n-th split size (S _n ), and the (n+1)-th split size considered by the optimal split search unit 120 are The relationship established between (S _{n+1 ) is}

may be a relationship that satisfies

구체적으로, 최적 스플릿 탐색부(120)는 첫 번째 반복 수행시(n=0), S_n+1인 S₁은 전술한 바와 같이 S_-1이 0이므로 초기 스플릿 크기의 절반(S₁=S₀/2)으로 연산하고, 두 번째 이후의 반복 수행시(n≥1)부터는 S_n+1을 S_n 및 S_n-1의 산술 평균값으로 연산하되, 연산에 활용되는 S_n 및 S_n-1의 값은 이전 반복 수행에서 연산된 값을 획득하는 것일 수 있다.Specifically, the optimal split-search section 120 is performed during the first iteration (n = 0), S _{n + 1} of S ₁ is half of the initial split-size because S _-1 is zero, as described above (S ₁ = S ₀ /2), and from the second and subsequent iterations (n≥1), S _n+1 is calculated as the arithmetic mean of _{S n} and S _n-1 _{, but S n} and S _{n- A value of 1} may be to obtain a value calculated in the previous iteration.

또한, 최적 스플릿 탐색부(120)는 연산된 (n+1)번째 스플릿 크기(S_n+1)에 기초하여 수행되는 미리 설정된 횟수(i번)만큼의 반복 학습에 소요되는 (n+1)번째 수행시간을 획득할 수 있다. 또한, 본원의 실시예에 관한 이하의 설명에서 (n+1)번째 수행시간은 t_n+1로 표시될 수 있다.Further, the optimal split-search section 120 calculates the (n + 1) (n + 1) required for iterative learning as much as the second split size number of times (i times) previously set to be performed based on the (S _{n + 1)} The second execution time can be obtained. In addition, in the following description of the embodiment of the present application, the (n+1)-th execution time may be expressed as _{t n+1.}

또한, 최적 스플릿 탐색부(120)는 n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1)사이의 시간 차이가 미리 설정된 시간 차이 이내이면, (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정할 수 있다.In addition, the optimal split search unit 120 determines that if _{the time difference between the n-th execution time t n} and the (n+1)-th execution time t _n+1 is within a preset time difference, (n+1) The second split size (S _n+1 ) may be determined as the optimal split size.

구체적으로 본원의 일 실시예에 따르면, 최적 스플릿 탐색부(120)는 n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1)이 하기 식 3의 부등식을 만족하면, 해당 반복 수행에서 연산된 (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정할 수 있다.Specifically, according to an embodiment of the present application, the optimal split search unit 120 performs the n-th execution time (t _n ) and the (n+1)-th execution time (t _n+1 ) when the inequality of Equation 3 below is satisfied. , the (n+1)-th split size (S _n+1 ) calculated in the iterative execution may be determined as the optimal split size.

[식 3][Equation 3]

여기서, t_n은 n번째 수행시간이고, t_n+1은 (n+1)번째 수행시간이고, T는 미리 설정된 시간 차이일 수 있다. 또한, T는 0보다 큰 임의의 임계값일 수 있다.Here, t _n may be the n-th execution time, t _n+1 may be the (n+1)-th execution time, and T may be a preset time difference. Also, T may be any threshold greater than zero.

구체적으로, 최적 스플릿 탐색부(120)는 미니배치를 더 작은 스플릿으로 나누어 학습을 수행하는 경우에 분할되는 스플릿의 크기가 필요 이상으로 작으면 GPU 간 데이터 전송이 많아져서 효율적으로 GPU의 연산 능력을 활용할 수 없는 점과 반대로 스플릿의 크기가 필요 이상으로 커지면 미니배치와 분할된 스플릿의 크기가 비슷해지므로 파이프라이닝이 적용되지 않은 경우처럼 GPU의 유휴 시간이 길어지게 되는 점을 고려하여 초기 스플릿 크기(S₀)로부터 스플릿 크기를 변화시킴에 따라 멀티 GPU의 처리 능력이 최대가 되는 지점(달리 말해, 연산에 소요되는 시간이 소정 수준 이상 감소하는 지점)을 탐색하여 최적 스플릿 크기를 결정하도록 동작할 수 있다.Specifically, when the optimal split search unit 120 divides the mini-batch into smaller splits and performs learning, if the size of the split split is smaller than necessary, data transfer between GPUs increases, so that the computing power of the GPU is efficiently improved. Conversely, if the split size becomes larger than necessary, the initial split size (S) ₀ ) to determine the optimal split size by searching for the point at which the processing power of the multi-GPU is maximized (in other words, the point at which the time required for operation decreases by a predetermined level or more) as the split size is changed. .

또한, 최적 스플릿 탐색부(120)는 n을 0부터 1씩 증가시키며 상술한 최적 스플릿 크기를 탐색하는 프로세스를 반복적으로 수행할 수 있다. 즉, 최적 스플릿 탐색부(120)는, n을 0부터 1씩 증가시키며 최적 스플릿 크기를 탐색하는 프로세스를 반복 수행하다가, n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1)사이의 시간 차이가 미리 설정된 시간 차이 이내인 조건을 만족하면, (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정하고 반복 수행을 종료할 수 있다.Also, the optimal split search unit 120 may repeatedly perform the above-described process of searching for the optimal split size by increasing n from 0 to 1. That is, the optimal split search unit 120 repeatedly performs the process of searching for the optimal split size while increasing n from 0 to 1, and the n-th execution time (t _n ) and the (n+1)-th execution time (t) _{If the time difference between n+1} ) satisfies the condition within the preset time difference, the (n+1)-th split size (S _n+1 ) may be determined as the optimal split size and the repetition execution may be terminated.

이하에서는, 최적 스플릿 크기 결정 장치(100) 및 멀티 GPU를 포함하는 딥러닝 모델 학습 시스템(미도시)이 결정된 최적 스플릿 크기를 통해 소정의 딥러닝 모델을 학습시키는 과정에 대해 설명하도록 한다. 즉, 딥러닝 모델 학습 시스템(미도시)은 최적 스플릿 크기 결정 장치(100) 및 멀티 GPU를 포함할 수 있다.Hereinafter, a process in which the apparatus 100 for determining the optimal split size and a deep learning model learning system (not shown) including a multi-GPU learn a predetermined deep learning model through the determined optimal split size will be described. That is, the deep learning model learning system (not shown) may include the apparatus 100 for determining the optimal split size and multi-GPU.

딥러닝 모델 학습 시스템(미도시)은 최적 스플릿 크기 결정 장치(100)에 의해 결정된 소정의 딥러닝 모델 학습에 활용되는 최적 스플릿 크기를 획득할 수 있다. 구체적으로, 딥러닝 모델 학습 시스템(미도시)의 최적 스플릿 크기 결정 장치(100)는 앞서 설명한 바와 같이 미니배치를 분할하는 기준이 되는 스플릿 크기를 변화시켜가며, 변화되는 스플릿 크기에 따른 미리 설정된 소정의 횟수(i번)의 반복 학습(iteration)에 소요되는 시간(수행시간)의 변화를 추적하여 멀티 GPU의 연산 능력이 최대가 되도록 하는 최적 스플릿 크기를 소정의 딥러닝 모델 학습을 생성하는 초기의 학습 단계에서 탐색할 수 있다.The deep learning model learning system (not shown) may acquire the optimal split size used for training a predetermined deep learning model determined by the optimal split size determining apparatus 100 . Specifically, the apparatus 100 for determining the optimal split size of the deep learning model learning system (not shown) changes the split size, which is a criterion for dividing the mini-batch, as described above, and sets a predetermined predetermined size according to the changed split size. The optimal split size for maximizing the computational power of multi-GPU by tracking the change in the time (execution time) required for the number of iterations (i times) of It can be explored in the learning phase.

종합하면, 본원에서 개시하는 딥러닝 모델 학습 시스템(미도시)은 초기의 학습 단계에서는 딥러닝 모델 생성을 위한 학습을 진행하는 동시에 최적 스플릿 크기를 탐색하고, 최적 스플릿 크기가 결정되고 난 후의 학습에는 결정된 최적 스플릿 크기에 기초하여 미니배치를 분할하여 GPU에 할당하는 파이프라이닝을 통해 학습이 이루어지도록 할 수 있다. 즉, 딥러닝 모델 학습 시스템(미도시)은 최적 스플릿 크기가 결정되고 나면, 결정된 최적 스플릿 크기에 기초하여 딥러닝 모델 생성을 위한 이후의 학습을 수행할 수 있다.In summary, the deep learning model learning system (not shown) disclosed herein conducts learning for deep learning model creation in the initial learning stage while searching for the optimal split size, and learning after the optimal split size is determined Based on the determined optimal split size, the mini-batch is divided and training can be performed through pipelining to allocate to the GPU. That is, after the optimal split size is determined, the deep learning model learning system (not shown) may perform subsequent learning for generating the deep learning model based on the determined optimal split size.

구체적으로, 딥러닝 모델 학습 시스템(미도시)은 미리 설정된 미니배치를 최적 스플릿 크기 결정 장치(100)에 의해 결정된 최적 스플릿 크기에 기초하여 분할한 복수의 스플릿 중 제1스플릿을 멀티 GPU 중 제1 GPU에 할당할 수 있다.Specifically, the deep learning model learning system (not shown) divides a preset mini-batch based on the optimal split size determined by the optimal split size determination apparatus 100 to convert a first split among a plurality of splits to a first among multi-GPUs. It can be assigned to the GPU.

또한, 딥러닝 모델 학습 시스템(미도시)은 제1 GPU에 의한 제1스플릿에 대한 처리 결과를 멀티 GPU 중 제2 GPU에 전송할 수 있다.In addition, the deep learning model learning system (not shown) may transmit the processing result of the first split by the first GPU to the second GPU among the multi-GPUs.

또한, 딥러닝 모델 학습 시스템(미도시)의 제2 GPU는 전송받은 처리 결과에 기초하여 기울기(달리 말해, Gradient)를 계산하고 가중치(달리 말해, 모델 파라미터)를 변경할 수 있다.In addition, the second GPU of the deep learning model learning system (not shown) may calculate a gradient (in other words, a gradient) based on the received processing result and change the weight (in other words, a model parameter).

또한, 딥러닝 모델 학습 시스템(미도시)은 제1 GPU에 분할된 복수의 스플릿 중 제2스플릿을 할당할 수 있다.In addition, the deep learning model learning system (not shown) may allocate a second split among a plurality of splits to the first GPU.

여기서, 딥러닝 모델 학습 시스템(미도시)의 제2 GPU에 의해 기울기가 계산되고 가중치를 변경되는 프로세스 및 제1 GPU에 제2스플릿을 할당하는 프로세스는 제1 GPU 및 제2 GPU 각각에 의해 수행될 수 있으므로 상술한 두 프로세스는 서로 미리 설정된 시간 차이 이내에 병렬적으로 개시될 수 있다. 달리 말해, 멀티 GPU 중 어느 하나의 GPU는 할당된 스플릿에 대한 처리를 완료하고 나면, 다른 GPU으로부터의 출력 결과를 수신하는 것을 대기할 필요 없이 다음 스플릿에 대한 처리를 개시할 수 있어 유휴 시간(Idle time)이 획기적으로 감소될 수 있다.Here, the process of calculating the gradient and changing the weight by the second GPU of the deep learning model learning system (not shown) and the process of allocating the second split to the first GPU are performed by the first GPU and the second GPU, respectively Therefore, the two processes described above can be started in parallel within a preset time difference from each other. In other words, after one GPU of the multi-GPUs completes processing for the allocated split, it can start processing for the next split without waiting to receive an output result from the other GPU, so that the idle time (Idle time) time) can be drastically reduced.

또한, 본원의 일 실시예에 따르면, 제1 GPU는 딥러닝 인공 신경망 모델의 입력 레이어에 대응하는 GPU를 의미하고, 제2 GPU는 딥러닝 인공 신경망 모델의 출력 레이어에 해당하는 GPU를 의미하는 것일 수 있으나, 이에만 한정되는 것은 아니다. 참고로, 본원의 구현예에 따라 제1 GPU는 이전 GPU로, 제2 GPU는 나중 GPU로 각각 지칭될 수 있다.In addition, according to an embodiment of the present application, the first GPU means a GPU corresponding to the input layer of the deep learning artificial neural network model, and the second GPU means the GPU corresponding to the output layer of the deep learning artificial neural network model. However, the present invention is not limited thereto. For reference, according to an embodiment of the present application, the first GPU may be referred to as a previous GPU, and the second GPU may be referred to as a later GPU, respectively.

종래의 모델 병렬화 기반의 딥러닝 모델 학습 시스템에 의할 때, 파이프라이닝이 적용되지 않아 제1 GPU가 할당된 미니배치에 포함된 데이터를 처리할 동안 제2 GPU가 제1 GPU가 데이터를 처리하는 것을 기다려야 하므로 제2 GPU가 유휴 상태가 되고, 제1 GPU가 데이터를 처리하여 제2 GPU로 넘겨주면 제1 GPU는 제2 GPU가 데이터를 처리하고 기울기를 계산하는 과정 등이 모두 완료된 후에야 새로운 미니배치를 읽어오기 때문에 제1 GPU가 유휴 상태가 된다.When using the conventional model parallelization-based deep learning model learning system, pipelining is not applied so that the first GPU processes the data included in the allocated mini-batch while the second GPU processes the data. Since the second GPU is idle, the first GPU processes data and passes it to the second GPU, and the first GPU only processes the data and calculates the gradient before the second GPU completes the process of processing the data and calculating the gradient. Because the batch is being read, the first GPU is idle.

반면, 본원에서 개시하는 딥러닝 모델 학습 시스템(미도시)에 의하면 미니배치를 더 작은 단위인 스플릿으로 나누어 학습을 진행하기 때문에, 제1스플릿이 제1 GPU로 할당된 후 제1 GPU가 제1스플릿에 포함된 데이터를 처리하여 제2 GPU로 전해주고 나면, 제1 GPU가 유휴 상태에 놓이는 것이 아니라 다음 스플릿인 제2스플릿을 신속히 할당받아 제1 GPU가 제2스플릿에 포함된 데이터에 대한 처리를 수행하고, 이 때 제2 GPU는 제1 GPU로부터 전달된 제1스플릿의 처리 결과를 기초로 기울기 계산 및 가중치 변경을 수행할 수 있으므로, 멀티 GPU 환경 내의 모든 GPU의 연산 능력을 병렬적으로 활용함으로써 학습 효율을 향상시킬 수 있다.On the other hand, according to the deep learning model learning system (not shown) disclosed herein, since the mini-batch is divided into splits, which are smaller units, and learning is carried out, after the first split is allocated as the first GPU, the first GPU becomes the first After the data included in the split is processed and transferred to the second GPU, the first GPU is not placed in an idle state, but is quickly assigned to the second split, which is the next split, so that the first GPU processes the data included in the second split In this case, since the second GPU can perform gradient calculation and weight change based on the processing result of the first split transmitted from the first GPU, the computing power of all GPUs in the multi-GPU environment is utilized in parallel. This can improve learning efficiency.

도 4a는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 동작과 연계된 일 실험예로, 분할되는 스플릿 크기에 따른 처리량의 변화를 나타낸 그래프이고, 도 4b는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 동작과 연계된 일 실험예로 미니배치의 크기에 따른 처리량의 변화를 모델 병렬화만을 적용한 경우와 본원의 최적 스플릿 크기 결정 기법을 적용한 경우 각각에 대하여 비교하여 나타낸 그래프이다.4A is an experimental example associated with the operation of the apparatus for determining the optimal split size when training a deep learning model using a multi-GPU according to an embodiment of the present application, and is a graph showing the change in throughput according to the split size, 4b is an experimental example related to the operation of the apparatus for determining the optimal split size when training a deep learning model using a multi-GPU according to an embodiment of the present application. It is a graph showing comparison for each of the cases in which the optimal split size determination technique of .

도 4a 및 도 4b는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치의 동작과 연계된 성능을 평가하기 위한 실험 결과를 나타낸 것으로, 본 실험은 Pytorch 1.4.0, CUDA 10.0 및 NVIDIA driver 418.56의 소프트웨어 환경 및 두 개의 GPU를 포함하는 멀티 GPU환경(2개의 NVIDIA GeForce GTX 1080 Ti)에서 수행되었고, 학습 대상인 딥러닝 모델의 유형으로써 U-Net을 사용하였다. U-Net은 이미지 분할을 위한 fully convolutional network로 U-Net에 관한 사항은 통상의 기술자에게 자명하므로 구체적인 설명은 생략하도록 한다.4A and 4B show experimental results for evaluating the performance associated with the operation of the apparatus for determining the optimal split size when training a deep learning model using a multi-GPU according to an embodiment of the present application, and the present experiment is Pytorch 1.4. 0, a software environment of CUDA 10.0 and NVIDIA driver 418.56, and a multi-GPU environment (two NVIDIA GeForce GTX 1080 Ti) including two GPUs were performed, and U-Net was used as the type of deep learning model to be trained. U-Net is a fully convolutional network for image segmentation, and details regarding U-Net are obvious to those skilled in the art, so a detailed description will be omitted.

도 4a는 미니배치를 분할하는 스플릿 사이즈에 따른 학습 성능의 변화를 나타낸 것으로, 도 4a를 참조하면, 학습 미니배치의 크기가 16일 때, 미니배치를 분할하는 스플릿의 크기가 1에서 3까지 변화하는 구간에서는 초당 이미지 처리량이 9.2 images/sec에서 15.69 images/sec로 증가하나, 미니배치를 분할하는 스플릿의 크기가 3을 초과하는 구간에서는 초당 이미지 처리량이 다시 감소하는 것을 확인할 수 있다. 구체적으로, 미니배치를 분할하는 스플릿의 크기를 14까지 증가시키면, 초당 이미지 처리량은 14.16 images/sec로 스플릿 크기가 3인 경우에 비해 11% 감소하는 것을 확인할 수 있다. 즉, 본 실험에서의 최적 스플릿 크기는 3으로 예시적으로 결정될 수 있으며, 결정된 최적 스플릿 크기 보다 작게 미니배치를 분할하거나 크게 미니배치를 분할하여 학습을 수행하는 경우에는 동등한 시간 동안 처리할 수 있는 처리량이 감소하여 미리 설정된 횟수만큼의 반복 학습에 소요되는 수행시간이 늘어날 것임을 예측할 수 있다.4A shows the change in learning performance according to the split size for dividing the mini-batch. Referring to FIG. 4A, when the size of the training mini-batch is 16, the size of the split dividing the mini-batch varies from 1 to 3 In the section where the image throughput per second increases from 9.2 images/sec to 15.69 images/sec, it can be seen that the image throughput per second decreases again in the section where the size of the split that divides the mini-batch exceeds 3. Specifically, if the size of the split dividing the mini-batch is increased to 14, the image throughput per second is 14.16 images/sec, which is reduced by 11% compared to the case where the split size is 3. That is, the optimal split size in this experiment may be exemplarily determined as 3, and when training is performed by dividing the mini-batch smaller than the determined optimal split size or dividing the mini-batch larger than the determined optimal split size, the throughput that can be processed for the same time As this decreases, it can be predicted that the execution time required for repeated learning by a preset number of times will increase.

도 4b는 학습 미니배치의 크기를 변화시킴에 따라 변화되는 학습 성능을 본원의 최적 스플릿 크기 결정 방법을 적용하여 모델 병렬화 및 파이프라이닝을 함께 적용한 경우와 종래의 모델 병렬화에 의한 경우를 비교하여 나타낸 것으로, 도 4b를 참조하면, 본원에서 개시하는 최적 스플릿 크기 결정 기법을 적용한 경우가 모든 미니배치의 크기에 대하여 종래의 모델 병렬화 기법에 비해 높은 이미지 처리량을 보이는 것을 확인할 수 있다. 특히 미니배치의 크기가 16일 때 본원을 적용한 경우는 15.66 images/sec의 처리 성능을 보인 반면, 종래의 모델 병렬화를 적용한 경우는 13.91 images/sec의 처리 성능을 보여 파이프라이닝을 적용함으로써 이미지 처리량이 약 12% 증가하는 것을 확인할 수 있다.Figure 4b shows the learning performance that is changed as the size of the training mini-batch is changed by applying the optimal split size determination method of the present application to the case where model parallelism and pipelining are applied together with the case by the conventional model parallelization. , 4b , it can be seen that the case of applying the optimal split size determination technique disclosed herein shows higher image throughput compared to the conventional model parallelization technique for all mini-batch sizes. In particular, when the size of the mini-batch is 16, the processing performance of 15.66 images/sec was shown when this application was applied, whereas the processing performance of 13.91 images/sec when the conventional model parallelization was applied. An increase of about 12% can be seen.

나아가 본원에서 개시하는 최적 스플릿 크기 결정 기법 기반으로 파이프라이닝을 적용할 경우, 종래의 모델 병렬화에 비해 이미지 처리량뿐만 아니라 학습 가능한 미니배치의 크기도 증가할 수 있다. 구체적으로, 본 실험을 통해 파악된 GeForce GTX 1080 Ti 1대로 학습 가능한 최대 mini-batch 크기는 10이고, GeForce GTX 1080 Ti 2대를 사용하여 종래의 모델 병렬화를 수행할 때 학습 가능한 최대 미니배치의 크기는 16으로 60% 증가한 반면, 최적 스플릿 크기 탐색 및 파이프라이닝 기법을 적용하는 본원의 딥러닝 모델 학습 기법에 의할 때의 학습 가능한 미니배치의 크기는 20으로 GPU 1대를 사용할 때보다 2배 증가하고 종래의 모델 병렬화에 비해서도 학습 가능한 미니배치의 크기가 25% 증가하는 것을 확인할 수 있다.Furthermore, when pipelining is applied based on the optimal split size determination technique disclosed herein, not only the image throughput but also the size of the trainable mini-batch can be increased compared to the conventional model parallelization. Specifically, the maximum mini-batch size that can be learned with one GeForce GTX 1080 Ti identified through this experiment is 10, and the maximum size of mini-batch that can be learned when conventional model parallelization is performed using two GeForce GTX 1080 Tis , increased by 60% to 16, while the size of the trainable mini-batch when using our deep learning model training technique applying the optimal split size search and pipelining technique was 20, which doubled compared to when using one GPU And it can be seen that the size of the trainable mini-batch increases by 25% compared to the conventional model parallelization.

즉, 본원에서 개시하는 최적 스플릿 크기 탐색 및 파이프라이닝 기법을 적용하여 딥러닝 모델의 학습을 수행하면, 실제 GPU 1대에서 실제로 학습을 수행하는 이미지의 수는 입력된 학습 미니배치의 크기보다 더 작은 최적 스플릿 크기이므로 GPU 각각이 연산을 수행(이미지 처리)하는데 필요한 메모리가 적어질 수 있어, 종래의 모델 병렬화 기법에 비해 더 큰 미니배치를 학습할 수 있는 것이다.That is, when training of a deep learning model is performed by applying the optimal split size search and pipelining technique disclosed herein, the number of images for which training is actually performed on one real GPU is smaller than the size of the input training mini-batch. Since the optimal split size can reduce the amount of memory required for each GPU to perform an operation (image processing), it is possible to learn a larger mini-batch compared to the conventional model parallelization technique.

도 4a 및 도 4b의 실험예를 종합하면, 본원에서 개시하는 최적 스플릿 크기 결정 기법을 통해 계산된 최적 스플릿 크기 및 파이프라이닝을 통해 딥러닝 모델의 학습을 수행하면, 종래의 모델 병렬화 기법을 적용하는 경우에 비해 이미지 처리량이 최대 12% 증가할 수 있고, 학습 가능한 미니배치의 크기 또한 25% 증가하여 1대의 GPU를 사용할 때 보다 2배의 미니배치를 학습시킬 수 있는 효과가 있다.Combining the experimental examples of FIGS. 4A and 4B, when learning of the deep learning model is performed through the optimal split size calculated through the optimal split size determination technique disclosed herein and pipelining, the conventional model parallelization technique is applied. Compared to the previous case, the image throughput can be increased by up to 12%, and the size of the mini-batch that can be trained is also increased by 25%, which has the effect of training twice the mini-batch compared to when using one GPU.

이하에서는 상기에 자세히 설명된 내용을 기반으로, 본원의 동작 흐름을 간단히 살펴보기로 한다.Hereinafter, an operation flow of the present application will be briefly reviewed based on the details described above.

도 5는 본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법에 대한 동작 흐름도이다.5 is an operation flowchart for a method of determining an optimal split size when training a deep learning model using a multi-GPU according to an embodiment of the present application.

도 5에 도시된 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법은 앞서 설명된 최적 스플릿 크기 결정 장치(100)에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 최적 스플릿 크기 결정 장치(100)에 대하여 설명된 내용은 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법에 대한 설명에도 동일하게 적용될 수 있다.The method of determining an optimal split size when training a deep learning model using a multi-GPU shown in FIG. 5 may be performed by the apparatus 100 for determining an optimal split size as described above. Therefore, even if omitted below, the description of the apparatus 100 for determining the optimal split size may be equally applied to the description of the method for determining the optimal split size when training a deep learning model using a multi-GPU.

도 5를 참조하면, 단계 S11에서 초기 스플릿 연산부(110)는 (a) 멀티 GPU에 포함된 GPU의 수 및 멀티 GPU의 메모리 크기에 기초하여 초기 스플릿 크기(S₀)를 연산할 수 있다.Referring to FIG. 5 , in step S11 , the initial split operation unit 110 may (a) calculate the initial split size S ₀ based on the number of GPUs included in the multi-GPU and the memory size of the multi-GPU.

본원의 일 실시예에 따르면, 단계 S11에서 초기 스플릿 연산부(110)는 전술한 식 1 및 식 2에 기초하여 초기 스플릿 크기(S₀)를 연산하는 것일 수 있다.According to an embodiment of the present disclosure, in step S11 , the initial split operation unit 110 may calculate the initial split size S ₀ based on Equations 1 and 2 described above.

다음으로, 단계 S12에서 초기 스플릿 연산부(110)는 (b) 연산된 초기 스플릿 크기(S₀)에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 초기 수행시간(t₀)을 연산할 수 있다.Next, in step S12, the initial split operation unit 110 (b) calculates _{the initial execution time (t 0} ) required for repeated learning for a preset number of times performed based on the _{calculated initial split size (S 0 ).} can

다음으로, 단계 S13에서 최적 스플릿 탐색부(120)는 (c) 초기 스플릿 크기(S₀)와 초기 수행시간(t₀), 그리고 (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기 사이에 설정된 관계에 기초하여, n번째 스플릿 크기(S_n), n번째 스플릿 크기(S_n)에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 n번째 수행시간(t_n), (n+1)번째 스플릿 크기(S_n+1) 및 (n+1)번째 스플릿 크기(S_n+1)에 기초하여 수행되는 미리 설정된 횟수만큼의 반복 학습에 소요되는 (n+1)번째 수행시간(t_n+1)을 획득할 수 있다.Next, in step S13, the optimal split search unit 120 calculates (c) the initial split size (S ₀ ), the initial execution time (t ₀ ), and the (n-1)-th split size, the n-th split size, and (n). +1) th on the basis of the relationship established between the split size, n-th split size (S _n), n second split size (n-th perform required for iterative learning of the number of times previously set to be performed based on S _n) time (t _n ), (n+1)-th split size (S _n+1 ), and (n+1)-th split size (S _n+1 ) required for iterative learning for a preset number of times performed based on ( An n+1)-th execution time (t _n+1 ) may be obtained.

달리 말해, 단계 S13에서((c) 단계에서) 최적 스플릿 탐색부(120)는 S₀, t₀ 및 S_n-1와 S_n와 S_n+1 사이의 미리 설정된 관계에 기초하여 S_n, t_n, S_n+1 및 t_n+1을 획득할 수 있다.In other words, in step S13 (in step (c)), the optimal split search unit 120 performs _{S n} , based on a preset relationship between _{S 0} , t ₀ and S _n-1 and S _n and S _n+1. t _n , S _n+1 and t _n+1 may be obtained.

또한, 본원의 일 실시예에 따르면, 단계 S13에서의(달리 말해, (c) 단계에서의) 설정된 관계는, (n-1)번째 스플릿 크기와 n번째 스플릿 크기와 (n+1)번째 스플릿 크기가

을 만족하는 관계일 수 있다.In addition, according to an embodiment of the present application, the relationship established in step S13 (in other words, in step (c)) is the (n-1)-th split size, the n-th split size, and the (n+1)-th split. size

may be a relationship that satisfies

다음으로, 단계 S14에서 최적 스플릿 탐색부(120)는 (d) n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1) 사이의 시간 차이가 미리 설정된 시간 차이 이내이면, (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정할 수 있다.Next, in step S14, the optimal split search unit 120 determines that (d) _{the time difference between the n-th execution time (t n} ) and the (n+1)-th execution time (t _n+1 ) is within a preset time difference , the (n+1)-th split size (S _n+1 ) may be determined as the optimal split size.

구체적으로, 단계 S14에서 최적 스플릿 탐색부(120)는 단계 S13을 통해 획득된 n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1)이 전술한 식 3의 부등식을 만족하면, (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정할 수 있다.Specifically, in step S14, the optimal split search unit 120 determines that the n-th execution time (t _n ) and the (n+1)-th execution time (t _n+1 ) obtained through step S13 are the inequality of Equation 3 described above. is satisfied, the (n+1)-th split size (S _n+1 ) may be determined as the optimal split size.

본원의 일 실시예에 따르면, 상술한 단계 S13 및 단계 S14(즉, (c) 단계와 (d) 단계)는 n을 0부터 1씩 증가시키며 반복 수행될 수 있다. 보다 구체적으로 도 5를 참조하면, 단계 S13 및 단계 S14(즉, (c) 단계와 (d) 단계)가 n을 0부터 1씩 증가시키며 반복 수행되다가, 단계 S14에서 n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1) 사이의 시간 차이가 미리 설정된 시간 차이 이내인 조건을 만족하면(단계 S14의 'YES'), (n+1)번째 스플릿 크기(S_n+1)를 최적 스플릿 크기로 결정하고 반복 수행이 종료되는 것일 수 있다.According to an embodiment of the present application, the above-described steps S13 and S14 (ie, steps (c) and (d)) may be repeatedly performed while increasing n from 0 to 1. More specifically, referring to FIG. 5 , steps S13 and S14 (ie, steps (c) and (d)) are repeatedly performed while increasing n from 0 to 1, and in step S14, the nth execution time (t _n) ) and the (n+1)-th execution time (t _n+1 ) satisfies the condition that the time difference is within a preset time difference ('YES' in step S14), the (n+1)-th split size (S _n+1 ) may be determined as the optimal split size and repeated execution is terminated.

이와 달리, 단계 S14에서 n번째 수행시간(t_n)과 (n+1)번째 수행시간(t_n+1) 사이의 시간 차이가 미리 설정된 시간 차이 이내인 조건을 만족하지 않으면(단계 S14의 'NO'), 최적 스플릿 탐색부(120)는 n을 1 증가시키고(n=n+1), 단계 S13((c) 단계)로 되돌아가 (n+1)에 대한 최적 스플릿 크기 탐색 프로세스의 다음 번 반복을 수행할 수 있다.On the other hand, if the time difference between the n-th execution time (t _n ) and the (n+1)-th execution time (t _n+1 ) in step S14 does not satisfy the condition that is within a preset time difference ('NO'), the optimal split search unit 120 increments n by 1 (n=n+1), returns to step S13 (step (c)), and continues the optimal split size search process for (n+1). Iterations can be performed.

또한, 본원의 일 실시예에 따르면, 단계 S13 및 단계 S14의 반복 수행시 획득되는 n번째 스플릿 크기(S_n)는 이전 반복 수행시의 (n+1)번째 스플릿 크기(S_n+1)일 수 있다. 또한, 본원의 일 실시예에 따르면, 단계 S13 및 단계 S14의 반복 수행시 획득되는 n번째 수행시간(t_n)은 이전 반복 수행시의 (n+1)번째 수행시간(t_n+1)일 수 있다.In addition, according to an embodiment of the present application, the n-th split size (S _n ) obtained when the steps S13 and S14 are repeatedly performed is the (n+1)-th split size (S _n+1 ) when the previous iteration is performed. can In addition, according to an embodiment of the present application, the n-th execution time (t _n ) obtained when the steps S13 and S14 are repeatedly performed is the (n+1)-th execution time (t _n+1 ) when the previous iteration is performed. can

상술한 설명에서, 단계 S11 내지 S14는 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S11 to S14 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present application. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

도 6은 본원의 일 실시예에 따른 멀티 GPU를 이용한 최적 스플릿 크기 기반의 딥러닝 모델 학습 방법에 대한 동작 흐름도이다.6 is an operation flowchart of a deep learning model learning method based on an optimal split size using a multi-GPU according to an embodiment of the present application.

도 6에 도시된 멀티 GPU를 이용한 최적 스플릿 크기 기반의 딥러닝 모델 학습 방법은 앞서 설명된 최적 스플릿 크기 결정 장치(100)를 포함하는 딥러닝 모델 학습 시스템에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 딥러닝 모델 학습 시스템에 대하여 설명된 내용은 도 6에 대한 설명에도 동일하게 적용될 수 있다.The deep learning model learning method based on the optimal split size using the multi-GPU shown in FIG. 6 may be performed by the deep learning model learning system including the optimal split size determining apparatus 100 described above. Therefore, even if omitted below, the contents described for the deep learning model learning system may be equally applied to the description of FIG. 6 .

도 6을 참조하면, 단계 S21에서 최적 스플릿 크기 결정 장치(100)는 소정의 딥러닝 모델 학습에 활용되는 최적 스플릿 크기를 결정할 수 있다.Referring to FIG. 6 , in step S21 , the optimal split size determining apparatus 100 may determine an optimal split size used for training a predetermined deep learning model.

다음으로, 단계 S22에서 딥러닝 모델 학습 시스템은 결정된 최적 스플릿 크기에 기초하여 소정의 딥러닝 모델을 학습시킬 수 있다.Next, in step S22, the deep learning model learning system may train a predetermined deep learning model based on the determined optimal split size.

상술한 설명에서, 단계 S21 및 S22는 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S21 and S22 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present application. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

도 7은 결정된 최적 스플릿 크기에 기초하여 딥러닝 모델을 학습시키는 단계에 대한 세부 동작 흐름도이다.7 is a detailed operation flowchart of the step of training the deep learning model based on the determined optimal split size.

도 7을 참조하면, 단계 S221에서 딥러닝 모델 학습 시스템은 미리 설정된 미니배치를 결정된 최적 스플릿 크기에 기초하여 분할한 복수의 스플릿 중 제1스플릿을 멀티 GPU 중 제1 GPU에 할당할 수 있다.Referring to FIG. 7 , in step S221 , the deep learning model learning system may allocate a first split among a plurality of splits obtained by dividing a preset mini-batch based on a determined optimal split size to the first GPU among the multi-GPUs.

다음으로, 단계 S222에서 딥러닝 모델 학습 시스템은 제1 GPU에 의한 제1스플릿에 대한 처리 결과를 멀티 GPU 중 제2 GPU에 전송할 수 있다.Next, in step S222, the deep learning model learning system may transmit the processing result for the first split by the first GPU to the second GPU among the multi-GPUs.

다음으로, 단계 S223에서 딥러닝 모델 학습 시스템의 제2 GPU는 수신한 제1스플릿에 대한 처리 결과에 기초하여 기울기를 계산하고 가중치를 변경할 수 있다.Next, in step S223, the second GPU of the deep learning model learning system may calculate a gradient based on the received processing result for the first split and change the weight.

다음으로, 단계 S224에서 딥러닝 모델 학습 시스템은 제1 GPU에 최적 스플릿 크기에 기초하여 분할한 복수의 스플릿 중 제2스플릿을 할당할 수 있다.Next, in step S224, the deep learning model learning system may allocate the second split among the plurality of splits divided based on the optimal split size to the first GPU.

이 때, 기울기를 계산하고 가중치를 변경하는 단계(단계 S223)와 제2스플릿을 할당하는 단계(단계 S224)는 제1 GPU 및 제2 GPU 각각에 의해 수행되어 서로 미리 설정된 시간 차이 이내에 병렬적으로 개시되는 것일 수 있다.At this time, the step of calculating the gradient and changing the weight (step S223) and the step of allocating the second split (step S224) are performed by each of the first GPU and the second GPU in parallel within a preset time difference from each other. may be initiated.

상술한 설명에서, 단계 S221 내지 S224는 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S221 to S224 may be further divided into additional steps or combined into fewer steps according to an embodiment of the present application. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

본원의 일 실시예에 따른 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method for determining the optimal split size when training a deep learning model using a multi-GPU according to an embodiment of the present application is implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and carry out program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

또한, 전술한 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 방법은 기록 매체에 저장되는 컴퓨터에 의해 실행되는 컴퓨터 프로그램 또는 애플리케이션의 형태로도 구현될 수 있다.In addition, the above-described method for determining the optimal split size during deep learning model learning using multi-GPU may be implemented in the form of a computer program or application executed by a computer stored in a recording medium.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is for illustration, and those of ordinary skill in the art to which the present application pertains will understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present application. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and likewise components described as distributed may be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present application.

100: 멀티 GPU를 이용한 딥러닝 모델 학습 시의 최적 스플릿 크기 결정 장치
110: 초기 스플릿 연산부
120: 최적 스플릿 탐색부100: Apparatus for determining the optimal split size when training a deep learning model using multi-GPU
110: initial split operation unit
120: optimal split search unit

Claims

As a method for determining the optimal split size when training a deep learning model using multi-GPU,
(a) calculating an initial split size based on the number of GPUs included in the multi-GPU and the memory size of the multi-GPU;
(b) calculating an initial execution time required for repeated learning for a preset number of times performed based on the initial split size;
(c) an n-th split size, based on the initial split size and the initial execution time, and a relationship established between (n-1)-th split size, n-th split size, and (n+1)-th split size; The n-th execution time required for repeated learning for a preset number of times performed based on the n-th split size, the (n+1)-th split size, and the preset number of times performed based on the (n+1)-th split size obtaining an (n+1)-th execution time required for repeated learning; and
(d) determining the (n+1)-th split size as the optimal split size when the time difference between the n-th execution time and the (n+1)-th execution time is within a preset time difference;
A method for determining an optimal split size, comprising:

According to claim 1,
The initial split size is denoted by S ₀ when n is 0,
The (n-1)th split size is represented by _{S n-1,}
The nth split size is denoted by _{S n ,}
The (n+1)th split size is represented by _{S n+1 ,}
When n is 0, S _n-1 is 0,
Step (c) and step (d) are to be repeatedly performed while increasing n from 0 to 1, the optimal split size determination method.

3. The method of claim 2,
Steps (c) and (d) are repeatedly performed while increasing n from 0 to 1, and in step (d), the time difference between the n-th execution time and the (n+1)-th execution time and satisfies a condition within a preset time difference, determining the (n+1)-th split size as the optimal split size and iterative execution is terminated.

3. The method of claim 2,
The method for determining the optimal split size, _{wherein S n} obtained when the steps (c) and (d) _{are repeatedly performed is S n+1} when the previous iteration is performed.

5. The method of claim 4,
The method for determining the optimal split size, wherein the n-th execution time obtained when the steps (c) and (d) are repeatedly performed is the (n+1)-th execution time when the previous iteration is performed.

According to claim 1,
In step (a), the initial split size is calculated based on Equations 1 and 2 below,
[Equation 1]

[Equation 2]

Here, S ₀ is the initial split size, M is the sum of the memory sizes of all GPUs included in the multi-GPU, and s is the memory size of any one GPU included in the multi-GPU when the memory size is m. is the size of the maximum training mini-batch that the GPU can perform, N is the number of GPUs included in the multi-GPU, and

is the size of the deep learning model, the optimal split size determination method.

3. The method of claim 2,
The set relationship in step (c) is that the (n-1)-th split size, the n-th split size, and the (n+1)-th split size are

A method for determining the optimal split size, which is a relationship that satisfies

3. The method of claim 2,
Step (d) is,
If the n-th execution time and the (n+1)-th execution time satisfy the inequality of Equation 3 below, the (n+1)-th split size is determined as the optimal split size,
[Equation 3]

Here, t _n is the n-th execution time, t _n+1 is the (n+1)-th execution time, and T is the preset time difference.

As a deep learning model training method based on optimal split size using multi-GPU,
determining an optimal split size used for training a predetermined deep learning model based on the optimal split size determining method according to claim 1; and
training the deep learning model based on the determined optimal split size;
Including, a deep learning model training method.

10. The method of claim 9,
The step of training the deep learning model is,
allocating a first split among a plurality of splits obtained by dividing a preset mini-batch based on the optimal split size to a first GPU among the multi-GPUs;
transmitting a processing result of the first split by the first GPU to a second GPU of the multi-GPU;
calculating, by the second GPU, a gradient based on the processing result and changing a weight; and
allocating a second split among the plurality of splits to the first GPU;
including,
The step of calculating the gradient and changing the weight and allocating the second split are performed by each of the first GPU and the second GPU and are started in parallel within a preset time difference from each other. How to train a learning model.

As an optimal split determination device for deep learning model training using multi-GPU,
Calculating the initial split size based on the number of GPUs included in the multi-GPU and the memory size of the multi-GPU, and calculating the initial execution time required for repeated learning for a preset number of times performed based on the initial split size an initial split operation unit; and
Based on the initial split size and the initial execution time, and the relationship established between the (n-1)-th split size, the n-th split size, and the (n+1)-th split size, the n-th split size and the n-th split size The n-th execution time required for iterative learning for a preset number of times performed based on the size, the (n+1)-th split size, and the (n+1)-th iteration for a preset number of times required for the repetition learning performed based on the size If the (n+1)th execution time required for learning is obtained, and the time difference between the nth execution time and the (n+1)th execution time is within a preset time difference, the (n+1)th execution time an optimal split search unit that determines the split size as the optimal split size;
Including, an optimal split determination device.

12. The method of claim 11,
The initial split size is denoted by S ₀ when n is 0,
The (n-1)th split size is represented by _{S n-1,}
The nth split size is denoted by _{S n ,}
The (n+1)th split size is represented by _{S n+1 ,}
and the optimal split search unit repeatedly performs a process of searching for the optimal split size while increasing n from 0 to 1.

13. The method of claim 12,
_{The S n} and n-th execution times obtained in the process of the optimal split search unit repeatedly performing the process _{are S n+1} when the previous iteration is performed and the (n+1)-th execution time when the previous iteration is performed, respectively , an optimal split determination device.

A computer-readable recording medium recording a program for executing the method according to any one of claims 1 to 10 on a computer.