KR20200109917A

KR20200109917A - Method for estimating learning speed of gpu-based distributed deep learning model and recording medium thereof

Info

Publication number: KR20200109917A
Application number: KR1020190029742A
Authority: KR
Inventors: 박경수; 황창호; 김태현; 손규호; 신진우
Original assignee: 한국과학기술원
Priority date: 2019-03-15
Filing date: 2019-03-15
Publication date: 2020-09-23

Abstract

The present invention relates to a method of predicting a learning speed of a GPU-based distributed deep learning model and a recording medium thereof, which are utilized to improve performance of a computing resource scheduler. According to the present invention, the method of predicting the learning speed of the GPU-based distributed deep learning model includes: a measuring step of measuring a first time taken to calculate a gradient value of each of model variables of the deep learning model and write the calculated gradient value to a CPU memory from a time point at which data of an input data batch is input in one GPU, and a second time taken to update each of the model variables in a CPU; and a learning time prediction step of predicting a learning time of the deep learning model by using the first time and the second time.

Description

Method and recording medium for predicting the learning speed of a GPU-based distributed deep learning model {METHOD FOR ESTIMATING LEARNING SPEED OF GPU-BASED DISTRIBUTED DEEP LEARNING MODEL AND RECORDING MEDIUM THEREOF}

본 발명은 딥 러닝 모델의 학습 속도 예측 방법 및 기록매체에 관한 것으로, 좀 더 상세하게는 딥 러닝 모델의 학습 과정에 사용하는 GPU 클러스터의 GPU 개수에 따른 학습 속도를 정확하게 예측할 수 있는 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법 및 이를 수행하는 프로그램이 기록된 기록매체에 관한 것이다.The present invention relates to a method for predicting the learning rate of a deep learning model and a recording medium, and more particularly, to a GPU-based dispersion capable of accurately predicting the learning rate according to the number of GPUs of the GPU cluster used in the learning process of the deep learning model. A method for predicting a learning rate of a deep learning model and a recording medium in which a program that performs the same is recorded.

딥 러닝 모델을 학습하는 학습 작업은 일반적으로 매우 많은 GPU를 오랜 시간 동안 사용하는 특징이 있다. 이로 인해 하나의 GPU 클러스터 상에서 다수의 딥 러닝 모델들을 효율적으로 동시 학습하기 위해서는 전체적인 학습 성능을 최적화하도록 GPU 자원을 분배하는 중앙 집중형 작업 스케줄링이 필요하다. Training tasks to train deep learning models are typically characterized by using very many GPUs for a long time. For this reason, in order to efficiently and simultaneously learn multiple deep learning models on one GPU cluster, centralized task scheduling is required to allocate GPU resources to optimize the overall learning performance.

하나의 GPU 클러스터에서, 다수의 딥 러닝 모델들의 전체적인 학습 성능을 최적화하기 위해서는, 먼저 각각의 학습 작업이 특정 개수의 GPU를 할당받았을 때, 어느 정도의 학습 속도를 내는지를 알아야 한다. 이를 위해, 각 학습 작업에 실제로 특정 개수의 GPU를 할당한 뒤, 학습 속도를 실측하여 알아내는 방법이 있다. 그러나 이러한 방법은, 실측에 필요한 시행착오적 연산 과정에서 많은 GPU들이 매우 오랜 시간 동안 낭비되기 때문에, 학습 작업 자체의 진행을 크게 방해하는 문제가 있다.In order to optimize the overall learning performance of multiple deep learning models in one GPU cluster, first, it is necessary to know how much learning speed is achieved when each training task is assigned a specific number of GPUs. To do this, there is a method of actually allocating a specific number of GPUs to each learning task and then measuring the learning speed to find out. However, in this method, since many GPUs are wasted for a very long time in a trial-and-error calculation process required for actual measurement, there is a problem that greatly hinders the progress of the learning task itself.

비특허문헌 1의 경우, 딥 러닝 모델의 학습에서 서로 다른 모델 변수들의 백워드 패스(backward pass) 과정이 서로 간에 오버랩(overlap)될 수 있음을 고려하지 않았다. 예를 들어, 네트워크를 통해 어떤 모델 변수를 서버들 간에 주고받는 작업을 하는 동안 다른 모델 변수의 기울기(gradient)를 계산하는 작업을 동시에 진행할 수 있는데, 이에 대한 고려 없이 모든 변수들의 기울기(gradient) 계산, 변수 업데이트, 네트워킹 과정이 순차적으로 진행된다고 보고 예측을 하였기 때문에, 실제 측정된 시간이 예측한 시간보다 훨씬 짧은 경우가 많았다.In the case of Non-Patent Document 1, it is not considered that a backward pass process of different model variables may overlap with each other in learning of a deep learning model. For example, while a model variable is exchanged between servers through a network, the task of calculating the gradient of another model variable can be performed at the same time, and the gradient of all variables can be calculated without taking this into account. The actual measured time was much shorter than the predicted time because the prediction was made by reporting that the process of updating, variable, and networking proceeds sequentially.

또한 비특허문헌 1에서는 네트워크를 통해 데이터를 주고 받는 시간을 예측할 때, 단순히 보내야 하는 데이터의 양을 네트워크 대역폭으로 나눗셈하는 방식을 사용했다.In addition, non-patent document 1 used a method of simply dividing the amount of data to be sent by the network bandwidth when predicting the time to send and receive data through the network.

이와 같이 하나의 GPU 클러스터에서 여러 개의 GPU들을 동시에 활용하여 다수의 딥 러닝 모델을 학습하고자 한다면 많은 수의 GPU를 긴 시간 동안 사용해야 한다. 따라서, 상기 GPU들을 효율적으로 사용하기 위해서는 GPU 개수에 따른 학습 모델의 학습 속도 향상을 정확히 조사하여 그에 따라 적절한 개수의 GPU를 할당해야 할 필요가 있다.In this way, if you want to train multiple deep learning models by simultaneously utilizing multiple GPUs in one GPU cluster, you need to use a large number of GPUs for a long time. Therefore, in order to efficiently use the GPUs, it is necessary to accurately investigate the improvement in learning speed of a learning model according to the number of GPUs and allocate an appropriate number of GPUs accordingly.

: Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference (EuroSys), 2018.: Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference (EuroSys), 2018.

본 발명이 해결하고자 하는 과제는, 학습 속도의 실측 과정을 최소화하면서도 학습 과정에 사용하는 GPU 개수에 따른 학습 속도를 정확하게 예측할 수 있는, GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법 및 기록매체를 제안한다.The problem to be solved by the present invention is to provide a method and recording medium for predicting the learning speed of a GPU-based distributed deep learning model that can accurately predict the learning speed according to the number of GPUs used in the learning process while minimizing the actual measurement process of the learning speed. Suggest.

본 발명의 실시 형태에 따른 딥 러닝 모델의 학습 속도 예측 방법은, 다수의 딥 러닝 모델을 동시 학습하기 위한 다수의 GPU를 구비한 GPU 클러스터에서의 학습 속도 예측 방법으로서, 하나의 GPU에서 입력 데이터 배치(batch)의 데이터가 입력된 시점으로부터 상기 딥 러닝 모델의 각 모델 변수(variable)의 기울기(gradient) 값이 계산되어 CPU 메모리에 쓰여지기까지 걸리는 제1 시간 및 상기 각 모델 변수를 CPU에서 업데이트하는데 걸리는 제2 시간을 실측하는, 측정 단계; 및 상기 제1 시간과 상기 제2 시간을 이용하여 상기 딥 러닝 모델의 학습 시간을 예측하는, 학습 시간 예측 단계;를 포함하되, 상기 학습 시간 예측 단계는, 상기 제1 시간에 기초하여, 상기 각 모델 변수 별로 상기 데이터가 입력된 시점으로부터 해당 모델 변수의 기울기 값이 상기 CPU 메모리에 쓰여지기까지의 시간을 예측하는, 제1 예측 단계; 상기 제2 시간에 기초하여, 상기 해당 모델 변수를 상기 CPU에 업데이트하는데 걸리는 시간을 연산하는, 연산 단계; 네트워크를 통해 상기 해당 모델 변수와 상기 해당 모델 변수의 기울기 값을 서버들 간에 주고 받는데 걸리는 시간을 예측하는, 제2 예측 단계; 상기 각 모델 변수 별로 상기 제1 예측 단계에서 예측된 시간, 상기 연산 단계에서 연산된 시간 및 상기 제2 예측 단계에서 예측된 시간을 합산하여 상기 각 모델 변수 별 합산 시간을 연산하는, 합산 단계; 및 모든 모델 변수들에 대한 합산 시간들 중에서 최대값을 상기 학습 시간으로 예측하는 단계;를 포함한다.A method for predicting a learning rate of a deep learning model according to an embodiment of the present invention is a method for predicting a learning rate in a GPU cluster including a plurality of GPUs for simultaneous learning of a plurality of deep learning models. The first time it takes for the gradient value of each model variable of the deep learning model to be calculated and written to the CPU memory from the time when the (batch) data is input, and the CPU to update each model variable. Measuring the second time taken; And predicting a learning time of the deep learning model using the first time and the second time, wherein the learning time prediction step includes, based on the first time, each of the A first prediction step of predicting a time from a time point at which the data is input for each model variable until a slope value of a corresponding model variable is written to the CPU memory; A calculation step of calculating a time taken to update the corresponding model variable in the CPU based on the second time; A second prediction step of predicting a time taken to exchange the corresponding model variable and the slope value of the corresponding model variable between servers through a network; A summation step of calculating a summation time for each model variable by summing the time predicted in the first prediction step, the time calculated in the calculation step, and the time predicted in the second prediction step for each of the model variables; And predicting a maximum value among the summation times for all model variables as the learning time.

여기서, 상기 측정 단계는, 상기 입력 데이터 배치의 크기를 1부터 시작하여 2배씩 늘려가면서 상기 제1 시간을 측정할 수 있다.Here, in the measuring step, the first time may be measured while increasing the size of the input data batch starting from 1 and increasing by 2 times.

여기서, 상기 측정 단계는, 측정된 다수의 제1 시간들 사이 구간을 선형 피팅(fitting)함으로써 임의의 크기를 갖는 입력 데이터 배치에 대한 상기 제1 시간을 예측할 수 있다.Here, in the measuring step, the first time for a batch of input data having an arbitrary size may be predicted by linear fitting a section between the plurality of measured first times.

여기서, 상기 제2 예측 단계에서 예측된 시간은, 하기의 수학식에 의해 결정될 수 있다.Here, the time predicted in the second prediction step may be determined by the following equation.

<수학식><Equation>

2*S*(n-1)*(1+c*H_{n-1})/(n*W),2*S*(n-1)*(1+c*H_{n-1})/(n*W),

단, 상기 n은 상기 다수의 GPU들이 분포된 상기 서버의 개수, W는 상기 서버 간의 네트워크 대역폭, S는 상기 해당 모델 변수의 크기, 상기 H_{n-1}은 n-1번째 조화수(harmonic number), c는 보정상수임. However, n is the number of servers in which the plurality of GPUs are distributed, W is the network bandwidth between the servers, S is the size of the corresponding model variable, and H_{n-1} is the n-1th harmonic. number), c is the correction constant.

여기서, 상기 학습 시간과 실제 학습 시간과의 차이가 미리 설정된 임계 값을 초과하면, 상기 보정상수를 다른 보정상수로 보정하는, 보정 단계;를 더 포함할 수 있다.Here, when the difference between the learning time and the actual learning time exceeds a preset threshold, a correction step of correcting the correction constant to another correction constant; may further include.

본 발명의 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법 및 기록매체는, 학습 속도 향상을 빠르고 정확하게 예측함으로써, 하나의 GPU 클러스터를 효율적으로 관리할 수 있고, 나아가 컴퓨팅 자원 스케줄러의 성능 향상을 위해 활용될 수 있는 이점이 있다.The method and recording medium for predicting a learning rate of a GPU-based distributed deep learning model according to an embodiment of the present invention can efficiently manage one GPU cluster by quickly and accurately predicting the improvement of the learning speed, and furthermore, the computing resource scheduler There is an advantage that can be used to improve performance.

또한, GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법 및 기록매체를 사용하여 딥 러닝 기반 서비스를 개발하는 기업 또는 딥 러닝을 위한 클라우드 서비스를 제공하는 기업에서 사용되는 GPU 클러스터를 관리하면, 평균 작업 완료 시간(job completion time) 측면에서 약 4배 가량의 성능 향상을 기대할 수 있으며, 이에 따라 GPU 자원의 활용도를 크게 높일 수 있는 이점이 있다.In addition, if you manage GPU clusters used by companies that develop deep learning-based services or companies that provide cloud services for deep learning using a method and recording medium for predicting the learning speed of a GPU-based distributed deep learning model, average work In terms of job completion time, a performance improvement of about 4 times can be expected, and accordingly, there is an advantage of greatly increasing the utilization of GPU resources.

도 1은 본 발명의 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법을 설명하기 위한 순서도이다.
도 2는 도 1에 도시된 학습 시간 예측 단계(S300)를 설명하기 위한 순서도이다.
도 3은 본 발명의 다른 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법을 설명하기 위한 순서도이다.1 is a flowchart illustrating a method of predicting a learning rate of a GPU-based distributed deep learning model according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating a learning time prediction step S300 shown in FIG. 1.
3 is a flowchart illustrating a method of predicting a learning rate of a GPU-based distributed deep learning model according to another embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시 형태를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시 형태는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시 형태는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시 형태에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시 형태로 구현될 수 있다. 또한, 각각의 개시된 실시 형태 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.For detailed description of the present invention to be described later, reference is made to the accompanying drawings, which illustrate specific embodiments in which the present invention may be practiced as an example. These embodiments are described in detail enough to enable those skilled in the art to practice the present invention. It is to be understood that the various embodiments of the present invention are different from each other, but need not be mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the present invention in relation to one embodiment. In addition, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the present invention. Accordingly, the detailed description to be described below is not intended to be taken in a limiting sense, and the scope of the present invention, if properly described, is limited only by the appended claims, along with all scopes equivalent to those claimed by the claims. Like reference numerals in the drawings refer to the same or similar functions over several aspects.

본 발명의 실시 형태들에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법에서는, 임의의 딥 러닝 모델의 학습에 하나의 GPU만을 할당한 뒤, 학습 속도를 실측하고, 그 실측 결과를 바탕으로 같은 종류의 GPU를 임의의 개수만큼 할당하였을 때 학습 속도를 정확하게 예측할 수 있다.In the method for predicting the learning rate of a GPU-based distributed deep learning model according to the embodiments of the present invention, after allocating only one GPU to training of an arbitrary deep learning model, the learning rate is measured, and based on the measured result. When an arbitrary number of GPUs of the same type are allocated, the learning speed can be accurately predicted.

여기서, '딥 러닝 모델'은 여러 단계의 작업으로 구성된 피드-전달 네트워크(feed-forward network)이다. 입력 데이터는 딥 러닝 모델의 계층(layer)에 의해 일련의 연산에 공급된다. 최종 계층(final layer)은 오류 양 또는 실제 값과 모델 출력 간의 차이를 나타내는 손실(loss)을 계산한다. 손실(loss) 값 아래에서 역 전파 유형(back-propagation type)의 SGD(stochastic gradient descent)는 딥 러닝 모델의 출력이 실제 값(true value)에 근접하도록 각 계층(layer)의 파라미터(parameter)를 재조정한다. 딥 러닝 모델의 예로서, CNN(convolutional neural network), RNNs, DNNs 등이 있다. 상기 딥 러닝 모델들은 일반적으로 GPU가 주요 계산 드라이버 역할을 하는 곳에서 리소스 집약적이다. 채택된 GPU의 개수가 일반적으로 기본 성능을 결정한다는 점을 감안할 때, GPU를 할당할 기본 리소스로 사용한다.Here, the'deep learning model' is a feed-forward network composed of several stages of work. Input data is fed to a series of operations by layers of the deep learning model. The final layer computes the amount of error or the loss representing the difference between the actual value and the model output. Under the loss value, the back-propagation type of stochastic gradient descent (SGD) sets the parameters of each layer so that the output of the deep learning model approaches the true value. Readjusted. Examples of deep learning models include convolutional neural networks (CNNs), RNNs, and DNNs. The deep learning models are generally resource intensive where the GPU serves as the main computation driver. Considering that the number of GPUs adopted generally determines the basic performance, we use the GPU as the basic resource to allocate.

여기서, 딥 러닝 모델의 예측된 '학습 속도'는, 구체적으로 확률적 기울기 하강법(stochastic gradient descent, SGD) 또는 이와 유사한 변형 알고리즘 (momentum SGD, Adam SGD 등)이 하나의 입력 데이터 배치(input data batch)를 처리(또는 학습)하기 위해 포워드 패스(forward pass), 백워드 패스(backward pass), 모델 변수 업데이트 등의 과정을 완료하는데 걸리는 시간을 의미한다.Here, the predicted'learning rate' of the deep learning model is specifically stochastic gradient descent (SGD) or a similar transformation algorithm (momentum SGD, Adam SGD, etc.) This refers to the time it takes to complete the process (forward pass), backward pass (backward pass), model variable update, etc. to process (or learn) batch).

본 발명의 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법에 있어서, 1) 여러 개의 GPU들을 동시에 활용하여 학습하는 경우, BSP(bulk synchronous parallel) 방식을 사용하여 입력 데이터 배치를 각 GPU에 등분배하여 처리하는 것으로 가정하고, 2) 여러 개의 GPU들이 여러 대의 서버에 분포해 있는 경우, 각 서버 간의 네트워크 대역폭은 모두 같다고 가정한다.In the method for predicting the learning rate of a GPU-based distributed deep learning model according to an embodiment of the present invention, 1) when learning by using multiple GPUs at the same time, each batch of input data is arranged using a bulk synchronous parallel (BSP) method. It is assumed that processing is equally distributed to GPUs, and 2) when several GPUs are distributed over several servers, it is assumed that the network bandwidth between each server is all the same.

구체적으로, 첨부된 도면을 참조하여, 본 발명의 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법을 상세히 설명한다.Specifically, a method for predicting a learning rate of a GPU-based distributed deep learning model according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법을 설명하기 위한 순서도이다.1 is a flowchart illustrating a method of predicting a learning rate of a GPU-based distributed deep learning model according to an embodiment of the present invention.

본 발명에서는 수학적 모델링을 통해 GPU 개수에 따라서 딥 러닝 모델이 하나의 입력 데이터 배치를 학습하는데 걸리는 시간을 예측하는 수식(이하, '예측 수식'이라 함)을 세운다. 상기 예측 수식에 필요한 변수 값들을 구하기 위해서, 먼저, 하나의 GPU를 사용하여 제1 시간(T1)과 제2 시간(T2)을 측정한다(S100). In the present invention, an equation (hereinafter referred to as a'prediction equation') for predicting the time it takes for the deep learning model to learn one batch of input data according to the number of GPUs is established through mathematical modeling. In order to obtain the variable values required for the prediction equation, first, a first time T1 and a second time T2 are measured using one GPU (S100).

제1 시간(T1)은, 하나의 GPU를 사용하여 하나의 입력 데이터 배치를 처리할 때, 데이터가 입력된 시점으로부터 각 모델 변수(variable)의 기울기(gradient) 값이 계산되어 CPU 메모리에 쓰여지기까지 걸리는 시간이다. 입력 데이터 배치의 크기는 1부터 시작하여 2배씩 늘려가면서 상기 제1 시간을 측정하고, 측정된 다수의 제1 시간들 사이 구간을 선형 피팅(fitting)함으로써, 임의의 크기의 입력 데이터 배치를 처리할 때 각 모델 변수의 기울기(gradient) 값이 계산되어 CPU 메모리에 쓰여지기까지 걸리는 시간을 예측할 수 있다.The first time (T1) is, when processing one batch of input data using one GPU, the gradient value of each model variable is calculated from the time the data is input and written to the CPU memory. This is the time it takes. The size of the input data batch starts from 1 and increases by two times, measuring the first time, and linear fitting the interval between the measured first times, thereby processing the input data batch of an arbitrary size. When the gradient value of each model variable is calculated, it is possible to predict the time it takes to write to the CPU memory.

제2 시간(T2)은 각 모델 변수(variable)를 CPU에서 업데이트하는데 걸리는 시간이다.The second time T2 is the time taken to update each model variable in the CPU.

제1 시간(T1)과 제2 시간(T2)이 측정되었으면(S100), 실측된 제1 시간(T1)과 제2 시간(T2)를 이용하여 학습 시간을 예측한다(S300). 학습 시간을 예측하는 단계(S300)를 도 2를 참조하여 구체적으로 설명한다.If the first time T1 and the second time T2 are measured (S100), the learning time is predicted using the measured first time T1 and the second time T2 (S300). The step of predicting the learning time (S300) will be described in detail with reference to FIG. 2.

도 2는 도 1에 도시된 학습 시간 예측 단계(S300)를 설명하기 위한 순서도이다.FIG. 2 is a flowchart illustrating a learning time prediction step S300 shown in FIG. 1.

도 1 및 도 2에 도시된 학습 시간 예측 단계(S300)를 설명함에 있어서, 학습에 사용되는 GPU들은 총 n개의 서버(또는 노드(node))에 분포해 있으며, 서버(또는 노드)간 네트워크 대역폭은 W라고 한다. In describing the learning time prediction step (S300) shown in FIGS. 1 and 2, GPUs used for training are distributed in a total of n servers (or nodes), and network bandwidth between servers (or nodes) Is called W.

도 1에 도시된 학습 시간 예측 단계(S300)는, 도 2에 도시된 바와 같이, 각각의 모델 변수에 대해서 S310, S330, S350, S370, S390 단계가 수행된다. In the learning time prediction step S300 illustrated in FIG. 1, steps S310, S330, S350, S370, and S390 are performed for each model variable, as illustrated in FIG. 2.

S310 단계는, 실측된 제1 시간(T1)의 결과를 통해 데이터 입력으로부터 해당 모델 변수의 기울기(gradient) 값이 CPU 메모리에 쓰여지기까지 시간(t1)을 예측한다.In step S310, a time t1 is predicted from data input to a gradient value of the model variable being written to the CPU memory through the measured result of the first time T1.

S330 단계는, 실측된 제2 시간(T2)의 결과를 통해 해당 모델 변수를 업데이트하는 데 걸리는 시간(t2)을 연산한다.In step S330, a time t2 for updating a corresponding model variable is calculated based on the measured result of the second time T2.

S350 단계는, 네트워크를 통해 해당 모델 변수와 해당 모델 변수의 기울기(gradient) 값을 서버(또는, 노드)들 간에 주고 받는데 걸리는 시간(t3)을 예측한다. 여기서, 상기 시간(t3)은, 아래의 <수학식 1>을 이용하여 예측될 수 있다. In step S350, a time (t3) required to exchange the model variable and the gradient value of the model variable between the servers (or nodes) through the network is predicted. Here, the time t3 may be predicted using the following <Equation 1>.

상기 <수학식 1>에서, n은 학습에 사용되는 다수의 GPU들이 분포된 서버(또는 노드)의 개수이고, W는 서버(또는 노드) 간의 네트워크 대역폭이고, S는 해당 모델 변수의 크기이고, H_{n-1}은 n-1번째 조화수(harmonic number)이다. 그리고, c는 보정상수로써 최초에는 충분히 작은 상수로 설정된다. 여기서, 보정상수(c)는 네트워크 오버헤드(network overhead)를 반영하기 위한 것으로서, 0.01로 설정될 수 있다. 여기서, 상기 시간(t3)을 예측하는 수학식이 상기 <수학식 1>로 한정되는 것은 아니며, <수학식 1>과 동일 또는 균등한 결과를 도출하는 다른 수학식도 포함하는 것으로 이해해야 한다.In the <Equation 1>, n is the number of servers (or nodes) in which a plurality of GPUs used for training are distributed, W is the network bandwidth between servers (or nodes), S is the size of the corresponding model variable, H_{n-1} is the n-1th harmonic number. And, c is a correction constant and is initially set to a sufficiently small constant. Here, the correction constant (c) is for reflecting network overhead and may be set to 0.01. Here, the equation for predicting the time t3 is not limited to the above <Equation 1>, and it should be understood that other equations for deriving the same or equivalent result as in <Equation 1> are included.

S370 단계는, S310 단계에서 예측된 t1, S330 단계에서 연산된 t2, 및 S350 단계에서 예측된 t3를 모두 합산한다. S370 단계에 의해서, 각 모델 변수 별로 합산된 합산 시간이 연산된다.In step S370, t1 predicted in step S310, t2 calculated in step S330, and t3 predicted in step S350 are all summed up. By step S370, the summation time summed for each model variable is calculated.

S390 단계는, 전체 모델 변수들의 합산된 합산 시간들 중에서 최대값을 '학습 시간'으로 예측한다. S310 내지 S370 단계들이 각 모델 변수 별로 수행되므로, 최종적으로는 모든 모델 변수들 별로 합산된 합산 시간이 연산되고, 전체 합산된 시간들 중에서 가장 큰 시간을 해당 딥 러닝 모델의 학습 시간으로 결정한다.In step S390, a maximum value among the summation times of all model variables is predicted as'learning time'. Since steps S310 to S370 are performed for each model variable, finally, the summation time summed for all model variables is calculated, and the largest time among the summed times is determined as the training time of the corresponding deep learning model.

도 1 및 도 2에 도시된 본 발명의 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법은, 딥 러닝 모델의 학습 알고리즘(예를 들어, SGD)이 하나의 GPU 클러스터 상에 분산되어 처리될 때, 어떻게 작동하는지가 시스템 단계에서 정확하게 이해되고 수학적으로 모델링될 수 있다. In the method for predicting the learning rate of the GPU-based distributed deep learning model according to the embodiment of the present invention shown in FIGS. 1 and 2, the learning algorithm (eg, SGD) of the deep learning model is distributed on one GPU cluster. When processed and processed, how it works can be accurately understood and mathematically modeled at the system level.

비특허문헌 1과 비교해 보면, 비특허문헌 1은 딥 러닝 모델의 학습에서 서로 다른 모델 변수들의 백워드 패스(backward pass) 과정이 서로 간에 오버랩(overlap)될 수 있음을 고려하지 않았다. 예를 들어 네트워크를 통해 어떤 모델 변수를 서버들 간에 주고받는 작업을 하는 동안에 다른 모델 변수의 기울기(gradient) 값을 계산하는 작업이 동시에 진행될 수 있는데, 비특허문헌 1은 이에 대한 고려 없이 모든 모델 변수들의 기울기(gradient) 값 계산, 모델 변수 업데이트, 네트워킹 과정이 순차적으로 진행된다고 보고 예측을 하였기 때문에, 실제 측정된 시간이 예측한 시간보다 훨씬 짧은 경우가 많았다. 반면, 도 1 및 도 2에 도시된 본 발명의 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법은, 각 모델 변수에 대한 백워드 패스(backward pass) 완료 시간을 독립적으로 계산하고, 전체의 최대 시간을 최종적인 학습 시간으로 예측함으로써 오버랩(overlap) 현상을 예측에 반영할 수 있는 효과가 있다.Compared with Non-Patent Literature 1, Non-Patent Literature 1 does not take into account that a backward pass process of different model variables may overlap with each other in learning of a deep learning model. For example, while a certain model variable is exchanged between servers through a network, the work of calculating the gradient value of another model variable may be performed at the same time.Non-Patent Document 1 describes all model variables without consideration. In many cases, the actual measured time was much shorter than the predicted time because it predicted that the gradient value calculation, model variable update, and networking process proceeded sequentially. On the other hand, the method for predicting the learning rate of the GPU-based distributed deep learning model according to the embodiment of the present invention shown in FIGS. 1 and 2 independently calculates the completion time of a backward pass for each model variable. , There is an effect of being able to reflect an overlap phenomenon in prediction by predicting the total maximum time as the final learning time.

또한, 비특허문헌 1에서는 네트워크를 통해 데이터를 주고 받는 시간을 예측할 때, 단순히 보내야 하는 데이터의 양을 네트워크 대역폭(W)으로 나눗셈하는 방식을 사용했지만, 도 1 및 도 2에 도시된 본 발명의 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법은, 서로 간에 통신하는 서버의 개수(n)가 많아질수록 확률적으로 증가하는 동기화 오버헤드를 예측하는 항(1+c*H_{n-1})을 추가함으로써 네트워킹에 걸리는 시간을 훨씬 정확하게 예측할 수 있는 효과가 있다.In addition, non-patent document 1 used a method of simply dividing the amount of data to be sent by the network bandwidth (W) when predicting the time to send and receive data through the network, but the method of the present invention shown in FIGS. 1 and 2 The method for predicting the learning rate of a GPU-based distributed deep learning model according to an embodiment is a term (1+c*H_) that predicts a synchronization overhead that probably increases as the number (n) of servers communicating with each other increases. Adding {n-1}) has the effect of being able to predict the time taken for networking more accurately.

도 3은 본 발명의 다른 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법을 설명하기 위한 순서도이다.3 is a flowchart illustrating a method of predicting a learning rate of a GPU-based distributed deep learning model according to another embodiment of the present invention.

도 3에 도시된 본 발명의 다른 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법은, 도 1 및 도 2에 도시된 본 발명의 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법에 추가적으로 S500 단계를 더 포함한다. 따라서, 이하에서는 S500 단계를 구체적으로 설명하고, S100, S300 단계는 상술한 내용으로 대체한다.A method for predicting a learning rate of a GPU-based distributed deep learning model according to another embodiment of the present invention shown in FIG. 3 is a method of predicting a learning rate of a GPU-based distributed deep learning model according to the embodiment of the present invention shown in FIGS. 1 and 2. In addition to the learning rate prediction method, step S500 is further included. Therefore, hereinafter, steps S500 will be described in detail, and steps S100 and S300 are replaced with the above-described contents.

도 3을 참조하면, S300 단계에서 예측된 학습 시간이 실제 학습 시간과 차이가 날 수 있다. 예를 들어, 예측된 학습 시간과 실제 학습 시간과의 차이가 미리 설정된 임계 값(예를 들어, 5%)을 초과하면, <수학식 1>의 보정상수(c) 값을 미리 설정된 다른 값으로 변경하여 학습 시간이 보정될 수 있다(S500). Referring to FIG. 3, the learning time predicted in step S300 may be different from the actual learning time. For example, if the difference between the predicted learning time and the actual learning time exceeds a preset threshold (eg, 5%), the correction constant (c) value of Equation 1 is set to another preset value. By changing the learning time may be corrected (S500).

도 3에 도시된 본 발명의 다른 실시 형태에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법은, 도 1 및 도 2에 도시된 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법이 갖는 효과에 더하여, 보정상수(c)가 보정되는 S500 단계가 더 포함되기 때문에, 더욱 정확한 학습 시간의 예측이 가능한 이점이 있다.In the method for predicting the learning rate of the GPU-based distributed deep learning model according to another embodiment of the present invention shown in FIG. 3, the effect of the method for predicting the learning rate of the GPU-based distributed deep learning model shown in FIGS. 1 and 2 In addition, since the step S500 in which the correction constant c is corrected is further included, there is an advantage that more accurate prediction of the learning time is possible.

도 1 내지 도 3에 도시된 실시 형태들에 따른 GPU 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법은, 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.The method for predicting the learning speed of the GPU-based distributed deep learning model according to the embodiments shown in FIGS. 1 to 3 is implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. have. The computer-readable medium may include program instructions, data files, data structures, and the like alone or in combination. The program instructions recorded in the medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of the program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상에서 실시 형태들에 설명된 특징, 구조, 효과 등은 본 발명의 하나의 실시 형태에 포함되며, 반드시 하나의 실시 형태에만 한정되는 것은 아니다. 나아가, 각 실시 형태에서 예시된 특징, 구조, 효과 등은 실시 형태들이 속하는 분야의 통상의 지식을 가지는 자에 의해 다른 실시 형태들에 대해서도 조합 또는 변형되어 실시 가능하다. 따라서 이러한 조합과 변형에 관계된 내용들은 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다.Features, structures, effects, and the like described in the embodiments above are included in one embodiment of the present invention, and are not necessarily limited to only one embodiment. Further, the features, structures, effects, and the like illustrated in each embodiment can be implemented by combining or modifying other embodiments by a person having ordinary knowledge in the field to which the embodiments belong. Accordingly, contents related to such combinations and modifications should be construed as being included in the scope of the present invention.

또한, 이상에서 실시 형태를 중심으로 설명하였으나 이는 단지 예시일 뿐 본 발명을 한정하는 것이 아니며, 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 실시 형태의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 예를 들어, 실시 형태에 구체적으로 나타난 각 구성 요소는 변형하여 실시할 수 있는 것이다. 그리고 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다.In addition, although the embodiments have been described above, these are only examples and do not limit the present invention, and those of ordinary skill in the field to which the present invention pertains will not depart from the essential characteristics of the present embodiment. It will be appreciated that various modifications and applications not illustrated are possible. For example, each constituent element specifically shown in the embodiment can be modified and implemented. And differences related to these modifications and applications should be construed as being included in the scope of the present invention defined in the appended claims.

Claims

In a method for predicting a learning rate in a GPU cluster equipped with a plurality of GPUs for simultaneous learning of a plurality of deep learning models,
The first time it takes for each model variable of the deep learning model to calculate the gradient value from the time when the data of the input data batch is input in one GPU and write it to the CPU memory, and each of the above A measuring step of measuring a second time taken to update the model variable in the CPU; And
Including; a learning time prediction step of predicting a learning time of the deep learning model using the first time and the second time,
The learning time prediction step,
A first prediction step of predicting a time from a time point at which the data is input for each of the model variables until a slope value of a corresponding model variable is written to the CPU memory based on the first time;
A calculation step of calculating a time taken to update the corresponding model variable in the CPU based on the second time;
A second prediction step of predicting a time taken to exchange the corresponding model variable and the slope value of the corresponding model variable between servers through a network;
A summation step of calculating a summation time for each model variable by summing the time predicted in the first prediction step, the time calculated in the calculation step, and the time predicted in the second prediction step for each of the model variables; And
Predicting a maximum value among the summation times for all model variables as the learning time;
Containing, a method for predicting the learning rate of a GPU-based distributed deep learning model.

The method of claim 1, wherein the measuring step,
A method for predicting a learning speed of a GPU-based distributed deep learning model, measuring the first time while increasing the size of the input data batch by two times starting from 1.

The method of claim 2, wherein the measuring step,
A method for predicting the learning speed of a GPU-based distributed deep learning model, predicting the first time for an input data batch having an arbitrary size by linear fitting a section between a plurality of measured first times.

The method of claim 1,
The time predicted in the second prediction step is determined by the following equation, a method for predicting a learning rate of a GPU-based distributed deep learning model.
<Equation>
2*S*(n-1)*(1+c*H_{n-1})/(n*W),
However, n is the number of servers in which the plurality of GPUs are distributed, W is the network bandwidth between the servers, S is the size of the corresponding model variable, and H_{n-1} is the n-1th harmonic. number), c is the correction constant.

The method of claim 4,
When the difference between the learning time and the actual learning time exceeds a preset threshold, a correction step of correcting the correction constant with another correction constant; further comprising, a method for predicting a learning rate of a GPU-based distributed deep learning model .

A computer-readable recording medium storing a program for executing the method of claim 1 on a computer.