KR102403476B1

KR102403476B1 - Distributed Deep Learning System and Its Operation Method

Info

Publication number: KR102403476B1
Application number: KR1020190137073A
Authority: KR
Inventors: 제현승; 최현성; 김영랑; 이재환
Original assignee: 주식회사 사피온코리아; 한국항공대학교 산학협력단
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2022-05-27
Also published as: KR20210051604A

Abstract

딥러닝 분산 학습 시스템 및 동작 방법을 개시한다.
본 발명의 일 측면에 의하면, 딥러닝 분산 학습 시스템에 있어서, 신경망의 파라미터를 저장하고, 갱신하는 글로벌 파라미터 서버; 및 서로 다른 성능을 가지는 복수의 워커 노드를 포함하되, 각각의 워커 노드는 로컬 파라미터 서버 및 복수의 워커를 포함하고, 상기 로컬 파라미터 서버는 상기 글로벌 파라미터 서버로부터 수신한 상기 파라미터를 각각의 워커에게 전달하고, 상기 복수의 워커로부터 수신하는 그래디언트를 집계하여 상기 글로벌 파라미터 서버에게 집계된 그래디언트를 전송하며, 상기 각각의 워커는 동일한 파라미터를 이용하도록 동기화된 것으로서, 상기 각각의 워커는 상기 파라미터를 이용하여 훈련 데이터에 대한 그래디언트를 생성하며, 상기 글로벌 파라미터 서버는 각각의 로컬 파라미터 서버로부터 상기 집계된 그래디언트를 수신할 때마다 상기 집계된 그래디언트를 이용하여 상기 파라미터를 갱신하며, 상기 복수의 워커 노드에 포함된 각각의 로컬 파라미터 서버는 상기 글로벌 파라미터 서버로부터 갱신된 파라미터를 독립적으로 수신하도록 비동기화된 것을 특징으로 하는 시스템을 제공한다.Disclosed are a deep learning distributed learning system and an operating method.
According to an aspect of the present invention, in a deep learning distributed learning system, a global parameter server for storing and updating parameters of a neural network; and a plurality of worker nodes having different capabilities, wherein each worker node includes a local parameter server and a plurality of workers, wherein the local parameter server transmits the parameter received from the global parameter server to each worker and transmits the aggregated gradient to the global parameter server by aggregating the gradients received from the plurality of workers, wherein each worker is synchronized to use the same parameter, and each worker is trained using the parameter Generates a gradient for data, and the global parameter server updates the parameter using the aggregated gradient whenever it receives the aggregated gradient from each local parameter server, each of the plurality of worker nodes included and the local parameter server of the system is asynchronous to independently receive updated parameters from the global parameter server.

Description

Distributed Deep Learning System and Its Operation Method

본 발명의 실시예들은 딥러닝 분산 학습 시스템 및 그의 동작 방법으로서, 파라미터 서버와 워커 사이의 프레임워크(framework)에 관한 것이다.Embodiments of the present invention relate to a framework between a parameter server and a worker as a deep learning distributed learning system and an operating method thereof.

이 부분에 기술된 내용은 단순히 본 발명에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information on the present invention and does not constitute the prior art.

최근, 딥러닝(deep lenarning) 연산 처리의 가속화를 위해, 학습 모델을 복제한 후 여러 모델을 이용하는 딥러닝 분산 프레임워크가 연구되고 있다. 딥러닝 프레임워크의 종류는 모델 분산과 데이터 분산, 동기 방식과 비동기 방식, 파라미터 서버 방식과 집단 통신 방식으로 나뉠 수 있다.Recently, in order to accelerate deep learning computational processing, a deep learning distributed framework using multiple models after duplicating a learning model is being studied. Deep learning frameworks can be divided into model distribution and data distribution, synchronous method and asynchronous method, parameter server method and group communication method.

모델 분산은 하나의 모델을 여러 개로 나누어 복수의 장치에 보관하는 프레임워크이며, 데이터 분산은 하나의 모델을 여러 개의 동일 모델들로 복제하고, 여러 장치에 보관하는 프레임워크를 의미한다. 일반적으로, 딥러닝 분산 프레임워크는 데이터 분산을 의미한다.Model distribution is a framework that divides one model into several and stores them in multiple devices, and data distribution refers to a framework that replicates one model into several identical models and stores them in multiple devices. In general, deep learning distributed framework refers to data distribution.

동기 방식은 데이터 분산 프레임워크에서 복제된 모델이 포함하는 모델 파라미터들을 함께 갱신하는 방식이다. 비동기 방식은 복제된 여러 모델들이 각자 그래디언트를 생성하고, 그래디언트에 기초하여 각자 파라미터를 갱신하는 방식이다.The synchronous method is a method of updating model parameters included in the replicated model together in the data distribution framework. The asynchronous method is a method in which multiple replicated models each generate gradients and update their parameters based on the gradients.

파라미터 서버 방식은 데이터 분산 프레임워크에서 파라미터를 저장하고 갱신하는 파라미터 서버와, 파라미터를 이용하여 훈련 데이터에 대한 그래디언트(gradient)를 생성하는 복수의 워커로 구성되는 방식을 의미한다. 여기서, 그래디언트는 신경망의 파라미터를 조정하는 데 이용되는 요소이다. 파라미터 서버는 복수의 워커들로부터 수신한 그래디언트들을 이용하여 파라미터를 갱신한 후 각각의 워커에게 전송함으로써 워커들을 업데이트할 수 있다. 반면, 집단 통신 방식은 각각의 워커들이 그래디언트를 생성하고, 서로 그래디언트를 공유하며 파라미터를 동기화하는 방식을 의미한다.The parameter server method refers to a method composed of a parameter server that stores and updates parameters in a data distribution framework, and a plurality of workers that use parameters to generate gradients for training data. Here, the gradient is an element used to adjust the parameters of the neural network. The parameter server may update the workers by using the gradients received from the plurality of workers to update the parameter and then transmit it to each worker. On the other hand, the group communication method refers to a method in which each worker generates a gradient, shares a gradient with each other, and synchronizes parameters.

도 1은 파라미터 서버 방식의 데이터 분산 딥러닝 프레임워크를 설명하기 위한 도면이다.1 is a diagram for explaining a data distribution deep learning framework of a parameter server method.

도 1을 참조하면, 파라미터 서버 방식의 프레임워크는 파라미터 서버(100), 복수의 워커(110, 112, 114, 116)를 포함하며, 미니 배치 사이즈(mini batch size)의 훈련 데이터(120, 122, 124, 126)을 이용하여 훈련할 수 있다.Referring to FIG. 1 , the parameter server type framework includes a parameter server 100 and a plurality of workers 110 , 112 , 114 , 116 , and training data 120 and 122 of mini batch size. , 124, 126) can be used to train.

파라미터 서버(100)는 딥러닝 모델의 파라미터를 저장하고, 복수의 워커(110, 112, 114, 116)에게 파라미터(W)를 전송하고, 복수의 워커하나110, 112, 114, 116)로부터 수신하는 그래디언트(dW)를 이용하여 파라미터(W)를 갱신하는 구성요소다. 여기서, 그래디언트(dW)란 파라미터 서버(100)가 모델의 파라미터를 갱신하는 데 필요한 요소를 의미한다.The parameter server 100 stores the parameters of the deep learning model, transmits the parameter (W) to a plurality of workers (110, 112, 114, 116), and receives from a plurality of workers (110, 112, 114, 116) It is a component that updates the parameter (W) using the gradient (dW). Here, the gradient dW means an element required for the parameter server 100 to update the parameters of the model.

복수의 워커(110, 112, 114, 116)는 모델의 파라미터(W)를 이용하여 훈련 데이터(120, 122, 124, 126)에 대한 그래디언트(gradient)를 생성하는 구성요소다. 즉, 복수의 워커 중 제1 워커(110)는 파라미터(W)를 이용하여 미니 배치 사이즈의 훈련 데이터(120)에 대해 그래디언트를 생성하고, 파라미터 서버(100)에게 그래디언트(dW)를 전송한다. The plurality of workers 110 , 112 , 114 , and 116 are components that generate a gradient for the training data 120 , 122 , 124 , and 126 using the parameter W of the model. That is, among the plurality of workers, the first worker 110 generates a gradient for the training data 120 of the mini-batch size using the parameter W, and transmits the gradient dW to the parameter server 100 .

파라미터 서버(100)는 복수의 워커(110, 112, 114, 116)로부터 각각 그래디언트를 수신하고, 각각의 그래디언트를 집계(aggregation)한 후 집계된 그래디언트를 이용하여 모델의 파라미터를 갱신(update)한다. 파라미터 서버(100)는 갱신된 파라미터를 복수의 워커(110, 112, 114, 116)에게 전송함으로써, 딥러닝 분산 모델이 훈련된다.The parameter server 100 receives gradients from a plurality of workers 110, 112, 114, and 116, respectively, aggregates each gradient, and then uses the aggregated gradient to update the parameters of the model. . The parameter server 100 transmits the updated parameters to the plurality of workers 110 , 112 , 114 , and 116 , whereby a deep learning distributed model is trained.

동기 파라미터 서버 방식은 파라미터 서버(100)가 복수의 워커(110, 112, 114, 116)로부터 그래디언트를 전부 수신한 후에, 모든 그래디언트를 이용하여 모델의 파라미터를 갱신하며, 복수의 워커(110, 112, 114, 116)에게 모두 동일하게 갱신된 파라미터를 전송하는 방식이다. In the synchronous parameter server method, the parameter server 100 receives all the gradients from the plurality of workers 110, 112, 114, and 116, and then updates the parameters of the model using all the gradients, and the plurality of workers 110, 112 , 114, 116) to transmit the same updated parameters.

동기 파라미터 서버 방식은 복수의 워커(110, 112, 114, 116)가 모두 동일한 파라미터를 이용하므로, 훈련 정확도(training accuracy)가 보장된다는 장점이 있다. 다만, 복수의 워커(110, 112, 114, 116) 각각이 서로 다른 클러스터(heterogeneous cluster) 환경에서 자원 활용률(resource utilization)이 낮다는 단점이 있다. 구체적으로, 복수의 워커(110, 112, 114, 116) 각각이 성능이 다르고, 같은 훈련 데이터에 대해 그래디언트를 생성하는 속도가 다른 경우, 연산 속도가 가장 느린 워커의 그래디언트 생성이 끝날 때까지 연산 속도가 가장 빠른 워커가 대기해야 한다. 따라서, 동기 파라미터 서버 방식은 자원 활용률이 비동기식 파라미터 서버 방식보다 낮다.The synchronization parameter server method has an advantage that training accuracy is guaranteed because the plurality of workers 110 , 112 , 114 , and 116 all use the same parameter. However, the plurality of workers 110 , 112 , 114 , and 116 have a disadvantage that resource utilization is low in a heterogeneous cluster environment in which each of them is different. Specifically, when the performance of each of the plurality of workers 110 , 112 , 114 , and 116 is different and the speed of generating a gradient for the same training data is different, the calculation speed until the gradient generation of the worker with the slowest calculation speed is finished The fastest walker must wait. Therefore, the synchronous parameter server method has a lower resource utilization rate than the asynchronous parameter server method.

비동기 파라미터 서버 방식은 파라미터 서버(100)가 복수의 워커(110, 112, 114, 116)로부터 그래디언트를 수신할 때마다 모델의 파라미터를 갱신하고, 각 워커에게 독립적으로 갱신된 파라미터를 전송한다.In the asynchronous parameter server method, the parameter server 100 updates the parameters of the model whenever it receives a gradient from the plurality of workers 110 , 112 , 114 , and 116 , and independently transmits the updated parameters to each worker.

비동기 파라미터 서버 방식은 서로 성능이 다른 복수의 워커(110, 112, 114, 116)들을 이용하는 이종 클러스터 환경에서 자원 활용률이 높다는 장점이 있다. 예를 들어, 연산 속도가 빠른 워커는 연산 속도가 느린 워커에 비해 그래디언트를 생성하고 갱신된 파라미터를 수신하는 시간이 빠르다. 즉, 연산 속도가 느린 워커가 훈련 데이터를 훈련하는 중에 연산 속도가 빠른 워커는 다른 훈련 데이터를 훈련할 수 있다. 하지만, 비동기 파라미터 서버 방식은 복수의 워커(110, 112, 114, 116) 각각이 서로 다른 파라미터를 이용하여 훈련하기 때문에 훈련 정확도가 낮다는 단점이 있다. 또한, 비동기 파라미터 방식은 복수의 워커(110, 112, 114, 116) 각각이 파라미터 서버(100)와 통신하므로 입출력(I/O) 병목 현상이 발생하여, 전체적인 딥러닝 연산 속도가 느려지는 단점이 있다.The asynchronous parameter server method has an advantage of high resource utilization in a heterogeneous cluster environment using a plurality of workers 110 , 112 , 114 , and 116 having different performances. For example, a worker with a fast computation speed generates a gradient and receives updated parameters faster than a worker with a slow computation speed. That is, while a worker with a slow computation speed trains on training data, a worker with a high computation speed can train on other training data. However, the asynchronous parameter server method has a disadvantage in that training accuracy is low because each of the plurality of workers 110 , 112 , 114 , and 116 trains using different parameters. In addition, in the asynchronous parameter method, since each of the plurality of workers 110, 112, 114, and 116 communicates with the parameter server 100, an input/output (I/O) bottleneck occurs, which slows down the overall deep learning operation speed. have.

구체적으로, 연산 속도가 빠른 제1 워커(110)와 연산 속도가 느린 제2 워커(122)가 초기에 같은 파라미터를 이용하여 그래디언트를 각각 생성한다. 파라미터 서버(100)는 제1 워커(110)가 생성한 그래디언트를 이용하여 파라미터를 갱신한 후, 제2 워커(122)가 생성한 그래디언트를 이용하여 파라미터를 다시 갱신한다. 이때, 제1 워커가 생성한 그래디언트를 이용하여 갱신한 파라미터에 대한 정보는 사라지므로, 비동기 파라미터 서버 방식은 훈련 정확도가 낮아진다. Specifically, the first worker 110 having a high operation speed and the second worker 122 having a slow operation speed initially generate gradients using the same parameters, respectively. The parameter server 100 updates the parameter using the gradient generated by the first worker 110 and then updates the parameter again using the gradient generated by the second worker 122 . At this time, since information on parameters updated using the gradient generated by the first worker disappears, the training accuracy of the asynchronous parameter server method is lowered.

도 2는 동기 집단 통신 구조의 데이터 분산 딥러닝 프레임워크를 설명하기 위한 도면이다.2 is a diagram for explaining a data distribution deep learning framework of a synchronous group communication structure.

도 2를 참조하면, 동기 집단 통신 방식은 복수의 워커(200, 210, 220, 230)을 포함한다. 동기 집단 통신 방식은 파라미터 서버를 이용하지 않고, 복수의 워커(200, 210, 220, 230) 각각이 생성한 그래디언트를 서로 공유함으로써, 복수의 워커(200, 210, 220, 230)에 포함된 파라미터를 동기적으로 갱신한다.Referring to FIG. 2 , the synchronous group communication method includes a plurality of workers 200 , 210 , 220 , 230 . In the synchronous group communication method, the parameters included in the plurality of workers 200, 210, 220, and 230 are not used by using a parameter server, but by sharing the gradient generated by each of the plurality of workers 200, 210, 220, and 230 with each other. is updated synchronously.

일반적으로, 동기 집단 통신 방식은 복수의 워커(200, 210, 220, 230)가 링(ring) 구조로 연결되며, 독립적으로 그래디언트를 생성한다. 동기 집단 통신 방식에서 모델 파라미터를 갱신하는 방법으로서 두 가지 방법이 있다. 첫 번째, 하나의 워커가 모든 그래디언트를 수집하고, 파라미터를 갱신한 후 갱신된 파라미터를 다른 워커들에게 전송함으로써 파라미터를 동기화하는 방법이다. 두 번째는, 복수의 워커(200, 210, 220, 230)들이 생성한 그래디언트들을 서로에게 전송한 후, 복수의 워커(200, 210, 220, 230) 각각이 파라미터를 갱신함으로써 동기화하는 방법이다.In general, in the synchronous group communication method, a plurality of workers 200 , 210 , 220 , and 230 are connected in a ring structure, and gradients are independently generated. There are two methods as a method of updating model parameters in the synchronous group communication method. First, one worker collects all gradients, updates parameters, and then transmits the updated parameters to other workers to synchronize parameters. The second is a method of synchronizing the plurality of workers 200, 210, 220, 230 by transmitting the gradients generated by the plurality of workers 200, 210, 220, and 230 to each other, and then updating the parameters by each of the plurality of workers 200, 210, 220, 230.

동기 집단 통신 방식은 파라미터 서버를 이용하지 않으므로, 입출력 병목 현상으로 인한 연산 속도 저하가 적다는 장점이 있다. 다만, 동기 집단 통신 방식은 동기 방식의 단점으로서, 복수의 워커(200, 210, 220, 230)에 대한 이종의 클러스터 환경에서 자원 활용률이 낮다는 단점이 있다.Since the synchronous group communication method does not use a parameter server, there is an advantage in that the decrease in operation speed due to an input/output bottleneck is small. However, the synchronous group communication scheme is a disadvantage of the synchronous scheme, in that the resource utilization rate for the plurality of workers 200 , 210 , 220 , and 230 is low in a heterogeneous cluster environment.

한편, 다시 도 1을 참조하면, 서브 미니 배치 훈련(sub mini batch training)을 위해, 비동기 파라미터 서버 방식에서 복수의 워커(110, 112, 114, 116)는 서로 다른 성능을 가지는 복수의 GPU를 각각 포함할 수 있다. On the other hand, referring again to Figure 1, for sub mini batch training, a plurality of workers 110, 112, 114, 116 in the asynchronous parameter server method a plurality of GPUs having different performance, respectively may include

예를 들어, 제1 워커(110)는 복수의 GPU를 포함하며, 미니 배치 사이즈의 훈련 데이터(120)를 할당 받는다. 제1 워커(110)는 서로 다른 성능의 GPU를 활용하기 위해 미니 배치 사이즈를 GPU의 개수로 나눈 서브 미니 배치 사이즈를 각 GPU에게 할당한다. 제1 워커(110)에 포함된 복수의 GPU는 서브 미니 배치 사이즈의 훈련 데이터에 대한 그래디언트를 생성한다. For example, the first worker 110 includes a plurality of GPUs and is allocated training data 120 of a mini-batch size. The first worker 110 allocates a sub-mini-batch size obtained by dividing the mini-batch size by the number of GPUs to each GPU in order to utilize GPUs of different performance. A plurality of GPUs included in the first worker 110 generates a gradient for training data of a sub-mini batch size.

서브 미니 배치 훈련은 비동기 파라미터 서버 방식의 단점인 훈련 정확도를 개선할 수 있으나, 미니 배치 사이즈를 단순히 GPU의 개수로 나누기 때문에, 훈련 정확도 개선의 한계가 있다. 구체적으로, 훈련 정확도를 개선하기 위해, GPU 개수를 증가시키는 것은 물리적 공간에 의해 제한되고, 서브 미니 배치 사이즈를 증가시키는 것은 각 GPU 메모리에 의해 제한된다.Sub-mini-batch training can improve training accuracy, which is a disadvantage of the asynchronous parameter server method, but there is a limit to improving training accuracy because the mini-batch size is simply divided by the number of GPUs. Specifically, in order to improve training accuracy, increasing the number of GPUs is limited by the physical space, and increasing the sub-mini batch size is limited by each GPU memory.

따라서, 전술한 방식들로부터 훈련 정확도의 저하 없이 자원 활용률을 높일 수 있는 데이터 분산 딥러닝 프레임워크가 필요하다. 또한, GPU 개수 및 서브 미니 배치 사이즈를 증가시키지 않고도 서브 미니 배치 훈련의 연산 성능을 높일 수 있는 방안이 필요하다.Therefore, there is a need for a data distribution deep learning framework capable of increasing the resource utilization rate without degrading the training accuracy from the above-described methods. In addition, there is a need for a method to increase the computational performance of sub-mini-batch training without increasing the number of GPUs and the sub-mini-batch size.

본 발명의 실시예들은, 분산 딥러닝 훈련의 정확도의 감소 없이 자원 활용률을 높이기 위해, 로컬 파라미터 서버를 추가함으로써 비동기 파라미터 서버 방식과 동기 집단 통신 방식을 혼합한 딥러닝 분산 학습 시스템 및 그의 동작 방법을 제공하는 데 주된 목적이 있다.Embodiments of the present invention provide a deep learning distributed learning system that mixes an asynchronous parameter server method and a synchronous group communication method by adding a local parameter server in order to increase the resource utilization rate without reducing the accuracy of distributed deep learning training, and an operating method thereof Its main purpose is to provide

본 발명의 다른 실시예들은, GPU 개수 및 서브 미니 배치 사이즈를 증가시키지 않고서 비동기 파라미터 서버의 훈련 정확도를 개선할 수 있는 서브 미니 배치 훈련 방법을 제공하는 데 일 목적이 있다.Another object of the present invention is to provide a sub-mini-batch training method capable of improving training accuracy of an asynchronous parameter server without increasing the number of GPUs and the sub-mini-batch size.

본 발명의 일 측면에 의하면, 딥러닝 분산 학습 시스템에 있어서, 신경망의 파라미터를 저장하고, 갱신하는 글로벌 파라미터 서버; 및 서로 다른 성능을 가지는 복수의 워커 노드를 포함하되, 각각의 워커 노드는 로컬 파라미터 서버 및 복수의 워커를 포함하고, 상기 로컬 파라미터 서버는 상기 글로벌 파라미터 서버로부터 수신한 상기 파라미터를 각각의 워커에게 전달하고, 상기 복수의 워커로부터 수신하는 그래디언트를 집계하여 상기 글로벌 파라미터 서버에게 집계된 그래디언트를 전송하며, 상기 각각의 워커는 동일한 파라미터를 이용하도록 동기화된 것으로서, 상기 각각의 워커는 상기 파라미터를 이용하여 훈련 데이터에 대한 그래디언트를 생성하며, 상기 글로벌 파라미터 서버는 각각의 로컬 파라미터 서버로부터 상기 집계된 그래디언트를 수신할 때마다 상기 집계된 그래디언트를 이용하여 상기 파라미터를 갱신하며, 상기 복수의 워커 노드에 포함된 각각의 로컬 파라미터 서버는 상기 글로벌 파라미터 서버로부터 갱신된 파라미터를 독립적으로 수신하도록 비동기화된 것을 특징으로 하는 시스템을 제공한다. According to one aspect of the present invention, in a deep learning distributed learning system, a global parameter server for storing and updating parameters of a neural network; and a plurality of worker nodes having different capabilities, wherein each worker node includes a local parameter server and a plurality of workers, wherein the local parameter server transmits the parameter received from the global parameter server to each worker and transmits the aggregated gradient to the global parameter server by aggregating the gradients received from the plurality of workers, wherein each worker is synchronized to use the same parameter, and each worker is trained using the parameter Generates a gradient for data, and the global parameter server updates the parameter using the aggregated gradient whenever it receives the aggregated gradient from each local parameter server, each of the plurality of worker nodes included and the local parameter server of the system is asynchronous to independently receive updated parameters from the global parameter server.

본 실시예의 다른 측면에 의하면, 딥러닝 분산 학습 시스템의 동작 방법으로서, 상기 시스템은 글로벌 파라미터 서버 및 복수의 워커 노드를 포함하고, 각각의 워커 노드는 로컬 파라미터 서버 및 복수의 워커를 포함하며, 상기 동작 방법은, 상기 글로벌 파라미터 서버가 상기 각각의 워커 노드에 포함된 상기 로컬 파라미터 서버에게 신경망의 파라미터를 전송하는 과정; 상기 각각의 워커 노드에 대해: 상기 로컬 파라미터 서버가 상기 복수의 워커 각각에게 상기 파라미터를 전송하는 과정; 각각의 워커는 동일한 파라미터를 이용하도록 동기화된 것으로서, 상기 각각의 워커가 상기 파라미터를 이용하여 훈련 데이터에 대한 그래디언트를 생성하고, 상기 그래디언트를 상기 로컬 파라미터 서버에게 전송하는 과정; 상기 로컬 파라미터 서버가 상기 각각의 워커로부터 수신한 상기 그래디언트를 집계하는 과정; 상기 글로벌 파라미터 서버가 각각의 로컬 파라미터 서버로부터 집계된 그래디언트를 수신하고, 상기 집계된 그래디언트를 수신할 때마다 상기 집계된 그래디언트를 이용하여 상기 파라미터를 갱신하는 과정; 및 상기 글로벌 파라미터 서버가 상기 집계된 그래디언트를 전송한 상기 로컬 파라미터 서버에게 갱신된 파라미터를 전송하는 과정을 포함하는 동작 방법을 제공한다.According to another aspect of this embodiment, there is provided an operating method of a deep learning distributed learning system, wherein the system includes a global parameter server and a plurality of worker nodes, each worker node including a local parameter server and a plurality of workers, wherein the The operating method may include: transmitting, by the global parameter server, a parameter of a neural network to the local parameter server included in each worker node; for each worker node: sending, by the local parameter server, the parameter to each of the plurality of workers; Each worker is synchronized to use the same parameter, and each worker generates a gradient for training data using the parameter, and transmits the gradient to the local parameter server; a process in which the local parameter server aggregates the gradients received from the respective workers; receiving, by the global parameter server, the aggregated gradient from each local parameter server, and updating the parameter by using the aggregated gradient each time the aggregated gradient is received; and transmitting, by the global parameter server, an updated parameter to the local parameter server that has transmitted the aggregated gradient.

이상에서 설명한 바와 같이 본 발명의 일 실시예에 의하면, 로컬 파라미터 서버를 추가하여 비동기 파라미터 서버 방식과 동기 집단 통신 방식을 혼합함으로써, 분산 딥러닝 훈련의 정확도의 감소 없이 자원 활용률을 높일 수 있다.As described above, according to an embodiment of the present invention, by adding a local parameter server and mixing the asynchronous parameter server method and the synchronous group communication method, it is possible to increase the resource utilization rate without reducing the accuracy of distributed deep learning training.

본 발명의 다른 실시예에 의하면, 이종의 클러스터 환경에서 복수의 워커의 각 성능 및 워커의 개수에 따라 워커마다 서로 다른 반복 훈련 횟수를 할당함으로써, GPU 개수 및 서브 미니 배치 사이즈를 증가시키지 않고서 비동기 파라미터 서버의 훈련 정확도를 개선할 수 있다.According to another embodiment of the present invention, asynchronous parameters without increasing the number of GPUs and sub-mini batch size by allocating a different number of iteration training to each worker according to each performance of a plurality of workers and the number of workers in a heterogeneous cluster environment It can improve the training accuracy of the server.

도 1은 파라미터 서버 방식의 데이터 분산 딥러닝 프레임워크를 설명하기 위한 도면이다.
도 2는 동기 집단 통신 구조의 데이터 분산 딥러닝 프레임워크를 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 딥러닝 분산 학습 시스템의 구성도이다.
도 4는 본 발명의 일 실시예에 따른 서브 미니 배치 훈련 방법을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 딥러닝 분산 학습 시스템의 병렬적 동작 과정을 설명하기 위한 도면이다.
도 6은 발명의 일 실시예에 따른 딥러닝 분산 학습 시스템의 동작 과정을 설명하기 위한 순서도다.1 is a diagram for explaining a data distribution deep learning framework of a parameter server method.
2 is a diagram for explaining a data distribution deep learning framework of a synchronous group communication structure.
3 is a block diagram of a deep learning distributed learning system according to an embodiment of the present invention.
4 is a diagram for explaining a sub-mini arrangement training method according to an embodiment of the present invention.
5 is a diagram for explaining a parallel operation process of a deep learning distributed learning system according to an embodiment of the present invention.
6 is a flowchart illustrating an operation process of a deep learning distributed learning system according to an embodiment of the present invention.

이하, 본 발명의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to exemplary drawings. In adding reference numerals to the components of each drawing, it should be noted that the same components are given the same reference numerals as much as possible even though they are indicated on different drawings. In addition, in describing the present invention, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present invention, the detailed description thereof will be omitted.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 '포함', '구비'한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 '~부', '모듈' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In addition, in describing the components of the present invention, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. Throughout the specification, when a part 'includes' or 'includes' a certain element, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated. . In addition, terms such as '~ unit' and 'module' described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software.

도 3은 본 발명의 일 실시예에 따른 딥러닝 분산 학습 시스템의 구성도이다.3 is a block diagram of a deep learning distributed learning system according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시예에 따른 이종 클러스터 환경에서의 딥러닝 분산 학습 시스템(distributed deep learning system)은 글로벌 파라미터 서버(300, Global Parameter Server: GPS), 워커 노드들(310, 340, 370, worker node)을 포함한다. 워커 노드들(310, 340, 370)은 제0 워커 노드(310), 제1 워커 노드(340)를 포함하며, 이 외에 제n 워커 노드(370)를 포함할 수 있다. 제0 워커 노드(310)는 제0 로컬 파라미터 서버(320, Local Parameter Server: LPS)와 복수의 워커(330, 331, 332, 333)을 포함하며, 제1 워커 노드(340)는 제1 로컬 파라미터 서버(350)와 복수의 다른 워커(360, 361, 362, 363)을 포함한다. 즉, 하나의 워커 노드는 하나의 로컬 파라미터 서버와 복수의 워커를 포함한다. Referring to FIG. 3 , a deep learning distributed learning system in a heterogeneous cluster environment according to an embodiment of the present invention includes a global parameter server 300 (Global Parameter Server: GPS), worker nodes 310, 340, 370, and worker nodes). The worker nodes 310 , 340 , and 370 include a 0th worker node 310 , a first worker node 340 , and may include an nth worker node 370 in addition. The zeroth worker node 310 includes a zeroth local parameter server 320 (Local Parameter Server: LPS) and a plurality of workers 330, 331, 332, 333, and the first worker node 340 is a first local It includes a parameter server 350 and a plurality of other workers 360 , 361 , 362 , 363 . That is, one worker node includes one local parameter server and a plurality of workers.

이하에서, 워커 노드들(310, 340, 370)은 3개로 표현되지만 둘 이상의 워커 노드를 의미한다. 또한, 시스템의 동작에 대해 제0 워커 노드(310) 및 제1 워커 노드(340)를 포함하는 복수의 워커 노드(310, 340)를 중심으로 설명하나, 나머지 워커 노드에도 동일하게 적용될 수 있다. Hereinafter, the worker nodes 310 , 340 , and 370 are represented as three, but mean two or more worker nodes. In addition, although the operation of the system is mainly described with reference to the plurality of worker nodes 310 and 340 including the 0th worker node 310 and the first worker node 340, the same may be applied to the remaining worker nodes.

한편, 각 워커는 GPU(Graphic Processing Unit)에 의해 구현될 수 있으며, 워커와 GPU는 혼용하여 지칭될 수 있다. 다만, 워커는 GPU외에 CPU(Central Processing Unit) 등의 다른 데이터 처리 장치에 의해 구현될 수도 있다. On the other hand, each worker may be implemented by a GPU (Graphic Processing Unit), and the worker and the GPU may be used interchangeably. However, the worker may be implemented by other data processing devices such as a CPU (Central Processing Unit) in addition to the GPU.

글로벌 파라미터 서버(300)는 모델(model)의 파라미터(parameter)를 저장하고, 복수의 워커 노드(310, 340)에게 전송하며, 복수의 워커 노드(310, 340)에서 집계된 그래디언트들(aggregated gradients)을 수신하고, 이를 이용하여 모델의 파라미터를 갱신(update)하는 구성요소다.The global parameter server 300 stores the parameters of the model, transmits them to the plurality of worker nodes 310 and 340, and aggregates gradients from the plurality of worker nodes 310 and 340. ) and uses it to update model parameters.

구체적으로, 글로벌 파라미터 서버(300)는 초기에는 복수의 로컬 파라미터 서버(320, 350)에게 모델의 파라미터를 전송하고, 이후에는 복수의 로컬 파라미터 서버(320, 350)에게 같은 파라미터를 전송할 수도 있고, 서로 다른 파라미터를 전송할 수도 있다. Specifically, the global parameter server 300 may initially transmit the model parameters to the plurality of local parameter servers 320 and 350, and then transmit the same parameters to the plurality of local parameter servers 320 and 350, Different parameters may be transmitted.

글로벌 파라미터 서버(300)는 복수의 로컬 파라미터 서버(320, 350)로부터 집계된 그래디언트(aggregated gradient)를 독립적으로 수신한다. 또한, 글로벌 파라미터 서버(300)는 복수의 로컬 파라미터 서버(320, 350)로부터 집계된 그래디언트를 수신할 때마다 파라미터를 갱신하고 저장할 수 있다. 글로벌 파라미터 서버(300)는 갱신 시점에 따라 복수의 워커 노드(310, 340)에게 서로 다른 파라미터를 전송할 수 있다. 이는, 복수의 워커 노드(310, 340)가 비동기화된 것을 의미하며, 복수의 워커 노드(310, 340)의 자원 활용률을 높이기 위함이다.The global parameter server 300 independently receives an aggregated gradient from a plurality of local parameter servers 320 and 350 . In addition, the global parameter server 300 may update and store the parameter whenever it receives the aggregated gradient from the plurality of local parameter servers 320 and 350 . The global parameter server 300 may transmit different parameters to the plurality of worker nodes 310 and 340 according to the update time. This means that the plurality of worker nodes 310 and 340 are unsynchronized, and this is to increase the resource utilization rate of the plurality of worker nodes 310 and 340 .

복수의 로컬 파라미터 서버(320, 350)는 각각 비동기화된 서버로서, 글로벌 파라미터 서버(300)에게 받은 파라미터를 워커들에게 전달하고, 워커들에게 받은 그래디언트를 집계하여 파라미터 서버에게 전달하는 구성요소다. The plurality of local parameter servers 320 and 350 are asynchronous servers, respectively, and are a component that transmits the parameters received from the global parameter server 300 to the workers, aggregates the gradients received from the workers, and delivers them to the parameter server. .

구체적으로, 복수의 로컬 파라미터 서버(320, 350) 중 제0 로컬 파라미터 서버(320)는 글로벌 파라미터 서버(300)로부터 수신한 파라미터를 복수의 워커(330, 331, 332, 333)에게 전달한다. 이때, 본 발명의 일 실시예에 따라 제0 로컬 파라미터 서버(320)와 제1 로컬 파라미터 서버(350)는 비동기화되므로, 각각의 집계된 그래디언트를 글로벌 파라미터 서버(300)에게 서로 다른 시점에 전송할 수 있다. 또한, 제0 로컬 파라미터 서버(320)와 제1 로컬 파라미터 서버(350)는 글로벌 파라미터 서버(300)의 파라미터 갱신 시점에 따라 서로 다른 파라미터를 받을 수도 있고, 동일한 파라미터를 받을 수도 있다.Specifically, the zeroth local parameter server 320 among the plurality of local parameter servers 320 and 350 transmits the parameter received from the global parameter server 300 to the plurality of workers 330 , 331 , 332 , and 333 . At this time, according to an embodiment of the present invention, since the 0th local parameter server 320 and the first local parameter server 350 are out of synchronization, each of the aggregated gradients is transmitted to the global parameter server 300 at different times. can Also, the 0th local parameter server 320 and the first local parameter server 350 may receive different parameters or the same parameters depending on the parameter update time of the global parameter server 300 .

이후, 제0 로컬 파라미터 서버(320)는 복수의 워커(330, 331, 332, 333)가 생성한 그래디언트들을 수신하고, 그래디언트들을 집계한다. 여기서, 본 발명의 일 실시예에 따른 집계(aggregation)는 그래디언트들의 총합, 그래디언트들의 곱, 그래디언트들 중 최대값, 또는 그래디언트들 중 최소값을 계산한다는 의미일 수 있다.Thereafter, the 0th local parameter server 320 receives the gradients generated by the plurality of workers 330 , 331 , 332 , 333 and aggregates the gradients. Here, aggregation according to an embodiment of the present invention may mean calculating the sum of gradients, the product of gradients, a maximum value among gradients, or a minimum value among gradients.

제0 로컬 파라미터 서버(320)는 집계된 그래디언트를 글로벌 파라미터 서버(300)에게 전송하고, 글로벌 파라미터 서버(300)로부터 갱신된 파라미터를 수신하며, 갱신된 파라미터를 복수의 워커(330, 331, 332, 333)에게 전달한다.The zeroth local parameter server 320 transmits the aggregated gradient to the global parameter server 300 , receives the updated parameter from the global parameter server 300 , and transmits the updated parameter to the plurality of workers 330 , 331 , 332 . , 333).

전술한 동작은 제0 로컬 파라미터 서버(320)뿐만 아니라 제1 로컬 파라미터 서버(350) 및 다른 로컬 파라미터 서버들에도 동등하게 적용된다.The above operation equally applies not only to the zeroth local parameter server 320 but also to the first local parameter server 350 and other local parameter servers.

복수의 워커(330, 331, 332, 333)는 각각 동기화된 것으로서, 제0 로컬 파라미터 서버(320)로부터 수신한 파라미터를 이용하여 훈련 데이터에 대한 그래디언트를 생성하고, 제0 로컬 파라미터 서버(320)에게 전송하는 구성요소다. The plurality of workers 330 , 331 , 332 , and 333 are synchronized, respectively, using the parameters received from the 0th local parameter server 320 to generate a gradient for the training data, and the 0th local parameter server 320 . component that is sent to

본 발명의 일 실시예에 따른 복수의 워커(330, 331, 332, 333)는 글로벌 파라미터 서버(300)에게 그래디언트를 전달하는 방법으로서, 동기 집단 통신 구조인 올 리듀스(all-reduce)를 이용할 수 있다. 올 리듀스 방식은, 각각의 워커(330, 331, 332, 333)가 생성한 그래디언트를 서로 공유하고, 공유한 그래디언트들을 제0 로컬 파라미터 서버(320)에게 전송하는 방식이다. 구체적으로, 제0 워커(330)는 제1 워커(331)에게 자신이 생성한 그래디언트를 전송하고, 제1 워커(331)는 제3 워커(333)에게, 제3 워커(3330)는 제2 워커(332)에게, 제2 워커(332)는 제0 워커(330)에게 자신이 생성한 그래디언트를 전송한다. 이후, 제0 워커(330)는 제2 워커(332)에게 받은 그래디언트를 제 1 워커(331)에게 전달한다. 이러한 과정을 거쳐 각각의 워커(330, 331, 332, 333)는 4개의 그래디언트를 가진다. 복수의 워커(330, 331, 332, 333) 중 하나가 4개의 그래디언트를 제0 로컬 파라미터 서버 (320)에게 전송함으로써, 제0 로컬 파라미터 서버(320)와 복수의 워커(330, 331, 332, 333) 사이의 통신을 최소화하여 입출력 병목 현상을 줄일 수 있다.The plurality of workers 330 , 331 , 332 , and 333 according to an embodiment of the present invention is a method of transmitting a gradient to the global parameter server 300 , and uses all-reduce, a synchronous group communication structure. can In the all-reduce method, the gradients generated by each worker 330 , 331 , 332 , and 333 are shared with each other, and the shared gradients are transmitted to the 0th local parameter server 320 . Specifically, the 0th worker 330 transmits the gradient it generated to the first worker 331 , the first worker 331 to the third worker 333 , and the third worker 3330 to the second worker 331 . To the worker 332 , the second worker 332 transmits the gradient created by the second worker 332 to the zeroth worker 330 . Thereafter, the 0th worker 330 transfers the gradient received from the second worker 332 to the first worker 331 . Through this process, each worker 330 , 331 , 332 , 333 has four gradients. One of the plurality of workers 330, 331, 332, 333 transmits four gradients to the zero-th local parameter server 320, whereby the zero-th local parameter server 320 and the plurality of workers 330, 331, 332, 333) can be minimized to reduce the I/O bottleneck.

본 발명의 다른 실시예에 따른 복수의 워커(330, 331, 332, 333) 각각은 서브 미니 배치 훈련 방법을 이용하여, 제0 로컬 파라미터 서버 (320)에게 개별적으로 그래디언트를 전송할 수 있고, 각각의 그래디언트는 제0 로컬 파라미터 서버 (320)에 의해 집계될 수 있다. 이에 대해서는 도 4에서 자세히 설명한다. Each of the plurality of workers 330 , 331 , 332 , and 333 according to another embodiment of the present invention may individually transmit a gradient to the 0th local parameter server 320 using the sub-mini batch training method, and each The gradient may be aggregated by the zeroth local parameter server 320 . This will be described in detail with reference to FIG. 4 .

한편, 본 발명의 일 실시예에 따른 복수의 워커(330, 331, 332, 333)는 글로벌 파라미터 서버(300)에 의해 갱신된 파라미터를 제0 로컬 파라미터 서버(320)로부터 모두 똑같이 수신함으로써 동기화될 수 있다. 복수의 워커(330, 331, 332, 333)가 그래디언트를 생성하는 시점은 각각 다르더라도, 파라미터를 수신하는 시점은 모두 동일하다. 마찬가지로, 복수의 다른 워커(360, 361, 362, 363)도 동일한 파라미터를 이용하도록 동기화된다. 다만, 복수의 워커(330, 331, 332, 333)와 복수의 다른 워커(360, 361, 362, 363)는 비동기화된다.Meanwhile, the plurality of workers 330 , 331 , 332 , and 333 according to an embodiment of the present invention can be synchronized by equally receiving the parameters updated by the global parameter server 300 from the 0th local parameter server 320 . can Although the plurality of workers 330 , 331 , 332 , and 333 generate gradients at different times, all of the workers 330 , 331 , 332 and 333 receive the same parameters. Similarly, a plurality of other workers 360, 361, 362, 363 are synchronized to use the same parameters. However, the plurality of workers 330 , 331 , 332 , 333 and the plurality of other workers 360 , 361 , 362 and 363 are out of synchronization.

본 발명의 일 실시예에 따른 복수의 워커(330, 331, 332, 333)는 서로 다른 클러스터(heterogeneous cluster) 환경 또는 서로 다른 GPU에 의해 구현될 수 있다.The plurality of workers 330 , 331 , 332 , and 333 according to an embodiment of the present invention may be implemented by different cluster environments or different GPUs.

서로 다른 성능을 가지는 복수의 워커(330, 331, 332, 333)를 이용하더라도, 비동기 파라미터 서버 방식과 동기 집단 통신 방식을 혼합하여 복수의 워커(330, 331, 332, 333)의 훈련 정확도의 저하 없이 자원 활용률을 높일 수 있다.Even when a plurality of workers 330, 331, 332, and 333 having different performance are used, the training accuracy of the plurality of workers 330, 331, 332, 333 is reduced by mixing the asynchronous parameter server method and the synchronous group communication method. It is possible to increase the resource utilization rate without

도 4는 본 발명의 일 실시예에 따른 서브 미니 배치 훈련 방법을 설명하기 위한 도면이다.4 is a diagram for explaining a sub-mini arrangement training method according to an embodiment of the present invention.

도 3 및 도 4를 참조하면, 제0 로컬 파라미터 서버 (320) 및 복수의 워커(330, 331, 332, 333)의 시간에 따른 동작 과정이 나타난다. 3 and 4 , operation processes according to time of the 0th local parameter server 320 and the plurality of workers 330 , 331 , 332 and 333 are shown.

복수의 워커(330, 331, 332, 333)는 모두 동일한 파라미터를 가지며, 동일한 서브 미니 배치 사이즈의 훈련 데이터를 훈련한다. 다시 말하면, 복수의 워커(330, 331, 332, 333)는 각각 64 배치 사이즈(64 batch)의 훈련 데이터마다 그래디언트를 생성한다.The plurality of workers 330 , 331 , 332 , and 333 all have the same parameters and train training data of the same sub-mini batch size. In other words, the plurality of workers 330 , 331 , 332 , and 333 generate gradients for each training data of 64 batch sizes (64 batches).

본 발명의 일 실시예에 따른 복수의 워커(330, 331, 332, 333)는 각각 성능이 다른 클러스터 또는 GPU로 구현되므로, 64 배치 사이즈의 훈련 데이터에 대한 그래디언트를 생성하는 시간은 복수의 워커(330, 331, 332, 333)마다 서로 다르다. 제0 워커(330)의 연산 속도가 가장 빠르고, 제3 워커(333)의 연산 속도가 가장 느리다.Since the plurality of workers 330, 331, 332, and 333 according to an embodiment of the present invention are implemented as clusters or GPUs having different performance, respectively, the time to generate a gradient for training data of 64 batch size is a plurality of workers ( 330, 331, 332, 333) are different. The operation speed of the 0th worker 330 is the fastest, and the operation speed of the third worker 333 is the slowest.

본 발명의 일 실시예에 따른 서브 미니 배치 훈련 방법으로서, 복수의 로컬 파라미터 서버(320, 350)는 모두 동일한 미니 배치 사이즈의 훈련 데이터를 할당 받는다. 제0 로컬 파라미터 서버 (320)는 복수의 워커(330, 331, 332, 333)의 개수 및 성능에 따라 복수의 워커(330, 331, 332, 333)의 반복 훈련 횟수를 결정한다. As a sub-mini-batch training method according to an embodiment of the present invention, the plurality of local parameter servers 320 and 350 are all allocated training data of the same mini-batch size. The zeroth local parameter server 320 determines the number of repetitions of training of the plurality of workers 330 , 331 , 332 , and 333 according to the number and performance of the plurality of workers 330 , 331 , 332 , and 333 .

복수의 워커(330, 331, 332, 333)는 제0 로컬 파라미터 서버 (320)에 의해 결정된 반복 훈련 횟수만큼 훈련 데이터에 대한 그래디언트를 생성하고, 제0 로컬 파라미터 서버 (320)에게 전송한다.The plurality of workers 330 , 331 , 332 , and 333 generate a gradient for the training data as many times as the number of repetition training determined by the zeroth local parameter server 320 , and transmit the gradient to the zeroth local parameter server 320 .

예를 들어, 제0 로컬 파라미터 서버 (320)에게 512 미니 배치 사이즈의 훈련 데이터가 할당되고, 복수의 워커(330, 331, 332, 333)가 64 서브 미니 배치 사이즈 단위로 그래디언트를 생성할 수 있으며, 제0 워커(330) 내지 제3 워커(333)의 연산 속도 비율이 4:2:1.5:1인 것으로 가정한다. For example, training data of 512 mini-batch size is allocated to the 0th local parameter server 320, and a plurality of workers 330, 331, 332, 333 may generate a gradient in units of 64 sub-mini-batch sizes, , it is assumed that the operation speed ratio of the 0th worker 330 to the 3rd worker 333 is 4:2:1.5:1.

우선, 제0 로컬 파라미터 서버 (320)는 512 미니 배치 사이즈를 64 서브 미니 배치 사이즈로 나눈 값인 8을 도출한다. 복수의 워커(330, 331, 332, 333)는 총 8개의 그래디언트를 생성한다. First, the 0th local parameter server 320 derives 8, which is a value obtained by dividing the 512 mini-batch size by the 64 sub-mini-batch size. The plurality of workers 330 , 331 , 332 , 333 generates a total of 8 gradients.

복수의 워커(330, 331, 332, 333)의 성능이 서로 다르므로, 제0 로컬 파라미터 서버 (320)는 복수의 워커(330, 331, 332, 333)의 개수 및 성능에 기초하여 각 워커의 반복 훈련 횟수를 결정한다. 제0 로컬 파라미터 서버 (320)는 제0 워커(330) 내지 제3 워커(333)가 4:2:1:1에 따라 그래디언트를 생성하도록 반복 훈련 횟수를 결정한다. 즉, 제0 워커(330)는 64 서브 미니 배치 사이즈의 4개의 훈련 데이터를 훈련하며, 제3 워커(333)는 1개의 훈련 데이터를 훈련한다. 연산 속도가 가장 빠른 제0 워커(330)는 하나의 서브 미니 배치 사이즈 훈련 데이터를 훈련한 후 제3 워커(330)의 연산이 끝날 때까지 기다릴 필요가 없다. 제0 워커(330)는 제3 워커(333)의 연산 중에 다른 훈련 데이터를 훈련함으로써, 동기화로 인한 자원 활용률 저하를 피할 수 있다.Since the performance of the plurality of workers 330, 331, 332, 333 is different from each other, the 0th local parameter server 320 determines the number and performance of each worker based on the number and performance of the plurality of workers 330, 331, 332, 333. Determine the number of repetitions of training. The zeroth local parameter server 320 determines the number of repetitions of training so that the zeroth worker 330 to the third worker 333 generate a gradient according to 4:2:1:1. That is, the 0th worker 330 trains four training data of 64 sub-mini batch size, and the third worker 333 trains one training data. The 0th worker 330, which has the fastest operation speed, does not need to wait until the operation of the third worker 330 is finished after training one sub-mini batch size training data. The 0th worker 330 trains other training data during the operation of the third worker 333, thereby avoiding a decrease in resource utilization due to synchronization.

복수의 워커(330, 331, 332, 333)는 반복 훈련 횟수에 따라 그래디언트를 생성하며, 그래디언트를 생성할 때마다 제0 로컬 파라미터 서버 (320)에게 전송한다. The plurality of workers 330 , 331 , 332 , and 333 generates a gradient according to the number of repetition training, and transmits the gradient to the 0th local parameter server 320 every time the gradient is generated.

제0 로컬 파라미터 서버 (320)는 복수의 워커(330, 331, 332, 333)로부터 그래디언트를 수신하며, 그래디언트들을 비동기적으로 집계한다(asynchronously aggregation). 복수의 워커(330, 331, 332, 333) 중 성능이 가장 좋은 제0 워커(330)로부터 그래디언트를 수신한 시점부터 집계가 시작된다.The zeroth local parameter server 320 receives the gradients from the plurality of workers 330 , 331 , 332 , 333 , and asynchronously aggregates the gradients. The counting starts from the point in time when the gradient is received from the 0th worker 330 with the best performance among the plurality of workers 330 , 331 , 332 , 333 .

제0 로컬 파라미터 서버 (320)는 복수의 워커(330, 331, 332, 333)로부터 8개의 그래디언트를 모두 수신한 후 글로벌 파라미터 서버(300)에게 집계된 그래디언트를 전송한다. 즉, 제0 로컬 파라미터 서버 (320)는 제0 워커(330)에게 그래디언트를 받더라도 그래디언트를 곧바로 글로벌 파라미터 서버(300)에게 보내지 않는다. 제0 로컬 파라미터 서버 (320)는 복수의 워커(330, 331, 332, 333)가 생성하는 모든 그래디언트를 수신하고 집계한 후에서야 글로벌 파라미터 서버(300)에게 집계된 그래디언트를 전송한다. 이는, 제0 로컬 파라미터 서버 (320)와 글로벌 파라미터 서버(300) 사이의 통신을 효율적으로 수행하기 위함이다.The zeroth local parameter server 320 transmits the aggregated gradient to the global parameter server 300 after receiving all eight gradients from the plurality of workers 330 , 331 , 332 , and 333 . That is, even if the 0th local parameter server 320 receives the gradient from the 0th worker 330 , it does not directly send the gradient to the global parameter server 300 . The 0th local parameter server 320 transmits the aggregated gradient to the global parameter server 300 only after receiving and counting all the gradients generated by the plurality of workers 330 , 331 , 332 , and 333 . This is to efficiently perform communication between the 0th local parameter server 320 and the global parameter server 300 .

글로벌 파라미터 서버(300)는 집계된 그래디언트를 이용하여 파라미터를 갱신하고, 제0 로컬 파라미터 서버 (320)를 통해 복수의 워커(330, 331, 332, 333)에게 갱신된 파라미터를 같은 시간에 전달할 수 있다(Broadcast updated parameter). 즉, 복수의 워커(330, 331, 332, 333)는 동일한 파라미터를 이용하고, 동시에 파라미터가 갱신되도록 동기화된다.The global parameter server 300 updates parameters using the aggregated gradient, and transmits the updated parameters to the plurality of workers 330, 331, 332, and 333 through the 0th local parameter server 320 at the same time. Yes (Broadcast updated parameter). That is, the plurality of workers 330 , 331 , 332 , and 333 use the same parameter and are synchronized so that the parameter is updated at the same time.

도 5는 본 발명의 일 실시예에 따른 딥러닝 분산 학습 시스템의 병렬적 동작 과정을 설명하기 위한 도면이다.5 is a diagram for explaining a parallel operation process of a deep learning distributed learning system according to an embodiment of the present invention.

도 5를 참조하면, 딥러닝 분산 학습 시스템의 글로벌 파라미터 서버, 로컬 파라미터 서버 및 복수의 워커의 동작 순서가 예시된다.Referring to FIG. 5 , an operation sequence of a global parameter server, a local parameter server, and a plurality of workers of the deep learning distributed learning system is illustrated.

기존의 파라미터 서버 방식의 경우, 복수의 워커가 훈련 데이터에 대한 그래디언트를 계산(compute)하고, 그래디언트를 집계하며, 집계된 그래디언트를 리듀스(reduce)한 후 글로벌 파라미터 서버에게 전송한다. 글로벌 파라미터 서버는 복수의 워커로부터 집계된 그래디언트를 수신하고(recv), 집계된 그래디언트를 이용하여 파라미터의 갱신을 적용하며(apply), 갱신된 파라미터를 복수의 워커 각각에게 전송(bcast)한다. In the case of the existing parameter server method, a plurality of workers computes the gradient for the training data, aggregates the gradient, reduces the aggregated gradient, and then transmits it to the global parameter server. The global parameter server receives the aggregated gradient from a plurality of workers (recv), applies the parameter update using the aggregated gradient (apply), and transmits the updated parameter to each of the plurality of workers (bcast).

글로벌 파라미터 서버 및 복수의 워커의 각 동작들이 다른 구성요소에 의해 수행됨에도 불구하고, 하나의 동작이 끝난 후에 다른 동작이 수행되므로, 파라미터를 갱신하는 데 많은 시간이 든다. 또한, 복수의 워커가 동작하는 동안 글로벌 파라미터 서버는 대기하며, 글로벌 파라미터 서버가 동작하는 동안 복수의 워커가 대기하기 때문에 자원의 활용률이 낮다.Although each operation of the global parameter server and the plurality of workers is performed by other components, since another operation is performed after one operation is finished, it takes a lot of time to update parameters. In addition, since the global parameter server waits while the plurality of workers are operating, and the plurality of workers wait while the global parameter server is operating, the resource utilization rate is low.

반면, 본 발명의 일 실시예에 따른 딥러닝 분산 학습 시스템은 로컬 파라미터 서버를 이용하여 글로벌 파라미터 서버 및 복수의 워커의 동작을 병렬화할 수 있다. 다시 말하면, 로컬 파라미터 서버 및 글로벌 파라미터 서버 사이의 통신과, 복수의 워커의 그래디언트 생성이 병렬적으로 수행될 수 있다.On the other hand, the deep learning distributed learning system according to an embodiment of the present invention can parallelize the operations of the global parameter server and the plurality of workers by using the local parameter server. In other words, communication between the local parameter server and the global parameter server and gradient generation of a plurality of workers may be performed in parallel.

구체적으로, 본 발명의 일 실시예는 로컬 파라미터 서버를 이용함으로써 딥러닝 연산과 통신을 병렬적으로 수행할 수 있다. 기존에 글로벌 파라미터 서버와 복수의 워카 사이의 통신이 직접적으로 수행되는 방식과 달리, 본 발명의 일 실시예에 따라 로컬 파라미터 서버가 글로벌 파라미터 서버와 복수의 워커 사이의 통신과 그래디언트 집계를 대신 수행한다. 로컬 파라미터 서버가 통신 및 집계를 수행함으로써, 복수의 워커는 통신에 필요한 오버헤드(overhead) 작업을 줄이고, 그래디언트의 생성을 중점적으로 수행할 수 있다. 또한, 글로벌 파라미터 서버는 복수의 워커 모두와 통신하지 않고, 그래디언트를 집계할 필요도 없으므로, 파라미터의 갱신을 더 빠르게 수행할 수 있다. Specifically, an embodiment of the present invention can perform a deep learning operation and communication in parallel by using a local parameter server. Unlike the conventional method in which communication between the global parameter server and a plurality of workers is directly performed, according to an embodiment of the present invention, the local parameter server performs communication and gradient aggregation between the global parameter server and a plurality of workers instead. . As the local parameter server performs communication and aggregation, a plurality of workers can reduce overhead work required for communication and focus on gradient generation. In addition, since the global parameter server does not communicate with all of the plurality of workers and does not need to aggregate gradients, parameter updates can be performed faster.

따라서, 복수의 워커와 글로벌 파라미터 서버가 동작하는 동안 로컬 파라미터 서버가 병렬적으로 통신을 수행하므로, 복수의 워커 및 글로벌 파라미터 서버의 자원 활용률이 높아진다. 또한, 글로벌 파라미터 서버는 복수의 워커가 아닌 로컬 파라미터 서버와 통신을 하므로, 입출력 병목 현상을 줄일 수 있다.Accordingly, since the local parameter server performs communication in parallel while the plurality of workers and the global parameter server are operating, the resource utilization rate of the plurality of workers and the global parameter server is increased. In addition, since the global parameter server communicates with the local parameter server rather than a plurality of workers, it is possible to reduce the input/output bottleneck.

도 6은 발명의 일 실시예에 따른 딥러닝 분산 학습 시스템의 동작 과정을 설명하기 위한 순서도다.6 is a flowchart illustrating an operation process of a deep learning distributed learning system according to an embodiment of the present invention.

도 6을 참조하면, 본 발명의 일 실시예에 따른 글로벌 파라미터 서버는 로컬 파라미터 서버에게 모델, 즉 신경망의 파라미터를 전송한다(S600). 로컬 파라미터 서버는 복수의 워커를 대신하여 글로벌 파라미터 서버와 통신한다. 이때, 복수의 워커 노드에 포함된 복수의 로컬 파라미터 서버는 서로 동기화되지 않는다.Referring to FIG. 6 , the global parameter server according to an embodiment of the present invention transmits a model, that is, a parameter of a neural network, to the local parameter server ( S600 ). The local parameter server communicates with the global parameter server on behalf of multiple workers. In this case, the plurality of local parameter servers included in the plurality of worker nodes are not synchronized with each other.

각각의 워커 노드에 포함된 각 로컬 파라미터 서버는 복수의 워커에게 파라미터를 전송한다(S602). 복수의 워커는 서로 동기화되어 있으므로, 복수의 워커는 동일한 파라미터를 수신한다.Each local parameter server included in each worker node transmits parameters to a plurality of workers (S602). Since multiple workers are synchronized with each other, multiple workers receive the same parameters.

각각의 워커마다 수신한 파라미터를 이용하여 훈련 데이터에 대한 각각의 그래디언트를 생성하고, 로컬 파라미터 서버에게 전송한다(S604). 본 발명의 일 실시예에 따른 서브 미니 배치 훈련 방법으로서, 로컬 파라미터 서버가 워커으 개수 및 성능에 따라 각각의 워커에 대한 반복 훈련 횟수를 결정하고, 각각의 워커는 반복 훈련 횟수만큼 그래디언트를 생성하여 로컬 파라미터 서버에게 전송한다.Using the parameters received for each worker, each gradient for the training data is generated and transmitted to the local parameter server (S604). As a sub-mini-batch training method according to an embodiment of the present invention, the local parameter server determines the number of repetition training for each worker according to the number and performance of workers, and each worker generates a gradient as many as the number of repetition training. Send to the local parameter server.

로컬 파라미터 서버는 각각의 워커로부터 수신한 각각의 그래디언트를 집계한다(S606). 여기서 집계는 그래디언트를 모두 더하거나, 그래디언트를 각각 곱하거나, 그래디언트들 중 최대값 또는 최소값을 선택하는 동작을 의미한다.The local parameter server aggregates each gradient received from each worker (S606). Here, the aggregation refers to an operation of adding all gradients, multiplying each gradient, or selecting the maximum or minimum value among the gradients.

글로벌 파라미터 서버는 로컬 파라미터 서버에 의해 집계된 그래디언트를 수신하고, 집계된 그래디언트를 이용하여 모델의 파라미터를 갱신한다(S608). 이때, 글로벌 파라미터 서버는 서로 다른 로컬 파라미터 서버로부터 서로 다른 그래디언트를 수신하며, 서로 다른 그래디언트를 수신할 때마다 파라미터를 갱신하고 저장할 수 있다. The global parameter server receives the gradient aggregated by the local parameter server, and updates the parameters of the model using the aggregated gradient (S608). In this case, the global parameter server may receive different gradients from different local parameter servers, and may update and store parameters whenever different gradients are received.

글로벌 파라미터 서버는 갱신된 파라미터를 로컬 파라미터 서버에게 각각 전송한다(S610). 구체적으로, 글로벌 파라미터 서버는 로컬 파라미터 서버에게 같은 갱신 파라미터를 전송할 수도 있고, 서로 다른 갱신 파라미터를 전송할 수도 있다. 이후, 로컬 파라미터 서버는 갱신된 파라미터를 복수의 워커에게 전달하고, 복수의 워커는 갱신된 파라미터를 이용하여 훈련 데이터에 대한 그래디언트를 생성한다.The global parameter server transmits the updated parameters to the local parameter server, respectively (S610). Specifically, the global parameter server may transmit the same update parameter to the local parameter server, or may transmit different update parameters to the local parameter server. Thereafter, the local parameter server transmits the updated parameters to a plurality of workers, and the plurality of workers use the updated parameters to generate gradients for the training data.

도 6에서는 과정 S600 내지 과정 S610을 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 발명의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것이다. 다시 말해, 본 발명의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 6에 기재된 순서를 변경하여 실행하거나 S600 내지 과정 S610 중 하나 이상의 과정을 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 도 6은 시계열적인 순서로 한정되는 것은 아니다.Although it is described that steps S600 to S610 are sequentially executed in FIG. 6 , this is merely illustrative of the technical idea of an embodiment of the present invention. In other words, those of ordinary skill in the art to which an embodiment of the present invention pertain may change the order described in FIG. 6 within a range that does not depart from the essential characteristics of an embodiment of the present invention, or perform one or more of S600 to S610. Since it will be possible to apply various modifications and variations by executing the process in parallel, FIG. 6 is not limited to a time-series order.

한편, 도 6에 도시된 과정들은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 즉, 이러한 컴퓨터가 읽을 수 있는　기록매체는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등의 비일시적인(non-transitory) 매체일 수 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송) 및 데이터 전송 매체(data transmission medium)와 같은 일시적인(transitory) 매체를 더 포함할 수도 있다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Meanwhile, the processes illustrated in FIG. 6 can be implemented as computer-readable codes on a computer-readable recording medium. The computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. That is, the computer-readable recording medium may be a non-transitory medium such as ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage device, and also carrier wave (for example, , transmission via the Internet) and may further include a transitory medium such as a data transmission medium. In addition, the computer-readable recording medium is distributed in a network-connected computer system so that the computer-readable code can be stored and executed in a distributed manner.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of this embodiment, and a person skilled in the art to which this embodiment belongs may make various modifications and variations without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are intended to explain rather than limit the technical spirit of the present embodiment, and the scope of the technical spirit of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present embodiment.

300: 글로벌 파라미터 서버 310: 제0 워커 노드
320: 제0 로컬 파라미터 서버 330: 제0 워커
331: 제1 워커 332: 제2 워커
333: 제3 워커300: global parameter server 310: worker node 0
320: 0th local parameter server 330: 0th worker
331: first walker 332: second walker
333: third walker

Claims

In the deep learning distributed learning system,
a global parameter server that stores and updates parameters of the neural network; and
comprising a plurality of worker nodes;
Each worker node includes a local parameter server and a plurality of workers,
the local parameter server transmits the parameters received from the global parameter server to each worker, aggregates the gradients received from the plurality of workers, and transmits the aggregated gradient to the global parameter server;
Each worker is synchronized to use the same parameter, and each worker uses the parameter to generate a gradient for training data,
The global parameter server updates the parameter using the aggregated gradient,
Communication between the local parameter server and the global parameter server and the gradient generation of each worker are performed in parallel;
The local parameter server determines the number of repetition training for each worker according to the number and performance of each worker,
The system of claim 1, wherein each worker generates the gradient by the number of repetitions of training.

delete

According to claim 1,
The system, characterized in that each worker is implemented by different clusters or different GPUs.

According to claim 1,
The aggregated gradient is
The system according to claim 1, wherein the local parameter server is any one of a sum, a product, a maximum value, or a minimum value of the gradients received from the respective workers.

A method of operating a deep learning distributed learning system, the system comprising a global parameter server and a plurality of worker nodes, each worker node comprising a local parameter server and a plurality of workers,
The method of operation is
transmitting, by the global parameter server, a parameter of a neural network to the local parameter server included in each worker node;
For each worker node above:
transmitting, by the local parameter server, the parameter to each of the plurality of workers;
Each worker is synchronized to use the same parameter, and each worker generates a gradient for training data using the parameter, and transmits the gradient to the local parameter server;
a process in which the local parameter server aggregates the gradients received from the respective workers;
receiving, by the global parameter server, an aggregated gradient from each local parameter server, and updating the parameter using the aggregated gradient; and
transmitting, by the global parameter server, an updated parameter to the local parameter server that has transmitted the aggregated gradient;
including,
Communication between the local parameter server and the global parameter server and the gradient generation of each worker are performed in parallel;
The local parameter server determines the number of repetition training for each worker according to the number and performance of each worker,
The method of operation, characterized in that each worker generates the gradient by the number of repetitions of training.

delete

7. The method of claim 6,
The method of operation, characterized in that each worker is implemented by different clusters or different GPUs.

7. The method of claim 6,
The aggregated gradient is
The method according to claim 1, wherein the local parameter server is any one of a sum, a product, a maximum value, and a minimum value of the gradients received from the respective workers.