KR102190511B1

KR102190511B1 - Method for distributed deep-learning based on heterogeneous cluster and apparatus using the same

Info

Publication number: KR102190511B1
Application number: KR1020190025730A
Authority: KR
Inventors: 안신영; 박유미; 임은지; 최용석
Original assignee: 한국전자통신연구원
Priority date: 2019-03-06
Filing date: 2019-03-06
Publication date: 2020-12-14
Also published as: KR20200107124A

Abstract

이종 클러스터 기반의 분산 딥러닝 방법 및 이를 위한 장치가 개시된다. 본 발명의 일실시예에 따른 분산 딥러닝 방법은 딥러닝 성능이 상이한 복수개의 이기종 딥러닝 모듈들이 원격 공유 메모리를 기반으로 전역 파라미터와 전역 학습 카운터를 공유하는 단계; 상기 복수개의 이기종 딥러닝 모듈들이 상기 전역 학습 카운터를 기반으로 할당된 지역 학습 카운터에 상응하는 분산 딥러닝 학습과 원격 공유 메모리 업데이트를 중첩하여 수행하는 단계; 및 상기 원격 공유 메모리 업데이트에 의해 업데이트된 전역 학습 카운터를 고려하여 분산 딥러닝 프로세스를 종료하는 단계를 포함한다.Disclosed are a heterogeneous cluster-based distributed deep learning method and an apparatus therefor. A distributed deep learning method according to an embodiment of the present invention includes the steps of: sharing a global parameter and a global learning counter by a plurality of heterogeneous deep learning modules having different deep learning performances based on a remote shared memory; Performing, by the plurality of heterogeneous deep learning modules, overlapping distributed deep learning learning corresponding to a local learning counter allocated based on the global learning counter and updating a remote shared memory; And terminating the distributed deep learning process in consideration of the global learning counter updated by the remote shared memory update.

Description

Distributed deep learning method based on heterogeneous clusters and device for the same {METHOD FOR DISTRIBUTED DEEP-LEARNING BASED ON HETEROGENEOUS CLUSTER AND APPARATUS USING THE SAME}

본 발명은 분산 딥러닝 기술에 관한 것으로, 특히 이기종의 컴퓨팅 모듈로 구성되는 이종 HPC 클러스터 환경에서 효율적으로 분산 딥러닝을 수행할 수 있는 기술에 관한 것이다.The present invention relates to a distributed deep learning technology, and in particular, to a technology capable of efficiently performing distributed deep learning in a heterogeneous HPC cluster environment composed of heterogeneous computing modules.

딥러닝이란 사람의 신경세포(BIOLOGICAL NEURON)를 모사하여 기계가 학습하도록 하는 인공신경망(ARTIFICIAL NEURAL NETWORK) 기반의 기계 학습법이다. 최근 딥러닝 모델들은 응용의 인식 성능을 높이기 위해 대규모 모델로 진화하고 있으나 점차 대형화되는 딥러닝 모델과 대규모 학습 데이터를 단일 머신에서 처리하기에는 한계가 있다. 그래서 대규모 분산 컴퓨팅 자원을 활용하려는 노력의 일환으로 딥러닝 분산 플랫폼 기술이 개발되고 있다. Deep learning is a machine learning method based on an artificial neural network (ARTIFICIAL NEURAL NETWORK) that simulates human neurons (BIOLOGICAL NEURON) and allows machines to learn. Recently, deep learning models are evolving into large-scale models in order to increase the recognition performance of applications, but there is a limit to processing the increasingly large-scale deep learning models and large-scale training data in a single machine. So, as part of an effort to utilize large-scale distributed computing resources, deep learning distributed platform technology is being developed.

기존의 딥러닝 분산 처리는 대부분 동일한 규격과 성능의 클러스터를 가정하는 경우가 많다. 그러나 실제로 딥러닝 분산 처리를 하려고 할 때, 동일한 규격과 성능의 컴퓨팅 서버들로 구성된 클러스터를 구비하는 경우는 많지 않다. 따라서 이종 클러스터 환경에서 다른 규격의 서버들로 구성된 이종 클러스터를 동시에 모두 이용하여 효율적으로 딥러닝 분산처리를 수행하는 것은 쉬운 일이 아니다. In many cases, existing deep learning distributed processing assumes clusters of the same standard and performance. However, when trying to actually perform deep learning distributed processing, there are not many cases of having a cluster composed of computing servers of the same standard and performance. Therefore, in a heterogeneous cluster environment, it is not easy to efficiently perform deep learning distributed processing using all heterogeneous clusters composed of servers of different standards at the same time.

일반적으로 동종 컴퓨터로 구성된 클러스터 환경에서는 동기식 파라미터 업데이트 방식을 이용한다. 그러나 동종 클러스터 환경에서 동시에 실행되는 분산 프로세스들도 시간이 지남에 따라 다양한 원인으로 인해 속도 차가 발생하기 때문에 동기식 트레이닝의 효율을 떨어뜨리게 된다. 이에 대한 대안으로 사용되는 것이 비동기식 파라미터 업데이트 방식이다. 비동기식 업데이트 방식은 파라미터 서버가 분산 컴퓨터들로부터 늦거나 빨리 도착하는 파라미터들의 동기를 맞추지 않고 트레이닝을 진행하는 방법이다. 비동기 방식은 동기식에 비해 정확성을 크게 희생시키지 않으면서 빠르게 트레이닝 할 수 있는 장점이 있다.In general, a synchronous parameter update method is used in a cluster environment composed of homogeneous computers. However, even distributed processes running simultaneously in a homogeneous cluster environment also degrade the efficiency of synchronous training because speed differences occur due to various causes over time. An alternative to this is the asynchronous parameter update method. The asynchronous update method is a method in which a parameter server performs training without synchronization of parameters arriving late or early from distributed computers. Compared to the synchronous method, the asynchronous method has the advantage of being able to train quickly without significantly sacrificing accuracy.

한국 등록 특허 제10-1559089호, 2015년 10월 2일 공개(명칭: 장치의 컴포넌트들 간에 메모리 자원들을 공유하기 위한 통신 프로토콜)Korean Patent Registration No. 10-1559089, published on October 2, 2015 (Name: communication protocol for sharing memory resources between components of a device)

본 발명의 목적은 이기종의 GPU를 이용한 분산 딥러닝 수행 시 통신 오버헤드를 감소시킬 수 있는 효과적인 분산 딥러닝 방법을 제공하는 것이다.An object of the present invention is to provide an effective distributed deep learning method capable of reducing communication overhead when performing distributed deep learning using heterogeneous GPUs.

또한, 본 발명의 목적은 동시에 성능이 다른 GPU들을 효과적으로 사용할 수 있는 분산 딥러닝 방법을 제공하는 것이다.In addition, an object of the present invention is to provide a distributed deep learning method capable of effectively using GPUs having different performances at the same time.

또한, 본 발명의 목적은 학습 속도가 다른 분산 프로세스들이 전체 학습을 효과적으로 나누어 수행할 수 있도록 하는 것이다.In addition, an object of the present invention is to enable distributed processes with different learning speeds to effectively divide and perform the entire learning.

또한, 본 발명의 목적은 학습한 파라미터의 업데이트를 지연하는 방식으로 계산과 통신을 중첩하여 각각의 GPU 활용률을 극대화함으로써 우수한 분산 처리 확장성을 제공하는 것이다.In addition, an object of the present invention is to provide excellent scalability of distributed processing by maximizing the utilization rate of each GPU by overlapping calculation and communication in a manner that delays updating of learned parameters.

상기한 목적을 달성하기 위한 본 발명에 따른 분산 딥러닝 방법은 딥러닝 성능이 상이한 복수개의 이기종 딥러닝 모듈들이 원격 공유 메모리를 기반으로 전역 파라미터와 전역 학습 카운터를 공유하는 단계; 상기 복수개의 이기종 딥러닝 모듈들이 상기 전역 학습 카운터를 기반으로 할당된 지역 학습 카운터에 상응하는 분산 딥러닝 학습과 원격 공유 메모리 업데이트를 중첩하여 수행하는 단계; 및 상기 원격 공유 메모리 업데이트에 의해 업데이트된 전역 학습 카운터를 고려하여 분산 딥러닝 프로세스를 종료하는 단계를 포함한다.The distributed deep learning method according to the present invention for achieving the above object includes the steps of: sharing a global parameter and a global learning counter by a plurality of heterogeneous deep learning modules having different deep learning performances based on a remote shared memory; Performing, by the plurality of heterogeneous deep learning modules, overlapping distributed deep learning learning corresponding to a local learning counter allocated based on the global learning counter and updating a remote shared memory; And terminating the distributed deep learning process in consideration of the global learning counter updated by the remote shared memory update.

이 때, 분산 딥러닝 방법은 상기 복수개의 이기종 딥러닝 모듈들에 각각 지역 파라미터, 전역-지역 파라미터 차분 및 지역 학습 카운터 영역을 생성하는 단계; 및 상기 원격 공유 메모리에 전역 파라미터 영역 및 전역 학습 카운터 영역을 생성하는 단계를 더 포함할 수 있다.In this case, the distributed deep learning method includes the steps of generating a region parameter, a global-region parameter difference, and a region learning counter region in the plurality of heterogeneous deep learning modules, respectively; And generating a global parameter area and a global learning counter area in the remote shared memory.

이 때, 분산 딥러닝 방법은 상기 복수개의 이기종 딥러닝 모듈들 중 어느 하나의 마스터 모듈을 통해 상기 전역 파라미터 및 상기 전역 학습 카운터를 초기화하는 단계를 더 포함할 수 있다.In this case, the distributed deep learning method may further include initializing the global parameter and the global learning counter through any one of the plurality of heterogeneous deep learning modules.

이 때, 수행하는 단계는 상기 복수개의 이기종 딥러닝 모듈들이 각각 상기 분산 딥러닝 학습을 위한 딥러닝 학습 스레드(THREAD) 및 상기 원격 공유 메모리 업데이트를 위한 업데이트 스레드(THREAD)를 생성하는 단계를 포함할 수 있다.In this case, the performing of the plurality of heterogeneous deep learning modules includes generating a deep learning learning thread (THREAD) for the distributed deep learning learning and an update thread (THREAD) for updating the remote shared memory, respectively. I can.

이 때, 초기화하는 단계는 상기 업데이트 스레드의 웨이크업 시점을 기준으로 수행될 수 있다.In this case, the initializing may be performed based on the wake-up time of the update thread.

이 때, 수행하는 단계는 상기 딥러닝 학습 스레드 및 상기 업데이트 스레드 중 어느 하나로 할당되는 흐름제어락을 기반으로 할 수 있다.In this case, the performing step may be based on a flow control lock allocated to one of the deep learning learning thread and the update thread.

이 때, 흐름제어락은 상기 분산 딥러닝 프로세스가 시작된 이후에 상기 딥러닝 학습 스레드로 먼저 할당될 수 있다.In this case, the flow control lock may be first allocated to the deep learning learning thread after the distributed deep learning process starts.

이 때, 복수개의 이기종 딥러닝 모듈들은 원격 직접 메모리 접근(REMOTE DIRECT MEMORY ACCESS, RDMA)을 지원하는 고속 네트워크를 기반으로 상기 원격 공유 메모리에 접근할 수 있다.In this case, a plurality of heterogeneous deep learning modules may access the remote shared memory based on a high-speed network supporting remote direct memory access (RDMA).

또한, 본 발명의 일실시예에 따른 분산 딥러닝 장치는, 원격 공유 메모리의 전역 학습 카운터를 기반으로 할당된 지역 학습 카운터에 상응하는 분산 딥러닝 학습과 원격 공유 메모리 업데이트를 중첩하여 수행하고, 상기 원격 공유 메모리 업데이트를 기반으로 업데이트된 전역 학습 카운터를 고려하여 분산 딥러닝 프로세스를 종료하는 프로세서; 및 지역 파라미터, 전역-지역 파라미터 차분 및 지역 학습 카운터를 저장하는 메모리를 포함한다.In addition, the distributed deep learning apparatus according to an embodiment of the present invention superimposes distributed deep learning learning corresponding to a local learning counter allocated based on a global learning counter of a remote shared memory and a remote shared memory update, and the A processor that terminates the distributed deep learning process in consideration of the updated global learning counter based on the remote shared memory update; And a memory for storing local parameters, global-local parameter differences, and local learning counters.

이 때, 프로세서는 지역 파라미터, 전역-지역 파라미터 차분 및 지역 학습 카운터 영역을 생성하고, 상기 원격 공유 메모리에 전역 파라미터 영역 및 전역 학습 카운터 영역을 생성할 수 있다.In this case, the processor may generate a local parameter, a global-local parameter difference, and a local learning counter area, and generate a global parameter area and a global learning counter area in the remote shared memory.

이 때, 프로세서는 상기 전역 파라미터 및 상기 전역 학습 카운터를 초기화할 수 있다.In this case, the processor may initialize the global parameter and the global learning counter.

이 때, 프로세서는 상기 분산 딥러닝 학습을 위한 딥러닝 학습 스레드(THREAD) 및 상기 원격 공유 메모리 업데이트를 위한 업데이트 스레드(THREAD)를 생성할 수 있다.In this case, the processor may generate a deep learning learning thread THREAD for the distributed deep learning learning and an update thread THREAD for updating the remote shared memory.

이 때, 프로세서는 상기 업데이트 스레드의 웨이크업 시점을 기준으로 상기 초기화를 수행할 수 있다.In this case, the processor may perform the initialization based on the wake-up time of the update thread.

이 때, 프로세서는 상기 딥러닝 학습 스레드 및 상기 업데이트 스레드 중 어느 하나로 흐름제어락을 할당하여 상기 분산 딥러닝 학습 및 상기 원격 공유 메모리 업데이트를 수행할 수 있다.In this case, the processor may perform the distributed deep learning learning and the remote shared memory update by allocating a flow control lock to one of the deep learning learning thread and the update thread.

이 때, 프로세서는 원격 직접 메모리 접근(REMOTE DIRECT MEMORY ACCESS, RDMA)을 지원하는 고속 네트워크를 기반으로 상기 원격 공유 메모리에 접근할 수 있다.In this case, the processor may access the remote shared memory based on a high-speed network supporting remote direct memory access (RDMA).

본 발명에 따르면, 이기종의 GPU를 이용한 분산 딥러닝 수행 시 통신 오버헤드를 감소시킬 수 있는 효과적인 분산 딥러닝 방법을 제공할 수 있다.According to the present invention, it is possible to provide an effective distributed deep learning method capable of reducing communication overhead when performing distributed deep learning using heterogeneous GPUs.

또한, 본 발명은 동시에 성능이 다른 GPU들을 효과적으로 사용할 수 있는 분산 딥러닝 방법을 제공할 수 있다.In addition, the present invention can provide a distributed deep learning method capable of effectively using GPUs having different performances at the same time.

또한, 본 발명은 학습 속도가 다른 분산 프로세스들이 전체 학습을 효과적으로 나누어 수행할 수 있도록 할 수 있다.In addition, the present invention can enable distributed processes with different learning speeds to effectively divide and perform the entire learning.

또한, 본 발명은 학습한 파라미터의 업데이트를 지연하는 방식으로 계산과 통신을 중첩하여 각각의 GPU 활용률을 극대화함으로써 우수한 분산 처리 확장성을 제공할 수 있다.In addition, the present invention can provide excellent scalability for distributed processing by maximizing the utilization rate of each GPU by overlapping calculation and communication in a manner that delays updating of learned parameters.

도 1은 본 발명의 일실시예에 따른 분산 딥러닝 방법을 나타낸 동작흐름도이다.
도 2는 본 발명의 일실시예에 따른 분산 딥러닝 시스템을 나타낸 도면이다.
도 3은 본 발명의 일실시예에 따른 분산 딥러닝 과정을 상세하게 나타낸 동작흐름도이다.
도 4는 본 발명에 따른 딥러닝 학습 스레드를 기반으로 분산 딥러닝을 수행하는 과정의 일 예를 상세하게 나타낸 동작흐름도이다.
5는 본 발명에 따른 업데이트 스레드를 기반으로 원격 공유 메모리를 업데이트하는 과정의 일 예를 상세하게 나타낸 동작흐름도이다.
도 6은 본 발명의 일실시예에 따른 분산 딥러닝 장치를 나타낸 블록도이다.1 is an operation flow diagram showing a distributed deep learning method according to an embodiment of the present invention.
2 is a diagram showing a distributed deep learning system according to an embodiment of the present invention.
3 is an operation flow diagram showing in detail a distributed deep learning process according to an embodiment of the present invention.
4 is an operation flow diagram showing in detail an example of a process of performing distributed deep learning based on a deep learning learning thread according to the present invention.
5 is an operation flow diagram showing in detail an example of a process of updating a remote shared memory based on an update thread according to the present invention.
6 is a block diagram showing a distributed deep learning apparatus according to an embodiment of the present invention.

본 발명을 첨부된 도면을 참조하여 상세히 설명하면 다음과 같다. 여기서, 반복되는 설명, 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능, 및 구성에 대한 상세한 설명은 생략한다. 본 발명의 실시형태는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위해서 제공되는 것이다. 따라서, 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.The present invention will be described in detail with reference to the accompanying drawings as follows. Here, repeated descriptions, known functions that may unnecessarily obscure the subject matter of the present invention, and detailed descriptions of configurations are omitted. Embodiments of the present invention are provided to more completely explain the present invention to those with average knowledge in the art. Accordingly, the shapes and sizes of elements in the drawings may be exaggerated for clearer explanation.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 분산 딥러닝 방법을 나타낸 동작흐름도이다.1 is an operation flow diagram showing a distributed deep learning method according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일실시예에 따른 분산 딥러닝 방법은 딥러닝 성능이 상이한 복수개의 이기종 딥러닝 모듈들이 원격 공유 메모리를 기반으로 전역 파라미터와 전역 학습 카운터를 공유한다(S110).Referring to FIG. 1, in a distributed deep learning method according to an embodiment of the present invention, a plurality of heterogeneous deep learning modules having different deep learning performances share a global parameter and a global learning counter based on a remote shared memory (S110).

이 때, 원격 공유 메모리에 저장된 전역 파라미터와 전역 학습 카운터는 배타적으로 업데이트가 가능한 데이터에 상응하는 것으로, 처리 성능이 서로 상이한 복수개의 이기종 딥러닝 모듈들이 전체 학습을 효과적으로 나누어 수행할 수 있도록 할 수 있다. In this case, the global parameter and the global learning counter stored in the remote shared memory correspond to data that can be updated exclusively, and a plurality of heterogeneous deep learning modules having different processing performances can effectively divide and perform the entire learning. .

이 때, 복수개의 이기종 딥러닝 모듈들은 원격 직접 메모리 접근(REMOTE DIRECT MEMORY ACCESS, RDMA)을 지원하는 고속 네트워크를 기반으로 원격 공유 메모리에 접근할 수 있다. 따라서, 원격 공유 메모리는 전역 파라미터와 전역 학습 카운터를 복수개의 이기종 딥러닝 모듈들에게 제공하여 직접 접근할 수 있도록 지원할 수 있다. At this time, a plurality of heterogeneous deep learning modules can access the remote shared memory based on a high-speed network supporting remote direct memory access (RDMA). Accordingly, the remote shared memory can support direct access by providing global parameters and global learning counters to a plurality of heterogeneous deep learning modules.

예를 들어 도 2를 참조하면, 본 발명의 일실시예에 따른 복수개의 이기종 딥러닝 모듈들(210-1~210-N)은 RDMA 고속 네트워크(230)를 통해 원격 공유 메모리(220)에 접근할 수 있다. 이 때, 도 2에 도시된 것처럼 본 발명의 일실시예에 따른 복수개의 이기종 딥러닝 모듈들(210-1~210-N)은 딥러닝 학습을 수행하는 계산노드에 해당할 수 있으며, 상호간에 서로 다른 성능의 GPGPU(GENERAL PURPOSE COMPUTING ON GRAPHICS PROCESSING UNITS))들을 포함할 수 있다. For example, referring to FIG. 2, a plurality of heterogeneous deep learning modules 210-1 to 210-N according to an embodiment of the present invention access the remote shared memory 220 through the RDMA high-speed network 230 can do. In this case, as shown in FIG. 2, a plurality of heterogeneous deep learning modules 210-1 to 210-N according to an embodiment of the present invention may correspond to computation nodes that perform deep learning learning, and mutually It may include GPGPUs (GENERAL PURPOSE COMPUTING ON GRAPHICS PROCESSING UNITS) of different capabilities.

이 때, 도 1에는 도시하지 아니하였으나, 본 발명의 일실시예에 따른 분산 딥러닝 방법은 복수개의 이기종 딥러닝 모듈들이 각각 지역 파라미터, 전역-지역 파라미터 차분 및 지역 학습 카운터 영역을 생성한다. 예를 들어, 도 2에 도시된 것처럼, 본 발명의 일실시예에 따른 복수개의 이기종 딥러닝 모듈들(210-1~210-N)은 각각 지역 파라미터, 전역-지역 파라미터 차분, 지역 학습 카운터를 포함할 수 있다.In this case, although not shown in FIG. 1, in the distributed deep learning method according to an embodiment of the present invention, a plurality of heterogeneous deep learning modules generate a region parameter, a global-region parameter difference, and a region learning counter region, respectively. For example, as shown in FIG. 2, a plurality of heterogeneous deep learning modules 210-1 to 210-N according to an embodiment of the present invention each have a regional parameter, a global-region parameter difference, and a regional learning counter. Can include.

이 때, 학습 카운터란, 분산 딥러닝 프로세스들이 딥러닝 학습을 수행할 때 학습한 미니배치(MINI-BATCH)의 전체 횟수를 카운팅하는데 사용될 수 있으며, 각각의 분산 딥러닝 프로세스들이 학습해야 할 미니배치의 순서 번호를 할당 받는데 활용될 수 있다. 이와 같이 각각의 딥러닝 모듈로 할당된 학습 카운터는 지역 학습 카운터로써 저장될 수 있다. 이 때, 각각의 딥러닝 모듈에 포함된 분산 프로세스들이 수행해야 할 전체 미니배치의 횟수는 분산 딥러닝 프로세스를 이용하는 사용자가 지정할 수 있다. In this case, the learning counter can be used to count the total number of mini-batch (MINI-BATCH) learned when distributed deep learning processes perform deep learning learning, and the mini-batch that each distributed deep learning process needs to learn. It can be used to assign a sequence number of. In this way, the learning counter allocated to each deep learning module may be stored as a local learning counter. In this case, the number of total mini-batch to be performed by the distributed processes included in each deep learning module may be designated by a user using the distributed deep learning process.

또한, 본 발명의 일실시예에 따른 분산 딥러닝 방법은 복수개의 이기종 딥러닝 모듈들이 전역 학습 카운터를 기반으로 할당된 지역 학습 카운터에 상응하는 분산 딥러닝 학습과 원격 공유 메모리 업데이트를 중첩하여 수행한다(S120).In addition, in the distributed deep learning method according to an embodiment of the present invention, a plurality of heterogeneous deep learning modules superimpose distributed deep learning learning corresponding to a local learning counter allocated based on a global learning counter and a remote shared memory update. (S120).

이 때, 복수개의 이기종 딥러닝 모듈들은 분산 딥러닝 학습을 통해 지역 파라미터를 자체적으로 학습시킬 수 있고, 원격 공유 메모리에 보관되는 전역 학습 카운터는 복수개의 이기종 딥러닝 모듈들 각각에 저장된 지역 학습 카운터와 비교하여 동일하면 변경하는 방식(COMPARE AND SWAP)으로 업데이트될 수 있다.At this time, the plurality of heterogeneous deep learning modules can learn local parameters by themselves through distributed deep learning learning, and the global learning counter stored in the remote shared memory includes a local learning counter stored in each of the plurality of heterogeneous deep learning modules. If they are the same as compared, they can be updated in a way of changing (COMPARE AND SWAP).

일반적으로 분산 딥러닝 플랫폼에서 분산 딥러닝 학습을 수행하는 프로세스들은 상호간에 대규모 파라미터를 빈번하게 송수신해야 하므로 이 과정에서 발생하는 통신 오버헤드는 전체 분산 딥러닝 학습 성능에서 차지하는 비중이 매우 높은 형편이다. 따라서, 효과적인 분산 딥러닝 학습을 위해서는 통신시간을 감소시키거나 또는 통신시간과 계산시간을 중첩함으로써 통신시간을 숨길 필요가 있다.In general, processes that perform distributed deep learning learning in a distributed deep learning platform need to frequently transmit and receive large-scale parameters with each other, so the communication overhead incurred in this process takes up a very high proportion of the overall distributed deep learning learning performance. Therefore, for effective distributed deep learning learning, it is necessary to hide the communication time by reducing the communication time or overlapping the communication time and the calculation time.

이와 같은 문제점을 해결하기 위해, 본 발명에서는 학습된 파라미터의 업데이트를 즉각적으로 수행하지 않고 지연하는 방식으로 계산과 통신을 중첩시키는 분산 딥러닝 방법을 제안하고자 한다.In order to solve such a problem, the present invention proposes a distributed deep learning method in which calculation and communication are overlapped in a manner that delays the update of the learned parameter without immediately performing the update.

이 때, 복수개의 이기종 딥러닝 모듈들이 각각 분산 딥러닝 학습을 위한 딥러닝 학습 스레드(THREAD) 및 원격 공유 메모리 업데이트를 위한 업데이트 스레드(THREAD)를 생성할 수 있다. 일반적으로 복수개의 이기종 딥러닝 모듈들 각각의 메인 스레드가 분산 딥러닝 학습 스레드에 상응할 수 있다. In this case, a plurality of heterogeneous deep learning modules may generate a deep learning learning thread THREAD for distributed deep learning learning and an update thread THREAD for remote shared memory update, respectively. In general, a main thread of each of a plurality of heterogeneous deep learning modules may correspond to a distributed deep learning learning thread.

또한, 분산 딥러닝 학습 및 원격 공유 메모리 업데이트는 딥러닝 학습 스레드 및 업데이트 스레드 중 어느 하나로 할당되는 흐름제어락을 기반으로 수행될 수 있다. In addition, distributed deep learning learning and remote shared memory update may be performed based on a flow control lock allocated to one of a deep learning learning thread and an update thread.

이하에서는 도 4 내지 도 5를 기반으로 분산 딥러닝 학습과 원격 공유 메모리 업데이트를 중첩 수행하는 두 개의 스레드들의 세부 절차를 설명하도록 한다.Hereinafter, a detailed procedure of two threads that superimpose distributed deep learning learning and remote shared memory update based on FIGS. 4 to 5 will be described.

먼저, 도 4에 도시된 것처럼, 딥러닝 학습 스레드는 시작되면(S410) 먼저 흐름제어락을 획득할 수 있다(S420). First, as illustrated in FIG. 4, when the deep learning learning thread is started (S410), a flow control lock may be obtained first (S420).

이 때, 흐름제어락은 딥러닝 학습 스레드와 업데이트 스레드 간의 흐름제어를 위해 사용되는 것으로, 흐름제어락은 분산 딥러닝 프로세스가 시작된 이후에 딥러닝 학습 스레드로 먼저 할당될 수 있다.In this case, the flow control lock is used for flow control between the deep learning learning thread and the update thread, and the flow control lock may be first allocated to the deep learning learning thread after the distributed deep learning process starts.

이 후, 전역-지역 파라미터 차분으로부터 지역 파라미터를 업데이트할 수 있다(S430). 이 후, 딥러닝 학습 스레드로 할당되었던 흐름제어락을 해제하고, 업데이트 스레드로 웨이크업 신호를 보내 깨워줄 수 있다(S440). 즉, 분산 딥러닝 학습 스레드는 분산 딥러닝 학습을 시작하기 전에 업데이트 스레드를 깨워준다.Thereafter, the regional parameter may be updated from the global-region parameter difference (S430). After that, the flow control lock allocated to the deep learning learning thread may be released, and a wakeup signal may be sent to the update thread to wake it up (S440). In other words, the distributed deep learning learning thread wakes up the update thread before starting the distributed deep learning learning.

이 후, 업데이트된 지역 파라미터를 이용하여 하나의 미니배치 데이터에 대한 분산 딥러닝 학습 수행한 뒤(S450), 학습 결과를 기반으로 지역 파라미터 업데이트할 수 있다(S460).Thereafter, after performing distributed deep learning learning on one mini-batch data using the updated regional parameter (S450), the regional parameter may be updated based on the learning result (S460).

이 후, 원격 공유 메모리에 저장된 전역 학습 카운터 고려하여 추가적인 분산 딥러닝 프로세스가 필요한 경우에는 딥러닝 학습 스레드 반복 수행하되, 전역 학습 카운터가 만료되어 사용자가 지정한 미니배치에 도달하였을 경우에는 딥러닝 분산 프로세스를 종료할 수 있다(S470).After that, if an additional distributed deep learning process is needed in consideration of the global learning counter stored in the remote shared memory, the deep learning training thread is repeated, but when the global learning counter expires and reaches the mini-batch specified by the user, the deep learning distributed process Can be terminated (S470).

또한, 도 5를 참조하면, 업데이트 스레드는 생성된 이후에 딥러닝 학습 스레드 또는 메인 스레드가 깨워줄 때까지 대기할 수 있다(S510).Further, referring to FIG. 5, after the update thread is created, it may wait until the deep learning learning thread or the main thread wakes up (S510).

따라서, 웨이크업 신호가 발생하는지 여부를 판단하고(S515), 웨이크업 신호가 발생하여 업데이트 스레드가 깨어나면, 종료변수가 참인지 여부를 확인할 수 있다(S525).Accordingly, it is determined whether a wakeup signal is generated (S515), and when the update thread is awakened by the wakeup signal, it is possible to check whether the end variable is true (S525).

단계(S525)의 판단결과 종료변수가 참이면, 업데이트 스레드 종료할 수 있다(S570).If the determination result in step S525 is true, the update thread may be terminated (S570).

또한, 단계(S525)의 판단결과 종료변수가 참이 아니면, 업데이트 스레드로 흐름제어락을 할당할 수 있다(S530).In addition, when the determination result in step S525 is that the end variable is not true, a flow control lock may be allocated to the update thread (S530).

이 후, 원격 공유 메모리에 저장되어 있는 전역 파라미터를 딥러닝 모듈의 지역 버퍼로 읽어와서 전역 파라미터와 지역 파라미터의 차분을 계산할 수 있다(S540).Thereafter, the global parameter stored in the remote shared memory is read into the local buffer of the deep learning module, and the difference between the global parameter and the local parameter may be calculated (S540).

이 후, 단계(S540)을 통해 산출된 전력-지역 파라미터 차분을 이용하여 원격 공유 메모리의 전역 파라미터가 증가하도록 업데이트한 뒤(S550) 업데이트 스레드로 할당된 흐름제어락을 해제할 수 있다(S560).Thereafter, the global parameter of the remote shared memory is updated to increase by using the power-region parameter difference calculated in step S540 (S550), and then the flow control lock allocated to the update thread may be released (S560). .

이 때, 단계(S530) 내지 단계(S560)의 절차는 분산 딥러닝 학습이 완료될 때까지 반복적으로 수행될 수 있으며, 대체로 분산 딥러닝 학습시간이 원격 공유 메모리 업데이트 시간보다 길기 때문에 업데이트 스레드는 전역 파라미터 업데이트 완료 후에 대기상태로 회귀할 수 있다. At this time, the procedures of steps (S530) to (S560) may be repeatedly performed until the distributed deep learning learning is completed, and in general, because the distributed deep learning learning time is longer than the remote shared memory update time, the update thread is global. After the parameter update is complete, it can return to the standby state.

이와 같이, 본 발명에서는 복수개의 이기종 딥러닝 모듈들 각각에서 N번째로 학습한 파라미터가 N+1번째 학습 도중에 전역 파라미터로 업데이트될 수 있고, N번째 학습 도중에 읽어온 전역 파라미터를 N+1번째 학습 전에 지역 파라미터로써 업데이트하여 학습을 수행할 수 있다. 따라서, 종래에 파라미터 서버를 이용하는 비동기 방식처럼 새로운 전역 파라미터가 업데이트될 때까지 대기할 필요가 없으므로 통신에 의해 지체되는 시간을 절약할 수 있다. 즉, 본 발명에서는 딥러닝 학습 스레드와 업데이트 스레드가 흐름제어락과 대기/웨이크업 방식을 이용하여 학습과 통신(전역 파라미터 업데이트)을 중첩 수행할 수 있다.As described above, in the present invention, the N-th parameter learned by each of the plurality of heterogeneous deep learning modules may be updated as a global parameter during the N+1-th learning, and the global parameter read during the N-th learning is learned for the N+1-th. You can perform learning by updating it as a local parameter before. Accordingly, there is no need to wait for a new global parameter to be updated as in the conventional asynchronous method using a parameter server, thereby saving time delayed by communication. In other words, in the present invention, the deep learning learning thread and the update thread can superimpose learning and communication (global parameter update) using a flow control lock and a standby/wakeup method.

또한, 본 발명의 일실시예에 따른 분산 딥러닝 방법은 원격 공유 메모리 업데이트에 의해 업데이트된 전역 학습 카운터를 고려하여 분산 딥러닝 프로세스를 종료한다(S130).In addition, the distributed deep learning method according to an embodiment of the present invention terminates the distributed deep learning process in consideration of the global learning counter updated by the remote shared memory update (S130).

예를 들어, 도 2에 도시된 것과 같은 본 발명의 딥러닝 모듈(210-1~210-N)은 원격 공유 메모리에 저장된 전역 학습 카운터를 배타적으로 증가시키면서 분산 딥러닝 학습을 수행하므로, 전역 학습 카운터가 사용자가 지정한 값에 도달하였을 때에 분산 딥러닝 프로세스를 종료할 수 있다.For example, the deep learning modules 210-1 to 210-N of the present invention as shown in FIG. 2 perform distributed deep learning learning while exclusively increasing the global learning counter stored in the remote shared memory. Distributed deep learning process can be terminated when the counter reaches the value specified by the user.

이와 같이 함으로써 저속 GPU는 전체 미니배치 횟수 중에서 더 적은 횟수를 학습하고, 고속 GPU는 더 많은 미니배치를 학습할 수 있으므로 분산 딥러닝 프로세스의 사용자가 지정한 미니배치에 도달하였을 때에 이기종의 분산 딥러닝 프로세스들은 거의 동시에 분산 딥러닝 학습을 종료할 수 있다.By doing this, a low-speed GPU can learn less of the total number of mini-batch, and a high-speed GPU can learn more mini-batch, so when it reaches the mini-batch specified by the user of the distributed deep learning process, a heterogeneous distributed deep learning process They can terminate distributed deep learning learning almost at the same time.

또한, 도 1에는 도시하지 아니하였으나, 본 발명의 일실시예에 따른 분산 딥러닝 방법은 원격 공유 메모리에 전역 파라미터 영역 및 전역 학습 카운터 영역을 생성한다.In addition, although not shown in FIG. 1, the distributed deep learning method according to an embodiment of the present invention generates a global parameter area and a global learning counter area in a remote shared memory.

예를 들어, 도 2에 도시된 것과 같은 본 발명의 일실시예에 따른 복수개의 이기종 딥러닝 모듈들(210-1~210-N)은 RDMA 고속 네트워크(230)를 기반으로 원격 공유 메모리(220)로 접근하여 전역 파라미터 영역과 전역 학습 카운터 영역을 생성할 수 있다. 이 때, 복수개의 이기종 딥러닝 모듈들(210-1~210-N) 중 어느 하나의 마스터 모듈을 설정하고, 설정된 마스터 모듈을 이용하여 전역 파라미터 영역과 전역 학습 카운터 영역을 생성할 수도 있다.For example, a plurality of heterogeneous deep learning modules 210-1 to 210-N according to an embodiment of the present invention as shown in FIG. 2 are remote shared memory 220 based on the RDMA high-speed network 230. ) To create a global parameter area and a global learning counter area. In this case, one of the plurality of heterogeneous deep learning modules 210-1 to 210-N may be set, and a global parameter area and a global learning counter area may be generated using the set master module.

또한, 도 1에는 도시하지 아니하였으나, 본 발명의 일실시예에 따른 분산 딥러닝 방법은 복수개의 이기종 딥러닝 모듈들 중 어느 하나의 마스터 모듈을 통해 전역 파라미터 및 전역 학습 카운터를 초기화한다. In addition, although not shown in FIG. 1, the distributed deep learning method according to an embodiment of the present invention initializes a global parameter and a global learning counter through any one of a plurality of heterogeneous deep learning modules.

이 때, 전역 파라미터는 딥러닝 모듈별 또는 딥러닝 플랫폼별로 다양한 방식으로 초기화될 수 있으며, 전역 학습 카운터는 0으로 초기화될 수 있다.In this case, the global parameter may be initialized in various ways for each deep learning module or for each deep learning platform, and the global learning counter may be initialized to zero.

또한, 이와 같은 초기화 과정은 업데이트 스레드가 최초로 웨이크업되는 시점을 기준으로 수행될 수도 있다.Also, such an initialization process may be performed based on a time point at which the update thread first wakes up.

또한, 도 1에는 도시하지 아니하였으나, 본 발명의 일실시예에 따른 분산 딥러닝 방법은 상술한 바와 같이 본 발명의 실시예에 따른 분산 딥러닝 과정에서 발생되는 다양한 정보를 저장할 수 있다.In addition, although not shown in FIG. 1, the distributed deep learning method according to an embodiment of the present invention may store various types of information generated in a distributed deep learning process according to an embodiment of the present invention as described above.

이와 같은 이종 클러스터 기반의 분산 딥러닝 방법을 통해 이기종의 GPU를 이용한 분산 딥러닝 수행 시 통신 오버헤드를 감소시킬 수 있다.Through such a heterogeneous cluster-based distributed deep learning method, communication overhead can be reduced when performing distributed deep learning using heterogeneous GPUs.

또한, 동시에 성능이 다른 GPU들을 효과적으로 사용할 수 있는 분산 딥러닝 방법을 제공할 수 있다.In addition, it is possible to provide a distributed deep learning method that can effectively use GPUs with different performance at the same time.

또한, 학습 속도가 다른 분산 프로세스들이 전체 학습을 효과적으로 나누어 수행할 수 있도록 할 수 있다.In addition, distributed processes with different learning speeds can effectively divide and perform the entire learning.

또한, 학습한 파라미터의 업데이트를 지연하는 방식으로 계산과 통신을 중첩하여 각각의 GPU 활용률을 극대화함으로써 우수한 분산 처리 확장성을 제공할 수 있다.In addition, it is possible to provide excellent scalability for distributed processing by maximizing the utilization rate of each GPU by overlapping calculation and communication in a manner that delays updating of learned parameters.

도 3은 본 발명의 일실시예에 따른 분산 딥러닝 과정을 상세하게 나타낸 동작흐름도이다.3 is an operation flow diagram showing in detail a distributed deep learning process according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일실시예에 따른 분산 딥러닝 과정은 먼저 복수개의 이기종 딥러닝 모듈들에 각각 지역 파라미터, 전역-지역 파라미터 차분, 지역 학습 카운터 영역을 생성한다(S310).Referring to FIG. 3, in the distributed deep learning process according to an embodiment of the present invention, a regional parameter, a global-region parameter difference, and a regional learning counter region are first generated in a plurality of heterogeneous deep learning modules (S310).

이 후, 복수개의 이기종 딥러닝 모듈들을 통해 원격 공유 메모리에 전역 파라미터 영역, 전역 학습 카운터 영역을 생성한다(S320).Thereafter, a global parameter area and a global learning counter area are created in the remote shared memory through a plurality of heterogeneous deep learning modules (S320).

이 후, 복수개의 이기종 딥러닝 모듈들 중 어느 하나의 마스터 모듈을 통해 전역 파라미터와 전역 학습 카운터를 초기화한다(S330).After that, the global parameter and the global learning counter are initialized through any one of the plurality of heterogeneous deep learning modules (S330).

이 후, 복수개의 이기종 딥러닝 모듈들은 각각 분산 딥러닝 학습을 위한 딥러닝 학습 스레드와 원격 공유 메모리 업데이트를 위한 업데이트 학습 스레드를 생성한다(S340).Thereafter, the plurality of heterogeneous deep learning modules each generate a deep learning thread for distributed deep learning learning and an update learning thread for remote shared memory update (S340).

이 후, 복수개의 이기종 딥러닝 모듈들은 딥러닝 학습 스레드를 통해 분산 딥러닝 학습을 수행하기 이전에 웨이크업 신호를 발생시켜 업데이트 스레드를 깨운다(S350).Thereafter, the plurality of heterogeneous deep learning modules wake up the update thread by generating a wake-up signal before performing distributed deep learning learning through the deep learning thread (S350).

이 후, 복수개의 이기종 딥러닝 모듈들은 각각 분산 딥러닝 학습과 원격 공유 메모리의 업데이트를 중첩 수행한다(S360).Thereafter, the plurality of heterogeneous deep learning modules perform distributed deep learning learning and update of the remote shared memory overlapping (S360).

이 후, 원격 공유 메모리에 업데이트되는 전역 학습 카운터가 목표 종료 카운터 이상인지 여부를 판단하고(S365), 전역 학습 카운터가 목표 종료 카운터 이상이면 분산 딥러닝 프로세스를 종료한다(S370).Thereafter, it is determined whether the global learning counter updated in the remote shared memory is equal to or greater than the target end counter (S365), and if the global learning counter is greater than or equal to the target end counter, the distributed deep learning process is terminated (S370).

또한, 단계(S365)의 판단결과 전역 학습 카운터가 목표 종료 카운터 미만이면 지속적으로 분산 딥러닝 학습을 수행할 수 있도록 단계(S360)부터 반복 수행할 수 있다.In addition, as a result of the determination in step S365, if the global learning counter is less than the target end counter, it may be repeatedly performed from step S360 so that distributed deep learning learning can be continuously performed.

도 6은 본 발명의 일실시예에 따른 분산 딥러닝 장치를 나타낸 블록도이다.6 is a block diagram showing a distributed deep learning apparatus according to an embodiment of the present invention.

도 6을 참조하면, 본 발명의 일실시예에 따른 분산 딥러닝 장치는 프로세서(610) 및 메모리(620)를 포함한다.Referring to FIG. 6, a distributed deep learning apparatus according to an embodiment of the present invention includes a processor 610 and a memory 620.

프로세서(610)는 원격 공유 메모리를 기반으로 전역 파라미터와 전역 학습 카운터를 공유한다.The processor 610 shares the global parameter and the global learning counter based on the remote shared memory.

이 때, 원격 공유 메모리에 저장된 전역 파라미터와 전역 학습 카운터는 배타적으로 업데이트가 가능한 데이터에 상응하는 것으로, 처리 성능이 서로 상이한 복수개의 이기종 분산 딥러닝 장치들이 전체 학습을 효과적으로 나누어 수행할 수 있도록 할 수 있다. In this case, the global parameter and the global learning counter stored in the remote shared memory correspond to data that can be updated exclusively, so that a plurality of heterogeneous distributed deep learning devices with different processing performances can effectively divide and perform the entire learning. have.

이 때, 원격 직접 메모리 접근(REMOTE DIRECT MEMORY ACCESS, RDMA)을 지원하는 고속 네트워크를 기반으로 원격 공유 메모리에 접근할 수 있다. 따라서, 원격 공유 메모리는 프로세서(610)가 전역 파라미터와 전역 학습 카운터에 직접 접근할 수 있도록 지원할 수 있다. At this time, remote shared memory can be accessed based on a high-speed network that supports remote direct memory access (REMOTE DIRECT MEMORY ACCESS, RDMA). Accordingly, the remote shared memory may support the processor 610 to directly access the global parameter and the global learning counter.

또한, 프로세서(610)는 지역 파라미터, 전역-지역 파라미터 차분 및 지역 학습 카운터 영역을 생성한다. 예를 들어, 본 발명의 일실시예에 따른 복수개의 이기종 딥러닝 장치들은 각각 지역 파라미터, 전역-지역 파라미터 차분, 지역 학습 카운터를 포함할 수 있다.Also, the processor 610 generates a local parameter, a global-local parameter difference, and a local learning counter area. For example, a plurality of heterogeneous deep learning apparatuses according to an embodiment of the present invention may each include a regional parameter, a global-region parameter difference, and a regional learning counter.

이 때, 학습 카운터란, 분산 딥러닝 프로세스들이 분산 딥러닝 학습을 수행할 때 학습한 미니배치(MINI-BATCH)의 전체 횟수를 카운팅하는데 사용될 수 있으며, 각각의 분산 딥러닝 프로세스들이 학습해야 할 미니배치의 순서 번호를 할당 받는데 활용될 수 있다. 이와 같이 각각의 분산 딥러닝 장치로 할당된 학습 카운터는 지역 학습 카운터로써 저장될 수 있다. 이 때, 각각의 분산 딥러닝 장치에 포함된 분산 프로세스들이 수행해야 할 전체 미니배치의 횟수는 분산 딥러닝 프로세스를 이용하는 사용자가 지정할 수 있다. In this case, the learning counter can be used to count the total number of mini-batch (MINI-BATCH) learned when distributed deep learning processes perform distributed deep learning learning. It can be used to assign a batch sequence number. In this way, the learning counter allocated to each distributed deep learning device may be stored as a local learning counter. In this case, the number of total mini-batch that the distributed processes included in each distributed deep learning device should perform may be designated by a user using the distributed deep learning process.

또한, 프로세서(610)는 원격 공유 메모리의 전역 학습 카운터를 기반으로 할당된 지역 학습 카운터에 상응하는 분산 딥러닝 학습과 원격 공유 메모리 업데이트를 중첩하여 수행한다.In addition, the processor 610 superimposes distributed deep learning learning corresponding to the local learning counter allocated based on the global learning counter of the remote shared memory and the remote shared memory update.

이 때, 분산 딥러닝 학습을 통해 지역 파라미터를 자체적으로 학습시킬 수 있고, 원격 공유 메모리에 보관되는 전역 학습 카운터는 메모리(620)에 저장된 지역 학습 카운터와 비교하여 동일하면 변경하는 방식(COMPARE AND SWAP)으로 업데이트될 수 있다.In this case, the local parameter can be learned by itself through distributed deep learning learning, and the global learning counter stored in the remote shared memory is compared with the local learning counter stored in the memory 620 and changed if the same (COMPARE AND SWAP) ) Can be updated.

이 때, 프로세서(610)는 분산 딥러닝 학습을 위한 딥러닝 학습 스레드(THREAD) 및 원격 공유 메모리 업데이트를 위한 업데이트 스레드(THREAD)를 생성할 수 있다. 일반적으로 메인 스레드가 분산 딥러닝 학습 스레드에 상응할 수 있다. In this case, the processor 610 may generate a deep learning learning thread THREAD for distributed deep learning learning and an update thread THREAD for remote shared memory update. In general, the main thread may correspond to a distributed deep learning learning thread.

이와 같이, 본 발명에서는 분산 딥러닝 장치가 N번째로 학습한 파라미터가 N+1번째 학습 도중에 전역 파라미터로 업데이트될 수 있고, N번째 학습 도중에 읽어온 전역 파라미터를 N+1번째 학습 전에 지역 파라미터로써 업데이트하여 학습을 수행할 수 있다. 따라서, 종래에 파라미터 서버를 이용하는 비동기 방식처럼 새로운 전역 파라미터가 업데이트될 때까지 대기할 필요가 없으므로 통신에 의해 지체되는 시간을 절약할 수 있다. 즉, 본 발명에서는 딥러닝 학습 스레드와 업데이트 스레드가 흐름제어락과 대기/웨이크업 방식을 이용하여 학습과 통신(전역 파라미터 업데이트)을 중첩 수행할 수 있다.As described above, in the present invention, the parameter learned by the distributed deep learning apparatus for the Nth time may be updated as a global parameter during the N+1th learning, and the global parameter read during the Nth learning as a local parameter before the N+1th learning. You can update to perform learning. Accordingly, there is no need to wait for a new global parameter to be updated as in the conventional asynchronous method using a parameter server, thereby saving time delayed by communication. In other words, in the present invention, the deep learning learning thread and the update thread can superimpose learning and communication (global parameter update) using a flow control lock and a standby/wakeup method.

또한, 프로세서(610)는 원격 공유 메모리 업데이트를 기반으로 업데이트된 전역 학습 카운터를 고려하여 분산 딥러닝 프로세스를 종료한다.Also, the processor 610 terminates the distributed deep learning process in consideration of the updated global learning counter based on the remote shared memory update.

예를 들어, 프로세서(610)는 원격 공유 메모리에 저장된 전역 학습 카운터를 배타적으로 증가시키면서 분산 딥러닝 학습을 수행하므로, 전역 학습 카운터가 사용자가 지정한 값에 도달하였을 때에 분산 딥러닝 프로세스를 종료할 수 있다.For example, since the processor 610 performs distributed deep learning learning while exclusively increasing the global learning counter stored in the remote shared memory, the distributed deep learning process can be terminated when the global learning counter reaches a value specified by the user. have.

이와 같이 함으로써 저속 GPU는 전체 미니배치 횟수 중에서 더 적은 횟수를 학습하고, 고속 GPU는 더 많은 미니배치를 학습할 수 있으므로 분산 딥러닝 프로세스의 사용자가 지정한 미니배치에 도달하였을 때에 복수개의 분산 딥러닝 장치들은 거의 동시에 분산 딥러닝 학습을 종료할 수 있다.By doing this, a low-speed GPU can learn fewer times out of the total number of mini-batch, and a high-speed GPU can learn more mini-batch, so when a user-specified mini-batch of the distributed deep learning process is reached, multiple distributed deep learning devices They can terminate distributed deep learning learning almost at the same time.

또한, 프로세서(610)는 원격 공유 메모리에 전역 파라미터 영역 및 전역 학습 카운터 영역을 생성한다.In addition, the processor 610 creates a global parameter area and a global learning counter area in the remote shared memory.

예를 들어, 프로세서(610)는 RDMA 기반의 고속 네트워크를 기반으로 원격 공유 메모리로 접근하여 전역 파라미터 영역과 전역 학습 카운터 영역을 생성할 수 있다. For example, the processor 610 may access a remote shared memory based on an RDMA-based high-speed network to generate a global parameter area and a global learning counter area.

또한, 프로세서(610)는 전역 파라미터 및 전역 학습 카운터를 초기화한다.In addition, the processor 610 initializes the global parameter and the global learning counter.

이 때, 전역 파라미터는 분산 딥러닝 장치 별로 다양한 방식으로 초기화될 수 있으며, 전역 학습 카운터는 0으로 초기화될 수 있다.In this case, the global parameter may be initialized in various ways for each distributed deep learning device, and the global learning counter may be initialized to zero.

메모리(620)는 지역 파라미터, 전역-지역 파라미터 차분 및 지역 학습 카운터를 저장한다.The memory 620 stores local parameters, global-local parameter differences, and local learning counters.

또한, 메모리(620)는 상술한 바와 같이 본 발명의 실시예에 따른 이종 클러스터 기반의 분산 딥러닝 과정에서 발생되는 다양한 정보를 저장한다.In addition, as described above, the memory 620 stores various pieces of information generated in a heterogeneous cluster-based distributed deep learning process according to an embodiment of the present invention.

실시예에 따라, 메모리(620)는 분산 딥러닝 장치와 독립적으로 구성되어 분산 딥러닝 수행을 위한 기능을 지원할 수 있다. 이 때, 메모리(620)는 별도의 대용량 스토리지로 동작할 수 있고, 동작 수행을 위한 제어 지능을 포함할 수 있다.According to an embodiment, the memory 620 may be configured independently of the distributed deep learning device to support a function for performing distributed deep learning. In this case, the memory 620 may operate as a separate mass storage, and may include control intelligence for performing the operation.

한편, 분산 딥러닝 장치는 메모리가 탑재되어 그 장치 내에서 정보를 저장할 수 있다. 일 구현예의 경우, 메모리는 컴퓨터로 판독 가능한 매체이다. 일 구현 예에서, 메모리는 휘발성 메모리 유닛일 수 있으며, 다른 구현예의 경우, 메모리는 비휘발성 메모리 유닛일 수도 있다. 일 구현예의 경우, 저장장치는 컴퓨터로 판독 가능한 매체이다. 다양한 서로 다른 구현 예에서, 저장장치는 예컨대 하드디스크 장치, 광학디스크 장치, 혹은 어떤 다른 대용량 저장장치를 포함할 수도 있다.Meanwhile, a distributed deep learning device is equipped with a memory and can store information within the device. In one implementation, the memory is a computer-readable medium. In one implementation, the memory may be a volatile memory unit, and in another implementation, the memory may be a non-volatile memory unit. In one implementation, the storage device is a computer-readable medium. In various different implementations, the storage device may include, for example, a hard disk device, an optical disk device, or some other mass storage device.

이와 같은 분산 딥러닝 장치를 이용함으로써 이기종의 GPU를 이용한 분산 딥러닝 수행 시 통신 오버헤드를 감소시킬 수 있다.By using such a distributed deep learning device, communication overhead can be reduced when performing distributed deep learning using a heterogeneous GPU.

이상에서와 같이 본 발명에 따른 이종 클러스터 기반의 분산 딥러닝 방법 및 이를 위한 장치는 상기한 바와 같이 설명된 실시예들의 구성과 방법이 한정되게 적용될 수 있는 것이 아니라, 상기 실시예들은 다양한 변형이 이루어질 수 있도록 각 실시예들의 전부 또는 일부가 선택적으로 조합되어 구성될 수도 있다.As described above, the heterogeneous cluster-based distributed deep learning method and apparatus for the same according to the present invention are not limited to the configuration and method of the embodiments described as described above, but various modifications may be made to the embodiments. All or part of each of the embodiments may be selectively combined to be configured.

210-1~210-N: 딥러닝 모듈 220: 원격 공유 메모리
230: RDMA 고속 네트워크 610: 프로세서
620: 메모리210-1~210-N: deep learning module 220: remote shared memory
230: RDMA high-speed network 610: processor
620: memory

Claims

In the distributed deep learning method in which each step is performed by a plurality of distributed deep learning devices,
Sharing, by the plurality of distributed deep learning devices having different deep learning performances, a global parameter and a global learning counter based on a remote shared memory;
Each of the plurality of distributed deep learning devices uses a result of a difference calculation between the local parameter and the global parameter previously stored in each of the plurality of distributed deep learning devices so as to correspond to a local learning counter allocated based on the global learning counter. Performing a distributed deep learning process by performing distributed deep learning, and updating the global parameter using the difference calculation result while the distributed deep learning learning is performed; And
The plurality of distributed deep learning devices update the global learning counter based on the local learning counter corresponding to the number of times the distributed deep learning is performed, and determine whether the updated global learning counter is greater than or equal to a preset end counter. Determining and terminating the distributed deep learning process
Distributed deep learning method comprising a.

The method according to claim 1,
The performing step is
A result of each of the plurality of distributed deep learning devices updating the regional parameter using a difference calculation result, performing the distributed deep learning learning using the updated regional parameter, and performing the distributed deep learning learning Distributed deep learning method, characterized in that to re-update the updated regional parameters using.

The method according to claim 1,
The distributed deep learning method
Generating regions of the region parameter, the difference calculation result, and the region learning counter, respectively, in the plurality of distributed deep learning devices;
Creating a global parameter area and a global learning counter area in the remote shared memory; And
And initializing the global parameter and the global learning counter by any one of the plurality of distributed deep learning devices.

The method according to claim 1,
The performing step is
Distributed deep learning, comprising the step of generating, by the plurality of distributed deep learning devices, a deep learning learning thread (THREAD) for the distributed deep learning learning and an update thread (THREAD) for updating the remote shared memory, respectively. Way.

delete

The method of claim 4,
The performing step is
Distributed deep learning method, characterized in that based on a flow control lock assigned to any one of the deep learning thread and the update thread.

The method of claim 6,
The flow control lock is a distributed deep learning method, characterized in that first allocated to the deep learning learning thread after the distributed deep learning process starts.

The method according to claim 1,
The distributed deep learning method, characterized in that the plurality of distributed deep learning devices access the remote shared memory based on a high-speed network supporting a remote direct memory access (REMOTE DIRECT MEMORY ACCESS, RDMA).

Distributed deep learning learning is performed by using the difference calculation result of the local parameter and the global parameter previously stored to correspond to the local learning counter allocated based on the global learning counter of the remote shared memory,
While the distributed deep learning learning is being performed, a distributed deep learning process of updating the global parameter using the difference calculation result is performed,
The global learning counter is updated based on the local learning counter corresponding to the number of times the distributed deep learning learning is performed, and the distributed deep learning process is terminated by determining whether the updated global learning counter is equal to or greater than a preset end counter. A processor; And
A memory for storing the local parameter, the difference calculation result between the global parameter and the local parameter, and the local learning counter
Distributed deep learning device comprising a.

The method of claim 9,
The processor is
The regional parameter is updated using the difference calculation result, the distributed deep learning learning is performed using the updated regional parameter, and the updated regional parameter is re-established using the result of performing the distributed deep learning learning. Distributed deep learning device, characterized in that to update.