KR20230121303A

KR20230121303A - Method and apparatus for distributed deep learning based on different mini batch size

Info

Publication number: KR20230121303A
Application number: KR1020220017995A
Authority: KR
Inventors: 안신영
Original assignee: 한국전자통신연구원
Priority date: 2022-02-11
Filing date: 2022-02-11
Publication date: 2023-08-18

Abstract

An embodiment of the present invention relates to a method for distributed deep learning, which is performed by a plurality of devices for distributed deep learning, comprising the steps of: calculating the total number of training samples; and obtaining GPU memory size information assigned to each of the plurality of devices for distributed deep learning; exchanging the GPU memory size information between the devices for distributed deep learning; adjusting the size of a mini-batch for each GPU based on the GPU memory size information to calculate an adjusted mini-batch size; adjusting a training rate based on the adjusted mini-batch size to calculate an adjusted training rate; reading the current global training counter value; comparing the global training counter value with the total number of training samples; if the global training counter value is smaller than the total number of training samples, updating a local training counter after increasing a global training counter based on the adjusted mini-batch size; and performing the training of the adjusted mini-batch size; and updating a local weight based on the adjusted training rate. Therefore, provided are a method and a device for distributed deep learning, wherein GPUs with different memory sizes can be utilized effectively.

Description

Distributed deep learning method and apparatus based on heterogeneous mini batch size {METHOD AND APPARATUS FOR DISTRIBUTED DEEP LEARNING BASED ON DIFFERENT MINI BATCH SIZE}

본 발명은 분산 딥러닝 방법에 관한 것으로, 특히 이종 미니배치사이즈 기반의 분산 딥러닝 방법 및 장치에 관한 것이다.The present invention relates to a distributed deep learning method, and more particularly to a heterogeneous mini-batch size-based distributed deep learning method and apparatus.

딥러닝이란 사람의 신경세포(Biological Neuron)을 모사하여 기계가 학습하도록 하는 인공신경망(Artificial Neural Network) 기반의 기계 학습법이다. 최근 딥러닝 모델들은 응용의 인식 성능을 높이기 위해 대규모 모델로 진화하고 있으나 점차 대형화되는 딥러닝 모델과 대규모 학습 데이터를 단일 머신에서 처리하기에는 한계가 있다. 따라서, 대규모 분산 컴퓨팅 자원을 활용하는 일환으로 딥러닝 분산 플랫폼 기술이 개발되고 있다.Deep learning is a machine learning method based on artificial neural networks that simulates human neurons and allows machines to learn. Recently, deep learning models are evolving into large-scale models to improve the recognition performance of applications, but there are limitations in processing increasingly large-scale deep learning models and large-scale training data in a single machine. Therefore, deep learning distributed platform technology is being developed as part of utilizing large-scale distributed computing resources.

종래 딥러닝 분산 처리는 이종 GPU(Graphic Processing Unit) 클러스터에서 동일 크기의 배치 사이즈로 데이터 병렬 분산 학습을 수행하는 경우, 가장 작은 메모리 크기를 갖는 GPU에서 실행해야 하므로 상대적으로 작은 미니배치사이즈로 학습할 수 밖에 없다. 이로 인해, GPU의 계산 성능을 모두 활용할 수 없게 된다.In conventional deep learning distributed processing, when data parallel distributed learning is performed with the same batch size in a heterogeneous GPU (Graphic Processing Unit) cluster, it is necessary to run it on a GPU with the smallest memory size, so it is difficult to learn with a relatively small mini-batch size. can only This makes it impossible to utilize all of the computational power of the GPU.

한국등록특허 제10-2190511호 (2020.09.16 공개)Korean Patent Registration No. 10-2190511 (published on September 16, 2020)

본 발명은 메모리 크기가 다른 GPU들을 효과적으로 활용하기 위한 분산 딥러닝 방법 및 장치를 제공하는 것을 그 목적으로 한다.An object of the present invention is to provide a distributed deep learning method and apparatus for effectively utilizing GPUs having different memory sizes.

상기한 목적을 달성하기 위한 본 발명은 복수의 분산 딥러닝 장치들에 의해 수행되는 분산 딥러닝 방법에 있어서, 상기 복수의 분산 딥러닝 장치들 각각이, 전체학습샘플수를 계산하는 단계와, 상기 복수의 분산 딥러닝 장치들 각각이, 상기 복수의 분산 딥러닝 장치들 각각에 지정된 GPU 메모리 크기 정보를 획득하는 단계와, 상기 복수의 분산 딥러닝 장치들 각각이, 상기 각 분산 딥러닝 장치 간의 GPU 메모리 크기 정보를 교환하는 단계와, 상기 복수의 분산 딥러닝 장치들 각각이, 상기 GPU 메모리 크기 정보를 기초로 각 GPU 별 미니배치사이즈를 조정하여 조정된 미니배치사이즈를 계산하는 단계와, 상기 복수의 분산 딥러닝 장치들 각각이, 상기 조정된 미니배치사이즈를 기초로 학습률을 조정하여 조정된 학습률을 계산하는 단계와, 상기 복수의 분산 딥러닝 장치들 각각이, 현재의 전역학습카운터 값을 읽어오는 단계와, 상기 복수의 분산 딥러닝 장치들 각각이, 상기 전역학습카운터 값을 상기 전체학습샘플수와 비교하는 단계와, 상기 복수의 분산 딥러닝 장치들 각각이, 상기 전역학습카운터 값이 상기 전체학습샘플수 보다 작으면 상기 조정된 미니배치사이즈를 기초로 상기 전역학습카운터를 증가 후 지역학습카운터를 업데이트 하는 단계와, 상기 복수의 분산 딥러닝 장치들 각각이, 조정된 미니배치사이즈의 학습을 수행하는 단계와, 상기 복수의 분산 딥러닝 장치들 각각이, 조정된 학습률을 기초로 지역 가중치를 업데이트 하는 단계를 포함할 수 있다.In order to achieve the above object, the present invention provides a distributed deep learning method performed by a plurality of distributed deep learning devices, comprising the steps of each of the plurality of distributed deep learning devices calculating the total number of training samples; Acquiring, by each of a plurality of distributed deep learning devices, GPU memory size information assigned to each of the plurality of distributed deep learning devices, and each of the plurality of distributed deep learning devices, GPU between the respective distributed deep learning devices exchanging memory size information; calculating, by each of the plurality of distributed deep learning devices, a mini-batch size for each GPU based on the GPU memory size information, and calculating the adjusted mini-batch size; Each of the distributed deep learning devices of calculates an adjusted learning rate by adjusting the learning rate based on the adjusted mini-batch size, and each of the plurality of distributed deep learning devices reads the current global learning counter value. Comparing, by each of the plurality of distributed deep learning devices, the global learning counter value with the total number of training samples, and by each of the plurality of distributed deep learning devices, the global learning counter value is If it is smaller than the total number of training samples, increasing the global learning counter based on the adjusted mini-batch size and then updating the local learning counter, each of the plurality of distributed deep learning devices learning the adjusted mini-batch size and updating, by each of the plurality of distributed deep learning devices, local weights based on the adjusted learning rate.

본 발명은 이종 미니배치사이즈를 이용함으로써, 서로 다른 성능을 가지는 GPU들의 활용을 극대화시켜 보다 빠른 분산 학습을 수행할 수 있는 효과가 있다.The present invention has the effect of performing faster distributed learning by maximizing the utilization of GPUs having different performance by using heterogeneous mini-batch sizes.

도 1은 본 발명의 일실시예에 따른 분산 딥러닝 시스템을 나타낸 블록도이다.
도 2는 본 발명의 일실시예에 따른 분산 딥러닝 장치를 나타낸 블록도이다.
도 3은 비동기 분산 학습시 종래 미니배치사이즈를 이용하는 방식과 실시예에 따른 미니배치사이즈를 이용하는 방식을 나타낸 도면이다.
도 4는 하이브리드 분산 학습시 종래 미니배치사이즈를 이용하는 방식과 실시예에 따른 미니배치사이즈를 이용하는 방식을 나타낸 도면이다.
도 5는 하이브리드 분산 학습시 다른 미니배치사이즈와 다른 수의 GPU를 이용하여 학습을 제안하는 방식을 나타낸 도면이다.
도 6은 본 발명의 일실시예에 따른 분산 딥러닝 방법을 나타낸 순서도이다.
도 7은 본 발명의 일실시예에 따른 비동기 분산학습 수행 시 미니배치사이즈와 학습률을 조정하는 모습을 나타낸 도면이다.
도 8은 본 발명의 일실시예에 따른 하이브리드 분산학습 수행 시 미니배치사이즈와 학습률을 조정하는 모습을 나타낸 도면이다.
도 9는 본 발명의 일실시예에 따른 컴퓨터 시스템의 구성을 나타낸 블록도이다.1 is a block diagram showing a distributed deep learning system according to an embodiment of the present invention.
2 is a block diagram showing a distributed deep learning apparatus according to an embodiment of the present invention.
3 is a diagram showing a conventional method of using a mini-batch size and a method of using a mini-batch size according to an embodiment in asynchronous distributed learning.
4 is a diagram showing a method using a conventional mini-batch size and a method using a mini-batch size according to an embodiment in hybrid distributed learning.
5 is a diagram showing a method of proposing learning using different mini-batch sizes and different numbers of GPUs in hybrid distributed learning.
6 is a flowchart illustrating a distributed deep learning method according to an embodiment of the present invention.
7 is a diagram showing how to adjust a mini-batch size and a learning rate when performing asynchronous distributed learning according to an embodiment of the present invention.
8 is a diagram showing how mini-batch size and learning rate are adjusted when hybrid distributed learning is performed according to an embodiment of the present invention.
9 is a block diagram showing the configuration of a computer system according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention, and methods of achieving them, will become clear with reference to the detailed description of the following embodiments taken in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various different forms, only these embodiments make the disclosure of the present invention complete, and common knowledge in the art to which the present invention belongs. It is provided to fully inform the holder of the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numbers designate like elements throughout the specification.

비록 "제1" 또는 "제2" 등이 다양한 구성요소를 서술하기 위해서 사용되나, 이러한 구성요소는 상기와 같은 용어에 의해 제한되지 않는다. 상기와 같은 용어는 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용될 수 있다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있다.Although "first" or "second" is used to describe various elements, these elements are not limited by the above terms. Such terms may only be used to distinguish one component from another. Therefore, the first component mentioned below may also be the second component within the technical spirit of the present invention.

본 명세서에서 사용된 용어는 실시예를 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소 또는 단계가 하나 이상의 다른 구성요소 또는 단계의 존재 또는 추가를 배제하지 않는다는 의미를 내포한다.Terms used in this specification are for describing embodiments and are not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. As used herein, "comprises" or "comprising" implies that a stated component or step does not preclude the presence or addition of one or more other components or steps.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 해석될 수 있다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used herein may be interpreted as meanings commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless explicitly specifically defined.

이하, 첨부된 도면을 참조하여 본 발명의 실시예들을 상세히 설명하기로 하며, 도면을 참조하여 설명할 때 동일하거나 대응하는 구성 요소는 동일한 도면 부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, and when describing with reference to the drawings, the same or corresponding components are given the same reference numerals, and overlapping descriptions thereof will be omitted. .

도 1은 본 발명의 일실시예에 따른 분산 딥러닝 시스템을 나타낸 블록도이다.1 is a block diagram showing a distributed deep learning system according to an embodiment of the present invention.

도 1을 참조하면, 실시예에 따른 분산 딥러닝 시스템은 복수의 분산 딥러닝 장치(100)를 포함할 수 있다. 복수의 분산 딥러닝 장치(100)는 RDMA 고속 네트워크(300)를 통해 원격 공유 메모리부(200)에 접근할 수 있다. Referring to FIG. 1 , a distributed deep learning system according to an embodiment may include a plurality of distributed deep learning devices 100 . A plurality of distributed deep learning devices 100 may access the remote shared memory unit 200 through the RDMA high-speed network 300 .

복수의 분산 딥러닝 장치(100)는 딥러닝 학습을 수행하는 계산노드에 해당할 수 있으며, 상호간에 서로 다른 성능의 분산 딥러닝 장치(100)들을 포함할 수 있다.The plurality of distributed deep learning devices 100 may correspond to calculation nodes that perform deep learning learning, and may include distributed deep learning devices 100 having different performances.

복수의 분산 딥러닝 장치(100)는 각각 지역 파라미터, 전역-지역 파라미터 차분 및 지역 학습 카운터 영역을 생성할 수 있다. Each of the plurality of distributed deep learning devices 100 may generate a local parameter, a global-local parameter difference, and a local learning counter area.

복수의 분산 딥러닝 장치(100)는 전역 학습 카운터를 기반으로 할당된 지역 학습 카운터에 상응하는 분산 딥러닝 학습과 원격 공유 메모리부(200) 업데이트를 중첩하여 수행할 수 있다.The plurality of distributed deep learning devices 100 may overlap and perform distributed deep learning learning corresponding to a local learning counter allocated based on the global learning counter and updating the remote shared memory unit 200 .

복수의 분산 딥러닝 장치(100)는 분산 딥러닝 학습을 통해 지역 파라미터를 자체적으로 학습시킬 수 있고, 원격 공유 메모리부(200)에 보관되는 전역 학습 카운터는 복수의 분산 딥러닝 장치(100)들 간에 배타적인 방식으로 업데이트될 수 있다.The plurality of distributed deep learning devices 100 can self-learn local parameters through distributed deep learning learning, and the global learning counter stored in the remote shared memory unit 200 is a plurality of distributed deep learning devices 100. It can be updated in an exclusive way between

도 2는 본 발명의 일실시예에 따른 분산 딥러닝 장치를 나타낸 블록도이다.2 is a block diagram showing a distributed deep learning apparatus according to an embodiment of the present invention.

도 2를 참조하면, 실시예에 따른 분산 딥러닝 장치(100)는 메모리(110)와 프로세서(120)를 포함할 수 있다. Referring to FIG. 2 , a distributed deep learning apparatus 100 according to an embodiment may include a memory 110 and a processor 120.

메모리(110)는 실시예에 따른 분산 딥러닝 방법을 수행하기 위한 제어 프로그램 등 전반적인 동작을 위한 다양한 데이터가 저장될 수 있다. 구체적으로, 메모리에는 분산 딥러닝 장치에서 구동되는 다수의 응용 프로그램, 분산 딥러닝 장치의 동작을 위한 데이터 및 명령어가 저장될 수 있다.The memory 110 may store various data for overall operations, such as a control program for performing the distributed deep learning method according to the embodiment. Specifically, the memory may store a plurality of application programs running on the distributed deep learning device, and data and instructions for operating the distributed deep learning device.

메모리(110)는 자기 저장 매체(Magnetic Storage Media) 또는 플래시 저장 매체(Flash Storage Media)를 포함할 수 있으나, 이에 한정되지 않는다.The memory 110 may include, but is not limited to, magnetic storage media or flash storage media.

프로세서(120)는 일종의 중앙처리장치로서 분산 딥러닝 장치(100)의 전체 동작을 제어할 수 있다.The processor 120 may control the entire operation of the distributed deep learning device 100 as a kind of central processing unit.

프로세서(120)는 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다. 여기서, '프로세서(processor)'는 예를 들어 프로그램 내에 포함된 코드 또는 명령으로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다. '프로세서'의 일 예로 GPU 또는 가속기(Accelerator)를 포함할 수 있다.The processor 120 may include any type of device capable of processing data. Here, a 'processor' may refer to a data processing device embedded in hardware having a physically structured circuit to perform functions expressed by codes or instructions included in a program, for example. An example of a 'processor' may include a GPU or an accelerator.

이와 같이 하드웨어에 내장된 데이터 처리 장치의 일 예로써, 마이크로프로세서(microporcessor), 중앙처리장치(central processing unit: CPU), 프로세서 코어(processor core), 멀티프로세서(multiprocessor), ASIC(application-specific integrated circuit), FPGA(field programmable gate array) 등의 처리 장치를 망라할 수 있으나, 이에 한정되는 것은 아니다.As an example of such a data processing device built into hardware, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific integrated (ASIC) circuit) and a processing device such as a field programmable gate array (FPGA), but is not limited thereto.

이하에서는 분산 딥러닝 장치(100)의 프로세서(120)가 수행하는 분산 딥러닝 방법에 대해 상세히 설명한다.Hereinafter, a distributed deep learning method performed by the processor 120 of the distributed deep learning apparatus 100 will be described in detail.

도 3은 비동기 분산 학습시 종래 미니배치사이즈를 이용하는 방식과 실시예에 따른 미니배치사이즈를 이용하는 방식을 나타낸 도면이다.3 is a diagram showing a conventional method of using a mini-batch size and a method of using a mini-batch size according to an embodiment in asynchronous distributed learning.

도 3을 참조하면, 종래 분산 딥러닝 장치는 GPU 메모리 크기에 상관없이 모두 동일한 크기의 미니배치사이즈를 갖도록 설정될 수 있다. 이는 GPU 메모리 사이즈가 가장 작은 분산 딥러닝 장치의 미니배치사이즈가 모든 분산 딥러닝 장치들에 일괄 적용되기 때문이다.Referring to FIG. 3, all of the conventional distributed deep learning devices can be set to have the same mini-batch size regardless of the GPU memory size. This is because the mini-batch size of the distributed deep learning device with the smallest GPU memory size is collectively applied to all distributed deep learning devices.

실시예에 따른 분산 딥러닝 장치는 계산 효율을 최대화하기 위해 최소 GPU 메모리 크기에 비례하여 분산 딥러닝 장치별로 상이한 배치사이즈를 갖도록 조정할 수 있다. 예컨대, 가장 작은 미니배치사이즈인 30을 기준으로 미니배치사이즈가 32,43,65가 되도록 조정할 수 있다.Distributed deep learning devices according to embodiments may be adjusted to have different batch sizes for each distributed deep learning device in proportion to the minimum GPU memory size in order to maximize computational efficiency. For example, based on the smallest mini-batch size of 30, the mini-batch size may be adjusted to be 32, 43, or 65.

도 4는 하이브리드 분산 학습시 종래 미니배치사이즈를 이용하는 방식과 실시예에 따른 미니배치사이즈를 이용하는 방식을 나타낸 도면이다.4 is a diagram showing a method of using a conventional mini-batch size and a method of using a mini-batch size according to an embodiment in hybrid distributed learning.

하이브리드 분산 학습은 한 노드내에서는 다중 GPU 간에 동기식 분산학습을 수행하고, 노드별 대표 분산 딥러닝 장치가 파라미터 서버 역할을 하는 원격 공유 메모리부(200)와 2차로 비동기 분산학습(전역 파라미터의 업데이트)을 수행하는 방식이다.Hybrid distributed learning performs synchronous distributed learning between multiple GPUs within one node, and a second asynchronous distributed learning (global parameter update) with the remote shared memory unit 200 where the representative distributed deep learning device for each node serves as a parameter server. way to do it.

도 4를 참조하면, 종래 방식은 GPU의 숫자가 다른 경우 최소 GPU를 가지는 노드에 맞춰서 가장 적은 GPU 개수만을 이용(노드당 2개의 GPU만 사용)하여 학습하는 방식으로 동작하였다.Referring to FIG. 4 , when the number of GPUs is different, the conventional method operates in a way of learning by using only the smallest number of GPUs (using only two GPUs per node) according to the node having the minimum GPU.

실시예에 따른 방식은 다른 메모리 크기를 가지는 GPU들을 다른 개수로 가진 노드들에게도 분산 학습을 효과적으로 수행할 수 있다. 각 노드의 분산 딥러닝 장치 그룹의 노드별 총 미니배치사이즈는 수학식 1에 의해 계산될 수 있다.The method according to the embodiment can effectively perform distributed learning even for nodes having different numbers of GPUs having different memory sizes. The total mini-batch size for each node of the distributed deep learning device group of each node can be calculated by Equation 1.

[수학식 1][Equation 1]

노드별 총 미니배치사이즈 = 조정된 GPU 당 미니배치사이즈 x GPU 개수Total mini-batch size per node = mini-batch size per adjusted GPU x number of GPUs

실시예에 따른 분산 딥러닝 장치 그룹은 각 노드별로 다른 노드별 총 미니배치사이즈를 가지는 것을 확인할 수 있다. 예를 들어, Node 1에서 조정된 GPU 미니배치사이즈가 32이고 GPU개수가 2 이므로 노드별 총 미니배치 사이즈는 64가될 수 있다.It can be seen that the distributed deep learning device group according to the embodiment has a total mini-batch size for each node that is different for each node. For example, since the adjusted GPU mini-batch size in Node 1 is 32 and the number of GPUs is 2, the total mini-batch size for each node can be 64.

상기와 같이 설명한 바와 같이, 다른 크기의 배치 사이즈로 분산 학습을 수행하기 위해 분산 학습의 진도 제어 수행 시 학습하는 미니배치의 개수가 아니라 학습 데이터 샘플 개수로 학습을 제어할 수 있어야 한다. As described above, when performing distributed learning progress control in order to perform distributed learning with different batch sizes, learning must be controlled by the number of training data samples rather than the number of mini-batches to be learned.

또한, 각 분산학습을 수행하는 분산 딥러닝 장치(비동기 분산 학습의 경우) 또는 분산 딥러닝 장치 그룹(하이브리드 분산 학습의 경우)별 학습률(Learning rate)도 조정된 GPU 배치 사이즈(또는 노드별 총 미니배치사이즈)에 비례하여 다르게 적용되어야 한다.In addition, the learning rate for each distributed deep learning device (in the case of asynchronous distributed learning) or group of distributed deep learning devices (in the case of hybrid distributed learning) performing each distributed learning is also adjusted by the GPU batch size (or the total minima per node). It should be applied differently in proportion to the batch size).

실시예는 도 3 및 도 4에 도시한 GPU 환경을 가정하고, 만약 한 노드의 GPU들이 종류가 다르면 도 3의 비동기 분산학습을 수행하고, 노드간에는 GPU 종류가 다르지만 한 노드의 GPU 종류가 같은 경우에는 하이브리드 분산 학습을 수행할 수 있다.The embodiment assumes the GPU environment shown in FIGS. 3 and 4, and if the GPUs of one node are of different types, the asynchronous distributed learning of FIG. can perform hybrid distributed learning.

도 5는 하이브리드 분산 학습시 다른 미니배치사이즈와 다른 수의 GPU를 이용하여 학습을 제안하는 방식을 나타낸 도면이다.5 is a diagram showing a method of proposing learning using different mini-batch sizes and different numbers of GPUs in hybrid distributed learning.

도 5에 도시된 바와 같이, 하이브리드 분산 학습시 분산 딥러닝 장치 그룹을 한 노드로 한정하지 않고 같은 타입의 GPU가 설치된 여러 노드까지 하나의 분산 딥러닝 장치 그룹으로 분산 학습을 수행할 수 있다.As shown in FIG. 5, in hybrid distributed learning, the distributed deep learning device group is not limited to one node, and distributed learning can be performed with one distributed deep learning device group up to several nodes equipped with GPUs of the same type.

즉, 분산 딥러닝 장치 그룹 1은 GPU1이 설치된 노드 1 및 노드 5를 포함할 수 있다. 분산 딥러닝 장치 그룹 1은 2개 노드의 6개의 GPU를 포함할 수 있다.That is, distributed deep learning device group 1 may include node 1 and node 5 where GPU1 is installed. Distributed deep learning device group 1 may include 6 GPUs of 2 nodes.

분산 딥러닝 장치 그룹 2는 GPU2가 설치된 노드 2를 포함할 수 있다. 분산 딥러닝 장치 그룹 2는 1개 노드의 6개의 GPU를 포함할 수 있다.Distributed deep learning device group 2 may include node 2 where GPU2 is installed. Distributed deep learning device group 2 may include 6 GPUs of 1 node.

분산 딥러닝 장치 그룹 3은 GPU3이 설치된 노드 3, 노드 4 및 노드 7을 포함할 수 있다. 분산 딥러닝 장치 그룹 3은 3개 노드의 12개의 GPU를 포함할 수 있다.Distributed deep learning device group 3 may include node 3, node 4, and node 7 where GPU3 is installed. Distributed deep learning device group 3 may include 12 GPUs in 3 nodes.

분산 딥러닝 장치 그룹 4는 GPU5가 설치된 노드 6을 포함할 수 있다. 분산 딥러닝 장치 그룹 4는 1개 노드의 4개의 GPU를 포함할 수 있다.Distributed deep learning device group 4 may include node 6 with GPU5 installed. Distributed deep learning device group 4 may include 4 GPUs of 1 node.

분산 딥러닝 장치 그룹 5는 GPU4가 설치된 노드 8을 포함할 수 있다. 분산 딥러닝 장치 그룹 5는 1개 노드의 4개의 GPU를 포함할 수 있다.Distributed deep learning device group 5 may include node 8 with GPU4 installed. Distributed deep learning device group 5 may include 4 GPUs of 1 node.

조정된 GPU별 미니배치사이즈는 초기 GPU 미니배치사이즈와 그룹별 상대적인 GPU 메모리 비율의 곱으로 계산될 수 있다.The adjusted mini-batch size for each GPU can be calculated as the product of the initial GPU mini-batch size and the relative GPU memory ratio for each group.

그룹별 총 미니배치사이즈는 조정된 GPU 미니배치사이즈와 그룹별 GPU 개수의 곱으로 계산될 수 있다.The total mini-batch size for each group can be calculated as the product of the adjusted GPU mini-batch size and the number of GPUs for each group.

조정된 학습률은 초기학습률과 (그룹별 총 미니배치사이즈/초기 GPU 미니배치사이즈)의 곱으로 계산될 수 있다.The adjusted learning rate can be calculated as the product of the initial learning rate and (total mini-batch size per group/initial GPU mini-batch size).

이하에서는 도 6을 참조하여, 분산 딥러닝 방법에 대해 더욱 상세히 살펴본다.Hereinafter, with reference to FIG. 6, the distributed deep learning method will be described in more detail.

도 6은 본 발명의 일실시예에 따른 분산 딥러닝 방법을 나타낸 순서도이다.6 is a flowchart illustrating a distributed deep learning method according to an embodiment of the present invention.

도 6을 참조하면, 실시예에 따른 복수의 분산 딥러닝 장치들 각각은 전체학습샘플수를 계산할 수 있다(S100).Referring to FIG. 6 , each of the plurality of distributed deep learning devices according to the embodiment may calculate the total number of training samples (S100).

전체학습샘플수는 수학식 2와 같이 계산될 수 있다.The total number of training samples can be calculated as in Equation 2.

[수학식 2][Equation 2]

전체학습샘플수 = 전체 학습 에포크 x 에포크당 학습샘플수Total number of training samples = total number of training epochs x number of training samples per epoch

여기서, 에포크당 학습샘플수는 미리 주어지거나 초기 미니배치사이즈와 에포크당 스텝수가 제공되면 그 곱으로 계산될 수 있다.Here, the number of learning samples per epoch is given in advance or can be calculated as the product of the initial mini-batch size and the number of steps per epoch.

복수의 분산 딥러닝 장치들 각각은 분산 딥러닝 장치에 지정된 GPU의 메모리 크기 정보를 GPU 드라이버로부터 획득할 수 있다(S110). Each of the plurality of distributed deep learning devices may acquire memory size information of a GPU assigned to the distributed deep learning device from the GPU driver (S110).

복수의 분산 딥러닝 장치들 각각은 MPI Allreduce 등의 집합통신 기법을 이용하여 전체 분산 딥러닝 장치 간의 GPU 메모리 크기 정보를 공유할 수 있다(S120). Each of the plurality of distributed deep learning devices may share GPU memory size information among all distributed deep learning devices using an aggregate communication technique such as MPI Allreduce (S120).

복수의 분산 딥러닝 장치들 각각은 GPU 메모리 사이즈가 가장 작은 GPU의 초기 미니배치사이즈와 GPU 메모리 크기 및 초기 학습률을 기준으로 각 분산 딥러닝 장치 별 또는 각 분산 딥러닝 장치 그룹 별 미니배치사이즈와 학습률을 GPU 메모리 크기와 GPU 개수(하이브리드 분산학습의 경우)에 비례하여 조정할 수 있다(S130, S140).Each of the plurality of distributed deep learning devices is based on the initial mini-batch size of the GPU with the smallest GPU memory size, the GPU memory size, and the initial learning rate, and the mini-batch size and learning rate for each distributed deep learning device or each distributed deep learning device group. can be adjusted in proportion to the GPU memory size and the number of GPUs (in the case of hybrid distributed learning) (S130, S140).

이때, 학습률을 계산하는 단계는 하이브리드 분산 학습 수행 여부에 따라 달라질 수 있다(S133).At this time, the step of calculating the learning rate may vary depending on whether hybrid distributed learning is performed (S133).

GPU 별 조정된 미니배치사이즈를 계산한 후 하이브리드 분산 학습 수행 시 분산 딥러닝 장치 그룹의 노드별 총 미니배치사이즈를 계산하고(S220), 조정된 매니배치사이즈에 비례하는 조정된 학습률을 계산할 수 있다(S140).After calculating the adjusted mini-batch size for each GPU, when performing hybrid distributed learning, the total mini-batch size for each node of the distributed deep learning device group is calculated (S220), and the adjusted learning rate proportional to the adjusted many-batch size can be calculated. (S140).

반면, GPU 별 조정된 미니배치사이즈를 계산한 후 비동기 분산 학습 수행 시 조정된 매니배치사이즈에 비례하는 조정된 학습률을 계산할 수 있다(S140).On the other hand, after calculating the adjusted mini-batch size for each GPU, it is possible to calculate the adjusted learning rate proportional to the adjusted many-batch size when performing asynchronous distributed learning (S140).

도 7은 본 발명의 일실시예에 따른 비동기 분산학습 수행 시 미니배치사이즈와 학습률을 조정하는 모습을 나타낸 도면이다.7 is a diagram showing how to adjust a mini-batch size and a learning rate when performing asynchronous distributed learning according to an embodiment of the present invention.

도 7에 도시된 바와 같이, 4개의 크기가 다른 GPU가 있을 때, GPU 메모리의 크기가 가장 작은 GPU1을 기준으로 GPU 별 상대적인 메모리 비율을 계산할 수 있다.As shown in FIG. 7 , when there are four GPUs of different sizes, a relative memory ratio for each GPU can be calculated based on GPU1 having the smallest GPU memory size.

계산된 GPU 별 상대적인 메모리 비율을 이용하여 GPU 미니배치사이즈를 수학식 3과 같이 계산할 수 있다.The GPU mini-batch size can be calculated as in Equation 3 using the calculated relative memory ratio for each GPU.

[수학식 3][Equation 3]

조정된 GPU 미니배치사이즈 = 초기 GPU 미니배치사이즈 x GPU 별 상대적인 메모리 비율Adjusted GPU mini-batch size = initial GPU mini-batch size x relative memory ratio per GPU

예컨대, 미리 지정된 초기 GPU 미니배치사이즈 값이 64일 경우, 초기 GPU 미니배치사이즈 값을 GPU 별 상대적인 메모리 비율로 곱하고 소수점 이하를 절사하여 조정된 GPU 미니배치사이즈를 계산할 수 있다.For example, if the pre-designated initial GPU mini-batch size value is 64, the adjusted GPU mini-batch size may be calculated by multiplying the initial GPU mini-batch size value by the relative memory ratio for each GPU and truncating the decimal point.

[수학식 4][Equation 4]

조정된 학습률 = 초기 학습률 x (조정된 GPU 미니배치사이즈 / 초기 GPU 미니배치사이즈)Adjusted learning rate = initial learning rate x (adjusted GPU mini-batch size / initial GPU mini-batch size)

그리고, 수학식 4와 같이, 초기 GPU 미니배치사이즈와 조정된 GPU 미니배치사이즈의 비율만큼 학습률도 조정할 수 있다. 조정된 학습률은 소수점 이하값을 절사하지 않고 실수 값을 보존할 수 있다.Also, as shown in Equation 4, the learning rate may be adjusted by the ratio of the initial GPU mini-batch size to the adjusted GPU mini-batch size. The adjusted learning rate can preserve real values without truncating decimal values.

도 8은 본 발명의 일실시예에 따른 하이브리드 분산학습 수행 시 미니배치사이즈와 학습률을 조정하는 모습을 나타낸 도면이다.8 is a diagram showing how mini-batch size and learning rate are adjusted when hybrid distributed learning is performed according to an embodiment of the present invention.

한 노드의 GPU들은 모두 같은 타입으로 가정한다. All GPUs on a node are assumed to be of the same type.

[수학식 5][Equation 5]

조정된 GPU 별 미니배치사이즈 = 초기 GPU 미니배치사이즈 x 노드별 상대적인 GPU 메모리 비율Adjusted GPU mini-batch size = initial GPU mini-batch size x relative GPU memory ratio per node

수학식 5와 같이, 조정된 GPU 미니배치사이즈는 각 노드의 GPU 메모리 크기의 비율로 노드별 상대적인 GPU 메모리 비율을 계산하고, 이 값을 초기 GPU 배치사이즈와 곱해 소수팀 이하를 절사하여 계산할 수 있다.As shown in Equation 5, the adjusted GPU mini-batch size can be calculated by calculating the relative GPU memory ratio for each node as a ratio of the GPU memory size of each node, multiplying this value by the initial GPU batch size, and truncating less than a fractional team. .

[수학식 6][Equation 6]

노드별 총 미니배치사이즈 = 조정된 GPU 미니배치사이즈 x 노드별 GPU의 개수Total mini-batch size per node = adjusted GPU mini-batch size x number of GPUs per node

한 노드의 GPU 간에는 그래디언트를 통합하므로 수학식 6과 같이, 각 노드별 GPU 개수를 노드별 조정된 GPU 배치 사이즈에 곱해 노드별 총 미니배치사이즈를 계산할 수 있다.Since the gradient is integrated between the GPUs of one node, the total mini-batch size for each node can be calculated by multiplying the number of GPUs for each node by the adjusted GPU batch size for each node, as shown in Equation 6.

[수학식 7][Equation 7]

노드별 조정된 학습률 = 초기 학습률 x (노드별 총 미니배치사이즈 / 초기 GPU 미니배치사이즈)Adjusted learning rate per node = initial learning rate x (total mini-batch size per node / initial GPU mini-batch size)

수학식 7과 같이, 노드별 총 미니배치사이즈 값과 초기 GPU 미니배치사이즈의 비율로 초기 학습률을 조정할 수 있다. 여기서, 조정된 학습률은 소수점 이하 값을 절사하지 않고 실수 값을 보존할 수 있다.As shown in Equation 7, the initial learning rate can be adjusted by the ratio of the total mini-batch size value for each node and the initial GPU mini-batch size. Here, the adjusted learning rate may preserve real values without truncating decimal values.

도 6으로 돌아가서, 복수의 분산 딥러닝 장치들 각각은 전역학습카운터의 값을 읽어올 수 있다(S150). Returning to FIG. 6 , each of the plurality of distributed deep learning devices may read the value of the global learning counter (S150).

복수의 분산 딥러닝 장치들 각각은 전역학습카운터 값이 전체학습샘플수 이상이면 학습을 종료할 수 있다(S170). Each of the plurality of distributed deep learning devices may end learning when the value of the global learning counter is equal to or greater than the total number of learning samples (S170).

반면, 복수의 분산 딥러닝 장치들 각각은 전역학습카운터 값이 전체학습샘플수 미만이면 각각의 분산 딥러닝 장치들 별로 조정된 미니배치사이즈만큼 전역학습카운터를 배타적으로 증가시킬 수 있다. 하이브리드 분산학습의 경우에는 노드별 총 미니배치사이즈만큼 전역학습카운터를 배타적으로 증가시킬 수 있다.On the other hand, each of the plurality of distributed deep learning devices may exclusively increase the global learning counter by the mini-batch size adjusted for each distributed deep learning device if the value of the global learning counter is less than the total number of training samples. In the case of hybrid distributed learning, the global learning counter can be exclusively increased by the total mini-batch size for each node.

전역학습카운터가 성공적으로 증가되면 지역학습카운터를 그 전역학습카운터 값으로 지정하고(S180), 조정된 GPU 매니배치사이즈의 데이터에 대한 학습을 수행할 수 있다(S190). 하이브리드 분산학습의 경우, 동일 노드내 분산 딥러닝 장치 간 그래디언트(Gradient) 통합을 추가로 수행할 수 있다.When the global learning counter is successfully increased, the local learning counter is designated as the global learning counter value (S180), and learning can be performed on the data of the adjusted GPU manifold size (S190). In the case of hybrid distributed learning, gradient integration between distributed deep learning devices within the same node can be additionally performed.

복수의 분산 딥러닝 장치들 각각은 학습 수행 후 조정된 학습률로 지역 가중치를 업데이트할 수 있다(S200). Each of the plurality of distributed deep learning devices may update local weights with an adjusted learning rate after learning is performed (S200).

복수의 분산 딥러닝 장치들 각각은 최신 전역가중치를 읽어올 수 있다. 복수의 분산 딥러닝 장치들 각각은 전역-지역 가중치 차분을 계산할 수 있다. 복수의 분산 딥러닝 장치들 각각은 지역 가중치를 2차 업데이트하고 전역 가중치를 추가 업데이트하여 분산 딥러닝 방법을 마칠 수 있다(S210). Each of the plurality of distributed deep learning devices may read the latest global weight. Each of the plurality of distributed deep learning devices may calculate a global-region weight difference. Each of the plurality of distributed deep learning devices may perform a second update of local weights and additionally update global weights to complete the distributed deep learning method (S210).

도 9는 본 발명의 일실시예에 따른 컴퓨터 시스템의 구성을 나타낸 블록도이다.9 is a block diagram showing the configuration of a computer system according to an embodiment of the present invention.

실시예에 따른 분산 딥러닝 장치는 컴퓨터 판독 가능한 기록매체와 같은 컴퓨터 시스템에서 구현될 수 있다.The distributed deep learning apparatus according to the embodiment may be implemented in a computer system such as a computer readable recording medium.

도 9를 참조하면, 실시예에 따른 컴퓨터 시스템(1000)은 버스(1020)를 통하여 서로 통신하는 하나 이상의 프로세서(1010), 메모리(1030), 사용자 인터페이스 입력 장치(1040), 사용자 인터페이스 출력 장치(1050) 및 스토리지(1060)를 포함할 수 있다. 또한, 컴퓨터 시스템(1000)은 네트워크에 연결되는 네트워크 인터페이스(1070)를 더 포함할 수 있다.Referring to FIG. 9, a computer system 1000 according to an embodiment includes one or more processors 1010, a memory 1030, a user interface input device 1040, and a user interface output device communicating with each other through a bus 1020 ( 1050) and storage 1060. In addition, the computer system 1000 may further include a network interface 1070 connected to a network.

프로세서(1010)는 중앙 처리 장치 또는 메모리나 스토리지에 저장된 프로그램 또는 프로세싱 인스트럭션들을 실행하는 반도체 장치일 수 있다. 메모리(1030) 및 스토리지(1060)는 휘발성 매체, 비휘발성 매체, 분리형 매체, 비분리형 매체, 통신 매체, 또는 정보 전달 매체 중에서 적어도 하나 이상을 포함하는 저장 매체일 수 있다. 예를 들어, 메모리(1030)는 ROM(1031)이나 RAM(1032)을 포함할 수 있다.The processor 1010 may be a central processing unit or a semiconductor device that executes programs or processing instructions stored in memory or storage. The memory 1030 and the storage 1060 may be storage media including at least one of volatile media, nonvolatile media, removable media, non-removable media, communication media, and information delivery media. For example, memory 1030 may include ROM 1031 or RAM 1032 .

일 실시예에 따르면, 컴퓨터 프로그램을 저장하고 있는 컴퓨터 판독 가능한 기록 매체로서, 전체학습샘플수를 계산하는 동작과, 상기 복수의 분산 딥러닝 장치들 각각에 지정된 GPU 메모리 크기를 획득하는 동작과, 상기 각 분산 딥러닝 장치 간의 GPU 메모리 크기를 교환하는 동작과, 상기 GPU 메모리 크기를 기초로 각 GPU 별 미니배치사이즈를 조정하여 조정된 미니배치사이즈를 계산하는 동작과, 상기 조정된 미니배치사이즈를 기초로 학습률을 조정하여 조정된 학습률을 계산하는 동작과, 현재의 전역학습카운터 값을 읽어오는 동작과, 상기 전역학습카운터 값을 상기 전체학습샘플수와 비교하는 동작과, 상기 복수의 분산 딥러닝 장치들 각각이, 상기 전역학습카운터 값이 상기 전체학습샘플수 보다 작으면 상기 조정된 미니배치사이즈를 기초로 상기 전역학습카운터를 증가 후 지역학습카운터를 업데이트 하는 동작과, 상기 복수의 분산 딥러닝 장치들 각각이, 조정된 미니배치사이즈의 학습을 수행하는 동작과, 상기 복수의 분산 딥러닝 장치들 각각이, 조정된 학습률을 기초로 지역 가중치를 업데이트 하는 동작을 포함하는 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함할 수 있다.According to one embodiment, as a computer-readable recording medium storing a computer program, the operation of calculating the total number of learning samples, the operation of obtaining a GPU memory size designated for each of the plurality of distributed deep learning devices, An operation of exchanging the GPU memory size between each distributed deep learning device, an operation of calculating the adjusted mini-batch size by adjusting the mini-batch size for each GPU based on the GPU memory size, and based on the adjusted mini-batch size An operation of calculating an adjusted learning rate by adjusting a learning rate with , an operation of reading a current global learning counter value, an operation of comparing the global learning counter value with the total number of learning samples, and the plurality of distributed deep learning devices. , if the value of the global learning counter is smaller than the total number of training samples, increasing the global learning counter based on the adjusted mini-batch size and then updating a local learning counter; and Each of the plurality of distributed deep learning devices performs learning of the adjusted mini-batch size, and each of the plurality of distributed deep learning devices performs a method including updating local weights based on the adjusted learning rate It may contain commands for

본 발명에서 설명하는 특정 실행들은 실시예들로서, 어떠한 방법으로도 본 발명의 범위를 한정하는 것은 아니다. 명세서의 간결함을 위하여, 종래 전자적인 구성들, 제어시스템들, 소프트웨어, 상기 시스템들의 다른 기능적인 측면들의 기재는 생략될 수 있다. 또한, 도면에 도시된 구성 요소들 간의 선들의 연결 또는 연결 부재들은 기능적인 연결 및/또는 물리적 또는 회로적 연결들을 예시적으로 나타낸 것으로서, 실제 장치에서는 대체 가능하거나 추가의 다양한 기능적인 연결, 물리적인 연결, 또는 회로 연결들로서 나타내어질 수 있다. 또한, "필수적인", "중요하게" 등과 같은 구체적인 언급이 없다면 본 발명의 적용을 위하여 반드시 필요한 구성 요소가 아닐 수 있다.The specific implementations described herein are examples and do not limit the scope of the present invention in any way. For brevity of the specification, description of conventional electronic components, control systems, software, and other functional aspects of the systems may be omitted. In addition, the connection of lines or connecting members between the components shown in the drawings are examples of functional connections and / or physical or circuit connections, which can be replaced in actual devices or additional various functional connections, physical connection, or circuit connections. In addition, if there is no specific reference such as "essential" or "important", it may not be a necessary component for the application of the present invention.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 또는 이로부터 등가적으로 변경된 모든 범위는 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments and should not be determined, and all scopes equivalent to or equivalently changed from the claims as well as the claims to be described later are within the scope of the spirit of the present invention. will be said to belong to

100: 분산 딥러닝 장치
200: 원격 공유 메모리부
300: 네트워크100: distributed deep learning device
200: remote shared memory unit
300: network

Claims

In the distributed deep learning method performed by distributed deep learning devices,
Calculating the total number of training samples;
obtaining GPU memory size information assigned to each of the plurality of distributed deep learning devices;
exchanging GPU memory size information between the distributed deep learning devices;
Calculating an adjusted mini-batch size by adjusting a mini-batch size for each GPU based on the GPU memory size information;
Calculating an adjusted learning rate by adjusting a learning rate based on the adjusted mini-batch size;
reading a global learning counter value;
comparing the value of the global learning counter with the total number of learning samples;
updating a local learning counter after incrementing the global learning counter based on the adjusted mini-batch size when the value of the global learning counter is smaller than the total number of training samples;
Performing learning of the adjusted mini-batch size; and
updating local weights based on the adjusted learning rate;
Distributed deep learning method comprising a.