KR102555268B1

KR102555268B1 - Asymmetric centralized training for distributed deep learning

Info

Publication number: KR102555268B1
Application number: KR1020200174036A
Authority: KR
Inventors: 김상욱; 고윤용; 최기봉
Original assignee: 한양대학교 산학협력단
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2023-07-13
Also published as: WO2022131525A1; KR20220084508A

Abstract

파라미터 서버 기반 비대칭 분산 학습 기법이 개시된다. 일 실시예에 따른 비대칭 통신 기반의 분산 학습 시스템은, 로컬 데이터를 이용하여 학습 모델을 학습함에 따라 계산된 학습 결과에 기초하여 업데이트된 글로벌 파라미터와 워커의 로컬 파라미터를 가중 평균(weighted average)하여 로컬 파라미터를 업데이트하는 복수의 워커; 및 상기 복수의 워커에서 계산된 학습 결과에 기초하여 업데이트된 글로벌 파라미터를 비대칭 통신을 통해 기 정의된 업데이트 룰에 따라 복수의 워커로 전송하는 파라미터 서버를 포함할 수 있다. A parameter server-based asymmetric distributed learning technique is disclosed. A distributed learning system based on asymmetric communication according to an embodiment performs a weighted average of a global parameter updated based on a learning result calculated as a learning model is learned using local data and a local parameter of a worker to obtain a local a plurality of workers updating parameters; and a parameter server that transmits the updated global parameters based on the learning results calculated in the plurality of workers to the plurality of workers according to a predefined update rule through asymmetric communication.

Description

Parameter server-based asymmetric distributed learning technique {ASYMMETRIC CENTRALIZED TRAINING FOR DISTRIBUTED DEEP LEARNING}

아래의 설명은 서버의 병목 문제를 해결하기 위한 비대칭 분산 학습 기술에 관한 것이다. The description below relates to an asymmetric distributed learning technique to solve the server bottleneck problem.

기존 파라미터 서버 분산 학습 기법들은 워커의 수와 모델 파라미터의 수가 증가할수록 파라미터 서버가 학습의 병목이 되는 문제를 가지고 있다. 이러한 파라미터 서버 병목 문제는 워커가 파라미터 서버와 대칭(Symmetric) 통신을 하기 때문이다. 워커와 파라미터 서버 간 대칭 통신이란, 워커가 파라미터 서버에게 학습 결과를 전송하고, 파라미터 서버로부터 최신의 파라미터를 되돌려 받는 통신 방식을 의미한다. Existing parameter server distributed learning techniques have a problem that the parameter server becomes a bottleneck for learning as the number of workers and model parameters increases. This bottleneck problem with the parameter server is because the worker communicates symmetrically with the parameter server. Symmetric communication between the worker and the parameter server means a communication method in which the worker transmits learning results to the parameter server and receives the latest parameters from the parameter server.

도 1을 참고하면, 기존 파라미터 서버 기반 분산 학습 방식을 설명하기 위한 도면이다. 도 1(a)는 동기적 분산 학습, 도 1(b)는 비동기적 분산 학습 방식을 나타낸 것이다. 도 1(a), 도 1(b)에 관계없이 각 워커는 파라미터 서버에게 학습 결과를 전송하고(Send.), 파라미터 서버가 최신의 파라미터를 되돌려주기를 기다린다(Wait. & Recv.). Referring to FIG. 1, it is a diagram for explaining an existing parameter server-based distributed learning method. Figure 1(a) shows a synchronous distributed learning method, and Figure 1(b) shows an asynchronous distributed learning method. Regardless of FIGS. 1(a) and 1(b), each worker transmits the learning result to the parameter server (Send.) and waits for the parameter server to return the latest parameters (Wait. & Recv.).

이러한 대칭 통신에 의해, 각 워커들은 한 학습 스텝을 끝내고(Iter. t), 파라미터 서버를 기다리느라 다음 스텝(Iter. t+1)을 바로 진행하지 못한다. 이러한 문제는 학습 모델의 파라미터 수와 워커의 수가 많을수록 더 심각해질 수 있다. 이러한 종래 기술의 문제를 확인하기 위해, 워커 수를 증가시키며 워커의 학습 시간에 대한 분석이 진행될 수 있다. 도 2는 워커 수에 따른 각 학습 스텝에서 소요되는 시간들을 분석한 것이다. 워커 수가 증가할수록 대칭 통신에 의한 오버헤드가 증가하는 것을 확인할 수 있다. Due to this symmetric communication, each worker finishes one learning step (Iter. t) and cannot immediately proceed to the next step (Iter. t+1) while waiting for the parameter server. This problem can become more serious as the number of parameters and the number of workers in the learning model increase. In order to confirm this problem of the prior art, analysis of the learning time of the workers may be performed while increasing the number of workers. 2 is an analysis of the time required for each learning step according to the number of workers. It can be seen that as the number of workers increases, the overhead due to symmetric communication increases.

기존 파라미터 서버 기반 분산 학습 방법들이 가지고 있는 파라미터 서버 병목 문제의 원인인 대칭(Symmetric) 통신 때문에 발생함을 파악하여, 이를 해결하기 위한 비대칭(Asymmetric) 통신 기반의 새로운 분산 학습 방법 및 시스템을 제공할 수 있다. It is possible to provide a new distributed learning method and system based on asymmetric communication to solve the problem by identifying that it is caused by symmetric communication, which is the cause of the parameter server bottleneck problem of existing parameter server-based distributed learning methods. there is.

비대칭 통신에 의해 발생할 수 있는 학습 모델 정확도 손실을 최소화하는 파라미터 업데이트 전략을 제공할 수 있다.A parameter update strategy can be provided that minimizes the loss of accuracy of the learning model that can be caused by asymmetric communication.

비대칭 분산 학습 시스템은, 로컬 데이터를 이용하여 학습 모델을 학습함에 따라 계산된 학습 결과에 기초하여 업데이트된 글로벌 파라미터와 워커의 로컬 파라미터를 가중 평균(weighted average)하여 로컬 파라미터를 업데이트하는 복수의 워커; 및 상기 복수의 워커에서 계산된 학습 결과에 기초하여 업데이트된 글로벌 파라미터를 비대칭 통신을 통해 기 정의된 업데이트 룰에 따라 복수의 워커로 전송하는 파라미터 서버를 포함할 수 있다. The asymmetric distributed learning system includes a plurality of workers that update local parameters by weighting averages of updated global parameters and local parameters of the workers based on a learning result calculated by learning a learning model using local data; and a parameter server that transmits the updated global parameters based on the learning results calculated in the plurality of workers to the plurality of workers according to a predefined update rule through asymmetric communication.

상기 복수의 워커는, 상기 로컬 데이터를 이용하여 학습 모델을 학습시킴에 따라 그레디언트를 계산하고, 상기 계산한 그레디언트로 상기 복수의 워커의 각각에 대한 로컬 파라미터를 업데이트할 수 있다. The plurality of workers may calculate a gradient as a learning model is trained using the local data, and may update a local parameter for each of the plurality of workers with the calculated gradient.

상기 복수의 워커는, 상기 파라미터 서버로부터 업데이트된 글로벌 파라미터가 전송될 경우, 상기 업데이트된 글로벌 파라미터와 워커의 로컬 파라미터를 가중 평균(weighted average)하여 워커의 로컬 파라미터를 업데이트할 수 있다. When the updated global parameter is transmitted from the parameter server, the plurality of workers may update the local parameter of the worker by performing a weighted average of the updated global parameter and the local parameter of the worker.

상기 복수의 워커는, 상기 파라미터 서버로부터 업데이트된 글로벌 파라미터가 전송되지 않을 경우, 상기 학습에 대한 다음 학습 과정을 진행할 수 있다. The plurality of workers may proceed with the next learning process for the learning when the updated global parameter is not transmitted from the parameter server.

상기 파라미터 서버는, 상기 복수의 워커에서 계산된 그레디언트를 수집하고, 상기 수집된 그레디언트를 이용하여 글로벌 파라미터를 업데이트할 수 있다. The parameter server may collect gradients calculated by the plurality of workers and update a global parameter using the collected gradients.

상기 파라미터 서버는, 상기 복수의 워커에 대하여 수집된 그레디언트를 이용하여 분산 학습에서 손실함수를 최소화하는 글로벌 파라미터를 탐색하고, 상기 탐색된 글로벌 파라미터를 업데이트 주기마다 상기 복수의 워커에게 전송할 수 있다. The parameter server may search for a global parameter that minimizes a loss function in distributed learning using gradients collected for the plurality of workers, and transmit the searched global parameter to the plurality of workers every update period.

워커에서 수행되는 비대칭 분산 학습 방법은, 로컬 데이터를 이용하여 학습 모델을 학습함에 따라 학습 결과를 계산하는 단계; 상기 계산된 학습 결과를 파라미터 서버로 전달하는 단계; 및 상기 파라미터 서버로부터 업데이트된 글로벌 파라미터를 비대칭 통신을 통해 기 정의된 파라미터 업데이트 룰에 기초하여 수신하는 단계를 포함하고, 상기 수신하는 단계에서, 상기 업데이트된 글로벌 파라미터와 워커의 로컬 파라미터가 가중 평균(weighted average)되어 상기 워커의 로컬 파라미터가 업데이트될 수 있다. An asymmetric distributed learning method performed in a worker includes calculating a learning result by learning a learning model using local data; transmitting the calculated learning result to a parameter server; And receiving the updated global parameter from the parameter server based on a predefined parameter update rule through asymmetric communication, wherein in the receiving step, the updated global parameter and the local parameter of the worker are a weighted average ( weighted average) so that the local parameters of the worker can be updated.

상기 계산하는 단계는, 제1 학습 과정에서 상기 복수의 워커가 로컬 데이터를 이용하여 학습 모델을 학습시킴에 따라 그레디언트를 계산하고, 상기 계산한 그레디언트로 상기 복수의 워커의 각각에 대한 로컬 파라미터를 업데이트하는 단계를 포함할 수 있다. The calculating may include calculating a gradient as the plurality of workers learn a learning model using local data in a first learning process, and updating a local parameter for each of the plurality of workers with the calculated gradient steps may be included.

상기 수신하는 단계는, 상기 파라미터 서버로부터 업데이트된 글로벌 파라미터가 전송됨을 수신할 경우, 상기 업데이트된 글로벌 파라미터와 워커의 로컬 파라미터를 가중 평균(weighted average)하여 워커의 로컬 파라미터를 업데이트하고, 상기 파라미터 서버로부터 업데이트된 글로벌 파라미터가 전송되지 않을 경우, 제2 학습 과정을 진행하는 단계를 포함할 수 있다. The receiving step, when receiving that the updated global parameter is transmitted from the parameter server, updates the local parameter of the worker by performing a weighted average of the updated global parameter and the local parameter of the worker, and the parameter server When the updated global parameter is not transmitted from , proceeding with a second learning process may be included.

파라미터 서버에 의해 수행되는 비대칭 분산 학습 방법은, 복수의 워커에서 로컬 데이터를 이용하여 학습 모델을 학습함에 따라 계산된 학습 결과를 수집하는 단계; 상기 수집된 학습 결과를 이용하여 글로벌 파라미터를 업데이트하는 단계; 및 기 설정된 업데이트 룰에 기초하여 비대칭 통신을 통해 상기 업데이트된 글로벌 파라미터를 복수의 워커로 전송하는 단계를 포함하고, 상기 전송하는 단계에서, 상기 복수의 워커로 전송된 업데이트된 글로벌 파라미터와 상기 복수의 워커의 로컬 파라미터가 가중 평균(weighted average)되어 상기 복수의 워커의 로컬 파라미터가 업데이트될 수 있다. An asymmetric distributed learning method performed by a parameter server includes: collecting learning results calculated as learning a learning model using local data in a plurality of workers; updating global parameters using the collected learning results; and transmitting the updated global parameter to a plurality of workers through asymmetric communication based on a preset update rule, wherein in the transmitting step, the updated global parameter transmitted to the plurality of workers and the plurality of workers The local parameters of the plurality of workers may be updated by performing a weighted average on the local parameters of the workers.

상기 업데이트하는 단계는, 상기 복수의 워커에서 계산된 그레디언트를 수집하고, 상기 수집된 그레디언트를 이용하여 분산 학습에서 손실함수를 최소화하는 글로벌 파라미터를 탐색하는 단계를 포함할 수 있다. The updating may include collecting gradients calculated by the plurality of workers and using the collected gradients to search for a global parameter that minimizes a loss function in distributed learning.

상기 전송하는 단계는, 상기 수집된 그레디언트를 이용하여 분산 학습에서 손실함수를 최소화하기 위하여 탐색된 글로벌 파라미터를 업데이트 주기마다 상기 복수의 워커에게 전송하는 단계를 포함할 수 있다. The transmitting may include transmitting a searched global parameter to the plurality of workers every update period in order to minimize a loss function in distributed learning using the collected gradients.

비대칭 통신 기반의 학습을 통해 파라미터 서버의 병목 유무와 관계없이 워커가 파라미터 서버를 기다리지 않고 학습을 진행할 수 있다. 이에, 높은 정확도로 빠른 시간 안에 학습 모델의 학습을 진행할 수 있다. Through asymmetric communication-based learning, workers can perform learning without waiting for the parameter server regardless of whether the parameter server is a bottleneck or not. Accordingly, the learning model can be trained with high accuracy and within a short time.

파라미터 업데이트 룰에 기초하여 글로벌 파라미터를 업데이트함으로써 비대칭 통산에 의해 발생할 수 있는 학습 모델 손실을 최소화함으로써 성능의 저하를 방지할 수 있다. By updating the global parameters based on the parameter update rule, it is possible to prevent performance degradation by minimizing a learning model loss that may occur due to asymmetric computation.

도 1은 기존 파라미터 서버 기반 분산 학습 방식을 설명하기 위한 도면이다.
도 2는 워커 수에 따른 각 학습 스텝에서 소요되는 시간들을 분석한 그래프이다.
도 3은 일 실시예에 따른 분산 학습 시스템에서 비대칭 통신 기반의 학습 동작을 설명하기 위한 도면이다.
도 4는 일 실시예에 따른 분산 학습 시스템에서 워커의 동작을 설명하기 위한 알고리즘의 예이다.
도 5는 일 실시예에 따른 분산 학습 시스템에서 파라미터 서버의 동작을 설명하기 위한 알고리즘의 예이다.
도 6은 일 실시예에 따른 분산 학습 시스템의 구성을 설명하기 위한 도면이다
도 7 및 도 8은 일 실시예에 따른 분산 학습 시스템에서 비대칭 통신 기반의 학습 방법을 설명하기 위한 도면이다. 1 is a diagram for explaining an existing parameter server-based distributed learning method.
Figure 2 is a graph analyzing the time required for each learning step according to the number of workers.
3 is a diagram for explaining a learning operation based on asymmetric communication in a distributed learning system according to an embodiment.
4 is an example of an algorithm for explaining an operation of a worker in a distributed learning system according to an embodiment.
5 is an example of an algorithm for explaining the operation of a parameter server in a distributed learning system according to an embodiment.
6 is a diagram for explaining the configuration of a distributed learning system according to an embodiment.
7 and 8 are diagrams for explaining a learning method based on asymmetric communication in a distributed learning system according to an embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, an embodiment will be described in detail with reference to the accompanying drawings.

도 3은 일 실시예에 따른 분산 학습 시스템에서 비대칭 통신 기반의 학습 동작을 설명하기 위한 도면이다. 3 is a diagram for explaining a learning operation based on asymmetric communication in a distributed learning system according to an embodiment.

동기식 또는 비동기식이든 각 워커는 파라미터 서버에게 학습 결과를 전송하고, 파라미터 서버가 업데이트된 파라미터를 되돌려주기를 기다린다. 이때, 동기식 또는 비동기식에 관계없이 워커의 유휴 시간을 줄이는 것이 중요하다. 이에, 워커와 파라미터 간 비대칭 통신 기반의 학습 동작을 통해, 파라미터 서버로부터 업데이트된 파라미터의 수신을 기다리지 않음으로써 대기 없이 학습을 진행하는 비대칭 통신 기반의 학습 동작을 설명하기로 한다. Whether synchronously or asynchronously, each worker sends learning results to the parameter server and waits for the parameter server to return updated parameters. At this time, it is important to reduce the idle time of the worker regardless of whether it is synchronous or asynchronous. Accordingly, a learning operation based on asymmetric communication in which learning is performed without waiting by not waiting for reception of an updated parameter from a parameter server through a learning operation based on asymmetric communication between a worker and a parameter will be described.

비대칭 통신 기반의 학습이란 도 3과 같이, 워커들이 각 학습 과정에서 계산한 학습 결과를 파라미터 서버에게 전송하고, 파라미터 서버를 기다리지 않고 다음의 학습 과정을 진행하는 방식이다. 도 1(a)의 동기식 통신 기반의 학습, 도 1(b)의 비동기식 통신 기반의 학습과 비교하여, 도 3의 비대칭 통신 기반의 학습은 각 워커가 파라미터 서버의 병목 유무와 관계없이 대기하지 않고(wait-free) 학습을 진행할 수 있다.As shown in FIG. 3, asymmetric communication-based learning is a method in which workers transmit the learning results calculated in each learning process to the parameter server and proceed with the next learning process without waiting for the parameter server. Compared to the synchronous communication-based learning of FIG. 1(a) and the asynchronous communication-based learning of FIG. 1(b), the asymmetric communication-based learning of FIG. (wait-free) learning can proceed.

또한, 각 워커의 모델 파라미터와 파라미터 서버의 글로벌 파라미터의 차가 일정 이상 벌어지지 않도록 보장하기 위해, 주기적으로 파라미터 서버가 워커의 모델 파라미터를 업데이트해주는 파라미터 서버 주도적 업데이트 전략을 제안한다. 각 워커에 대한 파라미터 서버 주도적 업데이트를 정확하게 수행하기 위해, 분산 딥 러닝 연구의 근본적인 문제를 정의하고, 정의된 문제를 기반으로 워커의 모델 파라미터 업데이트 룰(rule)이 정의될 수 있다. In addition, to ensure that the difference between the model parameters of each worker and the global parameters of the parameter server does not widen beyond a certain level, we propose a parameter server-driven update strategy in which the parameter server periodically updates the model parameters of the worker. In order to accurately perform server-driven parameter update for each worker, a fundamental problem of distributed deep learning research can be defined, and a rule for updating model parameters of a worker can be defined based on the defined problem.

분산 학습 문제의 정의는 손실 함수(loss function)을 최소화하는 글로벌 파라미터

를 찾는 것으로, 수학식 1과 같이 정의될 수 있다. The definition of a distributed learning problem is a global parameter that minimizes the loss function.

By finding , it can be defined as in Equation 1.

수학식 1:Equation 1:

이때, n은 워커 수,

는 글로벌 파라미터,

는 워커 i의 로컬 파라미터,

는 워커 i가 가지고 있는 데이터,

는 상수이다. At this time, n is the number of workers,

is a global parameter,

is the local parameter of worker i,

is the data that worker i has,

is a constant.

수학식 1을 각 워커(예를 들면,

)의 입장에 대해 경사 하강법(gradient desent)을 적용하여 수학식 2를 획득할 수 있다. 수학식 2의 문제 정의를 바탕으로 워커에 대한 파라미터 업데이트 룰(rule)이 정의될 수 있다. Equation 1 is applied to each worker (for example,

Equation 2 can be obtained by applying gradient descent to the position of ). Based on the problem definition of Equation 2, a parameter update rule for workers may be defined.

수학식 2:Equation 2:

라고 하고,

라고 하면, 최종적으로 워커 i에 대한 업데이트 룰을 획득할 수 있다.

say,

, the update rule for worker i can finally be obtained.

수학식 3:Equation 3:

수학식 3은 워커가 로컬 파라미터와 글로벌 파라미터 사이의 가중 평균을 취함으로써 로컬 파라미터를 업데이트할 수 있음을 의미한다. 예를 들면, 수학식 3에서, 가중치

는 워커의 수가 증가할수록

이 감소하도록 설정될 수 있으며, k는 상수, n은 워커의 수이다. Equation 3 means that the worker can update the local parameter by taking a weighted average between the local parameter and the global parameter. For example, in Equation 3, the weight

as the number of workers increases

can be set to decrease, where k is a constant and n is the number of workers.

파라미터 서버는 워커의 로컬 파라미터를 업데이트하기 위해, 파라미터 서버가 모든 워커의 그레디언트를 수집한 후에만, 워커에게 글로벌 파라미터를 전송하는 주기적인 업데이트 룰(전략)을 정의할 수 있다. 예를 들면, 업데이트 주기가 큰 경우, 각 워커의 로컬 데이터를 이용하여 학습되는 반면, 업데이트 주기가 작을 경우, 글로벌 파라미터에 의해 로컬 파라미터가 더욱 자주 업데이트될 수 있다. The parameter server may define a periodic update rule (strategy) that transmits global parameters to workers only after the parameter server collects the gradients of all workers in order to update the local parameters of the workers. For example, if the update period is large, learning is performed using local data of each worker, whereas if the update period is small, local parameters may be updated more frequently by global parameters.

도 4는 일 실시예에 따른 분산 학습 시스템에서 워커의 동작을 설명하기 위한 알고리즘의 예이다.4 is an example of an algorithm for explaining an operation of a worker in a distributed learning system according to an embodiment.

워커는 그레디언트를 계산할 수 있다(2줄). 워커는 계산한 그레디언트로 워커의 로컬 파라미터를 업데이트할 수 있다(3줄). 워커는 계산한 그레디언트를 파라미터 서버에게 전송할 수 있다(4줄). 워커는 파라미터 서버로부터 업데이트된 글로벌 파라미터가 전송될 경우, 업데이트된 글로벌 파라미터와 워커의 로컬 파라미터를 가중 평균(weighted average)하여 워커의 로컬 파라미터를 업데이트할 수 있고, 파라미터 서버로부터 업데이트된 글로벌 파라미터가 전송되지 않을 경우, 다음의 학습을 진행할 수 있다(5-7줄). 다시 말해서, 워커는 파라미터 서버를 기다리지 않고, 다음 반복을 진행할 수 있다. 이와 같은 비대칭 학습을 통해 워커는 파라미터 서버가 병목현상이 있는지 여부와 상관없이 학습 모델을 학습시킬 수 있다. The worker can compute the gradient (line 2). The worker can update its local parameters with the computed gradient (line 3). The worker can send the computed gradient to the parameter server (line 4). When updated global parameters are transmitted from the parameter server, the worker may update the local parameters of the worker by performing a weighted average of the updated global parameters and the local parameters of the worker, and the updated global parameters are transmitted from the parameter server. If not, you can proceed with the following learning (lines 5-7). In other words, the worker can proceed to the next iteration without waiting for the parameter server. Through this asymmetric learning, the worker can train the learning model regardless of whether the parameter server is a bottleneck or not.

도 5는 일 실시예에 따른 분산 학습 시스템에서 파라미터 서버의 동작을 설명하기 위한 알고리즘의 예이다.5 is an example of an algorithm for explaining the operation of a parameter server in a distributed learning system according to an embodiment.

파라미터 서버는 복수의 워커에서 계산된 그레디언트를 수집할 수 있다(2 줄). 파라미터 서버는 수집된 그레디언트를 이용하여 글로벌 파라미터를 업데이트할 수 있다(3 줄). 파라미터 서버는 업데이트 주기

마다 복수의 워커에게 업데이트된 글로벌 파라미터를 전송할 수 있다(4-6 줄). The parameter server can collect gradients computed from multiple workers (line 2). The parameter server can update the global parameters using the collected gradients (line 3). Parameter server update cycle

Each worker can send updated global parameters to multiple workers (lines 4-6).

이와 같이, 워커와 파라미터 서버가 비대칭 통신을 수행하기 때문에 파라미터 서버는 워커에서 업데이트된 파라미터를 매번 보낼 필요가 없다. 파라미터 서버는 모든 워커의 그레디언트를 수집하여 글로벌 파라미터를 한번에 업데이트할 수 있다.In this way, since the worker and the parameter server perform asymmetric communication, the parameter server does not need to send updated parameters from the worker each time. The parameter server can collect the gradients of all workers and update the global parameters at once.

도 6은 일 실시예에 따른 분산 학습 시스템에서 분산 학습 시스템의 구성을 설명하기 위한 도면이다6 is a diagram for explaining the configuration of a distributed learning system in a distributed learning system according to an embodiment.

분산 학습 시스템(100)은 파라미터 서버(610) 및 복수의 워커(620)를 포함할 수 있다. 이러한 파라미터 서버(610)와 복수의 워커(이하 '워커'로 기재하기로 함.)(620)간의 비대칭 통신 기반의 분산 학습 방법에 대하여 도 7 및 도 8을 통해 설명하기로 한다. The distributed learning system 100 may include a parameter server 610 and a plurality of workers 620 . A distributed learning method based on asymmetric communication between the parameter server 610 and a plurality of workers (hereinafter referred to as 'workers') 620 will be described with reference to FIGS. 7 and 8 .

도 7은 워커가 파라미터 서버로부터 업데이트된 글로벌 파라미터를 수신하였을 경우를 설명하기 위한 것이고, 도 8은 워커가 파라미터 서버로부터 업데이트된 글로벌 파라미터를 수신하지 못하였을 경우를 설명하기 위한 것이다. 7 is for explaining the case where the worker receives updated global parameters from the parameter server, and FIG. 8 is for explaining the case where the worker does not receive the updated global parameters from the parameter server.

도 7 및 도 8을 참고하면, 워커(620)는 로컬 데이터를 이용하여 학습 모델을 학습할 수 있다(701). 워커(620)에서 로컬 데이터를 이용하여 학습 모델을 학습시킴에 따라 학습 결과가 계산될 수 있다. 워커(620)는 로컬 데이터를 이용하여 학습 모델을 학습시킴에 따라 그레디언트를 계산할 수 있다. 워커(620)는 학습을 통해 계산된 학습 결과를 획득할 수 있다(702). 워커는 계산된 학습 결과를 파라미터 서버(610)에게 전송할 수 있다(703). Referring to FIGS. 7 and 8 , a worker 620 may learn a learning model using local data (701). As the worker 620 trains the learning model using local data, a learning result may be calculated. The worker 620 may calculate a gradient as the learning model is trained using local data. The worker 620 may obtain a calculated learning result through learning (702). The worker may transmit the calculated learning result to the parameter server 610 (703).

파라미터 서버(610)는 계산된 학습 결과를 수신할 수 있다(704). 파라미터 서버(610)는 기 정의된 업데이트 룰에 기초하여 글로벌 파라미터를 업데이트할 수 있다(705). 이때, 파라미터 서버(610)에서 워커로 글로벌 파라미터 전송 여부에 따라 워커의 동작이 달라지게 된다. The parameter server 610 may receive the calculated learning result (704). The parameter server 610 may update the global parameters based on a predefined update rule (705). At this time, the operation of the worker varies depending on whether or not the parameter server 610 transmits the global parameter to the worker.

다시 말해서, 파라미터 서버(610)는 글로벌 파라미터를 기 정의된 룰에 기초하여 업데이트할 수 있다. 파라미터 서버(610)는 업데이트된 글로벌 파라미터를 업데이트 주기마다 글로벌 파라미터를 전송한다. 이때, 업데이트 주기에 도달하였을 경우, 도 7과 같이, 파라미터 서버(610)에서 워커(620)로 업데이트된 글로벌 파라미터가 전송될 수 있다(706). 워커(620)는 업데이트된 글로벌 파라미터를 수신할 수 있다(707). 워커(620)는 업데이트된 글로벌 파라미터와 로컬 파라미터를 가중 평균하여 로컬 파라미터를 업데이트할 수 있다(708). In other words, the parameter server 610 may update the global parameters based on predefined rules. The parameter server 610 transmits updated global parameters at every update period. At this time, when the update period is reached, the updated global parameters may be transmitted from the parameter server 610 to the worker 620 as shown in FIG. 7 (706). Worker 620 may receive updated global parameters (707). The worker 620 may update the local parameter by performing a weighted average of the updated global parameter and the local parameter (708).

업데이트 주기에 도달하지 않았을 경우, 도 8을 참고하면, 파라미터 서버(610)가 워커(620)로 업데이트된 글로벌 파라미터가 전송되지 않을 수 있다. 파라미터 서버(610)에서 워커(620)로 글로벌 파라미터가 전송되지 않았을 경우, 워커(620)는 다음의 학습을 진행할 수 있다(801). 예를 들면, 워커(620)가 제1 학습을 진행하고 있었을 경우, 제2 학습을 진행할 수 있다. If the update period has not reached, referring to FIG. 8 , the updated global parameters may not be transmitted from the parameter server 610 to the worker 620 . If the global parameter is not transmitted from the parameter server 610 to the worker 620, the worker 620 may proceed with the next learning (801). For example, when the worker 620 is in progress of first learning, it may proceed with second learning.

일 실시예에 따른 분산 시스템에서의 비대칭 분산 학습 동작의 효과를 검증하기 위하여 기존의 분산 딥 러닝 방법들을 모델의 정확도와 학습 시간 측면에서 비교될 수 있다. In order to verify the effect of an asymmetric distributed learning operation in a distributed system according to an embodiment, existing distributed deep learning methods can be compared in terms of model accuracy and learning time.

표 1: Table 1:

표 1은 정확도와 학습 시간의 비교 결과를 나타낸 것이다. 학습 모델(예를 들면, ResNet-50, VGG-16 두 모델)과 데이터 셋(예를 들면, CIFAR-10, ImageNet-1K) 모두에서 실시예에서 제안된 비대칭 분산 학습이 가장 좋은 결과를 보임을 확인할 수 있다. Table 1 shows the comparison results of accuracy and learning time. The asymmetric distributed learning proposed in the example shows the best results in both the learning model (eg, ResNet-50, VGG-16 two models) and the data set (eg, CIFAR-10, ImageNet-1K) You can check.

특히, 실시예에서 제안된 비대칭 통신 기반의 분산 학습은 기존 파라미터 서버 기반 분산 딥 러닝 방법의 학습 성능 저하 문제를 해결할 수 있고, 기존의 방법들과 비교하였을 때, 높은 정확도를 가장 빠른 시간 안에 달성하는 결과를 보임을 확인할 수 있다. 다시 말해서, 실시예에서 제안된 비대칭 통신 기반의 분산 학습이 종래 기술들에 비해 우수함을 보임을 확인할 수 있다. In particular, the asymmetric communication-based distributed learning proposed in the embodiment can solve the learning performance degradation problem of the existing parameter server-based distributed deep learning method, and achieves high accuracy in the fastest time compared to existing methods. You can check that the results are visible. In other words, it can be confirmed that the distributed learning based on asymmetric communication proposed in the embodiment is superior to conventional techniques.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

In a distributed learning system based on asymmetric communication,
a plurality of workers that update local parameters by performing a weighted average of updated global parameters and local parameters of the workers based on a learning result calculated by learning a learning model using local data; and
A parameter server that transmits updated global parameters based on the learning results calculated in the plurality of workers to a plurality of workers according to a predefined update rule through asymmetric communication
including,
The parameter server,
Collect gradients calculated from the plurality of workers, update global parameters using the collected gradients, and search for global parameters that minimize a loss function in distributed learning using the gradients collected for the plurality of workers , Transmitting the searched global parameter to the plurality of workers according to an update cycle,
The plurality of workers,
Depending on whether or not the updated global parameters are transmitted from the parameter server, the worker's local parameters are updated or the next learning process for learning is performed.
Distributed learning system.

According to claim 1,
The plurality of workers,
Calculating a gradient as the learning model is trained using the local data, and updating local parameters for each of the plurality of workers with the calculated gradient
Distributed learning system, characterized in that.

According to claim 1,
The plurality of workers,
When the updated global parameter is transmitted from the parameter server, updating the local parameter of the worker by performing a weighted average of the updated global parameter and the local parameter of the worker
Distributed learning system, characterized in that.

According to claim 1,
The plurality of workers,
If the updated global parameter is not transmitted from the parameter server, proceeding with the next learning process for the learning
Distributed learning system, characterized in that.

delete

In the distributed learning method based on asymmetric communication performed in a worker,
calculating a learning result by learning a learning model using local data;
transmitting the calculated learning result to a parameter server; and
Receiving updated global parameters from the parameter server based on a predefined parameter update rule through asymmetric communication
including,
In the receiving step, the updated global parameter and the local parameter of the worker are weighted average according to whether the updated global parameter is received from the parameter server, so that the local parameter of the worker is updated or the next learning for learning including the process
In the parameter server, gradients calculated from a plurality of workers are collected, global parameters are updated using the collected gradients, and a global loss function is minimized in distributed learning using the gradients collected for the plurality of workers. Parameters are searched, and the searched global parameters are transmitted to the plurality of workers according to an update cycle.
Distributed learning method.

According to claim 7,
The calculation step is
Calculating a gradient as the plurality of workers learn a learning model using local data in a first learning process, and updating a local parameter for each of the plurality of workers with the calculated gradient.
Distributed learning method comprising a.

According to claim 7,
In the receiving step,
When receiving the transmission of the updated global parameter from the parameter server, the updated global parameter and the local parameter of the worker are weighted average to update the local parameter of the worker, and the global parameter updated from the parameter server If not transmitted, proceeding with the second learning process
Distributed learning method comprising a.

In the asymmetric distributed learning method performed by the parameter server,
Collecting learning results calculated by learning a learning model using local data in a plurality of workers;
updating global parameters using the collected learning results; and
Transmitting the updated global parameters to a plurality of workers through asymmetric communication based on a preset update rule.
including,
The updating step is
Collecting gradients calculated by the plurality of workers and using the collected gradients to search for global parameters that minimize a loss function in distributed learning.
including,
The sending step is
Transmitting the searched global parameters to the plurality of workers according to an update cycle in order to minimize a loss function in distributed learning using the collected gradients.
including,
In the plurality of workers, the updated global parameters transmitted to the plurality of workers and the local parameters of the plurality of workers are weighted average according to whether or not the updated global parameters are transmitted from the parameter server, and the plurality of workers When the local parameters of are updated or the next training pass for learning is in progress.
Distributed learning method.

delete