KR102376527B1

KR102376527B1 - Method and computer program of processing program for single accelerator using dnn framework on plural accelerators

Info

Publication number: KR102376527B1
Application number: KR1020200029251A
Authority: KR
Inventors: 이재진; 박정호; 김형모
Original assignee: 서울대학교산학협력단
Priority date: 2019-03-11
Filing date: 2020-03-09
Publication date: 2022-03-18
Also published as: KR20200108789A

Abstract

본 개시는 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법에 관한 것이다. 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법은, 딥러닝 연산 함수에 대한 호출을 수신하는 단계, 단일의 가속기에서 딥러닝 연산 함수를 실행하기 위한 가속기 라이브러리 함수에 대한 호출을 수신하는 단계, 가속기 라이브러리 함수의 호출에 응답하여, 접근 가능한 복수의 가속기의 각각에 가속기 라이브러리 함수를 할당하는 단계, 복수의 가속기의 각각으로부터 가속기 라이브러리 함수를 처리한 중간 결과 데이터를 수신하는 단계 및 수신된 중간 결과 데이터를 기초로, 호출된 가속기 라이브러리 함수에 대한 결과 데이터를 생성하는 단계를 포함한다.The present disclosure relates to a method of processing a program for a single accelerator using a DNN framework in a plurality of accelerators. A method for processing a program for a single accelerator in a plurality of accelerators includes receiving a call to a deep learning computation function, receiving a call to an accelerator library function for executing a deep learning computation function in a single accelerator, the accelerator responsive to the call of the library function, assigning an accelerator library function to each of a plurality of accessible accelerators, receiving intermediate result data of processing an accelerator library function from each of the plurality of accelerators, and receiving the intermediate result data based on it, generating result data for the called accelerator library function.

Description

Method and computer program for processing a program for a single accelerator using the DNN framework in multiple accelerators

본 개시는 DNN(Deep Neural Network) 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법 및 컴퓨터 프로그램에 관한 것으로, 구체적으로, 단일의 가속기에서 딥러닝 연산 함수를 실행하기 위한 가속기 라이브러리 함수에 대한 호출을 수신하면, 접근 가능한 복수의 가속기의 각각에 가속기 라이브러리 함수 및 이러한 함수와 연관된 데이터를 할당하는 방법 및 컴퓨터 프로그램에 관한 것이다.The present disclosure relates to a method and a computer program for processing a program for a single accelerator using a deep neural network (DNN) framework in a plurality of accelerators, and specifically, an accelerator library function for executing a deep learning operation function in a single accelerator A method and computer program for allocating accelerator library functions and data associated with these functions to each of a plurality of accessible accelerators upon receiving a call to.

최근에 널리 사용되는 DNN 프레임워크인 Tensorflow, Pytorch 등은 단일 가속기에서 실행되는 고성능 가속기 연산 라이브러리인 cuDNN, cuBLAS 등을 바탕으로 하는 연산을, 함수의 형태로 사용자에게 제공할 수 있다. 이러한 종래의 DNN 프레임워크 하에서, 복수의 가속기를 활용하기 위하여, 사용자는 하나의 컴퓨터에 장치된 가속기 디바이스를 각각 명시하거나, 네트워크로 연결된 클러스터에서 각각의 컴퓨터를 명시하여 각 컴퓨터에서의 작업을 별도로 명시해 주어야 한다. 이러한 요구 사항은 DNN 프레임워크가 기반을 두고 있는 라이브러리가 하나의 가속기 대상이므로, DNN 프레임워크 수준 또는 사용자의 응용 프로그램 수준에서 작업 분배를 반드시 수행해 주어야 하기 때문에 발생할 수 있다. 또한, 클러스터 시스템의 구성에 따라, 사용자가 프로그램을 다시 작성해야 하고, 실행 환경에 따라 코드 수정이 요구되거나 성능이 저하되거나, 오류가 발생할 수도 있었다.Tensorflow and Pytorch, which are recently widely used DNN frameworks, can provide users with operations in the form of functions based on cuDNN and cuBLAS, which are high-performance accelerator operation libraries that run on a single accelerator. Under such a conventional DNN framework, in order to utilize a plurality of accelerators, a user specifies an accelerator device installed in one computer, or specifies each computer in a network-connected cluster to separately specify an operation in each computer should do This requirement can occur because the library on which the DNN framework is based is an accelerator target, so task distribution must be performed at the DNN framework level or the user's application level. In addition, depending on the configuration of the cluster system, the user has to rewrite the program, and depending on the execution environment, code modification is required, performance is deteriorated, or an error may occur.

일반적으로 가속기 간 및 서버 간 메모리 통신은 DNN 프레임워크에서 내부적으로 처리하므로 사용자가 신경 쓸 필요가 없다. 하지만 복수의 가속기 또는 각각의 컴퓨터간 딥 뉴럴 네트워크의 공유 여부, 가속기 별 작업 지정, 클러스터 구성 설정은 사용자(프로그래머)가 직접 응용 프로그램 수준에서 처리할 수 있도록 그와 관련된 함수를 제공하는데 그친다. 즉, DNN 프레임워크 내부적으로 단일 컴퓨터 혹은 복수의 컴퓨터로 구성된 클러스터에 장착된 복수의 가속기를 사용하기 위해 메모리 통신 등 기본적인 기능은 제공하나, 그 기능을 활용하여 복수의 가속기로 작업을 분배하는 등의 추가적인 일은 모두 사용자가 수행해야 한다.In general, memory communication between accelerators and servers is handled internally by the DNN framework, so users do not need to worry about it. However, whether a deep neural network is shared between multiple accelerators or each computer, task assignments for each accelerator, and cluster configuration settings are only provided so that users (programmers) can directly process at the application level. In other words, basic functions such as memory communication are provided to use multiple accelerators mounted on a single computer or a cluster composed of multiple computers internally within the DNN framework. Any additional work must be done by the user.

딥 러닝 분야가 빠르게 발전하고 변화함에 따라 구현의 대상이 되는 네트워크 모델(network model)이나 DNN 프레임워크가 빠르게 변경되고, 기술의 발전에 따라 딥 러닝을 수행하는 클러스터의 구성이나 규모 등 실행 환경이 바뀌기 때문에 코드 수정 및 추가가 불가피할 수 있다. 이에 따라, 사용자 입장에서는 실행 환경에 따라 코드를 직접 수정하거나, 옵션에 따라 상이하게 동작하도록 코드를 작성해 두어야 한다. 이 과정에 굉장히 많은 시간이 소요되며, 코드의 구조가 굉장히 복잡해질 수 있고, 이에 따라 유지 보수 및 변경이 어려울 수 있다.As the field of deep learning rapidly develops and changes, the network model or DNN framework that is the target of implementation changes rapidly, and the execution environment such as the configuration or size of the cluster performing deep learning changes according to the development of technology Therefore, code modifications and additions may be unavoidable. Accordingly, the user must either directly modify the code according to the execution environment or write the code to operate differently depending on the options. This process takes a lot of time, and the structure of the code can become very complex, which can make maintenance and change difficult.

본 개시는 상기와 같은 문제점을 해결하기 위한 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법, 기록매체에 저장된 컴퓨터 프로그램 및 시스템을 제공한다.The present disclosure provides a method for processing a single accelerator program using a DNN framework for solving the above problems in a plurality of accelerators, a computer program stored in a recording medium, and a system.

본 발명의 목적은 프로그래머가 DNN 프레임워크를 이용하여, 하나의 가속기 대상의 딥러닝 학습/추론 프로그램을 작성하면, 이를 별도의 소스 코드(예를 들어, 사용자 프로그램 및 DNN 프레임워크 모두)의 수정 없이, 복수의 가속기가 설치된 컴퓨터 또는 클러스터 시스템에서 프로그램이 동작하도록 하는 방법 및 컴퓨터 프로그램이 제공된다.An object of the present invention is that when a programmer writes a deep learning learning/inference program for one accelerator by using the DNN framework, it is possible without modification of a separate source code (for example, both the user program and the DNN framework). , a method and computer program for allowing the program to operate in a computer or cluster system in which a plurality of accelerators are installed are provided.

본 발명은 소스 코드가 공개되지 않은 고성능 가속기 연산 라이브러리(예를 들어, cuDNN, cuBLAS 등)가 단일 가속기 대신에 하나의 컴퓨터 또는 클러스터 시스템에 포함된 다수의 가속기를 이용해 동작하도록 하는 방법 및 컴퓨터 프로그램이 제공된다.The present invention provides a method and computer program for enabling a high-performance accelerator operation library (eg, cuDNN, cuBLAS, etc.) whose source code is not disclosed to operate using a plurality of accelerators included in one computer or cluster system instead of a single accelerator. provided

본 개시는 방법, 장치, 시스템, 컴퓨터 프로그램 또는 명령어들을 저장하는 컴퓨터 판독가능 저장 매체를 포함한 다양한 방식으로 구현될 수 있다.The present disclosure may be embodied in a variety of ways, including a method, apparatus, system, computer program, or computer-readable storage medium storing instructions.

본 개시의 일 실시예에 따른 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법은, 딥러닝 연산 함수에 대한 호출을 수신하는 단계, 단일의 가속기에서 딥러닝 연산 함수를 실행하기 위한 가속기 라이브러리 함수에 대한 호출을 수신하는 단계, 가속기 라이브러리 함수의 호출에 응답하여, 접근 가능한 복수의 가속기의 각각에 가속기 라이브러리 함수를 할당하는 단계, 복수의 가속기의 각각으로부터 가속기 라이브러리 함수를 처리한 중간 결과 데이터를 수신하는 단계 및 수신된 중간 결과 데이터를 기초로, 호출된 가속기 라이브러리 함수에 대한 결과 데이터를 생성하는 단계를 포함한다.A method of processing a program for a single accelerator using a DNN framework according to an embodiment of the present disclosure in a plurality of accelerators, receiving a call to a deep learning operation function, executing a deep learning operation function in a single accelerator receiving a call to an accelerator library function for; in response to the call of the accelerator library function, assigning an accelerator library function to each of a plurality of accessible accelerators; intermediate processing of an accelerator library function from each of the plurality of accelerators; receiving result data and generating, based on the received intermediate result data, result data for the called accelerator library function.

일 실시예에 따르면, 접근 가능한 복수의 가속기의 각각에 가속기 라이브러리 함수를 할당하는 단계는, 가속기 라이브러리 함수의 입력 데이터를 복수의 부분 입력 데이터 세트로 분할하는 단계 및 가속기 라이브러리 함수 및 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기의 각각에 할당하는 단계를 포함하고, 복수의 가속기의 각각으로부터 가속기 라이브러리 함수를 처리한 중간 결과 데이터를 수신하는 단계는, 복수의 가속기의 각각에서 가속기 라이브러리 함수를 이용하여 복수의 부분 입력 데이터 세트의 각각을 처리한 중간 결과 데이터를 수신하는 단계를 포함한다.According to one embodiment, assigning an accelerator library function to each of a plurality of accessible accelerators comprises partitioning input data of the accelerator library function into a plurality of partial input data sets and the accelerator library function and the partitioned plurality of portions. and assigning each of the input data sets to each of a plurality of accelerators, wherein receiving intermediate result data of processing an accelerator library function from each of the plurality of accelerators comprises: an accelerator library function in each of the plurality of accelerators; and receiving intermediate result data obtained by processing each of the plurality of partial input data sets using the method.

일 실시예에 따르면, 가속기 라이브러리 함수 및 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기의 각각에 할당하는 단계는, 가속기 라이브러리 함수의 메모리 접근 패턴을 분석하는 단계 및 분석된 접근 패턴에 기초하여 가속기 라이브러리 함수의 실행 전에 가속기 라이브러리 함수의 입력 데이터를 접근 가능한 복수의 가속기의 각각에 할당하는 단계를 포함한다.According to one embodiment, the assigning each of the accelerator library function and the partitioned plurality of partial input data sets to each of the plurality of accelerators comprises: analyzing a memory access pattern of the accelerator library function and based on the analyzed access pattern and allocating input data of the accelerator library function to each of a plurality of accessible accelerators prior to execution of the accelerator library function.

일 실시예에 따르면, 가속기 라이브러리 함수 및 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기의 각각에 할당하는 단계는, 복수의 가속기의 각각에 할당되는 부분 입력 데이터 세트가 n개의 부분 입력 데이터 세트를 포함하는 경우(여기서, n은 2이상의 자연수임), n개의 부분 입력 데이터 세트 중에서, m번째 부분 입력 데이터 세트를 복수의 가속기의 각각에서 처리하는 동시에 m+1번째 부분 입력 데이터 세트가 복수의 가속기의 각각에 할당하는 단계(여기서, m은 n보다 작은 자연수임)를 포함한다.According to an embodiment, the step of allocating each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators includes: The partial input data set allocated to each of the plurality of accelerators includes n partial input data In the case of including a set (where n is a natural number greater than or equal to 2), among the n partial input data sets, the mth partial input data set is processed by each of the plurality of accelerators, and the m+1th partial input data set is plural. assigning to each of the accelerators of , where m is a natural number less than n.

일 실시예에 따르면, 접근 가능한 복수의 가속기의 각각에 가속기 라이브러리 함수를 할당하는 단계는, 가속기 라이브러리 함수의 파라미터 데이터를 복수의 부분 파라미터 데이터 세트로 분할하는 단계 및 가속기 라이브러리 함수 및 분할된 복수의 부분 파라미터 데이터 세트의 각각을 복수의 가속기의 각각에 할당하는 단계를 포함한다.According to one embodiment, assigning an accelerator library function to each of a plurality of accessible accelerators comprises: partitioning parameter data of the accelerator library function into a plurality of partial parameter data sets; and the accelerator library function and the partitioned plurality of parts. assigning each of the parameter data sets to each of the plurality of accelerators.

일 실시예에 따르면, 복수의 가속기의 각각으로부터 가속기 라이브러리 함수를 처리한 중간 결과 데이터를 수신하는 단계는 복수의 가속기의 각각으로부터 가속기 라이브러리 함수의 파라미터 데이터를 처리한 중간 결과 데이터를 수신하는 단계를 포함하고, 호출된 가속기 라이브러리 함수에 대한 결과 데이터를 생성하는 단계는, 가속기 라이브러리 함수의 파라미터 데이터를 처리한 중간 결과 데이터에 대한 결과 데이터를 생성하는 단계를 포함한다.According to an embodiment, receiving intermediate result data of processing an accelerator library function from each of the plurality of accelerators comprises receiving intermediate result data of processing parameter data of the accelerator library function from each of the plurality of accelerators and generating result data for the called accelerator library function includes generating result data for intermediate result data obtained by processing parameter data of the accelerator library function.

일 실시예에 따르면, 수신된 중간 결과 데이터를 기초로, 호출된 가속기 라이브러리 함수에 대한 결과 데이터를 생성하는 단계는 수신된 중간 결과 데이터를 연결하여(concatenate) 결과 데이터를 생성하는 단계를 포함한다.According to one embodiment, based on the received intermediate result data, generating the result data for the called accelerator library function comprises concatenating the received intermediate result data to generate the result data.

일 실시예에 따르면, 수신된 중간 결과 데이터를 기초로, 호출된 가속기 라이브러리 함수에 대한 결과 데이터를 생성하는 단계는 수신된 중간 결과 데이터를 연산하여 결과 데이터를 생성하는 단계를 포함한다.According to one embodiment, based on the received intermediate result data, generating result data for the called accelerator library function includes calculating the received intermediate result data to generate result data.

일 실시예에 따르면, 복수의 가속기는 하나의 컴퓨팅 장치에 포함된다.According to an embodiment, a plurality of accelerators are included in one computing device.

일 실시예에 따르면, 복수의 가속기는 복수의 컴퓨팅 장치를 포함하는 클러스터 시스템에 포함된다.According to an embodiment, the plurality of accelerators are included in a cluster system including a plurality of computing devices.

본 개시의 일 실시예에 따른 상술한 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법을 컴퓨터에서 실행하기 위해 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램이 제공된다.A computer program stored in a computer-readable recording medium is provided in order to execute a method of processing the above-described single accelerator program in a plurality of accelerators in a computer according to an embodiment of the present disclosure.

본 개시의 일부 실시예에 따르면, 주로 사용되는 DNN 프레임워크가 공통적으로 이용되고, 소스 코드가 공개되어 있지 않은 고성능 가속기 라이브러리는 하나의 가속기 대신에 하나의 컴퓨터 또는 클러스터에 포함된 복수의 가속기에서 동작될 수 있다.According to some embodiments of the present disclosure, a mainly used DNN framework is commonly used, and a high-performance accelerator library whose source code is not disclosed operates on a plurality of accelerators included in one computer or cluster instead of one accelerator. can be

본 개시의 일부 실시예에 따르면, 라이브러리 수준에서 자동으로 복수의 가속기를 활용하도록 하므로, DNN 프레임워크를 이용하는 사용자 응용 프로그램 수준에서는 단일 가속기를 대상으로 작성한 하나의 프로그램이, 실행 환경에서 적절하게 분산 처리될 수 있다. 또한, 복수의 가속기가 이용되어 연산이 수행되므로 실행 시간이 단축될 수 있다. 그리고, 사용자 응용 프로그램의 소스 코드가 변경될 필요가 없으므로, 이를 통해 프로그램 개발 과정이 크게 단축되어 프로그램의 생산성이 높아지고 유지 보수가 더욱 간편해질 수 있다.According to some embodiments of the present disclosure, since a plurality of accelerators are automatically utilized at the library level, at the user application level using the DNN framework, one program written for a single accelerator is appropriately distributed processing in the execution environment can be In addition, since the calculation is performed using a plurality of accelerators, the execution time may be shortened. In addition, since the source code of the user application does not need to be changed, the program development process can be greatly shortened through this, thereby increasing the productivity of the program and making maintenance easier.

본 개시의 일부 실시예에 따르면, DNN 프레임워크가 공통적으로 이용하는 라이브러리의 동작이 변경되므로 DNN 프레임워크의 소스 코드는 수정할 필요가 없다. 또한, DNN 프레임워크의 컴파일 과정에서 기존에 링크(link)하던 라이브러리를 바꿔치기하거나 가로채기하는 방식이 적용될 수 있다. 이러한 방식은 여러 종류의 DNN 프레임워크에 쉽게 적용될 수 있다. 새로운 DNN 프레임워크를 개발하는 경우에도, 하나의 가속기와 복수의 가속기의 처리가 일원화될 수 있다.According to some embodiments of the present disclosure, since the operation of the library commonly used by the DNN framework is changed, the source code of the DNN framework does not need to be modified. In addition, a method of replacing or intercepting a library previously linked in the process of compiling the DNN framework may be applied. This method can be easily applied to various kinds of DNN frameworks. Even in the case of developing a new DNN framework, processing of one accelerator and a plurality of accelerators may be unified.

본 개시의 실시예들은, 이하 설명하는 첨부 도면들을 참조하여 설명될 것이며, 여기서 유사한 참조 번호는 유사한 요소들을 나타내지만, 이에 한정되지는 않는다.
도 1은 본 개시의 일 실시예에 따른 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법을 제공하기 위하여, 호스트 컴퓨팅 장치가 복수의 컴퓨팅 장치와 통신 가능하도록 연결된 구성을 나타내는 개요도이다.
도 2는 본 개시의 일 실시예에 따른 컴퓨팅 장치의 내부 구성을 나타내는 블록도이다.
도 3은 본 개시의 일 실시예에 따른 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법을 제공하기 위한, 프로세서의 내부 구성을 나타내는 블록도이다.
도 4는 본 개시의 일 실시예에 따른 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법을 나타내는 흐름도이다.
도 5는 본 개시의 일 실시예에 따른 입력 데이터를 복수의 부분 입력 데이터 세트로 분할하고, 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기에 할당하는 예시를 나타내는 도면이다.
도 6은 본 개시의 일 실시예에 따른 복수의 가속기에 입력 데이터를 복수의 부분 입력 데이터 세트로 분할하지 못할 경우, 데이터를 할당하는 예시를 나타내는 도면이다.
도 7은 본 개시의 일 실시예에 따른 복수의 가속기로부터 수신한 중간 결과 데이터를 연결하여, 결과 데이터를 생성하는 예시를 나타내는 도면이다.
도 8은 본 개시의 일 실시예에 따른 복수의 가속기로부터 수신한 중간 결과 데이터를 연산하여, 결과 데이터를 생성하는 예시를 나타내는 도면이다.
도 9는 본 개시의 일 실시예에 따른 복수의 가속기에서 연산을 수행하는 동시에 복수의 부분 입력 데이터 세트의 복사를 수행하는 예시를 나타내는 도면이다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, in which like reference numerals denote like elements, but are not limited thereto.
1 is a schematic diagram illustrating a configuration in which a host computing device is connected to communicate with a plurality of computing devices in order to provide a method of processing a program for a single accelerator in a plurality of accelerators according to an embodiment of the present disclosure.
2 is a block diagram illustrating an internal configuration of a computing device according to an embodiment of the present disclosure.
3 is a block diagram illustrating an internal configuration of a processor for providing a method of processing a program for a single accelerator using a DNN framework in a plurality of accelerators according to an embodiment of the present disclosure.
4 is a flowchart illustrating a method of processing a program for a single accelerator using a DNN framework in a plurality of accelerators according to an embodiment of the present disclosure.
5 is a diagram illustrating an example of dividing input data into a plurality of partial input data sets and allocating each of the divided plurality of partial input data sets to a plurality of accelerators according to an embodiment of the present disclosure;
6 is a diagram illustrating an example of allocating data to a plurality of accelerators when it is not possible to divide input data into a plurality of partial input data sets according to an embodiment of the present disclosure;
7 is a diagram illustrating an example of generating result data by connecting intermediate result data received from a plurality of accelerators according to an embodiment of the present disclosure.
8 is a diagram illustrating an example of generating result data by calculating intermediate result data received from a plurality of accelerators according to an embodiment of the present disclosure.
9 is a diagram illustrating an example of performing an operation in a plurality of accelerators and copying a plurality of partial input data sets at the same time according to an embodiment of the present disclosure;

이하, 본 개시의 실시를 위한 구체적인 내용을 첨부된 도면을 참조하여 상세히 설명한다. 다만, 이하의 설명에서는 본 개시의 요지를 불필요하게 흐릴 우려가 있는 경우, 널리 알려진 기능이나 구성에 관한 구체적 설명은 생략하기로 한다.Hereinafter, specific contents for carrying out the present disclosure will be described in detail with reference to the accompanying drawings. However, in the following description, if there is a risk of unnecessarily obscuring the gist of the present disclosure, detailed descriptions of well-known functions or configurations will be omitted.

첨부된 도면에서, 동일하거나 대응하는 구성요소에는 동일한 참조부호가 부여되어 있다. 또한, 이하의 실시예들의 설명에 있어서, 동일하거나 대응되는 구성요소를 중복하여 기술하는 것이 생략될 수 있다. 그러나 구성요소에 관한 기술이 생략되어도, 그러한 구성요소가 어떤 실시예에 포함되지 않는 것으로 의도되지는 않는다.In the accompanying drawings, the same or corresponding components are assigned the same reference numerals. In addition, in the description of the embodiments below, overlapping description of the same or corresponding components may be omitted. However, even if descriptions regarding components are omitted, it is not intended that such components are not included in any embodiment.

개시된 실시예의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 개시는 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 개시가 완전하도록 하고, 본 개시가 통상의 기술자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것일 뿐이다.Advantages and features of the disclosed embodiments, and methods of achieving them, will become apparent with reference to the embodiments described below in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below and may be implemented in various different forms, and only the present embodiments allow the present disclosure to be complete, and the present disclosure will provide those of ordinary skill in the art to fully understand the scope of the invention. It is only provided to inform you.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 개시된 실시예에 대해 구체적으로 설명하기로 한다. 본 명세서에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 관련 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.Terms used in this specification will be briefly described, and the disclosed embodiments will be described in detail. Terms used in this specification have been selected as currently widely used general terms as possible while considering the functions in the present disclosure, but these may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.

본 명세서에서의 단수의 표현은 문맥상 명백하게 단수인 것으로 특정하지 않는 한, 복수의 표현을 포함한다. 또한, 복수의 표현은 문맥상 명백하게 복수인 것으로 특정하지 않는 한, 단수의 표현을 포함한다. 명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다.References in the singular herein include plural expressions unless the context clearly dictates the singular. Also, the plural expression includes the singular expression unless the context clearly dictates the plural. In the entire specification, when a part "includes" a certain element, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated.

본 개시에서, '클러스터 시스템은' 네트워크를 통해 연결된 복수의 컴퓨터를 포함할 수 있다. 이러한 클라이언트 시스템 하에서, 한 클라이언트 장치는 클러스터 시스템을 하나의 컴퓨터처럼 이용할 수 있다. 이러한 클러스터 시스템은 연결된 사용자에게 복수의 컴퓨터를 하나의 컴퓨터처럼 이용할 수 있도록 제공하기 때문에, 하나의 컴퓨터에서의 처리 속도보다 훨씬 향상된 처리 속도가 구현될 수 있다.In the present disclosure, a 'cluster system' may include a plurality of computers connected through a network. Under such a client system, one client device can use the cluster system as one computer. Since such a cluster system provides connected users to use a plurality of computers as if they were one computer, much improved processing speed than that of one computer can be implemented.

본 개시에서, '배치(Batch)는' 딥러닝에서 모델의 네트워크 파라미터를 업데이트 하기 전 샘플의 개수를 의미한다. 딥러닝 모델의 학습 데이터 세트는 하나 이상의 배치로 나눠질 수 있다. 예를 들어, 이미지를 처리하는 딥러닝 모델에서 각각의 배치 크기는 한 번에 처리하는 이미지의 수를 의미한다. 배치 크기가 64이면, 이미지 파일 64개를 연속된 메모리 공간에 차례로 둠을 의미할 수 있다. 이에 따라, 배치 단위로 분할되는 데이터는 구조적으로 동일하며, 배치 단위로 분할되는 데이터는 복수의 가속기에 별다른 처리 없이 분배할 수 있다.In the present disclosure, 'batch' means the number of samples before updating the network parameters of the model in deep learning. The training data set of a deep learning model can be divided into one or more batches. For example, in a deep learning model that processes images, each batch size means the number of images processed at one time. If the batch size is 64, it may mean that 64 image files are sequentially placed in a contiguous memory space. Accordingly, data divided in batch units is structurally the same, and data divided in batch units can be distributed to a plurality of accelerators without special processing.

본 개시에서, '라이브러리 래퍼(library wrapper)'는 소프트웨어를 개발하는데 사용되는 서브루틴 또는 클래스들의 모음으로, 라이브러리 래퍼는 라이브러리의 현재 인터페이스를 호환되는 인터페이스로 변환하는 코드의 얇은 층으로 구성될 수 있다. 예를 들어, 라이브러리 래퍼는 DNN 프레임워크가 공통적으로 이용하는 라이브러리의 동작을 변경하도록 구성된 함수를 포함할 수 있으므로, DNN 프레임워크의 소스 코드 자체를 수정할 필요가 없다. 이에 따라, 본 개시의 라이브러리 래퍼는 다양한 DNN 프레임워크에 유사한 방식으로 적용시킬 수 있음은 통상의 기술자에게 당연하다.In this disclosure, a 'library wrapper' is a collection of subroutines or classes used to develop software, and the library wrapper can be composed of a thin layer of code that converts the current interface of a library into a compatible interface. . For example, the library wrapper may contain functions configured to change the behavior of libraries commonly used by the DNN framework, so there is no need to modify the source code of the DNN framework itself. Accordingly, it is natural for those skilled in the art that the library wrapper of the present disclosure can be applied to various DNN frameworks in a similar manner.

도 1은 본 개시의 일 실시예에 따른 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법을 제공하기 위하여, 호스트 컴퓨팅 장치(120)가 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)와 통신 가능하도록 연결된 구성을 나타내는 개요도이다. 여기서, 복수의 가속기는 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)를 포함하는 클러스터 시스템(100)에 포함될 수 있다. 예를 들어, 복수의 가속기는 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)의 적어도 일부 컴퓨팅 장치에 포함될 수 있다. 또한, 클러스터 시스템(100)은 네트워크(110), 호스트 컴퓨팅 장치(120) 및 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)를 포함할 수 있다.1 is a diagram illustrating a host computing device 120 in a plurality of computing devices 130_1, 130_2, ..., 130_n in order to provide a method for processing a program for a single accelerator in a plurality of accelerators according to an embodiment of the present disclosure. It is a schematic diagram showing the configuration connected so as to be able to communicate with the Here, the plurality of accelerators may be included in the cluster system 100 including the plurality of computing devices 130_1 , 130_2 , ..., 130_n. For example, the plurality of accelerators may be included in at least some of the plurality of computing devices 130_1 , 130_2 , ..., 130_n. Also, the cluster system 100 may include a network 110 , a host computing device 120 , and a plurality of computing devices 130_1 , 130_2 , ..., 130_n .

네트워크(110)는, 클러스터 시스템(100)에 포함됨 호스트 컴퓨팅 장치(120) 및 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n) 사이의 통신이 가능하도록 구성될 수 있다. 네트워크(110)는 설치 환경에 따라, 예를 들어, 이더넷(Ethernet), 유선 홈 네트워크(Power Line Communication), 전화선 통신 장치 및 RS-serial 통신 등의 유선 네트워크, 이동통신망, WLAN(Wireless LAN), Wi-Fi, Bluetooth 및 ZigBee 등과 같은 무선 네트워크 또는 그 조합으로 구성될 수 있다. 다시 말해, 통신 방식은 제한되지 않으며, 네트워크(110)가 포함할 수 있는 통신망(일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망, 위성망 등)을 활용하는 통신 방식뿐만 아니라 호스트 컴퓨팅 장치(120) 및 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n) 사이의 근거리 무선 통신 역시 포함될 수 있다. 예를 들어, 네트워크(110)는 PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 네트워크(110)는 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.The network 110 may be configured to enable communication between the host computing device 120 included in the cluster system 100 and the plurality of computing devices 130_1 , 130_2 , ..., 130_n. Network 110, depending on the installation environment, for example, Ethernet (Ethernet), wired home network (Power Line Communication), telephone line communication device and wired networks such as RS-serial communication, mobile communication network, WLAN (Wireless LAN), It may consist of a wireless network such as Wi-Fi, Bluetooth and ZigBee, or a combination thereof. In other words, the communication method is not limited, and the host computing device 120 as well as a communication method using a communication network (eg, mobile communication network, wired Internet, wireless Internet, broadcasting network, satellite network, etc.) that the network 110 may include. ) and short-range wireless communication between the plurality of computing devices 130_1, 130_2, ..., 130_n may also be included. For example, the network 110 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), a broadband network (BBN), may include any one or more of networks such as the Internet. In addition, the network 110 may include any one or more of a network topology including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, etc. not limited

호스트 컴퓨팅 장치(120)는 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 실행시키도록 구성될 수 있다. 또한, 호스트 컴퓨팅 장치(120)는 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)과 통신 가능하도록 구성되고, 복수의 컴퓨팅 장치(130_1, 130_2), ..., 130_n)에 포함된 복수의 가속기의 동작을 제어하도록 구성될 수 있다. 예를 들어, 호스트 컴퓨팅 장치(120)는 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)에게 가속기 라이브러리 함수, 입력 데이터를 할당하고, 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n) 각각으로부터 가속기 라이브러리 함수를 처리한 중간 결과 데이터를 수신하고, 수신된 중간 결과 데이터를 기초로, 호출된 가속기 라이브러리 함수에 대한 결과 데이터를 생성할 수 있다.The host computing device 120 may be configured to execute a program for a single accelerator using the DNN framework. In addition, the host computing device 120 is configured to be able to communicate with the plurality of computing devices 130_1, 130_2, ..., 130_n, and included in the plurality of computing devices 130_1, 130_2, ..., 130_n). It may be configured to control operation of a plurality of accelerators. For example, the host computing device 120 allocates an accelerator library function and input data to the plurality of computing devices 130_1, 130_2, ..., 130_n, and the plurality of computing devices 130_1, 130_2, ..., 130_n) may receive intermediate result data of processing the accelerator library function from each, and generate result data for the called accelerator library function based on the received intermediate result data.

복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)는 클러스터 시스템(100) 상에서 정보처리 및 통신을 수행하는 컴퓨팅 장치이다. 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)는 컴퓨터 또는 원격처리 장치와 같은 단말의 형태로 구성될 수 있다. 또한, 각 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)는 독립적으로 정보처리 등을 수행할 수 있으나, 병렬 프로그래밍을 통하여 다른 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)들과 협력하면서 정보처리 등을 수행할 수도 있다. 각 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)는 네트워크(110)를 통해 딥러닝 어플리케이션의 동작을 위한 통신을 실행할 수 있다. 이러한 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)는 데이터의 송신원, 수신처 또는 중계점 중 어느 하나에 해당할 수 있다.The plurality of computing devices 130_1 , 130_2 , ..., 130_n are computing devices that perform information processing and communication on the cluster system 100 . The plurality of computing devices 130_1, 130_2, ..., 130_n may be configured in the form of a terminal such as a computer or a remote processing device. In addition, each of the plurality of computing devices 130_1, 130_2, ..., 130_n may independently perform information processing, etc., but through parallel programming, a plurality of other computing devices 130_1, 130_2, ..., 130_n) Information processing, etc. can also be carried out in cooperation with others. Each of the plurality of computing devices 130_1 , 130_2 , ..., 130_n may execute communication for the operation of the deep learning application through the network 110 . The plurality of computing devices 130_1 , 130_2 , ..., 130_n may correspond to any one of a data transmission source, a destination, or a relay point.

도 2는 본 개시의 일 실시예에 따른 컴퓨팅 장치(200)의 내부 구성을 나타내는 블록도이다. 복수의 가속기(230_1, 230_2, ..., 230_n)는 컴퓨팅 장치(200)에 포함될 수 있다. 일 실시예에 따르면, 컴퓨팅 장치(200)는 호스트 컴퓨팅 장치(120)를 지칭할 수 있다. 다른 실시예에서, 컴퓨팅 장치(200)는 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n) 각각을 지칭할 수 있다. 도 2에 도시된 바와 같이, 하나의 컴퓨팅 장치(200)는 프로세서(210), 메인 메모리(220) 및 복수의 가속기(230_1, 230_2, ..., 230_n)를 포함할 수 있다.2 is a block diagram illustrating an internal configuration of a computing device 200 according to an embodiment of the present disclosure. The plurality of accelerators 230_1 , 230_2 , ..., 230_n may be included in the computing device 200 . According to an embodiment, the computing device 200 may refer to the host computing device 120 . In another embodiment, the computing device 200 may refer to each of the plurality of computing devices 130_1 , 130_2 , ..., 130_n. As shown in FIG. 2 , one computing device 200 may include a processor 210 , a main memory 220 , and a plurality of accelerators 230_1 , 230_2 , ..., 230_n .

프로세서(210)는, 예를 들어, CPU(Central Processing Unit, 중앙 처리 장치)와 같은 연산 처리를 위한 범용 프로세서로 구성될 수 있으며, 프로세서(210)는 복수의 가속기(230_1, 230_2, ..., 230_n)와 연결되어 복수의 가속기(230_1, 230_2, ..., 230_n)의 동작을 제어할 수 있다. 또한, 프로세서(210)는 메인 메모리(220)가 연결될 수 있다. 예를 들어, 프로세서(210)는 PCI-E(Peripheral component interconnect-Express) 버스를 통해 복수의 가속기(230_1, 230_2, ..., 230_n) 및/또는 메인 메모리(220)와 서로 연결될 수 있으며 복수의 가속기(230_1, 230_2, ..., 230_n) 및/또는 메인 메모리(220)의 제어를 위한 데이터를 송수신할 수 있다.The processor 210 may be configured as a general-purpose processor for arithmetic processing such as, for example, a central processing unit (CPU), and the processor 210 includes a plurality of accelerators 230_1, 230_2, ... , 230_n) to control the operation of the plurality of accelerators 230_1, 230_2, ..., 230_n. In addition, the processor 210 may be connected to the main memory 220 . For example, the processor 210 may be connected to a plurality of accelerators 230_1 , 230_2 , ..., 230_n and/or the main memory 220 through a Peripheral component interconnect-Express (PCI-E) bus, and may be connected to a plurality of Data for controlling the accelerators 230_1 , 230_2 , ..., 230_n and/or the main memory 220 may be transmitted/received.

일 실시예에 따르면, 메인 메모리(220)는 전자 정보를 저장 가능한 임의의 전자 컴포넌트를 포함하도록 넓게 해석되어야 한다. 예를 들어, 메인 메모리(220)는 임의 액세스 메모리(RAM), 판독-전용 메모리(ROM), 비-휘발성 임의 액세스 메모리(NVRAM), 프로그램가능 판독-전용 메모리(PROM), 소거-프로그램가능 판독 전용 메모리(EPROM), 전기적으로 소거가능 PROM(EEPROM), 플래쉬 메모리, 자기 또는 광학 데이터 저장장치, 레지스터들 등과 같은 프로세서-판독가능 매체의 다양한 유형들을 지칭할 수도 있다. 프로세서(210) 및/또는 복수의 가속기(230_1, 230_2, ..., 230_n)로부터 정보를 판독하거나 메모리에 정보를 기록할 수 있다면 메인 메모리(220)는 프로세서(210) 및/또는 복수의 가속기(230_1, 230_2, ..., 230_n)와 전자 통신 상태에 있다고 불린다. 본 개시에서, 메인 메모리(220)는 프로세서(210) 및/또는 복수의 가속기(230_1, 230_2, ..., 230_n)에 의해 실행되는 프로그램(예를 들어, 단일 가속기용 프로그램, DNN 프레임워크, 가속기 라이브러리 래퍼 및 가속기 라이브러리 함수 등)과 연관된 임의의 데이터 및/또는 정보(예: 프로그램 실행 데이터, 입력 데이터, 파라미터 데이터, 결과 데이터 등)를 저장할 수 있다.According to one embodiment, main memory 220 should be broadly interpreted to include any electronic component capable of storing electronic information. For example, main memory 220 may include random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erase-programmable read It may refer to various types of processor-readable media, such as dedicated memory (EPROM), electrically erasable PROM (EEPROM), flash memory, magnetic or optical data storage, registers, and the like. If information can be read from or written to the processor 210 and/or the plurality of accelerators 230_1, 230_2, ..., 230_n, the main memory 220 is the processor 210 and/or the plurality of accelerators. It is said to be in electronic communication with (230_1, 230_2, ..., 230_n). In the present disclosure, the main memory 220 includes a program (eg, a program for a single accelerator, a DNN framework, It may store any data and/or information (eg, program execution data, input data, parameter data, result data, etc.) associated with accelerator library wrappers and accelerator library functions, etc.).

복수의 가속기(230_1, 230_2, ..., 230_n)는 범용의 CPU와는 달리 특정 패턴의 연산에 특화된 프로세서로 구성될 수 있다. 예를 들어, 복수의 가속기((230_1, 230_2, ..., 230_n)의 각각은 GPU, FPGA, DSP, Intel Xeon Phi, TPU, NPU, 멀티코어 CPU 등을 포함할 수 있다. 또한, 복수의 가속기(230_1, 230_2, ..., 230_n) 각각에는 메인 메모리와는 별도로 가속기 메모리(미도시)가 연결되거나 포함될 수 있다.The plurality of accelerators 230_1 , 230_2 , ..., 230_n may be configured as a processor specialized for a specific pattern operation, unlike a general-purpose CPU. For example, each of the plurality of accelerators (230_1, 230_2, ..., 230_n) may include a GPU, FPGA, DSP, Intel Xeon Phi, TPU, NPU, multi-core CPU, etc. In addition, a plurality of Each of the accelerators 230_1 , 230_2 , ..., 230_n may be connected to or included in an accelerator memory (not shown) separately from the main memory.

도 3은 본 개시의 일 실시예에 따른 DNN 프레임워크(320)를 이용하는 단일 가속기용 프로그램(310)을 복수의 가속기에서 처리하는 방법을 제공하기 위한, 프로세서(300)의 내부 구성을 나타내는 블록도이다. 일 실시예에 따르면, 프로세서(300)는 도 2의 프로세서(210)를 지칭할 수 있다. 도시된 바와 같이, 단일 가속기용 프로그램(310), DNN 프레임워크(320), 가속기 라이브러리 래퍼(330) 및 가속기 라이브러리(340)는 프로세서(300)에 의해 또는 프로세서(300) 상에서 동작되거나 처리될 수 있다.3 is a block diagram showing the internal configuration of the processor 300 to provide a method for processing a program 310 for a single accelerator using the DNN framework 320 according to an embodiment of the present disclosure in a plurality of accelerators am. According to an embodiment, the processor 300 may refer to the processor 210 of FIG. 2 . As shown, the program 310 for a single accelerator, the DNN framework 320 , the accelerator library wrapper 330 and the accelerator library 340 may be operated or processed by or on the processor 300 . there is.

단일 가속기용 프로그램(310)은, 가속기의 연산 자원을 활용하기 위해 다양한 플랫폼의 가속기(230)를 위한 병렬 프로그래밍 모델을 의미할 수 있다. 일 실시예에 따르면, 단일 가속기용 프로그램(310)은 단일 GPU 대상 사용자 응용 프로그램을 지칭할 수 있다. 예를 들어, 단일 가속기용 프로그램(310)은 임의의 가속기에 대해 딥러닝 연산 함수를 지원하는 임의의 프로그램을 포함할 수 있으며, 예를 들어, OpenCL(Open Computing Language) 및 CUDA(Compute Unified Device Architecture)와 같은 병렬 프로그래밍 모델 또는 프로그램을 포함할 수 있으나, 이에 제한되지 않는다. 또한, 딥러닝 연산 함수는 단일 가속기용 프로그램(310)을 통해 호출될 수 있다.The program 310 for a single accelerator may refer to a parallel programming model for the accelerator 230 of various platforms in order to utilize the computation resources of the accelerator. According to an embodiment, the single accelerator program 310 may refer to a single GPU target user application program. For example, the program 310 for a single accelerator may include any program supporting a deep learning computation function for any accelerator, for example, Open Computing Language (OpenCL) and Compute Unified Device Architecture (CUDA). ) may include a parallel programming model or program, such as, but not limited to. In addition, the deep learning operation function may be called through the program 310 for a single accelerator.

DNN 프레임워크(320)는, 딥러닝 애플리케이션 작성 및 실행을 용이하게 하는 만들어진 소프트웨어 집합체를 포함할 수 있다. 이러한 DNN 프레임워크(320)는 개발자로 하여금 높은 숙련도가 요구되는 병렬 프로그래밍 모델 또는 프로그램을 보다 손쉽게 이용할 수 있도록, 딥러닝 처리 또는 딥러닝 연산 함수를 가속기에 적용하여 학습(training) 과정과 추론(inference) 과정을 가속할 수 있다. 예를 들어, DNN 프레임워크(320)는 최근에 널리 사용되고 있는 Caffe, Tensorflow, Pytorch, CNTK 및 Theano 등의 DNN 프레임워크를 포함할 수 있으나, 이에 제한되지 않는다.The DNN framework 320 may include a collection of built-in software that facilitates the creation and execution of deep learning applications. The DNN framework 320 applies a deep learning processing or deep learning operation function to an accelerator so that a developer can more easily use a parallel programming model or program that requires high proficiency in a training process and inference. ) can accelerate the process. For example, the DNN framework 320 may include, but is not limited to, DNN frameworks such as Caffe, Tensorflow, Pytorch, CNTK, and Theano that are widely used recently.

가속기 라이브러리(340)는, DNN 프레임워크(320)를 통해 호출될 수 있는 고성능 라이브러리(library)를 지칭할 수 있다. 가속기 라이브러리(340)는 사용자 또는 개발자에게 복수의 가속기 라이브러리 함수를 제공할 수 있다. 이러한 가속기 라이브러리 함수는 단일 가속기에서 실행되도록 구성될 수 있다.The accelerator library 340 may refer to a high-performance library that may be called through the DNN framework 320 . The accelerator library 340 may provide a plurality of accelerator library functions to a user or a developer. These accelerator library functions can be configured to run on a single accelerator.

일 실시예에 따르면, 가속기 라이브러리(340)는 cuDNN 및 cuBLAS 등을 지칭할 수 있으나, 이에 한정되지 않는다. 일 실시예에 따르면, cuDNN은 convolutional layer, pooling layer 등의 forward propagation과 backward propagation 등을 API 형태로 제공할 수 있다. 예를 들어, cuDNN의 cudnnConvolutionForward() 함수는 네트워크의 파라미터 중 해당 레이어에 대응하는 가중치(weight), 그리고 이전 레이어의 피처 맵(feature map)(또는 이미지 데이터)을 배치(batch) 단위로 입력 받아 다음 레이어로의 출력 피처 맵(output feature map)을 배치 단위로 출력할 수 있다.According to an embodiment, the accelerator library 340 may refer to cuDNN and cuBLAS, but is not limited thereto. According to an embodiment, cuDNN may provide forward propagation and backward propagation of a convolutional layer and a pooling layer in the form of an API. For example, the cudnnConvolutionForward() function of cuDNN receives the weight corresponding to the corresponding layer among the parameters of the network and the feature map (or image data) of the previous layer in batch units and receives the next An output feature map to a layer can be output in batch units.

가속기 라이브러리 래퍼(330)는, 호출된 가속기 라이브러리 함수를 가로채고, 가속기 라이브러리 함수를 래핑(wrapping)하여 복수의 가속기(230_1, 230_2, ..., 230_n)에 의해 병렬 처리 가능하도록 구성될 수 있다. 이러한 구조 하에서, 단일 가속기용 프로그램(310) 및 DNN 프레임워크(320) 수준에서 단일 가속기(230)를 대상으로 작성된 코드가 라이브러리 수준에서는 복수의 가속기(230_1, 230_2, ..., 230_n)에서 실행될 수 있다. 예를 들어, 가속기 라이브러리 래퍼(330)는 GPU 라이브러리 래퍼를 지칭할 수 있다. 일 실시예에서, 가속기 라이브러리 래퍼(330)를 사용하여 단일 가속기용 프로그램(310)이 가속기 라이브러리 래퍼(330)를 이용하여 처리될 때, DNN 프레임워크(320)의 빌드 과정의 수정을 통해 가속기 라이브러리 함수 및 소스 코드(예를 들어, 사용자 프로그램 및 DNN 프레임워크(320))의 수정 없이, 단일 가속기용 프로그램(310)을 복수의 가속기(230_1, 230_2, ..., 230_n)에서 병렬 처리할 수 있다. 예를 들어, 프로세서(210)는 복수의 가속기(230_1, 230_2, ..., 230_n) 각각에 가속기 라이브러리 함수를 할당할 수 있다.The accelerator library wrapper 330 may be configured to intercept the called accelerator library function and wrap the accelerator library function to enable parallel processing by a plurality of accelerators 230_1, 230_2, ..., 230_n. . Under this structure, the code written for the single accelerator 230 at the level of the program 310 for a single accelerator and the DNN framework 320 is executed in the plurality of accelerators 230_1, 230_2, ..., 230_n at the library level. can For example, the accelerator library wrapper 330 may refer to a GPU library wrapper. In one embodiment, when a program 310 for a single accelerator using the accelerator library wrapper 330 is processed using the accelerator library wrapper 330 , the accelerator library through modification of the build process of the DNN framework 320 . Without modification of functions and source code (eg, user program and DNN framework 320), the program 310 for a single accelerator can be processed in parallel on a plurality of accelerators 230_1, 230_2, ..., 230_n. there is. For example, the processor 210 may allocate an accelerator library function to each of the plurality of accelerators 230_1 , 230_2 , ..., 230_n.

도 4는 본 개시의 일 실시예에 따른 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법(400)을 나타내는 흐름도이다. 이러한 방법(400)은 프로세서에 의해 수행될 수 있다. 도시된 바와 같이, 방법(400)은 딥러닝 연산 함수에 대한 호출을 수신하는 단계(S410)로 개시될 수 있다. 예를 들어, 프로세서 상에서 동작되는 DNN 프레임워크는 단일 가속기용 프로그램으로부터 딥러닝 연산 함수에 대한 호출을 수신할 수 있다.4 is a flowchart illustrating a method 400 of processing a program for a single accelerator using a DNN framework in a plurality of accelerators according to an embodiment of the present disclosure. This method 400 may be performed by a processor. As shown, the method 400 may begin with receiving a call to a deep learning computation function ( S410 ). For example, a DNN framework running on a processor may receive a call to a deep learning computation function from a program for a single accelerator.

그리고 나서, 단계(S420)에서, 단일의 가속기에서 딥러닝 연산 함수를 실행하기 위한 가속기 라이브러리 함수에 대한 호출을 수신할 수 있다. 예를 들어, 프로세서 상에서 동작하는 가속기 라이브러리 래퍼는 DNN 프레임워크로부터 단일의 가속기에서 딥러닝 연산 함수를 실행하기 위한 가속기 라이브러리 함수에 대한 호출을 수신할 수 있다.Then, in step S420 , a call to an accelerator library function for executing a deep learning operation function in a single accelerator may be received. For example, an accelerator library wrapper running on a processor may receive calls from the DNN framework to accelerator library functions to execute deep learning computation functions on a single accelerator.

다음으로, 단계(S430)에서, 가속기 라이브러리 함수의 호출에 응답하여, 접근 가능한 복수의 가속기의 각각에 가속기 라이브러리 함수를 할당할 수 있다. 일 실시예에서, 프로세서 상에서 가속기 라이브러리 래퍼는, DNN 프레임워크로부터 단일 가속기에서 딥러닝 연산 함수를 실행하기 위한 가속기 라이브러리 함수에 대한 호출을 수신하여, 복수의 가속기의 각각에 가속기 라이브러리 함수를 할당할 수 있다. 이 때, 가속기 라이브러리 래퍼는, 가속기 라이브러리 함수의 입력 데이터를 복수의 부분 입력 데이터 세트로 분할하여, 복수의 가속기의 각각에 분할된 복수의 부분 입력 데이터 세트의 각각을 할당할 수 있다. 예를 들어, 가속기 라이브러리 함수의 입력 데이터를 배치(batch) 단위로 입력 받을 경우, 가속기 라이브러리 래퍼는 복수의 가속기의 각각에 입력 데이터를 배치 단위로 분배하여, 하나 이상의 데이터 세트로 분할해 할당할 수 있다.Next, in step S430 , in response to the call of the accelerator library function, an accelerator library function may be assigned to each of a plurality of accessible accelerators. In one embodiment, the accelerator library wrapper on the processor may receive a call from the DNN framework to an accelerator library function for executing a deep learning computation function on a single accelerator, and assign the accelerator library function to each of the plurality of accelerators. there is. In this case, the accelerator library wrapper may divide the input data of the accelerator library function into a plurality of partial input data sets, and assign each of the divided partial input data sets to each of the plurality of accelerators. For example, when input data of an accelerator library function is received in batch units, the accelerator library wrapper distributes the input data to each of a plurality of accelerators in batch units and divides them into one or more data sets. there is.

일 실시예에 따르면, 가속기 라이브러리 래퍼는, 가속기 라이브러리 함수 및 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기의 각각에 할당할 수 있다. 여기서, 가속기 라이브러리 함수 및 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기의 각각에 할당함에 있어서, 가속기 라이브러리 함수의 메모리 접근 패턴을 분석하고 분석된 접근 패턴에 기초하여 가속기 라이브러리 함수의 실행 전에 가속기 라이브러리 함수의 입력 데이터가 접근 가능한 복수의 가속기의 각각에 할당될 수 있다. 예를 들어, DNN 관련 가속기 라이브러리 함수의 입력과 출력의 메모리로의 접근 패턴은 컴파일러 분석 기법 및/또는 프로파일링 기법(예를 들어, 미리 함수를 실행하여 메모리의 입출력 패턴을 파악하는 기법)을 사용하여 분석될 수 있다. 또 다른 예로서, 가속기 라이브러리 함수의 실행 전에, 컴파일러 분석 및/또는 프로파일링 기법에 의해 분석된 접근 패턴을 이용하여 복수의 가속기에 데이터가 미리 분배될 수 있다. DNN 프레임워크의 동작 특성상 동일한 패턴의 작업이 전체 학습이나 추론 과정에서 여러 iteration 동안 반복될 수 있다. 이러한 특성을 이용하여 처음 혹은 몇 번(예를 들어, 3회)의 iteration에 대한 메모리 접근 패턴이 파악되면, 전체 학습이나 추론 과정에서 각 가속기 라이브러리 함수의 메모리 접근 패턴이 분석될 수 있다.According to an embodiment, the accelerator library wrapper may assign each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators. Here, in assigning each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators, the memory access pattern of the accelerator library function is analyzed and based on the analyzed access pattern, before execution of the accelerator library function Input data of an accelerator library function may be assigned to each of a plurality of accessible accelerators. For example, the access patterns of inputs and outputs of DNN-related accelerator library functions to memory use compiler analysis techniques and/or profiling techniques (for example, techniques to identify input/output patterns in memory by executing functions in advance). can be analyzed. As another example, before the accelerator library function is executed, data may be pre-distributed to a plurality of accelerators using an access pattern analyzed by a compiler analysis and/or profiling technique. Due to the nature of the operation of the DNN framework, the same pattern of tasks may be repeated during multiple iterations during the entire learning or inference process. If the memory access pattern for the first or several iterations (for example, 3 times) is identified using these characteristics, the memory access pattern of each accelerator library function can be analyzed during the entire learning or inference process.

다음으로, 단계(S440)에서, 복수의 가속기의 각각으로부터 가속기 라이브러리 함수를 처리한 중간 결과 데이터를 수신할 수 있다. 일 실시예에 따르면, 가속기 라이브러리 래퍼는, 가속기 라이브러리 함수를 처리한 복수의 가속기로부터 중간 결과 데이터를 수신할 수 있다. 예를 들어, 가속기 라이브러리 래퍼는, 복수의 가속기의 각각에서 가속기 라이브러리 함수를 이용하여 복수의 부분 입력 데이터 세트의 각각을 처리한 중간 결과 데이터를 수신할 수 있다. 다른 예로서, 가속기 라이브러리 함수의 출력 데이터가 배치 단위로 출력될 경우, 가속기 라이브러리 래퍼는 복수의 가속기로부터 입력 데이터 세트의 각각을 처리한 중간 결과 데이터를 배치 단위로 수신할 수 있다.Next, in step S440 , intermediate result data of processing the accelerator library function may be received from each of the plurality of accelerators. According to an embodiment, the accelerator library wrapper may receive intermediate result data from a plurality of accelerators that have processed the accelerator library function. For example, the accelerator library wrapper may receive intermediate result data of processing each of a plurality of partial input data sets using an accelerator library function in each of the plurality of accelerators. As another example, when the output data of the accelerator library function is output in batches, the accelerator library wrapper may receive intermediate result data obtained by processing each of the input data sets from a plurality of accelerators in batches.

마지막으로, 단계(S450)에서, 수신된 중간 결과 데이터를 기초로, 호출된 가속기 라이브러리 함수에 대한 결과 데이터가 생성될 수 있다. 예를 들어, 가속기 라이브러리 래퍼는, 복수의 가속기로부터 수신된 중간 결과 데이터를 기초로 결과 데이터를 생성할 수 있으며, 도 7 및 도 8을 통해 보다 상세히 후술한다.Finally, in step S450 , result data for the called accelerator library function may be generated based on the received intermediate result data. For example, the accelerator library wrapper may generate result data based on intermediate result data received from a plurality of accelerators, which will be described in more detail later with reference to FIGS. 7 and 8 .

후술하는 도 5 내지 도 8에 도시된 복수의 가속기(530_1, 530_2, ..., 530_n; 여기서, n은 2이상의 자연수임)는 하나의 컴퓨팅 장치에 포함되거나, 복수의 컴퓨팅 장치를 포함하는 클러스터 시스템에 포함될 수 있다. 여기서, 복수의 가속기(530_1, 530_2, ..., 530_n)는 DNN 프레임워크가 동작되는 컴퓨팅 장치(노드)에 의해 접근 가능한 가속기를 포함할 수 있다.A plurality of accelerators 530_1, 530_2, ..., 530_n (where n is a natural number greater than or equal to 2) illustrated in FIGS. 5 to 8, which will be described later, are included in one computing device or a cluster including a plurality of computing devices. can be included in the system. Here, the plurality of accelerators 530_1, 530_2, ..., 530_n may include accelerators accessible by a computing device (node) in which the DNN framework operates.

도 5는 본 개시의 일 실시예에 따른 입력 데이터(510)를 복수의 부분 입력 데이터 세트(520_1, 520_2, ..., 520_n)로 분할하고, 분할된 복수의 부분 입력 데이터 세트(520_1, 520_2, ..., 520_n)의 각각을 복수의 가속기(530_1, 530_2, ..., 530_n)에 할당하는 예시를 나타내는 도면이다.5 is a diagram illustrating input data 510 divided into a plurality of partial input data sets 520_1, 520_2, ..., 520_n, and divided into a plurality of partial input data sets 520_1 and 520_2 according to an embodiment of the present disclosure. , ..., 520_n) is a diagram illustrating an example of allocating each of the accelerators 530_1, 530_2, ..., 530_n.

일 실시예에 따르면, 도시된 바와 같이, 가속기 라이브러리 함수의 입력 데이터(510)는 복수의 부분 입력 데이터 세트(520_1, 520_2, ..., 520_n)로 분할될 수 있다. 분할된 복수의 부분 입력 데이터 세트(520_1, 520_2, ..., 520_n)의 각각 및 해당 가속기 라이브러리 함수는 복수의 가속기(530_1, 530_2, ..., 530_n)의 각각에 할당될 수 있다.According to an embodiment, as shown, the input data 510 of the accelerator library function may be divided into a plurality of partial input data sets 520_1, 520_2, ..., 520_n. Each of the divided plurality of partial input data sets 520_1, 520_2, ..., 520_n and a corresponding accelerator library function may be assigned to each of the plurality of accelerators 530_1, 530_2, ..., 530_n.

일 실시예에 따르면, 가속기 라이브러리 래퍼는, 입력 데이터(510)를 배치 단위로 입력 받은 경우, 복수의 가속기(530_1, 530_2, ..., 530_n)에 균등하게 분배할 수 있다. 예를 들어, Tensorflow나 Pytorch에서 뉴런에 해당하는 데이터의 메모리 레이아웃은 일반적으로 "NCHW" 또는 "NHWC"인데, 여기서 "N"이 가장 앞에 존재하므로, 가장 높은 차원이 배치인 레이아웃에 해당할 수 있다. 이에 따라, 클러스터 시스템에 포함된 가속기가 두 개라고 가정할 경우, 전체 데이터를 연속된 두 부분으로 나누어 두 개의 가속기에 분배할 수 있다.According to an embodiment, when the input data 510 is received in batch units, the accelerator library wrapper may equally distribute the input data 510 to the plurality of accelerators 530_1 , 530_2 , ..., 530_n. For example, in Tensorflow or Pytorch, the memory layout of data corresponding to neurons is usually "NCHW" or "NHWC", where "N" is first, so it may correspond to a layout where the highest dimension is a layout. . Accordingly, if it is assumed that there are two accelerators included in the cluster system, the entire data may be divided into two consecutive parts and distributed to the two accelerators.

다른 실시예에서, 분배되는 부분 입력 데이터의 처리 속도가 다르거나, 클러스터 시스템의 네트워크 구성 등으로 처리 성능의 차이가 있을 경우, 성능 차이를 고려하여 복수의 부분 입력 데이터 세트(520_1, 520_2, ..., 520_n)가 복수의 가속기에 차등하여 분배될 수 있다.In another embodiment, if the processing speed of the distributed partial input data is different or there is a difference in processing performance due to the network configuration of the cluster system, etc., the plurality of partial input data sets 520_1, 520_2, .. ., 520_n) may be differentially distributed to a plurality of accelerators.

도 6은 본 개시의 일 실시예에 따른 복수의 가속기(530_1, 530_2, ..., 530_n)에 입력 데이터(610)를 복수의 부분 입력 데이터 세트로 분할하지 못할 경우, 데이터를 할당하는 예시를 나타내는 도면이다.6 is an example of allocating data when the input data 610 cannot be divided into a plurality of partial input data sets to the plurality of accelerators 530_1, 530_2, ..., 530_n according to an embodiment of the present disclosure; It is a drawing showing

일 실시예에서, 가속기 라이브러리 래퍼는, 입력 데이터(610)를 배치 단위로 나눌 수 없는 경우, 복수의 가속기(530_1, 530_2, ..., 530_n)에게 동일한 입력 데이터(610)를 할당할 수 있다. 여기서, 입력 데이터(610)가 배치 단위로 나눌 수 없는 경우는 복수의 가속기(530_1, 530_2, ..., 530_n)에서 해당 입력 데이터(610)가 필요한 경우일 수 있으며, DNN에서 주로 네트워크 파라미터가 이에 해당할 수 있다. 예를 들어, Tensorflow나 Pytorch에서 파라미터 데이터의 메모리 레이아웃은 일반적으로 "CKHW" 또는 "HWKC"인데, 이는 배치와는 무관하여, 배치 단위로 나눌 수 없다. 따라서, 복수의 가속기(530_1, 530_2, ..., 530_n)가 동일한 데이터를 사용하도록 각각의 복수의 가속기(530_1, 530_2, ..., 530_n)에 입력 데이터(610)의 복사본을 할당할 수 있다.In an embodiment, the accelerator library wrapper may allocate the same input data 610 to the plurality of accelerators 530_1 , 530_2 , ..., 530_n when the input data 610 cannot be divided into batch units. . Here, the case in which the input data 610 cannot be divided into batch units may be a case in which the corresponding input data 610 is required in a plurality of accelerators 530_1, 530_2, ..., 530_n, and in DNN, mainly network parameters are This may apply. For example, in Tensorflow or Pytorch, the memory layout of parameter data is usually "CKHW" or "HWKC", which is batch-independent and cannot be broken down into batches. Accordingly, a copy of the input data 610 may be allocated to each of the plurality of accelerators 530_1, 530_2, ..., 530_n so that the plurality of accelerators 530_1, 530_2, ..., 530_n use the same data. there is.

다른 실시예에 따르면, 가속기 라이브러리 래퍼는 가속기 라이브러리 함수의 파라미터 데이터(즉, 가중치; 미도시)를 복수의 부분 파라미터 데이터 세트로 분할하고, 분할된 부분 파라미터 데이터 세트의 각각을 복수의 가속기(530_1, 530_2, ..., 530_n)에 제공할 수 있다. 이 때, 가속기 라이브러리 래퍼는 분할된 파라미터 데이터 세트와 연관된 가속기 라이브러리 함수를 함께 복수의 가속기(530_1, 530_2, ..., 530_n)에 제공할 수 있다. 또한, 가속기 라이브러리 래퍼는 분할된 파라미터 데이터 세트와 연관된 가속기 라이브러리 함수의 입력 데이터(610)를 복수의 가속기(530_1, 530_2, ..., 530_n)에 분할된 파라미터 데이터 세트 및 가속기 라이브러리 함수와 함께 제공할 수 있다. 여기서, 도시된 바와 같이, 입력 데이터(610)는 분할되지 않을 수 있다. 이와 달리, 가속기 라이브러리 래퍼는 분할된 파라미터 데이터 세트와 연관된 가속기 라이브러리 함수의 입력 데이터(610)를 복수의 부분 입력 데이터 세트로 분할하고, 분할된 복수의 부분 입력 데이터 세트를, 분할된 파라미터 데이터 세트 및 가속기 라이브러리 함수와 함께 복수의 가속기(530_1, 530_2, ..., 530_n)에 제공할 수 있다.According to another embodiment, the accelerator library wrapper divides parameter data (ie, weight; not shown) of the accelerator library function into a plurality of partial parameter data sets, and divides each of the divided partial parameter data sets into a plurality of accelerators 530_1, 530_2, ..., 530_n) can be provided. In this case, the accelerator library wrapper may provide the accelerator library function associated with the divided parameter data set to the plurality of accelerators 530_1, 530_2, ..., 530_n. In addition, the accelerator library wrapper provides input data 610 of an accelerator library function associated with a partitioned parameter data set to a plurality of accelerators 530_1, 530_2, ..., 530_n with a partitioned parameter data set and an accelerator library function. can do. Here, as shown, the input data 610 may not be divided. In contrast, the accelerator library wrapper divides the input data 610 of the accelerator library function associated with the partitioned parameter data set into a plurality of partial input data sets, and divides the partitioned plurality of partial input data sets into the partitioned parameter data set and It can be provided to the plurality of accelerators 530_1, 530_2, ..., 530_n together with the accelerator library function.

도 7은 본 개시의 일 실시예에 따른 복수의 가속기(530_1, 530_2, ..., 530_n)로부터 수신한 중간 결과 데이터(710_1, 710_2, ..., 710_n)를 연결하여, 결과 데이터(720)를 생성하는 예시를 나타내는 도면이다. 일 실시예에 따르면, 가속기 라이브러리 래퍼는 복수의 가속기(530_1, 530_2, ..., 530_n)로부터 수신된 중간 결과 데이터(710_1, 710_2, ..., 710_n)를 수신하고, 수신된 중간 결과 데이터(710_1, 710_2, ..., 710_n)를 연결하여(concatenate), 결과 데이터(720)를 생성할 수 있다. 예를 들어, CuDNN에서 제공하는 가속기 라이브러리 함수 중 convolution 함수의 경우, 복수의 가속기(530_1, 530_2, ..., 530_n)로부터 수신한 중간 결과 데이터(710_1, 710_2, ..., 710_n)를 연결함으로써, 결과 데이터(720)가 생성될 수 있다.7 shows intermediate result data 710_1, 710_2, ..., 710_n received from a plurality of accelerators 530_1, 530_2, ..., 530_n according to an embodiment of the present disclosure by connecting result data 720 ) is a diagram showing an example of generating According to an embodiment, the accelerator library wrapper receives intermediate result data 710_1, 710_2, ..., 710_n received from a plurality of accelerators 530_1, 530_2, ..., 530_n, and receives the received intermediate result data. By concatenating (710_1, 710_2, ..., 710_n), result data 720 may be generated. For example, in the case of the convolution function among the accelerator library functions provided by CuDNN, intermediate result data (710_1, 710_2, ..., 710_n) received from a plurality of accelerators (530_1, 530_2, ..., 530_n) is connected. By doing so, result data 720 may be generated.

일 실시예에서, 가속기 라이브러리 래퍼(330)는, 중간 결과 데이터(710_1, 710_2, ..., 710_n)가 배치 단위로 병렬화되는 경우, 중간 결과 데이터(710_1, 710_2, ..., 710_n)를 연결하여(concatenate) 결과 데이터(720)를 생성할 수 있다. 예를 들어, 중간 결과 데이터(710_1, 710_2, ..., 710_n)가 배치 단위로 병렬화되는 경우, 중간 결과 데이터(710_1, 710_2, ..., 710_n)는 단순히 연결되어 결과 데이터(710)로서 저장할 수 있다.In one embodiment, the accelerator library wrapper 330 is configured to convert the intermediate result data 710_1, 710_2, ..., 710_n when the intermediate result data 710_1, 710_2, ..., 710_n are parallelized in batches. Concatenate to produce result data 720 . For example, when the intermediate result data 710_1, 710_2, ..., 710_n are parallelized in batches, the intermediate result data 710_1, 710_2, ..., 710_n are simply connected as the result data 710. can be saved

도 8은 본 개시의 일 실시예에 따른 복수의 가속기(530_1, 530_2, ..., 530_n)로부터 수신된 중간 결과 데이터(810_1, 810_2, ..., 810_n)를 연산하여, 결과 데이터(820)를 생성하는 예시를 나타내는 도면이다. 도 8에 도시된 바와 같이, 가속기 라이브러리 래퍼는 복수의 가속기(530_1, 530_2, ..., 530_n)의 각각으로부터 가속기 라이브러리 함수의 파라미터 데이터를 처리한 중간 결과 데이터(810_1, 810_2, ..., 810_n)를 수신하고, 가속기 라이브러리 함수의 파라미터 데이터를 처리한 중간 결과 데이터(810_1, 810_2, ..., 810_n)를 연산하여, 결과 데이터(820)를 생성할 수 있다. 이러한 연산 과정은 리덕션(reduction)이라고 지칭될 수 있다. 예를 들어, CuDNN에서 제공하는 가속기 라이브러리 함수 중 max pooling의 경우, 복수의 가속기(530_1, 530_2, ..., 530_n)로부터 수신한 중간 결과 데이터(810_1, 810_2, ..., 810_n)를 연산 처리함으로써, 결과 데이터(820)가 생성될 수 있다.8 is a diagram illustrating result data 820 by calculating intermediate result data 810_1, 810_2, ..., 810_n received from a plurality of accelerators 530_1, 530_2, ..., 530_n according to an embodiment of the present disclosure. ) is a diagram showing an example of generating As shown in FIG. 8, the accelerator library wrapper is intermediate result data (810_1, 810_2, ..., 810_n), the intermediate result data 810_1, 810_2, ..., 810_n obtained by processing the parameter data of the accelerator library function may be calculated to generate the result data 820 . This computational process may be referred to as reduction. For example, in the case of max pooling among the accelerator library functions provided by CuDNN, intermediate result data (810_1, 810_2, ..., 810_n) received from a plurality of accelerators (530_1, 530_2, ..., 530_n) is calculated. By processing, result data 820 may be generated.

일 실시예에 따르면, 중간 결과 데이터(810_1, 810_2, ..., 810_n)가 배치 방향으로 병렬화되지 않는 경우, 학습 과정에 필요한 네트워크 파라미터의 그라디언트(gradient) 데이터를 만들어 내는 연산이 사용될 수 있다. 복수의 가속기(530_1, 530_2, ..., 530_n)가 자신에게 분배된 입력 데이터로부터 네트워크 파라미터 그라디언트를 계산하고, 중간 결과 데이터(810_1, 810_2, ..., 810_n)를 서로 교환하여 합산할 수 있다.According to an embodiment, when the intermediate result data 810_1 , 810_2 , ..., 810_n are not parallelized in the arrangement direction, an operation for generating gradient data of network parameters necessary for the learning process may be used. A plurality of accelerators (530_1, 530_2, ..., 530_n) can calculate a network parameter gradient from the input data distributed to them, and exchange the intermediate result data (810_1, 810_2, ..., 810_n) to sum them. there is.

도 9는 본 개시의 일 실시예에 따른 복수의 가속기(930_1, ..., 930_n; 여기서, n은 2 이상의 자연수임)에서 연산을 수행하는 동시에 복수의 부분 입력 데이터 세트의 복사를 수행하는 예시를 나타내는 도면이다. 일 실시예에 따르면, 복수의 가속기(930_1, ..., 930_n)의 각각에 할당되는 부분 입력 데이터 세트(910)가 n개의 부분 입력 데이터 세트를 포함하는 경우(여기서, n은 2이상의 자연수임), n개의 부분 입력 데이터 세트(910) 중에서, m번째 부분 입력 데이터 세트(920_1)를 복수의 가속기(930_1, ..., 930_n)의 각각에서 처리하는 동시에 m+1번째 부분 입력 데이터 세트(920_2)를 복수의 가속기(930_1, ..., 930_n)의 각각에 할당하는 단계(여기서, m은 n보다 작은 자연수임)를 포함할 수 있다. 이를 위해, 복수의 부분 입력 데이터 세트(910)의 각각을 복수의 가속기(930_1, ..., 930_n) 각각에 할당하기 이전에, 처리될 가속기 라이브러리 함수의 메모리 접근 패턴이 분석될 수 있다.9 is an example of performing an operation in a plurality of accelerators (930_1, ..., 930_n; where n is a natural number equal to or greater than 2) according to an embodiment of the present disclosure and copying a plurality of partial input data sets at the same time; It is a drawing showing According to an embodiment, when the partial input data set 910 allocated to each of the plurality of accelerators 930_1, ..., 930_n includes n partial input data sets (where n is a natural number of 2 or more) ), among the n partial input data sets 910, the mth partial input data set 920_1 is processed by each of the plurality of accelerators 930_1, ..., 930_n, while the m+1th partial input data set ( The method may include allocating 920_2 to each of the plurality of accelerators 930_1, ..., 930_n (where m is a natural number smaller than n). To this end, before allocating each of the plurality of partial input data sets 910 to each of the plurality of accelerators 930_1 , ..., 930_n, the memory access pattern of the accelerator library function to be processed may be analyzed.

일 실시예에서, 가속기 라이브러리 래퍼(330)는, 분석된 메모리 접근 패턴을 기초로, 복수의 부분 입력 데이터 세트(910) 중 일부인 m번째 부분 입력 데이터 세트(920_1)를 가속기(930_1)에 할당하여 처리할 수 있다. m번째 부분 입력 데이터 세트(920_1)가 가속기(930_1)에서 처리되는 동시에, 가속기 라이브러리 래퍼(330)는 m+1번째 부분 입력 데이터 세트(920_2)를 가속기(930_1)에 할당할 수 있다. 예를 들어, 복수의 가속기(930_1, ..., 930_n)의 기계적 특성으로 인해, 가속기 내의 메모리의 복사 동작과 연산 동작은 별도로 처리되므로, memory transfer overlapping 기법을 이용해 복수의 가속기(930_1, ..., 930_n)에서 연산이 수행되는 동안 메모리의 복사를 동시에 처리하여, 딥러닝 어플리케이션의 총 수행 시간을 줄일 수 있다. 이러한 동작 방식 하에서, 도시된 바와 같이, 복수의 부분 입력 데이터 세트(920_1, 920_3)가 복수의 가속기(930_1, 930_n)의 각각에 할당하고, 할당된 복수의 부분 입력 데이터 세트(920_1, 920_3)가 처리되는 동안, 또 다른 복수의 부분 입력 데이터 세트(920_2, 920_4)가 복수의 가속기(930_1, 930_n)의 각각에 할당될 수 있다.In one embodiment, the accelerator library wrapper 330 allocates the mth partial input data set 920_1, which is a part of the plurality of partial input data sets 910, to the accelerator 930_1 based on the analyzed memory access pattern. can be processed While the m-th partial input data set 920_1 is processed by the accelerator 930_1 , the accelerator library wrapper 330 may allocate the m+1-th partial input data set 920_2 to the accelerator 930_1 . For example, due to the mechanical characteristics of the plurality of accelerators (930_1, ..., 930_n), the copy operation and operation operation of the memory in the accelerator are processed separately, so the plurality of accelerators (930_1, .. ., 930_n), it is possible to reduce the total execution time of the deep learning application by processing the copy of the memory while the operation is performed. Under this operation method, as shown, a plurality of partial input data sets 920_1 and 920_3 are allocated to each of the plurality of accelerators 930_1 and 930_n, and the allocated plurality of partial input data sets 920_1 and 920_3 are During processing, another plurality of partial input data sets 920_2 and 920_4 may be assigned to each of the plurality of accelerators 930_1 and 930_n.

상술된 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법은, 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수도 있다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 판독될 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 전술된 실시예들을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.The method of processing a program for a single accelerator using the above-described DNN framework in a plurality of accelerators may be implemented as computer-readable codes on a computer-readable recording medium. The computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage device. In addition, the computer-readable recording medium is distributed in a computer system connected through a network, so that the computer-readable code can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the above-described embodiments can be easily inferred by programmers in the art to which the present invention pertains.

본 개시의 방법, 동작 또는 기법들은 다양한 수단에 의해 구현될 수도 있다. 예를 들어, 이러한 기법들은 하드웨어, 펌웨어, 소프트웨어, 또는 이들의 조합으로 구현될 수도 있다. 본원의 개시와 연계하여 설명된 다양한 예시적인 논리적 블록들, 모듈들, 회로들, 및 알고리즘 단계들은 전자 하드웨어, 컴퓨터 소프트웨어, 또는 양자의 조합들로 구현될 수도 있음을 통상의 기술자들은 이해할 것이다. 하드웨어 및 소프트웨어의 이러한 상호 대체를 명확하게 설명하기 위해, 다양한 예시적인 구성요소들, 블록들, 모듈들, 회로들, 및 단계들이 그들의 기능적 관점에서 일반적으로 위에서 설명되었다. 그러한 기능이 하드웨어로서 구현되는지 또는 소프트웨어로서 구현되는 지의 여부는, 특정 애플리케이션 및 전체 시스템에 부과되는 설계 요구사항들에 따라 달라진다. 통상의 기술자들은 각각의 특정 애플리케이션을 위해 다양한 방식들로 설명된 기능을 구현할 수도 있으나, 그러한 구현들은 본 개시의 범위로부터 벗어나게 하는 것으로 해석되어서는 안된다.The method, operation, or techniques of this disclosure may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. Those of ordinary skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementations should not be interpreted as causing a departure from the scope of the present disclosure.

하드웨어 구현에서, 기법들을 수행하는 데 이용되는 프로세싱 유닛들은, 하나 이상의 ASIC들, DSP들, 디지털 신호 프로세싱 디바이스들(digital signal processing devices; DSPD들), 프로그램가능 논리 디바이스들(programmable logic devices; PLD들), 필드 프로그램가능 게이트 어레이들(field programmable gate arrays; FPGA들), 프로세서들, 제어기들, 마이크로제어기들, 마이크로프로세서들, 전자 디바이스들, 본 개시에 설명된 기능들을 수행하도록 설계된 다른 전자 유닛들, 컴퓨터, 또는 이들의 조합 내에서 구현될 수도 있다.In a hardware implementation, the processing units used to perform the techniques include one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs). ), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, and other electronic units designed to perform the functions described in this disclosure. , a computer, or a combination thereof.

따라서, 본 개시와 연계하여 설명된 다양한 예시적인 논리 블록들, 모듈들, 및 회로들은 범용 프로세서, DSP, ASIC, FPGA나 다른 프로그램 가능 논리 디바이스, 이산 게이트나 트랜지스터 로직, 이산 하드웨어 컴포넌트들, 또는 본원에 설명된 기능들을 수행하도록 설계된 것들의 임의의 조합으로 구현되거나 수행될 수도 있다. 범용 프로세서는 마이크로프로세서일 수도 있지만, 대안으로, 프로세서는 임의의 종래의 프로세서, 제어기, 마이크로제어기, 또는 상태 머신일 수도 있다. 프로세서는 또한, 컴퓨팅 디바이스들의 조합, 예를 들면, DSP와 마이크로프로세서, 복수의 마이크로프로세서들, DSP 코어와 연계한 하나 이상의 마이크로프로세서들, 또는 임의의 다른 구성의 조합으로서 구현될 수도 있다.Accordingly, the various illustrative logic blocks, modules, and circuits described in connection with this disclosure are suitable for use in general purpose processors, DSPs, ASICs, FPGAs or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or the present disclosure. It may be implemented or performed in any combination of those designed to perform the functions described in A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other configuration.

펌웨어 및/또는 소프트웨어 구현에 있어서, 기법들은 랜덤 액세스 메모리(random access memory; RAM), 판독 전용 메모리(read-only memory; ROM), 비휘발성 RAM(non-volatile random access memory; NVRAM), PROM(programmable read-only memory), EPROM(erasable programmable read-only memory), EEPROM(electrically erasable PROM), 플래시 메모리, 컴팩트 디스크(compact disc; CD), 자기 또는 광학 데이터 스토리지 디바이스 등과 같은 컴퓨터 판독가능 매체 상에 저장된 명령들로서 구현될 수도 있다. 명령들은 하나 이상의 프로세서들에 의해 실행 가능할 수도 있고, 프로세서(들)로 하여금 본 개시에 설명된 기능의 특정 양태들을 수행하게 할 수도 있다.In firmware and/or software implementations, the techniques may include random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), PROM ( on computer readable media such as programmable read-only memory), erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage devices, etc. It may be implemented as stored instructions. The instructions may be executable by one or more processors, and may cause the processor(s) to perform certain aspects of the functionality described in this disclosure.

소프트웨어로 구현되는 경우, 상기 기법들은 하나 이상의 명령들 또는 코드로서 컴퓨터 판독 가능한 매체 상에 저장되거나 또는 컴퓨터 판독 가능한 매체를 통해 전송될 수도 있다. 컴퓨터 판독가능 매체들은 한 장소에서 다른 장소로 컴퓨터 프로그램의 전송을 용이하게 하는 임의의 매체를 포함하여 컴퓨터 저장 매체들 및 통신 매체들 양자를 포함한다. 저장 매체들은 컴퓨터에 의해 액세스될 수 있는 임의의 이용 가능한 매체들일 수도 있다. 비제한적인 예로서, 이러한 컴퓨터 판독가능 매체는 RAM, ROM, EEPROM, CD-ROM 또는 다른 광학 디스크 스토리지, 자기 디스크 스토리지 또는 다른 자기 스토리지 디바이스들, 또는 소망의 프로그램 코드를 명령들 또는 데이터 구조들의 형태로 이송 또는 저장하기 위해 사용될 수 있으며 컴퓨터에 의해 액세스될 수 있는 임의의 다른 매체를 포함할 수 있다. 또한, 임의의 접속이 컴퓨터 판독가능 매체로 적절히 칭해진다.If implemented in software, the techniques may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a computer. By way of non-limiting example, such computer-readable medium may contain RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or desired program code in the form of instructions or data structures. may include any other medium that can be used for transport or storage to a computer and can be accessed by a computer. Also, any connection is properly termed a computer-readable medium.

예를 들어, 소프트웨어가 동축 케이블, 광섬유 케이블, 연선, 디지털 가입자 회선 (DSL), 또는 적외선, 무선, 및 마이크로파와 같은 무선 기술들을 사용하여 웹사이트, 서버, 또는 다른 원격 소스로부터 전송되면, 동축 케이블, 광섬유 케이블, 연선, 디지털 가입자 회선, 또는 적외선, 무선, 및 마이크로파와 같은 무선 기술들은 매체의 정의 내에 포함된다. 본원에서 사용된 디스크(disk) 와 디스크(disc)는, CD, 레이저 디스크, 광 디스크, DVD(digital versatile disc), 플로피디스크, 및 블루레이 디스크를 포함하며, 여기서 디스크들(disks)은 보통 자기적으로 데이터를 재생하고, 반면 디스크들(discs) 은 레이저를 이용하여 광학적으로 데이터를 재생한다. 위의 조합들도 컴퓨터 판독가능 매체들의 범위 내에 포함되어야 한다.For example, if the software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, wireless, and microwave, the coaxial cable , fiber optic cable, twisted pair, digital subscriber line, or wireless technologies such as infrared, radio, and microwave are included within the definition of medium. As used herein, disk and disk include CD, laser disk, optical disk, digital versatile disc (DVD), floppy disk, and Blu-ray disk, where disks are usually magnetic Data is reproduced optically, while discs reproduce data optically using a laser. Combinations of the above should also be included within the scope of computer-readable media.

소프트웨어 모듈은, RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터들, 하드 디스크, 이동식 디스크, CD-ROM, 또는 공지된 임의의 다른 형태의 저장 매체 내에 상주할 수도 있다. 예시적인 저장 매체는, 프로세가 저장 매체로부터 정보를 판독하거나 저장 매체에 정보를 기록할 수 있도록, 프로세서에 연결될 수 있다. 대안으로, 저장 매체는 프로세서에 통합될 수도 있다. 프로세서와 저장 매체는 ASIC 내에 존재할 수도 있다. ASIC은 유저 단말 내에 존재할 수도 있다. 대안으로, 프로세서와 저장 매체는 유저 단말에서 개별 구성요소들로서 존재할 수도 있다.A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium may be coupled to the processor such that the processor can read information from, or write information to, the storage medium. Alternatively, the storage medium may be integrated into the processor. The processor and storage medium may reside within the ASIC. The ASIC may exist in the user terminal. Alternatively, the processor and the storage medium may exist as separate components in the user terminal.

이상 설명된 실시예들이 하나 이상의 독립형 컴퓨터 시스템에서 현재 개시된 주제의 양태들을 활용하는 것으로 기술되었으나, 본 개시는 이에 한정되지 않고, 네트워크나 분산 컴퓨팅 환경과 같은 임의의 컴퓨팅 환경과 연계하여 구현될 수도 있다. 또 나아가, 본 개시에서 주제의 양상들은 복수의 프로세싱 칩들이나 장치들에서 구현될 수도 있고, 스토리지는 복수의 장치들에 걸쳐 유사하게 영향을 받게 될 수도 있다. 이러한 장치들은 PC들, 네트워크 서버들, 및 휴대용 장치들을 포함할 수도 있다.Although the embodiments described above have been described utilizing aspects of the presently disclosed subject matter in one or more standalone computer systems, the present disclosure is not so limited and may be implemented in connection with any computing environment, such as a network or distributed computing environment. . Still further, aspects of the subject matter in this disclosure may be implemented in a plurality of processing chips or devices, and storage may be similarly affected across the plurality of devices. Such devices may include PCs, network servers, and portable devices.

본 명세서에서는 본 개시가 일부 실시예들과 관련하여 설명되었지만, 본 개시의 발명이 속하는 기술분야의 통상의 기술자가 이해할 수 있는 본 개시의 범위를 벗어나지 않는 범위에서 다양한 변형 및 변경이 이루어질 수 있다. 또한, 그러한 변형 및 변경은 본 명세서에 첨부된 특허청구의 범위 내에 속하는 것으로 생각되어야 한다.Although the present disclosure has been described in connection with some embodiments herein, various modifications and changes may be made without departing from the scope of the present disclosure that can be understood by those skilled in the art to which the present disclosure pertains. Further, such modifications and variations are intended to fall within the scope of the claims appended hereto.

100: 클러스터 시스템
110: 네트워크
120: 호스트 컴퓨팅 장치
130, 130_1, 130_2, ..., 130_n: 컴퓨팅 장치
200: 컴퓨팅 장치
210: 프로세서
220: 메인 메모리
230, 230_1, 230_2, ..., 230_n: 가속기
310: 단일 가속기용 프로그램
320: DNN 프레임워크
330: 가속기 라이브러리 래퍼
340: 가속기 라이브러리 100: cluster system
110: network
120: host computing device
130, 130_1, 130_2, ..., 130_n: computing device
200: computing device
210: processor
220: main memory
230, 230_1, 230_2, ..., 230_n: accelerator
310: program for single accelerator
320: DNN Framework
330: accelerator library wrapper
340: accelerator library

Claims

In a method for processing a program for a single accelerator using a DNN framework in a plurality of accelerators,
receiving a call to a deep learning computation function;
receiving a call to an accelerator library function for executing the deep learning computation function in a single accelerator;
in response to the invocation of the accelerator library function, assigning the accelerator library function to each of a plurality of accessible accelerators;
receiving intermediate result data of processing the accelerator library function from each of the plurality of accelerators; and
generating result data for the called accelerator library function based on the received intermediate result data;
Receiving intermediate result data of processing the accelerator library function from each of the plurality of accelerators comprises receiving intermediate result data of processing parameter data of the accelerator library function from each of the plurality of accelerators,
The generating result data for the called accelerator library function includes generating result data for intermediate result data obtained by processing parameter data of the accelerator library function,
How to process a program for a single accelerator on multiple accelerators.

In a method for processing a program for a single accelerator using a DNN framework in a plurality of accelerators,
receiving a call to a deep learning computation function;
receiving a call to an accelerator library function for executing the deep learning computation function in a single accelerator;
in response to the invocation of the accelerator library function, assigning the accelerator library function to each of a plurality of accessible accelerators;
receiving intermediate result data of processing the accelerator library function from each of the plurality of accelerators; and
generating result data for the called accelerator library function based on the received intermediate result data;
Allocating the accelerator library function to each of the plurality of accessible accelerators comprises:
dividing the input data of the accelerator library function into a plurality of partial input data sets; and
assigning each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators;
including,
Receiving intermediate result data of processing the accelerator library function from each of the plurality of accelerators may include processing each of the plurality of partial input data sets using the accelerator library function in each of the plurality of accelerators. receiving result data;
How to process a program for a single accelerator on multiple accelerators.

3. The method of claim 2,
Allocating each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators comprises:
analyzing a memory access pattern of the accelerator library function; and
allocating input data of the accelerator library function to each of the plurality of accessible accelerators prior to execution of the accelerator library function based on the analyzed access pattern;
How to process a program for a single accelerator on multiple accelerators.

3. The method of claim 2,
Allocating each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators comprises:
When the partial input data set allocated to each of the plurality of accelerators includes n partial input data sets (where n is a natural number greater than or equal to 2), an mth partial input data set among the n partial input data sets processing in each of the plurality of accelerators while assigning an m+1th partial input data set to each of the plurality of accelerators, wherein m is a natural number less than n,
How to process a program for a single accelerator on multiple accelerators.

delete

According to claim 1,
Receiving intermediate result data of processing the accelerator library function from each of the plurality of accelerators comprises receiving intermediate result data of processing parameter data of the accelerator library function from each of the plurality of accelerators,
The generating result data for the called accelerator library function includes generating result data for intermediate result data obtained by processing parameter data of the accelerator library function,
How to process a program for a single accelerator on multiple accelerators.

According to claim 1,
Based on the received intermediate result data, generating result data for the called accelerator library function comprises concatenating the received intermediate result data to generate the result data.
How to process a program for a single accelerator on multiple accelerators.

According to claim 1,
Based on the received intermediate result data, generating result data for the called accelerator library function includes calculating the received intermediate result data to generate the result data,
How to process a program for a single accelerator on multiple accelerators.

According to claim 1,
The plurality of accelerators are included in one computing device,
How to process a program for a single accelerator on multiple accelerators.

According to claim 1,
The plurality of accelerators are included in a cluster system including a plurality of computing devices,
How to process a program for a single accelerator on multiple accelerators.

A method of processing a single accelerator program using the DNN framework according to any one of claims 1 to 4 and 6 to 10 in a plurality of accelerators is stored in a computer-readable recording medium for execution in a computer. stored computer programs.