KR20200108789A

KR20200108789A - Method and computer program of processing program for single accelerator using dnn framework on plural accelerators

Info

Publication number: KR20200108789A
Application number: KR1020200029251A
Authority: KR
Inventors: 이재진; 박정호; 김형모
Original assignee: 서울대학교산학협력단
Priority date: 2019-03-11
Filing date: 2020-03-09
Publication date: 2020-09-21
Also published as: KR102376527B1

Abstract

The present disclosure relates to a method for processing a program for a single accelerator using a DNN framework in a plurality of accelerators. The method for processing a program for a single accelerator in a plurality of accelerators comprises the steps of: receiving a call for a deep learning operation function; receiving a call for an accelerator library function in order to execute the deep learning operation function in a single accelerator; assigning the accelerator library function to each of the plurality of accessible accelerators in response to the call for the accelerator library function; receiving, from each of the plurality of accelerators, intermediate result data of processing the accelerator library function; and generating result data for a called accelerator library function, on the basis of the received intermediate result data. According to the present invention, a process of developing a program is greatly shortened and thus, productivity of a program is increased.

Description

A method and computer program for processing a single accelerator program using the DNN framework in multiple accelerators {METHOD AND COMPUTER PROGRAM OF PROCESSING PROGRAM FOR SINGLE ACCELERATOR USING DNN FRAMEWORK ON PLURAL ACCELERATORS}

본 개시는 DNN(Deep Neural Network) 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법 및 컴퓨터 프로그램에 관한 것으로, 구체적으로, 단일의 가속기에서 딥러닝 연산 함수를 실행하기 위한 가속기 라이브러리 함수에 대한 호출을 수신하면, 접근 가능한 복수의 가속기의 각각에 가속기 라이브러리 함수 및 이러한 함수와 연관된 데이터를 할당하는 방법 및 컴퓨터 프로그램에 관한 것이다.The present disclosure relates to a method and a computer program for processing a single accelerator program using a DNN (Deep Neural Network) framework in a plurality of accelerators, and specifically, an accelerator library function for executing a deep learning operation function in a single accelerator A method and a computer program for allocating an accelerator library function and data associated with the function to each of a plurality of accessible accelerators upon receiving a call to.

최근에 널리 사용되는 DNN 프레임워크인 Tensorflow, Pytorch 등은 단일 가속기에서 실행되는 고성능 가속기 연산 라이브러리인 cuDNN, cuBLAS 등을 바탕으로 하는 연산을, 함수의 형태로 사용자에게 제공할 수 있다. 이러한 종래의 DNN 프레임워크 하에서, 복수의 가속기를 활용하기 위하여, 사용자는 하나의 컴퓨터에 장치된 가속기 디바이스를 각각 명시하거나, 네트워크로 연결된 클러스터에서 각각의 컴퓨터를 명시하여 각 컴퓨터에서의 작업을 별도로 명시해 주어야 한다. 이러한 요구 사항은 DNN 프레임워크가 기반을 두고 있는 라이브러리가 하나의 가속기 대상이므로, DNN 프레임워크 수준 또는 사용자의 응용 프로그램 수준에서 작업 분배를 반드시 수행해 주어야 하기 때문에 발생할 수 있다. 또한, 클러스터 시스템의 구성에 따라, 사용자가 프로그램을 다시 작성해야 하고, 실행 환경에 따라 코드 수정이 요구되거나 성능이 저하되거나, 오류가 발생할 수도 있었다.Recently, widely used DNN frameworks such as Tensorflow and Pytorch can provide operations based on cuDNN and cuBLAS, which are high-performance accelerator operation libraries running on a single accelerator, to users in the form of functions. Under such a conventional DNN framework, in order to utilize a plurality of accelerators, a user specifies each accelerator device installed in one computer, or specifies each computer in a network connected cluster to separately specify the work on each computer. Should be done. This requirement may arise because the library on which the DNN framework is based is the target of one accelerator, and thus task distribution must be performed at the DNN framework level or the user's application program level. In addition, depending on the configuration of the cluster system, the user must rewrite the program, and depending on the execution environment, code correction may be required, performance may deteriorate, or errors may occur.

일반적으로 가속기 간 및 서버 간 메모리 통신은 DNN 프레임워크에서 내부적으로 처리하므로 사용자가 신경 쓸 필요가 없다. 하지만 복수의 가속기 또는 각각의 컴퓨터간 딥 뉴럴 네트워크의 공유 여부, 가속기 별 작업 지정, 클러스터 구성 설정은 사용자(프로그래머)가 직접 응용 프로그램 수준에서 처리할 수 있도록 그와 관련된 함수를 제공하는데 그친다. 즉, DNN 프레임워크 내부적으로 단일 컴퓨터 혹은 복수의 컴퓨터로 구성된 클러스터에 장착된 복수의 가속기를 사용하기 위해 메모리 통신 등 기본적인 기능은 제공하나, 그 기능을 활용하여 복수의 가속기로 작업을 분배하는 등의 추가적인 일은 모두 사용자가 수행해야 한다.In general, memory communication between accelerators and servers is handled internally by the DNN framework, so users do not need to be concerned. However, whether or not to share a deep neural network between multiple accelerators or each computer, designating tasks for each accelerator, and setting up cluster configurations are limited to providing functions related to them so that users (programmers) can directly process them at the application level. In other words, the DNN framework provides basic functions such as memory communication to use multiple accelerators mounted on a single computer or a cluster composed of multiple computers, but using the function to distribute tasks to multiple accelerators, etc. All additional work must be done by the user.

딥 러닝 분야가 빠르게 발전하고 변화함에 따라 구현의 대상이 되는 네트워크 모델(network model)이나 DNN 프레임워크가 빠르게 변경되고, 기술의 발전에 따라 딥 러닝을 수행하는 클러스터의 구성이나 규모 등 실행 환경이 바뀌기 때문에 코드 수정 및 추가가 불가피할 수 있다. 이에 따라, 사용자 입장에서는 실행 환경에 따라 코드를 직접 수정하거나, 옵션에 따라 상이하게 동작하도록 코드를 작성해 두어야 한다. 이 과정에 굉장히 많은 시간이 소요되며, 코드의 구조가 굉장히 복잡해질 수 있고, 이에 따라 유지 보수 및 변경이 어려울 수 있다.As the field of deep learning rapidly develops and changes, the network model or DNN framework to be implemented changes rapidly, and the execution environment such as the configuration or size of the cluster performing deep learning changes according to the development of technology. Therefore, it may be inevitable to modify and add code. Accordingly, the user must directly modify the code according to the execution environment or write the code to operate differently according to the options. This process takes a lot of time, and the structure of the code can become very complex, and maintenance and modification can be difficult accordingly.

본 개시는 상기와 같은 문제점을 해결하기 위한 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법, 기록매체에 저장된 컴퓨터 프로그램 및 시스템을 제공한다.The present disclosure provides a method for processing a single accelerator program using a DNN framework in a plurality of accelerators, and a computer program and system stored in a recording medium to solve the above problems.

본 발명의 목적은 프로그래머가 DNN 프레임워크를 이용하여, 하나의 가속기 대상의 딥러닝 학습/추론 프로그램을 작성하면, 이를 별도의 소스 코드(예를 들어, 사용자 프로그램 및 DNN 프레임워크 모두)의 수정 없이, 복수의 가속기가 설치된 컴퓨터 또는 클러스터 시스템에서 프로그램이 동작하도록 하는 방법 및 컴퓨터 프로그램이 제공된다.It is an object of the present invention to create a deep learning learning/inference program for one accelerator by a programmer using a DNN framework, without modifying separate source codes (for example, both a user program and a DNN framework). , A method and a computer program for operating a program in a computer or cluster system in which a plurality of accelerators are installed are provided.

본 발명은 소스 코드가 공개되지 않은 고성능 가속기 연산 라이브러리(예를 들어, cuDNN, cuBLAS 등)가 단일 가속기 대신에 하나의 컴퓨터 또는 클러스터 시스템에 포함된 다수의 가속기를 이용해 동작하도록 하는 방법 및 컴퓨터 프로그램이 제공된다.The present invention provides a method and computer program in which a high-performance accelerator operation library (eg, cuDNN, cuBLAS, etc.) for which the source code is not disclosed operates using a plurality of accelerators included in one computer or cluster system instead of a single accelerator. Is provided.

본 개시는 방법, 장치, 시스템, 컴퓨터 프로그램 또는 명령어들을 저장하는 컴퓨터 판독가능 저장 매체를 포함한 다양한 방식으로 구현될 수 있다.The present disclosure may be implemented in a variety of ways, including a method, apparatus, system, computer program, or computer readable storage medium storing instructions.

본 개시의 일 실시예에 따른 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법은, 딥러닝 연산 함수에 대한 호출을 수신하는 단계, 단일의 가속기에서 딥러닝 연산 함수를 실행하기 위한 가속기 라이브러리 함수에 대한 호출을 수신하는 단계, 가속기 라이브러리 함수의 호출에 응답하여, 접근 가능한 복수의 가속기의 각각에 가속기 라이브러리 함수를 할당하는 단계, 복수의 가속기의 각각으로부터 가속기 라이브러리 함수를 처리한 중간 결과 데이터를 수신하는 단계 및 수신된 중간 결과 데이터를 기초로, 호출된 가속기 라이브러리 함수에 대한 결과 데이터를 생성하는 단계를 포함한다.A method of processing a program for a single accelerator using a DNN framework according to an embodiment of the present disclosure in a plurality of accelerators includes the steps of receiving a call to a deep learning operation function, executing a deep learning operation function in a single accelerator. Receiving a call to an accelerator library function for, in response to a call to the accelerator library function, allocating an accelerator library function to each of a plurality of accessible accelerators, and processing the accelerator library function from each of the plurality of accelerators Receiving result data and generating result data for the called accelerator library function based on the received intermediate result data.

일 실시예에 따르면, 접근 가능한 복수의 가속기의 각각에 가속기 라이브러리 함수를 할당하는 단계는, 가속기 라이브러리 함수의 입력 데이터를 복수의 부분 입력 데이터 세트로 분할하는 단계 및 가속기 라이브러리 함수 및 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기의 각각에 할당하는 단계를 포함하고, 복수의 가속기의 각각으로부터 가속기 라이브러리 함수를 처리한 중간 결과 데이터를 수신하는 단계는, 복수의 가속기의 각각에서 가속기 라이브러리 함수를 이용하여 복수의 부분 입력 데이터 세트의 각각을 처리한 중간 결과 데이터를 수신하는 단계를 포함한다.According to an embodiment, the allocating an accelerator library function to each of a plurality of accessible accelerators comprises: dividing the input data of the accelerator library function into a plurality of partial input data sets, and the accelerator library function and the divided plurality of parts. The step of assigning each of the input data sets to each of the plurality of accelerators, and receiving intermediate result data of processing the accelerator library function from each of the plurality of accelerators includes: the accelerator library function from each of the plurality of accelerators. And receiving intermediate result data of processing each of the plurality of partial input data sets by using.

일 실시예에 따르면, 가속기 라이브러리 함수 및 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기의 각각에 할당하는 단계는, 가속기 라이브러리 함수의 메모리 접근 패턴을 분석하는 단계 및 분석된 접근 패턴에 기초하여 가속기 라이브러리 함수의 실행 전에 가속기 라이브러리 함수의 입력 데이터를 접근 가능한 복수의 가속기의 각각에 할당하는 단계를 포함한다.According to an embodiment, the step of allocating each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators comprises: analyzing a memory access pattern of the accelerator library function, and based on the analyzed access pattern. And allocating input data of the accelerator library function to each of the plurality of accessible accelerators before execution of the accelerator library function.

일 실시예에 따르면, 가속기 라이브러리 함수 및 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기의 각각에 할당하는 단계는, 복수의 가속기의 각각에 할당되는 부분 입력 데이터 세트가 n개의 부분 입력 데이터 세트를 포함하는 경우(여기서, n은 2이상의 자연수임), n개의 부분 입력 데이터 세트 중에서, m번째 부분 입력 데이터 세트를 복수의 가속기의 각각에서 처리하는 동시에 m+1번째 부분 입력 데이터 세트가 복수의 가속기의 각각에 할당하는 단계(여기서, m은 n보다 작은 자연수임)를 포함한다.According to an embodiment, the step of allocating each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators includes n partial input data sets allocated to each of the plurality of accelerators. In the case of including a set (where n is a natural number greater than or equal to 2), among n partial input data sets, the m-th partial input data set is processed by each of the plurality of accelerators, and the m+1-th partial input data set is plural. And assigning to each of the accelerators of (where m is a natural number less than n).

일 실시예에 따르면, 접근 가능한 복수의 가속기의 각각에 가속기 라이브러리 함수를 할당하는 단계는, 가속기 라이브러리 함수의 파라미터 데이터를 복수의 부분 파라미터 데이터 세트로 분할하는 단계 및 가속기 라이브러리 함수 및 분할된 복수의 부분 파라미터 데이터 세트의 각각을 복수의 가속기의 각각에 할당하는 단계를 포함한다.According to an embodiment, the allocating an accelerator library function to each of a plurality of accessible accelerators comprises: dividing the parameter data of the accelerator library function into a plurality of partial parameter data sets, and the accelerator library function and the divided plurality of parts. Assigning each of the parameter data sets to each of the plurality of accelerators.

일 실시예에 따르면, 복수의 가속기의 각각으로부터 가속기 라이브러리 함수를 처리한 중간 결과 데이터를 수신하는 단계는 복수의 가속기의 각각으로부터 가속기 라이브러리 함수의 파라미터 데이터를 처리한 중간 결과 데이터를 수신하는 단계를 포함하고, 호출된 가속기 라이브러리 함수에 대한 결과 데이터를 생성하는 단계는, 가속기 라이브러리 함수의 파라미터 데이터를 처리한 중간 결과 데이터에 대한 결과 데이터를 생성하는 단계를 포함한다.According to an embodiment, receiving intermediate result data of processing the accelerator library function from each of the plurality of accelerators includes receiving intermediate result data of processing parameter data of the accelerator library function from each of the plurality of accelerators. And, generating result data for the called accelerator library function includes generating result data for intermediate result data obtained by processing parameter data of the accelerator library function.

일 실시예에 따르면, 수신된 중간 결과 데이터를 기초로, 호출된 가속기 라이브러리 함수에 대한 결과 데이터를 생성하는 단계는 수신된 중간 결과 데이터를 연결하여(concatenate) 결과 데이터를 생성하는 단계를 포함한다.According to an embodiment, generating result data for the called accelerator library function based on the received intermediate result data includes generating result data by concatenating the received intermediate result data.

일 실시예에 따르면, 수신된 중간 결과 데이터를 기초로, 호출된 가속기 라이브러리 함수에 대한 결과 데이터를 생성하는 단계는 수신된 중간 결과 데이터를 연산하여 결과 데이터를 생성하는 단계를 포함한다.According to an embodiment, generating result data for the called accelerator library function based on the received intermediate result data includes calculating the received intermediate result data to generate result data.

일 실시예에 따르면, 복수의 가속기는 하나의 컴퓨팅 장치에 포함된다.According to an embodiment, a plurality of accelerators are included in one computing device.

일 실시예에 따르면, 복수의 가속기는 복수의 컴퓨팅 장치를 포함하는 클러스터 시스템에 포함된다.According to an embodiment, a plurality of accelerators are included in a cluster system including a plurality of computing devices.

본 개시의 일 실시예에 따른 상술한 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법을 컴퓨터에서 실행하기 위해 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램이 제공된다.A computer program stored in a computer-readable recording medium is provided in order to execute a method of processing the above-described single accelerator program in a plurality of accelerators in a computer according to an embodiment of the present disclosure.

본 개시의 일부 실시예에 따르면, 주로 사용되는 DNN 프레임워크가 공통적으로 이용되고, 소스 코드가 공개되어 있지 않은 고성능 가속기 라이브러리는 하나의 가속기 대신에 하나의 컴퓨터 또는 클러스터에 포함된 복수의 가속기에서 동작될 수 있다.According to some embodiments of the present disclosure, a DNN framework that is mainly used is commonly used, and a high-performance accelerator library in which the source code is not disclosed operates on a computer or a plurality of accelerators included in a cluster instead of a single accelerator. Can be.

본 개시의 일부 실시예에 따르면, 라이브러리 수준에서 자동으로 복수의 가속기를 활용하도록 하므로, DNN 프레임워크를 이용하는 사용자 응용 프로그램 수준에서는 단일 가속기를 대상으로 작성한 하나의 프로그램이, 실행 환경에서 적절하게 분산 처리될 수 있다. 또한, 복수의 가속기가 이용되어 연산이 수행되므로 실행 시간이 단축될 수 있다. 그리고, 사용자 응용 프로그램의 소스 코드가 변경될 필요가 없으므로, 이를 통해 프로그램 개발 과정이 크게 단축되어 프로그램의 생산성이 높아지고 유지 보수가 더욱 간편해질 수 있다.According to some embodiments of the present disclosure, since a plurality of accelerators are automatically used at the library level, at the user application program level using the DNN framework, a program written for a single accelerator is appropriately distributed in an execution environment. Can be. In addition, since the operation is performed using a plurality of accelerators, execution time may be shortened. In addition, since the source code of the user application program does not need to be changed, the program development process is greatly shortened through this, thereby increasing the productivity of the program and making maintenance easier.

본 개시의 일부 실시예에 따르면, DNN 프레임워크가 공통적으로 이용하는 라이브러리의 동작이 변경되므로 DNN 프레임워크의 소스 코드는 수정할 필요가 없다. 또한, DNN 프레임워크의 컴파일 과정에서 기존에 링크(link)하던 라이브러리를 바꿔치기하거나 가로채기하는 방식이 적용될 수 있다. 이러한 방식은 여러 종류의 DNN 프레임워크에 쉽게 적용될 수 있다. 새로운 DNN 프레임워크를 개발하는 경우에도, 하나의 가속기와 복수의 가속기의 처리가 일원화될 수 있다.According to some embodiments of the present disclosure, since the operation of a library commonly used by the DNN framework is changed, it is not necessary to modify the source code of the DNN framework. In addition, a method of replacing or intercepting libraries previously linked during the compilation process of the DNN framework can be applied. This method can be easily applied to various types of DNN frameworks. Even when a new DNN framework is developed, processing of one accelerator and a plurality of accelerators can be unified.

본 개시의 실시예들은, 이하 설명하는 첨부 도면들을 참조하여 설명될 것이며, 여기서 유사한 참조 번호는 유사한 요소들을 나타내지만, 이에 한정되지는 않는다.
도 1은 본 개시의 일 실시예에 따른 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법을 제공하기 위하여, 호스트 컴퓨팅 장치가 복수의 컴퓨팅 장치와 통신 가능하도록 연결된 구성을 나타내는 개요도이다.
도 2는 본 개시의 일 실시예에 따른 컴퓨팅 장치의 내부 구성을 나타내는 블록도이다.
도 3은 본 개시의 일 실시예에 따른 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법을 제공하기 위한, 프로세서의 내부 구성을 나타내는 블록도이다.
도 4는 본 개시의 일 실시예에 따른 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법을 나타내는 흐름도이다.
도 5는 본 개시의 일 실시예에 따른 입력 데이터를 복수의 부분 입력 데이터 세트로 분할하고, 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기에 할당하는 예시를 나타내는 도면이다.
도 6은 본 개시의 일 실시예에 따른 복수의 가속기에 입력 데이터를 복수의 부분 입력 데이터 세트로 분할하지 못할 경우, 데이터를 할당하는 예시를 나타내는 도면이다.
도 7은 본 개시의 일 실시예에 따른 복수의 가속기로부터 수신한 중간 결과 데이터를 연결하여, 결과 데이터를 생성하는 예시를 나타내는 도면이다.
도 8은 본 개시의 일 실시예에 따른 복수의 가속기로부터 수신한 중간 결과 데이터를 연산하여, 결과 데이터를 생성하는 예시를 나타내는 도면이다.
도 9는 본 개시의 일 실시예에 따른 복수의 가속기에서 연산을 수행하는 동시에 복수의 부분 입력 데이터 세트의 복사를 수행하는 예시를 나타내는 도면이다.Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, in which like reference numerals denote like elements, but are not limited thereto.
1 is a schematic diagram showing a configuration in which a host computing device is connected to communicate with a plurality of computing devices in order to provide a method for processing a program for a single accelerator in a plurality of accelerators according to an embodiment of the present disclosure.
2 is a block diagram illustrating an internal configuration of a computing device according to an embodiment of the present disclosure.
3 is a block diagram showing an internal configuration of a processor for providing a method of processing a single accelerator program using a DNN framework according to an embodiment of the present disclosure in a plurality of accelerators.
4 is a flowchart illustrating a method of processing a single accelerator program using a DNN framework according to an embodiment of the present disclosure in a plurality of accelerators.
5 is a diagram illustrating an example of dividing input data into a plurality of partial input data sets and allocating each of the divided plurality of partial input data sets to a plurality of accelerators according to an embodiment of the present disclosure.
6 is a diagram illustrating an example of allocating data when it is impossible to divide input data into a plurality of partial input data sets to a plurality of accelerators according to an embodiment of the present disclosure.
7 is a diagram illustrating an example of generating result data by connecting intermediate result data received from a plurality of accelerators according to an embodiment of the present disclosure.
8 is a diagram illustrating an example of generating result data by calculating intermediate result data received from a plurality of accelerators according to an embodiment of the present disclosure.
9 is a diagram illustrating an example of performing an operation in a plurality of accelerators and simultaneously copying a plurality of partial input data sets according to an embodiment of the present disclosure.

이하, 본 개시의 실시를 위한 구체적인 내용을 첨부된 도면을 참조하여 상세히 설명한다. 다만, 이하의 설명에서는 본 개시의 요지를 불필요하게 흐릴 우려가 있는 경우, 널리 알려진 기능이나 구성에 관한 구체적 설명은 생략하기로 한다.Hereinafter, with reference to the accompanying drawings, specific details for the implementation of the present disclosure will be described in detail. However, in the following description, when there is a possibility that the subject matter of the present disclosure may be unnecessarily obscure, detailed descriptions of widely known functions or configurations will be omitted.

첨부된 도면에서, 동일하거나 대응하는 구성요소에는 동일한 참조부호가 부여되어 있다. 또한, 이하의 실시예들의 설명에 있어서, 동일하거나 대응되는 구성요소를 중복하여 기술하는 것이 생략될 수 있다. 그러나 구성요소에 관한 기술이 생략되어도, 그러한 구성요소가 어떤 실시예에 포함되지 않는 것으로 의도되지는 않는다.In the accompanying drawings, the same or corresponding elements are assigned the same reference numerals. In addition, in the description of the following embodiments, overlapping descriptions of the same or corresponding components may be omitted. However, even if description of a component is omitted, it is not intended that such component is not included in any embodiment.

개시된 실시예의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 개시는 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 개시가 완전하도록 하고, 본 개시가 통상의 기술자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것일 뿐이다.Advantages and features of the disclosed embodiments, and a method of achieving them will become apparent with reference to the embodiments described below together with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below, but may be implemented in a variety of different forms, only the present embodiments make the present disclosure complete, and the present disclosure completely covers the scope of the invention to those skilled in the art. It is only provided to inform you.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 개시된 실시예에 대해 구체적으로 설명하기로 한다. 본 명세서에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 관련 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the present specification will be briefly described, and the disclosed embodiments will be described in detail. The terms used in the present specification have selected general terms that are currently widely used as possible while considering functions in the present disclosure, but this may vary according to the intention or precedent of a technician engaged in a related field, the emergence of new technologies, and the like. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning of the terms will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, not the name of a simple term.

본 명세서에서의 단수의 표현은 문맥상 명백하게 단수인 것으로 특정하지 않는 한, 복수의 표현을 포함한다. 또한, 복수의 표현은 문맥상 명백하게 복수인 것으로 특정하지 않는 한, 단수의 표현을 포함한다. 명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다.In this specification, expressions in the singular include plural expressions, unless the context clearly specifies that they are singular. In addition, plural expressions include expressions in the singular unless clearly specified as plural in context. When a part of the specification is said to "include" a certain component, it means that other components may be further included rather than excluding other components unless otherwise stated.

본 개시에서, '클러스터 시스템은' 네트워크를 통해 연결된 복수의 컴퓨터를 포함할 수 있다. 이러한 클라이언트 시스템 하에서, 한 클라이언트 장치는 클러스터 시스템을 하나의 컴퓨터처럼 이용할 수 있다. 이러한 클러스터 시스템은 연결된 사용자에게 복수의 컴퓨터를 하나의 컴퓨터처럼 이용할 수 있도록 제공하기 때문에, 하나의 컴퓨터에서의 처리 속도보다 훨씬 향상된 처리 속도가 구현될 수 있다.In the present disclosure, a'cluster system' may include a plurality of computers connected through a network. Under such a client system, one client device can use the cluster system as a single computer. Since such a cluster system provides a connected user to use a plurality of computers as a single computer, a processing speed that is much improved than that of a single computer can be implemented.

본 개시에서, '배치(Batch)는' 딥러닝에서 모델의 네트워크 파라미터를 업데이트 하기 전 샘플의 개수를 의미한다. 딥러닝 모델의 학습 데이터 세트는 하나 이상의 배치로 나눠질 수 있다. 예를 들어, 이미지를 처리하는 딥러닝 모델에서 각각의 배치 크기는 한 번에 처리하는 이미지의 수를 의미한다. 배치 크기가 64이면, 이미지 파일 64개를 연속된 메모리 공간에 차례로 둠을 의미할 수 있다. 이에 따라, 배치 단위로 분할되는 데이터는 구조적으로 동일하며, 배치 단위로 분할되는 데이터는 복수의 가속기에 별다른 처리 없이 분배할 수 있다.In the present disclosure,'batch' means the number of samples before updating the network parameters of the model in deep learning. The training data set of a deep learning model can be divided into one or more batches. For example, in a deep learning model processing images, each batch size means the number of images processed at one time. If the batch size is 64, it may mean that 64 image files are sequentially placed in a continuous memory space. Accordingly, data divided by batch units is structurally the same, and data divided by batch units can be distributed to a plurality of accelerators without any special processing.

본 개시에서, '라이브러리 래퍼(library wrapper)'는 소프트웨어를 개발하는데 사용되는 서브루틴 또는 클래스들의 모음으로, 라이브러리 래퍼는 라이브러리의 현재 인터페이스를 호환되는 인터페이스로 변환하는 코드의 얇은 층으로 구성될 수 있다. 예를 들어, 라이브러리 래퍼는 DNN 프레임워크가 공통적으로 이용하는 라이브러리의 동작을 변경하도록 구성된 함수를 포함할 수 있으므로, DNN 프레임워크의 소스 코드 자체를 수정할 필요가 없다. 이에 따라, 본 개시의 라이브러리 래퍼는 다양한 DNN 프레임워크에 유사한 방식으로 적용시킬 수 있음은 통상의 기술자에게 당연하다.In the present disclosure, a'library wrapper' is a collection of subroutines or classes used to develop software, and the library wrapper may consist of a thin layer of code that converts the current interface of the library into a compatible interface. . For example, the library wrapper may include a function configured to change the behavior of a library commonly used by the DNN framework, so there is no need to modify the source code of the DNN framework itself. Accordingly, it is natural to those skilled in the art that the library wrapper of the present disclosure can be applied in a similar manner to various DNN frameworks.

도 1은 본 개시의 일 실시예에 따른 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법을 제공하기 위하여, 호스트 컴퓨팅 장치(120)가 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)와 통신 가능하도록 연결된 구성을 나타내는 개요도이다. 여기서, 복수의 가속기는 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)를 포함하는 클러스터 시스템(100)에 포함될 수 있다. 예를 들어, 복수의 가속기는 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)의 적어도 일부 컴퓨팅 장치에 포함될 수 있다. 또한, 클러스터 시스템(100)은 네트워크(110), 호스트 컴퓨팅 장치(120) 및 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)를 포함할 수 있다.1 illustrates a method for processing a single accelerator program in a plurality of accelerators according to an embodiment of the present disclosure, in which a host computing device 120 includes a plurality of computing devices 130_1, 130_2, ..., 130_n. It is a schematic diagram showing a configuration connected to enable communication with. Here, the plurality of accelerators may be included in the cluster system 100 including a plurality of computing devices 130_1, 130_2, ..., 130_n. For example, a plurality of accelerators may be included in at least some computing devices of the plurality of computing devices 130_1, 130_2, ..., 130_n. In addition, the cluster system 100 may include a network 110, a host computing device 120, and a plurality of computing devices 130_1, 130_2, ..., 130_n.

네트워크(110)는, 클러스터 시스템(100)에 포함됨 호스트 컴퓨팅 장치(120) 및 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n) 사이의 통신이 가능하도록 구성될 수 있다. 네트워크(110)는 설치 환경에 따라, 예를 들어, 이더넷(Ethernet), 유선 홈 네트워크(Power Line Communication), 전화선 통신 장치 및 RS-serial 통신 등의 유선 네트워크, 이동통신망, WLAN(Wireless LAN), Wi-Fi, Bluetooth 및 ZigBee 등과 같은 무선 네트워크 또는 그 조합으로 구성될 수 있다. 다시 말해, 통신 방식은 제한되지 않으며, 네트워크(110)가 포함할 수 있는 통신망(일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망, 위성망 등)을 활용하는 통신 방식뿐만 아니라 호스트 컴퓨팅 장치(120) 및 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n) 사이의 근거리 무선 통신 역시 포함될 수 있다. 예를 들어, 네트워크(110)는 PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 네트워크(110)는 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.The network 110 may be configured to enable communication between the host computing device 120 included in the cluster system 100 and the plurality of computing devices 130_1, 130_2, ..., 130_n. Depending on the installation environment, for example, the network 110 is a wired network such as Ethernet, a wired home network (Power Line Communication), a telephone line communication device and RS-serial communication, a mobile communication network, a WLAN (Wireless LAN), It may consist of wireless networks such as Wi-Fi, Bluetooth and ZigBee, or a combination thereof. In other words, the communication method is not limited, and not only a communication method using a communication network (for example, a mobile communication network, wired Internet, wireless Internet, broadcasting network, satellite network, etc.) that the network 110 may include, but also the host computing device 120 ) And the plurality of computing devices 130_1, 130_2, ..., 130_n may also include short-range wireless communication. For example, the network 110 is a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), a broadband network (BBN), It may include any one or more of the networks, such as the Internet. In addition, the network 110 may include any one or more of a network topology including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, etc. Not limited.

호스트 컴퓨팅 장치(120)는 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 실행시키도록 구성될 수 있다. 또한, 호스트 컴퓨팅 장치(120)는 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)과 통신 가능하도록 구성되고, 복수의 컴퓨팅 장치(130_1, 130_2), ..., 130_n)에 포함된 복수의 가속기의 동작을 제어하도록 구성될 수 있다. 예를 들어, 호스트 컴퓨팅 장치(120)는 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)에게 가속기 라이브러리 함수, 입력 데이터를 할당하고, 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n) 각각으로부터 가속기 라이브러리 함수를 처리한 중간 결과 데이터를 수신하고, 수신된 중간 결과 데이터를 기초로, 호출된 가속기 라이브러리 함수에 대한 결과 데이터를 생성할 수 있다.The host computing device 120 may be configured to execute a program for a single accelerator using the DNN framework. In addition, the host computing device 120 is configured to communicate with a plurality of computing devices 130_1, 130_2, ..., 130_n, and included in the plurality of computing devices 130_1, 130_2, ..., 130_n). It may be configured to control the operation of a plurality of accelerators. For example, the host computing device 120 allocates an accelerator library function and input data to the plurality of computing devices 130_1, 130_2, ..., 130_n, and the plurality of computing devices 130_1, 130_2, ..., 130_n) Intermediate result data of processing the accelerator library function may be received from each, and result data for the called accelerator library function may be generated based on the received intermediate result data.

복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)는 클러스터 시스템(100) 상에서 정보처리 및 통신을 수행하는 컴퓨팅 장치이다. 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)는 컴퓨터 또는 원격처리 장치와 같은 단말의 형태로 구성될 수 있다. 또한, 각 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)는 독립적으로 정보처리 등을 수행할 수 있으나, 병렬 프로그래밍을 통하여 다른 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)들과 협력하면서 정보처리 등을 수행할 수도 있다. 각 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)는 네트워크(110)를 통해 딥러닝 어플리케이션의 동작을 위한 통신을 실행할 수 있다. 이러한 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n)는 데이터의 송신원, 수신처 또는 중계점 중 어느 하나에 해당할 수 있다.The plurality of computing devices 130_1, 130_2, ..., 130_n are computing devices that process information and communicate on the cluster system 100. The plurality of computing devices 130_1, 130_2, ..., 130_n may be configured in the form of a terminal such as a computer or a remote processing device. In addition, each of the plurality of computing devices 130_1, 130_2, ..., 130_n may independently perform information processing, etc., but other plurality of computing devices 130_1, 130_2, ..., 130_n through parallel programming You can also perform information processing, etc. while cooperating with them. Each of the plurality of computing devices 130_1, 130_2, ..., 130_n may execute communication for the operation of the deep learning application through the network 110. The plurality of computing devices 130_1, 130_2, ..., 130_n may correspond to any one of a transmission source, a destination, or a relay point of data.

도 2는 본 개시의 일 실시예에 따른 컴퓨팅 장치(200)의 내부 구성을 나타내는 블록도이다. 복수의 가속기(230_1, 230_2, ..., 230_n)는 컴퓨팅 장치(200)에 포함될 수 있다. 일 실시예에 따르면, 컴퓨팅 장치(200)는 호스트 컴퓨팅 장치(120)를 지칭할 수 있다. 다른 실시예에서, 컴퓨팅 장치(200)는 복수의 컴퓨팅 장치(130_1, 130_2, ..., 130_n) 각각을 지칭할 수 있다. 도 2에 도시된 바와 같이, 하나의 컴퓨팅 장치(200)는 프로세서(210), 메인 메모리(220) 및 복수의 가속기(230_1, 230_2, ..., 230_n)를 포함할 수 있다.2 is a block diagram illustrating an internal configuration of a computing device 200 according to an embodiment of the present disclosure. The plurality of accelerators 230_1, 230_2, ..., 230_n may be included in the computing device 200. According to an embodiment, the computing device 200 may refer to the host computing device 120. In another embodiment, the computing device 200 may refer to each of the plurality of computing devices 130_1, 130_2, ..., 130_n. As shown in FIG. 2, one computing device 200 may include a processor 210, a main memory 220, and a plurality of accelerators 230_1, 230_2, ..., 230_n.

프로세서(210)는, 예를 들어, CPU(Central Processing Unit, 중앙 처리 장치)와 같은 연산 처리를 위한 범용 프로세서로 구성될 수 있으며, 프로세서(210)는 복수의 가속기(230_1, 230_2, ..., 230_n)와 연결되어 복수의 가속기(230_1, 230_2, ..., 230_n)의 동작을 제어할 수 있다. 또한, 프로세서(210)는 메인 메모리(220)가 연결될 수 있다. 예를 들어, 프로세서(210)는 PCI-E(Peripheral component interconnect-Express) 버스를 통해 복수의 가속기(230_1, 230_2, ..., 230_n) 및/또는 메인 메모리(220)와 서로 연결될 수 있으며 복수의 가속기(230_1, 230_2, ..., 230_n) 및/또는 메인 메모리(220)의 제어를 위한 데이터를 송수신할 수 있다.The processor 210 may be configured as a general-purpose processor for arithmetic processing such as, for example, a CPU (Central Processing Unit, central processing unit), and the processor 210 includes a plurality of accelerators 230_1, 230_2, ... , 230_n) to control the operation of the plurality of accelerators 230_1, 230_2, ..., 230_n. Also, the main memory 220 may be connected to the processor 210. For example, the processor 210 may be connected to a plurality of accelerators 230_1, 230_2, ..., 230_n and/or the main memory 220 through a Peripheral Component Interconnect-Express (PCI-E) bus. Data for controlling the accelerators 230_1, 230_2, ..., 230_n and/or the main memory 220 may be transmitted and received.

일 실시예에 따르면, 메인 메모리(220)는 전자 정보를 저장 가능한 임의의 전자 컴포넌트를 포함하도록 넓게 해석되어야 한다. 예를 들어, 메인 메모리(220)는 임의 액세스 메모리(RAM), 판독-전용 메모리(ROM), 비-휘발성 임의 액세스 메모리(NVRAM), 프로그램가능 판독-전용 메모리(PROM), 소거-프로그램가능 판독 전용 메모리(EPROM), 전기적으로 소거가능 PROM(EEPROM), 플래쉬 메모리, 자기 또는 광학 데이터 저장장치, 레지스터들 등과 같은 프로세서-판독가능 매체의 다양한 유형들을 지칭할 수도 있다. 프로세서(210) 및/또는 복수의 가속기(230_1, 230_2, ..., 230_n)로부터 정보를 판독하거나 메모리에 정보를 기록할 수 있다면 메인 메모리(220)는 프로세서(210) 및/또는 복수의 가속기(230_1, 230_2, ..., 230_n)와 전자 통신 상태에 있다고 불린다. 본 개시에서, 메인 메모리(220)는 프로세서(210) 및/또는 복수의 가속기(230_1, 230_2, ..., 230_n)에 의해 실행되는 프로그램(예를 들어, 단일 가속기용 프로그램, DNN 프레임워크, 가속기 라이브러리 래퍼 및 가속기 라이브러리 함수 등)과 연관된 임의의 데이터 및/또는 정보(예: 프로그램 실행 데이터, 입력 데이터, 파라미터 데이터, 결과 데이터 등)를 저장할 수 있다.According to an embodiment, the main memory 220 should be broadly interpreted to include any electronic component capable of storing electronic information. For example, main memory 220 includes random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erase-programmable read It may refer to various types of processor-readable media such as dedicated memory (EPROM), electrically erasable PROM (EEPROM), flash memory, magnetic or optical data storage, registers, and the like. If the processor 210 and/or the plurality of accelerators 230_1, 230_2, ..., 230_n can read information or write information to the memory, the main memory 220 is the processor 210 and/or the plurality of accelerators. It is said to be in electronic communication with (230_1, 230_2, ..., 230_n). In the present disclosure, the main memory 220 is a program executed by the processor 210 and/or a plurality of accelerators 230_1, 230_2, ..., 230_n (for example, a program for a single accelerator, a DNN framework, Any data and/or information (eg, program execution data, input data, parameter data, result data, etc.) associated with the accelerator library wrapper and accelerator library function may be stored.

복수의 가속기(230_1, 230_2, ..., 230_n)는 범용의 CPU와는 달리 특정 패턴의 연산에 특화된 프로세서로 구성될 수 있다. 예를 들어, 복수의 가속기((230_1, 230_2, ..., 230_n)의 각각은 GPU, FPGA, DSP, Intel Xeon Phi, TPU, NPU, 멀티코어 CPU 등을 포함할 수 있다. 또한, 복수의 가속기(230_1, 230_2, ..., 230_n) 각각에는 메인 메모리와는 별도로 가속기 메모리(미도시)가 연결되거나 포함될 수 있다.Unlike a general-purpose CPU, the plurality of accelerators 230_1, 230_2, ..., 230_n may be configured as a processor specialized for a specific pattern of operation. For example, each of the plurality of accelerators 230_1, 230_2, ..., 230_n may include a GPU, FPGA, DSP, Intel Xeon Phi, TPU, NPU, multi-core CPU, etc. In addition, a plurality of accelerators 230_1, 230_2, ... Each of the accelerators 230_1, 230_2, ..., 230_n may have an accelerator memory (not shown) connected or included separately from the main memory.

도 3은 본 개시의 일 실시예에 따른 DNN 프레임워크(320)를 이용하는 단일 가속기용 프로그램(310)을 복수의 가속기에서 처리하는 방법을 제공하기 위한, 프로세서(300)의 내부 구성을 나타내는 블록도이다. 일 실시예에 따르면, 프로세서(300)는 도 2의 프로세서(210)를 지칭할 수 있다. 도시된 바와 같이, 단일 가속기용 프로그램(310), DNN 프레임워크(320), 가속기 라이브러리 래퍼(330) 및 가속기 라이브러리(340)는 프로세서(300)에 의해 또는 프로세서(300) 상에서 동작되거나 처리될 수 있다.3 is a block diagram showing an internal configuration of a processor 300 to provide a method for processing a single accelerator program 310 using a DNN framework 320 according to an embodiment of the present disclosure in a plurality of accelerators to be. According to an embodiment, the processor 300 may refer to the processor 210 of FIG. 2. As shown, the single accelerator program 310, the DNN framework 320, the accelerator library wrapper 330, and the accelerator library 340 may be operated or processed by the processor 300 or on the processor 300. have.

단일 가속기용 프로그램(310)은, 가속기의 연산 자원을 활용하기 위해 다양한 플랫폼의 가속기(230)를 위한 병렬 프로그래밍 모델을 의미할 수 있다. 일 실시예에 따르면, 단일 가속기용 프로그램(310)은 단일 GPU 대상 사용자 응용 프로그램을 지칭할 수 있다. 예를 들어, 단일 가속기용 프로그램(310)은 임의의 가속기에 대해 딥러닝 연산 함수를 지원하는 임의의 프로그램을 포함할 수 있으며, 예를 들어, OpenCL(Open Computing Language) 및 CUDA(Compute Unified Device Architecture)와 같은 병렬 프로그래밍 모델 또는 프로그램을 포함할 수 있으나, 이에 제한되지 않는다. 또한, 딥러닝 연산 함수는 단일 가속기용 프로그램(310)을 통해 호출될 수 있다.The single accelerator program 310 may mean a parallel programming model for accelerators 230 of various platforms in order to utilize the computational resources of the accelerator. According to an embodiment, the single accelerator program 310 may refer to a single GPU target user application. For example, the program for a single accelerator 310 may include any program that supports deep learning arithmetic functions for any accelerator, for example, OpenCL (Open Computing Language) and CUDA (Compute Unified Device Architecture). A parallel programming model or program such as) may be included, but is not limited thereto. Further, the deep learning operation function may be called through a single accelerator program 310.

DNN 프레임워크(320)는, 딥러닝 애플리케이션 작성 및 실행을 용이하게 하는 만들어진 소프트웨어 집합체를 포함할 수 있다. 이러한 DNN 프레임워크(320)는 개발자로 하여금 높은 숙련도가 요구되는 병렬 프로그래밍 모델 또는 프로그램을 보다 손쉽게 이용할 수 있도록, 딥러닝 처리 또는 딥러닝 연산 함수를 가속기에 적용하여 학습(training) 과정과 추론(inference) 과정을 가속할 수 있다. 예를 들어, DNN 프레임워크(320)는 최근에 널리 사용되고 있는 Caffe, Tensorflow, Pytorch, CNTK 및 Theano 등의 DNN 프레임워크를 포함할 수 있으나, 이에 제한되지 않는다.The DNN framework 320 may include a set of software made to facilitate the creation and execution of deep learning applications. This DNN framework 320 applies a deep learning process or a deep learning computation function to an accelerator so that a developer can more easily use a parallel programming model or program that requires high proficiency, to provide a training process and inference. ) Process can be accelerated. For example, the DNN framework 320 may include DNN frameworks such as Caffe, Tensorflow, Pytorch, CNTK, and Theano, which are widely used recently, but is not limited thereto.

가속기 라이브러리(340)는, DNN 프레임워크(320)를 통해 호출될 수 있는 고성능 라이브러리(library)를 지칭할 수 있다. 가속기 라이브러리(340)는 사용자 또는 개발자에게 복수의 가속기 라이브러리 함수를 제공할 수 있다. 이러한 가속기 라이브러리 함수는 단일 가속기에서 실행되도록 구성될 수 있다.The accelerator library 340 may refer to a high-performance library that can be called through the DNN framework 320. The accelerator library 340 may provide a plurality of accelerator library functions to a user or developer. These accelerator library functions can be configured to run on a single accelerator.

일 실시예에 따르면, 가속기 라이브러리(340)는 cuDNN 및 cuBLAS 등을 지칭할 수 있으나, 이에 한정되지 않는다. 일 실시예에 따르면, cuDNN은 convolutional layer, pooling layer 등의 forward propagation과 backward propagation 등을 API 형태로 제공할 수 있다. 예를 들어, cuDNN의 cudnnConvolutionForward() 함수는 네트워크의 파라미터 중 해당 레이어에 대응하는 가중치(weight), 그리고 이전 레이어의 피처 맵(feature map)(또는 이미지 데이터)을 배치(batch) 단위로 입력 받아 다음 레이어로의 출력 피처 맵(output feature map)을 배치 단위로 출력할 수 있다.According to an embodiment, the accelerator library 340 may refer to cuDNN and cuBLAS, but is not limited thereto. According to an embodiment, cuDNN may provide forward propagation and backward propagation such as a convolutional layer and a pooling layer in the form of an API. For example, cuDNN's cudnnConvolutionForward() function receives the weight corresponding to the layer from among the parameters of the network and the feature map (or image data) of the previous layer in batch units, and the next An output feature map to a layer can be output in batch units.

가속기 라이브러리 래퍼(330)는, 호출된 가속기 라이브러리 함수를 가로채고, 가속기 라이브러리 함수를 래핑(wrapping)하여 복수의 가속기(230_1, 230_2, ..., 230_n)에 의해 병렬 처리 가능하도록 구성될 수 있다. 이러한 구조 하에서, 단일 가속기용 프로그램(310) 및 DNN 프레임워크(320) 수준에서 단일 가속기(230)를 대상으로 작성된 코드가 라이브러리 수준에서는 복수의 가속기(230_1, 230_2, ..., 230_n)에서 실행될 수 있다. 예를 들어, 가속기 라이브러리 래퍼(330)는 GPU 라이브러리 래퍼를 지칭할 수 있다. 일 실시예에서, 가속기 라이브러리 래퍼(330)를 사용하여 단일 가속기용 프로그램(310)이 가속기 라이브러리 래퍼(330)를 이용하여 처리될 때, DNN 프레임워크(320)의 빌드 과정의 수정을 통해 가속기 라이브러리 함수 및 소스 코드(예를 들어, 사용자 프로그램 및 DNN 프레임워크(320))의 수정 없이, 단일 가속기용 프로그램(310)을 복수의 가속기(230_1, 230_2, ..., 230_n)에서 병렬 처리할 수 있다. 예를 들어, 프로세서(210)는 복수의 가속기(230_1, 230_2, ..., 230_n) 각각에 가속기 라이브러리 함수를 할당할 수 있다.The accelerator library wrapper 330 may be configured to intercept the called accelerator library function and wrap the accelerator library function to enable parallel processing by a plurality of accelerators 230_1, 230_2, ..., 230_n. . Under this structure, the code written for the single accelerator 230 at the level of the single accelerator program 310 and the DNN framework 320 will be executed in a plurality of accelerators 230_1, 230_2, ..., 230_n at the library level. I can. For example, the accelerator library wrapper 330 may refer to a GPU library wrapper. In one embodiment, when a single accelerator program 310 is processed using the accelerator library wrapper 330 using the accelerator library wrapper 330, the accelerator library is modified by modifying the build process of the DNN framework 320. A single accelerator program 310 can be processed in parallel in a plurality of accelerators 230_1, 230_2, ..., 230_n, without modification of functions and source codes (eg, user programs and DNN framework 320). have. For example, the processor 210 may allocate an accelerator library function to each of the plurality of accelerators 230_1, 230_2, ..., 230_n.

도 4는 본 개시의 일 실시예에 따른 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법(400)을 나타내는 흐름도이다. 이러한 방법(400)은 프로세서에 의해 수행될 수 있다. 도시된 바와 같이, 방법(400)은 딥러닝 연산 함수에 대한 호출을 수신하는 단계(S410)로 개시될 수 있다. 예를 들어, 프로세서 상에서 동작되는 DNN 프레임워크는 단일 가속기용 프로그램으로부터 딥러닝 연산 함수에 대한 호출을 수신할 수 있다.4 is a flowchart illustrating a method 400 of processing a program for a single accelerator using a DNN framework in a plurality of accelerators according to an embodiment of the present disclosure. This method 400 may be performed by a processor. As shown, method 400 may begin with receiving a call to a deep learning computation function (S410). For example, a DNN framework running on a processor may receive a call to a deep learning operation function from a single accelerator program.

그리고 나서, 단계(S420)에서, 단일의 가속기에서 딥러닝 연산 함수를 실행하기 위한 가속기 라이브러리 함수에 대한 호출을 수신할 수 있다. 예를 들어, 프로세서 상에서 동작하는 가속기 라이브러리 래퍼는 DNN 프레임워크로부터 단일의 가속기에서 딥러닝 연산 함수를 실행하기 위한 가속기 라이브러리 함수에 대한 호출을 수신할 수 있다.Then, in step S420, a call to an accelerator library function for executing a deep learning operation function in a single accelerator may be received. For example, an accelerator library wrapper running on a processor may receive a call from the DNN framework to an accelerator library function for executing a deep learning operation function in a single accelerator.

다음으로, 단계(S430)에서, 가속기 라이브러리 함수의 호출에 응답하여, 접근 가능한 복수의 가속기의 각각에 가속기 라이브러리 함수를 할당할 수 있다. 일 실시예에서, 프로세서 상에서 가속기 라이브러리 래퍼는, DNN 프레임워크로부터 단일 가속기에서 딥러닝 연산 함수를 실행하기 위한 가속기 라이브러리 함수에 대한 호출을 수신하여, 복수의 가속기의 각각에 가속기 라이브러리 함수를 할당할 수 있다. 이 때, 가속기 라이브러리 래퍼는, 가속기 라이브러리 함수의 입력 데이터를 복수의 부분 입력 데이터 세트로 분할하여, 복수의 가속기의 각각에 분할된 복수의 부분 입력 데이터 세트의 각각을 할당할 수 있다. 예를 들어, 가속기 라이브러리 함수의 입력 데이터를 배치(batch) 단위로 입력 받을 경우, 가속기 라이브러리 래퍼는 복수의 가속기의 각각에 입력 데이터를 배치 단위로 분배하여, 하나 이상의 데이터 세트로 분할해 할당할 수 있다.Next, in step S430, in response to a call of the accelerator library function, an accelerator library function may be allocated to each of a plurality of accessible accelerators. In one embodiment, the accelerator library wrapper on the processor receives a call to the accelerator library function for executing the deep learning computation function in a single accelerator from the DNN framework, and assigns the accelerator library function to each of the plurality of accelerators. have. At this time, the accelerator library wrapper may divide the input data of the accelerator library function into a plurality of partial input data sets, and allocate each of the divided partial input data sets to each of the plurality of accelerators. For example, if the input data of an accelerator library function is received in batch units, the accelerator library wrapper can distribute the input data to each of the plurality of accelerators in batch units, divide it into one or more data sets, and allocate them. have.

일 실시예에 따르면, 가속기 라이브러리 래퍼는, 가속기 라이브러리 함수 및 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기의 각각에 할당할 수 있다. 여기서, 가속기 라이브러리 함수 및 분할된 복수의 부분 입력 데이터 세트의 각각을 복수의 가속기의 각각에 할당함에 있어서, 가속기 라이브러리 함수의 메모리 접근 패턴을 분석하고 분석된 접근 패턴에 기초하여 가속기 라이브러리 함수의 실행 전에 가속기 라이브러리 함수의 입력 데이터가 접근 가능한 복수의 가속기의 각각에 할당될 수 있다. 예를 들어, DNN 관련 가속기 라이브러리 함수의 입력과 출력의 메모리로의 접근 패턴은 컴파일러 분석 기법 및/또는 프로파일링 기법(예를 들어, 미리 함수를 실행하여 메모리의 입출력 패턴을 파악하는 기법)을 사용하여 분석될 수 있다. 또 다른 예로서, 가속기 라이브러리 함수의 실행 전에, 컴파일러 분석 및/또는 프로파일링 기법에 의해 분석된 접근 패턴을 이용하여 복수의 가속기에 데이터가 미리 분배될 수 있다. DNN 프레임워크의 동작 특성상 동일한 패턴의 작업이 전체 학습이나 추론 과정에서 여러 iteration 동안 반복될 수 있다. 이러한 특성을 이용하여 처음 혹은 몇 번(예를 들어, 3회)의 iteration에 대한 메모리 접근 패턴이 파악되면, 전체 학습이나 추론 과정에서 각 가속기 라이브러리 함수의 메모리 접근 패턴이 분석될 수 있다.According to an embodiment, the accelerator library wrapper may allocate the accelerator library function and each of the divided plurality of partial input data sets to each of the plurality of accelerators. Here, in allocating each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators, the memory access pattern of the accelerator library function is analyzed and based on the analyzed access pattern, before execution of the accelerator library function Input data of an accelerator library function may be assigned to each of a plurality of accessible accelerators. For example, the access pattern of the input and output of the DNN-related accelerator library function to memory uses a compiler analysis technique and/or a profiling technique (e.g., a technique that detects the input/output pattern of memory by executing a function in advance). Can be analyzed. As another example, before execution of the accelerator library function, data may be pre-distributed to a plurality of accelerators using an access pattern analyzed by a compiler analysis and/or profiling technique. Due to the behavioral characteristics of the DNN framework, tasks of the same pattern may be repeated for several iterations during the entire learning or inference process. When a memory access pattern for the first or several iterations (for example, 3 times) is identified using these characteristics, the memory access pattern of each accelerator library function can be analyzed during the entire learning or inference process.

다음으로, 단계(S440)에서, 복수의 가속기의 각각으로부터 가속기 라이브러리 함수를 처리한 중간 결과 데이터를 수신할 수 있다. 일 실시예에 따르면, 가속기 라이브러리 래퍼는, 가속기 라이브러리 함수를 처리한 복수의 가속기로부터 중간 결과 데이터를 수신할 수 있다. 예를 들어, 가속기 라이브러리 래퍼는, 복수의 가속기의 각각에서 가속기 라이브러리 함수를 이용하여 복수의 부분 입력 데이터 세트의 각각을 처리한 중간 결과 데이터를 수신할 수 있다. 다른 예로서, 가속기 라이브러리 함수의 출력 데이터가 배치 단위로 출력될 경우, 가속기 라이브러리 래퍼는 복수의 가속기로부터 입력 데이터 세트의 각각을 처리한 중간 결과 데이터를 배치 단위로 수신할 수 있다.Next, in step S440, intermediate result data of processing the accelerator library function may be received from each of the plurality of accelerators. According to an embodiment, the accelerator library wrapper may receive intermediate result data from a plurality of accelerators that have processed the accelerator library function. For example, the accelerator library wrapper may receive intermediate result data obtained by processing each of a plurality of partial input data sets using an accelerator library function in each of the plurality of accelerators. As another example, when output data of an accelerator library function is output in batch units, the accelerator library wrapper may receive intermediate result data obtained by processing each of the input data sets from a plurality of accelerators in batch units.

마지막으로, 단계(S450)에서, 수신된 중간 결과 데이터를 기초로, 호출된 가속기 라이브러리 함수에 대한 결과 데이터가 생성될 수 있다. 예를 들어, 가속기 라이브러리 래퍼는, 복수의 가속기로부터 수신된 중간 결과 데이터를 기초로 결과 데이터를 생성할 수 있으며, 도 7 및 도 8을 통해 보다 상세히 후술한다.Finally, in step S450, result data for the called accelerator library function may be generated based on the received intermediate result data. For example, the accelerator library wrapper may generate result data based on intermediate result data received from a plurality of accelerators, which will be described later in more detail with reference to FIGS. 7 and 8.

후술하는 도 5 내지 도 8에 도시된 복수의 가속기(530_1, 530_2, ..., 530_n; 여기서, n은 2이상의 자연수임)는 하나의 컴퓨팅 장치에 포함되거나, 복수의 컴퓨팅 장치를 포함하는 클러스터 시스템에 포함될 수 있다. 여기서, 복수의 가속기(530_1, 530_2, ..., 530_n)는 DNN 프레임워크가 동작되는 컴퓨팅 장치(노드)에 의해 접근 가능한 가속기를 포함할 수 있다.A plurality of accelerators 530_1, 530_2, ..., 530_n; where n is a natural number greater than or equal to 2 shown in FIGS. 5 to 8 to be described later are included in one computing device or a cluster including a plurality of computing devices Can be included in the system. Here, the plurality of accelerators 530_1, 530_2, ..., 530_n may include an accelerator accessible by a computing device (node) on which the DNN framework is operated.

도 5는 본 개시의 일 실시예에 따른 입력 데이터(510)를 복수의 부분 입력 데이터 세트(520_1, 520_2, ..., 520_n)로 분할하고, 분할된 복수의 부분 입력 데이터 세트(520_1, 520_2, ..., 520_n)의 각각을 복수의 가속기(530_1, 530_2, ..., 530_n)에 할당하는 예시를 나타내는 도면이다.5 is a diagram illustrating an input data 510 according to an embodiment of the present disclosure divided into a plurality of partial input data sets 520_1, 520_2, ..., 520_n, and a plurality of divided partial input data sets 520_1, 520_2. , ..., 520_n) is a diagram showing an example of allocating each of the plurality of accelerators (530_1, 530_2, ..., 530_n).

일 실시예에 따르면, 도시된 바와 같이, 가속기 라이브러리 함수의 입력 데이터(510)는 복수의 부분 입력 데이터 세트(520_1, 520_2, ..., 520_n)로 분할될 수 있다. 분할된 복수의 부분 입력 데이터 세트(520_1, 520_2, ..., 520_n)의 각각 및 해당 가속기 라이브러리 함수는 복수의 가속기(530_1, 530_2, ..., 530_n)의 각각에 할당될 수 있다.According to an embodiment, as shown, the input data 510 of the accelerator library function may be divided into a plurality of partial input data sets 520_1, 520_2, ..., 520_n. Each of the divided plurality of partial input data sets 520_1, 520_2, ..., 520_n and a corresponding accelerator library function may be allocated to each of the plurality of accelerators 530_1, 530_2, ..., 530_n.

일 실시예에 따르면, 가속기 라이브러리 래퍼는, 입력 데이터(510)를 배치 단위로 입력 받은 경우, 복수의 가속기(530_1, 530_2, ..., 530_n)에 균등하게 분배할 수 있다. 예를 들어, Tensorflow나 Pytorch에서 뉴런에 해당하는 데이터의 메모리 레이아웃은 일반적으로 "NCHW" 또는 "NHWC"인데, 여기서 "N"이 가장 앞에 존재하므로, 가장 높은 차원이 배치인 레이아웃에 해당할 수 있다. 이에 따라, 클러스터 시스템에 포함된 가속기가 두 개라고 가정할 경우, 전체 데이터를 연속된 두 부분으로 나누어 두 개의 가속기에 분배할 수 있다.According to an embodiment, when the input data 510 is received in batch units, the accelerator library wrapper may evenly distribute the input data 510 to a plurality of accelerators 530_1, 530_2, ..., 530_n. For example, in Tensorflow or Pytorch, the memory layout of the data corresponding to the neuron is generally "NCHW" or "NHWC". Since "N" is in front of it, it may correspond to a layout in which the highest dimension is the layout. . Accordingly, assuming that there are two accelerators included in the cluster system, the entire data can be divided into two consecutive parts and distributed to two accelerators.

다른 실시예에서, 분배되는 부분 입력 데이터의 처리 속도가 다르거나, 클러스터 시스템의 네트워크 구성 등으로 처리 성능의 차이가 있을 경우, 성능 차이를 고려하여 복수의 부분 입력 데이터 세트(520_1, 520_2, ..., 520_n)가 복수의 가속기에 차등하여 분배될 수 있다.In another embodiment, when the processing speed of the partial input data to be distributed is different, or there is a difference in processing performance due to a network configuration of a cluster system, etc., a plurality of partial input data sets 520_1, 520_2, .. ., 520_n) may be differentially distributed to a plurality of accelerators.

도 6은 본 개시의 일 실시예에 따른 복수의 가속기(530_1, 530_2, ..., 530_n)에 입력 데이터(610)를 복수의 부분 입력 데이터 세트로 분할하지 못할 경우, 데이터를 할당하는 예시를 나타내는 도면이다.6 illustrates an example of allocating data when the input data 610 cannot be divided into a plurality of partial input data sets to a plurality of accelerators 530_1, 530_2, ..., 530_n according to an embodiment of the present disclosure. It is a drawing showing.

일 실시예에서, 가속기 라이브러리 래퍼는, 입력 데이터(610)를 배치 단위로 나눌 수 없는 경우, 복수의 가속기(530_1, 530_2, ..., 530_n)에게 동일한 입력 데이터(610)를 할당할 수 있다. 여기서, 입력 데이터(610)가 배치 단위로 나눌 수 없는 경우는 복수의 가속기(530_1, 530_2, ..., 530_n)에서 해당 입력 데이터(610)가 필요한 경우일 수 있으며, DNN에서 주로 네트워크 파라미터가 이에 해당할 수 있다. 예를 들어, Tensorflow나 Pytorch에서 파라미터 데이터의 메모리 레이아웃은 일반적으로 "CKHW" 또는 "HWKC"인데, 이는 배치와는 무관하여, 배치 단위로 나눌 수 없다. 따라서, 복수의 가속기(530_1, 530_2, ..., 530_n)가 동일한 데이터를 사용하도록 각각의 복수의 가속기(530_1, 530_2, ..., 530_n)에 입력 데이터(610)의 복사본을 할당할 수 있다.In an embodiment, when the input data 610 cannot be divided into batch units, the accelerator library wrapper may allocate the same input data 610 to the plurality of accelerators 530_1, 530_2, ..., 530_n. . Here, when the input data 610 cannot be divided into batch units, the corresponding input data 610 may be required by the plurality of accelerators 530_1, 530_2, ..., 530_n, and the network parameter is mainly used in the DNN. This may be the case. For example, in Tensorflow or Pytorch, the memory layout of parameter data is generally "CKHW" or "HWKC", which is independent of batch and cannot be divided into batch units. Therefore, a copy of the input data 610 can be allocated to each of the plurality of accelerators 530_1, 530_2, ..., 530_n so that the plurality of accelerators 530_1, 530_2, ..., 530_n use the same data. have.

다른 실시예에 따르면, 가속기 라이브러리 래퍼는 가속기 라이브러리 함수의 파라미터 데이터(즉, 가중치; 미도시)를 복수의 부분 파라미터 데이터 세트로 분할하고, 분할된 부분 파라미터 데이터 세트의 각각을 복수의 가속기(530_1, 530_2, ..., 530_n)에 제공할 수 있다. 이 때, 가속기 라이브러리 래퍼는 분할된 파라미터 데이터 세트와 연관된 가속기 라이브러리 함수를 함께 복수의 가속기(530_1, 530_2, ..., 530_n)에 제공할 수 있다. 또한, 가속기 라이브러리 래퍼는 분할된 파라미터 데이터 세트와 연관된 가속기 라이브러리 함수의 입력 데이터(610)를 복수의 가속기(530_1, 530_2, ..., 530_n)에 분할된 파라미터 데이터 세트 및 가속기 라이브러리 함수와 함께 제공할 수 있다. 여기서, 도시된 바와 같이, 입력 데이터(610)는 분할되지 않을 수 있다. 이와 달리, 가속기 라이브러리 래퍼는 분할된 파라미터 데이터 세트와 연관된 가속기 라이브러리 함수의 입력 데이터(610)를 복수의 부분 입력 데이터 세트로 분할하고, 분할된 복수의 부분 입력 데이터 세트를, 분할된 파라미터 데이터 세트 및 가속기 라이브러리 함수와 함께 복수의 가속기(530_1, 530_2, ..., 530_n)에 제공할 수 있다.According to another embodiment, the accelerator library wrapper divides the parameter data (ie, weights; not shown) of the accelerator library function into a plurality of partial parameter data sets, and divides each of the divided partial parameter data sets into a plurality of accelerators 530_1, 530_2, ..., 530_n). In this case, the accelerator library wrapper may provide an accelerator library function associated with the divided parameter data set to the plurality of accelerators 530_1, 530_2, ..., 530_n. In addition, the accelerator library wrapper provides the input data 610 of the accelerator library function associated with the divided parameter data set together with the parameter data set and the accelerator library function divided into a plurality of accelerators (530_1, 530_2, ..., 530_n). can do. Here, as shown, the input data 610 may not be divided. In contrast, the accelerator library wrapper divides the input data 610 of the accelerator library function associated with the divided parameter data set into a plurality of partial input data sets, and divides the divided plurality of partial input data sets, the divided parameter data set and It may be provided to a plurality of accelerators 530_1, 530_2, ..., 530_n together with an accelerator library function.

도 7은 본 개시의 일 실시예에 따른 복수의 가속기(530_1, 530_2, ..., 530_n)로부터 수신한 중간 결과 데이터(710_1, 710_2, ..., 710_n)를 연결하여, 결과 데이터(720)를 생성하는 예시를 나타내는 도면이다. 일 실시예에 따르면, 가속기 라이브러리 래퍼는 복수의 가속기(530_1, 530_2, ..., 530_n)로부터 수신된 중간 결과 데이터(710_1, 710_2, ..., 710_n)를 수신하고, 수신된 중간 결과 데이터(710_1, 710_2, ..., 710_n)를 연결하여(concatenate), 결과 데이터(720)를 생성할 수 있다. 예를 들어, CuDNN에서 제공하는 가속기 라이브러리 함수 중 convolution 함수의 경우, 복수의 가속기(530_1, 530_2, ..., 530_n)로부터 수신한 중간 결과 데이터(710_1, 710_2, ..., 710_n)를 연결함으로써, 결과 데이터(720)가 생성될 수 있다.7 is a connection between intermediate result data 710_1, 710_2, ..., 710_n received from a plurality of accelerators 530_1, 530_2, ..., 530_n according to an embodiment of the present disclosure, and result data 720 ) Is a diagram showing an example of generating. According to an embodiment, the accelerator library wrapper receives intermediate result data 710_1, 710_2, ..., 710_n received from a plurality of accelerators 530_1, 530_2, ..., 530_n, and received intermediate result data By concatenating (710_1, 710_2, ..., 710_n), result data 720 may be generated. For example, in the case of a convolution function among accelerator library functions provided by CuDNN, intermediate result data (710_1, 710_2, ..., 710_n) received from a plurality of accelerators (530_1, 530_2, ..., 530_n) are connected. By doing so, result data 720 may be generated.

일 실시예에서, 가속기 라이브러리 래퍼(330)는, 중간 결과 데이터(710_1, 710_2, ..., 710_n)가 배치 단위로 병렬화되는 경우, 중간 결과 데이터(710_1, 710_2, ..., 710_n)를 연결하여(concatenate) 결과 데이터(720)를 생성할 수 있다. 예를 들어, 중간 결과 데이터(710_1, 710_2, ..., 710_n)가 배치 단위로 병렬화되는 경우, 중간 결과 데이터(710_1, 710_2, ..., 710_n)는 단순히 연결되어 결과 데이터(710)로서 저장할 수 있다.In one embodiment, the accelerator library wrapper 330, when the intermediate result data (710_1, 710_2, ..., 710_n) is parallelized in batch units, the intermediate result data (710_1, 710_2, ..., 710_n) By concatenating, result data 720 may be generated. For example, when intermediate result data (710_1, 710_2, ..., 710_n) are parallelized in batch units, intermediate result data (710_1, 710_2, ..., 710_n) are simply connected as result data (710). Can be saved.

도 8은 본 개시의 일 실시예에 따른 복수의 가속기(530_1, 530_2, ..., 530_n)로부터 수신된 중간 결과 데이터(810_1, 810_2, ..., 810_n)를 연산하여, 결과 데이터(820)를 생성하는 예시를 나타내는 도면이다. 도 8에 도시된 바와 같이, 가속기 라이브러리 래퍼는 복수의 가속기(530_1, 530_2, ..., 530_n)의 각각으로부터 가속기 라이브러리 함수의 파라미터 데이터를 처리한 중간 결과 데이터(810_1, 810_2, ..., 810_n)를 수신하고, 가속기 라이브러리 함수의 파라미터 데이터를 처리한 중간 결과 데이터(810_1, 810_2, ..., 810_n)를 연산하여, 결과 데이터(820)를 생성할 수 있다. 이러한 연산 과정은 리덕션(reduction)이라고 지칭될 수 있다. 예를 들어, CuDNN에서 제공하는 가속기 라이브러리 함수 중 max pooling의 경우, 복수의 가속기(530_1, 530_2, ..., 530_n)로부터 수신한 중간 결과 데이터(810_1, 810_2, ..., 810_n)를 연산 처리함으로써, 결과 데이터(820)가 생성될 수 있다.8 illustrates intermediate result data 810_1, 810_2, ..., 810_n received from a plurality of accelerators 530_1, 530_2, ..., 530_n according to an embodiment of the present disclosure, and result data 820 ) Is a diagram showing an example of generating. As shown in FIG. 8, the accelerator library wrapper processes the parameter data of the accelerator library function from each of the plurality of accelerators 530_1, 530_2, ..., 530_n, and intermediate result data 810_1, 810_2, ..., 810_n) may be received, and intermediate result data 810_1, 810_2, ..., 810_n obtained by processing the parameter data of the accelerator library function may be calculated, and result data 820 may be generated. This calculation process may be referred to as reduction. For example, in the case of max pooling among accelerator library functions provided by CuDNN, intermediate result data (810_1, 810_2, ..., 810_n) received from multiple accelerators (530_1, 530_2, ..., 530_n) are calculated. By processing, result data 820 may be generated.

일 실시예에 따르면, 중간 결과 데이터(810_1, 810_2, ..., 810_n)가 배치 방향으로 병렬화되지 않는 경우, 학습 과정에 필요한 네트워크 파라미터의 그라디언트(gradient) 데이터를 만들어 내는 연산이 사용될 수 있다. 복수의 가속기(530_1, 530_2, ..., 530_n)가 자신에게 분배된 입력 데이터로부터 네트워크 파라미터 그라디언트를 계산하고, 중간 결과 데이터(810_1, 810_2, ..., 810_n)를 서로 교환하여 합산할 수 있다.According to an embodiment, when intermediate result data 810_1, 810_2, ..., 810_n are not parallelized in the arrangement direction, an operation for generating gradient data of a network parameter required for a learning process may be used. A plurality of accelerators (530_1, 530_2, ..., 530_n) calculates the network parameter gradient from the input data distributed to them, and the intermediate result data (810_1, 810_2, ..., 810_n) can be exchanged and summed. have.

도 9는 본 개시의 일 실시예에 따른 복수의 가속기(930_1, ..., 930_n; 여기서, n은 2 이상의 자연수임)에서 연산을 수행하는 동시에 복수의 부분 입력 데이터 세트의 복사를 수행하는 예시를 나타내는 도면이다. 일 실시예에 따르면, 복수의 가속기(930_1, ..., 930_n)의 각각에 할당되는 부분 입력 데이터 세트(910)가 n개의 부분 입력 데이터 세트를 포함하는 경우(여기서, n은 2이상의 자연수임), n개의 부분 입력 데이터 세트(910) 중에서, m번째 부분 입력 데이터 세트(920_1)를 복수의 가속기(930_1, ..., 930_n)의 각각에서 처리하는 동시에 m+1번째 부분 입력 데이터 세트(920_2)를 복수의 가속기(930_1, ..., 930_n)의 각각에 할당하는 단계(여기서, m은 n보다 작은 자연수임)를 포함할 수 있다. 이를 위해, 복수의 부분 입력 데이터 세트(910)의 각각을 복수의 가속기(930_1, ..., 930_n) 각각에 할당하기 이전에, 처리될 가속기 라이브러리 함수의 메모리 접근 패턴이 분석될 수 있다.9 is an example of performing an operation in a plurality of accelerators 930_1, ..., 930_n; where n is a natural number equal to or greater than 2 according to an embodiment of the present disclosure and simultaneously copying a plurality of partial input data sets It is a figure showing. According to an embodiment, when the partial input data set 910 allocated to each of the plurality of accelerators 930_1, ..., 930_n includes n partial input data sets (where n is a natural number of 2 or more) ), of the n partial input data sets 910, the m-th partial input data set 920_1 is processed by each of the plurality of accelerators 930_1, ..., 930_n, and the m+1th partial input data set ( Allocating 920_2) to each of the plurality of accelerators 930_1, ..., 930_n (here, m is a natural number less than n) may be included. To this end, before allocating each of the plurality of partial input data sets 910 to each of the plurality of accelerators 930_1, ..., 930_n, a memory access pattern of an accelerator library function to be processed may be analyzed.

일 실시예에서, 가속기 라이브러리 래퍼(330)는, 분석된 메모리 접근 패턴을 기초로, 복수의 부분 입력 데이터 세트(910) 중 일부인 m번째 부분 입력 데이터 세트(920_1)를 가속기(930_1)에 할당하여 처리할 수 있다. m번째 부분 입력 데이터 세트(920_1)가 가속기(930_1)에서 처리되는 동시에, 가속기 라이브러리 래퍼(330)는 m+1번째 부분 입력 데이터 세트(920_2)를 가속기(930_1)에 할당할 수 있다. 예를 들어, 복수의 가속기(930_1, ..., 930_n)의 기계적 특성으로 인해, 가속기 내의 메모리의 복사 동작과 연산 동작은 별도로 처리되므로, memory transfer overlapping 기법을 이용해 복수의 가속기(930_1, ..., 930_n)에서 연산이 수행되는 동안 메모리의 복사를 동시에 처리하여, 딥러닝 어플리케이션의 총 수행 시간을 줄일 수 있다. 이러한 동작 방식 하에서, 도시된 바와 같이, 복수의 부분 입력 데이터 세트(920_1, 920_3)가 복수의 가속기(930_1, 930_n)의 각각에 할당하고, 할당된 복수의 부분 입력 데이터 세트(920_1, 920_3)가 처리되는 동안, 또 다른 복수의 부분 입력 데이터 세트(920_2, 920_4)가 복수의 가속기(930_1, 930_n)의 각각에 할당될 수 있다.In one embodiment, the accelerator library wrapper 330 allocates the m-th partial input data set 920_1, which is a part of the plurality of partial input data sets 910, to the accelerator 930_1 based on the analyzed memory access pattern. You can handle it. While the m-th partial input data set 920_1 is processed by the accelerator 930_1, the accelerator library wrapper 330 may allocate the m+1th partial input data set 920_2 to the accelerator 930_1. For example, due to the mechanical properties of the plurality of accelerators (930_1, ..., 930_n), the copy operation and the operation operation of the memory in the accelerator are processed separately, so that the plurality of accelerators (930_1, .. ., 930_n), the total execution time of the deep learning application can be reduced by simultaneously processing the copy of the memory while the operation is being performed. Under this operation method, as shown, a plurality of partial input data sets 920_1 and 920_3 are allocated to each of a plurality of accelerators 930_1 and 930_n, and a plurality of allocated partial input data sets 920_1 and 920_3 are During processing, another plurality of partial input data sets 920_2 and 920_4 may be allocated to each of the plurality of accelerators 930_1 and 930_n.

상술된 DNN 프레임워크를 이용하는 단일 가속기용 프로그램을 복수의 가속기에서 처리하는 방법은, 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수도 있다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 판독될 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 전술된 실시예들을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.A method of processing a single accelerator program using the above-described DNN framework in a plurality of accelerators may be implemented as computer-readable codes on a computer-readable recording medium. The computer-readable recording medium includes all types of recording devices that store data that can be read by a computer system. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage devices. In addition, the computer-readable recording medium is distributed over a computer system connected through a network, so that computer-readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the above-described embodiments can be easily inferred by programmers in the technical field to which the present invention belongs.

본 개시의 방법, 동작 또는 기법들은 다양한 수단에 의해 구현될 수도 있다. 예를 들어, 이러한 기법들은 하드웨어, 펌웨어, 소프트웨어, 또는 이들의 조합으로 구현될 수도 있다. 본원의 개시와 연계하여 설명된 다양한 예시적인 논리적 블록들, 모듈들, 회로들, 및 알고리즘 단계들은 전자 하드웨어, 컴퓨터 소프트웨어, 또는 양자의 조합들로 구현될 수도 있음을 통상의 기술자들은 이해할 것이다. 하드웨어 및 소프트웨어의 이러한 상호 대체를 명확하게 설명하기 위해, 다양한 예시적인 구성요소들, 블록들, 모듈들, 회로들, 및 단계들이 그들의 기능적 관점에서 일반적으로 위에서 설명되었다. 그러한 기능이 하드웨어로서 구현되는지 또는 소프트웨어로서 구현되는 지의 여부는, 특정 애플리케이션 및 전체 시스템에 부과되는 설계 요구사항들에 따라 달라진다. 통상의 기술자들은 각각의 특정 애플리케이션을 위해 다양한 방식들로 설명된 기능을 구현할 수도 있으나, 그러한 구현들은 본 개시의 범위로부터 벗어나게 하는 것으로 해석되어서는 안된다.The method, operation, or techniques of this disclosure may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. Those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented in electronic hardware, computer software, or combinations of both. To clearly illustrate this interchange of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the particular application and design requirements imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementations should not be interpreted as causing a departure from the scope of the present disclosure.

하드웨어 구현에서, 기법들을 수행하는 데 이용되는 프로세싱 유닛들은, 하나 이상의 ASIC들, DSP들, 디지털 신호 프로세싱 디바이스들(digital signal processing devices; DSPD들), 프로그램가능 논리 디바이스들(programmable logic devices; PLD들), 필드 프로그램가능 게이트 어레이들(field programmable gate arrays; FPGA들), 프로세서들, 제어기들, 마이크로제어기들, 마이크로프로세서들, 전자 디바이스들, 본 개시에 설명된 기능들을 수행하도록 설계된 다른 전자 유닛들, 컴퓨터, 또는 이들의 조합 내에서 구현될 수도 있다.In a hardware implementation, the processing units used to perform the techniques include one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs). ), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described in this disclosure , Computer, or a combination thereof.

따라서, 본 개시와 연계하여 설명된 다양한 예시적인 논리 블록들, 모듈들, 및 회로들은 범용 프로세서, DSP, ASIC, FPGA나 다른 프로그램 가능 논리 디바이스, 이산 게이트나 트랜지스터 로직, 이산 하드웨어 컴포넌트들, 또는 본원에 설명된 기능들을 수행하도록 설계된 것들의 임의의 조합으로 구현되거나 수행될 수도 있다. 범용 프로세서는 마이크로프로세서일 수도 있지만, 대안으로, 프로세서는 임의의 종래의 프로세서, 제어기, 마이크로제어기, 또는 상태 머신일 수도 있다. 프로세서는 또한, 컴퓨팅 디바이스들의 조합, 예를 들면, DSP와 마이크로프로세서, 복수의 마이크로프로세서들, DSP 코어와 연계한 하나 이상의 마이크로프로세서들, 또는 임의의 다른 구성의 조합으로서 구현될 수도 있다.Accordingly, the various exemplary logic blocks, modules, and circuits described in connection with the present disclosure may include a general purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or It may be implemented or performed in any combination of those designed to perform the functions described in. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in connection with the DSP core, or any other configuration.

펌웨어 및/또는 소프트웨어 구현에 있어서, 기법들은 랜덤 액세스 메모리(random access memory; RAM), 판독 전용 메모리(read-only memory; ROM), 비휘발성 RAM(non-volatile random access memory; NVRAM), PROM(programmable read-only memory), EPROM(erasable programmable read-only memory), EEPROM(electrically erasable PROM), 플래시 메모리, 컴팩트 디스크(compact disc; CD), 자기 또는 광학 데이터 스토리지 디바이스 등과 같은 컴퓨터 판독가능 매체 상에 저장된 명령들로서 구현될 수도 있다. 명령들은 하나 이상의 프로세서들에 의해 실행 가능할 수도 있고, 프로세서(들)로 하여금 본 개시에 설명된 기능의 특정 양태들을 수행하게 할 수도 있다.In firmware and/or software implementation, the techniques include random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), PROM ( on a computer-readable medium such as programmable read-only memory, erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage device, etc. It can also be implemented as stored instructions. The instructions may be executable by one or more processors, and may cause the processor(s) to perform certain aspects of the functionality described in this disclosure.

소프트웨어로 구현되는 경우, 상기 기법들은 하나 이상의 명령들 또는 코드로서 컴퓨터 판독 가능한 매체 상에 저장되거나 또는 컴퓨터 판독 가능한 매체를 통해 전송될 수도 있다. 컴퓨터 판독가능 매체들은 한 장소에서 다른 장소로 컴퓨터 프로그램의 전송을 용이하게 하는 임의의 매체를 포함하여 컴퓨터 저장 매체들 및 통신 매체들 양자를 포함한다. 저장 매체들은 컴퓨터에 의해 액세스될 수 있는 임의의 이용 가능한 매체들일 수도 있다. 비제한적인 예로서, 이러한 컴퓨터 판독가능 매체는 RAM, ROM, EEPROM, CD-ROM 또는 다른 광학 디스크 스토리지, 자기 디스크 스토리지 또는 다른 자기 스토리지 디바이스들, 또는 소망의 프로그램 코드를 명령들 또는 데이터 구조들의 형태로 이송 또는 저장하기 위해 사용될 수 있으며 컴퓨터에 의해 액세스될 수 있는 임의의 다른 매체를 포함할 수 있다. 또한, 임의의 접속이 컴퓨터 판독가능 매체로 적절히 칭해진다.When implemented in software, the techniques may be stored on a computer-readable medium as one or more instructions or code or transmitted through a computer-readable medium. Computer-readable media includes both computer storage media and communication media, including any medium that facilitates transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a computer. By way of non-limiting example, such computer-readable medium may contain RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or desired program code in the form of instructions or data structures. It may include any other medium that may be used for transfer or storage to and accessible by a computer. Also, any connection is properly termed a computer-readable medium.

예를 들어, 소프트웨어가 동축 케이블, 광섬유 케이블, 연선, 디지털 가입자 회선 (DSL), 또는 적외선, 무선, 및 마이크로파와 같은 무선 기술들을 사용하여 웹사이트, 서버, 또는 다른 원격 소스로부터 전송되면, 동축 케이블, 광섬유 케이블, 연선, 디지털 가입자 회선, 또는 적외선, 무선, 및 마이크로파와 같은 무선 기술들은 매체의 정의 내에 포함된다. 본원에서 사용된 디스크(disk) 와 디스크(disc)는, CD, 레이저 디스크, 광 디스크, DVD(digital versatile disc), 플로피디스크, 및 블루레이 디스크를 포함하며, 여기서 디스크들(disks)은 보통 자기적으로 데이터를 재생하고, 반면 디스크들(discs) 은 레이저를 이용하여 광학적으로 데이터를 재생한다. 위의 조합들도 컴퓨터 판독가능 매체들의 범위 내에 포함되어야 한다.For example, if the software is transmitted from a website, server, or other remote source using wireless technologies such as coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or infrared, wireless, and microwave, coaxial cable , Fiber optic cable, twisted pair, digital subscriber line, or wireless technologies such as infrared, wireless, and microwave are included within the definition of the medium. Disks and discs as used herein include CDs, laser disks, optical disks, digital versatile discs (DVDs), floppy disks, and Blu-ray disks, where disks are usually magnetic It reproduces data optically, whereas discs reproduce data optically using a laser. Combinations of the above should also be included within the scope of computer-readable media.

소프트웨어 모듈은, RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터들, 하드 디스크, 이동식 디스크, CD-ROM, 또는 공지된 임의의 다른 형태의 저장 매체 내에 상주할 수도 있다. 예시적인 저장 매체는, 프로세가 저장 매체로부터 정보를 판독하거나 저장 매체에 정보를 기록할 수 있도록, 프로세서에 연결될 수 있다. 대안으로, 저장 매체는 프로세서에 통합될 수도 있다. 프로세서와 저장 매체는 ASIC 내에 존재할 수도 있다. ASIC은 유저 단말 내에 존재할 수도 있다. 대안으로, 프로세서와 저장 매체는 유저 단말에서 개별 구성요소들로서 존재할 수도 있다.The software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, or any other known type of storage medium. An exemplary storage medium may be coupled to a processor such that the processor can read information from or write information to the storage medium. Alternatively, the storage medium may be integrated into the processor. The processor and storage medium may also reside within the ASIC. The ASIC may exist in the user terminal. Alternatively, the processor and storage medium may exist as separate components in the user terminal.

이상 설명된 실시예들이 하나 이상의 독립형 컴퓨터 시스템에서 현재 개시된 주제의 양태들을 활용하는 것으로 기술되었으나, 본 개시는 이에 한정되지 않고, 네트워크나 분산 컴퓨팅 환경과 같은 임의의 컴퓨팅 환경과 연계하여 구현될 수도 있다. 또 나아가, 본 개시에서 주제의 양상들은 복수의 프로세싱 칩들이나 장치들에서 구현될 수도 있고, 스토리지는 복수의 장치들에 걸쳐 유사하게 영향을 받게 될 수도 있다. 이러한 장치들은 PC들, 네트워크 서버들, 및 휴대용 장치들을 포함할 수도 있다.Although the above-described embodiments have been described as utilizing aspects of the currently disclosed subject matter in one or more standalone computer systems, the present disclosure is not limited thereto, and may be implemented in connection with any computing environment such as a network or a distributed computing environment. . Furthermore, aspects of the subject matter in this disclosure may be implemented in multiple processing chips or devices, and storage may be similarly affected across multiple devices. Such devices may include PCs, network servers, and portable devices.

본 명세서에서는 본 개시가 일부 실시예들과 관련하여 설명되었지만, 본 개시의 발명이 속하는 기술분야의 통상의 기술자가 이해할 수 있는 본 개시의 범위를 벗어나지 않는 범위에서 다양한 변형 및 변경이 이루어질 수 있다. 또한, 그러한 변형 및 변경은 본 명세서에 첨부된 특허청구의 범위 내에 속하는 것으로 생각되어야 한다.In the present specification, the present disclosure has been described in connection with some embodiments, but various modifications and changes may be made without departing from the scope of the present disclosure as understood by those of ordinary skill in the art to which the present disclosure belongs. In addition, such modifications and changes should be considered to fall within the scope of the claims appended to this specification.

100: 클러스터 시스템
110: 네트워크
120: 호스트 컴퓨팅 장치
130, 130_1, 130_2, ..., 130_n: 컴퓨팅 장치
200: 컴퓨팅 장치
210: 프로세서
220: 메인 메모리
230, 230_1, 230_2, ..., 230_n: 가속기
310: 단일 가속기용 프로그램
320: DNN 프레임워크
330: 가속기 라이브러리 래퍼
340: 가속기 라이브러리 100: cluster system
110: network
120: host computing device
130, 130_1, 130_2, ..., 130_n: computing device
200: computing device
210: processor
220: main memory
230, 230_1, 230_2, ..., 230_n: accelerator
310: Program for a single accelerator
320: DNN framework
330: accelerator library wrapper
340: accelerator library

Claims

In a method for processing a single accelerator program using a DNN framework in a plurality of accelerators,
Receiving a call to a deep learning operation function;
Receiving a call to an accelerator library function for executing the deep learning operation function in a single accelerator;
Allocating the accelerator library function to each of a plurality of accessible accelerators in response to a call of the accelerator library function;
Receiving intermediate result data of processing the accelerator library function from each of the plurality of accelerators; And
Generating result data for the called accelerator library function based on the received intermediate result data,
A method of processing a program for a single accelerator in multiple accelerators.

The method of claim 1,
Allocating the accelerator library function to each of the plurality of accessible accelerators,
Dividing the input data of the accelerator library function into a plurality of partial input data sets; And
Assigning each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators
Including,
The step of receiving intermediate result data of processing the accelerator library function from each of the plurality of accelerators may include processing each of the plurality of partial input data sets by using the accelerator library function in each of the plurality of accelerators. Including the step of receiving result data,
A method of processing a program for a single accelerator in multiple accelerators.

The method of claim 2,
Allocating each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators,
Analyzing a memory access pattern of the accelerator library function; And
Allocating input data of the accelerator library function to each of the plurality of accessible accelerators before execution of the accelerator library function based on the analyzed access pattern,
A method of processing a program for a single accelerator in multiple accelerators.

The method of claim 2,
Allocating each of the accelerator library function and the divided plurality of partial input data sets to each of the plurality of accelerators,
When the partial input data set allocated to each of the plurality of accelerators includes n partial input data sets (where n is a natural number of 2 or more), from the n partial input data sets, an m-th partial input data set Processing in each of the plurality of accelerators and simultaneously allocating an m+1th partial input data set to each of the plurality of accelerators (where m is a natural number less than n),
A method of processing a program for a single accelerator in multiple accelerators.

The method of claim 1,
Allocating the accelerator library function to each of the plurality of accessible accelerators,
Dividing the parameter data of the accelerator library function into a plurality of partial parameter data sets; And
Allocating each of the accelerator library function and the divided plurality of partial parameter data sets to each of the plurality of accelerators,
A method of processing a program for a single accelerator in multiple accelerators.

The method of claim 5,
The step of receiving intermediate result data of processing the accelerator library function from each of the plurality of accelerators includes receiving intermediate result data of processing parameter data of the accelerator library function from each of the plurality of accelerators,
Generating result data for the called accelerator library function comprises generating result data for intermediate result data obtained by processing parameter data of the accelerator library function,
A method of processing a program for a single accelerator in multiple accelerators.

The method of claim 1,
Based on the received intermediate result data, generating result data for the called accelerator library function comprises generating the result data by concatenating the received intermediate result data,
A method of processing a program for a single accelerator in multiple accelerators.

The method of claim 1,
On the basis of the received intermediate result data, generating result data for the called accelerator library function includes calculating the received intermediate result data to generate the result data,
A method of processing a program for a single accelerator in multiple accelerators.

The method of claim 1,
The plurality of accelerators are included in one computing device,
A method of processing a program for a single accelerator in multiple accelerators.

The method of claim 1,
The plurality of accelerators are included in a cluster system including a plurality of computing devices,
A method of processing a program for a single accelerator in multiple accelerators.

A computer program stored in a computer-readable recording medium to execute a method of processing a single accelerator program using a DNN framework according to any one of claims 1 to 10 in a plurality of accelerators.